Managing annotations

Warning

You should never run two of the following commands in parallel. All of them need to be run sequentially, otherwise the index may get corrupted.

If you need to parallelize the processing to speed it up, you can use the --threads option, which is built-in in all of our tools that might require it.

Importation

Single annotation importation

Annotations can be imported one by one or in bulk. Annotation importation does the following :

Convert all input annotations from their original format (e.g. rttm, eaf, textgrid..) into the CSV format defined at format-input-annotations and stores them into annotations/.
Registers them to the annotation index at metadata/annotations.csv

Use child-project import-annotations to import a single annotation.

$ child-project import-annotations /path/to/dataset --help
usage: child-project import-annotations [-h] [--annotations ANNOTATIONS]
                                        [--threads THREADS] [--set SET]
                                        [--recording_filename RECORDING_FILENAME]
                                        [--time_seek TIME_SEEK]
                                        [--range_onset RANGE_ONSET]
                                        [--range_offset RANGE_OFFSET]
                                        [--raw_filename RAW_FILENAME]
                                        [--format {csv,vtc_rttm,vcm_rttm,alice,its,TextGrid,eaf,cha,NA}]
                                        [--filter FILTER]
                                        source

convert and import a set of annotations

positional arguments:
  source                project path

optional arguments:
  -h, --help            show this help message and exit
  --annotations ANNOTATIONS
                        path to input annotations dataframe (csv) [only for
                        bulk importation]
  --threads THREADS     amount of threads to run on
  --set SET             name of the annotation set (e.g. VTC, annotator1,
                        etc.)
  --recording_filename RECORDING_FILENAME
                        recording filename as specified in the recordings
                        index
  --time_seek TIME_SEEK
                        shift between the timestamps in the raw input
                        annotations and the actual corresponding timestamps in
                        the recordings (in milliseconds)
  --range_onset RANGE_ONSET
                        covered range onset timestamp in milliseconds (since
                        the start of the recording)
  --range_offset RANGE_OFFSET
                        covered range offset timestamp in milliseconds (since
                        the start of the recording)
  --raw_filename RAW_FILENAME
                        annotation input filename location, relative to
                        `annotations/<set>/raw`
  --format {csv,vtc_rttm,vcm_rttm,alice,its,TextGrid,eaf,cha,NA}
                        input annotation format
  --filter FILTER       source file to filter in (for rttm and alice only)

Example:

child-project import-annotations /path/to/dataset \
   --set eaf \
   --recording_filename sound.wav \
   --time_seek 0 \
   --raw_filename example.eaf \
   --range_onset 0 \
   --range_offset 300 \
   --format eaf

Find more information about the allowed values for each parameter, see format-input-annotations.

Bulk importation

Use this to do bulk importation of many annotation files.

child-project import-annotations /path/to/dataset --annotations /path/to/dataframe.csv

The input dataframe /path/to/dataframe.csv must have one entry per annotation to import, according to the format specified at format-input-annotations.

Rename a set of annotations

Rename a set of annotations. This will move the annotations themselves, and update the index (metadata/annotations.csv) accordingly.

$ child-project rename-annotations /path/to/dataset --help
usage: child-project rename-annotations [-h] --set SET --new-set NEW_SET
                                        [--recursive] [--ignore-errors]
                                        source

rename a set of annotations by moving the files and updating the index
accordingly

positional arguments:
  source             project path

optional arguments:
  -h, --help         show this help message and exit
  --set SET          set to rename
  --new-set NEW_SET  new name for the set
  --recursive        enable recursive mode
  --ignore-errors    proceed despite errors

Example:

child-project rename-annotations /path/to/dataset --set vtc --new-set vtc_1

Remove a set of annotations

This will deleted converted annotations associated to a given set and remove them from the index.

$ child-project remove-annotations /path/to/dataset --help
usage: child-project remove-annotations [-h] --set SET [--recursive] source

remove converted annotations of a given set and their entries in the index

positional arguments:
  source       project path

optional arguments:
  -h, --help   show this help message and exit
  --set SET    set to remove
  --recursive  enable recursive mode

child-project remove-annotations /path/to/dataset --set vtc

ITS annotations anonymization

LENA .its files might contain information that can help recover the identity of the participants, which may be undesired. This command anonymizes .its files, based on a routine by HomeBank.

$ child-project anonymize /path/to/dataset --help
usage: child-project anonymize [-h] --input-set INPUT_SET --output-set
                               OUTPUT_SET
                               [--replacements-json-dict REPLACEMENTS_JSON_DICT]
                               path

Anonymize a set of its annotations (`input_set`) and saves it as `output_set`.

positional arguments:
  path                  project path

optional arguments:
  -h, --help            show this help message and exit
  --input-set INPUT_SET
                        input annotation set
  --output-set OUTPUT_SET
                        output annotation set
  --replacements-json-dict REPLACEMENTS_JSON_DICT
                        path to the replacements configuration (json dict)

child-project anonymize /path/to/dataset --input-set lena --output-set lena/anonymous

Merge annotation sets

Some processing tools use pre-existing annotations as an input, and label the original segments with more information. This is typically the case of ALICE, which labels segments generated by the VTC. In this case, one might want to merge the ALICE and VTC annotations altogether. This can be done with child-project merge-annotations.

$ child-project merge-annotations /path/to/dataset --help
usage: child-project merge-annotations [-h] --left-set LEFT_SET --right-set
                                       RIGHT_SET --left-columns LEFT_COLUMNS
                                       --right-columns RIGHT_COLUMNS
                                       --output-set OUTPUT_SET
                                       [--threads THREADS]
                                       source

merge segments sharing identical onset and offset from two sets of annotations

positional arguments:
  source                project path

optional arguments:
  -h, --help            show this help message and exit
  --left-set LEFT_SET   left set
  --right-set RIGHT_SET
                        right set
  --left-columns LEFT_COLUMNS
                        comma-separated columns to merge from the left set
  --right-columns RIGHT_COLUMNS
                        comma-separated columns to merge from the right set
  --output-set OUTPUT_SET
                        name of the output set
  --threads THREADS     amount of threads to run on (default: 1)

child-project merge-annotations /path/to/dataset \
--left-set vtc \
--right-set alice \
--left-columns speaker_id,ling_type,speaker_type,vcm_type,lex_type,mwu_type,addresseee,transcription \
--right-columns phonemes,syllables,words \
--output-set alice_vtc

Intersect annotations

In order to combine annotations from different annotators, or to compare them, it is necessary to calculate which portions of the audio have been annotated by all of them. This can be done from the command-line interface:

$ child-project intersect-annotations /path/to/dataset --help
usage: child-project intersect-annotations [-h] --destination DESTINATION
                                           --sets SETS [SETS ...]
                                           [--annotations ANNOTATIONS]
                                           source

calculate the intersection of the annotations belonging to the given sets

positional arguments:
  source                project path

optional arguments:
  -h, --help            show this help message and exit
  --destination DESTINATION
                        output CSV dataframe destination
  --sets SETS [SETS ...]
                        annotation sets to intersect
  --annotations ANNOTATIONS
                        path a custom input CSV dataframe of annotations to
                        intersect. By default, the whole index of the project
                        will be used.

Example:

child-project intersect-annotations /path/to/dataset \
--sets its textgrid/annotator1 textgrid/annotator2 textgrid/annotator3 \
--destination intersection.csv

The output dataframe has the same format as the annotations index (see Annotations index).