Managing annotations

Warning

You should never run two of the following commands in parallel. All of them need to be run sequentially, otherwise the index may get corrupted.

If you need to parallelize the processing to speed it up, you can use the --threads option, which is built-in in all of our tools that might require it.

Importation

Importing annotations to a dataset is taking annotations files in various format, storing them in an annotation set raw folder in the dataset (inside of annotations/<setname>/raw) and providing the metadata of the set inside of the metannots.yml file (at annotations/<setname>/metannots.yml). Then the importation links those annotation files to stretches of the recordings of the dataset and creates a standardized csv of the annotations inside of the converted folder (in annotations/<setname>/converted).

For more information on the annotation sets metadata, read its the Annotation sets metadata description. The annotation set metadata file metannots.yml can be created afterwards and does not require new importations to be taken into account.

Single annotation importation

Annotations can be imported one by one, in bulk or through the automated command. Annotation importation does the following :

Convert all input annotations from their original format (e.g. rttm, eaf, textgrid..) into the CSV format defined at Annotation importation input format and stores them into annotations/.
Registers them to the annotation index at metadata/annotations.csv

Use child-project import-annotations to import a single annotation.

$ child-project import-annotations /path/to/dataset --help
usage: child-project import-annotations [-h] [--annotations ANNOTATIONS]
                                        [--set SET]
                                        [--recording_filename RECORDING_FILENAME]
                                        [--time_seek TIME_SEEK]
                                        [--range_onset RANGE_ONSET]
                                        [--range_offset RANGE_OFFSET]
                                        [--raw_filename RAW_FILENAME]
                                        [--format {csv,vtc_rttm,vcm_rttm,alice,its,TextGrid,eaf,cha,NA}]
                                        [--filter FILTER] [--threads THREADS]
                                        [--overwrite-existing]
                                        source

convert and import a set of annotations

positional arguments:
  source                project path

optional arguments:
  -h, --help            show this help message and exit
  --annotations ANNOTATIONS
                        path to input annotations dataframe (csv) [only for
                        bulk importation]
  --set SET             name of the annotation set (e.g. VTC, annotator1,
                        etc.)
  --recording_filename RECORDING_FILENAME
                        recording filename as specified in the recordings
                        index
  --time_seek TIME_SEEK
                        shift between the timestamps in the raw input
                        annotations and the actual corresponding timestamps in
                        the recordings (in milliseconds)
  --range_onset RANGE_ONSET
                        covered range onset timestamp in milliseconds (since
                        the start of the recording)
  --range_offset RANGE_OFFSET
                        covered range offset timestamp in milliseconds (since
                        the start of the recording)
  --raw_filename RAW_FILENAME
                        annotation input filename location, relative to
                        `annotations/<set>/raw`
  --format {csv,vtc_rttm,vcm_rttm,alice,its,TextGrid,eaf,cha,NA}
                        input annotation format
  --filter FILTER       source file to target. this field is dedicated to rttm
                        and ALICE annotations that may combine annotations
                        from several recordings into one same text file.
  --threads THREADS     amount of threads to run on
  --overwrite-existing, --ow
                        overwrites existing annotation file if should generate
                        the same output file (useful when reimporting

Example:

child-project import-annotations /path/to/dataset \
   --set eaf \
   --recording_filename sound.wav \
   --time_seek 0 \
   --raw_filename example.eaf \
   --range_onset 0 \
   --range_offset 300 \
   --format eaf

Find more information about the allowed values for each parameter, see Annotation importation input format.

Bulk importation

Use this to do bulk importation of many annotation files.

child-project import-annotations /path/to/dataset --annotations /path/to/dataframe.csv

The input dataframe /path/to/dataframe.csv must have one entry per annotation to import, according to the format specified at Annotation importation input format.

Automated importation

The automated method is mostly used for automated annotations. It is made to assume a certain number of parameters on importation, which allows us to perform the usual importations we are doing without additional input. The command will assume the following: - the annotation files will cover the entirety of the audio they annotate (equivalent to range_onset 0 and range_offset <duration of rec>) - the annotation files will have timestamps that are not offset compare to the recording (equivalent to time_seek 0) - the annotation files will be named exactly like the recording they annotate (including the folder they are in) except for the extension, which depends on the format (equivalent to recording_filename = annotation_filename + extension) - the format used is the same for all the files and needs to be given in the call, it determines the extension for all the annotation files - the set to import is the same for all files, must be given in the call

$ child-project automated-import . --help
usage: child-project automated-import [-h] --set SET --format
                                      {csv,vtc_rttm,vcm_rttm,alice,its,TextGrid,eaf,cha,NA}
                                      [--threads THREADS]
                                      [--overwrite-existing]
                                      source

convert and import a set of automated annotations covering the entire
recording

positional arguments:
  source                project path

optional arguments:
  -h, --help            show this help message and exit
  --set SET             set of annotations to import
  --format {csv,vtc_rttm,vcm_rttm,alice,its,TextGrid,eaf,cha,NA}
                        input annotation format
  --threads THREADS     amount of threads to run on
  --overwrite-existing, --ow
                        overwrites existing annotation file if should generate
                        the same output file (useful when reimporting

# import the vtc set by using the vtc_rttm format, all annotation files will need to be with extension ``.rttm``
child-project automated-import . --set vtc --format vtc_rttm

Rename a set of annotations

Rename a set of annotations. This will move the annotations themselves, and update the index (metadata/annotations.csv) accordingly.

$ child-project rename-annotations /path/to/dataset --help
usage: child-project rename-annotations [-h] --set SET --new-set NEW_SET
                                        [--recursive] [--ignore-errors]
                                        source

rename a set of annotations by moving the files and updating the index
accordingly

positional arguments:
  source             project path

optional arguments:
  -h, --help         show this help message and exit
  --set SET          set to rename
  --new-set NEW_SET  new name for the set
  --recursive        enable recursive mode
  --ignore-errors    proceed despite errors

Example:

child-project rename-annotations /path/to/dataset --set vtc --new-set vtc_1

Remove a set of annotations

This will deleted converted annotations associated to a given set and remove them from the index.

$ child-project remove-annotations /path/to/dataset --help
usage: child-project remove-annotations [-h] --set SET [--recursive] source

remove converted annotations of a given set and their entries in the index

positional arguments:
  source       project path

optional arguments:
  -h, --help   show this help message and exit
  --set SET    set to remove
  --recursive  enable recursive mode

child-project remove-annotations /path/to/dataset --set vtc

Derive annotations

This command allows to derive a new set of annotations (or adding new lines) by extracting information from an existing set of annotations. A number of derivations are available in the package, other derivations can be defined by the user when using the python api directly.

$ child-project derive-annotations /path/to/dataset --help
usage: child-project derive-annotations [-h] --input-set INPUT_SET
                                        --output-set OUTPUT_SET
                                        [--threads THREADS]
                                        [--overwrite-existing]
                                        source
                                        {acoustics,conversations,remove-overlaps,cva}

derive a set of annotations

positional arguments:
  source                project path
  {acoustics,conversations,remove-overlaps,cva}
                        Type of derivation

optional arguments:
  -h, --help            show this help message and exit
  --input-set INPUT_SET, -i INPUT_SET
                        input set
  --output-set OUTPUT_SET, -o OUTPUT_SET
                        output set
  --threads THREADS     amount of threads to run on
  --overwrite-existing, --ow
                        overwrites existing annotation file when deriving
                        (useful when reimporting), False by default

child-project derive-annotations . conversations --input-set vtc --output-set vtc/conversations

The following derivations exist:

Derivation	Description
acoustics	Based on the existing segmentation, extracts acoustics features of each vocalization identified. In particular, mean pitch semitone, median pitch semitone as well as 5th and 95th percentile of pitch semitone.
conversations	Based on the given interval (iti, maximum time elapsed after the end of an utterance for the next one to be considered an interaction) and delay (minimum time elapsed after the start of an utterance for the next one to be considered an interaction), classifies whether each segment is an interaction with the previous (columns is_CT i.e. is conversational turn). Then adds a column grouping vocalisations which belong to the same conversation (conv_count)
remove-overlaps	Cuts the segments to discard any part that has overlapping speech, resulting in a segmentation with no overlap of speech. Parts that contained overlapping speech therefore appear empty of any speech.
cva	The function takes a dataframe of annotation segments as an input and based on the given iti (inter turn interval) and scenario (permissive or restrictive), classifies whether each annotation is targeted to the key child or overheard. Filling in the column cva (child vocalization adjacent), Y meaning it is in an interaction with the child, N meaning the vocalization is not in direct interaction with the key child.

ITS annotations anonymization

LENA .its files might contain information that can help recover the identity of the participants, which may be undesired. This command anonymizes .its files, based on a routine by HomeBank.

$ child-project anonymize /path/to/dataset --help
usage: child-project anonymize [-h] --input-set INPUT_SET --output-set
                               OUTPUT_SET
                               [--replacements-json-dict REPLACEMENTS_JSON_DICT]
                               path

Anonymize a set of its annotations (`input_set`) and saves it as `output_set`.

positional arguments:
  path                  project path

optional arguments:
  -h, --help            show this help message and exit
  --input-set INPUT_SET
                        input annotation set
  --output-set OUTPUT_SET
                        output annotation set
  --replacements-json-dict REPLACEMENTS_JSON_DICT
                        path to the replacements configuration (json dict)

child-project anonymize /path/to/dataset --input-set lena --output-set lena/anonymous

Merge annotation sets

Some processing tools use pre-existing annotations as an input, and label the original segments with more information. This is typically the case of ALICE, which labels segments generated by the VTC. In this case, one might want to merge the ALICE and VTC annotations altogether. This can be done with child-project merge-annotations.

$ child-project merge-annotations /path/to/dataset --help
usage: child-project merge-annotations [-h] --left-set LEFT_SET --right-set
                                       RIGHT_SET --left-columns LEFT_COLUMNS
                                       --right-columns RIGHT_COLUMNS
                                       --output-set OUTPUT_SET
                                       [--threads THREADS]
                                       source

merge segments sharing identical onset and offset from two sets of annotations

positional arguments:
  source                project path

optional arguments:
  -h, --help            show this help message and exit
  --left-set LEFT_SET   left set
  --right-set RIGHT_SET
                        right set
  --left-columns LEFT_COLUMNS
                        comma-separated columns to merge from the left set
  --right-columns RIGHT_COLUMNS
                        comma-separated columns to merge from the right set
  --output-set OUTPUT_SET
                        name of the output set
  --threads THREADS     amount of threads to run on (default: 1)

child-project merge-annotations /path/to/dataset \
--left-set vtc \
--right-set alice/output \
--left-columns speaker_type \
--right-columns phonemes,syllables,words \
--output-set alice

Intersect annotations

In order to combine annotations from different annotators, or to compare them, it is necessary to calculate which portions of the audio have been annotated by all of them. This can be done from the command-line interface:

$ child-project intersect-annotations /path/to/dataset --help
usage: child-project intersect-annotations [-h] --destination DESTINATION
                                           --sets SETS [SETS ...]
                                           [--annotations ANNOTATIONS]
                                           source

calculate the intersection of the annotations belonging to the given sets

positional arguments:
  source                project path

optional arguments:
  -h, --help            show this help message and exit
  --destination DESTINATION
                        output CSV dataframe destination
  --sets SETS [SETS ...]
                        annotation sets to intersect
  --annotations ANNOTATIONS
                        path a custom input CSV dataframe of annotations to
                        intersect. By default, the whole index of the project
                        will be used.

Example:

child-project intersect-annotations /path/to/dataset \
--sets its textgrid/annotator1 textgrid/annotator2 textgrid/annotator3 \
--destination intersection.csv

The output dataframe has the same format as the annotations index (see Annotations index).