Managing annotations
Warning
You should never run two of the following commands in parallel. All of them need to be run sequentially, otherwise the index may get corrupted.
If you need to parallelize the processing to speed it up,
you can use the --threads
option, which is built-in
in all of our tools that might require it.
Importation
Single annotation importation
Annotations can be imported one by one, in bulk or through the automated command. Annotation importation does the following :
Convert all input annotations from their original format (e.g. rttm, eaf, textgrid..) into the CSV format defined at Annotation importation input format and stores them into
annotations/
.Registers them to the annotation index at
metadata/annotations.csv
Use child-project import-annotations
to import a single annotation.
$ child-project import-annotations /path/to/dataset --help
usage: child-project import-annotations [-h] [--annotations ANNOTATIONS]
[--set SET]
[--recording_filename RECORDING_FILENAME]
[--time_seek TIME_SEEK]
[--range_onset RANGE_ONSET]
[--range_offset RANGE_OFFSET]
[--raw_filename RAW_FILENAME]
[--format {csv,vtc_rttm,vcm_rttm,alice,its,TextGrid,eaf,cha,NA}]
[--filter FILTER] [--threads THREADS]
[--overwrite-existing]
source
convert and import a set of annotations
positional arguments:
source project path
optional arguments:
-h, --help show this help message and exit
--annotations ANNOTATIONS
path to input annotations dataframe (csv) [only for
bulk importation]
--set SET name of the annotation set (e.g. VTC, annotator1,
etc.)
--recording_filename RECORDING_FILENAME
recording filename as specified in the recordings
index
--time_seek TIME_SEEK
shift between the timestamps in the raw input
annotations and the actual corresponding timestamps in
the recordings (in milliseconds)
--range_onset RANGE_ONSET
covered range onset timestamp in milliseconds (since
the start of the recording)
--range_offset RANGE_OFFSET
covered range offset timestamp in milliseconds (since
the start of the recording)
--raw_filename RAW_FILENAME
annotation input filename location, relative to
`annotations/<set>/raw`
--format {csv,vtc_rttm,vcm_rttm,alice,its,TextGrid,eaf,cha,NA}
input annotation format
--filter FILTER source file to target. this field is dedicated to rttm
and ALICE annotations that may combine annotations
from several recordings into one same text file.
--threads THREADS amount of threads to run on
--overwrite-existing, --ow
overwrites existing annotation file if should generate
the same output file (useful when reimporting
Example:
child-project import-annotations /path/to/dataset \
--set eaf \
--recording_filename sound.wav \
--time_seek 0 \
--raw_filename example.eaf \
--range_onset 0 \
--range_offset 300 \
--format eaf
Find more information about the allowed values for each parameter, see Annotation importation input format.
Bulk importation
Use this to do bulk importation of many annotation files.
child-project import-annotations /path/to/dataset --annotations /path/to/dataframe.csv
The input dataframe /path/to/dataframe.csv
must have one entry per
annotation to import, according to the format specified at Annotation importation input format.
Automated importation
The automated method is mostly used for automated annotations. It is made to assume a certain number of parameters on importation, which allows us to perform the usual importations we are doing without additional input. The command will assume the following: - the annotation files will cover the entirety of the audio they annotate (equivalent to range_onset 0 and range_offset <duration of rec>) - the annotation files will have timestamps that are not offset compare to the recording (equivalent to time_seek 0) - the annotation files will be named exactly like the recording they annotate (including the folder they are in) except for the extension, which depends on the format (equivalent to recording_filename = annotation_filename + extension) - the format used is the same for all the files and needs to be given in the call, it determines the extension for all the annotation files - the set to import is the same for all files, must be given in the call
$ child-project automated-import . --help
usage: child-project automated-import [-h] --set SET --format
{csv,vtc_rttm,vcm_rttm,alice,its,TextGrid,eaf,cha,NA}
[--threads THREADS]
[--overwrite-existing]
source
convert and import a set of automated annotations covering the entire
recording
positional arguments:
source project path
optional arguments:
-h, --help show this help message and exit
--set SET set of annotations to import
--format {csv,vtc_rttm,vcm_rttm,alice,its,TextGrid,eaf,cha,NA}
input annotation format
--threads THREADS amount of threads to run on
--overwrite-existing, --ow
overwrites existing annotation file if should generate
the same output file (useful when reimporting
# import the vtc set by using the vtc_rttm format, all annotation files will need to be with extension ``.rttm``
child-project automated-import . --set vtc --format vtc_rttm
Rename a set of annotations
Rename a set of annotations. This will move the annotations themselves,
and update the index (metadata/annotations.csv
) accordingly.
$ child-project rename-annotations /path/to/dataset --help
usage: child-project rename-annotations [-h] --set SET --new-set NEW_SET
[--recursive] [--ignore-errors]
source
rename a set of annotations by moving the files and updating the index
accordingly
positional arguments:
source project path
optional arguments:
-h, --help show this help message and exit
--set SET set to rename
--new-set NEW_SET new name for the set
--recursive enable recursive mode
--ignore-errors proceed despite errors
Example:
child-project rename-annotations /path/to/dataset --set vtc --new-set vtc_1
Remove a set of annotations
This will deleted converted annotations associated to a given set and remove them from the index.
$ child-project remove-annotations /path/to/dataset --help
usage: child-project remove-annotations [-h] --set SET [--recursive] source
remove converted annotations of a given set and their entries in the index
positional arguments:
source project path
optional arguments:
-h, --help show this help message and exit
--set SET set to remove
--recursive enable recursive mode
child-project remove-annotations /path/to/dataset --set vtc
Derive annotations
This command allows to derive a new set of annotations (or adding new lines) by extracting information from an existing set of annotations. A number of derivations are available in the package, other derivations can be defined by the user when using the python api directly.
$ child-project derive-annotations /path/to/dataset --help
usage: child-project derive-annotations [-h] --input-set INPUT_SET
--output-set OUTPUT_SET
[--threads THREADS]
[--overwrite-existing]
source
{acoustics,conversations,remove-overlaps}
derive a set of annotations
positional arguments:
source project path
{acoustics,conversations,remove-overlaps}
Type of derivation
optional arguments:
-h, --help show this help message and exit
--input-set INPUT_SET, -i INPUT_SET
input set
--output-set OUTPUT_SET, -o OUTPUT_SET
output set
--threads THREADS amount of threads to run on
--overwrite-existing, --ow
overwrites existing annotation file when deriving
(useful when reimporting), False by default
child-project derive-annotations . conversations --input-set vtc --output-set vtc/conversations
ITS annotations anonymization
LENA .its files might contain information that can help recover the identity of the participants, which may be undesired. This command anonymizes .its files, based on a routine by HomeBank.
$ child-project anonymize /path/to/dataset --help
usage: child-project anonymize [-h] --input-set INPUT_SET --output-set
OUTPUT_SET
[--replacements-json-dict REPLACEMENTS_JSON_DICT]
path
Anonymize a set of its annotations (`input_set`) and saves it as `output_set`.
positional arguments:
path project path
optional arguments:
-h, --help show this help message and exit
--input-set INPUT_SET
input annotation set
--output-set OUTPUT_SET
output annotation set
--replacements-json-dict REPLACEMENTS_JSON_DICT
path to the replacements configuration (json dict)
child-project anonymize /path/to/dataset --input-set lena --output-set lena/anonymous
Merge annotation sets
Some processing tools use pre-existing annotations as an input,
and label the original segments with more information. This is
typically the case of ALICE, which labels segments generated
by the VTC. In this case, one might want to merge the ALICE
and VTC annotations altogether. This can be done with child-project merge-annotations
.
$ child-project merge-annotations /path/to/dataset --help
usage: child-project merge-annotations [-h] --left-set LEFT_SET --right-set
RIGHT_SET --left-columns LEFT_COLUMNS
--right-columns RIGHT_COLUMNS
--output-set OUTPUT_SET
[--threads THREADS]
source
merge segments sharing identical onset and offset from two sets of annotations
positional arguments:
source project path
optional arguments:
-h, --help show this help message and exit
--left-set LEFT_SET left set
--right-set RIGHT_SET
right set
--left-columns LEFT_COLUMNS
comma-separated columns to merge from the left set
--right-columns RIGHT_COLUMNS
comma-separated columns to merge from the right set
--output-set OUTPUT_SET
name of the output set
--threads THREADS amount of threads to run on (default: 1)
child-project merge-annotations /path/to/dataset \
--left-set vtc \
--right-set alice/output \
--left-columns speaker_type \
--right-columns phonemes,syllables,words \
--output-set alice
Intersect annotations
In order to combine annotations from different annotators, or to compare them, it is necessary to calculate which portions of the audio have been annotated by all of them. This can be done from the command-line interface:
$ child-project intersect-annotations /path/to/dataset --help
usage: child-project intersect-annotations [-h] --destination DESTINATION
--sets SETS [SETS ...]
[--annotations ANNOTATIONS]
source
calculate the intersection of the annotations belonging to the given sets
positional arguments:
source project path
optional arguments:
-h, --help show this help message and exit
--destination DESTINATION
output CSV dataframe destination
--sets SETS [SETS ...]
annotation sets to intersect
--annotations ANNOTATIONS
path a custom input CSV dataframe of annotations to
intersect. By default, the whole index of the project
will be used.
Example:
child-project intersect-annotations /path/to/dataset \
--sets its textgrid/annotator1 textgrid/annotator2 textgrid/annotator3 \
--destination intersection.csv
The output dataframe has the same format as the annotations index (see Annotations index).