Basic tools

Dataset initialization

This command allows you to create a new and empty dataset, with the correct structure

$ child-project init /path/to/dataset --help
usage: child-project init [-h] [--force] source

positional arguments:
  source       project path

optional arguments:
  -h, --help   show this help message and exit
  --force, -f  ignore existing files and create structure anyway

Example:

# create a dataset in a folder named mydataset
child-project init mydataset

Data validation

This is typically done (repeatedly!) in the process of importing your data into our format for the first time, but you should also do this whenever you make a change to the dataset.

Looks for errors and inconsistency in the metadata, or for missing audios. The validation will pass if formatting instructions are met (see Datasets structure).

$ child-project validate /path/to/dataset --help
usage: child-project validate [-h] [--ignore-recordings] [--profile PROFILE]
                              [--annotations ANNOTATIONS [ANNOTATIONS ...]]
                              [--threads THREADS]
                              source

validate the consistency of the dataset returning detailed errors and warnings

positional arguments:
  source                project path

optional arguments:
  -h, --help            show this help message and exit
  --ignore-recordings   ignore missing audio files
  --profile PROFILE     which recording profile to validate
  --annotations ANNOTATIONS [ANNOTATIONS ...]
                        path to or name of each annotation set(s) to check
                        (e.g. 'vtc' or '/path/to/dataset/annotations/vtc')
  --threads THREADS     amount of threads to run on (only applies to
                        --annotations)

Example:

# validate the metadata and raw recordings
child-project validate /path/to/dataset

# validate the metadata only
child-project validate /path/to/dataset --ignore-recordings

# validate the metadata and the recordings of the 'standard' profile
# (in recordings/converted/standard)
child-project validate /path/to/dataset --profile standard

# validate the metadata and all annotations within /path/to/dataset/annotations
child-project validate /path/to/dataset --ignore-recordings --annotations /path/to/dataset/annotations/*

# validate the metadata and annotations from the 'textgrid' set
child-project validate /path/to/dataset --ignore-recordings --annotations /path/to/dataset/annotations/textgrid/*

Dataset overview

An overview of the contents of a dataset can be obtained with the child-project overview command.

$ child-project overview --help
usage: child-project overview [-h] [--format {snapshot,json}]
                              source [source ...]

prints an overview of the contents of a given dataset

positional arguments:
  source                source data path

optional arguments:
  -h, --help            show this help message and exit
  --format {snapshot,json}
                        format to output to

Example:

$ child-project overview ../examples/valid_raw_data

Annotation sets metadata

An overview of the annotation sets of a dataset can be obtained with the child-project sets-metadata command. The output can be formatted to be human readable or parsable (csv).

$ child-project sets-metadata --help
usage: child-project sets-metadata [--help] [--format {snapshot,csv}]
                                   [--human-readable]
                                   [--sort-by {set,duration,segmentation,segmentation_type,method,sampling_method,sampling_target,sampling_count,sampling_unit_duration,recording_selection,participant_selection,annotator_name,annotator_experience,annotation_algorithm_name,annotation_algorithm_publication,annotation_algorithm_version,annotation_algorithm_repo,date_annotation,has_speaker_type,has_transcription,has_interactions,has_acoustics,has_addressee,has_vcm_type,has_words,notes} [{set,duration,segmentation,segmentation_type,method,sampling_method,sampling_target,sampling_count,sampling_unit_duration,recording_selection,participant_selection,annotator_name,annotator_experience,annotation_algorithm_name,annotation_algorithm_publication,annotation_algorithm_version,annotation_algorithm_repo,date_annotation,has_speaker_type,has_transcription,has_interactions,has_acoustics,has_addressee,has_vcm_type,has_words,notes} ...]]
                                   [--sort-descending]
                                   source [source ...]

get the metadata on all the annotation sets in the dataset

positional arguments:
  source                project_path

optional arguments:
  --help                show this help message and exit
  --format {snapshot,csv}
                        format to output to
  --human-readable, -h  convert units to be more human readable
  --sort-by {set,duration,segmentation,segmentation_type,method,sampling_method,sampling_target,sampling_count,sampling_unit_duration,recording_selection,participant_selection,annotator_name,annotator_experience,annotation_algorithm_name,annotation_algorithm_publication,annotation_algorithm_version,annotation_algorithm_repo,date_annotation,has_speaker_type,has_transcription,has_interactions,has_acoustics,has_addressee,has_vcm_type,has_words,notes} [{set,duration,segmentation,segmentation_type,method,sampling_method,sampling_target,sampling_count,sampling_unit_duration,recording_selection,participant_selection,annotator_name,annotator_experience,annotation_algorithm_name,annotation_algorithm_publication,annotation_algorithm_version,annotation_algorithm_repo,date_annotation,has_speaker_type,has_transcription,has_interactions,has_acoustics,has_addressee,has_vcm_type,has_words,notes} ...]
                        sort the table by the given column name(s)
  --sort-descending     sort the table descending instead of ascending

Example:

$ child-project sets-metadata ../examples/valid_raw_data

Compute recordings duration

Compute recordings duration in ms and store in into a column named ‘duration’ in the metadata.

$ child-project compute-durations /path/to/dataset --help
usage: child-project compute-durations [-h] [--profile PROFILE] [--force]
                                       source

creates a 'duration' column into metadata/recordings. duration is in ms

positional arguments:
  source             source data path

optional arguments:
  -h, --help         show this help message and exit
  --profile PROFILE  which audio profile to use
  --force            overwrite if column exists

Compute the correlation between audio files

Compute the correlation between two audio files and prints a divergence score. The divergence is computed over a given duration (default 5min) that can be changed with the –interval option. One segment of that duration is taken randomly, the difference in audio signal is calculated and averaged over the total duration. The result is printed as a divergence score. The closer the score is to 0, the more likely it is the 2 files are identical. We can consider that scores below 0.1 reflect a very high probability that the files are the same. At the other end of the spectrum, values higher than 1 almost certainly means they are different recordings. So a window exists in which we can’t be sure and would need additional correlation computations or manual checks. Running the correlation multiple time is useful because files that are different have a high variability in score whereas similar files will have a much more consistent output.

Giving a higher –interval value may take more time to compute.

$ child-project compare-recordings /path/to/dataset --help
usage: child-project compare-recordings [-h] [--profile PROFILE]
                                        [--interval INTERVAL]
                                        source audio1 audio2

computes the difference between 2 given audio files of the dataset. A
divergence score is outputted, it is the average difference of audio signal
over the considered sample (random point in the audio, fixed duration).
Divergence scores lower than 0.1 indicate a strong proximity

positional arguments:
  source               project path
  audio1               name of the first audio file as it is indexed in
                       recordings.csv in column <recording_filename>
  audio2               name of the second audio file as it is indexed in
                       recordings.csv in column <recording_filename>

optional arguments:
  -h, --help           show this help message and exit
  --profile PROFILE    which audio profile to use
  --interval INTERVAL  duration in minutes of the window used to build the
                       correlation score