Basic tools

Data validation

This is typically done (repeatedly!) in the process of importing your data into our format for the first time, but you should also do this whenever you make a change to the dataset.

Looks for errors and inconsistency in the metadata, or for missing audios. The validation will pass if formatting instructions are met (see Datasets structure).

$ child-project validate /path/to/dataset --help
usage: child-project validate [-h] [--ignore-recordings] [--profile PROFILE]
                              [--annotations ANNOTATIONS [ANNOTATIONS ...]]
                              [--threads THREADS]
                              source

validate the consistency of the dataset returning detailed errors and warnings

positional arguments:
  source                project path

optional arguments:
  -h, --help            show this help message and exit
  --ignore-recordings   ignore missing audio files
  --profile PROFILE     which recording profile to validate
  --annotations ANNOTATIONS [ANNOTATIONS ...]
                        path to or name of each annotation set(s) to check
                        (e.g. 'vtc' or '/path/to/dataset/annotations/vtc')
  --threads THREADS     amount of threads to run on (only applies to
                        --annotations)

Example:

# validate the metadata and raw recordings
child-project validate /path/to/dataset

# validate the metadata only
child-project validate /path/to/dataset --ignore-recordings

# validate the metadata and the recordings of the 'standard' profile
# (in recordings/converted/standard)
child-project validate /path/to/dataset --profile standard

# validate the metadata and all annotations within /path/to/dataset/annotations
child-project validate /path/to/dataset --ignore-recordings --annotations /path/to/dataset/annotations/*

# validate the metadata and annotations from the 'textgrid' set
child-project validate /path/to/dataset --ignore-recordings --annotations /path/to/dataset/annotations/textgrid/*

Dataset overview

An overview of the contents of a dataset can be obtained with the child-project overview command.

$ child-project overview --help
usage: child-project overview [-h] source

prints an overview of the contents of a given dataset

positional arguments:
  source      source data path

optional arguments:
  -h, --help  show this help message and exit

Example:

$ child-project overview .

recordings:
lena: 288.00 hours, 0/18 files locally available
olympus: 49.57 hours, 0/3 files locally available
usb: 223.42 hours, 0/20 files locally available

annotations:
alice: 560.99 hours, 0/40 files locally available
alice_vtc: 560.99 hours, 0/40 files locally available
eaf/nk: 1.47 hours, 0/88 files locally available
lena: 272.00 hours, 0/17 files locally available
textgrid/mm: 8.75 hours, 0/525 files locally available
vtc: 560.99 hours, 40/40 files locally available

Compute recordings duration

Compute recordings duration and store in into a column named ‘duration’ in the metadata.

$ child-project compute-durations /path/to/dataset --help
usage: child-project compute-durations [-h] [--profile PROFILE] [--force]
                                       source

creates a 'duration' column into metadata/recordings

positional arguments:
  source             source data path

optional arguments:
  -h, --help         show this help message and exit
  --profile PROFILE  which audio profile to use
  --force            overwrite if column exists