This is typically done (repeatedly!) in the process of importing your data into our format for the first time, but you should also do this whenever you make a change to the dataset.
Looks for errors and inconsistency in the metadata, or for missing audios. The validation will pass if formatting instructions are met (see Datasets structure).
$ child-project validate /path/to/dataset --help usage: child-project validate [-h] [--ignore-recordings] [--profile PROFILE] [--annotations ANNOTATIONS [ANNOTATIONS ...]] [--threads THREADS] source validate the consistency of the dataset returning detailed errors and warnings positional arguments: source project path optional arguments: -h, --help show this help message and exit --ignore-recordings ignore missing audio files --profile PROFILE which recording profile to validate --annotations ANNOTATIONS [ANNOTATIONS ...] path to or name of each annotation set(s) to check (e.g. 'vtc' or '/path/to/dataset/annotations/vtc') --threads THREADS amount of threads to run on (only applies to --annotations)
# validate the metadata and raw recordings child-project validate /path/to/dataset # validate the metadata only child-project validate /path/to/dataset --ignore-recordings # validate the metadata and the recordings of the 'standard' profile # (in recordings/converted/standard) child-project validate /path/to/dataset --profile standard # validate the metadata and all annotations within /path/to/dataset/annotations child-project validate /path/to/dataset --ignore-recordings --annotations /path/to/dataset/annotations/* # validate the metadata and annotations from the 'textgrid' set child-project validate /path/to/dataset --ignore-recordings --annotations /path/to/dataset/annotations/textgrid/*
An overview of the contents of a dataset can be obtained with the
child-project overview command.
$ child-project overview --help usage: child-project overview [-h] source prints an overview of the contents of a given dataset positional arguments: source source data path optional arguments: -h, --help show this help message and exit
$ child-project overview . recordings: lena: 288.00 hours, 0/18 files locally available olympus: 49.57 hours, 0/3 files locally available usb: 223.42 hours, 0/20 files locally available annotations: alice: 560.99 hours, 0/40 files locally available alice_vtc: 560.99 hours, 0/40 files locally available eaf/nk: 1.47 hours, 0/88 files locally available lena: 272.00 hours, 0/17 files locally available textgrid/mm: 8.75 hours, 0/525 files locally available vtc: 560.99 hours, 40/40 files locally available
Compute recordings duration
Compute recordings duration in ms and store in into a column named ‘duration’ in the metadata.
$ child-project compute-durations /path/to/dataset --help usage: child-project compute-durations [-h] [--profile PROFILE] [--force] source creates a 'duration' column into metadata/recordings. duration is in ms positional arguments: source source data path optional arguments: -h, --help show this help message and exit --profile PROFILE which audio profile to use --force overwrite if column exists
Compute the correlation between audio files
Compute the correlation between two audio files and prints a divergence score. The divergence is computed over a given duration (default 5min) that can be changed with the –interval option. One segment of that duration is taken randomly, the difference in audio signal is calculated and averaged over the total duration. The result is printed as a divergence score. The closer the score is to 0, the more likely it is the 2 files are identical. We can consider that scores below 0.1 reflect a very high probability that the files are the same. At the other end of the spectrum, values higher than 1 almost certainly means they are different recordings. So a window exists in which we can’t be sure and would need additional correlation computations or manual checks. Running the correlation multiple time is useful because files that are different have a high variability in score whereas similar files will have a much more consistent output.
Giving a higher –interval value may take more time to compute.
$ child-project compare-recordings /path/to/dataset --help usage: child-project compare-recordings [-h] [--profile PROFILE] [--interval INTERVAL] source audio1 audio2 computes the difference between 2 given audio files of the dataset. A divergence score is outputted, it is the average difference of audio signal over the considered sample (random point in the audio, fixed duration). Divergence scores lower than 0.1 indicate a strong proximity positional arguments: source project path audio1 name of the first audio file as it is indexed in recordings.csv in column <recording_filename> audio2 name of the second audio file as it is indexed in recordings.csv in column <recording_filename> optional arguments: -h, --help show this help message and exit --profile PROFILE which audio profile to use --interval INTERVAL duration in minutes of the window used to build the correlation score