Datasets structure

ChildProject assumes your data is structured in a specific way. This structure is necessary to check, for instance, that there are no unreferenced files, and no referenced files that are actually missing. The data curator therefore needs to organize their data in a specific way (respecting the dataset tree, with all specified metadata files, and all specified columns within the metadata files) before their data can be imported.

To be imported, datasets must pass the the validation routine (see Data validation). with no error. We also recommend you pay attention to the warnings, and try to sort as many of those out as possible before submission.

An example of dataset structured according to ChildProject’s format can be found here.

A set of procedures exists for new datasets, handling the creation of the base structure and folders as well as linkage to an online repository. These procedures are used with datalad and can be accessed here. To create a new entire dataset, consider following our guide in our lab handbook.

Dataset tree

All datasets should have this structure before import (so you need to organize your files into this structure):

project
│
│
└───metadata
│   │   children.csv
│   │   recordings.csv
│   │   annotations.csv
|
└───recordings
│   └───raw
│   │   │   recording1.wav
│
└───annotations
│   └───vtc
│   │   │   metannots.yml
│   │   └───raw
│   │   │   │   child1.rttm
│   └───annotator1
│   │   │   metannots.yml
│   │   └───raw
│   │   │   │   child1_3600.TextGrid
│
└───docs (*)
│   │   children.csv
│   │   recordings.csv
└───extra
    │   notes.txt

The children and recordings notebooks should be CSV dataframes formatted according to the standards detailed right below.

(*) The docs folder is optional.

Metadata

Children notebook

The children metadata dataframe needs to be saved at metadata/children.csv. It should be formatted as instructed below; you can add more fields beyond those that are standardized, but make sure to document them.

Children metadata
Name	Description	Required?	Format
experiment	one word to capture the unique ID of the data collection effort; for instance Tsimane_2018, solis-intervention-pre	required
child_id	unique child ID – unique within the experiment (Id could be repeated across experiments to refer to different children)	required
child_dob	child’s date of birth	required	`%Y-%m-%d`
location_id	Unique location ID – only specify here if children never change locations in this culture; otherwise, specify in the recordings metadata	optional
child_sex	f= female, m=male	optional	m, M, f, F, NA
language	main language the child is exposed to; small caps; eg “french”; “english”	optional
languages	list languages child is exposed to separating them with ; and indicating the percentage if one is available; eg: “french 35%; english 65%”	optional
mat_ed	maternal years of education	optional
fat_ed	paternal years of education	optional
car_ed	years of education of main caregiver (if not mother or father)	optional
monoling	whether the child is monolingual (Y) or not (N)	optional	Y, N, NA
monoling_criterion	how monoling was decided; eg “we asked families which languages they spoke in the home”	optional
normative	whether the child is normative (Y) or not (N)	optional	Y, N, NA
normative_criterion	how normative was decided; eg “unless the caregivers volunteered information whereby the child had a problem, we consider them normative by default”	optional
mother_id	unique ID of the mother	optional
father_id	unique ID of the father	optional
order_of_birth	child order of birth	optional	`(\d+(\.\d+)?)`
n_of_siblings	amount of siblings	optional	`(\d+(\.\d+)?)`
household_size	number of people living in the household (adults+children)	optional	`(\d+(\.\d+)?)`
dob_criterion	determines whether the date of birth is known exactly or extrapolated e.g. from the age. Dates of birth are assumed to be known exactly if this column is NA or unspecified.	optional	extrapolat ed, exact, reported, innacurate
dob_accuracy	date of birth accuracy	optional	day, week, month, year, other, inn acurate, NA
discard	set to 1 if item should be discarded in analyses	optional	0, 1

Recordings notebook

The recordings metadata dataframe needs to be saved at metadata/recordings.csv. It should be formatted as instructed below; you can add more fields beyond those that are standardized, but make sure to document them.

Recordings metadata
Name	Description	Required?	Format
experiment	one word to capture the unique ID of the data collection effort; for instance Tsimane_2018, solis-intervention-pre	required
child_id	unique child ID – unique within the experiment (Id could be repeated across experiments to refer to different children)	required
date_iso	date in which recording was started in ISO (eg 2020-09-17)	required	`%Y-%m-%d`
start_time	local time in which recording was started in format 24-hour (H)H:MM:SS or (H)H:MM; if minutes or seconds are unknown, use 00. ‘NA’ if unknown, this will raise a Warning when validating as some analysis that rely on times will not consider this recordings.	required	`%H:%M / %H:%M:%S`
recording_device_type	lena, usb, olympus, babylogger (lowercase), izyrec	required	lena, usb, olympus, b abylogger, izyrec, unknown
recording_filename	the path to the file from the root of “recordings”). It MUST be unique (two recordings cannot point towards the same file).	required	True
duration	duration of the audio, in milliseconds	optional	`([0-9]+)`
session_id	identifier of the recording session.	optional
session_offset	offset (in milliseconds) of the recording with respect to other recordings that are part of the same session. Each recording session is identified by their session_id.	optional	`[0-9]+`
recording_device_id	unique ID of the recording device	optional
experimenter	who collected the data (could be anonymized ID)	optional
location_id	unique location ID – can be specified at the level of the child (if children do not change locations)	optional
its_filename	its_filename	optional
upl_filename	upl_filename	optional
trs_filename	trs_filename	optional
lena_id		optional
lena_recording_num	value of the corresponding <Recording> num’s attribute, for LENA recordings that have been split into contiguous parts	optional
might_feature_gaps	1 if the audio cannot be guaranteed to be a continuous block with no time jumps, 0 or NA or undefined otherwise.	optional	is_boolean
start_time_accuracy	Accuracy of start_time for this recording. If not specified, assumes second-accuray.	optional	second, minute, hour, reliable, NA
noisy_setting	1 if the audio may be noisier than the childs usual day, 0 or undefined otherwise	optional	is_boolean
notes	free-style notes about individual recordings (avoid tabs and newlines)	optional
discard	set to 1 if item should be discarded in analyses	optional	0, 1

Splitting the metadata across several files

Sometimes, access to parts of the metadata should be limited to a list of authorized users. This can be achieved by moving confidential information out of the main notebook to a separate CSV file to be only delivered to authorized users. These additional files should be placed according to the table below:

Additional metadata
data	main notebook	location of additional notebooks
children	`metadata/children.csv`	`metadata/children/`
recordings	`metadata/recordings.csv`	`metadata/recordings/`

There can be as many additional notebooks as necessary, and recursion is permitted.

This is also useful if your metadata includes many columns and you’d like to spread it across several dataframes. This can also be used to deliver survey data in a separate file.

Note

In case two or more notebooks contain the same column, the files whose names come first in alphabetical order will prevail while loading the dataset with our package. For instance, if child_dob is specified in both metadata/recordings/0_private.csv and metadata/recordings/1_public.csv, the values in the former file will prevail if it is available. This is useful when anonymized values for a certain parameter still need to be shared, but should be replaced with the true values for those who have access to the full dataset.

Warning

For recursive metadata, two dataframes cannot share the same basename. For instance, if one dataframe is located at metadata/children/dates-of-birth.csv , an error will be thrown if another dataframe exists at metadata/children/private/dates-of-birth.csv .

Annotations

Upon importation, annotations are converted to standardized CSV dataframes (using built-in or custom ingestors) and registered into an index. The index of annotations stores the list of each interval that has been annotated for each annotator. This allows a number of functionalities such as the quick computation of the intersection of the portions of audio covered by a given set of annotators.

Annotation sets metadata

Additionally, information about a set of annotations (an ensemble grouping annotations from the same source) must be stored inside the set folder in a file named metannots.yml. This is a yaml formatted text file with a combination of key => values fields defined in it. This file is not mandatory to have but it is veery strongly encouraged to create it whenever adding a set of annotations.

Here is an example of the content of metannots.yml for the annotations from an automated tool (VTC):

segmentation: 'vtc'
segmentation_type: 'permissive'
method: 'automated'
annotation_algorithm_name: 'VTC'
annotation_algorithm_publication: 'Lavechin, M., Bousbib, R., Bredin, H., Dupoux, E., & Cristia, A. (2020). An open-source voice type classifier for child-centered daylong recordings. Interspeech. Online open access: https://www.isca-archive.org/interspeech_2020/lavechin20_interspeech.pdf'
annotation_algorithm_version: '1'
annotation_algorithm_repo: 'https://github.com/MarvinLvn/voice-type-classifier/tree/e443d8cfc40f7076eea903958d9344d4aa427cc2'
date_annotation: '2024-04-07'
has_speaker_type: 'Y'

And another example for a human annotated set of annotations:

segmentation: 'textgrid2'
segmentation_type: 'permissive'
method: 'manual'
sampling_method: 'high-volubility'
sampling_target: 'fem'
sampling_count: 17
sampling_unit_duration: 50000
recording_selection: 'all recordings'
participant_selection: '1 to 2 yo'
annotator_name: 'Ivan Cliao'
annotator_experience: 5
date_annotation: '2019-07-16'
has_speaker_type: 'Y'
has_transcription: 'Y'
has_vcm_type: 'Y'
has_addressee: 'N'

All the supported fields are listed below with their description. Custom fields may be used but won’t be checked.

Metadata fields supported for annotation sets
Name	Description	Required?	Format
segmentation	source of the segmentation. repeat the set name if uses its own, name(s) (comma separated) of other set(s) if using other set(s) segmentation(s)	optional
segmentation_type	permissivity of the segmentation. permissive if allows for annotation segments overlapping each other, restrictive if only one speaker allowed at a time	optional	permissive , restrict ive
method	Method used for the annotations, automated, human or a mix of both	optional	automated, manual, mixed, der ivation, citizen- scientists
sampling_method	Method used for sampling annotated parts (none is all recording)	optional	none, manual, periodic, random, high-volub ility, high- energy
sampling_target	targeted speaker type in the sampling	optional	chi, fem, mal, och
sampling_count	total count of sampled segments for this set. Other metrics like amount per child or recording can be derived from this number and the annotations dataframe.	optional
sampling_unit_duration	Target duration of each sampled segment in milliseconds. this does not mean that all segments are exactly this long	optional
recording_selection	How were the recording used for sampling selected, or excluded. be exhaustive.	optional
participant_selection	How were the participants used for sampling selected, or excluded. be exhaustive.	optional
annotator_name	unique name for human annotators	optional
annotator_experience	Estimation of annotator’s experience from 1 to 5. 1 being ‘new to annotation’ and 5 ‘Expert’.	optional	1, 2, 3, 4, 5
annotation_algorithm_name	name of the algorithm	optional	VTC, ALICE, VCM, ITS
annotation_algorithm_publication	scientific publication citation for the algorithm used	optional
annotation_algorithm_version	¨version of the algorithm	optional
annotation_algorithm_repo	link to repository where the algorithm is stored. Ideally along with a commit hash for more reproducibility.	optional
date_annotation	date when the annotation was produced, best practice is to give the day the annotation was finished. This is meant to be a broad time label and does not need to be very precise	optional	`%Y-%m-%d`
has_speaker_type	Does the set contain the type of speakers. Yes(Y) / No(N or empty)	optional	Y, N,
has_transcription	Does the set contain transcriptions. Yes(Y) / No(N or empty)	optional	Y, N,
has_interactions	Does the set contain information about interactions between speakers. Yes(Y) / No(N or empty)	optional	Y, N,
has_acoustics	Does the set contain information about acoustic features of speakers. Yes(Y) / No(N or empty)	optional	Y, N,
has_addressee	Does the set contain the information of who the vocalization is addressed to. Yes(Y) / No(N or empty)	optional	Y, N,
has_vcm_type	Does the set contain information about vocal maturity of vocalizations . Yes(Y) / No(N or empty)	optional	Y, N,
has_words	Does the set contain information about number of words contained . Yes(Y) / No(N or empty)	optional	Y, N,
notes	Various notes about the set of annotations	optional

Annotations format

The package provides functions to convert any annotation into the following CSV format, with one row per segment (e.g. per vocalization event):

Annotations format
Name	Description	Required?	Format
raw_filename	raw annotation path relative, relative to annotations/<set>/raw	required
segment_onset	segment onset timestamp in milliseconds (since the start of the recording)	required	`([0-9]+)`
segment_offset	segment end time in milliseconds (since the start of the recording)	required	`([0-9]+)`
speaker_id	identity of speaker in the annotation	optional
speaker_type	class of speaker (FEM = female adult, MAL = male adult, CHI = key child, OCH = other child)	optional	FEM, MAL, CHI, OCH, NA
ling_type	1 if the vocalization contains at least a vowel (ie canonical or non-canonical), 0 if crying or laughing	optional	1, 0, NA
vcm_type	vocal maturity defined as: C (canonical), N (non- canonical), Y (crying) L (laughing), J (junk), U (uncertain)	optional	C, N, Y, L, J, U, NA
lex_type	W if meaningful, 0 otherwise	optional	W, 0, NA
mwu_type	M if multiword, 1 if single word – only filled if lex_type==W	optional	M, 1, NA
msc_type	morphosyntactical complexity of the utterances defined as: 0 (0 meaningful word), 1 (1 meaningful word), 2 (2 meaningful words), S (simple utterance), C (complex utterance), U (uncertain)	optional	0, 1, 2, S, C, U
gra_type	grammaticality of the utterances defined as: G (grammatical), J (ungrammatical), U (uncertain)	optional	G, J, U
addressee	T if target-child-directed, C if other-child- directed, A if adult-directed, O if addressed to other, P if addressed to a pet, U if uncertain or other. Multiple values should be sorted and separated by commas	optional	T, C, A, O, P, U, NA
transcription	orthographic transcription of the speech	optional
phonemes	amount of phonemes	optional	`(\d+(\.\d+)?)`
syllables	amount of syllables	optional	`(\d+(\.\d+)?)`
words	amount of words	optional	`(\d+(\.\d+)?)`
lena_block_type	whether regarded as part as a pause or a conversation by LENA	optional	pause, CM, CIC, CIOCX, CIOCAX, AMF, AICF, AIOCF, AIOCCXF, AMM, AICM, AIOCM, AIOCCXM, XM, XIOCC, XIOCA, XIC, XIOCAC
lena_block_number	number of the LENA pause/conversation the segment belongs to	optional	`(\d+(\.\d+)?)`
lena_conv_status	LENA conversation status	optional	BC, RC, EC
lena_response_count	LENA turn count within block	optional	`(\d+(\.\d+)?)`
lena_conv_floor_type	(FI): Floor Initiation, (FH): Floor Holding	optional	FI, FH
lena_conv_turn_type	LENA turn type	optional	TIFI, TIMI, TIFR, TIMR, TIFE, TIME, NT
lena_speaker	LENA speaker type	optional	TVF, FAN, OLN, SIL, NOF, CXF, OLF, CHF, MAF, TVN, NON, CXN, CHN, MAN, FAF
utterances_count	utterances count	optional	`(\d+(\.\d+)?)`
utterances_length	utterances length	optional	`([0-9]+)`
non_speech_length	non-speech length	optional	`([0-9]+)`
average_db	average dB level	optional	`(\-?)(\d+(\.\d+)?)`
peak_db	peak dB level	optional	`(\-?)(\d+(\.\d+)?)`
child_cry_vfx_len	childCryVfxLen	optional	`([0-9]+)`
utterances	LENA utterances details (json)	optional
cries	cries (json)	optional
vfxs	Vfx (json)	optional

Custom columns may be used, although they should be documented somewhere in your dataset.

Annotations index

Warning

The index is maintained through the package functions only; it should never be updated by hand.

Annotations are indexed in one unique dataframe located at /metadata/annotations.csv , with the following format :

Annotations metadata
Name	Description	Required?	Format
set	name of the annotation set (e.g. VTC, annotator1, etc.)	required
recording_filename	recording filename as specified in the recordings index	required
time_seek	shift between the timestamps in the raw input annotations and the actual corresponding timestamps in the recordings (in milliseconds)	required	`(\-?)([0-9]+)`
range_onset	covered range onset timestamp in milliseconds (since the start of the recording)	required	`[0-9]+`
range_offset	covered range offset timestamp in milliseconds (since the start of the recording)	required	`[0-9]+`
raw_filename	annotation input filename location, relative to annotations/<set>/raw	required	True
annotation_filename	output formatted annotation location, relative to annotations/<set>/converted (automatic column, don’t specify)	optional	True
imported_at	importation date (automatic column, don’t specify)	optional	`%Y-%m-%d %H:%M:%S`
package_version	version of the package used when the importation was performed	optional	`[0-9]+\.[0-9]+\.[0-9]+`
error	error message in case the annotation could not be imported	optional
merged_from	sets used to generate this annotation by merging (comma separated)	optional

Below is shown an example of an index file (some uninformative columns were hidden for clarity). In this case, one recording has been fully annotated using the Voice Type Classifier (vtc), and partially annotated by two humans (LM and SP). These humans have both annotated the same seven 15 second clips.

set	recording_filename	range_onset	range_offset	raw_filename	format	annotation_filename
vtc	A730/A730_001105.wav	0	42764250	A730/A730_001105.rttm	vtc_rttm	A730/A730_001105_0_42764250.csv
eaf_2021/SP	A730/A730_001105.wav	2910000	2925000	A730_001105.eaf	eaf	A730/A730_001105_2910000_2925000.csv
eaf_2021/SP	A730/A730_001105.wav	4680000	4695000	A730_001105.eaf	eaf	A730/A730_001105_4680000_4695000.csv
eaf_2021/SP	A730/A730_001105.wav	4695000	4710000	A730_001105.eaf	eaf	A730/A730_001105_4695000_4710000.csv
eaf_2021/SP	A730/A730_001105.wav	14055000	14070000	A730_001105.eaf	eaf	A730/A730_001105_14055000_14070000.csv
eaf_2021/SP	A730/A730_001105.wav	15030000	15045000	A730_001105.eaf	eaf	A730/A730_001105_15030000_15045000.csv
eaf_2021/SP	A730/A730_001105.wav	36465000	36480000	A730_001105.eaf	eaf	A730/A730_001105_36465000_36480000.csv
eaf_2021/SP	A730/A730_001105.wav	39450000	39465000	A730_001105.eaf	eaf	A730/A730_001105_39450000_39465000.csv
eaf_2021/LM	A730/A730_001105.wav	2910000	2925000	A730_001105.eaf	eaf	A730/A730_001105_2910000_2925000.csv
eaf_2021/LM	A730/A730_001105.wav	4680000	4695000	A730_001105.eaf	eaf	A730/A730_001105_4680000_4695000.csv
eaf_2021/LM	A730/A730_001105.wav	4695000	4710000	A730_001105.eaf	eaf	A730/A730_001105_4695000_4710000.csv
eaf_2021/LM	A730/A730_001105.wav	14055000	14070000	A730_001105.eaf	eaf	A730/A730_001105_14055000_14070000.csv
eaf_2021/LM	A730/A730_001105.wav	15030000	15045000	A730_001105.eaf	eaf	A730/A730_001105_15030000_15045000.csv
eaf_2021/LM	A730/A730_001105.wav	36465000	36480000	A730_001105.eaf	eaf	A730/A730_001105_36465000_36480000.csv
eaf_2021/LM	A730/A730_001105.wav	39450000	39465000	A730_001105.eaf	eaf	A730/A730_001105_39450000_39465000.csv

Annotation importation input format

The annotations importation script (Bulk importation) and python method (ChildProject.annotations.AnnotationManager.import_annotations()) take a dataframe of the following format as an input:

Input annotations
Name	Description	Required?	Format
set	name of the annotation set (e.g. VTC, annotator1, etc.)	required
recording_filename	recording filename as specified in the recordings index	required
time_seek	shift between the timestamps in the raw input annotations and the actual corresponding timestamps in the recordings (in milliseconds)	required	`(\-?)([0-9]+)`
range_onset	covered range onset timestamp in milliseconds (since the start of the recording)	required	`[0-9]+`
range_offset	covered range offset timestamp in milliseconds (since the start of the recording)	required	`[0-9]+`
raw_filename	annotation input filename location, relative to annotations/<set>/raw	required	True
format	input annotation format	optional	csv, vtc_rttm, vcm_rttm, alice, its, TextGrid, eaf, cha, w2v2-sm, NA, custom
filter	source file to target. this field is dedicated to rttm and ALICE annotations that may combine annotations from several recordings into one same text file.	optional

Note

In order to avoid rounding errors, all timestamps are integers, expressed in milliseconds.

Documentation

An important aspect of a dataset is its documentation. Documentation includes:

authorship, references, contact information

a description of the corpus (population, collection process, etc.)

instructions to re-use the data

description of the data itself (e.g. a definition of each metadata field)

We currently do not provide a format for all these annotations. It is up to you to decide how to provide users with each of these information.

However, we suggest several options below.

Metadata and annotations

The ChildProject package supports a machine-readable format to describe the contents of the metadata and the annotations.

This format consists in CSV dataframe structured according to the following table:

Machine-readable documentation
Name	Description	Required?
variable	name of the variable	required
description	a definition of this field	required
values	a summary of authorized values	optional
scope	which group of users has access to it	optional
annotation_set	for annotations: which set(s) contain this variable	optional

Documentation for the children metadata should be stored in docs/children.csv
Documentation for the recordings metadata should be stored in docs/recordings.csv
Documentation for annotations should be stored in docs/annotations.csv

Authorship

We recommend DataCite’s .yaml format (see here)