Datasets structure

ChildProject assumes your data is structured in a specific way before it is imported. This structure is necessary to check, for instance, that there are no unreferenced files, and no referenced files that are actually missing. The data curator therefore needs to organize their data in a specific way (respecting the dataset tree, with all specified metadata files, and all specified columns within the metadata files) before their data can be imported.

To be imported, datasets must pass the the validation routine (see Data validation). with no error. We also recommend you pay attention to the warnings, and try to sort as many of those out as possible before submission.

Dataset tree

All datasets should have this structure before import (so you need to organize your files into this structure):

project
│
│
└───metadata
│   │   children.csv
│   │   recordings.csv
│   │   annotations.csv
|
└───recordings
│   └───raw
│   │   │   recording1.wav
│
└───annotations
│   └───vtc
│   │   └───raw
│   │   │   │   child1.rttm
│   └───annotator1
│   │   └───raw
│   │   │   │   child1_3600.TextGrid
│
└───extra
    │   notes.txt

The children and recordings notebooks should be CSV dataframes formatted according to the standards detailed right below.

Metadata

Children notebook

The children metadata dataframe needs to be saved at metadata/children.csv. It should be formatted as instructed below; you can add more fields beyond those that are standardized, but make sure to document them.

Children metadata
Name	Description	Required?	Format
experiment	one word to capture the unique ID of the data collection effort; for instance Tsimane_2018, solis-intervention-pre	required
child_id	unique child ID – unique within the experiment (Id could be repeated across experiments to refer to different children)	required
child_dob	child’s date of birth	required	`%Y-%m-%d`
location_id	Unique location ID – only specify here if children never change locations in this culture; otherwise, specify in the recordings metadata	optional
child_sex	f= female, m=male	optional	m, M, f, F
language	language the child is exposed to if child is monolingual; small caps, indicate dialect by name or location if available; eg “france french”; “paris french”	optional
languages	list languages child is exposed to separating them with ; and indicating the percentage if one is available; eg: “french 35%; english 65%”	optional
mat_ed	maternal years of education	optional
fat_ed	paternal years of education	optional
car_ed	years of education of main caregiver (if not mother or father)	optional
monoling	whether the child is monolingual (Y) or not (N)	optional	Y, N
monoling_criterion	how monoling was decided; eg “we asked families which languages they spoke in the home”	optional
normative	whether the child is normative (Y) or not (N)	optional	Y, N
normative_criterion	how normative was decided; eg “unless the caregivers volunteered information whereby the child had a problem, we consider them normative by default”	optional
mother_id	unique ID of the mother	optional
father_id	unique ID of the father	optional
order_of_birth	child order of birth	optional	`(\d+(\.\d+)?)`
n_of_siblings	amount of siblings	optional	`(\d+(\.\d+)?)`
household_size	number of people living in the household (adults+children)	optional	`(\d+(\.\d+)?)`
dob_criterion	determines whether the date of birth is known exactly or extrapolated e.g. from the age. Dates of birth are assumed to be known exactly if this column is NA or unspecified.	optional	extrapolat ed, exact
dob_accuracy	date of birth accuracy	optional	day, week, month, year, other

Recordings notebook

The recordings metadata dataframe needs to be saved at metadata/recordings.csv. It should be formatted as instructed below; you can add more fields beyond those that are standardized, but make sure to document them.

Recordings metadata
Name	Description	Required?	Format
experiment	one word to capture the unique ID of the data collection effort; for instance Tsimane_2018, solis-intervention-pre	required
child_id	unique child ID – unique within the experiment (Id could be repeated across experiments to refer to different children)	required
date_iso	date in which recording was started in ISO (eg 2020-09-17)	required	`%Y-%m-%d`
start_time	local time in which recording was started in format 24-hour (H)H:MM; if minutes are unknown, use 00. Set as ‘NA’ if unknown.	required	`%H:%M`
recording_device_type	lena, usb, olympus, babylogger (lowercase)	required	lena, usb, olympus, babylogger
recording_filename	the path to the file from the root of “recordings”). It MUST be unique (two recordings cannot point towards the same file).	required	True
duration	duration of the audio, in milliseconds	optional	`([0-9]+)`
session_id	identifier of the recording session.	optional
session_offset	offset (in milliseconds) of the recording with respect to other recordings that are part of the same session. Each recording session is identified by their session_id.	optional	`[0-9]+`
recording_device_id	unique ID of the recording device	optional
experimenter	who collected the data (could be anonymized ID)	optional
location_id	unique location ID – can be specified at the level of the child (if children do not change locations)	optional
its_filename	its_filename	optional
upl_filename	upl_filename	optional
trs_filename	trs_filename	optional
lena_id		optional
might_feature_gaps	1 if the audio cannot be guaranteed to be a continuous block with no time jumps, 0 or NA or undefined otherwise.	optional	is_boolean
start_time_accuracy	Accuracy of start_time for this recording. If not specified, assumes minute-accuray.	optional	minute, hour, reliable
noisy_setting	1 if the audio may be noisier than the childs usual day, 0 or undefined otherwise	optional	is_boolean
notes	free-style notes about individual recordings (avoid tabs and newlines)	optional

Splitting the metadata across several files

Sometimes, access to parts of the metadata should be limited to a list of authorized users. This can be achieved by moving confidential information out of the main notebook to a separate CSV file to be only delivered to authorized users. These additional files should be placed according to the table below:

Additional metadata
data	main notebook	location of additional notebooks
children	`metadata/children.csv`	`metadata/children/`
recordings	`metadata/recordings.csv`	`metadata/recordings/`

There can be as many additional notebooks as necessary, and recursion is permitted.

This is also useful if your metadata includes many columns and you’d like to spread it across several dataframes. This can also be used to deliver survey data in a separate file.

Note

In case two or more notebooks contain the same column, the files whose names come first in alphabetical order will prevail while loading the dataset with our package. For instance, if child_dob is specified in both metadata/recordings/0_private.csv and metadata/recordings/1_public.csv, the values in the former file will prevail if it is available. This is useful when anonymized values for a certain parameter still need to be shared, but should be replaced with the true values for those who have access to the full dataset.

Warning

For recursive metadata, two dataframes cannot share the same basename. For instance, if one dataframe is located at metadata/children/dates-of-birth.csv, an error will be thrown if another dataframe exists at `metadata/children/private/dates-of-birth.csv `.

Annotations

Upon importation, annotations are converted to standardized CSV dataframes (using built-in or custom ingestors) and registered into an index. The index of annotations stores the list of each interval that has been annotated for each annotator. This allows a number of functionalities such as the quick computation of the intersection of the portions of audio covered by a given set of annotators.

Annotations format

The package provides functions to convert any annotation into the following CSV format, with one row per segment (e.g. per vocalization event):

Annotations format
Name	Description	Required?	Format
raw_filename	raw annotation path relative, relative to annotations/<set>/raw	required
segment_onset	segment onset timestamp in milliseconds (since the start of the recording)	required	`([0-9]+)`
segment_offset	segment end time in milliseconds (since the start of the recording)	required	`([0-9]+)`
speaker_id	identity of speaker in the annotation	optional
speaker_type	class of speaker (FEM = female adult, MAL = male adult, CHI = key child, OCH = other child)	optional	FEM, MAL, CHI, OCH, NA
ling_type	1 if the vocalization contains at least a vowel (ie canonical or non-canonical), 0 if crying or laughing	optional	1, 0, NA
vcm_type	vocal maturity defined as: C (canonical), N (non- canonical), Y (crying) L (laughing), J (junk)	optional	C, N, Y, L, J, NA
lex_type	W if meaningful, 0 otherwise	optional	W, 0, NA
mwu_type	M if multiword, 1 if single word – only filled if lex_type==W	optional	M, 1, NA
addressee	T if target-child-directed, C if other-child- directed, A if adult-directed, U if uncertain or other. Multiple values should be sorted and separated by commas	optional	T, C, A, U, NA
transcription	orthographic transcription of the speach	optional
phonemes	amount of phonemes	optional	`(\d+(\.\d+)?)`
syllables	amount of syllables	optional	`(\d+(\.\d+)?)`
words	amount of words	optional	`(\d+(\.\d+)?)`
lena_block_type	whether regarded as part as a pause or a conversation by LENA	optional	pause, CM, CIC, CIOCX, CIOCAX, AMF, AICF, AIOCF, AIOCCXF, AMM, AICM, AIOCM, AIOCCXM, XM, XIOCC, XIOCA, XIC, XIOCAC
lena_block_number	number of the LENA pause/conversation the segment belongs to	optional	`(\d+(\.\d+)?)`
lena_conv_status	LENA conversation status	optional	BC, RC, EC
lena_response_count	LENA turn count within block	optional	`(\d+(\.\d+)?)`
lena_conv_floor_type	(FI): Floor Initiation, (FH): Floor Holding	optional	FI, FH
lena_conv_turn_type	LENA turn type	optional	TIFI, TIMI, TIFR, TIMR, TIFE, TIME, NT
utterances_count	utterances count	optional	`(\d+(\.\d+)?)`
utterances_length	utterances length	optional	`([0-9]+)`
non_speech_length	non-speech length	optional	`([0-9]+)`
average_db	average dB level	optional	`(\-?)(\d+(\.\d+)?)`
peak_db	peak dB level	optional	`(\-?)(\d+(\.\d+)?)`
child_cry_vfx_len	childCryVfxLen	optional	`([0-9]+)`
utterances	LENA utterances details (json)	optional
cries	cries (json)	optional
vfxs	Vfx (json)	optional

Custom columns may be used, although they should be documented somewhere in your dataset.

Annotations index

Warning

The index is maintained through the package functions only; it should never be updated by hand.

Annotations are indexed in one unique dataframe located at /metadata/annotations.csv, with the following format :

Annotations metadata
Name	Description	Required?	Format
set	name of the annotation set (e.g. VTC, annotator1, etc.)	required
recording_filename	recording filename as specified in the recordings index	required
time_seek	shift between the timestamps in the raw input annotations and the actual corresponding timestamps in the recordings (in milliseconds)	required	`(\-?)([0-9]+)`
range_onset	covered range onset timestamp in milliseconds (since the start of the recording)	required	`[0-9]+`
range_offset	covered range offset timestamp in milliseconds (since the start of the recording)	required	`[0-9]+`
raw_filename	annotation input filename location, relative to annotations/<set>/raw	required	True
annotation_filename	output formatted annotation location, relative to `annotations/<set>/converted (automatic column, don’t specify)	optional	True
imported_at	importation date (automatic column, don’t specify)	optional	`%Y-%m-%d %H:%M:%S`
package_version	version of the package used when the importation was performed	optional	`[0-9]+\.[0-9]+\.[0-9]+`
error	error message in case the annotation could not be imported	optional

Below is shown an example of an index file (some uninformative columns were hidden for clarity). In this case, one recording has been fully annotated using the Voice Type Classifier (vtc), and partially annotated by two humans (LM and SP). These humans have both annotated the same seven 15 second clips.

set	recording_filename	range_onset	range_offset	raw_filename	format	annotation_filename
vtc	A730/A730_001105.wav	0	42764250	A730/A730_001105.rttm	vtc_rttm	A730/A730_001105_0_42764250.csv
eaf_2021/SP	A730/A730_001105.wav	2910000	2925000	A730_001105.eaf	eaf	A730/A730_001105_2910000_2925000.csv
eaf_2021/SP	A730/A730_001105.wav	4680000	4695000	A730_001105.eaf	eaf	A730/A730_001105_4680000_4695000.csv
eaf_2021/SP	A730/A730_001105.wav	4695000	4710000	A730_001105.eaf	eaf	A730/A730_001105_4695000_4710000.csv
eaf_2021/SP	A730/A730_001105.wav	14055000	14070000	A730_001105.eaf	eaf	A730/A730_001105_14055000_14070000.csv
eaf_2021/SP	A730/A730_001105.wav	15030000	15045000	A730_001105.eaf	eaf	A730/A730_001105_15030000_15045000.csv
eaf_2021/SP	A730/A730_001105.wav	36465000	36480000	A730_001105.eaf	eaf	A730/A730_001105_36465000_36480000.csv
eaf_2021/SP	A730/A730_001105.wav	39450000	39465000	A730_001105.eaf	eaf	A730/A730_001105_39450000_39465000.csv
eaf_2021/LM	A730/A730_001105.wav	2910000	2925000	A730_001105.eaf	eaf	A730/A730_001105_2910000_2925000.csv
eaf_2021/LM	A730/A730_001105.wav	4680000	4695000	A730_001105.eaf	eaf	A730/A730_001105_4680000_4695000.csv
eaf_2021/LM	A730/A730_001105.wav	4695000	4710000	A730_001105.eaf	eaf	A730/A730_001105_4695000_4710000.csv
eaf_2021/LM	A730/A730_001105.wav	14055000	14070000	A730_001105.eaf	eaf	A730/A730_001105_14055000_14070000.csv
eaf_2021/LM	A730/A730_001105.wav	15030000	15045000	A730_001105.eaf	eaf	A730/A730_001105_15030000_15045000.csv
eaf_2021/LM	A730/A730_001105.wav	36465000	36480000	A730_001105.eaf	eaf	A730/A730_001105_36465000_36480000.csv
eaf_2021/LM	A730/A730_001105.wav	39450000	39465000	A730_001105.eaf	eaf	A730/A730_001105_39450000_39465000.csv

Annotation importation input format

The annotations importation script (Bulk importation) and python method (ChildProject.annotations.AnnotationManager.import_annotations()) take a dataframe of the following format as an input:

Input annotations
Name	Description	Required?	Format
set	name of the annotation set (e.g. VTC, annotator1, etc.)	required
recording_filename	recording filename as specified in the recordings index	required
time_seek	shift between the timestamps in the raw input annotations and the actual corresponding timestamps in the recordings (in milliseconds)	required	`(\-?)([0-9]+)`
range_onset	covered range onset timestamp in milliseconds (since the start of the recording)	required	`[0-9]+`
range_offset	covered range offset timestamp in milliseconds (since the start of the recording)	required	`[0-9]+`
raw_filename	annotation input filename location, relative to annotations/<set>/raw	required	True
format	input annotation format	optional	csv, vtc_rttm, vcm_rttm, alice, its, TextGrid, eaf, cha, NA
filter	source file to filter in (for rttm and alice only)	optional

Note

In order to avoid rounding errors, all timestamps are integers, expressed in milliseconds.