Datasets structure
ChildProject assumes your data is structured in a specific way. This structure is necessary to check, for instance, that there are no unreferenced files, and no referenced files that are actually missing. The data curator therefore needs to organize their data in a specific way (respecting the dataset tree, with all specified metadata files, and all specified columns within the metadata files) before their data can be imported.
To be imported, datasets must pass the the validation routine (see Data validation). with no error. We also recommend you pay attention to the warnings, and try to sort as many of those out as possible before submission.
An example of dataset structured according to ChildProject’s format can be found here.
A set of procedures exists for new datasets, handling the creation of the base structure and folders as well as linkage to an online repository. These procedures are used with datalad and can be accessed here. To create a new entire dataset, consider following our guide in our lab handbook.
Dataset tree
All datasets should have this structure before import (so you need to organize your files into this structure):
project
│
│
└───metadata
│ │ children.csv
│ │ recordings.csv
│ │ annotations.csv
|
└───recordings
│ └───raw
│ │ │ recording1.wav
│
└───annotations
│ └───vtc
│ │ └───raw
│ │ │ │ child1.rttm
│ └───annotator1
│ │ └───raw
│ │ │ │ child1_3600.TextGrid
│
└───docs (*)
│ │ children.csv
│ │ recordings.csv
└───extra
│ notes.txt
The children and recordings notebooks should be CSV dataframes formatted according to the standards detailed right below.
(*) The
docs
folder is optional.
Metadata
Children notebook
The children metadata dataframe needs to be saved at metadata/children.csv
.
It should be formatted as instructed below; you can add more fields beyond those that are
standardized, but make sure to document them.
Name |
Description |
Required? |
Format |
---|---|---|---|
experiment |
one word to capture the unique ID of the data
collection effort; for instance Tsimane_2018,
solis-intervention-pre
|
required |
|
child_id |
unique child ID – unique within the experiment
(Id could be repeated across experiments to refer
to different children)
|
required |
|
child_dob |
child’s date of birth
|
required |
|
location_id |
Unique location ID – only specify here if
children never change locations in this culture;
otherwise, specify in the recordings metadata
|
optional |
|
child_sex |
f= female, m=male
|
optional |
m, M, f, F
|
language |
language the child is exposed to if child is
monolingual; small caps, indicate dialect by name
or location if available; eg “france french”;
“paris french”
|
optional |
|
languages |
list languages child is exposed to separating them
with ; and indicating the percentage if one is
available; eg: “french 35%; english 65%”
|
optional |
|
mat_ed |
maternal years of education
|
optional |
|
fat_ed |
paternal years of education
|
optional |
|
car_ed |
years of education of main caregiver (if not
mother or father)
|
optional |
|
monoling |
whether the child is monolingual (Y) or not (N)
|
optional |
Y, N
|
monoling_criterion |
how monoling was decided; eg “we asked families
which languages they spoke in the home”
|
optional |
|
normative |
whether the child is normative (Y) or not (N)
|
optional |
Y, N
|
normative_criterion |
how normative was decided; eg “unless the
caregivers volunteered information whereby the
child had a problem, we consider them normative by
default”
|
optional |
|
mother_id |
unique ID of the mother
|
optional |
|
father_id |
unique ID of the father
|
optional |
|
order_of_birth |
child order of birth
|
optional |
|
n_of_siblings |
amount of siblings
|
optional |
|
household_size |
number of people living in the household
(adults+children)
|
optional |
|
dob_criterion |
determines whether the date of birth is known
exactly or extrapolated e.g. from the age. Dates
of birth are assumed to be known exactly if this
column is NA or unspecified.
|
optional |
extrapolat
ed, exact
|
dob_accuracy |
date of birth accuracy
|
optional |
day, week,
month,
year,
other
|
discard |
set to 1 if item should be discarded in analyses
|
optional |
0, 1
|
Recordings notebook
The recordings metadata dataframe needs to be saved at
metadata/recordings.csv
.
It should be formatted as instructed below; you can add more fields beyond those that are
standardized, but make sure to document them.
Name |
Description |
Required? |
Format |
---|---|---|---|
experiment |
one word to capture the unique ID of the data
collection effort; for instance Tsimane_2018,
solis-intervention-pre
|
required |
|
child_id |
unique child ID – unique within the experiment
(Id could be repeated across experiments to refer
to different children)
|
required |
|
date_iso |
date in which recording was started in ISO (eg
2020-09-17)
|
required |
|
start_time |
local time in which recording was started in
format 24-hour (H)H:MM:SS or (H)H:MM; if minutes
or seconds are unknown, use 00. ‘NA’ if unknown,
this will raise a Warning when validating as some
analysis that rely on times will not consider this
recordings.
|
required |
|
recording_device_type |
lena, usb, olympus, babylogger (lowercase), izyrec
|
required |
lena, usb,
olympus, b
abylogger,
izyrec,
unknown
|
recording_filename |
the path to the file from the root of
“recordings”). It MUST be unique (two recordings
cannot point towards the same file).
|
required |
True |
duration |
duration of the audio, in milliseconds
|
optional |
|
session_id |
identifier of the recording session.
|
optional |
|
session_offset |
offset (in milliseconds) of the recording with
respect to other recordings that are part of the
same session. Each recording session is identified
by their session_id.
|
optional |
|
recording_device_id |
unique ID of the recording device
|
optional |
|
experimenter |
who collected the data (could be anonymized ID)
|
optional |
|
location_id |
unique location ID – can be specified at the
level of the child (if children do not change
locations)
|
optional |
|
its_filename |
its_filename
|
optional |
|
upl_filename |
upl_filename
|
optional |
|
trs_filename |
trs_filename
|
optional |
|
lena_id |
optional |
||
lena_recording_num |
value of the corresponding <Recording> num’s
attribute, for LENA recordings that have been
split into contiguous parts
|
optional |
|
might_feature_gaps |
1 if the audio cannot be guaranteed to be a
continuous block with no time jumps, 0 or NA or
undefined otherwise.
|
optional |
is_boolean |
start_time_accuracy |
Accuracy of start_time for this recording. If not
specified, assumes second-accuray.
|
optional |
second,
minute,
hour,
reliable
|
noisy_setting |
1 if the audio may be noisier than the childs
usual day, 0 or undefined otherwise
|
optional |
is_boolean |
notes |
free-style notes about individual recordings
(avoid tabs and newlines)
|
optional |
|
discard |
set to 1 if item should be discarded in analyses
|
optional |
0, 1
|
Splitting the metadata across several files
Sometimes, access to parts of the metadata should be limited to a list of authorized users. This can be achieved by moving confidential information out of the main notebook to a separate CSV file to be only delivered to authorized users. These additional files should be placed according to the table below:
data |
main notebook |
location of additional notebooks |
---|---|---|
children |
|
|
recordings |
|
|
There can be as many additional notebooks as necessary, and recursion is permitted.
This is also useful if your metadata includes many columns and you’d like to spread it across several dataframes. This can also be used to deliver survey data in a separate file.
Note
In case two or more notebooks contain the same column, the files
whose names come first in alphabetical order will prevail while
loading the dataset with our package. For instance, if
child_dob
is specified in both metadata/recordings/0_private.csv
and metadata/recordings/1_public.csv
, the values in the former file will prevail if it is available.
This is useful when anonymized values for a certain parameter still need to be shared,
but should be replaced with the true values for those who have access to the full dataset.
Warning
For recursive metadata, two dataframes cannot share the same basename. For instance, if one dataframe is located at metadata/children/dates-of-birth.csv , an error will be thrown if another dataframe exists at metadata/children/private/dates-of-birth.csv .
Annotations
Upon importation, annotations are converted to standardized CSV dataframes (using built-in or custom ingestors) and registered into an index. The index of annotations stores the list of each interval that has been annotated for each annotator. This allows a number of functionalities such as the quick computation of the intersection of the portions of audio covered by a given set of annotators.
Annotations format
The package provides functions to convert any annotation into the following CSV format, with one row per segment (e.g. per vocalization event):
Name |
Description |
Required? |
Format |
---|---|---|---|
raw_filename |
raw annotation path relative, relative to
annotations/<set>/raw
|
required |
|
segment_onset |
segment onset timestamp in milliseconds (since the
start of the recording)
|
required |
|
segment_offset |
segment end time in milliseconds (since the start
of the recording)
|
required |
|
speaker_id |
identity of speaker in the annotation
|
optional |
|
speaker_type |
class of speaker (FEM = female adult, MAL = male
adult, CHI = key child, OCH = other child)
|
optional |
FEM, MAL,
CHI, OCH,
NA
|
ling_type |
1 if the vocalization contains at least a vowel
(ie canonical or non-canonical), 0 if crying or
laughing
|
optional |
1, 0, NA
|
vcm_type |
vocal maturity defined as: C (canonical), N (non-
canonical), Y (crying) L (laughing), J (junk), U
(uncertain)
|
optional |
C, N, Y,
L, J, U,
NA
|
lex_type |
W if meaningful, 0 otherwise
|
optional |
W, 0, NA
|
mwu_type |
M if multiword, 1 if single word – only filled if
lex_type==W
|
optional |
M, 1, NA
|
msc_type |
morphosyntactical complexity of the utterances
defined as: 0 (0 meaningful word), 1 (1 meaningful
word), 2 (2 meaningful words), S (simple
utterance), C (complex utterance), U (uncertain)
|
optional |
0, 1, 2,
S, C, U
|
gra_type |
grammaticality of the utterances defined as: G
(grammatical), J (ungrammatical), U (uncertain)
|
optional |
G, J, U
|
addressee |
T if target-child-directed, C if other-child-
directed, A if adult-directed, O if addressed to
other, P if addressed to a pet, U if uncertain or
other. Multiple values should be sorted and
separated by commas
|
optional |
T, C, A,
O, P, U,
NA
|
transcription |
orthographic transcription of the speech
|
optional |
|
phonemes |
amount of phonemes
|
optional |
|
syllables |
amount of syllables
|
optional |
|
words |
amount of words
|
optional |
|
lena_block_type |
whether regarded as part as a pause or a
conversation by LENA
|
optional |
pause, CM,
CIC,
CIOCX,
CIOCAX,
AMF, AICF,
AIOCF,
AIOCCXF,
AMM, AICM,
AIOCM,
AIOCCXM,
XM, XIOCC,
XIOCA,
XIC,
XIOCAC
|
lena_block_number |
number of the LENA pause/conversation the segment
belongs to
|
optional |
|
lena_conv_status |
LENA conversation status
|
optional |
BC, RC, EC
|
lena_response_count |
LENA turn count within block
|
optional |
|
lena_conv_floor_type |
(FI): Floor Initiation, (FH): Floor Holding
|
optional |
FI, FH
|
lena_conv_turn_type |
LENA turn type
|
optional |
TIFI,
TIMI,
TIFR,
TIMR,
TIFE,
TIME, NT
|
lena_speaker |
LENA speaker type
|
optional |
TVF, FAN,
OLN, SIL,
NOF, CXF,
OLF, CHF,
MAF, TVN,
NON, CXN,
CHN, MAN,
FAF
|
utterances_count |
utterances count
|
optional |
|
utterances_length |
utterances length
|
optional |
|
non_speech_length |
non-speech length
|
optional |
|
average_db |
average dB level
|
optional |
|
peak_db |
peak dB level
|
optional |
|
child_cry_vfx_len |
childCryVfxLen
|
optional |
|
utterances |
LENA utterances details (json)
|
optional |
|
cries |
cries (json)
|
optional |
|
vfxs |
Vfx (json)
|
optional |
Custom columns may be used, although they should be documented somewhere in your dataset.
Annotations index
Warning
The index is maintained through the package functions only; it should never be updated by hand.
Annotations are indexed in one unique dataframe located at
/metadata/annotations.csv
, with the following format :
Name |
Description |
Required? |
Format |
---|---|---|---|
set |
name of the annotation set (e.g. VTC, annotator1,
etc.)
|
required |
|
recording_filename |
recording filename as specified in the recordings
index
|
required |
|
time_seek |
shift between the timestamps in the raw input
annotations and the actual corresponding
timestamps in the recordings (in milliseconds)
|
required |
|
range_onset |
covered range onset timestamp in milliseconds
(since the start of the recording)
|
required |
|
range_offset |
covered range offset timestamp in milliseconds
(since the start of the recording)
|
required |
|
raw_filename |
annotation input filename location, relative to
annotations/<set>/raw
|
required |
True |
annotation_filename |
output formatted annotation location, relative to
annotations/<set>/converted (automatic column,
don’t specify)
|
optional |
True |
imported_at |
importation date (automatic column, don’t specify)
|
optional |
|
package_version |
version of the package used when the importation
was performed
|
optional |
|
error |
error message in case the annotation could not be
imported
|
optional |
|
merged_from |
sets used to generate this annotation by merging
(comma separated)
|
optional |
Below is shown an example of an index file (some uninformative columns were hidden for clarity). In this case, one recording has been fully annotated using the Voice Type Classifier (vtc), and partially annotated by two humans (LM and SP). These humans have both annotated the same seven 15 second clips.
set |
recording_filename |
time_seek |
range_onset |
range_offset |
raw_filename |
format |
annotation_filename |
---|---|---|---|---|---|---|---|
vtc |
A730/A730_001105.wav |
0 |
0 |
42764250 |
A730/A730_001105.rttm |
vtc_rttm |
A730/A730_001105_0_42764250.csv |
eaf_2021/SP |
A730/A730_001105.wav |
0 |
2910000 |
2925000 |
A730_001105.eaf |
eaf |
A730/A730_001105_2910000_2925000.csv |
eaf_2021/SP |
A730/A730_001105.wav |
0 |
4680000 |
4695000 |
A730_001105.eaf |
eaf |
A730/A730_001105_4680000_4695000.csv |
eaf_2021/SP |
A730/A730_001105.wav |
0 |
4695000 |
4710000 |
A730_001105.eaf |
eaf |
A730/A730_001105_4695000_4710000.csv |
eaf_2021/SP |
A730/A730_001105.wav |
0 |
14055000 |
14070000 |
A730_001105.eaf |
eaf |
A730/A730_001105_14055000_14070000.csv |
eaf_2021/SP |
A730/A730_001105.wav |
0 |
15030000 |
15045000 |
A730_001105.eaf |
eaf |
A730/A730_001105_15030000_15045000.csv |
eaf_2021/SP |
A730/A730_001105.wav |
0 |
36465000 |
36480000 |
A730_001105.eaf |
eaf |
A730/A730_001105_36465000_36480000.csv |
eaf_2021/SP |
A730/A730_001105.wav |
0 |
39450000 |
39465000 |
A730_001105.eaf |
eaf |
A730/A730_001105_39450000_39465000.csv |
eaf_2021/LM |
A730/A730_001105.wav |
0 |
2910000 |
2925000 |
A730_001105.eaf |
eaf |
A730/A730_001105_2910000_2925000.csv |
eaf_2021/LM |
A730/A730_001105.wav |
0 |
4680000 |
4695000 |
A730_001105.eaf |
eaf |
A730/A730_001105_4680000_4695000.csv |
eaf_2021/LM |
A730/A730_001105.wav |
0 |
4695000 |
4710000 |
A730_001105.eaf |
eaf |
A730/A730_001105_4695000_4710000.csv |
eaf_2021/LM |
A730/A730_001105.wav |
0 |
14055000 |
14070000 |
A730_001105.eaf |
eaf |
A730/A730_001105_14055000_14070000.csv |
eaf_2021/LM |
A730/A730_001105.wav |
0 |
15030000 |
15045000 |
A730_001105.eaf |
eaf |
A730/A730_001105_15030000_15045000.csv |
eaf_2021/LM |
A730/A730_001105.wav |
0 |
36465000 |
36480000 |
A730_001105.eaf |
eaf |
A730/A730_001105_36465000_36480000.csv |
eaf_2021/LM |
A730/A730_001105.wav |
0 |
39450000 |
39465000 |
A730_001105.eaf |
eaf |
A730/A730_001105_39450000_39465000.csv |
Annotation importation input format
The annotations importation script (Bulk importation) and python method (ChildProject.annotations.AnnotationManager.import_annotations()
) take a dataframe of the
following format as an input:
Name |
Description |
Required? |
Format |
---|---|---|---|
set |
name of the annotation set (e.g. VTC, annotator1,
etc.)
|
required |
|
recording_filename |
recording filename as specified in the recordings
index
|
required |
|
time_seek |
shift between the timestamps in the raw input
annotations and the actual corresponding
timestamps in the recordings (in milliseconds)
|
required |
|
range_onset |
covered range onset timestamp in milliseconds
(since the start of the recording)
|
required |
|
range_offset |
covered range offset timestamp in milliseconds
(since the start of the recording)
|
required |
|
raw_filename |
annotation input filename location, relative to
annotations/<set>/raw
|
required |
True |
format |
input annotation format
|
optional |
csv,
vtc_rttm,
vcm_rttm,
alice,
its,
TextGrid,
eaf, cha,
NA
|
filter |
source file to target. this field is dedicated to
rttm and ALICE annotations that may combine
annotations from several recordings into one same
text file.
|
optional |
Note
In order to avoid rounding errors, all timestamps are integers, expressed in milliseconds.
Documentation
An important aspect of a dataset is its documentation. Documentation includes:
authorship, references, contact information
a description of the corpus (population, collection process, etc.)
instructions to re-use the data
description of the data itself (e.g. a definition of each metadata field)
We currently do not provide a format for all these annotations. It is up to you to decide how to provide users with each of these information.
However, we suggest several options below.
Metadata and annotations
The ChildProject package supports a machine-readable format to describe the contents of the metadata and the annotations.
This format consists in CSV dataframe structured according to the following table:
Name |
Description |
Required? |
Format |
---|---|---|---|
variable |
name of the variable
|
required |
|
description |
a definition of this field
|
required |
|
values |
a summary of authorized values
|
optional |
|
scope |
which group of users has access to it
|
optional |
|
annotation_set |
for annotations: which set(s) contain this
variable
|
optional |
Documentation for the children metadata should be stored in
docs/children.csv
Documentation for the recordings metadata should be stored in
docs/recordings.csv
Documentation for annotations should be stored in
docs/annotations.csv