Metrics extraction

Overview

This package allows to extract metrics that are commonly used from annotations produced by the LENA or other pipelines. A csv file containing the metrics is produced along with a YML parameter file storing all the options used

$ child-project metrics --help
usage: child-project metrics [-h] [--recordings RECORDINGS]
                             [--by {recording_filename,session_id,child_id,experiment,segments}]
                             [--segments SEGMENTS] [--period PERIOD]
                             [-f FROM_TIME] [-t TO_TIME] [--rec-cols REC_COLS]
                             [--child-cols CHILD_COLS] [--threads THREADS]
                             path destination {custom,lena,aclew} ...

positional arguments:
  path                  path to the dataset
  destination           segments destination
  {custom,lena,aclew}   pipeline
    custom              metrics from a csv file
    lena                LENA metrics
    aclew               ACLEW metrics

optional arguments:
  -h, --help            show this help message and exit
  --recordings RECORDINGS
                        path to a CSV dataframe containing the list of
                        recordings to sample from (by default, all recordings
                        will be sampled). The CSV should have one column named
                        recording_filename.
  --by {recording_filename,session_id,child_id,experiment,segments}
                        units to sample from (default behavior is to sample by
                        recording)
  --segments SEGMENTS   path to a CSV dataframe containing the list of
                        segments to sample from. The CSV should have 3 columns
                        named recording_filename, segment_onset,
                        segment_offset. --by must be set to 'segments', Can
                        not be used along with options [--period,--recordings,
                        --from-tim,--to-time]
  --period PERIOD       time units to aggregate (optional); equivalent to
                        ``pandas.Grouper`` freq argument. The resulting
                        metrics will be split for each unit across all the
                        resulting periods.
  -f FROM_TIME, --from-time FROM_TIME
                        time range start in HH:MM:SS format (optional)
  -t TO_TIME, --to-time TO_TIME
                        time range end in HH:MM:SS format (optional)
  --rec-cols REC_COLS   comma separated columns from recordings.csv to include
                        in the outputted metrics (optional), NA if ambiguous
  --child-cols CHILD_COLS
                        comma separated columns from children.csv to include
                        in the outputted metrics (optional), NA if ambiguous
  --threads THREADS     amount of threads to run on

The Period option aggregates vocalizations for each time-of-the-day-unit based on a period specified by the user. For instance, if the period is set to 15Min (i.e. 15 minutes), vocalization rates will be reported for each recording and time-unit (e.g. 09:00 to 09:15, 09:15 to 09:30, etc.).

The output dataframe has \(r \times p\) rows, where \(r\) is the amount of recordings (or children if the -by option is set to child_id etc.), and \(p\) is the amount of time-bins per day (i.e. \(24 \times 4=96\) for a 15-minute period).

The output dataframe includes a period_start and a period_end columns that contain the onset and offset of each time-unit in HH:MM:SS format. The duration_<set> columns contain the total amount of annotated time covering each time-bin and each set, in milliseconds.

If --by is set to e.g. child_id, then the values for each time-bin will be the average rates across all the recordings of every child.

The list of supported metrics is shown below:

Warning

Be aware that numerous metrics are rates (every metric ending with ‘ph’ is) and not absolute counts! This can differ with results from other methods of extraction (e.g. LENA metrics). Rates are expressed in counts/hour (for events) or in milliseconds/hour (for durations).

Callable	Description	Required arguments
avg_can_voc_dur_speaker	average duration of canonical vocalizations for a given speaker type (based on vcm_type)	- speaker : speaker_type to use
avg_cry_voc_dur_speaker	average duration of cry vocalizations by a given speaker type (based on vcm_type or lena cries)	- speaker : speaker_type to use
avg_non_can_voc_dur_speaker	average duration of non-canonical vocalizations for a given speaker type (based on vcm_type)	- speaker : speaker_type to use
avg_voc_dur_speaker	average duration in milliseconds of vocalizations for a given speaker type	- speaker : speaker_type to use
can_voc_dur_speaker	total duration of canonical vocalizations by a given speaker type in milliseconds (based on vcm_type)	- speaker : speaker_type to use
can_voc_dur_speaker_ph	total duration of canonical vocalizations by a given speaker type in milliseconds (based on vcm_type)	- speaker : speaker_type to use This value is a ‘per hour’ value.
can_voc_speaker	number of canonical vocalizations for a given speaker type (based on vcm_type)	- speaker : speaker_type to use
can_voc_speaker_ph	number of canonical vocalizations for a given speaker type (based on vcm_type)	- speaker : speaker_type to use This value is a ‘per hour’ value.
cp_dur	canonical proportion on the number of vocalizations for CHI (based on vcm_type)
cp_n	canonical proportion on the number of vocalizations for CHI (based on vcm_type)
cry_voc_dur_speaker	total duration of cry vocalizations by a given speaker type in milliseconds (based on vcm_type or lena cry)	- speaker : speaker_type to use
cry_voc_dur_speaker_ph	total duration of cry vocalizations by a given speaker type in milliseconds (based on vcm_type or lena cry)	- speaker : speaker_type to use This value is a ‘per hour’ value.
cry_voc_speaker	number of cry vocalizations for a given speaker (based on vcm_type or lena cries)	- speaker : speaker_type to use
cry_voc_speaker_ph	number of cry vocalizations for a given speaker (based on vcm_type or lena cries)	- speaker : speaker_type to use This value is a ‘per hour’ value.
lena_CTC	number of conversational turn counts according to LENA’s extraction
lena_CTC_ph	number of conversational turn counts according to LENA’s extraction	This value is a ‘per hour’ value.
lena_CVC	number of child vocalizations according to LENA’s extraction
lena_CVC_ph	number of child vocalizations according to LENA’s extraction	This value is a ‘per hour’ value.
lp_dur	linguistic proportion on the duration of vocalizations for CHI (based on vcm_type or [child_cry_vfxs_len,utterances_length] if vcm_type does not exist)
lp_n	linguistic proportion on the number of vocalizations for CHI (based on vcm_type or [cries,vfxs,utterances_count] if vcm_type does not exist)
non_can_voc_dur_speaker	total duration of non-canonical vocalizations by a given speaker type in milliseconds (based on vcm_type)	- speaker : speaker_type to use
non_can_voc_dur_speaker_ph	total duration of non-canonical vocalizations by a given speaker type in milliseconds (based on vcm_type)	- speaker : speaker_type to use This value is a ‘per hour’ value.
non_can_voc_speaker	number of non-canonical vocalizations for a given speaker type (based on vcm_type)	- speaker : speaker_type to use
non_can_voc_speaker_ph	number of non-canonical vocalizations for a given speaker type (based on vcm_type)	- speaker : speaker_type to use This value is a ‘per hour’ value.
pc_adu	number of phonemes for all speakers
pc_adu_ph	number of phonemes for all speakers	This value is a ‘per hour’ value.
pc_speaker	number of phonemes for a given speaker type	- speaker : speaker_type to use
pc_speaker_ph	number of phonemes for a given speaker type	- speaker : speaker_type to use This value is a ‘per hour’ value.
peak_can_voc_dur_speaker	Computing the peak for 1h for the following metric: total duration of canonical vocalizations by a given speaker type in milliseconds (based on vcm_type)	- speaker : speaker_type to use
peak_can_voc_speaker	Computing the peak for 1h for the following metric: number of canonical vocalizations for a given speaker type (based on vcm_type)	- speaker : speaker_type to use
peak_cry_voc_dur_speaker	Computing the peak for 1h for the following metric: total duration of cry vocalizations by a given speaker type in milliseconds (based on vcm_type or lena cry)	- speaker : speaker_type to use
peak_cry_voc_speaker	Computing the peak for 1h for the following metric: number of cry vocalizations for a given speaker (based on vcm_type or lena cries)	- speaker : speaker_type to use
peak_lena_CTC	Computing the peak for 1h for the following metric: number of conversational turn counts according to LENA’s extraction
peak_lena_CVC	Computing the peak for 1h for the following metric: number of child vocalizations according to LENA’s extraction
peak_non_can_voc_dur_speaker	Computing the peak for 1h for the following metric: total duration of non-canonical vocalizations by a given speaker type in milliseconds (based on vcm_type)	- speaker : speaker_type to use
peak_non_can_voc_speaker	Computing the peak for 1h for the following metric: number of non-canonical vocalizations for a given speaker type (based on vcm_type)	- speaker : speaker_type to use
peak_pc_adu	Computing the peak for 1h for the following metric: number of phonemes for all speakers
peak_pc_speaker	Computing the peak for 1h for the following metric: number of phonemes for a given speaker type	- speaker : speaker_type to use
peak_sc_adu	Computing the peak for 1h for the following metric: number of syllables for all speakers
peak_sc_speaker	Computing the peak for 1h for the following metric: number of syllables for a given speaker type	- speaker : speaker_type to use
peak_simple_CTC	Computing the peak for 1h for the following metric: number of conversational turn counts based on vocalizations occurring in a given interval of one another keyword arguments: - interlocutors_1 : first group of interlocutors, default = [‘CHI’] - interlocutors_2 : second group of interlocutors, default = [‘FEM’,’MAL’,’OCH’] - max_interval : maximum interval in ms for it to be considered a turn, default = 1000 - min_delay : minimum delay between somebody starting speaking
peak_voc_dur_speaker	Computing the peak for 1h for the following metric: total duration of vocalizations by a given speaker type in milliseconds	- speaker : speaker_type to use
peak_voc_speaker	Computing the peak for 1h for the following metric: number of vocalizations for a given speaker type	- speaker : speaker_type to use
peak_wc_adu	Computing the peak for 1h for the following metric: number of words for all speakers
peak_wc_speaker	Computing the peak for 1h for the following metric: number of words for a given speaker type	- speaker : speaker_type to use
sc_adu	number of syllables for all speakers
sc_adu_ph	number of syllables for all speakers	This value is a ‘per hour’ value.
sc_speaker	number of syllables for a given speaker type	- speaker : speaker_type to use
sc_speaker_ph	number of syllables for a given speaker type	- speaker : speaker_type to use This value is a ‘per hour’ value.
simple_CTC	number of conversational turn counts based on vocalizations occurring in a given interval of one another keyword arguments: - interlocutors_1 : first group of interlocutors, default = [‘CHI’] - interlocutors_2 : second group of interlocutors, default = [‘FEM’,’MAL’,’OCH’] - max_interval : maximum interval in ms for it to be considered a turn, default = 1000 - min_delay : minimum delay between somebody starting speaking
simple_CTC_ph	number of conversational turn counts based on vocalizations occurring in a given interval of one another keyword arguments: - interlocutors_1 : first group of interlocutors, default = [‘CHI’] - interlocutors_2 : second group of interlocutors, default = [‘FEM’,’MAL’,’OCH’] - max_interval : maximum interval in ms for it to be considered a turn, default = 1000 - min_delay : minimum delay between somebody starting speaking This value is a ‘per hour’ value.
voc_dur_speaker	total duration of vocalizations by a given speaker type in milliseconds	- speaker : speaker_type to use
voc_dur_speaker_ph	total duration of vocalizations by a given speaker type in milliseconds	- speaker : speaker_type to use This value is a ‘per hour’ value.
voc_speaker	number of vocalizations for a given speaker type	- speaker : speaker_type to use
voc_speaker_ph	number of vocalizations for a given speaker type	- speaker : speaker_type to use This value is a ‘per hour’ value.
wc_adu	number of words for all speakers
wc_adu_ph	number of words for all speakers	This value is a ‘per hour’ value.
wc_speaker	number of words for a given speaker type	- speaker : speaker_type to use
wc_speaker_ph	number of words for a given speaker type	- speaker : speaker_type to use This value is a ‘per hour’ value.

LENA Metrics

The LENA pipeline will extract a list of usual metrics that can be obtained from the lena automated annotations (its files). Using this pipeline with a set of its annotations will extract the following metrics:

metric \| speaker	FEM	MAL	OCH	CHI	All speakers	CHI + MAL + FEM
voc_speaker_ph	voc_fem_ph	voc_mal_ph	voc_och_ph	voc_chi_ph
voc_dur_speaker_ph	voc_dur_fem_ph	voc_dur_mal_ph	voc_dur_och_ph	voc_dur_chi_ph
avg_voc_dur_speaker	avg_voc_dur_fem	avg_voc_dur_mal	avg_voc_dur_och	avg_voc_dur_chi
wc_speaker_ph	wc_fem_ph	wc_mal_ph			wc_adu_ph
lp_n				lp_n
lp_dur				lp_dur
lena_CVC				lena_CVC
lena_CTC						lena_CTC

$ child-project metrics /path/to/dataset output.csv lena --help
usage: child-project metrics path destination lena [-h] set

positional arguments:
  set         name of the LENA its annotations set

optional arguments:
  -h, --help  show this help message and exit

ACLEW Metrics

The ACLEW pipeline will extract a list of usual metrics that can be obtained from the automated annotations produced by the VTC, ALICE and VCM models. VTC is the only set required to run the pipeline, having the others will allow for more metrics but their presence is not mandatory. Using this pipeline with a set of vtc annotations and optionally alice and vcm sets will extract :

From VTC:

metric \| speaker	FEM	MAL	OCH	CHI
voc_speaker_ph	voc_fem_ph	voc_mal_ph	voc_och_ph	voc_chi_ph
voc_dur_speaker_ph	voc_dur_fem_ph	voc_dur_mal_ph	voc_dur_och_ph	voc_dur_chi_ph
avg_voc_dur_speaker	avg_voc_dur_fem	avg_voc_dur_mal	avg_voc_dur_och	avg_voc_dur_chi

From ALICE:

metric \| speaker	FEM	MAL	All speakers
wc_speaker_ph	wc_fem_ph	wc_mal_ph
sc_speaker_ph	sc_fem_ph	sc_mal_ph
pc_speaker_ph	pc_fem_ph	pc_mal_ph
wc_adu_ph			wc_adu_ph
sc_adu_ph			sc_adu_ph
pc_adu_ph			pc_adu_ph

From VCM:

metric \| speaker	CHI
cry_voc_speaker_ph	cry_voc_chi_ph
cry_voc_dur_speaker_ph	cry_voc_dur_chi_ph
avg_cry_voc_dur_speaker	avg_cry_voc_dur_chi
can_voc_speaker_ph	can_voc_chi_ph
can_voc_dur_speaker_ph	can_voc_dur_chi_ph
avg_can_voc_dur_speaker	avg_can_voc_dur_chi
non_can_voc_speaker_ph	non_can_voc_chi_ph
non_can_voc_dur_speaker_ph	non_can_voc_dur_chi_ph
avg_non_can_voc_dur_speaker	avg_non_can_voc_dur_chi
lp_n	lp_n
lp_dur	lp_dur
cp_n	cp_n
cp_dur	cp_dur

$ child-project metrics /path/to/dataset output.csv aclew --help
usage: child-project metrics path destination aclew [-h] [--vtc VTC]
                                                    [--alice ALICE]
                                                    [--vcm VCM]

optional arguments:
  -h, --help     show this help message and exit
  --vtc VTC      vtc set
  --alice ALICE  alice set
  --vcm VCM      vcm set

Custom metrics

The Custom metrics pipeline allows you to provide your own list of desired metrics to the pipeline to be extracted. The list must be in a csv file containing the following colums:

callable (required) : name of the metric to extract, see the list
set (required) : name of the set to extract from, make sure this annotations set is capable (has the required information) to extract this specific metric
name (optional) : name to use in the resulting metrics. If none is given, a default name will be used. Use this to extract the same metric for different sets and avoid name clashes.
<argument> (depending on the requirements of the metric you chose) : For each required argument of a metric, add a column of that argument’s name.

This is an example of a csv file we use to extract metrics. We want to extract the number of vocalizations per hour of the key child (CHI), male adult (MAL) and female adult (FEM) on 2 different sets to compare their result. So we write 3 lines per set (vtc and its), each having a different speaker and we also give each metric an explicit name because the default names voc_chi_ph, voc_mal_ph and voc_fem_ph would have clashed between the 2 sets. Additionaly, we extract linguistic proportion on number of vocalizations and on duration separately from the vcm set. the default names won’t clash and no speaker is needed (linguistic proportion is used on CHI) so we leave those columns empty.

callable	set	name	speaker
voc_speaker_ph	vtc	voc_chi_ph_vtc	CHI
voc_speaker_ph	vtc	voc_mal_ph_vtc	MAL
voc_speaker_ph	vtc	voc_fem_ph_vtc	FEM
voc_speaker_ph	its	voc_chi_ph_its	CHI
voc_speaker_ph	its	voc_mal_ph_its	MAL
voc_speaker_ph	its	voc_fem_ph_its	FEM
lp_n	vcm
lp_dur	vcm

$ child-project metrics /path/to/dataset output.csv custom --help
usage: child-project metrics path destination custom [-h] metrics

positional arguments:
  metrics     name of the csv file containing the list of metrics

optional arguments:
  -h, --help  show this help message and exit

Metrics from parameter file

To facilitate the extraction of metrics, one can simply use an exhaustive yml parameter file to launch a new extraction. This file has the exact same structure as the one produced by the pipeline. So you can use an output parameter file to rerun the same analysis.

$ child-project metrics-specification --help
usage: child-project metrics-specification [-h] parameters_input

positional arguments:
  parameters_input  path to the yml file with all parameters

optional arguments:
  -h, --help        show this help message and exit