Metrics extraction

Overview

This package allows to extract metrics that are commonly used from annotations produced by the LENA or other pipelines. A csv file containing the metrics is produced along with a YML parameter file storing all the options used

$child-project metrics --help usage: child-project metrics [-h] [--recordings RECORDINGS] [--by {recording_filename,session_id,child_id,experiment,segments}] [--segments SEGMENTS] [--period PERIOD] [-f FROM_TIME] [-t TO_TIME] [--rec-cols REC_COLS] [--child-cols CHILD_COLS] [--threads THREADS] path destination {custom,lena,aclew} ... positional arguments: path path to the dataset destination segments destination {custom,lena,aclew} pipeline custom metrics from a csv file lena LENA metrics aclew LENA metrics optional arguments: -h, --help show this help message and exit --recordings RECORDINGS path to a CSV dataframe containing the list of recordings to sample from (by default, all recordings will be sampled). The CSV should have one column named recording_filename. --by {recording_filename,session_id,child_id,experiment,segments} units to sample from (default behavior is to sample by recording) --segments SEGMENTS path to a CSV dataframe containing the list of segments to sample from. The CSV should have 3 columns named recording_filename, segment_onset, segment_offset. --by must be set to 'segments', Can not be used along with options [--period,--recordings, --from-tim,--to-time] --period PERIOD time units to aggregate (optional); equivalent to pandas.Grouper freq argument. The resulting metrics will be split for each unit across all the resulting periods. -f FROM_TIME, --from-time FROM_TIME time range start in HH:MM:SS format (optional) -t TO_TIME, --to-time TO_TIME time range end in HH:MM:SS format (optional) --rec-cols REC_COLS comma separated columns from recordings.csv to include in the outputted metrics (optional), NA if ambiguous --child-cols CHILD_COLS comma separated columns from children.csv to include in the outputted metrics (optional), NA if ambiguous --threads THREADS amount of threads to run on  The Period option aggregates vocalizations for each time-of-the-day-unit based on a period specified by the user. For instance, if the period is set to 15Min (i.e. 15 minutes), vocalization rates will be reported for each recording and time-unit (e.g. 09:00 to 09:15, 09:15 to 09:30, etc.). The output dataframe has $$r \times p$$ rows, where $$r$$ is the amount of recordings (or children if the -by option is set to child_id etc.), and $$p$$ is the amount of time-bins per day (i.e. $$24 \times 4=96$$ for a 15-minute period). The output dataframe includes a period_start and a period_end columns that contain the onset and offset of each time-unit in HH:MM:SS format. The duration_<set> columns contain the total amount of annotated time covering each time-bin and each set, in milliseconds. If --by is set to e.g. child_id, then the values for each time-bin will be the average rates across all the recordings of every child. The list of supported metrics is shown below: Warning Be aware that numerous metrics are rates (every metric ending with ‘ph’ is) and not absolute counts! This can differ with results from other methods of extraction (e.g. LENA metrics). Rates are expressed in counts/hour (for events) or in milliseconds/hour (for durations). Callable Description Required arguments avg_can_voc_dur_speaker average duration of canonical vocalizations for a given speaker type (based on vcm_type) - speaker : speaker_type to use avg_cry_voc_dur_speaker average duration of cry vocalizations by a given speaker type (based on vcm_type) - speaker : speaker_type to use avg_non_can_voc_dur_speaker average duration of non canonical vocalizations for a given speaker type (based on vcm_type) - speaker : speaker_type to use avg_voc_dur_speaker average duration in milliseconds of vocalizations for a given speaker type - speaker : speaker_type to use can_voc_dur_speaker_ph total duration of canonical vocalizations by a given speaker type in milliseconds per hour (based on vcm_type) - speaker : speaker_type to use can_voc_speaker_ph number of canonical vocalizations per hour for a given speaker type (based on vcm_type) - speaker : speaker_type to use cp_dur canonical proportion on the number of vocalizations for CHI (based on vcm_type) cp_n canonical proportion on the number of vocalizations for CHI (based on vcm_type) cry_voc_dur_speaker_ph total duration of cry vocalizations by a given speaker type in milliseconds per hour (based on vcm_type) - speaker : speaker_type to use cry_voc_speaker_ph number of cry vocalizations per hour for a given speaker (based on vcm_type) - speaker : speaker_type to use lena_CTC number of conversational turn counts according to LENA’s extraction lena_CVC number of child vocalizations according to LENA’s extraction lp_dur linguistic proportion on the duration of vocalizations for CHI (based on vcm_type or [child_cry_vfxs_len,utterances_length] if vcm_type does not exist) lp_n linguistic proportion on the number of vocalizations for CHI (based on vcm_type or [cries,vfxs,utterances_count] if vcm_type does not exist) non_can_voc_dur_speaker_ph total duration of non canonical vocalizations by a given speaker type in milliseconds per hour (based on vcm_type) - speaker : speaker_type to use non_can_voc_speaker_ph number of non canonical vocalizations per hour for a given speaker type (based on vcm_type) - speaker : speaker_type to use pc_adu_ph number of phonemes per hour for all speakers pc_speaker_ph number of phonemes per hour for a given speaker type - speaker : speaker_type to use sc_adu_ph number of syllables per hour for all speakers sc_speaker_ph number of syllables per hour for a given speaker type - speaker : speaker_type to use voc_dur_speaker_ph total duration of vocalizations by a given speaker type in milliseconds per hour - speaker : speaker_type to use voc_speaker number of vocalizations for a given speaker type - speaker : speaker_type to use voc_speaker_ph number of vocalizations per hour for a given speaker type - speaker : speaker_type to use wc_adu_ph number of words per hour for all speakers wc_speaker_ph number of words per hour for a given speaker type - speaker : speaker_type to use LENA Metrics The LENA pipeline will extract a list of usual metrics that can be obtained from the lena automated annotations (its files). Using this pipeline with a set of its annotations will extract the following metrics: metric | speaker FEM MAL OCH CHI All speakers CHI + MAL + FEM voc_speaker_ph voc_fem_ph voc_mal_ph voc_och_ph voc_chi_ph voc_dur_speaker_ph voc_dur_fem_ph voc_dur_mal_ph voc_dur_och_ph voc_dur_chi_ph avg_voc_dur_speaker avg_voc_dur_fem avg_voc_dur_mal avg_voc_dur_och avg_voc_dur_chi wc_speaker_ph wc_fem_ph wc_mal_ph wc_adu_ph lp_n lp_n lp_dur lp_dur lena_CVC lena_CVC lena_CTC lena_CTC $ child-project metrics /path/to/dataset output.csv lena --help
usage: child-project metrics path destination lena [-h] set

positional arguments:
set         name of the LENA its annotations set

optional arguments:
-h, --help  show this help message and exit


ACLEW Metrics

The ACLEW pipeline will extract a list of usual metrics that can be obtained from the automated annotations produced by the VTC, ALICE and VCM models. VTC is the only set required to run the pipeline, having the others will allow for more metrics but their presence is not mandatory. Using this pipeline with a set of vtc annotations and optionally alice and vcm sets will extract :

• From VTC:

metric | speaker

FEM

MAL

OCH

CHI

voc_speaker_ph

voc_fem_ph

voc_mal_ph

voc_och_ph

voc_chi_ph

voc_dur_speaker_ph

voc_dur_fem_ph

voc_dur_mal_ph

voc_dur_och_ph

voc_dur_chi_ph

avg_voc_dur_speaker

avg_voc_dur_fem

avg_voc_dur_mal

avg_voc_dur_och

avg_voc_dur_chi

• From ALICE:

metric | speaker

FEM

MAL

All speakers

wc_speaker_ph

wc_fem_ph

wc_mal_ph

sc_speaker_ph

sc_fem_ph

sc_mal_ph

pc_speaker_ph

pc_fem_ph

pc_mal_ph

• From VCM:

metric | speaker

CHI

cry_voc_speaker_ph

cry_voc_chi_ph

cry_voc_dur_speaker_ph

cry_voc_dur_chi_ph

avg_cry_voc_dur_speaker

avg_cry_voc_dur_chi

can_voc_speaker_ph

can_voc_chi_ph

can_voc_dur_speaker_ph

can_voc_dur_chi_ph

avg_can_voc_dur_speaker

avg_can_voc_dur_chi

non_can_voc_speaker_ph

non_can_voc_chi_ph

non_can_voc_dur_speaker_ph

non_can_voc_dur_chi_ph

avg_non_can_voc_dur_speaker

avg_non_can_voc_dur_chi

lp_n

lp_n

lp_dur

lp_dur

cp_n

cp_n

cp_dur

cp_dur

$child-project metrics /path/to/dataset output.csv aclew --help usage: child-project metrics path destination aclew [-h] [--vtc VTC] [--alice ALICE] [--vcm VCM] optional arguments: -h, --help show this help message and exit --vtc VTC vtc set --alice ALICE alice set --vcm VCM vcm set  Custom metrics The Custom metrics pipeline allows you to provide your own list of desired metrics to the pipeline to be extracted. The list must be in a csv file containing the following colums: • callable (required) : name of the metric to extract, see the list • set (required) : name of the set to extract from, make sure this annotations set is capable (has the required information) to extract this specific metric • name (optional) : name to use in the resulting metrics. If none is given, a default name will be used. Use this to extract the same metric for different sets and avoid name clashes. • <argument> (depending on the requirements of the metric you chose) : For each required argument of a metric, add a column of that argument’s name. This is an example of a csv file we use to extract metrics. We want to extract the number of vocalizations per hour of the key child (CHI), male adult (MAL) and female adult (FEM) on 2 different sets to compare their result. So we write 3 lines per set (vtc and its), each having a different speaker and we also give each metric an explicit name because the default names voc_chi_ph, voc_mal_ph and voc_fem_ph would have clashed between the 2 sets. Additionaly, we extract linguistic proportion on number of vocalizations and on duration separately from the vcm set. the default names won’t clash and no speaker is needed (linguistic proportion is used on CHI) so we leave those columns empty. callable set name speaker voc_speaker_ph vtc voc_chi_ph_vtc CHI voc_speaker_ph vtc voc_mal_ph_vtc MAL voc_speaker_ph vtc voc_fem_ph_vtc FEM voc_speaker_ph its voc_chi_ph_its CHI voc_speaker_ph its voc_mal_ph_its MAL voc_speaker_ph its voc_fem_ph_its FEM lp_n vcm lp_dur vcm $ child-project metrics /path/to/dataset output.csv custom --help
usage: child-project metrics path destination custom [-h] metrics

positional arguments:
metrics     name if the csv file containing the list of metrics

optional arguments:
-h, --help  show this help message and exit


Metrics from parameter file

To facilitate the extraction of metrics, one can simply use an exhaustive yml parameter file to launch a new extraction. This file has the exact same structure as the one produced by the pipeline. So you can use an output parameter file to rerun the same analysis.

\$ child-project metrics-specification --help
usage: child-project metrics-specification [-h] path

positional arguments:
path        path to the yml file with all parameters

optional arguments:
-h, --help  show this help message and exit