Metrics extraction
Overview
This package allows to extract metrics that are commonly used from annotations produced by the LENA or other pipelines. A csv file containing the metrics is produced along with a YML parameter file storing all the options used
$ child-project metrics --help
usage: child-project metrics [-h] [--recordings RECORDINGS]
[--by {recording_filename,session_id,child_id,experiment,segments}]
[--segments SEGMENTS] [--period PERIOD]
[-f FROM_TIME] [-t TO_TIME] [--rec-cols REC_COLS]
[--child-cols CHILD_COLS] [--threads THREADS]
path destination {custom,lena,aclew} ...
positional arguments:
path path to the dataset
destination segments destination
{custom,lena,aclew} pipeline
custom metrics from a csv file
lena LENA metrics
aclew ACLEW metrics
optional arguments:
-h, --help show this help message and exit
--recordings RECORDINGS
path to a CSV dataframe containing the list of
recordings to sample from (by default, all recordings
will be sampled). The CSV should have one column named
recording_filename.
--by {recording_filename,session_id,child_id,experiment,segments}
units to sample from (default behavior is to sample by
recording)
--segments SEGMENTS path to a CSV dataframe containing the list of
segments to sample from. The CSV should have 3 columns
named recording_filename, segment_onset,
segment_offset. --by must be set to 'segments', Can
not be used along with options [--period,--recordings,
--from-tim,--to-time]
--period PERIOD time units to aggregate (optional); equivalent to
``pandas.Grouper`` freq argument. The resulting
metrics will be split for each unit across all the
resulting periods.
-f FROM_TIME, --from-time FROM_TIME
time range start in HH:MM:SS format (optional)
-t TO_TIME, --to-time TO_TIME
time range end in HH:MM:SS format (optional)
--rec-cols REC_COLS comma separated columns from recordings.csv to include
in the outputted metrics (optional), NA if ambiguous
--child-cols CHILD_COLS
comma separated columns from children.csv to include
in the outputted metrics (optional), NA if ambiguous
--threads THREADS amount of threads to run on
The Period option aggregates vocalizations for each time-of-the-day-unit based on a period specified by the user.
For instance, if the period is set to 15Min
(i.e. 15 minutes), vocalization rates will be reported for each
recording and time-unit (e.g. 09:00 to 09:15, 09:15 to 09:30, etc.).
The output dataframe has \(r \times p\) rows, where \(r\) is the amount of recordings (or children if the -by
option is set to child_id
etc.), and \(p\) is the
amount of time-bins per day (i.e. \(24 \times 4=96\) for a 15-minute period).
The output dataframe includes a period_start
and a period_end
columns that contain the onset and offset of each time-unit in HH:MM:SS format.
The duration_<set>
columns contain the total amount of annotated time covering each time-bin and each set, in milliseconds.
If --by
is set to e.g. child_id
, then the values for each time-bin will be the average rates across
all the recordings of every child.
The list of supported metrics is shown below:
Warning
Be aware that numerous metrics are rates (every metric ending with ‘ph’ is) and not absolute counts! This can differ with results from other methods of extraction (e.g. LENA metrics). Rates are expressed in counts/hour (for events) or in milliseconds/hour (for durations).
Callable |
Description |
Required arguments |
---|---|---|
avg_can_voc_dur_speaker |
average duration of canonical vocalizations for a
given speaker type (based on vcm_type)
|
- speaker : speaker_type
to use
|
avg_cry_voc_dur_speaker |
average duration of cry vocalizations by a given
speaker type (based on vcm_type or lena cries)
|
- speaker : speaker_type
to use
|
avg_non_can_voc_dur_speaker |
average duration of non-canonical vocalizations
for a given speaker type (based on vcm_type)
|
- speaker : speaker_type
to use
|
avg_voc_dur_speaker |
average duration in milliseconds of vocalizations
for a given speaker type
|
- speaker : speaker_type
to use
|
can_voc_dur_speaker |
total duration of canonical vocalizations by a
given speaker type in milliseconds (based on
vcm_type)
|
- speaker : speaker_type
to use
|
can_voc_dur_speaker_ph |
total duration of canonical vocalizations by a
given speaker type in milliseconds (based on
vcm_type)
|
- speaker :
speaker_type to use
This value is a ‘per
hour’ value.
|
can_voc_speaker |
number of canonical vocalizations for a given
speaker type (based on vcm_type)
|
- speaker : speaker_type
to use
|
can_voc_speaker_ph |
number of canonical vocalizations for a given
speaker type (based on vcm_type)
|
- speaker :
speaker_type to use
This value is a ‘per
hour’ value.
|
cp_dur |
canonical proportion on the number of
vocalizations for CHI (based on vcm_type)
|
|
cp_n |
canonical proportion on the number of
vocalizations for CHI (based on vcm_type)
|
|
cry_voc_dur_speaker |
total duration of cry vocalizations by a given
speaker type in milliseconds (based on vcm_type or
lena cry)
|
- speaker : speaker_type
to use
|
cry_voc_dur_speaker_ph |
total duration of cry vocalizations by a given
speaker type in milliseconds (based on vcm_type or
lena cry)
|
- speaker :
speaker_type to use
This value is a ‘per
hour’ value.
|
cry_voc_speaker |
number of cry vocalizations for a given speaker
(based on vcm_type or lena cries)
|
- speaker : speaker_type
to use
|
cry_voc_speaker_ph |
number of cry vocalizations for a given speaker
(based on vcm_type or lena cries)
|
- speaker :
speaker_type to use
This value is a ‘per
hour’ value.
|
lena_CTC |
number of conversational turn counts according to
LENA’s extraction
|
|
lena_CTC_ph |
number of conversational turn counts according to
LENA’s extraction
|
This value is a ‘per
hour’ value.
|
lena_CVC |
number of child vocalizations according to LENA’s
extraction
|
|
lena_CVC_ph |
number of child vocalizations according to LENA’s
extraction
|
This value is a ‘per
hour’ value.
|
lp_dur |
linguistic proportion on the duration of
vocalizations for CHI (based on vcm_type or
[child_cry_vfxs_len,utterances_length] if vcm_type
does not exist)
|
|
lp_n |
linguistic proportion on the number of
vocalizations for CHI (based on vcm_type or
[cries,vfxs,utterances_count] if vcm_type does not
exist)
|
|
non_can_voc_dur_speaker |
total duration of non-canonical vocalizations by a
given speaker type in milliseconds (based on
vcm_type)
|
- speaker : speaker_type
to use
|
non_can_voc_dur_speaker_ph |
total duration of non-canonical vocalizations by a
given speaker type in milliseconds (based on
vcm_type)
|
- speaker :
speaker_type to use
This value is a ‘per
hour’ value.
|
non_can_voc_speaker |
number of non-canonical vocalizations for a given
speaker type (based on vcm_type)
|
- speaker : speaker_type
to use
|
non_can_voc_speaker_ph |
number of non-canonical vocalizations for a given
speaker type (based on vcm_type)
|
- speaker :
speaker_type to use
This value is a ‘per
hour’ value.
|
pc_adu |
number of phonemes for all speakers
|
|
pc_adu_ph |
number of phonemes for all speakers
|
This value is a ‘per
hour’ value.
|
pc_speaker |
number of phonemes for a given speaker type
|
- speaker : speaker_type
to use
|
pc_speaker_ph |
number of phonemes for a given speaker type
|
- speaker :
speaker_type to use
This value is a ‘per
hour’ value.
|
peak_can_voc_dur_speaker |
Computing the peak for 1h for the following
metric: total duration of canonical vocalizations
by a given speaker type in milliseconds (based on
vcm_type)
|
- speaker : speaker_type
to use
|
peak_can_voc_speaker |
Computing the peak for 1h for the following
metric: number of canonical vocalizations for a
given speaker type (based on vcm_type)
|
- speaker : speaker_type
to use
|
peak_cry_voc_dur_speaker |
Computing the peak for 1h for the following
metric: total duration of cry vocalizations by a
given speaker type in milliseconds (based on
vcm_type or lena cry)
|
- speaker : speaker_type
to use
|
peak_cry_voc_speaker |
Computing the peak for 1h for the following
metric: number of cry vocalizations for a given
speaker (based on vcm_type or lena cries)
|
- speaker : speaker_type
to use
|
peak_hour_metric |
empty_value : should repeat the empty value of the
metric function wrapper (as this will be used for
empty periods)
|
|
peak_lena_CTC |
Computing the peak for 1h for the following
metric: number of conversational turn counts
according to LENA’s extraction
|
|
peak_lena_CVC |
Computing the peak for 1h for the following
metric: number of child vocalizations according
to LENA’s extraction
|
|
peak_non_can_voc_dur_speaker |
Computing the peak for 1h for the following
metric: total duration of non-canonical
vocalizations by a given speaker type in
milliseconds (based on vcm_type)
|
- speaker : speaker_type
to use
|
peak_non_can_voc_speaker |
Computing the peak for 1h for the following
metric: number of non-canonical vocalizations for
a given speaker type (based on vcm_type)
|
- speaker : speaker_type
to use
|
peak_pc_adu |
Computing the peak for 1h for the following
metric: number of phonemes for all speakers
|
|
peak_pc_speaker |
Computing the peak for 1h for the following
metric: number of phonemes for a given speaker
type
|
- speaker : speaker_type
to use
|
peak_sc_adu |
Computing the peak for 1h for the following
metric: number of syllables for all speakers
|
|
peak_sc_speaker |
Computing the peak for 1h for the following
metric: number of syllables for a given speaker
type
|
- speaker : speaker_type
to use
|
peak_simple_CTC |
Computing the peak for 1h for the following
metric: number of conversational turn counts
based on vocalizations occurring in a given
interval of one another keyword arguments:
- interlocutors_1 : first group of interlocutors,
default = [‘CHI’] - interlocutors_2 :
second group of interlocutors, default =
[‘FEM’,’MAL’,’OCH’] - max_interval :
maximum interval in ms for it to be considered a
turn, default = 1000 - min_delay : minimum
delay between somebody starting speaking
|
|
peak_voc_dur_speaker |
Computing the peak for 1h for the following
metric: total duration of vocalizations by a
given speaker type in milliseconds per hour
|
- speaker : speaker_type
to use
|
peak_voc_speaker |
Computing the peak for 1h for the following
metric: number of vocalizations for a given
speaker type
|
- speaker : speaker_type
to use
|
peak_wc_adu |
Computing the peak for 1h for the following
metric: number of words for all speakers
|
|
peak_wc_speaker |
Computing the peak for 1h for the following
metric: number of words for a given speaker type
|
- speaker : speaker_type
to use
|
per_hour_metric |
||
sc_adu |
number of syllables for all speakers
|
|
sc_adu_ph |
number of syllables for all speakers
|
This value is a ‘per
hour’ value.
|
sc_speaker |
number of syllables for a given speaker type
|
- speaker : speaker_type
to use
|
sc_speaker_ph |
number of syllables for a given speaker type
|
- speaker :
speaker_type to use
This value is a ‘per
hour’ value.
|
simple_CTC |
number of conversational turn counts based on
vocalizations occurring in a given interval of one
another keyword arguments: - interlocutors_1
: first group of interlocutors, default = [‘CHI’]
- interlocutors_2 : second group of interlocutors,
default = [‘FEM’,’MAL’,’OCH’] - max_interval :
maximum interval in ms for it to be considered a
turn, default = 1000 - min_delay : minimum
delay between somebody starting speaking
|
|
simple_CTC_ph |
number of conversational turn counts based on
vocalizations occurring in a given interval of
one another keyword arguments: -
interlocutors_1 : first group of interlocutors,
default = [‘CHI’] - interlocutors_2 :
second group of interlocutors, default =
[‘FEM’,’MAL’,’OCH’] - max_interval :
maximum interval in ms for it to be considered a
turn, default = 1000 - min_delay : minimum
delay between somebody starting speaking This
value is a ‘per hour’ value.
|
|
voc_dur_speaker |
total duration of vocalizations by a given speaker
type in milliseconds per hour
|
- speaker : speaker_type
to use
|
voc_dur_speaker_ph |
total duration of vocalizations by a given speaker
type in milliseconds per hour
|
- speaker :
speaker_type to use
This value is a ‘per
hour’ value.
|
voc_speaker |
number of vocalizations for a given speaker type
|
- speaker : speaker_type
to use
|
voc_speaker_ph |
number of vocalizations for a given speaker type
|
- speaker :
speaker_type to use
This value is a ‘per
hour’ value.
|
wc_adu |
number of words for all speakers
|
|
wc_adu_ph |
number of words for all speakers
|
This value is a ‘per
hour’ value.
|
wc_speaker |
number of words for a given speaker type
|
- speaker : speaker_type
to use
|
wc_speaker_ph |
number of words for a given speaker type
|
- speaker :
speaker_type to use
This value is a ‘per
hour’ value.
|
LENA Metrics
The LENA pipeline will extract a list of usual metrics that can be obtained from the lena automated annotations (its files). Using this pipeline with a set of its annotations will extract the following metrics:
metric | speaker |
FEM |
MAL |
OCH |
CHI |
All speakers |
CHI + MAL + FEM |
---|---|---|---|---|---|---|
voc_speaker_ph |
voc_fem_ph |
voc_mal_ph |
voc_och_ph |
voc_chi_ph |
||
voc_dur_speaker_ph |
voc_dur_fem_ph |
voc_dur_mal_ph |
voc_dur_och_ph |
voc_dur_chi_ph |
||
avg_voc_dur_speaker |
avg_voc_dur_fem |
avg_voc_dur_mal |
avg_voc_dur_och |
avg_voc_dur_chi |
||
wc_speaker_ph |
wc_fem_ph |
wc_mal_ph |
wc_adu_ph |
|||
lp_n |
lp_n |
|||||
lp_dur |
lp_dur |
|||||
lena_CVC |
lena_CVC |
|||||
lena_CTC |
lena_CTC |
$ child-project metrics /path/to/dataset output.csv lena --help
usage: child-project metrics path destination lena [-h] set
positional arguments:
set name of the LENA its annotations set
optional arguments:
-h, --help show this help message and exit
ACLEW Metrics
The ACLEW pipeline will extract a list of usual metrics that can be obtained from the automated annotations produced by the VTC, ALICE and VCM models. VTC is the only set required to run the pipeline, having the others will allow for more metrics but their presence is not mandatory. Using this pipeline with a set of vtc annotations and optionally alice and vcm sets will extract :
From VTC:
metric | speaker |
FEM |
MAL |
OCH |
CHI |
---|---|---|---|---|
voc_speaker_ph |
voc_fem_ph |
voc_mal_ph |
voc_och_ph |
voc_chi_ph |
voc_dur_speaker_ph |
voc_dur_fem_ph |
voc_dur_mal_ph |
voc_dur_och_ph |
voc_dur_chi_ph |
avg_voc_dur_speaker |
avg_voc_dur_fem |
avg_voc_dur_mal |
avg_voc_dur_och |
avg_voc_dur_chi |
From ALICE:
metric | speaker |
FEM |
MAL |
All speakers |
---|---|---|---|
wc_speaker_ph |
wc_fem_ph |
wc_mal_ph |
|
sc_speaker_ph |
sc_fem_ph |
sc_mal_ph |
|
pc_speaker_ph |
pc_fem_ph |
pc_mal_ph |
|
wc_adu_ph |
wc_adu_ph |
||
sc_adu_ph |
sc_adu_ph |
||
pc_adu_ph |
pc_adu_ph |
From VCM:
metric | speaker |
CHI |
---|---|
cry_voc_speaker_ph |
cry_voc_chi_ph |
cry_voc_dur_speaker_ph |
cry_voc_dur_chi_ph |
avg_cry_voc_dur_speaker |
avg_cry_voc_dur_chi |
can_voc_speaker_ph |
can_voc_chi_ph |
can_voc_dur_speaker_ph |
can_voc_dur_chi_ph |
avg_can_voc_dur_speaker |
avg_can_voc_dur_chi |
non_can_voc_speaker_ph |
non_can_voc_chi_ph |
non_can_voc_dur_speaker_ph |
non_can_voc_dur_chi_ph |
avg_non_can_voc_dur_speaker |
avg_non_can_voc_dur_chi |
lp_n |
lp_n |
lp_dur |
lp_dur |
cp_n |
cp_n |
cp_dur |
cp_dur |
$ child-project metrics /path/to/dataset output.csv aclew --help
usage: child-project metrics path destination aclew [-h] [--vtc VTC]
[--alice ALICE]
[--vcm VCM]
optional arguments:
-h, --help show this help message and exit
--vtc VTC vtc set
--alice ALICE alice set
--vcm VCM vcm set
Custom metrics
The Custom metrics pipeline allows you to provide your own list of desired metrics to the pipeline to be extracted. The list must be in a csv file containing the following colums:
callable (required) : name of the metric to extract, see the list
set (required) : name of the set to extract from, make sure this annotations set is capable (has the required information) to extract this specific metric
name (optional) : name to use in the resulting metrics. If none is given, a default name will be used. Use this to extract the same metric for different sets and avoid name clashes.
<argument> (depending on the requirements of the metric you chose) : For each required argument of a metric, add a column of that argument’s name.
This is an example of a csv file we use to extract metrics. We want to extract the number of vocalizations per hour of the key child (CHI), male adult (MAL) and female adult (FEM) on 2 different sets to compare their result. So we write 3 lines per set (vtc and its), each having a different speaker and we also give each metric an explicit name because the default names voc_chi_ph, voc_mal_ph and voc_fem_ph would have clashed between the 2 sets. Additionaly, we extract linguistic proportion on number of vocalizations and on duration separately from the vcm set. the default names won’t clash and no speaker is needed (linguistic proportion is used on CHI) so we leave those columns empty.
callable |
set |
name |
speaker |
---|---|---|---|
voc_speaker_ph |
vtc |
voc_chi_ph_vtc |
CHI |
voc_speaker_ph |
vtc |
voc_mal_ph_vtc |
MAL |
voc_speaker_ph |
vtc |
voc_fem_ph_vtc |
FEM |
voc_speaker_ph |
its |
voc_chi_ph_its |
CHI |
voc_speaker_ph |
its |
voc_mal_ph_its |
MAL |
voc_speaker_ph |
its |
voc_fem_ph_its |
FEM |
lp_n |
vcm |
||
lp_dur |
vcm |
$ child-project metrics /path/to/dataset output.csv custom --help
usage: child-project metrics path destination custom [-h] metrics
positional arguments:
metrics name of the csv file containing the list of metrics
optional arguments:
-h, --help show this help message and exit
Metrics from parameter file
To facilitate the extraction of metrics, one can simply use an exhaustive yml parameter file to launch a new extraction. This file has the exact same structure as the one produced by the pipeline. So you can use an output parameter file to rerun the same analysis.
$ child-project metrics-specification --help
usage: child-project metrics-specification [-h] parameters_input
positional arguments:
parameters_input path to the yml file with all parameters
optional arguments:
-h, --help show this help message and exit