Metrics python extraction
Metrics can be extracted both from the command-line interface and from the python API. You will find here instructions on how to use the API to customize your metrics extraction.
To extract metrics, you can choose to use the pipelines that are defined in the command-line or define all the parameters of the extraction yourself.
Use the existing pipelines
We first need to initialize the project and import the necessary functions. Here we use
the example project vandam-data.
We then initialize our 2 types of metrics LenaMetrics
and AclewMetrics
with the desired parameters.
Here we choose to do a very simple LenaMetrics extraction using all the default values and the set named “its”. For AclewMetrics, we initialize the class to extract on the set “vtc” only between 8am and 5pm on periods of 6 hours, grouped by child_id values and adding the values of date_iso, child_dob and child_id to the resulting output.
>>> from ChildProject.projects import ChildProject
>>> from ChildProject.pipelines.metrics import LenaMetrics, AclewMetrics
>>> project = ChildProject('vandam-data')
>>> lmetrics = LenaMetrics(project,"its")
>>> ametrics = AclewMetrics(
... project,
... vtc='vtc',
... from_time='8:00:00',
... to_time='17:00:00',
... rec_cols='date_iso',
... child_cols='child_dob,child_id',
... period='6h',
... by='child_id',
... )
The ALICE set ('alice') was not found in the index.
The vcm set ('vcm') was not found in the index.
The programm warns us that the alice and vcm sets are not present which is expected given that the vandam-data corpus does not have vcm and alice annotations. So the output will not contain the metrics extracted from those.
We then launch the extraction for each pipeline. The function populates the .metrics
attribute
and returns the resulting metrics. Here we save the resulting metrics in csv files with the
to_csv function
from pandas.
>>> lmetrics.extract()
recording_filename child_id duration_its ... voc_mal_ph voc_dur_chi_ph lp_dur
0 BN32_010007.mp3 1 50464512 ... 103.86705 178674.867598 NaN
[1 rows x 20 columns]
>>> lmetrics.metrics.to_csv('LenaMetrics.csv', index=False)
>>> ametrics.extract()
child_id period_start period_end ... avg_voc_dur_mal avg_voc_dur_och avg_voc_dur_chi
0 1 00:00:00 06:00:00 ... NaN NaN NaN
1 1 06:00:00 12:00:00 ... 1373.208247 935.654378 1159.420822
2 1 12:00:00 18:00:00 ... 1099.721472 808.012712 1011.550502
3 1 18:00:00 00:00:00 ... NaN NaN NaN
[4 rows x 18 columns]
>>> ametrics.metrics.to_csv('AclewMetrics.csv', index=False)
Define you own metrics
You can also create your own metrics by defining your python function calculating the output value. To do so, define a function taking as arguments:
annotations : pandas DataFrame, this is the actual segments of the converted set
duration : int, the represents the length that was annotated, use this value to calculate rates per hour for example
**kwargs : keyword arguments, this allows the user to give whatever arguments he likes to the metric function (such as ‘speaker’ for example). The value used will have to be given in the wanted metrics_list dataframe
and returning, in that order:
a default name for the metric to take, it will be used when no specific name was explicitly required by the user
the value of the metric, should be a number or np.nan (a distinction is made between 0 and np.nan as np.nan indicates that the value can not be calculated).
The function should check the presence of the required columns in the annotations and of the required keyword arguments.
To make this easier, use the function ChildProject.pipelines.metricsFunctions.metricFunction()
as a decorator
to perform those checks as well as giving a default name based on the function’s name.
The decorator should be called along with the parameters :
args : a set of the names of the required keyword arguments
columns : a set of the names of the required columns in the annotations
emptyValue : the value to return when no annotations segments are found
name : the default name to use the designate this metric. If empty, uses the function name. Be aware that keyword
arguments found in the name will be replaced by their value (e.g. voc_speaker_ph with speaker='CHI'
will return voc_chi_ph).
The only remaining task of the function is the calculation and return of the value.
Here we define a function that only requires the keyword argument ‘speaker’ and is calculated only based on the ‘speaker_type’ column. When no annotation is found, its value will be 0 and by default it will take the name ‘num_of_voc_speaker’ with <speaker> being replaced with the value of the ‘speaker’ keyword argument. The returned value is the number of lines belonging to the speaker_type (i.e. its number of vocalizations as an absolute value).
>>> from ChildProject.projects import ChildProject
>>> from ChildProject.pipelines.metricsFunctions import metricFunction
>>> import pandas as pd
>>> @metricFunction({'speaker'},{'speaker_type'}, 0, 'num_of_voc_speaker')
... def voc_speaker(annotations: pd.DataFrame, duration: int, **kwargs):
... return annotations[annotations["speaker_type"]== kwargs["speaker"]].shape[0]
...
We defined our custom metric, now we will create our list of wanted metrics. It must be a pandas DataFrame compatible with the metrics listing format. The callable function is used for both names of the default available metrics and callables functions that we defined ourselves. Here we only use the vtc set, we want to extract the number of vocalizations produced by the key child and the mother in absolute values (using our newly defined function) but also in values per hour (using the default metric <voc_speaker_ph>).
>>> input = pd.DataFrame([{
... 'set': 'vtc',
... 'callable': 'voc_speaker_ph',
... 'speaker': 'CHI',
... },{
... 'set': 'vtc',
... 'callable': 'voc_speaker_ph',
... 'speaker': 'FEM',
... },{
... 'set': 'vtc',
... 'callable': voc_speaker,
... 'speaker': 'CHI',
... }{
... 'set': 'vtc',
... 'callable': voc_speaker,
... 'speaker': 'FEM',
... }])
Last thing left to do is initialize our ChildProject.pipelines.metrics.Metrics
with the correct
parameters and launch the extraction
>>> from ChildProject.pipelines.metrics import Metrics
>>> project = ChildProject('vandam-data')
>>> m = Metrics(
... project,
... metrics_list= input,
... from_time='8:00:00',
... to_time='17:00:00',
... rec_cols='date_iso',
... child_cols='child_dob,child_id',
... period='6h',
... by='child_id',
... )
>>> m.extract()
child_id period_start period_end ... voc_fem_ph num_of_voc_chi num_of_voc_fem
0 1 00:00:00 06:00:00 ... NaN NaN NaN
1 1 06:00:00 12:00:00 ... 244.5 1143.0 978.0
2 1 12:00:00 18:00:00 ... 253.4 1495.0 1267.0
3 1 18:00:00 00:00:00 ... NaN NaN NaN
[4 rows x 10 columns]
>>> m.metrics.to_csv('Metrics.csv', index=False)