Reliability metrics
ChildProject implements several metrics for evaluating annotations and their reliability. This section demonstrates how to use the python API for these purposes.
Note
In order to reproduce the following examples, you will need to install the public VanDam corpus and its annotations using datalad:
datalad install git@gin.g-node.org:/LAAC-LSCP/vandam-data.git
datalad get vandam-data/annotations
Comparing two annotators
The performance of automated annotations is usually assessed by comparing them to a ground truth provided by experts. The ChildProject package provides several tools for such comparisons.
Confusion matrix
Confusion matrices are widely used to assess the performance of classification algorithms; they give an accurate visual description of the behavior of a classifier, preserving most relevant information while still being easy to interpret.
We show how to compute confusion matrices with the ChildProject package, using data from the VanDam public corpus. In this example, we will compare annotations from the LENA and the Voice Type Classifier.
The first step is to get all annotations common to the LENA and the VTC.
This can be done with the intersection()
static method
of AnnotationManager
:
>>> from ChildProject.projects import ChildProject
>>> from ChildProject.annotations import AnnotationManager
>>> from ChildProject.metrics import segments_to_grid, conf_matrix
>>> speakers = ['CHI', 'OCH', 'FEM', 'MAL']
>>> project = ChildProject('vandam-data')
>>> am = AnnotationManager(project)
>>> am.read()
([], ["vandam-data/metadata/annotations.csv: 'chat' is not a permitted value for column 'format' on line 4, should be any of [TextGrid,eaf,vtc_rttm,vcm_rttm,alice,its]", "vandam-data/metadata/annotations.csv: 'custom_rttm' is not a permitted value for column 'format' on line 6, should be any of [TextGrid,eaf,vtc_rttm,vcm_rttm,alice,its]"])
>>> intersection = AnnotationManager.intersection(am.annotations, ['vtc', 'its'])
>>> intersection
set recording_filename time_seek range_onset range_offset raw_filename format filter annotation_filename imported_at error package_version
2 its BN32_010007.mp3 0 0 50464512 BN32_010007.its its NaN BN32_010007_0_0.csv 2021-03-06 22:55:06 NaN 0.0.1
3 vtc BN32_010007.mp3 0 0 50464512 BN32_010007.rttm vtc_rttm NaN BN32_010007_0_0.csv 2021-05-12 19:28:25 NaN 0.0.1
The next step is to retrieve the contents of the annotations that correspond to the intersection
of the two sets. This is done with get_collapsed_segments()
.
This method from AnnotationManager
does the following:
Read the contents of all annotations provided into one pandas dataframe.
Align them annotator by annotator, allowing cross-comparisons or combination
In case these annotations come from non-consecutive portions of audio, or from distinct audio files, they are aligned end-to-end into one virtual timeline.
In the case of the VanDam corpus, there is only one audio file, and it has been entirely annotated by all annotators. But the following will work even for sparse annotations covering several recordings.
>>> segments = am.get_collapsed_segments(intersection)
>>> segments = segments[segments['speaker_type'].isin(speakers)]
>>> segments
segment_onset segment_offset speaker_id ling_type speaker_type vcm_type lex_type ... imported_at error package_version abs_range_onset abs_range_offset duration position
1 9730.0 10540.0 NaN NaN OCH NaN NaN ... 2021-03-06 22:55:06 NaN 0.0.1 0 50464512 50464512.0 0.0
15 35820.0 36930.0 NaN NaN OCH NaN NaN ... 2021-03-06 22:55:06 NaN 0.0.1 0 50464512 50464512.0 0.0
21 67020.0 67620.0 NaN NaN OCH NaN NaN ... 2021-03-06 22:55:06 NaN 0.0.1 0 50464512 50464512.0 0.0
25 71640.0 72240.0 NaN NaN FEM NaN NaN ... 2021-03-06 22:55:06 NaN 0.0.1 0 50464512 50464512.0 0.0
29 87370.0 88170.0 NaN NaN OCH NaN NaN ... 2021-03-06 22:55:06 NaN 0.0.1 0 50464512 50464512.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
22342 50122992.0 50123518.0 NaN NaN FEM NaN NaN ... 2021-05-12 19:28:25 NaN 0.0.1 0 50464512 50464512.0 0.0
22344 50152103.0 50153510.0 NaN NaN FEM NaN NaN ... 2021-05-12 19:28:25 NaN 0.0.1 0 50464512 50464512.0 0.0
22348 50233080.0 50234492.0 NaN NaN FEM NaN NaN ... 2021-05-12 19:28:25 NaN 0.0.1 0 50464512 50464512.0 0.0
22350 50325867.0 50325989.0 NaN NaN CHI NaN NaN ... 2021-05-12 19:28:25 NaN 0.0.1 0 50464512 50464512.0 0.0
22352 50356380.0 50357011.0 NaN NaN FEM NaN NaN ... 2021-05-12 19:28:25 NaN 0.0.1 0 50464512 50464512.0 0.0
[20887 rows x 44 columns]
For an efficient computation of the confusion matrix, the timeline is then split into chunks of a given length
(in our case, we will set the time steps to 100 milliseconds).
This is done with ChildProject.metrics.segments_to_grid()
, which transforms a dataframe of segments
into a matrix of the indicator functions of each classification category at each time unit.
>>> vtc = segments_to_grid(segments[segments['set'] == 'vtc'], 0, segments['segment_offset'].max(), 100, 'speaker_type', speakers)
/Users/acristia/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:1676: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self._setitem_single_column(ilocs[0], value, pi)
/Users/acristia/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:1597: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self.obj[key] = value
>>> its = segments_to_grid(segments[segments['set'] == 'its'], 0, segments['segment_offset'].max(), 100, 'speaker_type', speakers)
>>> vtc.shape
(503571, 5)
>>> vtc
array([[0, 0, 0, 0, 1],
[0, 0, 0, 0, 1],
[0, 0, 0, 0, 1],
...,
[0, 0, 1, 0, 0],
[0, 0, 1, 0, 0],
[0, 0, 0, 0, 1]])
Note that this matrix has 5 columns, even though there are only 4 categories (CHI, OCH, FEM and MAL).
This is because segments_to_grid()
appends the matrix with a ‘none’ column,
which is set to 1 when all classes are inactive.
It can be turned off by setting none = False. It is also possible to append an ‘overlap’ column
by setting overlap=True; this column is set to 1 when at least 2 classes are active.
We can now compute the confusion matrix:
>>> confusion_counts = conf_matrix(vtc, its)
>>> confusion_counts
array([[ 20503, 7285, 4296, 1191, 21062],
[ 1435, 3354, 704, 136, 4105],
[ 2700, 1414, 18442, 4649, 19080],
[ 323, 229, 4600, 17654, 12415],
[ 3053, 2158, 3674, 2464, 365000]])
This means that 20503 of the 100 ms chunks were labelled as containing CHI speech by both the VTC and the LENA; 7285 chunks have been labelled as containing CHI speech by the VTC while being labelled as OCH by the LENA.
It is sometimes more useful to normalize confusion matrices:
>>> import numpy as np
>>> normalized = confusion_counts/(np.sum(vtc, axis = 0)[:,None])
>>> rel
array([[0.37733036, 0.13407071, 0.07906215, 0.02191877, 0.38761801],
[0.14742141, 0.34456544, 0.07232381, 0.01397165, 0.42171769],
[0.05833423, 0.03054985, 0.39844442, 0.10044291, 0.41222858],
[0.00917067, 0.0065018 , 0.1306039 , 0.50123506, 0.35248857],
[0.00811215, 0.00573404, 0.00976222, 0.00654711, 0.96984448]])
The top-left cell now reads as: 37,8% of the 100 ms chunks labelled as CHI by the VTC are also labelled as CHI by the LENA.
Using pyannote.metrics
Confusion matrices are still dimensional data (with \(n \times n\) components for \(n\) labels), which renders performance comparisons of several annotators difficult: it is hard to tell which one of two classifiers is the closest to the ground truth using confusion matrices.
As a result, in Machine Learning, many scalar measures are used in order to assess the overall performance of a classifier. These include recall, precision, accuracy, etc.
The pyannote-metrics package implements many of the metrics that are typically used in speech processing. ChildProject interfaces well with pyannote-metrics. Below, we show how to use both packages to compute recall and precision.
The first step is to convert the dataframe of segments into one pyannote.core.Annotation()
object per annotator:
>>> from ChildProject.metrics import segments_to_annotation
>>> ref = segments_to_annotation(segments[segments['set'] == 'vtc'], 'speaker_type')
>>> hyp = segments_to_annotation(segments[segments['set'] == 'its'], 'speaker_type')
Now, any pyannote metric can be instantianted and used with these annotations:
>>> from pyannote.metrics.detection import DetectionPrecisionRecallFMeasure
>>> metric = DetectionPrecisionRecallFMeasure()
>>> detail = metric.compute_components(ref, hyp)
>>> precision, recall, f = metric.compute_metrics(detail)
>>> print(f'{precision:.2f}/{recall:.2f}/{f:.2f}')
0.87/0.60/0.71
Reliability evaluations
Module reference
- ChildProject.metrics.conf_matrix(rows_grid, columns_grid)[source]
compute the confusion matrix (as counts) from grids of active classes.
See
ChildProject.metrics.segments_to_grid()
for a description of grids.- Parameters:
rows_grid (numpy.array) – the grid corresponding to the rows of the confusion matrix.
columns_grid (numpy.array) – the grid corresponding to the columns of the confusion matrix.
categories (list of strings) – the labels corresponding to each class
- Returns:
a square numpy array of counts
- Return type:
numpy.array
- ChildProject.metrics.gamma(segments: DataFrame, column: str, alpha: float = 1, beta: float = 1, precision_level: float = 0.05) float [source]
Compute Mathet et al. gamma agreement on segments.
The gamma measure evaluates the reliability of both the segmentation and the categorization simultaneously; a extensive description of the method and its parameters can be found in Mathet et al., 2015 (doi:10.1162/COLI_a_00227)
This function uses the pyagreement-agreement package by Titeux et al.
- Parameters:
segments (pd.DataFrame) – input segments dataframe (see Annotations format for the dataframe format)
column (str) – name of the categorical column of the segments to consider, e.g. ‘speaker_type’
alpha (float, optional) – gamma agreement time alignment weight, defaults to 1
beta (float, optional) – gamma agreement categorical weight, defaults to 1
precision_level (float, optional) – level of precision (see pygamma-agreement’s documentation), defaults to 0.05
- Returns:
gamma agreement
- Return type:
float
- ChildProject.metrics.grid_to_vector(grid, categories)[source]
Transform a grid of active classes into a vector of labels. In case several classes are active at time i, the label is set to ‘overlap’.
See
ChildProject.metrics.segments_to_grid()
for a description of grids.- Parameters:
grid (numpy.array) – a NumPy array of shape
(n, len(categories))
categories (list) – the list of categories
- Returns:
the vector of labels of length
n
(e.g.np.array([none FEM FEM FEM overlap overlap CHI])
)- Return type:
numpy.array
- ChildProject.metrics.segments_to_annotation(segments: DataFrame, column: str)[source]
Transform a dataframe of annotation segments into a pyannote.core.Annotation object
- Parameters:
segments (pd.DataFrame) – a dataframe of input segments. It should at least have the following columns:
segment_onset
,segment_offset
andcolumn
.column (str) – the name of the column in
segments
that should be used for the values of the annotations (e.g. speaker_type).
- Returns:
the pyannote.core.Annotation object.
- Return type:
pyannote.core.Annotation
- ChildProject.metrics.segments_to_grid(segments: DataFrame, range_onset: int, range_offset: int, timescale: int, column: str, categories: list, none=True, overlap=False) float [source]
Transform a dataframe of annotation segments into a 2d matrix representing the indicator function of each of the
categories
across time.Each row of the matrix corresponds to a unit of time of length
timescale
(in milliseconds), ranging fromrange_onset
torange_offset
; each column corresponds to one of thecategories
provided, plus two special columns (overlap and none).The value of the cell
ij
of the output matrix is set to 1 if the classj
is active at timei
, 0 otherwise.If overlap is True, an additional column is appended to the grid, which set to 1 if more than two classes are active at time
i
.If none is set to True, an additional column is appended to the grid, which is set to one if none of the classes are active at time
i
.The shape of the output matrix is therefore
((range_offset-range_onset)/timescale, len(categories) + n)
, where n = 2 if both overlap and none are True, 1 if one of them is True, and 0 otherwise.The fraction of time a class
j
is active can therefore be calculated asnp.mean(grid, axis = 0)[j]
- Parameters:
segments (pd.DataFrame) – a dataframe of input segments. It should at least have the following columns:
segment_onset
,segment_offset
andcolumn
.range_onset (int) – timestamp of the beginning of the range to consider (in milliseconds)
range_offset (int) – timestamp of the end of the range to consider (in milliseconds)
timescale (int) – length of each time unit (in milliseconds)
column (str) – the name of the column in
segments
that should be used for the values of the annotations (e.g. speaker_type).categories (list) – the list of categories
none (bool) – append a ‘none’ column, default True
overlap (bool) – append an overlap column, default False
- Returns:
the output grid
- Return type:
numpy.array
- ChildProject.metrics.vectors_to_annotation_task(*args, drop: List[str] = [])[source]
transform vectors of labels into a nltk AnnotationTask object.
- Parameters:
args (1d np.array() of labels) – vector of labels for each annotator; add one argument per annotator.
drop (List[str]) – list of labels that should be ignored
- Returns:
the AnnotationTask object
- Return type:
nltk.metrics.agreement.AnnotationTask