Reliability metrics

ChildProject implements several metrics for evaluating annotations and their reliability. This section demonstrates how to use the python API for these purposes.

Note

In order to reproduce the following examples, you will need to install the public VanDam corpus and its annotations using datalad:

datalad install git@gin.g-node.org:/LAAC-LSCP/vandam-data.git
datalad get vandam-data/annotations

Comparing two annotators

The performance of automated annotations is usually assessed by comparing them to a ground truth provided by experts. The ChildProject package provides several tools for such comparisons.

Confusion matrix

Confusion matrices are widely used to assess the performance of classification algorithms; they give an accurate visual description of the behavior of a classifier, preserving most relevant information while still being easy to interpret.

We show how to compute confusion matrices with the ChildProject package, using data from the VanDam public corpus. In this example, we will compare annotations from the LENA and the Voice Type Classifier.

The first step is to get all annotations common to the LENA and the VTC. This can be done with the intersection() static method of AnnotationManager:

>>> from ChildProject.projects import ChildProject
>>> from ChildProject.annotations import AnnotationManager
>>> from ChildProject.metrics import segments_to_grid, conf_matrix
>>> speakers = ['CHI', 'OCH', 'FEM', 'MAL']
>>> project = ChildProject('vandam-data')
>>> am = AnnotationManager(project)
>>> am.read()
([], ["vandam-data/metadata/annotations.csv: 'chat' is not a permitted value for column 'format' on line 4, should be any of [TextGrid,eaf,vtc_rttm,vcm_rttm,alice,its]", "vandam-data/metadata/annotations.csv: 'custom_rttm' is not a permitted value for column 'format' on line 6, should be any of [TextGrid,eaf,vtc_rttm,vcm_rttm,alice,its]"])
>>> intersection = AnnotationManager.intersection(am.annotations, ['vtc', 'its'])
>>> intersection
set recording_filename  time_seek  range_onset  range_offset      raw_filename    format  filter  annotation_filename          imported_at  error package_version
2  its    BN32_010007.mp3          0            0      50464512   BN32_010007.its       its     NaN  BN32_010007_0_0.csv  2021-03-06 22:55:06    NaN           0.0.1
3  vtc    BN32_010007.mp3          0            0      50464512  BN32_010007.rttm  vtc_rttm     NaN  BN32_010007_0_0.csv  2021-05-12 19:28:25    NaN           0.0.1

The next step is to retrieve the contents of the annotations that correspond to the intersection of the two sets. This is done with get_collapsed_segments(). This method from AnnotationManager does the following:

Read the contents of all annotations provided into one pandas dataframe.
Align them annotator by annotator, allowing cross-comparisons or combination
In case these annotations come from non-consecutive portions of audio, or from distinct audio files, they are aligned end-to-end into one virtual timeline.

In the case of the VanDam corpus, there is only one audio file, and it has been entirely annotated by all annotators. But the following will work even for sparse annotations covering several recordings.

>>> segments = am.get_collapsed_segments(intersection)
>>> segments = segments[segments['speaker_type'].isin(speakers)]
>>> segments
    segment_onset  segment_offset  speaker_id  ling_type speaker_type  vcm_type  lex_type  ...          imported_at  error  package_version  abs_range_onset  abs_range_offset    duration  position
1             9730.0         10540.0         NaN        NaN          OCH       NaN       NaN  ...  2021-03-06 22:55:06    NaN            0.0.1                0          50464512  50464512.0       0.0
15           35820.0         36930.0         NaN        NaN          OCH       NaN       NaN  ...  2021-03-06 22:55:06    NaN            0.0.1                0          50464512  50464512.0       0.0
21           67020.0         67620.0         NaN        NaN          OCH       NaN       NaN  ...  2021-03-06 22:55:06    NaN            0.0.1                0          50464512  50464512.0       0.0
25           71640.0         72240.0         NaN        NaN          FEM       NaN       NaN  ...  2021-03-06 22:55:06    NaN            0.0.1                0          50464512  50464512.0       0.0
29           87370.0         88170.0         NaN        NaN          OCH       NaN       NaN  ...  2021-03-06 22:55:06    NaN            0.0.1                0          50464512  50464512.0       0.0
...              ...             ...         ...        ...          ...       ...       ...  ...                  ...    ...              ...              ...               ...         ...       ...
22342     50122992.0      50123518.0         NaN        NaN          FEM       NaN       NaN  ...  2021-05-12 19:28:25    NaN            0.0.1                0          50464512  50464512.0       0.0
22344     50152103.0      50153510.0         NaN        NaN          FEM       NaN       NaN  ...  2021-05-12 19:28:25    NaN            0.0.1                0          50464512  50464512.0       0.0
22348     50233080.0      50234492.0         NaN        NaN          FEM       NaN       NaN  ...  2021-05-12 19:28:25    NaN            0.0.1                0          50464512  50464512.0       0.0
22350     50325867.0      50325989.0         NaN        NaN          CHI       NaN       NaN  ...  2021-05-12 19:28:25    NaN            0.0.1                0          50464512  50464512.0       0.0
22352     50356380.0      50357011.0         NaN        NaN          FEM       NaN       NaN  ...  2021-05-12 19:28:25    NaN            0.0.1                0          50464512  50464512.0       0.0

[20887 rows x 44 columns]

For an efficient computation of the confusion matrix, the timeline is then split into chunks of a given length (in our case, we will set the time steps to 100 milliseconds). This is done with ChildProject.metrics.segments_to_grid(), which transforms a dataframe of segments into a matrix of the indicator functions of each classification category at each time unit.

>>> vtc = segments_to_grid(segments[segments['set'] == 'vtc'], 0, segments['segment_offset'].max(), 100, 'speaker_type', speakers)
/Users/acristia/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:1676: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self._setitem_single_column(ilocs[0], value, pi)
/Users/acristia/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:1597: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self.obj[key] = value
>>> its = segments_to_grid(segments[segments['set'] == 'its'], 0, segments['segment_offset'].max(), 100, 'speaker_type', speakers)
>>> vtc.shape
(503571, 5)
>>> vtc
array([[0, 0, 0, 0, 1],
    [0, 0, 0, 0, 1],
    [0, 0, 0, 0, 1],
    ...,
    [0, 0, 1, 0, 0],
    [0, 0, 1, 0, 0],
    [0, 0, 0, 0, 1]])

Note that this matrix has 5 columns, even though there are only 4 categories (CHI, OCH, FEM and MAL). This is because segments_to_grid() appends the matrix with a ‘none’ column, which is set to 1 when all classes are inactive. It can be turned off by setting none = False. It is also possible to append an ‘overlap’ column by setting overlap=True; this column is set to 1 when at least 2 classes are active.

We can now compute the confusion matrix:

>>> confusion_counts = conf_matrix(vtc, its)
>>> confusion_counts
array([[ 20503,   7285,   4296,   1191,  21062],
[  1435,   3354,    704,    136,   4105],
[  2700,   1414,  18442,   4649,  19080],
[   323,    229,   4600,  17654,  12415],
[  3053,   2158,   3674,   2464, 365000]])

This means that 20503 of the 100 ms chunks were labelled as containing CHI speech by both the VTC and the LENA; 7285 chunks have been labelled as containing CHI speech by the VTC while being labelled as OCH by the LENA.

It is sometimes more useful to normalize confusion matrices:

>>> import numpy as np
>>> normalized = confusion_counts/(np.sum(vtc, axis = 0)[:,None])
>>> rel
array([[0.37733036, 0.13407071, 0.07906215, 0.02191877, 0.38761801],
[0.14742141, 0.34456544, 0.07232381, 0.01397165, 0.42171769],
[0.05833423, 0.03054985, 0.39844442, 0.10044291, 0.41222858],
[0.00917067, 0.0065018 , 0.1306039 , 0.50123506, 0.35248857],
[0.00811215, 0.00573404, 0.00976222, 0.00654711, 0.96984448]])

The top-left cell now reads as: 37,8% of the 100 ms chunks labelled as CHI by the VTC are also labelled as CHI by the LENA.

Using pyannote.metrics

Confusion matrices are still dimensional data (with \(n \times n\) components for \(n\) labels), which renders performance comparisons of several annotators difficult: it is hard to tell which one of two classifiers is the closest to the ground truth using confusion matrices.

As a result, in Machine Learning, many scalar measures are used in order to assess the overall performance of a classifier. These include recall, precision, accuracy, etc.

The pyannote-metrics package implements many of the metrics that are typically used in speech processing. ChildProject interfaces well with pyannote-metrics. Below, we show how to use both packages to compute recall and precision.

The first step is to convert the dataframe of segments into one pyannote.core.Annotation() object per annotator:

>>> from ChildProject.metrics import segments_to_annotation
>>> ref = segments_to_annotation(segments[segments['set'] == 'vtc'], 'speaker_type')
>>> hyp = segments_to_annotation(segments[segments['set'] == 'its'], 'speaker_type')

Now, any pyannote metric can be instantianted and used with these annotations:

>>> from pyannote.metrics.detection import DetectionPrecisionRecallFMeasure
>>> metric = DetectionPrecisionRecallFMeasure()
>>> detail = metric.compute_components(ref, hyp)
>>> precision, recall, f = metric.compute_metrics(detail)
>>> print(f'{precision:.2f}/{recall:.2f}/{f:.2f}')
0.87/0.60/0.71

Reliability evaluations

Module reference

ChildProject.metrics.conf_matrix(rows_grid, columns_grid) → array[source]

compute the confusion matrix (as counts) from grids of active classes.

See ChildProject.metrics.segments_to_grid() for a description of grids.

Parameters:

rows_grid (numpy.array) – the grid corresponding to the rows of the confusion matrix.
columns_grid (numpy.array) – the grid corresponding to the columns of the confusion matrix.
categories (list of strings) – the labels corresponding to each class

Returns:

a square numpy array of counts

Return type:

numpy.array

ChildProject.metrics.gamma(segments: DataFrame, column: str, alpha: float = 1, beta: float = 1, precision_level: float = 0.05) → float[source]

Compute Mathet et al. gamma agreement on segments.

The gamma measure evaluates the reliability of both the segmentation and the categorization simultaneously; a extensive description of the method and its parameters can be found in Mathet et al., 2015 (doi:10.1162/COLI_a_00227)

This function uses the pyagreement-agreement package by Titeux et al.

Parameters:

segments (pd.DataFrame) – input segments dataframe (see Annotations format for the dataframe format)
column (str) – name of the categorical column of the segments to consider, e.g. ‘speaker_type’
alpha (float, optional) – gamma agreement time alignment weight, defaults to 1
beta (float, optional) – gamma agreement categorical weight, defaults to 1
precision_level (float, optional) – level of precision (see pygamma-agreement’s documentation), defaults to 0.05

Returns:

gamma agreement

Return type:

float

ChildProject.metrics.grid_to_vector(grid, categories) → array[source]

Transform a grid of active classes into a vector of labels. In case several classes are active at time i, the label is set to ‘overlap’.

See ChildProject.metrics.segments_to_grid() for a description of grids.

Parameters:

grid (numpy.array) – a NumPy array of shape (n, len(categories))
categories (list) – the list of categories

Returns:

the vector of labels of length n (e.g. np.array([none FEM FEM FEM overlap overlap CHI]))

Return type:

numpy.array

ChildProject.metrics.segments_to_annotation(segments: DataFrame, column: str) → Annotation[source]

Transform a dataframe of annotation segments into a pyannote.core.Annotation object

Parameters:

segments (pd.DataFrame) – a dataframe of input segments. It should at least have the following columns: segment_onset, segment_offset and column.
column (str) – the name of the column in segments that should be used for the values of the annotations (e.g. speaker_type).

Returns:

the pyannote.core.Annotation object.

Return type:

pyannote.core.Annotation

ChildProject.metrics.segments_to_grid(segments: DataFrame, range_onset: int, range_offset: int, timescale: int, column: str, categories: list, none=True, overlap=False) → float[source]

Transform a dataframe of annotation segments into a 2d matrix representing the indicator function of each of the categories across time.

Each row of the matrix corresponds to a unit of time of length timescale (in milliseconds), ranging from range_onset to range_offset; each column corresponds to one of the categories provided, plus two special columns (overlap and none).

The value of the cell ij of the output matrix is set to 1 if the class j is active at time i, 0 otherwise.

If overlap is True, an additional column is appended to the grid, which set to 1 if more than two classes are active at time i.

If none is set to True, an additional column is appended to the grid, which is set to one if none of the classes are active at time i.

The shape of the output matrix is therefore ((range_offset-range_onset)/timescale, len(categories) + n), where n = 2 if both overlap and none are True, 1 if one of them is True, and 0 otherwise.

The fraction of time a class j is active can therefore be calculated as np.mean(grid, axis = 0)[j]

Parameters:

segments (pd.DataFrame) – a dataframe of input segments. It should at least have the following columns: segment_onset, segment_offset and column.
range_onset (int) – timestamp of the beginning of the range to consider (in milliseconds)
range_offset (int) – timestamp of the end of the range to consider (in milliseconds)
timescale (int) – length of each time unit (in milliseconds)
column (str) – the name of the column in segments that should be used for the values of the annotations (e.g. speaker_type).
categories (list) – the list of categories
none (bool) – append a ‘none’ column, default True
overlap (bool) – append an overlap column, default False

Returns:

the output grid

Return type:

numpy.array

ChildProject.metrics.vectors_to_annotation_task(*args, drop: List[str] = []) → AnnotationTask[source]

transform vectors of labels into a nltk AnnotationTask object.

Parameters:

args (1d np.array() of labels) – vector of labels for each annotator; add one argument per annotator.
drop (List[str]) – list of labels that should be ignored

Returns:

the AnnotationTask object

Return type:

nltk.metrics.agreement.AnnotationTask