Reliability metrics =================== ChildProject implements several metrics for evaluating annotations and their reliability. This section demonstrates how to use the python API for these purposes. .. note:: In order to reproduce the following examples, you will need to install the public VanDam corpus and its annotations using datalad: .. code-block:: bash datalad install git@gin.g-node.org:/LAAC-LSCP/vandam-data.git datalad get vandam-data/annotations Comparing two annotators ~~~~~~~~~~~~~~~~~~~~~~~~ The performance of automated annotations is usually assessed by comparing them to a ground truth provided by experts. The ChildProject package provides several tools for such comparisons. Confusion matrix ---------------- Confusion matrices are widely used to assess the performance of classification algorithms; they give an accurate visual description of the behavior of a classifier, preserving most relevant information while still being easy to interpret. We show how to compute confusion matrices with the ChildProject package, using data from the VanDam public corpus. In this example, we will compare annotations from the LENA and the Voice Type Classifier. To access the console in the dataset directory: .. code-block:: bash cd vandam-data python The first step is to get all annotations common to the LENA and the VTC. This can be done with the :meth:`~ChildProject.annotations.AnnotationManager.intersection` static method of :class:`~ChildProject.annotations.AnnotationManager`: .. code-block:: python >>> from ChildProject.projects import ChildProject >>> from ChildProject.annotations import AnnotationManager >>> from ChildProject.metrics import segments_to_grid, conf_matrix >>> speakers = ['CHI', 'OCH', 'FEM', 'MAL'] >>> project = ChildProject('.') >>> am = AnnotationManager(project) >>> am.read() ([], ["vandam-data/metadata/annotations.csv: 'chat' is not a permitted value for column 'format' on line 4, should be any of [TextGrid,eaf,vtc_rttm,vcm_rttm,alice,its]", "vandam-data/metadata/annotations.csv: 'custom_rttm' is not a permitted value for column 'format' on line 6, should be any of [TextGrid,eaf,vtc_rttm,vcm_rttm,alice,its]"]) >>> intersection = AnnotationManager.intersection(am.annotations, ['vtc', 'its']) >>> intersection set recording_filename time_seek range_onset range_offset raw_filename format filter annotation_filename imported_at error package_version 2 its BN32_010007.mp3 0 0 50464512 BN32_010007.its its NaN BN32_010007_0_0.csv 2021-03-06 22:55:06 NaN 0.0.1 3 vtc BN32_010007.mp3 0 0 50464512 BN32_010007.rttm vtc_rttm NaN BN32_010007_0_0.csv 2021-05-12 19:28:25 NaN 0.0.1 The next step is to retrieve the contents of the annotations that correspond to the intersection of the two sets. This is done with :meth:`~ChildProject.annotations.AnnotationManager.get_collapsed_segments`. This method from :class:`~ChildProject.annotations.AnnotationManager` does the following: 1. Read the contents of all annotations provided into one pandas dataframe. 2. Align them annotator by annotator, allowing cross-comparisons or combination 3. In case these annotations come from non-consecutive portions of audio, or from distinct audio files, they are aligned end-to-end into one virtual timeline. In the case of the VanDam corpus, there is only one audio file, and it has been entirely annotated by all annotators. But the following will work even for sparse annotations covering several recordings. .. code-block:: python >>> segments = am.get_collapsed_segments(intersection) >>> segments = segments[segments['speaker_type'].isin(speakers)] >>> segments segment_onset segment_offset speaker_id ling_type speaker_type vcm_type lex_type ... imported_at error package_version abs_range_onset abs_range_offset duration position 1 9730.0 10540.0 NaN NaN OCH NaN NaN ... 2021-03-06 22:55:06 NaN 0.0.1 0 50464512 50464512.0 0.0 15 35820.0 36930.0 NaN NaN OCH NaN NaN ... 2021-03-06 22:55:06 NaN 0.0.1 0 50464512 50464512.0 0.0 21 67020.0 67620.0 NaN NaN OCH NaN NaN ... 2021-03-06 22:55:06 NaN 0.0.1 0 50464512 50464512.0 0.0 25 71640.0 72240.0 NaN NaN FEM NaN NaN ... 2021-03-06 22:55:06 NaN 0.0.1 0 50464512 50464512.0 0.0 29 87370.0 88170.0 NaN NaN OCH NaN NaN ... 2021-03-06 22:55:06 NaN 0.0.1 0 50464512 50464512.0 0.0 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 22342 50122992.0 50123518.0 NaN NaN FEM NaN NaN ... 2021-05-12 19:28:25 NaN 0.0.1 0 50464512 50464512.0 0.0 22344 50152103.0 50153510.0 NaN NaN FEM NaN NaN ... 2021-05-12 19:28:25 NaN 0.0.1 0 50464512 50464512.0 0.0 22348 50233080.0 50234492.0 NaN NaN FEM NaN NaN ... 2021-05-12 19:28:25 NaN 0.0.1 0 50464512 50464512.0 0.0 22350 50325867.0 50325989.0 NaN NaN CHI NaN NaN ... 2021-05-12 19:28:25 NaN 0.0.1 0 50464512 50464512.0 0.0 22352 50356380.0 50357011.0 NaN NaN FEM NaN NaN ... 2021-05-12 19:28:25 NaN 0.0.1 0 50464512 50464512.0 0.0 [20887 rows x 44 columns] For an efficient computation of the confusion matrix, the timeline is then split into chunks of a given length (in our case, we will set the time steps to 100 milliseconds). This is done with :func:`ChildProject.metrics.segments_to_grid`, which transforms a dataframe of segments into a matrix of the indicator functions of each classification category at each time unit. .. code-block:: python >>> vtc = segments_to_grid(segments[segments['set'] == 'vtc'], 0, segments['segment_offset'].max(), 100, 'speaker_type', speakers) /Users/acristia/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:1676: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self._setitem_single_column(ilocs[0], value, pi) /Users/acristia/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:1597: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self.obj[key] = value >>> its = segments_to_grid(segments[segments['set'] == 'its'], 0, segments['segment_offset'].max(), 100, 'speaker_type', speakers) >>> vtc.shape (503571, 5) >>> vtc array([[0, 0, 0, 0, 1], [0, 0, 0, 0, 1], [0, 0, 0, 0, 1], ..., [0, 0, 1, 0, 0], [0, 0, 1, 0, 0], [0, 0, 0, 0, 1]]) Note that this matrix has 5 columns, even though there are only 4 categories (CHI, OCH, FEM and MAL). This is because :func:`~ChildProject.metrics.segments_to_grid` appends the matrix with a 'none' column, which is set to 1 when all classes are inactive. It can be turned off by setting `none = False`. It is also possible to append an 'overlap' column by setting `overlap=True`; this column is set to 1 when at least 2 classes are active. We can now compute the confusion matrix: .. code-block:: python >>> confusion_counts = conf_matrix(vtc, its) >>> confusion_counts array([[ 20503, 7285, 4296, 1191, 21062], [ 1435, 3354, 704, 136, 4105], [ 2700, 1414, 18442, 4649, 19080], [ 323, 229, 4600, 17654, 12415], [ 3053, 2158, 3674, 2464, 365000]]) This means that 20503 of the 100 ms chunks were labelled as containing CHI speech by both the VTC and the LENA; 7285 chunks have been labelled as containing CHI speech by the VTC while being labelled as OCH by the LENA. It is sometimes more useful to normalize confusion matrices: >>> import numpy as np >>> normalized = confusion_counts/(np.sum(vtc, axis = 0)[:,None]) >>> rel array([[0.37733036, 0.13407071, 0.07906215, 0.02191877, 0.38761801], [0.14742141, 0.34456544, 0.07232381, 0.01397165, 0.42171769], [0.05833423, 0.03054985, 0.39844442, 0.10044291, 0.41222858], [0.00917067, 0.0065018 , 0.1306039 , 0.50123506, 0.35248857], [0.00811215, 0.00573404, 0.00976222, 0.00654711, 0.96984448]]) The top-left cell now reads as: 37,8% of the 100 ms chunks labelled as CHI by the VTC are also labelled as CHI by the LENA. Using pyannote.metrics ---------------------- Confusion matrices are still dimensional data (with :math:`n \times n` components for :math:`n` labels), which renders performance comparisons of several annotators difficult: it is hard to tell which one of two classifiers is the closest to the ground truth using confusion matrices. As a result, in Machine Learning, many scalar measures are used in order to assess the overall performance of a classifier. These include recall, precision, accuracy, etc. The `pyannote-metrics package `_ implements many of the metrics that are typically used in speech processing. ChildProject interfaces well with pyannote-metrics. Below, we show how to use both packages to compute recall and precision. The first step is to convert the dataframe of segments into one :meth:`pyannote.core.Annotation` object per annotator: .. code-block:: python >>> from ChildProject.metrics import segments_to_annotation >>> ref = segments_to_annotation(segments[segments['set'] == 'vtc'], 'speaker_type') >>> hyp = segments_to_annotation(segments[segments['set'] == 'its'], 'speaker_type') Now, any pyannote metric can be instantianted and used with these annotations: .. code-block:: python >>> from pyannote.metrics.detection import DetectionPrecisionRecallFMeasure >>> metric = DetectionPrecisionRecallFMeasure() >>> detail = metric.compute_components(ref, hyp) >>> precision, recall, f = metric.compute_metrics(detail) >>> print(f'{precision:.2f}/{recall:.2f}/{f:.2f}') 0.87/0.60/0.71 Reliability evaluations ~~~~~~~~~~~~~~~~~~~~~~~ Module reference ~~~~~~~~~~~~~~~~ .. automodule:: ChildProject.metrics :members: :noindex: