Annotations
Annotations can be managed through both the command-line interface and the python API. This section documents the principle features of the API for the management of annotations.
Note
In order to reproduce the following examples, you will need to install the public VanDam corpus and its annotations using datalad:
datalad install git@gin.g-node.org:/LAAC-LSCP/vandam-data.git
datalad get vandam-data/annotations
Reading annotations
Annotations are managed with ChildProject.annotations.AnnotationManager
class.
The first step is create an instance of it based on the target project.
The read()
method reads the index of annotations
from metadata/annotations.csv
and stores into its
annotations
attribute:
>>> from ChildProject.projects import ChildProject
>>> from ChildProject.annotations import AnnotationManager
>>> project = ChildProject('vandam-data')
>>> am = AnnotationManager(project)
>>> am.read()
([], ["vandam-data/metadata/annotations.csv: 'chat' is not a permitted value for column 'format' on line 4, should be any of [csv,vtc_rttm,vcm_rttm,alice,its,TextGrid,eaf,cha,NA]"])
>>> am.annotations
set recording_filename time_seek range_onset range_offset raw_filename format filter annotation_filename imported_at error package_version
2 its BN32_010007.mp3 0 0 50464512 BN32_010007.its its NaN BN32_010007_0_0.csv 2021-03-06 22:55:06 NaN 0.0.1
3 vtc BN32_010007.mp3 0 0 50464512 BN32_010007.rttm vtc_rttm NaN BN32_010007_0_0.csv 2021-05-12 19:28:25 NaN 0.0.1
4 cha BN32_010007.mp3 0 0 50464512 BN32_010007.cha chat NaN BN32_010007_0_0.csv 2021-05-12 19:39:05 NaN 0.0.1
5 eaf BN32_010007.mp3 0 4138389 4199976 BN32_010007.eaf eaf NaN BN32_010007_4138389_4199976.csv 2021-07-14 17:39:50 NaN 0.0.1
6 eaf BN32_010007.mp3 0 4438842 4499995 BN32_010007.eaf eaf NaN BN32_010007_4438842_4499995.csv 2021-07-14 17:39:50 NaN 0.0.1
7 eaf BN32_010007.mp3 0 13199449 13256801 BN32_010007.eaf eaf NaN BN32_010007_13199449_13256801.csv 2021-07-14 17:39:50 NaN 0.0.1
8 eaf BN32_010007.mp3 0 37496002 37558424 BN32_010007.eaf eaf NaN BN32_010007_37496002_37558424.csv 2021-07-14 17:39:50 NaN 0.0.1
9 eaf BN32_010007.mp3 0 37616206 37679577 BN32_010007.eaf eaf NaN BN32_010007_37616206_37679577.csv 2021-07-14 17:39:50 NaN 0.0.1
10 cha/aligned BN32_010007.mp3 0 0 47725356 BN32_010007-aligned.csv csv NaN BN32_010007_0_47725356.csv 2021-07-15 16:15:48 NaN 0.0.1
As seen in this example, annotations
only
contains the index of annotations, not their contents. To retrieve the actual annotations,
use get_segments()
:
>>> selection = am.annotations[am.annotations['set'].isin(['cha', 'vtc'])]
>>> segments = am.get_segments(selection)
>>> segments
segment_onset segment_offset speaker_type raw_filename set annotation_filename participant ... range_onset range_offset format filter imported_at error package_version
0 9992 10839 SPEECH BN32_010007.rttm vtc BN32_010007_0_0.csv NaN ... 0 50464512 vtc_rttm NaN 2021-05-12 19:28:25 NaN 0.0.1
1 10004 10814 CHI BN32_010007.rttm vtc BN32_010007_0_0.csv NaN ... 0 50464512 vtc_rttm NaN 2021-05-12 19:28:25 NaN 0.0.1
2 11298 11953 SPEECH BN32_010007.rttm vtc BN32_010007_0_0.csv NaN ... 0 50464512 vtc_rttm NaN 2021-05-12 19:28:25 NaN 0.0.1
3 11345 11828 CHI BN32_010007.rttm vtc BN32_010007_0_0.csv NaN ... 0 50464512 vtc_rttm NaN 2021-05-12 19:28:25 NaN 0.0.1
4 12113 12749 FEM BN32_010007.rttm vtc BN32_010007_0_0.csv NaN ... 0 50464512 vtc_rttm NaN 2021-05-12 19:28:25 NaN 0.0.1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
31875 49705416 49952432 CHI BN32_010007.cha cha BN32_010007_0_0.csv CHI ... 0 50464512 chat NaN 2021-05-12 19:39:05 NaN 0.0.1
31876 49952432 50057166 CHI BN32_010007.cha cha BN32_010007_0_0.csv CHI ... 0 50464512 chat NaN 2021-05-12 19:39:05 NaN 0.0.1
31877 50057166 50173260 CHI BN32_010007.cha cha BN32_010007_0_0.csv CHI ... 0 50464512 chat NaN 2021-05-12 19:39:05 NaN 0.0.1
31878 50173260 50330885 CHI BN32_010007.cha cha BN32_010007_0_0.csv CHI ... 0 50464512 chat NaN 2021-05-12 19:39:05 NaN 0.0.1
31879 50330885 50397134 CHI BN32_010007.cha cha BN32_010007_0_0.csv CHI ... 0 50464512 chat NaN 2021-05-12 19:39:05 NaN 0.0.1
[31880 rows x 22 columns]
Warning
Trying to load all annotations at once may quickly lead to out-of-memory errors, especially with automated annotators convering thousands of hours of audio. Memory issues can be alleviated by processing the data sequentially, e.g. by treating one recording after another.
Importing annotations
Although importing annotations can be done using the command-line tool, sometimes it is more efficient to do it directly with the python API; it can even become necessary when custom converters (the functions that transform any kind of annotations into the CSV format used by the package) need to be used.
Two examples are given below (one using built-in converters, one using a custom converter). In order to reproduce them, please make a copy of the original annotations:
mkdir vandam-data/annotations/playground
cp -r vandam-data/annotations/its vandam-data/annotations/playground
Built-in formats
The following code imports only the annotations from the LENA that correspond to the second hour of the audio. The package natively supports LENA’s .its annotations.
Annotations are imported using import_annotations()
.
This first input argument of this method must be a pandas dataframe of all the annotations that need to
be imported. This dataframe should be structured according to the format defined at Annotation importation input format.
>>> import pandas as pd
>>> input = pd.DataFrame([{
... 'set': 'playground/its',
... 'recording_filename': 'BN32_010007.mp3',
... 'time_seek': 0,
... 'range_onset': 3600*1000,
... 'range_offset': 7200*1000,
... 'raw_filename': 'BN32_010007.its',
... 'format': 'its'
... }])
>>> am.import_annotations(input, threads = 1)
set recording_filename time_seek range_onset range_offset raw_filename format annotation_filename imported_at package_version
0 playground/its BN32_010007.mp3 0 3600000 7200000 BN32_010007.its its BN32_010007_0_3600000.csv 2021-05-12 20:37:43 0.0.1
After reloading the index of annotations, the newly inserted entry now appears:
>>> am.read()
([], [])
>>> am.annotations
set recording_filename time_seek range_onset range_offset raw_filename format filter annotation_filename imported_at error package_version
2 its BN32_010007.mp3 0 0 50464512 BN32_010007.its its NaN BN32_010007_0_0.csv 2021-03-06 22:55:06 NaN 0.0.1
3 vtc BN32_010007.mp3 0 0 50464512 BN32_010007.rttm vtc_rttm NaN BN32_010007_0_0.csv 2021-05-12 19:28:25 NaN 0.0.1
4 cha BN32_010007.mp3 0 0 50464512 BN32_010007.cha chat NaN BN32_010007_0_0.csv 2021-05-12 19:39:05 NaN 0.0.1
5 playground/its BN32_010007.mp3 0 3600000 7200000 BN32_010007.its its NaN BN32_010007_0_3600000.csv 2021-05-12 20:37:43 NaN 0.0.1
Built-in converters include: LENA’s its, VTC’s and VCM’s rttms, ALICE, ACLEW DAS eaf files. To import annotations under other formats, custom converters are needed.
Custom converter
A converter is a function that takes a filename for only input, and return a dataframe complying with the specifications defined in Annotations format.
The output dataframe _must_ contain at least a segment_onset
and a segment_offset
columns
expressing the onset and offset of each segment in milliseconds as
integers.
You are free to add as many extra columns as you want. It is however preferable to follow the standards listed in Annotations format when possible.
In our case, we’ll write a very simple converter to extract only the segments onset and offset from rttm files:
>>> def convert_rttm(filename: str):
... df = pd.read_csv(filename, sep = " ", names = ['type', 'file', 'chnl', 'tbeg', 'tdur', 'ortho', 'stype', 'name', 'conf', 'unk'])
... df['segment_onset'] = df['tbeg'].mul(1000).round().astype(int)
... df['segment_offset'] = (df['tbeg']+df['tdur']).mul(1000).round().astype(int)
... df.drop(['type', 'file', 'chnl', 'tbeg', 'tdur', 'ortho', 'stype', 'name', 'conf', 'unk'], axis = 1, inplace = True)
... return df
...
>>>
The converter can now be used with import_annotations()
:
>>> input = pd.DataFrame([{
... 'set': 'playground/vtc',
... 'recording_filename': 'BN32_010007.mp3',
... 'time_seek': 0,
... 'range_onset': 3600*1000,
... 'range_offset': 7200*1000,
... 'raw_filename': 'BN32_010007.rttm',
... 'format': 'custom_rttm'
... }])
>>> am.import_annotations(input, threads = 1, import_function = convert_rttm)
set recording_filename time_seek range_onset range_offset raw_filename format annotation_filename imported_at package_version
0 playground/vtc BN32_010007.mp3 0 3600000 7200000 BN32_010007.rttm custom_rttm BN32_010007_0_3600000.csv 2021-05-13 17:25:20 0.0.1
The contents of the output CSV file can be checked:
>>> rttm = pd.read_csv('vandam-data/annotations/playground/vtc/converted/BN32_010007_0_3600000.csv')
>>> rttm
segment_onset segment_offset raw_filename
0 3600401 3601370 BN32_010007.rttm
1 3600403 3601464 BN32_010007.rttm
2 3601503 3602843 BN32_010007.rttm
3 3601527 3602833 BN32_010007.rttm
4 3604075 3605570 BN32_010007.rttm
... ... ... ...
1622 7010992 7011243 BN32_010007.rttm
1623 7011495 7011615 BN32_010007.rttm
1624 7033826 7034142 BN32_010007.rttm
1625 7036539 7037008 BN32_010007.rttm
1626 7036556 7036996 BN32_010007.rttm
[1627 rows x 3 columns]
Warning
Do not import the same file twice, as duplicates in the index might cause issues.
Make sure to remove an annotation from an index beforehand if you need to import it again.
This can be done with remove_set()
to
remove a set of annotations from the index while preserving raw annotations.
Users are advised to check the consistency and validity of the annotations and their index using the validation procedure.
Importing any EAF tier
When importing EAF annotation files, some tiers are supported by ChildProject, such as vcm_type or lex_type.
If you want to import a tier that is not supported by ChildProject, you can use
import_annotations()
as follows :
>>> am.import_annotations(input, new_tier = ['name_of_tier'])
Validating annotations
The contents of annotations can be searched for errors
using the validate()
function.
..code-block:: python
>>> errors, warnings = am.validate()
validating BN32_010007_0_0.csv...
validating BN32_010007_0_0.csv...
validating BN32_010007_0_0.csv...
validating BN32_010007_0_3600000.csv...
validating BN32_010007_0_3600000.csv...
>>> errors
[]
>>> warnings
[]
errors
and warnings
are empty, indicating that there are no errors.
To gather the errors and warnings raised why validating the index of annotations,
use read()
:
..code-block:: python
>>> errors, warnings = am.read()
>>> errors
[]
>>> warnings
[]
Time-of-the-day
For a number of purposes, it may be convenient to retrieve the timestamp of each vocalization, or to filter out annotations outside some specific time-range.
Both tasks can be performed through the python API of the package.
Annotations within a specific time-range
A given set of annotations may be clipped within a given time-range using get_within_time_range()
.
For instance, annotations of audio between 9am and 12am may be retrieved from the following code:
>>> morning = am.get_within_time_range(am.annotations, start_time='09:00', end_time='12:00')
>>> morning
set recording_filename time_seek range_onset range_offset raw_filename ... imported_at error package_version start_time range_onset_time range_offset_time
0 its BN32_010007.mp3 0 7320000.0 18120000.0 BN32_010007.its ... 2021-03-06 22:55:06 NaN 0.0.1 1900-01-01 06:58:00 09:00 12:00
1 vtc BN32_010007.mp3 0 7320000.0 18120000.0 BN32_010007.rttm ... 2021-05-12 19:28:25 NaN 0.0.1 1900-01-01 06:58:00 09:00 12:00
2 cha BN32_010007.mp3 0 7320000.0 18120000.0 BN32_010007.cha ... 2021-05-12 19:39:05 NaN 0.0.1 1900-01-01 06:58:00 09:00 12:00
3 eaf BN32_010007.mp3 0 13199449.0 13256801.0 BN32_010007.eaf ... 2021-07-14 17:39:50 NaN 0.0.1 1900-01-01 06:58:00 10:37 10:38:56.352
4 cha/aligned BN32_010007.mp3 0 7320000.0 18120000.0 BN32_010007-aligned.csv ... 2021-07-15 16:15:48 NaN 0.0.1 1900-01-01 06:58:00 09:00 12:00
[5 rows x 15 columns]
The onset and offset timestamps for each segments can be calculated with get_segments_timestamps()
:
>>> segments = am.get_segments(morning)
>>> segments = am.get_segments_timestamps(segments)
>>> segments[['speaker_type', 'onset_time', 'offset_time']]
speaker_type onset_time offset_time
0 CHI 2010-07-24 09:00:00.000 2010-07-24 09:20:39.793
1 CHI 2010-07-24 09:20:39.793 2010-07-24 09:21:43.496
2 CHI 2010-07-24 09:21:43.496 2010-07-24 09:23:45.168
3 CHI 2010-07-24 09:23:45.168 2010-07-24 09:24:12.371
4 CHI 2010-07-24 09:24:12.371 2010-07-24 09:27:27.019
... ... ... ...
11801 CHI 2010-07-24 11:56:50.584 2010-07-24 11:56:51.011
11802 FEM 2010-07-24 11:57:15.749 2010-07-24 11:57:15.992
11803 MAL 2010-07-24 11:57:24.637 2010-07-24 11:57:25.010
11804 SPEECH 2010-07-24 11:57:35.237 2010-07-24 11:57:35.666
11805 CHI 2010-07-24 11:57:35.314 2010-07-24 11:57:35.511
[11806 rows x 3 columns]