Annotations

Annotations can be managed through both the command-line interface and the python API. This section documents the principle features of the API for the management of annotations.

Note

In order to reproduce the following examples, you will need to install the public VanDam corpus and its annotations using datalad:

datalad install git@gin.g-node.org:/LAAC-LSCP/vandam-data.git
datalad get vandam-data/annotations

Reading annotations

Annotations are managed with ChildProject.annotations.AnnotationManager class. The first step is create an instance of it based on the target project.

The read() method reads the index of annotations from metadata/annotations.csv and stores into its annotations attribute:

>>> from ChildProject.projects import ChildProject
>>> from ChildProject.annotations import AnnotationManager
>>> project = ChildProject('vandam-data')
>>> am = AnnotationManager(project)
>>> am.read()
([], ["vandam-data/metadata/annotations.csv: 'chat' is not a permitted value for column 'format' on line 4, should be any of [csv,vtc_rttm,vcm_rttm,alice,its,TextGrid,eaf,cha,NA]"])
>>> am.annotations
            set recording_filename  time_seek  range_onset  range_offset             raw_filename    format  filter                annotation_filename          imported_at  error package_version
2           its    BN32_010007.mp3          0            0      50464512          BN32_010007.its       its     NaN                BN32_010007_0_0.csv  2021-03-06 22:55:06    NaN           0.0.1
3           vtc    BN32_010007.mp3          0            0      50464512         BN32_010007.rttm  vtc_rttm     NaN                BN32_010007_0_0.csv  2021-05-12 19:28:25    NaN           0.0.1
4           cha    BN32_010007.mp3          0            0      50464512          BN32_010007.cha      chat     NaN                BN32_010007_0_0.csv  2021-05-12 19:39:05    NaN           0.0.1
5           eaf    BN32_010007.mp3          0      4138389       4199976          BN32_010007.eaf       eaf     NaN    BN32_010007_4138389_4199976.csv  2021-07-14 17:39:50    NaN           0.0.1
6           eaf    BN32_010007.mp3          0      4438842       4499995          BN32_010007.eaf       eaf     NaN    BN32_010007_4438842_4499995.csv  2021-07-14 17:39:50    NaN           0.0.1
7           eaf    BN32_010007.mp3          0     13199449      13256801          BN32_010007.eaf       eaf     NaN  BN32_010007_13199449_13256801.csv  2021-07-14 17:39:50    NaN           0.0.1
8           eaf    BN32_010007.mp3          0     37496002      37558424          BN32_010007.eaf       eaf     NaN  BN32_010007_37496002_37558424.csv  2021-07-14 17:39:50    NaN           0.0.1
9           eaf    BN32_010007.mp3          0     37616206      37679577          BN32_010007.eaf       eaf     NaN  BN32_010007_37616206_37679577.csv  2021-07-14 17:39:50    NaN           0.0.1
10  cha/aligned    BN32_010007.mp3          0            0      47725356  BN32_010007-aligned.csv       csv     NaN         BN32_010007_0_47725356.csv  2021-07-15 16:15:48    NaN           0.0.1

As seen in this example, annotations only contains the index of annotations, not their contents. To retrieve the actual annotations, use get_segments():

>>> selection = am.annotations[am.annotations['set'].isin(['cha', 'vtc'])]
>>> segments = am.get_segments(selection)
>>> segments

    segment_onset  segment_offset speaker_type      raw_filename  set  annotation_filename participant  ... range_onset range_offset    format filter          imported_at error package_version
0               9992           10839       SPEECH  BN32_010007.rttm  vtc  BN32_010007_0_0.csv         NaN  ...           0     50464512  vtc_rttm    NaN  2021-05-12 19:28:25   NaN           0.0.1
1              10004           10814          CHI  BN32_010007.rttm  vtc  BN32_010007_0_0.csv         NaN  ...           0     50464512  vtc_rttm    NaN  2021-05-12 19:28:25   NaN           0.0.1
2              11298           11953       SPEECH  BN32_010007.rttm  vtc  BN32_010007_0_0.csv         NaN  ...           0     50464512  vtc_rttm    NaN  2021-05-12 19:28:25   NaN           0.0.1
3              11345           11828          CHI  BN32_010007.rttm  vtc  BN32_010007_0_0.csv         NaN  ...           0     50464512  vtc_rttm    NaN  2021-05-12 19:28:25   NaN           0.0.1
4              12113           12749          FEM  BN32_010007.rttm  vtc  BN32_010007_0_0.csv         NaN  ...           0     50464512  vtc_rttm    NaN  2021-05-12 19:28:25   NaN           0.0.1
            ...             ...          ...               ...  ...                  ...         ...  ...         ...          ...       ...    ...                  ...   ...             ...
31875       49705416        49952432          CHI   BN32_010007.cha  cha  BN32_010007_0_0.csv         CHI  ...           0     50464512      chat    NaN  2021-05-12 19:39:05   NaN           0.0.1
31876       49952432        50057166          CHI   BN32_010007.cha  cha  BN32_010007_0_0.csv         CHI  ...           0     50464512      chat    NaN  2021-05-12 19:39:05   NaN           0.0.1
31877       50057166        50173260          CHI   BN32_010007.cha  cha  BN32_010007_0_0.csv         CHI  ...           0     50464512      chat    NaN  2021-05-12 19:39:05   NaN           0.0.1
31878       50173260        50330885          CHI   BN32_010007.cha  cha  BN32_010007_0_0.csv         CHI  ...           0     50464512      chat    NaN  2021-05-12 19:39:05   NaN           0.0.1
31879       50330885        50397134          CHI   BN32_010007.cha  cha  BN32_010007_0_0.csv         CHI  ...           0     50464512      chat    NaN  2021-05-12 19:39:05   NaN           0.0.1

[31880 rows x 22 columns]

Warning

Trying to load all annotations at once may quickly lead to out-of-memory errors, especially with automated annotators convering thousands of hours of audio. Memory issues can be alleviated by processing the data sequentially, e.g. by treating one recording after another.

Importing annotations

Although importing annotations can be done using the command-line tool, sometimes it is more efficient to do it directly with the python API; it can even become necessary when custom converters (the functions that transform any kind of annotations into the CSV format used by the package) need to be used.

Two examples are given below (one using built-in converters, one using a custom converter). In order to reproduce them, please make a copy of the original annotations:

mkdir vandam-data/annotations/playground
cp -r vandam-data/annotations/its vandam-data/annotations/playground

Built-in formats

The following code imports only the annotations from the LENA that correspond to the second hour of the audio. The package natively supports LENA’s .its annotations.

Annotations are imported using import_annotations(). This first input argument of this method must be a pandas dataframe of all the annotations that need to be imported. This dataframe should be structured according to the format defined at Annotation importation input format.

>>> import pandas as pd
>>> input = pd.DataFrame([{
...     'set': 'playground/its',
...     'recording_filename': 'BN32_010007.mp3',
...     'time_seek': 0,
...     'range_onset': 3600*1000,
...     'range_offset': 7200*1000,
...     'raw_filename': 'BN32_010007.its',
...     'format': 'its'
... }])
>>> am.import_annotations(input, threads = 1)
            set recording_filename  time_seek  range_onset  range_offset     raw_filename format        annotation_filename          imported_at package_version
0  playground/its    BN32_010007.mp3          0      3600000       7200000  BN32_010007.its    its  BN32_010007_0_3600000.csv  2021-05-12 20:37:43           0.0.1

After reloading the index of annotations, the newly inserted entry now appears:

>>> am.read()
([], [])
>>> am.annotations
            set recording_filename  time_seek  range_onset  range_offset      raw_filename    format  filter        annotation_filename          imported_at  error package_version
2             its    BN32_010007.mp3          0            0      50464512   BN32_010007.its       its     NaN        BN32_010007_0_0.csv  2021-03-06 22:55:06    NaN           0.0.1
3             vtc    BN32_010007.mp3          0            0      50464512  BN32_010007.rttm  vtc_rttm     NaN        BN32_010007_0_0.csv  2021-05-12 19:28:25    NaN           0.0.1
4             cha    BN32_010007.mp3          0            0      50464512   BN32_010007.cha      chat     NaN        BN32_010007_0_0.csv  2021-05-12 19:39:05    NaN           0.0.1
5  playground/its    BN32_010007.mp3          0      3600000       7200000   BN32_010007.its       its     NaN  BN32_010007_0_3600000.csv  2021-05-12 20:37:43    NaN           0.0.1

Built-in converters include: LENA’s its, VTC’s and VCM’s rttms, ALICE, ACLEW DAS eaf files. To import annotations under other formats, custom converters are needed.

Custom converter

A converter is a function that takes a filename for only input, and return a dataframe complying with the specifications defined in Annotations format.

The output dataframe _must_ contain at least a segment_onset and a segment_offset columns expressing the onset and offset of each segment in milliseconds as integers.

You are free to add as many extra columns as you want. It is however preferable to follow the standards listed in Annotations format when possible.

In our case, we’ll write a very simple converter to extract only the segments onset and offset from rttm files:

>>> def convert_rttm(filename: str):
...     df = pd.read_csv(filename, sep = " ", names = ['type', 'file', 'chnl', 'tbeg', 'tdur', 'ortho', 'stype', 'name', 'conf', 'unk'])
...     df['segment_onset'] = df['tbeg'].mul(1000).round().astype(int)
...     df['segment_offset'] = (df['tbeg']+df['tdur']).mul(1000).round().astype(int)
...     df.drop(['type', 'file', 'chnl', 'tbeg', 'tdur', 'ortho', 'stype', 'name', 'conf', 'unk'], axis = 1, inplace = True)
...     return df
...
>>>

The converter can now be used with import_annotations():

>>> input = pd.DataFrame([{
...     'set': 'playground/vtc',
...     'recording_filename': 'BN32_010007.mp3',
...     'time_seek': 0,
...     'range_onset': 3600*1000,
...     'range_offset': 7200*1000,
...     'raw_filename': 'BN32_010007.rttm',
...     'format': 'custom_rttm'
... }])
>>> am.import_annotations(input, threads = 1, import_function = convert_rttm)
            set recording_filename  time_seek  range_onset  range_offset      raw_filename       format        annotation_filename          imported_at package_version
0  playground/vtc    BN32_010007.mp3          0      3600000       7200000  BN32_010007.rttm  custom_rttm  BN32_010007_0_3600000.csv  2021-05-13 17:25:20           0.0.1

The contents of the output CSV file can be checked:

>>> rttm = pd.read_csv('vandam-data/annotations/playground/vtc/converted/BN32_010007_0_3600000.csv')
>>> rttm
    segment_onset  segment_offset      raw_filename
0           3600401         3601370  BN32_010007.rttm
1           3600403         3601464  BN32_010007.rttm
2           3601503         3602843  BN32_010007.rttm
3           3601527         3602833  BN32_010007.rttm
4           3604075         3605570  BN32_010007.rttm
...             ...             ...               ...
1622        7010992         7011243  BN32_010007.rttm
1623        7011495         7011615  BN32_010007.rttm
1624        7033826         7034142  BN32_010007.rttm
1625        7036539         7037008  BN32_010007.rttm
1626        7036556         7036996  BN32_010007.rttm

[1627 rows x 3 columns]

Warning

Do not import the same file twice, as duplicates in the index might cause issues. Make sure to remove an annotation from an index beforehand if you need to import it again. This can be done with remove_set() to remove a set of annotations from the index while preserving raw annotations.

Users are advised to check the consistency and validity of the annotations and their index using the validation procedure.

Importing any EAF tier

When importing EAF annotation files, some tiers are supported by ChildProject, such as vcm_type or lex_type.

If you want to import a tier that is not supported by ChildProject, you can use import_annotations() as follows :

>>> am.import_annotations(input, new_tier = ['name_of_tier'])

Validating annotations

The contents of annotations can be searched for errors using the validate() function.

..code-block:: python

>>> errors, warnings = am.validate()
validating BN32_010007_0_0.csv...
validating BN32_010007_0_0.csv...
validating BN32_010007_0_0.csv...
validating BN32_010007_0_3600000.csv...
validating BN32_010007_0_3600000.csv...
>>> errors
[]
>>> warnings
[]

errors and warnings are empty, indicating that there are no errors.

To gather the errors and warnings raised why validating the index of annotations, use read():

..code-block:: python

>>> errors, warnings = am.read()
>>> errors
[]
>>> warnings
[]

Time-of-the-day

For a number of purposes, it may be convenient to retrieve the timestamp of each vocalization, or to filter out annotations outside some specific time-range.

Both tasks can be performed through the python API of the package.

Annotations within a specific time-range

A given set of annotations may be clipped within a given time-range using get_within_time_range(). For instance, annotations of audio between 9am and 12am may be retrieved from the following code:

>>> morning = am.get_within_time_range(am.annotations, start_time='09:00', end_time='12:00')
>>> morning
        set recording_filename  time_seek  range_onset  range_offset             raw_filename  ...          imported_at  error package_version          start_time  range_onset_time range_offset_time
0          its    BN32_010007.mp3          0    7320000.0    18120000.0          BN32_010007.its  ...  2021-03-06 22:55:06    NaN           0.0.1 1900-01-01 06:58:00             09:00             12:00
1          vtc    BN32_010007.mp3          0    7320000.0    18120000.0         BN32_010007.rttm  ...  2021-05-12 19:28:25    NaN           0.0.1 1900-01-01 06:58:00             09:00             12:00
2          cha    BN32_010007.mp3          0    7320000.0    18120000.0          BN32_010007.cha  ...  2021-05-12 19:39:05    NaN           0.0.1 1900-01-01 06:58:00             09:00             12:00
3          eaf    BN32_010007.mp3          0   13199449.0    13256801.0          BN32_010007.eaf  ...  2021-07-14 17:39:50    NaN           0.0.1 1900-01-01 06:58:00             10:37      10:38:56.352
4  cha/aligned    BN32_010007.mp3          0    7320000.0    18120000.0  BN32_010007-aligned.csv  ...  2021-07-15 16:15:48    NaN           0.0.1 1900-01-01 06:58:00             09:00             12:00

[5 rows x 15 columns]

The onset and offset timestamps for each segments can be calculated with get_segments_timestamps():

>>> segments = am.get_segments(morning)
>>> segments = am.get_segments_timestamps(segments)
>>> segments[['speaker_type', 'onset_time', 'offset_time']]
    speaker_type              onset_time             offset_time
0              CHI 2010-07-24 09:00:00.000 2010-07-24 09:20:39.793
1              CHI 2010-07-24 09:20:39.793 2010-07-24 09:21:43.496
2              CHI 2010-07-24 09:21:43.496 2010-07-24 09:23:45.168
3              CHI 2010-07-24 09:23:45.168 2010-07-24 09:24:12.371
4              CHI 2010-07-24 09:24:12.371 2010-07-24 09:27:27.019
...            ...                     ...                     ...
11801          CHI 2010-07-24 11:56:50.584 2010-07-24 11:56:51.011
11802          FEM 2010-07-24 11:57:15.749 2010-07-24 11:57:15.992
11803          MAL 2010-07-24 11:57:24.637 2010-07-24 11:57:25.010
11804       SPEECH 2010-07-24 11:57:35.237 2010-07-24 11:57:35.666
11805          CHI 2010-07-24 11:57:35.314 2010-07-24 11:57:35.511

[11806 rows x 3 columns]