Annotations
===========

Annotations can be managed through both the command-line interface and the python
API. This section documents the principle features of the API for the management
of annotations.

.. note:: 

    In order to reproduce the following examples, you will need to install
    the public VanDam corpus and its annotations using datalad:
    
    .. code-block:: bash
    
        datalad install git@gin.g-node.org:/LAAC-LSCP/vandam-data.git
        datalad get vandam-data/annotations


Reading annotations
~~~~~~~~~~~~~~~~~~~

Annotations are managed with :class:`ChildProject.annotations.AnnotationManager` class.
The first step is create an instance of it based on the target project.

The :meth:`~ChildProject.annotations.AnnotationManager.read` method reads the index of annotations
from ``metadata/annotations.csv`` and stores into its
:attr:`~ChildProject.annotations.AnnotationManager.annotations` attribute:


.. code-block:: python

    >>> from ChildProject.projects import ChildProject
    >>> from ChildProject.annotations import AnnotationManager
    >>> project = ChildProject('vandam-data')
    >>> am = AnnotationManager(project)
    >>> am.read()
    ([], ["vandam-data/metadata/annotations.csv: 'chat' is not a permitted value for column 'format' on line 4, should be any of [csv,vtc_rttm,vcm_rttm,alice,its,TextGrid,eaf,cha,NA]"])
    >>> am.annotations
                set recording_filename  time_seek  range_onset  range_offset             raw_filename    format  filter                annotation_filename          imported_at  error package_version
    2           its    BN32_010007.mp3          0            0      50464512          BN32_010007.its       its     NaN                BN32_010007_0_0.csv  2021-03-06 22:55:06    NaN           0.0.1
    3           vtc    BN32_010007.mp3          0            0      50464512         BN32_010007.rttm  vtc_rttm     NaN                BN32_010007_0_0.csv  2021-05-12 19:28:25    NaN           0.0.1
    4           cha    BN32_010007.mp3          0            0      50464512          BN32_010007.cha      chat     NaN                BN32_010007_0_0.csv  2021-05-12 19:39:05    NaN           0.0.1
    5           eaf    BN32_010007.mp3          0      4138389       4199976          BN32_010007.eaf       eaf     NaN    BN32_010007_4138389_4199976.csv  2021-07-14 17:39:50    NaN           0.0.1
    6           eaf    BN32_010007.mp3          0      4438842       4499995          BN32_010007.eaf       eaf     NaN    BN32_010007_4438842_4499995.csv  2021-07-14 17:39:50    NaN           0.0.1
    7           eaf    BN32_010007.mp3          0     13199449      13256801          BN32_010007.eaf       eaf     NaN  BN32_010007_13199449_13256801.csv  2021-07-14 17:39:50    NaN           0.0.1
    8           eaf    BN32_010007.mp3          0     37496002      37558424          BN32_010007.eaf       eaf     NaN  BN32_010007_37496002_37558424.csv  2021-07-14 17:39:50    NaN           0.0.1
    9           eaf    BN32_010007.mp3          0     37616206      37679577          BN32_010007.eaf       eaf     NaN  BN32_010007_37616206_37679577.csv  2021-07-14 17:39:50    NaN           0.0.1
    10  cha/aligned    BN32_010007.mp3          0            0      47725356  BN32_010007-aligned.csv       csv     NaN         BN32_010007_0_47725356.csv  2021-07-15 16:15:48    NaN           0.0.1

As seen in this example, :attr:`~ChildProject.annotations.AnnotationManager.annotations` only
contains the index of annotations, not their contents. To retrieve the actual annotations,
use :meth:`~ChildProject.annotations.AnnotationManager.get_segments`:

.. code-block:: python

    >>> selection = am.annotations[am.annotations['set'].isin(['cha', 'vtc'])]
    >>> segments = am.get_segments(selection)
    >>> segments

        segment_onset  segment_offset speaker_type      raw_filename  set  annotation_filename participant  ... range_onset range_offset    format filter          imported_at error package_version
    0               9992           10839       SPEECH  BN32_010007.rttm  vtc  BN32_010007_0_0.csv         NaN  ...           0     50464512  vtc_rttm    NaN  2021-05-12 19:28:25   NaN           0.0.1
    1              10004           10814          CHI  BN32_010007.rttm  vtc  BN32_010007_0_0.csv         NaN  ...           0     50464512  vtc_rttm    NaN  2021-05-12 19:28:25   NaN           0.0.1
    2              11298           11953       SPEECH  BN32_010007.rttm  vtc  BN32_010007_0_0.csv         NaN  ...           0     50464512  vtc_rttm    NaN  2021-05-12 19:28:25   NaN           0.0.1
    3              11345           11828          CHI  BN32_010007.rttm  vtc  BN32_010007_0_0.csv         NaN  ...           0     50464512  vtc_rttm    NaN  2021-05-12 19:28:25   NaN           0.0.1
    4              12113           12749          FEM  BN32_010007.rttm  vtc  BN32_010007_0_0.csv         NaN  ...           0     50464512  vtc_rttm    NaN  2021-05-12 19:28:25   NaN           0.0.1
                ...             ...          ...               ...  ...                  ...         ...  ...         ...          ...       ...    ...                  ...   ...             ...
    31875       49705416        49952432          CHI   BN32_010007.cha  cha  BN32_010007_0_0.csv         CHI  ...           0     50464512      chat    NaN  2021-05-12 19:39:05   NaN           0.0.1
    31876       49952432        50057166          CHI   BN32_010007.cha  cha  BN32_010007_0_0.csv         CHI  ...           0     50464512      chat    NaN  2021-05-12 19:39:05   NaN           0.0.1
    31877       50057166        50173260          CHI   BN32_010007.cha  cha  BN32_010007_0_0.csv         CHI  ...           0     50464512      chat    NaN  2021-05-12 19:39:05   NaN           0.0.1
    31878       50173260        50330885          CHI   BN32_010007.cha  cha  BN32_010007_0_0.csv         CHI  ...           0     50464512      chat    NaN  2021-05-12 19:39:05   NaN           0.0.1
    31879       50330885        50397134          CHI   BN32_010007.cha  cha  BN32_010007_0_0.csv         CHI  ...           0     50464512      chat    NaN  2021-05-12 19:39:05   NaN           0.0.1

    [31880 rows x 22 columns]
    
.. warning::

    Trying to load all annotations at once may quickly lead to out-of-memory errors,
    especially with automated annotators convering thousands of hours of audio.
    Memory issues can be alleviated by processing the data sequentially, e.g.
    by treating one recording after another.

Importing annotations
~~~~~~~~~~~~~~~~~~~~~

Although importing annotations can be done using the command-line tool, 
sometimes it is more efficient to do it directly with the python API; it can
even become necessary when custom converters (the functions that transform
any kind of annotations into the CSV format used by the package) need to be used.

Two examples are given below (one using built-in converters, one using a custom converter).
In order to reproduce them, please make a copy of the original annotations:

.. code:: bash

    mkdir vandam-data/annotations/playground
    cp -r vandam-data/annotations/its vandam-data/annotations/playground

Built-in formats
----------------

The following code imports only the annotations from the LENA that
correspond to the second hour of the audio. The package natively
supports LENA's .its annotations.

Annotations are imported using :meth:`~ChildProject.annotations.AnnotationManager.import_annotations`.
This first input argument of this method must be a pandas dataframe of all the annotations that need to 
be imported. This dataframe should be structured according to the format defined at :ref:`format-input-annotations`.

.. code-block:: python

    >>> import pandas as pd
    >>> input = pd.DataFrame([{
    ...     'set': 'playground/its',
    ...     'recording_filename': 'BN32_010007.mp3',
    ...     'time_seek': 0,
    ...     'range_onset': 3600*1000,
    ...     'range_offset': 7200*1000,
    ...     'raw_filename': 'BN32_010007.its',
    ...     'format': 'its'
    ... }])
    >>> am.import_annotations(input, threads = 1)
                set recording_filename  time_seek  range_onset  range_offset     raw_filename format        annotation_filename          imported_at package_version
    0  playground/its    BN32_010007.mp3          0      3600000       7200000  BN32_010007.its    its  BN32_010007_0_3600000.csv  2021-05-12 20:37:43           0.0.1


After reloading the index of annotations, the newly inserted entry now appears:

.. code-block:: python

    >>> am.read()
    ([], [])
    >>> am.annotations
                set recording_filename  time_seek  range_onset  range_offset      raw_filename    format  filter        annotation_filename          imported_at  error package_version
    2             its    BN32_010007.mp3          0            0      50464512   BN32_010007.its       its     NaN        BN32_010007_0_0.csv  2021-03-06 22:55:06    NaN           0.0.1
    3             vtc    BN32_010007.mp3          0            0      50464512  BN32_010007.rttm  vtc_rttm     NaN        BN32_010007_0_0.csv  2021-05-12 19:28:25    NaN           0.0.1
    4             cha    BN32_010007.mp3          0            0      50464512   BN32_010007.cha      chat     NaN        BN32_010007_0_0.csv  2021-05-12 19:39:05    NaN           0.0.1
    5  playground/its    BN32_010007.mp3          0      3600000       7200000   BN32_010007.its       its     NaN  BN32_010007_0_3600000.csv  2021-05-12 20:37:43    NaN           0.0.1

Built-in converters include: LENA's its, VTC's and VCM's rttms, ALICE, ACLEW DAS eaf files.
To import annotations under other formats, custom converters are needed.

Custom converter
----------------

A converter is a function that takes a filename for only input, and return a dataframe
complying with the specifications defined in :ref:`format-annotations-segments`.

The output dataframe _must_ contain at least a ``segment_onset`` and a ``segment_offset`` columns
expressing the onset and offset of each segment in milliseconds as
integers.

You are free to add as many extra columns as you want. It is however preferable to follow the
standards listed in :ref:`format-annotations-segments` when possible.

In our case, we'll write a very simple converter to extract only the segments onset and offset
from rttm files:

.. code-block:: python

    >>> def convert_rttm(filename: str):
    ...     df = pd.read_csv(filename, sep = " ", names = ['type', 'file', 'chnl', 'tbeg', 'tdur', 'ortho', 'stype', 'name', 'conf', 'unk'])
    ...     df['segment_onset'] = df['tbeg'].mul(1000).round().astype(int)
    ...     df['segment_offset'] = (df['tbeg']+df['tdur']).mul(1000).round().astype(int)
    ...     df.drop(['type', 'file', 'chnl', 'tbeg', 'tdur', 'ortho', 'stype', 'name', 'conf', 'unk'], axis = 1, inplace = True)
    ...     return df
    ... 
    >>> 

The converter can now be used with :meth:`~ChildProject.annotations.AnnotationManager.import_annotations`:

.. code-block:: python

    >>> input = pd.DataFrame([{
    ...     'set': 'playground/vtc',
    ...     'recording_filename': 'BN32_010007.mp3',
    ...     'time_seek': 0,
    ...     'range_onset': 3600*1000,
    ...     'range_offset': 7200*1000,
    ...     'raw_filename': 'BN32_010007.rttm',
    ...     'format': 'custom_rttm'
    ... }])
    >>> am.import_annotations(input, threads = 1, import_function = convert_rttm)
                set recording_filename  time_seek  range_onset  range_offset      raw_filename       format        annotation_filename          imported_at package_version
    0  playground/vtc    BN32_010007.mp3          0      3600000       7200000  BN32_010007.rttm  custom_rttm  BN32_010007_0_3600000.csv  2021-05-13 17:25:20           0.0.1

The contents of the output CSV file can be checked:

.. code-block:: python

    >>> rttm = pd.read_csv('vandam-data/annotations/playground/vtc/converted/BN32_010007_0_3600000.csv')
    >>> rttm
        segment_onset  segment_offset      raw_filename
    0           3600401         3601370  BN32_010007.rttm
    1           3600403         3601464  BN32_010007.rttm
    2           3601503         3602843  BN32_010007.rttm
    3           3601527         3602833  BN32_010007.rttm
    4           3604075         3605570  BN32_010007.rttm
    ...             ...             ...               ...
    1622        7010992         7011243  BN32_010007.rttm
    1623        7011495         7011615  BN32_010007.rttm
    1624        7033826         7034142  BN32_010007.rttm
    1625        7036539         7037008  BN32_010007.rttm
    1626        7036556         7036996  BN32_010007.rttm

    [1627 rows x 3 columns]

.. warning::

    Do not import the same file twice, as duplicates in the index might cause issues.
    Make sure to remove an annotation from an index beforehand if you need to import it again.
    This can be done with :meth:`~ChildProject.annotations.AnnotationManager.remove_set` to
    remove a set of annotations from the index while preserving raw annotations.

Users are advised to check the consistency and validity of the annotations and their index
using the validation procedure.

Importing any EAF tier
----------------------

When importing EAF annotation files, some tiers are supported by ChildProject, such as `vcm_type` or
`lex_type`. 

If you want to import a tier that is not supported by ChildProject, you can use
:meth:`~ChildProject.annotations.AnnotationManager.import_annotations` as follows :

.. code-block:: python

    >>> am.import_annotations(input, new_tier = ['name_of_tier'])

Validating annotations
~~~~~~~~~~~~~~~~~~~~~~

The contents of annotations can be searched for errors
using the :meth:`~ChildProject.annotations.AnnotationManager.validate` function.


..code-block:: python

    >>> errors, warnings = am.validate()
    validating BN32_010007_0_0.csv...
    validating BN32_010007_0_0.csv...
    validating BN32_010007_0_0.csv...
    validating BN32_010007_0_3600000.csv...
    validating BN32_010007_0_3600000.csv...
    >>> errors
    []
    >>> warnings
    []

``errors`` and ``warnings`` are empty, indicating that there are no errors.

To gather the errors and warnings raised why validating the index of annotations,
use :meth:`~ChildProject.annotations.AnnotationManager.read`:

..code-block:: python

    >>> errors, warnings = am.read()
    >>> errors
    []
    >>> warnings
    []

Time-of-the-day
~~~~~~~~~~~~~~~

For a number of purposes, it may be convenient to retrieve the timestamp of each vocalization, or to filter out annotations outside
some specific time-range.

Both tasks can be performed through the python API of the package.

Annotations within a specific time-range
----------------------------------------

A given set of annotations may be clipped within a given time-range using :meth:`~ChildProject.annotations.AnnotationManager.get_within_time_range`.
For instance, annotations of audio between 9am and 12am may be retrieved from the following code:

.. code-block:: python

    >>> morning = am.get_within_time_range(am.annotations, start_time='09:00', end_time='12:00')
    >>> morning
            set recording_filename  time_seek  range_onset  range_offset             raw_filename  ...          imported_at  error package_version          start_time  range_onset_time range_offset_time
    0          its    BN32_010007.mp3          0    7320000.0    18120000.0          BN32_010007.its  ...  2021-03-06 22:55:06    NaN           0.0.1 1900-01-01 06:58:00             09:00             12:00
    1          vtc    BN32_010007.mp3          0    7320000.0    18120000.0         BN32_010007.rttm  ...  2021-05-12 19:28:25    NaN           0.0.1 1900-01-01 06:58:00             09:00             12:00
    2          cha    BN32_010007.mp3          0    7320000.0    18120000.0          BN32_010007.cha  ...  2021-05-12 19:39:05    NaN           0.0.1 1900-01-01 06:58:00             09:00             12:00
    3          eaf    BN32_010007.mp3          0   13199449.0    13256801.0          BN32_010007.eaf  ...  2021-07-14 17:39:50    NaN           0.0.1 1900-01-01 06:58:00             10:37      10:38:56.352
    4  cha/aligned    BN32_010007.mp3          0    7320000.0    18120000.0  BN32_010007-aligned.csv  ...  2021-07-15 16:15:48    NaN           0.0.1 1900-01-01 06:58:00             09:00             12:00

    [5 rows x 15 columns]

The onset and offset timestamps for each segments can be calculated with :meth:`~ChildProject.annotations.AnnotationManager.get_segments_timestamps`:

.. code-block:: python

    >>> segments = am.get_segments(morning)
    >>> segments = am.get_segments_timestamps(segments)
    >>> segments[['speaker_type', 'onset_time', 'offset_time']]
        speaker_type              onset_time             offset_time
    0              CHI 2010-07-24 09:00:00.000 2010-07-24 09:20:39.793
    1              CHI 2010-07-24 09:20:39.793 2010-07-24 09:21:43.496
    2              CHI 2010-07-24 09:21:43.496 2010-07-24 09:23:45.168
    3              CHI 2010-07-24 09:23:45.168 2010-07-24 09:24:12.371
    4              CHI 2010-07-24 09:24:12.371 2010-07-24 09:27:27.019
    ...            ...                     ...                     ...
    11801          CHI 2010-07-24 11:56:50.584 2010-07-24 11:56:51.011
    11802          FEM 2010-07-24 11:57:15.749 2010-07-24 11:57:15.992
    11803          MAL 2010-07-24 11:57:24.637 2010-07-24 11:57:25.010
    11804       SPEECH 2010-07-24 11:57:35.237 2010-07-24 11:57:35.666
    11805          CHI 2010-07-24 11:57:35.314 2010-07-24 11:57:35.511

    [11806 rows x 3 columns]


Module reference
~~~~~~~~~~~~~~~~

.. automodule:: ChildProject.annotations
   :members:
   :noindex: