.. _samplers: Samplers -------- Overview ~~~~~~~~ .. figure:: images/sampler_diagram.png :alt: Sampling recordings Sampling audio segments to be annotated with ChildProject. A sampler draws segments from the recordings, according to the algorithm and the parameters defined by the user. The sampler will produce two files into the `destination` folder : - ``segments_YYYYMMDD_HHMMSS.csv``, a CSV dataframe of all sampled segments, with three columns: ``recording_filename``, ``segment_onset`` and ``segment_offset``. - ``parameters_YYYYMMDD_HHMMSS.yml``, a Yaml file with all the parameters that were used to generate the samples. If the folder `destination` does not exist, it is automatically created in the process. Several samplers are implemented in our package, which are listed below. The samples can then feed downstream pipelines such as the :ref:`zooniverse` pipeline or the :ref:`eaf-builder`. .. clidoc:: child-project sampler --help All samplers have a few parameters in common: - ``--recordings``, which sets the white-list of recordings to sample from - ``--exclude``, which defines the portions of audio to exclude from the samples *after* sampling. Periodic sampler ~~~~~~~~~~~~~~~~ Draw segments from the recordings periodically. The ``--period`` argument is between the end of the previous segment until the start of the next. For example length:60000(1min) period:3540000(59min) will sample the first minute of every hour whereas length:60000(1min) period:3600000(1h) will sample the 1st min of the 1st hour, the 2nd min o the 2nd hour and so on. The ``--by`` argument will group recordings to form a single timeline in which the periodicity defines the parts to annotate, then those parts are extracted from the recordings of the group. this means that recordings following each other will maintain continuity in sampling period if in the same session and sampling by session_id. It also means concurrent recordings in the same session will have the same samples kept time/date wise regardless of shifts in start. The default is to sample by 'recording_filename' which will simply periodicly sample each recording independently. .. clidoc:: child-project sampler /path/to/dataset /path/to/destination periodic --help Vocalization sampler ~~~~~~~~~~~~~~~~~~~~ Draw segments from the recordings, targetting vocalizations from specific speaker-type(s). .. clidoc:: child-project sampler /path/to/dataset /path/to/destination random-vocalizations --help Energy-based sampler ~~~~~~~~~~~~~~~~~~~~ Draw segments from the recordings, targetting windows with energies above some threshold. This algorithm proceeds by segmenting the recordings into windows; the energy of the signal is computed for each window (users have the option to apply a band-pass filter to calculate the energy in some frequency band). Then, the algorithm samples as many windows as requested by the user from the windows that have energies above some threshold. The energy threshold is defined in term of energy quantile. By default, it is set to 0.8, i.e, only the windows with the 20% highest energies are sampled from. The sampling is performed unit by unit, where the unit is set through the ``--by`` option and can be any either ``recording_filename`` (to sample an equal amount of windows from each recording), ``session_id`` (to equally from each observing day), or ``child_id`` (to sample equally from each child). .. clidoc:: child-project sampler /path/to/dataset /path/to/destination energy-detection --help High-Volubility sampler ~~~~~~~~~~~~~~~~~~~~~~~ Return the top ``windows_count`` windows (of length ``windows_length``) with the highest volubility from each recording, as calculated from the metric ``metric``. ``metrics`` can be any of three values: words, turns, and vocs. - The **words** metric sums the amount of words within each window. For LENA annotations, it is equivalent to **awc**. - The **turns** metric (aka ctc) sums conversational turns within each window. It relies on **lena_conv_turn_type** for LENA annotations. For other annotations, turns are estimated as adult/child speech switches in close temporal proximity. - The **vocs** metric sums utterances (for LENA annotations) or vocalizations (for other annotations) within each window. If ``metric="vocs"`` and ``speakers=['CHI']``, it is equivalent to the usual cvc metric (child vocalization counts). .. clidoc:: child-project sampler /path/to/dataset /path/to/destination high-volubility --help Conversation sampler ~~~~~~~~~~~~~~~~~~~~ The conversation sampler returns the conversational blocks with the highest amount of turns (between adults and the key child). The first step is the detection of conversational blocks. Two consecutive vocalizations are considered part of the same conversational block if they are not separated by an interval longer than a certain duration, which by default is set to 1000 milliseconds. Then, the amount of conversational turns (by default, between the key child and female/male adults) is calculated for each conversational block. The sampler returns, for each unit, the desired amount of conversations with the higher amount of turns. This sampler, unlike the High-Volubility sampler, returns portions of audio with variable durations. Fixed duration can still be achieved by clipping or splitting each conversational block. .. clidoc:: child-project sampler /path/to/dataset /path/to/destination conversations --help .. note:: This sampler ignores LENA's conversational turn types.