ChildProject.pipelines package

Submodules

ChildProject.pipelines.anonymize module

class ChildProject.pipelines.anonymize.AnonymizationPipeline[source]

Bases: Pipeline

Anonymize a set of its annotations (input_set) and saves it as output_set.

DEFAULT_REPLACEMENTS = {'Bar': {'startClockTime': [{'replace_value': '1000-01-01'}, {'only_time': 'true'}]}, 'BarSummary': {'leftBoundaryClockTime': [{'replace_value': '1000-01-01'}, {'only_time': 'true'}], 'rightBoundaryClockTime': [{'replace_value': '1000-01-01'}, {'only_time': 'true'}]}, 'Child': {'DOB': '1000-01-01', 'EnrollDate': '1000-01-01', 'id': 'A999'}, 'ChildInfo': {'dob': '1000-01-01'}, 'FiveMinuteSection': {'endClockTime': [{'replace_value': '1000-01-01'}, {'only_time': 'true'}], 'startClockTime': [{'replace_value': '1000-01-01'}, {'only_time': 'true'}]}, 'ITS': {'fileName': 'new_filename_1001', 'timeCreated': [{'replace_value': '1000-01-01'}, {'only_time': 'true'}]}, 'Item': {'timeStamp': [{'replace_value': '1000-01-01'}, {'only_time': 'true'}]}, 'PrimaryChild': {'DOB': '1000-01-01'}, 'ProcessingJob': {'logfile': 'exec10001010T100010Z_job00000001-10001010_101010_100100.upl.log'}, 'Recording': {'endClockTime': [{'replace_value': '1000-01-01'}, {'only_time': 'true'}], 'startClockTime': [{'replace_value': '1000-01-01'}, {'only_time': 'true'}]}, 'ResourceSnapshot': {'timegmt': [{'replace_value': '1000-01-01'}, {'only_time': 'true'}], 'timelocal': [{'replace_value': '1000-01-01'}, {'only_time': 'true'}]}, 'TransferTime': {'LocalTime': [{'replace_value': '1000-01-01'}, {'only_time': 'true'}], 'UTCTime': [{'replace_value': '1000-01-01'}, {'only_time': 'true'}]}}
run(path: str, input_set: str, output_set: str, replacements_json_dict: str = '', **kwargs)[source]

Anonymize a set of its annotations (input_set) and saves it as output_set.

static setup_parser(parser)[source]

ChildProject.pipelines.eafbuilder module

class ChildProject.pipelines.eafbuilder.EafBuilderPipeline[source]

Bases: Pipeline

run(destination: str, segments: str, eaf_type: str, template: str, context_onset: int = 0, context_offset: int = 0, path: str | None = None, import_speech_from: str | None = None, **kwargs)[source]

generate .eaf templates based on intervals to code.

Parameters:
  • path (str) – project path

  • destination (str) – eaf destination

  • segments (str) – path to the input segments dataframe

  • eaf_type (str) – eaf-type [random, periodic]

  • template (str) – name of the template to use (basic, native, or non-native)

  • context_onset (int) – context onset and segment offset difference in milliseconds, 0 for no introductory context

  • context_offset (int) – context offset and segment offset difference in milliseconds, 0 for no outro context

static setup_parser(parser)[source]
ChildProject.pipelines.eafbuilder.create_eaf(etf_path: str, id: str, output_dir: str, recording_filename: str, timestamps_list: list, eaf_type: str, contxt_on: int, contxt_off: int, template: str, speech_segments: DataFrame | None = None, imported_set: str | None = None, imported_format: str | None = None)[source]

ChildProject.pipelines.metrics module

class ChildProject.pipelines.metrics.AclewMetrics(project: ChildProject, vtc: str = 'vtc', alice: str = 'alice', vcm: str = 'vcm', recordings: str | List[str] | DataFrame | None = None, from_time: str | None = None, to_time: str | None = None, rec_cols: str | None = None, child_cols: str | None = None, period: str | None = None, segments: str | DataFrame | None = None, by: str = 'recording_filename', threads: int = 1)[source]

Bases: Metrics

ACLEW metrics extractor. Extracts a number of metrics from the ACLEW pipeline annotations, which includes:

  • The Voice Type Classifier by Lavechin et al. (arXiv:2005.12656)

  • The Automatic LInguistic Unit Count Estimator (ALICE) by Räsänen et al. (doi:10.3758/s13428-020-01460-x)

  • The VoCalisation Maturity model (VCMNet) by Al Futaisi et al. (doi:10.1145/3340555.3353751)

Parameters:
  • project (ChildProject.projects.ChildProject) – ChildProject instance of the target dataset.

  • vtc (str) – name of the set associated to the VTC annotations

  • alice (str) – name of the set associated to the ALICE annotations

  • vcm (str) – name of the set associated to the VCM annotations

  • recordings (Union[str, List[str], pd.DataFrame], optional) – recordings to sample from; if None, all recordings will be sampled, defaults to None

  • from_time (str, optional) – If specified (in HH:MM:SS format), ignore annotations outside of the given time-range, defaults to None

  • to_time (str, optional) – If specified (in HH:MM:SS format), ignore annotations outside of the given time-range, defaults to None

  • rec_cols (str, optional) – comma separated columns from recordings.csv to include in the outputted metrics (optional), recording_filename,session_id,child_id,duration are always included if possible and dont need to be specified. Any column that is not unique for a given unit (eg date_iso for a child_id being recorded on multiple days) will output a <NA> value

  • child_cols (str, optional) – comma separated columns from children.csv to include in the outputted metrics (optional), None by default

  • by (str, optional) – unit to extract metric from (recording_filename, experiment, child_id, session_id, segments), defaults to ‘recording_filename’, ‘segments’ is mandatory if passing the segments argument

  • period (str, optional) – time units to aggregate (optional); equivalent to pandas.Grouper freq argument.

  • segments (Union[str, pd.DataFrame], optional) – DataFrame or path to csv file of the segments to extract from, containing ‘recording_filename’, ‘segment_onset’ and ‘segment_offset’ columns. To use this option, the option must be set to ‘segments’. Also, this option cannot be combined with options [recordings,period,from_time,to_time].

  • threads (int, optional) – amount of threads to run on, defaults to 1

SUBCOMMAND = 'aclew'
static add_parser(subparsers, subcommand)[source]
class ChildProject.pipelines.metrics.CustomMetrics(project: ChildProject, metrics: str, recordings: str | List[str] | DataFrame | None = None, from_time: str | None = None, to_time: str | None = None, rec_cols: str | None = None, child_cols: str | None = None, by: str = 'recording_filename', period: str | None = None, segments: str | DataFrame | None = None, threads: int = 1)[source]

Bases: Metrics

metrics extraction from a csv file. Extracts a number of metrics listed in a csv file as a dataframe. the csv file must contain the columns : - ‘callable’ which is the name of the wanted metric from the list of available metrics - ‘set’ which is the set of annotations to use for that specific metric (make sure this set has the required columns for that metric) - ‘name’ is optional, this is the name to give to that metric (if not given, a default name will be attributed) - any other necessary argument for the given metrics (eg the voc_speaker_ph metric requires the ‘speaker’ argument: add a column ‘speaker’ in the csv file and fill its cells for this metric with the wanted value (CHI|FEM|MAL|OCH))

Parameters:
  • project (ChildProject.projects.ChildProject) – ChildProject instance of the target dataset.

  • metrics (str) – name of the csv file listing the metrics to extract

  • recordings (Union[str, List[str], pd.DataFrame], optional) – recordings to sample from; if None, all recordings will be sampled, defaults to None

  • from_time (str, optional) – If specified (in HH:MM:SS format), ignore annotations outside of the given time-range, defaults to None

  • to_time (str, optional) – If specified (in HH:MM:SS format), ignore annotations outside of the given time-range, defaults to None

  • rec_cols (str, optional) – comma separated columns from recordings.csv to include in the outputted metrics (optional), recording_filename,session_id,child_id,duration are always included if possible and dont need to be specified. Any column that is not unique for a given unit (eg date_iso for a child_id being recorded on multiple days) will output a <NA> value

  • child_cols (str, optional) – comma separated columns from children.csv to include in the outputted metrics (optional), None by default

  • by (str, optional) – unit to extract metric from (recording_filename, experiment, child_id, session_id, segments), defaults to ‘recording_filename’, ‘segments’ is mandatory if passing the segments argument

  • period (str, optional) – time units to aggregate (optional); equivalent to pandas.Grouper freq argument.

  • segments (Union[str, pd.DataFrame], optional) – DataFrame or path to csv file of the segments to extract from, containing ‘recording_filename’, ‘segment_onset’ and ‘segment_offset’ columns. To use this option, the option must be set to ‘segments’. Also, this option cannot be combined with options [recordings,period,from_time,to_time].

  • threads (int, optional) – amount of threads to run on, defaults to 1

SUBCOMMAND = 'custom'
static add_parser(subparsers, subcommand)[source]
class ChildProject.pipelines.metrics.LenaMetrics(project: ChildProject, set: str, recordings: str | List[str] | DataFrame | None = None, from_time: str | None = None, to_time: str | None = None, rec_cols: str | None = None, child_cols: str | None = None, by: str = 'recording_filename', period: str | None = None, segments: str | DataFrame | None = None, threads: int = 1)[source]

Bases: Metrics

LENA metrics extractor. Extracts a number of metrics from the LENA .its annotations.

Parameters:
  • project (ChildProject.projects.ChildProject) – ChildProject instance of the target dataset.

  • set (str) – name of the set associated to the .its annotations

  • recordings (Union[str, List[str], pd.DataFrame], optional) – recordings to sample from; if None, all recordings will be sampled, defaults to None

  • from_time (str, optional) – If specified (in HH:MM:SS format), ignore annotations outside of the given time-range, defaults to None

  • to_time (str, optional) – If specified (in HH:MM:SS format), ignore annotations outside of the given time-range, defaults to None

  • rec_cols (str, optional) – comma separated columns from recordings.csv to include in the outputted metrics (optional), recording_filename,session_id,child_id,duration are always included if possible and dont need to be specified. Any column that is not unique for a given unit (eg date_iso for a child_id being recorded on multiple days) will output a <NA> value

  • child_cols (str, optional) – comma separated columns from children.csv to include in the outputted metrics (optional), None by default

  • by (str, optional) – unit to extract metric from (recording_filename, experiment, child_id, session_id, segments), defaults to ‘recording_filename’, ‘segments’ is mandatory if passing the segments argument

  • period (str, optional) – time units to aggregate (optional); equivalent to pandas.Grouper freq argument.

  • segments (Union[str, pd.DataFrame], optional) – DataFrame or path to csv file of the segments to extract from, containing ‘recording_filename’, ‘segment_onset’ and ‘segment_offset’ columns. To use this option, the option must be set to ‘segments’. Also, this option cannot be combined with options [recordings,period,from_time,to_time].

  • threads (int, optional) – amount of threads to run on, defaults to 1

SUBCOMMAND = 'lena'
static add_parser(subparsers, subcommand)[source]
class ChildProject.pipelines.metrics.Metrics(project: ChildProject, metrics_list: DataFrame, by: str = 'recording_filename', recordings: str | List[str] | DataFrame | None = None, from_time: str | None = None, to_time: str | None = None, rec_cols: str | None = None, child_cols: str | None = None, period: str | None = None, segments: str | DataFrame | None = None, threads: int = 1)[source]

Bases: ABC

Main class for generating metrics from a project object and a list of desired metrics

Parameters:
  • project (ChildProject.projects.ChildProject) – ChildProject instance of the target dataset.

  • metrics_list (pd.DataFrame) – pandas DataFrame containing the desired metrics (metrics functions are in metricsFunctions.py)

  • by (str, optional) – unit to extract metric from (recording_filename, experiment, child_id, session_id, segments), defaults to ‘recording_filename’, ‘segments’ is mandatory if passing the segments argument

  • recordings (Union[str, List[str], pd.DataFrame], optional) – recordings to sample from; if None, all recordings will be sampled, defaults to None

  • from_time (str, optional) – If specified (in HH:MM:SS format), ignore annotations outside of the given time-range, defaults to None

  • to_time (str, optional) – If specified (in HH:MM:SS format), ignore annotations outside of the given time-range, defaults to None

  • rec_cols (str, optional) – comma separated columns from recordings.csv to include in the outputted metrics (optional), recording_filename,session_id,child_id,duration are always included if possible and dont need to be specified. Any column that is not unique for a given unit (eg date_iso for a child_id being recorded on multiple days) will output a <NA> value

  • child_cols (str, optional) – comma separated columns from children.csv to include in the outputted metrics (optional), None by default

  • period (str, optional) – time units to aggregate (optional); equivalent to pandas.Grouper freq argument.

  • segments (Union[str, pd.DataFrame], optional) – DataFrame or path to csv file of the segments to extract from, containing ‘recording_filename’, ‘segment_onset’ and ‘segment_offset’ columns. To use this option, the option must be set to ‘segments’. Also, this option cannot be combined with options [recordings,period,from_time,to_time].

  • threads (int, optional) – amount of threads to run on, defaults to 1

extract()[source]

from the initiated self.metrics, compute each row metrics (handles threading) Once the Metrics class is initialized, call this function to extract the metrics and populate self.metrics

Returns:

DataFrame of computed metrics

Return type:

pandas.DataFrame

retrieve_segments(sets: List[str], row: str)[source]

from a list of sets and a row identifying the unit computed, return the relevant annotation segments

Parameters:
  • sets (List[str]) – List of annotation sets to keep

  • row (pandas.Series) – Series storing the unit to compute information

Returns:

relevant annotation DataFrame and index DataFrame

Return type:

(pandas.DataFrame , pandas.DataFrame)

class ChildProject.pipelines.metrics.MetricsPipeline[source]

Bases: Pipeline

run(path, destination, pipeline, func=None, **kwargs)[source]
static setup_parser(parser)[source]
class ChildProject.pipelines.metrics.MetricsSpecificationPipeline[source]

Bases: Pipeline

run(parameters_input, func=None)[source]
static setup_parser(parser)[source]

ChildProject.pipelines.metricsFunctions module

ChildProject.pipelines.metricsFunctions.avg_can_voc_dur_speaker(annotations: DataFrame, duration: int, **kwargs)[source]

average duration of canonical vocalizations for a given speaker type (based on vcm_type)

Required keyword arguments:
  • speaker : speaker_type to use

ChildProject.pipelines.metricsFunctions.avg_cry_voc_dur_speaker(annotations: DataFrame, duration: int, **kwargs)[source]

average duration of cry vocalizations by a given speaker type (based on vcm_type or lena cries)

Required keyword arguments:
  • speaker : speaker_type to use

ChildProject.pipelines.metricsFunctions.avg_non_can_voc_dur_speaker(annotations: DataFrame, duration: int, **kwargs)[source]

average duration of non-canonical vocalizations for a given speaker type (based on vcm_type)

Required keyword arguments:
  • speaker : speaker_type to use

ChildProject.pipelines.metricsFunctions.avg_voc_dur_speaker(annotations: DataFrame, duration: int, **kwargs)[source]

average duration in milliseconds of vocalizations for a given speaker type

Required keyword arguments:
  • speaker : speaker_type to use

ChildProject.pipelines.metricsFunctions.can_voc_dur_speaker(annotations: DataFrame, duration: int, **kwargs)[source]

total duration of canonical vocalizations by a given speaker type in milliseconds (based on vcm_type)

Required keyword arguments:
  • speaker : speaker_type to use

ChildProject.pipelines.metricsFunctions.can_voc_dur_speaker_ph(annotations: DataFrame, duration: int, **kwargs)

total duration of canonical vocalizations by a given speaker type in milliseconds (based on vcm_type)

Required keyword arguments:
  • speaker : speaker_type to use

This value is a ‘per hour’ value.

ChildProject.pipelines.metricsFunctions.can_voc_speaker(annotations: DataFrame, duration: int, **kwargs)[source]

number of canonical vocalizations for a given speaker type (based on vcm_type)

Required keyword arguments:
  • speaker : speaker_type to use

ChildProject.pipelines.metricsFunctions.can_voc_speaker_ph(annotations: DataFrame, duration: int, **kwargs)

number of canonical vocalizations for a given speaker type (based on vcm_type)

Required keyword arguments:
  • speaker : speaker_type to use

This value is a ‘per hour’ value.

ChildProject.pipelines.metricsFunctions.cp_dur(annotations: DataFrame, duration: int, **kwargs)[source]

canonical proportion on the number of vocalizations for CHI (based on vcm_type)

Required keyword arguments:

ChildProject.pipelines.metricsFunctions.cp_n(annotations: DataFrame, duration: int, **kwargs)[source]

canonical proportion on the number of vocalizations for CHI (based on vcm_type)

Required keyword arguments:

ChildProject.pipelines.metricsFunctions.cry_voc_dur_speaker(annotations: DataFrame, duration: int, **kwargs)[source]

total duration of cry vocalizations by a given speaker type in milliseconds (based on vcm_type or lena cry)

Required keyword arguments:
  • speaker : speaker_type to use

ChildProject.pipelines.metricsFunctions.cry_voc_dur_speaker_ph(annotations: DataFrame, duration: int, **kwargs)

total duration of cry vocalizations by a given speaker type in milliseconds (based on vcm_type or lena cry)

Required keyword arguments:
  • speaker : speaker_type to use

This value is a ‘per hour’ value.

ChildProject.pipelines.metricsFunctions.cry_voc_speaker(annotations: DataFrame, duration: int, **kwargs)[source]

number of cry vocalizations for a given speaker (based on vcm_type or lena cries)

Required keyword arguments:
  • speaker : speaker_type to use

ChildProject.pipelines.metricsFunctions.cry_voc_speaker_ph(annotations: DataFrame, duration: int, **kwargs)

number of cry vocalizations for a given speaker (based on vcm_type or lena cries)

Required keyword arguments:
  • speaker : speaker_type to use

This value is a ‘per hour’ value.

ChildProject.pipelines.metricsFunctions.lena_CTC(annotations: DataFrame, duration: int, **kwargs)[source]

number of conversational turn counts according to LENA’s extraction

Required keyword arguments:

ChildProject.pipelines.metricsFunctions.lena_CTC_ph(annotations: DataFrame, duration: int, **kwargs)

number of conversational turn counts according to LENA’s extraction

Required keyword arguments: This value is a ‘per hour’ value.

ChildProject.pipelines.metricsFunctions.lena_CVC(annotations: DataFrame, duration: int, **kwargs)[source]

number of child vocalizations according to LENA’s extraction

Required keyword arguments:

ChildProject.pipelines.metricsFunctions.lena_CVC_ph(annotations: DataFrame, duration: int, **kwargs)

number of child vocalizations according to LENA’s extraction

Required keyword arguments: This value is a ‘per hour’ value.

ChildProject.pipelines.metricsFunctions.lp_dur(annotations: DataFrame, duration: int, **kwargs)[source]

linguistic proportion on the duration of vocalizations for CHI (based on vcm_type or [child_cry_vfxs_len,utterances_length] if vcm_type does not exist)

Required keyword arguments:

ChildProject.pipelines.metricsFunctions.lp_n(annotations: DataFrame, duration: int, **kwargs)[source]

linguistic proportion on the number of vocalizations for CHI (based on vcm_type or [cries,vfxs,utterances_count] if vcm_type does not exist)

Required keyword arguments:

ChildProject.pipelines.metricsFunctions.metricFunction(args: set, columns: Set[str] | Tuple[Set[str], ...], empty_value=0, default_name: str | None = None)[source]

Decorator for all metrics functions to make them ready to be called by the pipeline.

Parameters:
  • args (set) – set of required keyword arguments for that function, raise ValueError if were not given you cannot use keywords [name, callable, set] as they are reserved

  • columns (set) – required columns in the dataframe given, missing columns raise ValueError

  • default_name (str) – default name to use for the metric in the resulting dataframe. Every keyword argument found in the name will be replaced by its value (e.g. ‘voc_speaker_ph’ uses kwarg ‘speaker’ so if speaker = ‘CHI’, name will be ‘voc_chi_ph’). if no name is given, the __name__ of the function is used

  • empty_value (float|int) – value to return when annotations are empty but the unit was annotated (e.g. 0 for counts like voc_speaker_ph , None for proportions like lp_n)

Returns:

new function to substitute the metric function

Return type:

Callable

ChildProject.pipelines.metricsFunctions.non_can_voc_dur_speaker(annotations: DataFrame, duration: int, **kwargs)[source]

total duration of non-canonical vocalizations by a given speaker type in milliseconds (based on vcm_type)

Required keyword arguments:
  • speaker : speaker_type to use

ChildProject.pipelines.metricsFunctions.non_can_voc_dur_speaker_ph(annotations: DataFrame, duration: int, **kwargs)

total duration of non-canonical vocalizations by a given speaker type in milliseconds (based on vcm_type)

Required keyword arguments:
  • speaker : speaker_type to use

This value is a ‘per hour’ value.

ChildProject.pipelines.metricsFunctions.non_can_voc_speaker(annotations: DataFrame, duration: int, **kwargs)[source]

number of non-canonical vocalizations for a given speaker type (based on vcm_type)

Required keyword arguments:
  • speaker : speaker_type to use

ChildProject.pipelines.metricsFunctions.non_can_voc_speaker_ph(annotations: DataFrame, duration: int, **kwargs)

number of non-canonical vocalizations for a given speaker type (based on vcm_type)

Required keyword arguments:
  • speaker : speaker_type to use

This value is a ‘per hour’ value.

ChildProject.pipelines.metricsFunctions.pc_adu(annotations: DataFrame, duration: int, **kwargs)[source]

number of phonemes for all speakers

Required keyword arguments:

ChildProject.pipelines.metricsFunctions.pc_adu_ph(annotations: DataFrame, duration: int, **kwargs)

number of phonemes for all speakers

Required keyword arguments: This value is a ‘per hour’ value.

ChildProject.pipelines.metricsFunctions.pc_speaker(annotations: DataFrame, duration: int, **kwargs)[source]

number of phonemes for a given speaker type

Required keyword arguments:
  • speaker : speaker_type to use

ChildProject.pipelines.metricsFunctions.pc_speaker_ph(annotations: DataFrame, duration: int, **kwargs)

number of phonemes for a given speaker type

Required keyword arguments:
  • speaker : speaker_type to use

This value is a ‘per hour’ value.

ChildProject.pipelines.metricsFunctions.peak_can_voc_dur_speaker(annotations: DataFrame, duration: int, **kwargs)

Computing the peak for 1h for the following metric: total duration of canonical vocalizations by a given speaker type in milliseconds (based on vcm_type)

Required keyword arguments:
  • speaker : speaker_type to use

ChildProject.pipelines.metricsFunctions.peak_can_voc_speaker(annotations: DataFrame, duration: int, **kwargs)

Computing the peak for 1h for the following metric: number of canonical vocalizations for a given speaker type (based on vcm_type)

Required keyword arguments:
  • speaker : speaker_type to use

ChildProject.pipelines.metricsFunctions.peak_cry_voc_dur_speaker(annotations: DataFrame, duration: int, **kwargs)

Computing the peak for 1h for the following metric: total duration of cry vocalizations by a given speaker type in milliseconds (based on vcm_type or lena cry)

Required keyword arguments:
  • speaker : speaker_type to use

ChildProject.pipelines.metricsFunctions.peak_cry_voc_speaker(annotations: DataFrame, duration: int, **kwargs)

Computing the peak for 1h for the following metric: number of cry vocalizations for a given speaker (based on vcm_type or lena cries)

Required keyword arguments:
  • speaker : speaker_type to use

ChildProject.pipelines.metricsFunctions.peak_hour_metric(empty_value=0)[source]

empty_value : should repeat the empty value of the metric function wrapper (as this will be used for empty periods)

ChildProject.pipelines.metricsFunctions.peak_lena_CTC(annotations: DataFrame, duration: int, **kwargs)

Computing the peak for 1h for the following metric: number of conversational turn counts according to LENA’s extraction

Required keyword arguments:

ChildProject.pipelines.metricsFunctions.peak_lena_CVC(annotations: DataFrame, duration: int, **kwargs)

Computing the peak for 1h for the following metric: number of child vocalizations according to LENA’s extraction

Required keyword arguments:

ChildProject.pipelines.metricsFunctions.peak_non_can_voc_dur_speaker(annotations: DataFrame, duration: int, **kwargs)

Computing the peak for 1h for the following metric: total duration of non-canonical vocalizations by a given speaker type in milliseconds (based on vcm_type)

Required keyword arguments:
  • speaker : speaker_type to use

ChildProject.pipelines.metricsFunctions.peak_non_can_voc_speaker(annotations: DataFrame, duration: int, **kwargs)

Computing the peak for 1h for the following metric: number of non-canonical vocalizations for a given speaker type (based on vcm_type)

Required keyword arguments:
  • speaker : speaker_type to use

ChildProject.pipelines.metricsFunctions.peak_pc_adu(annotations: DataFrame, duration: int, **kwargs)

Computing the peak for 1h for the following metric: number of phonemes for all speakers

Required keyword arguments:

ChildProject.pipelines.metricsFunctions.peak_pc_speaker(annotations: DataFrame, duration: int, **kwargs)

Computing the peak for 1h for the following metric: number of phonemes for a given speaker type

Required keyword arguments:
  • speaker : speaker_type to use

ChildProject.pipelines.metricsFunctions.peak_sc_adu(annotations: DataFrame, duration: int, **kwargs)

Computing the peak for 1h for the following metric: number of syllables for all speakers

Required keyword arguments:

ChildProject.pipelines.metricsFunctions.peak_sc_speaker(annotations: DataFrame, duration: int, **kwargs)

Computing the peak for 1h for the following metric: number of syllables for a given speaker type

Required keyword arguments:
  • speaker : speaker_type to use

ChildProject.pipelines.metricsFunctions.peak_simple_CTC(annotations: DataFrame, duration: int, interlocutors_1=('CHI',), interlocutors_2=('FEM', 'MAL', 'OCH'), max_interval=1000, min_delay=0, **kwargs)

Computing the peak for 1h for the following metric: number of conversational turn counts based on vocalizations occurring

in a given interval of one another

keyword arguments:
  • interlocutors_1 : first group of interlocutors, default = [‘CHI’]

  • interlocutors_2 : second group of interlocutors, default = [‘FEM’,’MAL’,’OCH’]

  • max_interval : maximum interval in ms for it to be considered a turn, default = 1000

  • min_delay : minimum delay between somebody starting speaking

ChildProject.pipelines.metricsFunctions.peak_voc_dur_speaker(annotations: DataFrame, duration: int, **kwargs)

Computing the peak for 1h for the following metric: total duration of vocalizations by a given speaker type in milliseconds per hour

Required keyword arguments:
  • speaker : speaker_type to use

ChildProject.pipelines.metricsFunctions.peak_voc_speaker(annotations: DataFrame, duration: int, **kwargs)

Computing the peak for 1h for the following metric: number of vocalizations for a given speaker type

Required keyword arguments:
  • speaker : speaker_type to use

ChildProject.pipelines.metricsFunctions.peak_wc_adu(annotations: DataFrame, duration: int, **kwargs)

Computing the peak for 1h for the following metric: number of words for all speakers

Required keyword arguments:

ChildProject.pipelines.metricsFunctions.peak_wc_speaker(annotations: DataFrame, duration: int, **kwargs)

Computing the peak for 1h for the following metric: number of words for a given speaker type

Required keyword arguments:
  • speaker : speaker_type to use

ChildProject.pipelines.metricsFunctions.per_hour_metric()[source]
ChildProject.pipelines.metricsFunctions.sc_adu(annotations: DataFrame, duration: int, **kwargs)[source]

number of syllables for all speakers

Required keyword arguments:

ChildProject.pipelines.metricsFunctions.sc_adu_ph(annotations: DataFrame, duration: int, **kwargs)

number of syllables for all speakers

Required keyword arguments: This value is a ‘per hour’ value.

ChildProject.pipelines.metricsFunctions.sc_speaker(annotations: DataFrame, duration: int, **kwargs)[source]

number of syllables for a given speaker type

Required keyword arguments:
  • speaker : speaker_type to use

ChildProject.pipelines.metricsFunctions.sc_speaker_ph(annotations: DataFrame, duration: int, **kwargs)

number of syllables for a given speaker type

Required keyword arguments:
  • speaker : speaker_type to use

This value is a ‘per hour’ value.

ChildProject.pipelines.metricsFunctions.simple_CTC(annotations: DataFrame, duration: int, interlocutors_1=('CHI',), interlocutors_2=('FEM', 'MAL', 'OCH'), max_interval=1000, min_delay=0, **kwargs)[source]

number of conversational turn counts based on vocalizations occurring in a given interval of one another

keyword arguments:
  • interlocutors_1 : first group of interlocutors, default = [‘CHI’]

  • interlocutors_2 : second group of interlocutors, default = [‘FEM’,’MAL’,’OCH’]

  • max_interval : maximum interval in ms for it to be considered a turn, default = 1000

  • min_delay : minimum delay between somebody starting speaking

ChildProject.pipelines.metricsFunctions.simple_CTC_ph(annotations: DataFrame, duration: int, interlocutors_1=('CHI',), interlocutors_2=('FEM', 'MAL', 'OCH'), max_interval=1000, min_delay=0, **kwargs)

number of conversational turn counts based on vocalizations occurring in a given interval of one another

keyword arguments:
  • interlocutors_1 : first group of interlocutors, default = [‘CHI’]

  • interlocutors_2 : second group of interlocutors, default = [‘FEM’,’MAL’,’OCH’]

  • max_interval : maximum interval in ms for it to be considered a turn, default = 1000

  • min_delay : minimum delay between somebody starting speaking

This value is a ‘per hour’ value.

ChildProject.pipelines.metricsFunctions.voc_dur_speaker(annotations: DataFrame, duration: int, **kwargs)[source]

total duration of vocalizations by a given speaker type in milliseconds per hour

Required keyword arguments:
  • speaker : speaker_type to use

ChildProject.pipelines.metricsFunctions.voc_dur_speaker_ph(annotations: DataFrame, duration: int, **kwargs)

total duration of vocalizations by a given speaker type in milliseconds per hour

Required keyword arguments:
  • speaker : speaker_type to use

This value is a ‘per hour’ value.

ChildProject.pipelines.metricsFunctions.voc_speaker(annotations: DataFrame, duration: int, **kwargs)[source]

number of vocalizations for a given speaker type

Required keyword arguments:
  • speaker : speaker_type to use

ChildProject.pipelines.metricsFunctions.voc_speaker_ph(annotations: DataFrame, duration: int, **kwargs)

number of vocalizations for a given speaker type

Required keyword arguments:
  • speaker : speaker_type to use

This value is a ‘per hour’ value.

ChildProject.pipelines.metricsFunctions.wc_adu(annotations: DataFrame, duration: int, **kwargs)[source]

number of words for all speakers

Required keyword arguments:

ChildProject.pipelines.metricsFunctions.wc_adu_ph(annotations: DataFrame, duration: int, **kwargs)

number of words for all speakers

Required keyword arguments: This value is a ‘per hour’ value.

ChildProject.pipelines.metricsFunctions.wc_speaker(annotations: DataFrame, duration: int, **kwargs)[source]

number of words for a given speaker type

Required keyword arguments:
  • speaker : speaker_type to use

ChildProject.pipelines.metricsFunctions.wc_speaker_ph(annotations: DataFrame, duration: int, **kwargs)

number of words for a given speaker type

Required keyword arguments:
  • speaker : speaker_type to use

This value is a ‘per hour’ value.

ChildProject.pipelines.pipeline module

class ChildProject.pipelines.pipeline.Pipeline[source]

Bases: ABC

check_setup()[source]
static recordings_from_list(recordings)[source]
abstract run(**kwargs)[source]
setup()[source]
static setup_pipeline(parser)[source]

ChildProject.pipelines.processors module

class ChildProject.pipelines.processors.AudioProcessingPipeline[source]

Bases: Pipeline

run(path: str, processor: str, threads: int = 1, func=None, **kwargs)[source]
static setup_parser(parser)[source]
class ChildProject.pipelines.processors.AudioProcessor(project: ChildProject, name: str, input_profile: str | None = None, threads: int = 1, recordings: str | List[str] | DataFrame | None = None)[source]

Bases: ABC

static add_parser(parsers)[source]
export_metadata()[source]
output_directory()[source]
process(parameters)[source]
abstract process_recording(recording)[source]
read_metadata()[source]
class ChildProject.pipelines.processors.AudioStandard(project: ChildProject, threads: int = 1, recordings: str | List[str] | DataFrame | None = None, skip_existing: bool = False, input_profile: str | None = None)[source]

Bases: AudioProcessor

SUBCOMMAND = 'standard'
static add_parser(subparsers, subcommand)[source]
process_recording(recording)[source]
class ChildProject.pipelines.processors.BasicProcessor(project: ChildProject, name: str, format: str, codec: str, sampling: int, threads: int = 1, recordings: str | List[str] | DataFrame | None = None, skip_existing: bool = False, input_profile: str | None = None)[source]

Bases: AudioProcessor

SUBCOMMAND = 'basic'
static add_parser(subparsers, subcommand)[source]
process_recording(recording)[source]
class ChildProject.pipelines.processors.ChannelMapper(project: ChildProject, name: str, channels: list, threads: int = 1, recordings: str | List[str] | DataFrame | None = None, input_profile: str | None = None)[source]

Bases: AudioProcessor

SUBCOMMAND = 'channel-mapping'
static add_parser(subparsers, subcommand)[source]
process_recording(recording)[source]
class ChildProject.pipelines.processors.VettingProcessor(project: ChildProject, name: str, segments_path: str, threads: int = 1, recordings: str | List[str] | DataFrame | None = None, input_profile: str | None = None)[source]

Bases: AudioProcessor

SUBCOMMAND = 'vetting'
static add_parser(subparsers, subcommand)[source]
process_recording(recording)[source]

ChildProject.pipelines.samplers module

class ChildProject.pipelines.samplers.ConversationSampler(project: ChildProject, annotation_set: str, count: int, interval: int = 1000, speakers: List[str] = ['FEM', 'MAL', 'CHI'], threads: int = 1, by: str = 'recording_filename', recordings: str | List[str] | DataFrame | None = None, exclude: str | DataFrame | None = None)[source]

Bases: Sampler

Conversation sampler.

Parameters:
  • project (ChildProject.projects.ChildProject) – ChildProject instance

  • annotation_set (str) – set of annotation to derive conversations from

  • count (int) – amount of conversations to sample

  • interval (int, optional) – maximum time-interval between two consecutive vocalizations (in milliseconds) to consider them part of the same conversational block, defaults to 1000

  • speakers (List[str], optional) – list of speakers to target, defaults to [“FEM”, “MAL”, “CHI”]

  • threads (int, optional) – threads to run on, defaults to 1

  • by (str, optional) – units to sample from, defaults to “recording_filename”

  • recordings (Union[str, List[str], pd.DataFrame], optional) – whitelist of recordings, defaults to None

  • exclude (Union[str, pd.DataFrame], optional) – portions to exclude, defaults to None

SUBCOMMAND = 'conversations'
static add_parser(subparsers, subcommand)[source]
class ChildProject.pipelines.samplers.CustomSampler(project: ChildProject, segments_path: str, recordings: str | List[str] | DataFrame | None = None, exclude: str | DataFrame | None = None)[source]

Bases: Sampler

SUBCOMMAND = 'custom'
static add_parser(subparsers, subcommand)[source]
class ChildProject.pipelines.samplers.EnergyDetectionSampler(project: ChildProject, windows_length: int, windows_spacing: int, windows_count: int, windows_offset: int = 0, threshold: float = 0.8, low_freq: int = 0, high_freq: int = 100000, threads: int = 1, profile: str = '', by: str = 'recording_filename', recordings: str | List[str] | DataFrame | None = None, exclude: str | DataFrame | None = None)[source]

Bases: Sampler

Sample windows within each recording, targetting those that have a signal energy higher than some threshold.

Parameters:
  • project (ChildProject.projects.ChildProject) – ChildProject instance of the target dataset.

  • windows_length (int) – Length of each window, in milliseconds.

  • windows_spacing (int) – Spacing between the start of each window, in milliseconds.

  • windows_count (int) – How many windows to retain per recording.

  • windows_offset (float, optional) – start of the first window, in milliseconds, defaults to 0

  • threshold (float, optional) – lowest energy quantile to sample from, defaults to 0.8

  • low_freq (int, optional) – if > 0, frequencies below will be filtered before calculating the energy, defaults to 0

  • high_freq (int, optional) – if < 100000, frequencies above will be filtered before calculating the energy, defaults to 100000

  • by (str, optional) – units to sample from, defaults to ‘recording_filename’

  • recordings (Union[str, List[str], pd.DataFrame], optional) – recordings to sample from; if None, all recordings will be sampled, defaults to None

  • threads (int, optional) – amount of threads to run on, defaults to 1

SUBCOMMAND = 'energy-detection'
static add_parser(subparsers, subcommand)[source]
compute_energy_loudness(chunk, sampling_frequency: int)[source]
get_recording_windows(recording)[source]
class ChildProject.pipelines.samplers.HighVolubilitySampler(project: ChildProject, annotation_set: str, metric: str, windows_length: int, windows_count: int, speakers: List[str] = ['FEM', 'MAL', 'CHI'], threads: int = 1, by: str = 'recording_filename', recordings: str | List[str] | DataFrame | None = None, exclude: str | DataFrame | None = None)[source]

Bases: Sampler

Return the top windows_count windows (of length windows_length) with the highest volubility from each recording, as calculated from the metric metric.

metrics can be any of three values: words, turns, and vocs.

  • The words metric sums the amount of words within each window. For LENA annotations, it is equivalent to awc.

  • The turns metric (aka ctc) sums conversational turns within each window. It relies on lena_conv_turn_type for LENA annotations. For other annotations, turns are estimated as adult/child speech switches in close temporal proximity.

  • The vocs metric sums vocalizations within each window. If metric="vocs" and speakers=['CHI'], it is equivalent to the usual cvc metric (child vocalization counts).

Parameters:
  • project (ChildProject.projects.ChildProject) – ChildProject instance of the target dataset.

  • annotation_set (str) – set of annotations to calculate volubility from.

  • metric (str) – the metric to evaluate high-volubility. should be any of ‘words’, ‘turns’, ‘vocs’.

  • windows_length (int) – length of the windows, in milliseconds

  • windows_count (int) – amount of top regions to extract per recording

  • by (str, optional) – units to sample from, defaults to ‘recording_filename’

  • recordings (Union[str, List[str], pd.DataFrame], optional) – recordings to sample from; if None, all recordings will be sampled, defaults to None

  • threads (int) – amount of threads to run the sampler on

SUBCOMMAND = 'high-volubility'
static add_parser(subparsers, subcommand)[source]
class ChildProject.pipelines.samplers.PeriodicSampler(project: ChildProject, length: int, period: int, offset: int = 0, profile: str | None = None, recordings: str | List[str] | DataFrame | None = None, exclude: str | DataFrame | None = None)[source]

Bases: Sampler

Periodic sampling of a recording.

Parameters:
  • project (ChildProject.projects.ChildProject) – ChildProject instance of the target dataset.

  • length (int) – length of each segment, in milliseconds

  • period (int) – spacing between two consecutive segments, in milliseconds

  • offset (int) – offset of the first segment, in milliseconds, defaults to 0

  • recordings (Union[str, List[str], pd.DataFrame], optional) – recordings to sample from; if None, all recordings will be sampled, defaults to None

SUBCOMMAND = 'periodic'
static add_parser(subparsers, subcommand)[source]
class ChildProject.pipelines.samplers.RandomVocalizationSampler(project: ChildProject, annotation_set: str, target_speaker_type: list, sample_size: int, threads: int = 1, by: str = 'recording_filename', recordings: str | List[str] | DataFrame | None = None, exclude: str | DataFrame | None = None)[source]

Bases: Sampler

Sample vocalizations based on some input annotation set.

Parameters:
  • project (ChildProject.projects.ChildProject) – ChildProject instance of the target dataset.

  • annotation_set (str) – Set of annotations to get vocalizations from.

  • target_speaker_type (list) – List of speaker types to sample vocalizations from.

  • sample_size (int) – Amount of vocalizations to sample, per recording.

  • by (str, optional) – units to sample from, defaults to ‘recording_filename’

  • recordings (Union[str, List[str], pd.DataFrame], optional) – recordings to sample from; if None, all recordings will be sampled, defaults to None

  • threads (int, optional) – amount of threads to run on, defaults to 1

SUBCOMMAND = 'random-vocalizations'
static add_parser(subparsers, subcommand)[source]
class ChildProject.pipelines.samplers.Sampler(project: ChildProject, recordings: str | List[str] | DataFrame | None = None, exclude: str | DataFrame | None = None)[source]

Bases: ABC

abstract static add_parser(parsers)[source]
assert_valid()[source]
export_audio(destination, profile=None, **kwargs)[source]
remove_excluded()[source]
retrieve_segments(recording_filename=None)[source]
sample()[source]
class ChildProject.pipelines.samplers.SamplerPipeline[source]

Bases: Pipeline

run(path, destination, sampler, func=None, **kwargs)[source]
static setup_parser(parser)[source]

ChildProject.pipelines.zooniverse module

class ChildProject.pipelines.zooniverse.Chunk(recording_filename, onset, offset, segment_onset, segment_offset)[source]

Bases: object

getbasename(extension)[source]
class ChildProject.pipelines.zooniverse.ZooniversePipeline[source]

Bases: Pipeline

exit_upload(*args, rec_orphan, sub_set)[source]
extract_chunks(path: str, destination: str, keyword: str, segments: str, chunks_length: int = -1, chunks_min_amount: int = 1, spectrogram: bool = False, profile: str = '', threads: int = 1, **kwargs)[source]

extract-audio chunks based on a list of segments and prepare them for upload to zooniverse.

Parameters:
  • path (str) – dataset path

  • destination (str) – path to the folder where to store the metadata and audio chunks

  • segments (str) – path to the input segments csv dataframe, defaults to None

  • keyword (str) – keyword to insert in the output metadata

  • chunks_length (int, optional) – length of the chunks, in milliseconds, defaults to -1

  • chunks_min_amount (int, optional) – minimum amount of chunk per segment, defaults to 1

  • spectrogram (bool, optional) – the extraction generates a png spectrogram, defaults to False

  • profile (str) – recording profile to extract from. If undefined, raw recordings will be used.

  • threads (int, optional) – amount of threads to run-on, defaults to 0

get_credentials(login: str = '', pwd: str = '')[source]

returns input credentials if provided or attempts to read them from the environment variables.

Parameters:
  • login (str, optional) – input login, defaults to ‘’

  • pwd (str, optional) – input password, defaults to ‘’

Returns:

(login, pwd)

Return type:

(str, str)

Attempts to link subjects that have been uploaded but not linked to a subject set in zooniverse from the CSV dataframe chunks to a zooniverse project (Attempts are made on chunks that have a zooniverse_id, a project_id and uploaded at True but no subject_set )

Parameters:
  • chunks ([type]) – path to the chunk CSV dataframe

  • project_id (int) – zooniverse project id

  • set_name (str) – name of the subject set

  • zooniverse_login (str, optional) – zooniverse login. If not specified, the program attempts to get it from the environment variable ZOONIVERSE_LOGIN instead, defaults to ‘’

  • zooniverse_pwd (str, optional) – zooniverse password. If not specified, the program attempts to get it from the environment variable ZOONIVERSE_PWD instead, defaults to ‘’

  • amount (int, optional) – amount of chunks to upload, defaults to 0

  • ignore_errors (bool, optional) – carry on with the upload even if a clip fails, the csv will be updated accordingly

  • test_endpoint (bool, optional) – run this command for tests, operations with zooniverse arefaked and considered succesfull

reset_orphan_subjects(chunks: str, **kwargs)[source]

Look for orphan subjects and considers them to be not uploaded, This is to be done either if the oprhan subjects were deleted from zooniverse or if they are not usable anymore. The next upload will try to push them to zooniverse as new subjects.

Parameters:

chunks ([type]) – path to the chunk CSV dataframe

retrieve_classifications(destination: str, project_id: int, zooniverse_login: str = '', zooniverse_pwd: str = '', chunks: List[str] = [], test_endpoint: bool = False, **kwargs)[source]

Retrieve classifications from Zooniverse as a CSV dataframe. They will be matched with the original chunks metadata if the path one or more chunk metadata files is provided.

Parameters:
  • destination (str) – output CSV dataframe destination

  • project_id (int) – zooniverse project id

  • zooniverse_login (str, optional) – zooniverse login. If not specified, the program attempts to get it from the environment variable ZOONIVERSE_LOGIN instead, defaults to ‘’

  • zooniverse_pwd (str, optional) – zooniverse password. If not specified, the program attempts to get it from the environment variable ZOONIVERSE_PWD instead, defaults to ‘’

  • chunks (List[str], optional) – the list of chunk metadata files to match the classifications to. If provided, only the classifications that have a match will be returned.

run(action, **kwargs)[source]
static setup_parser(parser)[source]
upload_chunks(chunks: str, project_id: int, set_name: str, zooniverse_login='', zooniverse_pwd='', amount: int = 1000, ignore_errors: bool = False, record_orphan: bool = False, test_endpoint: bool = False, **kwargs)[source]

Uploads amount audio chunks from the CSV dataframe chunks to a zooniverse project.

Parameters:
  • chunks ([type]) – path to the chunk CSV dataframe

  • project_id (int) – zooniverse project id

  • set_name (str) – name of the subject set

  • zooniverse_login (str, optional) – zooniverse login. If not specified, the program attempts to get it from the environment variable ZOONIVERSE_LOGIN instead, defaults to ‘’

  • zooniverse_pwd (str, optional) – zooniverse password. If not specified, the program attempts to get it from the environment variable ZOONIVERSE_PWD instead, defaults to ‘’

  • amount (int, optional) – amount of chunks to upload, defaults to 1000

  • ignore_errors (bool, optional) – carry on with the upload even if a clip fails, the csv will be updated accordingly, single clip errors are ignored but errors that will repeat (e.g. maximum number of subjects uploaded) will still exit

  • record_orphan (bool, optional) – when true, chunks that are correctly uploaded but not linked to a subject set (orphan) have their line updated with the subject id, project id and Uploaded flag at True, but subject_set empty. link_orphan_subjects can be used to reattempt it. If false, the chunk is considered not uploaded.

  • test_endpoint (bool, optional) – run this command for tests, operations with zooniverse arefaked and considered succesfull

ChildProject.pipelines.zooniverse.pad_interval(onset: int, offset: int, chunks_length: int, chunks_min_amount: int = 1) Tuple[int, int][source]

Module contents