ChildProject.pipelines package

Submodules

ChildProject.pipelines.anonymize module

class ChildProject.pipelines.anonymize.AnonymizationPipeline[source]

Bases: ChildProject.pipelines.pipeline.Pipeline

Anonymize a set of its annotations (input_set) and saves it as output_set.

DEFAULT_REPLACEMENTS = {'Bar': {'startClockTime': [{'replace_value': '1000-01-01'}, {'only_time': 'true'}]}, 'BarSummary': {'leftBoundaryClockTime': [{'replace_value': '1000-01-01'}, {'only_time': 'true'}], 'rightBoundaryClockTime': [{'replace_value': '1000-01-01'}, {'only_time': 'true'}]}, 'Child': {'DOB': '1000-01-01', 'EnrollDate': '1000-01-01', 'id': 'A999'}, 'ChildInfo': {'dob': '1000-01-01'}, 'FiveMinuteSection': {'endClockTime': [{'replace_value': '1000-01-01'}, {'only_time': 'true'}], 'startClockTime': [{'replace_value': '1000-01-01'}, {'only_time': 'true'}]}, 'ITS': {'fileName': 'new_filename_1001', 'timeCreated': [{'replace_value': '1000-01-01'}, {'only_time': 'true'}]}, 'Item': {'timeStamp': [{'replace_value': '1000-01-01'}, {'only_time': 'true'}]}, 'PrimaryChild': {'DOB': '1000-01-01'}, 'ProcessingJob': {'logfile': 'exec10001010T100010Z_job00000001-10001010_101010_100100.upl.log'}, 'Recording': {'endClockTime': [{'replace_value': '1000-01-01'}, {'only_time': 'true'}], 'startClockTime': [{'replace_value': '1000-01-01'}, {'only_time': 'true'}]}, 'ResourceSnapshot': {'timegmt': [{'replace_value': '1000-01-01'}, {'only_time': 'true'}], 'timelocal': [{'replace_value': '1000-01-01'}, {'only_time': 'true'}]}, 'TransferTime': {'LocalTime': [{'replace_value': '1000-01-01'}, {'only_time': 'true'}], 'UTCTime': [{'replace_value': '1000-01-01'}, {'only_time': 'true'}]}}

run(path: str, input_set: str, output_set: str, replacements_json_dict: str = '', **kwargs)[source]: Anonymize a set of its annotations (input_set) and saves it as output_set.

static setup_parser(parser)[source]

ChildProject.pipelines.eafbuilder module

class ChildProject.pipelines.eafbuilder.EafBuilderPipeline[source]

Bases: ChildProject.pipelines.pipeline.Pipeline

run(destination: str, segments: str, eaf_type: str, template: str, context_onset: int = 0, context_offset: int = 0, **kwargs)[source]

generate .eaf templates based on intervals to code.

Parameters

path (str) – project path
destination (str) – eaf destination
segments (str) – path to the input segments dataframe
eaf_type (str) – eaf-type [random, periodic]
template (str) – name of the template to use (basic, native, or non-native)
context_onset (int) – context onset and segment offset difference in milliseconds, 0 for no introductory context
context_offset (int) – context offset and segment offset difference in milliseconds, 0 for no outro context

static setup_parser(parser)[source]

ChildProject.pipelines.eafbuilder.create_eaf(etf_path: str, id: str, output_dir: str, recording_filename: str, timestamps_list: list, eaf_type: str, contxt_on: int, contxt_off: int, template: str)[source]

ChildProject.pipelines.metrics module

class ChildProject.pipelines.metrics.AclewMetrics(project: ChildProject.projects.ChildProject, vtc: str = 'vtc', alice: str = 'alice', vcm: str = 'vcm', recordings: Optional[Union[str, List[str], pandas.core.frame.DataFrame]] = None, from_time: Optional[str] = None, to_time: Optional[str] = None, by: str = 'recording_filename', threads: int = 1)[source]

Bases: ChildProject.pipelines.metrics.Metrics

ACLEW metrics extractor. Extracts a number of metrics from the ACLEW pipeline annotations, which includes:

The Voice Type Classifier by Lavechin et al. (arXiv:2005.12656)

The Automatic LInguistic Unit Count Estimator (ALICE) by Räsänen et al. (doi:10.3758/s13428-020-01460-x)

The VoCalisation Maturity model (VCMNet) by Al Futaisi et al. (doi:10.1145/3340555.3353751)

Parameters

project (ChildProject.projects.ChildProject) – ChildProject instance of the target dataset.
vtc (str) – name of the set associated to the VTC annotations
alice (str) – name of the set associated to the ALICE annotations
vcm (str) – name of the set associated to the VCM annotations
recordings (Union[str, List[str], pd.DataFrame], optional) – recordings to sample from; if None, all recordings will be sampled, defaults to None
from_time (str, optional) – If specified (in HH:MM format), ignore annotations outside of the given time-range, defaults to None
to_time (str, optional) – If specified (in HH:MM format), ignore annotations outside of the given time-range, defaults to None
by (str, optional) – units to sample from, defaults to ‘recording_filename’
threads (int, optional) – amount of threads to run on, defaults to 1

SUBCOMMAND = 'aclew'

static add_parser(subparsers, subcommand)[source]

extract()[source]

class ChildProject.pipelines.metrics.LenaMetrics(project: ChildProject.projects.ChildProject, set: str, recordings: Optional[Union[str, List[str], pandas.core.frame.DataFrame]] = None, from_time: Optional[str] = None, to_time: Optional[str] = None, by: str = 'recording_filename', threads: int = 1)[source]

Bases: ChildProject.pipelines.metrics.Metrics

LENA metrics extractor. Extracts a number of metrics from the LENA .its annotations.

Parameters

project (ChildProject.projects.ChildProject) – ChildProject instance of the target dataset.
set (str) – name of the set associated to the .its annotations
recordings (Union[str, List[str], pd.DataFrame], optional) – recordings to sample from; if None, all recordings will be sampled, defaults to None
from_time (str, optional) – If specified (in HH:MM format), ignore annotations outside of the given time-range, defaults to None
to_time (str, optional) – If specified (in HH:MM format), ignore annotations outside of the given time-range, defaults to None
by (str, optional) – units to sample from, defaults to ‘recording_filename’
threads (int, optional) – amount of threads to run on, defaults to 1

SUBCOMMAND = 'lena'

static add_parser(subparsers, subcommand)[source]

extract()[source]

class ChildProject.pipelines.metrics.Metrics(project: ChildProject.projects.ChildProject, by: str = 'recording_filename', recordings: Optional[Union[str, List[str], pandas.core.frame.DataFrame]] = None, from_time: Optional[str] = None, to_time: Optional[str] = None)[source]

Bases: abc.ABC

abstract extract()[source]

retrieve_segments(sets: List[str], unit: str)[source]

class ChildProject.pipelines.metrics.MetricsPipeline[source]

Bases: ChildProject.pipelines.pipeline.Pipeline

run(path, destination, pipeline, func=None, **kwargs)[source]

static setup_parser(parser)[source]

class ChildProject.pipelines.metrics.PeriodMetrics(project: ChildProject.projects.ChildProject, set: str, period: str, period_origin: Optional[str] = None, recordings: Optional[Union[str, List[str], pandas.core.frame.DataFrame]] = None, from_time: Optional[str] = None, to_time: Optional[str] = None, by: str = 'recording_filename', threads: int = 1)[source]

Bases: ChildProject.pipelines.metrics.Metrics

Time-aggregated metrics extractor.

Aggregates vocalizations for each time-of-the-day-unit based on a period specified by the user. For instance, if the period is set to 15Min (i.e. 15 minutes), vocalization rates will be reported for each recording and time-unit (e.g. 09:00 to 09:15, 09:15 to 09:30, etc.).

The output dataframe has rp rows, where r is the amount of recordings (or children if the --by option is set to child_id), p is the amount of time-bins per day (i.e. 24 x 4 = 96 for a 15-minute period).

The output dataframe includes a period column that contains the onset of each time-unit in HH:MM:SS format. The duration columns contains the total amount of annotations covering each time-bin, in milliseconds.

If --by is set to e.g. child_id, then the values for each time-bin will be the average rates across all the recordings of every child.

Parameters

project (ChildProject.projects.ChildProject) – ChildProject instance of the target dataset
set (str) – name of the set of annotations to derive the metrics from
period (str) – Time-period. Values should be formatted as pandas offset aliases. For instance, 15Min corresponds to a 15 minute period; 2H corresponds to a 2 hour period.
period_origin (str, optional) – NotImplemented, defaults to None
recordings (Union[str, List[str], pd.DataFrame], optional) – white-list of recordings to process, defaults to None
from_time (str, optional) – If specified (in HH:MM format), ignore annotations outside of the given time-range, defaults to None
to_time (str, optional) – If specified (in HH:MM format), ignore annotations outside of the given time-range, defaults to None
by (str, optional) – units to sample from, defaults to ‘recording_filename’
threads (int, optional) – amount of threads to run on, defaults to 1

SUBCOMMAND = 'period'

static add_parser(subparsers, subcommand)[source]

extract()[source]

ChildProject.pipelines.pipeline module

class ChildProject.pipelines.pipeline.Pipeline[source]

Bases: abc.ABC

check_setup()[source]

static recordings_from_list(recordings)[source]

abstract run(**kwargs)[source]

setup()[source]

static setup_pipeline(parser)[source]

ChildProject.pipelines.processors module

class ChildProject.pipelines.processors.AudioProcessingPipeline[source]

Bases: ChildProject.pipelines.pipeline.Pipeline

run(path: str, name: str, processor: str, threads: int = 1, func=None, **kwargs)[source]

static setup_parser(parser)[source]

class ChildProject.pipelines.processors.AudioProcessor(project: ChildProject.projects.ChildProject, name: str, input_profile: Optional[str] = None, threads: int = 1, recordings: Optional[Union[str, List[str], pandas.core.frame.DataFrame]] = None)[source]

Bases: abc.ABC

static add_parser(parsers)[source]

export_metadata()[source]

output_directory()[source]

process(parameters)[source]

abstract process_recording(recording)[source]

read_metadata()[source]

class ChildProject.pipelines.processors.BasicProcessor(project: ChildProject.projects.ChildProject, name: str, format: str, codec: str, sampling: int, split: Optional[str] = None, threads: Optional[int] = None, recordings: Optional[Union[str, List[str], pandas.core.frame.DataFrame]] = None, skip_existing: bool = False, input_profile: Optional[str] = None)[source]

Bases: ChildProject.pipelines.processors.AudioProcessor

SUBCOMMAND = 'basic'

static add_parser(subparsers, subcommand)[source]

process_recording(recording)[source]

class ChildProject.pipelines.processors.ChannelMapper(project: ChildProject.projects.ChildProject, name: str, channels: list, threads: int = 1, recordings: Optional[Union[str, List[str], pandas.core.frame.DataFrame]] = None, input_profile: Optional[str] = None)[source]

Bases: ChildProject.pipelines.processors.AudioProcessor

SUBCOMMAND = 'channel-mapping'

static add_parser(subparsers, subcommand)[source]

process_recording(recording)[source]

class ChildProject.pipelines.processors.VettingProcessor(project: ChildProject.projects.ChildProject, name: str, segments_path: str, threads: Optional[int] = None, recordings: Optional[Union[str, List[str], pandas.core.frame.DataFrame]] = None, input_profile: Optional[str] = None)[source]

Bases: ChildProject.pipelines.processors.AudioProcessor

SUBCOMMAND = 'vetting'

static add_parser(subparsers, subcommand)[source]

process_recording(recording)[source]

ChildProject.pipelines.samplers module

class ChildProject.pipelines.samplers.CustomSampler(project: ChildProject.projects.ChildProject, segments_path: str, recordings: Optional[Union[str, List[str], pandas.core.frame.DataFrame]] = None, exclude: Optional[Union[str, pandas.core.frame.DataFrame]] = None)[source]

Bases: ChildProject.pipelines.samplers.Sampler

SUBCOMMAND = 'custom'

static add_parser(subparsers, subcommand)[source]

class ChildProject.pipelines.samplers.EnergyDetectionSampler(project: ChildProject.projects.ChildProject, windows_length: int, windows_spacing: int, windows_count: int, windows_offset: int = 0, threshold: float = 0.8, low_freq: int = 0, high_freq: int = 100000, threads: int = 1, profile: str = '', by: str = 'recording_filename', recordings: Optional[Union[str, List[str], pandas.core.frame.DataFrame]] = None, exclude: Optional[Union[str, pandas.core.frame.DataFrame]] = None)[source]

Bases: ChildProject.pipelines.samplers.Sampler

Sample windows within each recording, targetting those that have a signal energy higher than some threshold.

Parameters

project (ChildProject.projects.ChildProject) – ChildProject instance of the target dataset.
windows_length (int) – Length of each window, in milliseconds.
windows_spacing (int) – Spacing between the start of each window, in milliseconds.
windows_count (int) – How many windows to retain per recording.
windows_offset (float, optional) – start of the first window, in milliseconds, defaults to 0
threshold (float, optional) – lowest energy quantile to sample from, defaults to 0.8
low_freq (int, optional) – if > 0, frequencies below will be filtered before calculating the energy, defaults to 0
high_freq (int, optional) – if < 100000, frequencies above will be filtered before calculating the energy, defaults to 100000
by (str, optional) – units to sample from, defaults to ‘recording_filename’
recordings (Union[str, List[str], pd.DataFrame], optional) – recordings to sample from; if None, all recordings will be sampled, defaults to None
threads (int, optional) – amount of threads to run on, defaults to 1

SUBCOMMAND = 'energy-detection'

static add_parser(subparsers, subcommand)[source]

compute_energy_loudness(chunk, sampling_frequency: int)[source]

get_recording_windows(recording)[source]

class ChildProject.pipelines.samplers.HighVolubilitySampler(project: ChildProject.projects.ChildProject, annotation_set: str, metric: str, windows_length: int, windows_count: int, threads: int = 1, by: str = 'recording_filename', recordings: Optional[Union[str, List[str], pandas.core.frame.DataFrame]] = None, exclude: Optional[Union[str, pandas.core.frame.DataFrame]] = None)[source]

Bases: ChildProject.pipelines.samplers.Sampler

Return the top windows_count windows (of length windows_length) with the highest volubility from each recording, as calculated from the metric metric.

Parameters

project (ChildProject.projects.ChildProject) – ChildProject instance of the target dataset.
annotation_set (str) – set of annotations to calculate volubility from.
metric (str) – the metric to evaluate high-volubility. should be any of ‘awc’, ‘ctc’, ‘cvc’.
windows_length (int) – length of the windows, in milliseconds
windows_count (int) – amount of top regions to extract per recording
by (str, optional) – units to sample from, defaults to ‘recording_filename’
recordings (Union[str, List[str], pd.DataFrame], optional) – recordings to sample from; if None, all recordings will be sampled, defaults to None
threads (int) – amount of threads to run the sampler on

SUBCOMMAND = 'high-volubility'

static add_parser(subparsers, subcommand)[source]

class ChildProject.pipelines.samplers.PeriodicSampler(project: ChildProject.projects.ChildProject, length: int, period: int, offset: int = 0, profile: Optional[str] = None, recordings: Optional[Union[str, List[str], pandas.core.frame.DataFrame]] = None, exclude: Optional[Union[str, pandas.core.frame.DataFrame]] = None)[source]

Bases: ChildProject.pipelines.samplers.Sampler

Periodic sampling of a recording.

Parameters

project (ChildProject.projects.ChildProject) – ChildProject instance of the target dataset.
length (int) – length of each segment, in milliseconds
period (int) – spacing between two consecutive segments, in milliseconds
offset (int) – offset of the first segment, in milliseconds, defaults to 0
recordings (Union[str, List[str], pd.DataFrame], optional) – recordings to sample from; if None, all recordings will be sampled, defaults to None

SUBCOMMAND = 'periodic'

static add_parser(subparsers, subcommand)[source]

class ChildProject.pipelines.samplers.RandomVocalizationSampler(project: ChildProject.projects.ChildProject, annotation_set: str, target_speaker_type: list, sample_size: int, threads: int = 1, by: str = 'recording_filename', recordings: Optional[Union[str, List[str], pandas.core.frame.DataFrame]] = None, exclude: Optional[Union[str, pandas.core.frame.DataFrame]] = None)[source]

Bases: ChildProject.pipelines.samplers.Sampler

Sample vocalizations based on some input annotation set.

Parameters

project (ChildProject.projects.ChildProject) – ChildProject instance of the target dataset.
annotation_set (str) – Set of annotations to get vocalizations from.
target_speaker_type (list) – List of speaker types to sample vocalizations from.
sample_size (int) – Amount of vocalizations to sample, per recording.
by (str, optional) – units to sample from, defaults to ‘recording_filename’
recordings (Union[str, List[str], pd.DataFrame], optional) – recordings to sample from; if None, all recordings will be sampled, defaults to None
threads (int, optional) – amount of threads to run on, defaults to 1

SUBCOMMAND = 'random-vocalizations'

static add_parser(subparsers, subcommand)[source]

class ChildProject.pipelines.samplers.Sampler(project: ChildProject.projects.ChildProject, recordings: Optional[Union[str, List[str], pandas.core.frame.DataFrame]] = None, exclude: Optional[Union[str, pandas.core.frame.DataFrame]] = None)[source]

Bases: abc.ABC

abstract static add_parser(parsers)[source]

assert_valid()[source]

export_audio(destination, profile=None, **kwargs)[source]

remove_excluded()[source]

retrieve_segments(recording_filename=None)[source]

sample()[source]

class ChildProject.pipelines.samplers.SamplerPipeline[source]

Bases: ChildProject.pipelines.pipeline.Pipeline

run(path, destination, sampler, func=None, **kwargs)[source]

static setup_parser(parser)[source]

ChildProject.pipelines.zooniverse module

class ChildProject.pipelines.zooniverse.Chunk(recording_filename, onset, offset, segment_onset, segment_offset)[source]

Bases: object

getbasename(extension)[source]

class ChildProject.pipelines.zooniverse.ZooniversePipeline[source]

Bases: ChildProject.pipelines.pipeline.Pipeline

extract_chunks(path: str, destination: str, keyword: str, segments: str, chunks_length: int = - 1, chunks_min_amount: int = 1, profile: str = '', threads: int = 0, **kwargs)[source]

extract-audio chunks based on a list of segments and prepare them for upload to zooniverse.

Parameters

path (str) – dataset path
destination (str) – path to the folder where to store the metadata and audio chunks
segments (str) – path to the input segments csv dataframe, defaults to None
keyword (str) – keyword to insert in the output metadata
chunks_length (int, optional) – length of the chunks, in milliseconds, defaults to -1
chunks_min_amount (int, optional) – minimum amount of chunk per segment, defaults to 1
profile (str) – recording profile to extract from. If undefined, raw recordings will be used.
threads (int, optional) – amount of threads to run-on, defaults to 0

get_credentials(login: str = '', pwd: str = '')[source]

returns input credentials if provided or attempts to read them from the environment variables.

Parameters

login (str, optional) – input login, defaults to ‘’
pwd (str, optional) – input password, defaults to ‘’

Returns

(login, pwd)

Return type

(str, str)

retrieve_classifications(destination: str, project_id: int, zooniverse_login: str = '', zooniverse_pwd: str = '', chunks: List[str] = [], **kwargs)[source]

Retrieve classifications from Zooniverse as a CSV dataframe. They will be matched with the original chunks metadata if the path one or more chunk metadata files is provided.

Parameters

destination (str) – output CSV dataframe destination
project_id (int) – zooniverse project id
zooniverse_login (str, optional) – zooniverse login. If not specified, the program attempts to get it from the environment variable ZOONIVERSE_LOGIN instead, defaults to ‘’
zooniverse_pwd (str, optional) – zooniverse password. If not specified, the program attempts to get it from the environment variable ZOONIVERSE_PWD instead, defaults to ‘’
chunks (List[str], optional) – the list of chunk metadata files to match the classifications to. If provided, only the classifications that have a match will be returned.

run(action, **kwargs)[source]

static setup_parser(parser)[source]

upload_chunks(chunks: str, project_id: int, set_name: str, zooniverse_login='', zooniverse_pwd='', amount: int = 1000, **kwargs)[source]

Uploads amount audio chunks from the CSV dataframe chunks to a zooniverse project.

Parameters

chunks ([type]) – path to the chunk CSV dataframe
project_id (int) – zooniverse project id
set_name (str) – name of the subject set
zooniverse_login (str, optional) – zooniverse login. If not specified, the program attempts to get it from the environment variable ZOONIVERSE_LOGIN instead, defaults to ‘’
zooniverse_pwd (str, optional) – zooniverse password. If not specified, the program attempts to get it from the environment variable ZOONIVERSE_PWD instead, defaults to ‘’
amount (int, optional) – amount of chunks to upload, defaults to 0

ChildProject.pipelines.zooniverse.pad_interval(onset: int, offset: int, chunks_length: int, chunks_min_amount: int = 1) → Tuple[int, int][source]

ChildProject.pipelines package

Submodules

ChildProject.pipelines.anonymize module

ChildProject.pipelines.eafbuilder module

ChildProject.pipelines.metrics module

ChildProject.pipelines.pipeline module

ChildProject.pipelines.processors module

ChildProject.pipelines.samplers module

ChildProject.pipelines.zooniverse module

Module contents