ChildProject.pipelines package
Submodules
ChildProject.pipelines.anonymize module
- class ChildProject.pipelines.anonymize.AnonymizationPipeline[source]
Bases:
ChildProject.pipelines.pipeline.Pipeline
Anonymize a set of its annotations (input_set) and saves it as output_set.
- DEFAULT_REPLACEMENTS = {'Bar': {'startClockTime': [{'replace_value': '1000-01-01'}, {'only_time': 'true'}]}, 'BarSummary': {'leftBoundaryClockTime': [{'replace_value': '1000-01-01'}, {'only_time': 'true'}], 'rightBoundaryClockTime': [{'replace_value': '1000-01-01'}, {'only_time': 'true'}]}, 'Child': {'DOB': '1000-01-01', 'EnrollDate': '1000-01-01', 'id': 'A999'}, 'ChildInfo': {'dob': '1000-01-01'}, 'FiveMinuteSection': {'endClockTime': [{'replace_value': '1000-01-01'}, {'only_time': 'true'}], 'startClockTime': [{'replace_value': '1000-01-01'}, {'only_time': 'true'}]}, 'ITS': {'fileName': 'new_filename_1001', 'timeCreated': [{'replace_value': '1000-01-01'}, {'only_time': 'true'}]}, 'Item': {'timeStamp': [{'replace_value': '1000-01-01'}, {'only_time': 'true'}]}, 'PrimaryChild': {'DOB': '1000-01-01'}, 'ProcessingJob': {'logfile': 'exec10001010T100010Z_job00000001-10001010_101010_100100.upl.log'}, 'Recording': {'endClockTime': [{'replace_value': '1000-01-01'}, {'only_time': 'true'}], 'startClockTime': [{'replace_value': '1000-01-01'}, {'only_time': 'true'}]}, 'ResourceSnapshot': {'timegmt': [{'replace_value': '1000-01-01'}, {'only_time': 'true'}], 'timelocal': [{'replace_value': '1000-01-01'}, {'only_time': 'true'}]}, 'TransferTime': {'LocalTime': [{'replace_value': '1000-01-01'}, {'only_time': 'true'}], 'UTCTime': [{'replace_value': '1000-01-01'}, {'only_time': 'true'}]}}
ChildProject.pipelines.eafbuilder module
- class ChildProject.pipelines.eafbuilder.EafBuilderPipeline[source]
Bases:
ChildProject.pipelines.pipeline.Pipeline
- run(destination: str, segments: str, eaf_type: str, template: str, context_onset: int = 0, context_offset: int = 0, **kwargs)[source]
generate .eaf templates based on intervals to code.
- Parameters
path (str) – project path
destination (str) – eaf destination
segments (str) – path to the input segments dataframe
eaf_type (str) – eaf-type [random, periodic]
template (str) – name of the template to use (basic, native, or non-native)
context_onset (int) – context onset and segment offset difference in milliseconds, 0 for no introductory context
context_offset (int) – context offset and segment offset difference in milliseconds, 0 for no outro context
ChildProject.pipelines.metrics module
- class ChildProject.pipelines.metrics.AclewMetrics(project: ChildProject.projects.ChildProject, vtc: str = 'vtc', alice: str = 'alice', vcm: str = 'vcm', recordings: Optional[Union[str, List[str], pandas.core.frame.DataFrame]] = None, from_time: Optional[str] = None, to_time: Optional[str] = None, by: str = 'recording_filename', threads: int = 1)[source]
Bases:
ChildProject.pipelines.metrics.Metrics
ACLEW metrics extractor. Extracts a number of metrics from the ACLEW pipeline annotations, which includes:
The Voice Type Classifier by Lavechin et al. (arXiv:2005.12656)
The Automatic LInguistic Unit Count Estimator (ALICE) by Räsänen et al. (doi:10.3758/s13428-020-01460-x)
The VoCalisation Maturity model (VCMNet) by Al Futaisi et al. (doi:10.1145/3340555.3353751)
- Parameters
project (ChildProject.projects.ChildProject) – ChildProject instance of the target dataset.
vtc (str) – name of the set associated to the VTC annotations
alice (str) – name of the set associated to the ALICE annotations
vcm (str) – name of the set associated to the VCM annotations
recordings (Union[str, List[str], pd.DataFrame], optional) – recordings to sample from; if None, all recordings will be sampled, defaults to None
from_time (str, optional) – If specified (in HH:MM format), ignore annotations outside of the given time-range, defaults to None
to_time (str, optional) – If specified (in HH:MM format), ignore annotations outside of the given time-range, defaults to None
by (str, optional) – units to sample from, defaults to ‘recording_filename’
threads (int, optional) – amount of threads to run on, defaults to 1
- SUBCOMMAND = 'aclew'
- class ChildProject.pipelines.metrics.LenaMetrics(project: ChildProject.projects.ChildProject, set: str, recordings: Optional[Union[str, List[str], pandas.core.frame.DataFrame]] = None, from_time: Optional[str] = None, to_time: Optional[str] = None, by: str = 'recording_filename', threads: int = 1)[source]
Bases:
ChildProject.pipelines.metrics.Metrics
LENA metrics extractor. Extracts a number of metrics from the LENA .its annotations.
- Parameters
project (ChildProject.projects.ChildProject) – ChildProject instance of the target dataset.
set (str) – name of the set associated to the .its annotations
recordings (Union[str, List[str], pd.DataFrame], optional) – recordings to sample from; if None, all recordings will be sampled, defaults to None
from_time (str, optional) – If specified (in HH:MM format), ignore annotations outside of the given time-range, defaults to None
to_time (str, optional) – If specified (in HH:MM format), ignore annotations outside of the given time-range, defaults to None
by (str, optional) – units to sample from, defaults to ‘recording_filename’
threads (int, optional) – amount of threads to run on, defaults to 1
- SUBCOMMAND = 'lena'
- class ChildProject.pipelines.metrics.Metrics(project: ChildProject.projects.ChildProject, by: str = 'recording_filename', recordings: Optional[Union[str, List[str], pandas.core.frame.DataFrame]] = None, from_time: Optional[str] = None, to_time: Optional[str] = None)[source]
Bases:
abc.ABC
- class ChildProject.pipelines.metrics.PeriodMetrics(project: ChildProject.projects.ChildProject, set: str, period: str, period_origin: Optional[str] = None, recordings: Optional[Union[str, List[str], pandas.core.frame.DataFrame]] = None, from_time: Optional[str] = None, to_time: Optional[str] = None, by: str = 'recording_filename', threads: int = 1)[source]
Bases:
ChildProject.pipelines.metrics.Metrics
Time-aggregated metrics extractor.
Aggregates vocalizations for each time-of-the-day-unit based on a period specified by the user. For instance, if the period is set to
15Min
(i.e. 15 minutes), vocalization rates will be reported for each recording and time-unit (e.g. 09:00 to 09:15, 09:15 to 09:30, etc.).The output dataframe has
rp
rows, wherer
is the amount of recordings (or children if the--by
option is set tochild_id
),p
is the amount of time-bins per day (i.e. 24 x 4 = 96 for a 15-minute period).The output dataframe includes a
period
column that contains the onset of each time-unit in HH:MM:SS format. Theduration
columns contains the total amount of annotations covering each time-bin, in milliseconds.If
--by
is set to e.g.child_id
, then the values for each time-bin will be the average rates across all the recordings of every child.- Parameters
project (ChildProject.projects.ChildProject) – ChildProject instance of the target dataset
set (str) – name of the set of annotations to derive the metrics from
period (str) – Time-period. Values should be formatted as pandas offset aliases. For instance, 15Min corresponds to a 15 minute period; 2H corresponds to a 2 hour period.
period_origin (str, optional) – NotImplemented, defaults to None
recordings (Union[str, List[str], pd.DataFrame], optional) – white-list of recordings to process, defaults to None
from_time (str, optional) – If specified (in HH:MM format), ignore annotations outside of the given time-range, defaults to None
to_time (str, optional) – If specified (in HH:MM format), ignore annotations outside of the given time-range, defaults to None
by (str, optional) – units to sample from, defaults to ‘recording_filename’
threads (int, optional) – amount of threads to run on, defaults to 1
- SUBCOMMAND = 'period'
ChildProject.pipelines.pipeline module
ChildProject.pipelines.processors module
- class ChildProject.pipelines.processors.AudioProcessor(project: ChildProject.projects.ChildProject, name: str, input_profile: Optional[str] = None, threads: int = 1, recordings: Optional[Union[str, List[str], pandas.core.frame.DataFrame]] = None)[source]
Bases:
abc.ABC
- class ChildProject.pipelines.processors.BasicProcessor(project: ChildProject.projects.ChildProject, name: str, format: str, codec: str, sampling: int, split: Optional[str] = None, threads: Optional[int] = None, recordings: Optional[Union[str, List[str], pandas.core.frame.DataFrame]] = None, skip_existing: bool = False, input_profile: Optional[str] = None)[source]
Bases:
ChildProject.pipelines.processors.AudioProcessor
- SUBCOMMAND = 'basic'
- class ChildProject.pipelines.processors.ChannelMapper(project: ChildProject.projects.ChildProject, name: str, channels: list, threads: int = 1, recordings: Optional[Union[str, List[str], pandas.core.frame.DataFrame]] = None, input_profile: Optional[str] = None)[source]
Bases:
ChildProject.pipelines.processors.AudioProcessor
- SUBCOMMAND = 'channel-mapping'
- class ChildProject.pipelines.processors.VettingProcessor(project: ChildProject.projects.ChildProject, name: str, segments_path: str, threads: Optional[int] = None, recordings: Optional[Union[str, List[str], pandas.core.frame.DataFrame]] = None, input_profile: Optional[str] = None)[source]
Bases:
ChildProject.pipelines.processors.AudioProcessor
- SUBCOMMAND = 'vetting'
ChildProject.pipelines.samplers module
- class ChildProject.pipelines.samplers.CustomSampler(project: ChildProject.projects.ChildProject, segments_path: str, recordings: Optional[Union[str, List[str], pandas.core.frame.DataFrame]] = None, exclude: Optional[Union[str, pandas.core.frame.DataFrame]] = None)[source]
Bases:
ChildProject.pipelines.samplers.Sampler
- SUBCOMMAND = 'custom'
- class ChildProject.pipelines.samplers.EnergyDetectionSampler(project: ChildProject.projects.ChildProject, windows_length: int, windows_spacing: int, windows_count: int, windows_offset: int = 0, threshold: float = 0.8, low_freq: int = 0, high_freq: int = 100000, threads: int = 1, profile: str = '', by: str = 'recording_filename', recordings: Optional[Union[str, List[str], pandas.core.frame.DataFrame]] = None, exclude: Optional[Union[str, pandas.core.frame.DataFrame]] = None)[source]
Bases:
ChildProject.pipelines.samplers.Sampler
Sample windows within each recording, targetting those that have a signal energy higher than some threshold.
- Parameters
project (ChildProject.projects.ChildProject) – ChildProject instance of the target dataset.
windows_length (int) – Length of each window, in milliseconds.
windows_spacing (int) – Spacing between the start of each window, in milliseconds.
windows_count (int) – How many windows to retain per recording.
windows_offset (float, optional) – start of the first window, in milliseconds, defaults to 0
threshold (float, optional) – lowest energy quantile to sample from, defaults to 0.8
low_freq (int, optional) – if > 0, frequencies below will be filtered before calculating the energy, defaults to 0
high_freq (int, optional) – if < 100000, frequencies above will be filtered before calculating the energy, defaults to 100000
by (str, optional) – units to sample from, defaults to ‘recording_filename’
recordings (Union[str, List[str], pd.DataFrame], optional) – recordings to sample from; if None, all recordings will be sampled, defaults to None
threads (int, optional) – amount of threads to run on, defaults to 1
- SUBCOMMAND = 'energy-detection'
- class ChildProject.pipelines.samplers.HighVolubilitySampler(project: ChildProject.projects.ChildProject, annotation_set: str, metric: str, windows_length: int, windows_count: int, threads: int = 1, by: str = 'recording_filename', recordings: Optional[Union[str, List[str], pandas.core.frame.DataFrame]] = None, exclude: Optional[Union[str, pandas.core.frame.DataFrame]] = None)[source]
Bases:
ChildProject.pipelines.samplers.Sampler
Return the top
windows_count
windows (of lengthwindows_length
) with the highest volubility from each recording, as calculated from the metricmetric
.- Parameters
project (ChildProject.projects.ChildProject) – ChildProject instance of the target dataset.
annotation_set (str) – set of annotations to calculate volubility from.
metric (str) – the metric to evaluate high-volubility. should be any of ‘awc’, ‘ctc’, ‘cvc’.
windows_length (int) – length of the windows, in milliseconds
windows_count (int) – amount of top regions to extract per recording
by (str, optional) – units to sample from, defaults to ‘recording_filename’
recordings (Union[str, List[str], pd.DataFrame], optional) – recordings to sample from; if None, all recordings will be sampled, defaults to None
threads (int) – amount of threads to run the sampler on
- SUBCOMMAND = 'high-volubility'
- class ChildProject.pipelines.samplers.PeriodicSampler(project: ChildProject.projects.ChildProject, length: int, period: int, offset: int = 0, profile: Optional[str] = None, recordings: Optional[Union[str, List[str], pandas.core.frame.DataFrame]] = None, exclude: Optional[Union[str, pandas.core.frame.DataFrame]] = None)[source]
Bases:
ChildProject.pipelines.samplers.Sampler
Periodic sampling of a recording.
- Parameters
project (ChildProject.projects.ChildProject) – ChildProject instance of the target dataset.
length (int) – length of each segment, in milliseconds
period (int) – spacing between two consecutive segments, in milliseconds
offset (int) – offset of the first segment, in milliseconds, defaults to 0
recordings (Union[str, List[str], pd.DataFrame], optional) – recordings to sample from; if None, all recordings will be sampled, defaults to None
- SUBCOMMAND = 'periodic'
- class ChildProject.pipelines.samplers.RandomVocalizationSampler(project: ChildProject.projects.ChildProject, annotation_set: str, target_speaker_type: list, sample_size: int, threads: int = 1, by: str = 'recording_filename', recordings: Optional[Union[str, List[str], pandas.core.frame.DataFrame]] = None, exclude: Optional[Union[str, pandas.core.frame.DataFrame]] = None)[source]
Bases:
ChildProject.pipelines.samplers.Sampler
Sample vocalizations based on some input annotation set.
- Parameters
project (ChildProject.projects.ChildProject) – ChildProject instance of the target dataset.
annotation_set (str) – Set of annotations to get vocalizations from.
target_speaker_type (list) – List of speaker types to sample vocalizations from.
sample_size (int) – Amount of vocalizations to sample, per recording.
by (str, optional) – units to sample from, defaults to ‘recording_filename’
recordings (Union[str, List[str], pd.DataFrame], optional) – recordings to sample from; if None, all recordings will be sampled, defaults to None
threads (int, optional) – amount of threads to run on, defaults to 1
- SUBCOMMAND = 'random-vocalizations'
- class ChildProject.pipelines.samplers.Sampler(project: ChildProject.projects.ChildProject, recordings: Optional[Union[str, List[str], pandas.core.frame.DataFrame]] = None, exclude: Optional[Union[str, pandas.core.frame.DataFrame]] = None)[source]
Bases:
abc.ABC
ChildProject.pipelines.zooniverse module
- class ChildProject.pipelines.zooniverse.Chunk(recording_filename, onset, offset, segment_onset, segment_offset)[source]
Bases:
object
- class ChildProject.pipelines.zooniverse.ZooniversePipeline[source]
Bases:
ChildProject.pipelines.pipeline.Pipeline
- extract_chunks(path: str, destination: str, keyword: str, segments: str, chunks_length: int = - 1, chunks_min_amount: int = 1, profile: str = '', threads: int = 0, **kwargs)[source]
extract-audio chunks based on a list of segments and prepare them for upload to zooniverse.
- Parameters
path (str) – dataset path
destination (str) – path to the folder where to store the metadata and audio chunks
segments (str) – path to the input segments csv dataframe, defaults to None
keyword (str) – keyword to insert in the output metadata
chunks_length (int, optional) – length of the chunks, in milliseconds, defaults to -1
chunks_min_amount (int, optional) – minimum amount of chunk per segment, defaults to 1
profile (str) – recording profile to extract from. If undefined, raw recordings will be used.
threads (int, optional) – amount of threads to run-on, defaults to 0
- get_credentials(login: str = '', pwd: str = '')[source]
returns input credentials if provided or attempts to read them from the environment variables.
- Parameters
login (str, optional) – input login, defaults to ‘’
pwd (str, optional) – input password, defaults to ‘’
- Returns
(login, pwd)
- Return type
(str, str)
- retrieve_classifications(destination: str, project_id: int, zooniverse_login: str = '', zooniverse_pwd: str = '', chunks: List[str] = [], **kwargs)[source]
Retrieve classifications from Zooniverse as a CSV dataframe. They will be matched with the original chunks metadata if the path one or more chunk metadata files is provided.
- Parameters
destination (str) – output CSV dataframe destination
project_id (int) – zooniverse project id
zooniverse_login (str, optional) – zooniverse login. If not specified, the program attempts to get it from the environment variable
ZOONIVERSE_LOGIN
instead, defaults to ‘’zooniverse_pwd (str, optional) – zooniverse password. If not specified, the program attempts to get it from the environment variable
ZOONIVERSE_PWD
instead, defaults to ‘’chunks (List[str], optional) – the list of chunk metadata files to match the classifications to. If provided, only the classifications that have a match will be returned.
- upload_chunks(chunks: str, project_id: int, set_name: str, zooniverse_login='', zooniverse_pwd='', amount: int = 1000, **kwargs)[source]
Uploads
amount
audio chunks from the CSV dataframe chunks to a zooniverse project.- Parameters
chunks ([type]) – path to the chunk CSV dataframe
project_id (int) – zooniverse project id
set_name (str) – name of the subject set
zooniverse_login (str, optional) – zooniverse login. If not specified, the program attempts to get it from the environment variable
ZOONIVERSE_LOGIN
instead, defaults to ‘’zooniverse_pwd (str, optional) – zooniverse password. If not specified, the program attempts to get it from the environment variable
ZOONIVERSE_PWD
instead, defaults to ‘’amount (int, optional) – amount of chunks to upload, defaults to 0