ChildProject.pipelines package
Submodules
ChildProject.pipelines.anonymize module
- class ChildProject.pipelines.anonymize.AnonymizationPipeline[source]
Bases:
ChildProject.pipelines.pipeline.Pipeline
Anonymize a set of its annotations (input_set) and saves it as output_set.
- DEFAULT_REPLACEMENTS = {'Bar': {'startClockTime': [{'replace_value': '1000-01-01'}, {'only_time': 'true'}]}, 'BarSummary': {'leftBoundaryClockTime': [{'replace_value': '1000-01-01'}, {'only_time': 'true'}], 'rightBoundaryClockTime': [{'replace_value': '1000-01-01'}, {'only_time': 'true'}]}, 'Child': {'DOB': '1000-01-01', 'EnrollDate': '1000-01-01', 'id': 'A999'}, 'ChildInfo': {'dob': '1000-01-01'}, 'FiveMinuteSection': {'endClockTime': [{'replace_value': '1000-01-01'}, {'only_time': 'true'}], 'startClockTime': [{'replace_value': '1000-01-01'}, {'only_time': 'true'}]}, 'ITS': {'fileName': 'new_filename_1001', 'timeCreated': [{'replace_value': '1000-01-01'}, {'only_time': 'true'}]}, 'Item': {'timeStamp': [{'replace_value': '1000-01-01'}, {'only_time': 'true'}]}, 'PrimaryChild': {'DOB': '1000-01-01'}, 'ProcessingJob': {'logfile': 'exec10001010T100010Z_job00000001-10001010_101010_100100.upl.log'}, 'Recording': {'endClockTime': [{'replace_value': '1000-01-01'}, {'only_time': 'true'}], 'startClockTime': [{'replace_value': '1000-01-01'}, {'only_time': 'true'}]}, 'ResourceSnapshot': {'timegmt': [{'replace_value': '1000-01-01'}, {'only_time': 'true'}], 'timelocal': [{'replace_value': '1000-01-01'}, {'only_time': 'true'}]}, 'TransferTime': {'LocalTime': [{'replace_value': '1000-01-01'}, {'only_time': 'true'}], 'UTCTime': [{'replace_value': '1000-01-01'}, {'only_time': 'true'}]}}
ChildProject.pipelines.eafbuilder module
- class ChildProject.pipelines.eafbuilder.EafBuilderPipeline[source]
Bases:
ChildProject.pipelines.pipeline.Pipeline
- run(destination: str, segments: str, eaf_type: str, template: str, context_onset: int = 0, context_offset: int = 0, **kwargs)[source]
generate .eaf templates based on intervals to code.
- Parameters
path (str) – project path
destination (str) – eaf destination
segments (str) – path to the input segments dataframe
eaf_type (str) – eaf-type [random, periodic]
template (str) – name of the template to use (basic, native, or non-native)
context_onset (int) – context onset and segment offset difference in milliseconds, 0 for no introductory context
context_offset (int) – context offset and segment offset difference in milliseconds, 0 for no outro context
ChildProject.pipelines.metrics module
- class ChildProject.pipelines.metrics.AclewMetrics(project: ChildProject.projects.ChildProject, vtc: str = 'vtc', alice: str = 'alice', vcm: str = 'vcm', recordings: Optional[Union[str, List[str], pandas.core.frame.DataFrame]] = None, from_time: Optional[str] = None, to_time: Optional[str] = None, by: str = 'recording_filename', threads: int = 1)[source]
Bases:
ChildProject.pipelines.metrics.Metrics
ACLEW metrics extractor. Extracts a number of metrics from the ACLEW pipeline annotations, which includes:
The Voice Type Classifier by Lavechin et al. (arXiv:2005.12656)
The Automatic LInguistic Unit Count Estimator (ALICE) by Räsänen et al. (doi:10.3758/s13428-020-01460-x)
The VoCalisation Maturity model (VCMNet) by Al Futaisi et al. (doi:10.1145/3340555.3353751)
- Parameters
project (ChildProject.projects.ChildProject) – ChildProject instance of the target dataset.
vtc (str) – name of the set associated to the VTC annotations
alice (str) – name of the set associated to the ALICE annotations
vcm (str) – name of the set associated to the VCM annotations
recordings (Union[str, List[str], pd.DataFrame], optional) – recordings to sample from; if None, all recordings will be sampled, defaults to None
from_time (str, optional) – If specified (in HH:MM format), ignore annotations outside of the given time-range, defaults to None
to_time (str, optional) – If specified (in HH:MM format), ignore annotations outside of the given time-range, defaults to None
by (str, optional) – units to sample from, defaults to ‘recording_filename’
threads (int, optional) – amount of threads to run on, defaults to 1
- SUBCOMMAND = 'aclew'
- class ChildProject.pipelines.metrics.LenaMetrics(project: ChildProject.projects.ChildProject, set: str, types: list = [], recordings: Optional[Union[str, List[str], pandas.core.frame.DataFrame]] = None, from_time: Optional[str] = None, to_time: Optional[str] = None, by: str = 'recording_filename', threads: int = 1)[source]
Bases:
ChildProject.pipelines.metrics.Metrics
LENA metrics extractor. Extracts a number of metrics from the LENA .its annotations.
- Parameters
project (ChildProject.projects.ChildProject) – ChildProject instance of the target dataset.
set (str) – name of the set associated to the .its annotations
types (list) – list of LENA vocalization/noise types (e.g. OLN, TVN)
recordings (Union[str, List[str], pd.DataFrame], optional) – recordings to sample from; if None, all recordings will be sampled, defaults to None
from_time (str, optional) – If specified (in HH:MM format), ignore annotations outside of the given time-range, defaults to None
to_time (str, optional) – If specified (in HH:MM format), ignore annotations outside of the given time-range, defaults to None
by (str, optional) – units to sample from, defaults to ‘recording_filename’
threads (int, optional) – amount of threads to run on, defaults to 1
- SUBCOMMAND = 'lena'
- class ChildProject.pipelines.metrics.Metrics(project: ChildProject.projects.ChildProject, by: str = 'recording_filename', recordings: Optional[Union[str, List[str], pandas.core.frame.DataFrame]] = None, from_time: Optional[str] = None, to_time: Optional[str] = None)[source]
Bases:
abc.ABC
- class ChildProject.pipelines.metrics.PeriodMetrics(project: ChildProject.projects.ChildProject, set: str, period: str, period_origin: Optional[str] = None, recordings: Optional[Union[str, List[str], pandas.core.frame.DataFrame]] = None, from_time: Optional[str] = None, to_time: Optional[str] = None, by: str = 'recording_filename', threads: int = 1)[source]
Bases:
ChildProject.pipelines.metrics.Metrics
Time-aggregated metrics extractor.
Aggregates vocalizations for each time-of-the-day-unit based on a period specified by the user. For instance, if the period is set to
15Min
(i.e. 15 minutes), vocalization rates will be reported for each recording and time-unit (e.g. 09:00 to 09:15, 09:15 to 09:30, etc.).The output dataframe has
rp
rows, wherer
is the amount of recordings (or children if the--by
option is set tochild_id
),p
is the amount of time-bins per day (i.e. 24 x 4 = 96 for a 15-minute period).The output dataframe includes a
period
column that contains the onset of each time-unit in HH:MM:SS format. Theduration
columns contains the total amount of annotations covering each time-bin, in milliseconds.If
--by
is set to e.g.child_id
, then the values for each time-bin will be the average rates across all the recordings of every child.- Parameters
project (ChildProject.projects.ChildProject) – ChildProject instance of the target dataset
set (str) – name of the set of annotations to derive the metrics from
period (str) – Time-period. Values should be formatted as pandas offset aliases. For instance, 15Min corresponds to a 15 minute period; 2H corresponds to a 2 hour period.
period_origin (str, optional) – NotImplemented, defaults to None
recordings (Union[str, List[str], pd.DataFrame], optional) – white-list of recordings to process, defaults to None
from_time (str, optional) – If specified (in HH:MM format), ignore annotations outside of the given time-range, defaults to None
to_time (str, optional) – If specified (in HH:MM format), ignore annotations outside of the given time-range, defaults to None
by (str, optional) – units to sample from, defaults to ‘recording_filename’
threads (int, optional) – amount of threads to run on, defaults to 1
- SUBCOMMAND = 'period'
ChildProject.pipelines.pipeline module
ChildProject.pipelines.processors module
- class ChildProject.pipelines.processors.AudioProcessor(project: ChildProject.projects.ChildProject, name: str, input_profile: Optional[str] = None, threads: int = 1, recordings: Optional[Union[str, List[str], pandas.core.frame.DataFrame]] = None)[source]
Bases:
abc.ABC
- class ChildProject.pipelines.processors.BasicProcessor(project: ChildProject.projects.ChildProject, name: str, format: str, codec: str, sampling: int, split: Optional[str] = None, threads: int = 1, recordings: Optional[Union[str, List[str], pandas.core.frame.DataFrame]] = None, skip_existing: bool = False, input_profile: Optional[str] = None)[source]
Bases:
ChildProject.pipelines.processors.AudioProcessor
- SUBCOMMAND = 'basic'
- class ChildProject.pipelines.processors.ChannelMapper(project: ChildProject.projects.ChildProject, name: str, channels: list, threads: int = 1, recordings: Optional[Union[str, List[str], pandas.core.frame.DataFrame]] = None, input_profile: Optional[str] = None)[source]
Bases:
ChildProject.pipelines.processors.AudioProcessor
- SUBCOMMAND = 'channel-mapping'
- class ChildProject.pipelines.processors.VettingProcessor(project: ChildProject.projects.ChildProject, name: str, segments_path: str, threads: int = 1, recordings: Optional[Union[str, List[str], pandas.core.frame.DataFrame]] = None, input_profile: Optional[str] = None)[source]
Bases:
ChildProject.pipelines.processors.AudioProcessor
- SUBCOMMAND = 'vetting'
ChildProject.pipelines.samplers module
- class ChildProject.pipelines.samplers.ConversationSampler(project: ChildProject.projects.ChildProject, annotation_set: str, count: int, interval: int = 1000, speakers: List[str] = ['FEM', 'MAL', 'CHI'], threads: int = 1, by: str = 'recording_filename', recordings: Optional[Union[str, List[str], pandas.core.frame.DataFrame]] = None, exclude: Optional[Union[str, pandas.core.frame.DataFrame]] = None)[source]
Bases:
ChildProject.pipelines.samplers.Sampler
Conversation sampler.
- Parameters
project (ChildProject.projects.ChildProject) – ChildProject instance
annotation_set (str) – set of annotation to derive conversations from
count (int) – amount of conversations to sample
interval (int, optional) – maximum time-interval between two consecutive vocalizations (in milliseconds) to consider them part of the same conversational block, defaults to 1000
speakers (List[str], optional) – list of speakers to target, defaults to [“FEM”, “MAL”, “CHI”]
threads (int, optional) – threads to run on, defaults to 1
by (str, optional) – units to sample from, defaults to “recording_filename”
recordings (Union[str, List[str], pd.DataFrame], optional) – whitelist of recordings, defaults to None
exclude (Union[str, pd.DataFrame], optional) – portions to exclude, defaults to None
- SUBCOMMAND = 'conversations'
- class ChildProject.pipelines.samplers.CustomSampler(project: ChildProject.projects.ChildProject, segments_path: str, recordings: Optional[Union[str, List[str], pandas.core.frame.DataFrame]] = None, exclude: Optional[Union[str, pandas.core.frame.DataFrame]] = None)[source]
Bases:
ChildProject.pipelines.samplers.Sampler
- SUBCOMMAND = 'custom'
- class ChildProject.pipelines.samplers.EnergyDetectionSampler(project: ChildProject.projects.ChildProject, windows_length: int, windows_spacing: int, windows_count: int, windows_offset: int = 0, threshold: float = 0.8, low_freq: int = 0, high_freq: int = 100000, threads: int = 1, profile: str = '', by: str = 'recording_filename', recordings: Optional[Union[str, List[str], pandas.core.frame.DataFrame]] = None, exclude: Optional[Union[str, pandas.core.frame.DataFrame]] = None)[source]
Bases:
ChildProject.pipelines.samplers.Sampler
Sample windows within each recording, targetting those that have a signal energy higher than some threshold.
- Parameters
project (ChildProject.projects.ChildProject) – ChildProject instance of the target dataset.
windows_length (int) – Length of each window, in milliseconds.
windows_spacing (int) – Spacing between the start of each window, in milliseconds.
windows_count (int) – How many windows to retain per recording.
windows_offset (float, optional) – start of the first window, in milliseconds, defaults to 0
threshold (float, optional) – lowest energy quantile to sample from, defaults to 0.8
low_freq (int, optional) – if > 0, frequencies below will be filtered before calculating the energy, defaults to 0
high_freq (int, optional) – if < 100000, frequencies above will be filtered before calculating the energy, defaults to 100000
by (str, optional) – units to sample from, defaults to ‘recording_filename’
recordings (Union[str, List[str], pd.DataFrame], optional) – recordings to sample from; if None, all recordings will be sampled, defaults to None
threads (int, optional) – amount of threads to run on, defaults to 1
- SUBCOMMAND = 'energy-detection'
- class ChildProject.pipelines.samplers.HighVolubilitySampler(project: ChildProject.projects.ChildProject, annotation_set: str, metric: str, windows_length: int, windows_count: int, speakers: List[str] = ['FEM', 'MAL', 'CHI'], threads: int = 1, by: str = 'recording_filename', recordings: Optional[Union[str, List[str], pandas.core.frame.DataFrame]] = None, exclude: Optional[Union[str, pandas.core.frame.DataFrame]] = None)[source]
Bases:
ChildProject.pipelines.samplers.Sampler
Return the top
windows_count
windows (of lengthwindows_length
) with the highest volubility from each recording, as calculated from the metricmetric
.metrics
can be any of three values: words, turns, and vocs.The words metric sums the amount of words within each window. For LENA annotations, it is equivalent to awc.
The turns metric (aka ctc) sums conversational turns within each window. It relies on lena_conv_turn_type for LENA annotations. For other annotations, turns are estimated as adult/child speech switches in close temporal proximity.
The vocs metric sums vocalizations within each window. If
metric="vocs"
andspeakers=['CHI']
, it is equivalent to the usual cvc metric (child vocalization counts).
- Parameters
project (ChildProject.projects.ChildProject) – ChildProject instance of the target dataset.
annotation_set (str) – set of annotations to calculate volubility from.
metric (str) – the metric to evaluate high-volubility. should be any of ‘words’, ‘turns’, ‘vocs’.
windows_length (int) – length of the windows, in milliseconds
windows_count (int) – amount of top regions to extract per recording
by (str, optional) – units to sample from, defaults to ‘recording_filename’
recordings (Union[str, List[str], pd.DataFrame], optional) – recordings to sample from; if None, all recordings will be sampled, defaults to None
threads (int) – amount of threads to run the sampler on
- SUBCOMMAND = 'high-volubility'
- class ChildProject.pipelines.samplers.PeriodicSampler(project: ChildProject.projects.ChildProject, length: int, period: int, offset: int = 0, profile: Optional[str] = None, recordings: Optional[Union[str, List[str], pandas.core.frame.DataFrame]] = None, exclude: Optional[Union[str, pandas.core.frame.DataFrame]] = None)[source]
Bases:
ChildProject.pipelines.samplers.Sampler
Periodic sampling of a recording.
- Parameters
project (ChildProject.projects.ChildProject) – ChildProject instance of the target dataset.
length (int) – length of each segment, in milliseconds
period (int) – spacing between two consecutive segments, in milliseconds
offset (int) – offset of the first segment, in milliseconds, defaults to 0
recordings (Union[str, List[str], pd.DataFrame], optional) – recordings to sample from; if None, all recordings will be sampled, defaults to None
- SUBCOMMAND = 'periodic'
- class ChildProject.pipelines.samplers.RandomVocalizationSampler(project: ChildProject.projects.ChildProject, annotation_set: str, target_speaker_type: list, sample_size: int, threads: int = 1, by: str = 'recording_filename', recordings: Optional[Union[str, List[str], pandas.core.frame.DataFrame]] = None, exclude: Optional[Union[str, pandas.core.frame.DataFrame]] = None)[source]
Bases:
ChildProject.pipelines.samplers.Sampler
Sample vocalizations based on some input annotation set.
- Parameters
project (ChildProject.projects.ChildProject) – ChildProject instance of the target dataset.
annotation_set (str) – Set of annotations to get vocalizations from.
target_speaker_type (list) – List of speaker types to sample vocalizations from.
sample_size (int) – Amount of vocalizations to sample, per recording.
by (str, optional) – units to sample from, defaults to ‘recording_filename’
recordings (Union[str, List[str], pd.DataFrame], optional) – recordings to sample from; if None, all recordings will be sampled, defaults to None
threads (int, optional) – amount of threads to run on, defaults to 1
- SUBCOMMAND = 'random-vocalizations'
- class ChildProject.pipelines.samplers.Sampler(project: ChildProject.projects.ChildProject, recordings: Optional[Union[str, List[str], pandas.core.frame.DataFrame]] = None, exclude: Optional[Union[str, pandas.core.frame.DataFrame]] = None)[source]
Bases:
abc.ABC
ChildProject.pipelines.zooniverse module
- class ChildProject.pipelines.zooniverse.Chunk(recording_filename, onset, offset, segment_onset, segment_offset)[source]
Bases:
object
- class ChildProject.pipelines.zooniverse.ZooniversePipeline[source]
Bases:
ChildProject.pipelines.pipeline.Pipeline
- extract_chunks(path: str, destination: str, keyword: str, segments: str, chunks_length: int = - 1, chunks_min_amount: int = 1, profile: str = '', threads: int = 1, **kwargs)[source]
extract-audio chunks based on a list of segments and prepare them for upload to zooniverse.
- Parameters
path (str) – dataset path
destination (str) – path to the folder where to store the metadata and audio chunks
segments (str) – path to the input segments csv dataframe, defaults to None
keyword (str) – keyword to insert in the output metadata
chunks_length (int, optional) – length of the chunks, in milliseconds, defaults to -1
chunks_min_amount (int, optional) – minimum amount of chunk per segment, defaults to 1
profile (str) – recording profile to extract from. If undefined, raw recordings will be used.
threads (int, optional) – amount of threads to run-on, defaults to 0
- get_credentials(login: str = '', pwd: str = '')[source]
returns input credentials if provided or attempts to read them from the environment variables.
- Parameters
login (str, optional) – input login, defaults to ‘’
pwd (str, optional) – input password, defaults to ‘’
- Returns
(login, pwd)
- Return type
(str, str)
- retrieve_classifications(destination: str, project_id: int, zooniverse_login: str = '', zooniverse_pwd: str = '', chunks: List[str] = [], **kwargs)[source]
Retrieve classifications from Zooniverse as a CSV dataframe. They will be matched with the original chunks metadata if the path one or more chunk metadata files is provided.
- Parameters
destination (str) – output CSV dataframe destination
project_id (int) – zooniverse project id
zooniverse_login (str, optional) – zooniverse login. If not specified, the program attempts to get it from the environment variable
ZOONIVERSE_LOGIN
instead, defaults to ‘’zooniverse_pwd (str, optional) – zooniverse password. If not specified, the program attempts to get it from the environment variable
ZOONIVERSE_PWD
instead, defaults to ‘’chunks (List[str], optional) – the list of chunk metadata files to match the classifications to. If provided, only the classifications that have a match will be returned.
- upload_chunks(chunks: str, project_id: int, set_name: str, zooniverse_login='', zooniverse_pwd='', amount: int = 1000, ignore_errors: bool = False, **kwargs)[source]
Uploads
amount
audio chunks from the CSV dataframe chunks to a zooniverse project.- Parameters
chunks ([type]) – path to the chunk CSV dataframe
project_id (int) – zooniverse project id
set_name (str) – name of the subject set
zooniverse_login (str, optional) – zooniverse login. If not specified, the program attempts to get it from the environment variable
ZOONIVERSE_LOGIN
instead, defaults to ‘’zooniverse_pwd (str, optional) – zooniverse password. If not specified, the program attempts to get it from the environment variable
ZOONIVERSE_PWD
instead, defaults to ‘’amount (int, optional) – amount of chunks to upload, defaults to 0