ChildProject package
Subpackages
- ChildProject.pipelines package
- Submodules
- ChildProject.pipelines.anonymize module
- ChildProject.pipelines.eafbuilder module
- ChildProject.pipelines.metrics module
- ChildProject.pipelines.metricsFunctions module
avg_can_voc_dur_speaker()
avg_cry_voc_dur_speaker()
avg_non_can_voc_dur_speaker()
avg_voc_dur_speaker()
can_voc_dur_speaker()
can_voc_dur_speaker_ph()
can_voc_speaker()
can_voc_speaker_ph()
cp_dur()
cp_n()
cry_voc_dur_speaker()
cry_voc_dur_speaker_ph()
cry_voc_speaker()
cry_voc_speaker_ph()
lena_CTC()
lena_CTC_ph()
lena_CVC()
lena_CVC_ph()
lp_dur()
lp_n()
metricFunction()
non_can_voc_dur_speaker()
non_can_voc_dur_speaker_ph()
non_can_voc_speaker()
non_can_voc_speaker_ph()
pc_adu()
pc_adu_ph()
pc_speaker()
pc_speaker_ph()
peak_can_voc_dur_speaker()
peak_can_voc_speaker()
peak_cry_voc_dur_speaker()
peak_cry_voc_speaker()
peak_hour_metric()
peak_lena_CTC()
peak_lena_CVC()
peak_non_can_voc_dur_speaker()
peak_non_can_voc_speaker()
peak_pc_adu()
peak_pc_speaker()
peak_sc_adu()
peak_sc_speaker()
peak_simple_CTC()
peak_voc_dur_speaker()
peak_voc_speaker()
peak_wc_adu()
peak_wc_speaker()
per_hour_metric()
sc_adu()
sc_adu_ph()
sc_speaker()
sc_speaker_ph()
simple_CTC()
simple_CTC_ph()
voc_dur_speaker()
voc_dur_speaker_ph()
voc_speaker()
voc_speaker_ph()
wc_adu()
wc_adu_ph()
wc_speaker()
wc_speaker_ph()
- ChildProject.pipelines.pipeline module
- ChildProject.pipelines.processors module
- ChildProject.pipelines.samplers module
- ChildProject.pipelines.zooniverse module
Chunk
ZooniversePipeline
ZooniversePipeline.exit_upload()
ZooniversePipeline.extract_chunks()
ZooniversePipeline.get_credentials()
ZooniversePipeline.link_orphan_subjects()
ZooniversePipeline.reset_orphan_subjects()
ZooniversePipeline.retrieve_classifications()
ZooniversePipeline.run()
ZooniversePipeline.setup_parser()
ZooniversePipeline.upload_chunks()
pad_interval()
- Module contents
- ChildProject.templates package
Submodules
ChildProject.annotations module
- class ChildProject.annotations.AnnotationManager(project: ChildProject)[source]
Bases:
object
- INDEX_COLUMNS = [IndexColumn(name = set), IndexColumn(name = recording_filename), IndexColumn(name = time_seek), IndexColumn(name = range_onset), IndexColumn(name = range_offset), IndexColumn(name = raw_filename), IndexColumn(name = format), IndexColumn(name = filter), IndexColumn(name = annotation_filename), IndexColumn(name = imported_at), IndexColumn(name = package_version), IndexColumn(name = error), IndexColumn(name = merged_from)]
- SEGMENTS_COLUMNS = [IndexColumn(name = raw_filename), IndexColumn(name = segment_onset), IndexColumn(name = segment_offset), IndexColumn(name = speaker_id), IndexColumn(name = speaker_type), IndexColumn(name = ling_type), IndexColumn(name = vcm_type), IndexColumn(name = lex_type), IndexColumn(name = mwu_type), IndexColumn(name = msc_type), IndexColumn(name = gra_type), IndexColumn(name = addressee), IndexColumn(name = transcription), IndexColumn(name = phonemes), IndexColumn(name = syllables), IndexColumn(name = words), IndexColumn(name = lena_block_type), IndexColumn(name = lena_block_number), IndexColumn(name = lena_conv_status), IndexColumn(name = lena_response_count), IndexColumn(name = lena_conv_floor_type), IndexColumn(name = lena_conv_turn_type), IndexColumn(name = lena_speaker), IndexColumn(name = utterances_count), IndexColumn(name = utterances_length), IndexColumn(name = non_speech_length), IndexColumn(name = average_db), IndexColumn(name = peak_db), IndexColumn(name = child_cry_vfx_len), IndexColumn(name = utterances), IndexColumn(name = cries), IndexColumn(name = vfxs)]
- static clip_segments(segments: DataFrame, start: int, stop: int) DataFrame [source]
Clip all segments onsets and offsets within
start
andstop
. Segments outside of the range [start
,``stop``] will be removed.- Parameters:
segments (pd.DataFrame) – Dataframe of the segments to clip
start (int) – range start (in milliseconds)
stop (int) – range end (in milliseconds)
- Returns:
Dataframe of the clipped segments
- Return type:
pd.DataFrame
- derive_annotations(input_set: str, output_set: str, derivation_function: ~typing.Union[str, ~typing.Callable], threads: int = -1, overwrite_existing: bool = False) -> (<class 'pandas.core.frame.DataFrame'>, <class 'pandas.core.frame.DataFrame'>)[source]
Derive annotations. From an existing set of annotations, create a new set that derive its result from the original set
- Parameters:
input_set – name of the set of annotations to be derived
output_set – name of the new set of derived annotations
derivation_function – name of the derivation type to be performed
threads (int, optional) – If > 1, conversions will be run on
threads
threads, defaults to -1overwrite_existing (bool, optional) – choice if lines with the same set and annotation_filename should be overwritten
- Return type:
str
- Return type:
str
- Return type:
Union[str, Callable]
- Returns:
tuple of dataframe of derived annotations, as in Annotations index and dataframe of errors
- Return type:
tuple (pd.DataFrame, pd.DataFrame)
- get_collapsed_segments(annotations: DataFrame) DataFrame [source]
get all segments associated to the annotations referenced in
annotations
, and collapses into one virtual timeline.- Parameters:
annotations (pd.DataFrame) – dataframe of annotations, according to Annotations index
- Returns:
dataframe of all the segments merged (as specified in Annotations format), merged with
annotations
- Return type:
pd.DataFrame
- get_segments(annotations: DataFrame) DataFrame [source]
get all segments associated to the annotations referenced in
annotations
.- Parameters:
annotations (pd.DataFrame) – dataframe of annotations, according to Annotations index
- Returns:
dataframe of all the segments merged (as specified in Annotations format), merged with
annotations
.- Return type:
pd.DataFrame
- get_segments_timestamps(segments: DataFrame, ignore_date: bool = False, onset: str = 'segment_onset', offset: str = 'segment_offset') DataFrame [source]
Calculate the onset and offset clock-time of each segment
- Parameters:
segments (pd.DataFrame) – DataFrame of segments (as returned by
get_segments()
).ignore_date (bool, optional) – leave date information and use time data only, defaults to False
onset (str, optional) – column storing the onset timestamp in milliseconds, defaults to “segment_onset”
offset (str, optional) – column storing the offset timestamp in milliseconds, defaults to “segment_offset”
- Returns:
Returns the input dataframe with two new columns
onset_time
andoffset_time
.onset_time
is a datetime object corresponding to the onset of the segment.offset_time
is a datetime object corresponding to the offset of the segment. In case eitherstart_time
ordate_iso
is not specified for the corresponding recording, both values will be set to NaT.- Return type:
pd.DataFrame
- get_subsets(annotation_set: str, recursive: bool = False) List[str] [source]
Retrieve the list of subsets belonging to a given set of annotations.
- Parameters:
annotation_set (str) – input set
recursive (bool, optional) – If True, get subsets recursively, defaults to False
- Returns:
the list of subsets names
- Return type:
list
- get_within_ranges(ranges: DataFrame, sets: Optional[Union[Set, List]] = None, missing_data: str = 'ignore')[source]
Retrieve and clip annotations that cover specific portions of recordings (
ranges
).The desired ranges are defined by an input dataframe with three columns:
recording_filename
,range_onset
, andrange_offset
. The function returns a dataframe of annotations under the same format as the index of annotations (Annotations index).This output get can then be provided to
get_segments()
in order to retrieve segments of annotations that match the desired range.For instance, the code belows will prints all the segments of annotations corresponding to the first hour of each recording:
>>> from ChildProject.projects import ChildProject >>> from ChildProject.annotations import AnnotationManager >>> project = ChildProject('.') >>> am = AnnotationManager(project) >>> am.read() >>> ranges = project.recordings >>> ranges['range_onset'] = 0 >>> ranges['range_offset'] = 60*60*1000 >>> matches = am.get_within_ranges(ranges) >>> am.get_segments(matches)
- Parameters:
ranges (pd.DataFrame) – pandas dataframe with one row per range to be considered and three columns:
recording_filename
,range_onset
,range_offset
.sets (Union[Set, List]) – optional list of annotation sets to retrieve. If None, all annotations from all sets will be retrieved.
missing_data (str, defaults to ignore) – how to handle missing annotations (“ignore”, “warn” or “raise”)
- Return type:
pd.DataFrame
- get_within_time_range(annotations: DataFrame, interval: Optional[TimeInterval] = None, start_time: Optional[str] = None, end_time: Optional[str] = None)[source]
Clip all input annotations within a given HH:MM:SS clock-time range. Those that do not intersect the input time range at all are filtered out.
- Parameters:
annotations (pd.DataFrame) – DataFrame of input annotations to filter. The only columns that are required are:
recording_filename
,range_onset
, andrange_offset
.interval (TimeInterval) – Interval of hours to consider, contains the start hour and end hour
start_time (str) – start_time to use in a HH:MM format, only used if interval is None, replaces the first value of interval
end_time (str) – end_time to use in a HH:MM format, only used if interval is None, replaces the second value of interval
- Returns:
a DataFrame of annotations; For each row,
range_onset
andrange_offset
are clipped within the desired clock-time range. The clock-time corresponding to the onset and offset of each annotation is stored in two newly created columns namedrange_onset_time
andrange_offset_time
. If the input annotation exceeds 24 hours, one row per matching interval is returned. :rtype: pd.DataFrame
- import_annotations(input: DataFrame, threads: int = -1, import_function: Optional[Callable[[str], DataFrame]] = None, new_tiers: Optional[list] = None, overwrite_existing: bool = False) DataFrame [source]
Import and convert annotations.
- Parameters:
input (pd.DataFrame) – dataframe of all annotations to import, as described in Annotation importation input format.
threads (int, optional) – If > 1, conversions will be run on
threads
threads, defaults to -1import_function (Callable[[str], pd.DataFrame], optional) – If specified, the custom
import_function
function will be used to convert allinput
annotations, defaults to Nonenew_tiers (list[str], optional) – List of EAF tiers names. If specified, the corresponding EAF tiers will be imported.
overwrite_existing (bool, optional) – choose if lines with the same set and annotation_filename should be overwritten
- Returns:
dataframe of imported annotations, as in Annotations index.
- Return type:
pd.DataFrame
- static intersection(annotations: DataFrame, sets: Optional[list] = None) DataFrame [source]
Compute the intersection of all annotations for all sets and recordings, based on their
recording_filename
,range_onset
andrange_offset
attributes. (Only these columns are required, but more can be passed and they will be preserved).- Parameters:
annotations (pd.DataFrame) – dataframe of annotations, according to Annotations index
- Returns:
dataframe of annotations, according to Annotations index
- Return type:
pd.DataFrame
- merge_annotations(left_columns, right_columns, columns, output_set, input, skip_existing: bool = False)[source]
From 2 DataFrames listing the annotation indexes to merge together (those indexes should come from the intersection of the left_set and right_set indexes), the listing of the columns to merge and name of the output_set, creates the resulting csv files containing the converted merged segments and returns the new indexes to add to annotations.csv.
- Parameters:
left_columns (list[str]) – list of the columns to include from the left set
right_columns (list[str]) – list of the columns to include from the right set
columns (dict) – additional columns to add to the segments, key is the column name
output_set (str) – name of the set to save the new merged files into
input (bool) – annotation indexes to use for the merge, contains keys ‘left_annotations’ and ‘right_annotations’ to separate indexes from left and right set
input –
- Returns:
annotation indexes created by the merge, should be added to annotations.csv
- Return type:
pandas.DataFrame
- merge_sets(left_set: str, right_set: str, left_columns: List[str], right_columns: List[str], output_set: str, full_set_merge: bool = True, skip_existing: bool = False, columns: dict = {}, recording_filter: Optional[str] = None, threads=-1)[source]
Merge columns from
left_set
andright_set
annotations, for all matching segments, into a new set of annotations namedoutput_set
that will be saved in the dataset.output_set
must not already exist if full_set_merge is True.- Parameters:
left_set (str) – Left set of annotations.
right_set (str) – Right set of annotations.
left_columns (List) – Columns which values will be based on the left set.
right_columns (List) – Columns which values will be based on the right set.
output_set (str) – Name of the output annotations set.
full_set_merge (bool) – The merge is meant to create the entired merged set. Therefore, the set should not already exist. defaults to True
skip_existing (bool) – The merge will skip already existing lines in the merged set. So both the annotation index and resulting converted csv will not change for those lines
columns (dict) – Additional columns to add to the resulting converted annotations.
recording_filter (set[str]) – set of recording_filenames to merge.
threads (int) – number of threads
- Returns:
[description]
- Return type:
[type]
- read() Tuple[List[str], List[str]] [source]
Read the index of annotations from
metadata/annotations.csv
and store it into self.annotations.- Returns:
a tuple containing the list of errors and the list of warnings generated while reading the index
- Return type:
Tuple[List[str],List[str]]
- remove_set(annotation_set: str, recursive: bool = False)[source]
Remove a set of annotations, deleting every converted file and removing them from the index. This preserves raw annotations.
- Parameters:
annotation_set (str) – set of annotations to remove
recursive (bool, optional) – remove subsets as well, defaults to False
- rename_set(annotation_set: str, new_set: str, recursive: bool = False, ignore_errors: bool = False)[source]
Rename a set of annotations, moving all related files and updating the index accordingly.
- Parameters:
annotation_set (str) – name of the set to rename
new_set (str) – new set name
recursive (bool, optional) – rename subsets as well, defaults to False
ignore_errors (bool, optional) – If True, keep going even if unindexed files are detected, defaults to False
- validate(annotations: Optional[DataFrame] = None, threads: int = 0) Tuple[List[str], List[str]] [source]
check all indexed annotations for errors
- Parameters:
annotations (pd.DataFrame, optional) – annotations to validate, defaults to None. If None, the whole index will be scanned.
threads (int, optional) – how many threads to run the tests with, defaults to 0. If <= 0, all available CPU cores will be used.
- Returns:
a tuple containg the list of errors and the list of warnings detected
- Return type:
Tuple[List[str], List[str]]
ChildProject.cmdline module
ChildProject.converters module
- class ChildProject.converters.AliceConverter[source]
Bases:
AnnotationConverter
- FORMAT = 'alice'
- class ChildProject.converters.AnnotationConverter[source]
Bases:
object
- SPEAKER_ID_TO_TYPE = {'C1': 'OCH', 'C2': 'OCH', 'CHI': 'CHI', 'CHI*': 'CHI', 'EE1': 'NA', 'EE2': 'NA', 'FA0': 'FEM', 'FA1': 'FEM', 'FA2': 'FEM', 'FA3': 'FEM', 'FA4': 'FEM', 'FA5': 'FEM', 'FA6': 'FEM', 'FA7': 'FEM', 'FA8': 'FEM', 'FAE': 'NA', 'FC1': 'OCH', 'FC2': 'OCH', 'FC3': 'OCH', 'FCE': 'NA', 'MA0': 'MAL', 'MA1': 'MAL', 'MA2': 'MAL', 'MA3': 'MAL', 'MA4': 'MAL', 'MA5': 'MAL', 'MAE': 'NA', 'MC1': 'OCH', 'MC2': 'OCH', 'MC3': 'OCH', 'MC4': 'OCH', 'MC5': 'OCH', 'MCE': 'NA', 'MI1': 'OCH', 'MOT*': 'FEM', 'OC0': 'OCH', 'UA1': 'NA', 'UA2': 'NA', 'UA3': 'NA', 'UA4': 'NA', 'UA5': 'NA', 'UA6': 'NA', 'UC1': 'OCH', 'UC2': 'OCH', 'UC3': 'OCH', 'UC4': 'OCH', 'UC5': 'OCH', 'UC6': 'OCH'}
- THREAD_SAFE = True
- class ChildProject.converters.ChatConverter[source]
Bases:
AnnotationConverter
- ADDRESSEE_TABLE = {'CHI': 'T', 'FEM': 'A', 'MAL': 'A', 'OCH': 'C'}
- FORMAT = 'cha'
- SPEAKER_ROLE_TO_TYPE = {'Adult': 'NA', 'Attorney': 'NA', 'Audience': 'NA', 'Boy': 'OCH', 'Brother': 'OCH', 'Caretaker': 'NA', 'Child': 'OCH', 'Doctor': 'NA', 'Environment': 'NA', 'Father': 'MAL', 'Female': 'FEM', 'Friend': 'OCH', 'Girl': 'OCH', 'Grandfather': 'MAL', 'Grandmother': 'FEM', 'Group': 'NA', 'Guest': 'NA', 'Host': 'NA', 'Investigator': 'NA', 'Justice': 'NA', 'LENA': 'NA', 'Leader': 'NA', 'Male': 'MAL', 'Media': 'NA', 'Member': 'NA', 'Mother': 'FEM', 'Narrator': 'NA', 'Nurse': 'NA', 'Other': 'NA', 'Participant': 'CHI', 'Partner': 'NA', 'PlayRole': 'NA', 'Playmate': 'OCH', 'Relative': 'NA', 'Sibling': 'OCH', 'Sister': 'OCH', 'Speaker': 'NA', 'Student': 'NA', 'Target_Adult': 'NA', 'Target_Child': 'CHI', 'Teacher': 'NA', 'Teenager': 'NA', 'Text': 'NA', 'Uncertain': 'NA', 'Unidentified': 'NA', 'Visitor': 'NA'}
- THREAD_SAFE = False
- class ChildProject.converters.CsvConverter[source]
Bases:
AnnotationConverter
- FORMAT = 'csv'
- class ChildProject.converters.EafConverter[source]
Bases:
AnnotationConverter
- FORMAT = 'eaf'
- class ChildProject.converters.Formats(value)[source]
Bases:
Enum
An enumeration.
- ALICE = 'alice'
- CHA = 'cha'
- CSV = 'csv'
- EAF = 'eaf'
- ITS = 'its'
- TEXTGRID = 'TextGrid'
- VCM = 'vcm_rttm'
- VTC = 'vtc_rttm'
- class ChildProject.converters.ItsConverter[source]
Bases:
AnnotationConverter
- FORMAT = 'its'
- SPEAKER_TYPE_TRANSLATION = {'CHN': 'CHI', 'CXN': 'OCH', 'FAN': 'FEM', 'MAN': 'MAL'}
- class ChildProject.converters.TextGridConverter[source]
Bases:
AnnotationConverter
- FORMAT = 'TextGrid'
ChildProject.metrics module
- ChildProject.metrics.conf_matrix(rows_grid, columns_grid)[source]
compute the confusion matrix (as counts) from grids of active classes.
See
ChildProject.metrics.segments_to_grid()
for a description of grids.- Parameters:
rows_grid (numpy.array) – the grid corresponding to the rows of the confusion matrix.
columns_grid (numpy.array) – the grid corresponding to the columns of the confusion matrix.
categories (list of strings) – the labels corresponding to each class
- Returns:
a square numpy array of counts
- Return type:
numpy.array
- ChildProject.metrics.gamma(segments: DataFrame, column: str, alpha: float = 1, beta: float = 1, precision_level: float = 0.05) float [source]
Compute Mathet et al. gamma agreement on segments.
The gamma measure evaluates the reliability of both the segmentation and the categorization simultaneously; a extensive description of the method and its parameters can be found in Mathet et al., 2015 (doi:10.1162/COLI_a_00227)
This function uses the pyagreement-agreement package by Titeux et al.
- Parameters:
segments (pd.DataFrame) – input segments dataframe (see Annotations format for the dataframe format)
column (str) – name of the categorical column of the segments to consider, e.g. ‘speaker_type’
alpha (float, optional) – gamma agreement time alignment weight, defaults to 1
beta (float, optional) – gamma agreement categorical weight, defaults to 1
precision_level (float, optional) – level of precision (see pygamma-agreement’s documentation), defaults to 0.05
- Returns:
gamma agreement
- Return type:
float
- ChildProject.metrics.grid_to_vector(grid, categories)[source]
Transform a grid of active classes into a vector of labels. In case several classes are active at time i, the label is set to ‘overlap’.
See
ChildProject.metrics.segments_to_grid()
for a description of grids.- Parameters:
grid (numpy.array) – a NumPy array of shape
(n, len(categories))
categories (list) – the list of categories
- Returns:
the vector of labels of length
n
(e.g.np.array([none FEM FEM FEM overlap overlap CHI])
)- Return type:
numpy.array
- ChildProject.metrics.pyannote_metric(segments: DataFrame, reference: str, hypothesis: str, metric, column: str)[source]
- ChildProject.metrics.segments_to_annotation(segments: DataFrame, column: str)[source]
Transform a dataframe of annotation segments into a pyannote.core.Annotation object
- Parameters:
segments (pd.DataFrame) – a dataframe of input segments. It should at least have the following columns:
segment_onset
,segment_offset
andcolumn
.column (str) – the name of the column in
segments
that should be used for the values of the annotations (e.g. speaker_type).
- Returns:
the pyannote.core.Annotation object.
- Return type:
pyannote.core.Annotation
- ChildProject.metrics.segments_to_grid(segments: DataFrame, range_onset: int, range_offset: int, timescale: int, column: str, categories: list, none=True, overlap=False) float [source]
Transform a dataframe of annotation segments into a 2d matrix representing the indicator function of each of the
categories
across time.Each row of the matrix corresponds to a unit of time of length
timescale
(in milliseconds), ranging fromrange_onset
torange_offset
; each column corresponds to one of thecategories
provided, plus two special columns (overlap and none).The value of the cell
ij
of the output matrix is set to 1 if the classj
is active at timei
, 0 otherwise.If overlap is True, an additional column is appended to the grid, which set to 1 if more than two classes are active at time
i
.If none is set to True, an additional column is appended to the grid, which is set to one if none of the classes are active at time
i
.The shape of the output matrix is therefore
((range_offset-range_onset)/timescale, len(categories) + n)
, where n = 2 if both overlap and none are True, 1 if one of them is True, and 0 otherwise.The fraction of time a class
j
is active can therefore be calculated asnp.mean(grid, axis = 0)[j]
- Parameters:
segments (pd.DataFrame) – a dataframe of input segments. It should at least have the following columns:
segment_onset
,segment_offset
andcolumn
.range_onset (int) – timestamp of the beginning of the range to consider (in milliseconds)
range_offset (int) – timestamp of the end of the range to consider (in milliseconds)
timescale (int) – length of each time unit (in milliseconds)
column (str) – the name of the column in
segments
that should be used for the values of the annotations (e.g. speaker_type).categories (list) – the list of categories
none (bool) – append a ‘none’ column, default True
overlap (bool) – append an overlap column, default False
- Returns:
the output grid
- Return type:
numpy.array
- ChildProject.metrics.vectors_to_annotation_task(*args, drop: List[str] = [])[source]
transform vectors of labels into a nltk AnnotationTask object.
- Parameters:
args (1d np.array() of labels) – vector of labels for each annotator; add one argument per annotator.
drop (List[str]) – list of labels that should be ignored
- Returns:
the AnnotationTask object
- Return type:
nltk.metrics.agreement.AnnotationTask
ChildProject.projects module
- class ChildProject.projects.ChildProject(path: str, enforce_dtypes: bool = False, ignore_discarded: bool = True)[source]
Bases:
object
ChildProject instance This class is a representation of a ChildProject dataset
Constructor parameters:
- Parameters:
path (str) – path to the root of the dataset.
enforce_dtypes (bool, optional) – enforce dtypes on children/recordings dataframes, defaults to False
ignore_discarded (bool, optional) – ignore entries such that discard=1, defaults to True
Attributes: :param path: path to the root of the dataset. :type path: str :param recordings: pandas dataframe representation of this dataset metadata/recordings.csv :type recordings: class:pd.DataFrame :param children: pandas dataframe representation of this dataset metadata/children.csv :type children: class:pd.DataFrame
- CHILDREN_COLUMNS = [IndexColumn(name = experiment), IndexColumn(name = child_id), IndexColumn(name = child_dob), IndexColumn(name = location_id), IndexColumn(name = child_sex), IndexColumn(name = language), IndexColumn(name = languages), IndexColumn(name = mat_ed), IndexColumn(name = fat_ed), IndexColumn(name = car_ed), IndexColumn(name = monoling), IndexColumn(name = monoling_criterion), IndexColumn(name = normative), IndexColumn(name = normative_criterion), IndexColumn(name = mother_id), IndexColumn(name = father_id), IndexColumn(name = order_of_birth), IndexColumn(name = n_of_siblings), IndexColumn(name = household_size), IndexColumn(name = dob_criterion), IndexColumn(name = dob_accuracy), IndexColumn(name = discard)]
- DOCUMENTATION_COLUMNS = [IndexColumn(name = variable), IndexColumn(name = description), IndexColumn(name = values), IndexColumn(name = scope), IndexColumn(name = annotation_set)]
- RECORDINGS_COLUMNS = [IndexColumn(name = experiment), IndexColumn(name = child_id), IndexColumn(name = date_iso), IndexColumn(name = start_time), IndexColumn(name = recording_device_type), IndexColumn(name = recording_filename), IndexColumn(name = duration), IndexColumn(name = session_id), IndexColumn(name = session_offset), IndexColumn(name = recording_device_id), IndexColumn(name = experimenter), IndexColumn(name = location_id), IndexColumn(name = its_filename), IndexColumn(name = upl_filename), IndexColumn(name = trs_filename), IndexColumn(name = lena_id), IndexColumn(name = lena_recording_num), IndexColumn(name = might_feature_gaps), IndexColumn(name = start_time_accuracy), IndexColumn(name = noisy_setting), IndexColumn(name = notes), IndexColumn(name = discard)]
- REC_COL_REF = {'child_id': IndexColumn(name = child_id), 'date_iso': IndexColumn(name = date_iso), 'discard': IndexColumn(name = discard), 'duration': IndexColumn(name = duration), 'experiment': IndexColumn(name = experiment), 'experimenter': IndexColumn(name = experimenter), 'its_filename': IndexColumn(name = its_filename), 'lena_id': IndexColumn(name = lena_id), 'lena_recording_num': IndexColumn(name = lena_recording_num), 'location_id': IndexColumn(name = location_id), 'might_feature_gaps': IndexColumn(name = might_feature_gaps), 'noisy_setting': IndexColumn(name = noisy_setting), 'notes': IndexColumn(name = notes), 'recording_device_id': IndexColumn(name = recording_device_id), 'recording_device_type': IndexColumn(name = recording_device_type), 'recording_filename': IndexColumn(name = recording_filename), 'session_id': IndexColumn(name = session_id), 'session_offset': IndexColumn(name = session_offset), 'start_time': IndexColumn(name = start_time), 'start_time_accuracy': IndexColumn(name = start_time_accuracy), 'trs_filename': IndexColumn(name = trs_filename), 'upl_filename': IndexColumn(name = upl_filename)}
- REQUIRED_DIRECTORIES = ['recordings', 'extra']
- accumulate_metadata(table: str, df: DataFrame, columns: list, merge_column: str, verbose=False) DataFrame [source]
- compute_ages(recordings: Optional[DataFrame] = None, children: Optional[DataFrame] = None, age_format: str = 'months') Series [source]
Compute the age of the subject child for each recording (in months, as a float) and return it as a pandas Series object.
Example:
>>> from ChildProject.projects import ChildProject >>> project = ChildProject("examples/valid_raw_data") >>> project.read() >>> project.recordings["age"] = project.compute_ages() >>> project.recordings[["child_id", "date_iso", "age"]] child_id date_iso age line 2 1 2020-04-20 3.613963 3 1 2020-04-21 3.646817
- Parameters:
recordings (pd.DataFrame, optional) – custom recordings DataFrame (see Metadata), otherwise use all project recordings, defaults to None
children (pd.DataFrame, optional) – custom children DataFrame (see Metadata), otherwise use all project children data, defaults to None
age_format (str, optional) – format to use for the output date default is months, choose between [‘months’,’days’,’weeks’, ‘years’]
- compute_recordings_duration(profile: Optional[str] = None) DataFrame [source]
compute recordings duration
- Parameters:
profile (str, optional) – name of the profile of recordings to compute the duration from. If None, raw recordings are used. defaults to None
- Returns:
dataframe of the recordings, with an additional/updated duration columns.
- Return type:
pd.DataFrame
- get_converted_recording_filename(profile: str, recording_filename: str) str [source]
retrieve the converted filename of a recording under a given
profile
, from its original filename.- Parameters:
profile (str) – recording profile
recording_filename (str) – original recording filename, as indexed in the metadata
- Returns:
corresponding converted filename of the recording under this profile
- Return type:
str
- get_recording_path(recording_filename: str, profile: Optional[str] = None) str [source]
return the path to a recording
- Parameters:
recording_filename (str) – recording filename, as in the metadata
profile (str, optional) – name of the conversion profile, defaults to None
- Returns:
path to the recording
- Return type:
str
- get_recordings_from_list(recordings: list, profile: Optional[str] = None) DataFrame [source]
Recover recordings metadata from a list of recordings or path to recordings.
- Parameters:
recordings (list) – list of recording names or paths
- Returns:
matching recordings
- Return type:
pd.DataFrame
- read(verbose=False, accumulate=True)[source]
Read the metadata from the project and stores it in recordings and children attributes
- Parameters:
verbose (bool) – read with additional output
accumulate (bool) – add metadata from subfolders (usually confidential metadata)
- validate(ignore_recordings: bool = False, profile: Optional[str] = None, accumulate: bool = True) tuple [source]
Validate a dataset, returning all errors and warnings.
- Parameters:
ignore_recordings (bool, optional) – if True, no errors will be returned for missing recordings.
profile (str, optional) – profile of recordings to use
accumulate – use accumulated metadata (usually confidential metadata if present)
- Returns:
A tuple containing the list of errors, and the list of warnings.
- Return type:
a tuple of two lists
- write_recordings(keep_discarded: bool = True, keep_original_columns: bool = True)[source]
Write self.recordings to the recordings csv file of the dataset. !! if read() was done with accumulate , you may write confidential information in recordings.csv !!
- Parameters:
keep_discarded (bool, optional) – if True, the lines in the csv that are discarded by the dataset are kept when writing. defaults to True (when False, discarded lines disappear from the dataset)
keep_original_columns (bool, optional) – if True, deleting columns in the recordings dataframe will not result in them disappearing from the csv file (if false, only the current columns are kept)
- Returns:
dataframe that was written to the csv file
- Return type:
pandas.DataFrame
ChildProject.tables module
- exception ChildProject.tables.IncorrectDtypeException[source]
Bases:
Exception
Exception when an Unexpected DType is found in a pandas DataFrame
- class ChildProject.tables.IndexColumn(name='', description='', required=False, regex=None, filename=False, datetime=None, function=None, choices=None, dtype=None, unique=False, generated=False)[source]
Bases:
object
- class ChildProject.tables.IndexTable(name, path=None, columns=[], enforce_dtypes: bool = False)[source]
Bases:
object
- exception ChildProject.tables.MissingColumnsException(name: str, missing: Set)[source]
Bases:
Exception
ChildProject.utils module
- ChildProject.utils.calculate_shift(file1, file2, start1, start2, interval)[source]
take 2 audio files, a starting point for each and a length to compare in seconds return a divergence score representing the average difference in audio signal
- Parameters:
file1 (str) – path to the first wav file to compare
file2 (str) – path to the second wav file to compare
start1 (int) – starting point for the comparison in seconds for the first audio
start2 (int) – starting point for the comparison in seconds for the second audio
interval (int) – length to compare between the 2 audios on in seconds
- Returns:
tuple of divergence score and number of values used
- Return type:
(float, int)
- ChildProject.utils.find_lines_involved_in_overlap(df: DataFrame, onset_label: str = 'range_onset', offset_label: str = 'range_offset', labels=[])[source]
takes a dataframe as input. The dataframe is supposed to have a column for the onset og a timeline and one for the offset. The function returns a boolean series where all indexes having ‘True’ are lines involved in overlaps and ‘False’ when not e.g. to select all lines involved in overlaps, use:
` ovl_segments = df[find_lines_involved_in_overlap(df, 'segment_onset', 'segment_offset')] `
and to select line that never overlap, use:` ovl_segments = df[~find_lines_involved_in_overlap(df, 'segment_onset', 'segment_offset')] `
- Parameters:
df (pd.DataFrame) – pandas DataFrame where we want to find overlaps, having some time segments described by 2 columns (onset and offset)
onset_label (str) – column label for the onset of time segments
offset_label (str) – columns label for the offset of time segments
labels (list[str]) – list of column labels that are required to match to be involved in overlap.
- Returns:
pandas Series of boolean values where ‘True’ are indexes where overlaps exist
- Return type:
pd.Series
- ChildProject.utils.series_to_datetime(time_series, time_index_list, time_column_name: str, date_series=None, date_index_list=None, date_column_name=None)[source]
returns a series of datetimes from a series of str. Using pd.to_datetime on all the formats listed for a specific column name in an index consisting of IndexColumn items. To have the date included and not only time), one can use a second series for date, with also the corresponding index and column
- Parameters:
time_series (pandas.Series) – pandas series of strings to transform into datetime (can contain NA value => NaT datetime), if date_series is given, time_series should only have the time
time_index_list (List[IndexColumn]) – list of index to use where the column wanted is present
time_column_name (str) – name of the IndexColumn to use (IndexColumn.name value) for accepted formats
date_series (pandas.Series) – pandas series of strings to transform into the date component of datetime (can contain NA value)
date_index_list (List[IndexColumn]) – list of index to use where the column wanted is present
date_column_name (str) – name of the IndexColumn to use (IndexColumn.name value) for accepted formats for dates
- Returns:
series with dtype datetime containing the converted datetimes
- Return type:
pandas.Series
- ChildProject.utils.time_intervals_intersect(ti1: TimeInterval, ti2: TimeInterval)[source]
given 2 time intervals (those do not take in consideration days, only time in the day), return an array of new interval(s) representing the intersections of the original ones. Examples 1. time_intervals_intersect( TimeInterval( datetime(1900,1,1,8,57), datetime(1900,1,1,21,4)), TimeInterval( datetime(1900,1,1,10,36), datetime(1900,1,1,22,1))) => [TimeInterval(10:36 , 21:04)] 2. time_intervals_intersect( TimeInterval( datetime(1900,1,1,8,57), datetime(1900,1,1,22,1)), TimeInterval( datetime(1900,1,1,21,4), datetime(1900,1,1,10,36))) => [TimeInterval(08:57 , 10:36),TimeInterval(21:04 , 22:01)]
- Parameters:
ti1 (TimeInterval) – first interval
ti2 (TimeInterval) – second interval