ChildProject package
Subpackages
- ChildProject.pipelines package
- Submodules
- ChildProject.pipelines.anonymize module
- ChildProject.pipelines.eafbuilder module
- ChildProject.pipelines.metrics module
- ChildProject.pipelines.metricsFunctions module
- ChildProject.pipelines.pipeline module
- ChildProject.pipelines.processors module
- ChildProject.pipelines.samplers module
- ChildProject.pipelines.zooniverse module
- Module contents
- ChildProject.templates package
Submodules
ChildProject.annotations module
- class ChildProject.annotations.AnnotationManager(project: ChildProject.projects.ChildProject)[source]
Bases:
object
- INDEX_COLUMNS = [IndexColumn(name = set), IndexColumn(name = recording_filename), IndexColumn(name = time_seek), IndexColumn(name = range_onset), IndexColumn(name = range_offset), IndexColumn(name = raw_filename), IndexColumn(name = format), IndexColumn(name = filter), IndexColumn(name = annotation_filename), IndexColumn(name = imported_at), IndexColumn(name = package_version), IndexColumn(name = error), IndexColumn(name = merged_from)]
- SEGMENTS_COLUMNS = [IndexColumn(name = raw_filename), IndexColumn(name = segment_onset), IndexColumn(name = segment_offset), IndexColumn(name = speaker_id), IndexColumn(name = speaker_type), IndexColumn(name = ling_type), IndexColumn(name = vcm_type), IndexColumn(name = lex_type), IndexColumn(name = mwu_type), IndexColumn(name = msc_type), IndexColumn(name = gra_type), IndexColumn(name = addressee), IndexColumn(name = transcription), IndexColumn(name = phonemes), IndexColumn(name = syllables), IndexColumn(name = words), IndexColumn(name = lena_block_type), IndexColumn(name = lena_block_number), IndexColumn(name = lena_conv_status), IndexColumn(name = lena_response_count), IndexColumn(name = lena_conv_floor_type), IndexColumn(name = lena_conv_turn_type), IndexColumn(name = lena_speaker), IndexColumn(name = utterances_count), IndexColumn(name = utterances_length), IndexColumn(name = non_speech_length), IndexColumn(name = average_db), IndexColumn(name = peak_db), IndexColumn(name = child_cry_vfx_len), IndexColumn(name = utterances), IndexColumn(name = cries), IndexColumn(name = vfxs)]
- static clip_segments(segments: pandas.core.frame.DataFrame, start: int, stop: int) pandas.core.frame.DataFrame [source]
Clip all segments onsets and offsets within
start
andstop
. Segments outside of the range [start
,``stop``] will be removed.- Parameters
segments (pd.DataFrame) – Dataframe of the segments to clip
start (int) – range start (in milliseconds)
stop (int) – range end (in milliseconds)
- Returns
Dataframe of the clipped segments
- Return type
pd.DataFrame
- get_collapsed_segments(annotations: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame [source]
get all segments associated to the annotations referenced in
annotations
, and collapses into one virtual timeline.- Parameters
annotations (pd.DataFrame) – dataframe of annotations, according to Annotations index
- Returns
dataframe of all the segments merged (as specified in Annotations format), merged with
annotations
- Return type
pd.DataFrame
- get_segments(annotations: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame [source]
get all segments associated to the annotations referenced in
annotations
.- Parameters
annotations (pd.DataFrame) – dataframe of annotations, according to Annotations index
- Returns
dataframe of all the segments merged (as specified in Annotations format), merged with
annotations
.- Return type
pd.DataFrame
- get_segments_timestamps(segments: pandas.core.frame.DataFrame, ignore_date: bool = False, onset: str = 'segment_onset', offset: str = 'segment_offset') pandas.core.frame.DataFrame [source]
Calculate the onset and offset clock-time of each segment
- Parameters
segments (pd.DataFrame) – DataFrame of segments (as returned by
get_segments()
).ignore_date (bool, optional) – leave date information and use time data only, defaults to False
onset (str, optional) – column storing the onset timestamp in milliseconds, defaults to “segment_onset”
offset (str, optional) – column storing the offset timestamp in milliseconds, defaults to “segment_offset”
- Returns
Returns the input dataframe with two new columns
onset_time
andoffset_time
.onset_time
is a datetime object corresponding to the onset of the segment.offset_time
is a datetime object corresponding to the offset of the segment. In case eitherstart_time
ordate_iso
is not specified for the corresponding recording, both values will be set to NaT.- Return type
pd.DataFrame
- get_subsets(annotation_set: str, recursive: bool = False) List[str] [source]
Retrieve the list of subsets belonging to a given set of annotations.
- Parameters
annotation_set (str) – input set
recursive (bool, optional) – If True, get subsets recursively, defaults to False
- Returns
the list of subsets names
- Return type
list
- get_within_ranges(ranges: pandas.core.frame.DataFrame, sets: Optional[Union[Set, List]] = None, missing_data: str = 'ignore')[source]
Retrieve and clip annotations that cover specific portions of recordings (
ranges
).The desired ranges are defined by an input dataframe with three columns:
recording_filename
,range_onset
, andrange_offset
. The function returns a dataframe of annotations under the same format as the index of annotations (Annotations index).This output get can then be provided to
get_segments()
in order to retrieve segments of annotations that match the desired range.For instance, the code belows will prints all the segments of annotations corresponding to the first hour of each recording:
>>> from ChildProject.projects import ChildProject >>> from ChildProject.annotations import AnnotationManager >>> project = ChildProject('.') >>> am = AnnotationManager(project) >>> am.read() >>> ranges = project.recordings >>> ranges['range_onset'] = 0 >>> ranges['range_offset'] = 60*60*1000 >>> matches = am.get_within_ranges(ranges) >>> am.get_segments(matches)
- Parameters
ranges (pd.DataFrame) – pandas dataframe with one row per range to be considered and three columns:
recording_filename
,range_onset
,range_offset
.sets (Union[Set, List]) – optional list of annotation sets to retrieve. If None, all annotations from all sets will be retrieved.
missing_data (str, defaults to ignore) – how to handle missing annotations (“ignore”, “warn” or “raise”)
- Return type
pd.DataFrame
- get_within_time_range(annotations: pandas.core.frame.DataFrame, interval: ChildProject.utils.TimeInterval, errors='raise')[source]
Clip all input annotations within a given HH:MM:SS clock-time range. Those that do not intersect the input time range at all are filtered out.
- Parameters
annotations (pd.DataFrame) – DataFrame of input annotations to filter. The only columns that are required are:
recording_filename
,range_onset
, andrange_offset
.interval (TimeInterval) – Interval of hours to consider, contains the start hour and end hour
errors (str) – how to deal with invalid start_time values for the recordings. Takes the same values as
pandas.to_datetime
.
- Returns
a DataFrame of annotations; For each row,
range_onset
andrange_offset
are clipped within the desired clock-time range. The clock-time corresponding to the onset and offset of each annotation is stored in two newly created columns namedrange_onset_time
andrange_offset_time
. If the input annotation exceeds 24 hours, one row per matching interval is returned. :rtype: pd.DataFrame
- import_annotations(input: pandas.core.frame.DataFrame, threads: int = - 1, import_function: Optional[Callable[[str], pandas.core.frame.DataFrame]] = None, new_tiers: Optional[list] = None, overwrite_existing: bool = False) pandas.core.frame.DataFrame [source]
Import and convert annotations.
- Parameters
input (pd.DataFrame) – dataframe of all annotations to import, as described in Annotation importation input format.
threads (int, optional) – If > 1, conversions will be run on
threads
threads, defaults to -1import_function (Callable[[str], pd.DataFrame], optional) – If specified, the custom
import_function
function will be used to convert allinput
annotations, defaults to Nonenew_tiers (list[str], optional) – List of EAF tiers names. If specified, the corresponding EAF tiers will be imported.
overwrite_existing (bool, optional) – choose if lines with the same set and annotation_filename should be overwritten
- Returns
dataframe of imported annotations, as in Annotations index.
- Return type
pd.DataFrame
- static intersection(annotations: pandas.core.frame.DataFrame, sets: Optional[list] = None) pandas.core.frame.DataFrame [source]
Compute the intersection of all annotations for all sets and recordings, based on their
recording_filename
,range_onset
andrange_offset
attributes. (Only these columns are required, but more can be passed and they will be preserved).- Parameters
annotations (pd.DataFrame) – dataframe of annotations, according to Annotations index
- Returns
dataframe of annotations, according to Annotations index
- Return type
pd.DataFrame
- merge_annotations(left_columns, right_columns, columns, output_set, input, skip_existing: bool = False)[source]
From 2 DataFrames listing the annotation indexes to merge together (those indexes should come from the intersection of the left_set and right_set indexes), the listing of the columns to merge and name of the output_set, creates the resulting csv files containing the converted merged segments and returns the new indexes to add to annotations.csv.
- Parameters
left_columns (list[str]) – list of the columns to include from the left set
right_columns (list[str]) – list of the columns to include from the right set
columns (dict) – additional columns to add to the segments, key is the column name
output_set (str) – name of the set to save the new merged files into
input (bool) – annotation indexes to use for the merge, contains keys ‘left_annotations’ and ‘right_annotations’ to separate indexes from left and right set
input –
- Returns
annotation indexes created by the merge, should be added to annotations.csv
- Return type
pandas.DataFrame
- merge_sets(left_set: str, right_set: str, left_columns: List[str], right_columns: List[str], output_set: str, full_set_merge: bool = True, skip_existing: bool = False, columns: dict = {}, recording_filter: Optional[str] = None, threads=- 1)[source]
Merge columns from
left_set
andright_set
annotations, for all matching segments, into a new set of annotations namedoutput_set
that will be saved in the dataset.output_set
must not already exist if full_set_merge is True.- Parameters
left_set (str) – Left set of annotations.
right_set (str) – Right set of annotations.
left_columns (List) – Columns which values will be based on the left set.
right_columns (List) – Columns which values will be based on the right set.
output_set (str) – Name of the output annotations set.
full_set_merge (bool) – The merge is meant to create the entired merged set. Therefore, the set should not already exist. defaults to True
skip_existing (bool) – The merge will skip already existing lines in the merged set. So both the annotation index and resulting converted csv will not change for those lines
columns (dict) – Additional columns to add to the resulting converted annotations.
recording_filter (set[str]) – set of recording_filenames to merge.
threads (int) – number of threads
- Returns
[description]
- Return type
[type]
- read() Tuple[List[str], List[str]] [source]
Read the index of annotations from
metadata/annotations.csv
and store it into self.annotations.- Returns
a tuple containing the list of errors and the list of warnings generated while reading the index
- Return type
Tuple[List[str],List[str]]
- remove_set(annotation_set: str, recursive: bool = False)[source]
Remove a set of annotations, deleting every converted file and removing them from the index. This preserves raw annotations.
- Parameters
annotation_set (str) – set of annotations to remove
recursive (bool, optional) – remove subsets as well, defaults to False
- rename_set(annotation_set: str, new_set: str, recursive: bool = False, ignore_errors: bool = False)[source]
Rename a set of annotations, moving all related files and updating the index accordingly.
- Parameters
annotation_set (str) – name of the set to rename
new_set (str) – new set name
recursive (bool, optional) – rename subsets as well, defaults to False
ignore_errors (bool, optional) – If True, keep going even if unindexed files are detected, defaults to False
- validate(annotations: Optional[pandas.core.frame.DataFrame] = None, threads: int = 0) Tuple[List[str], List[str]] [source]
check all indexed annotations for errors
- Parameters
annotations (pd.DataFrame, optional) – annotations to validate, defaults to None. If None, the whole index will be scanned.
threads (int, optional) – how many threads to run the tests with, defaults to 0. If <= 0, all available CPU cores will be used.
- Returns
a tuple containg the list of errors and the list of warnings detected
- Return type
Tuple[List[str], List[str]]
ChildProject.cmdline module
ChildProject.converters module
- class ChildProject.converters.AliceConverter[source]
Bases:
ChildProject.converters.AnnotationConverter
- FORMAT = 'alice'
- class ChildProject.converters.AnnotationConverter[source]
Bases:
object
- SPEAKER_ID_TO_TYPE = {'C1': 'OCH', 'C2': 'OCH', 'CHI': 'CHI', 'CHI*': 'CHI', 'EE1': 'NA', 'EE2': 'NA', 'FA0': 'FEM', 'FA1': 'FEM', 'FA2': 'FEM', 'FA3': 'FEM', 'FA4': 'FEM', 'FA5': 'FEM', 'FA6': 'FEM', 'FA7': 'FEM', 'FA8': 'FEM', 'FAE': 'NA', 'FC1': 'OCH', 'FC2': 'OCH', 'FC3': 'OCH', 'FCE': 'NA', 'MA0': 'MAL', 'MA1': 'MAL', 'MA2': 'MAL', 'MA3': 'MAL', 'MA4': 'MAL', 'MA5': 'MAL', 'MAE': 'NA', 'MC1': 'OCH', 'MC2': 'OCH', 'MC3': 'OCH', 'MC4': 'OCH', 'MC5': 'OCH', 'MCE': 'NA', 'MI1': 'OCH', 'MOT*': 'FEM', 'OC0': 'OCH', 'UA1': 'NA', 'UA2': 'NA', 'UA3': 'NA', 'UA4': 'NA', 'UA5': 'NA', 'UA6': 'NA', 'UC1': 'OCH', 'UC2': 'OCH', 'UC3': 'OCH', 'UC4': 'OCH', 'UC5': 'OCH', 'UC6': 'OCH'}
- THREAD_SAFE = True
- class ChildProject.converters.ChatConverter[source]
Bases:
ChildProject.converters.AnnotationConverter
- ADDRESSEE_TABLE = {'CHI': 'T', 'FEM': 'A', 'MAL': 'A', 'OCH': 'C'}
- FORMAT = 'cha'
- SPEAKER_ROLE_TO_TYPE = {'Adult': 'NA', 'Attorney': 'NA', 'Audience': 'NA', 'Boy': 'OCH', 'Brother': 'OCH', 'Caretaker': 'NA', 'Child': 'OCH', 'Doctor': 'NA', 'Environment': 'NA', 'Father': 'MAL', 'Female': 'FEM', 'Friend': 'OCH', 'Girl': 'OCH', 'Grandfather': 'MAL', 'Grandmother': 'FEM', 'Group': 'NA', 'Guest': 'NA', 'Host': 'NA', 'Investigator': 'NA', 'Justice': 'NA', 'LENA': 'NA', 'Leader': 'NA', 'Male': 'MAL', 'Media': 'NA', 'Member': 'NA', 'Mother': 'FEM', 'Narrator': 'NA', 'Nurse': 'NA', 'Other': 'NA', 'Participant': 'CHI', 'Partner': 'NA', 'PlayRole': 'NA', 'Playmate': 'OCH', 'Relative': 'NA', 'Sibling': 'OCH', 'Sister': 'OCH', 'Speaker': 'NA', 'Student': 'NA', 'Target_Adult': 'NA', 'Target_Child': 'CHI', 'Teacher': 'NA', 'Teenager': 'NA', 'Text': 'NA', 'Uncertain': 'NA', 'Unidentified': 'NA', 'Visitor': 'NA'}
- THREAD_SAFE = False
- class ChildProject.converters.CsvConverter[source]
Bases:
ChildProject.converters.AnnotationConverter
- FORMAT = 'csv'
- class ChildProject.converters.EafConverter[source]
Bases:
ChildProject.converters.AnnotationConverter
- FORMAT = 'eaf'
- class ChildProject.converters.Formats(value)[source]
Bases:
enum.Enum
An enumeration.
- ALICE = 'alice'
- CHA = 'cha'
- CSV = 'csv'
- EAF = 'eaf'
- ITS = 'its'
- TEXTGRID = 'TextGrid'
- VCM = 'vcm_rttm'
- VTC = 'vtc_rttm'
- class ChildProject.converters.ItsConverter[source]
Bases:
ChildProject.converters.AnnotationConverter
- FORMAT = 'its'
- SPEAKER_TYPE_TRANSLATION = {'CHN': 'CHI', 'CXN': 'OCH', 'FAN': 'FEM', 'MAN': 'MAL'}
- class ChildProject.converters.TextGridConverter[source]
Bases:
ChildProject.converters.AnnotationConverter
- FORMAT = 'TextGrid'
- class ChildProject.converters.VcmConverter[source]
Bases:
ChildProject.converters.AnnotationConverter
- FORMAT = 'vcm_rttm'
- SPEAKER_TYPE_TRANSLATION = {'CHI': 'OCH', 'CNS': 'CHI', 'CRY': 'CHI', 'FEM': 'FEM', 'MAL': 'MAL', 'NCS': 'CHI'}
- VCM_TRANSLATION = {'CNS': 'C', 'CRY': 'Y', 'NCS': 'N', 'OTH': 'J'}
- class ChildProject.converters.VtcConverter[source]
Bases:
ChildProject.converters.AnnotationConverter
- FORMAT = 'vtc_rttm'
- SPEAKER_TYPE_TRANSLATION = {'CHI': 'OCH', 'FEM': 'FEM', 'KCHI': 'CHI', 'MAL': 'MAL'}
ChildProject.metrics module
- ChildProject.metrics.conf_matrix(rows_grid, columns_grid)[source]
compute the confusion matrix (as counts) from grids of active classes.
See
ChildProject.metrics.segments_to_grid()
for a description of grids.- Parameters
rows_grid (numpy.array) – the grid corresponding to the rows of the confusion matrix.
columns_grid (numpy.array) – the grid corresponding to the columns of the confusion matrix.
categories (list of strings) – the labels corresponding to each class
- Returns
a square numpy array of counts
- Return type
numpy.array
- ChildProject.metrics.gamma(segments: pandas.core.frame.DataFrame, column: str, alpha: float = 1, beta: float = 1, precision_level: float = 0.05) float [source]
Compute Mathet et al. gamma agreement on segments.
The gamma measure evaluates the reliability of both the segmentation and the categorization simultaneously; a extensive description of the method and its parameters can be found in Mathet et al., 2015 (doi:10.1162/COLI_a_00227)
This function uses the pyagreement-agreement package by Titeux et al.
- Parameters
segments (pd.DataFrame) – input segments dataframe (see Annotations format for the dataframe format)
column (str) – name of the categorical column of the segments to consider, e.g. ‘speaker_type’
alpha (float, optional) – gamma agreement time alignment weight, defaults to 1
beta (float, optional) – gamma agreement categorical weight, defaults to 1
precision_level (float, optional) – level of precision (see pygamma-agreement’s documentation), defaults to 0.05
- Returns
gamma agreement
- Return type
float
- ChildProject.metrics.grid_to_vector(grid, categories)[source]
Transform a grid of active classes into a vector of labels. In case several classes are active at time i, the label is set to ‘overlap’.
See
ChildProject.metrics.segments_to_grid()
for a description of grids.- Parameters
grid (numpy.array) – a NumPy array of shape
(n, len(categories))
categories (list) – the list of categories
- Returns
the vector of labels of length
n
(e.g.np.array([none FEM FEM FEM overlap overlap CHI])
)- Return type
numpy.array
- ChildProject.metrics.pyannote_metric(segments: pandas.core.frame.DataFrame, reference: str, hypothesis: str, metric, column: str)[source]
- ChildProject.metrics.segments_to_annotation(segments: pandas.core.frame.DataFrame, column: str)[source]
Transform a dataframe of annotation segments into a pyannote.core.Annotation object
- Parameters
segments (pd.DataFrame) – a dataframe of input segments. It should at least have the following columns:
segment_onset
,segment_offset
andcolumn
.column (str) – the name of the column in
segments
that should be used for the values of the annotations (e.g. speaker_type).
- Returns
the pyannote.core.Annotation object.
- Return type
pyannote.core.Annotation
- ChildProject.metrics.segments_to_grid(segments: pandas.core.frame.DataFrame, range_onset: int, range_offset: int, timescale: int, column: str, categories: list, none=True, overlap=False) float [source]
Transform a dataframe of annotation segments into a 2d matrix representing the indicator function of each of the
categories
across time.Each row of the matrix corresponds to a unit of time of length
timescale
(in milliseconds), ranging fromrange_onset
torange_offset
; each column corresponds to one of thecategories
provided, plus two special columns (overlap and none).The value of the cell
ij
of the output matrix is set to 1 if the classj
is active at timei
, 0 otherwise.If overlap is True, an additional column is appended to the grid, which set to 1 if more than two classes are active at time
i
.If none is set to True, an additional column is appended to the grid, which is set to one if none of the classes are active at time
i
.The shape of the output matrix is therefore
((range_offset-range_onset)/timescale, len(categories) + n)
, where n = 2 if both overlap and none are True, 1 if one of them is True, and 0 otherwise.The fraction of time a class
j
is active can therefore be calculated asnp.mean(grid, axis = 0)[j]
- Parameters
segments (pd.DataFrame) – a dataframe of input segments. It should at least have the following columns:
segment_onset
,segment_offset
andcolumn
.range_onset (int) – timestamp of the beginning of the range to consider (in milliseconds)
range_offset (int) – timestamp of the end of the range to consider (in milliseconds)
timescale (int) – length of each time unit (in milliseconds)
column (str) – the name of the column in
segments
that should be used for the values of the annotations (e.g. speaker_type).categories (list) – the list of categories
none (bool) – append a ‘none’ column, default True
overlap (bool) – append an overlap column, default False
- Returns
the output grid
- Return type
numpy.array
- ChildProject.metrics.vectors_to_annotation_task(*args, drop: List[str] = [])[source]
transform vectors of labels into a nltk AnnotationTask object.
- Parameters
args (1d np.array() of labels) – vector of labels for each annotator; add one argument per annotator.
drop (List[str]) – list of labels that should be ignored
- Returns
the AnnotationTask object
- Return type
nltk.metrics.agreement.AnnotationTask
ChildProject.projects module
- class ChildProject.projects.ChildProject(path: str, enforce_dtypes: bool = False, ignore_discarded: bool = True)[source]
Bases:
object
ChildProject instance This class is a representation of a ChildProject dataset
Constructor parameters:
- Parameters
path (str) – path to the root of the dataset.
enforce_dtypes (bool, optional) – enforce dtypes on children/recordings dataframes, defaults to False
ignore_discarded (bool, optional) – ignore entries such that discard=1, defaults to True
Attributes: :param path: path to the root of the dataset. :type path: str :param recordings: pandas dataframe representation of this dataset metadata/recordings.csv :type recordings: class:pd.DataFrame :param children: pandas dataframe representation of this dataset metadata/children.csv :type children: class:pd.DataFrame
- CHILDREN_COLUMNS = [IndexColumn(name = experiment), IndexColumn(name = child_id), IndexColumn(name = child_dob), IndexColumn(name = location_id), IndexColumn(name = child_sex), IndexColumn(name = language), IndexColumn(name = languages), IndexColumn(name = mat_ed), IndexColumn(name = fat_ed), IndexColumn(name = car_ed), IndexColumn(name = monoling), IndexColumn(name = monoling_criterion), IndexColumn(name = normative), IndexColumn(name = normative_criterion), IndexColumn(name = mother_id), IndexColumn(name = father_id), IndexColumn(name = order_of_birth), IndexColumn(name = n_of_siblings), IndexColumn(name = household_size), IndexColumn(name = dob_criterion), IndexColumn(name = dob_accuracy), IndexColumn(name = discard)]
- DOCUMENTATION_COLUMNS = [IndexColumn(name = variable), IndexColumn(name = description), IndexColumn(name = values), IndexColumn(name = scope), IndexColumn(name = annotation_set)]
- RECORDINGS_COLUMNS = [IndexColumn(name = experiment), IndexColumn(name = child_id), IndexColumn(name = date_iso), IndexColumn(name = start_time), IndexColumn(name = recording_device_type), IndexColumn(name = recording_filename), IndexColumn(name = duration), IndexColumn(name = session_id), IndexColumn(name = session_offset), IndexColumn(name = recording_device_id), IndexColumn(name = experimenter), IndexColumn(name = location_id), IndexColumn(name = its_filename), IndexColumn(name = upl_filename), IndexColumn(name = trs_filename), IndexColumn(name = lena_id), IndexColumn(name = lena_recording_num), IndexColumn(name = might_feature_gaps), IndexColumn(name = start_time_accuracy), IndexColumn(name = noisy_setting), IndexColumn(name = notes), IndexColumn(name = discard)]
- REC_COL_REF = {'child_id': IndexColumn(name = child_id), 'date_iso': IndexColumn(name = date_iso), 'discard': IndexColumn(name = discard), 'duration': IndexColumn(name = duration), 'experiment': IndexColumn(name = experiment), 'experimenter': IndexColumn(name = experimenter), 'its_filename': IndexColumn(name = its_filename), 'lena_id': IndexColumn(name = lena_id), 'lena_recording_num': IndexColumn(name = lena_recording_num), 'location_id': IndexColumn(name = location_id), 'might_feature_gaps': IndexColumn(name = might_feature_gaps), 'noisy_setting': IndexColumn(name = noisy_setting), 'notes': IndexColumn(name = notes), 'recording_device_id': IndexColumn(name = recording_device_id), 'recording_device_type': IndexColumn(name = recording_device_type), 'recording_filename': IndexColumn(name = recording_filename), 'session_id': IndexColumn(name = session_id), 'session_offset': IndexColumn(name = session_offset), 'start_time': IndexColumn(name = start_time), 'start_time_accuracy': IndexColumn(name = start_time_accuracy), 'trs_filename': IndexColumn(name = trs_filename), 'upl_filename': IndexColumn(name = upl_filename)}
- REQUIRED_DIRECTORIES = ['recordings', 'extra']
- accumulate_metadata(table: str, df: pandas.core.frame.DataFrame, columns: list, merge_column: str, verbose=False) pandas.core.frame.DataFrame [source]
- compute_ages(recordings: Optional[pandas.core.frame.DataFrame] = None, children: Optional[pandas.core.frame.DataFrame] = None) pandas.core.series.Series [source]
Compute the age of the subject child for each recording (in months, as a float) and return it as a pandas Series object.
Example:
>>> from ChildProject.projects import ChildProject >>> project = ChildProject("examples/valid_raw_data") >>> project.read() >>> project.recordings["age"] = project.compute_ages() >>> project.recordings[["child_id", "date_iso", "age"]] child_id date_iso age line 2 1 2020-04-20 3.613963 3 1 2020-04-21 3.646817
- compute_recordings_duration(profile: Optional[str] = None) pandas.core.frame.DataFrame [source]
compute recordings duration
- Parameters
profile (str, optional) – name of the profile of recordings to compute the duration from. If None, raw recordings are used. defaults to None
- Returns
dataframe of the recordings, with an additional/updated duration columns.
- Return type
pd.DataFrame
- get_converted_recording_filename(profile: str, recording_filename: str) str [source]
retrieve the converted filename of a recording under a given
profile
, from its original filename.- Parameters
profile (str) – recording profile
recording_filename (str) – original recording filename, as indexed in the metadata
- Returns
corresponding converted filename of the recording under this profile
- Return type
str
- get_recording_path(recording_filename: str, profile: Optional[str] = None) str [source]
return the path to a recording
- Parameters
recording_filename (str) – recording filename, as in the metadata
profile (str, optional) – name of the conversion profile, defaults to None
- Returns
path to the recording
- Return type
str
- get_recordings_from_list(recordings: list, profile: Optional[str] = None) pandas.core.frame.DataFrame [source]
Recover recordings metadata from a list of recordings or path to recordings.
- Parameters
recordings (list) – list of recording names or paths
- Returns
matching recordings
- Return type
pd.DataFrame
- read(verbose=False, accumulate=True)[source]
Read the metadata from the project and stores it in recordings and children attributes
- Parameters
verbose (bool) – read with additional output
accumulate (bool) – add metadata from subfolders (usually confidential metadata)
- validate(ignore_recordings: bool = False, profile: Optional[str] = None, accumulate: bool = True) tuple [source]
Validate a dataset, returning all errors and warnings.
- Parameters
ignore_recordings (bool, optional) – if True, no errors will be returned for missing recordings.
profile (str, optional) – profile of recordings to use
accumulate – use accumulated metadata (usually confidential metadata if present)
- Returns
A tuple containing the list of errors, and the list of warnings.
- Return type
a tuple of two lists
- write_recordings(keep_discarded: bool = True, keep_original_columns: bool = True)[source]
Write self.recordings to the recordings csv file of the dataset. !! if read() was done with accumulate , you may write confidential information in recordings.csv !!
- Parameters
keep_discarded (bool, optional) – if True, the lines in the csv that are discarded by the dataset are kept when writing. defaults to True (when False, discarded lines disappear from the dataset)
keep_original_columns (bool, optional) – if True, deleting columns in the recordings dataframe will not result in them disappearing from the csv file (if false, only the current columns are kept)
- Returns
dataframe that was written to the csv file
- Return type
pandas.DataFrame
ChildProject.tables module
- exception ChildProject.tables.IncorrectDtypeException[source]
Bases:
Exception
Exception when an Unexpected DType is found in a pandas DataFrame
- class ChildProject.tables.IndexColumn(name='', description='', required=False, regex=None, filename=False, datetime=None, function=None, choices=None, dtype=None, unique=False, generated=False)[source]
Bases:
object
- class ChildProject.tables.IndexTable(name, path=None, columns=[], enforce_dtypes: bool = False)[source]
Bases:
object
- exception ChildProject.tables.MissingColumnsException(name: str, missing: Set)[source]
Bases:
Exception
- ChildProject.tables.assert_columns_presence(name: str, df: pandas.core.frame.DataFrame, columns: Union[Set, List])[source]
ChildProject.utils module
- class ChildProject.utils.TimeInterval(start: datetime.datetime, stop: datetime.datetime)[source]
Bases:
object
- ChildProject.utils.calculate_shift(file1, file2, start1, start2, interval)[source]
take 2 audio files, a starting point for each and a length to compare in seconds return a divergence score representing the average difference in audio signal
- Parameters
file1 (str) – path to the first wav file to compare
file2 (str) – path to the second wav file to compare
start1 (int) – starting point for the comparison in seconds for the first audio
start2 (int) – starting point for the comparison in seconds for the second audio
interval (int) – length to compare between the 2 audios on in seconds
- Returns
tuple of divergence score and number of values used
- Return type
(float, int)
- ChildProject.utils.find_lines_involved_in_overlap(df: pandas.core.frame.DataFrame, onset_label: str = 'range_onset', offset_label: str = 'range_offset', labels=[])[source]
takes a dataframe as input. The dataframe is supposed to have a column for the onset og a timeline and one for the offset. The function returns a boolean series where all indexes having ‘True’ are lines involved in overlaps and ‘False’ when not e.g. to select all lines involved in overlaps, use:
` ovl_segments = df[find_lines_involved_in_overlap(df, 'segment_onset', 'segment_offset')] `
and to select line that never overlap, use:` ovl_segments = df[~find_lines_involved_in_overlap(df, 'segment_onset', 'segment_offset')] `
- Parameters
df (pd.DataFrame) – pandas DataFrame where we want to find overlaps, having some time segments described by 2 columns (onset and offset)
onset_label (str) – column label for the onset of time segments
offset_label (str) – columns label for the offset of time segments
labels (list[str]) – list of column labels that are required to match to be involved in overlap.
- Returns
pandas Series of boolean values where ‘True’ are indexes where overlaps exist
- Return type
pd.Series
- ChildProject.utils.series_to_datetime(time_series, time_index_list, time_column_name: str, date_series=None, date_index_list=None, date_column_name=None)[source]
returns a series of datetimes from a series of str. Using pd.to_datetime on all the formats listed for a specific column name in an index consisting of IndexColumn items. To have the date included and not only time), one can use a second series for date, with also the corresponding index and column
- Parameters
time_series (pandas.Series) – pandas series of strings to transform into datetime (can contain NA value => NaT datetime), if date_series is given, time_series should only have the time
time_index_list (List[IndexColumn]) – list of index to use where the column wanted is present
time_column_name (str) – name of the IndexColumn to use (IndexColumn.name value) for accepted formats
date_series (pandas.Series) – pandas series of strings to transform into the date component of datetime (can contain NA value)
date_index_list (List[IndexColumn]) – list of index to use where the column wanted is present
date_column_name (str) – name of the IndexColumn to use (IndexColumn.name value) for accepted formats for dates
- Returns
series with dtype datetime containing the converted datetimes
- Return type
pandas.Series
- ChildProject.utils.time_intervals_intersect(ti1: ChildProject.utils.TimeInterval, ti2: ChildProject.utils.TimeInterval)[source]
given 2 time intervals (those do not take in consideration days, only time in the day), return an array of new interval(s) representing the intersections of the original ones. Examples 1. time_intervals_intersect( TimeInterval( datetime(1900,1,1,8,57), datetime(1900,1,1,21,4)), TimeInterval( datetime(1900,1,1,10,36), datetime(1900,1,1,22,1))) => [TimeInterval(10:36 , 21:04)] 2. time_intervals_intersect( TimeInterval( datetime(1900,1,1,8,57), datetime(1900,1,1,22,1)), TimeInterval( datetime(1900,1,1,21,4), datetime(1900,1,1,10,36))) => [TimeInterval(08:57 , 10:36),TimeInterval(21:04 , 22:01)]
- Parameters
ti1 (TimeInterval) – first interval
ti2 (TimeInterval) – second interval