ChildProject package

Subpackages

Submodules

ChildProject.annotations module

class ChildProject.annotations.AnnotationManager(project: ChildProject)[source]

Bases: object

INDEX_COLUMNS = [IndexColumn(name = set), IndexColumn(name = recording_filename), IndexColumn(name = time_seek), IndexColumn(name = range_onset), IndexColumn(name = range_offset), IndexColumn(name = raw_filename), IndexColumn(name = format), IndexColumn(name = filter), IndexColumn(name = annotation_filename), IndexColumn(name = imported_at), IndexColumn(name = package_version), IndexColumn(name = error), IndexColumn(name = merged_from)]
SEGMENTS_COLUMNS = [IndexColumn(name = raw_filename), IndexColumn(name = segment_onset), IndexColumn(name = segment_offset), IndexColumn(name = speaker_id), IndexColumn(name = speaker_type), IndexColumn(name = ling_type), IndexColumn(name = vcm_type), IndexColumn(name = lex_type), IndexColumn(name = mwu_type), IndexColumn(name = msc_type), IndexColumn(name = gra_type), IndexColumn(name = addressee), IndexColumn(name = transcription), IndexColumn(name = phonemes), IndexColumn(name = syllables), IndexColumn(name = words), IndexColumn(name = lena_block_type), IndexColumn(name = lena_block_number), IndexColumn(name = lena_conv_status), IndexColumn(name = lena_response_count), IndexColumn(name = lena_conv_floor_type), IndexColumn(name = lena_conv_turn_type), IndexColumn(name = lena_speaker), IndexColumn(name = utterances_count), IndexColumn(name = utterances_length), IndexColumn(name = non_speech_length), IndexColumn(name = average_db), IndexColumn(name = peak_db), IndexColumn(name = child_cry_vfx_len), IndexColumn(name = utterances), IndexColumn(name = cries), IndexColumn(name = vfxs)]
static clip_segments(segments: DataFrame, start: int, stop: int) DataFrame[source]

Clip all segments onsets and offsets within start and stop. Segments outside of the range [start,``stop``] will be removed.

Parameters:
  • segments (pd.DataFrame) – Dataframe of the segments to clip

  • start (int) – range start (in milliseconds)

  • stop (int) – range end (in milliseconds)

Returns:

Dataframe of the clipped segments

Return type:

pd.DataFrame

derive_annotations(input_set: str, output_set: str, derivation_function: ~typing.Union[str, ~typing.Callable], threads: int = -1, overwrite_existing: bool = False) -> (<class 'pandas.core.frame.DataFrame'>, <class 'pandas.core.frame.DataFrame'>)[source]

Derive annotations. From an existing set of annotations, create a new set that derive its result from the original set

Parameters:
  • input_set – name of the set of annotations to be derived

  • output_set – name of the new set of derived annotations

  • derivation_function – name of the derivation type to be performed

  • threads (int, optional) – If > 1, conversions will be run on threads threads, defaults to -1

  • overwrite_existing (bool, optional) – choice if lines with the same set and annotation_filename should be overwritten

Return type:

str

Return type:

str

Return type:

Union[str, Callable]

Returns:

tuple of dataframe of derived annotations, as in Annotations index and dataframe of errors

Return type:

tuple (pd.DataFrame, pd.DataFrame)

get_collapsed_segments(annotations: DataFrame) DataFrame[source]

get all segments associated to the annotations referenced in annotations, and collapses into one virtual timeline.

Parameters:

annotations (pd.DataFrame) – dataframe of annotations, according to Annotations index

Returns:

dataframe of all the segments merged (as specified in Annotations format), merged with annotations

Return type:

pd.DataFrame

get_segments(annotations: DataFrame) DataFrame[source]

get all segments associated to the annotations referenced in annotations.

Parameters:

annotations (pd.DataFrame) – dataframe of annotations, according to Annotations index

Returns:

dataframe of all the segments merged (as specified in Annotations format), merged with annotations.

Return type:

pd.DataFrame

get_segments_timestamps(segments: DataFrame, ignore_date: bool = False, onset: str = 'segment_onset', offset: str = 'segment_offset') DataFrame[source]

Calculate the onset and offset clock-time of each segment

Parameters:
  • segments (pd.DataFrame) – DataFrame of segments (as returned by get_segments()).

  • ignore_date (bool, optional) – leave date information and use time data only, defaults to False

  • onset (str, optional) – column storing the onset timestamp in milliseconds, defaults to “segment_onset”

  • offset (str, optional) – column storing the offset timestamp in milliseconds, defaults to “segment_offset”

Returns:

Returns the input dataframe with two new columns onset_time and offset_time. onset_time is a datetime object corresponding to the onset of the segment. offset_time is a datetime object corresponding to the offset of the segment. In case either start_time or date_iso is not specified for the corresponding recording, both values will be set to NaT.

Return type:

pd.DataFrame

get_subsets(annotation_set: str, recursive: bool = False) List[str][source]

Retrieve the list of subsets belonging to a given set of annotations.

Parameters:
  • annotation_set (str) – input set

  • recursive (bool, optional) – If True, get subsets recursively, defaults to False

Returns:

the list of subsets names

Return type:

list

get_within_ranges(ranges: DataFrame, sets: Optional[Union[Set, List]] = None, missing_data: str = 'ignore')[source]

Retrieve and clip annotations that cover specific portions of recordings (ranges).

The desired ranges are defined by an input dataframe with three columns: recording_filename, range_onset, and range_offset. The function returns a dataframe of annotations under the same format as the index of annotations (Annotations index).

This output get can then be provided to get_segments() in order to retrieve segments of annotations that match the desired range.

For instance, the code belows will prints all the segments of annotations corresponding to the first hour of each recording:

>>> from ChildProject.projects import ChildProject
>>> from ChildProject.annotations import AnnotationManager
>>> project = ChildProject('.')
>>> am = AnnotationManager(project)
>>> am.read()
>>> ranges = project.recordings
>>> ranges['range_onset'] = 0
>>> ranges['range_offset'] = 60*60*1000
>>> matches = am.get_within_ranges(ranges)
>>> am.get_segments(matches)
Parameters:
  • ranges (pd.DataFrame) – pandas dataframe with one row per range to be considered and three columns: recording_filename, range_onset, range_offset.

  • sets (Union[Set, List]) – optional list of annotation sets to retrieve. If None, all annotations from all sets will be retrieved.

  • missing_data (str, defaults to ignore) – how to handle missing annotations (“ignore”, “warn” or “raise”)

Return type:

pd.DataFrame

get_within_time_range(annotations: DataFrame, interval: Optional[TimeInterval] = None, start_time: Optional[str] = None, end_time: Optional[str] = None)[source]

Clip all input annotations within a given HH:MM:SS clock-time range. Those that do not intersect the input time range at all are filtered out.

Parameters:
  • annotations (pd.DataFrame) – DataFrame of input annotations to filter. The only columns that are required are: recording_filename, range_onset, and range_offset.

  • interval (TimeInterval) – Interval of hours to consider, contains the start hour and end hour

  • start_time (str) – start_time to use in a HH:MM format, only used if interval is None, replaces the first value of interval

  • end_time (str) – end_time to use in a HH:MM format, only used if interval is None, replaces the second value of interval

Returns:

a DataFrame of annotations; For each row, range_onset and range_offset are clipped within the desired clock-time range. The clock-time corresponding to the onset and offset of each annotation is stored in two newly created columns named range_onset_time and range_offset_time. If the input annotation exceeds 24 hours, one row per matching interval is returned. :rtype: pd.DataFrame

import_annotations(input: DataFrame, threads: int = -1, import_function: Optional[Callable[[str], DataFrame]] = None, new_tiers: Optional[list] = None, overwrite_existing: bool = False) DataFrame[source]

Import and convert annotations.

Parameters:
  • input (pd.DataFrame) – dataframe of all annotations to import, as described in Annotation importation input format.

  • threads (int, optional) – If > 1, conversions will be run on threads threads, defaults to -1

  • import_function (Callable[[str], pd.DataFrame], optional) – If specified, the custom import_function function will be used to convert all input annotations, defaults to None

  • new_tiers (list[str], optional) – List of EAF tiers names. If specified, the corresponding EAF tiers will be imported.

  • overwrite_existing (bool, optional) – choose if lines with the same set and annotation_filename should be overwritten

Returns:

dataframe of imported annotations, as in Annotations index.

Return type:

pd.DataFrame

static intersection(annotations: DataFrame, sets: Optional[list] = None) DataFrame[source]

Compute the intersection of all annotations for all sets and recordings, based on their recording_filename, range_onset and range_offset attributes. (Only these columns are required, but more can be passed and they will be preserved).

Parameters:

annotations (pd.DataFrame) – dataframe of annotations, according to Annotations index

Returns:

dataframe of annotations, according to Annotations index

Return type:

pd.DataFrame

merge_annotations(left_columns, right_columns, columns, output_set, input, skip_existing: bool = False)[source]

From 2 DataFrames listing the annotation indexes to merge together (those indexes should come from the intersection of the left_set and right_set indexes), the listing of the columns to merge and name of the output_set, creates the resulting csv files containing the converted merged segments and returns the new indexes to add to annotations.csv.

Parameters:
  • left_columns (list[str]) – list of the columns to include from the left set

  • right_columns (list[str]) – list of the columns to include from the right set

  • columns (dict) – additional columns to add to the segments, key is the column name

  • output_set (str) – name of the set to save the new merged files into

  • input (bool) – annotation indexes to use for the merge, contains keys ‘left_annotations’ and ‘right_annotations’ to separate indexes from left and right set

  • input

Returns:

annotation indexes created by the merge, should be added to annotations.csv

Return type:

pandas.DataFrame

merge_sets(left_set: str, right_set: str, left_columns: List[str], right_columns: List[str], output_set: str, full_set_merge: bool = True, skip_existing: bool = False, columns: dict = {}, recording_filter: Optional[str] = None, threads=-1)[source]

Merge columns from left_set and right_set annotations, for all matching segments, into a new set of annotations named output_set that will be saved in the dataset. output_set must not already exist if full_set_merge is True.

Parameters:
  • left_set (str) – Left set of annotations.

  • right_set (str) – Right set of annotations.

  • left_columns (List) – Columns which values will be based on the left set.

  • right_columns (List) – Columns which values will be based on the right set.

  • output_set (str) – Name of the output annotations set.

  • full_set_merge (bool) – The merge is meant to create the entired merged set. Therefore, the set should not already exist. defaults to True

  • skip_existing (bool) – The merge will skip already existing lines in the merged set. So both the annotation index and resulting converted csv will not change for those lines

  • columns (dict) – Additional columns to add to the resulting converted annotations.

  • recording_filter (set[str]) – set of recording_filenames to merge.

  • threads (int) – number of threads

Returns:

[description]

Return type:

[type]

read() Tuple[List[str], List[str]][source]

Read the index of annotations from metadata/annotations.csv and store it into self.annotations.

Returns:

a tuple containing the list of errors and the list of warnings generated while reading the index

Return type:

Tuple[List[str],List[str]]

remove_set(annotation_set: str, recursive: bool = False)[source]

Remove a set of annotations, deleting every converted file and removing them from the index. This preserves raw annotations.

Parameters:
  • annotation_set (str) – set of annotations to remove

  • recursive (bool, optional) – remove subsets as well, defaults to False

rename_set(annotation_set: str, new_set: str, recursive: bool = False, ignore_errors: bool = False)[source]

Rename a set of annotations, moving all related files and updating the index accordingly.

Parameters:
  • annotation_set (str) – name of the set to rename

  • new_set (str) – new set name

  • recursive (bool, optional) – rename subsets as well, defaults to False

  • ignore_errors (bool, optional) – If True, keep going even if unindexed files are detected, defaults to False

set_from_path(path: str) str[source]
validate(annotations: Optional[DataFrame] = None, threads: int = 0) Tuple[List[str], List[str]][source]

check all indexed annotations for errors

Parameters:
  • annotations (pd.DataFrame, optional) – annotations to validate, defaults to None. If None, the whole index will be scanned.

  • threads (int, optional) – how many threads to run the tests with, defaults to 0. If <= 0, all available CPU cores will be used.

Returns:

a tuple containg the list of errors and the list of warnings detected

Return type:

Tuple[List[str], List[str]]

validate_annotation(annotation: dict) Tuple[List[str], List[str]][source]
write()[source]

Update the annotations index, while enforcing its good shape.

ChildProject.cmdline module

ChildProject.converters module

class ChildProject.converters.AliceConverter[source]

Bases: AnnotationConverter

FORMAT = 'alice'
static convert(filename: str, source_file: str = '', **kwargs) DataFrame[source]
class ChildProject.converters.AnnotationConverter[source]

Bases: object

SPEAKER_ID_TO_TYPE = {'C1': 'OCH', 'C2': 'OCH', 'CHI': 'CHI', 'CHI*': 'CHI', 'EE1': 'NA', 'EE2': 'NA', 'FA0': 'FEM', 'FA1': 'FEM', 'FA2': 'FEM', 'FA3': 'FEM', 'FA4': 'FEM', 'FA5': 'FEM', 'FA6': 'FEM', 'FA7': 'FEM', 'FA8': 'FEM', 'FAE': 'NA', 'FC1': 'OCH', 'FC2': 'OCH', 'FC3': 'OCH', 'FCE': 'NA', 'MA0': 'MAL', 'MA1': 'MAL', 'MA2': 'MAL', 'MA3': 'MAL', 'MA4': 'MAL', 'MA5': 'MAL', 'MAE': 'NA', 'MC1': 'OCH', 'MC2': 'OCH', 'MC3': 'OCH', 'MC4': 'OCH', 'MC5': 'OCH', 'MCE': 'NA', 'MI1': 'OCH', 'MOT*': 'FEM', 'OC0': 'OCH', 'UA1': 'NA', 'UA2': 'NA', 'UA3': 'NA', 'UA4': 'NA', 'UA5': 'NA', 'UA6': 'NA', 'UC1': 'OCH', 'UC2': 'OCH', 'UC3': 'OCH', 'UC4': 'OCH', 'UC5': 'OCH', 'UC6': 'OCH'}
THREAD_SAFE = True
class ChildProject.converters.ChatConverter[source]

Bases: AnnotationConverter

ADDRESSEE_TABLE = {'CHI': 'T', 'FEM': 'A', 'MAL': 'A', 'OCH': 'C'}
FORMAT = 'cha'
SPEAKER_ROLE_TO_TYPE = {'Adult': 'NA', 'Attorney': 'NA', 'Audience': 'NA', 'Boy': 'OCH', 'Brother': 'OCH', 'Caretaker': 'NA', 'Child': 'OCH', 'Doctor': 'NA', 'Environment': 'NA', 'Father': 'MAL', 'Female': 'FEM', 'Friend': 'OCH', 'Girl': 'OCH', 'Grandfather': 'MAL', 'Grandmother': 'FEM', 'Group': 'NA', 'Guest': 'NA', 'Host': 'NA', 'Investigator': 'NA', 'Justice': 'NA', 'LENA': 'NA', 'Leader': 'NA', 'Male': 'MAL', 'Media': 'NA', 'Member': 'NA', 'Mother': 'FEM', 'Narrator': 'NA', 'Nurse': 'NA', 'Other': 'NA', 'Participant': 'CHI', 'Partner': 'NA', 'PlayRole': 'NA', 'Playmate': 'OCH', 'Relative': 'NA', 'Sibling': 'OCH', 'Sister': 'OCH', 'Speaker': 'NA', 'Student': 'NA', 'Target_Adult': 'NA', 'Target_Child': 'CHI', 'Teacher': 'NA', 'Teenager': 'NA', 'Text': 'NA', 'Uncertain': 'NA', 'Unidentified': 'NA', 'Visitor': 'NA'}
THREAD_SAFE = False
static convert(filename: str, filter=None, **kwargs) DataFrame[source]
static role_to_addressee(role)[source]
class ChildProject.converters.CsvConverter[source]

Bases: AnnotationConverter

FORMAT = 'csv'
static convert(filename: str, filter: str = '', **kwargs) DataFrame[source]
class ChildProject.converters.EafConverter[source]

Bases: AnnotationConverter

FORMAT = 'eaf'
static convert(filename: str, filter=None, **kwargs) DataFrame[source]
class ChildProject.converters.Formats(value)[source]

Bases: Enum

An enumeration.

ALICE = 'alice'
CHA = 'cha'
CSV = 'csv'
EAF = 'eaf'
ITS = 'its'
TEXTGRID = 'TextGrid'
VCM = 'vcm_rttm'
VTC = 'vtc_rttm'
class ChildProject.converters.ItsConverter[source]

Bases: AnnotationConverter

FORMAT = 'its'
SPEAKER_TYPE_TRANSLATION = {'CHN': 'CHI', 'CXN': 'OCH', 'FAN': 'FEM', 'MAN': 'MAL'}
static convert(filename: str, recording_num: Optional[int] = None, **kwargs) DataFrame[source]
class ChildProject.converters.TextGridConverter[source]

Bases: AnnotationConverter

FORMAT = 'TextGrid'
static convert(filename: str, filter=None, **kwargs) DataFrame[source]
class ChildProject.converters.VcmConverter[source]

Bases: AnnotationConverter

FORMAT = 'vcm_rttm'
SPEAKER_TYPE_TRANSLATION = {'CHI': 'OCH', 'CNS': 'CHI', 'CRY': 'CHI', 'FEM': 'FEM', 'MAL': 'MAL', 'NCS': 'CHI'}
VCM_TRANSLATION = {'CNS': 'C', 'CRY': 'Y', 'NCS': 'N', 'OTH': 'J'}
static convert(filename: str, source_file: str = '', **kwargs) DataFrame[source]
class ChildProject.converters.VtcConverter[source]

Bases: AnnotationConverter

FORMAT = 'vtc_rttm'
SPEAKER_TYPE_TRANSLATION = {'CHI': 'OCH', 'FEM': 'FEM', 'KCHI': 'CHI', 'MAL': 'MAL'}
static convert(filename: str, source_file: str = '', **kwargs) DataFrame[source]

ChildProject.metrics module

ChildProject.metrics.conf_matrix(rows_grid, columns_grid)[source]

compute the confusion matrix (as counts) from grids of active classes.

See ChildProject.metrics.segments_to_grid() for a description of grids.

Parameters:
  • rows_grid (numpy.array) – the grid corresponding to the rows of the confusion matrix.

  • columns_grid (numpy.array) – the grid corresponding to the columns of the confusion matrix.

  • categories (list of strings) – the labels corresponding to each class

Returns:

a square numpy array of counts

Return type:

numpy.array

ChildProject.metrics.gamma(segments: DataFrame, column: str, alpha: float = 1, beta: float = 1, precision_level: float = 0.05) float[source]

Compute Mathet et al. gamma agreement on segments.

The gamma measure evaluates the reliability of both the segmentation and the categorization simultaneously; a extensive description of the method and its parameters can be found in Mathet et al., 2015 (doi:10.1162/COLI_a_00227)

This function uses the pyagreement-agreement package by Titeux et al.

Parameters:
  • segments (pd.DataFrame) – input segments dataframe (see Annotations format for the dataframe format)

  • column (str) – name of the categorical column of the segments to consider, e.g. ‘speaker_type’

  • alpha (float, optional) – gamma agreement time alignment weight, defaults to 1

  • beta (float, optional) – gamma agreement categorical weight, defaults to 1

  • precision_level (float, optional) – level of precision (see pygamma-agreement’s documentation), defaults to 0.05

Returns:

gamma agreement

Return type:

float

ChildProject.metrics.grid_to_vector(grid, categories)[source]

Transform a grid of active classes into a vector of labels. In case several classes are active at time i, the label is set to ‘overlap’.

See ChildProject.metrics.segments_to_grid() for a description of grids.

Parameters:
  • grid (numpy.array) – a NumPy array of shape (n, len(categories))

  • categories (list) – the list of categories

Returns:

the vector of labels of length n (e.g. np.array([none FEM FEM FEM overlap overlap CHI]))

Return type:

numpy.array

ChildProject.metrics.pyannote_metric(segments: DataFrame, reference: str, hypothesis: str, metric, column: str)[source]
ChildProject.metrics.segments_to_annotation(segments: DataFrame, column: str)[source]

Transform a dataframe of annotation segments into a pyannote.core.Annotation object

Parameters:
  • segments (pd.DataFrame) – a dataframe of input segments. It should at least have the following columns: segment_onset, segment_offset and column.

  • column (str) – the name of the column in segments that should be used for the values of the annotations (e.g. speaker_type).

Returns:

the pyannote.core.Annotation object.

Return type:

pyannote.core.Annotation

ChildProject.metrics.segments_to_grid(segments: DataFrame, range_onset: int, range_offset: int, timescale: int, column: str, categories: list, none=True, overlap=False) float[source]

Transform a dataframe of annotation segments into a 2d matrix representing the indicator function of each of the categories across time.

Each row of the matrix corresponds to a unit of time of length timescale (in milliseconds), ranging from range_onset to range_offset; each column corresponds to one of the categories provided, plus two special columns (overlap and none).

The value of the cell ij of the output matrix is set to 1 if the class j is active at time i, 0 otherwise.

If overlap is True, an additional column is appended to the grid, which set to 1 if more than two classes are active at time i.

If none is set to True, an additional column is appended to the grid, which is set to one if none of the classes are active at time i.

The shape of the output matrix is therefore ((range_offset-range_onset)/timescale, len(categories) + n), where n = 2 if both overlap and none are True, 1 if one of them is True, and 0 otherwise.

The fraction of time a class j is active can therefore be calculated as np.mean(grid, axis = 0)[j]

Parameters:
  • segments (pd.DataFrame) – a dataframe of input segments. It should at least have the following columns: segment_onset, segment_offset and column.

  • range_onset (int) – timestamp of the beginning of the range to consider (in milliseconds)

  • range_offset (int) – timestamp of the end of the range to consider (in milliseconds)

  • timescale (int) – length of each time unit (in milliseconds)

  • column (str) – the name of the column in segments that should be used for the values of the annotations (e.g. speaker_type).

  • categories (list) – the list of categories

  • none (bool) – append a ‘none’ column, default True

  • overlap (bool) – append an overlap column, default False

Returns:

the output grid

Return type:

numpy.array

ChildProject.metrics.vectors_to_annotation_task(*args, drop: List[str] = [])[source]

transform vectors of labels into a nltk AnnotationTask object.

Parameters:
  • args (1d np.array() of labels) – vector of labels for each annotator; add one argument per annotator.

  • drop (List[str]) – list of labels that should be ignored

Returns:

the AnnotationTask object

Return type:

nltk.metrics.agreement.AnnotationTask

ChildProject.projects module

class ChildProject.projects.ChildProject(path: str, enforce_dtypes: bool = False, ignore_discarded: bool = True)[source]

Bases: object

ChildProject instance This class is a representation of a ChildProject dataset

Constructor parameters:

Parameters:
  • path (str) – path to the root of the dataset.

  • enforce_dtypes (bool, optional) – enforce dtypes on children/recordings dataframes, defaults to False

  • ignore_discarded (bool, optional) – ignore entries such that discard=1, defaults to True

Attributes: :param path: path to the root of the dataset. :type path: str :param recordings: pandas dataframe representation of this dataset metadata/recordings.csv :type recordings: class:pd.DataFrame :param children: pandas dataframe representation of this dataset metadata/children.csv :type children: class:pd.DataFrame

CHILDREN_COLUMNS = [IndexColumn(name = experiment), IndexColumn(name = child_id), IndexColumn(name = child_dob), IndexColumn(name = location_id), IndexColumn(name = child_sex), IndexColumn(name = language), IndexColumn(name = languages), IndexColumn(name = mat_ed), IndexColumn(name = fat_ed), IndexColumn(name = car_ed), IndexColumn(name = monoling), IndexColumn(name = monoling_criterion), IndexColumn(name = normative), IndexColumn(name = normative_criterion), IndexColumn(name = mother_id), IndexColumn(name = father_id), IndexColumn(name = order_of_birth), IndexColumn(name = n_of_siblings), IndexColumn(name = household_size), IndexColumn(name = dob_criterion), IndexColumn(name = dob_accuracy), IndexColumn(name = discard)]
DOCUMENTATION_COLUMNS = [IndexColumn(name = variable), IndexColumn(name = description), IndexColumn(name = values), IndexColumn(name = scope), IndexColumn(name = annotation_set)]
RECORDINGS_COLUMNS = [IndexColumn(name = experiment), IndexColumn(name = child_id), IndexColumn(name = date_iso), IndexColumn(name = start_time), IndexColumn(name = recording_device_type), IndexColumn(name = recording_filename), IndexColumn(name = duration), IndexColumn(name = session_id), IndexColumn(name = session_offset), IndexColumn(name = recording_device_id), IndexColumn(name = experimenter), IndexColumn(name = location_id), IndexColumn(name = its_filename), IndexColumn(name = upl_filename), IndexColumn(name = trs_filename), IndexColumn(name = lena_id), IndexColumn(name = lena_recording_num), IndexColumn(name = might_feature_gaps), IndexColumn(name = start_time_accuracy), IndexColumn(name = noisy_setting), IndexColumn(name = notes), IndexColumn(name = discard)]
REC_COL_REF = {'child_id': IndexColumn(name = child_id), 'date_iso': IndexColumn(name = date_iso), 'discard': IndexColumn(name = discard), 'duration': IndexColumn(name = duration), 'experiment': IndexColumn(name = experiment), 'experimenter': IndexColumn(name = experimenter), 'its_filename': IndexColumn(name = its_filename), 'lena_id': IndexColumn(name = lena_id), 'lena_recording_num': IndexColumn(name = lena_recording_num), 'location_id': IndexColumn(name = location_id), 'might_feature_gaps': IndexColumn(name = might_feature_gaps), 'noisy_setting': IndexColumn(name = noisy_setting), 'notes': IndexColumn(name = notes), 'recording_device_id': IndexColumn(name = recording_device_id), 'recording_device_type': IndexColumn(name = recording_device_type), 'recording_filename': IndexColumn(name = recording_filename), 'session_id': IndexColumn(name = session_id), 'session_offset': IndexColumn(name = session_offset), 'start_time': IndexColumn(name = start_time), 'start_time_accuracy': IndexColumn(name = start_time_accuracy), 'trs_filename': IndexColumn(name = trs_filename), 'upl_filename': IndexColumn(name = upl_filename)}
REQUIRED_DIRECTORIES = ['recordings', 'extra']
accumulate_metadata(table: str, df: DataFrame, columns: list, merge_column: str, verbose=False) DataFrame[source]
compute_ages(recordings: Optional[DataFrame] = None, children: Optional[DataFrame] = None, age_format: str = 'months') Series[source]

Compute the age of the subject child for each recording (in months, as a float) and return it as a pandas Series object.

Example:

>>> from ChildProject.projects import ChildProject
>>> project = ChildProject("examples/valid_raw_data")
>>> project.read()
>>> project.recordings["age"] = project.compute_ages()
>>> project.recordings[["child_id", "date_iso", "age"]]
    child_id    date_iso       age
line                                
2            1  2020-04-20  3.613963
3            1  2020-04-21  3.646817
Parameters:
  • recordings (pd.DataFrame, optional) – custom recordings DataFrame (see Metadata), otherwise use all project recordings, defaults to None

  • children (pd.DataFrame, optional) – custom children DataFrame (see Metadata), otherwise use all project children data, defaults to None

  • age_format (str, optional) – format to use for the output date default is months, choose between [‘months’,’days’,’weeks’, ‘years’]

compute_recordings_duration(profile: Optional[str] = None) DataFrame[source]

compute recordings duration

Parameters:

profile (str, optional) – name of the profile of recordings to compute the duration from. If None, raw recordings are used. defaults to None

Returns:

dataframe of the recordings, with an additional/updated duration columns.

Return type:

pd.DataFrame

get_converted_recording_filename(profile: str, recording_filename: str) str[source]

retrieve the converted filename of a recording under a given profile, from its original filename.

Parameters:
  • profile (str) – recording profile

  • recording_filename (str) – original recording filename, as indexed in the metadata

Returns:

corresponding converted filename of the recording under this profile

Return type:

str

get_recording_path(recording_filename: str, profile: Optional[str] = None) str[source]

return the path to a recording

Parameters:
  • recording_filename (str) – recording filename, as in the metadata

  • profile (str, optional) – name of the conversion profile, defaults to None

Returns:

path to the recording

Return type:

str

get_recordings_from_list(recordings: list, profile: Optional[str] = None) DataFrame[source]

Recover recordings metadata from a list of recordings or path to recordings.

Parameters:

recordings (list) – list of recording names or paths

Returns:

matching recordings

Return type:

pd.DataFrame

read(verbose=False, accumulate=True)[source]

Read the metadata from the project and stores it in recordings and children attributes

Parameters:
  • verbose (bool) – read with additional output

  • accumulate (bool) – add metadata from subfolders (usually confidential metadata)

read_documentation() DataFrame[source]
recording_from_path(path: str, profile: Optional[str] = None) str[source]
validate(ignore_recordings: bool = False, profile: Optional[str] = None, accumulate: bool = True) tuple[source]

Validate a dataset, returning all errors and warnings.

Parameters:
  • ignore_recordings (bool, optional) – if True, no errors will be returned for missing recordings.

  • profile (str, optional) – profile of recordings to use

  • accumulate – use accumulated metadata (usually confidential metadata if present)

Returns:

A tuple containing the list of errors, and the list of warnings.

Return type:

a tuple of two lists

write_recordings(keep_discarded: bool = True, keep_original_columns: bool = True)[source]

Write self.recordings to the recordings csv file of the dataset. !! if read() was done with accumulate , you may write confidential information in recordings.csv !!

Parameters:
  • keep_discarded (bool, optional) – if True, the lines in the csv that are discarded by the dataset are kept when writing. defaults to True (when False, discarded lines disappear from the dataset)

  • keep_original_columns (bool, optional) – if True, deleting columns in the recordings dataframe will not result in them disappearing from the csv file (if false, only the current columns are kept)

Returns:

dataframe that was written to the csv file

Return type:

pandas.DataFrame

ChildProject.tables module

exception ChildProject.tables.IncorrectDtypeException[source]

Bases: Exception

Exception when an Unexpected DType is found in a pandas DataFrame

class ChildProject.tables.IndexColumn(name='', description='', required=False, regex=None, filename=False, datetime=None, function=None, choices=None, dtype=None, unique=False, generated=False)[source]

Bases: object

class ChildProject.tables.IndexTable(name, path=None, columns=[], enforce_dtypes: bool = False)[source]

Bases: object

msg(text)[source]
read()[source]
validate()[source]
exception ChildProject.tables.MissingColumnsException(name: str, missing: Set)[source]

Bases: Exception

ChildProject.tables.assert_columns_presence(name: str, df: DataFrame, columns: Union[Set, List])[source]
ChildProject.tables.assert_dataframe(name: str, df: DataFrame, not_empty: bool = False)[source]
ChildProject.tables.is_boolean(x)[source]
ChildProject.tables.read_csv_with_dtype(file: str, dtypes: dict)[source]

ChildProject.utils module

class ChildProject.utils.Segment(start, stop)[source]

Bases: object

length()[source]
class ChildProject.utils.TimeInterval(start: datetime, stop: datetime)[source]

Bases: object

length()[source]
ChildProject.utils.calculate_shift(file1, file2, start1, start2, interval)[source]

take 2 audio files, a starting point for each and a length to compare in seconds return a divergence score representing the average difference in audio signal

Parameters:
  • file1 (str) – path to the first wav file to compare

  • file2 (str) – path to the second wav file to compare

  • start1 (int) – starting point for the comparison in seconds for the first audio

  • start2 (int) – starting point for the comparison in seconds for the second audio

  • interval (int) – length to compare between the 2 audios on in seconds

Returns:

tuple of divergence score and number of values used

Return type:

(float, int)

ChildProject.utils.find_lines_involved_in_overlap(df: DataFrame, onset_label: str = 'range_onset', offset_label: str = 'range_offset', labels=[])[source]

takes a dataframe as input. The dataframe is supposed to have a column for the onset og a timeline and one for the offset. The function returns a boolean series where all indexes having ‘True’ are lines involved in overlaps and ‘False’ when not e.g. to select all lines involved in overlaps, use: ` ovl_segments = df[find_lines_involved_in_overlap(df, 'segment_onset', 'segment_offset')] ` and to select line that never overlap, use: ` ovl_segments = df[~find_lines_involved_in_overlap(df, 'segment_onset', 'segment_offset')] `

Parameters:
  • df (pd.DataFrame) – pandas DataFrame where we want to find overlaps, having some time segments described by 2 columns (onset and offset)

  • onset_label (str) – column label for the onset of time segments

  • offset_label (str) – columns label for the offset of time segments

  • labels (list[str]) – list of column labels that are required to match to be involved in overlap.

Returns:

pandas Series of boolean values where ‘True’ are indexes where overlaps exist

Return type:

pd.Series

ChildProject.utils.get_audio_duration(filename)[source]
ChildProject.utils.intersect_ranges(xs, ys)[source]
ChildProject.utils.path_is_parent(parent_path: str, child_path: str)[source]
ChildProject.utils.read_wav(filename, start_s, length_s)[source]
ChildProject.utils.retry_func(func: callable, excep: Exception, tries: int = 3, **kwargs)[source]
ChildProject.utils.series_to_datetime(time_series, time_index_list, time_column_name: str, date_series=None, date_index_list=None, date_column_name=None)[source]

returns a series of datetimes from a series of str. Using pd.to_datetime on all the formats listed for a specific column name in an index consisting of IndexColumn items. To have the date included and not only time), one can use a second series for date, with also the corresponding index and column

Parameters:
  • time_series (pandas.Series) – pandas series of strings to transform into datetime (can contain NA value => NaT datetime), if date_series is given, time_series should only have the time

  • time_index_list (List[IndexColumn]) – list of index to use where the column wanted is present

  • time_column_name (str) – name of the IndexColumn to use (IndexColumn.name value) for accepted formats

  • date_series (pandas.Series) – pandas series of strings to transform into the date component of datetime (can contain NA value)

  • date_index_list (List[IndexColumn]) – list of index to use where the column wanted is present

  • date_column_name (str) – name of the IndexColumn to use (IndexColumn.name value) for accepted formats for dates

Returns:

series with dtype datetime containing the converted datetimes

Return type:

pandas.Series

ChildProject.utils.time_intervals_intersect(ti1: TimeInterval, ti2: TimeInterval)[source]

given 2 time intervals (those do not take in consideration days, only time in the day), return an array of new interval(s) representing the intersections of the original ones. Examples 1. time_intervals_intersect( TimeInterval( datetime(1900,1,1,8,57), datetime(1900,1,1,21,4)), TimeInterval( datetime(1900,1,1,10,36), datetime(1900,1,1,22,1))) => [TimeInterval(10:36 , 21:04)] 2. time_intervals_intersect( TimeInterval( datetime(1900,1,1,8,57), datetime(1900,1,1,22,1)), TimeInterval( datetime(1900,1,1,21,4), datetime(1900,1,1,10,36))) => [TimeInterval(08:57 , 10:36),TimeInterval(21:04 , 22:01)]

Parameters:

Module contents