ChildProject package

Subpackages

Submodules

ChildProject.annotations module

class ChildProject.annotations.AnnotationManager(project: ChildProject)[source]

Bases: object

INDEX_COLUMNS = [IndexColumn(name = set), IndexColumn(name = recording_filename), IndexColumn(name = time_seek), IndexColumn(name = range_onset), IndexColumn(name = range_offset), IndexColumn(name = raw_filename), IndexColumn(name = format), IndexColumn(name = filter), IndexColumn(name = annotation_filename), IndexColumn(name = imported_at), IndexColumn(name = package_version), IndexColumn(name = error), IndexColumn(name = merged_from)]

SEGMENTS_COLUMNS = [IndexColumn(name = raw_filename), IndexColumn(name = segment_onset), IndexColumn(name = segment_offset), IndexColumn(name = speaker_id), IndexColumn(name = speaker_type), IndexColumn(name = ling_type), IndexColumn(name = vcm_type), IndexColumn(name = lex_type), IndexColumn(name = mwu_type), IndexColumn(name = msc_type), IndexColumn(name = gra_type), IndexColumn(name = addressee), IndexColumn(name = transcription), IndexColumn(name = phonemes), IndexColumn(name = syllables), IndexColumn(name = words), IndexColumn(name = lena_block_type), IndexColumn(name = lena_block_number), IndexColumn(name = lena_conv_status), IndexColumn(name = lena_response_count), IndexColumn(name = lena_conv_floor_type), IndexColumn(name = lena_conv_turn_type), IndexColumn(name = lena_speaker), IndexColumn(name = utterances_count), IndexColumn(name = utterances_length), IndexColumn(name = non_speech_length), IndexColumn(name = average_db), IndexColumn(name = peak_db), IndexColumn(name = child_cry_vfx_len), IndexColumn(name = utterances), IndexColumn(name = cries), IndexColumn(name = vfxs)]

SETS_COLUMNS = [IndexColumn(name = segmentation), IndexColumn(name = segmentation_type), IndexColumn(name = method), IndexColumn(name = sampling_method), IndexColumn(name = sampling_target), IndexColumn(name = sampling_count), IndexColumn(name = sampling_unit_duration), IndexColumn(name = recording_selection), IndexColumn(name = participant_selection), IndexColumn(name = annotator_name), IndexColumn(name = annotator_experience), IndexColumn(name = annotation_algorithm_name), IndexColumn(name = annotation_algorithm_publication), IndexColumn(name = annotation_algorithm_version), IndexColumn(name = annotation_algorithm_repo), IndexColumn(name = date_annotation), IndexColumn(name = has_speaker_type), IndexColumn(name = has_transcription), IndexColumn(name = has_interactions), IndexColumn(name = has_acoustics), IndexColumn(name = has_addressee), IndexColumn(name = has_vcm_type), IndexColumn(name = has_words), IndexColumn(name = notes)]

SETS_CONTENT_COLUMNS = {}

add_annotation_file(src_path, dst_file, set: str, overwrite) → Self[source]

Add an annotation file to the dataset. This function takes the path to a file, copies that file inside the dataset in the correct spot given the set it belongs to. The destination file can contain parent folders, which will be included in the copied file (e.g. src_path= “/home/user/tmp/myrec.rttm”, dst_file=”loc1/RA5/rec001.rttm”, set=’vtc’ ; will copy the file inside the dataset in a annotations/vtc/raw/loc1/RA5 folder, the file will be named rec001.rttm.

Parameters:

src_path (Path | str) – path on the system to the annotation file to add to the dataset
dst_file (Path | str) – filename as it will be stored in the dataset, with possible parent folders (e.g. ‘location1/RA5/rec004.rttm’ will copy the original file as rec004.rttm inside folders location1 -> RA5)
set (str) – annotation set the annotation file belongs to
overwrite (bool, optional) – overwrite the existing destination if it already exists

static clip_segments(segments: DataFrame, start: int, stop: int) → DataFrame[source]

Clip all segments onsets and offsets within start and stop. Segments outside of the range [start,``stop``] will be removed.

Parameters:

segments (pd.DataFrame) – Dataframe of the segments to clip
start (int) – range start (in milliseconds)
stop (int) – range end (in milliseconds)

Returns:

Dataframe of the clipped segments

Return type:

pd.DataFrame

derive_annotations(input_set: str, output_set: str, derivation: str | ~typing.Callable, derivation_metadata=None, threads: int = -1, overwrite_existing: bool = False) -> (<class 'pandas.core.frame.DataFrame'>, <class 'pandas.core.frame.DataFrame'>)[source]

Derive annotations. From an existing set of annotations, create a new set that derive its result from the original set

Parameters:

input_set (str) – name of the set of annotations to be derived
output_set (str) – name of the new set of derived annotations
derivation (Union[str, Derivator, Callable]) – derivation to perform. this can be a str reference to existing keys in pipelines.derivations.DERIVATIONS, or a Derivator object or a function that is then used to create a minimal Derivator
derivation_metadata (dict) – metadata to be used for the set created by the derivation, this will be added to the automatically generated metadata and overwrite keys in common
threads (int, optional) – If > 1, conversions will be run on threads threads, defaults to -1
overwrite_existing (bool, optional) – choice if lines with the same set and annotation_filename should be overwritten

Returns:

tuple of dataframe of derived annotations, as in Annotations index and dataframe of errors

Return type:

tuple(pd.DataFrame, pd.DataFrame)

get_collapsed_segments(annotations: DataFrame) → DataFrame[source]

get all segments associated to the annotations referenced in annotations, and collapses into one virtual timeline.

Parameters:: annotations (pd.DataFrame) – dataframe of annotations, according to Annotations index
Returns:: dataframe of all the segments merged (as specified in Annotations format), merged with annotations
Return type:: pd.DataFrame

static get_printable_sets_metadata(sets, delimiter, header=True, human_readable: bool = False) → str[source]

get_segments(annotations: DataFrame) → DataFrame[source]

get all segments associated to the annotations referenced in annotations.

Parameters:: annotations (pd.DataFrame) – dataframe of annotations, according to Annotations index
Returns:: dataframe of all the segments merged (as specified in Annotations format), merged with annotations.
Return type:: pd.DataFrame

get_segments_timestamps(segments: DataFrame, ignore_date: bool = False, onset: str = 'segment_onset', offset: str = 'segment_offset') → DataFrame[source]

Calculate the onset and offset clock-time of each segment

Parameters:

segments (pd.DataFrame) – DataFrame of segments (as returned by get_segments()).
ignore_date (bool, optional) – leave date information and use time data only, defaults to False
onset (str, optional) – column storing the onset timestamp in milliseconds, defaults to “segment_onset”
offset (str, optional) – column storing the offset timestamp in milliseconds, defaults to “segment_offset”

Returns:

Returns the input dataframe with two new columns onset_time and offset_time. onset_time is a datetime object corresponding to the onset of the segment. offset_time is a datetime object corresponding to the offset of the segment. In case either start_time or date_iso is not specified for the corresponding recording, both values will be set to NaT.

Return type:

pd.DataFrame

get_sets_metadata(format: str = 'dataframe', delimiter=None, escapechar='"', header=True, human=False, sort_by='set', sort_ascending=True) → str | DataFrame[source]: return metadata about the sets

get_subsets(annotation_set: str, recursive: bool = False) → List[str][source]

Retrieve the list of subsets belonging to a given set of annotations.

Parameters:

annotation_set (str) – input set
recursive (bool, optional) – If True, get subsets recursively, defaults to False

Returns:

the list of subsets names

Return type:

list

get_within_ranges(ranges: DataFrame, sets: Set | List | None = None, missing_data: str = 'ignore') → DataFrame[source]

Retrieve and clip annotations that cover specific portions of recordings (ranges).

The desired ranges are defined by an input dataframe with three columns: recording_filename, range_onset, and range_offset. The function returns a dataframe of annotations under the same format as the index of annotations (Annotations index).

This output get can then be provided to get_segments() in order to retrieve segments of annotations that match the desired range.

For instance, the code belows will prints all the segments of annotations corresponding to the first hour of each recording:

>>> from ChildProject.projects import ChildProject
>>> from ChildProject.annotations import AnnotationManager
>>> project = ChildProject('.')
>>> am = AnnotationManager(project)
>>> am.read()
>>> ranges = project.recordings
>>> ranges['range_onset'] = 0
>>> ranges['range_offset'] = 60*60*1000
>>> matches = am.get_within_ranges(ranges)
>>> am.get_segments(matches)

Parameters:

ranges (pd.DataFrame) – pandas dataframe with one row per range to be considered and three columns: recording_filename, range_onset, range_offset.
sets (Union[Set, List]) – optional list of annotation sets to retrieve. If None, all annotations from all sets will be retrieved.
missing_data (str, defaults to ignore) – how to handle missing annotations (“ignore”, “warn” or “raise”)

Return type:

pd.DataFrame

get_within_time_range(annotations: DataFrame, interval: TimeInterval | None = None, start_time: str | None = None, end_time: str | None = None) → DataFrame[source]

Clip all input annotations within a given HH:MM:SS clock-time range. Those that do not intersect the input time range at all are filtered out.

Parameters:

annotations (pd.DataFrame) – DataFrame of input annotations to filter. The only columns that are required are: recording_filename, range_onset, and range_offset.
interval (TimeInterval) – Interval of hours to consider, contains the start hour and end hour
start_time (str) – start_time to use in a HH:MM format, only used if interval is None, replaces the first value of interval
end_time (str) – end_time to use in a HH:MM format, only used if interval is None, replaces the second value of interval

Returns:

a DataFrame of annotations; For each row, range_onset and range_offset are clipped within the desired clock-time range. The clock-time corresponding to the onset and offset of each annotation is stored in two newly created columns named range_onset_time and range_offset_time. If the input annotation exceeds 24 hours, one row per matching interval is returned. :rtype: pd.DataFrame

import_annotations(input: DataFrame, threads: int = -1, import_function: Callable[[str], DataFrame] | None = None, new_tiers: list | None = None, overwrite_existing: bool = False) → DataFrame[source]

Import and convert annotations.

Parameters:

input (pd.DataFrame) – dataframe of all annotations to import, as described in Annotation importation input format.
threads (int, optional) – If > 1, conversions will be run on threads threads, defaults to -1
import_function (Callable[[str], pd.DataFrame], optional) – If specified, the custom import_function function will be used to convert all input annotations, defaults to None
new_tiers (list[str], optional) – List of EAF tiers names. If specified, the corresponding EAF tiers will be imported.
overwrite_existing (bool, optional) – choose if lines with the same set and annotation_filename should be overwritten

Returns:

dataframe of imported annotations, as in Annotations index.

Return type:

pd.DataFrame

static infer_set_content_based_on_column_names(columns) → dict[source]

From a list of columns present in annotations, makes a prediction of what content will be present for metadata of set. It takes the defined field of metadata and determines based on their annotation_columns field if a combination of the right columns id present

Parameters:: columns (List[str]) – list of columns in the annotation
Returns:: dictionary with inferred metadata to add to the set
Return type:: dict

static intersection(annotations: DataFrame, sets: list | None = None) → DataFrame[source]

Compute the intersection of all annotations for all sets and recordings, based on their recording_filename, range_onset and range_offset attributes. (Only these columns are required, but more can be passed and they will be preserved).

Parameters:: annotations (pd.DataFrame) – dataframe of annotations, according to Annotations index
Returns:: dataframe of annotations, according to Annotations index
Return type:: pd.DataFrame

merge_annotations(left_columns, right_columns, columns, output_set, input, skip_existing: bool = False) → DataFrame[source]

From 2 DataFrames listing the annotation indexes to merge together (those indexes should come from the intersection of the left_set and right_set indexes), the listing of the columns to merge and name of the output_set, creates the resulting csv files containing the converted merged segments and returns the new indexes to add to annotations.csv.

Parameters:

left_columns (list[str]) – list of the columns to include from the left set
right_columns (list[str]) – list of the columns to include from the right set
columns (dict) – additional columns to add to the segments, key is the column name
output_set (str) – name of the set to save the new merged files into
input (bool) – annotation indexes to use for the merge, contains keys ‘left_annotations’ and ‘right_annotations’ to separate indexes from left and right set
input –

Returns:

annotation indexes created by the merge, should be added to annotations.csv

Return type:

pandas.DataFrame

merge_sets(left_set: str, right_set: str, left_columns: List[str], right_columns: List[str], output_set: str, full_set_merge: bool = True, skip_existing: bool = False, columns: dict = {}, recording_filter: str | None = None, metadata: str | None = None, threads=-1) → Self[source]

Merge columns from left_set and right_set annotations, for all matching segments, into a new set of annotations named output_set that will be saved in the dataset. output_set must not already exist if full_set_merge is True.

Parameters:

left_set (str) – Left set of annotations.
right_set (str) – Right set of annotations.
left_columns (List) – Columns which values will be based on the left set.
right_columns (List) – Columns which values will be based on the right set.
output_set (str) – Name of the output annotations set.
full_set_merge (bool) – The merge is meant to create the entired merged set. Therefore, the set should not already exist. defaults to True
skip_existing (bool) – The merge will skip already existing lines in the merged set. So both the annotation index and resulting converted csv will not change for those lines
columns (dict) – Additional columns to add to the resulting converted annotations.
recording_filter (set[str]) – set of recording_filenames to merge.
metadata (None | str) – set metadata to keep in the merged set, ‘right’ or ‘left’ to keep metadata of left or right set (except for content fields), None for no metadata kept, default is None
threads (int) – number of threads

Returns:

AnnotationManager object updated after the merge

Return type:

AnnotationManager

read() → Tuple[List[str], List[str]][source]

Read the index of annotations from metadata/annotations.csv and store it into self.annotations.

Returns:: a tuple containing the list of errors and the list of warnings generated while reading the index
Return type:: Tuple[List[str],List[str]]

remove_annotation_file(file, set: str) → Self[source]

remove a raw annotation file from the dataset. This function takes the path to a file, and removes it from the dataset annotations at the file system level (not in the index), the file could be under folder, they need to be in the file name as a posix path (i.e. subfolder/file) The set parameter is meant to define what annotation set the raw file is stored in.

Parameters:

file (Path | str) – filename as it is stored in the dataset annotations, in the annotation set raw folder (e.g. set=vtc will be evaluated inside the annotations/vtc/raw folder of the dataset
set (str) – name of the annotation set the file is stored in.

remove_set(annotation_set: str, recursive: bool = False) → Self[source]

Remove a set of annotations, deleting every converted file and removing them from the index. This preserves raw annotations.

Parameters:

annotation_set (str) – set of annotations to remove
recursive (bool, optional) – remove subsets as well, defaults to False

rename_recording_filename(recording_filename: str, new_recording_filename: str) → str[source]

Renames all references to a recording_filename in the annotation index to a new name. No check is carried out if the recording_filename given is not referenced in the index, the annotation index will be orphaned. Using values other than str may break the index

Parameters:

recording_filename (str) – existing reference to be changed
new_recording_filename (str) – new recording_filename to use in place of the old one

Returns:

new recording_filename

Return type:

str

rename_set(annotation_set: str, new_set: str, recursive: bool = False, ignore_errors: bool = False) → Self[source]

Rename a set of annotations, moving all related files and updating the index accordingly.

Parameters:

annotation_set (str) – name of the set to rename
new_set (str) – new set name
recursive (bool, optional) – rename subsets as well, defaults to False
ignore_errors (bool, optional) – If True, keep going even if unindexed files are detected, defaults to False

set_from_path(path: str) → str[source]

validate(annotations: DataFrame | None = None, threads: int = 0) → Tuple[List[str], List[str]][source]

check all indexed annotations for errors

Parameters:

annotations (pd.DataFrame, optional) – annotations to validate, defaults to None. If None, the whole index will be scanned.
threads (int, optional) – how many threads to run the tests with, defaults to 0. If <= 0, all available CPU cores will be used.

Returns:

a tuple containing the list of errors and the list of warnings detected

Return type:

Tuple[List[str], List[str]]

validate_annotation(annotation: dict) → Tuple[List[str], List[str]][source]

write() → Self[source]: Update the annotations index, while enforcing its good shape.

ChildProject.cmdline module

ChildProject.converters module

class ChildProject.converters.AliceConverter[source]

Bases: AnnotationConverter

FORMAT = 'alice'

static convert(filename: str, source_file: str = '', **kwargs) → DataFrame[source]

class ChildProject.converters.AnnotationConverter[source]

Bases: object

SPEAKER_ID_TO_TYPE = {'C1': 'OCH', 'C2': 'OCH', 'CHI': 'CHI', 'CHI*': 'CHI', 'EE1': 'NA', 'EE2': 'NA', 'FA0': 'FEM', 'FA1': 'FEM', 'FA2': 'FEM', 'FA3': 'FEM', 'FA4': 'FEM', 'FA5': 'FEM', 'FA6': 'FEM', 'FA7': 'FEM', 'FA8': 'FEM', 'FAE': 'NA', 'FC1': 'OCH', 'FC2': 'OCH', 'FC3': 'OCH', 'FCE': 'NA', 'MA0': 'MAL', 'MA1': 'MAL', 'MA2': 'MAL', 'MA3': 'MAL', 'MA4': 'MAL', 'MA5': 'MAL', 'MAE': 'NA', 'MC1': 'OCH', 'MC2': 'OCH', 'MC3': 'OCH', 'MC4': 'OCH', 'MC5': 'OCH', 'MCE': 'NA', 'MI1': 'OCH', 'MOT*': 'FEM', 'OC0': 'OCH', 'UA1': 'NA', 'UA2': 'NA', 'UA3': 'NA', 'UA4': 'NA', 'UA5': 'NA', 'UA6': 'NA', 'UC1': 'OCH', 'UC2': 'OCH', 'UC3': 'OCH', 'UC4': 'OCH', 'UC5': 'OCH', 'UC6': 'OCH'}

THREAD_SAFE = True

class ChildProject.converters.ChatConverter[source]

Bases: AnnotationConverter

ADDRESSEE_TABLE = {'Adult': 'A', 'Attorney': 'A', 'Audience': 'A', 'Boy': 'U', 'Brother': 'C', 'Caretaker': 'A', 'Child': 'C', 'Doctor': 'A', 'Environment': 'U', 'Father': 'A', 'Female': 'A', 'Friend': 'C', 'Girl': 'U', 'Grandfather': 'A', 'Grandmother': 'A', 'Group': 'U', 'Guest': 'A', 'Host': 'A', 'Investigator': 'A', 'Justice': 'A', 'LENA': 'O', 'Leader': 'A', 'Male': 'A', 'Media': 'O', 'Member': 'A', 'Mother': 'A', 'Narrator': 'A', 'Nurse': 'A', 'Other': 'O', 'Participant': 'U', 'Partner': 'A', 'PlayRole': 'O', 'Playmate': 'C', 'Relative': 'U', 'Sibling': 'C', 'Sister': 'C', 'Speaker': 'A', 'Student': 'A', 'Target_Adult': 'A', 'Target_Child': 'T', 'Teacher': 'A', 'Teenager': 'A', 'Text': 'O', 'Uncertain': 'U', 'Unidentified': 'U', 'Visitor': 'A'}

FORMAT = 'cha'

SPEAKER_ROLE_TO_TYPE = {'Adult': 'NA', 'Attorney': 'NA', 'Audience': 'NA', 'Boy': 'OCH', 'Brother': 'OCH', 'Caretaker': 'NA', 'Child': 'OCH', 'Doctor': 'NA', 'Environment': 'NA', 'Father': 'MAL', 'Female': 'FEM', 'Friend': 'OCH', 'Girl': 'OCH', 'Grandfather': 'MAL', 'Grandmother': 'FEM', 'Group': 'NA', 'Guest': 'NA', 'Host': 'NA', 'Investigator': 'NA', 'Justice': 'NA', 'LENA': 'NA', 'Leader': 'NA', 'Male': 'MAL', 'Media': 'NA', 'Member': 'NA', 'Mother': 'FEM', 'Narrator': 'NA', 'Nurse': 'NA', 'Other': 'NA', 'Participant': 'CHI', 'Partner': 'NA', 'PlayRole': 'NA', 'Playmate': 'OCH', 'Relative': 'NA', 'Sibling': 'OCH', 'Sister': 'OCH', 'Speaker': 'NA', 'Student': 'NA', 'Target_Adult': 'NA', 'Target_Child': 'CHI', 'Teacher': 'NA', 'Teenager': 'NA', 'Text': 'NA', 'Uncertain': 'NA', 'Unidentified': 'NA', 'Visitor': 'NA'}

THREAD_SAFE = False

static convert(filename: str, filter=None, **kwargs) → DataFrame[source]

static role_to_addressee(role)[source]

class ChildProject.converters.CsvConverter[source]

Bases: AnnotationConverter

FORMAT = 'csv'

static convert(filename: str, filter: str = '', **kwargs) → DataFrame[source]

class ChildProject.converters.EafConverter[source]

Bases: AnnotationConverter

FORMAT = 'eaf'

static convert(filename: str, filter=None, **kwargs) → DataFrame[source]

class ChildProject.converters.Formats(value)[source]

Bases: Enum

An enumeration.

ALICE = 'alice'

CHA = 'cha'

CSV = 'csv'

EAF = 'eaf'

ITS = 'its'

TEXTGRID = 'TextGrid'

VCM = 'vcm_rttm'

VTC = 'vtc_rttm'

W2V2SMChunks = 'w2v2-sm'

class ChildProject.converters.ItsConverter[source]

Bases: AnnotationConverter

FORMAT = 'its'

SPEAKER_TYPE_TRANSLATION = {'CHN': 'CHI', 'CXN': 'OCH', 'FAN': 'FEM', 'MAN': 'MAL'}

static convert(filename: str, recording_num: int | None = None, **kwargs) → DataFrame[source]

class ChildProject.converters.TextGridConverter[source]

Bases: AnnotationConverter

FORMAT = 'TextGrid'

static convert(filename: str, filter=None, **kwargs) → DataFrame[source]

class ChildProject.converters.VcmConverter[source]

Bases: AnnotationConverter

FORMAT = 'vcm_rttm'

SPEAKER_TYPE_TRANSLATION = {'CHI': 'OCH', 'CNS': 'CHI', 'CRY': 'CHI', 'FEM': 'FEM', 'MAL': 'MAL', 'NCS': 'CHI'}

VCM_TRANSLATION = {'CNS': 'C', 'CRY': 'Y', 'NCS': 'N', 'OTH': 'J'}

static convert(filename: str, source_file: str = '', **kwargs) → DataFrame[source]

class ChildProject.converters.VtcConverter[source]

Bases: AnnotationConverter

FORMAT = 'vtc_rttm'

SPEAKER_TYPE_TRANSLATION = {'CHI': 'OCH', 'FEM': 'FEM', 'KCHI': 'CHI', 'MAL': 'MAL', 'OCH': 'OCH'}

static convert(filename: str, source_file: str = '', **kwargs) → DataFrame[source]

class ChildProject.converters.W2V2SMChunksConverter[source]

Bases: AnnotationConverter

FORMAT = 'w2v2-sm'

VCM_TRANSLATION = {'Canonical': 'C', 'Crying': 'Y', 'Junk': 'J', 'Laughing': 'L', 'Non-Canonical': 'N'}

static convert(filename: str, recording: str | None = None, **kwargs) → DataFrame[source]

static id_split(id: str) → str[source]

split the id of cut chunks, chunks are expected to have been cut from audio and their names to be constituted of the original audio filename (minus extension) with onset and offset added separated by _

Returns:: return 3 values: audio filename without extension (could be a problem), onset offset

ChildProject.metrics module

ChildProject.metrics.conf_matrix(rows_grid, columns_grid) → array[source]

compute the confusion matrix (as counts) from grids of active classes.

See ChildProject.metrics.segments_to_grid() for a description of grids.

Parameters:

rows_grid (numpy.array) – the grid corresponding to the rows of the confusion matrix.
columns_grid (numpy.array) – the grid corresponding to the columns of the confusion matrix.
categories (list of strings) – the labels corresponding to each class

Returns:

a square numpy array of counts

Return type:

numpy.array

ChildProject.metrics.gamma(segments: DataFrame, column: str, alpha: float = 1, beta: float = 1, precision_level: float = 0.05) → float[source]

Compute Mathet et al. gamma agreement on segments.

The gamma measure evaluates the reliability of both the segmentation and the categorization simultaneously; a extensive description of the method and its parameters can be found in Mathet et al., 2015 (doi:10.1162/COLI_a_00227)

This function uses the pyagreement-agreement package by Titeux et al.

Parameters:

segments (pd.DataFrame) – input segments dataframe (see Annotations format for the dataframe format)
column (str) – name of the categorical column of the segments to consider, e.g. ‘speaker_type’
alpha (float, optional) – gamma agreement time alignment weight, defaults to 1
beta (float, optional) – gamma agreement categorical weight, defaults to 1
precision_level (float, optional) – level of precision (see pygamma-agreement’s documentation), defaults to 0.05

Returns:

gamma agreement

Return type:

float

ChildProject.metrics.grid_to_vector(grid, categories) → array[source]

Transform a grid of active classes into a vector of labels. In case several classes are active at time i, the label is set to ‘overlap’.

See ChildProject.metrics.segments_to_grid() for a description of grids.

Parameters:

grid (numpy.array) – a NumPy array of shape (n, len(categories))
categories (list) – the list of categories

Returns:

the vector of labels of length n (e.g. np.array([none FEM FEM FEM overlap overlap CHI]))

Return type:

numpy.array

ChildProject.metrics.pyannote_metric(segments: DataFrame, reference: str, hypothesis: str, metric, column: str)[source]

ChildProject.metrics.segments_to_annotation(segments: DataFrame, column: str) → Annotation[source]

Transform a dataframe of annotation segments into a pyannote.core.Annotation object

Parameters:

segments (pd.DataFrame) – a dataframe of input segments. It should at least have the following columns: segment_onset, segment_offset and column.
column (str) – the name of the column in segments that should be used for the values of the annotations (e.g. speaker_type).

Returns:

the pyannote.core.Annotation object.

Return type:

pyannote.core.Annotation

ChildProject.metrics.segments_to_grid(segments: DataFrame, range_onset: int, range_offset: int, timescale: int, column: str, categories: list, none=True, overlap=False) → float[source]

Transform a dataframe of annotation segments into a 2d matrix representing the indicator function of each of the categories across time.

Each row of the matrix corresponds to a unit of time of length timescale (in milliseconds), ranging from range_onset to range_offset; each column corresponds to one of the categories provided, plus two special columns (overlap and none).

The value of the cell ij of the output matrix is set to 1 if the class j is active at time i, 0 otherwise.

If overlap is True, an additional column is appended to the grid, which set to 1 if more than two classes are active at time i.

If none is set to True, an additional column is appended to the grid, which is set to one if none of the classes are active at time i.

The shape of the output matrix is therefore ((range_offset-range_onset)/timescale, len(categories) + n), where n = 2 if both overlap and none are True, 1 if one of them is True, and 0 otherwise.

The fraction of time a class j is active can therefore be calculated as np.mean(grid, axis = 0)[j]

Parameters:

segments (pd.DataFrame) – a dataframe of input segments. It should at least have the following columns: segment_onset, segment_offset and column.
range_onset (int) – timestamp of the beginning of the range to consider (in milliseconds)
range_offset (int) – timestamp of the end of the range to consider (in milliseconds)
timescale (int) – length of each time unit (in milliseconds)
column (str) – the name of the column in segments that should be used for the values of the annotations (e.g. speaker_type).
categories (list) – the list of categories
none (bool) – append a ‘none’ column, default True
overlap (bool) – append an overlap column, default False

Returns:

the output grid

Return type:

numpy.array

ChildProject.metrics.vectors_to_annotation_task(*args, drop: List[str] = []) → AnnotationTask[source]

transform vectors of labels into a nltk AnnotationTask object.

Parameters:

args (1d np.array() of labels) – vector of labels for each annotator; add one argument per annotator.
drop (List[str]) – list of labels that should be ignored

Returns:

the AnnotationTask object

Return type:

nltk.metrics.agreement.AnnotationTask

ChildProject.projects module

class ChildProject.projects.ChildProject(path: Path | str, enforce_dtypes: bool = True, ignore_discarded: bool = True)[source]

Bases: object

ChildProject instance This class is a representation of a ChildProject dataset

Constructor parameters:

Parameters:

path (str) – path to the root of the dataset.
enforce_dtypes (bool, optional) – enforce dtypes on children/recordings dataframes, defaults to False
ignore_discarded (bool, optional) – ignore entries such that discard=1, defaults to True

Attributes: :param path: path to the root of the dataset. :type path: str :param recordings: pandas dataframe representation of this dataset metadata/recordings.csv :type recordings: class:pd.DataFrame :param children: pandas dataframe representation of this dataset metadata/children.csv :type children: class:pd.DataFrame

CHILDREN_COLUMNS = [IndexColumn(name = experiment), IndexColumn(name = child_id), IndexColumn(name = child_dob), IndexColumn(name = location_id), IndexColumn(name = child_sex), IndexColumn(name = language), IndexColumn(name = languages), IndexColumn(name = mat_ed), IndexColumn(name = fat_ed), IndexColumn(name = car_ed), IndexColumn(name = monoling), IndexColumn(name = monoling_criterion), IndexColumn(name = normative), IndexColumn(name = normative_criterion), IndexColumn(name = mother_id), IndexColumn(name = father_id), IndexColumn(name = order_of_birth), IndexColumn(name = n_of_siblings), IndexColumn(name = household_size), IndexColumn(name = dob_criterion), IndexColumn(name = dob_accuracy), IndexColumn(name = discard)]

CONVERTED_COLUMNS = [IndexColumn(name = original_filename), IndexColumn(name = converted_filename), IndexColumn(name = success), IndexColumn(name = error), IndexColumn(name = parameters)]

DOCUMENTATION_COLUMNS = [IndexColumn(name = variable), IndexColumn(name = description), IndexColumn(name = values), IndexColumn(name = scope), IndexColumn(name = annotation_set)]

RECORDINGS_COLUMNS = [IndexColumn(name = experiment), IndexColumn(name = child_id), IndexColumn(name = date_iso), IndexColumn(name = start_time), IndexColumn(name = recording_device_type), IndexColumn(name = recording_filename), IndexColumn(name = duration), IndexColumn(name = session_id), IndexColumn(name = session_offset), IndexColumn(name = recording_device_id), IndexColumn(name = experimenter), IndexColumn(name = location_id), IndexColumn(name = its_filename), IndexColumn(name = upl_filename), IndexColumn(name = trs_filename), IndexColumn(name = lena_id), IndexColumn(name = lena_recording_num), IndexColumn(name = might_feature_gaps), IndexColumn(name = start_time_accuracy), IndexColumn(name = noisy_setting), IndexColumn(name = notes), IndexColumn(name = discard)]

REC_COL_REF = {'child_id': IndexColumn(name = child_id), 'date_iso': IndexColumn(name = date_iso), 'discard': IndexColumn(name = discard), 'duration': IndexColumn(name = duration), 'experiment': IndexColumn(name = experiment), 'experimenter': IndexColumn(name = experimenter), 'its_filename': IndexColumn(name = its_filename), 'lena_id': IndexColumn(name = lena_id), 'lena_recording_num': IndexColumn(name = lena_recording_num), 'location_id': IndexColumn(name = location_id), 'might_feature_gaps': IndexColumn(name = might_feature_gaps), 'noisy_setting': IndexColumn(name = noisy_setting), 'notes': IndexColumn(name = notes), 'recording_device_id': IndexColumn(name = recording_device_id), 'recording_device_type': IndexColumn(name = recording_device_type), 'recording_filename': IndexColumn(name = recording_filename), 'session_id': IndexColumn(name = session_id), 'session_offset': IndexColumn(name = session_offset), 'start_time': IndexColumn(name = start_time), 'start_time_accuracy': IndexColumn(name = start_time_accuracy), 'trs_filename': IndexColumn(name = trs_filename), 'upl_filename': IndexColumn(name = upl_filename)}

accumulate_metadata(table: str, df: DataFrame, columns: list, merge_column: str, verbose=False) → DataFrame[source]

add_project_file(src_path, dst_file, file_type: str, overwrite=False) → Self[source]

Add a file to the dataset. This function takes the path to a file, copies that file inside the dataset in the correct spot depending on the file type. The destination file can contain parent folders, which will be included in the copied file (e.g. src_path= “/home/user/tmp/myrec.wav”, dst_file=”loc1/RA5/rec001.wav”, file_type=’recording’ ; will copy the file inside the dataset in a recordings/raw/loc1/RA5 folder, the file will be named rec001.wav.

Parameters:

src_path (Path | str) – path to the file to add to the dataset on the system
dst_file (Path | str) – filename as it will be stored in the dataset, with possible subfolder(s) (e.g. “location1/RA5/rec004.wav will copy the original file as rec004.wav inside folders location1 -> RA5)
file_type (str) – type of the file to copy in order to know where it should be stored in the dataset, choose any of ‘recording’,’metadata’,’extra’ or ‘raw’, raw is just copied from the root of the dataset into any folder
overwrite (bool, optional) – overwrite the existing destination if it already exists

Returns:

ChildProject changed object

Return type:

ChildProject

compute_ages(recordings: DataFrame | None = None, children: DataFrame | None = None, age_format: str = 'months') → Series[source]

Compute the age of the subject child for each recording (in months, as a float) and return it as a pandas Series object.

Example:

>>> from ChildProject.projects import ChildProject
>>> project = ChildProject("examples/valid_raw_data")
>>> project.read()
>>> project.recordings["age"] = project.compute_ages()
>>> project.recordings[["child_id", "date_iso", "age"]]
    child_id    date_iso       age
line                                
2            1  2020-04-20  3.613963
3            1  2020-04-21  3.646817

Parameters:

recordings (pd.DataFrame, optional) – custom recordings DataFrame (see Metadata), otherwise use all project recordings, defaults to None
children (pd.DataFrame, optional) – custom children DataFrame (see Metadata), otherwise use all project children data, defaults to None
age_format (str, optional) – format to use for the output date default is months, choose between [‘months’,’days’,’weeks’, ‘years’]

compute_recordings_duration(profile: str | None = None) → DataFrame[source]

compute recordings duration

Parameters:: profile (str, optional) – name of the profile of recordings to compute the duration from. If None, raw recordings are used. defaults to None
Returns:: dataframe of the recordings, with an additional/updated duration columns.
Return type:: pd.DataFrame

dict_summary() → dict[source]

get_converted_recording_filename(profile: str, recording_filename: str) → str[source]

retrieve the converted filename of a recording under a given profile, from its original filename.

Parameters:

profile (str) – recording profile
recording_filename (str) – original recording filename, as indexed in the metadata

Returns:

corresponding converted filename of the recording under this profile

Return type:

str

get_recording_path(recording_filename: str, profile: str | None = None) → Path[source]

return the path to a recording

Parameters:

recording_filename (str) – recording filename, as in the metadata
profile (str, optional) – name of the conversion profile, defaults to None

Returns:

path to the recording

Return type:

pathlib.Path

get_recordings_from_list(recordings: list, profile: str | None = None) → DataFrame[source]

Recover recordings metadata from a list of recordings or path to recordings.

Parameters:: recordings (list) – list of recording names or paths
Returns:: matching recordings
Return type:: pd.DataFrame

read(verbose=False, accumulate=True) → Self[source]

Read the metadata from the project and stores it in recordings and children attributes

Parameters:

verbose (bool) – read with additional output
accumulate (bool) – add metadata from subfolders (usually confidential metadata)

Returns:

ChildProject object after reading

Return type:

ChildProject

read_documentation() → DataFrame[source]

read_profile(profile: str) → DataFrame[source]

Read profile index, return index in DataFrame form

Parameters:: profile (str) – profile to read index from
Returns:: index of the profile
Return type:: pd.DataFrame

recording_from_path(path: Path, profile: str | None = None) → str | None[source]

remove_project_file(file, file_type: str) → Self[source]

remove a file from the dataset. This function takes the path to a file, and removes it from the dataset at the file system level (not in the index), the file could be under folder, they need to be in the file name as a posix path (i.e. subfolder/file) The file_type is meant to define the type of file in the dataset, and each category corresponds to a subfolder path.

Parameters:

file (Path | str) – filename as it is stored in the dataset, in the tree of its category (e.g. recordings names are evaluated inside the recordings/raw folder of the dataset
file_type (str) – type of the file to copy in order to know where it should be stored in the dataset, choose any of ‘recording’,’metadata’,’extra’ or ‘raw’, raw is just copied from the root of the dataset into any folder

Returns:

ChildProject changed object

Return type:

ChildProject

rename_recording(recording_filename: str, new_recording_filename: str) → str[source]

Renames an existing recording to a specified new name. This change is written to the index of recordings, and spreads to converted recording profiles. the name given should be formatted as a posix path, ‘/’ will be interpreted as a directory separator, even on systems using different separators. Will carry out the changes regardless of presence of the reference in the index. This is to account for partial changes already made, so not finding the recording in the index is not blocking

Parameters:

recording_filename (str) – recording to be changed
new_recording_filename (str) – new name the recording will use

Returns:

name of the renamed recording

Return type:

str

validate(ignore_recordings: bool = False, profile: str | None = None, accumulate: bool = True, current_metadata=False, custom_metadata=None) → Tuple[List[str], List[str]][source]

Validate a dataset, returning all errors and warnings.

Parameters:

ignore_recordings (bool, optional) – if True, no errors will be returned for missing recordings.
profile (str, optional) – profile of recordings to use
accumulate – use accumulated metadata (usually confidential metadata if present)
current_metadata (bool, optional) – validate the currently set metadata, without reacquiring it from the files

Returns:

A tuple containing the list of errors, and the list of warnings.

Return type:

tuple[list[str],list[str]]

write_children(keep_discarded: bool = True, skip_validation=False, keep_original_columns: bool = True) → DataFrame[source]

Write self.children to the children csv file of the dataset. !! if read() was done with accumulate , you may write confidential information in recordings.csv !!

Parameters:

keep_discarded (bool, optional) – if True, the lines in the csv that are discarded by the dataset are kept when writing. defaults to True (when False, discarded lines disappear from the dataset)
skip_validation (bool, optional) – if True, writes the recordings without checking if the dataset is valid
keep_original_columns (bool, optional) – if True, deleting columns in the recordings dataframe will not result in them disappearing from the csv file (if false, only the current columns are kept)

Returns:

dataframe that was written to the csv file

Return type:

pandas.DataFrame

write_profile(profile, index) → DataFrame[source]

Write conversion table for a profile

Parameters:

profile (str) – name of the profile to write
index (pd.DataFrame) – Index to write to the dataset, storing the conversion between raw recordings and converted ones

Returns:

dataframe that was written to the csv file

Return type:

pandas.DataFrame

write_recordings(keep_discarded: bool = True, skip_validation=False, keep_original_columns: bool = True) → DataFrame[source]

Write self.recordings to the recordings csv file of the dataset. !! if read() was done with accumulate , you may write confidential information in recordings.csv !!

Parameters:

keep_discarded (bool, optional) – if True, the lines in the csv that are discarded by the dataset are kept when writing. defaults to True (when False, discarded lines disappear from the dataset)
skip_validation (bool, optional) – if True, writes the recordings without checking if the dataset is valid
keep_original_columns (bool, optional) – NOT IMPLEMENTED, if True, deleting columns in the recordings dataframe will not result in them disappearing from the csv file (if false, only the current columns are kept)

Returns:

dataframe that was written to the csv file

Return type:

pandas.DataFrame

ChildProject.tables module

exception ChildProject.tables.IncorrectDtypeException[source]

Bases: Exception

Exception when an Unexpected DType is found in a pandas DataFrame

class ChildProject.tables.IndexColumn(name='', description='', required=False, regex=None, filename=False, datetime=None, function=None, choices=None, dtype=None, unique=False, generated=False, annotation_columns=None)[source]: Bases: object

class ChildProject.tables.IndexTable(name, path=None, columns=[], enforce_dtypes: bool = False)[source]

Bases: object

msg(text) → str[source]

read() → DataFrame[source]

validate() → Tuple[List[str], List[str]][source]

exception ChildProject.tables.MissingColumnsException(name: str, missing: Set)[source]: Bases: Exception

ChildProject.tables.assert_columns_presence(name: str, df: DataFrame, columns: Set | List)[source]

ChildProject.tables.assert_dataframe(name: str, df: DataFrame, not_empty: bool = False)[source]

ChildProject.tables.is_boolean(x)[source]

ChildProject.tables.read_csv_with_dtype(file: str, dtypes: dict) → DataFrame[source]

ChildProject.utils module

class ChildProject.utils.Segment(start, stop)[source]

Bases: object

length()[source]

class ChildProject.utils.TimeInterval(start: datetime, stop: datetime)[source]

Bases: object

length()[source]

ChildProject.utils.calculate_shift(file1, file2, start1, start2, interval)[source]

take 2 audio files, a starting point for each and a length to compare in seconds return a divergence score representing the average difference in audio signal

Parameters:

file1 (str) – path to the first wav file to compare
file2 (str) – path to the second wav file to compare
start1 (int) – starting point for the comparison in seconds for the first audio
start2 (int) – starting point for the comparison in seconds for the second audio
interval (int) – length to compare between the 2 audios on in seconds

Returns:

tuple of divergence score and number of values used

Return type:

(float, int)

ChildProject.utils.df_to_printable(df: DataFrame, delimiter: str = ' ', header: bool = False) → str[source]

Takes a DataFrame and create a terminal printable string representing the output within a reasonable window options to have an aligned (like ls -l) output or parsable (with defined delimiter) in the order given

Parameters:

df (pd.DataFrame) – pandas DataFrame containing the data to print
delimiter (str) – Character delimiting fields, when char is in the fields, escape those with the escape char
visual (bool) – Whether to align the columns of the output and ignores escaping characters (output is not parsable)
escape_char (str) – Character escaping fields when those contain the delimiting char
header (bool) – First line of the output is the header, containing the name of the columns

Returns:

representation of the dataframe

Return type:

str

ChildProject.utils.find_lines_involved_in_overlap(df: DataFrame, onset_label: str = 'range_onset', offset_label: str = 'range_offset', labels=[]) → DataFrame[source]

takes a dataframe as input. The dataframe is supposed to have a column for the onset og a timeline and one for the offset. The function returns a boolean series where all indexes having ‘True’ are lines involved in overlaps and ‘False’ when not e.g. to select all lines involved in overlaps, use: ` ovl_segments = df[find_lines_involved_in_overlap(df, 'segment_onset', 'segment_offset')] ` and to select line that never overlap, use: ` ovl_segments = df[~find_lines_involved_in_overlap(df, 'segment_onset', 'segment_offset')] `

Parameters:

df (pd.DataFrame) – pandas DataFrame where we want to find overlaps, having some time segments described by 2 columns (onset and offset)
onset_label (str) – column label for the onset of time segments
offset_label (str) – columns label for the offset of time segments
labels (list[str]) – list of column labels that are required to match to be involved in overlap.

Returns:

pandas Series of boolean values where ‘True’ are indexes where overlaps exist

Return type:

pd.Series

ChildProject.utils.get_audio_duration(filename: Path) → int[source]

ChildProject.utils.intersect_ranges(xs, ys)[source]

ChildProject.utils.printable_unit_duration(duration) → str[source]

from a duration in milliseconds, returns a string with an appropriate unit between ms, seconds, minutes and hours

Parameters:: duration (int) – duration in milliseconds
Returns:: converted duration with unit letter
Return type:: str

ChildProject.utils.read_wav(filename, start_s, length_s)[source]

ChildProject.utils.retry_func(func: callable, excep: Exception, tries: int = 3, **kwargs)[source]

ChildProject.utils.series_to_datetime(time_series, time_index_list, time_column_name: str, date_series=None, date_index_list=None, date_column_name=None) → Series[source]

returns a series of datetimes from a series of str. Using pd.to_datetime on all the formats listed for a specific column name in an index consisting of IndexColumn items. To have the date included and not only time), one can use a second series for date, with also the corresponding index and column

Parameters:

time_series (pandas.Series) – pandas series of strings to transform into datetime (can contain NA value => NaT datetime), if date_series is given, time_series should only have the time
time_index_list (List[IndexColumn]) – list of index to use where the column wanted is present
time_column_name (str) – name of the IndexColumn to use (IndexColumn.name value) for accepted formats
date_series (pandas.Series) – pandas series of strings to transform into the date component of datetime (can contain NA value)
date_index_list (List[IndexColumn]) – list of index to use where the column wanted is present
date_column_name (str) – name of the IndexColumn to use (IndexColumn.name value) for accepted formats for dates

Returns:

series with dtype datetime containing the converted datetimes

Return type:

pandas.Series

ChildProject.utils.time_intervals_intersect(ti1: TimeInterval, ti2: TimeInterval) → List[TimeInterval][source]

given 2 time intervals (those do not take in consideration days, only time in the day), return an array of new interval(s) representing the intersections of the original ones. Examples 1. time_intervals_intersect( TimeInterval( datetime(1900,1,1,8,57), datetime(1900,1,1,21,4)), TimeInterval( datetime(1900,1,1,10,36), datetime(1900,1,1,22,1))) => [TimeInterval(10:36 , 21:04)] 2. time_intervals_intersect( TimeInterval( datetime(1900,1,1,8,57), datetime(1900,1,1,22,1)), TimeInterval( datetime(1900,1,1,21,4), datetime(1900,1,1,10,36))) => [TimeInterval(08:57 , 10:36),TimeInterval(21:04 , 22:01)]

Parameters:

ti1 (TimeInterval) – first interval
ti2 (TimeInterval) – second interval

Returns:

list of intervals that intersect

Return type:

list[TimeInterval]

ChildProject package

Subpackages

Submodules

ChildProject.annotations module

ChildProject.cmdline module

ChildProject.converters module

ChildProject.metrics module

ChildProject.projects module

ChildProject.tables module

ChildProject.utils module

Module contents