ChildProject package
Subpackages
- ChildProject.pipelines package
- Submodules
- ChildProject.pipelines.anonymize module
- ChildProject.pipelines.eafbuilder module
- ChildProject.pipelines.metrics module
- ChildProject.pipelines.pipeline module
- ChildProject.pipelines.processors module
- ChildProject.pipelines.samplers module
- ChildProject.pipelines.zooniverse module
- Module contents
- ChildProject.templates package
Submodules
ChildProject.annotations module
- class ChildProject.annotations.AnnotationManager(project: ChildProject.projects.ChildProject)[source]
Bases:
object- INDEX_COLUMNS = [IndexColumn(name = set), IndexColumn(name = recording_filename), IndexColumn(name = time_seek), IndexColumn(name = range_onset), IndexColumn(name = range_offset), IndexColumn(name = raw_filename), IndexColumn(name = format), IndexColumn(name = filter), IndexColumn(name = annotation_filename), IndexColumn(name = imported_at), IndexColumn(name = package_version), IndexColumn(name = error)]
- SEGMENTS_COLUMNS = [IndexColumn(name = raw_filename), IndexColumn(name = segment_onset), IndexColumn(name = segment_offset), IndexColumn(name = speaker_id), IndexColumn(name = speaker_type), IndexColumn(name = ling_type), IndexColumn(name = vcm_type), IndexColumn(name = lex_type), IndexColumn(name = mwu_type), IndexColumn(name = addressee), IndexColumn(name = transcription), IndexColumn(name = phonemes), IndexColumn(name = syllables), IndexColumn(name = words), IndexColumn(name = lena_block_type), IndexColumn(name = lena_block_number), IndexColumn(name = lena_conv_status), IndexColumn(name = lena_response_count), IndexColumn(name = lena_conv_floor_type), IndexColumn(name = lena_conv_turn_type), IndexColumn(name = utterances_count), IndexColumn(name = utterances_length), IndexColumn(name = non_speech_length), IndexColumn(name = average_db), IndexColumn(name = peak_db), IndexColumn(name = child_cry_vfx_len), IndexColumn(name = utterances), IndexColumn(name = cries), IndexColumn(name = vfxs)]
- static clip_segments(segments: pandas.core.frame.DataFrame, start: int, stop: int) pandas.core.frame.DataFrame[source]
Clip all segments onsets and offsets within
startandstop. Segments outside of the range [start,``stop``] will be removed.- Parameters
segments (pd.DataFrame) – Dataframe of the segments to clip
start (int) – range start (in milliseconds)
stop (int) – range end (in milliseconds)
- Returns
Dataframe of the clipped segments
- Return type
pd.DataFrame
- get_collapsed_segments(annotations: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame[source]
get all segments associated to the annotations referenced in
annotations, and collapses into one virtual timeline.- Parameters
annotations (pd.DataFrame) – dataframe of annotations, according to Annotations index
- Returns
dataframe of all the segments merged (as specified in Annotations format), merged with
annotations- Return type
pd.DataFrame
- get_segments(annotations: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame[source]
get all segments associated to the annotations referenced in
annotations.- Parameters
annotations (pd.DataFrame) – dataframe of annotations, according to Annotations index
- Returns
dataframe of all the segments merged (as specified in Annotations format), merged with
annotations.- Return type
pd.DataFrame
- get_segments_timestamps(segments: pandas.core.frame.DataFrame, ignore_date: bool = False, onset: str = 'segment_onset', offset: str = 'segment_offset') pandas.core.frame.DataFrame[source]
Calculate the onset and offset clock-time of each segment
- Parameters
segments (pd.DataFrame) – DataFrame of segments (as returned by
get_segments()).ignore_date (bool, optional) – leave date information and use time data only, defaults to False
onset (str, optional) – column storing the onset timestamp in milliseconds, defaults to “segment_onset”
offset (str, optional) – column storing the offset timestamp in milliseconds, defaults to “segment_offset”
- Returns
Returns the input dataframe with two new columns
onset_timeandoffset_time.
onset_timeis a datetime object corresponding to the onset of the segment.offset_timeis a datetime object corresponding to the offset of the segment. In case eitherstart_timeordate_isois not specified for the corresponding recording, both values will be set to NaT. :rtype: pd.DataFrame
- get_subsets(annotation_set: str, recursive: bool = False) List[str][source]
Retrieve the list of subsets belonging to a given set of annotations.
- Parameters
annotation_set (str) – input set
recursive (bool, optional) – If True, get subsets recursively, defaults to False
- Returns
the list of subsets names
- Return type
list
- get_within_time_range(annotations: pandas.core.frame.DataFrame, start_time: str, end_time: str, errors='raise')[source]
Clip all input annotations within a given HH:MM clock-time range. Those that do not intersect the input time range at all are filtered out.
- Parameters
annotations – DataFrame of input annotations to filter.
The only columns that are required are:
recording_filename,range_onset, andrange_offset. :type annotations: pd.DataFrame :param start: onset HH:MM clocktime :type start: str :param end: offset HH:MM clocktime :type end: str :param errors: how to deal with invalid start_time values for the recordings. Takes the same values aspandas.to_datetime. :type errors: str :return: a DataFrame of annotations; For each row,range_onsetandrange_offsetare clipped within the desired clock-time range. The clock-time corresponding to the onset and offset of each annotation is stored in two newly created columns namedrange_onset_timeandrange_offset_time. If the input annotation exceeds 24 hours, one row per matching interval is returned. :rtype: pd.DataFrame
- import_annotations(input: pandas.core.frame.DataFrame, threads: int = - 1, import_function: Optional[Callable[[str], pandas.core.frame.DataFrame]] = None) pandas.core.frame.DataFrame[source]
Import and convert annotations.
- Parameters
input (pd.DataFrame) – dataframe of all annotations to import, as described in format-input-annotations.
threads (int, optional) – If > 1, conversions will be run on
threadsthreads, defaults to -1import_function (Callable[[str], pd.DataFrame], optional) – If specified, the custom
import_functionfunction will be used to convert allinputannotations, defaults to None
- Returns
dataframe of imported annotations, as in Annotations index.
- Return type
pd.DataFrame
- static intersection(annotations: pandas.core.frame.DataFrame, sets: Optional[list] = None) pandas.core.frame.DataFrame[source]
Compute the intersection of all annotations for all sets and recordings, based on their
recording_filename,range_onsetandrange_offsetattributes. (Only these columns are required, but more can be passed and they will be preserved).- Parameters
annotations (pd.DataFrame) – dataframe of annotations, according to Annotations index
- Returns
dataframe of annotations, according to Annotations index
- Return type
pd.DataFrame
- merge_sets(left_set: str, right_set: str, left_columns: List[str], right_columns: List[str], output_set: str, columns: dict = {}, threads=- 1)[source]
Merge columns from
left_setandright_setannotations, for all matching segments, into a new set of annotations namedoutput_set.- Parameters
left_set (str) – Left set of annotations.
right_set (str) – Right set of annotations.
left_columns (List) – Columns which values will be based on the left set.
right_columns (List) – Columns which values will be based on the right set.
output_set (str) – Name of the output annotations set.
- Returns
[description]
- Return type
[type]
- read() Tuple[List[str], List[str]][source]
Read the index of annotations from
metadata/annotations.csvand store it into self.annotations.- Returns
a tuple containing the list of errors and the list of warnings generated while reading the index
- Return type
Tuple[List[str],List[str]]
- remove_set(annotation_set: str, recursive: bool = False)[source]
Remove a set of annotations, deleting every converted file and removing them from the index. This preserves raw annotations.
- Parameters
annotation_set (str) – set of annotations to remove
recursive (bool, optional) – remove subsets as well, defaults to False
- rename_set(annotation_set: str, new_set: str, recursive: bool = False, ignore_errors: bool = False)[source]
Rename a set of annotations, moving all related files and updating the index accordingly.
- Parameters
annotation_set (str) – name of the set to rename
new_set (str) – new set name
recursive (bool, optional) – rename subsets as well, defaults to False
ignore_errors (bool, optional) – If True, keep going even if unindexed files are detected, defaults to False
- validate(annotations: Optional[pandas.core.frame.DataFrame] = None, threads: int = 0) Tuple[List[str], List[str]][source]
check all indexed annotations for errors
- Parameters
annotations (pd.DataFrame, optional) – annotations to validate, defaults to None. If None, the whole index will be scanned.
threads (int, optional) – how many threads to run the tests with, defaults to 0. If <= 0, all available CPU cores will be used.
- Returns
a tuple containg the list of errors and the list of warnings detected
- Return type
Tuple[List[str], List[str]]
ChildProject.cmdline module
- ChildProject.cmdline.perform_validation(project: ChildProject.projects.ChildProject, require_success: bool = True, **args)[source]
- ChildProject.cmdline.subcommand(args=[], parent=_SubParsersAction(option_strings=[], dest='==SUPPRESS==', nargs='A...', const=None, default=None, type=None, choices={'validate': ArgumentParser(prog='__main__.py validate', usage=None, description='validate the consistency of the dataset returning detailed errors and warnings', formatter_class=<class 'argparse.HelpFormatter'>, conflict_handler='error', add_help=True), 'import-annotations': ArgumentParser(prog='__main__.py import-annotations', usage=None, description='convert and import a set of annotations', formatter_class=<class 'argparse.HelpFormatter'>, conflict_handler='error', add_help=True), 'merge-annotations': ArgumentParser(prog='__main__.py merge-annotations', usage=None, description='merge segments sharing identical onset and offset from two sets of annotations', formatter_class=<class 'argparse.HelpFormatter'>, conflict_handler='error', add_help=True), 'intersect-annotations': ArgumentParser(prog='__main__.py intersect-annotations', usage=None, description='calculate the intersection of the annotations belonging to the given sets', formatter_class=<class 'argparse.HelpFormatter'>, conflict_handler='error', add_help=True), 'remove-annotations': ArgumentParser(prog='__main__.py remove-annotations', usage=None, description='remove converted annotations of a given set and their entries in the index', formatter_class=<class 'argparse.HelpFormatter'>, conflict_handler='error', add_help=True), 'rename-annotations': ArgumentParser(prog='__main__.py rename-annotations', usage=None, description='rename a set of annotations by moving the files and updating the index accordingly', formatter_class=<class 'argparse.HelpFormatter'>, conflict_handler='error', add_help=True), 'overview': ArgumentParser(prog='__main__.py overview', usage=None, description='prints an overview of the contents of a given dataset', formatter_class=<class 'argparse.HelpFormatter'>, conflict_handler='error', add_help=True), 'compute-durations': ArgumentParser(prog='__main__.py compute-durations', usage=None, description="creates a 'duration' column into metadata/recordings", formatter_class=<class 'argparse.HelpFormatter'>, conflict_handler='error', add_help=True)}, help=None, metavar=None))[source]
ChildProject.converters module
- class ChildProject.converters.AliceConverter[source]
Bases:
ChildProject.converters.AnnotationConverter- FORMAT = 'alice'
- class ChildProject.converters.AnnotationConverter[source]
Bases:
object- SPEAKER_ID_TO_TYPE = {'C1': 'OCH', 'C2': 'OCH', 'CHI': 'CHI', 'CHI*': 'CHI', 'EE1': 'NA', 'EE2': 'NA', 'FA0': 'FEM', 'FA1': 'FEM', 'FA2': 'FEM', 'FA3': 'FEM', 'FA4': 'FEM', 'FA5': 'FEM', 'FA6': 'FEM', 'FA7': 'FEM', 'FA8': 'FEM', 'FAE': 'NA', 'FC1': 'OCH', 'FC2': 'OCH', 'FC3': 'OCH', 'FCE': 'NA', 'MA0': 'MAL', 'MA1': 'MAL', 'MA2': 'MAL', 'MA3': 'MAL', 'MA4': 'MAL', 'MA5': 'MAL', 'MAE': 'NA', 'MC1': 'OCH', 'MC2': 'OCH', 'MC3': 'OCH', 'MC4': 'OCH', 'MC5': 'OCH', 'MCE': 'NA', 'MI1': 'OCH', 'MOT*': 'FEM', 'OC0': 'OCH', 'UA1': 'NA', 'UA2': 'NA', 'UA3': 'NA', 'UA4': 'NA', 'UA5': 'NA', 'UA6': 'NA', 'UC1': 'OCH', 'UC2': 'OCH', 'UC3': 'OCH', 'UC4': 'OCH', 'UC5': 'OCH', 'UC6': 'OCH'}
- THREAD_SAFE = True
- class ChildProject.converters.ChatConverter[source]
Bases:
ChildProject.converters.AnnotationConverter- ADDRESSEE_TABLE = {'CHI': 'T', 'FEM': 'A', 'MAL': 'A', 'OCH': 'C'}
- FORMAT = 'cha'
- SPEAKER_ROLE_TO_TYPE = {'Adult': 'NA', 'Attorney': 'NA', 'Audience': 'NA', 'Boy': 'OCH', 'Brother': 'OCH', 'Caretaker': 'NA', 'Child': 'OCH', 'Doctor': 'NA', 'Environment': 'NA', 'Father': 'MAL', 'Female': 'FEM', 'Friend': 'OCH', 'Girl': 'OCH', 'Grandfather': 'MAL', 'Grandmother': 'FEM', 'Group': 'NA', 'Guest': 'NA', 'Host': 'NA', 'Investigator': 'NA', 'Justice': 'NA', 'LENA': 'NA', 'Leader': 'NA', 'Male': 'MAL', 'Media': 'NA', 'Member': 'NA', 'Mother': 'FEM', 'Narrator': 'NA', 'Nurse': 'NA', 'Other': 'NA', 'Participant': 'CHI', 'Partner': 'NA', 'PlayRole': 'NA', 'Playmate': 'OCH', 'Relative': 'NA', 'Sibling': 'OCH', 'Sister': 'OCH', 'Speaker': 'NA', 'Student': 'NA', 'Target_Adult': 'NA', 'Target_Child': 'CHI', 'Teacher': 'NA', 'Teenager': 'NA', 'Text': 'NA', 'Uncertain': 'NA', 'Unidentified': 'NA', 'Visitor': 'NA'}
- THREAD_SAFE = False
- class ChildProject.converters.CsvConverter[source]
Bases:
ChildProject.converters.AnnotationConverter- FORMAT = 'csv'
- class ChildProject.converters.EafConverter[source]
Bases:
ChildProject.converters.AnnotationConverter- FORMAT = 'eaf'
- class ChildProject.converters.ItsConverter[source]
Bases:
ChildProject.converters.AnnotationConverter- FORMAT = 'its'
- SPEAKER_TYPE_TRANSLATION = {'CHN': 'CHI', 'CXN': 'OCH', 'FAN': 'FEM', 'MAN': 'MAL'}
- class ChildProject.converters.TextGridConverter[source]
Bases:
ChildProject.converters.AnnotationConverter- FORMAT = 'TextGrid'
- class ChildProject.converters.VcmConverter[source]
Bases:
ChildProject.converters.AnnotationConverter- FORMAT = 'vcm_rttm'
- SPEAKER_TYPE_TRANSLATION = {'CHI': 'OCH', 'CNS': 'CHI', 'CRY': 'CHI', 'FEM': 'FEM', 'MAL': 'MAL', 'NCS': 'CHI'}
- VCM_TRANSLATION = {'CNS': 'C', 'CRY': 'Y', 'NCS': 'N', 'OTH': 'J'}
ChildProject.metrics module
- ChildProject.metrics.conf_matrix(rows_grid, columns_grid)[source]
compute the confusion matrix (as counts) from grids of active classes.
See
ChildProject.metrics.segments_to_grid()for a description of grids.- Parameters
rows_grid (numpy.array) – the grid corresponding to the rows of the confusion matrix.
columns_grid (numpy.array) – the grid corresponding to the columns of the confusion matrix.
categories (list of strings) – the labels corresponding to each class
- Returns
a square numpy array of counts
- Return type
numpy.array
- ChildProject.metrics.gamma(segments: pandas.core.frame.DataFrame, column: str, alpha: float = 1, beta: float = 1, precision_level: float = 0.05) float[source]
Compute Mathet et al. gamma agreement on segments.
The gamma measure evaluates the reliability of both the segmentation and the categorization simultaneously; a extensive description of the method and its parameters can be found in Mathet et al., 2015 (doi:10.1162/COLI_a_00227)
This function uses the pyagreement-agreement package by Titeux et al.
- Parameters
segments (pd.DataFrame) – input segments dataframe (see Annotations format for the dataframe format)
column (str) – name of the categorical column of the segments to consider, e.g. ‘speaker_type’
alpha (float, optional) – gamma agreement time alignment weight, defaults to 1
beta (float, optional) – gamma agreement categorical weight, defaults to 1
precision_level (float, optional) – level of precision (see pygamma-agreement’s documentation), defaults to 0.05
- Returns
gamma agreement
- Return type
float
- ChildProject.metrics.grid_to_vector(grid, categories)[source]
Transform a grid of active classes into a vector of labels. In case several classes are active at time i, the label is set to ‘overlap’.
See
ChildProject.metrics.segments_to_grid()for a description of grids.- Parameters
grid (numpy.array) – a NumPy array of shape
(n, len(categories))categories (list) – the list of categories
- Returns
the vector of labels of length
n(e.g.np.array([none FEM FEM FEM overlap overlap CHI]))- Return type
numpy.array
- ChildProject.metrics.pyannote_metric(segments: pandas.core.frame.DataFrame, reference: str, hypothesis: str, metric, column: str)[source]
- ChildProject.metrics.segments_to_annotation(segments: pandas.core.frame.DataFrame, column: str)[source]
Transform a dataframe of annotation segments into a pyannote.core.Annotation object
- Parameters
segments (pd.DataFrame) – a dataframe of input segments. It should at least have the following columns:
segment_onset,segment_offsetandcolumn.column (str) – the name of the column in
segmentsthat should be used for the values of the annotations (e.g. speaker_type).
- Returns
the pyannote.core.Annotation object.
- Return type
pyannote.core.Annotation
- ChildProject.metrics.segments_to_grid(segments: pandas.core.frame.DataFrame, range_onset: int, range_offset: int, timescale: int, column: str, categories: list, none=True, overlap=False) float[source]
Transform a dataframe of annotation segments into a 2d matrix representing the indicator function of each of the
categoriesacross time.Each row of the matrix corresponds to a unit of time of length
timescale(in milliseconds), ranging fromrange_onsettorange_offset; each column corresponds to one of thecategoriesprovided, plus two special columns (overlap and none).The value of the cell
ijof the output matrix is set to 1 if the classjis active at timei, 0 otherwise.If overlap is True, an additional column is appended to the grid, which set to 1 if more than two classes are active at time
i.If none is set to True, an additional column is appended to the grid, which is set to one if none of the classes are active at time
i.The shape of the output matrix is therefore
((range_offset-range_onset)/timescale, len(categories) + n), where n = 2 if both overlap and none are True, 1 if one of them is True, and 0 otherwise.The fraction of time a class
jis active can therefore be calculated asnp.mean(grid, axis = 0)[j]- Parameters
segments (pd.DataFrame) – a dataframe of input segments. It should at least have the following columns:
segment_onset,segment_offsetandcolumn.range_onset (int) – timestamp of the beginning of the range to consider (in milliseconds)
range_offset (int) – timestamp of the end of the range to consider (in milliseconds)
timescale (int) – length of each time unit (in milliseconds)
column (str) – the name of the column in
segmentsthat should be used for the values of the annotations (e.g. speaker_type).categories (list) – the list of categories
none (bool) – append a ‘none’ column, default True
overlap (bool) – append an overlap column, default False
- Returns
the output grid
- Return type
numpy.array
- ChildProject.metrics.vectors_to_annotation_task(*args, drop: List[str] = [])[source]
transform vectors of labels into a nltk AnnotationTask object.
- Parameters
*args –
vector of labels for each annotator; add one argument per annotator.
drop (List[str]) – list of labels that should be ignored
- Returns
the AnnotationTask object
- Return type
nltk.metrics.agreement.AnnotationTask
ChildProject.projects module
- class ChildProject.projects.ChildProject(path: str, enforce_dtypes: bool = False)[source]
Bases:
objectThis class is a representation of a ChildProject dataset.
- Attributes:
- param path
path to the root of the dataset.
- type path
str
- param recordings
pandas dataframe representation of this dataset metadata/recordings.csv
- type recordings
class:pd.DataFrame
- param children
pandas dataframe representation of this dataset metadata/children.csv
- type children
class:pd.DataFrame
- CHILDREN_COLUMNS = [IndexColumn(name = experiment), IndexColumn(name = child_id), IndexColumn(name = child_dob), IndexColumn(name = location_id), IndexColumn(name = child_sex), IndexColumn(name = language), IndexColumn(name = languages), IndexColumn(name = mat_ed), IndexColumn(name = fat_ed), IndexColumn(name = car_ed), IndexColumn(name = monoling), IndexColumn(name = monoling_criterion), IndexColumn(name = normative), IndexColumn(name = normative_criterion), IndexColumn(name = mother_id), IndexColumn(name = father_id), IndexColumn(name = order_of_birth), IndexColumn(name = n_of_siblings), IndexColumn(name = household_size), IndexColumn(name = dob_criterion), IndexColumn(name = dob_accuracy)]
- CONVERTED_RECORDINGS = 'recordings/converted'
- PROJECT_FOLDERS = ['recordings', 'annotations', 'metadata', 'doc', 'scripts']
- RAW_RECORDINGS = 'recordings/raw'
- RECORDINGS_COLUMNS = [IndexColumn(name = experiment), IndexColumn(name = child_id), IndexColumn(name = date_iso), IndexColumn(name = start_time), IndexColumn(name = recording_device_type), IndexColumn(name = recording_filename), IndexColumn(name = duration), IndexColumn(name = session_id), IndexColumn(name = session_offset), IndexColumn(name = recording_device_id), IndexColumn(name = experimenter), IndexColumn(name = location_id), IndexColumn(name = its_filename), IndexColumn(name = upl_filename), IndexColumn(name = trs_filename), IndexColumn(name = lena_id), IndexColumn(name = might_feature_gaps), IndexColumn(name = start_time_accuracy), IndexColumn(name = noisy_setting), IndexColumn(name = notes)]
- REQUIRED_DIRECTORIES = ['recordings', 'extra']
- accumulate_metadata(table: str, df: pandas.core.frame.DataFrame, columns: list, merge_column: str, verbose=False) pandas.core.frame.DataFrame[source]
- compute_recordings_duration(profile: Optional[str] = None) pandas.core.frame.DataFrame[source]
compute recordings duration
- Parameters
profile (str, optional) – name of the profile of recordings to compute the duration from. If None, raw recordings are used. defaults to None
- Returns
dataframe of the recordings, with an additional/updated duration columns.
- Return type
pd.DataFrame
- get_converted_recording_filename(profile: str, recording_filename: str) str[source]
retrieve the converted filename of a recording under a given
profile, from its original filename.- Parameters
profile (str) – recording profile
recording_filename (str) – original recording filename, as indexed in the metadata
- Returns
corresponding converted filename of the recording under this profile
- Return type
str
- get_recording_path(recording_filename: str, profile: Optional[str] = None) str[source]
return the path to a recording
- Parameters
recording_filename (str) – recording filename, as in the metadata
profile (str, optional) – name of the conversion profile, defaults to None
- Returns
path to the recording
- Return type
str
- get_recordings_from_list(recordings: list, profile: Optional[str] = None) pandas.core.frame.DataFrame[source]
Recover recordings metadata from a list of recordings or path to recordings.
- Parameters
recordings (list) – list of recording names or paths
- Returns
matching recordings
- Return type
pd.DataFrame
- validate(ignore_recordings: bool = False, profile: Optional[str] = None) tuple[source]
Validate a dataset, returning all errors and warnings.
- Parameters
ignore_recordings (bool, optional) – if True, no errors will be returned for missing recordings.
- Returns
A tuple containing the list of errors, and the list of warnings.
- Return type
a tuple of two lists
ChildProject.tables module
- class ChildProject.tables.IndexColumn(name='', description='', required=False, regex=None, filename=False, datetime=None, function=None, choices=None, dtype=None, unique=False, generated=False)[source]
Bases:
object