ChildProject package
Subpackages
- ChildProject.pipelines package
- Submodules
- ChildProject.pipelines.anonymize module
- ChildProject.pipelines.eafbuilder module
- ChildProject.pipelines.metrics module
- ChildProject.pipelines.pipeline module
- ChildProject.pipelines.processors module
- ChildProject.pipelines.samplers module
- ChildProject.pipelines.zooniverse module
- Module contents
- ChildProject.templates package
Submodules
ChildProject.annotations module
- class ChildProject.annotations.AnnotationManager(project: ChildProject.projects.ChildProject)[source]
Bases:
object
- INDEX_COLUMNS = [IndexColumn(name = set), IndexColumn(name = recording_filename), IndexColumn(name = time_seek), IndexColumn(name = range_onset), IndexColumn(name = range_offset), IndexColumn(name = raw_filename), IndexColumn(name = format), IndexColumn(name = filter), IndexColumn(name = annotation_filename), IndexColumn(name = imported_at), IndexColumn(name = package_version), IndexColumn(name = error)]
- SEGMENTS_COLUMNS = [IndexColumn(name = raw_filename), IndexColumn(name = segment_onset), IndexColumn(name = segment_offset), IndexColumn(name = speaker_id), IndexColumn(name = speaker_type), IndexColumn(name = ling_type), IndexColumn(name = vcm_type), IndexColumn(name = lex_type), IndexColumn(name = mwu_type), IndexColumn(name = addressee), IndexColumn(name = transcription), IndexColumn(name = phonemes), IndexColumn(name = syllables), IndexColumn(name = words), IndexColumn(name = lena_block_type), IndexColumn(name = lena_block_number), IndexColumn(name = lena_conv_status), IndexColumn(name = lena_response_count), IndexColumn(name = lena_conv_floor_type), IndexColumn(name = lena_conv_turn_type), IndexColumn(name = utterances_count), IndexColumn(name = utterances_length), IndexColumn(name = non_speech_length), IndexColumn(name = average_db), IndexColumn(name = peak_db), IndexColumn(name = child_cry_vfx_len), IndexColumn(name = utterances), IndexColumn(name = cries), IndexColumn(name = vfxs)]
- static clip_segments(segments: pandas.core.frame.DataFrame, start: int, stop: int) pandas.core.frame.DataFrame [source]
Clip all segments onsets and offsets within
start
andstop
. Segments outside of the range [start
,``stop``] will be removed.- Parameters
segments (pd.DataFrame) – Dataframe of the segments to clip
start (int) – range start (in milliseconds)
stop (int) – range end (in milliseconds)
- Returns
Dataframe of the clipped segments
- Return type
pd.DataFrame
- get_collapsed_segments(annotations: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame [source]
get all segments associated to the annotations referenced in
annotations
, and collapses into one virtual timeline.- Parameters
annotations (pd.DataFrame) – dataframe of annotations, according to Annotations index
- Returns
dataframe of all the segments merged (as specified in Annotations format), merged with
annotations
- Return type
pd.DataFrame
- get_segments(annotations: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame [source]
get all segments associated to the annotations referenced in
annotations
.- Parameters
annotations (pd.DataFrame) – dataframe of annotations, according to Annotations index
- Returns
dataframe of all the segments merged (as specified in Annotations format), merged with
annotations
.- Return type
pd.DataFrame
- get_segments_timestamps(segments: pandas.core.frame.DataFrame, ignore_date: bool = False, onset: str = 'segment_onset', offset: str = 'segment_offset') pandas.core.frame.DataFrame [source]
Calculate the onset and offset clock-time of each segment
- Parameters
segments (pd.DataFrame) – DataFrame of segments (as returned by
get_segments()
).ignore_date (bool, optional) – leave date information and use time data only, defaults to False
onset (str, optional) – column storing the onset timestamp in milliseconds, defaults to “segment_onset”
offset (str, optional) – column storing the offset timestamp in milliseconds, defaults to “segment_offset”
- Returns
Returns the input dataframe with two new columns
onset_time
andoffset_time
.
onset_time
is a datetime object corresponding to the onset of the segment.offset_time
is a datetime object corresponding to the offset of the segment. In case eitherstart_time
ordate_iso
is not specified for the corresponding recording, both values will be set to NaT. :rtype: pd.DataFrame
- get_subsets(annotation_set: str, recursive: bool = False) List[str] [source]
Retrieve the list of subsets belonging to a given set of annotations.
- Parameters
annotation_set (str) – input set
recursive (bool, optional) – If True, get subsets recursively, defaults to False
- Returns
the list of subsets names
- Return type
list
- get_within_time_range(annotations: pandas.core.frame.DataFrame, start_time: str, end_time: str, errors='raise')[source]
Clip all input annotations within a given HH:MM clock-time range. Those that do not intersect the input time range at all are filtered out.
- Parameters
annotations – DataFrame of input annotations to filter.
The only columns that are required are:
recording_filename
,range_onset
, andrange_offset
. :type annotations: pd.DataFrame :param start: onset HH:MM clocktime :type start: str :param end: offset HH:MM clocktime :type end: str :param errors: how to deal with invalid start_time values for the recordings. Takes the same values aspandas.to_datetime
. :type errors: str :return: a DataFrame of annotations; For each row,range_onset
andrange_offset
are clipped within the desired clock-time range. The clock-time corresponding to the onset and offset of each annotation is stored in two newly created columns namedrange_onset_time
andrange_offset_time
. If the input annotation exceeds 24 hours, one row per matching interval is returned. :rtype: pd.DataFrame
- import_annotations(input: pandas.core.frame.DataFrame, threads: int = - 1, import_function: Optional[Callable[[str], pandas.core.frame.DataFrame]] = None) pandas.core.frame.DataFrame [source]
Import and convert annotations.
- Parameters
input (pd.DataFrame) – dataframe of all annotations to import, as described in format-input-annotations.
threads (int, optional) – If > 1, conversions will be run on
threads
threads, defaults to -1import_function (Callable[[str], pd.DataFrame], optional) – If specified, the custom
import_function
function will be used to convert allinput
annotations, defaults to None
- Returns
dataframe of imported annotations, as in Annotations index.
- Return type
pd.DataFrame
- static intersection(annotations: pandas.core.frame.DataFrame, sets: Optional[list] = None) pandas.core.frame.DataFrame [source]
Compute the intersection of all annotations for all sets and recordings, based on their
recording_filename
,range_onset
andrange_offset
attributes. (Only these columns are required, but more can be passed and they will be preserved).- Parameters
annotations (pd.DataFrame) – dataframe of annotations, according to Annotations index
- Returns
dataframe of annotations, according to Annotations index
- Return type
pd.DataFrame
- merge_sets(left_set: str, right_set: str, left_columns: List[str], right_columns: List[str], output_set: str, columns: dict = {}, threads=- 1)[source]
Merge columns from
left_set
andright_set
annotations, for all matching segments, into a new set of annotations namedoutput_set
.- Parameters
left_set (str) – Left set of annotations.
right_set (str) – Right set of annotations.
left_columns (List) – Columns which values will be based on the left set.
right_columns (List) – Columns which values will be based on the right set.
output_set (str) – Name of the output annotations set.
- Returns
[description]
- Return type
[type]
- read() Tuple[List[str], List[str]] [source]
Read the index of annotations from
metadata/annotations.csv
and store it into self.annotations.- Returns
a tuple containing the list of errors and the list of warnings generated while reading the index
- Return type
Tuple[List[str],List[str]]
- remove_set(annotation_set: str, recursive: bool = False)[source]
Remove a set of annotations, deleting every converted file and removing them from the index. This preserves raw annotations.
- Parameters
annotation_set (str) – set of annotations to remove
recursive (bool, optional) – remove subsets as well, defaults to False
- rename_set(annotation_set: str, new_set: str, recursive: bool = False, ignore_errors: bool = False)[source]
Rename a set of annotations, moving all related files and updating the index accordingly.
- Parameters
annotation_set (str) – name of the set to rename
new_set (str) – new set name
recursive (bool, optional) – rename subsets as well, defaults to False
ignore_errors (bool, optional) – If True, keep going even if unindexed files are detected, defaults to False
- validate(annotations: Optional[pandas.core.frame.DataFrame] = None, threads: int = 0) Tuple[List[str], List[str]] [source]
check all indexed annotations for errors
- Parameters
annotations (pd.DataFrame, optional) – annotations to validate, defaults to None. If None, the whole index will be scanned.
threads (int, optional) – how many threads to run the tests with, defaults to 0. If <= 0, all available CPU cores will be used.
- Returns
a tuple containg the list of errors and the list of warnings detected
- Return type
Tuple[List[str], List[str]]
ChildProject.cmdline module
- ChildProject.cmdline.perform_validation(project: ChildProject.projects.ChildProject, require_success: bool = True, **args)[source]
- ChildProject.cmdline.subcommand(args=[], parent=_SubParsersAction(option_strings=[], dest='==SUPPRESS==', nargs='A...', const=None, default=None, type=None, choices={'validate': ArgumentParser(prog='__main__.py validate', usage=None, description='validate the consistency of the dataset returning detailed errors and warnings', formatter_class=<class 'argparse.HelpFormatter'>, conflict_handler='error', add_help=True), 'import-annotations': ArgumentParser(prog='__main__.py import-annotations', usage=None, description='convert and import a set of annotations', formatter_class=<class 'argparse.HelpFormatter'>, conflict_handler='error', add_help=True), 'merge-annotations': ArgumentParser(prog='__main__.py merge-annotations', usage=None, description='merge segments sharing identical onset and offset from two sets of annotations', formatter_class=<class 'argparse.HelpFormatter'>, conflict_handler='error', add_help=True), 'intersect-annotations': ArgumentParser(prog='__main__.py intersect-annotations', usage=None, description='calculate the intersection of the annotations belonging to the given sets', formatter_class=<class 'argparse.HelpFormatter'>, conflict_handler='error', add_help=True), 'remove-annotations': ArgumentParser(prog='__main__.py remove-annotations', usage=None, description='remove converted annotations of a given set and their entries in the index', formatter_class=<class 'argparse.HelpFormatter'>, conflict_handler='error', add_help=True), 'rename-annotations': ArgumentParser(prog='__main__.py rename-annotations', usage=None, description='rename a set of annotations by moving the files and updating the index accordingly', formatter_class=<class 'argparse.HelpFormatter'>, conflict_handler='error', add_help=True), 'overview': ArgumentParser(prog='__main__.py overview', usage=None, description='prints an overview of the contents of a given dataset', formatter_class=<class 'argparse.HelpFormatter'>, conflict_handler='error', add_help=True), 'compute-durations': ArgumentParser(prog='__main__.py compute-durations', usage=None, description="creates a 'duration' column into metadata/recordings", formatter_class=<class 'argparse.HelpFormatter'>, conflict_handler='error', add_help=True)}, help=None, metavar=None))[source]
ChildProject.converters module
- class ChildProject.converters.AliceConverter[source]
Bases:
ChildProject.converters.AnnotationConverter
- FORMAT = 'alice'
- class ChildProject.converters.AnnotationConverter[source]
Bases:
object
- SPEAKER_ID_TO_TYPE = {'C1': 'OCH', 'C2': 'OCH', 'CHI': 'CHI', 'CHI*': 'CHI', 'EE1': 'NA', 'EE2': 'NA', 'FA0': 'FEM', 'FA1': 'FEM', 'FA2': 'FEM', 'FA3': 'FEM', 'FA4': 'FEM', 'FA5': 'FEM', 'FA6': 'FEM', 'FA7': 'FEM', 'FA8': 'FEM', 'FAE': 'NA', 'FC1': 'OCH', 'FC2': 'OCH', 'FC3': 'OCH', 'FCE': 'NA', 'MA0': 'MAL', 'MA1': 'MAL', 'MA2': 'MAL', 'MA3': 'MAL', 'MA4': 'MAL', 'MA5': 'MAL', 'MAE': 'NA', 'MC1': 'OCH', 'MC2': 'OCH', 'MC3': 'OCH', 'MC4': 'OCH', 'MC5': 'OCH', 'MCE': 'NA', 'MI1': 'OCH', 'MOT*': 'FEM', 'OC0': 'OCH', 'UA1': 'NA', 'UA2': 'NA', 'UA3': 'NA', 'UA4': 'NA', 'UA5': 'NA', 'UA6': 'NA', 'UC1': 'OCH', 'UC2': 'OCH', 'UC3': 'OCH', 'UC4': 'OCH', 'UC5': 'OCH', 'UC6': 'OCH'}
- THREAD_SAFE = True
- class ChildProject.converters.ChatConverter[source]
Bases:
ChildProject.converters.AnnotationConverter
- ADDRESSEE_TABLE = {'CHI': 'T', 'FEM': 'A', 'MAL': 'A', 'OCH': 'C'}
- FORMAT = 'cha'
- SPEAKER_ROLE_TO_TYPE = {'Adult': 'NA', 'Attorney': 'NA', 'Audience': 'NA', 'Boy': 'OCH', 'Brother': 'OCH', 'Caretaker': 'NA', 'Child': 'OCH', 'Doctor': 'NA', 'Environment': 'NA', 'Father': 'MAL', 'Female': 'FEM', 'Friend': 'OCH', 'Girl': 'OCH', 'Grandfather': 'MAL', 'Grandmother': 'FEM', 'Group': 'NA', 'Guest': 'NA', 'Host': 'NA', 'Investigator': 'NA', 'Justice': 'NA', 'LENA': 'NA', 'Leader': 'NA', 'Male': 'MAL', 'Media': 'NA', 'Member': 'NA', 'Mother': 'FEM', 'Narrator': 'NA', 'Nurse': 'NA', 'Other': 'NA', 'Participant': 'CHI', 'Partner': 'NA', 'PlayRole': 'NA', 'Playmate': 'OCH', 'Relative': 'NA', 'Sibling': 'OCH', 'Sister': 'OCH', 'Speaker': 'NA', 'Student': 'NA', 'Target_Adult': 'NA', 'Target_Child': 'CHI', 'Teacher': 'NA', 'Teenager': 'NA', 'Text': 'NA', 'Uncertain': 'NA', 'Unidentified': 'NA', 'Visitor': 'NA'}
- THREAD_SAFE = False
- class ChildProject.converters.CsvConverter[source]
Bases:
ChildProject.converters.AnnotationConverter
- FORMAT = 'csv'
- class ChildProject.converters.EafConverter[source]
Bases:
ChildProject.converters.AnnotationConverter
- FORMAT = 'eaf'
- class ChildProject.converters.ItsConverter[source]
Bases:
ChildProject.converters.AnnotationConverter
- FORMAT = 'its'
- SPEAKER_TYPE_TRANSLATION = {'CHN': 'CHI', 'CXN': 'OCH', 'FAN': 'FEM', 'MAN': 'MAL'}
- class ChildProject.converters.TextGridConverter[source]
Bases:
ChildProject.converters.AnnotationConverter
- FORMAT = 'TextGrid'
- class ChildProject.converters.VcmConverter[source]
Bases:
ChildProject.converters.AnnotationConverter
- FORMAT = 'vcm_rttm'
- SPEAKER_TYPE_TRANSLATION = {'CHI': 'OCH', 'CNS': 'CHI', 'CRY': 'CHI', 'FEM': 'FEM', 'MAL': 'MAL', 'NCS': 'CHI'}
- VCM_TRANSLATION = {'CNS': 'C', 'CRY': 'Y', 'NCS': 'N', 'OTH': 'J'}
- class ChildProject.converters.VtcConverter[source]
Bases:
ChildProject.converters.AnnotationConverter
- FORMAT = 'vtc_rttm'
- SPEAKER_TYPE_TRANSLATION = {'CHI': 'OCH', 'FEM': 'FEM', 'KCHI': 'CHI', 'MAL': 'MAL'}
ChildProject.metrics module
- ChildProject.metrics.conf_matrix(rows_grid, columns_grid)[source]
compute the confusion matrix (as counts) from grids of active classes.
See
ChildProject.metrics.segments_to_grid()
for a description of grids.- Parameters
rows_grid (numpy.array) – the grid corresponding to the rows of the confusion matrix.
columns_grid (numpy.array) – the grid corresponding to the columns of the confusion matrix.
categories (list of strings) – the labels corresponding to each class
- Returns
a square numpy array of counts
- Return type
numpy.array
- ChildProject.metrics.gamma(segments: pandas.core.frame.DataFrame, column: str, alpha: float = 1, beta: float = 1, precision_level: float = 0.05) float [source]
Compute Mathet et al. gamma agreement on segments.
The gamma measure evaluates the reliability of both the segmentation and the categorization simultaneously; a extensive description of the method and its parameters can be found in Mathet et al., 2015 (doi:10.1162/COLI_a_00227)
This function uses the pyagreement-agreement package by Titeux et al.
- Parameters
segments (pd.DataFrame) – input segments dataframe (see Annotations format for the dataframe format)
column (str) – name of the categorical column of the segments to consider, e.g. ‘speaker_type’
alpha (float, optional) – gamma agreement time alignment weight, defaults to 1
beta (float, optional) – gamma agreement categorical weight, defaults to 1
precision_level (float, optional) – level of precision (see pygamma-agreement’s documentation), defaults to 0.05
- Returns
gamma agreement
- Return type
float
- ChildProject.metrics.grid_to_vector(grid, categories)[source]
Transform a grid of active classes into a vector of labels. In case several classes are active at time i, the label is set to ‘overlap’.
See
ChildProject.metrics.segments_to_grid()
for a description of grids.- Parameters
grid (numpy.array) – a NumPy array of shape
(n, len(categories))
categories (list) – the list of categories
- Returns
the vector of labels of length
n
(e.g.np.array([none FEM FEM FEM overlap overlap CHI])
)- Return type
numpy.array
- ChildProject.metrics.pyannote_metric(segments: pandas.core.frame.DataFrame, reference: str, hypothesis: str, metric, column: str)[source]
- ChildProject.metrics.segments_to_annotation(segments: pandas.core.frame.DataFrame, column: str)[source]
Transform a dataframe of annotation segments into a pyannote.core.Annotation object
- Parameters
segments (pd.DataFrame) – a dataframe of input segments. It should at least have the following columns:
segment_onset
,segment_offset
andcolumn
.column (str) – the name of the column in
segments
that should be used for the values of the annotations (e.g. speaker_type).
- Returns
the pyannote.core.Annotation object.
- Return type
pyannote.core.Annotation
- ChildProject.metrics.segments_to_grid(segments: pandas.core.frame.DataFrame, range_onset: int, range_offset: int, timescale: int, column: str, categories: list, none=True, overlap=False) float [source]
Transform a dataframe of annotation segments into a 2d matrix representing the indicator function of each of the
categories
across time.Each row of the matrix corresponds to a unit of time of length
timescale
(in milliseconds), ranging fromrange_onset
torange_offset
; each column corresponds to one of thecategories
provided, plus two special columns (overlap and none).The value of the cell
ij
of the output matrix is set to 1 if the classj
is active at timei
, 0 otherwise.If overlap is True, an additional column is appended to the grid, which set to 1 if more than two classes are active at time
i
.If none is set to True, an additional column is appended to the grid, which is set to one if none of the classes are active at time
i
.The shape of the output matrix is therefore
((range_offset-range_onset)/timescale, len(categories) + n)
, where n = 2 if both overlap and none are True, 1 if one of them is True, and 0 otherwise.The fraction of time a class
j
is active can therefore be calculated asnp.mean(grid, axis = 0)[j]
- Parameters
segments (pd.DataFrame) – a dataframe of input segments. It should at least have the following columns:
segment_onset
,segment_offset
andcolumn
.range_onset (int) – timestamp of the beginning of the range to consider (in milliseconds)
range_offset (int) – timestamp of the end of the range to consider (in milliseconds)
timescale (int) – length of each time unit (in milliseconds)
column (str) – the name of the column in
segments
that should be used for the values of the annotations (e.g. speaker_type).categories (list) – the list of categories
none (bool) – append a ‘none’ column, default True
overlap (bool) – append an overlap column, default False
- Returns
the output grid
- Return type
numpy.array
- ChildProject.metrics.vectors_to_annotation_task(*args, drop: List[str] = [])[source]
transform vectors of labels into a nltk AnnotationTask object.
- Parameters
*args –
vector of labels for each annotator; add one argument per annotator.
drop (List[str]) – list of labels that should be ignored
- Returns
the AnnotationTask object
- Return type
nltk.metrics.agreement.AnnotationTask
ChildProject.projects module
- class ChildProject.projects.ChildProject(path: str, enforce_dtypes: bool = False)[source]
Bases:
object
This class is a representation of a ChildProject dataset.
- Attributes:
- param path
path to the root of the dataset.
- type path
str
- param recordings
pandas dataframe representation of this dataset metadata/recordings.csv
- type recordings
class:pd.DataFrame
- param children
pandas dataframe representation of this dataset metadata/children.csv
- type children
class:pd.DataFrame
- CHILDREN_COLUMNS = [IndexColumn(name = experiment), IndexColumn(name = child_id), IndexColumn(name = child_dob), IndexColumn(name = location_id), IndexColumn(name = child_sex), IndexColumn(name = language), IndexColumn(name = languages), IndexColumn(name = mat_ed), IndexColumn(name = fat_ed), IndexColumn(name = car_ed), IndexColumn(name = monoling), IndexColumn(name = monoling_criterion), IndexColumn(name = normative), IndexColumn(name = normative_criterion), IndexColumn(name = mother_id), IndexColumn(name = father_id), IndexColumn(name = order_of_birth), IndexColumn(name = n_of_siblings), IndexColumn(name = household_size), IndexColumn(name = dob_criterion), IndexColumn(name = dob_accuracy)]
- CONVERTED_RECORDINGS = 'recordings/converted'
- PROJECT_FOLDERS = ['recordings', 'annotations', 'metadata', 'doc', 'scripts']
- RAW_RECORDINGS = 'recordings/raw'
- RECORDINGS_COLUMNS = [IndexColumn(name = experiment), IndexColumn(name = child_id), IndexColumn(name = date_iso), IndexColumn(name = start_time), IndexColumn(name = recording_device_type), IndexColumn(name = recording_filename), IndexColumn(name = duration), IndexColumn(name = session_id), IndexColumn(name = session_offset), IndexColumn(name = recording_device_id), IndexColumn(name = experimenter), IndexColumn(name = location_id), IndexColumn(name = its_filename), IndexColumn(name = upl_filename), IndexColumn(name = trs_filename), IndexColumn(name = lena_id), IndexColumn(name = might_feature_gaps), IndexColumn(name = start_time_accuracy), IndexColumn(name = noisy_setting), IndexColumn(name = notes)]
- REQUIRED_DIRECTORIES = ['recordings', 'extra']
- accumulate_metadata(table: str, df: pandas.core.frame.DataFrame, columns: list, merge_column: str, verbose=False) pandas.core.frame.DataFrame [source]
- compute_recordings_duration(profile: Optional[str] = None) pandas.core.frame.DataFrame [source]
compute recordings duration
- Parameters
profile (str, optional) – name of the profile of recordings to compute the duration from. If None, raw recordings are used. defaults to None
- Returns
dataframe of the recordings, with an additional/updated duration columns.
- Return type
pd.DataFrame
- get_converted_recording_filename(profile: str, recording_filename: str) str [source]
retrieve the converted filename of a recording under a given
profile
, from its original filename.- Parameters
profile (str) – recording profile
recording_filename (str) – original recording filename, as indexed in the metadata
- Returns
corresponding converted filename of the recording under this profile
- Return type
str
- get_recording_path(recording_filename: str, profile: Optional[str] = None) str [source]
return the path to a recording
- Parameters
recording_filename (str) – recording filename, as in the metadata
profile (str, optional) – name of the conversion profile, defaults to None
- Returns
path to the recording
- Return type
str
- get_recordings_from_list(recordings: list, profile: Optional[str] = None) pandas.core.frame.DataFrame [source]
Recover recordings metadata from a list of recordings or path to recordings.
- Parameters
recordings (list) – list of recording names or paths
- Returns
matching recordings
- Return type
pd.DataFrame
- validate(ignore_recordings: bool = False, profile: Optional[str] = None) tuple [source]
Validate a dataset, returning all errors and warnings.
- Parameters
ignore_recordings (bool, optional) – if True, no errors will be returned for missing recordings.
- Returns
A tuple containing the list of errors, and the list of warnings.
- Return type
a tuple of two lists
ChildProject.tables module
- class ChildProject.tables.IndexColumn(name='', description='', required=False, regex=None, filename=False, datetime=None, function=None, choices=None, dtype=None, unique=False, generated=False)[source]
Bases:
object