ChildProject package
Subpackages
- ChildProject.pipelines package
- Submodules
- ChildProject.pipelines.anonymize module
- ChildProject.pipelines.eafbuilder module
- ChildProject.pipelines.metrics module
- ChildProject.pipelines.metricsFunctions module
avg_can_voc_dur_speaker()
avg_cry_voc_dur_speaker()
avg_non_can_voc_dur_speaker()
avg_voc_dur_speaker()
can_voc_dur_speaker()
can_voc_dur_speaker_ph()
can_voc_speaker()
can_voc_speaker_ph()
cp_dur()
cp_n()
cry_voc_dur_speaker()
cry_voc_dur_speaker_ph()
cry_voc_speaker()
cry_voc_speaker_ph()
lena_CTC()
lena_CTC_ph()
lena_CVC()
lena_CVC_ph()
lp_dur()
lp_n()
metricFunction()
non_can_voc_dur_speaker()
non_can_voc_dur_speaker_ph()
non_can_voc_speaker()
non_can_voc_speaker_ph()
pc_adu()
pc_adu_ph()
pc_speaker()
pc_speaker_ph()
peak_can_voc_dur_speaker()
peak_can_voc_speaker()
peak_cry_voc_dur_speaker()
peak_cry_voc_speaker()
peak_hour_metric()
peak_lena_CTC()
peak_lena_CVC()
peak_non_can_voc_dur_speaker()
peak_non_can_voc_speaker()
peak_pc_adu()
peak_pc_speaker()
peak_sc_adu()
peak_sc_speaker()
peak_simple_CTC()
peak_voc_dur_speaker()
peak_voc_speaker()
peak_wc_adu()
peak_wc_speaker()
per_hour_metric()
sc_adu()
sc_adu_ph()
sc_speaker()
sc_speaker_ph()
simple_CTC()
simple_CTC_ph()
voc_dur_speaker()
voc_dur_speaker_ph()
voc_speaker()
voc_speaker_ph()
wc_adu()
wc_adu_ph()
wc_speaker()
wc_speaker_ph()
- ChildProject.pipelines.pipeline module
- ChildProject.pipelines.processors module
- ChildProject.pipelines.samplers module
- ChildProject.pipelines.zooniverse module
Chunk
ZooniversePipeline
ZooniversePipeline.exit_upload()
ZooniversePipeline.extract_chunks()
ZooniversePipeline.get_credentials()
ZooniversePipeline.link_orphan_subjects()
ZooniversePipeline.reset_orphan_subjects()
ZooniversePipeline.retrieve_classifications()
ZooniversePipeline.run()
ZooniversePipeline.setup_parser()
ZooniversePipeline.upload_chunks()
pad_interval()
- Module contents
- ChildProject.templates package
Submodules
ChildProject.annotations module
- class ChildProject.annotations.AnnotationManager(project: ChildProject)[source]
Bases:
object
- INDEX_COLUMNS = [IndexColumn(name = set), IndexColumn(name = recording_filename), IndexColumn(name = time_seek), IndexColumn(name = range_onset), IndexColumn(name = range_offset), IndexColumn(name = raw_filename), IndexColumn(name = format), IndexColumn(name = filter), IndexColumn(name = annotation_filename), IndexColumn(name = imported_at), IndexColumn(name = package_version), IndexColumn(name = error), IndexColumn(name = merged_from)]
- SEGMENTS_COLUMNS = [IndexColumn(name = raw_filename), IndexColumn(name = segment_onset), IndexColumn(name = segment_offset), IndexColumn(name = speaker_id), IndexColumn(name = speaker_type), IndexColumn(name = ling_type), IndexColumn(name = vcm_type), IndexColumn(name = lex_type), IndexColumn(name = mwu_type), IndexColumn(name = msc_type), IndexColumn(name = gra_type), IndexColumn(name = addressee), IndexColumn(name = transcription), IndexColumn(name = phonemes), IndexColumn(name = syllables), IndexColumn(name = words), IndexColumn(name = lena_block_type), IndexColumn(name = lena_block_number), IndexColumn(name = lena_conv_status), IndexColumn(name = lena_response_count), IndexColumn(name = lena_conv_floor_type), IndexColumn(name = lena_conv_turn_type), IndexColumn(name = lena_speaker), IndexColumn(name = utterances_count), IndexColumn(name = utterances_length), IndexColumn(name = non_speech_length), IndexColumn(name = average_db), IndexColumn(name = peak_db), IndexColumn(name = child_cry_vfx_len), IndexColumn(name = utterances), IndexColumn(name = cries), IndexColumn(name = vfxs)]
- SETS_COLUMNS = [IndexColumn(name = segmentation), IndexColumn(name = segmentation_type), IndexColumn(name = method), IndexColumn(name = sampling_method), IndexColumn(name = sampling_target), IndexColumn(name = sampling_count), IndexColumn(name = sampling_unit_duration), IndexColumn(name = recording_selection), IndexColumn(name = participant_selection), IndexColumn(name = annotator_name), IndexColumn(name = annotator_experience), IndexColumn(name = annotation_algorithm_name), IndexColumn(name = annotation_algorithm_publication), IndexColumn(name = annotation_algorithm_version), IndexColumn(name = annotation_algorithm_repo), IndexColumn(name = date_annotation), IndexColumn(name = has_speaker_type), IndexColumn(name = has_transcription), IndexColumn(name = has_interactions), IndexColumn(name = has_acoustics), IndexColumn(name = has_addressee), IndexColumn(name = has_vcm_type), IndexColumn(name = has_words), IndexColumn(name = notes)]
- SETS_CONTENT_COLUMNS = {}
- add_annotation_file(src_path, dst_file, set: str, overwrite)[source]
Add an annotation file to the dataset. This function takes the path to a file, copies that file inside the dataset in the correct spot given the set it belongs to. The destination file can contain parent folders, which will be included in the copied file (e.g. src_path= “/home/user/tmp/myrec.rttm”, dst_file=”loc1/RA5/rec001.rttm”, set=’vtc’ ; will copy the file inside
the dataset in a annotations/vtc/raw/loc1/RA5 folder, the file will be named rec001.rttm.
- Parameters:
src_path (Path | str) – path on the system to the annotation file to add to the dataset
dst_file – filename as it will be stored in the dataset, with possible parent folders (e.g.
‘location1/RA5/rec004.rttm’ will copy the original file as rec004.rttm inside folders location1 -> RA5) :type dst_file: Path | str :param set: annotation set the annotation file belongs to :type set: str :param overwrite: overwrite the existing destination if it already exists :type overwrite: bool, optional
- static clip_segments(segments: DataFrame, start: int, stop: int) DataFrame [source]
Clip all segments onsets and offsets within
start
andstop
. Segments outside of the range [start
,``stop``] will be removed.- Parameters:
segments (pd.DataFrame) – Dataframe of the segments to clip
start (int) – range start (in milliseconds)
stop (int) – range end (in milliseconds)
- Returns:
Dataframe of the clipped segments
- Return type:
pd.DataFrame
- derive_annotations(input_set: str, output_set: str, derivation_function: str | ~typing.Callable, derivation_metadata=None, threads: int = -1, overwrite_existing: bool = False) -> (<class 'pandas.core.frame.DataFrame'>, <class 'pandas.core.frame.DataFrame'>)[source]
Derive annotations. From an existing set of annotations, create a new set that derive its result from the original set
- Parameters:
input_set (str) – name of the set of annotations to be derived
output_set (str) – name of the new set of derived annotations
derivation_function (Union[str, Callable]) – name of the derivation type to be performed
derivation_metadata (dict) – metadata to be used for the set created by the derivation, if none and derivation is internal to the package (using str label), use the internally stored metadata
threads (int, optional) – If > 1, conversions will be run on
threads
threads, defaults to -1overwrite_existing (bool, optional) – choice if lines with the same set and annotation_filename should be overwritten
- Returns:
tuple of dataframe of derived annotations, as in Annotations index and dataframe of errors
- Return type:
tuple(pd.DataFrame, pd.DataFrame)
- get_collapsed_segments(annotations: DataFrame) DataFrame [source]
get all segments associated to the annotations referenced in
annotations
, and collapses into one virtual timeline.- Parameters:
annotations (pd.DataFrame) – dataframe of annotations, according to Annotations index
- Returns:
dataframe of all the segments merged (as specified in Annotations format), merged with
annotations
- Return type:
pd.DataFrame
- static get_printable_sets_metadata(sets, delimiter, header=True, human_readable: bool = False)[source]
- get_segments(annotations: DataFrame) DataFrame [source]
get all segments associated to the annotations referenced in
annotations
.- Parameters:
annotations (pd.DataFrame) – dataframe of annotations, according to Annotations index
- Returns:
dataframe of all the segments merged (as specified in Annotations format), merged with
annotations
.- Return type:
pd.DataFrame
- get_segments_timestamps(segments: DataFrame, ignore_date: bool = False, onset: str = 'segment_onset', offset: str = 'segment_offset') DataFrame [source]
Calculate the onset and offset clock-time of each segment
- Parameters:
segments (pd.DataFrame) – DataFrame of segments (as returned by
get_segments()
).ignore_date (bool, optional) – leave date information and use time data only, defaults to False
onset (str, optional) – column storing the onset timestamp in milliseconds, defaults to “segment_onset”
offset (str, optional) – column storing the offset timestamp in milliseconds, defaults to “segment_offset”
- Returns:
Returns the input dataframe with two new columns
onset_time
andoffset_time
.onset_time
is a datetime object corresponding to the onset of the segment.offset_time
is a datetime object corresponding to the offset of the segment. In case eitherstart_time
ordate_iso
is not specified for the corresponding recording, both values will be set to NaT.- Return type:
pd.DataFrame
- get_sets_metadata(format: str = 'dataframe', delimiter=None, escapechar='"', header=True, human=False, sort_by='set', sort_ascending=True)[source]
return metadata about the sets
- get_subsets(annotation_set: str, recursive: bool = False) List[str] [source]
Retrieve the list of subsets belonging to a given set of annotations.
- Parameters:
annotation_set (str) – input set
recursive (bool, optional) – If True, get subsets recursively, defaults to False
- Returns:
the list of subsets names
- Return type:
list
- get_within_ranges(ranges: DataFrame, sets: Set | List | None = None, missing_data: str = 'ignore')[source]
Retrieve and clip annotations that cover specific portions of recordings (
ranges
).The desired ranges are defined by an input dataframe with three columns:
recording_filename
,range_onset
, andrange_offset
. The function returns a dataframe of annotations under the same format as the index of annotations (Annotations index).This output get can then be provided to
get_segments()
in order to retrieve segments of annotations that match the desired range.For instance, the code belows will prints all the segments of annotations corresponding to the first hour of each recording:
>>> from ChildProject.projects import ChildProject >>> from ChildProject.annotations import AnnotationManager >>> project = ChildProject('.') >>> am = AnnotationManager(project) >>> am.read() >>> ranges = project.recordings >>> ranges['range_onset'] = 0 >>> ranges['range_offset'] = 60*60*1000 >>> matches = am.get_within_ranges(ranges) >>> am.get_segments(matches)
- Parameters:
ranges (pd.DataFrame) – pandas dataframe with one row per range to be considered and three columns:
recording_filename
,range_onset
,range_offset
.sets (Union[Set, List]) – optional list of annotation sets to retrieve. If None, all annotations from all sets will be retrieved.
missing_data (str, defaults to ignore) – how to handle missing annotations (“ignore”, “warn” or “raise”)
- Return type:
pd.DataFrame
- get_within_time_range(annotations: DataFrame, interval: TimeInterval | None = None, start_time: str | None = None, end_time: str | None = None)[source]
Clip all input annotations within a given HH:MM:SS clock-time range. Those that do not intersect the input time range at all are filtered out.
- Parameters:
annotations (pd.DataFrame) – DataFrame of input annotations to filter. The only columns that are required are:
recording_filename
,range_onset
, andrange_offset
.interval (TimeInterval) – Interval of hours to consider, contains the start hour and end hour
start_time (str) – start_time to use in a HH:MM format, only used if interval is None, replaces the first value of interval
end_time (str) – end_time to use in a HH:MM format, only used if interval is None, replaces the second value of interval
- Returns:
a DataFrame of annotations; For each row,
range_onset
andrange_offset
are clipped within the desired clock-time range. The clock-time corresponding to the onset and offset of each annotation is stored in two newly created columns namedrange_onset_time
andrange_offset_time
. If the input annotation exceeds 24 hours, one row per matching interval is returned. :rtype: pd.DataFrame
- import_annotations(input: DataFrame, threads: int = -1, import_function: Callable[[str], DataFrame] | None = None, new_tiers: list | None = None, overwrite_existing: bool = False) DataFrame [source]
Import and convert annotations.
- Parameters:
input (pd.DataFrame) – dataframe of all annotations to import, as described in Annotation importation input format.
threads (int, optional) – If > 1, conversions will be run on
threads
threads, defaults to -1import_function (Callable[[str], pd.DataFrame], optional) – If specified, the custom
import_function
function will be used to convert allinput
annotations, defaults to Nonenew_tiers (list[str], optional) – List of EAF tiers names. If specified, the corresponding EAF tiers will be imported.
overwrite_existing (bool, optional) – choose if lines with the same set and annotation_filename should be overwritten
- Returns:
dataframe of imported annotations, as in Annotations index.
- Return type:
pd.DataFrame
- static infer_set_content_based_on_column_names(columns) dict [source]
From a list of columns present in annotations, makes a prediction of what content will be present for metadata of set. It takes the defined field of metadata and determines based on their annotation_columns field if a combination of the right columns id present
- Parameters:
columns (List[str]) – list of columns in the annotation
- Returns:
dictionary with inferred metadata to add to the set
- Return type:
dict
- static intersection(annotations: DataFrame, sets: list | None = None) DataFrame [source]
Compute the intersection of all annotations for all sets and recordings, based on their
recording_filename
,range_onset
andrange_offset
attributes. (Only these columns are required, but more can be passed and they will be preserved).- Parameters:
annotations (pd.DataFrame) – dataframe of annotations, according to Annotations index
- Returns:
dataframe of annotations, according to Annotations index
- Return type:
pd.DataFrame
- merge_annotations(left_columns, right_columns, columns, output_set, input, skip_existing: bool = False)[source]
From 2 DataFrames listing the annotation indexes to merge together (those indexes should come from the intersection of the left_set and right_set indexes), the listing of the columns to merge and name of the output_set, creates the resulting csv files containing the converted merged segments and returns the new indexes to add to annotations.csv.
- Parameters:
left_columns (list[str]) – list of the columns to include from the left set
right_columns (list[str]) – list of the columns to include from the right set
columns (dict) – additional columns to add to the segments, key is the column name
output_set (str) – name of the set to save the new merged files into
input (bool) – annotation indexes to use for the merge, contains keys ‘left_annotations’ and ‘right_annotations’ to separate indexes from left and right set
input –
- Returns:
annotation indexes created by the merge, should be added to annotations.csv
- Return type:
pandas.DataFrame
- merge_sets(left_set: str, right_set: str, left_columns: List[str], right_columns: List[str], output_set: str, full_set_merge: bool = True, skip_existing: bool = False, columns: dict = {}, recording_filter: str | None = None, metadata: str | None = None, threads=-1)[source]
Merge columns from
left_set
andright_set
annotations, for all matching segments, into a new set of annotations namedoutput_set
that will be saved in the dataset.output_set
must not already exist if full_set_merge is True.- Parameters:
left_set (str) – Left set of annotations.
right_set (str) – Right set of annotations.
left_columns (List) – Columns which values will be based on the left set.
right_columns (List) – Columns which values will be based on the right set.
output_set (str) – Name of the output annotations set.
full_set_merge (bool) – The merge is meant to create the entired merged set. Therefore, the set should not already exist. defaults to True
skip_existing (bool) – The merge will skip already existing lines in the merged set. So both the annotation index and resulting converted csv will not change for those lines
columns (dict) – Additional columns to add to the resulting converted annotations.
recording_filter (set[str]) – set of recording_filenames to merge.
metadata (None | str) – set metadata to keep in the merged set, ‘right’ or ‘left’ to keep metadata of left or right set (except for content fields), None for no metadata kept, default is None
threads (int) – number of threads
- Returns:
[description]
- Return type:
[type]
- read() Tuple[List[str], List[str]] [source]
Read the index of annotations from
metadata/annotations.csv
and store it into self.annotations.- Returns:
a tuple containing the list of errors and the list of warnings generated while reading the index
- Return type:
Tuple[List[str],List[str]]
- remove_annotation_file(file, set: str)[source]
remove a raw annotation file from the dataset. This function takes the path to a file, and removes it from the dataset annotations at the file system level (not in the index), the file could be under folder, they need to be in the file name as a posix path (i.e. subfolder/file) The set parameter is meant to define what annotation set the raw file is stored in.
- Parameters:
file – filename as it is stored in the dataset annotations, in the annotation set raw folder (e.g. set=vtc
will be evaluated inside the annotations/vtc/raw folder of the dataset :type file: Path | str :param set: name of the annotation set the file is stored in. :type set: str
- remove_set(annotation_set: str, recursive: bool = False)[source]
Remove a set of annotations, deleting every converted file and removing them from the index. This preserves raw annotations.
- Parameters:
annotation_set (str) – set of annotations to remove
recursive (bool, optional) – remove subsets as well, defaults to False
- rename_set(annotation_set: str, new_set: str, recursive: bool = False, ignore_errors: bool = False)[source]
Rename a set of annotations, moving all related files and updating the index accordingly.
- Parameters:
annotation_set (str) – name of the set to rename
new_set (str) – new set name
recursive (bool, optional) – rename subsets as well, defaults to False
ignore_errors (bool, optional) – If True, keep going even if unindexed files are detected, defaults to False
- validate(annotations: DataFrame | None = None, threads: int = 0) Tuple[List[str], List[str]] [source]
check all indexed annotations for errors
- Parameters:
annotations (pd.DataFrame, optional) – annotations to validate, defaults to None. If None, the whole index will be scanned.
threads (int, optional) – how many threads to run the tests with, defaults to 0. If <= 0, all available CPU cores will be used.
- Returns:
a tuple containing the list of errors and the list of warnings detected
- Return type:
Tuple[List[str], List[str]]
ChildProject.cmdline module
ChildProject.converters module
- class ChildProject.converters.AliceConverter[source]
Bases:
AnnotationConverter
- FORMAT = 'alice'
- class ChildProject.converters.AnnotationConverter[source]
Bases:
object
- SPEAKER_ID_TO_TYPE = {'C1': 'OCH', 'C2': 'OCH', 'CHI': 'CHI', 'CHI*': 'CHI', 'EE1': 'NA', 'EE2': 'NA', 'FA0': 'FEM', 'FA1': 'FEM', 'FA2': 'FEM', 'FA3': 'FEM', 'FA4': 'FEM', 'FA5': 'FEM', 'FA6': 'FEM', 'FA7': 'FEM', 'FA8': 'FEM', 'FAE': 'NA', 'FC1': 'OCH', 'FC2': 'OCH', 'FC3': 'OCH', 'FCE': 'NA', 'MA0': 'MAL', 'MA1': 'MAL', 'MA2': 'MAL', 'MA3': 'MAL', 'MA4': 'MAL', 'MA5': 'MAL', 'MAE': 'NA', 'MC1': 'OCH', 'MC2': 'OCH', 'MC3': 'OCH', 'MC4': 'OCH', 'MC5': 'OCH', 'MCE': 'NA', 'MI1': 'OCH', 'MOT*': 'FEM', 'OC0': 'OCH', 'UA1': 'NA', 'UA2': 'NA', 'UA3': 'NA', 'UA4': 'NA', 'UA5': 'NA', 'UA6': 'NA', 'UC1': 'OCH', 'UC2': 'OCH', 'UC3': 'OCH', 'UC4': 'OCH', 'UC5': 'OCH', 'UC6': 'OCH'}
- THREAD_SAFE = True
- class ChildProject.converters.ChatConverter[source]
Bases:
AnnotationConverter
- ADDRESSEE_TABLE = {'CHI': 'T', 'FEM': 'A', 'MAL': 'A', 'OCH': 'C'}
- FORMAT = 'cha'
- SPEAKER_ROLE_TO_TYPE = {'Adult': 'NA', 'Attorney': 'NA', 'Audience': 'NA', 'Boy': 'OCH', 'Brother': 'OCH', 'Caretaker': 'NA', 'Child': 'OCH', 'Doctor': 'NA', 'Environment': 'NA', 'Father': 'MAL', 'Female': 'FEM', 'Friend': 'OCH', 'Girl': 'OCH', 'Grandfather': 'MAL', 'Grandmother': 'FEM', 'Group': 'NA', 'Guest': 'NA', 'Host': 'NA', 'Investigator': 'NA', 'Justice': 'NA', 'LENA': 'NA', 'Leader': 'NA', 'Male': 'MAL', 'Media': 'NA', 'Member': 'NA', 'Mother': 'FEM', 'Narrator': 'NA', 'Nurse': 'NA', 'Other': 'NA', 'Participant': 'CHI', 'Partner': 'NA', 'PlayRole': 'NA', 'Playmate': 'OCH', 'Relative': 'NA', 'Sibling': 'OCH', 'Sister': 'OCH', 'Speaker': 'NA', 'Student': 'NA', 'Target_Adult': 'NA', 'Target_Child': 'CHI', 'Teacher': 'NA', 'Teenager': 'NA', 'Text': 'NA', 'Uncertain': 'NA', 'Unidentified': 'NA', 'Visitor': 'NA'}
- THREAD_SAFE = False
- class ChildProject.converters.CsvConverter[source]
Bases:
AnnotationConverter
- FORMAT = 'csv'
- class ChildProject.converters.EafConverter[source]
Bases:
AnnotationConverter
- FORMAT = 'eaf'
- class ChildProject.converters.Formats(value)[source]
Bases:
Enum
An enumeration.
- ALICE = 'alice'
- CHA = 'cha'
- CSV = 'csv'
- EAF = 'eaf'
- ITS = 'its'
- TEXTGRID = 'TextGrid'
- VCM = 'vcm_rttm'
- VTC = 'vtc_rttm'
- class ChildProject.converters.ItsConverter[source]
Bases:
AnnotationConverter
- FORMAT = 'its'
- SPEAKER_TYPE_TRANSLATION = {'CHN': 'CHI', 'CXN': 'OCH', 'FAN': 'FEM', 'MAN': 'MAL'}
- class ChildProject.converters.TextGridConverter[source]
Bases:
AnnotationConverter
- FORMAT = 'TextGrid'
ChildProject.metrics module
- ChildProject.metrics.conf_matrix(rows_grid, columns_grid)[source]
compute the confusion matrix (as counts) from grids of active classes.
See
ChildProject.metrics.segments_to_grid()
for a description of grids.- Parameters:
rows_grid (numpy.array) – the grid corresponding to the rows of the confusion matrix.
columns_grid (numpy.array) – the grid corresponding to the columns of the confusion matrix.
categories (list of strings) – the labels corresponding to each class
- Returns:
a square numpy array of counts
- Return type:
numpy.array
- ChildProject.metrics.gamma(segments: DataFrame, column: str, alpha: float = 1, beta: float = 1, precision_level: float = 0.05) float [source]
Compute Mathet et al. gamma agreement on segments.
The gamma measure evaluates the reliability of both the segmentation and the categorization simultaneously; a extensive description of the method and its parameters can be found in Mathet et al., 2015 (doi:10.1162/COLI_a_00227)
This function uses the pyagreement-agreement package by Titeux et al.
- Parameters:
segments (pd.DataFrame) – input segments dataframe (see Annotations format for the dataframe format)
column (str) – name of the categorical column of the segments to consider, e.g. ‘speaker_type’
alpha (float, optional) – gamma agreement time alignment weight, defaults to 1
beta (float, optional) – gamma agreement categorical weight, defaults to 1
precision_level (float, optional) – level of precision (see pygamma-agreement’s documentation), defaults to 0.05
- Returns:
gamma agreement
- Return type:
float
- ChildProject.metrics.grid_to_vector(grid, categories)[source]
Transform a grid of active classes into a vector of labels. In case several classes are active at time i, the label is set to ‘overlap’.
See
ChildProject.metrics.segments_to_grid()
for a description of grids.- Parameters:
grid (numpy.array) – a NumPy array of shape
(n, len(categories))
categories (list) – the list of categories
- Returns:
the vector of labels of length
n
(e.g.np.array([none FEM FEM FEM overlap overlap CHI])
)- Return type:
numpy.array
- ChildProject.metrics.pyannote_metric(segments: DataFrame, reference: str, hypothesis: str, metric, column: str)[source]
- ChildProject.metrics.segments_to_annotation(segments: DataFrame, column: str)[source]
Transform a dataframe of annotation segments into a pyannote.core.Annotation object
- Parameters:
segments (pd.DataFrame) – a dataframe of input segments. It should at least have the following columns:
segment_onset
,segment_offset
andcolumn
.column (str) – the name of the column in
segments
that should be used for the values of the annotations (e.g. speaker_type).
- Returns:
the pyannote.core.Annotation object.
- Return type:
pyannote.core.Annotation
- ChildProject.metrics.segments_to_grid(segments: DataFrame, range_onset: int, range_offset: int, timescale: int, column: str, categories: list, none=True, overlap=False) float [source]
Transform a dataframe of annotation segments into a 2d matrix representing the indicator function of each of the
categories
across time.Each row of the matrix corresponds to a unit of time of length
timescale
(in milliseconds), ranging fromrange_onset
torange_offset
; each column corresponds to one of thecategories
provided, plus two special columns (overlap and none).The value of the cell
ij
of the output matrix is set to 1 if the classj
is active at timei
, 0 otherwise.If overlap is True, an additional column is appended to the grid, which set to 1 if more than two classes are active at time
i
.If none is set to True, an additional column is appended to the grid, which is set to one if none of the classes are active at time
i
.The shape of the output matrix is therefore
((range_offset-range_onset)/timescale, len(categories) + n)
, where n = 2 if both overlap and none are True, 1 if one of them is True, and 0 otherwise.The fraction of time a class
j
is active can therefore be calculated asnp.mean(grid, axis = 0)[j]
- Parameters:
segments (pd.DataFrame) – a dataframe of input segments. It should at least have the following columns:
segment_onset
,segment_offset
andcolumn
.range_onset (int) – timestamp of the beginning of the range to consider (in milliseconds)
range_offset (int) – timestamp of the end of the range to consider (in milliseconds)
timescale (int) – length of each time unit (in milliseconds)
column (str) – the name of the column in
segments
that should be used for the values of the annotations (e.g. speaker_type).categories (list) – the list of categories
none (bool) – append a ‘none’ column, default True
overlap (bool) – append an overlap column, default False
- Returns:
the output grid
- Return type:
numpy.array
- ChildProject.metrics.vectors_to_annotation_task(*args, drop: List[str] = [])[source]
transform vectors of labels into a nltk AnnotationTask object.
- Parameters:
args (1d np.array() of labels) – vector of labels for each annotator; add one argument per annotator.
drop (List[str]) – list of labels that should be ignored
- Returns:
the AnnotationTask object
- Return type:
nltk.metrics.agreement.AnnotationTask
ChildProject.projects module
- class ChildProject.projects.ChildProject(path: Path | str, enforce_dtypes: bool = True, ignore_discarded: bool = True)[source]
Bases:
object
ChildProject instance This class is a representation of a ChildProject dataset
Constructor parameters:
- Parameters:
path (str) – path to the root of the dataset.
enforce_dtypes (bool, optional) – enforce dtypes on children/recordings dataframes, defaults to False
ignore_discarded (bool, optional) – ignore entries such that discard=1, defaults to True
Attributes: :param path: path to the root of the dataset. :type path: str :param recordings: pandas dataframe representation of this dataset metadata/recordings.csv :type recordings: class:pd.DataFrame :param children: pandas dataframe representation of this dataset metadata/children.csv :type children: class:pd.DataFrame
- CHILDREN_COLUMNS = [IndexColumn(name = experiment), IndexColumn(name = child_id), IndexColumn(name = child_dob), IndexColumn(name = location_id), IndexColumn(name = child_sex), IndexColumn(name = language), IndexColumn(name = languages), IndexColumn(name = mat_ed), IndexColumn(name = fat_ed), IndexColumn(name = car_ed), IndexColumn(name = monoling), IndexColumn(name = monoling_criterion), IndexColumn(name = normative), IndexColumn(name = normative_criterion), IndexColumn(name = mother_id), IndexColumn(name = father_id), IndexColumn(name = order_of_birth), IndexColumn(name = n_of_siblings), IndexColumn(name = household_size), IndexColumn(name = dob_criterion), IndexColumn(name = dob_accuracy), IndexColumn(name = discard)]
- DOCUMENTATION_COLUMNS = [IndexColumn(name = variable), IndexColumn(name = description), IndexColumn(name = values), IndexColumn(name = scope), IndexColumn(name = annotation_set)]
- RECORDINGS_COLUMNS = [IndexColumn(name = experiment), IndexColumn(name = child_id), IndexColumn(name = date_iso), IndexColumn(name = start_time), IndexColumn(name = recording_device_type), IndexColumn(name = recording_filename), IndexColumn(name = duration), IndexColumn(name = session_id), IndexColumn(name = session_offset), IndexColumn(name = recording_device_id), IndexColumn(name = experimenter), IndexColumn(name = location_id), IndexColumn(name = its_filename), IndexColumn(name = upl_filename), IndexColumn(name = trs_filename), IndexColumn(name = lena_id), IndexColumn(name = lena_recording_num), IndexColumn(name = might_feature_gaps), IndexColumn(name = start_time_accuracy), IndexColumn(name = noisy_setting), IndexColumn(name = notes), IndexColumn(name = discard)]
- REC_COL_REF = {'child_id': IndexColumn(name = child_id), 'date_iso': IndexColumn(name = date_iso), 'discard': IndexColumn(name = discard), 'duration': IndexColumn(name = duration), 'experiment': IndexColumn(name = experiment), 'experimenter': IndexColumn(name = experimenter), 'its_filename': IndexColumn(name = its_filename), 'lena_id': IndexColumn(name = lena_id), 'lena_recording_num': IndexColumn(name = lena_recording_num), 'location_id': IndexColumn(name = location_id), 'might_feature_gaps': IndexColumn(name = might_feature_gaps), 'noisy_setting': IndexColumn(name = noisy_setting), 'notes': IndexColumn(name = notes), 'recording_device_id': IndexColumn(name = recording_device_id), 'recording_device_type': IndexColumn(name = recording_device_type), 'recording_filename': IndexColumn(name = recording_filename), 'session_id': IndexColumn(name = session_id), 'session_offset': IndexColumn(name = session_offset), 'start_time': IndexColumn(name = start_time), 'start_time_accuracy': IndexColumn(name = start_time_accuracy), 'trs_filename': IndexColumn(name = trs_filename), 'upl_filename': IndexColumn(name = upl_filename)}
- accumulate_metadata(table: str, df: DataFrame, columns: list, merge_column: str, verbose=False) DataFrame [source]
- add_project_file(src_path, dst_file, file_type: str, overwrite=False)[source]
Add a file to the dataset. This function takes the path to a file, copies that file inside the dataset in the correct spot depending on the file type. The destination file can contain parent folders, which will be included in the copied file (e.g. src_path= “/home/user/tmp/myrec.wav”, dst_file=”loc1/RA5/rec001.wav”, file_type=’recording’ ; will copy the file inside
the dataset in a recordings/raw/loc1/RA5 folder, the file will be named rec001.wav.
- Parameters:
src_path (Path | str) – path to the file to add to the dataset on the system
dst_file – filename as it will be stored in the dataset, with possible subfolder(s) (e.g.
“location1/RA5/rec004.wav will copy the original file as rec004.wav inside folders location1 -> RA5) :type dst_file: Path | str :param file_type: type of the file to copy in order to know where it should be stored in the dataset, choose any of ‘recording’,’metadata’,’extra’ or ‘raw’, raw is just copied from the root of the dataset into any folder :type file_type: str :param overwrite: overwrite the existing destination if it already exists :type overwrite: bool, optional
- compute_ages(recordings: DataFrame | None = None, children: DataFrame | None = None, age_format: str = 'months') Series [source]
Compute the age of the subject child for each recording (in months, as a float) and return it as a pandas Series object.
Example:
>>> from ChildProject.projects import ChildProject >>> project = ChildProject("examples/valid_raw_data") >>> project.read() >>> project.recordings["age"] = project.compute_ages() >>> project.recordings[["child_id", "date_iso", "age"]] child_id date_iso age line 2 1 2020-04-20 3.613963 3 1 2020-04-21 3.646817
- Parameters:
recordings (pd.DataFrame, optional) – custom recordings DataFrame (see Metadata), otherwise use all project recordings, defaults to None
children (pd.DataFrame, optional) – custom children DataFrame (see Metadata), otherwise use all project children data, defaults to None
age_format (str, optional) – format to use for the output date default is months, choose between [‘months’,’days’,’weeks’, ‘years’]
- compute_recordings_duration(profile: str | None = None) DataFrame [source]
compute recordings duration
- Parameters:
profile (str, optional) – name of the profile of recordings to compute the duration from. If None, raw recordings are used. defaults to None
- Returns:
dataframe of the recordings, with an additional/updated duration columns.
- Return type:
pd.DataFrame
- get_converted_recording_filename(profile: str, recording_filename: str) str [source]
retrieve the converted filename of a recording under a given
profile
, from its original filename.- Parameters:
profile (str) – recording profile
recording_filename (str) – original recording filename, as indexed in the metadata
- Returns:
corresponding converted filename of the recording under this profile
- Return type:
str
- get_recording_path(recording_filename: str, profile: str | None = None) Path [source]
return the path to a recording
- Parameters:
recording_filename (str) – recording filename, as in the metadata
profile (str, optional) – name of the conversion profile, defaults to None
- Returns:
path to the recording
- Return type:
str
- get_recordings_from_list(recordings: list, profile: str | None = None) DataFrame [source]
Recover recordings metadata from a list of recordings or path to recordings.
- Parameters:
recordings (list) – list of recording names or paths
- Returns:
matching recordings
- Return type:
pd.DataFrame
- read(verbose=False, accumulate=True)[source]
Read the metadata from the project and stores it in recordings and children attributes
- Parameters:
verbose (bool) – read with additional output
accumulate (bool) – add metadata from subfolders (usually confidential metadata)
- remove_project_file(file, file_type: str)[source]
remove a file from the dataset. This function takes the path to a file, and removes it from the dataset at the file system level (not in the index), the file could be under folder, they need to be in the file name as a posix path (i.e. subfolder/file) The file_type is meant to define the type of file in the dataset, and each category corresponds to a subfolder path.
- Parameters:
file – filename as it is stored in the dataset, in the tree of its category (e.g. recordings names are
evaluated inside the recordings/raw folder of the dataset :type file: Path | str :param file_type: type of the file to copy in order to know where it should be stored in the dataset, choose any of ‘recording’,’metadata’,’extra’ or ‘raw’, raw is just copied from the root of the dataset into any folder :type file_type: str
- validate(ignore_recordings: bool = False, profile: str | None = None, accumulate: bool = True, current_metadata=False, custom_metadata=None) tuple [source]
Validate a dataset, returning all errors and warnings.
- Parameters:
ignore_recordings (bool, optional) – if True, no errors will be returned for missing recordings.
profile (str, optional) – profile of recordings to use
accumulate – use accumulated metadata (usually confidential metadata if present)
current_metadata (bool, optional) – validate the currently set metadata, without reacquiring it from the files
- Returns:
A tuple containing the list of errors, and the list of warnings.
- Return type:
a tuple of two lists
- write_children(keep_discarded: bool = True, skip_validation=False, keep_original_columns: bool = True)[source]
Write self.children to the children csv file of the dataset. !! if read() was done with accumulate , you may write confidential information in recordings.csv !!
- Parameters:
keep_discarded (bool, optional) – if True, the lines in the csv that are discarded by the dataset are kept when writing. defaults to True (when False, discarded lines disappear from the dataset)
skip_validation (bool, optional) – if True, writes the recordings without checking if the dataset is valid
keep_original_columns (bool, optional) – if True, deleting columns in the recordings dataframe will not result in them disappearing from the csv file (if false, only the current columns are kept)
- Returns:
dataframe that was written to the csv file
- Return type:
pandas.DataFrame
- write_recordings(keep_discarded: bool = True, skip_validation=False, keep_original_columns: bool = True)[source]
Write self.recordings to the recordings csv file of the dataset. !! if read() was done with accumulate , you may write confidential information in recordings.csv !!
- Parameters:
keep_discarded (bool, optional) – if True, the lines in the csv that are discarded by the dataset are kept when writing. defaults to True (when False, discarded lines disappear from the dataset)
skip_validation (bool, optional) – if True, writes the recordings without checking if the dataset is valid
keep_original_columns (bool, optional) – NOT IMPLEMENTED, if True, deleting columns in the recordings dataframe will not result in them disappearing from the csv file (if false, only the current columns are kept)
- Returns:
dataframe that was written to the csv file
- Return type:
pandas.DataFrame
ChildProject.tables module
- exception ChildProject.tables.IncorrectDtypeException[source]
Bases:
Exception
Exception when an Unexpected DType is found in a pandas DataFrame
- class ChildProject.tables.IndexColumn(name='', description='', required=False, regex=None, filename=False, datetime=None, function=None, choices=None, dtype=None, unique=False, generated=False, annotation_columns=None)[source]
Bases:
object
- class ChildProject.tables.IndexTable(name, path=None, columns=[], enforce_dtypes: bool = False)[source]
Bases:
object
ChildProject.utils module
- ChildProject.utils.calculate_shift(file1, file2, start1, start2, interval)[source]
take 2 audio files, a starting point for each and a length to compare in seconds return a divergence score representing the average difference in audio signal
- Parameters:
file1 (str) – path to the first wav file to compare
file2 (str) – path to the second wav file to compare
start1 (int) – starting point for the comparison in seconds for the first audio
start2 (int) – starting point for the comparison in seconds for the second audio
interval (int) – length to compare between the 2 audios on in seconds
- Returns:
tuple of divergence score and number of values used
- Return type:
(float, int)
- ChildProject.utils.df_to_printable(df: DataFrame, delimiter: str = ' ', header: bool = False) str [source]
Takes a DataFrame and create a terminal printable string representing the output within a reasonable window options to have an aligned (like ls -l) output or parsable (with defined delimiter) in the order given
- Parameters:
df (pd.DataFrame) – pandas DataFrame containing the data to print
delimiter (str) – Character delimiting fields, when char is in the fields, escape those with the escape char
visual (bool) – Whether to align the columns of the output and ignores escaping characters (output is not parsable)
escape_char (str) – Character escaping fields when those contain the delimiting char
header (bool) – First line of the output is the header, containing the name of the columns
- Returns:
representation of the dataframe
- Return type:
str
- ChildProject.utils.find_lines_involved_in_overlap(df: DataFrame, onset_label: str = 'range_onset', offset_label: str = 'range_offset', labels=[])[source]
takes a dataframe as input. The dataframe is supposed to have a column for the onset og a timeline and one for the offset. The function returns a boolean series where all indexes having ‘True’ are lines involved in overlaps and ‘False’ when not e.g. to select all lines involved in overlaps, use:
` ovl_segments = df[find_lines_involved_in_overlap(df, 'segment_onset', 'segment_offset')] `
and to select line that never overlap, use:` ovl_segments = df[~find_lines_involved_in_overlap(df, 'segment_onset', 'segment_offset')] `
- Parameters:
df (pd.DataFrame) – pandas DataFrame where we want to find overlaps, having some time segments described by 2 columns (onset and offset)
onset_label (str) – column label for the onset of time segments
offset_label (str) – columns label for the offset of time segments
labels (list[str]) – list of column labels that are required to match to be involved in overlap.
- Returns:
pandas Series of boolean values where ‘True’ are indexes where overlaps exist
- Return type:
pd.Series
- ChildProject.utils.printable_unit_duration(duration)[source]
from a duration in milliseconds, returns a string with an appropriate unit between ms, seconds, minutes and hours
- Parameters:
duration (int) – duration in milliseconds
- Returns:
converted duration with unit letter
- Return type:
str
- ChildProject.utils.series_to_datetime(time_series, time_index_list, time_column_name: str, date_series=None, date_index_list=None, date_column_name=None)[source]
returns a series of datetimes from a series of str. Using pd.to_datetime on all the formats listed for a specific column name in an index consisting of IndexColumn items. To have the date included and not only time), one can use a second series for date, with also the corresponding index and column
- Parameters:
time_series (pandas.Series) – pandas series of strings to transform into datetime (can contain NA value => NaT datetime), if date_series is given, time_series should only have the time
time_index_list (List[IndexColumn]) – list of index to use where the column wanted is present
time_column_name (str) – name of the IndexColumn to use (IndexColumn.name value) for accepted formats
date_series (pandas.Series) – pandas series of strings to transform into the date component of datetime (can contain NA value)
date_index_list (List[IndexColumn]) – list of index to use where the column wanted is present
date_column_name (str) – name of the IndexColumn to use (IndexColumn.name value) for accepted formats for dates
- Returns:
series with dtype datetime containing the converted datetimes
- Return type:
pandas.Series
- ChildProject.utils.time_intervals_intersect(ti1: TimeInterval, ti2: TimeInterval)[source]
given 2 time intervals (those do not take in consideration days, only time in the day), return an array of new interval(s) representing the intersections of the original ones. Examples 1. time_intervals_intersect( TimeInterval( datetime(1900,1,1,8,57), datetime(1900,1,1,21,4)), TimeInterval( datetime(1900,1,1,10,36), datetime(1900,1,1,22,1))) => [TimeInterval(10:36 , 21:04)] 2. time_intervals_intersect( TimeInterval( datetime(1900,1,1,8,57), datetime(1900,1,1,22,1)), TimeInterval( datetime(1900,1,1,21,4), datetime(1900,1,1,10,36))) => [TimeInterval(08:57 , 10:36),TimeInterval(21:04 , 22:01)]
- Parameters:
ti1 (TimeInterval) – first interval
ti2 (TimeInterval) – second interval