A project from scratch

Here we give an example of creating a dataset from scratch. You may want to revisit this page with more experience, or if you want a whirlwind tour, follow the steps with or without a full comprehension.

We will try to show you a little bit of everything 1. How to create a dataset 2. How to load data 3. How to run our models on these files and move outputs into the right folders

We assume familiarity with the bash shell. I think what matters most is that you know the following commands: mv, ls, cd, mkdir.

We assume you have ChildProject and git-annex installed, alongside miniconda, as per the installation instructions. You may want to install DataLad if you follow the steps of getting data. For this make sure you’re in the conda environment and run pip install datalad. If you already have data, however, you won’t need to do this.

Getting data (in case you don’t have any)

Let’s get some data. Ignore this if you already have data. Let us start with the Vandam-Daylong dataset. This is a miniature example dataset.

This tutorial intends to show you how to start from scratch. But Vandam-Daylong already comes packed as a complete dataset. We only want the recordings. Unfortunately, but for good reasons, Vandam-Daylong requires DataLad to fetch the recordings.

Without going into the details of DataLad or the GIN platform, go into a clean folder somewhere and type:

datalad clone https://gin.g-node.org/LAAC-LSCP/vandam-data

Now cd into your vandam-data folder cd vandam-data. Run datalad get recordings/raw/** to get the raw recordings. There is only one.

Note

While we’re jumping in and across datasets, I will always make sure that the present working directory (pwd) is no deeper than the root of the dataset. Otherwise there’s lots of jumping around folders, and it’s easy to get lost.

The raw recordings are actually stored on the GIN servers or elsewhere. This would not typically be the case for your own datasets. In fact, for this reason, the recording file itself is actually a symbolic link, a sort of pointer to a file somewhere else on your computer, inside an “annex”. To mimick a more realistic setup, let’s rip the recording out of the annex git annex unlock recordings/raw/BN32_010007.mp3. If this is all confusing–and it surely once was for me–just run the commands and trust the process.

At this point we’re still in the vandam-data folder. Let’s step out of it, into the parent folder with cd ... The next step will be to actually make our dataset.

Create the dataset

Create the dataset with mkdir dataset-from-scratch. At this point it’s just an empty folder. Step into it with cd dataset-from-scratch.

Now let’s set up the boilerplate using ChildProject. Run child-project init .. If you run tree, you will see that a few folders and files were created. In particular, we have our annotations, extra, metadata, recordings and scripts folder. There is a children.csv and recordings.csv file, both empty except for their column headers. Running child-project validate . verifies that our dataset is in a clean, albeit pretty empty, state.

Note

If you’re using Datalad as we are for version control and (large) file management, it’s recommended to run datalad create before you run any other commands. This turns your dataset into a datalad repository (sort of a super-powered git repository).

Adding raw recordings

The next step is to put all your raw recordings under the recordings/raw folder. If you’ve followed the steps for the Vandam-Daylong data to a tee, you’ll be albe to run

mv ../vandam-data/recordings/raw/BN32_010007.mp3 ./recordings/raw/BN32_010007.mp3

to move the vandam-data raw recording into the raw recordings folder. If you have other recordings use those instead. Feel free to drag and drop instead of using mv.

Next Childproject needs to be made aware of these recordings. This digital awareness is all achieved in the metadata. In recordings.csv, we will add. Currently, recordings.csv has the following columns: experiment,child_id,date_iso,start_time,recording_device_type,recording_filename Let us call this experiment dataset_from_scratch, use child_id CHI_01, date_iso 2025-09-20, start_time 08:00:00, recording_device_type unknown, and recording_filename BN32_010007.mp3.

Note

Well, that is assuming you’re using the Vandam-Data data… Otherwise you will need to add many more rows with the corresponding filenames.

To add a single row to this file run

echo dataset_from_scratch,CHI_01,2025-09-20,08:00:00,unknown,BN32_010007.mp3 >> metadata/recordings.csv

Or use your favorite text editor.

To inspect the contents of the file run cat metadata/recordings.csv and check that all is correct. Now the recording has a reference to a child with id CHI_01 and experiment dataset_from_scratch, which to ChildProject makes no sense, as no such child is registered in the children metadata. children.csv contains the fields experiment,child_id,child_dob. So run e.g.,

echo dataset_from_scratch,CHI_01,2020-10-08 >> metadata/children.csv

Now run child-project validate . to check if everything is tied up correctly. As an odd side effect, running this command creates the metadata/annotations.csv file as well, which we will need moving forward.

Converting recordings

The models we run, such as VTC, or ALICE, are trained on audio sampled at 16,000Hz. As a general rule we convert all audio even if the sampling rate is already at 16,000Hz.

First create the converted folder. mkdir recordings/converted. Now run child-project process . basic standard --format=wav --sampling=16000 --codec=pcm_s16le.

Note

You may need to change the --format flag if you’re using anything other than .wav files.

Now run tree. You see, we’ve created the converted/standard folder, which contains the converted recordings. Inside this folder we keep track of the parameters used to run conversion, and some updated metadata in recordings/converted/standard/recordings.csv. Notably, recordings.csv remains untouched. These files together point ChildProject at the converted audio, but also inform it that some audio has in fact been converted, and to always use that in favor of raw audio.

Running models

Note

A more complete treatment of running models is found here, but this assumes tooling and infrastructure you likely won’t have

We will run a few models. These steps can be skipped if they have already be run for you. Technically they are outside the scope of ChildProject, but it is useful for anyone working with it to know how it is done. The model papers can be found in the references section of the repository README files.

I should warn you, though, this section is by far the most advanced and prone to errors. Hopefully you have someone around to help you run and debug things.

Running VTC

VTC is the Voice Type Classifier, which diarizes our audio into segments with different speakers.

This is not a tutorial on VTC, but it’s important that we know how to run it. We will momentarily step out of the dataset. cd ...

At this point, we follow the steps here and here. Make sure sox is installed.

git clone --recurse-submodules https://github.com/MarvinLvn/voice_type_classifier.git
cd voice_type_classifier
conda env create -f vtc.yml

Now run conda env list, and you’ll see a new environment. Let’s use that one while we work with VTC.

conda deactivate; conda activate pyannote

The model uses a bash script, apply.sh. Let’s run it on what we have

Note

This may take a very long time(!!!), extremely long if no gpu is available. I highly, highly recommend running it on a per-file basis. For anything above a few hours, and without a gpu in general, it’s best to use your institutions’ resources.

The command will spit out .rttm annotation files in the output_voice_type_classifier folder.

Running ALICE

Before we get back to our dataset, let’s step away from VTC and now run ALICE. ALICE will get unit counts, in particular phoneme, syllable and word counts over segments derived from VTC.

Note

Fun fact: ALICE actually doing transfer learning on VTC, thus using embeddings derived from the model we just saw earlier

Assuming you’re still in the VTC repository folder, step out cd .., and run

git clone --recurse-submodules https://github.com/orasanen/ALICE/
cd ALICE

We also need to make a virtual environment for this model. Run conda env create -f [ALICE_Linux.yml | ALICE_macOS.yml] depending on your operating system.

Note

On ARM processors use

CONDA_SUBDIR=osx-64 conda env create -f ALICE_macOS.yml
conda config --env --set subdir osx-64

To activate run conda deactivate; conda activate ALICE. To process your audio files, run either

./run_ALICE.sh ../dataset-from-scratch/recordings/converted/standard/

or

./run_ALICE.sh ../dataset-from-scratch/recordings/converted/standard/ gpu

if you have a gpu available, assuming you have a CUDA-compatible GPU. Note that this will take at least as long as the earlier VTC command.

Running VCM

This vocalisation maturity model lets us estimate occurences of cries, canonical or non-canonical vocalisations.

Assuming you’re in the ALICE folder, step out first with cd ... Then run

git clone https://github.com/LAAC-LSCP/vcm.git
cd vcm
conda create -p ./conda-vcm-env pip python=3.9
conda deactivate
conda activate ./conda-vcm-env

For vcm you’ll need the SMILExtract (audio feature extractor) binary file, also setting it to executable.

wget https://github.com/georgepar/opensmile/blob/master/bin/linux_x64_standalone_static/SMILExtract?raw=true -O SMILExtract
chmod u+x SMILExtract

(Or instead of wget paste the link into the browser if it didn’t work. Remember to chmod as above). To run the model, run

./vcm.sh -a ../dataset-from-scratch/recordings/converted/standard/ -r [path to VTC .rttm output files] -s [path to SMILExtract file] -o ./outputs -j 8

As always, consult with the corresponding GitHub repository if you get stuck, as there is a lot more documentation there.

Adding model outputs

Now we want to add our model outputs into our dataset. Depending on the models you’ve run, you want to create the sets (folders) for the output annotations.

Make sure you’re in the root directory of your dataset-from-scratch dataset. If you were in the VCM/VTC/ALICE folder, go there with cd ../dataset-from-scratch.

Now make the directories for the annotation sets that you’ll be working with

mkdir -p annotations/vtc/raw
mkdir -p annotations/vcm/raw
mkdir -p annotations/alice/raw

And then move the files you have

mv [path to dir with vtc output] annotations/vtc/raw
mv [path to dir with vcm output] annotations/vcm/raw
mv [path to dir with alice output] annotations/alice/raw

For the rest of the tutorial, we will focus only on vtc annotations, though vcm, alice and even lena annotations are handled similarly.

Our next aim is to populate our annotations file–we need to import our annotations.

child-project automated-import . --set vtc --format vtc_rttm --threads 4

You will likely, like me, get an error saying you need recording durations to be stored. Run child-project compute-durations .. This will change add a durations column to the recordings metadata. Now run the command above again. Do the same for vcm and alice, changing the --set and --format flags accordingly. You can run cat metadata/annotations.csv to see that some annotations have been added. We also find that a vtc/converted folder has been created.

Getting Standard Metrics

At this point we have enough to get some metrics over our annotations. Run

child-project metrics . ACLEW.csv aclew --vtc vtc

Then run cat ACLEW.csv to inspect the output. Add in the --vcm and --alice flags if you have those data available.

Getting Conversational Information

We have gotten some metrics using the outputted segments from our models. What we can also due is post-process these segments, and transform them once more into something useful. We have pipelines for that, and one of the most useful one is the conversations pipeline.

child-project derive-annotations . conversations --input-set vtc --output-set vtc/conversations

This will create the annotations/vtc/conversations folder with the conversational segmentation.

We can also post-process once more, getting a summary of conversational data

child-project conversations-summary --set vtc/conversations . conversations.csv standard