How to reuse GIN datasets ========================= Our datasets are managed with `datalad `__. Datalad allows the versioning and distribution of large datasets. Datalad relies on another tool called `git-annex `__, which itself is an extension of git providing support for large file versioning with a high flexibility. We host the data on `GIN `__. GIN's interface is really similar to GitHub, but unlike the latter, GIN can handle our large files. Installing datalad ------------------ The DataLad handbook provides extensive instructions for the installation of DataLad in their `handbook `__. If you have admin rights and you are working on Linux or Mac, the following should work: 1. Install git-annex using ``apt install git-annex`` (linux) or ``brew install git-annex`` (mac). Git-annex is available by default on Oberon. 2. Install datalad with pip : ``pip3 install datalad`` .. note:: If you are having permission issues, consider using python virtual environments or conda (see `DataLad's handbook `__). Otherwise, refer to your system administrator. Setup your GIN account ---------------------- Most repositories are private, and thus require authentication. We recommend that you always use SSH authentication and we will only provide instructions for this case. Before anything, you will need to create an account on `GIN `_, and to link your `SSH public key `_ to your GIN account. 1. Create an account on GIN 2. Copy your SSH public key (usually located in ``~/.ssh/id_rsa.pub``) 3. Go to `GIN > Settings > SSH Keys `__ 4. Click on the blue button ‘Add a key’ and paste your public key where requested. .. note:: Remember to communicate your username to the data administrator before you try to access the data in order for him to grant you permissions. .. note:: You can configure as many keys as necessary. This is useful when you need to access GIN from different locations with different SSH keys (e.g. from your lab cluster, or from your own laptop). .. note:: You may consider enabling the Keychain (append ``~/.ssh/config`` with ``UseKeychain yes``) if you are prompted for your SSH passphrase everytime. Installing a dataset ~~~~~~~~~~~~~~~~~~~~ Installing a dataset can be done with the `datalad install` command. The input is the SSH location of the dataset. It can be found on the page of the repository on GIN: .. figure:: images/gin.png :alt: Where to find the SSH url of a dataset on GIN A GIN dataset. For instance, the VanDam public dataset (available on `GIN `__) can be installed with the following command: .. code:: bash datalad install git@gin.g-node.org:/LAAC-LSCP/vandam-data.git cd vandam-data Datasets that contain subdatasets can be installed recursively using the -r switch. This is the case of the EL1000 dataset: .. code:: bash datalad install git@gin.g-node.org:/EL1000/EL1000.git cd EL1000 .. warning:: Some datasets may require additional configuration steps. Pay attention to the README before you start using a dataset. That’s it ! Your dataset is ready to go. By default, large files do not get downloaded automatically. See the next section for help with downloading those files. Downloading annexed files ------------------------- by default, many files are just pointers when they don't need to be versioned. This is most of the time the case when they are too large for us to keep every change made them. Files can be retrieved using ``datalad get [path]``. For instance, ``datalad get recordings`` will download all recordings. .. note:: Technically speaking, the annexed files in your repository are symbolic links pointing to their actual location, somewhere under `.git`. You can ignore that and read/copy the content of these files as if they where actual files. .. warning:: If you want to *edit* the content of an annexed file, you will need to unlock it beforehand, using ``datalad unlock``, e.g.: ``datalad unlock annotations/vtc/converted``. Updating a dataset ------------------ A dataset can be updated from the sources using ``git pull`` together with ``datalad update``. Contributing ------------ Pushing changes to a dataset ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ You can save local changes to a dataset with ``datalad save [path] -m "commit message"``. For instance : :: datalad save annotations/vtc/raw -m "adding vtc rttms" ``datalad save`` is analoguous to a combination of ``git add`` and ``git commit``. These changes still have to be pushed, which can be done with : :: datalad push