We will present the tutorial “Large-Scale Data Version Control for HPC and HTC with Git and DataLad” at the ISC-HPC 2026 in Hamburg on Monday, June 22.
Presenters:
Dr. Adina S. Wagner,
Dr. Andreas Knüpfer
Imagine you work on a large AI training project where the training data is produced by extensive HPC simulations. You are able to track the exact version of all software involved easily with git, from the parallel simulation code to the Pytorch DNN training scripts. But can you reliably answer, which version of the AI model was trained with what exact version of the huge training data? While the HPC simulations continue to produce more training data? After nine faulty data files needed to be replaced last January? When the health of patients may depend on it? Quite a challenge, isn’t it?
The tutorial will teach how solve this in pretty much the same way as you do it for code already. Using git (git-scm.com), git-annex (https://git-annex.branchable.com), and DataLad (https://handbook.datalad.org) and many of your familiar tools you get full version control, branching, merging, and collaboration through git forges like GitLab or GitHub.
Once code and data are in connected repositories, the next logical step is to record which code does what with which data files with the exact version via git commit hash for each. This brings full machine-actionable reproducibility. And then all of this needs to be adapted for HPC environments.
In the 1/2 day tutorial you will learn about the Open Source tools and practice how to use them on HPC clusters (we’ll provide tutorial HPC accounts at JSC Jülich). You will use The ESA Hubble https://esahubble.org/images/potw/ data (under Creative Commons license) and work with Slurm batch jobs. Your results will become part of the shared, collaborative tutorial repository where all attendees contribute. Feel free to also try it on you existing HPC accounts.
DataLad is a domain agnostic data management system based on the version control tools Git and git-annex. Its core data structure, the DataLad Dataset, is a joint Git/git-annex repository that provides version control for data, code, and software containers. Unlike default Git this combination is suitable for large and binary files. In addition, DataLad offers computational reproducibility by capturing the outcome of process executions in a machine-actionable reproducibility record.
In high performance and high throughput computing, version control and reproducibility management conflict with efficient and highly concurrent processing.
[1] developed a large-scale processing framework centered around DataLad, and prototyped it on different HPC systems.
With [2], this work has been extended to a direct integration with the SLURM job scheduler and to avoid further inefficient behavior patterns which may emerge on parallel file systems.
This tutorial shall enable participants to understand the importance and difficulties of version control and reproducibility management in HPC and, in a hands-on fashion, introduce them to DataLad and the DataLad-SLURM extension to bring these valuable concepts to their own HPC systems and use cases.
[1] https://www.nature.com/articles/s41597-022-01163-2
[2] https://arxiv.org/abs/2505.06558
Introduction (30 min)
DataLad version control (45 min)
DataLad reproducibility (45 min)
Coffee Break
Datalad in HPC with SLURM (45 min)
Outlook on additional and advanced features (30 min)
Wrap Up (15 min)
To work on the tutorial’s hands-on parts, please bring: