for all fields of science relying on data

Tutorial at ISC-HPC 2026

We will present the tutorial “Large-Scale Data Version Control for HPC and HTC with Git and DataLad” at the ISC-HPC 2026 in Hamburg on Monday, June 22.

Presenters:

Dr. Adina S. Wagner,

Dr. Andreas Knüpfer

Motivation

Imagine you work on a large AI training project where the training data is produced by extensive HPC simulations. You are able to track the exact version of all software involved easily with git, from the parallel simulation code to the Pytorch DNN training scripts. But can you reliably answer, which version of the AI model was trained with what exact version of the huge training data? While the HPC simulations continue to produce more training data? After nine faulty data files needed to be replaced last January? When the health of patients may depend on it? Quite a challenge, isn’t it?

The tutorial will teach how solve this in pretty much the same way as you do it for code already. Using git (git-scm.com), git-annex (https://git-annex.branchable.com), and DataLad (https://handbook.datalad.org) and many of your familiar tools you get full version control, branching, merging, and collaboration through git forges like GitLab or GitHub.

Once code and data are in connected repositories, the next logical step is to record which code does what with which data files with the exact version via git commit hash for each. This brings full machine-actionable reproducibility. And then all of this needs to be adapted for HPC environments.

In the 1/2 day tutorial you will learn about the Open Source tools and practice how to use them on HPC clusters (we’ll provide tutorial HPC accounts at JSC Jülich). You will use The ESA Hubble https://esahubble.org/images/potw/ data (under Creative Commons license) and work with Slurm batch jobs. Your results will become part of the shared, collaborative tutorial repository where all attendees contribute. Feel free to also try it on you existing HPC accounts.

Abstract

DataLad is a domain agnostic data management system based on the version control tools Git and git-annex. Its core data structure, the DataLad Dataset, is a joint Git/git-annex repository that provides version control for data, code, and software containers. Unlike default Git this combination is suitable for large and binary files. In addition, DataLad offers computational reproducibility by capturing the outcome of process executions in a machine-actionable reproducibility record.

In high performance and high throughput computing, version control and reproducibility management conflict with efficient and highly concurrent processing.
[1] developed a large-scale processing framework centered around DataLad, and prototyped it on different HPC systems.
With [2], this work has been extended to a direct integration with the SLURM job scheduler and to avoid further inefficient behavior patterns which may emerge on parallel file systems.

This tutorial shall enable participants to understand the importance and difficulties of version control and reproducibility management in HPC and, in a hands-on fashion, introduce them to DataLad and the DataLad-SLURM extension to bring these valuable concepts to their own HPC systems and use cases.

[1] https://www.nature.com/articles/s41597-022-01163-2
[2] https://arxiv.org/abs/2505.06558

Outline

Introduction (30 min)

  • The Git ecosystem including git forges
  • Why is standard Git not good for binary files?
  • F.A.I.R. research data management and reproducibility in science

DataLad version control (45 min)

  • The git-annex extension and external storages for large data
  • The DataLad tool on top of git and its sub-commands
  • Hands-on: Get to know the tutorial repository
  • Hands-on: Add new data to the tutorial repository

DataLad reproducibility (45 min)

  • The DataLad subcommands for machine-actionable reproducibility
  • The YODA principles for data repositories
  • Hands-on: Use the DataLad run subcommand
  • Hands-on: Reproduce somebody else’s result with DataLad rerun

Coffee Break

Datalad in HPC with SLURM (45 min)

  • The complication with DataLad run and SLURM batch processing
  • The DataLad batch scheduling extension
  • Hands-on: Run many reproducible batch jobs at a time with DataLad
  • Hands-on: Migrate results to another HPC cluster and continue there

Outlook on additional and advanced features (30 min)

  • Considerations for parallel HPC filesystems
  • DataLad simplifies hierarchical git submodules
  • Containerized computations with DataLad
  • Outlook on integrated metadata management

Wrap Up (15 min)

  • Summary and pointers to further resources

Preparation

To work on the tutorial’s hands-on parts, please bring: 

  • Your laptop where you can install software (in a Python venv at least)
  • Optionally: existing HPC accounts where you want to test the hands-on steps in addition to the local steps on the laptop