Feed ROOT data directly into models for machine learning training.

RDataLoader streams ROOT data into machine learning frameworks as batches ready for training. It takes any RDataFrame as input, giving you access to the full ROOT ecosystem for filtering, defining new variables and applying selections; it delivers batches of your dataset for NumPy, TensorFlow and PyTorch through a simple iteration interface.

Note: RDataLoader is part of ROOT.Experimental.ML and is currently experimental. The API may change between ROOT releases.

Cheat Sheet

A one-page quick reference covering the API.

⬇ Download cheat sheet (PDF)

Getting your data ready

RDataLoader takes an RDataFrame as input. This means your data preparation (selecting events, computing new variables, applying cuts, etc.) all happens before the loader is created, using the full power of RDataFrame:

import math
import ROOT
 
# Open a ROOT file and create an RDataFrame
rdf = ROOT.RDataFrame("events", "file.root")
 
# Define a Python callback to compute a new variable
def invariant_mass(E: float, p: float) -> float:
    return math.sqrt(E**2 - p**2)
 
# Apply selections and compute derived features
rdf = rdf.Filter("nMuons >= 2") \
         .Define("inv_mass", invariant_mass, ["E", "p"])

Then pass your RDataFrame to RDataLoader:

from ROOT.Experimental.ML import RDataLoader
 
dl = RDataLoader(rdf,
                 columns=["inv_mass", "label"],
                 batch_size=64,
                 batches_in_memory=1000,
                 target="label")
 
# Iterate your batches as PyTorch tensors: X contains inv_mass, y contains label
for X, y in dl.as_torch():
    ...

The sections below explain how to configure the loader and get the most out of it.

Configuring the RDataLoader

Selecting columns and target

columns selects which branches to load. target names the label column, it is returned separately as y when you iterate, so you don't need to split it manually:

dl = RDataLoader(
    rdf,
    columns=["inv_mass", "pt", "eta", "label"],
    target="label",
    batch_size=256,
    batches_in_memory=1000
)

You can also pass multiple targets:

dl = RDataLoader(rdf,
                 columns=["x1", "x2", "x3", "y1", "y2"],
                 target=["y1", "y2"])
 
for X, y in dl.as_torch():
    # X.shape: (256, 3)
    # y.shape: (256, 2)
    ...

Warning: target must appear in the columns list.

Batch size and memory

batch_size controls how many events are in each batch. batches_in_memory controls how many batches are held in the shuffle buffer at any time:

dl = RDataLoader(rdf,
                 batch_size=256,
                 batches_in_memory=20)  # default: 10

batches_in_memory ↑ - larger shuffle buffer, better randomisation, higher memory use
batches_in_memory ↓ - lower memory use, limited shuffle

Shuffling and reproducibility

Shuffling is enabled by default. To make runs reproducible, fix the seed:

dl = RDataLoader(rdf, batch_size=256,
                 shuffle=True,
                 set_seed=42) # same order every run

RVec / variable-length branches

ROOT branches that store variable-length arrays must be declared with a maximum size. Shorter entries are zero-padded and the branch is expanded into numbered columns:

dl = RDataLoader(
    rdf,
    columns=["jets_pt", "jets_eta", "label"],
    max_vec_sizes={"jets_pt": 10, "jets_eta": 10},
    vec_padding=0.0,
    target="label",
)
# jets_pt expands to jets_pt_0, jets_pt_1, … jets_pt_9
# events with fewer than 10 jets are zero-padded

Warning: Every RVec column in columns must appear in max_vec_sizes.

Iterating Batches

as_torch()

Yields torch.Tensor batches:

loss_fn   = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters())
 
for epoch in range(num_epochs):
    for X, y in dl.as_torch():
        optimizer.zero_grad()
        loss = loss_fn(model(X), y)
        loss.backward()
        optimizer.step()

Move tensors to GPU by passing a device:

for X, y in dl.as_torch(device="cuda"):

...

as_tensorflow()

Returns a tf.data.Dataset of tf.Tensor batches:

model.fit(dl.as_tensorflow(), epochs=10)

as_numpy()

Yields np.ndarray batches:

from sklearn.linear_model import SGDClassifier
 
clf = SGDClassifier()
for X, y in dl.as_numpy():
    clf.partial_fit(X, y, classes=[0, 1])

Train / Validation Split

Pass test_size to split the dataset into two loaders each representing a fraction of the original dataset (no data is duplicated):

train, val = dl.train_test_split(test_size=0.2)
 
for epoch in range(num_epochs):
    model.train()
    for X, y in train.as_torch(device):
        ...
 
    model.eval()
    for X, y in val.as_torch(device):
        ...

Note: Need a three-way train / val / test split? Call train_test_split twice:

train_val, test = dl.train_test_split(test_size=0.15)
train, val = train_val.train_test_split(test_size=0.176)
# 0.176 × 0.85 ≈ 15% of the total

Advanced Features

Resampling

Correct class imbalance by oversampling the minority or undersampling the majority. You can do this by passing two RDataFrames:

dl = RDataLoader(
    [rdf_signal, rdf_background],
    columns=["inv_mass", "label"],
    target="label",
    batch_size=256,
    batches_in_memory=1000,
    load_eager=True,
    sampling_type="oversampling", # or "undersampling"
    sampling_ratio=1.0,
)

Warning: This feature is only available in eager loading mode (load_eager=True).

Event weights

If your dataset has a weight column, pass its name to weights. It is returned as a third value w alongside X and y:

dl = RDataLoader(rdf,
                 columns=["inv_mass", "label", "weight"],
                 target="label",
                 weights="weight")
 
for X, y, w in dl.as_torch():
    loss = (loss_fn(model(X), y) * w).mean()

Eager loading

By default the loader reads data lazily, one chunk of data at a time. For small datasets that fit in memory and will be iterated many times, eager loading pays a one-time cost at construction and then serves every epoch from memory:

dl = RDataLoader(rdf, batch_size=256, load_eager=True)

API Reference

RDataLoader(rdataframes, ...)

Argument	Type	Default	Description
rdataframes	RDF \\| list	-	One or more RDataFrames to load from
batch_size	int	64	Number of events per batch
batches_in_memory	int	10	Shuffle buffer size in batches
columns	list[str]	None	Branches to load - all if not given
max_vec_sizes	dict	None	Max size per RVec column
vec_padding	float	0.0	Pad value for short RVec entries
target	str \\| list	None	Label column(s) - returned as y
weights	str	""	Event weight column - returned as w
shuffle	bool	True	Randomise event order
drop_remainder	bool	True	Drop last incomplete batch
set_seed	int	0	RNG seed - 0 means random
load_eager	bool	False	Load full dataset into RAM
sampling_type	str	""	"oversampling" or "undersampling"
sampling_ratio	float	1.0	Minority/majority ratio after resampling
replacement	bool	False	Undersampling with replacement