With RDataFrame, ROOT offers a modern, high-level interface for analysis of data stored in
TTree, CSV and other data formats, in C++ or Python.
The RDataFrame’s reference guide contains detailed information on ROOT dataframes. Keep reading for a brief introduction to the main concepts. Also see the tutorials for many code examples:→ RDataFrame tutorials
Data analysis with RDataFrame
RDataFrame provides the necessary methods to perform all operations required by your analysis.
Every RDataFrame program follows this workflow:
Construct a dataframe object by specifying a dataset. RDataFrame supports single TTrees as well as multiple TTrees (i.e.,
TChain), CSV files, SQLite files, RNTuples, and it can be extended to custom data formats. From Python, NumPy arrays can be imported into RDataFrame as well.
Transform the dataframe by:
Produce results. Actions are used to aggregate data into results. Most actions are lazy, i.e. they are not executed on the spot, but registered with RDataFrame and executed only when a result is accessed for the first time. The most typical result produced by ROOT analyses is a histogram, but RDataFrame supports any kind of data aggregation operation, including writing out new ROOT files.
How does it look in code?
This is a simple cut-and-fill with RDataFrame:
The lazy triggering of the event loop (i.e. the loop over all data) makes it easy to generate multiple results while reading the data only once:
Define expressions can consist of any callable type (e.g. C++11 lambda expressions). Strings containing valid C++ code are also supported, and usually save some typing at a little cost in performance.
As a last example, let’s filter the events, define a new quantity, produce a control plot and write out the filtered dataset, all in the same multi-thread event loop:
Python usage looks very similar. Note that in Python,
Defines require C++ code strings as expressions:
For more examples see the RDataFrame tutorials.
Working with collections
RDataFrame reads collections as the special type RVec: for example, a branch containing an array of floating point numbers can be read as a RVecF. C-style arrays (with variable or static size), STL vectors and most other collection types can be read this way.
RVec is a container similar to std::vector (and can be used just like a std::vector) but it also offers a rich interface to operate on the array elements in a vectorised fashion, similarly to Python’s NumPy arrays.
For example, to fill a histogram with the pt of selected particles for each event,
Define can be used to create a column that contains the desired array elements as follows:
And in Python:
RDataFrame can perform multi-threaded event loops to speed up the execution of its actions. Each thread will process part of the dataset, and RDataFrame will transparently merge results into the full objects returned to users.
To enable parallel data processing, call the ROOT::EnableImplicitMT() function before constructing a RDataFrame object.
For more information about multi-threading in ROOT, please see Multi-threading.
Experimental distributed execution
It is possible to schedule execution of a RDataFrame application on a computing cluster or other distributed computing resources thanks to the experimental Python package for distributed RDataFrame.
In most cases, no change to an existent Python RDataFrame analysis code is required. For example, the following snippet schedules a simple cut-and-fill task on a Dask cluster:
Through Dask, computation can be scheduled on a variety of systems, e.g. HTCondor clusters, SLURM clusters or by connecting to computing resources via SSH. Spark clusters are also supported.
Read more on distributed RDataFrame here.