Data frames

With RDataFrame, ROOT offers a modern, high-level interface for analysis of data stored in TTree s, CSV files and other data formats, in C++ or Python.

→ Data frame tutorials

The following is a brief introduction to ROOT data frames. For detailed information on ROOT data frames, see → RDataFrame’s reference guide.

Data analysis with RDataFrame

RDataFrame provides the necessary methods to perform all operations required by your analysis.

Every RDataFrame program follows this workflow:

  1. Construct a data frame object by specifying a data set. RDataFrame supports single TTrees as well as multiple ROOT TTrees (i.e., TChains), CSV files, SQLite files, and it can be extended to custom data formats.

  2. Transform the data frame by:

    • applying filters. This selects only specific rows of the data set.

    • creating custom columns. Custom columns can, for example, contain the results of a computation that must be performed for every row of the data set.

  3. Produce results. Actions are used to aggregate data into results. Most actions are lazy, i.e. they are not executed on the spot, but registered with RDataFrame and executed only when a result is accessed for the first time. The most typical result produced by ROOT analyses is a histogram, but RDataFrame supports any kind of data aggregation operation, including writing out new ROOT files.

How does it look in code?

This is a simple cut-and-fill with RDataFrame:

ROOT::RDataFrame df("mytree", {"f1.root", "f2.root"});
auto h = df.Filter("x > 0").Histo1D("x");
h->Draw(); // the event loop is run here, upon first access to one of the results

The lazy triggering of the event loop (i.e. the loop over all data) makes it easy to generate multiple results while reading the data only once:

// C++11 lambda expressions and C++ functions are also supported as filter expressions
auto filtered_df = df.Filter([](float x) { return x > 0; }, {"x"});
auto hx = filtered_df.Histo1D("x");
auto hy = filtered_df.Histo1D("y");
hx->Draw(); // event loop is run here, both hx and hy are filled

As a last example, let’s filter the events, define a new quantity, produce a control plot and write out the filtered dataset, all in the same multi-thread event loop:

ROOT::EnableImplicitMT(); // enable multi-threading (see below)
ROOT::RDataFrame df(treename, filenames); // create dataframe
auto df2 = df.Filter("x > 0").Define("y", "x*x"); // filter and define new column
auto control_h = df2.Histo1D("y"); // book filling of a control plot
// write out new dataset. this triggers the event loop and also fills the booked control plot
df2.Snapshot("newtree", "newfile.root", {"x","y"});

For more examples, including ones in Python, see the tutorials.

Parallel execution

RDataFrame can perform multi-threaded event loops to speed up the execution of its actions. Each thread will process part of the data set, and RDataFrame will then merge the thread-local partial results before returning the final result to the user.

  • To enable parallel data processing, call the ROOT::EnableImplicitMT() function before constructing a RDataFrame object.

This enables ROOT’s implicit multi-threading for all objects and methods that provide an internal parallelization mechanism.

In addition to RDataFrame, the following objects and methods also automatically take advantage of multi-threading:

  • TTree::GetEntry: Reads multiple branches in parallel.

  • TTree::FlushBaskets: Writes multiple baskets to disk in parallel.

  • TTreeCacheUnzip: Decompresses the baskets contained in a TTreeCache in parallel.

  • THx::Fit: Performs in parallel the evaluation of the objective function over the data.

  • TMVA::DNN: Trains a deep neural networks in parallel.

  • TMVA::BDT: Trains a classifier in parallel and multi-class BDTs are evaluated in parallel