25#include <nlohmann/json.hpp>
1250 void Exec(unsigned int slot)
1252 fPerThreadResults[slot]++;
1255 // Called at the end of the event loop.
1258 *fFinalResult = std::accumulate(fPerThreadResults.begin(), fPerThreadResults.end(), 0);
1261 // Called by RDataFrame to retrieve the name of this action.
1262 std::string GetActionName() const { return "MyCounter"; }
1266 ROOT::RDataFrame df(10);
1267 ROOT::RDF::RResultPtr<int> resultPtr = df.Book<>(MyCounter{df.GetNSlots()}, {});
1268 // The GetValue call triggers the event loop
1269 std::cout << "Number of processed entries: " << resultPtr.GetValue() << std::endl;
1273See the Book() method for more information and [this tutorial](https://root.cern/doc/master/df018__customActions_8C.html)
1274for a more complete example.
1276#### Injecting arbitrary code in the event loop with Foreach() and ForeachSlot()
1278Foreach() takes a callable (lambda expression, free function, functor...) and a list of columns and
1279executes the callable on the values of those columns for each event that passes all upstream selections.
1280It can be used to perform actions that are not already available in the interface. For example, the following snippet
1281evaluates the root mean square of column "x":
1283// Single-thread evaluation of RMS of column "x" using Foreach
1286df.Foreach([&sumSq, &n](double x) { ++n; sumSq += x*x; }, {"x"});
1287std::cout << "rms of x: " << std::sqrt(sumSq / n) << std::endl;
1289In multi-thread runs, users are responsible for the thread-safety of the expression passed to Foreach():
1290thread will execute the expression concurrently.
1291The code above would need to employ some resource protection mechanism to ensure non-concurrent writing of `rms`; but
1292this is probably too much head-scratch for such a simple operation.
1294ForeachSlot() can help in this situation. It is an alternative version of Foreach() for which the function takes an
1295additional "processing slot" parameter besides the columns it should be applied to. RDataFrame
1296guarantees that ForeachSlot() will invoke the user expression with different `slot` parameters for different concurrent
1297executions (see [Special helper columns: rdfentry_ and rdfslot_](\ref helper-cols) for more information on the slot parameter).
1298We can take advantage of ForeachSlot() to evaluate a thread-safe root mean square of column "x":
1300// Thread-safe evaluation of RMS of column "x" using ForeachSlot
1301ROOT::EnableImplicitMT();
1302const unsigned int nSlots = df.GetNSlots();
1303std::vector<double> sumSqs(nSlots, 0.);
1304std::vector<unsigned int> ns(nSlots, 0);
1306df.ForeachSlot([&sumSqs, &ns](unsigned int slot, double x) { sumSqs[slot] += x*x; ns[slot] += 1; }, {"x"});
1307double sumSq = std::accumulate(sumSqs.begin(), sumSqs.end(), 0.); // sum all squares
1308unsigned int n = std::accumulate(ns.begin(), ns.end(), 0); // sum all counts
1309std::cout << "rms of x: " << std::sqrt(sumSq / n) << std::endl;
1311Notice how we created one `double` variable for each processing slot and later merged their results via `std::accumulate`.
1315### Dataset joins with friend trees
1317Vertically concatenating multiple trees that have the same columns (creating a logical dataset with the same columns and
1318more rows) is trivial in RDataFrame: just pass the tree name and a list of file names to RDataFrame's constructor, or create a TChain
1319out of the desired trees and pass that to RDataFrame.
1321Horizontal concatenations of trees or chains (creating a logical dataset with the same number of rows and the union of the
1322columns of multiple trees) leverages TTree's "friend" mechanism.
1324Simple joins of trees that do not have the same number of rows are also possible with indexed friend trees (see below).
1326To use friend trees in RDataFrame, set up trees with the appropriate relationships and then instantiate an RDataFrame
1332main.AddFriend(&friend, "myFriend");
1335auto df2 = df.Filter("myFriend.MyCol == 42");
1338The same applies for TChains. Columns coming from the friend trees can be referred to by their full name, like in the example above,
1339or the friend tree name can be omitted in case the column name is not ambiguous (e.g. "MyCol" could be used instead of
1340"myFriend.MyCol" in the example above if there is no column "MyCol" in the main tree).
1342\note A common source of confusion is that trees that are written out from a multi-thread Snapshot() call will have their
1343 entries (block-wise) shuffled with respect to the original tree. Such trees cannot be used as friends of the original
1344 one: rows will be mismatched.
1346Indexed friend trees provide a way to perform simple joins of multiple trees over a common column.
1347When a certain entry in the main tree (or chain) is loaded, the friend trees (or chains) will then load an entry where the
1348"index" columns have a value identical to the one in the main one. For example, in Python:
1354# If a friend tree has an index on `commonColumn`, when the main tree loads
1355# a given row, it also loads the row of the friend tree that has the same
1356# value of `commonColumn`
1357aux_tree.BuildIndex("commonColumn")
1359mainTree.AddFriend(aux_tree)
1361df = ROOT.RDataFrame(mainTree)
1364RDataFrame supports indexed friend TTrees from ROOT v6.24 in single-thread mode and from v6.28/02 in multi-thread mode.
1366\anchor other-file-formats
1367### Reading data formats other than ROOT trees
1368RDataFrame can be interfaced with RDataSources. The ROOT::RDF::RDataSource interface defines an API that RDataFrame can use to read arbitrary columnar data formats.
1370RDataFrame calls into concrete RDataSource implementations to retrieve information about the data, retrieve (thread-local) readers or "cursors" for selected columns
1371and to advance the readers to the desired data entry.
1372Some predefined RDataSources are natively provided by ROOT such as the ROOT::RDF::RCsvDS which allows to read comma separated files:
1374auto tdf = ROOT::RDF::FromCSV("MuRun2010B.csv");
1375auto filteredEvents =
1376 tdf.Filter("Q1 * Q2 == -1")
1377 .Define("m", "sqrt(pow(E1 + E2, 2) - (pow(px1 + px2, 2) + pow(py1 + py2, 2) + pow(pz1 + pz2, 2)))");
1378auto h = filteredEvents.Histo1D("m");
1382See also FromNumpy (Python-only), FromRNTuple(), FromArrow(), FromSqlite().
1385### Computation graphs (storing and reusing sets of transformations)
1387As we saw, transformed dataframes can be stored as variables and reused multiple times to create modified versions of the dataset. This implicitly defines a **computation graph** in which
1388several paths of filtering/creation of columns are executed simultaneously, and finally aggregated results are produced.
1390RDataFrame detects when several actions use the same filter or the same defined column, and **only evaluates each
1391filter or defined column once per event**, regardless of how many times that result is used down the computation graph.
1392Objects read from each column are **built once and never copied**, for maximum efficiency.
1393When "upstream" filters are not passed, subsequent filters, temporary column expressions and actions are not evaluated,
1394so it might be advisable to put the strictest filters first in the graph.
1396\anchor representgraph
1397### Visualizing the computation graph
1398It is possible to print the computation graph from any node to obtain a [DOT (graphviz)](https://en.wikipedia.org/wiki/DOT_(graph_description_language)) representation either on the standard output
1401Invoking the function ROOT::RDF::SaveGraph() on any node that is not the head node, the computation graph of the branch
1402the node belongs to is printed. By using the head node, the entire computation graph is printed.
1404Following there is an example of usage:
1406// First, a sample computational graph is built
1407ROOT::RDataFrame df("tree", "f.root");
1409auto df2 = df.Define("x", []() { return 1; })
1410 .Filter("col0 % 1 == col0")
1411 .Filter([](int b1) { return b1 <2; }, {"cut1"})
1412 .Define("y", []() { return 1; });
1414auto count = df2.Count();
1416// Prints the graph to the rd1.dot file in the current directory
1417ROOT::RDF::SaveGraph(df, "./mydot.dot");
1418// Prints the graph to standard output
1419ROOT::RDF::SaveGraph(df);
1422The generated graph can be rendered using one of the graphviz filters, e.g. `dot`. For instance, the image below can be generated with the following command:
1424$ dot -Tpng computation_graph.dot -ocomputation_graph.png
1427\image html RDF_Graph2.png
1430### Activating RDataFrame execution logs
1432RDataFrame has experimental support for verbose logging of the event loop runtimes and other interesting related information. It is activated as follows:
1434#include <ROOT/RLogger.hxx>
1436// this increases RDF's verbosity level as long as the `verbosity` variable is in scope
1437auto verbosity = ROOT::Experimental::RLogScopedVerbosity(ROOT::Detail::RDF::RDFLogChannel(), ROOT::Experimental::ELogLevel::kInfo);
1444verbosity = ROOT.Experimental.RLogScopedVerbosity(ROOT.Detail.RDF.RDFLogChannel(), ROOT.Experimental.ELogLevel.kInfo)
1447More information (e.g. start and end of each multi-thread task) is printed using `ELogLevel.kDebug` and even more
1448(e.g. a full dump of the generated code that RDataFrame just-in-time-compiles) using `ELogLevel.kDebug+10`.
1450\anchor rdf-from-spec
1451### Creating an RDataFrame from a dataset specification file
1453RDataFrame can be created using a dataset specification JSON file:
1458df = ROOT.RDF.Experimental.FromSpec("spec.json")
1461The input dataset specification JSON file needs to be provided by the user and it describes all necessary samples and
1462their associated metadata information. The main required key is the "samples" (at least one sample is needed) and the
1463required sub-keys for each sample are "trees" and "files". Additionally, one can specify a metadata dictionary for each
1464sample in the "metadata" key.
1466A simple example for the formatting of the specification in the JSON file is the following:
1472 "trees": ["tree1", "tree2"],
1473 "files": ["file1.root", "file2.root"],
1477 "sample_category" = "data"
1481 "trees": ["tree3", "tree4"],
1482 "files": ["file3.root", "file4.root"],
1486 "sample_category" = "MC_background"
1493The metadata information from the specification file can be then accessed using the DefinePerSample function.
1494For example, to access luminosity information (stored as a double):
1497df.DefinePerSample("lumi", 'rdfsampleinfo_.GetD("lumi")')
1500or sample_category information (stored as a string):
1503df.DefinePerSample("sample_category", 'rdfsampleinfo_.GetS("sample_category")')
1506or directly the filename:
1509df.DefinePerSample("name", "rdfsampleinfo_.GetSampleName()")
1512An example implementation of the "FromSpec" method is available in tutorial: df106_HiggstoFourLeptons.py, which also
1513provides a corresponding exemplary JSON file for the dataset specification.
1516### Adding a progress bar
1518A progress bar showing the processed event statistics can be added to any RDataFrame program.
1519The event statistics include elapsed time, currently processed file, currently processed events, the rate of event processing
1520and an estimated remaining time (per file being processed). It is recorded and printed in the terminal every m events and every
1521n seconds (by default m = 1000 and n = 1). The ProgressBar can be also added when the multithread (MT) mode is enabled.
1523ProgressBar is added after creating the dataframe object (df):
1525ROOT::RDataFrame df("tree", "file.root");
1526ROOT::RDF::Experimental::AddProgressBar(df);
1529Alternatively, RDataFrame can be cast to an RNode first, giving the user more flexibility
1530For example, it can be called at any computational node, such as Filter or Define, not only the head node,
1531with no change to the ProgressBar function itself (please see the [Efficient analysis in Python](#python)
1532section for appropriate usage in Python):
1534ROOT::RDataFrame df("tree", "file.root");
1535auto df_1 = ROOT::RDF::RNode(df.Filter("x>1"));
1536ROOT::RDF::Experimental::AddProgressBar(df_1);
1538Examples of implemented progress bars can be seen by running [Higgs to Four Lepton tutorial](https://root.cern/doc/master/df106__HiggsToFourLeptons_8py_source.html) and [Dimuon tutorial](https://root.cern/doc/master/df102__NanoAODDimuonAnalysis_8C.html).
1558 : RInterface(std::make_shared<
RDFDetail::RLoopManager>(nullptr, defaultColumns))
1561 auto msg =
"Invalid TDirectory!";
1562 throw std::runtime_error(msg);
1564 const std::string treeNameInt(treeName);
1565 auto tree =
static_cast<TTree *
>(dirPtr->
Get(treeNameInt.c_str()));
1567 auto msg =
"Tree \"" + treeNameInt +
"\" cannot be found!";
1568 throw std::runtime_error(msg);
1570 GetProxiedPtr()->SetTree(std::shared_ptr<TTree>(tree, [](
TTree *) {}));
1586RDataFrame::RDataFrame(std::string_view treeName, std::string_view fileNameGlob,
const ColumnNames_t &defaultColumns)
1587 : RInterface(
ROOT::Detail::RDF::CreateLMFromFile(treeName, fileNameGlob, defaultColumns))
1591RDataFrame::RDataFrame(std::string_view treeName, std::string_view fileNameGlob,
const ColumnNames_t &defaultColumns)
1592 : RInterface(
ROOT::Detail::RDF::CreateLMFromTTree(treeName, fileNameGlob, defaultColumns))
1610 const ColumnNames_t &defaultColumns)
1611 : RInterface(
ROOT::Detail::RDF::CreateLMFromFile(datasetName, fileNameGlobs, defaultColumns))
1617 : RInterface(
ROOT::Detail::RDF::CreateLMFromTTree(datasetName, fileNameGlobs, defaultColumns))
1697namespace Experimental {
1731 const nlohmann::ordered_json fullData = nlohmann::ordered_json::parse(std::ifstream(jsonFile));
1732 if (!fullData.contains(
"samples") || fullData[
"samples"].empty()) {
1733 throw std::runtime_error(
1734 R
"(The input specification does not contain any samples. Please provide the samples in the specification like:
1738 "trees": ["tree1", "tree2"],
1739 "files": ["file1.root", "file2.root"],
1740 "metadata": {"lumi": 1.0, }
1743 "trees": ["tree3", "tree4"],
1744 "files": ["file3.root", "file4.root"],
1745 "metadata": {"lumi": 0.5, }
1753 for (
const auto &keyValue : fullData[
"samples"].items()) {
1754 const std::string &sampleName = keyValue.key();
1755 const auto &sample = keyValue.value();
1758 if (!sample.contains(
"trees")) {
1759 throw std::runtime_error(
"A list of tree names must be provided for sample " + sampleName +
".");
1761 std::vector<std::string> trees = sample[
"trees"];
1762 if (!sample.contains(
"files")) {
1763 throw std::runtime_error(
"A list of files must be provided for sample " + sampleName +
".");
1765 std::vector<std::string> files = sample[
"files"];
1766 if (!sample.contains(
"metadata")) {
1770 for (
const auto &metadata : sample[
"metadata"].items()) {
1771 const auto &val = metadata.value();
1772 if (val.is_string())
1773 m.Add(metadata.key(), val.get<std::string>());
1774 else if (val.is_number_integer())
1775 m.Add(metadata.key(), val.get<
int>());
1776 else if (val.is_number_float())
1777 m.Add(metadata.key(), val.get<
double>());
1779 throw std::logic_error(
"The metadata keys can only be of type [string|int|double].");
1784 if (fullData.contains(
"friends")) {
1785 for (
const auto &friends : fullData[
"friends"].items()) {
1786 std::string alias = friends.key();
1787 std::vector<std::string> trees = friends.value()[
"trees"];
1788 std::vector<std::string> files = friends.value()[
"files"];
1789 if (files.size() != trees.size() && trees.size() > 1)
1790 throw std::runtime_error(
"Mismatch between trees and files in a friend.");
1795 if (fullData.contains(
"range")) {
1796 std::vector<int> range = fullData[
"range"];
1798 if (range.size() == 1)
1800 else if (range.size() == 2)
1825 throw std::runtime_error(
"Cannot print information about this RDataFrame, "
1826 "it was not properly created. It must be discarded.");
1828 auto *
tree = lm->GetTree();
1829 auto defCols = lm->GetDefaultColumnNames();
1831 std::ostringstream ret;
1833 ret <<
"A data frame built on top of the " <<
tree->GetName() <<
" dataset.";
1834 if (!defCols.empty()) {
1835 if (defCols.size() == 1)
1836 ret <<
"\nDefault column: " << defCols[0];
1838 ret <<
"\nDefault columns:\n";
1839 for (
auto &&col : defCols) {
1840 ret <<
" - " << col <<
"\n";
1845 ret <<
"A data frame associated to the data source \"" << cling::printValue(ds) <<
"\"";
1847 ret <<
"An empty data frame that will create " << lm->GetNEmptyEntries() <<
" entries\n";
unsigned long long ULong64_t
The head node of a RDF computation graph.
The dataset specification for RDataFrame.
RDatasetSpec & WithGlobalFriends(const std::string &treeName, const std::string &fileNameGlob, const std::string &alias="")
Add friend tree to RDatasetSpec object.
RDatasetSpec & AddSample(RSample sample)
Add sample (RSample class object) to the RDatasetSpec object.
RDatasetSpec & WithGlobalRange(const RDatasetSpec::REntryRange &entryRange={})
Create an RDatasetSpec object for a given range of entries.
Class representing a sample which is a grouping of trees and their fileglobs, and,...
std::shared_ptr< ROOT::Detail::RDF::RLoopManager > fLoopManager
< The RLoopManager at the root of this computation graph. Never null.
RDataSource * fDataSource
Non-owning pointer to a data-source object. Null if no data-source. RLoopManager has ownership of the...
RDFDetail::RLoopManager * GetLoopManager() const
ROOT's RDataFrame offers a modern, high-level interface for analysis of data stored in TTree ,...
RDataFrame(std::string_view treeName, std::string_view filenameglob, const ColumnNames_t &defaultColumns={})
Build the dataframe.
ROOT::RDF::ColumnNames_t ColumnNames_t
Describe directory structure in memory.
virtual TObject * Get(const char *namecycle)
Return pointer to object identified by namecycle.
A TTree represents a columnar dataset.
ROOT::RDataFrame FromSpec(const std::string &jsonFile)
Factory method to create an RDataFrame from a JSON specification file.
std::vector< std::string > ColumnNames_t
tbb::task_arena is an alias of tbb::interface7::task_arena, which doesn't allow to forward declare tb...
std::shared_ptr< const ColumnNames_t > ColumnNamesPtr_t