25#include <nlohmann/json.hpp>
1290 void Exec(unsigned int slot)
1292 fPerThreadResults[slot]++;
1295 // Called at the end of the event loop.
1298 *fFinalResult = std::accumulate(fPerThreadResults.begin(), fPerThreadResults.end(), 0);
1301 // Called by RDataFrame to retrieve the name of this action.
1302 std::string GetActionName() const { return "MyCounter"; }
1306 ROOT::RDataFrame df(10);
1307 ROOT::RDF::RResultPtr<int> resultPtr = df.Book<>(MyCounter{df.GetNSlots()}, {});
1308 // The GetValue call triggers the event loop
1309 std::cout << "Number of processed entries: " << resultPtr.GetValue() << std::endl;
1313See the Book() method for more information and [this tutorial](https://root.cern/doc/master/df018__customActions_8C.html)
1314for a more complete example.
1316#### Injecting arbitrary code in the event loop with Foreach() and ForeachSlot()
1318Foreach() takes a callable (lambda expression, free function, functor...) and a list of columns and
1319executes the callable on the values of those columns for each event that passes all upstream selections.
1320It can be used to perform actions that are not already available in the interface. For example, the following snippet
1321evaluates the root mean square of column "x":
1323// Single-thread evaluation of RMS of column "x" using Foreach
1326df.Foreach([&sumSq, &n](double x) { ++n; sumSq += x*x; }, {"x"});
1327std::cout << "rms of x: " << std::sqrt(sumSq / n) << std::endl;
1329In multi-thread runs, users are responsible for the thread-safety of the expression passed to Foreach():
1330thread will execute the expression concurrently.
1331The code above would need to employ some resource protection mechanism to ensure non-concurrent writing of `rms`; but
1332this is probably too much head-scratch for such a simple operation.
1334ForeachSlot() can help in this situation. It is an alternative version of Foreach() for which the function takes an
1335additional "processing slot" parameter besides the columns it should be applied to. RDataFrame
1336guarantees that ForeachSlot() will invoke the user expression with different `slot` parameters for different concurrent
1337executions (see [Special helper columns: rdfentry_ and rdfslot_](\ref helper-cols) for more information on the slot parameter).
1338We can take advantage of ForeachSlot() to evaluate a thread-safe root mean square of column "x":
1340// Thread-safe evaluation of RMS of column "x" using ForeachSlot
1341ROOT::EnableImplicitMT();
1342const unsigned int nSlots = df.GetNSlots();
1343std::vector<double> sumSqs(nSlots, 0.);
1344std::vector<unsigned int> ns(nSlots, 0);
1346df.ForeachSlot([&sumSqs, &ns](unsigned int slot, double x) { sumSqs[slot] += x*x; ns[slot] += 1; }, {"x"});
1347double sumSq = std::accumulate(sumSqs.begin(), sumSqs.end(), 0.); // sum all squares
1348unsigned int n = std::accumulate(ns.begin(), ns.end(), 0); // sum all counts
1349std::cout << "rms of x: " << std::sqrt(sumSq / n) << std::endl;
1351Notice how we created one `double` variable for each processing slot and later merged their results via `std::accumulate`.
1355### Dataset joins with friend trees
1357Vertically concatenating multiple trees that have the same columns (creating a logical dataset with the same columns and
1358more rows) is trivial in RDataFrame: just pass the tree name and a list of file names to RDataFrame's constructor, or create a TChain
1359out of the desired trees and pass that to RDataFrame.
1361Horizontal concatenations of trees or chains (creating a logical dataset with the same number of rows and the union of the
1362columns of multiple trees) leverages TTree's "friend" mechanism.
1364Simple joins of trees that do not have the same number of rows are also possible with indexed friend trees (see below).
1366To use friend trees in RDataFrame, set up trees with the appropriate relationships and then instantiate an RDataFrame
1372main.AddFriend(&friend, "myFriend");
1375auto df2 = df.Filter("myFriend.MyCol == 42");
1378The same applies for TChains. Columns coming from the friend trees can be referred to by their full name, like in the example above,
1379or the friend tree name can be omitted in case the column name is not ambiguous (e.g. "MyCol" could be used instead of
1380"myFriend.MyCol" in the example above if there is no column "MyCol" in the main tree).
1382\note A common source of confusion is that trees that are written out from a multi-thread Snapshot() call will have their
1383 entries (block-wise) shuffled with respect to the original tree. Such trees cannot be used as friends of the original
1384 one: rows will be mismatched.
1386Indexed friend trees provide a way to perform simple joins of multiple trees over a common column.
1387When a certain entry in the main tree (or chain) is loaded, the friend trees (or chains) will then load an entry where the
1388"index" columns have a value identical to the one in the main one. For example, in Python:
1394# If a friend tree has an index on `commonColumn`, when the main tree loads
1395# a given row, it also loads the row of the friend tree that has the same
1396# value of `commonColumn`
1397aux_tree.BuildIndex("commonColumn")
1399mainTree.AddFriend(aux_tree)
1401df = ROOT.RDataFrame(mainTree)
1404RDataFrame supports indexed friend TTrees from ROOT v6.24 in single-thread mode and from v6.28/02 in multi-thread mode.
1406\anchor other-file-formats
1407### Reading data formats other than ROOT trees
1408RDataFrame can be interfaced with RDataSources. The ROOT::RDF::RDataSource interface defines an API that RDataFrame can use to read arbitrary columnar data formats.
1410RDataFrame calls into concrete RDataSource implementations to retrieve information about the data, retrieve (thread-local) readers or "cursors" for selected columns
1411and to advance the readers to the desired data entry.
1412Some predefined RDataSources are natively provided by ROOT such as the ROOT::RDF::RCsvDS which allows to read comma separated files:
1414auto tdf = ROOT::RDF::FromCSV("MuRun2010B.csv");
1415auto filteredEvents =
1416 tdf.Filter("Q1 * Q2 == -1")
1417 .Define("m", "sqrt(pow(E1 + E2, 2) - (pow(px1 + px2, 2) + pow(py1 + py2, 2) + pow(pz1 + pz2, 2)))");
1418auto h = filteredEvents.Histo1D("m");
1422See also FromNumpy (Python-only), FromRNTuple(), FromArrow(), FromSqlite().
1425### Computation graphs (storing and reusing sets of transformations)
1427As we saw, transformed dataframes can be stored as variables and reused multiple times to create modified versions of the dataset. This implicitly defines a **computation graph** in which
1428several paths of filtering/creation of columns are executed simultaneously, and finally aggregated results are produced.
1430RDataFrame detects when several actions use the same filter or the same defined column, and **only evaluates each
1431filter or defined column once per event**, regardless of how many times that result is used down the computation graph.
1432Objects read from each column are **built once and never copied**, for maximum efficiency.
1433When "upstream" filters are not passed, subsequent filters, temporary column expressions and actions are not evaluated,
1434so it might be advisable to put the strictest filters first in the graph.
1436\anchor representgraph
1437### Visualizing the computation graph
1438It is possible to print the computation graph from any node to obtain a [DOT (graphviz)](https://en.wikipedia.org/wiki/DOT_(graph_description_language)) representation either on the standard output
1441Invoking the function ROOT::RDF::SaveGraph() on any node that is not the head node, the computation graph of the branch
1442the node belongs to is printed. By using the head node, the entire computation graph is printed.
1444Following there is an example of usage:
1446// First, a sample computational graph is built
1447ROOT::RDataFrame df("tree", "f.root");
1449auto df2 = df.Define("x", []() { return 1; })
1450 .Filter("col0 % 1 == col0")
1451 .Filter([](int b1) { return b1 <2; }, {"cut1"})
1452 .Define("y", []() { return 1; });
1454auto count = df2.Count();
1456// Prints the graph to the rd1.dot file in the current directory
1457ROOT::RDF::SaveGraph(df, "./mydot.dot");
1458// Prints the graph to standard output
1459ROOT::RDF::SaveGraph(df);
1462The generated graph can be rendered using one of the graphviz filters, e.g. `dot`. For instance, the image below can be generated with the following command:
1464$ dot -Tpng computation_graph.dot -ocomputation_graph.png
1467\image html RDF_Graph2.png
1470### Activating RDataFrame execution logs
1472RDataFrame has experimental support for verbose logging of the event loop runtimes and other interesting related information. It is activated as follows:
1474#include <ROOT/RLogger.hxx>
1476// this increases RDF's verbosity level as long as the `verbosity` variable is in scope
1477auto verbosity = ROOT::Experimental::RLogScopedVerbosity(ROOT::Detail::RDF::RDFLogChannel(), ROOT::Experimental::ELogLevel::kInfo);
1484verbosity = ROOT.Experimental.RLogScopedVerbosity(ROOT.Detail.RDF.RDFLogChannel(), ROOT.Experimental.ELogLevel.kInfo)
1487More information (e.g. start and end of each multi-thread task) is printed using `ELogLevel.kDebug` and even more
1488(e.g. a full dump of the generated code that RDataFrame just-in-time-compiles) using `ELogLevel.kDebug+10`.
1490\anchor rdf-from-spec
1491### Creating an RDataFrame from a dataset specification file
1493RDataFrame can be created using a dataset specification JSON file:
1498df = ROOT.RDF.Experimental.FromSpec("spec.json")
1501The input dataset specification JSON file needs to be provided by the user and it describes all necessary samples and
1502their associated metadata information. The main required key is the "samples" (at least one sample is needed) and the
1503required sub-keys for each sample are "trees" and "files". Additionally, one can specify a metadata dictionary for each
1504sample in the "metadata" key.
1506A simple example for the formatting of the specification in the JSON file is the following:
1512 "trees": ["tree1", "tree2"],
1513 "files": ["file1.root", "file2.root"],
1517 "sample_category" = "data"
1521 "trees": ["tree3", "tree4"],
1522 "files": ["file3.root", "file4.root"],
1526 "sample_category" = "MC_background"
1533The metadata information from the specification file can be then accessed using the DefinePerSample function.
1534For example, to access luminosity information (stored as a double):
1537df.DefinePerSample("lumi", 'rdfsampleinfo_.GetD("lumi")')
1540or sample_category information (stored as a string):
1543df.DefinePerSample("sample_category", 'rdfsampleinfo_.GetS("sample_category")')
1546or directly the filename:
1549df.DefinePerSample("name", "rdfsampleinfo_.GetSampleName()")
1552An example implementation of the "FromSpec" method is available in tutorial: df106_HiggstoFourLeptons.py, which also
1553provides a corresponding exemplary JSON file for the dataset specification.
1556### Adding a progress bar
1558A progress bar showing the processed event statistics can be added to any RDataFrame program.
1559The event statistics include elapsed time, currently processed file, currently processed events, the rate of event processing
1560and an estimated remaining time (per file being processed). It is recorded and printed in the terminal every m events and every
1561n seconds (by default m = 1000 and n = 1). The ProgressBar can be also added when the multithread (MT) mode is enabled.
1563ProgressBar is added after creating the dataframe object (df):
1565ROOT::RDataFrame df("tree", "file.root");
1566ROOT::RDF::Experimental::AddProgressBar(df);
1569Alternatively, RDataFrame can be cast to an RNode first, giving the user more flexibility
1570For example, it can be called at any computational node, such as Filter or Define, not only the head node,
1571with no change to the ProgressBar function itself (please see the [Efficient analysis in Python](#python)
1572section for appropriate usage in Python):
1574ROOT::RDataFrame df("tree", "file.root");
1575auto df_1 = ROOT::RDF::RNode(df.Filter("x>1"));
1576ROOT::RDF::Experimental::AddProgressBar(df_1);
1578Examples of implemented progress bars can be seen by running [Higgs to Four Lepton tutorial](https://root.cern/doc/master/df106__HiggsToFourLeptons_8py_source.html) and [Dimuon tutorial](https://root.cern/doc/master/df102__NanoAODDimuonAnalysis_8C.html).
1580\anchor missing-values
1581### Working with missing values in the dataset
1583In certain situations a dataset might be missing one or more values at one or
1584more of its entries. For example:
1586- If the dataset is composed of multiple files and one or more files is
1587 missing one or more columns required by the analysis.
1588- When joining different datasets horizontally according to some index value
1589 (e.g. the event number), if the index does not find a match in one or more
1590 other datasets for a certain entry.
1592For example, suppose that column "y" does not have a value for entry 42:
1602If the RDataFrame application reads that column, for example if a Take() action
1603was requested, the default behaviour is to throw an exception indicating
1604that that column is missing an entry.
1606The following paragraphs discuss the functionalities provided by RDataFrame to
1607work with missing values in the dataset.
1609#### FilterAvailable and FilterMissing
1611FilterAvailable and FilterMissing are specialized RDataFrame Filter operations.
1612They take as input argument the name of a column of the dataset to watch for
1613missing values. Like Filter, they will either keep or discard an entire entry
1614based on whether a condition returns true or false. Specifically:
1616- FilterAvailable: the condition is whether the value of the column is present.
1617 If so, the entry is kept. Otherwise if the value is missing the entry is
1619- FilterMissing: the condition is whether the value of the column is missing. If
1620 so, the entry is kept. Otherwise if the value is present the entry is
1624df = ROOT.RDataFrame(dataset)
1626# Anytime an entry from "col" is missing, the entire entry will be filtered out
1627df_available = df.FilterAvailable("col")
1628df_available = df_available.Define("twice", "col * 2")
1630# Conversely, if we want to select the entries for which the column has missing
1631# values, we do the following
1632df_missingcol = df.FilterMissing("col")
1633# Following operations in the same branch of the computation graph clearly
1634# cannot access that same column, since there would be no value to read
1635df_missingcol = df_missingcol.Define("observable", "othercolumn * 2")
1639ROOT::RDataFrame df{dataset};
1641// Anytime an entry from "col" is missing, the entire entry will be filtered out
1642auto df_available = df.FilterAvailable("col");
1643auto df_twicecol = df_available.Define("twice", "col * 2");
1645// Conversely, if we want to select the entries for which the column has missing
1646// values, we do the following
1647auto df_missingcol = df.FilterMissing("col");
1648// Following operations in the same branch of the computation graph clearly
1649// cannot access that same column, since there would be no value to read
1650auto df_observable = df_missingcol.Define("observable", "othercolumn * 2");
1655DefaultValueFor creates a node of the computation graph which just forwards the
1656values of the columns necessary for other downstream nodes, when they are
1657available. In case a value of the input column passed to this function is not
1658available, the node will provide the default value passed to this function call
1662df = ROOT.RDataFrame(dataset)
1663# Anytime an entry from "col" is missing, the value will be the default one
1664default_value = ... # Some sensible default value here
1665df = df.DefaultValueFor("col", default_value)
1666df = df.Define("twice", "col * 2")
1670ROOT::RDataFrame df{dataset};
1671// Anytime an entry from "col" is missing, the value will be the default one
1672constexpr auto default_value = ... // Some sensible default value here
1673auto df_default = df.DefaultValueFor("col", default_value);
1674auto df_col = df_default.Define("twice", "col * 2");
1677#### Mixing different strategies to work with missing values in the same RDataFrame
1679All the operations presented above only act on the particular branch of the
1680computation graph where they are called, so that different results can be
1681obtained by mixing and matching the filtering or providing a default value
1685df = ROOT.RDataFrame(dataset)
1686# Anytime an entry from "col" is missing, the value will be the default one
1687default_value = ... # Some sensible default value here
1688df_default = df.DefaultValueFor("col", default_value).Define("twice", "col * 2")
1689df_filtered = df.FilterAvailable("col").Define("twice", "col * 2")
1691# Same number of total entries as the input dataset, with defaulted values
1692df_default.Display(["twice"]).Print()
1693# Only keep the entries where "col" has values
1694df_filtered.Display(["twice"]).Print()
1698ROOT::RDataFrame df{dataset};
1700// Anytime an entry from "col" is missing, the value will be the default one
1701constexpr auto default_value = ... // Some sensible default value here
1702auto df_default = df.DefaultValueFor("col", default_value).Define("twice", "col * 2");
1703auto df_filtered = df.FilterAvailable("col").Define("twice", "col * 2");
1705// Same number of total entries as the input dataset, with defaulted values
1706df_default.Display({"twice"})->Print();
1707// Only keep the entries where "col" has values
1708df_filtered.Display({"twice"})->Print();
1711#### Further considerations
1713Note that working with missing values is currently supported with a TTree-based
1714data source. Support of this functionality for other data sources may come in
1735 : RInterface(std::make_shared<
RDFDetail::RLoopManager>(nullptr, defaultColumns))
1738 auto msg =
"Invalid TDirectory!";
1739 throw std::runtime_error(msg);
1741 const std::string treeNameInt(treeName);
1742 auto tree =
static_cast<TTree *
>(dirPtr->
Get(treeNameInt.c_str()));
1744 auto msg =
"Tree \"" + treeNameInt +
"\" cannot be found!";
1745 throw std::runtime_error(msg);
1747 GetProxiedPtr()->SetTree(std::shared_ptr<TTree>(tree, [](
TTree *) {}));
1763RDataFrame::RDataFrame(std::string_view treeName, std::string_view fileNameGlob,
const ColumnNames_t &defaultColumns)
1764 : RInterface(
ROOT::Detail::RDF::CreateLMFromFile(treeName, fileNameGlob, defaultColumns))
1768RDataFrame::RDataFrame(std::string_view treeName, std::string_view fileNameGlob,
const ColumnNames_t &defaultColumns)
1769 : RInterface(
ROOT::Detail::RDF::CreateLMFromTTree(treeName, fileNameGlob, defaultColumns))
1787 const ColumnNames_t &defaultColumns)
1788 : RInterface(
ROOT::Detail::RDF::CreateLMFromFile(datasetName, fileNameGlobs, defaultColumns))
1794 : RInterface(
ROOT::Detail::RDF::CreateLMFromTTree(datasetName, fileNameGlobs, defaultColumns))
1874namespace Experimental {
1908 const nlohmann::ordered_json fullData = nlohmann::ordered_json::parse(std::ifstream(jsonFile));
1909 if (!fullData.contains(
"samples") || fullData[
"samples"].empty()) {
1910 throw std::runtime_error(
1911 R
"(The input specification does not contain any samples. Please provide the samples in the specification like:
1915 "trees": ["tree1", "tree2"],
1916 "files": ["file1.root", "file2.root"],
1917 "metadata": {"lumi": 1.0, }
1920 "trees": ["tree3", "tree4"],
1921 "files": ["file3.root", "file4.root"],
1922 "metadata": {"lumi": 0.5, }
1930 for (
const auto &keyValue : fullData[
"samples"].items()) {
1931 const std::string &sampleName = keyValue.key();
1932 const auto &sample = keyValue.value();
1935 if (!sample.contains(
"trees")) {
1936 throw std::runtime_error(
"A list of tree names must be provided for sample " + sampleName +
".");
1938 std::vector<std::string> trees = sample[
"trees"];
1939 if (!sample.contains(
"files")) {
1940 throw std::runtime_error(
"A list of files must be provided for sample " + sampleName +
".");
1942 std::vector<std::string> files = sample[
"files"];
1943 if (!sample.contains(
"metadata")) {
1947 for (
const auto &metadata : sample[
"metadata"].items()) {
1948 const auto &val = metadata.value();
1949 if (val.is_string())
1950 m.Add(metadata.key(), val.get<std::string>());
1951 else if (val.is_number_integer())
1952 m.Add(metadata.key(), val.get<
int>());
1953 else if (val.is_number_float())
1954 m.Add(metadata.key(), val.get<
double>());
1956 throw std::logic_error(
"The metadata keys can only be of type [string|int|double].");
1961 if (fullData.contains(
"friends")) {
1962 for (
const auto &friends : fullData[
"friends"].items()) {
1963 std::string alias = friends.key();
1964 std::vector<std::string> trees = friends.value()[
"trees"];
1965 std::vector<std::string> files = friends.value()[
"files"];
1966 if (files.size() != trees.size() && trees.size() > 1)
1967 throw std::runtime_error(
"Mismatch between trees and files in a friend.");
1972 if (fullData.contains(
"range")) {
1973 std::vector<int> range = fullData[
"range"];
1975 if (range.size() == 1)
1977 else if (range.size() == 2)
2002 throw std::runtime_error(
"Cannot print information about this RDataFrame, "
2003 "it was not properly created. It must be discarded.");
2005 auto *
tree = lm->GetTree();
2006 auto defCols = lm->GetDefaultColumnNames();
2008 std::ostringstream ret;
2010 ret <<
"A data frame built on top of the " <<
tree->GetName() <<
" dataset.";
2011 if (!defCols.empty()) {
2012 if (defCols.size() == 1)
2013 ret <<
"\nDefault column: " << defCols[0];
2015 ret <<
"\nDefault columns:\n";
2016 for (
auto &&col : defCols) {
2017 ret <<
" - " << col <<
"\n";
2022 ret <<
"A data frame associated to the data source \"" << cling::printValue(ds) <<
"\"";
2024 ret <<
"An empty data frame that will create " << lm->GetNEmptyEntries() <<
" entries\n";
unsigned long long ULong64_t
The head node of a RDF computation graph.
The dataset specification for RDataFrame.
RDatasetSpec & WithGlobalFriends(const std::string &treeName, const std::string &fileNameGlob, const std::string &alias="")
Add friend tree to RDatasetSpec object.
RDatasetSpec & AddSample(RSample sample)
Add sample (RSample class object) to the RDatasetSpec object.
RDatasetSpec & WithGlobalRange(const RDatasetSpec::REntryRange &entryRange={})
Create an RDatasetSpec object for a given range of entries.
Class representing a sample which is a grouping of trees and their fileglobs, and,...
std::shared_ptr< ROOT::Detail::RDF::RLoopManager > fLoopManager
< The RLoopManager at the root of this computation graph. Never null.
RDataSource * fDataSource
Non-owning pointer to a data-source object. Null if no data-source. RLoopManager has ownership of the...
RDFDetail::RLoopManager * GetLoopManager() const
ROOT's RDataFrame offers a modern, high-level interface for analysis of data stored in TTree ,...
RDataFrame(std::string_view treeName, std::string_view filenameglob, const ColumnNames_t &defaultColumns={})
Build the dataframe.
ROOT::RDF::ColumnNames_t ColumnNames_t
Describe directory structure in memory.
virtual TObject * Get(const char *namecycle)
Return pointer to object identified by namecycle.
A TTree represents a columnar dataset.
ROOT::RDataFrame FromSpec(const std::string &jsonFile)
Factory method to create an RDataFrame from a JSON specification file.
std::vector< std::string > ColumnNames_t
tbb::task_arena is an alias of tbb::interface7::task_arena, which doesn't allow to forward declare tb...
std::shared_ptr< const ColumnNames_t > ColumnNamesPtr_t