26#include <nlohmann/json.hpp>
1511 void Exec(unsigned int slot)
1513 fPerThreadResults[slot]++;
1516 // Called at the end of the event loop.
1519 *fFinalResult = std::accumulate(fPerThreadResults.begin(), fPerThreadResults.end(), 0);
1522 // Called by RDataFrame to retrieve the name of this action.
1523 std::string GetActionName() const { return "MyCounter"; }
1527 ROOT::RDataFrame df(10);
1528 ROOT::RDF::RResultPtr<int> resultPtr = df.Book<>(MyCounter{df.GetNSlots()}, {});
1529 // The GetValue call triggers the event loop
1530 std::cout << "Number of processed entries: " << resultPtr.GetValue() << std::endl;
1534See the Book() method for more information and [this tutorial](https://root.cern/doc/master/df018__customActions_8C.html)
1535for a more complete example.
1537#### Injecting arbitrary code in the event loop with Foreach() and ForeachSlot()
1539Foreach() takes a callable (lambda expression, free function, functor...) and a list of columns and
1540executes the callable on the values of those columns for each event that passes all upstream selections.
1541It can be used to perform actions that are not already available in the interface. For example, the following snippet
1542evaluates the root mean square of column "x":
1544// Single-thread evaluation of RMS of column "x" using Foreach
1547df.Foreach([&sumSq, &n](double x) { ++n; sumSq += x*x; }, {"x"});
1548std::cout << "rms of x: " << std::sqrt(sumSq / n) << std::endl;
1550In multi-thread runs, users are responsible for the thread-safety of the expression passed to Foreach():
1551thread will execute the expression concurrently.
1552The code above would need to employ some resource protection mechanism to ensure non-concurrent writing of `rms`; but
1553this is probably too much head-scratch for such a simple operation.
1555ForeachSlot() can help in this situation. It is an alternative version of Foreach() for which the function takes an
1556additional "processing slot" parameter besides the columns it should be applied to. RDataFrame
1557guarantees that ForeachSlot() will invoke the user expression with different `slot` parameters for different concurrent
1558executions (see [Special helper columns: rdfentry_ and rdfslot_](\ref helper-cols) for more information on the slot parameter).
1559We can take advantage of ForeachSlot() to evaluate a thread-safe root mean square of column "x":
1561// Thread-safe evaluation of RMS of column "x" using ForeachSlot
1562ROOT::EnableImplicitMT();
1563const unsigned int nSlots = df.GetNSlots();
1564std::vector<double> sumSqs(nSlots, 0.);
1565std::vector<unsigned int> ns(nSlots, 0);
1567df.ForeachSlot([&sumSqs, &ns](unsigned int slot, double x) { sumSqs[slot] += x*x; ns[slot] += 1; }, {"x"});
1568double sumSq = std::accumulate(sumSqs.begin(), sumSqs.end(), 0.); // sum all squares
1569unsigned int n = std::accumulate(ns.begin(), ns.end(), 0); // sum all counts
1570std::cout << "rms of x: " << std::sqrt(sumSq / n) << std::endl;
1572Notice how we created one `double` variable for each processing slot and later merged their results via `std::accumulate`.
1576### Dataset joins with friend trees
1578Vertically concatenating multiple trees that have the same columns (creating a logical dataset with the same columns and
1579more rows) is trivial in RDataFrame: just pass the tree name and a list of file names to RDataFrame's constructor, or create a TChain
1580out of the desired trees and pass that to RDataFrame.
1582Horizontal concatenations of trees or chains (creating a logical dataset with the same number of rows and the union of the
1583columns of multiple trees) leverages TTree's "friend" mechanism.
1585Simple joins of trees that do not have the same number of rows are also possible with indexed friend trees (see below).
1587To use friend trees in RDataFrame, set up trees with the appropriate relationships and then instantiate an RDataFrame
1593main.AddFriend(&friend, "myFriend");
1596auto df2 = df.Filter("myFriend.MyCol == 42");
1599The same applies for TChains. Columns coming from the friend trees can be referred to by their full name, like in the example above,
1600or the friend tree name can be omitted in case the column name is not ambiguous (e.g. "MyCol" could be used instead of
1601"myFriend.MyCol" in the example above if there is no column "MyCol" in the main tree).
1603\note A common source of confusion is that trees that are written out from a multi-thread Snapshot() call will have their
1604 entries (block-wise) shuffled with respect to the original tree. Such trees cannot be used as friends of the original
1605 one: rows will be mismatched.
1607Indexed friend trees provide a way to perform simple joins of multiple trees over a common column.
1608When a certain entry in the main tree (or chain) is loaded, the friend trees (or chains) will then load an entry where the
1609"index" columns have a value identical to the one in the main one. For example, in Python:
1615# If a friend tree has an index on `commonColumn`, when the main tree loads
1616# a given row, it also loads the row of the friend tree that has the same
1617# value of `commonColumn`
1618aux_tree.BuildIndex("commonColumn")
1620mainTree.AddFriend(aux_tree)
1622df = ROOT.RDataFrame(mainTree)
1625RDataFrame supports indexed friend TTrees from ROOT v6.24 in single-thread mode and from v6.28/02 in multi-thread mode.
1627\anchor other-file-formats
1628### Reading data formats other than ROOT trees
1629RDataFrame can be interfaced with RDataSources. The ROOT::RDF::RDataSource interface defines an API that RDataFrame can use to read arbitrary columnar data formats.
1631RDataFrame calls into concrete RDataSource implementations to retrieve information about the data, retrieve (thread-local) readers or "cursors" for selected columns
1632and to advance the readers to the desired data entry.
1633Some predefined RDataSources are natively provided by ROOT such as the ROOT::RDF::RCsvDS which allows to read comma separated files:
1635auto tdf = ROOT::RDF::FromCSV("MuRun2010B.csv");
1636auto filteredEvents =
1637 tdf.Filter("Q1 * Q2 == -1")
1638 .Define("m", "sqrt(pow(E1 + E2, 2) - (pow(px1 + px2, 2) + pow(py1 + py2, 2) + pow(pz1 + pz2, 2)))");
1639auto h = filteredEvents.Histo1D("m");
1643See also FromNumpy (Python-only), FromRNTuple(), FromArrow(), FromSqlite().
1646### Computation graphs (storing and reusing sets of transformations)
1648As we saw, transformed dataframes can be stored as variables and reused multiple times to create modified versions of the dataset. This implicitly defines a **computation graph** in which
1649several paths of filtering/creation of columns are executed simultaneously, and finally aggregated results are produced.
1651RDataFrame detects when several actions use the same filter or the same defined column, and **only evaluates each
1652filter or defined column once per event**, regardless of how many times that result is used down the computation graph.
1653Objects read from each column are **built once and never copied**, for maximum efficiency.
1654When "upstream" filters are not passed, subsequent filters, temporary column expressions and actions are not evaluated,
1655so it might be advisable to put the strictest filters first in the graph.
1657\anchor representgraph
1658### Visualizing the computation graph
1659It is possible to print the computation graph from any node to obtain a [DOT (graphviz)](https://en.wikipedia.org/wiki/DOT_(graph_description_language)) representation either on the standard output
1662Invoking the function ROOT::RDF::SaveGraph() on any node that is not the head node, the computation graph of the branch
1663the node belongs to is printed. By using the head node, the entire computation graph is printed.
1665Following there is an example of usage:
1667// First, a sample computational graph is built
1668ROOT::RDataFrame df("tree", "f.root");
1670auto df2 = df.Define("x", []() { return 1; })
1671 .Filter("col0 % 1 == col0")
1672 .Filter([](int b1) { return b1 <2; }, {"cut1"})
1673 .Define("y", []() { return 1; });
1675auto count = df2.Count();
1677// Prints the graph to the rd1.dot file in the current directory
1678ROOT::RDF::SaveGraph(df, "./mydot.dot");
1679// Prints the graph to standard output
1680ROOT::RDF::SaveGraph(df);
1683The generated graph can be rendered using one of the graphviz filters, e.g. `dot`. For instance, the image below can be generated with the following command:
1685$ dot -Tpng computation_graph.dot -ocomputation_graph.png
1688\image html RDF_Graph2.png
1691### Activating RDataFrame execution logs
1693RDataFrame has experimental support for verbose logging of the event loop runtimes and other interesting related information. It is activated as follows:
1695#include <ROOT/RLogger.hxx>
1697// this increases RDF's verbosity level as long as the `verbosity` variable is in scope
1698auto verbosity = ROOT::RLogScopedVerbosity(ROOT::Detail::RDF::RDFLogChannel(), ROOT::ELogLevel::kInfo);
1705verbosity = ROOT.RLogScopedVerbosity(ROOT.Detail.RDF.RDFLogChannel(), ROOT.ELogLevel.kInfo)
1708More information (e.g. start and end of each multi-thread task) is printed using `ELogLevel.kDebug` and even more
1709(e.g. a full dump of the generated code that RDataFrame just-in-time-compiles) using `ELogLevel.kDebug+10`.
1711\anchor rdf-from-spec
1712### Creating an RDataFrame from a dataset specification file
1714RDataFrame can be created using a dataset specification JSON file:
1719df = ROOT.RDF.Experimental.FromSpec("spec.json")
1722The input dataset specification JSON file needs to be provided by the user and it describes all necessary samples and
1723their associated metadata information. The main required key is the "samples" (at least one sample is needed) and the
1724required sub-keys for each sample are "trees" and "files". Additionally, one can specify a metadata dictionary for each
1725sample in the "metadata" key.
1727A simple example for the formatting of the specification in the JSON file is the following:
1733 "trees": ["tree1", "tree2"],
1734 "files": ["file1.root", "file2.root"],
1738 "sample_category" = "data"
1742 "trees": ["tree3", "tree4"],
1743 "files": ["file3.root", "file4.root"],
1747 "sample_category" = "MC_background"
1754The metadata information from the specification file can be then accessed using the DefinePerSample function.
1755For example, to access luminosity information (stored as a double):
1758df.DefinePerSample("lumi", 'rdfsampleinfo_.GetD("lumi")')
1761or sample_category information (stored as a string):
1764df.DefinePerSample("sample_category", 'rdfsampleinfo_.GetS("sample_category")')
1767or directly the filename:
1770df.DefinePerSample("name", "rdfsampleinfo_.GetSampleName()")
1773An example implementation of the "FromSpec" method is available in tutorial: df106_HiggstoFourLeptons.py, which also
1774provides a corresponding exemplary JSON file for the dataset specification.
1777### Adding a progress bar
1779A progress bar showing the processed event statistics can be added to any RDataFrame program.
1780The event statistics include elapsed time, currently processed file, currently processed events, the rate of event processing
1781and an estimated remaining time (per file being processed). It is recorded and printed in the terminal every m events and every
1782n seconds (by default m = 1000 and n = 1). The ProgressBar can be also added when the multithread (MT) mode is enabled.
1784ProgressBar is added after creating the dataframe object (df):
1786ROOT::RDataFrame df("tree", "file.root");
1787ROOT::RDF::Experimental::AddProgressBar(df);
1790Alternatively, RDataFrame can be cast to an RNode first, giving the user more flexibility
1791For example, it can be called at any computational node, such as Filter or Define, not only the head node,
1792with no change to the ProgressBar function itself (please see the [Python interface](classROOT_1_1RDataFrame.html#python)
1793section for appropriate usage in Python):
1795ROOT::RDataFrame df("tree", "file.root");
1796auto df_1 = ROOT::RDF::RNode(df.Filter("x>1"));
1797ROOT::RDF::Experimental::AddProgressBar(df_1);
1799Examples of implemented progress bars can be seen by running [Higgs to Four Lepton tutorial](https://root.cern/doc/master/df106__HiggsToFourLeptons_8py_source.html) and [Dimuon tutorial](https://root.cern/doc/master/df102__NanoAODDimuonAnalysis_8C.html).
1801\anchor missing-values
1802### Working with missing values in the dataset
1804In certain situations a dataset might be missing one or more values at one or
1805more of its entries. For example:
1807- If the dataset is composed of multiple files and one or more files is
1808 missing one or more columns required by the analysis.
1809- When joining different datasets horizontally according to some index value
1810 (e.g. the event number), if the index does not find a match in one or more
1811 other datasets for a certain entry.
1812- If, for a certain event, a column is invalid because it results from a Snapshot
1813 with systematic variations, and that variation didn't pass its filters. For
1814 more details, see \ref snapshot-with-variations.
1816For example, suppose that column "y" does not have a value for entry 42:
1826If the RDataFrame application reads that column, for example if a Take() action
1827was requested, the default behaviour is to throw an exception indicating
1828that that column is missing an entry.
1830The following paragraphs discuss the functionalities provided by RDataFrame to
1831work with missing values in the dataset.
1833#### FilterAvailable and FilterMissing
1835FilterAvailable and FilterMissing are specialized RDataFrame Filter operations.
1836They take as input argument the name of a column of the dataset to watch for
1837missing values. Like Filter, they will either keep or discard an entire entry
1838based on whether a condition returns true or false. Specifically:
1840- FilterAvailable: the condition is whether the value of the column is present.
1841 If so, the entry is kept. Otherwise if the value is missing the entry is
1843- FilterMissing: the condition is whether the value of the column is missing. If
1844 so, the entry is kept. Otherwise if the value is present the entry is
1848df = ROOT.RDataFrame(dataset)
1850# Anytime an entry from "col" is missing, the entire entry will be filtered out
1851df_available = df.FilterAvailable("col")
1852df_available = df_available.Define("twice", "col * 2")
1854# Conversely, if we want to select the entries for which the column has missing
1855# values, we do the following
1856df_missingcol = df.FilterMissing("col")
1857# Following operations in the same branch of the computation graph clearly
1858# cannot access that same column, since there would be no value to read
1859df_missingcol = df_missingcol.Define("observable", "othercolumn * 2")
1863ROOT::RDataFrame df{dataset};
1865// Anytime an entry from "col" is missing, the entire entry will be filtered out
1866auto df_available = df.FilterAvailable("col");
1867auto df_twicecol = df_available.Define("twice", "col * 2");
1869// Conversely, if we want to select the entries for which the column has missing
1870// values, we do the following
1871auto df_missingcol = df.FilterMissing("col");
1872// Following operations in the same branch of the computation graph clearly
1873// cannot access that same column, since there would be no value to read
1874auto df_observable = df_missingcol.Define("observable", "othercolumn * 2");
1879DefaultValueFor creates a node of the computation graph which just forwards the
1880values of the columns necessary for other downstream nodes, when they are
1881available. In case a value of the input column passed to this function is not
1882available, the node will provide the default value passed to this function call
1886df = ROOT.RDataFrame(dataset)
1887# Anytime an entry from "col" is missing, the value will be the default one
1888default_value = ... # Some sensible default value here
1889df = df.DefaultValueFor("col", default_value)
1890df = df.Define("twice", "col * 2")
1894ROOT::RDataFrame df{dataset};
1895// Anytime an entry from "col" is missing, the value will be the default one
1896constexpr auto default_value = ... // Some sensible default value here
1897auto df_default = df.DefaultValueFor("col", default_value);
1898auto df_col = df_default.Define("twice", "col * 2");
1901#### Mixing different strategies to work with missing values in the same RDataFrame
1903All the operations presented above only act on the particular branch of the
1904computation graph where they are called, so that different results can be
1905obtained by mixing and matching the filtering or providing a default value
1909df = ROOT.RDataFrame(dataset)
1910# Anytime an entry from "col" is missing, the value will be the default one
1911default_value = ... # Some sensible default value here
1912df_default = df.DefaultValueFor("col", default_value).Define("twice", "col * 2")
1913df_filtered = df.FilterAvailable("col").Define("twice", "col * 2")
1915# Same number of total entries as the input dataset, with defaulted values
1916df_default.Display(["twice"]).Print()
1917# Only keep the entries where "col" has values
1918df_filtered.Display(["twice"]).Print()
1922ROOT::RDataFrame df{dataset};
1924// Anytime an entry from "col" is missing, the value will be the default one
1925constexpr auto default_value = ... // Some sensible default value here
1926auto df_default = df.DefaultValueFor("col", default_value).Define("twice", "col * 2");
1927auto df_filtered = df.FilterAvailable("col").Define("twice", "col * 2");
1929// Same number of total entries as the input dataset, with defaulted values
1930df_default.Display({"twice"})->Print();
1931// Only keep the entries where "col" has values
1932df_filtered.Display({"twice"})->Print();
1935#### Further considerations
1937Note that working with missing values is currently supported with a TTree-based
1938data source. Support of this functionality for other data sources may come in
1941\anchor special-values
1942### Dealing with NaN or Inf values in the dataset
1944RDataFrame does not treat NaNs or infinities beyond what the floating-point standards require, i.e. they will
1945propagate to the final result.
1946Non-finite numbers can be suppressed using Filter(), e.g.:
1949df.Filter("std::isfinite(x)").Mean("x")
1952\anchor rosetta-stone
1953### Translating TTree commands to RDataFrame
1961 <b>ROOT::RDataFrame</b>
1967// Get the tree and Draw a histogram of x for selected y values
1968auto *tree = file->Get<TTree>("myTree");
1969tree->Draw("x", "y > 2");
1974ROOT::RDataFrame df("myTree", file);
1975df.Filter("y > 2").Histo1D("x")->Draw();
1982// Draw a histogram of "jet_eta" with the desired weight
1983tree->Draw("jet_eta", "weight*(event == 1)");
1988df.Filter("event == 1").Histo1D("jet_eta", "weight")->Draw();
1995// Draw a histogram filled with values resulting from calling a method of the class of the `event` branch in the TTree.
1996tree->Draw("event.GetNtrack()");
2002df.Define("NTrack","event.GetNtrack()").Histo1D("NTrack")->Draw();
2009// Draw only every 10th event
2010tree->Draw("fNtrack","fEvtHdr.fEvtNum%10 == 0");
2015// Use the Filter operation together with the special RDF column: `rdfentry_`
2016df.Filter("rdfentry_ % 10 == 0").Histo1D("fNtrack")->Draw();
2023// object selection: for each event, fill histogram with array of selected pts
2024tree->Draw('Muon_pt', 'Muon_pt > 100');
2029// with RDF, arrays are read as ROOT::VecOps::RVec objects
2030df.Define("good_pt", "Muon_pt[Muon_pt > 100]").Histo1D("good_pt")->Draw();
2038// Draw the histogram and fill hnew with it
2039tree->Draw("sqrt(x)>>hnew","y>0");
2041// Retrieve hnew from the current directory
2042auto hnew = gDirectory->Get<TH1F>("hnew");
2047// We pass histogram constructor arguments to the Histo1D operation, to easily give the histogram a name
2048auto hist = df.Define("sqrt_x", "sqrt(x)").Filter("y>0").Histo1D({"hnew","hnew", 10, 0, 10}, "sqrt_x");
2055// Draw a 1D Profile histogram instead of TH2F
2056tree->Draw("y:x","","prof");
2058// Draw a 2D Profile histogram instead of TH3F
2059tree->Draw("z:y:x","","prof");
2065// Draw a 1D Profile histogram
2066df.Profile1D("x", "y")->Draw();
2068// Draw a 2D Profile histogram
2069df.Profile2D("x", "y", "z")->Draw();
2076// This command draws 2 entries starting with entry 5
2077tree->Draw("x", "","", 2, 5);
2082// Range function with arguments begin, end
2083df.Range(5,7).Histo1D("x")->Draw();
2090// Draw the X() component of the
2091// ROOT::Math::DisplacementVector3D in vec_list
2092tree->Draw("vec_list.X()");
2097df.Define("x", "ROOT::RVecD out; for(const auto &el: vec_list) out.push_back(el.X()); return out;").Histo1D("x")->Draw();
2104// Gather all values from a branch holding a collection per event, `pt`,
2105// and fill a histogram so that we can count the total number of values across all events
2106tree->Draw("pt>>histo");
2107auto histo = gDirectory->Get<TH1D>("histo");
2113df.Histo1D("pt")->GetEntries();
2119 <b>TTree::Scan()</b>
2122 <b>ROOT::RDataFrame</b>
2128// Print a table of the first 10 entries for all variables in the Tree
2129// if the first entry in the Muon_pt collection is > 10.
2130tree->Scan("*", "Muon_pt[0] > 10.", "", 10);
2135// Selecting columns using a regular expression
2136df.Filter("Muon_pt[0] > 10.").Display(".*", 10)->Print();
2143// For 10 events, print Muon_pt and Muon_eta, starting at entry 100
2144tree->Scan("Muon_pt:Muon_eta", "", "", 10, 100);
2149// Selecting columns using a collection of names
2150df.Range(100, 0).Display({"Muon_pt", "Muon_eta"}, 10)->Print();
2277namespace Experimental {
2331 auto *
lm = df->GetLoopManager();
2333 throw std::runtime_error(
"Cannot print information about this RDataFrame, "
2334 "it was not properly created. It must be discarded.");
2336 auto defCols =
lm->GetDefaultColumnNames();
2338 std::ostringstream
ret;
2339 if (
auto ds = df->GetDataSource()) {
2340 ret <<
"A data frame associated to the data source \"" << cling::printValue(
ds) <<
"\"";
2342 ret <<
"An empty data frame that will create " <<
lm->GetNEmptyEntries() <<
" entries\n";
Basic types used by ROOT and required by TInterpreter.
unsigned long long ULong64_t
Portable unsigned long integer 8 bytes.
ROOT::Detail::TRangeCast< T, true > TRangeDynCast
TRangeDynCast is an adapter class that allows the typed iteration through a TCollection.
The head node of a RDF computation graph.
The dataset specification for RDataFrame.
ROOT's RDataFrame offers a modern, high-level interface for analysis of data stored in TTree ,...
RDataFrame(std::string_view treeName, std::string_view filenameglob, const ColumnNames_t &defaultColumns={})
Build the dataframe.
ROOT::RDF::ColumnNames_t ColumnNames_t
Describe directory structure in memory.
A TTree represents a columnar dataset.
ROOT::RDF::Experimental::RDatasetSpec RetrieveSpecFromJson(const std::string &jsonFile)
Function to retrieve RDatasetSpec from JSON file provided.
ROOT::RDataFrame FromSpec(const std::string &jsonFile)
Factory method to create an RDataFrame from a JSON specification file.
std::vector< std::string > ColumnNames_t
std::shared_ptr< const ColumnNames_t > ColumnNamesPtr_t