1111 void Exec(unsigned int slot)
1113 fPerThreadResults[slot]++;
1116 // Called at the end of the event loop.
1119 *fFinalResult = std::accumulate(fPerThreadResults.begin(), fPerThreadResults.end(), 0);
1122 // Called by RDataFrame to retrieve the name of this action.
1123 std::string GetActionName() const { return "MyCounter"; }
1127 ROOT::RDataFrame df(10);
1128 ROOT::RDF::RResultPtr<int> resultPtr = df.Book<>(MyCounter{df.GetNSlots()}, {});
1129 // The GetValue call triggers the event loop
1130 std::cout << "Number of processed entries: " << resultPtr.GetValue() << std::endl;
1134See the Book() method for more information and [this tutorial](https://root.cern/doc/master/df018__customActions_8C.html)
1135for a more complete example.
1137#### Injecting arbitrary code in the event loop with Foreach() and ForeachSlot()
1139Foreach() takes a callable (lambda expression, free function, functor...) and a list of columns and
1140executes the callable on the values of those columns for each event that passes all upstream selections.
1141It can be used to perform actions that are not already available in the interface. For example, the following snippet
1142evaluates the root mean square of column "x":
1144// Single-thread evaluation of RMS of column "x" using Foreach
1147df.Foreach([&sumSq, &n](double x) { ++n; sumSq += x*x; }, {"x"});
1148std::cout << "rms of x: " << std::sqrt(sumSq / n) << std::endl;
1150In multi-thread runs, users are responsible for the thread-safety of the expression passed to Foreach():
1151thread will execute the expression concurrently.
1152The code above would need to employ some resource protection mechanism to ensure non-concurrent writing of `rms`; but
1153this is probably too much head-scratch for such a simple operation.
1155ForeachSlot() can help in this situation. It is an alternative version of Foreach() for which the function takes an
1156additional "processing slot" parameter besides the columns it should be applied to. RDataFrame
1157guarantees that ForeachSlot() will invoke the user expression with different `slot` parameters for different concurrent
1158executions (see [Special helper columns: rdfentry_ and rdfslot_](\ref helper-cols) for more information on the slot parameter).
1159We can take advantage of ForeachSlot() to evaluate a thread-safe root mean square of column "x":
1161// Thread-safe evaluation of RMS of column "x" using ForeachSlot
1162ROOT::EnableImplicitMT();
1163const unsigned int nSlots = df.GetNSlots();
1164std::vector<double> sumSqs(nSlots, 0.);
1165std::vector<unsigned int> ns(nSlots, 0);
1167df.ForeachSlot([&sumSqs, &ns](unsigned int slot, double x) { sumSqs[slot] += x*x; ns[slot] += 1; }, {"x"});
1168double sumSq = std::accumulate(sumSqs.begin(), sumSqs.end(), 0.); // sum all squares
1169unsigned int n = std::accumulate(ns.begin(), ns.end(), 0); // sum all counts
1170std::cout << "rms of x: " << std::sqrt(sumSq / n) << std::endl;
1172Notice how we created one `double` variable for each processing slot and later merged their results via `std::accumulate`.
1177Friend TTrees are supported by RDataFrame.
1178Friend TTrees with a TTreeIndex are supported starting from ROOT v6.24.
1180To use friend trees in RDataFrame, it is necessary to add the friends directly to
1181the tree and instantiate an RDataFrame with the main tree:
1186t.AddFriend(&ft, "myFriend");
1189auto f = d.Filter("myFriend.MyCol == 42");
1192Columns coming from the friend trees can be referred to by their full name, like in the example above,
1193or the friend tree name can be omitted in case the column name is not ambiguous (e.g. "MyCol" could be used instead of
1194 "myFriend.MyCol" in the example above).
1197\anchor other-file-formats
1198### Reading data formats other than ROOT trees
1199RDataFrame can be interfaced with RDataSources. The ROOT::RDF::RDataSource interface defines an API that RDataFrame can use to read arbitrary columnar data formats.
1201RDataFrame calls into concrete RDataSource implementations to retrieve information about the data, retrieve (thread-local) readers or "cursors" for selected columns
1202and to advance the readers to the desired data entry.
1203Some predefined RDataSources are natively provided by ROOT such as the ROOT::RDF::RCsvDS which allows to read comma separated files:
1205auto tdf = ROOT::RDF::MakeCsvDataFrame("MuRun2010B.csv");
1206auto filteredEvents =
1207 tdf.Filter("Q1 * Q2 == -1")
1208 .Define("m", "sqrt(pow(E1 + E2, 2) - (pow(px1 + px2, 2) + pow(py1 + py2, 2) + pow(pz1 + pz2, 2)))");
1209auto h = filteredEvents.Histo1D("m");
1213See also MakeNumpyDataFrame (Python-only), MakeNTupleDataFrame(), MakeArrowDataFrame(), MakeSqliteDataFrame().
1216### Computation graphs (storing and reusing sets of transformations)
1218As we saw, transformed dataframes can be stored as variables and reused multiple times to create modified versions of the dataset. This implicitly defines a **computation graph** in which
1219several paths of filtering/creation of columns are executed simultaneously, and finally aggregated results are produced.
1221RDataFrame detects when several actions use the same filter or the same defined column, and **only evaluates each
1222filter or defined column once per event**, regardless of how many times that result is used down the computation graph.
1223Objects read from each column are **built once and never copied**, for maximum efficiency.
1224When "upstream" filters are not passed, subsequent filters, temporary column expressions and actions are not evaluated,
1225so it might be advisable to put the strictest filters first in the graph.
1227\anchor representgraph
1228### Visualizing the computation graph
1229It is possible to print the computation graph from any node to obtain a [DOT (graphviz)](https://en.wikipedia.org/wiki/DOT_(graph_description_language)) representation either on the standard output
1232Invoking the function ROOT::RDF::SaveGraph() on any node that is not the head node, the computation graph of the branch
1233the node belongs to is printed. By using the head node, the entire computation graph is printed.
1235Following there is an example of usage:
1237// First, a sample computational graph is built
1238ROOT::RDataFrame df("tree", "f.root");
1240auto df2 = df.Define("x", []() { return 1; })
1241 .Filter("col0 % 1 == col0")
1242 .Filter([](int b1) { return b1 <2; }, {"cut1"})
1243 .Define("y", []() { return 1; });
1245auto count = df2.Count();
1247// Prints the graph to the rd1.dot file in the current directory
1248ROOT::RDF::SaveGraph(df, "./mydot.dot");
1249// Prints the graph to standard output
1250ROOT::RDF::SaveGraph(df);
1253The generated graph can be rendered using one of the graphviz filters, e.g. `dot`. For instance, the image below can be generated with the following command:
1255$ dot -Tpng computation_graph.dot -ocomputation_graph.png
1258\image html RDF_Graph2.png
1261### Activating RDataFrame execution logs
1263RDataFrame has experimental support for verbose logging of the event loop runtimes and other interesting related information. It is activated as follows:
1265#include <ROOT/RLogger.hxx>
1267// this increases RDF's verbosity level as long as the `verbosity` variable is in scope
1268auto verbosity = ROOT::Experimental::RLogScopedVerbosity(ROOT::Detail::RDF::RDFLogChannel(), ROOT::Experimental::ELogLevel::kInfo);
1275verbosity = ROOT.Experimental.RLogScopedVerbosity(ROOT.Detail.RDF.RDFLogChannel(), ROOT.Experimental.ELogLevel.kInfo)
1295 : RInterface(std::make_shared<
RDFDetail::RLoopManager>(nullptr, defaultColumns))
1298 auto msg =
"Invalid TDirectory!";
1299 throw std::runtime_error(msg);
1301 const std::string treeNameInt(treeName);
1302 auto tree =
static_cast<TTree *
>(dirPtr->
Get(treeNameInt.c_str()));
1304 auto msg =
"Tree \"" + treeNameInt +
"\" cannot be found!";
1305 throw std::runtime_error(msg);
1307 GetProxiedPtr()->SetTree(std::shared_ptr<TTree>(
tree, [](
TTree *) {}));
1325 const std::string treeNameInt(treeName);
1326 const std::string filenameglobInt(filenameglob);
1327 auto chain = std::make_shared<TChain>(treeNameInt.c_str());
1328 chain->Add(filenameglobInt.c_str());
1347 std::string treeNameInt(treeName);
1348 auto chain = std::make_shared<TChain>(treeNameInt.c_str());
1349 for (
auto &
f : fileglobs)
1350 chain->Add(
f.c_str());
1401 auto *
tree = df.GetTree();
1402 auto defCols = df.GetDefaultColumnNames();
1404 std::ostringstream ret;
1406 ret <<
"A data frame built on top of the " <<
tree->GetName() <<
" dataset.";
1407 if (!defCols.empty()) {
1408 if (defCols.size() == 1)
1409 ret <<
"\nDefault column: " << defCols[0];
1411 ret <<
"\nDefault columns:\n";
1412 for (
auto &&col : defCols) {
1413 ret <<
" - " << col <<
"\n";
1418 ret <<
"A data frame associated to the data source \"" << cling::printValue(ds) <<
"\"";
1420 ret <<
"An empty data frame that will create " << df.GetNEmptyEntries() <<
" entries\n";
unsigned long long ULong64_t
The head node of a RDF computation graph.
RLoopManager * GetLoopManager() const
RDataSource * fDataSource
Non-owning pointer to a data-source object. Null if no data-source. RLoopManager has ownership of the...
const std::shared_ptr< RDFDetail::RLoopManager > & GetProxiedPtr() const
ROOT's RDataFrame offers a modern, high-level interface for analysis of data stored in TTree ,...
RDataFrame(std::string_view treeName, std::string_view filenameglob, const ColumnNames_t &defaultBranches={})
Build the dataframe.
ROOT::RDF::ColumnNames_t ColumnNames_t
Describe directory structure in memory.
virtual TObject * Get(const char *namecycle)
Return pointer to object identified by namecycle.
A TTree represents a columnar dataset.
basic_string_view< char > string_view
std::vector< std::string > ColumnNames_t
This file contains a specialised ROOT message handler to test for diagnostic in unit tests.
std::shared_ptr< const ColumnNames_t > ColumnNamesPtr_t