24#include <nlohmann/json.hpp>
1187 void Exec(unsigned int slot)
1189 fPerThreadResults[slot]++;
1192 // Called at the end of the event loop.
1195 *fFinalResult = std::accumulate(fPerThreadResults.begin(), fPerThreadResults.end(), 0);
1198 // Called by RDataFrame to retrieve the name of this action.
1199 std::string GetActionName() const { return "MyCounter"; }
1203 ROOT::RDataFrame df(10);
1204 ROOT::RDF::RResultPtr<int> resultPtr = df.Book<>(MyCounter{df.GetNSlots()}, {});
1205 // The GetValue call triggers the event loop
1206 std::cout << "Number of processed entries: " << resultPtr.GetValue() << std::endl;
1210See the Book() method for more information and [this tutorial](https://root.cern/doc/master/df018__customActions_8C.html)
1211for a more complete example.
1213#### Injecting arbitrary code in the event loop with Foreach() and ForeachSlot()
1215Foreach() takes a callable (lambda expression, free function, functor...) and a list of columns and
1216executes the callable on the values of those columns for each event that passes all upstream selections.
1217It can be used to perform actions that are not already available in the interface. For example, the following snippet
1218evaluates the root mean square of column "x":
1220// Single-thread evaluation of RMS of column "x" using Foreach
1223df.Foreach([&sumSq, &n](double x) { ++n; sumSq += x*x; }, {"x"});
1224std::cout << "rms of x: " << std::sqrt(sumSq / n) << std::endl;
1226In multi-thread runs, users are responsible for the thread-safety of the expression passed to Foreach():
1227thread will execute the expression concurrently.
1228The code above would need to employ some resource protection mechanism to ensure non-concurrent writing of `rms`; but
1229this is probably too much head-scratch for such a simple operation.
1231ForeachSlot() can help in this situation. It is an alternative version of Foreach() for which the function takes an
1232additional "processing slot" parameter besides the columns it should be applied to. RDataFrame
1233guarantees that ForeachSlot() will invoke the user expression with different `slot` parameters for different concurrent
1234executions (see [Special helper columns: rdfentry_ and rdfslot_](\ref helper-cols) for more information on the slot parameter).
1235We can take advantage of ForeachSlot() to evaluate a thread-safe root mean square of column "x":
1237// Thread-safe evaluation of RMS of column "x" using ForeachSlot
1238ROOT::EnableImplicitMT();
1239const unsigned int nSlots = df.GetNSlots();
1240std::vector<double> sumSqs(nSlots, 0.);
1241std::vector<unsigned int> ns(nSlots, 0);
1243df.ForeachSlot([&sumSqs, &ns](unsigned int slot, double x) { sumSqs[slot] += x*x; ns[slot] += 1; }, {"x"});
1244double sumSq = std::accumulate(sumSqs.begin(), sumSqs.end(), 0.); // sum all squares
1245unsigned int n = std::accumulate(ns.begin(), ns.end(), 0); // sum all counts
1246std::cout << "rms of x: " << std::sqrt(sumSq / n) << std::endl;
1248Notice how we created one `double` variable for each processing slot and later merged their results via `std::accumulate`.
1253Friend TTrees are supported by RDataFrame.
1254Friend TTrees with a TTreeIndex are supported starting from ROOT v6.24.
1256To use friend trees in RDataFrame, it is necessary to add the friends directly to
1257the tree and instantiate an RDataFrame with the main tree:
1262t.AddFriend(&ft, "myFriend");
1265auto f = d.Filter("myFriend.MyCol == 42");
1268Columns coming from the friend trees can be referred to by their full name, like in the example above,
1269or the friend tree name can be omitted in case the column name is not ambiguous (e.g. "MyCol" could be used instead of
1270 "myFriend.MyCol" in the example above).
1273\anchor other-file-formats
1274### Reading data formats other than ROOT trees
1275RDataFrame can be interfaced with RDataSources. The ROOT::RDF::RDataSource interface defines an API that RDataFrame can use to read arbitrary columnar data formats.
1277RDataFrame calls into concrete RDataSource implementations to retrieve information about the data, retrieve (thread-local) readers or "cursors" for selected columns
1278and to advance the readers to the desired data entry.
1279Some predefined RDataSources are natively provided by ROOT such as the ROOT::RDF::RCsvDS which allows to read comma separated files:
1281auto tdf = ROOT::RDF::MakeCsvDataFrame("MuRun2010B.csv");
1282auto filteredEvents =
1283 tdf.Filter("Q1 * Q2 == -1")
1284 .Define("m", "sqrt(pow(E1 + E2, 2) - (pow(px1 + px2, 2) + pow(py1 + py2, 2) + pow(pz1 + pz2, 2)))");
1285auto h = filteredEvents.Histo1D("m");
1289See also FromNumpy (Python-only), FromRNTuple(), FromArrow(), FromSqlite().
1292### Computation graphs (storing and reusing sets of transformations)
1294As we saw, transformed dataframes can be stored as variables and reused multiple times to create modified versions of the dataset. This implicitly defines a **computation graph** in which
1295several paths of filtering/creation of columns are executed simultaneously, and finally aggregated results are produced.
1297RDataFrame detects when several actions use the same filter or the same defined column, and **only evaluates each
1298filter or defined column once per event**, regardless of how many times that result is used down the computation graph.
1299Objects read from each column are **built once and never copied**, for maximum efficiency.
1300When "upstream" filters are not passed, subsequent filters, temporary column expressions and actions are not evaluated,
1301so it might be advisable to put the strictest filters first in the graph.
1303\anchor representgraph
1304### Visualizing the computation graph
1305It is possible to print the computation graph from any node to obtain a [DOT (graphviz)](https://en.wikipedia.org/wiki/DOT_(graph_description_language)) representation either on the standard output
1308Invoking the function ROOT::RDF::SaveGraph() on any node that is not the head node, the computation graph of the branch
1309the node belongs to is printed. By using the head node, the entire computation graph is printed.
1311Following there is an example of usage:
1313// First, a sample computational graph is built
1314ROOT::RDataFrame df("tree", "f.root");
1316auto df2 = df.Define("x", []() { return 1; })
1317 .Filter("col0 % 1 == col0")
1318 .Filter([](int b1) { return b1 <2; }, {"cut1"})
1319 .Define("y", []() { return 1; });
1321auto count = df2.Count();
1323// Prints the graph to the rd1.dot file in the current directory
1324ROOT::RDF::SaveGraph(df, "./mydot.dot");
1325// Prints the graph to standard output
1326ROOT::RDF::SaveGraph(df);
1329The generated graph can be rendered using one of the graphviz filters, e.g. `dot`. For instance, the image below can be generated with the following command:
1331$ dot -Tpng computation_graph.dot -ocomputation_graph.png
1334\image html RDF_Graph2.png
1337### Activating RDataFrame execution logs
1339RDataFrame has experimental support for verbose logging of the event loop runtimes and other interesting related information. It is activated as follows:
1341#include <ROOT/RLogger.hxx>
1343// this increases RDF's verbosity level as long as the `verbosity` variable is in scope
1344auto verbosity = ROOT::Experimental::RLogScopedVerbosity(ROOT::Detail::RDF::RDFLogChannel(), ROOT::Experimental::ELogLevel::kInfo);
1351verbosity = ROOT.Experimental.RLogScopedVerbosity(ROOT.Detail.RDF.RDFLogChannel(), ROOT.Experimental.ELogLevel.kInfo)
1354More information (e.g. start and end of each multi-thread task) is printed using `ELogLevel.kDebug` and even more
1355(e.g. a full dump of the generated code that RDataFrame just-in-time-compiles) using `ELogLevel.kDebug+10`.
1374 : RInterface(std::make_shared<
RDFDetail::RLoopManager>(nullptr, defaultColumns))
1377 auto msg =
"Invalid TDirectory!";
1378 throw std::runtime_error(msg);
1380 const std::string treeNameInt(treeName);
1381 auto tree =
static_cast<TTree *
>(dirPtr->
Get(treeNameInt.c_str()));
1383 auto msg =
"Tree \"" + treeNameInt +
"\" cannot be found!";
1384 throw std::runtime_error(msg);
1386 GetProxiedPtr()->SetTree(std::shared_ptr<TTree>(
tree, [](
TTree *) {}));
1401RDataFrame::RDataFrame(std::string_view treeName, std::string_view filenameglob,
const ColumnNames_t &defaultColumns)
1404 const std::string treeNameInt(treeName);
1405 const std::string filenameglobInt(filenameglob);
1407 chain->Add(filenameglobInt.c_str());
1426 std::string treeNameInt(treeName);
1428 for (
auto &
f : fileglobs)
1429 chain->Add(
f.c_str());
1491namespace Experimental {
1495 const nlohmann::json fullData = nlohmann::json::parse(std::ifstream(jsonFile));
1496 if (!fullData.contains(
"samples") || fullData[
"samples"].size() == 0) {
1497 throw std::runtime_error(
1498 R
"(The input specification does not contain any samples. Please provide the samples in the specification like:
1502 "trees": ["tree1", "tree2"],
1503 "files": ["file1.root", "file2.root"],
1504 "metadata": {"lumi": 1.0, }
1507 "trees": ["tree3", "tree4"],
1508 "files": ["file3.root", "file4.root"],
1509 "metadata": {"lumi": 0.5, }
1517 for (
const auto &keyValue : fullData[
"samples"].items()) {
1518 const std::string &sampleName = keyValue.key();
1519 const auto &sample = keyValue.value();
1522 if (!sample.contains(
"trees")) {
1523 throw std::runtime_error(
"A list of tree names must be provided for sample " + sampleName +
".");
1525 std::vector<std::string> trees = sample[
"trees"];
1526 if (!sample.contains(
"files")) {
1527 throw std::runtime_error(
"A list of files must be provided for sample " + sampleName +
".");
1529 std::vector<std::string> files = sample[
"files"];
1530 if (!sample.contains(
"metadata")) {
1534 for (
const auto &metadata : sample[
"metadata"].items()) {
1535 const auto &val = metadata.value();
1536 if (val.is_string())
1537 m.Add(metadata.key(), val.get<std::string>());
1538 else if (val.is_number_integer())
1539 m.Add(metadata.key(), val.get<
int>());
1540 else if (val.is_number_float())
1541 m.Add(metadata.key(), val.get<
double>());
1543 throw std::logic_error(
"The metadata keys can only be of type [string|int|double].");
1548 if (fullData.contains(
"friends")) {
1549 for (
const auto &friends : fullData[
"friends"].items()) {
1550 std::string alias = friends.key();
1551 std::vector<std::string> trees = friends.value()[
"trees"];
1552 std::vector<std::string> files = friends.value()[
"files"];
1553 if (files.size() != trees.size() && trees.size() > 1)
1554 throw std::runtime_error(
"Mismatch between trees and files in a friend.");
1559 if (fullData.contains(
"range")) {
1560 std::vector<int> range = fullData[
"range"];
1562 if (range.size() == 1)
1564 else if (range.size() == 2)
1581 auto *
tree = df.GetTree();
1582 auto defCols = df.GetDefaultColumnNames();
1584 std::ostringstream ret;
1586 ret <<
"A data frame built on top of the " <<
tree->GetName() <<
" dataset.";
1587 if (!defCols.empty()) {
1588 if (defCols.size() == 1)
1589 ret <<
"\nDefault column: " << defCols[0];
1591 ret <<
"\nDefault columns:\n";
1592 for (
auto &&col : defCols) {
1593 ret <<
" - " << col <<
"\n";
1598 ret <<
"A data frame associated to the data source \"" << cling::printValue(ds) <<
"\"";
1600 ret <<
"An empty data frame that will create " << df.GetNEmptyEntries() <<
" entries\n";
unsigned long long ULong64_t
The head node of a RDF computation graph.
A dataset specification for RDataFrame.
RDatasetSpec & WithGlobalFriends(const std::string &treeName, const std::string &fileNameGlob, const std::string &alias="")
RDatasetSpec & AddSample(RSample sample)
RDatasetSpec & WithGlobalRange(const RDatasetSpec::REntryRange &entryRange={})
Class representing a sample (grouping of trees (and their fileglobs) and (optional) metadata)
RDataSource * fDataSource
Non-owning pointer to a data-source object. Null if no data-source. RLoopManager has ownership of the...
RDFDetail::RLoopManager * GetLoopManager() const
const std::shared_ptr< RDFDetail::RLoopManager > & GetProxiedPtr() const
ROOT's RDataFrame offers a modern, high-level interface for analysis of data stored in TTree ,...
RDataFrame(std::string_view treeName, std::string_view filenameglob, const ColumnNames_t &defaultColumns={})
Build the dataframe.
ROOT::RDF::ColumnNames_t ColumnNames_t
@ kWithoutGlobalRegistration
Describe directory structure in memory.
virtual TObject * Get(const char *namecycle)
Return pointer to object identified by namecycle.
A TTree represents a columnar dataset.
ROOT::RDataFrame FromSpec(const std::string &jsonFile)
Factory method to create an RDataFrame from a JSON specification file.
std::vector< std::string > ColumnNames_t
This file contains a specialised ROOT message handler to test for diagnostic in unit tests.
std::shared_ptr< const ColumnNames_t > ColumnNamesPtr_t