{ "cells": [ { "cell_type": "markdown", "id": "cf59e8c0", "metadata": {}, "source": [ "# df001_introduction\n", "Basic RDataFrame usage.\n", "\n", "This tutorial illustrates the basic features of the RDataFrame class,\n", "a utility which allows to interact with data stored in TTrees following\n", "a functional-chain like approach.\n", "\n", "\n", "\n", "\n", "**Author:** Enrico Guiraud (CERN) \n", "This notebook tutorial was automatically generated with ROOTBOOK-izer from the macro found in the ROOT repository on Sunday, December 21, 2025 at 02:39 PM." ] }, { "cell_type": "markdown", "id": "7df15658", "metadata": {}, "source": [ " ## Preparation\n", "A simple helper function to fill a test tree: this makes the example\n", "stand-alone.\n", " " ] }, { "cell_type": "code", "execution_count": 1, "id": "ef718f59", "metadata": { "collapsed": false, "execution": { "iopub.execute_input": "2025-12-21T13:39:09.049647Z", "iopub.status.busy": "2025-12-21T13:39:09.049523Z", "iopub.status.idle": "2025-12-21T13:39:09.789109Z", "shell.execute_reply": "2025-12-21T13:39:09.788458Z" } }, "outputs": [], "source": [ "%%cpp -d\n", "void fill_tree(const char *treeName, const char *fileName)\n", "{\n", " ROOT::RDataFrame d(10);\n", " d.Define(\"b1\", [](ULong64_t entry) -> double { return entry; }, {\"rdfentry_\"})\n", " .Define(\"b2\", [](ULong64_t entry) -> int { return entry * entry; }, {\"rdfentry_\"})\n", " .Snapshot(treeName, fileName);\n", "}" ] }, { "cell_type": "markdown", "id": "ba3c7fc2", "metadata": {}, "source": [ "We prepare an input tree to run on" ] }, { "cell_type": "code", "execution_count": 2, "id": "d52c403b", "metadata": { "collapsed": false, "execution": { "iopub.execute_input": "2025-12-21T13:39:09.798778Z", "iopub.status.busy": "2025-12-21T13:39:09.798626Z", "iopub.status.idle": "2025-12-21T13:39:11.245143Z", "shell.execute_reply": "2025-12-21T13:39:11.244497Z" } }, "outputs": [], "source": [ "auto fileName = \"df001_introduction.root\";\n", "auto treeName = \"myTree\";\n", "fill_tree(treeName, fileName);" ] }, { "cell_type": "markdown", "id": "f755b9dc", "metadata": {}, "source": [ "We read the tree from the file and create a RDataFrame, a class that\n", "allows us to interact with the data contained in the tree.\n", "We select a default column, a *branch* to adopt ROOT jargon, which will\n", "be looked at if none is specified by the user when dealing with filters\n", "and actions." ] }, { "cell_type": "code", "execution_count": 3, "id": "d3753b07", "metadata": { "collapsed": false, "execution": { "iopub.execute_input": "2025-12-21T13:39:11.253836Z", "iopub.status.busy": "2025-12-21T13:39:11.253693Z", "iopub.status.idle": "2025-12-21T13:39:11.464684Z", "shell.execute_reply": "2025-12-21T13:39:11.464095Z" } }, "outputs": [], "source": [ "ROOT::RDataFrame d(treeName, fileName, {\"b1\"});" ] }, { "cell_type": "markdown", "id": "97a603a1", "metadata": {}, "source": [ "## Operations on the dataframe\n", "We now review some *actions* which can be performed on the data frame.\n", "Actions can be divided into instant actions (e. g. Foreach()) and lazy\n", "actions (e. g. Count()), depending on whether they trigger the event\n", "loop immediately or only when one of the results is accessed for the\n", "first time. Actions that return \"something\" either return their result\n", "wrapped in a RResultPtr or in a RDataFrame.\n", "But first of all, let us define our cut-flow with two lambda\n", "functions. We can use free functions too." ] }, { "cell_type": "code", "execution_count": 4, "id": "e9592ecb", "metadata": { "collapsed": false, "execution": { "iopub.execute_input": "2025-12-21T13:39:11.477737Z", "iopub.status.busy": "2025-12-21T13:39:11.477577Z", "iopub.status.idle": "2025-12-21T13:39:11.720420Z", "shell.execute_reply": "2025-12-21T13:39:11.703901Z" } }, "outputs": [], "source": [ "auto cutb1 = [](double b1) { return b1 < 5.; };\n", "auto cutb1b2 = [](int b2, double b1) { return b2 % 2 && b1 < 4.; };" ] }, { "cell_type": "markdown", "id": "f1a55ad5", "metadata": {}, "source": [ "### `Count` action\n", "The `Count` allows to retrieve the number of the entries that passed the\n", "filters. Here, we show how the automatic selection of the column kicks\n", "in in case the user specifies none." ] }, { "cell_type": "code", "execution_count": 5, "id": "c8fadfee", "metadata": { "collapsed": false, "execution": { "iopub.execute_input": "2025-12-21T13:39:11.742908Z", "iopub.status.busy": "2025-12-21T13:39:11.742680Z", "iopub.status.idle": "2025-12-21T13:39:12.836460Z", "shell.execute_reply": "2025-12-21T13:39:12.835959Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2 entries passed all filters\n" ] } ], "source": [ "auto entries1 = d.Filter(cutb1) // <- no column name specified here!\n", " .Filter(cutb1b2, {\"b2\", \"b1\"})\n", " .Count();\n", "\n", "std::cout << *entries1 << \" entries passed all filters\" << std::endl;" ] }, { "cell_type": "markdown", "id": "9f004fe3", "metadata": {}, "source": [ "Filters can be expressed as strings. The content must be C++ code. The\n", "name of the variables must be the name of the branches. The code is\n", "just-in-time compiled." ] }, { "cell_type": "code", "execution_count": 6, "id": "4f96f57d", "metadata": { "collapsed": false, "execution": { "iopub.execute_input": "2025-12-21T13:39:12.838545Z", "iopub.status.busy": "2025-12-21T13:39:12.838413Z", "iopub.status.idle": "2025-12-21T13:39:13.531828Z", "shell.execute_reply": "2025-12-21T13:39:13.531275Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "5 entries passed the string filter\n" ] } ], "source": [ "auto entries2 = d.Filter(\"b1 < 5.\").Count();\n", "std::cout << *entries2 << \" entries passed the string filter\" << std::endl;" ] }, { "cell_type": "markdown", "id": "44dafd69", "metadata": {}, "source": [ "### `Min`, `Max` and `Mean` actions\n", "These actions allow to retrieve statistical information about the entries\n", "passing the cuts, if any." ] }, { "cell_type": "code", "execution_count": 7, "id": "fedf6c7b", "metadata": { "collapsed": false, "execution": { "iopub.execute_input": "2025-12-21T13:39:13.537245Z", "iopub.status.busy": "2025-12-21T13:39:13.537084Z", "iopub.status.idle": "2025-12-21T13:39:15.025503Z", "shell.execute_reply": "2025-12-21T13:39:15.004089Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The mean is always included between the min and the max: 1 <= 2 <= 3\n" ] } ], "source": [ "auto b1b2_cut = d.Filter(cutb1b2, {\"b2\", \"b1\"});\n", "auto minVal = b1b2_cut.Min();\n", "auto maxVal = b1b2_cut.Max();\n", "auto meanVal = b1b2_cut.Mean();\n", "auto nonDefmeanVal = b1b2_cut.Mean(\"b2\"); // <- Column is not the default\n", "std::cout << \"The mean is always included between the min and the max: \" << *minVal << \" <= \" << *meanVal\n", " << \" <= \" << *maxVal << std::endl;" ] }, { "cell_type": "markdown", "id": "4415b9e3", "metadata": {}, "source": [ "### `Take` action\n", "The `Take` action allows to retrieve all values of the variable stored in a\n", "particular column that passed filters we specified. The values are stored\n", "in a vector by default, but other collections can be chosen." ] }, { "cell_type": "code", "execution_count": 8, "id": "d85ece1d", "metadata": { "collapsed": false, "execution": { "iopub.execute_input": "2025-12-21T13:39:15.029146Z", "iopub.status.busy": "2025-12-21T13:39:15.028988Z", "iopub.status.idle": "2025-12-21T13:39:16.209973Z", "shell.execute_reply": "2025-12-21T13:39:16.209560Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Selected b1 entries\n", "0 1 2 3 4 \n", "The type of b1Vec is vector\n" ] } ], "source": [ "auto b1_cut = d.Filter(cutb1);\n", "auto b1Vec = b1_cut.Take();\n", "auto b1List = b1_cut.Take>();\n", "\n", "std::cout << \"Selected b1 entries\" << std::endl;\n", "for (auto b1_entry : *b1List)\n", " std::cout << b1_entry << \" \";\n", "std::cout << std::endl;\n", "auto b1VecCl = ROOT::GetClass(b1Vec.GetPtr());\n", "std::cout << \"The type of b1Vec is \" << b1VecCl->GetName() << std::endl;" ] }, { "cell_type": "markdown", "id": "28fdd2f3", "metadata": {}, "source": [ "### `Histo1D` action\n", "The `Histo1D` action allows to fill an histogram. It returns a TH1D filled\n", "with values of the column that passed the filters. For the most common\n", "types, the type of the values stored in the column is automatically\n", "guessed." ] }, { "cell_type": "code", "execution_count": 9, "id": "25539f0c", "metadata": { "collapsed": false, "execution": { "iopub.execute_input": "2025-12-21T13:39:16.221864Z", "iopub.status.busy": "2025-12-21T13:39:16.221662Z", "iopub.status.idle": "2025-12-21T13:39:17.207306Z", "shell.execute_reply": "2025-12-21T13:39:17.191513Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Filled h 5 times, mean: 2\n" ] } ], "source": [ "auto hist = d.Filter(cutb1).Histo1D();\n", "std::cout << \"Filled h \" << hist->GetEntries() << \" times, mean: \" << hist->GetMean() << std::endl;" ] }, { "cell_type": "markdown", "id": "1b652bde", "metadata": {}, "source": [ "### `Foreach` action\n", "The most generic action of all: an operation is applied to all entries.\n", "In this case we fill a histogram. In some sense this is a violation of a\n", "purely functional paradigm - C++ allows to do that." ] }, { "cell_type": "code", "execution_count": 10, "id": "f0979b99", "metadata": { "collapsed": false, "execution": { "iopub.execute_input": "2025-12-21T13:39:17.227722Z", "iopub.status.busy": "2025-12-21T13:39:17.227546Z", "iopub.status.idle": "2025-12-21T13:39:17.967269Z", "shell.execute_reply": "2025-12-21T13:39:17.945266Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Filled h with 5 entries\n" ] } ], "source": [ "TH1F h(\"h\", \"h\", 12, -1, 11);\n", "d.Filter([](int b2) { return b2 % 2 == 0; }, {\"b2\"}).Foreach([&h](double b1) { h.Fill(b1); });\n", "\n", "std::cout << \"Filled h with \" << h.GetEntries() << \" entries\" << std::endl;" ] }, { "cell_type": "markdown", "id": "9636b6c7", "metadata": {}, "source": [ "## Express your chain of operations with clarity!\n", "We are discussing an example here but it is not hard to imagine much more\n", "complex pipelines of actions acting on data. Those might require code\n", "which is well organised, for example allowing to conditionally add filters\n", "or again to clearly separate filters and actions without the need of\n", "writing the entire pipeline on one line. This can be easily achieved.\n", "We'll show this by re-working the `Count` example:" ] }, { "cell_type": "code", "execution_count": 11, "id": "0aae637d", "metadata": { "collapsed": false, "execution": { "iopub.execute_input": "2025-12-21T13:39:17.995756Z", "iopub.status.busy": "2025-12-21T13:39:17.995537Z", "iopub.status.idle": "2025-12-21T13:39:18.409700Z", "shell.execute_reply": "2025-12-21T13:39:18.409246Z" } }, "outputs": [], "source": [ "auto cutb1_result = d.Filter(cutb1);\n", "auto cutb1b2_result = d.Filter(cutb1b2, {\"b2\", \"b1\"});\n", "auto cutb1_cutb1b2_result = cutb1_result.Filter(cutb1b2, {\"b2\", \"b1\"});" ] }, { "cell_type": "markdown", "id": "9fb27112", "metadata": {}, "source": [ "Now we want to count:" ] }, { "cell_type": "code", "execution_count": 12, "id": "236da6bd", "metadata": { "collapsed": false, "execution": { "iopub.execute_input": "2025-12-21T13:39:18.425623Z", "iopub.status.busy": "2025-12-21T13:39:18.425470Z", "iopub.status.idle": "2025-12-21T13:39:19.232817Z", "shell.execute_reply": "2025-12-21T13:39:19.232404Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Events passing cutb1: 5\n", "Events passing cutb1b2: 2\n", "Events passing both: 2\n" ] } ], "source": [ "auto evts_cutb1_result = cutb1_result.Count();\n", "auto evts_cutb1b2_result = cutb1b2_result.Count();\n", "auto evts_cutb1_cutb1b2_result = cutb1_cutb1b2_result.Count();\n", "\n", "std::cout << \"Events passing cutb1: \" << *evts_cutb1_result << std::endl\n", " << \"Events passing cutb1b2: \" << *evts_cutb1b2_result << std::endl\n", " << \"Events passing both: \" << *evts_cutb1_cutb1b2_result << std::endl;" ] }, { "cell_type": "markdown", "id": "2d3b88f0", "metadata": {}, "source": [ "## Calculating quantities starting from existing columns\n", "Often, operations need to be carried out on quantities calculated starting\n", "from the ones present in the columns. We'll create in this example a third\n", "column, the values of which are the sum of the *b1* and *b2* ones, entry by\n", "entry. The way in which the new quantity is defined is via a callable.\n", "It is important to note two aspects at this point:\n", "- The value is created on the fly only if the entry passed the existing\n", "filters.\n", "- The newly created column behaves as the one present on the file on disk.\n", "- The operation creates a new value, without modifying anything. De facto,\n", "this is like having a general container at disposal able to accommodate\n", "any value of any type.\n", "Let's dive in an example:" ] }, { "cell_type": "code", "execution_count": 13, "id": "3def3ae2", "metadata": { "collapsed": false, "execution": { "iopub.execute_input": "2025-12-21T13:39:19.247718Z", "iopub.status.busy": "2025-12-21T13:39:19.247561Z", "iopub.status.idle": "2025-12-21T13:39:20.016373Z", "shell.execute_reply": "2025-12-21T13:39:20.014910Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "8\n" ] } ], "source": [ "auto entries_sum = d.Define(\"sum\", [](double b1, int b2) { return b2 + b1; }, {\"b1\", \"b2\"})\n", " .Filter([](double sum) { return sum > 4.2; }, {\"sum\"})\n", " .Count();\n", "std::cout << *entries_sum << std::endl;" ] }, { "cell_type": "markdown", "id": "de069781", "metadata": {}, "source": [ "Additional columns can be expressed as strings. The content must be C++\n", "code. The name of the variables must be the name of the branches. The code\n", "is just-in-time compiled." ] }, { "cell_type": "code", "execution_count": 14, "id": "0746837b", "metadata": { "collapsed": false, "execution": { "iopub.execute_input": "2025-12-21T13:39:20.028788Z", "iopub.status.busy": "2025-12-21T13:39:20.028617Z", "iopub.status.idle": "2025-12-21T13:39:20.503401Z", "shell.execute_reply": "2025-12-21T13:39:20.502922Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "8\n" ] } ], "source": [ "auto entries_sum2 = d.Define(\"sum2\", \"b1 + b2\").Filter(\"sum2 > 4.2\").Count();\n", "std::cout << *entries_sum2 << std::endl;" ] }, { "cell_type": "markdown", "id": "9bfe9de6", "metadata": {}, "source": [ "It is possible at any moment to read the entry number and the processing\n", "slot number. The latter may change when implicit multithreading is active.\n", "The special columns which provide the entry number and the slot index are\n", "called \"rdfentry_\" and \"rdfslot_\" respectively. Their types are an unsigned\n", "64 bit integer and an unsigned integer." ] }, { "cell_type": "code", "execution_count": 15, "id": "d25b8b38", "metadata": { "collapsed": false, "execution": { "iopub.execute_input": "2025-12-21T13:39:20.516425Z", "iopub.status.busy": "2025-12-21T13:39:20.516248Z", "iopub.status.idle": "2025-12-21T13:39:21.178547Z", "shell.execute_reply": "2025-12-21T13:39:21.169306Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Entry: 0 Slot: 0\n", "Entry: 1 Slot: 0\n", "Entry: 2 Slot: 0\n", "Entry: 3 Slot: 0\n", "Entry: 4 Slot: 0\n", "Entry: 5 Slot: 0\n", "Entry: 6 Slot: 0\n", "Entry: 7 Slot: 0\n", "Entry: 8 Slot: 0\n", "Entry: 9 Slot: 0\n" ] } ], "source": [ "auto printEntrySlot = [](ULong64_t iEntry, unsigned int slot) {\n", " std::cout << \"Entry: \" << iEntry << \" Slot: \" << slot << std::endl;\n", "};\n", "d.Foreach(printEntrySlot, {\"rdfentry_\", \"rdfslot_\"});\n", "\n", "return 0;" ] } ], "metadata": { "kernelspec": { "display_name": "ROOT C++", "language": "c++", "name": "root" }, "language_info": { "codemirror_mode": "text/x-c++src", "file_extension": ".C", "mimetype": " text/x-c++src", "name": "c++" } }, "nbformat": 4, "nbformat_minor": 5 }