As introduced in → Storing columnar data in a ROOT file and reading it back,
ROOT can handle large columnar datasets.
In the aforementioned section, we made use of RDataFrame to write and
read back a simple dataset.
RDataFrame traditionally relies on
TTree for columnar data storage, used for example
by all LHC (Large Hadron Collider) experiments.
Trees are optimized for reduced disk space and selecting, high-throughput columnar access with reduced memory usage.
In addition to the documentation in this manual, we recommend to take a look at the TTree tutorials:→ Tree tutorials
To access TTree data, please use RDataFrame.
TTreeprovides interfaces for low-level, expert usage.
The tree and its data
TTree behaves like an array of a data structure that resides on storage - except for one entry (or row, in database language).
That entry is accessible in memory: you can load any tree entry, ideally sequentially.
You can provide your own storage for the values of the columns of the current entry, in the form of variables.
In this case you have to tell the
TTree about the addresses of these variables; either by calling
TTree::SetBranchAddress(), or by passing the variable when creating the branch for writing.
When “filling” (writing) the
TTree, it will read the values out of these variables;
when reading back a
TTree entry, it will write the values it read from storage into your variables.
Branches and leaves
A tree consists of a list of independent columns, called branches. A branch can contain values of any fundamental type, C++ objects known to ROOT’s type system, or collections of those. When reading a tree, you can select which subset of branches should be read. This allows you to optimize read throughput for a given analysis, and is one of the main motivations for storing data in columnar format.
Branches are represented by
TBranch and its derived classes.
TBranch represent structure, objects inheriting from
TLeaf give access to the actual data.
Originally, any columnar data was accessible through a
TLeaf; these days, some of the
TBranch-derived classes provide data access themselves, such as
Baskets, clusters and the tree header
Every branch or leaf stores the data for its entries in buffers of a size that can be specified during branch creation (default: 32000 bytes). Once the buffer is full, it gets compressed; the compressed buffer is called basket. These baskets are written into the ROOT file. Branches with more data per tree entry will fill more baskets than branches with less data per tree entry. Conversely, baskets can hold many tree entries if their branch stores only a few bytes per tree entry. This means that generally, all baskets - also of different branches - will contain data of different tree entry ranges.
To allow more efficient pre-fetching and better chunking of tree data stored in ROOT files, TTree groups baskets into clusters. A cluster contains all the data of a given entry range. Trees will close baskets that are not yet full when reaching the tree entry at a cluster boundary.
TTree finds the baskets for a given entry for a given branch by means of a header stored in the file.
This header also contains other auxiliary metadata.
When reading a
TTree object, only this header is actually deserialized, until the tree’s entries are loaded.
Multiple updates of these headers can often be found in files (
treename;2 etc, called cycles, see → Opening and inspecting a ROOT file).
Only the last one (also accessible as
treename) knows about all written baskets.
TNtuple, the high-performance spread-sheet
For convenience, ROOT also provides the
TNtuple class which is a tree whose branches contain only numbers of type
float, one per tree entry.
It derives from
TTree and is constructed with a list of column names separated by
Writing a tree
When writing a
TTree you first want to create a
(see → ROOT files.
Then construct the
TTree to be stored in the file; we will later add branches to the tree.
There are multiple ways to add branches to a
TTree; the most commonly used ones are covered here.
More extensive documentation can be found in the reference manual.
Do not use the
TBranchconstructor to add a branch to a tree.
The objects and variables used to create branches must not be destroyed until the
TTreeis deleted or
TTree::ResetBranchAddress()is called. If the address of the data to be filled changes with each tree entry, you have to inform the branch about the new address with TBranch::SetAddress before filling the tree again.
1. Branches holding basic types
If you have a variable of type
bool, or any other basic type, you can create a branch (and a leaf) from it.
For fundamental datatypes, the type can be deduced from the variable and the name of the leaf will be set to the name of the branch.
In Python, that type information is not available and the leaf name and data type must be specified as third argument.
Further details are explained in the reference guide.
2. Branches holding class type
You can create a branch holding one of ROOT’s classes, or your own type for which you have provided a dictionary (see → I/O).
If told, TTree will create (sub-) branches for each member of a class and its base classes. If such a member is a class itself, that member’s type can also be split. The recursion level of nested splitting is called the “split level”; it can be configured during branch creation.
If the split level is set to 0, there is no splitting: all data members are stored in the same branch. Data members can also be configured to be non-split as part of the dictionary; see → I/O. The default split level of 99 means to split all members at any recursion level.
X & are not supported as member types, pointers are.
If the pointer is non-null, ROOT stores the object pointed to (pointee).
If multiple pointers within the same branch point to the same object during one
TBranch::Fill() operation (as invoked by
TTree::Fill()), that pointee will only be stored once; upon reading, all pointers will again point to the same object.
For the general case, indices into object collections could be persistified instead of pointers. This way, the object is only stored once.
TNamed has the data members
The following requests the tree to create a branch for each of them.
TNamed derives from
TObject, branches for
TObject’s data members will also be created.
3. Branches holding
Both top-level branches (those created by a call to
TTree::Branch()) and branches created by splitting data members can hold collections such as
Splitting can traverse through collections:
if a member is a
std::vector<X>, the tree can split
X into sub-branches, too.
Such collections can also contain pointers.
For polymorphic pointees, ROOT will not just stream the base, but determine the actual object type.
If the split level is
TTree::kSplitCollectionOfPointers then the pointees will be written in split mode, possibly adding new branches as new polymorphic derived types are encountered.
Filling a tree
Use TTree:Fill() to add a new entry (or “row”) to the tree, and store the current values of the variables that were provided during branch creation.
Writing the tree header
Use TTree::Write() to write the tree header into a ROOT file.
Earlier entries’ data might already be written as part of
If due to the data written during
TTree::Fill(), the file’s size increases beyond TTree::GetMaxTreeSize(), the current ROOT file is closed and a new ROOT file is created.
For an original ROOT file named
myfile.root, the subsequent ROOT files are named
The tree can flush its data (i.e. its baskets) to file when reaching a given cluster size, thus closing the cluster. By default this happens approximately every 30MB of compressed data. The size can be adjusted using using TTree::SetAutoFlush().
The tree can write a header update to file after it has collected a certain data size in baskets (by default, 300MB). If your program crashes, you can recover the tree and its baskets written before the last autosave.
You can adjust the threshold (in bytes or entries) using TTree::SetAutoSave().
Reading a tree
Please use RDataFrame to read trees, unless you need to do low-level I/O!
To read a tree, you need to associate your variables with the tree’s branches, as when writing.
When loading a tree entry, the tree will set the variables to the branch’s value as read from the storage.
That is done by calling
In Python you can simply use the branch name as an attribute on the tree:
Selecting a subset of branches to be read
You can select or deselect branches from being read by
GetEntry() by calling
It is vividly recommended to only read the branches actually needed:
TTree is optimized for exactly this use case, and most analyses will only need a fraction of the available branches.
Selecting a subset of entries to be read
To process only a selection of tree entries, you can use a
First you insert the tree entry numbers you want to process into the
You can then re-use the
TEntryList in subsequent processing of the tree, skipping irrelevant entries.
TTrees as a
In high energy physics you always want as much data as possible.
But it’s not nice to deal with files of multiple terabytes.
ROOT allows to to split data across multiple files, where you can then access the files’ tree parts as one large tree.
That’s done through
TChain, which inherits from
it wants to know the name of the trees in the files (which can be overridden when adding files), and the file names, and will act as if it was a huge, continuous tree:
TTree through friends
Trees are usually written just once.
While updating an existing tree is non-trivial, extending it with additional branches, potentially an “improved” version of an original branch, is straightforward.
“Friend trees” are added by calling TTree::AddFriend().
Adding another tree called
T1 as a friend tree will make the branch
T1 available as both
T1.X and - if
X does not exist in the original tree - as
Friend trees are expected to have at least as many entries as the original tree. The order of the friend tree’s entries must preserve the entry order of the original tree.
Care must be taken to ensure that the order of entries in the primary tree matches friends’ entries. This is especially relevant when processing a tree in parallel to generate a friend tree, as the entries might be written out in an undefined order (misaligned entries). This can be mitigated by building an index on the friend tree with TTree::BuildIndex()), see Indexing a Tree.
Examining a tree
ROOT offers different ways to examine tree structure and its contents, from text to graphics.
Printing the summary of a tree
Use TTree::Print() to see a summary of the tree structure.
Showing the content of a tree entry
Use TTree::Show() to display the values of all branches for a given tree entry.
Showing tree data as a table
Use TTree::Scan() to display a paged table of branches’ values for all or some tree entries.
With the Tree Viewer you can examine a tree in a GUI.
You can also use the ROOT object browser to examine a tree that is saved in a ROOT file. See → ROOT object browser.
Figure: Tree Viewer.
The left panel contains the list of trees and their branches. The right panel displays the leaves or variables in the tree.
Drawing correlating variables in a scatterplot
You can show the correlation between the variables, listed in the
TTreeViewer, by drawing a scatterplot.
- Select a variable in the
TTreeViewerand drag it to the
- Select a second variable and drag it to the
Figure: Variables Age and Cost selected for the scatterplot.
Figure: Scatterplot icon.
The scatterplot is drawn.
Figure: Scatterplot of the variables Age and Cost.
Note that not each `(x,y) point on a scatterplot represents two values in your N−tuple. In fact, the scatterplot is a grid and each square in the grid is randomly populated with a density of dots that’s proportional to the number of values in that grid.
Indexing a tree
Use TTree::BuildIndex() to build an index table over expressions that depend on the value in the leaves. This index is similar to database indexes: it allows to quickly determine the tree entry number corresponding to the value of an expression. These expressions should be both equality comparable (that is, not use floating point numbers where precision might cause the index lookup to fail) and unique, to make sure you get the tree entry you expect. For high-energy physics, a common example could be a combination of run number and event number: while each one of them might have duplications, their combination is guaranteed to be unique.
To build an index, define a major and optionally a minor expression.
In the example above these might simply be the leaves
They can be expressions using original tree variables, such as
"run - 90000".
TTree::BuildIndex() loops over all entries and builds the lookup table from the expressions to the tree entry number.
The index can then be saved as part of the
TTree object with
This is done most conveniently at the end of the filling process, just before saving the tree header.
An entry can be retrieved using the index with TTree::GetEntryWithIndex().
Tree indexing works as well with a