This tutorial illustrates how to prepare ROOT datasets to be nicely readable by most machine learning methods. 
This requires filtering the initial complex datasets and writing the data in a flat format.
 
import ROOT
 
 
def filter_events(df):
    """
    Reduce initial dataset to only events which shall be used for training
    """
    return df.Filter("nElectron>=2 && nMuon>=2", "At least two electrons and two muons")
 
 
def define_variables(df):
    """
    Define the variables which shall be used for training
    """
    return df.Define("Muon_pt_1", "Muon_pt[0]")\
             .Define("Muon_pt_2", "Muon_pt[1]")\
             .Define("Electron_pt_1", "Electron_pt[0]")\
             .Define("Electron_pt_2", "Electron_pt[1]")
 
 
variables = ["Muon_pt_1", "Muon_pt_2", "Electron_pt_1", "Electron_pt_2"]
 
 
if __name__ == "__main__":
    for filename, label in [["SMHiggsToZZTo4L.root", "signal"], ["ZZTo2e2mu.root", "background"]]:
        print(
">>> Extract the training and testing events for {} from the {} dataset.".
format(
 
            label, filename))
 
        
        filepath = "root://eospublic.cern.ch//eos/root-eos/cms_opendata_2012_nanoaod/" + filename
        df = filter_events(df)
        df = define_variables(df)
 
        
        report = df.Report()
 
        
        columns = ROOT.std.vector["string"](variables)
        df.Filter("event % 2 == 0", "Select events with even event number for training")\
          .Snapshot("Events", "train_" + label + ".root", columns)
        df.Filter("event % 2 == 1", "Select events with odd event number for training")\
          .Snapshot("Events", "test_" + label + ".root", columns)
 
        
        report.Print()
Option_t Option_t TPoint TPoint const char GetTextMagnitude GetFillStyle GetLineColor GetLineWidth GetMarkerStyle GetTextAlign GetTextColor GetTextSize void char Point_t Rectangle_t WindowAttributes_t Float_t Float_t Float_t Int_t Int_t UInt_t UInt_t Rectangle_t Int_t Int_t Window_t TString Int_t GCValues_t GetPrimarySelectionOwner GetDisplay GetScreen GetColormap GetNativeEvent const char const char dpyName wid window const char font_name cursor keysym reg const char only_if_exist regb h Point_t winding char text const char depth char const char Int_t count const char ColorStruct_t color const char Pixmap_t Pixmap_t PictureAttributes_t attr const char char ret_data h unsigned char height h Atom_t Int_t ULong_t ULong_t unsigned char prop_list Atom_t Atom_t Atom_t Time_t format
 
ROOT's RDataFrame offers a modern, high-level interface for analysis of data stored in TTree ,...
 
  At least two electrons and two muons: pass=45352      all=299973     -- eff=15.12 % cumulative eff=15.12 %
At least two electrons and two muons: pass=262776     all=1497445    -- eff=17.55 % cumulative eff=17.55 %
>>> Extract the training and testing events for signal from the SMHiggsToZZTo4L.root dataset.
>>> Extract the training and testing events for background from the ZZTo2e2mu.root dataset.
- Date
 - August 2019 
 
- Author
 - Stefan Wunsch 
 
Definition in file tmva100_DataPreparation.py.