This tutorial illustrates how to prepare ROOT datasets to be nicely readable by most machine learning methods.
This requires filtering the initial complex datasets and writing the data in a flat format.
import ROOT
def filter_events(df):
"""
Reduce initial dataset to only events which shall be used for training
"""
return df.Filter("nElectron>=2 && nMuon>=2", "At least two electrons and two muons")
def define_variables(df):
"""
Define the variables which shall be used for training
"""
return df.Define("Muon_pt_1", "Muon_pt[0]")\
.Define("Muon_pt_2", "Muon_pt[1]")\
.Define("Electron_pt_1", "Electron_pt[0]")\
.Define("Electron_pt_2", "Electron_pt[1]")
variables = ["Muon_pt_1", "Muon_pt_2", "Electron_pt_1", "Electron_pt_2"]
if __name__ == "__main__":
for filename, label in [["SMHiggsToZZTo4L.root", "signal"], ["ZZTo2e2mu.root", "background"]]:
print(">>> Extract the training and testing events for {} from the {} dataset.".format(
label, filename))
filepath = "root://eospublic.cern.ch//eos/root-eos/cms_opendata_2012_nanoaod/" + filename
df = filter_events(df)
df = define_variables(df)
report = df.Report()
columns = ROOT.std.vector["string"](variables)
df.Filter("event % 2 == 0", "Select events with even event number for training")\
.Snapshot("Events", "train_" + label + ".root", columns)
df.Filter("event % 2 == 1", "Select events with odd event number for training")\
.Snapshot("Events", "test_" + label + ".root", columns)
report.Print()
ROOT's RDataFrame offers a high level interface for analyses of data stored in TTree,...
At least two electrons and two muons: pass=45352 all=299973 -- eff=15.12 % cumulative eff=15.12 %
At least two electrons and two muons: pass=262776 all=1497445 -- eff=17.55 % cumulative eff=17.55 %
>>> Extract the training and testing events for signal from the SMHiggsToZZTo4L.root dataset.
>>> Extract the training and testing events for background from the ZZTo2e2mu.root dataset.
- Date
- August 2019
- Author
- Stefan Wunsch
Definition in file tmva100_DataPreparation.py.