Detailed Description

This macro provides an example of how to use TMVA for k-folds cross evaluation.

As input data is used a toy-MC sample consisting of two guassian distributions.

The output file "TMVA.root" can be analysed with the use of dedicated macros (simply say: root -l <macro.C>), which can be conveniently invoked through a GUI that will appear at the end of the run of this macro. Launch the GUI via the command:

root -l -e 'TMVA::TMVAGui("TMVA.root")'

Cross Evaluation

Cross evaluation is a special case of k-folds cross validation where the splitting into k folds is computed deterministically. This ensures that the a given event will always end up in the same fold.

In addition all resulting classifiers are saved and can be applied to new data using MethodCrossValidation. One requirement for this to work is a splitting function that is evaluated for each event to determine into what fold it goes (for training/evaluation) or to what classifier (for application).

Split Expression

Cross evaluation uses a deterministic split to partition the data into folds called the split expression. The expression can be any valid TFormula as long as all parts used are defined.

For each event the split expression is evaluated to a number and the event is put in the fold corresponding to that number.

It is recommended to always use int([NumFolds]) at the end of the expression.

The split expression has access to all spectators and variables defined in the dataloader. Additionally, the number of folds in the split can be accessed with NumFolds (or numFolds).

### Example

"int(fabs([eventID]))%int([NumFolds])"

Project : TMVA - a ROOT-integrated toolkit for multivariate data analysis
Package : TMVA
Root Macro: TMVACrossValidation

0.0371379852295
5.80041503906
Processing /mnt/build/workspace/root-makedoc-v614/rootspi/rdoc/src/v6-14-00-patches/tutorials/tmva/TMVACrossValidation.C...
DataSetInfo              : [dataset] : Added class "Signal"
                         : Add Tree  of type Signal with 1000 events
DataSetInfo              : [dataset] : Added class "Background"
                         : Add Tree  of type Background with 1000 events
                         : Evaluate method: BDTG
<HEADER> Factory                  : Booking method: BDTG_fold1
                         : 
<HEADER> BDTG_fold1               : #events: (reweighted) sig: 499.5 bkg: 499.5
                         : #events: (unweighted) sig: 496 bkg: 503
                         : Training 100 Decision Trees ... patience please
                         : Elapsed time for training with 999 events: 0.0526 sec         
<HEADER> BDTG_fold1               : [dataset] : Evaluation of BDTG_fold1 on training sample (999 events)
                         : Elapsed time for evaluation of 999 events: 0.00304 sec       
                         : Creating xml weight file: dataset/weights/TMVACrossValidation_BDTG_fold1.weights.xml
                         : Creating standalone class: dataset/weights/TMVACrossValidation_BDTG_fold1.class.C
<HEADER> Factory                  : Test all methods
<HEADER> Factory                  : Test method: BDTG_fold1 for Classification performance
                         : 
<HEADER> BDTG_fold1               : [dataset] : Evaluation of BDTG_fold1 on testing sample (999 events)
                         : Elapsed time for evaluation of 999 events: 0.00294 sec       
<HEADER> Factory                  : Evaluate all methods
<HEADER> Factory                  : Evaluate classifier: BDTG_fold1
                         : 
<HEADER> BDTG_fold1               : [dataset] : Loop over test events and fill histograms with classifier response...
                         : 
<HEADER> Factory                  : Thank you for using TMVA!
                         : For citation information, please visit: http://tmva.sf.net/citeTMVA.html
<HEADER> Factory                  : Booking method: BDTG_fold2
                         : 
<HEADER> BDTG_fold2               : #events: (reweighted) sig: 499.5 bkg: 499.5
                         : #events: (unweighted) sig: 503 bkg: 496
                         : Training 100 Decision Trees ... patience please
                         : Elapsed time for training with 999 events: 0.0532 sec         
<HEADER> BDTG_fold2               : [dataset] : Evaluation of BDTG_fold2 on training sample (999 events)
                         : Elapsed time for evaluation of 999 events: 0.00298 sec       
                         : Creating xml weight file: dataset/weights/TMVACrossValidation_BDTG_fold2.weights.xml
                         : Creating standalone class: dataset/weights/TMVACrossValidation_BDTG_fold2.class.C
<HEADER> Factory                  : Test all methods
<HEADER> Factory                  : Test method: BDTG_fold2 for Classification performance
                         : 
<HEADER> BDTG_fold2               : [dataset] : Evaluation of BDTG_fold2 on testing sample (999 events)
                         : Elapsed time for evaluation of 999 events: 0.00311 sec       
<HEADER> Factory                  : Evaluate all methods
<HEADER> Factory                  : Evaluate classifier: BDTG_fold2
                         : 
<HEADER> BDTG_fold2               : [dataset] : Loop over test events and fill histograms with classifier response...
                         : 
<HEADER> Factory                  : Thank you for using TMVA!
                         : For citation information, please visit: http://tmva.sf.net/citeTMVA.html
<HEADER> Factory                  : Booking method: BDTG
                         : 
                         : Reading weightfile: dataset/weights/TMVACrossValidation_BDTG_fold1.weights.xml
                         : Reading weight file: dataset/weights/TMVACrossValidation_BDTG_fold1.weights.xml
                         : Reading weightfile: dataset/weights/TMVACrossValidation_BDTG_fold2.weights.xml
                         : Reading weight file: dataset/weights/TMVACrossValidation_BDTG_fold2.weights.xml
                         : Evaluate method: Fisher
<HEADER> Factory                  : Booking method: Fisher_fold1
                         : 
<HEADER> Fisher_fold1             : Results for Fisher coefficients:
                         : -----------------------
                         : Variable:  Coefficient:
                         : -----------------------
                         :        x:       +0.505
                         :        y:       +0.430
                         : (offset):       +0.006
                         : -----------------------
                         : Elapsed time for training with 999 events: 0.000249 sec         
<HEADER> Fisher_fold1             : [dataset] : Evaluation of Fisher_fold1 on training sample (999 events)
                         : Elapsed time for evaluation of 999 events: 5.39e-05 sec       
                         : Creating xml weight file: dataset/weights/TMVACrossValidation_Fisher_fold1.weights.xml
                         : Creating standalone class: dataset/weights/TMVACrossValidation_Fisher_fold1.class.C
<HEADER> Factory                  : Test all methods
<HEADER> Factory                  : Test method: Fisher_fold1 for Classification performance
                         : 
<HEADER> Fisher_fold1             : [dataset] : Evaluation of Fisher_fold1 on testing sample (999 events)
                         : Elapsed time for evaluation of 999 events: 5.7e-05 sec       
<HEADER> Factory                  : Evaluate all methods
<HEADER> Factory                  : Evaluate classifier: Fisher_fold1
                         : 
<HEADER> Fisher_fold1             : [dataset] : Loop over test events and fill histograms with classifier response...
                         : 
<HEADER> Factory                  : Thank you for using TMVA!
                         : For citation information, please visit: http://tmva.sf.net/citeTMVA.html
<HEADER> Factory                  : Booking method: Fisher_fold2
                         : 
<HEADER> Fisher_fold2             : Results for Fisher coefficients:
                         : -----------------------
                         : Variable:  Coefficient:
                         : -----------------------
                         :        x:       +0.446
                         :        y:       +0.479
                         : (offset):       +0.011
                         : -----------------------
                         : Elapsed time for training with 999 events: 0.000169 sec         
<HEADER> Fisher_fold2             : [dataset] : Evaluation of Fisher_fold2 on training sample (999 events)
                         : Elapsed time for evaluation of 999 events: 5.1e-05 sec       
                         : Creating xml weight file: dataset/weights/TMVACrossValidation_Fisher_fold2.weights.xml
                         : Creating standalone class: dataset/weights/TMVACrossValidation_Fisher_fold2.class.C
<HEADER> Factory                  : Test all methods
<HEADER> Factory                  : Test method: Fisher_fold2 for Classification performance
                         : 
<HEADER> Fisher_fold2             : [dataset] : Evaluation of Fisher_fold2 on testing sample (999 events)
                         : Elapsed time for evaluation of 999 events: 5.82e-05 sec       
<HEADER> Factory                  : Evaluate all methods
<HEADER> Factory                  : Evaluate classifier: Fisher_fold2
                         : 
<HEADER> Fisher_fold2             : [dataset] : Loop over test events and fill histograms with classifier response...
                         : 
<HEADER> Factory                  : Thank you for using TMVA!
                         : For citation information, please visit: http://tmva.sf.net/citeTMVA.html
<HEADER> Factory                  : Booking method: Fisher
                         : 
                         : Reading weightfile: dataset/weights/TMVACrossValidation_Fisher_fold1.weights.xml
                         : Reading weight file: dataset/weights/TMVACrossValidation_Fisher_fold1.weights.xml
                         : Reading weightfile: dataset/weights/TMVACrossValidation_Fisher_fold2.weights.xml
                         : Reading weight file: dataset/weights/TMVACrossValidation_Fisher_fold2.weights.xml
<HEADER> Factory                  : [dataset] : Create Transformation "I" with events from all classes.
                         : 
<HEADER>                          : Transformation, Variable selection : 
                         : Input : variable 'x' <---> Output : variable 'x'
                         : Input : variable 'y' <---> Output : variable 'y'
<HEADER> TFHandler_Factory        : Variable        Mean        RMS   [        Min        Max ]
                         : -----------------------------------------------------------
                         :        x:  -0.014734     1.4061   [    -4.1075     4.0969 ]
                         :        y: -0.0050703     1.4200   [    -4.8520     4.0761 ]
                         : -----------------------------------------------------------
                         : Ranking input variables (method unspecific)...
<HEADER> IdTransformation         : Ranking result (top variable is best ranked)
                         : --------------------------
                         : Rank : Variable  : Separation
                         : --------------------------
                         :    1 : x         : 5.452e-01
                         :    2 : y         : 5.244e-01
                         : --------------------------
                         : Elapsed time for training with 1998 events: 2.5e-05 sec         
<HEADER> BDTG                     : [dataset] : Evaluation of BDTG on training sample (1998 events)
                         : Elapsed time for evaluation of 1998 events: 0.00754 sec       
                         : Creating xml weight file: dataset/weights/TMVACrossValidation_BDTG.weights.xml
                         : Creating standalone class: dataset/weights/TMVACrossValidation_BDTG.class.C
<WARNING> <WARNING>                : MakeClassSpecificHeader not implemented for CrossValidation
<WARNING> <WARNING>                : MakeClassSpecific not implemented for CrossValidation
                         : Elapsed time for training with 1998 events: 2.86e-06 sec         
<HEADER> Fisher                   : [dataset] : Evaluation of Fisher on training sample (1998 events)
                         : Elapsed time for evaluation of 1998 events: 0.000456 sec       
                         : Creating xml weight file: dataset/weights/TMVACrossValidation_Fisher.weights.xml
                         : Creating standalone class: dataset/weights/TMVACrossValidation_Fisher.class.C
<WARNING> <WARNING>                : MakeClassSpecificHeader not implemented for CrossValidation
<WARNING> <WARNING>                : MakeClassSpecific not implemented for CrossValidation
<HEADER> Factory                  : Test all methods
<HEADER> Factory                  : Test method: BDTG for Classification performance
                         : 
<HEADER> BDTG                     : [dataset] : Evaluation of BDTG on testing sample (1998 events)
                         : Elapsed time for evaluation of 1998 events: 0.00713 sec       
<HEADER> Factory                  : Test method: Fisher for Classification performance
                         : 
<HEADER> Fisher                   : [dataset] : Evaluation of Fisher on testing sample (1998 events)
                         : Elapsed time for evaluation of 1998 events: 0.000446 sec       
<HEADER> Factory                  : Evaluate all methods
<HEADER> Factory                  : Evaluate classifier: BDTG
                         : 
<HEADER> BDTG                     : [dataset] : Loop over test events and fill histograms with classifier response...
                         : 
<HEADER> TFHandler_BDTG           : Variable        Mean        RMS   [        Min        Max ]
                         : -----------------------------------------------------------
                         :        x:  -0.014734     1.4061   [    -4.1075     4.0969 ]
                         :        y: -0.0050703     1.4200   [    -4.8520     4.0761 ]
                         : -----------------------------------------------------------
<HEADER> Factory                  : Evaluate classifier: Fisher
                         : 
<HEADER> Fisher                   : [dataset] : Loop over test events and fill histograms with classifier response...
                         : 
<HEADER> TFHandler_Fisher         : Variable        Mean        RMS   [        Min        Max ]
                         : -----------------------------------------------------------
                         :        x:  -0.014734     1.4061   [    -4.1075     4.0969 ]
                         :        y: -0.0050703     1.4200   [    -4.8520     4.0761 ]
                         : -----------------------------------------------------------
                         : 
                         : Evaluation results ranked by best signal efficiency and purity (area)
                         : -------------------------------------------------------------------------------------------------------------------
                         : DataSet       MVA                       
                         : Name:         Method:          ROC-integ
                         : dataset       Fisher         : 0.971
                         : dataset       BDTG           : 0.967
                         : -------------------------------------------------------------------------------------------------------------------
                         : 
                         : Testing efficiency compared to training efficiency (overtraining check)
                         : -------------------------------------------------------------------------------------------------------------------
                         : DataSet              MVA              Signal efficiency: from test sample (from training sample) 
                         : Name:                Method:          @B=0.01             @B=0.10            @B=0.30   
                         : -------------------------------------------------------------------------------------------------------------------
                         : dataset              Fisher         : 0.645 (0.645)       0.924 (0.924)      0.979 (0.979)
                         : dataset              BDTG           : 0.600 (0.600)       0.923 (0.923)      0.974 (0.974)
                         : -------------------------------------------------------------------------------------------------------------------
                         : 
<HEADER> Dataset:dataset          : Created tree 'TestTree' with 1998 events
                         : 
<HEADER> Dataset:dataset          : Created tree 'TrainTree' with 1998 events
                         : 
<HEADER> Factory                  : Thank you for using TMVA!
                         : For citation information, please visit: http://tmva.sf.net/citeTMVA.html
                         : Evaluation done.
Summary for method BDT
   Fold 0: ROC int: 0.968758, BkgEff@SigEff=0.3: 0.976
   Fold 1: ROC int: 0.965505, BkgEff@SigEff=0.3: 0.975
Summary for method Fisher
   Fold 0: ROC int: 0.970423, BkgEff@SigEff=0.3: 0.978
   Fold 1: ROC int: 0.97145, BkgEff@SigEff=0.3: 0.982
==> Wrote root file: TMVA.root
==> TMVACrossValidation is done!
(int) 0

#include <cstdlib>
#include <iostream>
#include <map>
#include <string>
#include "TChain.h"
#include "TFile.h"
#include "TTree.h"
#include "TString.h"
#include "TObjString.h"
#include "TSystem.h"
#include "TROOT.h"
#include "TMVA/CrossValidation.h"
#include "TMVA/DataLoader.h"
#include "TMVA/Factory.h"
#include "TMVA/Tools.h"
#include "TMVA/TMVAGui.h"
// Helper function to load data into TTrees.
TTree *genTree(Int_t nPoints, Double_t offset, Double_t scale, UInt_t seed = 100)
{
   TRandom3 rng(seed);
   Float_t x = 0;
   Float_t y = 0;
   UInt_t eventID = 0;
   TTree *data = new TTree();
   data->Branch("x", &x, "x/F");
   data->Branch("y", &y, "y/F");
   data->Branch("eventID", &eventID, "eventID/I");
   for (Int_t n = 0; n < nPoints; ++n) {
      x = rng.Gaus(offset, scale);
      y = rng.Gaus(offset, scale);
      // For our simple example it is enough that the id's are uniformly
      // distributed and independent of the data.
      ++eventID;
      data->Fill();
   }
   // Important: Disconnects the tree from the memory locations of x and y.
   data->ResetBranchAddresses();
   return data;
}
int TMVACrossValidation()
{
   // This loads the library
   TMVA::Tools::Instance();
   // --------------------------------------------------------------------------
   // Load the data into TTrees. If you load data from file you can use a
   // variant of
   // ```
   // TString filename = "/path/to/file";
   // TFile * input = TFile::Open( filename );
   // TTree * signalTree = (TTree*)input->Get("TreeName");
   // ```
   TTree *sigTree = genTree(1000, 1.0, 1.0, 100);
   TTree *bkgTree = genTree(1000, -1.0, 1.0, 101);
   // Create a ROOT output file where TMVA will store ntuples, histograms, etc.
   TString outfileName("TMVA.root");
   TFile *outputFile = TFile::Open(outfileName, "RECREATE");
   // DataLoader definitions; We declare variables in the tree so that TMVA can
   // find them. For more information see TMVAClassification tutorial.
   TMVA::DataLoader *dataloader = new TMVA::DataLoader("dataset");
   // Data variables
   dataloader->AddVariable("x", 'F');
   dataloader->AddVariable("y", 'F');
   // Spectator used for split
   dataloader->AddSpectator("eventID", 'I');
   // Attaches the trees so they can be read from
   dataloader->AddSignalTree(sigTree, 1.0);
   dataloader->AddBackgroundTree(bkgTree, 1.0);
   // The CV mechanism of TMVA splits up the training set into several folds.
   // The test set is currently left unused. The `nTest_ClassName=1` assigns
   // one event to the the test set for each class and puts the rest in the
   // training set. A value of 0 is a special value and would split the
   // datasets 50 / 50.
   dataloader->PrepareTrainingAndTestTree("", "",
                                          "nTest_Signal=1"
                                          ":nTest_Background=1"
                                          ":SplitMode=Random"
                                          ":NormMode=NumEvents"
                                          ":!V");
   // --------------------------------------------------------------------------
   //
   // This sets up a CrossValidation class (which wraps a TMVA::Factory
   // internally) for 2-fold cross validation.
   //
   UInt_t numFolds = 2;
   TString analysisType = "Classification";
   TString splitExpr = "";
   //
   // One can also use a custom splitting function for producing the folds.
   // The example uses a dataset spectator `eventID`.
   //
   // The idea here is that eventID should be an event number that is integral,
   // random and independent of the data, generated only once. This last
   // property ensures that if a calibration is changed the same event will
   // still be assigned the same fold.
   // 
   // This can be used to use the cross validated classifiers in application,
   // a technique that can simplify statistical analysis.
   // 
   // If you want to run TMVACrossValidationApplication, make sure you have 
   // run this tutorial with the below line uncommented first.
   // 
   // TString splitExpr = "int(fabs([eventID]))%int([NumFolds])";
   TString cvOptions = Form("!V"
                            ":!Silent"
                            ":ModelPersistence"
                            ":AnalysisType=%s"
                            ":NumFolds=%i"
                            ":SplitExpr=%s",
                            analysisType.Data(), numFolds, splitExpr.Data());
   TMVA::CrossValidation cv{"TMVACrossValidation", dataloader, outputFile, cvOptions};
   // --------------------------------------------------------------------------
   //
   // Books a method to use for evaluation
   //
   cv.BookMethod(TMVA::Types::kBDT, "BDTG",
                 "!H:!V:NTrees=100:MinNodeSize=2.5%:BoostType=Grad"
                 ":NegWeightTreatment=Pray:Shrinkage=0.10:nCuts=20"
                 ":MaxDepth=2");
   cv.BookMethod(TMVA::Types::kFisher, "Fisher",
                 "!H:!V:Fisher:VarTransform=None");
   // --------------------------------------------------------------------------
   //
   // Train, test and evaluate the booked methods.
   // Evaluates the booked methods once for each fold and aggregates the result
   // in the specified output file.
   //
   cv.Evaluate();
   // --------------------------------------------------------------------------
   //
   // Process some output programatically, printing the ROC score for each
   // booked method.
   //
   size_t iMethod = 0;
   for (auto && result : cv.GetResults()) {
      std::cout << "Summary for method " << cv.GetMethods()[iMethod++].GetValue<TString>("MethodName")
                << std::endl;
      for (UInt_t iFold = 0; iFold<cv.GetNumFolds(); ++iFold) {
         std::cout << "\tFold " << iFold << ": "
                   << "ROC int: " << result.GetROCValues()[iFold]
                   << ", "
                   << "BkgEff@SigEff=0.3: " << result.GetEff30Values()[iFold]
                   << std::endl;
      }
   }
   // --------------------------------------------------------------------------
   //
   // Save the output
   //
   outputFile->Close();
   std::cout << "==> Wrote root file: " << outputFile->GetName() << std::endl;
   std::cout << "==> TMVACrossValidation is done!" << std::endl;
   // --------------------------------------------------------------------------
   //
   // Launch the GUI for the root macros
   //
   if (!gROOT->IsBatch()) {
      TMVA::TMVAGui(outfileName);
   }
   return 0;
}
//
// This is used if the macro is compiled. If run through ROOT with
// `root -l -b -q MACRO.C` or similar it is unused.
//
int main(int argc, char **argv)
{
   TMVACrossValidation();
}

Author: Kim Albertsson (adapted from code originally by Andreas Hoecker)

Definition in file TMVACrossValidation.C.