Logo ROOT   6.16/01
Reference Guide
HybridStandardForm.C
Go to the documentation of this file.
1/// \file
2/// \ingroup tutorial_roostats
3/// \notebook -js
4/// A hypothesis testing example based on number counting with background uncertainty.
5///
6/// A hypothesis testing example based on number counting
7/// with background uncertainty.
8///
9/// NOTE: This example is like HybridInstructional, but the model is more clearly
10/// generalizable to an analysis with shapes. There is a lot of flexibility
11/// for how one models a problem in RooFit/RooStats. Models come in a few
12/// common forms:
13/// - standard form: extended PDF of some discriminating variable m:
14/// eg: P(m) ~ S*fs(m) + B*fb(m), with S+B events expected
15/// in this case the dataset has N rows corresponding to N events
16/// and the extended term is Pois(N|S+B)
17///
18/// - fractional form: non-extended PDF of some discriminating variable m:
19/// eg: P(m) ~ s*fs(m) + (1-s)*fb(m), where s is a signal fraction
20/// in this case the dataset has N rows corresponding to N events
21/// and there is no extended term
22///
23/// - number counting form: in which there is no discriminating variable
24/// and the counts are modeled directly (see HybridInstructional)
25/// eg: P(N) = Pois(N|S+B)
26/// in this case the dataset has 1 row corresponding to N events
27/// and the extended term is the PDF itself.
28///
29/// Here we convert the number counting form into the standard form by
30/// introducing a dummy discriminating variable m with a uniform distribution.
31///
32/// This example:
33/// - demonstrates the usage of the HybridCalcultor (Part 4-6)
34/// - demonstrates the numerical integration of RooFit (Part 2)
35/// - validates the RooStats against an example with a known analytic answer
36/// - demonstrates usage of different test statistics
37/// - explains subtle choices in the prior used for hybrid methods
38/// - demonstrates usage of different priors for the nuisance parameters
39/// - demonstrates usage of PROOF
40///
41/// The basic setup here is that a main measurement has observed x events with an
42/// expectation of s+b. One can choose an ad hoc prior for the uncertainty on b,
43/// or try to base it on an auxiliary measurement. In this case, the auxiliary
44/// measurement (aka control measurement, sideband) is another counting experiment
45/// with measurement y and expectation tau*b. With an 'original prior' on b,
46/// called \f$ \eta(b) \f$ then one can obtain a posterior from the auxiliary measurement
47/// \f$ \pi(b) = \eta(b) * Pois(y|tau*b) \f$. This is a principled choice for a prior
48/// on b in the main measurement of x, which can then be treated in a hybrid
49/// Bayesian/Frequentist way. Additionally, one can try to treat the two
50/// measurements simultaneously, which is detailed in Part 6 of the tutorial.
51///
52/// This tutorial is related to the FourBin.C tutorial in the modeling, but
53/// focuses on hypothesis testing instead of interval estimation.
54///
55/// More background on this 'prototype problem' can be found in the
56/// following papers:
57///
58/// - Evaluation of three methods for calculating statistical significance
59/// when incorporating a systematic uncertainty into a test of the
60/// background-only hypothesis for a Poisson process
61/// Authors: Robert D. Cousins, James T. Linnemann, Jordan Tucker
62/// http://arxiv.org/abs/physics/0702156
63/// NIM A 595 (2008) 480--501
64///
65/// - Statistical Challenges for Searches for New Physics at the LHC
66/// Author: Kyle Cranmer
67/// http://arxiv.org/abs/physics/0511028
68///
69/// - Measures of Significance in HEP and Astrophysics
70/// Author: J. T. Linnemann
71/// http://arxiv.org/abs/physics/0312059
72///
73/// \macro_image
74/// \macro_output
75/// \macro_code
76///
77/// \authors Kyle Cranmer, Wouter Verkerke, and Sven Kreiss
78
79#include "RooGlobalFunc.h"
80#include "RooRealVar.h"
81#include "RooProdPdf.h"
82#include "RooWorkspace.h"
83#include "RooDataSet.h"
84#include "RooDataHist.h"
85#include "TCanvas.h"
86#include "TStopwatch.h"
87#include "TH1.h"
88#include "RooPlot.h"
89#include "RooMsgService.h"
90
92
96
102
103using namespace RooFit;
104using namespace RooStats;
105
106//-------------------------------------------------------
107// A New Test Statistic Class for this example.
108// It simply returns the sum of the values in a particular
109// column of a dataset.
110// You can ignore this class and focus on the macro below
111
112class BinCountTestStat : public TestStatistic {
113public:
114 BinCountTestStat(void) : fColumnName("tmp") {}
115 BinCountTestStat(string columnName) : fColumnName(columnName) {}
116
117 virtual Double_t Evaluate(RooAbsData& data, RooArgSet& /*nullPOI*/) {
118 // This is the main method in the interface
119 Double_t value = 0.0;
120 for(int i=0; i < data.numEntries(); i++) {
121 value += data.get(i)->getRealValue(fColumnName.c_str());
122 }
123 return value;
124 }
125 virtual const TString GetVarName() const { return fColumnName; }
126
127private:
128 string fColumnName;
129
130protected:
131 ClassDef(BinCountTestStat,1)
132};
133
134ClassImp(BinCountTestStat)
135
136//-----------------------------
137// The Actual Tutorial Macro
138//-----------------------------
139
140void HybridStandardForm() {
141
142 // This tutorial has 6 parts
143 // Table of Contents
144 // Setup
145 // 1. Make the model for the 'prototype problem'
146 // Special cases
147 // 2. NOT RELEVANT HERE
148 // 3. Use RooStats analytic solution for this problem
149 // RooStats HybridCalculator -- can be generalized
150 // 4. RooStats ToyMC version of 2. & 3.
151 // 5. RooStats ToyMC with an equivalent test statistic
152 // 6. RooStats ToyMC with simultaneous control & main measurement
153
154 // Part 4 takes ~4 min without PROOF.
155 // Part 5 takes about ~2 min with PROOF on 4 cores.
156 // Of course, everything looks nicer with more toys, which takes longer.
157
158
159 TStopwatch t;
160 t.Start();
161 TCanvas *c = new TCanvas;
162 c->Divide(2,2);
163
164 //-----------------------------------------------------
165 // P A R T 1 : D I R E C T I N T E G R A T I O N
166 // ====================================================
167 // Make model for prototype on/off problem
168 // Pois(x | s+b) * Pois(y | tau b )
169 // for Z_Gamma, use uniform prior on b.
170 RooWorkspace* w = new RooWorkspace("w");
171
172
173 // replace the pdf in 'number counting form'
174 //w->factory("Poisson::px(x[150,0,500],sum::splusb(s[0,0,100],b[100,0,300]))");
175 // with one in standard form. Now x is encoded in event count
176 w->factory("Uniform::f(m[0,1])");//m is a dummy discriminating variable
177 w->factory("ExtendPdf::px(f,sum::splusb(s[0,0,100],b[100,0,300]))");
178 w->factory("Poisson::py(y[100,0,500],prod::taub(tau[1.],b))");
179 w->factory("PROD::model(px,py)");
180 w->factory("Uniform::prior_b(b)");
181
182 // We will control the output level in a few places to avoid
183 // verbose progress messages. We start by keeping track
184 // of the current threshold on messages.
186
187 // Use PROOF-lite on multi-core machines
188 ProofConfig* pc = NULL;
189 // uncomment below if you want to use PROOF
190 pc = new ProofConfig(*w, 4, "workers=4", kFALSE); // machine with 4 cores
191 // pc = new ProofConfig(*w, 2, "workers=2", kFALSE); // machine with 2 cores
192
193 //-----------------------------------------------
194 // P A R T 3 : A N A L Y T I C R E S U L T
195 // ==============================================
196 // In this special case, the integrals are known analytically
197 // and they are implemented in RooStats::NumberCountingUtils
198
199 // analytic Z_Bi
200 double p_Bi = NumberCountingUtils::BinomialWithTauObsP(150, 100, 1);
201 double Z_Bi = NumberCountingUtils::BinomialWithTauObsZ(150, 100, 1);
202 cout << "-----------------------------------------"<<endl;
203 cout << "Part 3" << endl;
204 std::cout << "Z_Bi p-value (analytic): " << p_Bi << std::endl;
205 std::cout << "Z_Bi significance (analytic): " << Z_Bi << std::endl;
206 t.Stop(); t.Print(); t.Reset(); t.Start();
207
208 //--------------------------------------------------------------
209 // P A R T 4 : U S I N G H Y B R I D C A L C U L A T O R
210 // ==============================================================
211 // Now we demonstrate the RooStats HybridCalculator.
212 //
213 // Like all RooStats calculators it needs the data and a ModelConfig
214 // for the relevant hypotheses. Since we are doing hypothesis testing
215 // we need a ModelConfig for the null (background only) and the alternate
216 // (signal+background) hypotheses. We also need to specify the PDF,
217 // the parameters of interest, and the observables. Furthermore, since
218 // the parameter of interest is floating, we need to specify which values
219 // of the parameter corresponds to the null and alternate (eg. s=0 and s=50)
220 //
221 // define some sets of variables obs={x} and poi={s}
222 // note here, x is the only observable in the main measurement
223 // and y is treated as a separate measurement, which is used
224 // to produce the prior that will be used in this calculation
225 // to randomize the nuisance parameters.
226 w->defineSet("obs","m");
227 w->defineSet("poi","s");
228
229 // create a toy dataset with the x=150
230 // RooDataSet *data = new RooDataSet("d", "d", *w->set("obs"));
231 // data->add(*w->set("obs"));
232 RooDataSet* data = w->pdf("px")->generate(*w->set("obs"),150);
233
234 // Part 3a : Setup ModelConfigs
235 //-------------------------------------------------------
236 // create the null (background-only) ModelConfig with s=0
237 ModelConfig b_model("B_model", w);
238 b_model.SetPdf(*w->pdf("px"));
239 b_model.SetObservables(*w->set("obs"));
240 b_model.SetParametersOfInterest(*w->set("poi"));
241 w->var("s")->setVal(0.0); // important!
242 b_model.SetSnapshot(*w->set("poi"));
243
244 // create the alternate (signal+background) ModelConfig with s=50
245 ModelConfig sb_model("S+B_model", w);
246 sb_model.SetPdf(*w->pdf("px"));
247 sb_model.SetObservables(*w->set("obs"));
248 sb_model.SetParametersOfInterest(*w->set("poi"));
249 w->var("s")->setVal(50.0); // important!
250 sb_model.SetSnapshot(*w->set("poi"));
251
252
253 // Part 3b : Choose Test Statistic
254 //--------------------------------------------------------------
255 // To make an equivalent calculation we need to use x as the test
256 // statistic. This is not a built-in test statistic in RooStats
257 // so we define it above. The new class inherits from the
258 // RooStats::TestStatistic interface, and simply returns the value
259 // of x in the dataset.
260
261 NumEventsTestStat eventCount(*w->pdf("px"));
262
263 // Part 3c : Define Prior used to randomize nuisance parameters
264 //-------------------------------------------------------------
265 //
266 // The prior used for the hybrid calculator is the posterior
267 // from the auxiliary measurement y. The model for the aux.
268 // measurement is Pois(y|tau*b), thus the likelihood function
269 // is proportional to (has the form of) a Gamma distribution.
270 // if the 'original prior' $\eta(b)$ is uniform, then from
271 // Bayes's theorem we have the posterior:
272 // $\pi(b) = Pois(y|tau*b) * \eta(b)$
273 // If $\eta(b)$ is flat, then we arrive at a Gamma distribution.
274 // Since RooFit will normalize the PDF we can actually supply
275 // py=Pois(y,tau*b) that will be equivalent to multiplying by a uniform.
276 //
277 // Alternatively, we could explicitly use a gamma distribution:
278 //
279 // `w->factory("Gamma::gamma(b,sum::temp(y,1),1,0)");`
280 //
281 // or we can use some other ad hoc prior that do not naturally
282 // follow from the known form of the auxiliary measurement.
283 // The common choice is the equivalent Gaussian:
284 w->factory("Gaussian::gauss_prior(b,y, expr::sqrty('sqrt(y)',y))");
285 // this corresponds to the "Z_N" calculation.
286 //
287 // or one could use the analogous log-normal prior
288 w->factory("Lognormal::lognorm_prior(b,y, expr::kappa('1+1./sqrt(y)',y))");
289 //
290 // Ideally, the HybridCalculator would be able to inspect the full
291 // model Pois(x | s+b) * Pois(y | tau b ) and be given the original
292 // prior $\eta(b)$ to form $\pi(b) = Pois(y|tau*b) * \eta(b)$.
293 // This is not yet implemented because in the general case
294 // it is not easy to identify the terms in the PDF that correspond
295 // to the auxiliary measurement. So for now, it must be set
296 // explicitly with:
297 // - ForcePriorNuisanceNull()
298 // - ForcePriorNuisanceAlt()
299 // the name "ForcePriorNuisance" was chosen because we anticipate
300 // this to be auto-detected, but will leave the option open
301 // to force to a different prior for the nuisance parameters.
302
303 // Part 3d : Construct and configure the HybridCalculator
304 //-------------------------------------------------------
305
306 HybridCalculator hc1(*data, sb_model, b_model);
307 ToyMCSampler *toymcs1 = (ToyMCSampler*)hc1.GetTestStatSampler();
308 // toymcs1->SetNEventsPerToy(1); // because the model is in number counting form
309 toymcs1->SetTestStatistic(&eventCount); // set the test statistic
310 // toymcs1->SetGenerateBinned();
311 hc1.SetToys(30000,1000);
312 hc1.ForcePriorNuisanceAlt(*w->pdf("py"));
313 hc1.ForcePriorNuisanceNull(*w->pdf("py"));
314 // if you wanted to use the ad hoc Gaussian prior instead
315 // ~~~
316 // hc1.ForcePriorNuisanceAlt(*w->pdf("gauss_prior"));
317 // hc1.ForcePriorNuisanceNull(*w->pdf("gauss_prior"));
318 // ~~~
319 // if you wanted to use the ad hoc log-normal prior instead
320 // ~~~
321 // hc1.ForcePriorNuisanceAlt(*w->pdf("lognorm_prior"));
322 // hc1.ForcePriorNuisanceNull(*w->pdf("lognorm_prior"));
323 // ~~~
324
325 // enable proof
326 // proof not enabled for this test statistic
327 // if(pc) toymcs1->SetProofConfig(pc);
328
329 // these lines save current msg level and then kill any messages below ERROR
331 // Get the result
332 HypoTestResult *r1 = hc1.GetHypoTest();
333 RooMsgService::instance().setGlobalKillBelow(msglevel); // set it back
334 cout << "-----------------------------------------"<<endl;
335 cout << "Part 4" << endl;
336 r1->Print();
337 t.Stop(); t.Print(); t.Reset(); t.Start();
338
339 c->cd(2);
340 HypoTestPlot *p1 = new HypoTestPlot(*r1,30); // 30 bins, TS is discrete
341 p1->Draw();
342
343 return; // keep the running time sort by default
344 //-------------------------------------------------------------------------
345 // # P A R T 5 : U S I N G H Y B R I D C A L C U L A T O R W I T H
346 // # A N A L T E R N A T I V E T E S T S T A T I S T I C
347 //
348 // A likelihood ratio test statistics should be 1-to-1 with the count x
349 // when the value of b is fixed in the likelihood. This is implemented
350 // by the SimpleLikelihoodRatioTestStat
351
352 SimpleLikelihoodRatioTestStat slrts(*b_model.GetPdf(),*sb_model.GetPdf());
353 slrts.SetNullParameters(*b_model.GetSnapshot());
354 slrts.SetAltParameters(*sb_model.GetSnapshot());
355
356 // HYBRID CALCULATOR
357 HybridCalculator hc2(*data, sb_model, b_model);
358 ToyMCSampler *toymcs2 = (ToyMCSampler*)hc2.GetTestStatSampler();
359 // toymcs2->SetNEventsPerToy(1);
360 toymcs2->SetTestStatistic(&slrts);
361 // toymcs2->SetGenerateBinned();
362 hc2.SetToys(20000,1000);
363 hc2.ForcePriorNuisanceAlt(*w->pdf("py"));
364 hc2.ForcePriorNuisanceNull(*w->pdf("py"));
365 // if you wanted to use the ad hoc Gaussian prior instead
366 // ~~~
367 // hc2.ForcePriorNuisanceAlt(*w->pdf("gauss_prior"));
368 // hc2.ForcePriorNuisanceNull(*w->pdf("gauss_prior"));
369 // ~~~
370 // if you wanted to use the ad hoc log-normal prior instead
371 // ~~~
372 // hc2.ForcePriorNuisanceAlt(*w->pdf("lognorm_prior"));
373 // hc2.ForcePriorNuisanceNull(*w->pdf("lognorm_prior"));
374 // ~~~
375
376 // enable proof
377 if(pc) toymcs2->SetProofConfig(pc);
378
379 // these lines save current msg level and then kill any messages below ERROR
381 // Get the result
382 HypoTestResult *r2 = hc2.GetHypoTest();
383 cout << "-----------------------------------------"<<endl;
384 cout << "Part 5" << endl;
385 r2->Print();
386 t.Stop(); t.Print(); t.Reset(); t.Start();
388
389 c->cd(3);
390 HypoTestPlot *p2 = new HypoTestPlot(*r2,30); // 30 bins
391 p2->Draw();
392
393 return; // so standard tutorial runs faster
394
395 //---------------------------------------------
396 // OUTPUT W/O PROOF (2.66 GHz Intel Core i7)
397 // ============================================
398
399 // -----------------------------------------
400 // Part 3
401 // Z_Bi p-value (analytic): 0.00094165
402 // Z_Bi significance (analytic): 3.10804
403 // Real time 0:00:00, CP time 0.610
404
405 // Results HybridCalculator_result:
406 // - Null p-value = 0.00103333 +/- 0.000179406
407 // - Significance = 3.08048 sigma
408 // - Number of S+B toys: 1000
409 // - Number of B toys: 30000
410 // - Test statistic evaluated on data: 150
411 // - CL_b: 0.998967 +/- 0.000185496
412 // - CL_s+b: 0.495 +/- 0.0158106
413 // - CL_s: 0.495512 +/- 0.0158272
414 // Real time 0:04:43, CP time 283.780
415
416 // With PROOF
417 // -----------------------------------------
418 // Part 5
419
420 // Results HybridCalculator_result:
421 // - Null p-value = 0.00105 +/- 0.000206022
422 // - Significance = 3.07571 sigma
423 // - Number of S+B toys: 1000
424 // - Number of B toys: 20000
425 // - Test statistic evaluated on data: 10.8198
426 // - CL_b: 0.99895 +/- 0.000229008
427 // - CL_s+b: 0.491 +/- 0.0158088
428 // - CL_s: 0.491516 +/- 0.0158258
429 // Real time 0:02:22, CP time 0.990
430
431
432 //-------------------------------------------------------
433 // Comparison
434 //-------------------------------------------------------
435 // LEPStatToolsForLHC
436 // https://plone4.fnal.gov:4430/P0/phystat/packages/0703002
437 // Uses Gaussian prior
438 // CL_b = 6.218476e-04, Significance = 3.228665 sigma
439 //
440 //-------------------------------------------------------
441 // Comparison
442 //-------------------------------------------------------
443 // Asymptotic
444 // From the value of the profile likelihood ratio (5.0338)
445 // The significance can be estimated using Wilks's theorem
446 // significance = sqrt(2*profileLR) = 3.1729 sigma
447}
#define c(i)
Definition: RSha256.hxx:101
static double p1(double t, double a, double b)
static double p2(double t, double a, double b, double c)
const Bool_t kFALSE
Definition: RtypesCore.h:88
double Double_t
Definition: RtypesCore.h:55
#define ClassDef(name, id)
Definition: Rtypes.h:324
#define ClassImp(name)
Definition: Rtypes.h:363
RooAbsData is the common abstract base class for binned and unbinned datasets.
Definition: RooAbsData.h:37
RooDataSet * generate(const RooArgSet &whatVars, Int_t nEvents, const RooCmdArg &arg1, const RooCmdArg &arg2=RooCmdArg::none(), const RooCmdArg &arg3=RooCmdArg::none(), const RooCmdArg &arg4=RooCmdArg::none(), const RooCmdArg &arg5=RooCmdArg::none())
See RooAbsPdf::generate(const RooArgSet&,const RooCmdArg&,const RooCmdArg&,const RooCmdArg&,...
Definition: RooAbsPdf.h:56
RooArgSet is a container object that can hold multiple RooAbsArg objects.
Definition: RooArgSet.h:28
RooDataSet is a container class to hold unbinned data.
Definition: RooDataSet.h:31
static RooMsgService & instance()
Return reference to singleton instance.
void setGlobalKillBelow(RooFit::MsgLevel level)
RooFit::MsgLevel globalKillBelow() const
virtual void setVal(Double_t value)
Set value of variable to 'value'.
Definition: RooRealVar.cxx:204
Same purpose as HybridCalculatorOriginal, but different implementation.
This class provides the plots for the result of a study performed with any of the HypoTestCalculatorG...
Definition: HypoTestPlot.h:22
HypoTestResult is a base class for results from hypothesis tests.
void Print(const Option_t *="") const
Print out some information about the results Note: use Alt/Null labels for the hypotheses here as the...
ModelConfig is a simple class that holds configuration information specifying how a model should be u...
Definition: ModelConfig.h:30
NumEventsTestStat is a simple implementation of the TestStatistic interface used for simple number co...
Holds configuration options for proof and proof-lite.
Definition: ProofConfig.h:46
TestStatistic class that returns -log(L[null] / L[alt]) where L is the likelihood.
TestStatistic is an interface class to provide a facility for construction test statistics distributi...
Definition: TestStatistic.h:31
virtual Double_t Evaluate(RooAbsData &data, RooArgSet &nullPOI)=0
Main interface to evaluate the test statistic on a dataset given the values for the Null Parameters O...
virtual const TString GetVarName() const =0
ToyMCSampler is an implementation of the TestStatSampler interface.
Definition: ToyMCSampler.h:71
void SetProofConfig(ProofConfig *pc=NULL)
Definition: ToyMCSampler.h:233
virtual void SetTestStatistic(TestStatistic *testStatistic, unsigned int i)
Definition: ToyMCSampler.h:184
The RooWorkspace is a persistable container for RooFit projects.
Definition: RooWorkspace.h:43
Bool_t defineSet(const char *name, const RooArgSet &aset, Bool_t importMissing=kFALSE)
Define a named RooArgSet with given constituents.
RooRealVar * var(const char *name) const
Retrieve real-valued variable (RooRealVar) with given name. A null pointer is returned if not found.
RooFactoryWSTool & factory()
Return instance to factory tool.
const RooArgSet * set(const char *name)
Return pointer to previously defined named set with given nmame If no such set is found a null pointe...
RooAbsPdf * pdf(const char *name) const
Retrieve p.d.f (RooAbsPdf) with given name. A null pointer is returned if not found.
The Canvas class.
Definition: TCanvas.h:31
Stopwatch class.
Definition: TStopwatch.h:28
void Start(Bool_t reset=kTRUE)
Start the stopwatch.
Definition: TStopwatch.cxx:58
void Stop()
Stop the stopwatch.
Definition: TStopwatch.cxx:77
void Reset()
Definition: TStopwatch.h:52
void Print(Option_t *option="") const
Print the real and cpu time passed between the start and stop events.
Definition: TStopwatch.cxx:219
Double_t BinomialWithTauObsZ(Double_t nObs, Double_t bExp, Double_t tau)
Double_t BinomialWithTauObsP(Double_t nObs, Double_t bExp, Double_t tau)
@(#)root/roostats:$Id$ Author: George Lewis, Kyle Cranmer
Definition: Asimov.h:20
static constexpr double pc