rf614_binned_fit_problems.py File Reference

## Namespaces | |

namespace | rf614_binned_fit_problems |

A tutorial that explains you how to solve problems with binning effects and numerical stability in binned fits.

In this tutorial, you will learn three new things:

- How to reduce the bias in binned fits by changing the definition of the normalization integral
- How to completely get rid of binning effects by integrating the pdf over each bin
- How to improve the numeric stability of fits with a greatly different number of events per bin, using a constant per-bin counterterm

import ROOT

def generateBinnedAsimov(pdf, x, n_events):

"""

Generate binned Asimov dataset for a continuous pdf.

One should in principle be able to use

pdf.generateBinned(x, n_events, RooFit::ExpectedData()).

Unfortunately it has a problem: it also has the bin bias that this tutorial

demostrates, to if we would use it, the biases would cancel out.

"""

data_h = ROOT.RooDataHist("dataH", "dataH", {x})

x_binning = x.getBinning()

for i_bin in range(x.numBins()):

x.setRange("bin", x_binning.binLow(i_bin), x_binning.binHigh(i_bin))

integ = pdf.createIntegral(x, NormSet=x, Range="bin")

ROOT.SetOwnership(integ, True)

integ.getVal()

data_h.set(i_bin, n_events * integ.getVal(), -1)

return data_h

def enableBinIntegrator(func, num_bins):

"""

Force numeric integration and do this numeric integration with the

RooBinIntegrator, which sums the function values at the bin centers.

"""

custom_config = ROOT.RooNumIntConfig(func.getIntegratorConfig())

custom_config.method1D().setLabel("RooBinIntegrator")

custom_config.getConfigSection("RooBinIntegrator").setRealValue("numBins", num_bins)

func.setIntegratorConfig(custom_config)

func.forceNumInt(True)

def disableBinIntegrator(func):

"""

Reset the integrator config to disable the RooBinIntegrator.

"""

func.setIntegratorConfig()

func.forceNumInt(False)

# Silence info output for this tutorial

ROOT.RooMsgService.instance().getStream(1).removeTopic(ROOT.RooFit.Minimization)

ROOT.RooMsgService.instance().getStream(1).removeTopic(ROOT.RooFit.Fitting)

ROOT.RooMsgService.instance().getStream(1).removeTopic(ROOT.RooFit.Generation)

# Exponential example

# -------------------

# Set up the observable

x = ROOT.RooRealVar("x", "x", 0.1, 5.1)

x.setBins(10)

# fewer bins so we have larger binning effects for this demo

# Let's first look at the example of an exponential function

c = ROOT.RooRealVar("c", "c", -1.8, -5, 5)

expo = ROOT.RooExponential("expo", "expo", x, c)

# Generate an Asimov dataset such that the only difference between the fit

# result and the true parameters comes from binning effects.

expo_data = generateBinnedAsimov(expo, x, 10000)

# If you do the fit the usual was in RooFit, you will get a bias in the

# result. This is because the continuous, normalized pdf is evaluated only

# at the bin centers.

fit1 = expo.fitTo(expo_data, Save=True, PrintLevel=-1, SumW2Error=False)

fit1.Print()

# In the case of an exponential function, the bias that you get by

# evaluating the pdf only at the bin centers is a constant scale factor in

# each bin. Here, we can do a trick to get rid of the bias: we also

# evaluate the normalization integral for the pdf the same way, i.e.,

# summing the values of the unnormalized pdf at the bin centers. Like this

# the bias cancels out. You can achieve this by customizing the way how the

# pdf is integrated (see also the rf901_numintconfig tutorial).

enableBinIntegrator(expo, x.numBins())

fit2 = expo.fitTo(expo_data, Save=True, PrintLevel=-1, SumW2Error=False)

fit2.Print()

disableBinIntegrator(expo)

# Power law example

# -----------------

# Let's not look at another example: a power law \f[x^a\f].

a = ROOT.RooRealVar("a", "a", -0.3, -5.0, 5.0)

powerlaw = ROOT.RooPower("powerlaw", "powerlaw", x, ROOT.RooFit.RooConst(1.0), a)

powerlaw_data = generateBinnedAsimov(powerlaw, x, 10000)

# Again, if you do a vanilla fit, you'll get a bias

fit3 = powerlaw.fitTo(powerlaw_data, Save=True, PrintLevel=-1, SumW2Error=False)

fit3.Print()

# This time, the bias is not the same factor in each bin! This means our

# trick by sampling the integral in the same way doesn't cancel out the

# bias completely. The average bias is canceled, but there are per-bin

# biases that remain. Still, this method has some value: it is cheaper than

# rigurously correcting the bias by integrating the pdf in each bin. So if

# you know your per-bin bias variations are small or performance is an

# issue, this approach can be sufficient.

enableBinIntegrator(powerlaw, x.numBins())

fit4 = powerlaw.fitTo(powerlaw_data, Save=True, PrintLevel=-1, SumW2Error=False)

fit4.Print()

disableBinIntegrator(powerlaw)

# To get rid of the binning effects in the general case, one can use the

# IntegrateBins() command argument. Now, the pdf is not evaluated at the

# bin centers, but numerically integrated over each bin and divided by the

# bin width. The parameter for IntegrateBins() is the required precision

# for the numeric integrals. This is computationally expensive, but the

# bias is now not a problem anymore.

fit5 = powerlaw.fitTo(powerlaw_data, IntegrateBins=1e-3, Save=True, PrintLevel=-1, SumW2Error=False)

fit5.Print()

# Improving numerical stability

# -----------------------------

# There is one more problem with binned fits that is related to the binning

# effects because often, a binned fit is affected by both problems.

#

# The issue is numerical stability for fits with a greatly different number

# of events in each bin. For each bin, you have a term \f[n\log(p)\f] in

# the NLL, where \f[n\f] is the number of observations in the bin, and

# \f[p\f] the predicted probability to have an event in that bin. The

# difference in the logarithms for each bin is small, but the difference in

# \f[n\f] can be orders of magnitudes! Therefore, when summing these terms,

# lots of numerical precision is lost for the bins with less events.

# We can study this with the example of an exponential plus a Gaussian. The

# Gaussian is only a faint signal in the tail of the exponential where

# there are not so many events. And we can't afford any precision loss for

# these bins, otherwise we can't fit the Gaussian.

x.setBins(100) # It's not about binning effects anymore, so reset the number of bins.

mu = ROOT.RooRealVar("mu", "mu", 3.0, 0.1, 5.1)

sigma = ROOT.RooRealVar("sigma", "sigma", 0.5, 0.01, 5.0)

gauss = ROOT.RooGaussian("gauss", "gauss", x, mu, sigma)

nsig = ROOT.RooRealVar("nsig", "nsig", 10000, 0, 1e9)

nbkg = ROOT.RooRealVar("nbkg", "nbkg", 10000000, 0, 1e9)

frac = ROOT.RooRealVar("frac", "frac", nsig.getVal() / (nsig.getVal() + nbkg.getVal()), 0.0, 1.0)

model = ROOT.RooAddPdf("model", "model", [gauss, expo], [nsig, nbkg])

model_data = model.generateBinned(x)

# Set the starting values for the Gaussian parameters away from the true

# value such that the fit is not trivial.

mu.setVal(2.0)

sigma.setVal(1.0)

fit6 = model.fitTo(model_data, Save=True, PrintLevel=-1, SumW2Error=False)

fit6.Print()

# You should see in the previous fit result that the fit did not converge:

# the `MINIMIZE` return code should be -1 (a successful fit has status code zero).

# To improve the situation, we can apply a numeric trick: if we subtract in

# each bin a constant counterterm \f[n\log(n/N)\f], we get terms for each

# bin that are closer to each other in order of magnitude as long as the

# initial model is not extremely off. Proving this mathematically is left

# as an excercise to the reader.

# This counterterms can be enabled in RooFit if you use a binned

# RooDataHist to do your fit and pass the Offset("bin") option to

# RooAbsPdf::fitTo() or RooAbsPdf::createNLL().

fit7 = model.fitTo(model_data, Offset="bin", Save=True, PrintLevel=-1, SumW2Error=False)

fit7.Print()

# You should now see in the last fit result that the fit has converged.

[#1] INFO:Eval -- RooRealVar::setRange(x) new range named 'bin' created with bounds [0.1,0.6]

RooFitResult: minimized FCN value: 4754.37, estimated distance to minimum: 6.17653e-09

covariance matrix quality: Full, accurate covariance matrix

Status : MINIMIZE=0 HESSE=0

Floating Parameter FinalValue +/- Error

-------------------- --------------------------

c -1.6862e+00 +/- 1.70e-02

[#0] WARNING:Integration -- RooBinIntegrator::RooBinIntegrator WARNING: integrand provide no binning definition observable #0 substituting default binning of 10 bins

[#1] INFO:NumericIntegration -- RooRealIntegral::init(expo_Int[x]) using numeric integrator RooBinIntegrator to calculate Int(x)

RooFitResult: minimized FCN value: 4440.6, estimated distance to minimum: 1.1198e-06

covariance matrix quality: Full, accurate covariance matrix

Status : MINIMIZE=0 HESSE=0

Floating Parameter FinalValue +/- Error

-------------------- --------------------------

c -1.8000e+00 +/- 1.87e-02

RooFitResult: minimized FCN value: 15816.4, estimated distance to minimum: 9.94074e-07

covariance matrix quality: Full, accurate covariance matrix

Status : MINIMIZE=0 HESSE=0

Floating Parameter FinalValue +/- Error

-------------------- --------------------------

a -2.6106e-01 +/- 1.06e-02

[#0] WARNING:Integration -- RooBinIntegrator::RooBinIntegrator WARNING: integrand provide no binning definition observable #0 substituting default binning of 10 bins

[#1] INFO:NumericIntegration -- RooRealIntegral::init(powerlaw_Int[x]) using numeric integrator RooBinIntegrator to calculate Int(x)

RooFitResult: minimized FCN value: 15739.9, estimated distance to minimum: 9.98948e-07

covariance matrix quality: Full, accurate covariance matrix

Status : MINIMIZE=0 HESSE=0

Floating Parameter FinalValue +/- Error

-------------------- --------------------------

a -3.1481e-01 +/- 1.15e-02

RooFitResult: minimized FCN value: 15739.6, estimated distance to minimum: 2.09956e-09

covariance matrix quality: Full, accurate covariance matrix

Status : MINIMIZE=0 HESSE=0

Floating Parameter FinalValue +/- Error

-------------------- --------------------------

a -3.0000e-01 +/- 1.07e-02

[#0] PROGRESS:Generation -- RooAbsPdf::generateBinned(model) Performing costly accept/reject sampling. If this takes too long, use extended mode to speed up the process.

RooFitResult: minimized FCN value: -1.47174e+08, estimated distance to minimum: 0.324117

covariance matrix quality: Full, accurate covariance matrix

Status : MINIMIZE=-1 HESSE=4

Floating Parameter FinalValue +/- Error

-------------------- --------------------------

c -1.7972e+00 +/- 7.39e-04

mu 2.9756e+00 +/- 3.90e-02

nbkg 1.0001e+07 +/- 3.25e+03

nsig 9.4264e+03 +/- 7.36e+02

sigma 4.6849e-01 +/- 2.75e-02

RooFitResult: minimized FCN value: -0.151899, estimated distance to minimum: 3.26001e-06

covariance matrix quality: Full, accurate covariance matrix

Status : MINIMIZE=0 HESSE=0

Floating Parameter FinalValue +/- Error

-------------------- --------------------------

c -1.7971e+00 +/- 7.26e-04

mu 2.9942e+00 +/- 3.64e-02

nbkg 1.0001e+07 +/- 3.24e+03

nsig 9.2412e+03 +/- 6.94e+02

sigma 4.5770e-01 +/- 2.59e-02

- Date
- January 2023

Definition in file rf614_binned_fit_problems.py.