Use of mixture models in RooFit.
This tutorial shows, how to use mixture models for Likelihood Calculation in ROOT. Instead of directly calculating the likelihood we use simulation based inference (SBI) as shown in tutorial 'rf615_simulation_based_inference.py'. We train the classifier to discriminate between samples from an background hypothesis here the zz samples and a target hypothesis, here the higgs samples. The data preparation is based on the tutorial 'df106_HiggsToFourLeptons.py'.
An introduction to mixture models can be found here https://arxiv.org/pdf/1506.02169.
A short summary: We assume the whole probability distribution can be written as a mixture of several components, i.e. $$p(x|\theta)= \sum_{c}w_{c}(\theta)p_{c}(x|\theta)$$ We can write the likelihood ratio in terms of pairwise classification problems \begin{align*} \frac{p(x|\mu)}{p(x|0)}&= \frac{\sum_{c}w_{c}(\mu)p_{c}(x|\mu)}{\sum_{c'}w_{c'}(0)p_{c'}(x|0)}\ &=\sum_{c}\Bigg[\sum_{c'}\frac{w_{c'}(0)}{w_{c}(\mu)}\frac{p_{c'}(x|0)}{p_{c}(x|\mu)}\Bigg]^{-1}, \end{align*} where mu is the signal strength, and a value of 0 corresponds to the background hypothesis. Using this decomposition, one is able to use the pairwise likelihood ratios.
Since the only free parameter in our case is mu, the distributions are independent of this parameter and the dependence on the signal strength can be encoded into the weights. Thus, the subratios simplify dramatically since they are independent of theta and these ratios can be pre-computed and the classifier does not need to be parametrized.
If you wish to see an analysis done with template histograms see 'hf001_example.py'.
import ROOT
import os
import numpy as np
import xgboost as xgb
results = {}
data_dict =
df.AsNumpy(columns=[
"m4l",
"sample_category",
"weight"])
weights_dict = {
name: data_dict[
"weight"][data_dict[
"sample_category"] == [name]].
sum()
for name
in (
"data",
"zz",
"other",
"higgs")
}
for sample_category in ["data", "higgs", "zz", "other"]:
weight_sum = weights_dict[sample_category]
mask = data_dict["sample_category"] == sample_category
weights = data_dict["weight"][mask]
weight_modified = weights / weight_sum
results[sample_category] = {
"weight_sum": weight_sum,
"weight_modified": weight_modified,
"count": count,
"weight": weights,
}
higgs_data = data_dict["m4l"][data_dict["sample_category"] == ["higgs"]]
zz_data = data_dict["m4l"][data_dict["sample_category"] == ["zz"]]
sample_weight_higgs =
np.array([results[
"higgs"][
"weight_modified"]]).
flatten()
sample_weight =
np.concatenate([sample_weight_higgs, sample_weight_zz])
sample_weight[sample_weight < 0] = 1e-6
model_xgb =
xgb.XGBClassifier(n_estimators=1000, max_depth=5, eta=0.2, min_child_weight=1e-6, nthread=1)
return (1 - prob) / prob
n_signal = results[
"higgs"][
"weight"].
sum()
n_back = results[
"zz"][
"weight"].
sum()
return n_back / (n_back + mu * n_signal)
m = 2
p[1, 0] = llr
p[j, i] = 1.0 / p[i, j]
pdf_learned =
ROOT.RooWrapperPdf(
"learned_pdf",
"learned_pdf", nll_ratio, selfNormalized=
True)
frame1 =
m4l.frame(Title=
"Likelihood ratio r(m_{4l}|#mu=1);m_{4l};p(#mu=1)/p(#mu=0)", Range=(80, 170))
pdf_learned_extended =
ROOT.RooExtendPdf(
"final_pdf",
"final_pdf", pdf_learned, n_pred)
frame2 =
mu_var.frame(Title=
"NLL sum;#mu (signal strength);#Delta NLL", Range=(0.5, 4))
nll.plotOn(frame2, ShiftToZero=
True, LineColor=
"kP6Blue")
single_canvas = True
c =
ROOT.TCanvas(
"",
"", 1200
if single_canvas
else 600, 600)
if single_canvas:
if single_canvas:
else:
if not single_canvas:
del minimizer
del nll
del pdf_learned_extended
del n_pred
del llh
del nll_ratio
ROOT::Detail::TRangeCast< T, true > TRangeDynCast
TRangeDynCast is an adapter class that allows the typed iteration through a TCollection.
Option_t Option_t TPoint TPoint const char GetTextMagnitude GetFillStyle GetLineColor GetLineWidth GetMarkerStyle GetTextAlign GetTextColor GetTextSize void char Point_t Rectangle_t WindowAttributes_t Float_t Float_t Float_t Int_t Int_t UInt_t UInt_t Rectangle_t Int_t Int_t Window_t TString Int_t GCValues_t GetPrimarySelectionOwner GetDisplay GetScreen GetColormap GetNativeEvent const char const char dpyName wid window const char font_name cursor keysym reg const char only_if_exist regb h Point_t winding char text const char depth char const char Int_t count const char ColorStruct_t color const char Pixmap_t Pixmap_t PictureAttributes_t attr const char char ret_data h unsigned char height h Atom_t Int_t ULong_t ULong_t unsigned char prop_list Atom_t Atom_t Atom_t Time_t UChar_t len
ROOT's RDataFrame offers a modern, high-level interface for analysis of data stored in TTree ,...
static uint64_t sum(uint64_t i)
[#1] INFO:Fitting -- RooAbsPdf::fitTo(final_pdf) fixing normalization set for coefficient determination to observables in data
[#1] INFO:Fitting -- using CPU computation library compiled with -mavx512
[#1] INFO:Fitting -- RooAddition::defaultErrorLevel(nll_final_pdf_) Summation contains a RooNLLVar, using its error level
RooFitResult: minimized FCN value: -1538.41, estimated distance to minimum: 1.10146e-09
covariance matrix quality: Full, accurate covariance matrix
Status : MINIMIZE=0
Floating Parameter FinalValue +/- Error
-------------------- --------------------------
mu 1.7130e+00 +/- 5.93e-01
- Date
- September 2024
- Author
- Robin Syring
Definition in file rf618_mixture_models.py.