Multidimensional Fits in ROOT.
Overview
A common problem encountered in different fields of applied science is to find an expression for one physical quantity in terms of several others, which are directly measurable.
An example in high energy physics is the evaluation of the momentum of a charged particle from the observation of its trajectory in a magnetic field. The problem is to relate the momentum of the particle to the observations, which may consists of positional measurements at intervals along the particle trajectory.
The exact functional relationship between the measured quantities (e.g., the space-points) and the dependent quantity (e.g., the momentum) is in general not known, but one possible way of solving the problem, is to find an expression which reliably approximates the dependence of the momentum on the observations.
This explicit function of the observations can be obtained by a least squares fitting procedure applied to a representative sample of the data, for which the dependent quantity (e.g., momentum) and the independent observations are known. The function can then be used to compute the quantity of interest for new observations of the independent variables.
This class TMultiDimFit
implements such a procedure in ROOT. It is largely based on the CERNLIB MUDIFI package 2. Though the basic concepts are still sound, and therefore kept, a few implementation details have changed, and this class can take advantage of MINUIT 4 to improve the errors of the fitting, thanks to the class TMinuit.
In 5 and 6 H. Wind demonstrates the utility of this procedure in the context of tracking, magnetic field parameterisation, and so on. The outline of the method used in this class is based on Winds discussion, and I refer these two excellents text for more information.
And example of usage is given in multidimfit.C.
The Method
Let \( D \) by the dependent quantity of interest, which depends smoothly on the observable quantities \( x_1, \ldots, x_N \) which we'll denote by \(\mathbf{x}\). Given a training sample of \( M\) tuples of the form, (TMultiDimFit::AddRow)
\[ \left(\mathbf{x}_j, D_j, E_j\right)\quad, \]
where \(\mathbf{x}_j = (x_{1,j},\ldots,x_{N,j})\) are \( N\) independent variables, \( D_j\) is the known, quantity dependent at \(\mathbf{x}_j\) and \( E_j\) is the square error in \( D_j\), the class will try to find the parameterization
\[ D_p(\mathbf{x}) = \sum_{l=1}^{L} c_l \prod_{i=1}^{N} p_{li}\left(x_i\right) = \sum_{l=1}^{L} c_l F_l(\mathbf{x}) \]
such that
\[ S \equiv \sum_{j=1}^{M} \left(D_j - D_p\left(\mathbf{x}_j\right)\right)^2 \]
is minimal. Here \(p_{li}(x_i)\) are monomials, or Chebyshev or Legendre polynomials, labelled \(l = 1, \ldots, L\), in each variable \( x_i\), \( i=1, \ldots, N\).
So what TMultiDimFit does, is to determine the number of terms \( L\), and then \( L\) terms (or functions) \( F_l\), and the \( L\) coefficients \( c_l\), so that \( S\) is minimal (TMultiDimFit::FindParameterization).
Of course it's more than a little unlikely that \( S\) will ever become exact zero as a result of the procedure outlined below. Therefore, the user is asked to provide a minimum relative error \( \epsilon\) (TMultiDimFit::SetMinRelativeError), and \( S\) will be considered minimized when
\[ R = \frac{S}{\sum_{j=1}^M D_j^2} < \epsilon \]
Optionally, the user may impose a functional expression by specifying the powers of each variable in \( L\) specified functions \( F_1, \ldots,F_L\) (TMultiDimFit::SetPowers). In that case, only the coefficients \( c_l\) is calculated by the class.
Limiting the Number of Terms
As always when dealing with fits, there's a real chance of over fitting. As is well-known, it's always possible to fit an \( N-1\) polynomial in \( x\) to \( N\) points \( (x,y)\) with \(\chi^2 = 0\), but the polynomial is not likely to fit new data at all 1. Therefore, the user is asked to provide an upper limit, \( L_{max}\) to the number of terms in \( D_p\) (TMultiDimFit::SetMaxTerms).
However, since there's an infinite number of \( F_l\) to choose from, the user is asked to give the maximum power. \( P_{max,i}\), of each variable \( x_i\) to be considered in the minimization of \( S\) (TMultiDimFit::SetMaxPowers).
One way of obtaining values for the maximum power in variable \( i\), is to perform a regular fit to the dependent quantity \( D\), using a polynomial only in \( x_i\). The maximum power is \( P_{max,i}\) is then the power that does not significantly improve the one-dimensional least-square fit over \( x_i\) to \( D\) 5.
There are still a huge amount of possible choices for \( F_l\); in fact there are \(\prod_{i=1}^{N} (P_{max,i} + 1)\) possible choices. Obviously we need to limit this. To this end, the user is asked to set a power control limit, \( Q\) (TMultiDimFit::SetPowerLimit), and a function \( F_l\) is only accepted if
\[ Q_l = \sum_{i=1}^{N} \frac{P_{li}}{P_{max,i}} < Q \]
where \( P_{li}\) is the leading power of variable \( x_i\) in function \( F_l\) (TMultiDimFit::MakeCandidates). So the number of functions increase with \( Q\) (1, 2 is fine, 5 is way out).
Gram-Schmidt Orthogonalisation
To further reduce the number of functions in the final expression, only those functions that significantly reduce \( S\) is chosen. What ‘significant’ means, is chosen by the user, and will be discussed below (see 2.3).
The functions \( F_l\) are generally not orthogonal, which means one will have to evaluate all possible \( F_l\)'s over all data-points before finding the most significant 1. We can, however, do better then that. By applying the modified Gram-Schmidt orthogonalisation algorithm [5] [3] to the functions \( F_l\), we can evaluate the contribution to the reduction of \( S\) from each function in turn, and we may delay the actual inversion of the curvature-matrix (TMultiDimFit::MakeGramSchmidt).
So we are let to consider an \( M\times L\) matrix \(\mathsf{F}\), an element of which is given by
\[ f_{jl} = F_j\left(x_{1j} , x_{2j}, \ldots, x_{Nj}\right) = F_l(\mathbf{x}_j)\, \quad\mbox{with}~j=1,2,\ldots,M, \]
where \( j\) labels the \( M\) rows in the training sample and \( l\) labels \( L\) functions of \( N\) variables, and \( L \leq M\). That is, \( f_{jl}\) is the term (or function) numbered \( l\) evaluated at the data point \( j\). We have to normalise \(\mathbf{x}_j\) to \( [-1,1]\) for this to succeed [5] (TMultiDimFit::MakeNormalized). We then define a matrix \(\mathsf{W}\) of which the columns \(\mathbf{w}_j\) are given by
\begin{eqnarray*} \mathbf{w}_1 &=& \mathbf{f}_1 = F_1\left(\mathbf x_1\right)\\ \mathbf{w}_l &=& \mathbf{f}_l - \sum^{l-1}_{k=1} \frac{\mathbf{f}_l \bullet \mathbf{w}_k}{\mathbf{w}_k^2}\mathbf{w}_k\,. \end{eqnarray*}
and \(\mathbf{w}_{l}\) is the component of \(\mathbf{f}_{l} \) orthogonal to \(\mathbf{w}_{1}, \ldots, \mathbf{w}_{l-1}\). Hence we obtain [3],
\[ \mathbf{w}_k\bullet\mathbf{w}_l = 0\quad\mbox{if}~k \neq l\quad. \]
We now take as a new model \(\mathsf{W}\mathbf{a}\). We thus want to minimize
\[ S\equiv \left(\mathbf{D} - \mathsf{W}\mathbf{a}\right)^2\quad, \]
where \(\mathbf{D} = \left(D_1,\ldots,D_M\right)\) is a vector of the dependent quantity in the sample. Differentiation with respect to \( a_j\) gives, using 6,
\[ \mathbf{D}\bullet\mathbf{w}_l - a_l\mathbf{w}_l^2 = 0 \]
or
\[ a_l = \frac{\mathbf{D}_l\bullet\mathbf{w}_l}{\mathbf{w}_l^2} \]
Let \( S_j\) be the sum of squares of residuals when taking \( j\) functions into account. Then
\[ S_l = \left[\mathbf{D} - \sum^l_{k=1} a_k\mathbf{w}_k\right]^2 = \mathbf{D}^2 - 2\mathbf{D} \sum^l_{k=1} a_k\mathbf{w}_k + \sum^l_{k=1} a_k^2\mathbf{w}_k^2 \]
Using 9, we see that
\begin{eqnarray*} S_l &=& \mathbf{D}^2 - 2 \sum^l_{k=1} a_k^2\mathbf{w}_k^2 + \sum^j_{k=1} a_k^2\mathbf{w}_k^2\nonumber\\ &=& \mathbf{D}^2 - \sum^l_{k=1} a_k^2\mathbf{w}_k^2\nonumber\\ &=& \mathbf{D}^2 - \sum^l_{k=1} \frac{\left(\mathbf D\bullet \mathbf w_k\right)}{\mathbf w_k^2} \end{eqnarray*}
So for each new function \( F_l\) included in the model, we get a reduction of the sum of squares of residuals of \(a_l^2\mathbf{w}_l^2\), where \(\mathbf{w}_l\) is given by 4 and \( a_l\) by 9. Thus, using the Gram-Schmidt orthogonalisation, we can decide if we want to include this function in the final model, before* the matrix inversion.
Function Selection Based on Residual
Supposing that \( L-1\) steps of the procedure have been performed, the problem now is to consider the \(L^{\mbox{th}}\) function.
The sum of squares of residuals can be written as
\[ S_L = \textbf{D}^T\bullet\textbf{D} - \sum^L_{l=1}a^2_l\left(\textbf{w}_l^T\bullet\textbf{w}_l\right) \]
where the relation 9 have been taken into account. The contribution of the \(L^{\mbox{th}}\) function to the reduction of S, is given by
\[ \Delta S_L = a^2_L\left(\textbf{w}_L^T\bullet\textbf{w}_L\right) \]
Two test are now applied to decide whether this \(L^{\mbox{th}}\) function is to be included in the final expression, or not.
Test 1
Denoting by \( H_{L-1}\) the subspace spanned by \(\textbf{w}_1,\ldots,\textbf{w}_{L-1}\) the function \d$\textbf{w}_L\d$ is by construction (see 4) the projection of the function \( F_L\) onto the direction perpendicular to \( H_{L-1}\). Now, if the length of \(\textbf{w}_L\) (given by \(\textbf{w}_L\bullet\textbf{w}_L\)) is very small compared to the length of \(\textbf{f}_L\) this new function can not contribute much to the reduction of the sum of squares of residuals. The test consists then in calculating the angle \( \theta \) between the two vectors \(\textbf{w}_L\) \( \textbf {f}_L\) (see also figure 1) and requiring that it's greater* then a threshold value which the user must set (TMultiDimFit::SetMinAngle).
Figure 1: (a) angle \f$\theta\f$ between \f$\textbf{w}_l\f$ and \f$\textbf{f}_L\f$, (b) angle \f$ \phi \f$ between \f$\textbf{w}_L\f$ and \f$\textbf{D}\f$
Test 2
Let \(\textbf{D}\) be the data vector to be fitted. As illustrated in figure 1, the \(L^{\mbox{th}}\) function \(\textbf{w}_L\) will contribute significantly to the reduction of \( S\), if the angle \(\phi^\prime\) between \(\textbf{w}_L\) and \(\textbf{D}\) is smaller than an upper limit \( \phi \), defined by the user (MultiDimFit::SetMaxAngle)
However, the method automatically readjusts the value of this angle while fitting is in progress, in order to make the selection criteria less and less difficult to be fulfilled. The result is that the functions contributing most to the reduction of \( S\) are chosen first (TMultiDimFit::TestFunction).
In case \( \phi \) isn't defined, an alternative method of performing this second test is used: The \(L^{\mbox{th}}\) function \(\textbf{f}_L\) is accepted if (refer also to equation (13))
\[ \Delta S_L > \frac{S_{L-1}}{L_{max}-L} \]
where \( S_{L-1}\) is the sum of the \( L-1\) first residuals from the \( L-1\) functions previously accepted; and \( L_{max}\) is the total number of functions allowed in the final expression of the fit (defined by user).
From this we see, that by restricting \( L_{max}\) – the number of terms in the final model – the fit is more difficult to perform, since the above selection criteria is more limiting.
The more coefficients we evaluate, the more the sum of squares of residuals \( S\) will be reduced. We can evaluate \( S\) before inverting \(\mathsf{B}\) as shown below.
Coefficients and Coefficient Errors
Having found a parameterization, that is the \( F_l\)'s and \( L\), that minimizes \( S\), we still need to determine the coefficients \( c_l\). However, it's a feature of how we choose the significant functions, that the evaluation of the \( c_l\)'s becomes trivial 5. To derive \(\mathbf{c}\), we first note that equation (4) can be written as
\[ \mathsf{F} = \mathsf{W}\mathsf{B} \]
where
\begin{eqnarray*} b_{ij} = \frac{\mathbf{f}_j \bullet \mathbf{w}_i}{\mathbf{w}_i^2} & \mbox{if} & i < j\\ 1 & \mbox{if} & i = j\\ 0 & \mbox{if} & i > j \end{eqnarray*}
Consequently, \(\mathsf{B}\) is an upper triangle matrix, which can be readily inverted. So we now evaluate
\[ \mathsf{F}\mathsf{B}^{-1} = \mathsf{W} \]
The model \(\mathsf{W}\mathbf{a}\) can therefore be written as \((\mathsf{F}\mathsf{B}^{-1})\mathbf{a} = \mathsf{F}(\mathsf{B}^{-1}\mathbf{a})\,.\)
The original model \(\mathsf{F}\mathbf{c}\) is therefore identical with this if
\[ \mathbf{c} = \left(\mathsf{B}^{-1}\mathbf{a}\right) = \left[\mathbf{a}^T\left(\mathsf{B}^{-1}\right)^T\right]^T\,. \]
The reason we use \(\left(\mathsf{B}^{-1}\right)^T\) rather then \(\mathsf{B}^{-1}\) is to save storage, since \(\left(\mathsf{B}^{-1}\right)^T\) can be stored in the same matrix as \(\mathsf{B}\) (TMultiDimFit::MakeCoefficients). The errors in the coefficients is calculated by inverting the curvature matrix of the non-orthogonal functions \( f_{lj}\) [1] (TMultiDimFit::MakeCoefficientErrors).
Considerations
It's important to realize that the training sample should be representative of the problem at hand, in particular along the borders of the region of interest. This is because the algorithm presented here, is a interpolation, rather then a extrapolation 5.
Also, the independent variables \( x_{i}\) need to be linear independent, since the procedure will perform poorly if they are not 5. One can find an linear transformation from ones original variables \( \xi_{i}\) to a set of linear independent variables \( x_{i}\), using a Principal Components Analysis (see TPrincipal), and then use the transformed variable as input to this class [5] 6.
H. Wind also outlines a method for parameterising a multidimensional dependence over a multidimensional set of variables. An example of the method from 5, is a follows (please refer to 5 for a full discussion):
- Define \(\mathbf{P} = (P_1, \ldots, P_5)\) are the 5 dependent quantities that define a track.
- Compute, for \( M\) different values of \(\mathbf{P}\), the tracks through the magnetic field, and determine the corresponding \(\mathbf{x} = (x_1, \ldots, x_N)\).
- Use the simulated observations to determine, with a simple approximation, the values of \(\mathbf{P}_j\). We call these values \(\mathbf{P}^\prime_j, j = 1, \ldots, M\).
- Determine from \(\mathbf{x}\) a set of at least five relevant coordinates \(\mathbf{x}^\prime\), using contrains, or alternative:
- Perform a Principal Component Analysis (using TPrincipal), and use to get a linear transformation \(\mathbf{x} \rightarrow \mathbf{x}^\prime\), so that \(\mathbf{x}^\prime\) are constrained and linear independent.
- Perform a Principal Component Analysis on \(Q_i = P_i / P^\prime_i\, i = 1, \ldots, 5\), to get linear indenpendent (among themselves, but not independent of \(\mathbf{x}\)) quantities \(\mathbf{Q}^\prime\)
- For each component \(Q^\prime_i\) make a multidimensional fit, using \(\mathbf{x}^\prime\) as the variables, thus determining a set of coefficients \(\mathbf{c}_i\).
To process data, using this parameterisation, do
- Test wether the observation \(\mathbf{x}\) within the domain of the parameterization, using the result from the Principal Component Analysis.
- Determine \(\mathbf{P}^\prime\) as before.
- Determine \(\mathbf{x}^\prime\) as before.
- Use the result of the fit to determine \(\mathbf{Q}^\prime\).
- Transform back to \(\mathbf{P}\) from \(\mathbf{Q}^\prime\), using the result from the Principal Component Analysis.
Testing the parameterization
The class also provides functionality for testing the, over the training sample, found parameterization (TMultiDimFit::Fit). This is done by passing the class a test sample of \( M_t\) tuples of the form \((\mathbf{x}_{t,j},D_{t,j}, E_{t,j})\), where \(\mathbf{x}_{t,j}\) are the independent variables, \( D_{t,j}\) the known, dependent quantity, and \( E_{t,j}\) is the square error in \( D_{t,j}\) (TMultiDimFit::AddTestRow).
The parameterization is then evaluated at every \(\mathbf{x}_t\) in the test sample, and
\[ S_t \equiv \sum_{j=1}^{M_t} \left(D_{t,j} - D_p\left(\mathbf{x}_{t,j}\right)\right)^2 \]
is evaluated. The relative error over the test sample
\[ R_t = \frac{S_t}{\sum_{j=1}^{M_t} D_{t,j}^2} \]
should not be to low or high compared to \( R\) from the training sample. Also, multiple correlation coefficient from both samples should be fairly close, otherwise one of the samples is not representative of the problem. A large difference in the reduced \( \chi^2\) over the two samples indicate an over fit, and the maximum number of terms in the parameterisation should be reduced.
It's possible to use 4 to further improve the fit, using the test sample.
Christian Holm
Bibliography
- Philip R. Bevington and D. Keith Robinson. Data Reduction and Error Analysis for the Physical Sciences. McGraw-Hill, 2 edition, 1992.
- R. Brun et al. Long writeup DD/75-23, CERN, 1980.
- Gene H. Golub and Charles F. van Loan. Matrix Computations. John Hopkins University Press, Baltimore, 3 edition, 1996.
- F. James. Minuit. Long writeup D506, CERN, 1998.
- H. Wind. Function parameterization. Proceedings of the 1972 CERN Computing and Data Processing School, volume 72-21 of Yellow report. CERN, 1972.
- H. Wind. 1. principal component analysis, 2. pattern recognition for track finding, 3. interpolation and functional representation. Yellow report EP/81-12, CERN, 1981.
Definition at line 15 of file TMultiDimFit.h.