ROOT logo
ROOT » HIST » HIST » TPrincipal

class TPrincipal: public TNamed


>


Principal Components Analysis (PCA)

The current implementation is based on the LINTRA package from CERNLIB by R. Brun, H. Hansroul, and J. Kubler. The class has been implemented by Christian Holm Christensen in August 2000.


Introduction

In many applications of various fields of research, the treatment of large amounts of data requires powerful techniques capable of rapid data reduction and analysis. Usually, the quantities most conveniently measured by the experimentalist, are not necessarily the most significant for classification and analysis of the data. It is then useful to have a way of selecting an optimal set of variables necessary for the recognition process and reducing the dimensionality of the problem, resulting in an easier classification procedure.

This paper describes the implementation of one such method of feature selection, namely the principal components analysis. This multidimensional technique is well known in the field of pattern recognition and and its use in Particle Physics has been documented elsewhere (cf. H. Wind, Function Parameterization, CERN 72-21).


Overview

Suppose we have prototypes which are trajectories of particles, passing through a spectrometer. If one measures the passage of the particle at say 8 fixed planes, the trajectory is described by an 8-component vector:

\begin{displaymath}
\mathbf{x} = \left(x_0, x_1, \ldots, x_7\right)
\end{displaymath}

in 8-dimensional pattern space.

One proceeds by generating a a representative tracks sample and building up the covariance matrix $\mathsf{C}$. Its eigenvectors and eigenvalues are computed by standard methods, and thus a new basis is obtained for the original 8-dimensional space the expansion of the prototypes,

\begin{displaymath}
\mathbf{x}_m = \sum^7_{i=0} a_{m_i} \mathbf{e}_i
\quad
\mbox{where}
\quad
a_{m_i} = \mathbf{x}^T\bullet\mathbf{e}_i
\end{displaymath}

allows the study of the behavior of the coefficients $a_{m_i}$ for all the tracks of the sample. The eigenvectors which are insignificant for the trajectory description in the expansion will have their corresponding coefficients $a_{m_i}$ close to zero for all the prototypes.

On one hand, a reduction of the dimensionality is then obtained by omitting these least significant vectors in the subsequent analysis.

On the other hand, in the analysis of real data, these least significant variables(?) can be used for the pattern recognition problem of extracting the valid combinations of coordinates describing a true trajectory from the set of all possible wrong combinations.

The program described here performs this principal components analysis on a sample of data provided by the user. It computes the covariance matrix, its eigenvalues ands corresponding eigenvectors and exhibits the behavior of the principal components ($a_{m_i}$), thus providing to the user all the means of understanding his data.


Principal Components Method

Let's consider a sample of $M$ prototypes each being characterized by $P$ variables $x_0, x_1, \ldots, x_{P-1}$. Each prototype is a point, or a column vector, in a $P$-dimensional pattern space.

\begin{displaymath}
\mathbf{x} = \left[\begin{array}{c}
x_0\\ x_1\\ \vdots\\ x_{P-1}\end{array}\right]\,,
\end{displaymath} (1)

where each $x_n$ represents the particular value associated with the $n$-dimension.

Those $P$ variables are the quantities accessible to the experimentalist, but are not necessarily the most significant for the classification purpose.

The Principal Components Method consists of applying a linear transformation to the original variables. This transformation is described by an orthogonal matrix and is equivalent to a rotation of the original pattern space into a new set of coordinate vectors, which hopefully provide easier feature identification and dimensionality reduction.

Let's define the covariance matrix:

\begin{displaymath}
\mathsf{C} = \left\langle\mathbf{y}\mathbf{y}^T\right\rangl...
...athbf{y} = \mathbf{x} - \left\langle\mathbf{x}\right\rangle\,,
\end{displaymath} (2)

and the brackets indicate mean value over the sample of $M$ prototypes.

This matrix $\mathsf{C}$ is real, positive definite, symmetric, and will have all its eigenvalues greater then zero. It will now be show that among the family of all the complete orthonormal bases of the pattern space, the base formed by the eigenvectors of the covariance matrix and belonging to the largest eigenvalues, corresponds to the most significant features of the description of the original prototypes.

let the prototypes be expanded on into a set of $N$ basis vectors $\mathbf{e}_n, n=0,\ldots,N,N+1, \ldots, P-1$,

\begin{displaymath}
\mathbf{y}_i = \sum^N_{i=0} a_{i_n} \mathbf{e}_n,
\quad
i = 0, \ldots, M,
\quad
N < P-1
\end{displaymath} (3)

The `best' feature coordinates $\mathbf{e}_n$, spanning a feature space, will be obtained by minimizing the error due to this truncated expansion, i.e.,

\begin{displaymath}
\min\left(E_N\right) =
\min\left[\left\langle\left(\mathb...
...\sum^N_{i=0} a_{i_n} \mathbf{e}_n\right)^2\right\rangle\right]
\end{displaymath} (4)

with the conditions:
\begin{displaymath}
\mathbf{e}_k\bullet\mathbf{e}_j = \delta_{jk} =
\left\{\b...
...for} & k = j\\
0 & \mbox{for} & k \neq j
\end{array}\right.
\end{displaymath} (5)

Multiplying (3) by $\mathbf{e}^T_n$ using (5), we get

\begin{displaymath}
a_{i_n} = \mathbf{y}_i^T\bullet\mathbf{e}_n\,,
\end{displaymath} (6)

so the error becomes
$\displaystyle E_N$ $\textstyle =$ $\displaystyle \left\langle\left[\sum_{n=N+1}^{P-1} a_{i_n}\mathbf{e}_n\right]^2\right\rangle$  
  $\textstyle =$ $\displaystyle \left\langle\left[\sum_{n=N+1}^{P-1} \mathbf{y}_i^T\bullet\mathbf{e}_n\mathbf{e}_n\right]^2\right\rangle$  
  $\textstyle =$ $\displaystyle \left\langle\sum_{n=N+1}^{P-1} \mathbf{e}_n^T\mathbf{y}_i\mathbf{y}_i^T\mathbf{e}_n\right\rangle$  
  $\textstyle =$ $\displaystyle \sum_{n=N+1}^{P-1} \mathbf{e}_n^T\mathsf{C}\mathbf{e}_n$ (7)

The minimization of the sum in (7) is obtained when each term $\mathbf{e}_n^\mathsf{C}\mathbf{e}_n$ is minimum, since $\mathsf{C}$ is positive definite. By the method of Lagrange multipliers, and the condition (5), we get


\begin{displaymath}
E_N = \sum^{P-1}_{n=N+1} \left(\mathbf{e}_n^T\mathsf{C}\mathbf{e}_n -
l_n\mathbf{e}_n^T\bullet\mathbf{e}_n + l_n\right)
\end{displaymath} (8)

The minimum condition $\frac{dE_N}{d\mathbf{e}^T_n} = 0$ leads to the equation
\begin{displaymath}
\mathsf{C}\mathbf{e}_n = l_n\mathbf{e}_n\,,
\end{displaymath} (9)

which shows that $\mathbf{e}_n$ is an eigenvector of the covariance matrix $\mathsf{C}$ with eigenvalue $l_n$. The estimated minimum error is then given by
\begin{displaymath}
E_N \sim \sum^{P-1}_{n=N+1} \mathbf{e}_n^T\bullet l_n\mathbf{e}_n
= \sum^{P-1}_{n=N+1} l_n\,,
\end{displaymath} (10)

where $l_n,\,n=N+1,\ldots,P-1$ are the eigenvalues associated with the omitted eigenvectors in the expansion (3). Thus, by choosing the $N$ largest eigenvalues, and their associated eigenvectors, the error $E_N$ is minimized.

The transformation matrix to go from the pattern space to the feature space consists of the ordered eigenvectors $\mathbf{e}_0,\ldots,\mathbf{e}_{P-1}$ for its columns

\begin{displaymath}
\mathsf{T} = \left[
\begin{array}{cccc}
\mathbf{e}_0 &
\...
...bf{e}_{1_{P-1}} & \cdots & \mathbf{e}_{{P-1}_{P-1}}\\
\end{array}\right]
\end{displaymath} (11)

This is an orthogonal transformation, or rotation, of the pattern space and feature selection results in ignoring certain coordinates in the transformed space.

Christian Holm
August 2000, CERN

Function Members (Methods)

public:
TPrincipal()
TPrincipal(Int_t nVariables, Option_t* opt = "ND")
virtual~TPrincipal()
voidTObject::AbstractMethod(const char* method) const
virtual voidAddRow(const Double_t* x)
virtual voidTObject::AppendPad(Option_t* option = "")
virtual voidBrowse(TBrowser* b)
static TClass*Class()
virtual const char*TObject::ClassName() const
virtual voidClear(Option_t* option = "")
virtual TObject*TNamed::Clone(const char* newname = "") const
virtual Int_tTNamed::Compare(const TObject* obj) const
virtual voidTNamed::Copy(TObject& named) const
virtual voidTObject::Delete(Option_t* option = "")MENU
virtual Int_tTObject::DistancetoPrimitive(Int_t px, Int_t py)
virtual voidTObject::Draw(Option_t* option = "")
virtual voidTObject::DrawClass() constMENU
virtual TObject*TObject::DrawClone(Option_t* option = "") constMENU
virtual voidTObject::Dump() constMENU
virtual voidTObject::Error(const char* method, const char* msgfmt) const
virtual voidTObject::Execute(const char* method, const char* params, Int_t* error = 0)
virtual voidTObject::Execute(TMethod* method, TObjArray* params, Int_t* error = 0)
virtual voidTObject::ExecuteEvent(Int_t event, Int_t px, Int_t py)
virtual voidTObject::Fatal(const char* method, const char* msgfmt) const
virtual voidTNamed::FillBuffer(char*& buffer)
virtual TObject*TObject::FindObject(const char* name) const
virtual TObject*TObject::FindObject(const TObject* obj) const
const TMatrixD*GetCovarianceMatrix() const
virtual Option_t*TObject::GetDrawOption() const
static Long_tTObject::GetDtorOnly()
const TVectorD*GetEigenValues() const
const TMatrixD*GetEigenVectors() const
TList*GetHistograms() const
virtual const char*TObject::GetIconName() const
const TVectorD*GetMeanValues() const
virtual const char*TNamed::GetName() const
virtual char*TObject::GetObjectInfo(Int_t px, Int_t py) const
static Bool_tTObject::GetObjectStat()
virtual Option_t*TObject::GetOption() const
const Double_t*GetRow(Int_t row)
const TVectorD*GetSigmas() const
virtual const char*TNamed::GetTitle() const
virtual UInt_tTObject::GetUniqueID() const
const TVectorD*GetUserData() const
virtual Bool_tTObject::HandleTimer(TTimer* timer)
virtual ULong_tTNamed::Hash() const
virtual voidTObject::Info(const char* method, const char* msgfmt) const
virtual Bool_tTObject::InheritsFrom(const char* classname) const
virtual Bool_tTObject::InheritsFrom(const TClass* cl) const
virtual voidTObject::Inspect() constMENU
voidTObject::InvertBit(UInt_t f)
virtual TClass*IsA() const
virtual Bool_tTObject::IsEqual(const TObject* obj) const
virtual Bool_tIsFolder() const
Bool_tTObject::IsOnHeap() const
virtual Bool_tTNamed::IsSortable() const
Bool_tTObject::IsZombie() const
virtual voidTNamed::ls(Option_t* option = "") const
virtual voidMakeCode(const char* filename = "pca", Option_t* option = "")MENU
virtual voidMakeHistograms(const char* name = "pca", Option_t* option = "epsdx")MENU
virtual voidMakeMethods(const char* classname = "PCA", Option_t* option = "")MENU
virtual voidMakePrincipals()MENU
voidTObject::MayNotUse(const char* method) const
virtual Bool_tTObject::Notify()
static voidTObject::operator delete(void* ptr)
static voidTObject::operator delete(void* ptr, void* vp)
static voidTObject::operator delete[](void* ptr)
static voidTObject::operator delete[](void* ptr, void* vp)
void*TObject::operator new(size_t sz)
void*TObject::operator new(size_t sz, void* vp)
void*TObject::operator new[](size_t sz)
void*TObject::operator new[](size_t sz, void* vp)
virtual voidP2X(const Double_t* p, Double_t* x, Int_t nTest)
virtual voidTObject::Paint(Option_t* option = "")
virtual voidTObject::Pop()
virtual voidPrint(Option_t* opt = "MSE") constMENU
virtual Int_tTObject::Read(const char* name)
virtual voidTObject::RecursiveRemove(TObject* obj)
voidTObject::ResetBit(UInt_t f)
virtual voidTObject::SaveAs(const char* filename = "", Option_t* option = "") constMENU
virtual voidTObject::SavePrimitive(basic_ostream<char,char_traits<char> >& out, Option_t* option = "")
voidTObject::SetBit(UInt_t f)
voidTObject::SetBit(UInt_t f, Bool_t set)
virtual voidTObject::SetDrawOption(Option_t* option = "")MENU
static voidTObject::SetDtorOnly(void* obj)
virtual voidTNamed::SetName(const char* name)MENU
virtual voidTNamed::SetNameTitle(const char* name, const char* title)
static voidTObject::SetObjectStat(Bool_t stat)
virtual voidTNamed::SetTitle(const char* title = "")MENU
virtual voidTObject::SetUniqueID(UInt_t uid)
virtual voidShowMembers(TMemberInspector& insp)
virtual Int_tTNamed::Sizeof() const
virtual voidStreamer(TBuffer& b)
voidStreamerNVirtual(TBuffer& b)
virtual voidSumOfSquareResiduals(const Double_t* x, Double_t* s)
virtual voidTObject::SysError(const char* method, const char* msgfmt) const
voidTest(Option_t* option = "")MENU
Bool_tTObject::TestBit(UInt_t f) const
Int_tTObject::TestBits(UInt_t f) const
virtual voidTObject::UseCurrentStyle()
virtual voidTObject::Warning(const char* method, const char* msgfmt) const
virtual Int_tTObject::Write(const char* name = 0, Int_t option = 0, Int_t bufsize = 0)
virtual Int_tTObject::Write(const char* name = 0, Int_t option = 0, Int_t bufsize = 0) const
virtual voidX2P(const Double_t* x, Double_t* p)
protected:
TPrincipal(const TPrincipal&)
virtual voidTObject::DoError(int level, const char* location, const char* fmt, va_list va) const
voidMakeNormalised()
voidMakeRealCode(const char* filename, const char* prefix, Option_t* option = "")
voidTObject::MakeZombie()
TPrincipal&operator=(const TPrincipal&)

Data Members

protected:
TMatrixDfCovarianceMatrixCovariance matrix
TVectorDfEigenValuesEigenvalue vector of trans
TMatrixDfEigenVectorsEigenvector matrix of trans
TList*fHistogramsList of histograms
Bool_tfIsNormalisedNormalize matrix?
TVectorDfMeanValuesMean value over all data points
TStringTNamed::fNameobject identifier
Int_tfNumberOfDataPointsNumber of data points
Int_tfNumberOfVariablesNumber of variables
TVectorDfOffDiagonalelements of the tridiagonal
TVectorDfSigmasvector of sigmas
Bool_tfStoreDataShould we store input data?
TStringTNamed::fTitleobject title
Double_tfTraceTrace of covarience matrix
TVectorDfUserDataVector of original data points

Class Charts

Inheritance Inherited Members Includes Libraries
Class Charts

Function documentation

TPrincipal()
 Empty CTOR, Do not use.
TPrincipal(Int_t nVariables, Option_t* opt = "ND")
 Ctor. Argument is number of variables in the sample of data
 Options are:
   N       Normalize the covariance matrix (default)
   D       Store input data (default)

 The created object is  named "principal" by default.
TPrincipal(const TPrincipal& )
copy constructor
TPrincipal& operator=(const TPrincipal& )
assignement operator
~TPrincipal()
 destructor
void AddRow(const Double_t* x)
  /*
     >
Add a data point and update the covariance matrix. The input
array must be fNumberOfVariables long.

The Covariance matrix and mean values of the input data is caculated on the fly by the following equations:

\begin{displaymath}
\left<x_i\right>^{(0)} = x_{i0}
\end{displaymath}


\begin{displaymath}
\left<x_i\right>^{(n)} = \left<x_i\right>^{(n-1)}
+ \frac1n \left(x_{in} - \left<x_i\right>^{(n-1)}\right)
\end{displaymath}


\begin{displaymath}
C_{ij}^{(0)} = 0
\end{displaymath}


\begin{displaymath}
C_{ij}^{(n)} = C_{ij}^{(n-1)}
+ \frac1{n-1}\left[\left(x_{i...
...\left<x_j\right>^{(n)}\right)\right]
- \frac1n C_{ij}^{(n-1)}
\end{displaymath}

since this is a really fast method, with no rounding errors (please refer to CERN 72-21 pp. 54-106).

The data is stored internally in a TVectorD, in the following way:

\begin{displaymath}
\mathbf{x} = \left[\left(x_{0_0},\ldots,x_{{P-1}_0}\right),\ldots,
\left(x_{0_i},\ldots,x_{{P-1}_i}\right), \ldots\right]
\end{displaymath}

With $P$ as defined in the class description.
  */
 
void Browse(TBrowser* b)
 Browse the TPrincipal object in the TBrowser.
void Clear(Option_t* option = "")
 Clear the data in Object. Notice, that's not possible to change
 the dimension of the original data.
const Double_t * GetRow(Int_t row)
 Return a row of the user supplied data.
 If row is out of bounds, 0 is returned.
 It's up to the user to delete the returned array.
 Row 0 is the first row;
void MakeCode(const char* filename = "pca", Option_t* option = "")
 Generates the file <filename>, with .C appended if it does
 argument doesn't end in .cxx or .C.

 The file contains the implementation of two functions

    void X2P(Double_t *x, Double *p)
    void P2X(Double_t *p, Double *x, Int_t nTest)

 which does the same as  TPrincipal::X2P and TPrincipal::P2X
 respectively. Please refer to these methods.

 Further, the static variables:

    Int_t    gNVariables
    Double_t gEigenValues[]
    Double_t gEigenVectors[]
    Double_t gMeanValues[]
    Double_t gSigmaValues[]

 are initialized. The only ROOT header file needed is Rtypes.h

 See TPrincipal::MakeRealCode for a list of options
void MakeHistograms(const char* name = "pca", Option_t* option = "epsdx")
 Make histograms of the result of the analysis.
 The option string say which histograms to create
      X         Histogram original data
      P         Histogram principal components corresponding to
                original data
      D         Histogram the difference between the original data
                and the projection of principal unto a lower
                dimensional subspace (2D histograms)
      E         Histogram the eigenvalues
      S         Histogram the square of the residues
                (see TPrincipal::SumOfSquareResidues)
 The histograms will be named <name>_<type><number>, where <name>
 is the first argument, <type> is one of X,P,D,E,S, and <number>
 is the variable.
void MakeNormalised()
 PRIVATE METHOD: Normalize the covariance matrix
void MakeMethods(const char* classname = "PCA", Option_t* option = "")
 Generate the file <classname>PCA.cxx which contains the
 implementation of two methods:

    void <classname>::X2P(Double_t *x, Double *p)
    void <classname>::P2X(Double_t *p, Double *x, Int_t nTest)

 which does the same as  TPrincipal::X2P and TPrincipal::P2X
 respectivly. Please refer to these methods.

 Further, the public static members:

    Int_t    <classname>::fgNVariables
    Double_t <classname>::fgEigenValues[]
    Double_t <classname>::fgEigenVectors[]
    Double_t <classname>::fgMeanValues[]
    Double_t <classname>::fgSigmaValues[]

 are initialized, and assumed to exist. The class declaration is
 assumed to be in <classname>.h and assumed to be provided by the
 user.

 See TPrincipal::MakeRealCode for a list of options

 The minimal class definition is:

   class <classname> {
   public:
     static Int_t    fgNVariables;
     static Double_t fgEigenVectors[];
     static Double_t fgEigenValues[];
     static Double_t fgMeanValues[];
     static Double_t fgSigmaValues[];

     void X2P(Double_t *x, Double_t *p);
     void P2X(Double_t *p, Double_t *x, Int_t nTest);
   };

 Whether the methods <classname>::X2P and <classname>::P2X should
 be static or not, is up to the user.
void MakePrincipals()
 Perform the principal components analysis.
 This is done in several stages in the TMatrix::EigenVectors method:
 * Transform the covariance matrix into a tridiagonal matrix.
 * Find the eigenvalues and vectors of the tridiagonal matrix.
void MakeRealCode(const char* filename, const char* prefix, Option_t* option = "")
 PRIVATE METHOD:
 This is the method that actually generates the code for the
 transformations to and from feature space and pattern space
 It's called by TPrincipal::MakeCode and TPrincipal::MakeMethods.

 The options are: NONE so far
void P2X(const Double_t* p, Double_t* x, Int_t nTest)
 Calculate x as a function of nTest of the most significant
 principal components p, and return it in x.
 It's the users responsibility to make sure that both x and p are
 of the right size (i.e., memory must be allocated for x).
void Print(Option_t* opt = "MSE") const
 Print the statistics
 Options are
      M            Print mean values of original data
      S            Print sigma values of original data
      E            Print eigenvalues of covariance matrix
      V            Print eigenvectors of covariance matrix
 Default is MSE
void SumOfSquareResiduals(const Double_t* x, Double_t* s)
 PRIVATE METHOD:

/* > Calculates the sum of the square residuals, that is

\begin{displaymath}
    E_N = \sum_{i=0}^{P-1} \left(x_i - x^\prime_i\right)^2
    \end{displaymath}

where $x^\prime_i = \sum_{j=i}^N p_i e_{n_j}$, $p_i$ is the $i^{\mbox{th}}$ component of the principal vector, corresponding to $x_i$, the original data; I.e., the square distance to the space spanned by $N$ eigenvectors.
   */
 
void Test(Option_t* option = "")
 Test the PCA, bye calculating the sum square of residuals
 (see method SumOfSquareResiduals), and display the histogram
void X2P(const Double_t* x, Double_t* p)
 Calculate the principal components from the original data vector
 x, and return it in p.
 It's the users responsibility to make sure that both x and p are
 of the right size (i.e., memory must be allocated for p).
const TMatrixD * GetCovarianceMatrix() const
const TVectorD * GetEigenValues() const
{return &fEigenValues;}
const TMatrixD * GetEigenVectors() const
{return &fEigenVectors;}
TList * GetHistograms() const
{return fHistograms;}
const TVectorD * GetMeanValues() const
{return &fMeanValues;}
const TVectorD * GetSigmas() const
{return &fSigmas;}
const TVectorD * GetUserData() const
{return &fUserData;}
Bool_t IsFolder() const
{ return kTRUE;}