Minimum Covariance Determinant Estimator - a Fast Algorithm invented by Peter J.Rousseeuw and Katrien Van Dreissen "A Fast Algorithm for the Minimum covariance Determinant Estimator" Technometrics, August 1999, Vol.41, NO.3
What are robust estimators? "An important property of an estimator is its robustness. An estimator is called robust if it is insensitive to measurements that deviate from the expected behaviour. There are 2 ways to treat such deviating measurements: one may either try to recognise them and then remove them from the data sample; or one may leave them in the sample, taking care that they do not influence the estimate unduly. In both cases robust estimators are needed...Robust procedures compensate for systematic errors as much as possible, and indicate any situation in which a danger of not being able to operate reliably is detected." R.Fruhwirth, M.Regler, R.K.Bock, H.Grote, D.Notz "Data Analysis Techniques for High-Energy Physics", 2nd edition
What does this algorithm do? It computes a highly robust estimator of multivariate location and scatter. Then, it takes those estimates to compute robust distances of all the data vectors. Those with large robust distances are considered outliers. Robust distances can then be plotted for better visualization of the data.
How does this algorithm do it? The MCD objective is to find h observations(out of n) whose classical covariance matrix has the lowest determinant. The MCD estimator of location is then the average of those h points and the MCD estimate of scatter is their covariance matrix. The minimum(and default) h = (n+nvariables+1)/2 so the algorithm is effective when less than (n+nvar+1)/2 variables are outliers. The algorithm also allows for exact fit situations - that is, when h or more observations lie on a hyperplane. Then the algorithm still yields the MCD location T and scatter matrix S, the latter being singular as it should be. From (T,S) the program then computes the equation of the hyperplane.
How can this algorithm be used? In any case, when contamination of data is suspected, that might influence the classical estimates. Also, robust estimation of location and scatter is a tool to robustify other multivariate techniques such as, for example, principal-component analysis and discriminant analysis.
Technical details of the algorithm:
Definition at line 23 of file TRobustEstimator.h.
Public Member Functions | |
TRobustEstimator () | |
this constructor should be used in a univariate case: first call this constructor, then - the EvaluateUni(..) function | |
TRobustEstimator (Int_t nvectors, Int_t nvariables, Int_t hh=0) | |
constructor | |
~TRobustEstimator () override | |
void | AddColumn (Double_t *col) |
adds a column to the data matrix it is assumed that the column has size fN variable fVarTemp keeps the number of columns l already added | |
void | AddRow (Double_t *row) |
adds a vector to the data matrix it is supposed that the vector is of size fNvar | |
void | Evaluate () |
Finds the estimate of multivariate mean and variance. | |
void | EvaluateUni (Int_t nvectors, Double_t *data, Double_t &mean, Double_t &sigma, Int_t hh=0) |
for the univariate case estimates of location and scatter are returned in mean and sigma parameters the algorithm works on the same principle as in multivariate case - it finds a subset of size hh with smallest sigma, and then returns mean and sigma of this subset | |
Int_t | GetBDPoint () |
returns the breakdown point of the algorithm | |
Double_t | GetChiQuant (Int_t i) const |
returns the chi2 quantiles | |
const TMatrixDSym * | GetCorrelation () const |
void | GetCorrelation (TMatrixDSym &matr) |
returns the correlation matrix | |
const TMatrixDSym * | GetCovariance () const |
void | GetCovariance (TMatrixDSym &matr) |
returns the covariance matrix | |
const TMatrixD & | GetData () |
returns a reference to the data matrix | |
const TVectorD * | GetHyperplane () const |
if the points are on a hyperplane, returns this hyperplane | |
void | GetHyperplane (TVectorD &vec) |
if the points are on a hyperplane, returns this hyperplane | |
const TVectorD * | GetMean () const |
void | GetMean (TVectorD &means) |
return the estimate of the mean | |
Int_t | GetNHyp () |
Int_t | GetNOut () |
returns the number of outliers | |
Int_t | GetNumberObservations () const |
Int_t | GetNvar () const |
const TArrayI * | GetOuliers () const |
const TVectorD * | GetRDistances () const |
void | GetRDistances (TVectorD &rdist) |
returns the robust distances (helps to find outliers) | |
Public Member Functions inherited from TObject | |
TObject () | |
TObject constructor. | |
TObject (const TObject &object) | |
TObject copy ctor. | |
virtual | ~TObject () |
TObject destructor. | |
void | AbstractMethod (const char *method) const |
Use this method to implement an "abstract" method that you don't want to leave purely abstract. | |
virtual void | AppendPad (Option_t *option="") |
Append graphics object to current pad. | |
virtual void | Browse (TBrowser *b) |
Browse object. May be overridden for another default action. | |
ULong_t | CheckedHash () |
Check and record whether this class has a consistent Hash/RecursiveRemove setup (*) and then return the regular Hash value for this object. | |
virtual const char * | ClassName () const |
Returns name of class to which the object belongs. | |
virtual void | Clear (Option_t *="") |
virtual TObject * | Clone (const char *newname="") const |
Make a clone of an object using the Streamer facility. | |
virtual Int_t | Compare (const TObject *obj) const |
Compare abstract method. | |
virtual void | Copy (TObject &object) const |
Copy this to obj. | |
virtual void | Delete (Option_t *option="") |
Delete this object. | |
virtual Int_t | DistancetoPrimitive (Int_t px, Int_t py) |
Computes distance from point (px,py) to the object. | |
virtual void | Draw (Option_t *option="") |
Default Draw method for all objects. | |
virtual void | DrawClass () const |
Draw class inheritance tree of the class to which this object belongs. | |
virtual TObject * | DrawClone (Option_t *option="") const |
Draw a clone of this object in the current selected pad with: gROOT->SetSelectedPad(c1) . | |
virtual void | Dump () const |
Dump contents of object on stdout. | |
virtual void | Error (const char *method, const char *msgfmt,...) const |
Issue error message. | |
virtual void | Execute (const char *method, const char *params, Int_t *error=nullptr) |
Execute method on this object with the given parameter string, e.g. | |
virtual void | Execute (TMethod *method, TObjArray *params, Int_t *error=nullptr) |
Execute method on this object with parameters stored in the TObjArray. | |
virtual void | ExecuteEvent (Int_t event, Int_t px, Int_t py) |
Execute action corresponding to an event at (px,py). | |
virtual void | Fatal (const char *method, const char *msgfmt,...) const |
Issue fatal error message. | |
virtual TObject * | FindObject (const char *name) const |
Must be redefined in derived classes. | |
virtual TObject * | FindObject (const TObject *obj) const |
Must be redefined in derived classes. | |
virtual Option_t * | GetDrawOption () const |
Get option used by the graphics system to draw this object. | |
virtual const char * | GetIconName () const |
Returns mime type name of object. | |
virtual const char * | GetName () const |
Returns name of object. | |
virtual char * | GetObjectInfo (Int_t px, Int_t py) const |
Returns string containing info about the object at position (px,py). | |
virtual Option_t * | GetOption () const |
virtual const char * | GetTitle () const |
Returns title of object. | |
virtual UInt_t | GetUniqueID () const |
Return the unique object id. | |
virtual Bool_t | HandleTimer (TTimer *timer) |
Execute action in response of a timer timing out. | |
virtual ULong_t | Hash () const |
Return hash value for this object. | |
Bool_t | HasInconsistentHash () const |
Return true is the type of this object is known to have an inconsistent setup for Hash and RecursiveRemove (i.e. | |
virtual void | Info (const char *method, const char *msgfmt,...) const |
Issue info message. | |
virtual Bool_t | InheritsFrom (const char *classname) const |
Returns kTRUE if object inherits from class "classname". | |
virtual Bool_t | InheritsFrom (const TClass *cl) const |
Returns kTRUE if object inherits from TClass cl. | |
virtual void | Inspect () const |
Dump contents of this object in a graphics canvas. | |
void | InvertBit (UInt_t f) |
virtual TClass * | IsA () const |
Bool_t | IsDestructed () const |
IsDestructed. | |
virtual Bool_t | IsEqual (const TObject *obj) const |
Default equal comparison (objects are equal if they have the same address in memory). | |
virtual Bool_t | IsFolder () const |
Returns kTRUE in case object contains browsable objects (like containers or lists of other objects). | |
R__ALWAYS_INLINE Bool_t | IsOnHeap () const |
virtual Bool_t | IsSortable () const |
R__ALWAYS_INLINE Bool_t | IsZombie () const |
virtual void | ls (Option_t *option="") const |
The ls function lists the contents of a class on stdout. | |
void | MayNotUse (const char *method) const |
Use this method to signal that a method (defined in a base class) may not be called in a derived class (in principle against good design since a child class should not provide less functionality than its parent, however, sometimes it is necessary). | |
virtual Bool_t | Notify () |
This method must be overridden to handle object notification (the base implementation is no-op). | |
void | Obsolete (const char *method, const char *asOfVers, const char *removedFromVers) const |
Use this method to declare a method obsolete. | |
void | operator delete (void *ptr) |
Operator delete. | |
void | operator delete (void *ptr, void *vp) |
Only called by placement new when throwing an exception. | |
void | operator delete[] (void *ptr) |
Operator delete []. | |
void | operator delete[] (void *ptr, void *vp) |
Only called by placement new[] when throwing an exception. | |
void * | operator new (size_t sz) |
void * | operator new (size_t sz, void *vp) |
void * | operator new[] (size_t sz) |
void * | operator new[] (size_t sz, void *vp) |
TObject & | operator= (const TObject &rhs) |
TObject assignment operator. | |
virtual void | Paint (Option_t *option="") |
This method must be overridden if a class wants to paint itself. | |
virtual void | Pop () |
Pop on object drawn in a pad to the top of the display list. | |
virtual void | Print (Option_t *option="") const |
This method must be overridden when a class wants to print itself. | |
virtual Int_t | Read (const char *name) |
Read contents of object with specified name from the current directory. | |
virtual void | RecursiveRemove (TObject *obj) |
Recursively remove this object from a list. | |
void | ResetBit (UInt_t f) |
virtual void | SaveAs (const char *filename="", Option_t *option="") const |
Save this object in the file specified by filename. | |
virtual void | SavePrimitive (std::ostream &out, Option_t *option="") |
Save a primitive as a C++ statement(s) on output stream "out". | |
void | SetBit (UInt_t f) |
void | SetBit (UInt_t f, Bool_t set) |
Set or unset the user status bits as specified in f. | |
virtual void | SetDrawOption (Option_t *option="") |
Set drawing option for object. | |
virtual void | SetUniqueID (UInt_t uid) |
Set the unique object id. | |
virtual void | Streamer (TBuffer &) |
Stream an object of class TObject. | |
void | StreamerNVirtual (TBuffer &ClassDef_StreamerNVirtual_b) |
virtual void | SysError (const char *method, const char *msgfmt,...) const |
Issue system error message. | |
R__ALWAYS_INLINE Bool_t | TestBit (UInt_t f) const |
Int_t | TestBits (UInt_t f) const |
virtual void | UseCurrentStyle () |
Set current style settings in this object This function is called when either TCanvas::UseCurrentStyle or TROOT::ForceStyle have been invoked. | |
virtual void | Warning (const char *method, const char *msgfmt,...) const |
Issue warning message. | |
virtual Int_t | Write (const char *name=nullptr, Int_t option=0, Int_t bufsize=0) |
Write this object to the current directory. | |
virtual Int_t | Write (const char *name=nullptr, Int_t option=0, Int_t bufsize=0) const |
Write this object to the current directory. | |
Protected Member Functions | |
void | AddToSscp (TMatrixD &sscp, TVectorD &vec) |
update the sscp matrix with vector vec | |
void | Classic () |
called when h=n. | |
void | ClearSscp (TMatrixD &sscp) |
clear the sscp matrix, used for covariance and mean calculation | |
void | Correl () |
transforms covariance matrix into correlation matrix | |
void | Covar (TMatrixD &sscp, TVectorD &m, TMatrixDSym &cov, TVectorD &sd, Int_t nvec) |
calculates mean and covariance | |
void | CreateOrtSubset (TMatrixD &dat, Int_t *index, Int_t hmerged, Int_t nmerged, TMatrixD &sscp, Double_t *ndist) |
creates a subset of hmerged vectors with smallest orthogonal distances to the hyperplane hyp[1]*(x1-mean[1])+...+hyp[nvar]*(xnvar-mean[nvar])=0 This function is called in case when less than fH samples lie on a hyperplane. | |
void | CreateSubset (Int_t ntotal, Int_t htotal, Int_t p, Int_t *index, TMatrixD &data, TMatrixD &sscp, Double_t *ndist) |
creates a subset of htotal elements from ntotal elements first, p+1 elements are drawn randomly(without repetitions) if their covariance matrix is singular, more elements are added one by one, until their covariance matrix becomes regular or it becomes clear that htotal observations lie on a hyperplane If covariance matrix determinant!=0, distances of all ntotal elements are calculated, using formula d_i=Sqrt((x_i-M)*S_inv*(x_i-M)), where M is mean and S_inv is the inverse of the covariance matrix htotal points with smallest distances are included in the returned subset. | |
Double_t | CStep (Int_t ntotal, Int_t htotal, Int_t *index, TMatrixD &data, TMatrixD &sscp, Double_t *ndist) |
from the input htotal-subset constructs another htotal subset with lower determinant | |
Int_t | Exact (Double_t *ndist) |
for the exact fit situations returns number of observations on the hyperplane | |
Int_t | Exact2 (TMatrixD &mstockbig, TMatrixD &cstockbig, TMatrixD &hyperplane, Double_t *deti, Int_t nbest, Int_t kgroup, TMatrixD &sscp, Double_t *ndist) |
This function is called if determinant of the covariance matrix of a subset=0. | |
Double_t | KOrdStat (Int_t ntotal, Double_t *arr, Int_t k, Int_t *work) |
because I need an Int_t work array | |
Int_t | Partition (Int_t nmini, Int_t *indsubdat) |
divides the elements into approximately equal subgroups number of elements in each subgroup is stored in indsubdat number of subgroups is returned | |
Int_t | RDist (TMatrixD &sscp) |
Calculates robust distances.Then the samples with robust distances greater than a cutoff value (0.975 quantile of chi2 distribution with fNvar degrees of freedom, multiplied by a correction factor), are given weiht=0, and new, reweighted estimates of location and scatter are calculated The function returns the number of outliers. | |
void | RDraw (Int_t *subdat, Int_t ngroup, Int_t *indsubdat) |
Draws ngroup nonoverlapping subdatasets out of a dataset of size n such that the selected case numbers are uniformly distributed from 1 to n. | |
Protected Member Functions inherited from TObject | |
virtual void | DoError (int level, const char *location, const char *fmt, va_list va) const |
Interface to ErrorHandler (protected). | |
void | MakeZombie () |
Additional Inherited Members | |
Public Types inherited from TObject | |
enum | { kIsOnHeap = 0x01000000 , kNotDeleted = 0x02000000 , kZombie = 0x04000000 , kInconsistent = 0x08000000 , kBitMask = 0x00ffffff } |
enum | { kSingleKey = (1ULL << ( 0 )) , kOverwrite = (1ULL << ( 1 )) , kWriteDelete = (1ULL << ( 2 )) } |
enum | EDeprecatedStatusBits { kObjInCanvas = (1ULL << ( 3 )) } |
enum | EStatusBits { kCanDelete = (1ULL << ( 0 )) , kMustCleanup = (1ULL << ( 3 )) , kIsReferenced = (1ULL << ( 4 )) , kHasUUID = (1ULL << ( 5 )) , kCannotPick = (1ULL << ( 6 )) , kNoContextMenu = (1ULL << ( 8 )) , kInvalidObject = (1ULL << ( 13 )) } |
Static Public Member Functions inherited from TObject | |
static TClass * | Class () |
static const char * | Class_Name () |
static constexpr Version_t | Class_Version () |
static const char * | DeclFileName () |
static Longptr_t | GetDtorOnly () |
Return destructor only flag. | |
static Bool_t | GetObjectStat () |
Get status of object stat flag. | |
static void | SetDtorOnly (void *obj) |
Set destructor only flag. | |
static void | SetObjectStat (Bool_t stat) |
Turn on/off tracking of objects in the TObjectTable. | |
Protected Types inherited from TObject | |
enum | { kOnlyPrepStep = (1ULL << ( 3 )) } |
#include <TRobustEstimator.h>
TRobustEstimator::TRobustEstimator | ( | ) |
this constructor should be used in a univariate case: first call this constructor, then - the EvaluateUni(..) function
Definition at line 125 of file TRobustEstimator.cxx.
constructor
Definition at line 131 of file TRobustEstimator.cxx.
|
inlineoverride |
Definition at line 78 of file TRobustEstimator.h.
void TRobustEstimator::AddColumn | ( | Double_t * | col | ) |
adds a column to the data matrix it is assumed that the column has size fN variable fVarTemp keeps the number of columns l already added
Definition at line 171 of file TRobustEstimator.cxx.
void TRobustEstimator::AddRow | ( | Double_t * | row | ) |
adds a vector to the data matrix it is supposed that the vector is of size fNvar
Definition at line 192 of file TRobustEstimator.cxx.
update the sscp matrix with vector vec
Definition at line 779 of file TRobustEstimator.cxx.
|
protected |
called when h=n.
Returns classic covariance matrix and mean
Definition at line 809 of file TRobustEstimator.cxx.
|
protected |
clear the sscp matrix, used for covariance and mean calculation
Definition at line 796 of file TRobustEstimator.cxx.
|
protected |
transforms covariance matrix into correlation matrix
Definition at line 850 of file TRobustEstimator.cxx.
|
protected |
calculates mean and covariance
Definition at line 827 of file TRobustEstimator.cxx.
|
protected |
creates a subset of hmerged vectors with smallest orthogonal distances to the hyperplane hyp[1]*(x1-mean[1])+...+hyp[nvar]*(xnvar-mean[nvar])=0 This function is called in case when less than fH samples lie on a hyperplane.
Definition at line 968 of file TRobustEstimator.cxx.
|
protected |
creates a subset of htotal elements from ntotal elements first, p+1 elements are drawn randomly(without repetitions) if their covariance matrix is singular, more elements are added one by one, until their covariance matrix becomes regular or it becomes clear that htotal observations lie on a hyperplane If covariance matrix determinant!=0, distances of all ntotal elements are calculated, using formula d_i=Sqrt((x_i-M)*S_inv*(x_i-M)), where M is mean and S_inv is the inverse of the covariance matrix htotal points with smallest distances are included in the returned subset.
Definition at line 878 of file TRobustEstimator.cxx.
|
protected |
from the input htotal-subset constructs another htotal subset with lower determinant
As proven by Peter J.Rousseeuw and Katrien Van Driessen, if distances for all elements are calculated, using the formula:d_i=Sqrt((x_i-M)*S_inv*(x_i-M)), where M is the mean of the input htotal-subset, and S_inv - the inverse of its covariance matrix, then htotal elements with smallest distances will have covariance matrix with determinant less or equal to the determinant of the input subset covariance matrix.
determinant for this htotal-subset with smallest distances is returned
Definition at line 1000 of file TRobustEstimator.cxx.
void TRobustEstimator::Evaluate | ( | ) |
Finds the estimate of multivariate mean and variance.
Definition at line 209 of file TRobustEstimator.cxx.
void TRobustEstimator::EvaluateUni | ( | Int_t | nvectors, |
Double_t * | data, | ||
Double_t & | mean, | ||
Double_t & | sigma, | ||
Int_t | hh = 0 |
||
) |
for the univariate case estimates of location and scatter are returned in mean and sigma parameters the algorithm works on the same principle as in multivariate case - it finds a subset of size hh with smallest sigma, and then returns mean and sigma of this subset
Definition at line 609 of file TRobustEstimator.cxx.
for the exact fit situations returns number of observations on the hyperplane
Definition at line 1037 of file TRobustEstimator.cxx.
|
protected |
This function is called if determinant of the covariance matrix of a subset=0.
If there are more then fH vectors on a hyperplane, returns this hyperplane and stops else stores the hyperplane coordinates in hyperplane matrix
Definition at line 1071 of file TRobustEstimator.cxx.
Int_t TRobustEstimator::GetBDPoint | ( | ) |
returns the breakdown point of the algorithm
Definition at line 675 of file TRobustEstimator.cxx.
returns the chi2 quantiles
Definition at line 685 of file TRobustEstimator.cxx.
|
inline |
Definition at line 94 of file TRobustEstimator.h.
void TRobustEstimator::GetCorrelation | ( | TMatrixDSym & | matr | ) |
returns the correlation matrix
Definition at line 706 of file TRobustEstimator.cxx.
|
inline |
Definition at line 92 of file TRobustEstimator.h.
void TRobustEstimator::GetCovariance | ( | TMatrixDSym & | matr | ) |
returns the covariance matrix
Definition at line 694 of file TRobustEstimator.cxx.
|
inline |
returns a reference to the data matrix
Definition at line 89 of file TRobustEstimator.h.
const TVectorD * TRobustEstimator::GetHyperplane | ( | ) | const |
if the points are on a hyperplane, returns this hyperplane
Definition at line 718 of file TRobustEstimator.cxx.
void TRobustEstimator::GetHyperplane | ( | TVectorD & | vec | ) |
if the points are on a hyperplane, returns this hyperplane
Definition at line 731 of file TRobustEstimator.cxx.
|
inline |
Definition at line 99 of file TRobustEstimator.h.
void TRobustEstimator::GetMean | ( | TVectorD & | means | ) |
return the estimate of the mean
Definition at line 747 of file TRobustEstimator.cxx.
|
inline |
Definition at line 97 of file TRobustEstimator.h.
Int_t TRobustEstimator::GetNOut | ( | ) |
returns the number of outliers
Definition at line 771 of file TRobustEstimator.cxx.
|
inline |
Definition at line 102 of file TRobustEstimator.h.
|
inline |
Definition at line 103 of file TRobustEstimator.h.
|
inline |
Definition at line 104 of file TRobustEstimator.h.
|
inline |
Definition at line 101 of file TRobustEstimator.h.
void TRobustEstimator::GetRDistances | ( | TVectorD & | rdist | ) |
returns the robust distances (helps to find outliers)
Definition at line 759 of file TRobustEstimator.cxx.
|
protected |
because I need an Int_t work array
Definition at line 1267 of file TRobustEstimator.cxx.
divides the elements into approximately equal subgroups number of elements in each subgroup is stored in indsubdat number of subgroups is returned
Definition at line 1118 of file TRobustEstimator.cxx.
Calculates robust distances.Then the samples with robust distances greater than a cutoff value (0.975 quantile of chi2 distribution with fNvar degrees of freedom, multiplied by a correction factor), are given weiht=0, and new, reweighted estimates of location and scatter are calculated The function returns the number of outliers.
Definition at line 1172 of file TRobustEstimator.cxx.
Draws ngroup nonoverlapping subdatasets out of a dataset of size n such that the selected case numbers are uniformly distributed from 1 to n.
Definition at line 1235 of file TRobustEstimator.cxx.
|
protected |
Definition at line 39 of file TRobustEstimator.h.
|
protected |
Definition at line 37 of file TRobustEstimator.h.
|
protected |
Definition at line 46 of file TRobustEstimator.h.
|
protected |
Definition at line 34 of file TRobustEstimator.h.
|
protected |
Definition at line 28 of file TRobustEstimator.h.
|
protected |
Definition at line 43 of file TRobustEstimator.h.
|
protected |
Definition at line 38 of file TRobustEstimator.h.
|
protected |
Definition at line 36 of file TRobustEstimator.h.
|
protected |
Definition at line 29 of file TRobustEstimator.h.
|
protected |
Definition at line 27 of file TRobustEstimator.h.
|
protected |
Definition at line 42 of file TRobustEstimator.h.
|
protected |
Definition at line 40 of file TRobustEstimator.h.
|
protected |
Definition at line 41 of file TRobustEstimator.h.
|
protected |
Definition at line 31 of file TRobustEstimator.h.
|
protected |
Definition at line 32 of file TRobustEstimator.h.