Logo ROOT  
Reference Guide
 
Loading...
Searching...
No Matches
TRobustEstimator.cxx
Go to the documentation of this file.
1// @(#)root/physics:$Id$
2// Author: Anna Kreshuk 08/10/2004
3
4/*************************************************************************
5 * Copyright (C) 1995-2004, Rene Brun and Fons Rademakers. *
6 * All rights reserved. *
7 * *
8 * For the licensing terms see $ROOTSYS/LICENSE. *
9 * For the list of contributors see $ROOTSYS/README/CREDITS. *
10 *************************************************************************/
11
12/** \class TRobustEstimator
13 \note Despite being in the group of Legacy statistics classes, TRobustEstimator is still useful and no drop-in replacement exists for it.
14 \ingroup Physics
15Minimum Covariance Determinant Estimator - a Fast Algorithm
16invented by Peter J.Rousseeuw and Katrien Van Dreissen
17"A Fast Algorithm for the Minimum covariance Determinant Estimator"
18Technometrics, August 1999, Vol.41, NO.3
19
20What are robust estimators?
21"An important property of an estimator is its robustness. An estimator
22is called robust if it is insensitive to measurements that deviate
23from the expected behaviour. There are 2 ways to treat such deviating
24measurements: one may either try to recognise them and then remove
25them from the data sample; or one may leave them in the sample, taking
26care that they do not influence the estimate unduly. In both cases robust
27estimators are needed...Robust procedures compensate for systematic errors
28as much as possible, and indicate any situation in which a danger of not being
29able to operate reliably is detected."
30R.Fruhwirth, M.Regler, R.K.Bock, H.Grote, D.Notz
31"Data Analysis Techniques for High-Energy Physics", 2nd edition
32
33What does this algorithm do?
34It computes a highly robust estimator of multivariate location and scatter.
35Then, it takes those estimates to compute robust distances of all the
36data vectors. Those with large robust distances are considered outliers.
37Robust distances can then be plotted for better visualization of the data.
38
39How does this algorithm do it?
40The MCD objective is to find h observations(out of n) whose classical
41covariance matrix has the lowest determinant. The MCD estimator of location
42is then the average of those h points and the MCD estimate of scatter
43is their covariance matrix. The minimum(and default) h = (n+nvariables+1)/2
44so the algorithm is effective when less than (n+nvar+1)/2 variables are outliers.
45The algorithm also allows for exact fit situations - that is, when h or more
46observations lie on a hyperplane. Then the algorithm still yields the MCD location T
47and scatter matrix S, the latter being singular as it should be. From (T,S) the
48program then computes the equation of the hyperplane.
49
50How can this algorithm be used?
51In any case, when contamination of data is suspected, that might influence
52the classical estimates.
53Also, robust estimation of location and scatter is a tool to robustify
54other multivariate techniques such as, for example, principal-component analysis
55and discriminant analysis.
56
57Technical details of the algorithm:
58
591. The default h = (n+nvariables+1)/2, but the user may choose any integer h with
60 (n+nvariables+1)/2<=h<=n. The program then reports the MCD's breakdown value
61 (n-h+1)/n. If you are sure that the dataset contains less than 25% contamination
62 which is usually the case, a good compromise between breakdown value and
63 efficiency is obtained by putting h=[.75*n].
642. If h=n,the MCD location estimate is the average of the whole dataset, and
65 the MCD scatter estimate is its covariance matrix. Report this and stop
663. If nvariables=1 (univariate data), compute the MCD estimate by the exact
67 algorithm of Rousseeuw and Leroy (1987, pp.171-172) in O(nlogn)time and stop
684. From here on, h<n and nvariables>=2.
69 1. If n is small:
70 - repeat (say) 500 times:
71 - construct an initial h-subset, starting from a random (nvar+1)-subset
72 - carry out 2 C-steps (described in the comments of CStep function)
73 - for the 10 results with lowest det(S):
74 - carry out C-steps until convergence
75 - report the solution (T, S) with the lowest det(S)
76 2. If n is larger (say, n>600), then
77 - construct up to 5 disjoint random subsets of size nsub (say, nsub=300)
78 - inside each subset repeat 500/5 times:
79 - construct an initial subset of size hsub=[nsub*h/n]
80 - carry out 2 C-steps
81 - keep the best 10 results (Tsub, Ssub)
82 - pool the subsets, yielding the merged set (say, of size nmerged=1500)
83 - in the merged set, repeat for each of the 50 solutions (Tsub, Ssub)
84 - carry out 2 C-steps
85 - keep the 10 best results
86 - in the full dataset, repeat for those best results:
87 - take several C-steps, using n and h
88 - report the best final result (T, S)
895. To obtain consistency when the data comes from a multivariate normal
90 distribution, covariance matrix is multiplied by a correction factor
916. Robust distances for all elements, using the final (T, S) are calculated
92 Then the very final mean and covariance estimates are calculated only for
93 values, whose robust distances are less than a cutoff value (0.975 quantile
94 of chi2 distribution with nvariables degrees of freedom)
95*/
96
97#include "TRobustEstimator.h"
98#include "TMatrixDSymEigen.h"
99#include "TRandom.h"
100#include "TMath.h"
101#include "TDecompChol.h"
102
103
105 0.454937, 1.38629, 2.36597, 3.35670, 4.35146, 5.34812, 6.34581, 7.34412, 8.34283,
106 9.34182, 10.34, 11.34, 12.34, 13.34, 14.34, 15.34, 16.34, 17.34, 18.34, 19.34,
107 20.34, 21.34, 22.34, 23.34, 24.34, 25.34, 26.34, 27.34, 28.34, 29.34, 30.34,
108 31.34, 32.34, 33.34, 34.34, 35.34, 36.34, 37.34, 38.34, 39.34, 40.34,
109 41.34, 42.34, 43.34, 44.34, 45.34, 46.34, 47.34, 48.34, 49.33};
110
112 5.02389, 7.3776,9.34840,11.1433,12.8325,
113 14.4494,16.0128,17.5346,19.0228,20.4831,21.920,23.337,
114 24.736,26.119,27.488,28.845,30.191,31.526,32.852,34.170,
115 35.479,36.781,38.076,39.364,40.646,41.923,43.194,44.461,
116 45.722,46.979,48.232,49.481,50.725,51.966,53.203,54.437,
117 55.668,56.896,58.120,59.342,60.561,61.777,62.990,64.201,
118 65.410,66.617,67.821,69.022,70.222,71.420};
119
120////////////////////////////////////////////////////////////////////////////////
121///this constructor should be used in a univariate case:
122///first call this constructor, then - the EvaluateUni(..) function
123
126
127////////////////////////////////////////////////////////////////////////////////
128///constructor
129
131 :fMean(nvariables),
132 fCovariance(nvariables),
133 fInvcovariance(nvariables),
134 fCorrelation(nvariables),
135 fRd(nvectors),
136 fSd(nvectors),
137 fOut(1),
138 fHyperplane(nvariables),
139 fData(nvectors, nvariables)
140{
141 if ((nvectors<=1)||(nvariables<=0)){
142 Error("TRobustEstimator","Not enough vectors or variables");
143 return;
144 }
145 if (nvariables==1){
146 Error("TRobustEstimator","For the univariate case, use the default constructor and EvaluateUni() function");
147 return;
148 }
149
150 fN=nvectors;
152 if (hh<(fN+fNvar+1)/2){
153 if (hh>0)
154 Warning("TRobustEstimator","chosen h is too small, default h is taken instead");
155 fH=(fN+fNvar+1)/2;
156 } else
157 fH=hh;
158
159 fVarTemp=0;
160 fVecTemp=0;
161 fExact=0;
162}
163
164////////////////////////////////////////////////////////////////////////////////
165///adds a column to the data matrix
166///it is assumed that the column has size fN
167///variable fVarTemp keeps the number of columns l
168///already added
169
171{
172 if (fVarTemp==fNvar) {
173 fNvar++;
180 }
181 for (Int_t i=0; i<fN; i++) {
182 fData(i, fVarTemp)=col[i];
183 }
184 fVarTemp++;
185}
186
187////////////////////////////////////////////////////////////////////////////////
188///adds a vector to the data matrix
189///it is supposed that the vector is of size fNvar
190
192{
193 if(fVecTemp==fN) {
194 fN++;
195 fRd.ResizeTo(fN);
196 fSd.ResizeTo(fN);
198 }
199 for (Int_t i=0; i<fNvar; i++)
200 fData(fVecTemp, i)=row[i];
201
202 fVecTemp++;
203}
204
205////////////////////////////////////////////////////////////////////////////////
206///Finds the estimate of multivariate mean and variance
207
209{
210 Double_t kEps=1e-14;
211
212 if (fH==fN){
213 Warning("Evaluate","Chosen h = #observations, so classic estimates of location and scatter will be calculated");
214 Classic();
215 return;
216 }
217
218 Int_t i, j, k;
219 Int_t ii, jj;
220 Int_t nmini = 300;
221 Int_t k1=500;
222 Int_t nbest=10;
223 TMatrixD sscp(fNvar+1, fNvar+1);
225
226 Int_t *index = new Int_t[fN];
227 Double_t *ndist = new Double_t[fN];
230 for (i=0; i<nbest; i++)
231 deti[i]=1e16;
232
233 for (i=0; i<fN; i++)
234 fRd(i)=0;
235 ////////////////////////////
236 //for small n
237 ////////////////////////////
238 if (fN<nmini*2) {
239 //for storing the best fMeans and covariances
240
243
244 for (k=0; k<k1; k++) {
246 //calculate the mean and covariance of the created subset
248 for (i=0; i<fH; i++) {
249 for(j=0; j<fNvar; j++)
250 vec(j)=fData[index[i]][j];
252 }
255 if (det < kEps) {
256 fExact = Exact(ndist);
257 delete [] index;
258 delete [] ndist;
259 delete [] deti;
260 return;
261 }
262 //make 2 CSteps
263 det = CStep(fN, fH, index, fData, sscp, ndist);
264 if (det < kEps) {
265 fExact = Exact(ndist);
266 delete [] index;
267 delete [] ndist;
268 delete [] deti;
269 return;
270 }
271 det = CStep(fN, fH, index, fData, sscp, ndist);
272 if (det < kEps) {
273 fExact = Exact(ndist);
274 delete [] index;
275 delete [] ndist;
276 delete [] deti;
277 return;
278 } else {
280 if(det<deti[maxind]) {
281 deti[maxind]=det;
282 for(ii=0; ii<fNvar; ii++) {
284 for(jj=0; jj<fNvar; jj++)
286 }
287 }
288 }
289 }
290
291 //now for nbest best results perform CSteps until convergence
292
293 for (i=0; i<nbest; i++) {
294 for(ii=0; ii<fNvar; ii++) {
295 fMean(ii)=mstock(i, ii);
296 for (jj=0; jj<fNvar; jj++)
298 }
299
300 det=1;
301 while (det>kEps) {
303 if(TMath::Abs(det-deti[i])<kEps)
304 break;
305 else
306 deti[i]=det;
307 }
308 for(ii=0; ii<fNvar; ii++) {
309 mstock(i,ii)=fMean(ii);
310 for (jj=0; jj<fNvar; jj++)
312 }
313 }
314
316 for(ii=0; ii<fNvar; ii++) {
318
319 for(jj=0; jj<fNvar; jj++)
321 }
322
323 if (deti[detind]!=0) {
324 //calculate robust distances and throw out the bad points
325 Int_t nout = RDist(sscp);
327
328 fOut.Set(nout);
329
330 j=0;
331 for (i=0; i<fN; i++) {
332 if(fRd(i)>cutoff) {
333 fOut[j]=i;
334 j++;
335 }
336 }
337
338 } else {
340 }
341 delete [] index;
342 delete [] ndist;
343 delete [] deti;
344 return;
345
346 }
347 /////////////////////////////////////////////////
348 //if n>nmini, the dataset should be partitioned
349 //partitioning
350 ////////////////////////////////////////////////
351 Int_t indsubdat[5];
352 Int_t nsub;
353 for (ii=0; ii<5; ii++)
354 indsubdat[ii]=0;
355
357
358 Int_t sum=0;
359 for (ii=0; ii<5; ii++)
360 sum+=indsubdat[ii];
361 Int_t *subdat=new Int_t[sum];
362 //printf("allocates subdat[ %d ]\n", sum);
363 // init the subdat matrix
364 for (int iii = 0; iii < sum; ++iii) subdat[iii] = -999;
366 for (int iii = 0; iii < sum; ++iii) {
367 if (subdat[iii] < 0 || subdat[iii] >= fN ) {
368 Error("Evaluate","subdat index is invalid subdat[%d] = %d",iii, subdat[iii] );
369 R__ASSERT(0);
370 }
371 }
372 //now the indexes of selected cases are in the array subdat
373 //matrices to store best means and covariances
378 for (i=0; i<nbestsub; i++) {
379 for(j=0; j<fNvar; j++)
380 hyperplane(i,j)=0;
381 }
386
388 //construct h-subsets and perform 2 CSteps in subgroups
389
390 for (Int_t kgroup=0; kgroup<nsub; kgroup++) {
391 //printf("group #%d\n", kgroup);
393 Int_t temp=0;
394 for (i=0; i<kgroup; i++)
395 temp+=indsubdat[i];
396 Int_t par;
397
398
399 for(i=0; i<ntemp; i++) {
400 for (j=0; j<fNvar; j++) {
401 dattemp(i,j)=fData[subdat[temp+i]][j];
402 }
403 }
405
406 for (i=0; i<nbest; i++)
407 deti[i]=1e16;
408
409 for(k=0; k<k2; k++) {
412 for (i=0; i<htemp; i++) {
413 for(j=0; j<fNvar; j++) {
414 vec(j)=dattemp(index[i],j);
415 }
417 }
420 if (det<kEps) {
422 if(par==nbest+1) {
423
424 delete [] detibig;
425 delete [] deti;
426 delete [] subdat;
427 delete [] ndist;
428 delete [] index;
429 return;
430 } else
431 deti[par]=det;
432 } else {
434 if (det<kEps) {
436 if(par==nbest+1) {
437
438 delete [] detibig;
439 delete [] deti;
440 delete [] subdat;
441 delete [] ndist;
442 delete [] index;
443 return;
444 } else
445 deti[par]=det;
446 } else {
448 if(det<kEps){
450 if(par==nbest+1) {
451
452 delete [] detibig;
453 delete [] deti;
454 delete [] subdat;
455 delete [] ndist;
456 delete [] index;
457 return;
458 } else {
459 deti[par]=det;
460 }
461 } else {
463 if(det<deti[maxind]) {
464 deti[maxind]=det;
465 for(i=0; i<fNvar; i++) {
467 for(j=0; j<fNvar; j++) {
469
470 }
471 }
472 }
473
474 }
475 }
476 }
477
479 if (deti[maxind]<kEps)
480 break;
481 }
482
483
484 for(i=0; i<nbest; i++) {
485 detibig[kgroup*nbest + i]=deti[i];
486
487 }
488
489 }
490
491 //now the arrays mstockbig and cstockbig store nbest*nsub best means and covariances
492 //detibig stores nbest*nsub their determinants
493 //merge the subsets and carry out 2 CSteps on the merged set for all 50 best solutions
494
496 for(i=0; i<sum; i++) {
497 for (j=0; j<fNvar; j++)
498 datmerged(i,j)=fData[subdat[i]][j];
499 }
500 // printf("performing calculations for merged set\n");
502
503 Int_t nh;
504 for(k=0; k<nbestsub; k++) {
505 //for all best solutions perform 2 CSteps and then choose the very best
506 for(ii=0; ii<fNvar; ii++) {
507 fMean(ii)=mstockbig(k,ii);
508 for(jj=0; jj<fNvar; jj++)
510 }
511 if(detibig[k]==0) {
512 for(i=0; i<fNvar; i++)
513 fHyperplane(i)=hyperplane(k,i);
515
516 }
518 if (det<kEps) {
519 nh= Exact(ndist);
520 if (nh>=fH) {
521 fExact = nh;
522
523 delete [] detibig;
524 delete [] deti;
525 delete [] subdat;
526 delete [] ndist;
527 delete [] index;
528 return;
529 } else {
531 }
532 }
533
535 if (det<kEps) {
536 nh=Exact(ndist);
537 if (nh>=fH) {
538 fExact = nh;
539 delete [] detibig;
540 delete [] deti;
541 delete [] subdat;
542 delete [] ndist;
543 delete [] index;
544 return;
545 }
546 }
547 detibig[k]=det;
548 for(i=0; i<fNvar; i++) {
549 mstockbig(k,i)=fMean(i);
550 for(j=0; j<fNvar; j++) {
551 cstockbig(i,k*fNvar+j)=fCovariance(i, j);
552 }
553 }
554 }
555 //now for the subset with the smallest determinant
556 //repeat CSteps until convergence
559 for(i=0; i<fNvar; i++) {
560 fMean(i)=mstockbig(minind,i);
562 for(j=0; j<fNvar; j++)
564 }
565 if(det<kEps)
567 det=1;
568 while (det>kEps) {
571 break;
572 } else {
574 }
575 }
576 if(det<kEps) {
577 Exact(ndist);
579 }
580 Int_t nout = RDist(sscp);
582
583 fOut.Set(nout);
584
585 j=0;
586 for (i=0; i<fN; i++) {
587 if(fRd(i)>cutoff) {
588 fOut[j]=i;
589 j++;
590 }
591 }
592
593 delete [] detibig;
594 delete [] deti;
595 delete [] subdat;
596 delete [] ndist;
597 delete [] index;
598 return;
599}
600
601////////////////////////////////////////////////////////////////////////////////
602///for the univariate case
603///estimates of location and scatter are returned in mean and sigma parameters
604///the algorithm works on the same principle as in multivariate case -
605///it finds a subset of size hh with smallest sigma, and then returns mean and
606///sigma of this subset
607
609{
610 if (hh==0)
611 hh=(nvectors+2)/2;
612 Double_t faclts[]={2.6477,2.5092,2.3826,2.2662,2.1587,2.0589,1.9660,1.879,1.7973,1.7203,1.6473};
615
617 nquant=TMath::Min(Int_t(Double_t(((hh*1./nvectors)-0.5)*40))+1, 11);
618 Double_t factor=faclts[nquant-1];
619
622 Double_t sq=0;
623 Double_t sqmin=0;
624 Int_t ndup=0;
627 for(Int_t i=0; i<len; i++)
628 slutn[i]=0;
629 for(Int_t jint=0; jint<len; jint++) {
630 aw[jint]=0;
631 for (Int_t j=0; j<hh; j++) {
632 aw[jint]+=data[index[j+jint]];
633 if(jint==0)
634 sq+=data[index[j]]*data[index[j]];
635 }
636 aw2[jint]=aw[jint]*aw[jint]/hh;
637
638 if(jint==0) {
639 sq=sq-aw2[jint];
640 sqmin=sq;
641 slutn[ndup]=aw[jint];
642
643 } else {
644 sq=sq - data[index[jint-1]]*data[index[jint-1]]+
646 aw2[jint]+aw2[jint-1];
647 if(sq<sqmin) {
648 ndup=0;
649 sqmin=sq;
650 slutn[ndup]=aw[jint];
651
652 } else {
653 if(sq==sqmin) {
654 ndup++;
655 slutn[ndup]=aw[jint];
656 }
657 }
658 }
659 }
660
661 slutn[0]=slutn[Int_t((ndup)/2)]/hh;
663 mean=slutn[0];
664 sigma=bstd;
665 delete [] aw;
666 delete [] aw2;
667 delete [] slutn;
668 delete [] index;
669}
670
671////////////////////////////////////////////////////////////////////////////////
672///returns the breakdown point of the algorithm
673
675{
676 Int_t n;
677 n=(fN-fH+1)/fN;
678 return n;
679}
680
681////////////////////////////////////////////////////////////////////////////////
682///returns the chi2 quantiles
683
685{
686 if (i < 0 || i >= 50) return 0;
687 return kChiQuant[i];
688}
689
690////////////////////////////////////////////////////////////////////////////////
691///returns the covariance matrix
692
694{
695 if (matr.GetNrows()!=fNvar || matr.GetNcols()!=fNvar){
696 Warning("GetCovariance","provided matrix is of the wrong size, it will be resized");
697 matr.ResizeTo(fNvar, fNvar);
698 }
699 matr=fCovariance;
700}
701
702////////////////////////////////////////////////////////////////////////////////
703///returns the correlation matrix
704
706{
707 if (matr.GetNrows()!=fNvar || matr.GetNcols()!=fNvar) {
708 Warning("GetCorrelation","provided matrix is of the wrong size, it will be resized");
709 matr.ResizeTo(fNvar, fNvar);
710 }
711 matr=fCorrelation;
712}
713
714////////////////////////////////////////////////////////////////////////////////
715///if the points are on a hyperplane, returns this hyperplane
716
718{
719 if (fExact==0) {
720 Error("GetHyperplane","the data doesn't lie on a hyperplane!\n");
721 return nullptr;
722 } else {
723 return &fHyperplane;
724 }
725}
726
727////////////////////////////////////////////////////////////////////////////////
728///if the points are on a hyperplane, returns this hyperplane
729
731{
732 if (fExact==0){
733 Error("GetHyperplane","the data doesn't lie on a hyperplane!\n");
734 return;
735 }
736 if (vec.GetNoElements()!=fNvar) {
737 Warning("GetHyperPlane","provided vector is of the wrong size, it will be resized");
738 vec.ResizeTo(fNvar);
739 }
741}
742
743////////////////////////////////////////////////////////////////////////////////
744///return the estimate of the mean
745
747{
748 if (means.GetNoElements()!=fNvar) {
749 Warning("GetMean","provided vector is of the wrong size, it will be resized");
750 means.ResizeTo(fNvar);
751 }
752 means=fMean;
753}
754
755////////////////////////////////////////////////////////////////////////////////
756///returns the robust distances (helps to find outliers)
757
759{
760 if (rdist.GetNoElements()!=fN) {
761 Warning("GetRDistances","provided vector is of the wrong size, it will be resized");
762 rdist.ResizeTo(fN);
763 }
764 rdist=fRd;
765}
766
767////////////////////////////////////////////////////////////////////////////////
768///returns the number of outliers
769
771{
772 return fOut.GetSize();
773}
774
775////////////////////////////////////////////////////////////////////////////////
776///update the sscp matrix with vector vec
777
779{
780 Int_t i, j;
781 for (j=1; j<fNvar+1; j++) {
782 sscp(0, j) +=vec(j-1);
783 sscp(j, 0) = sscp(0, j);
784 }
785 for (i=1; i<fNvar+1; i++) {
786 for (j=1; j<fNvar+1; j++) {
787 sscp(i, j) += vec(i-1)*vec(j-1);
788 }
789 }
790}
791
792////////////////////////////////////////////////////////////////////////////////
793///clear the sscp matrix, used for covariance and mean calculation
794
796{
797 for (Int_t i=0; i<fNvar+1; i++) {
798 for (Int_t j=0; j<fNvar+1; j++) {
799 sscp(i, j)=0;
800 }
801 }
802}
803
804////////////////////////////////////////////////////////////////////////////////
805///called when h=n. Returns classic covariance matrix
806///and mean
807
809{
810 TMatrixD sscp(fNvar+1, fNvar+1);
811 TVectorD temp(fNvar);
813 for (Int_t i=0; i<fN; i++) {
814 for (Int_t j=0; j<fNvar; j++)
815 temp(j)=fData(i, j);
816 AddToSscp(sscp, temp);
817 }
819 Correl();
820
821}
822
823////////////////////////////////////////////////////////////////////////////////
824///calculates mean and covariance
825
827{
828 Int_t i, j;
829 Double_t f;
830 for (i=0; i<fNvar; i++) {
831 m(i)=sscp(0, i+1);
832 sd[i]=sscp(i+1, i+1);
833 f=(sd[i]-m(i)*m(i)/nvec)/(nvec-1);
834 if (f>1e-14) sd[i]=TMath::Sqrt(f);
835 else sd[i]=0;
836 m(i)/=nvec;
837 }
838 for (i=0; i<fNvar; i++) {
839 for (j=0; j<fNvar; j++) {
840 cov(i, j)=sscp(i+1, j+1)-nvec*m(i)*m(j);
841 cov(i, j)/=nvec-1;
842 }
843 }
844}
845
846////////////////////////////////////////////////////////////////////////////////
847///transforms covariance matrix into correlation matrix
848
850{
851 Int_t i, j;
853 for(j=0; j<fNvar; j++)
854 sd[j]=1./TMath::Sqrt(fCovariance(j, j));
855 for(i=0; i<fNvar; i++) {
856 for (j=0; j<fNvar; j++) {
857 if (i==j)
858 fCorrelation(i, j)=1.;
859 else
860 fCorrelation(i, j)=fCovariance(i, j)*sd[i]*sd[j];
861 }
862 }
863 delete [] sd;
864}
865
866////////////////////////////////////////////////////////////////////////////////
867///creates a subset of htotal elements from ntotal elements
868///first, p+1 elements are drawn randomly(without repetitions)
869///if their covariance matrix is singular, more elements are
870///added one by one, until their covariance matrix becomes regular
871///or it becomes clear that htotal observations lie on a hyperplane
872///If covariance matrix determinant!=0, distances of all ntotal elements
873///are calculated, using formula d_i=Sqrt((x_i-M)*S_inv*(x_i-M)), where
874///M is mean and S_inv is the inverse of the covariance matrix
875///htotal points with smallest distances are included in the returned subset.
876
878{
879 Double_t kEps = 1e-14;
880 Int_t i, j;
882 Int_t nindex=0;
883 Int_t num;
884 for(i=0; i<ntotal; i++)
885 index[i]=ntotal+1;
886
887 for (i=0; i<p+1; i++) {
888 num=Int_t(gRandom->Uniform(0, 1)*(ntotal-1));
889 if (i>0){
890 for(j=0; j<=i-1; j++) {
891 if(index[j]==num)
893 }
894 }
895 if(repeat==kTRUE) {
896 i--;
898 } else {
899 index[i]=num;
900 nindex++;
901 }
902 }
903
905
908 for (i=0; i<p+1; i++) {
909 for (j=0; j<fNvar; j++) {
910 vec[j]=data[index[i]][j];
911
912 }
914 }
915
918 while((det<kEps)&&(nindex < htotal)) {
919 //if covariance matrix is singular,another vector is added until
920 //the matrix becomes regular or it becomes clear that all
921 //vectors of the group lie on a hyperplane
923 do{
924 num=Int_t(gRandom->Uniform(0,1)*(ntotal-1));
926 for(i=0; i<nindex; i++) {
927 if(index[i]==num) {
929 break;
930 }
931 }
932 }while(repeat==kTRUE);
933
934 index[nindex]=num;
935 nindex++;
936 //check if covariance matrix is singular
937 for(j=0; j<fNvar; j++)
938 vec[j]=data[index[nindex-1]][j];
942 }
943
944 if(nindex!=htotal) {
946 fInvcovariance = chol.Invert();
947
948 TVectorD temp(fNvar);
949 for(j=0; j<ntotal; j++) {
950 ndist[j]=0;
951 for(i=0; i<fNvar; i++)
952 temp[i]=data[j][i] - fMean(i);
953 temp*=fInvcovariance;
954 for(i=0; i<fNvar; i++)
955 ndist[j]+=(data[j][i]-fMean(i))*temp[i];
956 }
958 }
959
960}
961
962////////////////////////////////////////////////////////////////////////////////
963///creates a subset of hmerged vectors with smallest orthogonal distances to the hyperplane
964///hyp[1]*(x1-mean[1])+...+hyp[nvar]*(xnvar-mean[nvar])=0
965///This function is called in case when less than fH samples lie on a hyperplane.
966
968{
969 Int_t i, j;
971 for (i=0; i<nmerged; i++) {
972 ndist[i]=0;
973 for(j=0; j<fNvar; j++) {
974 ndist[i]+=fHyperplane[j]*(dat[i][j]-fMean[j]);
975 ndist[i]=TMath::Abs(ndist[i]);
976 }
977 }
980 for (i=0; i<hmerged; i++) {
981 for(j=0; j<fNvar; j++)
982 vec[j]=dat[index[i]][j];
984 }
986}
987
988////////////////////////////////////////////////////////////////////////////////
989///from the input htotal-subset constructs another htotal subset with lower determinant
990///
991///As proven by Peter J.Rousseeuw and Katrien Van Driessen, if distances for all elements
992///are calculated, using the formula:d_i=Sqrt((x_i-M)*S_inv*(x_i-M)), where M is the mean
993///of the input htotal-subset, and S_inv - the inverse of its covariance matrix, then
994///htotal elements with smallest distances will have covariance matrix with determinant
995///less or equal to the determinant of the input subset covariance matrix.
996///
997///determinant for this htotal-subset with smallest distances is returned
998
1000{
1001 Int_t i, j;
1003 Double_t det;
1004
1006 fInvcovariance = chol.Invert();
1007
1008 TVectorD temp(fNvar);
1009 for(j=0; j<ntotal; j++) {
1010 ndist[j]=0;
1011 for(i=0; i<fNvar; i++)
1012 temp[i]=data[j][i]-fMean[i];
1013 temp*=fInvcovariance;
1014 for(i=0; i<fNvar; i++)
1015 ndist[j]+=(data[j][i]-fMean[i])*temp[i];
1016 }
1017
1018 //taking h smallest
1020 //writing their mean and covariance
1021 ClearSscp(sscp);
1022 for (i=0; i<htotal; i++) {
1023 for (j=0; j<fNvar; j++)
1024 temp[j]=data[index[i]][j];
1025 AddToSscp(sscp, temp);
1026 }
1029 return det;
1030}
1031
1032////////////////////////////////////////////////////////////////////////////////
1033///for the exact fit situations
1034///returns number of observations on the hyperplane
1035
1037{
1038 Int_t i, j;
1039
1041 TMatrixD eigenMatrix=eigen.GetEigenVectors();
1042
1043 for (j=0; j<fNvar; j++) {
1045 }
1046 //calculate and return how many observations lie on the hyperplane
1047 for (i=0; i<fN; i++) {
1048 ndist[i]=0;
1049 for(j=0; j<fNvar; j++) {
1050 ndist[i]+=fHyperplane[j]*(fData[i][j]-fMean[j]);
1051 ndist[i]=TMath::Abs(ndist[i]);
1052 }
1053 }
1054 Int_t nhyp=0;
1055
1056 for (i=0; i<fN; i++) {
1057 if(ndist[i] < 1e-14) nhyp++;
1058 }
1059 return nhyp;
1060
1061}
1062
1063////////////////////////////////////////////////////////////////////////////////
1064///This function is called if determinant of the covariance matrix of a subset=0.
1065///
1066///If there are more then fH vectors on a hyperplane,
1067///returns this hyperplane and stops
1068///else stores the hyperplane coordinates in hyperplane matrix
1069
1073{
1074 Int_t i, j;
1075
1079 //now nh is the number of observation on the hyperplane
1080 //ndist stores distances of observation from this hyperplane
1081 if(nh>=fH) {
1082 ClearSscp(sscp);
1083 for (i=0; i<fN; i++) {
1084 if(ndist[i]<1e-14) {
1085 for (j=0; j<fNvar; j++)
1086 vec[j]=fData[i][j];
1087 AddToSscp(sscp, vec);
1088 }
1089 }
1091
1092 fExact=nh;
1093 return nbest+1;
1094
1095 } else {
1096 //if less than fH observations lie on a hyperplane,
1097 //mean and covariance matrix are stored in mstockbig
1098 //and cstockbig in place of the previous maximum determinant
1099 //mean and covariance
1100 for(i=0; i<fNvar; i++) {
1103 for(j=0; j<fNvar; j++) {
1105 }
1106 }
1107 return maxind;
1108 }
1109}
1110
1111
1112////////////////////////////////////////////////////////////////////////////////
1113///divides the elements into approximately equal subgroups
1114///number of elements in each subgroup is stored in indsubdat
1115///number of subgroups is returned
1116
1118{
1119 Int_t nsub;
1120 if ((fN>=2*nmini) && (fN<=(3*nmini-1))) {
1121 if (fN%2==1){
1122 indsubdat[0]=Int_t(fN*0.5);
1123 indsubdat[1]=Int_t(fN*0.5)+1;
1124 } else
1125 indsubdat[0]=indsubdat[1]=Int_t(fN/2);
1126 nsub=2;
1127 }
1128 else{
1129 if((fN>=3*nmini) && (fN<(4*nmini -1))) {
1130 if(fN%3==0){
1131 indsubdat[0]=indsubdat[1]=indsubdat[2]=Int_t(fN/3);
1132 } else {
1133 indsubdat[0]=Int_t(fN/3);
1134 indsubdat[1]=Int_t(fN/3)+1;
1135 if (fN%3==1) indsubdat[2]=Int_t(fN/3);
1136 else indsubdat[2]=Int_t(fN/3)+1;
1137 }
1138 nsub=3;
1139 }
1140 else{
1141 if((fN>=4*nmini)&&(fN<=(5*nmini-1))){
1142 if (fN%4==0) indsubdat[0]=indsubdat[1]=indsubdat[2]=indsubdat[3]=Int_t(fN/4);
1143 else {
1144 indsubdat[0]=Int_t(fN/4);
1145 indsubdat[1]=Int_t(fN/4)+1;
1146 if(fN%4==1) indsubdat[2]=indsubdat[3]=Int_t(fN/4);
1147 if(fN%4==2) {
1148 indsubdat[2]=Int_t(fN/4)+1;
1149 indsubdat[3]=Int_t(fN/4);
1150 }
1151 if(fN%4==3) indsubdat[2]=indsubdat[3]=Int_t(fN/4)+1;
1152 }
1153 nsub=4;
1154 } else {
1155 for(Int_t i=0; i<5; i++)
1156 indsubdat[i]=nmini;
1157 nsub=5;
1158 }
1159 }
1160 }
1161 return nsub;
1162}
1163
1164////////////////////////////////////////////////////////////////////////////////
1165///Calculates robust distances.Then the samples with robust distances
1166///greater than a cutoff value (0.975 quantile of chi2 distribution with
1167///fNvar degrees of freedom, multiplied by a correction factor), are given
1168///weiht=0, and new, reweighted estimates of location and scatter are calculated
1169///The function returns the number of outliers.
1170
1172{
1173 Int_t i, j;
1174 Int_t nout=0;
1175
1176 TVectorD temp(fNvar);
1178 fInvcovariance = chol.Invert();
1179
1180
1181 for (i=0; i<fN; i++) {
1182 fRd[i]=0;
1183 for(j=0; j<fNvar; j++) {
1184 temp[j]=fData[i][j]-fMean[j];
1185 }
1186 temp*=fInvcovariance;
1187 for(j=0; j<fNvar; j++) {
1188 fRd[i]+=(fData[i][j]-fMean[j])*temp[j];
1189 }
1190 }
1191
1192 Double_t med;
1194
1196 med/=chi;
1199 fInvcovariance = chol2.Invert();
1200
1201 for (i=0; i<fN; i++) {
1202 fRd[i]=0;
1203 for(j=0; j<fNvar; j++) {
1204 temp[j]=fData[i][j]-fMean[j];
1205 }
1206
1207 temp*=fInvcovariance;
1208 for(j=0; j<fNvar; j++) {
1209 fRd[i]+=(fData[i][j]-fMean[j])*temp[j];
1210 }
1211 }
1212
1214
1215 ClearSscp(sscp);
1216 for(i=0; i<fN; i++) {
1217 if (fRd[i]<=cutoff) {
1218 for(j=0; j<fNvar; j++)
1219 temp[j]=fData[i][j];
1220 AddToSscp(sscp,temp);
1221 } else {
1222 nout++;
1223 }
1224 }
1225
1227 return nout;
1228}
1229
1230////////////////////////////////////////////////////////////////////////////////
1231///Draws ngroup nonoverlapping subdatasets out of a dataset of size n
1232///such that the selected case numbers are uniformly distributed from 1 to n
1233
1235{
1236 Int_t jndex = 0;
1237 Int_t nrand;
1238 Int_t i, k, m, j;
1239 for (k=1; k<=ngroup; k++) {
1240 for (m=1; m<=indsubdat[k-1]; m++) {
1241 nrand = Int_t(gRandom->Uniform(0, 1) * double(fN-jndex))+1;
1242 //printf("nrand = %d - jndex %d\n",nrand,jndex);
1243 jndex++;
1244 if (jndex==1) {
1245 subdat[0]=nrand-1; // in case nrand is equal to fN
1246 } else {
1247 subdat[jndex-1]=nrand+jndex-2;
1248 for (i=1; i<=jndex-1; i++) {
1249 if(subdat[i-1] > nrand+i-2) {
1250 for(j=jndex; j>=i+1; j--) {
1251 subdat[j-1]=subdat[j-2];
1252 }
1253 //printf("subdata[] i = %d - nrand %d\n",i,nrand);
1254 subdat[i-1]=nrand+i-2;
1255 break; //breaking the loop for(i=1...
1256 }
1257 }
1258 }
1259 }
1260 }
1261}
1262
1263////////////////////////////////////////////////////////////////////////////////
1264///because I need an Int_t work array
1265
1268 const Int_t kWorkMax=100;
1269 Int_t i, ir, j, l, mid;
1270 Int_t arr;
1271 Int_t *ind;
1272 Int_t workLocal[kWorkMax];
1273 Int_t temp;
1274
1275
1276 if (work) {
1277 ind = work;
1278 } else {
1279 ind = workLocal;
1280 if (ntotal > kWorkMax) {
1282 ind = new Int_t[ntotal];
1283 }
1284 }
1285
1286 for (Int_t ii=0; ii<ntotal; ii++) {
1287 ind[ii]=ii;
1288 }
1289 Int_t rk = k;
1290 l=0;
1291 ir = ntotal-1;
1292 for(;;) {
1293 if (ir<=l+1) { //active partition contains 1 or 2 elements
1294 if (ir == l+1 && a[ind[ir]]<a[ind[l]])
1295 {temp = ind[l]; ind[l]=ind[ir]; ind[ir]=temp;}
1296 Double_t tmp = a[ind[rk]];
1297 if (isAllocated)
1298 delete [] ind;
1299 return tmp;
1300 } else {
1301 mid = (l+ir) >> 1; //choose median of left, center and right
1302 {temp = ind[mid]; ind[mid]=ind[l+1]; ind[l+1]=temp;}//elements as partitioning element arr.
1303 if (a[ind[l]]>a[ind[ir]]) //also rearrange so that a[l]<=a[l+1]
1304 {temp = ind[l]; ind[l]=ind[ir]; ind[ir]=temp;}
1305
1306 if (a[ind[l+1]]>a[ind[ir]])
1307 {temp=ind[l+1]; ind[l+1]=ind[ir]; ind[ir]=temp;}
1308
1309 if (a[ind[l]]>a[ind[l+1]])
1310 {temp = ind[l]; ind[l]=ind[l+1]; ind[l+1]=temp;}
1311
1312 i=l+1; //initialize pointers for partitioning
1313 j=ir;
1314 arr = ind[l+1];
1315 for (;;) {
1316 do i++; while (a[ind[i]]<a[arr]);
1317 do j--; while (a[ind[j]]>a[arr]);
1318 if (j<i) break; //pointers crossed, partitioning complete
1319 {temp=ind[i]; ind[i]=ind[j]; ind[j]=temp;}
1320 }
1321 ind[l+1]=ind[j];
1322 ind[j]=arr;
1323 if (j>=rk) ir = j-1; //keep active the partition that
1324 if (j<=rk) l=i; //contains the k_th element
1325 }
1326 }
1327}
#define f(i)
Definition RSha256.hxx:104
#define a(i)
Definition RSha256.hxx:99
#define e(i)
Definition RSha256.hxx:103
int Int_t
Signed integer 4 bytes (int)
Definition RtypesCore.h:59
constexpr Bool_t kFALSE
Definition RtypesCore.h:108
double Double_t
Double 8 bytes.
Definition RtypesCore.h:73
constexpr Bool_t kTRUE
Definition RtypesCore.h:107
ROOT::Detail::TRangeCast< T, true > TRangeDynCast
TRangeDynCast is an adapter class that allows the typed iteration through a TCollection.
#define R__ASSERT(e)
Checks condition e and reports a fatal error if it's false.
Definition TError.h:125
winID h TVirtualViewer3D TVirtualGLPainter p
Option_t Option_t TPoint TPoint const char GetTextMagnitude GetFillStyle GetLineColor GetLineWidth GetMarkerStyle GetTextAlign GetTextColor GetTextSize void data
Option_t Option_t TPoint TPoint const char GetTextMagnitude GetFillStyle GetLineColor GetLineWidth GetMarkerStyle GetTextAlign GetTextColor GetTextSize void char Point_t Rectangle_t WindowAttributes_t index
Option_t Option_t TPoint TPoint const char GetTextMagnitude GetFillStyle GetLineColor GetLineWidth GetMarkerStyle GetTextAlign GetTextColor GetTextSize void char Point_t Rectangle_t WindowAttributes_t Float_t Float_t Float_t Int_t Int_t UInt_t UInt_t Rectangle_t Int_t Int_t Window_t TString Int_t GCValues_t GetPrimarySelectionOwner GetDisplay GetScreen GetColormap GetNativeEvent const char const char dpyName wid window const char font_name cursor keysym reg const char only_if_exist regb h Point_t winding char text const char depth char const char Int_t count const char ColorStruct_t color const char Pixmap_t Pixmap_t PictureAttributes_t attr const char char ret_data h unsigned char height h Atom_t Int_t ULong_t ULong_t unsigned char prop_list Atom_t Atom_t Atom_t Time_t UChar_t len
R__EXTERN TRandom * gRandom
Definition TRandom.h:62
const Double_t kChiMedian[50]
const Double_t kChiQuant[50]
void Set(Int_t n) override
Set size of this array to n ints.
Definition TArrayI.cxx:104
Int_t GetSize() const
Definition TArray.h:47
Cholesky Decomposition class.
Definition TDecompChol.h:25
TMatrixDSymEigen.
Int_t GetNrows() const
Int_t GetNcols() const
TMatrixTBase< Element > & ResizeTo(Int_t nrows, Int_t ncols, Int_t=-1) override
Set size of the matrix to nrows x ncols New dynamic elements are created, the overlapping part of the...
Double_t Determinant() const override
TMatrixTBase< Element > & ResizeTo(Int_t nrows, Int_t ncols, Int_t=-1) override
Set size of the matrix to nrows x ncols New dynamic elements are created, the overlapping part of the...
virtual void Warning(const char *method, const char *msgfmt,...) const
Issue warning message.
Definition TObject.cxx:1057
virtual void Error(const char *method, const char *msgfmt,...) const
Issue error message.
Definition TObject.cxx:1071
virtual Double_t Uniform(Double_t x1=1)
Returns a uniform deviate on the interval (0, x1).
Definition TRandom.cxx:681
Int_t RDist(TMatrixD &sscp)
Calculates robust distances.Then the samples with robust distances greater than a cutoff value (0....
TMatrixDSym fCovariance
Double_t CStep(Int_t ntotal, Int_t htotal, Int_t *index, TMatrixD &data, TMatrixD &sscp, Double_t *ndist)
from the input htotal-subset constructs another htotal subset with lower determinant
const TMatrixDSym * GetCovariance() const
TMatrixDSym fInvcovariance
void CreateSubset(Int_t ntotal, Int_t htotal, Int_t p, Int_t *index, TMatrixD &data, TMatrixD &sscp, Double_t *ndist)
creates a subset of htotal elements from ntotal elements first, p+1 elements are drawn randomly(witho...
void Covar(TMatrixD &sscp, TVectorD &m, TMatrixDSym &cov, TVectorD &sd, Int_t nvec)
calculates mean and covariance
Int_t Exact(Double_t *ndist)
for the exact fit situations returns number of observations on the hyperplane
Double_t KOrdStat(Int_t ntotal, Double_t *arr, Int_t k, Int_t *work)
because I need an Int_t work array
Int_t GetNOut()
returns the number of outliers
TRobustEstimator()
this constructor should be used in a univariate case: first call this constructor,...
void Correl()
transforms covariance matrix into correlation matrix
void AddColumn(Double_t *col)
adds a column to the data matrix it is assumed that the column has size fN variable fVarTemp keeps th...
void RDraw(Int_t *subdat, Int_t ngroup, Int_t *indsubdat)
Draws ngroup nonoverlapping subdatasets out of a dataset of size n such that the selected case number...
void Classic()
called when h=n.
void Evaluate()
Finds the estimate of multivariate mean and variance.
const TVectorD * GetMean() const
Int_t Exact2(TMatrixD &mstockbig, TMatrixD &cstockbig, TMatrixD &hyperplane, Double_t *deti, Int_t nbest, Int_t kgroup, TMatrixD &sscp, Double_t *ndist)
This function is called if determinant of the covariance matrix of a subset=0.
const TVectorD * GetRDistances() const
Int_t GetBDPoint()
returns the breakdown point of the algorithm
void CreateOrtSubset(TMatrixD &dat, Int_t *index, Int_t hmerged, Int_t nmerged, TMatrixD &sscp, Double_t *ndist)
creates a subset of hmerged vectors with smallest orthogonal distances to the hyperplane hyp[1]*(x1-m...
void AddRow(Double_t *row)
adds a vector to the data matrix it is supposed that the vector is of size fNvar
TMatrixDSym fCorrelation
const TVectorD * GetHyperplane() const
if the points are on a hyperplane, returns this hyperplane
Int_t Partition(Int_t nmini, Int_t *indsubdat)
divides the elements into approximately equal subgroups number of elements in each subgroup is stored...
const TMatrixDSym * GetCorrelation() const
Double_t GetChiQuant(Int_t i) const
returns the chi2 quantiles
void EvaluateUni(Int_t nvectors, Double_t *data, Double_t &mean, Double_t &sigma, Int_t hh=0)
for the univariate case estimates of location and scatter are returned in mean and sigma parameters t...
void AddToSscp(TMatrixD &sscp, TVectorD &vec)
update the sscp matrix with vector vec
void ClearSscp(TMatrixD &sscp)
clear the sscp matrix, used for covariance and mean calculation
TVectorT< Element > & ResizeTo(Int_t lwb, Int_t upb)
Resize the vector to [lwb:upb] .
Definition TVectorT.cxx:293
Element * GetMatrixArray()
Definition TVectorT.h:78
const Double_t sigma
const Int_t n
Definition legend1.C:16
Long64_t LocMin(Long64_t n, const T *a)
Returns index of array with the minimum element.
Definition TMath.h:993
Double_t Median(Long64_t n, const T *a, const Double_t *w=nullptr, Long64_t *work=nullptr)
Same as RMS.
Definition TMath.h:1359
Long64_t LocMax(Long64_t n, const T *a)
Returns index of array with the maximum element.
Definition TMath.h:1099
Double_t Sqrt(Double_t x)
Returns the square root of x.
Definition TMath.h:673
Short_t Min(Short_t a, Short_t b)
Returns the smallest of a and b.
Definition TMathBase.h:199
void Sort(Index n, const Element *a, Index *index, Bool_t down=kTRUE)
Sort the n elements of the array a of generic templated type Element.
Definition TMathBase.h:432
Short_t Abs(Short_t d)
Returns the absolute value of parameter Short_t d.
Definition TMathBase.h:124
TMarker m
Definition textangle.C:8
TLine l
Definition textangle.C:4
static uint64_t sum(uint64_t i)
Definition Factory.cxx:2339