A helper class to prune a decision tree using the Cost Complexity method (see Classification and Regression Trees by Leo Breiman et al)
Some definitions:
- \( T_{max} \) - the initial, usually highly overtrained tree, that is to be pruned back
- \( R(T) \) - quality index (Gini, misclassification rate, or other) of a tree \( T \)
- \( \sim T \) - set of terminal nodes in \( T \)
- \( T' \) - the pruned subtree of \( T_max \) that has the best quality index \( R(T') \)
- \( \alpha \) - the prune strength parameter in Cost Complexity pruning \( (R_{\alpha}(T) = R(T) + \alpha*|\sim T|) \)
There are two running modes in CCPruner: (i) one may select a prune strength and prune back the tree \( T_{max}\) until the criterion:
\[ \alpha < \frac{R(T) - R(t)}{|\sim T_t| - 1} \]
is true for all nodes t in \( T \), or (ii) the algorithm finds the sequence of critical points \( \alpha_k < \alpha_{k+1} ... < \alpha_K \) such that \( T_K = root(T_{max}) \) and then selects the optimally-pruned subtree, defined to be the subtree with the best quality index for the validation sample.
Definition at line 61 of file CCPruner.h.