kd-tree and its implementation in TKDTree Contents: 1. What is kd-tree 2. How to cosntruct kdtree - Pseudo code 3. Using TKDTree a. Creating the kd-tree and setting the data b. Navigating the kd-tree 4. TKDTree implementation - technical details a. The order of nodes in internal arrays b. Division algorithm c. The order of nodes in boundary related arrays 1. What is kdtree ? ( http://en.wikipedia.org/wiki/Kd-tree ) In computer science, a kd-tree (short for k-dimensional tree) is a space-partitioning data structure for organizing points in a k-dimensional space. kd-trees are a useful data structure for several applications, such as searches involving a multidimensional search key (e.g. range searches and nearest neighbour searches). kd-trees are a special case of BSP trees. A kd-tree uses only splitting planes that are perpendicular to one of the coordinate system axes. This differs from BSP trees, in which arbitrary splitting planes can be used. In addition, in the typical definition every node of a kd-tree, from the root to the leaves, stores a point. This differs from BSP trees, in which leaves are typically the only nodes that contain points (or other geometric primitives). As a consequence, each splitting plane must go through one of the points in the kd-tree. kd-trees are a variant that store data only in leaf nodes. 2. Constructing a classical kd-tree ( Pseudo code) Since there are many possible ways to choose axis-aligned splitting planes, there are many different ways to construct kd-trees. The canonical method of kd-tree construction has the following constraints: * As one moves down the tree, one cycles through the axes used to select the splitting planes. (For example, the root would have an x-aligned plane, the root's children would both have y-aligned planes, the root's grandchildren would all have z-aligned planes, and so on.) * At each step, the point selected to create the splitting plane is the median of the points being put into the kd-tree, with respect to their coordinates in the axis being used. (Note the assumption that we feed the entire set of points into the algorithm up-front.) This method leads to a balanced kd-tree, in which each leaf node is about the same distance from the root. However, balanced trees are not necessarily optimal for all applications. The following pseudo-code illustrates this canonical construction procedure (NOTE, that the procedure used by the TKDTree class is a bit different, the following pseudo-code is given as a simple illustration of the concept): function kdtree (list of points pointList, int depth) { if pointList is empty return nil; else { // Select axis based on depth so that axis cycles through all valid values var int axis := depth mod k; // Sort point list and choose median as pivot element select median from pointList; // Create node and construct subtrees var tree_node node; node.location := median; node.leftChild := kdtree(points in pointList before median, depth+1); node.rightChild := kdtree(points in pointList after median, depth+1); return node; } } Our construction method is optimized to save memory, and differs a bit from the constraints above. In particular, the division axis is chosen as the one with the biggest spread, and the point to create the splitting plane is chosen so, that one of the two subtrees contains exactly 2^k terminal nodes and is a perfectly balanced binary tree, and, while at the same time, trying to keep the number of terminal nodes in the 2 subtrees as close as possible. The following section gives more details about our implementation. 3. Using TKDTree 3a. Creating the tree and setting the data The interface of the TKDTree, that allows to set input data, has been developped to simplify using it together with TTree::Draw() functions. That's why the data has to be provided column-wise. For example: { TTree *datatree = ... datatree->Draw("x:y:z", "selection", "goff"); //now make a kd-tree on the drawn variables TKDTreeID *kdtree = new TKDTreeID(npoints, 3, 1); kdtree->SetData(0, datatree->GetV1()); kdtree->SetData(1, datatree->GetV2()); kdtree->SetData(2, datatree->GetV3()); kdtree->Build(); } NOTE, that this implementation of kd-tree doesn't support adding new points after the tree has been built Of course, it's not necessary to use TTree::Draw(). What is important, is to have data columnwise. An example with regular arrays: { Int_t npoints = 100000; Int_t ndim = 3; Int_t bsize = 1; Double_t xmin = -0.5; Double_t xmax = 0.5; Double_t *data0 = new Double_t[npoints]; Double_t *data1 = new Double_t[npoints]; Double_t *data2 = new Double_t[npoints]; Double_t *y = new Double_t[npoints]; for (Int_t i=0; i<npoints; i++){ data0[i]=gRandom->Uniform(xmin, xmax); data1[i]=gRandom->Uniform(xmin, xmax); data2[i]=gRandom->Uniform(xmin, xmax); } TKDTreeID *kdtree = new TKDTreeID(npoints, ndim, bsize); kdtree->SetData(0, data0); kdtree->SetData(1, data1); kdtree->SetData(2, data2); kdtree->Build(); } By default, the kd-tree doesn't own the data and doesn't delete it with itself. If you want the data to be deleted together with the kd-tree, call TKDTree::SetOwner(kTRUE). Most functions of the kd-tree don't require the original data to be present after the tree has been built. Check the functions documentation for more details. 3b. Navigating the kd-tree Nodes of the tree are indexed top to bottom, left to right. The root node has index 0. Functions TKDTree::GetLeft(Index inode), TKDTree::GetRight(Index inode) and TKDTree::GetParent(Index inode) allow to find the children and the parent of a given node. For a given node, one can find the indexes of the original points, contained in this node, by calling the GetNodePointsIndexes(Index inode) function. Additionally, for terminal nodes, there is a function GetPointsIndexes(Index inode) that returns a pointer to the relevant part of the index array. To find the number of point in the node (not only terminal), call TKDTree::GetNpointsNode(Index inode). 4. TKDtree implementation details - internal information, not needed to use the kd-tree. 4a. Order of nodes in the node information arrays: TKDtree is optimized to minimize memory consumption. Nodes of the TKDTree do not store pointers to the left and right children or to the parent node, but instead there are several 1-d arrays of size fNNodes with information about the nodes. The order of the nodes information in the arrays is described below. It's important to understand it, if one's class needs to store some kind of additional information on the per node basis, for example, the fit function parameters. Drawback: Insertion to the TKDtree is not supported. Advantage: Random access is supported As noted above, the construction of the kd-tree involves choosing the axis and the point on that axis to divide the remaining points approximately in half. The exact algorithm for choosing the division point is described in the next section. The sequence of divisions is recorded in the following arrays: fAxix[fNNodes] - Division axis (0,1,2,3 ...) fValue[fNNodes] - Division value Given the index of a node in those arrays, it's easy to find the indices, corresponding to children nodes or the parent node: Suppose, the parent node is stored under the index inode. Then: Left child index = inode*2+1 Right child index = (inode+1)*2 Suppose, that the child node is stored under the index inode. Then: Parent index = inode/2 Number of division nodes and number of terminals : fNNodes = (fNPoints/fBucketSize) The nodes are filled always from left side to the right side: Let inode be the index of a node, and irow - the index of a row The TKDTree looks the following way: Ideal case: Number of _terminal_ nodes = 2^N, N=3 INode irow 0 0 - 1 inode irow 1 1 2 - 2 inodes irow 2 3 4 5 6 - 4 inodes irow 3 7 8 9 10 11 12 13 14 - 8 inodes Non ideal case: Number of _terminal_ nodes = 2^N+k, N=3 k=1 INode irow 0 0 - 1 inode irow 1 1 2 - 2 inodes irow 2 3 4 5 6 - 3 inodes irow 3 7 8 9 10 11 12 13 14 - 8 inodes irow 4 15 16 - 2 inodes 3b. The division algorithm: As described above, the kd-tree is built by repeatingly dividing the given set of points into 2 smaller sets. The cut is made on the axis with the biggest spread, and the value on the axis, on which the cut is performed, is chosen based on the following formula: Suppose, we want to divide n nodes into 2 groups, left and right. Then the left and right will have the following number of nodes: n=2^k+rest Left = 2^k-1 + ((rest>2^k-2) ? 2^k-2 : rest) Right = 2^k-1 + ((rest>2^k-2) ? rest-2^k-2 : 0) For example, let n_nodes=67. Then, the closest 2^k=64, 2^k-1=32, 2^k-2=16. Left node gets 32+3=35 sub-nodes, and the right node gets 32 sub-nodes The division process continues until all the nodes contain not more than a predefined number of points. 3c. The order of nodes in boundary-related arrays Some kd-tree based algorithms need to know the boundaries of each node. This information can be computed by calling the TKDTree::MakeBoundaries() function. It fills the following arrays: fRange : array containing the boundaries of the domain: | 1st dimension (min + max) | 2nd dimension (min + max) | ... fBoundaries : nodes boundaries | 1st node {1st dim * 2 elements | 2nd dim * 2 elements | ...} | 2nd node {...} | ... The nodes are arranged in the order described in section 3a. Note: the storage of the TKDTree in a file which include also the contained data is not supported. One must store the data separatly in a file (e.g. using a TTree) and then re-creating the TKDTree from the data, after having read them from the file
TKDTree<int,float>() | |
TKDTree<int,float>(int npoints, int ndim, UInt_t bsize) | |
TKDTree<int,float>(int npoints, int ndim, UInt_t bsize, float** data) | |
virtual | ~TKDTree<int,float>() |
void | TObject::AbstractMethod(const char* method) const |
virtual void | TObject::AppendPad(Option_t* option = "") |
virtual void | TObject::Browse(TBrowser* b) |
void | Build() |
static TClass* | Class() |
virtual const char* | TObject::ClassName() const |
virtual void | TObject::Clear(Option_t* = "") |
virtual TObject* | TObject::Clone(const char* newname = "") const |
virtual Int_t | TObject::Compare(const TObject* obj) const |
virtual void | TObject::Copy(TObject& object) const |
virtual void | TObject::Delete(Option_t* option = "")MENU |
Double_t | Distance(const float* point, int ind, Int_t type = 2) const |
void | DistanceToNode(const float* point, int inode, float& min, float& max, Int_t type = 2) |
virtual Int_t | TObject::DistancetoPrimitive(Int_t px, Int_t py) |
virtual void | TObject::Draw(Option_t* option = "") |
virtual void | TObject::DrawClass() constMENU |
virtual TObject* | TObject::DrawClone(Option_t* option = "") constMENU |
virtual void | TObject::Dump() constMENU |
virtual void | TObject::Error(const char* method, const char* msgfmt) const |
virtual void | TObject::Execute(const char* method, const char* params, Int_t* error = 0) |
virtual void | TObject::Execute(TMethod* method, TObjArray* params, Int_t* error = 0) |
virtual void | TObject::ExecuteEvent(Int_t event, Int_t px, Int_t py) |
virtual void | TObject::Fatal(const char* method, const char* msgfmt) const |
void | FindBNodeA(float* point, float* delta, Int_t& inode) |
void | FindInRange(float* point, float range, vector<int>& res) |
void | FindNearestNeighbors(const float* point, Int_t k, int* ind, float* dist) |
int | FindNode(const float* point) const |
virtual TObject* | TObject::FindObject(const char* name) const |
virtual TObject* | TObject::FindObject(const TObject* obj) const |
void | FindPoint(float* point, int& index, Int_t& iter) |
float* | GetBoundaries() |
float* | GetBoundariesExact() |
float* | GetBoundary(const Int_t node) |
float* | GetBoundaryExact(const Int_t node) |
int | GetBucketSize() |
Int_t | GetCrossNode() |
virtual Option_t* | TObject::GetDrawOption() const |
static Long_t | TObject::GetDtorOnly() |
virtual const char* | TObject::GetIconName() const |
int* | GetIndPoints() |
Int_t | GetLeft(Int_t inode) const |
virtual const char* | TObject::GetName() const |
int | GetNDim() |
Int_t | GetNNodes() const |
UChar_t | GetNodeAxis(Int_t id) const |
void | GetNodePointsIndexes(Int_t node, Int_t& first1, Int_t& last1, Int_t& first2, Int_t& last2) const |
float | GetNodeValue(Int_t id) const |
int | GetNPoints() |
int | GetNPointsNode(Int_t node) const |
virtual char* | TObject::GetObjectInfo(Int_t px, Int_t py) const |
static Bool_t | TObject::GetObjectStat() |
Int_t | GetOffset() |
virtual Option_t* | TObject::GetOption() const |
Int_t | GetParent(Int_t inode) const |
int* | GetPointsIndexes(Int_t node) const |
Int_t | GetRight(Int_t inode) const |
Int_t | GetRowT0() |
virtual const char* | TObject::GetTitle() const |
Int_t | GetTotalNodes() const |
virtual UInt_t | TObject::GetUniqueID() const |
virtual Bool_t | TObject::HandleTimer(TTimer* timer) |
virtual ULong_t | TObject::Hash() const |
virtual void | TObject::Info(const char* method, const char* msgfmt) const |
virtual Bool_t | TObject::InheritsFrom(const char* classname) const |
virtual Bool_t | TObject::InheritsFrom(const TClass* cl) const |
virtual void | TObject::Inspect() constMENU |
void | TObject::InvertBit(UInt_t f) |
virtual TClass* | IsA() const |
virtual Bool_t | TObject::IsEqual(const TObject* obj) const |
virtual Bool_t | TObject::IsFolder() const |
Bool_t | TObject::IsOnHeap() const |
Int_t | IsOwner() |
virtual Bool_t | TObject::IsSortable() const |
Bool_t | IsTerminal(int inode) const |
Bool_t | TObject::IsZombie() const |
float | KOrdStat(int ntotal, float* a, int k, int* index) const |
virtual void | TObject::ls(Option_t* option = "") const |
void | MakeBoundaries(float* range = 0x0) |
void | MakeBoundariesExact() |
void | TObject::MayNotUse(const char* method) const |
virtual Bool_t | TObject::Notify() |
void | TObject::Obsolete(const char* method, const char* asOfVers, const char* removedFromVers) const |
static void | TObject::operator delete(void* ptr) |
static void | TObject::operator delete(void* ptr, void* vp) |
static void | TObject::operator delete[](void* ptr) |
static void | TObject::operator delete[](void* ptr, void* vp) |
void* | TObject::operator new(size_t sz) |
void* | TObject::operator new(size_t sz, void* vp) |
void* | TObject::operator new[](size_t sz) |
void* | TObject::operator new[](size_t sz, void* vp) |
virtual void | TObject::Paint(Option_t* option = "") |
virtual void | TObject::Pop() |
virtual void | TObject::Print(Option_t* option = "") const |
virtual Int_t | TObject::Read(const char* name) |
virtual void | TObject::RecursiveRemove(TObject* obj) |
void | TObject::ResetBit(UInt_t f) |
virtual void | TObject::SaveAs(const char* filename = "", Option_t* option = "") constMENU |
virtual void | TObject::SavePrimitive(ostream& out, Option_t* option = "") |
void | TObject::SetBit(UInt_t f) |
void | TObject::SetBit(UInt_t f, Bool_t set) |
Int_t | SetData(int idim, float* data) |
void | SetData(int npoints, int ndim, UInt_t bsize, float** data) |
virtual void | TObject::SetDrawOption(Option_t* option = "")MENU |
static void | TObject::SetDtorOnly(void* obj) |
static void | TObject::SetObjectStat(Bool_t stat) |
void | SetOwner(Int_t owner) |
virtual void | TObject::SetUniqueID(UInt_t uid) |
virtual void | ShowMembers(TMemberInspector&) |
void | Spread(int ntotal, float* a, int* index, float& min, float& max) const |
virtual void | Streamer(TBuffer&) |
void | StreamerNVirtual(TBuffer& ClassDef_StreamerNVirtual_b) |
virtual void | TObject::SysError(const char* method, const char* msgfmt) const |
Bool_t | TObject::TestBit(UInt_t f) const |
Int_t | TObject::TestBits(UInt_t f) const |
virtual void | TObject::UseCurrentStyle() |
virtual void | TObject::Warning(const char* method, const char* msgfmt) const |
virtual Int_t | TObject::Write(const char* name = 0, Int_t option = 0, Int_t bufsize = 0) |
virtual Int_t | TObject::Write(const char* name = 0, Int_t option = 0, Int_t bufsize = 0) const |
virtual void | TObject::DoError(int level, const char* location, const char* fmt, va_list va) const |
void | TObject::MakeZombie() |
TKDTree<int,float>(const TKDTree<int,float>&) | |
void | CookBoundaries(const Int_t node, Bool_t left) |
TKDTree<int,float>& | operator=(const TKDTree<int,float>&) |
void | UpdateNearestNeighbors(int inode, const float* point, Int_t kNN, int* ind, float* dist) |
void | UpdateRange(int inode, float* point, float range, vector<int>& res) |
enum TObject::EStatusBits { | kCanDelete | |
kMustCleanup | ||
kObjInCanvas | ||
kIsReferenced | ||
kHasUUID | ||
kCannotPick | ||
kNoContextMenu | ||
kInvalidObject | ||
}; | ||
enum TObject::[unnamed] { | kIsOnHeap | |
kNotDeleted | ||
kZombie | ||
kBitMask | ||
kSingleKey | ||
kOverwrite | ||
kWriteDelete | ||
}; |
UChar_t* | fAxis | [fNNodes] nodes cutting axis |
float* | fBoundaries | ! nodes boundaries |
int | fBucketSize | size of the terminal nodes |
Int_t | fCrossNode | ! cross node - node that begins the last row (with terminal nodes only) |
float** | fData | ! data points |
Int_t | fDataOwner | ! 0 - not owner, 2 - owner of the pointer array, 1 - owner of the whole 2-d array |
int* | fIndPoints | ! array of points indexes |
int | fNDim | number of dimensions |
int | fNDimm | dummy 2*fNDim |
Int_t | fNNodes | size of node array |
int | fNPoints | number of multidimensional points |
Int_t | fOffset | ! offset in fIndPoints - if there are 2 rows, that contain terminal nodes |
float* | fRange | [fNDimm] range of data for each dimension |
Int_t | fRowT0 | ! smallest terminal row - first row that contains terminal nodes |
Int_t | fTotalNodes | total number of nodes (fNNodes + terminal nodes) |
float* | fValue | [fNNodes] nodes cutting value |
Build the kd-tree 1. calculate number of nodes 2. calculate first terminal row 3. initialize index array 4. non recursive building of the binary tree The tree is divided recursively. See class description, section 4b for the details of the division alogrithm
Find kNN nearest neighbors to the point in the first argument Returns 1 on success, 0 on failure Arrays ind and dist are provided by the user and are assumed to be at least kNN elements long
Update the nearest neighbors values by examining the node inode
Find the distance between point of the first argument and the point at index value ind Type argument specifies the metric: type=2 - L2 metric, type=1 - L1 metric
Find the minimal and maximal distance from a given point to a given node. Type argument specifies the metric: type=2 - L2 metric, type=1 - L1 metric If the point is inside the node, both min and max are set to 0.
find the index of point works only if we keep fData pointers
Find all points in the sphere of a given radius "range" around the given point
1st argument - the point
2nd argument - radius of the shere
3rd argument - a vector, in which the results will be returned
Internal recursive function with the implementation of range searches
return the indices of the points in that terminal node for all the nodes except last, the size is fBucketSize for the last node it's fOffset%fBucketSize
Return the indices of points in that node Indices are returned as the first and last value of the part of indices array, that belong to this node Sometimes points are in 2 intervals, then the first and last value for the second one are returned in third and fourth parameter, otherwise first2 is set to 0 and last2 is set to -1 To iterate over all the points of the node #inode, one can do, for example: Index *indices = kdtree->GetPointsIndexes(); Int_t first1, last1, first2, last2; kdtree->GetPointsIndexes(inode, first1, last1, first2, last2); for (Int_t ipoint=first1; ipoint<=last1; ipoint++){ point = indices[ipoint]; //do something with point; } for (Int_t ipoint=first2; ipoint<=last2; ipoint++){ point = indices[ipoint]; //do something with point; }
Get number of points in this node for all the terminal nodes except last, the size is fBucketSize for the last node it's fOffset%fBucketSize, or if fOffset%fBucketSize==0, it's also fBucketSize
Set the data array. See the constructor function comments for details
Calculate spread of the array a
Build boundaries for each node. Note, that the boundaries here are built based on the splitting planes of the kd-tree, and don't necessarily pass through the points of the original dataset. For the latter functionality see function MakeBoundariesExact() Boundaries can be retrieved by calling GetBoundary(inode) function that would return an array of boundaries for the specified node, or GetBoundaries() function that would return the complete array.
define index of this terminal node
Build boundaries for each node. Unlike MakeBoundaries() function the boundaries built here always pass through a point of the original dataset So, for example, for a terminal node with just one point minimum and maximum for each dimension are the same. Boundaries can be retrieved by calling GetBoundaryExact(inode) function that would return an array of boundaries for the specified node, or GetBoundaries() function that would return the complete array.
find the smallest node covering the full range - start