A BK-tree is a type of metric tree, which is to say it’s a tree structure ideal for storing and retrieving elements of metric spaces.
So what’s a metric space? A metric space is made up of two things:
- A set of elements, and
- A function m that defines a “distance” between any two elements
The distance function m is called the “metric”. For any x and y in the metric space, the metric m(x,y) must:
- produce a non-negative value
- produce zero if and only if x = y
- be equal to m(y, x) (ie, m must be symmetric)
- satisfy the triangle inequality
An example of a metric space is the set of integers plus the metric m(x,y) = abs(x – y). The same set with metric m‘(x,y) = sqrt(x^2 + y^2) forms a different metric space.
You can also form metric spaces over non-numerical sets. The set of strings with a Levenshtein distance (or “edit distance”) function forms a metric space, for example.
BK-trees exploit properties of a metric function to efficiently retrieve elements based on distance. For example, a spelling corrector might want to find all words within a Levenshtein distance of one from some misspelled word. BK-trees are designed for this type of problem.1
Nodes in a BK-tree can have an arbitrary number of children, but the trick is that each node can have at most one child any given distance from it. So, in our String/Levenshtein space, the word “steam” could have a child “stream” (distance 1); but if it did, it could not also have a child of “steal” (also distance 1).
To insert a node ins into a BK-tree, start with the root. Compute dist = m(root, ins). If the root already has a child c that is dist away, recursively insert ins as a child of c. Otherwise, ins becomes the child of the root with distance dist.
A tree constructed this way lets you prune entire branches as you search for elements within a distance. If the maximum distance in your search is max_dist, you can ignore subtrees of the root greater than max_dist away. For children of the root with distance d, you can ignore any of their children not within the range d +- max_dist; this follows from the metric’s adherence to the triangle inequality. The pruning cutoff range increases by the distance from node to node as you recurse down the tree.
For a more detailed walkthrough of the insertion and search algorithms, see this blog post.
Here is my Haskell BK-tree implementation: BKTree.hs
The exported parts of the module:
type (Metric a) = a -> a -> Int mkBkTree :: (Metric a) -> (BKTree a) bkInsert :: (BKTree a) -> a -> (BKTree a) bkLookup :: (BKTree a) -> Int -> a -> [a]
So, using this BKTree module in the String/Levenshtein context:
import BKTree -- From http://en.wikibooks.org/wiki/Algorithm_implementation/Strings/Levenshtein_distance#Haskell levDistance :: Metric String levDistance sa sb = last $ foldl transform [0..length sa] sb where transform xs@(x:xs') c = scanl compute (x+1) (zip3 sa xs xs') where compute z (c', x, y) = minimum [y+1, z+1, x + fromEnum (c' /= c)] buildFromDictFile :: String -> IO (BKTree String) buildFromDictFile fileName = do dictContents <- readFile fileName return $ foldl bkInsert (mkBkTree levDistance) (lines dictContents)
By building a BKTree from the Levenshtein distance metric and list of know dictionary words, we can write an easy function to suggest words one edit away from any given string:
*SpellCorrect> spellTree <- buildFromDictFile "dictionary.txt" *SpellCorrect> let suggest = bkLookup spellTree 1 *SpellCorrect> suggest "strea" ["strep","strew","streak","stream","stria"]
Know any fun metric spaces to play with?
- There are better ways of tackling the spelling correction problem, though. ↩









Rob Dickerson's blog on things like programming, math, and fencing.