I just posted to Square’s Engineering blog describing the techniques we use to make signatures on the Android client look great. This was an extremely fun project to tackle, especially since it let me break out some mathematics I hadn’t used in a while.
A BK-tree is a type of metric tree, which is to say it’s a tree structure ideal for storing and retrieving elements of metric spaces.
So what’s a metric space? A metric space is made up of two things:
- A set of elements, and
- A function m that defines a “distance” between any two elements
The distance function m is called the “metric”. For any x and y in the metric space, the metric m(x,y) must:
- produce a non-negative value
- produce zero if and only if x = y
- be equal to m(y, x) (ie, m must be symmetric)
- satisfy the triangle inequality
An example of a metric space is the set of integers plus the metric m(x,y) = abs(x – y). The same set with metric m‘(x,y) = sqrt(x^2 + y^2) forms a different metric space.
You can also form metric spaces over non-numerical sets. The set of strings with a Levenshtein distance (or “edit distance”) function forms a metric space, for example.
BK-trees exploit properties of a metric function to efficiently retrieve elements based on distance. For example, a spelling corrector might want to find all words within a Levenshtein distance of one from some misspelled word. BK-trees are designed for this type of problem.1
Nodes in a BK-tree can have an arbitrary number of children, but the trick is that each node can have at most one child any given distance from it. So, in our String/Levenshtein space, the word “steam” could have a child “stream” (distance 1); but if it did, it could not also have a child of “steal” (also distance 1).
To insert a node ins into a BK-tree, start with the root. Compute dist = m(root, ins). If the root already has a child c that is dist away, recursively insert ins as a child of c. Otherwise, ins becomes the child of the root with distance dist.
A tree constructed this way lets you prune entire branches as you search for elements within a distance. If the maximum distance in your search is max_dist, you can ignore subtrees of the root greater than max_dist away. For children of the root with distance d, you can ignore any of their children not within the range d +- max_dist; this follows from the metric’s adherence to the triangle inequality. The pruning cutoff range increases by the distance from node to node as you recurse down the tree.
For a more detailed walkthrough of the insertion and search algorithms, see this blog post.
Here is my Haskell BK-tree implementation: BKTree.hs
The exported parts of the module:
type (Metric a) = a -> a -> Int mkBkTree :: (Metric a) -> (BKTree a) bkInsert :: (BKTree a) -> a -> (BKTree a) bkLookup :: (BKTree a) -> Int -> a -> [a]
So, using this BKTree module in the String/Levenshtein context:
import BKTree -- From http://en.wikibooks.org/wiki/Algorithm_implementation/Strings/Levenshtein_distance#Haskell levDistance :: Metric String levDistance sa sb = last $ foldl transform [0..length sa] sb where transform xs@(x:xs') c = scanl compute (x+1) (zip3 sa xs xs') where compute z (c', x, y) = minimum [y+1, z+1, x + fromEnum (c' /= c)] buildFromDictFile :: String -> IO (BKTree String) buildFromDictFile fileName = do dictContents <- readFile fileName return $ foldl bkInsert (mkBkTree levDistance) (lines dictContents)
By building a BKTree from the Levenshtein distance metric and list of know dictionary words, we can write an easy function to suggest words one edit away from any given string:
*SpellCorrect> spellTree <- buildFromDictFile "dictionary.txt" *SpellCorrect> let suggest = bkLookup spellTree 1 *SpellCorrect> suggest "strea" ["strep","strew","streak","stream","stria"]
Know any fun metric spaces to play with?
My biggest problem with Haskell right now is package and dependency management.
Cabal along with the Haskell Platform are supposed to address this, but it seems like every time I want to install a new package I run into tons of dependency issues.
Currently I’m playing around with this SDL tutorial. When I tried to:
cabal install base
I got this:
Resolving dependencies... cabal: internal error: impossible
After googling around, I thought maybe the problem was that I’m running ghc 6.10.3 (the version I got by installing the Haskell Platform on Ubuntu) instead of the latest 6.12.1.
I’m still not even sure this is the reason for Cabal’s cryptic error, but trying to install 6.12 on Ubuntu is turning out to be a chore. After more googling, I came across this guide: Installation of GHC 6.12.1 on Ubuntu 9.10 (for Hubris)
If that document doesn’t convince you that Haskell has some installation and dependency management ugliness, I don’t know what will.
Maybe after I get more used to the Haskell ecosystem I’ll understand these things more intuitively. Right now, though, this whole affair is quite frustrating.
Update: Since writing this, I’ve tried installing the same package in OS X. The Haskell Platform contains GHC 6.12.1, but the version of Cabal I had installed doesn’t work with 6.12.x. So, I upgraded Cabal and…
RMBP:bin rdickerson$ ./cabal install base Resolving dependencies... cabal: Distribution/Client/Dependency/TopDown.hs:171:37-73: Non-exhaustive patterns in lambda
This is really awful.
It’s like this in a lot of places throughout the project.
Josh, Bekka, and I made it to Seattle yestarday around noon. PAX starts on Friday, so we’ve spent the last couple days taking in Seattle, which is absolutely a fantastic city.
We’re staying in a hotel in the middle of the downtown area (on 5th street a few blocks from Pike), so there’s a ton of stuff in walking distance.
Right after arriving, we headed over to Pike Place Market for lunch. If you’ve never been, Pike is basically a large open-air seafood and farmer’s market with tons of little shops and restaurants.
We had lunch at a little chowder place — very delicious. It’s easy to spend lots of time just walking around Pike, looking at the fresh seafood and produce, and talking with the artisans selling their products in the small shops around the market; this is exactly what we did.
For dinner, we headed to a local brewery called Rock Bottom. The food was decent, but the service was horribly slow. Bekka and Josh tell me the beer was pretty good, though.
After dinner, we hit up a couple bars around town, ending up at a small venue called The Green Room. Here we took in a set by Johanna Kunin. The show was great, check Kunin out if you’re into indie rock with fantastic vocals.
Josh started the day (before I was awake) by going for a walk, and brought back some delicious cinnamon rolls from a nearby coffee shop. After we were all awake, we headed over to the Seattle Art Museum.
It’s a nice place, but it focuses pretty heavily on modern and contemporary art, with comparatively small spaces reserved for ancient through renaissance work (which interests me more). Also, no photography allowed in the galleries.
After the art museum, it was back to Pike for lunch, then over to the Science Fiction Museum, one of the big places on my to-visit list.
The SciFi Museum was pretty awesome. (Although this museum also disallowed photography. What’s up with the photo-hating, Seattle?). The museum isn’t very large, but we didn’t visit the “Experience Music” portion.
The exhibits are what you’d probably expect from the name of the place: movie memorabilia, interesting books, discussions on the role of science fiction in society. Basically, a cool place to visit if you’re a nerd. Which I am.
We ate dinner at this place called Zeek’s Pizza, which is delicious; fresh ingredients, well-made food. If you’re in Seattle and like pizza, definitely check this place out.
We’re at the hotel for now (everyone was feeling tired — lots of walking around, and Seattle is pretty much built on a giant hill), but I’m sure we’ll find something fun and exciting to do tonight.
And, of course, PAX begins tomorrow!
If you haven’t heard of Mozilla Labs’ Bespin, the website is here.
The question I have is: what does a web-based code editor give us? Why put a code editor in the cloud?
Most of the articles I’ve seen use a phrase like this: “…combine the speed and power of desktop-based development with the collaborative benefits of cloud computing.” A little short on details.
The articles then proceed to list all the ways Bespin will be just like a desktop editor.
Of course, a perfect emulation of a desktop editor doesn’t give us anything we don’t already have, since we already have desktop editors. So, again, what do we actually gain by having that editor in the cloud?
The Mozilla Labs blog post introducing Bespin is here, and contains a bulleted list of things Bespin will do. By my count, all but two of the bullet points are things that desktop editors and IDEs already do and Bespin needs to replicate.
Here’s what I think about the remaining two bullets:
- Real Time Collaboration.
I suppose there are developers out there who do real time remote pair programming or similar (although I am not one of them). I would guess there is already software for this kind of thing, but I see that Bespin could potentially make it easier.
So, I’m willing to chalk up this advantage: easier real time code sharing.
How often do you want to share a code session in real time, though? For “normal” online collaboration, I suspect a DVCS is what you really want.
But maybe I just don’t understand the real time collaboration use case. What specifically would one do with this?
- Accessible from Anywhere.
This one is compelling — set up your coding environment once, and be able to use it anywhere.
But think of the things you do on your desktop environment; scripts you write, OS settings you tweak, your VCS setup, etc. For Bespin to equal a desktop coding environment, these things will need to be reinvented on the server.
Right now you can remotely access a machine via ssh, vnc, or rdp. For me, these things fulfill my “accessible from anywhere” wants. If you’re unable to use these protocols for some reason, perhaps Bespin will be useful.
I can also see how Bespin could be convenient when embedded in other web applications. A coworker suggested embedding it into Wordpress, for example. You could edit your php or whatever online, without the (small?) inconvenience of a manual FTP push to publish.
So, it seems to me that there are a handful of cases where a web-based code editor can be convenient. I doubt Bespin will ever be as flexible or as powerful as a real desktop coding environment, though.
I suppose time will tell.
I just went through a bit of a frustrating time getting cabal (the Haskell package management system) going on a fresh Ubuntu install. (Figuring out I needed to install the zlib package mentioned in step three below was the worst of it).
I thought I’d document the steps I went through in case I ever need to do it again:
- Install GHC.
sudo apt-get install ghc
- Download and extract the cabal installation tool: cabal-install-0.6.2.tar.gz. (This page will probably keep an up-to-date link for future versions).
Make sure this zlib package is installed:
sudo apt-get install zlib1g-dev
- Run the bootstrap.sh shell script extracted in step two.
- Either add ~/.cabal/bin to your $PATH or copy it to a $PATH location.
The Cabal-Install instruction page is located here.
I love this introduction to git: Git for Computer Scientists.
The reason I like it is evident right in the abstract:
Quick introduction to git internals for people who are not scared by words like Directed Acyclic Graph.
Computer Science has such a rich vocabulary for describing data structures and their transformations, yet every other introduction to git I’ve seen uses mostly command line examples and hides technical details behind analogies.
The examples and analogies are helpful, but a little precise discussion of the way a system is put together can be even more helpful. It seems to me that most programmer targeted software introductions and tutorials tend to shy away from directly invoking basic computer science concepts.
It’s a shame, because you can communicate a lot of information in a short amount of time if you use the right vocabulary.
If you’re a programmer writing a document for other programmers, why not leverage this kind of common knowledge to communicate efficiently?
My previous post described how this puzzle doesn’t give enough information to be definitively solved. We need to know the mother’s strategy for revealing the gender of the first child.
In that earlier post, I approached things from the perspective of repeated simulations, giving some code to demonstrate. Here’s a more visual argument.
My thinking is that in order to make a judgment about the likeliness of different outcomes, we need to know the probability tree for the different events: the possible genders of the children and which child is revealed first. We can deduce almost the entire tree:
But we are never told how the mother decides which child to reveal when she has both a boy and a girl.
In my last post, I gave two ways of completing this tree:
- If the mother always reveals the gender of the female first whenever possible:
- If the mother always randomly decides which child she will reveal the gender of first:
In this case, the a priori chance of the second child being a boy is 50%.
In this case, the a priori chance of the second child being a boy is higher.
So, to answer the question using a given a probability tree, we examine the possible states we can be in (states in which the first child revealed is ‘G’). Then we prune off the other states and refigure the odds using the rules of conditional probability.
So, depending on which tree is being used, the answer to the question is different.
- A girl is always chosen first if possible
Probability of a boy second = 2/3
- First child is always chosen randomly
Probability of a boy second = 1/2
Let’s say, hypothetically speaking, you met someone who told you they had two children, and one of them is a girl. What are the odds that person has a boy and a girl?
Commenters argued contentiously between 1:2 and 2:3 (and there were some arguments over semantics as well, but I’m focusing on the math right now). Atwood made this follow-up post claiming the correct answer was 2:3.
I’m not so sure this is necessarily correct.
Atwood’s reasoning is that the possible child combinations are: BB, GB, BG, and GG. We know one child is a girl, so that eliminates BB, and 2:3 remaining options have a boy. The sticking point many people have is: why are GB and BG two separate cases?
Atwood’s answer is to draw analogy with The Unfinished Game, a problem he formulates thusly:
Two players, Harry and Ted, place equal bets on who will win the best of 5 coin tosses. In each round, Harry always chooses heads (H), and Ted always chooses tails (T). Suppose they are forced to abandon the game after 3 coin tosses, with Harry ahead 2 to 1. What is the fairest way to divide the pot?
The “intuitive” answer is that the only possible continuations of the game are: H, TH, or TT (since Harry wins on the next H) so 2/3 of the pot should go to Harry since he wins 2:3 of the expected games. The real answer is that the HT and HH continuations affect the expected games equally “strongly” as TH and TT, so Harry actually has a 3:4 chance of winning. Jeff demonstrates this with a code simulation.
The implied argument is that, just like HT and HH are “equally strong” in the Unfinished Game, so are the GBs and BGs in the child gender problem.
Again, I don’t think this is necessarily the case.
In Atwood’s original problem, there is the complication of the mother deciding to tell you one of her children is female. I think the answer hinges on the behavior of the mother.
Here is my answer to Atwood’s original child gender problem. It’s a two-parter, depending on how the mother acts:
- Scenario 1: When deciding which child’s gender to report, the mother always picks a female if she can.
This yields the 2:3 answer.
If we do a large number of simulations where we randomly generate two genders sequentially, we’ll get BB, GB, BG, or GG equal numbers of times.
We know the mother told us one child is a girl. So 100% of the time we generate a BB, it never makes it to the mother reporting a female gender. So, that gender generation doesn’t count and we throw it out.
We know the mother will always pick G if she can, so 100% of the BGs, BGs, and GGs meet the assumptions of our scenario — they make it to the “mother reported a girl” phase. So, Atwood’s original logic applies and 2:3 times the second child will be a boy.
To demonstrate, here’s a simulation in Python 3.0:
import random genders = ["B","G"] def spawn(n): return [random.choice(genders) for i in range(n)] def run(times): boy = 0 girl = 0 valid = 0 for i in range(times): children = spawn(2) if not "G" in children: # This scenario is was already thrown out # by our initial assumption -- doesn't count continue else: valid += 1 if "B" in children: boy += 1 else: girl += 1 boyRatio = float(boy) / valid girlRatio = float(girl) / valid print("Boy: %f" % boyRatio) print("Girl: %f" % girlRatio)
>>> run(10000) Boy: 0.669774 Girl: 0.330226
- Scenario 2: The mother randomly picks a child whose gender to report.
This yields the 1:2 answer.
Imagine running the simulation as before. Just as in the previous simulation, we have to throw out scenarios where the mother reports B, since we were assuming we had arrived at the point where she reported G. Previously, since she always reported G in the BG and GB cases, we only had to throw out the BBs.
This time, however, when we get a BG, the mother will choose B or G to report at random. 50% of the time, the mother will report B, which means we will have to throw that simulation out — just as we do the BBs. Similarly, 50% of the BGs will not count. The BGs and GBs are both at “half strength” since half of them don’t make it to the “mother reported a girl” stage. So, we’ll get (BG or GB) the same amount of time we get GG, giving us a 50-50 shot at a boy.
Here’s the Python simulation for this case:
import random genders = ["B","G"] def spawn(n): return [random.choice(genders) for i in range(n)] def run(times): boy = 0 girl = 0 valid = 0 for i in range(times): children = spawn(2) random.shuffle(children) childX = children childY = children if childX != "G": # This scenario is was already thrown out # by our initial assumption continue else: valid += 1 if childY != "B": boy += 1 else: girl += 1 boyRatio = float(boy) / valid girlRatio = float(girl) / valid print("Boy: %f" % boyRatio) print("Girl: %f" % girlRatio)
>>> run(10000) Boy: 0.495971 Girl: 0.504029
The crucial difference in the simulations is:
if not "G" in children: # This scenario is was already thrown out # by our initial assumption -- doesn't count continue
if childX != "G": # This scenario is was already thrown out # by our initial assumption -- doesn't count continue
So, as far as I can tell, the real answer is that we don’t know enough about the mother’s behavior to give a definitive answer of how this scenario plays out on average.
Of course, this kind of reasoning is pretty tricky to pull off correctly, so I wouldn’t be surprised if a good counter arises. But I’ve thought about this problem for a while, and, for now at least, I’m pretty thoroughly convinced that this solution is correct.