Monte Carlo your way out of a puzzle: Why events of measure 0 matter.

I was sitting in my office, a couple of weeks ago, minding my regular bioinformatics flavored business with a cup of oolong tea, when Senthil stopped by and told me that he has a puzzle he’d like some help with. Now, both me and Senthil enjoy math and we especially enjoy small nifty tricky problems. Thus, the puzzle he shared was the following.

Consider a unit grid on a plane. Pick a non-negative real number L. Now, we can place a random segment of length L in the plane by picking a point and an angle uniform at random and drawing a segment of length L originating at the chosen point and oriented based on the chosen angle. Which value of L maximizes the probability that this segment will intersect exactly one line of the grid?

We spent about 5 minutes chatting about different placements of the segment, probabilities, and our lack of desire to compute a nasty triple integrals. We had to wrap up quickly since, I was short on time and had to finish up some other tasks.

Last Thursday I was co-organizing our department’s happy hour, and Senthil reminded me of the outstanding problem. I made a pinky promise that I’ll help on Friday, which brings us to the story I am about to tell you.

Do not integrate, just simulate!

Since, I was not sold on the idea of computing integrals first thing on a Friday morning, I decided to quickly write up a simulation that would do exactly what the problem statement described. After about 20 minutes of typing away, I had my tiny C program up and running and spitting out probabilities with 6 sweet significant digits. Senthil came by quite soon, and within 40 minutes of back and forth of computer assisted linear search we were quite certain our answer for L was 1. Yay, problem solved, L = 1, probability is approximately 0.63774, and we can all go drink coffee and do whatever Computer Science PhD students do on a Friday. Well, no.

Of course we were not happy, I meant what is 0.63774? Have you heard of this number before? Does it have any meaning? Is this the number of goats you must sacrifice to Belial before you learn the great mysteries of the world? We took to OEIS and found (among other amazing things) that 0, 6, 3, 7, 7 is a subsequence in the decimal expansion of Mill’s constant, assuming the Riemann hypothesis is true. Other than that the number didn’t quite give us an immediate answer. We also checked the magic of WolframAlpha, which gave us some neat ideas, like (3 π)/(2 e^2). However, as you can guess reverse engineering this fraction was 1) stupid; 2) hard; and finally 3) something we didn’t do.

In the meantime, we decided to put our theory hats back on and figure out what is happening, and how we can get an analytical expression of this probability. Senthil recalled that in fact there is a very similar well-known question, called Buffon’s needle problem, which has an amazingly clean solution. However, we could not see how one can easily translate the setting of that problem to ours. After some amount of drawing, complaining about integrals and debating which events are measure zero, we decided to take a coffee break and walk outside for a bit. This is when the major theoretical breakthrough happened.

Senthil noticed that instead of thinking about the grid as a whole, we can consider two sets of parallel lines independently. The advantage of this view was clear, now we have the Buffon’s needle problem embedded into ours directly! We multiplied some probabilities, got wrong numbers and complained about futility of theoretical efforts. Then we of course reimagined how stuff should be multiplied and counted things again, and again, and again. Are the events independent? Do we count this in or subtract it out? Do we go back to the integral formula?

Clearly the two intersection events are not independent, since the orientation (based on angle) that works well for horizontal intersection, is not favorable for the vertical one. Thus we can’t just multiply stuff out and subtract from one. Thus, we went back to the integral. This is where things got tangled up. We recomputed the triple integral a few times, but something was not tying out. We computed integral of a product of integrals and something still didn’t tie out. We have been running in circles, seeing numbers like 0.202 and 0.464 all over the place. Finally, the original answer to Buffon’s needle, the 2/π, was temptingly close to our own, since it was 0.636619. Finally, around 4pm we threw in the towel and parted ways.

Simulation clearly saved the day, and even if we were not sure of the theory behind this number, we knew the answer with high confidence, and that was already a win. But…

Back to the drawing and integration

I took out a piece of paper, folded it in four, and started sketching. There is a nice video out there that explains how to derive solution to the Buffon’s needle problem. Thus, I just went through the steps, drawing out the circle, and thinking what are the exact conditions we need to satisfy to get our segment intersect exactly one grid line.

The shaded region above represents the range of admissible positions for the needle (black arrow) when the angle formed by needle and x-axis does not exceed 90 degrees. Note that we only consider intersection with the line A_0A_1, since the intersection with the line B_0B_1 is identical by symmetry.

Now, considering the angle of the needle to x-axis to be \theta\leq\frac{\pi}{2} and distances to horizontal and vertical lines to be y\phantom{\frac{1}{1}} and x\phantom{\frac{1}{1}} respectively, we can define the conditions for the admissible position as 0\leq\theta\leq\frac{\pi}{2},\,y\leq\frac{L}{2}\sin\theta,\,x\geq\frac{L}{2}\cos\theta. Thus, we can write down our desired integral, and evaluate it as \displaystyle\int_0^{\frac{\pi}{2}}\displaystyle\int_0^{\frac{L}{2}\sin\theta}\displaystyle\int_{\frac{L}{2}\cos\theta}^{\frac{1}{2}}dx\,dy\,d\theta=-\frac{1}{8}(L-2)L. Now, dividing through by the total area and multiplying by 2, due to symmetry for the intersection to the second line, we arrive at the coveted \frac{2}{\pi}\approx 0.636619. Case can be rested, we now have a neat theoretical solution to the question, and a beautiful mathematical expression as an answer. However, I am still unhappy, because I cannot ascribe the error we see in the simulation to pure numerical precision at this scale.

The devil is in the details: wording and measure zero events

One of the first things you learn in a probability theory class is that events of measure zero don’t matter. Picking a specific point on a line, or even picking a rational point on a line are measure zero events, hence they don’t really matter when you calculate probabilities. Nevertheless simulations are not only countable, they are finite! Which means accidentally dropping in an event of measure zero that shouldn’t be there can trip you up. This is exactly what happened in our case. We want to count the number of times a random segment intersects exactly one line of the grid. Now, I assumed this condition meant instead: segment intersects the grid at exactly one point! The difference shines for the points that are intersections of the two sets of parallel lines forming the grid, because they are a single grid point, but a segment crossing this point in fact intersects not one, but two lines!

I have adjusted my counting scheme to correctly reflect this edge case as intersecting two lines, and therefore, not being a favorable outcome. Guess what happened next… Exactly, the sweet shiny 0.6366313 popped out of my simulation as the predicted probability of the event. Of course, this number is closer to \frac{2}{\pi} than our previous candidate, and now all the doubts are gone, the theory and simulation agree!

Discussion and lessons learned

First of all, if my simulation runs in about 2 minutes (going through 500,000,000 random segments), why not run a longer simulation to counter the effect of nasty accidental event? Well, numbers have their limits on computers, and overflows happen. I tried running the simulation for 2,000,000,000 and 20,000,000,000 segments. The first run gave a slightly smaller number than the original 0.6377, but still not quite 0.6366. The second one, gave a whooping probability of -8.648, which even those not too familiar with measure theoretic probability would call out as absurd. Can I spend more time bugfixing this, and get a great simulator, that can scale to 1,000,000,000,000 segment trial? Yes. Is it reasonable to pour time into this? No!

Second, why not do the theory neatly from the start? At the end of the day the problem really takes about 30 minutes of careful drawing and setup and 3 minutes of WolframAlpha-ing the integral. Well, first this involves dissecting the problem theoretically which takes time. Second, writing up the simulator took 20 minutes, for contrast understanding how the Buffon’s needle problem is solved took about the same 15-20 minutes. The latter on its own didn’t give a solution, the former at least provided a good numerical estimate. Finally, recall how the problem is actually about finding the optimal length L? Well, that was quite a block to developing nice theoretical answer, once we were quite certain L is 1, we could actually focus on the theory that mattered.

Third, why not read the problem statement carefully and not jump to wild assumptions? Well, this one goes into lessons learned bucket, alongside the entire finiteness of simulations shenanigans. You would expect that years of problem solving would teach me that, and yet again, I make the same mistakes.

Overall I can say that this was an amazing exercise in basic probability, geometry and mocking up simulations to solve puzzle problems. I am happy to learn that writing a quick program can help in figuring out theory behind a question. Furthermore, I am also quite delighted to experience nuances of probability first hand, without putting important questions and results on the line.

Final notes and resources

Many thanks to Senthil Rajasekaran for bringing this problem to my attention and guiding me through the crucial steps on getting the theoretical answer in the scopes. Looking forward to more puzzles to solve!

My code for the simulator is available in this (zipped) file. Note that the random number generator was lifted from a course by Benoit Roux at UChicago, and the rest of the code is mine. The commented out portion of the intersect function is exactly what caused the annoying of by 0.0011 error.

Thanks for the attention, and happy bug catching y’all!

Why any (comprehensive) course on computer programming should cover basics of C: A story.

Good evening world!

Recently I came across a small task that reinforced my belief in the importance of C programming. The task was the following:

  1. Generate 500,000,000 pseudo-random numbers using Linear Congruential Generator algorithm [1, 2].
  2. Use Box-Mueller transform [3] to get normally distributed random variables (RV).
  3. Compute the next step in the evolution of a deterministic system, and add noise using the generated normally distributed RV.
  4. Write the result of every 1000th step to a file.

In other words, we need to run a for-loop for 500,000,000 steps, doing some calculations (generating RV + evaluating deterministic function), and writing to a file once in 1000 steps.

This doesn’t sound particularly challenging, and the whole thing can be done in less than 80 lines of C code. Same task can also be done in about 45 lines of Python. However, LOC is not the metric I want to look at here. I want to talk about performance of the code written, and some general educational caveats.

Is the field set level?

Let’s talk a tiny bit about optimization, under-optimization and over-optimization here.

The moment I shared this story with my friend, he immediately said that the performance comparison doesn’t make sense. However, the argument provided was the following: if the Python code takes that much longer to run, it clearly was not written well. I agree, when I was coming up with the comparison I was not using any fancy race-track wheels for Python. The entire script I wrote is as vanilla Python, as one can possibly imagine. Does this mean that I am cheating by employing “bad” or “inefficient” Python coding practices? I would say no, or at worst, just a little bit.

In both cases: C and Python, I wrote a vanilla implementation of the given task. Hence, no parallelism, no non-standard libraries, no pre-compiled/hardware optimized shenanigans. Did I manage to cheat a bit? Yes, of course I did, I compiled my C code with -O3 optimization flag. This of course is not the full story either. I did run my Python script naively invoking python ./generate.py rather than trying to compile it into optimized binary and then running it. However, for all of these “sins” I have a quite simple answer: I don’t do that with Python 99.9% of the time. I do not compile my Python scripts. I do not roam the web for pre-compiled computational libraries for Python. I do not tend to care that much about performance in the first place, when I code in Python.

How is this a conversation about optimization then? Well, I think we need to consider several parameters to be optimized, and then checkout what we get in terms of the relative performance. Hence, I will be thinking about 3 metrics here:

  1. Human minutes spent writing code (including debugging).
  2. Human minutes spent waiting for the program to finish running.
  3. sys runtime of the programs written.

In the context of these 3 parameters I can clearly define what I mean by optimizing, over-optimizing and under-optimizing performance of a task.

Over-optimizing: This is the case when I will spend a lot of time writing code that supposedly is great in terms of wall and sys times. Not surprisingly majority of the over-optimization in my case does not come from the assembly injections leveraging latest microarchitecture features. When I over-optimize with probability 0.9 it is due to me finding a paper proposing a fast algorithm that I am trying to write from scratch. Clearly this brings a caveat: asymptotically better performance, does not always translate into cycle-count performance on small enough examples.

Optimizing: Once in a blue moon, when working on a one-off personal project, I do hit the right spot. Just enough of complexity in the implementation to get a good average for the runtimes. Any properly optimized code should be optimized both in terms of human minutes spent writing it, and human minutes spent waiting for the results. However, as with anything in the world of computer programming, or life at large, there is a caveat: optimization is context dependent. Spending more development hours over code that has to be reused on a regular basis is worth it, as long as the eventual benefit in runtime pays for it.

Under-optimizing: This is what happens when the deadline is far away. Hacking together 25 lines of your favorite vanilla high-level language, and letting it run overnight, because you still have a week of time left, and one run only takes 14 hours. Surprisingly, I think that from a practical perspective this is more justified than over-optimizing. If I had to choose between code that takes 14 hours to run, but gets the job done, and code that takes 12 hours to develop and only 2 to run, I might go for the first one, because at least I can sleep or read for those 12 hours of difference. However, the caveat here is simple: if you need to run the code more than once, the unoptimized runtime will cost you a lot.

Performance Analysis

I was compiling and running all code on my personal laptop. The specs are listed below.

MacBook Pro (Retina, 13-inch, Mid 2014)

  • 2.6 GHz Intel Core i5 (4278U)
  • 8 GB 1600 MHz DDR3 RAM

Runtimes measured with time utility.

Runtime comparison for the generator scripts.

As you can see all across the times, the performance differs drastically. This is by no means a shocking or unexpected result, but it matters for the rest of the discussion.

Pedagogical Discussion

This post ultimately is about teaching and learning, so let’s finally talk about why any comprehensive course[F1] on computer programming must cover some basics of C language.

First, C is a compiled language. While the intricacies of compiling as a process lie beyond the introduction level, the acknowledgement of compilation as a step in a lifecycle of a program is critical. Virtually anything that has to do with computer programming in its broadest definitions can benefit from a better understanding of the full picture. As an example I can bring up a recent workplace story, where as we discovered certain business logic scripts where ultimately compiled into SQL statements. When the underlying tables changed, SQL statements became invalid, while the surface level logic remained perfectly sound. Thus, it took a bit of tinkering around to find out that in fact we had to trigger a re-compile for the SQL to become valid again. Hence, if you have a better knowledge of the full picture, then your bug fixing abilities are also better.

Second, C has great performance metrics. As the first part of this story shows, C does in fact yield quiet great performance in its vanilla form. Of course you have to be mindful of your project scope. In terms of over-optimization failures C is probably at the top of the list in close competition with C++. Just think of all the linked list implementations ever written in C. Now, think of all double linked list implementations, and FFT implementations, and Dijkstra’s algorithm implementations, and so on ad nauseam. Writing code in C oftentimes feels like re-inventing the wheel. In fair part because it is. However, when the task at hand boils down to arguably simple arithmetic operations that need to performed at medium scale, writing it up in C is probably the best bet.

Third, C is ubiquitous (unless you are on Windows). If you have *nix system it comes with either gcc or clang or some other form of C compiler. No need to download a myriad of heavy IDEs and weird things. To be fair the same can be said about vanilla Python, which in part is why I love it so much (but still use Anaconda).

Fourth, C builds discipline and rigor. I am not talking about painstaking debugging of memory leaks and segfaults. I am not talking about arcane magic of pointer arithmetic. Those things are clearly important, but I am talking about very very basic C. You need to declare variable types. You need to allocate stuff ahead of time. You need to keep track of what moving parts you’ve got in the game. These things amount to cleaner code and better style. You have to at least minimally organize your thoughts before writing in C. Hopefully, that generalizes to the same concept for all of the code you will write.

Finally, C is just another programming language. I firmly believe that breadth of programming knowledge is equally if not more important than depth for about 80% of people who will ever write code. In the job market it is hard to guarantee that your amazing knowledge of C++ will always be equally demanded. In the academia you might realize that the lab you just joined does things in Perl. You can probably still write half of your code in Java, but then you need to interface to the rest of the Perl codebase… You get the general idea. On the other hand, “the jack of all trades, but master of none” kind of programmer will be more likely to pick up a new language from the docs, because it is needed. In this regard C serves as a good training ground for general programming principles.

Hence, in the end we have a performant language that exposes you to some fundamental programming concepts and builds up a better coding discipline.

References

  1. S. K. Park, K. W. Miller. Random number generators: good ones are hard to find. Comm. ACM (31):10, pp. 1192-1201, 1988. [DOI: 10.1145/63039.63042]
  2. P. A. W. Lewis, A. S. Goodman, J. M. Miller. A pseudo-random number generator for the System/360. IBM Systems Journal (8):2, pp. 136-146, 1969. [DOI:10.1147/sj.82.0136]
  3. G. E. P. Box, M. E. Muller. A Note on the Generation of Random Normal Deviates. Ann. Math. Statist. (29):2, pp. 610-611, 1958. [DOI: 10.1214/aoms/1177706645]

Footnotes

[F1] By comprehensive course I mean an academic to a calendar year long introductory sequence on computer programming. Examples would include any “Intro to Computer Science… n = 1, 2, 3, …” sequences, and any bootcamps that aim to teach you computer programming. I do agree that there are shorter courses that clearly cannot cover learning C. However, I would also argue that such courses by no means are comprehensive.