It’s Sunday morning and I get some time to myself, so I’m listening to some blues and catching up on my newsfeeds when I come across this interesting article on calculating what size group of people would be necessary to have a 50/50 chance of two of them sharing the same birthday.

General birthday problem

The primary focus of the article is on this equation to calculate the probability of uniqueness given a sample size of r from a group of N things to choose.

$$p = \frac {N!} {N^r (N-r)!}$$

In the article, he is concerned with the possibility of overflow with the size of the factorials involved and since scipy doesn’t have log factorial he implemented his solution with gamma log — see above article for his code. So I say to myself, “Wonder if I can do this with just straight Python?”

A quick google on log factorial found this approximation of log factorial by Srinivasa Ramanujan on math.stackexchange If you have not heard of Ramanujan before — stop and google him immediately. Wow!

$$\log n! \approx n\log n-n+\frac{\log(n(1+4n(1+2n)))}{6}+\frac{\log(\pi)}{2}$$

Which works out to the following CPython code:

from math import log, pi, exp

def logn_factorial(n):
"""return an approximation of log n! using Ramanujan's equation."""
return n * log(n) - n + (log(n * (1 + 4*n * (1 + 2*n)))/6) + (log(pi)/2)

p = exp(logn_factorial(N) - logn_factorial(N - r) - r*log(N))


Ok, so no scipy needed to follow along with this article plus I get to use some very cool math. I like Sunday morning fun time. However, I then start thinking, “overflow”, hmmmm. What is the upper bound of math.factorial anyway. Since it’s the birthday problem lets see what happens with 365

>>> import math
>>> math.factorial(365)
25104128675558732292929443748812027705165520269876079766872595193901106138220937419666018009000254169376172314360982328660708071123369979853445367910653872383599704355532740937678091491429440864316046925074510134847025546014098005907965541041195496105311886173373435145517193282760847755882291690213539123479186274701519396808504940722607033001246328398800550487427999876690416973437861078185344667966871511049653888130136836199010529180056125844549488648617682915826347564148990984138067809999604687488146734837340699359838791124995957584538873616661533093253551256845056046388738129702951381151861413688922986510005440943943014699244112555755279140760492764253740250410391056421979003289600000000000000000000000000000000000000000000000000000000000000000000000000000000000000000


Ok, let’s dial it up by a factor of 1000.

>>> math.factorial(365000)



Well it took a bit but it ran, with over 40 screens of numbers. Python is still going strong, so what is the upper bound of math.factorial? A search brought me here http://bugs.python.org/issue8692 and specifically this message. Which means that the max size of the result can not exceed sys.maxsize - 1 digits, or on a 64bit platform, 2**63 – 1 digits of capability. Thanks to some dedicated individuals who seemed to be having as much fun as I was, math.factorial is up to the task.

Don’t underestimate the power of Python.

Try straight Python before you move on to something more complex. The approximations of $\log n!$ were unnecessary. All that was needed was just Python and we get the following implementation of the Probability of Uniqueness:

from math import factorial

p = factorial(N) / ((N**r) * factorial(N-r))


A simple and straightforward implementation of the equation.

* A quick note, I’m usually a stickler for good variable names, however, when working the math equations I stick as close as possible to the equation that I am implementing.

* Also, if you are working on big enough numbers and I mean huge numbers, then approximations might be needed but by then you are going to be working at the limits of what a 64bit platform can do.

Premature optimizations and all that, when was the last time you fell in to the trap of implementing something to handle a misconceived belief of a shortcoming in Python or one of the standard libs?

Since Python is call-by-object(*1), a function that mutates a mutable argument, changes that object in the caller’s scope. some code to illustrate:

>>> mobject = {'magic': True}
>>> id(mobject)
140330485577440
>>>
>>> def toggle(mdata):
...    '''flip the state of magic'''
...    mdata['magic'] = not mdata['magic']
...
>>>
>>> toggle(mobject)
>>> mobject
{'magic': False}
>>> id(mobject)
140330485577440


So hopefully this does not surprise you. If it does, please see the two links in the footnotes(*1) as they explain it quiet well.

My question deals with the implicit nature of the mutation. Not that Python is somehow wrong, it is the fact that the function usage does not convey the mutation to the reader as pointedly as I want. Coming from other languages that are call by value, a function that wanted to mutate an argument and get it back into the caller’s scope had to return the mutated value.

>>> def toggle_explicit(mdata):
...    '''flip the state of magic and return'''
...    mdata['magic'] = not mdata['magic']
...    return mdata
...
>>> mobject = toggle_explicit(mobject)
>>> mobject
{'magic': True}
>>> id(mobject)
140330485577440
>>>


Now I know that the above code is most definitely NOT call by value, but I do feel that it is more explicit about my intention. Even though the assignment is not needed for the actual effect. i.e.:

>>> toggle_explicit(mobject)
{'magic': False}
>>> mobject
{'magic': False}


So why does toggle_explicit() give me the warm fuzzies? Where as toggle() requires the reader to know what is going on. Is it just me shaking the cruft of call-by-value languages off? What do you do, when you mutate state within a function? Is the toggle_explicit() form not/less Pythonic?

— Footnotes —

(*1) Raymond Hettinger in this SO article references a post by Fredrik Lundh’s on “Call By Object