Suppose we sample uniformly random elements from a set of cardinality $n$, and save them in a table. We continue doing this process (each sampling is one step) until we get a collision. What is the expected number of steps until we find the first collision?
This is a common problem, also known as the birthday paradox, since the solution is $O(\sqrt{n})$ which is rather unintuitive.
Suppose we sample $k$ times. Then, the probability of having no collisions after $k$ steps is$$\prod_{i=0}^{k-1}\left(\frac{n-i}{n}\right) = \prod_{i=0}^{k-1}\left(1-\frac{i}{n}\right) \leq \prod_{i=0}^{k-1}e^{\frac{-i}{n}} = e^{-\sum_{i=0}^{k-1}\frac{i}{n}}=e^{\frac{-k(k-1)}{2n}}\approx e^{-\frac{k^2}{n}},$$and so the probability of a collision with $k$ steps is$$1 - \prod_{i=0}^{k-1}\left(\frac{n-i}{n}\right) \geq 1 - e^{-\frac{k^2}{n}}$$which is $O(1)$ when $k=\Theta(\sqrt{n})$. The probability of no collision after $k$ steps can also be written as $\frac{n!}{(n-(k-1))! n^k}$.
Now consider the following question
What is the expected number of steps until the first collision?
I haven't seen an easy solution to this problem. This question and the corresponding Wikipedia page treat it, without a proof.
Let $X$ is the random variable "index of step of first collision". Then$$\mathbb{E}[X] = \sum_{k=1}^{\infty} \mathbb{P}[X \geq k] = 1 + \sum_{k=1}^n \frac{n!}{n^k (n-k)!},$$which is easy to prove from the above formula.
Now comes the non-trivial part. The function$$Q(n) = \sum_{k=1}^n \frac{n!}{n^k (n-k)!}$$is known as the Ramanujan $Q$-function, and has the asymptotic expansion (in $\sqrt{n}$)$$Q(n) = \sqrt{\frac{\pi n}{2}} - \frac{1}{3} + \frac{1}{12}\sqrt{\frac{\pi}{2n}} + O\left(\frac{1}{n}\right),$$and therefore the expected number of steps until first collision is $\sqrt{\frac{\pi n}{2}} + O(1)$.
- Is there a simpler proof that the expected number of steps is $\Theta(\sqrt{n})$?
- What is the variance of $X$?
Now suppose we parallelise, i.e., we run the same algorithm on $m$ different machines which do not communicate with each other.
- What is the expected run time until we find the first collision?
The interesting thing here is that parallelisation with $m$ machines only gives a $\sqrt{m}$ improvement in the run-time. One can do a similar argument to show that after $k$ steps the probability of having no collisions is$$\prod_{i=0}^{k-1}\left(\frac{n-i}{n}\right)^m = \cdots \approx e^{-\frac{k^2m}{2n}},$$so we will have an $O(1)$ probability of collision when $k^2m \approx n$. Since the run-time is $k$ we have $k \geq \sqrt{\frac{n}{m}}$, so increasing $m$ only gives a square-root improvement in the run-time $k$.
However, I read here that the expected number of steps until first collision is $\sqrt{\frac{\pi n}{2m}} + O(1)$, which is a statement that I can't prove, so my final question is
- How do I prove that for the parallel algorithm the expected number of steps until first collision is $\sqrt{\frac{\pi n}{2m}} + O(1)$?
But I'd also like to ask
- Is there a simple proof that this expected number of steps is $\Theta(\sqrt{\frac{n}{m}})$?
- What is the variance of the number of steps ?
I feel like the parallelisation questions should be easily provable if one knowns the variance of $X$.