# Truncated Digest Timing and Collision Analysis¶

The GA4GH Digest uses a truncated SHA-512 digest in order to generate a unique identifier based on data that defines the object. This notebook discusses the choice of SHA-512 over other digest methods and the choice of truncation length.

Note

Please see this Jupyter notebook in Python SeqRepo library for code and updates. A fuller explanation is given in [Hart2020].

## Conclusions¶

• The computational time for SHA-512 is similar to that of other digest methods. Given that it is believed to distribute input bits more uniformly with no increased computational cost, it should be preferred for our use (and likely most uses).

• 24 bytes (192 bits) of digest is ample for VRS uses. Arguably, we could choose much smaller without significant risk of collision.

import hashlib
import math
import timeit

from IPython.display import display, Markdown

from ga4gh.vrs.extras.utils import _format_time

algorithms = {'sha512', 'sha1', 'sha256', 'md5', 'sha224', 'sha384'}


## Digest Timing¶

This section provides a rationale for the selection of SHA-512 as the basis for the Truncated Digest.

def blob(l):
"""return binary blob of length l (POSIX only)"""

def digest(alg, blob):
md = hashlib.new(alg)
md.update(blob)
return md.digest()

def magic_run1(alg, blob):
t = %timeit -o digest(alg, blob)
return t

def magic_tfmt(t):
"""format TimeitResult for table"""
return "{a} ± {s} ([{b}, {w}])".format(
a = _format_time(t.average),
s = _format_time(t.stdev),
b = _format_time(t.best),
w = _format_time(t.worst),
)

blob_lengths = [100, 1000, 10000, 100000, 1000000]
blobs = [blob(l) for l in blob_lengths]

table_rows = []
table_rows += [["algorithm"] + list(map(str,blob_lengths))]
table_rows += [["-"] * len(table_rows)]
for alg in sorted(algorithms):
r = [alg]
for i in range(len(blobs)):
blob = blobs[i]
t = timeit.timeit(stmt='digest(alg, blob)', setup='from __main__ import alg, blob, digest', number=1000)
r += [_format_time(t)]
table_rows += [r]
table = "\n".join(["|".join(map(str,row)) for row in table_rows])
display(Markdown(table))


algorithm

100

1000

10000

100000

1000000

md5

1.02 ms

2.51 ms

23.4 ms

145 ms

1.44 s

sha1

1.02 ms

1.91 ms

11.3 ms

101 ms

1 s

sha224

1.21 ms

3.16 ms

23.1 ms

224 ms

2.2 s

sha256

1.18 ms

3.29 ms

23.3 ms

223 ms

2.2 s

sha384

1.17 ms

2.54 ms

16 ms

150 ms

1.47 s

sha512

1.2 ms

2.55 ms

16.1 ms

148 ms

1.47 s

Conclusion: SHA-512 computational time is comparable to that of other digest methods.

This is result was not expected initially. On further research, there is a clear explanation: The SHA-2 series of digests (which includes SHA-224, SHA-256, SHA-384, and SHA-512) is defined using 64-bit operations. When an implementation is optimized for 64-bit systems (as used for these timings), the number of cycles is essentially halved when compared to 32-bit systems and digests that use 32-bit operations. SHA-2 digests are indeed much slower than SHA-1 and MD5 on 32-bit systems, but such legacy platforms is not relevant to the Truncated Digest.

## Collision Analysis¶

Our question: For a hash function that generates digests of length b (bits) and a corpus of m messages, what is the probability p that there exists at least one collision? This is the so-called Birthday Problem .

Because analyzing digest collision probabilities typically involve choices of mathematical approximations, multiple “answers” appear online. This section provides a quick review of prior work and extends these discussions by focusing the choice of digest length for a desired collision probability and corpus size.

Throughout the following, we’ll use these variables:

• $$P$$ = Probability of collision

• $$P'$$ = Probability of no collision

• $$b$$ = digest size, in bits

• $$s$$ = digest space size, $$s = 2^b$$

• $$m$$ = number of messages in corpus

The length of individual messages is irrelevant.

### Background: The Birthday Problem¶

Directly computing the probability of one or more collisions, $$P$$, in a corpus is difficult. Instead, we first seek to solve for $$P'$$, the probability that a collision does not exist (i.e., that the digests are unique). Because are only two outcomes, $$P + P' = 1$$ or, equivalently, $$P = 1 - P'$$.

For a corpus of size $$m=1$$, the probabability that the digests of all $$m=1$$ messages are unique is (trivially) 1:

$P' = s/s = 1$

because there are $$s$$ ways to choose the first digest from among $$s$$ possible values without a collision.

For a corpus of size $$m=2$$, the probabability that the digests of all $$m=2$$ messages are unique is:

$P' = 1 \times (\frac{s-1}{s})$

because there are $$s-1$$ ways to choose the second digest from among $$s$$ possible values without a collision.

Continuing this logic, we have:

$P' = \prod\nolimits_{i=0}^{m-1} \frac{(s-i)}{s}$

or, equivalently,

$P' = \frac{s!}{s^m \cdot (s-m)!}$

When the size of the corpus becomes greater than the size of the digest space, the probability of uniques is zero by the pigeonhole principle. Formally, the above equation becomes:

$\begin{split}P' = \left\{ \begin{array}{ll} 1 & \text{if }m = 0 \\ \prod\nolimits_{i=0}^{m-1} \frac{(s-i)}{s} & \text{if }1 \le m\le s\\ 0 & \text{if }m \gt s \end{array} \right.\end{split}$

For the remainder of this section, we’ll focus on the case where $$1 \le m \ll s$$. In addition, notice that the brute force computation is not feasible in practice because $$m$$ and $$s$$ will be very large (both $$\gg 2^9$$).

### Approximation #1: Taylor approximation of terms of P’¶

The Taylor series expansion of the exponential function is

$e^x = 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + ...$

For $$|x| \ll 1$$, the expansion is dominated by the first terms and therecore $$e^x \approx 1 + x$$.

In the above expression for $$P'$$, note that the product term $$(s-i)/s$$ is equivalent to $$1-i/s$$. Combining this with the Taylor expansion, where $$x = -i/s$$ (⇒ $$m \ll s$$):

$\begin{split}\begin{split} P' & \approx \prod\nolimits_{i=0}^{m-1} e^{-i/s} \\ & = e^{-m(m-1)/2s} \end{split}\end{split}$

(The latter equivalence comes from converting the product of exponents to a single exponent of a summation of $$-i/s$$ terms, factoring out $$1/s$$, and using the series sum equivalence $$\sum\nolimits_{j=0}^{n} j = n(n+1)/2$$ for $$n\ge0$$.)

### Approximation #2: Taylor approximation of P’¶

The above result for $$P'$$ is also amenable to Taylor approximation. Setting $$x = -m(m-1)/2s$$, we continue from the previous derivation:

$\begin{split}\begin{split} P' & \approx e^{-(m(m-1)/2s} \\ & \approx 1 + \frac{-m(m-1)}{2s} \end{split}\end{split}$

### Approximation #3: Square approximation¶

For large $$m$$, we can approximate $$m(m-1)$$ as $$m^2$$ to yield

$P' \approx 1-m^2/2s$

### Summary of equations¶

We may now summarize equations to approximate the probability of digest collisions.

Summary of Equations

Method

Probability of uniqueness($$P'$$)

Probability of collision($$P=1-P'$$)

Assumptions

Source/Comparison

exact

$$\prod_\nolimits{i=0}^{m-1} \frac{(s-i)}{s}$$

$$1-P'$$

$$1 \le m\le s$$



Taylor approximation on #1

$$e^{-m(m-1)/2s}$$

$$1-P'$$

$$m \ll s$$



Taylor approximation on #2

$$1 - \frac{m(m-1)}{2s}$$

$$\frac{m(m-1)}{2s}$$

(same)



Large square approximation

$$1 - \frac{m^2}{2s}$$

$$\frac{m^2}{2s}$$

(same)

 (where $$s=2^n$$)