Let's play some computer poker, just you, me and a server we both trust. The server uses a pseudo-random number generator which is initialized with a 32-bit seed right before we play. So there are about four billion possible decks.

I get five cards in my hand -- apparently we are not playing Texas Hold 'Em. Suppose the cards are dealt out one to me, one to you, one to me, one to you, and so on. So I have the first, third, fifth, seventh and ninth cards in the deck.

Earlier I ran the pseudo-random number generator four billion times, once with each seed, and wrote down the first card generated for each into a database. Suppose my first card is the queen of spades. That only shows up one as the first card in one in every 52 of those possible decks, so we've cut down the possible decks from four billion to around 80 million or so.

Suppose my second card is the three of hearts. Now I run my RNG 80 million more times using the 80 million seeds that produce the queen of spades as the first number. This takes me a couple of seconds. I write down all the decks that produce the three of hearts as the third card -- the second card in my hand. That's again only about 2% of the decks, so now we're down to 2 million decks.

Suppose the third card in my hand is the 7 of clubs. I have a database of 2 million seeds that deal out my two cards; I run my RNG another 2 million times to find the 2% of those decks that produce the 7 of clubs as the third card, and we're down to only 40 thousand decks.

You see how this goes. I run my RNG 40000 more times to find all the seeds that produce my fourth card, and that gets us down to 800 decks, and then run it 800 more times to get the ~20 seeds that produce my fifth card, and now I just generate those twenty decks of cards and I know that you have one of twenty possible hands. Moreover, I have a very good idea of what I'm going to draw next.

Now do you see why true randomness is important? The way you describe it, you think that **distribution** is important, but distribution isn't what makes a process random. **Unpredictability** is what makes a process random.

## UPDATE

Based on the (now deleted due to their unconstructive nature) comments, at least 0.3% of people who've read this are confused as to my point. When people argue against points I haven't made, or worse, argue *for* points that I *did* make on the
assumption that I didn't make them, then I know that I need to explain more clearly and carefully.

There seems to be particular confusion around the word *distribution* so I want to call out usages carefully.

The questions at hand are:

- How do pseudorandom numbers and truly random numbers differ?
- Why is the difference important?
- Do the differences have something to do with the distribution of the output of the PRNG?

Let's begin by considering the *perfect* way to generate a random deck of cards with which to play poker. Then we'll see how other techniques for generating decks are different, and if it is possible to take advantage of that difference.

Let's begin by supposing that we have a magic box labeled `TRNG`

. As its input we give it an integer n greater than or equal to one, and as its output it gives us a truly random number between one and n, inclusive. The output of the box is *entirely unpredictable* (when given a number other than one) and any number between one and n is as likely as another; that is to say that the *distribution* is *uniform*. (There are other more advanced statistical checks of randomness that we could perform; I'm ignoring this point as it is not germane to my argument. TRNG is perfectly statistically random by assumption.)

We start with an unshuffled deck of cards. We ask the box for a number between one and 52 -- that is, `TRNG(52)`

. Whatever number it gives back, we count off that many cards from our sorted deck and remove that card. It becomes the first card in the shuffled deck. Then we ask for `TRNG(51)`

and do the same to select the second card, and so on.

Another way to look at it is: there are 52! = 52 x 51 x 50 ... x 2 x 1 possible decks, which is roughly 2^{226}. We have chosen one of them at truly at random.

Now we deal the cards. When I look at my cards I have *no idea whatsoever* what cards you have. (Aside from the obvious fact that you don't have any of the cards I have.) They could be any cards, with equal probability.

So let me make sure that I explain this clearly. We have *uniform distribution* of each individual output of `TRNG(n)`

; each one picks a number between 1 and n with probability 1/n. Also, the result of this process is that we have chosen one of 52! possible decks with a probability of 1/52!, so the distribution *over the set of possible decks* is *also* uniform.

All right.

Now let's suppose that we have a less magic box, labeled `PRNG`

. Before you can use it, it must be *seeded* with a 32-bit unsigned number.

ASIDE: *Why 32*? Couldn't it be seeded with a 64- or 256- or 10000-bit number? Sure. But (1) in practice most off-the-shelf PRNGs are seeded with a 32-bit number, and (2) if you have 10000 bits of randomness to make the seed then why are you using a PRNG at all? You already have a source of 10000 bits of randomness!

Anyway, back to how the PRNG works: after it is seeded, you can use it the same way you use `TRNG`

. That is, you pass it a number, n, and it gives you back a number between 1 and n, inclusive. Moreover, **the distribution of that output is more or less uniform**. That is, when we ask `PRNG`

for a number between 1 and 6, we get 1, 2, 3, 4, 5 or 6 each roughly one sixth of the time, no matter what the seed was.

I want to emphasize this point several times because it seems to be the one that is confusing certain commenters. The distribution of the PRNG is uniform in at least two ways. First, suppose we choose any particular seed. We would expect that the sequence `PRNG(6), PRNG(6), PRNG(6)...`

a million times would produce a uniform distribution of numbers between 1 and 6. And second, if we chose a million different seeds and called `PRNG(6)`

*once* for each seed, again we would expect a uniform distribution of numbers from 1 to 6. *The uniformity of the PRNG across either of these operations is not relevant to the attack I am describing*.

This process is said to be *pseudo-random* because the behaviour of the box is actually fully deterministic; it chooses from one of 2^{32} possible behaviours based on the seed. That is, once it is seeded, `PRNG(6), PRNG(6), PRNG(6), ...`

produces a *sequence* of numbers with a uniform distribution, but that sequence is *entirely* determined by the seed. For a given sequence of calls, say, PRNG(52), PRNG(51) ... and so on, there are only 2^{32} possible sequences. The seed essentially chooses which one we get.

To generate a deck the server now generates a seed. (How? We'll come back to that point.) Then they call `PRNG(52)`

, `PRNG(51)`

and so on to generate the deck, similar to before.

This system is susceptible to the attack I described. To attack the server we first, ahead of time, seed our own copy of the box with 0 and ask for `PRNG(52)`

and write that down. Then we re-seed with 1, ask for `PRNG(52)`

, and write that down, all the way up to 2^{32}-1.

Now, the poker server that is using PRNG to generate decks has to generate a seed somehow. *It does not matter how they do so.* They could call `TRNG(2^32)`

to get a truly random seed. Or they could take the current time as a seed, which is hardly random at all; I know what time it is as much as you do. The point of my attack is that it does not matter, *because I have my database*. When I see my first card I can eliminate 98% of the possible seeds. When I see my second card I can eliminate 98% more, and so on, until eventually I can get down to a handful of possible seeds, and know with high likelihood what is in your hand.

Now, again, I want to emphasize that the assumption here is that *if we called *`PRNG(6)`

a million times we would get each number roughly one sixth of the time. That distribution is (more or less) *uniform*, and *if uniformity of that distribution is all you care about*, that's fine. The point of the question was *are there things other that distribution of *`PRNG(6)`

that we care about? and the answer is *yes*. We care about *unpredictability* as well.

Another way to look at the problem is that even though the distribution of a million calls to `PRNG(6)`

might be fine, *because the PRNG is choosing from only 2*^{32} possible behaviours, it cannot generate every possible deck. It can only generate 2^{32} of the 2^{226} possible decks; a tiny fraction. So the distribution *over the set of all decks* is very bad. But again, the fundamental attack here is based on our being able to successfully *predict* the past and future behaviour of `PRNG`

from a small sample of its output.

Let me say this a third or four time to make sure this sinks in. There are three distributions here. First, the distribution of the process that produces the random 32-bit seed. That can be perfectly random, unpredictable and uniform and *the attack will still work*. Second, the distribution of a million calls to `PRNG(6)`

. That can be perfectly uniform and the attack will still work. Third, the distribution of decks chosen by the pseudo-random process I have described. That distribution is extremely poor; only a tiny fraction of the IRL possible decks can possibly be chosen. The attack depends on the *predictability* of the behaviour of the PRNG *based on partial knowledge of its output*.

ASIDE: This attack requires that the attacker know or be able to guess what the exact algorithm used by the PRNG is. Whether that is realistic or not is an open question. However, **when designing a security system you must design it to be secure against attacks even if the attacker knows all the algorithms in the program**. Put another way: the portion of a security system which must remain secret in order for the system to be secure is called the "key". If your system depends for its security on the algorithms you use being a secret then *your key contains those algorithms*. That is an *extremely* weak position to be in!

Moving on.

Now let's suppose that we have a third magic box labeled `CPRNG`

. It is a crypto-strength version of `PRNG`

. It takes a 256-bit seed rather than a 32-bit seed. It shares with `PRNG`

the property that the seed chooses from one of 2^{256} possible behaviours. And like our other machines, it has the property that a large number of calls to `CPRNG(n)`

produce a uniform distribution of results between 1 and n: each happens 1/n of the time. Can we run our attack against it?

Our original attack requires us to store 2^{32} mappings from seeds to `PRNG(52)`

. But 2^{256} is a much larger number; it is completely infeasible to run `CPRNG(52)`

that many times and store the results.

But suppose there is some *other* way to take the value of `CPRNG(52)`

and from that deduce a fact about the seed? We've been pretty dumb so far, just brute-forcing all the possible combinations. Can we look inside the magic box, figure out how it works, and deduce facts about the seed based on the output?

No. The details are too complicated to explain, but CPRNGs are cleverly designed so that it is infeasible to deduce *any* useful fact about the seed from the first output of `CPRNG(52)`

or from *any* subset of the output, *no matter how large*.

OK, so now let's suppose the server is using `CPRNG`

to generate decks. It needs a 256-bit seed. How does it choose that seed? If it chooses any value that an attacker can predict *then suddenly the attack becomes viable again*. If we can determine that of the 2^{256} possible seeds, only four billion of them are likely to be chosen by the server, then *we are back in business*. We can mount this attack again, only paying attention to the small number of seeds that can possibly be generated.

The server therefore should do work to ensure that the 256-bit number is *uniformly distributed* -- that is, each possible seed is chosen with probability of 1/2^{256}. Basically the server should be calling `TRNG(2^256)-1`

to generate the seed for `CPRNG`

.

What if I can hack the server and peer into it to see what seed was chosen? **In that case, the attacker knows the complete past and future of the CPRNG**. The author of the server needs to guard against this attack! (Of course if I can successfully mount this attack then I can probably also just transfer the money to my bank account directly, so maybe that's not that interesting. Point is: the seed has to be a hard-to-guess secret, and a truly random 256-bit number is pretty darn hard to guess.)

Returning to my earlier point about defense-in-depth: the 256-bit seed is the *key* to this security system. The idea of a CPRNG is that the system is secure *as long as the key is secure*; even if every other fact about the algorithm is known, as long as you can keep the key secret, the opponent's cards are unpredictable.

OK, so the seed should be both secret and uniformly distributed because if it is not, we can mount an attack. We have by assumption that the distribution of outputs of `CPRNG(n)`

is uniform. *What about the distribution over the set of all possible decks?*

You might say: there are 2^{256} possible sequences output by the CPRNG, but there are only 2^{226} possible decks. Therefore there are more possible sequences than decks, so we're fine; every possible-IRL deck is now (with high probability) possible in this system. And that's a good argument except...

2^{226} is only an *approximation* of 52!. Divide it out. 2^{256}/52! cannot possibly be a whole number because for one thing, 52! is divisible by 3 but no power of two is! Since this is not a whole number now we have the situation where all decks are *possible*, but *some decks are more likely than others*.

If that's not clear, consider the situation with smaller numbers. Suppose we have three cards, A, B and C. Suppose we use a PRNG with an 8-bit seed, so there are 256 possible seeds. There are 256 possible outputs of `PRNG(3)`

depending on the seed; there's no way to have one third of them be A, one third of them be B and one third of them be C because 256 isn't evenly divisible by 3. There has to be a small bias towards one of them.

Similarly, 52 doesn't divide evenly into 2^{256}, so there must be some bias towards some cards as the first card chosen and a bias away from others.

In our original system with a 32-bit seed there was a massive bias and the vast majority of possible decks were never produced. In this system all decks can be produced, but **the distribution of decks is still flawed**. Some decks are *very slightly* more likely than others.

Now the question is: *do we have an attack based on this flaw?* and the answer is *in practice, probably not*. CPRNGs are designed so that *if the seed is truly random* then *it is computationally infeasible to tell the difference between *`CPRNG`

and `TRNG`

.

OK, so let's sum up.

How do pseudorandom numbers and truly random numbers differ?

They differ in the level of predictability they exhibit.

- Truly random numbers are not predictable.
- All pseudo-random numbers are predictable if the seed can be determined or guessed.

Why is the difference important?

Because there are applications where the security of the system relies upon *unpredictability*.

- If a TRNG is used to choose each card then the system is unassailable.
- If a CPRNG is used to choose each card then the system is secure if the seed is both unpredictable and unknown.
- If an ordinary PRNG with a small seed space is used then the system is not secure regardless of whether the seed is unpredictable or unknown; a small enough seed space is susceptible to brute-force attacks of the kind I have described.

Does the difference have something to do with the distribution of the output of the PRNG?

The uniformity of distribution or lack thereof for *individual calls* to `RNG(n)`

is not relevant to the attacks I've described.

As we've seen, both a `PRNG`

and `CPRNG`

produce poor distributions of the probability of choosing any individual deck of all the possible decks. The `PRNG`

is considerably worse, but both have problems.

One more question:

If TRNG is so much better than CPRNG, which is in turn so much better than PRNG, why does anyone use CPRNG or PRNG?

Two reasons.

First: expense. TRNG is *expensive*. Generating truly random numbers is difficult. CPRNGs give good results for arbitrarily many calls with only *one* call to TRNG for the seed. The down side is of course that *you have to keep that seed a secret*.

Second: sometimes we *want* predictability and all we care about is good distribution. If you're generating "random" data as program inputs for a test suite, and it shows up a bug, then it would be nice that running the test suite again produces the bug again!

I hope that is now much more clear.

Finally, if you enjoyed this then you might enjoy some further reading on the subject of randomness and permutations:

7more comments