Generating a random PHP identifier

Update 1st June 2009 – Added a note mentioning about case-insensitive comparisons in MySQL.

I’ve been looking at generating random identifiers in PHP, and making sure they are random enough. Looking at the PHP function uniqid(), and its suggested better token, I don’t think this is an adiquate enough way:

$better_token = md5(uniqid(mt_rand(), true));

Md5 is a 128 bit hash algorithm. Looking at the data inside the function, there doesn’t appear to be enough data (i.e. entrophy) for the 128 bits, so at best, the uniqueness of the resulting string will be less than the capacity of the md5.

  • uniqid(), this alone generates 13 hexadecimal characters, which is the equivallent of 52 bits,
  • rand(), generates a random integer number, which would fill 32 bits
  • uniqid(”, true), this appends additional entropy to the string, 9 decimal numbers and a dot, less than 20 bits of randomness

Adding these all up, 52+32+20 = 104 bits.

So what happens when you md5 a value that has 104 bits of randomness? You get at most 2^104 possible random values back in a md5 that has 2^128 values. Hashing algorithms can have multiple input values which generate the same output value, causing collisions, so the actual randomness would be less.

A better way

*nix systems (including Solaris, Linux, OSX and FreeBSD) have special device files usually /dev/random, which can generate as much randomness as required by reading in the number of bytes needed. /dev/urandom is a non-blocking version which sacrifices entrophy in order for the file to not lock.

$uniqueId = bin2hex(file_get_contents('/dev/urandom', 0, null, -1, 16));

This will return a random 32 character hexadecimal (128 bits). You shouldn’t need to md5 this.

Generating a shorter (yet just as random) string suitable for urls

The hexadecimal format consists of the numbers 0-9 and a-f, 16 symbols in total, yet there are many more acceptable acceptable characters available in a url.

You can have a shorter identifier if you encode the binary data as base 64 instead, which as its name suggests, has 64 symbols. However, the standard ascii symbols used in this encoding include + and /, which have special meaning in URLs. You can convert these to – and _ respectively to make them more compatible.

Also, base-64 encoding pads the end with = if there isn’t enough bytes in a block. These are not needed by an identifier, so can be stripped.

If the data string is already in binary, you can just do the following to convert it to base 64:

$base64encodedString = str_replace(
    base64_encode(md5($data, true)));

If however the data was in hexadecimal, you can convert it to binary using the php function pack() beforehand:

$base64encodedString = str_replace(
    base64_encode(pack('H*', md5($data))));

Doing this will shorten a 32 character hexadecimal into a 22 character base-64 encoded string.

Please note that when doing MySQL comparisons, your text field’s collate will most likely be case-insensitive. Comparing a base64 string against one in the database when the collation is case-insensitive (e.g. latin1_ci, utf8_ci) will dramatically increase the chance of collisions as 26 of the 64 characters will match the other 26 case ones, giving a significant reduction in bits.