Back to Top

Tuesday, September 30, 2008

Working with bitstrings in PostgreSQL

When working with short, fixed width binary data in PostgreSQL (for example: MD5/SHA1 hashes), an option to consider is the bitstring type. It has several advantages:

  • Smaller data storage requirements (also - smaller index sizes). Storing an MD5 hash as characters representing hexadecimal numbers for example takes up 32 bytes, while storing it as bit string takes up 16 bytes (exactly the amount of useful data contained).
  • There are no canonisation issues. For example deadbeef, DeadBeef, deaDbeeF, etc all represent the same number in hexadecimal form, however, if there isn't a rule enforced on the table (for example: all characters must be lowercase), you can end up different entries representing the same data.

However it also has some disadvantages for which I'm trying to give workarounds (where I know of such):

  • It is not standard - not really a problem, since everybody should use PG :-)
  • Related to the first problem - database abstraction libraries (like DBI for Perl or PDO for PHP) don't know how to properly interpolate such data (they basically treat it as string) - the workaround - manually build the query string after carefully (!) validating the data. For example "SELECT foo FROM bar WHERE md5 = x'$md5'".
  • It doesn't support prefix queries. So you can't do the following: SELECT foo FROM bar WHERE md5 LIKE x'deadbeef%'. The solution lies in the fact that characters to the left are more important in the lexicographical ordering than characters to the right. So you can rewrite the previous query as this (assuming that the md5 field has a width of 128 bits or 16 bytes):
    SELECT foo FROM bar WHERE 
      md5 >= x'deadbeef000000000000000000000000'
      AND md5 <= x'deadbeefffffffffffffffffffffffff'
    A further advantage of this method is that - being a range query - can be accelerated using existing b-tree indexes on the column.
  • When data is returned from a query, it is in binary form (ie "10111000011..."). However frequently you want it in hexadecimal form. Below are some code samples for Perl, PHP and Python to accomplish this:
    # Perl
    my $row = unpack('H*', pack('B*', $row));
    # PHP - we need to split result because calculation have a limited precision
    # the code also supposes that the length of the bitstring is multiple of 4
    # which it has to be for it to be convertable into hex :-)
    preg_match_all('/[01]{4}/', $row, $nibbles);
    $result = '';
    foreach ($nibbles[0] as $b) {
     $result .= sprintf('%2x', bindec($b));
    # python
    result = hex(int(row, 2))

Update: You can use the examples shown above to create stored procedures in PG (after installing the appropriate language of course) which would concert a bit string to hex. One note: the perl code can not be used as shown, because the pack operation is not permitted by the safe perl environment. Alternatively you can use plperlu (but be aware of the security implications).


Post a Comment

You can use some HTML tags, such as <b>, <i>, <a>. Comments are moderated, so there will be a delay until the comment appears. However if you comment, I follow.