Back to Top

Tuesday, July 24, 2007

Compressed HTTP

The HTTP standard allows for the delivered content to be compressed (to be more precise it allows for it to be encoded in different ways, one of the encoding being compression). Under Apache there are two simple ways to do this:

I won't spend much detail on the configuration options, however I want to describe one little quirk, which is logical in hindsight but I struggled with it a little: you can loose the Content-Length header on files which don't fit in your compression buffer from the start. This is course logical because:

  • Headers must be sent before the content
  • If the server must do several read from file - compress - output cycles to compress the whole file, it can't possibly predict accurately (to the byte level) how large / small the compressed version of the file will be. Getting it wrong is risky because client software might rely on this and could lock up in a wait cycle or display incomplete data.

Update: if you want to selectively disable mod_deflate for certain files because of this (or other reasons), check out this post about it.

You can observe this effect when downloading (large) files especially, since the absence of a Content-Length header means that the client can't show a progress bar indicating the percentage you downloaded (this is what I observed at first and then went on to investigate the causes).

One more remark regarding the getting the Content-Length wrong part. One (fairly) common case where this can be an issue is with PHP scripts which output Content-Length headers and the compression is done via zlib.output_compression. The problem is that mod_php doesn't remove the Content-Length header, which almost certainly has a larger value than the size of the compressed data. This causes the hanging, incomplete downloads symptom. To be even more confusing:

  • When using HTTP/1.1 and keep-alive this problem manifest itself.
  • When keep-alive is inactive, the problem disapears (sort-of). What actually happens is that the Content-Length is still wrong, but the actual connection is reset by the server after sending all the data (since no keep-alive = one request per connection). This usually works with clients (both curl and Firefox interpreted it as download complete), but other client software might chose to interpret the condition as failed/corrupted download.

The possible solutions would be:

  • Perform the compression inside your PHP script (possibly caching the compressed version on-disk if it makes sense) and output the correct (ie. the one corresponding to the compressed data) Content-Length header. This is more work, but you will retain the progress-bar when downloading files
  • Use mod_deflate to perform the compression, which removes the Content-Length header if it can't compress the whole data at once (this is not specified in the documenation, but - the beauty of open source - you can take a peak at the source code - the ultimate documentation. Just search for apr_table_unset(r->headers_out, "Content-Length"); ). This will kill the progress bar (for the reasons discussed before). To get back the progress bar, you could increase the DeflateBufferSize configuration parameter (which is by default set to 8k) to be larger than the largest file you wish to serve, or deactivate compression for the files which will be downloaded (rather than displayed).

A final remark: the HTTP protocol also supports the uploaded data to be compressed (this can useful for example when uploading larger files), as shown by the following blurb in the mod_deflate documentation:

The mod_deflate module also provides a filter for decompressing a gzip compressed request body. In order to activate this feature you have to insert the DEFLATE filter into the input filter chain using SetInputFilter or AddInputFilter.

...

Now if a request contains a Content-Encoding: gzip header, the body will be automatically decompressed. Few browsers have the ability to gzip request bodies. However, some special applications actually do support request compression, for instance some WebDAV clients.

When I saw this, I was ecstatic, since I was searching for something like this for some of my projects. If this works, it means that I can:

  • Use a protocol (HTTP) for file upload which has libraries in many programming languates
  • Use a protocol which needs only one port (as opposed to FTP) and can be made secure if necessary (with SSL/TLS)
  • Use compression, just like rsync can (and, although it can't create binary diffs on its own, when the uploaded files are not used for synchronization, this is not an issue)

Obviously there must be some drawbacks :-)

  • It seems to be an Apache-only feature (I didn't find anything which could indicate support in IIS or even some clear RFC to document how this should work)
  • It can't be negociated! This is huge drawback. When the server side compression is used, the process is the following:
    • The client sends an Accept-Encoding: gzip header along with the request
    • The server checks for this header and if present, compresses the content (minus the time, when the client doesn't really support the compression)
    However, the fact that the client is the first to send, means that there is no way for the server to signal its (in)capability to accept gzip encoding. Even the fact that it's Apache and previously served up compressed content doesn't guarantee the fact that it can handle it, since the input and output filters are two separate things. So the options available are:
    • Use gzip (eventually preceding it with a heuristic detection like the one described before - is it Apache and does it serve up gzip compressed content), and if the server returns an error code, try without gzip
    • The option which I will take - use this only with your own private servers where you configured them properly.

So how do you do it? Here is a blurb, again from the mod_deflate source code: only work on main request/no subrequests. This means that the whole body of the request must be gzip compressed if we chose to use this, it is not possible to compress only the part containing the file for example in a multipart request. Below you can see some perl code I hacked together to use this feature:

#!/usr/bin/perl
use strict;
use warnings;
use File::Temp qw/tempfile/;
use Compress::Zlib;
use HTTP::Request::Common;
use LWP;

$HTTP::Request::Common::DYNAMIC_FILE_UPLOAD = 1;

my $request = POST 'http://10.0.0.12/test.php',
    [
 'test1' => 'test1',
 'test2' => 'test2',
 'a_file' => ['somedata.dat']
    ],
    'Content_Type' => 'form-data',
    'Content_Encoding' => 'gzip';

sub transform_upload {
    my $request = shift;
    
    my ($fh, $filename) = tempfile();
    my $cs = gzopen($fh, "wb");
    
    my $request_c = $request->content();
    while (my $data = $request_c->()) { $cs->gzwrite($data); }
    $cs->gzclose();
    close $fh;
    
    open $fh, $filename; binmode $fh;
    $request->content(sub {
 my $buffer;
 if (0 < read $fh, $buffer, 4096) {
     return $buffer;
 } else {
     close $fh;
     return undef;
 }
    });
    $request->content_length(-s $filename);
}

transform_upload($request);

my $browser = LWP::UserAgent->new();
my $response = $browser->request($request);

print $response->content();

This code is optimized for big files, meaning that it won't read the whole request in the memory at one time. Hope somebody finds it useful.

7 comments:

  1. Do you have any idea if compressing uploads (POST data) is supported by mod_gzip? I haven't been able to find anything that indicates that it is.

    ReplyDelete
  2. I don't think it will work. As far as I know mod_gzip is the predecessor of mod_deflate (ie. it was used for Apache 1.3) and it's no longer available for Apache 2.0 or 2.2.

    ReplyDelete
  3. Anonymous10:40 AM

    It can be negotiated. Add "Expect: 100-continue" header to the request that has "Content-Encoding: deflate". If the server responds with a "100 continue" then go ahead and send the compressed version. Otherwise, remove the "Expect" and "Content-Encoding" headers and send the uncompressed version.

    ReplyDelete
  4. Very interesting. Does anybody know if any HTTP libraries / client programs (ie. Perl's LWP, wget, curl, etc) implement this approach?

    Also, this shows just how complex "simple" protocols like HTTP or MIME have become.

    ReplyDelete
  5. This post saved me a bit of time, thanks!

    If you are a mod_perl2 (or mod_perl) guy, you'll hit a similar problem to the one you describe.

    Make sure to set the content-length (set_content_length) and *do NOT call rflush!!!!*. If you rflush() before or after you print() your content, mod_deflate will drop the content-length header!!

    ReplyDelete
  6. Anonymous5:25 PM

    How can I test this perl script? (sorry i am not familiare with perl)
    Thanks!

    ReplyDelete
  7. Anonymous10:53 AM

    Fantastic solution. I used mod_deflate to uncompress the posted data. If you use a perl CGI handler on the receiving end the Content-Length header does not match that of the content (why mod deflate doesn't fix this I don't know?) - So I added an additional header:-

    sub transform_upload {
    my $request = shift;

    my $origLen = 0;
    my $fh = new File::Temp;
    my $filename = $fh->filename;
    print "Temp file is $filename\n";
    my $cs = gzopen($fh, "wb");
    my $request_c = $request->content();
    while (my $data = $request_c->()) {
    $cs->gzwrite($data);
    $origLen += length($data);
    }
    $cs->gzclose();

    $fh = new FileHandle("<$filename");
    $request->content(sub {
    my $buffer;
    if (0 < read $fh, $buffer, 4096) {
    return $buffer;
    } else {
    close $fh;
    return undef;
    }
    });
    $request->content_length(-s $filename);
    $request->headers->header('X-Uncompressed-Length' => $origLen);
    }

    Then the CGI script should have the following added before the CGI constructor:-
    #
    # As apache has nicely uncompressed the content
    # it DID NOT change the Content-Length header
    # So we use a frig to get round this done by the client
    #
    $ENV{CONTENT_LENGTH} = $ENV{HTTP_X_UNCOMPRESSED_LENGTH}
    if $ENV{HTTP_X_UNCOMPRESSED_LENGTH};

    my $q = new CGI;

    This ensured the compression is seamless.

    ReplyDelete