Friday, January 09, 2009

Using a single file to serve up multiple web resources

While trying to set up my GHDB mirror, my first thought was to use googlepages. I quickly found the bulk upload to googlepages how to by X de Xavier, which is a very cool tool (and also an interesting way to hack your "chrome"), but unfortunately I found that Google Pages has a limit of 500 files (and the mirror contained aroung 1400 files), so this was a no-go.

My second thought was: the Browser Security Handbook documents several "pseudo-protocol" which can contain other files in them that can be directly adressed from the browser. Although support for them is rather spotty, I thought that using JAR (supported by Firefox) and MHT (supported by IE) I could cover a large gamut of users.

The results are rather disappointing, but I document the failure sources which I isolated, maybe it can help someone out.

First of was JAR. JARs are in fact just zip files, so creating them is very straight forward. After creating and testing it locally, I uploaded the archive and tried to access it like this (if you have NoScript, you must add it to the whitelist for it to work):

jar:http://ghdb.mirror.googlepages.com/ghdb.jar!/_0toc.html

Just to get the following error message:

Unsafe File Type

The page you are trying to view cannot be shown because it is contained in a file type that may not be safe to open. Please contact the website owners to inform them of this problem.

After searching for the error message and not coming up with anything useful, I took a stab at looking at the source code, this is one of the reasons open source is great after all.

From the code:

// We only want to run scripts if the server really intended to
// send us a JAR file.  Check the server-supplied content type for
// a JAR type.
...
mIsUnsafe = !contentType.EqualsLiteral("application/java-archive") &&
            !contentType.EqualsLiteral("application/x-jar");
...
if (prefs) {
    prefs->GetBoolPref("network.jar.open-unsafe-types", &allowUnpack);
}

if (!allowUnpack) {
    status = NS_ERROR_UNSAFE_CONTENT_TYPE;
}

Ignoring the fact that the code uses negative assertions (ie. mIsUnsage) rather than positive assertions (ie. mIsSafe), the code tells us that they are looking for the correct Content-Type sent by the webserver or, alternatively, for the "network.jar.open-unsafe-types" setting. This is probable to prevent the GIFAR attack. So, it seems that the googlepages server doesn't return the correct Content-Type. We can quickly confirm it with the command:

curl http://ghdb.mirror.googlepages.com/ghdb.jar --output /dev/null --dump-header /dev/stdout

And indeed the result is:

HTTP/1.1 200 OK
Last-Modified: Wed, 31 Dec 2008 11:25:06 GMT
Cache-control: public
Expires: Fri, 09 Jan 2009 10:54:28 GMT
Content-Length: 2700935
Content-Type: application/octet-stream
Date: Fri, 09 Jan 2009 10:54:28 GMT
Server: GFE/1.3
...

So the options would be to (a) tell people to lower their security or (b) not use Google's server, none of which was particularly attractive.

Now lets take a look at the MHT format. As many other MS formats, it is very sparsely documented (all hail our closed-source overlord), although there were some standardization efforts. Anyway, here is the Perl script I've thrown together to generate an MHTML file from the mirror:

use strict;
use warnings;
use File::Basename;
use MIME::Lite;
use File::Temp qw/tempfile/;
use MIME::Types;


my $mimetypes = MIME::Types->new;
my $msg = MIME::Lite->new(
        From    =>'Saved by Microsoft Internet Explorer 5',
        Subject =>'Google Hacking Data Base',
        Type    =>'multipart/related'
    );

my $i = 0;
my @tempfiles;
opendir my $d, 'WEB';
while (my $f = readdir $d) {
  $f = "WEB/$f";
  next unless -f $f;
  ++$i;

  next unless $f =~ /\.([^\.]+)$/;
  my $ext = lc $1;
  my $mime_type = $mimetypes->mimeTypeOf($ext);
  my $path = $f;

  if ('text/html' eq $mime_type) {
    my ($fh, $filename) = tempfile( "tmimeXXXXXXXX" );
    
    open my $fhtml, '<', $f;
    my $html = join('', <$fhtml>);
    close $fhtml;
    $html =~ s/(href|src)\s*=\s*"(.*?)"/manipulate_href($1, $2)/ge;
    $html =~ s/(href|src)\s*=\s*'(.*?)'/manipulate_href($1, $2)/ge;
    $html =~ s/(href|src)\s*=\s*([^'"][^\s>]+)/manipulate_href($1, $2)/ge;
    print $fh $html;
    close $fh;

    $path = $filename;
    push @tempfiles, $path;
  }

  my $part = $msg->attach(
      Type        => $mime_type,
      Path        => $path,
      Filename    => basename $f,
  );
  $part->attr('Content-Location' => 'http://example.com/' . basename $f);  
}
closedir $d;

$msg->print(\*STDOUT);

unlink $_ for (@tempfiles);

sub manipulate_href {
  my ($attr, $target) = @_;

  return qq{$attr="$target"} if ($target =~ /^http:\/\//i);
  return qq{$attr="http://example.com/$target"};
}

The two important things here are the fact that each element must contain the Content-Location header (ok, is somewhat of an oversimplifaction, because there are other ways to identify subcontent, but this is the easiest) and all URLs must be absolute! This is why there is all the regex replacement going on (again, this is quick hack, if you want to create production code, you should consider using a parser. An other possibility - which I haven't tried - is to use the BASE tag - you may also want to check out the changes IE7 brings to it, although most probably they wouldn't affect you).

Now, with the MHT file created, time to try it out (with IE obviously):

mhtml:http://ghdb.mirror.googlepages.com/ghdb.mht!http://example.com/_0toc.html

The result is an IE consuming 100% CPU (or less if you are on a multi-core system :-)) and seemingly doing nothing. Tried this on two different systems with IE6 and IE7. Now I assume that in the background it is downloading and parsing the file, but I just got bored with waiting. Update: I did manage to get it working after a fair amount of working, however it seemed to want to download the entire file at each click, making this solution unusable. It still might be an alternative for smaller files...

Conclusions, future work:

  • Both solutions want to download the entire file before displaying it, making the solutions very slow in case of large files.
  • It would be interesting to see if the MHT could incorporate some compressed resources. IE, something like: Content-Encoding: gzip, base64 (first gzipped, and after it base64 encoded). This could possibly reduce the size problem.
  • It would also be interesting to know in which context the content is interpreted. Hopefully in the context of the MHT file URL (ie, in this case http://ghdb.mirror.googlepages.com/), rather than the specified URL (ie. http://example.com), because, if not, it can result in some nasty XSS-type scenarios (ie. malicious individual crafts MHT pages with resources being referred to as http://powned.com/ and hosts it on its own server. Convinces a user to click on the link mhtml:http://evil.com/pown.mht!http://powned.com/foo.html and steals the cookies for example from powned.com, even if powned has no vulnerabilities per se!). I'm too lazy to try this out :-), but hupefully this can't happen.

4 comments:

  1. Thanks for your post. Can you please specify what workaround did you do to address the high CPU usage when using MHT in internet Explorer? I tried to look at http://ghdb.mirror.googlepages.com/ghdb.mht
    but I get "404 file not found".
    Best regards,
    Amitay

    ReplyDelete
  2. @Amitay: Unfortunately I don't remember what the solution to the CPU problem was. It is possible that I solved it just by being more patient (ie. waiting a little more) which did the trick.

    Also, Google pages is now defunct (Google replaced it with an other service), this being the reason for the 404.

    As I recall, the MHT approach was pretty much unworkable for any somewhat-large collection of pages.

    Regards.

    ReplyDelete
  3. @Amitay: +1 for your blog (http://www.doboism.com/blog/, not the blogspot one). I've just subscribed to it.

    ReplyDelete
  4. Thanks! (both for subscribing and trying to remember something from exactly a year ago).

    ReplyDelete