Two components which are usually found in web applications are authentication and static files. In this post I will try to show how these two interact. The post will refer to PHP and Apache specifically, since these are the platforms I'm familiar with, however the ideas are generally applicable.
The advantages of static files are: cacheability out of the box (with a dynamically generated result this is very hard to get right) and less overhead when serving up (even more so if something specialized is used like tiny httpd). However you might feel the need to apply authentication to the static files also (that is only users with proper privileges can have access to them). Of course you want to retain the advantages of caching and low overhead as much as possible.
One option (and probably the one with less overhead and ultimately simpler to implement) is to use mod_auth_mysql on the directory hosting the static files and generate a random long (!) username and password for each user session, insert them to the authentication table, and modify the links to the resources to include these credentials. For example, a link in this case might look like this:
http://w7PLTHUDxK:[email protected]/static/image.jpg
The advantage of this approach is that we get all those wonderful things like content type or cache headers (or even zlib compression if we configured it) for free. The main pitfall is the choosing of the place where to do the cleanup (remove this temporary user from the table). The session destroy handler is not good enough since it won't be called if the user doesn't properly log-out. One solution would be to do repeated "garbage collections" on the tables (in this case care must be taken to set this garbage collection interval the same or larger as the session timeout interval, since otherwise the access might "go away" from under the users feet while they are still logged on). An other option would be to add a user id column to the table and use the "REPLACE INTO" SQL command (which is AFAIK unique to MySQL, not standard) to ensure that the temporary user table has at most as many users as the main user table.
A quick note: all the above can of course be done with static authentication also (that is a hardcoded username and password in the .htaccess file). This is a very simple solution (an easier to apply, since mod_auth_mysql might not be installed/enabled on all the webservers, but mod_auth is on most of them), but is insecure, it can not be used to separate users (ie. to have files which only certain users can access) and because it does not expire automatically, one link is enough for search engines / other crawlers to find it.
This is all well and good, but what if you don't have control over the server configuration? While I strongly recommend against using shared PHP hosting, some people might be in this situation. The solution is to recreate (at least some of) Apache's functionality.
The first step is to put the actual static files outside your web root (preferably) or to deny access to the folder where the files are placed with .htaccess (less preferable). If the files would to reside in a public folder, this system would provide obfuscation at best and is equivalent with a 302 or 301 redirect at worst.
The next step is to decide on the method of referencing your static file. You have three options:
- Put the file name directly in as a GET parameter (for example
get_static.php?fn=image.jpg
)
- Use mod_rewrite to simulate a directory structure (
static/image.jpg
which will be rewritten by a rule into the form showed at the previous point)
- Use the fact that Apache walks up the path until if finds the first file / directory, so you can do something like
get_static.php/image.jpg
The second and third options are the ones I recommend. The reason behind this is that it gives the browser the illusion that you are dealing with different files which can help it do proper caching without relying on the ETag mechanism discussed later.
I would like to pause for a moment and remind everybody that security is a big concern in the web world, since you are practically putting your code out for everybody, meaning that anybody can come and try to break it. One particular concern with these types of scripts (which read and echo back to you arbitrary files) is path traversal. This attack is easy to demonstrate with the following example:
Let's say that the script works by taking the filename given, concatenating it with the directory (which for this example is /home/abcd/static/
) and echoing back the given file. Now if I supply in the filename something like ../../../etc/password
, the resulting path will be /home/abcd/static/../../../etc/password
, meaning that I can read any file the web server has access to. And before the Windows guys start jumping up and down saying that this is a *nix problem, the example is very easy to translate to Windows.
Now your first reaction would be to disallow (blacklist) the usage of the .
character in the path, but don't go this way. Rather, define the rule which your files will follow and verify that the supplied parameters follow that rule. For example the filenames will contain one or more alphanumeric, underscore or dash character and will have a png, jpg, css or js extension
. This translates into the regular expression ^[a-z0-9_\-]+\.(png|jpg|css|js)$
. Be sure to include the start and end anchors (otherwise it only has to contain a substring matching the rule, the whole string doesn't have to match the rule) and watch out for other regular expression gotcha's. As an added security measure use the realpath function (which resolves things like symbolic links or ..
sequences) before performing any further verification.
Now we have the file, and need to generate the headers. The important headers are:
- Content-Size - this is very straight forward, it is the size of the file. While theoretically the HTTP protocol supports other measurement units than bytes, practically bytes are always used
- Content-Type - this can be obtained using the mime_content_type function, however be aware that sometimes it fails to identify the correct type and action must be taken to correct it (for example a CSS file might be identified as
text/plain
, but it must be served up text/stylesheet
to work in all the browsers)
- Cache headers - depending on how long you think the clients / intermediate proxies should cache your content, these must be set accordingly.
- ETag - this is a header which helps the browser distinguish between multiple content sources from the same URL. For example if the link to an image is
http://example.com/image.php?id=1
and to the second one http://example.com/image.php?id=2
, without an ETag these will represent the same cache entry, meaning that you can have situations where the second image is displayed instead of the first or vice-versa, because the browser operates under the assumption that they are the same and pulls one out of the cache, when instead the should be used. ETag's can be an arbitrary alphanumeric string, so for example you could use the MD5 hash of the file (and no, there is no information disclosure vulnerability here which would warrant the usage of salted hashes for example because the user is already getting the file! S/he can recalculate the MD5 of it is s/he wishes!)
- Content-Encoding - if you wish and it makes sense to compress your content, be sure to output the proper Content-Encoding header. Also make sure to adjust the Content-Size header, otherwise you could have some serious breakage.
- Accept-Range - if you wish to enable resume support for the file (that is for the client to be able to start downloading from the middle of file for example), you need to provide (and handle, as described below) this header.
The script also needs to take into account the request headers:
- If-Modified-Since - the browser is checking the validity of the cached object, so this should return a 304 header if the content didn't change and provide no content body.
- Accept-Encoding - this should be checked before providing compressed (gzipped) content. Also, beware that some older browser falsely claim to support gzipped content.
- Range - if you specified that you handle ranges, you must look out for this header and send only which was requested. This of course can further be complicated with compression, in which case you need to take the specified chunk, compress it, make sure to output the correct Content-Length, and the send it
- ETag - if you supplied an ETag when serving up the content, it will (should) be returned to you, when doing cache checking
After I've written this all up, I've found that there is a PHP extension which provides most of the functions for this: HTTP. Use it. It's much easier than rolling your own and you have less chance to miss some corner cases (like the fact that as per HTTP/1.1 request headers are case-insensitive, meaning that If-Modified-Since and iF-mOdIfIeD-sInCe are the same thing and should be treated the same).
PS. I didn't mention, but mechanism can also be used to hide the real file names. This might be needed when for whatever reason you don't want to divulge it (because file names can provide additional information which you might not want your users to have). This can be achieved by using an additional step and giving the user a token which is translated in a file-name at the server. These tokens can be:
- Generated from the file name
- Arbitrarily chosen
- Created using a random process
- Created using a deterministic process
For maximum security I recommend to go with arbitrarily chosen random tokens for each file (otherwise an attacker might break the security by trying other IDs - for example if the IDs are numeric, s/he can try other numbers - or by guessing the file names and applying the generator function on it and checking the existence of the file).
Update: I've looked at using mod_xsendfile with PHP, however it seems to be a dormant project (the latest posted version is for Apache 2.0, nothing there for 2.2 :-(). An other option which may be worth exploring is the following (if you are using PHP as a loadable module rather than CGI): use virtual to redirect the request to the static files. You even find a good example in the comments.