hype-free: Avoiding the dogpile effect

Sunday, May 11, 2008

Avoiding the dogpile effect

When using caching to speed up webpages (or other request-response oriented protocols), it is very common to tie the update of the cache to a new request, meaning that every request checks if the cache is too old first. If not, it returns the value from the cache. If it is, it tries to recompute the value, after which it is stored in the cache and returned to the client.

Two advantages of this approach are that first it doesn't require other, special scripts to run on the server, and second of all, if the value is not needed, it's not recalculated.

However, one problem which the naive implementation miss, is the fact that if multiple requests come at roughly the same time, and the cache is expired, all the requests will try to recalculate the value, which can very much bring down the service if this operation is costly. Apparently this effect is known as the dogpile effect, clobbering updates or stampeding requests.

To prevent this, you have basically two (and a half) choices:

Move the update process outside of the request-response cycle (in a cronjob for example). Advantages: simple (if you can arrange for scripts to be run on a periodical basis) and gives a predictable response time (because data is always taken from the cache). Disadvantages: the data recalculation is made even when there are no clients requesting it (this can avoided by creating a signaling mechanism between the requests and the update script - so that the recalculation is done only if there were requests recently for the data - however this has the probability to return outdated data). It requires you to be able to run scripts periodically. Also, the dogpile effect can still manifest itself, if for some reason the recalculation step takes longer than the update interval (so you should use lockfiles or similar synchronization mechanisms to ensure that only one instance of the update script can run at a given moment)
Keep the update code inside of the request-response cycle. When you observe that the cache has expired, you should take a lock on the element before you try to recalculate it, and only do the recalculation if the lock was successfully aquired. Advantage: only one request will recalculate the value, thus avoiding the mini-ddos. This technique can be used even when there is no possibility to run scripts on a scheduled basis. Also unnecessary updates will be avoid (if there are no requests, the recalculation will not occur). Disadvantage: requests will have variable latency (shorter for cache hits, longer for cache misses).
This later technique has two variations (depending on the systems tolerance for delay and data freshness): when a thread observes that the data has expired but it fails to aquire the update lock, it can do two things: return the cached data immediately (this means that it can return outdated data, but only one request - the one doing the acual recalculation - will have a delay) or wait for the update script to finish (to ensures up-to-date data, however all the requests during the update time will be delayed). The first method can be implemented using a try-lock method (the script tries to aquire the lock with a timeout of 0) and the second with a lock mechanism (the script tries to aquire the lock with an infinite timeout). When using the second method, make sure that you check the freshness of the data a second time after you aquired the update lock, because otherwise you might recalculate the data multiple times in the following scenario:
- Request A comes in, it aquires the lock to recalculate the data
- Request B comes in, tries to aquire the lock and enters a waiting state
- Request A finishes
- Request B aquires the lock, but now the cache contains fresh data (written by request A), thus there is no reason to recalculate it.

For locking you can use the filesystem, or databases. For example with MySQL you can use the GET_LOCK function, under PostgreSQL you can use the advisory lock functions and with memcache you can emulate the locks. One important thing though: in PostgreSQL you can trick the locking system (or rather you can trick yourself) if you do the following:

Request A aquires the lock in exclusive mode
Request B aquires the lock in shared mode
Request B re-aquires the lock in exclusive mode

At first glance you would think that step 3 fails, since A already has an exclusive lock. However the logic if you already have the lock, all locking will succeed overrides this, and request B succeeds. My advice would be: choose one locking strategy (exclusive or shared) and stick with it, don't try to intermix the two (or at least not for the same lock).

3 comments:

Anonymous6:51 PM
Great post, congratulations.
ReplyDelete
Replies
John Moylan1:06 PM
Squid deals with this pretty well using the
"collapsed_forwarding on" option

Just stick Squid in front of your app.
ReplyDelete
Replies
Cd-MaN1:39 PM
@John: thanks for the tip. Squid (just like Apache) is a wonderful tool with a myriad of configuration options and it takes time to master them all.

A quick look at the documentation seems to indicate that you have to take great care when using this options - you absolutely need to make sure that the caching headers returned by your application are correct, or you won't get the desired effect.
ReplyDelete
Replies