(updated 3/4 to include the “Serving from clusters” case)
As I see more server scripts implementing conditional GET (a good thing), I also see the tendency to use a hash of the content for the ETag header value. While this doesn’t break anything, this often needlessly reduces performance of the system.
ETag is often misunderstood to function as a cache key. I.e., if two URLs give the same ETag, the browser could use the same cache entry for both. This is not the case. ETag is a cache key for a given URL. Think of the cache key as (URL + ETag). Both must match for the client to be able to create conditional GET requests.
What follows is that, if you have a unique URL and can send a Last-Modified header (e.g. based on mtime), you don’t need ETag at all. The older HTTP/1.0 Last-Modified/If-Modified-Since mechanism works just fine for implementing conditional GETs and will save you a bit of bandwidth. Opening and hashing content to create or validate an ETag is just a waste of resources and bandwidth.
When you actually need ETag
There are only a few situations where Last-Modified won’t suffice.
Multiple versions of a single URL
Let’s say a page outputs different content for logged in users, and you want to allow conditional GETs for each version. In this case, ETag needs to change with auth status, and, in fact, you should assume different users might share a browser, so you’d want to embed something user-specific in the ETag as well. E.g.,
ETag = mtime + userId.
In the case above, make sure to mark private pages with “private” in the Cache-Control header, so any user-specific content will not be kept in shared proxy caches.
No modification time available
If there’s no way to get (or guess) a Last-Modified time, you’ll have use ETag if you want to allow conditional GETs at all. You can generate it by hashing the content (or using any function that changes when the content changes).
Serving from clusters
If you serve files from multiple servers, it’s possible that file timestamps could differ, causing Last-Modified dates sent out to shift and needless 200 responses when a client hits a different server. Basically, if you can’t trust your mtime to stay synched (I don’t know how often this is an issue), it may be better to place a hash of the content in an ETag.
In any case using ETag, when handling a conditional GET request (which may contain multiple ETag values in the If-None-Match header), it’s not sufficient to return the 304 status code; you must include the particular ETag for the content you want used. Most software I’ve seen at least gets this right.