ETags, Optimistic Concurrency and Azure
Comments January 4, 2015

A fair number of Azure services like Azure Table, DocumentDb or Azure Cache (In-Role), provide the optimistic concurrency model for data updates. What it means to you as a developer is to handle a so called ETag while doing reads and writes. Everytime you read a row from Table Storage or a JSON document from DocumentDb, you get back a small piece of opaque data - ETag - along with the main payload. Probably you didn't even pay attention to it since Azure API doesn't force you to explicitly (and implicitly) use ETags. If you run parallel update operations though, ETags can give you a great power to control the concurrency.

The ETag notion comes from the HTTP protocol space, where a client can issue an HTTP request with an If-Match request header condition, and if the server fails to meet this condition it responds with HTTP status code 412. More on this can be read at this section of the HTTP protocol specification.

It's useful to think of an ETag as a one time passcode: you obtain it once you read an object and then you attach it to your write request to update that object. As long as you provide the right passcode, you are allowed to do the change. However if someone else has interfered in your way updating the object on the server between your read and write operations, your passcode becomes invalid blocking you from successful write.

OK, let's see where ETags are must to use. Imagine we run a sports event website and mobile app with multiple sport disciplines - each having multiple competitions and every competition has its own schedule that is updated independently. Users can find the full event schedule on the main page of our website and mobile app.

Schedule example Let's say we store the full schedule as a JSON document in DocumentDb. Every time our REST update service gets a request to update a specific competition, it reads that document from the database, then updates the required competition's information, and finally writes the altered document back to the database. Easy, right? What if our service gets multiple update requests to different competitions at the same time? We can't risk having the schedule lose some updates. You got it, we have to deal with the concurrency. This is where we should use the document's ETag to ensure the underlying document has not been modified between the read and write operations. If otherwise it has been modified, our code would retry the whole thing from scratch: read the document, update the competition, write the document back to the database. Consider this pseudo code:

var committed = false;
while (!committed) {
    var schedule = read_document(id: "schedule")
    // ... update the schedule
    var http_response_code = update_document(id: "schedule", schedule)

    if (http_response_code == 200) // Assume 200 means the object has been updated
        committed = true;
    else if (http_response_code == 412) // Underlying document was modified, retry the operation
        committed = false;
    else // some other error
        // ... handle differently
}

Special case: 409

So far we assumed that we always update an existing object (document) on the server. What if the read operation returns no object (HTTP status code 404)? Since there's no object, there's no ETag either and nothing to attach to the write operation. But this does not eliminate potential concurrency issues - still, parallel writers may compete for creating the same object at once. Luckily this is resolved on the server side (Azure services) by responding with HTTP Code 409 (Conflict) that basically states that the resource with such ID (or URI) already exists. And for the client it means using a designated method for an object creation different from an object update. Our updated pseudo code could look like this now:

var committed = false;
while (!committed) {
    var http_response_code 
    var schedule = read_document(id: "schedule")
    if (schedule == null) { 
        // if the object does not exist
        // ... create the schedule
        http_response_code = create_document(id: "schedule", schedule)
    } else {
        // if the object does not exist
        // ... update the schedule
        http_response_code = update_document(id: "schedule", schedule)
    }

    if (http_response_code == 200) // Assume 200 means the object has been updated or created
        committed = true;
    else if (http_response_code == 412 or http_response_code == 409) 
        // Underlying document was modified or created, retry the operation
        committed = false;
    else // some other error
        // ... handle differently
}

ETags vs Locks

OK, what's the deal (one might say)? Why don't we use locks to allow only one writer at a time? First off, really scalable services like those in Azure provide no locks exactly to avoid trading off their scalability. A machine bound locks like a Mutex are limited to one machine, or an instance in terms of Azure WebSites and Cloud Services.

So here's how we can mimic a shared lock accessible by all instances: upon every update, just before the read operation, the writer would try to create a pseudo document with a well known ID. If the document has not yet existed, it mimics a successful lock acquisition and the code can continue safely. Otherwise, the writer backs off and retries to acquire lock in a short timeout. Finally after the writer acquired the lock and updated the principal document (schedule), the code deletes the lock (pseudo document). Below is a pseudo code of how it could look like:

// trying to acquire lock
// true if document is created, false if it already exists
while ( create_document(id: "lock-schedule") == false ) 
    sleep ( short_timeout )

var schedule = read_document(id: "schedule")
// ... update the schedule
update_document(id: "schedule", schedule)

// release the lock
delete_document(id: "lock-schedule") 

A tricky part in this approach is the lock's life-time. How do we handle a possible machine (or the process) crash in the middle of update, right after lock acquisition and just before its release? Other instances are doomed to wait infinitely for release of the now abandoned lock.

Services like Azure Cache (not Azure Table, nor DocumentDb) allow creation of an object with a specific lifetime. It helps a bit but introduces new challenges: how do we pick the lifetime so that it's not too big to avoid long blocking upon crashing and not to small to avoid self-releasing before the writer completes? We could pick it small enough and spawn an additional thread that extends its lifetime while it's not released... However all such tweaks bring potentially unnecessary complexity.

Redis

Redis stands out in the way that it does not provide an ETag-like field apart from Azure In-Role Cache that attaches the DataCacheItemVersion field to every stored object. Redis provides the optimistic concurrency by means of the WATCH command. There's a really good explanation on how achieve this in Redis and Azure Redis client in StackExhange.Redis documentation.