Tuesday, June 17, 2008

Transactional Services

In the realm of systems that manage a persistent store of some kind (for example a database, a queue, a topic or a file system), a transaction is defined as being an atomic (indivisible) unit of work. The transaction manager ensures that all data manipulation operations that occur as part of a transaction complete successfully or not at all.

Consider the classic example of transferring money from one account to another. This operation involves reducing the balance of one account as well as increasing the balance of another. It is critical that either both operations succeed or both fail. For only one to succeed would leave the persistent store in an inconsistent state.

Transaction managers also provide features to manage concurrency - that is multiple simultaneous transactions occurring in parallel. Without such mechanisms in place, data inconsistency may result from effects such as race conditions.

A race condition describes a flaw in a system where the sequence or timing of two or more concurrent transactions causes the data held in the persistent store to enter an inconsistent state. Considering our money transfer example again, a race condition could occur if two transactions simultaneously read the balance of the first account, deduct the transfer amount and then update account balance.

So for example, let's say the account has a balance of $100 and we wish to transfer $10 to another account. Both transactions first read the current balance ($100), deduct the transfer amount ($10), and then update the balance to $90. Obviously the balance should be $80 after both transfer operations complete.

Transaction managers prevent these conditions from occurring by enforcing isolation between transactions. This is most often achieved through the application of locks. In our money transfer example, the first transaction will apply an "update lock" to the account balance which will prevent the second transaction from reading the balance until the first transaction has completed.

The final property enforced by a transaction manager is durability, which ensures that data is not lost or left in an inconsistent state as a result of a system failure. When a transaction manager starts up again after a failure, all incomplete transactions are rolled back. The transaction manager ensures that all successfully completed transactions are committed to durable storage.

These properties of atomicity, consistency, isolation and durability are abbreviated as ACID properties. Sometimes it is necessary for these properties to be enforced across two or more transactional persistent stores. This is achieved by enrolling the transaction managers of these stores in a single distributed transaction.

A two-phase commit algorithm is often used to support distributed transactions. However, as this approach may involve locking resources in the various persistent stores involved in the distributed transaction in order to preserve ACID properties), it is not appropriate to be used across service boundaries.

Services are autonomous and as such cannot be relied upon to complete operations within a reasonable period of time. We cannot allow the resources of one service to be locked while waiting for another service to signal whether it has successfully or unsuccessfully completed its operation.

That being said, distributed transactions are extremely useful within the service boundary. Consider a service that persists its state in a database, receives messages off one or more queues and/or topics, as well as sends and/or publishes messages.

Quite often, a service will perform some updates in one or more databases, and then send or publish one or messages in response to receiving a message from a queue or topic. If a failure occurs anywhere during this process, we want to ensure that we don't lose the inbound message, all database updates are rolled back, and we don't have any outbound messages escape.

This is achieved by way of enrolling the queue or topic from which the inbound message was read, any databases where the service performed updates, as well as any queues or topics onto which messages were sent or published during the operation into a single distributed transaction.

Any message read off a queue or topic is placed back onto the queue or topic as a result of failure, any outbound messages are erased and all database updates are rolled back.

So this gives us a great deal of robustness when it comes to handling failures that occur as part of a single operation within a service. But what about workflows that occur across services? If one part of a workflow fails, we very likely will need to take appropriate action in other services involved in the workflow. This is known as compensation logic.

Transaction managers deal with failures by rolling back changes that occur during a failed transaction. At the cross-service level however this action would not always be appropriate. Consider a Shipping service responsible for mailing order items to customers.

If an action performed by another service as part of this workflow fails, we wouldn't want the Shipping service to erase all record of the shipment. The package has already been physically shipped. We can't roll that back!

As a result of this, we manually implement logic within services to compensate for failures within other services as part of the same logical workflow. The appropriate compensation logic more often than not is a matter for the business to decide.

The logic will be often different for every service and every scenario, so it must be explicitly defined in the business requirements. Different compensation logic may also be necessary as a result of different failure conditions.

The need for manual compensation logic is considerably reduced with a self-contained process-centric service model. This flavour of SOA means that services hold all data they need to service any request locally. As such, all data updates are local to the service and can be protected by a distributed transaction inside the service boundary.

So, ACID transactions are a fantastic tool to be leveraged within the service boundary to help us build services that are robust and tolerate failures. They should not however be applied across services. Here, we must rely on manual compensation logic.

No comments: