Tuesday, June 17, 2008

Transactional Services

In the realm of systems that manage a persistent store of some kind (for example a database, a queue, a topic or a file system), a transaction is defined as being an atomic (indivisible) unit of work. The transaction manager ensures that all data manipulation operations that occur as part of a transaction complete successfully or not at all.

Consider the classic example of transferring money from one account to another. This operation involves reducing the balance of one account as well as increasing the balance of another. It is critical that either both operations succeed or both fail. For only one to succeed would leave the persistent store in an inconsistent state.

Transaction managers also provide features to manage concurrency - that is multiple simultaneous transactions occurring in parallel. Without such mechanisms in place, data inconsistency may result from effects such as race conditions.

A race condition describes a flaw in a system where the sequence or timing of two or more concurrent transactions causes the data held in the persistent store to enter an inconsistent state. Considering our money transfer example again, a race condition could occur if two transactions simultaneously read the balance of the first account, deduct the transfer amount and then update account balance.

So for example, let's say the account has a balance of $100 and we wish to transfer $10 to another account. Both transactions first read the current balance ($100), deduct the transfer amount ($10), and then update the balance to $90. Obviously the balance should be $80 after both transfer operations complete.

Transaction managers prevent these conditions from occurring by enforcing isolation between transactions. This is most often achieved through the application of locks. In our money transfer example, the first transaction will apply an "update lock" to the account balance which will prevent the second transaction from reading the balance until the first transaction has completed.

The final property enforced by a transaction manager is durability, which ensures that data is not lost or left in an inconsistent state as a result of a system failure. When a transaction manager starts up again after a failure, all incomplete transactions are rolled back. The transaction manager ensures that all successfully completed transactions are committed to durable storage.

These properties of atomicity, consistency, isolation and durability are abbreviated as ACID properties. Sometimes it is necessary for these properties to be enforced across two or more transactional persistent stores. This is achieved by enrolling the transaction managers of these stores in a single distributed transaction.

A two-phase commit algorithm is often used to support distributed transactions. However, as this approach may involve locking resources in the various persistent stores involved in the distributed transaction in order to preserve ACID properties), it is not appropriate to be used across service boundaries.

Services are autonomous and as such cannot be relied upon to complete operations within a reasonable period of time. We cannot allow the resources of one service to be locked while waiting for another service to signal whether it has successfully or unsuccessfully completed its operation.

That being said, distributed transactions are extremely useful within the service boundary. Consider a service that persists its state in a database, receives messages off one or more queues and/or topics, as well as sends and/or publishes messages.

Quite often, a service will perform some updates in one or more databases, and then send or publish one or messages in response to receiving a message from a queue or topic. If a failure occurs anywhere during this process, we want to ensure that we don't lose the inbound message, all database updates are rolled back, and we don't have any outbound messages escape.

This is achieved by way of enrolling the queue or topic from which the inbound message was read, any databases where the service performed updates, as well as any queues or topics onto which messages were sent or published during the operation into a single distributed transaction.

Any message read off a queue or topic is placed back onto the queue or topic as a result of failure, any outbound messages are erased and all database updates are rolled back.

So this gives us a great deal of robustness when it comes to handling failures that occur as part of a single operation within a service. But what about workflows that occur across services? If one part of a workflow fails, we very likely will need to take appropriate action in other services involved in the workflow. This is known as compensation logic.

Transaction managers deal with failures by rolling back changes that occur during a failed transaction. At the cross-service level however this action would not always be appropriate. Consider a Shipping service responsible for mailing order items to customers.

If an action performed by another service as part of this workflow fails, we wouldn't want the Shipping service to erase all record of the shipment. The package has already been physically shipped. We can't roll that back!

As a result of this, we manually implement logic within services to compensate for failures within other services as part of the same logical workflow. The appropriate compensation logic more often than not is a matter for the business to decide.

The logic will be often different for every service and every scenario, so it must be explicitly defined in the business requirements. Different compensation logic may also be necessary as a result of different failure conditions.

The need for manual compensation logic is considerably reduced with a self-contained process-centric service model. This flavour of SOA means that services hold all data they need to service any request locally. As such, all data updates are local to the service and can be protected by a distributed transaction inside the service boundary.

So, ACID transactions are a fantastic tool to be leveraged within the service boundary to help us build services that are robust and tolerate failures. They should not however be applied across services. Here, we must rely on manual compensation logic.

Monday, June 16, 2008

Reliable Messaging

Recently, we discussed the use of idempotent messages as a strategy for achieving reliable message delivery over an unreliable transport such as HTTP. The goal of idempotent messages is to eliminate any side effect from receipt of duplicate messages so that the sending party can retransmit messages for which no receipt acknowledgement was received without fear of the retransmitted message causing problems at the receiver.

The problem with relying upon idempotent messages of course is the effort involved in writing the retransmission logic for every endpoint sending messages reliably over an unreliable channel. It can also in some cases take quite a lot of effort to design and write systems that compensate for operations that are not naturally idempotent such that they become idempotent.

As such, we want to leverage reliable transports where possible so we prevent idempotence and retransmission concerns leaking into our application logic. Reliable transports handle retries and eliminate duplicate messages for us as part of the communication infrastructure.

Furthermore in situations where it is possible that messages arrive out of order (perhaps as a result of being routed by one or more intermediaries), reliable transports are capable of reordering messages such that they are delivered in the order in which they were sent.

A problem with reliable transports however is that they tend to be platform specific (such as MSMQ, available only on the Windows platform). Fortunately a standard reliable messaging specification has been defined, WS-ReliableMessaging (WS-RM). This specification falls under the WS-* group of specifications.

The catch though is that it is left up to the WS-RM implementer to decide what kinds of delivery assurances the WS-RM stack supports and will enforce. For example, the number of attempts a sending party makes to send a message before giving up is a matter of configuration at the sender. What the messaging infrastructure does with messages that fail to be delivered is out of scope for the WS-RM specification.

Whether the receiving party holds out of order messages aside such that they can be dispatched to message handlers in order is matter of how the receiving WS-RM stack is implemented and/or configured. It makes no difference to the messages that are transmitted over the wire.

The same applies for whether messages are placed in a durable store at the sender before being sent, or whether they are placed in a durable store at the receiver before being dispatched to the message handler. This makes sense if you think about it. There is no way that a service provider could enforce that its consumers store messages durably before forwarding them onto the provider.

The best we can achieve is that the service provider and its consumers are able to make claims about delivery assurances. This is achieved with WS-Policy assertions. Although WS-Policy assertions have been defined for some delivery assurances, none have yet been defined to make claims about durable messaging.

So we need to be aware when using WS-RM that either endpoint may or may not be storing messages durably. This means that if a service provider or consumer process crashes, a message could potentially be lost.

Microsoft WCF does not support durable messaging with WS-RM at all. Durable messaging with WCF is achievable only by using the MSMQ transport. In my opinion, this severely limits the usefulness of WCF's WS-RM implementation.

Another limitation of WS-RM is that it is not at present universally supported by all SOAP stacks as it is a relatively new specification. Where it is supported, there are no guarantees of what delivery assurances are enforced by the interacting parties.

That being said, where reliable messaging is required between services on disparate platforms and WS-RM is available, it certainly beats a raw HTTP transport.

Thursday, June 12, 2008

Outsourcing Business Capabilities (continued...)

Continuing my recent post on outsourcing business capabilities to third parties, I wanted to extend the example to include a Shipping service and a Billing service which outsources its billing function to PayPal.

If you recall from last time, our online sales channel (part of the Sales service) was outsourced to eBay. When a customer places an order on eBay, eBay needs to inform us of the order details. We achieve this by setting up a local Web service which is invoked by eBay whenever an order is placed.

eBay however does not guarantee delivery of this notification message. They do however provide a Web service we can interrogate in order to retrieve order details on demand. This involves a synchronous request-reply message exchange over an HTTP transport.

PayPal also provides a notification mechanism whereby PayPal invokes a Web service hosted by our organisation. A notification is sent whenever a payment is processed. Unlike eBay, PayPal does guarantee notification delivery.

Unfortunately, HTTP is not a guaranteed message delivery transport. Connection failures may occur. As a result, PayPal will keep sending the notification message until it receives confirmation that the message was successfully processed in the response from our Web service operation.

If our response message back to PayPal is lost somewhere along the way, we'll end up receiving duplicate notification messages. So we need to make sure that the service logic that handles the notification is idempotent.

We want to abstract away the details of these Web service interactions from other services in our enterprise. As far as the online sales channel is concerned, this is achieved by way of publishing a NewSaleNotification message within our organisation when we receive a sale notification from eBay via our Web service. The Sales service also stores a record of the sale in its database.

The eBay notification however is not guaranteed to arrive, so we're going to have to find a way of dealing with that. Let's deal with that later though and look to the Billing service. PayPal sends us a notification which I'll assume will contain the order number and a payment number.

When we receive a payment notification PayPal, we save the payment in the Billing service database and then publish a PaymentReceivedNotification. In order to protect ourselves from duplicate notifications, before publishing our PaymentReceivedNotification we first check to see if we already have a payment with the given payment number in the database. If it is already present, then we disregard the message.

The Sales service needs to be subscribed to the "payment received" event. When it receives notification of this event, it checks to see if an order with the given order number is in its database. If not, it makes a request to eBay using the eBay Web service to retrieve the order details and then saves the order in the database. The service then stores a record of the payment against the order and raises an OrderPaidNotification.

By virtue of the "order paid" event, we have abstracted away all the complexities associated with compensating for inadequate service level agreements of third parties. We can then subscribe the Shipping service to the "order paid" event (which would contain the full order and payment details) so that it can arrange shipment of the order once it has been paid.

Note that with this architecture the contracts of the Sales and Billing services are devoid of any details concerning the third party organisations eBay and PayPal. This means we can replace these suppliers without impacting the remainder of our architecture.

It also means we decouple our other services from the Web service contracts exposed by third party organisations. This is very important as we will have little to no influence on whether or how often these contracts change. The abstraction layer limits the impact of these changes to the boundary of the service interacting with the third party.

Tuesday, June 10, 2008

Idempotent Messages

I know in my last post I said we'd be continuing our outsourcing example. However before doing so I need to explain the concept of idempotent messages (you'll understand why when you read my next post).

Idempotence is actually not so much a property of the message, but a property of how the message is handled by the receiving service. A message is idempotent if the service operation that processes it yields the same result regardless of the number of times the message is received.

Some operations are idempotent by nature, whereas others require special treatment in order to become idempotent. Read only operations by their very nature are idempotent because they don't have any lasting effect. An "update customer" operation is idempotent because no matter how many times you update the customer with the same information, it yields the same result.

Operations such as "transfer $100 from account X to account Y" however are not idempotent. If the same message is replayed 10 times, then $1,000 will be transferred over 10 transactions. In these cases we need a mechanism to detect the duplicate messages and ignore them.

In some cases duplicates are easy to detect. For example, if we receive a ShipOrderRequest message containing an order number and store the order number in the Shipping service database, then all we need to do when receiving a ShipOrderRequest message is check the Shipping database for the given order number and if found, disregard the request.

Some scenarios require a bit more effort from the service consumer. Consider the account transfer operation described above. In this case, there is nothing in the message to identify that we have already processed that message. We cannot differentiate between a duplicate and another legitimate request to transfer $100 between the same accounts.

In such cases what we do is require that the service consumer place a unique message ID in each request message. A GUID works well for this. The receiving service can then store the message ID against the resultant account transfer transaction record in the database. Before processing a message, the receiving service checks the transaction table to see if the given message ID is already present. If so the request message is discarded.

So why go to all this effort? Under what circumstances do we need idempotent messages? Well so far in our discussions to date I have assumed the use of a transactional guaranteed message delivery transport (such as MSMQ). Such transports handle the detection and removal of duplicate messages as part of the messaging infrastructure.

Furthermore a transactional transport allows us to remove a message from a queue or topic as part of a broader distributed transaction. This means that the message is not lost if the service fails to process it. A failure results in the message being placed back on the queue or topic. I'll cover transactional services in more detail in a future post.

However such transports are not always available. For example when integrating with third party organisations, we generally tend to rely on Web services over an HTTP transport. HTTP does not guarantee delivery.

The problem with this is that when a failure occurs (e.g. the connection fails), the service consumer cannot determine whether or not the request message was actually successfully delivered and processed. Now for some situations, losing a message isn't very important. For example if someone is sending us weather updates every minute, it may not matter if we lose one because there'll be another along shortly.

However for other situations, we require a guaranteed message delivery service level agreement. This is only achievable over an unreliable transport if the consumer resends the message over and over until it receives confirmation from the service provider in the form of a response message that the original message has been successfully processed.

Now this is fine if the message is lost en route to the service provider. But what if the message was successfully processed and the confirmation response message is lost on its way back to the service consumer? The consumer will resend the request message and the service provider will receive and process the message twice.

When the operation performed by the service provider in response to receiving this message is not naturally idempotent, the service provider must detect the duplicate message and disregard it.

Of course this is a lot of extra effort to go to when implementing your service logic. So use transactional guaranteed delivery transports where available and appropriate. They'll save you a lot of time.

Friday, June 6, 2008

Outsourcing Business Capabilities

One of the commonly cited benefits of SOA is it gives organisations greater flexibility in outsourcing business capabilities. This is by virtue of the fact that organisations can leverage Web services as a foundation technology for B2B communications across organisational boundaries.

However one common misconception that exists is that a Web service interface that sits at the organisational boundary coincides with the boundary of a business service. This is in fact often not the case.

Consider an online retail business that sells products via an online store. Let's assume that the business also accepts orders by mail (either by snail mail or email) and telephone. The Sales service would include the online store Web application, as well as some kind of internal application leveraged by call centre operators that process orders by mail and telephone.

A possible Sales service architecture is illustrated below.


So what happens if at some point the business decides it can get better value from outsourcing its online sales channel to eBay? Well clearly the entire Sales service has not been outsourced. We end up with an architecture similar to that illustrated below.


Here we have a single service spanning organisational boundaries. The interaction between eBay and the components still hosted in house occurs inside the Sales service, but across organisational boundaries. The service contract of the Sales service remains unchanged.

No other service in our business need know that the online sales channel has been outsourced. Just as importantly, no service is dependent specifically on eBay. If at some point we decided to replace eBay with another provider, this would constitute only a change in the implementation of our Sales service.

Moreover, if we decided to branch out and leverage multiple third party online sales channels, this would involve only a change in the implementation of our Sales service.

Just because eBay exposes a Web service interface as an integration point for retail businesses doesn't mean that we should expose that interface directly to our other services in our enterprise. eBay's Web service interface is designed as a point of integration. No more, no less.

Business capabilities are unique to each business. Before outsourcing to eBay, our organisation had its own distinctive sales processes that evolved independently of sales processes in other organisations. Although there may be similarities, there will always be subtle differences.

Furthermore, our organisation will wish to retain the ability to tailor and evolve its sales processes as it sees fit. The business certainly won't appreciate terms being dictated by eBay.

We also want to be able to control the service level agreements (SLAs) (such as performance and reliability) upheld by our Sales service. eBay's Web service interface is based on synchronous request-reply interactions over an unreliable network (the Internet), effectively meaning there are no guarantees of availability or performance. Obviously we cannot expose other services within our enterprise to such poor SLAs.

Something else to consider is that eBay's Web service interface is potentially subject to change. We need to shield our other services from such potential changes. As such, we certainly don't want to couple our services directly to eBay's Web service contract.

Furthermore, what if we wish to outsource our online sales channel to a business that isn't quite so technologically savvy, providing an interface in the form of CSV files transferred via FTP? We certainly can't directly expose that as a service within our enterprise. What if we wish to partner with an organisation that offers only a REST based interface?

The point I am making is that people get tempted to directly expose Web services (of either of COTS applications or partner organisations) to other services within their organisations, simply because they are Web services. Do not do this. The provision of Web services is entirely coincidental and solely a result of the need for interoperability.

What is needed is a layer of abstraction between the partner organisation's Web service interface and the service contract exposed within our enterprise. This layer of abstraction gives our organisation the flexibility to have control over its sales processes and SLAs.

More on this example in my next post, so stay tuned!

Wednesday, June 4, 2008

Business Services

The most important part of a service is its contract and by extension its boundary. That determines the role the service plays in the broader architecture, as well as how a service interacts with other services.

Services are the top level element in any SOA. There is only a service bus making up the fabric between services. The service bus should handle message routing and delivery but should perform no other function outside the boundaries of our services. An ESB may perform other functions such as message transformation and process orchestration, but those functions are performed behind our service boundaries (at least when engaging in a self-contained process-centric service model).

So - the primary concern in SOA is the identification and definition of the architecture's constituent services. When applying SOA in a business context we are concerned with identifying business services. A business service serves a specific business purpose, and is defined strictly in business terms.

We define the responsibilities of a business service in terms of the cohesive business area/capability the service supports, and we define the interactions with other services in terms of the business events and operations exposed by the service, as well as the service level agreements provided by the service and the policies imposed by the service on its consumers.

Because this definition involves only business terms, its function and place in the enterprise is easily understood by the business folk. Business services can be defined at least initially by business architects. The technical people can then translate this business definition into a service contract which is expressed in technical terms such as endpoints, messages, schemas, transports, encryption, authentication etc.

The technical folk can then determine the internal architecture of each service based on the service contract. This includes taking an inventory of existing IT systems and determining which systems support each service.

The business service definition is where the rubber hits the road between the business architecture and the application architecture. Business services are expressed as part of the business architecture, whereas their technical definitions form part of the application architecture. There should be a one-to-one correlation between services defined in the business architecture and those defined in the application architecture.

Business processes will of course vary from business to business. Although many businesses have a customer management function, the way in which that function is performed will be different in each business. Furthermore, the way in which that function interacts with other functions in cross-function business processes will vary from business to business.

The interaction between business services defines how cross-function business processes are performed. As this is organisation specific, we find that the contracts of our business services are expressed in the terms of our specific organisation.

This is why we don't expose COTS application endpoints directly to other services in our organisation. We require a layer of abstraction to translate a service contract specific to our organisation to the generic service contract specific to the COTS application.

As we have previously discussed, the art form that exists in identifying candidate services is in achieving high cohesion and loose coupling. To some extent, we can look to capability mapping to help guide this process, but more on that in a future post.

It is very rare that an architecture will be correct on the first pass. The architecture will evolve as there is feedback from the application architecture domain to the business architecture domain and vice versa.

If we find symptoms of incorrect service definition (such as chatty, data centric, synchronous request-reply, transactional interactions between services or domain models with confused semantics) then we refine the service model until these symptoms are alleviated.

Sunday, June 1, 2008

Überservices are Bad

Continuing on my recent theme of coupling and cohesion, I'd like to take the opportunity to discuss a somewhat common SOA anti-pattern, the überservice. This is where we put too many concerns into a single service.

So why is this a problem? Why not put everything into a single service? In order to understand this, we need to consider one of the core values of SOA; that being that SOA draws boundaries between systems that enforce certain constraints between them, creating zones of autonomy and loose coupling.

Loose coupling between services is in part a result of the architectural style itself mandating that we have dependencies between services limited to the exchange of messages conforming to well-defined service contracts. Services have no visibility of or dependency on the implementation details of other services.

We get additional loose coupling through asynchronous communication between services (preferably publish-subscribe), eliminating data centric interfaces, decentralising our data and designing our services such that we have high cohesion within services. All this is achieved by designing our services such that they are self-contained and process-centric.

Each service controls its own data. Services are unable to directly access or manipulate data held by other services. This gives each service the freedom to represent its data locally in the way most suited to supporting the business processes that execute within that service.

As we expand the scope of a given service, the data model and the corresponding logic that executes on it becomes increasingly confused and complex. If we expand the scope of a service far enough, then we have unfettered access to any data by any component in the entire enterprise. That is, we have chaos.

Service boundaries control this complexity. We must draw them up such that we have appropriate service granularity in order to minimise coupling and maximise cohesion.

Many business processes within an organisation by their very nature are loosely coupled. That is, they do not share data directly. For example, when a customer inquires about an insurance policy, his or her details become known to the Sales department at that time.

However, until they actually proceed with purchasing a policy, their details are not known to the Underwriting department - and nor should they be. The Underwriting department doesn't care about prospective customers.

If we implement both the Sales and Underwriting processes as part of a single service, then we have data being shared between these two loosely coupled processes in an uncontrolled way, which now couples these two processes. If we need to make a change to one of these processes, the other will likely be affected.

So, drawing service boundaries between systems supporting cohesive business areas provides better alignment between the business and the systems that support it. When done properly, we end up with loose coupling between services such that our overall architecture is simplified, and we minimise the risk and impact of changes to systems in our enterprise.