Friday, February 15, 2008

The Data Debate Continues...

A reader has just left a number of very good questions regarding my last post. I thought the answers may interest everyone, so I thought I'd address it as a follow up post. I'll tackle each question in turn:

With a decentralised approach would you double up on data around the place?

Indeed this is the case. The data is not exactly doubled up however. Each service holds a different representation of the data, although there will be some elements in common. This serves our interests very well because we are able to optimise the representation for the specific business domain handled by each service.

For example, a customer as seen by sales involves opportunities, volumes, revenue, etc. According to billing the customer involves credit history, risk, billing history, etc. Some elements such as name and address however will be in common.

How do you keep data shared between services in sync?

Whenever an event occurs in a service that updates data in that service, an event is published onto the service bus. All other services that are interested in that event are subscribed and as such receive a notification. Each subscribed service then makes a local decision as to how to react to that notification. This may involve saving a new version of the updated data, but only the elements that the subscribing service is interested in.

When a service updates a bit of shared data does it broadcast to the world it has done this? Or do other services subscribe to that service so it has a list of services to inform of updates?

Only services subscribed to the topic on which that event is published will receive a notification.

How does it work if two or more services try to update the same thing at the same time? Broadcast some kind of lock event to all other interested services?

This question is really a business issue rather than a technology one. Firstly, this can only happen if you have two separate users using two separate systems that update the same data at the same time. The business may not require or even desire this. Usually, specific information is owned by specific groups of people within an organisation and only they should have authority to make changes to it.

However, if the business requires multiple independent users to be able to update the same data from different services, a business decision must be made as to what should occur. The easiest solution is first in best dressed. Another solution is to make one service the authoritative source, so that if there is a conflict that service always wins. Another way would be to send the information about the conflict to someone in an email.

In order to achieve this, we need to be able to detect conflicts to begin with. One way of doing this is to place an integer version number on the entity within each service. When any one service performs an update, it places the current version number in the notification and then increments the number locally. Subscribing services perform the update in their respective databases, where the version number in the notification matches the one in their database. If it doesn't match, you know that there is a conflict. This is essentially an optimistic locking strategy.

We don't want to use pessimistic locking between services. One service should never be able to lock the records of another. It would hurt performance too much and introduce too much coupling.

Is that less chatty than having it centralised?

Indeed. With a decentralised approach, all data needed by a service to service a request is held locally. This means no requests to other services to pick up data. Secondly, all data updates are done locally, meaning no requests to multiple other services to update data. The only message we really need to send out is a notification message once our local operation is complete. Yes, this message is sent to multiple subscribers, but to the publishing service, it is only one message. It is the messaging infrastructure that handles routing it to each subscriber.


Anonymous said...

Thanks for the interesting series of posts Bill. I do like the approach but I have some questions too:

1. Have you had problems dealing with the increased storage requirements the decentralized approach calls for? These could be significant!

2. By putting all the data into separate databases you can no longer share the resources (CPU, RAM) of a single machine over all requests to achieve a kind of "statistical multiplexing". How have you dealt with this? Did you just deploy more hardware or use something like virtualization or just having all databases be hosted on the same physical operating system+machine although they are for logically separate services?

3. It seems to me that there must be at least some services that fall in the general class of "sparse query over dense data" for which decentralized data storage may not be appropriate. Imagine you have a database of e.g. all historical stock prices used by some service. This will be very large and frequently updated. Then, you have another, logically and organizationally separate, service which performs ad-hoc queries only on restricted subsets of this database. Does it really make sense to distribute the entire stock price history to this service even though it needs only a fraction of it?

4. What is your opinion on using durable messages to mitigate the dependence on other services being up all the time?

Thanks in advance for your replies!

Andreas Öhlund said...

Hi Bill!
I'm totally on your side, decentrialized is definitly the way to go. But I think it's also important to give all "the smaller apps" a way to get the data using request/reply if they don't need the robustness, performance etc. benefits. So my point is that you should provide both pub/sub and reg/reply ways to get at the data.

Keep those great posts coming!

Bill said...

Hi Andreas,

We certainly expose request-reply operations at private endpoints for consumption by application user interfaces.

These applications however sit behind the service boundary.

If we do need to expose request-reply operations for consumption by other services, we should use asynchronous messaging rather than synchronous.

Exposing synchronous request-reply operations at our service boundary is a slippery slope. Other services may start using them and then we will end up in trouble down the road.

Services are very large and coarse grained. If we have many "smaller" services, then we likely have our service granularity incorrect. So using publish-subscribe to distribute data in a decentralised way should be acceptable for all business services in your enterprise.

I'll be doing a post on service granularity sometime in the next couple of weeks.

Bill said...

Hi Max,

All good questions. I'll address each one in turn below:

1. I've not as yet any issue dealing with the increased storage requirements for this approach. In general, the duplicated data tends to be a small percentage of the overall database size for any given service. Remember, we only want a service to have the data it needs to perform its specific function. This is a small subset of data from other services.

2. I'm not sure what it is that you mean here. If we have separate services with separate databases house on separate servers, this will give us improved performance as we have scaled out the hardware. You can host a number of services on the same physical machine, but this creates a physical dependency where potentially all those services could go down at once if there is a hardware failure.

3. A service will only persist data into its database from notifications it receives, if it will need that data in the future. So for instance if a service requires only a subset of the stock price data, then although it will receive all stock price notifications, it will only store the stock prices it needs. Something else to consider here is whether the service actually needs the stock price data at all. Any operation it is performing around that stock price information may perhaps belong in the other service.

4. The decision to use durable messaging is a policy decision for the service. If messages contain business information that cannot be lost, then durable messaging is a necessity. However if you are publishing stock price updates for example, it probably doesn't matter if one gets lost as another will be along shortly afterwards anyway.