Bill Poole's Creative Abrasion: data

Showing posts with label data. Show all posts

Wednesday, February 20, 2008

CRUD is Bad

We keep hearing that CRUD interfaces (that is interfaces with create/read/update/delete operations) are considered bad practice (an anti-pattern) in the context of SOA. But why? And how do we go about designing our service interfaces to avoid CRUD?

Let's start with the why. Services have both data and business logic. For reasons of encapsulation and loose coupling between services, we want to keep our business logic near the data upon which it operates. If our service contract permits direct manipulation of the data held within the service, this means that the business logic can leak outside the service boundary. It also means that the business logic inside the service boundary can be bypassed by direct manipulation of the service's data. All bad.

The same holds true in traditional OOP. Other classes cannot directly affect the state of a class. It is achieved through passing messages to (calling methods on) an object. This helps enforce loose coupling and high cohesion.

Even more compelling is the issue of updating multiple entities as part of a single logical atomic operation. CRUD interfaces usually will have create/read/update/delete operations for each entity housed by the service. But what if you want to update two different entities where either both updates succeed or neither?

You have the following options:

Use a distributed transaction
Implement compensation logic to handle failures yourself
Create new create/update/delete operations for specific combinations of entities

None of these options are satisfactory. The first option may not even be possible if the service stack doesn't support distributed transactions (e.g. ASMX, WSE). And even if it does support transactions, cross-service transactions are incredibly bad practice because services have the ability to lock each other's records which severely hurts service autonomy.

The second option is certainly not an easy task to do properly, and takes a lot of additional effort. And the third option isn't really practical. There are too many combinations of entities that may need to be updated in a single transaction, and it would take a lot of additional effort to implement them all.

Lastly, if a service must go to other services to pick up the data it is going to operate on, this means synchronous request/reply message exchanges between services. These are bad news because they are really slow and introduce temporal coupling between services (the service with the data must be available at the time the service without the data needs it).

So hopefully this is enough to convince you that CRUD is bad. But how do we design our services to avoid CRUD? Well, firstly we decentralise our data! This way all the data a service needs to operate on as part of a single logical operation is held locally within the service. Secondly, we make our service operations task centric, rather than data centric. The operations should be more like "make reservation" and "cancel reservation" rather than "retrieve reservation" and "update reservation". Udi Dahan has recently made a couple of posts discussing this very point.

A final point I'll make on this is that CRUD operations are fine inside the service boundary, so for example what you might see between a smart client and the service back end. But this point will be discussed in more detail in future posts. Stay tuned!

Monday, February 18, 2008

SOA and the "Universal Truth"

I was having a discussion with an experienced EA from another organisation the other day. His view was that a decentralised data architecture could never work because there would always be a certain amount of entropy in the data. That is, there would always be delays as the data is published between the interested services and as such there would not be one universal version of the truth.

In his opinion, in the centralised world you could point management at the data in a given service and know the whole truth.

This is however in my opinion merely an illusion of the truth. In order to get at "the truth", we must first define what the truth is. The truth is a matter of business definition. IT systems support a business - it is not the other way around. Truth begins in the real world as a business concept, and eventually ends up being recorded in a database somewhere. The event did not occur when the database was updated, it happened beforehand.

As such, with both approaches you will not be able to look to a single database for the universal truth. There will always be delays between the "truth" and the database. That is, you will always have entropy.

Also worth noting is that information in organisations has natural entropy. This is due to the fact that no one person can know everything at once. A customer first arrives via the sales department and eventually ends up at billing. The view each department has of the customer at various stages during the sales process is quite different. Also, the information does not travel instantly between the people in each department.

By mirroring this behaviour with our services (by decentralising our data), then we have a better chance at getting at the truth as the business would define it.

So although in theory there is a "universal truth" to the universe (except at the quantum level perhaps), in reality it is something we cannot practically attain; not due to the selection of a decentralised as opposed to centralised approach, but due to human limitations.

Friday, February 15, 2008

The Data Debate Continues...

A reader has just left a number of very good questions regarding my last post. I thought the answers may interest everyone, so I thought I'd address it as a follow up post. I'll tackle each question in turn:

With a decentralised approach would you double up on data around the place?

Indeed this is the case. The data is not exactly doubled up however. Each service holds a different representation of the data, although there will be some elements in common. This serves our interests very well because we are able to optimise the representation for the specific business domain handled by each service.

For example, a customer as seen by sales involves opportunities, volumes, revenue, etc. According to billing the customer involves credit history, risk, billing history, etc. Some elements such as name and address however will be in common.

How do you keep data shared between services in sync?

Whenever an event occurs in a service that updates data in that service, an event is published onto the service bus. All other services that are interested in that event are subscribed and as such receive a notification. Each subscribed service then makes a local decision as to how to react to that notification. This may involve saving a new version of the updated data, but only the elements that the subscribing service is interested in.

When a service updates a bit of shared data does it broadcast to the world it has done this? Or do other services subscribe to that service so it has a list of services to inform of updates?

Only services subscribed to the topic on which that event is published will receive a notification.

How does it work if two or more services try to update the same thing at the same time? Broadcast some kind of lock event to all other interested services?

This question is really a business issue rather than a technology one. Firstly, this can only happen if you have two separate users using two separate systems that update the same data at the same time. The business may not require or even desire this. Usually, specific information is owned by specific groups of people within an organisation and only they should have authority to make changes to it.

However, if the business requires multiple independent users to be able to update the same data from different services, a business decision must be made as to what should occur. The easiest solution is first in best dressed. Another solution is to make one service the authoritative source, so that if there is a conflict that service always wins. Another way would be to send the information about the conflict to someone in an email.

In order to achieve this, we need to be able to detect conflicts to begin with. One way of doing this is to place an integer version number on the entity within each service. When any one service performs an update, it places the current version number in the notification and then increments the number locally. Subscribing services perform the update in their respective databases, where the version number in the notification matches the one in their database. If it doesn't match, you know that there is a conflict. This is essentially an optimistic locking strategy.

We don't want to use pessimistic locking between services. One service should never be able to lock the records of another. It would hurt performance too much and introduce too much coupling.

Is that less chatty than having it centralised?

Indeed. With a decentralised approach, all data needed by a service to service a request is held locally. This means no requests to other services to pick up data. Secondly, all data updates are done locally, meaning no requests to multiple other services to update data. The only message we really need to send out is a notification message once our local operation is complete. Yes, this message is sent to multiple subscribers, but to the publishing service, it is only one message. It is the messaging infrastructure that handles routing it to each subscriber.

Centralised vs. Decentralised Data (continued...)

I had a comment from John on my last post, which I'd like to take the opportunity to discuss.

John suggests that the centralised approach is just plain easier to design, implement and most importantly, think about. I can't say I agree. Once making the shift to thinking in an event driven paradigm, the decentralised approach is actually very natural and as such is easier to design, implement and think about.

The decentralised approach produces an architecture with considerably less coupling. As a result of this, each service is easier to design and implement as you are only really concerned with one service at a time. Moreover, the architecture is less sensitive to change, so the overall system becomes easier to implement as changes are more localised to each service.

John also questions whether the customer really cares about and is prepared to pay for the advantages that decentralisation offers. In my experience, the decentralised approach delivers a better solution faster, so it actually turns out to be the cheaper alternative. One reason for this is you have considerably fewer message exchanges to worry about and implement.

Moreover, when looking at the cost of a system over its entire lifetime, the development cost pales in comparison to the maintenance cost. Due to the looser coupling offered by the decentralised approach, maintenance costs are substantially lower.

And finally, I believe that the performance and reliability problems of the centralised approach outlined in my last post are deal breakers. Will a customer be satisfied with a solution where rebooting a single system takes down all other dependent systems?

So I would suggest that yes, the customer will in fact care about the approach taken, as the decentralised approach delivers a system that is cheaper to implement, easier to change and easier to operate.

Thursday, February 14, 2008

Centralised vs. Decentralised Data

I tend to find during my travels that there are 2 main approaches to SOA design. In one corner we have the people who favour a decentralised data architecture, and in the other corner, we have the people who favour a centralised data architecture.

In the decentralised corner, we have data redundancy between the various services. For instance, customer data may exist in some form or another in every service in your enterprise. When a change occurs in the customer information in one service, an event is published and the services interested in that change receive a notification. They then in turn update their local representations.

In the centralised corner, there is only one centralised source of any given piece of information. Any service that needs that information at any time makes a request to that service to retrieve the information.

The centralised approach has the following disadvantages:

It produces far more chatty message exchanges because every service needs to go to other services to pick up the data needed to service a given request.
Data must be updated transactionally across multiple services. We never want transactions to span services as it adds too much coupling. Records may be locked in one service for extended periods of time whilst waiting for other services.
Synchronous request/reply (which is what is used to pick up the data from the other services) is really slow. You can chew up a lot of threads waiting for responses from other services.
If one of these services needs to rebooted, every service that depends on it for retrieving data will keel over.
One data representation must service the needs of all systems in the enterprise, which considerably increases complexity.
The risk of changing that representation could impact all systems in the enterprise, thus meaning considerably greater testing efforts.

Due to these shortcomings, and the success I have had with the decentralised approach, I sit squarely in the decentralised corner. However, I still tend to find that most people I come across sit in the centralised corner. Are there some advantages to this approach I am missing? Do we need more readily available guidance in this space?

Bill Poole's Creative Abrasion