Tuesday, August 12, 2008

Kehoe - So I Get This Call...

by John Kehoe - 12 August 2008

It’s Friday, the day before my 40th birthday (well, in fine Irish tradition, my birthday “wake”). I get a call from a customer, a major US-based air carrier. They’ve spent the last two months troubleshooting an online check-in system that powers their departure kiosks. They ask me to look at the problem.

The new check-in system was designed to complete the passenger ticketing process in 30 seconds, cut queue time and reduce counter staffing. Unfortunately, it wasn’t working out that well in production: the check-in process was taking five minutes, ten times longer than expected and longer than it would take to have an agent check in a passenger.

As a result, queues are long, customers are angry, and the customer has to increase the counter staff. Meanwhile, the airline the next counter over is fat, dumb and happy, successfully executing the business plan my customer was trying to implement. How dare they!

Each morning, my customer brings together a meeting of twenty people representing every vendor, owner and tier. Each presents the latest findings. All report acceptable metrics. Nobody can solve the end to end problem.

Before going any further, let’s do some math on the cost of these meetings. Sixty-three days, times twenty people, times (purely for round number’s sake) $100 equals $126,000 lost to just this meeting. This doesn’t include troubleshooting time, opportunity cost and proposed expenses to fix the problem (not to mention the meetings to implement that fix). So much for the returns the customer is trying to achieve with the new system.

This is a multi-million dollar problem. It isn't a seven digit problem, it’s an eight digit problem. The customer has already sunk millions in developing the software and acquiring the hardware and staff to deploy the application. They are past their planned deployment date and are paying dearly for FTE’s they want to shift. On top of it, they're losing business passengers who are the target of the system (the frequent flyer miles simply aren't worth the hassle).

To be fair, this application is a bear. There are four databases, three application tiers a data conversion tier, an application server and a remote data provider (that is, a third party, external vendor). There is no possible way to understand what is going on by looking at the problem atomically.

Now, back to the situation at hand. I join call number sixty-three. (Did I mention it’s my day off and I’m missing my birthday party?) There are four current fronts of the attack: the network load balancer is not cutting the mustard; the web servers are misconfigured; the Java guys think there might be an issue with the application server configuration; and the server pool is being tripled in size. I ask for seventy-two hours. My first – and only – act is to get two fellows from the customer to install some performance management software they bought a year earlier for a different project.

I sit back and wait.

It turns out that the team was off the mark. The Java guy was right, but for the wrong reasons.

Here is how the wait analysis breaks down. Authentication was responsible for 3% of wait, remote vendor response another 2%. One application component was responsible for 95% of the delay. The issue boiled down to async MDB calls.

Let’s consider the actual effort of what it took to isolate the problem.

First, we eliminated 90% of the people from the equation in two days. The network and systems were good. There was no issue with the web servers or system capacity. We could gain some single digit improvements by tweaking the authentication process (fix a couple of queries) and enforcing the SLA for our third party data provider. This left only the middleware team. This reduced the meeting of 20 people down to three: a customer Java guy, a rep from the Java App Server vendor and me.

Second, we eliminated a $1mm in hardware “solution” that was being given serious consideration. The web team genuinely believed they were the bottleneck and that if they scaled out and tripled their footprint, all would be better. Management (perhaps in a panic) was about to give them the money. It would have made no difference.

Third, we turned around a fix within seventy-two hours.

So, lets do the math again. One performance guy, times seventy-two hours (I really wasn’t working the whole time, I found the Scotch the family set aside for my birthday wake), times $100 (we didn’t charge so this is a bit inflated), comes to $7,200. Compare that to the (conservatively estimated) $126,000 spent for the daily firedrill meetings.

We eliminated waste by closing up the time wasting, money drawing, soul-sucking, morning meetings; avoiding a $1MM hardware upgrade that wouldn’t fix the problem; enabling the underlying system to achieve the business operations goals (reduction of counter staff and queue time) so that the it could come close to the business impact originally forecast, and providing a standard measurement system across all applications and tiers.

Consider this last point very carefully. We have to have a systematic approach to measuring the performance of applications. The approach must be holistic, i.e., capture transaction data from the desktop to the backend storage and all the tiers in between. We have to see and understand the relationships and interactions in the technology stack for our transactions. We cannot rely on just using free vendor supplied tool and a "toss the spaghetti to the wall to see what sticks" approach. This gives us only isolated, uncorrelated data points that show no problems or just symptoms, but not root cause.

From an IT perspective, the cost of the path that led to the solution was negligable: the time and tools over the three days spent actually solving the problem wasn’t much different from the cost of the morning meetings (except for the pounding the IT group was getting from the business owners while the application’s wings were clipped). From a business perspective, the cost of the path that led to the solution was nothing compared with the business impact: reduction of counter staff, faster check-ins, and happy customers. (Well, perhaps not "happy:" this is an airline we’re talking about... perhaps customers becoming disgruntled at a later point in the air travel experience.)

For all the panic and worry that it causes, a situation like this doesn’t need to be an exercise in “not my problem” and it can bring the business and vendors into alignment. But this is true only if vendors bear in mind that a holistic performance approach has real value associated with it, and if customers bear in mind that a holistic performance measurement system will set them back little more than the cost of futile execution.

Holistic performance management is an essential piece of successful business application deployment. Though viewed an afterthought, performance management is the least expensive part of application deployment. When used, it releases untapped value in applications. At the very least, it’s a cheap insurance policy for the business when the fire alarm rings.




About John Kehoe: John is a performance technologist plying his dark craft since the early nineties. John has a penchant for parenthetical editorializing, puns and mixed metaphors (sorry).

No comments:

Post a Comment