Have you walked down the ORM road of death?

A friend of mine asked me a really good question tonight:

Hey Stefan,
It would be great if you could please give me a sense for how many development teams get hit by a database bottleneck in JEE / Java / 3-tier / ORM / JPA land? And, how they go about addressing it? What exactly causes their bottleneck?

I think most successful apps – scaling problems are hopefully a sign that people are actually using the stuff, right? – built with Hibernate/JPA hit db contention pretty early on. From what I’ve seen this is usually caused by doing excessive round-trips over the wire or returning too large data sets.

And then we spend time fixing all the obvious broken data access patterns, by first to use HQL over standard eager/lazy fetching, or tuning existing HQL and then direct SQL if needed.

I believe the next step after this is typically to try to scale vertically, both in the db and app tier. Throwing more hardware at the problem may get us quite a bit further at this point.

Then we might get to the point where the app gets fixed so that it actually makes sense to scale horizontally in the app tier. We will probably have to add a load balancer to the mix and use sticky sessions by now.

And then then we will perhaps find out that we will not do that very well without a distributed 2nd level cache, and that all our direct SQL code writing to the DB (that bypass the 2nd level cache) won’t allow us to use a 2nd level cache for reads either…

Here is where I think there are many options and I’m not sure how people tend go from here. Here we might see some people abandoning ORM, while others may try to get the 2nd level cache to work?

Are these the typical steps for scaling up a Java Hibernate/JPA app? What’s your experience?

Speed sells

This coming week (first week of February), Unibet launches its revamped website based on the Facelift project I lead. As a part of this effort, we have worked extremely hard in order to lower page loading times. We have invested a substantial amount of time and money focusing on improving performance. Is this really justified?

A 2006 study by Jupiter Research found that the consequences for an online retailer whose site underperforms include diminished goodwill, negative brand perception, and, most important, significant loss in overall sales. Online shopper loyalty is contingent upon quick page loading, especially for high-spending shoppers and those with greater tenure.

The report ranked poor site performance second only to high prices and shipping costs as the main dissatisfaction among online shoppers. Additional findings in the report show that more than one-third of shoppers with a poor experience abandoned the site entirely, while 75 percent were likely not to shop on that site again. These results demonstrate that a poorly performing website can be damaging to a company’s reputation; according to the survey, nearly 30 percent of dissatisfied customers will either develop a negative perception of the company or tell their friends and family about the experience.

+500 ms page load time lead to a -20% drop in traffic at Google

Marissa Mayer ran an experiment where Google increased the number of search results from ten to thirty per page. Traffic and revenue from Google searchers in the experimental group dropped by 20%.

After a bit of looking, they found an uncontrolled variable. The page with 10 results took 400ms to generate. The page with 30 results took 900ms. Half a second delay caused a 20% drop in traffic. Half a second delay killed user satisfaction.

“It was almost proportional. If you make a product faster, you get that back in terms of increased usage”
-Marissa Mayer,VP Search Product and User Experience at Google

The same effect happened with Google Maps. When the company trimmed the 120KB page size down by about 30 percent, the company started getting about 30 percent more map requests.

+100 ms page load time lead to a -1% sales at Amazon

Amazon also performed some A/B testing and found that page load times directly impacted the revenue:

“In A/B tests, we tried delaying the page in increments of 100 milliseconds and found that even very small delays would result in substantial and costly drops in revenue.”
-Greg Linden, Amazon.com

There are a number of tools and best-practices available to improve web-site performance. I particularly like the work of Steve Souders. Steve was the Chief Performance Yahoo! (at Yahoo! obviously) and is now at Google doing web performance and open source initiatives.

When at Yahoo, Steve published a benchmark and tool, called YSlow which is a good indicator of how well the front-end web technology (HTML, javascript and images etc) of your site is implemented. Front-end makes up for almost 90% of the page load times at more e-commerce sites.

At Unibet, our old HTML had a YSlow score of 56/100 in average. This is about average in the e-gaming industry. However, the Facelifted version just out is 96/100. As comparison, eBay start page is 97/100, Yahoo! start-page is 95/100. This should result in reduced wait and based on the research above this will help drive revenue and customer satisfaction.

We have worked extremely hard in order to lower page loading times. We have invested a substantial amount of time and money in doing so. Is this really justified? YES! I am confident that our new site will contribute to increased sales and increased customer lifetime value.

More for less

Since I started working with IT, I’ve always focused on helping organizations increase productivity and/or cut costs. Since I joined Unibet I constantly am challenged by my managers to cut operational cost, with at the same time making the system and platform more performant, available and scalable. You might think that this would suck, but it’s really great fun. Its very rewarding and not too difficult really. My approach is to question everything. Start asking questions! Why? How much does it cost? What value does it provide?

So far I’m looking on a number of different areas where we can cut some cost.

Example #1 – KISS = Effectiveness

After a few month in my new position I started to question the current technical setup we had. We have one site in Malta and another one in Costa Rica. The latter site was set up a few years ago, and they moved some of our markets there for legal reasons. The thing that I was surprised with was that no one was challenging this decision or reevaluating it even though it seemed to cause major issues in production and increased development costs by quite a bit – obviously TTM suffered too. So, I decided to look at what our competitors are doing and I quickly come to the conclusion that they seemed to have a much more straight forward IT infrastructure.

The next step was obviously to try to change this so that we could be more effective, provide a better service, and increase TTM – while at the save time cutting costs. And the way to do this in any organization is to present a business case that explains what the rationale for making the change is. With the help of the colleagues in the IT management team I delivered a business case to my managers which in turn was presented to legal (who had been advocating for setting up the second site in the first place). They were baffled what the actual cost was for the current setup, and also that no one really explained to them before what the implications of their requirement was. As there was very hard to justify the direct and indirect costs by having the second data center in production, we are now, four months later, not in Costa Rica anymore.

Example #2 – Bandwidth costs

So, we run our business solely off Malta, a not particularly interesting rock in the Mediterranean. Bandwidth costs are insanely high in Malta due to the lack of competition in this space – and we require quite a lot of it. Most of our competitors run their systems closer to mainland Europe (London, Vienna, Gibraltar, Isle-of-Man and Madrid to name the most popular hosting locations for e-gaming). Legally they probably take a slightly higher risk by doing so, but they gain better performance as they are closer to the customers and they have lower cost – hosting and bandwidth costs are 30-50% of what we pay in Malta.

For this reason I was curious if it was allowed to run off a Maltese e-gaming license outside of Malta. After reading up on the regulations for the Maltese LGA’s laws and regulations, I found out that its allowed to have everything except the very core pieces of the site outside of Malta.

So, we move more and more stuff onto the Content Delivery Network. Currently we are diverting more than 50% of the traffic to the CDN and hence we could reduce our bandwidths costs in Malta by a lot. ROI from day one!

Example #3 – Support and software license costs

Another huge operational expense is the license fees we’re paying to companies such as Oracle, Bea. Oracle has a really good product that I don’t mind paying for but the issue here was that we payed too much (for too many CPU:s). We had database (disaster) replication using Oracle Data Guard to a server on the same site. We also had as many CPU:s active in the Data Guard as in the production databases. I read up on Oracle license agreement fine print and quickly came to two conclusions: We shouldn’t use more than one or two CPU:s in the replication database. We can consolidate smaller databases and save license costs. On top of this it was fairly easy to look at parts of the application and rewrite it to minimize load on the production database. In about four months we managed to reduce the load on the main database by 50% or more, hence cutting the Oracle licensing costs by the same amount.

As for Bea Weblogic costs, I don’t really see a point in paying them going forward. Application Servers are becoming commodity (as in open, free software), and Bea’s product isn’t really providing the business value to justify its cost. Bea support is infamous for its terrible offshore first line in India, and you get no help from them unless you reproduce the problem yourself, write the test case and submit it. You’re ending up doing what you pay them to do for you. Let me just say, I’m eating a hat if we’re paying for Bea’s services in eight months from now.

Divide and Conquerer

I listened to Randy Shoup at QCon. Randy works in the architecture team at eBay. The thing that I was impressed by with his presentation was the “just-the-facts-and-nothing-but-the-facts” approach and the complete lack of buzzwords and product talk. It was like listening to a really good and concise O’Reilly book. Although I didn’t learn anything new from listening to Randy, it’s always good to get a distilled and well-presented summary of what really works regardless of technology fads.

Partition everything! Partition your system (“functional split”) and your data (“horizontal split”). It doesn’t matter what tool or technology you use. If you can’t split it you can’t scale it. Simple as that. Regardless if you’re using a fancy grid solution or just multiple databases.

Use asynchronous processing everywhere! If you have synchronous coupled systems they scale together and fail together. The least scalable system will limit scalability and the least available system will limit your uptime. If you use asynchronous, decoupled, systems then the system can scale independently of each other.

In the Limelight

One of the first things I did when I joined Unibet was to set up a Content Delivery Network. I did some research and ended up with a shortlist.

A few factors were limiting my options:

  1. The CDN provider must do business with e-gaming companies
  2. The CDN provider must have an SSL CDN service

The first point effectively rules out Akamai and a number of other companies. The second point rules out even more companies.

I ended up talking to Limelight, and despite some screwups at their London sales office, I must say their CDN service is really awesome. Highly recommended. We currently use them for web site acceleration, to host downloadable clients and for banner serving.