On Parallelism: How Spring 2.5.1 Brought A 192-Core Compute Appliance To Its Knees (And How Spring 2.5.2 Brought It Back On Its Feet Again)
By leok | March 14, 2008
We here at Lime Wire are pretty jazzed about our store launch. We’ve not only loaded it up with some 500K tracks (with thousands more to come), but we also managed to add some really nifty web-based user interface features. We’re able to bring this experience to you along with terabytes of media, all with reasonably good performance thanks to some hefty hardware.
Of course, hardware is only as fast as the software it runs. While load testing The Store, we discovered a critical performance issue with Spring, specifically version 2.5.1, which is one of the many open-source frameworks we’re currently using to build the webapp. During our development phase, we used quad-core AMD Opteron boxes which loaded our front page in about 1-2 seconds. Given this, we just couldn’t wait to run it under load using our 192-core compute appliance from Azul Systems.
When we ran the application on the compute appliance, we were shocked to see that the home page took 10+ seconds to load for a single user. We made sure that the application was properly utilizing caching and taking advantage of the proper table indexes in the database. The most befuddling part was how the quad-core x86 box could do something in a matter of 1-2 seconds that the 192-core compute appliance could not. We knew that the x86 box had a higher clock speed than the computer appliance per core, but Azul’s ability to apply optimistic thread concurrency (i.e. execute multiple code paths with multiple concurrent threads) and claim of pauseless garbage collection should have enabled us to scale far better under load.
We took advantage of Azul’s built-in performance profiler and began simulating a load of 40 users simultaneously hitting the appliance each second. After about 10 simulated users, we saw that each request was taking 10+ seconds to process. We were averaging out at a dismal 15-20 users a second, at which point all other requests to the server were blocking.
We then examined thread activity on the appliance. Azul’s profiler has the ability to show where threads are contending for locks and for how long. When we examined the application threads, we saw dozens of threads blocked for 500+ milliseconds on a single lock, which Azul reported thusly:
* org.springframework.beans.factory.support.AbstractBeanFactory.
getMergedBeanDefinition(AbstractBeanFactory.java:1043, bci=7,
server compiler) * blocked on java.util.concurrent.ConcurrentHashMap
(0x000000c9182e5f90)
A quick look at the source code brought us to this line:
protected RootBeanDefinition getMergedBeanDefinition(
String beanName, BeanDefinition bd, BeanDefinition containingBd)
throws BeanDefinitionStoreException {
synchronized (this.mergedBeanDefinitions) {
...
We realized that getMergedBeanDefinition was always being called for each injection of a Spring bean in our Wicket components. We also found that this synchronized block could be avoided by a call if the RootBeanDefinition were cached. Further along in the method, we came upon these lines of code:
// Only cache the merged bean definition if we're already about to create an // instance of the bean, or at least have already created an instance before.
if (containingBd == null && isCacheBeanMetadata() && this.alreadyCreated.contains(beanName)) {
this.mergedBeanDefinitions.put(beanName, mbd);
}
...
return mbd;
Note how the comment says “or at least”. On the surface, the comment seems to conflict with what the following if-statement is saying. Knowing little about the internals of Spring, I went ahead and changed the if-statement to match what the comment seemed to be saying:
// Only cache the merged bean definition if we're already about to create an // instance of the bean, or at least have already created an instance before.
if (isCacheBeanMetadata() && (containingBd == null || this.alreadyCreated.contains(beanName))) {
this.mergedBeanDefinitions.put(beanName, mbd);
}
and we no longer experienced the lock contention and achieved 40 simultaneous users with the same load test. We filed a JIRA with the Spring people who turned around a fix in less than 48 hours, and released the fix with Spring 2.5.2 a few days later! The fix made in Spring was not the one I had tested above, and honestly I don’t feel like mine was correct (plus I trust the Spring people’s judgment on this more than mine).
Still, this serves as yet another lesson in latency and throughput, and further demonstrates the myopia of the single-threaded programming mentality that most web developers suffer from (myself included). We never experienced the locking contention with the quad-core box simply because the clock rate executed code more quickly through the critical section than the lower clock-rate Azul appliance. In fact, if we didn’t take into account the Azul appliance’s optimistic thread concurrency, we probably would never have thought to investigate locking contention as a bottleneck, particularly since the contention occurred deep inside an open-source framework. We would have written it off as a load problem and simply bought more hardware. Given the general trend towards multiple cores, developers ought to take a give a second thought to how parallelism can affect the overall performance of their applications.

Comments and Trackbacks
vivek Says:
March 23rd, 2008 at 11:14 pm |
Permalink
Congrats on the store launch!
Curious though - why not use a cluster of low-cost servers as opposed to a computing appliance… wouldn’t that be cheaper and maybe somewhat flexible as well?
leok Says:
March 25th, 2008 at 10:25 am |
Permalink
Thanks! Certainly a cluster of servers using general-purpose CPUs would be more flexible should we ever decide to jump from Java to, say, PHP, and would give us a higher degree of redundancy. But we also considered the cost of system administration, rackspace, electricity for servers and cooling, component maintenance, monitoring, security, and support. These issues plus real-time profiling tools made the Azul appliances a more compelling option for us.
Scott Says:
April 2nd, 2008 at 3:35 pm |
Permalink
I’m the architect of a large web site that uses Azul appliances, and we initially explored the appliance for the reasons that Leok outlines above (rack-space, electricity, servers, cooling) but then quickly realized that even without those benefits, the Azul’s where still less expensive per unit of capacity than any commodity AMD or Sparc based solution.
We also used the appliances to find all sorts of horrific things that we never imagined where wrong with our application, and made a well-performing application much better. We never would have found those “opportunities” on a standard platform, or with “state-of-the-art” profiler tools running on a standard JVM. My developers joke that its like a peep-show for java code. My administrators joke thats its like getting a raise >:-p