June 28, 2016 Timothy Prickett Morgan
When IBM sold off its System x division to Lenovo Group in the fall of 2014, some big supercomputing centers in the United States and Europe that were long-time customers of Big Blue had to stop and think about what their future systems would look like and who would supply them. It was not a foregone conclusion that the Xeon-based portion of IBM’s HPC business would just move over to Lenovo as part of the sale.
Quite the opposite, in fact. Many believed that Lenovo could not hold onto its HPC business, and Hewlett Packard Enterprise and Dell were quick to capitalize on the confusion that IBM sowed into the market to boost their own HPC divisions. But, as it turns out, the HPC market grew and Lenovo has experience with hyperscalers and a low cost structure – things that IBM did not have – that have allowed the company to stabilize and then grow its HPC business. Even Lenovo seems a bit surprised by how this business has rebounded so quickly after such a tumultuous change.
“We thought back then that there would be concerns that Lenovo would not be able to carry on the HPC mantle that IBM had, and the good news is that it is going much better than we ever were at IBM,” says Scott Tease, executive director of high performance computing at Lenovo. “Not only are we continuing to win the big high profile deals, such as the startup of the Marconi cluster at Cineca in Italy, which is the largest Omni-Path cluster in the world. So we continue to win on those really big ones that IBM used to win. But what is ever better is that we have a lot more success at what we call the run-rate HPC at universities and industry that buy clusters in the $500,000 to $1 million range. We were never as IBM to be cost conscious enough to win these types of deals, but as Lenovo, we are. We still win the big deals, but we have this run rate engine going. We never had that in the past.”
Lenovo doesn’t divulge its financial results from its HPC division, but in early 2015, Tease gave us some insight into the overall systems business at Lenovo and the HPC slice of it . The combined Lenovo and IBM System x businesses generated about $4 billion a year when the deal was done in October 2014 for $2.3 billion. Under that deal, Lenovo got all of the X86 server business as well as a slice of its storage and, importantly, rights to sell the Spectrum Storage (GPFS) parallel file system and Spectrum Computing (Platform Computing) middleware aimed mostly at HPC shops but increasingly deployed for scale-out infrastructure at large enterprises. Tease estimated that HPC proper drove somewhere between $400 million and $500 million a year for the System x division, with about half coming from built-for-purpose iDataPlex and NextScale systems and the remainder from plain vanilla rack servers. Lenovo also had an existing business with hyperscalers in China that generated on the order of hundreds of millions of dollars per year in revenues.
This is a very good foundation from which Lenovo can build a business that supports any organization operating at scale, and this is in fact why the company shelled out $2.3 billion for those System x assets and people from IBM.
While the HPC business at Lenovo has actually grown in 2015 and continues to do so here in 2016, the character of that business is a little bit different, Tease tells The Next Platform .
“Without a doubt, our customer profile in North America has changed quite a bit. At IBM, we were very dependent on a lot of government contracts,” says Tease. “We don’t win as many of those, although we still do NSF grants and things like that, but we are not doing a lot of business directly with the United States government. So the profile has changed. We used to have six big deals that drove our quarter, now we have 25 smaller deals that drive it. The revenue numbers in North America are about the same, but we are doing it on the back of a smaller wins. Europe is the surprising one for me, considering that we are based in Hong Kong and the perception of Chinese ownership, it is in no way a deterrent to selling in Europe. In fact, especially in England, Germany, Spain, and Italy, they almost view Lenovo as a gateway to engage with Chinese customers and university and research partners. The profile of customers has not changed in Europe, but it has grown substantially. North America used to be our biggest region, and now Europe is because of the deals like we have done at LRZ, Max Planck, and Cineca and that run rate engine.”
Overall, the Lenovo HPC business is bigger than it was a year ago and it is bigger than the System x-based HPC business was inside of IBM, he says. How much more, Tease is not at liberty to say.
Carrying The Mantle
The largest system that Lenovo has installed is phase two of the SuperMUC system at Leibniz Rechenzentrum in Munich, which is a “Haswell” Xeon E5 v3 cluster based on the NextScale system design with a total of 86,016 cores. The nodes are lashed together by 56 Gb/sec InfiniBand networks, and SuperMUC currently delivers a peak theoretical performance of 3.58 petaflops and 2.8 petaflops on the Linpack Fortran benchmark test. The Max Plank Institute in Munich has an older iDataPlex system with 65,320 cores based on “Ivy Bridge” Xeon E5 v3 processors that is rated at 1.28 petaflops on the Linpack test. But these were systems that IBM sold. The “Marconi” NextScale system, based on Intel Broadwell Xeon E5 processors was just fired up in time to make the June Top 500 supercomputer rankings , is the first big deal that Lenovo closed as Lenovo.
The Marconi system also has the distinction of being the largest system in the world based on Intel’s Omni-Path follow-on to InfiniBand, although the 180 petaflops “Aurora” system at Argonne National Laboratory that is expected in 2018 will be about six times larger than the Marconi system when it is finally completed years hence. The initial phase of the Marconi system, which was announced in April of this year, has 1,512 nodes with a total of 54,432 Broadwell Xeon cores and a peak double precision performance of 2 petaflops and a sustained Linpack performance of 1.72 petaflops. By the end of the year, a massive chunk of compute based on Intel’s “Knights Landing” Xeon Phi processors will be added, with a total of 250,000 cores and 11 petaflops peak in this section. By July 2017, a third phase of the Marconi project will be comprised of another 7 petaflops of compute, almost certainly based on Intel’s “Skylake” Xeon E5 v5 processors , pushing the peak performance of the Marconi system up to the 20 petaflops range. But Tease says there is an outside chance that it could be based on “Knights Hill” Xeon Phi processors, like the Aurora system at Argonne will be.
Over the longer haul, Cineca will have a follow-on system that will see it push the performance of its flagship system to somewhere between 50 petaflops and 60 petaflops by 2020. The combined investment for Marconi and its follow-on is a mere €50 million – and that is for both phases. You can see Moore’s Law in action – and slowing a bit to about a three year cadence for doubling – in those numbers.
“This is the first time we have ever done anything this large,” says Tease, referring to the scale of the Omni-Path fabric. “We had it up and running from the time we shipped it on site in twelve days, and it came from the factory all built out at the rack level. We had a sixteen node cluster running in our Stuttgart design center when Omni-Path was announced last fall, and we were a bit nervous doing an Omni-Path cluster at this size as the first one. We were a bit worried, but it went in really seamlessly and we really needed very little help from Intel.”
That Omni-Path runs at 100 Gb/sec like EDR InfiniBand gives Intel a place at the negotiating table for HPC clusters, but the unknowns, says Tease, are how Omni-Path is going to be managed. “These are items that have not been proven out yet, whereas with FDR and EDR InfiniBand, it has been running for so many generations now that Mellanox has got that down and it all just works.”
The other thing that Tease says is resonating with customers is moving the processing from compute out into the network, as Mellanox is doing with its current Switch-IB 2 switches and its impending ConnectX-5 adapter cards . “I think this is a really brilliant strategy for Mellanox, to maximize the important of the network and shift the conversation away from bandwidth. They seem to be putting some muscle behind this, and we will fully support the Mellanox roadmap going forward. We will have discussions with customers about which makes sense, and if they have a strong preference one way or another, we are going to be there.”
The good news for customers is that there is competition for HPC networking at scale. Tease says that about 20 percent of the infrastructure cost for an HPC cluster is for networking (including switches and adapters), and with a dense rack of compute plus networking costing around $350,000, that works out to around $70,000 just for networking. (Storage is in separate file systems.)
“We didn’t really have competition before, but there is competition now,” says Tease. “Intel has got a big sales organization in place for selling processors, and they have a big HPC organization, and they are going to be talking about bundling the network onto the processor and they are going to give Mellanox a run for the money in all of these deals. But Mellanox is also a very good engineering company and they are going to make changes on the fly. To be honest with you, they came up ConnectX-5 and implemented it in an unbelievably fast time, and it won’t be too long before they are at 200 Gb/sec with HDR InfiniBand.”
As we have previously reported , Mellanox believes that it will be able to offload about 60 percent of the computing required by the MPI protocol to the adapters and switches with the combination of ConnectX-5 and Switch-IB 2, and thinks it can eventually get to 100 percent with future products.
While networking is important to HPC systems, Lenovo has to keep an eye out for alternatives when it comes to compute, too, and also keep an open mind. That means keeping an eye on Power, ARM, and maybe even future AMD Opteron processors. It doesn’t sound much like Power is something that Lenovo will pursue, although it could happen and, ironically, put Lenovo in cooperation and contention with Big Blue with its own Power systems.
“We have a lot of friends back at IBM, who stayed and are doing Power stuff,” says Tease. “But until we see clients demanding that we have to move there, we really don’t see a reason to go there. Lenovo is one of Intel’s largest customers, and there is no reason to move off that position unless we are getting pulled by customers to do it. They are a good partner for us.”
ARM is another story.
“We have two regions that are driving us to look at ARM, and one of them is Europe,” explains Tease. “It is fairly public and they want to have an indigenous processor, and they view ARM as a way to do that. There is a lot of interest in research and a lot of opportunities to do joint partnerships. With the Hartree partnership in the United Kingdom , we built an ARM-based node that fits into our NextScale HPC system. A few customers now how these and we have them running in our benchmarking center. China is another area pushing us. We have a lot of connections with the hyperscalers like Alibaba, Baidu, and Tencent, and they are pushing us very hard to look at ARM because they have a lot of workloads that are not very processor heavy but they are very storage heavy and that is an area where ARM does quite well.”
The issue that ARM faces is not so much hardware as software, although there are some hardware availability issues from both Cavium and Applied Micro, the two dominant suppliers of ARM server chips these days. They are shipping their respective ThunderX and X-Gene products now and are working on next-generation follow-ons, which will offer better performance and performance per watt, but Intel has kept the pedal to the metal with Xeons, too. Tease says that operating systems, compilers, and management tools all need to be optimized further to run better on ARM chips. “We are not seeing the performance per watt benefits that we had hoped to see with ARM at this point, but as the community builds out this ecosystem and the compilers get better, in the long term, I think ARM is going to be a player. It might take us a while to get there. It is exciting, even if it is not just around the corner.”
The one thing that Tease worries about, ironically, is that there will be too many options and the companies that make operating systems, compilers, and other parts of the HPC stack will be too distracted to optimize for all of them.
“Intel has such an advantage over Power and over ARM in that the compiler optimization tools are just not as robust,” Tease says. “I think the industry wants to see another vendor appear on the scene to be an alternative to Intel. The danger we see is that if ARM and OpenPower are competing for this investment, and we have GPUs and soon we might have FPGAs competing for it, and all of these technologies will be vying for time from these compiler companies and it is going to make it more difficult for one technology to emerge. Having OpenPower and ARM as alternatives is a great thing for our industry, but I just hope it doesn’t dilute a true secondary alternative.”
We are more hopeful in that we think what companies want is to be able to code once and deploy on Xeon, Opteron, ARM, or Power as they see fit, and fulfill the promise of open systems from decades ago. This is how Google manages its cast software infrastructure, and that allows it to have leverage and options with CPU suppliers.