Creating Balance in HPC on the Piz Daint Supercomputer

Datetime:2016-08-23 05:27:33          Topic: HPC           Share

In this special guest feature, Robert Roe from Scientific Computing World investigates the motivation behind the architectural changes to Europe’s fastest supercomputer, Piz Daint, housed at the Swiss National Computing Centre.

Piz Daint Supercomputer at CSCS

The flagship supercomputer at the Swiss National Supercomputing Centre (CSCS), Piz Daint , named after a mountain in the Alps, currently delivers 7.8 petaflops of compute performance, or 7.8 quadrillion mathematical calculations per second. A recently announced upgrade will double its peak performance, thanks to a refresh using the latest Intel Xeon CPUs and 4,500 Nvidia Tesla P100 GPUs.

Thomas Schulthess, professor of computational physics at ETH Zurich and director of the Swiss National Supercomputing Centre, said: “We will put both systems into a single fabric. It will be one fabric with two different node architectures and we will have Data Warp nodes as well.”

During the upgrade, the CPUs and accelerators will be updated and the system will be combined with the Piz Dora supercomputer, also housed at the Swiss centre, to create a single, unified HPC system containing both CPU/GPU nodes and purely CPU based nodes.

This upgrade is key to the future development of supercomputing at the Swiss centre both for high-resolution simulations and for the field of data science, which requires the analysis of enormous volumes of data.

Today, materials science, geophysics, life sciences and climate sciences all use data- and CPU-intensive simulations. With the new hardware, researchers will be able to perform these simulations more efficiently.

Creating a balanced infrastructure

However, the planned upgrade is not purely to increase performance. As with many HPC centres, CSCS has many data-intensive applications and, as these continue to scale they face increasing challenges around memory bandwidth.

Thomas Schulthess, Director of the Swiss National Supercomputing Centre

Schulthess explained the reasoning behind the Piz Daint upgrade: “Currently on Piz Daint we have a bottleneck between the GPU and the CPU that is one of the motivations for changing the configuration of the node.”

All of our climate codes and the seismic codes are bandwidth bound; we have many applications that are bandwidth bound.”

To solve this challenge, Schulthess and his colleagues decided to add High bandwidth memory (HBM) and introduce Cray’s DataWarp technology. DataWarp is an IO accelerator that uses SSDs to increase storage performance – reducing the bottlenecks associated with the systems’ most data intensive applications.

Although slightly reduced in physical size, Piz Daint will become considerably more powerful as a result of the upgrade, particularly because we will be able to increases bandwidth significantly in the most important areas,” said Schulthess. “Piz Daint will remain an energy-efficient, balanced system, but will now offer increased flexibility.”

This upgrade, along with the introduction of the latest Nvidia GPUs, CPUs, Cray’s DataWarp technology and, crucially, High bandwidth memory (HBM), switch the system from PCIe Gen 2 to PCIe Gen 3. Both of these improvements directly address memory bandwidth – providing a more balanced system for future users.

The system will be more balanced in the future because on the K20X and Sandy Bridge they were talking to each other with PCIE Gen 2, but now they will talk on Gen 3’ said Schulthess. “If you imagine that the application problem is spread across the GPU memory of many nodes, now the GPUs talk all the way to the network interface circuit using PCIE Gen 3.”

The combination of the new memory, increased performance, and increased data bandwidth will help CSCS users deliver larger simulations in a shorter timeframe. But, in addition to increasing performance, the centre will focus its attention on data analytics with the use of the increased bandwidth and the DataWarp technology.

DataWarp features a ‘Burst Buffer’ mode that can quadruple the effective bandwidth for long-term storage. This allows the system to move data in and out of storage and compute resources much faster. It therefore enables the Swiss centre to accelerate data analytics or data science projects which require the analysis of millions of small, unstructured files.

Implementing a platform for application development

While GPUs have been steadily increasing in popularity over the last five to 10 years, Schulthess stressed that the decision to adopt new technology was not taken lightly, as it requires a considerable amount of application development to rework or optimize codes so they can be used efficiently on accelerators, such as GPUs.

The original introduction of GPUs at CSCS came in 2013 after a “multiyear study of applications and different node configurations,” said Schulthess.

Schulthess explained that the decision was made because it was the best compromise between providing CPU/GPU performance and the increased memory bandwidth. However, he also said another driving force behind the upgrade was giving motivation to application developers to do the right things with their applications: ‘I think that is one of our key values in Switzerland – we invest heavily in application development.’

He stressed that while hardware often steals the headlines, the most important aspect of HPC is application development. As early as 2009, while the Swiss centre was still using its previous supercomputer, called Monte Rosa, Schulthess and his colleagues saw the importance of application development, investing heavily to optimize codes for multicore architectures and then GPU accelerated architectures as they were introduced.

While the limitations of transistor technology and energy efficiency will continue to challenge HPC developers in the coming years, they are not the only major challenges facing HPC users. A rise in accelerators and increases in memory bandwidth – enabling a convergence of data intensive applications and HPC – will test the limitations of traditional HPC technology.

More and more users will require data intensive compute infrastructures that can keep pace with the demands of tomorrow’s HPC users. Although the technology is in place to meet these challenges – it not only requires that the community spends money on hardware – but also that it invests time to update applications and workflows to make the best use of the data that is being created.

I see Europe’s opportunity in this whole business in software and application development.” affirmed Schultless. “In Switzerland, Germany and Britain we have a lot of experience in software development and I think that is where Europe should invest. We can stay ahead of the game there and we are doing it already in Switzerland.”

The investment for the upgrade comes from ETH Zurich university, which invested 40 million Swiss Francs (£27 million). The centre was awarded the funding by the ETH Board as part of its dispatch on the promotion of education, research and innovation (ERI Dispatch). The upgrade is expected to be completed in the last quarter of 2016.

This story appears here as part of a  cross-publishing agreement  with  Scientific Computing World .

Sign up for our insideHPC Newsletter





About List