The debugging tool is a significant milestone in LLNL's multi-year collaboration with the University of Wisconsin (UW), Madison and the University of New Mexico (UNM) to ensure supercomputers run more efficiently.
Playing a significant role in scaling up the Sequoia supercomputer, STAT, a 2011 R&D 100 Award winner, has helped both early access users and system integrators quickly isolate a wide range of errors, including particularly perplexing issues that only manifested at extremely large scales up to 1,179,648 compute cores. During the Sequoia scale-up, bugs in applications as well as defects in system software and hardware have manifested themselves as failures in applications. It is important to quickly diagnose errors so they can be reported to experts who can analyze them in detail and ultimately solve the problem.
"STAT has been indispensable in this capacity, helping the multi-disciplined integration team keep pace with the aggressive system scale-up schedule," said LLNL computer scientist Greg Lee.
"While testing a subsystem of Blue/Gene Q, my test program consistently failed only when scaled to 1,179,648 MPI processes. Although the test program was simple, the sheer scale at which this program ran made debugging efforts highly challenging. But when I applied STAT, it quickly revealed that one particular rank process was consistently stuck in a system call," said Dong Ahn, a computer scientist in Livermore Computing.
Based on this finding, a system expert took a close look at the compute core on which this rank process was running and discovered a hardware defect. "Replacing the component suddenly got the entire Sequoia system back to life," Ahn said. "Putting this exercise into perspective, this error was due to a defect in a tiny hardware unit, the decrementor, of a single hardware thread out of a total of 4.7 million hardware threads. I felt it was like finding a needle in a haystack over a coffee break."
Sequoia delivers 20 petaflops of peak power and was ranked No. 1 in June of this year's TOP500 list. It is currently ranked No. 2, behind Oak Ridge National Laboratory's Titan.
LLNL plans to use Sequoia's impressive computational capability to advance understanding of fundamental physics and engineering questions that arise in the National Nuclear Security Administration's (NNSA) program to ensure the safety, security and effectiveness of the United States' nuclear deterrent without testing. Sequoia also will support NNSA/DOE programs at LLNL that focus on nonproliferation, counterterrorism, energy, security, health and climate change.
As LLNL takes delivery of the Sequoia system and works to move it into production, computer scientists will migrate applications that have been running on earlier systems to this newer architecture. This is a period of intense activity for LLNL's application teams as they gain experience with the new hardware and software environment.
"Having a highly effective debugging tool that scales to the full system is vital to the installation and acceptance process for Sequoia. It is critical that our development teams have a comprehensive parallel debugging tool set as they iron out the inevitable issues that come up with running on a new system like Sequoia," said Kim Cupps, leader of the Livermore Computing Division at LLNL.
STAT is particularly important for LLNL because supercomputer simulations are essential in virtually every mission area of the Laboratory. The tool also has been used at other sites and proved to be effective on a wide range of supercomputer platforms, including Linux clusters and Cray systems.
The team is actively pursuing further optimization of STAT technologies and is exploring commercialization strategies. More information about STAT, including a link to the source code, is available on the Web.More Information
LLNL news release, Nov. 9, 2012"Venturing into the heart of high-performance computing simulations"
Anne Stark | EurekAlert!
Fraunhofer FIT announces CloudTeams collaborative software development platform – join it for free
10.01.2017 | Fraunhofer-Institut für Angewandte Informationstechnik FIT
Electron-photon small-talk could have big impact on quantum computing
23.12.2016 | Princeton University
Among the general public, solar thermal energy is currently associated with dark blue, rectangular collectors on building roofs. Technologies are needed for aesthetically high quality architecture which offer the architect more room for manoeuvre when it comes to low- and plus-energy buildings. With the “ArKol” project, researchers at Fraunhofer ISE together with partners are currently developing two façade collectors for solar thermal energy generation, which permit a high degree of design flexibility: a strip collector for opaque façade sections and a solar thermal blind for transparent sections. The current state of the two developments will be presented at the BAU 2017 trade fair.
As part of the “ArKol – development of architecturally highly integrated façade collectors with heat pipes” project, Fraunhofer ISE together with its partners...
At TU Wien, an alternative for resource intensive formwork for the construction of concrete domes was developed. It is now used in a test dome for the Austrian Federal Railways Infrastructure (ÖBB Infrastruktur).
Concrete shells are efficient structures, but not very resource efficient. The formwork for the construction of concrete domes alone requires a high amount of...
Many pathogens use certain sugar compounds from their host to help conceal themselves against the immune system. Scientists at the University of Bonn have now, in cooperation with researchers at the University of York in the United Kingdom, analyzed the dynamics of a bacterial molecule that is involved in this process. They demonstrate that the protein grabs onto the sugar molecule with a Pac Man-like chewing motion and holds it until it can be used. Their results could help design therapeutics that could make the protein poorer at grabbing and holding and hence compromise the pathogen in the host. The study has now been published in “Biophysical Journal”.
The cells of the mouth, nose and intestinal mucosa produce large quantities of a chemical called sialic acid. Many bacteria possess a special transport system...
UMD, NOAA collaboration demonstrates suitability of in-orbit datasets for weather satellite calibration
"Traffic and weather, together on the hour!" blasts your local radio station, while your smartphone knows the weather halfway across the world. A network of...
Fiber-reinforced plastics (FRP) are frequently used in the aeronautic and automobile industry. However, the repair of workpieces made of these composite materials is often less profitable than exchanging the part. In order to increase the lifetime of FRP parts and to make them more eco-efficient, the Laser Zentrum Hannover e.V. (LZH) and the Apodius GmbH want to combine a new measuring device for fiber layer orientation with an innovative laser-based repair process.
Defects in FRP pieces may be production or operation-related. Whether or not repair is cost-effective depends on the geometry of the defective area, the tools...
10.01.2017 | Event News
09.01.2017 | Event News
05.01.2017 | Event News
16.01.2017 | Trade Fair News
16.01.2017 | Automotive Engineering
16.01.2017 | Life Sciences