Embracing New Physics: How One R&D Team Pushed the Boundaries of Large Scale Computing to Design Future Hard Drives
For those of you following High Performance Computing (HPC) in the cloud, you likely heard about our record-breaking collaboration with AWS™ to run 2.5 million HPC tasks on 1 million vCPUs using 40k EC2™ Spot Instances in 8 hours. I want to share a little more insight into this achievement; what prompted this endeavor from our R&D team; and how our engineers, business stakeholders, IT teams and partners came together to find new and better ways to advance data and data infrastructure.
The R&D of HDDs
Hard disk drives (HDDs) have powered data centers and personal computing devices for decades. We often hear that the ‘end of HDDs’ is coming. But the fact is that while SSD adoption and use cases are growing, high-capacity HDDs will continue to be the mainstay of data centers for years to come as the go-to choice for satisfying our explosive data demand.
For us on the R&D end, we keep up with this growing demand by continuing to evolve HDD technology. Data center HDDs plug into industry standard slots. To increase capacity per drive, we can increase both the platter count and the areal recording density (the product of linear and track densities). Western Digital’s pioneering work manufacturing helium-filled HDDs has enabled higher platter counts as well as increasing the density of tracks on the platters.
To push current HDD capacities toward 20TB, 40TB and beyond, we also need to increase linear density. This is not new for the HDD industry, we have a history of expanding the boundaries of applied physics.
Embracing New Physics Through Simulation
With hundreds of millions of HDDs shipped a year, we often take their underlying technology for granted. Yet there are over 2,000 mechanical parts in each HDD. Consequently, advancing HDD technology requires us to overcome inherent challenges of physics, material science and nano-fabrication. The other challenge in HDD technology design is that next generation devices will not only need to be manufactured at extremely high volumes, they also need to be designed and manufactured to continue to deliver a reasonable cost per terabyte.
As we look at how to advance HDD technologies our teams are experimenting with new materials, physical fields such as quantum tunneling, near-field transducers, and the tribology of small spaces. There are thousands, if not millions, of variables we want to test in order to find the most advantageous combination of characteristics, layers, or chemical compositions.
Energy-assisted recording is one of the critical technologies for driving greater capacities. Yet unlike the software world where code can be executed and tested immediately, the production of complex hardware designs, such as recording heads, can take several months. This means we have to get it right the first time, hence, modeling and simulation are a vital tool.
Extreme HPC in the Cloud
The 2.5 million HPC tasks we ran on our AWS cloud cluster simulated combinations of materials, energy levels, and other parameters for energy assisted recording technology. You can see a visualization of the simulation process below – the top stripe represents the magnetism; the middle one represents the added energy (in this case heat); and the bottom one represents the actual data written to the medium via the combination of magnetism and heat:
Running such an extreme HPC workload is not a straightforward task, and it’s never been done before. It required a massive coordinated effort between application engineers, IT infrastructure engineers, and our partners at AWS and Univa.
On the application end, we had to ensure that all individual simulations were carried out correctly. On the infrastructure end, we needed to coordinate jobs across a vast number of servers and cores and bring all the data back to collate. Our simulations had already been ported onto the containers which were required for a cluster of this scale. It took weeks of preparation and there were many challenges at this scale of concurrent connections, instances, tasks and availability zones. Jeff Barr shared some great detail on the solutions we found as well as insight into the simulation task itself on his blog.
Internally, too, we had to get many different stakeholders on board to move forward with this – our R&D teams, our IT department and business decision makers. We had to prove this could be accomplished and that it would deliver results that could impact the bottom line.
Technological and Competitive Advantage Through IT Innovation
Our ability to cut the time of simulation from 20-30 days to just 8 hours is a game changer as we can explore a huge parameter space very quickly. The understanding gained in our “million core run” confirmed that areal densities of 2 Tbpsi (over double the density of currently shipping HDDs) are possible and outlined the design space needed to achieve this. It’s a game changer not only for our R&D capabilities, but also for our IT teams’ ability to drive business competitiveness. We use HPC in many ways across our organization.
The success of this simulation showed us what we can do with cloud agility to power strategic technology development. It’s helping to pave the path for new ways of thinking and approaching technology innovation, storage architecture analysis and materials science explorations. It also drives a new anticipation of making critical decisions faster. We’re smarter, we know where to invest more, and we can further push the limits of physics and science to bring to market innovations that drive data forward.
Learn more about innovations at Western Digital.