Storage Class Memory (SCM) has been proposed as a cost-effective enabling technology for large-scale in-memory compute problems. However, to enable such systems at massive scale, we need some mechanism to ensure reliability and consistency of very large pools of distributed memory, even while ensuring low-latency access to remote memory across a network fabric.
I’m proud to announce that today Western Digital Research will help present a breakthrough solution to this problem at the P4 European Workshop (P4EU) in Cambridge, UK. We will demonstrate an in-network high-performance fault tolerant solution for distributed in-memory compute, at latencies that are orders-of-magnitude lower than any currently existing solution.
High Performance Memory Consistency at Scale
SCM is a new class of emerging memory technology that promises performance close to DRAM, but with higher die capacities and lower cost. This new class of memory will enable in-memory compute problems such as in-memory databases or real-time analytics to be solved at a scale which has not previously been economical. A well-known consequence of designing systems at scale is the emergence of new failure mechanisms (e.g. system failure, memory errors) which are less significant at smaller scales. Well established fault-tolerant mechanisms exist to deal with these problems, but traditional implementations impose a severe performance penalty.
This solution implements fault tolerance inside of the network switch, using the emerging P4 network programming language, resulting in many orders of magnitude improvement in performance relative to existing solutions. The rich and open P4 ecosystem empowers network owners, operators and app developers to have greater control of the data plane within the network, enabling a new class of distributed system solutions at performance levels well beyond what is conceivable today.
Presenting a New Approach
The live demonstration at P4EU is a P4 program implementation developed by Western Digital and the Università della Svizzera italiana (USI) using a high-performance Barefoot Tofino processor powering a BF6064X switch from STORDIS. USI and Western Digital use the rich, open and empowering P4 programming language to implement a consensus protocol within the Tofino network switch chip, allowing us to retain multiple copies of remote, non-volatile memory and to manage the consistency protocol within the switch itself. The system treats the SCM-based main memory in each server as a distributed storage system and implements data replication across systems, along with a consensus protocol to keep the replicas consistent.
Early implementations such as the one we are presenting at P4EU already operate at time scales that are faster than traditional replicated storage. We expect an explosion of new system design approaches based on the power of in-network computing, and anticipate that future SCM-based distributed systems will leverage new protocols in the network to enable unprecedented scalability and performance.
Learn More
The P4EU project team consisted of Dejan Vučinić, Director, R&D Engineering, Non-Volatile Memory Systems Architecture Group at Western Digital, Huynh Tu Dang, PhD Candidate at Università della Svizzera italiana in Lugano, Switzerland, and Professors Fernando Pedone and Robert Soulé, both from the Systems Institute of the Università della Svizzera italiana.
To learn more, please read the Consensus for Non-Volatile Main Memory research paper, co-written by Yang Liu, Marjan Radi and Dejan Vucinic of Western Digital, and Jaco Hoffman of Technischen Universität Darmstadt in Darmstadt, Germany. This work was performed at Western Digital Research in the Next Generation Platform Technologies department led by Zvonimir Bandić.
Stay tuned for further developments and news from Western Digital research @WesternDigiCTO on Twitter. We will have much more to share in the coming months.