Power AI: Embed an inference engine in the accelerator

Micron Technology | October 2019

Power AI by embedding an inference engine in your accelerator

There’s no question about it. Artificial IntelIigence (AI) and Machine Learning (ML) are driving significant changes in how the world consumes and uses data. For example, ML is accelerating scientific discoveries in the areas of particle physics, medical research, robotics etc... CERN openlab, for example, is at the cutting edge of applying new ML techniques to high-energy physics to help us understand our universe. Fully autonomous vehicles are in the not-too-distant future, and AI/ML is being currently deployed ranging from voice-activated assistants to smart manufacturing.

But ML also presents major challenges to conventional compute architectures. To truly harness the power of AI/ML, new compute architectures that are tightly coupled with high-performance dense memory are required. In this new world, near-real-time processing of large amounts of complex data through sophisticated ML algorithms must deliver high accuracy and speed, which requires fast memory bandwidth.

Researchers in science, medicine, and industry need a new approach if they want to harness the power of AI/ML. Memory bandwidth has not scaled up with microprocessor core growth. Server and processor elements have hit their clock speed limits. While at the same time, today’s data-intensive science applications have become memory bound.

The Advent of Deep Learning Accelerators

Innovations are coming to address these issues. New and intriguing microprocessors designed for hardware acceleration for AI applications are being deployed. Micron has developed its own line of Deep Learning Accelerators (DLA) series. The Micron DLA is a combination of hardware and software designed to offer the acceleration and power savings of field programmable gate arrays (FPGAs), tightly coupled with dense high-bandwidth memory, and an ML software development kit (SDK) that abstracts away the underlying hardware so that no FPGA programming is required (traditionally performed in hardware definition language, or HDL).

Micron is working with researchers at CERN openlab to test our DLA, the Micron-852, in two projects at the Compact Muon Solenoid (CMS), one of the four main Large Hadron Collider experiments. Micron’s neural-network-based memory solutions will be tested in the data-acquisition systems of the experiment


The Micron SB-852 deep learning accelerator, PCIe x16 Gen3

High Performance Accelerator with High Performance Memory

The acceleration of FPGAs can be indispensable when processing vast amounts of data fast. The Micron SB-852 accelerator is enabled by a Xilinx® Virtex Ultrascale+ FPGA. This provides the bit-crunching muscle to consume the massive amounts of scientific, healthcare, or other data. The SB-852 also has up to 512GB of DDR4 memory, which allows researchers to run inference on large data sets locally, thereby negating the need to partition the data. The four-channel configuration provides up to 68GB/s of memory bandwidth allowing researchers to analyze data quickly and return insights that can launch discoveries.


The FWDNXT inference engine works with major deep learning platforms

Pre-loaded Inference Engine for Flexible ML

You may ask: is an inference engine really built in to Micron’s DLA? Yes, the FPGA has already been programmed with an innovative ML inference engine from FWDNXT, which supports multiple types of neural networks (CNN, RNN, LSTM). With the FWDNXT ML SDK, programming the FPGA is as simple as programming in Python and C++. The SDK takes care of all the rest, making easy work of accelerating any neural network. Among the many benefits: Low power and high performance accrue not only from the FPGA, but from FWDNXT’s innovative inference engine, which achieves computational efficiencies nearing 100% on neural network models.

More specifically, the ML SDK supports all ML frameworks allowing data scientists to train their neural network in the framework of their choice (Tensor Flow, Pytorch, Caffe2, etc.), then output that network to ONNX, an open standard neural network exchange format. Then, with the SDK, they compile that output down to machine code to run on the pre-loaded inference engine. By just changing a couple lines of code, researchers target the Micron accelerator as they would target a GPU.

The Future is Accelerated

Micron offers a DLA family of various accelerator boards and modules, as well as PCIe carrier boards for modules that can accommodate up to six modules. Also available are boards that support both PCIe and QSFP interfaces. The low-power and small form factor features in many of the products enable efficient and fast machine learning from the data center to smart devices at the edge of the network.

Learn more at and follow us for updates on @MicronTech.