System architecture for network-attached FPGAs in the Cloud using partial reconfiguration

Javier

Last updated November 14, 2020 9:05am

Back to table of contents

The System architecture for network-attached FPGAs in the cloud using partial reconfiguration is presented in form of a paper, slides, and a poster.

Slides and poster have the same abstract:

Emerging applications such as deep neural networks, bioinformatics, or video encoding impose a high computing pressure on the Cloud. Reconfigurable technologies like Field-Programmable Gate Arrays (FPGAs) can handle such compute-intensive workloads in an efficient and performant way.

To seamlessly incorporate FPGAs into existing Cloud environments and leverage their full power efficiency, FPGAs should be directly attached to the data center network and operate independently of power-hungry CPUs. This raises new questions about resource management, application deployment, and network integrity.

We present a system architecture for managing a large number of network-attached FPGAs in an efficient, flexible, and scalable way. To ensure the integrity of the infrastructure, we use partial reconfiguration to separate the non-privileged user logic from the privileged system logic. To create a really scalable and agile cloud service, the management of all resources builds on the Representational State Transfer (REST) concept.

This seems to be consistent with the section "Acceleration Service: Management of Cloud FPGAs at scale" at IBM’s cloudFPGA webpage:

Today, the prevailing way to incorporate an FPGA into a server is to connect it to the CPU over a high-speed, point-to-point interconnect such as the PCIe bus, and to treat that FPGA resource as a co-processor worker under the control of the server CPU.

However, because of this master–slave programming paradigm, such an FPGA is typically integrated in the cloud only as an option of the primary host compute resource to which it belongs. As a result, bus-attached FPGAs are usually made available in the cloud indirectly via Virtual Machines (VMs) or Containers.

In our deployment, in contrast, a stand-alone, network-attached FPGA can be requested independently of a host via the cloudFPGA Resource Manager (cFRM, see figure). The cFRM provides a RESTful (Representational State Transfer) API (Application Program Interface) for integration in the Data Center (DC) management stack (e.g. OpenStack).

There is one resource manager per DC to control many Sleds. The cFRM handles the user images and maintains a database of FPGA resources.
There is one sled manager for every 32 FPGAs. The cFSM runs on a service processor that is part of the Sled. It powers the FPGAs on and off, monitors the physical parameters of the FPGAs, and runs the SW management stack of the Ethernet switch.
There is one cFMC per FPGA. The cFMC contains a simplified HTTP server that provides support for the REST API calls issued by the cFRM.

System architecture for the cloudFPGA platform 540 KB View full-size Download

In the figure above we can see that 32 FPGAs, one switch and a service processor are combined on one carrier board and called Sled. The management tasks are split into three levels — cloudFPGA Resource Manager (cFRM), cloudFPGA Sled Manager (cFSM), and cloudFPGA Manager Core (cFMC). A Sled is half of a 2U chassis. The OpenStack compute resources (Nova) CPU nodes are also available for creating heterogeneous clusters.

In the end, the components of all levels work together to provide the requested FPGA resources in a fast and secure way.