Promoting FPGAs to became 1st class-citizens in datacenters

Javier

Last updated November 14, 2020 9:02am

Back to table of contents

This presentation (click here to open IBM’s live presentation):

Promoting FPGAs to become 1st class-citizens in datacenters 3.05 MB View full-size Download

was shown in the IEEE International Symposium on Field-Programmable Custom Computing Machines. Here there is a summary of it:

—Goal: Deploy FPGAs at large scale in hyperscale DC, i.e. 1'000-10'000 thousands per DC

—cloudFPGA is developed to full-fill the next requirements:

Server commodity and homogeneity
Decrease in cost and power
Easy to manage and to deploy
On-demand acceleration
High utilisation + workload migration
Security, virtualisation, orchestration
Hybrid = public and private
Flexible = IaaS, PaaS, FaaS
Clusters = # accelerators per server
Community = open for support apps and developers

But in the end everything could be summaryzed (everything is fully driven) by the performance-per-cost metric = $$ (Scale game).

—XILINX/INTEL solutions will never end in datacenters... Why is that? 4U units high includes only 8 FPGAs. 1'000 FPGAs costs you 12 racks. The offers for high-end FPGAs from XILINX/INTEL will translate to 5-10 million $

4U Rack with high-end FPGAs 916 KB View full-size Download

—cloudFPGA summary:

FPGAs are the compute node in the DC ==> we need to make FPGAs stand-alone by disaggregating them from the CPU: power-on when we need it, power-off when we don’t need it, etc.
To get there we did a network attachment providing each FPGA a hardware Ethernet and TCP/IP stack
We made a lot of effort in the hyperscale concept:
- As we said this is what matters "Dollar and Performance per dollar"
- So what makes this project different from others is that we focus on hyperscaling mid-range FPGAs instead of high-end FPGAs
- High-end FPGAs are overprized. Mid-range FPGAs are only 30% less powerfull than high-end FPGAs, but they cost 10 times less.
- Summary: Hyperscale infrastructure in the sense of cost, energy, density, and scalability using mid-range FPGAs

—cloudFPGA is a unique solution: 99% of the FPGA based architectures are CPU-Centric using FPGAs as dumb co-processors; but our solution is FPGA-Centric using FPGAs as peer-processor:

FPGA-Centric Deployment with cloudFPGA 1.58 MB View full-size Download

—FPGAs can talk to the CPU or directly between them in clusters of FPGAs where they talk together

—We had this vision: try to come with the densest platform we can think of:

2U modules with 64 FPGAs
16 2U in a rack means 16 * 64 = 1024 FPGAs

—This hardware leads us to this cloud vision (FPGA computing cloud vision):

FPGA-Centric Cloud Vision 2.04 MB View full-size Download

(which is divided in four columns)

IaaS (Infrastructure as a Service): story where you come with your code, you have to do everything by yourself; we just provide you with the FPGA, and eventually our cFDK that you use to develop... either you download it on your desktop or it is given in a VM. At the end you deploy your FPGA. "This is the case very similar to AWS but your FPGA is now network attached." This was NOT THE MAIN GOAL of cloudFPGA (i.e. FPGAs for renting), but we can also do that. This is our hello world
PaaS-1: you have a company doing genomic stuff "genome.com", and you have developed your kernel. For this kernel you need to use for example 100 FPGAs. Since they are network attached, it is very easy and very cheap to do as you don’t need any VM. You only need one or two containers to distribute the load and that’s it. You pay for this cheap FPGAs and not for hundred VMs.
Paas-2: you want to offer pre-trained models. The main customer wants to use them, but does not want to train them, they are only interested in the inference. You can do: Smart cities, Video, General processing, etc. This was the use-case that motivated the project. What happens if you have such a big/huge model that does not fit on a single FPGA? We should be able to map it on multiple FPGAs. If it does not fit into a single chip, since our FPGAs can talk to each other, traffic just goes from one FPGA to the other.
FaaS: with so many FPGAs in your cloud you can dream and do a lot of things. What is very hype at the moment is everything related with cloud functions and server computing. You use FPGA as powerfull processors, but you don’t even know you are talking to an FPGA.

—What was also important is to offer FPGAs exactly in the same way Bare Metal/VM/containters traditional servers are offered ("with a GUI", i.e. Dashboard!).

—Not different to deploying a server!

—The key enabler of everything is the network attachment: 1. replaces PCIe I/F with the iNIC to 2. make the FPGA a stand-alone resource. For that also a tiny ARM is required (always on to boot FPGA, on/off, program IP, monitoring etc. 3. Also replaced the very expensive transceivers and cables connecting the FPGA to a custom backend.

—First node is a KINTEX. Attention: HBM-prototype shown!

FPGA Nodes 2.53 MB View full-size Download

—Also needed to design a switch which is the one exposed for the datacenter. They made a design and squeezed a huge top of rack INTEL switch into a small one (size of an iPhone, 20x reduction).

—All the concept also relies on a custom water cooling system. The original switch would need air disipation and fans, which could not be integrated in such dense chassis. Because of the water cooling we can "that ASIC" which is around 200 W (I guess it refers only to the switch)

—cloudFPGA is a proprietary platform; Facebook or AWS have also their own platfrom, but it is not network attached!! ==> $$$$$$ and not power-efficient!

—Shell and Role architecture is not very different of what everybody is doing.
—Shell is there to abstract the hardware and make it easy to the user to deploy their applications.
—Maybe one differences is that the

Shell is static and it comes out of the FLASH when the FPGA is started (power on)
Role is programmed with Partial reconfiguration and the bitfile is received through the 10 GB Ethernet network

—A partial bitstream is produced out of the process, and then "it is istored in the DC and we are gonna deploy it for you, very simple".

—That is very similar to Vitis platforms:

cloudFPGA vs. Vitis Platforms 1.3 MB View full-size Download

[...] but Vitis relies with PCIe bus (which has all the inconvenients mentioned in the beginning). We came up with this:

FPGA Management Core 1.87 MB View full-size Download

Important: POST/routing Sends the routing information of a cluster to the FPGA.

—As stated before, fully equivalence between CPU and FPGA deployment:

CPU Deployment 1.17 MB View full-size Download

[...] where VM images means ("Fedora", "Ubuntu"), etc. "If you want a cluster of them, you have to do it 10 times."

There is "no difference" in our case for FPGAs:

FPGA Deployment 2.32 MB View full-size Download

A Resource manager who knows which FPGAs are available, i.e. equivalent of a pool of compute resources is a pool of FPGA resources
Database of FPGA bitstreams ("for us the bitstream is the image")
Management system: A DCRM talks to a SM to power on the FPGA/s (if they are powered off) and deploy the image.

—Everything is based on RESTful API

—Example 1: if you have a cluster of to FPGA, you have to use POST/images 10 times to deliver 10 images if the kernels for each FPGA are different (or just one time if you cluster uses the same kernel), then POST/clusters to request a cluster and then we will return you 10 IP addresses of all the FPGAs, and then if you do POST/instance we are going to immediately deploy the bitfiles that "you gave us" on the 10 FPGAs for you "very simple"

How-To cloudFPGA @ ZYC2 (on-premise cloud in Zurich)

—Example 2: Hello World!

Hello World! with a single FPGA 1.8 MB View full-size Download

—Example 3: "Stencil comp" with 1 HOST + 8 FPGAs

Stencil comp. with 1 HOST and 8 FPGAs 2.26 MB View full-size Download

—"You don’t need to go to this GUI, you can script everything" (I guess it refers to the following line code and that you have to option to automate things):

"Image Id to be deployed" (I guess in this case it is the same for all the FPGAs).

Future work

Future Work (I) 1.68 MB View full-size Download

Open-source the cFDK
Walking up the application stack Offer 1 to walking up the application stack (2). For example they believe it will be easy to support Vitis on cloudFPGA
Walking up the systems stack (what François likes the most)
1. Focus on Function as a Service (FaaS, i.e. Serverless and microservice Computing, which is the most right presented use-case on the picture above).
2. Composable and dissagregated storage (NVMe-oF). "We are almost there to do NVMe-oF. People can experiment with that"

Future Work (II) 1.17 MB View full-size Download

4. The "pre-studied" HBM module... maybe also something with Versal! François says that there is a lot of pre- and post calculations before and after the PL logic for CNN/DNN that you can do on a MPSoC hard processor (like in ZYNQ) and that is not possible on the KINTEX node ==> means that this is not an option to use a VM to do those kind of operations and then talk with the FPGA ==> HBM Node with an ARM board made by us?

6. We are not married with XILINX; using INTEL is also possible

7. We could even consider to share the platform in as a Open Compute Platform.