BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Building a Rack-Scale Computer with P4 at the Core: Challenges, Solutions, and Practices in Engineering Systems on Programmable Network Processors

Building a Rack-Scale Computer with P4 at the Core: Challenges, Solutions, and Practices in Engineering Systems on Programmable Network Processors

Bookmarks
48:45

Summary

Ryan Goodfellow discusses lessons learned and open source tooling developed while delivering a product on top of the Tofino 2 switch processor.

Bio

Ryan Goodfellow is an engineer at the Oxide Computer Company, where he works on many aspects of networking, including routing protocol implementations, network programming language compilers, drivers. Prior to joining Oxide, Ryan worked as a computer scientist at the USC Information Sciences Institute, developing network testbeds for networking research and education.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Goodfellow: First off, I want to pose a few questions. What are programmable network processors? Are they potentially useful to you? If so, what's it like to actually program them and engineer systems around them? These are the questions that I'd like you to be able to have at least partial answers to, after watching my talk. I work at a company called Oxide Computer Company. The picture that you're looking at on my title slide here is a picture of our computer, it's a rack-scale computer that we've designed from the ground up, holistically thinking about not building a bunch of computers and putting them into a rack, but designing an entire rack together. At the core of that the networks that bind everything together inside of that rack, are built on programmable processors. I'm going to be talking about some of the challenges in working with programmable processors, some of the solutions that we've come up with, and just trying to give you a sense for, what's the level of effort involved? Is this a path that's going to be interesting for me? Can I leverage this in my own infrastructure or my own code?

Fixed Function Network Hardware

First, I'm going to start out with what programmable processors are not. I'm going to use the word network processing unit or the acronym NPU quite a bit. That's what I'm going to use to describe these, both fixed function processors and programmable processors. A fixed function NPU comes with a whole lot of functionality. These come from vendors like Broadcom, from Mellanox, from Cisco, and they're really great. They come with tons of features. You can just buy one, build your network on top of it. They come with a very much a batteries-included approach. When you buy a switch from Cisco, or Arista, or Mellanox, it's going to come with an operating system that has drivers for all the hardware that you need. It's going to have implementations of all the network protocols that need to exist on top of that piece of networking hardware for you to be able to build around it. It's a pretty wonderful thing. If you look at the programming manuals for these NPUs, they're massive. They're up to 1000 pages long, some of them from Broadcom, I believe, these days. The reason for that is the level of investment that it requires to actually make one of these networking chips we're talking about, in the neighborhood of $100 million to go from no networking chip to a full-blown networking chip that can write terabit speeds with all the protocols that we expect to be running today. When they design these chips, they try to pack as much functionality as they possibly can so they can get the largest cross section of audience for people that need to actually have this level of performance in their network with all the protocols that they need to have. That clearly comes with a tradeoff. None of these features come for free. You have to spin silicon on absolutely everything that you're going to implement on the chip. If tunneling or routing is really important to you, and something like access control lists, or NAT is not super important to you, you might not be able to use that full chip, because some of that silicon is allocated to the other features that are on that chip. It's really a tradeoff here.

Programmable Network Hardware

What does this look like for programmable NPUs? When we look at the programmable NPU, it's a blank slate. That's in the name, it's programmable. We write the code that's actually going to run on this network processing unit. That has a lot of up stack ramifications, because we're defining the foundation of the network, how every single packet that is going through this network processing unit is actually going to be treated. That means everything up stack from the drivers that are in the operating system to all of the protocol daemons and the management daemons that are up in user space in the operating system, all of that we have to participate in. Some of it, we have to write the complete thing. We might have to write drivers from scratch. We might get a software development kit from the vendor that has produced this network processing unit where we can take some of that. Maybe they have a Linux reference implementation and we can build on that for the operating system we're working with, or directly use it in Linux with the modifications that we need. There's a ton of work that needs to be done to actually be able to leverage these effectively to integrate them with the operating system's networking stack and make sure all the protocol daemons that we need to function are actually functioning.

Why Take on All the Extra Work?

This begs the question, why take on all the extra work? When we implement whatever network functions are important to us, those functions don't exist in a vacuum. If we want to improve the performance of something, or if we want to implement something new, we still have to have all of those other network functions around the ones that we're concerned with, working well. We're not going to come at this and do better than Cisco or Arista, who have been doing this for 30 years, just because we feel passionate about this and want to do this. The question is, what is the why here? Why would I want to do this? Why would all this extra work actually be worth it?

Flexibility (Line-Rate Control Loops)

The first thing that a lot of people think of when you think of anything that's programmable, whether it's, if we think back to the transition from fixed function GPUs, when it's programmable GPUs, and we have things like CUDA and things like that, and OpenCL. Now with networking, we have a programming language called P4 that we're going to be talking about quite a bit, the first thing that comes to a lot of people's minds is flexibility. I can now do the things that I want to do. If you're really trying to innovate in the networking space, and your innovation requires that you touch every packet going across a terabit network, you're not doing that in the operating system. You're not using something like DPDK, or Linux XDP to schlep all these packets up to the CPU, do whatever processing you're going to do, and send them back down to your network card. If you're running at these speeds, if you want to do innovation at that level, then you need specialized hardware to be able to do that. What you're looking at here is a very brief picture, or a high-level overview of something that we've done in Oxide. One of the things that we really wanted to nail was the routing protocol that interconnects everything in our racks. Something about the physical network inside of our rack, and between racks is it's multi-path. For every single packet that comes through our network processors, we have at least two exit routes that it can take to get to its final destination. That has great properties for things like resiliency and robustness and fault tolerance. It also means that we have to make that choice for every single packet that goes across our network. How do we make that choice? Traditional wisdom here is something called ECMP, or equal-cost multi-path, where you basically have a hashing function, where your hashing function looks at the layer 4 source and the layer 4 destination of the packet. Performs the hash on those, and the output of the hash is an index into the potential ports that you could send this out. There are nice statistical properties of this hashing function that basically say it's going to have a uniform distribution. In the limit, you're going to have a nice load balance network. For those of us who have run large scale data center style networks, we know that this doesn't always work out in practice. The intersection of the traffic that is going through these hashing functions with how they actually behave is not in fact uniform. The fault tolerance properties of them leave quite a bit to be desired, because they're essentially open loop control. What we wanted to do for our rack, is we wanted to have closed loop control, we wanted to take a look at every single packet, put a timestamp on it inside of one of our IPv6 headers, or all IPv6 in our underlay network. IPv6 has extension headers, and so we stuck some telemetry essentially in every single packet going through the network. When that packet would get to its destination, the destination host would look at the telemetry, say, do I need to send an acknowledgment to update basically what the end-to-end delay looks like from the point at which this telemetry was added, to the point at which it was received? It sends that acknowledgment back if necessary. Then our switches are programmed to recognize those packets, and then say, ok, so to this destination going out this particular path that I chose, this time, this is the latency. We're going to probabilistically always choose the path of least latency. We're going to be able to detect faults and react to them inside of the round-trip time of our network, which is actually really cool. It's really advancing the state of the art in multi-path routing. There's no way we could have done that without programmable processors for the network. This is one of the advantages that we have in terms of flexibility. They're completely programmable, so the sky is the limit for what you can do.

Fit for Purpose (Spending Silicon on what Counts)

The topics that I'm going to go over here, I'm going to go over three major ones. I view them in increasing levels of importance. The flexibility is important, but we also have a very rich ecosystem of fixed function processors. If you can't get the flexibility that you need, it's not the end of the world, so I viewed that as like the least important thing on this list. Moving up the important stack, we have fit for purpose resource usage, or what we call spending silicon on what counts. That's what I was talking about at the very beginning of the talk, when we have these very nice processors with all these functions, but are these functions taking away silicon from the functions that I really care about? Earlier in my career, a very interesting example of this, maybe depressing example of this is when we were deploying EVPN networks for VXLAN. The early days of VXLAN were tough all around. We had some switch ASICs. One of the difficult things about VXLAN is you're basically sending a layer 2 network over a layer 3 network, and layer 2 networks have broadcast domains in them. When you're sending out packets that can have broadcast properties in them, you need to replicate those packets as they go across certain points in the network. The switches handled this in the early days. Now they use multicast. In the early days, it was mostly something called head-end-replication. There is silicon in the switch to do head-end-replication, but that silicon is needed for other things in the switch, for things like NAT and things like that. While we had switches that implemented that head-end-replication feature, we couldn't use the entire space that we knew that was there on the chip, so we actually ran into soft limits. When we got to a certain level of load on our networks, our networks just stopped functioning, which was a really bad place to be in. It wasn't clear why they were stopping functioning at the time. Being able to understand what the capacity limits of your silicon is, exactly where your program reaches into those limits is a really powerful thing, and being able to just do what you need to do. The cycle times for these chips is pretty long. If you look at like Mellanox Spectrum, Broadcom Tomahawk, we're on a three-to-five-year cycle for these things, and so they don't come out very quickly.

Comprehensibility (Giving Designers and Operators Executional Semantics)

Then, finally, the most important thing is comprehensibility. If you've operated a large complex network, you've probably been at the place that this diagram is depicting. You have network or you have packets coming into your network, or maybe a particular device on your network. You know where they're supposed to be going. There's a place that they're supposed to be, but they're just not going there. They're getting dropped, and you have no idea why. You look over the configurations of your routers and switches, you look at your routing tables, you look at your forwarding tables, everything lines up, everything looks good. You're running tests. You're calling the vendor. They're asking you to run firmware tests. Everything looks good. Then we get to the solution, that is the most unfortunate solution in the world, we restart the thing. We take an even bigger hit in our network availability by restarting this whole thing, if it's just some specific feature that's not working. Then it starts working again, which is the worst thing in the world, because then we continue to lose sleep for the next three months thinking about, when is this thing going to happen again? There are other variations of this story. There's the one where your network vendor sends you a firmware update that says they fixed it, but you have no insight into why it was fixed. You're just trusting, but not able to verify that this fix is actually in place. It's a very vexing position to be in.

What does this look like with programmable network processors? The bad news is the initial part of this doesn't get any better. When your network goes down, it's almost certainly going to be 2 a.m. on a Saturday. Support tickets are going to be on fire. People are going to be very angry with you. You have 16 chats that are all on fire trying to figure out what's going on. None of that gets better with programmable networking. 2 a.m. on Saturday just somehow happens to line up with a critical timeframe for one of your business divisions that's losing tons of work. None of that gets better. What does get better is what happens next. What happens next is when we have this packet that's coming through our network, we have executional semantics for how our network actually works, because we wrote the code that underpins what's running on the silicon itself. The code that I have on the right-hand side of this slide is just something that I've cherry picked from our actual code where we're doing something called reverse path filter. The basics of reverse path filtering are anytime we get a packet and a router, we're going to do a routing table lookup on the source address of that packet. See if we actually have a route to that packet. If we don't have a route to that packet, we're just going to drop it on the floor, because if we were to get a response downstream from us that we want to send the reply to, that source address, we wouldn't be able to fulfill that request. This is a security feature that is commonly implemented in a lot of networking equipment. Our customers have asked for it, and so we've implemented it. The important thing about looking at this code is that we have several instances of this egress drop. We're setting egress drop equal to true, meaning that this is our intention to drop this packet on the floor at this point in time. This is P4 code. If we look at our overall P4 that is running on the switch and look at all the places we're intentionally dropping packets, then we can actually start to walk backwards from where are all the places that this can happen when I'm looking at a packet going function by function through my P4 code, and I know what the composition of this packet is, like what is actually going to happen. We can start to just look at this from a logical perspective instead of calling and pleading with our vendor perspective. This is an extremely powerful thing.

Comprehensibility (Debuggability, Tracing, and Architectural State)

The comprehensibility stuff doesn't end there. In the Twitter performance talk, the speaker was talking about, it's not really scalable to think about performance in terms of just looking at your code and thinking really hard about what is dogging my performance. The same thing is true for robustness and reliability in networking. While we can go through and read our P4 code and things like that, this is not really scalable. We might be working with other people's code that we're not intimately familiar with, or other people might be working with our code. We need to have tracing and debuggability tools with our networks. With the fixed function networks, we're fairly limited in what we can do. A lot of the NPUs that are coming from Arista, and Cisco, and folks, they allow you to basically take a port on your switch, and say, I want to mirror this port to the CPU and get all the packets from the CPU. When we think about how this system is architected from a hardware perspective, these ports are perhaps 100 gigabits. Maybe I need to look at 5 ports at the same time. Of course, when I need to look at things, they're all under 80%, 90% load, so I have several 100 gigabits of traffic that I need to run through a trace tool like tcpdump. When I near that port to the CPU on that device, I'm coming through maybe like a PCI Express 4-lane link, maybe a 1-lane link, so we're not shoving 100 gigabits through that. There's going to be some buffer on the ASIC side to buffer packets as they come in, but it's not going to keep up with multiple 100 gigabits of data coming through that. What we're going to get is packet loss. One of the really nice things that we have about programmable NPUs is that we can actually put debugging code on these NPUs. We've done this at Oxide on our networking switches. When something is going sideways, we want to load this P4 program that can basically filter only the things that we're interested in from these very high rate data streams coming through the ports, and then send that information up through our PCI Express port up to the host, where we can get a tcpdump like output. This is a big win for observability, when things are really going sideways.

Then, the desire to do tracing doesn't end at tracing packets. This is like the go-to thing that we go to. As network engineers, we want to packet trace from a whole bunch of different places in the network to see what's going on. Because we're actually writing the code on these NPUs now and we have that level of understanding of what's going on, it would be wonderful to have program traces. A technology that we lean on pretty heavily at Oxide for program traces, is something called DTrace, or dynamic tracing, which basically allows you to take a production program that understands how the instructions are laid out inside the binary for that program, and replace some of those instructions with the instruction itself, but also some instrumentation to see how your program is actually running. You can do function boundary tracing. If you've never used DTrace before, I'd highly recommend taking a look at it. It's quite a magical piece of software for observability. What we're looking at on the right-hand side of the screen here is a dynamic trace that I took from one of our P4 programs running that actually goes through and gives me statistics about every single function in my P4 program, every 5 seconds to tell me how often that this particular function was hit in terms of the execution of the program. When you have this, it shines a really bright spotlight on things that you know aren't supposed to be happening. If I know I'm supposed to be doing layer 3 routing, and I know I'm supposed to have MAC address translation all set up and working for the next hop on that layer 3 route, and I see a lot of MAC rewrite misses. I know something's wrong with my MAC tables, and I need to go look at that. This is a really good high-level tool for going in and being able to see what's wrong.

Then, wrapping up the comprehensibility discussion is a continuation of this tracing discussion. The important part is not to be able to read the terminal text, but the text on the right-hand side of the screen. Something that we have with dynamic tracing that's also really powerful is the ability to fire probes based on predicates. This terrible thing has happened in my network based on this predicate, now give me a dump with all of the architectural state of this processor. This is where we're going to start to dig a little bit deeper into the hardware-software interface and the deep integration that hardware can have with software. Architectural state are things that come from the instruction set architecture for the processor that you're executing on. What are all the registers looking like? What are the calling conventions for the stack? What does that stack look like? What does my stack trace look like? What does memory look like? We don't have DRAM on network processors, but we have SRAM and we have more esoteric type of memory called TCAM. What does all that look like at the time at which my program has gone sideways, because I have the intent of my program and I have the execution reality of my program? Architectural state of the processor is what allows me to identify the delta between that intent and what's actually happening. A lot of times in conventional programming, and we'll see this in core dumps, when you have a crash in your operating system, you get a core dump that has all kinds of useful information in it. Architectural state is what allows that to happen. Back to the text in this slide, when this probe triggered, what I got was the packet composition that was running through my network pipeline. When the probe triggered, I got a stack trace of how exactly this packet flowed completely through my program. Then, at the end, the cherry on top is I got basically a heap on a table rewrite miss that told me exactly this is where your packet was dropped. This is the culmination of the where this gets better story for observability, is when things are going sideways, we have the ability to look in and see exactly what is happening with the mismatch between the intent of how our systems are running and how they're actually running, and take real corrective actions that aren't reboot the router, or ask for new firmware and pray that it's going to actually solve the problem. This in my view is the most powerful thing about using programmable network processors is you can understand more about how your system works.

How and Where to Jump into the Deep End?

Some combination of flexibility, fit for purpose resource usage, or comprehensibility seems compelling to you, and you want to take the plunge into the deep end and start to work on programmable network processors. How and where do we start? I'm going to provide a little bit of insight from our experience at Oxide. One of the surprising things is that writing the actual data plane code is a very small piece of the pie. I sat down and tried to consider like, what were the relative levels of effort over the last three years we've been building this system, that we put into all of these things? The data plane code itself, the P4 code is like 5%, maybe. I think that's probably being generous. It's a couple 1000 lines of code, and our network stack is probably over 100,000 lines of code at this point. Around this, we have things like the ASIC driver in the operating system, which takes a ton of time and effort to be able to build. The operating system network stack integration is a huge piece. There are entire companies that are built around just this piece. If you remember, like Cumulus Linux, before they were taken over by NVIDIA. Their entire company was based on this, how do we integrate these ASICs from Broadcom and Mellanox, and places like that, into the Linux networking stack in a way that makes sense if you're used to the Linux networking abstractions? This is a huge effort, and having to decide where you're going to integrate and where you're not going to integrate with the operating system. Do you need to integrate the operating system bridges? Do you need to take the VLAN devices, the VXLAN devices? Tons of decisions. Decision paralysis kicks in pretty quickly here, when you're trying to define scope. Then things like the management daemons, the protocol daemons. Do you want to be able to keep the operating system's protocol daemons that you're using? Do you have to modify them? Do you want to rewrite some of them? Then, finally, integration as a big chunk. Twenty percent, I may be underselling the integration here, but making sure that this code that you write in P4 that is underpinning your network, is actually resulting in a functional network, end-to-end. It's not good enough that it just works on a particular node. You have to build a whole network around this with all the routing protocols, all the forwarding protocols, everything that needs to work. Is it handling ARP correctly? Is it handling LLDP correctly? All these things need to work. You're not in control of the endpoints, oftentimes, that are connected to your network, so you need to make sure that it's robust.

Data Plane Code

With that, let's dive a little bit into the data plane code. Not necessarily because it's the smallest piece, but because it's the piece that drives the rest of the work. It's because this data plane is a blank slate, is why everything else has to be modified and becomes such a large amount of work. Over the next few slides, in this presentation, basically, we're going to be writing a P4 program together. It's a very simple programming language. We're actually going to write a functional program in less than 100 lines of code that is actually maybe useful. I just made up this program for this presentation. It's not a real thing that we actually use. It's cool. Just a quick explanation of what this program is going to do. We're going to make a little border security device. What this border security device is doing is we have four ports. The first port, port 0 is connected to web traffic that's coming into our infrastructure. We all know what a wonderful place the World Wide Web is, and the very pleasant messages that we get from hosts on the internet. Then we have a web server farm that is meant to serve these requests. Because the World Wide Web is not actually a nice place, we have a traffic policing cluster that is looking at suspicious traffic to decide, should this traffic be allowed through to our web servers? Then we have threat modeling for traffic that we've decided is definitely bad, but we want to learn more so we can do defense in depth for when we see this type of threatening traffic, moving forward. We're going to write a P4 program that can do this.

The first thing that we have in P4 programs, much like many other types of programming languages is data structure definition. P4 is centered around headers. It processes packet headers. It doesn't do a whole lot of packet payload processing. We only need two headers for this program, at least for the first version of the program that we're going to write. We need an Ethernet header and an IPv4 header, we're not going to handle IPv6 here. I'm not going to go over networking 101 here, but just trust me that these are correct representations of Ethernet and IPv4 headers. The next thing that we need to do, so we have our header definitions, but now we need to actually parse packets coming in off the wire in binary format, into this header structure. What we have here is a P4 parser function, the function signature here. We have a packet_in, which is like our raw packet data. We have our out specifier. You have these inout specifiers that's reminiscent of Pascal. Then we have a headers object, which is a more structured representation of headers. That's what we're going to be filling in as an out parameter to this function. Then we have a little bit of metadata that we'll talk about later. Parsers are state machines. They always start with a special state called Start. In this start state, we are just extracting the raw data from the packet and placing it into headers.ethernet, which holds this Ethernet structure here. We're taking the first 14 bytes from this raw packet data and just putting it into that more structured header. We don't have to do any of the details of this, this is all taken care of by P4. Then we're just going to have this conditional that says if the EtherType is the IPv4 EtherType, that little 16w on the leading edge of that, if you've ever programmed in Verilog, is just saying that this is 16 bits wide. We're saying if this 16-byte value is equal to 0x800, which is the EtherType for IPv4, then we're going to transition to the next state. If not, then we're just going to reject the packet. There'll be no further processing on this packet, and we drop it on the floor. Next, similarly, we have an IPv4 state. We're extracting the IPv4 headers. Then we transition to a special accept state, which means we've accepted this packet in terms of parsing, and now we want to go process it.

The last stage for this is processing the packet headers. Basically, what we're doing here is we have a new type of function called a control function. We take as input the output of our last function, which is this headers_t that has our Ethernet header and our IPv4 header in it. We have a little bit of metadata that we're going to talk about later. Then we have three actions. Our first action forward, we are setting the egress port, so determining what port this packet is going to go out, to port number 1, which corresponds to our diagram on the left there to our web server traffic. Similarly, for our suspect and our threat action, we're setting the ports to 2 and 3 to go to the traffic policing and threat modeling respectively. These actions are bound to packets through things called match action tables in P4. The match part is based on a key structure that we're defining in our triage table here at line 58. We're saying that the key that we want to match on is the IPv4 source address from the headers that we parsed. The possible actions that we're going to take are suspect and threats. Then the default action, if we have no matches in this table, is just going to be to forward this packet on. Finally, we have an apply block, where we're basically saying, if the packet is coming from port 0, which means it's coming from the internet, then we're going to apply this triage. Otherwise, it's coming from one of our three internal ports so we just trust it, and we're going to send it back out to the internet. That's it. In about 70 lines of code, we've written a P4 program that can actually implement the picture of the diagram on the right.

Code Testing

Easy: define, parse, process, 1, 2 ,3. Not all that bad. How does this actually work? This is just code on some slides. How do we verify that this thing actually works? What even is the testing model for this? How do I sit down and write tests for this? In general-purpose programming languages, we write the test code in the same programming language we write our actual code in. It makes it very convenient to just sit down and iterate and write a whole bunch of tests and verify that the code that we've written is actually doing the thing that we want. We don't have that luxury in P4. P4 is very far from a general-purpose programming language. There's no imperative programming. There's no sitting down. P4 doesn't actively do things, it just reacts to packets. We don't have this luxury on the right here. This is just some Rust code where we are writing some unit tests and then compiling and running our unit tests. We have this nice cycle. This brings me to the first challenge is, how do we build effective edit-compile-test workflows around data plane programming, without having to set up a bunch of switches that each cost like 20 grand, 30 grand a pop? Then every time we want to run a new topology, having to rewire those switches. We don't want to do that, but we want an effective workflow. The solution that we came up for this is we wrote a compiler for P4. We're a Rust shop, we do most things at Oxide in Rust. We decided that we're going to write a compiler in Rust for P4 that compiles P4 to Rust. That way, we can write our test harnesses for our P4 code in Rust and have the nice toolkit that exists in Rust for doing unit tests and integration tests and things like that on our P4 code.

x4c Compiler Demo

I'm going to get a terminal up here. What we're looking at in the top pane here is just the code that we wrote in the slides. Then our compiler comes in two formats. The first format is just a regular command line compiler. Our compiler is called x4c, so if we just run x4c against this border.p4, it doesn't give anything out. The mantra is, no news is good news. This means that we have successfully compiled. If we change something to like IPv7, or something like that. Border.p4, let me go down to the parser. Then we're going to just transition to IPv7, which is not a thing, and run our compiler over it. Then we're going to get this nice little error condition saying, IPv7 is not defined, the basic stuff that we would expect from a command line compiler. This is not actually the most interesting thing about what we're doing here. I said we wanted to do testing, so let's actually do some testing. One of the really cool things about Rust is the macros system that we have in Rust, which allows us to basically call other code at compile time, and that code can influence the code that's actually being compiled. Our P4 compiler doesn't just emit Rust code, it actually emits Rust ASTs that can be consumed by the Rust compiler, so we can actively inject code into the Rust code that we are compiling. Here, I just have a library file where I'm using a macro saying, please ingest this P4 code, use our compilation library to create an AST, and just splat all of that generated P4 code right here in this file and give it a pipeline name of main. Then in my test code, which is in the same library, I can now create an instance of this pipeline main, and then just basically run some test over it. I can start to fill up my tables to find what sources of packet information I view as suspicious, then run some packets through it, and just see if the packets that I get out are the ones that I expect. We have a little piece of infrastructure in here that we call SoftNPU, or the software network processing unit that we use to execute the P4 code that we generated, because the P4 code needs things like ports and all this infrastructure than an NPU actually provides. We have a software version of that, that we are running here. What I'm going to do is I'm just going to run a unit test with my cargo nextest run. This is the test that exists in this program. Here, it's got this little test adornment on the top. Just to show that there's no tricks up my sleeve here, I'm not doing anything nefarious, let's change part of this P4 code, let's make a mistake. Let's say that our suspect traffic is going to go out to port 3. Then I can just do my nextest run here. Now this code is essentially just going to sit here and hang because I have a synchronous wait hanging off of that port expecting the packet to come out. This test is just going to hang here. Obviously, if it was a real test, we would want this to do something besides hang and give us erroneous output after a timeout. We can change this P4 code back, run our test again. Everything should compile and run ok. This gives us a really powerful mechanism for iterating quickly over P4 code without having to set up a whole bunch of hardware and things like that, and just understand that the basic input, output relationships of the code that we've written, which is only 70 lines of code here, so it's nothing big, but when you get to thousands of lines of packet processing, it becomes fairly intense.

Test End-to-End Network Functionality Before Hardware is Available

If I am sitting in the audience of my own talk, I'm looking at this going, so what? You sent some synthetic packets through a unit test generator, this is not a functional network. This is not real. I would agree with myself, this is very much not real. There are a bunch more things that functional networks actually have to have to work than just sending a packet in one port and going out the other. The next challenge is, how do we actually test that this P4 code can go into a real network device and implement a real network and do this with real packets from real machines with real network stacks? The solution that we came up with there is creating virtual hardware inside of a hypervisor. I said we were a Rust shop at Oxide. We also have a Rust based hypervisor VMM, called Propolis, that works on top of the bhyve kernel hypervisor, that we can develop emulated devices in. What we did was we developed an emulated network processing unit that exists inside of this hypervisor. The chart that you're looking at here on the right-hand side of the slide is basically how this comes together. Inside the hypervisor, we have another variant of SoftNPU, that's able to run precompiled P4 code. We take that Rust code that we compiled P4 code into, we create a shared library out of it. Then we're able to load that shared library onto this SoftNPU device that exists inside of a hypervisor by exposing a UART driver in the kernel, that goes down to that hypervisor device. Then we can load the code onto that processor. Then actually run real packets through the ports that are hanging off of that hypervisor that use I/O.

Demo

In the next demo that I'm going to show, is we're going to have four virtual machines that are representing the web traffic, the web servers, the traffic policing, and the threat modeling. We're actually going to send real packets through our P4 code. What we're going to do is, on the left-hand side of the screen, I have my web traffic, where I'm just going to start pinging from a source address of 10.0.0.4, to 10.0.0.1. This ping is not going to work initially, because our switching NPU is empty, there's no program running on it, and so we have to load some P4. In the upper right-hand terminal, I basically have this libsoftborder.so that I have compiled from P4 into a Rust based shared library object. What I'm going to do is I'm just going to not load zeros, but I'm going to load the program, libsoftborder.so onto my ASIC through this UART device through a little program that I wrote, just to communicate through the operating system's UART interface. It's loading this program that we compiled. Now on the left, we see the pings are starting to flow from this packet. Now I'm going to start some tracing sessions on my server. I see on my server that I am getting these ICMP packets, which is what we would expect from ping. Then in my police and my threat modeling, I'm not seeing any packets yet. In this border admin program that I created, I created the ability to start populating this table that we've had in P4 for the match action to start to direct packets in different directions. If I do a borderadmin add-suspect, and we're going to do 10.0.0.4. The source address was .4. We're going to load this into the table. Now all of a sudden, we see traffic stopping from our web servers, we're just seeing ARP traffic there now. We're seeing traffic down at our policing servers start to kick in. If I remove this triage, then I'm going to see now the packets are starting to come back to where they were supposed to go originally. Then at some later time, if we decide that this is actually threatening traffic, then we can go here and add a threat model. Down here, we're starting to see the traffic come in to the threat cluster for further analysis. This shows that this was actually somewhat of a useful program that we wrote and that it can actually translate packets. An interesting thing about this is, this is a little bit of a lie. The extent to which this is a lie is we didn't implement ARP in the initial program. ARP is the Address Resolution Protocol. It's when one computer says I have a packet to send to 10.0.0.1. I need to send an ARP request out to know what the layer 2 address is for that machine. We didn't implement that. Because we didn't implement that, nothing is going to work on this network. The moral of that story is that when you're doing P4 programmable stuff, you're on your own. You have to implement and build everything. You have to realize that in order to build functional networks, there's a lot more work than just doing what you want to do. Our 70-line program essentially turned into a 109-line program, when I added ARP into this program to allow it to be able to work, in our actual real network stack situation.

Closing the Knowledge Gap at Hardware/Software Interface

The final challenge is the fact that all of the instruction set architectures for NPUs today are closed source. We can't do a lot of the things that I was talking about with the SoftNPUs on real hardware based NPUs because we don't have the architectural state that we need. We at Oxide are working very hard to change this. We're starting to get traction with vendors and with the P4 community into making this happen.

Questions and Answers

Participant 1: You're saying that P4 is not general purpose, and you can just execute anything in there. What's the main limitation compared to general-purpose operational logic? The second question is more of the borders between who's writing P4. Is it you? Is that the people you ship racks to? Who's writing the code it ships?

Goodfellow: There are a lot of differences. There's no loops, for example. There's no code that just runs on its own, everything is in response to a packet coming through the system. Programming for P4 is a lot more like programming for an FPGA, than it is for a CPU, within reason, like there are obviously exceptions to this rule. There's no such thing as a slow P4 program. Your program either compiles and it's going to run at the line rate of the device that it's running on. Because of that, it has to be very constrained, or it just doesn't compile at all. The compiler will say, your program does not fit on the chip that you're trying to fit it on. There's no loops. There are constraints on how much code you can actually fit on these devices, because they're extremely heterogeneous. It's not like a pile of cores and a pile of RAM. It's like five different types of ALUs that all serve different functions. The compiler statically schedules the instructions onto those ALUs. It's not like superscalar, where you have a scheduler on the chip itself. It's a much different model. You can't just write code that autonomously executes and things like that.

Today, when we ship products, we are shipping P4 code, and that P4 code is available to our users. There are some difficulties with that, not technically, but in terms of NDAs and things like that. The compilers to go on a chip that we're using are closed source, our customers don't have access to the NDAs that provide them access to those compilers. We are trying to break down those barriers. We want to live in a world where our customers can recompile the code for whatever they want, because a part of our mantra is it's their computer. They bought it. They have the right to do whatever they want. We're trying to realize that world but we're not quite there yet.

Participant 2: How do you get those general-purpose things to interface with P4, so populating that thing with [inaudible 00:45:39], being able to direct the traffic?

Goodfellow: We defined a very simple low-level interface. If we look at like, what a table key and what a table value are, we just basically serialize those in binary format, and then send them across a UART. We have some other development environments that have Unix domain sockets that we send these over. We just basically take that IPv4 or IPv6 address, chunk it up into bits. Then if it has other keys in there, like it has the port, we'll take the big Endian representation of the integer and we'll just basically line all these things up in a binary chunk and send it over to the ASIC, and then the soft ASIC consumes that and populates the tables. They're a much more fancy runtime. If you go on to p4.org, and look at the specifications, there's a well-defined P4 runtime that has gRPC interfaces, and Apache Thrift interfaces and things like that. We do implement some of those things. For this, since it's a development substrate, and we may shift between our high-level runtimes, we wanted to keep things very low level and then map those higher-level runtimes on to just the low level piles of bits interface.

Participant 2: Is the gRPC runtime one of these interfaces?

Goodfellow: We're like a full solutions vendor, if you're selling a switch that's P4 capable, the P4 runtime defines a set of gRPC interfaces that can program that switch. Then high-level control planes have something stable to be able to hit without having to reinvent what the control plane looks like for every single P4 switch.

Participant 2: It's a general purpose processor in the switch [inaudible 00:47:23].

Goodfellow: The gRPC is not running on the NPU. It's running on a mezzanine card that probably has an Intel Atom or a Xeon D or something like that on it.

Participant 3: All the tooling and observability stuff you all built is really cool. If another company was trying to also build P4 things, are there other general toolings they would use or would they have to use the [inaudible 00:47:52], or do you provide that as well.

Goodfellow: To my knowledge, this type of observability tooling is unique in the space. Everything we built is open source and available, so those companies could pick up what we've created. We are somewhat tied in some of the tools we create to the operating system that we use, which is called illumos. We don't actually build our infrastructure on Linux. We're very tied to DTrace for a lot of the observability tooling. If folks wanted to pick that up, it's open source and available and out there, or they could use it as inspiration to build what they want to build.

 

See more presentations with transcripts

 

Recorded at:

May 31, 2024

BT