A RedMonk Conversation: AI and Custom Silicon at Annapurna Labs (with AWS)

Get more video from Redmonk, Subscribe!

As generative AI drives demand for training and inference workloads, the tech industry has turned to innovations in chip, board and server design to provide the infrastructure and compute power to meet these demands. Annapurna Labs, which Amazon acquired in 2015, plays a key role in AWS’s ability to innovate in this area, having developed the Nitro, Graviton, Inferentia, and Trainium families of processors. As a follow-up to RedMonk’s tour of Annapurna Labs in Austin, TX, Senior Analyst Kelly Fitzpatrick sits down with Chetan Kapoor (Director of Product Management for the Amazon EC2 Accelerated Computing Portfolio) to discuss how the custom silicon developed at Annapurna Labs is a differentiator for AWS, some of the differences between the infrastructure needs of training and inference workloads, and what AWS is doing to support developers building genAI applications.

This was a RedMonk video, sponsored by AWS; AWS also sponsored analyst T&E to Austin, TX.

Resources:

More about Annapurna Labs https://www.aboutamazon.com/news/aws/take-a-look-inside-the-lab-where-aws-makes-custom-chips
AWS Machine Learning Infrastructure https://aws.amazon.com/machine-learning/infrastructure/
Dev resources for building GenAI applications on AWS https://aws.amazon.com/developer/generative-ai/

Rather listen to this conversation as a podcast?

Transcript

Kelly Fitzpatrick: Beginning of 2024, everyone’s talking about AI, and a lot of the news out of re:Invent in 2023, only a few months ago, was around AI. And some of that news was around the Custom Silicon that powers some of these workloads, which I thought was very exciting. But in addition to all the news that everybody gets to see, some of my RedMonk colleagues and I got to take a tour of Annapurna Labs in Austin, Texas, recently. Annapurna was acquired by Amazon, I believe, in 2015. We got to see some really cool things and geek out. We now have questions about Custom Silicon that we did not have before. Joining me today is Chetan Kapoor from AWS, who’s going to answer some of our very geeky questions about Custom Silicon. To start us off with Chetan, can you just tell us a little bit about who you are, what you do, and what Custom Silicon has to do with it?

Chetan Kapoor: Yeah, absolutely. Hi, everybody. Chetan Kapoor here. I’m with AWS on the EC2 product management team, and I run our hardware accelerator business. So this basically entails EC2 instances or compute platforms that feature some hardware acceleration. Could be GPUs, could be custom ML chips that you’re building or other forms of accelerators. And I’ve been running this business for like coming up on eight years now. It’s been a while.

Kelly: That runs up to a decade, in case you were wondering.

Chetan: [laughs] It does, yeah, especially in Amazon years. Yeah, things go back really, really fast. And yeah, it’s been a fascinating space. There’s been a lot of coverage around the types of models that companies are building and the type of end user capabilities that we’re experiencing on a daily basis these days. But the space that I operate in is all about providing the required computing infrastructure to support these customers building this type of model. So there’s a lot that goes into providing infrastructure to support somebody building a foundational model. And that’s the space I mostly operate in. There are other businesses that I also support, specifically in the HPC and the graphics and gaming space. As you can imagine, those types of applications also leverage accelerators for a lot of different work.

Kelly: It sounds like you are very busy because you have a lot of things under your supervision or direction. Thank you again for taking the time to chat with me today. I’m going to jump right into questions because I have many of them. In fact, we have to narrow down the list of questions that we would ask you, so this would not take two hours. But to start off with, one of the core themes that came up throughout our tour of Annapurna Labs was spending engineering resources where it matters most, especially around build versus buy decisions. This theme was visible in multiple aspects, including the strategy of your hardware design and components used, and also the lab operations themselves. When it comes to Custom Silicon, can you give us some context around the areas where AWS chooses to build and how they are areas of differentiation for AWS? Or in other words, how does these build versus buy decisions relate to specific outcomes in which AWS is investing?

Chetan: Yeah. So this is a super interesting aspect of talking about how we build and deploy these products. So if you look under the covers and take a look at an accelerator for training a large language model as an example, right? A lot of the core function comes around being able to do large matrix math operations really, really effectively. And that’s where the heart of the chip lies is being able to do lots and lots of matrix multiplications really, really quickly. Because essentially for training or even inferencing of a large language model or deep learning model in general, that’s one of the core compute functions you need. Now, when we start to look at a design, we were like, okay, what parts of the design we think we need to invent IP for and bring a lot of value for? A lot of this is going to be centered around the compute engine. It’s going to be centered around how the compute engine talks to the memory that’s available because it’s not just about computers, it’s about memory also. But there are a whole bunch of other things you need in a chip to actually enable that chip to operate.

For example, power subsystems, for example, high-speed interfaces, out-of-hand communication mechanisms. And those are some of the components that we can actually get some help from third parties and leverage their IP instead of reinventing high-speed interfaces or PCIe controllers and things like that. So when it comes to build versus buy, our core focus is around, okay, what is the key capability that we need to build that is not available in the industry? And then how can we impact our time to market? Because all this craziness in the Gen AI space, everybody wants to get their next application and model up and running quickly. So there’s a lot of pressure from a time to market angle. So if we can leverage an IP from a third party, from a partner, then great. That’s going to enable us to get to market quicker. And obviously, the third aspect is cost. So we have to be mindful of how much it’s going to cost us, or what does that translate to the economics and benefit of the customer? And that’s where all these three things have to balance out. It needs to be IP that we can uniquely build and deliver value.

And if there’s ability for us to leverage third parties, great. But at the same time, we need to be mindful of what it costs us to build in terms of engineering time and effort, and obviously the cost of Silicon itself. Those are the vectors and areas we need to think through when it comes to the buy versus build decision for elements of a particular Silicon design.

Kelly: Yeah, and that makes sense. I think the balance has to be tricky at times, especially when you’re getting into the desire to make everything as perfect as possible versus the time and the cost as well. I think, so saying that from the position of, say, where you are and then being one of the people in the lab who’s trying to navigate these decisions as well, I think has to be a little bit different. I want to talk about those build versus buy decisions in the context of the lab. Before we did our tour of Annapurna Labs, it had been described to us as a startup-like environment, and it also had been described as scrappy a number of times. When you walk in, it does feel that way. It’s delightful. Also, it was evident in things like machine that tests cards running in part with raspberry pis. But also in team members having to pick up all these different things like, Oh, now you have to go learn how to use a soldering iron, if you don’t before, and a microscope and cabling, or understanding the latest wave of compute demands around AI and ML.

How do you build a team that has this scrappy ethos that fosters this building mentality and creativity while also ensuring that it’s latest focus and selective and not unnecessarily reinventing wheels.

Chetan: Yeah. Internally, we have a saying that Amazon is the world’s largest startup. Many parts of our business is true. In some cases, we’re not. But the overall mental model just comes down from how we think about teams and ability for teams to control their own destiny. You might have heard us use the reference of two pizza teams. For those who are not familiar, it essentially represents a team that is small enough that if you had a happy hour, you can get like two XL pizzas, and you’ll be able to have that happy hour with just two large pizzas. So it essentially means that the team is about 10 to maybe 15 people, and they’re super nimble. And at the same time, they have the ability to control their own destiny. So from an org structure perspective, when we think about how we set up teams, we want to make sure that the teams have the the ability to work on the things that directly translate to a positive impact in specific influence, a specific area that they’re operating. There are some other cultures out there where teams have a whole bunch of dependencies on other teams.

So if you want to get a product done, you need support from X, Y, and Z teams and things like that. So fundamentally, there’s a pretty different culture at Amazon, and it’s even more pronounced in the Annapurna team. So Annapurna is a startup within a startup. Kind of think about that. So that’s why we have small teams that are focused. And you talk briefly about this. We hire builders that are naturally multifaceted, where, if there’s a person who is designing the hardware, a person is also responsible for bringing it up, getting all the testing around it to make sure that it is running. Because the other perspective we have as Amazon, especially in Annapurna, is that if you’re building it, you should be testing it. It shouldn’t be that there’s another team that are responsible for testing, and then you’re going back and forth across those two teams, and maybe you don’t have a really clear understanding of how the product is supposed to work, and there’s some loss in translation and things like that. So organizationally, there’s a lot of emphasis at Amazon and Annapurna to make sure that we’ve got these super nimble teams.

And you mentioned a little bit about raspberry pi. That’s a really good example where the team needs a way… This came about in COVID, especially. If they need a way to talk to pre-production silicon that is only available in the lab, ideally, you would just come in the lab and actually just work with the silicon in person. But if you don’t have that ability where, again, think about COVID, where a lot of people couldn’t come into the office or didn’t want to come into the office, we empowered them to make decisions to have scrappy environments, but at the same time ensure that it’s secure and reliable in the way it’s set up. So in the case of raspberry pi, we actually use that to power some of our supporting circuitry around the chip and have an interface to talk with that chip. So that particular raspberry pi that you saw was set up in a controlled environment where it wasn’t accessible as a device on the Internet. You needed to have restrictive access to have access to it, but at the same time, it was an off-the-shelf part that the team was able to leverage and set up the test bed in order to make progress on the validation.

Kelly: Yeah, and it’s very cool. The lab, it does have the feel of a startup, the two pizza team, but with the backing and the… Almost like security resources available to…

Chetan: A large enterprise.

Kelly: Which is, I think, exactly what you want. So shifting gears a little bit to talk about training and inference workloads and the chips that power them and maybe the different needs that these type of workloads have. During our tour, we had the chance to see Trainium and Inferentia chips. Can you tell us a little bit about the difference between the workloads for which these chips are designed and how and why are the infrastructure requirements for training workloads different from inference workloads? Because one thing I learned is that they are very different.

Chetan: They are. And yeah, they’re very different. And a lot of the engineers don’t have a true appreciation of what the differences are. And generally speaking, if you look at training versus inference, training is a scale-out workflow. What I mean by that is for a deep learning model, and especially all these large language models, you’re not training a model on a single chip. You’re usually training it across hundreds, if not thousands of these chips because you want to get the training job done as quickly as possible. And on inference, on the other hand, you want to make it super cost-effective because if you’re going to be supporting millions of inferences on a per day or per minute or per second basis, you need to make sure it’s super optimal. So inference for the most part, and it’s changing actually in the market right now, is a single chip or a single server type of workflow. You don’t scale out a single inference job across hundreds of chips or thousands of chips. So because of that, you have uni requirements at the chip level, at the server level, and also at the data center level across training and inference.

So when it comes to training, to look at specific details, you need a lot of high bandwidth memory. You need a lot of chips that are interconnected using high performance networking. And you need to have really, really fast storage because you need to feed data into this training compute cluster. Otherwise, your expensive chip resources are going to be starved. So there are differences at the chip level where you need tons of compute, tons of memory packed into a single chip, and then you need a lot of networking and storage to combine it all together to form a large, large clusters, very large clusters. And on the inferencing side, for the most part, you are going to be sensitive to performance because you want your interactive experience to be super interactive and snappy. If you’re talking to a chatbot, you want your responses right away. You don’t want to say something into a chatbot and wait 10, 15 seconds for it to respond. Customers end up optimizing for latency and throughput and also cost. So inference used to be a single chip workload. So if one customer is interacting with a chatbot that is typically used to run on a single chip and it was good enough for the experience.

But now with the LLMs and especially the GPT 3 and 4 type models where it is scaling out, where you actually need multiple chips to host that model to provide a compelling experience for the customers. So to summarize it, the big difference is scale out versus scale up, right? Where, again, for training jobs, you need hundreds and thousands of accelerators that are tied together. And for inferencing, it’s typically limited to a few chips, but at the same time, it needs to be effective, both from a performance and a cost standpoint.

Kelly: Yeah, that makes sense. And for me and my colleagues, I think the visuals of seeing the different board setups for the trainium and the inferentia chips was really impressive. I think what really struck us, too, is we got to see, I think, a trainium one and an inferentia two chip, and they look very similar. But the boards they’re on and the servers they’re set up with vary so greatly. Can you speak a little bit about that? I know you touched on the different requirements for my previous question, but can you tell me more about why these things look the same, but they’re situated so differently?

Chetan: Yeah. So that’s a really good example, by the way. So on Trainium One, the part is the chip is designed for, as you can imagine, for training. So we have to max out the compute performance that is available on the chip. And we have to make sure that within the server, we have as many chips as we can pack in and interconnect those chips really, really tightly. So a Trainium one server has 16 chips running at full power that are all interconnected with each other. So again, it’s optimized for training where each chip needs to be able to talk to the other chip, and we need to max out the performance. Now for inference, as I mentioned earlier, inference is a super cost-sensitive workload, and usually it’s bound to a single server. So when we were designing the brand new architecture, our second generation architecture, we were like, Okay, we’ve got a single architecture. We got to support training, but we also at the same time, we want to look at supporting inference on leveraging the same architecture. So on the inference side, we took the core silicon and we tweaked it where we get lower cost structure, dial down the compute performance a bit because that was really not required for us to max out for Inferentia 2, and also change the interconnects for chips to chips.

Instead of having an all to all connection, Inferentia 2 as a server has a ring that connects all the 12 accelerators that we have. So there were pretty fundamental differences at the board level that allowed us to have multiple chips come out from the same architecture to support training and also inference. So your observation was correct. The fundamental silicon is very similar between the two. They leverage the same architecture, but they are packaged in a very different format based on the requirements of training versus inference.

Kelly: And then, if you don’t mind speaking about this, so Trainium 2 announced at Re:invent. Anything about… And that chip itself, I think, the benefits were touted and shouted from the rooftops. But going beyond that in terms of the setup around it, anything about that we should know about?

Chetan: Yeah, I think it’s going to be super exciting for our customers, and we’re really pumped about what that chip is going to bring to the market. Some of the high-level things we mentioned when we announced that part was we expect for it to be four times as performant as Trainium One. So that’s pretty sizable. We’re also going to set up really, really massive clusters of Trainium Two chips. We talked about workloads needing hundreds to thousands. We started to see customers today that are running tens of thousands of these chips in a single cluster, and we expect that trend is going to continue. So we’re planning on deploying Trainium Two as part of massive clusters. And that’s going to be really exciting for some of the leading builders of foundational models. And we’re going to disclose more about Trainium Two in the coming months as we get closer to preview and eventual GA. We’ll share more details around how we expect to package it as an EC2 instance and the kind of clusters we expect to be able to provide our customers to enable to train these next generation foundation models.

Kelly: Well, I, for one, will be on the look out for that news as it rolls out, because I do. I think we’re at a point where the stuff that is running the workloads that we need to do the things we want to do is becoming more and more important.

Chetan: Yeah.

Kelly: And speaking on that. So RedMonk, we care about developers very much. Many of the software developers we speak to are not overly conscious about hardware, and that may to do more with the nature of their training, what they’re taught to do and what they actually need to do their jobs than anything else. But from your perspective, why should developers care about Silicon? Because clearly, I do, we do. But why should developers care about it?

Chetan: Yeah. So there are multiple reasons. So if you’re in the AI/ML space right now, there’s a lot of focus on time to market, which is understandable. People want to tap into this new innovation in the market and try and deliver meaningful results and improvements to the businesses. We totally get that. The reason why paying attention to Silicon and the underlying infrastructure is going to come down to performance, cost, and ability for customers to control their own destiny. What do I mean by that? If you’re building on GPUs today as an example. That’s going to be great. That’s going to allow you to move quickly. But at some point, the cost of training using the existing methodologies is going to just skyrocket. It’s going to be at a point where the business leaders in your customers’ companies are going to start like, Okay, well, you’re consuming tens of thousands, hundreds thousands of millions of dollars to train these particular models. Are we actually seeing the ROI on this investment or not? If your developers are conscious about how they are training their models, what softwares and frameworks they’re using, it will give them the ability to try out alternative architectures such as Trainium, maybe, or maybe something else in the future, and enable them to either get to market quickly or potentially save 40, 50 % of the operational costs, which can be substantial when it comes to training these LLMs.

So it is super, super important for customers to be cognizant about how they’re building their applications, because what we’re seeing in the market right now is that there are certain set of customers that are writing code directly to a vendor’s API. So they’re like, okay, I’m going to pick a vendor A, and I’m going to use their software libraries, and my code is going to have a dependency on that particular software library. So for the most part, they have a direct coupling to vendor A in this example. The alternative that you’re seeing also is that where customers are using levels of abstraction, but they’re like, oh, okay, you know what? I don’t want to have a direct a tie in with the vendor’s library. So I’m going to use an open source platform that’s going to give me a level of abstraction. So what I’m referring to is common machine learning frameworks like PyTorch and JAX, where if you’re leveraging these frameworks to actually build and train and eventually deploy your models, you’re going to have that flexibility, where down the road, if there’s a vendor B that is providing a different hardware platform that is more more aligned with what you’re looking for from an economics or performance standpoint, you will have the ability to try the vendor B and see how well it works for your business.

And alternatively, you’ll have the ability to move back also. So developers need to be super conscious about how they’re building their applications, what platforms they’re leveraging, and what ability will they have to actually try out different Silicon, from different providers in order to either optimize their time to market or optimize their costs. So that’s the most fundamental thing. There’s a lot of detail also around the specific feature and capability that a particular part might provide. If you’re optimizing inference performance as an example, there are a whole bunch of techniques that are available at the hardware level to optimize the influence on the same thing on the trading side. So being aware of what hardware platform you’re leveraging will also enable you to micro optimize the performance that eke out the most performance out of that platform, right?

Kelly: Yeah, you’re definitely speaking my language, especially talking about the time to market and velocity and abstraction, which more and more we’re seeing the developer tools and the different ways they interact with the world being mediated by abstractions more and more. So we spoke a bit about why software developers should care about hardware. How is AWS making it easier for software developers to think about hardware, and where and how does abstraction come into play?

Chetan: Yeah, so there are a couple of elements here. So one is around the AI/ML space, so when it comes to training and inferentia, specifically, we have put in a lot of effort in taking our compiler which is called Neuron, and plugging that under common machine learning frameworks. So again, going back to our earlier conversation, where we want to make it easy for developers to try out, train them in inferentia, but at the same time, have the ability to move to a different architectural platform if it suits their needs better. So we had a very strong focus on taking our IP and plugging it under open source frameworks. And helping customers leverage those open source frameworks enables them to actually have that level of abstraction. So that’s one key thing. There’s a similar concept on the nitro side. And I know you guys got to see some of our nitro cards also. So nitro, for the most part, is actually fully transparent to our customers. Our customers, in many cases, don’t directly leverage nitro. It’s a behind the scenes technology that provides a lot of value. And that enables our customers to maximize how much hardware they’re able to use and essentially the value they’re able to get out of AWS.

So there are different ways for us to enable our customers to leverage the capabilities we have on the hardware side. On the nitro side it’s transparent for the most part. And then specifically on the training and inferentia side, it’s by enabling our customers to use open source frameworks instead of directly building on top of our component technology.

Kelly: I really like that introduction of nitro into the story because, again, everyone’s like, AI, ML, everything. But Annapurna Labs, the first chip that came out of there was nitro, correct?

Chetan: Yeah, you’re absolutely right. Yeah. So our journey with building Silicon for AWS started with Nitro, right? And it goes back to 2015, 2016, as you mentioned earlier. And the way we virtualize our hardware is, at this current point in time, 100 % managed on nitro. So let me rewind a bit. So prior to Nitro and prior to us building our own Silicon, on the Hypervisor side and how we managed to enable customers to run on slices of servers, a lot of that management was actually running on the same server itself. So that’s typically how Hypervisors work, where you would take out 15, 20 % of the compute resources on the server itself and allocate it for management purposes. And then you have the remaining part of the compute available for you to run customer workloads. So nitro as a core capability enabled us to offload a lot of the management capabilities and all of it and run it on a dedicated card. And that’s where we started building our own chips. There wasn’t off the shelf accelerator that was available for us to run our own management stack. And that’s where we started building our own nitro chips, our own offload cards, and plugging them into our servers.

Chetan: So nitro has actually laid the foundation when it comes to building expertise around Silicon. We built on it to launch our Gabatone CPUs about four years back. And that was super exciting because, again, that was the first time in the industry a cloud provider had built a CPU from the ground up and brought it to market. And then we built on that capability of being able to build high performance CPUs, start building accelerators for AI/ML, hence Trainium and Inferentia were born. And now we are on our third architecture for Trainium and Inferentia. So it’s been a fascinating ride.

Kelly: Yeah, and fascinating for me as well. So thank you so much for taking time to chat with me today. Are there any last thoughts that you want to leave folks with?

Chetan: No, I think it’s super exciting chatting with you. And we spent a bunch of time talking about Silicon, which is a core part of what sets AWS apart. But at the same time, if you look at everything else we have to operate on, a portfolio from storage products, your networking products, and all the high-level managed services, we think we have a really compelling portfolio for developers to build their Gen AI applications. And Silicon and hardware is going to underpower all of this, right? So again, there’s a lot of stuff we’re working on, and we’re looking forward to sharing some of these details in the coming months with you guys, and obviously with your customers and your developers.

Kelly: Well, we’re looking forward to hearing those stories. Thank you again.

Chetan: All right. Thanks for the time.