James Governor sits down with Anush Elangovan, VP of AI at AMD, for a deep dive into the evolving landscape of GPU development—tile-based programming, Pythonic DSLs, Mojo, ThunderKittens, and everything reshaping how developers write high-performance AI code.
They explore:
- ThunderKittens and the shift toward tile-based GPU programming
- How MatMul units and tensor cores are redefining GPU architectures
- The rise of Pythonic DSLs: Mojo, Modular, Triton, Gluon, Helion & more
- Python vs. C++ rewrites and what “speed as the moat” really means
- Where the most surprising GPU/kernel innovation is happening globally
- How AMD thinks about enabling developers—academia, startups, enterprise
- ROCm as an open, inclusive platform for the next generation of GPU programmers
- Why AI today is “the electricity moment”—and we’re only at the frog-leg-twitching stage
If you’re into GPU programming, AI systems, developer tooling, or the future of AI hardware/software co-design, this conversation is packed with insights.
This video was sponsored by AMD.
Transcript
James Governor: Hey, this is James from RedMonk. I’m here with Anush Elangovan, VP of AI for AMD. We’re going to talk about the future of GPU programming in the AI era. So obviously everybody is excited about GPUs. We’ve had a pretty good run with the market dominant and now things seem to be opening up. There are some interesting new opportunities and technologies that are going to probably begin to change the game in terms of what we expect around programming GPUs. So Anush, I think we’re going to start with the favorite name when we did a little bit of prep. And you mentioned Thunder Kittens. So, I mean, if we didn’t dive into Thunder Kittens, what kind of a tech podcast would this be? Tell me a little bit about Thunder Kittens.
Anush Elangovan: So, Thunder Kittens is work that came out of research from Stanford. And we at AMD had actually worked with them early on. We had actually shipped them some MI210s when they were in the early stages of prototyping. And they actually have a port working on MI320s right now. Thunder Kittens is a fundamental shift in how you express programming. programming of the GPU interfaces. So, let’s take a step back. GPUs in general, CPUs versus GPUs, right? Like the general trend was you had SPMD, which is like, you know, process. Each process can do its own thing kind of thing on CPUs. But then you had SIMT, which is, you know, a thread-based architecture for GPUs, right? So, you have a whole slew of threads that work on, you know, certain data leading to massive parallelism, but limited, you know, instruction sequencing and things like that, right? So, that’s the SIMT architecture. Then over time, we kind of discovered that, hey, SIMT is good, but we need these like gems and the MatMul units are like the key part of AI workloads. So, increasingly, we started to see these, quote-unquote, tensor cores or MatMul units that showed up. And the MatMul units are used for matrix multiplies. But for the early stages of it, we were trying to use the SIMT programming model to program the tensor cores, and that didn’t really work too well. And you had to do a little bit of like mental gymnastics to, you know, get the layouts right, etc. But what you get with tile-based programming is a native representation that is good for MatMul units, right? So, you can actually represent them in tiles, and you could do M by N, K, etc., right? So, it’s easy to program in tiles rather than in threads. And so, that’s the fundamental shift that’s happening right now as MatMul starts to take on majority of the AI workloads. If you, you know, like profile a model, 70-80% of it is like between gems and attention, you have all of the big heavy lifters in that part, right? So, you wanted something that could be easy to program in that space. And so, Thunder Kittens was one of those efforts that I had the privilege to chat with Chris Ray from the Stanford AI Lab and Ben Spector, who worked on that, to, you know, to kind of like express a very clean way to program to these tensor cores and matrix cores. But Thunder Kittens is just one attempt at it, which is a platform agnostic attempt. But then you have the similar attempt from QtDSL, which does it on the NVIDIA side. And then there’s also the Gluon, which is like a Triton backend that allows for tile-based exposure, which could, you know, intercept QtDSL. But then we also, at AMD, we have some experimental models that we’re working on, experimental programming languages that we’re playing with called Wave DSL, which provides the same kind of overlap between thread and tile-based programming. And I think there’s a bit of these combinations of programming languages that are evolving. We still don’t have a clear, like, hey, this is how the future is going to be. But in general, the abstraction is getting higher to make it easier to program. It’s getting more Pythonic. And it’s getting tile-based, if you were to summarize the trends of this.
James: Okay. So in terms of this Pythonic question and, you know, developer productivity across platform, do you have any views on, like, and it’s been interesting to see this. There’s a lot of excitement with Modular and Mojo. So, obviously, Chris Latner has a great sort of track record with Swift. And so I think there’s a lot of excitement about what he’s doing, the notion of across platform, Python-like syntax, where that’s going. Is that going to be part? Is that something that you’ve been looking at? Is that something?
Anush: Yeah, very much so. So we work with Chris closely. In fact, I met him last week in Austin when he was there at the ROCm 7 launch event to demonstrate how they had been able to port Modular and Mojo to AMD’s platforms. The way we think of it is we want this innovation to thrive, right? Like, we want Thunder Kittens to be successful. We want Mojo to be successful. We want Triton to be successful. And we want, you know, Gluon, Triton Gluon or Helion that’s also coming from Python’s team. So going back to the original question of, like, what is ROCm and why is it important? It’s because it’s an AI stack for everyone and it’s open source. So all of these components can sit on it and we will, you know, wholeheartedly embrace each one to have the best kind of, like, win. And best win or, you know, whatever gains adoption, we want to be able to enable that if that’s what, you know, customers want to use. So our general philosophy holds enabling everyone, but Mojo in specific, yeah, we’ve had, like, a big hackathon with them a few weeks ago or, like, a couple months ago maybe. And then Chris presented at the ROCm 7 launch. And I think there were some other joint announcements that we were making with them. So we want to be everywhere, you know, that is on the forefront of Pythonic DSL evolution. And it’s a trade-off. Usually it’s a trade-off between functionality, flexibility, and performance. But the closer you can get that and the faster you can get that deployed makes it easier for, like, actual, you know, consumption.
James: Okay. What about other sort of traditional programming languages? You know, pretty clearly Python has been, I guess, the language of AI. But if we think about certainly the, well, the language that people are building applications that touch models, that’s, like, TypeScript everywhere. One assumes that, like, it’s, you know, I remember when Big Data, everyone was like, oh, yeah, Big Data is going to be Java. And then Java comes along. Is it all going to be Python? Or are there some opportunities in terms of, are you seeing any interesting innovation in and around GPUs and the more traditional programming languages, rather than some of the things that’s just been built, as you said, specifically around tiles or, yeah, basically explicitly built and designed purely with AI in mind? I mean, does general purpose begin to play a role here? And does that, where does that touch GPUs?
Anush: Yeah. So it’s a very good question, right? So you can skin the problem two ways. One is you start from what is most deployed and try to make that good, right? That’s incrementalism, which is perfect for, you know, certain classes, right? Then there is the, like, hey, I want to start with a clean slate and do it, like, what is the perfect way to do it, right? So, so Triton and, and the like are from, hey, we got Python. Don’t try to, like, you know, redo all of the constructs of it, but try to get GPU programming. You use some decorators here, there. You know, you still have the GIL you have to work through, then the Python, you know, community tries to work through that issue. So it’s like, yeah, we’ll make good, solid progress. It changes. But then there’s the other end of the spectrum, which is, you know, like what Mojo takes, right? It’s like, okay, here’s a clean slate. If I were to, you know, build this from the grounds up, you know, completely clean and, and without any shackles and build first principles up, how would that look, right? There is, you know, there’s benefits to both, right? Which is, this is incremental. So, you know, the worst case is like, oh, you tried something, didn’t work. You have at least the last, like the fallback. In the clean slate approach, you, you know, you, you aim for something and you start digging a tunnel and you may end up at exactly the point you want to be, but you may also miss it by a mile. And you’ll be like, okay, you know, you got a great programming language and a great paradigm, but no adoption, right? But we, we can’t really foretell what that is going to be, but you need the, you need both in life, right? You need, you need, you know, one is the PhD student trying to explore what’s the art of possible. The other is like, okay, I’m just trying to like move the ball forward. But, you know, they said, the one sets the north stuff of where you can end up. The other is like, okay, I have to be incremental to get to that north stuff. But sometimes when it works out, it, you, you make this giant leap, right? And that’s when, you know, that bet that you put on like a clean slate approach, just completely re-writes everything. And you’re like, okay, great. So now all of that incremental, you know, thing that’d be like as baggage. And then you just say, okay, this is the new future, but getting to that new future. Yeah, it takes a little bit of like making sure it, making sure you’re doing the right thing, but also you want to make sure that you’re lucky, which is, you know, so you can do everything right. And, and, and it may not, it may just not be the, what the market adopts, right?
James: 100%. I mean, establishing the —
Anush — history of computing. Sorry, you can say.
James: Well, all technologies, in fact. I mean, the best technology doesn’t always win. And, and yeah, you have to get, you have to get an awful lot of things right, especially in, in programming languages and programming terms. I mean, you can have something that’s beautifully designed, but without the right frameworks around it, without the right community around it, it’s not going to see adoption. So yeah, no, that’s, you know, that’s exactly right. So, so, I mean, in terms of, there is this interesting pattern at the moment where, so I mentioned Python again, you know, you look at, at, at, I saw a very interesting conversation actually just this morning on, on social and, and it was somebody saying, oh yeah, people say, you know, Python isn’t good for production. Um, well, if it wasn’t good for production, you know, try telling that to open AI, uh, because pretty clearly, uh, it is in production there. And then in the comments, uh, somebody pops up and they’re like, I work at openAI. And let me tell you, running in production is not the high bar. That’s, that’s just the, the, that’s the floor, not the ceiling. And in fact, I’ve just joined the company to come in and rewrite some subsystems in C++. So this pattern of Python rewrites and C++ that we’re seeing in the Valley, I mean, I guess that’s some of the context for, um, some of these, some of these efforts that we’re seeing here going back to modular again.
Anush: Yeah. Yeah. Yeah. I think, um, this Python C++ thing, I’ve, um, I will just tell you, I’ve, I’ve had this conversation, you know, so many times, uh, it’s almost like formulaic right now in terms of, you know, um, but, but my, my general take on it is prototype fast with Python. And if you need the last 2% for your ML per submission, rewrite it in C++, that’s fine. Right. But in general, if you get 98% of the performance in the first two weeks of doing it, that’s okay. Even when you’re at scale, right? Like even when you deploy like 5,000 GPUs or whatever, or 10,000 GPUs, 20,000 GPUs. If you get to the 95th percentile or 97th percentile, you know, obviously open AI and others may have different, um, you know, they may want the last 2% because that would mean $200 million or $500 million of like exposure. Right. Um, but yeah, but in general for the common, uh, deployments, if you’re in the 95th percentile plus, that’s okay to leave on the table for the speed of execution. And that’s why when I say like, um, you know, speed is the mode. So I always refer to not just speed of like how fast your hardware is running or your software is running, how fast your entity can adapt to change and deploy change. Right. Um, that is speed too. And that speed favors by touch, uh, and Python based approaches. Um, but then when you get to the point where you are at scale at like 400,000 GPUs are humming away. And if you make one change, right, uh, this is like in the early days of Google where, um, they, they just tweaked the google.com page that loaded. Uh, and, and I think it saved like so many hundred GPUs, CPUs in the backend, because you, you know, how you compress the first page to load, you know, things like that. Right. Like, yes, at that scale, it will be important. And they will be on the frontier of like, okay, should we do a C++ rewrite or should we do Python with, uh, you know, no GIL support and get the GIL-less, um, uh, Python to work because theoretically, GIL-less Python should work. But now GIL-less Python is also something that you need to like, get your brain to grok, right? Like you, it’s not just an, um, it’s not just like, oh, P thread create something. And like magically parallelism happens without GIL, you got to think asynchronous in Python, which is like, oh, I can’t do the normal Pythony stuff. So is this, I thought sure it’s like, it’s got the syntax, but now you’re thinking asynchronous events and stuff. Right. And then you’re like, like, okay, where did, what happened to the Python that I knew? Um, but, but then it becomes a C++ thread wrapper. And at that point you’re like, okay, what’s the difference between writing it in Python and that, uh, because your application code doesn’t port directly to the no GIL, uh, Python anyway. So you got to actually architect it well. Uh, and at that point it’s as good as C++.
James: Yeah. Interesting. Interesting. So you’re, you’re, you’re, you’re, you’re still pretty bullish on, on Python. You don’t think that, uh…
Anush: Um, I, I, I think, I think I do not pick sides and I will put, uh, you know, uh, uh, uh, my foot to the pedal, uh, to gas every combination of like the ecosystem. Right. Which is, if it’s Python, great. If it is, uh, Triton and Gluon, great. If it’s Helion, great. If it’s Thunder Kittens, great. We just got to invest across the board. Uh, because, you know, we’re, we’re, we’re, we’re, um, selling travels in the, uh, in the gold rush. Right. So, uh, so you can’t be like, oh yeah, you know, I’ll only sell you this if you go to this mine. And I’m like, no, it’s because travel and go wherever you want. Uh, so we, we will, we will support every combination, but from a technology perspective, I think it’s, it’s just speed of execution versus absolute performance. Just, uh, yeah. And, and if you use the data for that, you will be able to choose what to do. Right. It’s like, do you want the latest model when GPT-OSS-135B comes out? I don’t know, whenever, um, or some combination or when two or 480B comes out, how long do you have to take it to production? And if you can take it to production in, uh, in a few days and you’re at 95th percentile plus in terms of performance, would you choose that? Or would you wait a couple of months and get the extra 3%? All this accommodation too, you deploy with Python first and then you’re like, okay, now I’m deployed and I’m going to go squeeze the next 3%. Uh, but by then the model changes and you’re like, oh, great. So now what do I do? I got to go back and implement a new algorithm for this. Right. So, uh, yeah, you just got to be prepared for that change.
James: So you mentioned Quinn, like, are we seeing where, you know, where are you seeing from a, you know, obviously you’ve got an international role. Where are you seeing interesting programming model, uh, kernel innovation, like geographically speaking, you know, where, where are you seeing, you know, uh, we, we get so used to, you know, modular is obviously a, a Silicon Valley phenomenon, um, or at least a West coast phenomenon. Like, yeah, where were you seeing, uh, interesting pockets of innovation, um, that, that, you know, maybe people aren’t expecting.
Anush: I think, uh, China definitely has a big, uh, flywheel. That’s just, you know, uh, good, very, uh, talent dense. And, uh, and surprisingly like, you know, like a big open source push, right. Which, which is good for everyone. I think it’s, it’s great. Um, I’ve seen, uh, you know, there, there are some companies in Korea that have taken on like, you know, one or two spots. Um, there is some traction or, or, um, you know, like, uh, uh, uh, spinning up of the interest in India, but not as much as what I would, uh, say where China is. And, um, and China definitely is like, is, uh, it’s got like a first principles approach that it’s starting to go down the stack, right. It’s like how, um, deep seek did DPP and, and, um, you know, how it was able to hide communication latencies with compute and, and be, get creative, uh, rather than just using the stack that, uh, was available to them. Right. So, so those kinds of innovations push the boundary in terms of how we also perceive and how we can, um, interact with the, uh, overall, uh, you know, ecosystem. So, yeah, I think, um, uh, Europe has a big, um, presence too. We have some of our folks in the silo AI team in Finland and, and, um, you know, obviously London and UK, there’s a big, uh, uh, presence. Uh, of, uh, you know, models and, and France too, uh, pretty big, uh, presence and, and all of those places we try and help them, uh, you know, achieve their, uh, goals with, with, uh, AMD hardware.
James: Okay. So, uh, I mean, how do you, you talk about meeting developers where they are, like, how do you do that? I mean, what, what does your team like look like? There’s, there’s, you know, only one, uh, Anush Elangovan, you can’t be everywhere. So yeah. How, what does it look like from, uh, and engaging with, you know, the academic communities, with startup communities? Uh, how, how do you, you know, invest and meet people where they are so that you can find these innovations and make sure you’re supporting them? That, you know, the example you gave, you know, who is it that you might give some hardware? Where, I mean, you know, if AMD wasn’t talking to people at Stanford, that would be absurd. But I mean, you know, maybe there’s someone at Carnegie Mellon that you should be talking to, or, you know, you’re mentioning, uh, you know, other geographies, maybe, you know, Imperial College in London, or, you know, maybe it’s in Paris. So like, how do you, um, obviously a huge amount of AI innovation is happening in, you know, in, in California right now, no doubt. Um, but, but yeah, more broadly. And of course you also have to support sort of enterprise, uh, patterns, not just the web company patterns. So how, how do you meet developers where they are? Not so much in their IDE, but more broadly in the communities that they are, um, so that you can, um, meet their needs.
Anush: Yeah, it’s a very good question. Um, so, so the way I think of it is like one, uh, I, I, uh, generally I have the philosophy that, you know, life should move on or people should do the work, even if you are there or not there. So I try to not be in the critical path, uh, and, and try to be clonable, um, so that it is, it’s scalable, right? Like you, you want the, uh, the infrastructure behind you to be sustainable. Uh, so it’s not, you know, uh, node lock to one person.
James: By the way, everyone, if you want to schedule some time with Anush, good luck. So yeah, yeah. I’m not so sure about that lock, my friend, but anyway, so go ahead.
Anush: Well, my, my calendar is a little messed up because it, a typical day in the morning has like four to six conflicts, um, two or three of which I have to attend, have to attend. Um, so, you know, so it’s, uh, uh, it, it is, uh, it is, uh, so I try to, I, I try to, uh, time multiplex it, um, and see how I can, uh, be there in multiple places. But, uh, but coming back to your point on like developers, right? I think, uh, our goal is to get to 10 X of, uh, developer outreach for ROC. 10 times what we have today. Right. Um, in the next year —
James: — what do you have today? What’s, what’s, what’s your baseline?
Anush: Uh, it’s, it’s, it’s, uh, it’s up and coming. The baseline is, it’s, it’s getting good, but, but whatever it is, it has to be 10 times more. Um, so we do a. Good thing, we’re not starting from zero. We have, you know, hundreds of thousands of like, uh, deployments and it’s good, but what, you know, like, but, but there is a long way to go. Um, so that ties into your earlier question on how we want to, you know, bring, um, and reach the developer beyond just like code IDs, et cetera. Right. So, so we want to definitely leverage, uh, ROCm everywhere that, um, that, uh, uh, customer has access to an AMD GPU, right? Like, like for example, uh, a Strix machine or your, um, Epic CPUs. You just… like the pervasiveness of AMD hardware is, um, is undersold. And what we want to do is first enable the software ecosystem on all of AMD’s, um, hardware ecosystem. That’s already pretty well deployed and give a very, um, you know, smooth, common user interface between, uh, using, you know, high touch on your windows laptop to running it on an instinct. Right. Right. Uh, yes, the hardware is different. The power is different. Uh, but your basic writing kernel that you write should be able to like generally run, you know, in either spot. Right. Um, so, so this goes back to the other discussion we had on local LLMs and, um, and enabling local development. So that’s one aspect where you have your personal AI box or machine or laptop, where you can do 80% of everything you want to do. The only thing you can’t do is scale. And then you just say, okay, deploy in the AMD developer cloud. And that gives you scale. Right. So if we give that experience very like cohesively, then it starts to, you know, kind of create a flywheel in terms of how, um, we can, we can kind of like ramp these things up. Um, so I think the, the main, um, thing that I, I, I would like to focus on in terms of metrics is can we have sticky developers that. That, that, you know, wake up and they’re like, huh, I have to do this. And the first thing they do is like, let’s get my ROCm platform running on AMD. And I have to be able to do whatever it is that they’re thinking about, right? Like they shouldn’t be thinking about anything else. It’s just like, you know, it should be magic down below, but we should enable them to be thinking about the last mile of AI, which is like, how do I take this to the end customer? How do I unlock value? You know, and, and, and I, I often refer to this as like, um, you know, AI is like the new electricity. Uh, and we still haven’t like, we literally on the early phases of like, oh, we found electricity. What do we do now? Right. Like we can turn on a light bulb is what we’ve done so far.
James: We’re still making the frog legs twitch.
Anush: Yeah. And, and so, but, but, but what we want people to be thinking about is not that, oh, this is AC. This is DC. It’s transmitting over a thing. What is it? It’s like, we want them to be thinking about like, oh, this is how you build a, a, a motor or, uh, and this is how you put the motor into an axle and put it into a wheel and then put four of them and you get a car. And then, you know, you put it into a, um, uh, a locomotive and, and you get a electric train. You know, like there’s first order, second order, third order, uh, innovations that are still on, uh, haven’t even been imagined in my, uh, uh, opinion that, uh, we want people to be focused on. Not the fact that, Hey, I’m using AI. I’m using, it’s like, sure, that’s fine. But we want them to look outward and when they look outward and towards the unknown in terms of like what they can, um, get, uh, uh, and, and innovate on. We just become the, the, the backend. Um, like, like I, I mentioned, it’s, uh, you know, um, building and making the shovels so that everyone.
James: Make, make, make those shovels. So, uh, it’s always a pleasure to talk to you, Anush. Um, uh, I do, we’re definitely going to get in that, that two hour podcast at some point when we can get it scheduled. Cause I always find it fascinating talking to you, but today, uh, I think we’re going to cap at really interesting conversation about the future of programming on the GPU, uh, where we are as an industry, uh, where we find innovation. Um, so yeah, thanks, uh, to you Anush for joining us. Thanks to all of you for joining us. Uh, don’t forget to like, subscribe, uh, share this if you found an interesting conversation, comment. Um, and so yeah, once again, Anush, great talking to you and a great show. Thanks a lot.
This video was sponsored by AMD.


















