A RedMonk Conversation: AI and Training Data – What You Need to Know

Get more video from Redmonk, Subscribe!

As AI continues to reshape the technology industry, it’s important to understand its foundations. In this RedMonk Conversation, Stephen O’Grady and Poolside CEO Jason Warner get together to discuss training data. Covering questions that range from ownership and licensing to accessibility and inspection, anyone interested in better understanding the roots of AI and the vital importance of training data will find this conversation of interest.

This was a RedMonk video, not sponsored by any entity.

Rather listen to this conversation as a podcast?

Transcript

Stephen O’Grady: Good morning, good afternoon, good evening. I am Stephen O’Grady with RedMonk, and I’m here today with Jason to talk about some AI stuff. Jason, you want to introduce yourself?

Jason Warner: Absolutely. Good talking to you again too, Steve. So everyone, I’m Jason Warner. I’m one of the co founders over at Poolside. We’re doing the world’s most advanced AI for software development. Before Poolside, I was a CTO at GitHub for four years, two years pre acquisition by Microsoft, two years post. Before that, similar type of roles at Heroku, which is a platform as a service now owned by Salesforce, and before that, Canonical, the people who make Ubuntu Linux. So, I am pretty not handy around anything that does not involve a computer.

Steve: And you’ve been around a bit. So you have some opinions, which we’re about to get into. The big reason I want to have you on is that AI, I think for a lot of folks, is just taking over the industry. There’s all this talk and all this attention, all this time spent. So much of it focuses on models or demos or GPU’s and so on. One of the things that, at least for me — I won’t speak for you — that can get lost in the shuffle is the training data and specifically questions about it. What is that data? Where did it come from? Who owns it, how is it licensed? What are you permitted to do with it, what are you not? And so on and so on. We’re seeing some of those questions surface in the finalization of the definition of open source AI, and that will be one opinion on it, certainly. But given that I know you have opinions on the importance of training data, let’s just start there. For you and for poolside, what is your position with respect to training data?

Jason: So generally speaking, this is so semantical. Definitions and Twitter debates about what words mean aside, let’s just put all that out there so we don’t get into all these really esoteric discussions. I will say this, that if somebody is not showing you their training data, they are not open by whatever definition we’re going to try to use here. But what we are doing at the moment, which is saying, hey, here are the weights to a model, therefore it is “open source,” and we’re going to go use that. That’s essentially useless. That is the worst possible definition of open. That is effectively, if we go back generations and you and I, Steve, fortunately are old enough to go back and do this, that is equivalent to releasing a .exe and saying that’s open source, or the compiled binary and saying we’re open source, and none of us would be fooled by that, but in this space, we’re being fooled by that.

Steve: Yeah, yeah. And I mean, I think actively fooled and, you know, sort of as you and I have talked about before, and I won’t ascribe any opinions to you, but I have strong ones on the use of open source and definitions and so on here. And it is enormously frustrating, you know, because we’re still working on what the definition is and what that means, and they’re hard questions as we’ll get to in just a minute. But I think it’s pretty easy in a lot of cases to say what is not open source. And there are a majority at this point of entities which are going around, in my opinion, abusing and misusing and misapplying that term. And to date at least, the press has been sort of credulous and gone along with that and said, oh, hey, so and so told me this is open source. Of course it is. And yeah, and then we sort of end up in a world where, to your point, like an .exe is sort of open source, which it’s not, it’s not how this works. Anyway, so I don’t want to get you into trouble there with my own opinion.

Jason: This is not a topic I think is actually going to be getting anybody in trouble. I think this is stating what all of us know. But for various financial reasons or marketing reasons, the entire industry is trying to go down a certain path. And we know the players involved, we know the people who are doing whatnot. And we also know the disadvantages that’s put out there by saying, here’s our open training data. But this is the thing. All of us have the same data. We’re all pulling down Commonweb, we’re all pulling down open source repos. We’re all filtering similarly, not identical and all that. Now, the weighted distributions of the models might be different and all that, but really what it comes down to is where people are trying to infuse gray market, black market, or private data. So some of the larger entities are putting stuff in there. This is why they won’t show the data sets to folks. One thing at poolside we’ve talked about all the time. We’re selling the enterprises, we’re going to make this available for everybody on the Internet, and it’s going to be available as Poolside cloud.

Everyone’s going to use it. We’re going to power things like the Devins and the Repletes and the Codys of the world, people who use other models, etcetera. But we’re also going to sell the enterprises and governments. I’ve gone to every single enterprise and I said, listen, hold me to task. You need to be able to inspect my dataset. A lot of good valid reasons why, and I’ll walk you through, if you’re not asking this question of other vendors, why you should be asking this question of other vendors.

Steve: Which actually was a perfect segue in fact, because that inspection is something I wanted to get to. So one of the things with a lot of these training data sets, as we all know, is that some of them are enormous. They’ve been trained on the Internet or some subset of it. That brings up two questions. One like, practically speaking, logistically speaking, how do you make these things available? And then sort of more specifically, even if you can do that, if I’m a customer and I want to inspect it, practically speaking, how do I do that? How do I introspect this database or data set, I should say, of just enormous size?

Jason: Yeah, there’s an impracticality to what I just said. Now what I’m trying to show these vendors is that I actually am open. I am willing to turn over this manifest view so that you can go and do some work here. I also think that there’s going to have to be some companies, third party vendors, some auditor type of things that enter the conversation and are able to do that for folks. But it’s also similar to how we say, hey, Linux is open source and what, 200 people on the planet actually go and read the source code to this as well. That said, it’s possible to go do this. And so there’s varying degrees of a spectrum of I’m willing to go do it versus I’m not willing to go do this. So there’s impracticality. This is much denser than what even the Linux source code is going to be if you’re training on trillions of tokens and all that sort of stuff. But I still think you got to be able to open it up. I still think they should be able to spot check and sample. I also think that there should be some new pieces of software, some new vendors that pop up and help people navigate this.

And I’ve seen a couple of startups that actually do this now. It’s like we’ll help you understand what training data is coming into your own models if you’re building internally, but also from vendors who are open about their manifests.

Steve: Okay. Yeah. So in other words, I mean, almost, what we’re talking about, to your point around Linux, is that it’s almost the optionality, right. In most cases you’re not going to leverage it, but you need the ability to if possible.

Jason: I think it’s also, it’s a slight orientation difference too. Which is, if you walked over to some of these people who are claiming to be open as well about their, let’s call them their models here. They’re open about their models and you say, show me the manifest. Ooh, hang on a second there. That’s because the way these things work, you can hide a lot of stuff in there. There’s all the conversation around what is actually gray market data out there that other people are using, whether it be copyrighted from something else or a torrent of books or whether or not they’re only using open access or maybe slightly closed, all that sort of stuff. And the fact that you can’t ever find it without the open manifest, but it’s possible to find it if it was.

Steve: Yeah. Okay. Yeah, that makes sense to me. And like, again, I think really what we’re talking about is the optionality here. Right. You know, preserving that sort of, you know, to the extent that it’s logistically possible at least. All right, so last question before I get you out. You know, so there’s a lot to this, obviously. There’s the questions around access, availability, the technology to introspect it and so on. What other, you know, when you talk to your customers, others in the field, what are the other things that you think they need to think about with respect to the training data itself? I know you and I have talked about this before, and I know you have pretty strong opinions on benchmarks as an example. But if I’m in enterprise and I’m sitting here thinking about, alright, I’m going to consume XYZ model, what are the other pieces here that I need to be thinking about with respect to what it was trained on?

Jason: Yeah. So I do think that…I think benchmarks are the worst possible thing that we’ve ever done in the industry, like speed bench and that sort of thing, largely because most people train on those. And so we call it training benchmark abuse. People are abusing these things to train them and get them over the line. The one that freaks me out the most, though, is… there’s two. The one that freaks me out the most is one that Andrej Karpathy talked about some from the Anthropic paper, which is the sleeper agent one. That is the one that keeps me up at night the most. Now that is a theoretical vulnerability vector that I don’t know how real it has been, but that being said, you know my history, I’ve been in the open source world for years. I assume you’ve been compromised, I assume all these things so that you act like you are, so that you can build the proper defenses against it. So we should actively assume that those things are happening at the moment so that we can build the defenses against it.

So that’s one. Most people in the world won’t care what I say next. But this is actually critically important for enterprises in particular. But in a weird, roundabout way, as you and I, two old open source heads — the enterprises have to be concerned with the more viral license types that are happening in these models because there’s a whole bunch of tools in last gen that have been put in place to give them safety and security, and you can still use those, but in this case it’s a very different risk factor for them on the commercial and the viral license types. And so you have to start asking questions around that as well.

Steve: Yeah man, that’s the thing is that you love to see sort of the collaboration and the energy and so on in some of the communities, like Hugging Face and so on. But oh man, I mean, just go and read some of those licenses and you’re like, oh, oh wow, has anybody ever looked at these, like who wrote this? And so on…

Jason: This is another strongly held opinion I have here, which is that we basically have entered into a generational gap between, let’s call last generation open source and current generation open source. And what we’ve not done is we’ve not understood that they’re pain transfers. So we’ve already been through these lessons, we’ve been through these multiple times. But we’re going to have to learn these again for a generation of folks that are building in a net new domain. It’s kind of like all platform shifts go through the same machination problems. We don’t learn the lesson. We are still funding companies that are effectively building on other people’s platforms, not realizing that those companies are dead already. We just don’t know it.

Steve: Yeah, yeah. Well, we’ll have to have you back on to talk about that. But the licensing thing in particular, it’s so frustrating because we’re just going to go through and have to relearn these things where it’s like, hey, we figured this out. This is a problem. This is how we solve the problem. And it’s like they’re going to speed run essentially all the mistakes that have been made.

Jason: We’re going to speed run them. And this is interesting, and they’re going to smack headlong into GCs. They’re going to run headlong into corporates who are saying, f you. We’re never using your product. There is zero chance we’re going to do this. And people will try to explain away why it’s fine and it’s not going to work because it never works. You have to understand the concern on that side of the fence and adapt to it.

Steve: Yeah. You really have to work backwards with what’s going to clear legal, unfortunately, so. All right, well, this has been great. I’m sure we’ll have you back on to talk about, well, all of the things involved here because this is a space that a is moving quickly and b has, I don’t know, as far as I can tell, an infinite number of moving pieces. But for today. Yeah. Really appreciate you coming by, Jason. Thanks.

Jason: Thanks for having me.

Rather listen to this conversation as a podcast?

Transcript

More in this series

Conversations (85)