A RedMonk Conversation: Why Dr. Scott Stephenson is Building Voice AI

A RedMonk Conversation: Why Dr. Scott Stephenson is Building Voice AI

Share via Twitter Share via Facebook Share via Linkedin Share via Reddit

In this RedMonk Conversation, Dr. Scott Stephenson, co-founder and CEO at Deepgram, discusses AI and audio with Kate Holterhoff, senior analyst at RedMonk. They chat about the history of speech recognition as well as current advancements and their potential impact on culture and industry. Kate and Scott speak about voice’s emotional resonance, the future of AI voice agents, and the role synthetic data and active learning will play in training models to understand a range of speech differences and languages.

Links

Transcript

Kate (00:12)
Hello and welcome to this RedMonk Conversation. My name is Kate Holterhoff, Senior Analyst at RedMonk, and with me today is Dr. Scott Stephenson, Co-Founder and CEO at Deepgram. Scott, thanks so much for joining me today.

Scott Stephenson (00:23)
Thanks for having me on.

Kate (00:25)
Awesome. I love having academics on the podcast, current and those who have left, because I feel like we can really vibe about the dream and realities of working in the ivory tower. So I’d love to start off with your history. Can you talk a little bit about your work and educational background?

Scott Stephenson (00:42)
Yeah, I was, so I got a bachelor’s and a PhD in physics and I built deep underground dark matter detectors while I was doing that. So about two miles underground, basically it looks like a James Bond lair, like with the yellow railing and cranes and danger and everything.

And there’s always all sorts of people working in the background too because we were literally a few months before it was it was just a tunnel from a tunnel boring machine. And then we blasted out with dynamite and we were, we had to do everything. We had to do the HVAC. had to cool it down like deep underground. It’s hot. You know, we had to run internet to it. We had to do all sorts of things, but this is what I was doing during my PhD. And so we got that lab going. And once you actually start taking data, then you have a huge pile of it and you need to use compute resources in order to

to see what’s inside, essentially. You get a huge pile, many terabytes of data, and only a few of them are going to be interesting, and so you have to write these analytics backends to figure out if anything interesting happened inside. it’s very interesting, actually, the results or the data that you looked at and how you got the results is very similar to what you do in audio. You actually look at a waveform. If you slow down the sample rate, then you can play it and you can listen

to it if you wanted to. And all these things started to connect in our brain to audio, thinking, whoa, everything that we’re doing here in particle physics could help in speech recognition or speech generation or speech understanding or finding a needle in a haystack of recorded audio, that type of thing. And that led us to building Deepgram, which is we’ve now started nine years ago and we were the first to do end -to -end deep learning for speech.

And now we’re the leaders, we’re the leading speech to text API in the world, supporting over 30 languages. And we’ve also released last year our text to speech model. And now we support all sorts of brands. We’re 500 different brands across the world to power their voice interactions in their applications.

Kate (02:42)
Well, that makes a lot of sense in terms of why you chose to leave academia then. You saw a practical application for some of the work that you were doing that was really exciting and something that no one else has really grappled with, with the efficacy that you were able to do based on this experience of digging deep underground, learning a little bit about how particle physics maps onto the waveforms from sound. do you miss any parts of leaving academia? Did you teach or were you purely in the research area.

Scott Stephenson (03:09)
So I was doing a lot of research. I was doing a little bit of teaching. I do love to teach. And there’s an interesting part to starting companies actually where you do get to teach. You have a different kind of audience though. Like it’s the people that work at the company or it’s your customers or that type of thing. But there are some interesting itches to scratch. Like when you’re trying to do.

fundamental research like we were doing with particle physics. You’re trying to find like a new particle that nobody’s ever seen before. But it’s actually interesting in what we’re doing with Deepgram. We are a foundational AI company. We build all of our models from scratch. We host our own data centers. We label our own data. We create our own synthetic data, et cetera. We’re actually, like if you were to talk to the research team now, a lot of them would say, you know, we’re,

discovering the fundamental laws of intelligence now. So it actually does scratch that itch, you know. And it’s a very fun place to be. So I do miss a lot of the physics side and I actually feel like I’m missing out a lot. There’s a lot of things that have happened in the last few years and I’m, you know, just kind of too busy to follow along and…

The world keeps on moving on that side and we’re doing things in AI instead of keeping track of what’s happening in the cutting edge in physics. I think luckily the style of company that we built, a research company that productizes as fast as possible, does scratch a lot of that itch on the discovery side.

Kate (04:31)
Right, yeah, I see what you mean. But I love that you take it to this high level place of like you’re grappling with what intelligence even is. I mean, as an academic, feel like that is certainly an aspirational goal that all of us are striving towards, right? We’re upholding the light of learning, right?

This is the ideal that when I wear my corduroy blazer with elbow patches, this is what I want to be thinking about. So yes, But I also know that running a company is a full -time job. And so I get why you, much as you appreciate what they’re doing and are excited by it, your path diverged a bit. you’re funded by Y Combinator, is that right?

Scott Stephenson (05:06)
Mm -hmm. Yeah, Y Combinator, NVIDIA, Madrona, which is a Pacific Northwest venture capitalist. There’s, I don’t know, we have so many different investors in Deepgram now. Wing led our Series A. We’ve raised over $86 million at this point. so it’s, we have quite the fun collection of investors, like, you know, just with all sorts of experience. Some on the strategic investment side, like Nvidia, obviously, a big piece of the AI puzzle, building the hardware for it. They’re great partners to Deepgram, but now we have great partners in other areas too that maybe soon there’ll be investors in Deepgram too, because I think everybody’s starting to realize that there’s a strategic partnership and infrastructure side to building all of this, and voice is one key component of it.

Kate (05:58)
Right. Yeah, and I definitely want to dig down into that particular component. That was certainly what inspired me to invite you on the show, because I think when we’re at the point in the AI hype cycle, I guess, of, I guess, differentiating different parts of this. Like, instead of just saying AI, now it’s important for us to think about specifics. And it seems to me that the text to speech and speech to text and that voice assistant domain is one that is extremely important. So

I would love your high level perspective about why this domain is so exciting to you and a little bit maybe about the history of it. This didn’t just come out of nowhere. So I’m thinking maybe of two different threads about this. So one would be like IBM and William Dersch investigating speech recognition in the 50s and 60s. this, again, this is like natural language processing has been sort of a gold standard for the sort of work that a lot of folks that

built the foundation for AI, have been working with forever. But also there seems to just be this interest in it. When I think of movies about computers from the 80s or even the 2000s, so I’m thinking of not only Tron, but Spike Jonze’s Her, it’s all about talking to your computer. And so it’s so far reaching. I’m just curious about your perspective. What drew you to it? How would you frame the history of this entire domain, this industry?

Scott Stephenson (07:17)
Yeah, it’s a very interesting area because it’s so, what we think of as just pretty uniquely human, being able to understand and speak and come up with a quick response and all of that. But yeah, going back to the like, think 20s, 30s, 40s, there were already scientists thinking about computing machines and there were scientists thinking about

how to create human speech or how to understand it. And obviously back in those days, they had to use very simplistic like analog methods in order to get any result. it’s very cool. There are some videos you can look up on.

on the internet of people using these keyboards where you would press certain keys in a specific order and how hard you press them and everything affected it as well. And you could voice like digits, like one, two, three, but this is like the 30s or something. It goes back a really long time. so people have been thinking about this.

probably for thousands of years, but certainly for at least 100 years and how to produce it with machines at least for 100 years. But it took a long time for everything to get going. We now know that it’s because of the data scale that you need and the compute scale and the memory size and all these things that you need.

But I think it’s still important to acknowledge that what they were doing back then is still pretty similar to what people do today, but it’s just with more memory, more compute, more data, and more know -how to get it working. So those pioneering scientists in that time, they were really onto something, and then now you just extend it with the right technology and the right scale, and then it obviously works. And yeah, to talk about audio versus some other things,

as well like text or video etc. Audio is interesting because back in 2012, 2013, 14, 15 when people were thinking about AI they weren’t thinking about text they were thinking about images at that point in time.

And so this is like self -driving cars, bounding boxes drawn around people’s heads, know, like, ooh, identifying your friend in Facebook, you know, that kind of thing. These were all like, wow, look at this, this is amazing. How does it work? And then think 2017, 2018, 2019, that’s when text started to come into play, right? And then obviously, you know, 2021, 2022, et cetera, with a chat GPT, then it was like

the volume was cranked to 11, you know, on text for that. And so images had had their moment, text had had their moment, but one of the, like the one that’s left is audio and audio still hadn’t had its moment up until just like a few months ago now where OpenAI released a demo.

of a voice system working and people saying, whoa, this is like, we’re getting into very human like territory here. And, I think voice and audio is pretty special in that regard because, you can see text, you can see video, you can see images, you can’t see audio and it’s like just, it fits, it fits into a person’s brain sort of in a different spot where

If it’s not very high quality, like the voice that they hear or the recognition that is happening, etc., then they just write it off as like not working.

But for images or for self -driving cars or that kind of thing, they give a lot of leeway because it’s actually entertaining for humans to look at the results. They can see the pictures. They can see all sorts of things. So this has actually kind of artificially held audio research back and audio funding back and everything. But now that folks can see the result because you put everything together, you put speech to text along with an LLM, along with text to speech, and you put it together and then you have this expressive agent that you can talk to, now it’s instantly

the volume up to 11 for voice as well. So there’s this long running research thread that has just built the foundations to get it all working. But then there’s a human psychology, just the right moment happens as well. And now they’re coinciding, and this is why it’s everywhere in the last few months.

Kate (11:26)
Wow, OK, yeah, you’re kind of blowing my mind here. And I want to dig into something that you mentioned, which is OpenAI, because I think that their relationship to audio and the large language model revolution has been unique in the sense that I’m thinking of Sam Altman. He tried to recreate Samantha from Her by licensing Scarlett Johansson’s voice. And of course, that story is a little bit more nuanced depending on who you hear it from. But I think what it points to is this idea that speech

is extremely intimate in this way that is so different from images. so I’m curious what your take is on that. Why do you think it is that there is this emotional resonance from voice that we don’t get from the uncanny valley of humanoid robots? Is voice going to help us to feel better or worse about having this computer that is gonna be the interface to so many systems that we already use?

Scott Stephenson (12:19)
Yeah, I think it’s videos and voice and the things that are common between them or are that it’s a real time evolving system. And it really taps into our senses as people, like I mentioned, like what you see, like we are programmed, you know, over the millennia, you know, to, to notice sites or notice sounds, et cetera. And then once you start to get into a territory that is, it’s not just a joke anymore. Like it.

wow, this is starting to sound real, this is starting to sound serious. It really heightens your attention as a person and then you start to think this could actually be useful. there are different phases to the hype cycle and how everything gets adopted. I like to think of this like a physics example here. When radioactivity was discovered,

it was looked at in scientific circles and used in many different ways, but then as the general population became aware of how, maybe not how it worked, but aware of the powers or the sort of spooky thing that was happening with radioactivity, they started to make medical elixirs that you could drink that were literally glowing. All these things that we know now would be like, that’s a horrible idea. Don’t drink these radioactive things.

folks back then would say, this is a new substance that we don’t know about. Let’s inject it into everything, basically. Fast forward now, 100 years, actually radioactivity is not bad for medicinal purposes if administered in the right location, basically. You can treat cancer with certain radioactive therapies, but you don’t just drink radioactive material and then that solves your problems. It’s a similar thing with

AI now with voice with with video etc and I think it gets encapsulated in some now like old sayings that are only a few months old like AI is not going to replace your job somebody who knows how to use AI is gonna replace your job you know that that type of trope is going to be just a hundred percent real and so folks now can start to latch onto like, wow, that video looks really real or wow, that voice sounds just like that person. But machines never get tired. you think about now.

relating that to what happened with factories and like just wheel like wagon wheel production or something like wheels were a massive productivity gain to the world, but they were all produced by artisanal people before, know, they had to go steam wood and bend it into a hoop and do all of this and they can only work eight hours out of the day and they they rested on the weekends and they did all of that. But pretty soon once you figure out metallurgy and stamping machines and then you put this into a factory, you can stamp out a wheel every, two seconds, right?

type of thing is going to happen with intelligence. Machine intelligence is being invented right now. It’s not just artisanal intelligence anymore. And so I think everybody’s sort of like at the same time realizing that this is happening, you know, like, hey, there’s going to be machine intelligence and factories producing intelligence doing a whole bunch of work.

I don’t know where I fit in, but I know that it’s going to happen because look at how real it is. I can see it in front of me. I can hear it, et cetera. And now then that just like cranks up to fear, right? Like, how is it all going to work out? Right. but I think that go back to that saying,

AI’s not gonna take your job, somebody who knows how to use AI’s gonna take your job. So people are gonna have to learn new tools, they’re just not exactly sure how to do it right now. But nevertheless, that’s gonna be time of education in our lives, just like how many people still alive today remember what it was like to first use Google, or first use a computer, or whatever it is.

there’s similar moments that are happening here with AI. But I think there’s something different to just using chat GPT talking to texts, you know, that kind of thing. It’s still a little bit separate. It’s like you’re, sending it out somewhere else. It’s not really in the human sphere so much, but then when you hear the voice talking back to you, you see videos generated, et cetera, then it’s going to be like, Whoa, this is

Okay, You’re making synthetic humans. You’re replicating humans now and what does that mean for the world? You know, and I think it’s all just happening all at once now

Kate (16:20)
Right, so the Rubicon that you’re describing as that we are now crossing is not necessarily AGI, but it’s just the fact that all of these AI agents and technologies and tools are going to be part of our workflow, our day -to -day lives. You know, it’s just, it’s going to be like air. We just we’re surrounded by it, so we don’t even notice it.

Scott Stephenson (16:38)
Absolutely. And we’re all going to have to retool. And it’s something that even, you know, we have a great research team at Deepgram and great engineering team and all of these folks are at the cutting edge of AI and even they have to sit back and think like, the old way that I was doing things like writing everything from scratch, figuring out how I’m going to train this model, et cetera, why don’t I just make an agent that does it for me? And then I’ll be the critic basically. I’ll say that’s a good job or that’s not a good job in that mode of operation. So even the folks that are creating this technology have to step back and think like how can I use the technology that I’m creating in order to make myself more efficient too. So it’s like everybody has to retool. Like there’s no select group that already knows how to do it. Everybody is figuring out how to do it.

Kate (17:31)
Well, let’s talk brass tacks then. We’re a developer -focused analyst firm at RedMonk, so I suspect many of our listeners are going to be curious to hear more about the role of speech in their domain. So what should a software developer pay attention to when it comes to integrating voice AI platforms like Deepgram into their apps?

Scott Stephenson (17:49)
Yeah, there’s really two categories. One is, are you building a voice -centric app, new application, new experience? And are you trying to invent something that hasn’t happened before? And if you’re doing that, then you’re going to have to really get in the weeds. And yes, you’ll stitch together other components of systems, but you’ll probably have to invent some things in order to make your system work, because these tools just haven’t been.

invented yet. These products and tools haven’t been invented yet. But there’s another category, which is I’m going to partner and use what is already built, and then I’m going to go build applications with that. So for instance, just low -hanging fruit here, if you…

had an FAQ page before, but now you’ve converted into an LLM bot where you can ask it questions. Well, now you’ll just add voice to it so you can talk to it. These are some of the simplest things that seem like really simple add -ons, are totally like every company is touched by it, every product is touched by it, and they will be a big business by themselves. And it’s kind of like the old joke with SaaS, where you just take whatever’s offline and you put it online and then make a billion dollars.

similar kind of thing is going to happen here. Just take something that isn’t voice enabled, make it voice enabled, and that will make it stand out versus others. And then you’ve created this new experience and you’ve just added voice to it. But if you get that experience right, then now this is a product that will win in the future. There’s a story around this that I think is pretty enlightening.

And I’ll start with a rhetorical question, but I’ll just answer it as well. But why did Google start doing voice? Why did they do that? And they started doing this back in like the 2000s.

They started a service called Google 411 where you could dial into Google 411 and then you would talk to an operator and say I would like to talk to Joe’s pizza, you know, can you connect me? So basically they were an operator. The old style dial zero, get an operator and then they help you get what you’re looking for. But then Google said, hey, we’re going to do Google 411. But why did they do that? Because they knew that people through a voice interface will like it’s one of the major modalities, you know, sure you can click things and you can read and do text, but

another major modality is voice. There’s just two of them really. And they wanted to get started in that, and their research team wanted to get started in it, but they didn’t have data. So they created this system to record these phone calls with people to figure out what they’re asking for, what type of voice things they would say, and then to gather data for it. And then produce over time, over now, like two decades, a system to voice enable search. And I think

I can say more about that, I think it’s important to think about the text domain as sort of easier to do for the technology that we have up until this point in time. But now voice is going to be easy enough for these interfaces for voice to just be prevalent around the world. And so any service that is text only now, if they just have the same service, but now somebody who is competing with them also has

a similar text service, but voice enabled. Then now there’s a massive attack vector to Google, to ChatGPT, to others. This is why they’re doing voice. This is why OpenAI is doing voice. It’s because this is an attack vector. These companies talk about screens. So like you have your desktop as a screen, you have your laptop as a screen, you have your iPad as a screen, you have your phone as a screen. Well, the next screen is no screen, it’s a voice. And some really hard metrics on this that you can look at are Google’s searches. What fraction of them are voice, like instead of typed? It’s about 40%.

So it’s a massive attack vector. So if somebody gets voice right, then now this is an attack vector on that side. So this is why these consumer, traditionally text -based systems, think so much about voice. But I want to contrast that with what B2B does, because all of the things that I just listed, like trying to take something that didn’t have a voice mode before and adding it they will need some system to do that. going to…

Trying to build it themselves, not a good idea. Trying to use open source stuff to do it, not a good idea. Going to Google or OpenAI or something like that, they could help you get to a good demo, but all of the controls that you would want to put guardrails on the system and how you’d want to control your cost and cogs for how it all works and how you’d want to adapt it and train it into your own personality, basically, that’s not going to be available from those because they’re not thinking about building

that for you, they’re thinking about building it for themselves. They want to make a good interface to go compete in their domain. And if you want to buy their system, they’re happy to sell it to you. But they’re not going to specifically say, hey, we’re building this for B2B and make it adaptable and controllable. And so this is the sort of world that is developing right now. All these companies and products need their own voice interface. But they need to.

the ones that we might come across from the consumer brands like Google or OpenAI are not going to cut it for what they need. So there’s this huge demand to actually build those. And then this is where companies like Deepgram or Anthropic or others start to Cohere, et cetera. These enterprise B2B AI companies start to come in and fill the gap in the text domain and audio domain, and then all partnering together in order to fulfill that need.

Kate (23:09)
It’s really interesting to me how speech and this rise in interacting with our computers using our voice is changing behavior. I’m thinking of my mother -in -law, she’s from Pittsburgh, and when I hear her, you know, instead of typing out a text, speaking it, she articulates a lot more clearly than she would normally. You know, there’s no yin -zing in her speech when she’s trying to type. And I think, you know, we all do it to an extent, right? We really try to articulate clearly. And I think that’s just sort of a microcosm of how important transcription is now. The fact that we want to make sure that what we say is recorded clearly. And then of course, when we get these automated phone calls, which I assume increasingly are, generated by AI in some capacity at least, you know, that they sound lifelike. And so there seems to be such a push to move these technologies from these sort of B2B use cases that you mentioned to end users to popular culture in ways that, I haven’t seen. And it is moving so fast. As someone who has a sort of inside view on that, I’m interested in what you consider to be the future. What’s the next horizon that you’re looking towards in the next, I don’t know, six months, a year, 10 years?

Scott Stephenson (24:24)
Yeah, there will be waves and I think a great way to try to analyze this situation is to look at certain pinch points or leverage points and just think like how does a human do it?

So whatever goal you’re trying to accomplish, well, how does a human do it? Because that is the number one way that it’s going to be done probably for the next five to 10 years. And let me be more clear about what I’m saying there. for instance, if you’d like to create,

like an AI SDR or something, a sales development representative, somebody who goes out and reaches out to different people to try to figure out if they’re a good fit for buying a product at a certain company. And this is something that many SaaS companies have in their arsenal to go get new customers. They have a long list of tools that they know how to use. And the way that they use those tools is by looking at a screen, moving a mouse, and clicking on their keyboard and talking through their microphone.

or those are pretty much the inputs. And so if you would like to put data into Salesforce or something like that, a CRM, you could interact with the API, but the APIs generally don’t have perfect symmetry compared to the user interface, like what the user is experiencing and doing for the API. Many times there’s some way to do it in the API, but like,

The first move for many businesses will be, can you just fill out this form for me? So let me just paint a quick picture here. An SDR has an introductory call with a lead. And that lead, what do they do in those calls? They typically ask them, what are they doing now? What kind of pain do they have? That kind of thing. And then a person that SDR afterward would have to go fill in all that information in the CRM. And probably they would do it in a very

lossy way because they’re also trying to be like an affable, nice person on the call and whatnot. They can’t necessarily keep all this information in one place. One thing that would help them is to essentially have a bot listening into the conversation, taking the notes for them and filling it out. yes, you could make that work with the API, but what I’m saying is

most circumstances over the next like five to 10 years, it’ll actually be done through the human interface. So you’ll, some people call this like robotic process automation and that kind of thing where you just think like a human, type like a human, go in, go to the field, literally type into the field, go to the next field, type in the field, et cetera, that type of thing. And I think the reason that this is going to be very important for the first phase

of AI is because it’s relatable, it’s debuggable. If everybody had to be a programmer and know how to use APIs to actually make AI work.

That would hold back the opportunity here, basically. Now, don’t get me wrong. The folks who know how to use APIs and know how to program and make it all work, they will probably be able to do a better job, a less error -prone job, et cetera, that type of thing. But it kind of won’t matter because the value will be so high to go through this robotic process automation way of doing things. And so there will be a phase where

You just ask your question, like, how does a human do this? Okay, I’m gonna build a system that just mimics a human, essentially. It goes and types things in, it moves things over there, it goes to the Asana board and moves it to the next step. It does whatever it’s going to do. And then there will be another phase after that where it’s really AI native systems that are built where we as humans have to have…

information organized in a certain way in order for us to understand it. I think a good analogy for this is like if we look at a screen, there’s only so much real estate there and we can only look at so many things. We can’t look at 15 ,000 web pages at once. We just can’t do it. We can’t read it. We can’t look at them all at once. But AI can, you know. And so there’s going to be a new way to represent everything. But that will be that will take a lot of trust building with folks who are using that type of technology.

going to be like a much slower build for how all that’s going to go. So the first big bang is going to be just what would a human do? How do I do the simple things? How do I automate it? How do get the agents in there doing the job? And then humans essentially do QA on top of that. And then there will be another version of this probably starting.

three to five years from now, then really coming into its own like 10 years from now, where there are like AI native systems that humans don’t necessarily interpret themselves, but it’s all jammed into the stew basically. But then you have tools to allow you to figure out.

why was this system reasoning the way that it was doing, you know, that type of thing. But there’s all these sort of observability tools for AI have to be built, they’re not built yet. And the way that you’re going to get observability is by just doing it like a human would, because all the observability tools for humans are already there, essentially. That’s not audio specific, that’s just like any particular AI system, that’s how it’s going to go. Audio won’t be any different for like call centers treat, they will treat virtual agents just like they do human agents. They will have human agents that have virtual agents listening in just like a manager would be listening in, et cetera. They’re just going to use all sorts of human analogies for the next few years to make it all work.

Kate (29:36)
That makes sense, and I think that aligns with, what we’re seeing already. But I’m curious if we can flip this since we’re speaking about the future, how does this relate to ways that these AI voice agents have functioned in the past? And so they tend to be couched with accessibility narratives. And so used by folks who are maybe blind, who need a screen reader. When I was a front end engineer, was a big part of what I did was checking web pages for accessibility to make sure that screen readers would work. So that’s certainly the frame that I’m coming from back when I was a practitioner, at least. But I’m also thinking about it in terms of technical communication. And so I used to teach tech com at Georgia Tech back when I was still an academic.

One of my colleagues there was named Halcyon Lawrence. She’s since passed away, unfortunately. But she did a lot of research, actually, on bias in speech recognition. And she is from Trinidad. So her first language was English, but she had an accent compared to how Midwestern English speakers tend to sound. And so she did a lot of research in what that looks like in the systems in place are you thinking at all about ways that we can train models so that they can understand a range of, I guess, speech difference. And how are you prepared for making sure that there’s an inclusive future when it comes to speech?

Scott Stephenson (30:55)
Mm -hmm. Yeah, this is something that we already do today, but I’ll also just say that it’s not as good as it needs to be. And this isn’t… But there is an answer. And the answer is really great and really interesting, which is if you have a system to understand voice and you have a system to generate voice and you can generate arbitrary voice, then that means that you can train systems to understand arbitrary voices as well.

So what I’m getting at is synthetic data is going to be unbelievably good for this area. So there are so many what folks would call low resource languages where there’s not a lot of labeled data. There’s not, in some languages it’s…

There may be lot of speakers, there may be a lot of content produced, but nobody labels the transcriptions. just not a thing. They just don’t have a lot of closed captioning in their culture, in their language, et cetera.

you know, there may be hundreds of millions of people in the world, like Turkish is one of those where there’s a large population, but not a lot of closed captioning, which means it’s harder. There’s not a large transcription business already set up for Turkish, and therefore it’s harder to label the data, et cetera. And there’s lots of languages that fall into all sorts of different buckets here. But one of the things that’s going to push all of this

so much farther forward than it was in the past is getting over the fact that up until very recently, you had to have a human like deeply involved in creating any piece of training data for these systems. So just to give you some like round numbers, if you have an hour of training data, it’ll probably cost you 100 to $200 an hour.

in order to create that piece of training data. But now it’ll be like a dollar an hour. So you’re talking like two orders of magnitude reduction in creating training data. But the way that that’s going to be done is through generative techniques. So generating speech.

So the equivalent of voice cloning, it’s not just about voice cloning. It’s like audio scape cloning. It’s what environment are you in? What kind of echo is there? What kind of microphone does it sound like? What distractions are in the background? And then you can take.

sure, one person’s voice and you can generate it in a thousand different scenarios saying a thousand different things and now you have a million different utterances and all of it for like, you know, a hundred dollars or whatever. Like it’s not going to cost a lot in order to do it. And then now that’s just one person, but now have that stretched out over hundreds of people, thousands of people, et cetera. Now have that stretched out over voices that don’t exist. You’ve probably seen the website, like this person doesn’t exist, you know, well, this voice doesn’t exist.

But still, it may have certain features to it that the system… like the speech to text system or any type of perception system may have a challenge understanding it. But a human would be able to understand it or maybe a human from that area would be able to understand it. Well, then as long as you can generate audio in a way that a human from that area would understand it, now you can expose your system to it. So there’s just going to be the synthetic data revolution that happens. It’s not going to be very easy because you run into this problem where if you just trust that all the synthetic data,

that is generated is very good and you just say all of it’s good and then you train your system on it your system will get worse because the all of the data is not good actually you have to you have to do something called active learning which is selecting the good ones and throwing out the rest and that’s the magic in doing this is how good is your active learning and if you make your active learning very good then synthetic data is just like a massive

boon for everything working. And if you get it wrong, then you’ll just say, I did the synthetic data thing, it should be better, but it actually isn’t, I don’t know why. And that’s like, that’s the difference is, do you know how to do that or not? And it’s really, really tricky to get that right. It takes a long time to get that right. but it’s going to happen. I can tell you we’re doing it at Deepgram. and, it’s going to happen all of sudden because once you get it right for one language, then it’s easy to get it right for five languages, then it’s easy to get it right for 20, then it’s easy to get it right for 1 ,000.

Kate (34:55)
That makes a lot of sense. I’m thinking two major thoughts on this line here. One would be the political will to make sure that we’re representing all these different languages, because I know it’s expensive to train. And so it sounds like there is, as with most AI problems today, it’s a data problem, Is it copyrighted? It seems like that’s where the conversation always ends up. And so it sounds like the data is not as well represented for non Western languages right now. And maybe there even needs to be some artificial push on that. Like, hey, we need more of these languages that are not represented in our data set. So I’m curious where the political will is. Does it make sense as a founder to make sure that you’re representing a broad swath of not only languages, but accents?

Scott Stephenson (35:29)
Yeah, I don’t think you have to rely on a political force to do this. Capitalism takes care of this because the services that are using

Kate (35:48)
Okay.

Scott Stephenson (35:55)
So the customers of services like I’m talking about, like speech to text from Deepgram or others, they have users from all over the world. And those users will complain and say, hey, it doesn’t work in this situation. And we have a great tracking system for that. if you complain and you allow that data to be used to make the system better, then it will make the system better. And so basically over a very short period of time, know, yep, that was a problem. It’s not anymore.

next problem, you know, and you just, and by the way, the surface area of problems is massive and so you can’t solve all of it like right away, but you can make massive strides in a very short period of time.

just with a few, what I would call topology changes. You just have to set up the system the right way. You have to let different users of the system know that they can help the system get better. And there will be many people that say, I don’t want to do that, but there will be many other people that say they do want to do that. And then those are the ones that will make the system better. And a way to incentivize that as an AI service is you just charge people less who are willing to make the system better.

discount basically and you don’t get the discount if you aren’t willing to make the system better. But everybody wins, you you can use the system you just pay a little bit more if you don’t want to contribute data or feedback or anything like that then the people who are willing to contribute data or feedback can get a better deal on it and then you’re talking like just a few years time span and the systems are like unbelievably good compared to where they were before and couple that with synthetic data and then you can shorten the timelines even more.

So yeah, I see a… So without synthetic data, it’s really, really expensive to do this. It’s possible, but it’s just really expensive to do it. But with synthetic data, it’s not gonna be super expensive. It’s not gonna be like, we need $10 billion to do this. it’ll, know, tens of millions or hundreds of millions will solve this problem. Yeah.

Kate (37:44)
Right. so I can understand why you’re optimistic on this front. And that makes sense. And I guess time will tell. I think what’s so interesting as someone who studies language. That was my career before I became an analyst and before I was a front -end engineer. So I spent 10 years in academia really focused in on communication. And language changes, right? I’m an old millennial.

the way that younger folks are talking now, I mean, there’s all kinds of words that they use that I don’t know. So it makes sense to me that a system that is optimized for language would evolve as just part of the programming, right? That makes perfect sense to me. And so I am hopeful that what you’re prescribing is the case, that yeah, that there are, the technology is there to make sure that even if these languages aren’t represented right now, that they will, you know.

be synthetically added just because these systems are meant to evolve and move. They’re not static at all. So hey, I like it.

Scott Stephenson (38:36)
think like why have one model or, you know, like one model, one model, no, you know, okay, well, we have 10 languages, 100 languages, whatever, like, there need to be like,

tens of thousands or hundreds of thousands or millions of models or something like that. Billions of models. we limiting ourselves in this way? You just have to create a new system that can adapt. This is really the promise of intelligence. The system adapts. So if you want it to work a certain way, OK, you just nudge it in that direction, and then now it’s way better at that thing. All you had to do is give it the signal that that’s what you wanted. And so that type of thing is coming.

Kate (39:12)
All right, we are about out of time, but before we go, how can folks hear more from you? What are your preferred social channels and are you planning to do any speaking at the end of this year or into 2025?

Scott Stephenson (39:23)
Yeah, so reach out to me on LinkedIn, just Scott Stephenson and just search for Deepgram. And @DeepgramAI is our Twitter handle. We have a Discord server as well, so you can find us at deepgram.com and look for the Discord server. There’s a great community there that it’s so, especially like for developers and API users that are trying to build things or just think about what’s coming up in the future for voice. So yeah, those are great places to find me.

Kate (39:50)
All right, thank you so much for coming on the show, Scott, and thanks for getting nerdy with me. I mean, I feel like we got into some metaphysics here. That’s not typically where the show goes. So I do appreciate it. Again, my name is Kate Holterhoff, senior analyst at RedMonk. If you enjoyed this conversation, please like, subscribe, and review the MonkCast on your podcast platform of choice. If you are watching us on YouTube, please like, subscribe, and engage with us in the comments.

 

No Comments

Leave a Reply

Your email address will not be published. Required fields are marked *