What You Wanted to Know About AI Agents but Were Afraid to Ask (with Maya Murad)

Get more video from Redmonk, Subscribe!

Ever wondered what agentic AI actually means—beyond the buzzwords? In this RedMonk Conversation, Maya Murad, Technical Product Manager on IBM’s Research Incubation team, pulls back the curtain on how IBM is exploring, building, and scaling agentic AI.

In this video we’ll talk about:
– What AI agents actually are – let’s go beyond the buzzwords and learn how reinforcement learning techniques laid the groundwork for LLMs, which have lead to agentic AI
– What makes agentic AI different from traditional AI workflows
– Why “AI agents” aren’t just hype—they’re a whole new paradigm
– How you can use agentic AI as a prototyping tool
– The roadmap from here, and what’s needed to bring this technology mainstream
– How Maya and her team use the [Bee Framework](https://github.com/i-am-bee/beeai-framework) to build agentic workflows

🔗 Check out IBM’s [BeeAI Framework on GitHub](https://github.com/i-am-bee/beeai-framework)
📖 Read more of Maya’s research about [General Purpose Agents here](https://towardsdatascience.com/build-a-general-purpose-ai-agent-c40be49e7400/)in development, including the need for continuous improvement and adaptation to meet the demands of modern coding practices.

This was a RedMonk video, sponsored by IBM.

Transcript

Rachel Stephens
Hi, everyone. This is Rachel Stephens with RedMonk. And today, I’m very excited to introduce you to Maya Murad from IBM. She’s a Technical Product Manager with the Research Incubation team, which I’m very excited to hear more about. Maya, could you tell us, what does the Research Incubation team do? Tell me about what are the things you’re working on.

Maya Murad
Yeah. Thanks for having me, Rachel. So I work at maybe one of the coolest teams at IBM. What we do is we look at the technologies that are emerging and see which one has the best propensity to disrupt us. Two years ago, that was large language models, and we went all in on that technology. We not only thought through how this would be consumed, but also that also uncovered all the challenges about scalability and how to use large language models at scale. Large language models basically came with hardware restrictions and scalability issues that we have not seen before. That was really interesting learning. Then last year, we went all in on AI agents, which is the topic of this conversation today.

Rachel Stephens
Yes, we’re definitely going to talk about AI agents. I feel like a lot of people out there have a hard time with the term agentic AI. I think it’s become a term that is used so often by so many people in so many ways that it’s become a little bit fuzzy and nebulous and sometimes a little bit market-y, so I want to start by just talking about what is the definition of AI agent in your world? I’m going to just start by talking about… Let’s talk about the landscape. Because as you talked about, in the last few years, the AI landscape started with this LLM-based landscape of how can we just get to a place of doing chat bots? How can we query these large language models in a way that is relevant to the enterprise? Then we started to get to this world of agentic AI, and it’s becoming more and more important. And I wanted to start off just by quoting you because I loved this article that you wrote, and you had this, I think, a very useful way to talk about the LLM agent, which is “an LLM agent is a program whose execution logic is controlled by its underlying model,” which I think is a really clear definition, and I think provides more of an understanding than maybe some of the definitions that I have seen in other places.

But maybe could you just talk us through how do you see agents versus the way that we’ve talked about AI in the iterations before it? And let’s just set the table in terms of where you see the landscape now.

Maya Murad
Yeah, absolutely. So this is a great place to start, and there’s a lot to disambiguate here because there’s this vision of agents, and then there’s the reality of what we’re able to do with it right now. It’s also a term that’s been around for a while. Even though it seems like it’s popping up new onto the scene, this is a field that has discussed for decades. To just start explaining this, I’m going to share my screen. I have a few visuals, and I’m a visual learner and speaker, so that should be easier to discuss.

Starting with this representation. For me, if we’re going back to the original definition, the one that has been studied for decades, the simplest way to think about an agent is it’s a program. A program that is intended to complete a certain goal, that goal can be set in advance, and it can complete this goal by doing actions. So the agent can determine what actions to take and can start the execution of those. And then there’s certain ways to measure whether the goal was attained or not. So this is a definition that has been around for a while.

And in the last decade, when we were talking about AI agents, that is very different than large language models. So just to compare and contrast, In the last decade, when we spoke about AI agents, these were underpinned by reinforcement learning models. So these are models that are able to adjust its parameters based on what it’s learned in an existing environment. And that is very different from the current large language models that are more static. So the reinforcement learning paradigm, I think some of you might have been familiar with AlphaGo, which was an AI system that was able to beat the world champion at the Go game. So reinforcement learning is a really great paradigm to go after these close set environments where you know what moves are available to you on the board. It’s very easy to tally up your score and to understand if you’re winning. And these agents, the way they were able to learn these games is that they played thousands of them and they played against themselves, and they learned through trial and error. So every time they made a move, there was some reward or the opposite of reward.

There was some charge if they made a move that was the wrong move. And all of that was measurable because you could simulate the pathway or the statistical probability of success. Now, this was helpful, and it got to solve really interesting problems, specifically problems where the goal is really well defined and measurable. And I think there’s some correlation of how AlphaGo led to AlphaFold, which helped in the chemistry world. But if we’re talking about this broader world that’s language-based, if we’re talking about the average problem of the business user or person going about their day, the really interesting paradigm here is doing agents that are based with large language models because it allows you to go after goals that are fuzzier, that is more difficult to validate if they’re correct or not. If you’re writing a summary or if you’re writing a draft or you’re just completing tasks where there’s just not an easy way to measure if it’s done correctly, large language models make that possible. The way they can do that is they can take a task and break it down into sub-steps. They can do that based on the knowledge that they’ve seen before, and they can determine what tools and knowledge to use in that process.

These could be access to an HR system that I might have told the LLM it has access to, or files that I also give to the model. So that contrasts where we were and where we’re going with this technology and why we’re hearing so much about it, because any problem or the subset of problems that we can now target is much larger than with the previous technology. Now, another way to contrast what is agentic workflow versus some of the other patterns that we’re seeing, like retrieval augmented generation, is to think through how the program is controlled. So I’m starting with the initial definition that you said. An agent is a program whose control logic is defined by the model itself. In the last year or so, that’s not how we build models. So when we’re building programs around models, there was going to be a builder who defines the steps that the program should execute. This could be a lot more complicated than what you’re seeing here on the screen. This could take different tangents, and it could execute different workflows in parallel. But the bottom line is all of this is predefined ahead of time by a builder.

Actually, most software that we interact with is built this way. The innovation here is how the program is going to execute is only defined at run time, depending on how the model determines the user query is best answered. This opens up a whole new paradigm where we do not have to prescriptively define how the program needs to execute. We could just give it attributes. I could say, Oh, I want you to create a report that summarizes the competitive environment in this space and also translate it to different languages. In this paradigm, I would have to break down every step of the process and articulate that and optimize that. In this paradigm, I just have to describe the end result I want, and then the system would determine how to best get to that. This was a lot, but I hope that this helps the average person understand these concepts and how to relate them to different technologies that they’re hearing.

Rachel Stephens
Yes, I think that was an excellent way. I love — the term you used was disambiguation. And I think you did such a good job because I was just thinking about the initial world of maybe if you have come into the world of AI in the last few years of, okay, ChatGPT popped on the scene and all of a sudden I am now thinking about this. But you came at it from an even broader scope of AI has been a technology that people have been working on for decades. And so I loved the way that you brought in this reinforcement learning path and I really appreciated just that general way that you have… This is how the technology has evolved in a lot of different frameworks. I thought that was a really excellent way to help people understand how things have evolved across a variety of spectrums. So thank you I thought that was great. All right. So that’s where we’re at. That’s how things have evolved. And I wanted to talk about a lovely article that you recently wrote towards data science. We will definitely link to it below if you want to talk about it.

But the thing that you talked about was a general purpose agent. And I think that that was really helpful for me to read. But the thing that was surprising was just that as a concept overall, because I think a lot of the time when I think about agent, and I will go to these T-notes and people will talk about, this is your agent to help you book a flight, and something like that. The agents that people are explaining a lot of the time tend to be hyper-specific and the task that people are demonstrating. I wanted to talk about just in general, what is a general purpose agent and why would you want to build one? Can you just walk us through what is a general purpose agent?

Maya Murad
Yeah, of course. A general purpose agent is, from a technical perspective, one architecture, so one program that you define once, but that allows you to do a number of things, from data analysis to automating workflows to being a task reminder, for example. In theory, this sounds great, because this is what all these people who want to do AGI maybe want to get to. I think in practice and practically, we’re quite far from that paradigm. But some people believe with better models, it could be closer. Some of the challenges translating that architecture to reality, some of the challenges you’re going to face is one, models don’t do well when they’re overwhelmed with context. So you have a person, if you tell them 10 things at once, they’re just going to struggle remembering them. And that’s how today’s models work. Maybe that will be solved, but this impacts the breadth of what you’re able to do. So this limits the number of tools you can give the agent to. So you have to be way more selective with the tools. This limits how deep the workflow could go and how complex it can go. But again, this is a statement of where we are right now, and there are tons of money being invested in solving these problems.

So that being said, so given that it’s something that is interesting, but in reality is limited, why would you still want to go after it? So in this article, I actually am very specific about the reference architecture to implement. I specifically determine what tools to prioritize and how to handle memory. It’s not just a general statement of why this is important, but why the specific architecture that I was proposing is interesting to go after, because I think it gets you to maybe 40% to 50% performance on most use cases. It’s a really great prototyping tool. You have one architecture that allows you to prototype all these different use cases. Then from there, you learn two things. One, do I really need an agentic approach for my use case? Then two, if I want to start customizing this workflow to really work for my use case well and get me from 50% performance to 90%, what do I need to change? I might observe that, okay, it’s looping 5-6 times because it’s struggling to understand my data. That’s an actionable data point that I can do something about, and that might to change the way that I architect my system.

So I think it’s a really helpful way to uncover these insights, and it’s a faster way to get there than if you just tediously start with, for example, a flow-based system.

Rachel Stephens
Yeah, so a useful prototyping tool and a tool for helping uncover how you should be thinking about your architecture going forward. Interesting. I like that, and I think that’s really useful. All right. And so then let’s talk about some just general architectural things that you’re thinking about in architecting agents. How are you thinking about just confusion in agentic approaches? And what have you learned so far?

Maya Murad
Yeah, absolutely. So my first answer here is there’s not one architecture to rule them all. So as much as we want to have that single agentic architecture that can take on all these different use cases, in reality, we don’t have the models to get there. So I think the best way to illustrate the answer is to also give you a visual representation of what I mean. I’m going to share my screen again. And the way I like to talk about it is there’s a spectrum of agentic architectures. And this spectrum is between limited agency and high agency. So high agency, a good example here is that general purpose agentic architecture. So that one architecture that can fully determine what steps to take has a number of tools at its disposal, making it really versatile. On the other end of the spectrum, you have the fixed flow. So something that I’ve predefined how it should work. So these are great because they’re reliable, I’m able to control how they work, but they’re less flexible. So if a complex question comes in, it might hit an error, and that error is fatal, and then the program cannot self-correct.

Or it’s just very limited to one use case. And if I want to go after another use case, I’m going to have to invest all this time to rebuild my use case. So how can we take some of the benefits from here and infuse some of the benefits from here? And that allows us to explore different agentic architectures that live in between. So to make this more tangible, what if we started with a fixed flow? So this could be a RAG, a retrieval augmented generation-like fixed flow. So maybe there’s a question related to a company’s policy on a certain thing. So the first step the program might do, again, we’re still executing things in a fixed flow that was predetermined. So the first thing that the program would do is search for more information on a corpus, it would then feed it into a model. What changes here and what adds more agency is these next two steps. The model itself would try to validate the answer. This is a really interesting concept. I could basically take a model and specifically prompt it on two things. Given the user’s question and the answer generated, how well does the answer provide a response to the question, and maybe it’s completely off topic, and then you might want to look into that.

And if you want your answer to not be hallucinated, you could also, specifically at this validation step, ask the model to see how well does the answer produced match with some of the data that was collected in the vector database. Based on this information, the LLM can decide, do I need to go back and maybe change my query or look into a different database, or is the answer good enough to return to the end user? Depending on the user’s question, you might just go through these first three steps, or you might loop three or four times in order to get to the right answer. This brings more flexibility. Then this is a good way of thinking about, you could start building your workflows, but for different failure points, you could actually try to automate solving with that failure point. And I think that’s a good way of thinking through, where should I add LLM agency into my system? I hope that makes sense.

Rachel Stephens
Yeah. I think that’s helpful in terms of… I think a lot of people hear about RAG architectures and don’t really quite understand that there can be that nuance in terms of validation and how many times to loop through and things like that, which I think it’s a useful way to think about how that architecture can be more flexible.

Maya Murad
Absolutely.

Rachel Stephens
All right. Well, in terms of just how are you thinking about all of this in the research arm of IBM and how is this starting to come into being in some of IBM’s products?

Maya Murad
Yeah, that’s a great question. So last year, when we were observing the space, it quickly became clear that we needed to be active in the open-source community community as a way to drive more learnings. Again, this is an emerging technology, and it’s going to impact a lot of things downstream. So the whole human AI interaction paradigm is going to be very different because of this technology. This is a technology that can interact with the external environment. So it’s going to bring about new human HCI modalities. It’s going to enable us to solve different problems. But also, we still don’t know how to solve problems well enough with this technology. So there’s a lot to learn, and being in the open source is really helpful to accelerate these learnings. And then just more broadly at IBM Research, we really believe in the power of open ecosystems, and our models have all been open sourced. I think just that is the healthy path forward in the field of AI to get to something beneficial. So what we did around summer of last year is we open sourced Project B. So B is a framework that allows developers to build production-grade agentic workflows.

They can come in with any model of their choice. Where we majored on last year is how to make that easier for open source models. So taking that open source play end-to-end, not just for the software layer that wraps the model, but also allows you to use open models in that process. So we support Granite, we support LLaMA, and all of that.

Rachel Stephens
In terms of existing limitations of this technology, can you talk about where you see the landscape now, and how are you addressing that? How are you thinking that through? What do you think we as an industry need to focus on and work on?

Maya Murad
There’s so many challenges to speak of. So I think one of the key issues people will find when they’re plating with agents is the reliability problem. So you could ask the same question twice and get a completely different answer. It might not be incorrect, but it could be formatted in a very different way. So you might get all of this different extra information. If I’m being very concrete, so I’m working with an agent that automatically analyzes data and generates a management report. In one iteration, it’s formatted nicely, and it has these charts that I thought were very useful. Ask the same question again, and then the charts disappeared or are no longer there. So when you’re thinking of using that scale, you want to have these acceptance criteria of what output should have. And then there’s no way to guarantee. So as much as you provide instructions, all instructions in the context of models are soft instructions. And then you’re just not guaranteed to get consistent outputs. And I think this is a big problem to solve. I think the solution is going to be a combination of better models, but also better software that wraps the model. So you check the outputs and you’re able to flag these issues ahead of time.

So that’s the first problem. In the academic field, there’s all of these debates around whether large language models are good planners in the first place, whether they reason and don’t reason. Heavy debate. There’s people on both spectrums debating whether for and against. But I think at the end of the day, I think the ability of large language models to plan well is limited. They’re able to only do that to the extent that they’ve been trained on that data. So they are trained to break down problems into smaller pieces. If that is something that they were able to generalize and learn based on their training data, then great. But if it’s something new, then it’s not going to work well. When it comes to the planning problem, the way that I like to phrase it is, it’s not a scientific planning problem. A lot of the times there’s this alignment problem. Let’s say for the traveling salesman problem or what is the fastest path to get from A to B, some people might prefer you to take a very mathematical route.

We were once playing with an agent who came up with an answer that was correct, but the way that it came up with the answer is that it googled the question and then found an article and then referenced that article. But it didn’t do its work. It didn’t do the math behind it. So there’s this preference. If when I’m getting an answer for a certain problem, do I want it done mathematically? Do I just want to rely on trusted sources? So I think that there’s this additional thing of how do we align the way problems are solved to our expectations and to our acceptance criteria. And I think this is also very interesting. Another set of problems is how do we operationalize this technology? So let’s say I’m running an agentic workflow, how do I scale it up and down as traffic increases to my website, for example? How can I track what goes wrong and use that information to improve the system? Because like I said, these are static systems. You build them. It’s not like with reinforcement learning, these systems are continuously learning and improving. A lot of people have that expectation out of AI agents, but the reality of it is AI agents do not learn on the fly.

So how can we get back to that modality of you’re able to improve based past data post-training. I think this is really interesting. I think the biggest challenge is going to have to come with security. Again, these are things that can… Given that they’re not reliable and not consistent, they might perform the wrong actions. They might call tools when they’re not supposed to. So if you’re thinking about the enterprise adopting, I think the security aspect could be one of the first blockers.

Rachel Stephens
Are you encountering trust as a concern?

Maya Murad
That’s a really interesting question.

Rachel Stephens
That’s part of the security question, maybe.

Maya Murad
That’s a really interesting question. So we’ve done our own user research, and we found that if you format things nicely, and if the user believes they’re interacting with an intelligent system, they tend to overtrust, and that is a big problem. So they might take the answer for granted, specifically if you at a high level see, oh, it seems that they went to the internet and, Oh, it quoted Wikipedia, and I trust Wikipedia as a source. People just tend to take mental shortcuts and say, I trust the result. But oftentimes, we dug into the results and it found that even though Wikipedia as a source was traversed by the agent, it was not used in providing the final answer. And the final answer has some inaccuracies in it. So it’s really tough to understand what is right and wrong. And I think with human nature that likes to take shortcuts, I think that exacerbates this overtrust problem. And I think that is also something that needs to be be addressed if we want to make these systems more useful at scale.

Rachel Stephens
Yes. I love this. Well, this has been such an enlightening conversation. Thank you so much for taking time out of your research and your busy day to come talk with me. If people want to know more about your work and find more about IBM B or the projects you are working on, where should they go?

Maya Murad
Yeah, that’s a great question. So for Project B, I think the best way is to go to our GitHub repo. So it’s called the B Agent Framework. From there, you’ll be able to find links to our YouTube channel, to our Discord channel, where we provide hands-on support. I also write a lot about AI and post my own content, and the best way to find me is on LinkedIn.

Rachel Stephens
Wonderful. Maya, thank you so much for your time today. This has been an absolute pleasure. I appreciate it. And thank you everyone for listening. We’ll have links for you in the show notes. Thanks, everyone.

Maya Murad
Thank you.

Transcript

More in this series

Conversations (90)