#2: How Agents Work
About a year and a half ago, I saw the video that got me really excited about agents. I still remember the day. I was in the office of a friend’s startup, and they were starting to hear rumblings about this new ChatGPT 3 tool that suddenly enabled new use cases around information retrieval and automation.
I started doing a bit of research to learn more and stumbled upon a video that Dharmesh, the co-founder of HubSpot, had created to showcase his new tool, ChatSpot. In it, he did something that, at the time, I thought was awesome. He asked for a graph of some data using simple natural language and the system created it for him.
Here’s the video in case you’ve never seen it:
At the time, my mind was blown. I had been working on software for 20 years and machine learning applications for 10 years, and had spent countless conversations telling folks that machines “just can’t understand” what you’re saying when you ask for something like a breakdown of which users are most likely to churn or what types of content performs best in your blogs.
This was the first time I had seen a machine properly understand and act on intent, which, up til then, had been a notoriously difficult problem for software developers to automate.
I can tell you that after that day, not only did I immediately change my tune, but I pivoted my entire career towards building systems that could understand and act on intent: what I (and many others) now refer to as autonomous agents.
Under the hood
The same day that I saw that video, I went on a massive research binge. As a software engineer, I had a compulsion to try to understand how this thing was working. I just had to be able to make sense of it in my head.
So I started reading everything I could — on OpenAI’s site, on the internet, and I even went through much of the source code of agent building frameworks like LangChain.
After about a full day’s worth of research and reading, I walked out of that office thinking “I can’t believe it’s that simple, but man this is huge.” Within the course of a day, I knew how these things tick, and the next day I started building the workings of my own agent, which eventually turned into one of my current products: Locusive.
In fact, what’s happening under the hood is incredibly simple, and anyone with a basic background in coding can recreate the basic functionality of an agent. And if you’re not a coder, don’t worry, I won’t be getting deep into the tech on this issue, but what I will do is break down how an agent works in a step-by-step format so that you can understand what’s happening when you send a request to an agent.
The core concept is extremely straightforward — an agent is just a software application that has access to other software tools and uses an LLM to decide which tool to use in which order. In fact, many agents are structured as a sequence of steps within a loop, and the entire high-level code of an agent might look something like this:
At the highest level, that’s really all it is. You basically use an LLM to do all the heavy lifting — thinking through what tool to use, figuring out how to use that tool (i.e. if you need to use a search engine, using the LLM to come up with the search request), figuring out if there’s enough data to move forward or end without a response. At its core, an agent is basically an LLM for brains and tool use for brawn.
And if you think about it, and squint just right, you can see that this is kind of what we as humans do as well. If I need to know about the weather tomorrow, I might first think about which app will give me that information, then go to the app and look it up, and then process that information and summarize (internally) what I found.
Or, in a more complex scenario, if I need to pull a report from my SQL database about which users interacted with which features on my app, I would:
In much the same way that I, as a person, plan, think through, and act on a set of tools to achieve an objective, autonomous agents can do the same thing.
Now you might ask, if this is so simple, why haven’t we seen more agents permeate all of the software we use?
The devil’s in the details
Much like everything else in this world, building an agent is simple in concept, but complicated in practice. There are a lot of blockers that prevent agents from working very well, at least, not without a lot of additional engineering work.
Recommended by LinkedIn
Tools are complicated
First, LLMs need to be connected to the tools that they need to service a user’s request. When you think about it, an agent is ultimately limited by the sum of the tools it can use to do anything. And giving an agent access to those tools can be painstaking work. Some tools might be easy to integrate — a search engine, for example, just needs a single input, a search query. But other tools, for example, a complex API like Salesforce, can have an infinite number of inputs (since any particular request might need to call one of many API endpoints, and those endpoints themselves might require highly specific and dynamic queries).
Adding support for these tools requires more of a traditional engineering effort, where we need to explicitly program instructions into the system for every different edge case that we want the agent to be able to support in the tool.
Certain tools, like databases, can also be incredibly nuanced, complex, and extremely custom to a given organization. They might have strange or unobvious data structures, which an LLM might never be able to understand without additional context, in which case, we need to add support for end users to add that context, and then provide it to the LLM on demand.
And while adding support for running a tool can be complicated, LLMs first need to select the right tool at the right time, which they don’t always do in a rational way.
LLMs can be irrational
The more tools that we make available to an LLM, the greater the chances are that an LLM will select the wrong tool at any given point in time. While we as agent developers have tricks to reduce this illogic, it’s not likely that we can prevent it entirely.
When an LLM selects the wrong tool, it then wastes a lot of time, money, and precious space in its conversational memory to execute that tool, which may then affect how it course corrects down the line.
And even if an LLM selects the right tool, it may not provide its response in a way that our (more structured, more brittle) code expects, which then causes our code to not know which tool was selected, requiring us to ask the LLM to repeat itself or retry.
While it’s not a big deal for an LLM to select the wrong tool once or twice, it’s very possible that an LLM can get into a negative cycle of continuously selecting either the wrong tool or the wrong input to a tool, which causes it to fail forever, going into an uncontrolled loop.
LLMs can go off the rails
An LLM that’s connected to powerful tools can also do powerful damage. LLMs might also simply just say something dumb, offensive, or totally wrong, which could have negative consequences for users. And again, while its possible to control for these scenarios, and also put guardrails in place to ensure they don’t happen (or if they do, that their consequences are minimized), the risk is still there, and many companies don’t want to take on that risk. So they’re still in the process of rolling out simpler agents that aren’t capable of doing much, while they experiment with more powerful LLMs in more controlled environments.
As you can see, a lot of the effort that goes into building a true autonomous agent goes into creating the tools, infrastructure, guardrails, and contextual ecosystem. This is a more traditional software engineering effort, one which takes time and development resources.
That’s why we’re starting to see simpler agents for now, ones which only support information retrieval and analysis, or which require humans to manually orchestrate with those awfully tedious workflow builder interfaces.
As we see agents start to progress, it’s likely we’ll start to have more sophisticated components and infrastructure that help improve the performance of these agents. There may be a world where we combine the text-based reasoning of an LLM with the predictive capabilities of a machine learning model, and the combination of these two systems can lead to more reasonable, “accurate” agents.
As agent developers continue to iterate on these systems, we’ll likely see more incremental, smaller agents continue to dominate the software landscape as we eventually build up to more horizontal, fully-capable autonomous agents. The recent increase in excitement around “vertical agents” is likely an outcome of this approach.
That’s all for today’s issue. If you think your friends or family might like this content, I’d appreciate you sending them over to my subscription page. I’ll continue creating a new issue every couple of weeks, and while I’ve got a lot of content already planned out, if there’s anything you want to learn more about, please let me know.
Until next time,
-Shanif
Assistant Professor of Medicine | Primary Care Physician | Co-Founder and CEO @ Docs for Health
4moThanks for this, Shanif. Echoing Jeffrey's comment!
Host of MSP Insider │ 18-Year MSP Veteran │ Speaker & Author │ Co-Founder @Cyft │ Breaking Barriers with AI in MSPs
4moI continue to enjoy consuming from you Shanif, your ability to take new and complex software engineer level context and make it transferable on first pass by not-a-developer is incredibly impactful. You’re a great evangelist for where the world is in this technology revolution. Appreciate you.