#2: How Agents Work

About a year and a half ago, I saw the video that got me really excited about agents. I still remember the day. I was in the office of a friend’s startup, and they were starting to hear rumblings about this new ChatGPT 3 tool that suddenly enabled new use cases around information retrieval and automation.

I started doing a bit of research to learn more and stumbled upon a video that Dharmesh, the co-founder of HubSpot, had created to showcase his new tool, ChatSpot. In it, he did something that, at the time, I thought was awesome. He asked for a graph of some data using simple natural language and the system created it for him.

Here’s the video in case you’ve never seen it:

At the time, my mind was blown. I had been working on software for 20 years and machine learning applications for 10 years, and had spent countless conversations telling folks that machines “just can’t understand” what you’re saying when you ask for something like a breakdown of which users are most likely to churn or what types of content performs best in your blogs.

This was the first time I had seen a machine properly understand and act on intent, which, up til then, had been a notoriously difficult problem for software developers to automate.

I can tell you that after that day, not only did I immediately change my tune, but I pivoted my entire career towards building systems that could understand and act on intent: what I (and many others) now refer to as autonomous agents.

Under the hood

The same day that I saw that video, I went on a massive research binge. As a software engineer, I had a compulsion to try to understand how this thing was working. I just had to be able to make sense of it in my head.

So I started reading everything I could — on OpenAI’s site, on the internet, and I even went through much of the source code of agent building frameworks like LangChain.

After about a full day’s worth of research and reading, I walked out of that office thinking “I can’t believe it’s that simple, but man this is huge.” Within the course of a day, I knew how these things tick, and the next day I started building the workings of my own agent, which eventually turned into one of my current products: Locusive.

In fact, what’s happening under the hood is incredibly simple, and anyone with a basic background in coding can recreate the basic functionality of an agent. And if you’re not a coder, don’t worry, I won’t be getting deep into the tech on this issue, but what I will do is break down how an agent works in a step-by-step format so that you can understand what’s happening when you send a request to an agent.

The core concept is extremely straightforward — an agent is just a software application that has access to other software tools and uses an LLM to decide which tool to use in which order. In fact, many agents are structured as a sequence of steps within a loop, and the entire high-level code of an agent might look something like this:

  1. Before the loop: The user sends in a request Optional planning stage: Present the agent with a list of the other software applications (or “tools” or “functions”) it has access to and ask it to come up with a list of steps to take to answer the user’s question Do any validation/authentication/etc
  2. In the loop Have the LLM pick a tool to run next (or, if no tool is possible, have the LLM say as much) Run the tool, using an LLM to figure out what inputs to give the tool, using the LLM to course-correct if there was an error Evaluate the results and go back to (a) or exit
  3. Send a final response back to the user

At the highest level, that’s really all it is. You basically use an LLM to do all the heavy lifting — thinking through what tool to use, figuring out how to use that tool (i.e. if you need to use a search engine, using the LLM to come up with the search request), figuring out if there’s enough data to move forward or end without a response. At its core, an agent is basically an LLM for brains and tool use for brawn.

And if you think about it, and squint just right, you can see that this is kind of what we as humans do as well. If I need to know about the weather tomorrow, I might first think about which app will give me that information, then go to the app and look it up, and then process that information and summarize (internally) what I found.

Or, in a more complex scenario, if I need to pull a report from my SQL database about which users interacted with which features on my app, I would:

  1. First think about where that data lives
  2. Recognize that most of it is in my database
  3. Go into my database and potentially examine the tables and columns
  4. Create my best guess SQL query to pull the data in a format I need
  5. Analyze and fix any errors that the database gives me about my query, then re-run it until I get reasonable results
  6. Take a closer look at the results to make sure it has exactly what I need
  7. Possibly graph and visualize the results
  8. Save or send the data from the database

In much the same way that I, as a person, plan, think through, and act on a set of tools to achieve an objective, autonomous agents can do the same thing.

Now you might ask, if this is so simple, why haven’t we seen more agents permeate all of the software we use?

The devil’s in the details

Much like everything else in this world, building an agent is simple in concept, but complicated in practice. There are a lot of blockers that prevent agents from working very well, at least, not without a lot of additional engineering work.

Tools are complicated

First, LLMs need to be connected to the tools that they need to service a user’s request. When you think about it, an agent is ultimately limited by the sum of the tools it can use to do anything. And giving an agent access to those tools can be painstaking work. Some tools might be easy to integrate — a search engine, for example, just needs a single input, a search query. But other tools, for example, a complex API like Salesforce, can have an infinite number of inputs (since any particular request might need to call one of many API endpoints, and those endpoints themselves might require highly specific and dynamic queries).

Adding support for these tools requires more of a traditional engineering effort, where we need to explicitly program instructions into the system for every different edge case that we want the agent to be able to support in the tool.

Certain tools, like databases, can also be incredibly nuanced, complex, and extremely custom to a given organization. They might have strange or unobvious data structures, which an LLM might never be able to understand without additional context, in which case, we need to add support for end users to add that context, and then provide it to the LLM on demand.

And while adding support for running a tool can be complicated, LLMs first need to select the right tool at the right time, which they don’t always do in a rational way.

LLMs can be irrational

The more tools that we make available to an LLM, the greater the chances are that an LLM will select the wrong tool at any given point in time. While we as agent developers have tricks to reduce this illogic, it’s not likely that we can prevent it entirely.

When an LLM selects the wrong tool, it then wastes a lot of time, money, and precious space in its conversational memory to execute that tool, which may then affect how it course corrects down the line.

And even if an LLM selects the right tool, it may not provide its response in a way that our (more structured, more brittle) code expects, which then causes our code to not know which tool was selected, requiring us to ask the LLM to repeat itself or retry.

While it’s not a big deal for an LLM to select the wrong tool once or twice, it’s very possible that an LLM can get into a negative cycle of continuously selecting either the wrong tool or the wrong input to a tool, which causes it to fail forever, going into an uncontrolled loop.

LLMs can go off the rails

An LLM that’s connected to powerful tools can also do powerful damage. LLMs might also simply just say something dumb, offensive, or totally wrong, which could have negative consequences for users. And again, while its possible to control for these scenarios, and also put guardrails in place to ensure they don’t happen (or if they do, that their consequences are minimized), the risk is still there, and many companies don’t want to take on that risk. So they’re still in the process of rolling out simpler agents that aren’t capable of doing much, while they experiment with more powerful LLMs in more controlled environments.


As you can see, a lot of the effort that goes into building a true autonomous agent goes into creating the tools, infrastructure, guardrails, and contextual ecosystem. This is a more traditional software engineering effort, one which takes time and development resources.

That’s why we’re starting to see simpler agents for now, ones which only support information retrieval and analysis, or which require humans to manually orchestrate with those awfully tedious workflow builder interfaces.

Article content
These are the worst...

As we see agents start to progress, it’s likely we’ll start to have more sophisticated components and infrastructure that help improve the performance of these agents. There may be a world where we combine the text-based reasoning of an LLM with the predictive capabilities of a machine learning model, and the combination of these two systems can lead to more reasonable, “accurate” agents.

As agent developers continue to iterate on these systems, we’ll likely see more incremental, smaller agents continue to dominate the software landscape as we eventually build up to more horizontal, fully-capable autonomous agents. The recent increase in excitement around “vertical agents” is likely an outcome of this approach.

That’s all for today’s issue. If you think your friends or family might like this content, I’d appreciate you sending them over to my subscription page. I’ll continue creating a new issue every couple of weeks, and while I’ve got a lot of content already planned out, if there’s anything you want to learn more about, please let me know.

Until next time,

-Shanif

Rahul Vanjani

Assistant Professor of Medicine | Primary Care Physician | Co-Founder and CEO @ Docs for Health

4mo

Thanks for this, Shanif. Echoing Jeffrey's comment!

Jeffrey Newton

Host of MSP Insider │ 18-Year MSP Veteran │ Speaker & Author │ Co-Founder @Cyft │ Breaking Barriers with AI in MSPs

4mo

I continue to enjoy consuming from you Shanif, your ability to take new and complex software engineer level context and make it transferable on first pass by not-a-developer is incredibly impactful. You’re a great evangelist for where the world is in this technology revolution. Appreciate you.

To view or add a comment, sign in

More articles by Shanif Dhanani

Insights from the community

Others also viewed

Explore topics