The Agents Newsletter #8: Guardrails
Howdy. Welcome to issue #8 of The Agents Newsletter. If this is the first issue you’re reading, thanks for (hopefully) subscribing. And a special thank you to Jeff N. for all the continuous kind words of support for this newsletter and the rest of my work. It makes a big difference, you rock.
Now, onwards.
If you talk to me about agents for long enough, you’ll inevitably hear me say that agents are “smart but not that smart.” They’ve got the entirety of the world’s information encoded into their neurons, and yet, they frequently don’t follow basic output instructions. They can provide truly insightful and, oftentimes novel, analyses and reasoning, but sometimes you need to tell them the same thing 5 different ways. They’re an enigma. They work until they don’t. And it’s the bane of an agent developer’s existence.
Getting agents to handle instructions and reason in a logical way might be a fool’s errand. And yet, we persist to build these darn things because when they do work, they’re magical. We must press on. But we have to do so with the knowledge that these things may screw up, sometimes in minor ways, sometimes in major ways. As agent developers, it’s our job to do our best to understand where they’ll screw up and try to control for those situations as best as we can. That’s where guardrails come in.
Hard-coded checks
If you read the 2nd issue of this newsletter, you may remember that under the hood, many agents follow a similar high-level pattern:
This approach is robust and straightforward to understand, but it’s generalizability is both a strength and a weakness. It allows us to plug-and-play lots of different tools, thereby adding a huge amount of capabilities and skills to an agent, but it also introduces a lot of surface area for agents to make errors.
Guardrails allow us, as agent developers (and users) to check for known or common errors along the way and introduce strategies to mitigate those errors when we see them. In short, we can use hard-coded checks to guard against common or frequently-seen errors in our agents, and, when implemented correctly, these guardrails can go a long way from taking a brittle and error-prone agent to a reliable and skilled agent.
Now, while adding guardrails sounds like a relatively straightforward and effective fix, in reality, it turns into a game of whack-a-mole. Some known (or more frequently, unknown) error comes up, we do our best to fix it using a combination of prompt engineering, if/then checks, heuristics, or other typical software-development techniques, push out the fix, and wait for the next strange behavior to reveal itself, where we can start the process all over again. Unfortunately, as developers, we’re always playing catchup, and this process is unlikely to change until large action models (more on these in a future issue) or reinforcement learning agents become the norm.
So we persist.
We try to add automated tests to ensure that our fixes are working more often than not. We add robust checks to handle increasingly sophisticated lapses in logic and judgment in underlying LLMs. We try to add training instructions or situational context that helps LLMs minimize errors. And the smartest among us ultimately end up narrowing the scope of their agents 😀. For now, it’s the best we can do, but I suspect this paradigm will fall by the wayside as we develop better techniques for machine reasoning and error-correction.
Now, I know all this has been fairly abstract and theoretical, so in the rest of this issue, I’ll run through some of the guardrails I’ve had to implement when creating Locusive and Nobi, with the hopes that it may save some of you some time and energy, or it might just be fun for you to laugh at the absurdity of the whole situation.
Examples of guardrails
Here are some of the fun problems in LLM illogic that I’ve had to contend with over the past two years:
Incorrect output formatting
Many agent developers ask LLMs to format their output in a structured way to make it easier for their code to find the most important pieces of information, like what tools to run next or what sources were used. Many of us use JSON for this formatting (though, I recently read a post that said XML may be better…). You’d think that formatting your output in a structured manner would be something that LLMs would excel at.
You’d think.
Recommended by LinkedIn
I generally ask LLMs to output their entire response in standard JSON with no extra text, and yet I frequently get back random ramblings, lists, incorrectly coded characters, and a whole slew of other issues. So, while I have the standard fallback of asking an LLM to just “try again”, I’ll frequently try to see if I can parse its response first, by doing things like looking for standard code markers (like the triple backtick ``` or the first curly brace "{"). I created an entire code file simply for the purpose of cleaning up incorrectly-formatted responses from an LLM so that I could try and parse it into something more comprehensible for my code.
Maliciousness checks
Agents with access to tools can be risky, and agents can be dumb as rocks when you try to get them to do something that they shouldn’t be doing. They’ll proceed exactly as they’re asked, which can be dangerous if they have access to tools that allow them to change or modify data. That’s why I’ve added lots of different guardrails that prevent them from doing anything “bad” with the systems they have access to. These checks can range from the simple, like checking for dangerous SQL keywords (DROP, DELETE, etc) and blocking them, to using other LLMs to do pre-checks and analyze the maliciousness of an anonymous user’s request, and everything in between.
Having to manage dumb LLMs is enough without malicious users as well… yeesh.
Incorrect tool names
When I need an agent to take an action, I’ll generally ask the underlying LLM to provide me the name of a tool to use. It’s not uncommon for an LLM to make up a tool name that doesn’t exist. So I’ve had to add keywords and aliases to my tools to check if an agent needs to use one of my existing tools but just couldn’t identify it properly. At some point down the line, I’ll probably switch to asking the LLM to provide a tool’s unique identifier, rather than its name, but even then, it might just hallucinate the name and I’ll have to add hard-coded checks to see if it did so.
Hallucinations
LLMs are notorious for just making stuff up. Some (very technical) folks say it’s a feature, but when you’re trying to get an LLM to make rational, linear decisions, or provide verifiable stats, it can be a bug. There are a ton of different ways to deal with hallucinations, the most effective of which is to provide additional context when asking an LLM to provide an answer, but even then, they can provide downstream hallucinations (for example, when you ask it to aggregate database query results into a file and upload that file to the cloud, it could just simply make up the fact that it uploaded a file and provide a bogus link).
One of the most effective ways I’ve found for dealing with hallucinations, outside of providing relevant context, is to check responses for keywords that are common indicators of a made up answer. This can include BS names like “John Smith” or “Jane Doe”, nonsense URLs like “example.com”, filenames like “example.csv”, non-existant or incorrect cloud URLs (when checking to see if a file was truly uploaded), etc.
This rudimentary strategy works extremely well, but it requires me to keep a constant eye on my agent’s responses, analyzing them consistently to understand if there are new indicators of hallucinations. Integration tests can help quite a bit here.
Asking an LLM to correct itself
LLMs can provide fluid responses, which isn’t ideal for structured and less-flexible code. Sometimes your code is expecting a particular JSON attribute to be present. Sometimes it expects a particular format and the LLM provides something entirely different. One of the best ways to deal with these issues is to simply ask the LLM to try again while providing the problem or error that prevented my code from proceeding correctly. LLMs will frequently be smart enough to take a look at their last message, analyze the error, and correct themselves accordingly.
There are so many other ways in which an agent can go off the rails that writing about them all will turn this into a novel, so I’ll stop here. While I’ll admit that part of the fun of creating an agent is to find and squash all the ways that it can go off the rails, it does become a tedious job. Often, the guardrails we put into our code work wonderfully. Sometimes, despite all of our tweaking and prompt engineering, they fail miserably, and we have to take an entirely new approach to getting our agents to do what we want.
It’s all interesting, and, while it may be made obsolete by future advances in AI (which is what I believe will happen), for now, there’s little else we can do outside of putting in the legwork to strengthen and harden our agents against these common errors.
That’s it for today’s issue, hope you got a flavor for the joy that is guardrail development for agents. As always, I really welcome comments, questions, and suggestions for my content. If you haven’t subscribed yet and you’d like to get this newsletter delivered to you every other week, you can do so here.
Until next time.
-Shanif