Are Your Agent Tools Trustworthy? 🔨

Hey!

If there's a topic you want me to cover click reply and let me know!

A few months ago I sat down with my coffee ☕️ ready to build a RAG pipeline. But instead I spent the week debugging Agents and building observability and monitoring tools.

Luckily I had just implemented one of the foundational pieces, so I thought I was off to a good start 🏃.

I wasn’t.

But it turned out to be 10x more valuable than the RAG pipeline though, and only 5x the work 😅. Here’s what I learned and how you can do it too.

Evals are all the rage but there’s some much lower hanging fruit. 🍑

Visibility is everything.

You need to be able to see what your Agent is doing. This is pretty simple but the amount of data can be overwhelming, especially when it has a lot of tools, does a lot of thinking, and runs for a long time. Make sure you’re tracing everything (requests, responses, errors, everything!), this will quickly explode in complexity because Agents are basically threading on steroids.

You need to trust that when your Agent uses its tools they work as expected. It’s like giving an employee a dashboard, without experience they won’t know if there’s an issue or if there really is no data to return.

You also need a way to know if it’s hallucinating or using genuine data.

This is a hard problem.

The lowest hanging fruit right now is: Test your tools.

Happy path, sad path, etc.

I know, ground breaking, but easy to miss.

Similarly you need to validate the arguments your LLM is passing to the tool.

Most LLM providers use JSON schema for tool parameters which will get you so far. Unfortunately JSON schema doesn’t validate UUIDs, they’re just strings, so you need to validate they’re actually UUIDs and better yet, they actually exist in your database. Be careful with this though you don’t want to create an attack vector.

You also need to make sure your tools are retried when you return validation errors, otherwise what’s the point.

Which leads nicely to the next point:

Tool metrics 🎯

You’re going to want to know when those tools aren’t retried, how many times they’re retried, and when the retries fail. Start sending those to Slack at a minimum. If you have a lot on your plate start collecting this as a metric so you can prioritise which tools need improving the most.

You need to know how often the LLM is giving the wrong or incomplete arguments to the tool. You can use this to tweak your tool descriptions and prompts and track the improvement.

You need to know when the tools are running but returning unexpected results. But because a tool can basically be anything this might sound vague.

So to be more specific: If your tool Is another LLM you need Evals (insert rabbit hole). If your tool is getting data or performing an action you should unit test it.

You should measure how successful your tools are. It’s success is simply the percentage of the time it returns what it should. Realistically you should aim for 100%.

Reply to let me know if you'd like me to dive deeper into any of this.

See you next week 👋

Dan