Edition 36 - Improving LLM Safety & Reliability
This month's edition of the Evaluator is packed with cutting-edge insights and practical know-how from our team. Learn how to instrument your LLM app, how to create multi-agent applications, or check out our agents series (now on-demand) for real-world examples of agents in production.
As always, we conclude with some of our favorite news, papers, community threads, and upcoming events.
Improving LLM Safety & Reliability in LLM Applications
Today’s AI engineering loop is very brittle, where small changes can result in big performance drops. Building better AI requires that you address LLM safety and reliability, and in this blog, we’ll show you how. Eric Xiao reviews all the different ways to improve safety and reliability in your LLM applications, including tracing, evaluations, experiments, guardrails, and more. Read It
LLM Evaluation Course
LLM evaluations can take many forms, from code-based comparisons against ground-truth data, to LLM as a Judge queries to validate outputs. This resource by Aparna Dhinakaran and Steven Miller covers different types of LLM evals, how they are used, and important factors to consider when structuring your LLM evaluation system. Read It
OpenAI's Realtime API
Sally-Ann DeLucia and Aparna Dhinakaran cover how to seamlessly integrate powerful language models into your applications for instant, context-aware responses that drive user engagement. Whether you’re building chatbots, dynamic content tools, or enhancing real-time collaboration, we walk through the API’s capabilities and potential use cases. Read it
Instrumenting Your LLM Application: Arize Phoenix and Vercel AI SDK
Recommended by LinkedIn
Evan Jolley dives into why instrumentation matters for LLM applications, the benefits of implementing instrumentation, and provides a guide on integrating Arize Phoenix with Vercel AI SDK for observability in Next.js applications. Read it
What is AutoGen?
AutoGen is a framework that helps you easily create multi-agent applications. Multi-agent applications are a relatively recent idea that involve defining multiple LLM agents, each with their own goals and capabilities, and allowing them to work together to achieve an end goal. John Gilhuly explains how it works. Read it
On Demand: Building an Agent or Assistant Series
Our 5-part series on real-life agents deployed in production is now available to watch. We deep dive into the agent architectures, the systems used in their development, and lessons learned from using them in production. Each week, we unpack a new example agent or agent component used in a real-world agent.
If you want a primer first, our previous series is available on-demand, and covers basic agent components, architectures, and frameworks. Watch it
Staff Picks 🤖
Here's a roundup of our team's favorite news, papers, and community threads recently.