Edition 26 - The LLM Observability Checklist ✓
The Drift is a collection of top content we've published recently at Arize AI. This month's edition features your definitive LLM observability checklist, a series of on-demand workshops that will help you navigate advanced LLM evals, a developer's guide to evaluating prompts, a two-part series that dives into the latest generative models, and more.
Read on and dive in...
The Definitive LLM Observability Checklist
What should teams look for when assessing an LLM observability platform? Our checklist is informed by experience working with hundreds of practitioners across dozens of large enterprises and technology companies with LLM apps in production, this checklist covers essential elements to consider when evaluating an LLM observability provider. Read it.
Advanced LLM Evals
These workshops on advanced techniques and best practices for leveraging LLM evals cover everything from how to create an LLM eval from scratch to different classes of evals, how to generate data, and advanced techniques for LLM retrieval evals. Watch it.
Evaluating Prompts: A Developer's Guide
At its core, prompt engineering is about crafting textual cues that effectively guide AI responses. These prompts, ranging from straightforward templates to complex structures, are instrumental in steering the vast knowledge of LLMs.
This article delves into the nuances of prompt engineering, the iterative processes essential for refining prompts, and the challenges that come with them. Understanding prompt engineering is crucial for anyone looking to access the full potential of LLMs in practical applications. Read it.
Benchmarking OpenAI Function Calling and Explanations
We benchmark OpenAI’s GPT models with function calling and explanations against various performance metrics. We are specifically interested in how the GPT models and OpenAI features perform on correctly classifying hallucinated and relevant responses.
The results show the trade-offs between speed and performance for different LLM application systems, as well as a discussion on how these results with explanations can be used for data labeling, LLM assisted evaluation, and quality checks. The experimental framework we used is provided below so that practitioners can iterate and improve on the default classification template. Read it.
The Build vs. Buy Guide
As the generative AI field continues to evolve, teams face a critical task in deciding whether to construct or procure their AI observability infrastructure. How should they navigate this decision as new research, foundation models, orchestration frameworks, and methods constantly upend established techniques?
Informed by work with dozens of enterprises and companies that have both traditional ML models and LLM apps live in production, this paper suggests some approaches for making build-versus-buy decisions in today’s world. Read it.
Recommended by LinkedIn
Mixtral 8x7B Discussion
For the last paper read of the year, Arize CPO & Co-Founder, Aparna Dhinakaran, is joined by a Dat Ngo (ML Solutions Architect) and Aman Khan (Group Product Manager) for an exploration of the new kids on the block: Gemini and Mixtral-8x7B.
There’s a lot to cover, so this week’s paper read is Part I in a series about Mixtral and Gemini. In Part I, we provide some background and context for Mixtral 8x7B from Mistral AI, a high-quality sparse mixture of experts model (SMoE) that outperforms Llama 2 70B on most benchmarks with 6x faster inference. Read it.
How to Prompt LLMs for Text-to-SQL Paper Read
For this paper read, we’re joined by Shuaichen Chang, now an Applied Scientist at AWS AI Lab and author of this week’s paper to discuss his findings. Shuaichen’s research (conducted at the Ohio State University) investigates the impact of prompt constructions on the performance of large language models (LLMs) in the text-to-SQL task, particularly focusing on zero-shot, single-domain, and cross-domain settings. Read it.
Why Enterprise Executives Should be Hip to LLMOps
Given the rapid rate of adoption, some early growing pains are inevitable. Among early adopters of LLMs, nearly half (43%) cite issues like evaluation, hallucinations, and needless abstraction as implementation challenges. How can large enterprises overcome these challenges to deliver results and minimize organizational risk?
Here are three keys that enterprises successfully deploying LLMs are embracing to rise to the challenge. Read it.
Paper Read Wednesday January 10: Gemini
Join us for a deep dive into Gemini on January 10 (Wednesday) at 10:15am PST. This is Part II in a series; we discussed Mixtral 8x7B in December, which you can find linked above. Sign up here.
Staff Picks 🤖
Here's a roundup of our team's favorite news, papers, and community threads recently.