Edition 26 - The LLM Observability Checklist ✓

Arize AI

Arize AI is unified AI observability and LLM evaluation platform - built for AI engineers, by AI engineers

Published Jan 9, 2024

The Drift is a collection of top content we've published recently at Arize AI. This month's edition features your definitive LLM observability checklist, a series of on-demand workshops that will help you navigate advanced LLM evals, a developer's guide to evaluating prompts, a two-part series that dives into the latest generative models, and more.

Read on and dive in...

The Definitive LLM Observability Checklist

What should teams look for when assessing an LLM observability platform? Our checklist is informed by experience working with hundreds of practitioners across dozens of large enterprises and technology companies with LLM apps in production, this checklist covers essential elements to consider when evaluating an LLM observability provider. Read it.

Advanced LLM Evals

These workshops on advanced techniques and best practices for leveraging LLM evals cover everything from how to create an LLM eval from scratch to different classes of evals, how to generate data, and advanced techniques for LLM retrieval evals. Watch it.

Evaluating Prompts: A Developer's Guide

At its core, prompt engineering is about crafting textual cues that effectively guide AI responses. These prompts, ranging from straightforward templates to complex structures, are instrumental in steering the vast knowledge of LLMs.

This article delves into the nuances of prompt engineering, the iterative processes essential for refining prompts, and the challenges that come with them. Understanding prompt engineering is crucial for anyone looking to access the full potential of LLMs in practical applications. Read it.

Benchmarking OpenAI Function Calling and Explanations

We benchmark OpenAI’s GPT models with function calling and explanations against various performance metrics. We are specifically interested in how the GPT models and OpenAI features perform on correctly classifying hallucinated and relevant responses.

The results show the trade-offs between speed and performance for different LLM application systems, as well as a discussion on how these results with explanations can be used for data labeling, LLM assisted evaluation, and quality checks. The experimental framework we used is provided below so that practitioners can iterate and improve on the default classification template. Read it.

The Build vs. Buy Guide

As the generative AI field continues to evolve, teams face a critical task in deciding whether to construct or procure their AI observability infrastructure. How should they navigate this decision as new research, foundation models, orchestration frameworks, and methods constantly upend established techniques?

Informed by work with dozens of enterprises and companies that have both traditional ML models and LLM apps live in production, this paper suggests some approaches for making build-versus-buy decisions in today’s world. Read it.

Recommended by LinkedIn

🥇Top AI Papers of the Week

DAIR.AI 1 month ago

An Intro to Building Knowledge Graphs, Deploying LLMs…

Open Data Science Conference (ODSC) 1 year ago

The Next Evolution of LLMOps: Trends Defining AI’s…

Sankara Reddy Thamma 3 months ago

Mixtral 8x7B Discussion

For the last paper read of the year, Arize CPO & Co-Founder, Aparna Dhinakaran, is joined by a Dat Ngo (ML Solutions Architect) and Aman Khan (Group Product Manager) for an exploration of the new kids on the block: Gemini and Mixtral-8x7B.

There’s a lot to cover, so this week’s paper read is Part I in a series about Mixtral and Gemini. In Part I, we provide some background and context for Mixtral 8x7B from Mistral AI, a high-quality sparse mixture of experts model (SMoE) that outperforms Llama 2 70B on most benchmarks with 6x faster inference. Read it.

How to Prompt LLMs for Text-to-SQL Paper Read

For this paper read, we’re joined by Shuaichen Chang, now an Applied Scientist at AWS AI Lab and author of this week’s paper to discuss his findings. Shuaichen’s research (conducted at the Ohio State University) investigates the impact of prompt constructions on the performance of large language models (LLMs) in the text-to-SQL task, particularly focusing on zero-shot, single-domain, and cross-domain settings. Read it.

Why Enterprise Executives Should be Hip to LLMOps

Given the rapid rate of adoption, some early growing pains are inevitable. Among early adopters of LLMs, nearly half (43%) cite issues like evaluation, hallucinations, and needless abstraction as implementation challenges. How can large enterprises overcome these challenges to deliver results and minimize organizational risk?

Here are three keys that enterprises successfully deploying LLMs are embracing to rise to the challenge. Read it.

Paper Read Wednesday January 10: Gemini

Join us for a deep dive into Gemini on January 10 (Wednesday) at 10:15am PST. This is Part II in a series; we discussed Mixtral 8x7B in December, which you can find linked above. Sign up here.

Staff Picks 🤖

Here's a roundup of our team's favorite news, papers, and community threads recently.

Small Differences in Prompts can have Drastically Different Outcomes for some LLMs. Read it
Mathematical Discoveries from Program Search with LLMs. Read It
Continual Learning in the Federated Learning Context. Read it
The Power of Prompting. Read it

Edition 26 - The LLM Observability Checklist ✓

Arize AI

Arize AI is unified AI observability and LLM evaluation platform - built for AI engineers, by AI engineers

The Definitive LLM Observability Checklist

Advanced LLM Evals

Evaluating Prompts: A Developer's Guide

Benchmarking OpenAI Function Calling and Explanations

The Build vs. Buy Guide

Recommended by LinkedIn

Mixtral 8x7B Discussion

How to Prompt LLMs for Text-to-SQL Paper Read

Why Enterprise Executives Should be Hip to LLMOps

Paper Read Wednesday January 10: Gemini

Staff Picks 🤖

The Evaluator

5,523 followers

More articles by Arize AI

Insights from the community

Others also viewed

Easy to understand AI bytes...

2025 Predictions: A Framework for the Future of AI Agents

RAG in 2025: Navigating the New Frontier of AI and Data Integration

Model Context Protocol: The Future of AI Interoperability

Unlocking AI's True Potential: A Simple Guide to Anthropic's Model Context Protocol (MCP)

When to Use Fine-Tuning, Instruction Sets, and RAG

Unlocking the Power of Generative AI Agents: A Deep Dive into Google's Insights

Building an effective search for AI and other applications (part 2 of 3)

Why Opkit chose human-in-the-loop

Demystifying Retrieval Augmented Generation (RAG)

Explore topics

The Definitive LLM Observability Checklist

Advanced LLM Evals

Evaluating Prompts: A Developer's Guide

Benchmarking OpenAI Function Calling and Explanations

The Build vs. Buy Guide

Recommended by LinkedIn

Mixtral 8x7B Discussion

How to Prompt LLMs for Text-to-SQL Paper Read

Why Enterprise Executives Should be Hip to LLMOps

Paper Read Wednesday January 10: Gemini

Staff Picks 🤖

The Evaluator

5,523 followers

More articles by Arize AI

Understanding LLM Benchmarks

Edition 37 – How to Build Smarter AI Agents

Edition 36 - Improving LLM Safety & Reliability

Edition 35 - Creating Self-Improving LLM Evals

Edition 34 - Choosing the Best LLM Eval Model

Edition 33 – How LLM Tracing Works

Edition 32 – How to Protect Your LLM App

Edition 31 – How to Build a Great LLM App

Edition 30 - Should You Trust an LLM to Pick Stocks?

Edition 29 - There is More Than One LLM Eval

Insights from the community

Others also viewed

Easy to understand AI bytes...

2025 Predictions: A Framework for the Future of AI Agents

RAG in 2025: Navigating the New Frontier of AI and Data Integration

Model Context Protocol: The Future of AI Interoperability

Unlocking AI's True Potential: A Simple Guide to Anthropic's Model Context Protocol (MCP)

When to Use Fine-Tuning, Instruction Sets, and RAG

Unlocking the Power of Generative AI Agents: A Deep Dive into Google's Insights

Building an effective search for AI and other applications (part 2 of 3)

Why Opkit chose human-in-the-loop

Demystifying Retrieval Augmented Generation (RAG)

Explore topics