🤔 Has OpenAI Lost Its Edge?

Pascal Biese

Daily AI highlights for 70k+ experts 📲🤗 AI/ML Engineer

Published Jun 21, 2024

+ Follow

In this issue:

Anthropic’s new flagship model
Q* is out and OpenAI has nothing to do with it
Please, we don’t need another Devin

Subscribe now

1. Introducing Claude 3.5 Sonnet

Watching: Claude-3.5-sonnet (model card)

What problem does it solve? Claude 3.5 Sonnet, the latest iteration of Anthropic's AI assistant, addresses the need for more capable and efficient language models that can handle complex reasoning tasks, demonstrate broad knowledge, and excel in coding proficiency. It aims to provide a more nuanced understanding of language, humor, and intricate instructions while generating high-quality, relatable content. Additionally, it tackles the challenge of performing context-sensitive customer support and orchestrating multi-step workflows in a cost-effective manner.

How does it solve the problem? Claude 3.5 Sonnet achieves its impressive performance through a combination of architectural improvements and enhanced training data. By operating at twice the speed of its predecessor, Claude 3 Opus, it offers a significant performance boost while maintaining cost-effective pricing. The model's ability to independently write, edit, and execute code with sophisticated reasoning and troubleshooting capabilities is a result of its training on a diverse range of coding problems and its exposure to relevant tools. Its proficiency in code translations further enhances its effectiveness in updating legacy applications and migrating codebases.

What's next? As Claude 3.5 Sonnet sets new industry benchmarks in various domains, it opens up exciting possibilities for future applications. Since it’s also much faster and cheaper than Opus, the model might become feasible for use cases that it previously was too slow and/or expensive for. My personal favorite, however, is the new artifact feature that makes it much easier to interate on - and generally, work with - structured model outputs. All structured outputs created by Claude-3.5 will be saved as files and stored in the conversation history.

2. Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning

Watching: Q* (paper)

Recommended by LinkedIn

Open Code LLMs; Long-Range Transformers; GPT-5 Release…

Danny Butvinik 1 year ago

Top 5 LLMs with Reasoning Capabilities in 2025

Andreas Ramos 2 months ago

Build Your Own Real-Time Multimodal RAG Applications!

Pavan Belagatti 8 months ago

What problem does it solve? Large Language Models (LLMs) have shown remarkable performance across a wide range of natural language tasks. However, their auto-regressive generation process can lead to errors, hallucinations, and inconsistencies, especially when dealing with multi-step reasoning tasks. This pathology can limit the reliability and usefulness of LLMs in real-world applications that require accurate and coherent outputs.

How does it solve the problem? The researchers introduce Q*, a general and versatile framework that guides the decoding process of LLMs using deliberative planning. Q* learns a plug-and-play Q-value model that serves as a heuristic function to select the most promising next step during generation. By doing so, Q* effectively steers LLMs towards more accurate and consistent outputs without the need for fine-tuning the entire model for each specific task. This approach avoids the significant computational overhead and potential performance degradation on other tasks that can occur with fine-tuning.

What's next? The Q* framework has demonstrated its effectiveness on several benchmark datasets, including GSM8K, MATH, and MBPP. However, it would be interesting to see how well this approach generalizes to other types of tasks and domains beyond mathematical reasoning and programming. Additionally, future research could explore the integration of Q* with other techniques to further improve the quality and robustness of LLM outputs.

3. Code Droid: A Technical Report

Watching: Code Droid (report)

What problem does it solve? Software development is a complex and time-consuming process that requires skilled human developers. As software systems become increasingly large and intricate, the demand for developer resources continues to grow. Factory aims to address this challenge by creating autonomous systems called Droids that can accelerate software engineering velocity. These Droids are designed to model the cognitive processes of human developers, adapted to the scale and speed required in modern software development.

How does it solve the problem? Factory's way of building Droids (=Agents) involves an interdisciplinary approach that draws from research on human and machine decision-making, complex problem-solving, and learning from environments. The Code Droid, in particular, has demonstrated state-of-the-art performance on SWE-bench, a benchmark for evaluating software engineering capabilities. It achieved 19.27% on the full SWE-bench and 31.67% on the lite version, indicating its ability to handle a wide range of software engineering tasks efficiently.

What's next? As Factory continues to develop and refine their Droids, we can expect to see further improvements in their performance on software engineering tasks. The ultimate goal is to create autonomous systems that can significantly accelerate software development processes, allowing human developers to focus on higher-level tasks and strategic decision-making. I just really, really wish this won’t turn into another Devin-level letdown.

Papers of the Week:

LLM Watch

54,743 followers

+ Subscribe

Kenny Dizi

Creator @OpenWorkspace-o1 | Top 4 Contributors @QodoAI | Contributor @RAGFlow, GPT Researcher, CamelAI, Aider-AI and many more.

10mo

I have been using Claude Sonet 3.5 a few weeks ago, and I agree with that.

Ferenc József Rab

Freelance at Moody's Corporation

10mo

Nagyon király szuper!

Suraj Singh

Software Engineer @Amdocs || 🎓 NIT Kurukshetra '24 || 🏆 Leetcode Knight Badge || Passionate about Solving Complex Problems and Building Impactful Applications || Exploring AI, Deep Learning, and the Frontiers of LLMs

10mo

I don't believe OpenAI has lost its edge. The rapid advancement of AI is remarkable, and the competition is driving the entire field forward. The pace of the AI revolution is astounding, and we could be just weeks away from the next major #LLM announcement.

Elton Du Plooy

voice and data Engineer

10mo

Very helpful!

1 Reaction

Bohdan Lukianets

Principal Enterprise Architect | GenAI Research Scientist | Co-Founder and CEO @ bohdan.AI | Microsoft AI Cloud Partner

10mo

Anthropic is my love—I initially worked there as a tester, and their linguistics impressed me from the very first moment. But are we discussing leadership in the context of 'hype' or in the context of #LLM, all features or just the visible ones...? 😯

2 Reactions

See more comments

To view or add a comment, sign in

🤔 Has OpenAI Lost Its Edge?

Pascal Biese

Daily AI highlights for 70k+ experts 📲🤗 AI/ML Engineer

In this issue:

1. Introducing Claude 3.5 Sonnet

2. Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning

Recommended by LinkedIn

3. Code Droid: A Technical Report

Papers of the Week:

LLM Watch

54,743 followers

More articles by Pascal Biese

Insights from the community

Others also viewed

Top LLM Papers of the Week (March Week-3 2024)

Spring AI and Large Language Models (LLMs) Integration

Fine-Tuning LLMs with Your Data

This Happens When RAG Meets Knowledge Graphs

A Coder's Take on Tasks Best Suited for Large Language Models

LLM FINE-TUNING STRATEGIES FOR DOMAIN-SPECIFIC APPLICATIONS - A DEEP DIVE

Gentle Introduction to Retrieval Augmented Generation

Inside LangChain: Simplification and Agility in Building Generative AI-based Application

Retrieval Augmented Generation (RAG) Vs Fine Tuning LLMs

Navigating the LLM Maze: How Benchmarking Can Guide You to the Right Model

Explore topics

In this issue:

1. Introducing Claude 3.5 Sonnet

2. Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning

Recommended by LinkedIn

3. Code Droid: A Technical Report

Papers of the Week:

LLM Watch

54,743 followers

More articles by Pascal Biese

📉 May the Best Cheater Win

🤗 Reinforcement Learning Without Human Feedback

🤗 The Very First Diffusion Reasoning Model

🤖 AI Is Shaking Up the Life Sciences

🐋 DeepSeek Strikes Again As OpenAI's Valuation Skyrockets

🎧 Vibe Coding + Knowledge Graphs = 10x Cheaper

⚛️ Quantum-Enhanced AI - It's Here

🧠 Search-R1, Gemini Embeddings & Controlled Reasoning with L1

🤯 QwQ-32B: 20x smaller than DeepSeek-R1

OpenAI Can Not Be Happy About This

Insights from the community

Others also viewed

Top LLM Papers of the Week (March Week-3 2024)

Spring AI and Large Language Models (LLMs) Integration

Fine-Tuning LLMs with Your Data

This Happens When RAG Meets Knowledge Graphs

A Coder's Take on Tasks Best Suited for Large Language Models

LLM FINE-TUNING STRATEGIES FOR DOMAIN-SPECIFIC APPLICATIONS - A DEEP DIVE

Gentle Introduction to Retrieval Augmented Generation

Inside LangChain: Simplification and Agility in Building Generative AI-based Application

Retrieval Augmented Generation (RAG) Vs Fine Tuning LLMs

Navigating the LLM Maze: How Benchmarking Can Guide You to the Right Model

Explore topics