Keeping up with LLMs: Cost, Accuracy, and the Path to Better Results

Keeping up with LLMs: Cost, Accuracy, and the Path to Better Results

Welcome to another edition of Digital Leap!

Imagine developing a product that relies on cutting-edge AI to solve problems and answer complex queries. That’s the reality at CodeSherlock.AI, where we’ve spent the past year on a fascinating quest: harnessing the power of Large Language Models (LLMs) to analyze code. In this article, we’ll delve into CodeSherlock’s journey of continuous LLM evaluation—meticulously assessing different LLMs and ultimately enhancing both the quality and cost-efficiency of our code analysis responses.

By sharing our hard-won lessons in model evaluation, prompt engineering, and performance optimization, we aim to benefit anyone looking to integrate AI into their products. Developers, product managers, and AI enthusiasts alike will gain a practical understanding of how to navigate the challenges of LLM selection and deployment, ensuring a balance of cost, accuracy, and scalability.

What Makes CodeSherlock Unique

  1. Prompt Mastery We’ve meticulously crafted prompts—essentially instructions for the LLM—ranging from approximately 250 to 130,000 tokens (we pass code as part of the prompt) to ensure optimal and comprehensive results with minimal inaccuracies.
  2. Expert LLM Selection Our success hinges not just on sophisticated prompts but also on carefully choosing the best LLM for the job.

Our journey began with established closed-source models like the GPT and Gemini families. As a startup venturing into the nascent world of AI, our priority was building a stable, secure, and scalable product. This required a robust LLM infrastructure capable of supporting a commercial system. While open-source models are now more prevalent, we initially opted for proven systems that could provide the reliability and scale we needed to launch successfully.

Our Requirements for Ideal LLM Responses

  • Relevant, Accurate, and Actionable Provide useful information directly related to the query.
  • Well-Formatted and Consistent Present information clearly, adhering to a predictable structure.
  • Optimized for Length Balance comprehensiveness with conciseness to manage token usage effectively.
  • Parseable Easily processed by downstream systems or tools.
  • Rich with Relevant Code Examples Illustrate both issues and solutions effectively.
  • Cost-Effective Minimize input token consumption.
  • Structured Outputs Enable slicing and dicing information for better user experience.

Our LLM Evaluation Process

Our evaluation approach is a continuous cycle designed to keep pace with rapidly evolving models:

  1. Monitoring We closely track advancements and updates across various LLMs.
  2. Internal Review Our team engages in ongoing discussions to assess different LLM offerings.
  3. Rigorous Testing We continuously evaluate LLMs through dedicated testing procedures.
  4. Comparative Analysis We deploy various LLM implementations in Azure environments to directly compare their responses across a range of prompts.

Over the past year, our small, agile team has rigorously tested various GPT and Gemini LLMs, gaining valuable insights. Our primary goal remains consistent: deliver the most useful responses to our users effortlessly by selecting the optimal LLM and crafting the perfect prompts. Today, CodeSherlock, as a commercially available product, leverages GPT 4o Mini to provide consistently strong results.

How We Test

  • Structured Test Plans We develop comprehensive test plans encompassing code files of various sizes and programming languages.
  • Side-by-Side Comparison We meticulously compare results across different LLMs using detailed Excel spreadsheets for each scenario.
  • In-Depth Analysis We generate insights on the number of issues found, errors encountered, and relevant results produced. Our testers and developers thoroughly examine each result’s quality.
  • Impact Evaluation Based on these insights, we evaluate the impact, gains, and overall value of each LLM to inform our selection.

Applying this process over the past year, we’ve thoroughly assessed six different LLMs- GPT 3.5, GPT 4, GPT 4o,GPT 4o Mini, Gemini Pro and Gemini Flash. In some cases, we performed multiple evaluations of models within the same family (e.g., Gemini) to see how updates affected their performance.

Key Learnings About Different LLMs

Below are some initial requirements and findings we discovered through our evaluations:

Cost vs. Accuracy/Reasoning

  1. GPT 4: Highest cost among the models. Strongest reasoning abilities. Minimal need for prompt examples to achieve accurate outputs.
  2. GPT 4o: Balances cost and performance effectively—lower cost than GPT 4 but still strong in reasoning and formatting.
  3. GPT 3.5: More cost-effective among the GPT family. Requires more examples or more explicit prompts to achieve the correct format or accuracy, however partially.
  4. GPT 4o: Mini Cheaper than GPT 4, GPT 4o, and GPT 3.5. Similar or better accuracy than GPT 3.5; consistent formatting with one-shot examples. Second to GPT 4o in code-related tasks and well-trained in code. Good option for budget-conscious use cases without sacrificing much accuracy.
  5. Gemini Pro 1.5 & Gemini Flash: 1.5 Comparable or cheaper than GPT 3.5. Main drawback: Inconsistent instruction following in many scenarios. Can hallucinate (e.g., adding code snippets that don’t exist).

Formatting and Instruction Following

  • GPT 4 & GPT 4o: Extremely accurate in following instructions and formatting. Perform well even without extensive one-shot prompting or formatting examples. They support structured outputs (e.g., JSON).
  • GPT 4o Mini: Maintains 100% adherence to formatting instructions. Supports structured outputs (e.g., JSON).
  • GPT 3.5: More prone to formatting inconsistencies. Often needs more structured or example-based prompts.
  • Gemini Pro & Gemini Flash: Incomplete or truncated outputs often, especially for larger code bases. Constant inconsistencies in Markdown outputs.

Emerging Requirements: Structured Outputs and Context Length

As our evaluation progressed, two additional factors became increasingly important:

Structured Outputs

These refer to LLM responses following a predefined format (e.g., JSON, XML, or a custom schema). By providing clear instructions and examples, we guide the LLM to include specific keys, nest data correctly, and adhere to strict formatting requirements. Eliminates the need for extensive manual parsing or cleanup. Markdown responses posed challenges due to complex parsing; structured outputs simplified the process. OpenAI API advancements in structured outputs encouraged new methods of segmenting responses and presenting them more effectively.

Here is the current state of Structured Outputs:

OpenAI Models

  • GPT-4o (version 2024-08-06)
  • GPT-4o-mini (version 2024-07-18)
  • GPT-4-turbo
  • GPT-4
  • GPT-3.5-turbo

Azure OpenAI Models

  • GPT-4o (version 2024-08-06)
  • GPT-4o-mini (version 2024-07-18)
  • GPT-4o (version 2024-12-17)

Gemini Pro & Gemini Flash

  • Both Gemini 1.5 Pro and Gemini 1.5 Flash offer structured output capabilities.

Context Length

Defines how much text an LLM can consider when generating a response. A longer context window enables handling more complex queries, retaining coherence across extended text, and referencing previous parts of the conversation. Important to remember: the context window includes both the input prompt and the generated output.

Article content
Summary of Model context lengths. Please note for GPT 4, the context length mentioned is for Standard which is what we evaluated. The versions of Gemini Pro and Flash are 1.5.

GPT 4o Mini: Our Most Current Model

As you can see above, GPT-4o Mini provided a larger context window among the models tested, while also offering a strong combination of cost-effectiveness, accuracy, consistency, and overall quality. This combination of context, cost and quality gave us a model which met all our current needs. On the HumanEval benchmark for code generation, GPT-4o Mini performs second best after GPT-4o, showing significant improvements over GPT-3.5 Turbo. This indicates that GPT-4o Mini is highly effective for code analysis and generation, leading us to select it as our model of choice. It has yielded really satisfactory results in CodeSherlock’s commercial deployment.

Despite showing initial promise and larger context windows, the Gemini family of models did not consistently demonstrate the required levels of accuracy and consistency in our evaluations—resulting in discontinuing their use for now.

Key Takeaways and Findings Summarized

  1. Cost vs. Accuracy: Higher-end models like GPT 4 excel at reasoning but can be cost-prohibitive. GPT 4o Mini strikes a productive balance for many use cases.
  2. Formatting & Instruction Adherence: GPT 4o Mini, GPT 4, and GPT 4o exhibit strong instruction following and formatting capabilities, while Gemini Pro & Flash often struggle with truncated or inconsistent outputs.
  3. Structured Outputs: Adopting JSON or similar formats reduces parsing complexity and improves integration with other systems.
  4. Context Length: Larger context windows allow more comprehensive input, which is crucial for analyzing extensive code bases. GPT-4o Mini excels in this aspect cost effectively.

Article content
Summary of Model evaluations. Please note for GPT 4, the context length mentioned is for Standard which is what we evaluated. The versions of Gemini Pro and Flash are 1.5.

Looking Ahead

The future of LLM-driven solutions holds immense potential for innovation and refinement:

  • Open-Source Exploration: Could reduce costs and enhance flexibility.
  • Advanced Fine-Tuning Techniques: May unlock specialized capabilities for domain-specific challenges.
  • Smaller, Domain-Specific Models: Promise more relevant results with fewer hallucinations and better cost efficiency.
  • Hybrid Approaches: Combining strengths of various models to dynamically select the optimal model for each task.

In Summary, this article aimed to show the ongoing journey at CodeSherlock—balancing LLM cost, accuracy, and scalability through a methodical evaluation and selection process. By adopting structured outputs and keeping an eye on context length, we’ve refined our product’s ability to analyze code at scale. We hope these insights will guide anyone looking to integrate AI solutions into their projects, just as they’ve helped shape CodeSherlock’s approach to harnessing LLMs for high-quality, actionable code analysis.

Resources:

HumanEval code generation model comparison

Open AI Structured Outputs article

To view or add a comment, sign in

More articles by Madhuri Mittal

Insights from the community

Others also viewed

Explore topics