Keeping up with LLMs: Cost, Accuracy, and the Path to Better Results
Welcome to another edition of Digital Leap!
Imagine developing a product that relies on cutting-edge AI to solve problems and answer complex queries. That’s the reality at CodeSherlock.AI, where we’ve spent the past year on a fascinating quest: harnessing the power of Large Language Models (LLMs) to analyze code. In this article, we’ll delve into CodeSherlock’s journey of continuous LLM evaluation—meticulously assessing different LLMs and ultimately enhancing both the quality and cost-efficiency of our code analysis responses.
By sharing our hard-won lessons in model evaluation, prompt engineering, and performance optimization, we aim to benefit anyone looking to integrate AI into their products. Developers, product managers, and AI enthusiasts alike will gain a practical understanding of how to navigate the challenges of LLM selection and deployment, ensuring a balance of cost, accuracy, and scalability.
What Makes CodeSherlock Unique
Our journey began with established closed-source models like the GPT and Gemini families. As a startup venturing into the nascent world of AI, our priority was building a stable, secure, and scalable product. This required a robust LLM infrastructure capable of supporting a commercial system. While open-source models are now more prevalent, we initially opted for proven systems that could provide the reliability and scale we needed to launch successfully.
Our Requirements for Ideal LLM Responses
Our LLM Evaluation Process
Our evaluation approach is a continuous cycle designed to keep pace with rapidly evolving models:
Over the past year, our small, agile team has rigorously tested various GPT and Gemini LLMs, gaining valuable insights. Our primary goal remains consistent: deliver the most useful responses to our users effortlessly by selecting the optimal LLM and crafting the perfect prompts. Today, CodeSherlock, as a commercially available product, leverages GPT 4o Mini to provide consistently strong results.
How We Test
Applying this process over the past year, we’ve thoroughly assessed six different LLMs- GPT 3.5, GPT 4, GPT 4o,GPT 4o Mini, Gemini Pro and Gemini Flash. In some cases, we performed multiple evaluations of models within the same family (e.g., Gemini) to see how updates affected their performance.
Key Learnings About Different LLMs
Below are some initial requirements and findings we discovered through our evaluations:
Cost vs. Accuracy/Reasoning
Formatting and Instruction Following
Emerging Requirements: Structured Outputs and Context Length
As our evaluation progressed, two additional factors became increasingly important:
Recommended by LinkedIn
Structured Outputs
These refer to LLM responses following a predefined format (e.g., JSON, XML, or a custom schema). By providing clear instructions and examples, we guide the LLM to include specific keys, nest data correctly, and adhere to strict formatting requirements. Eliminates the need for extensive manual parsing or cleanup. Markdown responses posed challenges due to complex parsing; structured outputs simplified the process. OpenAI API advancements in structured outputs encouraged new methods of segmenting responses and presenting them more effectively.
Here is the current state of Structured Outputs:
OpenAI Models
Azure OpenAI Models
Gemini Pro & Gemini Flash
Context Length
Defines how much text an LLM can consider when generating a response. A longer context window enables handling more complex queries, retaining coherence across extended text, and referencing previous parts of the conversation. Important to remember: the context window includes both the input prompt and the generated output.
GPT 4o Mini: Our Most Current Model
As you can see above, GPT-4o Mini provided a larger context window among the models tested, while also offering a strong combination of cost-effectiveness, accuracy, consistency, and overall quality. This combination of context, cost and quality gave us a model which met all our current needs. On the HumanEval benchmark for code generation, GPT-4o Mini performs second best after GPT-4o, showing significant improvements over GPT-3.5 Turbo. This indicates that GPT-4o Mini is highly effective for code analysis and generation, leading us to select it as our model of choice. It has yielded really satisfactory results in CodeSherlock’s commercial deployment.
Despite showing initial promise and larger context windows, the Gemini family of models did not consistently demonstrate the required levels of accuracy and consistency in our evaluations—resulting in discontinuing their use for now.
Key Takeaways and Findings Summarized
Looking Ahead
The future of LLM-driven solutions holds immense potential for innovation and refinement:
In Summary, this article aimed to show the ongoing journey at CodeSherlock—balancing LLM cost, accuracy, and scalability through a methodical evaluation and selection process. By adopting structured outputs and keeping an eye on context length, we’ve refined our product’s ability to analyze code at scale. We hope these insights will guide anyone looking to integrate AI solutions into their projects, just as they’ve helped shape CodeSherlock’s approach to harnessing LLMs for high-quality, actionable code analysis.
Resources: