Building a Python-Powered Data Analysis Agent with LangChain: A Step-by-Step Tutorial
Have you ever tried asking ChatGPT to analyze a dataset or perform complex calculations? If so, you've probably noticed it struggles with accuracy. That's because LLMs aren't designed to perform mathematical operations directly - but they excel at generating Python code!
Today, I'll share how I built a powerful data analysis agent that combines the reasoning capabilities of LLMs with the computational power of Python using LangChain and MCP (Model-Control-Panel) technology.
The Problem with LLMs and Data Analysis
Despite their impressive capabilities, LLMs face significant limitations when it comes to:
However, these same models are excellent at writing Python code. This led us to a solution: what if we give an LLM access to a Python interpreter so it can write and execute code to solve data problems?
The Architecture: LangChain + MCP Python Runtime
My implementation connects an OpenAI model to a Python runtime environment using:
Here's a diagram of how it all works:
1. User asks a data question
2. LLM thinks through the problem
3. LLM writes Python code to solve it
4. Code is executed in a sandboxed Python environment
5. Results are returned to the LLM
6. LLM interprets results and responds to the user
Let's dive into the implementation! You can find the complete source code on this Github repository!
Step 1: Setting Up the Environment
First, let's install the required packages:
npm install @langchain/openai @langchain/langgraph @langchain/mcp-adapters @langchain/core dotenv
Then create a .env file for your OpenAI API key:
OPENAI_API_KEY=your-api-key-here
Step 2: Basic Implementation
Let's start with the basic structure (this is the src/index.ts file):
This code sets up our OpenAI model and connects to the MCP Python Runner server. The key part is the MultiServerMCPClient configuration, which uses Deno to run a sandboxed Python environment.
Step 3: Creating the Agent with a Specialized System Prompt
Now let's add code to create the ReAct agent with a system prompt that emphasizes using Python for data tasks:
The system prompt is crucial - it instructs the LLM to always use Python code for data tasks and provides a clear thinking pattern to follow.
Recommended by LinkedIn
Step 4: Testing the Agent with a Data Analysis Task
Let's add code to test our agent with a complex data analysis task:
This test query asks the agent to perform a sequence of data analysis tasks that would be impossible for an LLM to do on its own.
Step 5: Building an Interactive CLI
To make this tool more useful, I also created a CLI version that allows for interactive queries. Here's a simplified version (from src/cli.ts):
I also implemented some parsing on the model output, so you can see in the details every step that the model did in order to reach your final answer
How It Works Under the Hood
When the agent receives a query, it follows these steps:
The MCP Python Runner uses Pyodide to execute Python code in a JavaScript environment with Deno, isolating it from the host system for security.
Advanced Implementation Details
The most interesting parts of this implementation:
Limitations and Future Improvements
While powerful, this approach has some limitations:
Future improvements you can consider considering:
Conclusion
By combining the reasoning capabilities of LLMs with the computational power of Python, we can create agents that excel at data analysis tasks that would be impossible for either component alone.
This hybrid approach represents a powerful pattern for AI development: use LLMs for what they're good at (reasoning, generating code, explaining results) and specialized tools for what they're good at (calculations, visualizations, data processing).
What do you think about this approach? Are you building similar AI agents? I'd love to hear your thoughts and experiences in the comments!
#LLMEngineering #AIAgents #DataAnalysis #Python #LangChain
Full Stack Developer Java | React | AWS
1wThanks for sharing, Hiram
Analytics Engineer | Data Engineer | Data Analyst | Business Data Analyst
1wThis is going to save me hours of work! 📊 I've been manually writing Python scripts based on LLM suggestions, but automating the whole pipeline is much more efficient. Does your implementation support persistent state between queries? For example, can a user ask follow-up questions about previously analyzed data?
Senior Software Engineer | C++ | Qt | QML
2wThe architecture diagram really helps clarify how all the pieces fit together. 🧩 I appreciate how you've leveraged existing tools rather than reinventing the wheel. I wonder if you've considered extending this to other languages beyond Python? R would be another natural fit for statistical analysis.
Software Engineer | React.js | React Native | Next.js | TypeScript | Node | AWS
2wThis is brilliant! 🚀 I've been looking for a solution to this exact problem. LLMs are great at reasoning but terrible at math, while Python excels at data processing. Combining them is genius. Have you encountered any issues with the execution time? I imagine running the Python code and passing results back might add latency to the system.
Senior Developer | Node.JS | Javascript | Typescript | React | AWS | GCP
2wLove how you're using the ReAct pattern here! The thought-action-observation loop is perfect for this kind of problem-solving. I'm curious though. how do you handle visualizations? Do you convert matplotlib outputs to base64 images or use some other approach to display them?