Building a Python-Powered Data Analysis Agent with LangChain: A Step-by-Step Tutorial

Building a Python-Powered Data Analysis Agent with LangChain: A Step-by-Step Tutorial

Have you ever tried asking ChatGPT to analyze a dataset or perform complex calculations? If so, you've probably noticed it struggles with accuracy. That's because LLMs aren't designed to perform mathematical operations directly - but they excel at generating Python code!

Today, I'll share how I built a powerful data analysis agent that combines the reasoning capabilities of LLMs with the computational power of Python using LangChain and MCP (Model-Control-Panel) technology.


The Problem with LLMs and Data Analysis

Despite their impressive capabilities, LLMs face significant limitations when it comes to:

  • Performing accurate numerical calculations
  • Analyzing large datasets
  • Generating precise visualizations
  • Statistical analysis and modeling

However, these same models are excellent at writing Python code. This led us to a solution: what if we give an LLM access to a Python interpreter so it can write and execute code to solve data problems?


The Architecture: LangChain + MCP Python Runtime

My implementation connects an OpenAI model to a Python runtime environment using:

Here's a diagram of how it all works:

1. User asks a data question

2. LLM thinks through the problem

3. LLM writes Python code to solve it

4. Code is executed in a sandboxed Python environment

5. Results are returned to the LLM

6. LLM interprets results and responds to the user

Let's dive into the implementation! You can find the complete source code on this Github repository!


Step 1: Setting Up the Environment

First, let's install the required packages:

npm install @langchain/openai @langchain/langgraph @langchain/mcp-adapters @langchain/core dotenv        

Then create a .env file for your OpenAI API key:

OPENAI_API_KEY=your-api-key-here        

Step 2: Basic Implementation

Let's start with the basic structure (this is the src/index.ts file):

Article content
Basic structure of the Agent

This code sets up our OpenAI model and connects to the MCP Python Runner server. The key part is the MultiServerMCPClient configuration, which uses Deno to run a sandboxed Python environment.

  • -N -R=node_modules -W=node_modules (alias of --allow-net --allow-read=node_modules --allow-write=node_modules) allows network access and read+write access to ./node_modules. These are required so Pyodide can download and cache the Python standard library and packages
  • --node-modules-dir=auto tells deno to use a local node_modules directory
  • stdio runs the server with the Stdio MCP transport — suitable for running the process as a subprocess locally


Step 3: Creating the Agent with a Specialized System Prompt

Now let's add code to create the ReAct agent with a system prompt that emphasizes using Python for data tasks:

Article content
Agent creation with specialized system prompt

The system prompt is crucial - it instructs the LLM to always use Python code for data tasks and provides a clear thinking pattern to follow.


Step 4: Testing the Agent with a Data Analysis Task

Let's add code to test our agent with a complex data analysis task:


Article content
Complex data analysis query

This test query asks the agent to perform a sequence of data analysis tasks that would be impossible for an LLM to do on its own.


Step 5: Building an Interactive CLI

To make this tool more useful, I also created a CLI version that allows for interactive queries. Here's a simplified version (from src/cli.ts):

Article content
CLI interface implementation, you can find the complete version on the Github repository

I also implemented some parsing on the model output, so you can see in the details every step that the model did in order to reach your final answer


Article content
Model Execution Step by Step

How It Works Under the Hood

When the agent receives a query, it follows these steps:

  1. Reasoning: The LLM thinks about how to solve the problem with Python
  2. Code Generation: It writes Python code using libraries like pandas, numpy, matplotlib
  3. Code Execution: The code is sent to the MCP Python Runner via run_python_code tool
  4. Result Processing: The LLM interprets the execution results (including visualizations)
  5. Response Generation: It provides a human-friendly explanation of the results

The MCP Python Runner uses Pyodide to execute Python code in a JavaScript environment with Deno, isolating it from the host system for security.


Advanced Implementation Details

The most interesting parts of this implementation:

  1. Tool Configuration: The MCP client connects to the Python runtime securely via stdio
  2. System Prompt Engineering: Forcing the LLM to always use Python for calculations
  3. Response Processing: I added special handling in the CLI to extract and display Python code, execution results, and conclusions separately


Limitations and Future Improvements

While powerful, this approach has some limitations:

  • The sandboxed environment has limited access to Python packages
  • Visualizations are mostly not possible
  • Long-running computations might timeout

Future improvements you can consider considering:

  • Supporting file uploads for CSV/Excel analysis
  • Persistent sessions for ongoing data analysis
  • Adding more specialized data visualization tools


Conclusion

By combining the reasoning capabilities of LLMs with the computational power of Python, we can create agents that excel at data analysis tasks that would be impossible for either component alone.

This hybrid approach represents a powerful pattern for AI development: use LLMs for what they're good at (reasoning, generating code, explaining results) and specialized tools for what they're good at (calculations, visualizations, data processing).

What do you think about this approach? Are you building similar AI agents? I'd love to hear your thoughts and experiences in the comments!

#LLMEngineering #AIAgents #DataAnalysis #Python #LangChain


Felipe A. R. Pinto

Full Stack Developer Java | React | AWS

1w

Thanks for sharing, Hiram

Like
Reply
Rodrigo Modesto

Analytics Engineer | Data Engineer | Data Analyst | Business Data Analyst

1w

This is going to save me hours of work! 📊 I've been manually writing Python scripts based on LLM suggestions, but automating the whole pipeline is much more efficient. Does your implementation support persistent state between queries? For example, can a user ask follow-up questions about previously analyzed data?

Like
Reply
Lucas Pereira Lacerda

Senior Software Engineer | C++ | Qt | QML

2w

The architecture diagram really helps clarify how all the pieces fit together. 🧩 I appreciate how you've leveraged existing tools rather than reinventing the wheel. I wonder if you've considered extending this to other languages beyond Python? R would be another natural fit for statistical analysis.

Leonardo Andrade

Software Engineer | React.js | React Native | Next.js | TypeScript | Node | AWS

2w

This is brilliant! 🚀 I've been looking for a solution to this exact problem. LLMs are great at reasoning but terrible at math, while Python excels at data processing. Combining them is genius. Have you encountered any issues with the execution time? I imagine running the Python code and passing results back might add latency to the system.

Vinicius Passos

Senior Developer | Node.JS | Javascript | Typescript | React | AWS | GCP

2w

Love how you're using the ReAct pattern here! The thought-action-observation loop is perfect for this kind of problem-solving. I'm curious though. how do you handle visualizations? Do you convert matplotlib outputs to base64 images or use some other approach to display them?

To view or add a comment, sign in

More articles by Hiram Reis Neto

Insights from the community

Others also viewed

Explore topics