Building a Full-Stack AI Data Assistant
The Problem: Bridging the Gap Between Data and Usability
As part of our final project for a Data Science course, my teammates and I were tasked with solving a real-world problem using what we had learned. One challenge we all recognized was the gap between data availability and data usability. Organizations collect massive amounts of data, but the ability to clean, analyze, and model it often requires advanced technical skills. We wanted to create a tool to make data science accessible to anyone, regardless of their coding background.
The Solution: A Conversational Web Application for Data Science
To address this, we developed a web application that transforms the entire data science process into a conversation. Built using Streamlit and powered by OpenAI’s GPT-4o-mini API, our app allows users to upload a dataset, interact with it through natural language, and move from raw data to predictive modeling—all within a guided interface. For our demonstration, we used the Titanic dataset, a classic dataset that contains demographic and survival information about passengers aboard the RMS Titanic.
Step 1: Upload and Preview the Data
The process begins with data upload. Users select a CSV file, and once uploaded, the app previews the data and displays key statistics such as the number of rows and columns, data types, and missing values. This immediate feedback helps users understand the structure of their dataset before diving deeper.
Step 2: Data Cleaning
Next comes the cleaning phase, which is typically one of the most time-consuming parts of data analysis. Our app simplifies this with AI-generated suggestions that identify missing values, inconsistent data types, and unnecessary columns. These suggestions are written in plain English. For example, the app might say, “Column ‘Cabin’ has over 70% missing values. Would you like to drop it?” Users can accept the suggestions or respond with modified instructions like, “Instead of dropping, fill missing ‘Age’ values with the median.” These inputs are processed by GPT-4 and translated into executable Python code, which is run safely inside the app environment.
Step 3: Ask Questions and Visualize the Data
Once the data is cleaned, users can explore it by asking questions about patterns and trends. With the Titanic dataset, for example, a user might type:
Recommended by LinkedIn
The app generates the necessary code to create charts and statistics using matplotlib, seaborn, and pandas. It also provides a plain-language summary of what each chart reveals. This turns traditional exploratory data analysis into a conversation, allowing users to iterate naturally as they explore insights.
Step 4: Build Machine Learning Models with Plain Language
In the final stage, users can build predictive models. The app detects whether a classification or regression model is needed and suggests an appropriate Model, such as logistic regression or decision trees. Continuing with the Titanic dataset, the app might recommend building a model to predict survival using features like age, sex, and passenger class. Users can request things like:
All models are built using scikit-learn, and the outputs, including accuracy scores and visuals, are displayed in the app. Users can refine their models by asking follow-up questions or adjusting input features, all without writing code.
Technical Architecture and Tools Used
From a technical standpoint, the app is modular and maintainable. We organized it into components for data upload, cleaning, analysis, and modeling. Utility modules handle backend logic, while a service layer manages secure communication with the OpenAI API. Prompt engineering was key—we structured prompts to return only safe, runnable Python code, eliminating markdown and commentary from the model’s output. All file handling and visualization is done securely in the local environment, preserving user privacy.
What We Learned
This project was an opportunity to bring together everything we had learned in the course—data cleaning, exploratory data analysis, machine learning, and application development—into a single cohesive experience. It also introduced us to the emerging potential of LLMs in workflow automation. In testing, we found that even users with no programming background could clean data, explore survival patterns, and build machine learning models just by describing what they wanted.
Marketing Student at Concordia University Texas
5dThis is a game changer! Congrats to you and the team.
Neuroscience and Behavior Student| Undergraduate Research Assistant at the Dell Medical School
1wGood stuff
Founder & CEO at BAfrika Trade | Investment & Market Research Intern | Student Assistant - President’s office | Staff Development Coordinator at RecWell | AFCU Supervisory Committee Member
1wImpressive work !