Building a Full-Stack AI Data Assistant

Building a Full-Stack AI Data Assistant

The Problem: Bridging the Gap Between Data and Usability

As part of our final project for a Data Science course, my teammates and I were tasked with solving a real-world problem using what we had learned. One challenge we all recognized was the gap between data availability and data usability. Organizations collect massive amounts of data, but the ability to clean, analyze, and model it often requires advanced technical skills. We wanted to create a tool to make data science accessible to anyone, regardless of their coding background.

The Solution: A Conversational Web Application for Data Science

To address this, we developed a web application that transforms the entire data science process into a conversation. Built using Streamlit and powered by OpenAI’s GPT-4o-mini API, our app allows users to upload a dataset, interact with it through natural language, and move from raw data to predictive modeling—all within a guided interface. For our demonstration, we used the Titanic dataset, a classic dataset that contains demographic and survival information about passengers aboard the RMS Titanic.

Step 1: Upload and Preview the Data

The process begins with data upload. Users select a CSV file, and once uploaded, the app previews the data and displays key statistics such as the number of rows and columns, data types, and missing values. This immediate feedback helps users understand the structure of their dataset before diving deeper.

Article content

Step 2:  Data Cleaning

Next comes the cleaning phase, which is typically one of the most time-consuming parts of data analysis. Our app simplifies this with AI-generated suggestions that identify missing values, inconsistent data types, and unnecessary columns. These suggestions are written in plain English. For example, the app might say, “Column ‘Cabin’ has over 70% missing values. Would you like to drop it?” Users can accept the suggestions or respond with modified instructions like, “Instead of dropping, fill missing ‘Age’ values with the median.” These inputs are processed by GPT-4 and translated into executable Python code, which is run safely inside the app environment.

Article content
Article content

Step 3: Ask Questions and Visualize the Data

Once the data is cleaned, users can explore it by asking questions about patterns and trends. With the Titanic dataset, for example, a user might type:


  • “What is the survival rate by passenger class?”
  • “Can you show a bar chart comparing survival between men and women?”
  • “What’s the relationship between age and survival?”
  • “Can you visualize the distribution of fares by class?”


The app generates the necessary code to create charts and statistics using matplotlib, seaborn, and pandas. It also provides a plain-language summary of what each chart reveals. This turns traditional exploratory data analysis into a conversation, allowing users to iterate naturally as they explore insights.

Article content
Article content

Step 4: Build Machine Learning Models with Plain Language

In the final stage, users can build predictive models. The app detects whether a classification or regression model is needed and suggests an appropriate Model, such as logistic regression or decision trees. Continuing with the Titanic dataset, the app might recommend building a model to predict survival using features like age, sex, and passenger class. Users can request things like:

  • “Build a logistic regression model using age, sex, and class.”
  • “Show the model’s accuracy.”
  • “Display the confusion matrix.”
  • “Which features were most important in predicting survival?”


All models are built using scikit-learn, and the outputs, including accuracy scores and visuals, are displayed in the app. Users can refine their models by asking follow-up questions or adjusting input features, all without writing code.

Article content
Article content

Technical Architecture and Tools Used

From a technical standpoint, the app is modular and maintainable. We organized it into components for data upload, cleaning, analysis, and modeling. Utility modules handle backend logic, while a service layer manages secure communication with the OpenAI API. Prompt engineering was key—we structured prompts to return only safe, runnable Python code, eliminating markdown and commentary from the model’s output. All file handling and visualization is done securely in the local environment, preserving user privacy.

What We Learned

This project was an opportunity to bring together everything we had learned in the course—data cleaning, exploratory data analysis, machine learning, and application development—into a single cohesive experience. It also introduced us to the emerging potential of LLMs in workflow automation. In testing, we found that even users with no programming background could clean data, explore survival patterns, and build machine learning models just by describing what they wanted.

Githib


Jimmy Huruma Nkone

Marketing Student at Concordia University Texas

5d

This is a game changer! Congrats to you and the team.

Like
Reply
Samuel Muwanika

Neuroscience and Behavior Student| Undergraduate Research Assistant at the Dell Medical School

1w

Good stuff

Like
Reply
Shakiib Wauyo

Founder & CEO at BAfrika Trade | Investment & Market Research Intern | Student Assistant - President’s office | Staff Development Coordinator at RecWell | AFCU Supervisory Committee Member

1w

Impressive work !

Like
Reply

To view or add a comment, sign in

More articles by Schadrack Karekezi

Insights from the community

Others also viewed

Explore topics