Data Science

Simplify Setup and Boost Data Science in the Cloud using NVIDIA CUDA-X and Coiled

An image of NYC taxis.

Imagine analyzing millions of NYC ride-share journeys—tracking patterns across boroughs, comparing service pricing, or identifying profitable pickup locations. The publicly available New York City Taxi and Limousine Commission (TLC) Trip Record Data contains valuable information that could reveal game-changing insights, but traditional processing approaches leave analysts waiting hours for results due to the volume of the data.

These delays interrupt analytical flow and limit business responsiveness. Data scientists at ride-hailing companies, urban planning departments, and financial firms need timely insights for critical decisions. The difference between waiting 9 minutes versus 5 seconds isn’t just convenience—it’s a competitive advantage.

Modern data science is perfectly suited to GPU parallelism. Operations like filtering and transforming large datasets involve applying the same function across millions of independent data points. When processing the NYC ride-share dataset, a GPU can evaluate calculations across thousands of rides simultaneously rather than sequentially, dramatically reducing computation time.

Despite these advantages, accessing the power of GPUs traditionally required specialized programming models and cloud configuration complexity. Recent developments have made GPU-accelerated data science accessible to anyone with basic Python skills, eliminating specialized hardware investments.

This post demonstrates how to use NVIDIA RAPIDS, part of NVIDIA CUDA-X libraries, and cloud GPUs with the cloud platform Coiled, which is designed to simplify running Python workloads at scale.

Prerequisites

  • A Coiled account
  • A local Python environment
  • Your cloud account (AWS, GCP or Azure) set to work with Coiled

NVIDIA RAPIDS: GPU acceleration for the PyData ecosystem

NVIDIA RAPIDS offers GPU acceleration for data science workloads with zero code changes required. The cudf.pandas accelerator enables instant GPU execution of pandas operations:

%load_ext cudf.pandas
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Learn more about how NVIDIA cuDF can accelerate pandas by up to 150x.

Cloud GPUs

Many cloud platforms provide immediate access to the latest NVIDIA GPU architectures, without hardware refresh cycles, offering flexibility to scale resources based on computational demands. This availability democratizes access to state-of-the-art GPU acceleration for teams of all sizes.

The performance advantages of these advanced GPUs transform data analysis capabilities. With the NYC ride-share dataset, operations that previously took minutes on CPUs now complete in seconds, enabling iterative exploration that fundamentally changes analytical workflows. Data scientists can test more hypotheses, explore additional variables, and refine models with near-instant feedback, leading to deeper insights and discoveries that might otherwise remain hidden.

While cloud environments typically involve configuration challenges, specialized platforms like Coiled can simplify this process for GPU workflows. By abstracting resource provisioning and environment setup, these solutions let data scientists focus on analysis rather than infrastructure management, accelerating innovation by removing technical barriers to advanced computing capabilities.

Coiled Notebooks

To start an interactive Jupyter notebook session with Coiled Notebooks, run the RAPIDS notebook container through the notebook service.

coiled notebook start --gpu --container nvcr.io/nvidia/rapidsai/notebooks:25.02-cuda12.8-py3.12

Note that the --gpu flag will automatically select a g4dn.xlarge instance with an NVIDIA T4 GPU. You could add the --vm-type flag to explicitly choose another machine type with a different GPU configuration. For example, to choose a machine with four L4 GPUs, you would run the following.

coiled notebook start --gpu --vm-type g6.24xlarge --container nvcr.io/nvidia/rapidsai/notebooks:25.02-cuda12.8-py3.12

To access Jupyter, click the link displayed in the terminal.

This image displays a screenshot of a terminal where the link to the Jupyterlab deployment along with approximate cost is displayed after running Coiled notebook start.
Figure 1. Output after starting a notebook session on Coiled
This image displays a Jupyterlab screenshot on a web browser where nvidia-smi is run to show that it has access to a GPU.
Figure 2. JupyterLab with GPU access, launched through Coiled

As demonstrated in the code snippet above, transitioning from local development to a cloud GPU execution is seamless ‌, and using RAPIDS Notebook images provides a convenient way to accelerate your workflows, fundamentally changing how you can approach large-scale problems.

Coiled Run

You can also run Python scripts in ephemeral VMs through Coiled Run. This boots a VM from the cloud, copies all the necessary packages from your local environment using package sync, runs the script, and shuts down the VM.

coiled run python my_code.py  # Boots a VM on the cloud, runs the scripts, then shuts down again

Use this to run GPU code on a remote environment using the RAPIDS container. You can set the coiled CLI to keep the VM around for a few minutes after execution is complete, just in case you want to run it again and reuse the same hardware.

$ coiled run --gpu --name rapids-demo --keepalive 5m --container nvcr.io/nvidia/rapidsai/base:25.02-cuda12.8-py3.12

This works very nicely when paired with the cudf.pandas CLI tool.

$ coiled run --gpu --name rapids-demo --keepalive 5m --container nvcr.io/nvidia/rapidsai/base:25.02-cuda12.8-py3.12 -- python -m cudf.pandas cudf_pandas_coiled_demo.py

Output
------

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.download.nvidia.com/licenses/NVIDIA_Deep_Learning_Container_License.pdf

Calculate violations by state took: 3.470 seconds
Calculate violations by vehicle type took: 0.145 seconds
Calculate violations by day of week took: 1.238 seconds

In the NVIDIA RAPIDS deployment documentation, you will find the Jupyter Notebook used in this experiment. You can download it and run it for yourself to reproduce the performance numbers mentioned throughout this article.

Analyzing the ride-share dataset

The NYC TLC Trip Record Data is available through S3. This data is also available in parquet format, partitioned into files of 100 mb, on Coiled’s S3 bucket, which was used for this example.

I used 60 partitions, which correspond to the latest recorded data, and translate to about 64.8M rows of data. Performing different operations on this data provides an idea of the speedup achieved by using the cudf.pandas accelerator as compared to vanilla Pandas.

I used the g6.24xlarge EC2 instance as shown in the snippet in the previous section. This machine comes with four NVIDIA L4 Tensor Core GPUs along with 96 vCPUs and 384 GB of memory.

The following shows some of the possible operations that can be accelerated with zero code changes. 

Loading data and optimizing data types

Data loading from S3 is done with a combination of s3fs and the read_parquet() function on Pandas.

import pandas as pd
path_files = []

for i in range(660,720):
    path_files.append(pd.read_parquet(f"s3://coiled-data/uber/part.{i}.parquet", filesystem=fs))

data = pd.concat(path_files, ignore_index=True)

To optimize memory usage, I converted all string and object types to categorical values, along with converting int32 and float64 to int16 and float32, respectively.

# Convert data types to save memory
for col in data.columns:
    if data[col].dtype == 'int32':
        if data[col].min() >= -32768 and data[col].max() <= 32767:
            data[col] = data[col].astype('int16')
    if data[col].dtype == 'float64':
        data[col] = data[col].astype('float32')
    if data[col].dtype == 'string' or data[col].dtype == 'object':
        data[col] = data[col].astype('category')

This operation took 15 seconds with Pandas, but only one second when using the cudf.pandas accelerator.

Finding monthly revenue and profit by company 

Next, I wanted to find out the revenue and profit of each ride-share company for each unique month present in the data. The dataset contains several columns about fare paid by the rider, including base passenger fare, tolls, and sales tax. ‌To calculate the total revenue of the company, I aggregated all the columns related to fares paid by the rider and stored them in a column called total_fare. I then performed a group by operation to group this data by company and month.

data['pickup_month'] = data['pickup_datetime'].dt.month

data['total_fare'] = data['base_passenger_fare'] + data['tolls'] + data['bcf'] + data['sales_tax'] + data['congestion_surcharge'] + data['airport_fee']

grouped = data.groupby(['company', 'pickup_month']).agg({
    'company': 'count',
    'total_fare': ['sum', 'mean'],
    'driver_pay': 'sum',
    'tips': 'sum'
}).reset_index()

grouped.columns = ['company', 'pickup_month', 'trip_count', 'total_revenue', 'avg_fare', 'total_driver_pay', 'total_tips']

grouped['total_driver_payout'] = grouped['total_driver_pay'] + grouped['total_tips']

grouped = grouped[['company', 'pickup_month', 'trip_count', 'total_revenue', 'avg_fare', 'total_driver_payout']]

grouped = grouped.sort_values(['company', 'pickup_month'])

grouped['profit'] = grouped['total_revenue'] - grouped['total_driver_payout']

grouped.head()

This operation took about 4.7 seconds on Pandas but with the cudf.pandas accelerator enabled, completed in 2.67 seconds. 

Categorizing the trips based on duration

To understand the speedup in user-defined functions (UDF), I categorized all the trips in the data into these three categories:

  1. Short (indicated by a 0 in the trip_category column) for trips of less than 10 minutes.
  2. Medium (indicated by a 1 in the trip_category column) for trips between 10 and 20 minutes.
  3. Long (indicated by a 2 in the trip_category column) for trips greater than 20 minutes.

I then used these categories to calculate the mean fare and number of trips in each category.

def categorize_trip(row):
    if row['trip_time'] < 600:  # Less than 10 minutes
        return 0
    elif row['trip_time'] < 1200:  # 10-20 minutes
        return 1
    else:  # More than 20 minutes
        return 2

# Apply UDF
data['trip_category'] = data.apply(categorize_trip, axis=1)

# Create a mapping for trip categories
trip_category_map = {0: 'short', 1: 'medium', 2: 'long'}

# Group by trip category
category_stats = data.groupby('trip_category').agg({
    'total_fare': ['mean', 'sum'],
    'trip_time': 'count'
})

# Rename the index with descriptive labels
category_stats.index = category_stats.index.map(lambda x: f"{trip_category_map[x]}")

category_stats

This operation took 408 seconds on Pandas, but with the cudf.pandas accelerator enabled, completed in just 0.2 seconds.This is because the categorize_trip function needs to be applied to each row in the dataset, but is inherently a parallelizable task, so leveraging a GPU provides significantly higher performance. 

Finding frequently taken routes

The TLC dataset has columns PULocationID and DOLocationID, which indicate the zone and borough information according to the taxi zones of the NYC TLC. You can find information and look up the zones corresponding to the index in CSV format.

Now, we can merge this dataset into the trips dataframe to find out the top 10 frequently used routes.

taxi_zones = pd.read_csv('taxi_zones.csv', usecols=['LocationID', 'zone', 'borough'])

#Convert PULocationID to pickup_location combining zone and borough information
data = pd.merge(data, taxi_zones, left_on='PULocationID', right_on='LocationID', how='left')
for col in ['zone', 'borough']:
    data[col] = data[col].fillna('NA')
data['pickup_location'] = data['zone'] + ',' + data['borough']
data.drop(['LocationID', 'zone', 'borough'], axis=1, inplace=True)

#Doing the same for dropoff location

location_group = data.groupby(['pickup_location', 'dropoff_location']).size().reset_index(name='ride_count')
location_group = location_group.sort_values('ride_count', ascending=False)

# Identify top 10 hotspots
top_hotspots = location_group.head(10)
print("Top 10 Pickup and Dropoff Hotspots:")
print(top_hotspots)

Each merge operation took about 30 seconds on Pandas, but only 1.3 seconds using the cudf.pandas accelerator. 

Overall time comparison

Let us compare the time taken to run our example notebook in its entirety. The standard Pandas implementation on CPU required 18 minutes and 45 seconds to execute. In contrast, the GPU-accelerated version using cudf.pandas executed identical operations in just 2 minutes and 6 seconds—an 8.9x speedup in execution time alone.

When accounting for infrastructure setup overhead—the 2 minutes and 45 seconds required to provision a GPU-equipped EC2 instance with the RAPIDS Docker image on Coiled—the total runtime was 4 minutes and 50 seconds. This still represents a 3.9x performance improvement over the CPU-based implementation, achieved with minimal infrastructure setup.

These metrics clearly demonstrate that GPU acceleration provides substantial benefits for computationally intensive operations involving large data, particularly those involving element-wise transformations and user-defined functions. 

Conclusion

By leveraging cudf.pandas, we achieved dramatic performance improvements—from 8.9x overall speedup to 30x faster UDF operations—with zero code changes. Coiled’s platform reduced cloud compute complexity for GPUs through auto-provisioning and auto-scaling resources, ensuring simplicity to set up resources and cost-effectiveness by shutting down after use. This combination of familiar syntax and simplified infrastructure management creates a powerful toolkit for data scientists to accelerate analytical workflows, reduce development cycles, and extract more value from large datasets while maintaining focus on insights rather than infrastructure.

Learn More:

  1. Get started with RAPIDS
  2. Install Coiled
  3. API reference for cudf.pandas
Discuss (0)

Tags

  翻译: