A Hands-On Introduction to cuML for GPU-Accelerated Machine Learning Workflows

In this article, you will learn what cuML is, and how it can significantly speed up the training of machine learning models through GPU acceleration.

Topics we will cover include:

The aim and distinctive features of cuML.
How to prepare datasets and train a machine learning model for classification with cuML in a scikit-learn-like fashion.
How to easily compare results with an equivalent conventional scikit-learn model, in terms of classification accuracy and training time.

Let’s not waste any more time.

A Hands-On Introduction to cuML for GPU-Accelerated Machine Learning Workflows
Image by Editor | ChatGPT

Introduction

This article offers a hands-on Python introduction to cuML, a Python library from RAPIDS AI (an open-source suite within NVIDIA) for GPU-accelerated machine learning workflows across widely used models. In conjunction with its data science–oriented sibling, cuDF, cuML has gained popularity among practitioners who need scalable, production-ready machine learning solutions.

The hands-on tutorial below uses cuML together with cuDF for GPU-accelerated dataset management in a DataFrame format. For an introduction to cuDF, check out this related article.

About cuML: An “Accelerated Scikit-Learn”

RAPIDS cuML (short for CUDA Machine Learning) is an open-source library that accelerates scikit-learn–style machine learning on NVIDIA GPUs. It provides drop-in replacements for many popular algorithms, often reducing training and inference times on large datasets — without major code changes or a steep learning curve for those familiar with scikit-learn.

Among its three most distinctive features:

cuML follows a scikit-learn-like API, easing the transition from CPU to GPU for machine learning with minimal code changes
It covers a broad set of techniques — all GPU-accelerated — including regression, classification, ensemble methods, clustering, and dimensionality reduction
Through tight integration with the RAPIDS ecosystem, cuML works hand-in-hand with cuDF for data preprocessing, as well as with related libraries to facilitate end-to-end, GPU-native pipelines

Hands-On Introductory Example

To illustrate the basics of cuML for building GPU-accelerated machine learning models, we will consider a fairly large, yet easily accessible, dataset via public URL in Jason Brownlee’s repository: the adult income dataset. This is a large, slightly class-unbalanced dataset intended for binary classification tasks, namely predicting whether an adult’s income level is high (above $50K) or low (below $50K) based on a set of demographic and socio-economic features. Therefore, we aim to build a binary classification model.

IMPORTANT: To run the code below on Google Colab or a similar notebook environment, make sure you change the runtime type to GPU; otherwise, a warning will be raised indicating cuDF cannot find the specific CUDA driver library it utilizes.

We start by importing the necessary libraries for our scenario:

import cudf import cuml from cuml.model_selection import train_test_split as gpu_train_test_split from cuml.linear_model import LogisticRegression as cuLogReg from IPython.display import display import pandas as pd from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression import time

import cudf

import cuml

from cuml.model_selection import train_test_split as gpu_train_test_split

from cuml.linear_model import LogisticRegression as cuLogReg

from IPython.display import display

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

import time

Note that, in addition to cuML modules and functions to split the dataset and train a logistic regression classifier, we have also imported their classical scikit-learn counterparts. While not mandatory for using cuML (as it works independently from plain scikit-learn), we are importing equivalent scikit-learn components for the sake of comparison in the rest of the example.

Next, we load the dataset into a cuDF dataframe optimized for GPU usage:

url = “https://raw.githubusercontent.com/jbrownlee/Datasets/master/adult-all.csv” # Column names (they are not included in the dataset’s CSV file we will read) cols = [ “age”,”workclass”,”fnlwgt”,”education”,”education_num”, “marital_status”,”occupation”,”relationship”,”race”,”sex”, “capital_gain”,”capital_loss”,”hours_per_week”,”native_country”,”income” ] df = cudf.read_csv(url, header=None, names=cols) display(df.head())

url = “https://raw.githubusercontent.com/jbrownlee/Datasets/master/adult-all.csv”

# Column names (they are not included in the dataset’s CSV file we will read)

cols = [

“age”,“workclass”,“fnlwgt”,“education”,“education_num”,

“marital_status”,“occupation”,“relationship”,“race”,“sex”,

“capital_gain”,“capital_loss”,“hours_per_week”,“native_country”,“income”

]

df = cudf.read_csv(url, header=None, names=cols)

display(df.head())

Once the data is loaded, we identify the target variable and convert it into binary (1 for high income, 0 for low income):

df[“income”] = df[“income”].str.strip() df[“income”] = (df[“income”] == “>50K”).astype(“int32”)

df[“income”] = df[“income”].str.strip()

df[“income”] = (df[“income”] == “>50K”).astype(“int32”)

This dataset combines numeric features with a slight predominance of categorical ones. Most scikit-learn models — including decision trees and logistic regression — do not natively handle string-valued categorical features, so they require encoding. A similar pattern applies to cuML; hence, we will select a small number of features to train our classifier and one-hot encode the categorical ones.

# Feature selection (let’s say based on domain expertise!) features = [“age”,”education_num”,”hours_per_week”,”workclass”,”occupation”,”sex”] X = df[features] y = df[“income”] # One-hot encode categorical features X_enc = cudf.get_dummies(X, drop_first=True) print(“Encoded feature shape:”, X_enc.shape)

# Feature selection (let’s say based on domain expertise!)

features = [“age”,“education_num”,“hours_per_week”,“workclass”,“occupation”,“sex”]

X = df[features]

y = df[“income”]

# One-hot encode categorical features

X_enc = cudf.get_dummies(X, drop_first=True)

print(“Encoded feature shape:”, X_enc.shape)

So far, we have used cuML (and also cuDF) much like using classical scikit-learn along with Pandas.

Now comes the interesting part. We will split the dataset into training and test sets and train a logistic regression classifier twice, using both CUDA GPU (cuML) and standalone scikit-learn. We will then compare both the classification accuracy and the time taken to train each model. Here’s the complete code for the model training and comparison:

# MODEL 1: GPU (cuML) train-test split and training t0 = time.time() X_train, X_test, y_train, y_test = gpu_train_test_split(X_enc, y, test_size=0.2, random_state=42) model_gpu = cuLogReg(max_iter=1000) model_gpu.fit(X_train, y_train) gpu_time = time.time() – t0 acc_gpu = model_gpu.score(X_test, y_test) print(f”cuML Logistic Regression accuracy: {acc_gpu:.4f}, time: {gpu_time:.3f} sec”) # MODEL 2: Scikit-learn and Pandas-driven train-test split and model training df_pd = pd.read_csv(url, header=None, names=cols) df_pd[“income”] = df_pd[“income”].str.strip() df_pd[“income”] = (df_pd[“income”] == “>50K”).astype(“int32”) X_pd = df_pd[features] y_pd = df_pd[“income”] X_pd = pd.get_dummies(X_pd, drop_first=True) t0 = time.time() X_train_pd, X_test_pd, y_train_pd, y_test_pd = train_test_split(X_pd, y_pd, test_size=0.2, random_state=42) model_cpu = LogisticRegression(max_iter=1000) model_cpu.fit(X_train_pd, y_train_pd) cpu_time = time.time() – t0 acc_cpu = model_cpu.score(X_test_pd, y_test_pd) print(f”scikit-learn Logistic Regression accuracy: {acc_cpu:.4f}, time: {cpu_time:.3f} sec”)

# MODEL 1: GPU (cuML) train-test split and training

t0 = time.time()

X_train, X_test, y_train, y_test = gpu_train_test_split(X_enc, y, test_size=0.2, random_state=42)

model_gpu = cuLogReg(max_iter=1000)

model_gpu.fit(X_train, y_train)

gpu_time = time.time() – t0

acc_gpu = model_gpu.score(X_test, y_test)

print(f“cuML Logistic Regression accuracy: {acc_gpu:.4f}, time: {gpu_time:.3f} sec”)

# MODEL 2: Scikit-learn and Pandas-driven train-test split and model training

df_pd = pd.read_csv(url, header=None, names=cols)

df_pd[“income”] = df_pd[“income”].str.strip()

df_pd[“income”] = (df_pd[“income”] == “>50K”).astype(“int32”)

X_pd = df_pd[features]

y_pd = df_pd[“income”]

X_pd = pd.get_dummies(X_pd, drop_first=True)

t0 = time.time()

X_train_pd, X_test_pd, y_train_pd, y_test_pd = train_test_split(X_pd, y_pd, test_size=0.2, random_state=42)

model_cpu = LogisticRegression(max_iter=1000)

model_cpu.fit(X_train_pd, y_train_pd)

cpu_time = time.time() – t0

acc_cpu = model_cpu.score(X_test_pd, y_test_pd)

print(f“scikit-learn Logistic Regression accuracy: {acc_cpu:.4f}, time: {cpu_time:.3f} sec”)

The results are quite interesting. They should look something like:

cuML Logistic Regression accuracy: 0.8014, time: 0.428 sec scikit-learn Logistic Regression accuracy: 0.8097, time: 15.184 sec

cuML Logistic Regression accuracy: 0.8014, time: 0.428 sec

scikit–learn Logistic Regression accuracy: 0.8097, time: 15.184 sec

As we can observe, the model trained with cuML achieved very similar classification performance to its classical scikit-learn counterpart, but it trained over an order of magnitude faster: about 0.5 seconds compared to roughly 15 seconds for the scikit-learn classifier. Your exact numbers will vary with hardware, drivers, and library versions.

Wrapping Up

This article provided a gentle, hands-on introduction to the cuML library for enabling GPU-boosted construction of machine learning models for classification, regression, clustering, and more. Through a simple comparison, we showed how cuML can help build effective models with significantly enhanced training efficiency.