📘 GPU Programming: CuPy Basics

🎯 Introduction

Welcome to this exciting tutorial on GPU Programming with CuPy! 🎉 In this guide, we’ll explore how to harness the incredible power of your graphics card for lightning-fast Python computations.

You’ll discover how CuPy can transform your data processing and scientific computing experience. Whether you’re building machine learning models 🤖, processing large datasets 📊, or running complex simulations 🔬, understanding GPU programming is essential for achieving blazing-fast performance!

By the end of this tutorial, you’ll feel confident using CuPy to accelerate your Python code by 10x, 100x, or even more! Let’s dive in! 🏊‍♂️

📚 Understanding GPU Programming

🤔 What is GPU Programming?

GPU programming is like having thousands of tiny workers instead of just a few powerful ones 🏭. Think of it as the difference between one master chef (CPU) preparing a meal versus an entire kitchen brigade (GPU) working in parallel!

In Python terms, CuPy provides a NumPy-compatible interface for GPU computing 🚀. This means you can:

✨ Run array operations on thousands of cores simultaneously
🚀 Process massive datasets at incredible speeds
🛡️ Keep your familiar NumPy syntax while gaining GPU power

💡 Why Use CuPy?

Here’s why developers love CuPy for GPU programming:

NumPy Compatibility 🔄: Drop-in replacement for most NumPy code
Massive Speedups ⚡: 10-100x faster for large arrays
Easy Migration 🎯: Change import numpy to import cupy
Memory Management 🧠: Automatic GPU memory handling

Real-world example: Imagine processing millions of images 📷. With CuPy, what takes hours on CPU can finish in minutes on GPU!

🔧 Basic Syntax and Usage

📝 Simple Example

Let’s start with a friendly example:

# 👋 Hello, CuPy!
import cupy as cp
import numpy as np

# 🎨 Creating GPU arrays
gpu_array = cp.array([1, 2, 3, 4, 5])
print(f"GPU array: {gpu_array} 🚀")

# 🔄 Converting from NumPy
cpu_data = np.array([10, 20, 30, 40, 50])
gpu_data = cp.asarray(cpu_data)  # 📤 Send to GPU!

# ⚡ Fast computations on GPU
result = gpu_data * 2 + 10
print(f"GPU result: {result} ✨")

# 📥 Get result back to CPU
cpu_result = cp.asnumpy(result)
print(f"CPU result: {cpu_result} 💻")

💡 Explanation: Notice how similar it is to NumPy! The magic happens behind the scenes where CuPy runs operations on your GPU’s thousands of cores!

🎯 Common Patterns

Here are patterns you’ll use daily:

# 🏗️ Pattern 1: Large array operations
size = 10_000_000  # 10 million elements! 
gpu_array = cp.random.random(size)  # 🎲 Random on GPU

# 🎨 Pattern 2: Mathematical operations
mean = cp.mean(gpu_array)  # 📊 Statistics
squared = cp.square(gpu_array)  # 🔢 Element-wise ops
sorted_arr = cp.sort(gpu_array)  # 📈 Sorting

# 🔄 Pattern 3: Matrix operations
matrix_a = cp.random.random((1000, 1000))
matrix_b = cp.random.random((1000, 1000))
result = cp.dot(matrix_a, matrix_b)  # ⚡ Super fast matrix multiply!

💡 Practical Examples

🛒 Example 1: Image Processing Pipeline

Let’s build something real:

# 🖼️ Image processing on GPU
import cupy as cp
import numpy as np

class GPUImageProcessor:
    def __init__(self):
        self.filters = {
            "blur": cp.array([[1, 2, 1], [2, 4, 2], [1, 2, 1]]) / 16,
            "edge": cp.array([[-1, -1, -1], [-1, 8, -1], [-1, -1, -1]]),
            "sharpen": cp.array([[0, -1, 0], [-1, 5, -1], [0, -1, 0]])
        }
    
    # 📷 Process image batch
    def process_batch(self, images):
        # 📤 Send images to GPU
        gpu_images = cp.asarray(images)
        processed = []
        
        for img in gpu_images:
            # ✨ Apply filters
            blurred = self.apply_filter(img, "blur")
            edges = self.apply_filter(img, "edge")
            sharpened = self.apply_filter(img, "sharpen")
            
            # 🎨 Combine effects
            result = (blurred * 0.3 + edges * 0.3 + sharpened * 0.4)
            processed.append(result)
        
        # 📥 Return to CPU
        return cp.asnumpy(cp.array(processed))
    
    # 🔧 Apply convolution filter
    def apply_filter(self, image, filter_type):
        kernel = self.filters[filter_type]
        # ⚡ GPU convolution - super fast!
        return cp.convolve(image.flatten(), kernel.flatten(), mode='same').reshape(image.shape)

# 🎮 Let's use it!
processor = GPUImageProcessor()
fake_images = np.random.random((10, 256, 256))  # 10 images
processed = processor.process_batch(fake_images)
print(f"Processed {len(processed)} images on GPU! 🚀")

🎯 Try it yourself: Add a brightness adjustment feature and measure the speedup compared to CPU!

🎮 Example 2: Monte Carlo Simulation

Let’s make it fun with simulations:

# 🎲 Monte Carlo Pi estimation on GPU
import cupy as cp
import numpy as np
import time

class GPUMonteCarloSimulator:
    def __init__(self):
        self.results = []
        
    # 🎯 Estimate Pi using random points
    def estimate_pi(self, n_points=10_000_000):
        print(f"🎲 Throwing {n_points:,} darts at a circle...")
        
        # ⚡ Generate random points on GPU
        start = time.time()
        x = cp.random.uniform(-1, 1, n_points)
        y = cp.random.uniform(-1, 1, n_points)
        
        # 🎨 Check if points are inside circle
        inside_circle = (x**2 + y**2) <= 1
        pi_estimate = 4 * cp.sum(inside_circle) / n_points
        
        gpu_time = time.time() - start
        
        # 📊 Compare with CPU
        cpu_start = time.time()
        cpu_x = np.random.uniform(-1, 1, min(n_points, 1_000_000))
        cpu_y = np.random.uniform(-1, 1, min(n_points, 1_000_000))
        cpu_inside = (cpu_x**2 + cpu_y**2) <= 1
        cpu_pi = 4 * np.sum(cpu_inside) / len(cpu_x)
        cpu_time = time.time() - cpu_start
        
        print(f"🚀 GPU estimate: {pi_estimate:.6f} (Time: {gpu_time:.3f}s)")
        print(f"💻 CPU estimate: {cpu_pi:.6f} (Time: {cpu_time:.3f}s)")
        print(f"⚡ GPU is {cpu_time/gpu_time:.1f}x faster!")
        
        return float(pi_estimate)
    
    # 🔬 Run multiple simulations
    def run_simulations(self, n_sims=10):
        estimates = []
        for i in range(n_sims):
            print(f"\n🎮 Simulation {i+1}/{n_sims}")
            estimate = self.estimate_pi()
            estimates.append(estimate)
            
        # 📈 Calculate statistics
        estimates_gpu = cp.array(estimates)
        mean_pi = cp.mean(estimates_gpu)
        std_pi = cp.std(estimates_gpu)
        
        print(f"\n🏆 Final Results:")
        print(f"  📊 Mean estimate: {mean_pi:.6f}")
        print(f"  📏 Actual Pi: {np.pi:.6f}")
        print(f"  🎯 Error: {abs(mean_pi - np.pi):.6f}")
        print(f"  📈 Std deviation: {std_pi:.6f}")

# 🎲 Let's simulate!
simulator = GPUMonteCarloSimulator()
simulator.run_simulations(5)

🚀 Advanced Concepts

🧙‍♂️ Custom CUDA Kernels

When you’re ready to level up, write custom GPU code:

# 🎯 Custom CUDA kernel for element-wise operations
import cupy as cp

# 🪄 Define a custom GPU kernel
add_multiply_kernel = cp.ElementwiseKernel(
    'float32 x, float32 y, float32 a, float32 b',  # Input params
    'float32 z',  # Output
    'z = a * x + b * y',  # GPU code! ✨
    'add_multiply'  # Kernel name
)

# 🚀 Use the custom kernel
size = 1_000_000
x = cp.random.random(size, dtype=cp.float32)
y = cp.random.random(size, dtype=cp.float32)
a, b = 2.5, 3.7

# ⚡ Run custom operation on GPU
result = add_multiply_kernel(x, y, a, b)
print(f"Custom kernel processed {size:,} elements! 🎉")

🏗️ Memory Management

For the brave developers handling large datasets:

# 🧠 Smart GPU memory management
import cupy as cp

class GPUMemoryManager:
    def __init__(self):
        self.memory_pool = cp.get_default_memory_pool()
        self.pinned_memory_pool = cp.get_default_pinned_memory_pool()
        
    # 📊 Check memory usage
    def check_memory(self):
        used_bytes = self.memory_pool.used_bytes()
        total_bytes = self.memory_pool.total_bytes()
        
        print(f"🧠 GPU Memory Status:")
        print(f"  📊 Used: {used_bytes / 1e9:.2f} GB")
        print(f"  📈 Total allocated: {total_bytes / 1e9:.2f} GB")
        
    # 🧹 Clear GPU memory
    def clear_memory(self):
        print("🧹 Clearing GPU memory...")
        self.memory_pool.free_all_blocks()
        self.pinned_memory_pool.free_all_blocks()
        cp.cuda.Stream.null.synchronize()
        print("✨ GPU memory cleared!")
        
    # 🎯 Context manager for memory-safe operations
    def memory_scope(self):
        class MemoryScope:
            def __enter__(scope_self):
                self.check_memory()
                return scope_self
                
            def __exit__(scope_self, *args):
                self.clear_memory()
                
        return MemoryScope()

# 🎮 Use memory manager
manager = GPUMemoryManager()
with manager.memory_scope():
    # ⚡ Large computation
    huge_array = cp.random.random((10000, 10000))
    result = cp.dot(huge_array, huge_array.T)
    print(f"Computed {result.shape} matrix! 🚀")

⚠️ Common Pitfalls and Solutions

😱 Pitfall 1: Out of Memory

# ❌ Wrong way - loading too much at once!
try:
    huge_array = cp.zeros((100000, 100000))  # 💥 40GB - OOM error!
except cp.cuda.memory.OutOfMemoryError:
    print("😰 GPU out of memory!")

# ✅ Correct way - process in chunks!
def process_in_chunks(data, chunk_size=1000):
    results = []
    for i in range(0, len(data), chunk_size):
        chunk = cp.asarray(data[i:i+chunk_size])
        result = cp.sum(chunk, axis=1)  # Process chunk
        results.append(cp.asnumpy(result))  # Free GPU memory
    return np.concatenate(results)

print("✅ Processing in chunks saves memory! 🛡️")

🤯 Pitfall 2: Unnecessary Transfers

# ❌ Dangerous - too many CPU-GPU transfers!
def slow_computation(data):
    result = 0
    for i in range(len(data)):
        gpu_data = cp.asarray(data[i])  # 📤 Transfer
        result += float(cp.sum(gpu_data))  # 📥 Transfer back
    return result

# ✅ Fast - minimize transfers!
def fast_computation(data):
    gpu_data = cp.asarray(data)  # 📤 One transfer
    result = cp.sum(gpu_data)  # ⚡ All ops on GPU
    return float(result)  # 📥 One transfer back

print("✅ Batch operations for speed! 🚀")

🛠️ Best Practices

🎯 Profile First: Measure before optimizing - not all code benefits from GPU!
📊 Use Large Arrays: GPUs shine with millions of elements
🛡️ Handle Memory: Monitor and manage GPU memory usage
🎨 Batch Operations: Process multiple items together
✨ Keep Data on GPU: Minimize CPU-GPU transfers

🧪 Hands-On Exercise

🎯 Challenge: Build a GPU-Accelerated Data Analyzer

Create a data analysis system using CuPy:

📋 Requirements:

✅ Load and process CSV data on GPU
🏷️ Calculate statistics (mean, std, percentiles)
👤 Find correlations between columns
📅 Time series analysis with moving averages
🎨 Visualize performance gains!

🚀 Bonus Points:

Add outlier detection
Implement parallel sorting
Create a performance benchmark suite

💡 Solution

🔍 Click to see solution

# 🎯 GPU-accelerated data analyzer!
import cupy as cp
import numpy as np
import time

class GPUDataAnalyzer:
    def __init__(self):
        self.data = None
        self.stats = {}
        
    # 📊 Load data to GPU
    def load_data(self, data_array):
        print("📤 Loading data to GPU...")
        self.data = cp.asarray(data_array)
        print(f"✅ Loaded {self.data.shape} array!")
        
    # 📈 Calculate statistics
    def calculate_stats(self):
        if self.data is None:
            return
            
        print("🔬 Calculating statistics on GPU...")
        start = time.time()
        
        self.stats = {
            'mean': cp.mean(self.data, axis=0),
            'std': cp.std(self.data, axis=0),
            'min': cp.min(self.data, axis=0),
            'max': cp.max(self.data, axis=0),
            'median': cp.median(self.data, axis=0),
            'percentile_25': cp.percentile(self.data, 25, axis=0),
            'percentile_75': cp.percentile(self.data, 75, axis=0)
        }
        
        gpu_time = time.time() - start
        print(f"⚡ GPU stats calculated in {gpu_time:.3f}s!")
        
        return self.stats
        
    # 🔗 Calculate correlations
    def calculate_correlations(self):
        if self.data is None:
            return
            
        print("🔗 Computing correlation matrix...")
        start = time.time()
        
        # Standardize data
        mean = cp.mean(self.data, axis=0)
        std = cp.std(self.data, axis=0)
        standardized = (self.data - mean) / std
        
        # Compute correlation matrix
        n = self.data.shape[0]
        corr_matrix = cp.dot(standardized.T, standardized) / (n - 1)
        
        gpu_time = time.time() - start
        print(f"✅ Correlation matrix ({corr_matrix.shape}) computed in {gpu_time:.3f}s!")
        
        return corr_matrix
        
    # 📊 Moving average analysis
    def moving_average(self, window_size=10):
        if self.data is None:
            return
            
        print(f"📈 Computing {window_size}-period moving average...")
        start = time.time()
        
        # Efficient convolution for moving average
        kernel = cp.ones(window_size) / window_size
        ma_results = []
        
        for col in range(self.data.shape[1]):
            ma = cp.convolve(self.data[:, col], kernel, mode='valid')
            ma_results.append(ma)
            
        gpu_time = time.time() - start
        print(f"🚀 Moving averages computed in {gpu_time:.3f}s!")
        
        return cp.stack(ma_results, axis=1)
        
    # 🎯 Detect outliers
    def detect_outliers(self, threshold=3):
        if self.data is None:
            return
            
        print(f"🔍 Detecting outliers (>{threshold} std devs)...")
        
        mean = cp.mean(self.data, axis=0)
        std = cp.std(self.data, axis=0)
        
        # Find outliers
        z_scores = cp.abs((self.data - mean) / std)
        outliers = z_scores > threshold
        outlier_count = cp.sum(outliers, axis=0)
        
        print(f"⚠️ Found {int(cp.sum(outlier_count))} total outliers!")
        return outliers, outlier_count
        
    # 📊 Performance comparison
    def benchmark_vs_cpu(self, cpu_data):
        print("\n🏁 Performance Benchmark: GPU vs CPU")
        print("=" * 50)
        
        # GPU timing
        gpu_start = time.time()
        self.calculate_stats()
        self.calculate_correlations()
        self.moving_average()
        self.detect_outliers()
        gpu_total = time.time() - gpu_start
        
        # CPU timing (NumPy)
        cpu_start = time.time()
        np.mean(cpu_data, axis=0)
        np.std(cpu_data, axis=0)
        np.corrcoef(cpu_data.T)
        cpu_total = time.time() - cpu_start
        
        print(f"\n🚀 GPU Total Time: {gpu_total:.3f}s")
        print(f"💻 CPU Total Time: {cpu_total:.3f}s")
        print(f"⚡ GPU Speedup: {cpu_total/gpu_total:.1f}x faster!")
        print("🎉 GPU wins!")

# 🎮 Test it out!
analyzer = GPUDataAnalyzer()

# Generate test data
n_samples, n_features = 1_000_000, 50
test_data = np.random.randn(n_samples, n_features)

# Analyze on GPU
analyzer.load_data(test_data)
stats = analyzer.calculate_stats()
correlations = analyzer.calculate_correlations()
ma = analyzer.moving_average(window_size=20)
outliers, outlier_counts = analyzer.detect_outliers()

# Benchmark
analyzer.benchmark_vs_cpu(test_data[:100_000])  # Smaller CPU sample

🎓 Key Takeaways

You’ve learned so much! Here’s what you can now do:

✅ Accelerate NumPy code with minimal changes 💪
✅ Process massive datasets at GPU speeds 🛡️
✅ Write custom GPU kernels for specialized operations 🎯
✅ Manage GPU memory efficiently 🐛
✅ Build blazing-fast data processing pipelines! 🚀

Remember: GPUs are incredibly powerful, but they’re not always the answer. Profile your code and use GPUs where they shine - large-scale parallel computations! 🤝

🤝 Next Steps

Congratulations! 🎉 You’ve mastered GPU programming with CuPy!

Here’s what to do next:

💻 Practice with the exercises above
🏗️ Accelerate your existing NumPy projects
📚 Explore CuPy’s advanced features (cuDNN, cuBLAS)
🌟 Share your GPU speedup results with others!

Remember: Every data scientist started with their first GPU array. Keep experimenting, keep optimizing, and most importantly, enjoy the speed! 🚀

Happy GPU coding! 🎉🚀✨

Prerequisites

What you'll learn