Rajandran R Creator of OpenAlgo - OpenSource Algo Trading framework for Indian Traders. Building GenAI Applications. Telecom Engineer turned Full-time Derivative Trader. Mostly Trading Nifty, Banknifty, High Liquid Stock Derivatives. Trading the Markets Since 2006 onwards. Using Market Profile and Orderflow for more than a decade. Designed and published 100+ open source trading systems on various trading tools. Strongly believe that market understanding and robust trading frameworks are the key to the trading success. Building Algo Platforms, Writing about Markets, Trading System Design, Market Sentiment, Trading Softwares & Trading Nuances since 2007 onwards. Author of Marketcalls.in

GPU vs CPU Performance: A Curious Experiment in Speed and Parallelism for Technical Analysis

5 min read

Lately, I’ve been curious—how much faster is a GPU (Nvidia RTX 4060) compared to CPU (AMD Ryzen 7) for technical analysis indicators? Do simple operations like EMA, Bollinger Bands, and matrix multiplication really benefit from GPU acceleration? I decided to run a series of benchmarks using NumPy, Numba, and CuPy across various trading-related use cases to find out.

This blog captures my exploration: the good, the bad, the surprising—and a few performance charts to back it up. This blog is a summary of those experiments, complete with benchmarks, bar charts, and practical notes. If you’ve ever been curious about whether GPU acceleration is worth it for your analytics pipeline, you’ll find this especially useful.

System Configuration used for Benchmarking


What is NumPy?

NumPy is a Python library for high-performance numerical computing on CPU. It provides fast array operations, vectorized math functions, and broadcasting. Most technical analysis tools and data pipelines in Python use NumPy for handling large arrays and calculations efficiently.


What is Numba?

Numba is a just-in-time (JIT) compiler for Python, specifically designed to speed up numerical code. It works by compiling Python functions into fast machine code using LLVM. With just a @njit decorator, you can take ordinary Python loops and logic and make them run as fast as C — without rewriting anything in C.

Numba is especially effective for:

  • Loops and recursion (e.g., EMA, ATR, backtests)
  • Parallel execution across CPU cores
  • Speeding up NumPy-style math without using a GPU

If your function is too slow in plain Python or NumPy, try Numba before reaching for CuPy or TensorFlow.


What is CuPy?

CuPy is a GPU-accelerated library with a NumPy-compatible API. It enables you to offload computations to the GPU by simply switching import numpy as np to import cupy as cp. Behind the scenes, CuPy compiles CUDA kernels to run operations on an NVIDIA GPU — giving huge performance gains for parallelizable tasks.

What is the CUDA Toolkit?

The CUDA Toolkit is a software development kit from NVIDIA that allows programs to use the GPU for computation. It includes:

Runtime libraries for Python tools like CuPy or PyTorch

GPU drivers

Compiler (nvcc)

Math libraries like cuBLAS and cuFFT


Setup

Installation Instructions

Python packages:

To install required Python libraries:

pip install numpy numba cupy-cuda12x scipy pytorch

✅ Install NVIDIA CUDA Toolkit:

After installation, Ensure the CUDA_PATH_V12_8 environment variable if it’s not auto-configured:

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8

Ensure these paths are included in your system PATH:

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\bin
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\libnvvp

Restart your terminal or system after this setup.

  • CPU: 16GB RAM, Windows 11
  • GPU: CUDA 12.8 installed and detected via CuPy (1 device)
  • Python Tools: NumPy, Numba, CuPy, SciPy, Seaborn

Testing CuPy

import torch
print(torch.version.cuda)

import cupy as cp

# Test array on GPU
x = cp.arange(5)
print("CuPy array:", x)
print("CuPy on GPU:", cp.cuda.runtime.getDeviceCount(), "device(s)")

Output

(venv) PS D:\Python\test> python .\cuda_test.py
12.4
CuPy array: [0 1 2 3 4]
CuPy on GPU: 1 device(s)

Benchmarks and Findings


Source Code

Here’s the exact code I used for all tests. You can replicate it on your system to verify or tweak the benchmarks.

1. 1D EMA Benchmark (Recursive)

import numpy as np
import time
from numba import njit
import cupy as cp

def ema_numpy(values, span):
    alpha = 2 / (span + 1)
    ema = np.zeros_like(values)
    ema[0] = values[0]
    for i in range(1, len(values)):
        ema[i] = alpha * values[i] + (1 - alpha) * ema[i - 1]
    return ema

@njit
def ema_numba(values, span):
    alpha = 2 / (span + 1)
    ema = np.zeros_like(values)
    ema[0] = values[0]
    for i in range(1, len(values)):
        ema[i] = alpha * values[i] + (1 - alpha) * ema[i - 1]
    return ema

def ema_cupy(values, span):
    alpha = 2 / (span + 1)
    ema = cp.zeros_like(values)
    ema[0] = values[0]
    for i in range(1, len(values)):
        ema[i] = alpha * values[i] + (1 - alpha) * ema[i - 1]
    return ema

def benchmark(func, *args, label="Benchmark"):
    start = time.perf_counter()
    result = func(*args)
    if isinstance(result, cp.ndarray):
        cp.cuda.Device(0).synchronize()
    end = time.perf_counter()
    print(f"{label:<25}: {1000*(end - start):.2f} ms")
    return result

if __name__ == "__main__":
    size = 1_000_000
    span = 20

    values_np = np.random.rand(size).astype(np.float32)
    values_cp = cp.asarray(values_np)

    benchmark(ema_numpy, values_np, span, label="NumPy")
    benchmark(ema_numba, values_np, span, label="Numba + NumPy")
    benchmark(ema_cupy, values_cp, span, label="CuPy (GPU)")

Results:

NumPy                    : 1731.09 ms
Numba + NumPy            : 297.69 ms
CuPy (GPU)               : 47802.28 ms

2. Batch EMA Benchmark (Parallelized)

import numpy as np
import cupy as cp
import time
from numba import njit, prange

def batch_ema_numpy(values, span):
    alpha = 2 / (span + 1)
    batch, length = values.shape
    output = np.zeros_like(values)
    for b in range(batch):
        output[b, 0] = values[b, 0]
        for i in range(1, length):
            output[b, i] = alpha * values[b, i] + (1 - alpha) * output[b, i - 1]
    return output

@njit(parallel=True)
def batch_ema_numba(values, span):
    alpha = 2 / (span + 1)
    batch, length = values.shape
    output = np.zeros_like(values)
    for b in prange(batch):
        output[b, 0] = values[b, 0]
        for i in range(1, length):
            output[b, i] = alpha * values[b, i] + (1 - alpha) * output[b, i - 1]
    return output

def batch_ema_cupy(values, span):
    alpha = 2 / (span + 1)
    output = cp.zeros_like(values)
    output[:, 0] = values[:, 0]
    for i in range(1, values.shape[1]):
        output[:, i] = alpha * values[:, i] + (1 - alpha) * output[:, i - 1]
    return output

def benchmark(func, *args, label="Benchmark"):
    start = time.perf_counter()
    result = func(*args)
    if isinstance(result, cp.ndarray):
        cp.cuda.Device(0).synchronize()
    end = time.perf_counter()
    print(f"{label:<25}: {1000*(end - start):.2f} ms")
    return result

if __name__ == "__main__":
    batch_size = 1000
    length = 100_000
    span = 20

    data_np = np.random.rand(batch_size, length).astype(np.float32)
    data_cp = cp.asarray(data_np)

    benchmark(batch_ema_numpy, data_np, span, label="NumPy (Looped)")
    benchmark(batch_ema_numba, data_np, span, label="Numba (Parallel)")
    benchmark(batch_ema_cupy, data_cp, span, label="CuPy (GPU)")

Results:

NumPy (Looped)           : 177279.69 ms
Numba (Parallel)         : 1084.78 ms
CuPy (GPU)               : 5341.71 ms

3. Bollinger Bands Benchmark (Vectorized)

import numpy as np
import cupy as cp
import time
from scipy.ndimage import uniform_filter1d

def bollinger_numpy_vec(data, window=20, num_std=2):
    mean = uniform_filter1d(data, size=window, axis=1)
    std = np.sqrt(uniform_filter1d((data - mean)**2, size=window, axis=1))
    return mean + num_std * std, mean - num_std * std

def bollinger_cupy_vec(data, window=20, num_std=2):
    kernel = cp.ones(window) / window
    mean = cp.apply_along_axis(lambda x: cp.convolve(x, kernel, mode='same'), axis=1, arr=data)
    std = cp.sqrt(cp.apply_along_axis(lambda x: cp.convolve((x - cp.convolve(x, kernel, mode='same'))**2, kernel, mode='same'), axis=1, arr=data))
    return mean + num_std * std, mean - num_std * std

def benchmark(func, *args, label="Benchmark"):
    start = time.perf_counter()
    result = func(*args)
    if isinstance(result[0], cp.ndarray):
        cp.cuda.Device(0).synchronize()
    end = time.perf_counter()
    print(f"{label:<25}: {1000*(end - start):.2f} ms")
    return result

if __name__ == "__main__":
    rows = 5000
    cols = 50000
    window = 20

    data_np = np.random.rand(rows, cols).astype(np.float32)
    data_cp = cp.asarray(data_np)

    benchmark(bollinger_numpy_vec, data_np, window, label="NumPy (Vectorized)")
    benchmark(bollinger_cupy_vec, data_cp, window, label="CuPy (GPU)")

Results:

NumPy (Vectorized)       : 3780.91 ms
CuPy (GPU)               : 4352.09 ms

4. Matrix Multiplication Benchmark

import numpy as np
import cupy as cp
import time

def benchmark(func, label="Benchmark"):
    start = time.perf_counter()
    result = func()
    if isinstance(result, cp.ndarray):
        cp.cuda.Device(0).synchronize()
    end = time.perf_counter()
    print(f"{label:<20}: {1000 * (end - start):.2f} ms")

N = 16384
M = 16384
K = 16384

A_np = np.random.rand(N, M).astype(np.float32)
B_np = np.random.rand(M, K).astype(np.float32)

def cpu_matmul():
    return A_np @ B_np

benchmark(cpu_matmul, label="NumPy (CPU)")

A_cp = cp.asarray(A_np)
B_cp = cp.asarray(B_np)

def gpu_matmul():
    return A_cp @ B_cp

benchmark(gpu_matmul, label="CuPy (GPU)")

Results:

NumPy (CPU)         : 8075.27 ms
CuPy (GPU)          : 1760.24 ms

Reflections

This wasn’t just about performance. It was about curiosity—seeing how parallel computing helps (or doesn’t) with basic technical analysis. Here’s what I learned:

Final Takeaways

  • Numba is wildly efficient for recursive or loop-based operations
  • CuPy (GPU) only wins big when the computation is highly parallel (e.g., matrix math)
  • Vectorized NumPy is still king for most simple TA indicators
  • GPU has overhead, so don’t expect it to magically speed up everything

Next Steps

If you’re building indicators or trading backtests:

  • Use Numba for recursive indicators
  • Stick with NumPy if your data fits in memory and is vectorized
  • Leverage CuPy or PyTorch only for GPU-intensive matrix ops or model inference

This was a fun experiment—I learned a lot about performance bottlenecks, memory overhead, and what’s truly parallel.

And next? Maybe test multi-GPU workloads on deep learning-based signal generation 😉


Rajandran R Creator of OpenAlgo - OpenSource Algo Trading framework for Indian Traders. Building GenAI Applications. Telecom Engineer turned Full-time Derivative Trader. Mostly Trading Nifty, Banknifty, High Liquid Stock Derivatives. Trading the Markets Since 2006 onwards. Using Market Profile and Orderflow for more than a decade. Designed and published 100+ open source trading systems on various trading tools. Strongly believe that market understanding and robust trading frameworks are the key to the trading success. Building Algo Platforms, Writing about Markets, Trading System Design, Market Sentiment, Trading Softwares & Trading Nuances since 2007 onwards. Author of Marketcalls.in

Get Notifications, Alerts on Market Updates, Trading Tools, Automation & More