Lately, I’ve been curious—how much faster is a GPU (Nvidia RTX 4060) compared to CPU (AMD Ryzen 7) for technical analysis indicators? Do simple operations like EMA, Bollinger Bands, and matrix multiplication really benefit from GPU acceleration? I decided to run a series of benchmarks using NumPy
, Numba
, and CuPy
across various trading-related use cases to find out.

This blog captures my exploration: the good, the bad, the surprising—and a few performance charts to back it up. This blog is a summary of those experiments, complete with benchmarks, bar charts, and practical notes. If you’ve ever been curious about whether GPU acceleration is worth it for your analytics pipeline, you’ll find this especially useful.
System Configuration used for Benchmarking

What is NumPy?
NumPy is a Python library for high-performance numerical computing on CPU. It provides fast array operations, vectorized math functions, and broadcasting. Most technical analysis tools and data pipelines in Python use NumPy for handling large arrays and calculations efficiently.
What is Numba?
Numba is a just-in-time (JIT) compiler for Python, specifically designed to speed up numerical code. It works by compiling Python functions into fast machine code using LLVM. With just a @njit
decorator, you can take ordinary Python loops and logic and make them run as fast as C — without rewriting anything in C.
Numba is especially effective for:
- Loops and recursion (e.g., EMA, ATR, backtests)
- Parallel execution across CPU cores
- Speeding up NumPy-style math without using a GPU
If your function is too slow in plain Python or NumPy, try Numba before reaching for CuPy or TensorFlow.
What is CuPy?
CuPy is a GPU-accelerated library with a NumPy-compatible API. It enables you to offload computations to the GPU by simply switching import numpy as np
to import cupy as cp
. Behind the scenes, CuPy compiles CUDA kernels to run operations on an NVIDIA GPU — giving huge performance gains for parallelizable tasks.
What is the CUDA Toolkit?
The CUDA Toolkit is a software development kit from NVIDIA that allows programs to use the GPU for computation. It includes:
Runtime libraries for Python tools like CuPy or PyTorch
GPU drivers
Compiler (nvcc
)
Math libraries like cuBLAS and cuFFT
Setup
Installation Instructions
Python packages:
To install required Python libraries:
pip install numpy numba cupy-cuda12x scipy pytorch
✅ Install NVIDIA CUDA Toolkit:
- Official link: https://developer.nvidia.com/cuda-downloads
- Choose your OS (Windows), architecture, version, and installer type
- Follow on-screen instructions to install

After installation, Ensure the CUDA_PATH_V12_8 environment variable if it’s not auto-configured:
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8

Ensure these paths are included in your system PATH:
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\bin
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\libnvvp

Restart your terminal or system after this setup.
- CPU: 16GB RAM, Windows 11
- GPU: CUDA 12.8 installed and detected via CuPy (1 device)
- Python Tools: NumPy, Numba, CuPy, SciPy, Seaborn
Testing CuPy
import torch
print(torch.version.cuda)
import cupy as cp
# Test array on GPU
x = cp.arange(5)
print("CuPy array:", x)
print("CuPy on GPU:", cp.cuda.runtime.getDeviceCount(), "device(s)")
Output
(venv) PS D:\Python\test> python .\cuda_test.py
12.4
CuPy array: [0 1 2 3 4]
CuPy on GPU: 1 device(s)
Benchmarks and Findings
Source Code
Here’s the exact code I used for all tests. You can replicate it on your system to verify or tweak the benchmarks.
1. 1D EMA Benchmark (Recursive)
import numpy as np
import time
from numba import njit
import cupy as cp
def ema_numpy(values, span):
alpha = 2 / (span + 1)
ema = np.zeros_like(values)
ema[0] = values[0]
for i in range(1, len(values)):
ema[i] = alpha * values[i] + (1 - alpha) * ema[i - 1]
return ema
@njit
def ema_numba(values, span):
alpha = 2 / (span + 1)
ema = np.zeros_like(values)
ema[0] = values[0]
for i in range(1, len(values)):
ema[i] = alpha * values[i] + (1 - alpha) * ema[i - 1]
return ema
def ema_cupy(values, span):
alpha = 2 / (span + 1)
ema = cp.zeros_like(values)
ema[0] = values[0]
for i in range(1, len(values)):
ema[i] = alpha * values[i] + (1 - alpha) * ema[i - 1]
return ema
def benchmark(func, *args, label="Benchmark"):
start = time.perf_counter()
result = func(*args)
if isinstance(result, cp.ndarray):
cp.cuda.Device(0).synchronize()
end = time.perf_counter()
print(f"{label:<25}: {1000*(end - start):.2f} ms")
return result
if __name__ == "__main__":
size = 1_000_000
span = 20
values_np = np.random.rand(size).astype(np.float32)
values_cp = cp.asarray(values_np)
benchmark(ema_numpy, values_np, span, label="NumPy")
benchmark(ema_numba, values_np, span, label="Numba + NumPy")
benchmark(ema_cupy, values_cp, span, label="CuPy (GPU)")
Results:
NumPy : 1731.09 ms
Numba + NumPy : 297.69 ms
CuPy (GPU) : 47802.28 ms

2. Batch EMA Benchmark (Parallelized)
import numpy as np
import cupy as cp
import time
from numba import njit, prange
def batch_ema_numpy(values, span):
alpha = 2 / (span + 1)
batch, length = values.shape
output = np.zeros_like(values)
for b in range(batch):
output[b, 0] = values[b, 0]
for i in range(1, length):
output[b, i] = alpha * values[b, i] + (1 - alpha) * output[b, i - 1]
return output
@njit(parallel=True)
def batch_ema_numba(values, span):
alpha = 2 / (span + 1)
batch, length = values.shape
output = np.zeros_like(values)
for b in prange(batch):
output[b, 0] = values[b, 0]
for i in range(1, length):
output[b, i] = alpha * values[b, i] + (1 - alpha) * output[b, i - 1]
return output
def batch_ema_cupy(values, span):
alpha = 2 / (span + 1)
output = cp.zeros_like(values)
output[:, 0] = values[:, 0]
for i in range(1, values.shape[1]):
output[:, i] = alpha * values[:, i] + (1 - alpha) * output[:, i - 1]
return output
def benchmark(func, *args, label="Benchmark"):
start = time.perf_counter()
result = func(*args)
if isinstance(result, cp.ndarray):
cp.cuda.Device(0).synchronize()
end = time.perf_counter()
print(f"{label:<25}: {1000*(end - start):.2f} ms")
return result
if __name__ == "__main__":
batch_size = 1000
length = 100_000
span = 20
data_np = np.random.rand(batch_size, length).astype(np.float32)
data_cp = cp.asarray(data_np)
benchmark(batch_ema_numpy, data_np, span, label="NumPy (Looped)")
benchmark(batch_ema_numba, data_np, span, label="Numba (Parallel)")
benchmark(batch_ema_cupy, data_cp, span, label="CuPy (GPU)")
Results:
NumPy (Looped) : 177279.69 ms
Numba (Parallel) : 1084.78 ms
CuPy (GPU) : 5341.71 ms

3. Bollinger Bands Benchmark (Vectorized)
import numpy as np
import cupy as cp
import time
from scipy.ndimage import uniform_filter1d
def bollinger_numpy_vec(data, window=20, num_std=2):
mean = uniform_filter1d(data, size=window, axis=1)
std = np.sqrt(uniform_filter1d((data - mean)**2, size=window, axis=1))
return mean + num_std * std, mean - num_std * std
def bollinger_cupy_vec(data, window=20, num_std=2):
kernel = cp.ones(window) / window
mean = cp.apply_along_axis(lambda x: cp.convolve(x, kernel, mode='same'), axis=1, arr=data)
std = cp.sqrt(cp.apply_along_axis(lambda x: cp.convolve((x - cp.convolve(x, kernel, mode='same'))**2, kernel, mode='same'), axis=1, arr=data))
return mean + num_std * std, mean - num_std * std
def benchmark(func, *args, label="Benchmark"):
start = time.perf_counter()
result = func(*args)
if isinstance(result[0], cp.ndarray):
cp.cuda.Device(0).synchronize()
end = time.perf_counter()
print(f"{label:<25}: {1000*(end - start):.2f} ms")
return result
if __name__ == "__main__":
rows = 5000
cols = 50000
window = 20
data_np = np.random.rand(rows, cols).astype(np.float32)
data_cp = cp.asarray(data_np)
benchmark(bollinger_numpy_vec, data_np, window, label="NumPy (Vectorized)")
benchmark(bollinger_cupy_vec, data_cp, window, label="CuPy (GPU)")
Results:
NumPy (Vectorized) : 3780.91 ms
CuPy (GPU) : 4352.09 ms

4. Matrix Multiplication Benchmark
import numpy as np
import cupy as cp
import time
def benchmark(func, label="Benchmark"):
start = time.perf_counter()
result = func()
if isinstance(result, cp.ndarray):
cp.cuda.Device(0).synchronize()
end = time.perf_counter()
print(f"{label:<20}: {1000 * (end - start):.2f} ms")
N = 16384
M = 16384
K = 16384
A_np = np.random.rand(N, M).astype(np.float32)
B_np = np.random.rand(M, K).astype(np.float32)
def cpu_matmul():
return A_np @ B_np
benchmark(cpu_matmul, label="NumPy (CPU)")
A_cp = cp.asarray(A_np)
B_cp = cp.asarray(B_np)
def gpu_matmul():
return A_cp @ B_cp
benchmark(gpu_matmul, label="CuPy (GPU)")
Results:
NumPy (CPU) : 8075.27 ms
CuPy (GPU) : 1760.24 ms

Reflections
This wasn’t just about performance. It was about curiosity—seeing how parallel computing helps (or doesn’t) with basic technical analysis. Here’s what I learned:
Final Takeaways
- Numba is wildly efficient for recursive or loop-based operations
- CuPy (GPU) only wins big when the computation is highly parallel (e.g., matrix math)
- Vectorized NumPy is still king for most simple TA indicators
- GPU has overhead, so don’t expect it to magically speed up everything
Next Steps
If you’re building indicators or trading backtests:
- Use Numba for recursive indicators
- Stick with NumPy if your data fits in memory and is vectorized
- Leverage CuPy or PyTorch only for GPU-intensive matrix ops or model inference
This was a fun experiment—I learned a lot about performance bottlenecks, memory overhead, and what’s truly parallel.
And next? Maybe test multi-GPU workloads on deep learning-based signal generation 😉