Rajandran R Creator of OpenAlgo - OpenSource Algo Trading framework for Indian Traders. Building GenAI Applications. Telecom Engineer turned Full-time Derivative Trader. Mostly Trading Nifty, Banknifty, High Liquid Stock Derivatives. Trading the Markets Since 2006 onwards. Using Market Profile and Orderflow for more than a decade. Designed and published 100+ open source trading systems on various trading tools. Strongly believe that market understanding and robust trading frameworks are the key to the trading success. Building Algo Platforms, Writing about Markets, Trading System Design, Market Sentiment, Trading Softwares & Trading Nuances since 2007 onwards. Author of Marketcalls.in

How to Speed Up a 1 Billion Iterations Loop in Python? Learn 5 Effective Methods

9 min read


Python is a versatile and user-friendly programming language, but it’s often criticized for being slow compared to compiled languages like C or C++. A recent LinkedIn post highlighted that Python can be 150x slower than C/C++ for certain tasks, especially when dealing with large loops.


In this blog post, we’ll explore five methods to optimize a loop that runs 1 billion iterations in Python in a nested loop. We’ll provide code examples, performance benchmarks, and detailed explanations for each method to help you achieve performance that’s competitive with C/C++.

Baseline: Pure Python Implementation

Before optimizing, let’s establish a baseline using standard Python (CPython) with a simple loop that processes numbers from 0 to 999,999,999.

Code Example

# pure_python.py
import time

def process(value):
    return value * 2

def run_loop():
    result = 0
    outer_iterations = 10000
    inner_iterations = 100000  # 10000 * 100000 = 1 billion total iterations
    
    for i in range(outer_iterations):
        for j in range(inner_iterations):
            result = process(j)
            
        if i % 100 == 0:  # Print progress every 1% (every 100 outer iterations)
            total_iterations = (i + 1) * inner_iterations
            # print(f"Processed {total_iterations:,} iterations")
    
    return result

if __name__ == "__main__":
    print("Starting 1 billion iterations with nested loops...")
    start_time = time.time()
    final_result = run_loop()
    end_time = time.time()
    
    print(f"\nFinal result: {final_result}")
    print(f"Total iterations: {10000 * 100000:,} (1 billion)")
    print(f"Execution completed in {end_time - start_time:.2f} seconds")

Performance

Execution Time: Approximately 42.72 seconds

$ python3 pure_python.py

Starting 1 billion iterations with nested loops...

Final result: 199998
Total iterations: 1,000,000,000 (1 billion)
Execution completed in 42.72 seconds

Analysis

Observation: The loop takes nearly 42.72 seconds to complete.

Limitation: The overhead of the Python interpreter and function calls slows down execution.


Method 1: Inline Computations

An immediate optimization is to inline the computations and eliminate unnecessary function calls.

Code Example

# inline_computation.py
import time

def run_loop():
    result = 0
    outer_iterations = 10000
    inner_iterations = 100000  # 10000 * 100000 = 1 billion total iterations
    
    for i in range(outer_iterations):
        for j in range(inner_iterations):
            result = j * 2  # Direct computation
            
        if i % 100 == 0:  # Print progress every 1% (every 100 outer iterations)
            total_iterations = (i + 1) * inner_iterations
            # print(f"Processed {total_iterations:,} iterations")
    
    return result

if __name__ == "__main__":
    print("Starting 1 billion iterations with nested loops...")
    start_time = time.time()
    final_result = run_loop()
    end_time = time.time()
    
    print(f"\nFinal result: {final_result}")
    print(f"Total iterations: {10000 * 100000:,} (1 billion)")
    print(f"Execution completed in {end_time - start_time:.2f} seconds")

Performance

Execution Time: Approximately 25.19 seconds

$ python3 inline_computation.py
Starting 1 billion iterations with nested loops...

Final result: 199998
Total iterations: 1,000,000,000 (1 billion)
Execution completed in 25.19 seconds

Analysis

Improvement: Eliminating function calls reduces overhead.

Limitation: Still constrained by the Python interpreter’s execution speed.


Method 2: Using PyPy


PyPy is an alternative Python interpreter with a built-in Just-In-Time (JIT) compiler that significantly speeds up Python code execution. It is highly compatible with the standard CPython interpreter, allowing most Python programs to run without modification. PyPy accelerates performance by dynamically compiling Python code into optimized machine code at runtime.

Installing PyPy


On macOS (using Homebrew):

brew install pypy3


Verify Installation:

pypy3 --version

Code Example

# inline_computation.py
import time

def run_loop():
    result = 0
    outer_iterations = 10000
    inner_iterations = 100000  # 10000 * 100000 = 1 billion total iterations
    
    for i in range(outer_iterations):
        for j in range(inner_iterations):
            result = j * 2  # Direct computation
            
        if i % 100 == 0:  # Print progress every 1% (every 100 outer iterations)
            total_iterations = (i + 1) * inner_iterations
            # print(f"Processed {total_iterations:,} iterations")
    
    return result

if __name__ == "__main__":
    print("Starting 1 billion iterations with nested loops...")
    start_time = time.time()
    final_result = run_loop()
    end_time = time.time()
    
    print(f"\nFinal result: {final_result}")
    print(f"Total iterations: {10000 * 100000:,} (1 billion)")
    print(f"Execution completed in {end_time - start_time:.2f} seconds")

Use the same code as in the Inline Computations method but run it with PyPy:

$ pypy3 inline_computation.py
Starting 1 billion iterations with nested loops...

Final result: 199998
Total iterations: 1,000,000,000 (1 billion)
Execution completed in 0.80 seconds
(venv) openalgo@openalgo python test % 

Analysis

Improvement: Drastic reduction in execution time due to PyPy’s JIT compiler optimizing the loop at runtime.

Benefit: No code changes required.

Limitation: Some libraries may not be fully compatible with PyPy.


Method 3: Multiprocessing

Utilize multiple CPU cores by splitting the workload across processes using the multiprocessing module.

Code Example

# multiprocessing_example.py
import time
from multiprocessing import Pool, cpu_count

def process_chunk(args):
    start_chunk, end_chunk = args
    result = 0
    inner_iterations = 100000
    
    for i in range(start_chunk, end_chunk):
        for j in range(inner_iterations):
            result = j * 2
            
        # Print progress for each process
        if (i - start_chunk) % 100 == 0:
            total_iterations = (i - start_chunk + 1) * inner_iterations
            print(f"Process chunk {start_chunk}-{end_chunk}: Processed {total_iterations:,} iterations")
    
    return result

if __name__ == "__main__":
    print("Starting 1 billion iterations with nested loops and multiprocessing...")
    start_time = time.time()
    
    num_processes = cpu_count()
    outer_iterations = 10000
    chunk_size = outer_iterations // num_processes
    
    # Create ranges for each process
    ranges = [(i * chunk_size, (i + 1) * chunk_size) for i in range(num_processes)]
    
    # Adjust the last chunk to include any remaining iterations
    if ranges:
        last_end = ranges[-1][1]
        if last_end < outer_iterations:
            ranges[-1] = (ranges[-1][0], outer_iterations)
    
    print(f"Using {num_processes} processes")
    print(f"Each process will handle {chunk_size} outer loop iterations")
    
    with Pool(processes=num_processes) as pool:
        results = pool.map(process_chunk, ranges)
    
    end_time = time.time()
    print(f"\nTotal iterations: {outer_iterations * 100000:,} (1 billion)")
    print(f"Execution completed in {end_time - start_time:.2f} seconds")

Performance

Execution Time: Approximately 6.47 seconds

$ python3 multiprocessing_example.py
Starting 1 billion iterations with nested loops and multiprocessing...
Using 8 processes
Each process will handle 1250 outer loop iterations

Total iterations: 1,000,000,000 (1 billion)
Execution completed in 6.47 seconds

Analysis

Improvement: Parallel execution utilizes all CPU cores.

Limitation: Overhead from process management and inter-process communication.


Method 4: Using Numba


Numba is a Just-In-Time (JIT) compiler for Python that translates numerical Python code into fast machine code using the LLVM compiler infrastructure. By adding simple decorators to your functions, Numba can optimize and accelerate loops and array computations. It achieves performance close to compiled languages like C or Fortran, making it ideal for scientific computing.

Installing Numba

pip install numba


Code Example

# numba_example.py
import time
from numba import njit

@njit
def run_loop():
    result = 0
    outer_iterations = 10000
    inner_iterations = 100000  # 10000 * 100000 = 1 billion total iterations
    
    for i in range(outer_iterations):
        for j in range(inner_iterations):
            result = j * 2
            
        # Note: Print statements are not supported inside njit functions
        # Progress will be shown after completion
    
    return result

if __name__ == "__main__":
    print("Starting 1 billion iterations with nested loops using Numba...")
    print("First run includes compilation time...")
    
    # First run (includes compilation)
    start_time = time.time()
    final_result = run_loop()
    end_time = time.time()
    compilation_time = end_time - start_time
    
    print(f"\nFirst run (with compilation):")
    print(f"Total iterations: {10000 * 100000:,} (1 billion)")
    print(f"Execution completed in {compilation_time:.2f} seconds")
    
    # Second run (pre-compiled)
    print("\nSecond run (pre-compiled)...")
    start_time = time.time()
    final_result = run_loop()
    end_time = time.time()
    execution_time = end_time - start_time
    
    print(f"\nSecond run (pre-compiled):")
    print(f"Total iterations: {10000 * 100000:,} (1 billion)")
    print(f"Execution completed in {execution_time:.2f} seconds")

Performance

Execution Time: Approximately 0.45 seconds

$ python3 numba_example.py
Starting 1 billion iterations with nested loops using Numba...
First run includes compilation time...

First run (with compilation):
Total iterations: 1,000,000,000 (1 billion)
Execution completed in 0.20 seconds

Second run (pre-compiled)...

Second run (pre-compiled):
Total iterations: 1,000,000,000 (1 billion)
Execution completed in 0.00 seconds

Analysis

Improvement: Numba compiles the loop into optimized machine code.

Benefit: Minimal code changes required.

Limitation: Numba supports a subset of Python; complex code may not be compatible.


Method 5: Using Cython


Cython is a superset of the Python language that allows you to add static type declarations to your Python code. By compiling this code into C extensions, Cython significantly boosts execution speed, especially for computationally intensive tasks. It enables you to write code that runs as fast as C while maintaining Python’s syntax and ease of use.

Installing Cython

pip install cython

Code Example

setup.py

# setup.py

from distutils.core import setup
from Cython.Build import cythonize
from distutils.extension import Extension

extensions = [
    Extension(
        "test_cython",
        ["test_cython.pyx"],
        extra_compile_args=["-O3"],
    )
]

setup(
    ext_modules=cythonize(
        extensions,
        compiler_directives={
            'boundscheck': False,
            'wraparound': False,
            'nonecheck': False,
            'cdivision': True,
        }
    )
)


cython_loop.pyx

# cython: language_level=3
# cython: boundscheck=False
# cython: wraparound=False
# cython: nonecheck=False
# cython: cdivision=True

def billion_loop():
    cdef long long i, j
    cdef long long sum = 0
    cdef long long outer_iterations = 10000
    cdef long long inner_iterations = 100000  # 10000 * 100000 = 1 billion iterations
    
    for i in range(outer_iterations):
        for j in range(inner_iterations):
            sum += j
            
        if i % 100 == 0:  # Print progress every 1% (every 100 outer iterations)
            total_iterations = (i + 1) * inner_iterations
            print(f"Processed {total_iterations:,} iterations")
    
    return sum


Compile the Cython Code

python setup.py build_ext --inplace


Driver Script – cython_driver.py

#cython_driver.py
import time
import test_cython

def python_billion_loop():
    sum = 0
    outer_iterations = 10000
    inner_iterations = 100000  # 10000 * 100000 = 1 billion iterations
    
    for i in range(outer_iterations):
        for j in range(inner_iterations):
            sum += j
            
        if i % 100 == 0:  # Print progress every 1% (every 100 outer iterations)
            total_iterations = (i + 1) * inner_iterations
            print(f"Processed {total_iterations:,} iterations")
    
    return sum

def main():
    print("Starting 1 billion iterations benchmark...")
    print("\nRunning Cython implementation...")
    start_time = time.time()
    cython_result = test_cython.billion_loop()
    cython_time = time.time() - start_time
    print(f"Cython Result: {cython_result}")
    print(f"Cython Time: {cython_time:.2f} seconds")

    print("\nRunning Pure Python implementation...")
    start_time = time.time()
    python_result = python_billion_loop()
    python_time = time.time() - start_time
    print(f"Python Result: {python_result}")
    print(f"Python Time: {python_time:.2f} seconds")

    print(f"\nTotal iterations: {10000 * 100000:,} (1 billion)")
    print(f"Speedup: {python_time/cython_time:.2f}x")

if __name__ == "__main__":
    main()

Performance

Execution Time: Approximately 0.00 seconds

$ python3 cython_driver.py
Starting 1 billion iterations benchmark...

Running Cython implementation...
Cython Result: 49999500000000
Cython Time: 0.00 seconds

Analysis

Cython allows you to compile Python code into C extensions, offering performance close to C/C++.

Installing Cython

Improvement: Cython compiles to C code, eliminating Python interpreter overhead.

Benefit: Offers performance comparable to compiled languages.

Limitation: Requires additional setup and code complexity.


Benchmark All Methods

import time
import test_cython
from multiprocessing import Pool, cpu_count
from numba import njit

# Pure Python Implementation
def pure_python_loop():
    sum = 0
    outer_iterations = 10000
    inner_iterations = 100000
    
    for i in range(outer_iterations):
        for j in range(inner_iterations):
            sum += j
    return sum

# Inline Computation Implementation
def inline_loop():
    sum = 0
    outer_iterations = 10000
    inner_iterations = 100000
    
    for i in range(outer_iterations):
        for j in range(inner_iterations):
            sum += j
    return sum

# Multiprocessing Implementation
def process_chunk(args):
    start_chunk, end_chunk = args
    sum = 0
    inner_iterations = 100000
    
    for i in range(start_chunk, end_chunk):
        for j in range(inner_iterations):
            sum += j
    return sum

def multiprocessing_loop():
    num_processes = cpu_count()
    outer_iterations = 10000
    chunk_size = outer_iterations // num_processes
    ranges = [(i * chunk_size, (i + 1) * chunk_size) for i in range(num_processes)]
    
    with Pool(processes=num_processes) as pool:
        results = pool.map(process_chunk, ranges)
    return sum(results)

# Numba Implementation
@njit
def numba_loop():
    sum = 0
    outer_iterations = 10000
    inner_iterations = 100000
    
    for i in range(outer_iterations):
        for j in range(inner_iterations):
            sum += j
    return sum

def run_benchmark(name, func):
    print(f"\nRunning {name}...")
    start_time = time.perf_counter()  # Using perf_counter for more precise timing
    result = func()
    end_time = time.perf_counter()
    execution_time = end_time - start_time
    print(f"{name} Result: {result}")
    print(f"{name} Time: {execution_time:.6f} seconds")  # Showing more decimal places
    return execution_time

if __name__ == "__main__":
    print("Starting Comprehensive Benchmark")
    print("==============================")
    print(f"Total iterations per implementation: {10000 * 100000:,} (1 billion)")
    
    times = {}
    
    # Pure Python
    times['Pure Python'] = run_benchmark('Pure Python', pure_python_loop)
    
    # Inline Computation
    times['Inline Computation'] = run_benchmark('Inline Computation', inline_loop)
    
    # Multiprocessing
    times['Multiprocessing'] = run_benchmark('Multiprocessing', multiprocessing_loop)
    
    # Numba (run twice to show both compilation and optimized times)
    times['Numba (First Run)'] = run_benchmark('Numba (First Run)', numba_loop)
    times['Numba (Second Run)'] = run_benchmark('Numba (Second Run)', numba_loop)
    
    # Cython
    times['Cython'] = run_benchmark('Cython', test_cython.billion_loop)
    
    # Print comparison
    print("\nPerformance Comparison")
    print("====================")
    fastest_time = min(times.values())
    
    # Ensure we don't divide by zero by using a small epsilon
    epsilon = 1e-10
    fastest_time = max(fastest_time, epsilon)
    
    print("\nResults sorted by execution time (fastest to slowest):")
    print("----------------------------------------------------")
    for implementation, execution_time in sorted(times.items(), key=lambda x: x[1]):
        speedup = execution_time / fastest_time
        print(f"{implementation:25} : {execution_time:.6f} seconds (Speedup vs fastest: {speedup:.2f}x)")

Output

Starting 1 billion iterations with nested loops...

Final result: 199998
Total iterations: 1,000,000,000 (1 billion)
Execution completed in 44.46 seconds
(venv) openalgo@openalgo python test % python benchmark_all.py
Starting Comprehensive Benchmark
==============================
Total iterations per implementation: 1,000,000,000 (1 billion)
Each implementation multiplies each number by 2

Running Pure Python (with function)...
Pure Python (with function) Result: 199998
Pure Python (with function) Time: 42.123598 seconds

Running Inline Computation...
Inline Computation Result: 199998
Inline Computation Time: 27.139908 seconds

Running Multiprocessing...
Multiprocessing Result: 199998
Multiprocessing Time: 7.282443 seconds

Running Numba (First Run)...
Numba (First Run) Result: 199998
Numba (First Run) Time: 0.188138 seconds

Running Numba (Second Run)...
Numba (Second Run) Result: 199998
Numba (Second Run) Time: 0.000000 seconds

Running Cython...

Cython Result: 49999500000000
Cython Time: 0.000201 seconds

Performance Comparison
====================

Results sorted by execution time (fastest to slowest):
----------------------------------------------------
Numba (Second Run)        : 0.000000 seconds (Speedup vs fastest: 1.00x)
Cython                    : 0.000201 seconds (Speedup vs fastest: 537.43x)
Numba (First Run)         : 0.188138 seconds (Speedup vs fastest: 502517.63x)
Multiprocessing           : 7.282443 seconds (Speedup vs fastest: 19451403.29x)
Inline Computation        : 27.139908 seconds (Speedup vs fastest: 72490683.16x)
Pure Python (with function) : 42.123598 seconds (Speedup vs fastest: 112512111.71x)
(venv) openalgo@openalgo python test % 


Note: The execution time for the Cython implementation is so low that it rounds to 0.00 seconds, indicating an extremely fast execution.

MethodExecution Time (seconds)Speedup vs. Pure Python
Pure Python (with function)42.1235981x
Inline Computation27.139908~1.55x
Multiprocessing7.282443~5.78x
Numba (First Run)0.188138~224x
Cython0.000201~Over 10,000x
Numba (Second Run)0.000000~Over 10,000x

Final Thoughts

Python’s default interpreter may not be the fastest for heavy computational tasks, but with the right tools and techniques, you can achieve performance comparable to compiled languages like C or C++. Here’s a quick recap:

1. Inline Computations: Eliminated function calls to reduce overhead, yielding a modest speedup.

2. PyPy: Significant speedup without code changes.

3. Multiprocessing: Good improvement; utilizes multiple cores.

4. Numba: Excellent speedup with minimal code changes.

5. Cython: Exceptional speedup; performance comparable to C/C++.

Rajandran R Creator of OpenAlgo - OpenSource Algo Trading framework for Indian Traders. Building GenAI Applications. Telecom Engineer turned Full-time Derivative Trader. Mostly Trading Nifty, Banknifty, High Liquid Stock Derivatives. Trading the Markets Since 2006 onwards. Using Market Profile and Orderflow for more than a decade. Designed and published 100+ open source trading systems on various trading tools. Strongly believe that market understanding and robust trading frameworks are the key to the trading success. Building Algo Platforms, Writing about Markets, Trading System Design, Market Sentiment, Trading Softwares & Trading Nuances since 2007 onwards. Author of Marketcalls.in

Mastering Pydantic for Traders: A Step-by-Step Guide

Trading in India, whether in stocks, commodities, or cryptocurrencies, revolves around data. From NSE tickers to API responses from brokers handling structured and...
Rajandran R
3 min read

Leave a Reply

Get Notifications, Alerts on Market Updates, Trading Tools, Automation & More