Python is a versatile and user-friendly programming language, but it’s often criticized for being slow compared to compiled languages like C or C++. A recent LinkedIn post highlighted that Python can be 150x slower than C/C++ for certain tasks, especially when dealing with large loops.
In this blog post, we’ll explore five methods to optimize a loop that runs 1 billion iterations in Python in a nested loop. We’ll provide code examples, performance benchmarks, and detailed explanations for each method to help you achieve performance that’s competitive with C/C++.
Baseline: Pure Python Implementation
Before optimizing, let’s establish a baseline using standard Python (CPython) with a simple loop that processes numbers from 0 to 999,999,999.
Code Example
# pure_python.py
import time
def process(value):
return value * 2
def run_loop():
result = 0
outer_iterations = 10000
inner_iterations = 100000 # 10000 * 100000 = 1 billion total iterations
for i in range(outer_iterations):
for j in range(inner_iterations):
result = process(j)
if i % 100 == 0: # Print progress every 1% (every 100 outer iterations)
total_iterations = (i + 1) * inner_iterations
# print(f"Processed {total_iterations:,} iterations")
return result
if __name__ == "__main__":
print("Starting 1 billion iterations with nested loops...")
start_time = time.time()
final_result = run_loop()
end_time = time.time()
print(f"\nFinal result: {final_result}")
print(f"Total iterations: {10000 * 100000:,} (1 billion)")
print(f"Execution completed in {end_time - start_time:.2f} seconds")
Performance
• Execution Time: Approximately 42.72 seconds
$ python3 pure_python.py
Starting 1 billion iterations with nested loops...
Final result: 199998
Total iterations: 1,000,000,000 (1 billion)
Execution completed in 42.72 seconds
Analysis
• Observation: The loop takes nearly 42.72 seconds to complete.
• Limitation: The overhead of the Python interpreter and function calls slows down execution.
Method 1: Inline Computations
An immediate optimization is to inline the computations and eliminate unnecessary function calls.
Code Example
# inline_computation.py
import time
def run_loop():
result = 0
outer_iterations = 10000
inner_iterations = 100000 # 10000 * 100000 = 1 billion total iterations
for i in range(outer_iterations):
for j in range(inner_iterations):
result = j * 2 # Direct computation
if i % 100 == 0: # Print progress every 1% (every 100 outer iterations)
total_iterations = (i + 1) * inner_iterations
# print(f"Processed {total_iterations:,} iterations")
return result
if __name__ == "__main__":
print("Starting 1 billion iterations with nested loops...")
start_time = time.time()
final_result = run_loop()
end_time = time.time()
print(f"\nFinal result: {final_result}")
print(f"Total iterations: {10000 * 100000:,} (1 billion)")
print(f"Execution completed in {end_time - start_time:.2f} seconds")
Performance
• Execution Time: Approximately 25.19 seconds
$ python3 inline_computation.py
Starting 1 billion iterations with nested loops...
Final result: 199998
Total iterations: 1,000,000,000 (1 billion)
Execution completed in 25.19 seconds
Analysis
• Improvement: Eliminating function calls reduces overhead.
• Limitation: Still constrained by the Python interpreter’s execution speed.
Method 2: Using PyPy
PyPy is an alternative Python interpreter with a built-in Just-In-Time (JIT) compiler that significantly speeds up Python code execution. It is highly compatible with the standard CPython interpreter, allowing most Python programs to run without modification. PyPy accelerates performance by dynamically compiling Python code into optimized machine code at runtime.
Installing PyPy
• On macOS (using Homebrew):
brew install pypy3
• Verify Installation:
pypy3 --version
Code Example
# inline_computation.py
import time
def run_loop():
result = 0
outer_iterations = 10000
inner_iterations = 100000 # 10000 * 100000 = 1 billion total iterations
for i in range(outer_iterations):
for j in range(inner_iterations):
result = j * 2 # Direct computation
if i % 100 == 0: # Print progress every 1% (every 100 outer iterations)
total_iterations = (i + 1) * inner_iterations
# print(f"Processed {total_iterations:,} iterations")
return result
if __name__ == "__main__":
print("Starting 1 billion iterations with nested loops...")
start_time = time.time()
final_result = run_loop()
end_time = time.time()
print(f"\nFinal result: {final_result}")
print(f"Total iterations: {10000 * 100000:,} (1 billion)")
print(f"Execution completed in {end_time - start_time:.2f} seconds")
Use the same code as in the Inline Computations method but run it with PyPy:
$ pypy3 inline_computation.py
Starting 1 billion iterations with nested loops...
Final result: 199998
Total iterations: 1,000,000,000 (1 billion)
Execution completed in 0.80 seconds
(venv) openalgo@openalgo python test %
Analysis
• Improvement: Drastic reduction in execution time due to PyPy’s JIT compiler optimizing the loop at runtime.
• Benefit: No code changes required.
• Limitation: Some libraries may not be fully compatible with PyPy.
Method 3: Multiprocessing
Utilize multiple CPU cores by splitting the workload across processes using the multiprocessing module.
Code Example
# multiprocessing_example.py
import time
from multiprocessing import Pool, cpu_count
def process_chunk(args):
start_chunk, end_chunk = args
result = 0
inner_iterations = 100000
for i in range(start_chunk, end_chunk):
for j in range(inner_iterations):
result = j * 2
# Print progress for each process
if (i - start_chunk) % 100 == 0:
total_iterations = (i - start_chunk + 1) * inner_iterations
print(f"Process chunk {start_chunk}-{end_chunk}: Processed {total_iterations:,} iterations")
return result
if __name__ == "__main__":
print("Starting 1 billion iterations with nested loops and multiprocessing...")
start_time = time.time()
num_processes = cpu_count()
outer_iterations = 10000
chunk_size = outer_iterations // num_processes
# Create ranges for each process
ranges = [(i * chunk_size, (i + 1) * chunk_size) for i in range(num_processes)]
# Adjust the last chunk to include any remaining iterations
if ranges:
last_end = ranges[-1][1]
if last_end < outer_iterations:
ranges[-1] = (ranges[-1][0], outer_iterations)
print(f"Using {num_processes} processes")
print(f"Each process will handle {chunk_size} outer loop iterations")
with Pool(processes=num_processes) as pool:
results = pool.map(process_chunk, ranges)
end_time = time.time()
print(f"\nTotal iterations: {outer_iterations * 100000:,} (1 billion)")
print(f"Execution completed in {end_time - start_time:.2f} seconds")
Performance
• Execution Time: Approximately 6.47 seconds
$ python3 multiprocessing_example.py
Starting 1 billion iterations with nested loops and multiprocessing...
Using 8 processes
Each process will handle 1250 outer loop iterations
Total iterations: 1,000,000,000 (1 billion)
Execution completed in 6.47 seconds
Analysis
• Improvement: Parallel execution utilizes all CPU cores.
• Limitation: Overhead from process management and inter-process communication.
Method 4: Using Numba
Numba is a Just-In-Time (JIT) compiler for Python that translates numerical Python code into fast machine code using the LLVM compiler infrastructure. By adding simple decorators to your functions, Numba can optimize and accelerate loops and array computations. It achieves performance close to compiled languages like C or Fortran, making it ideal for scientific computing.
Installing Numba
pip install numba
Code Example
# numba_example.py
import time
from numba import njit
@njit
def run_loop():
result = 0
outer_iterations = 10000
inner_iterations = 100000 # 10000 * 100000 = 1 billion total iterations
for i in range(outer_iterations):
for j in range(inner_iterations):
result = j * 2
# Note: Print statements are not supported inside njit functions
# Progress will be shown after completion
return result
if __name__ == "__main__":
print("Starting 1 billion iterations with nested loops using Numba...")
print("First run includes compilation time...")
# First run (includes compilation)
start_time = time.time()
final_result = run_loop()
end_time = time.time()
compilation_time = end_time - start_time
print(f"\nFirst run (with compilation):")
print(f"Total iterations: {10000 * 100000:,} (1 billion)")
print(f"Execution completed in {compilation_time:.2f} seconds")
# Second run (pre-compiled)
print("\nSecond run (pre-compiled)...")
start_time = time.time()
final_result = run_loop()
end_time = time.time()
execution_time = end_time - start_time
print(f"\nSecond run (pre-compiled):")
print(f"Total iterations: {10000 * 100000:,} (1 billion)")
print(f"Execution completed in {execution_time:.2f} seconds")
Performance
• Execution Time: Approximately 0.45 seconds
$ python3 numba_example.py
Starting 1 billion iterations with nested loops using Numba...
First run includes compilation time...
First run (with compilation):
Total iterations: 1,000,000,000 (1 billion)
Execution completed in 0.20 seconds
Second run (pre-compiled)...
Second run (pre-compiled):
Total iterations: 1,000,000,000 (1 billion)
Execution completed in 0.00 seconds
Analysis
• Improvement: Numba compiles the loop into optimized machine code.
• Benefit: Minimal code changes required.
• Limitation: Numba supports a subset of Python; complex code may not be compatible.
Method 5: Using Cython
Cython is a superset of the Python language that allows you to add static type declarations to your Python code. By compiling this code into C extensions, Cython significantly boosts execution speed, especially for computationally intensive tasks. It enables you to write code that runs as fast as C while maintaining Python’s syntax and ease of use.
Installing Cython
pip install cython
Code Example
setup.py
# setup.py
from distutils.core import setup
from Cython.Build import cythonize
from distutils.extension import Extension
extensions = [
Extension(
"test_cython",
["test_cython.pyx"],
extra_compile_args=["-O3"],
)
]
setup(
ext_modules=cythonize(
extensions,
compiler_directives={
'boundscheck': False,
'wraparound': False,
'nonecheck': False,
'cdivision': True,
}
)
)
cython_loop.pyx
# cython: language_level=3
# cython: boundscheck=False
# cython: wraparound=False
# cython: nonecheck=False
# cython: cdivision=True
def billion_loop():
cdef long long i, j
cdef long long sum = 0
cdef long long outer_iterations = 10000
cdef long long inner_iterations = 100000 # 10000 * 100000 = 1 billion iterations
for i in range(outer_iterations):
for j in range(inner_iterations):
sum += j
if i % 100 == 0: # Print progress every 1% (every 100 outer iterations)
total_iterations = (i + 1) * inner_iterations
print(f"Processed {total_iterations:,} iterations")
return sum
Compile the Cython Code
python setup.py build_ext --inplace
Driver Script – cython_driver.py
#cython_driver.py
import time
import test_cython
def python_billion_loop():
sum = 0
outer_iterations = 10000
inner_iterations = 100000 # 10000 * 100000 = 1 billion iterations
for i in range(outer_iterations):
for j in range(inner_iterations):
sum += j
if i % 100 == 0: # Print progress every 1% (every 100 outer iterations)
total_iterations = (i + 1) * inner_iterations
print(f"Processed {total_iterations:,} iterations")
return sum
def main():
print("Starting 1 billion iterations benchmark...")
print("\nRunning Cython implementation...")
start_time = time.time()
cython_result = test_cython.billion_loop()
cython_time = time.time() - start_time
print(f"Cython Result: {cython_result}")
print(f"Cython Time: {cython_time:.2f} seconds")
print("\nRunning Pure Python implementation...")
start_time = time.time()
python_result = python_billion_loop()
python_time = time.time() - start_time
print(f"Python Result: {python_result}")
print(f"Python Time: {python_time:.2f} seconds")
print(f"\nTotal iterations: {10000 * 100000:,} (1 billion)")
print(f"Speedup: {python_time/cython_time:.2f}x")
if __name__ == "__main__":
main()
Performance
• Execution Time: Approximately 0.00 seconds
$ python3 cython_driver.py
Starting 1 billion iterations benchmark...
Running Cython implementation...
Cython Result: 49999500000000
Cython Time: 0.00 seconds
Analysis
Cython allows you to compile Python code into C extensions, offering performance close to C/C++.
Installing Cython
• Improvement: Cython compiles to C code, eliminating Python interpreter overhead.
• Benefit: Offers performance comparable to compiled languages.
• Limitation: Requires additional setup and code complexity.
Benchmark All Methods
import time
import test_cython
from multiprocessing import Pool, cpu_count
from numba import njit
# Pure Python Implementation
def pure_python_loop():
sum = 0
outer_iterations = 10000
inner_iterations = 100000
for i in range(outer_iterations):
for j in range(inner_iterations):
sum += j
return sum
# Inline Computation Implementation
def inline_loop():
sum = 0
outer_iterations = 10000
inner_iterations = 100000
for i in range(outer_iterations):
for j in range(inner_iterations):
sum += j
return sum
# Multiprocessing Implementation
def process_chunk(args):
start_chunk, end_chunk = args
sum = 0
inner_iterations = 100000
for i in range(start_chunk, end_chunk):
for j in range(inner_iterations):
sum += j
return sum
def multiprocessing_loop():
num_processes = cpu_count()
outer_iterations = 10000
chunk_size = outer_iterations // num_processes
ranges = [(i * chunk_size, (i + 1) * chunk_size) for i in range(num_processes)]
with Pool(processes=num_processes) as pool:
results = pool.map(process_chunk, ranges)
return sum(results)
# Numba Implementation
@njit
def numba_loop():
sum = 0
outer_iterations = 10000
inner_iterations = 100000
for i in range(outer_iterations):
for j in range(inner_iterations):
sum += j
return sum
def run_benchmark(name, func):
print(f"\nRunning {name}...")
start_time = time.perf_counter() # Using perf_counter for more precise timing
result = func()
end_time = time.perf_counter()
execution_time = end_time - start_time
print(f"{name} Result: {result}")
print(f"{name} Time: {execution_time:.6f} seconds") # Showing more decimal places
return execution_time
if __name__ == "__main__":
print("Starting Comprehensive Benchmark")
print("==============================")
print(f"Total iterations per implementation: {10000 * 100000:,} (1 billion)")
times = {}
# Pure Python
times['Pure Python'] = run_benchmark('Pure Python', pure_python_loop)
# Inline Computation
times['Inline Computation'] = run_benchmark('Inline Computation', inline_loop)
# Multiprocessing
times['Multiprocessing'] = run_benchmark('Multiprocessing', multiprocessing_loop)
# Numba (run twice to show both compilation and optimized times)
times['Numba (First Run)'] = run_benchmark('Numba (First Run)', numba_loop)
times['Numba (Second Run)'] = run_benchmark('Numba (Second Run)', numba_loop)
# Cython
times['Cython'] = run_benchmark('Cython', test_cython.billion_loop)
# Print comparison
print("\nPerformance Comparison")
print("====================")
fastest_time = min(times.values())
# Ensure we don't divide by zero by using a small epsilon
epsilon = 1e-10
fastest_time = max(fastest_time, epsilon)
print("\nResults sorted by execution time (fastest to slowest):")
print("----------------------------------------------------")
for implementation, execution_time in sorted(times.items(), key=lambda x: x[1]):
speedup = execution_time / fastest_time
print(f"{implementation:25} : {execution_time:.6f} seconds (Speedup vs fastest: {speedup:.2f}x)")
Output
Starting 1 billion iterations with nested loops...
Final result: 199998
Total iterations: 1,000,000,000 (1 billion)
Execution completed in 44.46 seconds
(venv) openalgo@openalgo python test % python benchmark_all.py
Starting Comprehensive Benchmark
==============================
Total iterations per implementation: 1,000,000,000 (1 billion)
Each implementation multiplies each number by 2
Running Pure Python (with function)...
Pure Python (with function) Result: 199998
Pure Python (with function) Time: 42.123598 seconds
Running Inline Computation...
Inline Computation Result: 199998
Inline Computation Time: 27.139908 seconds
Running Multiprocessing...
Multiprocessing Result: 199998
Multiprocessing Time: 7.282443 seconds
Running Numba (First Run)...
Numba (First Run) Result: 199998
Numba (First Run) Time: 0.188138 seconds
Running Numba (Second Run)...
Numba (Second Run) Result: 199998
Numba (Second Run) Time: 0.000000 seconds
Running Cython...
Cython Result: 49999500000000
Cython Time: 0.000201 seconds
Performance Comparison
====================
Results sorted by execution time (fastest to slowest):
----------------------------------------------------
Numba (Second Run) : 0.000000 seconds (Speedup vs fastest: 1.00x)
Cython : 0.000201 seconds (Speedup vs fastest: 537.43x)
Numba (First Run) : 0.188138 seconds (Speedup vs fastest: 502517.63x)
Multiprocessing : 7.282443 seconds (Speedup vs fastest: 19451403.29x)
Inline Computation : 27.139908 seconds (Speedup vs fastest: 72490683.16x)
Pure Python (with function) : 42.123598 seconds (Speedup vs fastest: 112512111.71x)
(venv) openalgo@openalgo python test %
Note: The execution time for the Cython implementation is so low that it rounds to 0.00 seconds, indicating an extremely fast execution.
Method | Execution Time (seconds) | Speedup vs. Pure Python |
---|---|---|
Pure Python (with function) | 42.123598 | 1x |
Inline Computation | 27.139908 | ~1.55x |
Multiprocessing | 7.282443 | ~5.78x |
Numba (First Run) | 0.188138 | ~224x |
Cython | 0.000201 | ~Over 10,000x |
Numba (Second Run) | 0.000000 | ~Over 10,000x |
Final Thoughts
Python’s default interpreter may not be the fastest for heavy computational tasks, but with the right tools and techniques, you can achieve performance comparable to compiled languages like C or C++. Here’s a quick recap:
1. Inline Computations: Eliminated function calls to reduce overhead, yielding a modest speedup.
2. PyPy: Significant speedup without code changes.
3. Multiprocessing: Good improvement; utilizes multiple cores.
4. Numba: Excellent speedup with minimal code changes.
5. Cython: Exceptional speedup; performance comparable to C/C++.