Notes on Concurrency in Python

Sorry for the late post bubs. I know that this is not what was promised, but I want to make it clear that I did not slack off. Essentially what happened is that I’ve spent the last 2 weeks working on a much bigger post, which I thought I could release in just a couple of days for whatever reason, but is now looking like it might take almost a month to complete. So in the mean time, I’ll be peppering this timeline with posts about random stuff that I’m learning in the process of building said project that would be too off-topic to include in the mega post itself. Enough yapping though, let’s get down to business.

The topic today is concurrency/parallelism and how we can implement it in python for our projects. I’ll start off by explaining what exactly this means:

Concurrency vs Parallelism⌗

Concurrency:
- At least two tasks happening at the same time, but not necessarily executing simultaneously
- Can be obtained on a single core CPU via task-switching using a mechanism known as multitasking. In this case, only one task is actually running at any given instant.
Parallelism:
- Tasks are both concurrent and executing at the same time
- Requires at least two CPU cores
- Parallelism implies concurrency, but not vice versa

Source: Deepshi Garg on Medium

Seems simple enough! But in order to fully understand how these concepts can be implemented in python, we will first need a primer on the

Python Execution Model⌗

The python execution workflow goes like this

Python source code
Bytecode
Python Virtual Machine (PVM)
Executable
Hardware

Source: DevOpsSchool.com

Bytecode⌗

Intermediate representation of Python code
Platform-agnostic, allowing for cross-platform compatibility
Each python statement is converted into a group of bytecode instructions
Looks like this

from dis import dis

def add(a,b):
    return a + b

print(dis(add))

OUTPUT:

Python Virtual Machine (PVM)⌗

Converts bytecode into machine code
Equipped with an interpreter to execute the bytecode
Typically assigns work to only one CPU due to the Global Interpreter Lock (more on this later)
Looks like this (shocking)

Source: deez nuts

Python vs C++ vs Java⌗

Now that we are familiar with the execution model of Python, we need a benchmark to compare it to. C++ and Java make for great comparisons. The models of these languages differ slightly, and unfortunately this slight difference has a huge impact on execution speeds.

Python: Source code → Bytecode → PVM → Executable → Hardware

Flexible but harder to optimize than C and Java
Cannot utilize cache as effectively as Java

C++: Source code → Executable → Hardware

Directly compiled to machine code, fastest of the three (not even a contest)

Java: Source code → Bytecode → JVM → Executable → Hardware

Uses Just-In-Time (JIT) compilation
Can utilize cache for optimizations

As a python enjoyer I am sad to admit that C++ absolutely mogs python when it comes to low latency applications. But that’s why claude exists - to transform all my unoptimized python slop into kino C++ nanosecond inference code.

Global Interpreter Lock (GIL): The Great Limiter⌗

Every performance-related limitation in python comes down to the global interpreter lock. It’s a key concept to understand if you want to know why exactly python isn’t the best for performance critical operations. Cliffs notes are as follows:

CPython, the reference implementation for python, is not natively thread safe
What this means is that if multiple threads share a variable and attempt to modify it, the result is non-deterministic, and causes a state known as race condition
Example of a race condition: Consider two threads incrementing a shared counter. If not properly managed, they might read the same initial value, increment it, and write back, effectively losing one increment.
While threads can run concurrently on different cores, the GIL ensures that only one of them can run Python bytecode at a time to prevent race conditions, thus ensuring thread safety
Unfortunately, this makes multithreading for CPU-bound tasks close to impossible in CPython, regardless of the availability of multiple cores

If this doesn’t make much sense right now, keep reading, it should become clearer.

Workarounds for GIL⌗

Before we get into understanding how we can bypass the limitations of GIL, there is some prerequisite knowledge that we must know.

I/O vs CPU Bound Work⌗

In general, an application’s work can be divided into I/O and CPU bound (bound means limiting factor) work:

I/O operations involve a computer’s input and output devices, such as keyboard, hard drive and network cards
CPU operations involve the central processing unit, and consists of work such as mathematical computation

Processes vs Threads⌗

Processes	Threads
Independent execution units with their own memory space	Lightweight execution units within a process
Isolated from other processes	Share memory with the parent process
Can run in parallel on multi-core systems	Multiple threads in a process can run concurrently
Heavier weight, more resource-intensive	Lighter weight, less resource-intensive
Used for CPU-intensive tasks, parallel execution across multiple cores/machines	Used for I/O-bound tasks, concurrent operations within an application
Used for running independent programs	Used for concurrent operations within a program

Additional points:

Every process has at least one thread known as the main thread. The other threads created from the main thread are known as worker threads.
One can think of threads as lightweight processes, with a diﬀerent memory proﬁle
The entry into any Python application begins with a process and its main thread

That’s it for the prerequisites!

Multithreading & Multiprocessing⌗

Multithreading and multiprocessing are the python implementations of concurrency and parallelism which allow us to get around the limitations of the GIL:

Multithreading for I/O-bound tasks⌗

Key Concept: I/O operations make lower level system calls in the OS that are outside of the Python runtime

GIL is released from a thread when an I/O task is happening since we are not working directly with Python objects
In this period, the GIL is assigned to another thread that requires it
GIL is reacquired by the original thread when data is translated back into Python objects

Multiprocessing for CPU-bound tasks⌗

Key Concept: Allows utilization of multiple CPU cores

Creates multiple separate Python processes
These processes can be run in parallel because each process has its own interpreter (PVM and GIL) that executes the instructions allocated to it
Only one process can execute (whether its running, ready, blocked, etc.) on a single core at any given instant
OS scheduler manages time-sharing of cores among processes

Multithreading vs Multiprocessing⌗

Multithreading	Multiprocessing
Works by creating multiple threads within a single process, sharing the same memory space	Works by creating multiple independent processes, each with its own memory space
Threads share the same memory space and resources	Each process has its own memory space and resources
Concurrency, but not true parallelism	True parallelism
Lighter-weight than multiprocessing	Higher overhead* compared to multithreading
Better for I/O-bound tasks because when a thread is waiting for I/O, it releases the GIL, allowing other threads to execute	Generally less efficient for I/O-bound tasks due to higher overhead
Good for operations that spend significant time waiting for input/output to complete like downloading stock market data, running database queries, and file operations	Good for CPU-intensive tasks like number crunching and data processing
Very limited use for CPU-bound tasks due to GIL	Useful for CPU-bound tasks because it can utilize multiple CPU cores simultaneously

*Multiprocessing in Python works by allowing a single program to control multiple PVMs. The overhead comes from the allocation of new memory blocks and resources to create new PVMs.

Source: my course instructor who I hope doesn’t mind me using this

Python vs C++ vs Java Revisited⌗

Does this mean that Python can actually match up to C++ and Java? Stop asking, the answer is still no (T_T). Let’s look at this again, this time through the lens of multithreading/multiprocessing:

With multiprocessing, Python can utilize multiple cores effectively, similar to C and Java, and thus achieve performance comparable to these languages on very CPU-intensive tasks where the computation time outweighs the process creation and communication overhead.
However, C and Java can multithread CPU-bound tasks (unlike Python which can only multithread I/O-bound tasks). As a consequence, for shorter-running tasks or those requiring frequent communication between parallel units, the overhead of multiprocessing will result in lower performance compared to multithreaded solutions in C or Java.

Numpy and Alternative Python Implementations⌗

Another way to get around the GIL is by using numpy or alternative python implementations:

NumPy is implemented in C and can thus bypass the GIL, allowing for better performance
- Interops is what allows Python to interface with compilers of other languages, which enables libraries like NumPy to exist
Some alternative implementations (e.g., Jython, IronPython) don’t have a GIL, potentially allowing multi-threading of CPU-bound tasks

but this comes with its own set of caveats

Numpy can only be used for very specific number crunching purposes
Alternative implementations suffer from compatibility issues and are not well documented

Source: same course instructor

Implementing Concurrency in Python (Code)⌗

Finally we can get to the meat of it. Even though it may not match up to C++ in terms of speed, python has a lot of things to offer us beyond speed (specially for machine learning applications). Even so, this doesn’t mean that we should give up on reducing latency altogether! As we have learned, multithreading/multiprocessing are great ways to improve execution performance. In this section, I’ll go over a couple of libraries that can provide this functionality in python and accompany them with minimum viable examples to show how we can use them to optimize our own apps. But before that, a real quick intro on the different types of multitasking. I swear this is not stalling, knowing this is pretty much necessary to understand the nuance between library choices.

Preemptive vs Cooperative Multitasking⌗

CPU multitasking falls under two types.

Preemptive multitasking: the OS decides how and when to switch between different tasks via time slicing. When the OS switches between running threads or processes, a context switch is said to be involved, and the OS needs to save the state of the running process/thread in order to continue later.
Cooperative multitasking: the running process decides to give up CPU time via explicitly earmarked regions in the code.

Multithreading with `threading`⌗

threading uses preemptive multitasking to enable multithreading in Python.

Code⌗

import threading  # Import the threading module

def thread_function(name):
    print(f"Thread {name} is running")  # This runs in a worker thread

# Main thread creates and starts worker threads
threads = [threading.Thread(target=thread_function, args=(f"T{i}",)) for i in range(3)]  # Create 3 Thread objects
[thread.start() for thread in threads]  # Main thread starts each worker thread

# Main thread waits for all worker threads to complete
[thread.join() for thread in threads]  # Main thread blocks until each worker thread finishes

print("All threads completed")  # Main thread continues after all workers are done

In English:

Main thread starts with script execution.
threading.Thread() creates a Thread object (blueprint), not a running thread
thread.start() creates and begins execution of a new worker thread
thread.join() makes the main thread wait for the worker thread to finish

The fact that the main thread must join() each worker thread sequentially, blocking until each completes, is what makes this code synchronous. This synchronicity is what makes the code simple to understand due to the predictable execution order, but also makes it less efficient for I/O-bound tasks. More on this later.

Source: HangukQuant’s QT101

Multiprocessing with `multiprocessing`⌗

multiprocessing enables true parallelism in Python. I’d wager that this is not as useful as threading since most of our CPU-intensive tasks will be number crunching, which is what numpy is for.

Code⌗

import multiprocessing  # Import the multiprocessing module

def process_function(name):
    print(f"Process {name} is running")  # This runs in a separate process

if __name__ == "__main__":  # Ensures this runs only when script is executed directly
    # Main process creates and starts worker processes
    processes = [multiprocessing.Process(target=process_function, args=(f"P{i}",)) for i in range(3)]  # Create 3 Process objects
    [process.start() for process in processes]  # Main process starts each worker process

    # Main process waits for all worker processes to complete
    [process.join() for process in processes]  # Main process blocks until each worker process finishes

    print("All processes completed")  # Main process continues after all workers are done

In English:

Main process starts with script execution
multiprocessing.Process() creates a Process object (blueprint), not a running process
process.start() creates and begins execution of a new worker process
process.join() makes the main process wait for the worker process to finish

Asynchronous Programming with `asyncio`⌗

Time for a wild card entry. Asyncio is a library that also provides a framework for concurrency in Python. However, it doesn’t do this through multithreading or multiprocessing. Instead, it employs asynchronous programming.

What exactly does this entail? For one, asyncio uses coroutines and an event loop to achieve concurrency, rather than using threads or processes. This is known as a cooperative multitasking model, where tasks voluntarily yield control when they’re waiting for something (like I/O operations) rather than being preemptively interrupted.

Comparison with `threading`⌗

Aspect	Asyncio	Threading
Concurrency Model	Cooperative multitasking (tasks yield control voluntarily)	Preemptive multitasking (OS schedules threads)
Execution	Single-threaded, runs on one CPU core	Multi-threaded, can utilize multiple CPU cores
Resource Usage	Typically lower memory footprint	Higher memory usage due to separate thread stacks
Scalability	Highly scalable due to non-reliance on OS	Limited due to context switch overhead
Complexity	Harder to reason about execution flow due to complex code structure	Straightforward execution order, so code is easier to understand

Key Concepts⌗

Asynchronous Programming: Asynchronous programming allows parts of a program to run independently of the main program flow.

Coroutines: Coroutines are the building blocks of asyncio-based code. They are special functions that can be paused and resumed, allowing other code to run in the meantime. Coroutines are defined using the async def syntax and are called using the await keyword.

Event Loop: The event loop is at the core of every asyncio application. It runs asynchronous tasks and callbacks, performs network I/O operations, and manages subprocesses. The event loop is responsible for scheduling and running coroutines concurrently.

Sockets: Low-level abstractions that provide endpoints for data transfer between applications over a network. Things to note:

By default, sockets are blocking. when reading or writing, the application will wait until the operation is complete.
However, sockets can be set to a non-blocking mode. Operations return immediately, even if not complete. The catch is that your application needs to keep checking if the operation has been completed.
Different operating systems have different ways of notifying programs about socket events. Sounds complex, and it is! Thankfully, asyncio abstracts away these OS-specific details, allowing us to remain mostly ignorant about the underlying mechanisms.
This non-blocking approach is what lets asyncio efficiently manage many network connections at once, without getting stuck waiting on any single operation.

Tasks: Wrappers around coroutines that are scheduled and managed by the event loop.

Event Loop Workflow⌗

An event loop is instantiated with an empty queue of tasks
The loop iterates continuously, checking for events and tasks to run
When a task is selected, the loop runs its coroutine
If the coroutine hits an I/O operation, it’s suspended
The loop makes necessary system calls for the I/O operation
It also sets up notifications for the operation’s progress
The suspended task is put back in the queue
The loop moves on to check for and run other tasks
Each iteration also checks for completed tasks
This process repeats, allowing concurrent execution
When all tasks are complete, the loop exits

Thread Submitting Tasks to the Event Loop

Source: Matthew Fowler in “Python Concurrency with asyncio”

Code⌗

import asyncio

async def coroutine1():
    print("Task 1 started")
    # Simulate an I/O operation that takes 2 seconds
    # During this time, the event loop can run other tasks
    await asyncio.sleep(2)
    print("Task 1 finished")

async def coroutine2():
    print("Task 2 started")
    # Simulate an I/O operation that takes 1 second
    await asyncio.sleep(1)
    print("Task 2 finished")

async def main():
    # Create task objects from coroutines
    # These tasks are scheduled to run on the event loop
    task1 = asyncio.create_task(coroutine1())
    task2 = asyncio.create_task(coroutine2())
    
    # Wait for both tasks to complete
    # The 'await' keyword allows the event loop to run other tasks
    # while waiting for these tasks to finish
    await task1
    await task2

# Run the event loop with the main coroutine
# asyncio.run() creates a new event loop and runs the main coroutine until it's complete
asyncio.run(main())

In English:

Start task1, encounter the sleep, and switch to task2
Start task2, encounter the sleep
After 1 second, resume and finish task2
After another 1 second (2 seconds total), resume and finish task1
Return control to main(), which completes
Close the event loop

Closing Notes⌗

This has been a fun exercise! Undoubtedly, understanding concurrency has given us a greater understanding of production systems. Before I close here, I would like to mention HangukQuant’s great PDF on the topic which served as the main reference for this post. Highly recommend going through it if you have time. That’s about it!

Alright, see you in the next one :))

Notes on Concurrency in Python

Table of Contents

Concurrency vs Parallelism⌗

Python Execution Model⌗

Bytecode⌗

Python Virtual Machine (PVM)⌗

Python vs C++ vs Java⌗

Global Interpreter Lock (GIL): The Great Limiter⌗

Workarounds for GIL⌗

I/O vs CPU Bound Work⌗

Processes vs Threads⌗

Multithreading & Multiprocessing⌗

Multithreading for I/O-bound tasks⌗

Multiprocessing for CPU-bound tasks⌗

Multithreading vs Multiprocessing⌗

Python vs C++ vs Java Revisited⌗

Numpy and Alternative Python Implementations⌗

Implementing Concurrency in Python (Code)⌗

Preemptive vs Cooperative Multitasking⌗

Multithreading with threading⌗

Code⌗

Multiprocessing with multiprocessing⌗

Code⌗

Asynchronous Programming with asyncio⌗

Comparison with threading⌗

Key Concepts⌗

Event Loop Workflow⌗

Code⌗

Closing Notes⌗

Multithreading with `threading`⌗

Multiprocessing with `multiprocessing`⌗

Asynchronous Programming with `asyncio`⌗

Comparison with `threading`⌗