Multithreading and GIL

  • Process - running program.
  • Each process has a state isolated from other processes:
    • virtual address space,
    • pointer to executable instruction,
    • call stack,
    • system resources, such as open file   descriptors.
  • Processes are convenient for simultaneously performing multiple tasks.
  • Alternative way: delegate each task to a thread.


  • A thread is similar to a process in that its execution occurs independently of other threads (and processes).
  • Unlike a process, a thread executes within a process and shares the address space and system resources with it.
  • Threads are convenient for simultaneously performing several tasks that require access to a shared state.
  • The joint execution of several processes and threads is controlled by the operating system, sequentially allowing each process or thread to use some processor cycles.

threading module

  • A thread in Python is a system thread, that is, it is not the interpreter that controls its execution, but the operating system.
  • You can create a thread using the Thread class from the module of the standard threading library.
import time
from threading import Thread

def countdown(n):
    for i in range(n):
        print(n - i - 1, "left")
t = Thread(target=countdown, args=(3,))
An alternative way to create a stream is inheritance:

class CountdownThread(Thread):
    def __init__(self, n):
        self.n = n
    def run(self): # start method.
        for i in range(self.n):
            print(self.n - i - 1, "left")
t = CountdownThread(3)
The disadvantage of this approach is that it limits the reuse of code: the functionality of the CountdownThread class can be used only in a separate thread.

  • When creating a stream, you can specify a name. By default it is 'Thread-N':
  • Each active thread has an id - a non-negative number unique to all active threads.
t = Thread()
  • The join method allows you to wait until the stream finishes.
    • The execution of the calling thread will pause until thread t ends.
    • Repeated calls to the join method have no effect.
t = Thread(target=time.sleep, args=(5, )) 
t.join() # locks for 5 seconds
t.join() # performed instantly
  • You can check if the thread is running using the is_alive method:
t = Thread(target=time.sleep, args=(5, )) 
t.is_alive() # False after 5 seconds
  • A daemon is a thread created with the daemon=True argument:

  • The difference between a daemon thread and a regular thread is that daemon threads are automatically destroyed when exiting the interpreter.

t = Thread(target=time.sleep, args=(5,), daemon=True)
  • In Python, there is no built-in mechanism for terminating threads - this is not an accident, but an informed decision of the language developers.
  • Correct termination of the flow is often associated with the release of resources, for example:
    • a stream can work with a file whose descriptor needs to be closed,
    • or capture a synchronization primitive.
  • To end a stream, a flag is usually used:
class Task:
    def __init__(self):
        self._running = True
    def terminate(self):
        self._running = False
    def run(self, n):
        while self._running:

The set of synchronization primitives in the threading module is standard:

  • Lock - a regular mutex, used to provide exclusive access to a shared state.
  • RLock - a recursive mutex that allows a thread that owns a mutex to capture the mutex more than once.
  • Semaphore - a variation of the mutex that allows you to capture yourself no more than a fixed number of times.
  • BoundedSemaphore - a semaphore that makes sure that it is captured and released the same number of times.

All synchronization primitives implement a single interface:

  • the acquire method captures the synchronization primitive,
  • and the release method releases it.
class SharedCounter:
    def __init__(self, value):
        self._value = value
        self._lock = Lock()
    def increment(self, delta=1):
        self._value += delta
    def get(self):
        return self._value

queue module

The queue module implements several thread-safe queues:

  • Queue - FIFO queue,
  • LifoQueue — LIFO queue of stacks,
  • PriorityQueue — a queue of elements — a parse (priority, item).
  • There are no special frills in the implementation of queues: all state-changing methods work “inside” the mutex.
  • The Queue class uses deque as the container, and the LifoQueue and PriorityQueue classes use the list.
def worker(q):
    while True:
        item = q.get() # blocking expects next
        do_something(item) # element
        q.task_done() # notifies the execution queue
def master(q):
    for item in source():
        # blocking waits until all elements of the queue
        # will not be processed

futures module

  • The concurrent.futures module contains the abstract class Executor and its implementation as a thread pool - ThreadPoolExecutor.
  • The executor’s interface consists of only three methods:
from concurrent.futures import *
executor = ThreadPoolExecutor(max_workers=4)
executor.submit(print, "Hello, world!")
Hello, world!

<Future at 0x1043b1d90 state=running>
list(, ["Knock?", "Knock!"]))

[None, None]
  • Contractors support the context manager protocol:
with ThreadPoolExecutor(max_workers=4) as executor:
  • The Executor.submit method returns an instance of the Future class that encapsulates asynchronous calculations.

What can be done with Future?

with ThreadPoolExecutor(max_workers=4) as executor:
    f = executor.submit(sorted, [4, 3, 1, 2])
  • Ask about the calculation status:
f.running(), f.done(), f.cancelled()
(False, True, False)
  • Wait for the blocking result of the calculation:
[1, 2, 3, 4]
  • Add a function that will be called after the calculation is completed:
<Future at 0x1043b7f10 state=finished returned list>

futures module example: integrate

import math

def integrate(f, a, b, *, n_iter=1000):
    acc = 0
    step = (b - a) / n_iter
    for i in range(n_iter):
        acc += f(a + i * step) * step
    return acc
integrate(math.cos, 0, math.pi / 2)
from functools import partial

def integrate_async(f, a, b, *, n_jobs, n_iter=1000):
    executor = ThreadPoolExecutor(max_workers=n_jobs)
    spawn = partial(executor.submit, integrate, f,
                    n_iter=n_iter // n_jobs)
    step = (b-a)/n_jobs
        for i in range(n_jobs)]
    return sum(f.result() for f in as_completed(fs))
integrate_async(math.cos, 0, math.pi / 2, n_jobs=2)

Concurrency and Competition

Compare the performance of the serial and parallel versions of the integrate function using the “magic” timeit command:

%%timeit -n100
integrate(math.cos, 0, math.pi / 2, n_iter=10**6)
154 ms ± 2.28 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit -n100
integrate_async(math.cos, 0, math.pi / 2, n_iter=10**6, n_jobs=2)
142 ms ± 2.11 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


  • GIL (global interpreter lock) is a mutex that ensures that only one thread at a time has access to the internal state of the interpreter.
  • The Python C API allows you to release the GIL, but it is only safe when working with objects that are not dependent on the Python interpreter.

Is GIL Bad?

  • The answer depends on the task.
  • The presence of GIL makes it impossible to use threads in Python for parallelism: several threads do not speed up, and sometimes even slow down the program.
  • GIL does not interfere with the use of threads for competition when working with I/O.

C and Cython - GIL Remedy

%load_ext Cython

from libc.math cimport cos

def integrate(f, double a, double b, long n_iter):
    cdef double acc = 0
    cdef double step=(b-a)/n_iter
    cdef long i
    with nogil:
        for i in range(n_iter):
            acc += cos(a + i * step) * step
    return acc
%%timeit -n100
integrate_async(math.cos, 0, math.pi / 2, n_iter=10**6, n_jobs=2)
5.88 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

multiprocessing module

Processes Another GIL Remedy

  • You can use processes instead of threads.
  • Each process will have its own GIL, but it will not prevent them from working in parallel.
  • The multiprocessing module is responsible for working with processes in Python:
import multiprocessing as mp
p = mp.Process(target=countdown, args=(5, ))
4 left
3 left
2 left
1 left
0 left
  • The module implements basic synchronization primitives: mutexes, semaphores, conditional variables.
  • To organize the interaction between processes, you can use Pipe - a socket-based connection between two processes:
def ponger(conn):
parent_conn, child_conn = mp.Pipe()
p = mp.Process(target=ponger, args=(child_conn, ))

Process and performance

The implementation of the integrate_async function based on the thread pool worked for a long time, let's try to use the process pool:

from concurrent.futures import ProcessPoolExecutor
def integrate_async(f, a, b, *, n_jobs, n_iter=1000):
    executor = ProcessPoolExecutor(max_workers=n_jobs)
    spawn = partial(executor.submit, integrate, f,
                    n_iter=n_iter // n_jobs)

    step = (b - a) / n_jobs
    fs=[spawn(a + i * step, a + (i + 1) * step)
        for i in range(n_jobs)]
    return sum(f.result() for f in as_completed(fs))
%%timeit -n100
integrate_async(math.cos, 0, math.pi / 2, n_iter=10**6, n_jobs=2)
16.6 ms ± 144 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

joblib package

The joblib package implements a parallel analogue of the for loop, which is convenient for parallel execution of independent tasks.

from joblib import Parallel, delayed

def integrate_async(f, a, b, *, n_jobs, n_iter=1000, backend=None):
    step = (b - a) / n_jobs
    with Parallel(n_jobs=n_jobs, backend=backend) as parallel:
        fs = (delayed(integrate)(f, a + i * step,
                                 a + (i + 1) * step, 
                                 n_iter=n_iter // n_jobs)
              for i in range(n_jobs))
    return sum(parallel(fs))
%%timeit -n100
integrate_async(math.cos, 0, math.pi / 2, n_iter=10**6, n_jobs=2, backend="threading")
104 ms ± 280 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit -n100
integrate_async(math.cos, 0, math.pi / 2, n_iter=10**6, n_jobs=2, backend="multiprocessing")
290 ms ± 1.13 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


  • GIL is a global mutex that limits the use of threads for parallelism in programs in Python.
  • For programs that use mainly I / O, GIL is not scary: in CPython, these operations release the GIL.
  • For programs that need parallelism, there are options for increasing productivity:
    • write critical functionality in C or Cython;
    • or use the multiprocessing module.

numba JIT compiler

import math
from numba import jit, prange

@jit(nopython=True, parallel=True, fastmath=True, cache=True)
def integrate(a, b, *, n_iter=1000):
    acc = 0
    step = (b - a) / n_iter
    for i in prange(n_iter):
        acc += math.cos(a + i * step) * step
    return acc
%%timeit -n100
integrate(0, math.pi / 2, n_iter=10**6)
5.11 ms ± 983 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
