Speeding up NumPy using CFFI

The below is a case study of using CFFI to manipulate NumPy array data in C code, with an execution speed comparison of C implementation to a pure Python one. It can be useful for someone looking for a boilerplate solution to either use the existing C code with Python and NumPy or speed up the Python app with their own C code, which would end up reusable outside of a Python context.

The Starting Point

I lacked a row-reduce function in NumPy to simply convert a matrix to the reduced row echelon form, instead of receiving the final solution to the equation out of np.solve(). So, I went ahead and implemented one in Python, using the Gaussian Elimination algorithm.

myutils.py

import numpy as np


tolerance = np.finfo(float).eps * 10


def is_zero(x):
    """
    Check if x is effectively zero.
    """
    return np.abs(x) < tolerance


def row_reduce_py(input_m):
    """
    Row reduce a matrix to reduced row echelon form using Gaussian elimination.
    Python version.
    """
    m = np.array(
        input_m, dtype=float)  # Working with a copy to preserve input.
    rows, cols = m.shape
    pivot_row = 0
    pivot_col = 0
    while pivot_row < rows and pivot_col < cols:
        # Find the row with the largest value in the pivot column.
        max_row = np.argmax(np.abs(m[pivot_row:, pivot_col])) + pivot_row
        if is_zero(m[max_row, pivot_col]):
            # No nonzero pivot in this column, move to the next column.
            pivot_col += 1
            continue
        # Swap the pivot row with max_row.
        if pivot_row != max_row:
            m[[pivot_row, max_row]] = m[[max_row, pivot_row]]
        # Get the pivot value.
        pivot = m[pivot_row, pivot_col]j
        # Divide the pivot row by the pivot value.
        m[pivot_row, :] /= pivot
        # Subtract multiples of the pivot row from all other rows.
        for i in range(rows):
            if i != pivot_row:
                m[i, :] -= m[i, pivot_col] * m[pivot_row, :]
        # Move to the next pivot row and column.
        pivot_row += 1
        pivot_col += 1
    return m

This did the job for me, and the performance wasn't an issue since I only used it with very small matrices anyway. However, I wanted to get some practice with augmenting the Python code with C, so I went ahead with that.

Speeding Up

NumPy has a C-API that allows one to work with NumPy objects directly. However, I wanted to pick a library that would take the burden of gluing my C and Python code together, without having the need to dive deep into the details of the integration. CFFI primary use case is integrating existing C libraries with Python code, so it seemed suitable to me upon a quick comparison to other options out there (mainly the ones on the list in the Python docs here.

The C version

Let's start with the header file, declaring the single function that we'll expose to the outside world. My Python module is called myutils, so the C part will be called mylib.

mylib.h

#ifndef MYLIB_H
#define MYLIB_H

#include <stddef.h>

/*
 * Implementation of Gaussian Elimination, editing data in-place.
 */
void row_reduce(double* m, const size_t rows, const size_t cols);

#endif

And now for the implementation of row reduction in C. Note that, compared to the Python code, I don't make a copy of the input matrix here and am changing it in place; we'll handle that in the Python code.

mylib.c

#include "mylib.h"

#include <math.h>

// Using same epsilon value as in python.
bool is_zero(double x) { return fabs(x) < 2.220446049250313e-15; }

// Find the row with the largest value in the pivot column.
extern inline void find_max_pivot_row(double* m, size_t rows, size_t cols,
    size_t pivot_row, size_t pivot_col, size_t* max_row, double* pivot)
{
    for (size_t i = pivot_row + 1; i < rows; ++i) {
        if (fabs(m[i * cols + pivot_col]) > fabs(*pivot)) {
            *max_row = i;
            *pivot = m[i * cols + pivot_col];
        }
    }
}

// Swap two rows.
extern inline void swap_rows(
    double* m, size_t rows, size_t cols, size_t row1, size_t row2)
{
    if (row1 != row2) {
        // TODO: Use memcpy?
        for (size_t i = 0; i < cols; ++i) {
            double temp = m[row2 * cols + i];
            m[row2 * cols + i] = m[row1 * cols + i];
            m[row1 * cols + i] = temp;
        }
    }
}

// Multiply a row by a scalar.
extern inline void multiply_row(double* m, size_t cols, size_t row, double x)
{
    for (size_t i = 0; i < cols; ++i) {
        m[row * cols + i] *= x;
    }
}

// Subtract the pivot row from a given row, achieving 0 in the pivot column.
extern inline void subtract_pivot_row(
    double* m, size_t cols, size_t pivot_row, size_t pivot_col, size_t i)
{
    double factor = m[i * cols + pivot_col];
    for (size_t j = pivot_col; j < cols; ++j) {
        m[i * cols + j] -= factor * m[pivot_row * cols + j];
    }
}

/*
 * Implementation of Gaussian Elimination, editing data in-place.
 */
void row_reduce(double* m, const size_t rows, const size_t cols)
{
    if (!m || !rows || !cols) {
        return;
    }
    size_t pivot_row = 0;
    size_t pivot_col = 0;
    while (pivot_row < rows && pivot_col < cols) {
        size_t max_row = pivot_row;
        double pivot = m[pivot_row * cols + pivot_col];
        find_max_pivot_row(
            m, rows, cols, pivot_row, pivot_col, &max_row, &pivot);
        if (is_zero(pivot)) {
            // The pivot column is all zeros, so we can't reduce it any further.
            ++pivot_col;
            continue;
        }
        // Swap the pivot row with the max row.
        swap_rows(m, rows, cols, pivot_row, max_row);
        // Reduce the pivot row.
        multiply_row(m, cols, pivot_row, 1 / pivot);
        // Subtract multiples of the pivot row from all other rows.
        for (size_t i = 0; i < rows; ++i) {
            if (i == pivot_row) {
                continue;
            }
            subtract_pivot_row(m, cols, pivot_row, pivot_col, i);
        }
        ++pivot_row;
        ++pivot_col;
    }
}

Now it's time to connect it to Python. I'll use CFFI to produce a shared library that can be imported as a Python module, and I will add a wrapper in myutils.py that will hide the interaction with the library from the module user. The code below is copy-pasted from the CFFI docs with adjustments to my library and function names; otherwise, it's very straightforward.

mylib_extension_build.py

from cffi import FFI
ffibuilder = FFI()

# Copy the required definitions from the header file.
ffibuilder.cdef("""
void row_reduce(double* m, size_t rows, size_t cols);
""")

ffibuilder.set_source("_my",  # name of the output C extension
                      """
                      #include "mylib.h"
                      """,
                      sources=['mylib.c'],
                      libraries=['m'])

if __name__ == "__main__":
    ffibuilder.compile(verbose=True)

Now we can simply run it to produce the shared library:

python mylib_extension_build.py

This will generate the C code implementing the Python C API wrapper around mylib (_my.c), and will compile it, producing a shared library file (called _my.cpython-311-x86_64-linux-gnu.so in my case).

Now we can simply import it in myutils.py and add another version of the row reduce function using it:

myuitls.py

import _my.lib as mylib
from cffi import FFI
ffi = FFI()

...

def row_reduce_c(input_m):
    """
    Row reduce a matrix to reduced row echelon form using Gaussian elimination.
    """
    # Make a copy of the input matrix.
    m = np.array(input_m, dtype=float)
    # Get a pointer to the numpy array data.
    d = ffi.from_buffer("double[]", m)
    mylib.row_reduce(d, m.shape[0], m.shape[1])
    return m

Copying the data in C code

In the above version, we create a copy of the input matrix in Python code and pass it to the C code to be edited in place. Alternatively, we can make a copy of the data in Python and return the pointer to that copy. We'll find out below which way is more efficient, but the implementation goes as follows:

mylib.h

/*
 * Implementation of Gaussian Elimination, returning a copy of the input data.
 */
double* row_reduce_copy(
    const double* in_m, const size_t rows, const size_t cols);

mylib.c

#include <stdlib.h>
#include <string.h>

...

double* row_reduce_copy(
    const double* in_m, const size_t rows, const size_t cols)
{
    double* m = malloc(rows * cols * sizeof(double));
    memcpy(m, in_m, rows * cols * sizeof(double));
    row_reduce(m, rows, cols);
    return m;
}

mylib_extension_build.py

ffibuilder.cdef("""
void row_reduce(double* m, size_t rows, size_t cols);
double* row_reduce_copy(
    const double* in_m, const size_t rows, const size_t cols);
""")

myutils.py

def row_reduce_c2(input_m):
    """
    Row reduce a matrix to reduced row echelon form using Gaussian elimination.
    """
    # Get a pointer to the numpy array data.
    ind = ffi.from_buffer("double[]", input_m)
    # This will return a pointer to the raw data, which we have to process.
    outd = mylib.row_reduce_copy(ind, input_m.shape[0], input_m.shape[1])
    # Wrap the raw data to the buffer object.
    b = ffi.buffer(outd, np.dtype("float").itemsize * input_m.size)[:]
    # Feed it to numpy.
    m = np.frombuffer(b, dtype=float)
    return m.reshape(input_m.shape)

Benchmarking

Next, let's measure how much faster is the C version compared to the Python one, and whether it makes any difference to make the copy of the data in Python or C code.

For this, I made a simple benchmark with the following protocol:

For each matrix size (5x5, 10x10, 100x100, 200x200, 500x500, 1000x1000).
Generate 100 random matrices of that size.
Run each implementation on each matrix, measuring the execution time.
Report stats (average, median, 90th and 99th percentile).

benchmark.py

import numpy as np
import time

import myutils as my


def timed_sample(f):
    start_time = time.monotonic_ns()
    f()
    return time.monotonic_ns() - start_time


def print_stats_header():
    print("version   matrix      count      average       median         90th         99th")


ns_in_ms = 1000000


def print_stats(version, matrix_size, samples):
    count = len(samples)
    avg = np.average(samples) / ns_in_ms
    median = np.median(samples) / ns_in_ms
    percentile90 = np.percentile(samples, 90) / ns_in_ms
    percentile99 = np.percentile(samples, 99) / ns_in_ms
    print(f"{version:7} {matrix_size:8d} {count:10d} {avg:12.4f} "
          f"{median:12.4f} {percentile90:12.4f} {percentile99:12.4f}")


if __name__ == '__main__':
    # Warm up the library, otherwise will get one slow call.
    my.row_reduce_c(np.array([[1, 2, 3], [4, 5, 6]]))
    print_stats_header()
    for i in (5, 10, 100, 200, 500, 1000):
        pysamples = []
        csamples1 = []
        csamples2 = []
        for j in range(100):
            m = np.random.rand(i, i)
            pysamples.append(timed_sample(lambda: my.row_reduce_py(m)))
            csamples1.append(timed_sample(lambda: my.row_reduce_c(m)))
            csamples2.append(timed_sample(lambda: my.row_reduce_c2(m)))
        print_stats("Python", i, pysamples)
        print_stats("C1", i, csamples1)
        print_stats("C2", i, csamples2)

On my machine, I got the following results:

version	matrix	count	average	median	90th	99th
Python	5	100	0.0808	0.0787	0.0882	0.1796
C1	5	100	0.0029	0.0026	0.0037	0.0088
C2	5	100	0.0041	0.0036	0.005	0.0118
Python	10	100	0.242	0.2367	0.2612	0.3221
C1	10	100	0.0034	0.0029	0.0049	0.0092
C2	10	100	0.0046	0.0039	0.0065	0.0132
Python	100	100	17.5802	17.4525	17.8326	19.6342
C1	100	100	0.1713	0.1648	0.1808	0.253
C2	100	100	0.1934	0.1885	0.2076	0.271
Python	200	100	78.3592	76.9973	81.8242	104.076
C1	200	100	1.3269	1.2979	1.4478	1.754
C2	200	100	1.4206	1.3903	1.5575	1.6921
Python	500	100	573.1949	558.0741	643.4814	876.2773
C1	500	100	25.8631	24.7236	28.7465	46.2102
C2	500	100	27.2346	26.3597	31.2285	42.2302
Python	1000	100	2668.173	2619.901	2893.422	3211.205
C1	1000	100	202.2592	191.0541	239.1883	297.3644
C2	1000	100	202.4343	197.2587	227.2721	253.8931

It's no surprise that the C implementation is significantly faster than the Python one, although it was interesting to see that we got the maximal speedup at the 100x100 matrix size, and then the "Py to C speed ratio" decreased with larger matrices.

Py / C1 median ratio graph

This may be explained by NumPy itself using very efficient low-level implementations for the array multiplication/subtraction operations that I'm employing in the row_reduce_py function. So, with larger matrices, the effect of moving all matrix manipulation into C becomes less significant, as more time is spent in these efficient parts of NumPy compared to my own Python code. But still, it remains approximately 10x faster for 1000x1000 matrices, which makes the effort worthwhile.

Another interesting observation is that the C2 version ended up being a tiny bit slower than C1, although this difference diminishes as the matrix size grows.

C2 / C1 median ratio graph

It's apparent that NumPy's np.array function is performing a more efficient data copy than what I'm doing in row_reduce_copy, so it makes sense to use NumPy (or other mature libraries) facilities as much as possible, only implementing in C the parts that would otherwise be slowly interpreted by Python.

Conclusion

The above demonstrates that integrating C code into a Python program can be very simple and straightforward, especially if there's no need to expose the Python and NumPy type system directly into the C code (which would require a deeper dive into Python's C API), and a wrapper like CFFI can be used.

While I've achieved a significant speedup from moving my own algorithm into C code, I would advise first of all using what's already available in existing libraries like NumPy, as any existing algorithm implementation there is very likely to be more efficient than your own version.

There are some assumptions in this example that would have to be handled if you were to build something for real-world usage:

I'm assuming that NumPy arrays with dtype=float are used, which is equivalent to double.
I'm assuming that the NumPy array has the C memory layout (row-major, which is the default). If this is not the case in your application, you would have to handle that by either calling numpy.ascontiguousarray before ffi.from_buffer, or taking the column-major order into account in your C code.

The provided example is single-threaded, so there's room for speeding things up by, for example, performing the subtract_pivot_row function for several rows in parallel.

The full code for this example is available at https://github.com/vf42/numpy-cffi-example/.