Speeding up NumPy using CFFI
The below is a case study of using CFFI to manipulate NumPy array data in C code, with an execution speed comparison of C implementation to a pure Python one. It can be useful for someone looking for a boilerplate solution to either use the existing C code with Python and NumPy or speed up the Python app with their own C code, which would end up reusable outside of a Python context.
The Starting Point
I lacked a row-reduce function in NumPy to simply convert a matrix to the reduced row echelon form, instead of receiving the final solution to the equation out of np.solve()
. So, I went ahead and implemented one in Python, using the Gaussian Elimination algorithm.
myutils.py
import numpy as np
tolerance = np.finfo(float).eps * 10
def is_zero(x):
"""
Check if x is effectively zero.
"""
return np.abs(x) < tolerance
def row_reduce_py(input_m):
"""
Row reduce a matrix to reduced row echelon form using Gaussian elimination.
Python version.
"""
m = np.array(
input_m, dtype=float) # Working with a copy to preserve input.
rows, cols = m.shape
pivot_row = 0
pivot_col = 0
while pivot_row < rows and pivot_col < cols:
# Find the row with the largest value in the pivot column.
max_row = np.argmax(np.abs(m[pivot_row:, pivot_col])) + pivot_row
if is_zero(m[max_row, pivot_col]):
# No nonzero pivot in this column, move to the next column.
pivot_col += 1
continue
# Swap the pivot row with max_row.
if pivot_row != max_row:
m[[pivot_row, max_row]] = m[[max_row, pivot_row]]
# Get the pivot value.
pivot = m[pivot_row, pivot_col]j
# Divide the pivot row by the pivot value.
m[pivot_row, :] /= pivot
# Subtract multiples of the pivot row from all other rows.
for i in range(rows):
if i != pivot_row:
m[i, :] -= m[i, pivot_col] * m[pivot_row, :]
# Move to the next pivot row and column.
pivot_row += 1
pivot_col += 1
return m
This did the job for me, and the performance wasn't an issue since I only used it with very small matrices anyway. However, I wanted to get some practice with augmenting the Python code with C, so I went ahead with that.
Speeding Up
NumPy has a C-API that allows one to work with NumPy objects directly. However, I wanted to pick a library that would take the burden of gluing my C and Python code together, without having the need to dive deep into the details of the integration. CFFI primary use case is integrating existing C libraries with Python code, so it seemed suitable to me upon a quick comparison to other options out there (mainly the ones on the list in the Python docs here.
The C version
Let's start with the header file, declaring the single function that we'll expose to the outside world. My Python module is called myutils, so the C part will be called mylib.
mylib.h
#ifndef MYLIB_H
#define MYLIB_H
#include <stddef.h>
/*
* Implementation of Gaussian Elimination, editing data in-place.
*/
void row_reduce(double* m, const size_t rows, const size_t cols);
#endif
And now for the implementation of row reduction in C. Note that, compared to the Python code, I don't make a copy of the input matrix here and am changing it in place; we'll handle that in the Python code.
mylib.c
#include "mylib.h"
#include <math.h>
// Using same epsilon value as in python.
bool is_zero(double x) { return fabs(x) < 2.220446049250313e-15; }
// Find the row with the largest value in the pivot column.
extern inline void find_max_pivot_row(double* m, size_t rows, size_t cols,
size_t pivot_row, size_t pivot_col, size_t* max_row, double* pivot)
{
for (size_t i = pivot_row + 1; i < rows; ++i) {
if (fabs(m[i * cols + pivot_col]) > fabs(*pivot)) {
*max_row = i;
*pivot = m[i * cols + pivot_col];
}
}
}
// Swap two rows.
extern inline void swap_rows(
double* m, size_t rows, size_t cols, size_t row1, size_t row2)
{
if (row1 != row2) {
// TODO: Use memcpy?
for (size_t i = 0; i < cols; ++i) {
double temp = m[row2 * cols + i];
m[row2 * cols + i] = m[row1 * cols + i];
m[row1 * cols + i] = temp;
}
}
}
// Multiply a row by a scalar.
extern inline void multiply_row(double* m, size_t cols, size_t row, double x)
{
for (size_t i = 0; i < cols; ++i) {
m[row * cols + i] *= x;
}
}
// Subtract the pivot row from a given row, achieving 0 in the pivot column.
extern inline void subtract_pivot_row(
double* m, size_t cols, size_t pivot_row, size_t pivot_col, size_t i)
{
double factor = m[i * cols + pivot_col];
for (size_t j = pivot_col; j < cols; ++j) {
m[i * cols + j] -= factor * m[pivot_row * cols + j];
}
}
/*
* Implementation of Gaussian Elimination, editing data in-place.
*/
void row_reduce(double* m, const size_t rows, const size_t cols)
{
if (!m || !rows || !cols) {
return;
}
size_t pivot_row = 0;
size_t pivot_col = 0;
while (pivot_row < rows && pivot_col < cols) {
size_t max_row = pivot_row;
double pivot = m[pivot_row * cols + pivot_col];
find_max_pivot_row(
m, rows, cols, pivot_row, pivot_col, &max_row, &pivot);
if (is_zero(pivot)) {
// The pivot column is all zeros, so we can't reduce it any further.
++pivot_col;
continue;
}
// Swap the pivot row with the max row.
swap_rows(m, rows, cols, pivot_row, max_row);
// Reduce the pivot row.
multiply_row(m, cols, pivot_row, 1 / pivot);
// Subtract multiples of the pivot row from all other rows.
for (size_t i = 0; i < rows; ++i) {
if (i == pivot_row) {
continue;
}
subtract_pivot_row(m, cols, pivot_row, pivot_col, i);
}
++pivot_row;
++pivot_col;
}
}
Now it's time to connect it to Python. I'll use CFFI to produce a shared library that can be imported as a Python module, and I will add a wrapper in myutils.py that will hide the interaction with the library from the module user. The code below is copy-pasted from the CFFI docs with adjustments to my library and function names; otherwise, it's very straightforward.
mylib_extension_build.py
from cffi import FFI
ffibuilder = FFI()
# Copy the required definitions from the header file.
ffibuilder.cdef("""
void row_reduce(double* m, size_t rows, size_t cols);
""")
ffibuilder.set_source("_my", # name of the output C extension
"""
#include "mylib.h"
""",
sources=['mylib.c'],
libraries=['m'])
if __name__ == "__main__":
ffibuilder.compile(verbose=True)
Now we can simply run it to produce the shared library:
python mylib_extension_build.py
This will generate the C code implementing the Python C API wrapper around mylib (_my.c), and will compile it, producing a shared library file (called _my.cpython-311-x86_64-linux-gnu.so in my case).
Now we can simply import it in myutils.py and add another version of the row reduce function using it:
myuitls.py
import _my.lib as mylib
from cffi import FFI
ffi = FFI()
...
def row_reduce_c(input_m):
"""
Row reduce a matrix to reduced row echelon form using Gaussian elimination.
"""
# Make a copy of the input matrix.
m = np.array(input_m, dtype=float)
# Get a pointer to the numpy array data.
d = ffi.from_buffer("double[]", m)
mylib.row_reduce(d, m.shape[0], m.shape[1])
return m
Copying the data in C code
In the above version, we create a copy of the input matrix in Python code and pass it to the C code to be edited in place. Alternatively, we can make a copy of the data in Python and return the pointer to that copy. We'll find out below which way is more efficient, but the implementation goes as follows:
mylib.h
/*
* Implementation of Gaussian Elimination, returning a copy of the input data.
*/
double* row_reduce_copy(
const double* in_m, const size_t rows, const size_t cols);
mylib.c
#include <stdlib.h>
#include <string.h>
...
double* row_reduce_copy(
const double* in_m, const size_t rows, const size_t cols)
{
double* m = malloc(rows * cols * sizeof(double));
memcpy(m, in_m, rows * cols * sizeof(double));
row_reduce(m, rows, cols);
return m;
}
mylib_extension_build.py
ffibuilder.cdef("""
void row_reduce(double* m, size_t rows, size_t cols);
double* row_reduce_copy(
const double* in_m, const size_t rows, const size_t cols);
""")
myutils.py
def row_reduce_c2(input_m):
"""
Row reduce a matrix to reduced row echelon form using Gaussian elimination.
"""
# Get a pointer to the numpy array data.
ind = ffi.from_buffer("double[]", input_m)
# This will return a pointer to the raw data, which we have to process.
outd = mylib.row_reduce_copy(ind, input_m.shape[0], input_m.shape[1])
# Wrap the raw data to the buffer object.
b = ffi.buffer(outd, np.dtype("float").itemsize * input_m.size)[:]
# Feed it to numpy.
m = np.frombuffer(b, dtype=float)
return m.reshape(input_m.shape)
Benchmarking
Next, let's measure how much faster is the C version compared to the Python one, and whether it makes any difference to make the copy of the data in Python or C code.
For this, I made a simple benchmark with the following protocol:
- For each matrix size (5x5, 10x10, 100x100, 200x200, 500x500, 1000x1000).
- Generate 100 random matrices of that size.
- Run each implementation on each matrix, measuring the execution time.
- Report stats (average, median, 90th and 99th percentile).
benchmark.py
import numpy as np
import time
import myutils as my
def timed_sample(f):
start_time = time.monotonic_ns()
f()
return time.monotonic_ns() - start_time
def print_stats_header():
print("version matrix count average median 90th 99th")
ns_in_ms = 1000000
def print_stats(version, matrix_size, samples):
count = len(samples)
avg = np.average(samples) / ns_in_ms
median = np.median(samples) / ns_in_ms
percentile90 = np.percentile(samples, 90) / ns_in_ms
percentile99 = np.percentile(samples, 99) / ns_in_ms
print(f"{version:7} {matrix_size:8d} {count:10d} {avg:12.4f} "
f"{median:12.4f} {percentile90:12.4f} {percentile99:12.4f}")
if __name__ == '__main__':
# Warm up the library, otherwise will get one slow call.
my.row_reduce_c(np.array([[1, 2, 3], [4, 5, 6]]))
print_stats_header()
for i in (5, 10, 100, 200, 500, 1000):
pysamples = []
csamples1 = []
csamples2 = []
for j in range(100):
m = np.random.rand(i, i)
pysamples.append(timed_sample(lambda: my.row_reduce_py(m)))
csamples1.append(timed_sample(lambda: my.row_reduce_c(m)))
csamples2.append(timed_sample(lambda: my.row_reduce_c2(m)))
print_stats("Python", i, pysamples)
print_stats("C1", i, csamples1)
print_stats("C2", i, csamples2)
On my machine, I got the following results:
version | matrix | count | average | median | 90th | 99th |
---|---|---|---|---|---|---|
Python | 5 | 100 | 0.0808 | 0.0787 | 0.0882 | 0.1796 |
C1 | 5 | 100 | 0.0029 | 0.0026 | 0.0037 | 0.0088 |
C2 | 5 | 100 | 0.0041 | 0.0036 | 0.005 | 0.0118 |
Python | 10 | 100 | 0.242 | 0.2367 | 0.2612 | 0.3221 |
C1 | 10 | 100 | 0.0034 | 0.0029 | 0.0049 | 0.0092 |
C2 | 10 | 100 | 0.0046 | 0.0039 | 0.0065 | 0.0132 |
Python | 100 | 100 | 17.5802 | 17.4525 | 17.8326 | 19.6342 |
C1 | 100 | 100 | 0.1713 | 0.1648 | 0.1808 | 0.253 |
C2 | 100 | 100 | 0.1934 | 0.1885 | 0.2076 | 0.271 |
Python | 200 | 100 | 78.3592 | 76.9973 | 81.8242 | 104.076 |
C1 | 200 | 100 | 1.3269 | 1.2979 | 1.4478 | 1.754 |
C2 | 200 | 100 | 1.4206 | 1.3903 | 1.5575 | 1.6921 |
Python | 500 | 100 | 573.1949 | 558.0741 | 643.4814 | 876.2773 |
C1 | 500 | 100 | 25.8631 | 24.7236 | 28.7465 | 46.2102 |
C2 | 500 | 100 | 27.2346 | 26.3597 | 31.2285 | 42.2302 |
Python | 1000 | 100 | 2668.173 | 2619.901 | 2893.422 | 3211.205 |
C1 | 1000 | 100 | 202.2592 | 191.0541 | 239.1883 | 297.3644 |
C2 | 1000 | 100 | 202.4343 | 197.2587 | 227.2721 | 253.8931 |
It's no surprise that the C implementation is significantly faster than the Python one, although it was interesting to see that we got the maximal speedup at the 100x100 matrix size, and then the "Py to C speed ratio" decreased with larger matrices.
This may be explained by NumPy itself using very efficient low-level implementations for the array multiplication/subtraction operations that I'm employing in the row_reduce_py
function. So, with larger matrices, the effect of moving all matrix manipulation into C becomes less significant, as more time is spent in these efficient parts of NumPy compared to my own Python code. But still, it remains approximately 10x faster for 1000x1000 matrices, which makes the effort worthwhile.
Another interesting observation is that the C2 version ended up being a tiny bit slower than C1, although this difference diminishes as the matrix size grows.
It's apparent that NumPy's np.array
function is performing a more efficient data copy than what I'm doing in row_reduce_copy
, so it makes sense to use NumPy (or other mature libraries) facilities as much as possible, only implementing in C the parts that would otherwise be slowly interpreted by Python.
Conclusion
The above demonstrates that integrating C code into a Python program can be very simple and straightforward, especially if there's no need to expose the Python and NumPy type system directly into the C code (which would require a deeper dive into Python's C API), and a wrapper like CFFI can be used.
While I've achieved a significant speedup from moving my own algorithm into C code, I would advise first of all using what's already available in existing libraries like NumPy, as any existing algorithm implementation there is very likely to be more efficient than your own version.
There are some assumptions in this example that would have to be handled if you were to build something for real-world usage:
- I'm assuming that NumPy arrays with
dtype=float
are used, which is equivalent todouble
. - I'm assuming that the NumPy array has the C memory layout (row-major, which is the default). If this is not the case in your application, you would have to handle that by either calling
numpy.ascontiguousarray
beforeffi.from_buffer
, or taking the column-major order into account in your C code.
The provided example is single-threaded, so there's room for speeding things up by, for example, performing the subtract_pivot_row
function for several rows in parallel.
The full code for this example is available at https://github.com/vf42/numpy-cffi-example/.