Introduction
The equadratures (EQ
) code is based purely on Python and some standard core packages, with the majority of linear algebra performed using vectorised numpy
operations.
Although the use of the numpy
module significantly improves the speed of operations, the inherent simplicity of Python’s syntax and dynamic typing lends itself to a marked performance drop compared to statically typed languages such as C
, its derivatives and Fortran
.
The advantage behind Python lies in the ability to rapidly prototype code and the wealth of modules developed to simplify code development. The performance drop generally tends to be an acceptable disadvantage when taking these advantages into account. However, as a code matures and the size of the problems start to expand, the runtime starts to become a higher priority.
This post is simply a step in exploring the wide gamut of the methods that exist to seamlessly reduce runtimes for a user. It is simply recorded here for posterity, and to gain a quantifiable insight into the performance improvements that could be obtained via the use of C++
/Fortran
.
Testcase
There are wellestablished modules in development that can improve the speed of much of the code in EQ
such as Cython and Numba. This post focuses on a first step towards understanding the benefits of using C++
and Fortran
to accelerate the newly integrated conjugate descent algorithm for the elastic net regularisation method. A brief intro to the method has been laid out in one of the GSoC proposals for 2021 link here.
In brief, the elasticnet regularized regression method combines the l_1 and l_2 penalty terms of both the standard ridge and Lasso regression and aims to minimise the following loss function:
Figure 1: Elastic net loss function
where \alpha is a mixing term between ridge (\alpha=0) and Lasso regression (\alpha=1), and \lambda is a tunable parameter that in EQ
is optimised for using the coordinate descent method. This coordinate descent implementation is the target for speedup in this post.
For more details regarding the elastic net regression method, please consider reading the upcoming amazingly and beautifully written post by @ascillitoe  Ze Beauty of Elastic Net Algo.
Datageneration
The gen_linear method from equadratues.dataset
is used to create a series of synthetic test data for regression using the aforementioned elastic net method.
from equadratures.datasets import gen_linear
X, y = gen_linear(n_observations=r, n_dim=c, n_relevent=c, bias=0.0, noise=0.0, random_seed=x)
where, r and c are number of observations and dimensions for the test data randomly generated. They are increased to create a 2D contour of runtime for Python, Fortran and C++ codes for performance comparison.
Part I: A Fortran Story
Although there exist many modules to speed up preexisting Python code, including those that aim to be a bridge between Python and statically typed languages such as C (Cython
), Fortran
is wellestablished within the scientific community and is capable of basic linear algebra without requiring external libraries. Furthermore, in terms of syntax, it is relatively simple (compared to C) and easy to learn with a small learning curve for Python developers.
The numpy
module is already a requirement in using EQ
, and has a tool called F2PY which creates an interface between Python
and Fortran
without the need for external libraries for basic linear algebra operations or having to write local subroutines to do so. Although in this case, the OpenBLAS library is also optionally used.
Without delving into the syntax differences between Python and Fortran, a few additions are required in the subroutine arguments when using F2PY
compared to a standalone Fortran compiled code.
#Python function definition
def _elastic_net_cd_py(x, A, b, lamda, alpha, max_iter, tol, verbose):
!Fortran subroutine definition
subroutine elastic_net_cd_for(x, A, b, lambda, alpha, max_iter, dx_tol, verbose, xn, An, Am, bn)
Proceeding the verbose argument, an additional set of integers are listed (xn, An, Am, bn
) which correspond to the dimensions of the vector/arrays, x
, A
, and b
. These do not have to be included in the python call to the subroutine  as F2PY
is capable of inferring these dimensions through the numpy
array inputs. They are required in any Fortran function/subroutine definitions one may write in the future. Apart from this slight addition, the rest of the code can be written as any other standalone Fortran code.
Part II: A C++ Story
C++
is a ubiquitous language with compiler support found in even the most ancient of linux distros in one form or another. Its general objectoriented nature and the wealth of containers, algorithms and functions found under the Standard Template Library (STL) make it an attractive language to consider.
Initially, the xtensor library along with their BLAS and Python extensions and pybind11
were used to code the coordinate descent algorithm in C++
. Their approach at mimicking the numpy API made it an impressive contender in this test due to the smaller learning curve for any developer with only Python experience. For this reason and mainly having prior experience in using it for another project, the initial C++
version of the code was made using the xtensor library. However, upon extensive use of the array slicing, a significant drop in performance (final code being slower than the pure numpy version) led to alternatives being considered.
For performance, working with raw Carrays would be ideal, but due to the number of common operations being required for EQ
, and wanting to keep the complexity down, the rather battlehardened Eigen
library with pybind11 for Python
to C++
interface was selected. Its ability to implicitly use BLAS, LAPACK and/or Intel MKL depending on the available libraries in the system without requiring explicit calls made it an attractive proposition. No check were completed on which aspects of the code were being pushed to BLAS, but multithreading was verified as working in this test. Thus, there may still be performance being left on the table by being explicit in the BLAS usage.
Python Interface Definition:
For pybind11
, additional boiler plate code is required to interface any C++ function/class to Python.
namespace py = pybind11;
PYBIND11_MODULE(solver_cpp, m)
{
m.doc() = "Coordinate Descent for Elastic Path";
m.def("elastic_net_cd_cpp", &elastic_net_cd_cpp, "Perform coordinate descent",
py::arg("x").noconvert() = NULL,
py::arg("A").noconvert() = NULL,
py::arg("b").noconvert() = NULL,
py::arg("lambda") = NULL,
py::arg("alpha") = 1.0,
py::arg("max_iter") = 100,
py::arg("tol") = 1E6,
py::arg("verbose") = NULL
);
}

PYBIND11_MODULE(solver_cpp, m)
 A macro which takes in the required Python module name
solver_cpp
and a variablem
as its args.
 A macro which takes in the required Python module name

m.doc() = "Coordinate Descent for Elastic Path";
 Docstring passed to the Python code

m.def("elastic_net_cd_cpp", &elastic_net_cd_cpp, "Perform coordinate descent",
 The python module will have a function with name
elastic_net_cd_cpp
(1st argument)  The
C++
function reference is given as the second argument&elastic_net_cd_cpp
 With an explanation of the function given as the third argument
 The python module will have a function with name

py::arg(_) = *,
 Each of the input arguments are labelled and given a default value (optional).
 The
.noconvert()
annotation is used to ensure that if the array input is not in the expected memory storage layout, then it will stop with an error  this can be removed in which case the code will proceed as normal, but Eigen will create a new copy with the correct memory storage layout which is an expensive process.
Note:
Eigen does not have a separate concept of a vector, i.e. a Numpy vector with shape (n,)
will be an array of shape (n,1)
within Eigen  this may be of consequence depending on how any returned data is further processed using Numpy. The vector form can be reobtained by using the ravel function (array.ravel()
) on the returned data if needed.
Commonality
Both Fortran and Eigen (by default) follow a Columnmajor storage format where indexing a multidimension array in linear memory would lead to element increment along columns. This is in contrast to Numpy where elements in a row are contiguous in memory. Depending on the operations, having Columnmajor storage can have performance benefits. In this case, the initialisation and generation of the large arrays is completed with Columnmajor format before passing the data to the respective language functions. Interestingly this also improves the performance of the numpy code by ~3x by simply switching the storage layout order from the default Rowmajor to Columnmajor for the purely numpy
based coordinate descent code. This can be done by either initialising an array with Fortran order:
np.zeros((2, 1), order='F')
or, by changing the order of an existing array:
farray = np.asfortranarray(array)
This is redundant for 1D arrays/vectors, as they are stored in a contiguous manner in either case and can be thought of being either a row or a column of data.
Benchmarks
Numpy is an extremely efficient piece of code and is designed to run optimally depending on system configuration. Significant performance improvements can be obtained effectively for free without any code changes by ensuring that Numpy is linked with a BLAS library (there are various implementations out there, with Intel MKL, OpenBLAS and ATLAS being some of the performant versions). Some light reading:
The comparisons made here are with Numpy linked against OpenBLAS (verified by finding a link to the BLAS library using the approach found in the first link above). The Python version of the code has been written by Dr. Ashley Scillitoe, @ascillitoe.
The number of threads used for all the different languages can be controlled by setting the following environment variable:
export OMP_NUM_THREADS=$NTHREADS
where $NTHREADS
is at most the maximum number of threads available to the system. It can also be set by preceding the program call with it:
OMP_NUM_THREADS=$NTHREADS python runscript.py
This is not mandatory however, as by default, it tends to run with the maximum number of threads available.
System Configuration
The results were obtained using the following hardware/software:
 MacBook Pro 15inch 2017
 Intel i77920HQ @ 3.1GHz
 16GB 2133MHz
 macOS 10.15.7
 OpenBLAS 0.3.13
 Python 3.8.6
 Numpy 1.20.1
 pybind11 2.6.2
 Eigen 3.3.9
Results
Figure 2: Contours of dimensions against Observations for Python, Fortran, and C++; and speedups relative to Python
Figure 3: Contours of dimensions against % of active dimensions with fixed 2500 observations for Python, Fortran, and C++; and speedups relative to Python
Interactive versions of the above plots can be opened by selecting them.
Figure 2 shows contours of runtime and potential speedup for each languages relative to Python. Overall, roughly 23 times the speedup can be seen for the largest of cases with the use of Fortran
or C++
which is not insignificant. A similar runtime pattern is seen through all languages showing a linear improvement in performance with their use. A closer look is recommended at the speedup subplots via the interactive link as they show that with the highest number of observations and dimensions, Fortran and C++ still show a speedup of 2.8x, and 1.95x respectively. The minimum 1.5x speedups are concentrated in the region with only 100 observations. There still needs to be some thought put into improving the implementation of the basic coordinate descent algorithm.
This implementation of the coordinate descent algorithm omits working on any inactive subspaces to reduce runtime. Thus, Figure 3 shows increasing number of dimensions in the yaxis and an increasing number of active dimensions in the xaxis. This is with a fixed number of observations of 2500. Interestingly the most benefit is found with the highest number of active dimensions, with it showing greater than 4x speedup with Fortran
with 1000 dimensions of which 100% is active.
Summary
Rather significant improvements have been shown through the use of Fortran
and C++
of between 2x3x for even the largest of cases, with the worst being around 1.5x. The standard operations available within Fortran which helps in scientific computing, its higher performance in this study, and limited additional requirements shows that Fortran
should be considered for accelerating EQ
, especially with more loopheavy parts of the code, where, staticallytyped languages such as Fortran
and C++
can flex their performance advantages.
Some thought also needs to be put into how the main EQ
setup.py
would have to be adapted to account for this dual language setup.
Please feel free to contribute any suggestions/corrections regarding the options out there and anything I may have missed out or may have possibly misconceived in the above approach. Especially with the use of xtensor if anyone is familiar  its simplicity makes it something that is quite desirable provided the performance is not impacted  which is something that I came across, with its performance seemingly much slower than numpy
. If it helps, the initial xtensor version of the code can be made available for improvement.
More results utilising Microsoft Azure cloud computing for benchmarking will be forthcoming either in an upcoming post, or as an update to this post.
Files
The Fortran
, C++
code and the Python
scripts for benchmarking can be found in this GitHub Repo.

 1st plot benchmark, but using matplotlib instead.

1benchmarkpythonfortranndimsvsnactive.py
 2nd plot benchmark, by fixing observations to 2500, and varying number of active dimensions.


Fortran
version of the code, but contains two different subroutines,elastic_net_cd_for
andelastic_net_cd_purefor
 The first subroutine uses explicit calls to BLAS. It is multithreaded.
 The second is written simply using Fortran builtin calls. It is not multithreaded.
 Compiled by running:
sh compile.sh



C++
version of the code, but contains two different functions,elastic_net_cd_cpp
andelastic_net_cd_cpp_nosets
 The latter avoids the use of sets when checking for active dimensions.
 Requires the
Eigen
library added to environment path. In this case, a version of it is included in the root directory, so no changes are required in the Path.  Compiled by running:
python cppsetup.py build_ext i
; with thei
flag compiling the module inplace rather than globally  optional.



Python
version of the code  same as the current version on the develop branch.  Written by @ascillitoe.


 Bash script to compile Fortran code.

 Setup file to compile C++ code