The Quest for Speed - An Indicative Survey in Improving Performance

Introduction

The equadratures (EQ) code is based purely on Python and some standard core packages, with the majority of linear algebra performed using vectorised numpy operations.

Although the use of the numpy module significantly improves the speed of operations, the inherent simplicity of Python’s syntax and dynamic typing lends itself to a marked performance drop compared to statically typed languages such as C, its derivatives and Fortran.

The advantage behind Python lies in the ability to rapidly prototype code and the wealth of modules developed to simplify code development. The performance drop generally tends to be an acceptable disadvantage when taking these advantages into account. However, as a code matures and the size of the problems start to expand, the run-time starts to become a higher priority.

This post is simply a step in exploring the wide gamut of the methods that exist to seamlessly reduce runtimes for a user. It is simply recorded here for posterity, and to gain a quantifiable insight into the performance improvements that could be obtained via the use of C++/Fortran.

Test-case

There are well-established modules in development that can improve the speed of much of the code in EQ such as Cython and Numba. This post focuses on a first step towards understanding the benefits of using C++ and Fortran to accelerate the newly integrated conjugate descent algorithm for the elastic net regularisation method. A brief intro to the method has been laid out in one of the GSoC proposals for 2021 link here.

In brief, the elastic-net regularized regression method combines the l_1 and l_2 penalty terms of both the standard ridge and Lasso regression and aims to minimise the following loss function:

enet-loss

Figure 1: Elastic net loss function

where \alpha is a mixing term between ridge (\alpha=0) and Lasso regression (\alpha=1), and \lambda is a tunable parameter that in EQ is optimised for using the coordinate descent method. This coordinate descent implementation is the target for speedup in this post.

For more details regarding the elastic net regression method, please consider reading the upcoming amazingly and beautifully written post by @ascillitoe - Ze Beauty of Elastic Net Algo.

Data-generation

The gen_linear method from equadratues.dataset is used to create a series of synthetic test data for regression using the aforementioned elastic net method.

from equadratures.datasets import gen_linear

X, y = gen_linear(n_observations=r, n_dim=c, n_relevent=c, bias=0.0, noise=0.0, random_seed=x)

where, r and c are number of observations and dimensions for the test data randomly generated. They are increased to create a 2D contour of runtime for Python, Fortran and C++ codes for performance comparison.

Part I: A Fortran Story

Although there exist many modules to speed up pre-existing Python code, including those that aim to be a bridge between Python and statically typed languages such as C (Cython), Fortran is well-established within the scientific community and is capable of basic linear algebra without requiring external libraries. Furthermore, in terms of syntax, it is relatively simple (compared to C) and easy to learn with a small learning curve for Python developers.

The numpy module is already a requirement in using EQ, and has a tool called F2PY which creates an interface between Python and Fortran without the need for external libraries for basic linear algebra operations or having to write local subroutines to do so. Although in this case, the OpenBLAS library is also optionally used.

Without delving into the syntax differences between Python and Fortran, a few additions are required in the subroutine arguments when using F2PY compared to a standalone Fortran compiled code.

    #Python function definition
    def _elastic_net_cd_py(x, A, b, lamda, alpha, max_iter, tol, verbose):
    !Fortran subroutine definition
    subroutine elastic_net_cd_for(x, A, b, lambda, alpha, max_iter, dx_tol, verbose, xn, An, Am, bn)

Proceeding the verbose argument, an additional set of integers are listed (xn, An, Am, bn) which correspond to the dimensions of the vector/arrays, x, A, and b. These do not have to be included in the python call to the subroutine - as F2PY is capable of inferring these dimensions through the numpy array inputs. They are required in any Fortran function/subroutine definitions one may write in the future. Apart from this slight addition, the rest of the code can be written as any other standalone Fortran code.

Part II: A C++ Story

C++ is a ubiquitous language with compiler support found in even the most ancient of linux distros in one form or another. Its general object-oriented nature and the wealth of containers, algorithms and functions found under the Standard Template Library (STL) make it an attractive language to consider.

Initially, the xtensor library along with their BLAS and Python extensions and pybind11 were used to code the coordinate descent algorithm in C++. Their approach at mimicking the numpy API made it an impressive contender in this test due to the smaller learning curve for any developer with only Python experience. For this reason and mainly having prior experience in using it for another project, the initial C++ version of the code was made using the xtensor library. However, upon extensive use of the array slicing, a significant drop in performance (final code being slower than the pure numpy version) led to alternatives being considered.

For performance, working with raw C-arrays would be ideal, but due to the number of common operations being required for EQ, and wanting to keep the complexity down, the rather battle-hardened Eigen library with pybind11 for Python to C++ interface was selected. Its ability to implicitly use BLAS, LAPACK and/or Intel MKL depending on the available libraries in the system without requiring explicit calls made it an attractive proposition. No check were completed on which aspects of the code were being pushed to BLAS, but multi-threading was verified as working in this test. Thus, there may still be performance being left on the table by being explicit in the BLAS usage.

Python Interface Definition:

For pybind11, additional boiler plate code is required to interface any C++ function/class to Python.

namespace py = pybind11;

PYBIND11_MODULE(solver_cpp, m)
{
  m.doc() = "Coordinate Descent for Elastic Path";

  m.def("elastic_net_cd_cpp", &elastic_net_cd_cpp, "Perform coordinate descent",
        py::arg("x").noconvert() = NULL,
        py::arg("A").noconvert() = NULL,
        py::arg("b").noconvert() = NULL,
        py::arg("lambda") = NULL,
        py::arg("alpha") = 1.0,
        py::arg("max_iter") = 100,
        py::arg("tol") = 1E-6,
        py::arg("verbose") = NULL
      );
}
  • PYBIND11_MODULE(solver_cpp, m)

    • A macro which takes in the required Python module name solver_cpp and a variable m as its args.
  • m.doc() = "Coordinate Descent for Elastic Path";

    • Docstring passed to the Python code
  • m.def("elastic_net_cd_cpp", &elastic_net_cd_cpp, "Perform coordinate descent",

    • The python module will have a function with name elastic_net_cd_cpp (1st argument)
    • The C++ function reference is given as the second argument &elastic_net_cd_cpp
    • With an explanation of the function given as the third argument
  • py::arg(_) = *,

    • Each of the input arguments are labelled and given a default value (optional).
    • The .noconvert() annotation is used to ensure that if the array input is not in the expected memory storage layout, then it will stop with an error - this can be removed in which case the code will proceed as normal, but Eigen will create a new copy with the correct memory storage layout which is an expensive process.

Note:

Eigen does not have a separate concept of a vector, i.e. a Numpy vector with shape (n,) will be an array of shape (n,1) within Eigen - this may be of consequence depending on how any returned data is further processed using Numpy. The vector form can be reobtained by using the ravel function (array.ravel()) on the returned data if needed.

Commonality

Both Fortran and Eigen (by default) follow a Column-major storage format where indexing a multidimension array in linear memory would lead to element increment along columns. This is in contrast to Numpy where elements in a row are contiguous in memory. Depending on the operations, having Column-major storage can have performance benefits. In this case, the initialisation and generation of the large arrays is completed with Column-major format before passing the data to the respective language functions. Interestingly this also improves the performance of the numpy code by ~3x by simply switching the storage layout order from the default Row-major to Column-major for the purely numpy based coordinate descent code. This can be done by either initialising an array with Fortran order:

np.zeros((2, 1), order='F')

or, by changing the order of an existing array:

farray = np.asfortranarray(array)

This is redundant for 1D arrays/vectors, as they are stored in a contiguous manner in either case and can be thought of being either a row or a column of data.

Benchmarks

Numpy is an extremely efficient piece of code and is designed to run optimally depending on system configuration. Significant performance improvements can be obtained effectively for free without any code changes by ensuring that Numpy is linked with a BLAS library (there are various implementations out there, with Intel MKL, OpenBLAS and ATLAS being some of the performant versions). Some light reading:

The comparisons made here are with Numpy linked against OpenBLAS (verified by finding a link to the BLAS library using the approach found in the first link above). The Python version of the code has been written by Dr. Ashley Scillitoe, @ascillitoe.

The number of threads used for all the different languages can be controlled by setting the following environment variable:

export OMP_NUM_THREADS=$NTHREADS

where $NTHREADS is at most the maximum number of threads available to the system. It can also be set by preceding the program call with it:

OMP_NUM_THREADS=$NTHREADS python runscript.py

This is not mandatory however, as by default, it tends to run with the maximum number of threads available.

System Configuration

The results were obtained using the following hardware/software:

  • MacBook Pro 15-inch 2017
    • Intel i7-7920HQ @ 3.1GHz
    • 16GB 2133MHz
  • macOS 10.15.7
  • OpenBLAS 0.3.13
  • Python 3.8.6
  • Numpy 1.20.1
  • pybind11 2.6.2
  • Eigen 3.3.9

Results

ndims-all-Macbook

Figure 2: Contours of dimensions against Observations for Python, Fortran, and C++; and speedups relative to Python

active_dims-all-Macbook

Figure 3: Contours of dimensions against % of active dimensions with fixed 2500 observations for Python, Fortran, and C++; and speedups relative to Python

Interactive versions of the above plots can be opened by selecting them.

Figure 2 shows contours of runtime and potential speedup for each languages relative to Python. Overall, roughly 2-3 times the speedup can be seen for the largest of cases with the use of Fortran or C++ which is not insignificant. A similar runtime pattern is seen through all languages showing a linear improvement in performance with their use. A closer look is recommended at the speedup sub-plots via the interactive link as they show that with the highest number of observations and dimensions, Fortran and C++ still show a speedup of 2.8x, and 1.95x respectively. The minimum 1.5x speedups are concentrated in the region with only 100 observations. There still needs to be some thought put into improving the implementation of the basic coordinate descent algorithm.

This implementation of the coordinate descent algorithm omits working on any inactive subspaces to reduce runtime. Thus, Figure 3 shows increasing number of dimensions in the y-axis and an increasing number of active dimensions in the x-axis. This is with a fixed number of observations of 2500. Interestingly the most benefit is found with the highest number of active dimensions, with it showing greater than 4x speedup with Fortran with 1000 dimensions of which 100% is active.

Summary

Rather significant improvements have been shown through the use of Fortran and C++ of between 2x-3x for even the largest of cases, with the worst being around 1.5x. The standard operations available within Fortran which helps in scientific computing, its higher performance in this study, and limited additional requirements shows that Fortran should be considered for accelerating EQ, especially with more loop-heavy parts of the code, where, statically-typed languages such as Fortran and C++ can flex their performance advantages.

Some thought also needs to be put into how the main EQ setup.py would have to be adapted to account for this dual language setup.

Please feel free to contribute any suggestions/corrections regarding the options out there and anything I may have missed out or may have possibly misconceived in the above approach. Especially with the use of xtensor if anyone is familiar - its simplicity makes it something that is quite desirable provided the performance is not impacted - which is something that I came across, with its performance seemingly much slower than numpy. If it helps, the initial xtensor version of the code can be made available for improvement.

More results utilising Microsoft Azure cloud computing for benchmarking will be forthcoming either in an upcoming post, or as an update to this post.

Files

The Fortran, C++ code and the Python scripts for benchmarking can be found in this GitHub Repo.

  • 0-benchmark-python-fortran.py

    • 1st plot benchmark, but using matplotlib instead.
  • 1-benchmark-python-fortran-ndims-vs-nactive.py

    • 2nd plot benchmark, by fixing observations to 2500, and varying number of active dimensions.
  • solver_fortran.f90

    • Fortran version of the code, but contains two different subroutines, elastic_net_cd_for and elastic_net_cd_purefor
    • The first subroutine uses explicit calls to BLAS. It is multi-threaded.
    • The second is written simply using Fortran built-in calls. It is not multi-threaded.
    • Compiled by running: sh compile.sh
  • solver_cpp.cpp

    • C++ version of the code, but contains two different functions, elastic_net_cd_cpp and elastic_net_cd_cpp_nosets
    • The latter avoids the use of sets when checking for active dimensions.
    • Requires the Eigen library added to environment path. In this case, a version of it is included in the root directory, so no changes are required in the Path.
    • Compiled by running: python cppsetup.py build_ext -i; with the -i flag compiling the module in-place rather than globally - optional.
  • solver_python.py

    • Python version of the code - same as the current version on the develop branch.
    • Written by @ascillitoe.
  • compile.sh

    • Bash script to compile Fortran code.
  • cppsetup.py

    • Setup file to compile C++ code
2 Likes

Wow fantastic stuff @bubald! Looks like speedups of over 4x might be achievable for many of the problem sizes we typically encounter?

Just out of curiosity, any thoughts on why there is such an extreme speedup (or in other words python/numpy is so slow) for the very low dimensional cases? I would have guessed it was going to be the other way around, since our primary look is through the dimensions (columns of \mathbf{A}). :thinking: