Development ideas for newcomers!

ascillitoe · 1 March 2021 12:34

We’ve had a number of inquiries recently regarding ideas/issues suitable for newcomers to contribute to. A few areas which might be quite suitable for this are our new dataset generation and plotting utilities, along with further development of our tutorials.

To bring all of the above together, below are a few short “micro-project” ideas. Feel free to have a go at these, and do post code here so that we can provide feedback and advice. These short projects would be a good way to learn the basics of the code, as well as contribute some PR’s to the dataset loading and plotting modules.

Sensitivity analysis of aerofoil noise: This will first involve fitting an equadratures Poly() to the UCI machine learning repo airfoil noise dataset, and then computing Sobol indices to understand the key drivers behind the aerofoil noise. This would be a good opportunity to develop a new dataset loader in datasets.py, in order to streamline the process of loading UCI datasets (see this post). It would then be great to develop a plotting method (see this post) to plot the Sobol indices i.e. Poly().plot_sobol(). Feel free to pick a different UCI dataset (e.g. boston housing, Wine etc) if another one interests you more!
Turbomachinery blade design via dimension reduction: In this recent aeronautical journal article, equadratures is used to explore the design of three turbomachinery blades. This small project would involve porting the data from its current repo to our equadratures dataset repo, so that the data can be easily loaded into equadratures. You could then obtain dimension reducing subspaces for the data, and develop plotting methods (see this post) to reproduce the key visualisations from the paper (we can help with this!). - Thanks to @Nick for this idea.

In addition to PR’s associated with the dataset loading and plotting methods, we’d love to have your micro-projects included in our tutorials if they’re done well Another area where we welcome help is identifying and fixing bugs and/or poor error reporting. If while having a go at any of the above you encounter issues, please do flag these via Github issues!

ascillitoe · 22 March 2021 14:03

Responding to @Simardeep27’s question here: Google Summer of Code 2021 Projects - #24 by Simardeep27, which was " have gone through a lot of research papers but was not able to find a formula for sound pressure levels in terms of the given parameters".

The aerofoil noise UCI dataset comes from real experimental measurements of aerofoil noise in a wind tunnel. There are probably low order correlations (i.e. formula’s) to relate the noise (SPL) to the independent variables (Angle of attack, chord length etc), but they aren’t necessary for the purposes of this task. We already have the measured SPL (our y), and the independent variables (our \boldsymbol{x}). There is a a true functional mapping relating the two i.e. f(\boldsymbol{x}), but our task is to approximate this by fitting a polynomial model g(\boldsymbol{x})\approx f(\boldsymbol{x}). We can then use g(\boldsymbol{x}) to compute the sensitivity indices in order to learn more about the physical relationships in the data.

The Regularisation 1: Handling real-world data post you mentioned has an example of doing something similar for a piston. I would recommend starting off by simply trying to fit a suitable polynomial to the aerofoil data. Please do feel free to share code or ask more questions!

Simardeep27 · 22 March 2021 14:49

Thanks a lot for the clarification,

Actually , I was creating a separate function for calculating the SPL , with the features(Chord length, Frequency etc.) as the parameters to the function and then taking my X as the input features and y as my calculated SPL and then splitting them.
This was the research paper I was referring to
https://www.researchgate.net/publication/343218882_Investigation_of_the_mechanics_of_the_airfoil_application_of_machine_learning_in_the_noise

The parameters for the SPL equation was found using a coupled linear regression model.

I will implement the method provided by you and post the code here as soon as possible.

Simardeep27 · 24 March 2021 16:58

Hello @ascillitoe , I was working on the model, I obtained the coefficients for the g(x) function, but the r2_score was nearly 0.75, so should I try different models too?
This is the link to the notebook
https://colab.research.google.com/drive/1826YpI8xHvr5udTPmzhNn6lKCwgXog28?pli=1&authuser=2#scrollTo=azwTlpxdIyi1

ascillitoe · 24 March 2021 17:44

Hi @Simardeep27, an R^2 score of 0.75 doesn’t sound too bad. Is this on the training data, or on withheld test data? From memory, I believe the highest test R^2 score I’ve gotten with equadratures is 0.84, and that was with a more complex polynomial regression tree. I think a fully grown random forest is able to get up to 0.95, but then you’re losing a lot of interpretability here.

For the purposes of what we’re doing here I think 0.75 is OK (assuming its test score not training). The Sobol indices will still be useful, as long as you remember that they are telling you about the sensitivity of the model g(x) to the inputs, and that is not strictly identical to the sensitivity of the true function f(x).

P.s. I wasn’t able to access your notebook, are the privacy settings correct?

Simardeep27 · 25 March 2021 01:03

0.75 score is on the test data, I think even neural network can also give a better result,
Oh, I will make it public, sorry for the delay.

ascillitoe · 25 March 2021 11:19

Hi @Simardeep27, yes I’d expect a deep neural net to give a very high accuracy score here. I took a look at your notebook, looking good! I went ahead and tried matching your scikit-learn models with equadratures:

I’m able to get $R^2=0.74$ with a 3rd order least-squares poly, and $R^2=0.84$ with a 9th order regularised poly, so OK I think! The next step here, and where you could make a nice code contribution, is to plot the Sobol indices like I did in the regularisation blog post. You could package this up into a nice Poly().plot_sobol() method. The ability to plot higher order Sobol indices would be even better!

From a learning point of view, it would be interesting to look at how these Sobol indicies compare to just looking at the polynomial coefficients from a standard linear regression, or even more advanced neural net and random forest interpretation techniques.

Simardeep27 · 25 March 2021 11:38

Thanks for the advice @ascillitoe ,
I alternatively tried a DecisionTreeRegressor and got 0.84 R2_score, but the problem was to map the g(x) function as we require the coefficients but sklearn does not provide any method for finding the coefficients for tree methods , so I used PolynomialRegressor.

I would try to complete this part as soon as possible
Thanks again.

ascillitoe · 25 March 2021 12:01

Sounds good!

Just a quick note - we are not strictly interested in the polynomial (or other) model’s coefficients here. For high dimensional (and high order) polynomials the coefficients get rather confusing anyway. We’re actually interested in the Sobol’ indices, which are a global sensitivity measure, which decompose the variance of the model output into fractions which can be attributed to its inputs.

Comparing R^2 to scikit-learn polynomials and random forests etc is a good exercise to tell us if our equadratures polynomial model is way off in terms of accuracy, But the exercise here is really about fitting an equadratures polynomial, and then using it to compute Sobol indices. That being said, comparing what the Sobol indices tell you to what you get from some of these methods applied to RF’s etc would be very cool!

Simardeep27 · 25 March 2021 15:25

Oh actually what I was thinking that we need to find the index using a model from sklearn and then find the g(x) function for applying get_sobol_indices method. Thanks a lot for the clarification.

I will surely to try to compare them from the interpretation methods, also , I found a way to plot sobol indices for higher order, we can just increase the number of for loops in the main sobol_plot method. The complexity is currently high, I’ll try to reduce it and post it here.

Simardeep27 · 26 March 2021 06:33

Hi @ascillitoe , I created the method for plotting higher index sobol values (order=1,2,3).
I just have one doubt when evaluating sobol values for higher index, the value of some parameters are really low, hence they are not plotted on the graph, should I try to reduce the scale such that we can get some value from the plot or this would work fine?
Here is the notebook:

P.S. - The equadrature polynomial shows a negative R2_value on the test score for the same code but on training it shows near 0.98 score, so the model has to be changed right?

ascillitoe · 26 March 2021 10:31

Hi @Simardeep27, sorry, my mistake there. The poly in line y_pred_train = poly.get_polyfit(X_train.to_numpy()) and the y_pred_test line should be regpoly (i.e. the regularised polynomial). With that you should should be able to get a better R^2 with higher orders. Otherwise, you could stick with poly but drop the order back to 3 or 4.

The plotting method is getting there! The Sobol indices themselves should make more sense once the R^2 is better. I think the way you are doing the different orders makes sense, and probably a good idea to stop at order 3. In terms of a PR if you’d like to extend plot_Sobol_indices() in plot.py to cope with higher orders that would be great! This current method could also do with a bit of work:

Adding rotation=45 (and maybe `ha=‘right’) to the xticklabels would help with visualisation.
Might be an idea to have the default be to label the xticks S_1, S_2 etc, (and S_{11}, S_{12} etc for higher order), and have an optional argument for the user to provide the full feature names to use as labels.

P.s. just for your own info, in case you’re wondering what the higher order Sobol indices mean. they represent the variance of the model output due to the interaction of the input terms. For example S_{12} tells you how much the interaction between parameters 1 and 2 contributes to the variance in the output.

p.p.s. your work on this would make a nice blog post! Especially with some comparisons with random forests etc

ascillitoe · 26 March 2021 11:03

By the way @Simardeep27, you could also duplicate plot_Sobol_indices() in plot.py to create plot_total_Sobol_indices(), which would call .get_total_sobol_indices() and plot the total Sobol indices. You wouldn’t need to specify the order for these ones.

Simardeep27 · 26 March 2021 14:25

I am able to plot the higher order indices, I was thinking of taking a list of parameters as input for the function but the problem is about setting the default xticks like S1,S2,S3 because we need to know the length of the parameters.

Sure I would work on random forests just after completing this task.
Also, I thought about the idea you mentioned, of joining the plot_sobol_indices , so would it create sobol indices for the three order directly?

ascillitoe · 26 March 2021 15:47

Sorry I don’t quite get what you mean here? I would take a look at the way the plotting is currently done by Poly.plot_sobol_indices() within plot.py (in the develop branch). You could try out the current method in your notebook by running regpoly.plot_Sobol_indices(). I just meant that instead of calling the xticks “Parameter 1”, “Parameter 2” etc, we could save some space by renaming the defaults S_1, S_2 etc. Basically line 237 would just become:

 xTickMarks = [r'$S_%d$' %j for j in range(0, regpoly.dimensions)]

Having the feature names passed in as a list should also be doable as you can simply index it within the loop.

Currently, plot_Sobol_indices() only plots the first order Sobol indices i.e. the ones you get from get_sobol_indices(1), so a relatively straightforward extension would be to extend it to do the second and third order Sobol indices. Plus you could do the total order ones too.

With this I only meant that if you wanted to write a short blog post on your work here, you could include the comparisons you’ve already done looking at the test accuracies. You could also look at different interpretation methods but only if it’s something you were interested in!

Simardeep27 · 27 March 2021 02:35

Actually I was having problem in setting the default parameter earlier, it is working fine now by adding the code you provided. So, here the user would have a option to provide the parameters as a list or if not provided the default values will be used, This is the implementation, should I make a PR?

I would definitely write a blog and start working with the interpretation methods.

ascillitoe · 28 March 2021 12:55

Hi @Simardeep27, looking good! A PR to develop would be great. Just a very minor point, parameters=None might be a better default…

Simardeep27 · 28 March 2021 13:16

Sure, I will change it to default, I will start working on the blog next.
Thanks again

Simardeep27 · 30 March 2021 06:24

Hi @ascillitoe , I have started working on the blog and was wondering if the model interpretation methods like Partial Dependence Plot, ALE, etc. can be added to Equadratures since a lot of insights can be gained from these plots.

ascillitoe · 30 March 2021 13:38

Hi @Simardeep27, the short answer is I don’t think we know. We haven’t really looked into any of these “model agnostic” interpretation techniques in the context of equadratures. By “model agnostic” I mean techniques such as PDP and ALE which generally work by sampling the predictions from your model (in contrast to Sobol’ indices which are calculated directly from the polynomial coefficients).

Re the blog post I mentioned some of these methods as thought it might be interesting to look at what using these methods on the random forest tells you about the airfoil dataset, compared to what you learn from the equadratures Sobol’ indices. If there was a convincing argument to also apply them to the equadratures polynomials we could have a think about that too. However, I wonder if some of the existing interpretability libraries such as ELI5 and PDPbox might actually be able to accept an equadratures Poly already, if we were to create a .predict() method which wraps around get_polyfit()? That being said, it could still be an idea to add some of this functionality into equadratures itself, but only if any of the techniques really do appear to be useful.