Variable naming in Poly

mdonnelly · 16 April 2021 12:01

Many of our models have multiple outputs, for example a gas turbine performance model may predict pressures and temperatures at multiple stations in the gas path. As a result it’s not uncommon to find ourselves training a large number of models from the same input dataset.

Our normal way of dealing with this is to fit all of these in one step with a list comprehension so we can then pass this list around various functions without needing to have lots of duplicate code for each separate output quantity of interest. The problem with this is that it’s then a bit tricky to remember which output variable is associated with which Poly.

A possible improvement could be to allow “naming” of the Poly objects which can then be accessed like any other attribute. It would make plotting a bit simpler as the plot title (or y-axis label) could then just be read directly from Poly rather than needing to keep track of a separate list of variable names. I think Parameter already does this via:

self.variable = variable

Could this also be added to Poly? I guess the assigned variable name should also be inherited in the derivative objects like Correlations etc. I’m happy to have a go at implementing as it should just be a quick addition.

ascillitoe · 18 April 2021 11:23

Hi @mdonnelly, adding this as an optional argument seems like a nice idea to me, we could also incorporate the variable name into titles in the new plotting functions e.g. Poly.plot_polyfit_1D() etc.

I guess the only argument against doing this is that it does add a little bit more code into equadratures, when one could just manually add their own attribute to the Poly once defined if they like? e.g.

mypoly = eq.Poly(parameters=my_param_list, basis=my_basis, method='numerical-integration')
mypoly.variable = variable

@psesh any thoughts?

mdonnelly · 18 April 2021 17:33

I guess another possible use could be to have an extra line or two within get_summary() to explicitly state what variable it concerns just to potentially remove a little bit of ambiguity.

if self.variable is not None:
    variable_string = str('The output variable is ' + str(self.variable))
    added = added + variable_string

Good point about just manually adding them - in a similar way I guess you could also specify the summary output filename to be something based off the variable to make the above redundant too!.

ascillitoe · 18 April 2021 18:44

Ah yes that’s a great idea with the get_summary(). I suppose at the moment the summary outputs might lose their usefulness when you have many variables unless you’re quite careful with the naming of the files?

I reckon this feature is a worthwhile one to add, it’s only a few lines of extra code and sounds like it could add a fair bit of convenience for you.

ascillitoe · 18 April 2021 18:47

P.s. just out of curiosity, do you actually use get_summary() or know anyone that does? I’m just wondering about its usefulness in general, and whether there might other formats we want to think about outputting info in? i.e. would the functionality to output to a csv file, pandas data frame, or something else entirely be useful?

psesh · 18 April 2021 19:48

I think this is a very do-able task, and we could easily alter the “y”-axis for the relevant plots to capture that. @mdonnelly, are there specific output plots you require (e.g., truth vs polynomial / response surfaces?),

mdonnelly · 18 April 2021 21:13

@ascillitoe the use-case for me is as a bit of an audit file for our workflows. We sometimes revisit analyses months later and it’s often helpful to have a concise summary of what was run in terms of parameter assumptions etc rather than needing to open up each notebook/dataset etc. If something else like a csv is better then I’m all ears!

@psesh some examples of the more routine ones we create are below (mainly for UQ and sensitivity analysis). All fairly standard matplotlib type stuff. We have toyed with things like Bokeh but those are a bit more niche.

Matrix plot of all input parameter distributions e.g. all the CDFs/PDFs/histograms of the parameters associated with a poly.
Truth vs prediction for both train and test data sets.
Matrix plot of all main effects, essentially just independent parameter sweeps over their upper and lower bounds (recognise this is a bit more difficult for distributions that don’t have fixed limits).
Heatmap showing strength of (two-way) input parameter interactions
Output distributions with overlaid confidence intervals
Parallel coordinates plot showing all outputs.
Pareto plot of first and total Sobol’s

mdonnelly · 21 April 2021 17:53

I’ve made a PR covering the initial part of this (variable naming). As much as I’d be happy to help on the plotting side I expect you’d want to drive that yourselves!

psesh · 22 April 2021 08:39

Cheers @mdonnelly! I’ll have a go at updating the plotting functionality and drop a note here when done.

ascillitoe · 22 April 2021 12:34

Hi @psesh, shall I merge in @mdonnelly’s PR or do you want to commit your updates to that PR?

psesh · 22 April 2021 12:47

Hi @ascillitoe, yes for now please merge the PR.

ascillitoe · 6 May 2021 13:08

Hi @mdonnelly, @Simardeep27 has been implementing a few of the plotting mentions you mentioned above for us. Can I just check with you please, Re this one:

Did you have in mind something like the seaborn pairplot? I have a few ideas regarding how we could improve on this, but just wanted to check if this is what you meant first!

mdonnelly · 6 May 2021 13:57

Hi @ascillitoe,

Short answer is yes and no!

I do use those types of plots regularly (usually the scatter_matrix from Pandas) but the Seaborn one does look a bit more modern straight out the box. They’re useful as a quick glance to see what the sampling space of the problem is (distributions on the diagonals and whether there’s any correlation in the off-diagonals). Sometimes I also plot the output data in this too more as a quick EDA type step to spot the high-level relationships. Having it built in would be useful.

However I think what I was thinking about in the earlier post was just a way to plot all the PDFs of the input parameters being used by Poly. Calling it a matrix plot was probably a bad choice on my part - I just meant something like putting them all into an X by Y subplot as often we have lots (e.g. 40+) parameters. Use case for this one is similar as the one above and just general reporting.

ascillitoe · 6 May 2021 15:41

Hi @mdonnelly, thanks for this, very helpful!

mdonnelly · 6 May 2021 20:07

@ascillitoe I made a little PR just covering the second bit of my comment above. Also spotted a bug (maybe) that I’ve had a go at fixing.

github.com/equadratures/equadratures

Added function to plot PDFs associated with Poly.

equadratures:develop ← mdonnelly1:develop

opened 08:04PM - 06 May 21 UTC

mdonnelly1

+68 -2

@ascillitoe following on from the discourse I added a function to plot all of th…e PDF's from parameters associated with a polynomial. Please review and clean up as you see fit - no worries if you have other ideas on how to do this and don't use it - I just thought I'd offer this up as I already had most of it written from existing work. Essentially it does a loop through Polynomial.parameters and calls the existing `plot.plot_pdf` function. To allow a little bit of customisation I include a cols argument to allow you to specify the number of columns to put the parameters in. I also noticed a bug in `plot_pdf` where the defaults were specified in the return making it impossible to change them. I've just removed this so it now gives the expected behaviour. I also spotted some odd behaviour in the axes from `plot_pdf` when putting it into a subplot. After some digging this was because of the `sns.despine` call which was trimming everything. The fix was to add `ax=ax` into it so it was more targeted.

ascillitoe · 18 June 2021 18:19

Hi @mdonnelly , the above is now merged into version 9.1 so I shall mark as solved. Please do feel free to shout out if you feel this isn’t correct.