Datasets module

ascillitoe · 1 March 2021 12:03

Recently, dataset generation/loading utilities have been added to datasets.py in equadratures (currently in develop). There are a number of methods to generate simple canonical datasets for testing. For example, to generate a simple linear dataset with a given number of active and inactive dimensions:

import equadratures as eq
# Generate 10D linear dataset with 2 relevant (active) features
X,y = eq.datasets.gen_linear(n_observations=500,n_dim=10,bias=0.5,n_relevent=2,noise=0.2,random_seed=1)                                                                                                                                                                                                                                                            
X_train, X_test, y_train, y_test = eq.datasets.train_test_split(X,y,train=0.8,random_seed=42)

and also, a method to load our own more advanced datasets from our datasets repository:

import equadratures as eq
data = eq.datasets.load_eq_dataset('naca0012')
X = data['X']
Cp = data['Cp']

To further develop this module, it would be great to add utilities to pull in datasets from other popular sources, such as the UCI Machine Learning Repository. Exposing all of this functionality to the userbase via the tutorials is also necessary.

Please feel free to reply with ideas for other datasets to add to our own datasets repository, and ideas for other dataset sources to include!