Polynomial graphs

This blog concerns the construction of related orthogonal polynomial approximations over a graph and is based on the work of Tuck et al [1]. For this exercise, we will be use the House Sales in King County Kaggle data set. A series of scatter plots of the houses in question is shown below, where the colours indicate the house price on a logarithm base-10 scale.


Our primary motivation in developing polynomial graphs is to construct K polynomial approximations over distinct domains, but have those K polynomial approximations be related to each other. Tuck et al. and others refer to this modelling paradigm as a stratified model, as they ``depend in an arbitrary way on a set of selected categorical features, and depend linearly on the other features’’. If there exist K such models, the objective function to be minimised is of the form

\mathcal{L}\left(\boldsymbol{\alpha}_1 , \ldots, \boldsymbol{\alpha}_{k} \right) + \sum_{k=1}^{K}h_{k} \left(\boldsymbol{\alpha}_{k} \right),

where the first term represents the global Laplacian loss, leveraging a matrix representation of the graph that is a function of the number of connections between different nodes. This term is given by

\mathcal{L}\left(\boldsymbol{\alpha}_1 , \ldots, \boldsymbol{\alpha}_{k} \right) = \frac{1}{2} \sum_{i=1}^{K} \sum_{i>j} \boldsymbol{W}_{ij} \left\Vert \boldsymbol{\alpha}_{i} - \boldsymbol{\alpha}_{j} \right\Vert_{2}^{2}

Where \boldsymbol{W}_{ij} denotes the edge weights. The second term on the right hand side in the first equation is essentially the local L_2 norm, given by

h_{k} \left(\boldsymbol{\alpha}_{k} \right) = \left\Vert \boldsymbol{f}_{k} - \boldsymbol{P}^{T} \left( \boldsymbol{x}_{k} \right) \boldsymbol{\alpha}_{k} \right\Vert_{2}^{2}.

It is important to note that as all the edge weights approach 0, we effectively have separate (independent) models for each node. Conversely, as all the edge weights approach infinity, we have a single model across all the nodes.

To utilise the Graphpolys functionality in equadratures, we require three inputs: (i) a networkx graph; (ii) an instance of the Poly class that sets the multi-index and basis of the polynomial, and (iii) the training data. We provide the source code towards at the end of the blog; the main figures are summarised below.

We set each weight in \boldsymbol{W} to be identical, and initially start with a value of 15 (based on Tuck et al.). The figure below shows the first order Sobol’ indices of the polynomial models. Results with weights of 0.0001 and 10000 are given below.

Stratified model \boldsymbol{W}=15


Almost independent (separate) model \boldsymbol{W}=0.0001


Nearly one model \boldsymbol{W}=0.0001


As was reported in [1] the root mean squared error of the stratified model is the lowest, because of the graph-based regularisation.

Source code below.

df = pd.read_csv('examples/data/kc_house_data.csv',
                 usecols=["price", "bedrooms", "bathrooms", "sqft_living", "sqft_lot",
                          "floors", "waterfront", "condition", "grade", "yr_built", "lat", "long"],

df['log_price'] = np.log(df['price'])
df = df.query('long <= -121.6')
bins = 50
df['lat_bin'] = pd.cut(df['lat'], bins=bins)
df['long_bin'] = pd.cut(df['long'], bins=bins)
code_to_latbin = dict(enumerate(df['lat_bin'].cat.categories))
code_to_longbin = dict(enumerate(df['long_bin'].cat.categories))
df['lat_bin'] = df['lat_bin'].cat.codes
df['long_bin'] = df['long_bin'].cat.codes
df.drop(["price"], axis=1, inplace=True)

# Create dataset
df_train, df_test = model_selection.train_test_split(df)
G = nx.grid_2d_graph(bins, bins)

def get_data(df):
    Xs = []
    Ys = []
    Zs = []
    for node in G.nodes():
        latbin, longbin = node
        df_node = df.query('lat_bin == %d & long_bin == %d' %
                           (latbin, longbin))
        X_node = np.array(df_node.drop(
            ['log_price', 'lat', 'long', 'lat_bin', 'long_bin'], axis=1))
        Y_node = np.array(df_node['log_price'])
        N = X_node.shape[0]
        Xs += [X_node]
        Ys += [Y_node]
        Zs.extend([node] * N)

    return np.concatenate(Xs, axis=0), np.concatenate(Ys, axis=0)[:, np.newaxis], Zs

X_train, Y_train, Z_train = get_data(df_train)
X_test, Y_test, Z_test = get_data(df_test)

K, n = bins * bins, X_train.shape[1]
print("The stratified model will have", K * n, "variables.")

ss = eq.scalers.scaler_meanvar()
X_train = ss.transform(X_train)
X_test = ss.transform(X_test)

# Fit models
data_train = dict(X=X_train, Y=Y_train, Z=Z_train)
data_test = dict(X=X_test, Y=Y_test, Z=Z_test)

inputs = ["bedrooms", "bathrooms", "sqft_living", "sqft_lot",
                          "floors", "waterfront", "condition", "grade", "yr_built"]

dims = 9
param = eq.Parameter(distribution='Gaussian', shape_parameter_A=0., shape_parameter_B=1., order=1)
basis = eq.Basis('total-order')
params = []
for j in range(0, dims):
    param.name = inputs[j] 
poly = eq.Poly(params, basis)

gpoly = eq.Graphpolys(Graph=G, data_train=data_train, poly=poly, edge_weight=10000)


[1] Tuck, Barratt, Boyd, (2021) “A distributed method for fitting Laplacian regularised stratified models”. Journal of Machine Learning Research, 22, 1-37.


Perhaps a comparison of graphs vs regression trees is in order :slight_smile: ?

1 Like