lundi 13 juin 2016

How is cv_values_ computed in sklearn.linear::RidgeCV?

The reproducible example to fix the discussion:

from sklearn.linear_model import RidgeCV
from sklearn.datasets import load_boston
from sklearn.preprocessing import scale 

boston = scale(load_boston().data)
target = load_boston().target

import numpy as np
alphas = np.linspace(1.0,200.0, 5)
fit0 = RidgeCV(alphas=alphas, store_cv_values = True, gcv_mode='eigen').fit(boston, target)
fit0.alpha_
fit0.cv_values_[:,0]

The question: what formula is used to compute fit0.cv_values_?

Edit:

@Abhinav Arora answer below seems to suggests that fit0.cv_values_[:,0][0], the first entry of fit0.cv_values_[:,0] would be

(fit1.predict(boston[0,].reshape(1, -1)) - target[0])**2

where fit1 is a ridge regression with alpha = 1.0, fitted to the data-set from which observation 0 was removed.

Let's see:

1) create new dataset with first row of original dataset removed:

from sklearn.linear_model import Ridge
boston1 = np.delete(boston, (0), axis=0)
target1 = np.delete(target, (0), axis=0)

2) fit a ridge model with alpha = 1.0 on this truncated dataset:

fit1 = Ridge(alpha=1.0).fit(boston1, target1)

3) check the MSE of that model on the first data-point:

(fit1.predict(boston[0,].reshape(1, -1)) - target[0])**2

it is array([ 37.64650853]) which is not the same as what is produced by the fit0.cv_values_[:,0], ergo:

fit0.cv_values_[:,0][0]

which is 37.495629960571137

What gives?

Aucun commentaire:

Enregistrer un commentaire