I'm trying to figure out how I can go about interpreting my trees from my random forest. My data contains around 29,000 observations and 35 features. I pasted the first 22 observations, the first 11 features as well as the feature that I am trying to predict(HighLowMobility).
birthcohort countyfipscode county_name cty_pop2000 statename state_id stateabbrv perm_res_p25_kr24 perm_res_p75_kr24 perm_res_p25_c1823 perm_res_p75_c1823 HighLowMobility
1980 1001 Autauga 43671 Alabama 1 AL 45.2994 60.7061 Low
1981 1001 Autauga 43671 Alabama 1 AL 42.6184 63.2107 29.7232 75.266 Low
1982 1001 Autauga 43671 Alabama 1 AL 48.2699 62.3438 38.0642 72.2544 Low
1983 1001 Autauga 43671 Alabama 1 AL 42.6337 56.4204 38.2588 80.4664 Low
1984 1001 Autauga 43671 Alabama 1 AL 44.0163 62.2799 38.1238 73.747 Low
1985 1001 Autauga 43671 Alabama 1 AL 45.7178 61.3187 40.9339 83.0661 Low
1986 1001 Autauga 43671 Alabama 1 AL 47.9204 59.6553 47.4841 72.491 Low
1987 1001 Autauga 43671 Alabama 1 AL 48.3108 54.042 53.199 84.5379 Low
1988 1001 Autauga 43671 Alabama 1 AL 47.9855 59.42 52.8927 85.2844 Low
1980 1003 Baldwin 140415 Alabama 1 AL 42.4611 51.4142 Low
1981 1003 Baldwin 140415 Alabama 1 AL 43.0029 55.1014 35.5923 76.9857 Low
1982 1003 Baldwin 140415 Alabama 1 AL 46.2496 56.0045 38.679 77.038 Low
1983 1003 Baldwin 140415 Alabama 1 AL 44.3001 54.5173 38.7106 81.0388 Low
1984 1003 Baldwin 140415 Alabama 1 AL 46.4349 55.5245 42.4422 80.3047 Low
1985 1003 Baldwin 140415 Alabama 1 AL 47.1544 52.8189 42.7994 79.0835 Low
1986 1003 Baldwin 140415 Alabama 1 AL 47.553 54.934 42.0653 78.4398 Low
1987 1003 Baldwin 140415 Alabama 1 AL 48.9752 54.3541 39.96 79.4915 Low
1988 1003 Baldwin 140415 Alabama 1 AL 48.6887 55.3087 43.8557 79.387 Low
1980 1005 Barbour 29038 Alabama 1 AL Low
1981 1005 Barbour 29038 Alabama 1 AL 37.5338 54.3618 34.8771 75.1904 Low
1982 1005 Barbour 29038 Alabama 1 AL 37.028 57.2471 36.5392 90.3262 Low
1983 1005 Barbour 29038 Alabama 1 AL Low
Here is my random forest:
#loading the data into data frame
X = pd.read_csv('raw_data_for_edits.csv')
#Impute the missing values with median values,.
X = X.fillna(X.median())
#Dropping the categorical values
X = X.drop(['county_name','statename','stateabbrv'],axis=1)
#Collect the output in y variable
y = X['HighLowMobility']
X = X.drop(['HighLowMobility'],axis=1)
from sklearn.preprocessing import LabelEncoder
#Encoding the output labels
def preprocess_labels(y):
yp = []
#low = 0
#high = 0
for i in range(len(y)):
if (str(y[i]) =='Low'):
yp.append(0)
#low +=1
elif (str(y[i]) =='High'):
yp.append(1)
#high +=1
else:
yp.append(1)
return yp
#y = LabelEncoder().fit_transform(y)
yp = preprocess_labels(y)
yp = np.array(yp)
yp.shape
X.shape
from sklearn.cross_validation import train_test_split
X_train, X_test,y_train, y_test = train_test_split(X,yp,test_size=0.25, random_state=42)
X_train = np.array(X_train)
y_train = np.array(y_train)
X_test = np.array(X_test)
y_test = np.array(y_test)
training_data = X_train,y_train
test_data = X_test,y_test
dims = X_train.shape[1]
if __name__ == '__main__':
nn = Neural_Network([dims,10,5,1], learning_rate=1, C=1, opt=False, check_gradients=True, batch_size=200, epochs=100)
nn.fit(X_train,y_train)
weights = nn.final_weights()
testlabels_out = nn.predict(X_test)
print testlabels_out
print "Neural Net Accuracy is " + str(np.round(nn.score(X_test,y_test),2))
'''
RANDOM FOREST AND LOGISTIC REGRESSION
'''
from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
clf1 = LogisticRegression(penalty='l2', dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None)
clf2 = RandomForestClassifier(n_estimators=100, max_depth=None,min_samples_split=1, random_state=0)
for clf, label in zip([clf1, clf2], ['Logistic Regression', 'Random Forest']):
scores = cross_validation.cross_val_score(clf, X, y, cv=5, scoring='accuracy')
print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))
How would I interpret my trees? For example,perm_res_p25_c1823 is a feature that states the College attendance at ages 18-23 for child born at 25th percentile, perm_res_p75_c1823 represents the 75th percentile and the HighLowMobility feature states whether it there is High or Low upward income mobility. So how would show the following: "If the person comes from 25th percentile and lives Autauga,Alabama , then they will probably have lower upward mobility" ?
Aucun commentaire:
Enregistrer un commentaire