lundi 27 juin 2016

How to interpret trees from random forest via python

I'm trying to figure out how I can go about interpreting my trees from my random forest. My data contains around 29,000 observations and 35 features. I pasted the first 22 observations, the first 11 features as well as the feature that I am trying to predict(HighLowMobility).

birthcohort countyfipscode  county_name cty_pop2000 statename   state_id    stateabbrv  perm_res_p25_kr24   perm_res_p75_kr24   perm_res_p25_c1823  perm_res_p75_c1823  HighLowMobility
1980    1001    Autauga 43671   Alabama 1   AL  45.2994 60.7061         Low
1981    1001    Autauga 43671   Alabama 1   AL  42.6184 63.2107 29.7232 75.266  Low
1982    1001    Autauga 43671   Alabama 1   AL  48.2699 62.3438 38.0642 72.2544 Low
1983    1001    Autauga 43671   Alabama 1   AL  42.6337 56.4204 38.2588 80.4664 Low
1984    1001    Autauga 43671   Alabama 1   AL  44.0163 62.2799 38.1238 73.747  Low
1985    1001    Autauga 43671   Alabama 1   AL  45.7178 61.3187 40.9339 83.0661 Low
1986    1001    Autauga 43671   Alabama 1   AL  47.9204 59.6553 47.4841 72.491  Low
1987    1001    Autauga 43671   Alabama 1   AL  48.3108 54.042  53.199  84.5379 Low
1988    1001    Autauga 43671   Alabama 1   AL  47.9855 59.42   52.8927 85.2844 Low
1980    1003    Baldwin 140415  Alabama 1   AL  42.4611 51.4142         Low
1981    1003    Baldwin 140415  Alabama 1   AL  43.0029 55.1014 35.5923 76.9857 Low
1982    1003    Baldwin 140415  Alabama 1   AL  46.2496 56.0045 38.679  77.038  Low
1983    1003    Baldwin 140415  Alabama 1   AL  44.3001 54.5173 38.7106 81.0388 Low
1984    1003    Baldwin 140415  Alabama 1   AL  46.4349 55.5245 42.4422 80.3047 Low
1985    1003    Baldwin 140415  Alabama 1   AL  47.1544 52.8189 42.7994 79.0835 Low
1986    1003    Baldwin 140415  Alabama 1   AL  47.553  54.934  42.0653 78.4398 Low
1987    1003    Baldwin 140415  Alabama 1   AL  48.9752 54.3541 39.96   79.4915 Low
1988    1003    Baldwin 140415  Alabama 1   AL  48.6887 55.3087 43.8557 79.387  Low
1980    1005    Barbour 29038   Alabama 1   AL                  Low
1981    1005    Barbour 29038   Alabama 1   AL  37.5338 54.3618 34.8771 75.1904 Low
1982    1005    Barbour 29038   Alabama 1   AL  37.028  57.2471 36.5392 90.3262 Low
1983    1005    Barbour 29038   Alabama 1   AL                  Low

Here is my random forest:

   #loading the data into data frame
   X = pd.read_csv('raw_data_for_edits.csv')
   #Impute the missing values with median values,.
   X = X.fillna(X.median())

  #Dropping the categorical values
  X = X.drop(['county_name','statename','stateabbrv'],axis=1)

  #Collect the output in y variable
  y = X['HighLowMobility']


  X = X.drop(['HighLowMobility'],axis=1)


 from sklearn.preprocessing import LabelEncoder

 #Encoding the output labels
 def preprocess_labels(y):
   yp = []
   #low = 0
   #high = 0
    for i in range(len(y)):
      if (str(y[i]) =='Low'):
         yp.append(0)
         #low +=1
     elif (str(y[i]) =='High'):
         yp.append(1)
         #high +=1
      else:
         yp.append(1)
      return yp



  #y = LabelEncoder().fit_transform(y)
  yp = preprocess_labels(y)
  yp = np.array(yp)
  yp.shape
  X.shape
  from sklearn.cross_validation import train_test_split
  X_train, X_test,y_train, y_test = train_test_split(X,yp,test_size=0.25, random_state=42)
  X_train = np.array(X_train)
  y_train = np.array(y_train)
  X_test = np.array(X_test)
  y_test = np.array(y_test)
  training_data = X_train,y_train
  test_data = X_test,y_test
  dims = X_train.shape[1]
   if __name__ == '__main__':
     nn = Neural_Network([dims,10,5,1], learning_rate=1, C=1, opt=False, check_gradients=True, batch_size=200, epochs=100)
     nn.fit(X_train,y_train) 
     weights = nn.final_weights()
     testlabels_out = nn.predict(X_test)
     print testlabels_out
     print "Neural Net Accuracy is " + str(np.round(nn.score(X_test,y_test),2))


  '''
  RANDOM FOREST AND LOGISTIC REGRESSION
  '''
  from sklearn import cross_validation
  from sklearn.linear_model import LogisticRegression
  from sklearn.ensemble import RandomForestClassifier
  clf1 = LogisticRegression(penalty='l2', dual=False, tol=0.0001, C=1.0,       fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None)
  clf2 = RandomForestClassifier(n_estimators=100, max_depth=None,min_samples_split=1, random_state=0)
   for clf, label in zip([clf1, clf2], ['Logistic Regression', 'Random Forest']):
   scores = cross_validation.cross_val_score(clf, X, y, cv=5, scoring='accuracy')
    print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))

How would I interpret my trees? For example,perm_res_p25_c1823 is a feature that states the College attendance at ages 18-23 for child born at 25th percentile, perm_res_p75_c1823 represents the 75th percentile and the HighLowMobility feature states whether it there is High or Low upward income mobility. So how would show the following: "If the person comes from 25th percentile and lives Autauga,Alabama , then they will probably have lower upward mobility" ?

Aucun commentaire:

Enregistrer un commentaire