jeudi 16 juin 2016

Extract features from pandas column containing textual(descriptive) data and combine it with rest of the features ?

I have data set like below,(just one row from the dataset and some columns)

    id summary description01 decription02 recommendation connections total-experience_in_months

    1 Experienced and seasoned Data warehouse IT professional with 10 Years of experience looking for a Technical Architecture Role,process optimization involved in technical planning 5 524 30

    2 Working as a Technical Consultant at SAS Institute. Specialties: Business Intelligence, Data Warehousing, Data modeling, SAS Platform Administration,Worked for Discover Cards financial service for a transition 5 4000 40

I want to extract features from the text columns. Below I am using tf-idf approach

This is what I am tying

from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(analyzer='word', ngram_range=(1,3), min_df = 0, stop_words = 'english')

# Calculating tf-idf for summary column(only for single text) 
tfidf_matrix =  tf.fit_transform(raw_data['summary'][:1])
feature_names = tf.get_feature_names() 

print len(feature_names)

feature_names[50:70]

dense = tfidf_matrix.todense()

Now I got dense matrix representation for my first textual column called summary.(only for first text data)

My question is how do I combine this with my rest of the features from my dataset so that I could use it for my model.

Do I need to combine all textual column in single column and then calculate the tf-idf values or I need to calculate for each textual column separately.

Refereed below link:

http://www.markhneedham.com/blog/2015/02/15/pythonscikit-learn-calculating-tfidf-on-how-i-met-your-mother-transcripts/

Aucun commentaire:

Enregistrer un commentaire