mardi 28 juin 2016

Filter keywords and clean data with many columns

I have scraped data from Twitter with lat-lon search and now I am finding it difficult to clean the data and separating # & @ keywords and mentions from the message.

I am very much new to this side of python programming and finding it difficult with so many columns.

Header Fields

tweetID,
tweetText,
tweetRetweetCt,
tweetFavoriteCt,
tweetSource,
tweetCreated,
userID,
userScreen,
userName,
userCreateDt,
userDesc,
userFollowerCt,
userFriendsCt,
userLocation,
userTimezone,
Coordinates,
GeoEnabled,
Language

Tweet Sample One

  721125953926258688,
"Wind 4.4 mph W. Barometer 1001.0 mb, Rising slowly. Temperature 7.8 °C. Rain today 0.0 mm. Humidity 87%",
0,
0,
Sandaysoft Cumulus,
2016-04-15 23:59:59,
15380049,
BandAid,
JD,
2008-07-10 17:00:00,
50 something baldy bloke on the South Coast.,
103,
385,
Lee-on-the-Solent,
London,
"{u'type': u'Point', u'coordinates': [-1.19027778, 50.80194444]}",
True,
en

Tweet Sample Two

721125952886059008,
"Wind 4.0 mph WNW
    Barometer 997.80 mb Rising
    Temperature 7.7 C
    Rain today 0.0 mm
    Humidity 97% 
    #Clacton #Weather
    URL(had to remove)",
0,
0,
Sandaysoft Cumulus,
2016-04-15 23:59:59,
22190942,
Clacton_Weather,
Clacton Weather,
2009-02-27 21:12:34,
Sign up for Tweets from Clacton on Sea Weather Station. Get weather updates every hour. Check out the website. Also a Radio Ham URL(removed),
967,
870,
"Clacton on Sea, Essex, UK",
London,
"{u'type': u'Point', u'coordinates': [1.16888889, 51.81361111]}",
True,
en

What I am trying to achieve?

One: Create two new columns for # keywords and @ mentions

tweetID,
tweetText,
hashKeywords,#New Addition
mentions,    #New Addition
tweetRetweetCt,
tweetFavoriteCt,
tweetSource,
tweetCreated,
userID,
userScreen,
userName,
userCreateDt,
userDesc,
userFollowerCt,
userFriendsCt,
userLocation,
userTimezone,
Coordinates,
GeoEnabled,
Language 

Two: Fill the newly created columns by filtering # keywords and @ mentions from tweetText The text which does not have either or any of the filters the value can be given as 0.

Three: Clean the Coordinates column by keeping only the coordinates and remove all the other words andspecial characters`

I have given an extensive search but I am not able to find a proper reference.

Aucun commentaire:

Enregistrer un commentaire