I have scraped data from Twitter with lat-lon
search and now I am finding it difficult to clean the data and separating #
& @
keywords and mentions from the message.
I am very much new to this side of python programming and finding it difficult with so many columns.
Header Fields
tweetID,
tweetText,
tweetRetweetCt,
tweetFavoriteCt,
tweetSource,
tweetCreated,
userID,
userScreen,
userName,
userCreateDt,
userDesc,
userFollowerCt,
userFriendsCt,
userLocation,
userTimezone,
Coordinates,
GeoEnabled,
Language
Tweet Sample One
721125953926258688,
"Wind 4.4 mph W. Barometer 1001.0 mb, Rising slowly. Temperature 7.8 °C. Rain today 0.0 mm. Humidity 87%",
0,
0,
Sandaysoft Cumulus,
2016-04-15 23:59:59,
15380049,
BandAid,
JD,
2008-07-10 17:00:00,
50 something baldy bloke on the South Coast.,
103,
385,
Lee-on-the-Solent,
London,
"{u'type': u'Point', u'coordinates': [-1.19027778, 50.80194444]}",
True,
en
Tweet Sample Two
721125952886059008,
"Wind 4.0 mph WNW
Barometer 997.80 mb Rising
Temperature 7.7 C
Rain today 0.0 mm
Humidity 97%
#Clacton #Weather
URL(had to remove)",
0,
0,
Sandaysoft Cumulus,
2016-04-15 23:59:59,
22190942,
Clacton_Weather,
Clacton Weather,
2009-02-27 21:12:34,
Sign up for Tweets from Clacton on Sea Weather Station. Get weather updates every hour. Check out the website. Also a Radio Ham URL(removed),
967,
870,
"Clacton on Sea, Essex, UK",
London,
"{u'type': u'Point', u'coordinates': [1.16888889, 51.81361111]}",
True,
en
What I am trying to achieve?
One: Create two new columns for #
keywords and @
mentions
tweetID,
tweetText,
hashKeywords,#New Addition
mentions, #New Addition
tweetRetweetCt,
tweetFavoriteCt,
tweetSource,
tweetCreated,
userID,
userScreen,
userName,
userCreateDt,
userDesc,
userFollowerCt,
userFriendsCt,
userLocation,
userTimezone,
Coordinates,
GeoEnabled,
Language
Two: Fill the newly created columns by filtering #
keywords and @
mentions from tweetText
The text which does not have either or any of the filters the value can be given as 0
.
Three: Clean the Coordinates
column by keeping only the coordinates and remove all the other words and
special characters`
I have given an extensive search but I am not able to find a proper reference.
Aucun commentaire:
Enregistrer un commentaire