jeudi 7 juillet 2016

python string encoding unicode

I'm using python 2.7 and I have some problems converting chars like "ä" to "ae".

I'm retrieving the content of a webpage using:

req = urllib2.Request(url + str(questionID))
response = urllib2.urlopen(req)
data = response.read()

After that I'm doing some extraction stuff and there is my problem.

extractedStr = pageContent[start:end] // this string contains the "ä" !
extractedStr = extractedStr.decode("utf8") // here I get the error, tried it with encode aswell
extractedStr = extractedStr.replace(u"ä", "ae")

--> 'utf8' codec can't decode byte 0xe4 in position 13: invalid continuation byte

But: my simple trial is working fine...:

someStr = "geräusch"
someStr = someStr.decode("utf8")
someStr = someStr.replace(u"ä", "ae")

I've got the feeling, it has something to do with WHEN I try to use the .decode() function... I tried it at several positions, no success :(

Aucun commentaire:

Enregistrer un commentaire