This question already has an answer here:
- Parsing invalid Unicode JSON in Python 2 answers
I am getting a large data file from external service, where each line is a json object. However, it contains multiple hex characters like (xef,xa0,xa9) etc and some unicode characters like (u2022) .I am basically reading the file like
with open(filename,'r') as fh:
for line in fh:
attr = json.loads(line)
I tried giving encoding utf-8 and latin-1 to the open method, but still json loads is failing. If the invalid characters are removed then loads is working, but I don't want to lose any data. What's the recommended way to fix this ?
repr(line) sample:
'{"product_type":"SHOES","recommended_browse_nodes":"361208011","item_name":["Citygate 960561 Ankle Boots Womens Gray Grau (anthrazit 9) Size: 8 (42 EU)"],"product_description":[],"brand_name":"Citygate","manufacturer":"J H P\xf6lking GmbH & Co KG","bullet_point":[],"department_name":"Women\u2019s","size_name":"42 EU","material_composition":["Leather"]}n'
json.loads is failing at xf6 in item_name with Invalid escape: line 1 column 105 (char 104) .
Aucun commentaire:
Enregistrer un commentaire