Article From:https://segmentfault.com/q/1010000012136179
Question:
When you grab data with Python, you extract a JSON string STR from the web page.
{\"count\":8,\"sub_images\":[{\"url\":\"http:\\/\\/p3.pstatp.com\\/origin\\/470700000c7084773fb2\",\"width\":1178,\"url_list\":[{\"url\":\"http:\\/\\/p3.pstatp.com\\/origin\\/470700000c7084773fb2\"},{\"url\":\"http:\\/\\/pb9.pstatp.com\\/origin\\/470700000c7084773fb2\"},{\"url\":\"http:\\/\\/pb1.pstatp.com\\/origin\\/470700000c7084773fb2\"}],\"uri\":\"origin\\/470700000c7084773fb2\",\"height\":1590},{\"url\":\"http:\\/\\/p9.pstatp.com\\/origin\\/47050001b69355a0bf1b\",\"width\":1178,\"url_list\":[{\"url\":\"http:\\/\\/p9.pstatp.com\\/origin\\/47050001b69355a0bf1b\"},{\"url\":\"http:\\/\\/pb1.pstatp.com\\/origin\\/47050001b69355a0bf1b\"},{\"url\":\"http:\\/\\/pb3.pstatp.com\\/origin\\/47050001b69355a0bf1b\"}],\"uri\":\"origin\\/47050001b69355a0bf1b\",\"height\":1557},{\"url\":\"http:\\/\\/p3.pstatp.com\\/origin\\/470300020761150d671a\",\"width\":1178,\"url_list\":[{\"url\":\"http:\\/\\/p3.pstatp.com\\/origin\\/470300020761150d671a\"},{\"url\":\"http:\\/\\/pb9.pstatp.com\\/origin\\/470300020761150d671a\"},{\"url\":\"http:\\/\\/pb1.pstatp.com\\/origin\\/470300020761150d671a\"}],\"uri\":\"origin\\/470300020761150d671a\",\"height\":1552},{\"url\":\"http:\\/\\/p1.pstatp.com\\/origin\\/47000002200f2a0a9020\",\"width\":1178,\"url_list\":[{\"url\":\"http:\\/\\/p1.pstatp.com\\/origin\\/47000002200f2a0a9020\"},{\"url\":\"http:\\/\\/pb3.pstatp.com\\/origin\\/47000002200f2a0a9020\"},{\"url\":\"http:\\/\\/pb9.pstatp.com\\/origin\\/47000002200f2a0a9020\"}],\"uri\":\"origin\\/47000002200f2a0a9020\",\"height\":1575},{\"url\":\"http:\\/\\/p1.pstatp.com\\/origin\\/470000022011d5569ccb\",\"width\":1178,\"url_list\":[{\"url\":\"http:\\/\\/p1.pstatp.com\\/origin\\/470000022011d5569ccb\"},{\"url\":\"http:\\/\\/pb3.pstatp.com\\/origin\\/470000022011d5569ccb\"},{\"url\":\"http:\\/\\/pb9.pstatp.com\\/origin\\/470000022011d5569ccb\"}],\"uri\":\"origin\\/470000022011d5569ccb\",\"height\":1588},{\"url\":\"http:\\/\\/p3.pstatp.com\\/origin\\/4700000220127db96444\",\"width\":1178,\"url_list\":[{\"url\":\"http:\\/\\/p3.pstatp.com\\/origin\\/4700000220127db96444\"},{\"url\":\"http:\\/\\/pb9.pstatp.com\\/origin\\/4700000220127db96444\"},{\"url\":\"http:\\/\\/pb1.pstatp.com\\/origin\\/4700000220127db96444\"}],\"uri\":\"origin\\/4700000220127db96444\",\"height\":1561},{\"url\":\"http:\\/\\/p3.pstatp.com\\/origin\\/46ff000532e33a9fa35a\",\"width\":1178,\"url_list\":[{\"url\":\"http:\\/\\/p3.pstatp.com\\/origin\\/46ff000532e33a9fa35a\"},{\"url\":\"http:\\/\\/pb9.pstatp.com\\/origin\\/46ff000532e33a9fa35a\"},{\"url\":\"http:\\/\\/pb1.pstatp.com\\/origin\\/46ff000532e33a9fa35a\"}],\"uri\":\"origin\\/46ff000532e33a9fa35a\",\"height\":1563},{\"url\":\"http:\\/\\/p9.pstatp.com\\/origin\\/470700000c7b871a5fae\",\"width\":1178,\"url_list\":[{\"url\":\"http:\\/\\/p9.pstatp.com\\/origin\\/470700000c7b871a5fae\"},{\"url\":\"http:\\/\\/pb1.pstatp.com\\/origin\\/470700000c7b871a5fae\"},{\"url\":\"http:\\/\\/pb3.pstatp.com\\/origin\\/470700000c7b871a5fae\"}],\"uri\":\"origin\\/470700000c7b871a5fae\",\"height\":1575}],\"max_img_width\":1178,\"labels\":[],\"sub_abstracts\":[\" \",\" \",\" \",\" \",\" \",\" \",\" \",\" \"],\"sub_titles\":[\"\\u6e05\\u65b0\\u81ea\\u7136\\uff0c\\u7f8e\\u4e3d\\u65e0\\u53cc\",\"\\u6e05\\u65b0\\u81ea\\u7136\\uff0c\\u7f8e\\u4e3d\\u65e0\\u53cc\",\"\\u6e05\\u65b0\\u81ea\\u7136\\uff0c\\u7f8e\\u4e3d\\u65e0\\u53cc\",\"\\u6e05\\u65b0\\u81ea\\u7136\\uff0c\\u7f8e\\u4e3d\\u65e0\\u53cc\",\"\\u6e05\\u65b0\\u81ea\\u7136\\uff0c\\u7f8e\\u4e3d\\u65e0\\u53cc\",\"\\u6e05\\u65b0\\u81ea\\u7136\\uff0c\\u7f8e\\u4e3d\\u65e0\\u53cc\",\"\\u6e05\\u65b0\\u81ea\\u7136\\uff0c\\u7f8e\\u4e3d\\u65e0\\u53cc\",\"\\u6e05\\u65b0\\u81ea\\u7136\\uff0c\\u7f8e\\u4e3d\\u65e0\\u53cc\"]}
You can see that it has been transferred, so it can not be used.
data = json.loads(str)
Turning into the python object, now there are two ways to find out.
1.Search the demjson library from the Internet
data = json.loads(demjson.decode(str))
2. For STR regular replacement, but it seems that it is not so good to replace, it may replace the correct data.
Is there any other way to learn Python and ask for advice?
Answer 0:
import re
import json
data = your string
pattern = "\\\""
data = re.sub(pattern, r'"', data)
pattern = "\\\\/"
data = re.sub(pattern, r"/", data)
data_json = json.loads(data)
print data_json
Answer 1:
All the replacement is empty
Answer 2:
Pro test effectiveness
import re
import json
jsonStr = '{\"count\":8,\"sub_images\":[{\"url\":\"http:\\/\\/p3.pstatp.com\\/origin\\/470700000c7084773fb2\",\"width\":1178,\"url_list\":[{\"url\":\"http:\\/\\/p3.pstatp.com\\/origin\\/470700000c7084773fb2\"},{\"url\":\"http:\\/\\/pb9.pstatp.com\\/origin\\/470700000c7084773fb2\"},{\"url\":\"http:\\/\\/pb1.pstatp.com\\/origin\\/470700000c7084773fb2\"}],\"uri\":\"origin\\/470700000c7084773fb2\",\"height\":1590},{\"url\":\"http:\\/\\/p9.pstatp.com\\/origin\\/47050001b69355a0bf1b\",\"width\":1178,\"url_list\":[{\"url\":\"http:\\/\\/p9.pstatp.com\\/origin\\/47050001b69355a0bf1b\"},{\"url\":\"http:\\/\\/pb1.pstatp.com\\/origin\\/47050001b69355a0bf1b\"},{\"url\":\"http:\\/\\/pb3.pstatp.com\\/origin\\/47050001b69355a0bf1b\"}],\"uri\":\"origin\\/47050001b69355a0bf1b\",\"height\":1557},{\"url\":\"http:\\/\\/p3.pstatp.com\\/origin\\/470300020761150d671a\",\"width\":1178,\"url_list\":[{\"url\":\"http:\\/\\/p3.pstatp.com\\/origin\\/470300020761150d671a\"},{\"url\":\"http:\\/\\/pb9.pstatp.com\\/origin\\/470300020761150d671a\"},{\"url\":\"http:\\/\\/pb1.pstatp.com\\/origin\\/470300020761150d671a\"}],\"uri\":\"origin\\/470300020761150d671a\",\"height\":1552},{\"url\":\"http:\\/\\/p1.pstatp.com\\/origin\\/47000002200f2a0a9020\",\"width\":1178,\"url_list\":[{\"url\":\"http:\\/\\/p1.pstatp.com\\/origin\\/47000002200f2a0a9020\"},{\"url\":\"http:\\/\\/pb3.pstatp.com\\/origin\\/47000002200f2a0a9020\"},{\"url\":\"http:\\/\\/pb9.pstatp.com\\/origin\\/47000002200f2a0a9020\"}],\"uri\":\"origin\\/47000002200f2a0a9020\",\"height\":1575},{\"url\":\"http:\\/\\/p1.pstatp.com\\/origin\\/470000022011d5569ccb\",\"width\":1178,\"url_list\":[{\"url\":\"http:\\/\\/p1.pstatp.com\\/origin\\/470000022011d5569ccb\"},{\"url\":\"http:\\/\\/pb3.pstatp.com\\/origin\\/470000022011d5569ccb\"},{\"url\":\"http:\\/\\/pb9.pstatp.com\\/origin\\/470000022011d5569ccb\"}],\"uri\":\"origin\\/470000022011d5569ccb\",\"height\":1588},{\"url\":\"http:\\/\\/p3.pstatp.com\\/origin\\/4700000220127db96444\",\"width\":1178,\"url_list\":[{\"url\":\"http:\\/\\/p3.pstatp.com\\/origin\\/4700000220127db96444\"},{\"url\":\"http:\\/\\/pb9.pstatp.com\\/origin\\/4700000220127db96444\"},{\"url\":\"http:\\/\\/pb1.pstatp.com\\/origin\\/4700000220127db96444\"}],\"uri\":\"origin\\/4700000220127db96444\",\"height\":1561},{\"url\":\"http:\\/\\/p3.pstatp.com\\/origin\\/46ff000532e33a9fa35a\",\"width\":1178,\"url_list\":[{\"url\":\"http:\\/\\/p3.pstatp.com\\/origin\\/46ff000532e33a9fa35a\"},{\"url\":\"http:\\/\\/pb9.pstatp.com\\/origin\\/46ff000532e33a9fa35a\"},{\"url\":\"http:\\/\\/pb1.pstatp.com\\/origin\\/46ff000532e33a9fa35a\"}],\"uri\":\"origin\\/46ff000532e33a9fa35a\",\"height\":1563},{\"url\":\"http:\\/\\/p9.pstatp.com\\/origin\\/470700000c7b871a5fae\",\"width\":1178,\"url_list\":[{\"url\":\"http:\\/\\/p9.pstatp.com\\/origin\\/470700000c7b871a5fae\"},{\"url\":\"http:\\/\\/pb1.pstatp.com\\/origin\\/470700000c7b871a5fae\"},{\"url\":\"http:\\/\\/pb3.pstatp.com\\/origin\\/470700000c7b871a5fae\"}],\"uri\":\"origin\\/470700000c7b871a5fae\",\"height\":1575}],\"max_img_width\":1178,\"labels\":[],\"sub_abstracts\":[\" \",\" \",\" \",\" \",\" \",\" \",\" \",\" \"],\"sub_titles\":[\"\\u6e05\\u65b0\\u81ea\\u7136\\uff0c\\u7f8e\\u4e3d\\u65e0\\u53cc\",\"\\u6e05\\u65b0\\u81ea\\u7136\\uff0c\\u7f8e\\u4e3d\\u65e0\\u53cc\",\"\\u6e05\\u65b0\\u81ea\\u7136\\uff0c\\u7f8e\\u4e3d\\u65e0\\u53cc\",\"\\u6e05\\u65b0\\u81ea\\u7136\\uff0c\\u7f8e\\u4e3d\\u65e0\\u53cc\",\"\\u6e05\\u65b0\\u81ea\\u7136\\uff0c\\u7f8e\\u4e3d\\u65e0\\u53cc\",\"\\u6e05\\u65b0\\u81ea\\u7136\\uff0c\\u7f8e\\u4e3d\\u65e0\\u53cc\",\"\\u6e05\\u65b0\\u81ea\\u7136\\uff0c\\u7f8e\\u4e3d\\u65e0\\u53cc\",\"\\u6e05\\u65b0\\u81ea\\u7136\\uff0c\\u7f8e\\u4e3d\\u65e0\\u53cc\"]}'
jsonSt2 = re.sub(r'\\{1,2}', '',jsonStr)
jsonObj = json.loads(jsonStr)
print(jsonObj)
print(jsonObj.get('count'))
Answer 3:
Quite simple:
>>> print '"Hello,\\nworld!"'.decode('string_escape')
"Hello,
world!"
>>> data = json.loads('{\"count\":8,\"sub_images\":[{\"url\":\"http:\\/\\/p3.pstatp.com\\/origin\\/470700000c7084773fb2\",\"width\":1178,\"url_list\":[{\"url\":\"http:\\/\\/p3.pstatp.com\\/origin\\/470700000c7084773fb2\"},{\"url\":\"http:\\/\\/pb9.pstatp.com\\/origin\\/470700000c7084773fb2\"},{\"url\":\"http:\\/\\/pb1.pstatp.com\\/origin\\/470700000c7084773fb2\"}],\"uri\":\"origin\\/470700000c7084773fb2\",\"height\":1590},{\"url\":\"http:\\/\\/p9.pstatp.com\\/origin\\/47050001b69355a0bf1b\",\"width\":1178,\"url_list\":[{\"url\":\"http:\\/\\/p9.pstatp.com\\/origin\\/47050001b69355a0bf1b\"},{\"url\":\"http:\\/\\/pb1.pstatp.com\\/origin\\/47050001b69355a0bf1b\"},{\"url\":\"http:\\/\\/pb3.pstatp.com\\/origin\\/47050001b69355a0bf1b\"}],\"uri\":\"origin\\/47050001b69355a0bf1b\",\"height\":1557},{\"url\":\"http:\\/\\/p3.pstatp.com\\/origin\\/470300020761150d671a\",\"width\":1178,\"url_list\":[{\"url\":\"http:\\/\\/p3.pstatp.com\\/origin\\/470300020761150d671a\"},{\"url\":\"http:\\/\\/pb9.pstatp.com\\/origin\\/470300020761150d671a\"},{\"url\":\"http:\\/\\/pb1.pstatp.com\\/origin\\/470300020761150d671a\"}],\"uri\":\"origin\\/470300020761150d671a\",\"height\":1552},{\"url\":\"http:\\/\\/p1.pstatp.com\\/origin\\/47000002200f2a0a9020\",\"width\":1178,\"url_list\":[{\"url\":\"http:\\/\\/p1.pstatp.com\\/origin\\/47000002200f2a0a9020\"},{\"url\":\"http:\\/\\/pb3.pstatp.com\\/origin\\/47000002200f2a0a9020\"},{\"url\":\"http:\\/\\/pb9.pstatp.com\\/origin\\/47000002200f2a0a9020\"}],\"uri\":\"origin\\/47000002200f2a0a9020\",\"height\":1575},{\"url\":\"http:\\/\\/p1.pstatp.com\\/origin\\/470000022011d5569ccb\",\"width\":1178,\"url_list\":[{\"url\":\"http:\\/\\/p1.pstatp.com\\/origin\\/470000022011d5569ccb\"},{\"url\":\"http:\\/\\/pb3.pstatp.com\\/origin\\/470000022011d5569ccb\"},{\"url\":\"http:\\/\\/pb9.pstatp.com\\/origin\\/470000022011d5569ccb\"}],\"uri\":\"origin\\/470000022011d5569ccb\",\"height\":1588},{\"url\":\"http:\\/\\/p3.pstatp.com\\/origin\\/4700000220127db96444\",\"width\":1178,\"url_list\":[{\"url\":\"http:\\/\\/p3.pstatp.com\\/origin\\/4700000220127db96444\"},{\"url\":\"http:\\/\\/pb9.pstatp.com\\/origin\\/4700000220127db96444\"},{\"url\":\"http:\\/\\/pb1.pstatp.com\\/origin\\/4700000220127db96444\"}],\"uri\":\"origin\\/4700000220127db96444\",\"height\":1561},{\"url\":\"http:\\/\\/p3.pstatp.com\\/origin\\/46ff000532e33a9fa35a\",\"width\":1178,\"url_list\":[{\"url\":\"http:\\/\\/p3.pstatp.com\\/origin\\/46ff000532e33a9fa35a\"},{\"url\":\"http:\\/\\/pb9.pstatp.com\\/origin\\/46ff000532e33a9fa35a\"},{\"url\":\"http:\\/\\/pb1.pstatp.com\\/origin\\/46ff000532e33a9fa35a\"}],\"uri\":\"origin\\/46ff000532e33a9fa35a\",\"height\":1563},{\"url\":\"http:\\/\\/p9.pstatp.com\\/origin\\/470700000c7b871a5fae\",\"width\":1178,\"url_list\":[{\"url\":\"http:\\/\\/p9.pstatp.com\\/origin\\/470700000c7b871a5fae\"},{\"url\":\"http:\\/\\/pb1.pstatp.com\\/origin\\/470700000c7b871a5fae\"},{\"url\":\"http:\\/\\/pb3.pstatp.com\\/origin\\/470700000c7b871a5fae\"}],\"uri\":\"origin\\/470700000c7b871a5fae\",\"height\":1575}],\"max_img_width\":1178,\"labels\":[],\"sub_abstracts\":[\" \",\" \",\" \",\" \",\" \",\" \",\" \",\" \"],\"sub_titles\":[\"\\u6e05\\u65b0\\u81ea\\u7136\\uff0c\\u7f8e\\u4e3d\\u65e0\\u53cc\",\"\\u6e05\\u65b0\\u81ea\\u7136\\uff0c\\u7f8e\\u4e3d\\u65e0\\u53cc\",\"\\u6e05\\u65b0\\u81ea\\u7136\\uff0c\\u7f8e\\u4e3d\\u65e0\\u53cc\",\"\\u6e05\\u65b0\\u81ea\\u7136\\uff0c\\u7f8e\\u4e3d\\u65e0\\u53cc\",\"\\u6e05\\u65b0\\u81ea\\u7136\\uff0c\\u7f8e\\u4e3d\\u65e0\\u53cc\",\"\\u6e05\\u65b0\\u81ea\\u7136\\uff0c\\u7f8e\\u4e3d\\u65e0\\u53cc\",\"\\u6e05\\u65b0\\u81ea\\u7136\\uff0c\\u7f8e\\u4e3d\\u65e0\\u53cc\",\"\\u6e05\\u65b0\\u81ea\\u7136\\uff0c\\u7f8e\\u4e3d\\u65e0\\u53cc\"]}'.decode('string_escape'))
>>>
>>> data["count"]
8
>>>
Answer 4:
This is the street map of today’s headlines, which can be changed directly into empty (‘)’ and then json.loads.
Link of this Article: The problem of Python crawler JSON parsing