The school server can go to the outside network, so it is intended to write an automatic crawling joke to BBS, search a joke website from the Internet, most of the feeling is not too cold, the HTML structure is as follows:
As you can see, the list of links in the jokes is in < div class= “list_title” > inside, a regular expression can be used to find some of the most recent jokes, and then into a joke page.
Each joke page is made up of a number of small jokes, all in < span id= “text110” > under the label, each small joke is a single < p> a package, so it is very easy to put each individual joke in a list.The purpose of my joke is to make a joke every day in the daytime, so it’s enough to climb 20. There are 5 jokes on the average of each page, and the 4 pages are OK. Here are some details. There are links in this joke network in Chinese, such as:
1 <a href="/jokehtml/Cold joke /2014051200030765.htm" target="_blank">Reading more than ten thousand volumes, funny like a God < /a>
The direct urllib.request.urlopen function can not resolve the Chinese URL. It is necessary to transcode the urllib.parse first to correctly analyze. One more detail is that there is a newline between each jokes.The newline character needs to be changed to “[\w\W]” to match. Well, the following is the code:
1 import urllib.request 2 import urllib.parse 3 import re 4 5 rule_joke=re.compile('<span id=\"text110\">([\w\W]*?)</span>') 6 rule_url=re.compile('<a href=\"(.*?)\"target=\"_blank\" >') 7 mainUrl='http://www.jokeji.cn' 8 url='http://www.jokeji.cn/list.htm' 9 10 req=urllib.request.urlopen(url) 11 html=req.read().decode('gbk') 12 urls=rule_url.findall(html) 13 f=open('joke.txt','w') 14 for i in range(4): 15 url2=urllib.parse.quote(urls[i]) 16 joke_url=mainUrl+url2 17 req2=urllib.request.urlopen(joke_url) 18 html2=req2.read().decode('gbk') 19 joke=rule_joke.findall(html2) 20 jokes=joke.split('<P>') 21 22 for i in jokes: 23 i=i.replace('</P>','') 24 i=i.replace('<BR>','') 25 i=i[2:] 26 f.write(i) 27 f.close()
Look at the results of the climb.
In this way, each line is a separate joke to facilitate the use of other programs.