Article From:

   The school server can go to the outside network, so it is intended to write an automatic crawling joke to BBS, search a joke website from the Internet, most of the feeling is not too cold, the HTML structure is as follows:

 As you can see, the list of links in the jokes is in < div class= “list_title” > inside, a regular expression can be used to find some of the most recent jokes, and then into a joke page.

  Each joke page is made up of a number of small jokes, all in < span id= “text110” > under the label, each small joke is a single < p> a package, so it is very easy to put each individual joke in a list.The purpose of my joke is to make a joke every day in the daytime, so it’s enough to climb 20. There are 5 jokes on the average of each page, and the 4 pages are OK. Here are some details. There are links in this joke network in Chinese, such as:

1 <a href="/jokehtml/Cold joke /2014051200030765.htm" target="_blank">Reading more than ten thousand volumes, funny like a God < /a>

The direct urllib.request.urlopen function can not resolve the Chinese URL. It is necessary to transcode the urllib.parse first to correctly analyze. One more detail is that there is a newline between each jokes.The newline character needs to be changed to “[\w\W]” to match. Well, the following is the code:

 1 import urllib.request  
 2 import urllib.parse  
 3 import re  
 5 rule_joke=re.compile('<span id=\"text110\">([\w\W]*?)</span>')  
 6 rule_url=re.compile('<a href=\"(.*?)\"target=\"_blank\" >')  
 7 mainUrl=''  
 8 url=''  
10 req=urllib.request.urlopen(url)  
12 urls=rule_url.findall(html)  
13 f=open('joke.txt','w')  
14 for i in range(4):  
15     url2=urllib.parse.quote(urls[i])  
16     joke_url=mainUrl+url2  
17     req2=urllib.request.urlopen(joke_url)  
19     joke=rule_joke.findall(html2)  
20     jokes=joke[0].split('<P>')  
22     for i in jokes:  
23         i=i.replace('</P>','')  
24         i=i.replace('<BR>','')  
25         i=i[2:]  
26         f.write(i)  
27 f.close()  

Look at the results of the climb.

  In this way, each line is a separate joke to facilitate the use of other programs.

Leave a Reply

Your email address will not be published. Required fields are marked *