Article From:https://www.cnblogs.com/cx59244405/p/9064574.html

The nature of the crawler is to simulate browser access

So in theory, we can use crawler technology to restore operations in normal web pages.

On the basis of familiarity with HTTP, it can be said that it is very simple.

The steps of the crawler

1、Page analysis

See and analyze the page through F12 to find the required data.

If data is obtained by JS, you need to use phantomjs (headless browser) and selenium (driver browser to access).

2、Crawl the page

Download and analyze the required pages and get the data.

Library: requests (construct access request), BeautifulSoup (parse DOM, become Python object)

3、Save the data

Save the data that is downloaded to the local, database, form, and file.

4、Analysis data

Using various libraries to analyze data and generate reports

Source code:

 

 1 #coding:utf8
 2 import requests
 3 from bs4 import BeautifulSoup
 4 import time
 5 
 6 #Multi thread crawler, but does not set the upper limit of the number of threads. When the page is too much, it will consume too much resources.
 7 #Define the page set, in order to weigh
 8 page_set = set()
 9 
10 def get_page(url):
11     global page_set
12     header = {
13         'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
14     }
15     rep = requests.get('http://www.521609.com/daxuexiaohua'+url, headers=header)
16     soup = BeautifulSoup(rep.text, 'lxml')
17     #Find the picture and find its img_url
18     imgs = soup.select('.index_img ul a img')
19     for img in imgs:
20         img_url = 'http://www.521609.com' + img.get_attribute_list('src')[0]
21         print('Download the picture%s'%img_url)
22     #Find other pages
23     pages = soup.select('.index_img .listpage a')
24     for page in pages:
25         page_url = '/' + page.get_attribute_list('href')[0]
26         #Not duplicated pages are accessed
27         if page_url not in page_set:
28             page_set.add(page_url)
29             print('Access to%s' % page_url)
30             #Infinite recursive access
31             get_page(page_url)
32 
33 time_start=time.time()
34 get_page('')
35 time_end=time.time()
36 print('Time consuming:%s seconds'%(time_end-time_start))
37 #Total time: 43.121999979 seconds

Climb the school flower net photograph

 

Multithreading seems to be infamous, especially for thread pools.

 1 #coding:utf8
 2 import threading
 3 import requests
 4 from bs4 import BeautifulSoup
 5 import time
 6 
 7 #Multi thread crawler, but does not set the upper limit of the number of threads. When the page is too much, it will consume too much resources.
 8 #Define the page set, in order to weigh
 9 page_set = set()
10 lock = threading.Lock()
11 
12 def get_page(url):
13     global page_set
14     header = {
15         'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
16     }
17     rep = requests.get('http://www.521609.com/daxuexiaohua'+url, headers=header)
18     soup = BeautifulSoup(rep.text, 'lxml')
19     #Find the picture and find its img_url
20     imgs = soup.select('.index_img ul a img')
21     for img in imgs:
22         img_url = 'http://www.521609.com' + img.get_attribute_list('src')[0]
23         #Lock the place where the resource is grabbing
24         lock.acquire()
25         print('Download the picture%s'%img_url)
26         lock.release()
27     #Find other pages
28     pages = soup.select('.index_img .listpage a')
29     t_list=[]
30     for page in pages:
31         page_url = '/' + page.get_attribute_list('href')[0]
32         #Not duplicated pages are accessed
33         if page_url not in page_set:
34             page_set.add(page_url)
35             lock.acquire()
36             f = open('qq.txt','a+')
37             f.write('Access to%s\r\n' % page_url)
38             f.close()
39             print('Access to%s' % page_url)
40             lock.release()
41             #Infinite recursive access
42             #Each page is accessed with a thread
43             task = threading.Thread(target=get_page, args=(page_url,))
44             task.start()
45             t_list.append(task)
46     for t in t_list:
47         t.join()
48 
49 
50 time_start=time.time()
51 get_page('')
52 time_end=time.time()
53 print('Time consuming:%s seconds'%(time_end-time_start))
54 #Total time: 4.117000103 seconds

Multithreading climbing the school Flower Net Photo

 

Recommended use, small overhead, simple configuration

 1 from gevent import monkey
 2 monkey.patch_all()
 3 from gevent.pool import Pool
 4 import gevent
 5 import requests
 6 from bs4 import BeautifulSoup
 7 import time
 8 
 9 #Co range method
10 #Define the page set, in order to weigh
11 page_set = set()
12 
13 
14 def get_page(url):
15     global page_set
16     header = {
17         'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
18     }
19     rep = requests.get('http://www.521609.com/daxuexiaohua'+url, headers=header)
20     soup = BeautifulSoup(rep.text, 'lxml')
21     #Find the picture and find its img_url
22     imgs = soup.select('.index_img ul a img')
23     for img in imgs:
24         img_url = 'http://www.521609.com' + img.get_attribute_list('src')[0]
25         #Lock the place where the resource is grabbing
26         print('Download the picture%s'%img_url)
27     #Find other pages
28     pages = soup.select('.index_img .listpage a')
29     g_list=[]
30     for page in pages:
31         page_url = '/' + page.get_attribute_list('href')[0]
32         #Not duplicated pages are accessed
33         if page_url not in page_set:
34             page_set.add(page_url)
35             print('Access to%s' % page_url)
36             #Infinite recursive access
37             #Each page is accessed with a thread
38             g = gevent.spawn(get_page,page_url)
39             g_list.append(g)
40     for g1 in g_list:
41         g1.join()
42 
43 
44 time_start=time.time()
45 get_page('')
46 time_end=time.time()
47 print('Time consuming:%s seconds'%(time_end-time_start))
48 #Total time: 4.104735612869263 seconds

Gevent crawls the school Flower Net Photo

 

Similar Posts:

Link of this Article: Reptilian

Leave a Reply

Your email address will not be published. Required fields are marked *