When I woke up, QQ found someone looking for me to ask for a source code for a stick crawler, and remember to write a crawler that grabbed the mailbox and cell phone number in the Baidu post record. So the open source shared the learning and reference.
- Demand analysis:
- Test environment:
- Environmental preparation:
- The details of the environmental selection are:
- Multi thread crawler involved knowledge points:
- threadingModule (Duo Xiancheng):
- QueueModules (queues):
- re（Regular expression):
- urllibAnd urllib2:
- Error handling of automated crawlers:
This crawler mainly grabs the contents of various posts in Baidu post bar, and analyzes the contents of the posts to grab the phone numbers and e-mail addresses. The main process is explained in detail in the code annotation.
The code is passed in Windows7 64bit, python 2.7 64bit (install mysqldb extension), and CentOS 6.5, python 2.7 (with mysqldb extension).
If you want to do something well, you must first use your device. You can see from the screenshots that my environment is Windows 7 + PyCharm. My Python environment is Python 2.7 64bit. This is a more suitable development environment for novice. And then I would suggest that everyone install one.Easy_install, listen to the name to know this is an installer, it is used to install some extended packages, for example, if we are going to operate the MySQL database in Python, the python native is not supported, and we must install the mysqldb package to let itPython can operate the MySQL database, and if we have easy_install, we can quickly install the mysqldb extension package with a single line of commands, like composer in PHP, yum in CentOS, Ubuntu in Ubuntu.It’s as convenient as apt-get.
The relevant tools can be found in my GitHub: cw1997/python-tools, where the installation of easy_install only needs to run the PY script under the python command line and wait a moment, and he will automatically add WindowsEnvironment variables, if the input easy_install has an echo indicating that the installation is successful under the Windows command line.
The details of the environmental selection are:
As for computer hardware, the faster the better, the faster the better, the more memory 8G starts, because the crawler itself needs to store and parse the intermediate data in a large amount, especially the multithreaded crawler, in the case of grabbing the paging list and the details page, and using the queue queue to allocate the grabbing task very much.It accounts for memory. Sometimes, the data we grab is using JSON. If we use mongodb NoSQL database storage, it will also occupy a lot of memory.
The network connection suggests using wired network, because some inferior wireless routers on the market and ordinary civil wireless network card will appear intermittently broken net or data loss, drop and so on under the larger thread opening.
As for the operating system and python, of course, it is the 64 choice. If you use a 32 bit operating system, you can’t use large memory. If you are using a 32 bit python, you may not feel any problem when you grab data on a small scale, but when the amount of data increasesWhen, for example, a list, queue, and dictionary store a lot of data, the memory overflow error of Python will be reported when the memory occupies more than 2G. The reason why I answered segmentfault’s question is that Evian’s answer is explained (Java – pyth).On as long as the memory is up to 1.9G, the httplib module begins to report memory overflow error – SegmentFault.
If you are ready to use Mysql to store data, it is recommended to use a later version of mysql5.5, because the mysql5.5 version supports the JSON data type, so that mongodb can be abandoned. (some say MySQL will be more stable than mongodb.I’m not sure.)
As for now, python has already released the 3.x version. Why do I still use python2.7? The reason why I chose the 2.7 version is that the python core programming that I bought very early ago is the second edition, and still uses 2.7 as an example version. alsoAt present, there are still a lot of tutorials on the Internet that are explained by 2.7 versions, and 2.7 are quite different from 3.x in some ways. If we don’t learn 2.7, we may not understand some subtle grammatical differences that cause us to understand the deviation, or do not understand the demo code. andAnd now some of them rely on packages that are only compatible with version 2.7. My suggestion is that if you are ready to learn Python and go to work in the company, and the company doesn’t have old code to maintain, then you can consider 3.x directly, if you have plenty of time, and there’s not a very systematic bull belt.We can only learn from 00 scattered blog posts on the Internet, so we should learn 2.7 first and learn 3.x. After all, after learning 2.7, 3.x will soon get started.
Multi thread crawler involved knowledge points:
In fact, for any software project, we all want to know what knowledge is required to write the project, and we can all see which packages are imported by the main entry files of the project.
Now look at our project, as a person who is just in contact with Python, there may be some packages that are almost not used, so in this section we simply talk about what these packages play, and what knowledge they are involved in, and what the key words are. This articleZhang will not spend much of his time talking about fundamentals, so we must learn to make good use of Baidu and search for key words from these knowledge points. The following are the points of knowledge.
Our crawler grabbing data is essentially a continuous HTTP request to get the HTTP response and deposit it in our computer. Understanding the HTTP protocol helps us to control some parameters that can accelerate the speed of grabbing when we grab data, such as keep-ali.Ve and so on.
threadingModule (Duo Xiancheng):
The programs we usually write are single threaded programs. The code we write is run in the main thread, and the main thread runs in the python process. For the interpretation of threads and processes, we can refer to Ruan Yifeng’s blog: a simple explanation of processes and threads.
The realization of multithreading in Python is achieved through a module called threading. There were thread modules before, but threading was more powerful in controlling threads, so we later switched to threading for multithreaded programming.
On the use of threading multithreading, I think this article is good: [python] topic eight. Multi thread programming thread and threading people can refer to the reference.
Simply speaking, using the threading module to write a multithreaded program is to define a class by yourself first, then the class will inherit the threading.Thread, and write the work code for each thread to a class’s run method, of course if the thread itself is createdIf you want to do some initialization, you have to write the code to perform the initialization work in his __init__ method, which is the same as the construction method in PHP and Java.
Another point here is thread safety. Under normal circumstances, we only have one thread in each thread at the time of single thread operation, so there is no conflict. But in the case of multithreading, it may occur at the same time that the two threads are operating the same.Resources, causing resources to damage, so we need a mechanism to solve the damage caused by this conflict, usually with locking and other operations, for example, the MySQL database InnoDB table engine has a row level lock, file operation have read locks and so on, these are their program bottom to complete. thereforeAs long as we know those operations, or those programs deal with thread safety issues, then we can use them in multithreaded programming. And this thread taking into consideration thread safety problem is usually called “thread safety version”, for example, PHP has TS version, this TS is T.Hread Safety thread is safe. The Queue module we want to talk about is a thread safe queue data structure, so we can safely use it in multithreaded programming.
Finally, we will talk about the concept of thread blockage. After we have studied the threading module in detail, we probably know how to create and start the thread. But if we create the thread and call the start method, we will see that it is the whole.The procedure is over. What is the matter? In fact, this is because we have only the code that starts the child thread in the main thread, which means that the main thread has only the function of starting the child thread. As for the code that the sub thread executes, they are essentially a method written in the class, not in the main thread.The face really executes him, so after the main thread starts up the child thread, his job has already been completed, and it has passed away honorably. Since the main thread is out, then the python process ends, so there is no memory space for other threads to continue. So we should have to let it beWhen the main thread elder brother waited until all the child threads were executed, they would quit honorable. Then what method could be used to thread the main thread in the thread object? Thread.sleep? This is really a way, but how long should we make the main thread sleep? We don’t know exactly.How long does it take to finish a task? So we should check the Internet at this time. What is the way to make the sub thread “jam” the main thread? The word “stuck” seems too crude. In fact, it should be called “blockage”, so we can query “P”.Ython subthreads block the main thread. If we use the search engine correctly, we should find a method called join (). Yes, this join () method is the method of blocking the main thread by the child thread, and when the thread is not completed, the main thread runs to the content of the main thread.The row with join () method will be stuck there until all threads are executed, and the code behind the join () method will be executed.
Suppose there is a scene like this, we need to grab a person’s blog, we know that the person’s blog has two pages, a list.php page shows all the links to the blog, and a view.php page shows the specific content of an article.
If we are to grab all the articles in this person’s blog, the idea of writing a single thread crawler is to grab the href attribute of all the linked a tags of the list.php page with regular expressions and deposit into an array named article_list.(not an array in Python, called a list, a list of Chinese names), and then traversing the article_list array with a for loop, grabbing the content with a variety of functions that grab the content of the web page and storing it in the database.
If we are going to write a multithreaded crawler to accomplish this task, we assume that our program uses 10 threads, then we want to divide the article_list into 10, and assign each of them to one of the sub threads.
But the problem is that if the length of our article_list array is not a multiple of 10, that is, the number of articles is not an integer multiple of 10, then the last thread will be less assigned to some tasks than other threads, and it will end faster.
It doesn’t seem to be a problem if you just grab a few thousand – word blog articles, but if we have a task (not necessarily the task of grabbing a web page, it may be a mathematical calculation, or a time-consuming task, such as graphics rendering), it will run a long time, which will cause great waste of resources and time. WeThe purpose of multithreading is to make use of all computing resources and calculate time as much as possible, so we have to find ways to make tasks more scientific and reasonable.
And I also want to consider a situation, that is, when the number of articles is very large, we can quickly grab the contents of the article and see what we have captured as soon as possible. This demand is often reflected in many CMS collection stations.
For example, there are tens of millions of articles that we want to grab at the target blog now. Usually, in this case, the blog will do paging, so if we grab all the pages of list.php in the first place, it will take a few hours or even days, if the boss wants you to show it as soon as possible.Grab the content and present the content that has been captured to our CMS collection station as soon as possible, so we have to grab the list.php and drop the captured data into a article_list array and use another thread from the article_l.The ist array extracts the URL address that has been grabbed. Then the thread takes the regular expression to the content of the blog post in the corresponding URL address. How do you implement this function?
We need to open two kinds of threads at the same time. A class of threads is specifically responsible for grabbing URL in list.php and dropping it into the article_list array, and another class of threads is specifically responsible for extracting URL from article_list and then from the corresponding view.phpThe corresponding blog content is taken out of the page.
But do we remember the concept of thread safety mentioned earlier? The previous class of threads writes data to the article_list array, and the other class reads the data from the article_list and deletes the data that has been read. But in PythonList is not the data structure of the thread safe version, so this operation can cause unpredictable errors. So we can try to use a more convenient and thread safe data structure, which is the Queue queue data structure mentioned in our subtitle.
The same Queue also has a join () method, which is actually similar to the join () method in the threading described in the last section, but in Queue, the blocking condition for join () is blocked when the queue is not empty.Otherwise, continue to execute the code behind join (). In this crawler I used this method to block the main thread instead of blocking the main thread directly through the thread’s join. The advantage of this method is to judge whether there is an un – executed task in the current task queue without writing a dead loop.Let the program run more efficiently and make the code more elegant.
One more detail is that the name of the queue module in python2.7 is Queue, and it has been renamed as queue in python3.x, which is the difference between the first letter and the size of the initial letter. If you copy the code on the Internet, you should remember this small difference.
If you have learned the C language, you should be familiar with this module. He is a module responsible for extracting the incidental parameters from commands in the command line. For example, we usually operate the MySQL database in the command line, that is, input MySQL -h127.0.0.1 -ur.OOT -p, where the “-h127.0.0.1 -uroot -p” behind MySQL is the parameter part that can be obtained.
When we are writing reptiles, there are some parameters that we need to manually input by ourselves, such as MySQL’s host IP, user name password, etc. In order to make our program more friendly and generic, there are some configuration items that do not need to be hard coded in the code, but when we execute them, we are dynamic.Through the introduction of getopt module, we can achieve this function.
Hashi is essentially a set of mathematical algorithms, and this mathematical algorithm has a characteristic that you give one parameter and he can output another result, though the result is very short, but it can be considered to be unique. For example, we usually heard of MD5, SHA-1 and so on, they are all belong toIn hash algorithm. They can turn some files and texts into a string of less than one hundred bits in English after a series of mathematical operations.
pythonThe hashlib module encapsulates these mathematical operation functions for us. We can complete hash operations simply by calling them.
Why did I use this bag in my crawler? Because in some interface requests, the server needs to carry some check codes to ensure that the data of the interface requests are not tampered or lost. These check codes are generally hash algorithms, so we need to use this module to perform this operation.
In many cases, the data we grab are not HTML, but some JSON data, and JSON is essentially just a string of key value pairs. If we need to extract a specific string, then we need the JSON module to convert the JSON string to DICThe T type is convenient for us to operate.
Sometimes we grab some page content, but we need to extract some of the specific formats of the web page, for example, the format of the email is usually the first few English alphabetical letters plus an @ symbol plus a symbol.http://xxx.xxxTo describe this format, like a computer language, we can use an expression called regular expressions to express this format and allow the computer to automatically match the words that conform to this particular format from a large string of strings.
This module is mainly used to deal with some aspects of the system. In this crawler, I use him to solve the output encoding problem.
Some people who have learned a little English can guess that this module is used to deal with time. In this crawler I use it to get the current timestamp, and then by using the current timestamp at the end of the main thread to reduce the time stamp of the program to start running, and get the time of the program.
As shown, it takes 330 seconds to take 50 threads to grab 100 pages (30 posts per page, equivalent to grabbing 3000 posts) and to remove the content of the post and remove the phone mailbox from it.
These two modules are used to handle some HTTP requests and URL formatting. My crawler HTTP request part of the core code is completed using this module.
This is a third party module used to operate MySQL database in Python.
Here’s a detail problem: the mysqldb module is not a thread safe version, meaning that we can’t share the same MySQL connection handle in multithreading. So you can see in my code that I have imported a new MySQL in every thread’s constructor.Connect the handle. Therefore, each child thread will use its own independent MySQL connection handle.
This is also a third party module, which can find relevant code on the Internet. This module is mainly used to output color strings to the command line. For example, we usually crawl bug, to output red font will be more conspicuous, we need to use this module.
Error handling of automated crawlers:
If you use the crawler in an environment where the network is not good, you will find some exceptions as shown in the picture, which is the logic that I do not write any exception handling for laziness.
Normally, if we are to write a highly automated crawler, we need to anticipate all the exceptions that our crawler may encounter and deal with these exceptions.
For example, if the error shown in the picture, we should reinsert the task that was being processed into the task queue, otherwise we would have missing information. This is also a complex point of creeper writing.
In fact, the compilation of multi thread crawler is not complex, more examples of example code, more self – try, more community, forum exchange, many classic books on multi thread programming is also very detailed explanation. This article is essentially a popular science article. The content is not very thorough.I learn from all kinds of information on the Internet.If you do not understand the logic in the code, you can ask questions in the comment area. I will be patient when I am free.
pythonLearning exchange group: 125240963
Reprinted to: https://zhuanlan.zhihu.com/p/25039408