Grabbing all the contents of the page include:
- Element in the page
The page element contains elements returned directly by the server and dynamically constructed elements.
- All the resources in the page
All the resources of the page include the domain resources of the page and the third domain resources, and the resources of the main domain are also considered as the third domain resources. This resource is generally identified as an absolute path, and the resources in the same domain are mainly three forms of representation (by https://www.baidu.com).
a). Relative path
<image src="./image/logo.png" />
b). Absolute path
<image src="https://www.baidu.com/image/logo.png" />
c). Absolute path 2
<image src="//www.baidu.com/image/logo.png" />
This representation will automatically join the protocol (Protocol) when the browser opens the page’s protocol request. After local preservation, the file: prefix will also be added to the file: based on the file protocol.
current implementation scheme
Server-side HTTP get page
puppeteerIt is the upper level node API that operates the chromnium. When the browser opens a page, it can simply understand the subdivision into the following process:
- Notify the browser to initiate a request
- Browser launch request
- Browsers get response content
- Browsers send response content to the upper rendering engine
- Rendering engine processing
Throughout the process, puppeteer provides a mechanism that allows us to intercept the two stages of 2 and 3. Based on this, we can do more. For example, we can intercept all the requests of the page and intercept all the responses without paying attention to the request, as long as the request is requested.In addition, the biggest difference from the direct HTTP get page is that the former is rendered, the latter is original, and the former is more friendly to SPA or by script construction.
The implementation of puppeteer is able to deal with the deficiency of the original scheme.
Intercept all network requests, process resource requests and build DOM related requests.
The relative path of resources under the same domain name is processed, and the corresponding relative path is created locally.
For different domain name resources (third party resources), a new directory is set up under the name of the third party domain name to store third party resources.
core code description
Based on the above new scheme, the core code of the implementation is as follows: detailed annotations are added in the code, and no more explanation is made.
The above scheme can solve the problem that almost all the original schemes can’t solve, but it is not perfect, first choice. Compared with the original scheme, the rendering steps are added, so the performance has declined; secondly, if the user website is more special, such as https://www.xxx.com/admiThe resource in the N path, such as a CSS file, is written as follows:’background:url (‘./xxx.bg.png’)’, when the path will not be found, because in the resource path replacement phase, it will be replaced by hostname, that is to find the resource will beGo to the root directory and lead to the path not found, but there are other improvements, such as making the path of the domain name more flexible, and allowing the interface consumers to modify it.