Grabbing all the contents of the page include:
- Element in the page
The page element contains elements returned directly by the server and dynamically constructed elements.
- All the resources in the page
All the resources of the page include the domain resources of the page and the third domain resources, and the resources of the main domain are also considered as the third domain resources. This resource is generally identified as an absolute path, and the resources in the same domain are mainly three forms of representation (by https://www.baidu.com).
a). Relative path
<image src="./image/logo.png" />
b). Absolute path
<image src="https://www.baidu.com/image/logo.png" />
c). Absolute path 2
<image src="//www.baidu.com/image/logo.png" />
This representation will automatically join the protocol (Protocol) when the browser opens the page’s protocol request. After local preservation, the file: prefix will also be added to the file: based on the file protocol.
current implementation scheme
Server-side HTTP get page
puppeteerIt is the upper level node API that operates the chromnium. When the browser opens a page, it can simply understand the subdivision into the following process:
- Notify the browser to initiate a request
- Browser launch request
- Browsers get response content
- Browsers send response content to the upper rendering engine
- Rendering engine processing
Throughout the process, puppeteer provides a mechanism that allows us to intercept the two stages of 2 and 3. Based on this, we can do more. For example, we can intercept all the requests of the page and intercept all the responses without paying attention to the request, as long as the request is requested.In addition, the biggest difference from the direct HTTP get page is that the former is rendered, the latter is original, and the former is more friendly to SPA or by script construction.
The implementation of puppeteer is able to deal with the deficiency of the original scheme.
Intercept all network requests, process resource requests and build DOM related requests.
The relative path of resources under the same domain name is processed, and the corresponding relative path is created locally.
For different domain name resources (third party resources), a new directory is set up under the name of the third party domain name to store third party resources.
core code description
Based on the above new scheme, the core code of the implementation is as follows: detailed annotations are added in the code, and no more explanation is made.
The above scheme can solve the problem that almost all the original schemes can’t solve, but it is not perfect, first choice. Compared with the original scheme, the rendering steps are added, so the performance has declined; secondly, if the user website is more special, such as https://www.xxx.com/admiThe resource in the N path, such as a CSS file, is written as follows:’background:url (‘./xxx.bg.png’)’, when the path will not be found, because in the resource path replacement phase, it will be replaced by hostname, that is to find the resource will beGo to the root directory and lead to the path not found, but there are other improvements, such as making the path of the domain name more flexible, and allowing the interface consumers to modify it.
- Save all values greater than 66 to the first key of the dictionary, and save the value less than 66 to the value of second key.
- How to save the picture files stored in the S/4HANA system with the Java program to the local
- MySql uses the stored procedure to clear all the table data of the database and save the data structure.
- All the Chinese characters on a static HTML page on the site are scrambled, but the link address of this page changes to normal, what’s the reason?
- How do you realize the dropdown of web pages to show the rest of the content? Instead of loading all the contents of the web page after loading,
- Why have you searched all the resources of alpine and can’t find the version of php7-7.0.14-r5?
- Can the zblog2.3 version of the ASP system surpass the flow of Lu Songsong blog? [map]
- The principle of using a timer to stop all the JS of the page
- How to use localStorage to save the Vue rendered list in the page, and restore it to the previous page after refreshing.
- How does the page introduced by iframe click on the div of the parent page to change the display and hiding of the sub page picture?