Article From:https://www.cnblogs.com/Johnzhang/p/9061998.html

basic needs

Grabbing all the contents of the page include:

  1. Element in the page

The page element contains elements returned directly by the server and dynamically constructed elements.

  1. All the resources in the page

All the resources of the page include the domain resources of the page and the third domain resources, and the resources of the main domain are also considered as the third domain resources. This resource is generally identified as an absolute path, and the resources in the same domain are mainly three forms of representation (by https://www.baidu.com).

a). Relative path

<image src="./image/logo.png" />

b). Absolute path

<image src="https://www.baidu.com/image/logo.png" />

c). Absolute path 2

<image src="//www.baidu.com/image/logo.png" />

This representation will automatically join the protocol (Protocol) when the browser opens the page’s protocol request. After local preservation, the file: prefix will also be added to the file: based on the file protocol.

current implementation scheme

basic process

  1. Server-side HTTP get page

  2. Depending on the HTML that the server responds, traverses other resources that need to be loaded, such as JavaScript, image, CSS, font, media and other resources.

  3. Processing HTML, JavaScript, CSS and other files to replace the resources path to ensure that the page can be opened normally after localization.

new implementations

puppeteerIt is the upper level node API that operates the chromnium. When the browser opens a page, it can simply understand the subdivision into the following process:

  1. Notify the browser to initiate a request
  2. Browser launch request
  3. Browsers get response content
  4. Browsers send response content to the upper rendering engine
  5. Rendering engine processing

Throughout the process, puppeteer provides a mechanism that allows us to intercept the two stages of 2 and 3. Based on this, we can do more. For example, we can intercept all the requests of the page and intercept all the responses without paying attention to the request, as long as the request is requested.In addition, the biggest difference from the direct HTTP get page is that the former is rendered, the latter is original, and the former is more friendly to SPA or by script construction.

The implementation of puppeteer is able to deal with the deficiency of the original scheme.

  1. Intercept all network requests, process resource requests and build DOM related requests.

  2. The relative path of resources under the same domain name is processed, and the corresponding relative path is created locally.

  3. For different domain name resources (third party resources), a new directory is set up under the name of the third party domain name to store third party resources.

  4. Resource processing, processing HTML resources, CSS resources, and the absolute path of the JavaScript file as the relative path (the absolute path here refers to the CDN and other mode paths introduced directly, the relative path refers to the path after the localization directory of the CDN domain name).

core code description

Based on the above new scheme, the core code of the implementation is as follows: detailed annotations are added in the code, and no more explanation is made.

const puppeteer = require('puppeteer');
const URL = require('url');
const md5 = require('md5');
const fs = require('fs');
const util = require('util');
const path = require('path');
const shell = require('shelljs');

//Resource preservation directory
const BASEDIR = './asserts/';

const start = async () => {

    //Initialization and deletion of the cleanup resource directory is only the test phase, because the current directory is generated by timestamp.
    shell.exec('rm -rf asserts/');
    //Because all network requests are intercepted, processing requests are not related to page resources and DOM construction.
    //The domain name below is a common front end to collect domain names (many of which are not listed).
    const blackList = [
        'collect.ptengine.cn', 
        'collect.ptengine.jp',
        'js.ptengine.cn',
        'js.ptengine.jp',
        'hm.baidu.com',
        'api.growingio.com',
        'www.google-analytics.com',
        'script.hotjar.com',
        'vars.hotjar.com'
    ];
    //It is used to cache third party resources (including CSS, JavaScript), and no complete third party resource columns can be obtained before the request is finished. It is impossible to guarantee the complete replacement of content in CSS and JavaScript, so caching first, and reunification after the request is finished.
    const resourceBufferMap = new Map();
    //Third party resource service (domain name) list
    const thirdPartyList = {};
    try {
        const browser = await puppeteer.launch();

        const page = await browser.newPage();
        //Enable request interception
        await page.setRequestInterception(true);
       //Page grabbing with the example of a blogger
        let url = "https://www.cnblogs.com"
        let docUrl = URL.parse(url);
        //Gets the domain name of the request address to determine whether the resource comes from the third party.
        let originUrl = (docUrl.protocol + "//" + docUrl.hostname)
        //@fixme Grab the generated content directory name each time
        let md5_prefix = md5(Date.now());

        page.on('request', async (req) => {
            const whitelist = ['image', 'script', 'stylesheet', 'document', 'font'];
            //If you request a third party domain name, consider only resources related to page building.
            if (req.url().indexOf(originUrl) == -1 && !whitelist.includes(req.resourceType())) {
                return req.abort();

            }
            //The contents of the blacklist are not dealt with
            if (blackList.indexOf(URL.parse(req.url()).host) != -1) {
                return req.abort();
            }
            req.continue();


        });

        page.on('response', async res => {
            let request = res.request(),
                resourceUrl = request.url(),
                urlObj = URL.parse(resourceUrl),
                filePath = urlObj.pathname, //File path
                dirPath = path.dirname(filePath), //Directory path
                requestMethod = request.method().toUpperCase(), //Request method
                isSameOrigin = resourceUrl.includes(originUrl); //Is it a domain name request

            //Only get request resources are considered. Other HTTP verb requests less file resources.
            if (requestMethod === 'GET') {
                //If you are a resource under the same domain name, you can directly build the directory and download the files.
                //The way to create the path is based on the path structure of the request itself, ensuring that the directory structure of the original resource site is integrated and unified, so that even if the code of the CMD and AMD specifications is executed again, the relative path of the require will not be a problem.
                let dirPathCreatedIfNotExists,
                    filePathCreatedIfNotExists;

                let hostname = urlObj.hostname;

                if (isSameOrigin) {
                    //Building the same domain name path
                    //The resources of the domain name are sometimes used in the way of //www.xxx.com/images/logo.png, so special treatment is needed for such resources.
                    thirdPartyList[`//${hostname}`] = '';
                    dirPathCreatedIfNotExists = path.join(BASEDIR, md5_prefix, dirPath);
                    filePathCreatedIfNotExists = path.join(BASEDIR, md5_prefix, filePath);
                } else {
                    //Third party resources build regular expressions to replace HTTP, HTTPS, and / or three mode paths for local directory paths.
                    thirdPartyList[`(https?:)?//${hostname}`] = `/${hostname}`;
                    dirPathCreatedIfNotExists = path.join(BASEDIR, md5_prefix, hostname, dirPath);
                    filePathCreatedIfNotExists = path.join(BASEDIR, md5_prefix, hostname, filePath);
                }
                //Obtaining an extension is not a resource file if it is not available.
                if (path.extname(filePathCreatedIfNotExists)) {
                    //Path does not exist, directly create multilevel directories
                    if (!fs.existsSync(dirPathCreatedIfNotExists)) {
                        shell.exec(`mkdir -p ${dirPathCreatedIfNotExists}`);
                        console.log('create dir');
                    }
                    if (res.ok()) {
                        if ((isSameOrigin && dirPath != '/') || !isSameOrigin) {
                            let needReplace = ['stylesheet', 'script'];
                            //@fixme toString There may be a coding problem
                            let fileContent = (await res.buffer()).toString();
                            //The third party domain name is also obtained, first caching and reprocessing
                            if (needReplace.includes(request.resourceType())) {
                                //js css Files may contain contents that need to be replaced and need to be processed.
                                //So temporarily caching does not write to the file
                                resourceBufferMap.set(filePathCreatedIfNotExists, fileContent);
                            } else {

                                fs.writeFileSync(filePathCreatedIfNotExists, await res.buffer());
                            }
                        }
                    }
                }

            }

        });

        await page.goto(url, {
            waitUntil: 'networkidle0'
        });

        let content = await page.content();

        //Replace the CSS JavaScript file
        resourceBufferMap.forEach((value, key) => {
            value = applyReplace(value, thirdPartyList);
            fs.writeFileSync(key, value);
        })

        // html Content processing
        content = applyReplace(content, thirdPartyList);

        fs.writeFileSync(`./asserts/${md5_prefix}/index.html`, content);

        await page.close();
        await browser.close();
    } catch (error) {
        console.log(error);
    }


}

function applyReplace(origin, regList) {
    for (let prop in regList) {
        //Regular global substitution
        let reg = new RegExp(prop, 'g')
        origin = origin.replace(reg, regList[prop]);
    }
    return origin;
}


start();

summary

The above scheme can solve the problem that almost all the original schemes can’t solve, but it is not perfect, first choice. Compared with the original scheme, the rendering steps are added, so the performance has declined; secondly, if the user website is more special, such as https://www.xxx.com/admiThe resource in the N path, such as a CSS file, is written as follows:’background:url (‘./xxx.bg.png’)’, when the path will not be found, because in the resource path replacement phase, it will be replaced by hostname, that is to find the resource will beGo to the root directory and lead to the path not found, but there are other improvements, such as making the path of the domain name more flexible, and allowing the interface consumers to modify it.

Similar Posts:

Leave a Reply

Your email address will not be published. Required fields are marked *