Category:None
Article From:https://blog.csdn.net/fox64194167/article/details/79776379

1. Simulated landing and grabbing steps

1.1 Browse through browser tools to see if there is hidden input content submitted together

1.1.1 First request the landing interface, parse the page, get the hidden input content

1.2 Check all the forms submitted by the browser tool and record it

1.3 1.2A medium form plus a hidden submission

1.4 Return to the request target URL.

 

2. See if there is a hidden input with a browser tool

This time we are CSDN’s landing interface https://passport.csdn.net/account/login

We see is called a lt hidden input, CSDN programmers are very close to do that, each user has a serial number, this is to prevent the machine directly login with username and password.

 

3. The interface is parsed first to hide the input

 

    def start_requests(self):
        start_url = 'https://passport.csdn.net/account/login?from=https://mp.csdn.net/postlist/list/all'
        return [
            Request(start_url, callback=self.parseWelcome)
        ]

    def parseWelcome(self, response):
        lt = response.xpath('//input[@name="lt"]/@value').extract_first()
        logging.info('lt:' + lt)
        return FormRequest.from_response(
            response,
            url='https://passport.csdn.net/account/login?from=https://mp.csdn.net/postlist/list/all',
            #meta={'cookiejar': response.meta['cookiejar']},
            formdata={"username":"fox64194167", "password" : "*****", "lt" : lt},
            callback=self.afterLogin
        )

4. All code

 

import scrapy

from tutorial.items import CSDNItem
import logging
from scrapy.http import Request, FormRequest, HtmlResponse

class CSDNLoginSpider(scrapy.Spider):
    name = "csdnLogin"

    target_url = 'https://mp.csdn.net/postlist/list/all'

    def start_requests(self):
        start_url = 'https://passport.csdn.net/account/login?from=https://mp.csdn.net/postlist/list/all'
        return [
            Request(start_url, callback=self.parseWelcome)
        ]

    def parseWelcome(self, response):
        lt = response.xpath('//input[@name="lt"]/@value').extract_first()
        logging.info('lt:' + lt)
        return FormRequest.from_response(
            response,
            url='https://passport.csdn.net/account/login?from=https://mp.csdn.net/postlist/list/all',
            #meta={'cookiejar': response.meta['cookiejar']},
            formdata={"username":"fox64194167", "password" : "*****", "lt" : lt},
            callback=self.afterLogin
        )
    def afterLogin(self, response):
        yield Request(self.target_url)
    def parseDetail(self, response):
        item = CSDNItem()
        item['title'] = response.css('.csdn_top::text').extract_first()
        item['body'] = response.css('#article_content .htmledit_views').extract_first()
        yield item
    def parse(self, response):


        for article in response.css('.list-item-title .article-list-item-txt'):
            articleId = article.css('a::attr("href")').extract_first()
            if articleId is not None:
                articleId = str(articleId)
                articleId = articleId[articleId.rfind("/") + 1: len(articleId)]
                next_page = 'https://blog.csdn.net/fox64194167/article/details/%s' % articleId
                yield response.follow(next_page, self.parseDetail)


        bottomNavNum = response.css('.page-item.active a::text').extract_first()
        logging.info(int(bottomNavNum))

        if bottomNavNum is not None:
            next_page = ('https://mp.csdn.net/postlist/list/all/%d' % (int(bottomNavNum) + 1))
            logging.info('next_page:' + next_page)
            yield response.follow(next_page, self.parse)

Scrapy 模拟登录 用户名加密码

Other interpretations refer to an article

Scrapy 使用写死的cookie 来爬需要登录的页面

Similar Posts:

Leave a Reply

Your email address will not be published. Required fields are marked *