Category:None
Article From:https://blog.csdn.net/fox64194167/article/details/79765815

1.Environment establishment

1.Use xmapp to install PHP, mysql, phpMyAdmin

2.Install python3, Pip

3.installpymysql

3.(windows My side is Mac, brew is installed, and scrapy is installed with brew

 

2.Whole process

1. Create a database and database table, ready to save

2.Write a crawler target URL for a network request

3.To crawl back to data and to get specific data

4.Save the specific data to the database

 

2.1.Create a database

First create a database called scrapy, and then create a table article, which we add the only index to body to prevent duplication of data.

--
-- Database: `scrapy`
--

-- --------------------------------------------------------

--
-- The structure of the table `article`-CREATE TABLE `article` (`id` int (11) NOT NULL,`body` varchar (200) CHARACTERSET utf8 COLLATE utf8_bin NOT NULL,`author` varchar (50) CHARACTER SET utf8 COLLATE utf8_bin NOTNULL,`createDate` datetime NOT NULL) ENGINE=InnoDB DEFAULT CHARSET=latin1;-- Indexes for tAble `article`-ALTER TABLE `article`ADD PRIMARY KEY (`id`),ADD UNIQUE KEY `uk_body` (`body`);

It’s like this after making it.

 

2.2 First look at the structure of the entire crawler project

quotes_spider.pyIt is the core that is responsible for processing network requests and contents, and then dumping the finished contents to pipelines to process them in detail, and save them to the database, which will not affect the speed.

Other pictures

2.2 Write a crawler target URL for a network request

 

import scrapy

from tutorial.items import TutorialItem

class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        url = 'http://quotes.toscrape.com/tag/humor/'
        yield scrapy.Request(url)

    def parse(self, response):
        item = TutorialItem()
        for quote in response.css('div.quote'):
            item['body'] = quote.css('span.text::text').extract_first()
            item['author'] = quote.css('small.author::text').extract_first()
            yield item
        next_page = response.css('li.next a::attr("href")').extract_first()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

start_requests It is to write to the specific URL

parseIt’s the core to deal with the returned data, and then throw it in the form of item, and then define the next crawl.

 

2.3  items

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class TutorialItem(scrapy.Item):
    body = scrapy.Field()
    author = scrapy.Field()
    pass

 

2.4 pipelines

 

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import pymysql
import datetime
from tutorial import settings
import logging

class TutorialPipeline(object):
    def __init__(self):
        self.connect = pymysql.connect(
            host = settings.MYSQL_HOST,
            db = settings.MYSQL_DBNAME,
            user = settings.MYSQL_USER,
            passwd = settings.MYSQL_PASSWD,
            charset = 'utf8',
            use_unicode = True
        )
        self.cursor = self.connect.cursor();

    def process_item(self, item, spider):
        try:
            self.cursor.execute(
                "insert into article (body, author, createDate) value(%s, %s, %s) on duplicate key update author=(author)",
                (item['body'],
                 item['author'],
                 datetime.datetime.now()
                 ))
            self.connect.commit()
        except Exception as error:
            logging.log(error)
        return item

    def close_spider(self, spider):
        self.connect.close();

2.5 To configure

 

ITEM_PIPELINES = {
    'tutorial.pipelines.TutorialPipeline':300
}

MYSQL_HOST = 'localhost'
MYSQL_DBNAME = 'scrapy'
MYSQL_USER = 'root'
MYSQL_PASSWD = '123456'
MYSQL_PORT = 3306

使用scrapy 把爬到的数据保存到mysql 防止重复

3 Replies to “Use scrapy to save the crawled data to MYSQL to prevent repetition”

  1. Hello,
    Thanks for the tutorial, but I had a problem with inserting data into mysql…
    for some reason the connection succeeds but the data is not there, I mean it does not get inserted in my database… with the same exact code you use.

    1. self.cursor.execute(
      “insert into article (body, author, createDate) value(%s, %s, %s)”,
      (“body”,
      “author”,
      datetime.datetime.now()
      ))
      self.connect.commit()
      test code like this. may the table is wrong.

      1. thank you for the fast reply, I found that the item name defined in items.py does not match the item name called in pipeline :). thanks alot my friend, the tutorial was good and the pipeline was fantastically good and clean, and i really could understand how it works now, not like the others.

Leave a Reply

Your email address will not be published. Required fields are marked *