Article From:https://www.cnblogs.com/yijian001/p/9051124.html

 

Original address: http://blog.wiseturtles.com/posts/scrapyd.html

Tags scrapyd scrapy scrapyd-client By crazygit On 2015-10-29

scrapydIt is a program for deploying and running scrapy crawlers, which allows you to pass.JSON APITo deploy crawler projects and control Crawlers

overview

Projects and versions

scrapydYou can manage multiple projects, and each project allows multiple versions, but only the latest version will be used to run crawlers.

The most convenient version management is to make use ofVCSTools to record your crawler code, version comparison is not simply alphabetical, but through intelligent algorithms like distutils, such as: R10 is bigger than R9.

Working principle

scrapydIt is a daemon that monitors the operation and requests of the crawler, and then starts the process to execute them.

Start service

# Note that the directory that starts scrapyd saves the log and item files generated during the entire scrapyd run, so please select the appropriate location to run the command.$scrapyd

Scheduling crawler operation

$ curl http://localhost:6800/schedule.json -d project=myproject -d spider=spider2
{"status": "ok", "jobid": "26d1b1a6d6f111e0be5c001e648c57f8"}

webInterface

http://localhost:6800/

install

demand

  • Python 2.6+
  • Twisted 8.0+
  • Scrapy 0.17+

install

$ pip install scrapyd

or

$ sudo apt-get install scrapyd

Project deployment

Direct use of scrapyd-clientscrapyd-deployTools.

Install scrapyd-client

$ pip install scrapyd-client

scrapyd-clientWorking principle

Package the project, and then call itscrapydAaddversion.jsonInterface deployment project

Configuring server information

To facilitate the description, the whole deployment process is deployed as an example of the waterfront movie crawler. Configure the server and project information and need to editscrapy.cfgFile, add the following

[deploy:server-douban]
url = http://localhost:6800/

amongserver-doubanAs the server name,urlFor the server address, that is, runscrapydThe command server.

Check the configuration, list the currently available servers

$ scrapyd-deploy -l
server-douban        http://localhost:6800/

List all the items on the server, and ensure that the server isscrapydThe command is executing, otherwise it will fail to connect. If you run for the first time, you can see that there is only one.defaultproject

$ scrapyd-deploy -L server-douban
default

Open http://localhost:6800/, you can seeAvailable projects: default

Deployment project

Execute the following commands in the root directory of the crawler project,targetThe server name that is configured for the previous step,projectThe name of the project can be specified according to the actual situation.

scrapyd-deploy <target> -p <project>
$ scrapyd-deploy server-douban -p douban-movies
Packing version 1446102534
Deploying to project "douban-movies" in http://localhost:6800/addversion.json
Server response (200):
{"status": "ok", "project": "douban-movies", "version": "1446102534", "spiders": 1, "node_name": "sky"}

The deployment operation will pack your current project, if there are any items under the current project.setup.pyFiles will be used, and no one will automatically create one. (if a later project needs to be packaged, you can modify the inside information according to your needs, or temporarily ignore it). From the return result, we can see the status of the deployment, the name of the project, the version number, and the number of reptiles.And the current host name.

Check the deployment results

$ scrapyd-deploy -L server-douban
default
douban-movies

Or you can see http://localhost:6800/ again.Available projects: default, douban-movies

We can also write project information to the configuration file without deploying project information.scrapy.cfgFile, add project information

[deploy:server-douban]
url = http://localhost:6800/
project = douban-movies

The next deployment can be executed directly

$ scrapyd-deploy

If multiple servers are configured, the project can be deployed directly to multiple servers.

$ scrapyd-deploy -a -p <project>

Specified version number

By defaultscrapyd-deployUsing the current timestamp as the version number, we can use it.--versionTo specify a version number

scrapyd-deploy <target> -p <project> --version <version>

The format of the version number must meet the LooseVersion

Such as:

# Set the version number 0.1$scrapyd-deploy server-douban -p douban-movies --version0.1
Packing version 0.1
Deploying to project "douban-movies" in http://localhost:6800/addversion.json
Server response (200):
{"status": "ok", "project": "douban-movies", "version": "0.1", "spiders": 1, "node_name": "sky"}

If usedMercurialorGitManagement code that can be usedHGandGITAs a parameter of the version, it can also be written to itscrapy.cfgFile, then use the current reversion as the version number.

[deploy:target]
...
version = GIT
$ cat scrapy.cfg
...
[deploy:server-douban]
url = http://localhost:6800/
project = douban-movies
version = GIT

# The current version number is r7-master$scrapyd-deploy server-douban -p douban-moviesFatal: No names found, cannot descrIbe anything.Packing version r7-masterDeploying to project"douban-movies" in http://localhost:6800/addversion.json
Server response (200):
{"status": "ok", "project": "douban-movies", "version": "r7-master", "spiders": 1, "node_name": "sky"}

About fromGITThe way to get a version number can be seenscrapyd-clientSource code section

  elif version == 'GIT':
        p = Popen(['git', 'describe'], stdout=PIPE)
        d = p.communicate()[0].strip('\n')
        if p.wait() != 0:
            p = Popen(['git', 'rev-list', '--count', 'HEAD'], stdout=PIPE)
            d = 'r%s' % p.communicate()[0].strip('\n')

        p = Popen(['git', 'rev-parse', '--abbrev-ref', 'HEAD'], stdout=PIPE)
        b = p.communicate()[0].strip('\n')
        return '%s-%s' % (d, b)

Add authentication information to the server

We can also add a reverse proxy in front of scrapyd to achieve user authentication. Taking nginx as an example, configuring nginx

server {
       listen 6801;
       location / {
            proxy_pass            http://127.0.0.1:6800/;
            auth_basic            "Restricted";
            auth_basic_user_file  /etc/nginx/htpasswd/user.htpasswd;
        }
}

/etc/nginx/htpasswd/user.htpasswdThe username and password set in it are alltest Modify configuration file, add user information information

...
[deploy:server-douban]
url = http://localhost:6801/
project = douban-movies
version = GIT
username = test
password = test

Pay attention to the aboveurlThe port has been modified for nginx monitor.

remind: Remember to modify the configuration of scrapyd on the serverbind_addressField is127.0.0.1,So as not to bypass the nginx from outside, directly access the 6800 port. About configuration, you can refer to the configuration file settings later in this article.

API

scrapydThe web interface is relatively simple. It is mainly used for monitoring. All the scheduling depends on the interface.

Common interfaces:

  • Scheduling crawler

    $ curl http://localhost:6800/schedule.json -d project=myproject -d spider=somespider
    # On - band parameters$curl http://localhost:6800/schedule.json -dproject=myproject -d spider=somespider -d setting=DOWNLOAD_DELAY=2 -d arg1=val1
    
  • cancel

    $ curl http://localhost:6800/cancel.json -d project=myproject -d job=6487ec79947edab326d6db28a2d86511e8247444
    
  • List the items

    $ curl http://localhost:6800/listprojects.json
    
  • Listed versions

    $ curl http://localhost:6800/listversions.json?project=myproject
    
  • List the reptiles

    $ curl http://localhost:6800/listspiders.json?project=myproject
    
  • List job

    $ curl http://localhost:6800/listjobs.json?project=myproject
    
  • Delete version

    $ curl http://localhost:6800/delversion.json -d project=myproject -d version=r99
    
  • delete item

    $ curl http://localhost:6800/delproject.json -d project=myproject
    

configuration file

scrapydWhen searching, the configuration files are automatically searched, and the loading sequence of the configuration files is

  • /etc/scrapyd/scrapyd.conf
  • /etc/scrapyd/conf.d/*
  • scrapyd.conf
  • ~/.scrapyd.conf

The final load will cover the previous settings

The default configuration file is as follows.

[scrapyd]
eggs_dir    = eggs
logs_dir    = logs
items_dir   = items
jobs_to_keep = 5
dbs_dir     = dbs
max_proc    = 0
max_proc_per_cpu = 4
finished_to_keep = 100
poll_interval = 5
http_port   = 6800
debug       = off
runner      = scrapyd.runner
application = scrapyd.app.application
launcher    = scrapyd.launcher.Launcher

[services]
schedule.json     = scrapyd.webservice.Schedule
cancel.json       = scrapyd.webservice.Cancel
addversion.json   = scrapyd.webservice.AddVersion
listprojects.json = scrapyd.webservice.ListProjects
listversions.json = scrapyd.webservice.ListVersions
listspiders.json  = scrapyd.webservice.ListSpiders
delproject.json   = scrapyd.webservice.DeleteProject
delversion.json   = scrapyd.webservice.DeleteVersion
listjobs.json     = scrapyd.webservice.ListJobs

For the specific parameters of the configuration, you can refer to official documents.

To update

The abovescrapydThe default project (that is, startupscrapydSee after the commanddefaultProject. There are some mistakes in understanding, only in the scrapy project.scrapydWhen you command, there is a default item. The default item is the current one.scrapyProject, if executed under a non scrapy projectscrapyd, You can’t see itdefaultIt’s used directly in the crawler projectscrapydThe advantage of the command is that the crawler runs directly through the crawler, which is more suitable for the current reptiles, and the code structure is not standardized.EggifyingThe situation, because in usescrapyd-clientThe premise of deploying reptiles is that the crawler project must satisfy the following conditions:

Deploying your project to a Scrapyd server typically involves two steps:

  • Eggifying your project. You’ll need to install setuptools for this. See Egg Caveats below.
  • Uploading the egg to the Scrapyd server through the addversion.json endpoint.

Summary

As for scrapyd, personal feeling is more suitable for the deployment of crawler and scheduling. Distributed crawler scheduling may not be appropriate. It needs to start such a service on each machine. There is no way to focus the log. It may be more appropriate to do it with Jenkins.

2 Replies to “Scrapyd and scrapyd-client tutorials”

  1. HI 🙂 I am working with scrapyd on centos 6.10 for my dedicated server… I am reading the first bit of this and.. I happen to have installed Python 3.6, and with that created a virtualenv where I pip installed twisted and pip installed scrapyd. According to some google forum page I found I guess that was not the smartest idea, and I guess I need to install the tar ball or clone the git to actually get the scrapyd.tac file or something like that. Turns out the git page doesn’t have the scrapyd.tac either xD So I am sorta confused I guess and it seemed like you knew what you were doing. Is there any chance you can give me some help? 🙂

  2. HI 🙂 I am working with scrapyd on centos 6.10 for my dedicated server… I am reading the first bit of this and.. I happen to have installed Python 3.6, and with that created a virtualenv where I pip installed twisted and pip installed scrapyd. According to some google forum page I found I guess that was not the smartest idea, and I guess I need to install the tar ball or clone the git to actually get the scrapyd.tac file or something like that. Turns out the git page doesn’t have the scrapyd.tac either xD So I am sorta confused. Is there any chance you can give me some help? 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *