元旦假期，天干物燥，寒风刺骨，不宜外出。黄历上好像是这么说的，嗯，准没错。

所以在家三天不出屋。。。。

以上都是借口，其实就是懒，就是宅

整好在熟悉一把Python的爬虫框架scrapy。这篇文章不是教程，只是我在练习过程中出现的一些问题的记录，如果你想学习scray框架的话,请移步其他网站文章.或者直接查看scrapy官方文档.

缩略图
规则 :https://photo.tuchong.com/+ author_id + /g/ + img_id +.jpg 如:https://photo.tuchong.com/1370833/g/20494979.jpg
author_id 是图片作者的编号,img_id 是图片编号
原图
规则 :https://photo.tuchong.com/+ author_id + /f/ + img_id +.jpg 如:https://photo.tuchong.com/1370833/f/20494979.jpg
author_id 是图片作者的编号,img_id 是图片编号

代码编写

创建爬虫

# 创建项目
scrapy startproject TuchongProj
# 进入项目
cd TuchongProj
# 创建爬虫
scrapy genspider tuchong tuchong.com

相关代码编写

tuchong.py 演示代码, 代码未实现翻页,只取了一页数据。

class TuchongSpider(scrapy.Spider):
    name = 'tuchong'
    allowed_domains = ['tuchong.com']
    # 请求URL构造参数
    tag = "风光"
    types = "hot"  # new
    order = "weekly"
    page = "1"
    count = "20"
    baseUrl = "https://tuchong.com/rest/tags/"
    start_urls = [baseUrl + tag + "/posts?page=" +
                  page + "&count=" + count + "&order=" + order]
    # 暂时未做分页功能(循环page)

    def parse(self, response):
        posts = json.loads(response.body)['postList']
        for post in posts:
            images = post['images']
            for img in images:
                item = TuchongprojItem()
                author_id = img['user_id']
                img_id = img['img_id']
                img_link = "https://photo.tuchong.com/" + \
                    str(author_id) + "/f/" + str(img_id) + ".jpg"
                item['authorId'] = author_id
                item['imageId'] = img_id
                item['imageLink'] = img_link
                yield item

我在定义的Item有三个字段，这里只取了接口JSON数据中的author_id和img_id 按照上面的分析的规则,拼装了图片的大图路径。

Item定义：

class TuchongprojItem(scrapy.Item):
    # define the fields for your item here like:
    authorId = scrapy.Field()
    imageId = scrapy.Field()
    imageLink = scrapy.Field()

接下来就是定义pipeline处理相关数据了

class TuchongprojPipeline(ImagesPipeline):

    def get_media_requests(self, item, info):
        imglink = item['imageLink']
        # print(imglink)
        yield scrapy.Request(imglink)

    def item_completed(self, result, item, info):
        # print(result)
        # [(True, {'url': 'https://photo.tuchong.com/1660246/f/9281459.jpg', 'path': 'full/e8876e7547c166d58876c3fce547d6819d8bf031.jpg', 'checksum': 'd78e6ba1115ae956878bdc630173ea7a'})]
        # 取出图片存放的地址
        author_id = item['authorId']
        img_id = item['imageId']
        patharr = [x["path"] for ok, x in result if ok]
        path = patharr[0]
        # 文件重命名(原地址,新地址)
        os.rename(images_store + path, images_store + "full/" +
                  str(author_id) + "-" + str(img_id) + ".jpg")
        return item

由于要处理图片数据,我们使用scrapy提供ImagesPipeline进行图片处理.
重写ImagesPipeline里的两个方法:
get_media_requests方法, 主要负责发送图片获取请求, 这里我们将我们拼装的图片大图地址传递给Request进行图片处理.

item_completed方法, 这里主要用来对下载回来的图片进行重命名处理. 我们的命名规则是 author_id+'-'+img_id, 这样便于我们进行排序 ^_^

使用ImagesPipeline 我们需要制定图片的存放地址,同时我们需要启用当前pipeline

settings.py

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'TuchongProj.pipelines.TuchongprojPipeline': 300,
}

# 爬取图片后图片存放的位置
IMAGES_STORE = "F:/_tuchong/fengjing/"

好了, 爬虫差不多了,运行下试试吧!

scrapy crawl tuchong

其他

自己写爬虫,纯粹就是玩票。
我们自己写爬虫，去爬取别人的数据，尽量将并发降低，不要给站点造成骚扰。
如果可以尽量按照爬虫协议来爬取应允的内容。

附图虫的爬虫协议

# Robots.txt file from http://www.tuchong.com
# All robots will spider the domain

User-agent: YandexBot
Disallow: /
User-agent: MJ12bot
Disallow: /
User-agent: Purebot
Disallow: /
User-agent: psbot
Disallow: /

User-agent: *
Disallow: /admin/
Disallow: /api/

相关学习资料

scrapy文档(CN)

练习使用scrapy框架抓取图虫图片

目录

爬取前准备

爬去入口

原图地址分析

代码编写

创建爬虫

相关代码编写

相关问题

接口请求不成功或返回的数据为空

本机没有图片处理相关库

其他

Yoze