scrapy學(xué)習(xí)之路2(圖片下載與下載的路徑獲取)

WelliJhon 發(fā)布于2019-07-30 15:21 / 871人閱讀

摘要：圖片下載和拿到下載后的路徑小封面圖的爬取，后面通過(guò)傳到中詳情頁(yè)的爬取詳情頁(yè)的完整地址下一頁(yè)的爬取與請(qǐng)求不明打開(kāi)功能注意如要進(jìn)一步定制功能補(bǔ)充新建

圖片下載和拿到下載后的路徑 1

items.py

import scrapy

class InfoItem(scrapy.Item):
    url = scrapy.Field()
    url_object_id = scrapy.Field()
    small_image = scrapy.Field()
    small_image_path = scrapy.Field()
    big_image = scrapy.Field()
    big_image_path = scrapy.Field()
    code = scrapy.Field()
    date = scrapy.Field()
    lengths = scrapy.Field()
    author = scrapy.Field()
    cate = scrapy.Field()
    av_artor = scrapy.Field()

spider/jxxx.py

# -*- coding: utf-8 -*-
import scrapy
from urllib import parse
from scrapy.http import Request
from JaSpider.items import InfoItem
from JaSpider.utils.common import get_md5


class JxxxSpider(scrapy.Spider):
    name = "jxxx"
    allowed_domains = ["www.jxxx.com"]
    start_urls = ["http://www.jxxx.com/cn/vl_update.php"]

    def parse(self, response):
        for i in response.css(".video"):
            small_image = i.css("img::attr(src)").extract_first() # 小封面圖的爬取，后面通過(guò)meta傳到parse_info中
            link = i.css("a::attr(href)").extract_first() # 詳情頁(yè)的url爬取
            real_url = parse.urljoin(response.url, link) # 詳情頁(yè)的完整地址
            yield Request(url=real_url, meta={"small_image": small_image}, callback=self.parse_info)
        # 下一頁(yè)的爬取與請(qǐng)求    
        next_url = response.css(".page_selector .page.next::attr(href)").extract_first()
        perfect_next_url = parse.urljoin(response.url, next_url)
        if next_url:
            yield Request(url=perfect_next_url, callback=self.parse)

    def parse_info(self, response):
        small_image = "http:"+response.meta["small_image"]
        big_image = "http:"+response.xpath("http://div[@id="video_jacket"]/img/@src").extract_first()
        code = response.css("#video_id .text::text").extract_first()
        date = response.css("#video_date .text::text").extract_first()
        lengths = response.css("#video_length .text::text").extract_first()
        author = response.css("#video_director .director a::text").extract_first() if response.css("#video_director .director a::text").extract_first() else "不明"
        cate = ",".join([i.css("a::text").extract_first() for i in response.css("#video_genres .genre") if i.css("a::text").extract_first()])
        av_artor = ",".join([i.css("a::text").extract_first() for i in response.css(".star") if i.css("a::text").extract_first()])
        # print("http:"+small_image)
        info_item = InfoItem()
        info_item["url"] = response.url
        info_item["url_object_id"] = get_md5(response.url)
        info_item["small_image"] = small_image
        info_item["big_image"] = [big_image]
        info_item["code"] = code
        info_item["date"] = date
        info_item["lengths"] = lengths
        info_item["author"] = author
        info_item["cate"] = cate
        info_item["av_artor"] = av_artor
        yield info_item

打開(kāi)pipeline功能 settings.py

注意!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!：
spider/jxxx.py

如要進(jìn)一步定制功能
settings.py

pipeline.py

補(bǔ)充
新建utils/common.py

import hashlib


def get_md5(url):
    if isinstance(url, str):
        url = url.encode("utf-8")
    m = hashlib.md5()
    m.update(url)
    return m.hexdigest()


if __name__ == "__main__":
    a = get_md5("http://www.haddu.com")
    print(a)

GPU云服務(wù)器云服務(wù)器服務(wù)器絕對(duì)路徑下載 ftp服務(wù)器下載路徑搭建求生之路2服務(wù)器深度學(xué)習(xí)下載

文章版權(quán)歸作者所有，未經(jīng)允許請(qǐng)勿轉(zhuǎn)載,若此文章存在違規(guī)行為，您可以聯(lián)系管理員刪除。

轉(zhuǎn)載請(qǐng)注明本文地址：http://www.ezyhdfw.cn/yun/41201.html

發(fā)表評(píng)論

登陸后可評(píng)論

0條評(píng)論

WelliJhon

男|高級(jí)講師

我要關(guān)注我要私信

TA的文章

tensorflow算法

閱讀 1321·2023-04-25 18:57
tensor.unsqueeze

閱讀 2230·2023-04-25 16:28
ffmpeg獲取視頻截圖

閱讀 4051·2021-11-24 09:39
如何識(shí)別圖片文字，PaddleOCR機(jī)器學(xué)習(xí)開(kāi)源項(xiàng)目使用 | 機(jī)器學(xué)習(xí)

閱讀 3708·2021-11-16 11:45
【Python爬蟲(chóng)】手把手帶你爬下肯德基官網(wǎng)（ajax的post請(qǐng)求）

閱讀 1941·2021-10-13 09:40
組件設(shè)計(jì)漫談

閱讀 1312·2019-08-30 15:52
彈性盒模型

閱讀 1787·2019-08-30 10:57
Canvas + WebSocket + Redis 實(shí)現(xiàn)一個(gè)視頻彈幕

閱讀 719·2019-08-29 16:55

亚洲中字慕日产2020,大陆极品少妇内射AAAAAA,无码av大香线蕉伊人久久,久久精品国产亚洲av麻豆网站

資訊專欄INFORMATION COLUMN

上云采購(gòu)季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長(zhǎng)期優(yōu)惠，快來(lái)選購(gòu)！

scrapy學(xué)習(xí)之路2(圖片下載與下載的路徑獲取)

相關(guān)文章

scrapy 學(xué)習(xí)之路上的那些坑

scrapy的學(xué)習(xí)之路1(簡(jiǎn)單的例子)

**20、 Python快速開(kāi)發(fā)分布式搜索引擎Scrapy精講—編寫(xiě)spiders爬蟲(chóng)文件循環(huán)抓取內(nèi)容**

windows下安裝python+scrapy

發(fā)表評(píng)論

0條評(píng)論

WelliJhon

男|高級(jí)講師

TA的文章

tensorflow算法

tensor.unsqueeze

ffmpeg獲取視頻截圖

如何識(shí)別圖片文字，PaddleOCR機(jī)器學(xué)習(xí)開(kāi)源項(xiàng)目使用 | 機(jī)器學(xué)習(xí)

【Python爬蟲(chóng)】手把手帶你爬下肯德基官網(wǎng)（ajax的post請(qǐng)求）

組件設(shè)計(jì)漫談

彈性盒模型

Canvas + WebSocket + Redis 實(shí)現(xiàn)一個(gè)視頻彈幕

最新活動(dòng)

資訊專欄INFORMATION COLUMN

上云采購(gòu)季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長(zhǎng)期優(yōu)惠，快來(lái)選購(gòu)！

scrapy學(xué)習(xí)之路2(圖片下載與下載的路徑獲取)

相關(guān)文章

發(fā)表評(píng)論

0條評(píng)論

男|高級(jí)講師

TA的文章

最新活動(dòng)

上云采購(gòu)季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長(zhǎng)期優(yōu)惠，快來(lái)選購(gòu)！