使用python抓取百度漂流瓶妹紙照片

bang590 發(fā)布于2019-07-25 11:44 / 3347人閱讀

摘要：無意中發(fā)現(xiàn)貼吧也出了個漂流瓶的東西，隨手翻了翻發(fā)現(xiàn)居然有好多妹子圖，閑來無事于是就想寫個爬蟲程序把圖片全部抓取下來。具體獲取一頁內(nèi)容的如下看參數(shù)很容易明白，就是當(dāng)前頁碼，就是當(dāng)前頁中包含的漂流瓶數(shù)量。

無意中發(fā)現(xiàn)貼吧也出了個漂流瓶的東西，隨手翻了翻發(fā)現(xiàn)居然有好多妹子圖，閑來無事于是就想寫個爬蟲程序把圖片全部抓取下來。

這里是貼吧漂流瓶地址
http://tieba.baidu.com/bottle...

1.分析

首先打開抓包神器 Fiddler ，然后打開漂流瓶首頁，加載幾頁試試，在Fiddler中過濾掉圖片數(shù)據(jù)以及非 http 200 狀態(tài)碼的干擾數(shù)據(jù)后，發(fā)現(xiàn)每一頁的數(shù)據(jù)獲取都很有規(guī)律，這就給抓取提供了便利。具體獲取一頁內(nèi)容的url如下：

http://tieba.baidu.com/bottle...

看參數(shù)很容易明白，page_number 就是當(dāng)前頁碼，page_size 就是當(dāng)前頁中包含的漂流瓶數(shù)量。

訪問后得到的是一個json格式的數(shù)據(jù)，結(jié)構(gòu)大致如下：

{
    "error_code": 0,
    "error_msg": "success",
    "data": {
        "has_more": 1,
        "bottles": [
            {
                "thread_id": "5057974188",
                "title": "美得不可一世",
                "img_url": "http://imgsrc.baidu.com/forum/pic/item/a8c87dd062d9f2d3f0113c2ea0ec8a136227cca9.jpg"
            },
            {
                "thread_id": "5057974188",
                "title": "美得不可一世",
                "img_url": "http://imgsrc.baidu.com/forum/pic/item/a8c87dd062d9f2d3f0113c2ea0ec8a136227cca9.jpg"
            },
            ...
   }
}

內(nèi)容很直白一眼就看出，bottles 中的數(shù)據(jù)就是我們想要的（thread_id 瓶子具體id, title 妹紙吐槽的內(nèi)容, img_url 照片真實地址），遍歷 bottles 就可以獲得當(dāng)前頁的所有漂流瓶子。（其實現(xiàn)在得到的只是封面圖哦，打開具體的瓶子有驚喜，因為我比較懶就懶得寫了，不過我也分析了內(nèi)部的數(shù)據(jù)，具體url是：http://tieba.baidu.com/bottle...瓶子thread_id>）

還有一個參數(shù) has_more 猜測是是否存在下一頁的意思。
到這里采集方式應(yīng)該可以確定了。就是從第一頁不停往后循環(huán)采集，直到 has_more 這個參數(shù)不為 1 結(jié)束。

2.編碼

這里采用的是 python2.7 + urllib2 + demjson 來完成此項工作。urllib2 是python2.7自帶的庫，demjson 需要自己安裝下（一般情況下用python自帶的json庫就可以完成json解析任務(wù)，但是現(xiàn)在好多網(wǎng)站提供的json并不規(guī)范，這就讓自帶json庫無能為力了。）

demjson 安裝方式 (windows 不需要 sudo)

sudo pip install demjson

或者

sudo esay_install demjson

2.1獲得一頁內(nèi)容

def bottlegen():
    page_number = 1
    while True:
        try:
            data = urllib2.urlopen(
                "http://tieba.baidu.com/bottle/bottles?page_number=%d&page_size=30" % page_number).read()
            json = demjson.decode(data)
            if json["error_code"] == 0:
                data = json["data"]
                has_more = data["has_more"]
                bottles = data["bottles"]
                for bottle in bottles:
                    thread_id = bottle["thread_id"]
                    title = bottle["title"]
                    img_url = bottle["img_url"]
                    yield (thread_id, title, img_url)
                if has_more != 1:
                    break
                page_number += 1
        except:
            raise
            print("bottlegen exception")
            time.sleep(5)

這里使用python的生成器來源源不斷的輸出分析到的內(nèi)容。

2.2根據(jù)url保存圖片數(shù)據(jù)

for thread_id, title, img_url in bottlegen():
    filename = os.path.basename(img_url)
    pathname = "tieba/bottles/%s_%s" % (thread_id, filename)
        print filename
        with open(pathname, "wb") as f:
            f.write(urllib2.urlopen(img_url).read())
            f.close()

2.3全部代碼如下

# -*- encoding: utf-8 -*-
import urllib2
import demjson
import time
import re
import os

def bottlegen():
    page_number = 1
    while True:
        try:
            data = urllib2.urlopen(
                "http://tieba.baidu.com/bottle/bottles?page_number=%d&page_size=30" % page_number).read()
            json = demjson.decode(data)
            if json["error_code"] == 0:
                data = json["data"]
                has_more = data["has_more"]
                bottles = data["bottles"]
                for bottle in bottles:
                    thread_id = bottle["thread_id"]
                    title = bottle["title"]
                    img_url = bottle["img_url"]
                    yield (thread_id, title, img_url)
                if has_more != 1:
                    break
                page_number += 1
        except:
            raise
            print("bottlegen exception")
            time.sleep(5)

def imggen(thread_id):
    try:
        data = urllib2.urlopen(
            "http://tieba.baidu.com/bottle/photopbPage?thread_id=%s" % thread_id).read()
        match = re.search(r"\_.Module.use("encourage/widget/bottle",(.*?),function(){});", data)
        data = match.group(1)
        json = demjson.decode(data)
        json = demjson.decode(json[1].replace("
", ""))
        for i in json:
            thread_id = i["thread_id"]
            text = i["text"]
            img_url = i["img_url"]
            yield (thread_id, text, img_url)
    except:
        raise
        print("imggen exception")

try:
    os.makedirs("tieba/bottles")
except:
    pass

for thread_id, _, _ in bottlegen():
    for _, title, img_url in imggen(thread_id):
        filename = os.path.basename(img_url)
        pathname = "tieba/bottles/%s_%s" % (thread_id, filename)
        print filename
        with open(pathname, "wb") as f:
            f.write(urllib2.urlopen(img_url).read())
            f.close()

運行后會先獲得每頁所有瓶子，然后再獲得具體瓶子中的所有圖片，輸出到 tieba/bottles/xxxxx.jpg 中。(因為比較懶就沒做錯誤兼容，見諒 ^_^,,,)

結(jié)論

結(jié)論是,,, 都是騙人的就首頁有幾張好看的 - -,,, 他喵的,,,

最后貼下采集成果

GPU云服務(wù)器云服務(wù)器騰訊云服務(wù)器百度抓取異常504 python抓取 python 爬照片 python網(wǎng)頁抓取

文章版權(quán)歸作者所有，未經(jīng)允許請勿轉(zhuǎn)載,若此文章存在違規(guī)行為，您可以聯(lián)系管理員刪除。

轉(zhuǎn)載請注明本文地址：http://www.ezyhdfw.cn/yun/38580.html

發(fā)表評論

登陸后可評論

0條評論

bang590

男|高級講師

我要關(guān)注我要私信

TA的文章

SSM實戰(zhàn)項目：人事管理系統(tǒng)（藍(lán)色版）【附源代碼】

閱讀 2649·2021-11-22 09:34
Centos8 部署 ElasticSearch 集群并搭建 ELK，基于Logstash同步MyS

閱讀 1044·2021-11-19 11:34
華為注資3億元加碼云計算領(lǐng)域_云資訊

閱讀 2870·2021-10-14 09:42
什么云主機便宜-國內(nèi)便宜的云主機哪些人用？

閱讀 1584·2021-09-22 15:27
（快）開學(xué)了，各大編程語言在群里吵翻了天！

閱讀 2441·2021-09-07 09:59
Vultr：裸金屬服務(wù)器，$0.275/H，1.9TB SSD/10T流量/10G帶寬，洛杉磯/日本

閱讀 1805·2021-08-27 13:13
前端培訓(xùn)-中級階段（8）- jQuery元素屬性樣式操作（2019-08-01期）

閱讀 3490·2019-08-30 11:21
vs code 插件折騰記（二）

閱讀 828·2019-08-29 18:35

亚洲中字慕日产2020,大陆极品少妇内射AAAAAA,无码av大香线蕉伊人久久,久久精品国产亚洲av麻豆网站

資訊專欄INFORMATION COLUMN

上云采購季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺、長期優(yōu)惠，快來選購！

使用python抓取百度漂流瓶妹紙照片

相關(guān)文章

Python爬蟲基礎(chǔ)：爬取妹子圖片并保存到本地

零基礎(chǔ)如何學(xué)爬蟲技術(shù)

**手把手教你用Python爬蟲煎蛋妹紙海量圖片**

【“探探”為例】手把手教你用最少的代碼實現(xiàn)各種“機器人”

首次公開，整理12年積累的博客收藏夾，零距離展示《收藏夾吃灰》系列博客

發(fā)表評論

0條評論

bang590

男|高級講師

TA的文章

SSM實戰(zhàn)項目：人事管理系統(tǒng)（藍(lán)色版）【附源代碼】

Centos8 部署 ElasticSearch 集群并搭建 ELK，基于Logstash同步MyS

華為注資3億元加碼云計算領(lǐng)域_云資訊

什么云主機便宜-國內(nèi)便宜的云主機哪些人用？

（快）開學(xué)了，各大編程語言在群里吵翻了天！

Vultr：裸金屬服務(wù)器，$0.275/H，1.9TB SSD/10T流量/10G帶寬，洛杉磯/日本

前端培訓(xùn)-中級階段（8）- jQuery元素屬性樣式操作（2019-08-01期）

vs code 插件折騰記（二）

最新活動

資訊專欄INFORMATION COLUMN

上云采購季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺、長期優(yōu)惠，快來選購！

使用python抓取百度漂流瓶妹紙照片

相關(guān)文章

發(fā)表評論

0條評論

男|高級講師

TA的文章

最新活動

上云采購季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺、長期優(yōu)惠，快來選購！