數(shù)據(jù)科學(xué)

anquan 發(fā)布于2019-07-30 17:52 / 2491人閱讀

摘要：資料分析資料篩選偵測(cè)缺失值補(bǔ)齊缺失值資料轉(zhuǎn)換處理時(shí)間格式數(shù)據(jù)重塑資料學(xué)習(xí)正規(guī)運(yùn)算式處理資料格式，提供高效能，簡(jiǎn)易使用的數(shù)據(jù)格式讓用戶(hù)可以快速操作及分析資料。使用平均數(shù)，中位數(shù)，眾數(shù)等敘述性統(tǒng)計(jì)補(bǔ)齊缺失值。

有90%的有用數(shù)據(jù)，都在數(shù)據(jù)庫(kù)中。

數(shù)據(jù)

數(shù)據(jù)類(lèi)型

定性數(shù)據(jù)：敘述特征或種類(lèi)，例如：種族，區(qū)域。

定量數(shù)據(jù)：可以被計(jì)數(shù)或測(cè)量，例如：身高，消費(fèi)金額。

定量數(shù)據(jù)

離散數(shù)據(jù)
只能用自然數(shù)或整數(shù)單位計(jì)算。
只能按計(jì)量單位數(shù)計(jì)數(shù)，可由一般計(jì)算方法取得。
例如：班級(jí)人數(shù)

連續(xù)數(shù)據(jù)
一定區(qū)間內(nèi)可以任意取值的數(shù)據(jù)，其數(shù)值是連續(xù)不斷的，相鄰兩個(gè)數(shù)值可以取無(wú)限個(gè)數(shù)值。
其數(shù)值只能用測(cè)量或計(jì)量的方法取得。
例如：零件規(guī)格尺寸

數(shù)據(jù)來(lái)源

結(jié)構(gòu)化數(shù)據(jù)
每條數(shù)據(jù)都有固定的字段，固定的格式，方便程序進(jìn)行后續(xù)取用與分析。
例如：數(shù)據(jù)庫(kù)。

半結(jié)構(gòu)化數(shù)據(jù)（要使數(shù)據(jù)具有彈性，能夠存儲(chǔ)，也能夠便利查找。）
數(shù)據(jù)介于結(jié)構(gòu)化與非結(jié)構(gòu)化之間，
數(shù)據(jù)具有字段，也可以依據(jù)字段來(lái)查找，使用方便，但每條數(shù)據(jù)的字段可能不一致。
例如：XML，JSON。

非結(jié)構(gòu)化數(shù)據(jù)
沒(méi)有固定的格式，必須整理以后才能存取
例如：格式的文字，網(wǎng)頁(yè)數(shù)據(jù)，文件數(shù)據(jù)。

非結(jié)構(gòu)化數(shù)據(jù)必須透過(guò)ETL(Extract抽取, Transfromation轉(zhuǎn)換, Loading儲(chǔ)存)工具將數(shù)據(jù)轉(zhuǎn)為結(jié)構(gòu)化數(shù)據(jù)才能取用。

文件處理

普通操作文件

with open("fliename", "raw") as f:
    f.write("hello world")
    f.read()
    f.readlines()

CSV格式數(shù)據(jù)

方式一：
通過(guò)文件打開(kāi)讀取數(shù)據(jù)。

with open("./Population.csv", "r", encoding="UTF-8") as f:
    # print(f.read())
    for line in f.readlines():
        print(line)

方式二：
通過(guò)pandas模塊讀取

import pandas as pd


df = pd.read_csv("./Population.csv")
print(df.values)

Excel格式數(shù)據(jù)

import pandas as pd

filename = "house_sample.xlsx"

df = pd.read_excel(filename)

print(df.values[0][0])

JSON格式數(shù)據(jù)

方式1：
通過(guò)文件讀取，然后json模塊讀取，轉(zhuǎn)換為list類(lèi)型數(shù)據(jù)。

import json


filename = "jd.json"
with open(filename, "r") as f:
    fc = f.read()

df = json.loads(fc)
print(df)

strjson = json.dumps(df)

方式2：
通過(guò)pandas模塊讀取

import pandas as pd

filename = "jd.json"

df = pd.read_json(filename)

print(df.values)

XML格式數(shù)據(jù)

通過(guò)模塊xml處理:

import xml.etree.ElementTree as ET

filename = "weather.xml"
tree = ET.parse(filename)

root = tree.getroot()

for city in root.iter("city"):
    print(city.get("cityname"))

網(wǎng)絡(luò)爬蟲(chóng)

需要模塊：

BeautifulSoup

request：網(wǎng)絡(luò)獲取，可以使用REST操作POST,PUT,GET,DELETE存取網(wǎng)絡(luò)資源.

簡(jiǎn)單爬取：

import requests

newurl = "http://news.qq.com/"

res = requests.get(newurl)
print(res.text)

BeautifulSoup

bs4模塊，可以把抓取的網(wǎng)頁(yè)變成DOM文檔，允許使用CSS選擇器來(lái)尋找需要的內(nèi)容。

import requests
from bs4 import BeautifulSoup

newurl = "http://news.qq.com/"

res = requests.get(newurl)

html = res.text
# print(res.text)

html = "
    hello world
    數(shù)據(jù)科學(xué)
"
soup = BeautifulSoup(html, "html.parser")

s = soup.select("h1") # 獲取元素
print(s[0]["title"]) # 獲取屬性

抓取位置實(shí)用工具

Chrome

Firefox

InfoLite

xpath lxml庫(kù)

從其它地方獲取到數(shù)據(jù)，存儲(chǔ)為.json, .cvs, .xlsx，需要從DataFrame()中獲取。

import pandas
import requests
from bs4 import BeautifulSoup

newurl = "http://news.qq.com/"
html = requests.get(newurl).text

soup = BeautifulSoup(html, "html.parser")
warp = soup.select(".head .Q-tpWrap .text")

dataArr = []
for news in warp:
    dataArr.append({"name": news.select("a")[0].text.encode(), "herf": news.select("a")[0]["href"]})

newsdf = pandas.DataFrame(dataArr)
newsdf.to_json("news.json")
newsdf.to_csv("news.csv")
newsdf.to_excel("news.xlsx")

import requests
from bs4 import BeautifulSoup
import json

url = "http://xm.esf.fang.com/"
html = requests.get(url).text

soup = BeautifulSoup(html, "html.parser")
resultArr = []

for house in soup.select(".shop_list dl"):
    shop = {
        "tit_shop": house.select("dd:nth-of-type(1) .tit_shop") and house.select("dd:nth-of-type(1) .tit_shop")[0].text,
        "tel_shop": house.select("dd:nth-of-type(1) .tel_shop") and "".join( house.select("dd:nth-of-type(1) .tel_shop")[0].text.split("|") ).strip(),
        "add_shop": house.select("dd:nth-of-type(1) .add_shop") and "小區(qū)名字：" + house.select("dd:nth-of-type(1) .add_shop")[0].select("a")[0].text + "； 具體地址：" + house.select("dd:nth-of-type(1) .add_shop")[0].select("span")[0].text,
        "price_shop": house.select("dd:nth-of-type(2) span b") and house.select("dd:nth-of-type(2) span b")[0].text,
        "sqm": house.select("dd:nth-of-type(2) span") and house.select("dd:nth-of-type(2) span")[1].text
    }
    resultArr.append(shop)

resultArr = json.dumps(resultArr)

with open("fang.json", "w") as f:
    f.write(resultArr)
print("ok")

爬取房天下的廈門(mén)二手房數(shù)據(jù)

import json

import requests
from bs4 import BeautifulSoup

url = "http://xm.esf.fang.com/"
html = requests.get(url).text
domain = "http://xm.esf.fang.com"


def getUrlDetails(url):
    dhtml = requests.get(url).text
    dsoup = BeautifulSoup(dhtml, "html.parser")

    info = {}
    info["標(biāo)題"] = dsoup.select(".title h1")[0] and dsoup.select(
        ".title h1")[0].text.strip()
    info["總價(jià)"] = dsoup.select(".tab-cont-right .price_esf")[0].text

    for item in dsoup.select(".tab-cont-right .trl-item1"):
        info[item.select(".font14")[0].text] = item.select(
            ".tt")[0].text.strip()
    info["地址"] = dsoup.select(
        ".tab-cont-right .trl-item2 .rcont")[0].text.strip()[0:-2]
    for item in dsoup.select(".zf_new_left .cont .text-item"):
        st_split = item.text.strip().split("
")
        while "" in st_split:
            st_split.remove("")
        while "
" in st_split:
            st_split.remove("
")
        if len(st_split) > 2:
            st_split = [st_split[0]] + ["".join(st_split[1:])]
        k, v = st_split
        info[k] = v.strip()
    return info


if __name__ == "__main__":
    soup = BeautifulSoup(html, "html.parser")
    resultArr = []
    for house in soup.select(".shop_list dl"):
        if house.select("h4 a"):
            resUrl = domain + house.select("h4 a")[0]["href"]
            if getUrlDetails(resUrl):
                resultArr.append(getUrlDetails(resUrl))

    result = json.dumps(resultArr)
    print("爬取完畢")
    with open("house.json", "w") as f:
        f.write(result)
    print("寫(xiě)入完畢")

爬取拉勾網(wǎng)招聘信息 json格式

# coding=utf-8

import json
import time

import requests
import xlwt

url = "https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false"

# 獲取存儲(chǔ)職位信息的json對(duì)象，遍歷獲得公司名、福利待遇、工作地點(diǎn)、學(xué)歷要求、工作類(lèi)型、發(fā)布時(shí)間、職位名稱(chēng)、薪資、工作年限
def getJobRow(url, datas):
    time.sleep(10)
    header = {
        "Accept": "application/json, text/javascript, */*; q=0.01",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36",
        "Host": "www.lagou.com",
        "Origin": "https://www.lagou.com",
        "Referer": "https://www.lagou.com/jobs/list_?labelWords=&fromSearch=true&suginput="
    }
    cookie = {
        "Cookie": "JSESSIONID=ABAAABAAAIAACBI80DD5F5ACDEA0EB9CA0A1B926B8EAD3C; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1539844668; _ga=GA1.2.439735849.1539844668; _gid=GA1.2.491577759.1539844668; user_trace_token=20181018143747-53713f4a-d2a0-11e8-814e-525400f775ce; LGSID=20181018143747-53714082-d2a0-11e8-814e-525400f775ce; PRE_UTM=; PRE_HOST=; PRE_SITE=; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2F; LGUID=20181018143747-53714251-d2a0-11e8-814e-525400f775ce; index_location_city=%E4%B8%8A%E6%B5%B7; TG-TRACK-CODE=index_search; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1539844675; LGRID=20181018143754-57b25578-d2a0-11e8-bdc4-5254005c3644; SEARCH_ID=d0a97cea1d1d47d0afa41b3f298f41d5"
    }
    result_data = requests.post(url=url, cookies=cookie, headers=header, data=datas).json()
    content = result_data["content"]["positionResult"]["result"]

    info_item = []
    for i in content:
        information = []
        information.append(i["positionId"]) # 崗位對(duì)應(yīng)ID
        information.append(i["companyFullName"]) # 公司全名
        information.append(i["companyLabelList"]) # 福利待遇
        information.append(i["district"]) # 工作地點(diǎn)
        information.append(i["education"]) # 學(xué)歷要求
        information.append(i["firstType"]) # 工作類(lèi)型
        information.append(i["formatCreateTime"]) # 發(fā)布時(shí)間
        information.append(i["positionName"]) # 職位名稱(chēng)
        information.append(i["salary"]) # 薪資
        information.append(i["workYear"]) # 工作年限

        info_item.append(information)
    return info_item

def main():
    city = input("請(qǐng)輸入爬取的城市：")
    page = int(input("請(qǐng)輸入爬取的頁(yè)數(shù)："))
    kd = input("請(qǐng)輸入爬取關(guān)鍵詞：")
    info_result = []
    title = ["崗位id", "公司全名", "福利待遇", "工作地點(diǎn)", "學(xué)歷要求", "工作類(lèi)型", "發(fā)布時(shí)間", "職位名稱(chēng)", "薪資", "工作年限"]
    info_result.append(title)

    for x in range(1, page+1):
        datas = {
            "first": True,
            "pn": x,
            "kd": kd,
            "city": city
        }
        info = getJobRow(url, datas)
        info_result = info_result + info
        print(info_result, "info_result")

        # 寫(xiě)入excel的數(shù)據(jù)格式組裝成： [[表頭數(shù)據(jù)], [row數(shù)據(jù)], [row], [row]]
        workbook = xlwt.Workbook(encoding="utf-8")
        worksheet = workbook.add_sheet("lagou" + kd, cell_overwrite_ok=True)

        for i, row in enumerate(info_result):
            # print row
            for j, col in enumerate(row):
                # print col
                worksheet.write(i, j, col) # x,y 位置， col 內(nèi)容

        workbook.save("lagou" + kd + city + ".xls")

if __name__ == "__main__":
    main()

數(shù)據(jù)清理

數(shù)據(jù)處理

資料分析

詮釋結(jié)果

真正能用在數(shù)據(jù)分析的時(shí)間很少，必須要能夠善用工具。

資料分析：

資料篩選

偵測(cè)缺失值

補(bǔ)齊缺失值

資料轉(zhuǎn)換

處理時(shí)間格式數(shù)據(jù)

重塑資料

學(xué)習(xí)正規(guī)運(yùn)算式

pandas處理資料：

Table-like格式，

提供高效能，簡(jiǎn)易使用的數(shù)據(jù)格式(Data Frame)讓用戶(hù)可以快速操作及分析資料。

pandas底層是numpy

numpy特點(diǎn)：

python數(shù)學(xué)運(yùn)算套件

N維數(shù)組對(duì)象

多種數(shù)學(xué)運(yùn)算函數(shù)

可整合C/C++和Fortran

使用numpy產(chǎn)生結(jié)構(gòu)化信息，有缺陷，而在numpy上完善的pandas，比較合理使用數(shù)據(jù)。

pandas增加序列Series結(jié)構(gòu):

類(lèi)似Array,List的一維物件

每個(gè)Series都可以透過(guò)其索引進(jìn)行存取

預(yù)設(shè)Series會(huì)以0到Series長(zhǎng)度作為索引編號(hào)

數(shù)據(jù)處理

資料篩選

存取元素與切割：

df.ix[1] # 取一條記錄
df.ix[1:4] # 取1~4條記錄
df["name"] # 通過(guò)字段取數(shù)據(jù)
df[["name", "age"]] # 通過(guò)多個(gè)字段取數(shù)據(jù)，獲取每條字段下的數(shù)據(jù)

df[1:2, ["name", "age"]] # 根據(jù)索引號(hào)與字段名篩選數(shù)據(jù)

df["gender"] == "M" # 根據(jù)enum特點(diǎn)的值判斷篩選數(shù)據(jù)，返回True 和 False
df[df["gender"] == "M"] # 根據(jù)enum特點(diǎn)的值判斷篩選數(shù)據(jù)， 整張表中符合的返回

df[(df["gender" == "M"]) & (df["age" >= 30])] # 使用 & 取條件交集
df[(df["gender" == "M"]) | (df["age" >= 30])] # 使用 | 取條件

df["employee"] = True # 新增字段
del df["employee"] # 刪除字段
df = df.drop("empyloyee", axis=1) # 刪除字段

df.loc[6] = {"age": 18, "gender": "F", "name": "aa"} # 新增一條記錄
df.append(pd.DataFrame([{"age": 18, "gender": "F", "name": "aa"}]), ignore_index=True) # 新增記錄
df = df.drop(6) # 刪除某條記錄

df["userid"] = range(101, 117) # 設(shè)定新的索引
df.set_index("userid", inplace=True) # 設(shè)定新索引

df.iloc[1] # 設(shè)定新的索引去獲取數(shù)據(jù)
df.iloc[[1:3]] # 設(shè)定新的索引去獲取數(shù)據(jù)

獲取值的三種方式

df.ix[[101, 102]] # 使用ix取值，useid
df.loc[[101, 105]] # 使用loc取值，useid
df.iloc[1, 2]  # 使用iloc取值，索引

偵測(cè)缺失值

數(shù)據(jù)中有特定或一個(gè)范圍的值是不完全的

缺失值可能會(huì)導(dǎo)致數(shù)據(jù)分析是產(chǎn)生偏誤的推論

缺失值可能來(lái)自機(jī)械的缺失（機(jī)械故障，導(dǎo)致數(shù)據(jù)無(wú)法被完整保存）或是人為的缺失（填寫(xiě)信息不完整或數(shù)據(jù)真假情況）

占位：
使用numpy中的numpy.nan占位表示缺失值

pd.DataFrame(["a", numpy.nan])

檢查序列是否有缺失值：

df["gender"].notnull() # 檢查非缺失值數(shù)據(jù)
df["gender"].isnull() # 檢查缺失值數(shù)據(jù)

檢查字段或Data Frame是否含有缺失值：

df.name.isnull().values.any() # 檢查字段是否含有缺失值

df.isnull().values.any() # 檢查DataFrame是否含有缺失值，返回True或False

計(jì)算缺失值數(shù)量：

df.isnull().sum() # 檢查字段缺失值的數(shù)量
df.isnull().sum().sum() # 計(jì)算所有缺失值的數(shù)量

補(bǔ)齊缺失值

舍棄缺失值：當(dāng)缺失值占數(shù)據(jù)比例很低時(shí)。

使用平均數(shù)，中位數(shù)，眾數(shù)等敘述性統(tǒng)計(jì)補(bǔ)齊缺失值。

使用內(nèi)插法補(bǔ)齊缺失值：如果字段數(shù)據(jù)呈線性規(guī)律。

舍棄缺失值:

df.dropna() # 舍棄含有任意缺失值的行
df.dropna(how="all") # 舍棄所有都含有缺失值的行，每個(gè)字段都是NaN
df.dropna(thresh=2) # 舍棄超過(guò)兩欄缺失值的行

df["age"] = numpy.nan # 增加一列包含缺失值
df.dropna(axis=1, how="all") # 舍棄皆為缺失值的列

填補(bǔ)缺失值：

df.fillna(0) # 用0填補(bǔ)缺失值
df["age"].fillna(df["age"].mean()) # 用平均數(shù)填補(bǔ)缺失值
df["age"].fillna(df.groupby("gender")["age"].transfrom("mean")) # 用各性別年齡平均填補(bǔ)缺失值

df.fillna(method="pad") # 向后填充缺失值
df.fillna(method="bfill", limit=2) # 向前填充缺失值

維護(hù)處理不需要數(shù)據(jù)或者特殊的數(shù)為np.nan:

df.loc[df["物業(yè)費(fèi)用"] == "暫無(wú)資料", "物業(yè)費(fèi)用"] = np.nan # 修改“暫無(wú)資料”為"np.nan"

查看前三行數(shù)據(jù)：df.head(3)
查看后三行數(shù)據(jù)：df.tail(3)
查看DataFrame信息: df.info()
查看字段名稱(chēng): df.columns
查看字段類(lèi)型：df.dtypes
敘述性統(tǒng)計(jì)：df.describe()
檢查缺失值: df.isnull().any()
缺失值統(tǒng)計(jì): df.isnull().sum()
缺失值在整體數(shù)據(jù)中的比例：df.isnull().sum() / df.count()
對(duì)特殊字段進(jìn)行篩選處理： df["volume"].value_counts()
缺失值補(bǔ)齊：df["volume"].fillna(0)

資料轉(zhuǎn)換

如何清洗，轉(zhuǎn)換該數(shù)據(jù)？使用向量化計(jì)算

計(jì)算新值：

df["總價(jià)"] * 10000 # 計(jì)算新價(jià)格

使用物件計(jì)算新價(jià)格：

import numpy as np
np.sqrt(df["總價(jià)"])

合并二個(gè)字段：

df["朝向"] + df["戶(hù)型"]

計(jì)算需要的新值：

df["均價(jià)"] = df["總價(jià)"] * 1000 / df["建筑面積"]

map: 將函數(shù)套用到字段（Series）上的每個(gè)元素

def removeDollar(e):
  return e.split("萬(wàn)")[0]
df["總價(jià)"].map(removeDollar) # 移除“總價(jià)”字段中含有的"萬(wàn)"字符

df["總價(jià)"].map(lamdba e: e.split("萬(wàn)")[0]) # lamdba的寫(xiě)法

apply: 將函數(shù)套用到DataFrame上的行或列

df.apply(lambda e: e.max() - e.min(), axis=1) # axis=0（列）axis=1（行） 根據(jù)行還是列

applyMap: 將函數(shù)套用到DataFrame上的每個(gè)元素

import numpy as np
df.applymap(lamdba e: np.nan if e == "暫無(wú)資料" else e) # 將所有暫無(wú)資料的元素替代成缺失值（NaN）

"""
lamdba e: np.nan if e == "暫無(wú)資料" else e

def convertNaN(e):
    if e == "暫無(wú)資料":
        return np.nan
    else:
        return e
"""

處理時(shí)間格式

現(xiàn)在時(shí)間：

from datetime import datetime
current_time =  datetime.now()

將時(shí)間轉(zhuǎn)換成字符串：

current_time.strftime("%Y-%m-%d")

將字符串轉(zhuǎn)為時(shí)間：

datetime.strptime("2018-08-17", "%Y-%m-%d")

往前回溯一天：

from datetime import timedelta
current_time - timedelta(day=1)

往前回溯十天：

from datetime import timedelta
for i in range(1, 10):
    dt = current_time - timedelta(days=i)
    print(dt.strftime("%Y-%m-%d")) # 取得多天的日期
    
current_time - timedelta(day=10)

將datetime轉(zhuǎn)換為UNIX timestamp:

from time import mktime
mktime(current_time.timetuple()) # 需轉(zhuǎn)tuple

將UNIX timestamp轉(zhuǎn)換為datetime:

datetime.fromtimestamp(1538202783)

在pandas轉(zhuǎn)換時(shí)間：

import pandas as pd
df["日期"] = pd.to_datetime(df["日期"], format="%Y年%m月%d日") # 默認(rèn)轉(zhuǎn)換為`-` 2018-9-29

資料重塑

創(chuàng)建虛擬變量:

pandas.get_dummies(df["朝向"]) # 建立虛擬變量
df = pandas.concat([df, pandas.get_dummies(df["朝向"])], axis=1) # 合并虛擬變量與原DataFrame，成為數(shù)據(jù)中的真實(shí)數(shù)據(jù)
df.drop(df["朝向"], axis=1) # 舍棄原有字段

建立透視表pivot_table：

df2 = df.pivot_table(index="單價(jià)", columns="產(chǎn)權(quán)年限", values="參考均價(jià)", aggfunc=sum)
df2.head() # index，列名字，columns，字段名，values，函數(shù)執(zhí)行后的數(shù)據(jù)
# 可以使用 df2.T 可以透視表行列轉(zhuǎn)換

df3 = df.pivot_table(index=["產(chǎn)權(quán)性質(zhì)", "產(chǎn)權(quán)年限"], columns="日期", values="總價(jià)", aggfunc=sum)
df3.head()

正則

把數(shù)據(jù)通過(guò)正則出來(lái)。

比對(duì)，對(duì)比。

[] 或

*?, +?, ??非貪婪模式(盡可能少的對(duì)比)

通過(guò)字段名獲取捕獲到的數(shù)據(jù)

m = re.match(r"(?Pw+) (?Pw+)", "David Chiu")
print(m.group("first_name"), m.group("last_name"))

str1 = "scp file.text root@10.0.0.1:./"
m = re.search("^scp ([w.]+) (w+)@([w.]+):(.+)", str1)
if m:
    print(m.group(1), m.group(2), m.group(3), m.group(4))

在DataFrame中使用正則：

df[["室", "廳", "衛(wèi)"]]  = df["戶(hù)型"].str.extract(r"(d+)室(d+)廳(d+)衛(wèi)", expand=False)
# 室,廳,衛(wèi)等信息

爬取新浪新聞：

import pandas as pd
import requests
from bs4 import BeautifulSoup as bs


def getDetails(url, idx):
    if idx > 5:
        return
    print(url, "url")
    res = requests.get(url)
    res.encoding = "utf-8"
    d = bs(res.text, "html.parser")

    title = d.select(".main-title")[0].text
    create_time = d.select(".date-source")[0].select(".date")[0].text
    source = d.select(".date-source")[0].select(".source")[0].text
    article = " ".join(d.select("#article")[0].text.split())
    keywords = d.select("#keywords")[0].text

    return {
        "title": title,
        "create_time": create_time,
        "source": source,
        "article": article,
        "keywords": keywords
    }


if __name__ == "__main__":
    url = "https://news.sina.com.cn/china/"
    res = requests.get(url)
    res.encoding = "utf-8"
    dsoup = bs(res.text, "html.parser")
    news_herf = [h["href"]
                 for h in dsoup.select(".left-content-1 div")[3].select("li a")]
    newArr = []
    resultArr = []
    for idx, new in enumerate(news_herf):
        t = getDetails(new, idx)
        if t:
            newArr.append(t)

    df = pd.DataFrame(newArr)
    df["keywords"] = df["keywords"].apply(lambda i: i.split(":")[1].split())
    df["create_time"] = pd.to_datetime(df["create_time"], format=r"%Y年%m月%d日 %H:%M")
    df = df[["title", "source", "create_time", "keywords", "article"]] # 轉(zhuǎn)換字段順序
    df.to_json("news.json")

    print("ok")

可視化數(shù)據(jù)

敘述性統(tǒng)計(jì)

有系統(tǒng)的歸納數(shù)據(jù)，了解數(shù)據(jù)的輪廓。
對(duì)數(shù)據(jù)樣本做敘述性，例如：平均數(shù)，標(biāo)準(zhǔn)偏差，計(jì)次頻率，百分比
對(duì)數(shù)據(jù)資料的圖像化處理，將數(shù)據(jù)摘要變?yōu)閳D表
經(jīng)常更加偏重于敘述性統(tǒng)計(jì)處理可視化數(shù)據(jù).

多數(shù)資料分析，80%在于如何加總與平均

用SQL做敘述性統(tǒng)計(jì)，分割數(shù)據(jù)，轉(zhuǎn)換數(shù)據(jù)，聚合數(shù)據(jù)，探索數(shù)據(jù)。

Pyton類(lèi)似的分析工具

獲取股價(jià)：pandas-datareader

import pandas_datareader as pdd

df = pdd.DataReader("BABA", data_source="yahoo")
print(df.tail())

簡(jiǎn)易的統(tǒng)計(jì)單個(gè)字段：
算出總和：df["volume"].sum()
算出平均：df["volume"].mean()
算出標(biāo)準(zhǔn)差：df["volume"].std()
取得最小值：df["volume"].min()
取得最大值：df["volume"].max()
取得記錄數(shù)：df["volume"].count()
取得整體敘述性統(tǒng)計(jì): df.describe()

import pandas_datareader as pdd

df = pdd.DataReader("BABA", data_source="yahoo")


# 計(jì)算漲跌
df["diff"] = df["Close"] - df["Open"]
df["rise"] = df["diff"] < 0
df["fall"] = df["diff"] > 0

# 計(jì)算每日?qǐng)?bào)酬
df["ret"] = df["Close"].pct_change(1)

print(df[["rise", "fall"]].sum()) # 計(jì)算漲跌次數(shù)
# print(df[df.index > "2018-08-01"])
print(df.loc[df.index > "2018-08-01", ["rise", "fall"]].sum()) # 當(dāng)月的漲跌次數(shù)
print(df.groupby([df.index.year, df.index.month])["rise", "fall"].sum()) # 根據(jù)年月統(tǒng)計(jì)漲跌次數(shù)
print(df.groupby([df.index.year, df.index.month])["ret"].mean()) # 根據(jù)年月統(tǒng)計(jì)每月的報(bào)酬

推論性統(tǒng)計(jì)

資料模型的建構(gòu)
從樣本推論整體資料的概況
相關(guān)，回歸，因素分析

繪制圖表

人是視覺(jué)性的動(dòng)物，百分之七十的接收數(shù)據(jù)通過(guò)眼睛，百分之三十的接收數(shù)據(jù)通過(guò)其它五官（嘴巴，鼻子，耳朵等）

信息圖表的功能

溝通已知的信息(Storytelling)

從資料中發(fā)現(xiàn)背后的事實(shí)(Exploration)

信息可視化

可視化目標(biāo)：
有效溝通
清楚
完整
促進(jìn)參與者的互動(dòng)

專(zhuān)注在傳達(dá)的有效性

可視化 + 互動(dòng) = 成功的可視化

pands繪制圖表

需要安裝matplotlib模塊

import pandas_datareader as pdd

df = pdd.DataReader("BABA", data_source="yahoo")

# 繪制折線圖
df["Close"].plot(kind="line", figsize=[10, 5], legend=True, title="BABA", grid=True)
# lengend=True 圖表
# grid 表格

# 繪制移動(dòng)平均線
df["mvg30"] = df["Close"].rolling(window=30).mean()
df[["Close", "mvg30"]].plot(kind="line", legend=True, figsize=[10, 5])

# 直方圖
df.ix[df.index >= "2017-04-01", "Volume"].plot(kind="bar", figsize[10, 5], title="BABA", legend=True)

# 餅圖
df["diff"] = df["Close"] - df["Open"]
df["rise"] = df["diff"] > 0
df["fall"] = df["diff"] < 0

df[["rise", "fall"]].sum().plot(kind="pie", figsize[5,5], counterclock=Ture, startangle=90, legend=True)

數(shù)據(jù)存入

將數(shù)據(jù)以結(jié)構(gòu)化方式做存儲(chǔ)，讓用戶(hù)可以透明結(jié)構(gòu)化查詢(xún)語(yǔ)言（SQL），快速查詢(xún)及維護(hù)數(shù)據(jù)。

ACID原則:

不可分割性/原子性（Atomicity）: 交易必須全部完成或全部不完成。

一致性（Consistency）: 交易開(kāi)始到結(jié)束，數(shù)據(jù)完整性都符合既設(shè)規(guī)則與限制

隔離性（Isolation）: 并行的交易不會(huì)影響彼此

持久性（Durability）: 進(jìn)行完交易后，對(duì)數(shù)據(jù)庫(kù)的變更會(huì)永久保留在數(shù)據(jù)庫(kù)

sqlite3

套件，組件。

import sqlite3 as lite

con = lite.connect("test.sqlite") # 連接
cur = con.cursor() # 游標(biāo)
cur.execute("SELECT SQLITE_VERSION()") # 語(yǔ)句執(zhí)行
data = cur.fetchone() # 獲取一條row

print(data)

con.close()

新增，查詢(xún)：

import sqlite3 as lite

with lite.connect("test.sqlite") as con:
    cur = con.cursor()
    cur.execute("DROP TABLE IF EXISTS PhoneAddress")
    cur.execute("CREATE TABLE PhoneAddress(phone CHAR(10) PRIMARY KEY, address TEXT, name TEXT unique, age INT NOT NULL)")
    cur.execute("INSERT INTO PhoneAddress VALUES("245345345", "United", "Jsan", 50)")
    cur.execute("SELECT phone,address FROM PhoneAddress")
    data = cur.fetchall()

    for rec in data:
        print(rec[0], rec[1])

fetchone和fetchall獲取數(shù)據(jù)根據(jù)游標(biāo)cursor，來(lái)獲取對(duì)應(yīng)的數(shù)據(jù)。
操作的邏輯建立在游標(biāo)之上。

使用Pandas存儲(chǔ)數(shù)據(jù)

建立DataFrame

使用Pandas存儲(chǔ)數(shù)據(jù)

import sqlite3 as lite
import pandas

employee = [{
    "name": "Mary",
    "age": 24,
    "gender": "F"
}]
df = pandas.DataFrame(employee)

with lite.connect("test.sqlite") as db:
    cur = db.cursor()
    df.to_sql(name="employee", index=False, con=db, if_exists="replace")
    d = pandas.read_sql("SELECT * FROM employee", con=db) # 可以使用SQL語(yǔ)句(聚合查詢(xún)，排序語(yǔ)句)讀取數(shù)據(jù)，pandas轉(zhuǎn)換成DataFrame格式
    print(d)

    cur.execute("SELECT * FROM employee")
    data = cur.fetchone()
    print(data)

獲取國(guó)家外匯管理局-人民幣匯率中間價(jià)

獲取數(shù)據(jù)并處理數(shù)據(jù):

import sqlite3 as lite
from datetime import datetime, timedelta

import pandas as pd
import requests
from bs4 import BeautifulSoup

url = "http://www.safe.gov.cn/AppStructured/hlw/RMBQuery.do"
ua = "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko Core/1.63.6726.400 QQBrowser/10.2.2265.400"

headers = {
    "User-Agent": ua,
    "Content-Type": "text/html;charset=UTF-8"
}


def getCurrency(start, end):
    payload = {
        "startDate": start,
        "endDate": end,
        "queryYN": True
    }
    html = requests.post(url, data=payload, headers=headers).text

    soup = BeautifulSoup(html, "html.parser")

    dfs = pd.read_html(str(soup.select("#InfoTable")[0]), header=0)[0]  # 讀取成DataFrame格式數(shù)據(jù)
    # soup.select("#InfoTable")[0].prettify("UTF-8") 測(cè)試的時(shí)候出現(xiàn)中文亂碼

    dfs = pd.melt(dfs, col_level=0, id_vars="日期")
    dfs.columns = ["date", "currency", "exchange"]

    with lite.connect("currency.sqlite") as db:
        dfs.to_sql(name="currency", index=None, con=db, if_exists="append")

        cur = db.cursor()
        cur.execute("SELECT * FROM currency")
        data = cur.fetchall()
        print(len(data))


if __name__ == "__main__":
    current_time = datetime.now()

    for i in range(1, 300, 30):
        start_date = (current_time - timedelta(days=i+30)).strftime("%Y-%m-%d")
        end_date = (current_time - timedelta(days=i+1)).strftime("%Y-%m-%d")
        print(start_date, end_date)
        getCurrency(start_date, end_date)

展示圖表數(shù)據(jù)：

import sqlite3 as lite
import pandas as pd

with lite.connect("currency.sqlite") as db:
    df = pd.read_sql("SELECT * FROM currency", con=db)

    df["date"] = pd.to_datetime(df["date"], format="%Y-%m-%d")

    df.index = df.date
    print(df.head())
    df.plot(kind="line", rot=30, color="blue")

云服務(wù)器 GPU云服務(wù)器數(shù)據(jù)科學(xué)與大數(shù)據(jù)科學(xué) 科學(xué)數(shù)據(jù) 數(shù)據(jù)科學(xué) 大數(shù)據(jù)科學(xué)

文章版權(quán)歸作者所有，未經(jīng)允許請(qǐng)勿轉(zhuǎn)載,若此文章存在違規(guī)行為，您可以聯(lián)系管理員刪除。

轉(zhuǎn)載請(qǐng)注明本文地址：http://www.ezyhdfw.cn/yun/42475.html

發(fā)表評(píng)論

登陸后可評(píng)論

0條評(píng)論

anquan

男|高級(jí)講師

我要關(guān)注我要私信

TA的文章

torch轉(zhuǎn)tensorflow

閱讀 2002·2023-04-26 01:44
POSTMAN自動(dòng)化接口測(cè)試個(gè)人學(xué)習(xí)記錄

閱讀 1337·2021-11-12 10:34
追風(fēng)零隊(duì)2021硬件總結(jié)

閱讀 1690·2021-09-09 09:33
前端面試每日3+1——第117天

閱讀 1814·2019-08-30 15:44
嗶哩嗶哩2018校招前端筆試

閱讀 2955·2019-08-30 13:49
前端面試常見(jiàn)題型

閱讀 2264·2019-08-29 15:26
一文掌握前端面試瀏覽器相關(guān)知識(shí)點(diǎn)

閱讀 1000·2019-08-26 13:30
使用apache的HttpClient進(jìn)行http通訊，隱藏的HTTP請(qǐng)求頭部字段是如何自動(dòng)被添加的

閱讀 1481·2019-08-23 18:15

亚洲中字慕日产2020,大陆极品少妇内射AAAAAA,无码av大香线蕉伊人久久,久久精品国产亚洲av麻豆网站

資訊專(zhuān)欄INFORMATION COLUMN

上云采購(gòu)季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長(zhǎng)期優(yōu)惠，快來(lái)選購(gòu)！

數(shù)據(jù)科學(xué)

hello world