機(jī)器學(xué)習(xí)：隨機(jī)森林學(xué)習(xí)筆記

arashicage 發(fā)布于2019-07-31 12:22 / 3464人閱讀

摘要：前言隨機(jī)森林是一個很強(qiáng)大的模型，由一組決策樹投票得到最后的結(jié)果。要研究清楚隨機(jī)森林，首先需要研究清楚決策樹，然后理解隨機(jī)森林如何通過多棵樹的集成提高模型效果。

前言

隨機(jī)森林是一個很強(qiáng)大的模型，由一組決策樹投票得到最后的結(jié)果。要研究清楚隨機(jī)森林，首先需要研究清楚決策樹，然后理解隨機(jī)森林如何通過多棵樹的集成提高模型效果。

本文的目的是將自己學(xué)習(xí)這個模型時有用的資料匯總在一起。

決策樹基本知識

決策樹知識點精要

ID3:信息增益
C4.5：信息增益率
CART：Gini系數(shù)

決策樹的優(yōu)缺點

集成智慧編程

優(yōu)點有：

最大的優(yōu)勢是易于解釋

同時接受categorical和numerical數(shù)據(jù)，不需要做預(yù)處理或歸一化。

允許結(jié)果是不確定的：葉子節(jié)點具有多種可能的結(jié)果值卻無法進(jìn)一步拆分，可以統(tǒng)計count，評估出一個概率。

缺點有：

對于只有幾種可能結(jié)果的問題，算法很有效；面對擁有大量可能結(jié)果的數(shù)據(jù)集時，決策樹會變得異常復(fù)雜，預(yù)測效果也可能會大打折扣。

盡管能處理簡單的數(shù)值型數(shù)據(jù)，但只能創(chuàng)建滿足“大于/小于”條件的節(jié)點。若決定分類的因素取決于更多變量的復(fù)雜組合，此時要根據(jù)決策樹進(jìn)行分類就會比較困難了。例如，假設(shè)結(jié)果值是由兩個變量的差來決定的，那么這棵樹會變得異常龐大，而且預(yù)測的準(zhǔn)確性也會迅速下降。

總而言之：決策樹最適合用來處理的，是那些帶分界點的、由大量分類數(shù)據(jù)和數(shù)值數(shù)據(jù)共同組成的數(shù)據(jù)集。

關(guān)于書中提到的假設(shè)結(jié)果值是由兩個變量的差來決定的，那么這棵樹會變得異常龐大，而且預(yù)測的準(zhǔn)確性也會迅速下降，我們可以用下面的例子來實驗一下：

library(rpart)
library(rpart.plot);  

age1 <- as.integer(runif(1000, min=18, max=30))
age2 <- as.integer(runif(1000, min=18, max=30))

df <- data.frame(cbind(age1, aage2))

df <- df %>% dplyr::mutate(diff=age1-age2, label = diff >= 0 & diff <= 5)

ct <- rpart.control(xval=10, minsplit=20, cp=0.01) 
cfit <- rpart(label~age1+age2,
              data=df, method="class", control=ct,
              parms=list(split="gini")
)
print(cfit)


rpart.plot(cfit, branch=1, branch.type=2, type=1, extra=102,  
           shadow.col="gray", box.col="green",  
           border.col="blue", split.col="red",  
           split.cex=1.2, main="Decision Tree");  


cfit <- rpart(label~diff,
              data=df, method="class", control=ct,
              parms=list(split="gini")
)
print(cfit)

rpart.plot(cfit, branch=1, branch.type=2, type=1, extra=102,  
           shadow.col="gray", box.col="green",  
           border.col="blue", split.col="red",  
           split.cex=1.2, main="Decision Tree");

用age1和age2來預(yù)測，得到的決策樹截圖如下：

用diff來預(yù)測，得到的決策樹截圖如下：

隨機(jī)森林理論

sklearn官方文檔

Each tree in the ensemble is built from a sample drawn with replacement (bootstrap sample) from the training set. When splitting a node during the construction of the tree, the split that is chosen is no longer the best split among all features. Instead, the split that is picked is the best split among a random subset of the features.

As a result of this randomness, the bias of the forest usually slightly increases with respect to the bias of a single non-random tree, but, due to average, its variance also decreases, usually more than compensating for the increase in bias, hence yielding an overall better model.

In contrast to the original publication, the sklearn implementation combines classifiers by averaging their probabilistic prediction, instead of letting each classifier vote for a single class.

隨機(jī)森林實現(xiàn)

from sklearn.ensemble import RandomForestClassifier
X = [[0,0], [1,1]]
Y = [0, 1]
clf = RandomForestClassifier(n_estimator=10)
clf = clf.fit(X, Y)

調(diào)參

sklearn官網(wǎng)

核心參數(shù)由n_estimators和max_features：

n_estimators: the number of trees in the forest

max_features: the size of the random subsets of features to consider when splitting a node. Default values: max_features=n_features for regression problems, and max_features=sqrt(n_features) for classification tasks.

其他參數(shù)：Good results are often achieved when setting max_depth=None in combination with min_samples_split=1.

n_jobs=k：computations are partitioned into k jobs, and run on k cores of the machine. if n_jobs=-1 then all cores available on the machine are used.

特征重要性評估

sklearn官方文檔

The depth of a feature used as a decision node in a tree can be used to assess the relative importance of that feature with respect to the predictability of the target variable. Features used at the top of the tree are used contribute to the final prediction decision of a larger fraction of the input samples. The expected fraction of the samples they contribute to can thus be used as an estimate of the relative importance of the features.

By averaging those expected activity rates over several randomized trees one can reduce the variance of such an estimate and use it for feature selection.

In practice those estimates are stored as an attribute named feature_importances_ on the fitted model. This is an array with shape (n_features,) whose values are positive and sum to 1.0. The higher the value, the more important is the contribution of the matching feature to the prediction function.

StackOverflow

You initialize an array feature_importances of all zeros with size n_features.

You traverse the tree: for each internal node that splits on feature i you compute the error reduction of that node multiplied by the number of samples that were routed to the node and add this quantity to feature_importances[i].

The error reduction depends on the impurity criterion that you use (e.g. Gini, Entropy). It"s the impurity of the set of observations that gets routed to the internal node minus the sum of the impurities of the two partitions created by the split.

關(guān)于作者：丹追兵：數(shù)據(jù)分析師一枚，編程語言python和R，使用Spark、Hadoop、Storm、ODPS。本文出自丹追兵的pytrafficR專欄，轉(zhuǎn)載請注明作者與出處：https://segmentfault.com/blog...

云服務(wù)器 GPU云服務(wù)器 python隨機(jī)森林學(xué)習(xí)筆記學(xué)習(xí)筆記一基礎(chǔ)學(xué)習(xí)筆記

文章版權(quán)歸作者所有，未經(jīng)允許請勿轉(zhuǎn)載,若此文章存在違規(guī)行為，您可以聯(lián)系管理員刪除。

轉(zhuǎn)載請注明本文地址：http://www.ezyhdfw.cn/yun/45517.html

發(fā)表評論

登陸后可評論

0條評論

arashicage

男|高級講師

我要關(guān)注我要私信

TA的文章

神經(jīng)網(wǎng)絡(luò)tensorflow

閱讀 743·2023-04-26 02:03
一招秒殺指針問題（指針數(shù)組，數(shù)組指針，n維指針，以及什么時候使用他們）

閱讀 1099·2021-11-23 09:51
8051單片機(jī)Proteus仿真與開發(fā)實例-DS1302 RTC驅(qū)動仿真

閱讀 1239·2021-10-14 09:42
#便宜#Pacificrack：2核/4G/50G SSD/5T/1Gbps/洛杉磯QN機(jī)房/年付$

閱讀 1808·2021-09-13 10:23
racknerd，2021中秋促銷，洛杉磯DC-02，$9.89/年，1核/10G SSD/512M

閱讀 1046·2021-08-27 13:12
移動端兼容問題總結(jié)(1)

閱讀 915·2019-08-30 11:21
原生JS快速實現(xiàn)拖放（drag and drop）效果

閱讀 1063·2019-08-30 11:14
CSS :placeholder-shown偽類實現(xiàn)輸入框浮動文字效果

閱讀 1122·2019-08-30 11:09

亚洲中字慕日产2020,大陆极品少妇内射AAAAAA,无码av大香线蕉伊人久久,久久精品国产亚洲av麻豆网站

資訊專欄INFORMATION COLUMN

上云采購季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺、長期優(yōu)惠，快來選購！

機(jī)器學(xué)習(xí)：隨機(jī)森林學(xué)習(xí)筆記

相關(guān)文章

隨機(jī)森林算法入門(python)

發(fā)表評論

0條評論

arashicage

男|高級講師

TA的文章

神經(jīng)網(wǎng)絡(luò)tensorflow

一招秒殺指針問題（指針數(shù)組，數(shù)組指針，n維指針，以及什么時候使用他們）

8051單片機(jī)Proteus仿真與開發(fā)實例-DS1302 RTC驅(qū)動仿真

#便宜#Pacificrack：2核/4G/50G SSD/5T/1Gbps/洛杉磯QN機(jī)房/年付$

racknerd，2021中秋促銷，洛杉磯DC-02，$9.89/年，1核/10G SSD/512M

移動端兼容問題總結(jié)(1)

原生JS快速實現(xiàn)拖放（drag and drop）效果

CSS :placeholder-shown偽類實現(xiàn)輸入框浮動文字效果

最新活動

資訊專欄INFORMATION COLUMN

上云采購季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺、長期優(yōu)惠，快來選購！

機(jī)器學(xué)習(xí)：隨機(jī)森林學(xué)習(xí)筆記

相關(guān)文章

發(fā)表評論

0條評論

男|高級講師

TA的文章

最新活動

上云采購季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺、長期優(yōu)惠，快來選購！