javascript實現(xiàn)樸素貝葉斯分類與決策樹ID3分類

ernest.wang 發(fā)布于2019-08-20 18:21 / 1764人閱讀

摘要：根據(jù)這個訓(xùn)練集，運用樸素貝葉斯分類和決策樹分類則可以得到一個數(shù)據(jù)模型，然后通過輸入一條測試數(shù)據(jù)來判斷是否回去打網(wǎng)球。一樸素貝葉斯分類大學(xué)概率論的貝葉斯定理實現(xiàn)了通過計算概率求出假設(shè)推理的結(jié)論。

今年畢業(yè)時的畢設(shè)是有關(guān)大數(shù)據(jù)及機器學(xué)習(xí)的題目。因為那個時間已經(jīng)步入前端的行業(yè)自然選擇使用JavaScript來實現(xiàn)其中具體的算法。雖然JavaScript不是做大數(shù)據(jù)處理的最佳語言，相比還沒有優(yōu)勢，但是這提升了自己對與js的理解以及彌補了一點點關(guān)于數(shù)據(jù)結(jié)構(gòu)的弱點。對機器學(xué)習(xí)感興趣的朋友還是去用 python，最終還是在學(xué)校的死板論文格式要求之外，記錄一下實現(xiàn)的過程和我自己對于算法的理解。
源碼在github：https://github.com/abzerolee/...
開始學(xué)習(xí)機器學(xué)習(xí)算法是通過 Tom M. Mitchel. Machine Learning[M] 1994 一書。喜歡研究機器學(xué)習(xí)的朋友入門可以看看這本。接下來敘述的也僅僅是個人對于算法的淺薄理解與實現(xiàn)，只是針對沒有接觸過機器學(xué)習(xí)的朋友看個樂呵，自己總結(jié)記憶一下。當然能引起大家對機器學(xué)習(xí)算法的研究熱情是最好不過的了。

算法原理

實現(xiàn)過程其實是對訓(xùn)練集合（已知分類）的數(shù)據(jù)進行分析解析得到一個分類模型，通過輸入一條測試數(shù)據(jù)（未知分類），分類模型可以推斷出該條數(shù)據(jù)的分類結(jié)果。訓(xùn)練數(shù)據(jù)如下圖所示

這個數(shù)據(jù)集合意思為天氣狀況決定是否要最終去打網(wǎng)球一個數(shù)組代表一條天氣情況與對應(yīng)結(jié)果。前四列代表數(shù)據(jù)的特征屬性（天氣，溫度，濕度，是否刮風(fēng)），最后一列代表分類結(jié)果。根據(jù)這個訓(xùn)練集，運用樸素貝葉斯分類和決策樹ID3分類則可以得到一個數(shù)據(jù)模型，然后通過輸入一條測試數(shù)據(jù)：“sunny cool high TRUE” 來判斷是否回去打網(wǎng)球。相似的只要特征屬性保持一定且有對應(yīng)的分類結(jié)果，不論訓(xùn)練集為什么樣的數(shù)據(jù)，都可以通過特征屬性得到分類結(jié)果。所謂分類模型，就是通過一些概率論，統(tǒng)計學(xué)的理論基礎(chǔ)，用編程語言實現(xiàn)。下面簡單介紹一下兩種算法原理。

一.樸素貝葉斯分類

大學(xué)概率論的貝葉斯定理實現(xiàn)了通過計算概率求出假設(shè)推理的結(jié)論。貝葉斯定理如下圖所示：

E代表訓(xùn)練集合，r表示一個分類結(jié)果（即yes或no），P(E)是一個獨立于分類結(jié)果r的常量，可以發(fā)現(xiàn)P(E)越大，P(r|E)受到訓(xùn)練集影響越小。
即可以得到為 P(r) => P(yes)=9/14，或者P(no)=5/14,
再求的條件概率 P(E|r) => P(wind=TRUE|yes)=3/9 P(wind=FALSE|no)=2/5
這樣可以得到每個特征屬性在分類結(jié)果情況下的條件概率。當輸入一條測試數(shù)據(jù)時，通過計算這條數(shù)據(jù)特質(zhì)屬性值在某種分類假設(shè)的錢以下的條件概率，就可以得到對應(yīng)的分類假設(shè)的概率，然后比較出最大值，稱為極大似然假設(shè)，對應(yīng)的分類結(jié)果就是測試數(shù)據(jù)的分類結(jié)果。
比如測試數(shù)據(jù)如上：sunny,cool,high,TRUE則對應(yīng)的計算為：
P(yes)P(sunny|yes)P(high|yes)P(cool|yes)P(TRUE|yes) = P(yes|E)
P(no)P(sunny|no)P(high|no)P(cool|no)P(TRUE|no) = P(no|E)
推斷出 no 。
這里推薦介紹貝葉斯文本分類的博客http://www.cnblogs.com/phinec...

二.決策樹ID3分類法

決策樹分類法更像是我們思考的過程：

測試數(shù)據(jù)和上文相同，在天氣節(jié)點判斷則進入sunny分支溫度節(jié)點判斷進入high 分支則直接得出no的結(jié)果。
決策樹在根據(jù)測試數(shù)據(jù)分類時淺顯易懂，關(guān)鍵點在通過訓(xùn)練數(shù)據(jù)構(gòu)建決策樹，那相應(yīng)的出現(xiàn)兩個問題：
1.選擇哪個特征屬性作為根節(jié)點判斷？
2.特征屬性值對應(yīng)的分支上的下一個屬性節(jié)點如何來判斷？
這兩個問題可以總結(jié)為如何判斷最優(yōu)測試屬性？在信息論中，期望信息越小，那么信息增益就越大，從而純度就越高。其實就是特征屬性能夠為最終的分類結(jié)果帶來多少信息，帶來的信息越多，該特征屬性越重要。對一個屬性而言，分類時它是否存在會導(dǎo)致分類信息量發(fā)生變化，而前后信息量的差值就是這個特征屬性給分類帶來的信息量。而信息量就是信息熵。信息熵表示每個離散的消息提供的平均信息量。
如上文中的例子：可以表示為

當選取了某個特征屬性attr后的信息熵可以表示為

對應(yīng)該屬性的信息增益可以表示為

選擇最適合樹節(jié)點的特征屬性，就是信息增益最大的屬性。應(yīng)該可以得到Gain(天氣)=0.246
接下來是對該屬性值分支的節(jié)點選取的判斷，從訓(xùn)練集中找出滿足該屬性值的子集再次進行對于子集的每個屬性的信息增益，比較。重復(fù)上述步驟，直到子集為空返回最普遍的分類結(jié)果。

上圖為《Machine Learning》一書中對于ID3算法的介紹，下圖為程序流程圖

三.分類模型評估
分類模型的評估指標通過混淆矩陣來進行計算

P為樣本數(shù)據(jù)中yes的數(shù)量，N為樣本數(shù)據(jù)中no的數(shù)量，TP為正確預(yù)測yes的數(shù)量，F(xiàn)P為把yes預(yù)測為no的數(shù)量，F(xiàn)N為把yes預(yù)測為no的數(shù)量，TN為正確預(yù)測yes的數(shù)目。評估量度為
1.命中率：正確診斷確實患病的的概率 TP/P
2.虛警率：沒有患病卻診斷為患病概率。FP/N
分類模型的評估方法為交叉驗證法與.632的平均抽樣法，比如100條原始數(shù)據(jù)，對訓(xùn)練集有放回的隨機抽樣100次，并在每次抽樣時標注抽取的次數(shù) 將大于63.2的數(shù)據(jù)作為訓(xùn)練集，小于的數(shù)據(jù)作為測試集，但是實際程序?qū)崿F(xiàn)中可以樣本偏離的太厲害我選擇了44次作為標準。
這樣將測試集的每一條數(shù)據(jù)輸入，通過訓(xùn)練集得到的分類模型，得出測試數(shù)據(jù)的分類結(jié)果與真實分類進行比較。就可以得到混淆矩陣，最后根據(jù)混淆矩陣可以得到?jīng)Q策樹與貝葉斯分類的命中率與虛警率。重復(fù)評估40次則可以得到[命中率，虛警率]，以命中率為縱坐標，虛警率為橫坐標描點可以得到ROC曲線，描出的點越靠近左上角代表分類模型越正確，直觀的表現(xiàn)出來兩種分類模型差異。我得到的描點圖如下所示

從圖中明顯可以發(fā)現(xiàn)對于小樣本的數(shù)據(jù)，決策樹分類模型更為準確。

核心代碼

樸素貝葉斯分類法

const HashMap = require("./HashMap");

function Bayes($data){
  this._DATA = $data;
}
Bayes.prototype = {
  /**
   * 將訓(xùn)練數(shù)據(jù)單條數(shù)據(jù)按類別分類
   * @return HashMap<類別，對用類別的訓(xùn)練數(shù)據(jù)>
   */
  dataOfClass: function() {
    var map = new HashMap();
    var t = [], c = "";
    var datas = this._DATA;
    if(!(datas instanceof Array)) return;
    for(var i = 0; i < datas.length; i++){
      t = datas[i];
      c = t[t.length - 1];
      if(map.hasKey(c)){
        var ot = map.get(c);
        ot.push(t);
        map.put(c, ot);
      }else{
        var nt = [];
        nt.push(t);
        map.put(c, nt);
      }
    }
    return map;
  },
  /**
   * 預(yù)測測試數(shù)據(jù)的類別
   * @param Array testT 測試數(shù)據(jù)
   * @return String 測試數(shù)據(jù)對應(yīng)類別
   */
  predictClass: function(testT){
    var doc = this.dataOfClass();
    var maxP = 0, maxPIndex = -1;
    var classes = doc.keys();
    for(var i = 0; i < classes.length; i++){
      var c = classes[i]
      var d = doc.get(c);
      var pOfC = d.length / this._DATA.length;
      for(var j = 0; j < testT.length; j++){
        var pv = this.pOfV(d, testT[j], j);
        pOfC = pOfC * pv;
      }
      if(pOfC > maxP){
        maxP = pOfC;
        maxPIndex = i;
      }
    }
    if(maxPIndex === -1 || maxPIndex > doc.length){
      return "無法分類";
    }
    return classes[maxPIndex];
  },
  /**
   * 計算指定屬性在訓(xùn)練數(shù)據(jù)中指定值出現(xiàn)的條件概率
   * @param d     屬于某一類的訓(xùn)練元組
   * @param value 指定屬性
   * @param index 指定屬性所在列
   * @return 特征屬性在某類別下的條件概率
   */
  pOfV: function(d, value, index){
    var p = 0, count = 0, total = d.length, t = [];
    for(var i = 0; i < total; i++){
      if(d[i][index] === value)
        count++;
    }
    p = count / total;
    return p;
  } 
}

module.exports = Bayes;

2.決策樹ID3分類法

const HashMap = require("./HashMap");
const $data = require("./data");
const TreeNode = require("./TreeNode");
const InfoGain = require("./InfoGain");

function Iterator(arr){
  if(!(arr instanceof Array)){
    throw new Error("iterator needs a arguments that type is Array!");
  }
  this.arr = arr;
  this.length = arr.length;
  this.index = 0;
}
Iterator.prototype.current = function() {
  return this.arr[this.index-1];
}
Iterator.prototype.next = function(){
  this.index += 1;
  if(this.index > this.length || this.arr[this.index-1] === null)
    return false;
  return true;
}

function DecisionTree(data, attribute) {
  if(!(data instanceof Array) || !(attribute instanceof Array)){
    throw new Error("argument needs Array!");
  }
  this._data = data;
  this._attr = attribute;
  this._node = this.createDT(this._data,this._attr);
}
DecisionTree.prototype.createDT = function(data, attrList) {
  var node = new TreeNode();
  var resultMap = this.isPure(this.getTarget(data));
  
  if(resultMap.size() === 1){
    node.setType("result");
    node.setName(resultMap.keys()[0]);
    node.setVals(resultMap.keys()[0]);
    // console.log("單節(jié)點樹：" + node.getVals());
    return node;
  }
  if(attrList.length === 0){
    var max = this.getMaxVal(resultMap);
    node.setType("result");
    node.setName(max)
    node.setVals(max);
    // console.log("最普遍性結(jié)果："+ max);
    return node;
  }

  var maxGain = this.getMaxGain(data, attrList).maxGain;
  var attrIndex = this.getMaxGain(data, attrList).attrIndex
  // console.log("選出的最大增益率屬性為："+ attrList[attrIndex]);
  // console.log("創(chuàng)建節(jié)點："+attrList[attrIndex])
  node.setName(attrList[attrIndex]);
  node.setType("attribute");

  var remainAttr = new Array();
  remainAttr = attrList;
  // remainAttr.splice(attrIndex, 1);

  var self = this;
  var gain = new InfoGain(data, attrList)
  var attrValueMap = gain.getAttrValue(attrIndex); //最好分類的屬性的值MAP
  var possibleValues = attrValueMap.keys();
  
  node_vals = possibleValues.map(function(v) {
    // console.log("創(chuàng)建分支："+v);
    var newData = data.filter(function(x) {
      return x[attrIndex] === v;
    });
    // newData = newData.map(function(v) {
    //   return v.slice(1);
    // })
    var child_node = new TreeNode(v, "feature_values");
    var leafNode = self.createDT(newData, remainAttr);
    child_node.setVals(leafNode);
    return child_node;
  })
  node.setVals(node_vals);

  this._node = node;
  return node;
}
/**
 * 判斷訓(xùn)練數(shù)據(jù)純度分類是否為一種分類或沒有分類
 */
DecisionTree.prototype.getTarget = function(data){
  var list = new Array();
  var iter = new Iterator(data);
  while(iter.next()){
    var index = iter.current().length - 1;
    var value = iter.current()[index];
    list.push(value);
  }
  return list;
},
/**
 * 獲取分類結(jié)果數(shù)組，判斷純度
 */
DecisionTree.prototype.isPure = function(list) {
  var map = new HashMap(), count = 1;
  list.forEach(function(item) {
    if(map.get(item)){
      count++;
    }
    map.put(item, count);
  });
  return map;
}
/**
 * 獲取最大增益量屬性
 */
DecisionTree.prototype.getMaxGain = function(data, attrList) {
  var gain = new InfoGain(data, attrList);
  var maxGain = 0;
  var attrIndex = -1;
  for(var i = 0; i < attrList.length; i++){
    var temp = gain.getGainRaito(i);
    if(maxGain < temp){
      maxGain = temp;
      attrIndex = i;
    }
  }
  return {attrIndex: attrIndex, maxGain: maxGain};
}
/**
 * 獲取resultMap中值最大的key
 */
DecisionTree.prototype.getMaxVal = function(map){
  var obj = map.obj, temp = 0, okey = "";
  for(var key in obj){
    if(temp < obj[key] && typeof obj[key] === "number"){
      temp = obj[key];
      okey = key;
    };
  }
  return okey;
}
/**
 * 預(yù)測屬性
 */
DecisionTree.prototype.predictClass = function(sample){
  var root = this._node;
  var map = new HashMap();
  var attrList = this._attr;
  for(var i = 0; i < attrList.length; i++){
    map.put(attrList[i], sample[i]);
  }

  while(root.type !== "result"){
    if(root.name === undefined){
      return root = "無法分類";
    }
    var attr = root.name;
    var sample = map.get(attr);
    var childNode = root.vals.filter(function(node) {
      return node.name === sample;
    });
    if(childNode.length === 0){
      return root = "無法分類";
    }
    root = childNode[0].vals; // 只遍歷attribute節(jié)點
  }
  return root.vals;
}

module.exports = DecisionTree;

3.增益率計算

function InfoGain(data, attr) {
  if(!(data instanceof Array) || !(attr instanceof Array)){
    throw new Error("arguments needs Array!");
  }
  this._data = data;
  this._attr = attr;
}
InfoGain.prototype = {
  /**
   * 獲取訓(xùn)練數(shù)據(jù)分類個數(shù)
   * @return hashMap<類別, 該類別數(shù)量>
   */
  getTargetValue: function() {
    var map = new HashMap();
    var iter = new Iterator(this._data);
    while(iter.next()){
      var t = iter.current();
      var key = t[t.length-1];
      var value = map.get(key);
      map.put(key, value !== undefined ? ++value : 1);
    }
    return map;
  },
  /**
   * 獲取訓(xùn)練數(shù)據(jù)信息熵
   * @return 訓(xùn)練數(shù)據(jù)信息熵
   */
  getEntroy: function(){
    var targetValueMap = this.getTargetValue();
    var targetKey = targetValueMap.keys(), entroy = 0;
    var self = this;
    var iter = new Iterator(targetKey);
    while(iter.next()){
      var p = targetValueMap.get(iter.current()) / self._data.length;
      entroy += (-1) * p * (Math.log(p) / Math.LN2);
    }
    return entroy;
  },
  /**
   * 獲取屬性值在訓(xùn)練數(shù)據(jù)集中的數(shù)量
   * @param number index 屬性名數(shù)組索引
   */
  getAttrValue: function(index){
    var map = new HashMap();
    var iter = new Iterator(this._data);
    while(iter.next()){
      var t = iter.current();
      var key = t[index];
      var value = map.get(key);
      map.put(key, value !== undefined ? ++value : 1);
    }
    return map;
  },
  /**
   * 得到屬性值在決策空間的比例
   * @param string name 屬性值
   * @param number index 屬性所在第幾列
   */
  getAttrValueTargetValue: function(name, index){
    var map = new HashMap();
    var iter = new Iterator(this._data);
    while(iter.next()){
      var t = iter.current();
      if(name === t[index]){
        var size = t.length;
        var key = t[t.length-1];
        var value = map.get(key);
        map.put(key, value !== undefined ? ++value : 1);
      }
    }
    return map;
  },
  /**
   * 獲取特征屬性作用于訓(xùn)練數(shù)據(jù)集后分類出的數(shù)據(jù)集的熵
   * @param number index 屬性名數(shù)組索引
   */
  getInfoAttr: function(index){
    var attrValueMap = this.getAttrValue(index);
    var infoA = 0;
    var c = attrValueMap.keys();
    for(var i = 0; i < attrValueMap.size(); i++){
      var size = this._data.length;
      var attrP = attrValueMap.get(c[i]) / size;
      var targetValueMap = this.getAttrValueTargetValue(c[i], index);
      var totalCount = 0 ,valueSum = 0;
      for(var j = 0; j < targetValueMap.size(); j++){
        totalCount += targetValueMap.get(targetValueMap.keys()[j]);
      }
      for(var k = 0; k < targetValueMap.size(); k++){
        var p = targetValueMap.get(targetValueMap.keys()[k]) / totalCount;
        valueSum += (Math.log(p) / Math.LN2) * p;
      }
      infoA += (-1) * attrP * valueSum;
    }
    return infoA;
  },
  /**
   * 獲得信息增益量
   */
  getGain: function(index) {
    return this.getEntroy() - this.getInfoAttr(index);
  },
  getSplitInfo: function(index){
    var map = this.getAttrValue(index);
    var splitA = 0;
    for(var i = 0; i < map.size(); i++){
      var size = this._data.length;
      var attrP = map.get(map.keys()[i]) / size;
      splitA += (-1) * attrP * (Math.log(attrP) / Math.LN2);
    }
    return splitA;
  },
  /**
   * 獲得增益率
   */
  getGainRaito: function(index){
    return this.getGain(index) / this.getSplitInfo(index);
  },
  getData4Value: function(attrValue, attrIndex){
    var resultData = new Array();
    var iter = new Iterator(this._data);
    while(iter.next()){
      var temp = iter.current();
      if(temp[attrIndex] === attrValue){
        resultData.push(temp);
      }
    }
    return resultData;
  }
}

具體的程序?qū)崿F(xiàn)我會再繼續(xù)介紹的，待續(xù)。。。。
第一次在segmentfault發(fā)文章有點緊張各位有什么意見或者想法可以及時指正我。

GPU云服務(wù)器云服務(wù)器貝葉斯分類樸素貝葉斯樸素貝葉斯算法 php高斯樸素貝葉斯

文章版權(quán)歸作者所有，未經(jīng)允許請勿轉(zhuǎn)載,若此文章存在違規(guī)行為，您可以聯(lián)系管理員刪除。

轉(zhuǎn)載請注明本文地址：http://www.ezyhdfw.cn/yun/84402.html

發(fā)表評論

登陸后可評論

0條評論

ernest.wang

男|高級講師

我要關(guān)注我要私信

TA的文章

DeepSeek R1系列模型一鍵部署教程，支持云端免費體驗！

閱讀 1583·2025-02-07 13:29
特朗普勝任，將對跨境電商賣家產(chǎn)生哪些影響？

閱讀 1008·2024-11-07 18:25
幻獸帕魯常見問題匯總

閱讀 131502·2024-02-01 10:43
在Linux和Windows上部署帕魯?shù)膮^(qū)別？

閱讀 1375·2024-01-31 14:58
怎么重啟幻獸帕魯云服務(wù)器？

閱讀 1149·2024-01-31 14:54
幻獸帕魯自建服務(wù)器存檔備份及回檔 - windows篇

閱讀 83576·2024-01-29 17:11
小白必讀教程，幻獸帕魯服務(wù)器手動部署教程 - Windows篇

閱讀 3877·2024-01-25 14:55
A800GPU租用-A800顯卡

閱讀 2395·2023-06-02 13:36

亚洲中字慕日产2020,大陆极品少妇内射AAAAAA,无码av大香线蕉伊人久久,久久精品国产亚洲av麻豆网站

資訊專欄INFORMATION COLUMN

上云采購季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺、長期優(yōu)惠，快來選購！

javascript實現(xiàn)樸素貝葉斯分類與決策樹ID3分類

相關(guān)文章

機器學(xué)習(xí)算法基礎(chǔ)（使用Python代碼）

基于概率論的分類方法：樸素貝葉斯

發(fā)表評論

0條評論

ernest.wang

男|高級講師

TA的文章

DeepSeek R1系列模型一鍵部署教程，支持云端免費體驗！

特朗普勝任，將對跨境電商賣家產(chǎn)生哪些影響？

幻獸帕魯常見問題匯總

在Linux和Windows上部署帕魯?shù)膮^(qū)別？

怎么重啟幻獸帕魯云服務(wù)器？

幻獸帕魯自建服務(wù)器存檔備份及回檔 - windows篇

小白必讀教程，幻獸帕魯服務(wù)器手動部署教程 - Windows篇

A800GPU租用-A800顯卡

最新活動

資訊專欄INFORMATION COLUMN

上云采購季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺、長期優(yōu)惠，快來選購！

javascript實現(xiàn)樸素貝葉斯分類與決策樹ID3分類

相關(guān)文章

發(fā)表評論

0條評論

男|高級講師

TA的文章

最新活動

上云采購季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺、長期優(yōu)惠，快來選購！