pingpang

BlogJava :: 首頁 :: 聯系 :: 聚合

:: 管理

21 Posts :: 0 Stories :: 3 Comments :: 0 Trackbacks

數據挖掘中決策樹算法實現——Bash

一、決策樹簡介：

關于決策樹，幾乎是數據挖掘分類算法中最先介紹到的。

決策樹，顧名思義就是用來做決定的樹，一個分支就是一個決策過程。

每個決策過程中涉及一個數據的屬性，而且只涉及一個。然后遞歸地，貪心地直到滿足決策條件(即可以得到明確的決策結果)。

決策樹的實現首先要有一些先驗（已經知道結果的歷史）數據做訓練，通過分析訓練數據得到每個屬性對結果的影響的大小，這里我們通過一種叫做信息增益的理論去描述它，期間也涉及到熵的概念。也可參考文章信息增益與熵.

下面我們結合實例說一下決策樹實現過程中的上述關鍵概念：

假設我們有如下數據：

age	job	house	credit	class
1	0	0	1	0
1	0	0	2	0
1	1	0	2	1
1	1	1	1	1
1	0	0	1	0
2	0	0	1	0
2	0	0	2	0
2	1	1	2	1
2	0	1	3	1
2	0	1	3	1
3	0	1	3	1
3	0	1	2	1
3	1	0	2	1
3	1	0	3	1
3	0	0	1	0

(一)

我們首先要通過計算找到哪個屬性的所有屬性值能更好地表達class字段的不同。通過計算，我們發現house的屬性值最能表現class字段的不同。這個衡量標準其實就是信息增益。計算方法是：首先計算全部數據的熵，然后除class之外的其他屬性逐個遍歷，找到熵最小的那個屬性(house)，然后將全部數據的熵減去按照house屬性劃分數據之后的數據的熵。

這個值如果滿足條件假如(>0.1)，我們認為數據應該按照這個節點進行分裂，也就是說這個屬性(house)構成了我們的一次決策過程。

(二)

然后

在按照house分裂的每個數據集上，針對其他屬性(house除外)進行與(一)相同的過程，直到信息增益不足以滿足數據分裂的條件。

這樣，我們就得到了一個關于屬性數據劃分的一棵樹。可以作為class字段未知的數據的決策依據。

二、決策樹代碼實現：

具體計算代碼如下：---假設上述數據我們保存為descision.dat文件，以及需要bash4.0及以上支持運行。

Bash代碼  
#!/home/admin/bin/bash_bin/bash_4  
  
input=$1;  
  
if [ -z $input ]; then  
    echo "please input the traning file";  
    exit 1;  
fi   
  
## pre calculate the log2 value for the later calculate operation  
declare -a log2;  
logi=0;  
records=$(cat $input | wc -l);  
for i in `awk -v n=$records 'BEGIN{for(i=1;i<n;i++) print log(i)/log(2);}'`  
do  
    ((logi+=1));  
    log2[$logi]=$i;  
done  
  
  
## function for calculating the entropy for the given distribution of the class  
function getEntropy {  
    local input=`echo $1`;  
    if [[ $input == *" "* ]]; then  
        local current_entropy=0;  
        local sum=0;  
        local i;  
        for i in $input  
        do  
            ((sum+=$i));  
            current_entropy=$(awk -v n=$i -v l=${log2[$i]} -v o=$current_entropy 'BEGIN{print n*l+o}');  
        done  
        current_entropy=$(awk -v n=$current_entropy -v b=$sum -v l=${log2[$sum]} 'BEGIN{print n/b*-1+l;}')  
        eval $2=$current_entropy;  
    else  
        eval $2=0;  
    fi  
}  
  
  
### the header title of the input data  
declare -A header_info;  
header=$(head -1 $input);  
headers=(${header//,/ })  
length=${#headers[@]};  
for((i=0;i<length;i++))  
do  
    attr=${headers[$i]};  
    header_info[$attr]=$i;  
done  
  
  
  
### the data content of the input data  
data=${input}_dat;  
sed -n '2,$p' $input > $data  
  
  
  
## use an array to store the information of a descision tree  
## the node structure is {child,slibling,parent,attr,attr_value,leaf,class}  
## the root is a virtual node with none used attribute  
## only the leaf node has class flag and the "leaf,class" is meaningfull  
## the descision_tree  
declare -a descision_tree;  
  
## the root node with no child\slibing and anythings else  
descision_tree[0]="0:0:0:N:N:0:0";  
  
  
## use recursive algrithm to build the tree   
## so we need a trace_stack to record the call level infomation  
declare -a trace_stack;  
  
## push the root node into the stack  
trace_stack[0]=0;  
stack_deep=1;  
  
## begin to build the tree until the trace_stack is empty  
while [ $stack_deep -ne 0 ]  
do  
    ((stack_deep-=1));  
    current_node_index=${trace_stack[$stack_deep]};  
    current_node=${descision_tree[$current_node_index]};  
    current_node_struct=(${current_node//:/ });  
  
    ## select the current data set   
    ## get used attr and their values  
    attrs=${current_node_struct[3]};  
    attrv=${current_node_struct[4]};  
  
    declare -a grepstra=();  
  
    if [ $attrs != "N" ];then  
        attr=(${attrs//,/ });  
        attrvs=(${attrv//,/ });  
        attrc=${#attr[@]};  
        for((i=0;i<attrc;i++))  
        do  
            a=${attr[$i]};  
            index=${header_info[$a]};  
            grepstra[$index]=${attrvs[$i]};  
        done  
    fi  
  
    for((i=0;i<length;i++))  
    do  
        if [ -z ${grepstra[$i]} ]; then  
            grepstra[$i]=".*";  
        fi  
    done  
    grepstrt=${grepstra[*]};  
    grepstr=${grepstrt// /,};  
    grep $grepstr $data > current_node_data  
  
    ## calculate the entropy before split the records  
    entropy=0;  
    input=`cat current_node_data | cut -d "," -f 5 | sort | uniq -c | sed 's/^ \+//g' | cut -d " " -f 1`  
    getEntropy "$input" entropy;  
  
    ## calculate the entropy for each of the rest attrs  
    ## and select the min one  
    min_attr_entropy=1;   
    min_attr_name="";  
    min_attr_index=0;  
    for((i=0;i<length-1;i++))  
    do  
        ## just use the rest attrs  
        if [[ "$attrs" != *"${headers[$i]}"* ]]; then  
            ## calculate the entropy for the current attr  
            ### get the different values for the headers[$i]  
            j=$((i+1));  
            cut -d "," -f $j,$length current_node_data > tmp_attr_ds  
            dist_values=`cut -d , -f 1 tmp_attr_ds | sort | uniq -c | sed 's/^ \+//g' | sed 's/ /,/g'`;  
            totle=0;  
            totle_entropy_attr=0;  
            for k in $dist_values  
            do  
                info=(${k//,/ });  
                ((totle+=${info[0]}));  
                cur_class_input=`grep "^${info[1]}," tmp_attr_ds | cut -d "," -f 2 | sort | uniq -c | sed 's/^ \+//g' | cut -d " " -f 1`  
                cur_attr_value_entropy=0;  
                getEntropy "$cur_class_input" cur_attr_value_entropy;  
                totle_entropy_attr=$(awk -v c=${info[0]} -v e=$cur_attr_value_entropy -v o=$totle_entropy_attr 'BEGIN{print c*e+o;}');  
            done  
            attr_entropy=$(awk -v e=$totle_entropy_attr -v c=$totle 'BEGIN{print e/c;}');  
            if [ $(echo "$attr_entropy < $min_attr_entropy" | bc) = 1 ]; then  
                min_attr_entropy=$attr_entropy;  
                min_attr_name="${headers[$i]}";  
                min_attr_index=$j;  
            fi  
        fi  
    done  
  
    ## calculate the gain between the original entropy of the current node   
    ## and the entropy after split by the attribute which has the min_entropy  
    gain=$(awk -v b=$entropy -v a=$min_attr_entropy 'BEGIN{print b-a;}');  
  
    ## when the gain is large than 0.1 and  then put it as a branch  
    ##      and add the child nodes to the current node and push the index to the trace_stack  
    ## otherwise make it as a leaf node and get the class flag  
    ##      and do not push trace_stack  
    if [ $(echo "$gain > 0.1" | bc)  = 1 ]; then  
        ### get the attribute values  
        attr_values_str=`cut -d , -f $min_attr_index current_node_data | sort | uniq`;  
        attr_values=($attr_values_str);  
  
        ### generate the node and add to the tree and add their index to the trace_stack  
        tree_store_length=${#descision_tree[@]};  
        current_node_struct[0]=$tree_store_length;  
        parent_node_index=$current_node_index;  
         
        attr_value_c=${#attr_values[@]};  
        for((i=0;i<attr_value_c;i++))  
        do  
            tree_store_length=${#descision_tree[@]};  
            slibling=0;  
            if [ $i -lt $((attr_value_c-1)) ]; then  
                slibling=$((tree_store_length+1));  
            fi  
  
            new_attr="";  
            new_attrvalue="";  
            if [ $attrs != "N" ]; then  
                new_attr="$attrs,$min_attr_name";  
                new_attrvalue="$attrv,${attr_values[$i]}";  
            else  
                new_attr="$min_attr_name";  
                new_attrvalue="${attr_values[$i]}";  
            fi  
            new_node="0:$slibling:$parent_node_index:$new_attr:$new_attr_value:0:0";  
            descision_tree[$tree_store_length]="$new_node";  
            trace_stack[$stack_deep]=$tree_store_length;  
            ((stack_deep+=1));  
        done  
        current_node_update=${current_node_struct[*]};  
        descision_tree[$current_node_index]=${current_node_update// /:};  
    else   ## current node is a leaf node   
        current_node_struct[5]=1;  
        current_node_struct[6]=`cut -d , -f $length current_node_data | sort | uniq -c | sort -n -r | head -1 | sed 's/^ \+[^ ]* //g'`;  
        current_node_update=${current_node_struct[*]};  
        descision_tree[$current_node_index]=${current_node_update// /:};  
    fi   
      
    ## output the descision tree after every step for split or leaf node generater  
    echo ${descision_tree[@]};  
done  

執行代碼：

Bash代碼  
./descision.sh descision.dat  

執行結果為：

Java代碼  
0:0:N:N:0:0 0:2:0:house:0:0:0 0:0:0:house:1:0:0  
0:0:N:N:0:0 0:2:0:house:0:0:0 0:0:0:house:1:1:1  
0:0:N:N:0:0 3:2:0:house:0:0:0 0:0:0:house:1:1:1 0:4:1:house,job:0,0:0:0 0:0:1:house,job:0,1:0:0  
0:0:N:N:0:0 3:2:0:house:0:0:0 0:0:0:house:1:1:1 0:4:1:house,job:0,0:0:0 0:0:1:house,job:0,1:1:1  
0:0:N:N:0:0 3:2:0:house:0:0:0 0:0:0:house:1:1:1 0:4:1:house,job:0,0:1:0 0:0:1:house,job:0,1:1:1  

輸出結果中展示了決策樹結構生成過程的細節，以及生成過程中樹的變化過程

本代碼中使用了一維數組結構來存儲整棵決策樹，輸出的先后順序按照數組下標輸出。

輸出結果中最后一行表示最終的決策樹。它表示的樹形結構其實是：

決策樹結果

這樣看著是不是就爽多了。

說明：

關于上述決策樹結果中其實有部分存在誤導：

默認根節點是放在數組的第一個位置的，即索引值為0，而子節點中的child與sibling為0時并不表示指向跟節點，而是表示無意義，即沒有子節點或兄弟節點。

該決策樹所代表的分類規則：

根據該決策樹輸出，我們挖掘的規則是這樣的：

首先根據house屬性判斷，當house屬性為1時，走到索引為2的節點，此時該節點是葉子節點，預測值class為1.

當house屬性為0時，接著按照job屬性來判斷，當job屬性為0時走到索引為3的節點，預測值class為0。如果job屬性為1時走到索引值為4的節點，此時預測值class為1.

posted on 2012-07-21 22:21 往事隨風閱讀(1296) 評論(0) 編輯收藏

新用戶注冊刷新評論列表


只有注冊用戶登錄后才能發表評論。




網站導航: 博客園 IT新聞 Chat2DB C++博客博問管理

pingpang

常用鏈接

留言簿(1)

我參與的團隊

隨筆檔案

搜索

最新評論

閱讀排行榜

評論排行榜