人在江湖

:: 管理

82 Posts :: 10 Stories :: 169 Comments :: 0 Trackbacks

分類問題使用決策樹有一些好處，比如不需要太多domain知識(shí)。Learning和分類的過程也比較快。

決策樹的算法需要三個(gè)參數(shù)：被分類的對(duì)象，對(duì)象的屬性列表，屬性的選擇算法(attribute_selection_method)，學(xué)習(xí)決策樹，主要是學(xué)習(xí)屬性選擇算法。

Data Mining Concepts and Techniques書里對(duì)決策樹的構(gòu)建過程闡述的很清晰，書可以在之前的博客找到： http://www.tkk7.com/vcycyv/archive/2011/09/05/357967.html

(1)    create a node N;
(2)    if tuples in D are all of the same class, C then
(3) return N as a leaf node labeled with the class C;
(4)    if attribute list is empty then
(5) return N as a leaf node labeled with the majority class in D; // majority voting
(6)    apply Attribute selection method(D, attribute list) to ?nd the “best” splitting criterion;
(7)    label node N with splitting criterion;
(8)    if splitting attribute is discrete-valued and
multiway splits allowed then // not restricted to binary trees
(9) attribute list = attribute list - splitting attribute; // remove splitting attribute
(10) for each outcome j of splitting criterion
// partition the tuples and grow subtrees for each partition
(11)         let D j be the set of data tuples in D satisfying outcome j; // a partition
(12)         if D j is empty then
(13) attach a leaf labeled with the majority class in D to node N;
(14)         else attach the node returned by Generate decision tree(D j , attribute list) to node N;
endfor
(15) return N;

生成的樹是否是binary的主要取決于屬性的選擇算法(attribute_selection_method)，比如gini index算法生成的tree就是binary的，information gain生成的沒有這樣的限制。

關(guān)于算法：

information gain:

給對(duì)象集合D的某一個(gè)對(duì)象分組所需要的information可以這樣算：

其中Pi代表任一對(duì)象屬于類別Ci的概率

如果用某個(gè)屬性A來分D,經(jīng)過A把D分成幾組之后，給某一個(gè)對(duì)象分組所需要的information表述如下（看不懂沒關(guān)系，下面有例子）

information gain就可以這樣算：

例子：

The class label attribute, buys computer, has two distinct values (namely, {yes, no}); therefore, there are two distinct
classes (that is, m = 2). Let class C1 correspond to yes and class C2 correspond to no.There are nine tuples of class yes and ?ve tuples of class no. A (root) node N is createdfor the tuples in D. To ?nd the splitting criterion for these tuples, we must compute the information gain of each attribute. We ?rst compute the expected information needed to classify a tuple in D:

Next, we need to compute the expected information requirement for each attribute. Let’s start with the attribute age. We need to look at the distribution of yes and no tuples for each category of age. For the age category youth, there are two yes tuples and three no tuples. For the category middle aged, there are four yes tuples and zero no tuples. For the category senior, there are three yes tuples and two no tuples.

Similarly, we can compute Gain(income) = 0.029 bits, Gain(student) = 0.151 bits, and Gain(credit rating) = 0.048 bits. Because age has the highest information gain among the attributes, it is selected as the splitting attribute.

Gain ratio簡(jiǎn)述：

information gain在處理多值屬性的時(shí)候效果不好，比如如果有一個(gè)屬性是product_id,那么經(jīng)過他分所有對(duì)象之后，每個(gè)對(duì)象自成一組，也就是說每個(gè)組都是pure的，所以分組后的info（D）就是0，所以用product_id分組自然gain的值最大，但是顯然這樣分組沒意義。Gain ratio相當(dāng)于調(diào)整了information gain, 它用比值來計(jì)算而不是減法。具體在書里有例子，不詳述。

Gini index:

Gini index是用來算impurity of D的。上面說過，這種算法是binary split的。舉個(gè)binary split的例子：

if income has three possible values, namely {low,
medium, high}, then the possible subsets are {low, medium, high}, {low, medium}, {low,high}, {medium, high}, {low}, {medium}, {high}, and {}. We exclude the power set,{low, medium, high}, and the empty set from consideration since, conceptually, they do not represent a split.

示例：

there are nine tuples belonging to the class buys computer = yes and the remaining ?ve tuples belong to the class buys computer = no.

To ?nd the splitting criterion for the tuples in D, we need to compute the gini index for each attribute. Let’s start with the attribute income and consider each of the possible splitting subsets. Consider the subset {low, medium}. This would result in 10 tuples in partition D1 satisfying the condition “income 屬于{low, medium}.” The remaining four tuples of D would be assigned to partition D2 . The Gini index value computed based on this partitioning is

類似地可以算出來其他Income的subset, 在擴(kuò)展到可以類似算age 之類的attribute。

關(guān)于Prune：

為了去掉噪音和outlier,需要修剪(prune) tree.有兩種修剪方法：preprune和postprune.

preprune就是一邊split tree，一邊算著統(tǒng)計(jì)量，比如information gain, 或者gini index之類的，一旦算出來的值夠不到threshold,就停下來變成葉子節(jié)點(diǎn)。

postprune就是把tree grow完了之后，再?gòu)牡紫蛏霞舻粢恍┓种?。剪掉分支的算法叫cost complexity, 它主要基于Leaf的個(gè)數(shù)和error rate. 就是算一個(gè)node如果prune它和保留它之間哪個(gè)cost complexisty更大，如果prune之后cost complexity更小了，就prune它。注意算cost complexity要基于一個(gè)單獨(dú)的data set, 不用training dataset, 也不用validation data set.

往往認(rèn)為postprune可靠性更好一些。實(shí)際中，preprune和postprune可以結(jié)合在一起使用。

Data Mining techniques for Marketing Sales and Customer Relationship Management這本書提出了兩個(gè)需要注意的地方：

1. 經(jīng)過某次split之后的生成兩個(gè)leaf節(jié)點(diǎn)，他們可能是同一個(gè)category的。只是概率不一樣，但是都過了threshold.

2. binary tree可以分target是多個(gè)category的。反過來，多值tree也可以給binary target的分類。

posted on 2011-09-18 23:27 人在江湖閱讀(1956) 評(píng)論(1) 編輯收藏所屬分類: BI

Feedback

# re: 決策樹 2011-09-19 08:53 tb

很好狠強(qiáng)大啊回復(fù) 更多評(píng)論

新用戶注冊(cè) 刷新評(píng)論列表


只有注冊(cè)用戶登錄后才能發(fā)表評(píng)論。




網(wǎng)站導(dǎo)航: 博客園 IT新聞 Chat2DB C++博客博問管理
相關(guān)文章: 自錄pentaho視頻教程神經(jīng)網(wǎng)絡(luò) 決策樹 cluster聚類分析線性回歸矩陣與SAS IML淺嘗輒止時(shí)間序列分享讀書筆記 Data Mining Concepts and Techniques Basel2模型驗(yàn)證二：Kendall tau Basel2模型驗(yàn)證一：Hosmer–Lemeshow test

人在江湖

公告

常用鏈接

留言簿(16)

隨筆分類(90)

搜索

積分與排名

最新評(píng)論

閱讀排行榜

評(píng)論排行榜

Feedback