本文介绍: Written Part1. 给定包含属性{Height, Hair, Eye}和两个类别{C1, C2}的数据集。构建基于信息增益info gain)的决策树。 Height Hair Eye Class 1 Tall Blond Brown C1 2 Ta

σc,i1exp(2σc,i2(xiμc,i2))(6)
对于样本 Z = (Height = Short, Hair = blond, Eye = brown) 而言,先计算先验概率

P

(

C

1

)

P(rm C1)

P(C1)

P

(

C

2

)

P(rm C2)

P(C2)

P

(

C

1

)

=

5

9

P({rm C1}) = frac{5}{9}

P(C1)=95

P

(

C

2

)

=

4

9

P({rm C2}) = frac{4}{9}

P(C2)=94针对属性 Height

P

(

H

e

i

g

h

t

=

S

h

o

r

t

C

1

)

=

2

5

P({rm Height = Short} mid {rm C1}) = frac{2}{5}

P(Height=ShortC1)=52

P

(

H

e

i

g

h

t

=

S

h

o

r

t

C

2

)

=

1

4

P({rm Height = Short} mid {rm C2}) = frac{1}{4}

P(Height=ShortC2)=41针对属性 Hair,

P

(

H

a

i

r

=

b

l

o

n

d

C

1

)

=

2

5

P({rm Hair = blond}mid {rm C1}) = frac{2}{5}

P(Hair=blondC1)=52

P

(

H

a

i

r

=

b

l

o

n

d

C

2

)

=

1

2

P({rm Hair = blond}mid {rm C2}) = frac{1}{2}

P(Hair=blondC2)=21针对属性 Eye,

P

(

E

y

e

=

b

r

o

w

n

C

1

)

=

3

5

P({rm Eye = brown}mid {rm C1}) = frac{3}{5}

P(Eye=brownC1)=53

P

(

E

y

e

=

b

r

o

w

n

C

2

)

=

0

P({rm Eye= brown}mid {rm C2}) = 0

P(Eye=brownC2)=0

因此,

P

(

C

1

Z

)

=

P

(

C

1

)

P

(

H

e

i

g

h

t

=

S

h

o

r

t

C

1

)

P

(

H

a

i

r

=

b

l

o

n

d

C

1

)

P

(

E

y

e

=

B

r

o

w

n

C

1

)

=

0.0533

P({rm C1}mid {rm Z}) = P({rm C1})P({rm Height = Short} mid {rm C1})P({rm Hair = blond} mid {rm C1})P({rm Eye = Brown} mid {rm C1}) = 0.0533

P(C1Z)=P(C1)P(Height=ShortC1)P(Hair=blondC1)P(Eye=BrownC1)=0.0533​;

P

(

C

2

Z

)

=

P

(

C

2

)

P

(

H

e

i

g

h

t

=

S

h

o

r

t

C

2

)

P

(

H

a

i

r

=

b

l

o

n

d

C

2

)

P

(

E

y

e

=

B

r

o

w

n

C

2

)

=

0

P({rm C2}mid {rm Z}) = P({rm C2})P({rm Height = Short} mid {rm C2})P({rm Hair = blond} mid {rm C2})P({rm Eye = Brown} mid {rm C2}) = 0

P(C2Z)=P(C2)P(Height=ShortC2)P(Hair=blondC2)P(Eye=BrownC2)=0

在不考虑平滑的前提下,

P

(

E

y

e

=

b

r

o

w

n

C

2

)

=

0

P({rm Eye= brown}mid {rm C2}) = 0

P(Eye=brownC2)=0 导致

P

(

C

2

Z

)

P(rm C2mid Z)

P(C2Z)

0

0

0。所以样本 Z 被分类为 C1。

Lab Part

假设一家超市想推销意大利面。使用“Transactions.txt”中的数据作为训练数据来构建基于 C5.0 算法决策树模型,以预测客户是否会购买意大利面。

1. 使用数据集 “Transactions.txt” 构建决策树,利用其它字段预测pasta” 字段使用 Field Ops 中的 Type 模块,将除了 COD 字段外的每个字段的 “type设置为 “Flag”,将 COD 字段的 “type设置为 “Typeless”,将 “pasta” 字段的 “direction” 属性设置为 “out”。使用 Modeling 中的 C5.0 模块选择 “Expert” 并将 “Pruning severity” 设置

65

65

65,将 “Minimum records per child branch设置

95

95

95

5

5

5 为 Clementine 的使用截图使用数据集 “Transaction.txt” 构建的决策树如图

6

6

6 所示

图 5    Clementine 使用截图

图 6    决策树

虽然横向显示决策树比较美观,但是缩放严重出现失真,故还是选择了纵向显示

2. 使用上面创建好的模型对 “rollout.txt” 数据中的

20

20

20客户中的每一位进行预测,以确定客户是否会购买意大利面。

7

7

7 和图

8

8

8 分别展示数据类型配置和对 “rollout.txt” 的预测结果

图 7    rollout 数据类型配置

图 8    决策树预测结果

前五层的预测规则如下:

tomato souce = 1 [ Mode: 1 ] 
	tunny = 1 [ Mode: 1 ] => 1 
	tunny = 0 [ Mode: 1 ] 
		rice = 1 [ Mode: 1 ] => 1 
		rice = 0 [ Mode: 0 ] 
			brioches = 1 [ Mode: 1 ] => 1 
			brioches = 0 [ Mode: 0 ] 
				frozen vegetables = 1 [ Mode: 1 ] => 1 
				frozen vegetables = 0 [ Mode: 0 ] 
					coffee = 1 [ Mode: 1 ] => 1 
					coffee = 0 [ Mode: 0 ] => 0 
tomato souce = 0 [ Mode: 0 ] 
	rice = 1 [ Mode: 0 ] 
		coffee = 1 [ Mode: 1 ] => 1 
		coffee = 0 [ Mode: 0 ] 
			biscuits = 1 [ Mode: 1 ] => 1 
			biscuits = 0 [ Mode: 0 ] 
				coke = 1 [ Mode: 1 ] => 1 
				coke = 0 [ Mode: 0 ] => 0 
	rice = 0 [ Mode: 0 ] 
		tunny = 1 [ Mode: 0 ] => 0 
		tunny = 0 [ Mode: 0 ] 
			oil = 1 [ Mode: 0 ] => 0 
			oil = 0 [ Mode: 0 ] 
				water = 1 [ Mode: 0 ] => 0 
				water = 0 [ Mode: 0 ] 
					milk = 1 [ Mode: 0 ] => 0 
					milk = 0 [ Mode: 0 ] 
						yoghurt = 1 [ Mode: 0 ] => 0 
						yoghurt = 0 [ Mode: 0 ] 
							coke = 1 [ Mode: 0 ] => 0 
							coke = 0 [ Mode: 0 ] 
								biscuits = 1 [ Mode: 0 ] => 0 
								biscuits = 0 [ Mode: 0 ] 
									brioches = 1 [ Mode: 0 ] => 0 
									brioches = 0 [ Mode: 1 ] 
										coffee = 1 [ Mode: 0 ] => 0 
										coffee = 0 [ Mode: 1 ] 
											frozen vegetables = 1 [ Mode: 0 ] => 0 
											frozen vegetables = 0 [ Mode: 1 ] 
												beer = 1 [ Mode: 0 ] => 0 
												beer = 0 [ Mode: 1 ] 
													juices = 1 [ Mode: 0 ] => 0 
													juices = 0 [ Mode: 1 ] 
														mozzarella = 1 [ Mode: 0 ] => 0 
														mozzarella = 0 [ Mode: 1 ] 
															crackers = 1 [ Mode: 0 ] => 0 
															crackers = 0 [ Mode: 1 ] 
																frozen fish = 1 [ Mode: 0 ] => 0 
																frozen fish = 0 [ Mode: 1 ] => 1 

通过对某在线培训系统标注数据集进行建模预测其它会员期末考试结果。数据集来自在线培训系统日志数据包每个会员的在线学习行为。请尝试多种不同的模型、不同的参数,建立高质量的预测模型

训练集有

873

873

873记录测试集有

461

461

461记录训练集和测试包含如下变量

人员 ID 在线总时长(分钟) 在线阅读时长(分钟) 在线测试时长(分钟) 全文阅读次数 智能阅读次数 知识点阅读次数 试题阅读次数
回溯原文次数 题库测试次数 仿真考试次数 仿真考试优秀次数 仿真考试良好次数 仿真考试合格次数 仿真考试不合格次数 Class

1. 对训练数据集进行决策树分类。将除 “人员 ID” 之外的字段设置输入。将 “Class” 的 “direction” 设置为 “out”,“type设置为 ”Flag“。自定义pruning severity” 和 “minimum records per child branch”,然后勾选“use global pruning”。

尝试了多组参数如图

9

9

9 所示。其中,PS 为 pruning severity,MRPCB 为 minimum records per child branch。可见,“最佳”参数组合为 PS=5,MRPCB=5。

图 9    决策树混淆矩阵

2. 使用默认设置神经网络处理训练数据集。设置同上。

默认设置的神经网络对应的混淆矩阵如图

10

10

10 所示

图 10    神经网络混淆矩阵

3. 使用默认设置的逻辑回归模型处理训练数据集。设置同上。

默认设置的逻辑回归模型对应的混淆矩阵如图

11

11

11 所示

图 11    逻辑回归模型混淆矩阵

4. 分析上面三个模型生成的混淆矩阵,评估模型质量。

对比 PS=5, MRPCB=5 的决策树、默认设置的神经网络和逻辑回归模型,决策树在准确率(accuracy)、召回率(recall)还是精度(precision指标上的效果均优于其它两个模型,但这并不意味着决策树模型更适合这个数据集。

REF

Clementine教程 – 知乎

使用clementine得到混淆矩阵 – CSDN

原文地址:https://blog.csdn.net/weixin_46221946/article/details/134693284

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。

如若转载,请注明出处:http://www.7code.cn/show_11643.html

如若内容造成侵权/违法违规/事实不符,请联系代码007邮箱suwngjj01@126.com进行投诉反馈,一经查实,立即删除

发表回复

您的电子邮箱地址不会被公开。 必填项已用 * 标注