C2-4.2.2 决策树-纯度+信息熵+信息增益

本文介绍: 【※※※总结】：信息熵是用来衡量给出的数据集中数据的纯度的信息熵越小，数据就越纯。通常用在机器学习分类的情况下3.2 信息熵公式。

C2-4.2.2 决策树-纯度+信息熵+信息增益

1、首先了解他的应用背景——决策树

决策树算法详解; 算法核心思想; 结构;

其实说白了，就是一个二叉树

2、纯度

我们举一个买黄金的例子吧！黄金有999 和 9999 。他们是有区别的，代表着黄金的纯度（相对杂质而言），那在决策树中——我们也引入了“纯度”这一概念。如果结果集中，全是这一类的，那么我们说“vary pure”。如果结果集中有6个，但是3个是一个类别，那么我们说”not pure”，把除这三个外的东西叫做“杂质”

2.1 纯度简述

如果一个结果集（经过一次或多次二叉树判别），都是猫 / 都是非猫，那么就说这个结果集 very pure。
如果一个结果集既有猫又有非猫，那么就是not pure。但是not pure 也分级别。——引出我们计算的公式

在这里插入图片描述

P1：是猫的纯度。
- 当一组数据有6个，猫有0个时，熵为0，纯度最高
- 当一组数据有6个，猫有3个时，熵为0.92，纯度不好
  
  …

3、信息熵（entropy ）

那买黄金，有专业的机器来判别我们的黄金的纯度，那在决策树中的结果集中，如何判别纯度呢 / 判别纯度的标准？？——这就引出了**“信息熵”** 的定义。

3.1 信息熵的定义

In Machine Learning, entropy ※※measures the level of disorder or uncertainty in a given dataset or system. It is a metric that quantifies the amount of information in a dataset, and it is commonly used to evaluate the quality of a model and its ability to make accurate predictions.

※A higher entropy value indicates a more heterogeneous dataset with diverse classes, while a lower entropy signifies a more pure and homogeneous subset of data. Decision tree models can use entropy to determine the best splits to make informed decisions and build accurate predictive models.

【※※※总结】：
- 信息熵是用来衡量给出的数据集中数据的纯度的
- 信息熵越小，数据就越纯。
- 通常用在机器学习分类的情况下

3.2 信息熵公式

在这里插入图片描述

4、信息增益（Information Gain）

4.1、信息增益概念：

Information gain calculates the reduction in entropy or surprise from transforming a dataset in some way.

It is commonly used in the construction of decision trees from a training dataset, by evaluating the information gain for each variable, and selecting the variable that maximizes the information gain, which in turn minimizes the entropy and best splits the dataset into groups for effective classification.

在这里插入图片描述