ABW505 Python与预测分析期末终极题库:代码填空 + 算法手算 (含真题解析)
49 分钟阅读
🎓 ABW505 考试结构(官方)
题号 分值 题型 考试范围 Q1 20 Type 2 Python入门:核心对象、变量、输入输出、列表、元组、函数、循环、决策结构 Q2 30 Type 3 5选3:决策结构、重复结构、布尔逻辑、列表/元组、函数 Q3 25 Type 1,3 Pandas、数据预处理、编码器、SVM、随机森林 Q4 25 Type 1,3,4 朴素贝叶斯、决策树、数据预处理(带计算器!) 题型说明:
- Type 1: 理论题
- Type 2: 解读代码,计算结果
- Type 3: 解读结果,编写代码
- Type 4: 计算题(带计算器)
题型说明: 阅读代码,预测输出结果。考察变量作用域、可变性及逻辑流。
def extend_list(val, list=[]):
list.append(val)
return list
list1 = extend_list(10)
list2 = extend_list(123, [])
list3 = extend_list('a')
print(f"list1 = {list1}")
print(f"list2 = {list2}")
print(f"list3 = {list3}")👉 点击查看运行结果与解析
Output:
list1 = [10, 'a']
list2 = [123]
list3 = [10, 'a']解析:
这是 Python 面试经典题。函数的默认参数 list=[] 是在函数定义时创建的,且只创建一次。
extend_list(10) → 默认参数初始化为 [],然后 append 10 → [10]extend_list(123, []) → 传入新列表 [],与默认参数无关 → [123]extend_list('a') → 再次使用已被修改过的默认参数 [10] → [10, 'a']关键概念:Python 默认参数在函数定义时评估,可变对象会跨调用保留状态。
Analysis:
This is a classic Python interview question. The default parameter list=[] is created once at function definition time, not every time the function is called.
extend_list(10) → Default parameter initialized as [], then append 10 → [10]extend_list(123, []) → Passes a new list [], independent of default → [123]extend_list('a') → Reuses the already-modified default parameter [10] → [10, 'a']Key Concept: Python default parameters are evaluated at function definition time. Mutable objects retain their state across function calls.
x = 0
for i in range(5):
if i == 2:
continue
if i == 4:
break
x += i
print(x)👉 点击查看运行结果与解析
Output:
4执行追踪:
| 轮次 | i | 条件 | 操作 | x 值 |
|---|---|---|---|---|
| 1 | 0 | - | x += 0 | 0 |
| 2 | 1 | - | x += 1 | 1 |
| 3 | 2 | i==2 | continue(跳过) | 1 |
| 4 | 3 | - | x += 3 | 4 |
| 5 | 4 | i==4 | break(终止) | 4 |
结果: x = 4
Execution Trace:
| Iteration | i | Condition | Operation | x Value |
|---|---|---|---|---|
| 1 | 0 | - | x += 0 | 0 |
| 2 | 1 | - | x += 1 | 1 |
| 3 | 2 | i==2 | continue (skip) | 1 |
| 4 | 3 | - | x += 3 | 4 |
| 5 | 4 | i==4 | break (exit) | 4 |
Result: x = 4
data = (10, 20, 30, 40, 50)
a, *b, c = data
print(a)
print(b)
print(c)👉 点击查看运行结果与解析
Output:
10
[20, 30, 40]
50解析:
这是 Python 3 的拓展解包(Extended Unpacking)语法:
a 拿走第一个元素 → 10c 拿走最后一个元素 → 50*b 拿走中间剩余的所有元素,并打包成一个列表 → [20, 30, 40]注意:*b 收集的是列表,不是元组。
Analysis:
This uses Python 3's Extended Unpacking syntax:
a takes the first element → 10c takes the last element → 50*b collects all remaining middle elements and packs them as a list → [20, 30, 40]Note: *b collects into a list, not a tuple.
要求: 编写 Numpy 程序完成以下操作:
👉 点击查看参考答案
import numpy as np
# (i) 4x4 matrix ranging from 1 to 16
matrix_4x4 = np.arange(1, 17).reshape(4, 4)
print("4x4 Matrix:\n", matrix_4x4)
# (ii) Null vector of size 10, update 7th value to 10
# Note: Python indexing starts from 0, so 7th value is index 6
null_vector = np.zeros(10)
null_vector[6] = 10
print("\nNull Vector:\n", null_vector)
# (iii) 8x8 Checkerboard pattern
checkerboard = np.zeros((8, 8), dtype=int)
# Use slicing: odd rows even columns, even rows odd columns set to 1
checkerboard[1::2, ::2] = 1
checkerboard[::2, 1::2] = 1
print("\nCheckerboard:\n", checkerboard)Run Output:
4x4 Matrix:
[[ 1 2 3 4]
[ 5 6 7 8]
[ 9 10 11 12]
[13 14 15 16]]
Null Vector:
[ 0. 0. 0. 0. 0. 0. 10. 0. 0. 0.]
Checkerboard:
[[0 1 0 1 0 1 0 1]
[1 0 1 0 1 0 1 0]
[0 1 0 1 0 1 0 1]
[1 0 1 0 1 0 1 0]
[0 1 0 1 0 1 0 1]
[1 0 1 0 1 0 1 0]
[0 1 0 1 0 1 0 1]
[1 0 1 0 1 0 1 0]]解析:
(i) 创建 4x4 矩阵:
np.arange(1, 17) 创建从 1 到 16 的数组(不包括17).reshape(4, 4) 将其重塑为 4x4 矩阵(ii) 空向量与索引:
np.zeros(10) 创建全0向量null_vector[6] = 10 更新第 7 个位置(iii) 棋盘模式:
checkerboard[1::2, ::2] = 1 → 奇数行的偶数列置为1checkerboard[::2, 1::2] = 1 → 偶数行的奇数列置为1[start:stop:step]考点:
Analysis:
(i) Create 4x4 Matrix:
np.arange(1, 17) creates array from 1 to 16 (excluding 17).reshape(4, 4) reshapes it into 4x4 matrix(ii) Null Vector & Indexing:
np.zeros(10) creates zero vectornull_vector[6] = 10 updates the 7th position(iii) Checkerboard Pattern:
checkerboard[1::2, ::2] = 1 → Odd rows, even columns set to 1checkerboard[::2, 1::2] = 1 → Even rows, odd columns set to 1[start:stop:step]Key Concepts:
要求: 创建一个向量,包含从 10 到 49 的数值,并将其反转。
👉 点击查看参考答案
import numpy as np
# Create vector from 10 to 49
vector = np.arange(10, 50)
print("Original Vector:\n", vector)
# Reverse it using slicing
vector_reversed = vector[::-1]
print("\nReversed Vector:\n", vector_reversed)Run Output:
Original Vector:
[10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49]
Reversed Vector:
[49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26
25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10]解析:
方法1: 切片反转(推荐):
[::-1] 是Python切片的反转语法start:stop:step,步长为-1表示从后往前方法2: np.flip()函数:
vector_reversed = np.flip(vector)方法3: np.flipud()函数:
vector_reversed = np.flipud(vector)考点:
[::-1]Analysis:
Method 1: Slicing Reversal (Recommended):
[::-1] is Python's slice reversal syntaxstart:stop:step, step of -1 means traverse backwardsMethod 2: np.flip() function:
vector_reversed = np.flip(vector)Method 3: np.flipud() function:
vector_reversed = np.flipud(vector)Key Concepts:
[::-1]def process_data(*args, **kwargs):
print(f"Positional args: {args}")
print(f"Keyword args: {kwargs}")
total = sum(args)
multiplier = kwargs.get('multiplier', 1)
return total * multiplier
result = process_data(1, 2, 3, multiplier=10, label='test')
print(f"Result: {result}")👉 点击查看运行结果与解析
Output:
Positional args: (1, 2, 3)
Keyword args: {'multiplier': 10, 'label': 'test'}
Result: 60解析:
*args(任意数量的位置参数):
args = (1, 2, 3)**kwargs(任意数量的关键字参数):
kwargs = {'multiplier': 10, 'label': 'test'}.get() 方法安全访问,避免 KeyError执行流程:
sum(args) → sum((1, 2, 3)) → 6kwargs.get('multiplier', 1) → 返回 10(如果没有则返回默认值 1)6 * 10 = 60Analysis:
*args (Variable Positional Arguments):
args = (1, 2, 3)**kwargs (Variable Keyword Arguments):
kwargs = {'multiplier': 10, 'label': 'test'}.get() method for safe access, avoiding KeyErrorExecution Flow:
sum(args) → sum((1, 2, 3)) → 6kwargs.get('multiplier', 1) → Returns 10 (default 1 if not present)6 * 10 = 60题型说明: 根据要求写出代码。
需求: 给定一个数字列表,创建一个新列表,只包含偶数乘以 2 的结果。
Input: nums = [1, 2, 3, 4, 5]
Solution Code:
nums = [1, 2, 3, 4, 5]
result = [x * 2 for x in nums if x % 2 == 0]
print(result)Run Output:
[4, 8]解析:
nums 中的每个元素 xx % 2 == 0(偶数)x * 2(乘以 2)[4, 8]Analysis:
x in numsx % 2 == 0 (even numbers)x * 2 (multiply by 2)[4, 8]需求: 给定一个嵌套列表,将其展平为单层列表。
Input: nested = [[1, 2], [3, 4, 5], [6]]
Solution Code:
nested = [[1, 2], [3, 4, 5], [6]]
# Method 1: List comprehension
flat = [item for sublist in nested for item in sublist]
print(f"Method 1: {flat}")
# Method 2: Using sum() with empty list
flat2 = sum(nested, [])
print(f"Method 2: {flat2}")
# Method 3: Traditional loop
flat3 = []
for sublist in nested:
for item in sublist:
flat3.append(item)
print(f"Method 3: {flat3}")Run Output:
Method 1: [1, 2, 3, 4, 5, 6]
Method 2: [1, 2, 3, 4, 5, 6]
Method 3: [1, 2, 3, 4, 5, 6]解析:
方法 1(列表推导式):
for sublist in nested → 遍历每个子列表for item in sublist → 遍历子列表中的每个元素方法 2(sum函数):
sum(nested, []) → 从空列表开始,依次将每个子列表相加[] + [1,2] + [3,4,5] + [6] → 最终得到 [1,2,3,4,5,6]方法 3(传统循环):
Analysis:
Method 1 (List Comprehension):
for sublist in nested → Iterate through each sublistfor item in sublist → Iterate through each element in sublistMethod 2 (sum function):
sum(nested, []) → Starts with empty list, concatenates each sublist[] + [1,2] + [3,4,5] + [6] → Results in [1,2,3,4,5,6]Method 3 (Traditional Loop):
需求: 统计字符串中每个字符出现的频率。
Input: text = "hello"
Solution Code:
text = "hello"
freq = {}
for char in text:
if char in freq:
freq[char] += 1
else:
freq[char] = 1
print(freq)Run Output:
{'h': 1, 'e': 1, 'l': 2, 'o': 1}解析:
'l' 出现 2 次(因为有两个 l),其他字符各出现 1 次进阶写法(使用 defaultdict):
Analysis:
Advanced Implementation (using defaultdict):
from collections import defaultdict
text = "hello"
freq = defaultdict(int)
for char in text:
freq[char] += 1
print(dict(freq))需求: 从数字列表中,使用 lambda 和 filter 筛选出大于 5 的数字。
Input: numbers = [2, 8, 3, 12, 5, 7]
Solution Code:
numbers = [2, 8, 3, 12, 5, 7]
# Using filter() with lambda
result = list(filter(lambda x: x > 5, numbers))
print(f"Numbers > 5: {result}")
# Equivalent list comprehension
result2 = [x for x in numbers if x > 5]
print(f"List comprehension: {result2}")Run Output:
Numbers > 5: [8, 12, 7]
List comprehension: [8, 12, 7]解析:
Lambda 函数:
lambda x: x > 5 → 创建一个匿名函数,检查 x 是否大于 5True 或 FalseFilter 函数:
filter(function, iterable) → 对可迭代对象应用函数,保留返回 True 的元素list() 转换)两种方法比较:
filter() + lambda:函数式编程风格Analysis:
Lambda Function:
lambda x: x > 5 → Creates an anonymous function that checks if x > 5True or FalseFilter Function:
filter(function, iterable) → Applies function to iterable, keeps elements that return Truelist() conversion)Comparison:
filter() + lambda: Functional programming style要求: 修复并解释以下"猜数字"代码逻辑。目标:生成 1-9 的随机数,允许猜 4 次。
👉 点击查看参考代码
import random
target_num = random.randint(1, 9)
guess_num = 0
guess_counter = 0
max_attempts = 4
# Loop condition: haven't guessed correctly AND within attempt limit
while target_num != guess_num and guess_counter < max_attempts:
guess_counter += 1
# In real exam, you'd use input()
guess_num = int(input(f"Attempt {guess_counter}/{max_attempts}, Enter a number (1-9): "))
if target_num == guess_num:
print(f"🎉 Well guessed! The number was {target_num}")
break
elif guess_counter < max_attempts:
if guess_num < target_num:
print("Too low! Try again.")
else:
print("Too high! Try again.")
else:
print(f"❌ Out of chances! The number was {target_num}")考点解析:
1. While 循环终止条件:
while target_num != guess_num and guess_counter < max_attempts:2. break 语句:
3. 变量自增:
guess_counter += 1guess_counter = guess_counter + 14. 边界条件检查:
guess_counter < max_attempts 确保不会超过限制常见错误:
guess_counter,导致死循环or 代替 and)Key Concepts:
1. While Loop Termination Condition:
while target_num != guess_num and guess_counter < max_attempts:2. break Statement:
3. Variable Increment:
guess_counter += 1guess_counter = guess_counter + 14. Boundary Condition Check:
guess_counter < max_attempts ensures no overflowCommon Mistakes:
guess_counter, causing infinite loopor instead of and)要求: 根据以下数据绘制水平条形图:
Data: Moscow (70), Tokyo (60), Washington (75), Beijing (50), Delhi (40)
👉 点击查看参考代码
import matplotlib.pyplot as plt
# 1. Prepare data
cities = ['Moscow', 'Tokyo', 'Washington', 'Beijing', 'Delhi']
happiness_index = [70, 60, 75, 50, 40]
# 2. Create figure
plt.figure(figsize=(10, 5))
# 3. Draw horizontal bar chart (barh)
plt.barh(cities, happiness_index, color='skyblue', edgecolor='navy')
# 4. Add labels and title
plt.xlabel('Happiness Index', fontsize=12)
plt.ylabel('City', fontsize=12)
plt.title('Happiness Index by City', fontsize=14, fontweight='bold')
# 5. Add grid for better readability
plt.grid(axis='x', linestyle='--', alpha=0.7)
# 6. Display
plt.tight_layout()
plt.show()代码解析:
1. 数据准备:
2. 创建图形:
plt.figure(figsize=(10, 5)) 设置画布大小(宽10英寸,高5英寸)3. 绘制水平条形图:
plt.barh() 是横向条形图(h = horizontal)plt.bar() 是纵向条形图color='skyblue' 设置填充色edgecolor='navy' 设置边框色4. 添加标签:
xlabel() - X轴标签ylabel() - Y轴标签title() - 图表标题fontsize - 字体大小fontweight='bold' - 加粗5. 网格线:
plt.grid(axis='x') 只显示X轴网格linestyle='--' 虚线alpha=0.7 透明度70%6. 显示图形:
plt.tight_layout() 自动调整布局,避免标签重叠plt.show() 显示图形常见考点:
bar() 和 barh()Code Analysis:
1. Data Preparation:
2. Create Figure:
plt.figure(figsize=(10, 5)) sets canvas size (10 inches wide, 5 inches tall)3. Draw Horizontal Bar Chart:
plt.barh() is for horizontal bars (h = horizontal)plt.bar() is for vertical barscolor='skyblue' sets fill coloredgecolor='navy' sets border color4. Add Labels:
xlabel() - X-axis labelylabel() - Y-axis labeltitle() - Chart titlefontsize - Font sizefontweight='bold' - Bold text5. Grid Lines:
plt.grid(axis='x') shows only X-axis gridlinestyle='--' dashed linealpha=0.7 70% transparency6. Display Figure:
plt.tight_layout() auto-adjusts layout to prevent label overlapplt.show() displays the figureCommon Exam Points:
bar() and barh()需求: 找出两个集合的交集和差集。
Input:
set_a = {1, 2, 3, 4, 5}
set_b = {4, 5, 6, 7, 8}Solution Code:
set_a = {1, 2, 3, 4, 5}
set_b = {4, 5, 6, 7, 8}
# Intersection - common elements
intersection = set_a & set_b # or set_a.intersection(set_b)
print(f"Intersection: {intersection}")
# Difference - elements in A but not in B
difference = set_a - set_b # or set_a.difference(set_b)
print(f"A - B: {difference}")
# Symmetric difference - elements in either but not both
sym_diff = set_a ^ set_b # or set_a.symmetric_difference(set_b)
print(f"Symmetric difference: {sym_diff}")
# Union - all unique elements
union = set_a | set_b # or set_a.union(set_b)
print(f"Union: {union}")Run Output:
Intersection: {4, 5}
A - B: {1, 2, 3}
Symmetric difference: {1, 2, 3, 6, 7, 8}
Union: {1, 2, 3, 4, 5, 6, 7, 8}解析:
集合运算符:
& (交集):两个集合共有的元素 → {4, 5}- (差集):在 A 中但不在 B 中 → {1, 2, 3}^ (对称差集):在 A 或 B 中,但不同时在两者中 → {1, 2, 3, 6, 7, 8}| (并集):A 和 B 的所有唯一元素 → {1, 2, 3, 4, 5, 6, 7, 8}应用场景:
Analysis:
Set Operators:
& (intersection): Elements common to both sets → {4, 5}- (difference): Elements in A but not in B → {1, 2, 3}^ (symmetric difference): Elements in either but not both → {1, 2, 3, 6, 7, 8}| (union): All unique elements from both → {1, 2, 3, 4, 5, 6, 7, 8}Use Cases:
场景: 你有一个 DataFrame df,包含列 ['Salary', 'Department']。
要求:
SalarySalary > 5000 的行Solution Code:
import pandas as pd
import numpy as np
# Sample data
data = {'Salary': [3000, 6000, np.nan, 8000],
'Department': ['HR', 'IT', 'IT', 'HR']}
df = pd.DataFrame(data)
print("Original data:")
print(df)
print("\n" + "="*50 + "\n")
# 1. Fill missing values with median
df['Salary'].fillna(df['Salary'].median(), inplace=True)
# 2. Filter high salary
high_salary = df[df['Salary'] > 5000]
print("After filling and filtering:")
print(high_salary)Run Output:
Original data:
Salary Department
0 3000.0 HR
1 6000.0 IT
2 NaN IT
3 8000.0 HR
==================================================
After filling and filtering:
Salary Department
1 6000.0 IT
3 8000.0 HR关键步骤:
df['Salary'].median() → 计算中位数(3000, 6000, 8000 的中位数是 6000)fillna() → 使用中位数替换 NaNdf[df['Salary'] > 5000] → 筛选高薪员工Key Steps:
df['Salary'].median() → Calculates the median (median of 3000, 6000, 8000 is 6000)fillna() → Replaces NaN with the mediandf[df['Salary'] > 5000] → Filters high-salary employees场景: 计算每个部门的平均薪资,并找出平均薪资最高的部门。
Solution Code:
import pandas as pd
# Sample data
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
'Department': ['HR', 'IT', 'HR', 'IT', 'Finance', 'Finance'],
'Salary': [5000, 7000, 5500, 8000, 6000, 6500]
}
df = pd.DataFrame(data)
print("Original data:")
print(df)
print("\n" + "="*50 + "\n")
# Group by department and calculate mean salary
dept_avg = df.groupby('Department')['Salary'].mean()
print("Average salary by department:")
print(dept_avg)
print("\n" + "="*50 + "\n")
# Find department with highest average
max_dept = dept_avg.idxmax()
max_salary = dept_avg.max()
print(f"Highest average: {max_dept} (${max_salary:.2f})")Run Output:
Original data:
Name Department Salary
0 Alice HR 5000
1 Bob IT 7000
2 Charlie HR 5500
3 David IT 8000
4 Eve Finance 6000
5 Frank Finance 6500
==================================================
Average salary by department:
Department
Finance 6250.0
HR 5250.0
IT 7500.0
Name: Salary, dtype: float64
==================================================
Highest average: IT ($7500.00)关键操作:
应用场景:
Key Operations:
Use Cases:
场景: 给定以下学生数据表,根据已有数据推导决策规则,并预测剩下四个问号的结果。
| Student ID | Grade | GPA | Prediction (Yes/No) |
|---|---|---|---|
| 101 | A | 3.85 | Yes |
| 205 | C | 2.85 | No |
| 640 | A- | 3.50 | Yes |
| 710 | B | 3.00 | ? (1) |
| 595 | A | 3.30 | ? (2) |
| 540 | B- | 4.00 | ? (3) |
| 630 | B+ | 3.00 | ? (4) |
👉 点击查看解答
Task (i): 提出手动专家规则 (Propose Manual Expert Rules)
观察前三行已知数据:
归纳规律:
提出的专家规则(示例):
Rule 1 (基于 GPA 阈值):
IF GPA >= 3.50 THEN Prediction = Yes
ELSE Prediction = No
或者基于成绩等级的规则:
Rule 2 (基于 Grade):
IF Grade is 'A' or 'A-' THEN Prediction = Yes
ELSE Prediction = No
推荐使用 Rule 1,因为 GPA 是数值型特征,更客观稳定。
Task (ii): 识别预测结果 (Identify Predictions)
基于 Rule 1: IF GPA >= 3.50 THEN Yes, ELSE No
| Student ID | Grade | GPA | Rule Evaluation | Prediction |
|---|---|---|---|---|
| 710 | B | 3.00 | 3.00 < 3.50 | No |
| 595 | A | 3.30 | 3.30 < 3.50 | No |
| 540 | B- | 4.00 | 4.00 >= 3.50 | Yes ✅ |
| 630 | B+ | 3.00 | 3.00 < 3.50 | No |
最终答案:
如果使用 Rule 2 (基于 Grade):
IF Grade is 'A' or 'A-' THEN Yes
ELSE No
| Student ID | Grade | Rule Evaluation | Prediction |
|---|---|---|---|
| 710 | B | B ≠ A/A- | No |
| 595 | A | A = A | Yes ✅ |
| 540 | B- | B- ≠ A/A- | No |
| 630 | B+ | B+ ≠ A/A- | No |
最终答案 (Rule 2):
考点总结:
实际操作建议:
Task (i): Propose Manual Expert Rules
Observation from first three rows:
Pattern Identified:
Proposed Expert Rule (Example):
Rule 1 (Based on GPA threshold):
IF GPA >= 3.50 THEN Prediction = Yes
ELSE Prediction = No
Or rule based on Grade:
Rule 2 (Based on Grade):
IF Grade is 'A' or 'A-' THEN Prediction = Yes
ELSE Prediction = No
Recommend Rule 1, because GPA is a numerical feature, more objective and stable.
Task (ii): Identify Predictions
Based on Rule 1: IF GPA >= 3.50 THEN Yes, ELSE No
| Student ID | Grade | GPA | Rule Evaluation | Prediction |
|---|---|---|---|---|
| 710 | B | 3.00 | 3.00 < 3.50 | No |
| 595 | A | 3.30 | 3.30 < 3.50 | No |
| 540 | B- | 4.00 | 4.00 >= 3.50 | Yes ✅ |
| 630 | B+ | 3.00 | 3.00 < 3.50 | No |
Final Answers:
If using Rule 2 (Based on Grade):
IF Grade is 'A' or 'A-' THEN Yes
ELSE No
| Student ID | Grade | Rule Evaluation | Prediction |
|---|---|---|---|
| 710 | B | B ≠ A/A- | No |
| 595 | A | A = A | Yes ✅ |
| 540 | B- | B- ≠ A/A- | No |
| 630 | B+ | B+ ≠ A/A- | No |
Final Answers (Rule 2):
Key Concepts Summary:
Practical Tips:
问题: 什么时候应该用 One-Hot Encoding 而不是 Label Encoding?
👉 点击查看答案
Label Encoding 的工作原理:
One-Hot Encoding 的工作原理:
使用规则:
| 数据类型 | 定义 | 例子 | 编码方法 |
|---|---|---|---|
| Nominal(名义) | 无序、无排名关系 | 城市、颜色、品牌 | ✅ One-Hot |
| Ordinal(序数) | 有序、有排名关系 | Low, Medium, High | ✅ Label |
结论:用 One-Hot 处理没有排名关系的分类数据(如城市、颜色、国家),用 Label 处理有明确顺序的分类数据(如教育水平、收入等级)。
Label Encoding works by:
One-Hot Encoding works by:
Usage Rules:
| Data Type | Definition | Examples | Method |
|---|---|---|---|
| Nominal | No order, no ranking | Cities, Colors, Brands | ✅ One-Hot |
| Ordinal | Has order, has ranking | Low, Medium, High | ✅ Label |
Conclusion: Use One-Hot for unordered categorical data (cities, colors, countries), use Label for ordered categorical data (education level, income brackets).
问题: 什么时候用 Min-Max 归一化,什么时候用 Z-Score 标准化?
👉 点击查看答案与代码示例
Min-Max 归一化(Normalization):
Z-Score 标准化(Standardization):
Min-Max Normalization:
Z-Score Standardization:
Code Example:
from sklearn.preprocessing import MinMaxScaler, StandardScaler
import numpy as np
# Sample data with outlier
data = np.array([[1], [2], [3], [4], [100]]) # 100 is outlier
print("Original data:")
print(data.flatten())
# Min-Max Normalization
min_max_scaler = MinMaxScaler()
normalized = min_max_scaler.fit_transform(data)
print(f"\nMin-Max Normalized: {normalized.flatten()}")
# Z-Score Standardization
standard_scaler = StandardScaler()
standardized = standard_scaler.fit_transform(data)
print(f"Standardized: {standardized.flatten()}")Output:
Original data:
[ 1 2 3 4 100]
Min-Max Normalized: [0. 0.01010101 0.02020202 0.03030303 1. ]
Standardized: [-0.70039279 -0.67894753 -0.65750227 -0.63605702 2.6728996 ]观察:
结论:有异常值时优先用 Z-Score!
Observation:
Conclusion: Prefer Z-Score when outliers exist!
问题: 给定混淆矩阵,计算精确率、召回率和 F1 分数。
Confusion Matrix:
| 预测为正 | 预测为负 | |
|---|---|---|
| 实际为正 | TP = 80 | FN = 20 |
| 实际为负 | FP = 10 | TN = 90 |
👉 点击查看计算过程
公式:
$\text{Precision (精确率)} = \frac{TP}{TP + FP} = \frac{80}{80 + 10} = \frac{80}{90} \approx 0.889$
$\text{Recall (召回率)} = \frac{TP}{TP + FN} = \frac{80}{80 + 20} = \frac{80}{100} = 0.800$
$\text{F1-Score} = 2 \times \frac{Precision \times Recall}{Precision + Recall} = 2 \times \frac{0.889 \times 0.800}{0.889 + 0.800} \approx 0.842$
解释:
Precision(精确率):预测为正的样本中,真正为正的比例("不误伤")
Recall(召回率):实际为正的样本中,被正确预测的比例("不漏掉")
F1-Score:Precision 和 Recall 的调和平均数(综合指标)
应用场景:
Formulas:
$\text{Precision} = \frac{TP}{TP + FP} = \frac{80}{80 + 10} = \frac{80}{90} \approx 0.889$
$\text{Recall} = \frac{TP}{TP + FN} = \frac{80}{80 + 20} = \frac{80}{100} = 0.800$
$\text{F1-Score} = 2 \times \frac{Precision \times Recall}{Precision + Recall} = 2 \times \frac{0.889 \times 0.800}{0.889 + 0.800} \approx 0.842$
Interpretation:
Precision: Of predicted positives, how many are actually positive ("don't misclassify")
Recall: Of actual positives, how many are correctly predicted ("don't miss any")
F1-Score: Harmonic mean of Precision and Recall (balanced metric)
Use Cases:
问题: SVM 中 Linear、RBF 和 Polynomial 核函数有什么区别?
👉 点击查看答案
1. Linear Kernel(线性核):
2. RBF Kernel(径向基核,Gaussian 核):
3. Polynomial Kernel(多项式核):
选择建议:
1. Linear Kernel:
2. RBF Kernel (Radial Basis Function, Gaussian):
3. Polynomial Kernel:
Selection Guide:
问题: 解释 Random Forest 的关键超参数:n_estimators, max_depth, min_samples_split。
👉 点击查看答案
1. n_estimators(树的数量):
2. max_depth(树的最大深度):
3. min_samples_split(最小分裂样本数):
调参策略(GridSearchCV):
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [10, 20, 30, None],
'min_samples_split': [2, 5, 10]
}
rf = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(rf, param_grid, cv=5, scoring='accuracy')
# grid_search.fit(X_train, y_train)1. n_estimators (Number of Trees):
2. max_depth (Maximum Tree Depth):
3. min_samples_split (Minimum Samples to Split):
Tuning Strategy (GridSearchCV):
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [10, 20, 30, None],
'min_samples_split': [2, 5, 10]
}
rf = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(rf, param_grid, cv=5, scoring='accuracy')
# grid_search.fit(X_train, y_train)问题: 什么是 K 折交叉验证,为什么它很重要?
👉 点击查看答案与示例
K-Fold 交叉验证原理:
为什么重要:
常用 K 值:
K-Fold Cross-Validation Principle:
Why Important:
Common K Values:
Code Example:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
# Create sample dataset
X, y = make_classification(n_samples=100, n_features=10, random_state=42)
# Create model
rf = RandomForestClassifier(n_estimators=100, random_state=42)
# Perform 5-fold cross-validation
scores = cross_val_score(rf, X, y, cv=5, scoring='accuracy')
print(f"Cross-validation scores: {scores}")
print(f"Mean accuracy: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")Output:
Cross-validation scores: [0.85 0.9 0.8 0.95 0.85]
Mean accuracy: 0.870 (+/- 0.108)解释:
Interpretation:
问题: 如何识别和解决过拟合和欠拟合?
👉 点击查看答案
| 特征 | 过拟合 (Overfitting) | 欠拟合 (Underfitting) |
|---|---|---|
| 表现 | 训练集准确率高,测试集准确率低 | 训练集和测试集准确率都低 |
| 原因 | 模型过于复杂,记住了噪声 | 模型过于简单,未学到规律 |
| 训练误差 | 很低(接近 0) | 很高 |
| 测试误差 | 很高 | 很高 |
| 泛化能力 | 差(不能应用到新数据) | 差(连训练数据都拟合不好) |
如何识别:
from sklearn.model_selection import learning_curve
import matplotlib.pyplot as plt
train_sizes, train_scores, test_scores = learning_curve(
model, X, y, cv=5,
train_sizes=np.linspace(0.1, 1.0, 10)
)
# 画学习曲线
plt.plot(train_sizes, train_scores.mean(axis=1), label='Training score')
plt.plot(train_sizes, test_scores.mean(axis=1), label='Test score')
plt.legend()判断标准:
解决方法:
| 问题 | 解决方案 |
|---|---|
| 过拟合 | 1. 增加训练数据 |
| Feature | Overfitting | Underfitting |
|---|---|---|
| Performance | High training, low test accuracy | Low training and test accuracy |
| Cause | Model too complex, memorizes noise | Model too simple, misses patterns |
| Training Error | Very low (near 0) | Very high |
| Test Error | Very high | Very high |
| Generalization | Poor (can't apply to new data) | Poor (can't even fit training data) |
How to Identify:
from sklearn.model_selection import learning_curve
import matplotlib.pyplot as plt
train_sizes, train_scores, test_scores = learning_curve(
model, X, y, cv=5,
train_sizes=np.linspace(0.1, 1.0, 10)
)
# Plot learning curves
plt.plot(train_sizes, train_scores.mean(axis=1), label='Training score')
plt.plot(train_sizes, test_scores.mean(axis=1), label='Test score')
plt.legend()Diagnosis:
Solutions:
| Problem | Solutions |
|---|---|
| Overfitting | 1. Add more training data |
问题: 解释 SVM 和 Random Forest 分类的主要区别。
👉 点击查看答案
SVM (Support Vector Machine):
Random Forest:
SVM (Support Vector Machine):
Random Forest:
⚠️ 重要提示:这部分包含代码实现和手工计算。考试时请带上计算器!
场景:使用 sklearn KNeighborsClassifier 预测客户是否购买产品。
不完整代码:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
from sklearn.neighbors import KNeighborsClassifier
# [A] Formulate an instance of the class (Set K=7)
# [B] Fit the instance on the data
# [C] Predict the expected value
print(classes[y_pred[0]])任务:填写空白处 [A]、[B]、[C]。
👉 点击查看完整代码解答
完整代码:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
from sklearn.neighbors import KNeighborsClassifier
# [A] 实例化模型,设置 K=7
knn = KNeighborsClassifier(n_neighbors=7)
# [B] 在训练集上训练模型 (Fit)
knn.fit(X_train, y_train)
# [C] 在测试集上进行预测 (Predict)
y_pred = knn.predict(X_test)
# 输出第一个预测结果
print(classes[y_pred[0]])详细解析:
[A] 实例化模型:
knn = KNeighborsClassifier(n_neighbors=7)KNeighborsClassifier 是 sklearn 的 KNN 分类器类n_neighbors=7 设置 K 值为 7(即选取最近的 7 个邻居)metric='euclidean' (默认,欧氏距离)weights='uniform' (默认,所有邻居权重相等)weights='distance' (按距离加权,近的邻居权重更大)[B] 训练模型:
knn.fit(X_train, y_train)fit() 方法用于训练模型X_train 是训练特征矩阵(形状:[样本数, 特征数])y_train 是训练标签向量(形状:[样本数])[C] 预测结果:
y_pred = knn.predict(X_test)predict() 方法对测试集进行预测X_test 是测试特征矩阵y_pred 是预测结果向量(形状:[测试样本数])补充:完整工作流程示例
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report
# 假设数据
X = np.array([[1, 2], [2, 3], [3, 4], [6, 7], [7, 8], [8, 9]])
y = np.array([0, 0, 0, 1, 1, 1]) # 0 = Not Purchase, 1 = Purchase
classes = ['Not Purchase', 'Purchase']
# 拆分数据
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
# [A] 实例化
knn = KNeighborsClassifier(n_neighbors=3) # 或 K=7 视题目要求
# [B] 训练
knn.fit(X_train, y_train)
# [C] 预测
y_pred = knn.predict(X_test)
# 输出
print("预测结果:", y_pred)
print("第一个预测:", classes[y_pred[0]])
# 评估
accuracy = accuracy_score(y_test, y_pred)
print(f"准确率: {accuracy:.2f}")
print("\n分类报告:")
print(classification_report(y_test, y_pred, target_names=classes))考点总结:
import → instantiate → fit → predictn_neighbors (K值)、metric (距离度量)、weights (权重策略)fit(X_train, y_train) 用于训练,predict(X_test) 用于预测常见错误:
KNeighborsClassifiern_neighbors 拼写错误(不是 k=7)fit() 方法参数顺序错误(应该是 X_train, y_train,不能颠倒)predict() 方法忘记传入 X_testComplete Code:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
from sklearn.neighbors import KNeighborsClassifier
# [A] Instantiate model with K=7
knn = KNeighborsClassifier(n_neighbors=7)
# [B] Fit the model on training data
knn.fit(X_train, y_train)
# [C] Predict on test set
y_pred = knn.predict(X_test)
# Output first prediction
print(classes[y_pred[0]])Detailed Explanation:
[A] Instantiate Model:
knn = KNeighborsClassifier(n_neighbors=7)KNeighborsClassifier is sklearn's KNN classifier classn_neighbors=7 sets K value to 7 (select 7 nearest neighbors)metric='euclidean' (default, Euclidean distance)weights='uniform' (default, all neighbors equal weight)weights='distance' (distance-based weighting, closer neighbors have more weight)[B] Train Model:
knn.fit(X_train, y_train)fit() method trains the modelX_train is feature matrix (shape: [samples, features])y_train is label vector (shape: [samples])[C] Predict Results:
y_pred = knn.predict(X_test)predict() method predicts on test setX_test is test feature matrixy_pred is prediction vector (shape: [test samples])Complete Workflow Example:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report
# Sample data
X = np.array([[1, 2], [2, 3], [3, 4], [6, 7], [7, 8], [8, 9]])
y = np.array([0, 0, 0, 1, 1, 1]) # 0 = Not Purchase, 1 = Purchase
classes = ['Not Purchase', 'Purchase']
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
# [A] Instantiate
knn = KNeighborsClassifier(n_neighbors=3) # or K=7 per requirements
# [B] Train
knn.fit(X_train, y_train)
# [C] Predict
y_pred = knn.predict(X_test)
# Output
print("Predictions:", y_pred)
print("First prediction:", classes[y_pred[0]])
# Evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=classes))Key Concepts Summary:
import → instantiate → fit → predictn_neighbors (K value), metric (distance measure), weights (weighting strategy)fit(X_train, y_train) for training, predict(X_test) for predictionCommon Mistakes:
KNeighborsClassifiern_neighbors (not k=7)fit() (should be X_train, y_train, not reversed)X_test to predict()问题: 当 KNN 中的 K 值增大时,决策边界会发生什么变化?
👉 点击查看答案
K 值对决策边界的影响:
Small K (e.g., K=1):
Large K (e.g., K=100):
最佳 K 值选择:
结论:
Effect of K on Decision Boundary:
Small K (e.g., K=1):
Large K (e.g., K=100):
Optimal K Selection:
Conclusion:
场景:判断一封邮件是否是垃圾邮件。
训练数据集(共 10 封邮件):
| 类别 | 总数 | 包含"Free"的邮件数 |
|---|---|---|
| Spam | 4 | 3 |
| Not Spam | 6 | 1 |
任务:一封新邮件包含单词 "Free"。判断它是 Spam 还是 Not Spam?
计算 P(Spam | "Free") 和 P(Not Spam | "Free") 的分子部分,比较大小。
👉 点击查看完整计算过程
第一步:计算先验概率 (Prior Probability)
$P(\text{Spam}) = \frac{4}{10} = 0.4$
$P(\text{Not Spam}) = \frac{6}{10} = 0.6$
第二步:计算似然概率 (Likelihood)
$P(\text{"Free"} \mid \text{Spam}) = \frac{3}{4} = 0.75$
$P(\text{"Free"} \mid \text{Not Spam}) = \frac{1}{6} \approx 0.167$
第三步:计算后验概率的分子 (Posterior Numerator)
使用贝叶斯定理:$P(Spam | "Free") \propto P("Free" | Spam) \times P(Spam)$
$\text{Spam Score} = P(\text{"Free"} \mid \text{Spam}) \times P(\text{Spam}) = 0.75 \times 0.4 = 0.30$
$\text{Not Spam Score} = P(\text{"Free"} \mid \text{Not Spam}) \times P(\text{Not Spam}) = 0.167 \times 0.6 \approx 0.10$
第四步:做出决策
因为 $0.30 > 0.10$,模型预测该邮件是 Spam(垃圾邮件)。
结论:含有"Free"这个词的邮件更可能是垃圾邮件。
Step 1: Calculate Prior Probability
$P(\text{Spam}) = \frac{4}{10} = 0.4$
$P(\text{Not Spam}) = \frac{6}{10} = 0.6$
Step 2: Calculate Likelihood Probability
$P(\text{"Free"} \mid \text{Spam}) = \frac{3}{4} = 0.75$
$P(\text{"Free"} \mid \text{Not Spam}) = \frac{1}{6} \approx 0.167$
Step 3: Calculate Posterior Numerators
Using Bayes' theorem: $P(Spam | "Free") \propto P("Free" | Spam) \times P(Spam)$
$\text{Spam Score} = P(\text{"Free"} \mid \text{Spam}) \times P(\text{Spam}) = 0.75 \times 0.4 = 0.30$
$\text{Not Spam Score} = P(\text{"Free"} \mid \text{Not Spam}) \times P(\text{Not Spam}) = 0.167 \times 0.6 \approx 0.10$
Step 4: Make Decision
Since $0.30 > 0.10$, the model predicts this email is Spam.
Conclusion: Emails containing "Free" are more likely to be spam.
场景:一个决策树节点有 6 个正样本 (+) 和 2 个负样本 (-) 共 8 个样本。
任务:计算该节点的熵(Entropy)。
熵公式:$$\text{Entropy} = -\sum_{i=1}^{n} p_i \log_2(p_i)$$
其中 $p_i$ 是第 $i$ 类样本的比例。
👉 点击查看完整计算过程
第一步:计算各类样本比例
$p_{\text{Positive}} = \frac{6}{8} = 0.75$
$p_{\text{Negative}} = \frac{2}{8} = 0.25$
第二步:代入熵公式
$\text{Entropy} = -[p_+ \log_2(p_+) + p_- \log_2(p_-)]$
$= -[0.75 \times \log_2(0.75) + 0.25 \times \log_2(0.25)]$
第三步:使用计算器计算对数(保留 4 位小数)
$\log_2(0.75) = \frac{\log(0.75)}{\log(2)} \approx \frac{-0.1249}{0.3010} \approx -0.4150$
$\log_2(0.25) = \log_2(\frac{1}{4}) = -\log_2(4) = -2$
第四步:计算最终结果
项一:$0.75 \times (-0.4150) = -0.3112$
项二:$0.25 \times (-2) = -0.50$
熵值:
$\text{Entropy} = -(-0.3112 - 0.50) = -(-0.8112) = 0.8112 \approx \boxed{0.811}$
解读:
Step 1: Calculate Sample Proportions
$p_{\text{Positive}} = \frac{6}{8} = 0.75$
$p_{\text{Negative}} = \frac{2}{8} = 0.25$
Step 2: Substitute into Entropy Formula
$\text{Entropy} = -[p_+ \log_2(p_+) + p_- \log_2(p_-)]$
$= -[0.75 \times \log_2(0.75) + 0.25 \times \log_2(0.25)]$
Step 3: Use Calculator to Calculate Logarithms (Keep 4 decimals)
$\log_2(0.75) = \frac{\log(0.75)}{\log(2)} \approx \frac{-0.1249}{0.3010} \approx -0.4150$
$\log_2(0.25) = \log_2(\frac{1}{4}) = -\log_2(4) = -2$
Step 4: Calculate Final Result
Term 1: $0.75 \times (-0.4150) = -0.3112$
Term 2: $0.25 \times (-2) = -0.50$
Entropy Value:
$\text{Entropy} = -(-0.3112 - 0.50) = -(-0.8112) = 0.8112 \approx \boxed{0.811}$
Interpretation:
场景: 计算信息增益,决定用哪个特征进行分裂。
数据集(10个样本,预测是否打网球):
| Outlook | Temperature | Play Tennis |
|---|---|---|
| Sunny | Hot | No |
| Sunny | Hot | No |
| Overcast | Hot | Yes |
| Rain | Mild | Yes |
| Rain | Cool | Yes |
| Rain | Cool | No |
| Overcast | Cool | Yes |
| Sunny | Mild | No |
| Sunny | Cool | Yes |
| Rain | Mild | Yes |
任务: 计算 "Outlook" 特征的信息增益。
👉 点击查看完整计算过程
第一步:计算总体熵(Root Entropy)
总样本:10 个,其中 Yes = 6,No = 4
$p_{\text{Yes}} = \frac{6}{10} = 0.6, \quad p_{\text{No}} = \frac{4}{10} = 0.4$
$H_{\text{total}} = -[0.6 \times \log_2(0.6) + 0.4 \times \log_2(0.4)]$
使用计算器:
$H_{\text{total}} = -[0.6 \times (-0.737) + 0.4 \times (-1.322)]$ $= -[-0.442 - 0.529] = -(-0.971) = \boxed{0.971}$
第二步:按 Outlook 分组计算加权熵
| Outlook | Total | Yes | No |
|---|---|---|---|
| Sunny | 4 | 1 | 3 |
| Overcast | 2 | 2 | 0 |
| Rain | 4 | 3 | 1 |
Sunny 组的熵: $H_{\text{Sunny}} = -\left[\frac{1}{4} \log_2\left(\frac{1}{4}\right) + \frac{3}{4} \log_2\left(\frac{3}{4}\right)\right]$
$H_{\text{Sunny}} = -[0.25 \times (-2) + 0.75 \times (-0.415)]$ $= -[-0.5 - 0.311] = 0.811$
Overcast 组的熵: 全是 Yes(纯净节点) $H_{\text{Overcast}} = 0$
Rain 组的熵: $H_{\text{Rain}} = -\left[\frac{3}{4} \log_2\left(\frac{3}{4}\right) + \frac{1}{4} \log_2\left(\frac{1}{4}\right)\right]$ $= -[0.75 \times (-0.415) + 0.25 \times (-2)]$ $= -[-0.311 - 0.5] = 0.811$
加权平均熵: $H_{\text{weighted}} = \frac{4}{10} \times 0.811 + \frac{2}{10} \times 0 + \frac{4}{10} \times 0.811$ $= 0.324 + 0 + 0.324 = 0.648$
第三步:计算信息增益(Information Gain)
$\text{IG(Outlook)} = H_{\text{total}} - H_{\text{weighted}}$ $= 0.971 - 0.648 = \boxed{0.323}$
解释:
Step 1: Calculate Root Entropy
Total samples: 10, where Yes = 6, No = 4
$p_{\text{Yes}} = \frac{6}{10} = 0.6, \quad p_{\text{No}} = \frac{4}{10} = 0.4$
$H_{\text{total}} = -[0.6 \times \log_2(0.6) + 0.4 \times \log_2(0.4)]$
Using calculator:
$H_{\text{total}} = -[0.6 \times (-0.737) + 0.4 \times (-1.322)]$ $= -[-0.442 - 0.529] = -(-0.971) = \boxed{0.971}$
Step 2: Calculate Weighted Entropy by Outlook Groups
| Outlook | Total | Yes | No |
|---|---|---|---|
| Sunny | 4 | 1 | 3 |
| Overcast | 2 | 2 | 0 |
| Rain | 4 | 3 | 1 |
Entropy of Sunny: $H_{\text{Sunny}} = -\left[\frac{1}{4} \log_2\left(\frac{1}{4}\right) + \frac{3}{4} \log_2\left(\frac{3}{4}\right)\right]$
$H_{\text{Sunny}} = -[0.25 \times (-2) + 0.75 \times (-0.415)]$ $= -[-0.5 - 0.311] = 0.811$
Entropy of Overcast: All Yes (pure node) $H_{\text{Overcast}} = 0$
Entropy of Rain: $H_{\text{Rain}} = -\left[\frac{3}{4} \log_2\left(\frac{3}{4}\right) + \frac{1}{4} \log_2\left(\frac{1}{4}\right)\right]$ $= -[0.75 \times (-0.415) + 0.25 \times (-2)]$ $= -[-0.311 - 0.5] = 0.811$
Weighted Average Entropy: $H_{\text{weighted}} = \frac{4}{10} \times 0.811 + \frac{2}{10} \times 0 + \frac{4}{10} \times 0.811$ $= 0.324 + 0 + 0.324 = 0.648$
Step 3: Calculate Information Gain
$\text{IG(Outlook)} = H_{\text{total}} - H_{\text{weighted}}$ $= 0.971 - 0.648 = \boxed{0.323}$
Interpretation:
场景: 计算决策树节点的 Gini 指数。
问题: 一个节点有 40 个样本:25 个 A 类,15 个 B 类。计算 Gini 指数。
Gini Formula: $$\text{Gini} = 1 - \sum_{i=1}^{n} p_i^2$$
👉 点击查看计算过程
第一步:计算各类别概率
$p_A = \frac{25}{40} = 0.625$ $p_B = \frac{15}{40} = 0.375$
第二步:代入 Gini 公式
$\text{Gini} = 1 - (p_A^2 + p_B^2)$ $= 1 - (0.625^2 + 0.375^2)$ $= 1 - (0.3906 + 0.1406)$ $= 1 - 0.5312$ $= \boxed{0.4688} \approx 0.469$
解释:
Gini vs Entropy:
Step 1: Calculate Class Probabilities
$p_A = \frac{25}{40} = 0.625$ $p_B = \frac{15}{40} = 0.375$
Step 2: Substitute into Gini Formula
$\text{Gini} = 1 - (p_A^2 + p_B^2)$ $= 1 - (0.625^2 + 0.375^2)$ $= 1 - (0.3906 + 0.1406)$ $= 1 - 0.5312$ $= \boxed{0.4688} \approx 0.469$
Interpretation:
Gini vs Entropy:
场景: 根据两个特征(天气和温度)判断是否打网球。
训练数据(8个样本):
| Outlook | Temperature | Play |
|---|---|---|
| Sunny | Hot | No |
| Sunny | Hot | No |
| Overcast | Hot | Yes |
| Rain | Mild | Yes |
| Rain | Cool | Yes |
| Overcast | Cool | Yes |
| Sunny | Mild | No |
| Rain | Hot | Yes |
测试样本: Outlook = Sunny, Temperature = Cool。会打网球吗?
👉 点击查看完整计算过程
第一步:统计训练数据
先验概率: $P(\text{Yes}) = \frac{5}{8} = 0.625$ $P(\text{No}) = \frac{3}{8} = 0.375$
第二步:计算条件概率
对于 Yes 类:
$P(\text{Sunny} | \text{Yes}) = \frac{0}{5} = 0$ $P(\text{Cool} | \text{Yes}) = \frac{2}{5} = 0.4$
对于 No 类:
$P(\text{Sunny} | \text{No}) = \frac{3}{3} = 1.0$ $P(\text{Cool} | \text{No}) = \frac{0}{3} = 0$
第三步:应用 Naive Bayes
$P(\text{Yes} | \text{Sunny, Cool}) \propto P(\text{Sunny} | \text{Yes}) \times P(\text{Cool} | \text{Yes}) \times P(\text{Yes})$ $= 0 \times 0.4 \times 0.625 = 0$
$P(\text{No} | \text{Sunny, Cool}) \propto P(\text{Sunny} | \text{No}) \times P(\text{Cool} | \text{No}) \times P(\text{No})$ $= 1.0 \times 0 \times 0.375 = 0$
问题:零概率问题!
两个类别的概率都是 0,无法做出判断。这是因为训练数据中没有出现 "Sunny + Cool" 的组合。
解决方案:Laplace 平滑(拉普拉斯平滑)
修正公式: $P(\text{Feature} | \text{Class}) = \frac{\text{count} + 1}{\text{total} + \text{num_features}}$
应用平滑后:
对于 Yes 类(特征总数 = 3: Sunny, Overcast, Rain): $P(\text{Sunny} | \text{Yes}) = \frac{0 + 1}{5 + 3} = \frac{1}{8} = 0.125$ $P(\text{Cool} | \text{Yes}) = \frac{2 + 1}{5 + 3} = \frac{3}{8} = 0.375$
对于 No 类: $P(\text{Sunny} | \text{No}) = \frac{3 + 1}{3 + 3} = \frac{4}{6} = 0.667$ $P(\text{Cool} | \text{No}) = \frac{0 + 1}{3 + 3} = \frac{1}{6} = 0.167$
重新计算:
$\text{Yes Score} = 0.125 \times 0.375 \times 0.625 = 0.0293$ $\text{No Score} = 0.667 \times 0.167 \times 0.375 = 0.0418$
结论:因为 $0.0418 > 0.0293$,预测为 No(不打网球)。
关键要点:
Step 1: Statistics from Training Data
Prior Probability: $P(\text{Yes}) = \frac{5}{8} = 0.625$ $P(\text{No}) = \frac{3}{8} = 0.375$
Step 2: Calculate Conditional Probabilities
For Yes class:
$P(\text{Sunny} | \text{Yes}) = \frac{0}{5} = 0$ $P(\text{Cool} | \text{Yes}) = \frac{2}{5} = 0.4$
For No class:
$P(\text{Sunny} | \text{No}) = \frac{3}{3} = 1.0$ $P(\text{Cool} | \text{No}) = \frac{0}{3} = 0$
Step 3: Apply Naive Bayes
$P(\text{Yes} | \text{Sunny, Cool}) \propto P(\text{Sunny} | \text{Yes}) \times P(\text{Cool} | \text{Yes}) \times P(\text{Yes})$ $= 0 \times 0.4 \times 0.625 = 0$
$P(\text{No} | \text{Sunny, Cool}) \propto P(\text{Sunny} | \text{No}) \times P(\text{Cool} | \text{No}) \times P(\text{No})$ $= 1.0 \times 0 \times 0.375 = 0$
Problem: Zero Probability Issue!
Both classes have probability 0, making classification impossible. This is because "Sunny + Cool" combination never appeared in training data.
Solution: Laplace Smoothing
Corrected formula: $P(\text{Feature} | \text{Class}) = \frac{\text{count} + 1}{\text{total} + \text{num_features}}$
After applying smoothing:
For Yes class (total features = 3: Sunny, Overcast, Rain): $P(\text{Sunny} | \text{Yes}) = \frac{0 + 1}{5 + 3} = \frac{1}{8} = 0.125$ $P(\text{Cool} | \text{Yes}) = \frac{2 + 1}{5 + 3} = \frac{3}{8} = 0.375$
For No class: $P(\text{Sunny} | \text{No}) = \frac{3 + 1}{3 + 3} = \frac{4}{6} = 0.667$ $P(\text{Cool} | \text{No}) = \frac{0 + 1}{3 + 3} = \frac{1}{6} = 0.167$
Recalculate:
$\text{Yes Score} = 0.125 \times 0.375 \times 0.625 = 0.0293$ $\text{No Score} = 0.667 \times 0.167 \times 0.375 = 0.0418$
Conclusion: Since $0.0418 > 0.0293$, prediction is No (Don't play).
Key Takeaways:
| 陷阱 | 易错点 | 正确做法 |
|---|---|---|
| 可变默认参数 | def f(x, lst=[]) | 用 None 替代,内部初始化 |
| 列表 vs 元组 | 认为元组也可变 | 记住:列表可变,元组不可变 |
| 循环变量作用域 | i 在循环外消失 | Python 的 i 在循环外仍存在 |
| 列表切片 | 认为 lst[:] 是引用 | lst[:] 创建浅拷贝 |
isnull() 确认位置Naive Bayes:
Decision Tree:
来源:期末真题 Q3(b) - 写出两种数据归一化公式。
目的:将数据缩放到 [0, 1] 范围
$$X_{new} = \frac{X - X_{min}}{X_{max} - X_{min}}$$
解释:
使用场景:
缺点:
Explanation:
Use Cases:
Drawbacks:
目的:将数据转换为均值=0,标准差=1
$$X_{new} = \frac{X - \mu}{\sigma}$$
其中:
解释:
使用场景:
优点:
Where:
Explanation:
Use Cases:
Advantages:
| 方法 | 公式 | 输出范围 | 异常值敏感度 | 使用场景 |
|---|---|---|---|---|
| Min-Max | $\frac{X - X_{min}}{X_{max} - X_{min}}$ | [0, 1] | 高 | 神经网络、有界数据 |
| Z-Score | $\frac{X - \mu}{\sigma}$ | 无界 | 低 | SVM、KNN、有异常值数据 |
祝考试顺利! 🎓
最后更新:2026年1月24日 | 全面升级版,包含50+实战题目(含期末真题 Q1-Q4)