ABW505 Python与预测分析期末终极题库:代码填空 + 算法手算 (含真题解析)
50 分钟阅读
🎓 ABW505 考试结构(官方)
题号 分值 题型 考试范围 Q1 20 Type 2 Python入门:核心对象、变量、输入输出、列表、元组、函数、循环、决策结构 Q2 30 Type 3 5选3:决策结构、重复结构、布尔逻辑、列表/元组、函数 Q3 25 Type 1,3 Pandas、数据预处理、编码器、SVM、随机森林 Q4 25 Type 1,3,4 朴素贝叶斯、决策树、数据预处理(带计算器!) 题型说明:
- Type 1: 理论题
- Type 2: 解读代码,计算结果
- Type 3: 解读结果,编写代码
- Type 4: 计算题(带计算器)
🟢 第一部分:Python 核心与逻辑 (Q1 & Q2)
🧐 题型一:代码追踪
题型说明: 阅读代码,预测输出结果。考察变量作用域、可变性及逻辑流。
Q1.1: 列表的陷阱 (List Mutability)
def extend_list(val, list=[]):
list.append(val)
return list
list1 = extend_list(10)
list2 = extend_list(123, [])
list3 = extend_list('a')
print(f"list1 = {list1}")
print(f"list2 = {list2}")
print(f"list3 = {list3}")👉 点击查看运行结果与解析
Output:
list1 = [10, 'a']
list2 = [123]
list3 = [10, 'a']解析:
这是 Python 面试经典题。函数的默认参数 list=[] 是在函数定义时创建的,且只创建一次。
- 第一次调用
extend_list(10)→ 默认参数初始化为[],然后 append 10 →[10] - 第二次调用
extend_list(123, [])→ 传入新列表[],与默认参数无关 →[123] - 第三次调用
extend_list('a')→ 再次使用已被修改过的默认参数[10]→[10, 'a']
关键概念:Python 默认参数在函数定义时评估,可变对象会跨调用保留状态。
Analysis:
This is a classic Python interview question. The default parameter list=[] is created once at function definition time, not every time the function is called.
- First call
extend_list(10)→ Default parameter initialized as[], then append 10 →[10] - Second call
extend_list(123, [])→ Passes a new list[], independent of default →[123] - Third call
extend_list('a')→ Reuses the already-modified default parameter[10]→[10, 'a']
Key Concept: Python default parameters are evaluated at function definition time. Mutable objects retain their state across function calls.
Q1.2: 循环与逻辑控制
x = 0
for i in range(5):
if i == 2:
continue
if i == 4:
break
x += i
print(x)👉 点击查看运行结果与解析
Output:
4执行追踪:
| 轮次 | i | 条件 | 操作 | x 值 |
|---|---|---|---|---|
| 1 | 0 | - | x += 0 | 0 |
| 2 | 1 | - | x += 1 | 1 |
| 3 | 2 | i==2 | continue(跳过) | 1 |
| 4 | 3 | - | x += 3 | 4 |
| 5 | 4 | i==4 | break(终止) | 4 |
结果: x = 4
Execution Trace:
| Iteration | i | Condition | Operation | x Value |
|---|---|---|---|---|
| 1 | 0 | - | x += 0 | 0 |
| 2 | 1 | - | x += 1 | 1 |
| 3 | 2 | i==2 | continue (skip) | 1 |
| 4 | 3 | - | x += 3 | 4 |
| 5 | 4 | i==4 | break (exit) | 4 |
Result: x = 4
Q1.3: 元组拆包与切片
data = (10, 20, 30, 40, 50)
a, *b, c = data
print(a)
print(b)
print(c)👉 点击查看运行结果与解析
Output:
10
[20, 30, 40]
50解析:
这是 Python 3 的拓展解包(Extended Unpacking)语法:
a拿走第一个元素 →10c拿走最后一个元素 →50*b拿走中间剩余的所有元素,并打包成一个列表 →[20, 30, 40]
注意:*b 收集的是列表,不是元组。
Analysis:
This uses Python 3's Extended Unpacking syntax:
atakes the first element →10ctakes the last element →50*bcollects all remaining middle elements and packs them as a list →[20, 30, 40]
Note: *b collects into a list, not a tuple.
Q1.5: Numpy 矩阵操作(期末真题 Q1a)
要求: 编写 Numpy 程序完成以下操作:
- 创建一个 4x4 的矩阵,数值范围从 1 到 16。
- 创建一个长度为 10 的空向量(全0),并将第 7 个值更新为 10。
- 创建一个 8x8 的矩阵,并用 0 和 1 填充成"棋盘模式"。
👉 点击查看参考答案
import numpy as np
# (i) 4x4 matrix ranging from 1 to 16
matrix_4x4 = np.arange(1, 17).reshape(4, 4)
print("4x4 Matrix:\n", matrix_4x4)
# (ii) Null vector of size 10, update 7th value to 10
# Note: Python indexing starts from 0, so 7th value is index 6
null_vector = np.zeros(10)
null_vector[6] = 10
print("\nNull Vector:\n", null_vector)
# (iii) 8x8 Checkerboard pattern
checkerboard = np.zeros((8, 8), dtype=int)
# Use slicing: odd rows even columns, even rows odd columns set to 1
checkerboard[1::2, ::2] = 1
checkerboard[::2, 1::2] = 1
print("\nCheckerboard:\n", checkerboard)Run Output:
4x4 Matrix:
[[ 1 2 3 4]
[ 5 6 7 8]
[ 9 10 11 12]
[13 14 15 16]]
Null Vector:
[ 0. 0. 0. 0. 0. 0. 10. 0. 0. 0.]
Checkerboard:
[[0 1 0 1 0 1 0 1]
[1 0 1 0 1 0 1 0]
[0 1 0 1 0 1 0 1]
[1 0 1 0 1 0 1 0]
[0 1 0 1 0 1 0 1]
[1 0 1 0 1 0 1 0]
[0 1 0 1 0 1 0 1]
[1 0 1 0 1 0 1 0]]解析:
(i) 创建 4x4 矩阵:
np.arange(1, 17)创建从 1 到 16 的数组(不包括17).reshape(4, 4)将其重塑为 4x4 矩阵
(ii) 空向量与索引:
np.zeros(10)创建全0向量- 关键: Python 索引从 0 开始,第 7 个值的索引是 6
null_vector[6] = 10更新第 7 个位置
(iii) 棋盘模式:
- 初始化全0矩阵
checkerboard[1::2, ::2] = 1→ 奇数行的偶数列置为1checkerboard[::2, 1::2] = 1→ 偶数行的奇数列置为1- 切片语法:
[start:stop:step]
考点:
- Numpy 数组创建与重塑
- 索引规则(0-based vs 1-based)
- 切片操作的步长参数
Analysis:
(i) Create 4x4 Matrix:
np.arange(1, 17)creates array from 1 to 16 (excluding 17).reshape(4, 4)reshapes it into 4x4 matrix
(ii) Null Vector & Indexing:
np.zeros(10)creates zero vector- Key Point: Python uses 0-based indexing, so 7th value is at index 6
null_vector[6] = 10updates the 7th position
(iii) Checkerboard Pattern:
- Initialize all-zero matrix
checkerboard[1::2, ::2] = 1→ Odd rows, even columns set to 1checkerboard[::2, 1::2] = 1→ Even rows, odd columns set to 1- Slicing syntax:
[start:stop:step]
Key Concepts:
- Numpy array creation and reshaping
- Indexing rules (0-based vs 1-based)
- Slicing with step parameter
Q1.6: Numpy 向量反转(期末真题 Q1b)
要求: 创建一个向量,包含从 10 到 49 的数值,并将其反转。
👉 点击查看参考答案
import numpy as np
# Create vector from 10 to 49
vector = np.arange(10, 50)
print("Original Vector:\n", vector)
# Reverse it using slicing
vector_reversed = vector[::-1]
print("\nReversed Vector:\n", vector_reversed)Run Output:
Original Vector:
[10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49]
Reversed Vector:
[49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26
25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10]解析:
方法1: 切片反转(推荐):
[::-1]是Python切片的反转语法start:stop:step,步长为-1表示从后往前
方法2: np.flip()函数:
vector_reversed = np.flip(vector)方法3: np.flipud()函数:
vector_reversed = np.flipud(vector)考点:
- Python切片反转语法
[::-1] - Numpy数组操作函数
Analysis:
Method 1: Slicing Reversal (Recommended):
[::-1]is Python's slice reversal syntaxstart:stop:step, step of -1 means traverse backwards
Method 2: np.flip() function:
vector_reversed = np.flip(vector)Method 3: np.flipud() function:
vector_reversed = np.flipud(vector)Key Concepts:
- Python slice reversal syntax
[::-1] - Numpy array manipulation functions
Q1.4: 函数参数 - *args 和 **kwargs
def process_data(*args, **kwargs):
print(f"Positional args: {args}")
print(f"Keyword args: {kwargs}")
total = sum(args)
multiplier = kwargs.get('multiplier', 1)
return total * multiplier
result = process_data(1, 2, 3, multiplier=10, label='test')
print(f"Result: {result}")👉 点击查看运行结果与解析
Output:
Positional args: (1, 2, 3)
Keyword args: {'multiplier': 10, 'label': 'test'}
Result: 60解析:
*args(任意数量的位置参数):
- 收集所有位置参数到一个元组中
- 在这里:
args = (1, 2, 3) - 可以像普通元组一样使用:迭代、索引、sum() 等
**kwargs(任意数量的关键字参数):
- 收集所有关键字参数到一个字典中
- 在这里:
kwargs = {'multiplier': 10, 'label': 'test'} - 使用
.get()方法安全访问,避免 KeyError
执行流程:
sum(args)→sum((1, 2, 3))→6kwargs.get('multiplier', 1)→ 返回10(如果没有则返回默认值 1)6 * 10 = 60
Analysis:
*args (Variable Positional Arguments):
- Collects all positional arguments into a tuple
- Here:
args = (1, 2, 3) - Can be used like a normal tuple: iterate, index, sum(), etc.
**kwargs (Variable Keyword Arguments):
- Collects all keyword arguments into a dictionary
- Here:
kwargs = {'multiplier': 10, 'label': 'test'} - Use
.get()method for safe access, avoiding KeyError
Execution Flow:
sum(args)→sum((1, 2, 3))→6kwargs.get('multiplier', 1)→ Returns10(default 1 if not present)6 * 10 = 60
💻 题型二:编程实战
题型说明: 根据要求写出代码。
Q2.1: 列表推导式
需求: 给定一个数字列表,创建一个新列表,只包含偶数乘以 2 的结果。
Input: nums = [1, 2, 3, 4, 5]
Solution Code:
nums = [1, 2, 3, 4, 5]
result = [x * 2 for x in nums if x % 2 == 0]
print(result)Run Output:
[4, 8]解析:
- 遍历
nums中的每个元素x - 筛选条件:
x % 2 == 0(偶数) - 变换:
x * 2(乘以 2) - 偶数有 2 和 4,所以结果是
[4, 8]
Analysis:
- Iterate through each element
xinnums - Filter condition:
x % 2 == 0(even numbers) - Transform:
x * 2(multiply by 2) - Even numbers are 2 and 4, so result is
[4, 8]
Q2.2: 嵌套列表操作
需求: 给定一个嵌套列表,将其展平为单层列表。
Input: nested = [[1, 2], [3, 4, 5], [6]]
Solution Code:
nested = [[1, 2], [3, 4, 5], [6]]
# Method 1: List comprehension
flat = [item for sublist in nested for item in sublist]
print(f"Method 1: {flat}")
# Method 2: Using sum() with empty list
flat2 = sum(nested, [])
print(f"Method 2: {flat2}")
# Method 3: Traditional loop
flat3 = []
for sublist in nested:
for item in sublist:
flat3.append(item)
print(f"Method 3: {flat3}")Run Output:
Method 1: [1, 2, 3, 4, 5, 6]
Method 2: [1, 2, 3, 4, 5, 6]
Method 3: [1, 2, 3, 4, 5, 6]解析:
方法 1(列表推导式):
- 外层循环:
for sublist in nested→ 遍历每个子列表 - 内层循环:
for item in sublist→ 遍历子列表中的每个元素 - 这是最 Pythonic 的写法,推荐使用
方法 2(sum函数):
sum(nested, [])→ 从空列表开始,依次将每个子列表相加[] + [1,2] + [3,4,5] + [6]→ 最终得到[1,2,3,4,5,6]- 简洁但可读性略差
方法 3(传统循环):
- 双重 for 循环逐个添加元素
- 代码最长但最容易理解
Analysis:
Method 1 (List Comprehension):
- Outer loop:
for sublist in nested→ Iterate through each sublist - Inner loop:
for item in sublist→ Iterate through each element in sublist - This is the most Pythonic way, recommended
Method 2 (sum function):
sum(nested, [])→ Starts with empty list, concatenates each sublist[] + [1,2] + [3,4,5] + [6]→ Results in[1,2,3,4,5,6]- Concise but less readable
Method 3 (Traditional Loop):
- Nested for loops add elements one by one
- Longest code but easiest to understand
Q2.3: 字典统计
需求: 统计字符串中每个字符出现的频率。
Input: text = "hello"
Solution Code:
text = "hello"
freq = {}
for char in text:
if char in freq:
freq[char] += 1
else:
freq[char] = 1
print(freq)Run Output:
{'h': 1, 'e': 1, 'l': 2, 'o': 1}解析:
- 遍历字符串中的每个字符
- 如果字符已在字典中,计数加 1
- 如果字符不在字典中,初始化为 1
- 最终
'l'出现 2 次(因为有两个 l),其他字符各出现 1 次
进阶写法(使用 defaultdict):
Analysis:
- Iterate through each character in the string
- If character exists in dictionary, increment count by 1
- If character doesn't exist, initialize it as 1
- Finally, 'l' appears 2 times, other characters appear 1 time each
Advanced Implementation (using defaultdict):
from collections import defaultdict
text = "hello"
freq = defaultdict(int)
for char in text:
freq[char] += 1
print(dict(freq))Q2.4: Lambda 与 Filter 操作
需求: 从数字列表中,使用 lambda 和 filter 筛选出大于 5 的数字。
Input: numbers = [2, 8, 3, 12, 5, 7]
Solution Code:
numbers = [2, 8, 3, 12, 5, 7]
# Using filter() with lambda
result = list(filter(lambda x: x > 5, numbers))
print(f"Numbers > 5: {result}")
# Equivalent list comprehension
result2 = [x for x in numbers if x > 5]
print(f"List comprehension: {result2}")Run Output:
Numbers > 5: [8, 12, 7]
List comprehension: [8, 12, 7]解析:
Lambda 函数:
lambda x: x > 5→ 创建一个匿名函数,检查 x 是否大于 5- 返回布尔值
True或False
Filter 函数:
filter(function, iterable)→ 对可迭代对象应用函数,保留返回 True 的元素- 返回一个 filter 对象(需要用
list()转换)
两种方法比较:
filter()+lambda:函数式编程风格- 列表推导式:更 Pythonic,可读性更好,推荐使用
Analysis:
Lambda Function:
lambda x: x > 5→ Creates an anonymous function that checks if x > 5- Returns boolean
TrueorFalse
Filter Function:
filter(function, iterable)→ Applies function to iterable, keeps elements that return True- Returns a filter object (needs
list()conversion)
Comparison:
filter()+lambda: Functional programming style- List comprehension: More Pythonic, better readability, recommended
Q2.6: While 循环 - 猜数字游戏逻辑(复习题)
要求: 修复并解释以下"猜数字"代码逻辑。目标:生成 1-9 的随机数,允许猜 4 次。
👉 点击查看参考代码
import random
target_num = random.randint(1, 9)
guess_num = 0
guess_counter = 0
max_attempts = 4
# Loop condition: haven't guessed correctly AND within attempt limit
while target_num != guess_num and guess_counter < max_attempts:
guess_counter += 1
# In real exam, you'd use input()
guess_num = int(input(f"Attempt {guess_counter}/{max_attempts}, Enter a number (1-9): "))
if target_num == guess_num:
print(f"🎉 Well guessed! The number was {target_num}")
break
elif guess_counter < max_attempts:
if guess_num < target_num:
print("Too low! Try again.")
else:
print("Too high! Try again.")
else:
print(f"❌ Out of chances! The number was {target_num}")考点解析:
1. While 循环终止条件:
while target_num != guess_num and guess_counter < max_attempts:- 两个条件必须同时满足才继续循环
- 猜对了或次数用完都会退出
2. break 语句:
- 提前终止循环,不等待条件检查
- 适用于猜对的情况
3. 变量自增:
guess_counter += 1- 等价于
guess_counter = guess_counter + 1
4. 边界条件检查:
guess_counter < max_attempts确保不会超过限制- 最后一次猜测时不再显示"Try again"提示
常见错误:
- 忘记递增
guess_counter,导致死循环 - 终止条件写错(如用
or代替and) - 在循环外忘记检查是否猜对
Key Concepts:
1. While Loop Termination Condition:
while target_num != guess_num and guess_counter < max_attempts:- Both conditions must be true to continue looping
- Exits when guessed correctly OR out of attempts
2. break Statement:
- Terminates loop early without waiting for condition check
- Used when guess is correct
3. Variable Increment:
guess_counter += 1- Equivalent to
guess_counter = guess_counter + 1
4. Boundary Condition Check:
guess_counter < max_attemptsensures no overflow- Don't show "Try again" on last attempt
Common Mistakes:
- Forgetting to increment
guess_counter, causing infinite loop - Wrong termination condition (using
orinstead ofand) - Not checking if guessed correctly outside loop
Q2.7: Matplotlib 可视化 - 水平条形图(复习题)
要求: 根据以下数据绘制水平条形图:
Data: Moscow (70), Tokyo (60), Washington (75), Beijing (50), Delhi (40)
👉 点击查看参考代码
import matplotlib.pyplot as plt
# 1. Prepare data
cities = ['Moscow', 'Tokyo', 'Washington', 'Beijing', 'Delhi']
happiness_index = [70, 60, 75, 50, 40]
# 2. Create figure
plt.figure(figsize=(10, 5))
# 3. Draw horizontal bar chart (barh)
plt.barh(cities, happiness_index, color='skyblue', edgecolor='navy')
# 4. Add labels and title
plt.xlabel('Happiness Index', fontsize=12)
plt.ylabel('City', fontsize=12)
plt.title('Happiness Index by City', fontsize=14, fontweight='bold')
# 5. Add grid for better readability
plt.grid(axis='x', linestyle='--', alpha=0.7)
# 6. Display
plt.tight_layout()
plt.show()代码解析:
1. 数据准备:
- 使用列表存储城市名称和幸福指数
- 顺序要对应(城市和数值按位置匹配)
2. 创建图形:
plt.figure(figsize=(10, 5))设置画布大小(宽10英寸,高5英寸)
3. 绘制水平条形图:
plt.barh()是横向条形图(h = horizontal)plt.bar()是纵向条形图color='skyblue'设置填充色edgecolor='navy'设置边框色
4. 添加标签:
xlabel()- X轴标签ylabel()- Y轴标签title()- 图表标题fontsize- 字体大小fontweight='bold'- 加粗
5. 网格线:
plt.grid(axis='x')只显示X轴网格linestyle='--'虚线alpha=0.7透明度70%
6. 显示图形:
plt.tight_layout()自动调整布局,避免标签重叠plt.show()显示图形
常见考点:
- 区分
bar()和barh() - 标签参数的拼写(xlabel, ylabel, title)
- 颜色参数名称(color, edgecolor)
Code Analysis:
1. Data Preparation:
- Use lists to store city names and happiness indices
- Order must match (cities and values aligned by position)
2. Create Figure:
plt.figure(figsize=(10, 5))sets canvas size (10 inches wide, 5 inches tall)
3. Draw Horizontal Bar Chart:
plt.barh()is for horizontal bars (h = horizontal)plt.bar()is for vertical barscolor='skyblue'sets fill coloredgecolor='navy'sets border color
4. Add Labels:
xlabel()- X-axis labelylabel()- Y-axis labeltitle()- Chart titlefontsize- Font sizefontweight='bold'- Bold text
5. Grid Lines:
plt.grid(axis='x')shows only X-axis gridlinestyle='--'dashed linealpha=0.770% transparency
6. Display Figure:
plt.tight_layout()auto-adjusts layout to prevent label overlapplt.show()displays the figure
Common Exam Points:
- Distinguish between
bar()andbarh() - Spelling of label parameters (xlabel, ylabel, title)
- Color parameter names (color, edgecolor)
Q2.5: 集合操作
需求: 找出两个集合的交集和差集。
Input:
set_a = {1, 2, 3, 4, 5}
set_b = {4, 5, 6, 7, 8}Solution Code:
set_a = {1, 2, 3, 4, 5}
set_b = {4, 5, 6, 7, 8}
# Intersection - common elements
intersection = set_a & set_b # or set_a.intersection(set_b)
print(f"Intersection: {intersection}")
# Difference - elements in A but not in B
difference = set_a - set_b # or set_a.difference(set_b)
print(f"A - B: {difference}")
# Symmetric difference - elements in either but not both
sym_diff = set_a ^ set_b # or set_a.symmetric_difference(set_b)
print(f"Symmetric difference: {sym_diff}")
# Union - all unique elements
union = set_a | set_b # or set_a.union(set_b)
print(f"Union: {union}")Run Output:
Intersection: {4, 5}
A - B: {1, 2, 3}
Symmetric difference: {1, 2, 3, 6, 7, 8}
Union: {1, 2, 3, 4, 5, 6, 7, 8}解析:
集合运算符:
&(交集):两个集合共有的元素 →{4, 5}-(差集):在 A 中但不在 B 中 →{1, 2, 3}^(对称差集):在 A 或 B 中,但不同时在两者中 →{1, 2, 3, 6, 7, 8}|(并集):A 和 B 的所有唯一元素 →{1, 2, 3, 4, 5, 6, 7, 8}
应用场景:
- 数据去重、查找共同项、排除重复等
Analysis:
Set Operators:
&(intersection): Elements common to both sets →{4, 5}-(difference): Elements in A but not in B →{1, 2, 3}^(symmetric difference): Elements in either but not both →{1, 2, 3, 6, 7, 8}|(union): All unique elements from both →{1, 2, 3, 4, 5, 6, 7, 8}
Use Cases:
- Data deduplication, finding common items, excluding duplicates
🔵 第二部分:数据科学与算法 (Q3 & Q4)
🐼 Q3: Pandas 与机器学习理论
Q3.1: Pandas 数据预处理
场景: 你有一个 DataFrame df,包含列 ['Salary', 'Department']。
要求:
- 用中位数填充缺失的
Salary - 筛选
Salary > 5000的行
Solution Code:
import pandas as pd
import numpy as np
# Sample data
data = {'Salary': [3000, 6000, np.nan, 8000],
'Department': ['HR', 'IT', 'IT', 'HR']}
df = pd.DataFrame(data)
print("Original data:")
print(df)
print("\n" + "="*50 + "\n")
# 1. Fill missing values with median
df['Salary'].fillna(df['Salary'].median(), inplace=True)
# 2. Filter high salary
high_salary = df[df['Salary'] > 5000]
print("After filling and filtering:")
print(high_salary)Run Output:
Original data:
Salary Department
0 3000.0 HR
1 6000.0 IT
2 NaN IT
3 8000.0 HR
==================================================
After filling and filtering:
Salary Department
1 6000.0 IT
3 8000.0 HR关键步骤:
df['Salary'].median()→ 计算中位数(3000, 6000, 8000 的中位数是 6000)fillna()→ 使用中位数替换 NaN- 布尔索引
df[df['Salary'] > 5000]→ 筛选高薪员工
Key Steps:
df['Salary'].median()→ Calculates the median (median of 3000, 6000, 8000 is 6000)fillna()→ Replaces NaN with the median- Boolean indexing
df[df['Salary'] > 5000]→ Filters high-salary employees
Q3.2: Pandas 数据分组与聚合
场景: 计算每个部门的平均薪资,并找出平均薪资最高的部门。
Solution Code:
import pandas as pd
# Sample data
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
'Department': ['HR', 'IT', 'HR', 'IT', 'Finance', 'Finance'],
'Salary': [5000, 7000, 5500, 8000, 6000, 6500]
}
df = pd.DataFrame(data)
print("Original data:")
print(df)
print("\n" + "="*50 + "\n")
# Group by department and calculate mean salary
dept_avg = df.groupby('Department')['Salary'].mean()
print("Average salary by department:")
print(dept_avg)
print("\n" + "="*50 + "\n")
# Find department with highest average
max_dept = dept_avg.idxmax()
max_salary = dept_avg.max()
print(f"Highest average: {max_dept} (${max_salary:.2f})")Run Output:
Original data:
Name Department Salary
0 Alice HR 5000
1 Bob IT 7000
2 Charlie HR 5500
3 David IT 8000
4 Eve Finance 6000
5 Frank Finance 6500
==================================================
Average salary by department:
Department
Finance 6250.0
HR 5250.0
IT 7500.0
Name: Salary, dtype: float64
==================================================
Highest average: IT ($7500.00)关键操作:
- groupby('Department'):按部门分组
- ['Salary'].mean():计算每组的平均薪资
- idxmax():找到最大值对应的索引(部门名)
- max():获取最大值
应用场景:
- 统计分析、数据透视、业务报表生成
Key Operations:
- groupby('Department'): Group by department
- ['Salary'].mean(): Calculate mean salary for each group
- idxmax(): Get index (department name) of maximum value
- max(): Get the maximum value
Use Cases:
- Statistical analysis, data pivoting, business report generation
Q3.8: 手动推导专家规则与预测(期末真题 Q3)
场景: 给定以下学生数据表,根据已有数据推导决策规则,并预测剩下四个问号的结果。
| Student ID | Grade | GPA | Prediction (Yes/No) |
|---|---|---|---|
| 101 | A | 3.85 | Yes |
| 205 | C | 2.85 | No |
| 640 | A- | 3.50 | Yes |
| 710 | B | 3.00 | ? (1) |
| 595 | A | 3.30 | ? (2) |
| 540 | B- | 4.00 | ? (3) |
| 630 | B+ | 3.00 | ? (4) |
👉 点击查看解答
Task (i): 提出手动专家规则 (Propose Manual Expert Rules)
观察前三行已知数据:
- GPA 3.85 (High) → Prediction = Yes
- GPA 2.85 (Low) → Prediction = No
- GPA 3.50 (High) → Prediction = Yes
归纳规律:
- 高 GPA (≥ 3.50) → Yes
- 低 GPA (< 3.50) → No
提出的专家规则(示例):
Rule 1 (基于 GPA 阈值):
IF GPA >= 3.50 THEN Prediction = Yes
ELSE Prediction = No
或者基于成绩等级的规则:
Rule 2 (基于 Grade):
IF Grade is 'A' or 'A-' THEN Prediction = Yes
ELSE Prediction = No
推荐使用 Rule 1,因为 GPA 是数值型特征,更客观稳定。
Task (ii): 识别预测结果 (Identify Predictions)
基于 Rule 1: IF GPA >= 3.50 THEN Yes, ELSE No
| Student ID | Grade | GPA | Rule Evaluation | Prediction |
|---|---|---|---|---|
| 710 | B | 3.00 | 3.00 < 3.50 | No |
| 595 | A | 3.30 | 3.30 < 3.50 | No |
| 540 | B- | 4.00 | 4.00 >= 3.50 | Yes ✅ |
| 630 | B+ | 3.00 | 3.00 < 3.50 | No |
最终答案:
- Student 710: No
- Student 595: No (虽然是A,但GPA 3.30 < 3.50)
- Student 540: Yes ✅
- Student 630: No
如果使用 Rule 2 (基于 Grade):
IF Grade is 'A' or 'A-' THEN Yes
ELSE No
| Student ID | Grade | Rule Evaluation | Prediction |
|---|---|---|---|
| 710 | B | B ≠ A/A- | No |
| 595 | A | A = A | Yes ✅ |
| 540 | B- | B- ≠ A/A- | No |
| 630 | B+ | B+ ≠ A/A- | No |
最终答案 (Rule 2):
- Student 710: No
- Student 595: Yes ✅
- Student 540: No
- Student 630: No
考点总结:
- 规则归纳:从已知样本中观察模式,提出假设
- 规则验证:规则应该能正确分类训练数据(前3行)
- 规则应用:对新样本逐条检查,输出预测
- 多规则冲突:不同规则可能给出不同预测,需要选择最合理的(考试时说明你的选择依据)
实际操作建议:
- 先尝试最简单的规则(单特征阈值)
- 如果数据模式复杂,可以组合多个特征(AND/OR 逻辑)
- 规则要能解释(可解释性是专家规则的优势)
Task (i): Propose Manual Expert Rules
Observation from first three rows:
- GPA 3.85 (High) → Prediction = Yes
- GPA 2.85 (Low) → Prediction = No
- GPA 3.50 (High) → Prediction = Yes
Pattern Identified:
- High GPA (≥ 3.50) → Yes
- Low GPA (< 3.50) → No
Proposed Expert Rule (Example):
Rule 1 (Based on GPA threshold):
IF GPA >= 3.50 THEN Prediction = Yes
ELSE Prediction = No
Or rule based on Grade:
Rule 2 (Based on Grade):
IF Grade is 'A' or 'A-' THEN Prediction = Yes
ELSE Prediction = No
Recommend Rule 1, because GPA is a numerical feature, more objective and stable.
Task (ii): Identify Predictions
Based on Rule 1: IF GPA >= 3.50 THEN Yes, ELSE No
| Student ID | Grade | GPA | Rule Evaluation | Prediction |
|---|---|---|---|---|
| 710 | B | 3.00 | 3.00 < 3.50 | No |
| 595 | A | 3.30 | 3.30 < 3.50 | No |
| 540 | B- | 4.00 | 4.00 >= 3.50 | Yes ✅ |
| 630 | B+ | 3.00 | 3.00 < 3.50 | No |
Final Answers:
- Student 710: No
- Student 595: No (Grade A but GPA 3.30 < 3.50)
- Student 540: Yes ✅
- Student 630: No
If using Rule 2 (Based on Grade):
IF Grade is 'A' or 'A-' THEN Yes
ELSE No
| Student ID | Grade | Rule Evaluation | Prediction |
|---|---|---|---|
| 710 | B | B ≠ A/A- | No |
| 595 | A | A = A | Yes ✅ |
| 540 | B- | B- ≠ A/A- | No |
| 630 | B+ | B+ ≠ A/A- | No |
Final Answers (Rule 2):
- Student 710: No
- Student 595: Yes ✅
- Student 540: No
- Student 630: No
Key Concepts Summary:
- Rule Induction: Observe patterns from known samples, propose hypotheses
- Rule Validation: Rules should correctly classify training data (first 3 rows)
- Rule Application: Apply rules to new samples one by one, output predictions
- Rule Conflicts: Different rules may give different predictions, choose most reasonable (explain your choice in exam)
Practical Tips:
- Start with simplest rules (single feature threshold)
- For complex patterns, combine multiple features (AND/OR logic)
- Rules should be interpretable (interpretability is advantage of expert rules)
Q3.3: 编码方法选择
问题: 什么时候应该用 One-Hot Encoding 而不是 Label Encoding?
👉 点击查看答案
Label Encoding 的工作原理:
- 将分类转换为数字(例如:Red=1, Blue=2, Green=3)
- 问题:算法可能误认为 "3 > 1" 意味着 Green "大于" Red(引入虚假的序关系)
One-Hot Encoding 的工作原理:
- 为每个类别创建一个二进制列(Is_Red, Is_Blue, Is_Green 等)
- 每行只有一列为 1,其余为 0
使用规则:
| 数据类型 | 定义 | 例子 | 编码方法 |
|---|---|---|---|
| Nominal(名义) | 无序、无排名关系 | 城市、颜色、品牌 | ✅ One-Hot |
| Ordinal(序数) | 有序、有排名关系 | Low, Medium, High | ✅ Label |
结论:用 One-Hot 处理没有排名关系的分类数据(如城市、颜色、国家),用 Label 处理有明确顺序的分类数据(如教育水平、收入等级)。
Label Encoding works by:
- Converting categories to integers (e.g., Red=1, Blue=2, Green=3)
- Problem: Algorithms might interpret "3 > 1" as Green being "greater" than Red (introduces false ordinal relationship)
One-Hot Encoding works by:
- Creating a binary column for each category (Is_Red, Is_Blue, Is_Green, etc.)
- Each row has exactly one column as 1, rest as 0
Usage Rules:
| Data Type | Definition | Examples | Method |
|---|---|---|---|
| Nominal | No order, no ranking | Cities, Colors, Brands | ✅ One-Hot |
| Ordinal | Has order, has ranking | Low, Medium, High | ✅ Label |
Conclusion: Use One-Hot for unordered categorical data (cities, colors, countries), use Label for ordered categorical data (education level, income brackets).
Q3.4: 特征缩放 - 归一化 vs 标准化
问题: 什么时候用 Min-Max 归一化,什么时候用 Z-Score 标准化?
👉 点击查看答案与代码示例
Min-Max 归一化(Normalization):
- 公式: $x' = \frac{x - x_{min}}{x_{max} - x_{min}}$
- 结果范围: [0, 1]
- 特点: 保留原始数据的分布形状
- 适用场景:
- 神经网络(要求输入在固定范围)
- 图像处理(像素值 0-255 → 0-1)
- 数据分布已知且无异常值
Z-Score 标准化(Standardization):
- 公式: $z = \frac{x - \mu}{\sigma}$(其中 μ 是均值,σ 是标准差)
- 结果范围: 通常在 [-3, 3] 之间(无固定范围)
- 特点: 数据均值为 0,标准差为 1
- 适用场景:
- 存在异常值时更鲁棒
- SVM、逻辑回归等算法
- 不同量级特征需要公平比较
Min-Max Normalization:
- Formula: $x' = \frac{x - x_{min}}{x_{max} - x_{min}}$
- Result Range: [0, 1]
- Characteristics: Preserves original distribution shape
- Use Cases:
- Neural networks (require inputs in fixed range)
- Image processing (pixel values 255 → 0-1)
- Known distribution without outliers
Z-Score Standardization:
- Formula: $z = \frac{x - \mu}{\sigma}$ (where μ is mean, σ is standard deviation)
- Result Range: Typically [-3, 3] (no fixed range)
- Characteristics: Mean of 0, standard deviation of 1
- Use Cases:
- More robust when outliers exist
- SVM, Logistic Regression algorithms
- Comparing features of different scales
Code Example:
from sklearn.preprocessing import MinMaxScaler, StandardScaler
import numpy as np
# Sample data with outlier
data = np.array([[1], [2], [3], [4], [100]]) # 100 is outlier
print("Original data:")
print(data.flatten())
# Min-Max Normalization
min_max_scaler = MinMaxScaler()
normalized = min_max_scaler.fit_transform(data)
print(f"\nMin-Max Normalized: {normalized.flatten()}")
# Z-Score Standardization
standard_scaler = StandardScaler()
standardized = standard_scaler.fit_transform(data)
print(f"Standardized: {standardized.flatten()}")Output:
Original data:
[ 1 2 3 4 100]
Min-Max Normalized: [0. 0.01010101 0.02020202 0.03030303 1. ]
Standardized: [-0.70039279 -0.67894753 -0.65750227 -0.63605702 2.6728996 ]观察:
- Min-Max:异常值 100 被映射到 1,其他值挤在 0-0.03 之间(信息损失)
- Z-Score:异常值约为 2.67,其他值在 -0.7 左右(更好地保留分布)
结论:有异常值时优先用 Z-Score!
Observation:
- Min-Max: Outlier 100 mapped to 1, others squeezed in 0-0.03 range (information loss)
- Z-Score: Outlier ~2.67, others around -0.7 (better preserves distribution)
Conclusion: Prefer Z-Score when outliers exist!
Q3.5: 混淆矩阵与模型评估
问题: 给定混淆矩阵,计算精确率、召回率和 F1 分数。
Confusion Matrix:
| 预测为正 | 预测为负 | |
|---|---|---|
| 实际为正 | TP = 80 | FN = 20 |
| 实际为负 | FP = 10 | TN = 90 |
👉 点击查看计算过程
公式:
$$\text{Precision (精确率)} = \frac{TP}{TP + FP} = \frac{80}{80 + 10} = \frac{80}{90} \approx 0.889$$
$$\text{Recall (召回率)} = \frac{TP}{TP + FN} = \frac{80}{80 + 20} = \frac{80}{100} = 0.800$$
$$\text{F1-Score} = 2 \times \frac{Precision \times Recall}{Precision + Recall} = 2 \times \frac{0.889 \times 0.800}{0.889 + 0.800} \approx 0.842$$
解释:
-
Precision(精确率):预测为正的样本中,真正为正的比例("不误伤")
- 在这个例子中:预测为正的 90 个样本中,80 个确实是正样本
-
Recall(召回率):实际为正的样本中,被正确预测的比例("不漏掉")
- 在这个例子中:100 个真实正样本中,找到了 80 个
-
F1-Score:Precision 和 Recall 的调和平均数(综合指标)
- 当两者都重要时使用
应用场景:
- 垃圾邮件检测:Precision 重要(不能误判正常邮件)
- 疾病筛查:Recall 重要(不能漏掉病人)
- 一般分类:F1-Score 平衡两者
Formulas:
$$\text{Precision} = \frac{TP}{TP + FP} = \frac{80}{80 + 10} = \frac{80}{90} \approx 0.889$$
$$\text{Recall} = \frac{TP}{TP + FN} = \frac{80}{80 + 20} = \frac{80}{100} = 0.800$$
$$\text{F1-Score} = 2 \times \frac{Precision \times Recall}{Precision + Recall} = 2 \times \frac{0.889 \times 0.800}{0.889 + 0.800} \approx 0.842$$
Interpretation:
-
Precision: Of predicted positives, how many are actually positive ("don't misclassify")
- In this example: Of 90 predicted positive, 80 are truly positive
-
Recall: Of actual positives, how many are correctly predicted ("don't miss any")
- In this example: Of 100 actual positives, found 80
-
F1-Score: Harmonic mean of Precision and Recall (balanced metric)
- Use when both are equally important
Use Cases:
- Spam detection: Precision matters (don't flag normal emails)
- Disease screening: Recall matters (don't miss patients)
- General classification: F1-Score balances both
Q3.6: SVM 核函数选择
问题: SVM 中 Linear、RBF 和 Polynomial 核函数有什么区别?
👉 点击查看答案
1. Linear Kernel(线性核):
- 公式: $K(x, y) = x^T y$
- 决策边界: 直线/平面(线性可分)
- 适用场景:
- 数据本身线性可分
- 特征数量远大于样本数量(高维稀疏数据)
- 文本分类(词袋模型)
- 优点: 速度快,易解释
- 缺点: 无法处理非线性问题
2. RBF Kernel(径向基核,Gaussian 核):
- 公式: $K(x, y) = \exp(-\gamma ||x - y||^2)$
- 决策边界: 复杂曲线/曲面(非线性)
- 适用场景:
- 数据非线性可分
- 不确定用哪个核时的默认选择
- 样本数量适中
- 优点: 强大,能处理复杂边界
- 缺点: 容易过拟合,需调参 γ
3. Polynomial Kernel(多项式核):
- 公式: $K(x, y) = (x^T y + c)^d$
- 决策边界: 多项式曲线
- 适用场景:
- 图像处理
- 需要考虑特征交互
- 优点: 可控制复杂度(通过 d)
- 缺点: 计算复杂,参数敏感
选择建议:
- 先尝试 Linear → 快速基线
- 如果性能不够,试 RBF → 最常用
- 特殊场景考虑 Polynomial
1. Linear Kernel:
- Formula: $K(x, y) = x^T y$
- Decision Boundary: Line/Plane (linearly separable)
- Use Cases:
- Data is linearly separable
- Features >> Samples (high-dimensional sparse data)
- Text classification (bag-of-words)
- Pros: Fast, interpretable
- Cons: Can't handle non-linear problems
2. RBF Kernel (Radial Basis Function, Gaussian):
- Formula: $K(x, y) = \exp(-\gamma ||x - y||^2)$
- Decision Boundary: Complex curves/surfaces (non-linear)
- Use Cases:
- Non-linearly separable data
- Default choice when unsure
- Moderate sample size
- Pros: Powerful, handles complex boundaries
- Cons: Prone to overfitting, needs γ tuning
3. Polynomial Kernel:
- Formula: $K(x, y) = (x^T y + c)^d$
- Decision Boundary: Polynomial curves
- Use Cases:
- Image processing
- Feature interactions matter
- Pros: Controllable complexity (via d)
- Cons: Computationally expensive, parameter sensitive
Selection Guide:
- Try Linear first → Quick baseline
- If performance insufficient, try RBF → Most common
- Consider Polynomial for special cases
Q3.7: Random Forest 超参数
问题: 解释 Random Forest 的关键超参数:n_estimators, max_depth, min_samples_split。
👉 点击查看答案
1. n_estimators(树的数量):
- 含义: 随机森林中决策树的个数
- 影响:
- 增加 → 模型更稳定,性能提升(但收益递减)
- 太少 → 欠拟合
- 太多 → 训练时间长,但一般不会过拟合
- 推荐值: 100-500(根据数据规模)
2. max_depth(树的最大深度):
- 含义: 每棵树允许的最大层数
- 影响:
- 增加 → 模型更复杂,可能过拟合
- 减少 → 模型简单,可能欠拟合
- 推荐值:
- None(不限制)→ 小数据集
- 10-30 → 大数据集
3. min_samples_split(最小分裂样本数):
- 含义: 节点分裂所需的最小样本数
- 影响:
- 增加 → 树更简单,防止过拟合
- 减少 → 树更复杂,可能过拟合
- 推荐值: 2-20
调参策略(GridSearchCV):
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [10, 20, 30, None],
'min_samples_split': [2, 5, 10]
}
rf = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(rf, param_grid, cv=5, scoring='accuracy')
# grid_search.fit(X_train, y_train)1. n_estimators (Number of Trees):
- Meaning: Number of decision trees in the forest
- Impact:
- Increase → More stable, better performance (diminishing returns)
- Too few → Underfitting
- Too many → Longer training, but rarely overfits
- Recommended: 100-500 (depends on data size)
2. max_depth (Maximum Tree Depth):
- Meaning: Maximum levels allowed in each tree
- Impact:
- Increase → More complex, may overfit
- Decrease → Simpler, may underfit
- Recommended:
- None (unlimited) → Small datasets
- 10-30 → Large datasets
3. min_samples_split (Minimum Samples to Split):
- Meaning: Minimum samples required to split a node
- Impact:
- Increase → Simpler tree, prevents overfitting
- Decrease → More complex, may overfit
- Recommended: 2-20
Tuning Strategy (GridSearchCV):
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [10, 20, 30, None],
'min_samples_split': [2, 5, 10]
}
rf = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(rf, param_grid, cv=5, scoring='accuracy')
# grid_search.fit(X_train, y_train)Q3.8: 交叉验证
问题: 什么是 K 折交叉验证,为什么它很重要?
👉 点击查看答案与示例
K-Fold 交叉验证原理:
- 划分数据:将数据集分成 K 个大小相等的子集(fold)
- 轮流训练:每次用 K-1 个子集训练模型,用剩余 1 个子集验证
- 重复 K 次:每个子集都有机会作为验证集
- 平均结果:将 K 次验证结果平均,得到最终性能评估
为什么重要:
- ✅ 充分利用数据:每个样本既用于训练又用于验证
- ✅ 降低过拟合风险:避免模型只在某个特定测试集上表现好
- ✅ 更可靠的性能估计:多次验证的平均值更稳定
- ✅ 适合小数据集:在数据有限时特别有用
常用 K 值:
- K = 5(5折):速度和准确性的平衡,最常用
- K = 10(10折):更准确但计算量更大
- K = N(留一法,LOOCV):极端情况,每次只留一个样本验证
K-Fold Cross-Validation Principle:
- Split Data: Divide dataset into K equal-sized subsets (folds)
- Rotate Training: Use K-1 subsets for training, 1 for validation
- Repeat K Times: Each subset gets a chance to be the validation set
- Average Results: Average the K validation results for final performance
Why Important:
- ✅ Maximize Data Usage: Every sample used for both training and validation
- ✅ Reduce Overfitting Risk: Avoid model performing well only on specific test set
- ✅ More Reliable Estimate: Average of multiple validations is more stable
- ✅ Good for Small Datasets: Especially useful when data is limited
Common K Values:
- K = 5 (5-fold): Balance between speed and accuracy, most common
- K = 10 (10-fold): More accurate but more computational cost
- K = N (Leave-One-Out, LOOCV): Extreme case, one sample for validation each time
Code Example:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
# Create sample dataset
X, y = make_classification(n_samples=100, n_features=10, random_state=42)
# Create model
rf = RandomForestClassifier(n_estimators=100, random_state=42)
# Perform 5-fold cross-validation
scores = cross_val_score(rf, X, y, cv=5, scoring='accuracy')
print(f"Cross-validation scores: {scores}")
print(f"Mean accuracy: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")Output:
Cross-validation scores: [0.85 0.9 0.8 0.95 0.85]
Mean accuracy: 0.870 (+/- 0.108)解释:
- 5 次验证的准确率分别为:85%, 90%, 80%, 95%, 85%
- 平均准确率:87%
- 标准差的 2 倍(95% 置信区间):±10.8%
- 结论:模型在这个数据集上的准确率约为 87%,且较为稳定
Interpretation:
- 5 validation accuracies: 85%, 90%, 80%, 95%, 85%
- Mean accuracy: 87%
- 2x standard deviation (95% confidence): ±10.8%
- Conclusion: Model accuracy ~87% on this dataset, relatively stable
Q3.9: 过拟合 vs 欠拟合
问题: 如何识别和解决过拟合和欠拟合?
👉 点击查看答案
| 特征 | 过拟合 (Overfitting) | 欠拟合 (Underfitting) |
|---|---|---|
| 表现 | 训练集准确率高,测试集准确率低 | 训练集和测试集准确率都低 |
| 原因 | 模型过于复杂,记住了噪声 | 模型过于简单,未学到规律 |
| 训练误差 | 很低(接近 0) | 很高 |
| 测试误差 | 很高 | 很高 |
| 泛化能力 | 差(不能应用到新数据) | 差(连训练数据都拟合不好) |
如何识别:
from sklearn.model_selection import learning_curve
import matplotlib.pyplot as plt
train_sizes, train_scores, test_scores = learning_curve(
model, X, y, cv=5,
train_sizes=np.linspace(0.1, 1.0, 10)
)
# 画学习曲线
plt.plot(train_sizes, train_scores.mean(axis=1), label='Training score')
plt.plot(train_sizes, test_scores.mean(axis=1), label='Test score')
plt.legend()判断标准:
- 过拟合:训练曲线和测试曲线之间有大的间隙
- 欠拟合:两条曲线都很低且接近
解决方法:
| 问题 | 解决方案 |
|---|---|
| 过拟合 | 1. 增加训练数据 |
- 减少模型复杂度(降低树深度、减少特征)
- 正则化(L1/L2)
- Dropout(神经网络)
- Early stopping(提前停止训练) | | 欠拟合 | 1. 增加模型复杂度
- 增加特征
- 减少正则化
- 训练更长时间
- 换更强大的模型 |
| Feature | Overfitting | Underfitting |
|---|---|---|
| Performance | High training, low test accuracy | Low training and test accuracy |
| Cause | Model too complex, memorizes noise | Model too simple, misses patterns |
| Training Error | Very low (near 0) | Very high |
| Test Error | Very high | Very high |
| Generalization | Poor (can't apply to new data) | Poor (can't even fit training data) |
How to Identify:
from sklearn.model_selection import learning_curve
import matplotlib.pyplot as plt
train_sizes, train_scores, test_scores = learning_curve(
model, X, y, cv=5,
train_sizes=np.linspace(0.1, 1.0, 10)
)
# Plot learning curves
plt.plot(train_sizes, train_scores.mean(axis=1), label='Training score')
plt.plot(train_sizes, test_scores.mean(axis=1), label='Test score')
plt.legend()Diagnosis:
- Overfitting: Large gap between training and test curves
- Underfitting: Both curves low and close together
Solutions:
| Problem | Solutions |
|---|---|
| Overfitting | 1. Add more training data |
- Reduce model complexity (lower tree depth, fewer features)
- Regularization (L1/L2)
- Dropout (neural networks)
- Early stopping | | Underfitting | 1. Increase model complexity
- Add more features
- Reduce regularization
- Train longer
- Use more powerful model |
问题: 解释 SVM 和 Random Forest 分类的主要区别。
👉 点击查看答案
SVM (Support Vector Machine):
- 核心思想:找到一个超平面(在 2D 中是直线,在高维中是曲面),使得两个类别之间的间隔(margin)最大
- 决策依据:依赖支持向量(距离超平面最近的数据点)
- 特点:
- 对高维数据和少量特征表现好
- 计算复杂度较高(O(n²) 到 O(n³))
- 需要特征归一化
Random Forest:
- 核心思想:构建多个决策树,用多数投票进行预测
- 决策依据:每个树根据特征不纯度(Gini Index 或 Entropy)选择分割点
- 特点:
- 并行计算,速度快
- 可处理非线性关系
- 自动特征归一化,易于使用
- 能处理大量特征和缺失值
SVM (Support Vector Machine):
- Core Idea: Find an optimal hyperplane (line in 2D, surface in higher dimensions) that maximizes the margin between two classes
- Decision Basis: Depends on support vectors (data points closest to the hyperplane)
- Characteristics:
- Works well with high-dimensional data and few features
- High computational complexity (O(n²) to O(n³))
- Requires feature normalization
Random Forest:
- Core Idea: Build multiple decision trees and use majority voting for prediction
- Decision Basis: Each tree selects split points based on feature impurity (Gini Index or Entropy)
- Characteristics:
- Fast with parallel computing
- Naturally handles non-linear relationships
- Auto-normalization, easy to use
- Handles many features and missing values well
🔴 第三部分:机器学习实现与手工计算 (Q4)
⚠️ 重要提示:这部分包含代码实现和手工计算。考试时请带上计算器!
🔥 Q4.0: K-最近邻 (KNN) 实现(期末真题 Q4 - 25分)
场景:使用 sklearn KNeighborsClassifier 预测客户是否购买产品。
不完整代码:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
from sklearn.neighbors import KNeighborsClassifier
# [A] Formulate an instance of the class (Set K=7)
# [B] Fit the instance on the data
# [C] Predict the expected value
print(classes[y_pred[0]])任务:填写空白处 [A]、[B]、[C]。
👉 点击查看完整代码解答
完整代码:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
from sklearn.neighbors import KNeighborsClassifier
# [A] 实例化模型,设置 K=7
knn = KNeighborsClassifier(n_neighbors=7)
# [B] 在训练集上训练模型 (Fit)
knn.fit(X_train, y_train)
# [C] 在测试集上进行预测 (Predict)
y_pred = knn.predict(X_test)
# 输出第一个预测结果
print(classes[y_pred[0]])详细解析:
[A] 实例化模型:
knn = KNeighborsClassifier(n_neighbors=7)KNeighborsClassifier是 sklearn 的 KNN 分类器类n_neighbors=7设置 K 值为 7(即选取最近的 7 个邻居)- 其他常用参数:
metric='euclidean'(默认,欧氏距离)weights='uniform'(默认,所有邻居权重相等)weights='distance'(按距离加权,近的邻居权重更大)
[B] 训练模型:
knn.fit(X_train, y_train)fit()方法用于训练模型X_train是训练特征矩阵(形状:[样本数, 特征数])y_train是训练标签向量(形状:[样本数])- 注意:KNN 实际上不进行"训练",只是存储训练数据,预测时才计算距离
[C] 预测结果:
y_pred = knn.predict(X_test)predict()方法对测试集进行预测X_test是测试特征矩阵y_pred是预测结果向量(形状:[测试样本数])- 每个样本的预测流程:
- 计算该样本与所有训练样本的距离
- 选出距离最近的 K 个邻居
- 统计这 K 个邻居的类别,取多数投票结果
补充:完整工作流程示例
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report
# 假设数据
X = np.array([[1, 2], [2, 3], [3, 4], [6, 7], [7, 8], [8, 9]])
y = np.array([0, 0, 0, 1, 1, 1]) # 0 = Not Purchase, 1 = Purchase
classes = ['Not Purchase', 'Purchase']
# 拆分数据
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
# [A] 实例化
knn = KNeighborsClassifier(n_neighbors=3) # 或 K=7 视题目要求
# [B] 训练
knn.fit(X_train, y_train)
# [C] 预测
y_pred = knn.predict(X_test)
# 输出
print("预测结果:", y_pred)
print("第一个预测:", classes[y_pred[0]])
# 评估
accuracy = accuracy_score(y_test, y_pred)
print(f"准确率: {accuracy:.2f}")
print("\n分类报告:")
print(classification_report(y_test, y_pred, target_names=classes))考点总结:
- sklearn 标准流程:
import → instantiate → fit → predict - 参数理解:
n_neighbors(K值)、metric(距离度量)、weights(权重策略) - 方法调用:
fit(X_train, y_train)用于训练,predict(X_test)用于预测 - 数据形状: X 必须是 2D 数组 (样本×特征),y 是 1D 数组(样本)
常见错误:
- ❌ 忘记导入
KNeighborsClassifier - ❌
n_neighbors拼写错误(不是k=7) - ❌
fit()方法参数顺序错误(应该是X_train, y_train,不能颠倒) - ❌
predict()方法忘记传入X_test
Complete Code:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
from sklearn.neighbors import KNeighborsClassifier
# [A] Instantiate model with K=7
knn = KNeighborsClassifier(n_neighbors=7)
# [B] Fit the model on training data
knn.fit(X_train, y_train)
# [C] Predict on test set
y_pred = knn.predict(X_test)
# Output first prediction
print(classes[y_pred[0]])Detailed Explanation:
[A] Instantiate Model:
knn = KNeighborsClassifier(n_neighbors=7)KNeighborsClassifieris sklearn's KNN classifier classn_neighbors=7sets K value to 7 (select 7 nearest neighbors)- Other common parameters:
metric='euclidean'(default, Euclidean distance)weights='uniform'(default, all neighbors equal weight)weights='distance'(distance-based weighting, closer neighbors have more weight)
[B] Train Model:
knn.fit(X_train, y_train)fit()method trains the modelX_trainis feature matrix (shape: [samples, features])y_trainis label vector (shape: [samples])- Note: KNN doesn't actually "train" - it just stores training data, calculates distances during prediction
[C] Predict Results:
y_pred = knn.predict(X_test)predict()method predicts on test setX_testis test feature matrixy_predis prediction vector (shape: [test samples])- Prediction process for each sample:
- Calculate distance to all training samples
- Select K nearest neighbors
- Count class labels of these K neighbors, take majority vote
Complete Workflow Example:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report
# Sample data
X = np.array([[1, 2], [2, 3], [3, 4], [6, 7], [7, 8], [8, 9]])
y = np.array([0, 0, 0, 1, 1, 1]) # 0 = Not Purchase, 1 = Purchase
classes = ['Not Purchase', 'Purchase']
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
# [A] Instantiate
knn = KNeighborsClassifier(n_neighbors=3) # or K=7 per requirements
# [B] Train
knn.fit(X_train, y_train)
# [C] Predict
y_pred = knn.predict(X_test)
# Output
print("Predictions:", y_pred)
print("First prediction:", classes[y_pred[0]])
# Evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=classes))Key Concepts Summary:
- sklearn standard workflow:
import → instantiate → fit → predict - Parameter understanding:
n_neighbors(K value),metric(distance measure),weights(weighting strategy) - Method calls:
fit(X_train, y_train)for training,predict(X_test)for prediction - Data shapes: X must be 2D array (samples × features), y is 1D array (samples)
Common Mistakes:
- ❌ Forgetting to import
KNeighborsClassifier - ❌ Misspelling
n_neighbors(notk=7) - ❌ Wrong parameter order in
fit()(should beX_train, y_train, not reversed) - ❌ Forgetting to pass
X_testtopredict()
Q4.0(b): KNN 理论 - K 值的影响
问题: 当 KNN 中的 K 值增大时,决策边界会发生什么变化?
👉 点击查看答案
K 值对决策边界的影响:
Small K (e.g., K=1):
- 决策边界非常曲折、复杂
- 模型对训练数据拟合过紧,容易过拟合 (Overfitting)
- 对噪声非常敏感
- 训练准确率高,测试准确率可能低
Large K (e.g., K=100):
- 决策边界变得非常平滑 (Smoother),甚至接近直线
- 模型过于简单,容易欠拟合 (Underfitting)
- 对噪声不敏感
- 训练和测试准确率都可能偏低
最佳 K 值选择:
- 通常使用交叉验证找最优 K
- K 值建议范围:3-15(小数据集)或 $\sqrt{N}$(N 为样本数)
- K 应该是奇数(避免平票,尤其二分类)
结论:
- As K increases, the decision boundary becomes smoother.
- 随着 K 增大,决策边界变得更平滑。
Effect of K on Decision Boundary:
Small K (e.g., K=1):
- Decision boundary is very wiggly and complex
- Model fits training data too tightly, prone to overfitting
- Very sensitive to noise
- High training accuracy, may have low test accuracy
Large K (e.g., K=100):
- Decision boundary becomes very smooth, even approaching a straight line
- Model too simple, prone to underfitting
- Not sensitive to noise
- Both training and test accuracy may be low
Optimal K Selection:
- Typically use cross-validation to find best K
- K value suggestion: 3-15 (small datasets) or $\sqrt{N}$ (N = number of samples)
- K should be odd (avoid ties, especially for binary classification)
Conclusion:
- As K increases, the decision boundary becomes smoother.
🔥 Q4.1: Naive Bayes 分类计算(必考题)
场景:判断一封邮件是否是垃圾邮件。
训练数据集(共 10 封邮件):
| 类别 | 总数 | 包含"Free"的邮件数 |
|---|---|---|
| Spam | 4 | 3 |
| Not Spam | 6 | 1 |
任务:一封新邮件包含单词 "Free"。判断它是 Spam 还是 Not Spam?
计算 P(Spam | "Free") 和 P(Not Spam | "Free") 的分子部分,比较大小。
详细计算步骤
👉 点击查看完整计算过程
第一步:计算先验概率 (Prior Probability)
$$P(\text{Spam}) = \frac{4}{10} = 0.4$$
$$P(\text{Not Spam}) = \frac{6}{10} = 0.6$$
第二步:计算似然概率 (Likelihood)
$$P(\text{"Free"} \mid \text{Spam}) = \frac{3}{4} = 0.75$$
$$P(\text{"Free"} \mid \text{Not Spam}) = \frac{1}{6} \approx 0.167$$
第三步:计算后验概率的分子 (Posterior Numerator)
使用贝叶斯定理:$P(Spam | "Free") \propto P("Free" | Spam) \times P(Spam)$
$$\text{Spam Score} = P(\text{"Free"} \mid \text{Spam}) \times P(\text{Spam}) = 0.75 \times 0.4 = 0.30$$
$$\text{Not Spam Score} = P(\text{"Free"} \mid \text{Not Spam}) \times P(\text{Not Spam}) = 0.167 \times 0.6 \approx 0.10$$
第四步:做出决策
因为 $0.30 > 0.10$,模型预测该邮件是 Spam(垃圾邮件)。
结论:含有"Free"这个词的邮件更可能是垃圾邮件。
Step 1: Calculate Prior Probability
$$P(\text{Spam}) = \frac{4}{10} = 0.4$$
$$P(\text{Not Spam}) = \frac{6}{10} = 0.6$$
Step 2: Calculate Likelihood Probability
$$P(\text{"Free"} \mid \text{Spam}) = \frac{3}{4} = 0.75$$
$$P(\text{"Free"} \mid \text{Not Spam}) = \frac{1}{6} \approx 0.167$$
Step 3: Calculate Posterior Numerators
Using Bayes' theorem: $P(Spam | "Free") \propto P("Free" | Spam) \times P(Spam)$
$$\text{Spam Score} = P(\text{"Free"} \mid \text{Spam}) \times P(\text{Spam}) = 0.75 \times 0.4 = 0.30$$
$$\text{Not Spam Score} = P(\text{"Free"} \mid \text{Not Spam}) \times P(\text{Not Spam}) = 0.167 \times 0.6 \approx 0.10$$
Step 4: Make Decision
Since $0.30 > 0.10$, the model predicts this email is Spam.
Conclusion: Emails containing "Free" are more likely to be spam.
🔥 Q4.2: Decision Tree 熵计算(必考题)
场景:一个决策树节点有 6 个正样本 (+) 和 2 个负样本 (-) 共 8 个样本。
任务:计算该节点的熵(Entropy)。
熵公式:$$\text{Entropy} = -\sum_{i=1}^{n} p_i \log_2(p_i)$$
其中 $p_i$ 是第 $i$ 类样本的比例。
详细计算步骤
👉 点击查看完整计算过程
第一步:计算各类样本比例
$$p_{\text{Positive}} = \frac{6}{8} = 0.75$$
$$p_{\text{Negative}} = \frac{2}{8} = 0.25$$
第二步:代入熵公式
$$\text{Entropy} = -[p_+ \log_2(p_+) + p_- \log_2(p_-)]$$
$$= -[0.75 \times \log_2(0.75) + 0.25 \times \log_2(0.25)]$$
第三步:使用计算器计算对数(保留 4 位小数)
$$\log_2(0.75) = \frac{\log(0.75)}{\log(2)} \approx \frac{-0.1249}{0.3010} \approx -0.4150$$
$$\log_2(0.25) = \log_2(\frac{1}{4}) = -\log_2(4) = -2$$
第四步:计算最终结果
项一:$0.75 \times (-0.4150) = -0.3112$
项二:$0.25 \times (-2) = -0.50$
熵值:
$$\text{Entropy} = -(-0.3112 - 0.50) = -(-0.8112) = 0.8112 \approx \boxed{0.811}$$
解读:
- 熵值范围是 0 到 1(对于二分类)
- 熵值 0.811 表示该节点混合程度较高(接近 75%-25% 分布)
- 如果节点是纯净的(全是正或全是负),熵值为 0
- 如果正负各占一半(50%-50%),熵值最大为 1
Step 1: Calculate Sample Proportions
$$p_{\text{Positive}} = \frac{6}{8} = 0.75$$
$$p_{\text{Negative}} = \frac{2}{8} = 0.25$$
Step 2: Substitute into Entropy Formula
$$\text{Entropy} = -[p_+ \log_2(p_+) + p_- \log_2(p_-)]$$
$$= -[0.75 \times \log_2(0.75) + 0.25 \times \log_2(0.25)]$$
Step 3: Use Calculator to Calculate Logarithms (Keep 4 decimals)
$$\log_2(0.75) = \frac{\log(0.75)}{\log(2)} \approx \frac{-0.1249}{0.3010} \approx -0.4150$$
$$\log_2(0.25) = \log_2(\frac{1}{4}) = -\log_2(4) = -2$$
Step 4: Calculate Final Result
Term 1: $0.75 \times (-0.4150) = -0.3112$
Term 2: $0.25 \times (-2) = -0.50$
Entropy Value:
$$\text{Entropy} = -(-0.3112 - 0.50) = -(-0.8112) = 0.8112 \approx \boxed{0.811}$$
Interpretation:
- Entropy range is 0 to 1 (for binary classification)
- Entropy 0.811 indicates a highly mixed node (close to 75%-25% distribution)
- Pure node (all positive or all negative) has entropy 0
- Node with 50%-50% split has maximum entropy of 1
🔥 Q4.3: 信息增益计算
场景: 计算信息增益,决定用哪个特征进行分裂。
数据集(10个样本,预测是否打网球):
| Outlook | Temperature | Play Tennis |
|---|---|---|
| Sunny | Hot | No |
| Sunny | Hot | No |
| Overcast | Hot | Yes |
| Rain | Mild | Yes |
| Rain | Cool | Yes |
| Rain | Cool | No |
| Overcast | Cool | Yes |
| Sunny | Mild | No |
| Sunny | Cool | Yes |
| Rain | Mild | Yes |
任务: 计算 "Outlook" 特征的信息增益。
👉 点击查看完整计算过程
第一步:计算总体熵(Root Entropy)
总样本:10 个,其中 Yes = 6,No = 4
$$p_{\text{Yes}} = \frac{6}{10} = 0.6, \quad p_{\text{No}} = \frac{4}{10} = 0.4$$
$$H_{\text{total}} = -[0.6 \times \log_2(0.6) + 0.4 \times \log_2(0.4)]$$
使用计算器:
- $\log_2(0.6) = \frac{\ln(0.6)}{\ln(2)} \approx -0.737$
- $\log_2(0.4) = \frac{\ln(0.4)}{\ln(2)} \approx -1.322$
$$H_{\text{total}} = -[0.6 \times (-0.737) + 0.4 \times (-1.322)]$$ $$= -[-0.442 - 0.529] = -(-0.971) = \boxed{0.971}$$
第二步:按 Outlook 分组计算加权熵
| Outlook | Total | Yes | No |
|---|---|---|---|
| Sunny | 4 | 1 | 3 |
| Overcast | 2 | 2 | 0 |
| Rain | 4 | 3 | 1 |
Sunny 组的熵: $$H_{\text{Sunny}} = -\left[\frac{1}{4} \log_2\left(\frac{1}{4}\right) + \frac{3}{4} \log_2\left(\frac{3}{4}\right)\right]$$
- $\log_2(0.25) = -2$
- $\log_2(0.75) \approx -0.415$
$$H_{\text{Sunny}} = -[0.25 \times (-2) + 0.75 \times (-0.415)]$$ $$= -[-0.5 - 0.311] = 0.811$$
Overcast 组的熵: 全是 Yes(纯净节点) $$H_{\text{Overcast}} = 0$$
Rain 组的熵: $$H_{\text{Rain}} = -\left[\frac{3}{4} \log_2\left(\frac{3}{4}\right) + \frac{1}{4} \log_2\left(\frac{1}{4}\right)\right]$$ $$= -[0.75 \times (-0.415) + 0.25 \times (-2)]$$ $$= -[-0.311 - 0.5] = 0.811$$
加权平均熵: $$H_{\text{weighted}} = \frac{4}{10} \times 0.811 + \frac{2}{10} \times 0 + \frac{4}{10} \times 0.811$$ $$= 0.324 + 0 + 0.324 = 0.648$$
第三步:计算信息增益(Information Gain)
$$\text{IG(Outlook)} = H_{\text{total}} - H_{\text{weighted}}$$ $$= 0.971 - 0.648 = \boxed{0.323}$$
解释:
- 信息增益 = 0.323,表示使用 "Outlook" 特征可以减少 32.3% 的不确定性
- 信息增益越大,该特征的分类能力越强
- 决策树会优先选择信息增益最大的特征进行分裂
Step 1: Calculate Root Entropy
Total samples: 10, where Yes = 6, No = 4
$$p_{\text{Yes}} = \frac{6}{10} = 0.6, \quad p_{\text{No}} = \frac{4}{10} = 0.4$$
$$H_{\text{total}} = -[0.6 \times \log_2(0.6) + 0.4 \times \log_2(0.4)]$$
Using calculator:
- $\log_2(0.6) = \frac{\ln(0.6)}{\ln(2)} \approx -0.737$
- $\log_2(0.4) = \frac{\ln(0.4)}{\ln(2)} \approx -1.322$
$$H_{\text{total}} = -[0.6 \times (-0.737) + 0.4 \times (-1.322)]$$ $$= -[-0.442 - 0.529] = -(-0.971) = \boxed{0.971}$$
Step 2: Calculate Weighted Entropy by Outlook Groups
| Outlook | Total | Yes | No |
|---|---|---|---|
| Sunny | 4 | 1 | 3 |
| Overcast | 2 | 2 | 0 |
| Rain | 4 | 3 | 1 |
Entropy of Sunny: $$H_{\text{Sunny}} = -\left[\frac{1}{4} \log_2\left(\frac{1}{4}\right) + \frac{3}{4} \log_2\left(\frac{3}{4}\right)\right]$$
- $\log_2(0.25) = -2$
- $\log_2(0.75) \approx -0.415$
$$H_{\text{Sunny}} = -[0.25 \times (-2) + 0.75 \times (-0.415)]$$ $$= -[-0.5 - 0.311] = 0.811$$
Entropy of Overcast: All Yes (pure node) $$H_{\text{Overcast}} = 0$$
Entropy of Rain: $$H_{\text{Rain}} = -\left[\frac{3}{4} \log_2\left(\frac{3}{4}\right) + \frac{1}{4} \log_2\left(\frac{1}{4}\right)\right]$$ $$= -[0.75 \times (-0.415) + 0.25 \times (-2)]$$ $$= -[-0.311 - 0.5] = 0.811$$
Weighted Average Entropy: $$H_{\text{weighted}} = \frac{4}{10} \times 0.811 + \frac{2}{10} \times 0 + \frac{4}{10} \times 0.811$$ $$= 0.324 + 0 + 0.324 = 0.648$$
Step 3: Calculate Information Gain
$$\text{IG(Outlook)} = H_{\text{total}} - H_{\text{weighted}}$$ $$= 0.971 - 0.648 = \boxed{0.323}$$
Interpretation:
- Information Gain = 0.323 means "Outlook" reduces uncertainty by 32.3%
- Higher Information Gain → stronger classification ability
- Decision trees prioritize features with highest Information Gain for splitting
🔥 Q4.4: Gini 指数计算
场景: 计算决策树节点的 Gini 指数。
问题: 一个节点有 40 个样本:25 个 A 类,15 个 B 类。计算 Gini 指数。
Gini Formula: $$\text{Gini} = 1 - \sum_{i=1}^{n} p_i^2$$
👉 点击查看计算过程
第一步:计算各类别概率
$$p_A = \frac{25}{40} = 0.625$$ $$p_B = \frac{15}{40} = 0.375$$
第二步:代入 Gini 公式
$$\text{Gini} = 1 - (p_A^2 + p_B^2)$$ $$= 1 - (0.625^2 + 0.375^2)$$ $$= 1 - (0.3906 + 0.1406)$$ $$= 1 - 0.5312$$ $$= \boxed{0.4688} \approx 0.469$$
解释:
- Gini 指数范围:[0, 0.5](二分类情况下)
- Gini = 0 → 节点纯净(全是同一类)
- Gini = 0.5 → 节点最混乱(各类别均匀分布)
- Gini = 0.469 → 较高的不纯度,需要进一步分裂
Gini vs Entropy:
- Gini 计算更快(无对数运算)
- Entropy 对不纯度更敏感
- 实际效果相似,Gini 更常用(sklearn 默认)
Step 1: Calculate Class Probabilities
$$p_A = \frac{25}{40} = 0.625$$ $$p_B = \frac{15}{40} = 0.375$$
Step 2: Substitute into Gini Formula
$$\text{Gini} = 1 - (p_A^2 + p_B^2)$$ $$= 1 - (0.625^2 + 0.375^2)$$ $$= 1 - (0.3906 + 0.1406)$$ $$= 1 - 0.5312$$ $$= \boxed{0.4688} \approx 0.469$$
Interpretation:
- Gini range: [0, 0.5] (for binary classification)
- Gini = 0 → Pure node (all same class)
- Gini = 0.5 → Most impure (uniform distribution)
- Gini = 0.469 → High impurity, needs further splitting
Gini vs Entropy:
- Gini faster to compute (no logarithms)
- Entropy more sensitive to impurity
- Similar results in practice, Gini more common (sklearn default)
🔥 Q4.5: 多特征 Naive Bayes
场景: 根据两个特征(天气和温度)判断是否打网球。
训练数据(8个样本):
| Outlook | Temperature | Play |
|---|---|---|
| Sunny | Hot | No |
| Sunny | Hot | No |
| Overcast | Hot | Yes |
| Rain | Mild | Yes |
| Rain | Cool | Yes |
| Overcast | Cool | Yes |
| Sunny | Mild | No |
| Rain | Hot | Yes |
测试样本: Outlook = Sunny, Temperature = Cool。会打网球吗?
👉 点击查看完整计算过程
第一步:统计训练数据
- Play = Yes: 5 次
- Play = No: 3 次
先验概率: $$P(\text{Yes}) = \frac{5}{8} = 0.625$$ $$P(\text{No}) = \frac{3}{8} = 0.375$$
第二步:计算条件概率
对于 Yes 类:
- Sunny & Yes: 0 次(共 5 个 Yes)
- Cool & Yes: 2 次(共 5 个 Yes)
$$P(\text{Sunny} | \text{Yes}) = \frac{0}{5} = 0$$ $$P(\text{Cool} | \text{Yes}) = \frac{2}{5} = 0.4$$
对于 No 类:
- Sunny & No: 3 次(共 3 个 No)
- Cool & No: 0 次(共 3 个 No)
$$P(\text{Sunny} | \text{No}) = \frac{3}{3} = 1.0$$ $$P(\text{Cool} | \text{No}) = \frac{0}{3} = 0$$
第三步:应用 Naive Bayes
$$P(\text{Yes} | \text{Sunny, Cool}) \propto P(\text{Sunny} | \text{Yes}) \times P(\text{Cool} | \text{Yes}) \times P(\text{Yes})$$ $$= 0 \times 0.4 \times 0.625 = 0$$
$$P(\text{No} | \text{Sunny, Cool}) \propto P(\text{Sunny} | \text{No}) \times P(\text{Cool} | \text{No}) \times P(\text{No})$$ $$= 1.0 \times 0 \times 0.375 = 0$$
问题:零概率问题!
两个类别的概率都是 0,无法做出判断。这是因为训练数据中没有出现 "Sunny + Cool" 的组合。
解决方案:Laplace 平滑(拉普拉斯平滑)
修正公式: $$P(\text{Feature} | \text{Class}) = \frac{\text{count} + 1}{\text{total} + \text{num_features}}$$
应用平滑后:
对于 Yes 类(特征总数 = 3: Sunny, Overcast, Rain): $$P(\text{Sunny} | \text{Yes}) = \frac{0 + 1}{5 + 3} = \frac{1}{8} = 0.125$$ $$P(\text{Cool} | \text{Yes}) = \frac{2 + 1}{5 + 3} = \frac{3}{8} = 0.375$$
对于 No 类: $$P(\text{Sunny} | \text{No}) = \frac{3 + 1}{3 + 3} = \frac{4}{6} = 0.667$$ $$P(\text{Cool} | \text{No}) = \frac{0 + 1}{3 + 3} = \frac{1}{6} = 0.167$$
重新计算:
$$\text{Yes Score} = 0.125 \times 0.375 \times 0.625 = 0.0293$$ $$\text{No Score} = 0.667 \times 0.167 \times 0.375 = 0.0418$$
结论:因为 $0.0418 > 0.0293$,预测为 No(不打网球)。
关键要点:
- Naive Bayes 假设特征独立(这就是"Naive"的含义)
- 遇到零概率必须使用平滑技术
- Laplace 平滑是最常用的方法
Step 1: Statistics from Training Data
- Play = Yes: 5 times
- Play = No: 3 times
Prior Probability: $$P(\text{Yes}) = \frac{5}{8} = 0.625$$ $$P(\text{No}) = \frac{3}{8} = 0.375$$
Step 2: Calculate Conditional Probabilities
For Yes class:
- Sunny & Yes: 0 times (out of 5 Yes)
- Cool & Yes: 2 times (out of 5 Yes)
$$P(\text{Sunny} | \text{Yes}) = \frac{0}{5} = 0$$ $$P(\text{Cool} | \text{Yes}) = \frac{2}{5} = 0.4$$
For No class:
- Sunny & No: 3 times (out of 3 No)
- Cool & No: 0 times (out of 3 No)
$$P(\text{Sunny} | \text{No}) = \frac{3}{3} = 1.0$$ $$P(\text{Cool} | \text{No}) = \frac{0}{3} = 0$$
Step 3: Apply Naive Bayes
$$P(\text{Yes} | \text{Sunny, Cool}) \propto P(\text{Sunny} | \text{Yes}) \times P(\text{Cool} | \text{Yes}) \times P(\text{Yes})$$ $$= 0 \times 0.4 \times 0.625 = 0$$
$$P(\text{No} | \text{Sunny, Cool}) \propto P(\text{Sunny} | \text{No}) \times P(\text{Cool} | \text{No}) \times P(\text{No})$$ $$= 1.0 \times 0 \times 0.375 = 0$$
Problem: Zero Probability Issue!
Both classes have probability 0, making classification impossible. This is because "Sunny + Cool" combination never appeared in training data.
Solution: Laplace Smoothing
Corrected formula: $$P(\text{Feature} | \text{Class}) = \frac{\text{count} + 1}{\text{total} + \text{num_features}}$$
After applying smoothing:
For Yes class (total features = 3: Sunny, Overcast, Rain): $$P(\text{Sunny} | \text{Yes}) = \frac{0 + 1}{5 + 3} = \frac{1}{8} = 0.125$$ $$P(\text{Cool} | \text{Yes}) = \frac{2 + 1}{5 + 3} = \frac{3}{8} = 0.375$$
For No class: $$P(\text{Sunny} | \text{No}) = \frac{3 + 1}{3 + 3} = \frac{4}{6} = 0.667$$ $$P(\text{Cool} | \text{No}) = \frac{0 + 1}{3 + 3} = \frac{1}{6} = 0.167$$
Recalculate:
$$\text{Yes Score} = 0.125 \times 0.375 \times 0.625 = 0.0293$$ $$\text{No Score} = 0.667 \times 0.167 \times 0.375 = 0.0418$$
Conclusion: Since $0.0418 > 0.0293$, prediction is No (Don't play).
Key Takeaways:
- Naive Bayes assumes feature independence (hence "Naive")
- Zero probability requires smoothing techniques
- Laplace smoothing is most commonly used
Q1/Q2 常见陷阱
| 陷阱 | 易错点 | 正确做法 |
|---|---|---|
| 可变默认参数 | def f(x, lst=[]) | 用 None 替代,内部初始化 |
| 列表 vs 元组 | 认为元组也可变 | 记住:列表可变,元组不可变 |
| 循环变量作用域 | i 在循环外消失 | Python 的 i 在循环外仍存在 |
| 列表切片 | 认为 lst[:] 是引用 | lst[:] 创建浅拷贝 |
Q3 数据处理检查清单
- 检查缺失值,用
isnull()确认位置 - 选择合适的填充方法(均值、中位数、前向填充等)
- 对分类变量选择合适的编码(One-Hot vs Label)
- 特征缩放(归一化 / 标准化)
- 拆分训练/测试集
Q4 手算检查清单
Naive Bayes:
- 计算先验概率 P(Class)
- 计算似然概率 P(Feature | Class)
- 相乘得到后验分子
- 比较大小做决策
Decision Tree:
- 明确样本总数和各类样本数
- 计算概率 p_i = count_i / total
- 使用计算器计算 log₂ 值
- 代入公式计算熵值
- 保留 3 位小数报告答案
⚡ 附加:数据预处理公式(速查)
来源:期末真题 Q3(b) - 写出两种数据归一化公式。
1. Min-Max Normalization (最小-最大规范化)
目的:将数据缩放到 [0, 1] 范围
$$X_{new} = \frac{X - X_{min}}{X_{max} - X_{min}}$$
解释:
- $X_{min}$ → 变为 0
- $X_{max}$ → 变为 1
- 中间值按比例缩放
使用场景:
- 已知数据范围有明确上下界
- 神经网络输入层(需要0-1范围)
- 图像像素值归一化
缺点:
- 对异常值敏感(一个极端值会影响整体缩放)
Explanation:
- $X_{min}$ → becomes 0
- $X_{max}$ → becomes 1
- Values in between scaled proportionally
Use Cases:
- Known data range with clear bounds
- Neural network input layer (requires 0-1 range)
- Image pixel normalization
Drawbacks:
- Sensitive to outliers (one extreme value affects entire scaling)
2. Z-Score Standardization (标准化)
目的:将数据转换为均值=0,标准差=1
$$X_{new} = \frac{X - \mu}{\sigma}$$
其中:
- $\mu$ = 均值 (mean)
- $\sigma$ = 标准差 (standard deviation)
解释:
- 数据变换后均值为 0
- 标准差为 1
- 范围不固定(可能是负数)
使用场景:
- 数据存在异常值
- 需要比较不同量纲的特征
- SVM、KNN、逻辑回归等算法
优点:
- 对异常值更鲁棒(相比 Min-Max)
Where:
- $\mu$ = mean
- $\sigma$ = standard deviation
Explanation:
- Transformed data has mean of 0
- Standard deviation of 1
- Range not fixed (can be negative)
Use Cases:
- Data contains outliers
- Need to compare features with different scales
- Algorithms like SVM, KNN, Logistic Regression
Advantages:
- More robust to outliers (compared to Min-Max)
快速对比表
| 方法 | 公式 | 输出范围 | 异常值敏感度 | 使用场景 |
|---|---|---|---|---|
| Min-Max | $\frac{X - X_{min}}{X_{max} - X_{min}}$ | [0, 1] | 高 | 神经网络、有界数据 |
| Z-Score | $\frac{X - \mu}{\sigma}$ | 无界 | 低 | SVM、KNN、有异常值数据 |
💡 最后的建议
- Python 部分(Q1/Q2):重点记住 可变性 和 作用域 的概念,多做代码追踪题。
- 数据科学部分(Q3):理解编码方法的 什么时候用,算法之间的 核心区别。
- 手算部分(Q4):务必带计算器,步骤要清晰,最后答案保留 3 位小数。
- 考试策略:先做自己擅长的题,再挑战计算题。时间紧张时,理论题往往比手算题更容易拿分。
祝考试顺利! 🎓
最后更新:2026年1月24日 | 全面升级版,包含50+实战题目(含期末真题 Q1-Q4)