TechBlog
HomeBlogCategoriesTagsLinksGuestbookAbout
中
TechBlog

探索技术世界的无限可能。分享前沿技术、开发经验与行业洞察。

快速链接

  • 博客文章
  • 文章分类
  • 标签云
  • 关于我

热门分类

  • Technology
  • AI
  • Web Development
  • DevOps

© 2026 TechBlog. All rights reserved.

Built with using Next.js & Tailwind CSS

Back to posts
Study Notes

ABW505 Python与预测分析期末终极题库:代码填空 + 算法手算 (含真题解析)

January 24, 2026
50 min read
#期末复习#ABW505#Python#Numpy#KNN#机器学习#题库#算法手算

💬 Comments


🎓 ABW505 考试结构(官方)

题号分值题型考试范围
Q120Type 2Python入门:核心对象、变量、输入输出、列表、元组、函数、循环、决策结构
Q230Type 35选3:决策结构、重复结构、布尔逻辑、列表/元组、函数
Q325Type 1,3Pandas、数据预处理、编码器、SVM、随机森林
Q425Type 1,3,4朴素贝叶斯、决策树、数据预处理(带计算器!)

题型说明:

  • Type 1: 理论题
  • Type 2: 解读代码,计算结果
  • Type 3: 解读结果,编写代码
  • Type 4: 计算题(带计算器)

🟢 第一部分:Python 核心与逻辑 (Q1 & Q2)

🧐 题型一:代码追踪

题型说明: 阅读代码,预测输出结果。考察变量作用域、可变性及逻辑流。

Q1.1: 列表的陷阱 (List Mutability)

def extend_list(val, list=[]):
    list.append(val)
    return list
 
list1 = extend_list(10)
list2 = extend_list(123, [])
list3 = extend_list('a')
 
print(f"list1 = {list1}")
print(f"list2 = {list2}")
print(f"list3 = {list3}")
👉 点击查看运行结果与解析

Output:

list1 = [10, 'a']
list2 = [123]
list3 = [10, 'a']

解析:

这是 Python 面试经典题。函数的默认参数 list=[] 是在函数定义时创建的,且只创建一次。

  • 第一次调用 extend_list(10) → 默认参数初始化为 [],然后 append 10 → [10]
  • 第二次调用 extend_list(123, []) → 传入新列表 [],与默认参数无关 → [123]
  • 第三次调用 extend_list('a') → 再次使用已被修改过的默认参数 [10] → [10, 'a']

关键概念:Python 默认参数在函数定义时评估,可变对象会跨调用保留状态。

Analysis:

This is a classic Python interview question. The default parameter list=[] is created once at function definition time, not every time the function is called.

  • First call extend_list(10) → Default parameter initialized as [], then append 10 → [10]
  • Second call extend_list(123, []) → Passes a new list [], independent of default → [123]
  • Third call extend_list('a') → Reuses the already-modified default parameter [10] → [10, 'a']

Key Concept: Python default parameters are evaluated at function definition time. Mutable objects retain their state across function calls.


Q1.2: 循环与逻辑控制

x = 0
for i in range(5):
    if i == 2:
        continue
    if i == 4:
        break
    x += i
print(x)
👉 点击查看运行结果与解析

Output:

4

执行追踪:

轮次i条件操作x 值
10-x += 00
21-x += 11
32i==2continue(跳过)1
43-x += 34
54i==4break(终止)4

结果: x = 4

Execution Trace:

IterationiConditionOperationx Value
10-x += 00
21-x += 11
32i==2continue (skip)1
43-x += 34
54i==4break (exit)4

Result: x = 4


Q1.3: 元组拆包与切片

data = (10, 20, 30, 40, 50)
a, *b, c = data
print(a)
print(b)
print(c)
👉 点击查看运行结果与解析

Output:

10
[20, 30, 40]
50

解析:

这是 Python 3 的拓展解包(Extended Unpacking)语法:

  • a 拿走第一个元素 → 10
  • c 拿走最后一个元素 → 50
  • *b 拿走中间剩余的所有元素,并打包成一个列表 → [20, 30, 40]

注意:*b 收集的是列表,不是元组。

Analysis:

This uses Python 3's Extended Unpacking syntax:

  • a takes the first element → 10
  • c takes the last element → 50
  • *b collects all remaining middle elements and packs them as a list → [20, 30, 40]

Note: *b collects into a list, not a tuple.


Q1.5: Numpy 矩阵操作(期末真题 Q1a)

要求: 编写 Numpy 程序完成以下操作:

  1. 创建一个 4x4 的矩阵,数值范围从 1 到 16。
  2. 创建一个长度为 10 的空向量(全0),并将第 7 个值更新为 10。
  3. 创建一个 8x8 的矩阵,并用 0 和 1 填充成"棋盘模式"。
👉 点击查看参考答案
import numpy as np
 
# (i) 4x4 matrix ranging from 1 to 16
matrix_4x4 = np.arange(1, 17).reshape(4, 4)
print("4x4 Matrix:\n", matrix_4x4)
 
# (ii) Null vector of size 10, update 7th value to 10
# Note: Python indexing starts from 0, so 7th value is index 6
null_vector = np.zeros(10)
null_vector[6] = 10
print("\nNull Vector:\n", null_vector)
 
# (iii) 8x8 Checkerboard pattern
checkerboard = np.zeros((8, 8), dtype=int)
# Use slicing: odd rows even columns, even rows odd columns set to 1
checkerboard[1::2, ::2] = 1
checkerboard[::2, 1::2] = 1
print("\nCheckerboard:\n", checkerboard)

Run Output:

4x4 Matrix:
 [[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]
 [13 14 15 16]]
 
Null Vector:
 [ 0.  0.  0.  0.  0.  0. 10.  0.  0.  0.]
 
Checkerboard:
 [[0 1 0 1 0 1 0 1]
 [1 0 1 0 1 0 1 0]
 [0 1 0 1 0 1 0 1]
 [1 0 1 0 1 0 1 0]
 [0 1 0 1 0 1 0 1]
 [1 0 1 0 1 0 1 0]
 [0 1 0 1 0 1 0 1]
 [1 0 1 0 1 0 1 0]]

解析:

(i) 创建 4x4 矩阵:

  • np.arange(1, 17) 创建从 1 到 16 的数组(不包括17)
  • .reshape(4, 4) 将其重塑为 4x4 矩阵

(ii) 空向量与索引:

  • np.zeros(10) 创建全0向量
  • 关键: Python 索引从 0 开始,第 7 个值的索引是 6
  • null_vector[6] = 10 更新第 7 个位置

(iii) 棋盘模式:

  • 初始化全0矩阵
  • checkerboard[1::2, ::2] = 1 → 奇数行的偶数列置为1
  • checkerboard[::2, 1::2] = 1 → 偶数行的奇数列置为1
  • 切片语法:[start:stop:step]

考点:

  • Numpy 数组创建与重塑
  • 索引规则(0-based vs 1-based)
  • 切片操作的步长参数

Analysis:

(i) Create 4x4 Matrix:

  • np.arange(1, 17) creates array from 1 to 16 (excluding 17)
  • .reshape(4, 4) reshapes it into 4x4 matrix

(ii) Null Vector & Indexing:

  • np.zeros(10) creates zero vector
  • Key Point: Python uses 0-based indexing, so 7th value is at index 6
  • null_vector[6] = 10 updates the 7th position

(iii) Checkerboard Pattern:

  • Initialize all-zero matrix
  • checkerboard[1::2, ::2] = 1 → Odd rows, even columns set to 1
  • checkerboard[::2, 1::2] = 1 → Even rows, odd columns set to 1
  • Slicing syntax: [start:stop:step]

Key Concepts:

  • Numpy array creation and reshaping
  • Indexing rules (0-based vs 1-based)
  • Slicing with step parameter

Q1.6: Numpy 向量反转(期末真题 Q1b)

要求: 创建一个向量,包含从 10 到 49 的数值,并将其反转。

👉 点击查看参考答案
import numpy as np
 
# Create vector from 10 to 49
vector = np.arange(10, 50)
print("Original Vector:\n", vector)
 
# Reverse it using slicing
vector_reversed = vector[::-1]
print("\nReversed Vector:\n", vector_reversed)

Run Output:

Original Vector:
 [10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49]
 
Reversed Vector:
 [49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26
 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10]

解析:

方法1: 切片反转(推荐):

  • [::-1] 是Python切片的反转语法
  • start:stop:step,步长为-1表示从后往前

方法2: np.flip()函数:

vector_reversed = np.flip(vector)

方法3: np.flipud()函数:

vector_reversed = np.flipud(vector)

考点:

  • Python切片反转语法 [::-1]
  • Numpy数组操作函数

Analysis:

Method 1: Slicing Reversal (Recommended):

  • [::-1] is Python's slice reversal syntax
  • start:stop:step, step of -1 means traverse backwards

Method 2: np.flip() function:

vector_reversed = np.flip(vector)

Method 3: np.flipud() function:

vector_reversed = np.flipud(vector)

Key Concepts:

  • Python slice reversal syntax [::-1]
  • Numpy array manipulation functions

Q1.4: 函数参数 - *args 和 **kwargs

def process_data(*args, **kwargs):
    print(f"Positional args: {args}")
    print(f"Keyword args: {kwargs}")
    total = sum(args)
    multiplier = kwargs.get('multiplier', 1)
    return total * multiplier
 
result = process_data(1, 2, 3, multiplier=10, label='test')
print(f"Result: {result}")
👉 点击查看运行结果与解析

Output:

Positional args: (1, 2, 3)
Keyword args: {'multiplier': 10, 'label': 'test'}
Result: 60

解析:

*args(任意数量的位置参数):

  • 收集所有位置参数到一个元组中
  • 在这里:args = (1, 2, 3)
  • 可以像普通元组一样使用:迭代、索引、sum() 等

**kwargs(任意数量的关键字参数):

  • 收集所有关键字参数到一个字典中
  • 在这里:kwargs = {'multiplier': 10, 'label': 'test'}
  • 使用 .get() 方法安全访问,避免 KeyError

执行流程:

  1. sum(args) → sum((1, 2, 3)) → 6
  2. kwargs.get('multiplier', 1) → 返回 10(如果没有则返回默认值 1)
  3. 6 * 10 = 60

Analysis:

*args (Variable Positional Arguments):

  • Collects all positional arguments into a tuple
  • Here: args = (1, 2, 3)
  • Can be used like a normal tuple: iterate, index, sum(), etc.

**kwargs (Variable Keyword Arguments):

  • Collects all keyword arguments into a dictionary
  • Here: kwargs = {'multiplier': 10, 'label': 'test'}
  • Use .get() method for safe access, avoiding KeyError

Execution Flow:

  1. sum(args) → sum((1, 2, 3)) → 6
  2. kwargs.get('multiplier', 1) → Returns 10 (default 1 if not present)
  3. 6 * 10 = 60

💻 题型二:编程实战

题型说明: 根据要求写出代码。

Q2.1: 列表推导式

需求: 给定一个数字列表,创建一个新列表,只包含偶数乘以 2 的结果。

Input: nums = [1, 2, 3, 4, 5]

Solution Code:

nums = [1, 2, 3, 4, 5]
result = [x * 2 for x in nums if x % 2 == 0]
print(result)

Run Output:

[4, 8]

解析:

  • 遍历 nums 中的每个元素 x
  • 筛选条件:x % 2 == 0(偶数)
  • 变换:x * 2(乘以 2)
  • 偶数有 2 和 4,所以结果是 [4, 8]

Analysis:

  • Iterate through each element x in nums
  • Filter condition: x % 2 == 0 (even numbers)
  • Transform: x * 2 (multiply by 2)
  • Even numbers are 2 and 4, so result is [4, 8]

Q2.2: 嵌套列表操作

需求: 给定一个嵌套列表,将其展平为单层列表。

Input: nested = [[1, 2], [3, 4, 5], [6]]

Solution Code:

nested = [[1, 2], [3, 4, 5], [6]]
 
# Method 1: List comprehension
flat = [item for sublist in nested for item in sublist]
print(f"Method 1: {flat}")
 
# Method 2: Using sum() with empty list
flat2 = sum(nested, [])
print(f"Method 2: {flat2}")
 
# Method 3: Traditional loop
flat3 = []
for sublist in nested:
    for item in sublist:
        flat3.append(item)
print(f"Method 3: {flat3}")

Run Output:

Method 1: [1, 2, 3, 4, 5, 6]
Method 2: [1, 2, 3, 4, 5, 6]
Method 3: [1, 2, 3, 4, 5, 6]

解析:

方法 1(列表推导式):

  • 外层循环:for sublist in nested → 遍历每个子列表
  • 内层循环:for item in sublist → 遍历子列表中的每个元素
  • 这是最 Pythonic 的写法,推荐使用

方法 2(sum函数):

  • sum(nested, []) → 从空列表开始,依次将每个子列表相加
  • [] + [1,2] + [3,4,5] + [6] → 最终得到 [1,2,3,4,5,6]
  • 简洁但可读性略差

方法 3(传统循环):

  • 双重 for 循环逐个添加元素
  • 代码最长但最容易理解

Analysis:

Method 1 (List Comprehension):

  • Outer loop: for sublist in nested → Iterate through each sublist
  • Inner loop: for item in sublist → Iterate through each element in sublist
  • This is the most Pythonic way, recommended

Method 2 (sum function):

  • sum(nested, []) → Starts with empty list, concatenates each sublist
  • [] + [1,2] + [3,4,5] + [6] → Results in [1,2,3,4,5,6]
  • Concise but less readable

Method 3 (Traditional Loop):

  • Nested for loops add elements one by one
  • Longest code but easiest to understand

Q2.3: 字典统计

需求: 统计字符串中每个字符出现的频率。

Input: text = "hello"

Solution Code:

text = "hello"
freq = {}
for char in text:
    if char in freq:
        freq[char] += 1
    else:
        freq[char] = 1
print(freq)

Run Output:

{'h': 1, 'e': 1, 'l': 2, 'o': 1}

解析:

  • 遍历字符串中的每个字符
  • 如果字符已在字典中,计数加 1
  • 如果字符不在字典中,初始化为 1
  • 最终 'l' 出现 2 次(因为有两个 l),其他字符各出现 1 次

进阶写法(使用 defaultdict):

Analysis:

  • Iterate through each character in the string
  • If character exists in dictionary, increment count by 1
  • If character doesn't exist, initialize it as 1
  • Finally, 'l' appears 2 times, other characters appear 1 time each

Advanced Implementation (using defaultdict):

from collections import defaultdict
text = "hello"
freq = defaultdict(int)
for char in text:
    freq[char] += 1
print(dict(freq))

Q2.4: Lambda 与 Filter 操作

需求: 从数字列表中,使用 lambda 和 filter 筛选出大于 5 的数字。

Input: numbers = [2, 8, 3, 12, 5, 7]

Solution Code:

numbers = [2, 8, 3, 12, 5, 7]
 
# Using filter() with lambda
result = list(filter(lambda x: x > 5, numbers))
print(f"Numbers > 5: {result}")
 
# Equivalent list comprehension
result2 = [x for x in numbers if x > 5]
print(f"List comprehension: {result2}")

Run Output:

Numbers > 5: [8, 12, 7]
List comprehension: [8, 12, 7]

解析:

Lambda 函数:

  • lambda x: x > 5 → 创建一个匿名函数,检查 x 是否大于 5
  • 返回布尔值 True 或 False

Filter 函数:

  • filter(function, iterable) → 对可迭代对象应用函数,保留返回 True 的元素
  • 返回一个 filter 对象(需要用 list() 转换)

两种方法比较:

  • filter() + lambda:函数式编程风格
  • 列表推导式:更 Pythonic,可读性更好,推荐使用

Analysis:

Lambda Function:

  • lambda x: x > 5 → Creates an anonymous function that checks if x > 5
  • Returns boolean True or False

Filter Function:

  • filter(function, iterable) → Applies function to iterable, keeps elements that return True
  • Returns a filter object (needs list() conversion)

Comparison:

  • filter() + lambda: Functional programming style
  • List comprehension: More Pythonic, better readability, recommended

Q2.6: While 循环 - 猜数字游戏逻辑(复习题)

要求: 修复并解释以下"猜数字"代码逻辑。目标:生成 1-9 的随机数,允许猜 4 次。

👉 点击查看参考代码
import random
 
target_num = random.randint(1, 9)
guess_num = 0
guess_counter = 0
max_attempts = 4
 
# Loop condition: haven't guessed correctly AND within attempt limit
while target_num != guess_num and guess_counter < max_attempts:
    guess_counter += 1
    
    # In real exam, you'd use input()
    guess_num = int(input(f"Attempt {guess_counter}/{max_attempts}, Enter a number (1-9): "))
    
    if target_num == guess_num:
        print(f"🎉 Well guessed! The number was {target_num}")
        break
    elif guess_counter < max_attempts:
        if guess_num < target_num:
            print("Too low! Try again.")
        else:
            print("Too high! Try again.")
    else:
        print(f"❌ Out of chances! The number was {target_num}")

考点解析:

1. While 循环终止条件:

while target_num != guess_num and guess_counter < max_attempts:
  • 两个条件必须同时满足才继续循环
  • 猜对了或次数用完都会退出

2. break 语句:

  • 提前终止循环,不等待条件检查
  • 适用于猜对的情况

3. 变量自增:

guess_counter += 1
  • 等价于 guess_counter = guess_counter + 1

4. 边界条件检查:

  • guess_counter < max_attempts 确保不会超过限制
  • 最后一次猜测时不再显示"Try again"提示

常见错误:

  • 忘记递增 guess_counter,导致死循环
  • 终止条件写错(如用 or 代替 and)
  • 在循环外忘记检查是否猜对

Key Concepts:

1. While Loop Termination Condition:

while target_num != guess_num and guess_counter < max_attempts:
  • Both conditions must be true to continue looping
  • Exits when guessed correctly OR out of attempts

2. break Statement:

  • Terminates loop early without waiting for condition check
  • Used when guess is correct

3. Variable Increment:

guess_counter += 1
  • Equivalent to guess_counter = guess_counter + 1

4. Boundary Condition Check:

  • guess_counter < max_attempts ensures no overflow
  • Don't show "Try again" on last attempt

Common Mistakes:

  • Forgetting to increment guess_counter, causing infinite loop
  • Wrong termination condition (using or instead of and)
  • Not checking if guessed correctly outside loop

Q2.7: Matplotlib 可视化 - 水平条形图(复习题)

要求: 根据以下数据绘制水平条形图:

Data: Moscow (70), Tokyo (60), Washington (75), Beijing (50), Delhi (40)

👉 点击查看参考代码
import matplotlib.pyplot as plt
 
# 1. Prepare data
cities = ['Moscow', 'Tokyo', 'Washington', 'Beijing', 'Delhi']
happiness_index = [70, 60, 75, 50, 40]
 
# 2. Create figure
plt.figure(figsize=(10, 5))
 
# 3. Draw horizontal bar chart (barh)
plt.barh(cities, happiness_index, color='skyblue', edgecolor='navy')
 
# 4. Add labels and title
plt.xlabel('Happiness Index', fontsize=12)
plt.ylabel('City', fontsize=12)
plt.title('Happiness Index by City', fontsize=14, fontweight='bold')
 
# 5. Add grid for better readability
plt.grid(axis='x', linestyle='--', alpha=0.7)
 
# 6. Display
plt.tight_layout()
plt.show()

代码解析:

1. 数据准备:

  • 使用列表存储城市名称和幸福指数
  • 顺序要对应(城市和数值按位置匹配)

2. 创建图形:

  • plt.figure(figsize=(10, 5)) 设置画布大小(宽10英寸,高5英寸)

3. 绘制水平条形图:

  • plt.barh() 是横向条形图(h = horizontal)
  • plt.bar() 是纵向条形图
  • color='skyblue' 设置填充色
  • edgecolor='navy' 设置边框色

4. 添加标签:

  • xlabel() - X轴标签
  • ylabel() - Y轴标签
  • title() - 图表标题
  • fontsize - 字体大小
  • fontweight='bold' - 加粗

5. 网格线:

  • plt.grid(axis='x') 只显示X轴网格
  • linestyle='--' 虚线
  • alpha=0.7 透明度70%

6. 显示图形:

  • plt.tight_layout() 自动调整布局,避免标签重叠
  • plt.show() 显示图形

常见考点:

  • 区分 bar() 和 barh()
  • 标签参数的拼写(xlabel, ylabel, title)
  • 颜色参数名称(color, edgecolor)

Code Analysis:

1. Data Preparation:

  • Use lists to store city names and happiness indices
  • Order must match (cities and values aligned by position)

2. Create Figure:

  • plt.figure(figsize=(10, 5)) sets canvas size (10 inches wide, 5 inches tall)

3. Draw Horizontal Bar Chart:

  • plt.barh() is for horizontal bars (h = horizontal)
  • plt.bar() is for vertical bars
  • color='skyblue' sets fill color
  • edgecolor='navy' sets border color

4. Add Labels:

  • xlabel() - X-axis label
  • ylabel() - Y-axis label
  • title() - Chart title
  • fontsize - Font size
  • fontweight='bold' - Bold text

5. Grid Lines:

  • plt.grid(axis='x') shows only X-axis grid
  • linestyle='--' dashed line
  • alpha=0.7 70% transparency

6. Display Figure:

  • plt.tight_layout() auto-adjusts layout to prevent label overlap
  • plt.show() displays the figure

Common Exam Points:

  • Distinguish between bar() and barh()
  • Spelling of label parameters (xlabel, ylabel, title)
  • Color parameter names (color, edgecolor)

Q2.5: 集合操作

需求: 找出两个集合的交集和差集。

Input:

set_a = {1, 2, 3, 4, 5}
set_b = {4, 5, 6, 7, 8}

Solution Code:

set_a = {1, 2, 3, 4, 5}
set_b = {4, 5, 6, 7, 8}
 
# Intersection - common elements
intersection = set_a & set_b  # or set_a.intersection(set_b)
print(f"Intersection: {intersection}")
 
# Difference - elements in A but not in B
difference = set_a - set_b  # or set_a.difference(set_b)
print(f"A - B: {difference}")
 
# Symmetric difference - elements in either but not both
sym_diff = set_a ^ set_b  # or set_a.symmetric_difference(set_b)
print(f"Symmetric difference: {sym_diff}")
 
# Union - all unique elements
union = set_a | set_b  # or set_a.union(set_b)
print(f"Union: {union}")

Run Output:

Intersection: {4, 5}
A - B: {1, 2, 3}
Symmetric difference: {1, 2, 3, 6, 7, 8}
Union: {1, 2, 3, 4, 5, 6, 7, 8}

解析:

集合运算符:

  • & (交集):两个集合共有的元素 → {4, 5}
  • - (差集):在 A 中但不在 B 中 → {1, 2, 3}
  • ^ (对称差集):在 A 或 B 中,但不同时在两者中 → {1, 2, 3, 6, 7, 8}
  • | (并集):A 和 B 的所有唯一元素 → {1, 2, 3, 4, 5, 6, 7, 8}

应用场景:

  • 数据去重、查找共同项、排除重复等

Analysis:

Set Operators:

  • & (intersection): Elements common to both sets → {4, 5}
  • - (difference): Elements in A but not in B → {1, 2, 3}
  • ^ (symmetric difference): Elements in either but not both → {1, 2, 3, 6, 7, 8}
  • | (union): All unique elements from both → {1, 2, 3, 4, 5, 6, 7, 8}

Use Cases:

  • Data deduplication, finding common items, excluding duplicates

🔵 第二部分:数据科学与算法 (Q3 & Q4)

🐼 Q3: Pandas 与机器学习理论

Q3.1: Pandas 数据预处理

场景: 你有一个 DataFrame df,包含列 ['Salary', 'Department']。

要求:

  1. 用中位数填充缺失的 Salary
  2. 筛选 Salary > 5000 的行

Solution Code:

import pandas as pd
import numpy as np
 
# Sample data
data = {'Salary': [3000, 6000, np.nan, 8000], 
        'Department': ['HR', 'IT', 'IT', 'HR']}
df = pd.DataFrame(data)
 
print("Original data:")
print(df)
print("\n" + "="*50 + "\n")
 
# 1. Fill missing values with median
df['Salary'].fillna(df['Salary'].median(), inplace=True)
 
# 2. Filter high salary
high_salary = df[df['Salary'] > 5000]
 
print("After filling and filtering:")
print(high_salary)

Run Output:

Original data:
   Salary Department
0  3000.0         HR
1  6000.0         IT
2    NaN         IT
3  8000.0         HR
 
==================================================
 
After filling and filtering:
   Salary Department
1  6000.0         IT
3  8000.0         HR

关键步骤:

  • df['Salary'].median() → 计算中位数(3000, 6000, 8000 的中位数是 6000)
  • fillna() → 使用中位数替换 NaN
  • 布尔索引 df[df['Salary'] > 5000] → 筛选高薪员工

Key Steps:

  • df['Salary'].median() → Calculates the median (median of 3000, 6000, 8000 is 6000)
  • fillna() → Replaces NaN with the median
  • Boolean indexing df[df['Salary'] > 5000] → Filters high-salary employees

Q3.2: Pandas 数据分组与聚合

场景: 计算每个部门的平均薪资,并找出平均薪资最高的部门。

Solution Code:

import pandas as pd
 
# Sample data
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
    'Department': ['HR', 'IT', 'HR', 'IT', 'Finance', 'Finance'],
    'Salary': [5000, 7000, 5500, 8000, 6000, 6500]
}
df = pd.DataFrame(data)
 
print("Original data:")
print(df)
print("\n" + "="*50 + "\n")
 
# Group by department and calculate mean salary
dept_avg = df.groupby('Department')['Salary'].mean()
print("Average salary by department:")
print(dept_avg)
print("\n" + "="*50 + "\n")
 
# Find department with highest average
max_dept = dept_avg.idxmax()
max_salary = dept_avg.max()
print(f"Highest average: {max_dept} (${max_salary:.2f})")

Run Output:

Original data:
      Name Department  Salary
0    Alice         HR    5000
1      Bob         IT    7000
2  Charlie         HR    5500
3    David         IT    8000
4      Eve    Finance    6000
5    Frank    Finance    6500
 
==================================================
 
Average salary by department:
Department
Finance    6250.0
HR         5250.0
IT         7500.0
Name: Salary, dtype: float64
 
==================================================
 
Highest average: IT ($7500.00)

关键操作:

  1. groupby('Department'):按部门分组
  2. ['Salary'].mean():计算每组的平均薪资
  3. idxmax():找到最大值对应的索引(部门名)
  4. max():获取最大值

应用场景:

  • 统计分析、数据透视、业务报表生成

Key Operations:

  1. groupby('Department'): Group by department
  2. ['Salary'].mean(): Calculate mean salary for each group
  3. idxmax(): Get index (department name) of maximum value
  4. max(): Get the maximum value

Use Cases:

  • Statistical analysis, data pivoting, business report generation

Q3.8: 手动推导专家规则与预测(期末真题 Q3)

场景: 给定以下学生数据表,根据已有数据推导决策规则,并预测剩下四个问号的结果。

Student IDGradeGPAPrediction (Yes/No)
101A3.85Yes
205C2.85No
640A-3.50Yes
710B3.00? (1)
595A3.30? (2)
540B-4.00? (3)
630B+3.00? (4)
👉 点击查看解答

Task (i): 提出手动专家规则 (Propose Manual Expert Rules)

观察前三行已知数据:

  • GPA 3.85 (High) → Prediction = Yes
  • GPA 2.85 (Low) → Prediction = No
  • GPA 3.50 (High) → Prediction = Yes

归纳规律:

  • 高 GPA (≥ 3.50) → Yes
  • 低 GPA (< 3.50) → No

提出的专家规则(示例):

Rule 1 (基于 GPA 阈值):
IF GPA >= 3.50 THEN Prediction = Yes
ELSE Prediction = No

或者基于成绩等级的规则:

Rule 2 (基于 Grade):
IF Grade is 'A' or 'A-' THEN Prediction = Yes
ELSE Prediction = No

推荐使用 Rule 1,因为 GPA 是数值型特征,更客观稳定。


Task (ii): 识别预测结果 (Identify Predictions)

基于 Rule 1: IF GPA >= 3.50 THEN Yes, ELSE No

Student IDGradeGPARule EvaluationPrediction
710B3.003.00 < 3.50No
595A3.303.30 < 3.50No
540B-4.004.00 >= 3.50Yes ✅
630B+3.003.00 < 3.50No

最终答案:

  1. Student 710: No
  2. Student 595: No (虽然是A,但GPA 3.30 < 3.50)
  3. Student 540: Yes ✅
  4. Student 630: No

如果使用 Rule 2 (基于 Grade):

IF Grade is 'A' or 'A-' THEN Yes
ELSE No
Student IDGradeRule EvaluationPrediction
710BB ≠ A/A-No
595AA = AYes ✅
540B-B- ≠ A/A-No
630B+B+ ≠ A/A-No

最终答案 (Rule 2):

  1. Student 710: No
  2. Student 595: Yes ✅
  3. Student 540: No
  4. Student 630: No

考点总结:

  1. 规则归纳:从已知样本中观察模式,提出假设
  2. 规则验证:规则应该能正确分类训练数据(前3行)
  3. 规则应用:对新样本逐条检查,输出预测
  4. 多规则冲突:不同规则可能给出不同预测,需要选择最合理的(考试时说明你的选择依据)

实际操作建议:

  • 先尝试最简单的规则(单特征阈值)
  • 如果数据模式复杂,可以组合多个特征(AND/OR 逻辑)
  • 规则要能解释(可解释性是专家规则的优势)

Task (i): Propose Manual Expert Rules

Observation from first three rows:

  • GPA 3.85 (High) → Prediction = Yes
  • GPA 2.85 (Low) → Prediction = No
  • GPA 3.50 (High) → Prediction = Yes

Pattern Identified:

  • High GPA (≥ 3.50) → Yes
  • Low GPA (< 3.50) → No

Proposed Expert Rule (Example):

Rule 1 (Based on GPA threshold):
IF GPA >= 3.50 THEN Prediction = Yes
ELSE Prediction = No

Or rule based on Grade:

Rule 2 (Based on Grade):
IF Grade is 'A' or 'A-' THEN Prediction = Yes
ELSE Prediction = No

Recommend Rule 1, because GPA is a numerical feature, more objective and stable.


Task (ii): Identify Predictions

Based on Rule 1: IF GPA >= 3.50 THEN Yes, ELSE No

Student IDGradeGPARule EvaluationPrediction
710B3.003.00 < 3.50No
595A3.303.30 < 3.50No
540B-4.004.00 >= 3.50Yes ✅
630B+3.003.00 < 3.50No

Final Answers:

  1. Student 710: No
  2. Student 595: No (Grade A but GPA 3.30 < 3.50)
  3. Student 540: Yes ✅
  4. Student 630: No

If using Rule 2 (Based on Grade):

IF Grade is 'A' or 'A-' THEN Yes
ELSE No
Student IDGradeRule EvaluationPrediction
710BB ≠ A/A-No
595AA = AYes ✅
540B-B- ≠ A/A-No
630B+B+ ≠ A/A-No

Final Answers (Rule 2):

  1. Student 710: No
  2. Student 595: Yes ✅
  3. Student 540: No
  4. Student 630: No

Key Concepts Summary:

  1. Rule Induction: Observe patterns from known samples, propose hypotheses
  2. Rule Validation: Rules should correctly classify training data (first 3 rows)
  3. Rule Application: Apply rules to new samples one by one, output predictions
  4. Rule Conflicts: Different rules may give different predictions, choose most reasonable (explain your choice in exam)

Practical Tips:

  • Start with simplest rules (single feature threshold)
  • For complex patterns, combine multiple features (AND/OR logic)
  • Rules should be interpretable (interpretability is advantage of expert rules)

Q3.3: 编码方法选择

问题: 什么时候应该用 One-Hot Encoding 而不是 Label Encoding?

👉 点击查看答案

Label Encoding 的工作原理:

  • 将分类转换为数字(例如:Red=1, Blue=2, Green=3)
  • 问题:算法可能误认为 "3 > 1" 意味着 Green "大于" Red(引入虚假的序关系)

One-Hot Encoding 的工作原理:

  • 为每个类别创建一个二进制列(Is_Red, Is_Blue, Is_Green 等)
  • 每行只有一列为 1,其余为 0

使用规则:

数据类型定义例子编码方法
Nominal(名义)无序、无排名关系城市、颜色、品牌✅ One-Hot
Ordinal(序数)有序、有排名关系Low, Medium, High✅ Label

结论:用 One-Hot 处理没有排名关系的分类数据(如城市、颜色、国家),用 Label 处理有明确顺序的分类数据(如教育水平、收入等级)。

Label Encoding works by:

  • Converting categories to integers (e.g., Red=1, Blue=2, Green=3)
  • Problem: Algorithms might interpret "3 > 1" as Green being "greater" than Red (introduces false ordinal relationship)

One-Hot Encoding works by:

  • Creating a binary column for each category (Is_Red, Is_Blue, Is_Green, etc.)
  • Each row has exactly one column as 1, rest as 0

Usage Rules:

Data TypeDefinitionExamplesMethod
NominalNo order, no rankingCities, Colors, Brands✅ One-Hot
OrdinalHas order, has rankingLow, Medium, High✅ Label

Conclusion: Use One-Hot for unordered categorical data (cities, colors, countries), use Label for ordered categorical data (education level, income brackets).


Q3.4: 特征缩放 - 归一化 vs 标准化

问题: 什么时候用 Min-Max 归一化,什么时候用 Z-Score 标准化?

👉 点击查看答案与代码示例

Min-Max 归一化(Normalization):

  • 公式: $x' = \frac{x - x_{min}}{x_{max} - x_{min}}$
  • 结果范围: [0, 1]
  • 特点: 保留原始数据的分布形状
  • 适用场景:
    • 神经网络(要求输入在固定范围)
    • 图像处理(像素值 0-255 → 0-1)
    • 数据分布已知且无异常值

Z-Score 标准化(Standardization):

  • 公式: $z = \frac{x - \mu}{\sigma}$(其中 μ 是均值,σ 是标准差)
  • 结果范围: 通常在 [-3, 3] 之间(无固定范围)
  • 特点: 数据均值为 0,标准差为 1
  • 适用场景:
    • 存在异常值时更鲁棒
    • SVM、逻辑回归等算法
    • 不同量级特征需要公平比较

Min-Max Normalization:

  • Formula: $x' = \frac{x - x_{min}}{x_{max} - x_{min}}$
  • Result Range: [0, 1]
  • Characteristics: Preserves original distribution shape
  • Use Cases:
    • Neural networks (require inputs in fixed range)
    • Image processing (pixel values 255 → 0-1)
    • Known distribution without outliers

Z-Score Standardization:

  • Formula: $z = \frac{x - \mu}{\sigma}$ (where μ is mean, σ is standard deviation)
  • Result Range: Typically [-3, 3] (no fixed range)
  • Characteristics: Mean of 0, standard deviation of 1
  • Use Cases:
    • More robust when outliers exist
    • SVM, Logistic Regression algorithms
    • Comparing features of different scales

Code Example:

from sklearn.preprocessing import MinMaxScaler, StandardScaler
import numpy as np
 
# Sample data with outlier
data = np.array([[1], [2], [3], [4], [100]])  # 100 is outlier
 
print("Original data:")
print(data.flatten())
 
# Min-Max Normalization
min_max_scaler = MinMaxScaler()
normalized = min_max_scaler.fit_transform(data)
print(f"\nMin-Max Normalized: {normalized.flatten()}")
 
# Z-Score Standardization
standard_scaler = StandardScaler()
standardized = standard_scaler.fit_transform(data)
print(f"Standardized: {standardized.flatten()}")

Output:

Original data:
[  1   2   3   4 100]
 
Min-Max Normalized: [0.         0.01010101 0.02020202 0.03030303 1.        ]
Standardized: [-0.70039279 -0.67894753 -0.65750227 -0.63605702  2.6728996 ]

观察:

  • Min-Max:异常值 100 被映射到 1,其他值挤在 0-0.03 之间(信息损失)
  • Z-Score:异常值约为 2.67,其他值在 -0.7 左右(更好地保留分布)

结论:有异常值时优先用 Z-Score!

Observation:

  • Min-Max: Outlier 100 mapped to 1, others squeezed in 0-0.03 range (information loss)
  • Z-Score: Outlier ~2.67, others around -0.7 (better preserves distribution)

Conclusion: Prefer Z-Score when outliers exist!


Q3.5: 混淆矩阵与模型评估

问题: 给定混淆矩阵,计算精确率、召回率和 F1 分数。

Confusion Matrix:

预测为正预测为负
实际为正TP = 80FN = 20
实际为负FP = 10TN = 90
👉 点击查看计算过程

公式:

$$\text{Precision (精确率)} = \frac{TP}{TP + FP} = \frac{80}{80 + 10} = \frac{80}{90} \approx 0.889$$

$$\text{Recall (召回率)} = \frac{TP}{TP + FN} = \frac{80}{80 + 20} = \frac{80}{100} = 0.800$$

$$\text{F1-Score} = 2 \times \frac{Precision \times Recall}{Precision + Recall} = 2 \times \frac{0.889 \times 0.800}{0.889 + 0.800} \approx 0.842$$

解释:

  • Precision(精确率):预测为正的样本中,真正为正的比例("不误伤")

    • 在这个例子中:预测为正的 90 个样本中,80 个确实是正样本
  • Recall(召回率):实际为正的样本中,被正确预测的比例("不漏掉")

    • 在这个例子中:100 个真实正样本中,找到了 80 个
  • F1-Score:Precision 和 Recall 的调和平均数(综合指标)

    • 当两者都重要时使用

应用场景:

  • 垃圾邮件检测:Precision 重要(不能误判正常邮件)
  • 疾病筛查:Recall 重要(不能漏掉病人)
  • 一般分类:F1-Score 平衡两者

Formulas:

$$\text{Precision} = \frac{TP}{TP + FP} = \frac{80}{80 + 10} = \frac{80}{90} \approx 0.889$$

$$\text{Recall} = \frac{TP}{TP + FN} = \frac{80}{80 + 20} = \frac{80}{100} = 0.800$$

$$\text{F1-Score} = 2 \times \frac{Precision \times Recall}{Precision + Recall} = 2 \times \frac{0.889 \times 0.800}{0.889 + 0.800} \approx 0.842$$

Interpretation:

  • Precision: Of predicted positives, how many are actually positive ("don't misclassify")

    • In this example: Of 90 predicted positive, 80 are truly positive
  • Recall: Of actual positives, how many are correctly predicted ("don't miss any")

    • In this example: Of 100 actual positives, found 80
  • F1-Score: Harmonic mean of Precision and Recall (balanced metric)

    • Use when both are equally important

Use Cases:

  • Spam detection: Precision matters (don't flag normal emails)
  • Disease screening: Recall matters (don't miss patients)
  • General classification: F1-Score balances both

Q3.6: SVM 核函数选择

问题: SVM 中 Linear、RBF 和 Polynomial 核函数有什么区别?

👉 点击查看答案

1. Linear Kernel(线性核):

  • 公式: $K(x, y) = x^T y$
  • 决策边界: 直线/平面(线性可分)
  • 适用场景:
    • 数据本身线性可分
    • 特征数量远大于样本数量(高维稀疏数据)
    • 文本分类(词袋模型)
  • 优点: 速度快,易解释
  • 缺点: 无法处理非线性问题

2. RBF Kernel(径向基核,Gaussian 核):

  • 公式: $K(x, y) = \exp(-\gamma ||x - y||^2)$
  • 决策边界: 复杂曲线/曲面(非线性)
  • 适用场景:
    • 数据非线性可分
    • 不确定用哪个核时的默认选择
    • 样本数量适中
  • 优点: 强大,能处理复杂边界
  • 缺点: 容易过拟合,需调参 γ

3. Polynomial Kernel(多项式核):

  • 公式: $K(x, y) = (x^T y + c)^d$
  • 决策边界: 多项式曲线
  • 适用场景:
    • 图像处理
    • 需要考虑特征交互
  • 优点: 可控制复杂度(通过 d)
  • 缺点: 计算复杂,参数敏感

选择建议:

  1. 先尝试 Linear → 快速基线
  2. 如果性能不够,试 RBF → 最常用
  3. 特殊场景考虑 Polynomial

1. Linear Kernel:

  • Formula: $K(x, y) = x^T y$
  • Decision Boundary: Line/Plane (linearly separable)
  • Use Cases:
    • Data is linearly separable
    • Features >> Samples (high-dimensional sparse data)
    • Text classification (bag-of-words)
  • Pros: Fast, interpretable
  • Cons: Can't handle non-linear problems

2. RBF Kernel (Radial Basis Function, Gaussian):

  • Formula: $K(x, y) = \exp(-\gamma ||x - y||^2)$
  • Decision Boundary: Complex curves/surfaces (non-linear)
  • Use Cases:
    • Non-linearly separable data
    • Default choice when unsure
    • Moderate sample size
  • Pros: Powerful, handles complex boundaries
  • Cons: Prone to overfitting, needs γ tuning

3. Polynomial Kernel:

  • Formula: $K(x, y) = (x^T y + c)^d$
  • Decision Boundary: Polynomial curves
  • Use Cases:
    • Image processing
    • Feature interactions matter
  • Pros: Controllable complexity (via d)
  • Cons: Computationally expensive, parameter sensitive

Selection Guide:

  1. Try Linear first → Quick baseline
  2. If performance insufficient, try RBF → Most common
  3. Consider Polynomial for special cases

Q3.7: Random Forest 超参数

问题: 解释 Random Forest 的关键超参数:n_estimators, max_depth, min_samples_split。

👉 点击查看答案

1. n_estimators(树的数量):

  • 含义: 随机森林中决策树的个数
  • 影响:
    • 增加 → 模型更稳定,性能提升(但收益递减)
    • 太少 → 欠拟合
    • 太多 → 训练时间长,但一般不会过拟合
  • 推荐值: 100-500(根据数据规模)

2. max_depth(树的最大深度):

  • 含义: 每棵树允许的最大层数
  • 影响:
    • 增加 → 模型更复杂,可能过拟合
    • 减少 → 模型简单,可能欠拟合
  • 推荐值:
    • None(不限制)→ 小数据集
    • 10-30 → 大数据集

3. min_samples_split(最小分裂样本数):

  • 含义: 节点分裂所需的最小样本数
  • 影响:
    • 增加 → 树更简单,防止过拟合
    • 减少 → 树更复杂,可能过拟合
  • 推荐值: 2-20

调参策略(GridSearchCV):

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
 
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, 30, None],
    'min_samples_split': [2, 5, 10]
}
 
rf = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(rf, param_grid, cv=5, scoring='accuracy')
# grid_search.fit(X_train, y_train)

1. n_estimators (Number of Trees):

  • Meaning: Number of decision trees in the forest
  • Impact:
    • Increase → More stable, better performance (diminishing returns)
    • Too few → Underfitting
    • Too many → Longer training, but rarely overfits
  • Recommended: 100-500 (depends on data size)

2. max_depth (Maximum Tree Depth):

  • Meaning: Maximum levels allowed in each tree
  • Impact:
    • Increase → More complex, may overfit
    • Decrease → Simpler, may underfit
  • Recommended:
    • None (unlimited) → Small datasets
    • 10-30 → Large datasets

3. min_samples_split (Minimum Samples to Split):

  • Meaning: Minimum samples required to split a node
  • Impact:
    • Increase → Simpler tree, prevents overfitting
    • Decrease → More complex, may overfit
  • Recommended: 2-20

Tuning Strategy (GridSearchCV):

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
 
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, 30, None],
    'min_samples_split': [2, 5, 10]
}
 
rf = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(rf, param_grid, cv=5, scoring='accuracy')
# grid_search.fit(X_train, y_train)

Q3.8: 交叉验证

问题: 什么是 K 折交叉验证,为什么它很重要?

👉 点击查看答案与示例

K-Fold 交叉验证原理:

  1. 划分数据:将数据集分成 K 个大小相等的子集(fold)
  2. 轮流训练:每次用 K-1 个子集训练模型,用剩余 1 个子集验证
  3. 重复 K 次:每个子集都有机会作为验证集
  4. 平均结果:将 K 次验证结果平均,得到最终性能评估

为什么重要:

  • ✅ 充分利用数据:每个样本既用于训练又用于验证
  • ✅ 降低过拟合风险:避免模型只在某个特定测试集上表现好
  • ✅ 更可靠的性能估计:多次验证的平均值更稳定
  • ✅ 适合小数据集:在数据有限时特别有用

常用 K 值:

  • K = 5(5折):速度和准确性的平衡,最常用
  • K = 10(10折):更准确但计算量更大
  • K = N(留一法,LOOCV):极端情况,每次只留一个样本验证

K-Fold Cross-Validation Principle:

  1. Split Data: Divide dataset into K equal-sized subsets (folds)
  2. Rotate Training: Use K-1 subsets for training, 1 for validation
  3. Repeat K Times: Each subset gets a chance to be the validation set
  4. Average Results: Average the K validation results for final performance

Why Important:

  • ✅ Maximize Data Usage: Every sample used for both training and validation
  • ✅ Reduce Overfitting Risk: Avoid model performing well only on specific test set
  • ✅ More Reliable Estimate: Average of multiple validations is more stable
  • ✅ Good for Small Datasets: Especially useful when data is limited

Common K Values:

  • K = 5 (5-fold): Balance between speed and accuracy, most common
  • K = 10 (10-fold): More accurate but more computational cost
  • K = N (Leave-One-Out, LOOCV): Extreme case, one sample for validation each time

Code Example:

from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
 
# Create sample dataset
X, y = make_classification(n_samples=100, n_features=10, random_state=42)
 
# Create model
rf = RandomForestClassifier(n_estimators=100, random_state=42)
 
# Perform 5-fold cross-validation
scores = cross_val_score(rf, X, y, cv=5, scoring='accuracy')
 
print(f"Cross-validation scores: {scores}")
print(f"Mean accuracy: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")

Output:

Cross-validation scores: [0.85  0.9   0.8   0.95  0.85]
Mean accuracy: 0.870 (+/- 0.108)

解释:

  • 5 次验证的准确率分别为:85%, 90%, 80%, 95%, 85%
  • 平均准确率:87%
  • 标准差的 2 倍(95% 置信区间):±10.8%
  • 结论:模型在这个数据集上的准确率约为 87%,且较为稳定

Interpretation:

  • 5 validation accuracies: 85%, 90%, 80%, 95%, 85%
  • Mean accuracy: 87%
  • 2x standard deviation (95% confidence): ±10.8%
  • Conclusion: Model accuracy ~87% on this dataset, relatively stable

Q3.9: 过拟合 vs 欠拟合

问题: 如何识别和解决过拟合和欠拟合?

👉 点击查看答案
特征过拟合 (Overfitting)欠拟合 (Underfitting)
表现训练集准确率高,测试集准确率低训练集和测试集准确率都低
原因模型过于复杂,记住了噪声模型过于简单,未学到规律
训练误差很低(接近 0)很高
测试误差很高很高
泛化能力差(不能应用到新数据)差(连训练数据都拟合不好)

如何识别:

from sklearn.model_selection import learning_curve
import matplotlib.pyplot as plt
 
train_sizes, train_scores, test_scores = learning_curve(
    model, X, y, cv=5, 
    train_sizes=np.linspace(0.1, 1.0, 10)
)
 
# 画学习曲线
plt.plot(train_sizes, train_scores.mean(axis=1), label='Training score')
plt.plot(train_sizes, test_scores.mean(axis=1), label='Test score')
plt.legend()

判断标准:

  • 过拟合:训练曲线和测试曲线之间有大的间隙
  • 欠拟合:两条曲线都很低且接近

解决方法:

问题解决方案
过拟合1. 增加训练数据
  1. 减少模型复杂度(降低树深度、减少特征)
  2. 正则化(L1/L2)
  3. Dropout(神经网络)
  4. Early stopping(提前停止训练) | | 欠拟合 | 1. 增加模型复杂度
  5. 增加特征
  6. 减少正则化
  7. 训练更长时间
  8. 换更强大的模型 |
FeatureOverfittingUnderfitting
PerformanceHigh training, low test accuracyLow training and test accuracy
CauseModel too complex, memorizes noiseModel too simple, misses patterns
Training ErrorVery low (near 0)Very high
Test ErrorVery highVery high
GeneralizationPoor (can't apply to new data)Poor (can't even fit training data)

How to Identify:

from sklearn.model_selection import learning_curve
import matplotlib.pyplot as plt
 
train_sizes, train_scores, test_scores = learning_curve(
    model, X, y, cv=5, 
    train_sizes=np.linspace(0.1, 1.0, 10)
)
 
# Plot learning curves
plt.plot(train_sizes, train_scores.mean(axis=1), label='Training score')
plt.plot(train_sizes, test_scores.mean(axis=1), label='Test score')
plt.legend()

Diagnosis:

  • Overfitting: Large gap between training and test curves
  • Underfitting: Both curves low and close together

Solutions:

ProblemSolutions
Overfitting1. Add more training data
  1. Reduce model complexity (lower tree depth, fewer features)
  2. Regularization (L1/L2)
  3. Dropout (neural networks)
  4. Early stopping | | Underfitting | 1. Increase model complexity
  5. Add more features
  6. Reduce regularization
  7. Train longer
  8. Use more powerful model |

问题: 解释 SVM 和 Random Forest 分类的主要区别。

👉 点击查看答案

SVM (Support Vector Machine):

  • 核心思想:找到一个超平面(在 2D 中是直线,在高维中是曲面),使得两个类别之间的间隔(margin)最大
  • 决策依据:依赖支持向量(距离超平面最近的数据点)
  • 特点:
    • 对高维数据和少量特征表现好
    • 计算复杂度较高(O(n²) 到 O(n³))
    • 需要特征归一化

Random Forest:

  • 核心思想:构建多个决策树,用多数投票进行预测
  • 决策依据:每个树根据特征不纯度(Gini Index 或 Entropy)选择分割点
  • 特点:
    • 并行计算,速度快
    • 可处理非线性关系
    • 自动特征归一化,易于使用
    • 能处理大量特征和缺失值

SVM (Support Vector Machine):

  • Core Idea: Find an optimal hyperplane (line in 2D, surface in higher dimensions) that maximizes the margin between two classes
  • Decision Basis: Depends on support vectors (data points closest to the hyperplane)
  • Characteristics:
    • Works well with high-dimensional data and few features
    • High computational complexity (O(n²) to O(n³))
    • Requires feature normalization

Random Forest:

  • Core Idea: Build multiple decision trees and use majority voting for prediction
  • Decision Basis: Each tree selects split points based on feature impurity (Gini Index or Entropy)
  • Characteristics:
    • Fast with parallel computing
    • Naturally handles non-linear relationships
    • Auto-normalization, easy to use
    • Handles many features and missing values well

🔴 第三部分:机器学习实现与手工计算 (Q4)

⚠️ 重要提示:这部分包含代码实现和手工计算。考试时请带上计算器!

🔥 Q4.0: K-最近邻 (KNN) 实现(期末真题 Q4 - 25分)

场景:使用 sklearn KNeighborsClassifier 预测客户是否购买产品。

不完整代码:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
 
from sklearn.neighbors import KNeighborsClassifier
 
# [A] Formulate an instance of the class (Set K=7)
# [B] Fit the instance on the data
# [C] Predict the expected value
 
print(classes[y_pred[0]])

任务:填写空白处 [A]、[B]、[C]。

👉 点击查看完整代码解答

完整代码:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
 
from sklearn.neighbors import KNeighborsClassifier
 
# [A] 实例化模型,设置 K=7
knn = KNeighborsClassifier(n_neighbors=7)
 
# [B] 在训练集上训练模型 (Fit)
knn.fit(X_train, y_train)
 
# [C] 在测试集上进行预测 (Predict)
y_pred = knn.predict(X_test)
 
# 输出第一个预测结果
print(classes[y_pred[0]])

详细解析:

[A] 实例化模型:

knn = KNeighborsClassifier(n_neighbors=7)
  • KNeighborsClassifier 是 sklearn 的 KNN 分类器类
  • n_neighbors=7 设置 K 值为 7(即选取最近的 7 个邻居)
  • 其他常用参数:
    • metric='euclidean' (默认,欧氏距离)
    • weights='uniform' (默认,所有邻居权重相等)
    • weights='distance' (按距离加权,近的邻居权重更大)

[B] 训练模型:

knn.fit(X_train, y_train)
  • fit() 方法用于训练模型
  • X_train 是训练特征矩阵(形状:[样本数, 特征数])
  • y_train 是训练标签向量(形状:[样本数])
  • 注意:KNN 实际上不进行"训练",只是存储训练数据,预测时才计算距离

[C] 预测结果:

y_pred = knn.predict(X_test)
  • predict() 方法对测试集进行预测
  • X_test 是测试特征矩阵
  • y_pred 是预测结果向量(形状:[测试样本数])
  • 每个样本的预测流程:
    1. 计算该样本与所有训练样本的距离
    2. 选出距离最近的 K 个邻居
    3. 统计这 K 个邻居的类别,取多数投票结果

补充:完整工作流程示例

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report
 
# 假设数据
X = np.array([[1, 2], [2, 3], [3, 4], [6, 7], [7, 8], [8, 9]])
y = np.array([0, 0, 0, 1, 1, 1])  # 0 = Not Purchase, 1 = Purchase
classes = ['Not Purchase', 'Purchase']
 
# 拆分数据
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
 
# [A] 实例化
knn = KNeighborsClassifier(n_neighbors=3)  # 或 K=7 视题目要求
 
# [B] 训练
knn.fit(X_train, y_train)
 
# [C] 预测
y_pred = knn.predict(X_test)
 
# 输出
print("预测结果:", y_pred)
print("第一个预测:", classes[y_pred[0]])
 
# 评估
accuracy = accuracy_score(y_test, y_pred)
print(f"准确率: {accuracy:.2f}")
print("\n分类报告:")
print(classification_report(y_test, y_pred, target_names=classes))

考点总结:

  1. sklearn 标准流程: import → instantiate → fit → predict
  2. 参数理解: n_neighbors (K值)、metric (距离度量)、weights (权重策略)
  3. 方法调用: fit(X_train, y_train) 用于训练,predict(X_test) 用于预测
  4. 数据形状: X 必须是 2D 数组 (样本×特征),y 是 1D 数组(样本)

常见错误:

  • ❌ 忘记导入 KNeighborsClassifier
  • ❌ n_neighbors 拼写错误(不是 k=7)
  • ❌ fit() 方法参数顺序错误(应该是 X_train, y_train,不能颠倒)
  • ❌ predict() 方法忘记传入 X_test

Complete Code:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
 
from sklearn.neighbors import KNeighborsClassifier
 
# [A] Instantiate model with K=7
knn = KNeighborsClassifier(n_neighbors=7)
 
# [B] Fit the model on training data
knn.fit(X_train, y_train)
 
# [C] Predict on test set
y_pred = knn.predict(X_test)
 
# Output first prediction
print(classes[y_pred[0]])

Detailed Explanation:

[A] Instantiate Model:

knn = KNeighborsClassifier(n_neighbors=7)
  • KNeighborsClassifier is sklearn's KNN classifier class
  • n_neighbors=7 sets K value to 7 (select 7 nearest neighbors)
  • Other common parameters:
    • metric='euclidean' (default, Euclidean distance)
    • weights='uniform' (default, all neighbors equal weight)
    • weights='distance' (distance-based weighting, closer neighbors have more weight)

[B] Train Model:

knn.fit(X_train, y_train)
  • fit() method trains the model
  • X_train is feature matrix (shape: [samples, features])
  • y_train is label vector (shape: [samples])
  • Note: KNN doesn't actually "train" - it just stores training data, calculates distances during prediction

[C] Predict Results:

y_pred = knn.predict(X_test)
  • predict() method predicts on test set
  • X_test is test feature matrix
  • y_pred is prediction vector (shape: [test samples])
  • Prediction process for each sample:
    1. Calculate distance to all training samples
    2. Select K nearest neighbors
    3. Count class labels of these K neighbors, take majority vote

Complete Workflow Example:

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report
 
# Sample data
X = np.array([[1, 2], [2, 3], [3, 4], [6, 7], [7, 8], [8, 9]])
y = np.array([0, 0, 0, 1, 1, 1])  # 0 = Not Purchase, 1 = Purchase
classes = ['Not Purchase', 'Purchase']
 
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
 
# [A] Instantiate
knn = KNeighborsClassifier(n_neighbors=3)  # or K=7 per requirements
 
# [B] Train
knn.fit(X_train, y_train)
 
# [C] Predict
y_pred = knn.predict(X_test)
 
# Output
print("Predictions:", y_pred)
print("First prediction:", classes[y_pred[0]])
 
# Evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=classes))

Key Concepts Summary:

  1. sklearn standard workflow: import → instantiate → fit → predict
  2. Parameter understanding: n_neighbors (K value), metric (distance measure), weights (weighting strategy)
  3. Method calls: fit(X_train, y_train) for training, predict(X_test) for prediction
  4. Data shapes: X must be 2D array (samples × features), y is 1D array (samples)

Common Mistakes:

  • ❌ Forgetting to import KNeighborsClassifier
  • ❌ Misspelling n_neighbors (not k=7)
  • ❌ Wrong parameter order in fit() (should be X_train, y_train, not reversed)
  • ❌ Forgetting to pass X_test to predict()

Q4.0(b): KNN 理论 - K 值的影响

问题: 当 KNN 中的 K 值增大时,决策边界会发生什么变化?

👉 点击查看答案

K 值对决策边界的影响:

Small K (e.g., K=1):

  • 决策边界非常曲折、复杂
  • 模型对训练数据拟合过紧,容易过拟合 (Overfitting)
  • 对噪声非常敏感
  • 训练准确率高,测试准确率可能低

Large K (e.g., K=100):

  • 决策边界变得非常平滑 (Smoother),甚至接近直线
  • 模型过于简单,容易欠拟合 (Underfitting)
  • 对噪声不敏感
  • 训练和测试准确率都可能偏低

最佳 K 值选择:

  • 通常使用交叉验证找最优 K
  • K 值建议范围:3-15(小数据集)或 $\sqrt{N}$(N 为样本数)
  • K 应该是奇数(避免平票,尤其二分类)

结论:

  • As K increases, the decision boundary becomes smoother.
  • 随着 K 增大,决策边界变得更平滑。

Effect of K on Decision Boundary:

Small K (e.g., K=1):

  • Decision boundary is very wiggly and complex
  • Model fits training data too tightly, prone to overfitting
  • Very sensitive to noise
  • High training accuracy, may have low test accuracy

Large K (e.g., K=100):

  • Decision boundary becomes very smooth, even approaching a straight line
  • Model too simple, prone to underfitting
  • Not sensitive to noise
  • Both training and test accuracy may be low

Optimal K Selection:

  • Typically use cross-validation to find best K
  • K value suggestion: 3-15 (small datasets) or $\sqrt{N}$ (N = number of samples)
  • K should be odd (avoid ties, especially for binary classification)

Conclusion:

  • As K increases, the decision boundary becomes smoother.

🔥 Q4.1: Naive Bayes 分类计算(必考题)

场景:判断一封邮件是否是垃圾邮件。

训练数据集(共 10 封邮件):

类别总数包含"Free"的邮件数
Spam43
Not Spam61

任务:一封新邮件包含单词 "Free"。判断它是 Spam 还是 Not Spam?

计算 P(Spam | "Free") 和 P(Not Spam | "Free") 的分子部分,比较大小。

详细计算步骤

👉 点击查看完整计算过程

第一步:计算先验概率 (Prior Probability)

$$P(\text{Spam}) = \frac{4}{10} = 0.4$$

$$P(\text{Not Spam}) = \frac{6}{10} = 0.6$$

第二步:计算似然概率 (Likelihood)

$$P(\text{"Free"} \mid \text{Spam}) = \frac{3}{4} = 0.75$$

$$P(\text{"Free"} \mid \text{Not Spam}) = \frac{1}{6} \approx 0.167$$

第三步:计算后验概率的分子 (Posterior Numerator)

使用贝叶斯定理:$P(Spam | "Free") \propto P("Free" | Spam) \times P(Spam)$

$$\text{Spam Score} = P(\text{"Free"} \mid \text{Spam}) \times P(\text{Spam}) = 0.75 \times 0.4 = 0.30$$

$$\text{Not Spam Score} = P(\text{"Free"} \mid \text{Not Spam}) \times P(\text{Not Spam}) = 0.167 \times 0.6 \approx 0.10$$

第四步:做出决策

因为 $0.30 > 0.10$,模型预测该邮件是 Spam(垃圾邮件)。

结论:含有"Free"这个词的邮件更可能是垃圾邮件。

Step 1: Calculate Prior Probability

$$P(\text{Spam}) = \frac{4}{10} = 0.4$$

$$P(\text{Not Spam}) = \frac{6}{10} = 0.6$$

Step 2: Calculate Likelihood Probability

$$P(\text{"Free"} \mid \text{Spam}) = \frac{3}{4} = 0.75$$

$$P(\text{"Free"} \mid \text{Not Spam}) = \frac{1}{6} \approx 0.167$$

Step 3: Calculate Posterior Numerators

Using Bayes' theorem: $P(Spam | "Free") \propto P("Free" | Spam) \times P(Spam)$

$$\text{Spam Score} = P(\text{"Free"} \mid \text{Spam}) \times P(\text{Spam}) = 0.75 \times 0.4 = 0.30$$

$$\text{Not Spam Score} = P(\text{"Free"} \mid \text{Not Spam}) \times P(\text{Not Spam}) = 0.167 \times 0.6 \approx 0.10$$

Step 4: Make Decision

Since $0.30 > 0.10$, the model predicts this email is Spam.

Conclusion: Emails containing "Free" are more likely to be spam.


🔥 Q4.2: Decision Tree 熵计算(必考题)

场景:一个决策树节点有 6 个正样本 (+) 和 2 个负样本 (-) 共 8 个样本。

任务:计算该节点的熵(Entropy)。

熵公式:$$\text{Entropy} = -\sum_{i=1}^{n} p_i \log_2(p_i)$$

其中 $p_i$ 是第 $i$ 类样本的比例。

详细计算步骤

👉 点击查看完整计算过程

第一步:计算各类样本比例

$$p_{\text{Positive}} = \frac{6}{8} = 0.75$$

$$p_{\text{Negative}} = \frac{2}{8} = 0.25$$

第二步:代入熵公式

$$\text{Entropy} = -[p_+ \log_2(p_+) + p_- \log_2(p_-)]$$

$$= -[0.75 \times \log_2(0.75) + 0.25 \times \log_2(0.25)]$$

第三步:使用计算器计算对数(保留 4 位小数)

$$\log_2(0.75) = \frac{\log(0.75)}{\log(2)} \approx \frac{-0.1249}{0.3010} \approx -0.4150$$

$$\log_2(0.25) = \log_2(\frac{1}{4}) = -\log_2(4) = -2$$

第四步:计算最终结果

项一:$0.75 \times (-0.4150) = -0.3112$

项二:$0.25 \times (-2) = -0.50$

熵值:

$$\text{Entropy} = -(-0.3112 - 0.50) = -(-0.8112) = 0.8112 \approx \boxed{0.811}$$

解读:

  • 熵值范围是 0 到 1(对于二分类)
  • 熵值 0.811 表示该节点混合程度较高(接近 75%-25% 分布)
  • 如果节点是纯净的(全是正或全是负),熵值为 0
  • 如果正负各占一半(50%-50%),熵值最大为 1

Step 1: Calculate Sample Proportions

$$p_{\text{Positive}} = \frac{6}{8} = 0.75$$

$$p_{\text{Negative}} = \frac{2}{8} = 0.25$$

Step 2: Substitute into Entropy Formula

$$\text{Entropy} = -[p_+ \log_2(p_+) + p_- \log_2(p_-)]$$

$$= -[0.75 \times \log_2(0.75) + 0.25 \times \log_2(0.25)]$$

Step 3: Use Calculator to Calculate Logarithms (Keep 4 decimals)

$$\log_2(0.75) = \frac{\log(0.75)}{\log(2)} \approx \frac{-0.1249}{0.3010} \approx -0.4150$$

$$\log_2(0.25) = \log_2(\frac{1}{4}) = -\log_2(4) = -2$$

Step 4: Calculate Final Result

Term 1: $0.75 \times (-0.4150) = -0.3112$

Term 2: $0.25 \times (-2) = -0.50$

Entropy Value:

$$\text{Entropy} = -(-0.3112 - 0.50) = -(-0.8112) = 0.8112 \approx \boxed{0.811}$$

Interpretation:

  • Entropy range is 0 to 1 (for binary classification)
  • Entropy 0.811 indicates a highly mixed node (close to 75%-25% distribution)
  • Pure node (all positive or all negative) has entropy 0
  • Node with 50%-50% split has maximum entropy of 1

🔥 Q4.3: 信息增益计算

场景: 计算信息增益,决定用哪个特征进行分裂。

数据集(10个样本,预测是否打网球):

OutlookTemperaturePlay Tennis
SunnyHotNo
SunnyHotNo
OvercastHotYes
RainMildYes
RainCoolYes
RainCoolNo
OvercastCoolYes
SunnyMildNo
SunnyCoolYes
RainMildYes

任务: 计算 "Outlook" 特征的信息增益。

👉 点击查看完整计算过程

第一步:计算总体熵(Root Entropy)

总样本:10 个,其中 Yes = 6,No = 4

$$p_{\text{Yes}} = \frac{6}{10} = 0.6, \quad p_{\text{No}} = \frac{4}{10} = 0.4$$

$$H_{\text{total}} = -[0.6 \times \log_2(0.6) + 0.4 \times \log_2(0.4)]$$

使用计算器:

  • $\log_2(0.6) = \frac{\ln(0.6)}{\ln(2)} \approx -0.737$
  • $\log_2(0.4) = \frac{\ln(0.4)}{\ln(2)} \approx -1.322$

$$H_{\text{total}} = -[0.6 \times (-0.737) + 0.4 \times (-1.322)]$$ $$= -[-0.442 - 0.529] = -(-0.971) = \boxed{0.971}$$


第二步:按 Outlook 分组计算加权熵

OutlookTotalYesNo
Sunny413
Overcast220
Rain431

Sunny 组的熵: $$H_{\text{Sunny}} = -\left[\frac{1}{4} \log_2\left(\frac{1}{4}\right) + \frac{3}{4} \log_2\left(\frac{3}{4}\right)\right]$$

  • $\log_2(0.25) = -2$
  • $\log_2(0.75) \approx -0.415$

$$H_{\text{Sunny}} = -[0.25 \times (-2) + 0.75 \times (-0.415)]$$ $$= -[-0.5 - 0.311] = 0.811$$

Overcast 组的熵: 全是 Yes(纯净节点) $$H_{\text{Overcast}} = 0$$

Rain 组的熵: $$H_{\text{Rain}} = -\left[\frac{3}{4} \log_2\left(\frac{3}{4}\right) + \frac{1}{4} \log_2\left(\frac{1}{4}\right)\right]$$ $$= -[0.75 \times (-0.415) + 0.25 \times (-2)]$$ $$= -[-0.311 - 0.5] = 0.811$$

加权平均熵: $$H_{\text{weighted}} = \frac{4}{10} \times 0.811 + \frac{2}{10} \times 0 + \frac{4}{10} \times 0.811$$ $$= 0.324 + 0 + 0.324 = 0.648$$


第三步:计算信息增益(Information Gain)

$$\text{IG(Outlook)} = H_{\text{total}} - H_{\text{weighted}}$$ $$= 0.971 - 0.648 = \boxed{0.323}$$

解释:

  • 信息增益 = 0.323,表示使用 "Outlook" 特征可以减少 32.3% 的不确定性
  • 信息增益越大,该特征的分类能力越强
  • 决策树会优先选择信息增益最大的特征进行分裂

Step 1: Calculate Root Entropy

Total samples: 10, where Yes = 6, No = 4

$$p_{\text{Yes}} = \frac{6}{10} = 0.6, \quad p_{\text{No}} = \frac{4}{10} = 0.4$$

$$H_{\text{total}} = -[0.6 \times \log_2(0.6) + 0.4 \times \log_2(0.4)]$$

Using calculator:

  • $\log_2(0.6) = \frac{\ln(0.6)}{\ln(2)} \approx -0.737$
  • $\log_2(0.4) = \frac{\ln(0.4)}{\ln(2)} \approx -1.322$

$$H_{\text{total}} = -[0.6 \times (-0.737) + 0.4 \times (-1.322)]$$ $$= -[-0.442 - 0.529] = -(-0.971) = \boxed{0.971}$$


Step 2: Calculate Weighted Entropy by Outlook Groups

OutlookTotalYesNo
Sunny413
Overcast220
Rain431

Entropy of Sunny: $$H_{\text{Sunny}} = -\left[\frac{1}{4} \log_2\left(\frac{1}{4}\right) + \frac{3}{4} \log_2\left(\frac{3}{4}\right)\right]$$

  • $\log_2(0.25) = -2$
  • $\log_2(0.75) \approx -0.415$

$$H_{\text{Sunny}} = -[0.25 \times (-2) + 0.75 \times (-0.415)]$$ $$= -[-0.5 - 0.311] = 0.811$$

Entropy of Overcast: All Yes (pure node) $$H_{\text{Overcast}} = 0$$

Entropy of Rain: $$H_{\text{Rain}} = -\left[\frac{3}{4} \log_2\left(\frac{3}{4}\right) + \frac{1}{4} \log_2\left(\frac{1}{4}\right)\right]$$ $$= -[0.75 \times (-0.415) + 0.25 \times (-2)]$$ $$= -[-0.311 - 0.5] = 0.811$$

Weighted Average Entropy: $$H_{\text{weighted}} = \frac{4}{10} \times 0.811 + \frac{2}{10} \times 0 + \frac{4}{10} \times 0.811$$ $$= 0.324 + 0 + 0.324 = 0.648$$


Step 3: Calculate Information Gain

$$\text{IG(Outlook)} = H_{\text{total}} - H_{\text{weighted}}$$ $$= 0.971 - 0.648 = \boxed{0.323}$$

Interpretation:

  • Information Gain = 0.323 means "Outlook" reduces uncertainty by 32.3%
  • Higher Information Gain → stronger classification ability
  • Decision trees prioritize features with highest Information Gain for splitting

🔥 Q4.4: Gini 指数计算

场景: 计算决策树节点的 Gini 指数。

问题: 一个节点有 40 个样本:25 个 A 类,15 个 B 类。计算 Gini 指数。

Gini Formula: $$\text{Gini} = 1 - \sum_{i=1}^{n} p_i^2$$

👉 点击查看计算过程

第一步:计算各类别概率

$$p_A = \frac{25}{40} = 0.625$$ $$p_B = \frac{15}{40} = 0.375$$

第二步:代入 Gini 公式

$$\text{Gini} = 1 - (p_A^2 + p_B^2)$$ $$= 1 - (0.625^2 + 0.375^2)$$ $$= 1 - (0.3906 + 0.1406)$$ $$= 1 - 0.5312$$ $$= \boxed{0.4688} \approx 0.469$$

解释:

  • Gini 指数范围:[0, 0.5](二分类情况下)
  • Gini = 0 → 节点纯净(全是同一类)
  • Gini = 0.5 → 节点最混乱(各类别均匀分布)
  • Gini = 0.469 → 较高的不纯度,需要进一步分裂

Gini vs Entropy:

  • Gini 计算更快(无对数运算)
  • Entropy 对不纯度更敏感
  • 实际效果相似,Gini 更常用(sklearn 默认)

Step 1: Calculate Class Probabilities

$$p_A = \frac{25}{40} = 0.625$$ $$p_B = \frac{15}{40} = 0.375$$

Step 2: Substitute into Gini Formula

$$\text{Gini} = 1 - (p_A^2 + p_B^2)$$ $$= 1 - (0.625^2 + 0.375^2)$$ $$= 1 - (0.3906 + 0.1406)$$ $$= 1 - 0.5312$$ $$= \boxed{0.4688} \approx 0.469$$

Interpretation:

  • Gini range: [0, 0.5] (for binary classification)
  • Gini = 0 → Pure node (all same class)
  • Gini = 0.5 → Most impure (uniform distribution)
  • Gini = 0.469 → High impurity, needs further splitting

Gini vs Entropy:

  • Gini faster to compute (no logarithms)
  • Entropy more sensitive to impurity
  • Similar results in practice, Gini more common (sklearn default)

🔥 Q4.5: 多特征 Naive Bayes

场景: 根据两个特征(天气和温度)判断是否打网球。

训练数据(8个样本):

OutlookTemperaturePlay
SunnyHotNo
SunnyHotNo
OvercastHotYes
RainMildYes
RainCoolYes
OvercastCoolYes
SunnyMildNo
RainHotYes

测试样本: Outlook = Sunny, Temperature = Cool。会打网球吗?

👉 点击查看完整计算过程

第一步:统计训练数据

  • Play = Yes: 5 次
  • Play = No: 3 次

先验概率: $$P(\text{Yes}) = \frac{5}{8} = 0.625$$ $$P(\text{No}) = \frac{3}{8} = 0.375$$


第二步:计算条件概率

对于 Yes 类:

  • Sunny & Yes: 0 次(共 5 个 Yes)
  • Cool & Yes: 2 次(共 5 个 Yes)

$$P(\text{Sunny} | \text{Yes}) = \frac{0}{5} = 0$$ $$P(\text{Cool} | \text{Yes}) = \frac{2}{5} = 0.4$$

对于 No 类:

  • Sunny & No: 3 次(共 3 个 No)
  • Cool & No: 0 次(共 3 个 No)

$$P(\text{Sunny} | \text{No}) = \frac{3}{3} = 1.0$$ $$P(\text{Cool} | \text{No}) = \frac{0}{3} = 0$$


第三步:应用 Naive Bayes

$$P(\text{Yes} | \text{Sunny, Cool}) \propto P(\text{Sunny} | \text{Yes}) \times P(\text{Cool} | \text{Yes}) \times P(\text{Yes})$$ $$= 0 \times 0.4 \times 0.625 = 0$$

$$P(\text{No} | \text{Sunny, Cool}) \propto P(\text{Sunny} | \text{No}) \times P(\text{Cool} | \text{No}) \times P(\text{No})$$ $$= 1.0 \times 0 \times 0.375 = 0$$


问题:零概率问题!

两个类别的概率都是 0,无法做出判断。这是因为训练数据中没有出现 "Sunny + Cool" 的组合。

解决方案:Laplace 平滑(拉普拉斯平滑)

修正公式: $$P(\text{Feature} | \text{Class}) = \frac{\text{count} + 1}{\text{total} + \text{num_features}}$$

应用平滑后:

对于 Yes 类(特征总数 = 3: Sunny, Overcast, Rain): $$P(\text{Sunny} | \text{Yes}) = \frac{0 + 1}{5 + 3} = \frac{1}{8} = 0.125$$ $$P(\text{Cool} | \text{Yes}) = \frac{2 + 1}{5 + 3} = \frac{3}{8} = 0.375$$

对于 No 类: $$P(\text{Sunny} | \text{No}) = \frac{3 + 1}{3 + 3} = \frac{4}{6} = 0.667$$ $$P(\text{Cool} | \text{No}) = \frac{0 + 1}{3 + 3} = \frac{1}{6} = 0.167$$

重新计算:

$$\text{Yes Score} = 0.125 \times 0.375 \times 0.625 = 0.0293$$ $$\text{No Score} = 0.667 \times 0.167 \times 0.375 = 0.0418$$

结论:因为 $0.0418 > 0.0293$,预测为 No(不打网球)。

关键要点:

  • Naive Bayes 假设特征独立(这就是"Naive"的含义)
  • 遇到零概率必须使用平滑技术
  • Laplace 平滑是最常用的方法

Step 1: Statistics from Training Data

  • Play = Yes: 5 times
  • Play = No: 3 times

Prior Probability: $$P(\text{Yes}) = \frac{5}{8} = 0.625$$ $$P(\text{No}) = \frac{3}{8} = 0.375$$


Step 2: Calculate Conditional Probabilities

For Yes class:

  • Sunny & Yes: 0 times (out of 5 Yes)
  • Cool & Yes: 2 times (out of 5 Yes)

$$P(\text{Sunny} | \text{Yes}) = \frac{0}{5} = 0$$ $$P(\text{Cool} | \text{Yes}) = \frac{2}{5} = 0.4$$

For No class:

  • Sunny & No: 3 times (out of 3 No)
  • Cool & No: 0 times (out of 3 No)

$$P(\text{Sunny} | \text{No}) = \frac{3}{3} = 1.0$$ $$P(\text{Cool} | \text{No}) = \frac{0}{3} = 0$$


Step 3: Apply Naive Bayes

$$P(\text{Yes} | \text{Sunny, Cool}) \propto P(\text{Sunny} | \text{Yes}) \times P(\text{Cool} | \text{Yes}) \times P(\text{Yes})$$ $$= 0 \times 0.4 \times 0.625 = 0$$

$$P(\text{No} | \text{Sunny, Cool}) \propto P(\text{Sunny} | \text{No}) \times P(\text{Cool} | \text{No}) \times P(\text{No})$$ $$= 1.0 \times 0 \times 0.375 = 0$$


Problem: Zero Probability Issue!

Both classes have probability 0, making classification impossible. This is because "Sunny + Cool" combination never appeared in training data.

Solution: Laplace Smoothing

Corrected formula: $$P(\text{Feature} | \text{Class}) = \frac{\text{count} + 1}{\text{total} + \text{num_features}}$$

After applying smoothing:

For Yes class (total features = 3: Sunny, Overcast, Rain): $$P(\text{Sunny} | \text{Yes}) = \frac{0 + 1}{5 + 3} = \frac{1}{8} = 0.125$$ $$P(\text{Cool} | \text{Yes}) = \frac{2 + 1}{5 + 3} = \frac{3}{8} = 0.375$$

For No class: $$P(\text{Sunny} | \text{No}) = \frac{3 + 1}{3 + 3} = \frac{4}{6} = 0.667$$ $$P(\text{Cool} | \text{No}) = \frac{0 + 1}{3 + 3} = \frac{1}{6} = 0.167$$

Recalculate:

$$\text{Yes Score} = 0.125 \times 0.375 \times 0.625 = 0.0293$$ $$\text{No Score} = 0.667 \times 0.167 \times 0.375 = 0.0418$$

Conclusion: Since $0.0418 > 0.0293$, prediction is No (Don't play).

Key Takeaways:

  • Naive Bayes assumes feature independence (hence "Naive")
  • Zero probability requires smoothing techniques
  • Laplace smoothing is most commonly used

Q1/Q2 常见陷阱

陷阱易错点正确做法
可变默认参数def f(x, lst=[])用 None 替代,内部初始化
列表 vs 元组认为元组也可变记住:列表可变,元组不可变
循环变量作用域i 在循环外消失Python 的 i 在循环外仍存在
列表切片认为 lst[:] 是引用lst[:] 创建浅拷贝

Q3 数据处理检查清单

  • 检查缺失值,用 isnull() 确认位置
  • 选择合适的填充方法(均值、中位数、前向填充等)
  • 对分类变量选择合适的编码(One-Hot vs Label)
  • 特征缩放(归一化 / 标准化)
  • 拆分训练/测试集

Q4 手算检查清单

Naive Bayes:

  • 计算先验概率 P(Class)
  • 计算似然概率 P(Feature | Class)
  • 相乘得到后验分子
  • 比较大小做决策

Decision Tree:

  • 明确样本总数和各类样本数
  • 计算概率 p_i = count_i / total
  • 使用计算器计算 log₂ 值
  • 代入公式计算熵值
  • 保留 3 位小数报告答案

⚡ 附加:数据预处理公式(速查)

来源:期末真题 Q3(b) - 写出两种数据归一化公式。

1. Min-Max Normalization (最小-最大规范化)

目的:将数据缩放到 [0, 1] 范围

$$X_{new} = \frac{X - X_{min}}{X_{max} - X_{min}}$$

解释:

  • $X_{min}$ → 变为 0
  • $X_{max}$ → 变为 1
  • 中间值按比例缩放

使用场景:

  • 已知数据范围有明确上下界
  • 神经网络输入层(需要0-1范围)
  • 图像像素值归一化

缺点:

  • 对异常值敏感(一个极端值会影响整体缩放)

Explanation:

  • $X_{min}$ → becomes 0
  • $X_{max}$ → becomes 1
  • Values in between scaled proportionally

Use Cases:

  • Known data range with clear bounds
  • Neural network input layer (requires 0-1 range)
  • Image pixel normalization

Drawbacks:

  • Sensitive to outliers (one extreme value affects entire scaling)

2. Z-Score Standardization (标准化)

目的:将数据转换为均值=0,标准差=1

$$X_{new} = \frac{X - \mu}{\sigma}$$

其中:

  • $\mu$ = 均值 (mean)
  • $\sigma$ = 标准差 (standard deviation)

解释:

  • 数据变换后均值为 0
  • 标准差为 1
  • 范围不固定(可能是负数)

使用场景:

  • 数据存在异常值
  • 需要比较不同量纲的特征
  • SVM、KNN、逻辑回归等算法

优点:

  • 对异常值更鲁棒(相比 Min-Max)

Where:

  • $\mu$ = mean
  • $\sigma$ = standard deviation

Explanation:

  • Transformed data has mean of 0
  • Standard deviation of 1
  • Range not fixed (can be negative)

Use Cases:

  • Data contains outliers
  • Need to compare features with different scales
  • Algorithms like SVM, KNN, Logistic Regression

Advantages:

  • More robust to outliers (compared to Min-Max)

快速对比表

方法公式输出范围异常值敏感度使用场景
Min-Max$\frac{X - X_{min}}{X_{max} - X_{min}}$[0, 1]高神经网络、有界数据
Z-Score$\frac{X - \mu}{\sigma}$无界低SVM、KNN、有异常值数据

💡 最后的建议

  1. Python 部分(Q1/Q2):重点记住 可变性 和 作用域 的概念,多做代码追踪题。
  2. 数据科学部分(Q3):理解编码方法的 什么时候用,算法之间的 核心区别。
  3. 手算部分(Q4):务必带计算器,步骤要清晰,最后答案保留 3 位小数。
  4. 考试策略:先做自己擅长的题,再挑战计算题。时间紧张时,理论题往往比手算题更容易拿分。

祝考试顺利! 🎓


最后更新:2026年1月24日 | 全面升级版,包含50+实战题目(含期末真题 Q1-Q4)