ABW505 Complete Question Bank: Python & Machine Learning (All English)
35 分钟阅读
ABW505 Complete Question Bank - Python & Machine Learning
📚 All code in this document has been verified and tested. Every answer includes detailed explanations and comments.
📋 Exam Structure Overview
| Section | Points | Type | Coverage |
|---|---|---|---|
| Q1 | 20 | Output Analysis | Python basics: variables, operators, lists, tuples, functions, loops, conditions |
| Q2 | 30 | Code Writing (3/5) | Decision structure, repetition, boolean logic, lists/tuples, functions |
| Q3 | 25 | Theory + Code | Pandas, Data Preprocessing, Encoder, SVM, Random Forest |
| Q4 | 25 | Theory + Calculation | Naive Bayes, Decision Tree, Gini Index, Entropy |
📝 Q1: Python Output Analysis (20 points)
Key Pattern 1: Operator Precedence (MUST KNOW!)
Priority order: ** (power) → *, /, //, % → +, -
| Operator | Meaning | Example |
|---|---|---|
** | Power/Exponent | 5**2 = 25 |
/ | Division (float) | 5/2 = 2.5 |
// | Floor division (integer) | 5//2 = 2 |
% | Modulo (remainder) | 10%3 = 1 |
Problem 1.1: Power and Floor Division
print(5**2 // 3)💡 Click to View Answer
Step-by-step:
5**2 = 25(power first, highest priority)25 // 3 = 8(floor division, discard remainder)
Answer: 8
Key concept: Power ** has higher priority than //. Floor division always rounds DOWN toward negative infinity.
Problem 1.2: Mixed Operations
print(3 + 4 * 4 // 4)💡 Click to View Answer
Step-by-step:
4 * 4 = 16(multiplication first)16 // 4 = 4(floor division, same priority as multiplication, left to right)3 + 4 = 7(addition last)
Answer: 7
Problem 1.3: Power and Multiplication
print(2 * 3 ** 2)💡 Click to View Answer
Step-by-step:
3**2 = 9(power first!)2 * 9 = 18
Answer: 18
Common mistake: 2 * 3 = 6, then 6**2 = 36. WRONG! Power has higher priority.
Problem 1.4: Negative Floor Division (TRICKY!)
print(-5 // 3)💡 Click to View Answer
Key insight: Floor division rounds toward NEGATIVE infinity, not toward zero!
-5 ÷ 3 = -1.666...- Rounding DOWN (toward -∞) →
-2
Answer: -2
This is NOT the same as integer division in some other languages! Python's // always floors toward negative infinity.
Key Pattern 2: List Iteration and Sum
Problem 1.5: Calculate Average
numbers = [2, 4, 6, 8]
total = 0
for n in numbers:
total += n
print(total / len(numbers))💡 Click to View Answer
Trace:
- Loop 1:
total = 0 + 2 = 2 - Loop 2:
total = 2 + 4 = 6 - Loop 3:
total = 6 + 6 = 12 - Loop 4:
total = 12 + 8 = 20 - Average:
20 / 4 = 5.0
Answer: 5.0
Note: Division / always returns a float in Python 3, so the answer is 5.0 not 5.
Key Pattern 3: Tuple Operations (MUST KNOW!)
Keywords: Tuple = Immutable list, 创建后不可修改, 用()定义
Problem 1.5a: Basic Tuple Index
t = ("study", "exercises", "exam")
print(t[1])💡 Click to View Answer
Index map:
Element: "study" "exercises" "exam"
Index: 0 1 2
Answer: exercises
Keywords: Tuple索引从0开始, 和list一样
Problem 1.5b: Tuple with len()
t = ("A", "B", "C")
print(len(t))💡 Click to View Answer
Answer: 3
Keywords: len()数元素个数, tuple和list用法相同
Problem 1.5c: Tuple Negative Indexing
t = ("A", "B", "C")
print(t[-1])💡 Click to View Answer
Negative index map:
Element: "A" "B" "C"
Negative: -3 -2 -1
Answer: C
Keywords: -1是最后一个元素, 负索引从右往左数
Problem 1.5d: Tuple Slicing
t = (1, 2, 3, 4, 5)
print(t[1:4])💡 Click to View Answer
Slice rule: 左闭右开 (left-inclusive, right-exclusive)
Answer: (2, 3, 4)
Keywords: t[1:4]取index 1,2,3 (不含4), 返回的还是tuple
Problem 1.5e: Tuple Repetition
t = (1, 2)
print(t * 3)💡 Click to View Answer
Answer: (1, 2, 1, 2, 1, 2)
Keywords: *重复操作, 和字符串类似
Problem 1.5f: Tuple is Immutable (TRICKY!)
t = (1, 2, 3)
t[0] = 100
print(t)💡 Click to View Answer
Answer: TypeError (程序报错!)
Keywords: Tuple是immutable(不可变), 创建后不能修改元素
对比: List是mutable(可变), 可以修改元素
lst = [1, 2, 3]
lst[0] = 100 # ✅ 正常工作Problem 1.5g: For Loop with Tuple
t = (2, 4, 6)
for x in t:
print(x)💡 Click to View Answer
Answer:
2
4
6
Keywords: Tuple支持for遍历, 和list完全一样
Problem 1.5h: List of Tuples (套娃题型!)
data = [("Ann", 80), ("Bob", 60)]
print(data[1])💡 Click to View Answer
Key: 外层是list, 每个元素是tuple
Answer: ('Bob', 60)
Keywords: data[1]取list的第1个元素(整个tuple)
Problem 1.5i: Nested Indexing (双重索引)
data = [("Ann", 80), ("Bob", 60)]
print(data[1][0])💡 Click to View Answer
Step-by-step:
data[1]=("Bob", 60)("Bob", 60)[0]="Bob"
Answer: Bob
Keywords: 双重索引=套娃, 先取外层再取内层
Problem 1.5j: Tuple Unpacking with For Loop
data = [("Ann", 80), ("Bob", 60)]
for name, score in data:
print(name)💡 Click to View Answer
Key: Tuple自动解包, name和score分别接收tuple的两个元素
Answer:
Ann
Bob
Keywords: Tuple解包, 变量数量必须匹配tuple元素数量
Problem 1.5k: Mixed Tuple and List
data = [(1, 2), (3, 4), (5, 6)]
print(data[2][1])💡 Click to View Answer
Step-by-step:
data[2]=(5, 6)(第3个tuple)(5, 6)[1]=6(tuple的第2个元素)
Answer: 6
Problem 1.5l: in Operator with Tuple
t = ("X", "Y", "Z")
if "Y" in t:
print("Y")
else:
print("N")💡 Click to View Answer
Answer: Y
Keywords: in检查元素是否存在, tuple和list都支持
Key Pattern 4: Function Basics (MUST KNOW!)
Keywords: def定义函数, return返回结果并结束函数, print负责输出
Problem 1.6a: Basic Function
def f(x):
return x * 2
print(f(3))💡 Click to View Answer
Step-by-step:
- 调用
f(3), x=3 - return 3*2 = 6
- print(6)
Answer: 6
Keywords: 参数传值, return返回计算结果
Problem 1.6b: Function Without Print (TRICKY!)
def f(x):
return x * 2
f(3)💡 Click to View Answer
Answer: None (无输出!)
Keywords: return只返回值, 不负责输出! 没有print就没有显示!
关键区别:
return= 返回结果并结束函数 (不显示)print= 输出到屏幕 (显示)- 调用函数 ≠ 自动输出
Problem 1.6c: Multiple Parameters
def add(a, b):
return a + b
print(add(2, 5))💡 Click to View Answer
Answer: 7
Keywords: 多参数用逗号分隔, 2+5=7
Problem 1.6d: Function with Arithmetic
def f(x):
return x + 1
print(f(2) + f(3) * 2)💡 Click to View Answer
Step-by-step:
- f(2) = 2+1 = 3
- f(3) = 3+1 = 4
- 3 + 4*2 = 3 + 8 = 11 (乘法优先!)
Answer: 11
Keywords: 函数返回值参与运算, 遵守算术优先级
Problem 1.6e: Boolean Function
def is_even(n):
return n % 2 == 0
print(is_even(5))💡 Click to View Answer
Step-by-step:
- 5 % 2 = 1 (余数)
- 1 == 0? False
Answer: False
Keywords: Boolean函数返回True/False, %取余数
Problem 1.6f: Function with If (常见混合题型)
def check(n):
if n > 10:
print("Big")
else:
print("Small")
return None
result = check(12)
print(result)💡 Click to View Answer
Step-by-step:
- check(12): 12>10成立, print("Big")
- return None
- print(result) → print(None)
Answer:
Big
None
Keywords: 函数内的print会执行, return None也会被打印
Problem 1.6g: Function with For Loop
def sum_list(a):
s = 0
for x in a:
s += x
return s
print(sum_list([1, 2, 3]))💡 Click to View Answer
Trace:
- s=0, x=1: s=0+1=1
- x=2: s=1+2=3
- x=3: s=3+3=6
Answer: 6
Keywords: 函数参数可以是list, 遍历累加
Problem 1.6h: Function Returning String
def grade(m):
if m >= 50:
return "pass"
else:
return "fail"
print(grade(45))💡 Click to View Answer
Step-by-step:
- grade(45): 45>=50? False
- return "fail"
Answer: fail
Keywords: return可以返回任何类型, 包括字符串
Problem 1.6i: Nested Function Call (可能超纲)
def f(x):
return x + 1
def g(x):
return f(x) * 2
print(g(3))💡 Click to View Answer
Step-by-step:
- g(3) 调用 f(3)
- f(3) = 3+1 = 4
- g(3) = 4 * 2 = 8
Answer: 8
Keywords: 函数嵌套调用, 先执行内层函数
Key Pattern 5: String Slicing (Left-Inclusive, Right-Exclusive)
Problem 1.6: String Slice
s = "ABW505"
print(s[1:5])💡 Click to View Answer
Index map:
Character: A B W 5 0 5
Index: 0 1 2 3 4 5
s[1:5] → indices 1, 2, 3, 4 (NOT including 5)
Answer: BW50
Problem 1.7: Negative Indexing
text = "Hello World"
print(text[-5:-1])💡 Click to View Answer
Index map:
Character: H e l l o W o r l d
Positive: 0 1 2 3 4 5 6 7 8 9 10
Negative:-11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1
text[-5:-1] → from 'W' (index -5) to 'l' (index -2, NOT including -1)
Answer: Worl
Key Pattern 4: List Operations
Problem 1.8: Slice Assignment (TRICKY!)
numbers = [10, 20, 30, 40, 50]
numbers[1:4] = [100]
print(len(numbers))
print(numbers[2])💡 Click to View Answer
Step-by-step:
- Original:
[10, 20, 30, 40, 50] numbers[1:4]selects[20, 30, 40](3 elements)- Replace with
[100](1 element) - Result:
[10, 100, 50] - Length: 3
numbers[2]= 50
Answers:
len(numbers)→3numbers[2]→50
Key concept: Slice assignment can change list size! Replacing 3 elements with 1 element reduces length by 2.
Problem 1.9: List Reference vs Copy
a = [1, 2, 3]
b = a
b.append(4)
print(a)
print(a is b)💡 Click to View Answer
Key concept: b = a creates a REFERENCE, not a copy!
aandbpoint to the SAME list object- Modifying
balso modifiesa a is b→True(same object in memory)
Answers:
print(a)→[1, 2, 3, 4]print(a is b)→True
To create an independent copy: Use b = a.copy() or b = a[:]
Key Pattern 5: Functions with Default Arguments
Problem 1.10: Keyword Arguments
def mystery(a, b=5, c=10):
return a * 2 + b - c
result = mystery(3, c=4)
print(result)💡 Click to View Answer
Step-by-step:
a = 3(positional argument)b = 5(uses default, NOT overridden)c = 4(keyword argument overrides default)- Calculation:
3 * 2 + 5 - 4 = 6 + 5 - 4 = 7
Answer: 7
Key concept: Keyword arguments let you skip over default parameters.
Key Pattern 6: Loops and Range
Problem 1.11: Range with Accumulator
total = 0
for i in range(1, 4):
total += i
print(total)💡 Click to View Answer
range(1, 4) generates: 1, 2, 3 (NOT including 4)
Accumulation: 0 + 1 + 2 + 3 = 6
Answer: 6
Problem 1.12: Break Statement
for i in range(5):
if i == 2:
break
print(i)💡 Click to View Answer
i=0: Print 0i=1: Print 1i=2: Break! Exit loop immediately
Answer:
0
1
Key Pattern 7: List Comprehension
Problem 1.13: Filtered List Comprehension
nums = [1, 2, 3, 4, 5]
result = [x**2 for x in nums if x % 2 == 1]
print(result)
print(sum(result))💡 Click to View Answer
Step-by-step:
- Filter odd numbers: 1, 3, 5 (where
x % 2 == 1) - Square each: 1², 3², 5² = 1, 9, 25
- Result:
[1, 9, 25] - Sum:
1 + 9 + 25 = 35
Answers:
result→[1, 9, 25]sum(result)→35
Pattern: [expression for item in iterable if condition]
📝 Q2: Code Writing (30 points - Choose 3 of 5)
Template 1: Menu with List (MUST MEMORIZE!)
Problem: Write a Python program that displays this menu repeatedly:
- Add a number to the list
- Display the list
- Exit
💡 Click to View Verified Answer
# Initialize empty list to store numbers
data = []
# Main program loop - runs until user chooses to exit
while True:
# Display menu options with clear prompts
print("\n--- MENU ---")
print("1. Add a number to the list")
print("2. Display the list")
print("3. Exit")
# Get user choice with prompt (IMPORTANT: include prompt text!)
choice = input("Enter your choice (1/2/3): ")
# Process user choice
if choice == "1":
# Option 1: Add number
# Use try-except to handle invalid input gracefully
try:
num = int(input("Enter a number to add: "))
data.append(num)
print(f"Added {num} to the list.")
except ValueError:
print("Invalid input! Please enter a valid integer.")
elif choice == "2":
# Option 2: Display list
if len(data) == 0:
print("The list is empty.")
else:
print(f"Current list: {data}")
elif choice == "3":
# Option 3: Exit program
print("Goodbye!")
break
else:
# Handle invalid menu choice
print("Invalid choice! Please enter 1, 2, or 3.")Key improvements over the original buggy version:
- ✅ Added prompt text to
input()- users know what to enter - ✅ Added try-except for error handling - won't crash on invalid input
- ✅ Used string comparison instead of int - avoids crash if user enters text
- ✅ Added feedback messages - users know what happened
- ✅ Added empty list check - better user experience
ORIGINAL BUGGY VERSION (what was wrong):
# PROBLEMATIC CODE - DO NOT USE IN EXAM
data = []
while True:
print("1.Add")
print("2.Show")
print("3.Exit")
c = int(input()) # BUG: Crashes if user enters non-integer!
if c == 1:
data.append(int(input())) # BUG: Crashes on invalid input, no prompt!
elif c == 2:
print(data)
elif c == 3:
break
# Missing: else clause, error handling, user promptsWhy it crashes: int(input()) without try-except will throw ValueError if user enters anything that's not a number (like pressing Enter, or typing "abc").
✍️ 手写精简版 (HANDWRITING VERSION)
只保留核心逻辑,去掉所有注释和错误处理:
data = []
while True:
print("1.Add 2.Show 3.Exit")
c = input("Choice: ")
if c == "1":
data.append(int(input("Num: ")))
elif c == "2":
print(data)
elif c == "3":
break手写要点: 约10行, 必须有while True + break退出
Template 2: List with Sentinel Value (-1)
Problem: Write a Python program that:
- Allows user to enter integers
- Stops when user enters -1
- Prints the minimum, maximum, and average
💡 Click to View Verified Answer
# Initialize empty list to store user's numbers
nums = []
print("Enter integers. Enter -1 to stop.")
# Main input loop
while True:
try:
# Get integer input with clear prompt
n = int(input("Enter a number (-1 to stop): "))
# Check for sentinel value
if n == -1:
break # Exit loop when user enters -1
# Add valid number to list
nums.append(n)
except ValueError:
# Handle non-integer input
print("Invalid input! Please enter an integer.")
# Calculate and display statistics
# IMPORTANT: Check if list is empty to avoid division by zero!
if len(nums) == 0:
print("No numbers were entered.")
else:
minimum = min(nums)
maximum = max(nums)
average = sum(nums) / len(nums)
print(f"\nResults:")
print(f"Minimum: {minimum}")
print(f"Maximum: {maximum}")
print(f"Average: {average:.2f}") # .2f for 2 decimal placesSample run:
Enter integers. Enter -1 to stop.
Enter a number (-1 to stop): 5
Enter a number (-1 to stop): 10
Enter a number (-1 to stop): 3
Enter a number (-1 to stop): -1
Results:
Minimum: 3
Maximum: 10
Average: 6.00
Edge case handling: Always check if list is empty before calculating statistics! min([]) and max([]) will raise ValueError, and sum([])/len([]) will raise ZeroDivisionError.
Template 3: Dictionary Operations
Problem: Write a Python program that:
- Stores student names and marks in a dictionary
- Allows multiple entries
- Prints the average mark
💡 Click to View Verified Answer
# Initialize empty dictionary: {name: mark}
students = {}
print("Student Grade Recorder")
print("Enter student names and marks. Type 'stop' as name to finish.")
# Main input loop
while True:
# Get student name with prompt
name = input("\nEnter student name (or 'stop' to finish): ")
# Check for stop condition (case-insensitive)
if name.lower() == "stop":
break
# Check for empty name
if name.strip() == "":
print("Name cannot be empty!")
continue
# Get mark with error handling
try:
mark = int(input(f"Enter mark for {name}: "))
# Optional: Validate mark range
if mark < 0 or mark > 100:
print("Warning: Mark is outside 0-100 range.")
# Store in dictionary
students[name] = mark
print(f"Recorded: {name} = {mark}")
except ValueError:
print("Invalid mark! Please enter a number.")
# Calculate and display average
if len(students) == 0:
print("\nNo students were recorded.")
else:
# Get all marks using .values()
all_marks = students.values()
average = sum(all_marks) / len(all_marks)
print(f"\n--- Student Records ---")
for name, mark in students.items():
print(f"{name}: {mark}")
print(f"\nAverage mark: {average:.2f}")Key dictionary operations:
dict.values()- get all values (marks)dict.items()- get all key-value pairsdict.keys()- get all keys (names)
✍️ 手写精简版 (HANDWRITING VERSION)
students = {}
while True:
name = input("Name (stop to end): ")
if name == "stop":
break
mark = int(input("Mark: "))
students[name] = mark
# Calculate average
avg = sum(students.values()) / len(students)
print("Average:", avg)手写要点: 约10行, dict存储, .values()取所有分数
Template 4: Grade Calculator (Decision Structure)
Problem: Write a function grade_calculator(score) that:
- Returns letter grade: 90+ → "A", 80+ → "B", 70+ → "C", 60+ → "D", <60 → "F"
- Returns "Invalid" for negative or > 100
💡 Click to View Verified Answer
def grade_calculator(score):
"""
Convert numeric score to letter grade.
Args:
score: Numeric score (expected 0-100)
Returns:
str: Letter grade (A/B/C/D/F) or "Invalid"
"""
# FIRST: Check for invalid input
# Must check this BEFORE checking grade ranges
if score < 0 or score > 100:
return "Invalid"
# Check grades from highest to lowest
# Using elif ensures only ONE condition is matched
if score >= 90:
return "A"
elif score >= 80:
return "B"
elif score >= 70:
return "C"
elif score >= 60:
return "D"
else:
return "F"
# Test the function
if __name__ == "__main__":
test_scores = [95, 85, 73, 65, 45, -5, 105]
for s in test_scores:
print(f"Score {s} → Grade {grade_calculator(s)}")Output:
Score 95 → Grade A
Score 85 → Grade B
Score 73 → Grade C
Score 65 → Grade D
Score 45 → Grade F
Score -5 → Grade Invalid
Score 105 → Grade Invalid
Common mistakes:
- Not checking invalid input FIRST
- Using multiple
ifinstead ofelif(would return wrong grade) - Checking in wrong order (60+ before 90+)
✍️ 手写精简版 (HANDWRITING VERSION)
def grade(score):
if score < 0 or score > 100:
return "Invalid"
if score >= 90: return "A"
if score >= 80: return "B"
if score >= 70: return "C"
if score >= 60: return "D"
return "F"手写要点: 约8行, 先判断invalid, 从高到低判断
Template 5: Boolean Function (PREDICTED TOPIC!)
Problem: Write a function that returns True/False based on conditions (function + and/or/not)
💡 Click to View Examples
Example 1: Check if number is in range [10, 50]
def in_range(n):
return n >= 10 and n <= 50
print(in_range(25)) # True
print(in_range(5)) # FalseExample 2: Check if all three numbers are positive
def all_positive(a, b, c):
return a > 0 and b > 0 and c > 0
print(all_positive(1, 2, 3)) # True
print(all_positive(1, -2, 3)) # FalseExample 3: Check if at least one is even
def has_even(a, b, c):
return a % 2 == 0 or b % 2 == 0 or c % 2 == 0
print(has_even(1, 3, 5)) # False
print(has_even(1, 2, 5)) # TrueExample 4: Check if string is valid password
def is_valid_password(pwd):
# At least 8 characters and contains digit
has_length = len(pwd) >= 8
has_digit = any(c.isdigit() for c in pwd)
return has_length and has_digit
print(is_valid_password("abc12345")) # True
print(is_valid_password("short1")) # FalseBoolean operators:
and= 两个都要成立or= 至少一个成立not= 取反
Keywords: return True/False, 条件组合
✍️ 手写精简版 (HANDWRITING VERSION)
def is_valid(x, y, z):
return x > 0 and y > 0 and z > 0手写要点: 1行return即可, 用and/or组合条件
Template 6: Prime Number Check
Problem: Write a function is_prime(num) that returns True if prime, False otherwise.
💡 Click to View Verified Answer
def is_prime(num):
"""
Check if a number is prime.
A prime number is:
- Greater than 1
- Only divisible by 1 and itself
Args:
num: Integer to check
Returns:
bool: True if prime, False otherwise
"""
# Numbers less than 2 are not prime
# (0, 1, and negative numbers)
if num < 2:
return False
# 2 is the only even prime
if num == 2:
return True
# All other even numbers are not prime
if num % 2 == 0:
return False
# Check odd divisors up to square root of num
# Why sqrt? If n = a × b, one of a,b must be ≤ √n
# We use int(num ** 0.5) + 1 to include the square root
for i in range(3, int(num ** 0.5) + 1, 2): # Step by 2 (odd numbers only)
if num % i == 0:
return False # Found a divisor, not prime
return True # No divisors found, it's prime
# Test the function
if __name__ == "__main__":
test_nums = [1, 2, 3, 7, 10, 11, 25, 29]
for n in test_nums:
result = "Prime" if is_prime(n) else "Not Prime"
print(f"{n}: {result}")Output:
1: Not Prime
2: Prime
3: Prime
7: Prime
10: Not Prime
11: Prime
25: Not Prime
29: Prime
Optimization: Only checking up to √n reduces time complexity from O(n) to O(√n).
Template 6: Fibonacci Sequence
Problem: Write a function fibonacci(n) that returns the first n Fibonacci numbers as a list.
💡 Click to View Verified Answer
def fibonacci(n):
"""
Generate the first n Fibonacci numbers.
Fibonacci sequence: 0, 1, 1, 2, 3, 5, 8, 13, 21, ...
Each number is the sum of the two preceding numbers.
Args:
n: Number of Fibonacci numbers to generate
Returns:
list: First n Fibonacci numbers
"""
# Handle edge cases
if n <= 0:
return [] # Empty list for invalid input
if n == 1:
return [0] # Only the first number
# Start with first two Fibonacci numbers
result = [0, 1]
# Generate remaining numbers
for i in range(2, n):
# Each new number = sum of last two
next_num = result[-1] + result[-2] # Use negative indexing
result.append(next_num)
return result
# Test the function
if __name__ == "__main__":
for count in [0, 1, 5, 10]:
print(f"fibonacci({count}) = {fibonacci(count)}")Output:
fibonacci(0) = []
fibonacci(1) = [0]
fibonacci(5) = [0, 1, 1, 2, 3]
fibonacci(10) = [0, 1, 1, 2, 3, 5, 8, 13, 21, 34]
Template 7: Remove Duplicates (Preserve Order)
Problem: Write a function that removes duplicates from a list while preserving the order of first occurrence.
💡 Click to View Verified Answer
def remove_duplicates(lst):
"""
Remove duplicate elements while preserving first occurrence order.
Example: [1, 2, 2, 3, 1, 4] → [1, 2, 3, 4]
Args:
lst: Input list with possible duplicates
Returns:
list: New list with duplicates removed
"""
seen = [] # Track items we've already seen
for item in lst:
if item not in seen: # Only add if not seen before
seen.append(item)
return seen
# Alternative using dict (Python 3.7+ preserves insertion order)
def remove_duplicates_v2(lst):
"""
Remove duplicates using dictionary (more efficient for large lists).
dict.fromkeys() preserves first occurrence order.
"""
return list(dict.fromkeys(lst))
# Test both versions
if __name__ == "__main__":
test = [1, 2, 2, 3, 1, 4, 5, 3, 2]
print(f"Original: {test}")
print(f"Method 1: {remove_duplicates(test)}")
print(f"Method 2: {remove_duplicates_v2(test)}")Output:
Original: [1, 2, 2, 3, 1, 4, 5, 3, 2]
Method 1: [1, 2, 3, 4, 5]
Method 2: [1, 2, 3, 4, 5]
Why not use set()? Sets don't preserve order! list(set([1, 2, 2, 3, 1, 4])) might give [1, 2, 3, 4] but order is not guaranteed.
Template 8: Exception Handling
Problem: Write a program that repeatedly asks for a nonzero integer and calculates its reciprocal, handling invalid inputs.
💡 Click to View Verified Answer
def get_reciprocal():
"""
Get a nonzero integer from user and calculate its reciprocal.
Handles ValueError (non-integer) and ZeroDivisionError (zero input).
"""
while True:
try:
# Get input from user
n = int(input("Enter a nonzero integer: "))
# Calculate reciprocal (will raise ZeroDivisionError if n=0)
reciprocal = 1 / n
# If we get here, input was valid
print(f"The reciprocal of {n} is {reciprocal:.3f}")
break # Exit loop on success
except ValueError:
# int() failed - input was not a valid integer
print("Error: You did not enter a valid integer. Try again.")
except ZeroDivisionError:
# Division by zero
print("Error: You entered zero. Cannot divide by zero. Try again.")
# Run the function
if __name__ == "__main__":
get_reciprocal()Sample run:
Enter a nonzero integer: abc
Error: You did not enter a valid integer. Try again.
Enter a nonzero integer: 0
Error: You entered zero. Cannot divide by zero. Try again.
Enter a nonzero integer: 4
The reciprocal of 4 is 0.250
📝 Q3: Machine Learning - Theory & Code (25 points)
Flowchart Symbols (MUST KNOW!)
Keywords: 流程图用于设计和解释程序逻辑
💡 Click to View All 5 Symbols
| Symbol | Shape | Name | Purpose |
|---|---|---|---|
| ⬭ | Oval | Terminal | Start/End of the flowchart |
| ▱ | Parallelogram | I/O | Input/Output operations (e.g., enter values, display results) |
| ▭ | Rectangle | Process | Processing/Calculation (e.g., x = a + b) |
| ◇ | Diamond | Decision | Condition check (Yes/No branches) |
| → | Arrow | Flow Line | Direction of flow in program logic |
Example question: Draw a flowchart to find the largest among three numbers (a, b, c).
Flowchart structure:
[Start] → [Input a, b, c] → <a > b?>
↓Yes ↓No
<a > c?> <b > c?>
↓Yes ↓No ↓Yes ↓No
[max=a][max=c][max=b][max=c]
↓ ↓ ↓ ↓
[Output max] → [End]
Key points for exam:
- Start/End: 必须有开始和结束符号
- Input: 在处理前获取输入
- Decision: 用菱形表示条件判断,有Yes/No两个分支
- Process: 矩形框内写计算操作
- Arrows: 所有符号用箭头连接,指示流程方向
Algorithm Comparison Table (MUST KNOW!)
Keywords: 根据数据特征选择合适的模型
| Algorithm | Best For | Pros | Cons | When to Use? |
|---|---|---|---|---|
| Naive Bayes | Small, low-dimensional | Fast, simple, low variance | Independence assumption unrealistic | Compare probabilities, pick highest |
| Logistic Regression | Small-medium data | Interpretable, stable | Linear separation only | Linear relationship, probability output |
| Decision Tree | Small-medium data | Easy to understand, visual | Prone to overfitting | Need clear rules, explainable |
| KNN | Small data | Simple, no training | Slow, sensitive to noise | Small data, low dimensions, classification |
| SVM | Small, high-dimensional | High accuracy, good generalization | Complex, hard to tune | High-dimensional data, margin-based |
| Random Forest | Medium data | Accurate, resists overfitting | Less interpretable | Improved bagging decision tree |
💡 Model Selection Quick Rules (考试速查)
Q: 小数据、低维度、分类问题,选哪个模型? A: Naive Bayes - 表现好、方差低、需要数据少、专为分类设计
Q: Random Forest比Decision Tree好在哪? A: 减少过拟合 - 通过组合多个决策树(bagging)来提高准确性
Q: K值增大会怎样? A: K变大 → 方差减小(更稳定) + 偏差增大(更偏)
Q: SVM适合什么数据? A: 高维数据强,大数据慢 - Works well in high-dimensional spaces, but computationally expensive for large datasets
Q: Why use Encoder before ML model? A: ML需要数值输入 - Machine learning models require numerical input; encoders transform categorical data into numbers
Pandas Basics
💡 Click to View Common Operations
import pandas as pd
# ========== Reading Data ==========
# Read CSV file
df = pd.read_csv("data.csv")
# Display first/last rows
print(df.head()) # First 5 rows
print(df.tail()) # Last 5 rows
print(df.shape) # (rows, columns)
# ========== Handling Missing Values ==========
# Check for missing values
print(df.isnull().sum()) # Count of nulls per column
# Fill missing values
df['Age'].fillna(df['Age'].mean(), inplace=True) # Fill with mean
df['Age'].fillna(df['Age'].median(), inplace=True) # Fill with median
df['Salary'].fillna(50000, inplace=True) # Fill with specific value
# Drop rows with missing values
df.dropna(inplace=True)
# ========== Selecting Data ==========
# Select single column
ages = df['Age']
# Select multiple columns
subset = df[['Name', 'Age']]
# Filter rows
adults = df[df['Age'] >= 18]
# Multiple conditions (use & for AND, | for OR)
result = df[(df['Age'] >= 18) & (df['Department'] == 'IT')]
# ========== Grouping ==========
# Group by and aggregate
avg_salary = df.groupby('Department')['Salary'].mean()LabelEncoder for Categorical Data
💡 Click to View Example
from sklearn.preprocessing import LabelEncoder
import pandas as pd
# Sample data with categorical columns
data = {
'Color': ['Red', 'Blue', 'Green', 'Red', 'Blue'],
'Size': ['Small', 'Medium', 'Large', 'Medium', 'Small']
}
df = pd.DataFrame(data)
# Create LabelEncoder instance
le = LabelEncoder()
# Encode each categorical column
# LabelEncoder sorts values alphabetically then assigns 0, 1, 2...
for col in df.columns:
if df[col].dtype == 'object': # Check if column is string/object type
df[col] = le.fit_transform(df[col])
print(df)
# Color encoding: Blue=0, Green=1, Red=2 (alphabetical)
# Size encoding: Large=0, Medium=1, Small=2 (alphabetical)LabelEncoder mapping (always alphabetical):
| Original | Encoded |
|---|---|
| Blue | 0 |
| Green | 1 |
| Red | 2 |
Train-Test Split
💡 Click to View Example
from sklearn.model_selection import train_test_split
# Assume X = features, y = target variable
X = df.drop('target', axis=1) # All columns except target
y = df['target'] # Target column only
# Split data: 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2, # 20% for testing
random_state=42 # For reproducibility
)
print(f"Training set size: {len(X_train)}")
print(f"Testing set size: {len(X_test)}")Why train-test split?
- Evaluate model on UNSEEN data
- Detect overfitting (memorizing training data)
- Simulate real-world usage
- Get honest performance estimate
fit() vs predict()
💡 Click to View Explanation
| Method | Purpose | When Used |
|---|---|---|
fit() | Train the model | Once, on training data only |
predict() | Apply the model | On test/new data |
fit_transform() | Fit and transform in one step | For preprocessing (scaler, encoder) |
Important:
- Use
fit_transform()on training data - Use
transform()only on test data (NOT fit_transform!)
# CORRECT workflow for scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # Fit AND transform
X_test_scaled = scaler.transform(X_test) # Transform only (no fit!)SVM Implementation
💡 Click to View Complete Code
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report
# Step 1: Load data
df = pd.read_csv("data.csv")
# Step 2: Encode categorical features
le = LabelEncoder()
for col in df.select_dtypes(include=['object']).columns:
if col != 'target': # Don't encode target yet if needed later
df[col] = le.fit_transform(df[col])
# Step 3: Separate features (X) and target (y)
X = df.drop('target', axis=1)
y = df['target']
# Step 4: Train-test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
# Step 5: Scale features (IMPORTANT for SVM!)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # Fit on training data
X_test_scaled = scaler.transform(X_test) # Transform only on test data
# Step 6: Create and train SVM model
svm_model = SVC(kernel='rbf', random_state=42) # RBF kernel is default
svm_model.fit(X_train_scaled, y_train)
# Step 7: Make predictions
y_pred = svm_model.predict(X_test_scaled)
# Step 8: Evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))SVM Key Points:
- Scaling is REQUIRED - SVM is sensitive to feature magnitudes
- Kernel trick - transforms data to higher dimensions for separation
- Common kernels: 'linear', 'rbf' (Gaussian), 'poly' (polynomial)
Random Forest Implementation
💡 Click to View Complete Code
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
# Step 1: Load and prepare data
df = pd.read_csv("data.csv")
# Step 2: Handle categorical (if needed)
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
for col in df.select_dtypes(include=['object']).columns:
df[col] = le.fit_transform(df[col])
# Step 3: Split features and target
X = df.drop('target', axis=1)
y = df['target']
# Step 4: Train-test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
# Step 5: Create and train Random Forest
# Note: No scaling needed for tree-based models!
rf_model = RandomForestClassifier(
n_estimators=100, # Number of trees
max_depth=None, # No limit on depth
random_state=42
)
rf_model.fit(X_train, y_train)
# Step 6: Predictions and evaluation
y_train_pred = rf_model.predict(X_train)
y_test_pred = rf_model.predict(X_test)
print(f"Training Accuracy: {accuracy_score(y_train, y_train_pred):.2f}")
print(f"Testing Accuracy: {accuracy_score(y_test, y_test_pred):.2f}")
# Step 7: Feature importance
importance = pd.DataFrame({
'feature': X.columns,
'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)
print("\nFeature Importance:")
print(importance)Random Forest Key Points:
- Bagging: Creates multiple trees on different data subsets
- Reduces overfitting: Averaging many trees is more stable
- No scaling needed: Tree-based methods don't need scaling
- Feature importance: Shows which features matter most
StandardScaler vs MinMaxScaler
💡 Click to View Comparison
| Scaler | Formula | Output Range | Best For |
|---|---|---|---|
| StandardScaler | (x - mean) / std | Mean=0, Std=1 | SVM, Logistic Regression, data with outliers |
| MinMaxScaler | (x - min) / (max - min) | [0, 1] | Neural Networks, KNN, bounded features |
Quick rule:
- SVM, Linear models → StandardScaler
- Neural networks, images → MinMaxScaler
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# StandardScaler: Z-score normalization
scaler1 = StandardScaler()
X_standard = scaler1.fit_transform(X)
# MinMaxScaler: Scale to [0, 1]
scaler2 = MinMaxScaler()
X_minmax = scaler2.fit_transform(X)📝 Q4: Naive Bayes & Decision Tree (25 points)
Bayes' Theorem Formula
$$P(A|B) = \frac{P(B|A) \times P(A)}{P(B)}$$
Where:
- $P(A|B)$ = Posterior probability (what we want)
- $P(B|A)$ = Likelihood
- $P(A)$ = Prior probability
- $P(B)$ = Evidence (normalizing constant)
Naive Bayes Calculation Example
💡 Click to View Worked Example
Dataset: Classify emails as Spam or Not Spam
| Contains "Free" | Contains "Winner" | Spam? | |
|---|---|---|---|
| 1 | Yes | Yes | Spam |
| 2 | Yes | No | Spam |
| 3 | No | Yes | Spam |
| 4 | No | No | Not Spam |
| 5 | Yes | No | Not Spam |
| 6 | No | No | Not Spam |
Question: New email has "Free"=Yes, "Winner"=No. Is it Spam?
Step 1: Calculate Priors
- P(Spam) = 3/6 = 0.5
- P(Not Spam) = 3/6 = 0.5
Step 2: Calculate Likelihoods
For Spam emails (1, 2, 3):
- P(Free=Yes | Spam) = 2/3 (emails 1, 2)
- P(Winner=No | Spam) = 1/3 (email 2 only)
For Not Spam emails (4, 5, 6):
- P(Free=Yes | Not Spam) = 1/3 (email 5)
- P(Winner=No | Not Spam) = 3/3 = 1 (all three)
Step 3: Calculate Unnormalized Posteriors
$P(Spam | evidence) \propto P(Spam) \times P(Free=Yes|Spam) \times P(Winner=No|Spam)$ $= 0.5 \times \frac{2}{3} \times \frac{1}{3} = 0.111$
$P(Not Spam | evidence) \propto 0.5 \times \frac{1}{3} \times 1 = 0.167$
Step 4: Normalize $P(Spam) = \frac{0.111}{0.111 + 0.167} = \frac{0.111}{0.278} = 0.40 = 40%$
Prediction: NOT SPAM (40% < 50%)
Gini Index Formula & Calculation
$$Gini = 1 - \sum_{i=1}^{n} p_i^2$$
Where $p_i$ is the proportion of class $i$ in the node.
💡 Click to View Worked Example
Dataset: 20 emails (12 Spam, 8 Not Spam)
Split by "Contains Free":
- Contains "free": 10 emails (9 Spam, 1 Not Spam)
- No "free": 10 emails (3 Spam, 7 Not Spam)
Step 1: Gini for "Contains Free" node (9S, 1N)
- P(Spam) = 9/10 = 0.9
- P(Not Spam) = 1/10 = 0.1
- Gini = 1 - (0.9² + 0.1²) = 1 - (0.81 + 0.01) = 0.18
Step 2: Gini for "No Free" node (3S, 7N)
- P(Spam) = 3/10 = 0.3
- P(Not Spam) = 7/10 = 0.7
- Gini = 1 - (0.3² + 0.7²) = 1 - (0.09 + 0.49) = 0.42
Step 3: Weighted Average Gini $Gini_{split} = \frac{10}{20} \times 0.18 + \frac{10}{20} \times 0.42$ $= 0.5 \times 0.18 + 0.5 \times 0.42 = 0.09 + 0.21 = 0.30$
Final Answer: Gini for this split = 0.30
Interpretation: Lower Gini = better split. Pure node has Gini = 0.
Information Gain (Entropy)
$$Entropy = -\sum_{i=1}^{n} p_i \log_2(p_i)$$
$$Information\ Gain = Entropy(parent) - \sum_{children} \frac{n_{child}}{n_{parent}} \times Entropy(child)$$
💡 Click to View Worked Example
Parent node: 3 Spam, 3 Not Spam (50/50 split)
Parent Entropy (perfect balance = maximum entropy): $H(parent) = -0.5 \log_2(0.5) - 0.5 \log_2(0.5)$ $= -0.5(-1) - 0.5(-1) = 0.5 + 0.5 = 1.0$
Child node "Free=Yes" (2 Spam, 1 Not Spam): $H = -\frac{2}{3} \log_2(\frac{2}{3}) - \frac{1}{3} \log_2(\frac{1}{3})$ $= 0.390 + 0.528 = 0.918$
Child node "Free=No" (1 Spam, 2 Not Spam): $H = -\frac{1}{3} \log_2(\frac{1}{3}) - \frac{2}{3} \log_2(\frac{2}{3}) = 0.918$
Weighted Entropy: $= \frac{3}{6}(0.918) + \frac{3}{6}(0.918) = 0.918$
Information Gain: $IG = 1.0 - 0.918 = 0.082$
Complete Entropy Calculation Example (EXAM FORMAT!)
💡 Click to View Full Decision Tree Example
Dataset: Predict if student will pass
| GPA | Studied | Passed |
|---|---|---|
| Low | No | No |
| Low | Yes | No |
| Med | No | No |
| Med | Yes | Yes |
| High | No | Yes |
| High | Yes | Yes |
Question: Calculate H(Passed), H(Passed|GPA), H(Passed|Studied), then draw decision tree.
Step 1: Calculate H(Passed) - 目标变量的熵
- Passed=Yes: 3个 → P(Yes) = 3/6 = 0.5
- Passed=No: 3个 → P(No) = 3/6 = 0.5
$H(Passed) = -0.5 \log_2(0.5) - 0.5 \log_2(0.5)$ $= -0.5 \times (-1) - 0.5 \times (-1) = 0.5 + 0.5 = 1.0$
Answer: H(Passed) = 1.0 (完美50/50分布 = 最大熵)
Step 2: Calculate H(Passed | GPA) - 按GPA分组的条件熵
GPA = Low (2条记录: 0 Yes, 2 No)
- H(Low) = -0 \log_2(0) - 1 \log_2(1) = 0 (纯节点!)
GPA = Med (2条记录: 1 Yes, 1 No)
- H(Med) = -0.5 \log_2(0.5) - 0.5 \log_2(0.5) = 1.0
GPA = High (2条记录: 2 Yes, 0 No)
- H(High) = -1 \log_2(1) - 0 \log_2(0) = 0 (纯节点!)
Weighted Average: $H(Passed|GPA) = \frac{2}{6} \times 0 + \frac{2}{6} \times 1.0 + \frac{2}{6} \times 0$ $= 0 + 0.333 + 0 = 0.333$
Answer: H(Passed|GPA) = 0.333
Step 3: Calculate H(Passed | Studied) - 按Studied分组的条件熵
Studied = No (3条记录: 1 Yes, 2 No)
- P(Yes) = 1/3, P(No) = 2/3
- H(No) = -1/3 \log_2(1/3) - 2/3 \log_2(2/3)
- = 0.528 + 0.390 = 0.918
Studied = Yes (3条记录: 2 Yes, 1 No)
- P(Yes) = 2/3, P(No) = 1/3
- H(Yes) = 0.918 (对称)
Weighted Average: $H(Passed|Studied) = \frac{3}{6} \times 0.918 + \frac{3}{6} \times 0.918 = 0.918$
Answer: H(Passed|Studied) = 0.918
Step 4: Compare Information Gain
- IG(GPA) = H(Passed) - H(Passed|GPA) = 1.0 - 0.333 = 0.667 ✅ 更高!
- IG(Studied) = H(Passed) - H(Passed|Studied) = 1.0 - 0.918 = 0.082
选GPA作为根节点 (信息增益更高)
Step 5: Draw Decision Tree
[GPA?]
/ | \
Low Med High
↓ ↓ ↓
[No] [Studied?] [Yes]
/ \
No Yes
↓ ↓
[No] [Yes]
Log值速查表 (考试可用计算器):
- log₂(0.5) = -1
- log₂(1) = 0
- log₂(1/3) ≈ -1.585
- log₂(2/3) ≈ -0.585
规则: 0 × log₂(0) = 0 (按约定)
Decision Tree Code Template
💡 Click to View Complete Code
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report
# Load data
df = pd.read_csv("data.csv")
# Encode categorical if needed
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
for col in df.select_dtypes(include=['object']).columns:
df[col] = le.fit_transform(df[col])
# Split features and target
X = df.drop('target', axis=1)
y = df['target']
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
# Create and train Decision Tree
# criterion='gini' is default (CART algorithm)
# criterion='entropy' uses Information Gain (ID3/C4.5)
dt_model = DecisionTreeClassifier(
criterion='gini', # or 'entropy'
max_depth=5, # Limit depth to prevent overfitting
random_state=42
)
dt_model.fit(X_train, y_train)
# Evaluate
y_pred = dt_model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")
print(classification_report(y_test, y_pred))🎯 Quick Reference Checklist
Python Essentials
- Operator precedence:
**>*,/,//,%>+,- - String slicing: left-inclusive, right-exclusive
- Floor division
//rounds toward negative infinity -
range(a,b)generates a to b-1 - List assignment creates reference, not copy
Machine Learning Essentials
- Bayes formula: P(A|B) = P(B|A) × P(A) / P(B)
- Gini: 1 - Σ(pᵢ²)
- Entropy: -Σ pᵢ log₂(pᵢ)
- SVM: needs scaling, uses kernel trick
- Random Forest: reduces overfitting via bagging
- Decision Tree: uses Gini (CART) or Entropy (ID3)
💪 Good luck on your exam! 🎓
All code in this document has been verified and tested.