ABW501 Mock Exam 2: Analytics Edge (With Answers)
10 分钟阅读
ABW501 Mock Exam 2 - Analytics Edge
📋 Exam Information
| Item | Details |
|---|---|
| Total Points | 100 |
| Time Allowed | 90 minutes |
| Format | Closed book, calculator allowed |
| Structure | 4 Blocks, 8 Questions |
Block 1: Analytics Strategy & Applications (25 points)
Q1.1 (12 points)
A hospital wants to reduce patient readmission rates. For each analytics approach, give a specific example of how it could help:
a) Descriptive Analytics b) Diagnostic Analytics c) Predictive Analytics d) Prescriptive Analytics
💡 Click to View Answer & Solution
a) Descriptive Analytics: "Dashboard showing readmission rates by department, age group, and diagnosis. Example: 'Cardiology has 15% readmission rate vs. 8% hospital average.'"
b) Diagnostic Analytics: "Root cause analysis to understand WHY readmissions happen. Example: 'Patients discharged on Friday have 20% higher readmission - likely due to weekend pharmacy closures.'"
c) Predictive Analytics: "ML model predicting which patients are likely to be readmitted. Example: 'Patient John has 78% probability of readmission within 30 days based on his diagnosis, age, and prior history.'"
d) Prescriptive Analytics: "Recommending specific interventions. Example: 'For high-risk patients, schedule follow-up call within 48 hours, arrange home nurse visits, and ensure medication delivery.'"
Key Pattern:
- Descriptive: Summarize what happened
- Diagnostic: Explain why
- Predictive: Forecast risk
- Prescriptive: Recommend actions
Q1.2 (13 points)
Explain the concept of data-driven decision making vs intuition-based decision making. Give TWO advantages and TWO disadvantages of each approach.
💡 Click to View Answer & Solution
Data-Driven Decision Making:
| Advantages | Disadvantages |
|---|---|
| 1. Objective - removes bias | 1. Requires quality data (garbage in = garbage out) |
| 2. Scalable - can analyze millions of records | 2. May miss context that humans understand |
| 3. Reproducible - same data → same decision | 3. Expensive to set up and maintain |
| 4. Measurable - can track outcomes | 4. Can lead to "analysis paralysis" |
Intuition-Based Decision Making:
| Advantages | Disadvantages |
|---|---|
| 1. Fast - no data collection needed | 1. Subject to cognitive biases |
| 2. Works when data is unavailable | 2. Hard to explain or justify |
| 3. Can capture tacit knowledge | 3. Not scalable |
| 4. Good for unprecedented situations | 4. Inconsistent results |
Best Practice: Combine both - use data to inform decisions, but let human judgment handle context and ethics.
Block 2: Data Quality & Preparation (25 points)
Q2.1 (12 points)
Explain FOUR common data quality issues and how to address each:
💡 Click to View Answer & Solution
1. Missing Values
- Problem: Empty cells in dataset
- Causes: Survey non-response, system errors, data not collected
- Solutions:
- Delete rows (if few missing)
- Impute with mean/median/mode
- Use predictive models to estimate
- Create "missing" category for categorical
2. Outliers
- Problem: Extreme values far from normal range
- Causes: Data entry errors, genuine rare events
- Solutions:
- Remove if clearly erroneous
- Cap at percentiles (winsorization)
- Transform data (log scale)
- Use robust algorithms
3. Inconsistent Formatting
- Problem: Same thing recorded differently
- Examples: "USA", "U.S.A.", "United States"
- Solutions:
- Standardize formats
- Create lookup tables
- Use data validation rules
- Regular expressions for cleaning
4. Duplicate Records
- Problem: Same entity recorded multiple times
- Causes: Multiple data sources, entry errors
- Solutions:
- Exact matching (same ID)
- Fuzzy matching (similar names)
- Deduplication algorithms
- Define business rules for merging
Q2.2 (13 points)
You receive a dataset with the following issues:
- Age column has values: 25, 30, -5, 150, 45, NULL, 35
- Gender column has: "M", "Male", "m", "F", "female", "Female"
a) Identify what's wrong with each column b) Propose specific cleaning steps
💡 Click to View Answer & Solution
a) Age Column Issues:
- -5: Invalid (negative age impossible)
- 150: Outlier (likely data entry error - no one is 150)
- NULL: Missing value
Gender Column Issues:
- Inconsistent case: "M" vs "m", "Male" vs "male"
- Inconsistent format: "M" vs "Male" (abbreviation vs full word)
- Inconsistent capitalization: "female" vs "Female"
b) Cleaning Steps:
For Age:
# Step 1: Replace invalid values with NaN
df.loc[df['Age'] < 0, 'Age'] = np.nan
# Step 2: Replace outliers (>120) with NaN
df.loc[df['Age'] > 120, 'Age'] = np.nan
# Step 3: Fill missing with median
df['Age'].fillna(df['Age'].median(), inplace=True)For Gender:
# Step 1: Convert to lowercase
df['Gender'] = df['Gender'].str.lower()
# Step 2: Standardize to single format
df['Gender'] = df['Gender'].replace({
'm': 'Male',
'male': 'Male',
'f': 'Female',
'female': 'Female'
})Result:
- Age: [25, 30, 32.5, 32.5, 45, 32.5, 35] (assuming median = 32.5)
- Gender: ['Male', 'Male', 'Male', 'Female', 'Female', 'Female']
Block 3: Model Evaluation & Interpretation (25 points)
Q3.1 (12 points)
Explain the following model evaluation concepts:
a) Accuracy, Precision, Recall b) When is accuracy NOT a good metric? c) What is the F1 Score and when to use it?
💡 Click to View Answer & Solution
a) Definitions:
Accuracy: $\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$
- Overall correctness
- "Of all predictions, how many were right?"
Precision: $\text{Precision} = \frac{TP}{TP + FP}$
- Of positive predictions, how many were actually positive?
- "When I predict positive, how often am I right?"
Recall (Sensitivity): $\text{Recall} = \frac{TP}{TP + FN}$
- Of actual positives, how many did I catch?
- "Of all actual positives, how many did I find?"
b) When Accuracy Fails:
Imbalanced Classes!
Example: Fraud detection
- 99% legitimate, 1% fraud
- Model predicts "all legitimate" → 99% accuracy!
- But catches 0% of fraud (useless)
Rule: Use precision/recall for imbalanced datasets.
c) F1 Score:
$F1 = 2 \times \frac{Precision \times Recall}{Precision + Recall}$
- Harmonic mean of precision and recall
- Balances both metrics
- Range: 0 to 1 (higher is better)
Use When:
- Class imbalance exists
- Both false positives AND false negatives matter
- Need single metric to compare models
Q3.2 (13 points)
A spam detection model has the following confusion matrix:
| Predicted: Spam | Predicted: Not Spam | |
|---|---|---|
| Actual: Spam | 85 | 15 |
| Actual: Not Spam | 10 | 890 |
Calculate: a) Accuracy b) Precision (for Spam) c) Recall (for Spam) d) F1 Score e) Which is more important for spam detection: Precision or Recall? Why?
💡 Click to View Answer & Solution
Confusion Matrix Values:
- TP (Spam correctly identified) = 85
- FN (Spam missed) = 15
- FP (Not spam marked as spam) = 10
- TN (Not spam correctly identified) = 890
- Total = 1000
a) Accuracy: $\text{Accuracy} = \frac{85 + 890}{1000} = \frac{975}{1000} = 97.5%$
b) Precision: $\text{Precision} = \frac{85}{85 + 10} = \frac{85}{95} = 89.47%$
c) Recall: $\text{Recall} = \frac{85}{85 + 15} = \frac{85}{100} = 85.0%$
d) F1 Score: $F1 = 2 \times \frac{0.8947 \times 0.85}{0.8947 + 0.85} = 2 \times \frac{0.760}{1.745} = 87.17%$
e) Precision is MORE important for spam detection
Reasoning:
- High precision = Few false positives
- False positive = Important email marked as spam
- This is WORSE than missing some spam (user might miss critical email!)
Trade-off:
- Prefer: Some spam in inbox (annoying but visible)
- Avoid: Important email in spam folder (might never be seen)
Block 4: Advanced Analytics Concepts (25 points)
Q4.1 (12 points)
Compare Supervised vs Unsupervised learning:
| Aspect | Supervised | Unsupervised |
|---|---|---|
| Definition | ? | ? |
| Data Requirements | ? | ? |
| Example Algorithms | ? | ? |
| Business Use Cases | ? | ? |
💡 Click to View Answer & Solution
| Aspect | Supervised | Unsupervised |
|---|---|---|
| Definition | Learning from labeled data (input → known output) | Finding patterns in unlabeled data |
| Data Requirements | Needs labeled training data (expensive to create) | Only needs input data (no labels needed) |
| Example Algorithms | Decision Tree, Random Forest, SVM, Linear Regression, Naive Bayes | K-Means Clustering, Hierarchical Clustering, PCA, Association Rules |
| Business Use Cases | Spam detection, price prediction, customer churn, loan approval | Customer segmentation, market basket analysis, anomaly detection |
Key Difference: Supervised has a "teacher" (labels), unsupervised discovers structure on its own.
Q4.2 (13 points)
Explain the concept of overfitting:
a) What is overfitting? b) How can you detect it? c) List FOUR techniques to prevent overfitting
💡 Click to View Answer & Solution
a) What is Overfitting?
Model learns training data TOO well, including noise and random fluctuations.
- Performs excellently on training data
- Performs poorly on new/unseen data
- Model has memorized rather than learned general patterns
Analogy: Student who memorizes test answers but can't solve new problems.
b) How to Detect Overfitting
-
Train-Test Gap:
- Training accuracy = 99%
- Test accuracy = 70%
- Large gap = overfitting
-
Learning Curves:
- Training error keeps decreasing
- Validation error increases or plateaus
- Lines diverge
-
Cross-Validation:
- High variance in scores across folds
- Some folds much worse than others
c) Four Techniques to Prevent Overfitting
-
Cross-Validation
- Split data into k folds
- Train on k-1, test on 1, rotate
- More reliable performance estimate
-
Regularization (L1/L2)
- Adds penalty for complex models
- L1 (Lasso): Can eliminate features
- L2 (Ridge): Shrinks coefficients
-
Early Stopping
- Monitor validation error during training
- Stop when validation error starts increasing
- Prevents over-training
-
Reduce Model Complexity
- Fewer features (feature selection)
- Simpler model (shallower tree)
- Less parameters
Bonus techniques:
- Dropout (neural networks)
- More training data
- Data augmentation
- Ensemble methods
🎁 Bonus: Ethics in Analytics (5 extra points)
A company wants to use ML for hiring decisions. The model was trained on historical hiring data (who was hired and who succeeded).
What ethical concerns should be considered?
💡 Click to View Answer & Solution
Ethical Concerns:
-
Historical Bias Perpetuation
- If past hiring was biased (e.g., fewer women in tech), model learns this
- Model will discriminate against groups historically underrepresented
- "The model is only as fair as its training data"
-
Proxy Discrimination
- Even without protected attributes (race, gender), proxies exist
- Zip code correlates with race
- Name style correlates with ethnicity
- Model may discriminate indirectly
-
Lack of Transparency
- Candidates can't understand why they were rejected
- "Black box" decisions are hard to appeal
- Legal requirement for explainable decisions
-
Feedback Loop
- Only hired people have success data
- Rejected candidates might have succeeded
- Model never learns from its mistakes
Recommendations:
- Audit model for bias regularly
- Use diverse training data
- Ensure human oversight
- Allow candidates to appeal/explain
- Be transparent about AI use
🏁 End of Exam
| Block | Topic | Points |
|---|---|---|
| Block 1 | Analytics Strategy | 25 |
| Block 2 | Data Quality | 25 |
| Block 3 | Model Evaluation | 25 |
| Block 4 | Advanced Concepts | 25 |
| Total | 100 | |
| Bonus | Ethics | +5 |
📝 Key Formulas Reference
| Metric | Formula |
|---|---|
| Accuracy | (TP + TN) / Total |
| Precision | TP / (TP + FP) |
| Recall | TP / (TP + FN) |
| F1 Score | 2 × (Precision × Recall) / (Precision + Recall) |
| Specificity | TN / (TN + FP) |
Confusion Matrix:
Predicted
Pos Neg
Actual Pos TP FN
Neg FP TN
Show all working for partial credit. Good luck!