ABW501 Mock Exam 2 - Analytics Edge

📋 Exam Information

Item	Details
Total Points	100
Time Allowed	90 minutes
Format	Closed book, calculator allowed
Structure	4 Blocks, 8 Questions

Block 1: Analytics Strategy & Applications (25 points)

Q1.1 (12 points)

A hospital wants to reduce patient readmission rates. For each analytics approach, give a specific example of how it could help:

a) Descriptive Analytics b) Diagnostic Analytics c) Predictive Analytics d) Prescriptive Analytics

💡 Click to View Answer & Solution

a) Descriptive Analytics: "Dashboard showing readmission rates by department, age group, and diagnosis. Example: 'Cardiology has 15% readmission rate vs. 8% hospital average.'"

b) Diagnostic Analytics: "Root cause analysis to understand WHY readmissions happen. Example: 'Patients discharged on Friday have 20% higher readmission - likely due to weekend pharmacy closures.'"

c) Predictive Analytics: "ML model predicting which patients are likely to be readmitted. Example: 'Patient John has 78% probability of readmission within 30 days based on his diagnosis, age, and prior history.'"

d) Prescriptive Analytics: "Recommending specific interventions. Example: 'For high-risk patients, schedule follow-up call within 48 hours, arrange home nurse visits, and ensure medication delivery.'"

Key Pattern:

Descriptive: Summarize what happened
Diagnostic: Explain why
Predictive: Forecast risk
Prescriptive: Recommend actions

Q1.2 (13 points)

Explain the concept of data-driven decision making vs intuition-based decision making. Give TWO advantages and TWO disadvantages of each approach.

💡 Click to View Answer & Solution

Data-Driven Decision Making:

Advantages	Disadvantages
1. Objective - removes bias	1. Requires quality data (garbage in = garbage out)
2. Scalable - can analyze millions of records	2. May miss context that humans understand
3. Reproducible - same data → same decision	3. Expensive to set up and maintain
4. Measurable - can track outcomes	4. Can lead to "analysis paralysis"

Intuition-Based Decision Making:

Advantages	Disadvantages
1. Fast - no data collection needed	1. Subject to cognitive biases
2. Works when data is unavailable	2. Hard to explain or justify
3. Can capture tacit knowledge	3. Not scalable
4. Good for unprecedented situations	4. Inconsistent results

Best Practice: Combine both - use data to inform decisions, but let human judgment handle context and ethics.

Block 2: Data Quality & Preparation (25 points)

Q2.1 (12 points)

Explain FOUR common data quality issues and how to address each:

💡 Click to View Answer & Solution

1. Missing Values

Problem: Empty cells in dataset
Causes: Survey non-response, system errors, data not collected
Solutions:
- Delete rows (if few missing)
- Impute with mean/median/mode
- Use predictive models to estimate
- Create "missing" category for categorical

2. Outliers

Problem: Extreme values far from normal range
Causes: Data entry errors, genuine rare events
Solutions:
- Remove if clearly erroneous
- Cap at percentiles (winsorization)
- Transform data (log scale)
- Use robust algorithms

3. Inconsistent Formatting

Problem: Same thing recorded differently
Examples: "USA", "U.S.A.", "United States"
Solutions:
- Standardize formats
- Create lookup tables
- Use data validation rules
- Regular expressions for cleaning

4. Duplicate Records

Problem: Same entity recorded multiple times
Causes: Multiple data sources, entry errors
Solutions:
- Exact matching (same ID)
- Fuzzy matching (similar names)
- Deduplication algorithms
- Define business rules for merging

Q2.2 (13 points)

You receive a dataset with the following issues:

Age column has values: 25, 30, -5, 150, 45, NULL, 35
Gender column has: "M", "Male", "m", "F", "female", "Female"

a) Identify what's wrong with each column b) Propose specific cleaning steps

💡 Click to View Answer & Solution

a) Age Column Issues:

-5: Invalid (negative age impossible)
150: Outlier (likely data entry error - no one is 150)
NULL: Missing value

Gender Column Issues:

Inconsistent case: "M" vs "m", "Male" vs "male"
Inconsistent format: "M" vs "Male" (abbreviation vs full word)
Inconsistent capitalization: "female" vs "Female"

b) Cleaning Steps:

For Age:

# Step 1: Replace invalid values with NaN
df.loc[df['Age'] < 0, 'Age'] = np.nan
 
# Step 2: Replace outliers (>120) with NaN
df.loc[df['Age'] > 120, 'Age'] = np.nan
 
# Step 3: Fill missing with median
df['Age'].fillna(df['Age'].median(), inplace=True)

For Gender:

# Step 1: Convert to lowercase
df['Gender'] = df['Gender'].str.lower()
 
# Step 2: Standardize to single format
df['Gender'] = df['Gender'].replace({
    'm': 'Male',
    'male': 'Male',
    'f': 'Female',
    'female': 'Female'
})

Result:

Age: [25, 30, 32.5, 32.5, 45, 32.5, 35] (assuming median = 32.5)
Gender: ['Male', 'Male', 'Male', 'Female', 'Female', 'Female']

Block 3: Model Evaluation & Interpretation (25 points)

Q3.1 (12 points)

Explain the following model evaluation concepts:

a) Accuracy, Precision, Recall b) When is accuracy NOT a good metric? c) What is the F1 Score and when to use it?

💡 Click to View Answer & Solution

a) Definitions:

Accuracy: $\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$

Overall correctness
"Of all predictions, how many were right?"

Precision: $\text{Precision} = \frac{TP}{TP + FP}$

Of positive predictions, how many were actually positive?
"When I predict positive, how often am I right?"

Recall (Sensitivity): $\text{Recall} = \frac{TP}{TP + FN}$

Of actual positives, how many did I catch?
"Of all actual positives, how many did I find?"

b) When Accuracy Fails:

Imbalanced Classes!

Example: Fraud detection

99% legitimate, 1% fraud
Model predicts "all legitimate" → 99% accuracy!
But catches 0% of fraud (useless)

Rule: Use precision/recall for imbalanced datasets.

c) F1 Score:

$F1 = 2 \times \frac{Precision \times Recall}{Precision + Recall}$

Harmonic mean of precision and recall
Balances both metrics
Range: 0 to 1 (higher is better)

Use When:

Class imbalance exists
Both false positives AND false negatives matter
Need single metric to compare models

Q3.2 (13 points)

A spam detection model has the following confusion matrix:

	Predicted: Spam	Predicted: Not Spam
Actual: Spam	85	15
Actual: Not Spam	10	890

Calculate: a) Accuracy b) Precision (for Spam) c) Recall (for Spam) d) F1 Score e) Which is more important for spam detection: Precision or Recall? Why?

💡 Click to View Answer & Solution

Confusion Matrix Values:

TP (Spam correctly identified) = 85
FN (Spam missed) = 15
FP (Not spam marked as spam) = 10
TN (Not spam correctly identified) = 890
Total = 1000

a) Accuracy: $\text{Accuracy} = \frac{85 + 890}{1000} = \frac{975}{1000} = 97.5%$

b) Precision: $\text{Precision} = \frac{85}{85 + 10} = \frac{85}{95} = 89.47%$

c) Recall: $\text{Recall} = \frac{85}{85 + 15} = \frac{85}{100} = 85.0%$

d) F1 Score: $F1 = 2 \times \frac{0.8947 \times 0.85}{0.8947 + 0.85} = 2 \times \frac{0.760}{1.745} = 87.17%$

e) Precision is MORE important for spam detection

Reasoning:

High precision = Few false positives
False positive = Important email marked as spam
This is WORSE than missing some spam (user might miss critical email!)

Trade-off:

Prefer: Some spam in inbox (annoying but visible)
Avoid: Important email in spam folder (might never be seen)

Block 4: Advanced Analytics Concepts (25 points)

Q4.1 (12 points)

Compare Supervised vs Unsupervised learning:

Aspect	Supervised	Unsupervised
Definition	?	?
Data Requirements	?	?
Example Algorithms	?	?
Business Use Cases	?	?

💡 Click to View Answer & Solution

Aspect	Supervised	Unsupervised
Definition	Learning from labeled data (input → known output)	Finding patterns in unlabeled data
Data Requirements	Needs labeled training data (expensive to create)	Only needs input data (no labels needed)
Example Algorithms	Decision Tree, Random Forest, SVM, Linear Regression, Naive Bayes	K-Means Clustering, Hierarchical Clustering, PCA, Association Rules
Business Use Cases	Spam detection, price prediction, customer churn, loan approval	Customer segmentation, market basket analysis, anomaly detection

Key Difference: Supervised has a "teacher" (labels), unsupervised discovers structure on its own.

Q4.2 (13 points)

Explain the concept of overfitting:

a) What is overfitting? b) How can you detect it? c) List FOUR techniques to prevent overfitting

💡 Click to View Answer & Solution

a) What is Overfitting?

Model learns training data TOO well, including noise and random fluctuations.

Performs excellently on training data
Performs poorly on new/unseen data
Model has memorized rather than learned general patterns

Analogy: Student who memorizes test answers but can't solve new problems.

b) How to Detect Overfitting

Train-Test Gap:
- Training accuracy = 99%
- Test accuracy = 70%
- Large gap = overfitting
Learning Curves:
- Training error keeps decreasing
- Validation error increases or plateaus
- Lines diverge
Cross-Validation:
- High variance in scores across folds
- Some folds much worse than others

c) Four Techniques to Prevent Overfitting

Cross-Validation
- Split data into k folds
- Train on k-1, test on 1, rotate
- More reliable performance estimate
Regularization (L1/L2)
- Adds penalty for complex models
- L1 (Lasso): Can eliminate features
- L2 (Ridge): Shrinks coefficients
Early Stopping
- Monitor validation error during training
- Stop when validation error starts increasing
- Prevents over-training
Reduce Model Complexity
- Fewer features (feature selection)
- Simpler model (shallower tree)
- Less parameters

Bonus techniques:

Dropout (neural networks)
More training data
Data augmentation
Ensemble methods

🎁 Bonus: Ethics in Analytics (5 extra points)

A company wants to use ML for hiring decisions. The model was trained on historical hiring data (who was hired and who succeeded).

What ethical concerns should be considered?

💡 Click to View Answer & Solution

Ethical Concerns:

Historical Bias Perpetuation
- If past hiring was biased (e.g., fewer women in tech), model learns this
- Model will discriminate against groups historically underrepresented
- "The model is only as fair as its training data"
Proxy Discrimination
- Even without protected attributes (race, gender), proxies exist
- Zip code correlates with race
- Name style correlates with ethnicity
- Model may discriminate indirectly
Lack of Transparency
- Candidates can't understand why they were rejected
- "Black box" decisions are hard to appeal
- Legal requirement for explainable decisions
Feedback Loop
- Only hired people have success data
- Rejected candidates might have succeeded
- Model never learns from its mistakes

Recommendations:

Audit model for bias regularly
Use diverse training data
Ensure human oversight
Allow candidates to appeal/explain
Be transparent about AI use

🏁 End of Exam

Block	Topic	Points
Block 1	Analytics Strategy	25
Block 2	Data Quality	25
Block 3	Model Evaluation	25
Block 4	Advanced Concepts	25
Total		100
Bonus	Ethics	+5

📝 Key Formulas Reference

Metric	Formula
Accuracy	(TP + TN) / Total
Precision	TP / (TP + FP)
Recall	TP / (TP + FN)
F1 Score	2 × (Precision × Recall) / (Precision + Recall)
Specificity	TN / (TN + FP)

Confusion Matrix:

                Predicted
              Pos    Neg
Actual Pos    TP     FN
       Neg    FP     TN

Show all working for partial credit. Good luck!

ABW501 Mock Exam 2: Analytics Edge (With Answers)

💬 Comments

ABW501 Mock Exam 2: Analytics Edge (With Answers)

💬 Comments

ABW501 Mock Exam 2 - Analytics Edge

📋 Exam Information

Block 1: Analytics Strategy & Applications (25 points)

Q1.1 (12 points)

Q1.2 (13 points)

Block 2: Data Quality & Preparation (25 points)

Q2.1 (12 points)

Q2.2 (13 points)

Block 3: Model Evaluation & Interpretation (25 points)

Q3.1 (12 points)

Q3.2 (13 points)

Block 4: Advanced Analytics Concepts (25 points)

Q4.1 (12 points)

Q4.2 (13 points)

🎁 Bonus: Ethics in Analytics (5 extra points)

🏁 End of Exam

📝 Key Formulas Reference