Unit 2: Data Science Methodology – An Analytic Approach to Capstone Project

WORKSHEET

September 16, 2024

NOTES (XII AI) UNIT 5: Introduction to Big Data and Data Analytics

September 17, 2024

Data Science Methodology

Introduction to Data Science Methodology

The Data Science Methodology provides a step-by-step framework for solving real-world problems using data. Developed by John Rollins at IBM Analytics, this approach ensures that each project is structured, repeatable, and goal-oriented.

It is especially useful for students developing Capstone Projects, guiding them from problem identification to deploying a predictive model and gathering feedback.

🔷 Overview of the 10 Steps in Data Science Methodology

These ten steps are grouped into five logical stages, each of which is iterative and may involve revisiting earlier stages for refinement.

🔹 1. From Problem to Approach

1.1 Business Understanding

This is the foundational step in any data science project.
It involves working closely with stakeholders to define what problem needs solving and why it matters.
Tools such as Design Thinking (DT) and the 5W1H framework (What, Why, When, Where, Who, and How) are used to dig deeper into the context of the problem.
Example: A food delivery company wants to reduce delivery time. The data scientist must understand the business goal: improving customer satisfaction through faster deliveries.

1.2 Analytic Approach

This step involves deciding the type of analytics to use based on the problem:
- Descriptive Analytics – What happened?
- Diagnostic Analytics – Why did it happen?
- Predictive Analytics – What will happen?
- Prescriptive Analytics – What should be done?
Selecting the right AI technique:
- Classification: Used when predicting a category (e.g., spam or not spam).
- Regression: Used when predicting a numeric value (e.g., house price).
- Clustering: Grouping similar items (e.g., customer segments).
- Anomaly Detection: Spotting unusual events (e.g., fraud).
- Recommendation Systems: Suggesting items (e.g., movies, products).

🔹 2. From Requirements to Collection

🔷 2.1 Data Requirements

🔹 Purpose:

Before collecting any data, a data scientist must clearly define what data is required to address the problem identified in the Business Understanding phase. This includes determining the nature, type, format, quantity, and sources of data that will be used to solve the problem.

🔹 Key Considerations:

✅ What type of data do we need?

Numerical Data: e.g., prices, age, temperature
Categorical Data: e.g., gender, city, product type
Textual Data: e.g., reviews, tweets, comments
Multimedia Data: e.g., images, audio, video (for advanced AI tasks)

✅ What format should the data be in?

CSV: Simple and widely used for tabular data
JSON/XML: Common for web data and APIs
Excel: Often used for business reports
SQL Databases: Used for structured enterprise data

✅ From where will the data be sourced?

Primary Data Sources (firsthand data)
- Surveys
- Observations
- Sensors/IoT devices
- Interviews
Secondary Data Sources (already available)
- Government or public datasets (e.g., data.gov)
- Open data platforms (e.g., Kaggle, UCI ML Repository)
- Organizational databases

✅ How much data is enough?

Consider sample size, data completeness, and distribution.
Assess whether the data will allow reliable model training and validation.

✅ What about data quality?

Assess whether the data is accurate, up-to-date, unbiased, and relevant to the problem.

🔹 Types of Data (based on structure):

Data Type	Description	Examples
Structured Data	Organized in rows and columns	Excel sheets, SQL databases
Semi-structured Data	Has tags or markers, but not fixed schemas	JSON, XML
Unstructured Data	No predefined structure	Images, videos, social media posts

🔷 2.2 Data Collection

🔹 Purpose:

This step involves the actual gathering of data based on the requirements defined earlier. This data becomes the foundation upon which all future modeling and evaluation will be based.

🔹 Sources of Data:

✅ Primary Data Sources:

Data collected directly by the team
Examples:
- Conducting a survey on student learning habits
- Collecting sensor data from a fitness device
- Performing interviews or observations

✅ Secondary Data Sources:

Pre-existing data collected by others
Examples:
- Datasets on Kaggle (e.g., Titanic survival data, housing prices)
- Public health data from WHO or UNICEF
- Social media data using APIs (e.g., Twitter API)
- Government portals like data.gov.in

🔹 Challenges in Data Collection:

Accessibility issues: Data behind paywalls or protected APIs
Privacy concerns: Especially for personal or sensitive data
Data inconsistency: Varying formats across sources
Incomplete or outdated data: May not reflect current realities

🔹 Tips for Effective Collection:

Ensure data collection methods align with project timelines.
Validate the authenticity and credibility of secondary sources.
Plan for potential gaps and be ready to revisit the requirements phase if needed.

✏️ Example Use Case:

Problem: A school wants to predict which students are likely to drop out.

Data Requirements:

Student demographics, attendance, performance records, disciplinary actions

Data Collection:

Pull historical data from school management software (structured)
Use teacher surveys for recent behavioral observations (primary source)
Download education-related datasets from government portals (secondary source)

🔹 3. From Understanding to Preparation

This phase is critical because it bridges the gap between raw data and a format ready for analysis and modeling. Even if a dataset has been collected, it must be thoroughly examined and cleaned to ensure that the insights derived and the models built from it are accurate and reliable.

🔷 3.1 Data Understanding

🔹 Purpose:

To deeply explore and assess the collected data to determine its structure, quality, and relevance to the problem at hand.

🔹 Key Activities:

✅ Exploratory Data Analysis (EDA):

Use statistical summaries to understand the distribution of each feature (mean, median, standard deviation).
Visual tools like:
- Histograms: To check the distribution of continuous variables.
- Box plots: To identify outliers.
- Scatter plots: To analyze relationships between variables.
- Correlation heatmaps: To spot multicollinearity.

✅ Data Profiling:

Check data types: Are columns properly classified as integers, floats, strings, etc.?
Identify missing values and how frequently they occur.
Detect inconsistencies like unexpected text in numeric columns.

✅ Relevance Analysis:

Determine whether the data is relevant to the business objectives.
Identify features that could serve as potential predictors or targets.
If gaps are identified, this stage may loop back to the Data Collection phase.

🔹 Common Issues Discovered in This Phase:

Missing or incomplete data
Duplicate records
Outliers that may skew the analysis
Incorrect formats (e.g., dates stored as text)

📌 Example:

If analyzing sales data, you might find:

Missing entries for certain product categories
Sales recorded in different currencies
Product IDs entered inconsistently

Understanding these issues early allows for targeted fixes before modeling begins.

🔷 3.2 Data Preparation

🔹 Purpose:

To transform raw data into a clean and structured format suitable for analysis and modeling. This phase is often referred to as data wrangling or data preprocessing.

🔹 Key Steps in Data Preparation:

✅ Data Cleaning

Handling missing data:
- Imputation using mean/median/mode
- Dropping missing entries (if justified)
Removing duplicates
Correcting errors:
- Fix typos in categorical values (e.g., “Delhii” to “Delhi”)
- Convert data to correct formats (e.g., strings to dates)

✅ Feature Engineering

Creating new variables from existing data that might be more meaningful.
Examples:
- From “Date of Purchase” → extract “Day of Week”, “Month”, “Season”
- From “Price” and “Area” → derive “Price per Square Foot”
Useful for improving model accuracy and performance.

✅ Data Transformation

Scaling and Normalization:
- Standardize variables for algorithms sensitive to magnitude (e.g., SVM, KNN)
- Methods: Min-Max Scaling, Z-score Normalization
Encoding categorical variables:
- Label Encoding: Assign numeric values to categories (e.g., Male = 0, Female = 1)
- One-Hot Encoding: Create binary columns for each category

✅ Data Integration

Combining multiple datasets:
- Merging sales data with customer demographics
- Joining geographic data with sensor readings
Ensures a comprehensive dataset for analysis

✅ Data Reduction (if necessary)

Eliminate irrelevant features or reduce dimensionality using techniques like PCA (Principal Component Analysis)

🔹 Why This Step is Crucial:

“Garbage in, garbage out” — If the input data is flawed, the model outcomes will also be flawed.

Well-prepared data leads to better model performance.
Reduces the risk of overfitting or underfitting.
Makes the model more interpretable and robust.

🛠️ Tools Commonly Used in This Phase:

Python Libraries:
- pandas for manipulation
- numpy for numerical operations
- matplotlib/seaborn for visualization
Jupyter Notebook / Google Colab for interactive data exploration

📌 Example Use Case:

Project Goal: Predict student dropout risk.

Data Understanding:

Explore attendance data, grades, and engagement metrics.
Identify missing values in performance records.

Data Preparation:

Impute missing attendance with average values.
Encode engagement levels as High/Medium/Low.
Create a new feature: “Average Grade per Term.”

🔹 4. From Modeling to Evaluation

This stage focuses on the core of data science work—building, training, and evaluating machine learning models. It’s where insights and patterns hidden in data are translated into actionable predictions or classifications. These models form the basis for decision-making in real-world applications.

🔷 4.1 AI Modeling

🔹 Purpose:

To build a machine learning model that can learn from data and make predictions or classifications. This model acts as the solution to the business problem identified earlier.

🔹 Types of Modeling Approaches:

✅ Supervised Learning

The model is trained on a labeled dataset, meaning the input data is paired with the correct output.
Used for:
- Classification: Predicting categories (e.g., spam vs. non-spam)
- Regression: Predicting continuous values (e.g., house prices)

✅ Unsupervised Learning

The model finds hidden patterns or structures in unlabeled data.
Used for:
- Clustering: Grouping similar items (e.g., customer segmentation)
- Dimensionality Reduction: Simplifying datasets while retaining structure

✅ Other Approaches

Semi-supervised Learning: Combination of labeled and unlabeled data
Reinforcement Learning: Learning through trial and error (e.g., robotics, game AI)

🔹 Model Selection

Choosing the right algorithm depends on:

Nature of the problem (classification, regression, etc.)
Size and quality of the data
Computational efficiency
Interpretability requirements

Problem Type	Common Algorithms
Classification	Logistic Regression, Decision Tree, Random Forest, SVM, KNN
Regression	Linear Regression, Decision Tree Regression, SVR
Clustering	K-Means, DBSCAN, Hierarchical Clustering
Recommendation	Collaborative Filtering, Content-Based Filtering
Anomaly Detection	Isolation Forest, One-Class SVM

🔹 Model Building Workflow:

Split the data into training and testing sets (or use cross-validation).
Train the model on the training data using selected algorithm(s).
Tune hyperparameters to optimize performance (e.g., tree depth, learning rate).
Validate using unseen data to check generalization.

🔹 Model Output:

Trained model capable of:
- Predicting future outcomes
- Classifying new inputs
- Grouping or ranking data points

📌 Example Use Case:

Goal: Predict customer churn for a telecom company
Modeling Choices:

Use Logistic Regression or Random Forest for classification
Features: Usage time, call drop rate, complaints, payment history
Output: Probability that a customer will churn (Yes/No)

🔷 4.2 Evaluation

🔹 Purpose:

To assess how well the trained model performs on unseen data. This ensures that the model is accurate, reliable, and generalizes well.

🔹 Evaluation Strategies:

✅ Train-Test Split

The dataset is divided into:
- Training Set: Used to train the model (e.g., 70–80%)
- Testing Set: Used to evaluate the model’s performance on new data (e.g., 20–30%)

✅ Cross-Validation (e.g., K-Fold)

The data is split into k equal folds.
Model is trained on k-1 folds and tested on the remaining fold.
This is repeated k times, each time with a different fold used as the test set.
Final score is the average performance across all runs.
Helps in reducing bias and gives a more stable estimate.

🔹 Evaluation Metrics:

📌 For Classification Models:

Metric	Description
Accuracy	Ratio of correct predictions to total predictions. Effective only with balanced datasets.
Precision	Proportion of positive identifications that were actually correct. Useful in cases like spam detection.
Recall	Proportion of actual positives that were correctly identified. Important in medical diagnoses.
F1-Score	Harmonic mean of precision and recall. Best when you need a balance between the two.
Confusion Matrix	A table showing TP, TN, FP, FN to visualize prediction outcomes.

📌 For Regression Models:

Metric	Description
MAE (Mean Absolute Error)	Average absolute difference between predicted and actual values.
MSE (Mean Squared Error)	Average of squared differences. Penalizes larger errors more.
RMSE (Root Mean Squared Error)	Square root of MSE. Brings error to original unit scale.

🔹 Model Validation Example:

Scenario: Evaluating a fraud detection system

Accuracy alone may be misleading due to class imbalance.
Focus instead on:
- Recall: Are we catching most fraud cases?
- Precision: Are flagged transactions actually fraud?

🔹 Iterative Improvement:

After evaluation:

If performance is poor: Revisit feature engineering, try a different model, or tune hyperparameters.
If performance is acceptable: Proceed to deployment and monitor in the real world.

🛠️ Tools for Modeling and Evaluation:

Python libraries: scikit-learn, xgboost, lightgbm, keras
Visualization: matplotlib, seaborn, plotly
Model tuning: GridSearchCV, RandomizedSearchCV

🔹 5. From Deployment to Feedback

This phase marks the transition from building and evaluating a model to actually using it in the real world. It’s where data science moves from theory to practice. A well-performing model is only useful if it can be deployed effectively and continuously improved through user and system feedback.

🔷 5.1 Deployment

🔹 Purpose:

To operationalize the model—that is, make it available for use in real-world applications such as business systems, websites, mobile apps, or embedded systems.

🔹 Deployment Formats:

✅ Web Applications

Model integrated into websites using back-end servers or cloud platforms.
Example: A recommendation engine for an e-commerce website.

✅ Mobile Applications

AI models embedded within apps using platforms like Thunkable or MIT App Inventor.
Example: A diet tracker app suggesting food based on user input.

✅ APIs (Application Programming Interfaces)

The model is hosted on a server, and other applications can send data to it via API calls.
Example: A chatbot service accessing an NLP model via REST API.

✅ Batch Processing

For models that work with bulk data (e.g., daily sales forecasting), deployment may involve scheduled scripts that process data periodically.

🔹 Key Considerations During Deployment:

Scalability: Can the model handle increasing amounts of users or data?
Performance: Does it provide fast and accurate predictions?
Security & Privacy: Are data privacy regulations (like GDPR) being followed?
Resource Optimization: Is the deployment cost-effective and efficient?
Accessibility: Is the model accessible to the intended users (via apps, dashboards, etc.)?

🔹 Types of Deployment Tools:

Tool/Platform	Use Case
Thunkable	Deploying models into mobile apps without coding
Weebly/Wix	Embedding models into websites
Flask/Django	Hosting models as web services using Python
AWS, Azure, GCP	Scalable cloud-based deployment
Heroku	Easy-to-use platform-as-a-service for smaller applications

📌 Example Use Case:

Problem: Predicting the probability of student dropout
Deployment: Model integrated into a school dashboard. Teachers input student metrics and receive real-time predictions about dropout risk, allowing for timely interventions.

🔷 5.2 Feedback

🔹 Purpose:

Once the model is deployed, it’s essential to monitor how well it performs in a real-world environment. Feedback allows for continuous learning, performance tuning, and user adaptation.

🔹 Sources of Feedback:

✅ User Feedback

Direct input from users or stakeholders on how useful and accurate the model’s outputs are.
Example: Users flagging incorrect recommendations or predictions.

✅ System Monitoring

Track performance metrics over time to detect:
- Model drift: When data patterns change, reducing model accuracy.
- Latency issues: Slow predictions or server errors.
Use logs, dashboards, and alerts to monitor model behavior.

✅ Performance Metrics Over Time

Compare live performance against initial evaluation.
Detect drop in accuracy, precision, or recall, signaling the need for retraining.

🔹 Actions Taken Based on Feedback:

Model Retraining
- Retrain the model with new data to account for evolving trends.
- Example: Updating a sales forecasting model after a new product is launched.
Feature Re-engineering
- Introduce or remove features based on how the model is behaving in the real world.
Interface Updates
- Improve usability based on user feedback (e.g., clearer output labels, easier input methods).
Version Control
- Maintain multiple versions of models and switch based on performance.

🔹 Automated Feedback Loop

In advanced systems, feedback is collected automatically and used to retrain the model periodically.
This creates a self-improving system that evolves with new data.

📌 Example Use Case:

Deployed Model: A health app that predicts risk of diabetes
Feedback Loop: As users track their diet and physical activity, new data is used to improve prediction accuracy. Alerts are also refined based on user input.

🔷 Model Validation

Purpose:

To ensure the model’s predictions are generalizable to new, unseen data.

Techniques:

1. Train-Test Split

Common approach where:
- 70–80% of data is used for training
- 20–30% is used for testing
Helps evaluate performance before real deployment.

2. K-Fold Cross-Validation

Data is split into k subsets (folds).
Model is trained on k-1 folds and tested on the remaining one.
Repeated k times, with a different fold used as the test each time.
Final performance is the average across all k runs.

🔷 Evaluation Metrics in Depth

For Classification Models:

Confusion Matrix: Matrix of predicted vs actual outcomes.
- TP: Correctly predicted positive
- TN: Correctly predicted negative
- FP: Incorrectly predicted positive
- FN: Incorrectly predicted negative
Precision = TP / (TP + FP) – how many predicted positives were correct.
Recall = TP / (TP + FN) – how many actual positives were identified.
F1-Score = Harmonic mean of precision and recall – balances the two.
Accuracy = (TP + TN) / Total predictions.

For Regression Models:

MAE: Average magnitude of errors in predictions.
MSE: Average of squared errors – penalizes larger errors more.
RMSE: Square root of MSE – in the same unit as the target variable.

🔷 Application to Capstone Projects

Students apply this methodology to:

Frame a real-world problem
Gather and analyze data
Build and test models
Deploy a solution
Evaluate and iterate

This approach ensures project relevance, structure, and practical value.

🔷 Case Examples from the Curriculum

Scenario	Key Concept Applied
Investment firm improving user experience	Business Understanding
Course recommendation system	Evaluation
Predicting sales using marketing spend	Regression Modeling
Churn prediction for small startups	Cross-Validation
Predicting equipment failure	Predictive Analytics
Delivery optimization	Prescriptive Analytics

🔷 Tools & Platforms for Implementation

Languages & Libraries: Python (pandas, sklearn, matplotlib)
Data Sources: Kaggle, data.gov, WHO, UNICEF
Development Tools: Jupyter Notebook, Google Colab
Deployment Tools: Thunkable, Weebly

ai cbse

This site is dedicated to provide contents, notes, questions bank,blogs,articles and other materials for AI students of CBSE.