The Data Science Methodology provides a step-by-step framework for solving real-world problems using data. Developed by John Rollins at IBM Analytics, this approach ensures that each project is structured, repeatable, and goal-oriented.
It is especially useful for students developing Capstone Projects, guiding them from problem identification to deploying a predictive model and gathering feedback.
🔷 Overview of the 10 Steps in Data Science Methodology
These ten steps are grouped into five logical stages, each of which is iterative and may involve revisiting earlier stages for refinement.
🔹 1. From Problem to Approach
1.1 Business Understanding
This is the foundational step in any data science project.
It involves working closely with stakeholders to define what problem needs solving and why it matters.
Tools such as Design Thinking (DT) and the 5W1H framework (What, Why, When, Where, Who, and How) are used to dig deeper into the context of the problem.
Example: A food delivery company wants to reduce delivery time. The data scientist must understand the business goal: improving customer satisfaction through faster deliveries.
1.2 Analytic Approach
This step involves deciding the type of analytics to use based on the problem:
Descriptive Analytics – What happened?
Diagnostic Analytics – Why did it happen?
Predictive Analytics – What will happen?
Prescriptive Analytics – What should be done?
Selecting the right AI technique:
Classification: Used when predicting a category (e.g., spam or not spam).
Regression: Used when predicting a numeric value (e.g., house price).
Clustering: Grouping similar items (e.g., customer segments).
Before collecting any data, a data scientist must clearly define what data is required to address the problem identified in the Business Understanding phase. This includes determining the nature, type, format, quantity, and sources of data that will be used to solve the problem.
🔹 Key Considerations:
✅ What type of data do we need?
Numerical Data: e.g., prices, age, temperature
Categorical Data: e.g., gender, city, product type
Textual Data: e.g., reviews, tweets, comments
Multimedia Data: e.g., images, audio, video (for advanced AI tasks)
✅ What format should the data be in?
CSV: Simple and widely used for tabular data
JSON/XML: Common for web data and APIs
Excel: Often used for business reports
SQL Databases: Used for structured enterprise data
✅ From where will the data be sourced?
Primary Data Sources (firsthand data)
Surveys
Observations
Sensors/IoT devices
Interviews
Secondary Data Sources (already available)
Government or public datasets (e.g., data.gov)
Open data platforms (e.g., Kaggle, UCI ML Repository)
Organizational databases
✅ How much data is enough?
Consider sample size, data completeness, and distribution.
Assess whether the data will allow reliable model training and validation.
✅ What about data quality?
Assess whether the data is accurate, up-to-date, unbiased, and relevant to the problem.
🔹 Types of Data (based on structure):
Data Type
Description
Examples
Structured Data
Organized in rows and columns
Excel sheets, SQL databases
Semi-structured Data
Has tags or markers, but not fixed schemas
JSON, XML
Unstructured Data
No predefined structure
Images, videos, social media posts
🔷 2.2 Data Collection
🔹 Purpose:
This step involves the actual gathering of data based on the requirements defined earlier. This data becomes the foundation upon which all future modeling and evaluation will be based.
🔹 Sources of Data:
✅ Primary Data Sources:
Data collected directly by the team
Examples:
Conducting a survey on student learning habits
Collecting sensor data from a fitness device
Performing interviews or observations
✅ Secondary Data Sources:
Pre-existing data collected by others
Examples:
Datasets on Kaggle (e.g., Titanic survival data, housing prices)
Public health data from WHO or UNICEF
Social media data using APIs (e.g., Twitter API)
Government portals like data.gov.in
🔹 Challenges in Data Collection:
Accessibility issues: Data behind paywalls or protected APIs
Privacy concerns: Especially for personal or sensitive data
Data inconsistency: Varying formats across sources
Incomplete or outdated data: May not reflect current realities
🔹 Tips for Effective Collection:
Ensure data collection methods align with project timelines.
Validate the authenticity and credibility of secondary sources.
Plan for potential gaps and be ready to revisit the requirements phase if needed.
✏️ Example Use Case:
Problem: A school wants to predict which students are likely to drop out.
Pull historical data from school management software (structured)
Use teacher surveys for recent behavioral observations (primary source)
Download education-related datasets from government portals (secondary source)
🔹 3. From Understanding to Preparation
This phase is critical because it bridges the gap between raw data and a format ready for analysis and modeling. Even if a dataset has been collected, it must be thoroughly examined and cleaned to ensure that the insights derived and the models built from it are accurate and reliable.
🔷 3.1 Data Understanding
🔹 Purpose:
To deeply explore and assess the collected data to determine its structure, quality, and relevance to the problem at hand.
🔹 Key Activities:
✅ Exploratory Data Analysis (EDA):
Use statistical summaries to understand the distribution of each feature (mean, median, standard deviation).
Visual tools like:
Histograms: To check the distribution of continuous variables.
Box plots: To identify outliers.
Scatter plots: To analyze relationships between variables.
Correlation heatmaps: To spot multicollinearity.
✅ Data Profiling:
Check data types: Are columns properly classified as integers, floats, strings, etc.?
Identify missing values and how frequently they occur.
Detect inconsistencies like unexpected text in numeric columns.
✅ Relevance Analysis:
Determine whether the data is relevant to the business objectives.
Identify features that could serve as potential predictors or targets.
If gaps are identified, this stage may loop back to the Data Collection phase.
🔹 Common Issues Discovered in This Phase:
Missing or incomplete data
Duplicate records
Outliers that may skew the analysis
Incorrect formats (e.g., dates stored as text)
📌 Example:
If analyzing sales data, you might find:
Missing entries for certain product categories
Sales recorded in different currencies
Product IDs entered inconsistently
Understanding these issues early allows for targeted fixes before modeling begins.
🔷 3.2 Data Preparation
🔹 Purpose:
To transform raw data into a clean and structured format suitable for analysis and modeling. This phase is often referred to as data wrangling or data preprocessing.
🔹 Key Steps in Data Preparation:
✅ Data Cleaning
Handling missing data:
Imputation using mean/median/mode
Dropping missing entries (if justified)
Removing duplicates
Correcting errors:
Fix typos in categorical values (e.g., “Delhii” to “Delhi”)
Convert data to correct formats (e.g., strings to dates)
✅ Feature Engineering
Creating new variables from existing data that might be more meaningful.
Examples:
From “Date of Purchase” → extract “Day of Week”, “Month”, “Season”
From “Price” and “Area” → derive “Price per Square Foot”
Useful for improving model accuracy and performance.
✅ Data Transformation
Scaling and Normalization:
Standardize variables for algorithms sensitive to magnitude (e.g., SVM, KNN)
Methods: Min-Max Scaling, Z-score Normalization
Encoding categorical variables:
Label Encoding: Assign numeric values to categories (e.g., Male = 0, Female = 1)
One-Hot Encoding: Create binary columns for each category
✅ Data Integration
Combining multiple datasets:
Merging sales data with customer demographics
Joining geographic data with sensor readings
Ensures a comprehensive dataset for analysis
✅ Data Reduction (if necessary)
Eliminate irrelevant features or reduce dimensionality using techniques like PCA (Principal Component Analysis)
🔹 Why This Step is Crucial:
“Garbage in, garbage out” — If the input data is flawed, the model outcomes will also be flawed.
Well-prepared data leads to better model performance.
Reduces the risk of overfitting or underfitting.
Makes the model more interpretable and robust.
🛠️ Tools Commonly Used in This Phase:
Python Libraries:
pandas for manipulation
numpy for numerical operations
matplotlib/seaborn for visualization
Jupyter Notebook / Google Colab for interactive data exploration
📌 Example Use Case:
Project Goal: Predict student dropout risk.
Data Understanding:
Explore attendance data, grades, and engagement metrics.
Identify missing values in performance records.
Data Preparation:
Impute missing attendance with average values.
Encode engagement levels as High/Medium/Low.
Create a new feature: “Average Grade per Term.”
🔹 4. From Modeling to Evaluation
This stage focuses on the core of data science work—building, training, and evaluating machine learning models. It’s where insights and patterns hidden in data are translated into actionable predictions or classifications. These models form the basis for decision-making in real-world applications.
🔷 4.1 AI Modeling
🔹 Purpose:
To build a machine learning model that can learn from data and make predictions or classifications. This model acts as the solution to the business problem identified earlier.
🔹 Types of Modeling Approaches:
✅ Supervised Learning
The model is trained on a labeled dataset, meaning the input data is paired with the correct output.
Used for:
Classification: Predicting categories (e.g., spam vs. non-spam)
Regression: Predicting continuous values (e.g., house prices)
✅ Unsupervised Learning
The model finds hidden patterns or structures in unlabeled data.
Used for:
Clustering: Grouping similar items (e.g., customer segmentation)
Dimensionality Reduction: Simplifying datasets while retaining structure
✅ Other Approaches
Semi-supervised Learning: Combination of labeled and unlabeled data
Reinforcement Learning: Learning through trial and error (e.g., robotics, game AI)
🔹 Model Selection
Choosing the right algorithm depends on:
Nature of the problem (classification, regression, etc.)
Size and quality of the data
Computational efficiency
Interpretability requirements
Problem Type
Common Algorithms
Classification
Logistic Regression, Decision Tree, Random Forest, SVM, KNN
Regression
Linear Regression, Decision Tree Regression, SVR
Clustering
K-Means, DBSCAN, Hierarchical Clustering
Recommendation
Collaborative Filtering, Content-Based Filtering
Anomaly Detection
Isolation Forest, One-Class SVM
🔹 Model Building Workflow:
Split the data into training and testing sets (or use cross-validation).
Train the model on the training data using selected algorithm(s).
Tune hyperparameters to optimize performance (e.g., tree depth, learning rate).
Validate using unseen data to check generalization.
🔹 Model Output:
Trained model capable of:
Predicting future outcomes
Classifying new inputs
Grouping or ranking data points
📌 Example Use Case:
Goal: Predict customer churn for a telecom company Modeling Choices:
Use Logistic Regression or Random Forest for classification
Features: Usage time, call drop rate, complaints, payment history
Output: Probability that a customer will churn (Yes/No)
🔷 4.2 Evaluation
🔹 Purpose:
To assess how well the trained model performs on unseen data. This ensures that the model is accurate, reliable, and generalizes well.
🔹 Evaluation Strategies:
✅ Train-Test Split
The dataset is divided into:
Training Set: Used to train the model (e.g., 70–80%)
Testing Set: Used to evaluate the model’s performance on new data (e.g., 20–30%)
✅ Cross-Validation (e.g., K-Fold)
The data is split into k equal folds.
Model is trained on k-1 folds and tested on the remaining fold.
This is repeated k times, each time with a different fold used as the test set.
Final score is the average performance across all runs.
Helps in reducing bias and gives a more stable estimate.
🔹 Evaluation Metrics:
📌 For Classification Models:
Metric
Description
Accuracy
Ratio of correct predictions to total predictions. Effective only with balanced datasets.
Precision
Proportion of positive identifications that were actually correct. Useful in cases like spam detection.
Recall
Proportion of actual positives that were correctly identified. Important in medical diagnoses.
F1-Score
Harmonic mean of precision and recall. Best when you need a balance between the two.
Confusion Matrix
A table showing TP, TN, FP, FN to visualize prediction outcomes.
📌 For Regression Models:
Metric
Description
MAE (Mean Absolute Error)
Average absolute difference between predicted and actual values.
MSE (Mean Squared Error)
Average of squared differences. Penalizes larger errors more.
RMSE (Root Mean Squared Error)
Square root of MSE. Brings error to original unit scale.
🔹 Model Validation Example:
Scenario: Evaluating a fraud detection system
Accuracy alone may be misleading due to class imbalance.
Focus instead on:
Recall: Are we catching most fraud cases?
Precision: Are flagged transactions actually fraud?
🔹 Iterative Improvement:
After evaluation:
If performance is poor: Revisit feature engineering, try a different model, or tune hyperparameters.
If performance is acceptable: Proceed to deployment and monitor in the real world.
🛠️ Tools for Modeling and Evaluation:
Python libraries: scikit-learn, xgboost, lightgbm, keras
Visualization: matplotlib, seaborn, plotly
Model tuning: GridSearchCV, RandomizedSearchCV
🔹 5. From Deployment to Feedback
This phase marks the transition from building and evaluating a model to actually using it in the real world. It’s where data science moves from theory to practice. A well-performing model is only useful if it can be deployed effectively and continuously improved through user and system feedback.
🔷 5.1 Deployment
🔹 Purpose:
To operationalize the model—that is, make it available for use in real-world applications such as business systems, websites, mobile apps, or embedded systems.
🔹 Deployment Formats:
✅ Web Applications
Model integrated into websites using back-end servers or cloud platforms.
Example: A recommendation engine for an e-commerce website.
✅ Mobile Applications
AI models embedded within apps using platforms like Thunkable or MIT App Inventor.
Example: A diet tracker app suggesting food based on user input.
✅ APIs (Application Programming Interfaces)
The model is hosted on a server, and other applications can send data to it via API calls.
Example: A chatbot service accessing an NLP model via REST API.
✅ Batch Processing
For models that work with bulk data (e.g., daily sales forecasting), deployment may involve scheduled scripts that process data periodically.
🔹 Key Considerations During Deployment:
Scalability: Can the model handle increasing amounts of users or data?
Performance: Does it provide fast and accurate predictions?
Security & Privacy: Are data privacy regulations (like GDPR) being followed?
Resource Optimization: Is the deployment cost-effective and efficient?
Accessibility: Is the model accessible to the intended users (via apps, dashboards, etc.)?
🔹 Types of Deployment Tools:
Tool/Platform
Use Case
Thunkable
Deploying models into mobile apps without coding
Weebly/Wix
Embedding models into websites
Flask/Django
Hosting models as web services using Python
AWS, Azure, GCP
Scalable cloud-based deployment
Heroku
Easy-to-use platform-as-a-service for smaller applications
📌 Example Use Case:
Problem: Predicting the probability of student dropout Deployment: Model integrated into a school dashboard. Teachers input student metrics and receive real-time predictions about dropout risk, allowing for timely interventions.
🔷 5.2 Feedback
🔹 Purpose:
Once the model is deployed, it’s essential to monitor how well it performs in a real-world environment. Feedback allows for continuous learning, performance tuning, and user adaptation.
🔹 Sources of Feedback:
✅ User Feedback
Direct input from users or stakeholders on how useful and accurate the model’s outputs are.
Example: Users flagging incorrect recommendations or predictions.
✅ System Monitoring
Track performance metrics over time to detect:
Model drift: When data patterns change, reducing model accuracy.
Latency issues: Slow predictions or server errors.
Use logs, dashboards, and alerts to monitor model behavior.
✅ Performance Metrics Over Time
Compare live performance against initial evaluation.
Detect drop in accuracy, precision, or recall, signaling the need for retraining.
🔹 Actions Taken Based on Feedback:
Model Retraining
Retrain the model with new data to account for evolving trends.
Example: Updating a sales forecasting model after a new product is launched.
Feature Re-engineering
Introduce or remove features based on how the model is behaving in the real world.
Interface Updates
Improve usability based on user feedback (e.g., clearer output labels, easier input methods).
Version Control
Maintain multiple versions of models and switch based on performance.
🔹 Automated Feedback Loop
In advanced systems, feedback is collected automatically and used to retrain the model periodically.
This creates a self-improving system that evolves with new data.
📌 Example Use Case:
Deployed Model: A health app that predicts risk of diabetes Feedback Loop: As users track their diet and physical activity, new data is used to improve prediction accuracy. Alerts are also refined based on user input.
🔷 Model Validation
Purpose:
To ensure the model’s predictions are generalizable to new, unseen data.
Techniques:
1. Train-Test Split
Common approach where:
70–80% of data is used for training
20–30% is used for testing
Helps evaluate performance before real deployment.
2. K-Fold Cross-Validation
Data is split into k subsets (folds).
Model is trained on k-1 folds and tested on the remaining one.
Repeated k times, with a different fold used as the test each time.
Final performance is the average across all k runs.
🔷 Evaluation Metrics in Depth
For Classification Models:
Confusion Matrix: Matrix of predicted vs actual outcomes.
TP: Correctly predicted positive
TN: Correctly predicted negative
FP: Incorrectly predicted positive
FN: Incorrectly predicted negative
Precision = TP / (TP + FP) – how many predicted positives were correct.
Recall = TP / (TP + FN) – how many actual positives were identified.
F1-Score = Harmonic mean of precision and recall – balances the two.
Accuracy = (TP + TN) / Total predictions.
For Regression Models:
MAE: Average magnitude of errors in predictions.
MSE: Average of squared errors – penalizes larger errors more.
RMSE: Square root of MSE – in the same unit as the target variable.
🔷 Application to Capstone Projects
Students apply this methodology to:
Frame a real-world problem
Gather and analyze data
Build and test models
Deploy a solution
Evaluate and iterate
This approach ensures project relevance, structure, and practical value.
🔷 Case Examples from the Curriculum
Scenario
Key Concept Applied
Investment firm improving user experience
Business Understanding
Course recommendation system
Evaluation
Predicting sales using marketing spend
Regression Modeling
Churn prediction for small startups
Cross-Validation
Predicting equipment failure
Predictive Analytics
Delivery optimization
Prescriptive Analytics
🔷 Tools & Platforms for Implementation
Languages & Libraries: Python (pandas, sklearn, matplotlib)