Unit 2: Data Science Methodology – An Analytic Approach to Capstone Project

WORKSHEET
September 16, 2024
CBSE XII AI
NOTES (XII AI) UNIT 5: Introduction to Big Data and Data Analytics
September 17, 2024
CBSE XII AI

Data Science Methodology

Introduction to Data Science Methodology

The Data Science Methodology provides a step-by-step framework for solving real-world problems using data. Developed by John Rollins at IBM Analytics, this approach ensures that each project is structured, repeatable, and goal-oriented.

It is especially useful for students developing Capstone Projects, guiding them from problem identification to deploying a predictive model and gathering feedback.


🔷 Overview of the 10 Steps in Data Science Methodology

These ten steps are grouped into five logical stages, each of which is iterative and may involve revisiting earlier stages for refinement.


🔹 1. From Problem to Approach

1.1 Business Understanding

  • This is the foundational step in any data science project.
  • It involves working closely with stakeholders to define what problem needs solving and why it matters.
  • Tools such as Design Thinking (DT) and the 5W1H framework (What, Why, When, Where, Who, and How) are used to dig deeper into the context of the problem.
  • Example: A food delivery company wants to reduce delivery time. The data scientist must understand the business goal: improving customer satisfaction through faster deliveries.

1.2 Analytic Approach

  • This step involves deciding the type of analytics to use based on the problem:
    • Descriptive Analytics – What happened?
    • Diagnostic Analytics – Why did it happen?
    • Predictive Analytics – What will happen?
    • Prescriptive Analytics – What should be done?
  • Selecting the right AI technique:
    • Classification: Used when predicting a category (e.g., spam or not spam).
    • Regression: Used when predicting a numeric value (e.g., house price).
    • Clustering: Grouping similar items (e.g., customer segments).
    • Anomaly Detection: Spotting unusual events (e.g., fraud).
    • Recommendation Systems: Suggesting items (e.g., movies, products).

🔹 2. From Requirements to Collection

🔷 2.1 Data Requirements

🔹 Purpose:

Before collecting any data, a data scientist must clearly define what data is required to address the problem identified in the Business Understanding phase. This includes determining the nature, type, format, quantity, and sources of data that will be used to solve the problem.

🔹 Key Considerations:

What type of data do we need?

  • Numerical Data: e.g., prices, age, temperature
  • Categorical Data: e.g., gender, city, product type
  • Textual Data: e.g., reviews, tweets, comments
  • Multimedia Data: e.g., images, audio, video (for advanced AI tasks)

What format should the data be in?

  • CSV: Simple and widely used for tabular data
  • JSON/XML: Common for web data and APIs
  • Excel: Often used for business reports
  • SQL Databases: Used for structured enterprise data

From where will the data be sourced?

  • Primary Data Sources (firsthand data)
    • Surveys
    • Observations
    • Sensors/IoT devices
    • Interviews
  • Secondary Data Sources (already available)
    • Government or public datasets (e.g., data.gov)
    • Open data platforms (e.g., Kaggle, UCI ML Repository)
    • Organizational databases

How much data is enough?

  • Consider sample size, data completeness, and distribution.
  • Assess whether the data will allow reliable model training and validation.

What about data quality?

  • Assess whether the data is accurate, up-to-date, unbiased, and relevant to the problem.

🔹 Types of Data (based on structure):

Data TypeDescriptionExamples
Structured DataOrganized in rows and columnsExcel sheets, SQL databases
Semi-structured DataHas tags or markers, but not fixed schemasJSON, XML
Unstructured DataNo predefined structureImages, videos, social media posts

🔷 2.2 Data Collection

🔹 Purpose:

This step involves the actual gathering of data based on the requirements defined earlier. This data becomes the foundation upon which all future modeling and evaluation will be based.

🔹 Sources of Data:

Primary Data Sources:

  • Data collected directly by the team
  • Examples:
    • Conducting a survey on student learning habits
    • Collecting sensor data from a fitness device
    • Performing interviews or observations

Secondary Data Sources:

  • Pre-existing data collected by others
  • Examples:
    • Datasets on Kaggle (e.g., Titanic survival data, housing prices)
    • Public health data from WHO or UNICEF
    • Social media data using APIs (e.g., Twitter API)
    • Government portals like data.gov.in

🔹 Challenges in Data Collection:

  • Accessibility issues: Data behind paywalls or protected APIs
  • Privacy concerns: Especially for personal or sensitive data
  • Data inconsistency: Varying formats across sources
  • Incomplete or outdated data: May not reflect current realities

🔹 Tips for Effective Collection:

  • Ensure data collection methods align with project timelines.
  • Validate the authenticity and credibility of secondary sources.
  • Plan for potential gaps and be ready to revisit the requirements phase if needed.

✏️ Example Use Case:

Problem: A school wants to predict which students are likely to drop out.

Data Requirements:

  • Student demographics, attendance, performance records, disciplinary actions

Data Collection:

  • Pull historical data from school management software (structured)
  • Use teacher surveys for recent behavioral observations (primary source)
  • Download education-related datasets from government portals (secondary source)

🔹 3. From Understanding to Preparation

This phase is critical because it bridges the gap between raw data and a format ready for analysis and modeling. Even if a dataset has been collected, it must be thoroughly examined and cleaned to ensure that the insights derived and the models built from it are accurate and reliable.


🔷 3.1 Data Understanding

🔹 Purpose:

To deeply explore and assess the collected data to determine its structure, quality, and relevance to the problem at hand.

🔹 Key Activities:

Exploratory Data Analysis (EDA):

  • Use statistical summaries to understand the distribution of each feature (mean, median, standard deviation).
  • Visual tools like:
    • Histograms: To check the distribution of continuous variables.
    • Box plots: To identify outliers.
    • Scatter plots: To analyze relationships between variables.
    • Correlation heatmaps: To spot multicollinearity.

Data Profiling:

  • Check data types: Are columns properly classified as integers, floats, strings, etc.?
  • Identify missing values and how frequently they occur.
  • Detect inconsistencies like unexpected text in numeric columns.

Relevance Analysis:

  • Determine whether the data is relevant to the business objectives.
  • Identify features that could serve as potential predictors or targets.
  • If gaps are identified, this stage may loop back to the Data Collection phase.

🔹 Common Issues Discovered in This Phase:

  • Missing or incomplete data
  • Duplicate records
  • Outliers that may skew the analysis
  • Incorrect formats (e.g., dates stored as text)

📌 Example:

If analyzing sales data, you might find:

  • Missing entries for certain product categories
  • Sales recorded in different currencies
  • Product IDs entered inconsistently

Understanding these issues early allows for targeted fixes before modeling begins.


🔷 3.2 Data Preparation

🔹 Purpose:

To transform raw data into a clean and structured format suitable for analysis and modeling. This phase is often referred to as data wrangling or data preprocessing.

🔹 Key Steps in Data Preparation:

Data Cleaning

  • Handling missing data:
    • Imputation using mean/median/mode
    • Dropping missing entries (if justified)
  • Removing duplicates
  • Correcting errors:
    • Fix typos in categorical values (e.g., “Delhii” to “Delhi”)
    • Convert data to correct formats (e.g., strings to dates)

Feature Engineering

  • Creating new variables from existing data that might be more meaningful.
  • Examples:
    • From “Date of Purchase” → extract “Day of Week”, “Month”, “Season”
    • From “Price” and “Area” → derive “Price per Square Foot”
  • Useful for improving model accuracy and performance.

Data Transformation

  • Scaling and Normalization:
    • Standardize variables for algorithms sensitive to magnitude (e.g., SVM, KNN)
    • Methods: Min-Max Scaling, Z-score Normalization
  • Encoding categorical variables:
    • Label Encoding: Assign numeric values to categories (e.g., Male = 0, Female = 1)
    • One-Hot Encoding: Create binary columns for each category

Data Integration

  • Combining multiple datasets:
    • Merging sales data with customer demographics
    • Joining geographic data with sensor readings
  • Ensures a comprehensive dataset for analysis

Data Reduction (if necessary)

  • Eliminate irrelevant features or reduce dimensionality using techniques like PCA (Principal Component Analysis)

🔹 Why This Step is Crucial:

“Garbage in, garbage out” — If the input data is flawed, the model outcomes will also be flawed.

  • Well-prepared data leads to better model performance.
  • Reduces the risk of overfitting or underfitting.
  • Makes the model more interpretable and robust.

🛠️ Tools Commonly Used in This Phase:

  • Python Libraries:
    • pandas for manipulation
    • numpy for numerical operations
    • matplotlib/seaborn for visualization
  • Jupyter Notebook / Google Colab for interactive data exploration

📌 Example Use Case:

Project Goal: Predict student dropout risk.

Data Understanding:

  • Explore attendance data, grades, and engagement metrics.
  • Identify missing values in performance records.

Data Preparation:

  • Impute missing attendance with average values.
  • Encode engagement levels as High/Medium/Low.
  • Create a new feature: “Average Grade per Term.”

🔹 4. From Modeling to Evaluation

This stage focuses on the core of data science work—building, training, and evaluating machine learning models. It’s where insights and patterns hidden in data are translated into actionable predictions or classifications. These models form the basis for decision-making in real-world applications.


🔷 4.1 AI Modeling

🔹 Purpose:

To build a machine learning model that can learn from data and make predictions or classifications. This model acts as the solution to the business problem identified earlier.


🔹 Types of Modeling Approaches:

Supervised Learning

  • The model is trained on a labeled dataset, meaning the input data is paired with the correct output.
  • Used for:
    • Classification: Predicting categories (e.g., spam vs. non-spam)
    • Regression: Predicting continuous values (e.g., house prices)

Unsupervised Learning

  • The model finds hidden patterns or structures in unlabeled data.
  • Used for:
    • Clustering: Grouping similar items (e.g., customer segmentation)
    • Dimensionality Reduction: Simplifying datasets while retaining structure

Other Approaches

  • Semi-supervised Learning: Combination of labeled and unlabeled data
  • Reinforcement Learning: Learning through trial and error (e.g., robotics, game AI)

🔹 Model Selection

Choosing the right algorithm depends on:

  • Nature of the problem (classification, regression, etc.)
  • Size and quality of the data
  • Computational efficiency
  • Interpretability requirements
Problem TypeCommon Algorithms
ClassificationLogistic Regression, Decision Tree, Random Forest, SVM, KNN
RegressionLinear Regression, Decision Tree Regression, SVR
ClusteringK-Means, DBSCAN, Hierarchical Clustering
RecommendationCollaborative Filtering, Content-Based Filtering
Anomaly DetectionIsolation Forest, One-Class SVM

🔹 Model Building Workflow:

  1. Split the data into training and testing sets (or use cross-validation).
  2. Train the model on the training data using selected algorithm(s).
  3. Tune hyperparameters to optimize performance (e.g., tree depth, learning rate).
  4. Validate using unseen data to check generalization.

🔹 Model Output:

  • Trained model capable of:
    • Predicting future outcomes
    • Classifying new inputs
    • Grouping or ranking data points

📌 Example Use Case:

Goal: Predict customer churn for a telecom company
Modeling Choices:

  • Use Logistic Regression or Random Forest for classification
  • Features: Usage time, call drop rate, complaints, payment history
  • Output: Probability that a customer will churn (Yes/No)

🔷 4.2 Evaluation

🔹 Purpose:

To assess how well the trained model performs on unseen data. This ensures that the model is accurate, reliable, and generalizes well.


🔹 Evaluation Strategies:

Train-Test Split

  • The dataset is divided into:
    • Training Set: Used to train the model (e.g., 70–80%)
    • Testing Set: Used to evaluate the model’s performance on new data (e.g., 20–30%)

Cross-Validation (e.g., K-Fold)

  • The data is split into k equal folds.
  • Model is trained on k-1 folds and tested on the remaining fold.
  • This is repeated k times, each time with a different fold used as the test set.
  • Final score is the average performance across all runs.
  • Helps in reducing bias and gives a more stable estimate.

🔹 Evaluation Metrics:

📌 For Classification Models:

MetricDescription
AccuracyRatio of correct predictions to total predictions. Effective only with balanced datasets.
PrecisionProportion of positive identifications that were actually correct. Useful in cases like spam detection.
RecallProportion of actual positives that were correctly identified. Important in medical diagnoses.
F1-ScoreHarmonic mean of precision and recall. Best when you need a balance between the two.
Confusion MatrixA table showing TP, TN, FP, FN to visualize prediction outcomes.

📌 For Regression Models:

MetricDescription
MAE (Mean Absolute Error)Average absolute difference between predicted and actual values.
MSE (Mean Squared Error)Average of squared differences. Penalizes larger errors more.
RMSE (Root Mean Squared Error)Square root of MSE. Brings error to original unit scale.

🔹 Model Validation Example:

Scenario: Evaluating a fraud detection system

  • Accuracy alone may be misleading due to class imbalance.
  • Focus instead on:
    • Recall: Are we catching most fraud cases?
    • Precision: Are flagged transactions actually fraud?

🔹 Iterative Improvement:

After evaluation:

  • If performance is poor: Revisit feature engineering, try a different model, or tune hyperparameters.
  • If performance is acceptable: Proceed to deployment and monitor in the real world.

🛠️ Tools for Modeling and Evaluation:

  • Python libraries: scikit-learn, xgboost, lightgbm, keras
  • Visualization: matplotlib, seaborn, plotly
  • Model tuning: GridSearchCV, RandomizedSearchCV

🔹 5. From Deployment to Feedback

This phase marks the transition from building and evaluating a model to actually using it in the real world. It’s where data science moves from theory to practice. A well-performing model is only useful if it can be deployed effectively and continuously improved through user and system feedback.


🔷 5.1 Deployment

🔹 Purpose:

To operationalize the model—that is, make it available for use in real-world applications such as business systems, websites, mobile apps, or embedded systems.

🔹 Deployment Formats:

Web Applications

  • Model integrated into websites using back-end servers or cloud platforms.
  • Example: A recommendation engine for an e-commerce website.

Mobile Applications

  • AI models embedded within apps using platforms like Thunkable or MIT App Inventor.
  • Example: A diet tracker app suggesting food based on user input.

APIs (Application Programming Interfaces)

  • The model is hosted on a server, and other applications can send data to it via API calls.
  • Example: A chatbot service accessing an NLP model via REST API.

Batch Processing

  • For models that work with bulk data (e.g., daily sales forecasting), deployment may involve scheduled scripts that process data periodically.

🔹 Key Considerations During Deployment:

  1. Scalability: Can the model handle increasing amounts of users or data?
  2. Performance: Does it provide fast and accurate predictions?
  3. Security & Privacy: Are data privacy regulations (like GDPR) being followed?
  4. Resource Optimization: Is the deployment cost-effective and efficient?
  5. Accessibility: Is the model accessible to the intended users (via apps, dashboards, etc.)?

🔹 Types of Deployment Tools:

Tool/PlatformUse Case
ThunkableDeploying models into mobile apps without coding
Weebly/WixEmbedding models into websites
Flask/DjangoHosting models as web services using Python
AWS, Azure, GCPScalable cloud-based deployment
HerokuEasy-to-use platform-as-a-service for smaller applications

📌 Example Use Case:

Problem: Predicting the probability of student dropout
Deployment: Model integrated into a school dashboard. Teachers input student metrics and receive real-time predictions about dropout risk, allowing for timely interventions.


🔷 5.2 Feedback

🔹 Purpose:

Once the model is deployed, it’s essential to monitor how well it performs in a real-world environment. Feedback allows for continuous learning, performance tuning, and user adaptation.


🔹 Sources of Feedback:

User Feedback

  • Direct input from users or stakeholders on how useful and accurate the model’s outputs are.
  • Example: Users flagging incorrect recommendations or predictions.

System Monitoring

  • Track performance metrics over time to detect:
    • Model drift: When data patterns change, reducing model accuracy.
    • Latency issues: Slow predictions or server errors.
  • Use logs, dashboards, and alerts to monitor model behavior.

Performance Metrics Over Time

  • Compare live performance against initial evaluation.
  • Detect drop in accuracy, precision, or recall, signaling the need for retraining.

🔹 Actions Taken Based on Feedback:

  1. Model Retraining
    • Retrain the model with new data to account for evolving trends.
    • Example: Updating a sales forecasting model after a new product is launched.
  2. Feature Re-engineering
    • Introduce or remove features based on how the model is behaving in the real world.
  3. Interface Updates
    • Improve usability based on user feedback (e.g., clearer output labels, easier input methods).
  4. Version Control
    • Maintain multiple versions of models and switch based on performance.

🔹 Automated Feedback Loop

  • In advanced systems, feedback is collected automatically and used to retrain the model periodically.
  • This creates a self-improving system that evolves with new data.

📌 Example Use Case:

Deployed Model: A health app that predicts risk of diabetes
Feedback Loop: As users track their diet and physical activity, new data is used to improve prediction accuracy. Alerts are also refined based on user input.


🔷 Model Validation

Purpose:

To ensure the model’s predictions are generalizable to new, unseen data.

Techniques:

1. Train-Test Split

  • Common approach where:
    • 70–80% of data is used for training
    • 20–30% is used for testing
  • Helps evaluate performance before real deployment.

2. K-Fold Cross-Validation

  • Data is split into k subsets (folds).
  • Model is trained on k-1 folds and tested on the remaining one.
  • Repeated k times, with a different fold used as the test each time.
  • Final performance is the average across all k runs.

🔷 Evaluation Metrics in Depth

For Classification Models:

  • Confusion Matrix: Matrix of predicted vs actual outcomes.
    • TP: Correctly predicted positive
    • TN: Correctly predicted negative
    • FP: Incorrectly predicted positive
    • FN: Incorrectly predicted negative
  • Precision = TP / (TP + FP) – how many predicted positives were correct.
  • Recall = TP / (TP + FN) – how many actual positives were identified.
  • F1-Score = Harmonic mean of precision and recall – balances the two.
  • Accuracy = (TP + TN) / Total predictions.

For Regression Models:

  • MAE: Average magnitude of errors in predictions.
  • MSE: Average of squared errors – penalizes larger errors more.
  • RMSE: Square root of MSE – in the same unit as the target variable.

🔷 Application to Capstone Projects

Students apply this methodology to:

  • Frame a real-world problem
  • Gather and analyze data
  • Build and test models
  • Deploy a solution
  • Evaluate and iterate

This approach ensures project relevance, structure, and practical value.


🔷 Case Examples from the Curriculum

ScenarioKey Concept Applied
Investment firm improving user experienceBusiness Understanding
Course recommendation systemEvaluation
Predicting sales using marketing spendRegression Modeling
Churn prediction for small startupsCross-Validation
Predicting equipment failurePredictive Analytics
Delivery optimizationPrescriptive Analytics

🔷 Tools & Platforms for Implementation

  • Languages & Libraries: Python (pandas, sklearn, matplotlib)
  • Data Sources: Kaggle, data.gov, WHO, UNICEF
  • Development Tools: Jupyter Notebook, Google Colab
  • Deployment Tools: Thunkable, Weebly




ai cbse
ai cbse
This site is dedicated to provide contents, notes, questions bank,blogs,articles and other materials for AI students of CBSE.

Leave a Reply

Your email address will not be published. Required fields are marked *