NOTES (XII AI) UNIT 5: Introduction to Big Data and Data Analytics

Unit 2: Data Science Methodology – An Analytic Approach to Capstone Project

September 17, 2024

NOTES (XII AI) Unit 8: Storytelling in AI

September 17, 2024

Introduction to Big Data and Data Analytics

1. Introduction to Big Data

Definition:
Big Data refers to extremely large, complex datasets that are difficult to manage, process, or analyze using traditional computing techniques and databases. The rapid growth of data in today’s digital age—driven by internet usage, social media, mobile apps, and IoT devices—has led to a need for more advanced tools and strategies to harness its potential.

Examples of Big Data Sources:

Transactional Data: Online purchases, billing systems, bank transactions.
Machine Data: IoT sensors, logs, smart devices.
Social Data: Tweets, posts, likes, shares on platforms like Facebook and Twitter.

Big Data plays a vital role in Artificial Intelligence (AI) and Data Science by providing the raw material (data) needed to train models and extract patterns that guide decision-making.

2. Types of Big Data

Structured Data:

Well-organized and stored in relational databases (tables with rows and columns).
Easily searchable and manageable using SQL.
Examples: Customer names, addresses, product IDs, transaction records.

Semi-structured Data:

Has some structure, but not strictly organized like structured data.
May contain tags or metadata for categorization.
Examples: XML files, JSON, CSV files, HTML documents.

Unstructured Data:

Lacks a predefined structure, making it hard to analyze with traditional tools.
Often includes multimedia and free-form text.
Examples: Images, videos, audio recordings, social media comments, emails.

Each data type requires different tools and techniques for storage, processing, and analysis.

3. Advantages of Big Data

Enhanced Decision-Making:
- Real-time insights enable organizations to respond quickly to market changes.
Improved Efficiency & Productivity:
- Data analysis identifies bottlenecks and optimizes business operations.
Deeper Customer Insights:
- Allows personalization based on customer behavior and preferences.
Competitive Advantage:
- Early identification of trends helps companies outperform rivals.
Fosters Innovation:
- Encourages development of new products, services, and business models.

4. Disadvantages of Big Data

Privacy & Security Concerns:
- Large-scale data collection risks misuse or unauthorized access.
Data Quality Issues:
- Inaccurate, incomplete, or inconsistent data can mislead analysis.
Technical Complexity:
- Requires advanced infrastructure and skilled personnel.
Regulatory Compliance:
- Must follow data protection laws like GDPR and India’s DPDP Act 2023.
High Costs:
- Infrastructure, tools, and talent demand significant investment.

5. Characteristics of Big Data – 6V Framework

1. Volume

Massive amounts of data generated daily (e.g., 328 million terabytes/day).
Traditional storage is insufficient—requires scalable solutions.

2. Velocity

Refers to the speed at which new data is created (e.g., Google handles over 40,000 search queries per second).
Demands real-time processing and analysis.

3. Variety

Big Data comes in many formats—structured, semi-structured, and unstructured.
Complex formats like audio, images, and video provide unique insights.

4. Veracity

Refers to the trustworthiness of the data.
Data must be cleaned to ensure accuracy and reliability.

5. Value

The ultimate goal of Big Data is to extract business value and insights.
Insights drive growth, efficiency, and customer satisfaction.

6. Variability

Refers to the unpredictability and dynamic nature of data.
Requires systems that can adapt to changing formats and patterns.

6. Big Data vs Traditional Data

Feature	Traditional Data	Big Data
Volume	Limited (gigabytes)	Massive (terabytes to zettabytes)
Variety	Limited to structured formats	Includes all formats
Velocity	Low (processed in batches)	High (often real-time)
Flexibility	Less flexible	Highly adaptable and dynamic

7. Big Data Analytics

Definition:
Big Data Analytics refers to the use of advanced computational and statistical techniques to analyze large volumes of data in order to discover patterns, extract insights, and make informed decisions.

Types of Analytics:

Descriptive Analytics:
- Summarizes past events and trends.
- Example: Monthly sales reports.
Diagnostic Analytics:
- Explains why something happened.
- Example: Why did customer churn increase?
Predictive Analytics:
- Uses historical data to forecast future outcomes.
- Example: Predicting product demand or customer behavior.
Prescriptive Analytics:
- Recommends actions based on analysis.
- Example: Suggesting the best marketing strategy.

Emerging Trends Driving Big Data Analytics:

Moore’s Law: Growth in computing power makes complex data analysis feasible.
Mobile Computing: Mobile apps generate valuable user data.
Social Networking: Massive data from user interaction and behavior.
Cloud Computing: Affordable, scalable infrastructure for Big Data storage and analytics.

8. Working on Big Data Analytics

Step 1: Gather Data

Sources include mobile apps, social media, cloud services, IoT sensors.

Step 2: Process Data

Batch Processing: Large datasets processed over a period (not real-time).
Stream Processing: Data processed in real-time or near-real-time.

Step 3: Clean Data

Data cleaning improves quality by removing errors, duplicates, or irrelevant information.
Tools like Impute (used in Orange) handle missing values.

Step 4: Analyze Data

Apply models such as:
- K-Means for clustering
- Logistic Regression or Decision Trees for predictions
- Use Visualization Tools like scatter plots, box plots, and heatmaps.

9. Case Study: Orange Data Mining

Dataset Used: Heart Disease Dataset
Tools/Widgets:
- File Widget: Load data
- Preprocess Widget: Normalize features
- Impute Widget: Replace missing values
- Logistic Regression Widget: Build predictive models
- Test & Score Widget: Validate model using techniques like cross-validation
- Predict Widget: Generate predictions

Orange provides a visual, no-code environment to implement analytics workflows efficiently.

10. Data Stream Mining

Definition:
Data stream mining is the process of extracting useful insights and patterns from a continuous stream of real-time data. It differs from traditional mining because it works on data as it arrives and doesn’t store the complete dataset.

Examples:

Monitoring website activity to detect sudden surges in interest (e.g., spike in “election results” searches).
Real-time fraud detection in banking.

11. Real-World Applications of Big Data

Sector	Applications
Business	Personalized marketing, trend forecasting, inventory management
Healthcare	Disease prediction, patient care optimization, health data analysis
Finance	Fraud detection, credit scoring, investment risk management
Media/Entertainment	Personalized content (e.g., Netflix suggestions), audience analytics
Environmental Science	Climate modeling, pollution tracking, disaster forecasting
Education	Adaptive learning systems, performance prediction, curriculum optimization

12. Future of Big Data Analytics

Real-Time Analytics:
- Enables immediate action by processing live data streams (e.g., stock market changes).
Advanced Predictive Models:
- Integration of AI and ML improves accuracy and efficiency in forecasting.
Quantum Computing:
- Offers exponential speed improvements in processing complex datasets.
Job Growth:
- Demand for data analysts, scientists, and engineers is expected to rise across industries.

13. Ethical and Legal Concerns

Privacy Risks:
- Sensitive personal data may be exposed or misused.
Security Threats:
- Unauthorized access can result in data breaches.
Misuse of Data:
- Using data for purposes beyond original intent raises ethical issues.
Compliance Requirements:
- Regulations like GDPR and India’s DPDP Act 2023 enforce responsible data handling.

Solutions:

Encryption of data during storage and transmission.
Anonymization to protect identities.
Access controls and audits to maintain security.

ai cbse

This site is dedicated to provide contents, notes, questions bank,blogs,articles and other materials for AI students of CBSE.