Sign in to save

Bookmark this page so you can find it later.

Sign in to save

Bookmark this page so you can find it later.

Big data refers to datasets so large, fast, or varied that ordinary tools are not enough to store, process, and analyze them effectively. Data science is the field that turns those raw data streams into useful patterns, predictions, and decisions. It matters because modern systems such as search engines, medical databases, weather models, and recommendation apps all depend on extracting meaning from huge amounts of information.

A typical data science pipeline starts with collecting raw data, cleaning it, storing it, analyzing it, and communicating results. Big data systems often use distributed computing, where many machines work together on parts of the same problem. Machine learning models can then detect trends, classify examples, or make predictions, but the quality of the output depends strongly on the quality, fairness, and relevance of the input data.

Key Facts

  • The 5 V's of big data are volume, velocity, variety, veracity, and value.
  • Storage needed = number of records × size per record.
  • Throughput = data processed ÷ processing time.
  • Speedup = time on one computer ÷ time on multiple computers.
  • Accuracy = correct predictions ÷ total predictions.
  • A data science pipeline often follows: collect, clean, store, analyze, model, visualize, decide.

Vocabulary

Big Data
Big data is data that is too large, fast-moving, or complex for traditional processing tools to handle easily.
Data Science
Data science is the practice of using statistics, computing, and domain knowledge to find useful insights in data.
Distributed Computing
Distributed computing uses many connected computers to store data or solve parts of a problem at the same time.
Machine Learning
Machine learning is a method where algorithms improve at a task by finding patterns in data.
Data Visualization
Data visualization is the use of charts, graphs, maps, or diagrams to make patterns in data easier to understand.

Common Mistakes to Avoid

  • Confusing big data with good data. A huge dataset can still be biased, incomplete, duplicated, or poorly measured.
  • Skipping data cleaning before analysis. Dirty data can produce misleading averages, false trends, and incorrect model predictions.
  • Treating correlation as causation. Two variables moving together does not prove that one directly causes the other.
  • Judging a model only by accuracy. Accuracy can hide poor performance on rare classes, biased subgroups, or high-cost errors.

Practice Questions

  1. 1 A sensor network creates 2,000,000 records per day, and each record is 500 bytes. How many gigabytes of data are created in 30 days if 1 GB = 1,000,000,000 bytes?
  2. 2 A data job takes 8 hours on one computer and 1.25 hours on a cluster. What is the speedup?
  3. 3 A company trains a hiring model using only data from past employees. Explain why this dataset could create biased predictions and name one way to reduce the problem.