Spark Note
BerkeleyX: CS120x
- Course website
- Databricks
Week 1: ML overview, Math Review, Spark RDD overview Week 2: Distributed ML principles and linear regression Week 3: Classification with click-through rate prediction Week 4: Exploratory Analysis with Brain imaging data
How to Handle Massive Data
scale-up (one big machine) scale-out (many machines, i.e., distributed)
What is ML?
learn from and make predictions on data.
unsupervised learning: learning from unlabeled observations.
- learning algorithm must find latent structure from features alone
- Can be goal in itself (discover hidden patterns, exploratory data analysis)
- Can be means to an end (preprocessing for supervised task)
Linear Algebra Review
a belong to R^4
- addition and subtraction
- Matrix Scalar Multiplication
- Matrix-vector product
Matrix-matrix Multiplication
identity matrix
- inverse matrix
- Euclidean Norm for vectors The Euclidean norm for $$ x \in \mathbb{R}^m $$ is denoted by $$|x|_2$$
$$ |x|_2 = \sqrt{x_1^2 + x_2^2 + ... + x_m^2 } $$
Big O Notation
O(1) Constant time algorithms. O(n) complexity: linear time algorithms perfrom a number of operations. O(n2) Complexity: Quadratic algorithms.
Time and space complexity can differ.
RDD Fundamentals
Resilient Distributed Datasets (RDDs)
- wirte programs in terms of operations on distributed data
- partitioned collections of objects spread across a cluster
- Diverse set of parallel transformations and actions
- Fault tolerant
RDD vs DataFrames
- RDDs provide a low-level interfact into Apache Spark
- DataFrames have a schema
- DataFrames are optimized via Catalyze
- DataFrames are built on top of the RDD and core APs