Spark Note

BerkeleyX: CS120x

Course website
Databricks

Week 1: ML overview, Math Review, Spark RDD overview Week 2: Distributed ML principles and linear regression Week 3: Classification with click-through rate prediction Week 4: Exploratory Analysis with Brain imaging data

How to Handle Massive Data

scale-up (one big machine) scale-out (many machines, i.e., distributed)

What is ML?

learn from and make predictions on data.

unsupervised learning: learning from unlabeled observations.

learning algorithm must find latent structure from features alone
Can be goal in itself (discover hidden patterns, exploratory data analysis)
Can be means to an end (preprocessing for supervised task)

Linear Algebra Review

a belong to R^4

addition and subtraction
Matrix Scalar Multiplication
Matrix-vector product
Matrix-matrix Multiplication
identity matrix
inverse matrix
Euclidean Norm for vectors The Euclidean norm for $$ x \in \mathbb{R}^m $$ is denoted by $$|x|_2$$

$$ |x|_2 = \sqrt{x_1^2 + x_2^2 + ... + x_m^2 } $$

Big O Notation

O(1) Constant time algorithms. O(n) complexity: linear time algorithms perfrom a number of operations. O(n2) Complexity: Quadratic algorithms.

Time and space complexity can differ.

RDD Fundamentals

Resilient Distributed Datasets (RDDs)

wirte programs in terms of operations on distributed data
partitioned collections of objects spread across a cluster
Diverse set of parallel transformations and actions
Fault tolerant

RDD vs DataFrames

RDDs provide a low-level interfact into Apache Spark
DataFrames have a schema
DataFrames are optimized via Catalyze
DataFrames are built on top of the RDD and core APs

Spark