Spark Note

BerkeleyX: CS120x

Week 1: ML overview, Math Review, Spark RDD overview Week 2: Distributed ML principles and linear regression Week 3: Classification with click-through rate prediction Week 4: Exploratory Analysis with Brain imaging data

How to Handle Massive Data

scale-up (one big machine) scale-out (many machines, i.e., distributed)

What is ML?

learn from and make predictions on data.

unsupervised learning: learning from unlabeled observations.

  • learning algorithm must find latent structure from features alone
  • Can be goal in itself (discover hidden patterns, exploratory data analysis)
  • Can be means to an end (preprocessing for supervised task)

Linear Algebra Review

a belong to R^4

  • addition and subtraction
  • Matrix Scalar Multiplication
  • Matrix-vector product
  • Matrix-matrix Multiplication

  • identity matrix

  • inverse matrix
  • Euclidean Norm for vectors The Euclidean norm for $$ x \in \mathbb{R}^m $$ is denoted by $$|x|_2$$

$$ |x|_2 = \sqrt{x_1^2 + x_2^2 + ... + x_m^2 } $$

Big O Notation

O(1) Constant time algorithms. O(n) complexity: linear time algorithms perfrom a number of operations. O(n2) Complexity: Quadratic algorithms.

Time and space complexity can differ.

RDD Fundamentals

Resilient Distributed Datasets (RDDs)

  • wirte programs in terms of operations on distributed data
  • partitioned collections of objects spread across a cluster
  • Diverse set of parallel transformations and actions
  • Fault tolerant

RDD vs DataFrames

  • RDDs provide a low-level interfact into Apache Spark
  • DataFrames have a schema
  • DataFrames are optimized via Catalyze
  • DataFrames are built on top of the RDD and core APs

results matching ""

    No results matching ""