Spark can run on MESOS managed by Cluster or YARN managed by Hadoop.
Spark can interface with any file system like HDFS. Spark does not implement JobTracker and a File System.
Spark is not rack aware.
When running spark program, we have access to
SparkSQL : We cna write SQL queries. writing sql against data structure
MLIb : Machine learning library. (Do KNN on xyz)
Graphx: Used for graph analysis and graph algos
Streaming : Data that never ends. They have API's.
parquet : Standard Storage Format.
val people = spark.read.parquet('...').as[Person]<This is a schema>
Dataset <Person> people = spark.read.parquet('...')
map: Will go row by row and create a new data frame called x with the result of the operation specified in the brackets of map function.
val x = df.map(_.name) // x will contain only one column 'name'
val x = df.select('column 1 name') //produces same output as above statement
MAP
SELECT
FILTER
JOIN
printSchema
Fore
show
take
takeAsList
Group By functions : agg, avg, count, max, mean, min, pivot, sum
>>spark
>>val d = spark.
>>sc.setLogLevel("ERROR")
>>val df = spark.read.
>> val df = spark.read.csv("sdsfdsfd").withColumn("last",$"_c26").cast("Double") #Take columns 26 and convert into double
>> df.select("docID","_c26") same as spar.sql("select docID,_c26 from df")
>> df.select("docID","_c26").head(100) #Gives top 100 rows
>> df.select("docID","_c26").sort(desc("_c26")).head
>> d2.persist # Will save it in memory
>> d2.count #Count no. of rows
>> val result = df.groupBy("roomName").avg("temp")
>> val result = df.groupBy("Building","roomName").agg(avg("temp"),max("Time"))
wget