docker run -p 8080:8080 --rm -d -t -v D:\data\xeppelin\logs:/logs -v D:\data\zeppelin\notebook:/notebook -e ZEPPELIN_LOG_DIR = '/logs'
%pig (verifies pig) A = LOAD '/shared/(path name)';
B = LIMIT A 10;
DUMP B; ------> Json data
DESCRIBE A; ------> Schema Unknown for A
Elephant Pig : Developed by Twitter to read JSON files
Elephant Bird core
Elephant bird pig
Elephant bird hadoop compact
%pig
REGISTER '/shared/(path name)/elephant-bird-core-4.15.jar'
REGISTER '/shared/(path name)/elephant-bird-pig-4.15.jar'
REGISTER '/shared/(path name)/elephant-bird-hadoop-compat-4.15.jar'
REGISTER '/shared/(path name)/json-simple-1.1.1.jar'
A = LOAD '/shared/(path name)/business.json' USING com.twitter.elephantbird.pig.load.JdonLoader('-nestedLoad') AS (Json: MAP[]);
DESCRIBE A; -----> {Json: map[]}
DUMP A; --------> (key,value) pair
%pig
B = FOREACH b IN A GENERATE b#city; (Iterate through each record and parse it)
Load json
Iterate through it
Inner structure : You need to Flatten
CAT = FOREACH A FLATTEN (Something)
We have to do the same thing using SPARK(Extra Credit)
%spark
sc
Basic Data Structure in SPARK:
RDD Basics: Resilient Distributed Database
Data frames are immutable
Spark Dataframes:
PySpark :Can have pandas dataframe and exchange to Spark
Spark Motivation:
Will bring all structures to memory.
To run pyspark(inside hadoop): pyspark --master local[2]
Home work will require aggregate or select