docker run -p 8080:8080 --rm -d -t -v D:\data\xeppelin\logs:/logs -v D:\data\zeppelin\notebook:/notebook -e ZEPPELIN_LOG_DIR = '/logs'

%pig (verifies pig) A = LOAD '/shared/(path name)';


B = LIMIT A 10;
DUMP B; ------> Json data

DESCRIBE A; ------> Schema Unknown for A


Elephant Pig : Developed by Twitter to read JSON files

Elephant Bird core

Elephant bird pig

Elephant bird hadoop compact


%pig
REGISTER '/shared/(path name)/elephant-bird-core-4.15.jar'
REGISTER '/shared/(path name)/elephant-bird-pig-4.15.jar'
REGISTER '/shared/(path name)/elephant-bird-hadoop-compat-4.15.jar'
REGISTER '/shared/(path name)/json-simple-1.1.1.jar'

A = LOAD '/shared/(path name)/business.json' USING com.twitter.elephantbird.pig.load.JdonLoader('-nestedLoad') AS (Json: MAP[]);


DESCRIBE A; -----> {Json: map[]}
DUMP A; --------> (key,value) pair


%pig
B = FOREACH b IN A GENERATE b#city; (Iterate through each record and parse it)


Load json

Iterate through it

Inner structure : You need to Flatten

CAT = FOREACH A FLATTEN (Something)

We have to do the same thing using SPARK(Extra Credit)


%spark
sc

Basic Data Structure in SPARK:

RDD Basics: Resilient Distributed Database
Data frames are immutable

Spark Dataframes:

PySpark :Can have pandas dataframe and exchange to Spark

Spark Motivation:

Will bring all structures to memory.

To run pyspark(inside hadoop): pyspark --master local[2]

Home work will require aggregate or select

results matching ""

    No results matching ""