What is Apache Pig?
Apache Pig is an abstraction over MapReduce. It is a tool/platform which is used to analyze larger sets of data representing them as data flows. Pig is generally used with Hadoop; we can perform all the data manipulation operations in Hadoop using Apache Pig.
To write data analysis programs, Pig provides a high-level language known as Pig Latin. This language provides various operators using which programmers can develop their own functions for reading, writing, and processing data.
To analyze data using Apache Pig, programmers need to write scripts using Pig Latin language. All these scripts are internally converted to Map and Reduce tasks. Apache Pig has a component known as Pig Engine that accepts the Pig Latin scripts as input and converts those scripts into MapReduce jobs.
Why we need Pig?
Using Pig Latin, programmers can perform MapReduce tasks easily without having to type complex codes in Java.
Apache Pig uses multi-query approach, thereby reducing the length of codes. For example, an operation that would require you to type 200 lines of code (LoC) in Java can be easily done by typing as less as just 10 LoC in Apache Pig. Ultimately Apache Pig reduces the development time by almost 16 times.
Pig Latin is SQL-like language and it is easy to learn Apache Pig when you are familiar with SQL.
Apache Pig provides many built-in operators to support data operations like joins, filters, ordering, etc. In addition, it also provides nested data types like tuples, bags, and maps that are missing from MapReduce.
Features of Pig
Apache Pig comes with the following features −
Rich set of operators− It provides many operators to perform operations like join, sort, filer, etc.
Ease of programming− Pig Latin is similar to SQL and it is easy to write a Pig script if you are good at SQL.
Optimization opportunities− The tasks in Apache Pig optimize their execution automatically, so the programmers need to focus only on semantics of the language.
Extensibility− Using the existing operators, users can develop their own functions to read, process, and write data.
UDF’s− Pig provides the facility to createUser-defined Functionsin other programming languages such as Java and invoke or embed them in Pig Scripts.
Handles all kinds of data− Apache Pig analyzes all kinds of data, both structured as well as unstructured. It stores the results in HDFS.
Apache Pig Vs MapReduce
Listed below are the major differences between Apache Pig and MapReduce.
Apache Pig | MapReduce |
---|---|
Apache Pig is a data flow language. | MapReduce is a data processing paradigm. |
It is a high level language. | MapReduce is low level and rigid. |
Performing a Join operation in Apache Pig is pretty simple. | It is quite difficult in MapReduce to perform a Join operation between datasets. |
Any novice programmer with a basic knowledge of SQL can work conveniently with Apache Pig. | Exposure to Java is must to work with MapReduce. |
Apache Pig uses multi-query approach, thereby reducing the length of the codes to a great extent. | MapReduce will require almost 20 times more the number of lines to perform the same task. |
There is no need for compilation. On execution, every Apache Pig operator is converted internally into a MapReduce job. | MapReduce jobs have a long compilation process. |
Apache Pig Latin −
- Allows splits in the pipeline.
- Allows developers to store data anywhere in the pipeline.
- Declares execution plans.
Provides operators to perform ETL (Extract, Transform, and Load) functions.
Schema on Read:
Schema-on-Read Traditional data systems require users to create a schema before loading any data into the system. This allows such systems to tightly control the placement of the data during load time hence enabling them to answer interactive queries very fast. However, this leads to loss of agility. In this talk I will demonstrate Hadoop's schema-onread capability. Using this approach data can start flowing into the system in its original form, then the schema is parsed at read time (each user can apply their own "data-lens“ to interpret the data). This allows for extreme agility while dealing with complex evolving data structures.
Lazy Evaluation:
1. Objective
In this Apache Sparklazy evaluation tutorial, we will understand what is lazy evaluation in Apache Spark, How Spark manages the lazy evaluation of Spark RDD data transformation, the reason behind keeping Spark lazy evaluation and what are the advantages of lazy evaluation in Spark transformation.
2. What is Lazy Evaluation in Apache Spark?
As the name itself indicates its definition,lazy evaluationin Spark means that the execution will not start until an action is triggered. In Spark, the picture of lazy evaluation comes when Spark transformations occur.
Transformationsare lazy in nature meaning when we call some operation in RDD, it does not execute immediately. Spark maintains the record of which operation is being called(Through DAG). We can think Spark RDD as the data, that we built up through transformation. Since transformations are lazy in nature, so we can execute operation any time by calling an action on data. Hence, in lazy evaluation data is not loaded until it is necessary.
In MapReduce, much time of developer wastes in minimizing the number of MapReduce passes. It happens by clubbing the operations together. While in Spark we do not create the single execution graph, rather we club many simple operations. Thus it creates thedifference between Hadoop MapReduce vs Apache Spark.
In Spark,_driver program_loads the code to the cluster. When the code executes after every operation, the task will be time and memory consuming. Since each time data goes to the cluster for evaluation.
3. Advantages of Lazy Evaluation in Spark Transformation
There are some benefits of Lazy evaluation in Apache Spark-
a. Increases Manageability
By lazy evaluation, users can organize their Apache Spark program into smaller operations. It reduces the number of passes on data by grouping operations.
b. Saves Computation and increases Speed
Spark Lazy Evaluation plays a key role in saving calculation overhead. Since only necessary values get compute. It saves the trip between driver and cluster, thus speeds up the process.
c. Reduces Complexities
The two main complexities of any operation are_time_and_space_complexity. Using Apache Spark lazy evaluation we can overcome both. Since we do not execute every operation, Hence, the time gets saved. It let us work with an infinite data structure. The action is triggered only when the data is required, it reduces overhead.
d. Optimization
It provides optimization by reducing the number of queries.
4. Conclusion
Hence, Lazy evaluation enhances the power of Apache Spark by reducing the execution time of the RDD operations.It maintains the lineage graph to remember the operations on RDD. As a result, itOptimizes the performanceand achieves fault tolerance.