Introduction to Spark SQL

Spark SQL is Spark’s interface for working with structured and semi-structured data. Structured data is considered any data that has a schema such as JSON, Hive Tables, Parquet.  Schema means having a known set of fields for each record.  Semi-structured data is when there is no separation between the schema and the data.

Features of Spark SQL:

1)Integration With Spark. Spark SQL queries are integrated with Spark programs. 

2)Uniform Data Access. 

3)Hive Compatibility. 

4)Standard Connectivity. 

5)Performance And Scalability. 

6) User Defined Functions

Components of Spark SQL:

1)Spark SQL Dataframes 

There was no provision to handle structured data and there was no optimization engine when working with structured data. On the basis of attributes the developer had to optimize each RDD. Spark DataFrame is a distributed collection of data ordered into named columns. You might remember a table in relational database. Spark Dataframe e is similar to that.

2)Spark SQL dataset

The catch with this interface is that it provides the benefits of RDDs along with the benefits of optimized execution engine of Apache Spark SQL. To achieve conversion between JVM objects and tabular representation the concept of encoder is used. Using JVM objects a dataset can be incepted and functional transformations like map, filter etc have to be used to modify them.

3)Spark Catalyst Optimizer

Catalyst optimizer is the optimizer used in Spark SQL and all the queries written by Spark SQL and DataFrame DSL is optimized by this tool. This optimizer is better than the RDD and hence the performance of the system is increased.

Conclusion

Apache foundation has given a carefully thought out component for real time analytics. When the analytics world start seeing the shortcomings of Hadoop in providing real time analytics then migrating to Spark will be the obvious outcome.

Comments

Popular posts from this blog

Blue Prism Interview Questions and Answers

Selenium Interview Questions - Top MNC's

Trending IT Technologies in 2019