Advanced Data Solutions : Spark 3.1 is now Generally Available on HDInsight

Notable Changes:

Python version in Apache Spark 3.1.2 changes to 3.8. Python 2.x is no longer supported
Accordingly Jupyter and Zeppelin notebook’s only support Python 3.8 from now on. This should be taken note of while migrating to Spark 3.1.2.
HDInsight Spark 3.1 ships with Apache Kafka client 2.4 jars while the open-source spark 3.1 ships with Apache Kafka 2.6 client jars.

Known Issues

HWC connector will be shipped with the next release cycle of HDI 4.0.
Customers willing to use HWC till the next release can proceed with the steps as :

Setup of the HWC v2 dependencies is to be done by executing respective Script Actions on Spark and Interactive Query clusters.
Spark Cluster
- Execute the Spark cluster specific Script Actionon HeadNodes and WorkerNodes .
- On the Spark Cluster Script Actions’ screen, use the following options
  
  => Script Type – Custom
  => Bash Script URI - https://hdiconfigactions2.blob.core.windows.net/install-hwc-v2/install_hwc_v2_spark_script_action.sh
  => Node Types - HeadNode, WorkerNode
- The script action for Spark Cluster will install the hive-warehouse-connector-assembly-2.x.jar at /usr/hdp/5.x.x.x/hive_warehouse_connector path.
- Post installation the Spark service will be automatically restarted to add the new dependencies to the classpath.
- After successful execution validate the presence of the hive-warehouse-connector-assembly-2.x.jar at /usr/hdp/5.x.x.x/hive_warehouse_connector.

Interactive Query Cluster
- Execute the Interactive Query cluster specific Script Action on HeadNodes .
- On the InteractiveQuery Cluster Script Actions’ screen, use the following options
  
  => Script Type – Custom
  => Bash Script URI - https://hdiconfigactions2.blob.core.windows.net/install-hwc-v2/install_hwc_v2_llap_script_action.sh
  => Node Types - HeadNode
- The script action for Interactive Query Cluster will patch the following dependencies at /usr/hdp/5.x.x.x/hive/libs path.
- Post installation the Hive service will be automatically restarted to add the new dependencies to the classpath.
  - hive-exec.jar
  - hive-serde.jar
  - hive-llap-server.jar
  - arrow-memory-core-2.x.jar
  - arrow-memory-netty-2.x.jar
  - arrow-vector-2.x.jar
  - arrow-format-2.x.jar

Component versions

Component	Version
Apache Spark	3.1.2
Operating System	Ubuntu 18.04
Java	1.8.0_282
Scala	2.12.10
Hadoop	3.1.1
Python	3.8
Hive	3.1.0
Zookeeper	3.4.6
Zeppelin	0.8.0
Jupyter	1.0.0
Kafka	2.4

Some of key changes in the Apache Spark 3.1.2 release are:

ANSI SQL compliance

This release adds additional improvements for ANSI SQL compliance, which aids to simplify the workload migration from traditional data warehouse systems to Apache Spark.

Support char/varchar data type (SPARK-33480)
ANSI mode: Runtime errors instead of returning null (SPARK-33275)
ANSI mode: New explicit cast syntax rules (SPARK-33354)
Add SQL standard command SET TIME ZONE (SPARK-32272)
Unify create table SQL syntax (SPARK-31257)
Unify temporary view and permanent view behaviors (SPARK-33138)
Support column list in INSERT statement (SPARK-32976)
Support ANSI nested bracketed comments (SPARK-28880)

Performance

Host-local shuffle data reading without shuffle service (SPARK-32077)
Remove redundant sorts before repartition nodes (SPARK-32276)
Partially push down predicates (SPARK-32302, SPARK-32352)
Push down filters through expand (SPARK-33302)
Push more possible predicates through Join via CNF conversion (SPARK-31705)
Remove shuffle by preserving output partitioning of broadcast hash join (SPARK-31869)
Remove shuffle by improving reordering join keys (SPARK-32282)
Remove shuffle by normalizing output partitioning and sortorder (SPARK-33399)

Streaming

Cache fetched list of files beyond maxFilesPerTrigger as unread file (SPARK-30866)
Streamline the logic on file stream source and sink metadata log (SPARK-30462)
Avoid reading compact metadata log twice if the query restarts from compact batch (SPARK-30900)

Project Zen initiative

Project Zen was initiated in this release to improve PySpark’s usability in the following manner:

Being Pythonic

Pandas UDF enhancements and type hints
Avoid dynamic function definitions, for example, at funcitons.py which makes IDEs unable to detect.

Better and easier usability in PySpark

User-facing error message and warnings
Documentation
User guide
Better examples and API documentation, e.g. Koalas and pandas

Better interoperability with other Python libraries

Visualization and plotting
Potentially better interface by leveraging Arrow
Compatibility with other libraries such as NumPy universal functions or pandas possibly by leveraging Koalas

PyPI Installation

PySpark with Hadoop 3 support on PyPi
Better error handling

For a complete list of the open-source Apache Spark 3.1.2 features now available in Azure HDinsight, please see the release notes.

Customers using ARM template for creating Spark 3.0 cluster are advised to update their ARM templates to Apache Spark 3.1 version. Please review this document for the steps to create cluster using ARM templates.

For a more comprehensive list of changes in Spark 3.1 please see, release notes.

For more details on migration, please refer to the Migration Guide - Spark 3.1.2

Posted at https://sl.advdat.com/3CwtP3ahttps://sl.advdat.com/3CwtP3a

Friday, March 11, 2022

Spark 3.1 is now Generally Available on HDInsight

Known Issues