Friday, March 11, 2022

Spark 3.1 is now Generally Available on HDInsight

Notable Changes:

  • Python version in Apache Spark 3.1.2 changes to 3.8. Python 2.x is no longer supported
  • Accordingly Jupyter and Zeppelin notebook’s only support Python 3.8 from now on. This should be taken note of while migrating to Spark 3.1.2.
  • HDInsight Spark 3.1 ships with Apache Kafka client 2.4 jars while the open-source spark 3.1 ships with Apache Kafka 2.6 client jars.

Known Issues

  • HWC connector will be shipped with the next release cycle of HDI 4.0.
  • Customers willing to use HWC till the next release can proceed with the steps as :
  1. Setup of the HWC v2 dependencies is to be done by executing respective Script Actions on Spark and Interactive Query clusters.
  2. Spark Cluster
    • Execute the Spark cluster specific Script Actionon  HeadNodes and WorkerNodes .
    • On the Spark Cluster Script Actions’ screen, use the following options

      => Script Type – Custom
      => Bash Script URI - https://hdiconfigactions2.blob.core.windows.net/install-hwc-v2/install_hwc_v2_spark_script_action.sh
      => Node Types -  HeadNode, WorkerNode

    • The script action for Spark Cluster will install the hive-warehouse-connector-assembly-2.x.jar at /usr/hdp/5.x.x.x/hive_warehouse_connector path.

    • Post installation the Spark service will be automatically restarted to add the new dependencies to the classpath.

    • After successful execution validate the presence of the hive-warehouse-connector-assembly-2.x.jar at /usr/hdp/5.x.x.x/hive_warehouse_connector.

  1. Interactive Query Cluster
    • Execute the Interactive Query cluster specific Script Action on HeadNodes .
    • On the InteractiveQuery Cluster Script Actions’ screen, use the following options

      => Script Type – Custom
      => Bash Script URI - https://hdiconfigactions2.blob.core.windows.net/install-hwc-v2/install_hwc_v2_llap_script_action.sh
      => Node Types - HeadNode

    • The script action for Interactive Query Cluster will patch the following dependencies at /usr/hdp/5.x.x.x/hive/libs path.
    • Post installation the Hive service will be automatically restarted to add the new dependencies to the classpath.
      • hive-exec.jar
      • hive-serde.jar
      • hive-llap-server.jar
      • arrow-memory-core-2.x.jar
      • arrow-memory-netty-2.x.jar
      • arrow-vector-2.x.jar
      • arrow-format-2.x.jar

Component versions

Component

Version

Apache Spark

3.1.2

Operating System

Ubuntu 18.04

Java

1.8.0_282

Scala

2.12.10

Hadoop

3.1.1

Python

3.8

Hive

3.1.0

Zookeeper

3.4.6

Zeppelin

0.8.0

Jupyter

1.0.0

Kafka

2.4

 

Some of key changes in the Apache Spark 3.1.2 release are: 

 

ANSI SQL compliance

This release adds additional improvements for ANSI SQL compliance, which aids to simplify the workload migration from traditional data warehouse systems to Apache Spark.

Performance

  • Host-local shuffle data reading without shuffle service (SPARK-32077)
  • Remove redundant sorts before repartition nodes (SPARK-32276)
  • Partially push down predicates (SPARK-32302SPARK-32352)
  • Push down filters through expand (SPARK-33302)
  • Push more possible predicates through Join via CNF conversion (SPARK-31705)
  • Remove shuffle by preserving output partitioning of broadcast hash join (SPARK-31869)
  • Remove shuffle by improving reordering join keys (SPARK-32282)
  • Remove shuffle by normalizing output partitioning and sortorder (SPARK-33399)

Streaming

  • Cache fetched list of files beyond maxFilesPerTrigger as unread file (SPARK-30866)
  • Streamline the logic on file stream source and sink metadata log (SPARK-30462)
  • Avoid reading compact metadata log twice if the query restarts from compact batch (SPARK-30900)

Project Zen initiative

Project Zen was initiated in this release to improve PySpark’s usability in the following manner:

  • Being Pythonic
    • Pandas UDF enhancements and type hints
    • Avoid dynamic function definitions, for example, at funcitons.py which makes IDEs unable to detect.
  • Better and easier usability in PySpark
    • User-facing error message and warnings
    • Documentation
    • User guide
    • Better examples and API documentation, e.g. Koalas and pandas
  • Better interoperability with other Python libraries
    • Visualization and plotting
    • Potentially better interface by leveraging Arrow
    • Compatibility with other libraries such as NumPy universal functions or pandas possibly by leveraging Koalas
  • PyPI Installation
    • PySpark with Hadoop 3 support on PyPi
    • Better error handling

For a complete list of the open-source Apache Spark 3.1.2 features now available in Azure HDinsight, please see the release notes

 

Customers using ARM template for creating Spark 3.0 cluster are advised to update their ARM templates to Apache Spark 3.1 version. Please review this document for the steps to create cluster using ARM templates.

 

For a more comprehensive list of changes in Spark 3.1 please see, release notes.

 

For more details on migration, please refer to the Migration Guide - Spark 3.1.2  

Posted at https://sl.advdat.com/3CwtP3ahttps://sl.advdat.com/3CwtP3a