Cloudera Enterprise 6.0.x | Other versions

Apache Spark Overview

  Note:

This page contains information related to Spark 2.x, which is included with CDH beginning with CDH 6. This information supercedes the documentation for the separately available parcel for CDS Powered By Apache Spark.

Apache Spark is a general framework for distributed computing that offers high performance for both batch and interactive processing. It exposes APIs for Java, Python, and Scala and consists of Spark core and several related projects.

You can run Spark applications locally or distributed across a cluster, either by using an interactive shell or by submitting an application. Running Spark applications interactively is commonly performed during the data-exploration phase and for ad hoc analysis.

To run applications distributed across a cluster, Spark requires a cluster manager. In CDH 6, Cloudera supports only the YARN cluster manager. When run on YARN, Spark application processes are managed by the YARN ResourceManager and NodeManager roles. Spark Standalone is no longer supported.

For detailed API information, see the Apache Spark project site.

  Note: Although this document makes some references to the external Spark site, not all the features, components, recommendations, and so on are applicable to Spark when used on CDH. Always cross-check the Cloudera documentation before building a reliance on some aspect of Spark that might not be supported or recommended by Cloudera. In particular, see Apache Spark Known Issues for components and features to avoid.

The Apache Spark 2 service in CDH 6 consists of Spark core and several related projects:

Module for working with structured data. Allows you to seamlessly mix SQL queries with Spark programs.
API that allows you to build scalable fault-tolerant streaming applications.
API that implements common machine learning algorithms.

The Cloudera Enterprise product includes the Spark features roughly corresponding to the feature set and bug fixes of Apache Spark 2.2. The Spark 2.x service was previously shipped as its own parcel, separate from CDH.

In CDH 6, the Spark 1.6 service does not exist. The port of the Spark History Server is 18088, which is the same as formerly with Spark 1.6, and a change from port 18089 formerly used for the Spark 2 parcel.

Unsupported Features

The following Spark features are not supported:

  • Apache Spark experimental features/APIs are not supported unless stated otherwise.
  • Using the JDBC Datasource API to access Hive or Impala is not supported
  • ADLS not Supported for All Spark Components. Microsoft Azure Data Lake Store (ADLS) is a cloud-based filesystem that you can access through Spark applications. Spark with Kudu is not currently supported for ADLS data. (Hive on Spark is available for ADLS in CDH 5.12 and higher.)
  • IPython / Jupyter notebooks is not supported. The IPython notebook system (renamed to Jupyter as of IPython 4.0) is not supported.
  • Certain Spark Streaming features not supported. The mapWithState method is unsupported because it is a nascent unstable API.
  • Thrift JDBC/ODBC server is not supported
  • Spark SQL CLI is not supported
  • GraphX is not supported
  • SparkR is not supported
  • Structured Streaming is not supported.
  • Spark cost-based optimizer (CBO) not supported.

Consult Apache Spark Known Issues for a comprehensive list of Spark 2 features that are not supported with CDH 6.

Page generated July 25, 2018.