Cloudera Enterprise 6.0.x | Other versions

Overview of Ingesting and Querying Data with Apache Hive in CDH

Data ingestion begins your data pipeline or "write path." Classic data pipelines bring in data and then apply ETL operations on it, which clean and transform data for consumption. Apache Hive in CDH is the preferred tool for ETL workloads. Hive queries transform data, using functions such as CAST or TRIM, or by joining data sets, to ensure the data conforms to target data models for your data warehouse. Then the data can be consumed. For example, by business intelligence (BI) tools or users' ad-hoc queries.

Ingesting Data with Hive

Hive can ingest data in several different file formats, such as Parquet, Avro, TEXTFILE, or RCFile. If you are setting up a data pipeline where Apache Impala is involved on the query side, use Parquet. See Using Apache Parquet Data Files with CDH for general information about the Parquet file format and for information about using Parquet tables in Hive. If a custom file format is required, you can extend the Hive SerDes. See the Apache Hive wiki for information about the Hive SerDes and how to write your own for Hive.

  Important:

The configuration property serialization.null.format is set in Hive and Impala engines as SerDes or table properties to specify how to serialize/deserialize NULL values into a storage format.

This configuration option is suitable for text file formats only. If used with binary storage formats such as RCFile or Parquet, the option causes compatibility, complexity and efficiency issues.

See Using Avro Data Files in Hive for details about using Avro to ingest data into Hive tables and about using Snappy compression on the output files.

Column and Table Statistics for Query Optimization

Statistics for Hive can be numbers of rows of tables or partitions and the histograms of interesting columns. Statistics are used by the cost functions of the query optimizer to generate query plans for the purpose of query optimization.

See Accessing Apache Hive Table Statistics in CDH for details about collecting statistics for Hive.

Transaction (ACID) Support in Hive

The CDH distribution of Hive does not support transactions (HIVE-5317). Currently, transaction support in Hive is an experimental feature that only works with the ORC file format. Cloudera recommends using the Parquet file format, which works across many tools. Merge updates in Hive tables using existing functionality, including statements such as INSERT, INSERT OVERWRITE, and CREATE TABLE AS SELECT.

Upstream Information for Hive

Detailed Hive documentation is available on the Apache Software Foundation site on the Hive project page. For specific areas of the Apache Hive documentation, see:

Because Cloudera does not support all Hive features, for example ACID (transactions), always check external Hive documentation against the current version and supported features of Hive included in CDH distribution.

Hive has its own JIRA issue tracker.

Page generated July 25, 2018.