Cloudera Enterprise 6.0.x | Other versions

Apache Parquet Tables with Hive in CDH

Apache Parquet is a columnar storage format available to any component in the Hadoop ecosystem, regardless of the data processing framework, data model, or programming language. The Parquet file format incorporates several features that support data warehouse-style operations:

  • Columnar storage layout - A query can examine and perform calculations on all values for a column while reading only a small fraction of the data from a data file or table.
  • Flexible compression options - Data can be compressed with any of several codecs. Different data files can be compressed differently.
  • Innovative encoding schemes - Sequences of identical, similar, or related data values can be represented in ways that save disk space and memory. The encoding schemes provide an extra level of space savings beyond overall compression for each data file.
  • Large file size - The layout of Parquet data files is optimized for queries that process large volumes of data, with individual files in the multi-megabyte or even gigabyte range.

Parquet is automatically installed when you install CDH, and the required libraries are automatically placed in the classpath for all CDH components. Copies of the libraries are in /usr/lib/parquet or /opt/cloudera/parcels/CDH/lib/parquet.

CDH lets you use the component of your choice with the Parquet file format for each phase of data processing. For example, you can read and write Parquet files using Pig and MapReduce jobs. You can convert, transform, and query Parquet tables through Hive, Impala, and Spark. And you can interchange data files between all of these components.

Using Parquet Tables in Hive

To create a table named PARQUET_TABLE that uses the Parquet format, use a command like the following, substituting your own table name, column names, and data types:

CREATE TABLE parquet_table_name (x INT, y STRING) STORED AS PARQUET;
  Note:
  • Once you create a Parquet table, you can query it or insert into it through other components such as Impala and Spark.
  • Set dfs.block.size to 256 MB in hdfs-site.xml.
  • To enhance performance on Parquet tables in Hive, see Enabling Query Vectorization.

If the table will be populated with data files generated outside of Impala and Hive, you can create the table as an external table pointing to the location where the files will be created:

CREATE EXTERNAL TABLE parquet_table_name (x INT, y STRING)
LOCATION '/test-warehouse/tinytable'
STORED AS PARQUET;
      

To populate the table with an INSERT statement, and to read the table with a SELECT statement, see Loading Data into Parquet Tables.

To set the compression type to use when writing data, configure the parquet.compression property:

SET parquet.compression=GZIP;
INSERT OVERWRITE TABLE tinytable SELECT * FROM texttable;
      

The supported compression types are UNCOMPRESSED, GZIP, and SNAPPY.

Page generated July 25, 2018.