Cloudera Enterprise 6.0.x | Other versions

Snappy Compression

Snappy is a compression/decompression library. It optimizes for very high-speed compression and decompression, and moderate compression instead of maximum compression or compatibility with other compression libraries.

Snappy is supported for all CDH components. How you specify compression depends on the component.

Continue reading:

Using Snappy with HBase
Using Snappy with Hive or Impala
Using Snappy with MapReduce
Using Snappy with Pig
Using Snappy with Spark SQL
Using Snappy Compression with Sqoop 1 and Sqoop 2 Imports

Using Snappy with HBase

If you install Hadoop and HBase from RPM or Debian packages, Snappy requires no HBase configuration.

Using Snappy with Hive or Impala

To enable Snappy compression for Hive output when creating SequenceFile outputs, use the following settings:

SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.type=BLOCK;

For information about configuring Snappy compression for Parquet files with Hive, see Using Parquet Tables in Hive. For information about using Snappy compression for Parquet files with Impala, see Snappy and GZip Compression for Parquet Data Files in the Impala Guide.

Using Snappy with MapReduce

Enabling MapReduce intermediate compression can make jobs run faster without requiring application changes. Only the temporary intermediate files created by Hadoop for the shuffle phase are compressed; the final output may or may not be compressed. Snappy is ideal in this case because it compresses and decompresses very quickly compared to other compression algorithms, such as Gzip. For information about choosing a compression format, see Choosing and Configuring Data Compression.

To enable Snappy for MapReduce intermediate compression for the whole cluster, set the following properties in mapred-site.xml:

MRv1

<property>
  <name>mapred.compress.map.output</name>
  <value>true</value>
</property>
<property>
  <name>mapred.map.output.compression.codec</name>
  <value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>

YARN

<property>
  <name>mapreduce.map.output.compress</name>
  <value>true</value>
</property>
<property>
  <name>mapred.map.output.compress.codec</name>
  <value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>

You can also set these properties on a per-job basis.

Use the properties in the following table to compress the final output of a MapReduce job. These are usually set on a per-job basis.

MRv1 Property	YARN Property	Description
mapred.output.compress	mapreduce.output. fileoutputformat. compress	Whether to compress the final job outputs (`true` or `false`).
mapred.output. compression.codec	mapreduce.output. fileoutputformat. compress.codec	If the final job outputs are to be compressed, the codec to use. Set to `org.apache.hadoop.io.compress.SnappyCodec` for Snappy compression.
mapred.output. compression.type	mapreduce.output. fileoutputformat. compress.type	For `SequenceFile` outputs, e type of compression to use (`NONE`, `RECORD`, or `BLOCK`). Cloudera recommends `BLOCK`.

Note: The MRv1 property names are also supported (but deprecated) in YARN. You do not need to update them in this release.

Using Snappy with Pig

Set the same properties for Pig as for MapReduce.

Using Snappy with Spark SQL

To enable Snappy compression for Spark SQL when writing tables, specify the snappy codec in the spark.sql.parquet.compression.codec configuration:

sqlContext.setConf("spark.sql.parquet.compression.codec","snappy")

Using Snappy Compression with Sqoop 1 and Sqoop 2 Imports

Sqoop 1 - On the command line, use the following option to enable Snappy compression:
```
--compression-codec org.apache.hadoop.io.compress.SnappyCodec
```
Cloudera recommends using the --as-sequencefile option with this compression option.
Sqoop 2 - When you create a job (sqoop:000> create job), choose 7 (SNAPPY) as the compression format.

Page generated July 25, 2018.