Managing Apache Hive User-Defined Functions (UDFs) in CDH
Hive's query language (HiveQL) can be extended with Java-based user-defined functions (UDFs). See the Apache Hive Language Manual UDF page for information about Hive built-in UDFs. To create customized UDFs, see the Apache Hive wiki.
After creating a new Java class to extend the com.example.hive.udf package, you must compile your code into a Java archive file (JAR), and add it to the Hive classpath with the ADD JAR command. The ADD JAR command does not work with HiveServer2 and the Beeline client when Beeline runs on a different host. As an alternative to ADD JAR, Hive's auxiliary paths functionality should be used. Cloudera recommends using the hive.reloadable.aux.jars.path property. This property enables you to update UDF JAR files or add new ones to the UDF directory specified in the property without restarting HiveServer2. Instead, use the Beeline reload command, which refreshes the server configuration without a service interruption. The following sections explain how to use this property to configure Hive to use custom UDFs.
Continue reading:
Using Cloudera Manager to Create User-Defined Functions (UDFs) with HiveServer2
Minimum Required Role: Configurator (also provided by Cluster Administrator, Full Administrator)
Creating Permanent Functions
- Copy the JAR file to the host on which HiveServer2 is running. Save the JARs to any directory you choose, give the hive user read, write, and execute
access to this directory, and make a note of the path (for example, /usr/lib/hive/lib/).
Note: If the Hive metastore is running on a different host, create the same directory there that you created on the HiveServer2 host. You do not need to copy the JAR file onto the Hive metastore host, but the same directory must be there. For example, if you copied the JAR file to /usr/lib/hive/lib/ on the HiveServer2 host, you must create the same directory on the Hive metastore host. If the same directory is not present on the Hive metastore host, Hive metastore service will not start.
- In the Cloudera Manager Admin Console, go to the Hive service.
- Click the Configuration tab.
- Under Filters, click Hive (Service-Wide) scope.
- Click the Advanced category.
-
In the panel on the right, locate the Hive Service Advanced Configuration Snippet (Safety Value) for hive-site.xml, click the plus sign (+) to the right of it, and enter the following information:
- In the Name field, enter the hive.reloadable.aux.jars.path property.
- In the Value field, enter the path where you copied the JAR file to in Step 1.
- In the Description field, enter the property description. For example, Path to Hive UDF JAR files.
- Click Save Changes.
- Redeploy the Hive client configuration.
- In the Cloudera Manager Admin Console, go to the Hive service.
- From the Actions menu at the top right of the service page, select Deploy Client Configuration.
- Click Deploy Client Configuration.
-
Restart the Hive service.
Important: You need to restart the Hive service only when you first specify the JAR file location to the hive.reloadable.aux.jars.path property so the server can read the location. Afterwards, if you add or remove JAR files from this directory location, you can use the Beeline reload command to refresh the changes with the Hive service. -
If Sentry is enabled on your cluster, otherwise ignore this step: Grant privileges on the JAR files to the roles that require access. Log in to Beeline as user hive and use the Hive SQL GRANT statement to do so. For example:
GRANT ALL ON URI 'file:///usr/lib/hive/lib/<my.jar>' TO ROLE <example_role>;
-
Run the CREATE FUNCTION command in Beeline to create the UDF from the class in the JAR file:
-
If Sentry is enabled on your cluster:
The USING JAR command is not supported. To load the jar, you must make sure the JAR file is at the location pointed to by the hive.reloadable.aux.jars.path, and then use the following CREATE FUNCTION statement:
CREATE FUNCTION <your_function_name> AS '<fully_qualified_class_name>';
Where the <fully_qualified_class_name> is the full path to the Java class in your JAR file. For example, if your Java class is located at directory_1/directory_2/directory_3/udf_class.class in your JAR file, use:
CREATE FUNCTION <your_function_name> AS 'directory_1.directory_2.directory_3.udf_class';
-
Without Sentry enabled on your cluster:
- Copy the JAR file to HDFS and make sure the hive user can access this JAR file, and make note of the path (for example, hdfs:///user/hive/udf_jars/).
-
Run the CREATE FUNCTION command as follows and point to the JAR file location in HDFS:
CREATE FUNCTION <your_function_name> AS '<fully_qualified_class_name>' USING JAR 'hdfs:///<path/to/jar/in/hdfs>';
Where the <fully_qualified_class_name> is the full path to the Java class in your JAR file.
-
Creating Temporary Functions
- Copy the JAR file to the host on which HiveServer2 is running. Save the JARs to any directory you choose, give the hive user read, write, and execute
access to this directory, and make a note of the path (for example, /usr/lib/hive/lib/).
Note: If the Hive metastore is running on a different host, create the same directory there that you created on the HiveServer2 host. You do not need to copy the JAR file onto the Hive metastore host, but the same directory must be there. For example, if you copied the JAR file to /usr/lib/hive/lib/ on the HiveServer2 host, you must create the same directory on the Hive metastore host. If the same directory is not present on the Hive metastore host, Hive metastore service will not start.
- In the Cloudera Manager Admin Console, go to the Hive service.
- Click the Configuration tab.
- Under Filters, click Hive (Service-Wide) scope.
- Click the Advanced category.
-
In the panel on the right, locate the Hive Service Advanced Configuration Snippet (Safety Valve) for hive-site.xml, click the plus sign (+) to the right of it, and enter the following information:
- In the Name field, enter the hive.reloadable.aux.jars.path property.
- In the Value field, enter the path where you copied the JAR file to in Step 1. For example, /usr/lib/hive/lib.
- In the Description field, enter the property description. For example, Path to Hive UDF JAR files.
- Click Save Changes.
- Redeploy the Hive client configuration.
- In the Cloudera Manager Admin Console, go to the Hive service.
- From the Actions menu at the top right of the service page, select Deploy Client Configuration.
- Click Deploy Client Configuration.
-
Restart the Hive service.
Important: You need to restart the Hive service only when you first specify the JAR files location to the hive.reloadable.aux.jars.path property so the server can read the location. Afterwards, if you add or remove JAR files from this directory location, you can use the Beeline reload command to refresh the changes with the Hive service. -
Run the CREATE TEMPORARY FUNCTION command. For example:
CREATE TEMPORARY FUNCTION <your_function_name> AS '<fully_qualified_class_name>';
Where the <fully_qualified_class_name> is the full path to the Java class in your JAR file.
Using the Command Line to Create User-Defined Functions (UDFs) with HiveServer2
The following sections describe how to create permanent and temporary functions using the command line.
Creating Permanent Functions
- Copy the JAR file to the host on which HiveServer2 is running. Save the JARs to any directory you choose, give the hive user read, write, and execute
access to this directory, and make a note of the path (for example, /usr/lib/hive/lib/).
Note: If the Hive metastore is running on a different host, create the same directory there that you created on the HiveServer2 host. You do not need to copy the JAR file onto the Hive metastore host, but the same directory must be there. For example, if you copied the JAR file to /usr/lib/hive/lib/ on the HiveServer2 host, you must create the same directory on the Hive metastore host. If the same directory is not present on the Hive metastore host, Hive metastore service will not start.
-
On the Beeline client machine, in /etc/hive/conf/hive-site.xml, set the hive.reloadable.aux.jars.path property to the fully qualified path where you copied the JAR file to in Step 1. If there are multiple JAR files, use commas to separate them:
<property> <name>hive.reloadable.aux.jars.path</name> <value>path/to/java_class1.jar, path/to/java_class2.jar</value> <description>property_description</description> </property>
-
Restart HiveServer2.
Important: You need to restart the Hive service only when you first specify the JAR files location to the hive.reloadable.aux.jars.path property so the server can read the location. Afterwards, if you add or remove JAR files from this directory location, you can use the Beeline reload command to refresh the changes with the Hive service. -
If Sentry is enabled on your cluster, otherwise ignore this step: Grant privileges on the JAR files to the roles that require access. Login to Beeline as user hive and use the Hive SQL GRANT statement to do so. For example:
GRANT ALL ON URI 'file:///usr/lib/hive/lib/<my.jar>' TO ROLE <example_role>;
If you are using Sentry policy files, grant the URI privilege as follows:
udf_r = server=server1->uri=file:///<path/to/jar>
-
Run the CREATE FUNCTION command in Beeline to create the UDF from the JAR file:
-
If Sentry is enabled on your cluster:
The USING JAR command is not supported. To load the jar, you have to make sure the JAR file is at the location pointed to by the hive.reloadable.aux.jars.path property, and then use the following CREATE FUNCTION statement:
CREATE FUNCTION <your_function_name> AS '<fully_qualified_class_name>';
Where the <fully_qualified_class_name> is the full path to the Java class in your JAR file. For example, if your Java class is located at directory_1/directory_2/directory_3/udf_class.class in your JAR file, use:
CREATE FUNCTION <your_function_name> AS 'directory_1.directory_2.directory_3.udf_class';
-
Without Sentry enabled on your cluster:
- Copy the JAR file to HDFS and make sure the hive user can access this JAR file, and make note of the path (for example, hdfs:///user/hive/udf_jars/).
-
Run the CREATE FUNCTION command as follows and point to the JAR file location in HDFS:
CREATE FUNCTION <your_function_name> AS '<fully_qualified_class_name>' USING JAR 'hdfs:///<path/to/jar/in/hdfs>';
Where the <fully_qualified_class_name> is the full path to the Java class in your JAR file.
-
Creating Temporary Functions
- Copy the JAR file to the host on which HiveServer2 is running. Save the JARs to any directory you choose, give the hive user read, write, and execute
access to this directory, and make a note of the path (for example, /usr/lib/hive/lib/).
Note: If the Hive metastore is running on a different host, create the same directory there that you created on the HiveServer2 host. You do not need to copy the JAR file onto the Hive metastore host, but the same directory must be there. For example, if you copied the JAR file to /usr/lib/hive/lib/ on the HiveServer2 host, you must create the same directory on the Hive metastore host. If the same directory is not present on the Hive metastore host, Hive metastore service will not start.
-
On the Beeline client machine, in /etc/hive/conf/hive-site.xml, set the hive.reloadable.aux.jars.path property to the fully qualified path where you copied the JAR file to in Step 1. If there are multiple JAR files, use commas to separate them:
<property> <name>hive.reloadable.aux.jars.path</name> <value>path/to/java_class1.jar, path/to/java_class2.jar</value> <description>property_description</description> </property>
-
Restart HiveServer2.
Important: You need to restart the Hive service only when you first specify the JAR files location to the hive.reloadable.aux.jars.path property so the server can read the location. Afterwards, if you add or remove JAR files from this directory location, you can use the Beeline reload command to refresh the changes with the Hive service. -
Run the CREATE TEMPORARY FUNCTION command. For example:
CREATE TEMPORARY FUNCTION <your_function_name> AS '<fully_qualified_class_name>';
Where the <fully_qualified_class_name> is the full path to the Java class in your JAR file.
Updating Existing HiveServer2 User-Defined Functions (UDFs)
To update existing UDFs, you must first update the Java class in the JAR file. Once the Java class is updated, create the new JAR file. Then you must drop the existing function in Beeline and re-create it with the CREATE FUNCTION statement. These steps are explained in detail below:
- Update the Java class in the JAR file to update the UDF. For more information, see the Apache Hive wiki.
-
Drop the UDF that has been updated. For example, if you have updated a UDF named my_udf, log into Beeline and run the following command:
DROP FUNCTION my_udf;
-
Delete the JAR file from which the my_udf function was created from both HDFS and the local file system. On the local filesystem, the JAR can be found at the location pointed to by either the hive.aux.jars.path or the hive.reloadable.aux.jars.path property.
-
Copy the updated JAR file to the local filesystem as described in Step 1 of Using Cloudera Manager to Create User-Defined Functions (UDFs) with HiveServer2.
Important:If you specified the old JAR file location with the hive.reloadable.aux.jars.path property, make sure that you copy the updated JAR to the same directory. This enables you to use the Beeline reload command so HiveServer2 can read the new JAR file without needing the service to be restarted. Thus you can avoid a service disruption.
-
Set the hive.reloadable.aux.jars.path property with the location of the updated JAR file:
- In the Cloudera Manager Admin Console, go to the Hive service.
- Click the Configuration tab.
- Under Filters, click Hive (Service-Wide) scope.
- Click the Advanced category.
-
In the panel on the right, locate the Hive Service Advanced Configuration Snippet (Safety Value) for hive-site.xml, click the plus sign (+) to the right of it, and enter the following information:
- In the Name field, enter the hive.reloadable.aux.jars.path property.
- In the Value field, enter the path where you copied the JAR file to in Step 3.
- In the Description field, enter the property description. For example, Path to Hive UDF JAR files..
- Click Save Changes.
-
Redeploy the Hive client configuration.
- In the Cloudera Manager Admin Console, go to the Hive service.
- From the Actions menu at the top right of the service page, select Deploy Client Configuration.
- Click Deploy Client Configuration.
-
Restart the Hive service.
Important: You need to restart the Hive service only when you first specify the JAR files location to the hive.reloadable.aux.jars.path property so the server can read the location. Afterwards, if you add or remove JAR files from this directory location, you can use the Beeline reload command to refresh the changes with the Hive service.
-
If Sentry is enabled on your cluster, otherwise ignore this step: Grant privileges on the JAR files to the roles that require access. Log in to Beeline as user hive and use the Hive SQL GRANT statement to do so. For example:
GRANT ALL ON URI 'file:///usr/lib/hive/lib/<my.jar>' TO ROLE <example_role>;
-
Run the CREATE FUNCTION command in Beeline to create the UDF from the class in the JAR file:
-
If Sentry is enabled on your cluster:
The USING JAR command is not supported. To load the jar, you must make sure the JAR file is at the location pointed to by the hive.reloadable.aux.jars.path, and then use the following CREATE FUNCTION statement:
CREATE FUNCTION <your_function_name> AS '<fully_qualified_class_name>';
Where the <fully_qualified_class_name> is the full path to the Java class in your JAR file. For example, if your Java class is located at directory_1/directory_2/directory_3/udf_class.class in your JAR file, use:
CREATE FUNCTION <your_function_name> AS 'directory_1.directory_2.directory_3.udf_class';
-
Without Sentry enabled on your cluster:
- Copy the JAR file to HDFS and make sure the hive user can access this JAR file, and make note of the path (for example, hdfs:///user/hive/udf_jars/).
-
Run the CREATE FUNCTION command as follows and point to the JAR file location in HDFS:
CREATE FUNCTION <your_function_name> AS '<fully_qualified_class_name>' USING JAR 'hdfs:///<path/to/jar/in/hdfs>';
Where the <fully_qualified_class_name> is the full path to the Java class in your JAR file.
-
Adding Built-in UDFs to the HiveServer2 Blacklist
HiveServer2 maintains a blacklist for built-in UDFs to secure itself against attacks in a multi user scenario where the hive user's credentials can be used to execute any Java code.
hive.server2.builtin.udf.blacklist | A comma separated list of built-in UDFs that are not allowed to be executed. A UDF that is included in the list will return an error if
invoked from a query.
Default value: Empty |
To check whether hive.server2.builtin.udf.blacklist contains any UDFs, run the following SET statement in Beeline:
SET hive.server2.builtin.udf.blacklist;
If any UDFs are set to be blacklisted, they are returned after running this command. For example, if character_length() and ascii() are blacklisted, the SET command returns the following information which shows these two built-in UDFs are blacklisted. The UDFs are shown in bold font in the following example:
+----------------------------------------------------+--+ | set | +----------------------------------------------------+--+ | hive.server2.builtin.udf.blacklist=character_length,ascii | +----------------------------------------------------+--+
To add built-in UDF names to the hive.server2.builtin.udf.blacklist property with Cloudera Manager:
- In the Cloudera Manager Admin Console, go to the Hive service.
- On the Hive service page, click the Configuration tab.
- On the Configuration page, click HiveServer2 under Scope and click Advanced under Category.
-
Search for HiveServer2 Advanced Configuration Snippet (Safety Valve) for hive-site.xml and add the following information:
- Name: hive.server2.builtin.udf.blacklist
- Value: <builtin_udf_name1>,<builtin_udf_name2>...
- Description: Blacklisted built-in UDFs.
- Click Save Changes and restart the HiveServer2 service for the changes to take effect.
If you are not using Cloudera Manager to manage your cluster, set the hive.server2.builtin.udf.blacklist property in the hive-site.xml file.
<< Accessing Apache Hive Table Statistics in CDH | ©2016 Cloudera, Inc. All rights reserved | Configuring Transient Apache Hive ETL Jobs to Use the Amazon S3 Filesystem in CDH >> |
Terms and Conditions Privacy Policy |