How to Configure Impala with Dedicated Coordinators
Each host that runs the Impala Daemon acts as both a coordinator and as an executor, by default, managing metadata caching, query compilation, and query execution. In this configuration, Impala clients can connect to any Impala daemon and send query requests.
During highly concurrent workloads for large-scale queries, the dual roles can cause scalability issues because:
-
The extra work required for a host to act as the coordinator could interfere with its capacity to perform other work for the later phases of the query. For example, coordinators can experience significant network and CPU overhead with queries containing a large number of query fragments. Each coordinator caches metadata for all table partitions and data files, which can be substantial and can contend with memory needed to process joins, aggregations, and other operations performed by query executors.
-
Having a large number of hosts act as coordinators can cause unnecessary network overhead, or even timeout errors, as each of those hosts communicates with the Statestored daemon for metadata updates.
-
The "soft limits" imposed by the admission control feature are more likely to be exceeded when there are a large number of heavily loaded hosts acting as coordinators. Check IMPALA-3649 and IMPALA-6437 to see the status of the enhancements to mitigate this issue.
The following factors can further exacerbate the above issues:
-
High number of concurrent query fragments due to query concurrency and/or query complexity
-
Large metadata topic size related to the number of partitions/files/blocks
-
High number of coordinator nodes
-
High number of coordinators used in the same resource pool
If such scalability bottlenecks occur, in CDH 5.12 / Impala 2.9 and higher, you can assign one dedicated role to each Impala daemon host, either as a coordinator or as an executor, to address the issues.
-
All explicit or load-balanced client connections must go to the coordinator hosts. These hosts perform the network communication to keep metadata up-to-date and route query results to the appropriate clients. The dedicated coordinator hosts do not participate in I/O-intensive operations such as scans, and CPU-intensive operations such as aggregations.
-
The executor hosts perform the intensive I/O, CPU, and memory operations that make up the bulk of the work for each query. The executors do communicate with the Statestored daemon for membership status, but the dedicated executor hosts do not process the final result sets for queries.
Using dedicated coordinators offers the following benefits:
-
Reduces memory usage by limiting the number of Impala nodes that need to caches metadata.
-
Provides a better concurrency by avoiding coordinator bottleneck.
-
Eliminates the query over admission by using one dedicated coordinator.
-
Reduces resource, especially network, utilization on the Statestore daemon by limiting metadata broadcast to a subset of nodes.
-
Improves reliability and performance for highly concurrent workloads by reducing workload stress on coordinators. Dedicated coordinators require 50% or less connections and threads.
-
Reduces the number of explicit metadata refreshes required.
-
Improves diagnosability if a bottleneck or other performance issue arises on a specific host, you can narrow down the cause more easily because each host is dedicated to specific operations within the overall Impala workload.
In this configuration with dedicated coordinators / executors, you cannot connect to the dedicated executor hosts through clients such as impala-shell or business intelligence tools as only the coordinator nodes support client connections.
Determining the Optimal Number of Dedicated Coordinators
You should have the smallest number of coordinators that will still satisfy your workload requirements in a cluster. A rough estimation is 1 coordinator for every 50 executors.
To maintain a healthy state and optimal performance, it is recommended that you keep the peak utilization of all resources used by Impala, including CPU, the number of threads, the number of connections, RPCs, under 80%.
Consider the following factors to determine the right number of coordinators in your cluster:
-
What is the number of concurrent queries?
-
What percentage of the workload is DDL?
-
What is the average query resource usage at the various stages (merge, runtime filter, resultset size, etc.)?
-
How many Impala Daemons (impalad) is in the cluster?
-
Is there a high availability requirement?
-
Compute/storage capacity reduction factor
Start with the below set of steps to determine the initial number of coordinators:
- If your cluster has less than 10 nodes, we recommend that you configure one dedicated coordinator. Deploy the dedicated coordinator on a DataNode to avoid losing storage capacity. In most of cases, one dedicated coordinator is enough to support all workloads on a cluster.
- Add more coordinators if the dedicated coordinator CPU or network peak utilization is 80% or higher. You might need 1 coordinator for every 50 executors.
- If the Impala service is shared by multiple workgroups with a dynamic resource pool assigned, use one coordinator per pool to avoid admission control over admission.
- If high availability is required, double the number of coordinators. One set as an active set and the other as a backup set.
Advanced Tuning
- The concurrency of DML statements does not typically depend on the number of coordinators or size of the cluster. Queries that return large result sets (10,000+ rows) consume more CPU and memory resources on the coordinator. Add one or two coordinators if the workload has many such queries.
- DDL queries, excluding COMPUTE STATS and CREATE TABLE AS SELECT, are executed only on coordinators. If your workload contains many DDL queries running concurrently, you could add one coordinator.
- The CPU contention on coordinators can slow down query executions when concurrency is high, especially for very short queries (<10s). Add more coordinators to avoid CPU contention.
- On a large cluster, the number of network connections from a coordinator to executors can grow quickly as query complexity increases. The growth is much greater on coordinators than executors. Add a few more coordinators if workload are complex, i.e. (an average number of fragments * number of Impalad) > 500, but with the low memory/CPU usage to share the load. Watch IMPALA-4603 and IMPALA-7213 to track the progress on fixing this issue.
- When using multiple coordinators for DML statements, divide queries to different groups (number of groups = number of coordinators). Configure a separate dynamic resource pool for each group and direct each group of query requests to a specific coordinator. This is to avoid query over admission.
- The frontend connection requirement is not a factor in determining the number of dedicated coordinators. Consider setting up a connection pool at the client side instead of adding coordinators. For a short term solution, you could increase the value of fe_service_threads on coordinators to allow more client connections.
- In general, you should have a very small number of coordinators so storage capacity reduction is not a concern. On a very small (less than 10 nodes), deploy a dedicated coordinator on a Data Node to avoid storage capacity reduction.
Estimating Coordinator Resource Usage
Resource | Safe range | Notes / CM tsquery to monitor |
Memory |
(Max JVM heap setting + query concurrency * query mem_limit) <= 80% of Impala process memory allocation |
Memory usage: SELECT mem_rss WHERE entityName = "Coordinator Instance ID" AND category = ROLE JVM heap usage (metadata cache): SELECT impala_jvm_heap_current_usage_bytes WHERE entityName = "Coordinator Instance ID" AND category = ROLE (only in release 5.15 and above) |
TCP Connection | Incoming + outgoing < 16K |
Incoming connection usage: SELECT thrift_server_backend_connections_in_use WHERE entityName = "Coordinator Instance ID" AND category = ROLE Outgoing connection usage: SELECT backends_client_cache_clients_in_use WHERE entityName = "Coordinator Instance ID" AND category = ROLE |
Threads | < 32K | SELECT thread_manager_running_threads WHERE entityName = "Coordinator Instance ID" AND category = ROLE |
CPU |
Concurrency = non-DDL query concurrency <= number of virtual coresvcores allocated to Impala per node |
CPU usage estimation should be based on how many cores are allocated to Impala per node, not a sum of all cores of the cluster. It is recommended thatIn theory, optimal concurrency should not be more than the number of virtual coresvcores allocated to Impala per node. Query concurrency: executing=true SELECTtotal_impala_num_queries_registered_across_impaladss WHERE entityName = "IMPALA-1" AND category = SERVICE |
If usage of any of the above resources exceeds the safe range, consider adding one more coordinator.
Monitoring Coordinator Resource Usage
- Impala Queries tab: Monitor such attributes as DDL queries and Rows produced. See Monitoring Impala Queries for detail information.
- Custom charts: Monitor activities, such as query complexity which is an average fragment count per query (total fragments / total queries).
- tsquery: Build the custom charts to monitor and estimate the amount of resource the coordinator needs. See tsquery Language for more information.
The following are sample queries for common resource usage monitoring. Replace entityName values with your coordinator instance id.
Resource Usage | Tsquery |
---|---|
Memory usage | SELECT impala_memory_total_used, mem_tracker_process_limit WHERE entityName = "Coordinator Instance ID" AND category = ROLE |
JVM heap usage (metadata cache) | SELECT impala_jvm_heap_current_usage_bytes WHERE entityName = "Coordinator Instance ID" AND category = ROLE (only in release 5.15 and above) |
CPU usage | SELECT cpu_user_rate / getHostFact(numCores, 1) * 100, cpu_system_rate / getHostFact(numCores, 1) * 100 WHERE entityName="Coordinator Instance ID" |
Network usage (host level) | SELECT total_bytes_receive_rate_across_network_interfaces, total_bytes_transmit_rate_across_network_interfaces WHERE entityName="Coordinator Instance ID" |
Incoming connection usage | SELECT thrift_server_backend_connections_in_use WHERE entityName = "Coordinator Instance ID" AND category = ROLE |
Outgoing connection usage | SELECT backends_client_cache_clients_in_use WHERE entityName = "Coordinator Instance ID" AND category = ROLE |
Thread usage | SELECT thread_manager_running_threads WHERE entityName = "Coordinator Instance ID" AND category = ROLE |
Resource usage | Tsquery |
---|---|
Frontend connection usage | SELECT total_thrift_server_beeswax_frontend_connections_in_use_across_impalads, total_thrift_server_hiveserver2_frontend_connections_in_use_across_impalads |
Query concurrency | SELECT total_impala_num_queries_registered_across_impalads WHERE entityName = "IMPALA-1" AND category = SERVICE |
Deploying Dedicated Coordinators and Executors
This section describes the process to configure a dedicated coordinator and a dedicated executor roles for Impala.
-
Dedicated coordinator: They should be on edge node with no other services running on it. They don’t need large local disks but still need some that can be used for Spilling. They require at least same or even larger memory allocation.
-
(Dedicated) Executors: They should be collocated with Data Nodes as usual. The number of hosts with this setting typically increases as the cluster grows larger and handles more table partitions, data files, and concurrent queries.
- Navigate to Clusters > Impala > Configuration > Role Groups.
- Click Create to create two role groups with the following values.
- Group for Coordinators
- Group Name: Coordinators
- Role Type: Impala Daemon
- Copy from: None
- Group for Executors
- Group Name: Executors
- Role Type: Impala Daemon
- Copy from: None
- Group for Coordinators
- In the Role Groups page, click Impala Daemon Default Group.
- Select the set of nodes intended to be coordinators.
- Click Action for Selected and select Move To Different Role Group….
- Select the Coordinators.
- Select the set of nodes intended to be Executors.
- Click Action for Selected and select Move To Different Role Group….
- Select Executors.
- Select the set of nodes intended to be coordinators.
- Click Configuration. In the search field, type Impala Command Line Argument Advanced Configuration Snippet (Safety Valve).
- For Coordinators, in the argument field, type: --is_executor=false
- For Executors, in the argument field, type: --is_coordinator=false
- Click Save Changes and then restart Impala service.
<< Scalability Considerations for Impala | ©2016 Cloudera, Inc. All rights reserved | Partitioning for Impala Tables >> |
Terms and Conditions Privacy Policy |