This post is part 2 of the Monitoring Apache Spark series. In part 1 – Monitoring Apache Spark: Why is it Challenging? , we discussed the basics of Spark architecture and why it is challenging to monitor a distributed and dynamic Spark cluster. In this post, we will dig deeper into the specific Apache Spark metrics and understand their behavior. Finally, we will discuss how you can automate the task of configuring monitoring for Spark and provide visibility into key concerns.
Key Spark Metrics and Behavior Patterns
Once you have identified and broken down the Spark and associated infrastructure and application components you want to monitor, you need to understand the metrics that you should really care about that affect the performance of your application as well as your infrastructure. Let's dig deeper into some of the things you should care about monitoring.
- In Spark, it is well known that memory- related issues are typical if you haven't paid attention to the memory usage when building your application. Make sure you track garbage collection and memory across the cluster on each component, specifically, the executors and the driver. Garbage collection stalls or abnormality in patterns can increase back pressure.
- Performance issues mostly show up during shuffles . Although avoiding shuffles and reducing the amount of data being moved between nodes is nice to have, it is often times unavoidable. It would be wise to track the shuffle read and write bytes on each executor and driver as well as the total Input bytes. As a thumb rule, more the shuffles, lower the performance.
- Track the latency of your Spark application. This is the time it takes to complete a single job on your Spark cluster
- Measure the throughput of your Spark application. This gives you the number of records that are being processed per second
- Monitor the number of file descriptors that are open. Shuffles sometimes cause too many file descriptors to be opened resulting in "Too many open files" exception. Check your ulimit as well.
- Monitor scheduling delay in streaming mode. It gives the time taken by the scheduler to submit jobs to the cluster
At OpsClarity, we understand Spark in-depth. Alerts are automatically configured for you. So when throughput or latency changes abnormally, you get alerted. If garbage collection and memory usage patterns go awry, or if shuffles are heavy, you get alerted. We do this through anomaly detection models that are baked into the product. As mentioned before, it also makes it very easy to identify where the bottlenecks are instead of jumping from component to component trying to find the root cause of an issue. We've curated a list of metrics you should consider monitoring and alerting, and these are automatically made available with OpsClarity.
|Spark Driver (one driver for each application)||Spark Executor (multiple executors for each application)|
• active jobs
• total jobs
• # of receivers
• # of completed batches
• # of running batches
• # of records received
• # of records processed
• # of unprocessed batches
• average message processing time
Max Memory Allocated
# of Active Tasks
# of Failed Tasks
# of Completed Tasks
Total Duration for which the Executor has run
Total Input Bytes
Total Shuffle Read Bytes
Total Shuffle Write Bytes
|Spark Master||Spark Worker|
# of Apps Running and Waiting
# of Apps Waiting
# of Workers Provisioned in the Cluster
# of Used Cores
# of Executors Currently Running
Down Data Nodes
HDFS disk %
NameNode Heap %
How Do I Troubleshoot Issues in Spark?
While you can hit a number of bumps when running Apache Spark, here are a few things to ask yourself during the debugging process.
- Is the Spark pipeline healthy? Are there affected components that could lead to degraded performance?
- Does the cluster have enough resources (cores, memory) available for jobs to run?
- How well is the data parallelized? Are there too many or too little partitions?
- How much memory is the application consuming per executor?
- What is the garbage collection pattern on the executors and the driver? Too much GC correlating with increase in ingestion rate could cause havoc on the executors
- Are you using the right serializer?
- Has there been an increase in shuffle reads/writes?
- What is the throughput and latency of my application?
The Spark UI provides details of a single job in great depth by breaking it down to stages and tasks. However, if you need a production ready monitoring tool to get an aggregated view of how the cluster is performing, and understand how a dynamic infrastructure like Spark with different moving components affect availability and performance, you need a more complete monitoring tool. Understanding what’s important to be alerted on and not, can greatly help you focus on things that matter and increase your agility on building applications on Spark.
At OpsClarity, we not only discover and automatically monitor Spark but also the connected components such as Hadoop services, messaging systems like Kafka, etc. With an understanding of how your components are connected (the Operational Knowledge Graph), OpsClarity can help you seamlessly monitor performance and availability across different services.