In my first blog on Top Tomcat Performance Problems I focused on Database Access , Micro-Services Access and Bad Frameworks that impact your application performance and scalability running in your Java App Server. This time I will focus on Bad Coding resulting in high CPU, Sync or Wait hotspots, Inefficient Logging, and the Impact of Invisible Exceptions.
As a reminder – here is my complete Top 10 list which is also applicable to other App Servers – so keep reading if your app runs on Jetty, JBoss, WebSphere, WebLogic, Glassfish, and more.
- Database Access : Loading too much data inefficiently
- Micro-Service Access : Inefficient access and badly designed Service APIs
- Bad Frameworks : Bottlenecks under load or misconfiguration
- Bad Coding : CPU, Sync and Wait Hotspots
- Inefficient Logging : Even too much for Splunk & ELK
- Invisible Exceptions : Frameworks gone wild!
- Exceptions : Overhead through Stack Trace generation
- Pools & Queues : Bottlenecks through wrong Sizing
- Multi-Threading : Locks, Syncs & Wait Issues
- Memory : Leaks and Garbage Collection Impact
Once again I want to offer a big “thank you” to all our Dynatrace Personal License users for sharing their data with me through my Share Your PurePath program. Without you I wouldn’t have so much information to share with everyone. Also, it is great to see that some folks have already accepted the challenge announced in my first blog post on this topic: proving me wrong or sending me new problem patterns. The winner gets a speaking session with me at a conference or user group meeting (and, of course, eternal fame!).
As I mentioned in Part I of this blog series: If your app suffers from any of these problems I can guarantee that not even Docker, EC2 or Azure will help you. Throwing more computing power at your problems will not help at all!
Again, I will be using Dynatrace as it is my tool of choice but my approaches should also work with other APM or Performance Diagnostics Tools! If you want to watch a video rather than reading this blog series check out my Tomcat Performance Analysis YouTube Tutorial instead.
Top Tomcat Performance Problem #4: Bad Coding — CPU, Sync & Wait Hotspots
I used to be a developer and just like every developer I thought my code was the best, most elegant, and most efficient “out there”. Sound familiar? Now, in 2016, I know better because I proved myself wrong by analyzing some of my old code. The same realization occurred frequently last year when I analyzed CPU, Sync and Wait Hotspots in Java Apps. We still find the classic “String Concatenations” but fortunately seen less often these days.
When I analyze CPU, Sync (time spent in entering a synchronized block or method) and Wait (call to object.wait) Hotspots I typically start in the Layer Breakdown as shown in Part I of this blog series. There I identify the problematic layer and then drill to the Method Hotspots View. An alternative is to start in the Response Time Hotspot which allows me to filter on “Slowest 10%”, “Slowest 5%” and “Slowest 1%”. This is useful to determine how your hotspots shift for your slowest transactions. Here is an interesting observation on a PurePath that was sent to me: “ Very high wait time in two custom code packages as well as in the ElasticSearch API. What is the code waiting for? Waiting also means they block the current executing threads which means the worker threads cannot work on new incoming requests! ”
Good synchronized access to shared resources is very hard to achieve and can only be seen under load when you actually have concurrent access to these shared resources. Therefore always check out for sync time when system is under load. The same is true for waiting on asynchronous tasks.
In the example above we see extremely high wait time. Wait time typically means you are waiting on some task to be completed, e.g: making an external REST call and waiting for the result. It could also mean waiting on a free resource, e.g: waiting for the next free Http connection to make a call. Both cases mean that the current worker thread is put on hold and can’t do anything until to object gets notified. In the above use case the calling thread is waiting for the response of an external REST Call as see from the Methods Hotspots Dashlet:
It is important to understand why your code is waiting: lack of resources or long job execution. In both cases your calling threads are blocked and your JVM eventually runs out of worker threads.
If you are calling external services synchronously and these services are getting slower to respond it impacts your JVMs ability to handle new incoming requests. You might run out of worker threads if they are all waiting. This can have a ripple effect as requests will back-up on your frontend app server, web server or load balancer. It is easier said than done and implemented – but in this case you want to think about an event driven and asynchronous invocation model to decouple your worker threads from your external service calls. Another alternative would be to speed up these external service calls – assuming you can speed them up, as they are under your control, e.g: make them faster by fixing performance problems or add more service instances to the cluster to handle more parallel requests!
Another scenario I’ve seen are undersized outgoing connection pools. If your JVM opens outgoing network connections you must go through a pool managed by your JVM or by the communication framework you are using. Make sure you check the number of connections in that pool and size it appropriately to the number of connections you think your system will need to open under peak load.
A final example I want to raise is a CPU Hotspot. I blogged about this story a long time ago so I’ll keep it short here and let you refer to the original blog. It turned out to be a CPU Hotspot when using a Hashtable. The Hashtable itself was obviously not the problem as it is highly optimized. However, the Hashtable was used very inefficiently, causing the custom application code to spend most of its CPU time in Hashtable.get. Here is the Method Hotspots Dashlet with some of the details to this use case:
You need to understand how your code is using standard libraries such as Hashtable’s. CPU Hotspots in these libraries are typically caused by bad access patterns from your custom code!
Dynatrace Tip : The Response Time and Methods Hotspot also show I/O. That is time spent on the Network or File I/O. Don’t let yourself fooled down a wrong path if you see high I/O on the web server. That is usually normal for web applications delivering volumes of static content.
Top Tomcat Performance Problem#5: Inefficient Logging Even for Splunk & ELK
Splunk and ELK (ElasticSearch & Kibana) are a blessing for many log analytics fanatics. The dark side of the story is that log levels in frameworks seem to get abused to squeeze more data out of any app. But what if this data is a) CRXP and b) logging becomes your performance bottleneck?
In the last year I’ve seen several of these scenarios, and there are blogs about the performance impact of logging . The first one is simple yet still happens frequently. Debug log level makes it into a high load environment. Is this really necessary? Was it intentional? Or was it a mistake? The performance impact is huge as you can see from the Methods Hotspot Dashlet:
Do your sanity checks on log levels and log appenders. There has to be a VERY GOOD reason for debug logging being turned on in a high load or production environment. Here is proof of the potential impact.
Another example is from a misconfigured Spring framework that generated too many log entries. This behavior is also immediately visible as a performance hotspot.
CPU and I/O shows up as hotspot in log4j when Spring was configured to generate verbose log output. Please do your homework and verify that what ends up in the log file should really be there!
A closing remark on logging: You do not need an APM tool to figure out you are logging too much – simply look in the log files from time to time and figure out what percent of log entries actually make sense. Also, APM tools do a great job showing you the current impact of logging on your performance, as well as showing you exactly which framework or custom code is responsible for it!
Dynatrace Tip: Dynatrace can also capture log messages in the context of each executed transaction. This makes error analysis very easy if you always see the PurePath that generated a log message. By default the Logging Sensors only capture ERROR, FATAL and SEVERE but you can change that in the Sensor Properties. Check out my YouTube Tutorial on Advanced Sensors for more hands-in guidance!
Top Tomcat Performance Problem #6: Invisible exceptions — Frameworks gone wild!
From my viewpoint there are three types of exceptions: those that we capture and handle; those that we don’t handle and eventually end up in a log file or as an HTTP 5xx; and those of which we are not even aware. It appears the third exception outnumbers the other two by 9:1! These are exceptions that are thrown and handled within frameworks and runtime libraries we use. There can be millions of these exceptions being thrown without you noticing because they are handled perfectly, e.g: a framework missing some configuration parameters throws an Exception and handles it by defining a default value.
Two years ago we upgraded our Confluence used to power our internal and external Wiki to an up-to-date version. After the update we received complaints about slow performance. Our servers also showed higher CPU utilization than before the upgrade. We identified that fillInStackTrace was on the top of our CPU hotspots – a method that gets called when a new Exception object is created:
Every Exception object will get the current stack trace calling fillInStackTrace. This can become a CPU issue if you have thousands of Exceptions being created all the time!
What was odd is that we didn’t really see many log messages that would indicate a big problem. When correlating the number of log messages (severe, warning, fatal) with the number of Exception Objects created it became apparent that something was happening within the frameworks that we didn’t know about. For every 1 log entry we saw 4000 Exceptions being created!
I always like to correlate Log Messages with Exceptions. It gives you a good indication on whether something is going on within your frameworks that you are not aware of. Especially after an upgrade or new deployment!
When looking at these Exceptions in the Dynatrace Exception Dashlet it became clear what was happening:
A deployment problem caused thousands of Missing Resource Exceptions to be thrown but handled internally by Atlassian. We would have never found out about a deployment problem through the log files.
Lesson learned: even Splunk or ELK wouldn’t have helped here as the information of a missing resource never made it to a log file. These exceptions were handled internally causing a lot of CPU overhead. I therefore always watch out for hotspots in fillInStackTrace and then analyze which Exceptions we are missing!
Dynatrace Tip : By default Dynatrace captures most Exceptions that are created in your application. Not only those that are missed but really everything . We only exclude a few system exceptions to avoid capturing lots of irrelevant Java internal information. To learn more about our Exception Sensor and how to configure it, check out this Advanced Sensor YouTube Tutorial .
Challenge: Prove me wrong or show me a new problem & win a speaking engagement with me!
As I mentioned in my first blog post I want to give you the opportunity to get on stage with me at a user group or conference that suits your schedule! All you need to do is to either prove to me that your app is not suffering from these problems (demo apps don’t count) or you show me a new problem pattern that I do not have on the list.
Challenge Accepted? If so – just sign up for the Dynatrace Personal License . After the 30-day trial period it is free for life for you to use to analyze local apps. After you sign up and received the license file (check your spam folder for emails from email@example.com ) you have two options:
- Full Install of Dynatrace in your environment -> Download and Installfrom here!
- Just use the pre-configured Dynatrace Docker Containers on GitHub -> special thanks to my colleague Martin Etmajer !
I also recommend checking out my YouTube Tutorials on What Is Dynatrace and How Does it Work as well as Tomcat Performance Analysis with Dynatrace. Once you have some Dynatrace PurePaths collected share them with me through my Share Your PurePath program.
So, let me throw the gauntlet down again – are you ready to answer this challenge?
About The Author
Andreas Grabner Andreas Grabner has been helping companies improve their application performance for 15+ years. He is a regular contributor within Web Performance and DevOps communities and a prolific speaker at user groups and conferences around the world. Reach him at @grabnerandi