Slow or excessive SQL Queries; wrong configured connection pools; excessive service, REST and remoting calls; overhead through excessive logging or inefficient exception handling; as well as bad coding leading to CPU Hotspots, Memory Leaks, impact through Garbage Collection or stuck threads through synchronization issues. These are some of the top performance problems I analyzed through my “ Share Your PurePath ” program last year.
A big “Thank You!” to all our Dynatrace Personal License users for sharing their data with me – allowing me to not only help them with a free performance review but also share it with a larger audience.
Tomcat Performance Issues List
Because I have many Tomcat users out there I compiled my top 10 problem pattern list I use when analyzing a Tomcat environment. This list also applies to other App Servers, so keep reading if your app runs on Jetty, JBoss, WebSphere, WebLogic, Glassfish, etc:
- Database Access : Loading too much data inefficiently
- Micro-Service Access : Inefficient access and badly designed Service APIs
- Bad Frameworks : Bottlenecks under load or misconfiguration
- Bad Coding : CPU, Sync and Wait Hotspots
- Inefficient Logging : Even too much for Splunk & ELK
- Invisible Exceptions : Frameworks gone wild!
- Exceptions : Overhead through Stack Trace generation
- Pools & Queues : Bottlenecks through wrong Sizing
- Multi-Threading : Locks, Syncs & Wait Issues
- Memory : Leaks and Garbage Collection Impact
If your app suffers from any of these problems I can guarantee that not even Docker, EC2 or Azure will help you. Throwing more computing power at your problems will not help at all!
In Part I of this blog series I focus on how I identify problem patterns on Database Access , Service Access and Bad Frameworks . Part II & Part III will cover the remaining items. And who knows – if you are up for the challenge I have for you at the end of the blog post — there might be more to come!
I will be using Dynatrace as it is my tool of choice, but my approaches should also work using other APM or Performance Diagnostics Tools! If you want to watch a video rather than reading this blog series check out my Tomcat Performance Analysis YouTube Tutorial instead.
#1: Database Access: Loading too much data inefficiently
Believe it or not, loading too much data or loading data in an inefficient way is THE #1 Problem out there. And it’s not just me saying this – check out Theodora’s blog On Top Java Performance Issues , The Netflix Blog on Databases or Alois Reitbauers blog series on Hibernate Performance . I can therefore easily live with comments such as “WOW – yet another N+1 query problem – don’t you have anything new to tell?” (recently received this comment via email). It’s up to you out there to whether to fix this issue: As long as I find this problem in > 80% of the performance data that people share with me I will present it as my #1 Problem. The following is my favorite view in Dynatrace to identify bad database access patterns — The Transaction Flow. Similar visualizations are available in other APM Tools. In Dynatrace you get this for every single request which also allows you to analyze bad database access patterns even though it might not yet lead to slow overall transaction performance:
Most often this happens through an incorrectly configured O/R-Mapper such as Hibernate, Spring or the ADO.NET Entity Framework. Simply watch out for the # of SQL Statements being executed per request!
Also, examine the actual SQL statements executed to identify the following patterns: N+1 Query Problem, Slow Running Statements, Unprepared Statements. I covered these and other database patterns in recent article I wrote for InfoQ: Top Java Performance Hotspots .
Don’t blindly trust your database access layer. Look at the individual SQL queries, execution count, type, on which connections they are executed and whether they are prepared or not. You will be surprised what you will find!
Dynatrace Tip : When drilling to the Database Dashlet I always click on the “Percentage of Transactions Calling” header. This will show me the table as shown above starting with the “Execs/calling transaction” -> If this column shows a value > 1 you know you have a potential N+1!
#2: Micro-Service Access: Inefficient access and badly designed Service APIs
With the continued trend towards service orientation I see the problem of apps becoming “very chatty” on these remoting interfaces. For every remote call your app is going to bind two threads (on the caller and callee side), you are going to transport data over the network and your app has to process the data. Whether you still use RMI or traditional SOAP Web Services or you have already moved towards “lightweight” REST Services using JSON: you want to avoid bad service interface design and excessive interaction with your services. The N+1 Query Problem is yet again a hot topic – but now it is between two App Servers vs. between App Server and Database.
The following Transaction Flow shows a very classical architectural issue. The Frontend Tomcat Server makes 40 Individual RMI calls to a backend service! Each service call itself must access the database, totaling up to 2,790! SQL executions. This is a distributed N+1 Query Problem!
Service Oriented Architecture is not always implemented well. Make sure you know which service is invoking which other service and what the end-to-end impact on the system is. Keep an eye on every thread pool involved!
If your app makes service calls – or has a frontend web server like in the example above – make sure to monitor and correctly size all the connection pools involved. Apache has a connection pool in its Tomcat Connector. Tomcat itself has an incoming worker thread pool as well as an outgoing thread pool when making remoting calls to other systems. Pool size and Utilization are typically exposed via JMX. When you do your load and performance testing make sure to find the correct pool sizes for each and every pool in your system.
Most importantly though is to figure out whether all these calls are really necessary or whether there might be a better Service API to get your work done. The following screenshot is an example of a Job Search Portal. When executing a search the frontend Tomcat server is compiling a search result of 38 jobs found for a specific search. For each Job that matches the search criteria it makes a call to a Job Details Backend Service. This is a classical N+1 Query Problem. It would be smarter to provide a backend service API that returns the details of all Jobs that match the query. This would eliminate 37 Remoting calls!:
N+1 Query Problem with potentially even bigger impact than on the database level. Make sure you understand which APIs you call how often. Think about defining “smart APIs” that get the job done more efficiently!
A quick look at the individual Remoting Calls makes it easy to not only spot the N+1 Query Problem, it also highlights that some of these calls are actually done twice! In this case it is time to define a new remote API — getAllResults — and allow a more complex query string to pass instead of only individual Job Titles:
Just as with the SQL Statements you should analyze every single remoting call. It is easy to spot the access pattern. This is also great input for extending the backend service with new API calls to make this more efficient.
Dynatrace Tip : If you use Dynatrace, simply drill to the Web Requests Dashlet. Then make sure to select “Show -> All” and “Group By -> Uri and Query” in the Context Menu. This will now show you all remoting calls WITHIN a single transaction including URI and Query Parameters!
#3: Bad Frameworks: Bottlenecks under load
I assume that > 90% of the code that runs in your Tomcat is not your own code but instead Java Runtime, Spring, Hibernate, Netflix Hystrix, or any other cool library you downloaded on GitHub. We all build new software on top of existing frameworks instead of re-inventing the wheel every time we start a new project. The problem with this approach is that many developers or architects start with a sample or prototype implementation which then becomes the real code base without taking the time to figure out whether this framework will also work under “non demo” conditions.
A favorite approach of mine to figure out whether a framework is a potential performance bottleneck is to run an in Increasing Load load-test. You can pick any load testing tool available to you. My favorites are JMeter (or BlazeMeter), Neotys, SoapUI, SilkPerformer (I used to be an engineer on that team) or even LoadRunner.
In an Increasing Load load-test I slowly start ramping up the load on the system until I reach a breaking point of the app. A breaking point for me is when Response Time spikes or I run out of CPU, Memory, Disk or Network bandwidth on my servers.
I always do a quick sanity check on host and process metrics before going into the application itself. If I already see that load is not correctly distributed behind the load balancer or if we memory sizes are incorrectly configured I stop here and fix these problems right away
In 80% of the cases I can typically spot the problem by looking at what Dynatrace calls the Layer Breakdown chart. Other APM tools have similar views. The idea is that we break down response time into logical layers of your application code. A logical layer could be JDBC, Hibernate, Spring, RMI, Web Services, Caching Framwork, Your Own Code, with the other 20% configuration issues (pool sizes, load balancers, etc).
The following is a Layer Breakdown that clearly shows the breaking point in a Caching Framework used by this application. At a certain point into the load test the code in that Caching Framework showed disproportional contribution to the response time of the application.
The Layer Breakdown is the perfect starting point for hotspot analysis. Especially under varying load conditions it is easy to spot which layers of your application don’t scale!
Having identified the layer I start analyzing the actual methods in the code that cause the problem. I can even compare two timeframes (before the problem started and when we see the biggest spikes) to make identifying the root cause even easier. One of my favorite views here is the Methods Hotspot Dashlet that shows me which methods actually contributed to the execution time – even broken down into CPU, Synchronization, Wait, and Garbage Collection Suspension Time:
Make sure you always load and performance test the frameworks you are using. Consult the vendors and ask for best practice configurations.
I have a long history in Load Testing and Test Analysis, and it is one of my favorite tasks. I encourage you to read my two blog posts on Key Load Testing Metrics:Part I &Part II.
Dynatrace Tip : When analyzing Load Tests take a look at the Load Testing Overview Dashboard which you can open through the Start Center. Also check out my Load Testing with Dynatrace YouTube Tutorial .
Challenge: Prove me wrong or show me a new problem & win a speaking gig!
I have the luxury and privilege of speaking at many user groups and conferences around the world – mainly presenting the top problem patterns I found in prior months and years. I would like to give you the opportunity to get on stage with me at a user group or conference that works for you. All you need to do is to either prove to me that your app is not suffering from these problems (demo apps don’t count) or you show me a new problem pattern that I do not yet have on the list!
Challenge Accepted? If so – just sign up for the Dynatrace Personal License . After the 30-day trial period it remains FREE FOR LIFE for you to analyze your local apps. After you have signed up and receive the license file (check your spam folder for emails from firstname.lastname@example.org ) you have two options:
- Full Install of Dynatrace in your environment -> Download and Installfrom here!
- Just use the pre-configured Dynatrace Docker Containers on GitHub -> special thanks to my colleague Martin Etmajer !
I also recommend you check out my YouTube Tutorials on What Is Dynatrace and How Does it Work as well as Tomcat Performance Analysis with Dynatrace. Once you have some Dynatrace PurePaths collected share them with me through my Share Your PurePath program.
So – who is up for the challenge? First come – first win!
About The Author
Andreas Grabner Andreas Grabner has been helping companies improve their application performance for 15+ years. He is a regular contributor within Web Performance and DevOps communities and a prolific speaker at user groups and conferences around the world. Reach him at @grabnerandi