Continuous Integration/Continuous Deployment (CI/CD) has become central to software development. To ensure high-quality software releases smoke tests, regression tests, performance tests, static code analysis & security scans are run in CI/CD pipeline. Despite of all these quality measures, still OutOfMemoryError, CPU spikes, unresponsiveness, degradation in response time are surfacing in production environment.
These sort of performance problems surfaces in production because in CI/CD pipeline only macro level metrics such as: Static code quality metrics, test/code coverage, CPU Utilization, memory consumption, response time… are studied. In several cases these macro-metrics aren’t sufficient enough to uncover performance problems. In this article let’s review the micrometrics that should be studied in CI/CD pipeline to deliver high quality releases in production. We will also learn how to source this micrometrics and integrate it into CI/CD pipeline.
How Tsunamis are forecasted?
You might wonder why Tsunami forecasting is related to this article. There is a relationship :-). A normal sea wave travels at a speed of 5 – 60 miles/hr, whereas Tsunami waves travel at a speed of 500 – 600 miles/hr. Even though Tsunami wave travels at a speed of 10x – 100x speed of normal waves, it’s very hard to forecast Tsunami waves. Thus, modern day technologies use micrometrics to forecast Tsunami waves.
Fig: DART Device to detect Tsunami
To forecast Tsunami, multiple DART (Deep-ocean Assessment and Reporting of Tsunami) devices are installed all throughout the world. This DART contains two parts:
a. Surface Buoy: Device which floats at the top of ocean water
b. Seabed Monitor: Device which is stationed at the bottom of the ocean
Deep ocean water is about 6000 meters in depth. (20x of tallest San Francisco Sales Force tower). Whenever the sea level rises more than 1 mm then DART automatically detects it and transmits this information to satellite. This 1 mm rise in sea water is a lead indicator of Tsunami origination. I would like to request you to pause here for a second and visualize length of 1 mm in the scale of 6000 meters sea depth. It’s nothing, negligible. But this micrometric analysis is what used for forecasting Tsunamis.
How to forecast Performance Tsunamis through Micrometrics?
Similarly, there are few micrometrics that you can monitor in your CI/CD pipeline. This micrometrics are lead indicators of several performance problems that you will face in production. Raise or drop in values of these micrometrics are the great indicators for the origination of performance problems.
1. Garbage Collection Throughput
2. Average GC pause time
3. Maximum GC pause time
4. Object creation rate
5. Peak heap size
6. Thread Count
7. Thread States
8. Thread Groups
9. Wasted Memory
10. Object Count
11. Class Count
Let’s study each micrometrics in detail:
1. GARBAGE COLLECTION THROUGHPUT
Garbage Collection throughout is the amount of time your application spends in processing customer transactions vs amount of time your application spends in doing garbage collection.
Let’s say your application has been running for 60 minutes. In this 60 minutes, 2 minutes is spent on GC activities.
It means application has spent 3.33% on GC activities (i.e. 2 / 60 * 100)
It means Garbage Collection throughput is 96.67% (i.e. 100 – 3.33).
When there is a degradation in the GC throughput, it’s an indication of some sort of memory problem. Now the question is: What is the acceptable throughput %? It depends on the application and business demands. Typically, one should target for more than 98% throughput.
2. AVERAGE GARBAGE COLLECTION PAUSE TIME
When Garbage Collection event runs, entire application pauses. Because Garbage Collection has to mark every object in the application, see whether those objects are referenced by other objects, if there is no references then it will have to be evicted from memory. Then fragmented memory has to be compacted. To do all these operations, application will be paused. Thus when Garbage collection runs, customer will experience pauses/delays. Thus one should always target to attain low average GC pause time.
3. MAX GARBAGE COLLECTION PAUSE TIME
Some Garbage collection events might take a few milliseconds, whereas some garbage collection events might also take several seconds to minutes. You should measure maximum garbage collection pause time, to understand the worst possible impact to the customer. Proper tuning (and if needed application code changes) are needed to reduce the maximum Garbage Collection pause time.
4. OBJECT CREATION RATE
Object creation rate is the average amount of objects created by your application. Maybe in your previous code commit, application was creating 100mb/sec. Starting from recent code commit, application started to create 150mb/sec. This additional object creation rate can trigger lot more GC activity, CPU spikes, potential OutOfMemoryError, memory leaks when application is running for longer period.
5. PEAK HEAP SIZE
Peak heap size is the maximum amount of memory consumed by your application. If peak heap size goes beyond a limit you must investigate it. Maybe there is a potential memory leak in the application, newly introduced code (or 3rd libraries/frameworks) is consuming lot of memory, maybe there is legitimate use of it, if it is the case you will have to change your JVM arguments to allocate more memory.
“Garbage collection throughput, average GC pause time, maximum GC pause time, object creation rate, peak heap size micrometrics can be sourced only from garbage collection logs. No other tools can be used for this purpose.
As part of your CI/CD pipeline, you need to run regression test suite or performance test (ideal). Garbage Collection logs generated from the test, should be passed to GCeasy’s REST API. This API analyzes garbage collection logs and responds back with above mentioned micrometrics. To learn where this micrometrics are sent in the API response and JSON path expression for them, to this article. If any value is breached, then build can be failed. GCeasy REST API has intelligence to detect various other garbage collection problems such as: memory leaks, user time > sys + real time, sys time > user time, invocation of System.gc() API calls,… Any detected GC problems will be reported in the ‘problem’ element of API response. You might want to track this element as well”.
6. THREAD COUNT
Thread count can be another key metric to monitor. If thread count goes beyond a limit it can cause CPU, memory problems. Too many threads can cause ‘java.lang.OutOfMemoryError: unable to create new native thread’ in the long-running production environment.
7. THREAD STATES
Application threads can be in different thread states. To learn about various thread states, refer to this quick video clip. Too many threads in RUNNABLE state can cause CPU spike. Too many threads in BLOCKED state can make application unresponsive. If number of threads in a particular thread state crosses certain threshold then you may consider generating appropriate alerts/warning in CI/CD report.
8. THREAD GROUPS
A thread group represents a collection of threads performing similar tasks. There could be a servlet container thread group that processes all the incoming HTTP requests. There could be a JMS thread group, which handles all the JMS sending, receiving activity. There could be some other sensitive thread groups in the application as well. You might want to track those sensitive thread groups size. You don’t want their size neither to drop below a threshold nor go beyond a threshold. Less number of threads in a thread group can stall the activities. More number of threads can lead to memory, CPU problems.
“Thread count, thread states, thread groups micrometrics can be sourced from thread dumps. As part of your CI/CD pipeline, you need to run regression test suite or performance test (ideal). 3 Thread dumps in a gap of 10 seconds interval should be captured when tests are running. Captured thread dumps should be passed to FastThread’s REST API. This API analyzes thread dumps and responds back with the above mentioned micrometrics.
To learn where this micrometrics are sent in the API response and JSON path expression for them, refer to this article. If any value is breached, then build can be failed”. FastThread REST API has intelligence to detect several threading problems such as: Deadlocks, CPU spiking threads, prolonged blocking threads, … Any detected problems will be reported in the ‘problem’ element of API response. You might want to track this element as well”.
9. WASTED MEMORY
In modern computing world lot of memory is wasted because of poor programming practices such as: duplicate object creation, duplicate string creation, inefficient collections implementations, sub-optimal data type definitions, inefficient finalizations,.. Heap Hero API detects an amount of memory wasted due to all these inefficient programming practices. This can be a key metric to track. In case if amount wasted memory goes beyond certain percentage, then CI/CD build can be failed, or warnings can be generated.
10. OBJECT COUNT
You might also want to track the total number of objects that are present in the application’s memory. Object count can spike up because of inefficient code, new introduction of 3rd party libraries, frameworks. Too many objects can cause OutOfMemoryError, memory leak, CPU spike in production.
11. CLASS COUNT
You might also want to track the total number of classes that are present in the application’s memory. Sometimes class count can spike because of an introduction of any 3rd party libraries, frameworks. Spike in classes count, can cause problems in Metaspace/PermGen space of the memory.
“Wasted Memory size, object count, class count micrometrics can be sourced from heap dumps. As part of your CI/CD pipeline, you need to run a regression test suite or performance test (ideal). Heap dumps should be captured after the test run is complete. Captured heap dumps should be passed to HeapHero’s RESTAPI. This API analyzes heap dumps and responds back with this micrometrics.
To learn where this micrometrics are sent in the API response and JSON path expression for them, refer to this article. If any value is breached, then build can be failed”. HeapHero REST API has the intelligence to detect several memory related problems such as: memory leaks, objects finalization,… Any detected problems will be reported in the ‘problem’ element of API response. You might want to track this element as well”.