There is a lot of similarities between detective stories (from Sherlock Holmes, James Bond,…) and troubleshooting production problems. Detective stories need to have a very complex/burning problem. If your application is experiencing issues in production, it automatically becomes a burning problem in the enterprise and gets attention from Senior Management. A detective uses very basic clues, extrapolates them, rules out the odd possibilities, puts a lot of hard work and identifies the villain. He fights against all odds, takes risks and eradicates the evil. A lot of heroism is involved. This is no way different from debugging/troubleshooting complex production problems. Thus I am going to introduce a fictional troubleshooting character: ‘Jack Che‘. Through this fictional character – I am going to narrate how complex real world production problems faced by major enterprises are solved. Feel free to share your comments and let me know whether you like it. If not I can always revert back to regular writing style.
While twitter, Google and others are talking about 10 milliseconds, 20 milliseconds response time, still there are significant enterprises whose response time runs for several seconds. There is one such enterprise, whose response time was running for several seconds for their ‘search’ transactions. Recently, this enterprise ported their application to AWS Elastic Beanstalk environment in Java 8/Tomcat 8.
When a customer performs ‘search’ operation on this application, a progress bar is displayed on the browser. Once search completes, progress bar vanishes and search results are displayed. After porting to AWS Elastic Beanstalk for certain data conditions, the customer was seeing a progress bar on the screen forever. Management didn’t know what was causing this issue and how to go about solving it. Thus they engaged Tier1app LLC to solve the problem. Tier1app LLC sent out their top notch troubleshooting detective ‘Jack Che’ to solve the problem.
HTTP 504 Gateway Time-out Error Code
Just like every time, Jack Che was super excited to solve this problem. He assessed the situation quickly. He wanted to understand what interaction was going on between the Server and the browser. Thus he launched the developer console in the chrome browser and triggered the search transaction. A few seconds later, he saw HTTP 504 error code thrown from the server. (HTTP 504 is a time-out error thrown from the backend). Ah, he got his first clue.
Now Jack Che started to review the Ajax javascript which made the backend server side call. Unfortunately, javascript didn’t have any error handling code in place. Thus, when error code was thrown it wasn’t handled and the screen was displaying progress bar forever. Wow, initial breakthrough for Jack Che within few minutes of his job.
Seeing the smoke, where is the fire?
Now Jack Che was curious to figure out from where this HTTP 504 error code is thrown? Jack Che found a second clue now; exactly at 60th second of the search transaction, this HTTP 504 error code was thrown. Since exactly at 60th second, HTTP 504 error code was thrown, Jack Che believed there is some sort of timeout is kicking in. But he wasn’t sure where this timeout value is configured. He searched all throughout the application source code to see whether any 60 seconds timeout is configured. He checked with the application development team. But there was no 60 seconds timeout configured anywhere within the application source code.
Elastic Beanstalk Architecture
Now he came to the conclusion that timeout is triggered by some component that is outside of the source code. Thus he started to examine each layer in the technology stack. Below is a very quick overview of the Elastic Beanstalk architecture.
Fig: High-level Elastic Beanstalk Architecture
There is an elastic load balancer in the forefront. It receives the requests from the customers and distributes the traffic to backend Apache Servers. Each Apache Server has a dedicated Tomcat Server. Apache server relays the request to the Tomcat server. Application running on the tomcat server processes the request and sends back the response.
Timeout in Elastic Load Balancer
As first step Jack Che started to look out for AWS Elastic Load Balancer’s settings. Apparently, Jack’s research revealed that AWS Elastic Load Balancer has an idle timeout value set at 60 seconds. If there is no activity for 60 seconds, then the connection is teared down and HTTP error code 504 was thrown to the customer. Jack followed the below steps to change the timeout value in the AWS Elastic Load Balancer:
- Sign in to AWS Console
- Go to EC2 Services
- On the left panel, click on the Load Balancing > Load Balancers
- In the top panel, select the Load Balancer for which you want to change the idle timeout
- Now in the bottom panel, under the ‘Attributes’ section, click on the ‘Edit idle timeout’ button. The default value would be 60 seconds. Change it to the value that you would like. (say 180 seconds)
- Click on ‘Save’ button
Fig: Editing Idle Timeout in AWS Elastic Load Balancer
After changing the timeout setting in AWS Elastic Load Balancer, Jack Che got a good news and a bad news.
Good news: HTTP error code 504 stopped coming.
Bad News: New HTTP error code 502 was thrown 😦
Timeout in Apache Server
The interesting part is: this new HTTP error code 502 was also exactly thrown at the 60th second. This once again confirmed that there is some other timeout value kicking in. Now, Next layer in the technology stack is Apache web server. Jack Che started to tinker with Apache Web server’s settings. He figured out that in AWS Elastic Beanstalk environment, Apache server had a 60-second Timeout value to be set. Now he followed the below steps to increase this value to 180 seconds. Note: Below are the steps to update the Apache web server settings in Java 8/Tomcat 8 platform. If you are using a different platform, it might be different as well:
- In your application Web Archive WAR file, create a folder “.ebextensions\httpd\conf”
- Under this folder, create the file “httpd.conf” with the below contents.
# Managed by Elastic Beanstalk PidFile run/httpd.pid # Enable TCP keepclive Timeout 180 KeepAlive On MaxKeepAliveRequests 100 KeepAliveTimeout 180 <IfModule worker.c> StartServers 10 MinSpareThreads 250 MaxSpareThreads 250 ServerLimit 10 MaxClients 250 MaxRequestsPerChild 1000000 </IfModule> Listen 80 Include conf.d/*.conf Include conf.d/elasticbeanstalk/*.conf User apache Group apache CustomLog logs/access_log "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" TraceEnable off LoadModule alias_module modules/mod_alias.so LoadModule authz_host_module modules/mod_authz_host.so LoadModule log_config_module modules/mod_log_config.so LoadModule deflate_module modules/mod_deflate.so LoadModule headers_module modules/mod_headers.so LoadModule proxy_module modules/mod_proxy.so LoadModule proxy_balancer_module modules/mod_proxy_balancer.so LoadModule proxy_ftp_module modules/mod_proxy_ftp.so LoadModule proxy_http_module modules/mod_proxy_http.so LoadModule proxy_ajp_module modules/mod_proxy_ajp.so LoadModule proxy_connect_module modules/mod_proxy_connect.so LoadModule cache_module modules/mod_cache.so
NOTE: Here only two changes has been made from the default:
- Timeout is set to 180. (Default value is 60)
- KeepAliveTimeout is set to 180. (Default value is 60)
After making the above change, Jack Che deployed the new WAR file to the elastic beanstalk environment. To everyone’s surprise, HTTP 502 error code stopped. Search transactions completed successfully. Business was back on its wheels.
Woww!! Senior Management of the company couldn’t believe that Jack Che’s troubleshooting detective was able to solve this problem within few hours. Excitement and celebrations continued in the happy hour party as well.
Hi
The above detective story was good and helpful to understand how internal server problem can detect