DETECTIVE STORY: TROUBLESHOOTING TIMEOUT IN AWS ELASTIC BEANSTALK

There is a lot of similarities between detective stories (from Sherlock Holmes, James Bond,…) and troubleshooting production problems. Detective stories need to have a very complex/burning problem. If your application is experiencing issues in production, it automatically becomes a burning problem in the enterprise and gets attention from Senior Management. A detective uses very basic clues, extrapolates them, rules out the odd possibilities, puts a lot of hard work and identifies the villain. He fights against all odds, takes risks and eradicates the evil. A lot of heroism is involved. This is no way different from debugging/troubleshooting complex production problems. Thus I am going to introduce a fictional troubleshooting character: ‘Jack Che‘. Through this fictional character – I am going to narrate how complex real world production problems faced by major enterprises are solved. Feel free to share your comments and let me know whether you like it. If not I can always revert back to regular writing style.

While twitter, Google and others are talking about 10 milliseconds, 20 milliseconds response time, still there are significant enterprises whose response time runs for several seconds. There is one such enterprise, whose response time was running for several seconds for their ‘search’ transactions. Recently, this enterprise ported their application to AWS Elastic Beanstalk environment in Java 8/Tomcat 8.

When a customer performs ‘search’ operation on this application, a progress bar is displayed on the browser. Once search completes, progress bar vanishes and search results are displayed. After porting to AWS Elastic Beanstalk for certain data conditions, the customer was seeing a progress bar on the screen forever. Management didn’t know what was causing this issue and how to go about solving it. Thus they engaged Tier1app LLC to solve the problem. Tier1app LLC sent out their top notch troubleshooting detective ‘Jack Che’ to solve the problem.

HTTP 504 Gateway Time-out Error Code

Just like every time, Jack Che was super excited to solve this problem. He assessed the situation quickly. He wanted to understand what interaction was going on between the Server and the browser. Thus he launched the developer console in the chrome browser and triggered the search transaction. A few seconds later, he saw HTTP 504 error code thrown from the server. (HTTP 504 is a time-out error thrown from the backend). Ah, he got his first clue.

Now Jack Che started to review the Ajax javascript which made the backend server side call. Unfortunately, javascript didn’t have any error handling code in place. Thus, when error code was thrown it wasn’t handled and the screen was displaying progress bar forever. Wow, initial breakthrough for Jack Che within few minutes of his job.

Seeing the smoke, where is the fire?

Now Jack Che was curious to figure out from where this HTTP 504 error code is thrown? Jack Che found a second clue now; exactly at 60th second of the search transaction, this HTTP 504 error code was thrown. Since exactly at 60th second, HTTP 504 error code was thrown, Jack Che believed there is some sort of timeout is kicking in. But he wasn’t sure where this timeout value is configured. He searched all throughout the application source code to see whether any 60 seconds timeout is configured. He checked with the application development team. But there was no 60 seconds timeout configured anywhere within the application source code.

Elastic Beanstalk Architecture

Now he came to the conclusion that timeout is triggered by some component that is outside of the source code. Thus he started to examine each layer in the technology stack. Below is a very quick overview of the Elastic Beanstalk architecture.

Elastic beanstalk

Fig: High-level Elastic Beanstalk Architecture

There is an elastic load balancer in the forefront. It receives the requests from the customers and distributes the traffic to backend Apache Servers. Each Apache Server has a dedicated Tomcat Server. Apache server relays the request to the Tomcat server. Application running on the tomcat server processes the request and sends back the response.

Timeout in Elastic Load Balancer

As first step Jack Che started to look out for AWS Elastic Load Balancer’s settings. Apparently, Jack’s research revealed that AWS Elastic Load Balancer has an idle timeout value set at 60 seconds. If there is no activity for 60 seconds, then the connection is teared down and HTTP error code 504 was thrown to the customer. Jack followed the below steps to change the timeout value in the AWS Elastic Load Balancer:

  1. Sign in to AWS Console
  2. Go to EC2 Services
  3. On the left panel, click on the Load Balancing > Load Balancers
  4. In the top panel, select the Load Balancer for which you want to change the idle timeout
  5. Now in the bottom panel, under the ‘Attributes’ section, click on the ‘Edit idle timeout’ button. The default value would be 60 seconds. Change it to the value that you would like. (say 180 seconds)
  6. Click on ‘Save’ button

elastic.png

Fig: Editing Idle Timeout in AWS Elastic Load Balancer

After changing the timeout setting in AWS Elastic Load Balancer, Jack Che got a good news and a bad news.

Good news: HTTP error code 504 stopped coming.

Bad News: New HTTP error code 502 was thrown 😦

Timeout in Apache Server

The interesting part is: this new HTTP error code 502 was also exactly thrown at the 60th second. This once again confirmed that there is some other timeout value kicking in. Now, Next layer in the technology stack is Apache web server. Jack Che started to tinker with Apache Web server’s settings. He figured out that in AWS Elastic Beanstalk environment, Apache server had a 60-second Timeout value to be set. Now he followed the below steps to increase this value to 180 seconds. Note: Below are the steps to update the Apache web server settings in Java 8/Tomcat 8 platform. If you are using a different platform, it might be different as well:

  1. In your application Web Archive WAR file, create a folder “.ebextensions\httpd\conf
  2. Under this folder, create the file “httpd.conf” with the below contents.
# Managed by Elastic Beanstalk
PidFile run/httpd.pid

# Enable TCP keepclive
Timeout 180
KeepAlive On
MaxKeepAliveRequests 100
KeepAliveTimeout 180

<IfModule worker.c>
StartServers        10
MinSpareThreads     250
MaxSpareThreads     250
ServerLimit         10
MaxClients          250
MaxRequestsPerChild 1000000
</IfModule>

Listen 80

Include conf.d/*.conf
Include conf.d/elasticbeanstalk/*.conf

User apache
Group apache

CustomLog logs/access_log "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\""
TraceEnable off

LoadModule alias_module modules/mod_alias.so
LoadModule authz_host_module modules/mod_authz_host.so
LoadModule log_config_module modules/mod_log_config.so
LoadModule deflate_module modules/mod_deflate.so
LoadModule headers_module modules/mod_headers.so
LoadModule proxy_module modules/mod_proxy.so
LoadModule proxy_balancer_module modules/mod_proxy_balancer.so
LoadModule proxy_ftp_module modules/mod_proxy_ftp.so
LoadModule proxy_http_module modules/mod_proxy_http.so
LoadModule proxy_ajp_module modules/mod_proxy_ajp.so
LoadModule proxy_connect_module modules/mod_proxy_connect.so
LoadModule cache_module modules/mod_cache.so

NOTE: Here only two changes has been made from the default:

  1. Timeout is set to 180. (Default value is 60)
  2. KeepAliveTimeout is set to 180. (Default value is 60)

After making the above change, Jack Che deployed the new WAR file to the elastic beanstalk environment. To everyone’s surprise, HTTP 502 error code stopped. Search transactions completed successfully. Business was back on its wheels.

Woww!! Senior Management of the company couldn’t believe that Jack Che’s troubleshooting detective was able to solve this problem within few hours. Excitement and celebrations continued in the happy hour party as well.

One thought on “DETECTIVE STORY: TROUBLESHOOTING TIMEOUT IN AWS ELASTIC BEANSTALK

Add yours

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Blog at WordPress.com.

Up ↑

%d bloggers like this: