Unlocking the Secrets to Solving Database Max. Connection Limit: Lessons from our Experience
At Hapily, we utilize Serverless architecture in most of our applications. Most of our codebases are hosted in Lambda functions. Few of them are in EC2 instances. Nevertheless, we still use a relational database, AWS RDS, as our primary database.
By nature, AWS Lambda cannot persist open connection as Lambda only persists when it's being used, and it would be closed in a non-deterministic period of time when it's unused. Lambda will set up a new execution environment when it receives invocation. This is also a cold start problem. To understand better, AWS has written a good explanation of Lambda's execution environment lifecycle in this blog post. Whenever a new Lambda spins up, it will be hosted in a pristine, different execution environment.
Not to forget, Lambda supports concurrency process. So, whenever Lambda receives a lot of invocations, instead of letting the invocation be throttled, Lambda can spin up several Lambdas to handle all of those invocations. Each lambda has a different execution environment. To learn more about the Lambda concurrency process, AWS has written a good explanation of it in this documentation.
Now, we enter the main problem by using this tech stack.
While managing the database connectivity for EC2-based applications is straightforward, managing the database connectivity for Lambda-based applications is not.
As our applications and user base grew, we began to notice an alarming trend: the number of open connections was skyrocketing. We even faced a situation where the database cluster returned "Too many connections" which caused some of our lambdas not to execute. Hence, it caused problems in our business process and affected our customers.
To resolve this database max. connection issue, there are several researches and experiments that we did.
List of experimentation
1. Each application uses a different database user account
Previously, there were several applications that used the same user account to create a connection to the database. It prevents us from getting insight of "which application contributes to creating too many open connections".
SELECT user, count(*) as total FROM information_schema.processlist GROUP BY 1
To fix this issue, we create multiple user accounts and make sure that a single user account is only being used by a single application.
This experimentation helped us to get a better picture of which application contributed the most. We learned that some of them were coming from Lambda-based applications and some of them coming from EC2-based applications.
It still didn't give us a clear picture of the root cause, which led us to the next experimentation.
2. Utilize AWS RDS Proxy
In several resources that we found, they always mentioned using AWS RDS Proxy. We also didn't need to make any code changes to implement it. Hence, it became our next ideal step.
We made sure that all of our applications connected to the database cluster via RDS Proxy instead of a direct connection.
With this implementation, we saw that the total number of open connections was going down for a while until it went up again.
At this point, we started to assume that the horizontal scaling may be the root cause.
3. Lambda's concurrent invocation and EC2's auto scaling group
We tried to reduce the total number of reserved concurrency in some of our most active lambdas and did several experiments. It gave us some insignificant difference in the total number of open connections; It just fluctuated. On some rare occasions, we even see an application that used to have a small number of open connections suddenly have quite a lot of open connections. Hence, we believe that the total number of concurrent invocations may be the knob that we need to adjust, but there were some missing pieces to get a significant reduction on the open connection.
We also tried to make some adjustments in our EC2's auto-scaling group. Nevertheless, we quickly rejected this hypothesis given that it hasn't scaled up enough.
Given that we still stumbled upon the same problem, we started to explore an option to make some code changes to our application.
4. Refactor code
We refactor our application with a two-stage process
Use "Connection pooling" instead of "single connection"
Move out connection initialization from "inside lambda handler" into "outside lambda handler"
Based on several resources that we found, they always recommend establishing database connections by using a connection pooling library, especially for Lambda-based applications. It will help to manage the database connections instead of managing it manually.
Also, by moving the database connection initialization out from the lambda handler, it provided the possibility of re-using the existing connection if the lambda's execution environment is still active.
After we implemented code changes to some of our applications, we still see a high number of database connections in our database cluster.
While this code change sounds necessary and could solve our problem, it could not. We were frustrated at this point. We stepped back for a while to do another round of research and exploration in the AWS console. On there, we found a setting called, "AWS RDS Proxy's Target group configuration".
5. Fine-tune AWS RDS Proxy's configuration
In AWS Proxy, there is a section called, "Target groups". By default, AWS will create a target group attached to the AWS Proxy called, "default".
In this setting, the target group sets 100% of maximum connections as the connection pool maximum connections (MaxConnectionsPercent) by default. It means that the RDS proxy will try to create as many connections as possible until it hits the maximum connections of the database cluster.
After we realized this configuration, we changed it from 100% to a lower percentage. We picked the number based on simple calculation and common sense to foresee the total number of connection our system need.
This simple configuration change helped us to reduce the total number of open connections until low enough to make us feel confident to save our weekend.
From here, our latest structure for both Lambda-based applications and EC2-based applications is like the diagram below. We rely on AWS RDS Proxy to manage the maximum number of open connections to the database cluster while we also rely on the database pooling library to manage the reusability of database connection in the application layer.
To learn more about RDS Proxy management, AWS has written a section specifically talking about it.
From that firefighting experience and multiple experimentation, there is 1 silver line that we drew: understand how each component of our system works, end-to-end, from the business layer, application layer, and infrastructure layer. This is crucial to allow us to manage and to optimize the application effectively.
There are several points that could help us to achieve it:
Documentation is your friend
Not only internal documentation but also public documentation. For our case, we use AWS as the cloud service provider. We also use some of the public libraries as dependencies in our application. They provide documentation that contains explanations about their service/library. It may look daunting, especially AWS documentation since they provide a lot of information there. But the gold; the information that you're seeking for may be there.
Monitor your application
AWS provides a lot of useful standard CloudWatch metrics as the baseline. Without them, we believe this system issue resolution would be much harder than what we experienced.
Don't take simplicity for granted; understand how it works
A lot of services being provided by AWS simplifies our work. For instance, lambda concurrency execution helps us to reduce throttling issues. A lot of library being provided by public library also helps us to simplify our work.
However, this simplification comes with a trade-off: you may not know what & how things work behind the scenes. Using something that you don't know how it works may be costly down the road. Understanding how it works is the baseline for us to effectively manage and optimize the application.
I hope this article could help you to learn new things and/or solve similar problems. If you have any questions, feel free to reach me out!
If you are using HubSpot ecosystem, check out some cool applications at https://hapily.com/