Debugging Essentials: A Comprehensive Approach to Resolving Critical System Issues-PART 2

Shashwat Srivastava
5 min readJul 20, 2023

--

If you haven’t had a chance to read part-1 yet, I highly recommend going through it before diving into part-2 of this blog.

In this section, we will examine several alarming scenarios, such as sudden spikes in RDS CPU, memory, or database connections. Additionally, we will delve into the causes of 403 errors and discuss the cause of sudden surges in traffic. Understanding these potential causes is crucial, as encountering such issues in a production environment could lead to a sense of urgency within the team.

Case: RDS experiencing sudden spike in CPU, memory or db connection.

  1. Verify Throughput: Check if sudden load has increased on system because if request to database will increase CPU and memory consumption will increase.
  2. Check code releases: check if any new code has been released or if there is any modification in database config parameters such as cache sizes or connection limits.
  3. Verify Scripts: Validate if there is any script running on production database, If there is a temporary one-time script running and the CPU usage is persistently above the critical baseline, you can either wait for the script to finish or take action to halt it based on the specific circumstances.
  4. Concurrency and connection issues: High numbers of concurrent connections or connection leaks in your application can strain the database and lead to increased CPU usage. This can happen if your application is not managing connections efficiently or if there are connection pooling issues. Furthermore, if a new service instance is introduced, it may attempt to establish the minimum required connections (as specified by the connection pool’s minimum limit in the database configuration), potentially contributing to increased CPU utilization.
  5. Verify slow query: Verify if there is any specific query that has started taking more time. Generally sudden spikes due to slow query occurs when:

a. Index missing: Inefficient indexing can lead to larger result sets or increased sorting and grouping operations, requiring more memory.

b. Transaction Lock: When multiple transactions contend for the same locked resource, they may be queued or waiting for the lock to be released.

c. Large result sets: Queries that return large result sets can consume substantial memory resources, especially if the data is not efficiently paginated or streamed.

d. Deadlock: A deadlock situation can consume CPU resources because the database engine continuously attempts to resolve the deadlock by performing various checks and evaluations. The database engine needs to identify the dependencies and conflicts between the transactions involved in the deadlock, and this can involve intensive CPU processing.
Make sure to have db transaction timeouts configured to prevent deadlock from happening.

Note:

  • Scale up or out: If the spike in CPU usage is due to increased workload or application demands, consider scaling up your RDS instance by using a larger instance type with more CPU resources. Alternatively, you can also consider horizontal scaling by adding read replicas to distribute the workload across multiple instances.
  • Enable Cloudwatch query logs: Set-up alerts on slow queries. You can send those errors to internal slack groups to proactively monitor and take action on slow queries.
  • Consider database parameter tuning: Review the RDS parameter group settings and ensure they are optimized for your workload. Adjusting parameters like max_connections, max_worker_processes can help optimize the database performance and reduce CPU usage.

Remember that these are general possibilities, and the specific cause of the CPU spike will depend on your application, workload, and configuration. Thoroughly analyzing the relevant metrics, logs, and system behaviour will help you pinpoint the exact reason for the sudden CPU spike in your RDS instance.

Case: Unexpected spike in Http Error 403

  1. Check logs: Find a pattern in api’s that are causing problem, Look for user_ids.
  2. Verify resource availability: Validate that your requested resource is up and running, also check if there is any recent code or config change that is released.
  3. Authentication Issue: If authentication is required to access the resource, verify that the authentication service is up and running. Check if there have been any changes to the authentication system or user management processes that could be causing the increase in 403 errors. Review the authentication logs for any relevant information or error messages.
  4. Authorization Issue: If the user is authenticated but still receiving 403 errors, it suggests that they lack the necessary permissions to access the requested resource. Review the authorization rules and roles assigned to the user or client. Check if there have been any changes to the authorization mechanisms or if there are misconfigurations that prevent access. Audit the authorization process to ensure it is working as intended.
  5. IP Whitelisting or Blacklisting: If IP-based access controls are in place, verify that the IPs of the clients or users experiencing the 403 errors are not inadvertently blocked or restricted. Double-check the IP whitelists or blacklists to ensure they accurately reflect the intended access permissions.
  6. Validate firewall setting: If you have any application firewall setting configured do ensure that it is not blocking legitimate requests. Review the WAF(Web Application Firewall) configuration, rules, and logs for any indications of false positives or misconfigurations. Adjust the WAF settings if necessary to allow legitimate requests.

Case: Sudden Spike in Throughput.

  1. Operational Surge: A genuine surge in server traffic may occur, driven by various factors such as marketing campaigns, seasonal events, or external influences attracting more users to your service.
  2. Feature Rollout: A new feature or app update could be undergoing a full-scale release, reaching maximum users.
  3. Security Attacks: Certain types of security attacks, such as Distributed Denial of Service (DDoS) attacks, can artificially increase the load on your system, causing a throughput spike.
  4. Content Delivery: If your system serves large files, such as videos, images, or software updates, high demand for content delivery can result in a throughput spike.
  5. External service: External services calling your service can result in increased traffic due to various reasons, such as increased load on the calling service, bugs, or specific use cases that trigger higher traffic to your service.

In the upcoming sections, I will recount some real incidents from production that I’ve encountered. These incidents caused panic within the team and had the potential to lead to downtime. Along with detailing these incidents, we will also explore the solutions we implemented to prevent similar issues from occurring in the future. Stay tuned!

I hope this article was helpful to you, Please do follow to get the update of upcoming articles. Thanks.

--

--