API requests timing out and goroutines hanging

Hey everyone, I’ve been maintaining a small restaurant-focused website (mainly about Texas Roadhouse locations, menu updates, and deals), and the backend is written entirely in Go. I’ve recently started facing strange performance issues with my API layer specifically, requests timing out randomly and goroutines hanging even under light traffic. What’s odd is that everything used to run perfectly fine for months, and the issue only started appearing after I made a few structural changes to the code.

The website’s backend is built using the net/http package with a few helper functions that serve JSON responses to the frontend. It’s deployed on an Ubuntu VPS using Caddy as the reverse proxy. The endpoints mostly fetch menu data from a local PostgreSQL database and occasionally call an external API for restaurant details. Lately, I’ve noticed that several requests get stuck indefinitely the browser keeps waiting for a response, and the logs show no indication that the handler completed execution.

When I checked the runtime metrics using pprof, I saw an unexpected increase in the number of goroutines over time, even though I’m not spawning any in tight loops. It looks like some of my database query contexts aren’t being properly canceled when the client disconnects. I’ve been using context.WithTimeout() for all DB queries, but maybe I’ve missed something subtle — because even after the timeout, the goroutines stay alive until I restart the service.

I also tried running benchmarks and stress tests locally with ab and wrk, and after around 2000–3000 requests, the response time spikes from 100ms to over 5 seconds. CPU usage stays relatively stable, but memory usage keeps climbing. I suspect this is a connection pooling issue with database/sql or perhaps a race condition introduced by my recent caching layer (which uses a simple map guarded by a mutex). I added some defer statements and double-checked for possible deadlocks, but I haven’t found anything conclusive yet.

The really strange part is that when I disable my caching middleware, everything runs more smoothly but slower so the bug seems tied to how I’m managing shared state between goroutines. I might not be handling concurrent reads and writes correctly. I’m using Go 1.22.3, and the code runs fine locally for small tests, but under real-world load, it degrades fast.

Has anyone here experienced goroutines hanging due to improper context cancellation or mutex contention in Go web servers? What’s the best approach to debug such issues should I use pprof’s block profiling, or is there a more direct way to detect stuck goroutines in real time? I’d love to make my backend more resilient since my site’s restaurant data API is growing quickly, and I don’t want to deal with manual restarts every few hours. Sorry for long post!

In what way are you using goroutines? Most of my handlers don’t explicitly use goroutines because net/http is already spawning goroutines to handle requests and most request stuff happens synchronously. I usually only spawn goroutines for long-running tasks that can happen outside the context of the request (e.g. user signs up for a new account and the handler itself doesn’t need to wait for a “welcome” email to send).

Are you using transactions at all? Because one common “gotcha” I’ve seen is people remembering to commit but forgetting to rollback:

tx := db.Begin()
err := tx.Exec(`some sql`, assignmentIDs, facultyID).Error
if err == nil {
	tx.Commit()
} else {
	// Forget to rollback on error? You're in trouble!
	tx.Rollback()
}

I ran into something where connections weren’t being released (it was caused by transactions not being rolled back as I mentioned above). It was very intermittent but eventually my app would grind to a halt. I created a DB stats route that might be of use:

func DBStats(w http.ResponseWriter, r *http.Request) {
	stats := queries.DBStats()

	encoded, _ := json.MarshalIndent(stats, "", "\t")

	html := fmt.Sprintf(`
<!doctype html>
<html>
<head>
  <meta charset="utf-8">
  <meta name="viewport" content="width=device-width, initial-scale=1">
  <title>DB Stats</title>
  <style>
  html, body {
    height: 100%%;
}

html {
    display: table;
    margin: auto;
}

body {
    display: table-cell;
    vertical-align: middle;
}
</style>
</head>

<body style="font-family: Helvetica, Arial, sans-serif;">
	<h1>DB Stats</h1>
	<p><strong>Wait count:</strong> %v</p>
	<p><strong>Wait Duration:</strong> %v</p>
	<p><strong>Idle:</strong> %v</p>
	<p><strong>In Use:</strong> %v</p>
	<p><strong>JSON:</strong><pre>%s</pre></p>
</body>

</html>`, stats.WaitCount, stats.WaitDuration.String(), stats.Idle, stats.InUse, encoded)
	w.Header().Set("Content-Type", "text/html; charset=utf-8")
	w.Write([]byte(html))
}

As you can see this is a simple HTML stats page. The type returned from my queries package DBStats func is sql/DBStats which you can get by calling Stats on your DB connection pool. Give that a try, and look for any place you are using transactions.

2 Likes

Hi Joe,

as you pointed out this should have to do with your caching middleware. Without seeing the code it is hard to tell exactly what happens.
My guess is that a locked mutex is hit by your goroutine. Try to decouble blocking behaviour, your mutex should not stay locked.

What could happen. Not sure. You lock the mutex, defer unlock…

What I recommend:
I do not use defer at all for mutex unlocks.
Instead keep the lock time as short as possible and store in the goroutine ram if needed.

I always use manual lock and unlock without defer:

fc.mu.Lock()
infos := fc.Infos
fc.mu.Unlock()

Yes the code will look not so great, but it is more simple and saved me many times. Defer is great for other use cases but for mutex handling it is a trap in my oppinion.

One last thought about troubleshooting. Have you seen this?

Thanks for the detailed response that’s actually really helpful! You make a great point about unnecessary goroutines. I’m not explicitly spawning them in most handlers, but I do have a few places where I launch background cache refreshes and async database calls for menu updates (my website handles restaurant data like Texas Roadhouse menus
), so that might be part of the issue.

And yes, I’m using transactions in a few parts of the code — particularly for grouped menu inserts — and I think you’re spot on about rollback handling. I went back through my code and found one spot where I wasn’t rolling back properly on error. That could explain why some DB connections weren’t being released and the goroutine count kept climbing.

I really like your idea of adding a simple DB stats route. That’ll make it much easier to monitor connection pool behavior in real time. I’ll implement that and see what it reveals under load. Thanks again for sharing such a practical debugging approach — this gives me a clearer direction to track down the hanging requests.

1 Like

A properly configured linter like some included in golang-lint-ci should flag problems like these most of the time. An unclosed transaction, file or request-body is easy to detect by a linter most of the time and can help to mitigate such problems.

You will always get some false positives or opinionated flags when first starting to use a linter framework, but I think it always pays off in the long run. When I stumble across a bug like this in my code I always try to improve my linter configuration, so the same bug will be flagged in my editor in the future.