Recover() usage - Go idiomatic way

pieterlouw · January 31, 2017, 6:40am

Hi,

I want to start a discussion on the the use of recover() in your projects.

Should there be as much or as few as possible recover() code in your code?
Which cases is recover() code justified?
Any Go idioms around the use of recover()

In my personal experience I have found that a good place to put recover() is in any server (TCP/HTTP) handler functions.

Pieter

dfc · January 31, 2017, 6:57am

Imo you should not use recover at all. If there is a panic worthy event your program should not attempt to recover, instead it should exit and let your process management software restart it.

CF. http://wiki.c2.com/?CrashOnlySoftware

pieterlouw · January 31, 2017, 9:04am

Thanks @dfc

I’m struggling to see what difference it will make between restarting the program and recovering->continuing the program.

calmh · January 31, 2017, 12:45pm

By crashing and restarting you get

A nice backtrace to the console, which your service monitor can capture and alert on.
The program running again from a known clean state.

The underlying assumption here is that panics are not something commonplace that you should be expecting to know how to handle. That’s errors - handle those. But panics are things like out of boundary slice accesses and following nil pointers. That kind of issue is an indication that your logic is out of sync with reality and your state is corrupt. There’s really not much to do to recover at that point, in general.

radovskyb · January 31, 2017, 1:01pm

I think that was said really well.

pieterlouw · January 31, 2017, 1:14pm

This make sense thanks, especially the fact that the program is running from a clean state.

This also point out the fact that libraries in Go should not panic in cases where it’s actually an error

willis7 · January 31, 2017, 7:19pm

I’m not sure I entirely agree with the whole crash and restart, especially in the age of schedulers which will immediately fire up a new instance. Most people don’t have monitoring/alerting telling them this has happened - and this goes unnoticed for a long time. (well, thats my experience at least)

There’s also the poor user experience. If we consider a long running routine that gets terminated because of a poorly written short one, that runs once in a blue moon. Is that acceptable?

I like to think of Recover in the same vein as tests. Hopeful I dont need it to catch bugs, but grateful if it does.

dfc · January 31, 2017, 8:00pm

I hear this argument a lot; people want very reliable software, to the point that they are not prepared to let a program that has suffered a critical failure exit. Yet at the same time they don’t appear to be prepared to invest the effort commensurate with the level of importance of that program.

Here are the ground rules if you care about a service

have more than one of them; ruthlessly eradicate any single point of failure.This also means a failure of one component should not lead to an outage, only a reduction in capacity
don’t just monitor binary values like “is the process up” because software has bugs and can get wedged just as often as it crashes. Instead, monitor things like throughput, request latency, queue length, and alert on those.
assume everything is broken, always. The key to designing robust software is to answer the question “what happens when I send this request and I never hear back”, once you’ve done that for every request, in every component in your system, not only do you get increased reliability, but you get the happy successfl path for free.

willis7 · January 31, 2017, 8:33pm

I hear your point and agree for the most part. However, if we’re being pragmatic, isn’t it cheaper to implement a simple Recover() than a logging and monitoring solution?

I guess we could apply the same logic to writing tests. Why bother? lets just fix on fail.

This meme comes to mind.

dfc · January 31, 2017, 8:45pm

I think your still approaching this from a monolithic application point of view. Once you have more than one component in your system you have to assume everything is failing all the time and plan your deployment accordingly.

willis7 · January 31, 2017, 8:54pm

Yes, that’s a fair assumption. I am thinking from a monolith pov. It wont be long until the industry pivots and we start building monoliths again

christophberger · February 1, 2017, 6:25am

Implementation cost is not the important factor here. The point is, if your code catches a panic from some call levels deeper in the code, perhaps even from within some larger third-party libary, then how can a recover() routine be able to clearly determine where the panic came from, what the exact reasons where, what state the process is in now, and how to restore the process to a well-known, clean state?

Worst case: Imagine the recover code saves the process from crashing but now the process operates on broken data, and no on notices for weeks, until it turns out that a large money transfer had the wrong recipient who then headed off to an unknown foreign country with a bag of money and a fake passport and now sits in some beach bar sipping a Cuba Libre, knowing he would not need to work anymore for the rest of his life.

Damage done.

pieterlouw · February 1, 2017, 7:29am

Do you mean something like Prometheus?

dfc · February 1, 2017, 7:49am

Sure, there are lots of good monitoring and alerting tools.

willis7 · February 1, 2017, 11:03am

Maybe I am misunderstanding the purpose of recover(), but my understanding was that it’s used as a type of circuit breaker. You should still report it, and act on it, but my challenge is; is it right that a panic() during the transfer of £10 in one routine kills the transfer of £100,000,000 in another… hypothetically speaking, of course.

I’m pretty sure most bank managers would agree with me and say they would rather recover() from the panic() in the £10 transfer, so that the £100,000,000 transaction may continue.

dfc · February 1, 2017, 11:45am

I doubt I can change your mind, but consider issues like this https://github.com/golang/go/issues/13879

dfc · February 1, 2017, 11:50am

Also, how is your situation any different from a transfer of 100,000 pounds and a cap blows in the power supply of the server?

The point is, failures happen. Stop trying to categorise them into ones you think shouldn’t happen.

willis7 · February 1, 2017, 1:54pm

Nassim Nicholas Taleb has already beaten me to the categorization… Ok, that was very tongue in cheek, but I still haven’t been convinced recover is bad. I do agree that there can be bad implementations, which I believe is what you alluded to in your link.

From the direction of this conversation so far I believe it runs the risk of turning into a tab vs. space debate, so I will quit now, but I am keen to hear more from others.

system · May 2, 2017, 1:55pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.