Imo you should not use recover at all. If there is a panic worthy event your program should not attempt to recover, instead it should exit and let your process management software restart it.
A nice backtrace to the console, which your service monitor can capture and alert on.
The program running again from a known clean state.
The underlying assumption here is that panics are not something commonplace that you should be expecting to know how to handle. That’s errors - handle those. But panics are things like out of boundary slice accesses and following nil pointers. That kind of issue is an indication that your logic is out of sync with reality and your state is corrupt. There’s really not much to do to recover at that point, in general.
I’m not sure I entirely agree with the whole crash and restart, especially in the age of schedulers which will immediately fire up a new instance. Most people don’t have monitoring/alerting telling them this has happened - and this goes unnoticed for a long time. (well, thats my experience at least)
There’s also the poor user experience. If we consider a long running routine that gets terminated because of a poorly written short one, that runs once in a blue moon. Is that acceptable?
I like to think of Recover in the same vein as tests. Hopeful I dont need it to catch bugs, but grateful if it does.
I hear this argument a lot; people want very reliable software, to the point that they are not prepared to let a program that has suffered a critical failure exit. Yet at the same time they don’t appear to be prepared to invest the effort commensurate with the level of importance of that program.
Here are the ground rules if you care about a service
have more than one of them; ruthlessly eradicate any single point of failure.This also means a failure of one component should not lead to an outage, only a reduction in capacity
don’t just monitor binary values like “is the process up” because software has bugs and can get wedged just as often as it crashes. Instead, monitor things like throughput, request latency, queue length, and alert on those.
assume everything is broken, always. The key to designing robust software is to answer the question “what happens when I send this request and I never hear back”, once you’ve done that for every request, in every component in your system, not only do you get increased reliability, but you get the happy successfl path for free.
I hear your point and agree for the most part. However, if we’re being pragmatic, isn’t it cheaper to implement a simple Recover() than a logging and monitoring solution?
I guess we could apply the same logic to writing tests. Why bother? lets just fix on fail.
I think your still approaching this from a monolithic application point of view. Once you have more than one component in your system you have to assume everything is failing all the time and plan your deployment accordingly.
Implementation cost is not the important factor here. The point is, if your code catches a panic from some call levels deeper in the code, perhaps even from within some larger third-party libary, then how can a recover() routine be able to clearly determine where the panic came from, what the exact reasons where, what state the process is in now, and how to restore the process to a well-known, clean state?
Worst case: Imagine the recover code saves the process from crashing but now the process operates on broken data, and no on notices for weeks, until it turns out that a large money transfer had the wrong recipient who then headed off to an unknown foreign country with a bag of money and a fake passport and now sits in some beach bar sipping a Cuba Libre, knowing he would not need to work anymore for the rest of his life.
Maybe I am misunderstanding the purpose of recover(), but my understanding was that it’s used as a type of circuit breaker. You should still report it, and act on it, but my challenge is; is it right that a panic() during the transfer of £10 in one routine kills the transfer of £100,000,000 in another… hypothetically speaking, of course.
I’m pretty sure most bank managers would agree with me and say they would rather recover() from the panic() in the £10 transfer, so that the £100,000,000 transaction may continue.
Nassim Nicholas Taleb has already beaten me to the categorization… Ok, that was very tongue in cheek, but I still haven’t been convinced recover is bad. I do agree that there can be bad implementations, which I believe is what you alluded to in your link.
From the direction of this conversation so far I believe it runs the risk of turning into a tab vs. space debate, so I will quit now, but I am keen to hear more from others.