Ah, that makes sense. Thanks for the response.
It asks the error it wrapped. Here is an example from the net
package
func (e *OpError) Temporary() bool {
if ne, ok := e.Err.(*os.SyscallError); ok {
t, ok := ne.Err.(temporary)
return ok && t.Temporary()
}
t, ok := e.Err.(temporary)
return ok && t.Temporary()
}
Yes, but that begs the question of how does the wrapped error (which comes from a widely reusable package) know what the application considers temporary? Also, temporary on what timescale?
It was mentioned above that the primary use of the Temporary
behavior was to avoid logging such errors, but that doesnât agree with my experience in the applications that I have been writing. I find that the primary requirement is for deciding whether to retry the operation or not. Even when retrying I usually want to log the error that caused the retry.
In the context of deciding whether to retry, each application will consider different errors as worthy of a retry. It usually depends on the application requirements, which a widely reusable package should not be concerned with.
This line of argument is the opposite of the one that the caller should consider errors opaque. I really think the caller should consider errors opaque.
There are some exceptions, as always, like io.EOF, but they represent exceptions, not the rule.
In my experience the value of the error returned almost never alters the logic of the caller. Ie, if there was an error, the caller decides to retry based on some logic of its own, not inspecting the error for cause.
The inspection of the error appears to be closely related and almost exclusive to logging.
How opaque should the errors be? Isnât providing Temporary
, Timeout
or other methods leaking some information?
My objection is not with the idea of errors providing behaviors, but with the Temporary
behavior specifically. The semantics of Temporary
are much harder to define than Timeout
for example.
Temporary is whatever author of the original value thinks it is. Given the only interface which exports this behaviour is net.Error
, I donât think you need to worry about disagreement over a definition. It feels to me that youâre trying to inspect the error by value to make your own judgement on it.
I fear we are going in circles. How does the author of the original value know what my application considers temporary?
I am going to do my best to help break us out of this loop. I have read most of the references you linked above before today, but had not read the discussion at https://github.com/golang/go/issues/4373 until now. I feel that I have a much better understanding of where you are coming from. Now I will attempt to explain more about where I am coming from.
I have several applications that work with a database via database/sql
using the github.com/alexbrainman/odbc
driver. These are long running applications that are expected to be tolerant of external problems and also not make any harmful updates in the face of unexpected DB state.
In general these applications avoid making harmful updates by following the crash-only software model. Taking this approach has forced the application to have very solid startup and recovery behavior. But as the âMisconceptions âŚâ section of the article explains:
Some programmers overuse the technique by deliberately writing code to crash the program whenever something goes wrong, when the correct solution is to handle all the errors you can think of correctly, and then rely on crash/restart for unforeseen error conditions.
In the early days of the application we didnât know what kinds of external problems to expect, so all such errors crashed the service for a restart. But since we logged the errors we could learn how to better handle the errors when appropriate.
One day our monitoring dashboard lit up because every single instance of the application crashed at once. Thatâs when we learned that every so often the DBAâs shut down the DB for maintenance. On the one hand it was awesome because the application recovered properly. On the other hand we would prefer that the application recognized the DB was down and wait for it to come back without crashing.
But how should we distinguish between database errors to know whether to crash or retry? We should only retry if the error is known to represent a DB outage, all others should cause a crash.
This sounds like a great opportunity to follow your advice and inspect returned errors for the Temporary
behavior. In fact, we have a package that wraps database/sql
and provides retry logic. The package includes this function:
// New returns a new robust database provider. Methods on the returned
// database pass errors to retry. If retry returns nil the operation is tried
// again, otherwise it returns the error returned from retry.
func New(db *sql.DB, retry func(err error) error) *DB {
return &DB{
db: db,
retry: retry,
}
}
In theory we wouldnât need to provide a retry function to this constructor because errors could just be inspected for the Temporary
method and call it to decide when to retry. In practice that doesnât scale.
ODBC defines specific error codes that ODBC drivers can return. But ODBC is itself an abstraction and different DB engines may use the error codes in subtly different ways. Even different ODBC drivers talking to the same DB may work slightly differently.
Given that situation, an ODBC driver for database/sql
such as github.com/alexbrainman/odbc
would need to know about the behavior of too many ODBC driver/DB engine configurations. Although it is possible for the driver to faithfully communicate the error information, it is not practical for it to properly interpret that information accurately across all database engines to determine if the error is temporary for our application.
But our application has the context to know the specifics of the DB engine we have chosen and how it should handle those errors. Thus, we can write a function analogous to os.IsNotExist
. It looks like this:
// IsTemporary returns true if err represents an error that is most likely
// caused by a temporary database outage.
func IsTemporary(err error) bool {
if err == driver.ErrBadConn {
return true
}
odbcErr, ok := err.(*odbc.Error)
if !ok {
return false
}
for _, dr := range odbcErr.Diag {
// 08001: Client unable to establish connection
// 08S01: Communication link failure
if dr.State == "08001" || dr.State == "08S01" {
return true
}
// 42000: Syntax error or access violation
if dr.State == "42000" {
switch dr.NativeError {
// 978: [SQL Server]Availability group read-only node
// 983: [SQL Server]Availability group not PRIMARY or SECONDARY
// 6005: [SQL Server]SHUTDOWN is in progress
case 978, 983, 6005:
return true
}
}
}
return false
}
Recently our organization has decided to install a clustered version of the database our application works with. The new technology will provide automatic fail-overâwith no data lossâto a secondary DB server should the primary fail. Of course we are testing it before we deploy it to production.
When we forced a fail-over while a test application was running transactions we discovered that this new configuration returns some new errors during the fail-over window. All of these new errors use the same ODBC error state, but have different native error codes. Given our software architecture we simply added the new cases to our IsTemporary
function (native errors 978 & 983 seen above). I see that as a big win compared to the author of github.com/alexbrainman/odbc
being required to stay abreast of the error reporting idiosyncrasies of all the DB engines people are using with his driver and hoping that the authorâs concept of temporary is compatible with ours.
Thanks for your comprehensive reply. I think I understand your problem well.
In response to your two points, what defines a temporary error? something the caller might want to retry. A canonical example of this are the things which return true when implementing the net.Error interface.
In the example of your odbc driver, would adding a method
func (e *odbc.Error) Temporary() bool {
for _, dr := range e.Diag {
// 08001: Client unable to establish connection
// 08S01: Communication link failure
if dr.State == "08001" || dr.State == "08S01" {
return true
}
// 42000: Syntax error or access violation
if dr.State == "42000" {
switch dr.NativeError {
// 978: [SQL Server]Availability group read-only node
// 983: [SQL Server]Availability group not PRIMARY or SECONDARY
// 6005: [SQL Server]SHUTDOWN is in progress
case 978, 983, 6005:
return true
}
}
}
return false
}
work ?
It would work in the sense that the application would function as we want, but I donât think it is optimal from a software engineering perspective. I have these questions:
- Is that list of conditions what others want?
- Will the maintainer of the driver package want to take on the burden of making that method work for all the database engines out there that have ODBC drivers?
- If yes, is it even possible? Note the
switch dr.NativeError
section is specific to SQL Server. It seems likely that another DB engine might use some of the same native error codes to mean different things. That implies thatTemporary
must know which DB engine is behind the ODBC interface in order to interpret the native codes.
I think the answers are:
- Maybe.
- No.
- Unlikely or problematic.
It seems like youâre over thinking this. Temporary just means that the code returning the error thinks you might be able to succeed if you retry. It is not any kind of guarantee, nor is it any kind of binding contract. It seems like your problem is âbut what if I donât agree with the value returned by Temporary()?â In which case, donât call it, just do your own thing, which is what Dave was suggesting.
I think the discussion further reinforces the point of your blog post. Returning errors is hard. There is not one technique that is always right.
@dfc would you recommend to create custom error types as always unexported with unexported fields? This would force you to add behaviors that allow the concrete type value to remain opaque.
Yup, but I wouldnât recommend creating custom error types, period.
Iâve gone back and forth on how I should do errors. I think a good example is the Exception class in java and what it takes. I generally want a message, cause, and stack trace.
As a real life example, the ingest server we use is written in Go and there are a handful of places where we retry on errors (of a specific type b/c there is no behavior to introspect on ), a single place where we introspect error behavior so we know what HTTP error codes and messages to show users, and the 99% case of âraise the error back up until itâs logged and a metric is incrementedâ.
Because after the fact diagnosis is the 99% real life use case for us, we embed the Java exception base parts (message/cause/stack) in any errors, which is what most of the error wrapping libraries do.
I think itâs such a good thing to do, I wish the error libraries out there would standardize on what method signatures they use for the cause and stack trace parts like we already have for message being Error() string
The problem I have with adding a stack trace to every error is that youâre doing a lot of work for some text that may never see the light of day. Errors happen all the time, and often times they are just handled⌠if you created a stack trace for every one, youâve just slowed down your code for little benefit. The problem, of course, is that you never know at error creation time if itâll be âworth itâ to record the stack trace. I guess one solution to this is to have it be a feature you can turn on with an environment variable or something, for when youâre debugging. But that can still screw you if you get an error in production and donât know where it came from.
In my experience, error messages are usually unique enough that I can grep for them, though there have been a few that were pretty opaque and took some time to track down.
But generally if youâre properly annotating your errors with contextual information, you almost (almost) donât need stack traces.
I agree about every error. Itâs a flag for us and it only happens at the first wrap of all the wrapped errors.
This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.