Returning errors is hard

goinggodotnet · October 12, 2015, 3:00am

Ah, that makes sense. Thanks for the response.

dfc · October 12, 2015, 3:42am

It asks the error it wrapped. Here is an example from the net package

func (e *OpError) Temporary() bool {
	if ne, ok := e.Err.(*os.SyscallError); ok {
		t, ok := ne.Err.(temporary)
		return ok && t.Temporary()
	}
	t, ok := e.Err.(temporary)
	return ok && t.Temporary()
}

ChrisHines · October 12, 2015, 2:24pm

Yes, but that begs the question of how does the wrapped error (which comes from a widely reusable package) know what the application considers temporary? Also, temporary on what timescale?

It was mentioned above that the primary use of the Temporary behavior was to avoid logging such errors, but that doesn’t agree with my experience in the applications that I have been writing. I find that the primary requirement is for deciding whether to retry the operation or not. Even when retrying I usually want to log the error that caused the retry.

In the context of deciding whether to retry, each application will consider different errors as worthy of a retry. It usually depends on the application requirements, which a widely reusable package should not be concerned with.

dfc · October 12, 2015, 4:29pm

This line of argument is the opposite of the one that the caller should consider errors opaque. I really think the caller should consider errors opaque.

There are some exceptions, as always, like io.EOF, but they represent exceptions, not the rule.

In my experience the value of the error returned almost never alters the logic of the caller. Ie, if there was an error, the caller decides to retry based on some logic of its own, not inspecting the error for cause.

The inspection of the error appears to be closely related and almost exclusive to logging.

ChrisHines · October 12, 2015, 4:52pm

How opaque should the errors be? Isn’t providing Temporary, Timeout or other methods leaking some information?

My objection is not with the idea of errors providing behaviors, but with the Temporary behavior specifically. The semantics of Temporary are much harder to define than Timeout for example.

dfc · October 12, 2015, 5:03pm

Temporary is whatever author of the original value thinks it is. Given the only interface which exports this behaviour is net.Error, I don’t think you need to worry about disagreement over a definition. It feels to me that you’re trying to inspect the error by value to make your own judgement on it.

ChrisHines · October 12, 2015, 8:03pm

I fear we are going in circles. How does the author of the original value know what my application considers temporary?

I am going to do my best to help break us out of this loop. I have read most of the references you linked above before today, but had not read the discussion at https://github.com/golang/go/issues/4373 until now. I feel that I have a much better understanding of where you are coming from. Now I will attempt to explain more about where I am coming from.

I have several applications that work with a database via database/sql using the github.com/alexbrainman/odbc driver. These are long running applications that are expected to be tolerant of external problems and also not make any harmful updates in the face of unexpected DB state.

In general these applications avoid making harmful updates by following the crash-only software model. Taking this approach has forced the application to have very solid startup and recovery behavior. But as the “Misconceptions …” section of the article explains:

Some programmers overuse the technique by deliberately writing code to crash the program whenever something goes wrong, when the correct solution is to handle all the errors you can think of correctly, and then rely on crash/restart for unforeseen error conditions.

In the early days of the application we didn’t know what kinds of external problems to expect, so all such errors crashed the service for a restart. But since we logged the errors we could learn how to better handle the errors when appropriate.

One day our monitoring dashboard lit up because every single instance of the application crashed at once. That’s when we learned that every so often the DBA’s shut down the DB for maintenance. On the one hand it was awesome because the application recovered properly. On the other hand we would prefer that the application recognized the DB was down and wait for it to come back without crashing.

But how should we distinguish between database errors to know whether to crash or retry? We should only retry if the error is known to represent a DB outage, all others should cause a crash.

This sounds like a great opportunity to follow your advice and inspect returned errors for the Temporary behavior. In fact, we have a package that wraps database/sql and provides retry logic. The package includes this function:

// New returns a new robust database provider. Methods on the returned
// database pass errors to retry. If retry returns nil the operation is tried
// again, otherwise it returns the error returned from retry.
func New(db *sql.DB, retry func(err error) error) *DB {
	return &DB{
		db:    db,
		retry: retry,
	}
}

In theory we wouldn’t need to provide a retry function to this constructor because errors could just be inspected for the Temporary method and call it to decide when to retry. In practice that doesn’t scale.

ODBC defines specific error codes that ODBC drivers can return. But ODBC is itself an abstraction and different DB engines may use the error codes in subtly different ways. Even different ODBC drivers talking to the same DB may work slightly differently.

Given that situation, an ODBC driver for database/sql such as github.com/alexbrainman/odbc would need to know about the behavior of too many ODBC driver/DB engine configurations. Although it is possible for the driver to faithfully communicate the error information, it is not practical for it to properly interpret that information accurately across all database engines to determine if the error is temporary for our application.

But our application has the context to know the specifics of the DB engine we have chosen and how it should handle those errors. Thus, we can write a function analogous to os.IsNotExist. It looks like this:

// IsTemporary returns true if err represents an error that is most likely
// caused by a temporary database outage.
func IsTemporary(err error) bool {
	if err == driver.ErrBadConn {
		return true
	}

	odbcErr, ok := err.(*odbc.Error)
	if !ok {
		return false
	}

	for _, dr := range odbcErr.Diag {
		// 08001: Client unable to establish connection
		// 08S01: Communication link failure
		if dr.State == "08001" || dr.State == "08S01" {
			return true
		}
		// 42000: Syntax error or access violation
		if dr.State == "42000" {
			switch dr.NativeError {
			// 978: [SQL Server]Availability group read-only node
			// 983: [SQL Server]Availability group not PRIMARY or SECONDARY
			// 6005: [SQL Server]SHUTDOWN is in progress
			case 978, 983, 6005:
				return true
			}
		}
	}

	return false
}

Recently our organization has decided to install a clustered version of the database our application works with. The new technology will provide automatic fail-over—with no data loss—to a secondary DB server should the primary fail. Of course we are testing it before we deploy it to production.

When we forced a fail-over while a test application was running transactions we discovered that this new configuration returns some new errors during the fail-over window. All of these new errors use the same ODBC error state, but have different native error codes. Given our software architecture we simply added the new cases to our IsTemporary function (native errors 978 & 983 seen above). I see that as a big win compared to the author of github.com/alexbrainman/odbc being required to stay abreast of the error reporting idiosyncrasies of all the DB engines people are using with his driver and hoping that the author’s concept of temporary is compatible with ours.

dfc · October 12, 2015, 8:16pm

Thanks for your comprehensive reply. I think I understand your problem well.

In response to your two points, what defines a temporary error? something the caller might want to retry. A canonical example of this are the things which return true when implementing the net.Error interface.

In the example of your odbc driver, would adding a method

func (e *odbc.Error) Temporary() bool {
	for _, dr := range e.Diag {
		// 08001: Client unable to establish connection
		// 08S01: Communication link failure
		if dr.State == "08001" || dr.State == "08S01" {
			return true
		}
		// 42000: Syntax error or access violation
		if dr.State == "42000" {
			switch dr.NativeError {
			// 978: [SQL Server]Availability group read-only node
			// 983: [SQL Server]Availability group not PRIMARY or SECONDARY
			// 6005: [SQL Server]SHUTDOWN is in progress
			case 978, 983, 6005:
				return true
			}
		}
	}
	return false
}

work ?

ChrisHines · October 12, 2015, 9:35pm

It would work in the sense that the application would function as we want, but I don’t think it is optimal from a software engineering perspective. I have these questions:

Is that list of conditions what others want?
Will the maintainer of the driver package want to take on the burden of making that method work for all the database engines out there that have ODBC drivers?
If yes, is it even possible? Note the switch dr.NativeError section is specific to SQL Server. It seems likely that another DB engine might use some of the same native error codes to mean different things. That implies that Temporary must know which DB engine is behind the ODBC interface in order to interpret the native codes.

I think the answers are:

Maybe.
No.
Unlikely or problematic.

NateFinch · October 13, 2015, 10:57am

It seems like you’re over thinking this. Temporary just means that the code returning the error thinks you might be able to succeed if you retry. It is not any kind of guarantee, nor is it any kind of binding contract. It seems like your problem is “but what if I don’t agree with the value returned by Temporary()?” In which case, don’t call it, just do your own thing, which is what Dave was suggesting.

ChrisHines · October 13, 2015, 2:37pm

I think the discussion further reinforces the point of your blog post. Returning errors is hard. There is not one technique that is always right.

goinggodotnet · October 14, 2015, 9:38pm

@dfc would you recommend to create custom error types as always unexported with unexported fields? This would force you to add behaviors that allow the concrete type value to remain opaque.

dfc · October 15, 2015, 12:29am

Yup, but I wouldn’t recommend creating custom error types, period.

Jack · October 20, 2015, 12:53pm

I’ve gone back and forth on how I should do errors. I think a good example is the Exception class in java and what it takes. I generally want a message, cause, and stack trace.

As a real life example, the ingest server we use is written in Go and there are a handful of places where we retry on errors (of a specific type b/c there is no behavior to introspect on ), a single place where we introspect error behavior so we know what HTTP error codes and messages to show users, and the 99% case of “raise the error back up until it’s logged and a metric is incremented”.

Because after the fact diagnosis is the 99% real life use case for us, we embed the Java exception base parts (message/cause/stack) in any errors, which is what most of the error wrapping libraries do.

I think it’s such a good thing to do, I wish the error libraries out there would standardize on what method signatures they use for the cause and stack trace parts like we already have for message being Error() string

NateFinch · October 20, 2015, 5:06pm

The problem I have with adding a stack trace to every error is that you’re doing a lot of work for some text that may never see the light of day. Errors happen all the time, and often times they are just handled… if you created a stack trace for every one, you’ve just slowed down your code for little benefit. The problem, of course, is that you never know at error creation time if it’ll be “worth it” to record the stack trace. I guess one solution to this is to have it be a feature you can turn on with an environment variable or something, for when you’re debugging. But that can still screw you if you get an error in production and don’t know where it came from.

In my experience, error messages are usually unique enough that I can grep for them, though there have been a few that were pretty opaque and took some time to track down.

But generally if you’re properly annotating your errors with contextual information, you almost (almost) don’t need stack traces.

Jack · October 20, 2015, 5:21pm

I agree about every error. It’s a flag for us and it only happens at the first wrap of all the wrapped errors.

system · January 18, 2016, 5:21pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.