Solving split brain in a distributed application

diego.bernardes · March 25, 2018, 2:16am

I’m working in a distribured application that need to elect a master to schedule a bunch of tasks for the cluster. Etcd works just fine for this, but, how can i handle a split brain?

Imagine a node is elected as master, then, for any reason, it lost the lease and another node assumes. At Etcd, everything is fine, it knows exactly who is the new master, the problem is, the old master receive the notification that it isn’t the master anymore at the same time the new master assumes, so, for a moment, both will be working and this can cause a bad application behaviour.

This is a snippet of the code that elect the master node:

func (e *Election) Elect(ctx context.Context) context.Context {
	session, err := concurrency.NewSession(e.Client.base, concurrency.WithTTL(e.TTL))
	if err != nil {
		nctx, nctxCancel := context.WithCancel(context.Background())
		nctxCancel()
		return nctx
	}

	nctx, nctxCancel := context.WithCancel(ctx)
	election := concurrency.NewElection(session, "/election")
	if err = election.Campaign(nctx, e.NodeID); err != nil {
		nctxCancel()
		return nctx
	}

	go func() {
		for range election.Observe(nctx) {
		}
		nctxCancel()
	}()

	return ctx
}

Any suggestion?

acim · March 25, 2018, 6:50am

https://zookeeper.apache.org/doc/current/recipes.html#sc_leaderElection

diego.bernardes · March 26, 2018, 2:59am

I’m using etcd, it does the same thing as zookeeper. Maybe i wasn’t clear enough, the problem is when a elected master loose the master because it didn’t update the lease for any reason (network, overhead, …) and another node is elected as master.

During a short, but existent, period of time, the old master still think it’s the master and can operate as it and the new elected master assumes and start operating, in this period of time, the cluster can have a wrong behaviour.

Don’t caught this problem yet in production. Looking at the code, it’s possible to happen. But as i’m not seeing a easy way to solve this, i’m gonna risk and resolve this problem only when it happens.

acim · March 26, 2018, 3:27am

Just found out that Zookeeper doesn’t guarantee this either:

diego.bernardes · March 26, 2018, 4:35am

Yes, this is exactly the problem. I thought in a way to solve, isn’t the most elegant, but, gonna work. The application has a struct that generate and keep the Etcd leases, the lease has a TTL and i was using the Lease.KeepAlive to keep the lease alive, but using Lease.KeepAliveOnce inside a loop make much more sense. The first one is called once and the lease is refresh inside the Etcd Go client, i don’t have much control over it, the second, i can control how i’m gonna refresh the lease.

Lets imagine the lease has a TTL of 60sec, i could run the lease refresh every 30sec, if anything goes wrong, the application cancel the context and revoke the lease, this way i can have 30sec of margin of error.

This is how the application is working today:

.
└── registry              lease 1
    ├── election          lease 2
    │   └── scheduler     lease 2
    └── runner            lease 3
        └── runner-unit   lease 3

Basically i’m using 3 distinct leases to control everything, if the lease 1 is revoked, the application stop the child tasks calling a stop function. In the case of lease 2 and lease 3, if the lease is revoked, the child tasks get stopped immediately without the need to call the stop function. This way, i can reduce even more the chance to have more then one master at the same time.

ps: wanted so much that Go had supervisors like Erlang. My code would be easier/better.

calmh · March 26, 2018, 11:00am

It’s not the same, and it can’t be in Go, but I like thejerf/suture for that kind of thing.

acim · April 17, 2018, 8:04am

@diego.bernardes, maybe interesting for you:

Consul should be implementing this.

system · July 16, 2018, 8:04am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.