go binaries behaves differently on two identical linux servers


(Владимир Савостин) #1

Hello there!
I have a service written in golang and using grpc to link one grpc-client with pool of grpc-servers.
I have an issue with my grpc-server that runs on absolutely identical nodes. I rent those nodes on linode.com.
Nodes specifications are shown below (for example server-1 and server-2):

server-1 and server-2 have a same:

$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 16.04.5 LTS
Release:	16.04
Codename:	xenial

$ arch
x86_64

My grpc-server has a pool of cryptocurrency exchanges that implement a common interface. The grpc-server through grpc receives request from the grpc-client and in separate go-routines starts all exchanges from the pool to gather some info from the internet through HTTP. Each go-routine begins to continuously collect data from a specific crypto exchange and send this data to a common channel DataChan. All this data is read from the channel and sent to the grpc-client in the grpc stream.

But at the same time, everything works fine on server-1, all data is continuously sent.
And on server-2, my program begins to consume more than 90% of the CPU before it has time to make a first request to the internet from any go-routine.

My binaries was built on bamboo server with same arch and os version

I have no problems with connections, I checked using curl

can see that on server-1 are many connections established (so my requests to exchanges works as expected) by collector-ser (my grpc-server):

@server-1:~$ netstat -nputw
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 172.104.157.111:44722   104.20.147.108:443      TIME_WAIT   -               
tcp        0      0 172.104.157.111:32844   104.17.177.152:443      ESTABLISHED 21056/collector-ser
tcp        0      0 172.104.157.111:52996   104.20.199.74:443       ESTABLISHED 21056/collector-ser
tcp        0      0 127.0.0.1:9001          127.0.0.1:35250         ESTABLISHED -               
tcp        0      0 172.104.157.111:57636   149.154.167.220:443     TIME_WAIT   -               
tcp        0      0 172.104.157.111:33708   104.17.154.108:443      ESTABLISHED 21056/collector-ser
tcp        0      0 172.104.157.111:7888    89.33.219.167:37860     ESTABLISHED -               
tcp        0    215 172.104.157.111:42774   47.56.56.151:80         ESTABLISHED 21056/collector-ser
tcp        0      0 172.104.157.111:33068   104.17.6.188:443        TIME_WAIT   -               
tcp        0      0 172.104.157.111:34906   104.16.233.188:443      ESTABLISHED 21056/collector-ser
tcp        0      0 172.104.157.111:54140   143.204.101.87:443      ESTABLISHED 21056/collector-ser
tcp        0      0 172.104.157.111:39856   104.17.179.152:443      TIME_WAIT   -               
tcp        0      0 172.104.157.111:50390   104.20.190.108:443      ESTABLISHED 21056/collector-ser
tcp        0    155 172.104.157.111:57586   103.206.42.112:443      ESTABLISHED 21056/collector-ser
tcp        0      0 172.104.157.111:37740   107.154.248.133:443     ESTABLISHED 21056/collector-ser
tcp        0      0 172.104.157.111:59366   104.19.245.31:443       TIME_WAIT   -               
tcp        0      0 172.104.157.111:54814   143.204.101.74:443      TIME_WAIT   -               
tcp        0      0 172.104.157.111:46198   104.25.8.112:443        ESTABLISHED 21056/collector-ser
tcp        0      0 172.104.157.111:35940   143.204.214.30:443      TIME_WAIT   -               
tcp        0      0 172.104.157.111:53358   104.18.216.39:443       ESTABLISHED 21056/collector-ser
tcp        0      0 127.0.0.1:35250         127.0.0.1:9001          ESTABLISHED 17524/darvin-courie
tcp        0      0 172.104.157.111:49054   13.225.78.21:443        ESTABLISHED 21056/collector-ser
tcp        0      0 172.104.157.111:57634   149.154.167.220:443     TIME_WAIT   -               
tcp        0      0 172.104.157.111:33484   2.16.106.88:443         ESTABLISHED 21056/collector-ser
tcp        0      0 172.104.157.111:57144   13.224.196.108:443      ESTABLISHED 21056/collector-ser
tcp        0      0 172.104.157.111:48702   47.244.38.215:443       TIME_WAIT   -               
tcp        0      0 172.104.157.111:49260   47.244.38.215:443       ESTABLISHED 21056/collector-ser
tcp        0     44 172.104.157.111:38022   13.230.49.199:443       ESTABLISHED 21056/collector-ser
tcp        0     44 172.104.157.111:52372   15.164.106.249:443      ESTABLISHED 21056/collector-ser
tcp        0      0 172.104.157.111:43276   217.182.199.239:443     ESTABLISHED 21056/collector-ser
tcp        0      0 172.104.157.111:55928   13.32.158.212:443       TIME_WAIT   -               
tcp        0      0 172.104.157.111:50886   104.19.246.31:443       ESTABLISHED 21056/collector-ser
tcp        0      0 172.104.157.111:49956   104.20.22.137:443       ESTABLISHED 21056/collector-ser
tcp        0      0 172.104.157.111:54632   104.17.185.234:443      ESTABLISHED 21056/collector-ser
tcp        0      0 172.104.157.111:44468   104.25.7.112:443        TIME_WAIT   -               
tcp        0      0 172.104.157.111:59174   104.17.7.188:443        ESTABLISHED 21056/collector-ser
tcp        0      0 172.104.157.111:50304   52.17.153.78:443        ESTABLISHED 21056/collector-ser
tcp        0      0 172.104.157.111:50274   52.17.153.78:443        TIME_WAIT   -               
tcp        0      0 172.104.157.111:47538   104.17.184.234:443      TIME_WAIT   -               
tcp        0      0 172.104.157.111:37416   104.16.127.19:443       TIME_WAIT   -               
tcp        0      0 172.104.157.111:37970   104.16.127.19:443       ESTABLISHED 21056/collector-ser
tcp        0    126 172.104.157.111:33780   13.112.137.99:80        ESTABLISHED 21056/collector-ser
tcp        0      0 172.104.157.111:51914   104.19.212.87:443       ESTABLISHED 21056/collector-ser
tcp        0      0 172.104.157.111:34344   104.16.233.188:443      TIME_WAIT   -               
tcp        0      0 172.104.157.111:51358   104.19.212.87:443       TIME_WAIT   -               
tcp        0      0 172.104.157.111:49832   104.20.190.108:443      TIME_WAIT   -               
tcp        0      0 172.104.157.111:36180   104.16.144.70:443       TIME_WAIT   -               
tcp        0      0 172.104.157.111:45286   104.20.147.108:443      ESTABLISHED 21056/collector-ser
tcp        0    368 172.104.157.111:57412   47.52.123.127:443       ESTABLISHED 21056/collector-ser
tcp        0      0 172.104.157.111:51166   104.17.156.108:443      TIME_WAIT   -               
tcp        0      0 172.104.157.111:52434   104.20.199.74:443       TIME_WAIT   -               
tcp        0      0 172.104.157.111:36730   104.16.144.70:443       ESTABLISHED 21056/collector-ser
tcp        0      0 172.104.157.111:57452   104.20.21.137:443       TIME_WAIT   -    

but on server-2 you can see only one connection established (this is conn with grpc-client)

server-2:~$ netstat -nputw
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 172.104.246.194:7888    89.33.219.167:45392     ESTABLISHED -               
tcp6       0      0 192.168.129.202:9888    192.168.128.90:44900    ESTABLISHED -               
tcp6       0      0 192.168.129.202:9888    192.168.128.90:46058    ESTABLISHED -               
tcp6       0      0 192.168.129.202:9888    192.168.128.90:44912    ESTABLISHED -               
tcp6       0      0 192.168.129.202:9888    192.168.128.90:48092    ESTABLISHED -               
tcp6       0      0 192.168.129.202:9888    192.168.128.90:46878    ESTABLISHED -               
tcp6       0      0 192.168.129.202:9888    192.168.128.90:48084    ESTABLISHED -               
tcp6       0      0 192.168.129.202:9888    192.168.128.90:44902    ESTABLISHED -               
tcp6       0      0 192.168.129.202:9888    192.168.128.90:50540    ESTABLISHED -               
tcp6      30      0 192.168.129.202:10555   192.168.139.182:58262   ESTABLISHED 28591/collector-ser
tcp6       0      0 192.168.129.202:9888    192.168.128.90:44916    ESTABLISHED -         

and there is top output from both servers:

server-1:~$ top -bn 1 | head -15
top - 11:50:23 up 23 min,  1 user,  load average: 0.33, 0.47, 0.33
Tasks: 125 total,   2 running,  59 sleeping,   0 stopped,   0 zombie
%Cpu(s):  8.5 us,  1.2 sy,  0.0 ni, 81.3 id,  0.1 wa,  0.7 hi,  0.4 si,  7.9 st
KiB Mem :  4022248 total,  3398316 free,   130056 used,   493876 buff/cache
KiB Swap:   262140 total,   262140 free,        0 used.  3640232 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 1159 plutos    20   0  395252  30816  14236 S  12.5  0.8   4:18.49 collector-serve
 1137 root      20   0   60152  20048   7340 R   6.2  0.5   0:33.70 supervisord
    1 root      20   0   37844   5744   3908 S   0.0  0.1   0:01.65 systemd
    2 root      20   0       0      0      0 S   0.0  0.0   0:00.00 kthreadd
    3 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 rcu_gp
    4 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 rcu_par_gp
    5 root      20   0       0      0      0 I   0.0  0.0   0:00.04 kworker/0:0-eve
    6 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 kworker/0:0H-kb
server-2:~$ top -bn 1 | head -15
top - 11:50:35 up 18 min,  1 user,  load average: 1.00, 0.95, 0.61
Tasks: 126 total,   2 running,  59 sleeping,   0 stopped,   0 zombie
%Cpu(s): 25.9 us,  0.4 sy,  0.0 ni, 54.8 id,  0.1 wa,  1.0 hi,  0.0 si, 17.7 st
KiB Mem :  4022248 total,  3765628 free,   119012 used,   137608 buff/cache
KiB Swap:   262140 total,   262140 free,        0 used.  3696904 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 1247 plutos    20   0  378092  16524  10972 R  93.8  0.4   7:16.65 collector-serve
 1367 plutos    20   0   41664   3584   3068 R   6.2  0.1   0:00.01 top
    1 root      20   0   37804   5788   3956 S   0.0  0.1   0:01.76 systemd
    2 root      20   0       0      0      0 S   0.0  0.0   0:00.01 kthreadd
    3 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 rcu_gp
    4 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 rcu_par_gp
    6 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 kworker/0:0H-kb
    8 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 mm_percpu_wq

below some lines of my code (I’ve not provided full example to simplify):

my entrypoint main.go:

package main

import (
	server2 "path-to-grpc-server-implementation/server"
	pb "path-to-proto-files/proto"
	"fmt"
	"google.golang.org/grpc"
	"net"	
)

func main() {
	lis, err := net.Listen("tcp", fmt.Sprintf("0.0.0.0:%d", myPort))
	if err != nil {
		logs.Error("failed to listen:", err)
	}

	grpcServer := grpc.NewServer()
	pb.RegisterGrpcCollectorServer(grpcServer, server2.NewCollectorServer())
	err = grpcServer.Serve(lis)
	if err != nil {
		logs.Error("serve grpc: %s", err)
	}

}

my server implementation:

package server

// my grpc-server struct
type GrpcCollectorServer struct {
	//pool of exchanger interface
	Exchanges map[string]darvin_exchanger.Exchanger
	// common chan for all go-routines
	DataChan chan exchange.Data
}


func NewCollectorServer() *GrpcCollectorServer {
	return &GrpcCollectorServer{
            //my fileds
	}
}

// new request from grpc-client to collect data
func (s *GrpcCollectorServer) DoStuff(e *pb.Empty, stream pb.GrpcCollector_GetOrderBooksServer) error {
	//range over exchangers and start to collect data from each in separate go-routine
        wg:=&sync.WaitGroup{}
        for exchangeName, ex := range s.Exchanges {
            wg.Add(1)
            go ex.CollectData(s.DataChan, wg)
	}

	//read collected data from common chan
	for {
                select {
                case v := <-s.DataChan: //new data from routines
                       msg := new(pb.SomeMsg)
                       //form message properly...
			
                       // send msg to grpc-stream
                       if err := stream.Send(msg); err != nil {
				logs.Error(err.Error())
		        }
		}
	}

	wg.Wait()
	return nil
}

and there my common exchanger implementation

//Exchanger - common interface for all crypto exchanges
type Exchanger interface {
     CollectData(data chan exchange.Data, wg *sync.WaitGroup)
}

//exchangeN1 - one of my exchanges
type exchangeN1 struct {
     //some fields
}

// CollectData - collects data from the crypto exchnage N1
func (ex *exchangeN1) CollectData(dataChan chan<- exchange.Data, wg sync.WaitGroup) {
	defer wg.Done()
	
	for { //main loop
                //doing some http request (to simplify i use http.Get)

                //!!!!!! it hangs here judging by the logs 
                //(no requests are sent from any routine (any exchanger))!!!!!!
                resp, err:=http.Get("api.exchangeN1/example/endpoint")
                    if err != nil {
                         logs.Error(err.Error())
                    }
                data:=formDataFromResp(resp)//unmarshaling
                //send new portion of data to the common chan
                dataChan<-data
		}
}

I expect the my grpc-server will work the same on both server-1 and server-2. But on server-2 everything works until the moment when the go-routines try to make their http requests and then it hangs with high CPU usage.

I even tried to make few test http requests with one of my exchangers before start to run go-routines and it works fine on both server-1 and server-2

I have no idea, how to solve this problem. Please, help!