Hello there!
I have a service written in golang
and using grpc to link one grpc-client
with pool of grpc-servers
.
I have an issue with my grpc-server
that runs on absolutely identical nodes. I rent those nodes on linode.com.
Nodes specifications are shown below (for example server-1 and server-2):
server-1 and server-2 have a same:
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.5 LTS
Release: 16.04
Codename: xenial
$ arch
x86_64
My grpc-server
has a pool of cryptocurrency exchanges that implement a common interface. The grpc-server
through grpc receives request from the grpc-client
and in separate go-routines starts all exchanges from the pool to gather some info from the internet through HTTP. Each go-routine begins to continuously collect data from a specific crypto exchange and send this data to a common channel DataChan
. All this data is read from the channel and sent to the grpc-client
in the grpc stream.
But at the same time, everything works fine on server-1, all data is continuously sent.
And on server-2, my program begins to consume more than 90% of the CPU before it has time to make a first request to the internet from any go-routine.
My binaries was built on bamboo server with same arch and os version
I have no problems with connections, I checked using curl
can see that on server-1 are many connections established (so my requests to exchanges works as expected) by collector-ser (my grpc-server
):
@server-1:~$ netstat -nputw
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 172.104.157.111:44722 104.20.147.108:443 TIME_WAIT -
tcp 0 0 172.104.157.111:32844 104.17.177.152:443 ESTABLISHED 21056/collector-ser
tcp 0 0 172.104.157.111:52996 104.20.199.74:443 ESTABLISHED 21056/collector-ser
tcp 0 0 127.0.0.1:9001 127.0.0.1:35250 ESTABLISHED -
tcp 0 0 172.104.157.111:57636 149.154.167.220:443 TIME_WAIT -
tcp 0 0 172.104.157.111:33708 104.17.154.108:443 ESTABLISHED 21056/collector-ser
tcp 0 0 172.104.157.111:7888 89.33.219.167:37860 ESTABLISHED -
tcp 0 215 172.104.157.111:42774 47.56.56.151:80 ESTABLISHED 21056/collector-ser
tcp 0 0 172.104.157.111:33068 104.17.6.188:443 TIME_WAIT -
tcp 0 0 172.104.157.111:34906 104.16.233.188:443 ESTABLISHED 21056/collector-ser
tcp 0 0 172.104.157.111:54140 143.204.101.87:443 ESTABLISHED 21056/collector-ser
tcp 0 0 172.104.157.111:39856 104.17.179.152:443 TIME_WAIT -
tcp 0 0 172.104.157.111:50390 104.20.190.108:443 ESTABLISHED 21056/collector-ser
tcp 0 155 172.104.157.111:57586 103.206.42.112:443 ESTABLISHED 21056/collector-ser
tcp 0 0 172.104.157.111:37740 107.154.248.133:443 ESTABLISHED 21056/collector-ser
tcp 0 0 172.104.157.111:59366 104.19.245.31:443 TIME_WAIT -
tcp 0 0 172.104.157.111:54814 143.204.101.74:443 TIME_WAIT -
tcp 0 0 172.104.157.111:46198 104.25.8.112:443 ESTABLISHED 21056/collector-ser
tcp 0 0 172.104.157.111:35940 143.204.214.30:443 TIME_WAIT -
tcp 0 0 172.104.157.111:53358 104.18.216.39:443 ESTABLISHED 21056/collector-ser
tcp 0 0 127.0.0.1:35250 127.0.0.1:9001 ESTABLISHED 17524/darvin-courie
tcp 0 0 172.104.157.111:49054 13.225.78.21:443 ESTABLISHED 21056/collector-ser
tcp 0 0 172.104.157.111:57634 149.154.167.220:443 TIME_WAIT -
tcp 0 0 172.104.157.111:33484 2.16.106.88:443 ESTABLISHED 21056/collector-ser
tcp 0 0 172.104.157.111:57144 13.224.196.108:443 ESTABLISHED 21056/collector-ser
tcp 0 0 172.104.157.111:48702 47.244.38.215:443 TIME_WAIT -
tcp 0 0 172.104.157.111:49260 47.244.38.215:443 ESTABLISHED 21056/collector-ser
tcp 0 44 172.104.157.111:38022 13.230.49.199:443 ESTABLISHED 21056/collector-ser
tcp 0 44 172.104.157.111:52372 15.164.106.249:443 ESTABLISHED 21056/collector-ser
tcp 0 0 172.104.157.111:43276 217.182.199.239:443 ESTABLISHED 21056/collector-ser
tcp 0 0 172.104.157.111:55928 13.32.158.212:443 TIME_WAIT -
tcp 0 0 172.104.157.111:50886 104.19.246.31:443 ESTABLISHED 21056/collector-ser
tcp 0 0 172.104.157.111:49956 104.20.22.137:443 ESTABLISHED 21056/collector-ser
tcp 0 0 172.104.157.111:54632 104.17.185.234:443 ESTABLISHED 21056/collector-ser
tcp 0 0 172.104.157.111:44468 104.25.7.112:443 TIME_WAIT -
tcp 0 0 172.104.157.111:59174 104.17.7.188:443 ESTABLISHED 21056/collector-ser
tcp 0 0 172.104.157.111:50304 52.17.153.78:443 ESTABLISHED 21056/collector-ser
tcp 0 0 172.104.157.111:50274 52.17.153.78:443 TIME_WAIT -
tcp 0 0 172.104.157.111:47538 104.17.184.234:443 TIME_WAIT -
tcp 0 0 172.104.157.111:37416 104.16.127.19:443 TIME_WAIT -
tcp 0 0 172.104.157.111:37970 104.16.127.19:443 ESTABLISHED 21056/collector-ser
tcp 0 126 172.104.157.111:33780 13.112.137.99:80 ESTABLISHED 21056/collector-ser
tcp 0 0 172.104.157.111:51914 104.19.212.87:443 ESTABLISHED 21056/collector-ser
tcp 0 0 172.104.157.111:34344 104.16.233.188:443 TIME_WAIT -
tcp 0 0 172.104.157.111:51358 104.19.212.87:443 TIME_WAIT -
tcp 0 0 172.104.157.111:49832 104.20.190.108:443 TIME_WAIT -
tcp 0 0 172.104.157.111:36180 104.16.144.70:443 TIME_WAIT -
tcp 0 0 172.104.157.111:45286 104.20.147.108:443 ESTABLISHED 21056/collector-ser
tcp 0 368 172.104.157.111:57412 47.52.123.127:443 ESTABLISHED 21056/collector-ser
tcp 0 0 172.104.157.111:51166 104.17.156.108:443 TIME_WAIT -
tcp 0 0 172.104.157.111:52434 104.20.199.74:443 TIME_WAIT -
tcp 0 0 172.104.157.111:36730 104.16.144.70:443 ESTABLISHED 21056/collector-ser
tcp 0 0 172.104.157.111:57452 104.20.21.137:443 TIME_WAIT -
but on server-2 you can see only one connection established (this is conn with grpc-client
)
server-2:~$ netstat -nputw
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 172.104.246.194:7888 89.33.219.167:45392 ESTABLISHED -
tcp6 0 0 192.168.129.202:9888 192.168.128.90:44900 ESTABLISHED -
tcp6 0 0 192.168.129.202:9888 192.168.128.90:46058 ESTABLISHED -
tcp6 0 0 192.168.129.202:9888 192.168.128.90:44912 ESTABLISHED -
tcp6 0 0 192.168.129.202:9888 192.168.128.90:48092 ESTABLISHED -
tcp6 0 0 192.168.129.202:9888 192.168.128.90:46878 ESTABLISHED -
tcp6 0 0 192.168.129.202:9888 192.168.128.90:48084 ESTABLISHED -
tcp6 0 0 192.168.129.202:9888 192.168.128.90:44902 ESTABLISHED -
tcp6 0 0 192.168.129.202:9888 192.168.128.90:50540 ESTABLISHED -
tcp6 30 0 192.168.129.202:10555 192.168.139.182:58262 ESTABLISHED 28591/collector-ser
tcp6 0 0 192.168.129.202:9888 192.168.128.90:44916 ESTABLISHED -
and there is top output from both servers:
server-1:~$ top -bn 1 | head -15
top - 11:50:23 up 23 min, 1 user, load average: 0.33, 0.47, 0.33
Tasks: 125 total, 2 running, 59 sleeping, 0 stopped, 0 zombie
%Cpu(s): 8.5 us, 1.2 sy, 0.0 ni, 81.3 id, 0.1 wa, 0.7 hi, 0.4 si, 7.9 st
KiB Mem : 4022248 total, 3398316 free, 130056 used, 493876 buff/cache
KiB Swap: 262140 total, 262140 free, 0 used. 3640232 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1159 plutos 20 0 395252 30816 14236 S 12.5 0.8 4:18.49 collector-serve
1137 root 20 0 60152 20048 7340 R 6.2 0.5 0:33.70 supervisord
1 root 20 0 37844 5744 3908 S 0.0 0.1 0:01.65 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd
3 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_gp
4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_par_gp
5 root 20 0 0 0 0 I 0.0 0.0 0:00.04 kworker/0:0-eve
6 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/0:0H-kb
server-2:~$ top -bn 1 | head -15
top - 11:50:35 up 18 min, 1 user, load average: 1.00, 0.95, 0.61
Tasks: 126 total, 2 running, 59 sleeping, 0 stopped, 0 zombie
%Cpu(s): 25.9 us, 0.4 sy, 0.0 ni, 54.8 id, 0.1 wa, 1.0 hi, 0.0 si, 17.7 st
KiB Mem : 4022248 total, 3765628 free, 119012 used, 137608 buff/cache
KiB Swap: 262140 total, 262140 free, 0 used. 3696904 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1247 plutos 20 0 378092 16524 10972 R 93.8 0.4 7:16.65 collector-serve
1367 plutos 20 0 41664 3584 3068 R 6.2 0.1 0:00.01 top
1 root 20 0 37804 5788 3956 S 0.0 0.1 0:01.76 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.01 kthreadd
3 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_gp
4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_par_gp
6 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/0:0H-kb
8 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 mm_percpu_wq
below some lines of my code (I’ve not provided full example to simplify):
my entrypoint main.go
:
package main
import (
server2 "path-to-grpc-server-implementation/server"
pb "path-to-proto-files/proto"
"fmt"
"google.golang.org/grpc"
"net"
)
func main() {
lis, err := net.Listen("tcp", fmt.Sprintf("0.0.0.0:%d", myPort))
if err != nil {
logs.Error("failed to listen:", err)
}
grpcServer := grpc.NewServer()
pb.RegisterGrpcCollectorServer(grpcServer, server2.NewCollectorServer())
err = grpcServer.Serve(lis)
if err != nil {
logs.Error("serve grpc: %s", err)
}
}
my server implementation:
package server
// my grpc-server struct
type GrpcCollectorServer struct {
//pool of exchanger interface
Exchanges map[string]darvin_exchanger.Exchanger
// common chan for all go-routines
DataChan chan exchange.Data
}
func NewCollectorServer() *GrpcCollectorServer {
return &GrpcCollectorServer{
//my fileds
}
}
// new request from grpc-client to collect data
func (s *GrpcCollectorServer) DoStuff(e *pb.Empty, stream pb.GrpcCollector_GetOrderBooksServer) error {
//range over exchangers and start to collect data from each in separate go-routine
wg:=&sync.WaitGroup{}
for exchangeName, ex := range s.Exchanges {
wg.Add(1)
go ex.CollectData(s.DataChan, wg)
}
//read collected data from common chan
for {
select {
case v := <-s.DataChan: //new data from routines
msg := new(pb.SomeMsg)
//form message properly...
// send msg to grpc-stream
if err := stream.Send(msg); err != nil {
logs.Error(err.Error())
}
}
}
wg.Wait()
return nil
}
and there my common exchanger implementation
//Exchanger - common interface for all crypto exchanges
type Exchanger interface {
CollectData(data chan exchange.Data, wg *sync.WaitGroup)
}
//exchangeN1 - one of my exchanges
type exchangeN1 struct {
//some fields
}
// CollectData - collects data from the crypto exchnage N1
func (ex *exchangeN1) CollectData(dataChan chan<- exchange.Data, wg sync.WaitGroup) {
defer wg.Done()
for { //main loop
//doing some http request (to simplify i use http.Get)
//!!!!!! it hangs here judging by the logs
//(no requests are sent from any routine (any exchanger))!!!!!!
resp, err:=http.Get("api.exchangeN1/example/endpoint")
if err != nil {
logs.Error(err.Error())
}
data:=formDataFromResp(resp)//unmarshaling
//send new portion of data to the common chan
dataChan<-data
}
}
I expect the my grpc-server will work the same on both server-1 and server-2. But on server-2 everything works until the moment when the go-routines try to make their http requests and then it hangs with high CPU usage.
I even tried to make few test http requests with one of my exchangers before start to run go-routines and it works fine on both server-1 and server-2
I have no idea, how to solve this problem. Please, help!