I have a Go service that I’m running in Kubernetes (AWS EKS). I’m observing that the RES memory grows to the max resources.limits.memory
and gets OOMKilled by Kubernetes. As a point of reference, this code is running in production currently in an EC2 instance t3.small and the RES memory value is always under 50MB with a heavy load of 20k rps (async message processing). In my k8s dev environment, I only see 5 rps and the RES memory grows to whatever max value, I tried assigning up to 2GB to the pod, but it’s the same pattern, the process gets OOMKilled. It seems like Memory is not fully released to the OS, but the service GC shows a small memory footprint (12->12->9 MB, 19 MB goal
)
K8s config and env variables
For easier testing, I configured pods to have 365Mi
, since it was the same pattern with 2GB
. This is the k8s manifest resources configuration:
resources:
limits:
memory: 365Mi
requests:
cpu: 1000m
memory: 365Mi
With the following Environment variables:
GOMEMLIMIT: '180MiB'
GOMAXPROCS: '2'
GODEBUG: 'gctrace=1'
GOGC: '50'
Initially, I didn’t have GOMEMLIMIT
but I read that it could help, so I set it with half of the memory limit of the pod, but it doesn’t seem to help. I also had the default GOGC
, so tried the value of 50, but doesn’t seem to help.
Garbage collection
GC logs before it gets killed show the following:
21:03:41.854 gc 1643 @80939.735s 0%: 0.060+2.4+0.002 ms clock, 0.12+1.1/1.4/0.57+0.005 ms cpu, 12->12->8 MB, 12 MB goal, 0 MB stacks, 0 MB globals, 2 P
Memstats
runtime.mstats
show the following (added MB conversion manually for easier readability):
21:03:43.195
{
"Alloc":8922888 (8.9MB),
"TotalAlloc":5646312096 (5.6GB),
"Sys":28415240 (28.4MB),
"HeapSys":18284544 (18.2MB),
"HeapIdle":6414336 (6.4MB),
"HeapReleased":3121152 (3.1MB),
"HeapInuse":11870208 (11.8MB),
"HeapObjects":24393,
"MallocsObjects":43016155,
"FreesObjects":42991762,
"LiveObjects":24393,
"PauseTotalNs":153890330,
"NumGC":1643,
"NumGoroutine":265
}
Alloc
is 8.9MB, which matches the end 8MB in the gc log (12->12->8 MB).
Here’s another log sample before and after the OOMKilled:
15:39:21.969 my-service gc 1709 @168600.017s 0%: 0.033+3.5+0.002 ms clock, 0.033+0/0.059/3.4+0.002 ms cpu, 12->12->9 MB, 19 MB goal, 0 MB stacks, 0 MB globals, 1 P (forced)
15:39:23.947 my-service {"Alloc":10126368,"TotalAlloc":5661878296,"Sys":36803848,"HeapSys":26771456,"HeapIdle":13369344,"HeapReleased":13336576,"HeapObjects":42613,"MallocsObjects":35141353,"FreesObjects":35098740,"LiveObjects":42613,"PauseTotalNs":70123823,"NumGC":1709,"NumGoroutine":264}
15:40:23.948 my-service {"Alloc":14120360,"TotalAlloc":5665872288,"Sys":36803848,"HeapSys":26738688,"HeapIdle":10780672,"HeapReleased":10780672,"HeapObjects":73826,"MallocsObjects":35172566,"FreesObjects":35098740,"LiveObjects":73826,"PauseTotalNs":70123823,"NumGC":1709,"NumGoroutine":264}
15:41:16.861 my-service Killed
15:41:18.201 my-service gc 1 @0.015s 6%: 0.007+4.9+0.002 ms clock, 0.007+0.027/1.3/0+0.002 ms cpu, 3->4->2 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 1 P
top
kubectl top pod my-service-pod-56f7fcffbb-d8tdh
shows:
NAME CPU(cores) MEMORY(bytes)
my-service-pod-56f7fcffbb-d8tdh 8m 344Mi
top
shows:
top - 05:04:05 up 14 days, 8:05, 0 user, load average: 1.78, 1.95, 1.89
Tasks: 4 total, 1 running, 3 sleeping, 0 stopped, 0 zombie
%Cpu(s): 13.2 us, 5.2 sy, 0.0 ni, 79.2 id, 0.8 wa, 0.0 hi, 1.6 si, 0.0 st
MiB Mem : 15801.5 total, 4087.7 free, 9661.0 used, 2460.5 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 6140.5 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
11 guest 20 0 2315940 348884 10532 S 0.7 2.2 7:02.35 my-service
pprof
pprof
does’t show anything out of the ordinary either, total Memory allocated is 7096.39kB
.
go tool pprof ~/Downloads/my-service-pod-56f7fcffbb-d8tdh.tar.gz
File: my-service
Type: inuse_space
Time: Feb 6, 2024 at 8:40pm (PST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 7096.39kB, 100% of 7096.39kB total
Showing top 10 nodes out of 35
flat flat% sum% ■■■ ■■■%
4104kB 57.83% 57.83% 4104kB 57.83% github.com/DataDog/datadog-go/v5/statsd.newStatsdBuffer (inline)
902.59kB 12.72% 70.55% 902.59kB 12.72% compress/flate.NewWriter (inline)
553.04kB 7.79% 78.34% 553.04kB 7.79% gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer.newConcentrator
512.34kB 7.22% 85.56% 512.34kB 7.22% crypto/x509.map.init.0
512.31kB 7.22% 92.78% 512.31kB 7.22% vendor/golang.org/x/net/http/httpguts.map.init.0
512.10kB 7.22% 100% 512.10kB 7.22% github.com/aws/aws-sdk-go/aws/endpoints.init
0 0% 100% 902.59kB 12.72% bufio.(*Writer).Flush
0 0% 100% 902.59kB 12.72% compress/gzip.(*Writer).Write
0 0% 100% 512.34kB 7.22% crypto/x509.init
0 0% 100% 4104kB 57.83% github.com/DataDog/datadog-go/v5/statsd.(*bufferPool).addNewBuffer
(pprof)
Question
Am I correct concluding that there’s no memory leak because of the following:
runtime.MemStats.Alloc
is lowruntime.Memstats.NumGC
is constant (no unfinished go routines)runtime.Memstats.TotalAlloc
is big since it’s the accumulated number since the service started but most of that memory has been released.- gc has very little memory to release each run (12->12->8 MB, 12 MB goal). I haven’t seen that goal above 18 MB.
It looks like the gc is releasing memory but not all of it is going back to the OS. runtime.Memstats.TotalAlloc
does show 5.6GB which tells me that some memory is released, but not as it does in the EC2 instance where the process has less than 50MB RES memory footprint, in kubernetes the service keeps the max memory allocated in resources.limits.memory
.
I was expecting to not have any issues since the service runs fine in an EC2 instance. And I’ve looked at several posts for weeks trying to find an answer.