What k8s factors can affect Garbage collector behavior inside the pod

patroldo · February 9, 2025, 2:28pm

Hello. We have a project where go was used for 2 purposes:

Build executor file for our project
Build helm libraries from helm.sh sources

Project shortly saying is about deploying helm charts. Running of that project is happening using kubernetes job. And issue is that the same job and same image with same parameters, same requests and same limits(basically same job.yaml) in 2/3 k8s clusters works fine. But on third one it’s running into OOM and we want to understand what can cause it. When we’re talking about job:

Memory Requests and limits are the same values
CPU requests and limits - slightly different

In actual execution we noticed:

CPU is throttling in limits, but i believe it’s ok
Even when limits is 1m(literally the minimum possible) - it’s still normally consuming memory on those 2 clusters

But we noticed that if we disable garbage collector using this:
“debug.SetGCPercent(-1)”

memory consumption increases drastically. So we suspecting that on third environment garbage collector not working as expected

Question is - what are potential k8s environment configurations can affect garbage collector, so it will behave differently on the cluster?

Unfortunately can’t share any source code due to company policy. And debugging on problem environment is not possible.

Any tips and suggestions will be helpful

Dean_Davidson · February 9, 2025, 4:19pm

It sounds like you’re going to just have to experiment using trial and error. Since you can’t debug in the environment in question, maybe you can create a local docker image with limited memory to try to reproduce the error there? Also did you find this possibly-related article?

patroldo · February 9, 2025, 5:00pm

Hi, @Dean_Davidson

Thanks for the link. Definitly something worth to check. Ye, we experimenting with different configuration, we haven’t changed the memory requests/limits as for us it’s important to observe memory growing in graphana. Still trying to experiment and discuss with company about possibility to run job with increased logging and memory profile catching as in the end it’s the only possibility to confirm root cause. But we need to ensure the maximum data is collected and not to ask another round of debug on that poor env

patroldo · March 12, 2025, 4:59pm

Hi. Looks like it was environment issue. It was found on kubernetes cluster something wrong with node where job was scheduled. On healthy env issue is not reproduced

system · June 10, 2025, 4:59pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.