Lstat/Fstat (Syscall6?) very slow

Hi Everyone,

I’m struggling with a performance issue surrounding Lstat.
I have a large directory with 100k files that I have to list and then stat each file.
The C code runs in around 1.5 seconds, but the equivalent Go code takes 1.5 minutes (60x slower!).
I tried to converge the C and the Go code as much as I could, and I don’t see any difference in what they do, just that, for some reason, Go takes 60x longer to make a syscall, which sounds unreasonable. I also tried using unix.Lstat, but it seemed to make no difference.
What am I doing wrong?

    start = time.Now()
	f, err := os.Open(dirname)
	if err != nil {
		fmt.Println(err)
	}
	defer f.Close()

	files, err := f.Readdirnames(-1) // I also tried ReadDir, which also does a stat... it's just as slow
	if err != nil {
		fmt.Println(err)
	}

	for _, file := range files {
		var fs syscall.Stat_t
		err := syscall.Lstat(dirname+"/"+file, &fs)
		if err != nil {
			fmt.Println(err)
			continue
		}
		fmt.Println(file, fs.Size) // for the sake of the example
	}

	duration = time.Since(start)
	fmt.Println(duration) //--> 1.5 minutes
  DIR *dir;
  struct dirent *ent;
  struct stat sta;
  char buf[5000];
  strcpy (buf, dirname);
  int pos = strlen(buf);
  buf[pos++] = '/';
  auto start = std::chrono::high_resolution_clock::now();
  if ((dir = opendir (dirname)) != NULL) {
    while ((ent = readdir (dir)) != NULL) {
      strcpy(buf+pos, ent->d_name);
      fstatat(-0x64, buf, &sta, 0x100); // to mirror the Go implementation
      printf ("%s %d\n", ent->d_name, sta.st_size);
    }
    closedir (dir);
  }
  auto end = std::chrono::high_resolution_clock::now();
  std::cerr << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << " ms" << std::endl; // <-- 1.5 sec

Above I forgot to mention that the files are on an NFS drive, which may play some sort of role in all this.

To investigate further, I created two cgo versions of this code:
A. where the Readdirnames and loop for the files is done on the Go side, and the actual stat call is on the cgo side
B. where the entire activity is done on the C side (the cgo side is basically identical to my C example), and the Go side is a single function call to the cgo side

This showed some very interesting results: the (B) cgo version was essentially as fast as the C version (1.5 sec); the (A) version however was just as slow as the Go version (1.5 min).

Doing an strace on these versions showed that:

  1. the newfstatat function calls took 9 us in the (B) case, but 100 us in the (A) case
  2. the number of futex calls increased from 10 (not 10k, just 10) in (B) to 350k in (A), taking a minute of the runtime
  3. the detailed strace showed that the newfstatat calls were made on different threads

The documentation mentioned that you can lock a goroutine to a single thread, so I called LockOSThread and confirmed with strace that all the newfstatat calls were indeed on the same thread – regardless the scheduler was still spamming the kernel with hundreds of thousands of futex calls, taking a minute in total.

To further diagnose the situation, I also made a dummy version from (A), where the C code just returns a static value without calling fstat. There were some spurious futex calls, but it ran almost as fast as the pure C version. I replaced the static value with a pid() call, and got a result that was 3x slower than C (inbetween static values and fstat calls).

So there is some behavior in the scheduler that amplifies the time taken by system calls exponentially.

Is there some workaround to the flatten this exponential futex runaway? Or is Go just completely unsuitable for programs that need to make many system calls?

Hi, @esp32wrangler, and welcome to the forum!

I tried reproducing what you describe with this, but I’m getting numbers between 20s-1m25s for both the C and Go functions; sometimes the Go function runs faster. I tested with >500K files in my “/home/sean/test” folder because I don’t have an NFS share set up. If you try your program on a local directory, do you see the same huge gap in performance?

Hi @skillian ,

Thank you for taking the time to make the nice goplay version and for looking at the issue!
With the local filesystem, I don’t see a difference between go and cgo either.

To convince myself that I’m not seeing some side-effect of my particular environment, I set up an NFS server in in a WSL environment on a clean PC (following the Youtube tutorial “Linux Tips - NFS Server on Windows using WSL (2022)”), and created a loopback mount.
Even with this loopback directory I could reproduce the futex runaway. I had to modify your test program to be able to run the cgo and go version separately, because the NFS cache has a disproportionate impact if you run the two versions immediately one after the other.

This is what the strace -c -f looks like for the cgo version:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ------------------
 65.54    6.974096      581174        12           futex
 32.87    3.497595          34    100006           newfstatat
  1.19    0.127158        1025       124           getdents64
...

And for the go version:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ------------------
 82.24  106.391117         808    131655     26638 futex
  5.83    7.547247         152     49354           nanosleep
  4.44    5.740233          57    100002           newfstatat
...

Without strace there is a 5-10x runtime difference in this toy environment, smaller than the 60x difference in the real environment, but still noticeable.