Hello. I have encountered a problem that beats me. Imagine I have about 350K file XML files (for testing, I am using the same file copied that many times; the file names are sample_1.xml, sample_2.xml, …). I want to parse them, that’s all. Here’s a mock up of the function to do the parsing:
func singleRun(xmlFile fs.DirEntry) {
xmlFileOpened, err := os.Open(filepath.Join("data", xmlFile.Name()))
if err != nil {
log.Println(err)
return
}
defer xmlFileOpened.Close()
byteValue, _ := ioutil.ReadAll(xmlFileOpened)
var users Users
xml.Unmarshal(byteValue, &users)
for i := 0; i < len(users.Users); i++ {
// We would do something with the data here.
// Here, I do nothing
}
}
So, this function only opens the file, loops over users that are XML fields, and then closes the file.
Now, this is the first version of the function:
func PareseWithGoroutines() {
files, err := os.ReadDir("data")
if err != nil {
log.Fatal(err)
}
var xmlFiles []fs.DirEntry
for _, file := range files {
if strings.HasSuffix(file.Name(), ".xml") {
xmlFiles = append(xmlFiles, file)
}
}
if len(files) == len(xmlFiles) {
log.Println("Found", len(files), "XML files in the data folder.")
} else {
log.Print("Among", len(files), "files in the data folder,", len(xmlFiles), "XML files were found.")
}
var wg sync.WaitGroup
wg.Add(len(xmlFiles))
nFiles := 0
for _, xmlFile := range xmlFiles {
go func(file fs.DirEntry) {
defer wg.Done()
singleRun(file)
nFiles += 1
}(xmlFile)
}
wg.Wait()
fmt.Println("Parsed", nFiles, "XML files.")
}
The heart of the function starts off with var wg sync.WaitGroup
. This function works totally fine (checked). I wanted to compare its time with a slightly different version, in which I do not use wg.Add(len(xmlFiles))
once by wg.Add(1)
in the loop. Here’s this version:
func PareseWithGoroutinesV2() {
files, err := os.ReadDir("data")
if err != nil {
log.Fatal(err)
}
var xmlFiles []fs.DirEntry
for _, file := range files {
if strings.HasSuffix(file.Name(), ".xml") {
xmlFiles = append(xmlFiles, file)
}
}
if len(files) == len(xmlFiles) {
log.Println("Found", len(files), "XML files in the data folder.")
} else {
log.Print("Among", len(files), "files in the data folder,", len(xmlFiles), "XML files were found.")
}
var wg sync.WaitGroup
nFiles := 0
for _, xmlFile := range xmlFiles {
wg.Add(1)
go func(file fs.DirEntry) {
defer wg.Done()
singleRun(file)
nFiles += 1
}(xmlFile)
}
wg.Wait()
fmt.Println("Parsed", nFiles, "XML files.")
}
Do note that the only difference is removing wg.Add(len(xmlFiles))
and adding wg.Add(1)
inside the loop, before running the next goroutine.
And what amazed me is that this second version does not work! It does not close the files, and after some time of running I get a lot of lines claiming that too many files are open:
...
2021/07/26 08:12:36 open data/sample_345429.xml: too many open files
2021/07/26 08:12:36 open data/sample_345556.xml: too many open files
2021/07/26 08:12:36 open data/sample_345652.xml: too many open files
...
Frankly, I have no idea what’s going on, though it might be a very simple mistake that I made here. The singleRun()
function has the deferred close, the loop has wg.Done()
inside the goroutine function, so I really have no idea what this can be. Does anyone have an idea what’s going on?
(I can copy here a full code, but it requires generating hundreds of thousands of XML files, so — since this seems to be a generic problem — maybe someone knows what’s going on without running the actual code?)