am trying to load a big CSV file using goroutines using Golang. The dimension of the csv is (254882, 100). But using my goroutines when I am parsing the csv and storing it into an 2D list, I am getting rows lesser than 254882 and the number is varying for each run. I feel it is happening due goroutines but can’t seem to point the reason. Can anyone please help me. I am also new in Golang. Here is my code below
func loadCSV(csvFile string) (*[][]float64, error) {
startTime := time.Now()
var dataset [][]float64
f, err := os.Open(csvFile)
if err != nil {
return &dataset, err
}
r := csv.NewReader(bufio.NewReader(f))
counter := 0
var wg sync.WaitGroup
for {
record, err := r.Read()
if err == io.EOF {
break
}
if counter != 0 {
wg.Add(1)
go func(r []string, dataset *[][]float64) {
var temp []float64
for _, each := range record {
f, err := strconv.ParseFloat(each, 64)
if err == nil {
temp = append(temp, f)
}
}
*dataset = append(*dataset, temp)
wg.Done()
}(record, &dataset)
}
counter++
}
wg.Wait()
duration := time.Now().Sub(startTime)
log.Printf("Loaded %d rows in %v seconds", counter, duration)
return &dataset, nil
}
Multiple gouroutines manipulate the same variable dataset so they can overwrite each other’s result. You can do one of these:
communicate data between the go routines using channels
use a lock to syncronize writing to the variable
don’t use gouroutines. The gouroutines as started in sequence after each read of a line of the CSV so I don’t know how much faster the read will be. Depends on how much time the parsing of string to float takes.
One small optimisation you can do is to initialize the temp variable with the number of elements it should have. The length of record and then set the items by using index instead of append. Or at least make the temp variable with the capacity you will need.
Have you tried running the program without go routines? Maybe it will be slower because less work is done in parallel but locking takes time also.
Thanks for your reply…Actually groutines improves the time by half. Without goroutines, it takes around 6.2 seconds while with goroutines, it takes around 3.1 seconds. But I shall try initialising the temp and then benchmark and see if it improves or not…
If you use benchmarks you should see a little less allocations i think. I think an empty slice is allocated with a capacity of 4 first and then that is to small it will allocate a new with a capacity of 8 and then maybe 16. So if you have have 20 elements which should go into a slices would it maybe require 4 allocations until you can fit all elements instead of allocating 20 elements from the beginning.
No I haven’t done the benchmarking yet but I have run it enough time to be sure of it. However, I plan to do it later. Can you comment on the fact that if nested goroutines are possible or not ?