Undefined Behaviour of Goroutines while parsing large CSV

MayukhSobo · November 2, 2018, 7:43am

am trying to load a big CSV file using goroutines using Golang. The dimension of the csv is (254882, 100). But using my goroutines when I am parsing the csv and storing it into an 2D list, I am getting rows lesser than 254882 and the number is varying for each run. I feel it is happening due goroutines but can’t seem to point the reason. Can anyone please help me. I am also new in Golang. Here is my code below

func loadCSV(csvFile string) (*[][]float64, error) {
    startTime := time.Now()
    var dataset [][]float64
    f, err := os.Open(csvFile)
    if err != nil {
        return &dataset, err
    }
    r := csv.NewReader(bufio.NewReader(f))
    counter := 0
    var wg sync.WaitGroup
    for {
        record, err := r.Read()
        if err == io.EOF {
            break
        }
        if counter != 0 {
            wg.Add(1)
            go func(r []string, dataset *[][]float64) {
                var temp []float64
                for _, each := range record {
                    f, err := strconv.ParseFloat(each, 64)
                    if err == nil {
                        temp = append(temp, f)
                    }
                }
                *dataset = append(*dataset, temp)
                wg.Done()
            }(record, &dataset)
        }
        counter++
    }
    wg.Wait()
    duration := time.Now().Sub(startTime)
    log.Printf("Loaded %d rows in %v seconds", counter, duration)
    return &dataset, nil
}

And my main function looks like the following

func main() {
    // runtime.GOMAXPROCS(4)
    dataset, err := loadCSV("AvgW2V_train.csv")
    if err != nil {
        panic(err)
    }
    fmt.Println(len(*dataset))
}

If anyone needs to download the CSV too, then click the link below (485 MB) https://drive.google.com/file/d/1G4Nw6JyeC-i0R1exWp5BtRtGM1Fwyelm/view?usp=sharing

johandalabacka · November 2, 2018, 8:48am

Multiple gouroutines manipulate the same variable dataset so they can overwrite each other’s result. You can do one of these:

communicate data between the go routines using channels
use a lock to syncronize writing to the variable
don’t use gouroutines. The gouroutines as started in sequence after each read of a line of the CSV so I don’t know how much faster the read will be. Depends on how much time the parsing of string to float takes.

MayukhSobo · November 2, 2018, 9:24am

Thanks got the solution in the following way

func loadCSV(csvFile string) [][]float64 {
    var dataset [][]float64

    f, _ := os.Open(csvFile)

    r := csv.NewReader(f)

    var wg sync.WaitGroup
    l := new(sync.Mutex) // lock

    for record, err := r.Read(); err == nil; record, err = r.Read() {
        wg.Add(1)

        go func(record []string) {
            defer wg.Done()

            var temp []float64
            for _, each := range record {
                if f, err := strconv.ParseFloat(each, 64); err == nil {
                    temp = append(temp, f)
                }
            }
            l.Lock() // lock before writing
            dataset = append(dataset, temp) // write
            l.Unlock() // unlock

        }(record)
    }

    wg.Wait()

    return dataset
}

Can I improve and make the code even faster??

johandalabacka · November 2, 2018, 9:56am

One small optimisation you can do is to initialize the temp variable with the number of elements it should have. The length of record and then set the items by using index instead of append. Or at least make the temp variable with the capacity you will need.

Have you tried running the program without go routines? Maybe it will be slower because less work is done in parallel but locking takes time also.

MayukhSobo · November 2, 2018, 10:30am

Thanks for your reply…Actually groutines improves the time by half. Without goroutines, it takes around 6.2 seconds while with goroutines, it takes around 3.1 seconds. But I shall try initialising the temp and then benchmark and see if it improves or not…

MayukhSobo · November 2, 2018, 10:41am

Just tested initialising the temp and then using index to push directly into temp and it didn’t improve the performance and didn’t even make it worse

johandalabacka · November 2, 2018, 11:45am

If you use benchmarks you should see a little less allocations i think. I think an empty slice is allocated with a capacity of 4 first and then that is to small it will allocate a new with a capacity of 8 and then maybe 16. So if you have have 20 elements which should go into a slices would it maybe require 4 allocations until you can fit all elements instead of allocating 20 elements from the beginning.

Good it took less time with go-routines

How do you benchmark? Have you looked into the the testing paclage? https://dave.cheney.net/2013/06/30/how-to-write-benchmarks-in-go

MayukhSobo · November 2, 2018, 11:49am

No I haven’t done the benchmarking yet but I have run it enough time to be sure of it. However, I plan to do it later. Can you comment on the fact that if nested goroutines are possible or not ?

johandalabacka · November 2, 2018, 12:54pm

It is good to know it. I have used it then testing programs and finding out which is the fastest and which does the least amount of allocation.

By nested do you mean go routines creating go routines? What can you certainly do. Or did you mean something else?

MayukhSobo · November 2, 2018, 12:59pm

Yes I meant goroutines creating goroutines…This may not be required for this task but I need it for some other task.