Bulk insert speed gap when multiple machines are used

I have a go utility that reads documents from flat files and bulk loads them into couchbase. The application is able to insert data at a speed of up to 21K writes per second when 2 writer threads are being executed on a single machine (Destination is a remote couchbase cluster having one server within the network).
But When 2 writer threads are executed from 2 different machines (1 thread each), the insertion speed is reduced to half (10K writes/sec). Since both the machines are using their own RAM and CPU for insertion, plus the utility has shown the speed of up to 25K writes per second, the network doesn’t seem to be the issue (I checked the network utilization as well and it is below 50 percent when multiple machines are used.)

Note: All machines that are used have i7 3.40GHz quad-core processor and 8GB RAM. The total amount of data being inserted is up to 500MB. Bucket Configuration: 5.00GB RAM, Bucket disk I/O priority: High

I need to know what’s causing this speed gap. Please help…

You should show us the code, or it’s a pointless guessing game.

As Go doesn’t do anything special in a multi-machine setup the problem is 100% on your setup. Your description skips a lot of useful details, for example where you are reading the data from.

For a successful analysis, make sure to picture the full chain of events your program causes, and benchmark each part separately.

Hi @Giulio_lotti,

Both machines have data files (File_1.txt, File_2.txt, File_2.txt… etc. ) stored locally on their respective hard drives and all documents are unique.

  • The function ‘InsertDataFromFile’ reads documents from text file line by line, appends them to ‘items’ array and when 500 documents are appended to the array, it performs a bulk insert operation and continues with the next chunk of data.
    After one bulk insert operation is completed, the ‘items’ array is made empty for a new batch of data.

  • This function is being invoked as goroutine in a loop for multiple file inserts at the same time.

Here’s the Code:

package main

import (
    "bufio"
    "encoding/csv"
    "fmt"
    "io"
    "log"
    "strconv"
    "os"
    "sync"
    "runtime"
    "gopkg.in/couchbase/gocb.v1"
)

var  (    
    bucket *gocb.Bucket
    CB_Host string
)

func main() {
    var wg sync.WaitGroup
    CB_Host = <IP Address of Couchbase Server>
    runtime.GOMAXPROCS(runtime.NumCPU())
    cluster, err := gocb.Connect("couchbase://" + CB_Host)  //..........Establish Couchbase Connection
    if err != nil {
        fmt.Println("ERROR CONNECTING COUCHBASE:", err)
    }
    bucket, err = cluster.OpenBucket("BUCKET", "*******")
    if err != nil {
        fmt.Println("ERROR OPENING BUCKET:", err)
    }
    Path := "E:\\Data\\File_"               \\Path of the text file that contains data

    for i := 1; i <= 2; i++{
        wg.Add(1)                                            
        go InsertDataFromFile(Path+strconv.Itoa(i)+".txt", i, &wg)        
    }

    wg.Wait()

        err = bucket.Close()                            //.............. Close Couchbase Connection
        if err != nil {
           fmt.Println("ERROR CLOSING COUCHBASE CONNECTION:", err)
       }
   }

/*-- Main function Ends Here --*/


func InsertDataFromFile(FilePath string, i int, wg *sync.WaitGroup) (){        
    var  (    
        ID string   
        JSONData  string
        items []gocb.BulkOp  
    )

    csvFile, _ := os.Open(FilePath)                          //...............Open flat file containing data
    reader := csv.NewReader(bufio.NewReader(csvFile))
    reader.Comma = '$'
    reader.LazyQuotes = true
    counter := 1
    fmt.Println("Starting Insertion of File "+ strconv.Itoa(i) + "...")
    for {                                                                
        line, error := reader.Read()
        if error == io.EOF {
            break
        } else if error != nil {                         //...............Parse data and append it into items[] array
            log.Fatal(error)
        }
        ID = line[0]
        JSONData = line[1]
        items = append(items, &gocb.UpsertOp{Key: ID, Value: JSONData})
        if counter % 500 == 0 {
           BulkInsert(&items)                           //................Bulk Insert Next 500 Documents Data into couchbase
            items = nil
              }
        counter = counter + 1     
        } 

    BulkInsert(&items)                                  //................Insert remaining documents
       items = nil
    fmt.Println("Insertion of File "+ strconv.Itoa(i) + " Completed...")
    wg.Done()
}   

func BulkInsert(item *[]gocb.BulkOp) (){
    err := bucket.Do(*item)
    if err != nil {
       fmt.Println("ERROR PERFORMING BULK INSERT:", err)
       }
}

It looks like you have two go routines that read CSV files but the Couchbase client is shared, thus sending is always done in an a synchronized way.

You should try using a different connection to Couchbase from each reader. Also, your “items” variable is inefficiently allocated after every flush. You can allocate it once with capacity=500 and re-slice it to empty when you flush.

That of course should improve the performance of your one machine agent, which is the exact opposite performance result of what you describe… To improve performance on several machines, I would concentrate on Couchbase configuration.