Hello, I’m putting together some code that eventually will read and process some big freely available company datasets. I’ve so far managed to read the raw CSV and print the file contents. In order to be sure that I have read the entire file, I thought I would put in a bytes read mechanism. This is where my problems started. Code as follows:
package main
import (
"bufio"
"encoding/csv"
"fmt"
"io"
"log"
"os"
"golang.org/x/text/language"
"golang.org/x/text/message"
)
func main() {
fileName := "CompanyData.csv"
csvFile, _ := os.Open(fileName)
fi, _ := csvFile.Stat()
fileSize := fi.Size()
reader := csv.NewReader(bufio.NewReader(csvFile))
bytesRead := 0
var lineLength int
for {
record, error := reader.Read()
lineLength = 0 // PROBLEM code
for index, element := range record {
fmt.Println(index)
fmt.Println(element[index:index]) // Falls over after index exceeds 20
lineLength += len(element[index:index]) // with: panic: runtime error:
// slice bounds out of range
}
if error == io.EOF {
fmt.Println("\nEOF reached")
break
} else if error != nil {
log.Fatal(error)
}
fmt.Println(record) // OK on previous run
bytesRead += lineLength // Doesn't get this far
p := message.NewPrinter(language.English)
p.Printf("\r %d/%d", bytesRead, fileSize)
}
}
Since reader.Read() returns a []string, I am trying to write code starting at // PROBLEM which assigns the length of the CSV record to lineLength but the code is falling over at run time, with a slice bounds out of range error. Previous versions of the code which did not include the code starting at // Doesn't get this far appeared to read the entire file successfully.
The dataset, in case anyone is interested, is this file http://download.companieshouse.gov.uk/BasicCompanyDataAsOneFile-2017-11-01.zip which I renamed to the much more manageble file name CompanyData.csv. Be warned, unzipped it’s about 2 gig in size. I thought it would be an interesting introduction to “Big Data”!
The record count is 4106758, and my bytes read mechanism shows a final reading of 1,993,100,453 bytes read from a total file size of 1,993,103,070 which is pretty good, BUT NOT GOOD ENOUGH!! Where are the missing 2617 bytes?
You are not counting line endings, quoting removed by the CSV reader, and maybe other things - at least not necessarily correctly. If you want to keep track of how much you’ve read, you can make a counting reader.
type countingReader struct {
r io.Reader
tot int64 // bytes
}
func (c *countingReader) Read(bs []byte) (int, error) {
n, err := c.r.Read(bs)
c.tot += int64(n)
return n, err
}
Wrap your file in it:
csvFile, err := os.Open(fileName)
// handle the err, don't ignore it
cr := &countingReader{r: csvFile}
reader := csv.NewReader(cr) // no need for a bufio.Reader here, csv.Reader already has one
After you’re done, look at cr.tot which will hold the number of bytes read.
It’s true. It is a beautifully elegant solution, and it’s taught me that if I want to find out how many things have been read, then I should count how many things have been read rather than try to work it out by implication, as it were, which is what I was attempting to do by counting line feeds and quote characters and so on.