Problem getting size in bytes from a []string

Hello, I’m putting together some code that eventually will read and process some big freely available company datasets. I’ve so far managed to read the raw CSV and print the file contents. In order to be sure that I have read the entire file, I thought I would put in a bytes read mechanism. This is where my problems started. Code as follows:

package main

import (
	"bufio"
	"encoding/csv"
	"fmt"
	"io"
	"log"
	"os"

	"golang.org/x/text/language"
	"golang.org/x/text/message"
)

func main() {
	fileName := "CompanyData.csv"
	csvFile, _ := os.Open(fileName)
	fi, _ := csvFile.Stat()
	fileSize := fi.Size()
	reader := csv.NewReader(bufio.NewReader(csvFile))

	bytesRead := 0
	var lineLength int

	for {
		record, error := reader.Read()

		lineLength = 0												// PROBLEM code
		for index, element := range record {							 	
			fmt.Println(index)
			fmt.Println(element[index:index])							// Falls over after index exceeds 20	
			lineLength += len(element[index:index])						// with: panic: runtime error:  
																// slice bounds out of range
		}

		if error == io.EOF {
			fmt.Println("\nEOF reached")
			break
		} else if error != nil {
			log.Fatal(error)
		}

		fmt.Println(record)											// OK on previous run

		bytesRead += lineLength									// Doesn't get this far	
		p := message.NewPrinter(language.English)
		p.Printf("\r %d/%d", bytesRead, fileSize)
	}
}

Since reader.Read() returns a []string, I am trying to write code starting at // PROBLEM which assigns the length of the CSV record to lineLength but the code is falling over at run time, with a slice bounds out of range error. Previous versions of the code which did not include the code starting at // Doesn't get this far appeared to read the entire file successfully.

The dataset, in case anyone is interested, is this file http://download.companieshouse.gov.uk/BasicCompanyDataAsOneFile-2017-11-01.zip which I renamed to the much more manageble file name CompanyData.csv. Be warned, unzipped it’s about 2 gig in size. I thought it would be an interesting introduction to “Big Data”!

you can not use index of record as key of element.

Yeah. I spotted that a few minutes ago. I’m putting together a corrected version. Back in a few minutes!

This is the latest version:

package main

import (
	"bufio"
	"encoding/csv"
	"fmt"
	"io"
	"log"
	"os"
	"strings"

	"golang.org/x/text/language"
	"golang.org/x/text/message"
)

func main() {
	fileName := "CompanyData.csv"
	csvFile, _ := os.Open(fileName)
	fi, _ := csvFile.Stat()
	fileSize := fi.Size()
	reader := csv.NewReader(bufio.NewReader(csvFile))
	bytesRead := 0
	recordCount := 0

	for {
		recordCount++

		record, error := reader.Read()
		recordString := strings.Join(record, "")

		quoteCount := len(record) * 2
		commaCount := len(record) - 1
		crCount := 1
		lineLength := len(recordString) + quoteCount + commaCount + crCount

		if error == io.EOF {
			fmt.Println("\nEOF reached")
			fmt.Printf("\n%d records read", recordCount)
			break
		} else if error != nil {
			log.Fatal(error)
		}
		bytesRead += lineLength
		p := message.NewPrinter(language.English)
		p.Printf("\r %d / %d", bytesRead, fileSize)
	}
}

The record count is 4106758, and my bytes read mechanism shows a final reading of 1,993,100,453 bytes read from a total file size of 1,993,103,070 which is pretty good, BUT NOT GOOD ENOUGH!! Where are the missing 2617 bytes?

Can anyone see what I am missing? Thanks!

Oops, I forgot
defer csvFile.Close() after the csvFile, _ := os.Open(fileName) line!

Why are you ignoring the errors in both Open and Stat? It’s pretty common to have unreadable files and in that case your program crashes badly.

1 Like

You are not counting line endings, quoting removed by the CSV reader, and maybe other things - at least not necessarily correctly. If you want to keep track of how much you’ve read, you can make a counting reader.

type countingReader struct {
        r    io.Reader
        tot  int64 // bytes
}

func (c *countingReader) Read(bs []byte) (int, error) {
        n, err := c.r.Read(bs)
        c.tot += int64(n)
        return n, err
}

Wrap your file in it:

csvFile, err := os.Open(fileName)
// handle the err, don't ignore it
cr := &countingReader{r: csvFile}
reader := csv.NewReader(cr) // no need for a bufio.Reader here, csv.Reader already has one

After you’re done, look at cr.tot which will hold the number of bytes read.

1 Like

it is a bueatiful solution. thank for your share!

New version is:

package main

import (
	"encoding/csv"
	"fmt"
	"io"
	"os"

	"github.com/pkg/errors"
	"golang.org/x/text/language"
	"golang.org/x/text/message"
)

func checkError(err error) {
	if err != nil {
		fmt.Printf("%+v", errors.WithStack(err))
		os.Exit(1) // or anything else ...
	}
}

type countingReader struct {
	r   io.Reader
	tot int64 // bytes
}

func (c *countingReader) Read(bs []byte) (int, error) {
	n, err := c.r.Read(bs)
	c.tot += int64(n)
	return n, err
}

func main() {
	fileName := "CompanyData.csv"
	csvFile, err := os.Open(fileName)
	checkError(err)
	defer csvFile.Close()
	fi, err := csvFile.Stat()
	checkError(err)
	fileSize := fi.Size()
	cr := &countingReader{r: csvFile}
	reader := csv.NewReader(cr)
	recordCount := 0

	for {
		recordCount++
		// record, err := reader.Read()
		_, err := reader.Read()
		p := message.NewPrinter(language.English)
		if err == io.EOF {
			p.Printf("\nEOF reached")
			p.Printf("\n%d records read", recordCount)
			break
		} else {
			checkError(err)
		}
		p.Printf("\r%d / %d", cr.tot, fileSize)
	}
	fmt.Println()
}

Program output is now:

1,993,103,070 / 1,993,103,070
EOF reached
4,106,758 records read

Yay! Thank you, Jakob. God! I love this language, and Jakob, I respect you greatly!

It’s true. It is a beautifully elegant solution, and it’s taught me that if I want to find out how many things have been read, then I should count how many things have been read rather than try to work it out by implication, as it were, which is what I was attempting to do by counting line feeds and quote characters and so on.

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.