Memory usage when reading large files

gojo · September 2, 2018, 6:19pm

I’m reading a file from disk(~570 MB) using NewScanner() function from bufio package and appending lines returned by Text() to a string slice. I wanted to see the memory consumption of my program so I ran the top command in linux and to my surprise the resident memory taken by my code hovers around 950MB-1GB. I understand that since I’ve read the contents of the whole file into a string slice, the compiler has to allocate memory equal to that of the file to hold the contents in RAM, but where has the additional ~400MB memory gone to?
My function-

func read(path string) (*[]string, error) {
var lines []string
file, err := os.OpenFile(path, os.O_RDONLY, os.ModePerm)
if err != nil {
	return nil, err
}
defer file.Close()

scanner := bufio.NewScanner(file)
for scanner.Scan() {
	line := scanner.Text()
	if len(line) == 0 {
		continue
	}
	lines = append(lines, scanner.Text())
}
return &lines, nil

}

clbanning · September 2, 2018, 8:41pm

Try:

func read(path string) ([][]byte, error) {
   data, err := ioutil.ReadFile(path)
   if err != nil {
      return nil, err
   }
   return bytes.Split(data, []byte("\n")), nil
}

jayts · September 3, 2018, 12:27am

This looked interesting because later in the development of my project, I may need to read in whole files too. I decided to study into it a bit.

To get memory statistics, top may not be the best tool. The Go package runtime has methods for reporting memory statistics, and you can get a lot more detail. In the code sample below, I’m using runtime.ReadMemStats() for this purpose.

Not just in Go, but generally, it is a lot more efficient to read in a whole file in a single gulp rather than a line at a time, but I was surprised to see how much difference it makes. I wrote a little program to compare gojo’s read() function with the one provided by clbanning.

Here is the full program, and directions for usage follow the code:

package main

import (
        "bufio"
        "bytes"
        "fmt"
        "io/ioutil"
        "os"
        "runtime"
        "strconv"
)

/* gojo (original poster) */

func read1(path string) (*[]string, error) {
        var lines []string

        file, err := os.OpenFile(path, os.O_RDONLY, os.ModePerm)
        if err != nil {
                return nil, err
        }

        defer file.Close()

        scanner := bufio.NewScanner(file)

        for scanner.Scan() {
                line := scanner.Text()
                if len(line) == 0 { continue }
                lines = append(lines, scanner.Text())
        }

        return &lines, nil
}

/* Charles Banning's (clbanning) */

func read2(path string) ([][]byte, error) {
   data, err := ioutil.ReadFile(path)
   if err != nil {
      return nil, err
   }
   return bytes.Split(data, []byte("\n")), nil
}

const (
        gojo = iota + 1
        clbanning
        )

func usage(exitval int) {
//
        fmt.Fprintf(os.Stderr,"usage: %s <filename> <version>\n",os.Args[0])
        fmt.Fprintf(os.Stderr,"\tversion is either 1 (for gojo) or 2 (for clbanning)\n")
        os.Exit(exitval)
}

func main() {
//
        var ms runtime.MemStats
        var err error
        var version int

        if len(os.Args) > 3 {
        //
                fmt.Fprintf(os.Stderr,"%s: too many arguments\n",os.Args[0])
                usage(1)
        }

        if len(os.Args) < 3 {
        //
                fmt.Fprintf(os.Stderr,"%s: missing argument(s)\n",os.Args[0])
                usage(1)
        }

        version, err = strconv.Atoi(os.Args[2])

        if err != nil {
        //
                fmt.Fprintf(os.Stderr,"%s: version error\n",os.Args[0])
                usage(2)
        }

        switch version {
        //
                case gojo:
                        // non-testing version: ignore return value
                        _, err = read1(os.Args[1])
                        // for testing: print the first line of the file
//                      var s1 *[]string
//                      s1, err = read1(os.Args[1])
//                      fmt.Printf("%s\n",(*s1)[0:1][0]) // Print first line of s1
                case clbanning:
                        // non-testing version: ignore return value
                        _, err = read2(os.Args[1])
                        // for testing: print the first line of the file
//                      var s2 [][]byte
//                      s2, err = read2(os.Args[1])
//                      fmt.Printf("%s\n",string(s2[0]))
                default:
                        fmt.Fprintf(os.Stderr,"%s: version error\n",os.Args[0])
                        os.Exit(2)
        }

        if err != nil {
        //
                fmt.Fprintf(os.Stderr,"Can't open file \"%s\"\n",os.Args[1])
                os.Exit(2)
        }


        runtime.ReadMemStats(&ms)

        fmt.Printf("\n")
        fmt.Printf("Alloc: %d MB, TotalAlloc: %d MB, Sys: %d MB\n",
                ms.Alloc/1024/1024, ms.TotalAlloc/1024/1024,ms.Sys/1024/1024)
        fmt.Printf("Mallocs: %d, Frees: %d\n",
                ms.Mallocs, ms.Frees)
        fmt.Printf("HeapAlloc: %d MB, HeapSys: %d MB, HeapIdle: %d MB\n",
                ms.HeapAlloc/1024/1024, ms.HeapSys/1024/1024, ms.HeapIdle/1024/1024)
        fmt.Printf("HeapObjects: %d\n", ms.HeapObjects)
        fmt.Printf("\n")
}

To use it, name the file “readfile.go”, then use the ‘go build’ command to build it:

go build readfile.go

To do the tests, use the readfile program like this:

readfile input_file 1

or

readfile input_file 2

where input_file is the name of a file.

When I tested a 570 MB file, I got these results:

Reading a line at a time (gojo):

Alloc: 651 MB, TotalAlloc: 1192 MB, Sys: 812 MB
Mallocs: 1166446, Frees: 518919
HeapAlloc: 651 MB, HeapSys: 767 MB, HeapIdle: 74 MB
HeapObjects: 647527

Reading all at once (clbanning):

Alloc: 583 MB, TotalAlloc: 583 MB, Sys: 662 MB
Mallocs: 192, Frees: 13
HeapAlloc: 583 MB, HeapSys: 639 MB, HeapIdle: 55 MB
HeapObjects: 179

You can easily see that reading the file a line at a time required many more memory allocations and freeings of memory, and resulted in many more objects on the heap. (The numbers you get on your own data depends on how many newlines are in the file.)

To understand the numbers better, this will help:

(Documentation for package runtime, type MemStats)

There are many fields in the MemStats struct than I used, and you can modify the code to your liking to look at other things. Have fun.

gojo · September 4, 2018, 1:50pm

Thank you @clbanning and @jayts for your inputs.

The new function returning byte slices certainly takes less memory compared to mine. However, reading the whole file in memory at once using ioutil is not an option for me because in my case I don’t know beforehand the size of the incoming file and the size could very well be in gigabytes. So I’ve opted to using Readbytes() from bufio package

file, err := os.Open("myfile")
if err != nil {
	log.Fatal(err)
}
defer file.Close()
r := bufio.NewReader(file)
for {
	bytes, err := r.ReadBytes('\n')
	if err == io.EOF {
		break
	} else if err != nil {
		//log the error or panic 
	}
//Process the line...
}

clbanning · September 4, 2018, 2:09pm

If you’re really concerned about memory allocation and processing large files with lots of lines you might want to tune it a bit more -

r := bufio.NewReader(file)
var b []byte
var err error
for {
   b, err = r.ReadBytes('\n')
   ...

Juk · September 4, 2018, 3:45pm

Out of curiosity (I’m a Go beginner), how would we read big text files then? From how I understand it, memory allocation is as big (if not bigger) when only reading a portion of a text at a time.

jayts · September 7, 2018, 4:56pm

The best way to read a file depends on your specific needs. Handle that first, then think about optimization.

For example, if you are writing a program that takes lines of text from the user one at a time and behaves interactively after each line is entered, there is no way to read in all of standard input and then process it. You have to accept a line, process it, and respond.

For other programs, it may make sense to read the entire file into a slice of strings, and process each string one at a time. And then, be careful about how long the lines may be. (What if there are no newlines in the file at all? Can your program handle that?)

Before reading a file, whether in a gulp or a line at a time, it’s probably best to check to see how big the file is. Trying to read a 20 GB file into 8 GB of system memory is probably a mistake, and in that case, you may need to read and process the file in smaller chunks.

Reading files line-by-line or by single characters almost always made sense in the 1980s or before, when computers had relatively little memory. Now we have options, and can often use the benefit of huge memory to do things more efficiently.

sandyethadka · September 9, 2018, 6:45am

Awesome info @jayts though I am not the one who raised the question. Nice to know. Thanks.

Juk · September 11, 2018, 4:18pm

Thanks for your thoughtful reply. It makes sense thinking about it, and it shows your experience.

system · December 10, 2018, 4:19pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.