Consider a \n separated file of ints that is so large that it does not fit into memory. To find the largest N ints in this file it is not possible to read the file into memory and sort it. The file must be read line by line.
I’d implement this using a slice to hold the top N ints found so far plus the int read from the current line. To compare the current int against the current top N, I’d use sort.Sort.
package main
import (
"bufio"
"fmt"
"os"
"sort"
"strconv"
)
func main() {
// We are interested in the n largest ints.
n := 5
// The slice tops will hold the n largest ints on
// positions 1..n-1 and the current int on position 0.
tops := make([]int, n+1)
scanner := bufio.NewScanner(os.Stdin)
for scanner.Scan() {
// Read the next int.
text := scanner.Text()
i, err := strconv.Atoi(text)
if err != nil {
fmt.Fprintf(os.Stderr, "could not parse int from '"+text+"'\n")
continue
}
// Put it on position 0 of the slice.
tops[0] = i
// Sort the slice. If the current int is among the n largest,
// it will be moved to a position > 0.
sort.Sort(sort.IntSlice(tops))
}
// The largest n ints are on position 1..n-1.
fmt.Println(tops[1:])
}
This approach works but the comparison requires sort.Sort for every line.
What do you thing? Is this a reasonable approach? Is there a more efficient way to compare the current int with the current top N?
If the top N list is sorted you only need to compare with the lowest, and if your current item is larger use a binary search to find the insertion position. The list is then still sorted.
Sort seems like overkill… wonder if a ring could be used instead, if it would perform better? Altho, I’m pretty sure it would require more code to write.
I’ll try to take a look tomorrow (if no one else does).
I have used the sort command on multi-gigabyte files. I wasn’t sure that it do numbers in text strings (not all the same width unless they have leading spaces/zeros; nor would it work if there were negative numbers). But I found this:
-n, --numeric-sort Compare according to string numerical value.
This example actually gets the top N unique integers on the input stream. If you want dups instead, remove the check to see if the value is already on the ring.
Hey @lutzhorn, good question. So far, people have shared good thoughts and I just wanted to contribute a bit to the discussion. Your solution asymptotically takes O(M *(N * log N)) where M is the large number of \n separated ints like you shared. My suggestion improves it to O(M * log N )
I suggest you use a MinHeap (https://golang.org/pkg/container/heap/#example__intHeap) which would allow you to keep track of the N top numbers while giving you O(1) access to the lowest number. This comes in handy to replace such lowest number with a larger one you see while scanning new numbers all in O(log N).