# Get largest N ints from large input

Consider a `\n` separated file of `int`s that is so large that it does not fit into memory. To find the largest N `int`s in this file it is not possible to read the file into memory and sort it. The file must be read line by line.

I’d implement this using a slice to hold the top N `int`s found so far plus the `int` read from the current line. To compare the current `int` against the current top N, I’d use `sort.Sort`.

``````package main

import (
"bufio"
"fmt"
"os"
"sort"
"strconv"
)

func main() {
// We are interested in the n largest ints.
n := 5

// The slice tops will hold the n largest ints on
// positions 1..n-1 and the current int on position 0.
tops := make([]int, n+1)

scanner := bufio.NewScanner(os.Stdin)
for scanner.Scan() {
text := scanner.Text()
i, err := strconv.Atoi(text)
if err != nil {
fmt.Fprintf(os.Stderr, "could not parse int from '"+text+"'\n")
continue
}

// Put it on position 0 of the slice.
tops[0] = i

// Sort the slice. If the current int is among the n largest,
// it will be moved to a position > 0.
sort.Sort(sort.IntSlice(tops))
}

// The largest n ints are on position 1..n-1.
fmt.Println(tops[1:])
}
``````

This approach works but the comparison requires `sort.Sort` for every line.

What do you thing? Is this a reasonable approach? Is there a more efficient way to compare the current `int` with the current top N?

2 Likes

If the top N list is sorted you only need to compare with the lowest, and if your current item is larger use a binary search to find the insertion position. The list is then still sorted.

Yes, that could be an improvement.

How does `sort.Sort` sort such an almost sorted slice? Would a binary search be faster than sorting something like `[12, 10, 11, 12, 13, 14]`?

You can’t binary search if it’s not sorted, but you also don’t need to keep the “current” element as part of the slice.

Something like this? https://play.golang.org/p/e6SQCR4cu-1

2 Likes

Sort seems like overkill… wonder if a ring could be used instead, if it would perform better? Altho, I’m pretty sure it would require more code to write.

I’ll try to take a look tomorrow (if no one else does).

That would be nice, any idea is welcome.

The boring solution is to run a command line file sort and read the output.

Does `sort` handl files of gigabyte size?

I don’t see why not (but I am an ex-mainframer so I have many strange pre-conceptions I suppose).

I have used the sort command on multi-gigabyte files. I wasn’t sure that it do numbers in text strings (not all the same width unless they have leading spaces/zeros; nor would it work if there were negative numbers). But I found this:

-n, --numeric-sort Compare according to string numerical value.

So I’m sure it would be faster than anything written in Go!

2 Likes

Here is a ring example:
https://play.golang.org/p/FO1Dy3Z0Zpg

This example actually gets the top N unique integers on the input stream. If you want dups instead, remove the check to see if the value is already on the ring.

Cheers.

3 Likes

Hey @lutzhorn, good question. So far, people have shared good thoughts and I just wanted to contribute a bit to the discussion. Your solution asymptotically takes O(M *(N * log N)) where M is the large number of `\n` separated `ints` like you shared. My suggestion improves it to O(M * log N )

I suggest you use a `MinHeap` (https://golang.org/pkg/container/heap/#example__intHeap) which would allow you to keep track of the N top numbers while giving you O(1) access to the lowest number. This comes in handy to replace such lowest number with a larger one you see while scanning new numbers all in O(log N).

Here’s the code: https://play.golang.org/p/z1l9hRreK0n

I hope this helps.

2 Likes