Get largest N ints from large input

lutzhorn · January 16, 2018, 9:18am

Consider a \n separated file of ints that is so large that it does not fit into memory. To find the largest N ints in this file it is not possible to read the file into memory and sort it. The file must be read line by line.

I’d implement this using a slice to hold the top N ints found so far plus the int read from the current line. To compare the current int against the current top N, I’d use sort.Sort.

package main

import (
	"bufio"
	"fmt"
	"os"
	"sort"
	"strconv"
)

func main() {
	// We are interested in the n largest ints.
	n := 5

	// The slice tops will hold the n largest ints on
	// positions 1..n-1 and the current int on position 0.
	tops := make([]int, n+1)

	scanner := bufio.NewScanner(os.Stdin)
	for scanner.Scan() {
		// Read the next int.
		text := scanner.Text()
		i, err := strconv.Atoi(text)
		if err != nil {
			fmt.Fprintf(os.Stderr, "could not parse int from '"+text+"'\n")
			continue
		}

		// Put it on position 0 of the slice.
		tops[0] = i

		// Sort the slice. If the current int is among the n largest,
		// it will be moved to a position > 0.
		sort.Sort(sort.IntSlice(tops))
	}

	// The largest n ints are on position 1..n-1.
	fmt.Println(tops[1:])
}

This approach works but the comparison requires sort.Sort for every line.

What do you thing? Is this a reasonable approach? Is there a more efficient way to compare the current int with the current top N?

calmh · January 16, 2018, 9:24am

If the top N list is sorted you only need to compare with the lowest, and if your current item is larger use a binary search to find the insertion position. The list is then still sorted.

lutzhorn · January 16, 2018, 9:28am

Yes, that could be an improvement.

How does sort.Sort sort such an almost sorted slice? Would a binary search be faster than sorting something like [12, 10, 11, 12, 13, 14]?

calmh · January 16, 2018, 9:30am

You can’t binary search if it’s not sorted, but you also don’t need to keep the “current” element as part of the slice.

calmh · January 16, 2018, 9:40am

Something like this? https://play.golang.org/p/e6SQCR4cu-1

Cecil_New · January 16, 2018, 9:15pm

Sort seems like overkill… wonder if a ring could be used instead, if it would perform better? Altho, I’m pretty sure it would require more code to write.

I’ll try to take a look tomorrow (if no one else does).

lutzhorn · January 17, 2018, 7:30am

That would be nice, any idea is welcome.

ChrisL · January 17, 2018, 11:00am

The boring solution is to run a command line file sort and read the output.

lutzhorn · January 17, 2018, 11:00am

Does sort handl files of gigabyte size?

ChrisL · January 17, 2018, 11:04am

I don’t see why not (but I am an ex-mainframer so I have many strange pre-conceptions I suppose).

Cecil_New · January 17, 2018, 1:01pm

I have used the sort command on multi-gigabyte files. I wasn’t sure that it do numbers in text strings (not all the same width unless they have leading spaces/zeros; nor would it work if there were negative numbers). But I found this:

-n, --numeric-sort Compare according to string numerical value.

at Linux Sort Command Help and Examples

So I’m sure it would be faster than anything written in Go!

Cecil_New · January 17, 2018, 2:37pm

Here is a ring example:
https://play.golang.org/p/FO1Dy3Z0Zpg

This example actually gets the top N unique integers on the input stream. If you want dups instead, remove the check to see if the value is already on the ring.

Cheers.

joe94 · January 20, 2018, 8:40am

Hey @lutzhorn, good question. So far, people have shared good thoughts and I just wanted to contribute a bit to the discussion. Your solution asymptotically takes O(M *(N * log N)) where M is the large number of \n separated ints like you shared. My suggestion improves it to O(M * log N )

I suggest you use a MinHeap (https://golang.org/pkg/container/heap/#example__intHeap) which would allow you to keep track of the N top numbers while giving you O(1) access to the lowest number. This comes in handy to replace such lowest number with a larger one you see while scanning new numbers all in O(log N).

Here’s the code: https://play.golang.org/p/z1l9hRreK0n

I hope this helps.