How to find the offset of a []byte in a large binary file?

probono · December 1, 2019, 3:07pm

Let’s say I have a 5 MB file that contains “hello” somewhere.

How can I find the offset (in number of bytes) in the file where the “hello” starts?

I am looking for something like this (pseudo-code):

f = os.Open("file.bin")
defer f.Close()
search = []byte{"hello"}
offset, err = scanFile(f, search)

// offset should now contain e.g, 1234567890 if "hello" is at that offset

On IRC it was suggested to use a bufio.Reader but that can only scan for single bytes, not for []bytes.

probono · December 1, 2019, 4:02pm

This seems to do the trick. Thanks @bpalmer on IRC.

// ScanFile returns the offset of the first occurence of a []byte in a file,
// or -1 if []byte was not found in file, and seeks to the beginning of the searched []byte
func ScanFile(f io.ReadSeeker, search []byte) int64 {
	ix := 0
	r := bufio.NewReader(f)
	offset := int64(0)
	for ix < len(search) {
		b, err := r.ReadByte()
		if err != nil {
			return -1
		}
		if search[ix] == b {
			ix++
		} else {
			ix = 0
		}
		offset++
	}
	f.Seek(offset - int64(len(search)), 0 ) // Seeks to the beginning of the searched []byte
	return offset - int64(len(search))
}

Playground:
https://play.golang.org/p/6nwRjbDCeng

jabbson · December 1, 2019, 4:43pm

Here is one way to do it:

func ScanFile(fd io.Reader, stopWord []byte) int {
	scanner := bufio.NewScanner(fd)
	bytesRead := 0

	for scanner.Scan() {
		currentSlice := scanner.Bytes()
		currentIndex := bytes.Index(currentSlice, stopWord)
		if currentIndex >= 0 {
			return bytesRead + currentIndex
		}
		bytesRead += len(currentSlice) + 1 // account for EOL
	}
	return -1
}

Christophe_Meessen · December 2, 2019, 7:42am

Your algorithm is incorrect.

It won’t find the string “aab” in the string “aaab”. It is because once you found the first two matching “aa”, you have a mismatch at pos 2. You then restart search at pos 2, while you should restart search at pos 1.

You are missing a backtrack.

Another problem is the seek. The bufio reader, will read ahead. So the position in f is not the same as the position we just read the last matching character.

Christophe_Meessen · December 2, 2019, 7:47am

Your algorithm is incorrect. Let say the EOL marker is \n, the default for a scanner. If the stopWord contains one, it may cover two scan lines. Your algorithm won’t find it.

Christophe_Meessen · December 2, 2019, 7:54am

What about reading the whole file and using bytes.Index() ? 5 MB can easily hold in memory.

func scanFile(f io.ReadSeeker, search []byte) (int64, error){
    data, err := ioutil.ReadAll(f)
    if err != nil {
        return 0, err
    }
    return bytes.Index(data, search), nil
}

jabbson · December 2, 2019, 1:20pm

5 MB can easily hold in memory

While this is true in most cases, it is still arguable. Today it is 5MB, tomorrow it is 5GB and we will need new algorithm. I guess it is up to the topic starter to decide what is more likely to happen for him - that he will need to parse 5GB file or that he will be looking for a word with a new line symbol in it

calmh · December 2, 2019, 3:14pm

I’d suggest reading chunks and looking through them. Make sure to overlap chunks with the pattern length to capture instances straddling chunks. Something like this perhaps

const chunkSize = 4096

func find(r io.ReaderAt, search []byte) (int64, error) {
    var offset int64
	chunk := make([]byte, chunkSize+len(search))
	for {
		n, err := r.ReadAt(chunk, offset)
		idx := bytes.Index(chunk[:n], search)
		if idx >= 0 {
			return offset + int64(idx), nil
		}

		if err == io.EOF {
			break
		} else if err != nil {
			return -1, err
		}

		offset += chunkSize
	}
	return -1, errors.New("not found")
}

The I/O pattern would be nicer without the overlapping reads, but then we need to be smarter when searching. Like this we can just fall back on the (already correct and efficient) bytes.Index as already suggested. Simple beats clever until proven otherwise.

Christophe_Meessen · December 2, 2019, 5:21pm

Here is a code version that uses a simple io.Reader. We use len(search)-1 characters as tail that we copy to the front of the buffer at each read. This ensures the search is overlapping chunk boundaries.

const chunkSize = 4096

func find(r io.Reader, search []byte) (int64, error) {
    var offset int64
    tailLen := len(search)-1
    chunk := make([]byte, chunkSize+tailLen)
    n, err := r.Read(chunk[taiLen:])
    idx := bytes.Index(chunk[tailLen:n+tailLen], search)
    for {
        if idx >= 0 {
            return offset + int64(idx), nil
        }
        if err == io.EOF {
            return -1, nil
        } else if err != nil {
            return -1, err
        }
        copy(chunk, chunk[chunkSize:])
        offset += chunkSize
        n, err = r.Read(chunk[tailLen:])
        idx = bytes.Index(chunk[:n+tailLen], search)
    }
}

system · March 1, 2020, 5:21pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.