How to find the offset of a []byte in a large binary file?
Let’s say I have a 5 MB file that contains “hello” somewhere.
How can I find the offset (in number of bytes) in the file where the “hello” starts?
I am looking for something like this (pseudo-code):
f = os.Open("file.bin")
defer f.Close()
search = []byte{"hello"}
offset, err = scanFile(f, search)
// offset should now contain e.g, 1234567890 if "hello" is at that offset
On IRC it was suggested to use a bufio.Reader but that can only scan for single bytes, not for []bytes.
This seems to do the trick. Thanks @bpalmer on IRC.
// ScanFile returns the offset of the first occurence of a []byte in a file,
// or -1 if []byte was not found in file, and seeks to the beginning of the searched []byte
func ScanFile(f io.ReadSeeker, search []byte) int64 {
ix := 0
r := bufio.NewReader(f)
offset := int64(0)
for ix < len(search) {
b, err := r.ReadByte()
if err != nil {
return -1
}
if search[ix] == b {
ix++
} else {
ix = 0
}
offset++
}
f.Seek(offset - int64(len(search)), 0 ) // Seeks to the beginning of the searched []byte
return offset - int64(len(search))
}
It won’t find the string “aab” in the string “aaab”. It is because once you found the first two matching “aa”, you have a mismatch at pos 2. You then restart search at pos 2, while you should restart search at pos 1.
You are missing a backtrack.
Another problem is the seek. The bufio reader, will read ahead. So the position in f is not the same as the position we just read the last matching character.
Your algorithm is incorrect. Let say the EOL marker is \n, the default for a scanner. If the stopWord contains one, it may cover two scan lines. Your algorithm won’t find it.
While this is true in most cases, it is still arguable. Today it is 5MB, tomorrow it is 5GB and we will need new algorithm. I guess it is up to the topic starter to decide what is more likely to happen for him - that he will need to parse 5GB file or that he will be looking for a word with a new line symbol in it
I’d suggest reading chunks and looking through them. Make sure to overlap chunks with the pattern length to capture instances straddling chunks. Something like this perhaps
The I/O pattern would be nicer without the overlapping reads, but then we need to be smarter when searching. Like this we can just fall back on the (already correct and efficient) bytes.Index as already suggested. Simple beats clever until proven otherwise.
Here is a code version that uses a simple io.Reader. We use len(search)-1 characters as tail that we copy to the front of the buffer at each read. This ensures the search is overlapping chunk boundaries.