Interface Pitfalls and Harnessing io.Reader

Hi All,

I just put together some thoughts to try to help newcomers to Go start to grasp interfaces. It walks through some steps that I’ve seen people coming from languages like Perl go through. I’d really appreciate any vetting, comments or critiques you could offer.

https://medium.com/p/golang-interface-pitfalls-and-io-reader-a57e2d8842a2

Thanks a bunch!

Hi Joshua and thanks for the nice article!

I haven’t checked thoroughly, but I think there might be a flaw in the final code. When m.Old is split in two different chunks, it might not be replaced.

p := &Processor{
    Src:       strings.NewReader("hello hold the door trail"),
    ChunkSize: 10,
    Old:       []byte("hold the door"),
    New:       []byte("hodor"),
}
if _, err := io.Copy(os.Stdout, p); err != nil {
    log.Fatal("copy: ", err)
}

Is it the case or am I wrong?

One solution is to implement Read a bit differently: when one finds the full occurrence of m.Old, write everything before, write m.New and reset the buffer at after m.Old and start again.

If there is a partial occurrence (that is, when the buffer ends in a substring of m.Old), write all before the matched part, reset the buffer to before the matched part and start again. The next iteration will handle the replacement (or not.)

To match, I suppose one can just compare byte by byte; I am not sure there is any advantage in handling utf-8 in a special way.

This probably gets a bit tricky with bytes.Buffer, thus it needs more optimization for allocations.

Sadly I cannot write any code now, but I will try later.

Thank’s for taking a look at the post!

I think I understand the issue you’re explaining. I am aware of the issue where, because of the way the input is chunked, it is possible for a boundary to be within a string to be replaced. I did write in the article:

It should be noted that there is a bit of a bug in this implementation in that the chunk could read until the middle of a hodor and it wouldn’t get replaced properly. Since this code is for demonstration only, fixing it is an exercise left to the reader.

I suppose that is a bit of professors’ slight of hand, but it is, as you said, a non-trivial issue to address.

The simplest way to fix it would be to use a buffer with peek-ahead ability as a read-through cache of the source. That way, if you find that you are within a potential match, but reach the end of a chunk, you could peek ahead a few more bytes until the correct action (or inaction) can be taken.

Taking it further, if you know you are dealing with strings (as opposed to any byte-slice), you may want to only replace Old when it exists as a complete word or phrase (e.g. replace “hodor”, but not “foohodorbar”).

Word boundaries can often be determined by the presence of spaces or punctuation, but there are a lot of corner cases (e.g contractions like “don’t”) that make this difficult.

We’ve had to implement similar logic, and chose to use the unicode word boundary algorithm described here: UAX #29: Unicode Text Segmentation

There is certainly a lot that can be said about how to fix the “bug” in the article, but I felt it distracted from the core subject.

I’m happy to discuss other approaches and improvements here though!

Thanks very much for the reply! Sorry for only skimming though and missing the warning :slight_smile:

Also thank you for the explanation on how Unicode is used, very interesting.

What I tentatively described above is similar to what you say. I was thinking that one could profit from the fact that Read can return short. This way one can only copy to the Read buffer the data before the match, then the match replacement and then the rest after the match without doing any real substitution (which usually involves copying for resizing in another buffer.)

If you had the time to set up a proper repository, including the benchmarks code, trying different implementations could be an exercise for the weekend!

I like that idea a lot.

I probably should have done that already. I’ll see what I can do.

Basically, you process it a lot like a no-seek ReadLine() would. Read, then return up to stopping point saving the leftovers. Slightly different: no need to recheck the saved buffer for the next stopping point at the top of the function.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.