So for this weekend I wanted to implement a Lexer just for fun. Before starting to implement I’ve started to look out about the state of lexers in Go. Right now I’ve found three lexers in the std lib:
One of is based on the iterative approach and the other blog post is based on doing the scanning based on states (Rob Pike’s method).
Seeing two ways of lexer implementation in the std lib, I’m not sure which one is the preferred way of doing it. Maybe they both have their own advantages, unknown to me.
So my question is, which one do you think is the preferable way? Or to say better, which do you think is more Go idiomatic and maintainable in long term?
Rob Pike’s own answer for a related question from go-nuts:
That talk was about a lexer, but the deeper purpose was to demonstrate how concurrency can make programs nice even without obvious parallelism in the problem. And like many such uses of concurrency, the code is pretty but not necessarily fast.
I think it’s a fine approach to a lexer if you don’t care about performance. It is significantly slower than some other approaches but is very easy to adapt. I used it in ivy, for example, but just so you know, I’m probably going to replace the one in ivy with a more traditional model to avoid some issues with the lexer accessing global state. You don’t care about that for your application, I’m sure.
So: It’s pretty and nice to work on, but you’d probably not choose that approach for a production compiler.
This is really nice @corylanou. I wonder why you always write the literals every time into a buffer, whereas you can already get that information via the position information. This seems a different approach to text/scanner or go/scanner, which both relies on the token start/end positions to return the token literal.
Returning the literal saves the step of calling another function to retrieve it. In cases like InfluxDB and go/scanner, where we know the literal is going to be needed and used, might as well go ahead and return it. text/scanner, on the other hand, is a general purpose scanner and some use cases might not want / need to pay the penalty of actually retrieving the literal.
You are right about returning the token, pos and literal. So the signature is the same.
But the main difference between influxql and go/scanner is the way they are preparing the literal. go/scanner uses the already available information (aka Position) and returns the literal right from the source buffer. Whereas influxql prepease a new buffer from scratch and fills the buffer whenever it starts to scan a token: https://github.com/influxdb/influxdb/blob/master/influxql/scanner.go#L134
I’m not sure which is the better way, just there is a difference in the underlying implementation.