Best practices for implementing a Lexer

So for this weekend I wanted to implement a Lexer just for fun. Before starting to implement I’ve started to look out about the state of lexers in Go. Right now I’ve found three lexers in the std lib:


Both go/scanner and text/scanner are similar and use iterative processing of tokens. However text/template is based on the talk from Rob Pike’s video: This approach is also used in some of the third party packages, such as BurntSushi’s toml package:

There are also two great blog posts about writing lexers:

One of is based on the iterative approach and the other blog post is based on doing the scanning based on states (Rob Pike’s method).

Seeing two ways of lexer implementation in the std lib, I’m not sure which one is the preferred way of doing it. Maybe they both have their own advantages, unknown to me.

So my question is, which one do you think is the preferable way? Or to say better, which do you think is more Go idiomatic and maintainable in long term?


Rob Pike’s own answer for a related question from go-nuts:

That talk was about a lexer, but the deeper purpose was to demonstrate how concurrency can make programs nice even without obvious parallelism in the problem. And like many such uses of concurrency, the code is pretty but not necessarily fast.

I think it’s a fine approach to a lexer if you don’t care about performance. It is significantly slower than some other approaches but is very easy to adapt. I used it in ivy, for example, but just so you know, I’m probably going to replace the one in ivy with a more traditional model to avoid some issues with the lexer accessing global state. You don’t care about that for your application, I’m sure.

So: It’s pretty and nice to work on, but you’d probably not choose that approach for a production compiler.

1 Like

You could also use a lexer generator. For example nex. This is similar to the well known flex utility but for Go not C.

I’ve played with nex but in the end I wrote a traditional state driven lexer last time I did this.

1 Like

Not stating it’s a best practice, but here is what we have done:

So far, it has been pretty easy to keep adding language features.

1 Like

Thank ncw. As you said, I’m also in favor of writing the lexer instead of generating it. It makes the code more readable and understandable.

This is really nice @corylanou. I wonder why you always write the literals every time into a buffer, whereas you can already get that information via the position information. This seems a different approach to text/scanner or go/scanner, which both relies on the token start/end positions to return the token literal.

@fatih The influxdb scanner that @corylanou linked to and go/scanner are similar. Both return position, token, and literal.

influxql …

go/scanner …

Speculating a bit…

Returning the literal saves the step of calling another function to retrieve it. In cases like InfluxDB and go/scanner, where we know the literal is going to be needed and used, might as well go ahead and return it. text/scanner, on the other hand, is a general purpose scanner and some use cases might not want / need to pay the penalty of actually retrieving the literal.

You are right about returning the token, pos and literal. So the signature is the same.

But the main difference between influxql and go/scanner is the way they are preparing the literal. go/scanner uses the already available information (aka Position) and returns the literal right from the source buffer. Whereas influxql prepease a new buffer from scratch and fills the buffer whenever it starts to scan a token:

I’m not sure which is the better way, just there is a difference in the underlying implementation.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.