Best practices for implementing a Lexer

fatih · October 3, 2015, 3:08pm

So for this weekend I wanted to implement a Lexer just for fun. Before starting to implement I’ve started to look out about the state of lexers in Go. Right now I’ve found three lexers in the std lib:

go/scanner
text/scanner
text/template

Both go/scanner and text/scanner are similar and use iterative processing of tokens. However text/template is based on the talk from Rob Pike’s video: https://www.youtube.com/watch?v=HxaD_trXwRE This approach is also used in some of the third party packages, such as BurntSushi’s toml package: https://github.com/BurntSushi/toml/blob/master/lex.go#L52

There are also two great blog posts about writing lexers:

http://blog.gopheracademy.com/advent-2014/parsers-lexers/
http://adampresley.com/2015/04/12/writing-a-lexer-and-parser-in-go-part-1.html

One of is based on the iterative approach and the other blog post is based on doing the scanning based on states (Rob Pike’s method).

Seeing two ways of lexer implementation in the std lib, I’m not sure which one is the preferred way of doing it. Maybe they both have their own advantages, unknown to me.

So my question is, which one do you think is the preferable way? Or to say better, which do you think is more Go idiomatic and maintainable in long term?

fatih · October 3, 2015, 4:31pm

Rob Pike’s own answer for a related question from go-nuts:

That talk was about a lexer, but the deeper purpose was to demonstrate how concurrency can make programs nice even without obvious parallelism in the problem. And like many such uses of concurrency, the code is pretty but not necessarily fast.

I think it’s a fine approach to a lexer if you don’t care about performance. It is significantly slower than some other approaches but is very easy to adapt. I used it in ivy, for example, but just so you know, I’m probably going to replace the one in ivy with a more traditional model to avoid some issues with the lexer accessing global state. You don’t care about that for your application, I’m sure.

So: It’s pretty and nice to work on, but you’d probably not choose that approach for a production compiler.

ncw · October 3, 2015, 9:42pm

You could also use a lexer generator. For example nex. This is similar to the well known flex utility but for Go not C.

I’ve played with nex but in the end I wrote a traditional state driven lexer last time I did this.

corylanou · October 5, 2015, 2:57am

Not stating it’s a best practice, but here is what we have done:

https://github.com/influxdb/influxdb/tree/master/influxql

So far, it has been pretty easy to keep adding language features.

fatih · October 6, 2015, 8:45am

Thank ncw. As you said, I’m also in favor of writing the lexer instead of generating it. It makes the code more readable and understandable.

fatih · October 6, 2015, 8:46am

This is really nice @corylanou. I wonder why you always write the literals every time into a buffer, whereas you can already get that information via the position information. This seems a different approach to text/scanner or go/scanner, which both relies on the token start/end positions to return the token literal.

dgnorton · October 6, 2015, 12:39pm

@fatih The influxdb scanner that @corylanou linked to and go/scanner are similar. Both return position, token, and literal.

influxql … https://github.com/influxdb/influxdb/blob/master/influxql/scanner.go#L25

go/scanner … https://golang.org/src/go/scanner/scanner.go?s=13662:13731#L588

Speculating a bit…

Returning the literal saves the step of calling another function to retrieve it. In cases like InfluxDB and go/scanner, where we know the literal is going to be needed and used, might as well go ahead and return it. text/scanner, on the other hand, is a general purpose scanner and some use cases might not want / need to pay the penalty of actually retrieving the literal.

fatih · October 6, 2015, 12:51pm

You are right about returning the token, pos and literal. So the signature is the same.

But the main difference between influxql and go/scanner is the way they are preparing the literal. go/scanner uses the already available information (aka Position) and returns the literal right from the source buffer. Whereas influxql prepease a new buffer from scratch and fills the buffer whenever it starts to scan a token: https://github.com/influxdb/influxdb/blob/master/influxql/scanner.go#L134

I’m not sure which is the better way, just there is a difference in the underlying implementation.

system · January 4, 2016, 12:51pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.