Why Perl regexp faster than Go?

Hi!
Simple script for analize access_log

Perl:

  $l=~s/\A([^\s]+?) - - \[([^\]]+?)\] \"([^\"]+?)\" ([^\s]+?) ([^\s]+?) \"([^\"]+?)\"(.+)/$1\n$2\n$3\n$4\n$5\n$6\n$7/g;
  ($ip, $time, $page, $code, $size, $ref, $agent, $els) = split(/\n/, $l);
  $page=~s/(GET|HEAD|POST) (.+) (HTTP.+)/$2/;
  $hash{$page}++;
  # this faster than Golang (35-40%)

Go:
> log_format := ^([^ ]+) (-) (-) \[([^\]]+)\] "([^\"]+?)" ([0-9]+) ([^ ]+) "([^"])*" "([^"]*)"
> logParser := regexp.MustCompilePOSIX(log_format)
> log_format_get := ^(GET|HEAD|POST) (.+) (HTTP.+)$
> logParserGet := regexp.MustCompilePOSIX(log_format_get)
> …
> analize1 := func (iline *string) {
> submatch := logParser.FindSubmatch(strings.TrimSpace(*iline))
> if (len(submatch[0])>0){
> pg := logParserGet.FindAllStringSubmatch(strings.TrimSpace(submatch[0][5]), 1)
> if (len(pg)>0){
> hash[pg[0][2]]++
> }
> }

Are you using a buffered reader?

Can you provide a complete runnable program? Perhaps someone can suggest how to speed it up?

Those two regexps are different. They might both be written for matching the same set of data, but still they are different. Even a small change to a regexp can cause a significant change in execution time. (Just consider the difference between the greedy and non-greedy versions of the * and + operators.)
Comparing two different regexp expressions that run on two different regexp engines does not say much.

Input data also has an impact on the execution time of a regexp, but I assume you have tested both regexps against the same access.log file, so this should not be an issue here.

In general, I would not worry too much about a performance difference of 30%-40%. I would worry if the difference is far above a factor of 2.

Also consider that the Perl regexp engine exists since quite some time now and surely has been heavily optimized during the years, even for corner cases that newer regexp engine implementations might not even have on their radar yet.

Optimizing a regexp can be quite difficult even for quite simple expressions. You would need to know how the regexp engine behaves (and there are different strategies for implementing a regexp engine, with different compile and execution times). And you’d always need to ensure that no optimization step inadvertently changes the semantics of the expression.

You might gain some speed improvement by trying to match as much as possible using simple substring matching, and only using a regexp for the remaining cases (where the regexp would be much simpler then).

4 Likes

If you’re interested in regular expressions and why they might be slow or fast, these articles from Russ Cox are really good background reading:

https://swtch.com/~rsc/regexp/regexp1.html
https://swtch.com/~rsc/regexp/regexp2.html
https://swtch.com/~rsc/regexp/regexp3.html
https://swtch.com/~rsc/regexp/regexp4.html

and this gonuts thread, while dated, has some good reading on a similar problem (there are a few threads on golangnuts about regexp speed):

https://groups.google.com/forum/#!topic/golang-nuts/8PKZf5mv2uI/discussion

In many cases, the answer depends on your particular expression, and you can either adjust the regexp, or stop using regexp for parsing for a dramatic speed-up.

4 Likes

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.