Regexp and PCRE

Hi,

Would appreciate help in understanding the ‘Go’ way to work around this regexp issue, please.

When trying to extract URL endpoints out of a chunk of html such as:

<li><a href="BPL15/"> BPL15/</a></li>
<li><a href="BPL16/"> BPL16/</a></li>
<li><a href="BPL17/"> BPL17/</a></li>
<li><a href="BRAAD15/"> BRAAD15/</a></li>
<li><a href="BRAAD16/"> BRAAD16/</a></li>
<li><a href="BRAFS15/"> BRAFS15/</a></li>
<li><a href="BRAFS16/"> BRAFS16/</a></li>
<li><a href="BRASD15/"> BRASD15/</a></li>
<li><a href="BRASD16/"> BRASD16/</a></li>
<li><a href="BRDSolos15/"> BRDSolos15/</a></li>
<li><a href="BRDSolos16/"> BRDSolos16/</a></li>
<li><a href="BRDSolos17/"> BRDSolos17/</a></li>

I would normally use something like:

(?<=href\=\").*(?=\/)

Basically a zero width assertion type pattern to anchor a simple match. I’ve used this frequently in the past in other languages without issue, so the lack of support for lookarounds in the Go regexp implementation is hurting a bit.

However I’m sure there will be a more Go-ey way of doing this. Right?

How do others compensate for lack of support for certain patterns in the regexp package? I see from here:

That there are a shedload of things which are not supported.

The pattern (?:re) looked like a promising option as a non-capturing group, but I can’t seem to get it to work as expected using something like:

(?:href\=\")\S*

But it was capturing the href=" part too.

Any assistance on how to anchor matches with this Regex lib would very much be appreciated.

I would work around the regexp issue by using github.com/PuerkitoBio/goquery, which lets you use CSS selectors and is backed by an HTML5 parser.

package main

import (
	"log"
	"strings"

	"github.com/PuerkitoBio/goquery"
)

const fragment = `<li><a href="BPL15/"> BPL15/</a></li>
<li><a href="BPL16/"> BPL16/</a></li>
<li><a href="BPL17/"> BPL17/</a></li>
<li><a href="BRAAD15/"> BRAAD15/</a></li>
<li><a href="BRAAD16/"> BRAAD16/</a></li>
<li><a href="BRAFS15/"> BRAFS15/</a></li>
<li><a href="BRAFS16/"> BRAFS16/</a></li>
<li><a href="BRASD15/"> BRASD15/</a></li>
<li><a href="BRASD16/"> BRASD16/</a></li>
<li><a href="BRDSolos15/"> BRDSolos15/</a></li>
<li><a href="BRDSolos16/"> BRDSolos16/</a></li>
<li><a href="BRDSolos17/"> BRDSolos17/</a></li>`

func main() {
	log.SetFlags(log.Lshortfile)

	doc, err := goquery.NewDocumentFromReader(strings.NewReader(fragment))
	if err != nil {
		log.Fatalln(err)
	}

	doc.Find("a").Each(func(i int, s *goquery.Selection) {
		href, exists := s.Attr("href")
		if exists {
			log.Println(href)
		}
	})
}

Produces:

$ go run main.go
main.go:34: BPL15/
main.go:34: BPL16/
main.go:34: BPL17/
main.go:34: BRAAD15/
main.go:34: BRAAD16/
main.go:34: BRAFS15/
main.go:34: BRAFS16/
main.go:34: BRASD15/
main.go:34: BRASD16/
main.go:34: BRDSolos15/
main.go:34: BRDSolos16/
main.go:34: BRDSolos17/
2 Likes

Wonderful stuff. The moment I get home I’ll give that a try.

Your suggestion will also help enormously with the part which comes after the URLS are collected.

Thank you!

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.