Encoding/xml Unmarshal on dynamically structure elements

Hi,

I’m working with epubs, I have to fetch the cover image from cover.xhtml file (or whatever file it is mentioned in .opf file). My problem is with dynamic structure of elements in the Cover.xhtml files.

Each epubs has different structure on the Cover.xhtml file. For example,

<body>
    <figure id="cover-image">
        <img src="covers/9781449328030_lrg.jpg" alt="First Edition" />
    </figure>
</body>

Another epub cover.xhtml file

<body>
    <div>
        <img src="@public@vhost@g@gutenberg@html@files@54869@54869-h@images@cover.jpg" alt="Cover" />
    </div>
</body>

I need to fetch the img tag’s src attribute from this file. But I couldn’t do it.

Here is the part of my Code that deals with unmarshalling the cover.xhtml file

type CPSRCS struct {
    Src string `xml:"src,attr"`
}

type CPIMGS struct {
    Image CPSRCS `xml:"img"`
}

XMLContent, err = ioutil.ReadFile("./uploads/moby-dick/OPS/cover.xhtml")
CheckError(err)

coverFile := CPIMGS{}
err = xml.Unmarshal(XMLContent, &coverFile)
CheckError(err)
fmt.Println(coverFile)

The output is:

{{}}

The output I’m expecting is:

{{covers/9781449328030_lrg.jpg}}

Thanks in advance!

I like using github.com/PuerkitoBio/goquery to extract data from html files. Since it uses CSS selectors, you can just tell it what data you want.

Here’s an example using your files:

package main

import (
	"log"
	"strings"

	"github.com/PuerkitoBio/goquery"
)

func main() {
	log.SetFlags(log.Lshortfile)

	files := []string{
		`
<body>
    <figure id="cover-image">
        <img src="covers/9781449328030_lrg.jpg" alt="First Edition" />
    </figure>
</body>
`,
		`
<body>
    <div>
        <img src="@public@vhost@g@gutenberg@html@files@54869@54869-h@images@cover.jpg" alt="Cover" />
    </div>
</body>
`,
	}

	for i, file := range files {
		d, err := goquery.NewDocumentFromReader(strings.NewReader(file))
		if err != nil {
			log.Fatalln(i, err)
		}

		d.Find("img").Each(func(j int, s *goquery.Selection) {
			src, ok := s.Attr("src")
			if !ok {
				log.Println("src not found")
				return
			}

			log.Println(src)
		})
	}
}

The output is:

main.go:43: covers/9781449328030_lrg.jpg
main.go:43: @public@vhost@g@gutenberg@html@files@54869@54869-h@images@cover.jpg

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.