How do I get the original string which a `*html.Node` was parsed from?

Hello

How can I get the original string which a *html.Node was parsed from?

Rendering the node with html.Render is not sufficient for me, since I need the original string character by character.

I would prefer not to fork and modify the internals of golang.org/x/net/html, but I suspect it is the only way :-(.

Any thoughts?

Thanks

I don’t think you can do this with that package as-is. Why do you need the original string character by character?

I’m writing a program which crawls a website looking for references(links, images, stylesheet-imports, etc) to unavailable resources and generates a report. In the report it would be useful to have the original string. Say It finds a link with a reference to a unavailable resource <a href="https://gøøglæ.com">google.com</a> I would like the report to say something like:

Found a link <a href=“https://gøøglæ.com”>google.com</a>, GET https://gøøglæ.com: something went wrong

If I merely html.Render the node and include that in the report, I risk a user copying the HTML <a href="https://gøøglæ.com">google.com</a> ctrl-f searching for it in their HTML file and scratching their head for why they get 0 results. That is why I need the original string character by character.

For now I just wrote a warning message in the report saying something like “the HTML in the report might not be equal character-by-character to the HTML in the web page”, but I am not to happy with this solution.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.