How can I get the original string which a
*html.Node was parsed from?
Rendering the node with
html.Render is not sufficient for me, since I need the original string character by character.
I would prefer not to fork and modify the internals of
golang.org/x/net/html, but I suspect it is the only way :-(.
I don’t think you can do this with that package as-is. Why do you need the original string character by character?
I’m writing a program which crawls a website looking for references(links, images, stylesheet-imports, etc) to unavailable resources and generates a report. In the report it would be useful to have the original string. Say It finds a link with a reference to a unavailable resource
<a href="https://gøøglæ.com">google.com</a> I would like the report to say something like:
Found a link <a href=“https://gøøglæ.com”>google.com</a>, GET https://gøøglæ.com: something went wrong
If I merely
html.Render the node and include that in the report, I risk a user copying the HTML
<a href="https://gøøglæ.com">google.com</a> ctrl-f searching for it in their HTML file and scratching their head for why they get 0 results. That is why I need the original string character by character.
For now I just wrote a warning message in the report saying something like “the HTML in the report might not be equal character-by-character to the HTML in the web page”, but I am not to happy with this solution.
This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.