Strategy to parse big xml files

So in my organization i am working on a problem where we hit a internal url to get pages of xmk information that we need to parse and send up to a S3 bucket. Currently we do that in a lambda running node js application. Now because of Node’s leaky futures implementation we are unable to use parallel processing. Instead we are having to hit the url and traverse through the chain of urls sequentially to work around that problem.
I want to use go to parallelise this job and move the node app to Go.
A caveat is that the html files can be 1mb - 4 mb in size and a minimal parsing needs to be done in the app. Node.js has good support for XML parsers.
What does XML parsing look like in Go?
I came across this issue which put kind of a dampener on my plan.

Can dsomeone please suggest what is the best way to approack xml parsing in golang.

You could use a streaming parser like gosax if it is not necessary to parse the whole XML document into a tree that can be navigated in code.

Here is some background:

https://eli.thegreenplace.net/2019/faster-xml-stream-processing-in-go/

2 Likes

Thanks for the information. I will try that library.

1 Like

So on one app i need very cursory parsing. However on the other I need more detailed parsing. I am interested in validating if at all the parallelism gotten through Go be a improvement over the node js thing we are doing today.

XML parsing is mainly IO bound. Parsing multiple XML files concurrently may very well be faster than parsing one after the other. But you will have to try and benchmark.

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.