Strategy to parse big xml files

Som_Bhattacharyya · April 15, 2020, 6:21am

So in my organization i am working on a problem where we hit a internal url to get pages of xmk information that we need to parse and send up to a S3 bucket. Currently we do that in a lambda running node js application. Now because of Node’s leaky futures implementation we are unable to use parallel processing. Instead we are having to hit the url and traverse through the chain of urls sequentially to work around that problem.
I want to use go to parallelise this job and move the node app to Go.
A caveat is that the html files can be 1mb - 4 mb in size and a minimal parsing needs to be done in the app. Node.js has good support for XML parsers.
What does XML parsing look like in Go?
I came across this issue which put kind of a dampener on my plan.

Can dsomeone please suggest what is the best way to approack xml parsing in golang.

lutzhorn · April 15, 2020, 7:01am

You could use a streaming parser like gosax if it is not necessary to parse the whole XML document into a tree that can be navigated in code.

Here is some background:

https://eli.thegreenplace.net/2019/faster-xml-stream-processing-in-go/

Som_Bhattacharyya · April 15, 2020, 10:42am

Thanks for the information. I will try that library.

Som_Bhattacharyya · April 15, 2020, 12:26pm

So on one app i need very cursory parsing. However on the other I need more detailed parsing. I am interested in validating if at all the parallelism gotten through Go be a improvement over the node js thing we are doing today.

lutzhorn · April 15, 2020, 12:28pm

XML parsing is mainly IO bound. Parsing multiple XML files concurrently may very well be faster than parsing one after the other. But you will have to try and benchmark.

system · July 14, 2020, 12:28pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.