Storing data into an archive

rhaidiz · April 29, 2020, 1:53pm

Hi everyone, I’m new to Go development and I have a small question for you today.
I am in need of developing a small package that can store files into an archive. Once the files are written in the archive, and the main program closes, the next time I run the program I will need to add more data into the archive not only as new files but also by appending new data to files already into the archive.

My first thought was to go with the “archive/zip” library, only to find out that the library does not support the possibility to add new files to an existing zip nor append content to existing files within the zip. There is even a github issue about this with some ideas\suggestions (https://github.com/golang/go/issues/15626). It also seems the same applies to tar and gzip.

The question is: is there a best practice or recommended way to achieve the above? Should I extend the “archive/zip” to handle the cases I need?

Thanks.

hollowaykeanho · April 30, 2020, 6:53am

Hi, welcome to Go Forum

As far as I understand about compressing archives mechanism flow, you need to re-compress every-time you modify/delete the payloads for effective compression.

Hence, that’s the reason why Append/Delete is considered costly “good-to-have” feature. What you can do is to design a wrapper functions on top of the standard library (if you really need it). The wrapper should have the following sequence:

decompress the archive.
update the payload.
re-compress the payload.

rhaidiz · April 30, 2020, 12:41pm

Thanks for the answer. I feel like I’m approach this all the wrong way and that a compressed archive might be the right way to approach this. But since I will have many different files I thought it might be kinda nice to have everything inside a “container” instead of having everything laying around in a folder.

hollowaykeanho · May 1, 2020, 4:01am

It depends on the requirements the nature of the payloads. If the total uncompressed files is small (<5GB) and the modifications is not frequent, you can use your method.

Otherwise, you might need to consider other approaches like scattering the archive into multiple small archives (reorganization). It’s more of strategizing your approach.

It boils down to:

What is the average and edge cases file size for each payload changes?
How frequent would you change payload?
How does your customer access the payload via your compression choice?

That’s is different story. On Linux (not sure about Mac / Windows), we can mount a encrypted storage drive as a “container” to store everything inside. Using this method only needs os.exec for one time mount.

This method facilitates an even easier way to modify payload without needing to archive into compressed file. Even if that is a requirement, you can easily archive that mounted directory at will.

The cost are:

not platform independent
requires you to learn a few things like RAID 1 (optional), cryptsetup, and lvm.

I believe in Windows there is such solution already. (I’m not a Windows user for >5 years already so )

rhaidiz · May 3, 2020, 9:16am

Thank you for your answer. I’ve done some testing and essentially:

What is the average and edge cases file size for each payload changes?
Files size can increase a lot and be over 5GB. They are mainly JSON encoded file.
How frequent would you change payload?
Very frequently
How does your customer access the payload via your compression choice?
If they want to uncompress the archive themselves they can, or they can open it with the go program.

Having everything in a zip container seems a very nice thing to have as the compression would save a lot of space based on my preliminary tests

skillian · May 3, 2020, 2:11pm

Are you saying you have 5GB of JSON that’s updated frequently? What does your application do?

rhaidiz · May 3, 2020, 2:13pm

I should clarify, the update is just an append. I don’t change old data only new data gets appended.

geosoft1 · May 3, 2020, 7:05pm

Because of the big file you claim I suggest an approach something like logrotate and not by compressing the file. Over a limit will be unusable.

rhaidiz · May 3, 2020, 7:12pm

What I’m doing is saving HTTP requests and responses coming into a proxy. I am not considering SQLite because I don’t really need a relational database and I was suggested to consider a simple JSON file where I would append the requests and responses. The thing is that something it seems to easily create large file. An history of roughly 244 items is already a couple of MB.

hollowaykeanho · May 4, 2020, 12:06pm

It boils down to how your customer queries the information to update their local “database” at a particular time. This is “database” and “design” question.

For example, for a large tracker data like IoT sensor dump per seconds, I would organize the data by (30 mins timestamps) chained archives (similar to logging mechanism or git in software development). Then on the client sides, there is a function simply re-construct these archives into a cache-able database after receiving the archives one at a time.

This is done for 2 reasons:

Small archives are a lot easier to transmit and distribute over network.
Client can consume the data one stage at a time (e.g. with or without a particular update)

Large archive files at one point will be sliced into multiple small file fragments for effective network transmission so it is redundant effort. Also, in case you need to optimize or secure the “database”, it is a lot of wait time with consumer level laptop.

If you do really need go to single file and grow beyond 5GB (assuming peaked at 5TB), I would suggest you use some kind of transaction database like no-SQL databases like

https://unqlite.org/ or
reference this GitHub - nanobox-io/golang-scribble: A tiny Golang JSON database and write your “database” query
cznic / kv · GitLab

If you prefer:

no relation
text-based file
transaction
Do not want user to install this/that dependencies

Eventually, you will face problems with query so you can consider opting for database early.

There are a lot of databases for you to consider if non of the above suits the requirements:

system · August 2, 2020, 12:06pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.