Parse ugly txt file, split in section, join rows and create a csv file

maba · August 18, 2017, 5:23pm

Hello,
I’m a newbie with Go and I don’t know what is the best approach to solve this problem.

I have a txt file like this.
The final result should be a csv file with (see my comment on gist):

1st row = date from line 4; start from line 5; end from line 5
2nd row = lines 9, 10, 11, 13, 14, 15 joined (less line 12, always the fourth of the “section”), remove some field and format other (bold on the comment)
3rd row = 17, 18, 19, 21, 22, 23 (less line 20)

and so on.

I think that the better solution is to split the file in “section”, separated by
------------------------------------------------------------------------------------------------------------*-------------------------------------;

and after:

for the first section ignore all line except 4 and 5
for the other section remove unwanted line, join the others and working on some text

My first problem is split the file in… “rows slice”?
Any hints?

Thanks a lot,
Marco

kryten · August 18, 2017, 11:12pm

Perhaps better to use Logstash for this kind of thing?
https://www.elastic.co/products/logstash

What you describe would be pretty trivial for Logstash.

maba · August 19, 2017, 11:06am

Thank you for the reply.
So, do you suggest to use Logstash for this kind of job?
I have to create a csv from this ugly bookkeeping export file and pass it to XLCubed. This… two or three times a month.

NobbZ · August 19, 2017, 11:25am

Perhaps not the full log stack, but maybe you can find a standalone version of grok. The tool used for pasting by log stack

maba · August 19, 2017, 11:33am

Some doubts about the first block of code (it is a different event) but I can try this way (multiline codec and moooore grok).
I thought it was easier with Go.

NobbZ · August 19, 2017, 11:43am

Grok is designed to parse data out of semi-structured text.

Most common parsing approaches aren’t suitable for that and need fixed grammars.

Of course for grok needs to learn just another language to actually define that parser.

But since you seem to have to deal with hole lines only (except the date) it might be the easiest to just split the input by lines, access by index and write back into the csv. Maybe use intermediate structs for input and output to keep semantics where necessary.

kryten · August 19, 2017, 2:42pm

I’m a big Logstash fan…

If you wanted to IM me a link to

a sample data file
A manually created ‘result’, which shows how the data should look when parsed to your requirements.

I’d be happy to put a Logstash conf file together for you which does it.

kryten · August 19, 2017, 10:25pm

Hi @maba

Thanks for the links and data. For what its worth (and at risk of being flamed for being WAY OT for this forum), here is a Logstash config which seems to produce what you need:

input {
	file {
		path => "/Users/me/Documents/Code/logstash/input.txt"
        ignore_older => 0
        codec => multiline {
            pattern => "^\d{1,2}\;\d{2}\/\d{2}.*"
            negate => "true"
            what => "previous"
        }
	}
}

filter {

    if [message] =~ /.*Date\s\:\;\d{1,2}\/\d{1,2}\/\d{4}.*/ {
        grok {
            match => { "message" => ".*Date\s\:\;(?<date>\d{1,2}\/\d{1,2}\/\d{4}).*\:\;(?<i>\d*).*numero\s(?<z>\d*)" }
            add_field => { "myComment" => "#Comment" }
        }
    }
    if [myComment] {
        mutate {
            add_field => { "output" => "%{myComment}: %{date};%{i};%{z}"}
            remove_field => ["myComment", "tags", "path", "host", "date", "i", "z", "@version", "message", "@timestamp" ]
        }
    }    
    if [message] =~ /RILEVAZIONE\sCOSTI/ {
        mutate {
            rename => { "message" => "i" } 
            split => { "i" => ";" }
            add_field => { "split" => "1" }
        }
        if [split] == "1" {
            mutate {
                gsub => ["[i]", "\r\n", "" ]
                gsub => ["[i]", "\s*", "" ]
                remove_field => [ "tags", "@version", "@timestamp", "split", "path", "host" ]
            }
            mutate {
                add_field => { "a1" => "%{[i][0]};%{[i][1]};%{[i][2]};%{[i][3]} - %{[i][4]};%{[i][5]};%{[i][6]};%{[i][7]};%{[i][8]};%{[i][9]};"}
                add_field => { "a2" => "%{[i][10]};;;%{[i][14]};;%{[i][17]};Abbuoini attivi;%{[i][20]}€;%{[i][21]};%{[i][26]};%{[i][27]};"}
                add_field => { "a3" => "%{[i][28]};%{[i][29]};%{[i][30]};%{[i][31]};%{[i][32]};%{[i][33]};%{[i][37]};150.00;" }
                add_field => { "result" => "%{a1}%{a2}%{a3}" }
                remove_field => [ "a1","a2","a3","i" ]
            }
        }       
    }
}

output {
	stdout {
		codec => rubydebug
	}
}

That will produce (based on your test data):

{
    "output" => "#Comment: 23/06/2017;1;9999999"
}
{
    "result" => "1;02/01/2017;02/01/2017;10 - RILEVAZIONECOSTI;Effettivo;No;A;/;1;//;;;Incremento;;47/5/29;Abbuoini attivi;-200,00€;abcd;1;CR100-centrodicostodipro;1;FF-dispesaFUEL;1;BD-BUDIESEL;1;15-intercompanyversoIT;-200,00;150.00;"
}
{
    "result" => "2;02/01/2017;02/01/2017;10 - RILEVAZIONECOSTI;Effettivo;No;A;/;1;//;;;Incremento;;39/5/18;Abbuoini attivi;1.000,00€;dcba;1;CR100-centrodicostodipro;1;FF-tipologiadispesaFUEL;1;BD-BUDIESEL;1;15-intercompanyversoIT;1.000,00;150.00;"
}
{
    "result" => "33;02/01/2017;02/01/2017;10 - RILEVAZIONECOSTI;Effettivo;No;A;/;1;//;;;Incremento;;39/5/18;Abbuoini attivi;150,00€;nuovacommessa;1;CR100-centrodicostodipro;1;FF-tipologiadispesaFUEL;1;BD-BUDIESEL;1;15-intercompanyversoIT;150,00;150.00;"
}
{
    "result" => "34;03/01/2017;03/01/2017;10 - RILEVAZIONECOSTI;Effettivo;No;A;/;1;//;;;Incremento;;39/5/36;Abbuoini attivi;300,00€;nuovacommessa;1;CR100-centrodicostodipro;1;FF-tipologiadispesaFUEL;1;BD-BUDIESEL;1;15-intercompanyversoIT;300,00;150.00;"
}

You can, when you are happy, very easily change the output block to send to a file. More info here:
https://www.elastic.co/guide/en/logstash/2.3/plugins-outputs-file.html

system · November 17, 2017, 10:25pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.