File cleaning support

I’ve a csv file which looks to be corrupted, so I want to do some cleaning.

I want to handle it as txt file for cleaning, then reading it as csv.

In the file I’ve some fields having "" and some does not, and notes that those having "" are containing a , inside it, examples:

Some numbers appearing as: "2,15.0 and some appearing as 22.3 without ""

Some texts appearing as: Manager, Supply Chain and some appearing as Supervisor without ""

My approach for solving this, is:

  1. Remove the , from the file if it is between ""
  2. Remove the "" from the file

Lets say I’ve the lines in the file as:

John, supervisor, 20.22
Mark, "Manager, SC", "3,200.0"
Joseph, "Technician, Electrical", 15.2
"Selphia, Henry", "Manager, Lab", "4,250.0"

Then the file after cleaning should be:

John, supervisor, 20.22
Mark, Manager SC, 3200.0
Joseph, Technician Electrical, 15.2
Selphia Henry, Manager Lab, 4250.0

Any thoughts, support?

1 Like

Don’t corrupt data while cleaning! "Manager, SC" should become Manager, SC.

Perform the conversion step-by-step, checking the results after each step.

John, supervisor, 20.22
Mark, "Manager, SC", "3,200.0"
Joseph, "Technician, Electrical", 15.2
"Selphia, Henry", "Manager, Lab", "4,250.0"

Read the .csv file as if a .txt file. For each line, replace commas not inside quotes ('"') by tabs ("\t" or 0x09). Write the input .csv file, after conversion, with a .tsv file extension. Check the output .tsv file. For visibility, I have shown the tab character ("\t" or 0x09) as \t in the examples.

John\t supervisor\t 20.22
Mark\t "Manager, SC"\t "3,200.0"
Joseph\t "Technician, Electrical"\t 15.2
"Selphia, Henry"\t "Manager, Lab"\t "4,250.0"

Read the .tsv file. For each field, trim leading and trailing spaces. If the first and last characters of the field are quotes ('"') trim them. Write the .tsv file. Check the output .tsv file.

John\tsupervisor\t20.22
Mark\tManager, SC\t3,200.0
Joseph\tTechnician, Electrical\t15.2
Selphia, Henry\tManager, Lab\t4,250.0

Read the .tsv file. For each field, if removing commas and strconv.ParseFloat returns err == nil then remove commas. Write the .tsv file. Check the output .tsv file.

John\tsupervisor\t20.22
Mark\tManager, SC\t3200.0
Joseph\tTechnician, Electrical\t15.2
Selphia, Henry\tManager, Lab\t4250.0

For .tsv files, use package encoding/csv:

rdr := csv.NewReader(file)
rdr.Comma = '\t'
1 Like

Thanks a lot, I tried the below code:

package main

import (
	"bytes"
	"encoding/csv"
	"fmt"
	"io/ioutil"
	"log"
	"os"
	"regexp"
)

// https://yourbasic.org/golang/regexp-cheat-sheet/
func main() {
	// str1 := `"Selphia, Henry", "Manager, Lab", "4,250.0"` // {}
	input, err := ioutil.ReadFile("sample.csv")
	if err != nil {
		fmt.Println(err)
		os.Exit(1)
	}

	re := regexp.MustCompile(`"\s*,\s*"|,\s*"|",\s*`)
	str := re.ReplaceAllString(string(input), `\t`) // `0x09`

	output := bytes.Replace([]byte(str), []byte(`"`), []byte(""), -1) // remove the `"` at begining and at end of each line
	if err = ioutil.WriteFile("modified.tsv", output, 0666); err != nil {
		fmt.Println(err)
		os.Exit(1)
	}

	fr, err := os.Open("modified.tsv")
	if err != nil {
		log.Fatal("Unable to open output")
	}
	//methods.FailOnError(err)
	defer fr.Close()
	rdr := csv.NewReader(fr)
	rdr.Comma = '\t'
	rows, err := rdr.ReadAll()
	if err != nil {
		log.Fatal("Unable to read data: ")
	}
	fmt.Printf("%s", rows[1])
}

The output was:

[Mark\tManager, SC\t3,200.0]

And if I tried reading it as:

fmt.Printf("%s", rows[1][2])

I get:

panic: runtime error: index out of range [2] with length 1

Regular expressions are the wrong solution.

The XY problem is asking about your attempted solution rather than your actual problem: The XY Problem.

Your csv file is not corrupt. Your solution would corrupt your csv file by removing commas from fields. Removing quotes surrounding fields is wrong. Your csv file could use some fixes to make it more usable.

RFC 4180 - Common Format and MIME Type for Comma-Separated Values (CSV) Files

Comma-separated values - Wikipedia

Quote characters (") are used to escape commas in fields. The escape for a quote in a field is two quotes ("").

One good test for a valid csv file is to read it into Microsoft Excel and LibreOffice Calc.


Leading spaces in an unquoted field are significant. Using ", " as a field separator instead of “,” is wrong. For example,

John, supervisor, 20.22

Using a comma in a numerice field (3,200.0) requires quotes to escape the comma. For example,

Mark, "Manager, SC", "3,200.0"

Microsoft Excel and LibreOffice Calc treat fields in quotes as text not numbers. Commas in a numeric field are unnecessary.

If we fix the last two issues then your csv file will be valid and usable. For example,

John, supervisor, 20.22
Mark, "Manager, SC", "3,200.0"
Joseph, "Technician, Electrical", 15.2
"Selphia, Henry", "Manager, Lab", "4,250.0"

becomes

John,supervisor,20.22
Mark,"Manager, SC",3200
Joseph,"Technician, Electrical",15.2
"Selphia, Henry","Manager, Lab",4250

Using Go package encoding/csv, there is a simple fix:

func cleanCSV(in io.Reader, out io.Writer) error {
    r := csv.NewReader(in)
    r.TrimLeadingSpace = true
    w := csv.NewWriter(out)
    for {
	    rec, err := r.Read()
	    if err != nil {
		    if err == io.EOF {
			    break
		    }
		    return err
	    }
	    for i, fld := range rec {
		    if strings.IndexByte(fld, ',') >= 0 {
			    fld = strings.Replace(fld, ",", "", -1)
			    if _, err := strconv.ParseFloat(fld, 64); err == nil {
				    rec[i] = fld
			    }
		    }
	    }
	    err = w.Write(rec)
	    if err != nil {
		    return err
	    }
    }
    w.Flush()
    if err := w.Error(); err != nil {
	    return err
    }
    return nil
}

The program (cleancsv.go) in full: Go Playground - The Go Programming Language

$ go build cleancsv.go

$ ./cleancsv -i=clean.in.csv -o=clean.out.csv
cleaned: in: clean.in.csv out: clean.out.csv

$ cat clean.in.csv
John, supervisor, 20.22
Mark, "Manager, SC", "3,200.0"
Joseph, "Technician, Electrical", 15.2
"Selphia, Henry", "Manager, Lab", "4,250.0"

$ cat clean.out.csv
John,supervisor,20.22
Mark,"Manager, SC",3200.0
Joseph,"Technician, Electrical",15.2
"Selphia, Henry","Manager, Lab",4250.0

$ 

To avoid the use of quotes to escape commas in fields, the program (cleantsv.go) outputs a tsv file: Go Playground - The Go Programming Language

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.