Writing the needed amount to a csv file


(Justin) #1

Im trying to write a scraper that gets the data from the body of an RFC site, and then parses the data to a csv file. Trying also to get the top 20 most popular words with a length > 3. unfortunately, i cant seem to get the amount i want. This is my result


It basically gets all the words and their occurrences(in the form of a map, i.e key and value, and ignoring the limit i specified
I believe the problem is somewhere from line 152. Here’s a link to the full code
https://play.golang.org/p/RZ9iCRRKypA
Here’s the function that does the parsing

func writeSlice(results chan []MyMap, outfile string, headers []string) error {
	f, err := os.OpenFile(outfile, os.O_RDONLY|os.O_CREATE, 0666)
	if err != nil {
		return fmt.Errorf("couldnt open file %s", outfile)
	}
	defer f.Close()

	w := csv.NewWriter(f)

	w.Flush()

	if err = w.Write(headers); err != nil {
		return fmt.Errorf("couldnt write headers to file %s", outfile)
	}
	data := make([]string, 0, len(results))
	for v := range results {
		for i := 0; i < 20; i++ {
			data = append(data, v[i].key)
			data = append(data, strconv.Itoa(v[i].value))
			if err = w.Write(data); err != nil {
				return fmt.Errorf("couldnt write headers to file %s", outfile)
			}
		}
	}
	if err = w.Error(); err != nil {
		return fmt.Errorf("couldnt write headers to file %s", outfile)
	}
	return nil
}

My problem is that im not sure how to be specific and get only the top 20 results with a particular length and w.Write() the result


(Justin) #2

I can do something like this if im trying to just output the result

for i := 0; i < *concurrency; i++ {
		go func() {
			for value := range tasks {
				site, err := scraper(value.url, value.id, query)
				if err != nil {
					fmt.Printf("error parsing data %v", err)
				}
				for i := 0; i < 20; i++ {
					fmt.Printf("%s\t : %d\n", site[i].key, site[i].value)
				}
				results <- site
			}
			wg.Done()
		}()
	}

Not sure how to check if the length of the words > 3 yet


(Holloway) #3

Is it possible to do it under single thread? without using concurrency for the time being?


(Christophe Meessen) #4

Why using goquery ? The web page (https://tools.ietf.org/rfc/rfc%d.txt) is pure text. It doesn’t contain html.
There is no need for goquery.

Also, I don’t understand “… and then parses the data to a csv file”. What must you do exactly ? Do you mean writing ? What data would you want to write to the csv file ? The 20 most frequent words ?

Here is the code to count words with more than 3 letters. It’s more efficient to provide the code than to explain the algorithm. Note that it will consider a number as a word. Remove || unicode.IsNumber(c) if you want to consider words with letters only.

func countWordsIn(text string) map[string]int {
    var wordBegPos, runeCount int
    wordCounts := make(map[string]int)
    for i, c := range text {
        if unicode.IsLetter(c) || unicode.IsNumber(c) {
            if runeCount == 0 {
                wordBegPos = i
            }
            runeCount++
            continue
        }
        if runeCount > 3 {
            word := text[wordBegPos:i]
            count := wordCounts[word] // return 0 if word is not in wordCounts
            count++
            wordCounts[word] = count
        }
        runeCount = 0
    }
    return wordCounts
}

If you have to accumulate the word counts of different RFC text files, you need to accumulate the maps. You do it like this:

var totalWordCounts = make(map[string]int)
func accumulateWordCounts(wordCounts map[string]int) {
    for key, value := range wordCounts {
        totalWordCounts[key] = totalWordCounts[key] + value
    }
}

To get the 20 most frequent words, you could copy all word-count pairs in a slice, sort it in decreasing count order and return the first 20 word-count pairs.

A more efficient method is to use a heap ( import "container/heap"). A heap is a slice with the smallest or biggest value at index position 0. It will be the smallest or biggest depending on the comparison function provided. Here, you would only need a heap of 20 word-count pairs with the word with smallest count at index position 0 (the top of the heap). You then iterate over all word-count pairs in the map, and if the count is bigger than the count of the heap top, you replace the word-count pair at index position 0 with the word-count pair from the map. After that you call heap.Fix(h, 0) to get the new smallest word-count value at the top.

Here is the code I had some fun to implement.

package main

import (
    "container/heap"
    "fmt"
    "sort"
)

type elem struct {
    word  string
    count int
}

type elemHeap []elem

func (h elemHeap) Len() int           { return len(h) }
func (h elemHeap) Less(i, j int) bool { return h[i].count < h[j].count }
func (h elemHeap) Swap(i, j int)      { h[i], h[j] = h[j], h[i] }
func (h elemHeap) Push(x interface{}) { /* not used */ }
func (h elemHeap) Pop() interface{}   { /* not used */ return nil }

func mostFrequentWords(m map[string]int, nbrWords int) []elem {
    h := elemHeap(make([]elem, nbrWords))
    for word, count := range m {
        if count > h[0].count {
            h[0] = elem{word: word, count: count}
            heap.Fix(h, 0)
        }
    }
    sort.Slice(h, func(i, j int) bool { return h[i].count > h[j].count })
    return h
}

func main() {
    m := map[string]int{"the": 123, "this": 29, "house": 4, "hold": 8, "then": 27}
    r := mostFrequentWords(m, 3)
    for i := range r {
        fmt.Println(r[i])
    }
}

(Justin) #5

Sorry, i mean that i’m trying to write the results to the csv file. The results being the top 20 words word with > 3 letters


(Justin) #6

Yes, its possible. We can get the result from using just a single thread. But i doubt i can get multiple results w/o running concurrently, because im trying to go over 1000 RFC sites


(Justin) #7

Yeah, i need only words


(Christophe Meessen) #9

This is the mostFrequentWord function without using the heap package that requires to define the elemHeap type and functions. It implements the heap.Fix() function inside.

type elem struct {
    word  string
    count int
}

func mostFrequentWords(m map[string]int, nbrWords int) []elem {
	h := elemHeap(make([]elem, nbrWords))
	for word, count := range m {
		if count < h[0].count {
			continue
		}
		h[0] = elem{word: word, count: count}
		for i, j := 0, 0; i < nbrWords; i = j {
			j := i * 2
			if j < nbrWords && h[i].count > h[j].count {
				h[j], h[i] = h[i], h[j]
				continue
			}
			j++
			if j < nbrWords && h[i].count > h[j].count {
				h[j], h[i] = h[i], h[j]
				continue
			}
			break
		}
	}
	sort.Slice(h, func(i, j int) bool { return h[i].count > h[j].count })
	return h
}

There is a small problem with the implementation of countWordsIn(). Words will be sub-slices of the rfc text. As a side effect, the rfc text won’t be garbage collected and we may end with all the rfc text in memory. We shouldn’t bother about this because an rfc text is not that big and there are not many rfc. It will require a few megabytes at most.

In order to avoid this, we would have to create a copy of the word used as key of the wordCounts map. Unfortunately, there is not trivial and sure way to do this in Go.


(Justin) #11

modified it so i can pass a slice of string as a parameter, since most of my code was using a slice. Hope it was a good approach.

package main

import (
	"fmt"
	"strings"
	"unicode"
)

func main() {
	text := []string{"this is my is is my wow so is is my my so hey no yes"}
	m := countWordsIn(text)
	for k, v := range m {
		fmt.Printf("%d : %s\n", v, k)
	}
}

func countWordsIn(text []string) map[string]int {
	var wordBegPos, runeCount int
	wordCounts := make(map[string]int)
	texts := strings.Join(text, " ")
	for i, c := range texts {
		if unicode.IsLetter(c) {
			if runeCount == 0 {
				wordBegPos = i
			}
			runeCount++
			continue
		}
		if runeCount > 3 {
			word := texts[wordBegPos:i]
			count := wordCounts[word]
			count++
			wordCounts[word] = count
		}
		runeCount = 0
	}
	return wordCounts
}

(Christophe Meessen) #12

Here is an even better implementation of mostFrequentWords. It is simpler but slightly less efficient than the heap. It’s main benefit is that the result is exactly what you would get if you built a slice of the map content, sort it and return the sub-slice of nbrWords first entries. The algorithm is very simple. It builds a list of nbrWords elem that it keeps sorted and it insert new elem as needed.

func mostFrequentWords(m map[string]int, nbrWords int) []elem {
	h := elemHeap(make([]elem, 0, nbrWords))
	last := nbrWords - 1
	for word, count := range m {
		if len(h) < nbrWords {
			pos := sort.Search(len(h), func(i int) bool { return h[i].count < count || (h[i].count == count && h[i].word >= word) })
			if pos == len(h) {
				h = append(h, elem{count: count, word: word})
				continue
			}
			h = append(h, elem{})
			copy(h[pos+1:], h[pos:len(h)])
			h[pos] = elem{count: count, word: word}
			continue
		}
		if count >= h[last].count {
			pos := sort.Search(nbrWords, func(i int) bool { return h[i].count < count || (h[i].count == count && h[i].word >= word) })
			copy(h[pos+1:], h[pos:last])
			h[pos] = elem{count: count, word: word}
		}
	}
	return h
}

(Christophe Meessen) #13

modified it so i can pass a slice of string as a parameter, since most of my code was using a slice. Hope it was a good approach.

I’m lost. Why do you need a slice of strings ? Each rfc can be loaded in one text string.
Here is an example implementation of the scraper function. As you see, the RFC text is fully read into b which is a byte slice.

func scraper(url string, id int) (map[string]int, error) {
	client := &http.Client{
		Timeout: time.Minute,
	}
	resp, err := client.Get(url)
	if resp != nil {
		defer resp.Body.Close()
	}
	if err != nil {
		return nil, fmt.Errorf("error making request %v", err)
	}
	b, err := ioutil.ReadAll(resp.Body) 
	if err != nil {
		return nil, fmt.Errorf("error reading RFC text %v", err)
	}
	m := countWordsIn(string(b))
	return m, nil
}

The main routine would then call scraper for each url and accumulate the returned result to sum all word counts in all scraped RFC, if this is what you want to do.

As suggested by @hollowaykeanho, implement first a single threaded version and make sure it works before implementing a concurrent version.


(Christophe Meessen) #14

Sorry, the mostFrequentWords function using my own heap implementation was incorrect.

I implemented and corrected the functions. I also benchmarked them to identify the most efficient one.
You can find them in my github repo.

The following is the most simple and efficient implementation:

func mostFrequentWords(m map[string]int, nbrWords int) []elem {
	h := elemHeap(make([]elem, nbrWords))
	last := nbrWords - 1
	for word, count := range m {
		if count < h[last].count || (count == h[last].count && word > h[last].word) {
			continue
		}
		pos := sort.Search(nbrWords, func(i int) bool { return h[i].count < count || (h[i].count == count && h[i].word > word) })
		copy(h[pos+1:], h[pos:last])
		h[pos] = elem{count: count, word: word}
	}
	if len(m) < nbrWords {
		return h[:len(m)]
	}
	return h
}

Sorry for the noise.


(Justin) #15

Thanks a lot for your time. I’ll take my time to go through your implementation, and ask you questions also if theres need. Thanks again