How to make 1 Million get request with golang concurrency in low end windows PC


(Md Ariful Islam Protik) #1

My code works fine when my ip.txt file has 3K lines. When i increase lines i tried with 50K lines, What means I want to make 50K http get request concurrently. But I am getting :

SOCKET: Too many open files 

I tried many solutions. I tried minimizing go rutine count but still getting same error. My code:

const (
    // this is where you can specify how many maxFileDescriptors
    // you want to allow open
    maxFileDescriptors = 1000
)
var wg sync.WaitGroup

// Final Literation
func main() {
    file, err := os.Open("ip.txt")
    if err != nil {
        log.Fatal(err)
    }
    defer file.Close()
    outfile, err := os.Create("urls.txt")
    if err != nil {
        log.Fatal(err)
    }
    defer outfile.Close()
    results := make(chan []string, maxFileDescriptors)
    go func() {
        for output := range results {
            for _, url := range output {
                fmt.Fprintln(outfile, url)
            }
        }
    }()

    scanner := bufio.NewScanner(file)
    for scanner.Scan() {
        wg.Add(1)
        go Grabber(scanner.Text(), results)

    }
    wg.Wait()
    close(results)

    if err := scanner.Err(); err != nil {
        log.Fatal(err)
    }
}

Grabber() just Does send request parse data return list of strings to its channel. Any solutio? How to make it work without changing ulimit. Which i dont know how to change ulimit on windows. :sob::sob:

func Grabber(ip string, results chan []string) {
	defer wg.Done()
	var output []string
	if ip == "" {
		return
	}
	page := 1
	for page < 251 {
		client := &http.Client{}
		req, err := http.NewRequest(
			http.MethodGet,
			fmt.Sprintf(
				"http://www.bing.com/search?q=ip:%s+&count=50&first=1",
				url.QueryEscape(ip),
			),
			nil,
		)
		if err != nil {
			fmt.Println(err.Error())
		}
		req.Header.Set("User-Agent", "Mozilla/5.0 (Windows NT 6.1; rv:57.0) Gecko/20100101 Firefox/57.0")
		res, err := client.Do(req)
		if err != nil {
            fmt.Println("Invalid Request")
            return
		}
		defer res.Body.Close()
		body, err := ioutil.ReadAll(res.Body)
		if err != nil {
			fmt.Println("Couldn't Read")
		}
		re := regexp.MustCompile(`<h2><a href="(.*?)"`)
		links := re.FindAllString(string(body), -1)
		if links != nil {
			for l := range links {
				o := strings.Split(links[l], `"`)
				d := strings.Split(o[1], "/")
				s := d[0] + "//" + d[2]
				if !stringInArray(s, output) {
					output = append(output, s)
				}
			}
		}
		page = page + 50
	}
	results <- output
	for _, links := range output {
		fmt.Println(links)
	}
}

(Qi Yin) #2

I never think that there are more goroutine that can make the program run faster. Our hardware resources are limited. In pure CPU computing programs, I expect my goroutine not to exceed the number of CPU logic cores too much. The closer it is, the better. This will reduce the context switching of the CPU and maintain faster computing.

When I need to deal with a large number of IO, I expect more goroutine so that the CPU can switch to other tasks when IO is blocked. According to IO latency, we should selectively keep the number of goroutine. For example, if I request a web page with a delay of 1 second on an 8-core server, I will generally start 50-100 goroutine for processing tasks, which makes most of the CPU in the processing program Logic instead of context switching or IO.

Here is an example of task distribution. I don’t know if it will help you. How many goroutine do you need? You need to carry out actual verification to ensure that you can get a more efficient number of CPU utilization

package main

import (
	"net/http"
	"sync"
)

func main() {
	var wg sync.WaitGroup
	urls := make([]string, 1000000)
	gnum := 100
	wg.Add(gnum)

	queue := make(chan string)
	for i := 0; i < gnum; i++ {
		go func() {
			defer wg.Done()
			request(queue)
		}()
	}

	for _, url := range urls {
		queue <- url
	}

	close(queue)
	wg.Wait()
}

func request(queue chan string) {
	for url := range queue {
		http.Get(url)
	}
}


(Ali Koyuncu) #3

You can change your design to skip windows handler limitation. Create one buffered channels which receives your ips, one func which receives ip from that channel. something like that

jobs := make(chan string, 100)// banchmark different channel length find best for you(cpu+io)
func run(ctx context.Context) (results chan Response, err error){
 for {
      select {
        // case ip:= <-jobs{} make request, send  response to results channel
       //  case timeout implement timeout
      }
      select{
         // handle context cancelation
         // case ok: ctx.Done(){}
      }
  }
}

this link could give you some tips to how to create a worker pool


(Christophe Meessen) #4

You have a defer leak in your code. Inside the for page < 251 you are doing an http request and defer closing the body. This defer is only executed at function termination. So you are piling up 251 defers and thus don’t close 251 response body.

To fix this bug, I suggest you replace this:

defer res.Body.Close()
body, err := ioutil.ReadAll(res.Body)

by this :

body, err := ioutil.ReadAll(res.Body)
res.Body.Close()

Not sure it will solve your problem, but it may help since not closing the body will keep the connection open.


(Senseye) #5
  1. You can compile regex once, like WaitGroup
re := regexp.MustCompile(`<h2><a href="(.*?)"`)
  1. You can use fasthttp to reuse memory

(Gabriel Nelle) #6

I’m pretty sure this is the problem. The connection is not closed properly if the body is not closed. That will not let the operating system close the connections immediately (after a timeout it does) and run into a connection limit quickly. (Did that on our network and kind of crashed our internet connection with it after a few days. Don’t ask :wink: )