How can i restricts third parties URLs in go-colly?

anujdakwala · January 10, 2019, 9:23pm

I have to crawl suppose abc.com domain ,in visiting URLs it redirect to lots of third parties URLs like facebook.com,google.com etc.

Is there any rules for go colly to restriction of domain like scrapy linkextractor rules?

johandalabacka · January 10, 2019, 9:34pm

colly.Collector has a field AllowedDomains. Try setting this.

github.com

gocolly/colly/blob/master/colly.go#L54


	"github.com/PuerkitoBio/goquery"
	"github.com/antchfx/htmlquery"
	"github.com/antchfx/xmlquery"
	"github.com/kennygrant/sanitize"
	"github.com/temoto/robotstxt"


	"github.com/gocolly/colly/debug"
	"github.com/gocolly/colly/storage"
)


// Collector provides the scraper instance for a scraping job
type Collector struct {
	// UserAgent is the User-Agent string used by HTTP requests
	UserAgent string
	// MaxDepth limits the recursion depth of visited URLs.
	// Set it to 0 for infinite recursion (default).
	MaxDepth int
	// AllowedDomains is a domain whitelist.
	// Leave it blank to allow any domains to be visited
	AllowedDomains []string
	// DisallowedDomains is a domain blacklist.

johandalabacka · January 10, 2019, 9:50pm

And also RedirectHandler in Collector can be used

github.com

gocolly/colly/blob/master/colly.go#L105


	Async bool
	// ParseHTTPErrorResponse allows parsing HTTP responses with non 2xx status codes.
	// By default, Colly parses only successful HTTP responses. Set ParseHTTPErrorResponse
	// to true to enable it.
	ParseHTTPErrorResponse bool
	// ID is the unique identifier of a collector
	ID uint32
	// DetectCharset can enable character encoding detection for non-utf8 response bodies
	// without explicit charset declaration. This feature uses https://github.com/saintfish/chardet
	DetectCharset bool
	// RedirectHandler allows control on how a redirect will be managed
	RedirectHandler   func(req *http.Request, via []*http.Request) error
	store             storage.Storage
	debugger          debug.Debugger
	robotsMap         map[string]*robotstxt.RobotsData
	htmlCallbacks     []*htmlCallbackContainer
	xmlCallbacks      []*xmlCallbackContainer
	requestCallbacks  []RequestCallback
	responseCallbacks []ResponseCallback
	errorCallbacks    []ErrorCallback
	scrapedCallbacks  []ScrapedCallback

system · April 10, 2019, 9:50pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.