Dataflow kit - extract structured data from web sites. Web sites scraping


(Dmitry) #1

Dataflow kit is released. DFK is a new scraping framework for extracting structured data from web pages. https://github.com/slotix/dataflowkit

Dataflow kit is fast. 50 pages may be fetched and parsed for about 4 -6 seconds.

Dataflow kit is able to process quite large volumes of data. Our tests show it took about 7 hours to parse appr. 4 millions of pages.

Other noticable benefits of our product are:

• Work on any interactive site (Java script driven pages render using headless chrome)

• Scrape a website behind a login form

• Extract data from multiple pages.

• Scrape infinite scrolled pages.

• Crawl details; Extract and follow links.

• Skip intermediary pages while scraping. For example if you need have array of goods as a result and skip summary pages.

• Follow the direction of robots.txt

• Save results as CSV, JSON, XML


(system) #2

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.