Dataflow kit is released. DFK is a new scraping framework for extracting structured data from web pages. https://github.com/slotix/dataflowkit
Dataflow kit is fast. 50 pages may be fetched and parsed for about 4 -6 seconds.
Dataflow kit is able to process quite large volumes of data. Our tests show it took about 7 hours to parse appr. 4 millions of pages.
Other noticable benefits of our product are:
• Work on any interactive site (Java script driven pages render using headless chrome)
• Scrape a website behind a login form
• Extract data from multiple pages.
• Scrape infinite scrolled pages.
• Crawl details; Extract and follow links.
• Skip intermediary pages while scraping. For example if you need have array of goods as a result and skip summary pages.
• Follow the direction of robots.txt
• Save results as CSV, JSON, XML