Review of existing systems
Data extraction process from the web can be classified based on the selectors used. Selectors can be CSS or XPath expressions. CSS selectors are said to be faster and are used by many browsers. Ducky uses CSS selectors for extracting data from pages that are similarly structured.
On the other hand, XPath expressions are more reliable, handles text recognition better and a powerful option to locate elements when compared to CSS selectors. Many researches are going on presently in this topic. Oxpath provides an extension for XPath expressions. The system created by V. Crescenzi, P. Merialdo, and D. Qiu uses XPath expressions for locating the training data to create queries posed to the workers of a crowd sourcing platform.
Systems like Ducky and Deixto use the concept of Configuration files where the user inputs the simple details like base pages, a “next” column if there are multiple pages to be parsed. Deixto uses the concept of tag filtering where the unnecessary html tags can be ignored when the DOM (Document Object Model) tree is created.
Scrapy , an open source project, provides the framework for web crawlers and extractors. This framework provides support for spider programs that are manually written to extract data from the web. It uses XPath expression to locate the content. The output formats of Ducky and Scrapy include XML, CSV and JSON files.