Concepts

In our software we provide some concepts that will be used in data extraction process.

Project

Project is a container of all elements, that will used to create web scraping agents and data extractors. It contains patterns, agents, crawlers and all other things you need to automate the process of data extraction.

Agent

Agent is an essential tool to extract data from the target web sites. It behaves like a human to collect data: loads pages, clicks links, types text, extracts text and images, modifies extracted data and stores it into data storage (CSV, Excel, Database etc). Read more about Agent here.

Extractor

Extractor is a simplified agent. Unlike Agent it cannot click links or interact with the web site by doing some search. Its main goal is to load pages and extract data. The source of pages can be: list of URLs, URLs from CSV or Excel file, local HTML files, URLs or html pages from the database, URL generator. To increase extraction speed Extractor can process several pages simultaneously . Read more about extractor here.

Crawler

Crawler a bot used to browse all pages on some specific web site and store pages that meet some criteria. It is not directly used to extract web data. The main purpose of the Crawler is to store web pages for further processing by Agent or Extractor. Read more about Crawler here.

Pattern

Patterns is basic instrument to extract data from the web page. Once defined Pattern can extract data from other similar pages from the same web site. The Patterns are configured by WebSundew point and click wizards. There are different types of Patterns. Details patterns are used to extract static data (it is useful to extract data from product details pages). List Patterns are used to extract some tables, lists, etc. Page Patterns are used to find Next page links. Find more info about Patterns here.

Capture

Capture is the way to transform or modify data extracted by the pattern. Usually data extracted by pattern is HTML. Capture can do modifications: convert HTML to text, remove unnecessary text, replace some text, extract links to images and files, extract HTML attributes, download images and files.

Login

In some cases you need to be logged in to access data. Login is a special Agent to automatically log in into the web site. It can be used with bots that do not directly support log in, like Extractor or Crawler. Find more info about login here.