In our software we provide some concepts that will be used in data extraction process.
Project is a container of all elements, that will used to create web scraping agents and data extractors.
It contains patterns, agents, crawlers and all other things you need to automate the process of data extraction.
Agent is an essential tool to extract data from the target web sites. It behaves like a human to collect data: loads pages,
clicks links, types text, extracts text and images, modifies extracted data and stores it into data storage (CSV, Excel, Database etc).
Read more about Agent here.
Extractor is a simplified agent. Unlike Agent it cannot click links or interact with the web site by doing some search.
Its main goal is to load pages and extract data. The source of pages can be: list of URLs, URLs from CSV or Excel file, local HTML files, URLs or html pages
from the database, URL generator. To increase extraction speed Extractor can process several pages simultaneously .
Read more about extractor here.
Crawler a bot used to browse all pages on some specific web site and store pages that meet some criteria. It is not directly used to
extract web data. The main purpose of the Crawler is to store web pages for further processing by Agent or Extractor.
Read more about Crawler here.
Patterns is basic instrument to extract data from the web page. Once defined Pattern can extract data from other similar pages
from the same web site. The Patterns are configured by WebSundew point and click wizards. There are different types of Patterns.
Details patterns are used to extract static data (it is useful to extract data from product details pages).
List Patterns are used to extract some tables, lists, etc. Page Patterns are used to find Next page links.
Find more info about Patterns here.
Capture is the way to transform or modify data extracted by the pattern. Usually data extracted by pattern is HTML. Capture can
do modifications: convert HTML to text, remove unnecessary text, replace some text, extract links to images and files, extract HTML attributes, download
images and files.
In some cases you need to be logged in to access data. Login is a special Agent to automatically log in into the web site. It can be used
with bots that do not directly support log in, like Extractor or Crawler. Find more info about login here.