Advanced Data Extraction
Welcome to your second tutorial. In the first tutorial we created an extraction project that extracts data from simple e-commerce site. In the second tutorial we will create Agent that will visit all product detail pages. Agent will extract and capture into Excel file the following fields: product title, price, image and product properties.
We will use one of our demo web sites (e-commerce demo) to create Agent. It will help you to easily repeat steps from this tutorial as web site will have same structure as we used in it.
Step 1 - Create New Project
The first step is to create new project. The project is a place where all components of data extraction process are stored. Click New Project in the application toolbar.
New project dialog will appear:
Enter project name. It is better to use name related to the target web site. It will be easy to find the project later. Click Ok The new project will be added into the project workspace.
Every time you run WebSundew you will be able to access it in the Project View
Step 2 - Create Agent
Agent is a one of the main concepts of WebSundew. It automates all activity the user performs to collect web data. For example navigate over web pages, click links, extract data and store it into data storage.
Click New Agent in the application toolbar.
New agent dialog will appear:
Enter URL of the target web site. We will use our demo e-commerce web site - https://demo.websundew.io/ecommerce/products for this tutorial. Click Ok
New Agent will be created and added to the current project. Also agent's editor will be opened. The created Agent will have two states: Init and State1.
Step 3 - Visit Detail Pages
We need to configure Agent to visit all products pages. The Agent should collect links to detail pages and sequentially visit each page and capture product data. To configure agent in this way click Deep Crawler in the application toolbar.
Deep crawler dialog will appear:
Select List and click Ok.
The List pattern wizard will start. We need to configure List pattern to collect all links.
Click on the first product link.
Click Add in the wizard. The pattern builder will try to find patterns. The Result pane will contains several results.
We need to select appropriate result. In our case we have 9 links on the web page. Click on the 9. The collected data will be highlighted on the web page and will be available in the Preview View.
Now click Finish.
The Loop statement will be added into State1. This Loop will contain two statements: Capture and Load Page. The Capture statement is used to extract link to the detail page, Load Page used to load new state from the extracted url.
The new state State2 will be added to the agent. This state will be associated with product detail page.
Step 4 - Capture Data
Now we can capture required data on the page. We will capture product title, price, etc and properties.
You can notice that detail page has following structure: fixed part with product image, title and price:
And variable part that contains properties in pairs: Name - Value. These properties are depend on product. For example: Digital Camera has Lens Mount property, but Desktop Computer has not.
WebSundew uses special type of data patterns: Simple Detail data pattern to capture statically located data and Linked Tag data pattern to capture data in pairs Name - Value for variable part.
4.1 Capture Simple Data
Simple Detail pattern captures data that statically located on the web page. To capture such data
click Capture in the application toolbar.
Capture type selector dialog will appear.
Select Simple then click Ok. The Simple Details pattern wizard will start.
Now you need to configure pattern fields. Click on the required data, i.e. title on the web page then click Add in the pattern wizard.
Repeat process for all required field. Also add main image. You can rename field to more affordable names (double click on the field title to start editing).
Click Finish to complete Simple Detail pattern wizard. The Capture Block statement based on Simple Detail pattern will be added into State2.
4.2 Capture Name Value Pairs
Linked Tag pattern captures data that located in pairs: Name - Value. To capture such data click Capture in the application toolbar.
Capture type selector dialog will appear.
Select Pairs then click Ok. The Linked Tag data pattern wizard will start.
Now you need to add pairs Name (tag) - Value. Select tag element on the web page:
Now you need add value part. Click Value tab in the wizard
Select value on the web page.
Click Add in the pattern wizard to complete creating field.
Repeat same process for all required data.
Click Finish to complete pattern creating. The Capture Block statement based on Lined Tag pattern will be added into State2.
Step 5 - Pagination
To collected all data the Agent should visit all pages. Select State1 in agent's graph. The paging is simple with separate next page element.
To configure Agent to visit all pages we need to create Pagination. Click Pagination in the application toolbar.
Pagination wizard dialog will appear:
The web site uses simple pagination with separate next page element. So we can choose Simple page pattern type and click Ok. Next page pattern wizard will appear:
WebSundew uses page pattern to find HTML element that leads to next pages. Click on the next page element in the browser part of the agent, then click Select Next in the wizard.
Click Finish in the wizard part of the agent's editor. Next page pattern will be added to the project.
The Agent will use this pattern on all similar pages to find next page element.
Step 6 - Store Captured Data
Now we are ready to store captured data. Click Export in the application toolbar.
Data export wizard will appear:
Select Excel, then click Next.
Configure Excel properties or leave default values. Click Finish. Data storage will be added to the Agent.
Step 7 - Run the Agent
Now we are ready to run the Agent. Click Run in the application toolbar.
The Agent will start extracting the data.