Mozenda

Mozenda
A Good Tool for Web Scraping and Data Extraction
CS 548 Showcase
By Xuanxing Huang
Jan 31, 2014
Resources
How to Get and Use Mozenda
Mozenda’s Home Page:  
Mozenda’s Web Console:  
Mozenda’s Tutorial Videos:  
Target Websites in the Demo
The National Air Quality in Real Time:  
The National Air Quality Ranks in Real Time:  
The National Meteorological Center of CMA:  
More References for the Demo
Web Scraping:  
Air Quality Index:  
Haze:  
http://en.wikipedia.org/wiki/Hazehttp://en.wikipedia.org/wiki/Air_quality_indexhttp://en.wikipedia.org/wiki/Web_scrapinghttp://www.nmc.gov.cn/index.htmlhttp://www.cnpm25.cn/paiming.htmhttp://www.cnpm25.cnhttps://system.mozenda.com/training#OnDemandhttps://www.mozenda.com/consolehttp://www.mozenda.com
About Mozenda
Free for collecting data from 500 Pages per month
Simple
,  no data mining techniques or algorithms evolved
Two components
Agent Builder: construct and test agent
Web Console: manage agents, data collections and jobs
You can download and launch the Agent Builder from Web Console
Topic
Haze and Air Pollution in China
Demo 1
Task:  
collect AQI data of different cities
Purpose:  
to show how to gather big data from multiple web pages
Steps:
1.  Input 
http://www.cnpm25.cn
 and Start a new agent from this page
2.  Create a list of cities on Page 1
Click two similar items and capture the field as “City Name”
3.  Click a city item
4.  Create some list of monitoring stations on Page 2
Capture “Station Name”, “AQI”, “AQI Trends”, “PM25”, “PM25 Trends”, “Air Pollution
Level”
Set optional fields
Refine captured text
6.  Test and save the agent
Demo 2
Task:  
collect AQI and climate data of different cities
Purpose:  
to show how to combine data from different sources
Steps:
1.  Input 
http://www.cnpm25.cn/paiming.htm
 and Start a new agent from this page
2.  Create lists
Capture “City Name”, “Province Name”, “AQI”, “PM25”, “Air Pollution Level”
Refine captured text
3.  Test and save the agent as “Demo 2 AQI”
4.  Run it in the Web Console
5.  Input 
http://www.nmc.gov.cn/index.html
 and Start another agent
6.  Set user input on Page 1
Use data collection of agent “Deme 2 AQI”, input “City Name”
7.  Click the search item
8.  Capture text on Page 2
Capture “City Name”, “Air Temperature”, “Air Pressure”, “Relative Humidity”, “Wind
Speed”
Refine captured text
3.  Test and save the agent as “Demo 2 Climate”
9.  Run it in the Web Console with Error Handling
Agent Schedule
Estimated Run Time
Demo 2 AQI:  
  
2m
Demo 2 Climate: 
  
1h
Schedule Two Agent in Demo 2
The air quality data is updated every hour on the hour
The climate data is updated more frequently
Set Demo 2 AQI to run :
  
at 12:00AM and repeat every 2 hours
Set Demo 2 Climate to run :
 
at 12:30AM and repeat every 2 hours
Combine data from two agents
Problem:  
data from two agents are not synchronized
In Professional Version, we can run multiple agents concurrently
Results
Demo 2 Climate agent reached the page limit, only got 65 records
Inconsistency after running in the Web Console, only captured “Wind Speed” successfully
Combine data collection
Combine “Wind Speed” to AQI data
Make “City Name” unique
Configure user’s view
Select attributes you want to show
Rearrange attributes
Set filters
Export Data as CSV file
More Information
In Agent Running Records
Use temporal model (HMMs)
Export Data as CSV file
The End!
 
Slide Note
Embed
Share

Mozenda is a helpful tool for web scraping and data extraction. It allows users to collect data efficiently from various websites. The tool is free for collecting data from up to 500 pages per month, offering simplicity without complex data mining techniques. With Mozenda, users can easily create and test agents, manage data collections, and jobs through its Web Console. The tool enables users to gather big data from multiple web pages, making it a valuable asset for extracting valuable information.

  • Web Scraping
  • Data Extraction
  • Mozenda Tool
  • Web Data Collection
  • Data Mining

Uploaded on Feb 16, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Mozenda A Good Tool for Web Scraping and Data Extraction CS 548 Showcase By Xuanxing Huang Jan 31, 2014

  2. Resources How to Get and Use Mozenda Mozenda s Home Page: http://www.mozenda.com Mozenda s Web Console: https://www.mozenda.com/console Mozenda s Tutorial Videos: https://system.mozenda.com/training#OnDemand Target Websites in the Demo The National Air Quality in Real Time: http://www.cnpm25.cn The National Air Quality Ranks in Real Time: http://www.cnpm25.cn/paiming.htm The National Meteorological Center of CMA: http://www.nmc.gov.cn/index.html More References for the Demo Web Scraping: http://en.wikipedia.org/wiki/Web_scraping Air Quality Index: http://en.wikipedia.org/wiki/Air_quality_index Haze: http://en.wikipedia.org/wiki/Haze

  3. About Mozenda Free for collecting data from 500 Pages per month Simple, no data mining techniques or algorithms evolved Two components Agent Builder: construct and test agent Web Console: manage agents, data collections and jobs You can download and launch the Agent Builder from Web Console

  4. Topic Haze and Air Pollution in China

  5. Demo 1 Task: collect AQI data of different cities Purpose: to show how to gather big data from multiple web pages Steps: 1. Input http://www.cnpm25.cn and Start a new agent from this page 2. Create a list of cities on Page 1 Click two similar items and capture the field as City Name 3. Click a city item 4. Create some list of monitoring stations on Page 2 Capture Station Name , AQI , AQI Trends , PM25 , PM25 Trends , Air Pollution Level Set optional fields Refine captured text 6. Test and save the agent

  6. Demo 2 Task: collect AQI and climate data of different cities Purpose: to show how to combine data from different sources Steps: 1. Input http://www.cnpm25.cn/paiming.htm and Start a new agent from this page 2. Create lists Capture City Name , Province Name , AQI , PM25 , Air Pollution Level Refine captured text 3. Test and save the agent as Demo 2 AQI 4. Run it in the Web Console 5. Input http://www.nmc.gov.cn/index.html and Start another agent 6. Set user input on Page 1 Use data collection of agent Deme 2 AQI , input City Name 7. Click the search item 8. Capture text on Page 2 Capture City Name , Air Temperature , Air Pressure , Relative Humidity , Wind Speed Refine captured text 3. Test and save the agent as Demo 2 Climate 9. Run it in the Web Console with Error Handling

  7. Agent Schedule Estimated Run Time Demo 2 AQI: Demo 2 Climate: 2m 1h Schedule Two Agent in Demo 2 The air quality data is updated every hour on the hour The climate data is updated more frequently Set Demo 2 AQI to run : Set Demo 2 Climate to run : Combine data from two agents at 12:00AM and repeat every 2 hours at 12:30AM and repeat every 2 hours Problem: data from two agents are not synchronized In Professional Version, we can run multiple agents concurrently

  8. Results Demo 2 Climate agent reached the page limit, only got 65 records Inconsistency after running in the Web Console, only captured Wind Speed successfully Combine data collection Combine Wind Speed to AQI data Make City Name unique Configure user s view Select attributes you want to show Rearrange attributes Set filters Export Data as CSV file

  9. More Information In Agent Running Records Use temporal model (HMMs) Export Data as CSV file

  10. The End!

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#