Enterprise Characteristics Through Web Data Analysis

undefined
Monica Scannapieco (Istat, Italy)
Galya Stateva (BNSI, Bulgaria)
Peter Struijs (CBS, Netherlands)
NTTS –
New Techniques and Technologies for Statistics
12-14 March 2019
Estimating Enterprise Characteristics from Web Data:
Achievements and Future Developments
1
Outline
Background
ESSnet Big Data I:
Workpackage 2 «Webscraping Enterprise Characteristics»
ESSnet Big Data II:
Workpackage C «Enterprise Characteristics»
Final Remarks
2
Enterprise Websites as a source for Official Statistics
Objective: to investigate whether 
webscraping
, 
text mining 
and 
inference techniques 
can
be used to collect, process and improve 
general information about enterprises
3
Example of Use Cases
4
Pilots of the ESSnet Big Data I
5
List of pilot projects
Webscraping (2 work packages)
job vacancies
enterprise characteristics
Smart meters
electricity consumption; temporary vacant dwellings
Automatic Identification System (AIS)
vessel identification  data
Mobile phone data
preparing for access to data
Early estimates
various domains
Multiple domains
population, tourism / border crossing, agriculture
WP2:  Webscraping / Enterprise
Characteristics
 
6
WP leader:
 
Italy
Partners:
 
Bulgaria
  
Netherlands
  
Poland
  
Sweden
  
UK
 
Big Data Pilots I - WP2 Results (1)
7
Pipeline
 for processing data scraped from enterprises’ websites 
defined in
detail and shared
 among the participants
Big Data Pilots I - WP2 Results (2)
8
Methods
:
Webscraping methods (including URLs retrieval when
necessary)
Text representation and mining methods for processing
webscraped texts
Deterministic and Machine Learning methods tested for
prediction at unit-level of enterprise characteristics
Big Data Pilots I - WP2 Results (3)
IT Solutions
Istat/Italy
Generalized scraping
URLs retrieval
ML analysis
 Used by Poland and Bulgaria
GUS/Poland
Social media scraping & analysis
 Used by Italy, Netherlands, Sweden, Bulgaria
9
Developed software available at:
https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/index.php/WP2_Links
BNSI/Bulgaria
URLs retrieval
Scraping
Deterministic analysis
Big Data Pilots I - WP2 Results (4)
10
https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/index.php/Category:WP
2_Experimental_statistics1
First Experimental Statistics
ESSnet Big Data II:
Overview of Workpackages
11
WPC:  Enterprise Characteristics
12
WP leader:
 
Bulgaria
Partners:
 
Austria
  
Germany
  
Finland
  
Ireland
  
Italy
  
Netherlands
  
Poland
  
Austria
  
UK
From WP2 to WPC
13
Results of WP2 as a starting point for WPC, in particular:
URLs retrieval methodology, i.e. a process and software
implementations for detecting websites of enterprises based on
search engines and machine learning techniques;
Methodologies, processes and software implementations for
detecting characteristics of enterprises such as E-commerce
activities, Social media presence, Job advertisements, NACE
code, etc.
WPC Objective and Tasks
14
From piloting to implementation
Five tasks:
ESS webscraping policies
Methodological  Framework/Guidelines
Experimental Statistics, including reference metadata
Starter Kit for NSIs
Quality template for statistical outputs
WPC: Updated Use Cases
15
Final Remarks
The work done within ESSnet Big Data I and the starting work of ESSnet
Big Data 2 result in 
milestones
 in the route for using 
Internet as a
Data Source for Official Statistics
The work on using enterprise websites to support business statistics
addresses the 
whole production pipeline 
from data collection to
data dissemination and 
impacts on technical, legal and
organizational levels
The promising achieved results, as well as the concrete actions planned
for facing implementation issues, are expected 
to move this pipeline
towards a full-fledged statistical production 
in the short to
medium term for several countries of the ESS
16
Questions?
 
17
https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata
Thank you for your attention!
scannapi@istat.it
GStateva@NSI.bg
p.struijs@cbs.nl
Slide Note
Embed
Share

This presentation delves into the use of webscraping, text mining, and inference techniques to gather and improve information on enterprises from their websites. It covers the objectives, examples of use cases, pilot projects, and results from the ESSnet Big Data initiatives led by various European countries like Italy, Bulgaria, Netherlands, Poland, Sweden, and the UK. The focus is on leveraging web data for official statistics and exploring innovative methods in data processing.

  • Web Data Analysis
  • Enterprise Characteristics
  • Webscraping
  • ESSnet Big Data
  • Statistics

Uploaded on Sep 24, 2024 | 1 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Estimating Enterprise Characteristics from Web Data: Estimating Enterprise Characteristics from Web Data: Achievements and Future Developments Achievements and Future Developments Monica Scannapieco (Istat, Italy) Galya Stateva (BNSI, Bulgaria) Peter Struijs (CBS, Netherlands) NTTS New Techniques and Technologies for Statistics 12-14 March 2019 1

  2. Outline Background ESSnet Big Data I: Workpackage 2 Webscraping Enterprise Characteristics ESSnet Big Data II: Workpackage C Enterprise Characteristics Final Remarks 2

  3. Enterprise Websites as a source for Official Statistics Objective: to investigate whether webscraping, text mining and inference techniques can be used to collect, process and improve general information about enterprises Enterprises Websites National Business Register Business Statistics Surveys 3

  4. Example of Use Cases Use case 2: Websales - ECommerce Use case 1: URLs Inventory Use Case 3: Social Media Presence Use Case 5: Economic Activity Classification (NACE) Use Case 4: Job Advertisements 4

  5. Pilots of the ESSnet Big Data I List of pilot projects Webscraping (2 work packages) job vacancies enterprise characteristics Smart meters electricity consumption; temporary vacant dwellings Automatic Identification System (AIS) vessel identification data Mobile phone data preparing for access to data Early estimates various domains Multiple domains population, tourism / border crossing, agriculture 5

  6. WP2: Webscraping / Enterprise Characteristics WP leader: Italy Partners: Bulgaria Netherlands Poland Sweden UK 6

  7. Big Data Pilots I - WP2 Results (1) Pipeline for processing data scraped from enterprises websites defined in detail and shared among the participants 7

  8. Big Data Pilots I - WP2 Results (2) Methods: Webscraping methods (including URLs retrieval when necessary) Text representation and mining methods for processing webscraped texts Deterministic and Machine Learning methods tested for prediction at unit-level of enterprise characteristics 8

  9. Big Data Pilots I - WP2 Results (3) IT Solutions BNSI/Bulgaria URLs retrieval Scraping Deterministic analysis Istat/Italy Generalized scraping URLs retrieval ML analysis Used by Poland and Bulgaria GUS/Poland Social media scraping & analysis Used by Italy, Netherlands, Sweden, Bulgaria Developed software available at: https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/index.php/WP2_Links 9

  10. Big Data Pilots I - WP2 Results (4) First Experimental Statistics https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/index.php/Category:WP 2_Experimental_statistics1 10

  11. ESSnet Big Data II: Overview of Workpackages WP WP name WP leader Country WPA Coordination and Communication Peter Struijs Marc Debusschere (deputy) NL BE WPB WPC WPD WPE WPF Online Job Vacancies Enterprise Characteristics Smart Energy Tracking Ships Process and Architecture Toma Speh Galya Stateva Arko Kesk la Anke Consten Monica Scannapieco SI BG EE NL IT WPG WPH WPI WPJ WPK Financial Transactions Data Earth Observation Mobile Networks Data Innovative Tourism Statistics Methodology and Quality Johan Fosen Marek Morze David Salgado Marek Cierpia -Wolan Alexander Kowarik NO PL ES PL AT WPL Preparing Smart Statistics Natalie Rosenski DE 11

  12. WPC: Enterprise Characteristics WP leader: Bulgaria Partners: Austria Germany Finland Ireland Italy Netherlands Poland Austria UK 12

  13. From WP2 to WPC Results of WP2 as a starting point for WPC, in particular: URLs retrieval methodology, i.e. a process and software implementations for detecting websites of enterprises based on search engines and machine learning techniques; Methodologies, processes and software implementations for detecting characteristics of enterprises such as E-commerce activities, Social media presence, Job advertisements, NACE code, etc. 13

  14. WPC Objective and Tasks From piloting to implementation Five tasks: ESS webscraping policies Methodological Framework/Guidelines Experimental Statistics, including reference metadata Starter Kit for NSIs Quality template for statistical outputs 14

  15. WPC: Updated Use Cases Use case 2: Variables on the ICT usage in enterprise survey Use case 1: URLs Inventory Use Case 3: Validation of BR and NACE classification Use Case 4: Experimental Language Statistics 15

  16. Final Remarks The work done within ESSnet Big Data I and the starting work of ESSnet Big Data 2 result in milestones in the route for using Internet as a Data Source for Official Statistics The work on using enterprise websites to support business statistics addresses the whole production pipeline from data collection to data dissemination and impacts on technical, legal and organizational levels The promising achieved results, as well as the concrete actions planned for facing implementation issues, are expected to move this pipeline towards a full-fledged statistical production in the short to medium term for several countries of the ESS 16

  17. Questions? https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata Thank you for your attention! scannapi@istat.it GStateva@NSI.bg p.struijs@cbs.nl 17

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#