Graph Structure in the Web: A Comprehensive Analysis

undefined
Graph Structure in the Web —
Revisited
 
A paper by
Robert Meusel (University of Mannheim Germany),
Sebastiano Vigna (
Università degli Studi di Milano 
Italy),
Oliver Lehmberg (University of Mannheim Germany) and
Christian Bizer(University of Mannheim Germany
)
ABSTRACT
 
This paper is about the analysis of a  large, publicly accessible crawl of the
web that was gathered by the common crawl foundation in 2012 and which
contains over 3.5 billion web pages and 128.7 billion links creating a
webgraph.
In particular, the paper depicts a comparison among  features like  degree
distributions, connectivity, average distances, and the structure of
weakly/strongly connected components of a web graph created from this
crawled data.
Introduction 
 
Purpose of this paper –
To gain knowledge about the general graph structure of the web graph.
To improvise ranking mechanisms
Increase efficiency of web crawling using its structure information
Detecting spam networks
Prior work and findings –
Based on the structure of the web as a whole published by 
B
roder in 2000 using an
A
ltaVista crawl of 200 million pages and 1.5 billion links
Bow-tie structure of web graph.
A giant strongly connected component containing 28% of  the nodes.
The indegree distribution, the outdegree distribution and the distribution of the
sizes  are heavily tailed.
Intention of the authors:
Revisit and update the findings of previous research to give an up-to-date view of
the web graph today, using a crawl that is significantly larger (3.5 billion pages)
than the ones used in previous work.
Dataset and Methodology
 
Based on large web crawl gathered by the Common Crawl Foundation
Contains 3.83 billion web documents, of which over 3.53 billion (92%) are of mime-
type text/html.
Strategy for crawling is a breathfirst visiting strategy, together with heuristics to
detect spam pages.
Methodology
Each node represents a page and each arc between two nodes
represents the links between the associated pages.
Data extraction is a 3 step process by parsing the Common Crawl
corpus and extracting structured data embedded in HTML pages .
1.
Collect URL, mime-type, links to other pages, type, redirect URL
for each crawled page.
2.
Filter the extracted URLs by mime-type 
text/html
3.
Use a 40-node Amazon Elastic MapReduce cluster to compress the
graph
Construction of a host graph and the pay-level-domain (PLD) graph.
ANALYSIS OF THE WEB GRAPH
Analysis performed using the “big” version of the WebGraph framework, that can
handle more than 2^31 nodes. The BV compression scheme used to compress the
graph in crawl order at 3.52 bits per link.
Analysis is done for the following:
Indegree & Outdegree Distribution
High Indegree Pages and Hosts
Components
The Bow Tie Structure
Diameter and Distances
Indegree & Outdegree Distribution
Calculating density of web graphs – ratio between number of arcs(degree) and the
number of nodes(pages) having that degree in graph.
Aggregation of values using Fibonacci binning to get an approximate shape of the
distribution
Fitting power law to a tail of the data.
Observations after the fitting test and estimation of a 
p-value
:
The p-value of the best fits is 0 (0.01). Which means the tail of the distribution is
not a power law. 
Size-rank plot depicting possibility of a heavy tail.
High Indegree Pages and Hosts
Analysis of 
Weakly
 Connected Components
The largest component (rightmost grey point) contains about around 94% of the whole graph.
The p-value is again 0, and the power law covers only to 1% of the distribution.
Analysis of Strongly Connected Components
The largest component (rightmost grey point) contains 51.3% of the nodes.
 Power law fit covers to 8.9% of the distribution. 
The p-value is again 0.
The Bow Tie structure of WebGraph
The core is given by the giant strongly
connected component (LSCC);
The IN component contains non-core
pages that can reach the core via a
directed path;
The OUT component contains non-core
pages that can be reached from the core;
The TUBES are formed by non-core pages
reachable from IN and that can reach
OUT;
Pages reachable from IN, or that can
reach OUT, but are not listed above, are
called TENDRILS;
The remaining pages are DISCONNECTED.
Comparison with Broder results
Diameter and Distances
Observations –
48.15% approx. of the pairs of
pages have a connecting
directed path. The average
distance is 12.84 and the
harmonic mean of distances is
24.43 approx.
Broder Results –
25% of pairs of pages are
connected pairs and the average
distance 16.12
Conclusion
 The average degree increased  by a factor of 5.
The connectivity of the increased and the average distance between pages
decreased, in spite of a predicted growth that should have been logarithmic
in the number of pages.
The structure of the depends on the specific web crawl. Cannot rely of
components of bow-tie.
The distribution of indegrees and outdegrees is extremely different.
Neither degree nor component size distributions fit a power law.
Size-rank plots suggest that their tails are not fat (i.e., they decrease faster
than    polynomial).
Thank You
Slide Note
Embed
Share

Delve into a detailed analysis of the web graph, leveraging a vast dataset of 3.5 billion web pages and 128.7 billion links. The study compares various features such as degree distributions, connectivity, average distances, and connected components' structures. The research aims to enhance ranking mechanisms, improve web crawling efficiency, and detect spam networks by revisiting and updating previous findings on the web's structure.

  • Web Graph Analysis
  • Degree Distributions
  • Connectivity
  • Ranking Mechanisms
  • Web Crawling

Uploaded on Sep 17, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Graph Structure in the Web Revisited A paper by Robert Meusel (University of Mannheim Germany), Sebastiano Vigna (Universit degli Studi di Milano Italy), Oliver Lehmberg (University of Mannheim Germany) and Christian Bizer(University of Mannheim Germany)

  2. ABSTRACT This paper is about the analysis of a large, publicly accessible crawl of the web that was gathered by the common crawl foundation in 2012 and which contains over 3.5 billion web pages and 128.7 billion links creating a webgraph. In particular, the paper depicts a comparison among features like degree distributions, connectivity, average distances, and the structure of weakly/strongly connected components of a web graph created from this crawled data.

  3. Introduction Purpose of this paper To gain knowledge about the general graph structure of the web graph. To improvise ranking mechanisms Increase efficiency of web crawling using its structure information Detecting spam networks Prior work and findings Based on the structure of the web as a whole published by Broder in 2000 using an AltaVista crawl of 200 million pages and 1.5 billion links Bow-tie structure of web graph. A giant strongly connected component containing 28% of the nodes. The indegree distribution, the outdegree distribution and the distribution of the sizes are heavily tailed. Intention of the authors: Revisit and update the findings of previous research to give an up-to-date view of the web graph today, using a crawl that is significantly larger (3.5 billion pages) than the ones used in previous work.

  4. Dataset and Methodology Based on large web crawl gathered by the Common Crawl Foundation Contains 3.83 billion web documents, of which over 3.53 billion (92%) are of mime- type text/html. Strategy for crawling is a breathfirst visiting strategy, together with heuristics to detect spam pages.

  5. Methodology Each node represents a page and each arc between two nodes represents the links between the associated pages. Data extraction is a 3 step process by parsing the Common Crawl corpus and extracting structured data embedded in HTML pages . 1. Collect URL, mime-type, links to other pages, type, redirect URL for each crawled page. 2. Filter the extracted URLs by mime-type text/html 3. Use a 40-node Amazon Elastic MapReduce cluster to compress the graph Construction of a host graph and the pay-level-domain (PLD) graph.

  6. ANALYSIS OF THE WEB GRAPH Analysis performed using the big version of the WebGraph framework, that can handle more than 2^31 nodes. The BV compression scheme used to compress the graph in crawl order at 3.52 bits per link. Analysis is done for the following: Indegree & Outdegree Distribution High Indegree Pages and Hosts Components The Bow Tie Structure Diameter and Distances

  7. Indegree & Outdegree Distribution Calculating density of web graphs ratio between number of arcs(degree) and the number of nodes(pages) having that degree in graph. Aggregation of values using Fibonacci binning to get an approximate shape of the distribution

  8. Fitting power law to a tail of the data. Observations after the fitting test and estimation of a p-value: The p-value of the best fits is 0 (0.01). Which means the tail of the distribution is not a power law. Size-rank plot depicting possibility of a heavy tail.

  9. High Indegree Pages and Hosts

  10. Analysis of Weakly Connected Components The largest component (rightmost grey point) contains about around 94% of the whole graph. The p-value is again 0, and the power law covers only to 1% of the distribution.

  11. Analysis of Strongly Connected Components The largest component (rightmost grey point) contains 51.3% of the nodes. Power law fit covers to 8.9% of the distribution. The p-value is again 0.

  12. The Bow Tie structure of WebGraph The core is given by the giant strongly connected component (LSCC); The IN component contains non-core pages that can reach the core via a directed path; The OUT component contains non-core pages that can be reached from the core; The TUBES are formed by non-core pages reachable from IN and that can reach OUT; Pages reachable from IN, or that can reach OUT, but are not listed above, are called TENDRILS; The remaining pages are DISCONNECTED.

  13. Comparison with Broder results

  14. Diameter and Distances Observations 48.15% approx. of the pairs of pages have a connecting directed path. The average distance is 12.84 and the harmonic mean of distances is 24.43 approx. Broder Results 25% of pairs of pages are connected pairs and the average distance 16.12

  15. Conclusion The average degree increased by a factor of 5. The connectivity of the increased and the average distance between pages decreased, in spite of a predicted growth that should have been logarithmic in the number of pages. The structure of the depends on the specific web crawl. Cannot rely of components of bow-tie. The distribution of indegrees and outdegrees is extremely different. Neither degree nor component size distributions fit a power law. Size-rank plots suggest that their tails are not fat (i.e., they decrease faster than polynomial).

  16. Thank You

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#