Exploring Graph Structure in the Web: A Comprehensive Analysis
Delve into a detailed analysis of the web graph, leveraging a vast dataset of 3.5 billion web pages and 128.7 billion links. The study compares various features such as degree distributions, connectivity, average distances, and connected components' structures. The research aims to enhance ranking mechanisms, improve web crawling efficiency, and detect spam networks by revisiting and updating previous findings on the web's structure.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Graph Structure in the Web Revisited A paper by Robert Meusel (University of Mannheim Germany), Sebastiano Vigna (Universit degli Studi di Milano Italy), Oliver Lehmberg (University of Mannheim Germany) and Christian Bizer(University of Mannheim Germany)
ABSTRACT This paper is about the analysis of a large, publicly accessible crawl of the web that was gathered by the common crawl foundation in 2012 and which contains over 3.5 billion web pages and 128.7 billion links creating a webgraph. In particular, the paper depicts a comparison among features like degree distributions, connectivity, average distances, and the structure of weakly/strongly connected components of a web graph created from this crawled data.
Introduction Purpose of this paper To gain knowledge about the general graph structure of the web graph. To improvise ranking mechanisms Increase efficiency of web crawling using its structure information Detecting spam networks Prior work and findings Based on the structure of the web as a whole published by Broder in 2000 using an AltaVista crawl of 200 million pages and 1.5 billion links Bow-tie structure of web graph. A giant strongly connected component containing 28% of the nodes. The indegree distribution, the outdegree distribution and the distribution of the sizes are heavily tailed. Intention of the authors: Revisit and update the findings of previous research to give an up-to-date view of the web graph today, using a crawl that is significantly larger (3.5 billion pages) than the ones used in previous work.
Dataset and Methodology Based on large web crawl gathered by the Common Crawl Foundation Contains 3.83 billion web documents, of which over 3.53 billion (92%) are of mime- type text/html. Strategy for crawling is a breathfirst visiting strategy, together with heuristics to detect spam pages.
Methodology Each node represents a page and each arc between two nodes represents the links between the associated pages. Data extraction is a 3 step process by parsing the Common Crawl corpus and extracting structured data embedded in HTML pages . 1. Collect URL, mime-type, links to other pages, type, redirect URL for each crawled page. 2. Filter the extracted URLs by mime-type text/html 3. Use a 40-node Amazon Elastic MapReduce cluster to compress the graph Construction of a host graph and the pay-level-domain (PLD) graph.
ANALYSIS OF THE WEB GRAPH Analysis performed using the big version of the WebGraph framework, that can handle more than 2^31 nodes. The BV compression scheme used to compress the graph in crawl order at 3.52 bits per link. Analysis is done for the following: Indegree & Outdegree Distribution High Indegree Pages and Hosts Components The Bow Tie Structure Diameter and Distances
Indegree & Outdegree Distribution Calculating density of web graphs ratio between number of arcs(degree) and the number of nodes(pages) having that degree in graph. Aggregation of values using Fibonacci binning to get an approximate shape of the distribution
Fitting power law to a tail of the data. Observations after the fitting test and estimation of a p-value: The p-value of the best fits is 0 (0.01). Which means the tail of the distribution is not a power law. Size-rank plot depicting possibility of a heavy tail.
Analysis of Weakly Connected Components The largest component (rightmost grey point) contains about around 94% of the whole graph. The p-value is again 0, and the power law covers only to 1% of the distribution.
Analysis of Strongly Connected Components The largest component (rightmost grey point) contains 51.3% of the nodes. Power law fit covers to 8.9% of the distribution. The p-value is again 0.
The Bow Tie structure of WebGraph The core is given by the giant strongly connected component (LSCC); The IN component contains non-core pages that can reach the core via a directed path; The OUT component contains non-core pages that can be reached from the core; The TUBES are formed by non-core pages reachable from IN and that can reach OUT; Pages reachable from IN, or that can reach OUT, but are not listed above, are called TENDRILS; The remaining pages are DISCONNECTED.
Diameter and Distances Observations 48.15% approx. of the pairs of pages have a connecting directed path. The average distance is 12.84 and the harmonic mean of distances is 24.43 approx. Broder Results 25% of pairs of pages are connected pairs and the average distance 16.12
Conclusion The average degree increased by a factor of 5. The connectivity of the increased and the average distance between pages decreased, in spite of a predicted growth that should have been logarithmic in the number of pages. The structure of the depends on the specific web crawl. Cannot rely of components of bow-tie. The distribution of indegrees and outdegrees is extremely different. Neither degree nor component size distributions fit a power law. Size-rank plots suggest that their tails are not fat (i.e., they decrease faster than polynomial).