Understanding Web Characteristics for Building a Search Engine
Exploring the key aspects of the Web, this content delves into topics such as the size of the Web, link structure, and the number of sites present. It discusses the importance of defining quantifiable measures when assessing the Web's scale and sheds light on issues like multi-hosted servers and virtual hosting. Through insights from Junghoo "John" Cho of UCLA, this content provides a comprehensive overview for those looking to understand the Web's intricacies in the context of search engine development.
Uploaded on Sep 06, 2024 | 0 Views
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
CS246: Web Characteristics Junghoo John Cho
Web Characteristics To build a search engine, we need to know what the Web is like Q: What is the Web like? What characteristics are important for search engine? Some relevant questions How big is the Web? What is the Web graph like? Anything else? Junghoo "John" Cho (UCLA Computer Science) 2
Web Characteristics Size of the Web Link structure of the Web Junghoo "John" Cho (UCLA Computer Science) 3
Size of the Web Q: What does the size of the Web mean? Every scientific endeavor starts with defining a quantifiable measure How can we define the size of the Web? Possibilities How many bytes of data? How many sites? How many Web pages?
How many Web sites? Polling every IP 2^32 = 4B sites, 10 sec/IP, 1000 simultaneous connection: 2^32*10/(1000*24*60*60) = 460 days
How Many Web Sites? Sampling based T: All IPs S: Sampled IPs V: Valid reply Junghoo "John" Cho (UCLA Computer Science) 6
How Many Web Sites? 1. Select |S| random IPs 2. Send HTTP requests to port 80 at the selected IPs 3. Count valid replies: HTTP 200 OK = |V| 4. |T| = 2^32 Junghoo "John" Cho (UCLA Computer Science) 7
Issues Multi-hosted servers cnn.com: 207.25.71.5, 207.25.71.20, Select the lowest IP address For each sampled IP: Look up domain name Resolve the name to IP Is our sampled IP the lowest? Junghoo "John" Cho (UCLA Computer Science) 8
Issues Virtual hosting Multiple sites on the same IP Find the average number of hosted sites per IP 7.4M sites on 3.4M IPs by polling all available site names [Netcraft, 2000] Other ports? Temporarily unavailable sites? Junghoo "John" Cho (UCLA Computer Science) 9
Web Site Statistics (# unique hostnames) Web was invented in 1989 by Tim Berners-Lee First-ever Web site (info.cern.ch) went online on August 6, 1991 Year 1998 (Google launched): 2,636,000 According to OCLC (Online Computer Library) Year 2014: First time to reach 1 billion sites (October) According to Netcraft Web Site Survey Today: ~ 1.6 billion sites The growth stopped around year 2015 About 75% of sites are parked sites Hostname is resolved to an IP, but no real content http://www.internetlivestats.com/total-number-of-websites/
Questions? Junghoo "John" Cho (UCLA Computer Science) 11
How Many Web Pages? Sampling based? T: All URLs S: Sampled URLs V: Valid reply Infinite number of URLs Junghoo "John" Cho (UCLA Computer Science) 12
How Many Web Pages? Solution 1: Estimate the average number of pages per site: (average no of pages) * (total no of sites) Algorithm: For each site with valid reply, download all pages Take average Result [LG99]: 289 pages per site, 2.8M sites 800M pages Junghoo "John" Cho (UCLA Computer Science) 13
Issues 1,000,000 900,000 800,000 700,000 No of Pages 600,000 500,000 99.99% of the sites 400,000 300,000 200,000 100,000 0 0 200 400 No of Sites 600 800 1000 A small number of sites with TONS of pages Very likely to miss these sites Lots of samples necessary Junghoo "John" Cho (UCLA Computer Science) 14
How Many Pages? Solution 2: Sampling-based T: All pages B: Base setS: Random samples ? ? ? ? ? = Junghoo "John" Cho (UCLA Computer Science) 15
Random Page? Idea: Random walk Start from the Yahoo home page Follow random links, say 10,000 times Select the page Problem: Biased to popular pages. e.g., Google, Amazon, Facebook, Junghoo "John" Cho (UCLA Computer Science) 16
Random Page? Random walks on regular, undirected graph uniform random sample Regular graph: an equal number of edges for all nodes After 1/? log(N) steps ? : depends on the graph structure Difference between top two eigenvalues of the graph Laplacian matrix N: number of nodes Idea: Transform the Web graph to a regular, undirected graph Perform a random walk Q: How can we transform the Web graph to a regular graph? Junghoo "John" Cho (UCLA Computer Science) 17
Ideal Random Walk Generate the regular, undirected graph: Make edges undirected Decide d the maximum # of edges per page: say, 300,000 If edge(n) < 300,000, then add self-loop Perform random walks on the graph ?~10-5 for the 1996 Web, N ~109 3,000,000 steps, but mostly self-loops 100 actual walk Junghoo "John" Cho (UCLA Computer Science) 18
Different Interpretation Random walk on irregular Web graph High chance to be at a popular node at a particular time Increase the chance to be at an unpopular node by staying there longer through self loops. Popular node Unpopular nodes Junghoo "John" Cho (UCLA Computer Science) 19
Issues Q: How do we get edges to/from node n? We don t know the Web graph! Possible sources of edge information Edges discovered so far From search engines, like Google, Bing But still limited incoming links Junghoo "John" Cho (UCLA Computer Science) 20
WebWalker [BBCF00] Our graph does not have to be the same as the real Web Construct regular undirected graphs while performing the random walk Add new node n when it visits n Fix edges for node n at that time 1. Edges discovered so far 2. From search engines Add self-loops as necessary Ignore any more edges to n later Junghoo "John" Cho (UCLA Computer Science) 21
WebWalker d = 5 1 2 2 3 1 Junghoo "John" Cho (UCLA Computer Science) 22
WebWalker Why ignore new incoming edges? Make the graph regular. Discovered parts of the graph do not change Uniformity theorem still holds Can we arrive at all reachable pages? We ignore only the edges to visited nodes Can we use the same ?? No Junghoo "John" Cho (UCLA Computer Science) 23
WebWalker results Size of Web in 1998 Altavista: |B| = 250M |B S|/|S| = 35% |T| = 720M Junghoo "John" Cho (UCLA Computer Science) 24
WebWalker results Random samples from WebWalker can be used to estimate other Web statistics Avg page size: 12K Avg # of out-links: 10 Pages by domain: .com: 49%, .edu: 8%, .org: 7%, .net: 6%, Web of today # of Web sites: ~ 1.6 billion Less then 10% resides in the US Avg page size: 1MB text (HTML, CSS, JavaScript) + 1MB image Avg # of out-links: ~ 100 Junghoo "John" Cho (UCLA Computer Science) 25
What About Other Web Pages? Pages that are Available within corporate Intranet Protected by authentication Not reachable through following links E.g., pages within e-commerce sites Deep Web vs Hidden Web Information reachable through search interface What if a page is reachable both through links and search interface?
References [LG99] Steve Lawrence, C. Lee Giles: Accessibility of Information on the Web. Nature, 400(6740): 107-109, 1999 [BBCF00] Ziv Bar-Yossef et al.: Approximating Aggregate Queries about Web Pages via Random Walks. VLDB 2000 Junghoo "John" Cho (UCLA Computer Science) 27