
Understanding Hyperlinks, Robots, and Data Scraping
Explore the world of hyperlinks, robots, and data scraping. Learn about default URLs, types of hyperlinks, absolute and relative links, and how domain relative links work. Dive into the complexities of link structures on the web.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Hyperlinks, Robots, & Data Scraping
Default URLs Default URLs are URLs that don't have an actual page name as part of the path: https://byu.edu https://cs111.byu.edu https://cs111.byu.edu/lab/lab04/ ... They may or may not have a trailing forward slash (/). When a web server receives a URL like this, it appends its default filename to the URL to get the correct file to load. This default filename is usually index.html
Types of Hyperlinks The content of the href attribute in an <a> tag can have several different formats. Absolute links these contain complete URLs Relative links these are links relative to the current domain or page and don't contain complete URLs Section links these are a form of relative link that point at another part of the same page. mailto: links trigger the creating of an email message The src attribute in an <img> tag can only have absolute and relative links
Absolute Links When a hyperlink contains an absolute link, the value of the href or src attribute is a complete URL containing the protocol, the domain, and the path (which may be just /). For example, all of the following contain absolute links: <a href="https://byu.edu">BYU homepage</a> <a href="https://cs111.byu.edu/hw/hw07">Homework 7</a> <a href="https://frontierexplorer.org/data/FrontierExplorer036.pdf"> Issue 36</a> <img src="https://expandingfrontier.com/wp- content/uploads/2021/06/MS2-rotated-e1624401676511-1024x737.jpg">
Relative links There are two types of relative links Domain relative Page relative All of these links are part of the same domain as the page you are currently visiting. The only difference is the starting point for finding the path
Domain Relative Links Domain relative links do not contain a protocol or domain and start with a leading forward slash (/) These are relative to the domain of the website. To get the full URL to the resource you append the link body to the domain. For example, if we are on the https://cs111.byu.edu domain and see the following link: <a href="/hw/hw02">Homework 2</a> the full URL that we should load is https://cs111.byu.edu/hw/hw02
Page Relative Links Like Domain relative links, page relative links do not begin with a protocol and domain. Additionally, they do not have a leading forward slash (/). These links are relative to the current page's directory, not the domain. To generate the full URL Start with the current page's URL If it has a page name (i.e. <something>.htm or .html), remove the page name and forward slash before the page name. Append a forward slash and the link contents to the URL
Example of generating URL from page relative links (1) If you were on a page with the URL of https://cs111.byu.edu/lab/lab03 and you had this link: <img src="assets/iron.png"> The resultant URL would be: https://cs111.byu.edu/lab/lab03/assets/iron.png This example uses a default URL that doesn t have an explicit page name (ending in .html or .htm, or .<something>) so we just append the link reference (the value of the src attribute in this case) to the current page's URL
Example of generating URL from page relative links (2) If you were on a page with the URL of https://cs111.byu.edu/lab/lab04/index.html and you had this link: <a href="lab04.zip">lab04.zip</a> The resultant URL would be: https://cs111.byu.edu/lab/lab04/lab04.zip This example has an explicit page name (index.html) so we just drop the page name and append the link reference get the full URL.
Section links If a tag in an HTML document has an id attribute: <h2 id="starter-files">Starter Files</h2> then we can create a hyperlink to that specific point in the document using the pound (or hashtag) symbol (#) This can be done to link to content on a different page <a href="https://cs111.byu.edu/lab/lab07/index.html#starter- files">Lab 7 Starter Files</a> or to link to id attributes in our current page <a href="#starter-files">Starter Files</a> You can see these types of links in the table of contents sections in the left sidebar on the lab, homework, and project pages
mailto links As the WWW was designed to connect research scientists and collaborators, the ability to send email was built into the HTTP protocol This is done with a mailto: link The format for this type of link is the text mailto: followed by an email address You can email <a href="mailto:tstephen@cs.byu.edu">Dr. Stephens</a> Clicking on this type of link opens your system's default email client and starts composing a new email with the email address in the To: field This is not used as much today as spammers like to harvest these links and use them to feed their spambots.
Finding Hyperlinks in an HTML document We can use Beautiful Soup to find all the hyperlinks in a given HTML document. You'll be doing this as part of Project 4. Unfortunately, we can't just search for attributes We have to search for tags, and then look at the attributes on those tags. But that's okay, since we know all links (hrefs) are in the <a> tag
Finding Hyperlinks in an HTML document You start by finding all the <a> tags in the document Then loop over each tag and get its href attribute value Then you can process the link import requests from bs4 import BeautifulSoup URL = "https://cs111.byu.edu" resp = requests.get(URL) soup = BeautifulSoup(resp.text,"html.parser") aTags = soup.find_all("a") for tag in aTags: link = tag.get("href") #or tag.attrs["href"] #do something with the link here
PageRank In 1998 when Google was created, it used the PageRank algorithm to determine which pages were most relevant for a given search This worked by analyzing hyperlinks A simplified form of PageRank works like this: Crawl and read every page on the web. For every page, count the number of other web pages that link to that page That count (or some mathematical formula applied to that count) is the page's PageRank For a given search, find all the pages that match and list them in descending order of their PageRank. If you're interested in exactly how it worked, you can read the PageRank Wikipedia article.
Link Counting in Project 4 You will be doing a similar type of calculation in Project 4 You will crawl the CS111 website and keep track of how many times each link was referenced. Every time you encounter a link you either add it to the list of links with a count of 1 if it's not in the list increment its count by one if it is in the list You'll want to use a dictionary to do this with the link as the key and the count as the value You'll need to construct the full URL to use as the key so you don't have different entries for relative and absolute links. You also need to remove section tags so all the links to a page are counted together and not individually.
Link Counting with a Dictionary link_counts = {} # create an empty dictionary # the variable link contains a full URL if link in link_counts: link_counts[link] += 1 # add 1 if it is already there else: link_counts[link] = 1 # add an initial count of 1 if not When done, the dictionary will contain every full URL found as its keys and for each URL, a count of how many times it was referenced. You could then use this to do different types of analysis. We're going to use it to make a plot but more about that in a future lecture.
Human vs. Computer web browsing Webpages are designed for human interaction Automated tools (bots) can access pages faster and more broadly on a site than any person interacting with the site. This can potentially cause problems Bandwidth costs money bots constantly downloading large files increases costs for site owners Bandwidth is finite bots downloading lots of files, large or small, limits the bandwidth available to human users Server resources are finite large numbers of connections from a bot limits connections for human visitors effectively creating a Denial of Service (DOS) attack on the server.
The robots.txt file An important part of crawling a website is respecting the site's robots.txt file This is a file that tells automated web tools how they can interact with the site. If you plan on visiting/downloading a large number of pages on a website with an automated tool, this should be the first page you download. Always! At the very least, the file describes directories and files that you should not access. It might also specify how fast you can send queries to the site
Robots.txt contents There are four main parts to a robots.txt file User-agent: specification specify which robots the following lines apply to Allow: lines paths that the robot is explicitly allowed to visit Disallow: lines paths that the robot is not allowed to visit Crawl-delay: lines time, in seconds that the robot must allow to pass between requests. User-agent: * Crawl-delay: 10 Disallow: /data Disallow: /sites/default/files/episodes
Processing a robots.txt file When you process a robots.txt file, you proceed top to bottom and stop at the first match You use the rules in the first section that matches your user agent (more on that on the next slide) When looking at a link, you start down the list of allow and disallow rules and the first one that match the link you're looking at is the one that applies.
User-agent: As part of the HTTP protocol, every program performing a web query must send a "User-agent" string identifying it. This can be anything the program wants it to be: My current Firefox browser uses: "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/119.0" My Current Google Chrome uses: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36" Python's request library uses: "Python-requests/2.31.0" (the number at the end is the version number of the library) In the robots.txt file, you can specify different rules for different robots by specifying what user agent applies to which section of the file. However, most robots.txt files just specify '*' signifying that the rules apply to all bots.
Allow: Lines that begin with Allow: contain paths that bots are explicitly allowed to visit. These are often used to allow access to certain file types or subdirectories within directories that are disallowed generally. Since robots.txt files are processed top to bottom, if you hit an allowed path before a disallowed path, you can visit that page, thus the Allowed: lines typically proceed the Disallow: lines User-agent: * Crawl-delay: 10 Allow: /data/ship-templates Disallow: /data Disallow: /sites/default/files/episodes
Disallow: This is the most common entry in a robots.txt file Each line gives a path that a robot is not supposed to visit. If a link matches a path in a Disallow line before matching an Allow line, the link should not be visited. There are a couple of special cases for the Disallow entries: Disallow: This allows everything it's the same as not having a robots.txt file Disallow: / This disallows everything on the site as '/' matches every path
Matching paths The easiest way to see if a path matches or not is to build a regular expression from the specified path The one thing to be aware of is the wildcards * - zero or more characters ? - a single character Examples Path in Disallow: / /data/ /data/images/*.jpg /data/images/character??.png Regular expression r"/" r"/data/" r"/data/images/.*\.jpg" r"/data/images/character..\.png"
Crawl-delay This is not an "official" part of a robots.txt file but some bots do honor it (Google does not) The number specified is basically a number of seconds that must elapse between requests to the site. e.g. a value of 10 means the bot should limit its requests to one very 10 seconds. While you are not "required" to follow this command, it is good etiquette to do so.
Data Scraping It is very common to want to extract data from websites to use in some sort of analysis. For a single page we might be able to do it by hand, but as the number of pages grows, or if it changes frequently, this can get tedious. Downloading the HTML page gives you the data but has all the tags in it. We need to be able to extract the data from the tags. This process is generally known as web scraping or data scraping.
Understanding the data When you are web scraping, you must understand the structure of the page you are trying to extract data from. Download the page source and look at it Get the page and use BeautifulSoup's prettify() function to make it more readable if necessary Look at the tags and attributes on the data you want to extract Are there patterns? Are there specific tags or attributes used? Understanding how the page is constructed will help you in writing a script to extract the data you need.