Data Scraping

undefined
 
Data Scraping
 
 
Robots.txt
 
Human vs. Computer web browsing
 
Webpages are designed for human interaction
Automated tools (bots) can access pages faster and more broadly on a
site than any person interacting with the site.
This can potentially cause problems
Bandwidth costs money – bots constantly downloading large files increases
costs for site owners
Bandwidth is finite – bots downloading lots of files, large or small, limits
the bandwidth available to human users
Server resources are finite – large numbers of connections from a bot
limits connections for human visitors effectively creating a Denial of
Service (DOS) attack on the server.
 
The robots.txt file
 
An important part of crawling a website is respecting the site's
robots.txt 
file
This is a file that tells automated web tools how they can interact
with the site.
If you plan on visiting/downloading a large number of pages on a
website with an automated tool, this should be the first page you
download. 
Always!
At the very least, the file describes directories and files that you
should not access.
It might also specify how fast you can send queries to the site
 
Robots.txt contents
 
There are four main parts to a robots.txt file
User-agent:
 specification – specify which robots the following lines apply
to
Allow:
 lines – paths that the robot is explicitly allowed to visit
Disallow:
 lines – paths that the robot is not allowed to visit
Crawl-delay:
 lines – time, in seconds that the robot must allow to pass
between requests.
User-agent: *
Crawl-delay: 10
Disallow: /data
Disallow: /sites/default/files/episodes
 
Processing a robots.txt file
 
When you process a robots.txt file, you proceed top to bottom and
stop at the first match
You use the rules in the first section that matches your user agent
(more on that on the next slide)
When looking at a link, you start down the list of allow and disallow
rules and the first one that match the link you're looking at is the one
that applies.
 
User-agent:
 
As part of the HTTP protocol, every program performing a web query
must send a "User-agent" string identifying it.
This can be anything the program wants it to be:
My current Firefox browser uses: "
Mozilla/5.0 (Windows NT 10.0; Win64;
x64; rv:109.0) Gecko/20100101 Firefox/119.0
"
My Current Google Chrome uses: "
Mozilla/5.0 (Windows NT 10.0; Win64;
x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0
Safari/537.36
"
Python's request library uses: "
Python-requests/2.31.0
" (the number at
the end is the version number of the library)
In the robots.txt file, you can specify different rules for different
robots by specifying what user agent applies to which section of the
file.
However, most robots.txt files just specify '*' signifying that the rules
apply to all bots.
 
Allow:
 
Lines that begin with Allow: contain paths that bots are explicitly
allowed to visit.
These are often used to allow access to certain file types or
subdirectories within directories that are disallowed generally.
Since robots.txt files are processed top to bottom, if you hit an
allowed path before a disallowed path, you can visit that page, thus
the Allowed: lines typically proceed the Disallow: lines
User-agent: *
Crawl-delay: 10
Allow: /data/ship-templates
Disallow: /data
Disallow: /sites/default/files/episodes
Disallow:
 
This is the most common entry in a robots.txt file
Each line gives a path that a robot is not supposed to visit.
If a link matches a path in a Disallow line before matching an Allow
line, the link should not be visited.
There are a couple of special cases for the Disallow entries:
 
This allows everything – it's the same as not having a robots.txt file
 
This disallows everything on the site as '/' matches every path
Disallow: /
Disallow:
 
Matching paths
 
The easiest way to see if a path matches or not is to build a regular
expression from the specified path
The one thing to be aware of is the wildcards
* - zero or more characters
? - a single character
Examples
 
Crawl-delay
 
This is not an "official" part of a robots.txt file but some bots do honor
it (Google does not)
The number specified is basically a number of seconds that must
elapse between requests to the site.
e.g. a value of 10 means the bot should limit its requests to one very 10
seconds.
While you are not "required" to follow this command, it is good
etiquette to do so.
Data Scraping
 
It is very common to want to extract data from websites to use in
some sort of analysis.
For a single page we might be able to do it by hand, but as the
number of pages grows, or if it changes frequently, this can get
tedious.
Downloading the HTML page gives you the data but has all the tags in
it.
We need to be able to extract the data from the tags.
This process is generally known as web scraping or data scraping.
 
Understanding the data
 
When you are web scraping, you must understand the structure of the
page you are trying to extract data from.
Download the page source and look at it
Get the page and use BeautifulSoup's prettify() function to make it more
readable if necessary
Look at the tags and attributes on the data you want to extract
Are there patterns?
Are there specific tags or attributes used?
Understanding how the page is constructed will help you in writing a
script to extract the data you need.
 
Finding Tags with specific attributes
 
Previously we showed you how to find all instances of a tag within a
document using the find_all() method. But it can do more
Often you don't want all the tags but rather ones with specific
attributes or even attribute values.
To find all instances of a tag with a specific, known attribute name,
you can use the find_all() function in this form
 
This finds all instances of <tag> that have the <attribute> attribute
(regardless of its value) and ignores all others
It returns a list of Tag objects
The following would find all the images with a height attribute:
tags = soup.find_all('<tag>',<attribute>=True)
tags = soup.find_all('img',height=True)
 
Finding tags with specific attribute
values
 
If we want to find tags with specific attributes and attribute values,
we can use the find_all() function again:
 
The first argument is the tag name
The second argument is a dictionary with the attribute as the key and the
attribute value as the dictionary value
This would find all the 'tr' tags with an 'id' attribute with "data" as its
value:
tags = soup.find_all('<tag>',{'<attribute>':'<value>'})
tags = soup.find_all('tr',{'id':'data'})
Searching multiple tags
 
It's also possible to search for different tags that have the same
attribute or same attribute and value.
Instead of passing in a single tag as the first argument to find_all()
you pass in a list of tags
This finds all the 'p' and 'h1' tags that have an 'id' attribute
 
This finds all the 'h1' and 'tr' tags that have the 'id' attribute with
"header" as its value:
tags = soup.find_all(['p','h1'], id=True)
tags = soup.find_all(['h1','tr'], {'id':'header'})
Searching for multiple attributes and
values
 
Since the collection of attributes and values to search for is a
dictionary, we can add additional attribute-value pairs by adding
entries to the dictionary
 
The tag must have all the attribute-value pairs specified to match
 
If you want to have different possible values for a single attribute,
make the value for that attribute in the dictionary into a list
containing the possible values
tags = soup.find_all(['h1','tr','p'],
       {'id':'header','data':'index'})
tags = soup.find_all(['h1','tr','p'],
       {'id':['header','start'],'data':'index'})
 
Using Regular Expressions
 
As mentioned in a previous lecture, you can use regular expressions to
select tags, attributes, or values
To do so, you must first compile the regular expression to it can be
used.
This is done using the re.compile() function
 
This returns a regular expression object that you can bind to a name
and use repeatedly or just put the re.compile() expression right where
you want the regex to be used
re.compile(<regex string>)
Using Regular Expressions (examples)
 
This creates a regular expression that matches 
data
 followed by zero
or more digits
 
 
This searches for any 'td' or 'p' tags that have the 'id' attribute with a
value that matches the regular expression
 
 
This uses the regex multiple time as the value for different attributes
data_index = re.compile(r'data\d*')
results = soup.find_all(['td','p'],
               {'id': re.compile(r'data\d*')})
results = soup.find_all(['td','p','tr'],
               {'id':data_index,'title':data_index})
 
Reading Tables
 
 
Reading Tables
 
We've looked at reading arbitrary tags, let's look specifically at
reading tabular data on a web page
Imagine a table of degrees granted per year at a university
 
 
 
 
 
We want a list with the data from each column and a list of column
headings.
How do we read this if it is rendered on a webpage
 
The table as HTML
 
<table id="degrees" border="1">
  <tr><th>Academic Year</th><th>Bachelors</th>
      <th>Masters</th><th>Doctoral</th><th>Total</th></tr>
  <tr><td>2021-2022</td><td>6406</td>
      <td>1128</td><td>233</td><td>7767</td></tr>
  <tr><td>2020-2021</td><td>6683</td><td>959</td>
      <td>192</td><td>7834</td></tr>
  <tr><td>2019-2022</td><td>6684</td>
      <td>1033</td><td>212</td><td>7929</td></tr>
...
  <tr><td>1896-1897</td><td>1</td>
      <td>0</td><td>0</td><td>1</td></tr>
</table>
 
Exercise: Read the table's data
 
How do we find the table?
How to we get the column headers?
How do we read data from each column/row?
 
Exercise: Read the table's data (solution)
 
table = soup.find_all('table',{'id':'degrees'})[0]
heads = table.find_all('th')
headers = []
for item in heads:
    headers.append(item.string)
data = [[], [], [], [], []]   # make a list of 5 lists, one for each column
rows = table.find_all('tr')
for row in rows:
    columns = row.find_all("td")
    index = 0
    for col in columns:
        data[index].append(col.string)
        index += 1
for col in data:
    print(col)
 
Handling Images
 
Reading and Saving Images
 
What if you want to save all the images on the page to a local
directory?
What information do you need to do this?
The URL to the image
The directory you want to save the image in
The output filename
How do we do this?
Finding the URLs
 
How would we find the URLs to all the images on a page?
 
 
 
Remember these could be relative links so you'll need to construct the
full URL from the current page/domain before you try to access the
images.
images = soup.find_all('img')
img_srcs = []
for img in images:
 
img_srcs.append(img['src'])
Requesting the images
 
Once you have the URL for the image, you can just make a GET
request to have the server send it to you:
 
The .text attribute on the response object is not going to give us what
we need.
There is another attribute, .raw, that gives us the raw bytes of the
data in the response.
Note: to properly use the data via the .raw attribute, your GET
request needs to include an additional parameter: 
stream=True
 
We've got the raw data, what do we do with it?
image_response = requests.get(imageURL)
image_response = requests.get(imageURL
, stream=True
)
 
Saving binary data to a file
 
We can use Python's 
copyfileobj() 
function (from the 
shutil
 library) to
write the raw file contents directly to disk
 
 
 
output_filename 
is the path+filename of the output file
The '
wb
' parameter says to open the file to write in binary format
The 
copyfileobj()
 function takes a source of binary data and a destination
you could use this to copy a file: open one file to read and use that as the
source, and the second to write and use that as the destination
The 
del
 command deletes the named object immediately instead of
waiting for Python to do it. Can help to save memory.
import shutil
 
with open(output_filename, 'wb') as out_file:
    shutil.copyfileobj(image_response.raw, out_file)
del image_response    # this frees up the memory (optional
)
 
Using the image in the program
 
You could also use the image directly in your program if you needed
to.
You'd access it using the response object's 
.content
 attribute which
allows you to access the content as binary data.
 
 
 
from PIL import Image    #PIL is the library under byuimage
from io import BytesIO   #a built-in Python library
 
image = Image.open(BytesIO(image_response.content))
)
Slide Note
Embed
Share

Robots.txt is a crucial file for website crawling, outlining rules for bots on what to access and avoid. Learn about its importance, contents, and processing guidelines to ensure efficient web scraping and avoid potential issues like increased costs and server overload.

  • Web crawling
  • Robots.txt file
  • Data scraping
  • Automation
  • Web scraping

Uploaded on Mar 13, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Data Scraping

  2. Robots.txt

  3. Human vs. Computer web browsing Webpages are designed for human interaction Automated tools (bots) can access pages faster and more broadly on a site than any person interacting with the site. This can potentially cause problems Bandwidth costs money bots constantly downloading large files increases costs for site owners Bandwidth is finite bots downloading lots of files, large or small, limits the bandwidth available to human users Server resources are finite large numbers of connections from a bot limits connections for human visitors effectively creating a Denial of Service (DOS) attack on the server.

  4. The robots.txt file An important part of crawling a website is respecting the site's robots.txt file This is a file that tells automated web tools how they can interact with the site. If you plan on visiting/downloading a large number of pages on a website with an automated tool, this should be the first page you download. Always! At the very least, the file describes directories and files that you should not access. It might also specify how fast you can send queries to the site

  5. Robots.txt contents There are four main parts to a robots.txt file User-agent: specification specify which robots the following lines apply to Allow: lines paths that the robot is explicitly allowed to visit Disallow: lines paths that the robot is not allowed to visit Crawl-delay: lines time, in seconds that the robot must allow to pass between requests. User-agent: * Crawl-delay: 10 Disallow: /data Disallow: /sites/default/files/episodes

  6. Processing a robots.txt file When you process a robots.txt file, you proceed top to bottom and stop at the first match You use the rules in the first section that matches your user agent (more on that on the next slide) When looking at a link, you start down the list of allow and disallow rules and the first one that match the link you're looking at is the one that applies.

  7. User-agent: As part of the HTTP protocol, every program performing a web query must send a "User-agent" string identifying it. This can be anything the program wants it to be: My current Firefox browser uses: "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/119.0" My Current Google Chrome uses: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36" Python's request library uses: "Python-requests/2.31.0" (the number at the end is the version number of the library) In the robots.txt file, you can specify different rules for different robots by specifying what user agent applies to which section of the file. However, most robots.txt files just specify '*' signifying that the rules apply to all bots.

  8. Allow: Lines that begin with Allow: contain paths that bots are explicitly allowed to visit. These are often used to allow access to certain file types or subdirectories within directories that are disallowed generally. Since robots.txt files are processed top to bottom, if you hit an allowed path before a disallowed path, you can visit that page, thus the Allowed: lines typically proceed the Disallow: lines User-agent: * Crawl-delay: 10 Allow: /data/ship-templates Disallow: /data Disallow: /sites/default/files/episodes

  9. Disallow: This is the most common entry in a robots.txt file Each line gives a path that a robot is not supposed to visit. If a link matches a path in a Disallow line before matching an Allow line, the link should not be visited. There are a couple of special cases for the Disallow entries: Disallow: This allows everything it's the same as not having a robots.txt file Disallow: / This disallows everything on the site as '/' matches every path

  10. Matching paths The easiest way to see if a path matches or not is to build a regular expression from the specified path The one thing to be aware of is the wildcards * - zero or more characters ? - a single character Examples Path in Disallow: / /data/ /data/images/*.jpg /data/images/character??.png Regular expression r"/" r"/data/" r"/data/images/.*\.jpg" r"/data/images/character..\.png"

  11. Crawl-delay This is not an "official" part of a robots.txt file but some bots do honor it (Google does not) The number specified is basically a number of seconds that must elapse between requests to the site. e.g. a value of 10 means the bot should limit its requests to one very 10 seconds. While you are not "required" to follow this command, it is good etiquette to do so.

  12. Data Scraping It is very common to want to extract data from websites to use in some sort of analysis. For a single page we might be able to do it by hand, but as the number of pages grows, or if it changes frequently, this can get tedious. Downloading the HTML page gives you the data but has all the tags in it. We need to be able to extract the data from the tags. This process is generally known as web scraping or data scraping.

  13. Understanding the data When you are web scraping, you must understand the structure of the page you are trying to extract data from. Download the page source and look at it Get the page and use BeautifulSoup's prettify() function to make it more readable if necessary Look at the tags and attributes on the data you want to extract Are there patterns? Are there specific tags or attributes used? Understanding how the page is constructed will help you in writing a script to extract the data you need.

  14. Finding Tags with specific attributes Previously we showed you how to find all instances of a tag within a document using the find_all() method. But it can do more Often you don't want all the tags but rather ones with specific attributes or even attribute values. To find all instances of a tag with a specific, known attribute name, you can use the find_all() function in this form tags = soup.find_all('<tag>',<attribute>=True) This finds all instances of <tag> that have the <attribute> attribute (regardless of its value) and ignores all others It returns a list of Tag objects The following would find all the images with a height attribute: tags = soup.find_all('img',height=True)

  15. Finding tags with specific attribute values If we want to find tags with specific attributes and attribute values, we can use the find_all() function again: tags = soup.find_all('<tag>',{'<attribute>':'<value>'}) The first argument is the tag name The second argument is a dictionary with the attribute as the key and the attribute value as the dictionary value This would find all the 'tr' tags with an 'id' attribute with "data" as its value: tags = soup.find_all('tr',{'id':'data'})

  16. Searching multiple tags It's also possible to search for different tags that have the same attribute or same attribute and value. Instead of passing in a single tag as the first argument to find_all() you pass in a list of tags This finds all the 'p' and 'h1' tags that have an 'id' attribute tags = soup.find_all(['p','h1'], id=True) This finds all the 'h1' and 'tr' tags that have the 'id' attribute with "header" as its value: tags = soup.find_all(['h1','tr'], {'id':'header'})

  17. Searching for multiple attributes and values Since the collection of attributes and values to search for is a dictionary, we can add additional attribute-value pairs by adding entries to the dictionary tags = soup.find_all(['h1','tr','p'], {'id':'header','data':'index'}) The tag must have all the attribute-value pairs specified to match If you want to have different possible values for a single attribute, make the value for that attribute in the dictionary into a list containing the possible values tags = soup.find_all(['h1','tr','p'], {'id':['header','start'],'data':'index'})

  18. Using Regular Expressions As mentioned in a previous lecture, you can use regular expressions to select tags, attributes, or values To do so, you must first compile the regular expression to it can be used. This is done using the re.compile() function re.compile(<regex string>) This returns a regular expression object that you can bind to a name and use repeatedly or just put the re.compile() expression right where you want the regex to be used

  19. Using Regular Expressions (examples) data_index = re.compile(r'data\d*') This creates a regular expression that matches data followed by zero or more digits results = soup.find_all(['td','p'], {'id': re.compile(r'data\d*')}) This searches for any 'td' or 'p' tags that have the 'id' attribute with a value that matches the regular expression results = soup.find_all(['td','p','tr'], {'id':data_index,'title':data_index}) This uses the regex multiple time as the value for different attributes

  20. Reading Tables

  21. Reading Tables We've looked at reading arbitrary tags, let's look specifically at reading tabular data on a web page Imagine a table of degrees granted per year at a university Academic Year Bachelors 2021-2022 6406 2020-2021 6683 2019-2020 6684 1896-1897 1 Masters 1128 959 1033 Doctoral 233 192 212 Total 7767 7834 7929 0 0 1 We want a list with the data from each column and a list of column headings. How do we read this if it is rendered on a webpage

  22. The table as HTML <table id="degrees" border="1"> <tr><th>Academic Year</th><th>Bachelors</th> <th>Masters</th><th>Doctoral</th><th>Total</th></tr> <tr><td>2021-2022</td><td>6406</td> <td>1128</td><td>233</td><td>7767</td></tr> <tr><td>2020-2021</td><td>6683</td><td>959</td> <td>192</td><td>7834</td></tr> <tr><td>2019-2022</td><td>6684</td> <td>1033</td><td>212</td><td>7929</td></tr> ... <tr><td>1896-1897</td><td>1</td> <td>0</td><td>0</td><td>1</td></tr> </table>

  23. Exercise: Read the table's data How do we find the table? How to we get the column headers? How do we read data from each column/row?

  24. Exercise: Read the table's data (solution) table = soup.find_all('table',{'id':'degrees'})[0] heads = table.find_all('th') headers = [] for item in heads: headers.append(item.string) data = [[], [], [], [], []] # make a list of 5 lists, one for each column rows = table.find_all('tr') for row in rows: columns = row.find_all("td") index = 0 for col in columns: data[index].append(col.string) index += 1 for col in data: print(col)

  25. Handling Images

  26. Reading and Saving Images What if you want to save all the images on the page to a local directory? What information do you need to do this? The URL to the image The directory you want to save the image in The output filename How do we do this?

  27. Finding the URLs How would we find the URLs to all the images on a page? images = soup.find_all('img') img_srcs = [] for img in images: img_srcs.append(img['src']) Remember these could be relative links so you'll need to construct the full URL from the current page/domain before you try to access the images.

  28. Requesting the images Once you have the URL for the image, you can just make a GET request to have the server send it to you: image_response = requests.get(imageURL) The .text attribute on the response object is not going to give us what we need. There is another attribute, .raw, that gives us the raw bytes of the data in the response. Note: to properly use the data via the .raw attribute, your GET request needs to include an additional parameter: stream=True image_response = requests.get(imageURL, stream=True) We've got the raw data, what do we do with it?

  29. Saving binary data to a file We can use Python's copyfileobj() function (from the shutil library) to write the raw file contents directly to disk import shutil with open(output_filename, 'wb') as out_file: shutil.copyfileobj(image_response.raw, out_file) del image_response # this frees up the memory (optional) output_filename is the path+filename of the output file The 'wb' parameter says to open the file to write in binary format The copyfileobj() function takes a source of binary data and a destination you could use this to copy a file: open one file to read and use that as the source, and the second to write and use that as the destination The del command deletes the named object immediately instead of waiting for Python to do it. Can help to save memory.

  30. Using the image in the program You could also use the image directly in your program if you needed to. You'd access it using the response object's .content attribute which allows you to access the content as binary data. from PIL import Image #PIL is the library under byuimage from io import BytesIO #a built-in Python library image = Image.open(BytesIO(image_response.content)))

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#