Working with Requests Library in Python

undefined
Requests and
Beautiful Soup
 
The Requests Library
Now that we understand a little of how the web works, what a URL is,
and how HTML documents are structured, it's to figure out how to
read them from a Python program.
To start, we need to be able to request and download content from
URLs.
To do this, we'll be using the Requests library.
(
https://requests.readthedocs.io/en/latest/
)
This is an external library so we'll need to install it
Then to use it, we just import the library into our scripts
pip install requests
import requests
A basic request
In this class, we'll only be making simple GET requests.
To do that, we use the Requests library's 
get() 
function
If we wanted to do a POST request, we'd use the 
post() 
function
This returns a request object which, in the code above, is bound to
the 
response
 name.
URL = 'https://cs111.byu.edu'
response = requests.get(URL)
Checking the response code
When we get our request object, one of the first things we should do
is check the status code.
If we got a 200, everything is fine and we can continue
Anything else and we have some sort of error
The request object has a 
status_code
 attribute
We can check the status codes against the values we know, or we can
use the names in the requests.codes attribute
The most common ones we'll be checking are requests.codes.ok (200) and
requests.codes.not_found (404)
>>> response.status_code
200
>>> response.status_code == requests.codes.ok
True
Response Content
The content returned in the response can be accessed in a variety of
ways.
The 
.text 
attribute provides the text representation of the resource.
For a text file like an HTML file, it will just be the contents
The 
.content 
attribute provides the data in its binary form.  This is
useful when downloading non-text resources such as images.
>>> print(response.text)
<!DOCTYPE html>
<html class="h-full" lang="en">
  <head>
 
 
The Document Tree
 
A simple HTML document
 
The document has structure.
How could we represent it?
<html>
  <head>
    <title>Hello world!</title>
  </head>
  <body>
    <h1>Hello world!</h1>
    <p>This is a simple <em>Hello World</em> web page.</p>
    <p>This paragraph has a link to the <a
       href="https://cs111.byu.edu">CS 111 Homepage</a> in
       it.</p>
  </body>
</html>
A Tree!
The tree structure that represents a
web page is called the Document
Object Model (DOM).
<html>
text
text
href
text
 
 
Beautiful Soup
 
Beautiful Soup
The Beautiful Soup library is designed to make accessing the elements
of the DOM easier for us as developers
To install the library:
To use the library, we import bs4
Beautiful Soup allows you to perform a lot of manipulations on the
DOM but we're only going to be using it to read and extract data from
our web pages.
The full documentation on the library can be found at
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
pip install beautifulsoup4
import bs4
Making Soup
To allow us to work with the document tree, we first need to make a
Beautiful Soup object.
The constructor takes two inputs:
A string containing the HTML
This it the contents of the 
.text 
attribute from our request object
A parser that knows how to read the HTML
We can just use the built in Python parser called '
html.parser
'
With the Beautiful Soup object, we can start exploring the document
tree
soup = bs4.BeautifulSoup(response.text,'html.parser')
Finding Tags
Beautiful Soup generates Tag objects for every HTML tag found in the
document.
Each tag appears as an attribute on the soup object:
soup.title
soup.p
soup.h1
soup.a
Finding Tags
However, each of these tag names only returns the first instance of
that tag in the document.
If you want to get all of the instances, use the 
find_all()
 method with
the name of the tag you are looking for.
This returns a list with all the instances of the specified tag as its
elements
soup.find_all('p')
[<p>This is a simple <em>Hello World</em> web page.</p>,
 <p>This paragraph has a link to the <a href="https://cs111.byu.edu">CS
111 Homepage</a> in it.</p>]
Tag Attributes
 
Each instance of a tag has a number of attributes:
.name 
– the name of the tag
 
.attrs 
– a dictionary of all the tags attributes with the attribute name as
the key and its value as the value in the dictionary
 
These can be accessed like any dictionary using the key to get the value:
 
 
.string – the text contained within the tag
soup.title.name
'title'
soup.a.attrs
{'href': 'https://cs111.byu.edu'}
soup.a.attrs['href']
'https://cs111.byu.edu'
soup.a.string
'CS 111 Homepage'
Accessing a Tag's Children
If a tag has children, we can access them through the 
.contents 
and
.children 
attributes
.contents 
is simply a list of all the child elements
.children 
is an iterator that allows you to iterate through the child
elements
for item in soup.body.children:
    print(type(item))
<class 'bs4.element.NavigableString'>
<class 'bs4.element.Tag'>
<class 'bs4.element.NavigableString'>
<class 'bs4.element.Tag'>
<class 'bs4.element.NavigableString'>
<class 'bs4.element.Tag'>
<class 'bs4.element.NavigableString'>
Accessing at Tag's Parent
Just like you can find a tag's children, you can also find it's parent
The 
.parent 
attribute give you the tag that is the current tag's parent.
The 
.parents 
attribute is an iterator that allows you to iterate through
all of a tag's ancestors back to the document root.
soup.a.parent
<p>This paragraph has a link to the <a
href="https://cs111.byu.edu">CS 111 Homepage</a> in it.</p>
for parent in soup.a.parents:
    print(parent.name)
p
body
html
[document]
 
 
Search Filters
 
Search Filters
Earlier we showed you the 
find_all() 
method and passed in a tag name
as the thing to find.
There are other options as well:
A regular expression 
– this will find all the tags whose name matches the
regular expression provided
A list 
– this will find all the tags that match anything in the list
True 
– This returns all the tags
A function 
– You can pass in a function that takes a tag as its argument
and returns 
True
 if the tag matches any criteria you define in the
function.  
find_all() 
will return any tag that gives a 
True
 result from the
function.
Searching Strings
By default, find_all() searches for tags that match the input criteria.
Sometimes, you want to search the strings in a document for
something.
To do this you use the 
string
 parameter to the find_all() method.
It can take the same filters as searching tags, i.e. strings, regular
expressions, etc.
soup.find_all(string=re.compile(r"[Hh]ello"))
'Hello world!'
'Hello world!'
'Hello World'
Searching only part of the document
Not only can find_all() be called on the entire document, it can be
called on a specific tag to only search for the items in that tag and its
children
soup.body.find_all(string=re.compile(r"[Hh]ello"))
'Hello world!'
'Hello World'
prettify()
If you want to see the contents of a tag in a slightly easier to read
format you can use the 
prettify() 
method
It prints out one tag or string per line indenting them by one space per
level of the document tree they appear on.
print(soup.p)
<p>This is a simple <em>hello world</em> web page.</p>
print(soup.p.prettify())
<p>
 This is a simple
 <em>
  hello world
 </em>
 web page.
</p>
 
 
Slide Note
Embed
Share

requests library in Python allows you to make HTTP requests, download content from URLs, check response codes, and access response content. Learn how to use the Requests library for web scraping and interacting with web resources efficiently.

  • Python Programming
  • Web Scraping
  • HTTP Requests
  • Python Libraries
  • Web Development

Uploaded on Feb 27, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Requests and Beautiful Soup

  2. The Requests Library Now that we understand a little of how the web works, what a URL is, and how HTML documents are structured, it's to figure out how to read them from a Python program. To start, we need to be able to request and download content from URLs. To do this, we'll be using the Requests library. (https://requests.readthedocs.io/en/latest/) This is an external library so we'll need to install it pip install requests Then to use it, we just import the library into our scripts import requests

  3. A basic request In this class, we'll only be making simple GET requests. To do that, we use the Requests library's get() function URL = 'https://cs111.byu.edu' response = requests.get(URL) If we wanted to do a POST request, we'd use the post() function This returns a request object which, in the code above, is bound to the response name.

  4. Checking the response code When we get our request object, one of the first things we should do is check the status code. If we got a 200, everything is fine and we can continue Anything else and we have some sort of error The request object has a status_code attribute >>> response.status_code 200 We can check the status codes against the values we know, or we can use the names in the requests.codes attribute The most common ones we'll be checking are requests.codes.ok (200) and requests.codes.not_found (404) >>> response.status_code == requests.codes.ok True

  5. Response Content The content returned in the response can be accessed in a variety of ways. The .text attribute provides the text representation of the resource. For a text file like an HTML file, it will just be the contents >>> print(response.text) <!DOCTYPE html> <html class="h-full" lang="en"> <head> The .content attribute provides the data in its binary form. This is useful when downloading non-text resources such as images.

  6. The Document Tree

  7. A simple HTML document <html> <head> <title>Hello world!</title> </head> <body> <h1>Hello world!</h1> <p>This is a simple <em>Hello World</em> web page.</p> <p>This paragraph has a link to the <a href="https://cs111.byu.edu">CS 111 Homepage</a> in it.</p> </body> </html> The document has structure. How could we represent it?

  8. A Tree! <html> <head> <body> <title> <h1> <p> <p> text text text <em> text text <a> text text href text The tree structure that represents a web page is called the Document Object Model (DOM). text

  9. Beautiful Soup

  10. Beautiful Soup The Beautiful Soup library is designed to make accessing the elements of the DOM easier for us as developers To install the library: pip install beautifulsoup4 To use the library, we import bs4 import bs4 Beautiful Soup allows you to perform a lot of manipulations on the DOM but we're only going to be using it to read and extract data from our web pages. The full documentation on the library can be found at https://www.crummy.com/software/BeautifulSoup/bs4/doc/

  11. Making Soup To allow us to work with the document tree, we first need to make a Beautiful Soup object. The constructor takes two inputs: A string containing the HTML This it the contents of the .text attribute from our request object A parser that knows how to read the HTML We can just use the built in Python parser called 'html.parser' soup = bs4.BeautifulSoup(response.text,'html.parser') With the Beautiful Soup object, we can start exploring the document tree

  12. Finding Tags Beautiful Soup generates Tag objects for every HTML tag found in the document. Each tag appears as an attribute on the soup object: soup.title soup.p soup.h1 soup.a

  13. Finding Tags However, each of these tag names only returns the first instance of that tag in the document. If you want to get all of the instances, use the find_all() method with the name of the tag you are looking for. soup.find_all('p') [<p>This is a simple <em>Hello World</em> web page.</p>, <p>This paragraph has a link to the <a href="https://cs111.byu.edu">CS 111 Homepage</a> in it.</p>] This returns a list with all the instances of the specified tag as its elements

  14. Tag Attributes Each instance of a tag has a number of attributes: .name the name of the tag soup.title.name 'title' .attrs a dictionary of all the tags attributes with the attribute name as the key and its value as the value in the dictionary soup.a.attrs {'href': 'https://cs111.byu.edu'} These can be accessed like any dictionary using the key to get the value: soup.a.attrs['href'] 'https://cs111.byu.edu' .string the text contained within the tag soup.a.string 'CS 111 Homepage'

  15. Accessing a Tag's Children If a tag has children, we can access them through the .contents and .children attributes .contents is simply a list of all the child elements .children is an iterator that allows you to iterate through the child elements for item in soup.body.children: print(type(item)) <class 'bs4.element.NavigableString'> <class 'bs4.element.Tag'> <class 'bs4.element.NavigableString'> <class 'bs4.element.Tag'> <class 'bs4.element.NavigableString'> <class 'bs4.element.Tag'> <class 'bs4.element.NavigableString'>

  16. Accessing at Tag's Parent Just like you can find a tag's children, you can also find it's parent The .parent attribute give you the tag that is the current tag's parent. soup.a.parent <p>This paragraph has a link to the <a href="https://cs111.byu.edu">CS 111 Homepage</a> in it.</p> The .parents attribute is an iterator that allows you to iterate through all of a tag's ancestors back to the document root. for parent in soup.a.parents: print(parent.name) p body html [document]

  17. Search Filters

  18. Search Filters Earlier we showed you the find_all() method and passed in a tag name as the thing to find. There are other options as well: A regular expression this will find all the tags whose name matches the regular expression provided A list this will find all the tags that match anything in the list True This returns all the tags A function You can pass in a function that takes a tag as its argument and returns True if the tag matches any criteria you define in the function. find_all() will return any tag that gives a True result from the function.

  19. Searching Strings By default, find_all() searches for tags that match the input criteria. Sometimes, you want to search the strings in a document for something. To do this you use the string parameter to the find_all() method. It can take the same filters as searching tags, i.e. strings, regular expressions, etc. soup.find_all(string=re.compile(r"[Hh]ello")) 'Hello world!' 'Hello world!' 'Hello World'

  20. Searching only part of the document Not only can find_all() be called on the entire document, it can be called on a specific tag to only search for the items in that tag and its children soup.body.find_all(string=re.compile(r"[Hh]ello")) 'Hello world!' 'Hello World'

  21. prettify() If you want to see the contents of a tag in a slightly easier to read format you can use the prettify() method It prints out one tag or string per line indenting them by one space per level of the document tree they appear on. print(soup.p) <p>This is a simple <em>hello world</em> web page.</p> print(soup.p.prettify()) <p> This is a simple <em> hello world </em> web page. </p>

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#