Mastering Web Scraping with Python

 
Basic Web Scraping with
Python
 
Everything you need to get data from the web
 
Python is great for collecting data from the web
 
Python is unique as a programming language, in that it has very strong
communities for both web programming and data management/analysis.
As a result, your entire computational research pipeline can be integrated.
Instead of having separate applications for data collection, storage, management,
analysis and visualization, you have one code-base.
 
Approaches to Web Scraping in Python
 
There are two primary approaches to web scraping in Python:
1.
Customize a canned spider using ScraPy
2.
Create a fully custom spider using requests, lxml, sqlalchemy and celery
In general, unless you’re trying to do something really unusual - such as
distributed, high throughput crawling - ScraPy is the right choice.
 
How should I manage my data?
 
Before we start talking about the specifics of each approach, we need to address
our data management strategy.
The right answer is basically the same for everyone: store it in a PostgreSQL
database.  PostgreSQL is available for all operating systems, and is easy to set
up.  Research computing also offers managed PostgreSQL databases.
You will want to create a table for each scraping project, with columns for the URL
of the scraped resource, the date and time it was scraped, and the various data
fields you are collecting.  If the data you’re collecting doesn’t fit easily into a
standard field, you can create a JSON column and store it there.
 
Creating a spider using ScraPy
 
Once you’ve installed ScraPy, the first step is to create a new project:
$ scrapy startproject <your_project_name>
Next, change directory to the newly created project directory, and create a spider:
$ cd <your_project_name>
$ scrapy genspider <name> <domain>
 
Creating a spider using ScraPy (continued…)
 
Once that’s done, a directory will be created with the name of your project, and
below that, a directory called spiders, with a python file having the name of the
spider you generated.  If you open it up it will look something like this:
 
 
import
 
scrapy
class
 
MySpider
(
scrapy
.
Spider
):
    
name
 
=
 
'example.com'
    
allowed_domains
 
=
 [
'example.com'
]
    
start_urls
 
=
 [
'http://www.example.com/'
]
 
def
 
parse
(
self
, 
response
):
    pass
 
 
Creating a spider using ScraPy (continued…)
 
In most cases, all you need to do is add data and link extraction code to the parse
method, for example:
 
   
def
 
parse
(
self
, 
response
):
        
for
 
h3
 
in
 
response
.
xpath
(
'//h3'
)
.
extract
():
            
yield
 {
"title"
: 
h3
}
        
for
 
url
 
in
 
response
.
xpath
(
'//a/@href'
)
.
extract
():
            
yield
 
scrapy
.
Request
(
url
, 
callback
=
self
.
parse
)
 
Creating a spider using ScraPy (continued…)
 
If you want to add custom code to generate starting URLs for scraping, you can do
that by overriding the start_requests method, like this:
 
   
def
 
start_requests
(
self
):
        
yield
 
scrapy
.
Request
(
'http://www.example.com/1.html'
, 
self
.
parse
)
        
yield
 
scrapy
.
Request
(
'http://www.example.com/2.html'
, 
self
.
parse
)
        
yield
 
scrapy
.
Request
(
'http://www.example.com/3.html'
, 
self
.
parse
)
 
ScraPy canned spider templates
 
ScraPy includes a number of spider templates that make building scrapers for
common types of sites and data formats easy, including:
CrawlSpider - generic spider, allows you to specify rules for following links and
extracting items rather than write extraction code.
XMLFeedSpider - spider for crawling XML feeds.
CSVFeedSpider - like the XMLFeedSpider, but for CSV document feeds.
SitemapSpider - spider that crawls based on links listed in a sitemap.
 
Using ScraPy spiders
 
Once you’ve set up your ScraPy project, the next step (from the project directory)
is to test and run it.
You can test your extraction logic by running the command: scrapy shell --
spider <spider> <url>.  This will start an interactive spider session, so you can
verify that your logic is extracting data and links from the page as intended.
Once you’re satisfied that your spider is working properly, you can crawl a site
by running the command: scrapy crawl <spider> -o results.json.
 
ScraPy shell
 
The ScraPy shell is a good place to verify that you’ve written your spider correctly.
From the shell you can test your parse function and extraction code like so:
 
 
 
# Produce a list of all links and data extracted from the target url
list(spider.parse(response))
 
# You can also test xpath selectors here
for anchor_text in response.xpath(“//a/text()”).extract():
    print anchor_text
 
What if ScraPy doesn’t do what you need?
 
First, it is worth your time to double check, ScraPy is very configurable.  You want
to avoid reinventing the wheel if at all possible.
If you are absolutely certain ScraPy isn’t up to the task, the next step is a custom
spider.
For this, you will need Requests, LXML, SQL Alchemy and maybe Celery.
 
Libraries used to create a custom spider
 
Requests - This is Python’s best HTTP request library.  You will be using it to
fetch web pages using get/post/etc requests.
LXML - This is Python’s best HTML/XML parsing and processing library.  You
will be using it to extract data and links from the fetched web pages.
SQL Alchemy - This is Python’s best database library.  You will be using it to
write scraping results into a database.
Celery - This is a task queue library.  Using Celery is optional, unless you want
multiple spiders running on the same domain simultaneously.
 
Requests
 
Using requests is drop dead simple.  For example:
>>> 
import
 
requests
>>> 
r
 
=
 
requests
.
get
(
'https://api.github.com/events'
)
>>> 
r
.
text
u'[{"repository":{"open_issues":0,"url":"https://github.com/…
>>> 
r
 
=
 
requests
.
post
(
'http://httpbin.org/post'
,
 
data
 
=
 
{
'key'
:
'value'
})
>>> 
requests
.
get
(
'https://api.github.com/user'
,
 
auth
=
(
'user'
,
 
'pass'
))
<Response [200]>
 
That is pretty much all there is to it!
 
LXML
 
LXML is a large, powerful library that can take a while to master.  You will need to
become familiar with xpath in order to take advantage of LXML’s power.
from lxml import etree
 
# here r is the request object we got in the previous example
document = etree.fromstring(r.text)
 
anchor_tags = document.xpath(“//a”)
hrefs = [a.attrib.get(“href”) for a in anchor_tags]
anchor_text = [a.xpath(“text()”) for a in anchor_tags]
 
SQL Alchemy
 
SQL Alchemy is a large, powerful library for interacting with databases in Python.
There is a lot to learn, but initially you just need to know how to insert data.
import sqlalchemy
 
engine = sqlalchemy.create_engine(“postgresql://user:pass@host/database”)
connection = engine.connect()
tables = sqlalchemy.MetaData(bind=engine, reflect=True).tables
 
connection.execute(tables[“my_table”].insert(), column_1=”random”, column_2=”data”)
 
Celery
 
This part is optional.  You don’t need celery unless you have several spiders
running on a site simultaneously.
Celery provides you with a task queue.  This queue might be all the pages you
want to scrape.  You would then have tasks that scrape a single page, extract
data from it and push any found links to the task queue.
You can add jobs to the task queue in one part of your program, and separate
worker processes complete jobs from the queue.
Celery requires a separate RabbitMQ (or equivalent) server.
 
Celery Worker
 
Celery uses worker processes to handle jobs from the queue:
worker.py:
from
 
celery
 
import
 Celery
app 
=
 Celery(
'tasks'
, broker
=
'pyamqp://guest@localhost//'
)
@app.task
def
 
scrape
(url):
    
text = requests.get(
url
).text
    data = extract_data(
text
)
    add_data_to_database(
data
)
    
for
 link 
in
 extract_links(
text
):
        scrape.delay(
link
)
 
Running a Celery worker
 
If you want your tasks to be completed, you need to have one or more worker
processes running.  Starting a worker process is as simple as:
$
 celery -A tasks worker
Slide Note
Embed
Share

Learn the essentials of web scraping using Python to extract data efficiently. Python's versatility shines in web programming and data management, allowing for a seamless research pipeline. Explore different approaches to web scraping in Python and understand the importance of managing data effectively, including storing it in a PostgreSQL database. Dive into creating spiders using ScraPy for customized scraping projects, setting up project structures, and leveraging powerful scraping tools.


Uploaded on Aug 28, 2024 | 9 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Basic Web Scraping with Python Everything you need to get data from the web

  2. Python is great for collecting data from the web Python is unique as a programming language, in that it has very strong communities for both web programming and data management/analysis. As a result, your entire computational research pipeline can be integrated. Instead of having separate applications for data collection, storage, management, analysis and visualization, you have one code-base.

  3. Approaches to Web Scraping in Python There are two primary approaches to web scraping in Python: 1.Customize a canned spider using ScraPy 2.Create a fully custom spider using requests, lxml, sqlalchemy and celery In general, unless you re trying to do something really unusual - such as distributed, high throughput crawling - ScraPy is the right choice.

  4. How should I manage my data? Before we start talking about the specifics of each approach, we need to address our data management strategy. The right answer is basically the same for everyone: store it in a PostgreSQL database. PostgreSQL is available for all operating systems, and is easy to set up. Research computing also offers managed PostgreSQL databases. You will want to create a table for each scraping project, with columns for the URL of the scraped resource, the date and time it was scraped, and the various data fields you are collecting. If the data you re collecting doesn t fit easily into a standard field, you can create a JSON column and store it there.

  5. Creating a spider using ScraPy Once you ve installed ScraPy, the first step is to create a new project: $ scrapy startproject <your_project_name> Next, change directory to the newly created project directory, and create a spider: $ cd <your_project_name> $ scrapy genspider <name> <domain>

  6. Creating a spider using ScraPy (continued) Once that s done, a directory will be created with the name of your project, and below that, a directory called spiders, with a python file having the name of the spider you generated. If you open it up it will look something like this: import scrapy class MySpider(scrapy.Spider): name = 'example.com' allowed_domains = ['example.com'] start_urls = ['http://www.example.com/'] def parse(self, response): pass

  7. Creating a spider using ScraPy (continued) In most cases, all you need to do is add data and link extraction code to the parse method, for example: def parse(self, response): for h3 in response.xpath('//h3').extract(): yield {"title": h3} for url in response.xpath('//a/@href').extract(): yield scrapy.Request(url, callback=self.parse)

  8. Creating a spider using ScraPy (continued) If you want to add custom code to generate starting URLs for scraping, you can do that by overriding the start_requests method, like this: def start_requests(self): yield scrapy.Request('http://www.example.com/1.html', self.parse) yield scrapy.Request('http://www.example.com/2.html', self.parse) yield scrapy.Request('http://www.example.com/3.html', self.parse)

  9. ScraPy canned spider templates ScraPy includes a number of spider templates that make building scrapers for common types of sites and data formats easy, including: CrawlSpider - generic spider, allows you to specify rules for following links and extracting items rather than write extraction code. XMLFeedSpider - spider for crawling XML feeds. CSVFeedSpider - like the XMLFeedSpider, but for CSV document feeds. SitemapSpider - spider that crawls based on links listed in a sitemap.

  10. Using ScraPy spiders Once you ve set up your ScraPy project, the next step (from the project directory) is to test and run it. You can test your extraction logic by running the command: scrapy shell -- spider <spider> <url>. This will start an interactive spider session, so you can verify that your logic is extracting data and links from the page as intended. Once you re satisfied that your spider is working properly, you can crawl a site by running the command: scrapy crawl <spider> -o results.json.

  11. ScraPy shell The ScraPy shell is a good place to verify that you ve written your spider correctly. From the shell you can test your parse function and extraction code like so: # Produce a list of all links and data extracted from the target url list(spider.parse(response)) # You can also test xpath selectors here for anchor_text in response.xpath( //a/text() ).extract(): print anchor_text

  12. What if ScraPy doesnt do what you need? First, it is worth your time to double check, ScraPy is very configurable. You want to avoid reinventing the wheel if at all possible. If you are absolutely certain ScraPy isn t up to the task, the next step is a custom spider. For this, you will need Requests, LXML, SQL Alchemy and maybe Celery.

  13. Libraries used to create a custom spider Requests - This is Python s best HTTP request library. You will be using it to fetch web pages using get/post/etc requests. LXML - This is Python s best HTML/XML parsing and processing library. You will be using it to extract data and links from the fetched web pages. SQL Alchemy - This is Python s best database library. You will be using it to write scraping results into a database. Celery - This is a task queue library. Using Celery is optional, unless you want multiple spiders running on the same domain simultaneously.

  14. Requests Using requests is drop dead simple. For example: >>> import requests >>> r = requests.get('https://api.github.com/events') >>> r.text u'[{"repository":{"open_issues":0,"url":"https://github.com/ >>> r = requests.post('http://httpbin.org/post', data = {'key':'value'}) >>> requests.get('https://api.github.com/user', auth=('user', 'pass')) <Response [200]> That is pretty much all there is to it!

  15. LXML LXML is a large, powerful library that can take a while to master. You will need to become familiar with xpath in order to take advantage of LXML s power. from lxml import etree # here r is the request object we got in the previous example document = etree.fromstring(r.text) anchor_tags = document.xpath( //a ) hrefs = [a.attrib.get( href ) for a in anchor_tags] anchor_text = [a.xpath( text() ) for a in anchor_tags]

  16. SQL Alchemy SQL Alchemy is a large, powerful library for interacting with databases in Python. There is a lot to learn, but initially you just need to know how to insert data. import sqlalchemy engine = sqlalchemy.create_engine( postgresql://user:pass@host/database ) connection = engine.connect() tables = sqlalchemy.MetaData(bind=engine, reflect=True).tables connection.execute(tables[ my_table ].insert(), column_1= random , column_2= data )

  17. Celery This part is optional. You don t need celery unless you have several spiders running on a site simultaneously. Celery provides you with a task queue. This queue might be all the pages you want to scrape. You would then have tasks that scrape a single page, extract data from it and push any found links to the task queue. You can add jobs to the task queue in one part of your program, and separate worker processes complete jobs from the queue. Celery requires a separate RabbitMQ (or equivalent) server.

  18. Celery Worker Celery uses worker processes to handle jobs from the queue: worker.py: from celery import Celery app = Celery('tasks', broker='pyamqp://guest@localhost//') @app.task def scrape(url): text = requests.get(url).text data = extract_data(text) add_data_to_database(data) for link in extract_links(text): scrape.delay(link)

  19. Running a Celery worker If you want your tasks to be completed, you need to have one or more worker processes running. Starting a worker process is as simple as: $ celery -A tasks worker

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#