Mastering Web Scraping with Python

Basic Web Scraping with

Python

Everything you need to get data from the web

Python is great for collecting data from the web

Python is unique as a programming language, in that it has very strong

communities for both web programming and data management/analysis.

As a result, your entire computational research pipeline can be integrated.

Instead of having separate applications for data collection, storage, management,

analysis and visualization, you have one code-base.

Approaches to Web Scraping in Python

There are two primary approaches to web scraping in Python:

1.

Customize a canned spider using ScraPy

2.

Create a fully custom spider using requests, lxml, sqlalchemy and celery

In general, unless you’re trying to do something really unusual - such as

distributed, high throughput crawling - ScraPy is the right choice.

How should I manage my data?

Before we start talking about the specifics of each approach, we need to address

our data management strategy.

The right answer is basically the same for everyone: store it in a PostgreSQL

database.  PostgreSQL is available for all operating systems, and is easy to set

up.  Research computing also offers managed PostgreSQL databases.

You will want to create a table for each scraping project, with columns for the URL

of the scraped resource, the date and time it was scraped, and the various data

fields you are collecting.  If the data you’re collecting doesn’t fit easily into a

standard field, you can create a JSON column and store it there.

Creating a spider using ScraPy

Once you’ve installed ScraPy, the first step is to create a new project:

$ scrapy startproject <your_project_name>

Next, change directory to the newly created project directory, and create a spider:

$ cd <your_project_name>

$ scrapy genspider <name> <domain>

Creating a spider using ScraPy (continued…)

Once that’s done, a directory will be created with the name of your project, and

below that, a directory called spiders, with a python file having the name of the

spider you generated.  If you open it up it will look something like this:

import

scrapy

class

MySpider

scrapy

Spider

):

name

'example.com'

allowed_domains

'example.com'

start_urls

'http://www.example.com/'

def

parse

self

response

):

    pass

Creating a spider using ScraPy (continued…)

In most cases, all you need to do is add data and link extraction code to the parse

method, for example:

def

parse

self

response

):

for

h3

in

response

xpath

'//h3'

extract

():

yield

"title"

h3

for

url

in

response

xpath

'//a/@href'

extract

():

yield

scrapy

Request

url

callback

self

parse

Creating a spider using ScraPy (continued…)

If you want to add custom code to generate starting URLs for scraping, you can do

that by overriding the start_requests method, like this:

def

start_requests

self

):

yield

scrapy

Request

'http://www.example.com/1.html'

self

parse

yield

scrapy

Request

'http://www.example.com/2.html'

self

parse

yield

scrapy

Request

'http://www.example.com/3.html'

self

parse

ScraPy canned spider templates

ScraPy includes a number of spider templates that make building scrapers for

common types of sites and data formats easy, including:

CrawlSpider - generic spider, allows you to specify rules for following links and

extracting items rather than write extraction code.

XMLFeedSpider - spider for crawling XML feeds.

CSVFeedSpider - like the XMLFeedSpider, but for CSV document feeds.

SitemapSpider - spider that crawls based on links listed in a sitemap.

Using ScraPy spiders

Once you’ve set up your ScraPy project, the next step (from the project directory)

is to test and run it.

You can test your extraction logic by running the command: scrapy shell --

spider <spider> <url>.  This will start an interactive spider session, so you can

verify that your logic is extracting data and links from the page as intended.

Once you’re satisfied that your spider is working properly, you can crawl a site

by running the command: scrapy crawl <spider> -o results.json.

ScraPy shell

The ScraPy shell is a good place to verify that you’ve written your spider correctly.

From the shell you can test your parse function and extraction code like so:

# Produce a list of all links and data extracted from the target url

list(spider.parse(response))

# You can also test xpath selectors here

for anchor_text in response.xpath(“//a/text()”).extract():

    print anchor_text

What if ScraPy doesn’t do what you need?

First, it is worth your time to double check, ScraPy is very configurable.  You want

to avoid reinventing the wheel if at all possible.

If you are absolutely certain ScraPy isn’t up to the task, the next step is a custom

spider.

For this, you will need Requests, LXML, SQL Alchemy and maybe Celery.

Libraries used to create a custom spider

Requests - This is Python’s best HTTP request library.  You will be using it to

fetch web pages using get/post/etc requests.

LXML - This is Python’s best HTML/XML parsing and processing library.  You

will be using it to extract data and links from the fetched web pages.

SQL Alchemy - This is Python’s best database library.  You will be using it to

write scraping results into a database.

Celery - This is a task queue library.  Using Celery is optional, unless you want

multiple spiders running on the same domain simultaneously.

Requests

Using requests is drop dead simple.  For example:

>>>

import

requests

>>>

requests

get

'https://api.github.com/events'

>>>

text

u'[{"repository":{"open_issues":0,"url":"https://github.com/…

>>>

requests

post

'http://httpbin.org/post'

data

'key'

'value'

})

>>>

requests

get

'https://api.github.com/user'

auth

'user'

'pass'

))

<Response [200]>

That is pretty much all there is to it!

LXML

LXML is a large, powerful library that can take a while to master.  You will need to

become familiar with xpath in order to take advantage of LXML’s power.

from lxml import etree

# here r is the request object we got in the previous example

document = etree.fromstring(r.text)

anchor_tags = document.xpath(“//a”)

hrefs = [a.attrib.get(“href”) for a in anchor_tags]

anchor_text = [a.xpath(“text()”) for a in anchor_tags]

SQL Alchemy

SQL Alchemy is a large, powerful library for interacting with databases in Python.

There is a lot to learn, but initially you just need to know how to insert data.

import sqlalchemy

engine = sqlalchemy.create_engine(“postgresql://user:pass@host/database”)

connection = engine.connect()

tables = sqlalchemy.MetaData(bind=engine, reflect=True).tables

connection.execute(tables[“my_table”].insert(), column_1=”random”, column_2=”data”)

Celery

This part is optional.  You don’t need celery unless you have several spiders

running on a site simultaneously.

Celery provides you with a task queue.  This queue might be all the pages you

want to scrape.  You would then have tasks that scrape a single page, extract

data from it and push any found links to the task queue.

You can add jobs to the task queue in one part of your program, and separate

worker processes complete jobs from the queue.

Celery requires a separate RabbitMQ (or equivalent) server.

Celery Worker

Celery uses worker processes to handle jobs from the queue:

worker.py:

from

celery

import

 Celery

app

 Celery(

'tasks'

, broker

'pyamqp://guest@localhost//'

@app.task

def

scrape

(url):

text = requests.get(

url

).text

    data = extract_data(

text

    add_data_to_database(

data

for

 link

in

 extract_links(

text

):

        scrape.delay(

link

Running a Celery worker

If you want your tasks to be completed, you need to have one or more worker

processes running.  Starting a worker process is as simple as:

 celery -A tasks worker

Slide Note

Embed Share

Download

Learn the essentials of web scraping using Python to extract data efficiently. Python's versatility shines in web programming and data management, allowing for a seamless research pipeline. Explore different approaches to web scraping in Python and understand the importance of managing data effectively, including storing it in a PostgreSQL database. Dive into creating spiders using ScraPy for customized scraping projects, setting up project structures, and leveraging powerful scraping tools.

aanya Follow

Uploaded on Aug 28, 2024 | 9 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Basic Web Scraping with Python Everything you need to get data from the web

Python is great for collecting data from the web Python is unique as a programming language, in that it has very strong communities for both web programming and data management/analysis. As a result, your entire computational research pipeline can be integrated. Instead of having separate applications for data collection, storage, management, analysis and visualization, you have one code-base.

Approaches to Web Scraping in Python There are two primary approaches to web scraping in Python: 1.Customize a canned spider using ScraPy 2.Create a fully custom spider using requests, lxml, sqlalchemy and celery In general, unless you re trying to do something really unusual - such as distributed, high throughput crawling - ScraPy is the right choice.

How should I manage my data? Before we start talking about the specifics of each approach, we need to address our data management strategy. The right answer is basically the same for everyone: store it in a PostgreSQL database. PostgreSQL is available for all operating systems, and is easy to set up. Research computing also offers managed PostgreSQL databases. You will want to create a table for each scraping project, with columns for the URL of the scraped resource, the date and time it was scraped, and the various data fields you are collecting. If the data you re collecting doesn t fit easily into a standard field, you can create a JSON column and store it there.

Creating a spider using ScraPy Once you ve installed ScraPy, the first step is to create a new project: $ scrapy startproject <your_project_name> Next, change directory to the newly created project directory, and create a spider: $ cd <your_project_name> $ scrapy genspider <name> <domain>

Creating a spider using ScraPy (continued) Once that s done, a directory will be created with the name of your project, and below that, a directory called spiders, with a python file having the name of the spider you generated. If you open it up it will look something like this: import scrapy class MySpider(scrapy.Spider): name = 'example.com' allowed_domains = ['example.com'] start_urls = ['http://www.example.com/'] def parse(self, response): pass

Creating a spider using ScraPy (continued) In most cases, all you need to do is add data and link extraction code to the parse method, for example: def parse(self, response): for h3 in response.xpath('//h3').extract(): yield {"title": h3} for url in response.xpath('//a/@href').extract(): yield scrapy.Request(url, callback=self.parse)

Creating a spider using ScraPy (continued) If you want to add custom code to generate starting URLs for scraping, you can do that by overriding the start_requests method, like this: def start_requests(self): yield scrapy.Request('http://www.example.com/1.html', self.parse) yield scrapy.Request('http://www.example.com/2.html', self.parse) yield scrapy.Request('http://www.example.com/3.html', self.parse)

ScraPy canned spider templates ScraPy includes a number of spider templates that make building scrapers for common types of sites and data formats easy, including: CrawlSpider - generic spider, allows you to specify rules for following links and extracting items rather than write extraction code. XMLFeedSpider - spider for crawling XML feeds. CSVFeedSpider - like the XMLFeedSpider, but for CSV document feeds. SitemapSpider - spider that crawls based on links listed in a sitemap.

Using ScraPy spiders Once you ve set up your ScraPy project, the next step (from the project directory) is to test and run it. You can test your extraction logic by running the command: scrapy shell -- spider <spider> <url>. This will start an interactive spider session, so you can verify that your logic is extracting data and links from the page as intended. Once you re satisfied that your spider is working properly, you can crawl a site by running the command: scrapy crawl <spider> -o results.json.

ScraPy shell The ScraPy shell is a good place to verify that you ve written your spider correctly. From the shell you can test your parse function and extraction code like so: # Produce a list of all links and data extracted from the target url list(spider.parse(response)) # You can also test xpath selectors here for anchor_text in response.xpath( //a/text() ).extract(): print anchor_text

What if ScraPy doesnt do what you need? First, it is worth your time to double check, ScraPy is very configurable. You want to avoid reinventing the wheel if at all possible. If you are absolutely certain ScraPy isn t up to the task, the next step is a custom spider. For this, you will need Requests, LXML, SQL Alchemy and maybe Celery.

Libraries used to create a custom spider Requests - This is Python s best HTTP request library. You will be using it to fetch web pages using get/post/etc requests. LXML - This is Python s best HTML/XML parsing and processing library. You will be using it to extract data and links from the fetched web pages. SQL Alchemy - This is Python s best database library. You will be using it to write scraping results into a database. Celery - This is a task queue library. Using Celery is optional, unless you want multiple spiders running on the same domain simultaneously.

Requests Using requests is drop dead simple. For example: >>> import requests >>> r = requests.get('https://api.github.com/events') >>> r.text u'[{"repository":{"open_issues":0,"url":"https://github.com/ >>> r = requests.post('http://httpbin.org/post', data = {'key':'value'}) >>> requests.get('https://api.github.com/user', auth=('user', 'pass')) <Response [200]> That is pretty much all there is to it!

LXML LXML is a large, powerful library that can take a while to master. You will need to become familiar with xpath in order to take advantage of LXML s power. from lxml import etree # here r is the request object we got in the previous example document = etree.fromstring(r.text) anchor_tags = document.xpath( //a ) hrefs = [a.attrib.get( href ) for a in anchor_tags] anchor_text = [a.xpath( text() ) for a in anchor_tags]

SQL Alchemy SQL Alchemy is a large, powerful library for interacting with databases in Python. There is a lot to learn, but initially you just need to know how to insert data. import sqlalchemy engine = sqlalchemy.create_engine( postgresql://user:pass@host/database ) connection = engine.connect() tables = sqlalchemy.MetaData(bind=engine, reflect=True).tables connection.execute(tables[ my_table ].insert(), column_1= random , column_2= data )

Celery This part is optional. You don t need celery unless you have several spiders running on a site simultaneously. Celery provides you with a task queue. This queue might be all the pages you want to scrape. You would then have tasks that scrape a single page, extract data from it and push any found links to the task queue. You can add jobs to the task queue in one part of your program, and separate worker processes complete jobs from the queue. Celery requires a separate RabbitMQ (or equivalent) server.

Celery Worker Celery uses worker processes to handle jobs from the queue: worker.py: from celery import Celery app = Celery('tasks', broker='pyamqp://guest@localhost//') @app.task def scrape(url): text = requests.get(url).text data = extract_data(text) add_data_to_database(data) for link in extract_links(text): scrape.delay(link)

Running a Celery worker If you want your tasks to be completed, you need to have one or more worker processes running. Starting a worker process is as simple as: $ celery -A tasks worker

Mastering Web Scraping with Python

Download Presentation

Presentation Transcript

Related

More Related Content