Mastering Web Scraping with Python

Slide Note
Embed
Share

Learn the essentials of web scraping using Python to extract data efficiently. Python's versatility shines in web programming and data management, allowing for a seamless research pipeline. Explore different approaches to web scraping in Python and understand the importance of managing data effectively, including storing it in a PostgreSQL database. Dive into creating spiders using ScraPy for customized scraping projects, setting up project structures, and leveraging powerful scraping tools.


Uploaded on Aug 28, 2024 | 9 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Basic Web Scraping with Python Everything you need to get data from the web

  2. Python is great for collecting data from the web Python is unique as a programming language, in that it has very strong communities for both web programming and data management/analysis. As a result, your entire computational research pipeline can be integrated. Instead of having separate applications for data collection, storage, management, analysis and visualization, you have one code-base.

  3. Approaches to Web Scraping in Python There are two primary approaches to web scraping in Python: 1.Customize a canned spider using ScraPy 2.Create a fully custom spider using requests, lxml, sqlalchemy and celery In general, unless you re trying to do something really unusual - such as distributed, high throughput crawling - ScraPy is the right choice.

  4. How should I manage my data? Before we start talking about the specifics of each approach, we need to address our data management strategy. The right answer is basically the same for everyone: store it in a PostgreSQL database. PostgreSQL is available for all operating systems, and is easy to set up. Research computing also offers managed PostgreSQL databases. You will want to create a table for each scraping project, with columns for the URL of the scraped resource, the date and time it was scraped, and the various data fields you are collecting. If the data you re collecting doesn t fit easily into a standard field, you can create a JSON column and store it there.

  5. Creating a spider using ScraPy Once you ve installed ScraPy, the first step is to create a new project: $ scrapy startproject <your_project_name> Next, change directory to the newly created project directory, and create a spider: $ cd <your_project_name> $ scrapy genspider <name> <domain>

  6. Creating a spider using ScraPy (continued) Once that s done, a directory will be created with the name of your project, and below that, a directory called spiders, with a python file having the name of the spider you generated. If you open it up it will look something like this: import scrapy class MySpider(scrapy.Spider): name = 'example.com' allowed_domains = ['example.com'] start_urls = ['http://www.example.com/'] def parse(self, response): pass

  7. Creating a spider using ScraPy (continued) In most cases, all you need to do is add data and link extraction code to the parse method, for example: def parse(self, response): for h3 in response.xpath('//h3').extract(): yield {"title": h3} for url in response.xpath('//a/@href').extract(): yield scrapy.Request(url, callback=self.parse)

  8. Creating a spider using ScraPy (continued) If you want to add custom code to generate starting URLs for scraping, you can do that by overriding the start_requests method, like this: def start_requests(self): yield scrapy.Request('http://www.example.com/1.html', self.parse) yield scrapy.Request('http://www.example.com/2.html', self.parse) yield scrapy.Request('http://www.example.com/3.html', self.parse)

  9. ScraPy canned spider templates ScraPy includes a number of spider templates that make building scrapers for common types of sites and data formats easy, including: CrawlSpider - generic spider, allows you to specify rules for following links and extracting items rather than write extraction code. XMLFeedSpider - spider for crawling XML feeds. CSVFeedSpider - like the XMLFeedSpider, but for CSV document feeds. SitemapSpider - spider that crawls based on links listed in a sitemap.

  10. Using ScraPy spiders Once you ve set up your ScraPy project, the next step (from the project directory) is to test and run it. You can test your extraction logic by running the command: scrapy shell -- spider <spider> <url>. This will start an interactive spider session, so you can verify that your logic is extracting data and links from the page as intended. Once you re satisfied that your spider is working properly, you can crawl a site by running the command: scrapy crawl <spider> -o results.json.

  11. ScraPy shell The ScraPy shell is a good place to verify that you ve written your spider correctly. From the shell you can test your parse function and extraction code like so: # Produce a list of all links and data extracted from the target url list(spider.parse(response)) # You can also test xpath selectors here for anchor_text in response.xpath( //a/text() ).extract(): print anchor_text

  12. What if ScraPy doesnt do what you need? First, it is worth your time to double check, ScraPy is very configurable. You want to avoid reinventing the wheel if at all possible. If you are absolutely certain ScraPy isn t up to the task, the next step is a custom spider. For this, you will need Requests, LXML, SQL Alchemy and maybe Celery.

  13. Libraries used to create a custom spider Requests - This is Python s best HTTP request library. You will be using it to fetch web pages using get/post/etc requests. LXML - This is Python s best HTML/XML parsing and processing library. You will be using it to extract data and links from the fetched web pages. SQL Alchemy - This is Python s best database library. You will be using it to write scraping results into a database. Celery - This is a task queue library. Using Celery is optional, unless you want multiple spiders running on the same domain simultaneously.

  14. Requests Using requests is drop dead simple. For example: >>> import requests >>> r = requests.get('https://api.github.com/events') >>> r.text u'[{"repository":{"open_issues":0,"url":"https://github.com/ >>> r = requests.post('http://httpbin.org/post', data = {'key':'value'}) >>> requests.get('https://api.github.com/user', auth=('user', 'pass')) <Response [200]> That is pretty much all there is to it!

  15. LXML LXML is a large, powerful library that can take a while to master. You will need to become familiar with xpath in order to take advantage of LXML s power. from lxml import etree # here r is the request object we got in the previous example document = etree.fromstring(r.text) anchor_tags = document.xpath( //a ) hrefs = [a.attrib.get( href ) for a in anchor_tags] anchor_text = [a.xpath( text() ) for a in anchor_tags]

  16. SQL Alchemy SQL Alchemy is a large, powerful library for interacting with databases in Python. There is a lot to learn, but initially you just need to know how to insert data. import sqlalchemy engine = sqlalchemy.create_engine( postgresql://user:pass@host/database ) connection = engine.connect() tables = sqlalchemy.MetaData(bind=engine, reflect=True).tables connection.execute(tables[ my_table ].insert(), column_1= random , column_2= data )

  17. Celery This part is optional. You don t need celery unless you have several spiders running on a site simultaneously. Celery provides you with a task queue. This queue might be all the pages you want to scrape. You would then have tasks that scrape a single page, extract data from it and push any found links to the task queue. You can add jobs to the task queue in one part of your program, and separate worker processes complete jobs from the queue. Celery requires a separate RabbitMQ (or equivalent) server.

  18. Celery Worker Celery uses worker processes to handle jobs from the queue: worker.py: from celery import Celery app = Celery('tasks', broker='pyamqp://guest@localhost//') @app.task def scrape(url): text = requests.get(url).text data = extract_data(text) add_data_to_database(data) for link in extract_links(text): scrape.delay(link)

  19. Running a Celery worker If you want your tasks to be completed, you need to have one or more worker processes running. Starting a worker process is as simple as: $ celery -A tasks worker

Related


More Related Content