Web Scraping with Scrapy

Web scraping, often referred to as 'Web Örümceği' (Web Spider) in Turkish, is the automated process of extracting data from websites. It involves programmatically fetching web pages and then parsing their content to retrieve specific information. This data can then be stored in various formats, such as CSV, JSON, or databases, for further analysis or use.

Scrapy is a powerful, fast, and open-source web crawling and scraping framework for Python. It's designed to be highly extensible and provides all the necessary components for building web spiders that can efficiently extract data from websites. Scrapy is built on top of the Twisted asynchronous networking framework, allowing it to handle requests concurrently and significantly speed up the scraping process.

Key features and concepts of Scrapy include:

1. Spiders: These are classes that you define to tell Scrapy how to crawl a site (i.e., which links to follow) and how to extract structured data from its pages.
2. Selectors: Scrapy uses XPath and CSS selectors to extract data from HTML/XML responses. This allows for precise targeting of specific elements on a web page.
3. Items: Scrapy Items are simple Python objects used to define the structure of the data you want to scrape. They behave like dictionaries but provide additional protection against typos.
4. Item Pipelines: Once an Item is scraped by a Spider, it can be passed through several Item Pipelines. These are components that process the scraped Item, performing tasks such as validation, cleaning, duplication filtering, and storing the Item in a database or file system.
5. Middlewares: Scrapy provides Downloader Middlewares (for processing requests and responses before they are handled by spiders) and Spider Middlewares (for processing input/output of spiders). These are useful for tasks like handling cookies, user agents, retries, redirects, and proxies.
6. Requests & Responses: Scrapy interacts with websites by sending Requests and receiving Responses. Requests can specify callbacks that will process the received Responses.

Scrapy is highly scalable and suitable for a wide range of scraping tasks, from small personal projects to large-scale, distributed data extraction systems. It handles many common scraping challenges out-of-the-box, such as handling redirects, retries, and rate limiting, allowing developers to focus more on data extraction logic.

Example Code

 To set up a Scrapy project:
 1. Install Scrapy: pip install Scrapy
 2. Start a new Scrapy project: scrapy startproject myproject
 3. Navigate into the project directory: cd myproject

 --- File: myproject/myproject/items.py ---
import scrapy

class QuoteItem(scrapy.Item):
     Define the fields for your item here like:
    text = scrapy.Field()
    author = scrapy.Field()
    tags = scrapy.Field()


 --- File: myproject/myproject/spiders/quotes_spider.py ---
import scrapy
from myproject.items import QuoteItem  Import your defined Item

class QuotesSpider(scrapy.Spider):
    name = "quotes"  A unique name for your spider
    start_urls = [
        'http://quotes.toscrape.com/page/1/',  Starting URL(s) for the spider
    ]

    def parse(self, response):
         This method is called for each URL in start_urls and for subsequently followed URLs.
         It's responsible for parsing the response data and extracting scraped items.

         Iterate over each quote container on the page
        for quote_div in response.css('div.quote'):
            item = QuoteItem()  Instantiate your Item
            item['text'] = quote_div.css('span.text::text').get()  Extract quote text
            item['author'] = quote_div.css('small.author::text').get()  Extract author
            item['tags'] = quote_div.css('div.tags a.tag::text').getall()  Extract all tags
            yield item  Yield the populated item

         Find the link to the next page
        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
             If a next page exists, follow it and call the 'parse' method again
            yield response.follow(next_page, callback=self.parse)


 --- How to run the spider ---
 From the 'myproject' directory (where scrapy.cfg is located), execute:
 scrapy crawl quotes -o quotes.json
 This command will run the 'quotes' spider and save the extracted data into a JSON file named 'quotes.json'.
 You can also output to CSV: scrapy crawl quotes -o quotes.csv

Web Scraping with Scrapy

Example Code

Related Topics