In this tutorial, the focus will be on one of the best frameworks for web crawling called Scrapy. You will learn the basics of Scrapy and how to create your first web crawler or spider. Furthermore, the tutorial gives a demonstration of extracting and storing the scraped data.
Scrapy, a web framework written in Python that is used to crawl through a website and to extract data in an efficient manner.
You can use the extracted data for further processing, data mining, and storing the data in spreadsheets or any other business need.
The architecture of Scrapy contains five main components:
- Scrapy Engine
- Item Pipelines
The Scrapy engine is the main component of Scrapy which is aimed at controlling the data flow between all other components. The engine generates requests and manages events against an action.
The scheduler receives the requests sent by the engine and queues them.
The objective of the downloader is to fetch all the web pages and send them to the engine. The engine then sends the web pages to the spider.
Spiders are the codes you write to parse websites and extract data.
The item pipeline processes the items side by side after the spiders extract them.
You can simply install Scrapy along with its dependencies by using the Python Package Manager (pip).
Run the following command to install Scrapy in Windows:
pip install scrapy
However, the official Installation guide recommends installing Scrapy in a virtual environment because the Scrapy dependencies may conflict with other Python system packages which will affect other scripts and tools.
Therefore, we will create a virtual environment to provide an encapsulated development environment.
In this tutorial, we will install a virtual environment first and then continue with the installation of Scrapy.
- Run the following command in the Python Scripts folder to install the virtual environment:
pip install virtualenv
- Now install virtualenvwrapper-win which lets us create an isolated Python virtual environments.
pip install virtualenvwrapper-win
- Set the path within the scripts folder, so you can globally use the Python commands:
- Create a virtual environment:
Where ScrapyTut is the name of our environment:
- Create your project folder and connect it with the virtual environmen
- Bind virtual environment with the current working directory:
- If you want to turn off the virtual environment mode simply use deactivate as below:
- If you want to work again on the project use the workon command along with the name of your project:
Now we have our virtual environment, we can continue the installation of Scrapy.
- For installation in Windows, you have to download OpenSSL and install it. Choose the regular version that matches your version of Python. Also, install the Visual C++ 2008 redistributables, otherwise, you will get an error when installing dependencies.
- Add C:\OpenSSL-Win32\bin to the system PATH.
- When installing Scrapy, there are a number of packages that Scrapy depends on and you have to install them. These packages include pywin32, twisted, zope.interface, lxml and pyOpenSSL.
- In the ScrapyTut directory run the following pip command to install Scrapy:
pip install scrapy
Note that when installing Twisted, you may encounter an error as:
Microsoft visual c++ 14.0 is required
To fix this error, you will have to install the following from Microsoft build Tools:
After this installation, if you get another error like the following:
error: command ‘C:\\Program Files (x86)\\Microsoft Visual Studio 14.0\\VC\\BIN\\link.exe’ failed with exit status 1158
Simply download the wheel for Twisted that matches your version of Python. Paste this wheel into your current working directory as:
Now run the following command:
pip install Twisted-18.9.0-cp37-cp37m-win32.whl
Now, everything is ready to create our first crawler, so let’s do it.
Create a Scrapy Project
Before writing a Scrapy code, you will have to create a Scrapy project using the startproject command like this:
scrapy startproject myFirstScrapy
That will generate the project directory with the following contents:
The spider folder contains the spiders.
Here the scrapy.cfg file is the configuration file. Inside the myFirstScrapy folder we will have the following files:
Create a Spider
After creating the project, navigate to the project directory and generate your spider along with the website URL that you want to crawl by executing the following command:
scrapy genspider jobs www.python.org
The result will be like the following:
Our “jobs” spider folder will be like this:
In the Spiders folder, we can have multiple spiders within the same project.
Now let’s go through the content of our newly created spider. Open the jobs.py file which contains the following code:
Here the AccessoriesSpider is the subclass of scrapy.Spider. The ‘name’ variable is the name of our spider that was assigned in the process of creation of the spider. The name is used to run the spider. The ‘allowed_domains’ is the domain accessible by this spider.
The start_urls is the URL from where the web crawling will be started or you can say it is the initial URL where web crawling begins. Then we have the parse method which parses through the content of the page.
To crawl the accessories page of our URL, we need to add one more link in the start_urls property as below:
As we want to crawl more than one page, it is recommended to subclass the spider from the CrawlSpider class instead of the scrapy.spider class. For this, you will have to import the following module:
from scrapy.spiders import CrawlSpider
Our class will look like the following:
class JobsSpider(CrawlSpider): …
The next step is to initialize the rules variable. The rules variable defines the navigation rules that will be followed when crawling the site. To use the rules object, import the following class:
from scrapy.spiders import Rule
The rules variable further contains rule objects such as:
- link_extractor which is an object of Link Extractor class. The link_extractor object specifies how to extract links from the crawled URL. For this, you will have to import the Link Extractor class like this:
from scrapy.linkextractors import LinkExtractor
The rule variable will look like the following:
- callback is a string which is called when a link is extracted. It specifies the methods that will be used when accessing the elements of the page.
- follow is a Boolean which specifies if the extracted link should be followed or not after this rule.
Here allow is used to specify the link which is to be extracted. But in our example, we have restricted by CSS class. So only the pages with the specified class should be extracted.
The callback parameter specifies the method that will be called when parsing the page. The .list-recent-jobs is the class for all the jobs listed on the page. You can check the class of an item by right clicking on that item and select inspect on the web page.
In the example, we called the spider’s parse_item method instead of parse.
The content of the parse_item method is as follows:
This will print Extracting… along with the URL currently being extracted. For example, a link https://www.python.org/jobs/3698/ is extracted. So on the output screen, Extracting…https://www.python.org/jobs/3698/ will be printed.
To run the spider, navigate to your project folder and type in the following command:
scrapy crawl jobs
The output will be like the following:
In this example, we set follow=true which means the crawler will crawl the pages until the rule becomes false. That means when the list of jobs ends.
If you want to get only the print statement, you can use the following command:
scrapy crawl –nolog jobs
The output will be like the following:
Congratulations! You’ve built your first web crawler.
Now we can crawl web pages. Let’s play with the crawled content for a little.
You can use selectors to select some parts of data from the crawled HTML. The selectors select data from HTML by using XPath and CSS through response.xpath() and response.css() respectively. Just like in the previous example, we used the css class to select the data.
Consider the following example where we declared a string with HTML tags. Using the selector class we extracted the data in the h1 tag using the Selector.xpath:
Scrapy uses Python dicts to return the extracted data.
To extract data, Scrapy provides the Item class which provides item objects. We can use these item objects as containers for the scraped data.
Items provide a simple syntax to declare fields. The syntax is like the following:
The Field object specifies the Metadata for each field.
You may notice when the Scrapy project is created, an items.py file is also created in our project directory. We can modify this file to add our items as follows:
Here we have added one item. You can call this class from your spider file to initialize the items as follows:
In the above code, we have used the css method of response to extract the data.
In our web page, we have a div with class text, inside this div, we have a heading with class listing-company, inside this heading, we have a span tag with class listing-location, and finally, we have a tag a that contains some text. This text is extracted using the extract() method.
Finally, we will loop through all the items extracted and call the items class.
Instead of doing all this in the crawler, we can also test our crawler by using only one statement while working in the Scrapy shell. We will demonstrate Scrapy shell in a later section.
The data or items scrapped by the Item object is loaded or populated by using the Item Loader. You can use the item loader to extend the parsing rules.
After extracting items, we can populate the items in the item loader with the help of selectors.
The syntax for Item loader is as follows:
Scrapy shell is a command line tool that lets the developers test the parser without going through the crawler itself. With Scrapy shell, you can debug your code easily. The main purpose of Scrapy shell is to test the data extraction code.
We use the Scrapy shell to test the data extracted by CSS and XPath expression when performing crawl operations on a website.
You can activate Scrapy shell from the current project using the shell command:
if you want to parse a web page, so you will use the shell command along with the link of the page:
scrapy shell https://www.python.org/jobs/3659/
To extract the location of the job, simply run the following command in the shell:
response.css(‘.text > .listing-company > .listing-location > a::text’).extract()
The result will be like this:
Similarly, you can extract any data from the website.
To get the current working URL, you can use the command below:
This is how you extract all the data in Scrapy. In the next section, we will save this data into a CSV file.
Storing the data
Let’s use the response.css in our actual code. We will store the value returned by this statement into a variable and after that, we will store this into a CSV file. Use the following code:
Here we stored the result of response.css into a variable called location. Then we assigned this variable to the location object of the item in the MyfirstscrapyItem() class.
Execute the following command to run your crawler and store the result into a CSV file:
scrapy crawl jobs -o ScrappedData.csv
The will generate a CSV file in the project directory:
Scrapy is a very easy framework to crawl web pages. That was just the beginning. If you liked the tutorial and hungry for more, tell us on the comments blew what is the next Scrapy topic you would like to read about?