A few months back I had taken a virtual
machine on Windows Azure for one month trial, to run an experimental
project on web-crawling. The experiment was a good learning experience, more
than anything, it helped me appreciate the technology infrastructure Google, Yahoo
and MSN etc. might have. I couldn't crawl even a single website completely,
while they continue to crawl billions of web-pages regularly. In this post, I’ll
describe the way I went ahead in crawling more than 100 million public webpages by a machine
with a core-2 duo processor and 3.5 GB RAM, till the trial period lasted.
I had used a popular C based library
called libcurl to fetch the web pages and wrote a wrapper in C++ to
parse the HTML pages.
Just like any web-crawler, the
process started with a list of seed URLs, which may or may not result in new URLs.
As you might have seen, most of the websites have many pages and inbuilt
links to many other web-pages. These web-pages may be on the same website
domain or may be part of other websites. The process of crawling is simple
in logic - fetch a webpage, look for new URLs in its html (source code) and add
the new URLs in the ever growing list which you have to fetch.
I started the crawler with a seed
URL list of a few public web-profiles of a popular professional networking site.
As every page has links to close to a dozen other profiles, it wasn't a problem
to find new profile pages to add to the URL list.
The crawler started with a call to
fetch one webpage at a time. Every request used to take around 5-10 seconds to
completely fetch the webpage and parse its content. I was able to fetch close
to 600 profiles in first hour. However it was very frustrating to see such a
slow rate of crawling. After a few hours of search I came to understand the
multiple request function of libcurl, which means that we can fetch up to 1000
pages in a single call. This could have increased the crawling rate very significantly
and make the whole exercise meaningful.
As it was the wfirst time that I was
using libcurl, it was taking time to explore its features, so I spent a weekend
understand it and embedding its functionalities in my code.
So with the new crawler, I
started to fetch 500 URLs at one go. As anyone can expect, the web-requests
were blocked by the target website in just 2-3 minutes after a few attempts,
however the multiple requests were successful, and it was able to fetch the
pages successfully. Now the task was to run the crawler without getting
blocked.
There can be two ways to do it,
first is to hit the target website at a very low rate i.e. just a few
hundred pages per day so that it doesn't take you as a threat and continue to
respond to your requests. Another way is to send your requests such that these
appear to come from disparate sources at a significant delay, so that target
website doesn't get alarmed by an automated robot.
I chose the second option. I
browsed many websites which provide proxy server IPs to make a list of around
3000 proxies and added a function in the code to continuously check the status
of these proxies, 15-20% of these proxies were working regularly. The proxies which were working on my target website
were used to fetch the webpages; those which were not active were
removed and added at the end of the queue. In this fashion, the crawler was
able to run continuously.
There were a few occurrences when
the application will crash suddenly, after many debugging hours, I came to
realize that the crash was happening in a libcurl function call, which I didn't
have much control over as I was using it from the libcurl library. This was quite
troubling as I had to restart the application once it crashes, that means I
could not leave it running alone. So I found a solution to monitor the running
status of the application through WMIC. If the application wasn't
running than it launched the application, allowing me to sleep peacefully.
Next challenge came in handling
the lists of URLs which were already fetched and the ones which were in the queue
to be fetched. Every new URL had to be checked whether it was already fetched
or already present in the queue to be fetched. When the list exceeded around 15
million URLs, it came to my notice that 3.5 GB RAM was not able to handle these
lists. Since it was necessary to check whether a URL had been fetched already to
avoid the repetition? I also had to see that in how many pages a URL has been
found, giving it a rank, more the occurrences of the URL, better its rank and
higher the priority of it in the queue.
After a few days of continuous crawling,
it was a necessity to use a regular database to handle the URL queues. I chose
SQLite because of its easy interface with C++. The process continued for over two-weeks,
filling the hard drives close to 2 TB, with the fetched web-pages and the
parsed content in the SQLite Database. The HTML parser was also written in C++,
I tried doing that in Python, but it was relatively slower than C++, so I
continued with C++.
Now that the trial period was
about to be over as just 3 days’ worth of money was left in the trial account. I
thought of taking the parsed content out of the virtual machine and store it in
some safe place where I can access it later on. However, there was no way I could store
close to 2 TB html content somewhere free of cost.
I thought of uploading a few GBs
of meaningful data to an online drive. So I dragged the data file to the online drive and
went to sleep. Next morning, what I saw was unbelievable! The trial was over
and only a few GBs of data reached the online Drive. I could not log into the
virtual machine as the trial was already over because of lack of credit in the
trial account.
The cause was that, while the
data download to the virtual machine was at very inexpensive cost, the data
upload out of the machine was quite expensive. It sucked out all the remaining
hundreds of rupees.
The trial account was closed
abruptly, leaving all the parsed data and SQLite database files in the remote
cloud server!
However, to me it was an amazing
learning experience of doing a grain of what Google etc. do on a daily basis.