The very first step in building a web-crawler is to extract the links present in a given seed page. To get a picture of the source code of a web page, right click anywhere on this page and select the option "View Source Code". A new window opens containing the source code of this web-page. A basic knowledge of html will be helpful although nothing to worry about if you are new to html. In the source code file of the web page you'll notice certain links. These links have the following format
"<a href" is an html tag used to represent a link on a web-page. Our primary objective is to find all of these "<a href" links in the source code of a web page and extract them. If you have gone through the earlier chapter of String Handling, finding and extracting links would be a cake walk for you. The emboldened characters above i.e.http://www.link.com is known as a link and that is what we have to extract. For the sake of simplicity lets break down our task into smaller steps.
- Searching for the a "<a href" tag.
- Searching for the starting quotes of the URL.
- Searching for the ending quotes of the URL.
- Extracting the URL between the starting and the ending quotes.
Step 1
How do we search for the "<a href" tag in the source code. We have'nt talked much about the source code of the web page i.e. how to get the source code of a given web page. We'll talk about it in detail in a later post. For the time being let us assume that we have a variable named page, initalized to the content or the source code of the web page. We'll do this manually. First of all let us initialize the "page" variable to a sample html code. Since we are dealing with mutiple lines, we make use of the triple quotes(""").
<body>
<a href="http://flywithpython.blogspot.com/2012/08/python-getting-started.html">Ch 1:Getting Started</a>
<a href="http://flywithpython.blogspot.com/2012/08/python-basics.html">Ch 2:The Basics</a>
<a href="http://flywithpython.blogspot.com/2012/08/string-handling.html">Ch 3:String Handling</a>
</body>
</html>"""
Now, we have a variable "page" of a string type containing the html code of a sample web-page, in which we have to search for the first occurence of the "<a href" tag. This is done using the "find" method. The "find" method returns the index number of the first occurence of the "<a href" tag. We assign this index number to the variable "start_link". Following is the code.
>>> print start_link
48
>>> print page[48:]
<a href="http://flywithpython.blogspot.com/2012/08/python-getting-started.html">Ch 1:Getting Started</a>
<a href="http://flywithpython.blogspot.com/2012/08/python-basics.html"<Ch 2:The Basics</a>
<a href="http://flywithpython.blogspot.com/2012/08/string-handling.html"<Ch 3:String Handling</a>
</body>
</html>
In the first command we use the find method on string type variable "page" to search for the "<a href" tag. If the search is successful, the find method returns the index number of the first occurence of the sub-sequence searched for. However, if the search is unsuccessful the find method returns -1. The returned result is then assigned to the variable "start_link". The next command prints the value stored in "start_link". In the third command we check if the returned index number really points to the start of an "<a href" tag. This is being done by providing the first(starting) index number of the string and leaving the second(ending) index number blank, so that the string is being printed to the last of the character. When the last command is being executed, we see that it really points to the commencement of an "<a href" tag.
Step 2
The next step is to search for the starting quotes of the URL within the "<a href" tag. This can easily be done using the find method as is done in the previous step with only a minute difference. Since we have search for the starting quotes within the "<a href" tag, we'll provide the value of start_link as the second parameter to the find method. If you recall from the previous post, the second parameter to the find method specifies the position from where the search is to commence. Therefore the find method will search for the starting quotes of the URL from the index number provided by the second parameter. We assign the value of the starting quotes of the URL to the variable "start_quote". Following is the code:
>>> print start_quote
56
>>> print page[56:]
"http://flywithpython.blogspot.com/2012/08/python-getting-started.html">Ch 1:Getting Started</a>
<a href="http://flywithpython.blogspot.com/2012/08/python-basics.html"<Ch 2:The Basics</a>
<a href="http://flywithpython.blogspot.com/2012/08/string-handling.html"<Ch 3:String Handling</a>
</body>
</html>
The third and last command above tells us that we are on the right path by printing the URL from the starting_quotes.
Step 3
The third step involves finding the end quotes of the URL. This is being done in the exact same way as the previous step. Although, a cautious approach need to be practised while searching for the end quotes of the URL. Supposingly, we start the search for the end quotes from the position of the starting quotes. This way you'll end up getting the same index number as of the starting quotes. This is because the find method start it's search from the index of the starting quotes. The very first character is a double quote, this'll make the find method to end it's search and return the same index number as the staring index. This situation is taken care of by starting the search for the end quotes from the next character. This is being done by adding a one to the second parameter in the find method. The index number returned by the find method is stored in the variable "end_quote". Following is the code:
>>> print end_quote
56
>>> end_quote = page.find('"', start_quote+1)
>>> print end_quote
126
>>> print page[start_quote:end_quote+1]
"http://flywithpython.blogspot.com/2012/08/python-getting-started.html"
As is clear from the first command, using start_quote as the starting point for the serach method will end up giving the same index number as the very first element is a double quote and that is what we are searching for. Also, in the third command we use the second index number as (end_quote+1) and not just "end_quote". Can you guess why? Well, the second or the end index number selects all the index number upto the last index but not the last index. Therefore, we increase the index by one so as to include the last element.
Step 4
The last step is to extract the link and assign it to a variable. We'll assign the the extracted link to the variable "link". This is quite simple. Also, since we have extracted the first link of the web-page, we'll update the web-page as starting from the index number of the end_quotes of the extracted link.
>>> print link
"http://flywithpython.blogspot.com/2012/08/python-getting-started.html"
>>> page = page[end_quote:]
>>> print page
">Ch 1:Getting Started</a>
<a href="http://flywithpython.blogspot.com/2012/08/python-basics.html"<Ch 2:The Basics</a>
<a href="http://flywithpython.blogspot.com/2012/08/string-handling.html"<Ch 3:String Handling</a>
</body>
</html>
Summary
Now putting these bits and pieces together, we get the proper code of extracting a link from a given web-page. Following is the complete code:
<body>
<a href="http://flywithpython.blogspot.com/2012/08/python-getting-started.html">Ch 1:Getting Started</a>
<a href="http://flywithpython.blogspot.com/2012/08/python-basics.html">Ch 2:The Basics</a>
<a href="http://flywithpython.blogspot.com/2012/08/string-handling.html">Ch 3:String Handling</a>
</body>
</html>"""
>>> start_link = page.find("<a href")
>>> start_quote = page.find('"', start_link)
>>> end_quote = page.find('"', start_quote)
>>> link = page[start_quote:end_quote+1]
>>> print link
"http://flywithpython.blogspot.com/2012/08/python-getting-started.html"
>>> page = page[end_quote:]
No comments:
Post a Comment