Saturday, 11 August 2012

Ch 5: Learning Procedures

In the previous chapter we learnt how to extract the first link on a given web-page. In this chapter we'll figure out a way to extract all the links present on a web-page. To carry out this task, two very important concepts of python programming come into play. They are Procedure and Control. While procedure saves us the time of writing the same code over and over again, Control tells us how to proceed. Together, with the help Procedure and Control, you'll easily be able to extract all the links present on a web-page.

Procedure

Procedural Abstraction is a very important tool. It helps the programmer to avoid writing the same code over and over again. A procedure is a set of instructions to carry out a specific task. For example, to extract all the links from a given web-page, one would expect to write the code for extracting a link over and over again. But procedural abstraction elegantly solves the problem by defining the common code inside a block. A procedure is something that takes in some inputs, perform some operations on them and produce the outputs. There can be more than one input and output. The code specified in a procedure can work on different inputs, producing different results depending on the inputs. The idea of a procedure can easily be co-related to built-in operator let's say +. The addition operator takes in as input 2 numbers and produces as output their sum. Since it is built-in operator the synatx of addition is quite different from that of a procedure but the idea is same as that of a procedure. The following image will clear up the air.

Syntax

def <name>(parameter):
<block>

Python provides the keyword def for defining the procedure and is a short form for define. The name of the procedure can be anything except the python keywords or the name starting from numeric value. Any name that can be given to a variable is eligible for becoming the name of the procedure. Next comes the left parenthesis and the right parenthesis. Parameters are provided inside the parenthesis. Parameters are nothing but a fancy name of the inputs. Any number of parameters can be provided to the procedure, each seprated by a comma from each other. For example, (a, b, c, d, e) The parameters are followed by a colon.

Next comes the block or the body of the procedure. This is where the actual processing is being done. The block contains the set of instructions that operate on the input provided as the parameters. Notice that the block is indented by an equivalent of four spaces. It is indentation that makes it possible for the python interpreter to distinguish between the different blocks of code. We have already seen how to provide input to a procedure, now we'll figure out a way to yield the output from a procedure. We use keyword return to produce the output of the procedure. Following code will help you understand the concept better.

>>> def procedure(str_1, str_2):

str = str_1 + str_2
return str

>>> procedure("Hello ", "Sid!!!")
'Hello Sid!!!"

Just like in the code below you can also return mutiple outputs by seprating them with a comma.

>>> def rectangle(length, breadth):

area = length * breadth
perimeter = 2 * (length + breadth)
return area, perimeter

>>> rectangle(10, 4)
(40, 28)
>>> ar, peri = rectangle(10,4)
>>> print ar
40
>>> print peri
28

Defining Procedure

Now, lets get started. We are now on our way to define our own procedure. Recall the code to extract a link from Ch 4: Extracting a link.

>>> page = """<!DOCTYPE html>

<head>Sample Page</head>
<body>
<a href="http://flywithpython.blogspot.com/2012/08/python-getting-started.html">Ch 1:Getting Started</a>
<a href="http://flywithpython.blogspot.com/2012/08/python-basics.html">Ch 2:The Basics</a>
<a href="http://flywithpython.blogspot.com/2012/08/string-handling.html">Ch 3:String Handling</a>
</body>
</html>"""

>>> start_link = page.find("<a href")
>>> start_quote = page.find('"', start_link)
>>> end_quote = page.find('"', start_quote)
>>> link = page[start_quote:end_quote+1]
>>> print link
"http://flywithpython.blogspot.com/2012/08/python-getting-started.html"
>>> page = page[end_quote:]

Now, instead of writing the same code over and over again to extract all the links from a wep-page, we'll design a procedure in python to do the work for us. In the first piece of coding we just initialized the variable "page" to the contents of a sample web-page. The real work is being done in the second piece of code above. Observe that each time you need to extract a link from a wep-page the first four commands remain as it is.

The only thing that changes is the content of the variable "page" which will act an input to the the procedure i.e. the input to our procedure is the variable "page" that contains the source code of the web-page. Each time we extract a link, we update the content of the variable "page" to the end of the first extracted link. This implies that each time a link is extracted, the content of the variable page is initialized from the end of the extracted link. This is indispensable, since after extracting the first link, we would like to extract the second link, then the third and so on. This can only be done if our search method start it's search from the end of the extracted link. So, each time we call the procedure, we provide the rest of the wep-page's source code as the input. Therefore, we have figured out the input to our procedure.

So, with page as the input, what do you think should be the output of our procedure? Well, it's nothing else but the extracted link itself and the index number giving the position of the end_quote of the extracted URL. Let's call the new procedure get_next_target with the variable page as input, that contains the source code of the remaining page and the extracted link and position of the end_quote as the output. I assume you have already initialized the variable "page" with sample source-code I have provided in the first piece of coding under the "Defining Procedure" heading. Well, now it's time to define the procedure.

>>> def get_next_target(page):

start_link = page.find("<a href")
start_quote = page.find('"', start_link)
end_quote = page.find('"', start_quote+1)
link = page[start_quote+1:end_quote]
return link, end_quote

>>> get_next_target(page)
('http://flywithpython.blogspot.com/2012/08/python-getting-started.html', 126)
>>> li, en = get_next_target(page)
>>> print li
http://flywithpython.blogspot.com/2012/08/python-getting-started.html
>>> page = page[en:]
>>> li, en = get_next_target(page)
>>> print li
http://flywithpython.blogspot.com/2012/08/python-basics.html

In this post we have learnt the concept of Procedures in python and how to define them. In the next post we'll dive into the Control part and learn how to go on and on untill all the links have been extracted from a web-page.

Friday, 10 August 2012

Ch 4: Extracting A Link

The very first step in building a web-crawler is to extract the links present in a given seed page. To get a picture of the source code of a web page, right click anywhere on this page and select the option "View Source Code". A new window opens containing the source code of this web-page. A basic knowledge of html will be helpful although nothing to worry about if you are new to html. In the source code file of the web page you'll notice certain links. These links have the following format

<a href="http://www.link.com">Description of the link</a>

"<a href" is an html tag used to represent a link on a web-page. Our primary objective is to find all of these "<a href" links in the source code of a web page and extract them. If you have gone through the earlier chapter of String Handling, finding and extracting links would be a cake walk for you. The emboldened characters above i.e.http://www.link.com is known as a link and that is what we have to extract. For the sake of simplicity lets break down our task into smaller steps.

Searching for the a "<a href" tag.
Searching for the starting quotes of the URL.
Searching for the ending quotes of the URL.
Extracting the URL between the starting and the ending quotes.

Step 1

How do we search for the "<a href" tag in the source code. We have'nt talked much about the source code of the web page i.e. how to get the source code of a given web page. We'll talk about it in detail in a later post. For the time being let us assume that we have a variable named page, initalized to the content or the source code of the web page. We'll do this manually. First of all let us initialize the "page" variable to a sample html code. Since we are dealing with mutiple lines, we make use of the triple quotes(""").

>>> page = """<!DOCTYPE html>

Now, we have a variable "page" of a string type containing the html code of a sample web-page, in which we have to search for the first occurence of the "<a href" tag. This is done using the "find" method. The "find" method returns the index number of the first occurence of the "<a href" tag. We assign this index number to the variable "start_link". Following is the code.

>>> start_link = page.find("<a href")
>>> print start_link
48
>>> print page[48:]
<a href="http://flywithpython.blogspot.com/2012/08/python-getting-started.html">Ch 1:Getting Started</a>
<a href="http://flywithpython.blogspot.com/2012/08/python-basics.html"<Ch 2:The Basics</a>
<a href="http://flywithpython.blogspot.com/2012/08/string-handling.html"<Ch 3:String Handling</a>
</body>
</html>

In the first command we use the find method on string type variable "page" to search for the "<a href" tag. If the search is successful, the find method returns the index number of the first occurence of the sub-sequence searched for. However, if the search is unsuccessful the find method returns -1. The returned result is then assigned to the variable "start_link". The next command prints the value stored in "start_link". In the third command we check if the returned index number really points to the start of an "<a href" tag. This is being done by providing the first(starting) index number of the string and leaving the second(ending) index number blank, so that the string is being printed to the last of the character. When the last command is being executed, we see that it really points to the commencement of an "<a href" tag.

Step 2

The next step is to search for the starting quotes of the URL within the "<a href" tag. This can easily be done using the find method as is done in the previous step with only a minute difference. Since we have search for the starting quotes within the "<a href" tag, we'll provide the value of start_link as the second parameter to the find method. If you recall from the previous post, the second parameter to the find method specifies the position from where the search is to commence. Therefore the find method will search for the starting quotes of the URL from the index number provided by the second parameter. We assign the value of the starting quotes of the URL to the variable "start_quote". Following is the code:

>>> start_quote = page.find('"', start_link)
>>> print start_quote
56
>>> print page[56:]
"http://flywithpython.blogspot.com/2012/08/python-getting-started.html">Ch 1:Getting Started</a>
<a href="http://flywithpython.blogspot.com/2012/08/python-basics.html"<Ch 2:The Basics</a>
<a href="http://flywithpython.blogspot.com/2012/08/string-handling.html"<Ch 3:String Handling</a>
</body>
</html>

The third and last command above tells us that we are on the right path by printing the URL from the starting_quotes.

Step 3

The third step involves finding the end quotes of the URL. This is being done in the exact same way as the previous step. Although, a cautious approach need to be practised while searching for the end quotes of the URL. Supposingly, we start the search for the end quotes from the position of the starting quotes. This way you'll end up getting the same index number as of the starting quotes. This is because the find method start it's search from the index of the starting quotes. The very first character is a double quote, this'll make the find method to end it's search and return the same index number as the staring index. This situation is taken care of by starting the search for the end quotes from the next character. This is being done by adding a one to the second parameter in the find method. The index number returned by the find method is stored in the variable "end_quote". Following is the code:

>>> end_quote = page.find('"', start_quote)
>>> print end_quote
56
>>> end_quote = page.find('"', start_quote+1)
>>> print end_quote
126
>>> print page[start_quote:end_quote+1]
"http://flywithpython.blogspot.com/2012/08/python-getting-started.html"

As is clear from the first command, using start_quote as the starting point for the serach method will end up giving the same index number as the very first element is a double quote and that is what we are searching for. Also, in the third command we use the second index number as (end_quote+1) and not just "end_quote". Can you guess why? Well, the second or the end index number selects all the index number upto the last index but not the last index. Therefore, we increase the index by one so as to include the last element.

Step 4

The last step is to extract the link and assign it to a variable. We'll assign the the extracted link to the variable "link". This is quite simple. Also, since we have extracted the first link of the web-page, we'll update the web-page as starting from the index number of the end_quotes of the extracted link.

>>> link = page[start_quote:end_quote+1]
>>> print link
"http://flywithpython.blogspot.com/2012/08/python-getting-started.html"
>>> page = page[end_quote:]
>>> print page
">Ch 1:Getting Started</a>
<a href="http://flywithpython.blogspot.com/2012/08/python-basics.html"<Ch 2:The Basics</a>
<a href="http://flywithpython.blogspot.com/2012/08/string-handling.html"<Ch 3:String Handling</a>
</body>
</html>

Summary

Now putting these bits and pieces together, we get the proper code of extracting a link from a given web-page. Following is the complete code:

>>> page = """<!DOCTYPE html>

Thursday, 9 August 2012

Ch 3: String Handling

In our quest of building a web crawler in python, we need to deal with web pages containing a large amount of data in the form of text. A string is nothing but a sequence of characters enclosed within the quotes. Also, triple quotes are used to print the mutiple lines. Following are the sample strings in python.

'I am a string'

"I am a double quote string"

"""The two roads diverged in a wood, and I-
I took the one less traveled by,
And that has made all the difference."""

Strings can be enclosed within single quotes or double quotes. There is no difference whatsoever in the two types. The only requirement is that a string commeneced with a single quote must end with a single quote. The same goes for the double quote strings. This feature comes in handy as a double quote string may contain a single quote within the string. Following is an example.

"I'll be happy with double quotes"

The above string contains a single quote within the double quotes. This saves us the need for escaping the single quote. It is important to grasp the concept of string handling. as it is indispensable for building up a web crawler. Now lets fire up our python command line interpreter and quickly get to know the python string type.

>>> print 'Hello'
Hello
>>> print "Hello"
Hello
>>> print Hello
Traceback (most recent call last):
File "", line 1, in
NameError: name 'Hello' is not defined
>>> hello = "howdy"
>>> print hello
howdy

In the first command above, the string is printed using the single quotes while in the second command double quotes are being used. As is clear from the output, there is no diffrence between the two. However, in the third command we tried to print Hello, which now behaves as a variable without the single or the double quotes. Since we haven't defined any variable with the name Hello, the Python runtime raises a NameError exception. Now in the fourth command we assign the string value "howdy" to the variable hello and then print the value contained in the variable hello in the next command.

Concatenating python strings is quite simple. The string is concatenated using the overloaded addition symbol(+). Following are some of the sample examples.

>>> name = "Sid"
>>> print "hello " + name + "!" + "!" + "!"
hello Sid!!!
>>> print "hello " + name + 3 * "!"
hello Sid!!!

Indexing Strings

The most important feature of python string type is that we can extract sub-string from a given string. This is made possible due to the flexibility provided by python. Strings can be indexed i.e. given an index number any given character can be extracted using the square brackets, just like an array or list. The first element starts with the index number zero. Following representation will clear up the air. Let us assume the string to be "WOODS". 'W' with index zero, 'O' with index one, 'O' with index two, 'D' with index three, 'S' with index four.

W	O	O	D	S
0	1	2	3	4

Now, with the knowledge of index numbers we can perform the following operations on string types.

>>> name = "WOODS"
>>> print name[0]
W
>>> print name[4]
S
>>> print name[2]
O
>>> print name[-1]
S
>>> print name[-2]
D
>>> print name[5]
Traceback (most recent call last):
File "", line 1, in
IndexError: string index out of range

Now, in the first command we assing the string "WOODS" to the variable name. In the next command we extract the charcter at the index number zero that happens to be 'W'. Similarly, we extract the character at index number two. In the next command we use the index number -1, which is perfectly legal in python. The -1 index number extract the last character from the string which is 'S'. Similarly, the second last character is extracted using -2 as the index number and so on. In the last of the command we tried to extract the character at index 5. However, the last index number is 4 and therefore the python runtime raises the IndexError runtime exception.

Selecting Sub-Sequence

In building a web-crawler the very first step is to extract a link from a web page. This is done by selecting the links from the web page source code and extracting it. Sub-sequences can easily be extracted from a given string using the index numbers. A very important thing to keep in mind while selecting sub-sequence from a string is that the first index i.e. the starting index selects the sub-sequence including the the charcter at the given index while the second index i.e. the ending index number selects the all the characters upto that index number but excluding the last element. For example in the below coding the command name[0:4] selects the string 'EVE' and not the string 'EVER', although the chacter 'R' is at index 3. Also, if the starting index number is not provided the sub-sequence is selected from the begining or in case the ending index number is not provided, the sub-sequence is selected till the last element. Following is the sample code..

>>> name = "EVERGREEN"
>>> name[0:3]
'EVE'
>>> name[0:4]
'EVER'
>>> name[4:]
'GREEN'
>>> name[:4]
'EVER'

Finding A Sub-Sequence

We can find a given sub-sequence within a string using the "find" method. The sub-sequence to be searched within the string is provided as a parameter to the find method. Let us assume we have the following string.

For men may come and men may go but I go on forever.

The find method when used, returns the index number from where the matching string starts. Also the find method returns the index number of the first occurence of the sub-sequence. Any sub-sequence can be searched for like "men", "go", "forever" etc. For example when searched for "men", the find method returns the index number 4, i.e. the index number of the first occurrence of the sub-sequence "men". If the match is not found, the find method returns -1 as the status code. The following example will clear the air up.

>>> var = "For men may come and men may go but I go on forever."
>>> print var.find("men")
4
>>> print var.find("forever")
44
>>> print var.find("man")
-1

To find the next occurence of a sub-sequence in a string, a second parameter can also be passed to the find method. This parameter provides the position from where the sub-sequence is to be searched. Caution should be practised while using the second argument to search for the second occurence of a sub-sequence in a string. For example the first occurence of the sub-sequence "men" occurs at the index 4. If we want to search for the second occurence then we have to start our search from the next index and not from the index 4. If we start our search from the index 4, the find method will start it's search from the index 4 and will return the same index number.

>>> var = "For men may come and men may go but I go on forever."
>>> print var.find("men")
4
>>> print var.find("men", 4)
4
>>> print var.find("men", 5)
21
>>> print var.find("men", 4+1)
21

We learnt the basics of string handling in python. In the next post we'll apply these tools to extract the links from a given web page.

Monday, 6 August 2012

Ch 2: The Basics

I hope by now you have a Python command line interpreter up and running on your system. Well if you don't then please download the current stable version of Python from here.

Now, lets quickly get to know the basics of Python language. Unlike any other languages the Python language comes without much too much of fanfare and fuss. The syntax is neat and easy to comprehend. Lets start with the python command line interpreter. If you are on a Linux machine just enter python at the shell prompt to initiate the python command line. For Windows users, you can run the interpreter in the command line if you have set the PATH variable appropriately. Alternatively, you can use the IDLE program. IDLE is short for Integrated DeveLopment Environment. Click on Start -> Programs -> Python 2.7 -> IDLE (Python GUI). The command line interpreter executes the current command on the command line i.e. one line at a time. For the time being lets play around the python command line to get it's feel before jumping in to create a full fledged program.

On a Linux machine Python command line interpreter looks something like this.

neo@neo:~$ python
Python 2.7.3 (default, Apr 20 2012, 22:39:59)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>

Lets print HELLO WORLD.

neo@neo:~$ python
Python 2.7.3 (default, Apr 20 2012, 22:39:59)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>print "HELLO WORLD"
HELLO WORLD
>>>

You can try things on your own like printing numbers, performing operations on numbers and displaying the result etc.

neo@neo:~$ python
Python 2.7.3 (default, Apr 20 2012, 22:39:59)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>print "HELLO WORLD"
HELLO WORLD
>>> print 34
34
>>> print 3 + 7
10
>>> print 11 % 5
1
>>> print 10 * 5
50
>>>

Printing multiple lines using triple quotes """.

neo@neo:~$ python
Python 2.7.3 (default, Apr 20 2012, 22:39:59)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> print """The two roads diverged in a wood, and I-
... I took the one less traveled by,
... And that has made all the difference."""
The two roads diverged in a wood, and I-
I took the one less traveled by,
And that has made all the difference.
>>>

Understanding Variables

Unlike many other languages that requires the variables to be initialized to their correct data type, Python initialize the variable to it's correct data type automatically. That means the variables need not be declared before using them to store data. However there is a catch. Consider the last statement. What do you think should be the output?

neo@neo:~$ python
Python 2.7.3 (default, Apr 20 2012, 22:39:59)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> int(a)
17
>>> a = 17
>>> b = 23
>>> print a + b
40
>>> c = a + b
>>> print c
40
>>> a = a + 1
>>> print a
18
>>> x = x + 1

In last statement the evaluation takes place from left to right of the expression. The statement adds 1 to the current value of x and stores the result in the variable x. However, the variable x is not defined prior to this statement. The Python runtime raises an error.

>>> x = x + 1
Traceback (most recent call last):
File "", line 1, in
NameError: name 'x' is not defined
>>>

Following are the definitions of some of the different data-types.

>>> i = 38 #integer type
>>> f = 12.25 #float type
>>> l = 980089 #long type
>>> d = 1098827.9032 #double type
>>> s = "hello" #string type
>>> b = True #boolean type(either True or False)
>>>

Sunday, 5 August 2012

Ch 1: Getting Started

Why Walk When You Can Fly With Python

When, Where And How It All Began?

Python development began in the late 1989.

Guido Van Rossum, a researcher at National
Research Institute for Mathematics and Computer Science in Amsterdam (CWI), needed a high level scripting language for administrative tasks. Necessity is the mother of inventions and thus was born the Python language.

Python was introduced publicly in 1991 and since then a steadily growing community of Python developers and users have contributed in improving and making Python what it is today.

Python programming language is named after the British surreal comedy group Monty Python that created the famous Monty Python's Flying Circus. Every language has its own philosophy. Python's philosophy has its own uniqueness and is known by the name of "Zen Of Python".

Python: In A League Of It's Own

Why walk, when you can run and better still, why run, when you can fly. Yes, you can really learn to fly with python programming language. Have problem with the C's pointer manipulation or trouble understanding the complex OOP model(Object Oriented Programming) of Java. Then, python language is the solution to your problems. The best thing I like about python is getting the maximum bang from a bare minimum of code. That's what we programmers ultimately aim for, isn't it? To get things in perspective, suffice is to say that you can actually build a complete web crawler with indexing feature in a couple of dozens of lines of code!

For example, following are the constructs in different languages for printing a simple line of text to the screen.

Java:

System.out.println("The road less travelled by.");

printf("The road less travelled by.");
//and that too after including the stdio.h header file

C#:

System.Console.WriteLine("The road less travelled by.");

I mean what's the point in going through so much pain when in python you can just do the above by the following line of code.

print "Lean to fly with python" #neat isn't it?

Advantages Of Python

I will not get into the specifics of the benefits of python programming language, as theory is not what I'll concern myself with in my blog but still would like to briefly highlight these points.

Python is designed so that a novice as well as an experienced programmer can easily learn the language.

Python is portable as well as extensible. Programs can be developed rapidly without sacrificing scalability and maintenance.

Python is a modularly extensible language i.e. new modules developed my anyone can be incorporated to extend its capabilities.

Python is platform independent making it portable enough to run on any platform thereby extending its credibility.

Compared to any other languages python has a more precise code i.e. a piece of code written in C will be 2 to 10 times more compact in python.

Most importantly the English language like constructs make it all the more easier for the novice programmers to get started with python in no time.

And You Wonder Where It Is Used

Python's implementation ranges from typical desktop applications to controlling a highly sophisticated NASA space expedition rover. Following are some of the domains based on Python's implementation.

Desktop Applications

BitTorrent, original P2P client
DropBox, a web-based file-hosting system
Cinema 4D, a program for creating 3D art and animation
Ubuntu Software Center, graphical package manager in Ubuntu
Wikipad, a wiki-like platform for managing ideas, contacts etc.

Web Applications

OpenERP, open source software for business applications
ERP5, a powerful open source ERP / CRM used in Banking and e-government

Video games

Civilization IV uses python for most of it's development
Frets On Fire is written in Python and uses Pygame

Web Frameworks

Django, an MVT (model, view, template) web framework
Google App Engine, platform for developing and hosting web applications
TurboGears, another web framework based on python

Last but not the least, suffice is to say that Google uses Python for most of it's services.