In the previous chapter we learnt how to extract the first link on a given web-page. In this chapter we'll figure out a way to extract all the links present on a web-page. To carry out this task, two very important concepts of python programming come into play. They are Procedure and Control. While procedure saves us the time of writing the same code over and over again, Control tells us how to proceed. Together, with the help Procedure and Control, you'll easily be able to extract all the links present on a web-page.

Procedure

Procedural Abstraction is a very important tool. It helps the programmer to avoid writing the same code over and over again. A procedure is a set of instructions to carry out a specific task. For example, to extract all the links from a given web-page, one would expect to write the code for extracting a link over and over again. But procedural abstraction elegantly solves the problem by defining the common code inside a block. A procedure is something that takes in some inputs, perform some operations on them and produce the outputs. There can be more than one input and output. The code specified in a procedure can work on different inputs, producing different results depending on the inputs. The idea of a procedure can easily be co-related to built-in operator let's say +. The addition operator takes in as input 2 numbers and produces as output their sum. Since it is built-in operator the synatx of addition is quite different from that of a procedure but the idea is same as that of a procedure. The following image will clear up the air.

Syntax

def <name>(parameter):
<block>

Python provides the keyword def for defining the procedure and is a short form for define. The name of the procedure can be anything except the python keywords or the name starting from numeric value. Any name that can be given to a variable is eligible for becoming the name of the procedure. Next comes the left parenthesis and the right parenthesis. Parameters are provided inside the parenthesis. Parameters are nothing but a fancy name of the inputs. Any number of parameters can be provided to the procedure, each seprated by a comma from each other. For example, (a, b, c, d, e) The parameters are followed by a colon.

Next comes the block or the body of the procedure. This is where the actual processing is being done. The block contains the set of instructions that operate on the input provided as the parameters. Notice that the block is indented by an equivalent of four spaces. It is indentation that makes it possible for the python interpreter to distinguish between the different blocks of code. We have already seen how to provide input to a procedure, now we'll figure out a way to yield the output from a procedure. We use keyword return to produce the output of the procedure. Following code will help you understand the concept better.

>>> def procedure(str_1, str_2):

str = str_1 + str_2
return str

>>> procedure("Hello ", "Sid!!!")
'Hello Sid!!!"

Just like in the code below you can also return mutiple outputs by seprating them with a comma.

>>> def rectangle(length, breadth):

area = length * breadth
perimeter = 2 * (length + breadth)
return area, perimeter

>>> rectangle(10, 4)
(40, 28)
>>> ar, peri = rectangle(10,4)
>>> print ar
40
>>> print peri
28

Defining Procedure

Now, lets get started. We are now on our way to define our own procedure. Recall the code to extract a link from Ch 4: Extracting a link.

>>> page = """<!DOCTYPE html>

<head>Sample Page</head>
<body>
<a href="http://flywithpython.blogspot.com/2012/08/python-getting-started.html">Ch 1:Getting Started</a>
<a href="http://flywithpython.blogspot.com/2012/08/python-basics.html">Ch 2:The Basics</a>
<a href="http://flywithpython.blogspot.com/2012/08/string-handling.html">Ch 3:String Handling</a>
</body>
</html>"""

>>> start_link = page.find("<a href")
>>> start_quote = page.find('"', start_link)
>>> end_quote = page.find('"', start_quote)
>>> link = page[start_quote:end_quote+1]
>>> print link
"http://flywithpython.blogspot.com/2012/08/python-getting-started.html"
>>> page = page[end_quote:]

Now, instead of writing the same code over and over again to extract all the links from a wep-page, we'll design a procedure in python to do the work for us. In the first piece of coding we just initialized the variable "page" to the contents of a sample web-page. The real work is being done in the second piece of code above. Observe that each time you need to extract a link from a wep-page the first four commands remain as it is.

The only thing that changes is the content of the variable "page" which will act an input to the the procedure i.e. the input to our procedure is the variable "page" that contains the source code of the web-page. Each time we extract a link, we update the content of the variable "page" to the end of the first extracted link. This implies that each time a link is extracted, the content of the variable page is initialized from the end of the extracted link. This is indispensable, since after extracting the first link, we would like to extract the second link, then the third and so on. This can only be done if our search method start it's search from the end of the extracted link. So, each time we call the procedure, we provide the rest of the wep-page's source code as the input. Therefore, we have figured out the input to our procedure.

So, with page as the input, what do you think should be the output of our procedure? Well, it's nothing else but the extracted link itself and the index number giving the position of the end_quote of the extracted URL. Let's call the new procedure get_next_target with the variable page as input, that contains the source code of the remaining page and the extracted link and position of the end_quote as the output. I assume you have already initialized the variable "page" with sample source-code I have provided in the first piece of coding under the "Defining Procedure" heading. Well, now it's time to define the procedure.

>>> def get_next_target(page):

start_link = page.find("<a href")
start_quote = page.find('"', start_link)
end_quote = page.find('"', start_quote+1)
link = page[start_quote+1:end_quote]
return link, end_quote

>>> get_next_target(page)
('http://flywithpython.blogspot.com/2012/08/python-getting-started.html', 126)
>>> li, en = get_next_target(page)
>>> print li
http://flywithpython.blogspot.com/2012/08/python-getting-started.html
>>> page = page[en:]
>>> li, en = get_next_target(page)
>>> print li
http://flywithpython.blogspot.com/2012/08/python-basics.html

In this post we have learnt the concept of Procedures in python and how to define them. In the next post we'll dive into the Control part and learn how to go on and on untill all the links have been extracted from a web-page.

Fly With Python

Saturday, 11 August 2012

Ch 5: Learning Procedures

Procedure

No comments:

Post a Comment