In our quest of building a web crawler in python, we need to deal with web pages containing a large amount of data in the form of text. A string is nothing but a sequence of characters enclosed within the quotes. Also, triple quotes are used to print the mutiple lines. Following are the sample strings in python.
I took the one less traveled by,
And that has made all the difference."""
Strings can be enclosed within single quotes or double quotes. There is no difference whatsoever in the two types. The only requirement is that a string commeneced with a single quote must end with a single quote. The same goes for the double quote strings. This feature comes in handy as a double quote string may contain a single quote within the string. Following is an example.
The above string contains a single quote within the double quotes. This saves us the need for escaping the single quote. It is important to grasp the concept of string handling. as it is indispensable for building up a web crawler. Now lets fire up our python command line interpreter and quickly get to know the python string type.
Hello
>>> print "Hello"
Hello
>>> print Hello
Traceback (most recent call last):
File "
NameError: name 'Hello' is not defined
>>> hello = "howdy"
>>> print hello
howdy
In the first command above, the string is printed using the single quotes while in the second command double quotes are being used. As is clear from the output, there is no diffrence between the two. However, in the third command we tried to print Hello, which now behaves as a variable without the single or the double quotes. Since we haven't defined any variable with the name Hello, the Python runtime raises a NameError exception. Now in the fourth command we assign the string value "howdy" to the variable hello and then print the value contained in the variable hello in the next command.
Concatenating python strings is quite simple. The string is concatenated using the overloaded addition symbol(+). Following are some of the sample examples.
>>> print "hello " + name + "!" + "!" + "!"
hello Sid!!!
>>> print "hello " + name + 3 * "!"
hello Sid!!!
Indexing Strings
The most important feature of python string type is that we can extract sub-string from a given string. This is made possible due to the flexibility provided by python. Strings can be indexed i.e. given an index number any given character can be extracted using the square brackets, just like an array or list. The first element starts with the index number zero. Following representation will clear up the air. Let us assume the string to be "WOODS". 'W' with index zero, 'O' with index one, 'O' with index two, 'D' with index three, 'S' with index four.
| W | O | O | D | S |
| 0 | 1 | 2 | 3 | 4 |
Now, with the knowledge of index numbers we can perform the following operations on string types.
>>> print name[0]
W
>>> print name[4]
S
>>> print name[2]
O
>>> print name[-1]
S
>>> print name[-2]
D
>>> print name[5]
Traceback (most recent call last):
File "
IndexError: string index out of range
Now, in the first command we assing the string "WOODS" to the variable name. In the next command we extract the charcter at the index number zero that happens to be 'W'. Similarly, we extract the character at index number two. In the next command we use the index number -1, which is perfectly legal in python. The -1 index number extract the last character from the string which is 'S'. Similarly, the second last character is extracted using -2 as the index number and so on. In the last of the command we tried to extract the character at index 5. However, the last index number is 4 and therefore the python runtime raises the IndexError runtime exception.
Selecting Sub-Sequence
In building a web-crawler the very first step is to extract a link from a web page. This is done by selecting the links from the web page source code and extracting it. Sub-sequences can easily be extracted from a given string using the index numbers. A very important thing to keep in mind while selecting sub-sequence from a string is that the first index i.e. the starting index selects the sub-sequence including the the charcter at the given index while the second index i.e. the ending index number selects the all the characters upto that index number but excluding the last element. For example in the below coding the command name[0:4] selects the string 'EVE' and not the string 'EVER', although the chacter 'R' is at index 3. Also, if the starting index number is not provided the sub-sequence is selected from the begining or in case the ending index number is not provided, the sub-sequence is selected till the last element. Following is the sample code..
>>> name[0:3]
'EVE'
>>> name[0:4]
'EVER'
>>> name[4:]
'GREEN'
>>> name[:4]
'EVER'
Finding A Sub-Sequence
We can find a given sub-sequence within a string using the "find" method. The sub-sequence to be searched within the string is provided as a parameter to the find method. Let us assume we have the following string.
The find method when used, returns the index number from where the matching string starts. Also the find method returns the index number of the first occurence of the sub-sequence. Any sub-sequence can be searched for like "men", "go", "forever" etc. For example when searched for "men", the find method returns the index number 4, i.e. the index number of the first occurrence of the sub-sequence "men". If the match is not found, the find method returns -1 as the status code. The following example will clear the air up.
>>> print var.find("men")
4
>>> print var.find("forever")
44
>>> print var.find("man")
-1
To find the next occurence of a sub-sequence in a string, a second parameter can also be passed to the find method. This parameter provides the position from where the sub-sequence is to be searched. Caution should be practised while using the second argument to search for the second occurence of a sub-sequence in a string. For example the first occurence of the sub-sequence "men" occurs at the index 4. If we want to search for the second occurence then we have to start our search from the next index and not from the index 4. If we start our search from the index 4, the find method will start it's search from the index 4 and will return the same index number.
>>> print var.find("men")
4
>>> print var.find("men", 4)
4
>>> print var.find("men", 5)
21
>>> print var.find("men", 4+1)
21
We learnt the basics of string handling in python. In the next post we'll apply these tools to extract the links from a given web page.
No comments:
Post a Comment