Translation(s): English - Italiano - Português


This is a summary of the IRC tutorial on Python on #debian-women February 11, 2006.

This version of the tutorial is not a direct transcription of the IRC log. Instead, it expands a little bit on some points, so if you were part of the IRC tutorial, you may still want to read this one.

The structure of the tutorial was that I gave examples and said something about them, and then there were questions and discussion until we moved to the next one.

About Python in general

From the Python FAQ

Python is an interpreted, interactive, object-oriented programming language. It incorporates modules, exceptions, dynamic typing, very high level dynamic data types, and classes.

Interpreted means, in practice, that you can just run Python programs stored in files without having to compile them first, so it's similar to shell scripts in that way, and unlike C programs.

Interactive means that you can start the Python interpreter and start feeding it statements, one at a time, and it will execute them and print out any results. Those familiar with Lisp's Read-Eval-Print Loop (REPL) will be familiar with this. Ditto for BASIC.

Object-oriented means that Python favors OOP, but it doesn't force it. Python is more relaxed about paradigms than, say, Java.

Of the other things the FAQ lists, "dynamic typing" is perhaps the most interesting. It comes a bit of a shock to those whose only languages are like C or Java, which are statically typed: variables have a type that is explicitly declared, and therefore every expression also has a type that can be analyzed at compile time. Therefore, the compiler can find some errors before the program starts.

In dynamic typing, variables don't have types, values do. A variable can be "bound" to different types of variables at different points in time, and therefore all type checking is done at run-time. This makes some things nicer to do, but can make it harder to find type errors.

Dynamic typing tends to be good for small to medium sized programs, and rapid development and prototyping. Static typing is good for large programs.

Python is often described as a scripting langauge, but that doesn't mean it isn't a general purpose language. Python, like Perl, is often used for quick hacks, and it is good for that since it is high-level and interpreted, and has some nifty features and libraries for certain kinds of sysadmin-like tasks. At the same time, it is often used for so called real application development.

The hello world program

   1 # Save to file "hello.py" and run with "python hello.py"
   2 print "hello, world"

Save the above to a file called "hello.py", and run it with the command "python hello.py". If that works, then you know you have a working Python installation, and know how to use it.

The first line is a comment: starts with a hash sign ("#"), and continues to the end of the line.

The second line is a simple statement that prints out a string, plus a newline. "print" is the simplest way of producing output to the standard output. You can print out any number of values, just separate them with commas, and there'll be spaces between the values.

Command line arguments

   1 #Run this as: python hello2.py yourname
   2 import sys
   3 print "greetings,", sys.argv[1]

We have here a way to import a library module, and a way to access command line arguments. "sys" is one of the modules in the Python standard library. In that module there is an array, or really a list, called argv, which contains the command line arguments of the Python program, similar to C's argv argument to the main function.

Lists (or arrays) start indexing at zero. Thus, sys.argv[0] is the first command line argument; like in C, it is the name of the program being run. sys.argv[1] is the first actual argument. Note that since Python is dynamically typed, list elements don't all need to be of the same type.

Run the program as "python hello2.py darling", and it will print out "greetings, darling".

If you don't give it a command line argument, the program will try to use sys.argv[1] when it doesn't exist, and this causes a run time error, an exception, and the python interpreter prints out a long, nasty error message. Like this:

liw@esme$ python hello2.py
greetings,
Traceback (most recent call last):
File "hello2.py", line 3, in ?
print "greetings,", sys.argv[1]
IndexError: list index out of range

The exception traceback has a record (two lines) per entry in the call stack, with the place where the exception was raised at the bottom (i.e., the main program at the top).

By reading it carefully and analyzing what was called where, you can (usually!) figure out what was wrong. It's possible for a program to catch exceptions.

if

We continue to be inordinately fond of "hello, world" examples.

   1 #Run this as: python hello3.py yourname
   2 import sys
   3 print "greetings,",
   4 if len(sys.argv) == 2:
   5     print sys.argv[1]
   6 else:
   7     print "my nameless friend"

A comma after the last argument to "print" will prevent it from printing a newline.

"len(foo)" returns the length of a list "foo", that is, the number of elements in it. Indexes, therefore, go from 0 to len(foo)-1. len() is very fast, constant-time function.

We use len to check that there was an argument given on the command line and if not, we substitute a generic greeting.

The other big thing about this example is the "if" statement. This is where we learn about Python's use of indentation to mark blocks.

The "then" and "else" parts of the "if" statement are both marked by indenting them more than the "if" statement. Python does not have explicit block markers, it's always done with indentation. A block is a series of statements with the same indentation; empty lines and comments are ignored, of course.

Tabs are expanded, by default to every 8 spaces, but that is configurable. Using any other value is likely to cause trouble when sharing code with others. Python programmers tend to prefer to use spaces only, and no tabs at all.

"if" and other statements that introduct blocks end with a colon. This is a stylistic issue. You have to put the colon there, and occasionally it helps the parser to catch syntax errors.

It is possible to put several statements on a line by separating them with a semicolon, but it is considered very bad style.

while

One more "hello" example.

   1 # Run this as: python hello4.py name1 name2 name3 ...
   2 import sys
   3 print "greetings,",
   4 
   5 if len(sys.argv) == 1:
   6     print "my nameless friend"
   7 elif len(sys.argv) == 2:
   8     print sys.argv[1]
   9 else:
  10     i = 1
  11     last_index = len(sys.argv) - 1
  12     while i < last_index:
  13         print sys.argv[i] + ",",
  14         i = i + 1
  15     print "and", sys.argv[last_index]

In this example we get variables, "elif", and "while". Variables work pretty much as you'd expect. Variables are not declared, but it is an error to use a variable that has not yet been assigned to in the local scope, or a surrounding scope.

An assignment creates a local variable (it is possible to use global variables, but we'll skip that now). Thus, a typo in the last statement of the while loop above to change it to "i = j + 1" would cause Python to raise an exception, but "j = i + 1" would work, causing an infinite loop?

When I say "a variable is assigned a value", what I really mean is that a variable gets a reference to the value. All variables are references.

"elif" is a contraction of "else if", and should likewise be pretty clear. There is no "switch" statement in Python, instead a long "if: ... elif: ... elif: ... else: ..." statement is used.

All the usual integer operators work:

 +, -, *, /, %, <, >, <=, >=, ==, !=.

There is no ++ or -- operators, and += and similar ones are only used in assignment statements (assignments are never expressions in Python).

Some of the operators are overloaded for other types as well. For example, + also works as string concatenation when both operands are strings.

For "if", "while", and other contexts where a boolean value is required, the values False, 0, 0.0, "", and None, plus a few other "empty" values, are treated as false, everything else as true.

for

More greetings.

   1 # Run this as: python hello5.py name1 name2 name3 ...
   2 import sys
   3 print "greetings,",
   4 
   5 if len(sys.argv) == 1:
   6     print "my nameless friend"
   7 elif len(sys.argv) == 2:
   8     print sys.argv[1]
   9 else:
  10     for name in sys.argv[1:-1]:
  11         print name + ",",
  12     print "and", sys.argv[-1]

The new thing here is the "for" loop, which iterates over a sequence of values, such as a list. "name" is assigned the value of each command line argument in turn, and then the block inside the "for" is executed. This tends to be more convenient than doing explicit indexing with "while".

The other fun thing is the use of slices. Slices are a way of creating a new list out of elements from another, a subsection of another list. Given a list "foo", "foo[i]" is element at index i, "foo[a:b]" is a new list with all elements from index a up to, but not including index b. For extra fun, i, a, and b can all be negative, in which case they index from the end of the list, so "foo[-1]" is the last element. Thus, "sys.argv[1:-1]" is all the command line arguments from the first one after the program name up until, but not including the last one

The a and b index may also be missing; in that case, the corresponding end of the list is used. "foo[a:]" is everything from index a to the end of the list. "foo[:b]" is everything from the beginning of the list up to, but not including index b. "foo[:]" is a copy of the entire list.

Functions

The greetings never end, do they?

   1 # Run this as: python hello6.py name1 name2 name3 ...
   2 import sys
   3 
   4 def greet(greeting, names):
   5     print greeting + ",",
   6 
   7     if not names:
   8         print "my nameless friend"
   9     elif len(names) == 1:
  10         print names[0]
  11     else:
  12         for name in names[:-1]:
  13             print name + ",",
  14         print "and", names[-1]
  15 
  16 greet("hi there", sys.argv[1:])

Here we see how a function is defined. Note that argument names (if any) are declared, but not their types, and neither is the return type. All typing in Python is dynamic.

Also note that the function gets a list of names to be greeted, and sys.argv starts with the name of the program, so the main program strips it out with a slice when calling the function.

Hashbanging

My bag of helloworld programs is infinite!

   1 #!/usr/bin/python
   2 
   3 import sys
   4 
   5 def greet(greeting, names):
   6     print greeting + ",",
   7 
   8     if not names:
   9         print "my nameless friend"
  10     elif len(names) == 1:
  11         print names[0]
  12     else:
  13         for name in names[:-1]:
  14             print name + ",",
  15         print "and", names[-1]
  16 
  17 def main():
  18     greet("hi there", sys.argv[1:])
  19 main()

This is how one would make a Python script that can be run as any command, without prefixing the command with "python". Just save this into a file "hello7", chmod +x it, and then run it with "./hello7".

The main program of a Python program is customarily put into a function (often called "main"). That function is then called either directly or, better, like this:

   1 if __name__ == "__main__":
   2     main()

"__name__" is a special Python variable that has the value "__main__" if the Python file is run directly. This allows the file to be used as a Python module (i.e., with "import") without invoking its main program. This can also be used to invoke unit testing.

I/O

I think we've been polite enough now.

   1 #!/usr/bin/python
   2 
   3 import sys
   4 
   5 line_count = 0
   6 while True:
   7     line = sys.stdin.readline()
   8     if not line:
   9         break
  10     line_count += 1
  11 
  12 sys.stdout.write("%d lines\n" % line_count)

This program counts the number of lines in the standard input.

"sys.stdin", "sys.stdout", and "sys.stderr" are file objects that correspond to the standard input, output, and error streams.

File objects have a method ".readline()" that reads and returns the next line, including the newline, or the empty string if they hit EOF.

Similarly, ".write()" is a file object method that writes a string to the file; it does not add a newline.

"if not line" tests whether the variable line is false or not; it's false, if it is the empty string (since it has a string value). Thus, the condition is true at the end of the file. "break" then jumps out of the innermost loop.

The "while True: data = f.read(); if not data: break" pattern is a common way of doing input in a loop.

When the first operand of the % operator is a string, it works similar to sprintf in C. The first operand acts as the format string, and "%s" in it gets replaced by a string value, "%d" with an integer value, etc.

The values are taken from the second operand, which can be a single value, if there is only one %something in the format string, or a sequence of values inside parentheses if there are several.

String manipulation

We're not going back to hello, world.

   1 #!/usr/bin/python
   2 
   3 import sys
   4 
   5 def count_words(str):
   6     word_count = 0
   7     i = 0
   8     in_word = False
   9     while i < len(str):
  10         c = str[i]
  11         is_word_char = (c >= "a" and c <= "z") or (c >= "A" and c <= "Z")
  12         if in_word:
  13             if not is_word_char:
  14                 in_word = False
  15         else:
  16             if is_word_char:
  17                 in_word = True
  18                 word_count += 1
  19         i += 1
  20     return word_count
  21 
  22 def main():
  23     line_count = 0
  24     word_count = 0
  25     byte_count = 0
  26 
  27     while True:
  28         line = sys.stdin.readline()
  29         if not line:
  30             break
  31         byte_count += len(line)
  32         line_count += 1
  33         word_count += count_words(line)
  34 
  35     sys.stdout.write("%d words, %d lines, %d bytes\n" %
  36                      (word_count, line_count, byte_count))
  37 
  38 main()

This program counts words, defined as sequences of letters or digits. It is the biggest example yet, and it is also very, very ugly. We'll make it prettier next, though.

Strings can be used (partly) like lists: "len(str)" is the length of a string, "str[i]" is the character at index, "str[a:b]" also works as expected. There is no separate character type; single-character strings are used instead.

Strings can be compared with <, <=, and so on; the comparison is based on the values of the bytes (since strings are strings of bytes; we'll come ot unicode later).

The last line of main() shows one way of extending Python statements to multiple lines: if a parenthesized expression is too long, just break it to the next line, and it will all work automatically. The other way is to use a backslash at the end of a line.

The ugly parts of this code is that it is very much specific to ASCII, when it should be locale sensitive, and there is also no point in using "while" to loop over characters in a string, since "for" also works.

Unicode strings

Disclaimer: I am not very good at Unicode handling, either in general or in Python.

Unicode characters are bigger than 8 bits (and you don't need to care exactly how big they are, when using Python). Python has a separate string type for Unicode strings. They work pretty much identically to normal strings (which are strings of bytes), but for I/O you need to convert them from and to byte strings, using some kind of encoding. The encoding depends on various factors, but often it is OK to use an encoding based on the current locale.

Note that a Python Unicode string is not a UTF-8 string. UTF-8 is one of the encodings used for I/O (and storage).

In source code, 'u"Copyright \u00A9 2006 Lars Wirzenius"' is a Unicode string containing the copyright character. You can't write non-ASCII characters into Python source code unless you tell the Python interpreter what the encoding and character set are (and I don't know how).

"sys.stdin.readline" returns a normal string, which we will call "s" here. "s.decode(enc)" decodes s into a Unicode string ("u") using some encoding. "u.encode(enc)" encodes in the other direction, from Unicode to normal string. "enc" can be "utf-8", for example. "locale.getpreferredencoding()" returns the preferred encoding for the current locale.

Wordcounting revisited

Let's apply what we learned to word counting.

   1 #!/usr/bin/python
   2 
   3 import locale
   4 import sys
   5 
   6 def count_words(str):
   7 
   8     word_count = 0
   9     in_word = False
  10     for c in str:
  11         if in_word and not c.isalnum():
  12             in_word = False
  13         elif not in_word and c.isalnum():
  14             in_word = True
  15             word_count += 1
  16     return word_count
  17 
  18 def main():
  19     locale.setlocale(locale.LC_ALL, "")
  20 
  21     line_count = 0
  22     word_count = 0
  23     char_count = 0
  24 
  25     while True:
  26         line = sys.stdin.readline()
  27         if not line:
  28             break
  29         line = line.decode(locale.getpreferredencoding())
  30         char_count += len(line)
  31         line_count += 1
  32         word_count += count_words(line)
  33 
  34     sys.stdout.write("%d words, %d lines, %d chars\n" %
  35                      (word_count, line_count, char_count))
  36 
  37 main()

In addition to the above discussion about Unicode, the line 'locale.setlocale(locale.LC_ALL, "")' is necessary to active the locale settings.

More word play: print out all words

Let's write words out.

   1 #!/usr/bin/python
   2 
   3 import locale
   4 import sys
   5 
   6 def split_words(str):
   7     words = []
   8     word = None
   9     for c in str + " ":
  10         if word:
  11             if c.isalnum():
  12                 word += c
  13             else:
  14                 words.append(word)
  15                 word = None
  16         else:
  17             if c.isalnum():
  18                 word = c
  19     return words
  20 
  21 def main():
  22     locale.setlocale(locale.LC_ALL, "")
  23     encoding = locale.getpreferredencoding()
  24 
  25     while True:
  26         line = sys.stdin.readline()
  27         if not line:
  28             break
  29         line = line.decode(encoding)
  30         for word in split_words(line):
  31             sys.stdout.write("%s\n" % word.encode(encoding))
  32 
  33 main()

An empty list is written as "[]". A non-empty list would be written like this: '[1, 2, 3, "hello"]'. "list.append(item)" modifies the list in place and adds a new item to the end. Lists can be concatenated: "[1,2] + [3,4]" gives "[1,2,3,4]".

The split_words function creates new lists (and new strings) indiscrimantely, they are not freed anywhere in the program; Python does garbage collection, which is a very nice thing to have.

The 'str + " "' thing in split_words is there so that there is a guaranteed non-isalnum character so that if the line ends with a word (no newline at the end) it is still counted correctly.

Word frequencies: dictionaries!

Let's count word frequencies.

   1 #!/usr/bin/python
   2 
   3 import locale
   4 import sys
   5 
   6 def split_words(str):
   7     words = []
   8     word = None
   9     for c in str + " ":
  10         if word:
  11             if c.isalnum():
  12                 word += c
  13             else:
  14                 words.append(word)
  15                 word = None
  16         else:
  17             if c.isalnum():
  18                 word = c
  19     return words
  20 
  21 def main():
  22     locale.setlocale(locale.LC_ALL, "")
  23     encoding = locale.getpreferredencoding()
  24 
  25     counts = {}
  26 
  27     while True:
  28         line = sys.stdin.readline()
  29         if not line:
  30             break
  31         line = line.decode(encoding)
  32         for word in split_words(line):
  33             word = word.lower()
  34             if counts.has_key(word):
  35                 counts[word] += 1
  36             else:
  37                 counts[word] = 1
  38 
  39     words = counts.keys()
  40     words.sort()
  41     for word in words:
  42         sys.stdout.write("%d %s\n" % (counts[word], word.encode(encoding)))
  43 
  44 main()

The changes are to the main program. Python's hash tables (or hash maps) are called dictionaries. An empty dictionary: "{}". A non-empty one: '{ "foo": 0, "bar": 1 }'. "dict[key]" is the value stored at a given key. Keys can be numbers, strings, or various other types for which Python knows how to compute a hash value.

"dict.has_key(key)" is True if "dict[key]" exists (has been assigned to). Alternatively "key in dict".

"dict.keys()" is an unsorted list of all keys.

The string method ".lower()" converts it to lower case, returning the new string (the original is not modified; strings cannot be modified in Python). Similarly, ".upper()" to convert to upper case.

"list.sort()" sorts in place (original list is changed, does not return sorted list, or any other value).

Let's have some class

Let's see how classes and objects are used in Python.

   1 #!/usr/bin/python
   2 
   3 import locale
   4 import sys
   5 
   6 class WordFreqCounter:
   7 
   8     def __init__(self):
   9         self.counts = {}
  10 
  11     def count_word(self, word):
  12         word = word.lower()
  13         if self.counts.has_key(word):
  14             self.counts[word] += 1
  15         else:
  16             self.counts[word] = 1
  17 
  18     def print_counts(self, file):
  19         encoding = locale.getpreferredencoding()
  20         words = self.counts.keys()
  21         words.sort()
  22         for word in words:
  23             file.write("%d %s\n" %
  24                        (self.counts[word], word.encode(encoding)))
  25 
  26 def split_words(str):
  27     words = []
  28     word = None
  29     for c in str + " ":
  30         if word:
  31             if c.isalnum():
  32                 word += c
  33             else:
  34                 words.append(word)
  35                 word = None
  36         else:
  37             if c.isalnum():
  38                 word = c
  39     return words
  40 
  41 def main():
  42     locale.setlocale(locale.LC_ALL, "")
  43     encoding = locale.getpreferredencoding()
  44 
  45     counter = WordFreqCounter()
  46 
  47     while True:
  48         line = sys.stdin.readline()
  49         if not line:
  50             break
  51         line = line.decode(encoding)
  52         for word in split_words(line):
  53             counter.count_word(word)
  54 
  55     counter.print_counts(sys.stdout)
  56 
  57 main()

In this example, we put the dictionary inside a class. It doesn't really matter in a program this small, whether we have a custom class or a plain dictionary, but we do it for demonstration purposes.

"class" starts a class definition. A class is instantiated by saying "ClassName()" (with arguments, if any, to the constructor inside the parentheses).

Methods are defined as functions inside the class, i.e., they must be indented relative to the "class" line. Methods are straightforward, except for their first argument, customarily called "self", which is a reference to the class instance (object) they're being called for.

Thus, when you call "counter.count_word(word)", the method's first argument ("self") is bound to "counter", and its second argument ("word") is bound to "word" (in the caller's context).

There is no implicit way to refer to other methods or attributes of the object or class, you must always go via "self".

The special method name "__init__" indicates the constructor. It is called when the object is created.

Simplifying slightly, there are no access controls on Python object attributes and methods. Everything is "public" in the C++ terminology.

Modules

First the file wordstuff.py:

   1  import locale
   2 
   3 class WordFreqCounter:
   4 
   5     def __init__(self):
   6         self.counts = {}
   7 
   8     def count_word(self, word):
   9         word = word.lower()
  10         if self.counts.has_key(word):
  11             self.counts[word] += 1
  12         else:
  13             self.counts[word] = 1
  14 
  15     def print_counts(self, file):
  16         encoding = locale.getpreferredencoding()
  17         words = self.counts.keys()
  18         words.sort()
  19         for word in words:
  20             file.write("%d %s\n" %
  21                        (self.counts[word], word.encode(encoding)))
  22 
  23 def split_words(str):
  24     words = []
  25     word = None
  26     for c in str + " ":
  27         if word:
  28             if c.isalnum():
  29                 word += c
  30             else:
  31                 words.append(word)
  32                 word = None
  33         else:
  34             if c.isalnum():
  35                 word = c
  36     return words

And then the file freq3.py:

   1 #!/usr/bin/python
   2 
   3 import locale
   4 import sys
   5 
   6 from wordstuff import WordFreqCounter, split_words
   7 
   8 def main():
   9     locale.setlocale(locale.LC_ALL, "")
  10     encoding = locale.getpreferredencoding()
  11 
  12     counter = WordFreqCounter()
  13 
  14     while True:
  15         line = sys.stdin.readline()
  16         if not line:
  17             break
  18         line = line.decode(encoding)
  19         for word in split_words(line):
  20             counter.count_word(word)
  21 
  22     counter.print_counts(sys.stdout)
  23 
  24 if __name__ == "__main__":
  25     main()

freq3.py is the main program and uses wordstuff.py as a module, and imports only certain names from it. These names can then be referred to without prefixing them with the module name.

Potentially every Python file is a module that can be imported to another file. Modules are searched for in $PYTHONPATH; see the documentation for more details. Usually you don't need to worry about setting $PYTHONPATH if things are installed in the canonical way.

What next

Read the tutorial on python.org.

Skim through the library reference and play with any interesting stuff you find there.

Write programs, read programs.