Chapter 2 Quick Start -- What can you do with Biopython?

This section is designed to get you started quickly with Biopython, and to give a general overview of what is available and how to use it. All of the examples in this section assume that you have some general working knowledge of python, and that you have successfully installed biopython on your system. If you think you need to brush up on your python, the main python web site provides quite a bit of free documentation to get started with (http://www.python.org/doc/).

Since much biological work on the computer involves connecting with databases on the internet, some of the examples will also require a working internet connection in order to run.

Now that that is all out of the way, let's get into what we can do with Biopython.

2.1 General overview of what Biopython provides

As mentioned in the introduction, Biopython is a set of libraries to provide the ability to deal with ''things'' of interest to biologists working on the computer. In general this means that you will need to have at least some programming experience (in python, of course!) or at least an interest in learning to program. Biopython's job is to make your job easier as a programmer by supplying reusable libraries so that you can focus on answering your specific question of interest, instead of focusing on the internals of parsing a particular file format (of course, if you want to help by writing a parser that doesn't exist and contributing it to Biopython, please go ahead!). So Biopython's job is to make you happy!

One thing to note about Biopython is that it often provides multiple ways of ``doing the same thing.'' To me, this can be frustrating since I often way to just know the one right way to do something. However, this is also a real benefit because it gives you lots of flexibility and control over the libraries. The tutorial helps to show you the common or easy ways to do things so that you can just make things work. To learn more about the alternative possibilities, look into the Cookbook section (which tells you some cools tricks and tips) and the Advanced section (which provides you with as much detail as you'd ever want to know!).

2.2 Working with sequences

Disputedly (of course!), the central object in bioinformatics is the sequence. Thus, we'll start with the Biopython mechanisms for dealing with sequences. When I think of a sequence the first thing that pops into my mind is a string of letters: 'AGTACACTGGT' which seems natural since this is the most common way that sequences are seen in biological file formats. However, a simple string of letters by itself is also very uninformative -- is it a DNA or protein sequence (okay, a protein with a lot of Alanines, Glycines, Cysteines and Threonines!), what type of organism did it come from, what is so interesting about it, and so on. The challenge in designing a sequence interface is to pick a representation that is informative enough to take into account the more complex information, yet is as lightweight and easy to work with as just a simple sequence.

The approach taken in the Biopython sequence class is to utilize a class that holds more complex information, yet can be manipulated as if it were a simple string. This is accomplished by utliziing operator overloading to make manipulating a sequence object feel like manipulating a python string. The sequence class, referred to simply as Seq, is defined in the file Bio/Seq.py. Let's look at the Seq class deeper to see what it has to offer.

A biopython Seq object has two important attributes:

data -- as the name implies, this is the actual sequence data string of the sequence.
alphabet -- an object describing what the individual characters making up the string ``mean'' and how they should be interpreted.

Clearly the alphabet object is the important thing that is making the Seq object more than just a string. The currently available alphabets for Biopython are defined in the Bio/Alphabet module. We'll use the IUPAC alphabets (http://www.chem.qmw.ac.uk/iupac/) here to deal with some of our favorite objects: DNA, RNA and Proteins.

Bio/Alphabet/IUPAC.py provides basic definitions for proteins, DNA and RNA, but additionally provides the abilitiy to extend and cutomize the basic definitions. For instance, for proteins, there is a basic IUPACProtein class, but there is an additional ExtendedIUPACProtein class providing for the additional elements ``Asx'' (asparagine or aspartic acid), ``Sec'' (selenocysteine), and ``Glx'' (glutamine or glutamic acid). For DNA you've got choices of IUPACUnambiguousDNA, which provides for just the basic letters, IUPACAmbiguousDNA (which provides for ambiguity letters for every possible situation) and ExtenedIUPACDNA, which allows letters for modified bases. Similarly, RNA can be represented by IUPACAmbigousRNA or IUPACUnambigousRNA.

The advantages of having an alphabet class are two fold. First, this gives an idea of the type of information the data object contains. Secondly, this provides a means of contraining the information you have in the data object, as a means of type checking.

Now that we know what we are dealing with, let's look at how to utilize this class to do interesting work.

First, create a Sequence object from a string of information we've got. We'll create an unambiougous DNA object:

>>> from Bio.Alphabet import IUPAC
>>> my_alpha = IUPAC.unambiguous_dna
>>> from Bio.Seq import Seq
>>> my_seq = Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', my_alpha)
>>> print my_seq
Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPACUnambiguousDNA())

Even though this is a sequence object, we can deal with it in some ways as if it were a normal python string. For instance, let's get a slice of the sequence.

>>> my_seq[4:12]
Seq('GATGGGCC', IUPACUnambiguousDNA())

Two things are interesting to note. First, this follows the normal conventions for python sequences. So the first element of the sequence is 0 (which is normal for computer science, but not so normal for biology). When you do a slice the first item is included (i. e. 4 in this case) and the last is excluded (12 in this case), which is the way things work in python, but of course not necessarily the way everyone in the world would expect. The main goal is to stay consistent with what python does. The second thing to notice is that the slice is performed on the sequence data string, but the new object produced retains the alphabet information from the original Seq object.

You can treat the Seq object like the string in many ways:

>>> len(my_seq)
32
>>> new_seq = my_seq[0:5]
>>> print new_seq
Seq('GATCG', IUPACUnambiguousDNA())
>>> my_seq + new_seq
Seq('GATCGATGGGCCTATATAGGATCGAAAATCGCGATCG', IUPACUnambiguousDNA())
>>> my_seq[5]
'A'
>>> my_seq == new_seq
0

In all of the operations, the alphabet property is maintained. This is very useful in case you accidentally end up trying to do something weird like add a protein sequence and a DNA sequence:

>>> protein_seq = Seq('EVRNAK', IUPAC.protein)
>>> dna_seq = Seq('ACGT', IUPAC.unambiguous_dna)
>>> protein_seq + dna_seq
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "/usr/local/lib/python1.6/site-packages/Bio/Seq.py", line 42, in __add__
    raise TypeError, ("incompatable alphabets", str(self.alphabet),
TypeError: ('incompatable alphabets', 'IUPACProtein()', 'IUPACUnambiguousDNA()')

And if you are really just need the string to insert into something, this is very easy to extract:

>>> my_seq.tostring()
'GATCGATGGGCCTATATAGGATCGAAAATCGC'

The sequence object is not mutable by default, since in many biological applications you want to ensure you are not changing your data:

>>> my_seq[5] = 'G'
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
AttributeError: 'Seq' instance has no attribute '__setitem__'

However, you can convert it into a mutable sequence and do pretty much anything you want with it:

>>> mutable_seq = my_seq.tomutable()
>>> print mutable_seq
MutableSeq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPACUnambiguousDNA())
>>> mutable_seq[5] = 'T'
>>> print mutable_seq
MutableSeq('GATCGTTGGGCCTATATAGGATCGAAAATCGC', IUPACUnambiguousDNA())
>>> mutable_seq.remove('T')
>>> print mutable_seq
MutableSeq('GACGTTGGGCCTATATAGGATCGAAAATCGC', IUPACUnambiguousDNA())
>>> mutable_seq.reverse()
>>> print mutable_seq
MutableSeq('CGCTAAAAGCTAGGATATATCCGGGTTGCAG', IUPACUnambiguousDNA())

Now that the nature of the sequence object makes some sense, the next thing to look at is what kind of things we can do with a sequence. The Bio/Tools directory contains two useful modules to transcribe and translate a sequence object. These tools work based on the alphabet of the sequence. For instance, let's supposed we want to transcribe our my_seq object. Remember that this contains an unambiguous alphabet, so to transcribe we would do the following:

>>> from Bio.Tools import Transcribe
>>> transcriber = Transcribe.unambiguous_transcriber
>>> my_rna_seq = transcriber.transcribe(my_seq)
>>> print my_rna_seq
Seq('GAUCGAUGGGCCUAUAUAGGAUCGAAAAUCGC', IUPACUnambiguousRNA())

The alphabet of the new RNA Seq object is created for free, so again, dealing with a Seq object is no more difficult then dealing with a simple string.

You can also reverse transcribe RNA sequences:

>>> transcriber.back_transcribe(my_rna_seq)
Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPACUnambiguousDNA())

To translate our DNA object we have quite a few choices. First, we can use any number of translation tables depending on what we know about our DNA sequence. The translation tables available in biopython were taken from information at ftp://ncbi.nlm.nih.gov/entrez/misc/data/gc.prt. So, you have tons of choices to pick from. For this, let's just focus on two choices: the Standard translation table, and the Translation table for Vertebrate Mitochondriall DNA. These tables are labelled with id numbers 1 and 2, respectively. Now that we know what tables we are looking to get, we're all set to perform a basic translation. First, we need to get our translators that use these tables. Since we are still dealing with our unambiguous DNA object, we want to fetch translators that take this into account:

>>> from Bio.Tools import Translate
>>> standard_translator = Translate.unambiguous_dna_by_id[1] 
>>> mito_translator = Translate.unambiguous_dna_by_id[2]

Once we've got the proper translators, it's time to go ahead and translate a sequence:

>>> my_seq = Seq('GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA', IUPAC.unambiguous_dna)
>>> standard_translator.translate(my_seq)
Seq('AIVMGR*KGAR', IUPACProtein())
>>> mito_translator.translate(my_seq)
Seq('AIVMGRWKGAR', IUPACProtein())

Notice that the default translation will just go ahead and proceed blindly through a stop codon. If you are aware that you are translating some kind of open reading frame and want to just see everything up until the stop codon, this can be easily done with the translate_to_stop function:

>>> standard_translator.translate_to_stop(my_seq)
Seq('AIVMGR', IUPACProtein())

Similar to the transcriber, it is also possible to reverse translate a protein into a DNA sequence:

>>> my_protein = Seq('AVMGRWKGGRAAG', IUPAC.protein)
>>> standard_translator.back_translate(my_protein)
Seq('GCTGTTATGGGTCGTTGGAAGGGTGGTCGTGCTGCTGGT', IUPACUnambiguousDNA())

This covers the basic features and uses of the Biopython sequence class. There is a more detailed description of the design ideas behind the sequence class in the Advanced section of this tutorial. This class is still under development and comments on the design and use are, of course, very welcome. Now that you've got some idea of what it is like to interact with the Biopython libraries, it's time to delve into the fun, fun world of dealing with biological file formats!

2.3 A usage example

Before we jump right into parsers and everything else to do with Biopython, let's set up an example to motivate everything we do and make life more interesting. After all, if there wasn't any biology in this tutorial, why would you want you read it?

Since I love plants, I think we're just going to have to have a plant based example (sorry to all the fans of other organisms out there!). Having just completed a recent trip to our local greenhouse, we've suddenly developed an incredible obsession with Lady Slipper Orchids (if you wonder why take a look at http://www.millicentorchids.com/greenhouse/images/papesq01.jpg. If that doesn't convince you, you can look at all of the available photos at http://www.millicentorchids.com/greenhouse/indexphoto.htm). Of course, orchids are not only beautiful to look at, they are also extermely interesting for people studying evolution and systematics. So we're thinking about writing a little proposal to do a molecular study of Lady Slipper evolution and would like to see what kind of research has already been done and how we can add to that.

After a little bit of reading up we discover that the Lady Slipper Orchids are in the Orchidaceae family and the Cypripedioideae sub-family and are made up of 5 genera: Cypripedium, Paphiopedilum, Phragmipedium, Selenipedium and Mexipedium. That gives us enough information to get started delving for more information. So, let's look at how biopython tools can help us.

2.4 Parsing biological file formats

A large part of much bioinformatics work involves dealing with the many types of file formats designed to hold biological data. These files are loaded with interesting biological data, and a special challenge is parsing these files into a format so that you can manipulate them with some kind of programming language. However the task of parsing these files can be frustrated by the fact that the formats can change quite regularly, and that formats may contain small subleties which can break even the most well designed parsers.

2.4.1 General parser design

The biopython solution to these problem is to develop a structured parser framework that is applicable to all of the parsers. The advantages are two fold. First, this allows code reuse between parsers (see Bio/ParserSupport.py). Second, this provides a similar framework for all the parsers so that it is relatively easy to jump into the internals of a parser and figure out problems you might be having.

All of the parsers have two components:

Scanner - The part of the parser that actually does the work or going through the file and extracting useful information. This useful information is converted into Events.
Consumer - The consumer does the job of processing the useful information and spitting it out ina format that the programmer can use. The consumer does this by recieving the events created by the scanner.

So, the parser design is event oriented. The general idea is that a scanner will go through and produce events for every item that might be of interest in a file. For instance, let's say we've got the following FASTA formatted entry (edited to fit the page):

>gi|6273290|gb|AF191664.1|AF191664 Opuntia clavata rpl16 gene; chloroplast gene for...
TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAAAAAAATGAA
TCTAAATGATATAGGATTCCACTATGTAAGGTCTTTGAATCATATCATAAAAGACAATGTAAT
AAA...

As a scanner moved through a file containing this entry, it would create the following events:

Event Name Entry input

begin_sequence (as soon as we notice we've got a new >)

title >gi|6273290|gb|AF191664.1|AF191664 Opuntia clavata...

sequence TATACATTAAAGGAGGGGGATGCGGAT...

sequence TCTAAATGATATAGGATTCCACTATGTAA...

sequence AAA... (and so on -- you've got the idea!)

end_sequence (as soon as we reach a blank line after the sequence data)

So, events are being produced. Big deal, unless we are able to capture these events and do something interesting with them! This is where the consumer comes in. The consumer needs to do two things:

Register itself with the the scanner to let it know that it wants to recieve those events that are being generated.
Do something with the events (and the information associated with them).

An example should make it more clear how this works.

2.4.2 Writing your own consumer

Now it's time to understand our parser framework, and also start looking at our friends, the lady slipper orchids. To start our search, let's just take a look through the nucleotide databases at NCBI, using an Entrez search (lhttp://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?db=Nucleotide) for everything mentioning the text Cypripedioideae (this is the subfamily of lady slipper orchids). This search gives us 94 hits, which we save into a FASTA formatted text file.

Now, let's try and use a parser to summarize the type of information we got. To take a simple example, why don't we go through and grab all of the organism names mentioned, to see how many different species of lady slipper orchids are represented in this data.

To do this kind of useful work, we need to do get our hands dirty and write the solution to all of our problems -- a consumer. This is what our consumer implementation to extract the organisms will look like:

import string
from Bio.ParserSupport import AbstractConsumer

class SpeciesExtractor(AbstractConsumer):

    def __init__(self):
        self.species_list = []

    def title(self, title_info):
        title_atoms = string.split(title_info)
        new_species = title_atoms[1]

        if new_species not in self.species_list:
            self.species_list.append(new_species)

The first thing we do is import the base class which Consumers should derive from, AbstractConsumer. This base class does nice things for us like take care of all the sections we aren't interested in. Then we create our personal consumer class by deriving from this base AbstractConsumer class.

Just like any other python class, we define a __init__ function that will be called when a new instance of the class is created. In the initialization we set a species_list class attribute. This will store the species information as the file is parsed, and will allow us to extract this information later on.

Now we come to the function that is really nifty, the title function. This function will be called by the Scanner everytime a 'title' event is generated. So, when the Scanner comes to the first line in our example FASTA file:

>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA

It will recognize this as a title, and call the title function with the title_info argument set as the value of the title.

Now that we've captured the title, we want to extract the information from it we are interested in. Looking at the FASTA title info, we notice that the second item in the info string is the organism name. To get this out we can just use the builtin python function string.split to split the string at every space, creating the list title_atoms. Since the second element of this list is the species name, we then grab this out of the list. Then we simply check to see if the species is already in our current list of organisms, and if not, we add it.

Okay, well that was easy enough -- so now we need to call the Scanner and actual do this work. To do this, we write a little function:

from Bio import Fasta

def extract_organisms(file, num_records):
    scanner = Fasta._Scanner()
    consumer = SpeciesExtractor()

    file_to_parse = open(file, 'r')

    for fasta_record in range(num_records):
        scanner.feed(file_to_parse, consumer)

    file_to_parse.close()

    return handler.species_list

Let's walk though the function step by step. First, we import the Fasta parser from the Biopython library, then we proceed to define our function. It takes two arguments, the FASTA formatted file to parse, and the number of records in the file. It then creates a Scanner for scanning through FASTA files, and a SpeciesExtractor consumer which will get all of the organisms, as we described above.

We then open our file to parse and get it ready to go. All of the Biopython parsers take file handles as inputs. This means that you can pass them any kind of ``file-like'' object. For instance, the url library allows to you work with a document on a remote URL as if it was a local file, so you could use this to pass the parser a document somewhere on the net.

Now that we've got the open file, it's time to parse it. The way the parser works is that you call feed for every item in the file you want to parse, passing it the file-like object to parse, and the consumer to call back to. So we loop through all records in the file, and scan them, relying on the consumer to deal with the files as appropriate. Finally, once that is all done, we get the species_list attribute of the consumer class, containing all of the information extracted and return that.

Whew, with all of that work out of the way, parsing the file will be really easy:

all_species = extract_organisms("ls_orchid.fasta", 94)
print "number of species:", len(all_species)
print 'species names:', all_species

Running this all as one big program (the program is available as fasta_consumer.py in the Doc/examples directory) give us the information we desire:

[chapmanb@taxus examples]# python fasta_consumer.py
number of species: 92
species names: ['C.irapeanum', 'C.californicum', 'C.fasciculatum', 
'C.margaritaceum', 'C.lichiangense', 'C.yatabeanum', 'C.guttatum',
...

2.4.3 Making it easier

The last section explained all of the nitty gritty of writing a specialized consumer. The flexibility of being able to write your own consumer is very nice, however for many applications it can be overkill. For a simple application, the nice Biopython folks have provided some useful classes that act more like full parsers. So, for quick and dirty applications, these are what you want to look at.

One big problem with the parser we used above was that we were required to know the number of elements in the file. This can be pretty impractical for a lot of applications when you just want to parse through a file and look for something. To help deal with this problem, we can use the Iterator interface for FASTA files. This interface allows you to not worry about consumers and scanners and all of that jazz, and just march through a file one record at a time. However, doing this requires that the Iterator return some kind of object that you can manipulate to extract the info you want.

To deal with this problem, a very useful class was written into many of the Biopython parsers -- a RecordParser, which parses a file into a python class that represents it. For instance, for FASTA files, the Record class is just an object with title and sequence attributes.

Let's make all of this talk more concrete by using the Iterator and Record interefaces to do what we did before -- extract a unique list of all species in our FASTA file. First we need to set up our parser and iterator:

>>> from Bio import Fasta
>>> parser = Fasta.RecordParser()
>>> file = open("ls_orchid.fasta")
>>> iterator = Fasta.Iterator(file, parser)

First we create our parser -- a RecordParser that parses a FASTA entry into the Record class representing it. Then we open the file, and create an iterator, and we are ready to start parsing.

Like most iterator interfaces, we retrieve the objects by calling next(). When there are no sequences left to parse, next() will return None, so we know that it is time to stop. To get our first record:

>>> cur_record = iterator.next()

Let's take a look at the record object. It has sequence and title attributes that we can easily get at:

>>> dir(cur_record)
['_colwidth', 'sequence', 'title']
>>> print cur_record.title
gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DN>>> print cur_record.sequence
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTGAATCCGGA
...

In my opinion, the coolest thing is how easy it is to get the FASTA record right back:

>>> print cur_record
>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAA
CGATCGAGTGAATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGT
GACCCTGATTTGTTGTTGGGCCGCCTCGGGAGCGTCCATGGCGGGTTTGAACCTCTAGCC
CGGCGCAGTTTGGGCGCCAAGCCATATGAAAGCATCACCGGCGAATGGCATTGTCTTCCC
CAAAACCCGGAGCGGCGGCGTGCTGTCGCGTGCCCAATGAATTTTGATGACTCTCGCAAA
CGGGAATCTTGGCTCTTTGCATCGGATGGAAGGACGCAGCGAAATGCGATAAGTGGTGTG
AATTGCAAGATCCCGTGAACCATCGAGTCTTTTGAACGCAAGTTGCGCCCGAGGCCATCA
GGCTAAGGGCACGCCTGCTTGGGCGTCGCGCTTCGTCTCTCTCCTGCCAATGCTTGCCCG
GCATACAGCCAGGCCGGCGTGGTGCGGATGTGAAAGATTGGCCCCTTGTGCCTAGGTGCG
GCGGGTCCAAGAGCTGGTGTTTTGATGGCCCGGAACCCGGCAAGAGGTGGACGGATGCTG
GCAGCAGCTGCCGTGCGAATCCCCCATGTTGTCGTGCTTGTCGGACAGGCAGGAGAACCC
TTCCGAACCCCAATGGAGGGCGGTTGACCGCCATTCGGATGTGACCCCAGGTCAGGCGGG
GGCACCCGCTGAGTTTACGC

We can use all of this to do a rewrite of our extract_organisms function:

def extract_organisms(file_to_parse):

    parser = Fasta.RecordParser()
    file = open(file_to_parse, 'r')
    iterator = Fasta.Iterator(file, parser)

    all_species = []

    while 1:
        cur_record = iterator.next()

        if cur_record is None:
            break

        # extract the info from the title
        title_atoms = string.split(cur_record.title)
        new_species = title_atoms[1]

        # append the new species to the list if it isn't there
        if new_species not in all_species:
            all_species.append(new_species)

    return all_species

Running this will give us identical results to what we saw earlier. Whether you choose to use this interface or the previous one depends on your needs and what you feel most comfortable with. Hey, Perl isn't the only place where there's more that one way to do things!

2.4.4 FASTA files as Dictionaries

The final thing that we'll do with our ubiquitous orchid fasta file is to show how to index it and access it like a database. This is very useful for large files where you only need to access certain elements of the file, and makes for a nice quick 'n dirty database.

Let's index our record via their GenBank accession number, which seems like a useful way to get track of them. To do this, we first need to write a small function which returns accession numbers from a FASTA record (this is the Record class we discussed earlier):

import string

def get_accession_num(fasta_record):
    title_atoms = string.split(fasta_record.title)

    # all of the accession number information is stuck in the first element
    # and separated by '|'s
    accession_atoms = string.split(title_atoms[0], '|')
 
    # the accession number is the 4th element
    gb_name = accession_atoms[3]

    # strip the version info before returning
    return gb_name[:-2]

Now we need to create an index from this file, which will show us why writing this function was important. The general format for indexing a file is:

index_file(file_to_index, index_file_to_create, function_to_get_index_key)

The tricky argument is the function_to_get_index_key. Basically, this is a reference to a function that can get called and return an element to use as a key. The idea here is that the index_file should be general purpose, and there isn't any good way for the Biopython developers to read your mind and know how you want to index your files. So now the get_accession_num function above makes a lot of sense!

Using all of this, we can now create an index file and see how it works. First, let's create the index file:

>>> from Bio import Fasta 
>>> Fasta.index_file("ls_orchid.fasta", "my_orchid_dict.idx", get_accession_num)

This creates an index file my_orchid_dict.idx based on the input from ls_orchid.fasta and indexed by the values returned by our get_accession_number function.

Now that we've got the index, we can create an in-memory dictionary to access the file contents using the index:

>>> from Bio.Alphabet import IUPAC
>>> dna_parser = Fasta.SequenceParser(IUPAC.ambiguous_dna)
>>> orchid_dict = Fasta.Dictionary("my_orchid_dict.idx", dna_parser)

The dictionary is created from our my_orchid_dict.idx file, and returns Sequence objects. The type of objects returned by the index file are based on the second argument passed. This argument should be a parser that the index file will pass the information through before returning it. If no parser is passed, then the dictionary will just return the raw object (i. e. exactly as it appears in the file).

The parser that we pass here is a SequenceParser which converts FASTA files into SeqRecord objects. The SequenceParser takes an argument of the alphabet we should be using for the sequences. Since the parser can't be smart enough to know exactly what it will be parsing, we need to tell it. If no alphabet is provided, then the parser will default to Alphabet.generic_alphabet.

Since this is a dictionary, we can look at all of the keys we have available:

>>> print orchid_dict.keys()
['Z78475', 'Z78519', 'Z78516', 'Z78517', 'Z78514', 'Z78515', 'Z78512', 
 'Z78513', 'Z78510', 'Z78511', 'Z78457', 'Z78456', 'Z78455', 'Z78454', 
...

We can access a single sequence object via the keys and manipulate the object as a normal SeqRecord object:

>>> seq_record = orchid_dict['Z78475']
>>> print seq_record
<Bio.SeqRecord.SeqRecord instance at 0x102c1f2c>
>>> print seq_record.description
gi|2765600|emb|Z78475.1|PSZ78475 P.supardii 5.8S rRNA gene and ITS1 and ITS2 DNA
>>> print seq_record.seq
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACATAATAAT ...', IUPACAmbiguousDNA())

That easily, we have created a database of our FASTA file that will spit out sequence objects.

2.4.5 I love parsing -- please don't stop talking about it!

Biopython has a lot of parsers, and each has its own little special niches based on the sequence format it is parsing and all of that. The best place to look for information about specific parsers and how to do cool things with them is in the Cookbook section of the Tutorial. If you don't find the information you are looking for, please consider helping out your poor overworked documentors and submitting a cookbook entry about it! (once you figure out how to do it, that is!)

2.5 Connecting with biological databases

One of the very common things that you need to do in bioinformatics is extract information from biological databases. It can be quite tedious to access these databases manually, especially if you have a lot of repetitive work to do. Biopython attempts to save you time and energy by making some on-line databases available from python scripts. Currently, Biopython has code to extract information from the following databases:

ExPASy -- http://www.expasy.ch/ See section 3.2 in the Cookbook for more information.
Entrez from NCBI -- http://www.ncbi.nlm.nih.gov/Entrez/
PubMed from NCBI -- http://www.ncbi.nlm.nih.gov/PubMed/. See section 3.3 in the Cookbook for example code detailing how to use this.
SCOP -- http://scop.mrc-lmb.cam.ac.uk/scop/

The code is these modules basically makes it easy to write python code that interact with the CGI scripts on these pages, so that you can get results in an easy to deal with format. In some cases, the results can be tightly integrated with the Biopython parsers to make it even easier to extract information.

Here we'll show a simple example of performing a remote Entrez query. More information on the other services is available in the Cookbook, which begins on page ??.

In section 2.4.2 of the parsing examples, we talked about using Entrez to search the NCBI nucleotide databases for info on Cypripedioideae, our friends the lady slipper orchids. Now, we'll look at how to automate that process using a python script. For Entrez searching, this is more useful for displaying results then as a tool for getting sequences. The NCBI web site is mostly set up to allow remote queries so that you could write our own local cgi scripts that return information from NCBI pages. For this reason, the results are returned as HTML and it is pretty tough to get a flat file in a quick manner.

In this example, we'll just show how to connect, get the results, and display them in a web browser. First, we'll start by defining our search and how to display the results:

search_command = 'Search'
search_database = 'Nucleotide'
return_format = 'FASTA'
search_term = 'Cypripedioideae'

my_browser = 'lynx'

The first four terms define the search we are going to do. To use the Entrez module, you'll need to know a bit about how the remote CGI scripts at NCBI work, and you can find out more about this at http://www.ncbi.nlm.nih.gov/entrez/query/static/linking.html. The final term just describes the browser to display the results in.

Now that we've got this all set up, we can query Entrez and get a handle with the results. This is done with the following code:

from Bio.WWW import NCBI

result_handle = NCBI.query(search_command, search_database, term = search_term,
                           doptcmdl = return_format)

The query function does all of the work of preparing the CGI script command line and rounding up the HTML results.

Now that we've got the results, we are ready to save them to a file and display them in our browser, which we can do with code like:

import os

result_file_name = os.path.join(os.getcwd(), 'results.html')
result_file = open(result_file_name, 'w')
result_file.write(result_handle.read())
result_file.close()

if my_browser == 'lynx':
    os.system('lynx -force_html ' + result_file_name)
elif my_browser == 'netscape':
    os.system('netscape file:' + result_file_name)

Snazzy! We can fetch things and display them automatically -- you could use this to quickly set up searches that you want to repeat on a daily basis and check by hand, or to set up a small CGI script to do queries and locally save the results before displaying them (as a kind of lab notebook of our search results). Hopefully whatever your task, the database connectivity code will make things lots easier for you!

2.6 What to do next

Now that you've made it this far, you hopefully have a good understanding of the basics of Biopython and are ready to start using it for doing useful work. The best thing to do now is to start snooping around in the source code and looking at the automatically generated documentation.

Once you get a picture of what you want to do, and what libraries in Biopython will do it, you should take a peak at the Cookbook, which may have example code to do something similar to what you want to do.

If you know what you want to do, but can't figure out how to do it, please feel free to post questions to the main biopython list (biopython@biopython.org). This will not only help us answer your question, it will also allow us to improve the documentation so it can help the next person do what you want to do.

Enjoy the code!