[BiO BB] question regarding unicode, biopython Seq object, DAS

Fri Dec 8 03:37:13 EST 2006

Hello,

I'm attempting to get sequence data from a DAS server (UCSC, DAS1) and
am having what appears to be a unicode-related problem - if you have
any insights or advice, I'd be grateful for the help.

I'm running biopython v. 1.42 on Mac OS X 10.3.9.

My sax parser delivers character (sequence) data as unicode, but when
I make a Seq object from the unicode string and then try to reverse
complement the sequence, I get an exception:

TypeError: character mapping must return integer, None or unicode

So I tried this:

>>> from Bio.Alphabet import IUPAC
>>> from Bio.Seq import Seq
>>> s = Seq(u'atcg',IUPAC.unambiguous_dna)
>>> s.reverse_complement()
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "/usr/local/lib/python2.4/site-packages/Bio/Seq.py", line 117,
in reverse_complement
    s = self.data[-1::-1].translate(ttable)
TypeError: character mapping must return integer, None or unicode
>>> s = Seq('atcg',IUPAC.unambiguous_dna) # note: no longer unicode
>>> s.reverse_complement()
Seq('cgat', IUPACUnambiguousDNA())

An example access of the UCSC DAS1 site follows. In my code I'm using
a SAX parser to get the data, but this demonstrates a bit of how the
DAS aspect works:

>>> u = 'http://genome.cse.ucsc.edu/cgi-bin/das/hg17/dna?segment=1:158288275,158302415'
>>> import urllib
>>> fh = urllib.urlopen(u)
>>> fh.readline()
'<?xml version="1.0" standalone="no"?>\n'
>>> fh.readline()
'<DASDNA>\n'
>>> fh.readline()
'<SEQUENCE id="1" start="158288275" stop="158302415" version="1.00">\n'
>>> fh.readline()
'<DNA length="14141">\n'
>>> fh.readline()
'gtctcttaaaacccactggacgttggcacagtgctgggatgactatggag\n'

...and etc.

Yours,

Ann

-- 
Ann Loraine
Assistant Professor
Departments of Genetics, Biostatistics, and
Section on Statistical Genetics
University of Alabama at Birmingham
http://www.ssg.uab.edu
http://www.transvar.org