[BiO BB] sequence retrieval

hz5 at njit.edu hz5 at njit.edu
Sun Feb 3 13:42:35 EST 2002


Hi group,
I wrote a tool in java in order to retrieve any region of a gene from human
genome draft sequence.

The tool is like, say, I want -200 to +100 sequence of gene FUBP3(Genbank AccID
is U69127), and the 300bp sequence will be retrieved.(upstream 200 and
downstream 100)

I also implement batch retrieve. The way I parse and retrieve is simply make a
socket connection to NCBI and parse the webpage to get information, pages and
tools involved are UniGene, LocusLink, and ASN.1 file.

Questions and problems:
1. is there any biojave tool I can use here to make my tool compatible to
biojava. I mean, I want to use biojava api to replace the same function in my
code, because it is always a good idea to keep with the common source.
2. I am using contig to address the upstream and downstream sequences positions,
this raise a problem when a gene is located at either end of the contig, I
cannot find where is the information to tell what is the ajacent contig to this
one. Say if a gene begins at position 20 on contig NT_001100, if I want upstream
200, I couldn't get it from this contig, I must know the contig that is overlap
with this and retrive the sequence accordingly. But I currently don't know where
this information is at NCBI.
3. I found that the contig that NCBI used for LocusLink is different in the
contig they depict in the human genomic project draft report, any thought?
4. my class can also count atgcn composition and build random sequence according
to the compostion; also can build first layer markov chain and build random
sequence accordingly(tends to keep dimer composition). I used java
Math.random(),  is it safe? Are these tools already been implemented in biojava
or they can be of some help?

Any evaluation and suggestion are highly appreciated! Thanks!

Haibo Zhang



More information about the BBB mailing list