[BiO BB] Matching and Filtering -- try grep

Mon Nov 17 07:12:13 EST 2003

Hi, Pooja Jain !

 On Mon, Nov 17, 2003 at 11:15:10AM -0000, Pooja Jain wrote:

> I am having a txt file with a list of accession numbers for few of the 
> seqeuence from entire Arabidopsis thaliana genome. I have another tab
> delimited txt file with all the accession numbers and other details about
> every sequence peresent in the genome of it (row wise). From this later
> file I want to filter the details about only those  sequences which have
> the same accesion numbers as in the former file.
> 
> Could some one please suggest some simple way to do this matching and
> filtering? I tried using the simple shell scripts commands like cmp and
> diff but none of them worked. Is ther any other command I can use with the
> shell. Any other way to do so with perl is also welcome.

From man pages:

    grep, egrep, fgrep - print lines matching a pattern

You should use grep.

If
    file-with-a-list is a txt file with a list of accession numbers
and
    file-with-all-the-details is the other file,

then this shell one-liner

    user at host$ cat file-with-a-list \
               | while read AN ; do \
                   grep "^$AN" file-with-all-the-details ; \
                 done >> file-with-the-details-for-the-listed-accnum 

should work for you (if the accession numbers are at the beginning of the lines in the "other" file).  If they are not, but there are some white-space characters at the beginning of each lines, then change "^$AN" to "[ \t]$AN" (with quotation marks).

Hope this helps,

-- 
DIG (Dmitri I GOULIAEV)        http://www.bioinformatics.org/~dig/
1024D/63A6C649: 26A0 E4D5 AB3F C2D4 0112  66CD 4343 C0AF 63A6 C649