[Bioclusters] Request for advice and pointers on a project to help biologists d o simple formatting and analysis

Tue Mar 8 14:07:23 EST 2005

Hi.

This isn't exactly a biocolusters issue, but I was told that this list would
be a good place to find programmers who help biologists manipulate their
data. (That's "manipulate" like formatting, not cheating :) Apologies if
this isn't the right place to post, and apologies to people also subscribing
to bioperl-l.

I've gotten the impression - in my short time in bioinformatics - that
biologists get very frustrated with data formatting and analysis tasks.
Which is too bad, because many of these tasks are trivial for someone with a
bit of programming knowledge. Then again, we can't force them to learn
programming, even if it would be For Their Own Good.

I was thinking it would be useful to have a toolkit of outrageously simple
Perl one-liners.  Here's one:

    # Merge two lists, removing duplicates (logical OR)
    perl -ne '$seen{$_}++; END {print keys %seen}' file1 file2 > outfile

A biologist would look through a website containing a bunch of (searchable,
categorized, etc.) scripts, cut & paste the Perl into Unix (from a website),
then backspace over the filenames and type in their own filenames, and end
up with something like this on the command line:

myhost>perl -ne '$seen{$_}++; END {print keys %seen}' genes1 genes2 >
all_genes

The biologist hits return & voilà! Instant data munging!

Of course, I'm not the first one to identify this problem or try to solve
it.  But I think I'm working on a slightly different problem than previous
solutions, and my (complete lack of) interface is different too.  Here's the
"prior art" I've seen in this area, compared and contrasted with my idea.
- EMBOSS et al.: solving harder bioinformatics problems; Interface is Unix
executables
- Bioperl's bioscripts: harder problems; Perl executables
- Taverna / myGrid: fancy GUI interface (but I do think of my scripts as
"shims")

I'm really aiming for the lowest of low-hanging fruit here. I don't want
scripts that run Blast or do fancy analysis. Rather, we'll have scripts like
the above to merge lists, or get the standard deviation of column 7 of
tabular data, or get the GenBank IDs of the top 10 hits from a BLAST output,
or whatever. These are all tasks that're trivial in (Bio)Perl - and some you
can even do in Excel - but most biologists won't know either Perl or fancy
Excel.  Think of it as pipelining software for your vterm100.

Why one-liners?
- really, really fast development of new tools (especially compared with GUI
tools)
- no installation necessary, no dependencies (except Perl)
- no download necessary; just cut and paste a tool from the web page
- biologist doesn't need to learn an interface
- if a biologist learns just a bit of Perl, they can tweak the one-liners:
much easier than writing from scratch, but makes tools much more flexible
- take advantage of existing tools' APIs: perl -MBio::Perl -e '...'

Potential problems:
- psychological barrier to using command line (I figure I'll aim first at
the Unix-aware subset of biologists first, and leave complete World
Domination to Phase 2.)
- we can't fit error-handling into one-liners. Caveat scriptor

So my questions for you (finally!):
- Are there other projects that have tried to solve this niche of problems
i.e., allowing biologists to do simple formatting & analysis of biological
or tabular data?
- Are there at least discussions of this issue that I could read somewhere
for ideas?
- Does anyone have any free advice (positive or negative or both) to offer
for this project?
- Are there any other lists I should post these questions to?

The working name for my toolbox of bio scripts is "Scriptome".  If it ever
gets off the ground (and anyone cares), I'll post more info about it, along
with a request for more advice, I'm sure.  

Thanks,
-Amir Karger
akarger at cgr.harvard.edu