Hi. This isn't exactly a biocolusters issue, but I was told that this list would be a good place to find programmers who help biologists manipulate their data. (That's "manipulate" like formatting, not cheating :) Apologies if this isn't the right place to post, and apologies to people also subscribing to bioperl-l. I've gotten the impression - in my short time in bioinformatics - that biologists get very frustrated with data formatting and analysis tasks. Which is too bad, because many of these tasks are trivial for someone with a bit of programming knowledge. Then again, we can't force them to learn programming, even if it would be For Their Own Good. I was thinking it would be useful to have a toolkit of outrageously simple Perl one-liners. Here's one: # Merge two lists, removing duplicates (logical OR) perl -ne '$seen{$_}++; END {print keys %seen}' file1 file2 > outfile A biologist would look through a website containing a bunch of (searchable, categorized, etc.) scripts, cut & paste the Perl into Unix (from a website), then backspace over the filenames and type in their own filenames, and end up with something like this on the command line: myhost>perl -ne '$seen{$_}++; END {print keys %seen}' genes1 genes2 > all_genes The biologist hits return & voilà! Instant data munging! Of course, I'm not the first one to identify this problem or try to solve it. But I think I'm working on a slightly different problem than previous solutions, and my (complete lack of) interface is different too. Here's the "prior art" I've seen in this area, compared and contrasted with my idea. - EMBOSS et al.: solving harder bioinformatics problems; Interface is Unix executables - Bioperl's bioscripts: harder problems; Perl executables - Taverna / myGrid: fancy GUI interface (but I do think of my scripts as "shims") I'm really aiming for the lowest of low-hanging fruit here. I don't want scripts that run Blast or do fancy analysis. Rather, we'll have scripts like the above to merge lists, or get the standard deviation of column 7 of tabular data, or get the GenBank IDs of the top 10 hits from a BLAST output, or whatever. These are all tasks that're trivial in (Bio)Perl - and some you can even do in Excel - but most biologists won't know either Perl or fancy Excel. Think of it as pipelining software for your vterm100. Why one-liners? - really, really fast development of new tools (especially compared with GUI tools) - no installation necessary, no dependencies (except Perl) - no download necessary; just cut and paste a tool from the web page - biologist doesn't need to learn an interface - if a biologist learns just a bit of Perl, they can tweak the one-liners: much easier than writing from scratch, but makes tools much more flexible - take advantage of existing tools' APIs: perl -MBio::Perl -e '...' Potential problems: - psychological barrier to using command line (I figure I'll aim first at the Unix-aware subset of biologists first, and leave complete World Domination to Phase 2.) - we can't fit error-handling into one-liners. Caveat scriptor So my questions for you (finally!): - Are there other projects that have tried to solve this niche of problems i.e., allowing biologists to do simple formatting & analysis of biological or tabular data? - Are there at least discussions of this issue that I could read somewhere for ideas? - Does anyone have any free advice (positive or negative or both) to offer for this project? - Are there any other lists I should post these questions to? The working name for my toolbox of bio scripts is "Scriptome". If it ever gets off the ground (and anyone cares), I'll post more info about it, along with a request for more advice, I'm sure. Thanks, -Amir Karger akarger at cgr.harvard.edu