Bioinformatics FAQ

From Bioinformatics.Org Wiki

(Difference between revisions)
Jump to: navigation, search
(Europe: Moved)
m (Reverted edits by Yxuhehybyja (Talk) to last revision by Admin)
 
(45 intermediate revisions not shown)
Line 1: Line 1:
-
''We're in the process of moving the subsections to separate articles.  Please pardon the mess.''
 
-
 
==Bioinformatics==
==Bioinformatics==
[[Bioinformatics|What is bioinformatics?]]
[[Bioinformatics|What is bioinformatics?]]
-
[[Origins Of Bioinformatics|What are the origins of bioinformatics?]]
+
[[Origins of bioinformatics|What are the origins of bioinformatics?]]
-
[[Common Programs|What are the most common bioinformatics programs?]]
+
[[Common programs|What are the most common bioinformatics programs?]]
-
[[Common Technologies|What are the most common bioinformatics technologies?]]
+
[[Common technologies|What are the most common bioinformatics technologies?]]
-
[[Data Analysis|How are data analyzed in bioinformatics?]]
+
[[Data analysis|How are data analyzed in bioinformatics?]]
-
==Fields Related to Bioinformatics==
+
==Fields related to bioinformatics==
[[Biophysics|What is biophysics?]]
[[Biophysics|What is biophysics?]]
-
[[Computational Biology|What is computational biology?]]
+
[[Computational biology|What is computational biology?]]
-
[[Medical Informatics|What is medical informatics?]]
+
[[Medical informatics|What is medical informatics?]]
[[Cheminformatics|What is cheminformatics?]]
[[Cheminformatics|What is cheminformatics?]]
Line 25: Line 23:
[[Genomics|What is genomics?]]
[[Genomics|What is genomics?]]
-
[[Mathematical Biology|What is mathematical biology?]]
+
[[Mathematical biology|What is mathematical biology?]]
[[Proteomics|What is proteomics?]]
[[Proteomics|What is proteomics?]]
Line 35: Line 33:
==Books: Can you recommend any bioinformatics books?==
==Books: Can you recommend any bioinformatics books?==
-
It's notoriously difficult to find any books on bioinformatics itself that cater well for all of those coming from computing, from mathematics and from biology backgrounds. The few textbooks available in the field tend to be eyewateringly expensive as well. I've divided suggested reading into [#generalBooks books of general interest], [#computerScientistsBooks those] best suited to people coming from a computational/mathematical background and [#biologistsBooks books for biologists interested in bioinformatics]. After my suggestions are some links to other lists of bioinformatics books.
+
See [[Recommended books|this article]].
-
===General introductions===
+
==Centers of bioinformatics activity: Where is bioinformatics done?==
-
Many people are curious about the Human Genome (Project). The completion of the first draft probably represents bioinformatics' coming of age as a discipline. The first couple of books are aimed at the intelligent layperson.
+
[http://www.rfcgr.mrc.ac.uk/GenomeWeb/ Genome Web] at the [http://www.rfcgr.mrc.ac.uk/ Rosalind Franklin Centre for Genomics Research] at the [http://www.hinxton.wellcome.ac.uk/ Genome Campus] near [http://www.cambridge.gov.uk/cambridge.htm Cambridge], UK, provides some of the links below.
-
A gossipy and insightful account of the race to sequence the genome can be found in "<cite>The Sequence</cite>" by Kevin Davies [Weidenfeld; ISBN 0297646982]. Matt Ridley's "<cite>Genome</cite>" [Fourth Estate; ISBN 185702835X] is both an interesting layperson's introduction to the issues raised by the bioinformatic revolution and an overview of its biology and enormous scope. If I remember rightly, Ridley's book received a slightly snooty review from Walter Bodmer. This is understandable, since his and Robin McKie's excellent "pre-genomic" guide to the Human Genome Mapping Project, "The Book of Life" [Oxford Paperbacks; ISBN 0195114876] was undeservedly in a remainders bin when I bought my copy a couple of years ago.
+
[[Research centers]]
-
If you are a non-biological scientist (or a non-scientist) and are hooked by these, why not go back to the "real beginning" of the race and read James Watson's entertaining and indiscreet memoir of his and Francis Crick's determination of the structure of DNA, "<cite>The Double Helix</cite>" [Penguin; ISBN 0140268774]---now updated with an introduction by media don Steve Jones.
+
[[Sequencing centers]]
-
Nigel Barber at Peterborough Regional College in the UK recommends Gary Zweiger's "Transducing the Genome" [McGraw-Hill Professional Publishing: ISBN 0071369805]. The [http://www.amazon.com/exec/obidos/ASIN/0071369805/ summary] at Amazon makes it sound a tad pretentious, but all the reviews seem pretty positive so it might be worth a read.
+
[[Standard centers]]
-
If you are a quantitative scientist and would like a deeper knowledge of contemporary (molecular) biology, but you want to acquire it as painlessly as possible you could try the following:
+
[[Virtual centers for bioinformatics activity]]
-
* Donna Rae Siegfried's <cite>Biology for Dummies</cite> [Wiley; ISBN 0-7645-5326-7] is fun, well thought out and a lot more informative than the title might suggest. If only all biology textbooks were this entertaining and unpretentious.
+
==Online resources: What bioinformatics websites are there?==
-
* If you already have some biological knowledge and would like to get a grip on modern biomolecular science then Richard J. Epstein's <cite>Human Molecular Biology</cite> is an elegant, colourful and detailed guide.
+
-
 
+
-
===Computational/Mathematical aspects===
+
-
 
+
-
If you are a hardcore maths/computing person Michael Waterman's <cite>"Introduction to Computational Biology"</cite> [Chapman & Hall/CRC Statistics and Mathematics; ISBN 0412993910] and Pavel Pevzner's <cite>"Computational Molecular Biology - An Algorithmic Approach"</cite> [The MIT Press (A Bradford Book); ISBN 0262161974] will give you all the discrete maths you can shake a stick at, but perfunctory introductions to the biology.
+
-
 
+
-
Bioinformatics.Org's very own Jeff Bizzaro recommends Dan Gusfield's <cite>"Algorithms on Strings, Trees and Sequences"</cite> [Cambridge, 1997 ISBN 0-52158-519-8], Richard Durbin, S. Eddy, A. Krogh, G. Mitchison <cite>"Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids"</cite> [Cambridge, 1997 ISBN 0-52162-971-3] (which I think is one of the clearest and most comprehensive guides to alignment algorithms) and---for that full "computers-to-biology conversion"--- Geoffrey M. Cooper <cite>"The Cell: A Molecular Approach"</cite> [ASM Press, 1996 ISBN 0-87893-119-8]. Jeff Ames writes that a second edition of this book is now available [Sinauer Associates, Incorporated, 2000 ISBN 0-87893-106-6] and that this version---if you can find it in the shops---comes with a CD.
+
-
 
+
-
===Applying bioinformatics to biological research===
+
-
 
+
-
One outstanding general text for the biologist is David W. Mount's "<cite>Bioinformatics</cite>" [Cold Spring Harbor Press; ISBN 0879696087]. It's not cheap, but it's the best I've seen if you are studying bioinformatics ''itself''.
+
-
 
+
-
Bioinformatics has been dismissed by some as "the science of BLAST searches". The best collection of advice so far on doing BLAST searches is [http://www.oreilly.com/ O'Reilly's] [http://www.oreilly.com/catalog/blast/ <cite>BLAST</cite>] book by Ian Korf, Mark Yandell and Joseph Bedell [O'Reilly ISBN 0-596-00299-8]. I reviewed it enthusiastically, but not uncritically, for the [http://www.ukuug.org/ UK UNIX Users' Group] magazine. I'd go as far as to say that all biologists thinking of using BLAST in their research should read the relevant sections before they even go near a computer.
+
-
 
+
-
If you wish to use general bioinformatics ''tools'', especially if you are a little wary of computers, my new "best" book is "<cite>Bioinformatics for Dummies</cite>" [John Wiley and Sons ISBN 0764516965]. It is (obviously) aimed at people who are beginners, who are happier using the Web rather than typing commands, and who are more interested in learning than in impressing people---the writing is friendly clear and unpretentious. However, like several of my other tips (below) it concentrates on Web-based resources so it will, inevitably, date. (This is partially compensated for by there being [http://www.dummies.com/extras/bioinformatics_fd/ a companion Website].)
+
-
 
+
-
Also, if you're coming to the subject as a computer user with a biological background, looking to exploit the many tools available, you might want to try Terry Attwood and David Parry-Smith's <cite>"Introduction to Bioinformatics"</cite> [Longman Higher Education; ISBN 0582327881], or Des Higgins and Willie Taylor's <cite>"Bioinformatics: Sequence Structure and Databanks"</cite> [Oxford University Press; ISBN 0199637903]. Another excellent practical introduction is Andreas Baxevanis and Francis Oulette's "<cite>Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins</cite>" [Wiley-Interscience; ISBN 0471383910], now in its new and improved second edition. Bax teaches bioinformatics all over Canada and the experience shows.
+
-
 
+
-
Bioinformatics.Org also recommends Cynthia Gibas and Per Jambeck's <cite>"Developing Bioinformatics Skills"</cite> [O'Reilly, 2001 ISBN 1-56592-664-1].
+
-
 
+
-
Stuart Brown recommends his own book <cite>"Bioinformatics: A Biologist's Guide to Biocomputing and the Internet"</cite> [Eaton Pub Co; ISBN: 188129918X]. If he sends me a review copy I might recommend it too ;-) .
+
-
 
+
-
===Fiction books===
+
-
 
+
-
<cite>"Darwin's Radio"</cite> by Greg Bear [Ballantine Books, ISBN: 0345435249] is a wonderful hard SF thriller which stretches ideas derived from genome discoveries to their breaking point. It's gripping and humane.
+
-
 
+
-
Leonard Crane, the author of <cite>[http://www.ninthday.com/ Ninth Day of Creation]</cite> kindly sent me a copy for review. So far it's an excellent read. I haven't finished it yet, not because it isn't a rattling good story, but because, like <cite>"Darwin's Radio"</cite>, it is very long and because I am very busy. If you'd like to read a well-researched, but speculative, novel containing actual scenes of practising bioinformatics then try it.
+
-
 
+
-
Ken Allen contributed the following reviews:
+
-
 
+
-
<blockquote>
+
-
 
+
-
"<cite>Frameshift</cite> [Tor Books, ISBN: 0812571088] by Robert J. Sawyer---based around the HGP---reasonable read, but poor / confused ending."
+
-
 
+
-
</blockquote><blockquote>
+
-
 
+
-
<cite>Calculating God</cite> [Tor Books, ISBN: 0812580354]by the same author---has a subtler bio connection and is a much better read. Near the start an alien spacecraft lands, the alien emerges and says 'take me to your paleontologist'
+
-
 
+
-
</blockquote>
+
-
 
+
-
===Other lists of bioinformatics books===
+
-
 
+
-
See also [http://compbiology.org compbiology.org]'s [http://compbiology.org/?section=books list], Steve Brenner's [http://compbio.berkeley.edu/people/brenner/misc/books-compbio.html  list], and [http://www.brc.dcs.gla.ac.uk/%7Eactan/bioinformatics/BioinformaticsBooks.html Aik Choon Tan's collection of books].
+
-
 
+
-
==Centres of Bioinformatics Activity: Where is bioinformatics done?==
+
-
 
+
-
The biggest and best source of bioinformatics links I have encountered is the [http://www.rfcgr.mrc.ac.uk/GenomeWeb/ Genome Web] at the [http://www.rfcgr.mrc.ac.uk/ Rosalind Franklin Centre for Genomics Research] at the [http://www.hinxton.wellcome.ac.uk/ Genome Campus] near [http://www.cambridge.gov.uk/cambridge.htm Cambridge], UK. Most of the links below come from that resource. My list is necessarily limited by comparison.
+
-
 
+
-
[[Research Centers]]
+
-
 
+
-
[[Sequencing Centers]]
+
-
 
+
-
[[Standard Centers]]
+
-
 
+
-
[[Virtual Centers for Bioinformatics Activity]]
+
-
 
+
-
==Online Resources: What bioinformatics Websites are there?==
+
[[Blogs]]
[[Blogs]]
-
[[Information]]
+
[[General information websites]]
[[Directories]]
[[Directories]]
Line 116: Line 57:
[[Societies]]
[[Societies]]
-
[[Collections of Tools]]
+
[[Collections of tools]]
[[Portals]]
[[Portals]]
Line 122: Line 63:
[[Tutorials]]
[[Tutorials]]
-
==Education: Where can I study Bioinformatics...==
+
==Education: Where can I study bioinformatics?==
-
This section is ''not'' complete, but contributions to broaden its coverage are welcome. '''Please do not direct questions about eligibility, course quality or admissions policy to me, but to ask the individual institutions directly.''' Use the links to obtain contact details. If an institution doesn't provide telephone numbers/email addresses or snailmail details on its Web site it doesn't deserve your patronage.
+
Below are complete, full-time degree programmes rather than on individual study modules. You can go to other places, however, if you are looking for short courses. Rockefeller has a [http://linkage.rockefeller.edu/wli/bioinfocourse/ list] that is mirrored at various other sites. ICSB also maintains a [http://www.iscb.org/univ.shtml list].
-
 
+
-
This resource focuses on complete, full-time degree programmes rather than on individual study modules. Curating a list of the latter would be a full-time job. You can go to other places, however, if you are looking for short courses. Thanks to various [#acknowledgementsLinks contributors], including Wentian Li who pointed me to this [http://linkage.rockefeller.edu/wli/bioinfocourse/ list] at Rockefeller which is mirrored at various other sites. And to Humberto Ortiz Zuazaga for mailing me a link to the ICSB, where you can find [http://www.iscb.org/univ.shtml this list].
+
-
 
+
-
If you are interested in U.S. programmes, here's [http://wbiomed.curtin.edu.au/teach/biochem/resources/Bioinformatics.html a list from Curtin] and here's [http://www.smi.stanford.edu/academics/pdfs/degree_table.pdf a list from Stanford]. Thanks to Amelie Stein who also supplied some of the individual entries in this section.
+
-
 
+
-
Those wanting to find programmes in the Asia Pacific region could have a look at [http://www.apbionet.org/project/edu/index.shtml this resource] maintained by the Asia Pacific Bioinformatics Network APBioNet. Thanks to Sentausa.
+
-
 
+
-
In the UK [http://www.rfcgr.mrc.ac.uk/CCP11/index.jsp The Bioinformatics Resource] (part of the [http://www.bbsrc.ac.uk/ BBSRC]'s [http://www.rfcgr.mrc.ac.uk/CCP11/index.jsp  CCP11] project) project maintains (among many other resources) lists of (mainly) British [http://www.rfcgr.mrc.ac.uk/CCP11/directory/directory_mastersdegrees.jsp?Rp=20  Masters] and [http://www.rfcgr.mrc.ac.uk/CCP11/directory/directory_phds.jsp?Rp=20  PhDs] in bioinformatics. If you have any suggestions or updates please [/sendmessage.php?toaddress=counsell_maillink_bioinformatics.org  contact] me with them. You can publicize your course and offer a public service at the same time.
+
[[Africa]]
[[Africa]]
Line 144: Line 77:
[[Europe]]
[[Europe]]
-
===...Remotely (Distance/Correspondence Courses)===
+
[[Distance or correspondence courses]]
-
Many visitors to the FAQ ask about bioinformatics distance learning. Eventually I will try to gather together all those courses on this list which can be taken remotely---if I ever have the time. Unfortunately I don't at the moment. All I can suggest is that you examine the courses yourself through the links provided in the FAQ. Many can be taken over the Net or offer components that can be studied at a distance. (And, if you do compile such a list for yourself, do please email it to me and I will post it here for the benefit of our users with, as usual, a full credit for your efforts.)
+
==Careers: How can I become a bioinformatics practitioner?==
-
If you are thinking of studying at a UK institution you might want to search through the [file:///people/dcounsel/public_html/papers/Counsell03_education.pdf pre-print] of my review of UK bioinformatics education for the word "distance". At the moment I think the courses at Birkbeck, Exeter and Oxford offer either full or part distance learning options.
+
[[Getting involved]]
-
==Careers: How can I become a bioinformatician?==
+
[[Careers]]
-
 
+
-
===How can I get involved?===
+
-
 
+
-
If you want to get involved in bioinformatics, now is an exciting time. I can honestly say this is one area of science where demand for skilled practitioners (and salaries) can be high. Whether this will still be the case when you graduate is another question. At lower levels of seniority it looks as though demand may be falling, partly for general economic reasons, partly, perhaps, as some of the hype about the field subsides.
+
-
 
+
-
This section is opinionated; there are people in the field, both computer scientists and biologists, who I would love to provoke (or convert). If you are a newcomer, and especially if you come from one of bioinformatics component pure disciplines, I hope my ranted warnings will help you to avoid the mistakes of your predecessors---and I write as one of the mistaken. [http://www.bio.upenn.edu/faculty/roos/  David S. Roos] put it well in his [http://www.sciencemag.org/cgi/content/full/291/5507/1260  review] in the journal [http://www.sciencemag.org/ Science]<nowiki>:</nowiki>
+
-
 
+
-
<blockquote> "Lack of familiarity with the intellectual questions that motivate each side can also lead to misunderstandings. For example, writing a computer program that assembles overlapping expressed sequence tags (EST) sequences may be of great importance to the biologist without breaking any new ground in computer science. Similarly, proving that it is impossible to determine a globally optimal phylogenetic tree under certain conditions may constitute a significant finding in computer science, while being of little practical use to the biologist." </blockquote>
+
-
 
+
-
====How can I get involved?---I am a "newbie"====
+
-
 
+
-
Please read the education section above for information about some of the places you can currently study bioinformatics. '''Please do not direct questions about eligibility, course quality or admissions policy to me, but to ask the individual institutions directly.'''
+
-
 
+
-
If you are a high school student / sixth former, think about taking an interdisciplinary computational biology or bioinformatics bachelor's degree of the sort offered at, for example, Manchester University in the UK or UPenn in the States. Don't worry if you can't find a place on such a course or there isn't one nearby; perhaps the best way to approach this subject is from two sides. Do a bachelor's degree in one area while taking a healthy interest in the other---or (if you can afford to) complement a first degree in one part of the discipline with a second degree in the second.
+
-
 
+
-
If you already have a degree in a biological discipline there are similar Master's courses---both interdisciplinary (''e.g.'' Birkbeck's in London) and conversion type courses---for biologists or others to learn computer science, for example.
+
-
 
+
-
If you are currently doing a computer science or biology PhD, try to take advantage of the opportunity to take courses in the "other" discipline.
+
-
 
+
-
====How can I get involved?---I am a biologist====
+
-
 
+
-
To a biologist I would say: take as many ''real'' computing courses as you can. It's important not just to learn a programming language, but also to learn the ''discipline'' of computing; to structure and document your work in a rigorous way. What courses you take might be directed by the kind of work you are interested in doing when you graduate---whether you see yourself supporting bioinformatics applications or building them. For the former you need all-round familiarity with the programs themselves and the hardware and software needed to run them---plus your existing understanding of biology. For the latter you need to learn a structured programming language and the principles of good program design---plus the ability to talk to and understand biologists.
+
-
 
+
-
=====Courses biologists might consider taking:=====
+
-
 
+
-
; UNIX
+
-
: Of all the computing courses available it is most important that you have a proper introduction to the UNIX operating system(s). Most current bioinformatics software (especially the free stuff) runs on "open" platforms like [http://www.linux.org/ Linux] and the Web. The UNIX philosophy is elegant, powerful, and frustrating. Master it and you will save a lot of time.
+
-
; Mathematics
+
-
: Learn some maths. Basic statistics, logic/set theory and a little calculus would be my recommendation. Many practising biologists have little or no grasp of elementary concepts like statistical significance, permutations and combinations and the principles of good experimental design. Logic will come in handy at the very least if you want to query databases in an intelligent way.
+
-
; Programming
+
-
: If you're interested in development, learn a real programming language: Pascal, C(++), Java or Fortran.
+
-
Perl and HTML are the stuff that holds the Web together. A grasp of these is essential for a lot of the Web/database work being done by many bioinformaticians at the moment.
+
-
Good old BASIC can be very useful as an introduction to programming or as a tool in its own right, but none of these latter languages is built to crunch numbers and tackle real world biological problems---which isn't to say people don't try...
+
-
 
+
-
====How can I get involved?---I am a computational/quantitative scientist====
+
-
 
+
-
One thing that I will emphasise repeatedly in this section is the simple value of doing some "proper" biological laboratory science. I have sat through many talks during which a bioinformatics "scientist" describes in great detail how his---it's usually "his"---application of a trendy mathematical tool offers a supposed insight into a (sometimes supposed) biological problem. Nine times out of ten I know that this method will never be so much as sneezed on by a practising biologist.
+
-
 
+
-
Quantitative scientists sometimes talk about their interest in studying some aspect of "God's mind". Biologists, in contrast, are interested in "Mother Nature". You might meditate on God in the hope of some revelation, but to understand Nature you have to meet her in the flesh. You are as likely to be useful to biologists working in isolation at the keyboard as you are to conceive with your clothes on. Desk-bound bioinformaticians ''have'' written code that has turned out to be popular with biologists, but almost always because they have collaborated with biologists.
+
-
 
+
-
=====Courses quantitative scientists might consider taking:=====
+
-
 
+
-
; Molecular biology
+
-
: "MoBi" was the bioinformatics of its day; desperately fashionable, the province of new, higher-paid practitioners and considered with slight suspicion by more traditional biologists. It was once a great achievement to sequence a modest stretch of DNA, now it's a job for robots. Today the technology of molecular biology is very well established. Scientists can buy kits to perform the sort of genetic manipulations that would make your parents' jaws drop. Some of the kits are so simple your small children could use them (with a modest amount of training and supervision).
+
-
Despite the profusion of commercial kits, there is still a requirement for real skill in molecular biology and the general level of scientific understanding required to be a good biological scientist---rather than just completing a practical class---doesn't come easy. Living matter, the stuff you have to work with is unpredictable and responds slowly---except when it's dying. Even supposedly fast-growing bacteria can take a long time to yield up their secrets.
+
-
Now, fashions in biomedical research are shifting from molecular biology back to cell biology and protein biochemistry, but it's well worth offering yourself up as a volunteer for some vacation work in a molecular biology lab. The term is now more often used to refer to the technological tools provided by MoBi to biology in general, rather than to fundamental research in the field itself. Those tools are common to a vast array of different kinds of research, from archaeology to zoology.
+
-
; Protein (bio)chemistry
+
-
: Protein (bio)chemistry is experiencing a revival. Proteins are still more delicate and fussy than nucleic acids. The same advice that applies to molecular biology applies to protein biochemistry. That stuff bioinformatics people refer to as "wet lab science" is much harder than it looks.
+
-
You might find it more difficult to get access to a good protein lab than a good molecular biology lab and do protein science with real wizards, but the very least you can do is read about the theoretical aspects of the subject.
+
-
For insights into the principles of proteins structure, try, for example, Carl Branden and John Tooze's "Introduction to Protein Structure" [Garland ISBN 0-8153-2305-0]. Physicists in particular might find the lack of general unifying principles in this area overwhelming. Unfortunately there's no substitute for acquiring a "feel" from the subject by examining a lot of examples. Still the most critical stages in the successful prediction of protein structure from sequence are those requiring human intervention.
+
-
Thomas E. Creighton has been responsible for a range of standard texts on protein chemistry. If you are working in a protein lab you are likely to come across his "Protein Function : A Practical Approach" [ISBN 019963615X] and the rather more expensive and theoretical "Proteins : Structures and Molecular Properties" [ISBN 071677030X]
+
-
; Evolutionary biology
+
-
: It's a worn quote, but worth repeating:
+
-
<blockquote>
+
-
"The mechanisms that bring evolution about certainly need study and clarification. There are no alternatives to evolution as history that can withstand critical examination. Yet we are constantly learning new and important facts about evolutionary mechanisms. ''Nothing in biology makes sense except in the light of evolution.''"
+
-
Theodosius Dobzhansky in "American Biology Teacher" vol.35
+
-
</blockquote>
+
-
Darwin's theory is one of the simplest and most misunderstood in science. Start with a good layperson's introduction, Richard Dawkin's "The Selfish Gene" (and remember: it's a ''metaphor'', stupid) or Steve Jones' paraphrasing of Darwin's original "The Origin of the Species" "Almost Like a Whale". All biologists agree on the underlying principles, but they are nearly ready to kill one another over the details. After reading a decent book on evolutionary biology you should have at least a handful of good questions. Now you are ready to take a class in the subject. Take your questions with you. You'll probably start an argument---or a fight.
+
-
 
+
-
You might also like to peruse Cynthia Gibas's [http://www.oreilly.com/news/bioinformatics_0401.html answers] to similar questions from computational scientists on the [http://www.oreilly.com/ O'Reilly Web site].
+
-
 
+
-
=====These damned biologists are making me use Word instead of LaTeX to write up---what can I do?=====
+
-
 
+
-
Try [http://www.counsell.com/wordforthewise/ this].
+
-
 
+
-
====More general advice====
+
-
 
+
-
=====Use the software=====
+
-
 
+
-
Get access to an installation of EMBOSS and/or Staden and get someone to lead you through the tools available. [http://www.umass.edu/microbio/rasmol/ RasMol] is a simple, but powerful and elegant molecular imaging program which can teach you a great deal about biological macromolecules; try a [http://www.umass.edu/microbio/rasmol/rasclass.htm tutorial]. Get out on the Web and do some ''productive'' surfing for a change :-) . The best starting point is the Human Genome Mapping Project Resource Centre's "[http://www.rfcgr.mrc.ac.uk/GenomeWeb/ GenomeWeb]". There's ''so'' much stuff out there -- and most of it is free to academics.
+
-
 
+
-
===Where can I find Bioinformatics jobs?===
+
-
 
+
-
Start here at [http://bioinformatics.org/jobs/ Bioinformatics.Org's Job Announcements Homepage]...
+
-
 
+
-
Then move on to the appointments / careers sections of the the major scientific journals, or, better, search their Web jobs pages with "bioinformatics":
+
-
 
+
-
* <cite>[http://www.nature.com/naturejobs/ Nature]</cite>
+
-
* <cite>[http://recruit.sciencemag.org/ Science]</cite>
+
-
* [http://www.cell.com/ Cell], [http://www.bmn.com/ BioMedNet] and [http://www.newscientist.com/ New Scientist] have a pooled job site: <cite>[http://www.sciencejobs.com/ sciencejobs.com]</cite>
+
-
 
+
-
Appropriately for a Web-dependent discipline, there are a variety of specialist commercial Web sites which carry bioinformatics jobs:
+
-
 
+
-
* <cite>[http://www.bioinform.com/ BioInform]</cite>
+
-
* <cite>[http://www.BioPlanet.com/ BioPlanet]</cite>
+
-
* <cite>[http://www.bioexchange.com/ Bioexchange]</cite>
+
-
* <cite>[http://www.scijobs.org/ Scijobs]</cite>
+
-
 
+
-
There are also a number of companies actively recruiting in the area. Here are a few:
+
-
 
+
-
* [http://www.lionbioscience.com/career Lion Biosciences]
+
-
* [http://www.accelrys.com/ Accelrys]
+
-
* [http://www.genomecorp.com/jobs/index.shtml Genome Therapeutics Corporation]
+
==Practical tips==
==Practical tips==
-
This section includes some simple rules-of-thumb to apply when performing common bioinformatics tasks. I try to give a reference to a more detailed source of guidance where I know of one.
+
This section includes some simple rules-of-thumb to apply when performing common bioinformatics tasks.  
-
===How do I find a sequence?===
+
[[Finding a sequence]]
-
The most common task in bioinformatics must be the acquisition of some bioinformatics data on which to operate. Usually this in the form of a nucleic acid or protein sequence, stored as characters in the appropriate alphabet together with a header of related information: for example some kind of unique identifying number the species from which the original biological substrate was obtained, the names of any authors who published the sequence and so on.
+
[[Sequence alignment|Aligning two sequences]]
-
You may have already generated your own sequence data experimentally. In this case you are likely to want to find sequences which are identical or similar (and therefore possibly related) to yours. The task is then one of ''similarity search''.
+
[[Gene function prediction|Predicting the functions of a gene]]
-
====...I have a description.====
+
[[Sequence structure prediction|Predicting the structure of a sequence]]
-
A paradoxical problem generated by the success of the bioinformatics revolution is the increasing difficulty of navigating the huge amount of data available. Once you could print out most of the existing sequence databases onto paper and cram them into a single binder. Now a search for "actin" alone will pull out hundreds and hundreds of sequences. The key to find what you want is to develop your own discriminatory skills rather than rely on computers to figure out what it is you're ''really'' after.
+
[[Simulating a biomolecule]]
-
=====Use Entrez-PubMed=====
+
[[Publishing]]
-
 
+
-
Make sure you are clear about your aim first. If you are looking for a sequence for a specific scientific purpose then you might be best to start with a relevant human-generated publication. For example, you have cloned a gene which is part of a well-characterised biochemical pathway and you want to find other sequences of the same functional gene product in other species (orthologues) [http://www.ncbi.nlm.nih.gov/PubMed/ Entrez PubMed] is your friend.
+
-
 
+
-
PubMed is a huge and very comprehensive database of the biomedical scientific literature., created by the U.S. National Library of Medicine (NLM). Entrez PubMed is another indispensable resource of the U.S. National Centre for Biotechnology Information (NCBI). Both are part of the [http://www.nih.gov/ U.S. Department of Health and Human Services National Institutes of Health]
+
-
 
+
-
=====Use Swiss-Prot=====
+
-
 
+
-
Swiss-Prot is curated by human beings.
+
-
 
+
-
=====Use SRS at the RFCGR=====
+
-
 
+
-
[XXXX INSERT DETAILED ADVICE HERE]
+
-
 
+
-
=====Use Boolean logic=====
+
-
 
+
-
[XXXX INSERT DETAILED ADVICE HERE]
+
-
 
+
-
=====Use cunning=====
+
-
 
+
-
[XXXX INSERT DETAILED ADVICE HERE]
+
-
 
+
-
====...I have an accession number.====
+
-
 
+
-
[XXXX INSERT DETAILED SEQUENCE ADVICE HERE]
+
-
 
+
-
====...I have another sequence.====
+
-
 
+
-
This section will be expanded---and there will be a more basic and detailed explanation for novice searchers, but, in the meantime, here are the top tips cribbed from the excellent [http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=10868283&dopt=Abstract  paper] by Hugh B. Nicholas Jr., David W Deerfield II and Alexander J. Ropelewski in [http://www.BioTechniques.com/ BioTechniques].
+
-
 
+
-
* Use a local favourite program on the Web server of your choice.
+
-
* Use at least two and preferably three similarity tables.
+
-
* If using Smith-Waterman or FASTA algorithms ensure that the gap opening penalty is high enough.
+
-
* If the initial search finds no or insufficient matches repeat it with a highly diverged matrix and/or with a Smith-Waterman-based server.
+
-
* If this doesn't work try switching from a PAM matrix to a BLOSUM matrix.
+
-
 
+
-
====...I'm not sure whether or not to use the defaults.====
+
-
 
+
-
Hugh, David and Alexander again on when not to use the default search parameters provided by a server.
+
-
 
+
-
* ...when the homologues you are looking for to match your query are highly diverged.
+
-
* ...when the query or matches are short.
+
-
* ...when you are only interested in a specific (in the sense of "species") subset of database matches with a particular evolutionary relationship to your sequence of interest---a relationship not implied by the default settings.
+
-
 
+
-
===How can I align two sequences?===
+
-
 
+
-
This section will also be expanded for newbies, until then, here are Hugh, David and Alexander's tips for alignment:
+
-
 
+
-
* Use an appropriately divergent matrix (I'll be adding a table soon to explain this).
+
-
* Reduce your gap penalty relative to that you used for your database search.
+
-
* Use the MaxSegs/Waterman-Eggert version of the dynamic programming algorithm to provide the best local alignment and also to search for repeats.
+
-
 
+
-
===How can I predict the function of a gene (product)?===
+
-
 
+
-
[XXXX INSERT FUNCTION PREDICTION ADVICE HERE]
+
-
 
+
-
===How can I predict the structure of a sequence?===
+
-
 
+
-
You could start with anyone of these excellent guides (listed strictly in alphabetical order):
+
-
 
+
-
* Rob Russell's [http://speedy.embl-heidelberg.de/gtsp/ <cite>Guide to Structure Prediction (version 2)</cite>]
+
-
* András Fiser and Andrej \x{0160}ali's [http://salilab.org/pdf/086_FiserDekker2000.pdf <cite>Comparative protein structure modeling</cite>]
+
-
* Gert Vriend's [http://www.cmbi.kun.nl/gv/articles/text/gambling.html <cite>Professional gambling</cite>]
+
-
 
+
-
===How can I simulate a biomolecule?===
+
-
 
+
-
Here's Peter J. Steinbach's [http://cmm.cit.nih.gov/intro_simulation/course_for_html.html <cite>"Introduction to Macromolecular Simulation"</cite>]
+
-
 
+
-
===How can I write up?===
+
-
 
+
-
Go [http://www.counsell.com/wordforthewise/ here] to download some detailed advice. Go [http://users.path.ox.ac.uk/%7Epcook/w1/w1dict.htm here] for more links.
+
==Glossary of bioinformatics terms==
==Glossary of bioinformatics terms==
-
Here I attempt to define some common terms in bioinformatics. I have tried to balance clarity, brevity and rigour. Let me know if I let one of these priorities over-ride the others.
+
Here are some common terms in bioinformatics:
-
 
+
-
===What is an alignment?===
+
-
 
+
-
When two symbolic representations of DNA or protein sequences are arranged next to one another so that their most similar elements are juxtaposed they are said to be '''aligned'''. Many bioinformatics tasks depend upon successful alignments. Alignments are conventionally shown as a '''traces'''.
+
-
 
+
-
In a symbolic sequence each base or residue monomer in each sequence is represented by a letter. The convention is to print the single-letter codes for the constituent monomers in order in a fixed font (from the N-most to C-most end of the protein sequence in question or from 5' to 3' of a nucleic acid molecule). This is based on the assumption that the combined monomers evenly spaced along the single dimension of the molecule's primary structure. From now on I shall refer to an alignment of two protein sequences.
+
-
 
+
-
Every element in a trace is either a '''match''' or a '''gap'''. Where a residue in one of two aligned sequences is identical to its counterpart in the other the corresponding amino-acid letter codes in the two sequences are vertically aligned in the trace: a match. When a residue in one sequence seems to have been deleted since the assumed divergence of the sequence from its counterpart, its "absence" is labelled by a dash in the derived sequence. When a residue appears to have been inserted to produce a longer sequence a dash appears opposite in the unaugmented sequence. Since these dashes represent "gaps" in one or other sequence, the action of inserting such spacers is known as '''gapping'''.
+
-
 
+
-
A deletion in one sequence is symmetric with an insertion in the other. When one sequence is gapped relative to another a deletion in sequence '''a''' can be seen as an insertion in sequence '''b'''. Indeed, the two types of mutation are referred to together as '''indels'''. If we imagine that at some point one of the sequences was identical to its primitive homologue, then a trace can represent the three ways divergence could occur (at that point).
+
-
 
+
-
====Biological interpretation of an alignment====
+
-
 
+
-
A trace can represent a '''substitution'''<nowiki>:</nowiki>
+
-
 
+
-
<blockquote>
+
-
 
+
-
+
-
AKVAIL
+
-
 
+
-
+
-
AKIAIL
+
-
 
+
-
</blockquote>
+
-
 
+
-
A trace can represent a '''deletion'''<nowiki>:</nowiki>
+
-
 
+
-
<blockquote>
+
-
 
+
-
+
-
VCGMD
+
-
 
+
-
+
-
VCG-D
+
-
 
+
-
</blockquote>
+
-
 
+
-
A trace can represent a '''insertion'''<nowiki>:</nowiki>
+
-
 
+
-
<blockquote>
+
-
 
+
-
+
-
GS-K
+
-
 
+
-
+
-
GSGK
+
-
 
+
-
</blockquote>
+
-
 
+
-
For obvious reasons I do not represent a silent mutation.
+
-
 
+
-
Traces may represent recent genetic changes which obscure older changes. Here I have only represented point mutations for simplicity. Actual mutations often insert or delete several residues.
+
-
 
+
-
===What is a DNA array?===
+
-
 
+
-
Thanks to Bioinformatics.Org member Ravi Jain for the following answer, which I present ''verbatim''.
+
-
 
+
-
DNA microarrays consist of thousands of immobilized DNA sequences present on a miniaturized surface the size of a business card or less. Arrays are used to analyze a sample for the presence of gene variations or mutations (genotyping), or for patterns of gene expression, performing the equivalent of ''ca.'' 5 000 to 10 000 individual "test tube" experiments in approximately two days of time.
+
-
 
+
-
Robotic technology is employed in the preparation of most arrays. The DNA sequences are bound to a surface such as a nylon membrane or glass slide at precisely defined locations on a grid. Using an alternate method, some arrays are produced using laser lithographic processes and are referred to as biochips or gene chips. The composition of DNA on the arrays is of two general types:
+
-
 
+
-
* Oligonucleotides or DNA fragments (approximately 20-25 nucleotide bases). These arrays are frequently used in genotyping experiments. The sequences of alternate gene forms may be included for detection of mutations or normal variants (polymorphisms).
+
-
* Complete or partial cDNA (approximately 500-5 000 nucleotide bases). These arrays are generally used for relative gene expression analysis of two or more samples; however, oligonucleotide-based arrays may also be used for these studies.
+
-
 
+
-
DNA samples are prepared from the cells or tissues of interest. For genotyping analysis, the sample is genomic DNA. For expression analysis, the sample is cDNA, DNA copies of RNA. The DNA samples are tagged with a radioactive or fluorescent label and applied to the array. Single stranded DNA will bind to a complementary strand of DNA. At positions on the array where the immobilized DNA recognizes a complementary DNA in the sample, binding or hybridization occurs. The labeled sample DNA marks the exact positions on the array where binding occurs, allowing automatic detection. The output consists of a list of hybridization events, indicating the presence or the relative abundance of specific DNA sequences that are present in the sample.
+
-
 
+
-
===What is a homologue?===
+
-
 
+
-
[http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=3621342&dopt=Abstract "Homology" is a much-misused term] and existed in biology long before the notion of protein sequences. Strictly homology cannot be qualified; it is not correct to state that two proteins are "30% homologous" with each other, for example. If we could look back far enough in the evolutionary histories of any two molecules under comparison, we would be guaranteed to find a common ancestor eventually, but this is not true homology. An example of this would be the relationship between two variants of a single ancestral enzyme resulting from a gene duplication event.
+
-
 
+
-
As a rule-of-thumb, true homology should be assigned only when the feature which leads us to suspect a relationship between molecules is one we consider likely to have ''derived from the molecules' common ancestor''. To quote Page and Holmes [<cite>Molecular Evolution: A Phylogenetic Approac</cite>, Roderick D. M. Page and Edward C. Holmes; Blackwell Scientific; ISBN 0865428891]: <blockquote> "The classic molecular example is the parallel evolution of amino acid sequences in the lysozyme enzyme in leaf-eating langur monkeys and in cows. Both animals have independently evolved foregut fermentation using bacteria, and in both cases lysozyme has been recruited to degrade these bacteria. Therefore, langur and cow lysozymes are homologous as genes; however, as digestive enzymes they are not homologous because this functionality was not present in the ancestral lysozyme" </blockquote> Although sequence determines structure, it is possible for two proteins to have very different sequences and functions and share a common fold. In fact, most gene products with similar three-dimensional structures are insufficiently similar at the sequence level for true homology or analogy (non-homologous similarity) to be distinguished.
+
-
 
+
-
===What is an ontology?===
+
-
 
+
-
Biology is changing from being a descriptive to an analytical science. Accurate and consistent descriptions are, however, vital to analysis. The idea of ''ontologies'' has been co-opted from [http://www.formalontology.it/ philosophy] and [http://www-ksl.stanford.edu/kst/what-is-an-ontology.html artificial intelligence] to partition bioinformatic knowledge in a way which can be reliably navigated by computers.
+
-
 
+
-
[resources/Holloway02.pdf This preprint] of a review by [http://www.ebi.ac.uk/Information/Staff/ele_holloway.html Ele Holloway] of the [http://www.ebi.ac.uk/ European Bioinformatics Institute] gives a more detailed insight into the varied approaches to ontologies in bioinformatics by covering a recent meeting on the subject. The final version appears in [http://www.wiley.co.uk/wileychi/genomics/cfg.html <cite>Comparative and Functional Genomics</cite>].
+
-
 
+
-
===What is a scoring matrix?===
+
-
 
+
-
The following explanation was edited from a contribution by Amelie Stein.
+
-
 
+
-
The aim of a [http://bioinformatics.org/faq/#glossaryAlignment sequence alignment], is to match "the most similar elements" of two sequences. This similarity must be evaluated somehow. For example, consider the following two alignments:
+
-
 
+
-
<blockquote>{| summary="illustration of an alignment"
+
-
|- align="center"
+
-
| align="center" |
+
-
(a)
+
-
 
+
-
+
-
                      AIWQH
+
-
                      AL-QH
+
-
+
-
| align="center" |
+
-
(b)
+
-
 
+
-
+
-
                      AIWQH
+
-
                      A-LQH
+
-
+
-
|}</blockquote>
+
-
 
+
-
They seem quite similar: both contain one "indel" and one substitution, just at different positions. However, if we think of the letters as amino acid residues rather than elements of strings, alignment (a) is the better one, because isoleucine (I) and leucine (L) are similar sidechains, while tryptophan (W) has a very different structure. This is a physico-chemical measure; we might prefer these days to say that leucine simply substitutes for isoleucine more frequently---without giving an underlying "reason" for this observation.
+
-
 
+
-
However we explain it, it is much more likely that a mutation changed I into L and that W was lost, as in (a), than that W changed into L and I was lost. We would expect that a change from I to L would not affect the function as much as a mutation from W to L---but this deserves its own topic.
+
-
To quantify the similarity achieved by an alignment, ''scoring matrices'' are used: they contain a value for each possible substitution, and the ''alignment score'' is the sum of the matrix's entries for each aligned amino acid pair. For gaps (indels), a special ''gap score'' is necessary---a very simple one is just to add a constant penalty score for each indel. The ''optimal alignment'' is the one which maximizes the alignment score.
+
[[Sequence alignment]]
-
''PAM'' matrices are a common family of score matrices. PAM stands for '''''P'''ercent '''A'''ccepted '''M'''utations'', where "accepted" means that the mutation has been adopted by the sequence in question. Thus, using the [http://acer.gen.tcd.ie/~amclysag/pam250.html PAM 250] scoring matrix means that about 250 mutations per 100 amino acids may have happened, while with PAM 10 only 10 mutations per 100 amino acids are assumed, so that only very similar sequences will reach useful alignment scores.
+
[[DNA array]]
-
PAM matrices contain positive and negative values: if the alignment score is greater than zero, the sequences are considered to be related (they are similar with respect to the used scoring matrix), if the score is negative, it is assumed that they are not related. "Relationship" here may refer to evolution as well as functionality of the proteins, and of course the choice of the matrix affects the result, so one has to make an assumption on the similarity of the sequences in order to receive a useful result: rather distant sequences won't produce a good alignment using PAM 10, and the optimal aligment of two very similar sequences with PAM 500 may be less useful than that with PAM 50.
+
[[Homologue]]
-
Finally, it should be noted that only some scoring matrices use ''similarity'' to evaluate alignments, but others use ''distance'', so the be careful interpreting the results!
+
[[Ontology]]
-
After this brief and necessarily superficial overview, [http://www.inf.ethz.ch/personal/gonnet/papers/Distance/Distance.html you might want to read some more about scoring matrices].
+
[[Scoring matrix]]

Latest revision as of 03:00, 24 November 2010

Contents

Bioinformatics

What is bioinformatics?

What are the origins of bioinformatics?

What are the most common bioinformatics programs?

What are the most common bioinformatics technologies?

How are data analyzed in bioinformatics?

Fields related to bioinformatics

What is biophysics?

What is computational biology?

What is medical informatics?

What is cheminformatics?

What is genomics?

What is mathematical biology?

What is proteomics?

What is pharmacogenomics?

What is pharmacogenetics?

Books: Can you recommend any bioinformatics books?

See this article.

Centers of bioinformatics activity: Where is bioinformatics done?

Genome Web at the Rosalind Franklin Centre for Genomics Research at the Genome Campus near Cambridge, UK, provides some of the links below.

Research centers

Sequencing centers

Standard centers

Virtual centers for bioinformatics activity

Online resources: What bioinformatics websites are there?

Blogs

General information websites

Directories

Societies

Collections of tools

Portals

Tutorials

Education: Where can I study bioinformatics?

Below are complete, full-time degree programmes rather than on individual study modules. You can go to other places, however, if you are looking for short courses. Rockefeller has a list that is mirrored at various other sites. ICSB also maintains a list.

Africa

The Americas

Asia

Australia

Europe

Distance or correspondence courses

Careers: How can I become a bioinformatics practitioner?

Getting involved

Careers

Practical tips

This section includes some simple rules-of-thumb to apply when performing common bioinformatics tasks.

Finding a sequence

Aligning two sequences

Predicting the functions of a gene

Predicting the structure of a sequence

Simulating a biomolecule

Publishing

Glossary of bioinformatics terms

Here are some common terms in bioinformatics:

Sequence alignment

DNA array

Homologue

Ontology

Scoring matrix

Personal tools
Namespaces
Variants
Actions
wiki navigation
Toolbox