With the increasing availability of textual information
related to biology, including MedLine abstracts and full-text journal articles,
research on information extraction is rapidly becoming an essential
component of various bioinformatics applications.
It is expected that text mining in general and
information extraction in particular will provide tools that will facilitate
the annotation of vast amounts of molecular information, including gene
sequences, transcription profiles and biological pathways.
Text mining and information extraction have already
been successfully used in a number of applications, including the detection
of gene and protein interactions, the functional classification
of proteins, automatic database-driven sequence annotation and the annotation
of transcription profiles from microarray technology.
Similarly to other areas of data mining, the primary goal of textual
information extraction is the detection of linguistic patterns already
present in the corpus under investigation. Thus, novel discoveries
in biology should be expected from the mere mining of biological text.
OAP Background
Copyright on scientific communications (published articles and so forth)
belongs to publishing companies and not to authors, for most publications.
Scientists wishing to share relevant communications, even their own in
some cases, face legal challenges from publishers.
Publishing companies charge expensive subscriptions to access scientific
communications. Scientists in developing countries and poorly-endowed institutions,
although intellectually on par with their peers, are severely hindered
by this.
These two problems have prevented scientists from gaining any access,
even for simple searches, to the full text of these communications.
Scientific communications are published in journals segregated by topic.
This has resulted in confusion as to the best place to publish, retrieve
or extract information (e.g., mathematical biology communications could
be published in either a mathematical journal or a biological one).
Communications are also published in journals differing by publisher.
This has caused the segregation of communications by the prestige of the
journal (e.g., how difficult it is to be published in the journal and the
composition of the readership). This has also allowed room for personal
politics in scientific communication.
These two problems are compounded by the first two: with a limited
budget, to which journals should one subscribe? What we are left with is
an artificial selection, by publishers, of which communications are best
suited to a scientist's field of study.
This may be the result of a competitive marketplace for readership,
but is there an alternative to profit-based publications? Should there
be? Can an alternative publication model be profitable for a publisher?
Additionally, even with the advent of computers, databases, and the
World Wide Web, scientific communications are published as they were 100
years ago: as linear, printable text. And they are archived this way. While
this makes good reading, it is not the best format for information retrieval
or extraction.
All of these problems restrict information retrieval, extraction, and
scientific inquiry. How do we resolve them? As the ultimate solution, should
future communications be published in an "open-access, global knowledge-base"?
Before or after information extraction techniques are applied?
The primary aim of this conference is to bring together individuals
and groups actively involved in text mining for biology.
OAP Aims
We identify several obstacles to information retrieval and extraction:
copyright restrictions, costly subscriptions, artificial segregation of
communications, and archival of information in a manner not suited for
information retrieval and extraction. And we seek to discuss the concept
of "open-access publications" and if it is a viable solution to these
problems.
OAP also serves as a "Birds of a Feather" (BoF) meeting for Bioinformatics.org,
an organization committed to freedom and openness in the field of bioinformatics.
BRIE Scope
We are seeking abstract submissions in the area of biological discovery
using text mining techniques. In particular, we would like to put more
emphasis on the use of these techniques in the discovery of highly non-trivial,
novel information in biology, including relationships at the molecular,
biochemical and cellular levels.
Abstracts in the following areas are particularly welcome:
Biological discoveries independently supported by other experimental
information using text mining
Annotated corpora of biological text for the benchmarking of existing
methods
Approaches for database maintenance and integrity using text mining
Standardization and evaluation of methods for information extraction
including algorithms (e.g. full parsers) and databases (e.g. portable ontologies).
OAP Scope
We are seeking several speakers who can address how the above
problems might be solved. Topics may include author-owned copyrights, free or
inexpensive subscriptions, uniform and multiple categories for communications,
and archival of information in a manner suited for information retrieval and
extraction, for example, knowledge bases.