[BiO BB] Looking for researcher, to assist on blast-like invention

Mon Feb 11 18:56:41 EST 2008

On 11 Feb 2008, at 22:28, Ryan Golhar wrote:

> Why don't you write up a paper describing the algorithm in detail and
> submit it to a bioinformatics journal?  And, why not make the  
> executable
> available with documentation so that people can download it and try it
> out for themselves.
>
> Do you have any test cases that show it runs faster/better than BLAST?
> Describe them and make them available.

The first thing I'd need to do is make a good test. I'm not sure what  
constitutes "a good test", in this case.

How big should the databanks be to make the test reasonable? Is  
randomly generated data good enough, or is a randomly selected sample  
better. If a sample is better, how large a dataset must I gather to do  
the test.

Perhaps certain settings make my algorithm work better or worse  
relative to BLAST. But then how do I know which settings are more  
likely to be used and which aren't?

I think someone who uses BLAST frequently, and knows it well from a  
user's perspective... might have a better feel for creating a test  
than I might.

The worst thing that could happen is I make a test, which is unfairly  
prejudiced to my algorithm :) The next thing that would happen is  
people would see my test has "suspiciously good" results, and... be  
annoyed about that, and lose interest, even if it were an innocent  
mistake on my end. I'd rather avoid that sort of mistake by getting a  
knowledged eye in the designing of a test!

Like I said, I haven't gotten all the code in C++ yet. I've got a  
framework in C++ already, I mean I know how to write C++. And I know  
what to do, as I've written it in a proto-typing language.

The C++ version will come soon, though.

> Theodore H. Smith wrote:
>> Hi everyone,
>>
>> So I've been working, on and off, on this algorithm for quite a while
>> now. It's basically an invention of mine. It is a "blast-like"
>> algorithm, in that it does "Fuzzy lookup" operations across a  
>> database
>> of letters. I am designing this algorithm to be useful for bio-
>> informatics, this is the main field I am initially targetting.
>>
>> The database will be filled with protein sequences, and the search
>> across the database will be another protein sequence. The algorithm
>> has a "scoring matrix", which can accept different protein  
>> replacement
>> scores. The cost of inserting letters (protein letters) can be
>> configured also.
>>
>> In this sense, it's no different to Smith-Waterman. The same input,
>> the same output!
>>
>> The real difference from Smith-Waterman, is it's speed. My algorithm
>> will be hugely faster. This is because I use many techniques to avoid
>> processing unnecessary parts of the Smith-Waterman matrix.
>>
>> I also use many tricks to reuse computations across various proteins.
>> For example, the matrix for protein "ABCDE", is identical, at first
>> anyhow, for the matrix for "ABCDEFG". This means if I have both
>> proteins "ABCDE", and "ABCDEFG" in my protein database, I can test
>> both of them against the search query, in almost half the time. My
>> algorithm also runs in logarithmic-time with respect to the size of
>> the database. Basically, bigger databases run disproportionately  
>> faster.
>>
>> I want to turn this algorithm, into something useful for people. My
>> first challenge here, is to answer the question "is this algorithm
>> faster, or better than BLAST". If it is not faster, my algorithm
>> basically has little use. But I have good hopes it will be faster! I
>> am very good with these sort of things, you see :) Speed is my  
>> strong-
>> point.
>>
>> Currently, I do not know about the speed, because I haven't
>> implemented a C++ version of my algorithm, or a good speed testing
>> framework.
>>
>> I do however know that my algorithm is more accurate than BLAST,
>> because it is just as accurate as SSEARCH, as mine uses the Smith-
>> Waterman algorithm. Whereas BLAST uses a heuristic, intelligent  
>> guess-
>> work basically. A fine heuristic, but still a heuristic. Mine is
>> methodological, not heuristic based.
>>
>> So here is what I am looking for!
>>
>> I am hoping, that someone in the field will be able to offer me
>> guidance, interest, enthusiasm, suggestions and maybe even do some
>> testing for me.
>>
>> Perhaps a student doing a bio-informatics related degree, who would
>> like to write a paper on an alternative way of processing protein
>> databases. My invention could be an interesting subject for a paper.
>>
>> Or perhaps a researcher who just has an interest in these sort of
>> things! Perhaps a researcher who feels there must be a better way of
>> doing these things. Or anyone really in this field with the time and
>> interest, and feels helping me could help him (or her) too in some  
>> way.
>>
>> I'd like someone I can ask a lot of questions to, and show my  
>> software
>> to, and explain my hopes what I can achieve with it.
>>
>> Basically, my first question to you, would be "how would I set this  
>> up
>> to be useful for someone", and "how would I test it's usefulness,  
>> what
>> would you need to know about my algorithm that you would decide to  
>> use
>> it over blast"
>>
>> It's sort of a vague question from me, like "what do you need me to
>> do", but... well that's where I am right now. Sort of a bit on the
>> outside hoping someone on the inside will show me something.
>>
>> So it's an opportunity to tell me what you want, basically!! Tell me,
>> and I might just make it.
>>
>> Who knows? Maybe one day in a few years time, everyone will be using
>> this "ElfDataFuzzy" algorithm that I invented, instead of BLAST! You
>> might be part of something.
>>
>> Thanks to anyone who replies!
>>
>> --
>> http://elfdata.com/plugin/
>> "String processing, done right"
>>
>>
>>
>> _______________________________________________
>> BBB mailing list
>> BBB at bioinformatics.org
>> http://www.bioinformatics.org/mailman/listinfo/bbb
>>
>>
>
>
> _______________________________________________
> BBB mailing list
> BBB at bioinformatics.org
> http://www.bioinformatics.org/mailman/listinfo/bbb

--
http://elfdata.com/plugin/
"String processing, done right"