[BiO BB] Clustring Large Datasets
Hilmar Lapp
hlapp at gmx.net
Thu Jun 8 09:17:02 EDT 2006
Using a simple back of the envelope calculation you would see that
the size of your matrix would be at least 46GB if every matrix
element would occupy only a single byte. Do you have a machine with
this much memory?
R uses doubles, hence the size would be 8x as much.
If the matrix is sparse you may be able to use a sparse matrix
representation - but I doubt this will be applicable to clustering.
If not, you'll have to write some code to farm it out in chunks over
hundreds of nodes of a compute farm.
Note that another back of the envelope calculation tells you that
computing the pairwise distances will take 57 days if a single
pairwise distance takes 1ms.
If I were you I would rethink what you're trying to do and reduce the
dimensionality upfront.
-hilmar
On Jun 7, 2006, at 7:20 PM, Alex butarbutar wrote:
> Hello,
> I am having a difficult time of finding a program / statistical
> package that will allow clustring of large data sets. By large data
> sets, I mean 100,000 x 500,000 matrix data from a microarray
> experiment. I know this will probably an excess of computational time
> / memory
> I've tried using R only to find out that i does not allow matrix of
> that size.
> any suggestion ?
>
> alex b
> _______________________________________________
> Bioinformatics.Org general forum -
> BiO_Bulletin_Board at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bio_bulletin_board
>
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
More information about the BBB
mailing list