[BiO BB] Clustring Large Datasets

Hilmar Lapp hlapp at gmx.net
Thu Jun 8 09:17:02 EDT 2006


Using a simple back of the envelope calculation you would see that  
the size of your matrix would be at least 46GB if every matrix  
element would occupy only a single byte. Do you have a machine with  
this much memory?

R uses doubles, hence the size would be 8x as much.

If the matrix is sparse you may be able to use a sparse matrix  
representation - but I doubt this will be applicable to clustering.

If not, you'll have to write some code to farm it out in chunks over  
hundreds of nodes of a compute farm.

Note that another back of the envelope calculation tells you that  
computing the pairwise distances will take 57 days if a single  
pairwise distance takes 1ms.

If I were you I would rethink what you're trying to do and reduce the  
dimensionality upfront.

	-hilmar

On Jun 7, 2006, at 7:20 PM, Alex butarbutar wrote:

> Hello,
> I am having a difficult time of finding a program / statistical
> package that will allow clustring of large data sets. By large data
> sets, I mean 100,000 x 500,000 matrix data from a microarray
> experiment. I know this will probably an excess of computational time
> / memory
> I've tried using R only to find out that i does not allow matrix of  
> that size.
> any suggestion ?
>
> alex b
> _______________________________________________
> Bioinformatics.Org general forum  -   
> BiO_Bulletin_Board at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bio_bulletin_board
>

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================








More information about the BBB mailing list