[Bioclusters] Large numbers of files (again...?)

Tim Cutts bioclusters@bioinformatics.org
Wed, 28 Jan 2004 13:37:31 +0000


On 28 Jan 2004, at 12:25, Dan Bolser wrote:

> Hello,
>
> Sorry if this is a repost, I am not sure how your moderation works, 
> but now I am a
> member of the list, I am sending this mail again...

Well, here's a trivial filename hashing routine in perl, which does 
quite a nice job:

use Digest::MD5 qw(md5_hex);

sub get_hash_path_for_file {
	my $filename = shift; # Assuming without a path component

	# Adjust the following array slice for the hash depth you want
	return join('/', (md5_hex($filename) =~ /\G../g)[0..1], $filename);
}

The above hash depth of two directories is probably fine for up to 64 
million files or so (assuming you want to keep things down to around 
1000 files per leaf directory).

You could of course just use a better filesystem, as Joe suggested, but 
that's not an option on some architectures. :-)

Tim

-- 
Dr Tim Cutts
Informatics Systems Group
Wellcome Trust Sanger Institute
Hinxton, Cambridge, CB10 1SA, UK