[Bioclusters] Current Dell Powerconnect 3048 switches can fail in some conditio ns under high NFS traffic loads

chris dagdigian bioclusters@bioinformatics.org
Thu, 21 Nov 2002 14:00:46 -0500


Responding to 2 messages at once here...

This response may not be all that helpful -

We had problems with the 3048 switch locking up under heavy NFSv3 load 
as I wrote in my August bioclusters post ( post archived at 
https://bioinformatics.org/pipermail/bioclusters/2002-August/000341.html)

I was able to get in touch with both the Powerconnect product manager 
and an engineer who was working in the problem. Both reported that the 
problem had been seen before at a "huge customer" and that it was being 
taken seriously.

My guess is that this was the switch that Dell used for the massive 
4,000 CPU cluster at SUNY Buffalo. Given the PR they are trying to 
squeeze from that project I'd say that any switch problems would get the 
highest level of attention. Dell told me that the powerconnect people 
were working with the internal Dell high performance clustering group to 
try to recreate the failure in the lab. Last I heard they had been 
partially successfull at getting the switch to lock up in their lab.

We (Bioteam.net) were under heavy deadline pressure for the client 
cluster we were building. We tried one firmware fix that Dell provided 
directly to us and when that failed we met with our customer and decided 
that we didn't have the time to wait for a Dell fix. The customer ended 
up taking the 3048's out of the cluster and deployed them elsewhere in 
the company where they have been working fine ever since in a less 
demanding network environment(no crashes).

We replaced the 3048's with the Dell Powerconnect 3248's which were more 
expensive and had all these layer3/4 features that we didn't need or 
want. But-- the 3248 runs totally different firmware and a totally 
different commandline interface.

Since replacing the 3048's with a pair of trunked 3248's we've had zero 
problems with the switches in the cluster. In fact the cluster has been 
working wonderfully and as of today has an uptime figure of 44 days 
straight which is cool given the somewhat odd and experimental stuff we 
have been doing to that poor system.

I'd like to assume that the 3048 problem has long since been fixed. 
Keith and Tony -- if you are still having problems I can contact you 
directly if you like and pass along the email addresses and name of the 
powerconnect product manager. I'd guess that he would be able to talk 
about the current state of the 3048.

Regards,
Chris
www.BioTeam.net







> Keith Maples wrote:
>> We just purchased a Powerconnect 3048.  had the same problems you list 
>> in your report.  They had us do a firmware update on the switch. I 
>> checked out the date of the firmware and it was in June which is some 
>> time before you wrote the article on these switches.  Does this mean 
>> that the firmware didn't work for you on the switch.  I hate to bother 
>> you but we just got the switch and we'll return it if it's still having 
>> issues with Heavy NFS traffic.
>> 
>> We have a Macintosh Server OS X 10.1.  uplinked to it via Gigabit and 
>> assumed it was the culprit.  Any input would be GREATLY appreciated.
>> 
>> Thank you very much...


Meoni, Tony wrote:
> Chris Dagdigian,
> 
> Read your warning tonight on bioinformatics.org, but unfortunately to 
> late.  We have these lemons and they have been trouble.  Loaded a fix 
> 5.2.8 on them and they worked for about 17 days and then they all 6 
> crashed.  Separated them from a stack and couldn't ping between two pc's 
> on any of the stacks.  Can you shed any other light on this?  Have the 
> 3248's still been working okay?  Maybe I'll ask Dell for an exchange.
> 
> Thanks for any help or info you can give.
> 
> ____________
> *Anthony Meoni*
> *_tmeoni@aahp.org_*
> ___**__**__**_