[Beowulf] Tips for diagnosing intermittent problems on a small cluster
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Andrew M.A. Cater amacater at galactic.demon.co.ukSun Nov 25 03:27:21 PST 2007
- Previous message: AW: [Beowulf] Tips for diagnosing intermittent problems on a small cluster
- Next message: [Beowulf] Tips for diagnosing intermittent problems on a small cluster
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Thu, Nov 22, 2007 at 01:53:04PM +0100, Jürgen Kabelitz wrote: > > Hi, > > We had the same problems with a cluster of 40 nodes. The motherboard has problems with great IO. We have some test programs they used only the cpu and make no or less IO. These programmes runs and runs. But when you have a program like Gaussian with a big IO then this can happen. > At last we change the motherboard against the S2882. > J. Kabelitz > > > -----Ursprüngliche Nachricht----- > Von: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] Im Auftrag von stephen mulcahy > Gesendet: Mittwoch, 21. November 2007 18:28 > An: beowulf at beowulf.org > Betreff: [Beowulf] Tips for diagnosing intermittent problems on a small cluster > > Hi, > > As I mentioned in my previous posting, the 20 node Tyan S2891 Dual > Opteron dual core Debian cluster (1 NFS providing head node, 19 diskless > compute nodes) is currently experiencing 2 intermittent problems which > I'm trying to diagnose. > > After a few days of testing and digging through system logs I'm pretty > much stumped as to what may be causing these. There are 2 separate > problems - anyones opinions on how to go about diagnosing these problems > or things I might have missed would be most welcome. > > Problem #1 > Over the last 6 months, 3 different nodes have been found in a powered > down state - the nodes seem to have powered off during a run of the > model. Same here with on a single machine with an earlier model Tyan board - it happened to us either after a very occasional kernel panic/exception or after 25-28 days of continuous running. I've got a 2885 here, if I can just find two Opterons, memory and a case :-) I'll let you know if this one does it too. There _may_ be some PSU involvement with ours: the machine and fans are running but not accepting connections. You have to disconnect the power for a few minutes for it to even boot again properly. Powercycling from the front panel doesn't always work Debian etch, stock Debian kernel (2.6.18-5 from memory). Andy
- Previous message: AW: [Beowulf] Tips for diagnosing intermittent problems on a small cluster
- Next message: [Beowulf] Tips for diagnosing intermittent problems on a small cluster
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
