[Beowulf] failure rates
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Walid walid.shaari at gmail.comMon Feb 5 09:48:33 PST 2007
- Previous message: [Beowulf] failure rates
- Next message: [Beowulf] massive parallel processing application required
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi, I do not know if i can help answering the original question really. but most of the failures we see from the system side are in that order hard disks interconnect cards misconfigured node Uncorrected Memory errors system board failures Unexplainable failures failures related to the application itself we do not see them as the user will resubmit his job and will correct their mistakes quietly. The question is cluster by definition are not highly available systems, they are made up of commodity hardware, and if most of these clusters are using the standard mpi implementation then they will work on the principle if it fails stop. and in most of the time failure investigation is minimal as the importance is getting the node back to work. so is failure rate really of concern? if it was so we would see more of fault tolerance layers in clusters and failure rate metrics in monitoring tools and reports. I am interested in reducing these failure rates as user demands are growing instead of using few nodes, now they are using as much as possible and requesting for even more, and the more you give them, the more failures we will get! What will you be trying to achieve with your thesis? will the question of how the reduce or manage the failures be part of it? regards Walid.
- Previous message: [Beowulf] failure rates
- Next message: [Beowulf] massive parallel processing application required
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
