[Beowulf] Problems with a JS21 - Ah, the networking...
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Patrick Geoffray patrick at myri.comMon Oct 1 06:35:38 PDT 2007
- Previous message: [Beowulf] Problems with a JS21 - Ah, the networking...
- Next message: [Beowulf] Problems with a JS21 - Ah, the networking...
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi Ivan,
Ivan Paganini wrote:
> The myrinet connection was working right, but sometimes a user program
> just got stuck - one of the processes was sleeping, and all others
> were running. Then, the program hangs. Investigating this further,
Unless you are using bocking receives ("--mx-recv blocking" or
"--mx-recv hybrid"), the default mode is polling. So, a process will
only sleep if it is still in the spawning phase (in MPI_Init) or if it's
blocking on something outside MPI (like disk IO).
> overheat. mpirun.ch_mx -v shows that all the processes are issued ok
> to the nodes, but somehow one (or more) process go to sleep or never
> starts, and all the other processes just hangs. The mx diagnose tools
All processes wait on everybody at spawn time, so if one process never
starts, the rest of the MPI world will wait for it, possibly forever.
The root problem is the process not starting.
The spawning phase in MPICH-MX uses socket and ssh (or rsh). Usually,
ssh uses native Ethernet, but it could also use IPoM (Ethernet over
Myrinet). Which case is it for you ?
Patrick
- Previous message: [Beowulf] Problems with a JS21 - Ah, the networking...
- Next message: [Beowulf] Problems with a JS21 - Ah, the networking...
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
