From buccaneer at rocketmail.com Sun Jul 1 05:55:20 2007 From: buccaneer at rocketmail.com (Buccaneer for Hire.) Date: Sun Jul 27 01:06:10 2008 Subject: [Beowulf] Resources for starting a Beowulf Cluster (NFS Setup?) In-Reply-To: <468711BE.2090406@gmail.com> Message-ID: <147529.50197.qm@web30602.mail.mud.yahoo.com> > What about integrating rsync into the password > scripts? Fundamentally, I > don't trust NIS. >From a security standpoint NIS is not secure and I don't think anyone would tell you differently. On the other hand you don't normally place a cluster on an unprotected network. One of the clusters I manage globally is a tiny 34 node HP cluster (SMP dual-core Opteron though) which is basically a single user cluster kept running at almost 100% of capacity. There are enough changes happening this morning that rsyncing, parallel copy, etc just becomes onerous so I am turning on NIS today for it. ____________________________________________________________________________________ Choose the right car based on your needs. Check out Yahoo! Autos new Car Finder tool. http://autos.yahoo.com/carfinder/ From hahn at mcmaster.ca Sun Jul 1 12:51:26 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Sun Jul 27 01:06:10 2008 Subject: [Beowulf] Resources for starting a Beowulf Cluster (NFS Setup?) In-Reply-To: <147529.50197.qm@web30602.mail.mud.yahoo.com> References: <147529.50197.qm@web30602.mail.mud.yahoo.com> Message-ID: > don't think anyone would tell you differently. On the > other hand you don't normally place a cluster on an > unprotected network. can NIS actually preserve shadow-ness of passwords? I would never run a cluster that exposed encrypted passwords. > almost 100% of capacity. There are enough changes > happening this morning that rsyncing, parallel copy, > etc just becomes onerous so I am turning on NIS today > for it. for small clusters, I would definitely use nfs-root; for larger ones, probably ldap. having a cron job run rsync frequently isn't a terrible solution, though, especially if there's one canonical node (a login node) where the only changes are made. From buccaneer at rocketmail.com Sun Jul 1 14:00:12 2007 From: buccaneer at rocketmail.com (Buccaneer for Hire.) Date: Sun Jul 27 01:06:10 2008 Subject: [Beowulf] Resources for starting a Beowulf Cluster (NFS Setup?) In-Reply-To: Message-ID: <218444.87901.qm@web30608.mail.mud.yahoo.com> --- Mark Hahn wrote: > can NIS actually preserve shadow-ness of passwords? > I would never run a cluster that exposed encrypted > passwords. I don't include root (or other sometimes critical) password and let nsswitch handle how it works. The cluster is segregated from my normal network-and the login node and monitoring/grid node are the only ones with access to the cluster nodes and the real world. > for small clusters, I would definitely use nfs-root; > for larger ones, probably ldap. having a cron job > run rsync frequently isn't a terrible solution, > though, especially if there's one canonical node (a > login node) where the only changes are made. Remember for my $DAYJOB I run commercial clusters where downtime have real implications and a real cost associated with it. I build each node stand-alone. If an NFS server drops out there is a good chance the entire cluster won't hit the skids. We tried using rsync for our kickstart servers and found that when you are pressed for time you don't always remember to manually rsync and have to wait until the normal rsync time. LDAP takes too much work for a small cluster unless it is part of your normal infrastructure. ____________________________________________________________________________________ Get the Yahoo! toolbar and be alerted to new email wherever you're surfing. http://new.toolbar.yahoo.com/toolbar/features/mail/index.php From hahn at mcmaster.ca Sun Jul 1 15:14:43 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Sun Jul 27 01:06:10 2008 Subject: [Beowulf] interconnects (intel's optical cx4) Message-ID: this surprised me: http://www.intel.com/design/network/products/optical/cables/index.htm seems like a potentially nice solution to me - any guesses about price? I guess my hope is that these might become somewhat commoditized, and thus provide a 10+ Gb media that avoids a pair of $500 XFP's, and yet gives thinner/lighter/longer cables than CX4. (and from what I read, chances are poor for 10GbaseT to run cool and low-latency enough.) I especially like the fact that they are agnostic of the difference between 10GE and IB. but they do draw 1.1-1.2W from the connector - is this commonly supported? thanks, mark hahn. From gebhardt at hrz.uni-marburg.de Mon Jul 2 07:05:34 2007 From: gebhardt at hrz.uni-marburg.de (Gebhardt Thomas) Date: Sun Jul 27 01:06:10 2008 Subject: [Beowulf] Solved: SATA(?) errors locks up node In-Reply-To: <200705231114.00231.gebhardt@hrz.uni-marburg.de> References: <200705231114.00231.gebhardt@hrz.uni-marburg.de> Message-ID: <200707021605.34849.gebhardt@hrz.uni-marburg.de> Hello, thank you all for your advice! After a Firmware upgrade (->20.06C06) of the SATA disks we had no further incident until now. So I'm pretty sure that we have caught the bug. Thanks again, Th. Gebhardt On Wednesday 23 May 2007 11:13, Gebhardt Thomas wrote: > we are running a cluster of 57 dual opteron nodes. Once or twice a week > one of these nodes gets in an error state and can't connect to the > I/O-subsystem anymore. I need to reboot that node. As far as I can see, > the problem occurs randomly at any of our nodes, i.e., the MTBF of a single > node is about 6-12 months. > > I still don't know whether this is a problem of the linux kernel sata > driver, a hardware problem, a flaw of the disk firmware or something else. > I'm looking for a possibilty to track down the problem without > substantially interfering with the jobs on the cluster. > > This is our environment: > TYAN S3992 motherboard with Serverworks HT1000+2000 chipset. > 2 DualCore Opteron ?2216 HE 2.4GHz, 16GByte Mem > Western Digital 250GByte SATA disk, WDC WD2500YS-01SHB0, firmware rev. 20.06C03 From rgb at phy.duke.edu Mon Jul 2 08:24:16 2007 From: rgb at phy.duke.edu (Robert G. Brown) Date: Sun Jul 27 01:06:10 2008 Subject: [Beowulf] cold cathode fluorescent backlighting In-Reply-To: <46844EEF.6040308@bogus.com> References: <6394934c0705251214r4ec157f6o882f391ec8c0e7d9@mail.gmail.com> <50811.70.22.66.55.1183026598.squirrel@email.powweb.com> <46844EEF.6040308@bogus.com> Message-ID: On Thu, 28 Jun 2007, Joel Jaeggli wrote: > Robert G. Brown wrote: > >> So yes, I think that LCDs are, on average, far better for the planet and >> your pocketbook than CRTs (remember, an 80W power differential can add >> up to $100's in power savings over the lifetime of a monitor), but not >> perfect. LEDs, if/when they ever appear (Cree, are you listening?) >> would almost certainly be better than either in all ways. > > led backlit displays are already commercially available (for a year or > more in some case like cellphone and high-end lcd tv), with the new mac > be a notable but not first example in a laptop. As lumens/watt continues > to increase their advantages over ccfl's will continue to grow... At the > same time direct emissive displays (oled) will eventually challenge lcd > in most areas where lcd currently challenges other technology. Ah, I hadn't realized LEDs had made it to the street in real computer displays (as I'm not a Mac person:-). But I'm quite ready to lose the tubes in LCDs... rgb > >> rgb >> >>> >>> Wikipedia is a good source of keywords for use in further research. I >>> would not consider Wikipedia an authoritative source of information on >>> it's own. >> > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From alenzo at mail.rochester.edu Mon Jul 2 12:45:06 2007 From: alenzo at mail.rochester.edu (A Lenzo) Date: Sun Jul 27 01:06:10 2008 Subject: [Beowulf] RE: Beowulf Digest, Vol 41, Issue 2 In-Reply-To: <200707021900.l62J08ph014587@bluewest.scyld.com> References: <200707021900.l62J08ph014587@bluewest.scyld.com> Message-ID: <006c01c7bce1$80c50ed0$f6339780@libra.cc.rochester.edu> Re: NIS Hello all, Thank you to Everybody who has been helping with my NFS/NIS setup. I am very grateful. Even though NIS may not be extremely secure, it seems a decent solution for my small network which is, BTW, behind a firewall. All I need is a simple solution. I have begun setting up NIS - to get it to work, I changed the yp.conf file to the following: domain nisbanjosrv server banjosrv ypserver banjosrv The problem is that every time I reboot, this file reverts to the following code instead: #generated by /sbin/dhclient-script ypserver 128.??.??.4 ---------------------- Now, I can vi the file and fix it, but of course, I'd rather not have to do it on every reboot. By the way, 128.??.??.4 is another Linux server on my same network that probably uses NIS also. Don't know how or why this is being picked up - but this new server (and the nodes it supports) will ultimately have to be separate even though they will be on the same network. I tried using chmod 444 on yp.conf in hopes that it would prevent the system from overwriting it, but this has also not worked. So the question is: how can I prevent yp.conf from being changed on every boot? Thanks all for any and all help, Sol From buccaneer at rocketmail.com Mon Jul 2 13:14:14 2007 From: buccaneer at rocketmail.com (Buccaneer for Hire.) Date: Sun Jul 27 01:06:10 2008 Subject: [Beowulf] RE: Beowulf Digest, Vol 41, Issue 2 In-Reply-To: <006c01c7bce1$80c50ed0$f6339780@libra.cc.rochester.edu> Message-ID: <175696.69242.qm@web30615.mail.mud.yahoo.com> option nis-domain "XXXXXXX"; --- A Lenzo wrote: > Hello all, > > Thank you to Everybody who has been helping with my > NFS/NIS setup. I am > very grateful. Even though NIS may not be extremely > secure, it seems a > decent solution for my small network which is, BTW, > behind a firewall. All > I need is a simple solution. I have begun setting > up NIS - to get it to > work, I changed the yp.conf file to the following: > > domain nisbanjosrv server banjosrv > ypserver banjosrv > > The problem is that every time I reboot, this file > reverts to the following > code instead: > > #generated by /sbin/dhclient-script > ypserver 128.??.??.4 > > ---------------------- > Now, I can vi the file and fix it, but of course, > I'd rather not have to do > it on every reboot. By the way, 128.??.??.4 is > another Linux server on my > same network that probably uses NIS also. Don't know > how or why this is > being picked up - but this new server (and the nodes > it supports) will > ultimately have to be separate even though they will > be on the same network. > > I tried using chmod 444 on yp.conf in hopes that it > would prevent the system > from overwriting it, but this has also not worked. > > So the question is: how can I prevent yp.conf from > being changed on every > boot? > > Thanks all for any and all help, > Sol > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or > unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > ____________________________________________________________________________________ We won't tell. Get more on shows you hate to love (and love to hate): Yahoo! TV's Guilty Pleasures list. http://tv.yahoo.com/collections/265 From cousins at umit.maine.edu Tue Jul 3 08:59:54 2007 From: cousins at umit.maine.edu (Steve Cousins) Date: Sun Jul 27 01:06:10 2008 Subject: [Beowulf] 10 GbE vs. Trunked 1 GbE performance? Message-ID: We are purchasing a NAS NFS server for our cluster and I'm wondering if a 10 GbE card would give us twice the performance than say 4 trunked 1 GbE lines. Given that the NAS itself has the performance to drive this, do any of you have real-world numbers comparing this sort of thing? Thanks, Steve -- ______________________________________________________________________ Steve Cousins, Ocean Modeling Group Email: cousins@umit.maine.edu Marine Sciences, 452 Aubert Hall http://rocky.umeoce.maine.edu Univ. of Maine, Orono, ME 04469 Phone: (207) 581-4302 From jack at crepinc.com Tue Jul 3 14:07:14 2007 From: jack at crepinc.com (Jack C) Date: Sun Jul 27 01:06:10 2008 Subject: [Beowulf] 10 GbE vs. Trunked 1 GbE performance? In-Reply-To: References: Message-ID: <2ad0f9f60707031407w45d56bf2v18c2e098c08c137@mail.gmail.com> Steve, Depends on your app. Trunking just allows alternate paths for data: you're not going to be able to push 4 Gb/s over a single link on that bond. Rather, if you have multiple connections all firing data around, you can approach that bandwidth or 4 gigs/s. So: if you're pushing huge arrays of data around not very often, 10 GbE will be much faster, but if you're throwing a zillion smaller sized messages to lots of different hosts at once, you will definatly get some usage out of that trunk. At least, this is how I understand it. -Jack Carrozzo On 7/3/07, Steve Cousins wrote: > > > We are purchasing a NAS NFS server for our cluster and I'm wondering if a > 10 GbE card would give us twice the performance than say 4 trunked 1 GbE > lines. Given that the NAS itself has the performance to drive this, do > any of you have real-world numbers comparing this sort of thing? > > Thanks, > > Steve > > -- > ______________________________________________________________________ > Steve Cousins, Ocean Modeling Group Email: cousins@umit.maine.edu > Marine Sciences, 452 Aubert Hall http://rocky.umeoce.maine.edu > Univ. of Maine, Orono, ME 04469 Phone: (207) 581-4302 > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20070703/c7b1b4fe/attachment.html From mathog at caltech.edu Thu Jul 5 16:16:24 2007 From: mathog at caltech.edu (David Mathog) Date: Sun Jul 27 01:06:10 2008 Subject: [Beowulf] static pvm library won't link anymore Message-ID: In the land of dominoes... The NCBI added another character or two to the allowed list of letters that can be stored in BLAST databases which broke my software that interfaces with those database which broke every program I use that reads sequence data through that interface. That sort of change happens once or twice a year and so is par for the course. What is unusual this time is that when I tried to rebuild the first one of these PVM based applications (hmmer 2.3.2 package by Sean Eddy) the linker was very unhappy with the static pvm library: /usr/common/pvm3/lib/LINUX/libpvm3.a(lpvmpack.o): In function `pvm_vpackf': lpvmpack.c:(.text+0x36d4): undefined reference to `__ctype_b' lpvmpack.c:(.text+0x36f8): undefined reference to `__ctype_b' lpvmpack.c:(.text+0x3728): undefined reference to `__ctype_b' lpvmpack.c:(.text+0x3747): undefined reference to `__ctype_b' /usr/common/pvm3/lib/LINUX/libpvm3.a(lpvmpack.o): In function `pvm_vunpackf': lpvmpack.c:(.text+0x3bb8): undefined reference to `__ctype_b' /usr/common/pvm3/lib/LINUX/libpvm3.a(lpvmpack.o):lpvmpack.c:(.text+0x3bd7): more undefined references to `__ctype_b' follow collect2: ld returned 1 exit status make: *** [afetch] Error 1 In the spirit of leaving well enough alone the hmm applications have not been modified in 3 years, and libpvm3.a has not been touched since Oct 2002. Apparently that library must now be rebuilt. The source was PVM 3.4.4, it seems in the last 5 years they've released 3.4.5, so I guess it's time to update to that. Anybody else seen this particular PVM/linker issue? If so, is there anything else that needs to be done beyond rebuilding PVM (applications and library)? Thanks, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From hahn at mcmaster.ca Thu Jul 5 16:40:24 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Sun Jul 27 01:06:10 2008 Subject: [Beowulf] static pvm library won't link anymore In-Reply-To: References: Message-ID: > /usr/common/pvm3/lib/LINUX/libpvm3.a(lpvmpack.o): In function `pvm_vpackf': > lpvmpack.c:(.text+0x36d4): undefined reference to `__ctype_b' I'm guessing that's an array that maps a byte/char to ctype flags. this is an internal bit of implementation detail to libc and if you run into a problem like this, it implies that there's some disparity between the libc headers and libc library (often implicitly added by cc when linking.) or, in this case, the former version assumed a different libc than you currently have installed... > In the spirit of leaving well enough alone the hmm applications have not > been modified in 3 years, and libpvm3.a has not been > touched since Oct 2002. Apparently that library must now be rebuilt. you can probably find compat libc's that define the symbol. > The source was PVM 3.4.4, it seems in the last 5 years they've released from source will presumably use your current libc's headers and all will be well... > If so, is there anything else that needs to be done beyond > rebuilding PVM (applications and library)? probably solvable by lib version hackery as well... From landman at scalableinformatics.com Thu Jul 5 16:58:25 2007 From: landman at scalableinformatics.com (Joe Landman) Date: Sun Jul 27 01:06:10 2008 Subject: [Beowulf] static pvm library won't link anymore In-Reply-To: References: Message-ID: <468D85A1.6080106@scalableinformatics.com> David Mathog wrote: > In the land of dominoes... > > The NCBI added another character or two to the allowed > list of letters that can be stored in BLAST databases Yeah ... breaks my parsers every now and then ... > > which > > broke my software that interfaces with those database :) > > which > > broke every program I use that reads sequence data through that interface. You should see what they did to the code itself. Just go try to build it in a non-GCC linux environment on x86_64. It will complain, badly. I have patches for that in my latest RPMs (http://downloads.scalableinformatics.com/downloads/ncbi/). Annoying. > That sort of change happens once or twice a year and so is par for the Progress ... > course. What is unusual this time is that when I tried to rebuild the > first one of these PVM based applications (hmmer 2.3.2 package by Sean > Eddy) the linker was very unhappy with the static pvm library: > > /usr/common/pvm3/lib/LINUX/libpvm3.a(lpvmpack.o): In function `pvm_vpackf': > lpvmpack.c:(.text+0x36d4): undefined reference to `__ctype_b' > lpvmpack.c:(.text+0x36f8): undefined reference to `__ctype_b' > lpvmpack.c:(.text+0x3728): undefined reference to `__ctype_b' > lpvmpack.c:(.text+0x3747): undefined reference to `__ctype_b' > /usr/common/pvm3/lib/LINUX/libpvm3.a(lpvmpack.o): In function > `pvm_vunpackf': > lpvmpack.c:(.text+0x3bb8): undefined reference to `__ctype_b' > /usr/common/pvm3/lib/LINUX/libpvm3.a(lpvmpack.o):lpvmpack.c:(.text+0x3bd7): > more undefined references to `__ctype_b' follow > collect2: ld returned 1 exit status > make: *** [afetch] Error 1 > > In the spirit of leaving well enough alone the hmm applications have not > been modified in 3 years, and libpvm3.a has not been Been modified by some :) to make them go faster. Also there is a nice MPI version out now (no PVM vs MPI here, just pointing it out) at http://code.google.com/p/mpihmmer/ . Dev team are good guys :) > touched since Oct 2002. Apparently that library must now be rebuilt. > The source was PVM 3.4.4, it seems in the last 5 years they've released > 3.4.5, so I guess it's time to update to that. > > Anybody else seen this particular PVM/linker issue? I think this could be due to some libs being built with older glibc/gcc bits. Library mismatch is best guess. > If so, is there anything else that needs to be done beyond > rebuilding PVM (applications and library)? Possibly any dependencies. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 fax : +1 866 888 3112 cell : +1 734 612 4615 From mathog at caltech.edu Fri Jul 6 09:55:06 2007 From: mathog at caltech.edu (David Mathog) Date: Sun Jul 27 01:06:10 2008 Subject: [Beowulf] static pvm library won't link anymore Message-ID: > If so, is there anything else that needs to be done beyond > rebuilding PVM (applications and library)? Silly me, I assumed that since it used to be just "make" and it would build it would still be that easy. Unfortunately 3.4.5 throws about a zillion compilation errors on Mandriva 2007.1 with gcc 4.1.2. It's missing all sorts of includes (unistd.h, stdlib.h) and has some funky extern declarations that this recent gcc is mighty unhappy with. Before I reinvent the wheel fixing all these problems, do any of you have a copy of 3.4.5 with these changes already applied? Thanks, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From hahn at mcmaster.ca Fri Jul 6 10:17:44 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Sun Jul 27 01:06:10 2008 Subject: [Beowulf] static pvm library won't link anymore In-Reply-To: References: Message-ID: > errors on Mandriva 2007.1 with gcc 4.1.2. It's missing all sorts > of includes (unistd.h, stdlib.h) and has some funky extern declarations afaik, this is only possible if you are somehow missing some basic package like glibc-headers. (you can find this out on an rpm system by "rpm -qf /usr/include/stdlib.h".) From mathog at caltech.edu Fri Jul 6 10:24:08 2007 From: mathog at caltech.edu (David Mathog) Date: Sun Jul 27 01:06:10 2008 Subject: [Beowulf] static pvm library won't link anymore Message-ID: > > errors on Mandriva 2007.1 with gcc 4.1.2. It's missing all sorts > > of includes (unistd.h, stdlib.h) and has some funky extern declarations > > afaik, this is only possible if you are somehow missing > some basic package like glibc-headers. (you can find this out > on an rpm system by "rpm -qf /usr/include/stdlib.h".) > It isn't that, it's that the 3.4.5 (and 3.4.4 I assume) use forms of C that gcc 4.1.x compiles are not happy with. Here's one discussion of it: http://groups.google.com/group/comp.parallel.pvm/browse_thread/thread/34f97af6f7998375/41fd271b8443e8ea?lnk=st&q=%22diff+-r+-b+-B+pvm3%2Fconf%2FLINUX.def+pvm3orig%2Fconf%2FLINUX.def%22&rnum=1&hl=en#41fd271b8443e8ea I've been working through the patches at the end of that thread, but there still seem to be issues. Regards, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From rgb at phy.duke.edu Fri Jul 6 11:24:01 2007 From: rgb at phy.duke.edu (Robert G. Brown) Date: Sun Jul 27 01:06:10 2008 Subject: [Beowulf] static pvm library won't link anymore In-Reply-To: References: Message-ID: On Fri, 6 Jul 2007, David Mathog wrote: >> If so, is there anything else that needs to be done beyond >> rebuilding PVM (applications and library)? > > Silly me, I assumed that since it used to be just "make" and it would > build it would still be that easy. > > Unfortunately 3.4.5 throws about a zillion compilation > errors on Mandriva 2007.1 with gcc 4.1.2. It's missing all sorts > of includes (unistd.h, stdlib.h) and has some funky extern declarations > that this recent gcc is mighty unhappy with. Before I reinvent the wheel > fixing all these problems, do any of you have a copy of 3.4.5 with > these changes already applied? If it can't find stdlib.h, there is something SERIOUSLY wrong -- something so wrong that it is probably trivial to fix. However, there are two other things to try before messing with it. One is that 3.4.5 is in FC core 6 (and doubtless 7), so there exist source RPMs for it. Yes, source RPMs don't always build across distros, but they DO have to be built using autoconf instead of aimk (according to FC spec, anyway:-) and so there is a decent chance that they will. The second is to open up the source rpm and grab ITS tarball as a semi-sane starting point instead of the OTC distribution. To me the one thing that REALLY needs to happen in pvm (now that it is, after all, 2007) is for aimk to GO AWAY! In SGE too, for that matter. It's just plain crazy to still be using it -- it was a tool designed to solve the same problem that autoconf solves, about 18 years ago. (A third is to use FC X instead of Mandriva in the first place -- nowadays one can just do "yum install pvm pvm-gui" and be done...) rgb > > Thanks, > > David Mathog > mathog@caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From mathog at caltech.edu Fri Jul 6 14:20:12 2007 From: mathog at caltech.edu (David Mathog) Date: Sun Jul 27 01:06:10 2008 Subject: [Beowulf] Re: static pvm library won't link anymore Message-ID: Well it builds cleanly now, but it doesn't run. Traced the failure down to the exec in /usr/common/pvm3/lib/LINUX/pvmd. For the old version it does this: % /usr/common/pvm3_3.4.4/lib/LINUX/pvmd3 -s -d0x0 -nmonkey01.cluster 1 c0a801dc:84c2 4080 30 c0a80101:0000 ddpro<2316> arch ip mtu<4080> dsig<4229185> (as it should be) for the new one it does this: % /usr/common/pvm3_3.4.5/lib/LINUX/pvmd3 -s -d0x0 -nmonkey01.cluster 1 c0a801dc:84c2 4080 30 c0a80101:0000 Floating point exception (core dumped) Sigh, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From mathog at caltech.edu Fri Jul 6 15:52:21 2007 From: mathog at caltech.edu (David Mathog) Date: Sun Jul 27 01:06:10 2008 Subject: [Beowulf] Re: static pvm library won't link anymore Message-ID: It still doesn't work, pvmd blows up on Linux. Here are the tiny number of changes I made to the vanilla pvm 3.4.5 distribution, based on the changes in the Mandriva 2007.1 and FC6 src rpms (mostly). It's only 18 modified files and mostly one or two lines in each, and most of those are just adding and the like: http://saf.bio.caltech.edu/pub/software/linux_or_unix_tools/pvm_3.4.5_diffs.txt This gives a clean compile on both Linux (gcc 4.1.2 or 4.1.1) and Solaris (gcc 3.4.2). It seems to run ok on Solaris but dumps core on the 32bit X86 machines. The old one, 3.4.4, still works (aside from the blast library issue and the fact I can't link anything to it. which started this whole mess), so I fell back to that for now. Very frustrating when a package like this won't build from source. Regards, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From laytonjb at charter.net Sat Jul 7 08:41:01 2007 From: laytonjb at charter.net (Jeffrey B. Layton) Date: Sun Jul 27 01:06:10 2008 Subject: [Beowulf] Core 2 (or Woodcrest) Performance on HPL (Top500) Message-ID: <468FB40D.3050308@charter.net> Good morning, Has anyone run HPL (Top500 benchmark) on a single node that has the Core 2 Duo or the Quad-core on-board? I'm interested in the performance numbers you obtained. I don't need the performance for more than one node. Ideally I would like the numbers for just one core, but I can make do with 2 cores for the Core 2 and 4 cores for the quad-core. Thanks! Jeff From kilian at stanford.edu Sat Jul 7 13:31:17 2007 From: kilian at stanford.edu (Kilian CAVALOTTI) Date: Sun Jul 27 01:06:10 2008 Subject: [Beowulf] Core 2 (or Woodcrest) Performance on HPL (Top500) In-Reply-To: <468FB40D.3050308@charter.net> References: <468FB40D.3050308@charter.net> Message-ID: <200707071331.18037.kilian@stanford.edu> On Saturday 07 July 2007 08:41:01 Jeffrey B. Layton wrote: > Good morning, > > Has anyone run HPL (Top500 benchmark) on a single node > that has the Core 2 Duo or the Quad-core on-board? I'm > interested in the performance numbers you obtained. I don't > need the performance for more than one node. Ideally I would > like the numbers for just one core, but I can make do with > 2 cores for the Core 2 and 4 cores for the quad-core. >From what I recall, I got roughly 7.5 Gflops per core on a Clovertown (quad-core 2.66GHz, E5345), and 55 Gflops per node (dual socket, 8 cores, 16GB). I can re-launch some runs if you like. Cheers, -- Kilian From pub at acnlab.csie.ncu.edu.tw Fri Jul 6 00:37:52 2007 From: pub at acnlab.csie.ncu.edu.tw (ACNLab) Date: Sun Jul 27 01:06:10 2008 Subject: [Beowulf] [CFP] Deadline Extension: P2P-NVE 2007 Message-ID: <468DF150.6070205@acnlab.csie.ncu.edu.tw> Dear Colleagues, Please be advised that the deadline of P2P-NVE 2007 is extended to be July 24, 2007. Hope that you can submit your paper early. Attached is the CFP, where you can find useful information about the paper submission. Please note that the topics of interest are also extended to include the following items: * P2P systems and infrastructures * Applications of P2P systems * Performance evaluation of P2P systems * Trust and security issues in P2P systems * Network support for P2P systems * Fault tolerance in P2P systems * Efficient P2P resource lookup and sharing * Distributed Hash Tables (DHTs) and related issues Best wishes, Jehn-Ruey Jiang Program Chair P2P-NVE 2007 ================ CALL FOR PAPERS ================ P2P-NVE 2007 International Workshop on Peer-to-Peer Networked Virtual Environments 2007 in conjunction with The 13th International Conference on Parallel and Distributed Systems (ICPADS 2007) December 5 -7, 2007 National Tsing Hua University, Hsinchu, Taiwan, R.O.C. http://acnlab.csie.ncu.edu.tw/P2PNVE2007. ================================================= About P2P-NVE 2007 The rapid growth and popularity of networked virtual environments (NVEs) such as Massively Multiplayer Online Games (MMOGs) in recent years have spawned a series of research interests in constructing such large-scale virtual environments. For increasing scalability and decreasing the cost of management and deployment, more and more studies propose using peer-to-peer (P2P) architectures to construct large-scale NVEs for games, multimedia virtual worlds and other applications. The goal of such research is to support an Earth-scale virtual environment or to make hosting virtual worlds more affordable than existing client-server approaches. However, existing solutions for consistency control, persistent data storage, multimedia data dissemination, and cheat-prevention may not be straightforwardly adapted to such new environments, novel ideas and designs thus are needed to realize the potential of P2P-based NVEs. The theme of this workshop is to solicit original and previously unpublished new ideas on the construction and realization of P2P-based NVEs, with a focus to facilitate discussions and idea exchanges by both academics and practitioners. All papers accepted for the workshop will be included in the IEEE Xplore Digital Library and will be included in the proceedings published by the IEEE Computer Society. Topics of interest include, but are not limited to: * P2P systems and infrastructures * Applications of P2P systems * Performance evaluation of P2P systems * Trust and security issues in P2P systems * Network support for P2P systems * Fault tolerance in P2P systems * Efficient P2P resource lookup and sharing * Distributed Hash Tables (DHTs) and related issues * Constructions of P2P overlays for NVEs * Multicast for P2P NVEs * P2P NVE content distribution * 3D streaming for P2P NVEs * Voice communication on P2P NVEs * Persistent storage for P2P NVEs * Security and cheat-prevention mechanisms for P2P games * Data structures and queries for P2P NVEs * Consistency control for P2P NVEs * Design considerations for P2P NVEs * Prototypes of P2P NVEs * P2P control for mobile NVEs * P2P NVE applications on mobile devices Important Dates Paper Submission: July 24, 2007 (Extended) Author Notification: August 24, 2007 Camera Ready Copy Due: September 2, 2007 Paper Submission Authors are invited to submit an electronic version of original, unpublished manuscripts, not to exceed 8 double-columned, single-spaced pages, to web site http://acnlab.csie.ncu.edu.tw/P2PNVE2007. Submitted papers should be in be in PDF format in accordance with IEEE Computer Society guidelines (ftp://pubftp.computer.org/press/outgoing/proceedings). All submitted papers will be refereed by reviewers in terms of originality, contribution, correctness, and presentation. From masud62 at yahoo.com Fri Jul 6 12:04:38 2007 From: masud62 at yahoo.com (Mostafiz) Date: Sun Jul 27 01:06:10 2008 Subject: [Beowulf] Cluster Diagram of 500 PC In-Reply-To: Message-ID: <745845.71782.qm@web52310.mail.re2.yahoo.com> Dear Sir, We want to setup a Cluster of 500 PC in following Configuration: Intel Duel Core 1.88 GHz 2MB Cache 4 GB DDR-2 RAM 2 X 80 GB How dow we connect these computers and how many will be defined as master. How do we conect using how many switch. How power connection will be provided. How do we start and stop all nodes using a remote computer. How do we ensure fault tolarent network connectivity. We want to use windows XP or windows 2003 as OS. Better persormance Centos ro Linux RHL may be selected. please advice us and help me in providing a network diagram of the system Regards. Mostafiz. Lt Col IT directorate Bangladesh Dhaka-1206 Tel: 880-2-8752348 Cell: 880-1711431880 --------------------------------- Boardwalk for $500? In 2007? Ha! Play Monopoly Here and Now (it's updated for today's economy) at Yahoo! Games. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20070706/f2b38e09/attachment.html From john.hearns at streamline-computing.com Sun Jul 8 10:41:24 2007 From: john.hearns at streamline-computing.com (John Hearns) Date: Sun Jul 27 01:06:10 2008 Subject: [Beowulf] Cluster Diagram of 500 PC In-Reply-To: <745845.71782.qm@web52310.mail.re2.yahoo.com> References: <745845.71782.qm@web52310.mail.re2.yahoo.com> Message-ID: <469121C4.5050802@streamline-computing.com> Mostafiz wrote: > > Dear Sir, > > We want to setup a Cluster of 500 PC in following Configuration: > Intel Duel Core 1.88 GHz 2MB Cache > 4 GB DDR-2 RAM > 2 X 80 GB > > How dow we connect these computers and how many will be defined as > master. > Mostafiz, ourslves and many other companies would normally give you detailed answers to questions such as these, including detailed diagrams, in a response to your requirements, either an open tender or a requirement. We would only do this after a discussion with you regarding applications and other requirements. I suggest you contact a range of vendors directly. John Hearns From rgb at phy.duke.edu Sun Jul 8 15:18:39 2007 From: rgb at phy.duke.edu (Robert G. Brown) Date: Sun Jul 27 01:06:10 2008 Subject: [Beowulf] Cluster Diagram of 500 PC In-Reply-To: <745845.71782.qm@web52310.mail.re2.yahoo.com> References: <745845.71782.qm@web52310.mail.re2.yahoo.com> Message-ID: On Fri, 6 Jul 2007, Mostafiz wrote: > Dear Sir, > > We want to setup a Cluster of 500 PC in following Configuration: > Intel Duel Core 1.88 GHz 2MB Cache > 4 GB DDR-2 RAM > 2 X 80 GB > > How dow we connect these computers and how many will be defined as master. > How do we conect using how many switch. > How power connection will be provided. > How do we start and stop all nodes using a remote computer. > How do we ensure fault tolarent network connectivity. > We want to use windows XP or windows 2003 as OS. Better persormance Centos ro Linux RHL may be selected. > please advice us and help me in providing a network diagram of the system Dear Mostafiz, There is a free online book on cluster engineering here: http://www.phy.duke.edu/~rgb/Beowulf/beowulf_book.php that you might want to read over to get yourself familiar with the concepts and design constraints. Second, the standard answer to all of your questions above is: "First think about your application(s), THEN engineer your cluster. In other words, don't pick your node hardware first. The correct way to engineer a cost-benefit optimal cluster that does the most work for the least up-front cost and long term administrative expense is to think about the computations you want to perform, and then decide on the network and compute hardware simultaneously. Once the application mix is understood, most of your decisions above are pretty obvious. If your application is "real parallel software" using MPI and with a large communication requirement between nodes, you will need to invest much more heavily in network than in nodes -- fewer nodes and an expensive network will get more work done than more nodes and a cheap network, as your nodes will sit idle waiting on the network. If your application is a lot of very simple tasks that can run completely independently and don't need a lot of access to disk or other nodes and aren't all linear algebra (and hence memory intensive) then you want to minimize investment in the network and maximize the compute capacity of your nodes (which might or might not be achieved with Intel CPUs, depending on the code). The main thing is to COMPLETELY UNDERSTAND the problem(s) you wish to run and how they will fit onto the hardware you will use before making any definitive choices regarding that hardware. Incidentally, I would personally strongly advise you against building a Windows cluster. First of all, it will cost vastly more money as you add several hundred dollars in completely unnecessary software cost per node -- for 500 nodes at $200 each for XP-Pro that's an extra $100,000 right there, and even at $20 per node $10,000 is still far too much. Linux is free and is VASTLY more efficient. It is also far more flexible regarding cluster configuration. Finally, nearly everyone on this list runs and builds linux clusters, for good reasons. I occasionally struggle with dealing with Windows in SMALLISH client/server LAN environments, and believe me, it is nightmarish compared to linux. Good luck. You can probably find consultants on this list to help you with the above design process at a fairly reasonable price, at least if your project isn't classified (so that you can't actually TELL us what you're going to use the cluster for...;-). In the latter case -- which I imagine isn't that unlikely -- then you'll simply have to develop the local expertise to answer the right questions on your own to guide your design process. My book should help -- and the list will generally answer sufficiently specific questions that HAVE an answer... rgb > > Regards. > > Mostafiz. > Lt Col > IT directorate > Bangladesh > Dhaka-1206 > Tel: 880-2-8752348 > Cell: 880-1711431880 > > > > --------------------------------- > Boardwalk for $500? In 2007? Ha! > Play Monopoly Here and Now (it's updated for today's economy) at Yahoo! Games. -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From roberto.ammendola at roma2.infn.it Mon Jul 9 08:05:18 2007 From: roberto.ammendola at roma2.infn.it (Roberto Ammendola) Date: Sun Jul 27 01:06:10 2008 Subject: [Beowulf] Core 2 (or Woodcrest) Performance on HPL (Top500) In-Reply-To: <468FB40D.3050308@charter.net> References: <468FB40D.3050308@charter.net> Message-ID: <46924EAE.209@roma2.infn.it> Jeffrey B. Layton wrote: > Good morning, > > Has anyone run HPL (Top500 benchmark) on a single node > that has the Core 2 Duo or the Quad-core on-board? I'm > interested in the performance numbers you obtained. I got around 36 GFlops on a dual dual-core 5160 (Woodcrest 3.0 GHz) with 8 GB RAM. regards Roberto From kohlja at ornl.gov Mon Jul 9 09:54:32 2007 From: kohlja at ornl.gov (kohlja@ornl.gov) Date: Sun Jul 27 01:06:10 2008 Subject: [Beowulf] Re: More on PVM's afterlife :-) In-Reply-To: <4692192E.8090301@ornl.gov> References: <4692192E.8090301@ornl.gov> Message-ID: <20070709165432.GA26509@neo.csm.ornl.gov> Hey David/Beowulf Gang, Sorry for your PVM woes, but please try the latest "patch pre-release" version of PVM, 3.4.5+7 available at: http://www.csm.ornl.gov/~kohl/PVM/pvm3.4.5+7.tar.gz This likely fixes many recent problems, especially on 64-bit arches. (Yeah, I know, another full release would be handy sometime... :-}) In the future, please don't hesitate to email: pvm@msr.csm.ornl.gov with any PVM-related questions or concerns. We're unfunded but not dead yet... :) All the Best, Jim On Mon, Jul 09, 2007 at 07:17:02AM -0400, Wael R. Elwasif wrote: > ---------------------------------------------------------------------- > Message: 1 > Date: Fri, 06 Jul 2007 14:20:12 -0700 > From: "David Mathog" > Subject: [Beowulf] Re: static pvm library won't link anymore > To: beowulf@beowulf.org > Message-ID: > Content-Type: text/plain; charset=iso-8859-1 > Well it builds cleanly now, but it doesn't run. Traced the failure > down to the exec in /usr/common/pvm3/lib/LINUX/pvmd. For the old > version it does this: > % /usr/common/pvm3_3.4.4/lib/LINUX/pvmd3 -s -d0x0 -nmonkey01.cluster 1 > c0a801dc:84c2 4080 30 c0a80101:0000 > ddpro<2316> arch ip mtu<4080> dsig<4229185> > (as it should be) > for the new one it does this: > % /usr/common/pvm3_3.4.5/lib/LINUX/pvmd3 -s -d0x0 -nmonkey01.cluster 1 > c0a801dc:84c2 4080 30 c0a80101:0000 > Floating point exception (core dumped) > Sigh, > David Mathog > mathog@caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech > ------------------------------ > Message: 2 > Date: Fri, 06 Jul 2007 15:52:21 -0700 > From: "David Mathog" > Subject: [Beowulf] Re: static pvm library won't link anymore > To: beowulf@beowulf.org > Message-ID: > Content-Type: text/plain; charset=iso-8859-1 > It still doesn't work, pvmd blows up on Linux. > Here are the tiny number of changes I made to the vanilla pvm 3.4.5 > distribution, based on the changes in the Mandriva 2007.1 and FC6 > src rpms (mostly). It's only 18 modified files and mostly one or > two lines in each, and most of those are just adding and > the like: > http://saf.bio.caltech.edu/pub/software/linux_or_unix_tools/pvm_3.4.5_diffs.txt > This gives a clean compile on both Linux (gcc 4.1.2 or 4.1.1) > and Solaris (gcc 3.4.2). It seems to run ok on Solaris but dumps > core on the 32bit X86 machines. The old one, 3.4.4, still works > (aside from the blast library issue and the fact I can't link anything > to it. which started this whole mess), so I fell back to that for now. > Very frustrating when a package like this won't build from source. > Regards, > David Mathog > mathog@caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech > ------------------------------ > Message: 3 > Date: Sat, 07 Jul 2007 11:41:01 -0400 > From: "Jeffrey B. Layton" > Subject: [Beowulf] Core 2 (or Woodcrest) Performance on HPL (Top500) > To: Beowulf Mailing list > Message-ID: <468FB40D.3050308@charter.net> > Content-Type: text/plain; charset=ISO-8859-1; format=flowed > Good morning, > Has anyone run HPL (Top500 benchmark) on a single node > that has the Core 2 Duo or the Quad-core on-board? I'm > interested in the performance numbers you obtained. I don't > need the performance for more than one node. Ideally I would > like the numbers for just one core, but I can make do with > 2 cores for the Core 2 and 4 cores for the quad-core. > Thanks! > Jeff > ------------------------------ > _______________________________________________ > Beowulf mailing list > Beowulf@beowulf.org > http://www.beowulf.org/mailman/listinfo/beowulf > End of Beowulf Digest, Vol 41, Issue 7 > ************************************** > -- > Wael R. Elwasif > Research Staff Member > Network & Cluster Computing Group > Oak Ridge National Laboratory > P.O. Box 2008, Bldg. 5700, MS 6164 > Oak Ridge, TN 37831-6367 > Office : (865)241-0002 > Fax: (865)576-5491 (:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(: James Arthur "Jeeembo" Kohl, Ph.D. "Da Blooos Brathas?! They Oak Ridge National Laboratory still owe you money, Fool!" kohlja@ornl.gov http://www.csm.ornl.gov/~kohl/ Long Live Curtis Blues!!! :):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):) From mathog at caltech.edu Mon Jul 9 11:04:41 2007 From: mathog at caltech.edu (David Mathog) Date: Sun Jul 27 01:06:10 2008 Subject: [Beowulf] Re: static pvm library won't link anymore Message-ID: Turns out that the difficulties getting pvm to run were not just pvm problems. There is a bizarre problem where every program, even "Hello World" built on Mandriva 2007.1 (gcc 4.1.2,2.6.17-14mdv) will not run on Mandriva 2007.0 (gcc 4.1.1, 2.6.19.3). Going in the other direction works, as does Mandriva 2006.0 to either, and of course all programs run wherever they were built in the first place. Here's hello refusing to run on 2007.0 under gdb: (master) # cat hello.c #include int main(void){ (void) fprintf(stdout,"HELLO\n"); } # gcc -g -o hello hello.c # hello HELLO # cp hello /usr/common/bin (slave) # gdb /usr/common/tmp/hello GNU gdb 6.3-8mdv2007.0 (Mandriva Linux release 2007.0) Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "i586-mandriva-linux-gnu"...Using host libthread_db library "/lib/i686/libthread_db.so.1". (gdb) run Starting program: /usr/common/tmp/hello Failed to read a valid object file image from memory. Program received signal SIGFPE, Arithmetic exception. 0xb7f8b96f in do_lookup_x (undef_name=0xb7e598d3 "_res", hash=420035, ref=0xb7e52834, result=0xbffef1f0, scope=0xb7f9c838, i=0, version=0xb7f78328, flags=0, skip=0x0, type_class=Variable "type_class" is not available. ) at do-lookup.h:72 72 do-lookup.h: No such file or directory. in do-lookup.h (gdb) bt #0 0xb7f8b96f in do_lookup_x (undef_name=0xb7e598d3 "_res", hash=420035, ref=0xb7e52834, result=0xbffef1f0, scope=0xb7f9c838, i=0, version=0xb7f78328, flags=0, skip=0x0, type_class=Variable "type_class" is not available. ) at do-lookup.h:72 #1 0xb7f8bc87 in _dl_lookup_symbol_x (undef_name=0xb7e598d3 "_res", undef_map=0xb7f78000, ref=0xbffef310, symbol_scope=0xb7f781a8, version=0xb7f78328, type_class=0, flags=0, skip_map=0x0) at dl-lookup.c:233 #2 0xb7f8d263 in _dl_relocate_object (l=Variable "l" is not available. ) at ../sysdeps/i386/dl-machine.h:354 #3 0xb7f8631f in dl_main (phdr=0x8048034, phnum=224, user_entry=0xbffef700) at rtld.c:2235 #4 0xb7f9540e in _dl_sysdep_start (start_argptr=0xbffef760, dl_main=0xb7f85050 ) at ../elf/dl-sysdep.c:239 #5 0xb7f84709 in _dl_start (arg=0xbffef760) at rtld.c:333 #6 0xb7f83847 in _start () at rtld.c:788 (gdb) quit The program is running. Exit anyway? (y or n) y When the master node was upgraded a short time ago from Mandriva 2006.0 to 2007.1 it introduced this problem, but it didn't manifest until the first time a program was built on the master and distributed to the slaves, and that was, unfortunately, pvm, which has enough build issues, that it took a while to determine that this was yet _another_ issue beyond those. Regards, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From mathog at caltech.edu Mon Jul 9 15:34:36 2007 From: mathog at caltech.edu (David Mathog) Date: Sun Jul 27 01:06:10 2008 Subject: [Beowulf] Re: static pvm library won't link anymore Message-ID: > Turns out that the difficulties getting pvm to run were not just > pvm problems. There is a bizarre problem where every program, even > "Hello World" built on Mandriva 2007.1 (gcc 4.1.2,2.6.17-14mdv) > will not run on Mandriva 2007.0 (gcc 4.1.1, 2.6.19.3). Google with strings from the back trace messages showed that this was a known issue. New distributions such as Mandriva 2007.1 and FC7 default to a new linker mode: --hash-style=gnu Unfortunately ld in older distros has no clue what this is and treats the binary as if it had the older --hash-style=sysv resulting in "floating point exception" (which is a bit of a red herring in terms of figuring out why the program won't run) and a core dump. The newer version of ld still knows about the older hash-style, so old binaries work on the newer systems. To build a binary on the newer system that can run on the older one use either: gcc -g -o hello -Wl,--hash-style=sysv hello.c or gcc -g -o hello -Wl,--hash-style=both hello.c Or upgrade every machine in your cluster to a "newer" release. In the long run, this might be the easiest choice. Regards, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From weikuan.yu at gmail.com Tue Jul 10 12:52:47 2007 From: weikuan.yu at gmail.com (Weikuan Yu) Date: Sun Jul 27 01:06:10 2008 Subject: [Beowulf] passing f90 module data to C with PGI Message-ID: <4693E38F.6050001@gmail.com> Hi, Anybody has experiences in passing f90 module data to C functions, when using PGI compilers? I am sharing a long list of module parameters from different f90 modules to a C function. It now looks quite awkward with 17 parameters. But I do not want to combine all of them as COMMON blocks though. I am only able to find some information about doing such things (prefixing with module name), using intel compiler or HP fortran 90. But this approach of prefixing variable names with the module name does not seem to apply on my environment with PGI compilers. I am exactly not sure if this is a compiler issue. Thanks, -- Weikuan Yu http://ft.ornl.gov/~wyu From hahn at mcmaster.ca Tue Jul 10 15:29:26 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Sun Jul 27 01:06:10 2008 Subject: [Beowulf] Cluster Diagram of 500 PC In-Reply-To: <745845.71782.qm@web52310.mail.re2.yahoo.com> References: <745845.71782.qm@web52310.mail.re2.yahoo.com> Message-ID: > We want to setup a Cluster of 500 PC in following Configuration: > Intel Duel Core 1.88 GHz 2MB Cache > 4 GB DDR-2 RAM > 2 X 80 GB I'm not saying this configuration is bad, but how did you arrive at it? there are tradeoffs in each of these hardware choices, and those decisions are the ones you can't fix later. (I have especially not often seen 2x80G be a useful cluster node config. it's potentially too much for a diskful install, even if you insist on disfulness. and yet clusters are normally quite automated, so raid1-ing an OS install doesn't make much sense unless you have some specific reasons. finally, if you have disk-intensive applications that can utilize local disks, it would make more sense to use larger disks, since these days 250-320G is pretty much entry-level (one-platter).) > How dow we connect these computers and how many will be defined as master. you don't necessarily even need a master, but to answer this, you must quantify your workload fairly precisely. with 500 nodes, you might well expect a significant number of users logged in at once, which might incurr a significant "support" load (compiling, etc). or perhaps you want to run many, many serial jobs, in which case you'll need to split up your "admin" load across multiple machines (queueing, monitoring, logging.) > How do we conect using how many switch. again, depends entirely on your workload. it _could_ be quite reasonable to have a rack of nodes going to one switch, and just one uplink from each rack to a top-level switch. that would clearly optimize the cabling at the expense of serious MPI programs. unless the workload consisted solely of rack-sized MPI programs! large switches of the size you're looking for tend to be expensive; if you compromise (say, single 10G uplink per rack), modular switches can still be used. otoh, maybe a spindly ethernet fabric alongside a fast and flat 512-way myri10G network? all depends on the workload. > How power connection will be provided. you want some sort of PDU in the rack with high-current, high-voltage feeds to each rack. dual 30A 220 3-phase is not an unusual design point. obviously, if you can make nodes more efficient, you save money on the power infrastructure, as well as operating costs. for instance, the cluster I sit next to has dual-95W sockets per node, with each node pulling around 300W. higher-efficiency power supplies might save 30W/node, which would be only 1.1 kw/rack; 65W sockets would save 60W/node - that's starting to be significant. providing consistently cool air saves power too (nodes here have 12 fans that consume up to 10W each!). > How do we start and stop all nodes using a remote computer. IPMI is an excellent, portable, well-scriptable interface for control and monitoring. there are some vendor-specific alternatives, as well as cruder mechanisms (controllable PDU's). > How do we ensure fault tolarent network connectivity. something like LVS. it's a software thing, thus easy ;) > We want to use windows XP or windows 2003 as OS. Better persormance Centos ro Linux RHL may be selected. don't bother with windows unless you really are a windows guru and also incredibly linux-averse. > please advice us and help me in providing a network diagram of the system nodes in racks, leaf switch(s) in racks, uplinking to top-level switch(s). knowing nothing about your workload, that's reasonable. From siegert at sfu.ca Tue Jul 10 17:11:50 2007 From: siegert at sfu.ca (Martin Siegert) Date: Sun Jul 27 01:06:10 2008 Subject: [Beowulf] 8GB memeory limit? Message-ID: <20070711001150.GB616@stikine.ucs.sfu.ca> Hi, I am running into a bizarre memory issue: I do not appear to be able to allocate 8GB of memory into a single array: =================================================== #include #include int main(int argc, char *argv[]){ long int i, n; long int *m; n = 1024*1024*1024*sizeof(long int); for (i = 1; i <= n; i*=2) { m = (long *) malloc(i); if (m == NULL) { fprintf(stderr, "allocation of %li bytes failed.\n", i); exit(-1); } free(m); } } ==================================================== # gcc -m64 int_malloc.c # ./a.out allocation of 8589934592 bytes failed. This is with a 2.6.5 kernel (SLES 9). If I compile the same program under kernel 2.6.16.27 (openSuSE 10.2), the program completes without problem. Under either OS I can allocate, e.g., 5 arrays of 4GB each within the same program without problem. Where does this limit of 8GB for a single array come from? Is it in the kernel? If yes, can it be changed, e.g., through a sysctl? Which one? Cheers, Martin -- Martin Siegert Head, HPC@SFU WestGrid Site Lead Academic Computing Services phone: (604) 291-4691 Simon Fraser University fax: (604) 291-4242 Burnaby, British Columbia email: siegert@sfu.ca Canada V5A 1S6 From hahn at mcmaster.ca Tue Jul 10 22:03:51 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Sun Jul 27 01:06:10 2008 Subject: [Beowulf] 8GB memeory limit? In-Reply-To: <20070711001150.GB616@stikine.ucs.sfu.ca> References: <20070711001150.GB616@stikine.ucs.sfu.ca> Message-ID: > # gcc -m64 int_malloc.c try -mcmodel=medium. basically, mcmodel defaults to small in most cases, which means 32b data relocations. medium bumps single-object addressability up to 64b. neither of these affect the basic addressing model (int=32b, long=64b, pointer=64). > This is with a 2.6.5 kernel (SLES 9). not a kernel issue, but rather the width of address calculations for a single addressable object (as decided by the compiler.) From lindahl at pbm.com Wed Jul 11 01:17:35 2007 From: lindahl at pbm.com (Greg Lindahl) Date: Sun Jul 27 01:06:11 2008 Subject: [Beowulf] 8GB memeory limit? In-Reply-To: References: <20070711001150.GB616@stikine.ucs.sfu.ca> Message-ID: <20070711081735.GA6392@bx9.net> On Wed, Jul 11, 2007 at 01:03:51AM -0400, Mark Hahn wrote: > try -mcmodel=medium. ... which doesn't affect malloc() at all. It only affects static data. -- greg From hahn at mcmaster.ca Wed Jul 11 05:37:34 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Sun Jul 27 01:06:11 2008 Subject: [Beowulf] 8GB memeory limit? In-Reply-To: <20070711081735.GA6392@bx9.net> References: <20070711001150.GB616@stikine.ucs.sfu.ca> <20070711081735.GA6392@bx9.net> Message-ID: >> try -mcmodel=medium. > > ... which doesn't affect malloc() at all. It only affects static data. oops, duh! failure of attention. I'd have to guess the issue is a ulimit. malloc is mapped (for large allocations) to an anonymous mmap normally. it might be interesting to strace the config that fails, to verify this. there's also the kernel's overcommit setting (/proc/sys/vm/overcommit_*) which could cause this effect. -mark From hahn at mcmaster.ca Wed Jul 11 06:26:34 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Sun Jul 27 01:06:11 2008 Subject: [Beowulf] Cluster Diagram of 500 PC In-Reply-To: <49467.88.37.125.98.1184148615.squirrel@www.lri.fr> References: <49467.88.37.125.98.1184148615.squirrel@www.lri.fr> Message-ID: > By experience, some IPMI hardware implementations are not sufficient to > ensure efficient reboot, for example, we had some issues rebooting the > nodes when they were in the PXE boot stage, or blocked in grub with a > missing kernel, or worse: when running a freeBSD system. that is most peculiar - why would any activity on the host affect the IPMI in the first place? oh - was this IPMI one of the ones that shares a NIC/port with the host? I can easily imagine that would cause some possible issues. > Many other solutions are OK: they tend to be scriptable though a telnet + > expect script, so it's OK as long as it can reboot all your nodes in any > situation. I guess I'd be surprised if the protocol to the BMC made any difference - IPMI or telnet. but I'm often surprised ;) From hahn at mcmaster.ca Wed Jul 11 09:18:10 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Sun Jul 27 01:06:11 2008 Subject: [Beowulf] Cluster Diagram of 500 PC In-Reply-To: <49467.88.37.125.98.1184148615.squirrel@www.lri.fr> References: <49467.88.37.125.98.1184148615.squirrel@www.lri.fr> Message-ID: there are now several switches on the market that have, for instance, 48x Gb ports and 2-4 10G ports. having the higher-speed ports is attractive both to attach servers, but also to build a larger switch fabric. in particular, if you have 5x of these switches, you could plug their 10G ports into each other and entirely avoid a top-level switch. it "only" gets you to 240 nodes, but is also fairly cheap. my question is: do switches these days have smart protocols for mapping and routing in such a configuration? I know that the original spanning tree protocol would reduce such a config to a single tree, turning off all the redundant links, effectively creating a hotspot and ~doubling hop-counts. maybe with vlans this could be avoided? (incidentally, I notice HP is bragging about latencies <2.1 us for 10G and <3.7 for 1G, as well as supporting jumbo packets even on fairly economy-level switches...) From lindahl at pbm.com Wed Jul 11 09:45:41 2007 From: lindahl at pbm.com (Greg Lindahl) Date: Sun Jul 27 01:06:11 2008 Subject: [Beowulf] Cluster Diagram of 500 PC In-Reply-To: References: <49467.88.37.125.98.1184148615.squirrel@www.lri.fr> Message-ID: <20070711164541.GB32524@bx9.net> On Wed, Jul 11, 2007 at 12:18:10PM -0400, Mark Hahn wrote: > my question is: do switches these days have smart protocols for > mapping and routing in such a configuration? I know that the original > spanning tree protocol would reduce such a config to a single tree, > turning off all the redundant links, effectively creating a hotspot > and ~doubling hop-counts. maybe with vlans this could be avoided? Yes, you can do more interesting routing with vlans; I've run into a couple of companies that have claimed to patent it, but there's probably tons of prior art from previous Ethernet generations. It's also the case that some ethernet chipsets support alternate non-standard routing methods; they're designed for bulding big switches composed of many switch chips, and they support arbitrary topologies. I've never seen a switch touting this ability, though. -- greg From patrick at myri.com Wed Jul 11 11:39:42 2007 From: patrick at myri.com (Patrick Geoffray) Date: Sun Jul 27 01:06:11 2008 Subject: [Beowulf] Cluster Diagram of 500 PC In-Reply-To: References: <49467.88.37.125.98.1184148615.squirrel@www.lri.fr> Message-ID: <469523EE.7050005@myri.com> Hi Mark, Mark Hahn wrote: > my question is: do switches these days have smart protocols for mapping > and routing in such a configuration? I know that the original spanning That's 802.3ad. Quick pointer: http://en.wikipedia.org/wiki/Link_aggregation You can use it between switches to use multiple links. The spec does not specify how to load balance the packets across the links, so link aggregation may not be very efficient between switches from different vendors. YMMV. Patrick From hahn at mcmaster.ca Wed Jul 11 12:18:10 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Sun Jul 27 01:06:11 2008 Subject: [Beowulf] Cluster Diagram of 500 PC In-Reply-To: <469523EE.7050005@myri.com> References: <49467.88.37.125.98.1184148615.squirrel@www.lri.fr> <469523EE.7050005@myri.com> Message-ID: >> my question is: do switches these days have smart protocols for mapping and >> routing in such a configuration? I know that the original spanning > > That's 802.3ad. Quick pointer: > http://en.wikipedia.org/wiki/Link_aggregation I don't think that's what I meant. imagine instead that you have 48pt GE switches, each of which has 4x 10G extra ports. now, take 5 such switches and fully connect them (each switch has a 10G link to each of the other 4 switches). I don't think 802.3ad helps here, since what you want is to _avoid_ a single spanning tree, which would necessarily have one root. 802.3ad is exactly the right thing if you simply want to stack two such switches and want 4x10Gb inter-switch bandwidth. I noticed that d-link appears to use 10G links for stacking, but has a route-discovery protocol that lets them structure the switches into a ring. I'm not sure they use this to reduce hop-count, though - perhaps just for reliability. From lindahl at pbm.com Wed Jul 11 14:18:57 2007 From: lindahl at pbm.com (Greg Lindahl) Date: Sun Jul 27 01:06:11 2008 Subject: [Beowulf] Cluster Diagram of 500 PC In-Reply-To: References: <49467.88.37.125.98.1184148615.squirrel@www.lri.fr> <469523EE.7050005@myri.com> Message-ID: <20070711211857.GC32038@bx9.net> On Wed, Jul 11, 2007 at 03:18:10PM -0400, Mark Hahn wrote: > 802.3ad is exactly the right thing if you simply want to stack > two such switches and want 4x10Gb inter-switch bandwidth. Right. It's just spanning-tree-plus-trunking. -- greg From patrick at myri.com Wed Jul 11 21:21:08 2007 From: patrick at myri.com (Patrick Geoffray) Date: Sun Jul 27 01:06:11 2008 Subject: [Beowulf] Cluster Diagram of 500 PC In-Reply-To: References: <49467.88.37.125.98.1184148615.squirrel@www.lri.fr> <469523EE.7050005@myri.com> Message-ID: <4695AC34.70905@myri.com> Mark Hahn wrote: > I don't think that's what I meant. imagine instead that you have 48pt > GE switches, each of which has 4x 10G extra ports. now, take > 5 such switches and fully connect them (each switch has a 10G link > to each of the other 4 switches). I don't think 802.3ad helps here, > since what you want is to _avoid_ a single spanning tree, which would > necessarily have one root. Yes, there is still a spanning tree and it will prevent cycles by choosing a root for you. There is no way to work around it at the pure Ethernet level. Sure, you can play with vlans, but it's more an hack than a well defined way to solve the problem. In practice, you can do this type of topology by using a proprietary "stacking" interface if there is one available. These ports are not Ethernet links, they are backplane extensions, so they are not subject to spanning tree limitations. Of course, it's vendor specific and some switches don't have such stacking ports. You could also turn off the spanning tree, but you cannot call it Ethernet anymore. Another solution is to do layer-3 routing between switches, if you have such IP routing capability. Patrick From jonbernard at fmailbox.com Tue Jul 10 15:22:28 2007 From: jonbernard at fmailbox.com (Jon Bernard) Date: Sun Jul 27 01:06:11 2008 Subject: [Beowulf] power usage, Intel 5160 vs. AMD 2216 Message-ID: <1184106148.13500.1199500571@webmail.messagingengine.com> Two vendors recently bid on a cluster we're buying, and we've got questions about their power estimates that I though I'd also put to the list while we wait for the vendors to respond. Vendor A estimates that at peak load a compute node with two AMD 2216s, 4 GB of 667 DDR2 RAM, a hard drive, and an IB board will draw 265 watts. Vendor B estimates that such a node will draw 450 watts. Vendor B also estimates that a similar machine with two Intel 5160s will draw 550 watts at peak load. Anybody have any actual measurements? Jon From SternM at grc.nia.nih.gov Tue Jul 10 18:13:47 2007 From: SternM at grc.nia.nih.gov (Stern, Michael (NIH/NIA/IRP) [E]) Date: Sun Jul 27 01:06:11 2008 Subject: [Beowulf] MPI_reduce roundoff question. Message-ID: Hi, I'm a user of the NIH Biowulf cluster, and I'm looking for an answer to a question that the staff here didn't know. I'm aware that roundoff error in MPI_reduce can vary with the number of processors, because the order of summation depends on the communication path. My question is whether the order of summation can differ among different calls to MPI_reduce within the same program (with the same number of processors during a single run). From rcd2951 at satx.rr.com Tue Jul 10 19:46:43 2007 From: rcd2951 at satx.rr.com (rcd2951@satx.rr.com) Date: Sun Jul 27 01:06:11 2008 Subject: [Beowulf] Orion Multisystems DS96 In-Reply-To: <20070711001150.GB616@stikine.ucs.sfu.ca> References: <20070711001150.GB616@stikine.ucs.sfu.ca> Message-ID: Hi: My name is John Pearson. I work for the US Air Force in San Antonio, Texas. My organization recently inherited an Orion Multisystems DS96 personal cluster. I was told that the device was not fully populated with HDDs for each node, however after opening the system, we found each of the 8 processor boards had 12 40 GB HDDs, long with full RAM complement. I was also told we could access each node's HDD independently,which is critical for our application. I am hoping someone on this list has one, or experience building & operating one. I am alo hoping someone will have some ideas as to where to get spare parts if needed. I can be reached via this list or at john.pearson@lackland.af.mil Regards John T. Pearson, Jr. Senior C3ISR/EW Modsim Engineer From Julien.Leduc at lri.fr Wed Jul 11 03:10:15 2007 From: Julien.Leduc at lri.fr (Julien.Leduc@lri.fr) Date: Sun Jul 27 01:06:11 2008 Subject: [Beowulf] Cluster Diagram of 500 PC Message-ID: <49467.88.37.125.98.1184148615.squirrel@www.lri.fr> >> How do we start and stop all nodes using a remote computer. > > IPMI is an excellent, portable, well-scriptable interface for control and monitoring. there are some vendor-specific alternatives, as well as cruder mechanisms (controllable PDU's). IPMI is sometimes OK, sometimes not that good: be carefull about your exact needs. IPMI is just a standard that can be implemented quite well, or so poorly, it does not work most of the times (and at a 500 nodes scale, it is a nightmare!). I take care of a cluster that is similar in size to the one you want to build, and that requires a lot of reboots (>460 000 rebooted nodes on a 9 month time slot => an average of 5 reboots per node per day). By experience, some IPMI hardware implementations are not sufficient to ensure efficient reboot, for example, we had some issues rebooting the nodes when they were in the PXE boot stage, or blocked in grub with a missing kernel, or worse: when running a freeBSD system. controllable PDUs is not a good idea, because, it will burn your harddrives and your nodes components pretty quickly, and with so many nodes, you will loose many even if your reboot rate is low. Many other solutions are OK: they tend to be scriptable though a telnet + expect script, so it's OK as long as it can reboot all your nodes in any situation. Regards, Julien Leduc From rokrau at yahoo.com Wed Jul 11 09:40:03 2007 From: rokrau at yahoo.com (Roland Krause) Date: Sun Jul 27 01:06:11 2008 Subject: [Beowulf] 8GB memeory limit? In-Reply-To: <20070711001150.GB616@stikine.ucs.sfu.ca> Message-ID: <136443.72286.qm@web81101.mail.mud.yahoo.com> Martin, your problem, or observation, is due to a change in the way the kernel and the glibc facilitate the allocation of memory. The background of this is explained very well in this earlier post http://www.mail-archive.com/beowulf%40beowulf.org/msg03723.html and that's also where I learned about it, so all I will say here is that the change that allows you to allocate your entire memory using mmap occurred in kernel 2.6.9. Best regards, Roland --- Martin Siegert wrote: > Hi, > > I am running into a bizarre memory issue: I do not appear to be > able to allocate 8GB of memory into a single array: > > =================================================== > #include > #include > > int main(int argc, char *argv[]){ > long int i, n; > long int *m; > > n = 1024*1024*1024*sizeof(long int); > for (i = 1; i <= n; i*=2) { > m = (long *) malloc(i); > if (m == NULL) { > fprintf(stderr, "allocation of %li bytes failed.\n", i); > exit(-1); > } > free(m); > } > } > ==================================================== > > # gcc -m64 int_malloc.c > # ./a.out > allocation of 8589934592 bytes failed. > > This is with a 2.6.5 kernel (SLES 9). > If I compile the same program under kernel 2.6.16.27 (openSuSE 10.2), > the program completes without problem. > > Under either OS I can allocate, e.g., 5 arrays of 4GB each within the > same program without problem. > > Where does this limit of 8GB for a single array come from? > Is it in the kernel? If yes, can it be changed, e.g., through a > sysctl? > Which one? > > Cheers, > Martin > > -- > Martin Siegert > Head, HPC@SFU > WestGrid Site Lead > Academic Computing Services phone: (604) 291-4691 > Simon Fraser University fax: (604) 291-4242 > Burnaby, British Columbia email: siegert@sfu.ca > Canada V5A 1S6 > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > From scheinin at crs4.it Thu Jul 12 08:41:15 2007 From: scheinin at crs4.it (Alan Louis Scheinine) Date: Sun Jul 27 01:06:11 2008 Subject: [Beowulf] power usage, Intel 5160 vs. AMD 2216 In-Reply-To: <1184106148.13500.1199500571@webmail.messagingengine.com> References: <1184106148.13500.1199500571@webmail.messagingengine.com> Message-ID: <46964B9B.7040607@crs4.it> > Vendor A estimates that at peak load a compute node with two AMD 2216s, > 4 GB of 667 DDR2 RAM, a hard drive, and an IB board will draw 265 watts. > Vendor B estimates that such a node will draw 450 watts. > > Vendor B also estimates that a similar machine with two Intel 5160s will > draw 550 watts at peak load. > 450 watts would be the specification for the power supply. Because of the typical way of describing power supplies the 450 watt version would probably not give clean power nor last long at 450 watts continuous. The typical and perhaps even the maximum power drawn of the computer with two CPUS, memory a a few hard disks would be about 265 watts. Perhaps in other fields of equipment the specs for a power supply are defined so that a maximum power drawn of 256 watts could be met by a power supply of of 270 watts -- I don't know. But in the commercial-consumer field of PC's it is typical to see specs of 400-600 watts for a power supply intended for a 2 or 4 CPU board whereas we have measured something near 265 watts for a dual Opteron computer. best regards, Alan Scheinine -- Centro di Ricerca, Sviluppo e Studi Superiori in Sardegna Center for Advanced Studies, Research, and Development in Sardinia Postal Address: | Physical Address for FedEx, UPS, DHL: --------------- | ------------------------------------- Alan Scheinine | Alan Scheinine c/o CRS4 | c/o CRS4 C.P. n. 25 | Loc. Pixina Manna Edificio 1 09010 Pula (Cagliari), Italy | 09010 Pula (Cagliari), Italy Email: scheinin@crs4.it Phone: 070 9250 238 [+39 070 9250 238] Fax: 070 9250 216 or 220 [+39 070 9250 216 or +39 070 9250 220] Operator at reception: 070 9250 1 [+39 070 9250 1] Mobile phone: 347 7990472 [+39 347 7990472] From becker at scyld.com Thu Jul 12 09:43:04 2007 From: becker at scyld.com (Donald Becker) Date: Sun Jul 27 01:06:11 2008 Subject: [Beowulf] power usage, Intel 5160 vs. AMD 2216 In-Reply-To: <1184106148.13500.1199500571@webmail.messagingengine.com> Message-ID: On Tue, 10 Jul 2007, Jon Bernard wrote: > Vendor A estimates that at peak load a compute node with two AMD 2216s, > 4 GB of 667 DDR2 RAM, a hard drive, and an IB board will draw 265 watts. > Vendor B estimates that such a node will draw 450 watts. Considering that the regular 2216 is 95W peak (the 'HE' version is about 65W), and the memory and IB card are both pretty warm, 265 watts is unrealistic. Multiply a realistic max power by a power supply efficiency and you'll get about 450 watts. There is way to get that lower number: 2216HE processors, and very efficient (93% would be exceptional, high 80s more realistic) power supplies. But that will be significantly more expensive. Easily enough $$ that you would know if that's what you are buying. (We use 'HE' processors, special memory and highly-efficient power supplies in our blade systems to make the thermals work, and system price/MIPS looks pretty bad compared to standard 1U products.) The chipsets and base processors are same between vendors, and the IB cards and disks are about the same. The memory power use will vary a bit, but you might get different numbers between the sample and shipped nodes. The biggest variation will be the power supply efficiency. And even there it will be vary significantly with the power draw. We've measured between 50% and 93% efficiency. The worst are the supplies in old generic 1U cases, which hopefully we won't see again. There is a significant cost difference between today's low-end, low-efficiency supplies and the better 80+% units. Doing an extra conversion e.g. to 50VDC then to the final board voltage, won't improve the overall numbers, but will move the conversion and thermals to "behind the curtain". We could get into an interesting discussion about the best way to decrease the typical power use of a cluster. The best way to do this is with software -- laptop-style power control, and powering nodes down. But when purchasing and installing clusters you have to design for that long-running job that stays at the peak power draw and thermal state of the cluster. -- Donald Becker becker@scyld.com Penguin Computing / Scyld Software www.penguincomputing.com www.scyld.com Annapolis MD and San Francisco CA From Ron.Jerome at nrc-cnrc.gc.ca Thu Jul 12 09:19:54 2007 From: Ron.Jerome at nrc-cnrc.gc.ca (Jerome, Ron) Date: Sun Jul 27 01:06:11 2008 Subject: [Beowulf] power usage, Intel 5160 vs. AMD 2216 In-Reply-To: <1184106148.13500.1199500571@webmail.messagingengine.com> Message-ID: If it helps any, I have a rack of 32 dual Intel 5320's with 8G FBDIMM (4 x 2G) with a single 70G SAS drive. The whole rack (at peak usage) draws about 11,000 watts. Ron. > -----Original Message----- > From: beowulf-bounces@beowulf.org [mailto:beowulf-bounces@beowulf.org] On > Behalf Of Jon Bernard > Sent: Tuesday, July 10, 2007 6:22 PM > To: beowulf@beowulf.org > Subject: [Beowulf] power usage, Intel 5160 vs. AMD 2216 > > Two vendors recently bid on a cluster we're buying, and we've got > questions about their power estimates that I though I'd also put to the > list while we wait for the vendors to respond. > > Vendor A estimates that at peak load a compute node with two AMD 2216s, > 4 GB of 667 DDR2 RAM, a hard drive, and an IB board will draw 265 watts. > Vendor B estimates that such a node will draw 450 watts. > > Vendor B also estimates that a similar machine with two Intel 5160s will > draw 550 watts at peak load. > > Anybody have any actual measurements? > > Jon > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From jmdavis1 at vcu.edu Thu Jul 12 10:14:09 2007 From: jmdavis1 at vcu.edu (Mike Davis) Date: Sun Jul 27 01:06:11 2008 Subject: [Beowulf] power usage, Intel 5160 vs. AMD 2216 In-Reply-To: References: Message-ID: <46966161.5010504@vcu.edu> Donald, I have spun up Sun x4100 dual core, dual processor to 100% processor usage and normal HD writes and measured the actually powerusage at 267 watts. Obviously higher than normal HD usage (such as swapping) would drive the number up, but I was very pleased with these results as well as those of the x2200 machines. These tests were made with CentOS 4 installed. Mike Davis Donald Becker wrote: > On Tue, 10 Jul 2007, Jon Bernard wrote: > > >>Vendor A estimates that at peak load a compute node with two AMD 2216s, >>4 GB of 667 DDR2 RAM, a hard drive, and an IB board will draw 265 watts. >>Vendor B estimates that such a node will draw 450 watts. > > > Considering that the regular 2216 is 95W peak (the 'HE' version is about > 65W), and the memory and IB card are both pretty warm, 265 watts is > unrealistic. Multiply a realistic max power by a power supply > efficiency and you'll get about 450 watts. > > There is way to get that lower number: 2216HE processors, and very efficient > (93% would be exceptional, high 80s more realistic) power supplies. But > that will be significantly more expensive. Easily enough $$ that you would > know if that's what you are buying. (We use 'HE' processors, special > memory and highly-efficient power supplies in our blade systems to make > the thermals work, and system price/MIPS looks pretty bad compared to > standard 1U products.) > > The chipsets and base processors are same between vendors, and the IB > cards and disks are about the same. The memory power use will vary a bit, > but you might get different numbers between the sample and shipped nodes. > The biggest variation will be the power supply efficiency. And even there > it will be vary significantly with the power draw. > > We've measured between 50% and 93% efficiency. The worst are the supplies > in old generic 1U cases, which hopefully we won't see again. There is > a significant cost difference between today's low-end, low-efficiency > supplies and the better 80+% units. Doing an extra conversion e.g. to > 50VDC then to the final board voltage, won't improve the > overall numbers, but will move the conversion and thermals to "behind the > curtain". > > We could get into an interesting discussion about the best way to decrease > the typical power use of a cluster. The best way to do this is with > software -- laptop-style power control, and powering nodes down. But when > purchasing and installing clusters you have to design for that > long-running job that stays at the peak power draw and thermal state of > the cluster. > > From rgb at phy.duke.edu Thu Jul 12 10:35:31 2007 From: rgb at phy.duke.edu (Robert G. Brown) Date: Sun Jul 27 01:06:11 2008 Subject: [Beowulf] power usage, Intel 5160 vs. AMD 2216 In-Reply-To: <1184106148.13500.1199500571@webmail.messagingengine.com> References: <1184106148.13500.1199500571@webmail.messagingengine.com> Message-ID: On Tue, 10 Jul 2007, Jon Bernard wrote: > Two vendors recently bid on a cluster we're buying, and we've got > questions about their power estimates that I though I'd also put to the > list while we wait for the vendors to respond. > > Vendor A estimates that at peak load a compute node with two AMD 2216s, > 4 GB of 667 DDR2 RAM, a hard drive, and an IB board will draw 265 watts. > Vendor B estimates that such a node will draw 450 watts. > > Vendor B also estimates that a similar machine with two Intel 5160s will > draw 550 watts at peak load. > > Anybody have any actual measurements? Not for this specific configuration -- I'd be inclined towards ~300W (we could have a lottery!) although it will vary depending on whether the system is loaded/active or idle by as much as 30% of the loaded value, so they couldn't QUITE both be right but it's close. Google for a "kill-a-watt". Brand new, shipped, it's about $25 (from e.g. Amazon). A must have for any serious cluster enthusiast or manager -- use it to plug the systems in and then look. It's one of the only ways to be sure. rgb > > Jon > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From larry.stewart at sicortex.com Thu Jul 12 11:36:20 2007 From: larry.stewart at sicortex.com (Larry Stewart) Date: Sun Jul 27 01:06:11 2008 Subject: [Beowulf] power usage, Intel 5160 vs. AMD 2216 In-Reply-To: References: Message-ID: <469674A4.8010305@sicortex.com> Donald Becker wrote: >We could get into an interesting discussion about the best way to decrease >the typical power use of a cluster. The best way to do this is with >software -- laptop-style power control, and powering nodes down. But when >purchasing and installing clusters you have to design for that >long-running job that stays at the peak power draw and thermal state of >the cluster. > > > > Been waiting for <> for a straight line. ... or you could design new machines for low power. Both our machines and IBM's forthcoming Blue Gene/P are about 3 watts per peak gigaflop. I do agree though, that if you aren't computing anything, then you may as well unplug the machine. -Larry / SiCortex From cap at nsc.liu.se Thu Jul 12 13:05:42 2007 From: cap at nsc.liu.se (Peter Kjellstrom) Date: Sun Jul 27 01:06:11 2008 Subject: [Beowulf] power usage, Intel 5160 vs. AMD 2216 In-Reply-To: <1184106148.13500.1199500571@webmail.messagingengine.com> References: <1184106148.13500.1199500571@webmail.messagingengine.com> Message-ID: <200707122205.46211.cap@nsc.liu.se> On Wednesday 11 July 2007, Jon Bernard wrote: > Two vendors recently bid on a cluster we're buying, and we've got > questions about their power estimates that I though I'd also put to the > list while we wait for the vendors to respond. > > Vendor A estimates that at peak load a compute node with two AMD 2216s, > 4 GB of 667 DDR2 RAM, a hard drive, and an IB board will draw 265 watts. > Vendor B estimates that such a node will draw 450 watts. > > Vendor B also estimates that a similar machine with two Intel 5160s will > draw 550 watts at peak load. > > Anybody have any actual measurements? For our latest big procurement I measured (among other things) the actual power draw of all final candidates. One was a dual Opteron 2220 (2.8 GHz dual core part) which at idle/load scored 218W/301W this with 12 dims (24 gig) one sata drive and no IB. A 2nd vendor measured 230/315 for a very similar configuration. We did not test any intel 51xx, only 53xx (quad core). The 2.33 GHz 53xx (80W tdp if I remember correctly) did around 260W/350W (this with one drive and 8 fbdimms with a total of 32 gig). An IB board is typically <10W, Mellanox says that their LX single port DDR card is only 4W. Hope that helps a bit, Peter > Jon -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part. Url : http://www.scyld.com/pipermail/beowulf/attachments/20070712/dc0db169/attachment.bin From gardner at backserv.gsfc.nasa.gov Thu Jul 12 10:34:28 2007 From: gardner at backserv.gsfc.nasa.gov (Glen Gardner) Date: Sun Jul 27 01:06:11 2008 Subject: [Beowulf] power usage, Intel 5160 vs. AMD 2216 In-Reply-To: <1184106148.13500.1199500571@webmail.messagingengine.com> References: <1184106148.13500.1199500571@webmail.messagingengine.com> Message-ID: <1184261668.1974.5.camel@kaze.gsfc.nasa.gov> 450 watts peak is not too far off from what I've experienced. two AMD socket F dual core cpu's on a TYAN Thunder n3600M S2932GNR motherboard. Max startup power use: 330W Idle power use after OS boot: 200W Maximum power use during heavy file I/O to raid (using io_bench.c): 271W Max power use during an MPI pheapsort.c run (11 processes): 351W estimated max continuous power use (heavy I/O & cpu use): 422W Glen On Tue, 2007-07-10 at 17:22 -0500, Jon Bernard wrote: > Two vendors recently bid on a cluster we're buying, and we've got > questions about their power estimates that I though I'd also put to the > list while we wait for the vendors to respond. > > Vendor A estimates that at peak load a compute node with two AMD 2216s, > 4 GB of 667 DDR2 RAM, a hard drive, and an IB board will draw 265 watts. > Vendor B estimates that such a node will draw 450 watts. > > Vendor B also estimates that a similar machine with two Intel 5160s will > draw 550 watts at peak load. > > Anybody have any actual measurements? > > Jon > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From apittman at concurrent-thinking.com Thu Jul 12 12:06:54 2007 From: apittman at concurrent-thinking.com (Ashley Pittman) Date: Sun Jul 27 01:06:11 2008 Subject: [Beowulf] MPI_reduce roundoff question. In-Reply-To: References: Message-ID: <1184267214.21024.12.camel@bruce.priv.wark.uk.streamline-computing.com> On Tue, 2007-07-10 at 21:13 -0400, Stern, Michael (NIH/NIA/IRP) [E] wrote: > Hi, > I'm a user of the NIH Biowulf cluster, and I'm looking for an answer > to a question that the staff here didn't know. I'm aware that > roundoff error in MPI_reduce can vary with the number of processors, > because the order of summation depends on the communication path. My > question is whether the order of summation can differ among different > calls to MPI_reduce within the same program (with the same number of > processors during a single run). An interesting question. Technically the answer is yes it can according to the letter of the MPI spec although the spec also advises vendors that it shouldn't differ regardless of the layout of the processes. In practice I'd be very surprised if two calls to MPI_Reduce on the same communicator with the same values would produce different results although you may find that communicators with the same size but different process layout within the same job gave you slightly different answers if shared memory optimisations have been employed. I'm aware this has been a issue for people in the past and I've seen procurement contracts which state that any results obtained by the computer must be 100% repeatable which in effect means the answer to your question is no, they cannot differ. Ashley, From mathog at caltech.edu Thu Jul 12 14:14:48 2007 From: mathog at caltech.edu (David Mathog) Date: Sun Jul 27 01:06:11 2008 Subject: [Beowulf] power usage, Intel 5160 vs. AMD 2216 Message-ID: "Robert G. Brown" wrote > Google for a "kill-a-watt". Brand new, shipped, it's about $25 (from > e.g. Amazon). A must have for any serious cluster enthusiast or > manager -- use it to plug the systems in and then look. It's one of the > only ways to be sure. I'll second that recommendation. I used one to make a lot of power measurements on various systems: http://saf.bio.caltech.edu/saving_power.html One thing it taught me is that screensavers WASTE energy, even ones that blank the display. While that worked ok for CRTs (where power usage is proportional to screen brightness) it just sets the pixels to black on an LCD, which saves no energy at all. Far better to let the OS put the monitor into standby, where energy usage is roughly zero. Larry Stewart wrote: > I do agree though, that if you aren't computing anything, then you > may as well unplug the machine. Well, the problem with going from full off to full on is that it's quite slow. Plus they won't always come back up reliable (Cough, Tyan, cough). It would be nice if somebody made a server class CPU with an ultra low power setting - the system would still be running, just very, very slowly. Much faster kicking up the CPU from 100Mhz to 3Ghz than rebooting the system. The normal CPU frequency shifting software could keep an eye on the load and run as slowly, or as quickly, as is possible/required. Also, it's definitely bad to spin disks down and up frequently, but I think mostly the problem is that the disk goes all the way to, stopped and bad things tend to happen there. I wonder if a disk couldn't be built that could slow down from 10K RPM to 5K RPM (or less), and save energy that way. Just keep the platters moving fast enough to float the heads. The time to reposition the heads wouldn't change, although the seek time would be a bit longer waiting for the right piece of data to rotate around. Such a disk could conceivably also monitor its own activity and spin up to full speed when that is warranted. With 8Mb buffers being common I suspect that under a lot of circumstances there would be virtually no difference between the 5K and 10K modes. The flip side of all of this is that if the power usage is going up and down dramatically on the cluster it might cause problems with, for instance, the room's UPS or A/C. Regards, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From danapa2000 at gmail.com Thu Jul 12 14:41:11 2007 From: danapa2000 at gmail.com (Daniel Navas-Parejo Alonso) Date: Sun Jul 27 01:06:11 2008 Subject: [Beowulf] Cluster Diagram of 500 PC In-Reply-To: References: <49467.88.37.125.98.1184148615.squirrel@www.lri.fr> <469523EE.7050005@myri.com> Message-ID: <171130500707121441r323a1037qe636474e3a13ed3d@mail.gmail.com> Drawback when stacking could be scalability (there is always a maximum number of switches that could be stacked), but is a nice solution for a limited number of nodes. Of course, this is not at all an HA approach. There is a network manufacturer that implements a couple of protocols that could be interesting for network redundancy and low-latency failover mechanisms, that can perfectly fit on that ring proposal you're doing. Those protocols are EAPSv2 (based in a set of rings) for L2 and ESRP for L3. Yet once again, propietary protocols, that's the drawback. Spanning tree has some caveats and could make you to increase the number of hops when communicating two distant leafs. In this case there are some vendor-specific spanning tree implementations (i.e PVST+, EMISTP, etc...) but also a number of "standard" Spanning tree standard implementations (for instance the first definition of STP in 802.1D and its successors 802.1w, etc....) Anyway, hop latency in Ethernet is most of times just peanuts in terms of latency compared to TCP/IP stack overhead... Other option could be not to impement redundancy but of course this depends on the criticity of your cluster.... 2007/7/11, Mark Hahn : > > >> my question is: do switches these days have smart protocols for mapping > and > >> routing in such a configuration? I know that the original spanning > > > > That's 802.3ad. Quick pointer: > > http://en.wikipedia.org/wiki/Link_aggregation > > I don't think that's what I meant. imagine instead that you have > 48pt GE switches, each of which has 4x 10G extra ports. now, take > 5 such switches and fully connect them (each switch has a 10G link > to each of the other 4 switches). I don't think 802.3ad helps here, > since what you want is to _avoid_ a single spanning tree, which > would necessarily have one root. > > 802.3ad is exactly the right thing if you simply want to stack > two such switches and want 4x10Gb inter-switch bandwidth. > > I noticed that d-link appears to use 10G links for stacking, but > has a route-discovery protocol that lets them structure the switches > into a ring. I'm not sure they use this to reduce hop-count, though - > perhaps just for reliability. > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20070712/7bfa28b6/attachment.html From ballen at gravity.phys.uwm.edu Thu Jul 12 14:45:40 2007 From: ballen at gravity.phys.uwm.edu (Bruce Allen) Date: Sun Jul 27 01:06:11 2008 Subject: [Beowulf] power usage, Intel 5160 vs. AMD 2216 In-Reply-To: <46966161.5010504@vcu.edu> References: <46966161.5010504@vcu.edu> Message-ID: > I have spun up Sun x4100 dual core, dual processor to 100% processor > usage and normal HD writes and measured the actually powerusage at 267 > watts. Obviously higher than normal HD usage (such as swapping) would > drive the number up, Even heavy disk thrashing would only drive the number up by a few watts. Typical SATA disk are around 6 or 8W normal power use [except on power up, when spinning up the platters can increase power use by 15W for one or two seconds]. Once the disk is spun up, seeking the heads back and forth as fast as possible while reading and writing only increases the power use by a few watts. Here is a specific example for illustration: Hitachi Deskstar T7K250 250GB, reference OEM manual, Version 1.8, 12 September 2006, http://www.hitachigst.com/tech/techlib.nsf/products/Deskstar_T7K250 . Note: I choose this disk NOT as an endorsement of any kind but because Hitachi/IBM have detailed OEM manuals on their web site, and because it is a typical high-sales-volume inexpensive dual platter 7200 rpm SATA disk. Citing from section 7.4 of the OEM manual: Idle average: 6.2W Random RW average: 10.5W So heavy disk loading would drive up the power consumption by 4.3W, which is less than 2% of the total power use of the node. Most of the power in typical cluster machines is consumed by the CPU(s), the chipset(s), inefficiency in the power supply, and the memory. On thing that surprised me a bit the last time I looked at this a year ago was that the power use of typical memory sticks is quite high, often more than 10W. In fact if you look at system with well-designed cooling, you'll see that the fans are designed to blow a lot of air over the memory area. Cheers, Bruce From jmdavis1 at vcu.edu Thu Jul 12 14:52:20 2007 From: jmdavis1 at vcu.edu (Mike Davis) Date: Sun Jul 27 01:06:11 2008 Subject: [Beowulf] power usage, Intel 5160 vs. AMD 2216 In-Reply-To: References: <46966161.5010504@vcu.edu> Message-ID: <4696A294.7070508@vcu.edu> That's excellent information. I was surpised by how close to the documented power use my number were. Knowing that even high disk use adds little power usage is important to though of us fighting the power/cooling wars. Mike Bruce Allen wrote: >> I have spun up Sun x4100 dual core, dual processor to 100% processor >> usage and normal HD writes and measured the actually powerusage at 267 >> watts. Obviously higher than normal HD usage (such as swapping) would >> drive the number up, > > > Even heavy disk thrashing would only drive the number up by a few watts. > > Typical SATA disk are around 6 or 8W normal power use [except on power > up, when spinning up the platters can increase power use by 15W for one > or two seconds]. Once the disk is spun up, seeking the heads back and > forth as fast as possible while reading and writing only increases the > power use by a few watts. > > Here is a specific example for illustration: Hitachi Deskstar T7K250 > 250GB, reference OEM manual, Version 1.8, 12 September 2006, > http://www.hitachigst.com/tech/techlib.nsf/products/Deskstar_T7K250 . > Note: I choose this disk NOT as an endorsement of any kind but because > Hitachi/IBM have detailed OEM manuals on their web site, and because it > is a typical high-sales-volume inexpensive dual platter 7200 rpm SATA disk. > > Citing from section 7.4 of the OEM manual: > Idle average: 6.2W > Random RW average: 10.5W > > So heavy disk loading would drive up the power consumption by 4.3W, > which is less than 2% of the total power use of the node. > > Most of the power in typical cluster machines is consumed by the CPU(s), > the chipset(s), inefficiency in the power supply, and the memory. On > thing that surprised me a bit the last time I looked at this a year ago > was that the power use of typical memory sticks is quite high, often > more than 10W. In fact if you look at system with well-designed cooling, > you'll see that the fans are designed to blow a lot of air over the > memory area. > > Cheers, > Bruce From ballen at gravity.phys.uwm.edu Thu Jul 12 15:15:44 2007 From: ballen at gravity.phys.uwm.edu (Bruce Allen) Date: Sun Jul 27 01:06:11 2008 Subject: [Beowulf] power usage, Intel 5160 vs. AMD 2216 In-Reply-To: <4696A294.7070508@vcu.edu> References: <46966161.5010504@vcu.edu> <4696A294.7070508@vcu.edu> Message-ID: Thanks Mike! To undertand power use in the past I have found it very helpful to look at OEM motherboard manuals. For example the Intel OEM motherboard manuals typically include a table listing expected power consumption for a system built around the motherboard. The table includes standby, normal and peak power use for all the major components: cpu, memory, disk, chipset, networking, etc. It's helpful to get a quick overview and to develop a feel for where the power is going. It's my experience that when building large clusters, issues of space, power and cooling are often harder and more time-consuming to resolve than actually getting the cluster itself purchased, commissioned, and operating. For example I've recently taken up a new position in Hannover Germany where as part of my start-up package the MPG is building a cluster room (450 square meters floor space, 500kW cooling, 800kW UPS, with the option to double cooling/power in four years). The design of the room began in September 2005 and construction is now underway; the room is scheduled for completion at the end of this year. So total design and construction time is 2.3 years. In contrast, I am just now starting some serious benchmarking for the cluster itself, which will probably arrive early next year. Total design and construction time will be about 0.5 years. The cost of the cluster room is about equal to the cost of the initial cluster that will go into it. But the ratio of time to completion (and design time) is more than 4 to 1. Cheers, Bruce On Thu, 12 Jul 2007, Mike Davis wrote: > That's excellent information. I was surpised by how close to the documented > power use my number were. Knowing that even high disk use adds little power > usage is important to though of us fighting the power/cooling wars. > > Mike From hahn at mcmaster.ca Thu Jul 12 15:17:51 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Sun Jul 27 01:06:11 2008 Subject: [Beowulf] power usage, Intel 5160 vs. AMD 2216 In-Reply-To: References: Message-ID: >> Vendor A estimates that at peak load a compute node with two AMD 2216s, >> 4 GB of 667 DDR2 RAM, a hard drive, and an IB board will draw 265 watts. that sounds right to me. I have a cluster of dual-95W opteron nodes, 8GB ram, 2x HD, quadrics card, and they measure about that when under load. (assuming it's not hyperventilating - when the fan controller gets upset, dissipation for the 12 fans increases and draw is closer to 350.) >> Vendor B estimates that such a node will draw 450 watts. typical vendor response, based on PS rating. it's only trivially correct, but then again a mere sales droid can provide the answer ;) > Considering that the regular 2216 is 95W peak (the 'HE' version is about > 65W), and the memory and IB card are both pretty warm, 265 watts is > unrealistic. hmm. I imagine there have been some warm IB cards, but most NICs I see are in the ~8W range (quadrics, myrinet, IB, though _not_ counting those gross 10G TOE cards). memory is typically about 1W/chip (.8W for the pc2-800 1Gb production chip from Micron I just looked at). so 4x 1G 9-chip ECC dimms would total about 36W. disks are only 5-15W. > Multiply a realistic max power by a power supply > efficiency and you'll get about 450 watts. well, that seems a bit high to me, but not outlandish. I'd probably work with something between 350 and 400. > There is way to get that lower number: 2216HE processors, and very efficient HE model is $73 more according to AMD's current list price. according to my math (and local ~5 cent/kwh prices), it won't justify itself in direct operating cost, but might when you factor in lower infrastructure (wires, cooling). one interesting thing is that PSU's are most efficient near their max load. so getting nodes with 500W PSUs (like mine), and then only running at 265W is probably a bad thing. I think I've read that a typical PSU might be only, say, 65% efficient at half-load, up to say 75 or 80% near full load. so just sizing the PSU right could save almost the same power as HE CPUs... > We've measured between 50% and 93% efficiency. The worst are the supplies Don, how do you measure efficiency? > We could get into an interesting discussion about the best way to decrease > the typical power use of a cluster. The best way to do this is with I've got disks in my cluster in standby most of the time - it'll only save a few watts per node, but the cost is negligable. our cpus are, for better or worse, nearly 100% utilized all the time. I don't know if there's any way to tell what fraction of time they are, for instance, in blocking MPI communication. for longer-term blocking (perhaps local or network disk IO), I'd hope the CPU uses a high-savings idle mode (halt, I guess). hmm, I wonder if any nics can monitor how recently the host polled for status, and use that to decide whether to issue an interrupt when a packet comes in. then the MPI lib could poll for the approximate RTT (say 5 us) and sleep otherwise. if the nic hadn't seen a poll for a few us, it could assume the host needed an irq to wake up... then again, presumably 4-8 core chips will keep nics warmer... regards, mark hahn. From landman at scalableinformatics.com Thu Jul 12 15:19:49 2007 From: landman at scalableinformatics.com (Joe Landman) Date: Sun Jul 27 01:06:11 2008 Subject: [Beowulf] power usage, Intel 5160 vs. AMD 2216 In-Reply-To: <4696A294.7070508@vcu.edu> References: <46966161.5010504@vcu.edu> <4696A294.7070508@vcu.edu> Message-ID: <4696A905.9060805@scalableinformatics.com> Mike Davis wrote: > That's excellent information. I was surpised by how close to the > documented power use my number were. Knowing that even high disk use > adds little power usage is important to though of us fighting the > power/cooling wars. Memory dimms are a major consumer. If you have the choice to make between smaller cheaper DIMMs using more slots, versus fewer larger (and more costly) DIMMs using fewer slots, the latter will usually consume less power. 8 GB can be 8x1GB dimms, 4x2GB dimms (or 2x 4GB dimms, though this is still not cost competitive even factoring in the power). The 2GB dimms emit the same heat as the 1 GB dimms. So if you have a 1000 node cluster, and you use the larger (slightly more expensive) 2GB dimms vs the 1GB dimms, you will emit somewhat less heat. I haven't done the analysis, but I bet it would be close to a good tradeoff for TCO. That and few parts means lower absolute number of failures, but that is another issue. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 fax : +1 866 888 3112 cell : +1 734 612 4615 From hahn at mcmaster.ca Thu Jul 12 15:46:37 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Sun Jul 27 01:06:11 2008 Subject: [Beowulf] power usage, Intel 5160 vs. AMD 2216 In-Reply-To: References: <46966161.5010504@vcu.edu> Message-ID: > chipset(s), inefficiency in the power supply, and the memory. On thing that > surprised me a bit the last time I looked at this a year ago was that the > power use of typical memory sticks is quite high, often more than 10W. In do your systems use very large numbers of dimms? I think ours are typically 4-8. so even 8x10W is going to be comparable to a cpu or the inefficiency of a mediocre PSU. like most of these other factors, this is a situation that's improving - one of ddr3's main changes is going from 1.8 to 1.5V, saving ~30% on power. one nice thing is that dram dissipate a lot less power when fully-on but not continuously bursting (~50 mA vs 300). I don't know whether current bios/mem controllers will but dram into the really low-power modes (~7mA self-refresh) by default. > fact if you look at system with well-designed cooling, you'll see that the > fans are designed to blow a lot of air over the memory area. yes, though the power density is relatively low. I wonder if one could come up with a specific access pattern that would overheat particular chips on a dimm - perhaps even corresponding chips on a dual-rank dimm... From hahn at mcmaster.ca Thu Jul 12 15:59:37 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Sun Jul 27 01:06:11 2008 Subject: [Beowulf] power usage, Intel 5160 vs. AMD 2216 In-Reply-To: References: <46966161.5010504@vcu.edu> <4696A294.7070508@vcu.edu> Message-ID: > It's my experience that when building large clusters, issues of space, power > and cooling are often harder and more time-consuming to resolve than actually > getting the cluster itself purchased, commissioned, and operating. For that is somewhat perplexing, since the space/power/cooling issues aren't really _that_ complicated. I think it's one of those areas where too much choice leads to harder decisions. perhaps it also reflects the fact that we're still not really comfortable with the state of affairs - for instance, vendors still advocate blade servers, which if fully populated are basically uncoolable (~24 KW/rack!). > example I've recently taken up a new position in Hannover Germany where as > part of my start-up package the MPG is building a cluster room (450 square > meters floor space, 500kW cooling, 800kW UPS, with the option to double > cooling/power in four years). those numbers seem strange to me - unless I've botched conversion, the cluster I sit next to is about 4.7 KW/sq-m. (such a large UPS seems strange too - did they choose it based on poor quality line power? we have none of our compute hardware on UPS, and don't have problems, since modern PDUs seem to ride out the typical 1-second glitch without much trouble...) > end of this year. So total design and construction time is 2.3 years. In that's a bit extreme, I think. our room was a bare-slab reno and took a bit over a year. another one of our sites was built from scratch and took about 1.5. > construction time will be about 0.5 years. The cost of the cluster room is > about equal to the cost of the initial cluster that will go into it. But the strange. I'm pretty sure the cost ratio we see is more like 4:1 for the from-scratch site (and closer to 10:1 for renos.) regards, mark hahn. From hahn at mcmaster.ca Thu Jul 12 16:07:53 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Sun Jul 27 01:06:11 2008 Subject: [Beowulf] power usage, Intel 5160 vs. AMD 2216 In-Reply-To: <4696A905.9060805@scalableinformatics.com> References: <46966161.5010504@vcu.edu> <4696A294.7070508@vcu.edu> <4696A905.9060805@scalableinformatics.com> Message-ID: > The 2GB dimms emit the same heat as the 1 GB dimms. So if you have a 1000 > node cluster, and you use the larger (slightly more expensive) 2GB dimms vs > the 1GB dimms, you will emit somewhat less heat. I haven't done the assuming the same number of chips per dimm. if your 1G are single-sided, and 2G are double, you save nothing. it's also interesting that for a given generation chip, the higher-clocked dimms are significantly hotter (say, 200 vs 300 mA max draw for 1G pc2/667 vs /800). also, I notice that x16 chips dissipate a lot more than x4 or x8, even though the chips have the same number of onchip banks. I guess this says that the main power issue is driving wide parallel buses at speed... > That and few parts means lower absolute number of failures, but that is > another issue. a very interesting one. I wonder how many people have scrubbing turned on in their cluster, and how many use mcelog to monitor the ECC rate. comments? thanks, mark. From hahn at mcmaster.ca Thu Jul 12 16:28:47 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Sun Jul 27 01:06:11 2008 Subject: [Beowulf] Cluster Diagram of 500 PC In-Reply-To: <171130500707121441r323a1037qe636474e3a13ed3d@mail.gmail.com> References: <49467.88.37.125.98.1184148615.squirrel@www.lri.fr> <469523EE.7050005@myri.com> <171130500707121441r323a1037qe636474e3a13ed3d@mail.gmail.com> Message-ID: > Anyway, hop latency in Ethernet is most of times just peanuts in terms of > latency compared to TCP/IP stack overhead... unfortunately - I'm still puzzled why we haven't seen any open, widely-used, LAN-tuned non-TCP implementation that reduces the latency. it should be possible to do ~10 us vs ~40 for a typical MPI-over-Gb-TCP. but my main motive for asking was to spread the _bandwidth_, since STP creates BW hotspots that would be 4.8x over-subscribed in the simple and fairly flat topology I described... From James.P.Lux at jpl.nasa.gov Thu Jul 12 20:56:14 2007 From: James.P.Lux at jpl.nasa.gov (Jim Lux) Date: Sun Jul 27 01:06:11 2008 Subject: [Beowulf] power usage, Intel 5160 vs. AMD 2216 In-Reply-To: References: <46966161.5010504@vcu.edu> <4696A294.7070508@vcu.edu> <4696A905.9060805@scalableinformatics.com> Message-ID: <6.2.3.4.2.20070712205321.03250b78@mail.jpl.nasa.gov> At 04:07 PM 7/12/2007, Mark Hahn wrote: >>The 2GB dimms emit the same heat as the 1 GB dimms. So if you have >>a 1000 node cluster, and you use the larger (slightly more >>expensive) 2GB dimms vs the 1GB dimms, you will emit somewhat less >>heat. I haven't done the > >assuming the same number of chips per dimm. if your 1G are single-sided, >and 2G are double, you save nothing. it's also interesting that for a given >generation chip, the higher-clocked dimms are significantly hotter >(say, 200 vs 300 mA max draw for 1G pc2/667 vs /800). > >also, I notice that x16 chips dissipate a lot more than x4 or x8, even >though the chips have the same number of onchip banks. I guess this >says that the main power issue is driving wide parallel buses at speed... You betcha.. the power dissipation inside the chip is fairly low (on both ends of the path, either within the DRAM or within the CPU).. it's the drivers and receivers on the bus (or just getting on and off chip) that consume the joules. Consider if you are using a voltage source and a series resistive termination that a well matched source will dissipate the same power as being transmitted (i.e. Thevenin). In reality, the termination and the Z are partly reactive, which doens't theoretically use any power, but real systems do burn power charging and discharging the capacitance. James Lux, P.E. Spacecraft Radio Frequency Subsystems Group Flight Communications Systems Section Jet Propulsion Laboratory, Mail Stop 161-213 4800 Oak Grove Drive Pasadena CA 91109 tel: (818)354-2075 fax: (818)393-6875 From csamuel at vpac.org Thu Jul 12 22:29:37 2007 From: csamuel at vpac.org (Chris Samuel) Date: Sun Jul 27 01:06:11 2008 Subject: [Beowulf] Gaussian g03 on CentOS5/RHEL5 ? Message-ID: <200707131529.37257.csamuel@vpac.org> Hi folks, Don't suppose anyone out there has any war stories about trying to get Gaussian 03 going with CentOS5/RHEL5 ? We're looking at running G03 here at VPAC and the new cluster will be running CentOS5 and I'm trying to find out as much as possible before committing! All the best, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part. Url : http://www.scyld.com/pipermail/beowulf/attachments/20070713/85d15c7e/attachment.bin From ballen at gravity.phys.uwm.edu Fri Jul 13 01:41:40 2007 From: ballen at gravity.phys.uwm.edu (Bruce Allen) Date: Sun Jul 27 01:06:11 2008 Subject: [Beowulf] power usage, Intel 5160 vs. AMD 2216 In-Reply-To: References: <46966161.5010504@vcu.edu> <4696A294.7070508@vcu.edu> Message-ID: Hi Mark, >> It's my experience that when building large clusters, issues of space, >> power and cooling are often harder and more time-consuming to resolve >> than actually getting the cluster itself purchased, commissioned, and >> operating. For > > that is somewhat perplexing, since the space/power/cooling issues aren't > really _that_ complicated. I think it's one of those areas where too much > choice leads to harder decisions. Over the past three years, I've been closely involved in the construction of two large cluster rooms. (400kW and 500kW). Once is conventional raised floor air cooling and the second is water cooled racks. In both cases, they were institutional construction projects, and took substantially more than a year. In the past, I have done remodelling efforts at the 40kW level. This, I agree, is something that can be done in a matter of a couple of months. But when scaled up by a factor of ten it's substantially more difficult. > perhaps it also reflects the fact that we're still not really > comfortable with the state of affairs - for instance, vendors still > advocate blade servers, which if fully populated are basically > uncoolable (~24 KW/rack!). With good design (water cooled racks) you can cool a 24kW system. >> example I've recently taken up a new position in Hannover Germany where as >> part of my start-up package the MPG is building a cluster room (450 square >> meters floor space, 500kW cooling, 800kW UPS, with the option to double >> cooling/power in four years). > > those numbers seem strange to me - unless I've botched conversion, the > cluster I sit next to is about 4.7 KW/sq-m. In my case, it turns out that the most cost-effective systems are LOW density ones. So the room has been designed to accomodate these, with a lower kW/m^2 value. And the physical space itself was 'free' so it's nice to have some elbow room. > (such a large UPS seems strange too - did they choose it based on poor > quality line power? we have none of our compute hardware on UPS, and > don't have problems, I have had power-related problems in the past, and have found that the lower maintenance needs and higher reliability of UPS backed systems are worth it. The room is being designed with a 20-year lifetime. When amortized over that time (at least 6 or 7 clusters) the one-time cost of the UPS is negligable. > since modern PDUs seem to ride out the typical 1-second glitch without > much trouble...) That's interesting. Where does the PDU store 1 second of power? >> end of this year. So total design and construction time is 2.3 years. In > that's a bit extreme, I think. our room was a bare-slab reno and took a > bit over a year. That's fast! > another one of our sites was built from scratch and took about 1.5. Fair enough. From my experience this 1.5 years from scratch is probably typical. >> construction time will be about 0.5 years. The cost of the cluster room is >> about equal to the cost of the initial cluster that will go into it. But >> the > strange. I'm pretty sure the cost ratio we see is more like 4:1 for the > from-scratch site (and closer to 10:1 for renos.) When I built a 40kW system (rennovation) the ratio was 10:1. In other words the room rennovation cost was 10% of the cluster cost. In that case the construction was done by a University 'Physical Plant' as a remodelling effort. No management or controls. When I built a 400kW system (bare room) the cost ratio was about 2:1. In that case the construction was done by a state agency, and there were a lot of management and process requirement costs. This did result in a higher-quality room but I think that it probably doubled the construction costs. The room that I am building now would *also* be about 2:1, except for the fact that we are using liquid cooled racks (we have a very low ceiling). This effectively doubles the cost, hence the 1:1 ratio. Fortunately, over the 20 year lifetime that the room will be in use, I expect that at least six generations of cluster will go into it. So in the end the ratio should be at least 6:1. Cheers, Bruce From john.hearns at streamline-computing.com Fri Jul 13 01:58:53 2007 From: john.hearns at streamline-computing.com (John Hearns) Date: Sun Jul 27 01:06:11 2008 Subject: [Beowulf] Cluster Diagram of 500 PC In-Reply-To: References: <49467.88.37.125.98.1184148615.squirrel@www.lri.fr> <469523EE.7050005@myri.com> <171130500707121441r323a1037qe636474e3a13ed3d@mail.gmail.com> Message-ID: <46973ECD.8000608@streamline-computing.com> Mark Hahn wrote: >> Anyway, hop latency in Ethernet is most of times just peanuts in terms of >> latency compared to TCP/IP stack overhead... > > unfortunately - I'm still puzzled why we haven't seen any open, > widely-used, > LAN-tuned non-TCP implementation that reduces the latency. it should be > possible to do ~10 us vs ~40 for a typical MPI-over-Gb-TCP. Well, the SCore impementation which we install on all our clusters does just this. www.pccluster.org In fact, we have one 500 machine cluster which (at the time of install) ranked 167 in the Top 500 and achieved a very high efficiency. All connected with gigabit ethernet only. http://www.streamline-computing.com/index.php?wcId=76&xwcId=72 -- John Hearns Senior HPC Engineer Streamline Computing, The Innovation Centre, Warwick Technology Park, Gallows Hill, Warwick CV34 6UW Office: 01926 623130 Mobile: 07841 231235 From rgb at phy.duke.edu Fri Jul 13 05:48:02 2007 From: rgb at phy.duke.edu (Robert G. Brown) Date: Sun Jul 27 01:06:11 2008 Subject: [Beowulf] power usage, Intel 5160 vs. AMD 2216 In-Reply-To: <4696A905.9060805@scalableinformatics.com> References: <46966161.5010504@vcu.edu> <4696A294.7070508@vcu.edu> <4696A905.9060805@scalableinformatics.com> Message-ID: On Thu, 12 Jul 2007, Joe Landman wrote: > Mike Davis wrote: >> That's excellent information. I was surpised by how close to the documented >> power use my number were. Knowing that even high disk use adds little power >> usage is important to though of us fighting the power/cooling wars. > > Memory dimms are a major consumer. If you have the choice to make between > smaller cheaper DIMMs using more slots, versus fewer larger (and more costly) > DIMMs using fewer slots, the latter will usually consume less power. 8 GB > can be 8x1GB dimms, 4x2GB dimms (or 2x 4GB dimms, though this is still not > cost competitive even factoring in the power). > > The 2GB dimms emit the same heat as the 1 GB dimms. So if you have a 1000 > node cluster, and you use the larger (slightly more expensive) 2GB dimms vs > the 1GB dimms, you will emit somewhat less heat. I haven't done the > analysis, but I bet it would be close to a good tradeoff for TCO. The analysis is easy using Unka Rob's Foolproof Power Cost Estimate Rate: $1/watt/year. This is just a ballpark number, deliberately slightly highballed -- the actual number is probably between $0.60 and $0.80 cents -- but it depends on the cost per KW, taxes, efficiency of your AC and temperature of your server room, and so on. It's been a while since I computed it but IIRC it is based on $0.08/KW-hour and and AC CoP between 2 and 3. Then lessee 24*365*0.08 = $700/KW/year, or $0.70/W/year, to which I'm arbitrarily adding $0.30 for AC costs even though it is more likely ~$0.20 -- the extra dime is that slop. If your power costs only $0.06/KW-hour and your AC is super efficient and the outdoor temperature is on average cold and has a CoP of 5, well, adjust accordingly. Anyway, if a DIMM draws and average power of (say) 10W and is expected to be on for 3 years, that means it costs roughly $30 over its lifetime. so if the marginal c