From csamuel at vpac.org Sat Mar 1 12:25:25 2008 From: csamuel at vpac.org (csamuel@vpac.org) Date: Sun Jul 27 01:06:56 2008 Subject: [Beowulf] Cluster Monitoring Tool In-Reply-To: <19973042.101204403003202.JavaMail.csamuel@ubuntu> Message-ID: <12219323.121204403093384.JavaMail.csamuel@ubuntu> ----- "Cally" wrote: > Hi everyone, Hi Cally, > Is there some kinda tool to use, say if I want to see how much of memory > is being used just for a rendering process. You don't mention which O/S you're using, so I'm presuming it's Linux based. The Linux kernel memory accounting is, umm, sub-optimal for this sort of stuff, to get an accurate picture you probably want to look into something like exmap which is both a kernel module and a user space program. Ubuntu describes it as: Exmap is a memory analysis tool which allows you to accurately determine how much physical memory and swap is used by individual processes and shared libraries on a running system. In particular, it accounts for the sharing of memory and swap between different processes. Help is at hand though - 2.6.25 will include [1] Matt Mackall's page map patches [2] to improve the memory accounting in the kernel. [1] - http://lwn.net/Articles/267849/ [2] - http://lwn.net/Articles/230975/ cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From csamuel at vpac.org Sat Mar 1 13:02:02 2008 From: csamuel at vpac.org (csamuel@vpac.org) Date: Sun Jul 27 01:06:56 2008 Subject: [Beowulf] Open source Job Scheduler for Apple Leopard 10.5.2 server that will work with Open Directory? In-Reply-To: <15809818.381204405232710.JavaMail.csamuel@ubuntu> Message-ID: <27647177.401204405286579.JavaMail.csamuel@ubuntu> ----- "Prakashan Korambath" wrote: > Anyone knows an Open source job scheduler for Apple Leopard 10.5.2 > server that will work with Open Directory? SGE seems to have > intermittent problems, Condor and Torque supports only 10.4. Tiger. It might be worth asking on the Torque development list [1] about whether anyone there is working with Leopard, I know that they've just got the SVN trunk version building with no warnings on OSX, but don't know which version they're testing with (I've just asked). I guess there's not much demand for OSX clusters these days.. [1] - http://www.supercluster.org/mailman/listinfo/torquedev cheers! Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From reuti at staff.uni-marburg.de Sun Mar 2 01:45:06 2008 From: reuti at staff.uni-marburg.de (Reuti) Date: Sun Jul 27 01:06:57 2008 Subject: [Beowulf] Re: python2.4 error when loose MPICH2 TI with Grid Engine In-Reply-To: References: Message-ID: Hi, Am 22.02.2008 um 09:23 schrieb Sangamesh B: > Dear Reuti & members of beowulf, > > I need to execute a parallel job thru grid engine. > > MPICH2 is installed with Process Manager:mpd. > > Added a parallel environment MPICH2 into SGE: > > $ qconf -sp MPICH2 > pe_name MPICH2 > slots 999 > user_lists NONE > xuser_lists NONE > start_proc_args /share/apps/MPICH2/startmpi.sh -catch_rsh > $pe_hostfile > stop_proc_args /share/apps/MPICH2/stopmpi.sh > allocation_rule $pe_slots > control_slaves FALSE > job_is_first_task TRUE > urgency_slots min > > > Added this PE to the default queue: all.q. > > mpdboot is done. mpd's are running on two nodes. > > The script for submitting this job thru sge is: > > $ cat subsamplempi.sh > #!/bin/bash > > #$ -S /bin/bash > > #$ -cwd > > #$ -N Samplejob > > #$ -q all.q > > #$ -pe MPICH2 4 > > #$ -e ERR_$JOB_NAME.$JOB_ID > > #$ -o OUT_$JOB_NAME.$JOB_ID > > date > > hostname > > /opt/MPI_LIBS/MPICH2-GNU/bin/mpirun -np $NSLOTS -machinefile > $TMP_DIR/machines ./samplempi > > echo "Executed" > > exit 0 > > > The job is getting submitted, but not executing. The error and > output file contain: > > cat ERR_Samplejob.192 > /usr/bin/env: python2.4: No such file or directory > > $ cat OUT_Samplejob.192 > -catch_rsh /opt/gridengine/default/spool/compute-0-0/active_jobs/ > 192.1/pe_hostfile > compute-0-0 > compute-0-0 > compute-0-0 > compute-0-0 > Fri Feb 22 12:57:18 IST 2008 > compute-0-0.local > Executed > > So the problem is coming for python2.4. > > $ which python2.4 > /opt/rocks/bin/python2.4 > > I googled this error. Then created a symbolic link: > > # ln -sf /opt/rocks/bin/python2.4 /bin/python2.4 > > After this also same error is coming. > > I guess the problem might be different. i.e. gridengine might not > getting the link to running mpd. > > And the procedure followed by me to configure PE might be wrong. > > So, I expect from you to clear my doubts and help me to resolve > this error. > > 1. Is the PE configuration of MPICH2 + grid engine right? if you want to integrate MPICH2 with MPD it's similar to a PVM setup. The daemons must be started in start_proc_args on every node with a dedicated port number per job. You don't say what your startmpi.sh is doing. > 2. Without Tight integration, is there a way to run a MPICh2(mpd) > based job using gridengine? Yes. > 3. In smpd-daemon based and daemonless MPICH2 tight integration, > which one is better? Depends: if you have just one mpirun per job which will run for days, I would go for the daemonless startup. But if you issue many mpirun calls in your jobscript which will just run for seconds I would go for the daemon based startup, as the mpirun will be distributed to the slaves faster. > 4. Can we do mvapich2 tight integration with SGE? Any differences > with process managers wrt MVAPICH2? Maybe, if the startup is similar to standard MPICH2. -- Reuti > Thanks & Best Regards, > Sangamesh B -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20080302/a3484f8f/attachment.html From reuti at staff.uni-marburg.de Sun Mar 2 01:48:53 2008 From: reuti at staff.uni-marburg.de (Reuti) Date: Sun Jul 27 01:06:57 2008 Subject: [Beowulf] Three questions on a new Beowulf Cluster In-Reply-To: <47BEEF3A.9020806@sps.aero> References: <47BEEF3A.9020806@sps.aero> Message-ID: <00C9756F-C215-4934-BF18-498997FEBB4A@staff.uni-marburg.de> Hi, Am 22.02.2008 um 16:50 schrieb John P. Kosky, PhD: > My company is taking it's first foray into the world of HPC with an > expandable architecture, 16 processor (comprised of quad core > Opterons), one header node cluster using Infiniband interconnects. > OS has tentatively been selected as SUSE 64-bit Linux. The > principal purpose of the cluster is as a tool for spacecraft and > propulsion design support. The cluster will therefore be running > the most recent versions of commercially available software - > initially for FEA and CFD using COMSOL Multiphysics and associated > packages, NASTRAN, MatLab modules, as well as an internally > modified and expanded commercial code for materials properties > prediction,with emphasis on polymer modeling (Accelrys Materials > Studio). Since we will be repetitively running standard modeling > codes on this system, we are trying to make the system as user > friendly as possible... most of our scientists and engineers want > to use this as a tool, and not have to become cluster experts. The > company WILL be hiring an IT Sys Admin with good cluster experience > to support the system, however... > > Question 1: > 1) Does anyone here know of any issues that have arisen running the > above named commercial packages on clusters using infiniband? > > Question 2: > 2) As far as the MPI for the system is concerned, for the system > and application requirements described above, would OpenMPI or > MvApich be better for managing node usage? none of them will manage the node usage - you have to assemble a node list for every run by hand. What you might be looking for is a resource manager like SGE, Torque, LSF, Condor,... und run parallel jobs under their supervision. -- Reuti > ANY help or advice would be greatly appreciated. > > Thanks in advance > > John > > John P. Kosky, PhD > Director of Technical Development > Space Propulsion Systems > > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From hahn at mcmaster.ca Sun Mar 2 11:40:28 2008 From: hahn at mcmaster.ca (Mark Hahn) Date: Sun Jul 27 01:06:57 2008 Subject: [Beowulf] IB + 10G eth integrated switches? Message-ID: Hi all, has anyone experienced switches that have integrated both IB and 10Geth? I guess I've seen IB switch vendors talking about EoIB modules for a while, but I'm curious about how well they work. mainly I'm curious about providing high-speed IP connectivity for a cluster that has only IB networking. I presume this would work by configuring an EoIB interface on the node, and having the IB switch act like an eth switch (L2?). thanks, mark hahn. From kalpana0611 at gmail.com Sun Mar 2 02:21:29 2008 From: kalpana0611 at gmail.com (Cally) Date: Sun Jul 27 01:06:57 2008 Subject: [Beowulf] Re: Cluster Monitoring Tool Message-ID: Yeah, I am using OpenSuse for the 2 node thing, bt the cluster in the lab uses Redhat, thanks alot. I am checking out ur mail. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20080302/c8e3c327/attachment.html From glen.beane at jax.org Sun Mar 2 17:21:46 2008 From: glen.beane at jax.org (Glen Beane) Date: Sun Jul 27 01:06:57 2008 Subject: [Beowulf] Open source Job Scheduler for Apple Leopard 10.5.2 server that will work with Open Directory? In-Reply-To: <43F64E86355A744E9D51506B6C6783B9021AE68C@EM2.ad.ucla.edu> References: <02A63D14-3E34-4C0E-A012-D491922AC023@ee.duke.edu> <220FE1C2-C27A-4B94-8060-D4D78DFCF50A@staff.uni-marburg.de> <43F64E86355A744E9D51506B6C6783B9021AE68C@EM2.ad.ucla.edu> Message-ID: <47CB52AA.9070306@jax.org> Korambath, Prakashan wrote: > Anyone knows an Open source job scheduler for Apple Leopard 10.5.2 > server that will work with Open Directory? SGE seems to have > intermittent problems, Condor and Torque supports only 10.4. Tiger. Thanks. Hello Prakashan, I am a TORQUE developer and Apple user. TORQUE 2.2.x and up should work on Leopard. Although 2.2.x requires passing --disable-gcc-warnings to configure, as we compile with -Wall -pedantic -Werror by default and there are a few harmless warnings generated. If you are compiling the latest 2.3.0 (development) snapshot then these warnings are fixed (some were Apples fault, the code was fine, and this has been fixed in the 2.5 and 3.0 Developer Tools). An official 2.3.0 release should be out soon. If you have any problems running TORQUE on OS X Leopard I would like to know so I can improve our support for OS X -- Glen L. Beane Software Engineer The Jackson Laboratory Phone (207) 288-6153 From Shainer at mellanox.com Sun Mar 2 22:34:44 2008 From: Shainer at mellanox.com (Gilad Shainer) Date: Sun Jul 27 01:06:57 2008 Subject: [Beowulf] IB + 10G eth integrated switches? In-Reply-To: Message-ID: <9FA59C95FFCBB34EA5E42C1A8573784FF826F6@mtiexch01.mti.com> Hi Mark, > > Hi all, > has anyone experienced switches that have integrated both IB > and 10Geth? > I guess I've seen IB switch vendors talking about EoIB > modules for a while, but I'm curious about how well they > work. mainly I'm curious about providing high-speed IP > connectivity for a cluster that has only IB networking. I > presume this would work by configuring an EoIB interface on > the node, and having the IB switch act like an eth switch (L2?). > Most of the IB switch vendors have gateway solution to connect InfiniBand based clusters to 10GigE networks, and the gateway change L2 of IB to L2 of Eth. Those gateways have been installed in many places and work great. Gilad. > thanks, mark hahn. > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org To change your > subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > From hahn at mcmaster.ca Sun Mar 2 22:43:35 2008 From: hahn at mcmaster.ca (Mark Hahn) Date: Sun Jul 27 01:06:57 2008 Subject: [Beowulf] IB + 10G eth integrated switches? In-Reply-To: <9FA59C95FFCBB34EA5E42C1A8573784FF826F6@mtiexch01.mti.com> References: <9FA59C95FFCBB34EA5E42C1A8573784FF826F6@mtiexch01.mti.com> Message-ID: > of IB to L2 of Eth. Those gateways have been installed in many places > and work great. thanks, but that's not what I'm asking for. I am interested in actual experience from purchasers and users, not vendor/marketing praise. for instance, if you stream TCP through such a route, can you saturate the link? how well does NFS work through such a network? From deadline at eadline.org Mon Mar 3 06:23:08 2008 From: deadline at eadline.org (Douglas Eadline) Date: Sun Jul 27 01:06:57 2008 Subject: [Beowulf] NYCA-HUG Meeting (geo specific) Message-ID: <40232.192.168.1.1.1204554188.squirrel@mail.eadline.org> Ignore this message if you are not in the New York City Metro area. For those interested, this months meeting will be on March 6th at Le Figaro Cafe. Mario Juric from the School of Natural Sciences, Institute for Advanced Study at Princeton will be talking about computing with GPGPU chips. (General Purpose Graphical Processing Units i.e. video cards) More information, directions, and links at http://www.linux-mag.com/nyca-hug See you there. -- Doug From ppk at ats.ucla.edu Mon Mar 3 06:36:44 2008 From: ppk at ats.ucla.edu (Korambath, Prakashan) Date: Sun Jul 27 01:06:57 2008 Subject: [Beowulf] Open source Job Scheduler for Apple Leopard 10.5.2 server that will work with Open Directory? References: <02A63D14-3E34-4C0E-A012-D491922AC023@ee.duke.edu> <220FE1C2-C27A-4B94-8060-D4D78DFCF50A@staff.uni-marburg.de> <43F64E86355A744E9D51506B6C6783B9021AE68C@EM2.ad.ucla.edu> <47CB52AA.9070306@jax.org> Message-ID: <43F64E86355A744E9D51506B6C6783B9021AE6C6@EM2.ad.ucla.edu> Thank you Glen. I will try Torque and let you know if there are any problems. Prakashan -----Original Message----- From: Glen Beane [mailto:glen.beane@jax.org] Sent: Sun 3/2/2008 5:21 PM To: Korambath, Prakashan Cc: Beowulf Mailing List Subject: Re: [Beowulf] Open source Job Scheduler for Apple Leopard 10.5.2 server that will work with Open Directory? Korambath, Prakashan wrote: > Anyone knows an Open source job scheduler for Apple Leopard 10.5.2 > server that will work with Open Directory? SGE seems to have > intermittent problems, Condor and Torque supports only 10.4. Tiger. Thanks. Hello Prakashan, I am a TORQUE developer and Apple user. TORQUE 2.2.x and up should work on Leopard. Although 2.2.x requires passing --disable-gcc-warnings to configure, as we compile with -Wall -pedantic -Werror by default and there are a few harmless warnings generated. If you are compiling the latest 2.3.0 (development) snapshot then these warnings are fixed (some were Apples fault, the code was fine, and this has been fixed in the 2.5 and 3.0 Developer Tools). An official 2.3.0 release should be out soon. If you have any problems running TORQUE on OS X Leopard I would like to know so I can improve our support for OS X -- Glen L. Beane Software Engineer The Jackson Laboratory Phone (207) 288-6153 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20080303/6b9d8250/attachment.html From glen.beane at jax.org Mon Mar 3 07:29:37 2008 From: glen.beane at jax.org (Glen Beane) Date: Sun Jul 27 01:06:57 2008 Subject: [Beowulf] Open source Job Scheduler for Apple Leopard 10.5.2 server that will work with Open Directory? In-Reply-To: <43F64E86355A744E9D51506B6C6783B9021AE6C6@EM2.ad.ucla.edu> References: <02A63D14-3E34-4C0E-A012-D491922AC023@ee.duke.edu> <220FE1C2-C27A-4B94-8060-D4D78DFCF50A@staff.uni-marburg.de> <43F64E86355A744E9D51506B6C6783B9021AE68C@EM2.ad.ucla.edu> <47CB52AA.9070306@jax.org> <43F64E86355A744E9D51506B6C6783B9021AE6C6@EM2.ad.ucla.edu> Message-ID: <47CC1961.7080505@jax.org> Korambath, Prakashan wrote: > Thank you Glen. I will try Torque and let you know if there are any > problems. please note that the job memory usage reported by Torque will be incorrect on OS X - I just got some fixes for that into subversion a few days ago. These fixes will be released with 2.3.0 (and in the next 2.3.0 development snapshot) I also backported these job memory reporting fixes to the 2.2 branch and those will be available to 2.2.x users when 2.2.2 is released. -- Glen L. Beane Software Engineer The Jackson Laboratory Phone (207) 288-6153 From mathog at caltech.edu Mon Mar 3 12:47:06 2008 From: mathog at caltech.edu (David Mathog) Date: Sun Jul 27 01:06:57 2008 Subject: [Beowulf] mysterious slow disk Message-ID: One node has developed a "slow disk", although at this point I'm not at all sure that the disk is actually at fault. Have any of you ever seen something like this: 1. hdparm -t -T /dev/hda on the slow node is typically: Timing cached reads: 504 MB in 2.00 seconds = 251.61 MB/sec Timing buffered disk reads: 104 MB in 3.55 seconds = 29.29 MB/sec but the second line varies A LOT, down to 24Mb and up to over 30. However the same test on the other nodes varies only a little: Timing cached reads: 514 MB in 2.01 seconds = 256.03 MB/sec Timing buffered disk reads: 124 MB in 3.04 seconds = 40.74 MB/sec (plus or minus about 1MB/sec on the second line). 2. hdparm -i -v and -i -m are identical on all machines, except for serial number. Here is -i -v on the slow one: /dev/hda: multcount = 16 (on) IO_support = 1 (32-bit) unmaskirq = 1 (on) using_dma = 1 (on) keepsettings = 0 (off) readonly = 0 (off) readahead = 256 (on) geometry = 65535/16/63, sectors = 78165360, start = 0 Model=WDC WD400BB-00DEA0, FwRev=05.03E05, SerialNo=WD-WMAD11736294 Config={ HardSect NotMFM HdSw>15uSec SpinMotCtl Fixed DTR>5Mbs FmtGapReq } RawCHS=16383/16/63, TrkSize=57600, SectSize=600, ECCbytes=40 BuffType=DualPortCache, BuffSize=2048kB, MaxMultSect=16, MultSect=16 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=78165360 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120} PIO modes: pio0 pio1 pio2 pio3 pio4 DMA modes: mdma0 mdma1 mdma2 UDMA modes: udma0 udma1 udma2 udma3 udma4 *udma5 udma3 udma4 *udma5 AdvancedPM=no WriteCache=enabled Drive conforms to: Unspecified: ATA/ATAPI-1 ATA/ATAPI-2 ATA/ATAPI-3 ATA/ATAPI-4 ATA/ATAPI-5 3. Western Digital diagnostics and smartctl show nothing wrong with the disk. It has no bad sectors or other errors logged. The smartctl tests do take longer to complete than on the other disks. 4. DMA is working (at least partially), since turning that off drops the hdparm test down to about 4Mb/sec. 5. Opened the case and all jumpers were as they should be, Power supply tested good. 6. dmesg from the slow node and a normal node shows no significant differences. (Changes in the 3rd digit after the decimal of the bogoMIPS value, for instance.) The only oddity I've found was that the setting "32 bit I/O" in the BIOS was disabled for some reason on the slow node. Changing it to enabled made no difference (even after several reboots, cold and warm.) Is it possible that the OS has the earlier setting hidden away somewhere and is still using it? This was particularly weird because hdparm showed IO_Support = 1 (32-bit) even when this BIOS bit was disabled. The speed issue initially turned up in a run where a certain program was required to allocate about 1.6 Gb of memory (at least .6GB of which had to come out of the 2GB swap, since there was only 1GB of RAM.) That large region was then ordered with qsort(). Bizarrely, this took forever on one node (hours longer than on any other node, with similar sized data), and when it finished the sort the resulting binary data was written to disk at only 0.5 Mb/sec. Yes, 500kilobytes/sec. Nothing else was using CPU time. Prior to that one run this node did nothing to draw my attention to it as a "slow node". This is on one of the notorious Tyan S2466 boards. I'm beginning to wonder if perhaps it now has a bit stuck somewhere in the BIOS, in which case maybe wiping the BIOS settings and redoing them will fix it. I already tried powering it off for 15 minutes unplugged, but that did not help. Also the whole cluster had to be powered down for about 15 minutes in the morning before this started for A/C service. If the battery on that board is iffy it might explain how the "32 bit I/O" became disabled. However, it did not reset again on a subsequent long power down. Thanks, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From mathog at caltech.edu Mon Mar 3 14:25:24 2008 From: mathog at caltech.edu (David Mathog) Date: Sun Jul 27 01:06:57 2008 Subject: [Beowulf] Re: mysterious slow disk Message-ID: Bruno Coutinho wrote: > What is the Reallocated_Sector_Ct in smartctl? If it's non zero, you will > have badblocks soon. > > How is the Seek_Error_Rate is compared with orher HDs? Here is part of smartctl -a for the slow one, with some columns edited out so that it will fit without wrapping. The disk is old but it isn't throwing any conventional sorts of errors: ID# ATTRIBUTE_NAME FLAG UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b Always - 0 3 Spin_Up_Time 0x0007 Always - 2658 4 Start_Stop_Count 0x0032 Always - 88 5 Reallocated_Sector_Ct 0x0033 Always - 0 7 Seek_Error_Rate 0x000b Always - 0 9 Power_On_Hours 0x0032 Always - 44036 10 Spin_Retry_Count 0x0013 Always - 0 11 Calibration_Retry_Count 0x0013 Always - 0 12 Power_Cycle_Count 0x0032 Always - 88 196 Reallocated_Event_Count 0x0032 Always - 0 197 Current_Pending_Sector 0x0012 Always - 0 198 Offline_Uncorrectable 0x0012 Always - 0 199 UDMA_CRC_Error_Count 0x000a Always - 0 200 Multi_Zone_Error_Rate 0x0009 Offline - 0 Regards, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From carsten.aulbert at aei.mpg.de Mon Mar 3 22:57:24 2008 From: carsten.aulbert at aei.mpg.de (Carsten Aulbert) Date: Sun Jul 27 01:06:57 2008 Subject: [Beowulf] mysterious slow disk In-Reply-To: References: Message-ID: <47CCF2D4.9090701@aei.mpg.de> Hi David, David Mathog wrote: > 2. hdparm -i -v and -i -m are identical on all machines, except > for serial number. Here is -i -v on the slow one: Looks good. Can you run hdparm -I /dev/hda? And there please have a look at the line with acoustic mgmt: Recommended acoustic management value: 128, current value: 128 (this line is from my laptop). Run it on the faster node and compare those. Maybe it's ust that somehow acoustic mgmt is enabled on one and not on the other. HTH Carsten From asabigue at fing.edu.uy Tue Mar 4 02:14:19 2008 From: asabigue at fing.edu.uy (ariel sabiguero yawelak) Date: Sun Jul 27 01:06:57 2008 Subject: [Beowulf] mysterious slow disk In-Reply-To: References: Message-ID: <47CD20FB.2000605@fing.edu.uy> Hi! I found a situation pretty similar to this a few years ago in a system with shared video memory. Whenever you allocate less than 32MB of video memory (even without X server running) the performance drops incredibly low, without any kind of error, warning or whatever. System seems to stalls erratically, but when it continues working, everythig seems ok. The performance problem was related to every I/O operation we performed, not only HDD related. We were able to detect the problem because we were "tunning" the configuration, trying to free as much memory and resources. After we tunned the system, It just started to crawl and when we rolled back, we realized the source of the problem: "If it works, don't fix it" I hope this helps. best regards ariel David Mathog escribi?: > One node has developed a "slow disk", although at this point I'm not > at all sure that the disk is actually at fault. Have any of you ever > seen something like this: > > 1. hdparm -t -T /dev/hda on the slow node is typically: > > Timing cached reads: 504 MB in 2.00 seconds = 251.61 MB/sec > Timing buffered disk reads: 104 MB in 3.55 seconds = 29.29 MB/sec > > but the second line varies A LOT, down to 24Mb and up to over > 30. However the same test on the other nodes varies only a little: > > Timing cached reads: 514 MB in 2.01 seconds = 256.03 MB/sec > Timing buffered disk reads: 124 MB in 3.04 seconds = 40.74 MB/sec > (plus or minus about 1MB/sec on the second line). > > 2. hdparm -i -v and -i -m are identical on all machines, except > for serial number. Here is -i -v on the slow one: > > /dev/hda: > multcount = 16 (on) > IO_support = 1 (32-bit) > unmaskirq = 1 (on) > using_dma = 1 (on) > keepsettings = 0 (off) > readonly = 0 (off) > readahead = 256 (on) > geometry = 65535/16/63, sectors = 78165360, start = 0 > > Model=WDC WD400BB-00DEA0, FwRev=05.03E05, SerialNo=WD-WMAD11736294 > Config={ HardSect NotMFM HdSw>15uSec SpinMotCtl Fixed DTR>5Mbs FmtGapReq } > RawCHS=16383/16/63, TrkSize=57600, SectSize=600, ECCbytes=40 > BuffType=DualPortCache, BuffSize=2048kB, MaxMultSect=16, MultSect=16 > CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=78165360 > IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120} > PIO modes: pio0 pio1 pio2 pio3 pio4 > DMA modes: mdma0 mdma1 mdma2 > UDMA modes: udma0 udma1 udma2 udma3 udma4 *udma5 udma3 udma4 *udma5 > AdvancedPM=no WriteCache=enabled > Drive conforms to: Unspecified: ATA/ATAPI-1 ATA/ATAPI-2 ATA/ATAPI-3 > ATA/ATAPI-4 ATA/ATAPI-5 > > 3. Western Digital diagnostics and smartctl show nothing wrong with > the disk. It has no bad sectors or other errors logged. The smartctl > tests do take longer to complete than on the other disks. > > 4. DMA is working (at least partially), since turning that off drops > the hdparm test down to about 4Mb/sec. > > 5. Opened the case and all jumpers were as they should be, Power supply > tested good. > > 6. dmesg from the slow node and a normal node shows no significant > differences. (Changes in the 3rd digit after the decimal of the > bogoMIPS value, for instance.) > > The only oddity I've found was that the setting "32 bit I/O" in the BIOS > was disabled for some reason on the slow node. Changing it to enabled > made no difference (even after several reboots, cold and warm.) Is it > possible that the OS has the earlier setting hidden away somewhere and > is still using it? This was particularly weird because hdparm showed > IO_Support = 1 (32-bit) > even when this BIOS bit was disabled. > > The speed issue initially turned up in a run where a certain program was > required to allocate about 1.6 Gb of memory (at least .6GB of which had > to come out of the 2GB swap, since there was only 1GB of RAM.) That > large region was then ordered with qsort(). Bizarrely, this took > forever on one node (hours longer than on any other node, with similar > sized data), and when it finished the sort the resulting binary data was > written to disk at only 0.5 Mb/sec. Yes, 500kilobytes/sec. Nothing > else was using CPU time. Prior to that one run this node did > nothing to draw my attention to it as a "slow node". > > This is on one of the notorious Tyan S2466 boards. I'm beginning to > wonder if perhaps it now has a bit stuck somewhere in the BIOS, in which > case maybe wiping the BIOS settings and redoing them will fix it. > I already tried powering it off for 15 minutes unplugged, but that > did not help. > > Also the whole cluster had to be powered down for about 15 minutes > in the morning before this started for A/C service. If the battery > on that board is iffy it might explain how the "32 bit I/O" became > disabled. However, it did not reset again on a subsequent long power down. > > Thanks, > > David Mathog > mathog@caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > From steffen.grunewald at aei.mpg.de Tue Mar 4 00:39:18 2008 From: steffen.grunewald at aei.mpg.de (Steffen Grunewald) Date: Sun Jul 27 01:06:57 2008 Subject: [Beowulf] Re: mysterious slow disk In-Reply-To: References: Message-ID: <20080304083918.GE23076@casco.aei.mpg.de> On Mon, Mar 03, 2008 at 02:25:24PM -0800, David Mathog wrote: > Bruno Coutinho wrote: > > > What is the Reallocated_Sector_Ct in smartctl? If it's non zero, you will > > have badblocks soon. > > > > How is the Seek_Error_Rate is compared with orher HDs? > > Here is part of smartctl -a for the slow one, with some columns > edited out so that it will fit without wrapping. The disk is old but > it isn't throwing any conventional sorts of errors: > > ID# ATTRIBUTE_NAME FLAG UPDATED WHEN_FAILED RAW_VALUE > 1 Raw_Read_Error_Rate 0x000b Always - 0 > 3 Spin_Up_Time 0x0007 Always - 2658 > 4 Start_Stop_Count 0x0032 Always - 88 > 5 Reallocated_Sector_Ct 0x0033 Always - 0 > 7 Seek_Error_Rate 0x000b Always - 0 > 9 Power_On_Hours 0x0032 Always - 44036 > 10 Spin_Retry_Count 0x0013 Always - 0 > 11 Calibration_Retry_Count 0x0013 Always - 0 > 12 Power_Cycle_Count 0x0032 Always - 88 > 196 Reallocated_Event_Count 0x0032 Always - 0 > 197 Current_Pending_Sector 0x0012 Always - 0 > 198 Offline_Uncorrectable 0x0012 Always - 0 > 199 UDMA_CRC_Error_Count 0x000a Always - 0 > 200 Multi_Zone_Error_Rate 0x0009 Offline - 0 Looks clean. (Sorry, but I probably missed part of the thread:) Did you check the (U)DMA mode? (hdparm -di /dev/hd*) Would a full reboot (power down, power up) change the performance? Steffen -- Steffen Grunewald * MPI Grav.Phys.(AEI) * Am M?hlenberg 1, D-14476 Potsdam Cluster Admin * http://pandora.aei.mpg.de/merlin/ * http://www.aei.mpg.de/ * e-mail: steffen.grunewald(*)aei.mpg.de * +49-331-567-{fon:7233,fax:7298} No Word/PPT mails - http://www.gnu.org/philosophy/no-word-attachments.html From p2s2-chairs at mcs.anl.gov Tue Mar 4 13:03:37 2008 From: p2s2-chairs at mcs.anl.gov (p2s2-chairs@mcs.anl.gov) Date: Sun Jul 27 01:06:57 2008 Subject: [Beowulf] [p2s2-announce] CFP: Workshop on Parallel Programming Models and Systems Software for High-end Computing (P2S2) Message-ID: <20080304210337.BAC4246218@shakey.mcs.anl.gov> CALL FOR PAPERS =============== First International Workshop on Parallel Programming Models and Systems Software for High-end Computing (P2S2) (http://www.mcs.anl.gov/events/workshops/p2s2) Sep. 8th, 2008 To be held in conjunction with ICPP-08: The 27th International Conference on Parallel Processing Sep. 8-12, 2008 Portland, Oregon, USA SCOPE ----- The goal of this workshop is to bring together researchers and practitioners in parallel programming models and systems software for high-end computing systems. Please join us in a discussion of new ideas, experiences, and the latest trends in these areas at the workshop. TOPICS OF INTEREST ------------------ The focus areas for this workshop include, but are not limited to: * Programming models and their high-performance implementations o MPI, Sockets, OpenMP, Global Arrays, X10, UPC, Chapel o Other Hybrid Programming Models * Systems software for scientific and enterprise computing o Communication sub-subsystems for high-end computing o High-performance File and storage systems o Fault-tolerance techniques and implementations o Efficient and high-performance virtualization and other management mechanisms * Tools for Management, Maintenance, Coordination and Synchronization o Software for Enterprise Data-centers using Modern Architectures o Job scheduling libraries o Management libraries for large-scale system o Toolkits for process and task coordination on modern platforms * Performance evaluation, analysis and modeling of emerging computing platforms SUBMISSION INSTRUCTIONS ----------------------- Submissions should be in PDF format in U.S. Letter size paper. They should not exceed 8 pages (all inclusive). Submissions will be judged based on relevance, significance, originality, correctness and clarity. DATES AND DEADLINES ------------------- Paper Submission: April 11th, 2008 Author Notification: May 20th, 2008 Camera Ready: June 2nd, 2008 PROGRAM CHAIRS -------------- * Pavan Balaji (Argonne National Laboratory) * Sayantan Sur (IBM Research) STEERING COMMITTEE ------------------ * William D. Gropp (University of Illinois Urbana-Champaign) * Dhabaleswar K. Panda (Ohio State University) * Vijay Saraswat (IBM Research) PROGRAM COMMITTEE ----------------- * David Bernholdt (Oak Ridge National Laboratory) * Ron Brightwell (Sandia National Laboratory) * Wu-chun Feng (Virginia Tech) * Richard Graham (Oak Ridge National Laboratory) * Hyun-wook Jin (Konkuk University, South Korea) * Sameer Kumar (IBM Research) * Doug Lea (State University of New York at Oswego) * Jarek Nieplocha (Pacific Northwest National Laboratory) * Scott Pakin (Los Alamos National Laboratory) * Vivek Sarkar (Rice University) * Rajeev Thakur (Argonne National Laboratory) * Pete Wyckoff (Ohio Supercomputing Center) If you have any questions, please contact us at p2s2-chairs@mcs.anl.gov ------------------------------------------------------------------------- If you do not want to receive any more announcements about the P2S2 workshop, please send an email to majordomo@mcs.anl.gov with the email body "unsubscribe p2s2-announce". ------------------------------------------------------------------------- From csamuel at vpac.org Wed Mar 5 05:55:59 2008 From: csamuel at vpac.org (csamuel@vpac.org) Date: Sun Jul 27 01:06:57 2008 Subject: [Beowulf] mysterious slow disk In-Reply-To: Message-ID: <502938.641204725340009.JavaMail.csamuel@ubuntu> ----- "David Mathog" wrote: > One node has developed a "slow disk", although at this point I'm not > at all sure that the disk is actually at fault. As an extreme suggestion - how about swapping the drive with a known good node and rebuilding both ? If the problem moves with the drive then you'll know it's that, and if it stays then it's likely to be the MB or the cabling. Of course there's always the chance that the problem will disappear completely or you'll have two slow drives instead of one.. :-) cheers! Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From Bogdan.Costescu at iwr.uni-heidelberg.de Wed Mar 5 07:00:30 2008 From: Bogdan.Costescu at iwr.uni-heidelberg.de (Bogdan Costescu) Date: Sun Jul 27 01:06:57 2008 Subject: [Beowulf] mysterious slow disk In-Reply-To: <502938.641204725340009.JavaMail.csamuel@ubuntu> References: <502938.641204725340009.JavaMail.csamuel@ubuntu> Message-ID: On Thu, 6 Mar 2008, csamuel@vpac.org wrote: > If the problem moves with the drive then you'll know it's that, and > if it stays then it's likely to be the MB or the cabling. Sorry for the late jump... IIRC, the OP had dual Athlon based nodes; there used to be some issues of slow I/O related to the usage of energy saving features of the CPUs and/or chipset. Have you already checked for such a situation ? Another thing that comes to my mind: I have seen some SATA disks (I think from Hitachi) which have developed some bad sectors, which could be seen in the smartctl output. I have run the DFT on them and after choosing to rewrite the whole disk (I have forgotten the exact option name...) the disk was shown as good and there was no sign of bad sectors in the smartctl output. However, the access to disk was slower, especially the writes, for which the speed was aproximately halved. So, is it sure that the disk was never subjected to a manufacturer "diagnose and repair" tool ? -- Bogdan Costescu IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany Phone: +49 6221 54 8869/8240, Fax: +49 6221 54 8868/8850 E-mail: bogdan.costescu@iwr.uni-heidelberg.de From mathog at caltech.edu Wed Mar 5 11:44:59 2008 From: mathog at caltech.edu (David Mathog) Date: Sun Jul 27 01:06:57 2008 Subject: [Beowulf] Re: mysterious slow disk Message-ID: I don't think I'm going to solve this one :-(. Bruno Coutinho wrote: >I noticed that some disks can't get full bandwidth at once. >A way to monitor disk throughput for a longer time is to use dd >to copy a partition to /dev/null. The disk is also slow for long dd operations by about the same factor. SLOW: % sync; \ TIME=`accudate -t0` ; \ dd if=/dev/zero count=4000000 of=/scratch/tmp/foo.dat ; \ sync; \ accudate -ds $TIME 4000000+0 records in 4000000+0 records out 2048000000 bytes (2.0 GB) copied, 72.4961 seconds, 28.2 MB/s 0000105.343 vs. FAST: % sync; \ TIME=`accudate -t0` ; \ dd if=/dev/zero count=4000000 of=/scratch/tmp/foo.dat ; \ sync; \ accudate -ds $TIME 4000000+0 records in 4000000+0 records out 2048000000 bytes (2.0 GB) copied, 59.6326 seconds, 34.3 MB/s 0000075.422 (27Mb/sec sustained vs. 19.4Mb/sec sustained). Carsten Aulbert wrote: >Looks good. Can you run hdparm -I /dev/hda? >And there please have a look at the line with acoustic mgmt: hdparm -I was identical on fast and slow systems (except for serial numbers). hdparm -I /dev/hda /dev/hda: ATA device, with non-removable media Model Number: WDC WD400BB-00DEA0 Serial Number: WD-WMAD11736294 Firmware Revision: 05.03E05 Standards: Supported: 5 4 3 Likely used: 6 Configuration: Logical max current cylinders 16383 16383 heads 16 16 sectors/track 63 63 -- CHS current addressable sectors: 16514064 LBA user addressable sectors: 78165360 device size with M = 1024*1024: 38166 MBytes device size with M = 1000*1000: 40020 MBytes (40 GB) Capabilities: LBA, IORDY(can be disabled) bytes avail on r/w long: 40 Standby timer values: spec'd by Standard, with device specific minimum R/W multiple sector transfer: Max = 16 Current = 16 Recommended acoustic management value: 128, current value: 254 DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 *udma5 Cycle time: min=120ns recommended=120ns PIO: pio0 pio1 pio2 pio3 pio4 Cycle time: no flow control=120ns IORDY flow control=120ns Commands/features: Enabled Supported: * SMART feature set Security Mode feature set * Power Management feature set * Write cache * Look-ahead * Host Protected Area feature set * WRITE_BUFFER command * READ_BUFFER command * DOWNLOAD_MICROCODE SET_MAX security extension Automatic Acoustic Management feature set * Device Configuration Overlay feature set * SMART error logging * SMART self-test Security: supported not enabled not locked frozen not expired: security count not supported: enhanced erase HW reset results: CBLID- above Vih Device num = 0 determined by CSEL Checksum: correct All three Acoustic management settings of 0, 128, 254 were tried. 128 made it slightly slower still, 0 or 254 were apparently equivalent, and it was at 0 to start with. (The only difference between 0 and 254 is that the former unchecks the "Automatic Acoustic Management feature set" line.) ariel sabiguero yawelak wrote: >I found a situation pretty similar to this a few years ago in a system >with shared video memory. The system has a separate graphics card. As a final shot the case was opened again and all of the following checked, none of which made any difference and/or were different from other systems: 1. motherboard and disk jumpers 2. IDE cable 3. voltage on the power connector to the drive 4. checked power supply with two testers (both showed PS in spec) 5. cleared the BIOS with the CMOS jumper, loaded defaults, changed the few settings that were not at default to match the other systems. 6. moved cable from first IDE primary IDE to secondary IDE socket 7. Observed the exposed running disk. The amount of vibration and temperature were typical, and there were no unusual noises. Ran the Stream benchmark on both the slow and normal systems and it scored the same. The hdparm -T test is also the same on the slow system as on the fast ones, it is only -t which is slow. Seems like there is something on the disk itself which is slow and it isn't a CPU speed, memory speed, or even IDE bus speed issue. The disk is apparently going, but it is an odd way to fail. I almost wonder if it has not stepped down from 7200 RPM to 5400 RPM. That ratio is 0.75, and the speed ratio in the dd test was .72, which is pretty close. It spins up with no problems though. Thanks all, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From timattox at open-mpi.org Thu Mar 6 12:59:26 2008 From: timattox at open-mpi.org (Tim Mattox) Date: Sun Jul 27 01:06:57 2008 Subject: [Beowulf] Re: Setting up a new Beowulf cluster In-Reply-To: <5721d9d70802190700m2a356e5fo3d637308e2d9a34d@mail.gmail.com> References: <5721d9d70802190700m2a356e5fo3d637308e2d9a34d@mail.gmail.com> Message-ID: One important consideration that I didn't see mentioned when skimming this thread on the beowulf list, is what software to use to manage the cluster. You will save yourself a LOT of time using good cluster management software. I recommend PERCEUS (http://www.perceus.org/) but there are others you can try, such as OSCAR (http://oscar.openclustergroup.org/). On Tue, Feb 19, 2008 at 10:00 AM, Berkley Starks wrote: > Thank you all for the help and support here. With what has been presented > here, and sound considerations, we have decided on a home for our Beowulf > cluster. The room is already sound proofed, and well air conditioned. As > for people worrying about noise, it will be housed with out vacuum chamber, > so those going into the room and doing stuff are already used to a little > bit of noise. > > The floor is rated to hold more than enough computers and the AC in there is > phenomenal. I just finished meeting with campus physical facilities the > other day and have got the budget requestioned and approved to allow us > independent AC control of the room. > > Right now we are seeing how much money can be appropriated for the actual > construction of the cluster. > > Thank you all so much for your input and support so far. It has helped a > lot. > > Berkley Starks > > > > > On Feb 14, 2008 9:39 AM, David Mathog wrote: > > > > Jim Lux wrote: > > > > > >>quiet down a rack because to first order sound insulation == heat > > > >>insulation. \ > > > > > > Actually, no.. good acoustic isolation is not good thermal > > > isolation. Sure, things like fiberglass batts provide thermal > > > insulation and also (slightly) attenuate high frequencies. > > > > I guess I should have used => or some other "implies". Sound insulators > > tend to be good heat insulators, heat insulators are generally not good > > sound insulators. > > > > I spent way too long trying to quiet down a rack when it had to live in > > a classroom. Mass loaded vinyl on all 4 sides worked fairly well > > to stop the noise coming out that way, but then it just turned into a > > big speaker enclosure and directed nearly as much sound out the fan > > holes, where it bounced off the ceiling and floor. And the rack exhaust > > fans (2 very high capacity 120mm fans on the top) were not able to keep > > it cool when it was fully sound insulated. The rated capacity > > of those two fans was more than the sum of all the little ones in the > > nodes, but the air flow was too restricted, I think mostly by the narrow > > space between the node's front panels and the front insulator panel. > > Thankfully it finally moved to a machine room and the noise problem went > > away. > > > > Anyway, it is a much easier to sound insulate a room than it is a single > > noisy rack. > > > > > > David Mathog > > mathog@caltech.edu > > Manager, Sequence Analysis Facility, Biology Division, Caltech > > _______________________________________________ > > > > > > > > Beowulf mailing list, Beowulf@beowulf.org > > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > -- Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/ tmattox@gmail.com || timattox@open-mpi.org I'm a bright... http://www.the-brights.net/ From peter.skomoroch at gmail.com Fri Mar 7 08:21:00 2008 From: peter.skomoroch at gmail.com (Peter Skomoroch) Date: Sun Jul 27 01:06:57 2008 Subject: [Beowulf] VMC - Virtual Machine Console Message-ID: I used Doug's BPS package to benchmark a virtual cluster on Amazon EC2, and was hoping the beowulf list could give their feedback on the results and feasibility of using this for on-demand clusters. This approach is currently being used to run some MPI code which is tolerant of poor latency, i.e. MPIBlast, monte carlo runs, etc. You get gigabit ethernet on EC2, but the latency from netpipe seems to be an order of magnitude higher than Doug's Kronos example on the Cluster Monkey page: Amazon EC2 Latency: 0.000492 (microseconds) Kronos Latency: 0.000029 (microseconds) Full Results/Charts for a "small" cluster of two extra-large nodes here (I just used the default BPS config with MPICH2): http://www.datawrangling.com/media/BPS-AmazonEC2-xlarge-run-1/index.html http://www.datawrangling.com/media/BPS-AmazonEC2-xlarge-run-2/index.html The unixbench results are misleading on VM, so I left those out. Others have verified the performance mentioned in the EC2 documentation: "One EC2 Compute Unit provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor." Some bonnie results are here: http://blog.dbadojo.com/2007/10/bonnie-io-benchmark-vs-ec2.html The cluster is launched and configured using some python scripts and a custom beowulf Amazon Machine Image (AMI), which is basically a Xen image configured to run on EC2. You end up paying 80 cents/hour for 8 cores with15GB RAM, and can scale that up to 100 or more if you need to. I'm cleaning up the code, and will post it on my blog if anyone wants to try it out. I think this could be a cost effective path for people, who for whatever reason, can't build/use a dedicated cluster. Here are the specifications for each instance: Extra Large Instance: 15 GB memory 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units each) 1,690 GB instance storage (4 x 420 GB plus 10 GB root partition) 64-bit platform I/O Performance: High Price: $0.80 per instance hour -Pete > There are plenty of parallel chores that are tolerant of poor latency -- > the whole world of embarrassingly parallel computations plus some > extension up into merely coarse grained, not terribly synchronous real > parallel computations. > > VMs can also be wonderful for TEACHING clustering and for managing > "political" problems. ... Having any sort of access to a high-latency Linux VM > node running on a Windows box beats the hell out of having no node at > all or having to port one's code to work under Windows. > > > > We can therefore see that there are clearly environments where the bulk > of the work being done is latency tolerant and where VMs may well have > benefits in administration and security and fault tolerance and local > politics that make them a great boon in clustering, just as there are > without question computations for which latency is the devil and any > suggestion of adding a layer of VM latency on top of what is already > inherent to the device and minimal OS will bring out the peasants with > pitchforks and torches. Multiboot systems, via grub and local > provisioning or PXE and remote e.g. NFS provisioning is also useful but > is not always politically possible or easy to set up. > > It is my hope that folks working on both sorts of multienvironment > provisioning and sysadmin environments work hard and produce spectacular > tools. I've done way more work than I care to setting up both of these > sorts of things. It is not easy, and requires a lot of expertise. > Hiding this detail and expertise from the user would be a wonderful > contribution to practical clustering (and of course useful in the HA > world as well). > > -- Peter N. Skomoroch peter.skomoroch@gmail.com http://www.datawrangling.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20080307/db2f89e3/attachment.html From landman at scalableinformatics.com Fri Mar 7 08:57:35 2008 From: landman at scalableinformatics.com (Joe Landman) Date: Sun Jul 27 01:06:57 2008 Subject: [Beowulf] VMC - Virtual Machine Console In-Reply-To: References: Message-ID: <47D173FF.6030001@scalableinformatics.com> Peter Skomoroch wrote: > Extra Large Instance: > > 15 GB memory > 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units each) > 1,690 GB instance storage (4 x 420 GB plus 10 GB root partition) > 64-bit platform > I/O Performance: High Note: minor criticism, but overall, nice results. Looking over your bonnie results is worth a quick comment. Any time you have bonnie or IOzone (or other IO benchmarks) which are testing file sizes less than ram size, you are not actually measuring disk IO. This is cache speed pure and simple. Either page/buffer cache, or RAID cache, or whatever. We have had people tell us to our face that their 2GB file results (on a 16 GB RAM machine) were somehow indicative of real file performance, when, if they walked over to the units they were testing, they would have noticed the HD lights simply not blinking ... Yeah, an amusing beer story (the longer version of it), but a problem none-the-less. Your 1GB file data is likely more representative, but with 15 GB ram, you need to be testing 30-60 GB files. Not trying to be a marketing guy here or anything like that ... we test our JackRabbit units with 80GB to 1.3TB sized files. We see (sustained) 750 MB/s - 1.3 GB/s in these tests. We also note some serious issues with the linux buffer cache and multiple RAID controllers (buffer cache appears to serialize access). We do this as we actually want to measure disk performance, and not buffer cache performance. That criticism aside, nice results. It shows what a "cloud" can do. > Price: $0.80 per instance hour -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 fax : +1 866 888 3112 cell : +1 734 612 4615 From peter.skomoroch at gmail.com Fri Mar 7 09:05:03 2008 From: peter.skomoroch at gmail.com (Peter Skomoroch) Date: Sun Jul 27 01:06:57 2008 Subject: [Beowulf] VMC - Virtual Machine Console In-Reply-To: <47D173FF.6030001@scalableinformatics.com> References: <47D173FF.6030001@scalableinformatics.com> Message-ID: Joe, thanks for the feedback. The bonnie results were not actually mine, I was just pointing to some numbers run by Paul Moen. Your 1GB file data is likely more representative, but with 15 GB ram, > you need to be testing 30-60 GB files. > I'll try to tweak the BPS bonnie tests to run some large files... On Fri, Mar 7, 2008 at 11:57 AM, Joe Landman < landman@scalableinformatics.com> wrote: > Peter Skomoroch wrote: > > > Extra Large Instance: > > > > 15 GB memory > > 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units > each) > > 1,690 GB instance storage (4 x 420 GB plus 10 GB root partition) > > 64-bit platform > > I/O Performance: High > > Note: minor criticism, but overall, nice results. > > Looking over your bonnie results is worth a quick comment. Any time you > have bonnie or IOzone (or other IO benchmarks) which are testing file > sizes less than ram size, you are not actually measuring disk IO. This > is cache speed pure and simple. Either page/buffer cache, or RAID > cache, or whatever. > > We have had people tell us to our face that their 2GB file results (on a > 16 GB RAM machine) were somehow indicative of real file performance, > when, if they walked over to the units they were testing, they would > have noticed the HD lights simply not blinking ... Yeah, an amusing > beer story (the longer version of it), but a problem none-the-less. > > Your 1GB file data is likely more representative, but with 15 GB ram, > you need to be testing 30-60 GB files. > > Not trying to be a marketing guy here or anything like that ... we test > our JackRabbit units with 80GB to 1.3TB sized files. We see (sustained) > 750 MB/s - 1.3 GB/s in these tests. We also note some serious issues > with the linux buffer cache and multiple RAID controllers (buffer cache > appears to serialize access). We do this as we actually want to measure > disk performance, and not buffer cache performance. > > That criticism aside, nice results. It shows what a "cloud" can do. > > > Price: $0.80 per instance hour > > > -- > Joseph Landman, Ph.D > Founder and CEO > Scalable Informatics LLC, > email: landman@scalableinformatics.com > web : http://www.scalableinformatics.com > http://jackrabbit.scalableinformatics.com > phone: +1 734 786 8423 > fax : +1 866 888 3112 > cell : +1 734 612 4615 > -- Peter N. Skomoroch peter.skomoroch@gmail.com http://www.datawrangling.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20080307/a043bfa0/attachment.html From peter.skomoroch at gmail.com Fri Mar 7 09:49:46 2008 From: peter.skomoroch at gmail.com (Peter Skomoroch) Date: Sun Jul 27 01:06:57 2008 Subject: [Beowulf] VMC - Virtual Machine Console In-Reply-To: References: <47D173FF.6030001@scalableinformatics.com> Message-ID: I'm running bonnie++ on a xlarge instance right now with 30 GB files on /mnt. I'll post the results when it finishes. I also have Ganglia set up on the node, so you can check that out until I shut the instance down: http://ec2-72-44-53-20.compute-1.amazonaws.com/ganglia On Fri, Mar 7, 2008 at 12:05 PM, Peter Skomoroch wrote: > Joe, thanks for the feedback. The bonnie results were not actually mine, > I was just pointing to some numbers run by Paul Moen. > > Your 1GB file data is likely more representative, but with 15 GB ram, > > you need to be testing 30-60 GB files. > > > > I'll try to tweak the BPS bonnie tests to run some large files... > > > > On Fri, Mar 7, 2008 at 11:57 AM, Joe Landman < > landman@scalableinformatics.com> wrote: > > > Peter Skomoroch wrote: > > > > > Extra Large Instance: > > > > > > 15 GB memory > > > 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units > > each) > > > 1,690 GB instance storage (4 x 420 GB plus 10 GB root partition) > > > 64-bit platform > > > I/O Performance: High > > > > Note: minor criticism, but overall, nice results. > > > > Looking over your bonnie results is worth a quick comment. Any time you > > have bonnie or IOzone (or other IO benchmarks) which are testing file > > sizes less than ram size, you are not actually measuring disk IO. This > > is cache speed pure and simple. Either page/buffer cache, or RAID > > cache, or whatever. > > > > We have had people tell us to our face that their 2GB file results (on a > > 16 GB RAM machine) were somehow indicative of real file performance, > > when, if they walked over to the units they were testing, they would > > have noticed the HD lights simply not blinking ... Yeah, an amusing > > beer story (the longer version of it), but a problem none-the-less. > > > > Your 1GB file data is likely more representative, but with 15 GB ram, > > you need to be testing 30-60 GB files. > > > > Not trying to be a marketing guy here or anything like that ... we test > > our JackRabbit units with 80GB to 1.3TB sized files. We see (sustained) > > 750 MB/s - 1.3 GB/s in these tests. We also note some serious issues > > with the linux buffer cache and multiple RAID controllers (buffer cache > > appears to serialize access). We do this as we actually want to measure > > disk performance, and not buffer cache performance. > > > > That criticism aside, nice results. It shows what a "cloud" can do. > > > > > Price: $0.80 per instance hour > > > > > > -- > > Joseph Landman, Ph.D > > Founder and CEO > > Scalable Informatics LLC, > > email: landman@scalableinformatics.com > > web : http://www.scalableinformatics.com > > http://jackrabbit.scalableinformatics.com > > phone: +1 734 786 8423 > > fax : +1 866 888 3112 > > cell : +1 734 612 4615 > > > > > > -- > Peter N. Skomoroch > peter.skomoroch@gmail.com > http://www.datawrangling.com > -- Peter N. Skomoroch peter.skomoroch@gmail.com http://www.datawrangling.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20080307/760a28fd/attachment.html From peter.st.john at gmail.com Fri Mar 7 10:18:49 2008 From: peter.st.john at gmail.com (Peter St. John) Date: Sun Jul 27 01:06:57 2008 Subject: [Beowulf] VMC - Virtual Machine Console In-Reply-To: References: <47D173FF.6030001@scalableinformatics.com> Message-ID: Peter Sk, Where is the blog you mentioned (where you'll be posting followups)? Thanks, Peter St On Fri, Mar 7, 2008 at 12:49 PM, Peter Skomoroch wrote: > I'm running bonnie++ on a xlarge instance right now with 30 GB files on > /mnt. I'll post the results when it finishes. I also have Ganglia set up > on the node, so you can check that out until I shut the instance down: > > http://ec2-72-44-53-20.compute-1.amazonaws.com/ganglia > > > On Fri, Mar 7, 2008 at 12:05 PM, Peter Skomoroch < > peter.skomoroch@gmail.com> wrote: > > > Joe, thanks for the feedback. The bonnie results were not actually > > mine, I was just pointing to some numbers run by Paul Moen. > > > > Your 1GB file data is likely more representative, but with 15 GB ram, > > > you need to be testing 30-60 GB files. > > > > > > > I'll try to tweak the BPS bonnie tests to run some large files... > > > > > > > > On Fri, Mar 7, 2008 at 11:57 AM, Joe Landman < > > landman@scalableinformatics.com> wrote: > > > > > Peter Skomoroch wrote: > > > > > > > Extra Large Instance: > > > > > > > > 15 GB memory > > > > 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units > > > each) > > > > 1,690 GB instance storage (4 x 420 GB plus 10 GB root > > > partition) > > > > 64-bit platform > > > > I/O Performance: High > > > > > > Note: minor criticism, but overall, nice results. > > > > > > Looking over your bonnie results is worth a quick comment. Any time > > > you > > > have bonnie or IOzone (or other IO benchmarks) which are testing file > > > sizes less than ram size, you are not actually measuring disk IO. > > > This > > > is cache speed pure and simple. Either page/buffer cache, or RAID > > > cache, or whatever. > > > > > > We have had people tell us to our face that their 2GB file results (on > > > a > > > 16 GB RAM machine) were somehow indicative of real file performance, > > > when, if they walked over to the units they were testing, they would > > > have noticed the HD lights simply not blinking ... Yeah, an amusing > > > beer story (the longer version of it), but a problem none-the-less. > > > > > > Your 1GB file data is likely more representative, but with 15 GB ram, > > > you need to be testing 30-60 GB files. > > > > > > Not trying to be a marketing guy here or anything like that ... we > > > test > > > our JackRabbit units with 80GB to 1.3TB sized files. We see > > > (sustained) > > > 750 MB/s - 1.3 GB/s in these tests. We also note some serious issues > > > with the linux buffer cache and multiple RAID controllers (buffer > > > cache > > > appears to serialize access). We do this as we actually want to > > > measure > > > disk performance, and not buffer cache performance. > > > > > > That criticism aside, nice results. It shows what a "cloud" can do. > > > > > > > Price: $0.80 per instance hour > > > > > > > > > -- > > > Joseph Landman, Ph.D > > > Founder and CEO > > > Scalable Informatics LLC, > > > email: landman@scalableinformatics.com > > > web : http://www.scalableinformatics.com > > > http://jackrabbit.scalableinformatics.com > > > phone: +1 734 786 8423 > > > fax : +1 866 888 3112 > > > cell : +1 734 612 4615 > > > > > > > > > > > -- > > Peter N. Skomoroch > > peter.skomoroch@gmail.com > > http://www.datawrangling.com > > > > > > -- > Peter N. Skomoroch > peter.skomoroch@gmail.com > http://www.datawrangling.com > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20080307/d378d961/attachment.html From peter.skomoroch at gmail.com Fri Mar 7 10:30:11 2008 From: peter.skomoroch at gmail.com (Peter Skomoroch) Date: Sun Jul 27 01:06:57 2008 Subject: [Beowulf] VMC - Virtual Machine Console In-Reply-To: References: <47D173FF.6030001@scalableinformatics.com> Message-ID: My blog is here: http://www.datawrangling.com/ I had a post last year describing launching a cluster of the small 32 bit instances: http://www.datawrangling.com/mpi-cluster-with-python-and-amazon-ec2-part-2-of-3.html Since then, Amazon upgraded to 64 bit "extra-large" instances with full gigabit ethernet which might make this feasible. Essentially you are getting a full physical box, where with the small instances you would be sharing network, disk, etc. I built a set of new Fedora images and config scripts for the extra-large instances which include NFS,mpich,lam,openmpi,ganglia etc. I'd like to use a standard cluster distribution, but it would take some hacking to get the networking to work properly within EC2. Amazon uses a custom firewall setup where autodiscovery won't work, also multicast is not supported and subnets are randomly assigned. On Fri, Mar 7, 2008 at 1:18 PM, Peter St. John wrote: > Peter Sk, > Where is the blog you mentioned (where you'll be posting followups)? > Thanks, > Peter St > > On Fri, Mar 7, 2008 at 12:49 PM, Peter Skomoroch < > peter.skomoroch@gmail.com> wrote: > > > I'm running bonnie++ on a xlarge instance right now with 30 GB files on > > /mnt. I'll post the results when it finishes. I also have Ganglia set up > > on the node, so you can check that out until I shut the instance down: > > > > http://ec2-72-44-53-20.compute-1.amazonaws.com/ganglia > > > > > > On Fri, Mar 7, 2008 at 12:05 PM, Peter Skomoroch < > > peter.skomoroch@gmail.com> wrote: > > > > > Joe, thanks for the feedback. The bonnie results were not actually > > > mine, I was just pointing to some numbers run by Paul Moen. > > > > > > Your 1GB file data is likely more representative, but with 15 GB ram, > > > > you need to be testing 30-60 GB files. > > > > > > > > > > I'll try to tweak the BPS bonnie tests to run some large files... > > > > > > > > > > > > On Fri, Mar 7, 2008 at 11:57 AM, Joe Landman < > > > landman@scalableinformatics.com> wrote: > > > > > > > Peter Skomoroch wrote: > > > > > > > > > Extra Large Instance: > > > > > > > > > > 15 GB memory > > > > > 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute > > > > Units each) > > > > > 1,690 GB instance storage (4 x 420 GB plus 10 GB root > > > > partition) > > > > > 64-bit platform > > > > > I/O Performance: High > > > > > > > > Note: minor criticism, but overall, nice results. > > > > > > > > Looking over your bonnie results is worth a quick comment. Any time > > > > you > > > > have bonnie or IOzone (or other IO benchmarks) which are testing > > > > file > > > > sizes less than ram size, you are not actually measuring disk IO. > > > > This > > > > is cache speed pure and simple. Either page/buffer cache, or RAID > > > > cache, or whatever. > > > > > > > > We have had people tell us to our face that their 2GB file results > > > > (on a > > > > 16 GB RAM machine) were somehow indicative of real file performance, > > > > when, if they walked over to the units they were testing, they would > > > > have noticed the HD lights simply not blinking ... Yeah, an amusing > > > > beer story (the longer version of it), but a problem none-the-less. > > > > > > > > Your 1GB file data is likely more representative, but with 15 GB > > > > ram, > > > > you need to be testing 30-60 GB files. > > > > > > > > Not trying to be a marketing guy here or anything like that ... we > > > > test > > > > our JackRabbit units with 80GB to 1.3TB sized files. We see > > > > (sustained) > > > > 750 MB/s - 1.3 GB/s in these tests. We also note some serious > > > > issues > > > > with the linux buffer cache and multiple RAID controllers (buffer > > > > cache > > > > appears to serialize access). We do this as we actually want to > > > > measure > > > > disk performance, and not buffer cache performance. > > > > > > > > That criticism aside, nice results. It shows what a "cloud" can do. > > > > > > > > > Price: $0.80 per instance hour > > > > > > > > > > > > -- > > > > Joseph Landman, Ph.D > > > > Founder and CEO > > > > Scalable Informatics LLC, > > > > email: landman@scalableinformatics.com > > > > web : http://www.scalableinformatics.com > > > > http://jackrabbit.scalableinformatics.com > > > > phone: +1 734 786 8423 > > > > fax : +1 866 888 3112 > > > > cell : +1 734 612 4615 > > > > > > > > > > > > > > > > -- > > > Peter N. Skomoroch > > > peter.skomoroch@gmail.com > > > http://www.datawrangling.com > > > > > > > > > > > -- > > Peter N. Skomoroch > > peter.skomoroch@gmail.com > > http://www.datawrangling.com > > > > _______________________________________________ > > Beowulf mailing list, Beowulf@beowulf.org > > To change your subscription (digest mode or unsubscribe) visit > > http://www.beowulf.org/mailman/listinfo/beowulf > > > > > -- Peter N. Skomoroch peter.skomoroch@gmail.com http://www.datawrangling.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20080307/383d5e5b/attachment.html From peter.skomoroch at gmail.com Fri Mar 7 11:07:51 2008 From: peter.skomoroch at gmail.com (Peter Skomoroch) Date: Sun Jul 27 01:06:57 2008 Subject: [Beowulf] VMC - Virtual Machine Console In-Reply-To: References: <47D173FF.6030001@scalableinformatics.com> Message-ID: Here are the results from one pass: [root@ip-10-251-166-127 bonnie++-1.03a]# ./bonnie++ -d /mnt/bonnie -s 30000 -n 1 -m Fedora6 -x 3 -r 15000 -u lamuser Using uid:500, gid:500. name,file_size,putc,putc_cpu,put_block,put_block_cpu,rewrite,rewrite_cpu,getc,getc_cpu,get_block,get_block_cpu,seeks,seeks_cpu,num_files,seq_create,seq_create_cpu,seq_stat,seq_stat_cpu,seq_del,seq_del_cpu,ran_create,ran_create_cpu,ran_stat,ran_stat_cpu,ran_del,ran_del_cpu Writing with putc()...done Writing intelligently...done Rewriting...done Reading with getc()...done Reading intelligently...done start 'em...done...done...done... Create files in sequential order...done. Stat files in sequential order...done. Delete files in sequential order...done. Create files in random order...done. Stat files in random order...done. Delete files in random order...done. Fedora6,30000M,14984,28,54544,15,22190,1,41469,58,55526,0,176.0 ,0,1,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++ On Fri, Mar 7, 2008 at 12:49 PM, Peter Skomoroch wrote: > I'm running bonnie++ on a xlarge instance right now with 30 GB files on > /mnt. I'll post the results when it finishes. I also have Ganglia set up > on the node, so you can check that out until I shut the instance down: > > http://ec2-72-44-53-20.compute-1.amazonaws.com/ganglia > > > On Fri, Mar 7, 2008 at 12:05 PM, Peter Skomoroch < > peter.skomoroch@gmail.com> wrote: > > > Joe, thanks for the feedback. The bonnie results were not actually > > mine, I was just pointing to some numbers run by Paul Moen. > > > > Your 1GB file data is likely more representative, but with 15 GB ram, > > > you need to be testing 30-60 GB files. > > > > > > > I'll try to tweak the BPS bonnie tests to run some large files... > > > > > > > > On Fri, Mar 7, 2008 at 11:57 AM, Joe Landman < > > landman@scalableinformatics.com> wrote: > > > > > Peter Skomoroch wrote: > > > > > > > Extra Large Instance: > > > > > > > > 15 GB memory > > > > 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units > > > each) > > > > 1,690 GB instance storage (4 x 420 GB plus 10 GB root > > > partition) > > > > 64-bit platform > > > > I/O Performance: High > > > > > > Note: minor criticism, but overall, nice results. > > > > > > Looking over your bonnie results is worth a quick comment. Any time > > > you > > > have bonnie or IOzone (or other IO benchmarks) which are testing file > > > sizes less than ram size, you are not actually measuring disk IO. > > > This > > > is cache speed pure and simple. Either page/buffer cache, or RAID > > > cache, or whatever. > > > > > > We have had people tell us to our face that their 2GB file results (on > > > a > > > 16 GB RAM machine) were somehow indicative of real file performance, > > > when, if they walked over to the units they were testing, they would > > > have noticed the HD lights simply not blinking ... Yeah, an amusing > > > beer story (the longer version of it), but a problem none-the-less. > > > > > > Your 1GB file data is likely more representative, but with 15 GB ram, > > > you need to be testing 30-60 GB files. > > > > > > Not trying to be a marketing guy here or anything like that ... we > > > test > > > our JackRabbit units with 80GB to 1.3TB sized files. We see > > > (sustained) > > > 750 MB/s - 1.3 GB/s in these tests. We also note some serious issues > > > with the linux buffer cache and multiple RAID controllers (buffer > > > cache > > > appears to serialize access). We do this as we actually want to > > > measure > > > disk performance, and not buffer cache performance. > > > > > > That criticism aside, nice results. It shows what a "cloud" can do. > > > > > > > Price: $0.80 per instance hour > > > > > > > > > -- > > > Joseph Landman, Ph.D > > > Founder and CEO > > > Scalable Informatics LLC, > > > email: landman@scalableinformatics.com > > > web : http://www.scalableinformatics.com > > > http://jackrabbit.scalableinformatics.com > > > phone: +1 734 786 8423 > > > fax : +1 866 888 3112 > > > cell : +1 734 612 4615 > > > > > > > > > > > -- > > Peter N. Skomoroch > > peter.skomoroch@gmail.com > > http://www.datawrangling.com > > > > > > -- > Peter N. Skomoroch > peter.skomoroch@gmail.com > http://www.datawrangling.com > -- Peter N. Skomoroch peter.skomoroch@gmail.com http://www.datawrangling.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20080307/e5938597/attachment.html From erwan at seanodes.com Sat Mar 8 15:07:02 2008 From: erwan at seanodes.com (Erwan) Date: Sun Jul 27 01:06:57 2008 Subject: [Beowulf] Re: mysterious slow disk In-Reply-To: References: Message-ID: <47D31C16.50600@seanodes.com> David Mathog wrote: > I don't think I'm going to solve this one :-( > Here come some ideas. If you invert a disk from a good node and a bad node, does the good node becomes bad ? If yes, the disk is the problem, not the host. If no, the host is the problem. This test will help us searching in the right direction. Maybe the pci latency isn't the same, could you diff a lspci -vvv between a good and a bad host ? Could you also diff the dmidecode of a good and a bad host ? Maybe we missed something. Could you provide a "smartctl -A /dev/" output for a good and a bad disk ? Maybe their proprietary tools could diagnose something we can't see directly: http://websupport.wdc.com/rd.asp?p=sw30&t=122&lang=fr&s=http://support.wdc.com/download/dlg/Diag504cCD.iso I'm out of ideas for the moment ;o) Erwan, -------------------------------------------------------------------------------- Les opinions et prises de position emises par le signataire du present message lui sont propres et ne sauraient engager la responsabilite de la societe SEANODES. Ce message ainsi que les eventuelles pieces jointes constituent une correspondance privee et confidentielle a l'attention exclusive du destinataire designe ci-dessus. Si vous n'etes pas le destinataire du present message ou une personne susceptible de pouvoir le lui delivrer, il vous est signifie que toute divulgation, distribution ou copie de cette transmission est strictement interdite. Si vous avez recu ce message par erreur, nous vous remercions d'en informer l'expediteur par telephone ou de lui retourner le present message, puis d'effacer immediatement ce message de votre systeme. The views and opinions expressed by the author of this message are personal. SEANODES shall assume no liability, express or implied for such message. This e-mail and any attachments is a confidential correspondence intended only for use of the individual or entity named above. If you are not the intended recipient or the agent responsible for delivering the message to the intended recipient, you are hereby notified that any disclosure, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify the sender by phone or by replying this message, and then delete this message from your system. From mathog at caltech.edu Mon Mar 10 11:10:43 2008 From: mathog at caltech.edu (David Mathog) Date: Sun Jul 27 01:06:57 2008 Subject: [Beowulf] Re: mysterious slow disk Message-ID: >Erwan wrote > If you invert a disk from a good node and a bad node, does the good node > becomes bad ? It is on the list of things to try. > Maybe the pci latency isn't the same, could you diff a lspci -vvv > between a good and a bad host ? Already looked at that, they were the same. > Could you also diff the dmidecode of a good and a bad host ? Maybe we > missed something. That's interesting, dmidecode on the slow node shows only: # dmidecode 2.9 # No SMBIOS nor DMI entry point found, sorry. The other nodes actually have dmidecode output. dmesg shows on the slow one: DMI not present or invalid. the others show DMI 2.3 present. This is probably not the problem though, since it seems to have come up after I tried reloading the BIOS last week. Prior to that I did diff dmesg and this difference was not present. Thanks, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From xclski at yahoo.com Wed Mar 12 20:07:40 2008 From: xclski at yahoo.com (Ellis Wilson) Date: Sun Jul 27 01:06:57 2008 Subject: [Beowulf] Live Implementation for Clusters Message-ID: <993205.49661.qm@web37909.mail.mud.yahoo.com> Hi all, I've been working lately developing a LiveCD/LiveUSB which will dynamically bring up a cluster for a few professors at my university. Their issue is that this (LaSalle U.) being largely a business and liberal arts university, all their funds go towards instruments not geared towards computers (or really research in general, this is largely a "teaching" school). However, a few are really into research in Chemistry (specifically using ORCA, Gaussian is too expensive and actually doesn't do what they need). They've run their tasks in the past for weeks at a time on their home computers or old computers turned over by the school following student use. I asked, why not use the computers for the students when they can't use them? This sparked my LiveCD endeavor. Now, granted, using gigabit ethernet is a huge drawback (and again, money is the real issue, so I can't buy a switch and install nice NICs into the computers for my own use in those off hours), but the task at hand won't even allow propagation beyond 12 computers (hard coded limitation in ORCA). Therefore, our speedup is somewhat CPU bound (at night, obviously a non-dedicated switch during the day would create really terrible problems since students are mucking around on it simultaneously). I'd like to hear concerns/comments from the community on this. For reference I've built the CD based on a slightly stripped down Gentoo, but kept it fairly run of the mill so I can use it for some other applications afterwards by simply unmerging the chemistry applications and reburning it with my new configuration. MPI 1.27 is utilized because ORCA (again very picky and unfortunately closed source) requires it. I've got a small folder of scripts that create the computer as a Node or Master with more or less one command. Oh, and no, IT here is really, really bad and super Windows friendly. There's no way they'd let me install onto another partition. Feel free to rip away, I know its not a perfect solution, but I'm not sure under the heavy money circumstances a better one exists (please prove me wrong!). Ellis ____________________________________________________________________________________ Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ From john.hearns at streamline-computing.com Wed Mar 12 23:20:31 2008 From: john.hearns at streamline-computing.com (John Hearns) Date: Sun Jul 27 01:06:57 2008 Subject: [Beowulf] Live Implementation for Clusters In-Reply-To: <993205.49661.qm@web37909.mail.mud.yahoo.com> References: <993205.49661.qm@web37909.mail.mud.yahoo.com> Message-ID: <1205389241.10217.7.camel@Vigor13> On Wed, 2008-03-12 at 20:07 -0700, Ellis Wilson wrote: > Hi all, > > > Feel free to rip away, Why should we? This sounds an exciting and innovative project. Write it up. Doug Eadline - sounds like a perfect article for the Clustermonkey site. Actually, your post is quite topical for me. I'm doing a talk on Saturday, and was looking at the Cluster Knoppix site, as I'll probably be asked about how to go about getting a taster of cluster building. http://clusterknoppix.sw.be/ I was a bit surprised how out of date this is. Best of British with the project, and as I say write it up. John Hearns Senior HPC Engineer Streamline Computing From john.hearns at streamline-computing.com Wed Mar 12 23:42:21 2008 From: john.hearns at streamline-computing.com (John Hearns) Date: Sun Jul 27 01:06:57 2008 Subject: [Beowulf] Live Implementation for Clusters In-Reply-To: <993205.49661.qm@web37909.mail.mud.yahoo.com> References: <993205.49661.qm@web37909.mail.mud.yahoo.com> Message-ID: <1205390551.10217.23.camel@Vigor13> On Wed, 2008-03-12 at 20:07 -0700, Ellis Wilson wrote: > This sparked my LiveCD endeavor. Now, granted, using > gigabit ethernet > is a huge drawback Hey, don't say that. We build and install clusters for customers every day with gigabit ethernet as the main interconnect. (And Infiniband; Infinipath; Myrinet; Quadrics or anything else your budget will stretch to). An enhancement you might like to think of is wiring any second on-board gigabit ports to a dedicated switch for MPI traffic. (See Beowulf passim for discussions on the merits of various switches) People are using our gig ethernet connected clusters right now for serious computational chemistry work all over the UK. John Hearns Senior HPC Engineer Streamline Computing From xclski at yahoo.com Thu Mar 13 07:43:42 2008 From: xclski at yahoo.com (Ellis Wilson) Date: Sun Jul 27 01:06:57 2008 Subject: [Beowulf] Live Implementation for Clusters Message-ID: <933254.46423.qm@web37910.mail.mud.yahoo.com> > An enhancement you might like to think of is wiring any second on-board > gigabit ports to a dedicated switch for MPI traffic. > (See Beowulf passim for discussions on the merits of various switches) Actually, its funny you say that because as of right now the script serves in a (very rough) DHCP capacity that seems to work best on a dedicated switch. My work today was to change that, but it might be good to make two functions so one might choose. Anyhow, the way that it currently works in my script is: 1. Fire up the master node, which configures itself as 192.168.0.250 and starts an NFS server in its home folder as .ssh 2. Fire up each node, running the script, which gets an IP from the standard DHCP server, connects to the NFS server, grabs a file called IP from it and assigns its IP based on that and increments the IP file on the masters NFS server. Unmount NFS, restart network, remount NFS, run SSH passwordless stuff (which is placed into the .ssh folder, handily available now to any PC connected to the NFS server) and is ready to go. I was going to change this so that it doesn't grab an IP file but rather simply appends its standard assigned ip address to a file on the Masters NFS server and when all your nodes are up, one issues a command from the Master to refresh all the Hosts and Machine files on the nodes and master using this file with all the random ip addresses in it (obviously Node1 in hosts is the first IP, and so on). This way I can take advantage of my school using a non 192.168.***.*** network. I tried the old system using this, the master sets its ip as 192.168.0.250, and yet the nodes don't seem to find it. Not sure why. Ellis ____________________________________________________________________________________ Looking for last minute shopping deals? Find them fast with Yahoo! Search. http://tools.search.yahoo.com/newsearch/category.php?category=shopping From garantes at iq.usp.br Fri Mar 14 05:55:03 2008 From: garantes at iq.usp.br (Guilherme Menegon Arantes) Date: Sun Jul 27 01:06:57 2008 Subject: [Beowulf] some tips on new cluster hardware Message-ID: <20080314125503.GA9770@dinamobile> Hi folks, We are buindling a new baby and would like some tips on inexpensive hardware. Any help is much appreciated: 1) Suggestions on MoBo's that support a 1333MHz FSB to plug Intel's Q9450, with video and jumbo-frames capable Gigabit on-board; 2) NIC PCI Gigabit card, jumbo-frames capable (because I might be asking too much on item 1); 3) 16-port Gigabit Switch, jumbo-frames capable, with Flow Control. The most likely brands we can find here in Brazil are Intel, 3COM, D-Link, BCM and TrendWare. Thanks a lot for your opinions/hints, G -- Guilherme Menegon Arantes, PhD S?o Paulo, Brasil ______________________________________________________ From laytonjb at charter.net Sat Mar 15 16:27:00 2008 From: laytonjb at charter.net (Jeffrey B. Layton) Date: Sun Jul 27 01:06:57 2008 Subject: [Beowulf] some tips on new cluster hardware In-Reply-To: <20080314125503.GA9770@dinamobile> References: <20080314125503.GA9770@dinamobile> Message-ID: <47DC5B44.40907@charter.net> Guilherme Menegon Arantes wrote: > Hi folks, > > We are buindling a new baby and would like some tips on inexpensive > hardware. Any help is much appreciated: > > 1) Suggestions on MoBo's that support a 1333MHz FSB to plug Intel's > Q9450, with video and jumbo-frames capable Gigabit on-board; > To get a good idea, look at www.newegg.com and look for motherboards that people like and also have a fair number of reviews. You might want to think about looking at a newer chipset instead of really old ones while you're at it (just my opinion). > 2) NIC PCI Gigabit card, jumbo-frames capable (because I might be > asking too much on item 1); > I would think about using a simple PCI-e x1 GigE card. Intel has a good one that's not too expensive. Doug Eadline has done some testing and these NICs look a bit better than simple PCI NICs. Jeff From hahn at mcmaster.ca Sun Mar 16 15:25:47 2008 From: hahn at mcmaster.ca (Mark Hahn) Date: Sun Jul 27 01:06:57 2008 Subject: [Beowulf] some tips on new cluster hardware In-Reply-To: <20080314125503.GA9770@dinamobile> References: <20080314125503.GA9770@dinamobile> Message-ID: > 1) Suggestions on MoBo's that support a 1333MHz FSB to plug Intel's > Q9450, with video and jumbo-frames capable Gigabit on-board; > 2) NIC PCI Gigabit card, jumbo-frames capable (because I might be > asking too much on item 1); > 3) 16-port Gigabit Switch, jumbo-frames capable, with Flow Control. it sounds to me as if you've done some research, and just want help shopping. how thorough was your research? often people start with preconceptions that limit the range of solutions they look at, unnecessarily. for instance, you have clearly decided you want Gb+jumbo - how did you decide that? jumbo is mainly about reducing the cpu overhead of high-bandwidth apps, which does not necessarily correspond with typical cluster workloads. if you're concerned about bandwidth, might you be better off with some other interconnect? IB has gotten a lot cheaper recently. to me, cluster hardware should really take one of two approaches: high road or low. the high road includes IB or 10Ge, and tends to also include rackmount and dual or quad socket. this is approach provides an excellent platform for demanding (high-bandwidth and/or tight-coupled) workloads. the low road is more "beowulfy" - to leverage commodity parts. that means gigabit, and excludes IB. but it also means PC-style cases, probably single-socket mATX boards with integrated video and gbit. these parts are a LOT cheaper than high-road "server" boards. main downsides of the low road: - no real managability. cheap boards don't have IPMI, though nowadays they will at least PXE boot. - at most 4cores and one memory controller per node. - gigabit isn't exactly low-latency or high-bandwidth, so this kind of cluster is mostly appropriate for serial or small/loose parallel jobs. > The most likely brands we can find here in Brazil are Intel, 3COM, > D-Link, BCM and TrendWare. I think you should look closely at using motherboard-integrated gigabit. besides saving cost, it reduces the complexity of the system, and if you're feeling DIYish, you can mount boards caseless for extra savings (and often better control of airflow!) I also think you should avoid falling in love with your ethernet switch ;) I just checked the price on the DGS-1216t, which appears to be what you're looking for, and it's $Cdn 270 or so (cheaper than one cpu). you should probably consider getting a modestly larger switch - 24 is still commodity, and would let you expand or add a trunked uplink to fileserver(s). From garantes at iq.usp.br Sun Mar 16 17:27:38 2008 From: garantes at iq.usp.br (Guilherme Menegon Arantes) Date: Sun Jul 27 01:06:58 2008 Subject: [Beowulf] some tips on new cluster hardware In-Reply-To: References: <20080314125503.GA9770@dinamobile> Message-ID: <20080317002738.GA4265@dinamobile> On Sun, Mar 16, 2008 at 06:25:47PM -0400, Mark Hahn wrote: > > it sounds to me as if you've done some research, and just want help > shopping. Yes. We don't have much cheap hardware options around here. Nor well-educated sellers that can help finding solutions, so I was just asking to see if you could give me some ideas. > how thorough was your research? often people start with preconceptions > that limit the range of solutions they look at, unnecessarily. for > instance, > you have clearly decided you want Gb+jumbo - how did you decide that? Based on previous experience. It helps considerably with one of our applications, which has loads of disk I/O (previous and new cluster will be diskless). > if you're concerned about bandwidth, might you be better off with some > other interconnect? IB has gotten a lot cheaper recently. We are aware of that. Actually, we are in touch with Colfax to import some chips. > to me, cluster hardware should really take one of two approaches: > high road or low. the high road includes IB or 10Ge, and tends to Agreed. We are in the low side at the mo, because of scarce funds and little experience. But, want to learn/get experience (with IB, for example) and move towards the higher end road. > feeling DIYish, you can mount boards caseless for extra savings (and often > better control of airflow!) No, thanks. We are going rackmount. > I also think you should avoid falling in love with your ethernet switch ;) Don't worry, I won't change my girlfriend for that ;-), but a 24 port (as suggested by somelse on Beowulf before) switch is something I am considering. Just a bit unsure of D-Link brand, since I had problems with it before (wireless ap and NIC). Thanks for the suggestions/attention, G -- Guilherme Menegon Arantes, PhD S?o Paulo, Brasil ______________________________________________________ From garantes at iq.usp.br Sun Mar 16 17:30:26 2008 From: garantes at iq.usp.br (Guilherme Menegon Arantes) Date: Sun Jul 27 01:06:58 2008 Subject: [Beowulf] some tips on new cluster hardware In-Reply-To: <47DC5B44.40907@charter.net> References: <20080314125503.GA9770@dinamobile> <47DC5B44.40907@charter.net> Message-ID: <20080317003026.GB4265@dinamobile> On Sat, Mar 15, 2008 at 06:27:00PM -0500, Jeffrey B. Layton wrote: > > I would think about using a simple PCI-e x1 GigE card. Intel has a good > one that's not too expensive. Doug Eadline has done some testing and > these NICs look a bit better than simple PCI NICs. URL? I searched Clustermonkey withou success... Thanks, G -- Guilherme Menegon Arantes, PhD S?o Paulo, Brasil ______________________________________________________ From midair77 at gmail.com Mon Mar 17 11:49:14 2008 From: midair77 at gmail.com (Steven Truong) Date: Sun Jul 27 01:06:58 2008 Subject: [Beowulf] NMI (Non maskable interrupts) Message-ID: <28bb77d30803171149p396f515cs124f1ce045ceeea7@mail.gmail.com> Dear, all. We recently bought some dual quadcore AMD Barcelona nodes with Asus KFSN4-DRE motherboard and installed Rocks Cluster 4.3, CentOS 5.1 on these machines. What we found have irked us in terms of the number of NMI generated. #cat /proc/interrupts CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 0: 11468997 11550969 11551029 11550982 11550374 11549932 11549991 11553108 IO-APIC-edge timer 8: 0 0 0 0 0 0 0 0 IO-APIC-edge rtc 9: 0 0 0 0 0 0 0 0 IO-APIC-level acpi 177: 0 0 0 0 0 0 0 0 IO-APIC-level ohci_hcd 185: 0 0 0 0 0 0 0 0 IO-APIC-level ehci_hcd 193: 0 149313 392 38104 36736 0 1 77870 IO-APIC-level libata 201: 0 0 0 0 0 0 0 0 IO-APIC-level libata 233: 29715082 0 0 0 0 0 0 0 PCI-MSI eth0 NMI: 658519 686187 682474 687981 690017 689957 685692 588203 LOC: 92324693 92324694 92324693 92324692 92324692 92324687 92324689 92324681 ERR: 0 MIS: 0 # uptime 11:38:50 up 10 days, 16:27, 1 user, load average: 7.99, 7.98, 7.99 >From my understanding, NMI is not good since the processors really have to handle these interrupts right away and these might degrade the performance of the nodes. From what I read, NMI are usually generated by bad hardwares or memory issues and I would like to know how to find out what causes these NMI... Could you please point me to the right direction in finding out more about this? Thank you. From hahn at mcmaster.ca Mon Mar 17 15:02:24 2008 From: hahn at mcmaster.ca (Mark Hahn) Date: Sun Jul 27 01:06:58 2008 Subject: [Beowulf] NMI (Non maskable interrupts) In-Reply-To: <28bb77d30803171149p396f515cs124f1ce045ceeea7@mail.gmail.com> References: <28bb77d30803171149p396f515cs124f1ce045ceeea7@mail.gmail.com> Message-ID: > From my understanding, NMI is not good since the processors really > have to handle these interrupts right away and these might degrade the > performance of the nodes. I think you're mistaken - NMI's of the sort you're talking about will result in a panic. these NMI's are probably just low-level kernel synchronization like where one CPU needs to cause others to immediately do something like changing the status of a page in their MMUs. for instance, I notice that more recent kernels classify interrupts more finely: [root@experiment ~]# cat /proc/interrupts CPU0 CPU1 CPU2 CPU3 0: 68 0 0 0 IO-APIC-edge timer 1: 0 0 0 10 IO-APIC-edge i8042 4: 0 0 0 2 IO-APIC-edge 8: 0 0 0 0 IO-APIC-edge rtc 9: 0 0 0 0 IO-APIC-fasteoi acpi 12: 0 0 0 4 IO-APIC-edge i8042 14: 0 0 0 0 IO-APIC-edge ide0 17: 0 0 0 0 IO-APIC-fasteoi sata_nv 18: 0 0 0 0 IO-APIC-fasteoi sata_nv 19: 123229 148 514 4698 IO-APIC-fasteoi sata_nv 362: 127524168 5281605 236961 121506 PCI-MSI-edge eth1 377: 519748 12731137 607115 42573852 PCI-MSI-edge eth0:MSI-X-2-RX 378: 109154 80191 302109913 6487104 PCI-MSI-edge eth0:MSI-X-1-TX NMI: 0 0 0 0 Non-maskable interrupts LOC: 300446104 300446082 300446060 300446038 Local timer interrupts RES: 2698262 44102 2234502 3677120 Rescheduling interrupts CAL: 4135 4379 4460 415 function call interrupts TLB: 14018 15088 4079 7251 TLB shootdowns TRM: 0 0 0 0 Thermal event interrupts THR: 0 0 0 0 Threshold APIC interrupts SPU: 0 0 0 0 Spurious interrupts ERR: 0 I suspect that all the counts listed after RES are, in earlier kernels, lumped into NMI. obviously, rescheduling, function call and TLB shootdowns are perfectly normal, not indicating any error (though you might want to minimize them as well...) how about trying a new kernel? the above is 2.6.24.3. note that there are important security fixes that you might be missing if you're running certain ranges of old kernels... From csamuel at vpac.org Tue Mar 18 05:53:32 2008 From: csamuel at vpac.org (csamuel@vpac.org) Date: Sun Jul 27 01:06:58 2008 Subject: [Beowulf] NMI (Non maskable interrupts) In-Reply-To: Message-ID: <20740902.661205814821402.JavaMail.csamuel@ubuntu> ----- "Mark Hahn" wrote: > how about trying a new kernel? the above is 2.6.24.3. note that > there are important security fixes that you might be missing if > you're running certain ranges of old kernels... We also saw performance improvements going to the mainline kernel (2.6.24.*) from the CentOS 5 kernels. Don't forget to make sure you're not disabling ACPI on Barcelona by booting with noacpi, if you do then the kernel won't learn that it's a NUMA box as the old K8 table hack doesn't work on K10h and you'll get much worse memory bandwidth. cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From midair77 at gmail.com Tue Mar 18 11:55:21 2008 From: midair77 at gmail.com (Steven Truong) Date: Sun Jul 27 01:06:58 2008 Subject: [Beowulf] NMI (Non maskable interrupts) In-Reply-To: References: <28bb77d30803171149p396f515cs124f1ce045ceeea7@mail.gmail.com> Message-ID: <28bb77d30803181155n3f52aeb6k6d92ac2fb5a90322@mail.gmail.com> On Mon, Mar 17, 2008 at 3:02 PM, Mark Hahn wrote: > > From my understanding, NMI is not good since the processors really > > have to handle these interrupts right away and these might degrade the > > performance of the nodes. > > I think you're mistaken - NMI's of the sort you're talking about will > result in a panic. these NMI's are probably just low-level kernel > synchronization like where one CPU needs to cause others to immediately do > something like changing the status of a page in their MMUs. > > for instance, I notice that more recent kernels classify interrupts > more finely: > > [root@experiment ~]# cat /proc/interrupts > CPU0 CPU1 CPU2 CPU3 > 0: 68 0 0 0 IO-APIC-edge timer > 1: 0 0 0 10 IO-APIC-edge i8042 > 4: 0 0 0 2 IO-APIC-edge > 8: 0 0 0 0 IO-APIC-edge rtc > 9: 0 0 0 0 IO-APIC-fasteoi acpi > 12: 0 0 0 4 IO-APIC-edge i8042 > 14: 0 0 0 0 IO-APIC-edge ide0 > 17: 0 0 0 0 IO-APIC-fasteoi sata_nv > 18: 0 0 0 0 IO-APIC-fasteoi sata_nv > 19: 123229 148 514 4698 IO-APIC-fasteoi sata_nv > 362: 127524168 5281605 236961 121506 PCI-MSI-edge eth1 > 377: 519748 12731137 607115 42573852 PCI-MSI-edge eth0:MSI-X-2-RX > 378: 109154 80191 302109913 6487104 PCI-MSI-edge eth0:MSI-X-1-TX > NMI: 0 0 0 0 Non-maskable interrupts > LOC: 300446104 300446082 300446060 300446038 Local timer interrupts > RES: 2698262 44102 2234502 3677120 Rescheduling interrupts > CAL: 4135 4379 4460 415 function call interrupts > TLB: 14018 15088 4079 7251 TLB shootdowns > TRM: 0 0 0 0 Thermal event interrupts > THR: 0 0 0 0 Threshold APIC interrupts > SPU: 0 0 0 0 Spurious interrupts > ERR: 0 > > I suspect that all the counts listed after RES are, in earlier kernels, > lumped into NMI. obviously, rescheduling, function call and TLB shootdowns > are perfectly normal, not indicating any error (though you might want to > minimize them as well...) > > how about trying a new kernel? the above is 2.6.24.3. note that there are > important security fixes that you might be missing if you're running certain > ranges of old kernels... > Hi, Mark. Yes, I was wrong. I also found a very informative discussion of NMI. http://x86vmm.blogspot.com/2005/10/linux-nmis-on-intel-64-bit-hardware.html Thank you. From carsten.aulbert at aei.mpg.de Tue Mar 18 12:35:13 2008 From: carsten.aulbert at aei.mpg.de (Carsten Aulbert) Date: Sun Jul 27 01:06:58 2008 Subject: [Beowulf] Which Xeon supports NUMA? Message-ID: <47E01971.6050108@aei.mpg.de> Hi, given that many core Xeons (especially quad and/or many socket systems) have some memory speed issues. With NUMA the kernel seems to be able to optimize this somehow. However, I have two questions: (1) Which EM64T Xeon supports NUMA? I've searched a bit, but I have not found a definitive answer so far. Is this visible from /proc/cpuinfo? (2) More importantly, has someone measured (how?) if this improves performance? Thanks for a brief answer Carsten From hahn at mcmaster.ca Tue Mar 18 12:42:01 2008 From: hahn at mcmaster.ca (Mark Hahn) Date: Sun Jul 27 01:06:58 2008 Subject: [Beowulf] NMI (Non maskable interrupts) In-Reply-To: <20740902.661205814821402.JavaMail.csamuel@ubuntu> References: <20740902.661205814821402.JavaMail.csamuel@ubuntu> Message-ID: > We also saw performance improvements going to the mainline > kernel (2.6.24.*) from the CentOS 5 kernels. interesting - in what area? thanks, mark hahn. From john.leidel at gmail.com Tue Mar 18 12:50:30 2008 From: john.leidel at gmail.com (John Leidel) Date: Sun Jul 27 01:06:58 2008 Subject: [Beowulf] Which Xeon supports NUMA? In-Reply-To: <47E01971.6050108@aei.mpg.de> References: <47E01971.6050108@aei.mpg.de> Message-ID: <1205869830.28737.13.camel@e521.site> This should be an option in your kernel. Check to see if you have this enabled [assuming you are running some sort of Linux varient] On Tue, 2008-03-18 at 20:35 +0100, Carsten Aulbert wrote: > Hi, > > given that many core Xeons (especially quad and/or many socket systems) > have some memory speed issues. With NUMA the kernel seems to be able to > optimize this somehow. However, I have two questions: > > (1) Which EM64T Xeon supports NUMA? I've searched a bit, but I have not > found a definitive answer so far. Is this visible from /proc/cpuinfo? > > (2) More importantly, has someone measured (how?) if this improves > performance? > > Thanks for a brief answer > > Carsten > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From carsten.aulbert at aei.mpg.de Tue Mar 18 12:54:37 2008 From: carsten.aulbert at aei.mpg.de (Carsten Aulbert) Date: Sun Jul 27 01:06:58 2008 Subject: [Beowulf] Which Xeon supports NUMA? In-Reply-To: <1205869830.28737.13.camel@e521.site> References: <47E01971.6050108@aei.mpg.de> <1205869830.28737.13.camel@e521.site> Message-ID: <47E01DFD.6030800@aei.mpg.de> Hi John, John Leidel wrote: > This should be an option in your kernel. Check to see if you have this > enabled [assuming you are running some sort of Linux varient] > Sorry, I came up with this question because when compiling 2.6.24.3 (linux kernel) I came across these lines: For x86_64 this is recommended on all multiprocessor Opteron systems. If the system is EM64T, you should say N unless your system is EM64T NUMA. Therefore the question if our Xeon E5345 or Xeon X3220 are EM64T NUMA or just EM64T. Sorry for not being more clear initially. Cheers Carsten From lindahl at pbm.com Tue Mar 18 13:16:45 2008 From: lindahl at pbm.com (Greg Lindahl) Date: Sun Jul 27 01:06:58 2008 Subject: [Beowulf] Which Xeon supports NUMA? In-Reply-To: <47E01971.6050108@aei.mpg.de> References: <47E01971.6050108@aei.mpg.de> Message-ID: <20080318201645.GB12358@bx9.net> On Tue, Mar 18, 2008 at 08:35:13PM +0100, Carsten Aulbert wrote: > (1) Which EM64T Xeon supports NUMA? The typical chipsets we use around here are not NUMA with Xeons -- all of main memory is the same distance from all the cpus. It's my impression that the grouping of cores to sockets and such isn't part of the NUMA code in the kernel. (Anyone?) In the future, QPI Xeons will be NUMA, similar to Opterons. There are also some exotic chipsets that do NUMA with Xeons today, but you probably don't own one. -- greg From jan.heichler at gmx.net Tue Mar 18 13:18:09 2008 From: jan.heichler at gmx.net (Jan Heichler) Date: Sun Jul 27 01:06:58 2008 Subject: [Beowulf] Which Xeon supports NUMA? In-Reply-To: <47E01DFD.6030800@aei.mpg.de> References: <47E01971.6050108@aei.mpg.de> <1205869830.28737.13.camel@e521.site> <47E01DFD.6030800@aei.mpg.de> Message-ID: <366299804.20080318211809@gmx.net> Hallo Carsten, Dienstag, 18. M?rz 2008, meintest Du: CA> Hi John, CA> John Leidel wrote: >> This should be an option in your kernel. Check to see if you have this >> enabled [assuming you are running some sort of Linux varient] CA> Sorry, I came up with this question because when compiling 2.6.24.3 CA> (linux kernel) I came across these lines: CA> For x86_64 this is recommended on all multiprocessor Opteron systems. CA> CA> If the system is EM64T, you should say N unless your system is CA> CA> EM64T NUMA. CA> Therefore the question if our Xeon E5345 or Xeon X3220 are EM64T NUMA or CA> just EM64T. CA> Sorry for not being more clear initially. Could it be that there is already code for the upcoming Nehalem CPU in the kernel? Is intel maybe providing code? I asked a few people who have good knowledge about technologies and linux and this was the only explanation that came up. Cheers, Jan -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20080318/4545776e/attachment.html From hahn at mcmaster.ca Tue Mar 18 14:00:10 2008 From: hahn at mcmaster.ca (Mark Hahn) Date: Sun Jul 27 01:06:58 2008 Subject: [Beowulf] Which Xeon supports NUMA? In-Reply-To: <47E01971.6050108@aei.mpg.de> References: <47E01971.6050108@aei.mpg.de> Message-ID: > given that many core Xeons (especially quad and/or many socket systems) have > some memory speed issues. With NUMA the kernel seems to be able to optimize > this somehow. I don't believe so. Intel currently still uses a single memory controller (MCH), which means that memory access is, in the NUMA sense, uniform. I don't believe that Intel's recent use of multiple socket-MCH links, or multiple independent FBDIMM channels off the MCH change this. here's an Intel It2 chipset: http://www.intel.com/products/i/chipsets/e8870sp/e8870_blkdiag_8way_800.jpg you can see that there are two FSB's with 4cpus each. a CPU on the left will have non-uniform access to a memory bank which happens to be on the right side of the system. I don't believe any of the Intel x86 chipsets provide this kind of design, though several other companies have done numa x86 chipsets (IBM for one). the interesting thing is that Intel has decided to embrace the numa-oriented system architecture of AMD (et al). it'll be very interesting to see how this plays out with Nehalem/QPI. obviously, AMD really, really needs to wake up and try a little harder to complete... > (2) More importantly, has someone measured (how?) if this improves > performance? usually, tuning for NUMA just means trying to keep a process near its memory. in the chipset above, if a proc starts on the left half, make an effort to allocate its memory on the left as well, and keep scheduling it on left cpus. the kernel does contain code that tries to understand this topology - the most common machines that use it are multi-socket opteron boxes. but systems like SGI Altix depend on this sort of thing quite heavily. following is a trivial measurement of the effect. I'm running the stream benchmark on a single thread. in the first case, I force the process and memory to be on the same socket. then the "wrong" socket. [hahn@rb17 ~]$ numactl --membind=0 --cpubind=0 ./s ... The total memory requirement is 1144 MB You are running each test 11 times ... Function Rate (MB/s) Avg time Min time Max time Copy: 5298.8324 0.1515 0.1510 0.1520 Scale: 5334.1523 0.1504 0.1500 0.1510 Add: 5455.4020 0.2200 0.2200 0.2200 Triad: 5455.3902 0.2200 0.2200 0.2200 ... [hahn@rb17 ~]$ numactl --membind=0 --cpubind=1 ./s ... Function Rate (MB/s) Avg time Min time Max time Copy: 3556.1072 0.2253 0.2250 0.2260 Scale: 3620.4688 0.2213 0.2210 0.2220 Add: 3647.9716 0.3305 0.3289 0.3310 Triad: 3659.0890 0.3305 0.3280 0.3310 note that NUMA optimizations are a wonderful thing, but hardly a panacea. for instance, a busy system might not be able to put all a proc's memory on a particular node. or perhaps the cpus of that node are busy. and then think about multithreaded programs. on top of that, consider caches, which these days are variously per-core, per-chip and per-socket. > Thanks for a brief answer oh, sorry ;) From amjad11 at gmail.com Wed Mar 19 00:58:44 2008 From: amjad11 at gmail.com (amjad ali) Date: Sun Jul 27 01:06:58 2008 Subject: [Beowulf] Three questions on a new Beowulf Cluster In-Reply-To: <47BEEF3A.9020806@sps.aero> References: <47BEEF3A.9020806@sps.aero> Message-ID: <428810f20803190058r4ea2ced4vce9f809b1b0b5b03@mail.gmail.com> Hello, If you dont want to indulge into cluster details and want to use it as "black box" kind of tool, Then I would suggest you to use some commercial cluster scheduler/manager like Aspen Beowulf Cluster, PlatformLSF, Scali Manage etc.. or you may go for ROCKS , OSCAR etc This way perhaps you could have easier simpler life. Also have a look in using PETSc (that make use of MPI internally). regards, Amjad Ali. On Fri, Feb 22, 2008 at 8:50 PM, John P. Kosky, PhD wrote: > My company is taking it's first foray into the world of HPC with an > expandable architecture, 16 processor (comprised of quad core Opterons), > one header node cluster using Infiniband interconnects. OS has > tentatively been selected as SUSE 64-bit Linux. The principal purpose of > the cluster is as a tool for spacecraft and propulsion design support. > The cluster will therefore be running the most recent versions of > commercially available software - initially for FEA and CFD using COMSOL > Multiphysics and associated packages, NASTRAN, MatLab modules, as well > as an internally modified and expanded commercial code for materials > properties prediction,with emphasis on polymer modeling (Accelrys > Materials Studio). Since we will be repetitively running standard > modeling codes on this system, we are trying to make the system as user > friendly as possible... most of our scientists and engineers want to use > this as a tool, and not have to become cluster experts. The company WILL > be hiring an IT Sys Admin with good cluster experience to support the > system, however... > > Question 1: > 1) Does anyone here know of any issues that have arisen running the > above named commercial packages on clusters using infiniband? > > Question 2: > 2) As far as the MPI for the system is concerned, for the system and > application requirements described above, would OpenMPI or MvApich be > better for managing node usage? > > ANY help or advice would be greatly appreciated. > > Thanks in advance > > John > > John P. Kosky, PhD > Director of Technical Development > Space Propulsion Systems > > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20080319/a48ad5bb/attachment.html From csamuel at vpac.org Wed Mar 19 03:44:33 2008 From: csamuel at vpac.org (Chris Samuel) Date: Sun Jul 27 01:06:58 2008 Subject: [Beowulf] NMI (Non maskable interrupts) In-Reply-To: <1655697996.235421205896688535.JavaMail.root@zimbra.vpac.org> Message-ID: <875929636.241761205923473284.JavaMail.root@zimbra.vpac.org> ----- "Mark Hahn" wrote: > > We also saw performance improvements going to the mainline > > kernel (2.6.24.*) from the CentOS 5 kernels. > > interesting - in what area? I think it was NAMD, from memory. -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From csamuel at vpac.org Wed Mar 19 04:04:30 2008 From: csamuel at vpac.org (Chris Samuel) Date: Sun Jul 27 01:06:58 2008 Subject: [Beowulf] Which Xeon supports NUMA? In-Reply-To: <6581730.241861205924318549.JavaMail.root@zimbra.vpac.org> Message-ID: <741569291.241881205924670773.JavaMail.root@zimbra.vpac.org> ----- "Carsten Aulbert" wrote: > Sorry, I came up with this question because when compiling 2.6.24.3 > (linux kernel) I came across these lines: > > For x86_64 this is recommended on all multiprocessor Opteron systems. > > > > If the system is EM64T, you should say N unless your system is > EM64T NUMA. Interestingly that text dates back to a modification for 2.6.14-rc5 back in 2005: http://osdir.com/ml/linux.ports.x86-64.general/2005-10/msg00140.html with a comment that says: # CONFIG_K8_NUMA is not needed for Intel EM64T NUMA boxes So either this was in the early days of CSI, or there are/were other Intel AMD64 NUMA boxes out there that none of us know about.. cheers! Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From csamuel at vpac.org Wed Mar 19 04:08:46 2008 From: csamuel at vpac.org (Chris Samuel) Date: Sun Jul 27 01:06:58 2008 Subject: [Beowulf] Which Xeon supports NUMA? In-Reply-To: <741569291.241881205924670773.JavaMail.root@zimbra.vpac.org> Message-ID: <400907048.241911205924926660.JavaMail.root@zimbra.vpac.org> ----- "Chris Samuel" wrote: > So either this was in the early days of CSI, or there are/were > other Intel AMD64 NUMA boxes out there that none of us know about.. Aha - to answer myself this even predates the core architecture, rolling our clocks back to 2005 we find this message from an IBM kernel developer correcting Andi Kleen (then the x86_64 maintainer) on the issue: This is quoted from http://lkml.org/lkml/2005/10/17/183 : ----------------8< snip snip 8<---------------- On Mon, Oct 17, 2005 at 05:40:56PM +0200, Andi Kleen wrote: [...] > Intel NUMA x86 machines are really rare No they are not. IBM X460s are generally available machines and the bug affects those boxes. ----------------8< snip snip 8<---------------- So there you go! cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From surs at us.ibm.com Wed Mar 19 11:54:33 2008 From: surs at us.ibm.com (Sayantan Sur) Date: Sun Jul 27 01:06:58 2008 Subject: [Beowulf] [p2s2-announce] UPDATED CFP: Workshop on Parallel Programming Models and Systems Software for High-end Computing (P2S2) Message-ID: CALL FOR PAPERS =============== First International Workshop on Parallel Programming Models and Systems Software for High-end Computing (P2S2) (http://www.mcs.anl.gov/events/workshops/p2s2) Sep. 8th, 2008 To be held in conjunction with ICPP-08: The 27th International Conference on Parallel Processing Sep. 8-12, 2008 Portland, Oregon, USA SCOPE ----- The goal of this workshop is to bring together researchers and practitioners in parallel programming models and systems software for high-end computing systems. Please join us in a discussion of new ideas, experiences, and the latest trends in these areas at the workshop. TOPICS OF INTEREST ------------------ The focus areas for this workshop include, but are not limited to: * Programming models and their high-performance implementations o MPI, Sockets, OpenMP, Global Arrays, X10, UPC, Chapel o Other Hybrid Programming Models * Systems software for scientific and enterprise computing o Communication sub-subsystems for high-end computing o High-performance File and storage systems o Fault-tolerance techniques and implementations o Efficient and high-performance virtualization and other management mechanisms * Tools for Management, Maintenance, Coordination and Synchronization o Software for Enterprise Data-centers using Modern Architectures o Job scheduling libraries o Management libraries for large-scale system o Toolkits for process and task coordination on modern platforms * Performance evaluation, analysis and modeling of emerging computing platforms PROCEEDINGS ----------- Proceedings of this workshop will be published by the IEEE Computer Society (together with the ICPP conference proceedings) in CD format only and will be available at the conference. SUBMISSION INSTRUCTIONS ----------------------- Submissions should be in PDF format in U.S. Letter size paper. They should not exceed 8 pages (all inclusive). Submissions will be judged based on relevance, significance, originality, correctness and clarity. DATES AND DEADLINES ------------------- Paper Submission: April 11th, 2008 Author Notification: May 20th, 2008 Camera Ready: June 2nd, 2008 PROGRAM CHAIRS -------------- * Pavan Balaji (Argonne National Laboratory) * Sayantan Sur (IBM Research) STEERING COMMITTEE ------------------ * William D. Gropp (University of Illinois Urbana-Champaign) * Dhabaleswar K. Panda (Ohio State University) * Vijay Saraswat (IBM Research) PROGRAM COMMITTEE ----------------- * David Bernholdt (Oak Ridge National Laboratory) * Ron Brightwell (Sandia National Laboratory) * Wu-chun Feng (Virginia Tech) * Richard Graham (Oak Ridge National Laboratory) * Hyun-wook Jin (Konkuk University, South Korea) * Sameer Kumar (IBM Research) * Doug Lea (State University of New York at Oswego) * Jarek Nieplocha (Pacific Northwest National Laboratory) * Scott Pakin (Los Alamos National Laboratory) * Vivek Sarkar (Rice University) * Rajeev Thakur (Argonne National Laboratory) * Pete Wyckoff (Ohio Supercomputing Center) If you have any questions, please contact us at p2s2-chairs@mcs.anl.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20080319/b81e432a/attachment.html From saville at comcast.net Wed Mar 19 16:31:21 2008 From: saville at comcast.net (Gregg Germain) Date: Sun Jul 27 01:06:58 2008 Subject: [Beowulf] Simple MPI programs hang Message-ID: <47E1A249.5030206@comcast.net> Hi everyone, I've created a 2 node cluster running FC8. I've installed MPICH2 1.0.6pl on both (not NFS'd). The Master, Ragnar, is a 64 bit; olaf is a 32 bit. I set up the ring, and mpdtrace shows: $ mpdtrace -l Ragnar_37601 (192.168.0.2) olaf_45530 (192.168.0.5) $ I run a VERY simple MPI program and it hangs: #include "mpi.h" #include #include #include int main( int argc, char *argv[] ) { MPI_Init(&argc,&argv); printf("Hello!\n"); MPI_Finalize(); return 0; } The program outputs the two lines for the two nodes and hangs. I have to CNTRL-C out of it: [gregg@Ragnar ~/BEOAPPS]$ mpiexec -l -n 2 mpibase 0: Hello! 1: Hello! It would sit there forever if I didn't bail. Other simple tests work fine: Running a simple "hostname" test works fine: $ mpiexec -l -n 2 hostname 0: Ragnar 1: olaf $ Now I run a Hello World (no MPI): #include #include int main(int argc,char *argv[]) { printf("\nHello World!\n %d \n", n); } $ mpiexec -l -n 2 ../HelloWorld 0: 0: Hello World! 1: 1: Hello World! $ Any help with this would be appreciated Gregg From gdjacobs at gmail.com Wed Mar 19 18:13:56 2008 From: gdjacobs at gmail.com (Geoff Jacobs) Date: Sun Jul 27 01:06:58 2008 Subject: [Beowulf] Simple MPI programs hang In-Reply-To: <47E1A249.5030206@comcast.net> References: <47E1A249.5030206@comcast.net> Message-ID: <47E1BA54.4080406@gmail.com> Gregg Germain wrote: > Hi everyone, > > I've created a 2 node cluster running FC8. I've installed MPICH2 > 1.0.6pl on both (not NFS'd). > > The Master, Ragnar, is a 64 bit; olaf is a 32 bit. > > I set up the ring, and mpdtrace shows: > > $ mpdtrace -l > Ragnar_37601 (192.168.0.2) > olaf_45530 (192.168.0.5) > $ > > I run a VERY simple MPI program and it hangs: > #include "mpi.h" > #include > #include > #include > > int main( int argc, char *argv[] ) > { > MPI_Init(&argc,&argv); > printf("Hello!\n"); > MPI_Finalize(); > return 0; > } > > The program outputs the two lines for the two nodes and hangs. I have to > CNTRL-C out of it: > > [gregg@Ragnar ~/BEOAPPS]$ mpiexec -l -n 2 mpibase > 0: Hello! > 1: Hello! > > It would sit there forever if I didn't bail. Other simple tests work fine: > > Running a simple "hostname" test works fine: > > $ mpiexec -l -n 2 hostname > 0: Ragnar > 1: olaf > $ > > Now I run a Hello World (no MPI): > #include > #include > > int main(int argc,char *argv[]) > { > printf("\nHello World!\n %d \n", n); > } > > $ mpiexec -l -n 2 ../HelloWorld > 0: > 0: Hello World! > 1: > 1: Hello World! > $ > > Any help with this would be appreciated > > Gregg Last time I checked, MPICH2 does not permit heterogeneous machine architectures. If Ragnar is using an AMD64 build of MPICH2 and Olaf using MPICH2 targeted on IA32, you are most likely seeing an ABI conflict. You can get around this by using a 32 bit compiler and MPICH library on Ragnar, or a 32 bit development environment residing in a chroot, or a hosted 32 bit VM image, or just reinstall Ragnar as 32 bit only. Or you can go shopping for a different MPI library. The Open MPI people look like they're actively working on this functionality, for example. -- Geoffrey D. Jacobs From wrankin at ee.duke.edu Thu Mar 20 15:51:54 2008 From: wrankin at ee.duke.edu (Bill Rankin) Date: Sun Jul 27 01:06:58 2008 Subject: [Beowulf] Simple MPI programs hang In-Reply-To: <47E1A249.5030206@comcast.net> References: <47E1A249.5030206@comcast.net> Message-ID: <41DA62C4-678F-401F-AA4C-83BA221F7A79@ee.duke.edu> On Mar 19, 2008, at 7:31 PM, Gregg Germain wrote: > Hi everyone, > > I've created a 2 node cluster running FC8. I've installed MPICH2 > 1.0.6pl on both (not NFS'd). > > The Master, Ragnar, is a 64 bit; olaf is a 32 bit. > > I set up the ring, and mpdtrace shows: > > $ mpdtrace -l > Ragnar_37601 (192.168.0.2) > olaf_45530 (192.168.0.5) > $ > > I run a VERY simple MPI program and it hangs: > #include "mpi.h" > #include > #include > #include > > int main( int argc, char *argv[] ) > { > MPI_Init(&argc,&argv); > printf("Hello!\n"); > MPI_Finalize(); > return 0; > } > If it's not a 32/64 bit issue (as the other poster mentioned) you should try putting an MPI_Barrier() call immediately before the MPI_Finalize(). This will at least make sure that both instances have made it complete through MPI_Init() before the one or the other one exits. -bill From eagles051387 at gmail.com Thu Mar 20 12:32:49 2008 From: eagles051387 at gmail.com (Jon Aquilina) Date: Sun Jul 27 01:06:58 2008 Subject: [Beowulf] bonic projects on a cluster Message-ID: has anyone setup a cluster to be used with the analysis of decently sized boinc project data sets. do they integrate nicely into a cluster environment? -- Jonathan Aquilina -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20080320/beb902e5/attachment.html From carsten.aulbert at aei.mpg.de Fri Mar 21 00:11:33 2008 From: carsten.aulbert at aei.mpg.de (Carsten Aulbert) Date: Sun Jul 27 01:06:58 2008 Subject: [Beowulf] bonic projects on a cluster In-Reply-To: References: Message-ID: <47E35FA5.6050702@aei.mpg.de> Hi, Jon Aquilina wrote: > has anyone setup a cluster to be used with the analysis of decently > sized boinc project data sets. do they integrate nicely into a cluster > environment? Yes, Einstein @ Home has been used for this. Either as a backfill when no job was running on the node(core) or as a continuous job maximally niced in the background. Usually the memory footprint is not too harsh, thus there is not much of a problem there. However, if running Linux, be advised to stay away from the completely fair scheduler. this tends to give too many cycles to the job if other