* XFS/Linux Sanity check
@ 2011-05-02 15:47 Paul Anderson
2011-05-02 17:09 ` Andi Kleen
` (3 more replies)
0 siblings, 4 replies; 12+ messages in thread
From: Paul Anderson @ 2011-05-02 15:47 UTC (permalink / raw)
To: xfs
Our genetic sequencing research group is growing our file storage from
1PB to 2PB.
Our workload looks very much like large video processing might look -
relatively low metadata, very, very high sequential I/O. The servers
will either be doing very high I/O with local I/O bound jobs, or
serving data via NFSv4 (or possibly custom data distribution means) to
our compute grid for compute bound jobs. Our first PB of data is
largely on Promise RAID arrays, all of which are set up with XFS.
Generally, we're big fans of XFS for stability, high performance, and
robustness in the face of crashes. We tried ZFS, ran into I/O
throttling issues that at the time seem intractable (write picketing -
essentially half the maximum write rate of hardware).
We are deploying five Dell 810s, 192GiB RAM, 12 core, each with three
LSI 9200-8E SAS controllers, and three SuperMicro 847 45 drive bay
cabinets with enterprise grade 2TB drives.
We're running Ubuntu 10.04 LTS, and have tried either the stock kernel
(2.6.32-30) or 2.6.35 from linux.org. We organize the storage as one
software (MD) RAID 0 composed of 7 software RAID (MD) 6s, each with 18
drives, giving 204 TiB usable (9 drives of the 135 are unused). XFS
is set up properly (as far as I know) with respect to stripe and chunk
sizes. Allocation groups are 1TiB in size, which seems sane for the
size of files we expect to work with.
In isolated testing, I see around 5GiBytes/second raw (135 parallel dd
reads), and with a benchmark test of 10 simultaneous 64GiByte dd
commands, I can see just shy of 2 GiBytes/second reading, and around
1.4GiBytes/second writing through XFS. The benchmark is crude, but
fairly representative of our expected use.
md apparently does not support barriers, so we are badly exposed in
that manner, I know. As a test, I disabled write cache on all drives,
performance dropped by 30% or so, but since md is apparently the
problem, barriers still didn't work.
Nonetheless, what we need, but don't have, is stability.
With 2.6.32-30, we get reliable kernel panics after 2 days of
sustained rsync to the machine (around 150-250MiBytes/second for the
entire time - the source machines are slow), and with 2.6.35, we get a
bad resource contention problem fairly quickly - much less than 24
hours (in this instance, we start getting XFS kernel thread timeouts
similar to what I've seen posted here recently, but it isn't clear
whether it is only XFS or also ext3 boot drives that are starved for
I/O - suspending or killing all I/O load doesn't solve the problem -
only a reboot does).
Ideally, I'd firstly be able to find informed opinions about how I can
improve this arrangement - we are mildly flexible on RAID controllers,
very flexible on versions of Linux, etc, and can try other OS's as a
last resort (but the leading contender here would be "something"
running ZFS, and though I love ZFS, it really didn't seem to work well
for our needs).
Secondly, I welcome suggestions about which version of the linux
kernel you'd prefer to hear bug reports about, as well as what kinds
of output is most useful (we're getting all chassis set up with serial
console so we can do kgdb and also full kernel panic output results).
Thanks in advance,
Paul Anderson
Center for Statistical Genetics
University of Michigan
USA
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: XFS/Linux Sanity check 2011-05-02 15:47 XFS/Linux Sanity check Paul Anderson @ 2011-05-02 17:09 ` Andi Kleen 2011-05-02 17:13 ` Emmanuel Florac ` (2 subsequent siblings) 3 siblings, 0 replies; 12+ messages in thread From: Andi Kleen @ 2011-05-02 17:09 UTC (permalink / raw) To: Paul Anderson; +Cc: xfs Paul Anderson <pha@umich.edu> writes: > > md apparently does not support barriers, so we are badly exposed in MD barriers for RAID-0 were added in 2.6.33, but very recent kernels have much improved barriers again. > Secondly, I welcome suggestions about which version of the linux > kernel you'd prefer to hear bug reports about, as well as what kinds Kernel developers usually want reports about the newest versions, ideally with enough information to debug the problem (that is backtraces etc.) When you have a hang but the console is still active you can also just dump the threads with sysrq-t Often when things hang it's the underlying IO subsystem (driver, IO device). -Andi -- ak@linux.intel.com -- Speaking for myself only _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: XFS/Linux Sanity check 2011-05-02 15:47 XFS/Linux Sanity check Paul Anderson 2011-05-02 17:09 ` Andi Kleen @ 2011-05-02 17:13 ` Emmanuel Florac 2011-06-11 1:33 ` FYI: LSI rebuilding; and XFS speed V. raw - hints on maxing out 'dd'....(if not already obvious) Linda Walsh 2011-05-03 3:18 ` XFS/Linux Sanity check Dave Chinner 2011-05-04 1:10 ` Stan Hoeppner 3 siblings, 1 reply; 12+ messages in thread From: Emmanuel Florac @ 2011-05-02 17:13 UTC (permalink / raw) To: Paul Anderson; +Cc: xfs Le Mon, 2 May 2011 11:47:48 -0400 Paul Anderson <pha@umich.edu> écrivait: > We are deploying five Dell 810s, 192GiB RAM, 12 core, each with three > LSI 9200-8E SAS controllers, and three SuperMicro 847 45 drive bay > cabinets with enterprise grade 2TB drives. I have very little experience with these RAID coontrollers. However I have a 9212 4i4e (same card generation and same chipset) in test, and so far I must say it looks like _utter_ _crap_. The performance is abysmal (it's been busy rebuilding a 20TB array for... 6 days!); the server regularly freezes and crashes without any reason (it's a pure dev system with virtually zero load and zero IO); and there were lots of filesystem corruptions. I'm running a 2.6.32.25 64 bits plain vanilla kernel that poses no problem whatsoever with any other configuration. > In isolated testing, I see around 5GiBytes/second raw (135 parallel dd > reads), and with a benchmark test of 10 simultaneous 64GiByte dd > commands, I can see just shy of 2 GiBytes/second reading, and around > 1.4GiBytes/second writing through XFS. The benchmark is crude, but > fairly representative of our expected use. I don't understand why there's such a gap between the raw and XFS performance. Generally XFS gives 90% performance or more of raw performance. > md apparently does not support barriers, so we are badly exposed in > that manner, I know. As a test, I disabled write cache on all drives, > performance dropped by 30% or so, but since md is apparently the > problem, barriers still didn't work. Frankly, I'd stay away from md at this array size. I'm pretty sure you're exploring uncharted territory here. > Ideally, I'd firstly be able to find informed opinions about how I can > improve this arrangement - we are mildly flexible on RAID controllers, > very flexible on versions of Linux, etc, and can try other OS's as a > last resort (but the leading contender here would be "something" > running ZFS, and though I love ZFS, it really didn't seem to work well > for our needs). I can't yet be sure because I plan more testing with this card, but I'd ditch the LSI controllers for LSI/3Ware or Adaptec (or Areca eventually), and stay away from md RAID and use hardware RAID. I'm an hardware RAID freak, but... hardware RAID allows proper write cache, for a start (because it has BBUs). -- ------------------------------------------------------------------------ Emmanuel Florac | Direction technique | Intellique | <eflorac@intellique.com> | +33 1 78 94 84 02 ------------------------------------------------------------------------ _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 12+ messages in thread
* FYI: LSI rebuilding; and XFS speed V. raw - hints on maxing out 'dd'....(if not already obvious)... 2011-05-02 17:13 ` Emmanuel Florac @ 2011-06-11 1:33 ` Linda Walsh 2011-06-11 9:30 ` Emmanuel Florac 0 siblings, 1 reply; 12+ messages in thread From: Linda Walsh @ 2011-06-11 1:33 UTC (permalink / raw) To: Emmanuel Florac; +Cc: Paul Anderson, xfs Emmanuel Florac wrote: > Le Mon, 2 May 2011 11:47:48 -0400 > Paul Anderson <pha@umich.edu> écrivait: > >> We are deploying five Dell 810s, 192GiB RAM, 12 core, each with three >> LSI 9200-8E SAS controllers, and three SuperMicro 847 45 drive bay >> cabinets with enterprise grade 2TB drives. > > I have very little experience with these RAID coontrollers. However I > have a 9212 4i4e (same card generation and same chipset) in test, and so > far I must say it looks like _utter_ _crap_. The performance is abysmal > (it's been busy rebuilding a 20TB array for... 6 days!); the server ---- By default the card only allocates about 20% of it's disk capacity for rebuilds with the rest allocated for 'real work'. It's not so smart as to use 100% when there is no real work... If you enter the control software, (runs under X on linux -- even displays on CygX)... and enter something like 90%, you'll find your rebuilds will go much faster, but you can expect any real-access to the device to suffer accordingly. I have a 9285-8E and have been pretty happy with it's performance, but I only have 10 data disks (2x6-disk RAID5 =>RAID50) with 2TB SATA's and get 1GB perf... about what I'd expect from disks that get around 120MB each and doing 2 RAID5 calcs...)... ---- The only other things I can think of when benching XFS for max throughput. 1) Realtime partition might be an option (never tried, but thought I'd mention it) 2) on "dd", if you are testing "write" performance, try pre-allocating the file using (filling in the vars...): xfs_io -f -c "truncate $size" -c "resvsp 0 $size" "$Newfile" then test for fragmentation: (see if it is in 1 'extent'...., 1/line): xfs_bmap "$Newfile" if needed, try defragging: xfs_fsr "$Newfile" Then on "dd" use the conv="nocreat,notrunc" flags -- that way you'll be able to dump I/O directly into the file without it having to be "created or allocated"... _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: FYI: LSI rebuilding; and XFS speed V. raw - hints on maxing out 'dd'....(if not already obvious)... 2011-06-11 1:33 ` FYI: LSI rebuilding; and XFS speed V. raw - hints on maxing out 'dd'....(if not already obvious) Linda Walsh @ 2011-06-11 9:30 ` Emmanuel Florac 2011-06-11 16:48 ` Linda Walsh 0 siblings, 1 reply; 12+ messages in thread From: Emmanuel Florac @ 2011-06-11 9:30 UTC (permalink / raw) To: Linda Walsh; +Cc: Paul Anderson, xfs Le Fri, 10 Jun 2011 18:33:08 -0700 vous écriviez: > I have a 9285-8E and have been pretty happy with it's performance, > but I only have 10 data disks (2x6-disk RAID5 =>RAID50) with 2TB > SATA's and get 1GB perf... about what I'd expect from disks that get > around 120MB each and doing 2 RAID5 calcs...)... I've made some more tests since then, and I was using cheapo drives for testing that for some reason, behave extraordinarily poorly in combination with the LSI card. I've retested the card with hitachi Ultrastars and it worked just fine (though a bit slower than Adaptec 6xx5). OTOH, the Adaptec works fine with the "bad" drives too, go figure. I didn't yet explored the advanced capabilities of the card (apparently it may have some snapshot mechanism) because MegaCli is such a PITA :) regards, -- ------------------------------------------------------------------------ Emmanuel Florac | Direction technique | Intellique | <eflorac@intellique.com> | +33 1 78 94 84 02 ------------------------------------------------------------------------ _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: FYI: LSI rebuilding; and XFS speed V. raw - hints on maxing out 'dd'....(if not already obvious)... 2011-06-11 9:30 ` Emmanuel Florac @ 2011-06-11 16:48 ` Linda Walsh 0 siblings, 0 replies; 12+ messages in thread From: Linda Walsh @ 2011-06-11 16:48 UTC (permalink / raw) To: Emmanuel Florac; +Cc: Paul Anderson, xfs Emmanuel Florac wrote: > Le Fri, 10 Jun 2011 18:33:08 -0700 vous écriviez: > >> I have a 9285-8E and have been pretty happy with it's performance, >> but I only have 10 data disks (2x6-disk RAID5 =>RAID50) with 2TB >> SATA's and get 1GB perf... about what I'd expect from disks that get >> around 120MB each and doing 2 RAID5 calcs...)... > > I've made some more tests since then, and I was using cheapo drives for > testing that for some reason, behave extraordinarily poorly in > combination with the LSI card. I've retested the card with hitachi > Ultrastars and it worked just fine (though a bit slower than Adaptec > 6xx5). OTOH, the Adaptec works fine with the "bad" drives too, go > figure. ----- I noticed this issue with the LSI card as well -- i.e. with 'bad' low-qual drives. What I noticed though -- in independent testing of the lower quality drives, was that even though they were all spec'ed at 7200RPM, their actual speeds varied, from top to bottom by almost 50% (Indirectly measured by top transfer speeds copying from & to the drives 128G files... .. all freshly formatted, factory 'new', from the same generation). Speeds varied from top to bottom with max around 129MB/s to bottom @ 87MB/s, with typical being 95-120. That was quite a range of speeds. Trying to 'synchronize reads/writes/ on drives with such varying speeds would be a nightmare. I tried another RAID controller and got about 60% the performance of the LSI's -- but it was also with the cheaper drives. The LSI controller wouldn't "pass" the drives as "acceptable" if they were too far out of spec. The Ultrastars (NOT the deskstars, which are not speed controlled) all were fine. I've heard of people having problems with Enterprise Class WD drives due to the issue. So it may be a matter of the controller 'protecting itself' from drives that are out of spec -- and are therefore not considered 'RAID class' drives... Some would argue that being more 'tolerant' of poor drives is a good thing... But if you want the fastest speed, you need to make sure the drive speeds are well matched. same issue. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: XFS/Linux Sanity check 2011-05-02 15:47 XFS/Linux Sanity check Paul Anderson 2011-05-02 17:09 ` Andi Kleen 2011-05-02 17:13 ` Emmanuel Florac @ 2011-05-03 3:18 ` Dave Chinner 2011-05-03 8:58 ` Michael Monnerie ` (2 more replies) 2011-05-04 1:10 ` Stan Hoeppner 3 siblings, 3 replies; 12+ messages in thread From: Dave Chinner @ 2011-05-03 3:18 UTC (permalink / raw) To: Paul Anderson; +Cc: xfs On Mon, May 02, 2011 at 11:47:48AM -0400, Paul Anderson wrote: > Our genetic sequencing research group is growing our file storage from > 1PB to 2PB. ..... > We are deploying five Dell 810s, 192GiB RAM, 12 core, each with three > LSI 9200-8E SAS controllers, and three SuperMicro 847 45 drive bay > cabinets with enterprise grade 2TB drives. So roughly 250TB raw capacity per box. > We're running Ubuntu 10.04 LTS, and have tried either the stock kernel > (2.6.32-30) or 2.6.35 from linux.org. (OT: why do people install a desktop OS on their servers?) > We organize the storage as one > software (MD) RAID 0 composed of 7 software RAID (MD) 6s, each with 18 > drives, giving 204 TiB usable (9 drives of the 135 are unused). That's adventurous. I would serious consider rethinking this - hardware RAID-6 with controllers that have ia significant amount of BBWC is much more appropriate for this scale of storage. You get an unclean shutdown (e.g. power loss) and MD is going to take _weeks_ to resync those RAID6 arrays. Background scrubbing is likely to never cease, either.... Also, knowing how you spread out the disks in each RAID-6 group between controllers, trays, etc as that has important performance and failure implications. e.g. I'm guessing that you are taking 6 drives from each enclosure for each 18-drive raid-6 group, which would split the RAID-6 group across all three SAS controllers and enclosures. That means if you lose a SAS controller or enclosure you lose all RAID-6 groups at once which is effectively catastrophic from a recovery point of view. It also means that one slow controller slows down everything so load balancing is difficult. Large stripes might look like a good idea, buti when you get to this scale concatenation of high throughput LUNs provides better throughput because of less contention through the storage controllers and enclosures. > XFS > is set up properly (as far as I know) with respect to stripe and chunk > sizes. Any details? You might be wrong ;) > Allocation groups are 1TiB in size, which seems sane for the > size of files we expect to work with. Any filesystem over 16TB will use 1TB AGs. > In isolated testing, I see around 5GiBytes/second raw (135 parallel dd > reads), and with a benchmark test of 10 simultaneous 64GiByte dd > commands, I can see just shy of 2 GiBytes/second reading, and around > 1.4GiBytes/second writing through XFS. The benchmark is crude, but > fairly representative of our expected use. If you want insightful comments, then you'll need to provide intimate details of the tests your ran and the results (e.g. command lines, raw results, etc). > md apparently does not support barriers, so we are badly exposed in > that manner, I know. As a test, I disabled write cache on all drives, > performance dropped by 30% or so, but since md is apparently the > problem, barriers still didn't work. Doesn't matter if you have BBWC on your hardware RAID controllers. Seriously, if you want to sustain high throughput, you want a large amount of BBWC in front your disks.... > Nonetheless, what we need, but don't have, is stability. > > With 2.6.32-30, we get reliable kernel panics after 2 days of > sustained rsync to the machine (around 150-250MiBytes/second for the > entire time - the source machines are slow), Stack traces from the crash? > and with 2.6.35, we get a > bad resource contention problem fairly quickly - much less than 24 > hours (in this instance, we start getting XFS kernel thread timeouts > similar to what I've seen posted here recently, but it isn't clear > whether it is only XFS or also ext3 boot drives that are starved for > I/O - suspending or killing all I/O load doesn't solve the problem - > only a reboot does). Details of the timeout messages? > Ideally, I'd firstly be able to find informed opinions about how I can > improve this arrangement - we are mildly flexible on RAID controllers, > very flexible on versions of Linux, etc, and can try other OS's as a > last resort (but the leading contender here would be "something" > running ZFS, and though I love ZFS, it really didn't seem to work well > for our needs). > > Secondly, I welcome suggestions about which version of the linux > kernel you'd prefer to hear bug reports about, as well as what kinds > of output is most useful (we're getting all chassis set up with serial > console so we can do kgdb and also full kernel panic output results). If you want to stay on mainline kernels with best-effort community support, I'd suggest 2.6.38 or more recent kernels are the only ones we're going to debug. If you want fixes, then running the curent -rc kernels is probably a good idea. It's unlikely you'll get anyone backporting fixes for you to older kernels. Alternatively, you can switch to something like RHEL (or SLES) where XFS is fully supported (and in the RHEL case, pays my bills :). The advantage of this is that once the bug is fixed in mainline, it will get backported to the supported kernel you are running. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: XFS/Linux Sanity check 2011-05-03 3:18 ` XFS/Linux Sanity check Dave Chinner @ 2011-05-03 8:58 ` Michael Monnerie 2011-05-03 16:05 ` Paul Anderson 2011-05-04 6:18 ` Stan Hoeppner 2 siblings, 0 replies; 12+ messages in thread From: Michael Monnerie @ 2011-05-03 8:58 UTC (permalink / raw) To: xfs [-- Attachment #1.1: Type: Text/Plain, Size: 873 bytes --] On Dienstag, 3. Mai 2011 Dave Chinner wrote: > > We're running Ubuntu 10.04 LTS, and have tried either the stock > > kernel (2.6.32-30) or 2.6.35 from linux.org. > > (OT: why do people install a desktop OS on their servers?) I can only tell for us: We use openSUSE, and tried with SLES, but support on openSUSE is better despite you pay for SLES, and openSUSE is (by nature) newer in every package. And that gives you (nearly) actual XFS improvements due to newer kernels and therefore more performance (e.g. delaylog). -- mit freundlichen Grüssen, Michael Monnerie, Ing. BSc it-management Internet Services: Protéger http://proteger.at [gesprochen: Prot-e-schee] Tel: +43 660 / 415 6531 // ****** Radiointerview zum Thema Spam ****** // http://www.it-podcast.at/archiv.html#podcast-100716 // // Haus zu verkaufen: http://zmi.at/langegg/ [-- Attachment #1.2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 198 bytes --] [-- Attachment #2: Type: text/plain, Size: 121 bytes --] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: XFS/Linux Sanity check 2011-05-03 3:18 ` XFS/Linux Sanity check Dave Chinner 2011-05-03 8:58 ` Michael Monnerie @ 2011-05-03 16:05 ` Paul Anderson 2011-05-04 10:36 ` Stan Hoeppner 2011-05-04 6:18 ` Stan Hoeppner 2 siblings, 1 reply; 12+ messages in thread From: Paul Anderson @ 2011-05-03 16:05 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs Dave, thanks for your feedback - comments below - possibly of interest to others. Several underlying assumptions strongly influence my choices that I've made here. Sequential I/O is of paramount importance - all else is nearly insignificant (not entirely true, but a reasonable plan for the coming year or two). Highly I/O intensive work can/should be done locally to avoid networking (NFS and 10 GigE just add more delays - later, research could be done to saturate a 10GigE link in a variety of other ways, but is of secondary concern to me today). Compute intensive workloads will start looking more random because we'll send those out to the grid and large numbers of incoming requests makes the I/O stream less predictable. Mind you, I envision eliminating NFS or any other network filesystem in favor of straight TCP/IP or even something like RoCE from Redhat. With proper buffering, even serving data like this can look sequential by and large. The team here favors large filesystems because from the user perspective it is simply easier than having to juggle space among distinct partitions. The easy admninistrative solution of splitting 204TiB into say 7 mounted volumes really imposes a big barrier to how work is organized, and further wastes storage. I believe that typical working file sizes will exceed 100GiB within a year or two - for example, one project is generating 250 sequencing sample files each of which is 250 GiB in size, which we need to pull, reprocess, and analyze. This is fallout from the fact that there is a very rapid drop in the cost of genome sequencing that is still underway. On Mon, May 2, 2011 at 11:18 PM, Dave Chinner <david@fromorbit.com> wrote: > On Mon, May 02, 2011 at 11:47:48AM -0400, Paul Anderson wrote: >> Our genetic sequencing research group is growing our file storage from >> 1PB to 2PB. > ..... >> We are deploying five Dell 810s, 192GiB RAM, 12 core, each with three >> LSI 9200-8E SAS controllers, and three SuperMicro 847 45 drive bay >> cabinets with enterprise grade 2TB drives. > > So roughly 250TB raw capacity per box. > >> We're running Ubuntu 10.04 LTS, and have tried either the stock kernel >> (2.6.32-30) or 2.6.35 from linux.org. > > (OT: why do people install a desktop OS on their servers?) Our end users want many GUI based apps running on the compute head nodes, ergo we wind up installing most of the desktop anyway, so it is just easier to install it and add whatever server related packages we may need. I'm not fond of that situation myself. > >> We organize the storage as one >> software (MD) RAID 0 composed of 7 software RAID (MD) 6s, each with 18 >> drives, giving 204 TiB usable (9 drives of the 135 are unused). > > That's adventurous. I would serious consider rethinking this - > hardware RAID-6 with controllers that have ia significant amount of > BBWC is much more appropriate for this scale of storage. You get an > unclean shutdown (e.g. power loss) and MD is going to take _weeks_ > to resync those RAID6 arrays. Background scrubbing is likely to > never cease, either.... 18 hours from start - remember the sync is proceeding at over 4GiBytes/sec (14.5 hours if exactly 4 GiBytes/second). The big problem with my setup is lack of BBWC. They are running in JBOD mode, and I can disable per drive write cache and still maintain decent performance across the array. That said, there are few if any cases where we care about loss of in-flight data - we care a great deal about static data that is corrupted or lost due to metadata corruption, so this is still probably an open issue (ideas welcome). > Also, knowing how you spread out the disks in each RAID-6 group > between controllers, trays, etc as that has important performance > and failure implications. You bet! > e.g. I'm guessing that you are taking 6 drives from each enclosure > for each 18-drive raid-6 group, which would split the RAID-6 group > across all three SAS controllers and enclosures. That means if you > lose a SAS controller or enclosure you lose all RAID-6 groups at > once which is effectively catastrophic from a recovery point of view. > It also means that one slow controller slows down everything so load > balancing is difficult. Each of the three enclosures has a pair of SAS expanders, and each LSI 9200-8e controller has two SAS cables, so I actually ordered the RAID-6 drive sets as subsets of three, each from successive distinct controller cards in a round robin fashion until you have a full set of 18 drives. A wrinkle is that the SAS expanders have differing numbers of drives - 24 front, 21 rear (the other 3 on the rear are taken by the power supplies). So to finding a good match of RAID size versus available channels and splitting I/O across those channels is a bit challenging. > Large stripes might look like a good idea, buti when you get to this > scale concatenation of high throughput LUNs provides better > throughput because of less contention through the storage > controllers and enclosures. I don't disagree, but what I need to do is run a scripted test varying stripe size, stripe units, chunk size (md parameter), etc - this gets cumbersome with the 135 drives, as trying to get good balances across the available resources is tedious and not automatic. Basically, I found a combo (described immediately below) that works pretty well, and started working on other problems than performance. I have sufficient hardware to test other combinations, but time to run them is an issue for me. (ie set them up precisely right, babysit them, wait for parity to build, then test - yes, I tested on various subsets of the full 126 drive array, but getting those configs right and then knowing you can extrapolate to the full size set is confusing and hurts my poor little head) > >> XFS >> is set up properly (as far as I know) with respect to stripe and chunk >> sizes. > > Any details? You might be wrong ;) Oh yes indeedy, I could be wrong! Each of the 126 in use drives show something like this: /dev/sdbc1: Magic : a92b4efc Version : 1.1 Feature Map : 0x0 Array UUID : f3c44896:ecdcadca:153ee6d1:1770781f Name : louie:5 (local to host louie) Creation Time : Fri Apr 8 15:01:16 2011 Raid Level : raid6 Raid Devices : 18 Avail Dev Size : 3907026856 (1863.02 GiB 2000.40 GB) Array Size : 62512429056 (29808.25 GiB 32006.36 GB) Used Dev Size : 3907026816 (1863.02 GiB 2000.40 GB) Data Offset : 264 sectors Super Offset : 0 sectors State : clean Device UUID : adbd8716:94ebf4a2:ea753ee0:418b7bd8 Update Time : Tue May 3 11:18:45 2011 Checksum : 44d36ef7 - correct Events : 187 Chunk Size : 64K There are 7 RAID-6 arrays, each of which look like this: /dev/md0: Magic : a92b4efc Version : 1.1 Feature Map : 0x0 Array UUID : cbb4b32e:afc7126a:922e501d:9404011e Name : louie:8 (local to host louie) Creation Time : Fri Apr 8 15:02:20 2011 Raid Level : raid0 Raid Devices : 7 Avail Dev Size : 62512429048 (29808.25 GiB 32006.36 GB) Used Dev Size : 0 Data Offset : 8 sectors Super Offset : 0 sectors State : active Device UUID : 94bfd084:138f8ca5:2938df2e:1ef0b76d Update Time : Fri Apr 8 15:02:20 2011 Checksum : d733d87a - correct Events : 0 Chunk Size : 1024K Array Slot : 0 (0, 1, 2, 3, 4, 5, 6) Array State : Uuuuuuu The seven RAID 6 devices are concatenated into a RAID 0: /dev/md8: Version : 01.01 Creation Time : Fri Apr 8 15:02:20 2011 Raid Level : raid0 Array Size : 218793494528 (208657.74 GiB 224044.54 GB) Raid Devices : 7 Total Devices : 7 Preferred Minor : 8 Persistence : Superblock is persistent Update Time : Fri Apr 8 15:02:20 2011 State : clean Active Devices : 7 Working Devices : 7 Failed Devices : 0 Spare Devices : 0 Chunk Size : 1024K Name : louie:8 (local to host louie) UUID : cbb4b32e:afc7126a:922e501d:9404011e Events : 0 Number Major Minor RaidDevice State 0 9 0 0 active sync /dev/block/9:0 1 9 1 1 active sync /dev/block/9:1 2 9 2 2 active sync /dev/block/9:2 3 9 3 3 active sync /dev/block/9:3 4 9 4 4 active sync /dev/block/9:4 5 9 5 5 active sync /dev/block/9:5 6 9 6 6 active sync /dev/block/9:6 The xfs_info for the mounted volume is: meta-data=/dev/md8 isize=256 agcount=204, agsize=268435440 blks = sectsz=512 attr=2 data = bsize=4096 blocks=54698373632, imaxpct=1 = sunit=16 swidth=256 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal bsize=4096 blocks=521728, version=2 = sectsz=512 sunit=16 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 The sunit/swidth parameters are chosen to exactly match the RAID-6 device, not the RAID-0. Mount options are negligible, although I will be trying this: UUID=0a675b55-d68a-41f2-8bb7-063e33123531 /exports xfs inode64,largeio,logbufs=8,noatime 0 2 All disk drives (almost a thousand here now) are Hitachi HUA72202 2TB enterprise drives. We did a failed experiment awhile back with desktop drives... never again. > >> Allocation groups are 1TiB in size, which seems sane for the >> size of files we expect to work with. > > Any filesystem over 16TB will use 1TB AGs. > >> In isolated testing, I see around 5GiBytes/second raw (135 parallel dd >> reads), and with a benchmark test of 10 simultaneous 64GiByte dd >> commands, I can see just shy of 2 GiBytes/second reading, and around >> 1.4GiBytes/second writing through XFS. The benchmark is crude, but >> fairly representative of our expected use. > > If you want insightful comments, then you'll need to provide > intimate details of the tests your ran and the results (e.g. command > lines, raw results, etc). To test RAW read rates, I do this: for i in /dev/sd[b-z] /dev/sd[a-z][a-z] ; do dd if=$i of=/dev/null bs=1024k & done killall dd gets rid of them. I use "dstat 1" to check what the kernel thinks is happening. For filesystems test (configured and mounted as I described above with the mdadm commands and xfs_info), I do this: for load in 0 1 2 3 4 5 6 7 8 9 ; do dd if=/dev/zero of=/exports/load_$load$step bs=1024k count=32768 & done Later to test read, I do: for load in 0 1 2 3 4 5 6 7 8 9 ; do dd of=/dev/null if=/exports/load_$load bs=1024 & done In both cases, I watch I/O rates after the buffers overflow - with 192GB of RAM, this takes a few seconds. For giggles, I've allowed the read commands to cache 20-100GB in RAM, then rerun the read test to see what a cached read rate looks like - interestingly, the aggregate dd reported I/O rate in that case is around 5GiBytes/second, indicating that is approaching something of an upper limit for this particular chassis. I am fully aware that this is a simplified test. I'm also quite familiar with the workload, and know this is a reasonable facsimile of what we do. Better real world benchmarking for us now comprises of end user jobs - day long jobs on a single sequencing run using a bunch of home grown software. > >> md apparently does not support barriers, so we are badly exposed in >> that manner, I know. As a test, I disabled write cache on all drives, >> performance dropped by 30% or so, but since md is apparently the >> problem, barriers still didn't work. > > Doesn't matter if you have BBWC on your hardware RAID > controllers. Seriously, if you want to sustain high throughput, you > want a large amount of BBWC in front your disks.... Here we talk performance expectations and goals - from my testing so far, I can reasonably say I'm happy with the performance of the software RAID with XFS running on top of that. What I need now are stability and robustness in the face of crashes. I'm still perfectly willing to buy good HW RAID cards, don't get me wrong, but their main benefit to me will be the battery backed cache, not the performance. Keep in mind that it is hard to balance a HW RAID card across multiple SAS expanders - you can certainly get a -16e card of some sort, but then it does ALL of the I/O to those 4 expanders ALL of the time. I'm not sure that is a win, either. Cheaper cards, one per expander might work, though (but with six 8x slots available, probably a HW RAID card with 8e would be the best - run two expanders per card as I do now). > >> Nonetheless, what we need, but don't have, is stability. >> >> With 2.6.32-30, we get reliable kernel panics after 2 days of >> sustained rsync to the machine (around 150-250MiBytes/second for the >> entire time - the source machines are slow), > > Stack traces from the crash? Mostly a non-responsive console and kgdb was not set up at the time - I am trying to get this set up now. Here's the one stack trace I wrote down from the console (again from a 2.6.32-30 kernel): RSP 0018:ffff880dcce39e48 E FLAGS 287 _spin_lock+0xe/0x20 futex_wake+0x7d/0x130 handle_nm_fault+0x1a8/0x3c0 do_futex+0x68/0x1b0 sys_futex+0x7b/0x170 do_page_fault+0x158/0x3b0 system_call_fastpath+0x16/0x1b All other info lost - other crashes result in a locked console that we've not been able to revive. The load on the system at the time of the crash was simply 3-4 rsync's copying data via 'ssh -c arcfour' over to the XFS filesystem (basically loading up the test server with user data for further testing). Sustained I/O rates were moderate - 200-400MiBytes/second. No swap, CPU load of significance or user jobs. Obviously, this is an old kernel and of less interest, but nonetheless answers your question. > >> and with 2.6.35, we get a >> bad resource contention problem fairly quickly - much less than 24 >> hours (in this instance, we start getting XFS kernel thread timeouts >> similar to what I've seen posted here recently, but it isn't clear >> whether it is only XFS or also ext3 boot drives that are starved for >> I/O - suspending or killing all I/O load doesn't solve the problem - >> only a reboot does). > > Details of the timeout messages? Here are some typical ones from yesterday when I was trying to run the sync command on a relatively lightly loaded 2.6.35 machine (sustained 100MiByte/second copies onto the server in question): 178602.197456] INFO: task sync:2787 blocked for more than 120 seconds. [178602.203933] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [178602.211863] sync D 0000000000000000 0 2787 2691 0x00000000 [178602.211867] ffff880d2dc51cd8 0000000000000086 ffff880d2dc51cc8 0000000000015880 [178602.211870] ffff880d2dc51fd8 0000000000015880 ffff880d2dc51fd8 ffff8817fb725d40 [178602.211872] 0000000000015880 0000000000015880 ffff880d2dc51fd8 0000000000015880 [178602.211875] Call Trace: [178602.211887] [<ffffffff81050641>] ? select_task_rq_fair+0x561/0x8e0 [178602.211893] [<ffffffff8156436d>] schedule_timeout+0x22d/0x310 [178602.211896] [<ffffffff8104af33>] ? enqueue_task_fair+0x43/0x90 [178602.211898] [<ffffffff8104e609>] ? enqueue_task+0x79/0x90 [178602.211900] [<ffffffff81563606>] wait_for_common+0xd6/0x180 [178602.211904] [<ffffffff81053310>] ? default_wake_function+0x0/0x20 [178602.211910] [<ffffffff81167570>] ? sync_one_sb+0x0/0x30 [178602.211912] [<ffffffff8156378d>] wait_for_completion+0x1d/0x20 [178602.211915] [<ffffffff81162b19>] sync_inodes_sb+0x89/0x180 [178602.211955] [<ffffffffa032c0f1>] ? xfs_quiesce_data+0x71/0xc0 [xfs] [178602.211958] [<ffffffff81167570>] ? sync_one_sb+0x0/0x30 [178602.211960] [<ffffffff81167558>] __sync_filesystem+0x88/0xa0 [178602.211962] [<ffffffff81167590>] sync_one_sb+0x20/0x30 [178602.211966] [<ffffffff81142afb>] iterate_supers+0x8b/0xd0 [178602.211968] [<ffffffff811675e5>] sys_sync+0x45/0x70 [178602.211973] [<ffffffff8100a072>] system_call_fastpath+0x16/0x1b > >> Ideally, I'd firstly be able to find informed opinions about how I can >> improve this arrangement - we are mildly flexible on RAID controllers, >> very flexible on versions of Linux, etc, and can try other OS's as a >> last resort (but the leading contender here would be "something" >> running ZFS, and though I love ZFS, it really didn't seem to work well >> for our needs). >> >> Secondly, I welcome suggestions about which version of the linux >> kernel you'd prefer to hear bug reports about, as well as what kinds >> of output is most useful (we're getting all chassis set up with serial >> console so we can do kgdb and also full kernel panic output results). > > If you want to stay on mainline kernels with best-effort community > support, I'd suggest 2.6.38 or more recent kernels are the only ones > we're going to debug. If you want fixes, then running the curent -rc > kernels is probably a good idea. It's unlikely you'll get anyone > backporting fixes for you to older kernels. I will be doing that today. We can backport if it were crucial to do so, but I'm not aware of any local reasons why this would be so. > > Alternatively, you can switch to something like RHEL (or SLES) where > XFS is fully supported (and in the RHEL case, pays my bills :). The > advantage of this is that once the bug is fixed in mainline, it will > get backported to the supported kernel you are running. We're buying a RHEL support license today - hooray! My rationale for doing that is that I'm not convinced I will be seeing just XFS issues in the kernel - the stack trace I reported is more generic than XFS... Paul > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: XFS/Linux Sanity check 2011-05-03 16:05 ` Paul Anderson @ 2011-05-04 10:36 ` Stan Hoeppner 0 siblings, 0 replies; 12+ messages in thread From: Stan Hoeppner @ 2011-05-04 10:36 UTC (permalink / raw) To: xfs On 5/3/2011 11:05 AM, Paul Anderson wrote: > I'm still perfectly willing to buy good HW RAID cards, don't get me > wrong, but their main benefit to me will be the battery backed cache, > not the performance. Good RAID cards have many more advantages than battery cache and performance. One is moving a RAID card and its attached arrays from a failed host to a new one. In the case of the hardware RAID card usually all that is required is loading the HBA driver and mounting the filesystem. Such a move of an mdraid array is usually, well, not nearly as straightforward, to put it kindly. > Keep in mind that it is hard to balance a HW RAID card across multiple > SAS expanders -you can certainly get a -16e card of some sort, but > then it does ALL of the I/O to those 4 expanders ALL of the time. I'm note sure I know exactly what you mean here Paul. You seem to be talking about RAID card <-> drive chassis cabling flexibility and symmetrical bandwidth. The following two SAS expander/switch products are likely worth a quick read: http://www.intel.com/Products/Server/RAID-controllers/re-res2sv240/RES2SV240-Overview.htm http://www.lsi.com/channel/products/switch/sas6160/index.html Using an LSI 9260-4i single 8087 port RAID card, the Intel expander, and some 8087/8088 panel converters, one could attach *5* x 24 drive LSI 620J SAS JBOD chassis for a total of 120 drives with equal bandwidth to/from all drives, about 2GB/s total bandwidth, RAID ASIC limited. Few would want to connect 120 drives to such a single port RAID controller, but this example demonstrates that symmetry can be achieved across a large number of cascaded SAS expander ASICs (6 total) with a lot of drives. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: XFS/Linux Sanity check 2011-05-03 3:18 ` XFS/Linux Sanity check Dave Chinner 2011-05-03 8:58 ` Michael Monnerie 2011-05-03 16:05 ` Paul Anderson @ 2011-05-04 6:18 ` Stan Hoeppner 2 siblings, 0 replies; 12+ messages in thread From: Stan Hoeppner @ 2011-05-04 6:18 UTC (permalink / raw) To: xfs On 5/2/2011 10:18 PM, Dave Chinner wrote: > Also, knowing how you spread out the disks in each RAID-6 group > between controllers, trays, etc as that has important performance > and failure implications. > > e.g. I'm guessing that you are taking 6 drives from each enclosure > for each 18-drive raid-6 group, which would split the RAID-6 group > across all three SAS controllers and enclosures. That means if you > lose a SAS controller or enclosure you lose all RAID-6 groups at > once which is effectively catastrophic from a recovery point of view. > It also means that one slow controller slows down everything so load > balancing is difficult. Assuming Paul's SC847 SAS chassis have the standard EL1 backplanes, his bandwidth profile per chassis is: 24 x 6Gb/s drives on 4 x 6Gb/s host ports via 36 port LSI expander 21 x 6Gb/s drives on 4 x 6Gb/s host ports via 36 port LSI expander Not balanced but not horribly bad. I recommend using one LSI 9285-8E RAID card per SC847 chassis, one SFF8088 cable connected to the front backplane the other connected to the rear. Create two 21 drive RAID6 arrays, taking care than one array consists only of drives on the front backplane, the other array consisting only of drives on the rear backplane. Configure the remaining 3 drives on the front backplane as cold spares. Not perfect, but I think the best solution given the unbalanced nature of the chassis backplanes. > Large stripes might look like a good idea, but when you get to this > scale concatenation of high throughput LUNs provides better > throughput because of less contention through the storage > controllers and enclosures. Now create an LVM or mdraid concatenated device of the 6 hardware RAID6 LUNs. Format the resulting device with mkfs.xfs defaults allowing XFS allocation groups to drive your parallelism and throughput instead of a big stripe, just as Dave recommends. Each 9285-8E should be able to pump streaming reads at about 3.2 to 3.5GB/s, a little less than the 38 RAID6 spindle streaming aggregate capability. At this throughput level you're bumping against the PCIe 2.0 x8 one way bandwidth limit after encoding and error correction overhead. So overall I think you're fairly well balanced now, overcoming the slight imbalance of the disk chassis configuration. Assuming you're able to load balance interrupts and tune things optimally, and assuming the Intel chipset in the R810 is up to the task, the above recommended setup should be capable of 8-10GB/s throughput with a parallel workload. Newegg carries both the 9285-8E and the cache battery unit, ~$1200 total. So it'll run you about $18,000 for 15 units for 5 servers, about 3x what you spent on the 9200-8E cards, and worth every sweet penny. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: XFS/Linux Sanity check 2011-05-02 15:47 XFS/Linux Sanity check Paul Anderson ` (2 preceding siblings ...) 2011-05-03 3:18 ` XFS/Linux Sanity check Dave Chinner @ 2011-05-04 1:10 ` Stan Hoeppner 3 siblings, 0 replies; 12+ messages in thread From: Stan Hoeppner @ 2011-05-04 1:10 UTC (permalink / raw) To: xfs On 5/2/2011 10:47 AM, Paul Anderson wrote: Hi Paul, > md apparently does not support barriers, so we are badly exposed in > that manner, I know. As a test, I disabled write cache on all drives, > performance dropped by 30% or so, but since md is apparently the > problem, barriers still didn't work. ... > Ideally, I'd firstly be able to find informed opinions about how I can > improve this arrangement - we are mildly flexible on RAID controllers, I'm not familiar enough with the md driver to address the barrier issue. Try the mdadm mailing list. However... You should be able to solve the barrier issue, and get additional advantages, by simply swapping out the LSI 9200-8E's with the 9285-8E w/cache battery. The 9285 has a dual core 800MHz PowerPC (vs single core 533MHz on the 9280) and 1GB of cache. Configure 3x15 drive hardware RAID6 arrays per controller, then stitch the resulting 9 arrays together with mdraid or LVM striping or concatenation. I'd test both under your normal multistreaming workload to see which works best. A multilevel stripe will show better performance with an artificial single stream test such as dd, but under your operational multiple stream workload, concatenation may have similar performance, while at the same time giving you additional capability, especially if done with LVM instead of mdraid --linear. Using LVM concatenation enables snapshots and the ability to grow and shrink the volume, neither of which you can do with striping (RAID 0). The 9285-8E will be pricier than the 9280-8E but it's well worth the extra dollars, given the low overall cost percentage of the HBAs vs total system cost. You'll get better performance and the data safety you're looking for. Just make sure that in addition to BBWC on the HBAs you have good UPS units backing the servers and SC847 chassis. > very flexible on versions of Linux, etc, and can try other OS's as a > last resort (but the leading contender here would be "something" > running ZFS, and though I love ZFS, it really didn't seem to work well > for our needs). Supermicro product is usually pretty decent. However, "DIY" arrays comprised of an inexpensive teir 2/3 vendor drive box/backplane/expander and off the shelf drives, whose firmware may not all match, can often be a recipe for problems that are difficult to troubleshoot. Your problems may not be caused by a kernel issue at all. The kernel may simply be showing the symptoms but not the cause. You've ordered, if my math is correct, 675 'enterprise class' 2TB SATA drives, 45 per chassis, 135 per system, 5 systems. Did you specify/verify with the vendor that all drives must be of the same manufacturing lot and have matching firmware? When building huge storage subsystems it is critical that all drives behave the same, which usually means identical firmware. > Secondly, I welcome suggestions about which version of the linux > kernel you'd prefer to hear bug reports about, as well as what kinds > of output is most useful (we're getting all chassis set up with serial > console so we can do kgdb and also full kernel panic output results). Others are better qualified to answer this. I'm just the lowly hardware guy on the list. ;) -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2011-06-11 16:49 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-05-02 15:47 XFS/Linux Sanity check Paul Anderson 2011-05-02 17:09 ` Andi Kleen 2011-05-02 17:13 ` Emmanuel Florac 2011-06-11 1:33 ` FYI: LSI rebuilding; and XFS speed V. raw - hints on maxing out 'dd'....(if not already obvious) Linda Walsh 2011-06-11 9:30 ` Emmanuel Florac 2011-06-11 16:48 ` Linda Walsh 2011-05-03 3:18 ` XFS/Linux Sanity check Dave Chinner 2011-05-03 8:58 ` Michael Monnerie 2011-05-03 16:05 ` Paul Anderson 2011-05-04 10:36 ` Stan Hoeppner 2011-05-04 6:18 ` Stan Hoeppner 2011-05-04 1:10 ` Stan Hoeppner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox