* raw vs XFS sequential write and system load @ 2007-10-18 10:50 Mario Kadastik 2007-10-18 22:23 ` David Chinner 0 siblings, 1 reply; 7+ messages in thread From: Mario Kadastik @ 2007-10-18 10:50 UTC (permalink / raw) To: xfs Hello, I have a slight problem. Namely we have 4 systems with each having 2x 3ware 9550SX cards in them each with hardware RAID5. Everything is running the latest FW etc. The systems have at least 3GB of memory and at least 2 CPU-s (one has 4GB and 4 cpu-s). Now my problem is that as this is a Grid storage node and we transfer constantly files 2+ GB back and forth with tens of them in parallel, then the system seems to be having serious problems. We can do read speeds basically with only limit being the network without the system picking up any load (we do use the blockdev -- setra 16384) and can keep running combined 200MB/s to the network over a long period of time. However if we get writes to the systems then even at low speeds they hog the systems and send the load of the systems to 20+ (largest I have recovered from is 150, usually they are between 20-60). As this makes the system basically unusable I did a lot of digging to try to understand what causes such a high load. The basic thing is that with vmstat I can see a number of blocked processes during the writes and high io wait for the system. All of the RAID5-s have XFS on top of them. Finally after weeks of not getting any adequate response from 3ware etc I freed up one of the systems to do some extra tests. Just yesterday I measured the basic difference of doing a direct dd to the actual raw device versus doing the dd to a local file in XFS. The results are in detail here: http://hep.kbfi.ee/dbg/jupiter_test.txt As you can see the system is quite responsive during the read and write sequentially to the raw device. There are no blocked processes and the io wait is < 5%. However going to XFS we immediately see blocked processes which over time leads to very high load on the system. Is there something I'm missing? I did create the XFS with the correct RAID 5 su,sw settings, but is there a way to tune the XFS or general kernel parameters in a way that this blocking doesn't occur that much with XFS. I don't really care if I don't get 400MB/s read/write performance, I'd be satisfied with 10% of that during production load as long as the system is able to not fall over because of it. Just as a clarification, the io done is sequential write or read of 2.5GB files, there will be a number of files accessed in parallel (I'd say up to 20 in parallel, but I can limit the number). I have googled around a lot, but to be honest have not enough inside info on the kernel tunables to make an educated guess on where to begin. I'd assume the io patterns of XFS differ from raw sequential read/write, but probably it could be tuned to some extent to better match the hardware and probably there are ways to make sure that instead of load going up the speed goes down. Thanks in advance, Mario Kadastik ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: raw vs XFS sequential write and system load 2007-10-18 10:50 raw vs XFS sequential write and system load Mario Kadastik @ 2007-10-18 22:23 ` David Chinner 2007-10-19 6:12 ` Mario Kadastik 0 siblings, 1 reply; 7+ messages in thread From: David Chinner @ 2007-10-18 22:23 UTC (permalink / raw) To: Mario Kadastik; +Cc: xfs On Thu, Oct 18, 2007 at 12:50:44PM +0200, Mario Kadastik wrote: > Hello, > > I have a slight problem. Namely we have 4 systems with each having 2x > 3ware 9550SX cards in them each with hardware RAID5. Everything is > running the latest FW etc. The systems have at least 3GB of memory > and at least 2 CPU-s (one has 4GB and 4 cpu-s). Before going any further, what kernel are you using and what's the output of xfs_info </mntpt> of the filesytsem you are testing? FWIW, high iowait = high load average. High iowait is generally an indicator of an overloaded disk subsystem. You tests to the raw device only used a single stream, so it's unlikely to show any of the issues you're complaining about when running tens of parallel streams.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: raw vs XFS sequential write and system load 2007-10-18 22:23 ` David Chinner @ 2007-10-19 6:12 ` Mario Kadastik 2007-10-19 7:59 ` David Chinner 0 siblings, 1 reply; 7+ messages in thread From: Mario Kadastik @ 2007-10-19 6:12 UTC (permalink / raw) To: David Chinner; +Cc: xfs >> I have a slight problem. Namely we have 4 systems with each having 2x >> 3ware 9550SX cards in them each with hardware RAID5. Everything is >> running the latest FW etc. The systems have at least 3GB of memory >> and at least 2 CPU-s (one has 4GB and 4 cpu-s). > > Before going any further, what kernel are you using and what's > the output of xfs_info </mntpt> of the filesytsem you are testing? Well I did manage to accidentally kill that specific box (did the heavy dd to a file on the root disk instead of the XFS mount (forgot to mount first), filling it and losing the system from net, so will have to wait for it to come back after someone locally can go and have a look). But I moved over to another box where I had freed up one RAID5 for testing purposes and a number of things came apparent: 1. on the original box I had been running 2.6.9 SMP which was the default shipped with Scientific Linux 4. With that kernel the single stream to raw device seemed to go without no io wait and everything seemed very nice, however the XFS performance was as I wrote, under par the very least. 2. before I lost the box I had rebooted it to 2.6.22.9 SMP as I had been reading around about XFS and found that 2.6.15+ kernels had a few updates which might be of interest, however I immediately found that 2.6.22.9 behaved absolutely different. For one thing the single stream write to raw disk no longer had 0% io wait, but instead around 40-50%. A quick look of the difference of the two kernels revealed for example that the /sys/block/sda/queue/nr_requests had gone from 8192 in 2.6.9 to 128 in 2.6.22.9. Going back to 8192 decreased the load of single stream write to raw disk io wait to 10% region, but not to 0. Soon after however I killed the system so had to stop the tests for a while. 3. On the new box with 4 cpu-s, 4 GB of memory and 12 drive RAID5 I was running 2.6.23 SMP with CONFIG_4KSTACKS disabled (one of our admins thought that could cure a few crashes we had seen before on the system due to high network load, don't know if it's relevant, but just in case mentioned). On this box I first also discovered horrible io wait with single stream write to raw device and again the nr_requests seemed to cure that to 10% level. However here I also found that XFS was performing exactly the same as the direct raw device. Also in the 5-10% region of io wait. Doing 2 parallel writes to the filesystem increased the io wait to 25%. Doing parallel read and write had the system at around 15-20% of io wait, the more concrete numbers for some of the tests I did: 1 w 0 r: 10% 2 w 0 r: 20% 3 w 0 r: 33% 4 w 0 r: 45% 5 w 0 r: 50% 3 w 3 r: 50-60% (system still ca 20% idle) 3 w 10 r: 50-80% (system ca 10% idle, over time system load increased to 14) the last one was already a more realistic scenario (8 RAID5-s, 3 writes per one is 24 writes, that's about the order of magnitude I'm aiming for, 80 reads is quite conservative still, likely is 120 accross the whole storage of 4 systems, though we will increase that number further to spread out the load even further). However I have been running the test on only one controller now while the other one was sitting idle, in reality both of them would be hit the same way at the same time. Now as I have only access to the new box I'll provide the XFS info for that one: meta-data=/dev/sdc isize=256 agcount=32, agsize=62941568 blks = sectsz=512 attr=0 data = bsize=4096 blocks=2014129920, imaxpct=25 = sunit=16 swidth=176 blks, unwritten=1 naming =version 2 bsize=4096 log =internal log bsize=4096 blocks=32768, version=1 = sectsz=512 sunit=0 blks, lazy- count=0 realtime =none extsz=4096 blocks=0, rtextents=0 it was created with mkfs.xfs -d su=64k,sw=11 /dev/sdc to match the underlying RAID5 of 12 disks and stripe size 64k. > > FWIW, high iowait = high load average. High iowait is generally an > indicator of an overloaded disk subsystem. You tests to the raw > device only used a single stream, so it's unlikely to show any of > the issues you're complaining about when running tens of parallel > streams.... > Well I do understand that high io wait leads to high load over some time period. And I also do understand that high io wait indicates overloaded disk, however as the percentage of io wait seems to vary highly with what kernel is running and what the kernel settings are, then I think that the system should be able to cope with what I'm throwing at it. Now, my main concern is not the speed. As long as I get around 2-3MB/ s per file/stream read/written I'm happy AS LONG AS the system remains responsive. I mean Linux kernel must have a way to gear down network traffic (or in the case of dd then memory access) to suit the underlying system which is taking the hit. It's probably a question of tuning the kernel to act correctly, not try to do all at maximum speed, but to do it in a stable way. All of the above tests were still going at high speed, average read and write speeds in total were around 150-200MB/s however I'd be happy with 10% of that if it were to make the system more stable. It seems now that XFS may not be the big culprit here, but I do think that the kernel VM management is best tuned by people who do understand how XFS behaves to make sure that it can cope with something I'm hoping for it to do as well as tuning XFS itself to match the io patterns and underlying system. I do appreciate any help you could give me. Thanks in advance, Mario [[HTML alternate version deleted]] ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: raw vs XFS sequential write and system load 2007-10-19 6:12 ` Mario Kadastik @ 2007-10-19 7:59 ` David Chinner 2007-10-19 10:11 ` Mario Kadastik 2007-10-27 11:30 ` Mario Kadastik 0 siblings, 2 replies; 7+ messages in thread From: David Chinner @ 2007-10-19 7:59 UTC (permalink / raw) To: Mario Kadastik; +Cc: xfs On Fri, Oct 19, 2007 at 08:12:16AM +0200, Mario Kadastik wrote: > >>I have a slight problem. Namely we have 4 systems with each having 2x > >>3ware 9550SX cards in them each with hardware RAID5. Everything is > >>running the latest FW etc. The systems have at least 3GB of memory > >>and at least 2 CPU-s (one has 4GB and 4 cpu-s). > > > >Before going any further, what kernel are you using and what's > >the output of xfs_info </mntpt> of the filesytsem you are testing? > > Well I did manage to accidentally kill that specific box (did the > heavy dd to a file on the root disk instead of the XFS mount (forgot > to mount first), filling it and losing the system from net, so will > have to wait for it to come back after someone locally can go and > have a look). But I moved over to another box where I had freed up > one RAID5 for testing purposes and a number of things came apparent: Oops :/ > 1. on the original box I had been running 2.6.9 SMP which was the > default shipped with Scientific Linux 4. With that kernel the single > stream to raw device seemed to go without no io wait and everything > seemed very nice, however the XFS performance was as I wrote, under > par the very least. Ah - 2.6.9. That explains the bad behaviour of XFS - it's locking all the system memory in the elevator because the depth is so large. i.e. throttle at 7/8 * 8192 requests, and each request will be 512k which means that we can have ~3.5GB of RAM locked in a single elevator queue before it will throttle. Effectively your config is running your machine out of available memory.... > 2. before I lost the box I had rebooted it to 2.6.22.9 SMP as I had > been reading around about XFS and found that 2.6.15+ kernels had a > few updates which might be of interest, however I immediately found > that 2.6.22.9 behaved absolutely different. Absolutely. We fixed all the problems w.r.t. queue depth and congestion, and we completely rewrote the write path.... > For one thing the single > stream write to raw disk no longer had 0% io wait, but instead around > 40-50%. A quick look of the difference of the two kernels revealed > for example that the /sys/block/sda/queue/nr_requests had gone from > 8192 in 2.6.9 to 128 in 2.6.22.9. Yup, it was set to something sane and the block device is throttling writes on device congestion. > Going back to 8192 decreased the > load of single stream write to raw disk io wait to 10% region, but > not to 0. Soon after however I killed the system so had to stop the > tests for a while. Yup, you probably had the OOM killer trigger because setting the queue depth that deep is a Bad Thing To Do. Effectively, you turned off all feedback from the I/o layer to the VM to say the drive has enough I/O now so please stop sending me more because all I'm doing is queing it. > 3. On the new box with 4 cpu-s, 4 GB of memory and 12 drive RAID5 I > was running 2.6.23 SMP with CONFIG_4KSTACKS disabled (one of our > admins thought that could cure a few crashes we had seen before on > the system due to high network load, don't know if it's relevant, but > just in case mentioned). On this box I first also discovered horrible > io wait with single stream write to raw device and again the > nr_requests seemed to cure that to 10% level. That's not a cure! that's asking for trouble. You're seeing high I/O wait because the system can feed data to the disks *much* faster than the disk can do the I/O and you're not consuming any CPU time. This is *not wrong*. XFS can feed disk subsystems many, many times faster than what you have - you will always see iowait time on this sort of system when using XFS. It's telling you the filesystem is far, far faster than your disk. ;) > However here I also > found that XFS was performing exactly the same as the direct raw > device. Also in the 5-10% region of io wait. Doing 2 parallel writes > to the filesystem increased the io wait to 25%. Doing parallel read > and write had the system at around 15-20% of io wait, the more > concrete numbers for some of the tests I did: > > 1 w 0 r: 10% > 2 w 0 r: 20% > 3 w 0 r: 33% > 4 w 0 r: 45% > 5 w 0 r: 50% > > 3 w 3 r: 50-60% (system still ca 20% idle) > 3 w 10 r: 50-80% (system ca 10% idle, over time system load increased > to 14) Now change thenr_request back to 128 and run the test again. What happens to your iowait? What happens to responsiveness? > Now as I have only access to the new box I'll provide the XFS info > for that one: > meta-data=/dev/sdc isize=256 agcount=32, > agsize=62941568 blks > = sectsz=512 attr=0 > data = bsize=4096 blocks=2014129920, > imaxpct=25 > = sunit=16 swidth=176 blks, > unwritten=1 > naming =version 2 bsize=4096 > log =internal log bsize=4096 blocks=32768, version=1 > = sectsz=512 sunit=0 blks, lazy- > count=0 > realtime =none extsz=4096 blocks=0, rtextents=0 > > it was created with mkfs.xfs -d su=64k,sw=11 /dev/sdc to match the > underlying RAID5 of 12 disks and stripe size 64k. Add v2 logs, log stripe unit of 64k. > Now, my main concern is not the speed. As long as I get around 2-3MB/ > s per file/stream read/written I'm happy AS LONG AS the system > remains responsive. I mean Linux kernel must have a way to gear down > network traffic (or in the case of dd then memory access) to suit the > underlying system which is taking the hit. It *does*. It's the elevator queue depth! By setting it back to 8192 you turned off the mechanism linux uses to maintain responsiveness under heavy I/O load. > It's probably a question > of tuning the kernel to act correctly, not try to do all at maximum > speed, but to do it in a stable way. By default it should do the right thing. You should not have to tweak anything at all. You're tweaking is causing the unstableness in the recent kernels. Use the defaults and your system should remain responsive under any I/o load you throw at it. High iowait time and/or high load average is *not* an indication of a problem, just that your system is under load and you're not cpu bound. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: raw vs XFS sequential write and system load 2007-10-19 7:59 ` David Chinner @ 2007-10-19 10:11 ` Mario Kadastik 2007-10-27 11:30 ` Mario Kadastik 1 sibling, 0 replies; 7+ messages in thread From: Mario Kadastik @ 2007-10-19 10:11 UTC (permalink / raw) To: David Chinner; +Cc: xfs > Ah - 2.6.9. That explains the bad behaviour of XFS - it's locking all > the system memory in the elevator because the depth is so large. > i.e. throttle at 7/8 * 8192 requests, and each request will be > 512k which means that we can have ~3.5GB of RAM locked in a single > elevator queue before it will throttle. Effectively your config > is running your machine out of available memory.... Ok, that explains a few things ... >> However here I also >> found that XFS was performing exactly the same as the direct raw >> device. Also in the 5-10% region of io wait. Doing 2 parallel writes >> to the filesystem increased the io wait to 25%. Doing parallel read >> and write had the system at around 15-20% of io wait, the more >> concrete numbers for some of the tests I did: >> >> 1 w 0 r: 10% >> 2 w 0 r: 20% >> 3 w 0 r: 33% >> 4 w 0 r: 45% >> 5 w 0 r: 50% >> >> 3 w 3 r: 50-60% (system still ca 20% idle) >> 3 w 10 r: 50-80% (system ca 10% idle, over time system load increased >> to 14) > > Now change thenr_request back to 128 and run the test again. What > happens to your iowait? What happens to responsiveness? 1 w 0 r: 25-50% and the bo of vmstat is extremely fluctuating 2 w 0 r: 60-90% and fluctuations are big 3 w 0 r: 80-100% 4 w 0 r: 90+ % 5 w 0 r: 95+ % 3 w 3 r: 90% most of the time there is no cpu idle % 3 w 10 r: 95%, nothing idle 8 w 10 r: 95%, nothing idle however the speeds seem to be quite stable in and out in the read +write tests around 50-70MB/s. You can see the system behaviour as I ramped up the tests here: http://monitor.hep.kbfi.ee/?c=Jupiter% 20SE&h=europa.hep.kbfi.ee&m=&r=hour&s=descending&hc=4 It was running in the end 8 w 10 r and the load kept at about 26 with cpus being in io wait. The disk rates aren't visible, but for an example here is some vmstat output when the test was running for quite some time already: procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 1 26 144 516584 344 3378396 0 0 77760 37764 1859 3353 0 5 0 95 1 25 144 526844 348 3356420 0 0 77888 69652 2177 4728 0 7 0 93 0 27 144 438660 348 3450708 0 0 55296 36580 1270 2487 0 4 0 96 0 26 144 467444 348 3419976 0 0 71616 66988 1870 3856 0 7 0 93 3 27 144 534780 348 3362948 0 0 59392 45628 1374 3380 0 5 0 95 0 27 144 545440 344 3349692 0 0 96256 57316 2462 3736 0 7 0 93 0 26 144 438876 348 3464304 0 0 73664 38608 1798 2038 0 3 0 97 10 20 144 480852 348 3410584 0 0 61568 53908 1549 3455 0 5 0 95 0 26 144 530496 348 3356732 0 0 61376 57240 1620 4370 0 6 0 95 0 23 144 582324 348 3302928 0 0 64000 42036 1433 3808 0 4 0 96 8 18 144 493092 348 3401184 0 0 49728 55193 1502 2784 0 4 0 96 0 26 144 513676 444 3375716 0 0 60832 73583 2033 4772 3 6 0 91 0 26 144 460332 444 3437160 0 0 49024 46160 1434 2225 0 4 0 96 so around 60MB/s reads and 50MB/s writes were ongoing in the system at the time. The main question now is wether this can be kept up stably. To test this I'd have to migrate data back to the new XFS (3.1TB of data) and wait and see. The system was responsive and if the load remains flat out, then I guess it is not such a big problem. The 3ware recommended value for 9550SX I think is 512 for the nr_request, so I tried that as well (changing live during the test) and the result was that io wait remaind around 93% (so dropped a few %), but the speed did increase to around 80-90MB/s on reads and around 70MB/s on writes. The system load itself remained at the same level. I'll let it run in the background for a longer period to see how things behave. >> it was created with mkfs.xfs -d su=64k,sw=11 /dev/sdc to match the >> underlying RAID5 of 12 disks and stripe size 64k. > > Add v2 logs, log stripe unit of 64k. Did that. > It *does*. It's the elevator queue depth! By setting it back to 8192 > you turned off the mechanism linux uses to maintain responsiveness > under heavy I/O load. Ok, 8192 is probably way too high, but I guess the 512 that was something I remember from 3ware should be about right? >> It's probably a question >> of tuning the kernel to act correctly, not try to do all at maximum >> speed, but to do it in a stable way. > > By default it should do the right thing. You should not have to > tweak anything at all. You're tweaking is causing the unstableness > in the recent kernels. Use the defaults and your system should > remain responsive under any I/o load you throw at it. High iowait > time and/or high load average is *not* an indication of a problem, > just that your system is under load and you're not cpu bound. Well my question is wether or not one needs to tune also the VM management (dirty ratio etc) considering the high amount of data transfers. I haven't added network to the mesh yet until I put the new optimized system online for use and see how it performs. I guess having 8 pdflush -s in uninterruptible sleep can also cause problems and could maybe be handled better somehow? Thanks a lot for the answers, Mario [[HTML alternate version deleted]] ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: raw vs XFS sequential write and system load 2007-10-19 7:59 ` David Chinner 2007-10-19 10:11 ` Mario Kadastik @ 2007-10-27 11:30 ` Mario Kadastik 2007-10-27 13:07 ` Justin Piszcz 1 sibling, 1 reply; 7+ messages in thread From: Mario Kadastik @ 2007-10-27 11:30 UTC (permalink / raw) To: David Chinner; +Cc: xfs Well to finally summarize, the things pointed out all helped too, but the major change in system behavior came from the fact that 2.6.23 had totally different virtual memory defaults than 2.6.9 and running with 2.6.23 one has to change the dirty_ratio to something bigger to allow for a fast i/o machine to actually handle the load. Now the four nodes we have are all running very nicely and calmly and performing all the tasks we have asked from them, no more see we any congestion etc. I have summarized my weeks of investigations into a twiki page, comments are welcome: http://hep.kbfi.ee/index.php/IT/KernelTuning Thanks for the help, Mario ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: raw vs XFS sequential write and system load 2007-10-27 11:30 ` Mario Kadastik @ 2007-10-27 13:07 ` Justin Piszcz 0 siblings, 0 replies; 7+ messages in thread From: Justin Piszcz @ 2007-10-27 13:07 UTC (permalink / raw) To: Mario Kadastik; +Cc: David Chinner, xfs On Sat, 27 Oct 2007, Mario Kadastik wrote: > Well to finally summarize, the things pointed out all helped too, but the > major change in system behavior came from the fact that 2.6.23 had totally > different virtual memory defaults than 2.6.9 and running with 2.6.23 one has > to change the dirty_ratio to something bigger to allow for a fast i/o machine > to actually handle the load. Now the four nodes we have are all running very > nicely and calmly and performing all the tasks we have asked from them, no > more see we any congestion etc. > > I have summarized my weeks of investigations into a twiki page, comments are > welcome: > > http://hep.kbfi.ee/index.php/IT/KernelTuning > > Thanks for the help, > > Mario > Very nice doc! Thanks. Justin. ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2007-10-27 13:07 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-10-18 10:50 raw vs XFS sequential write and system load Mario Kadastik 2007-10-18 22:23 ` David Chinner 2007-10-19 6:12 ` Mario Kadastik 2007-10-19 7:59 ` David Chinner 2007-10-19 10:11 ` Mario Kadastik 2007-10-27 11:30 ` Mario Kadastik 2007-10-27 13:07 ` Justin Piszcz
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox