* raw vs XFS sequential write and system load
@ 2007-10-18 10:50 Mario Kadastik
2007-10-18 22:23 ` David Chinner
0 siblings, 1 reply; 7+ messages in thread
From: Mario Kadastik @ 2007-10-18 10:50 UTC (permalink / raw)
To: xfs
Hello,
I have a slight problem. Namely we have 4 systems with each having 2x
3ware 9550SX cards in them each with hardware RAID5. Everything is
running the latest FW etc. The systems have at least 3GB of memory
and at least 2 CPU-s (one has 4GB and 4 cpu-s).
Now my problem is that as this is a Grid storage node and we transfer
constantly files 2+ GB back and forth with tens of them in parallel,
then the system seems to be having serious problems.
We can do read speeds basically with only limit being the network
without the system picking up any load (we do use the blockdev --
setra 16384) and can keep running combined 200MB/s to the network
over a long period of time. However if we get writes to the systems
then even at low speeds they hog the systems and send the load of the
systems to 20+ (largest I have recovered from is 150, usually they
are between 20-60). As this makes the system basically unusable I did
a lot of digging to try to understand what causes such a high load.
The basic thing is that with vmstat I can see a number of blocked
processes during the writes and high io wait for the system. All of
the RAID5-s have XFS on top of them. Finally after weeks of not
getting any adequate response from 3ware etc I freed up one of the
systems to do some extra tests. Just yesterday I measured the basic
difference of doing a direct dd to the actual raw device versus doing
the dd to a local file in XFS. The results are in detail here:
http://hep.kbfi.ee/dbg/jupiter_test.txt
As you can see the system is quite responsive during the read and
write sequentially to the raw device. There are no blocked processes
and the io wait is < 5%. However going to XFS we immediately see
blocked processes which over time leads to very high load on the system.
Is there something I'm missing? I did create the XFS with the correct
RAID 5 su,sw settings, but is there a way to tune the XFS or general
kernel parameters in a way that this blocking doesn't occur that much
with XFS. I don't really care if I don't get 400MB/s read/write
performance, I'd be satisfied with 10% of that during production load
as long as the system is able to not fall over because of it.
Just as a clarification, the io done is sequential write or read of
2.5GB files, there will be a number of files accessed in parallel
(I'd say up to 20 in parallel, but I can limit the number).
I have googled around a lot, but to be honest have not enough inside
info on the kernel tunables to make an educated guess on where to
begin. I'd assume the io patterns of XFS differ from raw sequential
read/write, but probably it could be tuned to some extent to better
match the hardware and probably there are ways to make sure that
instead of load going up the speed goes down.
Thanks in advance,
Mario Kadastik
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: raw vs XFS sequential write and system load
2007-10-18 10:50 raw vs XFS sequential write and system load Mario Kadastik
@ 2007-10-18 22:23 ` David Chinner
2007-10-19 6:12 ` Mario Kadastik
0 siblings, 1 reply; 7+ messages in thread
From: David Chinner @ 2007-10-18 22:23 UTC (permalink / raw)
To: Mario Kadastik; +Cc: xfs
On Thu, Oct 18, 2007 at 12:50:44PM +0200, Mario Kadastik wrote:
> Hello,
>
> I have a slight problem. Namely we have 4 systems with each having 2x
> 3ware 9550SX cards in them each with hardware RAID5. Everything is
> running the latest FW etc. The systems have at least 3GB of memory
> and at least 2 CPU-s (one has 4GB and 4 cpu-s).
Before going any further, what kernel are you using and what's
the output of xfs_info </mntpt> of the filesytsem you are testing?
FWIW, high iowait = high load average. High iowait is generally an
indicator of an overloaded disk subsystem. You tests to the raw
device only used a single stream, so it's unlikely to show any of
the issues you're complaining about when running tens of parallel
streams....
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: raw vs XFS sequential write and system load
2007-10-18 22:23 ` David Chinner
@ 2007-10-19 6:12 ` Mario Kadastik
2007-10-19 7:59 ` David Chinner
0 siblings, 1 reply; 7+ messages in thread
From: Mario Kadastik @ 2007-10-19 6:12 UTC (permalink / raw)
To: David Chinner; +Cc: xfs
>> I have a slight problem. Namely we have 4 systems with each having 2x
>> 3ware 9550SX cards in them each with hardware RAID5. Everything is
>> running the latest FW etc. The systems have at least 3GB of memory
>> and at least 2 CPU-s (one has 4GB and 4 cpu-s).
>
> Before going any further, what kernel are you using and what's
> the output of xfs_info </mntpt> of the filesytsem you are testing?
Well I did manage to accidentally kill that specific box (did the
heavy dd to a file on the root disk instead of the XFS mount (forgot
to mount first), filling it and losing the system from net, so will
have to wait for it to come back after someone locally can go and
have a look). But I moved over to another box where I had freed up
one RAID5 for testing purposes and a number of things came apparent:
1. on the original box I had been running 2.6.9 SMP which was the
default shipped with Scientific Linux 4. With that kernel the single
stream to raw device seemed to go without no io wait and everything
seemed very nice, however the XFS performance was as I wrote, under
par the very least.
2. before I lost the box I had rebooted it to 2.6.22.9 SMP as I had
been reading around about XFS and found that 2.6.15+ kernels had a
few updates which might be of interest, however I immediately found
that 2.6.22.9 behaved absolutely different. For one thing the single
stream write to raw disk no longer had 0% io wait, but instead around
40-50%. A quick look of the difference of the two kernels revealed
for example that the /sys/block/sda/queue/nr_requests had gone from
8192 in 2.6.9 to 128 in 2.6.22.9. Going back to 8192 decreased the
load of single stream write to raw disk io wait to 10% region, but
not to 0. Soon after however I killed the system so had to stop the
tests for a while.
3. On the new box with 4 cpu-s, 4 GB of memory and 12 drive RAID5 I
was running 2.6.23 SMP with CONFIG_4KSTACKS disabled (one of our
admins thought that could cure a few crashes we had seen before on
the system due to high network load, don't know if it's relevant, but
just in case mentioned). On this box I first also discovered horrible
io wait with single stream write to raw device and again the
nr_requests seemed to cure that to 10% level. However here I also
found that XFS was performing exactly the same as the direct raw
device. Also in the 5-10% region of io wait. Doing 2 parallel writes
to the filesystem increased the io wait to 25%. Doing parallel read
and write had the system at around 15-20% of io wait, the more
concrete numbers for some of the tests I did:
1 w 0 r: 10%
2 w 0 r: 20%
3 w 0 r: 33%
4 w 0 r: 45%
5 w 0 r: 50%
3 w 3 r: 50-60% (system still ca 20% idle)
3 w 10 r: 50-80% (system ca 10% idle, over time system load increased
to 14)
the last one was already a more realistic scenario (8 RAID5-s, 3
writes per one is 24 writes, that's about the order of magnitude I'm
aiming for, 80 reads is quite conservative still, likely is 120
accross the whole storage of 4 systems, though we will increase that
number further to spread out the load even further). However I have
been running the test on only one controller now while the other one
was sitting idle, in reality both of them would be hit the same way
at the same time.
Now as I have only access to the new box I'll provide the XFS info
for that one:
meta-data=/dev/sdc isize=256 agcount=32,
agsize=62941568 blks
= sectsz=512 attr=0
data = bsize=4096 blocks=2014129920,
imaxpct=25
= sunit=16 swidth=176 blks,
unwritten=1
naming =version 2 bsize=4096
log =internal log bsize=4096 blocks=32768, version=1
= sectsz=512 sunit=0 blks, lazy-
count=0
realtime =none extsz=4096 blocks=0, rtextents=0
it was created with mkfs.xfs -d su=64k,sw=11 /dev/sdc to match the
underlying RAID5 of 12 disks and stripe size 64k.
>
> FWIW, high iowait = high load average. High iowait is generally an
> indicator of an overloaded disk subsystem. You tests to the raw
> device only used a single stream, so it's unlikely to show any of
> the issues you're complaining about when running tens of parallel
> streams....
>
Well I do understand that high io wait leads to high load over some
time period. And I also do understand that high io wait indicates
overloaded disk, however as the percentage of io wait seems to vary
highly with what kernel is running and what the kernel settings are,
then I think that the system should be able to cope with what I'm
throwing at it.
Now, my main concern is not the speed. As long as I get around 2-3MB/
s per file/stream read/written I'm happy AS LONG AS the system
remains responsive. I mean Linux kernel must have a way to gear down
network traffic (or in the case of dd then memory access) to suit the
underlying system which is taking the hit. It's probably a question
of tuning the kernel to act correctly, not try to do all at maximum
speed, but to do it in a stable way. All of the above tests were
still going at high speed, average read and write speeds in total
were around 150-200MB/s however I'd be happy with 10% of that if it
were to make the system more stable.
It seems now that XFS may not be the big culprit here, but I do think
that the kernel VM management is best tuned by people who do
understand how XFS behaves to make sure that it can cope with
something I'm hoping for it to do as well as tuning XFS itself to
match the io patterns and underlying system. I do appreciate any help
you could give me.
Thanks in advance,
Mario
[[HTML alternate version deleted]]
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: raw vs XFS sequential write and system load
2007-10-19 6:12 ` Mario Kadastik
@ 2007-10-19 7:59 ` David Chinner
2007-10-19 10:11 ` Mario Kadastik
2007-10-27 11:30 ` Mario Kadastik
0 siblings, 2 replies; 7+ messages in thread
From: David Chinner @ 2007-10-19 7:59 UTC (permalink / raw)
To: Mario Kadastik; +Cc: xfs
On Fri, Oct 19, 2007 at 08:12:16AM +0200, Mario Kadastik wrote:
> >>I have a slight problem. Namely we have 4 systems with each having 2x
> >>3ware 9550SX cards in them each with hardware RAID5. Everything is
> >>running the latest FW etc. The systems have at least 3GB of memory
> >>and at least 2 CPU-s (one has 4GB and 4 cpu-s).
> >
> >Before going any further, what kernel are you using and what's
> >the output of xfs_info </mntpt> of the filesytsem you are testing?
>
> Well I did manage to accidentally kill that specific box (did the
> heavy dd to a file on the root disk instead of the XFS mount (forgot
> to mount first), filling it and losing the system from net, so will
> have to wait for it to come back after someone locally can go and
> have a look). But I moved over to another box where I had freed up
> one RAID5 for testing purposes and a number of things came apparent:
Oops :/
> 1. on the original box I had been running 2.6.9 SMP which was the
> default shipped with Scientific Linux 4. With that kernel the single
> stream to raw device seemed to go without no io wait and everything
> seemed very nice, however the XFS performance was as I wrote, under
> par the very least.
Ah - 2.6.9. That explains the bad behaviour of XFS - it's locking all
the system memory in the elevator because the depth is so large.
i.e. throttle at 7/8 * 8192 requests, and each request will be
512k which means that we can have ~3.5GB of RAM locked in a single
elevator queue before it will throttle. Effectively your config
is running your machine out of available memory....
> 2. before I lost the box I had rebooted it to 2.6.22.9 SMP as I had
> been reading around about XFS and found that 2.6.15+ kernels had a
> few updates which might be of interest, however I immediately found
> that 2.6.22.9 behaved absolutely different.
Absolutely. We fixed all the problems w.r.t. queue depth and
congestion, and we completely rewrote the write path....
> For one thing the single
> stream write to raw disk no longer had 0% io wait, but instead around
> 40-50%. A quick look of the difference of the two kernels revealed
> for example that the /sys/block/sda/queue/nr_requests had gone from
> 8192 in 2.6.9 to 128 in 2.6.22.9.
Yup, it was set to something sane and the block device is throttling
writes on device congestion.
> Going back to 8192 decreased the
> load of single stream write to raw disk io wait to 10% region, but
> not to 0. Soon after however I killed the system so had to stop the
> tests for a while.
Yup, you probably had the OOM killer trigger because setting the
queue depth that deep is a Bad Thing To Do. Effectively, you
turned off all feedback from the I/o layer to the VM to say
the drive has enough I/O now so please stop sending me more
because all I'm doing is queing it.
> 3. On the new box with 4 cpu-s, 4 GB of memory and 12 drive RAID5 I
> was running 2.6.23 SMP with CONFIG_4KSTACKS disabled (one of our
> admins thought that could cure a few crashes we had seen before on
> the system due to high network load, don't know if it's relevant, but
> just in case mentioned). On this box I first also discovered horrible
> io wait with single stream write to raw device and again the
> nr_requests seemed to cure that to 10% level.
That's not a cure! that's asking for trouble. You're seeing high
I/O wait because the system can feed data to the disks *much* faster
than the disk can do the I/O and you're not consuming any
CPU time. This is *not wrong*.
XFS can feed disk subsystems many, many times faster than what you
have - you will always see iowait time on this sort of system
when using XFS. It's telling you the filesystem is far, far
faster than your disk. ;)
> However here I also
> found that XFS was performing exactly the same as the direct raw
> device. Also in the 5-10% region of io wait. Doing 2 parallel writes
> to the filesystem increased the io wait to 25%. Doing parallel read
> and write had the system at around 15-20% of io wait, the more
> concrete numbers for some of the tests I did:
>
> 1 w 0 r: 10%
> 2 w 0 r: 20%
> 3 w 0 r: 33%
> 4 w 0 r: 45%
> 5 w 0 r: 50%
>
> 3 w 3 r: 50-60% (system still ca 20% idle)
> 3 w 10 r: 50-80% (system ca 10% idle, over time system load increased
> to 14)
Now change thenr_request back to 128 and run the test again. What
happens to your iowait? What happens to responsiveness?
> Now as I have only access to the new box I'll provide the XFS info
> for that one:
> meta-data=/dev/sdc isize=256 agcount=32,
> agsize=62941568 blks
> = sectsz=512 attr=0
> data = bsize=4096 blocks=2014129920,
> imaxpct=25
> = sunit=16 swidth=176 blks,
> unwritten=1
> naming =version 2 bsize=4096
> log =internal log bsize=4096 blocks=32768, version=1
> = sectsz=512 sunit=0 blks, lazy-
> count=0
> realtime =none extsz=4096 blocks=0, rtextents=0
>
> it was created with mkfs.xfs -d su=64k,sw=11 /dev/sdc to match the
> underlying RAID5 of 12 disks and stripe size 64k.
Add v2 logs, log stripe unit of 64k.
> Now, my main concern is not the speed. As long as I get around 2-3MB/
> s per file/stream read/written I'm happy AS LONG AS the system
> remains responsive. I mean Linux kernel must have a way to gear down
> network traffic (or in the case of dd then memory access) to suit the
> underlying system which is taking the hit.
It *does*. It's the elevator queue depth! By setting it back to 8192
you turned off the mechanism linux uses to maintain responsiveness
under heavy I/O load.
> It's probably a question
> of tuning the kernel to act correctly, not try to do all at maximum
> speed, but to do it in a stable way.
By default it should do the right thing. You should not have to
tweak anything at all. You're tweaking is causing the unstableness
in the recent kernels. Use the defaults and your system should
remain responsive under any I/o load you throw at it. High iowait
time and/or high load average is *not* an indication of a problem,
just that your system is under load and you're not cpu bound.
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: raw vs XFS sequential write and system load
2007-10-19 7:59 ` David Chinner
@ 2007-10-19 10:11 ` Mario Kadastik
2007-10-27 11:30 ` Mario Kadastik
1 sibling, 0 replies; 7+ messages in thread
From: Mario Kadastik @ 2007-10-19 10:11 UTC (permalink / raw)
To: David Chinner; +Cc: xfs
> Ah - 2.6.9. That explains the bad behaviour of XFS - it's locking all
> the system memory in the elevator because the depth is so large.
> i.e. throttle at 7/8 * 8192 requests, and each request will be
> 512k which means that we can have ~3.5GB of RAM locked in a single
> elevator queue before it will throttle. Effectively your config
> is running your machine out of available memory....
Ok, that explains a few things ...
>> However here I also
>> found that XFS was performing exactly the same as the direct raw
>> device. Also in the 5-10% region of io wait. Doing 2 parallel writes
>> to the filesystem increased the io wait to 25%. Doing parallel read
>> and write had the system at around 15-20% of io wait, the more
>> concrete numbers for some of the tests I did:
>>
>> 1 w 0 r: 10%
>> 2 w 0 r: 20%
>> 3 w 0 r: 33%
>> 4 w 0 r: 45%
>> 5 w 0 r: 50%
>>
>> 3 w 3 r: 50-60% (system still ca 20% idle)
>> 3 w 10 r: 50-80% (system ca 10% idle, over time system load increased
>> to 14)
>
> Now change thenr_request back to 128 and run the test again. What
> happens to your iowait? What happens to responsiveness?
1 w 0 r: 25-50% and the bo of vmstat is extremely fluctuating
2 w 0 r: 60-90% and fluctuations are big
3 w 0 r: 80-100%
4 w 0 r: 90+ %
5 w 0 r: 95+ %
3 w 3 r: 90% most of the time there is no cpu idle %
3 w 10 r: 95%, nothing idle
8 w 10 r: 95%, nothing idle
however the speeds seem to be quite stable in and out in the read
+write tests around 50-70MB/s. You can see the system behaviour as I
ramped up the tests here:
http://monitor.hep.kbfi.ee/?c=Jupiter%
20SE&h=europa.hep.kbfi.ee&m=&r=hour&s=descending&hc=4
It was running in the end 8 w 10 r and the load kept at about 26 with
cpus being in io wait. The disk rates aren't visible, but for an
example here is some vmstat output when the test was running for
quite some time already:
procs -----------memory---------- ---swap-- -----io---- --system--
----cpu----
r b swpd free buff cache si so bi bo in cs us
sy id wa
1 26 144 516584 344 3378396 0 0 77760 37764 1859 3353
0 5 0 95
1 25 144 526844 348 3356420 0 0 77888 69652 2177 4728
0 7 0 93
0 27 144 438660 348 3450708 0 0 55296 36580 1270 2487
0 4 0 96
0 26 144 467444 348 3419976 0 0 71616 66988 1870 3856
0 7 0 93
3 27 144 534780 348 3362948 0 0 59392 45628 1374 3380
0 5 0 95
0 27 144 545440 344 3349692 0 0 96256 57316 2462 3736
0 7 0 93
0 26 144 438876 348 3464304 0 0 73664 38608 1798 2038
0 3 0 97
10 20 144 480852 348 3410584 0 0 61568 53908 1549 3455
0 5 0 95
0 26 144 530496 348 3356732 0 0 61376 57240 1620 4370
0 6 0 95
0 23 144 582324 348 3302928 0 0 64000 42036 1433 3808
0 4 0 96
8 18 144 493092 348 3401184 0 0 49728 55193 1502 2784
0 4 0 96
0 26 144 513676 444 3375716 0 0 60832 73583 2033 4772
3 6 0 91
0 26 144 460332 444 3437160 0 0 49024 46160 1434 2225
0 4 0 96
so around 60MB/s reads and 50MB/s writes were ongoing in the system
at the time. The main question now is wether this can be kept up
stably. To test this I'd have to migrate data back to the new XFS
(3.1TB of data) and wait and see. The system was responsive and if
the load remains flat out, then I guess it is not such a big problem.
The 3ware recommended value for 9550SX I think is 512 for the
nr_request, so I tried that as well (changing live during the test)
and the result was that io wait remaind around 93% (so dropped a few
%), but the speed did increase to around 80-90MB/s on reads and
around 70MB/s on writes. The system load itself remained at the same
level. I'll let it run in the background for a longer period to see
how things behave.
>> it was created with mkfs.xfs -d su=64k,sw=11 /dev/sdc to match the
>> underlying RAID5 of 12 disks and stripe size 64k.
>
> Add v2 logs, log stripe unit of 64k.
Did that.
> It *does*. It's the elevator queue depth! By setting it back to 8192
> you turned off the mechanism linux uses to maintain responsiveness
> under heavy I/O load.
Ok, 8192 is probably way too high, but I guess the 512 that was
something I remember from 3ware should be about right?
>> It's probably a question
>> of tuning the kernel to act correctly, not try to do all at maximum
>> speed, but to do it in a stable way.
>
> By default it should do the right thing. You should not have to
> tweak anything at all. You're tweaking is causing the unstableness
> in the recent kernels. Use the defaults and your system should
> remain responsive under any I/o load you throw at it. High iowait
> time and/or high load average is *not* an indication of a problem,
> just that your system is under load and you're not cpu bound.
Well my question is wether or not one needs to tune also the VM
management (dirty ratio etc) considering the high amount of data
transfers. I haven't added network to the mesh yet until I put the
new optimized system online for use and see how it performs. I guess
having 8 pdflush -s in uninterruptible sleep can also cause problems
and could maybe be handled better somehow?
Thanks a lot for the answers,
Mario
[[HTML alternate version deleted]]
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: raw vs XFS sequential write and system load
2007-10-19 7:59 ` David Chinner
2007-10-19 10:11 ` Mario Kadastik
@ 2007-10-27 11:30 ` Mario Kadastik
2007-10-27 13:07 ` Justin Piszcz
1 sibling, 1 reply; 7+ messages in thread
From: Mario Kadastik @ 2007-10-27 11:30 UTC (permalink / raw)
To: David Chinner; +Cc: xfs
Well to finally summarize, the things pointed out all helped too, but
the major change in system behavior came from the fact that 2.6.23
had totally different virtual memory defaults than 2.6.9 and running
with 2.6.23 one has to change the dirty_ratio to something bigger to
allow for a fast i/o machine to actually handle the load. Now the
four nodes we have are all running very nicely and calmly and
performing all the tasks we have asked from them, no more see we any
congestion etc.
I have summarized my weeks of investigations into a twiki page,
comments are welcome:
http://hep.kbfi.ee/index.php/IT/KernelTuning
Thanks for the help,
Mario
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: raw vs XFS sequential write and system load
2007-10-27 11:30 ` Mario Kadastik
@ 2007-10-27 13:07 ` Justin Piszcz
0 siblings, 0 replies; 7+ messages in thread
From: Justin Piszcz @ 2007-10-27 13:07 UTC (permalink / raw)
To: Mario Kadastik; +Cc: David Chinner, xfs
On Sat, 27 Oct 2007, Mario Kadastik wrote:
> Well to finally summarize, the things pointed out all helped too, but the
> major change in system behavior came from the fact that 2.6.23 had totally
> different virtual memory defaults than 2.6.9 and running with 2.6.23 one has
> to change the dirty_ratio to something bigger to allow for a fast i/o machine
> to actually handle the load. Now the four nodes we have are all running very
> nicely and calmly and performing all the tasks we have asked from them, no
> more see we any congestion etc.
>
> I have summarized my weeks of investigations into a twiki page, comments are
> welcome:
>
> http://hep.kbfi.ee/index.php/IT/KernelTuning
>
> Thanks for the help,
>
> Mario
>
Very nice doc!
Thanks.
Justin.
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2007-10-27 13:07 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-10-18 10:50 raw vs XFS sequential write and system load Mario Kadastik
2007-10-18 22:23 ` David Chinner
2007-10-19 6:12 ` Mario Kadastik
2007-10-19 7:59 ` David Chinner
2007-10-19 10:11 ` Mario Kadastik
2007-10-27 11:30 ` Mario Kadastik
2007-10-27 13:07 ` Justin Piszcz
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox