RAID 5: low sequential write performance?

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RAID 5: low sequential write performance?
@ 2013-06-15 23:10 Corey Hickey
  2013-06-16 21:27 ` Peter Grandi
  0 siblings, 1 reply; 9+ messages in thread
From: Corey Hickey @ 2013-06-15 23:10 UTC (permalink / raw)
  To: 'Linux RAID'

Hi,

I'm getting poorer performance for large sequential writes than I
expected with a 3-drive RAID 5--each drive writes at about half of the
speed it is capable of. When I monitor the I/O with dstat or iostat, I
see a high number of read operations for each drive, and I suspect that
is related to the low performance, since presumably the drives are
having to seek in order to perform these reads.

I'm aware of the RAID 5 write penalty, but does it still apply to large
sequential writes that traverse many stripes? If the kernel is
overwriting an entire stripe, can't it just overwrite the parity chunk
without having to read anything beforehand? I tried to find out if the
kernel actually does this, but my searches came up short. Perhaps my
assumption is naive.

I know this doesn't have anything to do with the filesystem--I was able
to reproduce the behavior on a test system, writing directly to an
otherwise unused array, using a single 768 MB write() call (verified by
strace).

I wrote a script to benchmark the number of read/write operations along
with the elapsed time for writing. The methodology is basically:

1. create array
2. read 768 MB to a buffer
3. wait for array to finish resyncing
4. sync; drop buffers/caches
5. read stats from /proc/diskstats
6. write buffer to array
7. sync
8. read stats from /proc/diskstats
9. analyze data:
   - for each component device, subtract initial stats from final stats
   - sum up the stats from all the devices

That last step is probably invalid for the fields in /proc/diskstats
that are not counters, but I wasn't interested in them.

I measured chunk sizes at each power of 2 from 2^2 to 2^14 KB. The
results of this are that smaller chunks performed the best, with
generally lower performance for larger ones, corresponding to more read
and write operations.

http://www.fatooh.org/files/tmp/chunks/output1.png

Note that the blue line (time) has the Y axis on the right.

Does this behavior seem expected? Am I doing something wrong, or is
there something I can tune? I'd like to be able to understand this
better, but I don't have enough background.

Full results, scripts, and raw data are available here:

http://www.fatooh.org/files/tmp/chunks/

The CSV fields are:
- chunk size
- time to write 768 MB
- the fields calculated from /proc/diskstat in step #9 above

Test system stats:
2 GB RAM
Athlon64 3400+
Debian Sid, 64-bit
Linux 3.8-2-amd64 (Debian kernel)
mdadm v3.2.5
3 disk raid 5 of 1 GB partitions on separate disks
   (small RAID size for testing to keep the resync time down)

Thanks for any help,
Corey

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RAID 5: low sequential write performance?
  2013-06-15 23:10 RAID 5: low sequential write performance? Corey Hickey
@ 2013-06-16 21:27 ` Peter Grandi
  2013-06-17  6:39   ` Corey Hickey
  0 siblings, 1 reply; 9+ messages in thread
From: Peter Grandi @ 2013-06-16 21:27 UTC (permalink / raw)
  To: Linux RAID

> [ ... ] see a high number of read operations for each drive,
> and I suspect that is related to the low performance, since
> presumably the drives are having to seek in order to perform
> these reads. I'm aware of the RAID 5 write penalty

Yes, that's Read-Modify-Write.

> but does it still apply to large sequential writes that
> traverse many stripes?

If the writes are striped aligned, things should be good. But
there is no guarantee that the writes you issue to a '/dev/md'
device will not be rescheduled by the IO subsystem, and even if
you issue aligned logical writes the physical writes may not be
aligned.

> I know this doesn't have anything to do with the filesystem--
> I was able to reproduce the behavior on a test system, writing
> directly to an otherwise unused array, using a single 768 MB
> write() call.

Usually writes via a filesystem are more likely to avoid RMW
issues, as suitabky chosen filesystem designs take into account
stripe alignment.

Some time ago I did some tests and I was also writing to a
'/dev/md' device, but I found I got RMW only if using
'O_DIRECT', while buffered writes ended up being aligned.
Without going into details, it looked like that the Linux IO
subsystem does significant reordering of requests, sometimes
surprisingly, when directly accessing the block device, but not
when writing files after creating a filesystem in that block
device. Perhaps currently MD expects to be fronted by a
filesystem.

> I measured chunk sizes at each power of 2 from 2^2 to 2^14
> KB. The results of this are that smaller chunks performed the
> best, [ ... ]

Your Perl script is a bit convoluted. I prefer to keep it simple
and use 'dd' advisedly to get upper boundaries.

Anyhow, try using a stripe-aware filesystem like XFS, and also
perhaps increase significantly the size of the stripe cache.
That seems to help scheduling too. Changing the elevator on the
member devices sometimes helps too (but is not necessarily
related to RMW issues).

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RAID 5: low sequential write performance?
  2013-06-16 21:27 ` Peter Grandi
@ 2013-06-17  6:39   ` Corey Hickey
  2013-06-17 14:22     ` Stan Hoeppner
  0 siblings, 1 reply; 9+ messages in thread
From: Corey Hickey @ 2013-06-17  6:39 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux RAID

On 2013-06-16 14:27, Peter Grandi wrote:
>> I know this doesn't have anything to do with the filesystem--
>> I was able to reproduce the behavior on a test system, writing
>> directly to an otherwise unused array, using a single 768 MB
>> write() call.
> 
> Usually writes via a filesystem are more likely to avoid RMW
> issues, as suitabky chosen filesystem designs take into account
> stripe alignment.

Yeah.... but if I issue a single write() directly to the array without
seeking anywhere, it should be aligned, at least I would expect.

> Some time ago I did some tests and I was also writing to a
> '/dev/md' device, but I found I got RMW only if using
> 'O_DIRECT', while buffered writes ended up being aligned.
> Without going into details, it looked like that the Linux IO
> subsystem does significant reordering of requests, sometimes
> surprisingly, when directly accessing the block device, but not
> when writing files after creating a filesystem in that block
> device. Perhaps currently MD expects to be fronted by a
> filesystem.

Hmm. I tried some more simple tests with dd right now, for chunk size
512 (I'm only reporting single runs here, but I ran them a few times to
make sure they were representative).

Without O_DIRECT:

icebox:~# dd if=/dev/zero bs=768M count=1 of=/dev/md3 conv=fdatasync

1+0 records in
1+0 records out
805306368 bytes (805 MB) copied, 16.1938 s, 49.7 MB/s

With O_DIRECT:

icebox:~# dd if=/dev/zero bs=768M count=1 of=/dev/md3 oflag=direct
conv=fdatasync
1+0 records in
1+0 records out
805306368 bytes (805 MB) copied, 18.1964 s, 44.3 MB/s

So, that doesn't seem to help. Interestingly, it did help when I tested
with 16384 KB chunk size (34 MB/s --> 39 MB/s); but, I'll stick with 512
KB chunks for now.

>> I measured chunk sizes at each power of 2 from 2^2 to 2^14
>> KB. The results of this are that smaller chunks performed the
>> best, [ ... ]
> 
> Your Perl script is a bit convoluted. I prefer to keep it simple
> and use 'dd' advisedly to get upper boundaries.

Yeah... I had used dd for the early testing, but I wanted to try random
data and /dev/urandom was so slow it added a large baseline to each test
unless I read it into memory before starting the time measurements.

The rest of the script is just for the setup, reporting, etc.

> Anyhow, try using a stripe-aware filesystem like XFS, and also
> perhaps increase significantly the size of the stripe cache.
> That seems to help scheduling too. Changing the elevator on the
> member devices sometimes helps too (but is not necessarily
> related to RMW issues).

The underlying devices were using cfq. noop and deadline were available,
but they didn't make a noticeable difference.

The stripe cache, however, made a huge difference. It was 256 (KB,
right?) by default. Here are some average-of-three dd results (without
O_DIRECT, as above).

  256 KB: 50.2 MB/s
  512 KB: 61.0 MB/s
 1024 KB: 72.7 MB/s
 2048 KB: 79.6 MB/s
 4096 KB: 87.5 MB/s
 8192 KB: 87.3 MB/s
16384 KB: 89.8 MB/s
32768 KB: 91.3 MB/s

...then I tried O_DIRECT again with 32768 KB stripe cache, and it
consistently gets slightly better results: 92.7 MB/s.

This is just on some old dissimilar drives I stuffed into my old
desktop, so I'm not expecting stellar performance. sdd is the slowest,
and it only writes at 54.6 MB/s on its own, so 92.7 MB/s is not too
shabby for the RAID, especially compared to the 49.7 MB/s I was getting
before.

I've been watching a dstat run this whole time, and increasing stripe
cache sizes do indeed result in fewer reads, until they go away entirely
at 32768 KB (except for a few reads at the end, which appear to be
unrelated to RAID).

32768 seems to be the maximum for the stripe cache. I'm quite happy to
spend 32 MB for this. 256 KB seems quite low, especially since it's only
half the default chunk size.

Out of curiosity, I did some tests with xfs vs. ext4 on an empty
filesystem. I'm not familiar with xfs, so I may be missing out on some
tuning. This isn't really meant to be a comprehensive benchmark, but for
a single sequential write with dd:

mkfs.xfs /dev/m3
direct: 89.8 MB/s  not direct: 90.0 MB/s

mkfs.ext4 /dev/md3 -E lazy_itable_init=0
direct: 86.2 MB/s  not direct: 85.4 MB/s

mkfs.ext4 /dev/md3 -E lazy_itable_init=0,stride=128,stripe_width=256
direct: 89.0 MB/s  not direct: 85.6 MB/s

Anyway, xfs indeed did slightly better, so I may evaluate it further
next time I rebuild my main array (the one that actually matters for all
this testing. :) As far as that array goes, write performance went from
30.2 MB/s to 53.1 MB/s. Not that great, unfortunately, but I'm using
dm-crypt and that may be the bottleneck now. Reads are still present
during a write (due to fragmentation, perhaps?), but they are minimal.

Thanks Peter, your email was a great help to me. I'm still interested if
you or anyone else has anything to comment on here, but I'm satisfied
that I've managed to eliminate unnecessary read-modify-write as a source
of slowness.

Thanks,
Corey

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RAID 5: low sequential write performance?
  2013-06-17  6:39   ` Corey Hickey
@ 2013-06-17 14:22     ` Stan Hoeppner
  2013-06-17 17:14       ` Corey Hickey
  0 siblings, 1 reply; 9+ messages in thread
From: Stan Hoeppner @ 2013-06-17 14:22 UTC (permalink / raw)
  To: Corey Hickey; +Cc: Peter Grandi, Linux RAID

On 6/17/2013 1:39 AM, Corey Hickey wrote:

> 32768 seems to be the maximum for the stripe cache. I'm quite happy to
> spend 32 MB for this. 256 KB seems quite low, especially since it's only
> half the default chunk size.

FULL STOP.  Your stripe cache is consuming *384MB* of RAM, not 32MB.
Check your actual memory consumption.  The value plugged into
stripe_cache_size is not a byte value.  The value specifies the number
of data elements in the stripe cache array.  Each element is #disks*4KB
in size.  The formula for calculating memory consumed by the stripe
cache is:

(num_of_disks * 4KB) * stripe_cache_size

In your case this would be

(3 * 4KB) * 32768 = 384MB

Test different values until you find the best combo of performance and
lowest RAM usage.  It'll probably be 2048, 4096, or 8192, which will
cost you 24MB, 48MB, or 96MB of RAM.

> mkfs.xfs /dev/m3
> direct: 89.8 MB/s  not direct: 90.0 MB/s

You didn't align XFS.  Though with large streaming writes it won't
matter much as md and the block layer will fill the stripes.  However,
XFS' big advantage is parallel IO and you're testing serial IO.  Fire up
4 O_DIRECT threads/processes and compare to EXT4 w/4 write threads.  The
throughput gap will increase until you run out of hardware.

-- 
Stan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RAID 5: low sequential write performance?
  2013-06-17 14:22     ` Stan Hoeppner
@ 2013-06-17 17:14       ` Corey Hickey
  2013-06-17 17:45         ` Mikael Abrahamsson
  2013-06-18  5:52         ` Stan Hoeppner
  0 siblings, 2 replies; 9+ messages in thread
From: Corey Hickey @ 2013-06-17 17:14 UTC (permalink / raw)
  To: stan; +Cc: Peter Grandi, Linux RAID

On 2013-06-17 07:22, Stan Hoeppner wrote:
> On 6/17/2013 1:39 AM, Corey Hickey wrote:
> 
>> 32768 seems to be the maximum for the stripe cache. I'm quite happy to
>> spend 32 MB for this. 256 KB seems quite low, especially since it's only
>> half the default chunk size.
> 
> FULL STOP.  Your stripe cache is consuming *384MB* of RAM, not 32MB.
> Check your actual memory consumption.  The value plugged into
> stripe_cache_size is not a byte value.  The value specifies the number
> of data elements in the stripe cache array.  Each element is #disks*4KB
> in size.  The formula for calculating memory consumed by the stripe
> cache is:
> 
> (num_of_disks * 4KB) * stripe_cache_size
> 
> In your case this would be
> 
> (3 * 4KB) * 32768 = 384MB

I'm actually seeing a bit more memory difference: 401-402 MB when going
from 256 to to 32768, on a mostly idle system, so maybe there's
something else coming into play.

Still your formula does make more sense. Apparently the idea of the
value being KB is a common misconception, possibly perpetuated by this:

https://raid.wiki.kernel.org/index.php/Performance
---
# Set stripe-cache_size for RAID5.
echo "Setting stripe_cache_size to 16 MiB for /dev/md3"
echo 16384 > /sys/block/md3/md/stripe_cache_size
---

Is 256 really a reasonable default? Given what I've been seeing, it
appears that 256 is either unreasonably low or I have something else wrong.

>> mkfs.xfs /dev/m3
>> direct: 89.8 MB/s  not direct: 90.0 MB/s
> 
> You didn't align XFS.  Though with large streaming writes it won't
> matter much as md and the block layer will fill the stripes.  However,
> XFS' big advantage is parallel IO and you're testing serial IO.  Fire up
> 4 O_DIRECT threads/processes and compare to EXT4 w/4 write threads.  The
> throughput gap will increase until you run out of hardware.

This will be something to test next time I rebuild my "real" array.

Thanks,
Corey

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RAID 5: low sequential write performance?
  2013-06-17 17:14       ` Corey Hickey
@ 2013-06-17 17:45         ` Mikael Abrahamsson
  2013-06-18  5:32           ` Corey Hickey
  2013-06-18  5:52         ` Stan Hoeppner
  1 sibling, 1 reply; 9+ messages in thread
From: Mikael Abrahamsson @ 2013-06-17 17:45 UTC (permalink / raw)
  To: Linux RAID

On Mon, 17 Jun 2013, Corey Hickey wrote:

> Is 256 really a reasonable default? Given what I've been seeing, it 
> appears that 256 is either unreasonably low or I have something else 
> wrong.

It's a safe setting for a low memory system. 1 megabyte per drive can 
probably be handled by most systems.

It's expected that you know how to tune this yourself, right now. I seem 
to remember Neil saying something a few years back about it being 
desireable for the stripe-cache to be auto-tuned depending on size of 
available RAM, but probably other things was higher priority plus it's not 
obvious exactly what the settings should be depending on how much RAM you 
have.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RAID 5: low sequential write performance?
  2013-06-17 17:45         ` Mikael Abrahamsson
@ 2013-06-18  5:32           ` Corey Hickey
  0 siblings, 0 replies; 9+ messages in thread
From: Corey Hickey @ 2013-06-18  5:32 UTC (permalink / raw)
  To: Linux RAID

On 2013-06-17 10:45, Mikael Abrahamsson wrote:
> On Mon, 17 Jun 2013, Corey Hickey wrote:
> 
>> Is 256 really a reasonable default? Given what I've been seeing, it 
>> appears that 256 is either unreasonably low or I have something else 
>> wrong.
> 
> It's a safe setting for a low memory system. 1 megabyte per drive can 
> probably be handled by most systems.
> 
> It's expected that you know how to tune this yourself, right now. I seem 
> to remember Neil saying something a few years back about it being 
> desireable for the stripe-cache to be auto-tuned depending on size of 
> available RAM, but probably other things was higher priority plus it's not 
> obvious exactly what the settings should be depending on how much RAM you 
> have.

That would seem like a good thing.

Thanks,
Corey

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RAID 5: low sequential write performance?
  2013-06-17 17:14       ` Corey Hickey
  2013-06-17 17:45         ` Mikael Abrahamsson
@ 2013-06-18  5:52         ` Stan Hoeppner
  2013-06-18  6:29           ` Corey Hickey
  1 sibling, 1 reply; 9+ messages in thread
From: Stan Hoeppner @ 2013-06-18  5:52 UTC (permalink / raw)
  To: Corey Hickey; +Cc: Linux RAID

On 6/17/2013 12:14 PM, Corey Hickey wrote:
> On 2013-06-17 07:22, Stan Hoeppner wrote:
>> On 6/17/2013 1:39 AM, Corey Hickey wrote:
>>
>>> 32768 seems to be the maximum for the stripe cache. I'm quite happy to
>>> spend 32 MB for this. 256 KB seems quite low, especially since it's only
>>> half the default chunk size.
>>
>> FULL STOP.  Your stripe cache is consuming *384MB* of RAM, not 32MB.
>> Check your actual memory consumption.  The value plugged into
>> stripe_cache_size is not a byte value.  The value specifies the number
>> of data elements in the stripe cache array.  Each element is #disks*4KB
>> in size.  The formula for calculating memory consumed by the stripe
>> cache is:
>>
>> (num_of_disks * 4KB) * stripe_cache_size
>>
>> In your case this would be
>>
>> (3 * 4KB) * 32768 = 384MB
> 
> I'm actually seeing a bit more memory difference: 401-402 MB when going
> from 256 to to 32768, on a mostly idle system, so maybe there's
> something else coming into play.

384MB = 402,653,184 bytes

> Still your formula does make more sense.  Apparently the idea of the
> value being KB is a common misconception, possibly perpetuated by this:
> 
> https://raid.wiki.kernel.org/index.php/Performance
> ---
> # Set stripe-cache_size for RAID5.
> echo "Setting stripe_cache_size to 16 MiB for /dev/md3"
> echo 16384 > /sys/block/md3/md/stripe_cache_size
> ---

Note that kernel wikis are not official documentation.  They don't
receive the same review as kernel docs.  Pretty much anyone can edit
them.  So the odds of incomplete or misinformation are higher.  And of
course always be skeptical of performance claims.  A much better source
of stripe_cache_size information is this linux-raid thread from March of
this year:  http://www.spinics.net/lists/raid/msg42370.html

> Is 256 really a reasonable default? Given what I've been seeing, it
> appears that 256 is either unreasonably low or I have something else wrong.

Neither.  You simply haven't digested the information given you, nor
considered that many/most folks have more than 3 drives in their md
array, some considerably more drives.  Revisit the memory consumption
equation, found in md(4):

memory_consumed = system_page_size * nr_disks * stripe_cache_size

The current default, 256.  On i386/x86-64 platforms with default 4KB
page size, this consumes 1MB memory per drive.  A 12 drive arrays eats
12MB.  Increase the default to 1024 and you now eat 4MB/drive.  A
default kernel managing a 12 drive md/RAID6 array now eats 48MB just to
manage the array, 96MB for a 24 drive RAID6.  This memory consumption is
unreasonable for a default kernel.

Defaults do not exist to work optimally with your setup.  They exist to
work reasonably well with all possible setups.

-- 
Stan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RAID 5: low sequential write performance?
  2013-06-18  5:52         ` Stan Hoeppner
@ 2013-06-18  6:29           ` Corey Hickey
  0 siblings, 0 replies; 9+ messages in thread
From: Corey Hickey @ 2013-06-18  6:29 UTC (permalink / raw)
  To: stan; +Cc: Linux RAID

On 2013-06-17 22:52, Stan Hoeppner wrote:
>>> (num_of_disks * 4KB) * stripe_cache_size
>>>
>>> In your case this would be
>>>
>>> (3 * 4KB) * 32768 = 384MB
>>
>> I'm actually seeing a bit more memory difference: 401-402 MB when going
>> from 256 to to 32768, on a mostly idle system, so maybe there's
>> something else coming into play.
> 
> 384MB = 402,653,184 bytes

:)

I think that's just a coincidence, but it's possible I'm measuring it
wrong. I just did "free -m" (without --si) immediately before and after
changing the cache size.


stripe_cache_size = 256
---
             total       used       free     shared    buffers     cached
Mem:         16083      13278       2805          0       1387       4028
-/+ buffers/cache:       7862       8221
Swap:            0          0          0
---


stripe_cache_size = 32768
---
             total       used       free     shared    buffers     cached
Mem:         16083      12876       3207          0       1387       4028
-/+ buffers/cache:       7461       8622
Swap:            0          0          0
---

The exact memory usage isn't really that important to me; I just
mentioned it.

> memory_consumed = system_page_size * nr_disks * stripe_cache_size
> 
> The current default, 256.  On i386/x86-64 platforms with default 4KB
> page size, this consumes 1MB memory per drive.  A 12 drive arrays eats
> 12MB.  Increase the default to 1024 and you now eat 4MB/drive.  A
> default kernel managing a 12 drive md/RAID6 array now eats 48MB just to
> manage the array, 96MB for a 24 drive RAID6.  This memory consumption is
> unreasonable for a default kernel.
> 
> Defaults do not exist to work optimally with your setup.  They exist to
> work reasonably well with all possible setups.

True, and I will grant you that I was not considering low-memory setups.
I wouldn't want the kernel to frivolously consume RAM either. In a choice
between getting the low performance I was seeing vs. spending the RAM,
though, I'd much rather spend the RAM. Now that I know I can tune that,
I'm happy enough; I was just surprised...

Thanks,
Corey

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2013-06-18  6:29 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-06-15 23:10 RAID 5: low sequential write performance? Corey Hickey
2013-06-16 21:27 ` Peter Grandi
2013-06-17  6:39   ` Corey Hickey
2013-06-17 14:22     ` Stan Hoeppner
2013-06-17 17:14       ` Corey Hickey
2013-06-17 17:45         ` Mikael Abrahamsson
2013-06-18  5:32           ` Corey Hickey
2013-06-18  5:52         ` Stan Hoeppner
2013-06-18  6:29           ` Corey Hickey

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).