md-raid5, dm-crypt, alignment and readahead

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* md-raid5, dm-crypt, alignment and readahead
@ 2008-03-10 16:10 Christian Pernegger
  2008-03-10 21:37 ` Peter Grandi
  2008-03-11 23:47 ` Christian Pernegger
  0 siblings, 2 replies; 5+ messages in thread
From: Christian Pernegger @ 2008-03-10 16:10 UTC (permalink / raw)
  To: linux-raid-u79uwXL29TY76Z2rM5mHXA, dm-crypt-4q3lyFh4P1g

Hi!

Since the original thread has gotten rather long and convoluted,
mostly because I've been barking up a few wrong trees, I'd like to
start a new one and welcome the dm-crypt people aboard at the same
time.

HARDWARE:

Tyan Thunder K8W (S2885)
Dual Opteron 254, 2GB (2x2x512MB) DDR333 ECC RAM
Adaptec 29160 with 1x Maxtor Atlas 15K II (system disk)
Dawicontrol DC-4320 RAID with 4x WD RE2-GP 1TB

The Dawicontrol is a 4-port SATA2 PCI-X card using a Silicon Image
3124 chip with sata_sil24 driver. According to its datasheet the cards
maximum total throughput is 300MB/s, which I have confirmed
empirically. TCQ is enabled and works flawlessly as far as I can tell.
The WD RE2-GBs can do ~75MB/s reads or writes at their very beginning
- of course it goes down from there.
The Opterons' crypto performance is ~100MB/s for aes-256-cbc.

SOFTWARE:

Debian testing-amd64
linux-image-2.6.22-3-amd64
e2fsprogs 1.40.6-1
mdadm 2.6.4-1
cryptsetup 2:1.0.6~pre1+svn45-1
aes-x86_64 module

SETUP

1. md only
After some extensive benchmarking I have decided to create a md-RAID5
across the 4 disks with 1MB chunk size and 512MB bitmap chunk size
(internal).
/sys/block/md0/md/stripe_cache_size = 8192   (Maybe further increase
would help, I haven't tested this much. FWIW I haven't seen
stripe_cache_active full yet.)
readahead set to 0 for the component devices and 2 full stripes = 6MB
= 12288 sectors for the md device via blockdev --setra. NOTE: dm-crypt
is not involved yet.

Tested with "sync; echo 3 > /proc/sys/vm/drop_caches; dd of=/dev/null
if=/dev/md0 bs=3145728 count=2730" and averaged over 3 runs:

Reads: 274MB/s
- that's 91% of reading from the four disks in parallel
- larger readahead does not bring any improvement
- iostat shows that during reads the load is evenly distributed over
the component disks and there aren't any writes.

Writes: 182MB/s
- that's 81% of the write performance of 3 disks in parallel
- iostat shows that during writes the load is evenly distributed over
the component disks, but also that there are *reads* going on in
parallel, if slowly. Why is that? The dd block size should be a full
stripe and in any case large enough to be combined into one. When I do
some badly misaligned writes on purpose the "MB_read/s" values are
about 10-15 times higher, so it's not raid5 read-modify-write cycles,
but what is it reading?

Sample output of iostat -m 2:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
                0,00      0,00    45,02      0,00       0,00   54,98

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb             238,12         1,00        57,72              2        116
sdc             250,50         2,24        57,93              4        117
sdd             136,14         2,05        56,43              4        113
sde             202,97         1,74        54,71              3        110
md0           42650,99      0,00       166,61             0        336

Additionally, for both reads and writes the tps ("transfers per
second") seems strange. The component disks have ~4 transfers / MB /
second (write) and ~3 transfers / MB / second (read), while the md
device has ~ 256 transfers / MB / second (write and read). Is this
normal? Well, maybe these "transfers" are combined later anyway, but
I'd have expected mds tps to be 3-4 times that of a component disk.

2. md + dm-crypt
Since I was getting nice performance on the RAID, even though I
obviously can't interpret iostat, I decided to go on to the dm-crypt
layer.

That raised the dreaded question of alignment. In theory telling
cryptsetup to align at chunk boundaries (= 1MB = 2048 sectors) should
do the trick. There should be no need to align to stripe boundaries
because it doesn't matter if a full-stripe-write is [d0 -> d1 -> d2 ->
d3] or f. ex. [d2 -> d3 -> d0 -> d1]. Testing this with the same
method as above I got:

ALIGN (KB)   read (MB/s)   write (MB/s)
1024             113               131               # chunk
3072             114               133               # stripe
4096             116               132               # nicer multiple of chunk
4                   115               130               # default alignment
81                  83                  80               # cruel mis-align

If it weren't for the last case I'd have doubts the --align-payload
option does anything at all. Especially the fact that not giving an
explicit alignment doesn't hurt is strange. Of course the requests
could be merged somewhere so that most still result in a
full-stripe-write but then the same should result for the pathological
case-81, shouldn't it? Oh and why does mis-alignment kill *reads* as
well?

Next up, readahead:

Is there a difference between running blockdev --setra and echo-int to
/sys/block/.../read_ahead_kb for devices that have the entry in /sys?
How is readahead handled when "stacked" virtual block devices are
involved? Does only the top layer count, does each layer read ahead
for itself and if it does is the data used at all?

In theory it would make sense to have md read ahead but not dm-crypt,
because decrypting block that turn out not to be needed is
ridiculously expensive. That way the encrypted blocks would be read
ahead into the page cache and dm-crypt could get them from there when
needed. Only I don't know if it works that way at all and dd /
sequential I/O is not a benchmark suited for a test.

Considering the past reports on dm-crypt-on-md data corruption - what
is a good data corruption test I can leave running for a few days and
at least hope that everything is fine if it passes?

Thanks for reading this far :-)

Cheers,

C.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: md-raid5, dm-crypt, alignment and readahead
  2008-03-10 16:10 md-raid5, dm-crypt, alignment and readahead Christian Pernegger
@ 2008-03-10 21:37 ` Peter Grandi
  2008-03-11 23:47 ` Christian Pernegger
  1 sibling, 0 replies; 5+ messages in thread
From: Peter Grandi @ 2008-03-10 21:37 UTC (permalink / raw)
  To: Linux RAID

[ ... careful RAID5 test ... ]

The test is fairly careful, but your chunk size of 1MB is a bit
huge. Not a good idea for sequential reads for example.

pernegger> Writes: 182MB/s
pernegger> - that's 81% of the write performance of 3 disks in parallel
pernegger> - iostat shows that during writes the load is evenly
pernegger>   distributed over the component disks, but also that
pernegger>   there are *reads* going on in parallel, if
pernegger>   slowly.

pernegger> Why is that? The dd block size should be a full
pernegger> stripe and in any case large enough to be combined
pernegger> into one. When I do some badly misaligned writes on
pernegger> purpose the "MB_read/s" values are about 10-15 times
pernegger> higher, so it's not raid5 read-modify-write cycles,
pernegger> but what is it reading?

My guess is that the reads and writes that you do get rearranged
by the page cache and not necessarily all will remain stripe
aligned. I would make sure that 'syslog' is logging debug-level
messages to a named pipe, say '/dev/xconsole', 'cat' it, and
then 'sysctl vm/block_dump=1' to see the stream of IO operations
and check where the reads are.

pernegger> How is readahead handled when "stacked" virtual block
pernegger> devices are involved? Does only the top layer count,
pernegger> does each layer read ahead for itself and if it does
pernegger> is the data used at all?

Good question. From some quick testing I did in the MD block
device on top of disk block device case the readhead that
matters is the top level one. I tried to follow the logic in the
code, but a bit opaque. I suspect that it depends on the type of
request function the upper layer uses to issues requests to the
lower layer.

pernegger> Considering the past reports on dm-crypt-on-md data
pernegger> corruption - what is a good data corruption test I
pernegger> can leave running for a few days and at least hope
pernegger> that everything is fine if it passes?

I personally like 'loop-AES', and it is seems to be particularly
reliable, and has some very good encryption code builtin.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: md-raid5, dm-crypt, alignment and readahead
  2008-03-10 16:10 md-raid5, dm-crypt, alignment and readahead Christian Pernegger
  2008-03-10 21:37 ` Peter Grandi
@ 2008-03-11 23:47 ` Christian Pernegger
       [not found]   ` <bb145bd20803111647h13685bd8w4088138316667524-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 5+ messages in thread
From: Christian Pernegger @ 2008-03-11 23:47 UTC (permalink / raw)
  To: linux-raid, dm-crypt

>  Is there a difference between running blockdev --setra and echo-int to
>  /sys/block/.../read_ahead_kb for devices that have the entry in /sys?

To answer myself, blockdev --setra seems to change
/sys/block/.../read_ahead_kb, so I'll just always use that.

>  How is readahead handled when "stacked" virtual block devices are
>  involved?

Ran quite a few tests and readahead at any layer but the top layer did
not have any effect. This means that read ahead blocks will always be
decrypted. Oh well ...

Comments on the other issues still welcome :-)

Also, I've been having responsiveness problems with 2.6.22 - accessing
an md-raid5 and/or md-raid5+dm-crypt device in certain ways (surefire
test case = mke2fs) would block ANY disk I/O on the system, including
to/from the SCSI system disk (also encrypted) for minutes at a time.
Already running apps would run fine as long as they did no disk
accesses, but running a simple 'cat' would be too much.

It seems that something related to dm-crypt has some serious
congestion issues and after finally digging up a few threads about
loosely related problems I tried 2.6.24, which allegedly had improved
in related areas. Result:

+ (other) system I/O does not just stop during mke2fs anymore
- running apps are sluggish, ssh input is laggy (wasn't before)
- bonnie++ (sequential) write performance has dropped by more than 20%
- getting "slab: cache kmem_cache error: free_objects accounting
error" messages, which may be totally unrelated but is worrying
nonetheless.

Can't say I'm happy with either kernel's behaviour - any ideas?

C.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Re: md-raid5, dm-crypt, alignment and readahead
       [not found]   ` <bb145bd20803111647h13685bd8w4088138316667524-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2008-03-12 11:55     ` Pasi Kärkkäinen
  2008-03-12 13:07       ` [dm-crypt] " Christian Pernegger
  0 siblings, 1 reply; 5+ messages in thread
From: Pasi Kärkkäinen @ 2008-03-12 11:55 UTC (permalink / raw)
  To: dm-crypt-4q3lyFh4P1g; +Cc: linux-raid-u79uwXL29TY76Z2rM5mHXA

On Wed, Mar 12, 2008 at 12:47:02AM +0100, Christian Pernegger wrote:
> >  Is there a difference between running blockdev --setra and echo-int to
> >  /sys/block/.../read_ahead_kb for devices that have the entry in /sys?
> 
> To answer myself, blockdev --setra seems to change
> /sys/block/.../read_ahead_kb, so I'll just always use that.
> 
> >  How is readahead handled when "stacked" virtual block devices are
> >  involved?
> 
> Ran quite a few tests and readahead at any layer but the top layer did
> not have any effect. This means that read ahead blocks will always be
> decrypted. Oh well ...
>

Just to make it clear.. with "top layer" you mean you set read ahead only for
the dm-crypt device and NOT for the md-raid and/or physical sdX devices? 
 
-- Pasi

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [dm-crypt] Re: md-raid5, dm-crypt, alignment and readahead
  2008-03-12 11:55     ` Pasi Kärkkäinen
@ 2008-03-12 13:07       ` Christian Pernegger
  0 siblings, 0 replies; 5+ messages in thread
From: Christian Pernegger @ 2008-03-12 13:07 UTC (permalink / raw)
  To: dm-crypt, linux-raid

>  > Ran quite a few tests and readahead at any layer but the top layer did
>  > not have any effect. This means that read ahead blocks will always be
>  > decrypted. Oh well ...
>  >
>
>  Just to make it clear.. with "top layer" you mean you set read ahead only for
>  the dm-crypt device and NOT for the md-raid and/or physical sdX devices?

Exactly.

C.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2008-03-12 13:07 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-03-10 16:10 md-raid5, dm-crypt, alignment and readahead Christian Pernegger
2008-03-10 21:37 ` Peter Grandi
2008-03-11 23:47 ` Christian Pernegger
     [not found]   ` <bb145bd20803111647h13685bd8w4088138316667524-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2008-03-12 11:55     ` Pasi Kärkkäinen
2008-03-12 13:07       ` [dm-crypt] " Christian Pernegger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).