All of lore.kernel.org
 help / color / mirror / Atom feed
* dm-crypt on RAID5/6 write performance - cause & proposed solutions
@ 2011-05-11 19:11 Chris Lais
  2011-05-12  8:47 ` Milan Broz
  0 siblings, 1 reply; 3+ messages in thread
From: Chris Lais @ 2011-05-11 19:11 UTC (permalink / raw)
  To: dm-devel

I've recently installed a system with dm-crypt placed over a software
RAID5 array, and have noticed some very severe issues with write
performance due to the way dm-crypt works.

Almost all of these problems are caused by dm-crypt re-ordering bios
to an extreme degree (as shown by blktrace), such that it is very hard
for the raid layer to merge them in to full stripes, leading to many
extra reads and writes.  There are minor problems with losing
io_context and seeking for CFQ, but they have far less impact.

I've worked around the reordering locally by increasing the size of
the various queues to very nearly their maximum, and preferring to
write full stripes before partial ones by such an extreme amount that
partial stripe writes can take up to ~30 seconds to complete.  Some
partial writes still get through where there should be none.

This is sub-optimal for the intended use of the machine (an
interactive workstation), and I'd like to open some discussion on
possible solutions.

Increasing the queue sizes and preferring full stripe writes has
increased sequential write performance roughly 6-fold, so this is a
MAJOR issue with this configuration (dm-crypt on top of RAID5/6).

Using RAID5/6 without dm-crypt does /not/ have these problems in my
setup, even with standard queue sizes, because the raid layer can
handle the stripe merging when the bios are not so far out of order.
Using lower RAID levels even with dm-crypt also does not have these
problems to such an extreme degree, because they don't need
read-parity-write cycles for partial stripes.

Solution #1 -
Don't re-order bios in dm-crypt.
This would also have the side effect of making barriers work again,
but would probably require a very large sorted queue on the kcryptd_io
thread, would introduce some latency, would probably introduce memory
starvation for some loads, and could potentially introduce deadlocks
if not done properly.  It may also cause bursty output instead of
sustained, even when the input is sustained.

Solution #2 -
Merge stripes in dm-crypt, and submit an entire stripe at once.
This is a huge hack, but it would use much smaller queues than would
be required at a lower layer (i.e., the raid5/6 layer, where it is
currently happening).  It would still produce out-of-order stripe
writes, but the I/O scheduler would probably handle that with a large
enough request queue, and seeking is much cheaper than multiple
read-parity-write cycles per stripe.

Solution #3 -
In the md layer, in addition to preread_bypass_threshold, add a
preread_expire, to allow stripes that need pre-read to be submitted
based on time, rather than skip count.
Nothing more than triage for the partial stripe write delay when
favoring full stripes.

Does anyone have any more ideas, or comments on these?
Need logs?  I can produce them, just ask for what you want.

Setup:
Linux version: 2.6.38
Processor: i7-870 (2.93GHz with 4 cores + HT = 8 logical units)
RAM: 8GB
RAID5: 3x1.5TB w/ 512k chunks
LVM2
dm-crypt: LUKS with aes-cbc-essiv:sha256
All layers are properly aligned.

-- 
Chris

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: dm-crypt on RAID5/6 write performance - cause & proposed solutions
  2011-05-11 19:11 dm-crypt on RAID5/6 write performance - cause & proposed solutions Chris Lais
@ 2011-05-12  8:47 ` Milan Broz
  2011-05-12 13:07   ` Chris Lais
  0 siblings, 1 reply; 3+ messages in thread
From: Milan Broz @ 2011-05-12  8:47 UTC (permalink / raw)
  To: device-mapper development; +Cc: Chris Lais

Hi,

On 05/11/2011 09:11 PM, Chris Lais wrote:
> I've recently installed a system with dm-crypt placed over a software
> RAID5 array, and have noticed some very severe issues with write
> performance due to the way dm-crypt works.
> 
> Almost all of these problems are caused by dm-crypt re-ordering bios
> to an extreme degree (as shown by blktrace), such that it is very hard
> for the raid layer to merge them in to full stripes, leading to many
> extra reads and writes.  There are minor problems with losing
> io_context and seeking for CFQ, but they have far less impact.

There is no explicit reordering of bios in dmcrypt.

There are basically two situations were dmcrypt can reorder request:

First is when crypto layer process request asynchronously
(probably not a case here - according to your system spec you should
be probably using AES-NI, right?)

The second possible reordering can happen if you run 2.6.38 kernel and
above, where the encryption run always on the cpu core which submitted it.

First thing is to check what's really going on your system and why.

- What's the io pattern here? Several applications issues writes
in parallel? Can you provide commands how do you tested it?

- Can you test older kernel (2.6.37) and check blktrace?
Does it behave differently (it should - no reordering but all
encryption just on one core.)

- Also 2.6.39-rc (with flush changes) can have influence here,
if you can test that the problems is still here, it would be nice
(any fix will be based on this version).

Anyway, we need to find what's really going before suggesting any fix.

> Using RAID5/6 without dm-crypt does /not/ have these problems in my
> setup, even with standard queue sizes, because the raid layer can
> handle the stripe merging when the bios are not so far out of order.
> Using lower RAID levels even with dm-crypt also does not have these
> problems to such an extreme degree, because they don't need
> read-parity-write cycles for partial stripes.

Ah, so you are suggesting that the problem is caused by read/write
interleaving (parity blocks)?
Or you are talking about degraded mode as well?

Milan

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: dm-crypt on RAID5/6 write performance - cause & proposed solutions
  2011-05-12  8:47 ` Milan Broz
@ 2011-05-12 13:07   ` Chris Lais
  0 siblings, 0 replies; 3+ messages in thread
From: Chris Lais @ 2011-05-12 13:07 UTC (permalink / raw)
  To: dm-devel

On Thu, May 12, 2011 at 3:47 AM, Milan Broz <mbroz@redhat.com> wrote:
> Hi,
>
> On 05/11/2011 09:11 PM, Chris Lais wrote:
>> I've recently installed a system with dm-crypt placed over a software
>> RAID5 array, and have noticed some very severe issues with write
>> performance due to the way dm-crypt works.
>>
>> Almost all of these problems are caused by dm-crypt re-ordering bios
>> to an extreme degree (as shown by blktrace), such that it is very hard
>> for the raid layer to merge them in to full stripes, leading to many
>> extra reads and writes.  There are minor problems with losing
>> io_context and seeking for CFQ, but they have far less impact.
>
> There is no explicit reordering of bios in dmcrypt.
>
> There are basically two situations were dmcrypt can reorder request:
>
> First is when crypto layer process request asynchronously
> (probably not a case here - according to your system spec you should
> be probably using AES-NI, right?)

No, the i7-870 does not have AES-NI.

>
> The second possible reordering can happen if you run 2.6.38 kernel and
> above, where the encryption run always on the cpu core which submitted it.
>
> First thing is to check what's really going on your system and why.
>
> - What's the io pattern here? Several applications issues writes
> in parallel? Can you provide commands how do you tested it?
>

The I/O pattern is a single dd command, using a block size of 1M or 2M
(does not produce a substantial difference).  And before you ask, this
/is/ one of the more major intended workloads, not a failed attempt at
a benchmark.

For the purposes of testing, I'm inputting from /dev/zero, but
normally it will be from an attached drive, which will sometimes be
slower than 180MB/s, and sometimes faster, but will always be
substantially faster than 30MB/s.

The I/O is being submitted by a dirty background thread, which is
jumping cores periodically (and which I don't think I can set the cpu
affinity of reliably).

I don't know why the caches aren't able to cope without very large
cache sizes (and *still* fail to assemble full stripes frequently),
unless the switching is happening very often and is splitting between
stripes (very likely, with a stripe size of 1MB).

Even with perfect splitting (as in the case with a parallel workload
with no reordering), the cache size for merging stripes will have to
be at least stripe_size*threads.  I have to think it we'd get far
better performance (for any media with large physical block sizes)
keeping the bios for each block/stripe together starting from the
upper-most block layer, but the system doesn't seem to be designed in
a way that makes this easy at all.


dd if=/dev/zero of=test bs=1048576:

submitted to dm-crypt layer (top-level) [dm-5]:
254,5    5      419     1.019698208  1533  Q   W 761892040 + 8 [flush-254:5]
254,5    5      420     1.019699440  1533  Q   W 761892048 + 8 [flush-254:5]
254,5    5      421     1.019700449  1533  Q   W 761892056 + 8 [flush-254:5]
254,5    5      422     1.019701510  1533  Q   W 761892064 + 8 [flush-254:5]
254,5    5      423     1.019702466  1533  Q   W 761892072 + 8 [flush-254:5]
254,5    5      424     1.019703528  1533  Q   W 761892080 + 8 [flush-254:5]
[snip]
254,5    1      418     1.030607158  1533  Q   W 761959960 + 8 [flush-254:5]
254,5    1      419     1.030608679  1533  Q   W 761959968 + 8 [flush-254:5]
254,5    1      420     1.030610084  1533  Q   W 761959976 + 8 [flush-254:5]
254,5    1      421     1.030611534  1533  Q   W 761959984 + 8 [flush-254:5]
254,5    1      422     1.030612991  1533  Q   W 761959992 + 8 [flush-254:5]
254,5    1      423     1.030614446  1533  Q   W 761960000 + 8 [flush-254:5]
[snip]
254,5    3      423     1.062605245  1533  Q   W 762049928 + 8 [flush-254:5]
254,5    3      424     1.062606044  1533  Q   W 762049936 + 8 [flush-254:5]
254,5    3      425     1.062606853  1533  Q   W 762049944 + 8 [flush-254:5]
254,5    3      426     1.062607616  1533  Q   W 762049952 + 8 [flush-254:5]
254,5    3      427     1.062609579  1533  Q   W 762049960 + 8 [flush-254:5]
254,5    3      428     1.062610503  1533  Q   W 762049968 + 8 [flush-254:5]
254,5    3      429     1.062611306  1533  Q   W 762049976 + 8 [flush-254:5]
254,5    3      430     1.062612079  1533  Q   W 762049984 + 8 [flush-254:5]
254,5    3      431     1.062612851  1533  Q   W 762049992 + 8 [flush-254:5]

submitted to LVM2 logical volume layer (directly below dm-5) [dm-3]:
254,3    1       34     1.055642427  6282  Q   W 761959960 + 8 [kworker/1:2]
254,3    3       39     1.055676830  6402  Q   W 762049928 + 8 [kworker/3:0]
254,3    5       35     1.055707355  6349  Q   W 761892040 + 8 [kworker/5:1]
254,3    3       40     1.055720657  6402  Q   W 762049936 + 8 [kworker/3:0]
254,3    1       35     1.055720737  6282  Q   W 761959968 + 8 [kworker/1:2]
254,3    3       41     1.055768875  6402  Q   W 762049944 + 8 [kworker/3:0]
254,3    5       36     1.055782164  6349  Q   W 761892048 + 8 [kworker/5:1]
254,3    1       36     1.055798939  6282  Q   W 761959976 + 8 [kworker/1:2]
254,3    3       42     1.055813807  6402  Q   W 762049952 + 8 [kworker/3:0]
254,3    5       37     1.055858505  6349  Q   W 761892056 + 8 [kworker/5:1]
254,3    3       43     1.055858595  6402  Q   W 762049960 + 8 [kworker/3:0]
254,3    1       37     1.055873828  6282  Q   W 761959984 + 8 [kworker/1:2]
254,3    3       44     1.055906790  6402  Q   W 762049968 + 8 [kworker/3:0]
254,3    5       38     1.055937878  6349  Q   W 761892064 + 8 [kworker/5:1]
254,3    3       45     1.055950798  6402  Q   W 762049976 + 8 [kworker/3:0]
254,3    1       38     1.055950939  6282  Q   W 761959992 + 8 [kworker/1:2]
254,3    3       46     1.055999370  6402  Q   W 762049984 + 8 [kworker/3:0]
254,3    5       39     1.056011893  6349  Q   W 761892072 + 8 [kworker/5:1]
254,3    1       39     1.056028144  6282  Q   W 761960000 + 8 [kworker/1:2]
254,3    3       47     1.056044505  6402  Q   W 762049992 + 8 [kworker/3:0]
254,3    5       40     1.056088439  6349  Q   W 761892080 + 8 [kworker/5:1]

http://zenthought.org/tmp/dm-crypt+raid5/dm-5,dm-3.single-thread.dd.zero.1M.tar.gz

> - Can you test older kernel (2.6.37) and check blktrace?
> Does it behave differently (it should - no reordering but all
> encryption just on one core.)
>
> - Also 2.6.39-rc (with flush changes) can have influence here,
> if you can test that the problems is still here, it would be nice
> (any fix will be based on this version).

I will test both of these when I'm able (should be in the next few
days), but I suspect 2.6.37 will perform much better if it's doing it
on one core with no re-ordering.

I'll have to let you know on 2.6.39-rc*.

>
> Anyway, we need to find what's really going before suggesting any fix.
>
>> Using RAID5/6 without dm-crypt does /not/ have these problems in my
>> setup, even with standard queue sizes, because the raid layer can
>> handle the stripe merging when the bios are not so far out of order.
>> Using lower RAID levels even with dm-crypt also does not have these
>> problems to such an extreme degree, because they don't need
>> read-parity-write cycles for partial stripes.
>
> Ah, so you are suggesting that the problem is caused by read/write
> interleaving (parity blocks)?
> Or you are talking about degraded mode as well?

Yes, it seems to be caused almost entirely by multiple partial stripe
writes to the same stripes, leading to extra unnecessary reads and
parity calculations (I do suspect that the reads themselves have much
more impact on this system, however).

I'm not talking about degraded mode (I don't expect that to perform well).

>
> Milan
>
>

--
Chris

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2011-05-12 13:07 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-05-11 19:11 dm-crypt on RAID5/6 write performance - cause & proposed solutions Chris Lais
2011-05-12  8:47 ` Milan Broz
2011-05-12 13:07   ` Chris Lais

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.