dm-crypt parallelization patches

dm-devel.redhat.com archive mirror
 help / color / mirror / Atom feed

* dm-crypt parallelization patches
@ 2013-09-03 19:06 Mikulas Patocka
  2013-09-03 20:07 ` Andi Kleen
  2013-09-03 20:49 ` Milan Broz
  0 siblings, 2 replies; 22+ messages in thread
From: Mikulas Patocka @ 2013-09-03 19:06 UTC (permalink / raw)
  To: Andi Kleen; +Cc: dm-devel, Milan Broz, Alasdair G. Kergon

Hi Andi

You wrote the original dm-crypt paralllelization patch (commit 
c029772125594e31eb1a5ad9e0913724ed9891f2). This patch improves 
parallelization, it makes data being encrypted on the same CPU that 
submitted the request.

I improved that so that encryption is performed by all processors, it uses 
unbound workqueue, so that any processor can do encryption. My patches 
improves performance mainly when doing sequential read - your original 
patch paralellized sequential read badly because most requests were 
submitted by the same cpu.

I put my patch series at 
http://people.redhat.com/mpatocka/patches/kernel/dm-crypt-paralelizace/current/series.html

I'd like if you could look at it and ack it for inclusion in the upstream 
kernel.

Mikulas

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: dm-crypt parallelization patches
  2013-09-03 19:06 dm-crypt parallelization patches Mikulas Patocka
@ 2013-09-03 20:07 ` Andi Kleen
  2013-09-11 23:03   ` Mikulas Patocka
  2013-09-03 20:49 ` Milan Broz
  1 sibling, 1 reply; 22+ messages in thread
From: Andi Kleen @ 2013-09-03 20:07 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: dm-devel, Milan Broz, Alasdair G. Kergon

On Tue, Sep 03, 2013 at 03:06:03PM -0400, Mikulas Patocka wrote:
> Hi Andi
> 
> You wrote the original dm-crypt paralllelization patch (commit 
> c029772125594e31eb1a5ad9e0913724ed9891f2). This patch improves 
> parallelization, it makes data being encrypted on the same CPU that 
> submitted the request.

The motivation for my patches was to make large systems with many
cores scale.

I think the motivation for yours is to make small systems without
hardware crypto offload slightly faster.

That's fine (although I suspect it's more and more obscure), as long as
you don't impact the large systems scalability.

Can you do some tests with a larger system with very parallel IO
(many CPUs submitting)  and see if the IO rate is equivalent?

If that's given the patches are ok for me.

-Andi

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: dm-crypt parallelization patches
  2013-09-03 20:07 ` Andi Kleen
@ 2013-09-11 23:03   ` Mikulas Patocka
  2013-09-11 23:33     ` Andi Kleen
  2013-09-12  3:51     ` Milan Broz
  0 siblings, 2 replies; 22+ messages in thread
From: Mikulas Patocka @ 2013-09-11 23:03 UTC (permalink / raw)
  To: Andi Kleen; +Cc: dm-devel, Milan Broz, Alasdair G. Kergon



On Tue, 3 Sep 2013, Andi Kleen wrote:

> On Tue, Sep 03, 2013 at 03:06:03PM -0400, Mikulas Patocka wrote:
> > Hi Andi
> > 
> > You wrote the original dm-crypt paralllelization patch (commit 
> > c029772125594e31eb1a5ad9e0913724ed9891f2). This patch improves 
> > parallelization, it makes data being encrypted on the same CPU that 
> > submitted the request.
> 
> The motivation for my patches was to make large systems with many
> cores scale.
> 
> I think the motivation for yours is to make small systems without
> hardware crypto offload slightly faster.
> 
> That's fine (although I suspect it's more and more obscure), as long as
> you don't impact the large systems scalability.
> 
> Can you do some tests with a larger system with very parallel IO
> (many CPUs submitting)  and see if the IO rate is equivalent?
> 
> If that's given the patches are ok for me.
> 
> -Andi

The patches mainly help with sequential I/O (sequential I/O is normally 
submitted from a few CPUs, so existing dm-crypt doesn't parallelize 
right). There is a decrease of performance when 12 threads are doing 
writes on ramdisk:

Sequential read of two SCSI disks in RAID-0 (throughput):
	raw read rate: 262MB/s stdev 2.9
	existing dm-crypt: 111MB/s stdev 19.4 (it varies greatly)
	new dm-crypt: 254MB/s stdev 1.5

Sequential read of ramdisk (throughput):
	raw read rate: 908MB/s stdev 7.7
	existing dm-crypt: 133MB/s stdev 0.8
	my new dm-crypt: 475MB/s stdev 6

fio --rw=randread --bs=4k --direct=1
Random read (4k block) on ramdisk from 12 threads (time in seconds):
	raw device: 2.32s stdev 0.02
	existing dm-crypt: 17.95s stdev 3.40
	new dm-crypt: 15.72s stdev 1.86

fio --rw=randwrite --bs=4k --direct=1
Random write (4k block) on ramdisk from 12 threads (time in seconds):
	raw device 3.91s stdev 0.01
	existing dm-crypt: 21.16s stdev 3.67
	new dm-crypt: 27.42s stdev 0.27

fio --rw=randrw --bs=4k --direct=1
Random read+write (4k block) on ramdisk from 12 threads (time in seconds):
	raw device: 3.09s stdev 0.01
	existing dm-crypt: 20.08s stdev 3.48
	new dm-crypt: 24.78s stdev 0.13

Tests were done on two six-core Istanbul-class Opterons.

I don't have a computer with AES-NI, so I can't test that.

Mikulas

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: dm-crypt parallelization patches
  2013-09-11 23:03   ` Mikulas Patocka
@ 2013-09-11 23:33     ` Andi Kleen
  2013-09-12  3:51     ` Milan Broz
  1 sibling, 0 replies; 22+ messages in thread
From: Andi Kleen @ 2013-09-11 23:33 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: dm-devel, Milan Broz, Alasdair G. Kergon

> The patches mainly help with sequential I/O (sequential I/O is normally 
> submitted from a few CPUs, so existing dm-crypt doesn't parallelize 
> right). There is a decrease of performance when 12 threads are doing 
> writes on ramdisk:

That's a bit worrying. Could you please find a larger system (4S) in RH and
try with that?

I suspect you may need to limit the load balancing to smaller groups
of CPUs. Querying all other CPUs all the time is unlikely to scale.

-Andi

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: dm-crypt parallelization patches
  2013-09-11 23:03   ` Mikulas Patocka
  2013-09-11 23:33     ` Andi Kleen
@ 2013-09-12  3:51     ` Milan Broz
  1 sibling, 0 replies; 22+ messages in thread
From: Milan Broz @ 2013-09-12  3:51 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: Ondrej Kozina, dm-devel, Andi Kleen, Alasdair G. Kergon

On 09/12/2013 01:03 AM, Mikulas Patocka wrote:
> 
> Tests were done on two six-core Istanbul-class Opterons.
> 
> I don't have a computer with AES-NI, so I can't test that.

AES-NI is used almost in all systems where is dmcrypt currently
deployed (notebooks typically)... so I think it must be tested there.

Also ramdisk tests are good baseline but this is not real world scenario.

I would really like to see the same test with RAID5 (over fast rotational disks)
with various stripe size and for SSDs (and RAID0 with SSDs).
(These were reported problematic scenarios in recent years I know about.)

Can you also show how the sorting patch influences performance?
(IOW how it performs without it.)

Another problematic part was latency (e.g. sql database over dmcrypt doesn't
perform well). Isn't it even worse with these patches?
(I am not sure if anyone have fio script to simulate such workload.)

I would like to test it myself but I have no usable fast-enough test systems
at all currently. But there are surely some in RH.
(Added cc to Ondra, he has some test scripts I used.)

I also remember I tried your patches (year ago) with limited CPU workqueues
(I mean using e.g. only 3 of 6 cores) and it performed better in some tests.
This support Andi's suggestion to limit it to smaller groups of cpus.

Also I think we discussed this should be configurable through optional dmcrypt
per-device parameters (e.g. I would like to set up some dmcrypt device
to use only one CPU). Think about backend for virtual machines etc, it should not
eat all available CPUs power on host...
Do you plan to add such patch as well?

Milan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: dm-crypt parallelization patches
  2013-09-03 19:06 dm-crypt parallelization patches Mikulas Patocka
  2013-09-03 20:07 ` Andi Kleen
@ 2013-09-03 20:49 ` Milan Broz
  1 sibling, 0 replies; 22+ messages in thread
From: Milan Broz @ 2013-09-03 20:49 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: dm-devel, Andi Kleen, Milan Broz, Alasdair G. Kergon

On 09/03/2013 09:06 PM, Mikulas Patocka wrote:
> Hi Andi
> 
> You wrote the original dm-crypt paralllelization patch (commit 
> c029772125594e31eb1a5ad9e0913724ed9891f2). This patch improves 
> parallelization, it makes data being encrypted on the same CPU that 
> submitted the request.

Hi Mikulas,

Do you have some recent reported performance problems with
the current dmcrypt code?

Please post numbers from tests. Also include systems using AES-NI.

You know I tested your patchset cca year ago and result was not clear
improvement... So if this changed, it is good news.

Milan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* dm-crypt performance
@ 2013-03-26  3:47 Mikulas Patocka
  2013-03-26 12:27 ` [dm-devel] " Alasdair G Kergon
  0 siblings, 1 reply; 22+ messages in thread
From: Mikulas Patocka @ 2013-03-26  3:47 UTC (permalink / raw)
  To: Mike Snitzer, Alasdair G. Kergon, dm-devel

Hi

I performed some dm-crypt performance tests as Mike suggested.

It turns out that unbound workqueue performance has improved somewhere 
between kernel 3.2 (when I made the dm-crypt patches) and 3.8, so the 
patches for hand-built dispatch are no longer needed.

For RAID-0 composed of two disks with total throughput 260MB/s, the 
unbound workqueue performs as well as the hand-built dispatch (both 
sustain the 260MB/s transfer rate).

For ramdisk, unbound workqueue performs better than hand-built dispatch 
(620MB/s vs 400MB/s). Unbound workqueue with the patch that Mike suggested 
(git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git) improves 
performance slighlty on ramdisk compared to 3.8 (700MB/s vs. 620MB/s).

However, there is still the problem with request ordering. Milan found out 
that under some circumstances parallel dm-crypt has worse performance than 
the previous dm-crypt code. I found out that this is not caused by 
deficiencies in the code that distributes work to individual processors. 
Performance drop is caused by the fact that distributing write bios to 
multiple processors causes the encryption to finish out of order and the 
I/O scheduler is unable to merge these out-of-order bios.

The deadline and noop schedulers perform better (only 50% slowdown 
compared to old dm-crypt), CFQ performs very badly (8 times slowdown).

If I sort the requests in dm-crypt to come out in the same order as they 
were received, there is no longer any slowdown, the new crypt performs as 
well as the old crypt, but the last time I submitted the patches, people 
objected to sorting requests in dm-crypt, saying that the I/O scheduler 
should sort them. But it doesn't. This problem still persists in the 
current kernels.

For best performance we could use the unbound workqueue implementation 
with request sorting, if people don't object to the request sorting being 
done in dm-crypt.

Mikulas

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dm-devel] dm-crypt performance
  2013-03-26  3:47 dm-crypt performance Mikulas Patocka
@ 2013-03-26 12:27 ` Alasdair G Kergon
  2013-03-26 20:05   ` Milan Broz
  0 siblings, 1 reply; 22+ messages in thread
From: Alasdair G Kergon @ 2013-03-26 12:27 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Mike Snitzer, dm-devel, Andi Kleen, dm-crypt, Milan Broz,
	linux-kernel, Christoph Hellwig, Christian Schmidt

[Adding dm-crypt + linux-kernel]

On Mon, Mar 25, 2013 at 11:47:22PM -0400, Mikulas Patocka wrote:
> I performed some dm-crypt performance tests as Mike suggested.
> 
> It turns out that unbound workqueue performance has improved somewhere 
> between kernel 3.2 (when I made the dm-crypt patches) and 3.8, so the 
> patches for hand-built dispatch are no longer needed.
> 
> For RAID-0 composed of two disks with total throughput 260MB/s, the 
> unbound workqueue performs as well as the hand-built dispatch (both 
> sustain the 260MB/s transfer rate).
> 
> For ramdisk, unbound workqueue performs better than hand-built dispatch 
> (620MB/s vs 400MB/s). Unbound workqueue with the patch that Mike suggested 
> (git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git) improves 
> performance slighlty on ramdisk compared to 3.8 (700MB/s vs. 620MB/s).
> 
> 
> 
> However, there is still the problem with request ordering. Milan found out 
> that under some circumstances parallel dm-crypt has worse performance than 
> the previous dm-crypt code. I found out that this is not caused by 
> deficiencies in the code that distributes work to individual processors. 
> Performance drop is caused by the fact that distributing write bios to 
> multiple processors causes the encryption to finish out of order and the 
> I/O scheduler is unable to merge these out-of-order bios.
> 
> The deadline and noop schedulers perform better (only 50% slowdown 
> compared to old dm-crypt), CFQ performs very badly (8 times slowdown).
> 
> 
> If I sort the requests in dm-crypt to come out in the same order as they 
> were received, there is no longer any slowdown, the new crypt performs as 
> well as the old crypt, but the last time I submitted the patches, people 
> objected to sorting requests in dm-crypt, saying that the I/O scheduler 
> should sort them. But it doesn't. This problem still persists in the 
> current kernels.
> 
> 
> For best performance we could use the unbound workqueue implementation 
> with request sorting, if people don't object to the request sorting being 
> done in dm-crypt.


On Tue, Mar 26, 2013 at 02:52:29AM -0400, Christoph Hellwig wrote:
> FYI, XFS also does it's own request ordering for the metadata buffers,
> because it knows the needed ordering and has a bigger view than than
> than especially CFQ.  You at least have precedence in a widely used
> subsystem for this code.


So please post this updated version of the patches for a wider group of
people to try out.

Alasdair

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dm-devel] dm-crypt performance
  2013-03-26 12:27 ` [dm-devel] " Alasdair G Kergon
@ 2013-03-26 20:05   ` Milan Broz
  2013-03-26 20:28     ` Mike Snitzer
  0 siblings, 1 reply; 22+ messages in thread
From: Milan Broz @ 2013-03-26 20:05 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Mike Snitzer, dm-devel, Andi Kleen, dm-crypt, Milan Broz,
	linux-kernel, Christoph Hellwig, Christian Schmidt

On 26.3.2013 13:27, Alasdair G Kergon wrote:
> [Adding dm-crypt + linux-kernel]

Thanks.

>
> On Mon, Mar 25, 2013 at 11:47:22PM -0400, Mikulas Patocka wrote:
>> I performed some dm-crypt performance tests as Mike suggested.
>>
>> It turns out that unbound workqueue performance has improved somewhere
>> between kernel 3.2 (when I made the dm-crypt patches) and 3.8, so the
>> patches for hand-built dispatch are no longer needed.
>>
>> For RAID-0 composed of two disks with total throughput 260MB/s, the
>> unbound workqueue performs as well as the hand-built dispatch (both
>> sustain the 260MB/s transfer rate).
>>
>> For ramdisk, unbound workqueue performs better than hand-built dispatch
>> (620MB/s vs 400MB/s). Unbound workqueue with the patch that Mike suggested
>> (git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git) improves
>> performance slighlty on ramdisk compared to 3.8 (700MB/s vs. 620MB/s).

I found that ramdisk tests are usualy quite misleading for dmcrypt.
Better use some fast SSD, ideally in RAID0 (so you get >500MB or so).
Also be sure you compare recent machines which uses AES-NI.
For reference, null cipher (no crypt, data copy only) works as well,
but this is not real-world scenario.

After introducing Andi's patches, we created performance regression
for people which created "RAID over several dmcrypt devices".
(All IOs were processed by one core.) Rerely use case but several people
complained.
But most of people reported that current approach works much better
(even with stupid dd test - I think it is because page cache sumbits
requests from different CPUs so it in fact run in parallel).

But using dd with direct-io is trivial way how to simulate the "problem".
(I guess we all like using dd for performance testing... :-])

>> However, there is still the problem with request ordering. Milan found out
>> that under some circumstances parallel dm-crypt has worse performance than
>> the previous dm-crypt code. I found out that this is not caused by
>> deficiencies in the code that distributes work to individual processors.
>> Performance drop is caused by the fact that distributing write bios to
>> multiple processors causes the encryption to finish out of order and the
>> I/O scheduler is unable to merge these out-of-order bios.

If the IO scheduler is unable to merge these request because of out of order
bios, please try to FIX IO scheduler and do not invent workarounds in dmcrypt.
(With recent accelerated crypto this should not happen so often btw.)

I know it is not easy but I really do not like that in "little-walled
device-mapper garden" is something what should be done on different
layer (again).

>> The deadline and noop schedulers perform better (only 50% slowdown
>> compared to old dm-crypt), CFQ performs very badly (8 times slowdown).
>>
>>
>> If I sort the requests in dm-crypt to come out in the same order as they
>> were received, there is no longer any slowdown, the new crypt performs as
>> well as the old crypt, but the last time I submitted the patches, people
>> objected to sorting requests in dm-crypt, saying that the I/O scheduler
>> should sort them. But it doesn't. This problem still persists in the
>> current kernels.

I have probable no vote here anymore but for the record: I am strictly
against any sorting of requests in dmcrypt. My reasons are:

- dmcrypt should be simple transparent layer (doing one thing - encryption),
sorting of requests was always primarily IO scheduler domain
(which has well-known knobs to control it already)

- Are we sure we are not inroducing some another side channel in disc
encryption? (Unprivileged user can measure timing here).
(Perhaps stupid reason but please do not prefer performance to security
in encryption. Enough we have timing attacks for AES implementations...)

- In my testing (several months ago) output was very unstable - in some
situations it helps, in some it was worse. I have no longer hard
data but some test output was sent to Alasdair.

>> For best performance we could use the unbound workqueue implementation
>> with request sorting, if people don't object to the request sorting being
>> done in dm-crypt.

So again:

- why IO scheduler is not working properly here? Do it need some extensions?
If fixed, it can help even is some other non-dmcrypt IO patterns.
(I mean dmcrypt can set some special parameter for underlying device queue
automagically to fine-tune sorting parameters.)

- can we have some cpu-bound workqueue which automatically switch to unbound
(relocates work to another cpu) if it detects some saturation watermark etc?
(Again, this can be used in other code.
http://www.redhat.com/archives/dm-devel/2012-August/msg00288.html
(Yes, I see skepticism there :-)

> On Tue, Mar 26, 2013 at 02:52:29AM -0400, Christoph Hellwig wrote:
>> FYI, XFS also does it's own request ordering for the metadata buffers,
>> because it knows the needed ordering and has a bigger view than than
>> than especially CFQ.  You at least have precedence in a widely used
>> subsystem for this code.

Nice. But XFS is much more complex system.
Isn't it enough that multipath uses own IO queue (so we have one IO scheduler
on top of another, and now we have metadata io sorting in XFS on top of it
and planning one more in dmcrypt? Is it really good approach?)

Milan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: dm-crypt performance
  2013-03-26 20:05   ` Milan Broz
@ 2013-03-26 20:28     ` Mike Snitzer
  2013-03-28 18:53       ` Tejun Heo
  0 siblings, 1 reply; 22+ messages in thread
From: Mike Snitzer @ 2013-03-26 20:28 UTC (permalink / raw)
  To: Milan Broz
  Cc: Mikulas Patocka, dm-devel, Andi Kleen, dm-crypt, linux-kernel,
	Christoph Hellwig, Christian Schmidt, tj

On Tue, Mar 26 2013 at  4:05pm -0400,
Milan Broz <gmazyland@gmail.com> wrote:

> >On Mon, Mar 25, 2013 at 11:47:22PM -0400, Mikulas Patocka wrote:
> >
> >>For best performance we could use the unbound workqueue implementation
> >>with request sorting, if people don't object to the request sorting being
> >>done in dm-crypt.
> 
> So again:
> 
> - why IO scheduler is not working properly here? Do it need some extensions?
> If fixed, it can help even is some other non-dmcrypt IO patterns.
> (I mean dmcrypt can set some special parameter for underlying device queue
> automagically to fine-tune sorting parameters.)

Not sure, but IO scheduler changes are fairly slow to materialize given
the potential for adverse side-effects.  Are you so surprised that a
shotgun blast of IOs might make the IO schduler less optimal than if
some basic sorting were done at the layer above?

> - can we have some cpu-bound workqueue which automatically switch to unbound
> (relocates work to another cpu) if it detects some saturation watermark etc?
> (Again, this can be used in other code.
> http://www.redhat.com/archives/dm-devel/2012-August/msg00288.html
> (Yes, I see skepticism there :-)

Question for Tejun? (now cc'd).

> >On Tue, Mar 26, 2013 at 02:52:29AM -0400, Christoph Hellwig wrote:
> >>FYI, XFS also does it's own request ordering for the metadata buffers,
> >>because it knows the needed ordering and has a bigger view than than
> >>than especially CFQ.  You at least have precedence in a widely used
> >>subsystem for this code.
> 
> Nice. But XFS is much more complex system.
> Isn't it enough that multipath uses own IO queue (so we have one IO scheduler
> on top of another, and now we have metadata io sorting in XFS on top of it
> and planning one more in dmcrypt? Is it really good approach?)

Multipath's request_queue is the only one with an active IO scheduler;
the requests are dispatched directly to the underlying devices' queues
without any IO scheduling.

As for dm-crypt; as you know it is bio-based so it is already dealing
with out of order IOs (no benefit of upper level IO scheduler).  Seems
relatively clear, from Mikulas' results, that maybe you're hoping for a
bit too much magic from the IO scheduler gnomes that lurk on LKML.  BTW,
pretty sure btrfs takes care to maintain some IO dispatch ordering too.

Mike

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: dm-crypt performance
  2013-03-26 20:28     ` Mike Snitzer
@ 2013-03-28 18:53       ` Tejun Heo
  2013-03-28 19:33         ` Vivek Goyal
  0 siblings, 1 reply; 22+ messages in thread
From: Tejun Heo @ 2013-03-28 18:53 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Milan Broz, Mikulas Patocka, dm-devel, Andi Kleen, dm-crypt,
	linux-kernel, Christoph Hellwig, Christian Schmidt, Vivek Goyal,
	Jens Axboe

Hello,

(cc'ing Vivek and Jens for the iosched related bits)

On Tue, Mar 26, 2013 at 04:28:38PM -0400, Mike Snitzer wrote:
> On Tue, Mar 26 2013 at  4:05pm -0400,
> Milan Broz <gmazyland@gmail.com> wrote:
> 
> > >On Mon, Mar 25, 2013 at 11:47:22PM -0400, Mikulas Patocka wrote:
> > >
> > >>For best performance we could use the unbound workqueue implementation
> > >>with request sorting, if people don't object to the request sorting being
> > >>done in dm-crypt.
> > 
> > So again:
> > 
> > - why IO scheduler is not working properly here? Do it need some extensions?
> > If fixed, it can help even is some other non-dmcrypt IO patterns.
> > (I mean dmcrypt can set some special parameter for underlying device queue
> > automagically to fine-tune sorting parameters.)
> 
> Not sure, but IO scheduler changes are fairly slow to materialize given
> the potential for adverse side-effects.  Are you so surprised that a
> shotgun blast of IOs might make the IO schduler less optimal than if
> some basic sorting were done at the layer above?

My memory is already pretty hazy but Vivek should be able to correct
me if I say something nonsense.  The thing is, the order and timings
of IOs coming down from upper layers has certain meanings to ioscheds
and they exploit those patterns to do better scheduling.

Reordering IOs randomly actually makes certain information about the
IO stream lost and makes ioscheds mis-classify the IO stream -
e.g. what could have been classfied as "mostly consecutive streaming
IO" could after such reordering fail to be detected as such.  Sure,
ioscheds can probably be improved to compensate for such temporary
localized reorderings but nothing is free and given that most of the
upper stacks already do pretty good job of issuing IOs orderly when
possible, it would be a bit silly to do more than usually necessary in
ioscheds.

So, no, I don't think maintaining IO order in stacking drivers is a
bad idea.  I actually think all stacking drivers should do that;
otherwise, they really are destroying actual useful side-band
information.

> > - can we have some cpu-bound workqueue which automatically switch to unbound
> > (relocates work to another cpu) if it detects some saturation watermark etc?
> > (Again, this can be used in other code.
> > http://www.redhat.com/archives/dm-devel/2012-August/msg00288.html
> > (Yes, I see skepticism there :-)
> 
> Question for Tejun? (now cc'd).

Unbound workqueues went through quite a bit of improvements lately and
are currently growing NUMA affinity support.  Once merged, all unbound
work items issued on a NUMA node will be processed in the same NUMA
node, which should mitigate some, unfortunately not all, of the
disadvantages compared to per-cpu ones.  Mikulas, can you share more
about your test setup?  Was it a NUMA machine?  Which wq branch did
you use?

The NUMA affinity support would have less severe but similar issue as
per-cpu.  If all IOs are being issued from one node while other nodes
are idle, that specific node can get saturated.  NUMA affinity support
is adjusted both from inside kernel and userland via sysfs, so there
are control knobs for corner cases.

As for maintaining CPU or NUMA affinity until the CPU / node is
saturated and spilling to other CPUs/nodes beyond that, yeah, an
interesting idea.  It's non-trivial and would have to incorporate a
lot of notions on "load" similar to the scheduler.  It really becomes
a generic load balancing problem as it'd be pointless and actually
harmful to, say, spill work items to each other between two saturated
NUMA nodes.

So, if the brunt of scattering workload across random CPUs can be
avoided by NUMA affinity, that could be a reasonable tradeoff, I
think.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: dm-crypt performance
  2013-03-28 18:53       ` Tejun Heo
@ 2013-03-28 19:33         ` Vivek Goyal
  2013-03-28 19:44           ` Tejun Heo
  0 siblings, 1 reply; 22+ messages in thread
From: Vivek Goyal @ 2013-03-28 19:33 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Mike Snitzer, Milan Broz, Mikulas Patocka, dm-devel, Andi Kleen,
	dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt,
	Jens Axboe

On Thu, Mar 28, 2013 at 11:53:27AM -0700, Tejun Heo wrote:
> Hello,
> 
> (cc'ing Vivek and Jens for the iosched related bits)
> 
> On Tue, Mar 26, 2013 at 04:28:38PM -0400, Mike Snitzer wrote:
> > On Tue, Mar 26 2013 at  4:05pm -0400,
> > Milan Broz <gmazyland@gmail.com> wrote:
> > 
> > > >On Mon, Mar 25, 2013 at 11:47:22PM -0400, Mikulas Patocka wrote:
> > > >
> > > >>For best performance we could use the unbound workqueue implementation
> > > >>with request sorting, if people don't object to the request sorting being
> > > >>done in dm-crypt.
> > > 
> > > So again:
> > > 
> > > - why IO scheduler is not working properly here? Do it need some extensions?
> > > If fixed, it can help even is some other non-dmcrypt IO patterns.
> > > (I mean dmcrypt can set some special parameter for underlying device queue
> > > automagically to fine-tune sorting parameters.)
> > 
> > Not sure, but IO scheduler changes are fairly slow to materialize given
> > the potential for adverse side-effects.  Are you so surprised that a
> > shotgun blast of IOs might make the IO schduler less optimal than if
> > some basic sorting were done at the layer above?
> 
> My memory is already pretty hazy but Vivek should be able to correct
> me if I say something nonsense.  The thing is, the order and timings
> of IOs coming down from upper layers has certain meanings to ioscheds
> and they exploit those patterns to do better scheduling.
> 
> Reordering IOs randomly actually makes certain information about the
> IO stream lost and makes ioscheds mis-classify the IO stream -
> e.g. what could have been classfied as "mostly consecutive streaming
> IO" could after such reordering fail to be detected as such.  Sure,
> ioscheds can probably be improved to compensate for such temporary
> localized reorderings but nothing is free and given that most of the
> upper stacks already do pretty good job of issuing IOs orderly when
> possible, it would be a bit silly to do more than usually necessary in
> ioscheds.
> 
> So, no, I don't think maintaining IO order in stacking drivers is a
> bad idea.  I actually think all stacking drivers should do that;
> otherwise, they really are destroying actual useful side-band
> information.

I am curious why out of order bio is a problem. Doesn't the elevator
already merge bio's with existing requests and if merging does not
happen then requests are sorted in order. So why ordering is not
happening properly with dm-crypt. What additional info dm-crypt has
that it can do better ordering than IO scheduler.

CFQ might seeing more performance hit because we maintain per
process queues and kernel threads might not be sharing the IO context
(i am not sure). So if all the crypto threads can share the IO
context, atleast it will make sure all IO from them goes into a
single queue.

So it would help if somebody can explain that why existing merging
and sorting logic is not working well with dm-crypt and what additional
info dm-crypt has which can help it do better job.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: dm-crypt performance
  2013-03-28 19:33         ` Vivek Goyal
@ 2013-03-28 19:44           ` Tejun Heo
  2013-03-28 20:38             ` Vivek Goyal
  0 siblings, 1 reply; 22+ messages in thread
From: Tejun Heo @ 2013-03-28 19:44 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Mike Snitzer, Milan Broz, Mikulas Patocka, dm-devel, Andi Kleen,
	dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt,
	Jens Axboe

Hello,

On Thu, Mar 28, 2013 at 03:33:43PM -0400, Vivek Goyal wrote:
> I am curious why out of order bio is a problem. Doesn't the elevator
> already merge bio's with existing requests and if merging does not
> happen then requests are sorted in order. So why ordering is not
> happening properly with dm-crypt. What additional info dm-crypt has
> that it can do better ordering than IO scheduler.

Hmmm... well, for one, it doesn't only change ordering.  It also
changes the timings.  Before iosched would get contiguous stream of
IOs when the queue gets unplugged (BTW, how does dm crypt handling
plugging?  If not handled properly, it could definitely affect a lot
of things.)  With multiple threads doing encryption in the middle, the
iosched could get scattered IOs which could easily span multiple
millisecs.  Even if context tagging was done properly, it could easily
lead to much less efficient IO patterns to hardware.

Keeping IO order combined with proper plug handling would not only
keep the ordering constant but also the relative timing of events,
which is an important factor when scheduling IOs.

> CFQ might seeing more performance hit because we maintain per
> process queues and kernel threads might not be sharing the IO context
> (i am not sure). So if all the crypto threads can share the IO
> context, atleast it will make sure all IO from them goes into a
> single queue.

Right, this is important too although I fail to see how workqueue
vs. custom dispatch would make any difference here.  dm-crypt should
definitely be using bio_associate_current().

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: dm-crypt performance
  2013-03-28 19:44           ` Tejun Heo
@ 2013-03-28 20:38             ` Vivek Goyal
  2013-03-28 20:45               ` Tejun Heo
  0 siblings, 1 reply; 22+ messages in thread
From: Vivek Goyal @ 2013-03-28 20:38 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Mike Snitzer, Milan Broz, Mikulas Patocka, dm-devel, Andi Kleen,
	dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt,
	Jens Axboe

On Thu, Mar 28, 2013 at 12:44:43PM -0700, Tejun Heo wrote:
> Hello,
> 
> On Thu, Mar 28, 2013 at 03:33:43PM -0400, Vivek Goyal wrote:
> > I am curious why out of order bio is a problem. Doesn't the elevator
> > already merge bio's with existing requests and if merging does not
> > happen then requests are sorted in order. So why ordering is not
> > happening properly with dm-crypt. What additional info dm-crypt has
> > that it can do better ordering than IO scheduler.
> 
> Hmmm... well, for one, it doesn't only change ordering.  It also
> changes the timings.  Before iosched would get contiguous stream of
> IOs when the queue gets unplugged (BTW, how does dm crypt handling
> plugging?  If not handled properly, it could definitely affect a lot
> of things.)  With multiple threads doing encryption in the middle, the
> iosched could get scattered IOs which could easily span multiple
> millisecs.  Even if context tagging was done properly, it could easily
> lead to much less efficient IO patterns to hardware.
> 
> Keeping IO order combined with proper plug handling would not only
> keep the ordering constant but also the relative timing of events,
> which is an important factor when scheduling IOs.

If timing of unordered IO is an issue, then dm-crypt can try
to batch IO submission using blk_start_plug()/blk_finish_plug(). That way
dm-crypt can batch bio and control submission and there should not
be a need to put specific ordering logic in dm-crypt. 

So if there are multiple threads doing crypto, they end up submitting
bio after crypto operation or queue it somewhere and ther is single
submitter thread.

If compltion of crypto is an issue, then I think it is very hard to
determine whether extra waiting helps with throughput or hurts. If
dm-crypt can decide that somehow, then I guess they can just try
to do batch submission of IO from various crypto threads and see
if it helps with performance. (At some point of time, submitter
thread will become a bottleneck).

> 
> > CFQ might seeing more performance hit because we maintain per
> > process queues and kernel threads might not be sharing the IO context
> > (i am not sure). So if all the crypto threads can share the IO
> > context, atleast it will make sure all IO from them goes into a
> > single queue.
> 
> Right, this is important too although I fail to see how workqueue
> vs. custom dispatch would make any difference here.  dm-crypt should
> definitely be using bio_associate_current().

Agreed. bio_associate_current() will atleast help keeping all bios
of single context in single queue and promote more merging (if submission
happens in right time frame).

Thanks
Vivek

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: dm-crypt performance
  2013-03-28 20:38             ` Vivek Goyal
@ 2013-03-28 20:45               ` Tejun Heo
  2013-04-09 17:51                 ` dm-crypt parallelization patches Mikulas Patocka
  0 siblings, 1 reply; 22+ messages in thread
From: Tejun Heo @ 2013-03-28 20:45 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Mike Snitzer, Milan Broz, Mikulas Patocka, dm-devel, Andi Kleen,
	dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt,
	Jens Axboe

Hello,

On Thu, Mar 28, 2013 at 04:38:08PM -0400, Vivek Goyal wrote:
> If timing of unordered IO is an issue, then dm-crypt can try
> to batch IO submission using blk_start_plug()/blk_finish_plug(). That way
> dm-crypt can batch bio and control submission and there should not
> be a need to put specific ordering logic in dm-crypt. 

Yes, it has to preserve and propagate the plugging boundaries and if
you think about the implementation, maintaining issue order don't
really need to be "sorted" per-se.  Just keep the list of bios
received but still going through encryption in the received order with
a counter of in-progress bios in the plugging boundary.  Link the
outputs to the source bios somehow and when the counter hits zero,
issue them in the same order.  While keeping the specific order itself
might not be essential, it's not gonna add any significant complexity
or runtime overhead and I think it generally is a good idea for
stacking drivers to preserve as much information and context as
possible in general.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 22+ messages in thread

* dm-crypt parallelization patches
  2013-03-28 20:45               ` Tejun Heo
@ 2013-04-09 17:51                 ` Mikulas Patocka
  2013-04-09 17:57                   ` Tejun Heo
  2013-04-09 18:36                   ` Vivek Goyal
  0 siblings, 2 replies; 22+ messages in thread
From: Mikulas Patocka @ 2013-04-09 17:51 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Vivek Goyal, Tejun Heo, Mike Snitzer, Milan Broz, dm-devel,
	Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig,
	Christian Schmidt

Hi

I placed the dm-crypt parallization patches at: 
http://people.redhat.com/~mpatocka/patches/kernel/dm-crypt-paralelizace/current/

The patches paralellize dm-crypt and make it possible to use all processor 
cores.

The patch dm-crypt-remove-percpu.patch removes some percpu variables and 
replaces them with per-request variables.

The patch dm-crypt-unbound-workqueue.patch sets WQ_UNBOUND on the 
encryption workqueue, allowing the encryption to be distributed to all 
CPUs in the system.

The patch dm-crypt-offload-writes-to-thread.patch moves submission of all 
write requests to a single thread.

The patch dm-crypt-sort-requests.patch sorts write requests submitted by a 
single thread. The requests are sorted according to the sector number, 
rb-tree is used for efficient sorting.

Some usage notes:

* turn off automatic cpu frequency scaling (or set it to "performance"
  governor) - cpufreq doesn't recognize encryption workload correctly, 
  sometimes it underclocks all the CPU cores when there is some encryption 
  work to do, resulting in bad performance

* when using filesystem on encrypted dm-crypt device, reduce maximum 
  request size with "/sys/block/dm-2/queue/max_sectors_kb" (substitute 
  "dm-2" with the real name of your dm-crypt device). Note that having too 
  big requests means that there is a small number of requests and they 
  cannot be distributed to all available processors in parallel - it 
  results in worse performance. Having too small requests results in high 
  request overhead and also reduced performance. So you must find the 
  optimal request size for your system and workload. For me, when testing 
  this on ramdisk, the optimal is 8KiB. 

---

Now, the problem with I/O scheduler: when doing performance testing, it 
turns out that the parallel version is sometimes worse than the previous 
implementation.

When I create a 4.3GiB dm-crypt device on the top of dm-loop on the top of 
ext2 filesystem on 15k SCSI disk and run this command

time fio --rw=randrw --size=64M --bs=256k --filename=/dev/mapper/crypt 
--direct=1 --name=job1 --name=job2 --name=job3 --name=job4 --name=job5 
--name=job6 --name=job7 --name=job8 --name=job9 --name=job10 --name=job11 
--name=job12

the results are this:
CFQ scheduler:
--------------
no patches:
21.9s
patch 1:
21.7s
patches 1,2:
2:33s
patches 1,2 (+ nr_requests = 1280000)
2:18s
patches 1,2,3:
20.7s
patches 1,2,3,4:
20.7s

deadline scheduler:
-------------------
no patches:
27.4s
patch 1:
27.4s
patches 1,2:
27.8s
patches 1,2,3:
29.6s
patches 1,2,3,4:
29.6s

We can see that CFQ performs badly with the patch 2, but improves with the 
patch 3. All that patch 3 does is that it moves write requests from 
encryption threads to a separate thread.

So it seems that CFQ has some deficiency that it cannot merge adjacent 
requests done by different processes.

The problem is this:
- we have 256k write direct-i/o request
- it is broken to 4k bios (because we run on dm-loop on a filesystem with 
  4k block size)
- encryption of these 4k bios is distributed to 12 processes on a 12-core 
  machine
- encryption finishes out of order and in different processes, 4k bios 
  with encrypted data are submitted to CFQ
- CFQ doesn't merge them
- the disk is flooded with random 4k write requests, and performs much 
  worse than with 256k requests

Increasing nr_requests to 1280000 helps a little, but not much - it is 
still order of magnitute slower.

I'd like to ask if someone who knows the CFQ scheduler (Jens?) could look 
at it and find out why it doesn't merge requests from different processes. 

Why do I have to do a seemingly senseless operation (hand over write 
requests to a separate thread) in patch 3 to improve performance?

Mikulas

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: dm-crypt parallelization patches
  2013-04-09 17:51                 ` dm-crypt parallelization patches Mikulas Patocka
@ 2013-04-09 17:57                   ` Tejun Heo
  2013-04-09 18:08                     ` Mikulas Patocka
  2013-04-09 18:36                   ` Vivek Goyal
  1 sibling, 1 reply; 22+ messages in thread
From: Tejun Heo @ 2013-04-09 17:57 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Jens Axboe, Vivek Goyal, Mike Snitzer, Milan Broz, dm-devel,
	Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig,
	Christian Schmidt

On Tue, Apr 09, 2013 at 01:51:43PM -0400, Mikulas Patocka wrote:
> The patch dm-crypt-sort-requests.patch sorts write requests submitted by a 
> single thread. The requests are sorted according to the sector number, 
> rb-tree is used for efficient sorting.

Hmmm? Why not just keep the issuing order along with plugging
boundaries?

> So it seems that CFQ has some deficiency that it cannot merge adjacent 
> requests done by different processes.

As I wrote before, please use bio_associate_current().  Currently,
dm-crypt is completely messing up all the context information that cfq
depends on to schedule IOs.  Of course, it doesn't perform well.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: dm-crypt parallelization patches
  2013-04-09 17:57                   ` Tejun Heo
@ 2013-04-09 18:08                     ` Mikulas Patocka
  2013-04-09 18:10                       ` Tejun Heo
  0 siblings, 1 reply; 22+ messages in thread
From: Mikulas Patocka @ 2013-04-09 18:08 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, Vivek Goyal, Mike Snitzer, Milan Broz, dm-devel,
	Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig,
	Christian Schmidt



On Tue, 9 Apr 2013, Tejun Heo wrote:

> On Tue, Apr 09, 2013 at 01:51:43PM -0400, Mikulas Patocka wrote:
> > The patch dm-crypt-sort-requests.patch sorts write requests submitted by a 
> > single thread. The requests are sorted according to the sector number, 
> > rb-tree is used for efficient sorting.
> 
> Hmmm? Why not just keep the issuing order along with plugging
> boundaries?

What do you mean?

I used to have a patch that keeps order of requests as they were 
introduced, but sorting the requests according to sector number is a bit 
simpler.

> > So it seems that CFQ has some deficiency that it cannot merge adjacent 
> > requests done by different processes.
> 
> As I wrote before, please use bio_associate_current().  Currently,
> dm-crypt is completely messing up all the context information that cfq
> depends on to schedule IOs.  Of course, it doesn't perform well.

bio_associate_current() is only valid on a system with cgroups and there 
are no cgroups on the kernel where I tested it. It is an empty function:

static inline int bio_associate_current(struct bio *bio) { return -ENOENT; }

Mikulas

> Thanks.
> 
> -- 
> tejun
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: dm-crypt parallelization patches
  2013-04-09 18:08                     ` Mikulas Patocka
@ 2013-04-09 18:10                       ` Tejun Heo
  2013-04-09 18:42                         ` Vivek Goyal
  2013-04-09 19:42                         ` Mikulas Patocka
  0 siblings, 2 replies; 22+ messages in thread
From: Tejun Heo @ 2013-04-09 18:10 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Jens Axboe, Vivek Goyal, Mike Snitzer, Milan Broz, dm-devel,
	Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig,
	Christian Schmidt

Hey,

On Tue, Apr 09, 2013 at 02:08:06PM -0400, Mikulas Patocka wrote:
> > Hmmm? Why not just keep the issuing order along with plugging
> > boundaries?
> 
> What do you mean?
> 
> I used to have a patch that keeps order of requests as they were 
> introduced, but sorting the requests according to sector number is a bit 
> simpler.

You're still destroying the context information.  Please just keep the
issuing order along with plugging boundaries.

> > As I wrote before, please use bio_associate_current().  Currently,
> > dm-crypt is completely messing up all the context information that cfq
> > depends on to schedule IOs.  Of course, it doesn't perform well.
> 
> bio_associate_current() is only valid on a system with cgroups and there 
> are no cgroups on the kernel where I tested it. It is an empty function:
> 
> static inline int bio_associate_current(struct bio *bio) { return -ENOENT; }

Yeah, because blkcg was the only user.  Please feel free to drop the
ifdefs.  It covers both iocontext and cgroup association.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: dm-crypt parallelization patches
  2013-04-09 18:10                       ` Tejun Heo
@ 2013-04-09 18:42                         ` Vivek Goyal
  2013-04-09 18:57                           ` Tejun Heo
  2013-04-09 19:42                         ` Mikulas Patocka
  1 sibling, 1 reply; 22+ messages in thread
From: Vivek Goyal @ 2013-04-09 18:42 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Mikulas Patocka, Jens Axboe, Mike Snitzer, Milan Broz, dm-devel,
	Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig,
	Christian Schmidt

On Tue, Apr 09, 2013 at 11:10:31AM -0700, Tejun Heo wrote:
> Hey,
> 
> On Tue, Apr 09, 2013 at 02:08:06PM -0400, Mikulas Patocka wrote:
> > > Hmmm? Why not just keep the issuing order along with plugging
> > > boundaries?
> > 
> > What do you mean?
> > 
> > I used to have a patch that keeps order of requests as they were 
> > introduced, but sorting the requests according to sector number is a bit 
> > simpler.
> 
> You're still destroying the context information.  Please just keep the
> issuing order along with plugging boundaries.

I guess plugging boundary is more important than issuing order as 
block layer should take care of mering the bio and put in right
order (attempt_plug_merge()).

But to make use of plugging boundary, one would probably still need
submission using single thread.

And if one is using single thread for submission, one will still get
good performance (even if you are not using bio_associate_current()), as
by default all bio will go to submitting thread's context.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: dm-crypt parallelization patches
  2013-04-09 18:42                         ` Vivek Goyal
@ 2013-04-09 18:57                           ` Tejun Heo
  2013-04-09 19:13                             ` Vivek Goyal
  0 siblings, 1 reply; 22+ messages in thread
From: Tejun Heo @ 2013-04-09 18:57 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Mikulas Patocka, Jens Axboe, Mike Snitzer, Milan Broz, dm-devel,
	Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig,
	Christian Schmidt

Hello,

On Tue, Apr 09, 2013 at 02:42:48PM -0400, Vivek Goyal wrote:
> I guess plugging boundary is more important than issuing order as 
> block layer should take care of mering the bio and put in right
> order (attempt_plug_merge()).

Yeah, the exact order probably doesn't affect things too much but it's
just a nice design principle to follow - if you're gonna step in in
the middle and meddle with requests, preserve as much context as
reasonably possible, and it's not like preserving that order is
difficult.

> But to make use of plugging boundary, one would probably still need
> submission using single thread.

It doesn't have to a specific task.  Whoever finishes the last bio /
segment / whatever in the plugging domain can issue all of them.  I
probably am missing details but the overall mechanism can be pretty
simple.  Just keep the bios from the same plugging domain in the
received order along with an atomic counter and issue them all when
the counter hits zero.  No need to fiddle with sorting or whatever.

> And if one is using single thread for submission, one will still get
> good performance (even if you are not using bio_associate_current()), as
> by default all bio will go to submitting thread's context.

And destroy all per-ioc and cgroup logics in block layer in the
process.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: dm-crypt parallelization patches
  2013-04-09 18:57                           ` Tejun Heo
@ 2013-04-09 19:13                             ` Vivek Goyal
  0 siblings, 0 replies; 22+ messages in thread
From: Vivek Goyal @ 2013-04-09 19:13 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Mikulas Patocka, Jens Axboe, Mike Snitzer, Milan Broz, dm-devel,
	Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig,
	Christian Schmidt

On Tue, Apr 09, 2013 at 11:57:21AM -0700, Tejun Heo wrote:

[..]
> And destroy all per-ioc and cgroup logics in block layer in the
> process.

Oh, I am in no-way suggesting don't use bio_associate_current(). I am 
just trying to analyze the performance issue right now and saying that
as far as performance is concenred, one will get it back even if one
does not use bio_associate_current().

Yes but one needs to use bio_associate_current() to make sure bio's
are attributed to right cgroup and associate bio to right task. This
should help solving the long standing issue of task losing its ioprio
if dm-crypt target is in the stack.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: dm-crypt parallelization patches
  2013-04-09 18:10                       ` Tejun Heo
  2013-04-09 18:42                         ` Vivek Goyal
@ 2013-04-09 19:42                         ` Mikulas Patocka
  2013-04-09 19:52                           ` Tejun Heo
  1 sibling, 1 reply; 22+ messages in thread
From: Mikulas Patocka @ 2013-04-09 19:42 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, Vivek Goyal, Mike Snitzer, Milan Broz, dm-devel,
	Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig,
	Christian Schmidt



On Tue, 9 Apr 2013, Tejun Heo wrote:

> Hey,
> 
> On Tue, Apr 09, 2013 at 02:08:06PM -0400, Mikulas Patocka wrote:
> > > Hmmm? Why not just keep the issuing order along with plugging
> > > boundaries?
> > 
> > What do you mean?
> > 
> > I used to have a patch that keeps order of requests as they were 
> > introduced, but sorting the requests according to sector number is a bit 
> > simpler.
> 
> You're still destroying the context information.  Please just keep the
> issuing order along with plugging boundaries.
>
> > > As I wrote before, please use bio_associate_current().  Currently,
> > > dm-crypt is completely messing up all the context information that cfq
> > > depends on to schedule IOs.  Of course, it doesn't perform well.
> > 
> > bio_associate_current() is only valid on a system with cgroups and there 
> > are no cgroups on the kernel where I tested it. It is an empty function:
> > 
> > static inline int bio_associate_current(struct bio *bio) { return -ENOENT; }
> 
> Yeah, because blkcg was the only user.  Please feel free to drop the
> ifdefs.  It covers both iocontext and cgroup association.
> 
> Thanks.

If I drop ifdefs, it doesn't compile (because other cgroup stuff it 
missing).

So I enabled bio cgroups.

bio_associate_current can't be used, because by the time we allocate the 
outgoing write bio, we are no longer in the process that submitted the 
original bio.

Anyway, I tried to reproduce in dm-crypt what bio_associate_current does - 
in the submitting process I record "ioc" and "css" fields in "dm_crypt_io" 
structure and set these fields on all outgoing bios. It has no effect on 
performance, it is as bad as if I hadn't done it.

Mikulas

---
(this is the patch that I used, to be applied after 
dm-crypt-unbound-workqueue.patch)

---
 drivers/md/dm-crypt.c |   41 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)

Index: linux-3.8.6-fast/drivers/md/dm-crypt.c
===================================================================
--- linux-3.8.6-fast.orig/drivers/md/dm-crypt.c	2013-04-09 20:32:41.000000000 +0200
+++ linux-3.8.6-fast/drivers/md/dm-crypt.c	2013-04-09 21:29:12.000000000 +0200
@@ -20,6 +20,7 @@
 #include <linux/backing-dev.h>
 #include <linux/atomic.h>
 #include <linux/scatterlist.h>
+#include <linux/cgroup.h>
 #include <asm/page.h>
 #include <asm/unaligned.h>
 #include <crypto/hash.h>
@@ -60,6 +61,9 @@ struct dm_crypt_io {
 	int error;
 	sector_t sector;
 	struct dm_crypt_io *base_io;
+
+	struct io_context *ioc;
+	struct cgroup_subsys_state *css;
 };
 
 struct dm_crypt_request {
@@ -797,6 +801,14 @@ static struct bio *crypt_alloc_buffer(st
 	if (!clone)
 		return NULL;
 
+	if (unlikely(io->base_io != NULL)) {
+		clone->bi_ioc = io->base_io->ioc;
+		clone->bi_css = io->base_io->css;
+	} else {
+		clone->bi_ioc = io->ioc;
+		clone->bi_css = io->css;
+	}
+
 	clone_init(io, clone);
 	*out_of_pages = 0;
 
@@ -859,6 +871,9 @@ static struct dm_crypt_io *crypt_io_allo
 	io->ctx.req = NULL;
 	atomic_set(&io->io_pending, 0);
 
+	io->ioc = NULL;
+	io->css = NULL;
+
 	return io;
 }
 
@@ -884,6 +899,14 @@ static void crypt_dec_pending(struct dm_
 
 	if (io->ctx.req)
 		mempool_free(io->ctx.req, cc->req_pool);
+
+	if (io->ioc) {
+		put_io_context(io->ioc);
+	}
+	if (io->css) {
+		css_put(io->css);
+	}
+
 	mempool_free(io, cc->io_pool);
 
 	if (likely(!base_io))
@@ -927,6 +950,9 @@ static void crypt_endio(struct bio *clon
 	if (rw == WRITE)
 		crypt_free_buffer_pages(cc, clone);
 
+	clone->bi_ioc = NULL;
+	clone->bi_css = NULL;
+
 	bio_put(clone);
 
 	if (rw == READ && !error) {
@@ -1658,6 +1684,21 @@ static int crypt_map(struct dm_target *t
 
 	io = crypt_io_alloc(cc, bio, dm_target_offset(ti, bio->bi_sector));
 
+	if (current->io_context) {
+		struct io_context *ioc = current->io_context;
+		struct cgroup_subsys_state *css;
+
+		get_io_context_active(ioc);
+		io->ioc = ioc;
+
+		/* associate blkcg if exists */
+		rcu_read_lock();
+		css = task_subsys_state(current, blkio_subsys_id);
+		if (css && css_tryget(css))
+			io->css = css;
+		rcu_read_unlock();
+	}
+
 	if (bio_data_dir(io->base_bio) == READ) {
 		if (kcryptd_io_read(io, GFP_NOWAIT))
 			kcryptd_queue_io(io);

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: dm-crypt parallelization patches
  2013-04-09 19:42                         ` Mikulas Patocka
@ 2013-04-09 19:52                           ` Tejun Heo
  2013-04-09 20:32                             ` Mikulas Patocka
  0 siblings, 1 reply; 22+ messages in thread
From: Tejun Heo @ 2013-04-09 19:52 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Jens Axboe, Vivek Goyal, Mike Snitzer, Milan Broz, dm-devel,
	Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig,
	Christian Schmidt

On Tue, Apr 09, 2013 at 03:42:16PM -0400, Mikulas Patocka wrote:
> If I drop ifdefs, it doesn't compile (because other cgroup stuff it 
> missing).
> 
> So I enabled bio cgroups.
> 
> bio_associate_current can't be used, because by the time we allocate the 
> outgoing write bio, we are no longer in the process that submitted the 
> original bio.

Oh, I suppose it'd need some massaging to selectively turn off the
cgroup part.

> Anyway, I tried to reproduce in dm-crypt what bio_associate_current does - 

and we probably need to change that to bio_associate_task().

> in the submitting process I record "ioc" and "css" fields in "dm_crypt_io" 
> structure and set these fields on all outgoing bios. It has no effect on 
> performance, it is as bad as if I hadn't done it.

A good way to verify that the tagging is correct would be configuring
io limits in block cgroup and see whether the limits are correctly
applied when going through dm-crypt (please test with direct-io or
reads, writeback is horribly broken, sorry).working correctly, maybe
plugging is the overriding factor?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: dm-crypt parallelization patches
  2013-04-09 19:52                           ` Tejun Heo
@ 2013-04-09 20:32                             ` Mikulas Patocka
  2013-04-09 21:02                               ` Tejun Heo
  2013-04-09 21:07                               ` Vivek Goyal
  0 siblings, 2 replies; 22+ messages in thread
From: Mikulas Patocka @ 2013-04-09 20:32 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, Vivek Goyal, Mike Snitzer, Milan Broz, dm-devel,
	Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig,
	Christian Schmidt



On Tue, 9 Apr 2013, Tejun Heo wrote:

> On Tue, Apr 09, 2013 at 03:42:16PM -0400, Mikulas Patocka wrote:
> > If I drop ifdefs, it doesn't compile (because other cgroup stuff it 
> > missing).
> > 
> > So I enabled bio cgroups.
> > 
> > bio_associate_current can't be used, because by the time we allocate the 
> > outgoing write bio, we are no longer in the process that submitted the 
> > original bio.
> 
> Oh, I suppose it'd need some massaging to selectively turn off the
> cgroup part.
> 
> > Anyway, I tried to reproduce in dm-crypt what bio_associate_current does - 
> 
> and we probably need to change that to bio_associate_task().

Generally, we shouldn't associate bios with "current" task in device 
mapper targets. For example suppose that we have two stacked dm-crypt 
targets:

In the "current" process pointer in lower dm-crypt target's request 
function always points to the workqueue of the upper dm-crypt target that 
submits the bios. So if we associate the bio with "current" in the lower 
target, we are associating it with a preallocated workqueue and we already 
lost the information who submitted it.

You should associate a bio with a task when you create the bio and "md" 
and "dm" midlayers should just forward this association to lower layer 
bios.

> > in the submitting process I record "ioc" and "css" fields in "dm_crypt_io" 
> > structure and set these fields on all outgoing bios. It has no effect on 
> > performance, it is as bad as if I hadn't done it.
> 
> A good way to verify that the tagging is correct would be configuring
> io limits in block cgroup and see whether the limits are correctly
> applied when going through dm-crypt (please test with direct-io or
> reads, writeback is horribly broken, sorry).working correctly, maybe
> plugging is the overriding factor?
> 
> Thanks.

It doesn't work because device mapper on the underlying layers ignores 
bi_ioc and bi_css.

If I make device mapper forward bi_ioc and bi_css to outgoing bios, it 
improves performance (from 2:30 to 1:30), but it is still far from 
perfect.

Mikulas

---

dm: forward cgroup context

This patch makes dm forward associated cgroup context to cloned bios.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 drivers/md/dm.c |    9 +++++++++
 fs/bio.c        |    2 ++
 2 files changed, 11 insertions(+)

Index: linux-3.8.6-fast/drivers/md/dm.c
===================================================================
--- linux-3.8.6-fast.orig/drivers/md/dm.c	2013-04-09 22:00:36.000000000 +0200
+++ linux-3.8.6-fast/drivers/md/dm.c	2013-04-09 22:19:40.000000000 +0200
@@ -453,6 +453,10 @@ static void free_io(struct mapped_device
 
 static void free_tio(struct mapped_device *md, struct dm_target_io *tio)
 {
+#ifdef CONFIG_BLK_CGROUP
+	tio->clone.bi_ioc = NULL;
+	tio->clone.bi_css = NULL;
+#endif
 	bio_put(&tio->clone);
 }
 
@@ -1124,6 +1128,11 @@ static struct dm_target_io *alloc_tio(st
 	clone = bio_alloc_bioset(GFP_NOIO, nr_iovecs, ci->md->bs);
 	tio = container_of(clone, struct dm_target_io, clone);
 
+#ifdef CONFIG_BLK_CGROUP
+	tio->clone.bi_ioc = ci->bio->bi_ioc;
+	tio->clone.bi_css = ci->bio->bi_css;
+#endif
+
 	tio->io = ci->io;
 	tio->ti = ti;
 	memset(&tio->info, 0, sizeof(tio->info));

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: dm-crypt parallelization patches
  2013-04-09 20:32                             ` Mikulas Patocka
@ 2013-04-09 21:02                               ` Tejun Heo
  2013-04-09 21:03                                 ` Tejun Heo
  2013-04-09 21:07                               ` Vivek Goyal
  1 sibling, 1 reply; 22+ messages in thread
From: Tejun Heo @ 2013-04-09 21:02 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Jens Axboe, Vivek Goyal, Mike Snitzer, Milan Broz, dm-devel,
	Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig,
	Christian Schmidt

Hey,

On Tue, Apr 09, 2013 at 04:32:28PM -0400, Mikulas Patocka wrote:
> > and we probably need to change that to bio_associate_task().
> 
> Generally, we shouldn't associate bios with "current" task in device 
> mapper targets. For example suppose that we have two stacked dm-crypt 
> targets:

It only follows the first association so it doesn't matter how many
layers it goes through.  That said, yeah, there could be situations
where @task is availalbe but the bio's already in the hand of a
different task.  If that's the case, change it to
associate_task(@task).

> It doesn't work because device mapper on the underlying layers ignores 
> bi_ioc and bi_css.
> 
> If I make device mapper forward bi_ioc and bi_css to outgoing bios, it 
> improves performance (from 2:30 to 1:30), but it is still far from 
> perfect.

For testing, copying bi_ioc and bi_css directly is fine but please add
another interface to copy those for the actual code.  Say,
bio_copy_association(@to_bio, @from_bio) or whatever.

As for the performance loss, I'm somewhat confident in saying the
remaining difference would be from ignoring plugging boundaries.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: dm-crypt parallelization patches
  2013-04-09 21:02                               ` Tejun Heo
@ 2013-04-09 21:03                                 ` Tejun Heo
  0 siblings, 0 replies; 22+ messages in thread
From: Tejun Heo @ 2013-04-09 21:03 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Jens Axboe, Vivek Goyal, Mike Snitzer, Milan Broz, dm-devel,
	Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig,
	Christian Schmidt

On Tue, Apr 09, 2013 at 02:02:01PM -0700, Tejun Heo wrote:
> For testing, copying bi_ioc and bi_css directly is fine but please add
> another interface to copy those for the actual code.  Say,
> bio_copy_association(@to_bio, @from_bio) or whatever.

Another and probably better possibility is just remembering the
issuing task (you would of course need to hold an extra ref as long as
you wanna use it) and use bio_associate_task() on it when creating new
bios.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: dm-crypt parallelization patches
  2013-04-09 20:32                             ` Mikulas Patocka
  2013-04-09 21:02                               ` Tejun Heo
@ 2013-04-09 21:07                               ` Vivek Goyal
  2013-04-09 21:18                                 ` Mikulas Patocka
  1 sibling, 1 reply; 22+ messages in thread
From: Vivek Goyal @ 2013-04-09 21:07 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Tejun Heo, Jens Axboe, Mike Snitzer, Milan Broz, dm-devel,
	Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig,
	Christian Schmidt

On Tue, Apr 09, 2013 at 04:32:28PM -0400, Mikulas Patocka wrote:

[..]
> Generally, we shouldn't associate bios with "current" task in device 
> mapper targets. For example suppose that we have two stacked dm-crypt 
> targets:
>
> In the "current" process pointer in lower dm-crypt target's request 
> function always points to the workqueue of the upper dm-crypt target that 
> submits the bios. So if we associate the bio with "current" in the lower 
> target, we are associating it with a preallocated workqueue and we already 
> lost the information who submitted it.
> 
> You should associate a bio with a task when you create the bio and "md" 
> and "dm" midlayers should just forward this association to lower layer 
> bios.

bio_associate_current() return -EBUSY if bio has already been associated
with an io context.

So in a stack if every driver calls bio_associate_current(), upon bio
submission, it will automatically amke sure bio gets associated with
submitter task in top level device and calls by lower level device
will be ignored.

Lower level devices I think just need to make sure this context
info is propogated to cloned bios.


[..]
> +#ifdef CONFIG_BLK_CGROUP
> +	tio->clone.bi_ioc = ci->bio->bi_ioc;
> +	tio->clone.bi_css = ci->bio->bi_css;

You also need to take references to ioc and css objects. I guess a helper
function will be better. May be something like.

bio_associate_bio_context(bio1, bio2)

And this initialize bio2's context with bio1's context.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: dm-crypt parallelization patches
  2013-04-09 21:07                               ` Vivek Goyal
@ 2013-04-09 21:18                                 ` Mikulas Patocka
  2013-04-10 19:24                                   ` Vivek Goyal
  0 siblings, 1 reply; 22+ messages in thread
From: Mikulas Patocka @ 2013-04-09 21:18 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Tejun Heo, Jens Axboe, Mike Snitzer, Milan Broz, dm-devel,
	Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig,
	Christian Schmidt



On Tue, 9 Apr 2013, Vivek Goyal wrote:

> On Tue, Apr 09, 2013 at 04:32:28PM -0400, Mikulas Patocka wrote:
> 
> [..]
> > Generally, we shouldn't associate bios with "current" task in device 
> > mapper targets. For example suppose that we have two stacked dm-crypt 
> > targets:
> >
> > In the "current" process pointer in lower dm-crypt target's request 
> > function always points to the workqueue of the upper dm-crypt target that 
> > submits the bios. So if we associate the bio with "current" in the lower 
> > target, we are associating it with a preallocated workqueue and we already 
> > lost the information who submitted it.
> > 
> > You should associate a bio with a task when you create the bio and "md" 
> > and "dm" midlayers should just forward this association to lower layer 
> > bios.
> 
> bio_associate_current() return -EBUSY if bio has already been associated
> with an io context.
> 
> So in a stack if every driver calls bio_associate_current(), upon bio
> submission, it will automatically amke sure bio gets associated with
> submitter task in top level device and calls by lower level device
> will be ignored.

The stacking drivers do not pass the same bio to each other.

The stacking driver receives a bio, allocates zero or more new bios and 
sends these new bios to the lower layer. So you need to propagate 
ownership from the received bio to newly allocated bios, you don't want to 
associate newly allocated bio with "current" process.

> Lower level devices I think just need to make sure this context
> info is propogated to cloned bios.
> 
> 
> [..]
> > +#ifdef CONFIG_BLK_CGROUP
> > +	tio->clone.bi_ioc = ci->bio->bi_ioc;
> > +	tio->clone.bi_css = ci->bio->bi_css;
> 
> You also need to take references to ioc and css objects. I guess a helper
> function will be better. May be something like.

The lifetime of the "tio" structure is shorter than the lifetime of 
"ci->bio". So we don't need to increment reference.

We only need to increment reference if we copy ownership to a new bio that 
has could have longer lifetime than the original bio. But this situation 
is very rare - in most stacking drivers the newly allocated bio has 
shorter lifetime than the original one.

> bio_associate_bio_context(bio1, bio2)
> 
> And this initialize bio2's context with bio1's context.

Yes, that would be ok.

> Thanks
> Vivek

Mikulas

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: dm-crypt parallelization patches
  2013-04-09 21:18                                 ` Mikulas Patocka
@ 2013-04-10 19:24                                   ` Vivek Goyal
  0 siblings, 0 replies; 22+ messages in thread
From: Vivek Goyal @ 2013-04-10 19:24 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Tejun Heo, Jens Axboe, Mike Snitzer, Milan Broz, dm-devel,
	Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig,
	Christian Schmidt

On Tue, Apr 09, 2013 at 05:18:25PM -0400, Mikulas Patocka wrote:

[..]
> > bio_associate_current() return -EBUSY if bio has already been associated
> > with an io context.
> > 
> > So in a stack if every driver calls bio_associate_current(), upon bio
> > submission, it will automatically amke sure bio gets associated with
> > submitter task in top level device and calls by lower level device
> > will be ignored.
> 
> The stacking drivers do not pass the same bio to each other.
> 
> The stacking driver receives a bio, allocates zero or more new bios and 
> sends these new bios to the lower layer. So you need to propagate 
> ownership from the received bio to newly allocated bios, you don't want to 
> associate newly allocated bio with "current" process.

Actually I was asking to call bio_associate_current() for the incoming
bio in the driver and not for the newly created bios by the driver.

For any newly created bios on behalf of this incoming bio, we need to
copy the context so that this context info can be propogated down the
stack.


> 
> > Lower level devices I think just need to make sure this context
> > info is propogated to cloned bios.
> > 
> > 
> > [..]
> > > +#ifdef CONFIG_BLK_CGROUP
> > > +	tio->clone.bi_ioc = ci->bio->bi_ioc;
> > > +	tio->clone.bi_css = ci->bio->bi_css;
> > 
> > You also need to take references to ioc and css objects. I guess a helper
> > function will be better. May be something like.
> 
> The lifetime of the "tio" structure is shorter than the lifetime of 
> "ci->bio". So we don't need to increment reference.
> 
> We only need to increment reference if we copy ownership to a new bio that 
> has could have longer lifetime than the original bio. But this situation 
> is very rare - in most stacking drivers the newly allocated bio has 
> shorter lifetime than the original one.

I think it is not a good idea to rely on the fact that cloned bios or
newly created bios will have shorter lifetime than the original bio. In fact
when the bio completes and you free it up bio_disassociate_task() will try
to put io context and blkcg reference. So it is important to take
reference if you are copying context info in any newly created bio.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: dm-crypt parallelization patches
  2013-04-09 17:51                 ` dm-crypt parallelization patches Mikulas Patocka
  2013-04-09 17:57                   ` Tejun Heo
@ 2013-04-09 18:36                   ` Vivek Goyal
  1 sibling, 0 replies; 22+ messages in thread
From: Vivek Goyal @ 2013-04-09 18:36 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Jens Axboe, Tejun Heo, Mike Snitzer, Milan Broz, dm-devel,
	Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig,
	Christian Schmidt

On Tue, Apr 09, 2013 at 01:51:43PM -0400, Mikulas Patocka wrote:
> Hi
> 
> I placed the dm-crypt parallization patches at: 
> http://people.redhat.com/~mpatocka/patches/kernel/dm-crypt-paralelizace/current/
> 
> The patches paralellize dm-crypt and make it possible to use all processor 
> cores.
> 
> 
> The patch dm-crypt-remove-percpu.patch removes some percpu variables and 
> replaces them with per-request variables.
> 
> The patch dm-crypt-unbound-workqueue.patch sets WQ_UNBOUND on the 
> encryption workqueue, allowing the encryption to be distributed to all 
> CPUs in the system.
> 
> The patch dm-crypt-offload-writes-to-thread.patch moves submission of all 
> write requests to a single thread.
> 
> The patch dm-crypt-sort-requests.patch sorts write requests submitted by a 
> single thread. The requests are sorted according to the sector number, 
> rb-tree is used for efficient sorting.
> 
> Some usage notes:
> 
> * turn off automatic cpu frequency scaling (or set it to "performance"
>   governor) - cpufreq doesn't recognize encryption workload correctly, 
>   sometimes it underclocks all the CPU cores when there is some encryption 
>   work to do, resulting in bad performance
> 
> * when using filesystem on encrypted dm-crypt device, reduce maximum 
>   request size with "/sys/block/dm-2/queue/max_sectors_kb" (substitute 
>   "dm-2" with the real name of your dm-crypt device). Note that having too 
>   big requests means that there is a small number of requests and they 
>   cannot be distributed to all available processors in parallel - it 
>   results in worse performance. Having too small requests results in high 
>   request overhead and also reduced performance. So you must find the 
>   optimal request size for your system and workload. For me, when testing 
>   this on ramdisk, the optimal is 8KiB. 
> 
> ---
> 
> Now, the problem with I/O scheduler: when doing performance testing, it 
> turns out that the parallel version is sometimes worse than the previous 
> implementation.
> 
> When I create a 4.3GiB dm-crypt device on the top of dm-loop on the top of 
> ext2 filesystem on 15k SCSI disk and run this command
> 
> time fio --rw=randrw --size=64M --bs=256k --filename=/dev/mapper/crypt 
> --direct=1 --name=job1 --name=job2 --name=job3 --name=job4 --name=job5 
> --name=job6 --name=job7 --name=job8 --name=job9 --name=job10 --name=job11 
> --name=job12
> 
> the results are this:
> CFQ scheduler:
> --------------
> no patches:
> 21.9s
> patch 1:
> 21.7s
> patches 1,2:
> 2:33s
> patches 1,2 (+ nr_requests = 1280000)
> 2:18s
> patches 1,2,3:
> 20.7s
> patches 1,2,3,4:
> 20.7s
> 
> deadline scheduler:
> -------------------
> no patches:
> 27.4s
> patch 1:
> 27.4s
> patches 1,2:
> 27.8s
> patches 1,2,3:
> 29.6s
> patches 1,2,3,4:
> 29.6s
> 
> 
> We can see that CFQ performs badly with the patch 2, but improves with the 
> patch 3. All that patch 3 does is that it moves write requests from 
> encryption threads to a separate thread.
> 
> So it seems that CFQ has some deficiency that it cannot merge adjacent 
> requests done by different processes.
> 

CFQ does not merge requests across different cfq queues (cfqq). Each
queue is associated with one iocontext. So in this case each worker
thread is submitting its own bio and each 4K bio must be going in
separate cfqq hence no merging is taking place.

The moment you applied patch 3, where a single thread submitted bios,
each bio went into single queue and possibly got merged.

So either use single thread to submit bio or better use
bio_associate_current() (as tejun suggested) on original 256K bio.
(Hopefully bio iocontext association information is retained when you
 split the bios into smaller pieces).

> The problem is this:
> - we have 256k write direct-i/o request
> - it is broken to 4k bios (because we run on dm-loop on a filesystem with 
>   4k block size)
> - encryption of these 4k bios is distributed to 12 processes on a 12-core 
>   machine
> - encryption finishes out of order and in different processes, 4k bios 
>   with encrypted data are submitted to CFQ
> - CFQ doesn't merge them
> - the disk is flooded with random 4k write requests, and performs much 
>   worse than with 256k requests
> 

Thanks
Vivek

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2013-09-12  3:51 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-09-03 19:06 dm-crypt parallelization patches Mikulas Patocka
2013-09-03 20:07 ` Andi Kleen
2013-09-11 23:03   ` Mikulas Patocka
2013-09-11 23:33     ` Andi Kleen
2013-09-12  3:51     ` Milan Broz
2013-09-03 20:49 ` Milan Broz
  -- strict thread matches above, loose matches on Subject: below --
2013-03-26  3:47 dm-crypt performance Mikulas Patocka
2013-03-26 12:27 ` [dm-devel] " Alasdair G Kergon
2013-03-26 20:05   ` Milan Broz
2013-03-26 20:28     ` Mike Snitzer
2013-03-28 18:53       ` Tejun Heo
2013-03-28 19:33         ` Vivek Goyal
2013-03-28 19:44           ` Tejun Heo
2013-03-28 20:38             ` Vivek Goyal
2013-03-28 20:45               ` Tejun Heo
2013-04-09 17:51                 ` dm-crypt parallelization patches Mikulas Patocka
2013-04-09 17:57                   ` Tejun Heo
2013-04-09 18:08                     ` Mikulas Patocka
2013-04-09 18:10                       ` Tejun Heo
2013-04-09 18:42                         ` Vivek Goyal
2013-04-09 18:57                           ` Tejun Heo
2013-04-09 19:13                             ` Vivek Goyal
2013-04-09 19:42                         ` Mikulas Patocka
2013-04-09 19:52                           ` Tejun Heo
2013-04-09 20:32                             ` Mikulas Patocka
2013-04-09 21:02                               ` Tejun Heo
2013-04-09 21:03                                 ` Tejun Heo
2013-04-09 21:07                               ` Vivek Goyal
2013-04-09 21:18                                 ` Mikulas Patocka
2013-04-10 19:24                                   ` Vivek Goyal
2013-04-09 18:36                   ` Vivek Goyal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).