* dm-crypt parallelization patches @ 2013-09-03 19:06 Mikulas Patocka 2013-09-03 20:07 ` Andi Kleen 2013-09-03 20:49 ` Milan Broz 0 siblings, 2 replies; 22+ messages in thread From: Mikulas Patocka @ 2013-09-03 19:06 UTC (permalink / raw) To: Andi Kleen; +Cc: dm-devel, Milan Broz, Alasdair G. Kergon Hi Andi You wrote the original dm-crypt paralllelization patch (commit c029772125594e31eb1a5ad9e0913724ed9891f2). This patch improves parallelization, it makes data being encrypted on the same CPU that submitted the request. I improved that so that encryption is performed by all processors, it uses unbound workqueue, so that any processor can do encryption. My patches improves performance mainly when doing sequential read - your original patch paralellized sequential read badly because most requests were submitted by the same cpu. I put my patch series at http://people.redhat.com/mpatocka/patches/kernel/dm-crypt-paralelizace/current/series.html I'd like if you could look at it and ack it for inclusion in the upstream kernel. Mikulas ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: dm-crypt parallelization patches 2013-09-03 19:06 dm-crypt parallelization patches Mikulas Patocka @ 2013-09-03 20:07 ` Andi Kleen 2013-09-11 23:03 ` Mikulas Patocka 2013-09-03 20:49 ` Milan Broz 1 sibling, 1 reply; 22+ messages in thread From: Andi Kleen @ 2013-09-03 20:07 UTC (permalink / raw) To: Mikulas Patocka; +Cc: dm-devel, Milan Broz, Alasdair G. Kergon On Tue, Sep 03, 2013 at 03:06:03PM -0400, Mikulas Patocka wrote: > Hi Andi > > You wrote the original dm-crypt paralllelization patch (commit > c029772125594e31eb1a5ad9e0913724ed9891f2). This patch improves > parallelization, it makes data being encrypted on the same CPU that > submitted the request. The motivation for my patches was to make large systems with many cores scale. I think the motivation for yours is to make small systems without hardware crypto offload slightly faster. That's fine (although I suspect it's more and more obscure), as long as you don't impact the large systems scalability. Can you do some tests with a larger system with very parallel IO (many CPUs submitting) and see if the IO rate is equivalent? If that's given the patches are ok for me. -Andi ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: dm-crypt parallelization patches 2013-09-03 20:07 ` Andi Kleen @ 2013-09-11 23:03 ` Mikulas Patocka 2013-09-11 23:33 ` Andi Kleen 2013-09-12 3:51 ` Milan Broz 0 siblings, 2 replies; 22+ messages in thread From: Mikulas Patocka @ 2013-09-11 23:03 UTC (permalink / raw) To: Andi Kleen; +Cc: dm-devel, Milan Broz, Alasdair G. Kergon On Tue, 3 Sep 2013, Andi Kleen wrote: > On Tue, Sep 03, 2013 at 03:06:03PM -0400, Mikulas Patocka wrote: > > Hi Andi > > > > You wrote the original dm-crypt paralllelization patch (commit > > c029772125594e31eb1a5ad9e0913724ed9891f2). This patch improves > > parallelization, it makes data being encrypted on the same CPU that > > submitted the request. > > The motivation for my patches was to make large systems with many > cores scale. > > I think the motivation for yours is to make small systems without > hardware crypto offload slightly faster. > > That's fine (although I suspect it's more and more obscure), as long as > you don't impact the large systems scalability. > > Can you do some tests with a larger system with very parallel IO > (many CPUs submitting) and see if the IO rate is equivalent? > > If that's given the patches are ok for me. > > -Andi The patches mainly help with sequential I/O (sequential I/O is normally submitted from a few CPUs, so existing dm-crypt doesn't parallelize right). There is a decrease of performance when 12 threads are doing writes on ramdisk: Sequential read of two SCSI disks in RAID-0 (throughput): raw read rate: 262MB/s stdev 2.9 existing dm-crypt: 111MB/s stdev 19.4 (it varies greatly) new dm-crypt: 254MB/s stdev 1.5 Sequential read of ramdisk (throughput): raw read rate: 908MB/s stdev 7.7 existing dm-crypt: 133MB/s stdev 0.8 my new dm-crypt: 475MB/s stdev 6 fio --rw=randread --bs=4k --direct=1 Random read (4k block) on ramdisk from 12 threads (time in seconds): raw device: 2.32s stdev 0.02 existing dm-crypt: 17.95s stdev 3.40 new dm-crypt: 15.72s stdev 1.86 fio --rw=randwrite --bs=4k --direct=1 Random write (4k block) on ramdisk from 12 threads (time in seconds): raw device 3.91s stdev 0.01 existing dm-crypt: 21.16s stdev 3.67 new dm-crypt: 27.42s stdev 0.27 fio --rw=randrw --bs=4k --direct=1 Random read+write (4k block) on ramdisk from 12 threads (time in seconds): raw device: 3.09s stdev 0.01 existing dm-crypt: 20.08s stdev 3.48 new dm-crypt: 24.78s stdev 0.13 Tests were done on two six-core Istanbul-class Opterons. I don't have a computer with AES-NI, so I can't test that. Mikulas ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: dm-crypt parallelization patches 2013-09-11 23:03 ` Mikulas Patocka @ 2013-09-11 23:33 ` Andi Kleen 2013-09-12 3:51 ` Milan Broz 1 sibling, 0 replies; 22+ messages in thread From: Andi Kleen @ 2013-09-11 23:33 UTC (permalink / raw) To: Mikulas Patocka; +Cc: dm-devel, Milan Broz, Alasdair G. Kergon > The patches mainly help with sequential I/O (sequential I/O is normally > submitted from a few CPUs, so existing dm-crypt doesn't parallelize > right). There is a decrease of performance when 12 threads are doing > writes on ramdisk: That's a bit worrying. Could you please find a larger system (4S) in RH and try with that? I suspect you may need to limit the load balancing to smaller groups of CPUs. Querying all other CPUs all the time is unlikely to scale. -Andi ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: dm-crypt parallelization patches 2013-09-11 23:03 ` Mikulas Patocka 2013-09-11 23:33 ` Andi Kleen @ 2013-09-12 3:51 ` Milan Broz 1 sibling, 0 replies; 22+ messages in thread From: Milan Broz @ 2013-09-12 3:51 UTC (permalink / raw) To: Mikulas Patocka; +Cc: Ondrej Kozina, dm-devel, Andi Kleen, Alasdair G. Kergon On 09/12/2013 01:03 AM, Mikulas Patocka wrote: > > Tests were done on two six-core Istanbul-class Opterons. > > I don't have a computer with AES-NI, so I can't test that. AES-NI is used almost in all systems where is dmcrypt currently deployed (notebooks typically)... so I think it must be tested there. Also ramdisk tests are good baseline but this is not real world scenario. I would really like to see the same test with RAID5 (over fast rotational disks) with various stripe size and for SSDs (and RAID0 with SSDs). (These were reported problematic scenarios in recent years I know about.) Can you also show how the sorting patch influences performance? (IOW how it performs without it.) Another problematic part was latency (e.g. sql database over dmcrypt doesn't perform well). Isn't it even worse with these patches? (I am not sure if anyone have fio script to simulate such workload.) I would like to test it myself but I have no usable fast-enough test systems at all currently. But there are surely some in RH. (Added cc to Ondra, he has some test scripts I used.) I also remember I tried your patches (year ago) with limited CPU workqueues (I mean using e.g. only 3 of 6 cores) and it performed better in some tests. This support Andi's suggestion to limit it to smaller groups of cpus. Also I think we discussed this should be configurable through optional dmcrypt per-device parameters (e.g. I would like to set up some dmcrypt device to use only one CPU). Think about backend for virtual machines etc, it should not eat all available CPUs power on host... Do you plan to add such patch as well? Milan ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: dm-crypt parallelization patches 2013-09-03 19:06 dm-crypt parallelization patches Mikulas Patocka 2013-09-03 20:07 ` Andi Kleen @ 2013-09-03 20:49 ` Milan Broz 1 sibling, 0 replies; 22+ messages in thread From: Milan Broz @ 2013-09-03 20:49 UTC (permalink / raw) To: Mikulas Patocka; +Cc: dm-devel, Andi Kleen, Milan Broz, Alasdair G. Kergon On 09/03/2013 09:06 PM, Mikulas Patocka wrote: > Hi Andi > > You wrote the original dm-crypt paralllelization patch (commit > c029772125594e31eb1a5ad9e0913724ed9891f2). This patch improves > parallelization, it makes data being encrypted on the same CPU that > submitted the request. Hi Mikulas, Do you have some recent reported performance problems with the current dmcrypt code? Please post numbers from tests. Also include systems using AES-NI. You know I tested your patchset cca year ago and result was not clear improvement... So if this changed, it is good news. Milan ^ permalink raw reply [flat|nested] 22+ messages in thread
* dm-crypt performance @ 2013-03-26 3:47 Mikulas Patocka 2013-03-26 12:27 ` [dm-devel] " Alasdair G Kergon 0 siblings, 1 reply; 22+ messages in thread From: Mikulas Patocka @ 2013-03-26 3:47 UTC (permalink / raw) To: Mike Snitzer, Alasdair G. Kergon, dm-devel Hi I performed some dm-crypt performance tests as Mike suggested. It turns out that unbound workqueue performance has improved somewhere between kernel 3.2 (when I made the dm-crypt patches) and 3.8, so the patches for hand-built dispatch are no longer needed. For RAID-0 composed of two disks with total throughput 260MB/s, the unbound workqueue performs as well as the hand-built dispatch (both sustain the 260MB/s transfer rate). For ramdisk, unbound workqueue performs better than hand-built dispatch (620MB/s vs 400MB/s). Unbound workqueue with the patch that Mike suggested (git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git) improves performance slighlty on ramdisk compared to 3.8 (700MB/s vs. 620MB/s). However, there is still the problem with request ordering. Milan found out that under some circumstances parallel dm-crypt has worse performance than the previous dm-crypt code. I found out that this is not caused by deficiencies in the code that distributes work to individual processors. Performance drop is caused by the fact that distributing write bios to multiple processors causes the encryption to finish out of order and the I/O scheduler is unable to merge these out-of-order bios. The deadline and noop schedulers perform better (only 50% slowdown compared to old dm-crypt), CFQ performs very badly (8 times slowdown). If I sort the requests in dm-crypt to come out in the same order as they were received, there is no longer any slowdown, the new crypt performs as well as the old crypt, but the last time I submitted the patches, people objected to sorting requests in dm-crypt, saying that the I/O scheduler should sort them. But it doesn't. This problem still persists in the current kernels. For best performance we could use the unbound workqueue implementation with request sorting, if people don't object to the request sorting being done in dm-crypt. Mikulas ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [dm-devel] dm-crypt performance 2013-03-26 3:47 dm-crypt performance Mikulas Patocka @ 2013-03-26 12:27 ` Alasdair G Kergon 2013-03-26 20:05 ` Milan Broz 0 siblings, 1 reply; 22+ messages in thread From: Alasdair G Kergon @ 2013-03-26 12:27 UTC (permalink / raw) To: Mikulas Patocka Cc: Mike Snitzer, dm-devel, Andi Kleen, dm-crypt, Milan Broz, linux-kernel, Christoph Hellwig, Christian Schmidt [Adding dm-crypt + linux-kernel] On Mon, Mar 25, 2013 at 11:47:22PM -0400, Mikulas Patocka wrote: > I performed some dm-crypt performance tests as Mike suggested. > > It turns out that unbound workqueue performance has improved somewhere > between kernel 3.2 (when I made the dm-crypt patches) and 3.8, so the > patches for hand-built dispatch are no longer needed. > > For RAID-0 composed of two disks with total throughput 260MB/s, the > unbound workqueue performs as well as the hand-built dispatch (both > sustain the 260MB/s transfer rate). > > For ramdisk, unbound workqueue performs better than hand-built dispatch > (620MB/s vs 400MB/s). Unbound workqueue with the patch that Mike suggested > (git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git) improves > performance slighlty on ramdisk compared to 3.8 (700MB/s vs. 620MB/s). > > > > However, there is still the problem with request ordering. Milan found out > that under some circumstances parallel dm-crypt has worse performance than > the previous dm-crypt code. I found out that this is not caused by > deficiencies in the code that distributes work to individual processors. > Performance drop is caused by the fact that distributing write bios to > multiple processors causes the encryption to finish out of order and the > I/O scheduler is unable to merge these out-of-order bios. > > The deadline and noop schedulers perform better (only 50% slowdown > compared to old dm-crypt), CFQ performs very badly (8 times slowdown). > > > If I sort the requests in dm-crypt to come out in the same order as they > were received, there is no longer any slowdown, the new crypt performs as > well as the old crypt, but the last time I submitted the patches, people > objected to sorting requests in dm-crypt, saying that the I/O scheduler > should sort them. But it doesn't. This problem still persists in the > current kernels. > > > For best performance we could use the unbound workqueue implementation > with request sorting, if people don't object to the request sorting being > done in dm-crypt. On Tue, Mar 26, 2013 at 02:52:29AM -0400, Christoph Hellwig wrote: > FYI, XFS also does it's own request ordering for the metadata buffers, > because it knows the needed ordering and has a bigger view than than > than especially CFQ. You at least have precedence in a widely used > subsystem for this code. So please post this updated version of the patches for a wider group of people to try out. Alasdair ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [dm-devel] dm-crypt performance 2013-03-26 12:27 ` [dm-devel] " Alasdair G Kergon @ 2013-03-26 20:05 ` Milan Broz 2013-03-26 20:28 ` Mike Snitzer 0 siblings, 1 reply; 22+ messages in thread From: Milan Broz @ 2013-03-26 20:05 UTC (permalink / raw) To: Mikulas Patocka Cc: Mike Snitzer, dm-devel, Andi Kleen, dm-crypt, Milan Broz, linux-kernel, Christoph Hellwig, Christian Schmidt On 26.3.2013 13:27, Alasdair G Kergon wrote: > [Adding dm-crypt + linux-kernel] Thanks. > > On Mon, Mar 25, 2013 at 11:47:22PM -0400, Mikulas Patocka wrote: >> I performed some dm-crypt performance tests as Mike suggested. >> >> It turns out that unbound workqueue performance has improved somewhere >> between kernel 3.2 (when I made the dm-crypt patches) and 3.8, so the >> patches for hand-built dispatch are no longer needed. >> >> For RAID-0 composed of two disks with total throughput 260MB/s, the >> unbound workqueue performs as well as the hand-built dispatch (both >> sustain the 260MB/s transfer rate). >> >> For ramdisk, unbound workqueue performs better than hand-built dispatch >> (620MB/s vs 400MB/s). Unbound workqueue with the patch that Mike suggested >> (git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git) improves >> performance slighlty on ramdisk compared to 3.8 (700MB/s vs. 620MB/s). I found that ramdisk tests are usualy quite misleading for dmcrypt. Better use some fast SSD, ideally in RAID0 (so you get >500MB or so). Also be sure you compare recent machines which uses AES-NI. For reference, null cipher (no crypt, data copy only) works as well, but this is not real-world scenario. After introducing Andi's patches, we created performance regression for people which created "RAID over several dmcrypt devices". (All IOs were processed by one core.) Rerely use case but several people complained. But most of people reported that current approach works much better (even with stupid dd test - I think it is because page cache sumbits requests from different CPUs so it in fact run in parallel). But using dd with direct-io is trivial way how to simulate the "problem". (I guess we all like using dd for performance testing... :-]) >> However, there is still the problem with request ordering. Milan found out >> that under some circumstances parallel dm-crypt has worse performance than >> the previous dm-crypt code. I found out that this is not caused by >> deficiencies in the code that distributes work to individual processors. >> Performance drop is caused by the fact that distributing write bios to >> multiple processors causes the encryption to finish out of order and the >> I/O scheduler is unable to merge these out-of-order bios. If the IO scheduler is unable to merge these request because of out of order bios, please try to FIX IO scheduler and do not invent workarounds in dmcrypt. (With recent accelerated crypto this should not happen so often btw.) I know it is not easy but I really do not like that in "little-walled device-mapper garden" is something what should be done on different layer (again). >> The deadline and noop schedulers perform better (only 50% slowdown >> compared to old dm-crypt), CFQ performs very badly (8 times slowdown). >> >> >> If I sort the requests in dm-crypt to come out in the same order as they >> were received, there is no longer any slowdown, the new crypt performs as >> well as the old crypt, but the last time I submitted the patches, people >> objected to sorting requests in dm-crypt, saying that the I/O scheduler >> should sort them. But it doesn't. This problem still persists in the >> current kernels. I have probable no vote here anymore but for the record: I am strictly against any sorting of requests in dmcrypt. My reasons are: - dmcrypt should be simple transparent layer (doing one thing - encryption), sorting of requests was always primarily IO scheduler domain (which has well-known knobs to control it already) - Are we sure we are not inroducing some another side channel in disc encryption? (Unprivileged user can measure timing here). (Perhaps stupid reason but please do not prefer performance to security in encryption. Enough we have timing attacks for AES implementations...) - In my testing (several months ago) output was very unstable - in some situations it helps, in some it was worse. I have no longer hard data but some test output was sent to Alasdair. >> For best performance we could use the unbound workqueue implementation >> with request sorting, if people don't object to the request sorting being >> done in dm-crypt. So again: - why IO scheduler is not working properly here? Do it need some extensions? If fixed, it can help even is some other non-dmcrypt IO patterns. (I mean dmcrypt can set some special parameter for underlying device queue automagically to fine-tune sorting parameters.) - can we have some cpu-bound workqueue which automatically switch to unbound (relocates work to another cpu) if it detects some saturation watermark etc? (Again, this can be used in other code. http://www.redhat.com/archives/dm-devel/2012-August/msg00288.html (Yes, I see skepticism there :-) > On Tue, Mar 26, 2013 at 02:52:29AM -0400, Christoph Hellwig wrote: >> FYI, XFS also does it's own request ordering for the metadata buffers, >> because it knows the needed ordering and has a bigger view than than >> than especially CFQ. You at least have precedence in a widely used >> subsystem for this code. Nice. But XFS is much more complex system. Isn't it enough that multipath uses own IO queue (so we have one IO scheduler on top of another, and now we have metadata io sorting in XFS on top of it and planning one more in dmcrypt? Is it really good approach?) Milan ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: dm-crypt performance 2013-03-26 20:05 ` Milan Broz @ 2013-03-26 20:28 ` Mike Snitzer 2013-03-28 18:53 ` Tejun Heo 0 siblings, 1 reply; 22+ messages in thread From: Mike Snitzer @ 2013-03-26 20:28 UTC (permalink / raw) To: Milan Broz Cc: Mikulas Patocka, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt, tj On Tue, Mar 26 2013 at 4:05pm -0400, Milan Broz <gmazyland@gmail.com> wrote: > >On Mon, Mar 25, 2013 at 11:47:22PM -0400, Mikulas Patocka wrote: > > > >>For best performance we could use the unbound workqueue implementation > >>with request sorting, if people don't object to the request sorting being > >>done in dm-crypt. > > So again: > > - why IO scheduler is not working properly here? Do it need some extensions? > If fixed, it can help even is some other non-dmcrypt IO patterns. > (I mean dmcrypt can set some special parameter for underlying device queue > automagically to fine-tune sorting parameters.) Not sure, but IO scheduler changes are fairly slow to materialize given the potential for adverse side-effects. Are you so surprised that a shotgun blast of IOs might make the IO schduler less optimal than if some basic sorting were done at the layer above? > - can we have some cpu-bound workqueue which automatically switch to unbound > (relocates work to another cpu) if it detects some saturation watermark etc? > (Again, this can be used in other code. > http://www.redhat.com/archives/dm-devel/2012-August/msg00288.html > (Yes, I see skepticism there :-) Question for Tejun? (now cc'd). > >On Tue, Mar 26, 2013 at 02:52:29AM -0400, Christoph Hellwig wrote: > >>FYI, XFS also does it's own request ordering for the metadata buffers, > >>because it knows the needed ordering and has a bigger view than than > >>than especially CFQ. You at least have precedence in a widely used > >>subsystem for this code. > > Nice. But XFS is much more complex system. > Isn't it enough that multipath uses own IO queue (so we have one IO scheduler > on top of another, and now we have metadata io sorting in XFS on top of it > and planning one more in dmcrypt? Is it really good approach?) Multipath's request_queue is the only one with an active IO scheduler; the requests are dispatched directly to the underlying devices' queues without any IO scheduling. As for dm-crypt; as you know it is bio-based so it is already dealing with out of order IOs (no benefit of upper level IO scheduler). Seems relatively clear, from Mikulas' results, that maybe you're hoping for a bit too much magic from the IO scheduler gnomes that lurk on LKML. BTW, pretty sure btrfs takes care to maintain some IO dispatch ordering too. Mike ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: dm-crypt performance 2013-03-26 20:28 ` Mike Snitzer @ 2013-03-28 18:53 ` Tejun Heo 2013-03-28 19:33 ` Vivek Goyal 0 siblings, 1 reply; 22+ messages in thread From: Tejun Heo @ 2013-03-28 18:53 UTC (permalink / raw) To: Mike Snitzer Cc: Milan Broz, Mikulas Patocka, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt, Vivek Goyal, Jens Axboe Hello, (cc'ing Vivek and Jens for the iosched related bits) On Tue, Mar 26, 2013 at 04:28:38PM -0400, Mike Snitzer wrote: > On Tue, Mar 26 2013 at 4:05pm -0400, > Milan Broz <gmazyland@gmail.com> wrote: > > > >On Mon, Mar 25, 2013 at 11:47:22PM -0400, Mikulas Patocka wrote: > > > > > >>For best performance we could use the unbound workqueue implementation > > >>with request sorting, if people don't object to the request sorting being > > >>done in dm-crypt. > > > > So again: > > > > - why IO scheduler is not working properly here? Do it need some extensions? > > If fixed, it can help even is some other non-dmcrypt IO patterns. > > (I mean dmcrypt can set some special parameter for underlying device queue > > automagically to fine-tune sorting parameters.) > > Not sure, but IO scheduler changes are fairly slow to materialize given > the potential for adverse side-effects. Are you so surprised that a > shotgun blast of IOs might make the IO schduler less optimal than if > some basic sorting were done at the layer above? My memory is already pretty hazy but Vivek should be able to correct me if I say something nonsense. The thing is, the order and timings of IOs coming down from upper layers has certain meanings to ioscheds and they exploit those patterns to do better scheduling. Reordering IOs randomly actually makes certain information about the IO stream lost and makes ioscheds mis-classify the IO stream - e.g. what could have been classfied as "mostly consecutive streaming IO" could after such reordering fail to be detected as such. Sure, ioscheds can probably be improved to compensate for such temporary localized reorderings but nothing is free and given that most of the upper stacks already do pretty good job of issuing IOs orderly when possible, it would be a bit silly to do more than usually necessary in ioscheds. So, no, I don't think maintaining IO order in stacking drivers is a bad idea. I actually think all stacking drivers should do that; otherwise, they really are destroying actual useful side-band information. > > - can we have some cpu-bound workqueue which automatically switch to unbound > > (relocates work to another cpu) if it detects some saturation watermark etc? > > (Again, this can be used in other code. > > http://www.redhat.com/archives/dm-devel/2012-August/msg00288.html > > (Yes, I see skepticism there :-) > > Question for Tejun? (now cc'd). Unbound workqueues went through quite a bit of improvements lately and are currently growing NUMA affinity support. Once merged, all unbound work items issued on a NUMA node will be processed in the same NUMA node, which should mitigate some, unfortunately not all, of the disadvantages compared to per-cpu ones. Mikulas, can you share more about your test setup? Was it a NUMA machine? Which wq branch did you use? The NUMA affinity support would have less severe but similar issue as per-cpu. If all IOs are being issued from one node while other nodes are idle, that specific node can get saturated. NUMA affinity support is adjusted both from inside kernel and userland via sysfs, so there are control knobs for corner cases. As for maintaining CPU or NUMA affinity until the CPU / node is saturated and spilling to other CPUs/nodes beyond that, yeah, an interesting idea. It's non-trivial and would have to incorporate a lot of notions on "load" similar to the scheduler. It really becomes a generic load balancing problem as it'd be pointless and actually harmful to, say, spill work items to each other between two saturated NUMA nodes. So, if the brunt of scattering workload across random CPUs can be avoided by NUMA affinity, that could be a reasonable tradeoff, I think. Thanks. -- tejun ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: dm-crypt performance 2013-03-28 18:53 ` Tejun Heo @ 2013-03-28 19:33 ` Vivek Goyal 2013-03-28 19:44 ` Tejun Heo 0 siblings, 1 reply; 22+ messages in thread From: Vivek Goyal @ 2013-03-28 19:33 UTC (permalink / raw) To: Tejun Heo Cc: Mike Snitzer, Milan Broz, Mikulas Patocka, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt, Jens Axboe On Thu, Mar 28, 2013 at 11:53:27AM -0700, Tejun Heo wrote: > Hello, > > (cc'ing Vivek and Jens for the iosched related bits) > > On Tue, Mar 26, 2013 at 04:28:38PM -0400, Mike Snitzer wrote: > > On Tue, Mar 26 2013 at 4:05pm -0400, > > Milan Broz <gmazyland@gmail.com> wrote: > > > > > >On Mon, Mar 25, 2013 at 11:47:22PM -0400, Mikulas Patocka wrote: > > > > > > > >>For best performance we could use the unbound workqueue implementation > > > >>with request sorting, if people don't object to the request sorting being > > > >>done in dm-crypt. > > > > > > So again: > > > > > > - why IO scheduler is not working properly here? Do it need some extensions? > > > If fixed, it can help even is some other non-dmcrypt IO patterns. > > > (I mean dmcrypt can set some special parameter for underlying device queue > > > automagically to fine-tune sorting parameters.) > > > > Not sure, but IO scheduler changes are fairly slow to materialize given > > the potential for adverse side-effects. Are you so surprised that a > > shotgun blast of IOs might make the IO schduler less optimal than if > > some basic sorting were done at the layer above? > > My memory is already pretty hazy but Vivek should be able to correct > me if I say something nonsense. The thing is, the order and timings > of IOs coming down from upper layers has certain meanings to ioscheds > and they exploit those patterns to do better scheduling. > > Reordering IOs randomly actually makes certain information about the > IO stream lost and makes ioscheds mis-classify the IO stream - > e.g. what could have been classfied as "mostly consecutive streaming > IO" could after such reordering fail to be detected as such. Sure, > ioscheds can probably be improved to compensate for such temporary > localized reorderings but nothing is free and given that most of the > upper stacks already do pretty good job of issuing IOs orderly when > possible, it would be a bit silly to do more than usually necessary in > ioscheds. > > So, no, I don't think maintaining IO order in stacking drivers is a > bad idea. I actually think all stacking drivers should do that; > otherwise, they really are destroying actual useful side-band > information. I am curious why out of order bio is a problem. Doesn't the elevator already merge bio's with existing requests and if merging does not happen then requests are sorted in order. So why ordering is not happening properly with dm-crypt. What additional info dm-crypt has that it can do better ordering than IO scheduler. CFQ might seeing more performance hit because we maintain per process queues and kernel threads might not be sharing the IO context (i am not sure). So if all the crypto threads can share the IO context, atleast it will make sure all IO from them goes into a single queue. So it would help if somebody can explain that why existing merging and sorting logic is not working well with dm-crypt and what additional info dm-crypt has which can help it do better job. Thanks Vivek ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: dm-crypt performance 2013-03-28 19:33 ` Vivek Goyal @ 2013-03-28 19:44 ` Tejun Heo 2013-03-28 20:38 ` Vivek Goyal 0 siblings, 1 reply; 22+ messages in thread From: Tejun Heo @ 2013-03-28 19:44 UTC (permalink / raw) To: Vivek Goyal Cc: Mike Snitzer, Milan Broz, Mikulas Patocka, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt, Jens Axboe Hello, On Thu, Mar 28, 2013 at 03:33:43PM -0400, Vivek Goyal wrote: > I am curious why out of order bio is a problem. Doesn't the elevator > already merge bio's with existing requests and if merging does not > happen then requests are sorted in order. So why ordering is not > happening properly with dm-crypt. What additional info dm-crypt has > that it can do better ordering than IO scheduler. Hmmm... well, for one, it doesn't only change ordering. It also changes the timings. Before iosched would get contiguous stream of IOs when the queue gets unplugged (BTW, how does dm crypt handling plugging? If not handled properly, it could definitely affect a lot of things.) With multiple threads doing encryption in the middle, the iosched could get scattered IOs which could easily span multiple millisecs. Even if context tagging was done properly, it could easily lead to much less efficient IO patterns to hardware. Keeping IO order combined with proper plug handling would not only keep the ordering constant but also the relative timing of events, which is an important factor when scheduling IOs. > CFQ might seeing more performance hit because we maintain per > process queues and kernel threads might not be sharing the IO context > (i am not sure). So if all the crypto threads can share the IO > context, atleast it will make sure all IO from them goes into a > single queue. Right, this is important too although I fail to see how workqueue vs. custom dispatch would make any difference here. dm-crypt should definitely be using bio_associate_current(). Thanks. -- tejun ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: dm-crypt performance 2013-03-28 19:44 ` Tejun Heo @ 2013-03-28 20:38 ` Vivek Goyal 2013-03-28 20:45 ` Tejun Heo 0 siblings, 1 reply; 22+ messages in thread From: Vivek Goyal @ 2013-03-28 20:38 UTC (permalink / raw) To: Tejun Heo Cc: Mike Snitzer, Milan Broz, Mikulas Patocka, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt, Jens Axboe On Thu, Mar 28, 2013 at 12:44:43PM -0700, Tejun Heo wrote: > Hello, > > On Thu, Mar 28, 2013 at 03:33:43PM -0400, Vivek Goyal wrote: > > I am curious why out of order bio is a problem. Doesn't the elevator > > already merge bio's with existing requests and if merging does not > > happen then requests are sorted in order. So why ordering is not > > happening properly with dm-crypt. What additional info dm-crypt has > > that it can do better ordering than IO scheduler. > > Hmmm... well, for one, it doesn't only change ordering. It also > changes the timings. Before iosched would get contiguous stream of > IOs when the queue gets unplugged (BTW, how does dm crypt handling > plugging? If not handled properly, it could definitely affect a lot > of things.) With multiple threads doing encryption in the middle, the > iosched could get scattered IOs which could easily span multiple > millisecs. Even if context tagging was done properly, it could easily > lead to much less efficient IO patterns to hardware. > > Keeping IO order combined with proper plug handling would not only > keep the ordering constant but also the relative timing of events, > which is an important factor when scheduling IOs. If timing of unordered IO is an issue, then dm-crypt can try to batch IO submission using blk_start_plug()/blk_finish_plug(). That way dm-crypt can batch bio and control submission and there should not be a need to put specific ordering logic in dm-crypt. So if there are multiple threads doing crypto, they end up submitting bio after crypto operation or queue it somewhere and ther is single submitter thread. If compltion of crypto is an issue, then I think it is very hard to determine whether extra waiting helps with throughput or hurts. If dm-crypt can decide that somehow, then I guess they can just try to do batch submission of IO from various crypto threads and see if it helps with performance. (At some point of time, submitter thread will become a bottleneck). > > > CFQ might seeing more performance hit because we maintain per > > process queues and kernel threads might not be sharing the IO context > > (i am not sure). So if all the crypto threads can share the IO > > context, atleast it will make sure all IO from them goes into a > > single queue. > > Right, this is important too although I fail to see how workqueue > vs. custom dispatch would make any difference here. dm-crypt should > definitely be using bio_associate_current(). Agreed. bio_associate_current() will atleast help keeping all bios of single context in single queue and promote more merging (if submission happens in right time frame). Thanks Vivek ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: dm-crypt performance 2013-03-28 20:38 ` Vivek Goyal @ 2013-03-28 20:45 ` Tejun Heo 2013-04-09 17:51 ` dm-crypt parallelization patches Mikulas Patocka 0 siblings, 1 reply; 22+ messages in thread From: Tejun Heo @ 2013-03-28 20:45 UTC (permalink / raw) To: Vivek Goyal Cc: Mike Snitzer, Milan Broz, Mikulas Patocka, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt, Jens Axboe Hello, On Thu, Mar 28, 2013 at 04:38:08PM -0400, Vivek Goyal wrote: > If timing of unordered IO is an issue, then dm-crypt can try > to batch IO submission using blk_start_plug()/blk_finish_plug(). That way > dm-crypt can batch bio and control submission and there should not > be a need to put specific ordering logic in dm-crypt. Yes, it has to preserve and propagate the plugging boundaries and if you think about the implementation, maintaining issue order don't really need to be "sorted" per-se. Just keep the list of bios received but still going through encryption in the received order with a counter of in-progress bios in the plugging boundary. Link the outputs to the source bios somehow and when the counter hits zero, issue them in the same order. While keeping the specific order itself might not be essential, it's not gonna add any significant complexity or runtime overhead and I think it generally is a good idea for stacking drivers to preserve as much information and context as possible in general. Thanks. -- tejun ^ permalink raw reply [flat|nested] 22+ messages in thread
* dm-crypt parallelization patches 2013-03-28 20:45 ` Tejun Heo @ 2013-04-09 17:51 ` Mikulas Patocka 2013-04-09 17:57 ` Tejun Heo 2013-04-09 18:36 ` Vivek Goyal 0 siblings, 2 replies; 22+ messages in thread From: Mikulas Patocka @ 2013-04-09 17:51 UTC (permalink / raw) To: Jens Axboe Cc: Vivek Goyal, Tejun Heo, Mike Snitzer, Milan Broz, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt Hi I placed the dm-crypt parallization patches at: http://people.redhat.com/~mpatocka/patches/kernel/dm-crypt-paralelizace/current/ The patches paralellize dm-crypt and make it possible to use all processor cores. The patch dm-crypt-remove-percpu.patch removes some percpu variables and replaces them with per-request variables. The patch dm-crypt-unbound-workqueue.patch sets WQ_UNBOUND on the encryption workqueue, allowing the encryption to be distributed to all CPUs in the system. The patch dm-crypt-offload-writes-to-thread.patch moves submission of all write requests to a single thread. The patch dm-crypt-sort-requests.patch sorts write requests submitted by a single thread. The requests are sorted according to the sector number, rb-tree is used for efficient sorting. Some usage notes: * turn off automatic cpu frequency scaling (or set it to "performance" governor) - cpufreq doesn't recognize encryption workload correctly, sometimes it underclocks all the CPU cores when there is some encryption work to do, resulting in bad performance * when using filesystem on encrypted dm-crypt device, reduce maximum request size with "/sys/block/dm-2/queue/max_sectors_kb" (substitute "dm-2" with the real name of your dm-crypt device). Note that having too big requests means that there is a small number of requests and they cannot be distributed to all available processors in parallel - it results in worse performance. Having too small requests results in high request overhead and also reduced performance. So you must find the optimal request size for your system and workload. For me, when testing this on ramdisk, the optimal is 8KiB. --- Now, the problem with I/O scheduler: when doing performance testing, it turns out that the parallel version is sometimes worse than the previous implementation. When I create a 4.3GiB dm-crypt device on the top of dm-loop on the top of ext2 filesystem on 15k SCSI disk and run this command time fio --rw=randrw --size=64M --bs=256k --filename=/dev/mapper/crypt --direct=1 --name=job1 --name=job2 --name=job3 --name=job4 --name=job5 --name=job6 --name=job7 --name=job8 --name=job9 --name=job10 --name=job11 --name=job12 the results are this: CFQ scheduler: -------------- no patches: 21.9s patch 1: 21.7s patches 1,2: 2:33s patches 1,2 (+ nr_requests = 1280000) 2:18s patches 1,2,3: 20.7s patches 1,2,3,4: 20.7s deadline scheduler: ------------------- no patches: 27.4s patch 1: 27.4s patches 1,2: 27.8s patches 1,2,3: 29.6s patches 1,2,3,4: 29.6s We can see that CFQ performs badly with the patch 2, but improves with the patch 3. All that patch 3 does is that it moves write requests from encryption threads to a separate thread. So it seems that CFQ has some deficiency that it cannot merge adjacent requests done by different processes. The problem is this: - we have 256k write direct-i/o request - it is broken to 4k bios (because we run on dm-loop on a filesystem with 4k block size) - encryption of these 4k bios is distributed to 12 processes on a 12-core machine - encryption finishes out of order and in different processes, 4k bios with encrypted data are submitted to CFQ - CFQ doesn't merge them - the disk is flooded with random 4k write requests, and performs much worse than with 256k requests Increasing nr_requests to 1280000 helps a little, but not much - it is still order of magnitute slower. I'd like to ask if someone who knows the CFQ scheduler (Jens?) could look at it and find out why it doesn't merge requests from different processes. Why do I have to do a seemingly senseless operation (hand over write requests to a separate thread) in patch 3 to improve performance? Mikulas ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: dm-crypt parallelization patches 2013-04-09 17:51 ` dm-crypt parallelization patches Mikulas Patocka @ 2013-04-09 17:57 ` Tejun Heo 2013-04-09 18:08 ` Mikulas Patocka 2013-04-09 18:36 ` Vivek Goyal 1 sibling, 1 reply; 22+ messages in thread From: Tejun Heo @ 2013-04-09 17:57 UTC (permalink / raw) To: Mikulas Patocka Cc: Jens Axboe, Vivek Goyal, Mike Snitzer, Milan Broz, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt On Tue, Apr 09, 2013 at 01:51:43PM -0400, Mikulas Patocka wrote: > The patch dm-crypt-sort-requests.patch sorts write requests submitted by a > single thread. The requests are sorted according to the sector number, > rb-tree is used for efficient sorting. Hmmm? Why not just keep the issuing order along with plugging boundaries? > So it seems that CFQ has some deficiency that it cannot merge adjacent > requests done by different processes. As I wrote before, please use bio_associate_current(). Currently, dm-crypt is completely messing up all the context information that cfq depends on to schedule IOs. Of course, it doesn't perform well. Thanks. -- tejun ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: dm-crypt parallelization patches 2013-04-09 17:57 ` Tejun Heo @ 2013-04-09 18:08 ` Mikulas Patocka 2013-04-09 18:10 ` Tejun Heo 0 siblings, 1 reply; 22+ messages in thread From: Mikulas Patocka @ 2013-04-09 18:08 UTC (permalink / raw) To: Tejun Heo Cc: Jens Axboe, Vivek Goyal, Mike Snitzer, Milan Broz, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt On Tue, 9 Apr 2013, Tejun Heo wrote: > On Tue, Apr 09, 2013 at 01:51:43PM -0400, Mikulas Patocka wrote: > > The patch dm-crypt-sort-requests.patch sorts write requests submitted by a > > single thread. The requests are sorted according to the sector number, > > rb-tree is used for efficient sorting. > > Hmmm? Why not just keep the issuing order along with plugging > boundaries? What do you mean? I used to have a patch that keeps order of requests as they were introduced, but sorting the requests according to sector number is a bit simpler. > > So it seems that CFQ has some deficiency that it cannot merge adjacent > > requests done by different processes. > > As I wrote before, please use bio_associate_current(). Currently, > dm-crypt is completely messing up all the context information that cfq > depends on to schedule IOs. Of course, it doesn't perform well. bio_associate_current() is only valid on a system with cgroups and there are no cgroups on the kernel where I tested it. It is an empty function: static inline int bio_associate_current(struct bio *bio) { return -ENOENT; } Mikulas > Thanks. > > -- > tejun > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: dm-crypt parallelization patches 2013-04-09 18:08 ` Mikulas Patocka @ 2013-04-09 18:10 ` Tejun Heo 2013-04-09 18:42 ` Vivek Goyal 2013-04-09 19:42 ` Mikulas Patocka 0 siblings, 2 replies; 22+ messages in thread From: Tejun Heo @ 2013-04-09 18:10 UTC (permalink / raw) To: Mikulas Patocka Cc: Jens Axboe, Vivek Goyal, Mike Snitzer, Milan Broz, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt Hey, On Tue, Apr 09, 2013 at 02:08:06PM -0400, Mikulas Patocka wrote: > > Hmmm? Why not just keep the issuing order along with plugging > > boundaries? > > What do you mean? > > I used to have a patch that keeps order of requests as they were > introduced, but sorting the requests according to sector number is a bit > simpler. You're still destroying the context information. Please just keep the issuing order along with plugging boundaries. > > As I wrote before, please use bio_associate_current(). Currently, > > dm-crypt is completely messing up all the context information that cfq > > depends on to schedule IOs. Of course, it doesn't perform well. > > bio_associate_current() is only valid on a system with cgroups and there > are no cgroups on the kernel where I tested it. It is an empty function: > > static inline int bio_associate_current(struct bio *bio) { return -ENOENT; } Yeah, because blkcg was the only user. Please feel free to drop the ifdefs. It covers both iocontext and cgroup association. Thanks. -- tejun ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: dm-crypt parallelization patches 2013-04-09 18:10 ` Tejun Heo @ 2013-04-09 18:42 ` Vivek Goyal 2013-04-09 18:57 ` Tejun Heo 2013-04-09 19:42 ` Mikulas Patocka 1 sibling, 1 reply; 22+ messages in thread From: Vivek Goyal @ 2013-04-09 18:42 UTC (permalink / raw) To: Tejun Heo Cc: Mikulas Patocka, Jens Axboe, Mike Snitzer, Milan Broz, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt On Tue, Apr 09, 2013 at 11:10:31AM -0700, Tejun Heo wrote: > Hey, > > On Tue, Apr 09, 2013 at 02:08:06PM -0400, Mikulas Patocka wrote: > > > Hmmm? Why not just keep the issuing order along with plugging > > > boundaries? > > > > What do you mean? > > > > I used to have a patch that keeps order of requests as they were > > introduced, but sorting the requests according to sector number is a bit > > simpler. > > You're still destroying the context information. Please just keep the > issuing order along with plugging boundaries. I guess plugging boundary is more important than issuing order as block layer should take care of mering the bio and put in right order (attempt_plug_merge()). But to make use of plugging boundary, one would probably still need submission using single thread. And if one is using single thread for submission, one will still get good performance (even if you are not using bio_associate_current()), as by default all bio will go to submitting thread's context. Thanks Vivek ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: dm-crypt parallelization patches 2013-04-09 18:42 ` Vivek Goyal @ 2013-04-09 18:57 ` Tejun Heo 2013-04-09 19:13 ` Vivek Goyal 0 siblings, 1 reply; 22+ messages in thread From: Tejun Heo @ 2013-04-09 18:57 UTC (permalink / raw) To: Vivek Goyal Cc: Mikulas Patocka, Jens Axboe, Mike Snitzer, Milan Broz, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt Hello, On Tue, Apr 09, 2013 at 02:42:48PM -0400, Vivek Goyal wrote: > I guess plugging boundary is more important than issuing order as > block layer should take care of mering the bio and put in right > order (attempt_plug_merge()). Yeah, the exact order probably doesn't affect things too much but it's just a nice design principle to follow - if you're gonna step in in the middle and meddle with requests, preserve as much context as reasonably possible, and it's not like preserving that order is difficult. > But to make use of plugging boundary, one would probably still need > submission using single thread. It doesn't have to a specific task. Whoever finishes the last bio / segment / whatever in the plugging domain can issue all of them. I probably am missing details but the overall mechanism can be pretty simple. Just keep the bios from the same plugging domain in the received order along with an atomic counter and issue them all when the counter hits zero. No need to fiddle with sorting or whatever. > And if one is using single thread for submission, one will still get > good performance (even if you are not using bio_associate_current()), as > by default all bio will go to submitting thread's context. And destroy all per-ioc and cgroup logics in block layer in the process. Thanks. -- tejun ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: dm-crypt parallelization patches 2013-04-09 18:57 ` Tejun Heo @ 2013-04-09 19:13 ` Vivek Goyal 0 siblings, 0 replies; 22+ messages in thread From: Vivek Goyal @ 2013-04-09 19:13 UTC (permalink / raw) To: Tejun Heo Cc: Mikulas Patocka, Jens Axboe, Mike Snitzer, Milan Broz, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt On Tue, Apr 09, 2013 at 11:57:21AM -0700, Tejun Heo wrote: [..] > And destroy all per-ioc and cgroup logics in block layer in the > process. Oh, I am in no-way suggesting don't use bio_associate_current(). I am just trying to analyze the performance issue right now and saying that as far as performance is concenred, one will get it back even if one does not use bio_associate_current(). Yes but one needs to use bio_associate_current() to make sure bio's are attributed to right cgroup and associate bio to right task. This should help solving the long standing issue of task losing its ioprio if dm-crypt target is in the stack. Thanks Vivek ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: dm-crypt parallelization patches 2013-04-09 18:10 ` Tejun Heo 2013-04-09 18:42 ` Vivek Goyal @ 2013-04-09 19:42 ` Mikulas Patocka 2013-04-09 19:52 ` Tejun Heo 1 sibling, 1 reply; 22+ messages in thread From: Mikulas Patocka @ 2013-04-09 19:42 UTC (permalink / raw) To: Tejun Heo Cc: Jens Axboe, Vivek Goyal, Mike Snitzer, Milan Broz, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt On Tue, 9 Apr 2013, Tejun Heo wrote: > Hey, > > On Tue, Apr 09, 2013 at 02:08:06PM -0400, Mikulas Patocka wrote: > > > Hmmm? Why not just keep the issuing order along with plugging > > > boundaries? > > > > What do you mean? > > > > I used to have a patch that keeps order of requests as they were > > introduced, but sorting the requests according to sector number is a bit > > simpler. > > You're still destroying the context information. Please just keep the > issuing order along with plugging boundaries. > > > > As I wrote before, please use bio_associate_current(). Currently, > > > dm-crypt is completely messing up all the context information that cfq > > > depends on to schedule IOs. Of course, it doesn't perform well. > > > > bio_associate_current() is only valid on a system with cgroups and there > > are no cgroups on the kernel where I tested it. It is an empty function: > > > > static inline int bio_associate_current(struct bio *bio) { return -ENOENT; } > > Yeah, because blkcg was the only user. Please feel free to drop the > ifdefs. It covers both iocontext and cgroup association. > > Thanks. If I drop ifdefs, it doesn't compile (because other cgroup stuff it missing). So I enabled bio cgroups. bio_associate_current can't be used, because by the time we allocate the outgoing write bio, we are no longer in the process that submitted the original bio. Anyway, I tried to reproduce in dm-crypt what bio_associate_current does - in the submitting process I record "ioc" and "css" fields in "dm_crypt_io" structure and set these fields on all outgoing bios. It has no effect on performance, it is as bad as if I hadn't done it. Mikulas --- (this is the patch that I used, to be applied after dm-crypt-unbound-workqueue.patch) --- drivers/md/dm-crypt.c | 41 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 41 insertions(+) Index: linux-3.8.6-fast/drivers/md/dm-crypt.c =================================================================== --- linux-3.8.6-fast.orig/drivers/md/dm-crypt.c 2013-04-09 20:32:41.000000000 +0200 +++ linux-3.8.6-fast/drivers/md/dm-crypt.c 2013-04-09 21:29:12.000000000 +0200 @@ -20,6 +20,7 @@ #include <linux/backing-dev.h> #include <linux/atomic.h> #include <linux/scatterlist.h> +#include <linux/cgroup.h> #include <asm/page.h> #include <asm/unaligned.h> #include <crypto/hash.h> @@ -60,6 +61,9 @@ struct dm_crypt_io { int error; sector_t sector; struct dm_crypt_io *base_io; + + struct io_context *ioc; + struct cgroup_subsys_state *css; }; struct dm_crypt_request { @@ -797,6 +801,14 @@ static struct bio *crypt_alloc_buffer(st if (!clone) return NULL; + if (unlikely(io->base_io != NULL)) { + clone->bi_ioc = io->base_io->ioc; + clone->bi_css = io->base_io->css; + } else { + clone->bi_ioc = io->ioc; + clone->bi_css = io->css; + } + clone_init(io, clone); *out_of_pages = 0; @@ -859,6 +871,9 @@ static struct dm_crypt_io *crypt_io_allo io->ctx.req = NULL; atomic_set(&io->io_pending, 0); + io->ioc = NULL; + io->css = NULL; + return io; } @@ -884,6 +899,14 @@ static void crypt_dec_pending(struct dm_ if (io->ctx.req) mempool_free(io->ctx.req, cc->req_pool); + + if (io->ioc) { + put_io_context(io->ioc); + } + if (io->css) { + css_put(io->css); + } + mempool_free(io, cc->io_pool); if (likely(!base_io)) @@ -927,6 +950,9 @@ static void crypt_endio(struct bio *clon if (rw == WRITE) crypt_free_buffer_pages(cc, clone); + clone->bi_ioc = NULL; + clone->bi_css = NULL; + bio_put(clone); if (rw == READ && !error) { @@ -1658,6 +1684,21 @@ static int crypt_map(struct dm_target *t io = crypt_io_alloc(cc, bio, dm_target_offset(ti, bio->bi_sector)); + if (current->io_context) { + struct io_context *ioc = current->io_context; + struct cgroup_subsys_state *css; + + get_io_context_active(ioc); + io->ioc = ioc; + + /* associate blkcg if exists */ + rcu_read_lock(); + css = task_subsys_state(current, blkio_subsys_id); + if (css && css_tryget(css)) + io->css = css; + rcu_read_unlock(); + } + if (bio_data_dir(io->base_bio) == READ) { if (kcryptd_io_read(io, GFP_NOWAIT)) kcryptd_queue_io(io); ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: dm-crypt parallelization patches 2013-04-09 19:42 ` Mikulas Patocka @ 2013-04-09 19:52 ` Tejun Heo 2013-04-09 20:32 ` Mikulas Patocka 0 siblings, 1 reply; 22+ messages in thread From: Tejun Heo @ 2013-04-09 19:52 UTC (permalink / raw) To: Mikulas Patocka Cc: Jens Axboe, Vivek Goyal, Mike Snitzer, Milan Broz, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt On Tue, Apr 09, 2013 at 03:42:16PM -0400, Mikulas Patocka wrote: > If I drop ifdefs, it doesn't compile (because other cgroup stuff it > missing). > > So I enabled bio cgroups. > > bio_associate_current can't be used, because by the time we allocate the > outgoing write bio, we are no longer in the process that submitted the > original bio. Oh, I suppose it'd need some massaging to selectively turn off the cgroup part. > Anyway, I tried to reproduce in dm-crypt what bio_associate_current does - and we probably need to change that to bio_associate_task(). > in the submitting process I record "ioc" and "css" fields in "dm_crypt_io" > structure and set these fields on all outgoing bios. It has no effect on > performance, it is as bad as if I hadn't done it. A good way to verify that the tagging is correct would be configuring io limits in block cgroup and see whether the limits are correctly applied when going through dm-crypt (please test with direct-io or reads, writeback is horribly broken, sorry).working correctly, maybe plugging is the overriding factor? Thanks. -- tejun ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: dm-crypt parallelization patches 2013-04-09 19:52 ` Tejun Heo @ 2013-04-09 20:32 ` Mikulas Patocka 2013-04-09 21:02 ` Tejun Heo 2013-04-09 21:07 ` Vivek Goyal 0 siblings, 2 replies; 22+ messages in thread From: Mikulas Patocka @ 2013-04-09 20:32 UTC (permalink / raw) To: Tejun Heo Cc: Jens Axboe, Vivek Goyal, Mike Snitzer, Milan Broz, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt On Tue, 9 Apr 2013, Tejun Heo wrote: > On Tue, Apr 09, 2013 at 03:42:16PM -0400, Mikulas Patocka wrote: > > If I drop ifdefs, it doesn't compile (because other cgroup stuff it > > missing). > > > > So I enabled bio cgroups. > > > > bio_associate_current can't be used, because by the time we allocate the > > outgoing write bio, we are no longer in the process that submitted the > > original bio. > > Oh, I suppose it'd need some massaging to selectively turn off the > cgroup part. > > > Anyway, I tried to reproduce in dm-crypt what bio_associate_current does - > > and we probably need to change that to bio_associate_task(). Generally, we shouldn't associate bios with "current" task in device mapper targets. For example suppose that we have two stacked dm-crypt targets: In the "current" process pointer in lower dm-crypt target's request function always points to the workqueue of the upper dm-crypt target that submits the bios. So if we associate the bio with "current" in the lower target, we are associating it with a preallocated workqueue and we already lost the information who submitted it. You should associate a bio with a task when you create the bio and "md" and "dm" midlayers should just forward this association to lower layer bios. > > in the submitting process I record "ioc" and "css" fields in "dm_crypt_io" > > structure and set these fields on all outgoing bios. It has no effect on > > performance, it is as bad as if I hadn't done it. > > A good way to verify that the tagging is correct would be configuring > io limits in block cgroup and see whether the limits are correctly > applied when going through dm-crypt (please test with direct-io or > reads, writeback is horribly broken, sorry).working correctly, maybe > plugging is the overriding factor? > > Thanks. It doesn't work because device mapper on the underlying layers ignores bi_ioc and bi_css. If I make device mapper forward bi_ioc and bi_css to outgoing bios, it improves performance (from 2:30 to 1:30), but it is still far from perfect. Mikulas --- dm: forward cgroup context This patch makes dm forward associated cgroup context to cloned bios. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> --- drivers/md/dm.c | 9 +++++++++ fs/bio.c | 2 ++ 2 files changed, 11 insertions(+) Index: linux-3.8.6-fast/drivers/md/dm.c =================================================================== --- linux-3.8.6-fast.orig/drivers/md/dm.c 2013-04-09 22:00:36.000000000 +0200 +++ linux-3.8.6-fast/drivers/md/dm.c 2013-04-09 22:19:40.000000000 +0200 @@ -453,6 +453,10 @@ static void free_io(struct mapped_device static void free_tio(struct mapped_device *md, struct dm_target_io *tio) { +#ifdef CONFIG_BLK_CGROUP + tio->clone.bi_ioc = NULL; + tio->clone.bi_css = NULL; +#endif bio_put(&tio->clone); } @@ -1124,6 +1128,11 @@ static struct dm_target_io *alloc_tio(st clone = bio_alloc_bioset(GFP_NOIO, nr_iovecs, ci->md->bs); tio = container_of(clone, struct dm_target_io, clone); +#ifdef CONFIG_BLK_CGROUP + tio->clone.bi_ioc = ci->bio->bi_ioc; + tio->clone.bi_css = ci->bio->bi_css; +#endif + tio->io = ci->io; tio->ti = ti; memset(&tio->info, 0, sizeof(tio->info)); ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: dm-crypt parallelization patches 2013-04-09 20:32 ` Mikulas Patocka @ 2013-04-09 21:02 ` Tejun Heo 2013-04-09 21:03 ` Tejun Heo 2013-04-09 21:07 ` Vivek Goyal 1 sibling, 1 reply; 22+ messages in thread From: Tejun Heo @ 2013-04-09 21:02 UTC (permalink / raw) To: Mikulas Patocka Cc: Jens Axboe, Vivek Goyal, Mike Snitzer, Milan Broz, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt Hey, On Tue, Apr 09, 2013 at 04:32:28PM -0400, Mikulas Patocka wrote: > > and we probably need to change that to bio_associate_task(). > > Generally, we shouldn't associate bios with "current" task in device > mapper targets. For example suppose that we have two stacked dm-crypt > targets: It only follows the first association so it doesn't matter how many layers it goes through. That said, yeah, there could be situations where @task is availalbe but the bio's already in the hand of a different task. If that's the case, change it to associate_task(@task). > It doesn't work because device mapper on the underlying layers ignores > bi_ioc and bi_css. > > If I make device mapper forward bi_ioc and bi_css to outgoing bios, it > improves performance (from 2:30 to 1:30), but it is still far from > perfect. For testing, copying bi_ioc and bi_css directly is fine but please add another interface to copy those for the actual code. Say, bio_copy_association(@to_bio, @from_bio) or whatever. As for the performance loss, I'm somewhat confident in saying the remaining difference would be from ignoring plugging boundaries. Thanks. -- tejun ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: dm-crypt parallelization patches 2013-04-09 21:02 ` Tejun Heo @ 2013-04-09 21:03 ` Tejun Heo 0 siblings, 0 replies; 22+ messages in thread From: Tejun Heo @ 2013-04-09 21:03 UTC (permalink / raw) To: Mikulas Patocka Cc: Jens Axboe, Vivek Goyal, Mike Snitzer, Milan Broz, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt On Tue, Apr 09, 2013 at 02:02:01PM -0700, Tejun Heo wrote: > For testing, copying bi_ioc and bi_css directly is fine but please add > another interface to copy those for the actual code. Say, > bio_copy_association(@to_bio, @from_bio) or whatever. Another and probably better possibility is just remembering the issuing task (you would of course need to hold an extra ref as long as you wanna use it) and use bio_associate_task() on it when creating new bios. Thanks. -- tejun ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: dm-crypt parallelization patches 2013-04-09 20:32 ` Mikulas Patocka 2013-04-09 21:02 ` Tejun Heo @ 2013-04-09 21:07 ` Vivek Goyal 2013-04-09 21:18 ` Mikulas Patocka 1 sibling, 1 reply; 22+ messages in thread From: Vivek Goyal @ 2013-04-09 21:07 UTC (permalink / raw) To: Mikulas Patocka Cc: Tejun Heo, Jens Axboe, Mike Snitzer, Milan Broz, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt On Tue, Apr 09, 2013 at 04:32:28PM -0400, Mikulas Patocka wrote: [..] > Generally, we shouldn't associate bios with "current" task in device > mapper targets. For example suppose that we have two stacked dm-crypt > targets: > > In the "current" process pointer in lower dm-crypt target's request > function always points to the workqueue of the upper dm-crypt target that > submits the bios. So if we associate the bio with "current" in the lower > target, we are associating it with a preallocated workqueue and we already > lost the information who submitted it. > > You should associate a bio with a task when you create the bio and "md" > and "dm" midlayers should just forward this association to lower layer > bios. bio_associate_current() return -EBUSY if bio has already been associated with an io context. So in a stack if every driver calls bio_associate_current(), upon bio submission, it will automatically amke sure bio gets associated with submitter task in top level device and calls by lower level device will be ignored. Lower level devices I think just need to make sure this context info is propogated to cloned bios. [..] > +#ifdef CONFIG_BLK_CGROUP > + tio->clone.bi_ioc = ci->bio->bi_ioc; > + tio->clone.bi_css = ci->bio->bi_css; You also need to take references to ioc and css objects. I guess a helper function will be better. May be something like. bio_associate_bio_context(bio1, bio2) And this initialize bio2's context with bio1's context. Thanks Vivek ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: dm-crypt parallelization patches 2013-04-09 21:07 ` Vivek Goyal @ 2013-04-09 21:18 ` Mikulas Patocka 2013-04-10 19:24 ` Vivek Goyal 0 siblings, 1 reply; 22+ messages in thread From: Mikulas Patocka @ 2013-04-09 21:18 UTC (permalink / raw) To: Vivek Goyal Cc: Tejun Heo, Jens Axboe, Mike Snitzer, Milan Broz, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt On Tue, 9 Apr 2013, Vivek Goyal wrote: > On Tue, Apr 09, 2013 at 04:32:28PM -0400, Mikulas Patocka wrote: > > [..] > > Generally, we shouldn't associate bios with "current" task in device > > mapper targets. For example suppose that we have two stacked dm-crypt > > targets: > > > > In the "current" process pointer in lower dm-crypt target's request > > function always points to the workqueue of the upper dm-crypt target that > > submits the bios. So if we associate the bio with "current" in the lower > > target, we are associating it with a preallocated workqueue and we already > > lost the information who submitted it. > > > > You should associate a bio with a task when you create the bio and "md" > > and "dm" midlayers should just forward this association to lower layer > > bios. > > bio_associate_current() return -EBUSY if bio has already been associated > with an io context. > > So in a stack if every driver calls bio_associate_current(), upon bio > submission, it will automatically amke sure bio gets associated with > submitter task in top level device and calls by lower level device > will be ignored. The stacking drivers do not pass the same bio to each other. The stacking driver receives a bio, allocates zero or more new bios and sends these new bios to the lower layer. So you need to propagate ownership from the received bio to newly allocated bios, you don't want to associate newly allocated bio with "current" process. > Lower level devices I think just need to make sure this context > info is propogated to cloned bios. > > > [..] > > +#ifdef CONFIG_BLK_CGROUP > > + tio->clone.bi_ioc = ci->bio->bi_ioc; > > + tio->clone.bi_css = ci->bio->bi_css; > > You also need to take references to ioc and css objects. I guess a helper > function will be better. May be something like. The lifetime of the "tio" structure is shorter than the lifetime of "ci->bio". So we don't need to increment reference. We only need to increment reference if we copy ownership to a new bio that has could have longer lifetime than the original bio. But this situation is very rare - in most stacking drivers the newly allocated bio has shorter lifetime than the original one. > bio_associate_bio_context(bio1, bio2) > > And this initialize bio2's context with bio1's context. Yes, that would be ok. > Thanks > Vivek Mikulas ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: dm-crypt parallelization patches 2013-04-09 21:18 ` Mikulas Patocka @ 2013-04-10 19:24 ` Vivek Goyal 0 siblings, 0 replies; 22+ messages in thread From: Vivek Goyal @ 2013-04-10 19:24 UTC (permalink / raw) To: Mikulas Patocka Cc: Tejun Heo, Jens Axboe, Mike Snitzer, Milan Broz, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt On Tue, Apr 09, 2013 at 05:18:25PM -0400, Mikulas Patocka wrote: [..] > > bio_associate_current() return -EBUSY if bio has already been associated > > with an io context. > > > > So in a stack if every driver calls bio_associate_current(), upon bio > > submission, it will automatically amke sure bio gets associated with > > submitter task in top level device and calls by lower level device > > will be ignored. > > The stacking drivers do not pass the same bio to each other. > > The stacking driver receives a bio, allocates zero or more new bios and > sends these new bios to the lower layer. So you need to propagate > ownership from the received bio to newly allocated bios, you don't want to > associate newly allocated bio with "current" process. Actually I was asking to call bio_associate_current() for the incoming bio in the driver and not for the newly created bios by the driver. For any newly created bios on behalf of this incoming bio, we need to copy the context so that this context info can be propogated down the stack. > > > Lower level devices I think just need to make sure this context > > info is propogated to cloned bios. > > > > > > [..] > > > +#ifdef CONFIG_BLK_CGROUP > > > + tio->clone.bi_ioc = ci->bio->bi_ioc; > > > + tio->clone.bi_css = ci->bio->bi_css; > > > > You also need to take references to ioc and css objects. I guess a helper > > function will be better. May be something like. > > The lifetime of the "tio" structure is shorter than the lifetime of > "ci->bio". So we don't need to increment reference. > > We only need to increment reference if we copy ownership to a new bio that > has could have longer lifetime than the original bio. But this situation > is very rare - in most stacking drivers the newly allocated bio has > shorter lifetime than the original one. I think it is not a good idea to rely on the fact that cloned bios or newly created bios will have shorter lifetime than the original bio. In fact when the bio completes and you free it up bio_disassociate_task() will try to put io context and blkcg reference. So it is important to take reference if you are copying context info in any newly created bio. Thanks Vivek ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: dm-crypt parallelization patches 2013-04-09 17:51 ` dm-crypt parallelization patches Mikulas Patocka 2013-04-09 17:57 ` Tejun Heo @ 2013-04-09 18:36 ` Vivek Goyal 1 sibling, 0 replies; 22+ messages in thread From: Vivek Goyal @ 2013-04-09 18:36 UTC (permalink / raw) To: Mikulas Patocka Cc: Jens Axboe, Tejun Heo, Mike Snitzer, Milan Broz, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt On Tue, Apr 09, 2013 at 01:51:43PM -0400, Mikulas Patocka wrote: > Hi > > I placed the dm-crypt parallization patches at: > http://people.redhat.com/~mpatocka/patches/kernel/dm-crypt-paralelizace/current/ > > The patches paralellize dm-crypt and make it possible to use all processor > cores. > > > The patch dm-crypt-remove-percpu.patch removes some percpu variables and > replaces them with per-request variables. > > The patch dm-crypt-unbound-workqueue.patch sets WQ_UNBOUND on the > encryption workqueue, allowing the encryption to be distributed to all > CPUs in the system. > > The patch dm-crypt-offload-writes-to-thread.patch moves submission of all > write requests to a single thread. > > The patch dm-crypt-sort-requests.patch sorts write requests submitted by a > single thread. The requests are sorted according to the sector number, > rb-tree is used for efficient sorting. > > Some usage notes: > > * turn off automatic cpu frequency scaling (or set it to "performance" > governor) - cpufreq doesn't recognize encryption workload correctly, > sometimes it underclocks all the CPU cores when there is some encryption > work to do, resulting in bad performance > > * when using filesystem on encrypted dm-crypt device, reduce maximum > request size with "/sys/block/dm-2/queue/max_sectors_kb" (substitute > "dm-2" with the real name of your dm-crypt device). Note that having too > big requests means that there is a small number of requests and they > cannot be distributed to all available processors in parallel - it > results in worse performance. Having too small requests results in high > request overhead and also reduced performance. So you must find the > optimal request size for your system and workload. For me, when testing > this on ramdisk, the optimal is 8KiB. > > --- > > Now, the problem with I/O scheduler: when doing performance testing, it > turns out that the parallel version is sometimes worse than the previous > implementation. > > When I create a 4.3GiB dm-crypt device on the top of dm-loop on the top of > ext2 filesystem on 15k SCSI disk and run this command > > time fio --rw=randrw --size=64M --bs=256k --filename=/dev/mapper/crypt > --direct=1 --name=job1 --name=job2 --name=job3 --name=job4 --name=job5 > --name=job6 --name=job7 --name=job8 --name=job9 --name=job10 --name=job11 > --name=job12 > > the results are this: > CFQ scheduler: > -------------- > no patches: > 21.9s > patch 1: > 21.7s > patches 1,2: > 2:33s > patches 1,2 (+ nr_requests = 1280000) > 2:18s > patches 1,2,3: > 20.7s > patches 1,2,3,4: > 20.7s > > deadline scheduler: > ------------------- > no patches: > 27.4s > patch 1: > 27.4s > patches 1,2: > 27.8s > patches 1,2,3: > 29.6s > patches 1,2,3,4: > 29.6s > > > We can see that CFQ performs badly with the patch 2, but improves with the > patch 3. All that patch 3 does is that it moves write requests from > encryption threads to a separate thread. > > So it seems that CFQ has some deficiency that it cannot merge adjacent > requests done by different processes. > CFQ does not merge requests across different cfq queues (cfqq). Each queue is associated with one iocontext. So in this case each worker thread is submitting its own bio and each 4K bio must be going in separate cfqq hence no merging is taking place. The moment you applied patch 3, where a single thread submitted bios, each bio went into single queue and possibly got merged. So either use single thread to submit bio or better use bio_associate_current() (as tejun suggested) on original 256K bio. (Hopefully bio iocontext association information is retained when you split the bios into smaller pieces). > The problem is this: > - we have 256k write direct-i/o request > - it is broken to 4k bios (because we run on dm-loop on a filesystem with > 4k block size) > - encryption of these 4k bios is distributed to 12 processes on a 12-core > machine > - encryption finishes out of order and in different processes, 4k bios > with encrypted data are submitted to CFQ > - CFQ doesn't merge them > - the disk is flooded with random 4k write requests, and performs much > worse than with 256k requests > Thanks Vivek ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2013-09-12 3:51 UTC | newest] Thread overview: 22+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-09-03 19:06 dm-crypt parallelization patches Mikulas Patocka 2013-09-03 20:07 ` Andi Kleen 2013-09-11 23:03 ` Mikulas Patocka 2013-09-11 23:33 ` Andi Kleen 2013-09-12 3:51 ` Milan Broz 2013-09-03 20:49 ` Milan Broz -- strict thread matches above, loose matches on Subject: below -- 2013-03-26 3:47 dm-crypt performance Mikulas Patocka 2013-03-26 12:27 ` [dm-devel] " Alasdair G Kergon 2013-03-26 20:05 ` Milan Broz 2013-03-26 20:28 ` Mike Snitzer 2013-03-28 18:53 ` Tejun Heo 2013-03-28 19:33 ` Vivek Goyal 2013-03-28 19:44 ` Tejun Heo 2013-03-28 20:38 ` Vivek Goyal 2013-03-28 20:45 ` Tejun Heo 2013-04-09 17:51 ` dm-crypt parallelization patches Mikulas Patocka 2013-04-09 17:57 ` Tejun Heo 2013-04-09 18:08 ` Mikulas Patocka 2013-04-09 18:10 ` Tejun Heo 2013-04-09 18:42 ` Vivek Goyal 2013-04-09 18:57 ` Tejun Heo 2013-04-09 19:13 ` Vivek Goyal 2013-04-09 19:42 ` Mikulas Patocka 2013-04-09 19:52 ` Tejun Heo 2013-04-09 20:32 ` Mikulas Patocka 2013-04-09 21:02 ` Tejun Heo 2013-04-09 21:03 ` Tejun Heo 2013-04-09 21:07 ` Vivek Goyal 2013-04-09 21:18 ` Mikulas Patocka 2013-04-10 19:24 ` Vivek Goyal 2013-04-09 18:36 ` Vivek Goyal
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).