* dm-crypt: Performance Regression 2.6.37 -> 2.6.38-rc8 @ 2011-03-08 16:45 Mario 'BitKoenig' Holbe 2011-03-08 17:35 ` [dm-crypt] " Milan Broz 2011-03-10 16:57 ` Andi Kleen 0 siblings, 2 replies; 12+ messages in thread From: Mario 'BitKoenig' Holbe @ 2011-03-08 16:45 UTC (permalink / raw) To: dm-crypt; +Cc: linux-kernel, Andi Kleen, Milan Broz, Alasdair G Kergon [-- Attachment #1: Type: text/plain, Size: 1268 bytes --] Hello, dm-crypt in 2.6.38 changed to per-CPU workqueues to increase it's performance by parallelizing encryption to multiple CPUs. This modification seems to cause (massive) performance drops for multiple parallel dm-crypt instances... I'm running a 4-disk RAID0 on top of 4 independent dm-crypt(aes-xts) devices on a Core2Quad 3GHz. This setup did overcome the single-CPU limitation from previous versions and utilized all 4 cores for encryption. The throughput of this array drops from 282MB/s sustained read (dd, single process) with 2.6.37.3 down to 133MB/s with 2.6.38-rc8 (which nearly equals to single-disk throughput of 128MB/s - just in case this matters). This indicates way less parallelization now with 2.6.38 than before. I don't think this was intentional :) The dm-crypt per-CPU workqueues got introduced in 2.6.38 with c029772125594e31eb1a5ad9e0913724ed9891f2 Reverting dm-crypt.c to the version before this commit re-gains the same throughput as with 2.6.37. Submitters/Signers of c029772125594e31eb1a5ad9e0913724ed9891f2 CC:ed. Thanks for your work & regards Mario -- There are two major products that come from Berkeley: LSD and UNIX. We don't believe this to be a coincidence. -- Jeremy S. Anderson [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 482 bytes --] ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [dm-crypt] dm-crypt: Performance Regression 2.6.37 -> 2.6.38-rc8 2011-03-08 16:45 dm-crypt: Performance Regression 2.6.37 -> 2.6.38-rc8 Mario 'BitKoenig' Holbe @ 2011-03-08 17:35 ` Milan Broz 2011-03-08 19:23 ` Mario 'BitKoenig' Holbe 2011-03-10 16:57 ` Andi Kleen 1 sibling, 1 reply; 12+ messages in thread From: Milan Broz @ 2011-03-08 17:35 UTC (permalink / raw) To: Mario 'BitKoenig' Holbe, dm-crypt, linux-kernel, Andi Kleen, Alasdair G Kergon On 03/08/2011 05:45 PM, Mario 'BitKoenig' Holbe wrote: > dm-crypt in 2.6.38 changed to per-CPU workqueues to increase it's > performance by parallelizing encryption to multiple CPUs. > This modification seems to cause (massive) performance drops for > multiple parallel dm-crypt instances... > > I'm running a 4-disk RAID0 on top of 4 independent dm-crypt(aes-xts) > devices on a Core2Quad 3GHz. This setup did overcome the single-CPU > limitation from previous versions and utilized all 4 cores for > encryption. > The throughput of this array drops from 282MB/s sustained read (dd, > single process) with 2.6.37.3 down to 133MB/s with 2.6.38-rc8 (which > nearly equals to single-disk throughput of 128MB/s - just in case this > matters). > > This indicates way less parallelization now with 2.6.38 than before. > I don't think this was intentional :) Well, it depends. I never suggested this kind of workaround because you basically hardcoded (in device stacking) how many parallel instances (==cpu cores ideally) of dmcrypt can run effectively. Previously there was no cpu affinity, so dmcrypt thread simply run on some core. With current design the IO is encrypted by the cpu which submitted it. If you have RAID0 it probably means that one IO is split into stripes and these try to encrypt on the same core (in "parallel"). (I need to test what actually happens though.) If you use one dmcrypt instance over RAID0, you will now get probably much more better throughput. (Even with one process generating IOs the bios are, surprisingly, submitted on different cpus. But this time it runs really in parallel.) Maybe we can find some compromise but I basically prefer current design, which provides much more better behaviour for most of configurations. Milan ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [dm-crypt] dm-crypt: Performance Regression 2.6.37 -> 2.6.38-rc8 2011-03-08 17:35 ` [dm-crypt] " Milan Broz @ 2011-03-08 19:23 ` Mario 'BitKoenig' Holbe 2011-03-08 20:07 ` Milan Broz 0 siblings, 1 reply; 12+ messages in thread From: Mario 'BitKoenig' Holbe @ 2011-03-08 19:23 UTC (permalink / raw) To: Milan Broz; +Cc: dm-crypt, linux-kernel, Andi Kleen, Alasdair G Kergon [-- Attachment #1: Type: text/plain, Size: 2588 bytes --] On Tue, Mar 08, 2011 at 06:35:01PM +0100, Milan Broz wrote: > On 03/08/2011 05:45 PM, Mario 'BitKoenig' Holbe wrote: > > dm-crypt in 2.6.38 changed to per-CPU workqueues to increase it's > > performance by parallelizing encryption to multiple CPUs. > > This modification seems to cause (massive) performance drops for > > multiple parallel dm-crypt instances... > Well, it depends. I never suggested this kind of workaround because > you basically hardcoded (in device stacking) how many parallel instances > (==cpu cores ideally) of dmcrypt can run effectively. Yes. But it was the best to get :) > With current design the IO is encrypted by the cpu which submitted it. ... > If you use one dmcrypt instance over RAID0, you will now get probably > much more better throughput. (Even with one process generating IOs > the bios are, surprisingly, submitted on different cpus. But this time > it runs really in parallel.) Mh, not really. I just tested this with kernels fresh booted into emergency and udev started to create device nodes: # cryptsetup -c aes-xts-plain -s 256 -h sha256 -d /dev/urandom create foo1 /dev/sdc ... # cryptsetup -c aes-xts-plain -s 256 -h sha256 -d /dev/urandom create foo4 /dev/sdf # mdadm -B -l raid0 -n 4 -c 256 /dev/md/foo /dev/mapper/foo[1-4] # dd if=/dev/md/foo of=/dev/null bs=1M count=20k 2.6.37: 291MB/s 2.6.38: 139MB/s # mdadm -B -l raid0 -n 4 -c 256 /dev/md/foo /dev/sd[c-f] # cryptsetup -c aes-xts-plain -s 256 -h sha256 -d /dev/urandom create foo /dev/md/foo # dd if=/dev/mapper/foo of=/dev/null bs=1M count=20k 2.6.37: 126MB/s 2.6.38: 138MB/s So... performance drops on .37 (as expected) and nothing changes on .38 (unlike expected). Those results, btw., differ dramatically when using tmpfs-backed loop-devices instead of hard disks: raid0 over crypted loops: 2.6.37: 285MB/s 2.6.38: 324MB/s crypted raid0 over loops: 2.6.37: 119MB/s 2.6.38: 225MB/s Here we have indeed changing results - even if they are not what one would expect. All those constructs are read-only and hence can be tested on any somewhat available block device. Setting devices read-only would probably be a good idea to compensate being short on sleep or whatever. > Maybe we can find some compromise but I basically prefer current design, > which provides much more better behaviour for most of configurations. Hmmm... regards Mario -- File names are infinite in length where infinity is set to 255 characters. -- Peter Collinson, "The Unix File System" [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 482 bytes --] ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [dm-crypt] dm-crypt: Performance Regression 2.6.37 -> 2.6.38-rc8 2011-03-08 19:23 ` Mario 'BitKoenig' Holbe @ 2011-03-08 20:07 ` Milan Broz 2011-03-08 20:17 ` Mario 'BitKoenig' Holbe 0 siblings, 1 reply; 12+ messages in thread From: Milan Broz @ 2011-03-08 20:07 UTC (permalink / raw) To: Mario 'BitKoenig' Holbe, dm-crypt, linux-kernel, Andi Kleen, Alasdair G Kergon On 03/08/2011 08:23 PM, Mario 'BitKoenig' Holbe wrote: > On Tue, Mar 08, 2011 at 06:35:01PM +0100, Milan Broz wrote: >> On 03/08/2011 05:45 PM, Mario 'BitKoenig' Holbe wrote: >>> dm-crypt in 2.6.38 changed to per-CPU workqueues to increase it's >>> performance by parallelizing encryption to multiple CPUs. >>> This modification seems to cause (massive) performance drops for >>> multiple parallel dm-crypt instances... >> Well, it depends. I never suggested this kind of workaround because >> you basically hardcoded (in device stacking) how many parallel instances >> (==cpu cores ideally) of dmcrypt can run effectively. > > Yes. But it was the best to get :) I know... > >> With current design the IO is encrypted by the cpu which submitted it. > ... >> If you use one dmcrypt instance over RAID0, you will now get probably >> much more better throughput. (Even with one process generating IOs >> the bios are, surprisingly, submitted on different cpus. But this time >> it runs really in parallel.) > > Mh, not really. I just tested this with kernels fresh booted into > emergency and udev started to create device nodes: > > # cryptsetup -c aes-xts-plain -s 256 -h sha256 -d /dev/urandom create foo1 /dev/sdc > ... > # cryptsetup -c aes-xts-plain -s 256 -h sha256 -d /dev/urandom create foo4 /dev/sdf > # mdadm -B -l raid0 -n 4 -c 256 /dev/md/foo /dev/mapper/foo[1-4] > # dd if=/dev/md/foo of=/dev/null bs=1M count=20k > > 2.6.37: 291MB/s 2.6.38: 139MB/s > > # mdadm -B -l raid0 -n 4 -c 256 /dev/md/foo /dev/sd[c-f] > # cryptsetup -c aes-xts-plain -s 256 -h sha256 -d /dev/urandom create foo /dev/md/foo > # dd if=/dev/mapper/foo of=/dev/null bs=1M count=20k > > 2.6.37: 126MB/s 2.6.38: 138MB/s > > So... performance drops on .37 (as expected) and nothing changes on .38 > (unlike expected). Could you please try also writes? I get better results than reads here. Anyway, the patch provides parallel processing if it is submitted from different CPUs, it does not provide any load balancing if everything is submitted from one process. (Seems it is side effect of something else...) So unfortunately for someone it is huge improvement, in this case it causes just troubles. We need to investigate if some change on top of current code can provide better results here. Milan ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [dm-crypt] dm-crypt: Performance Regression 2.6.37 -> 2.6.38-rc8 2011-03-08 20:07 ` Milan Broz @ 2011-03-08 20:17 ` Mario 'BitKoenig' Holbe 0 siblings, 0 replies; 12+ messages in thread From: Mario 'BitKoenig' Holbe @ 2011-03-08 20:17 UTC (permalink / raw) To: Milan Broz; +Cc: dm-crypt, linux-kernel, Andi Kleen, Alasdair G Kergon [-- Attachment #1: Type: text/plain, Size: 512 bytes --] On Tue, Mar 08, 2011 at 09:07:10PM +0100, Milan Broz wrote: > Could you please try also writes? I get better results than reads here. No, I can't, sorry. I don't have *that* spare devices to try with. However, if you are able to reproduce my read-results your write-results should be similar to what I'd get. Mario -- The secret that the NSA could read the Iranian secrets was more important than any specific Iranian secrets that the NSA could read. -- Bruce Schneier [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 482 bytes --] ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: dm-crypt: Performance Regression 2.6.37 -> 2.6.38-rc8 2011-03-08 16:45 dm-crypt: Performance Regression 2.6.37 -> 2.6.38-rc8 Mario 'BitKoenig' Holbe 2011-03-08 17:35 ` [dm-crypt] " Milan Broz @ 2011-03-10 16:57 ` Andi Kleen 2011-03-10 17:54 ` Mario 'BitKoenig' Holbe 1 sibling, 1 reply; 12+ messages in thread From: Andi Kleen @ 2011-03-10 16:57 UTC (permalink / raw) To: Mario 'BitKoenig' Holbe, dm-crypt, linux-kernel, Milan Broz, Alasdair G Kergon On Tue, Mar 08, 2011 at 05:45:08PM +0100, Mario 'BitKoenig' Holbe wrote: > Hello, > > dm-crypt in 2.6.38 changed to per-CPU workqueues to increase it's > performance by parallelizing encryption to multiple CPUs. > This modification seems to cause (massive) performance drops for > multiple parallel dm-crypt instances... > > I'm running a 4-disk RAID0 on top of 4 independent dm-crypt(aes-xts) > devices on a Core2Quad 3GHz. This setup did overcome the single-CPU > limitation from previous versions and utilized all 4 cores for > encryption. > The throughput of this array drops from 282MB/s sustained read (dd, > single process) with 2.6.37.3 down to 133MB/s with 2.6.38-rc8 (which It will be better with multiple processes running on different CPUs. The new design is really for multiple processes. Do you actually use dd for production or is this just a benchmark? (if yes: newsflash: use a better benchmark) -Andi ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: dm-crypt: Performance Regression 2.6.37 -> 2.6.38-rc8 2011-03-10 16:57 ` Andi Kleen @ 2011-03-10 17:54 ` Mario 'BitKoenig' Holbe 2011-03-11 1:18 ` Andi Kleen 0 siblings, 1 reply; 12+ messages in thread From: Mario 'BitKoenig' Holbe @ 2011-03-10 17:54 UTC (permalink / raw) To: Andi Kleen; +Cc: dm-crypt, linux-kernel, Milan Broz, Alasdair G Kergon [-- Attachment #1: Type: text/plain, Size: 1480 bytes --] On Thu, Mar 10, 2011 at 08:57:30AM -0800, Andi Kleen wrote: > On Tue, Mar 08, 2011 at 05:45:08PM +0100, Mario 'BitKoenig' Holbe wrote: > > I'm running a 4-disk RAID0 on top of 4 independent dm-crypt(aes-xts) > > devices on a Core2Quad 3GHz. This setup did overcome the single-CPU > Do you actually use dd for production or is this just a benchmark? The array is streaming most of the time, i.e. single-process sequential read or write (read mostly) for large chunks of data. So, no and yes, but... > (if yes: newsflash: use a better benchmark) this makes dd quite a valid benchmark for me in this case. > It will be better with multiple processes running on different CPUs. > The new design is really for multiple processes. Of course it is. What bother me is that I can't get back my old performance in my case whatever I do. I don't know what kind of parallelism padata uses, i.e. whether a padata-based solution would suffer from the same limitations like the current dm-crypt/kcryptd-parallelism or not. Wth the current approach: Would it be possible to make CPU-affinity configurable for *single* kcryptd instances? Either in the way to nail a specific kcryptd to a specific CPU or (what would be better for me, I guess) in the way to completely remove CPU-affinity from a specific kcryptd, like it was before? Mario -- There is nothing more deceptive than an obvious fact. -- Sherlock Holmes by Arthur Conan Doyle [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 482 bytes --] ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: dm-crypt: Performance Regression 2.6.37 -> 2.6.38-rc8 2011-03-10 17:54 ` Mario 'BitKoenig' Holbe @ 2011-03-11 1:18 ` Andi Kleen 2011-03-11 18:03 ` Mario 'BitKoenig' Holbe 0 siblings, 1 reply; 12+ messages in thread From: Andi Kleen @ 2011-03-11 1:18 UTC (permalink / raw) To: Mario 'BitKoenig' Holbe, dm-crypt, linux-kernel, Milan Broz, Alasdair G Kergon > Would it be possible to make CPU-affinity configurable for *single* > kcryptd instances? Either in the way to nail a specific kcryptd to a > specific CPU or (what would be better for me, I guess) in the way to > completely remove CPU-affinity from a specific kcryptd, like it was > before? I don't think that's a good idea. You probably need to find some way to make pcrypt (parallel crypt layer) work for dmcrypt. That may actually give you more speedup too than your old hack because it can balance over more cores. Or get a system with AES-NI -- that usually solves it too. Frankly I don't think it's a very interesting case, the majority of workloads are not like that. -Andi ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: dm-crypt: Performance Regression 2.6.37 -> 2.6.38-rc8 2011-03-11 1:18 ` Andi Kleen @ 2011-03-11 18:03 ` Mario 'BitKoenig' Holbe 2011-03-11 18:29 ` Milan Broz 0 siblings, 1 reply; 12+ messages in thread From: Mario 'BitKoenig' Holbe @ 2011-03-11 18:03 UTC (permalink / raw) To: Andi Kleen; +Cc: dm-crypt, linux-kernel, Milan Broz, Alasdair G Kergon [-- Attachment #1: Type: text/plain, Size: 2569 bytes --] I was long pondering whether to reply to this or not, but sorry, I couldn't resist. On Thu, Mar 10, 2011 at 05:18:42PM -0800, Andi Kleen <ak@linux.intel.com> wrote: > You probably need to find some way > to make pcrypt (parallel crypt layer) work for dmcrypt. That may > actually give you more speedup too than your old hack because > it can balance over more cores. "my" old "hack" balances well as long as the number of stripes is equal or greater than the number of cores. And for my specific case... it's hard to balance over more than 4 cores on a Core2Quad :) > Or get a system with AES-NI -- that usually solves it too. Honi soit qui mal y pense. Of course I understand that Intel's primary goal is to sell new hardware and hence I understand that you are required to tell this to me. However, based on the AES-NI benchmarks from the linux-crypto ML, even with AES-NI it would be hard to impossible to re-gain my (non-AES-NI!) pre-.38 performance with the .38 dm-crypt parallelization approach. > Frankly I don't think it's a very interesting case, the majority > of workloads are not like that. Well, I'm not sure if we understand each other. Probably my use case is a little bit special, but that's not the point. The main point is that the .38 dm-crypt parallelization approach does kill performance on *each* RAID0-over-dm-crypt setup. A setup which, I believe, is not that uncommon as you may believe because it was the only way to spread disk-encryption over multiple CPUs until .38. Up to .37 due to the CPU-inaffinity accessing (reading or writing) one stripe in the RAID0 did always spread over min(#core, #kcryptd) cores. Now with .38 the same access will always only utilize one single core because all the chunks of the stripe are (obviously) accessed on the same core and hence either the multiple underlying kcryptds block each other now with the old approach or with dm-crypt-over-RAID0 there is only one kcryptd involved in serving one request on one core. Hence, for single requests the new approach always decreases throughput and increases latency. The latency-increase holds even for multi-process workloads. For your approach to at least match up the old one it requires min(#core, #kcryptd) parallel requests all the time assuming latency doesn't matter and disk seek time to be zero (now you tell me to get X25s, right? :)). Mario -- There are two major products that come from Berkeley: LSD and UNIX. We don't believe this to be a coincidence. -- Jeremy S. Anderson [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 482 bytes --] ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: dm-crypt: Performance Regression 2.6.37 -> 2.6.38-rc8 2011-03-11 18:03 ` Mario 'BitKoenig' Holbe @ 2011-03-11 18:29 ` Milan Broz 2011-03-11 18:36 ` Andi Kleen 0 siblings, 1 reply; 12+ messages in thread From: Milan Broz @ 2011-03-11 18:29 UTC (permalink / raw) To: Mario 'BitKoenig' Holbe, Andi Kleen, dm-crypt, linux-kernel, Alasdair G Kergon On 03/11/2011 07:03 PM, Mario 'BitKoenig' Holbe wrote: >> You probably need to find some way >> to make pcrypt (parallel crypt layer) work for dmcrypt. That may >> actually give you more speedup too than your old hack because >> it can balance over more cores. dmcrypt is already using async crypto interface, it is ready for parallelization on this level. Perhaps the problem is that pcrypt is not yet implemented for needed algorithms? Milan ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: dm-crypt: Performance Regression 2.6.37 -> 2.6.38-rc8 2011-03-11 18:29 ` Milan Broz @ 2011-03-11 18:36 ` Andi Kleen 2011-03-12 1:05 ` Herbert Xu 0 siblings, 1 reply; 12+ messages in thread From: Andi Kleen @ 2011-03-11 18:36 UTC (permalink / raw) To: Milan Broz Cc: Mario 'BitKoenig' Holbe, dm-crypt, linux-kernel, Alasdair G Kergon, Herbert Xu On Fri, Mar 11, 2011 at 07:29:58PM +0100, Milan Broz wrote: > On 03/11/2011 07:03 PM, Mario 'BitKoenig' Holbe wrote: > > >> You probably need to find some way > >> to make pcrypt (parallel crypt layer) work for dmcrypt. That may > >> actually give you more speedup too than your old hack because > >> it can balance over more cores. > > dmcrypt is already using async crypto interface, it is ready > for parallelization on this level. > > Perhaps the problem is that pcrypt is not yet implemented for needed > algorithms? It needs some glue according to Herbert. I forgot the details. -Andi -- ak@linux.intel.com -- Speaking for myself only ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: dm-crypt: Performance Regression 2.6.37 -> 2.6.38-rc8 2011-03-11 18:36 ` Andi Kleen @ 2011-03-12 1:05 ` Herbert Xu 0 siblings, 0 replies; 12+ messages in thread From: Herbert Xu @ 2011-03-12 1:05 UTC (permalink / raw) To: Andi Kleen Cc: Milan Broz, Mario 'BitKoenig' Holbe, dm-crypt, linux-kernel, Alasdair G Kergon On Fri, Mar 11, 2011 at 10:36:54AM -0800, Andi Kleen wrote: > > It needs some glue according to Herbert. I forgot the details. As we don't want to have pcrypt default on it needs to be loaded by hand. What we lack is a clean way to instantiate it. For now you have to do something like modprobe tcrypt alg="pcrypt(authenc(hmac(sha1-generic),cbc(aes-asm)))" type=3 Cheers, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2011-03-12 1:06 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-03-08 16:45 dm-crypt: Performance Regression 2.6.37 -> 2.6.38-rc8 Mario 'BitKoenig' Holbe 2011-03-08 17:35 ` [dm-crypt] " Milan Broz 2011-03-08 19:23 ` Mario 'BitKoenig' Holbe 2011-03-08 20:07 ` Milan Broz 2011-03-08 20:17 ` Mario 'BitKoenig' Holbe 2011-03-10 16:57 ` Andi Kleen 2011-03-10 17:54 ` Mario 'BitKoenig' Holbe 2011-03-11 1:18 ` Andi Kleen 2011-03-11 18:03 ` Mario 'BitKoenig' Holbe 2011-03-11 18:29 ` Milan Broz 2011-03-11 18:36 ` Andi Kleen 2011-03-12 1:05 ` Herbert Xu
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox