* Re: [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing [not found] ` <ZmAuTceqwZlRJqHx@gondor.apana.org.au> @ 2024-06-05 9:46 ` Herbert Xu 2024-06-05 19:14 ` Eric Biggers 0 siblings, 1 reply; 18+ messages in thread From: Herbert Xu @ 2024-06-05 9:46 UTC (permalink / raw) To: Eric Biggers, Steffen Klassert, netdev Cc: linux-crypto, fsverity, dm-devel, x86, linux-arm-kernel, Ard Biesheuvel, Sami Tolvanen, Bart Van Assche On Wed, Jun 05, 2024 at 05:22:21PM +0800, Herbert Xu wrote: > > However, I really dislike the idea of shoehorning this into shash. > I know you really like shash, but I think there are some clear > benefits to be had by coupling this with ahash. If we do this properly, we should be able to immediately use the mb code with IPsec. In the network stack, we already aggregate the data prior to IPsec with GSO. So at the boundary between IPsec and the Crypto API, it's dividing chunks of data up to 64K into 1500-byte packets and feeding them to crypto one at a time. It really should be sending the whole chain of packets to us as a unit. Once we have a proper mb interface, we can fix that and immediately get the benefit of mb hashing. Cheers, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing 2024-06-05 9:46 ` [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing Herbert Xu @ 2024-06-05 19:14 ` Eric Biggers 2024-06-06 2:00 ` Herbert Xu 0 siblings, 1 reply; 18+ messages in thread From: Eric Biggers @ 2024-06-05 19:14 UTC (permalink / raw) To: Herbert Xu Cc: Steffen Klassert, netdev, linux-crypto, fsverity, dm-devel, x86, linux-arm-kernel, Ard Biesheuvel, Sami Tolvanen, Bart Van Assche On Wed, Jun 05, 2024 at 05:46:27PM +0800, Herbert Xu wrote: > On Wed, Jun 05, 2024 at 05:22:21PM +0800, Herbert Xu wrote: > > > > However, I really dislike the idea of shoehorning this into shash. > > I know you really like shash, but I think there are some clear > > benefits to be had by coupling this with ahash. > > If we do this properly, we should be able to immediately use the > mb code with IPsec. In the network stack, we already aggregate > the data prior to IPsec with GSO. So at the boundary between > IPsec and the Crypto API, it's dividing chunks of data up to 64K > into 1500-byte packets and feeding them to crypto one at a time. > > It really should be sending the whole chain of packets to us as > a unit. > > Once we have a proper mb interface, we can fix that and immediately > get the benefit of mb hashing. > This would at most apply to AH, not to ESP. Is AH commonly used these days? Also even for AH, the IPsec code would need to be significantly restructured to make use of multibuffer hashing. See how the segmentation happens in xfrm_output_gso(), but the ahash calls happen much lower in the stack. I'm guessing that you've had the AH use case in mind since your earlier comments. Given you were originally pushing for this to be supported using the existing async support in the ahash API (which would have required fewer code changes on the AH side), but we now agree that is not feasible, maybe it is time to reconsider whether it would still be worthwhile to make all the changes to the AH code needed to support this? Also, even if it would be worthwhile and would use ahash, ahash is almost always just a wrapper for shash. So the shash support would be needed anyway. - Eric ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing 2024-06-05 19:14 ` Eric Biggers @ 2024-06-06 2:00 ` Herbert Xu 2024-06-06 5:28 ` Eric Biggers 0 siblings, 1 reply; 18+ messages in thread From: Herbert Xu @ 2024-06-06 2:00 UTC (permalink / raw) To: Eric Biggers Cc: Steffen Klassert, netdev, linux-crypto, fsverity, dm-devel, x86, linux-arm-kernel, Ard Biesheuvel, Sami Tolvanen, Bart Van Assche On Wed, Jun 05, 2024 at 12:14:10PM -0700, Eric Biggers wrote: > > This would at most apply to AH, not to ESP. Is AH commonly used these days? No AH is completely useless. However, this applies perfectly to ESP, in conjunction with authenc. Obviously we would need to add request linking to authenc (AEAD) as well so that it can pass it along to sha. BTW, does any of this interleaving apply to AES? If so we should explore adding request linking to skcipher as well. Cheers, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing 2024-06-06 2:00 ` Herbert Xu @ 2024-06-06 5:28 ` Eric Biggers 2024-06-06 5:41 ` Herbert Xu 0 siblings, 1 reply; 18+ messages in thread From: Eric Biggers @ 2024-06-06 5:28 UTC (permalink / raw) To: Herbert Xu Cc: Steffen Klassert, netdev, linux-crypto, fsverity, dm-devel, x86, linux-arm-kernel, Ard Biesheuvel, Sami Tolvanen, Bart Van Assche On Thu, Jun 06, 2024 at 10:00:05AM +0800, Herbert Xu wrote: > On Wed, Jun 05, 2024 at 12:14:10PM -0700, Eric Biggers wrote: > > > > This would at most apply to AH, not to ESP. Is AH commonly used these days? > > No AH is completely useless. However, this applies perfectly to > ESP, in conjunction with authenc. Obviously we would need to add > request linking to authenc (AEAD) as well so that it can pass it > along to sha. > > BTW, does any of this interleaving apply to AES? If so we should > explore adding request linking to skcipher as well. > With AES, interleaving would only help with non-parallelizable modes such as CBC encryption. Anyone who cares about IPsec performance should of course be using AES-GCM, which is parallelizable. Especially since my other patch https://lore.kernel.org/linux-crypto/20240602222221.176625-2-ebiggers@kernel.org/ is making AES-GCM twice as fast... With hashing we unfortunately don't have the luxury of there being widely used and accepted parallelizable algorithms. In particular, all the SHAs are serialized. So that's why interleaving makes sense there. In any case, it seems that what you're asking for at this point is far beyond the scope of this patchset. - Eric ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing 2024-06-06 5:28 ` Eric Biggers @ 2024-06-06 5:41 ` Herbert Xu 2024-06-06 6:58 ` Ard Biesheuvel 0 siblings, 1 reply; 18+ messages in thread From: Herbert Xu @ 2024-06-06 5:41 UTC (permalink / raw) To: Eric Biggers Cc: Steffen Klassert, netdev, linux-crypto, fsverity, dm-devel, x86, linux-arm-kernel, Ard Biesheuvel, Sami Tolvanen, Bart Van Assche, Megha Dey, Tim Chen On Wed, Jun 05, 2024 at 10:28:01PM -0700, Eric Biggers wrote: > > With AES, interleaving would only help with non-parallelizable modes such as CBC > encryption. Anyone who cares about IPsec performance should of course be using > AES-GCM, which is parallelizable. Especially since my other patch > https://lore.kernel.org/linux-crypto/20240602222221.176625-2-ebiggers@kernel.org/ > is making AES-GCM twice as fast... Algorithm selection may be limited by peer capability. For IPsec, if SHA is being used, then most likely CBC is also being used. > In any case, it seems that what you're asking for at this point is far beyond > the scope of this patchset. I'm more than happy to take this over if you don't wish to extend it beyond the storage usage cases. According to the original Intel sha2-mb submission, this should result in at least a two-fold speed-up. Cheers, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing 2024-06-06 5:41 ` Herbert Xu @ 2024-06-06 6:58 ` Ard Biesheuvel 2024-06-06 7:34 ` Herbert Xu 0 siblings, 1 reply; 18+ messages in thread From: Ard Biesheuvel @ 2024-06-06 6:58 UTC (permalink / raw) To: Herbert Xu Cc: Eric Biggers, Steffen Klassert, netdev, linux-crypto, fsverity, dm-devel, x86, linux-arm-kernel, Sami Tolvanen, Bart Van Assche, Megha Dey, Tim Chen On Thu, 6 Jun 2024 at 07:41, Herbert Xu <herbert@gondor.apana.org.au> wrote: > > On Wed, Jun 05, 2024 at 10:28:01PM -0700, Eric Biggers wrote: > > > > With AES, interleaving would only help with non-parallelizable modes such as CBC > > encryption. Anyone who cares about IPsec performance should of course be using > > AES-GCM, which is parallelizable. Especially since my other patch > > https://lore.kernel.org/linux-crypto/20240602222221.176625-2-ebiggers@kernel.org/ > > is making AES-GCM twice as fast... > > Algorithm selection may be limited by peer capability. For IPsec, > if SHA is being used, then most likely CBC is also being used. > IPSec users relying on software crypto and authenc() and caring about performance seems like a rather niche use case to me. > > In any case, it seems that what you're asking for at this point is far beyond > > the scope of this patchset. > > I'm more than happy to take this over if you don't wish to extend > it beyond the storage usage cases. According to the original Intel > sha2-mb submission, this should result in at least a two-fold > speed-up. > I'm struggling to follow this debate. Surely, if this functionality needs to live in ahash, the shash fallbacks need to implement this parallel scheme too, or ahash would end up just feeding the requests into shash sequentially, defeating the purpose. It is then up to the API client to choose between ahash or shash, just as it can today. So Eric has a pretty strong case for his shash implementation; kmap_local() is essentially a NOP on architectures that anyone still cares about (unlike kmap_atomic() which still disables preemption), so I don't have a problem with the caller relying on that in order to be able to use shash directly. The whole scatterlist / request alloc dance is just too tedious and pointless, given that in practice, it all gets relegated to shash anyway. But my point is that even if we go with Herbert's proposal for the ahash, we'll still need something like Eric's code on the shash side. For the true async accelerator use case, none of this should make any difference, right? If the caller already tolerates async (but in-order) completion, implementing this request chaining doesn't buy it anything. So only when the caller is sync and the implementation is async, we might be able to do something smart where the caller waits on a single completion that signals the completion of a set of inputs. But this is also rather niche, so not worth holding up this effort. So Herbert, what would the ahash_to_shash plumbing look like for the ahash API that you are proposing? What changes would it require to shash, and how much to they deviate from what Eric is suggesting? ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing 2024-06-06 6:58 ` Ard Biesheuvel @ 2024-06-06 7:34 ` Herbert Xu 2024-06-06 7:55 ` Ard Biesheuvel 0 siblings, 1 reply; 18+ messages in thread From: Herbert Xu @ 2024-06-06 7:34 UTC (permalink / raw) To: Ard Biesheuvel Cc: Eric Biggers, Steffen Klassert, netdev, linux-crypto, fsverity, dm-devel, x86, linux-arm-kernel, Sami Tolvanen, Bart Van Assche, Megha Dey, Tim Chen On Thu, Jun 06, 2024 at 08:58:47AM +0200, Ard Biesheuvel wrote: > > IPSec users relying on software crypto and authenc() and caring about > performance seems like a rather niche use case to me. It's no more niche than fs/verity and dm-integrity. In fact, this could potentially be used for all algorithms. Just the reduction in the number of function calls may produce enough of a benefit (this is something I observed when adding GSO, even without any actual hardware offload, simply aggregating packets into larger units produced a visible benefit). > I'm struggling to follow this debate. Surely, if this functionality > needs to live in ahash, the shash fallbacks need to implement this > parallel scheme too, or ahash would end up just feeding the requests > into shash sequentially, defeating the purpose. It is then up to the > API client to choose between ahash or shash, just as it can today. I've never suggested it adding it to shash at all. In fact that's my primary problem with this interface. I think it should be ahash only. Just like skcipher and aead. My reasoning is that this should cater mostly to bulk data, i.e., multiple pages (e.g., for IPsec we're talking about 64K chunks, actually that (the 64K limit) is something that we should really extend, it's not 2014 anymore). These typically will be more easily accessed as a number of distinct pages rather than as a contiguous mapping. Cheers, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing 2024-06-06 7:34 ` Herbert Xu @ 2024-06-06 7:55 ` Ard Biesheuvel 2024-06-06 8:08 ` Herbert Xu 0 siblings, 1 reply; 18+ messages in thread From: Ard Biesheuvel @ 2024-06-06 7:55 UTC (permalink / raw) To: Herbert Xu Cc: Eric Biggers, Steffen Klassert, netdev, linux-crypto, fsverity, dm-devel, x86, linux-arm-kernel, Sami Tolvanen, Bart Van Assche, Tim Chen On Thu, 6 Jun 2024 at 09:34, Herbert Xu <herbert@gondor.apana.org.au> wrote: > > On Thu, Jun 06, 2024 at 08:58:47AM +0200, Ard Biesheuvel wrote: > > > > IPSec users relying on software crypto and authenc() and caring about > > performance seems like a rather niche use case to me. > > It's no more niche than fs/verity and dm-integrity. fsverity is used by Android, which is the diametrical opposite of niche when it comes to Linux distros. > this could potentially be used for all algorithms. Just the > reduction in the number of function calls may produce enough > of a benefit (this is something I observed when adding GSO, > even without any actual hardware offload, simply aggregating > packets into larger units produced a visible benefit). > Sure. But my point is that anyone who cares enough about IPsec performance would have stopped using authenc(cbc(aes),hmac(sha2)) long ago and moved to GCM or even ChaChaPoly. This is not just a matter of efficiency in the implementation - using a general purpose hash function such as SHA-256 [twice] rather than GHASH or Poly1305 is just overkill. So complicating the performance optimization of something that runs on every (non-Apple) phone for the benefit of something that is rarely used in a context where the performance matters seems unreasonable to me. > > I'm struggling to follow this debate. Surely, if this functionality > > needs to live in ahash, the shash fallbacks need to implement this > > parallel scheme too, or ahash would end up just feeding the requests > > into shash sequentially, defeating the purpose. It is then up to the > > API client to choose between ahash or shash, just as it can today. > > I've never suggested it adding it to shash at all. In fact > that's my primary problem with this interface. I think it > should be ahash only. Just like skcipher and aead. > So again, how would that work for ahash falling back to shash. Are you saying every existing shash implementation should be duplicated into an ahash so that the multibuffer optimization can be added? shash is a public interface so we cannot just remove the existing ones and we'll end up carrying both forever. > My reasoning is that this should cater mostly to bulk data, i.e., > multiple pages (e.g., for IPsec we're talking about 64K chunks, > actually that (the 64K limit) is something that we should really > extend, it's not 2014 anymore). These typically will be more > easily accessed as a number of distinct pages rather than as a > contiguous mapping. > Sure, but the block I/O world is very different. Forcing it to use an API modeled after how IPsec might use it seems, again, unreasonable. So these 64k chunks are made up of 1500 byte frames that can be hashed in parallel? ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing 2024-06-06 7:55 ` Ard Biesheuvel @ 2024-06-06 8:08 ` Herbert Xu 2024-06-06 8:33 ` Ard Biesheuvel 0 siblings, 1 reply; 18+ messages in thread From: Herbert Xu @ 2024-06-06 8:08 UTC (permalink / raw) To: Ard Biesheuvel Cc: Eric Biggers, Steffen Klassert, netdev, linux-crypto, fsverity, dm-devel, x86, linux-arm-kernel, Sami Tolvanen, Bart Van Assche, Tim Chen On Thu, Jun 06, 2024 at 09:55:56AM +0200, Ard Biesheuvel wrote: > > So again, how would that work for ahash falling back to shash. Are you > saying every existing shash implementation should be duplicated into > an ahash so that the multibuffer optimization can be added? shash is a > public interface so we cannot just remove the existing ones and we'll > end up carrying both forever. It should do the same thing for ahash algorithms that do not support multiple requests. IOW it should process the requests one by one. > Sure, but the block I/O world is very different. Forcing it to use an > API modeled after how IPsec might use it seems, again, unreasonable. It's not different at all. You can see that by the proliferation of kmap calls in fs/verity. It's a fundamental issue. You can't consistently get a large contiguous allocation beyond one page due to fragmentation. So large data is always going to be scattered. BTW, I'm all for elminating the overhead when you already have a linear address for scattered memory, e.g., through vmalloc. We should definitely improve our interface for ahash/skcipher/aead so that vmalloc addresses (as well as kmalloc virtual addresses by extension) are supported as first class citizens, and we don't turn them into SG lists unless it's necessary for DMA. Cheers, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing 2024-06-06 8:08 ` Herbert Xu @ 2024-06-06 8:33 ` Ard Biesheuvel 2024-06-06 9:15 ` Herbert Xu 0 siblings, 1 reply; 18+ messages in thread From: Ard Biesheuvel @ 2024-06-06 8:33 UTC (permalink / raw) To: Herbert Xu Cc: Eric Biggers, Steffen Klassert, netdev, linux-crypto, fsverity, dm-devel, x86, linux-arm-kernel, Sami Tolvanen, Bart Van Assche, Tim Chen On Thu, 6 Jun 2024 at 10:08, Herbert Xu <herbert@gondor.apana.org.au> wrote: > > On Thu, Jun 06, 2024 at 09:55:56AM +0200, Ard Biesheuvel wrote: > > > > So again, how would that work for ahash falling back to shash. Are you > > saying every existing shash implementation should be duplicated into > > an ahash so that the multibuffer optimization can be added? shash is a > > public interface so we cannot just remove the existing ones and we'll > > end up carrying both forever. > > It should do the same thing for ahash algorithms that do not support > multiple requests. IOW it should process the requests one by one. > That is not what I am asking. Are you suggesting that, e.g., the arm64 sha2 shash implementation that is modified by this series should instead expose both an shash as before, and an ahash built around the same asm code that exposes the multibuffer capability? > > Sure, but the block I/O world is very different. Forcing it to use an > > API modeled after how IPsec might use it seems, again, unreasonable. > > It's not different at all. You can see that by the proliferation > of kmap calls in fs/verity. It's a fundamental issue. You can't > consistently get a large contiguous allocation beyond one page due > to fragmentation. So large data is always going to be scattered. > I don't think this is true for many uses of the block layer. > BTW, I'm all for elminating the overhead when you already have a > linear address for scattered memory, e.g., through vmalloc. We > should definitely improve our interface for ahash/skcipher/aead so > that vmalloc addresses (as well as kmalloc virtual addresses by > extension) are supported as first class citizens, and we don't turn > them into SG lists unless it's necessary for DMA. > Yes, this is something I've been pondering for a while. An ahash/skcipher/aead with CRYPTO_ALG_ASYNC cleared (which would guarantee that any provided VA would not be referenced after the algo invocation returns) should be able to consume a request that carries virtual addresses rather than SG lists. Given that it is up to the caller to choose between sync and async, it would be in a good position also to judge whether it wants to use stack/vmalloc addresses. I'll have a stab at this. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing 2024-06-06 8:33 ` Ard Biesheuvel @ 2024-06-06 9:15 ` Herbert Xu 2024-06-10 16:42 ` Eric Biggers 0 siblings, 1 reply; 18+ messages in thread From: Herbert Xu @ 2024-06-06 9:15 UTC (permalink / raw) To: Ard Biesheuvel Cc: Eric Biggers, Steffen Klassert, netdev, linux-crypto, fsverity, dm-devel, x86, linux-arm-kernel, Sami Tolvanen, Bart Van Assche, Tim Chen On Thu, Jun 06, 2024 at 10:33:15AM +0200, Ard Biesheuvel wrote: > > Are you suggesting that, e.g., the arm64 sha2 shash implementation > that is modified by this series should instead expose both an shash as > before, and an ahash built around the same asm code that exposes the > multibuffer capability? Yes the multi-request handling should be implemented as ahash only. Cheers, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing 2024-06-06 9:15 ` Herbert Xu @ 2024-06-10 16:42 ` Eric Biggers 2024-06-11 15:21 ` Herbert Xu 0 siblings, 1 reply; 18+ messages in thread From: Eric Biggers @ 2024-06-10 16:42 UTC (permalink / raw) To: Herbert Xu Cc: Ard Biesheuvel, Steffen Klassert, netdev, linux-crypto, fsverity, dm-devel, x86, linux-arm-kernel, Sami Tolvanen, Bart Van Assche, Tim Chen Hi Herbert, On Thu, Jun 06, 2024 at 05:15:16PM +0800, Herbert Xu wrote: > On Thu, Jun 06, 2024 at 10:33:15AM +0200, Ard Biesheuvel wrote: > > > > Are you suggesting that, e.g., the arm64 sha2 shash implementation > > that is modified by this series should instead expose both an shash as > > before, and an ahash built around the same asm code that exposes the > > multibuffer capability? > > Yes the multi-request handling should be implemented as ahash > only. > That is not a reasonable way to do it. It would be much more complex, more error-prone, and slower than my proposal. Also your proposal wouldn't actually bring you much closer to being able to optimize your authenc use case. First, each algorithm implementation that wants to support your scheme would need to implement the ahash API alongside shash. This would be a lot of duplicated code, and it would need to support all quirks of the ahash API including scatterlists. Second, allowing an ahash_request to actually be a list of requests creates an ambiguity where it will often be unclear when code is supposed to operate on an ahash_request individually vs. on the whole list. This is error-prone, as it invites bugs where a crypto operation is done on only one request of the list. This is a bad design, especially for a cryptographic API. We would have crypto_ahash_*() do some checks to prevent multi-requests from reaching algorithms that don't support the particular kind of request being made. But there will be a lot of cases to consider. Third, it would be hard to actually implement multibuffer hashing given an arbitrary list of requests. For multibuffer hashing to work, the lengths of all the buffers must be synced up, including the internal buffers in the hash states as well as every buffer that is produced while walking the scatterlists. We can place constraints on what is supported, but those constraints will need to be clearly specified, and each algorithm will actually need to check for them and do something sane (error or fallback) when they are not met. Note that it would be possible for the messages to get out of sync in the middle of walking the scatterlists, which would be difficult to handle. All this would add a lot of complexity and overhead compared to my proposal, which naturally expresses the same-length constraints in the API. Fourth, having the API be ahash-only would also force fsverity to switch back to the ahash API, which would add complexity and overhead. The shash API is easier to use even when working with pages, as the diffstat of commit 8fcd94add6c5 clearly shows. The ahash API also has more overhead, including in doing the needed memory allocation, setting up a scatterlist, initializing required but irrelevant or redundant fields in the ahash_request, and shash_ahash_finup() running the fully generalized scatterlist walker on the scatterlist just to finally end up with a pointer which the caller could have provided directly. All this overhead adds up, even when hashing 4K blocks. Back when CPU-based crypto was slow this didn't really matter, but now it's fast and these small overheads are significant when trying to keep up with storage device speeds (which are also fast now). E.g., even disregarding the memory allocation, hashing a 4K block is about 5% faster with shash than with ahash on x86_64. Of course, I'd need to support ahash in fsverity anyway if someone were to actually need support for non-inline hardware offload in fsverity (and if I decide to accept that despite many hardware drivers being buggy). But really the best way to do this would be to support ahash alongside shash, like what I'm proposing in dm-verity -- perhaps with ahash support limited to the data blocks only, as that's the most performance critical part. The ahash code path would use the async callback to actually properly support offload, which would mean the code would be substantially different anyway due to the fundamentally different computational model, especially vs. multibuffer hashing. A "single" API, that works for both hardware offload and for the vast majority of users that simply need very low overhead software crypto, would be nice in theory. But I'm afraid it's been shown repeatedly that just doesn't work... The existence of lib/crypto/, shash, lskcipher, and scomp all reflect this. I understand that you think the ahash based API would make it easier to add multibuffer support to "authenc(hmac(sha256),cbc(aes))" for IPsec, which seems to be a very important use case for you (though it isn't relevant to nearly as many systems as dm-verity and fsverity are). Regardless, the reality is that it would be much more difficult to take advantage of multibuffer crypto in the IPsec authenc use case than in dm-verity and fsverity. authenc uses multiple underlying algorithms, AES-CBC and HMAC-SHA256, that would both have to use multibuffer crypto in order to see a significant benefit, seeing as even if the SHA-256 support could be wired up through HMAC-SHA256, encryption would be bottlenecked on AES-CBC, especially on Intel CPUs. It also looks like the IPsec code would need a lot of updates to support multibuffer crypto. At the same time, an easier way to optimize "authenc(hmac(sha256),cbc(aes))" would be to add an implementation of it to arch/x86/crypto/ that interleaves the AES-CBC and SHA-256 and also avoids the overhead associated with the template based approach (such as all the extra indirect calls). Of course, that would require writing assembly, but so would multibuffer crypto. It seems unlikely that someone would step in to do all the work to implement a multibuffer optimization for this algorithm and its use in IPsec, when no one has ever bothered to optimize the single-buffer case in the first place, which has been possible all along and would require no API or IPsec changes... In any case, any potential multi-request support in ahash, skcipher, or aead should be separate considerations from the simple function in shash that I'm proposing. It makes sense to have the shash support regardless. Ultimately, I need to have dm-verity and fsverity be properly optimized in the downstreams that are most relevant to me. If you're not going to allow the upstream crypto API to provide the needed functionality in a reasonable way, then I'll need to shift my focus to getting this patchset into downstream kernels such as Android and Chrome OS instead. - Eric ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing 2024-06-10 16:42 ` Eric Biggers @ 2024-06-11 15:21 ` Herbert Xu 2024-06-11 15:39 ` Herbert Xu ` (2 more replies) 0 siblings, 3 replies; 18+ messages in thread From: Herbert Xu @ 2024-06-11 15:21 UTC (permalink / raw) To: Eric Biggers Cc: Ard Biesheuvel, Steffen Klassert, netdev, linux-crypto, fsverity, dm-devel, x86, linux-arm-kernel, Sami Tolvanen, Bart Van Assche, Tim Chen On Mon, Jun 10, 2024 at 09:42:58AM -0700, Eric Biggers wrote: > > I understand that you think the ahash based API would make it easier to add > multibuffer support to "authenc(hmac(sha256),cbc(aes))" for IPsec, which seems > to be a very important use case for you (though it isn't relevant to nearly as > many systems as dm-verity and fsverity are). Regardless, the reality is that it > would be much more difficult to take advantage of multibuffer crypto in the > IPsec authenc use case than in dm-verity and fsverity. authenc uses multiple > underlying algorithms, AES-CBC and HMAC-SHA256, that would both have to use > multibuffer crypto in order to see a significant benefit, seeing as even if the > SHA-256 support could be wired up through HMAC-SHA256, encryption would be > bottlenecked on AES-CBC, especially on Intel CPUs. It also looks like the IPsec > code would need a lot of updates to support multibuffer crypto. The linked-request thing feeds nicely into networking. In fact that's where I got the idea of linking them from. In networking a large GSO (currently limited to 64K but theoretically we could make it unlimited) packet is automatically split up into a linked list of MTU-sized skb's. Therefore if we switched to a linked-list API networking could give us the buffers with minimal changes. BTW, I found an old Intel paper that claims through their multi- buffer strategy they were able to make AES-CBC-XCBC beat AES-GCM. I wonder if we could still replicate this today: https://github.com/intel/intel-ipsec-mb/wiki/doc/fast-multi-buffer-ipsec-implementations-ia-processors-paper.pdf > Ultimately, I need to have dm-verity and fsverity be properly optimized in the > downstreams that are most relevant to me. If you're not going to allow the > upstream crypto API to provide the needed functionality in a reasonable way, > then I'll need to shift my focus to getting this patchset into downstream > kernels such as Android and Chrome OS instead. I totally understand that this is your priority. But please give me some time to see if we can devise something that works for both scenarios. Thanks, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing 2024-06-11 15:21 ` Herbert Xu @ 2024-06-11 15:39 ` Herbert Xu 2024-06-11 20:32 ` Eric Biggers 2024-06-11 15:46 ` Ard Biesheuvel 2024-06-11 20:18 ` Eric Biggers 2 siblings, 1 reply; 18+ messages in thread From: Herbert Xu @ 2024-06-11 15:39 UTC (permalink / raw) To: Eric Biggers Cc: Ard Biesheuvel, Steffen Klassert, netdev, linux-crypto, fsverity, dm-devel, x86, linux-arm-kernel, Sami Tolvanen, Bart Van Assche, Tim Chen On Tue, Jun 11, 2024 at 11:21:43PM +0800, Herbert Xu wrote: > > Therefore if we switched to a linked-list API networking could > give us the buffers with minimal changes. BTW, this is not just about parallelising hashing. Just as one of the most significant benefits of GSO does not come from hardware offload, but rather the amortisation of (network) stack overhead. IOW you're traversing a very deep stack once instead of 40 times (this is the factor for 64K vs MTU, if we extend beyond 64K (which we absolute should do) the benefit would increase as well). The same should apply to the Crypto API. So even if this was a purely software solution with no assembly code at all, it may well improve GCM performance (at least for users able to feed us bulk data, like networking). Cheers, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing 2024-06-11 15:39 ` Herbert Xu @ 2024-06-11 20:32 ` Eric Biggers 0 siblings, 0 replies; 18+ messages in thread From: Eric Biggers @ 2024-06-11 20:32 UTC (permalink / raw) To: Herbert Xu Cc: Ard Biesheuvel, Steffen Klassert, netdev, linux-crypto, fsverity, dm-devel, x86, linux-arm-kernel, Sami Tolvanen, Bart Van Assche, Tim Chen On Tue, Jun 11, 2024 at 11:39:08PM +0800, Herbert Xu wrote: > On Tue, Jun 11, 2024 at 11:21:43PM +0800, Herbert Xu wrote: > > > > Therefore if we switched to a linked-list API networking could > > give us the buffers with minimal changes. > > BTW, this is not just about parallelising hashing. Just as one of > the most significant benefits of GSO does not come from hardware > offload, but rather the amortisation of (network) stack overhead. > IOW you're traversing a very deep stack once instead of 40 times > (this is the factor for 64K vs MTU, if we extend beyond 64K (which > we absolute should do) the benefit would increase as well). > > The same should apply to the Crypto API. So even if this was a > purely software solution with no assembly code at all, it may well > improve GCM performance (at least for users able to feed us bulk > data, like networking). > At best this would save an indirect call per message, if the underlying algorithm explicitly added support for it and the user of the API migrated to the multi-request model. This alone doesn't seem worth the effort of migrating to multi-request, especially considering the many other already-possible optimizations that would not require API changes or migrating users to multi-request. The x86_64 AES-GCM is pretty well optimized now after my recent patches, but there's still an indirect call associated with the use of the SIMD helper which could be eliminated, saving one per message (already as much as we could hope to get from multi-request). authenc on the other hand is almost totally unoptimized, as I mentioned before; it makes little sense to talk about any sort of multi-request optimization for it at this point. - Eric ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing 2024-06-11 15:21 ` Herbert Xu 2024-06-11 15:39 ` Herbert Xu @ 2024-06-11 15:46 ` Ard Biesheuvel 2024-06-11 15:51 ` Herbert Xu 2024-06-11 20:18 ` Eric Biggers 2 siblings, 1 reply; 18+ messages in thread From: Ard Biesheuvel @ 2024-06-11 15:46 UTC (permalink / raw) To: Herbert Xu Cc: Eric Biggers, Steffen Klassert, netdev, linux-crypto, fsverity, dm-devel, x86, linux-arm-kernel, Sami Tolvanen, Bart Van Assche, Tim Chen On Tue, 11 Jun 2024 at 17:21, Herbert Xu <herbert@gondor.apana.org.au> wrote: > > On Mon, Jun 10, 2024 at 09:42:58AM -0700, Eric Biggers wrote: > > > > I understand that you think the ahash based API would make it easier to add > > multibuffer support to "authenc(hmac(sha256),cbc(aes))" for IPsec, which seems > > to be a very important use case for you (though it isn't relevant to nearly as > > many systems as dm-verity and fsverity are). Regardless, the reality is that it > > would be much more difficult to take advantage of multibuffer crypto in the > > IPsec authenc use case than in dm-verity and fsverity. authenc uses multiple > > underlying algorithms, AES-CBC and HMAC-SHA256, that would both have to use > > multibuffer crypto in order to see a significant benefit, seeing as even if the > > SHA-256 support could be wired up through HMAC-SHA256, encryption would be > > bottlenecked on AES-CBC, especially on Intel CPUs. It also looks like the IPsec > > code would need a lot of updates to support multibuffer crypto. > > The linked-request thing feeds nicely into networking. In fact > that's where I got the idea of linking them from. In networking > a large GSO (currently limited to 64K but theoretically we could > make it unlimited) packet is automatically split up into a linked > list of MTU-sized skb's. > > Therefore if we switched to a linked-list API networking could > give us the buffers with minimal changes. > > BTW, I found an old Intel paper that claims through their multi- > buffer strategy they were able to make AES-CBC-XCBC beat AES-GCM. > I wonder if we could still replicate this today: > > https://github.com/intel/intel-ipsec-mb/wiki/doc/fast-multi-buffer-ipsec-implementations-ia-processors-paper.pdf > This looks like the whitepaper that describes the buggy multibuffer code that we ripped out. > > Ultimately, I need to have dm-verity and fsverity be properly optimized in the > > downstreams that are most relevant to me. If you're not going to allow the > > upstream crypto API to provide the needed functionality in a reasonable way, > > then I'll need to shift my focus to getting this patchset into downstream > > kernels such as Android and Chrome OS instead. > > I totally understand that this is your priority. But please give > me some time to see if we can devise something that works for both > scenarios. > The issue here is that the CPU based multibuffer approach has rather tight constraints in terms of input length and the shared prefix, and so designing a more generic API based on ahash doesn't help at all. The intel multibuffer code went off into the weeds entirely attempting to apply this parallel scheme to arbitrary combinations of inputs, so this is something we know we should avoid. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing 2024-06-11 15:46 ` Ard Biesheuvel @ 2024-06-11 15:51 ` Herbert Xu 0 siblings, 0 replies; 18+ messages in thread From: Herbert Xu @ 2024-06-11 15:51 UTC (permalink / raw) To: Ard Biesheuvel Cc: Eric Biggers, Steffen Klassert, netdev, linux-crypto, fsverity, dm-devel, x86, linux-arm-kernel, Sami Tolvanen, Bart Van Assche, Tim Chen On Tue, Jun 11, 2024 at 05:46:01PM +0200, Ard Biesheuvel wrote: > > The issue here is that the CPU based multibuffer approach has rather > tight constraints in terms of input length and the shared prefix, and > so designing a more generic API based on ahash doesn't help at all. > The intel multibuffer code went off into the weeds entirely attempting > to apply this parallel scheme to arbitrary combinations of inputs, so > this is something we know we should avoid. The sha-mb approach failed because it failed to aggregate the data properly. By driving this from the data sink, it was doomed to fail. The correct way to aggregate data is to do it at the source. The user (of the Crypto API) knows exactlty how much data they want to hash and how it's structured. They should be supplying that info to the API so it can use multi-buffer where applicable. Even where multi-buffer isn't available, they would at least benefit from making a single indirect call into the Crypto stack instead of N calls. When N is large (which is almost always the case for TCP) this produces a non-trivial saving. Sure I understand that you guys are more than happy with N=2 but please let me at least try this out and see if we could make this work for a large value of N. Cheers, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing 2024-06-11 15:21 ` Herbert Xu 2024-06-11 15:39 ` Herbert Xu 2024-06-11 15:46 ` Ard Biesheuvel @ 2024-06-11 20:18 ` Eric Biggers 2 siblings, 0 replies; 18+ messages in thread From: Eric Biggers @ 2024-06-11 20:18 UTC (permalink / raw) To: Herbert Xu Cc: Ard Biesheuvel, Steffen Klassert, netdev, linux-crypto, fsverity, dm-devel, x86, linux-arm-kernel, Sami Tolvanen, Bart Van Assche, Tim Chen On Tue, Jun 11, 2024 at 11:21:43PM +0800, Herbert Xu wrote: > > BTW, I found an old Intel paper that claims through their multi- > buffer strategy they were able to make AES-CBC-XCBC beat AES-GCM. > I wonder if we could still replicate this today: > > https://github.com/intel/intel-ipsec-mb/wiki/doc/fast-multi-buffer-ipsec-implementations-ia-processors-paper.pdf No, not even close. Even assuming that the lack of parallelizability in AES-CBC and AES-XCBC can be entirely compensated for via multibuffer crypto (which really it can't -- consider single packets, for example), doing AES twice is much more expensive than doing AES and GHASH. GHASH is a universal hash function, and computing a universal hash function is inherently cheaper than computing a cryptographic hash function. But also modern Intel CPUs have very fast carryless multiplication, and it uses a different execution port from what AES uses. So the overhead of AES + GHASH over AES alone is very small. By doing AES twice, you'd be entirely bottlenecked by the ports that can execute the AES instructions, while the other ports go nearly unused. So it would probably be approaching twice as slow as AES-GCM. Westmere (2010) through Ivy Bridge (2012) are the only Intel CPUs where multibuffer AES-CBC-XCBC could plausibly be faster than AES-GCM (given a sufficiently large number of messages at once), due to the very slow pclmulqdq instruction on those CPUs. This is long since fixed, as pclmulqdq became much faster in Haswell (2013), and faster still in Broadwell. This is exactly what that Intel paper shows; they show AES-GCM becoming fastest in "Gen 4", i.e. Haswell. The paper is from 2012, so of course they don't show anything after that. But AES-GCM has only pulled ahead even more since then. In theory something like AES-CBC + SHA-256 could be slightly more competitive than AES-CBC + AES-XCBC. But it would still be worse than simply doing AES-GCM -- which again, doesn't need multibuffer, and my recent patches have already fully optimized for recent x86_64 CPUs. - Eric ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2024-06-11 20:32 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20240603183731.108986-1-ebiggers@kernel.org>
[not found] ` <20240603183731.108986-7-ebiggers@kernel.org>
[not found] ` <Zl7gYOMyscYDKZ8_@gondor.apana.org.au>
[not found] ` <20240604184220.GC1566@sol.localdomain>
[not found] ` <ZmAthcxC8V3V3sm3@gondor.apana.org.au>
[not found] ` <ZmAuTceqwZlRJqHx@gondor.apana.org.au>
2024-06-05 9:46 ` [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing Herbert Xu
2024-06-05 19:14 ` Eric Biggers
2024-06-06 2:00 ` Herbert Xu
2024-06-06 5:28 ` Eric Biggers
2024-06-06 5:41 ` Herbert Xu
2024-06-06 6:58 ` Ard Biesheuvel
2024-06-06 7:34 ` Herbert Xu
2024-06-06 7:55 ` Ard Biesheuvel
2024-06-06 8:08 ` Herbert Xu
2024-06-06 8:33 ` Ard Biesheuvel
2024-06-06 9:15 ` Herbert Xu
2024-06-10 16:42 ` Eric Biggers
2024-06-11 15:21 ` Herbert Xu
2024-06-11 15:39 ` Herbert Xu
2024-06-11 20:32 ` Eric Biggers
2024-06-11 15:46 ` Ard Biesheuvel
2024-06-11 15:51 ` Herbert Xu
2024-06-11 20:18 ` Eric Biggers
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox