Re: [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing

public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing
       [not found]         ` <ZmAuTceqwZlRJqHx@gondor.apana.org.au>
@ 2024-06-05  9:46           ` Herbert Xu
  2024-06-05 19:14             ` Eric Biggers
  0 siblings, 1 reply; 18+ messages in thread
From: Herbert Xu @ 2024-06-05  9:46 UTC (permalink / raw)
  To: Eric Biggers, Steffen Klassert, netdev
  Cc: linux-crypto, fsverity, dm-devel, x86, linux-arm-kernel,
	Ard Biesheuvel, Sami Tolvanen, Bart Van Assche

On Wed, Jun 05, 2024 at 05:22:21PM +0800, Herbert Xu wrote:
>
> However, I really dislike the idea of shoehorning this into shash.
> I know you really like shash, but I think there are some clear
> benefits to be had by coupling this with ahash.

If we do this properly, we should be able to immediately use the
mb code with IPsec.  In the network stack, we already aggregate
the data prior to IPsec with GSO.  So at the boundary between
IPsec and the Crypto API, it's dividing chunks of data up to 64K
into 1500-byte packets and feeding them to crypto one at a time.

It really should be sending the whole chain of packets to us as
a unit.

Once we have a proper mb interface, we can fix that and immediately
get the benefit of mb hashing.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing
  2024-06-05  9:46           ` [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing Herbert Xu
@ 2024-06-05 19:14             ` Eric Biggers
  2024-06-06  2:00               ` Herbert Xu
  0 siblings, 1 reply; 18+ messages in thread
From: Eric Biggers @ 2024-06-05 19:14 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Steffen Klassert, netdev, linux-crypto, fsverity, dm-devel, x86,
	linux-arm-kernel, Ard Biesheuvel, Sami Tolvanen, Bart Van Assche

On Wed, Jun 05, 2024 at 05:46:27PM +0800, Herbert Xu wrote:
> On Wed, Jun 05, 2024 at 05:22:21PM +0800, Herbert Xu wrote:
> >
> > However, I really dislike the idea of shoehorning this into shash.
> > I know you really like shash, but I think there are some clear
> > benefits to be had by coupling this with ahash.
> 
> If we do this properly, we should be able to immediately use the
> mb code with IPsec.  In the network stack, we already aggregate
> the data prior to IPsec with GSO.  So at the boundary between
> IPsec and the Crypto API, it's dividing chunks of data up to 64K
> into 1500-byte packets and feeding them to crypto one at a time.
> 
> It really should be sending the whole chain of packets to us as
> a unit.
> 
> Once we have a proper mb interface, we can fix that and immediately
> get the benefit of mb hashing.
> 

This would at most apply to AH, not to ESP.  Is AH commonly used these days?
Also even for AH, the IPsec code would need to be significantly restructured to
make use of multibuffer hashing.  See how the segmentation happens in
xfrm_output_gso(), but the ahash calls happen much lower in the stack.

I'm guessing that you've had the AH use case in mind since your earlier
comments.  Given you were originally pushing for this to be supported using the
existing async support in the ahash API (which would have required fewer code
changes on the AH side), but we now agree that is not feasible, maybe it is time
to reconsider whether it would still be worthwhile to make all the changes to
the AH code needed to support this?

Also, even if it would be worthwhile and would use ahash, ahash is almost always
just a wrapper for shash.  So the shash support would be needed anyway.

- Eric

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing
  2024-06-05 19:14             ` Eric Biggers
@ 2024-06-06  2:00               ` Herbert Xu
  2024-06-06  5:28                 ` Eric Biggers
  0 siblings, 1 reply; 18+ messages in thread
From: Herbert Xu @ 2024-06-06  2:00 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Steffen Klassert, netdev, linux-crypto, fsverity, dm-devel, x86,
	linux-arm-kernel, Ard Biesheuvel, Sami Tolvanen, Bart Van Assche

On Wed, Jun 05, 2024 at 12:14:10PM -0700, Eric Biggers wrote:
> 
> This would at most apply to AH, not to ESP.  Is AH commonly used these days?

No AH is completely useless.  However, this applies perfectly to
ESP, in conjunction with authenc.  Obviously we would need to add
request linking to authenc (AEAD) as well so that it can pass it
along to sha.

BTW, does any of this interleaving apply to AES? If so we should
explore adding request linking to skcipher as well.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing
  2024-06-06  2:00               ` Herbert Xu
@ 2024-06-06  5:28                 ` Eric Biggers
  2024-06-06  5:41                   ` Herbert Xu
  0 siblings, 1 reply; 18+ messages in thread
From: Eric Biggers @ 2024-06-06  5:28 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Steffen Klassert, netdev, linux-crypto, fsverity, dm-devel, x86,
	linux-arm-kernel, Ard Biesheuvel, Sami Tolvanen, Bart Van Assche

On Thu, Jun 06, 2024 at 10:00:05AM +0800, Herbert Xu wrote:
> On Wed, Jun 05, 2024 at 12:14:10PM -0700, Eric Biggers wrote:
> > 
> > This would at most apply to AH, not to ESP.  Is AH commonly used these days?
> 
> No AH is completely useless.  However, this applies perfectly to
> ESP, in conjunction with authenc.  Obviously we would need to add
> request linking to authenc (AEAD) as well so that it can pass it
> along to sha.
> 
> BTW, does any of this interleaving apply to AES? If so we should
> explore adding request linking to skcipher as well.
> 

With AES, interleaving would only help with non-parallelizable modes such as CBC
encryption.  Anyone who cares about IPsec performance should of course be using
AES-GCM, which is parallelizable.  Especially since my other patch
https://lore.kernel.org/linux-crypto/20240602222221.176625-2-ebiggers@kernel.org/
is making AES-GCM twice as fast...

With hashing we unfortunately don't have the luxury of there being widely used
and accepted parallelizable algorithms.  In particular, all the SHAs are
serialized.  So that's why interleaving makes sense there.

In any case, it seems that what you're asking for at this point is far beyond
the scope of this patchset.

- Eric

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing
  2024-06-06  5:28                 ` Eric Biggers
@ 2024-06-06  5:41                   ` Herbert Xu
  2024-06-06  6:58                     ` Ard Biesheuvel
  0 siblings, 1 reply; 18+ messages in thread
From: Herbert Xu @ 2024-06-06  5:41 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Steffen Klassert, netdev, linux-crypto, fsverity, dm-devel, x86,
	linux-arm-kernel, Ard Biesheuvel, Sami Tolvanen, Bart Van Assche,
	Megha Dey, Tim Chen

On Wed, Jun 05, 2024 at 10:28:01PM -0700, Eric Biggers wrote:
>
> With AES, interleaving would only help with non-parallelizable modes such as CBC
> encryption.  Anyone who cares about IPsec performance should of course be using
> AES-GCM, which is parallelizable.  Especially since my other patch
> https://lore.kernel.org/linux-crypto/20240602222221.176625-2-ebiggers@kernel.org/
> is making AES-GCM twice as fast...

Algorithm selection may be limited by peer capability.  For IPsec,
if SHA is being used, then most likely CBC is also being used.

> In any case, it seems that what you're asking for at this point is far beyond
> the scope of this patchset.

I'm more than happy to take this over if you don't wish to extend
it beyond the storage usage cases.  According to the original Intel
sha2-mb submission, this should result in at least a two-fold
speed-up.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing
  2024-06-06  5:41                   ` Herbert Xu
@ 2024-06-06  6:58                     ` Ard Biesheuvel
  2024-06-06  7:34                       ` Herbert Xu
  0 siblings, 1 reply; 18+ messages in thread
From: Ard Biesheuvel @ 2024-06-06  6:58 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Eric Biggers, Steffen Klassert, netdev, linux-crypto, fsverity,
	dm-devel, x86, linux-arm-kernel, Sami Tolvanen, Bart Van Assche,
	Megha Dey, Tim Chen

On Thu, 6 Jun 2024 at 07:41, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> On Wed, Jun 05, 2024 at 10:28:01PM -0700, Eric Biggers wrote:
> >
> > With AES, interleaving would only help with non-parallelizable modes such as CBC
> > encryption.  Anyone who cares about IPsec performance should of course be using
> > AES-GCM, which is parallelizable.  Especially since my other patch
> > https://lore.kernel.org/linux-crypto/20240602222221.176625-2-ebiggers@kernel.org/
> > is making AES-GCM twice as fast...
>
> Algorithm selection may be limited by peer capability.  For IPsec,
> if SHA is being used, then most likely CBC is also being used.
>

IPSec users relying on software crypto and authenc() and caring about
performance seems like a rather niche use case to me.

> > In any case, it seems that what you're asking for at this point is far beyond
> > the scope of this patchset.
>
> I'm more than happy to take this over if you don't wish to extend
> it beyond the storage usage cases.  According to the original Intel
> sha2-mb submission, this should result in at least a two-fold
> speed-up.
>

I'm struggling to follow this debate. Surely, if this functionality
needs to live in ahash, the shash fallbacks need to implement this
parallel scheme too, or ahash would end up just feeding the requests
into shash sequentially, defeating the purpose. It is then up to the
API client to choose between ahash or shash, just as it can today.

So Eric has a pretty strong case for his shash implementation;
kmap_local() is essentially a NOP on architectures that anyone still
cares about (unlike kmap_atomic() which still disables preemption), so
I don't have a problem with the caller relying on that in order to be
able to use shash directly. The whole scatterlist / request alloc
dance is just too tedious and pointless, given that in practice, it
all gets relegated to shash anyway.

But my point is that even if we go with Herbert's proposal for the
ahash, we'll still need something like Eric's code on the shash side.

For the true async accelerator use case, none of this should make any
difference, right? If the caller already tolerates async (but
in-order) completion, implementing this request chaining doesn't buy
it anything. So only when the caller is sync and the implementation is
async, we might be able to do something smart where the caller waits
on a single completion that signals the completion of a set of inputs.
But this is also rather niche, so not worth holding up this effort.

So Herbert, what would the ahash_to_shash plumbing look like for the
ahash API that you are proposing? What changes would it require to
shash, and how much to they deviate from what Eric is suggesting?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing
  2024-06-06  6:58                     ` Ard Biesheuvel
@ 2024-06-06  7:34                       ` Herbert Xu
  2024-06-06  7:55                         ` Ard Biesheuvel
  0 siblings, 1 reply; 18+ messages in thread
From: Herbert Xu @ 2024-06-06  7:34 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Eric Biggers, Steffen Klassert, netdev, linux-crypto, fsverity,
	dm-devel, x86, linux-arm-kernel, Sami Tolvanen, Bart Van Assche,
	Megha Dey, Tim Chen

On Thu, Jun 06, 2024 at 08:58:47AM +0200, Ard Biesheuvel wrote:
>
> IPSec users relying on software crypto and authenc() and caring about
> performance seems like a rather niche use case to me.

It's no more niche than fs/verity and dm-integrity.  In fact,
this could potentially be used for all algorithms.  Just the
reduction in the number of function calls may produce enough
of a benefit (this is something I observed when adding GSO,
even without any actual hardware offload, simply aggregating
packets into larger units produced a visible benefit).

> I'm struggling to follow this debate. Surely, if this functionality
> needs to live in ahash, the shash fallbacks need to implement this
> parallel scheme too, or ahash would end up just feeding the requests
> into shash sequentially, defeating the purpose. It is then up to the
> API client to choose between ahash or shash, just as it can today.

I've never suggested it adding it to shash at all.  In fact
that's my primary problem with this interface.  I think it
should be ahash only.  Just like skcipher and aead.

My reasoning is that this should cater mostly to bulk data, i.e.,
multiple pages (e.g., for IPsec we're talking about 64K chunks,
actually that (the 64K limit) is something that we should really
extend, it's not 2014 anymore).  These typically will be more
easily accessed as a number of distinct pages rather than as a
contiguous mapping.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing
  2024-06-06  7:34                       ` Herbert Xu
@ 2024-06-06  7:55                         ` Ard Biesheuvel
  2024-06-06  8:08                           ` Herbert Xu
  0 siblings, 1 reply; 18+ messages in thread
From: Ard Biesheuvel @ 2024-06-06  7:55 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Eric Biggers, Steffen Klassert, netdev, linux-crypto, fsverity,
	dm-devel, x86, linux-arm-kernel, Sami Tolvanen, Bart Van Assche,
	Tim Chen

On Thu, 6 Jun 2024 at 09:34, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> On Thu, Jun 06, 2024 at 08:58:47AM +0200, Ard Biesheuvel wrote:
> >
> > IPSec users relying on software crypto and authenc() and caring about
> > performance seems like a rather niche use case to me.
>
> It's no more niche than fs/verity and dm-integrity.

fsverity is used by Android, which is the diametrical opposite of
niche when it comes to Linux distros.

> this could potentially be used for all algorithms.  Just the
> reduction in the number of function calls may produce enough
> of a benefit (this is something I observed when adding GSO,
> even without any actual hardware offload, simply aggregating
> packets into larger units produced a visible benefit).
>

Sure. But my point is that anyone who cares enough about IPsec
performance would have stopped using authenc(cbc(aes),hmac(sha2)) long
ago and moved to GCM or even ChaChaPoly. This is not just a matter of
efficiency in the implementation - using a general purpose hash
function such as SHA-256 [twice] rather than GHASH or Poly1305 is just
overkill.

So complicating the performance optimization of something that runs on
every (non-Apple) phone for the benefit of something that is rarely
used in a context where the performance matters seems unreasonable to
me.

> > I'm struggling to follow this debate. Surely, if this functionality
> > needs to live in ahash, the shash fallbacks need to implement this
> > parallel scheme too, or ahash would end up just feeding the requests
> > into shash sequentially, defeating the purpose. It is then up to the
> > API client to choose between ahash or shash, just as it can today.
>
> I've never suggested it adding it to shash at all.  In fact
> that's my primary problem with this interface.  I think it
> should be ahash only.  Just like skcipher and aead.
>

So again, how would that work for ahash falling back to shash. Are you
saying every existing shash implementation should be duplicated into
an ahash so that the multibuffer optimization can be added? shash is a
public interface so we cannot just remove the existing ones and we'll
end up carrying both forever.

> My reasoning is that this should cater mostly to bulk data, i.e.,
> multiple pages (e.g., for IPsec we're talking about 64K chunks,
> actually that (the 64K limit) is something that we should really
> extend, it's not 2014 anymore).  These typically will be more
> easily accessed as a number of distinct pages rather than as a
> contiguous mapping.
>

Sure, but the block I/O world is very different. Forcing it to use an
API modeled after how IPsec might use it seems, again, unreasonable.

So these 64k chunks are made up of 1500 byte frames that can be hashed
in parallel?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing
  2024-06-06  7:55                         ` Ard Biesheuvel
@ 2024-06-06  8:08                           ` Herbert Xu
  2024-06-06  8:33                             ` Ard Biesheuvel
  0 siblings, 1 reply; 18+ messages in thread
From: Herbert Xu @ 2024-06-06  8:08 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Eric Biggers, Steffen Klassert, netdev, linux-crypto, fsverity,
	dm-devel, x86, linux-arm-kernel, Sami Tolvanen, Bart Van Assche,
	Tim Chen

On Thu, Jun 06, 2024 at 09:55:56AM +0200, Ard Biesheuvel wrote:
>
> So again, how would that work for ahash falling back to shash. Are you
> saying every existing shash implementation should be duplicated into
> an ahash so that the multibuffer optimization can be added? shash is a
> public interface so we cannot just remove the existing ones and we'll
> end up carrying both forever.

It should do the same thing for ahash algorithms that do not support
multiple requests.  IOW it should process the requests one by one.

> Sure, but the block I/O world is very different. Forcing it to use an
> API modeled after how IPsec might use it seems, again, unreasonable.

It's not different at all.  You can see that by the proliferation
of kmap calls in fs/verity.  It's a fundamental issue.  You can't
consistently get a large contiguous allocation beyond one page due
to fragmentation.  So large data is always going to be scattered.

BTW, I'm all for elminating the overhead when you already have a
linear address for scattered memory, e.g., through vmalloc.  We
should definitely improve our interface for ahash/skcipher/aead so
that vmalloc addresses (as well as kmalloc virtual addresses by
extension) are supported as first class citizens, and we don't turn
them into SG lists unless it's necessary for DMA.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing
  2024-06-06  8:08                           ` Herbert Xu
@ 2024-06-06  8:33                             ` Ard Biesheuvel
  2024-06-06  9:15                               ` Herbert Xu
  0 siblings, 1 reply; 18+ messages in thread
From: Ard Biesheuvel @ 2024-06-06  8:33 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Eric Biggers, Steffen Klassert, netdev, linux-crypto, fsverity,
	dm-devel, x86, linux-arm-kernel, Sami Tolvanen, Bart Van Assche,
	Tim Chen

On Thu, 6 Jun 2024 at 10:08, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> On Thu, Jun 06, 2024 at 09:55:56AM +0200, Ard Biesheuvel wrote:
> >
> > So again, how would that work for ahash falling back to shash. Are you
> > saying every existing shash implementation should be duplicated into
> > an ahash so that the multibuffer optimization can be added? shash is a
> > public interface so we cannot just remove the existing ones and we'll
> > end up carrying both forever.
>
> It should do the same thing for ahash algorithms that do not support
> multiple requests.  IOW it should process the requests one by one.
>

That is not what I am asking.

Are you suggesting that, e.g., the arm64 sha2 shash implementation
that is modified by this series should instead expose both an shash as
before, and an ahash built around the same asm code that exposes the
multibuffer capability?

> > Sure, but the block I/O world is very different. Forcing it to use an
> > API modeled after how IPsec might use it seems, again, unreasonable.
>
> It's not different at all.  You can see that by the proliferation
> of kmap calls in fs/verity.  It's a fundamental issue.  You can't
> consistently get a large contiguous allocation beyond one page due
> to fragmentation.  So large data is always going to be scattered.
>

I don't think this is true for many uses of the block layer.

> BTW, I'm all for elminating the overhead when you already have a
> linear address for scattered memory, e.g., through vmalloc.  We
> should definitely improve our interface for ahash/skcipher/aead so
> that vmalloc addresses (as well as kmalloc virtual addresses by
> extension) are supported as first class citizens, and we don't turn
> them into SG lists unless it's necessary for DMA.
>

Yes, this is something I've been pondering for a while. An
ahash/skcipher/aead with CRYPTO_ALG_ASYNC cleared (which would
guarantee that any provided VA would not be referenced after the algo
invocation returns) should be able to consume a request that carries
virtual addresses rather than SG lists. Given that it is up to the
caller to choose between sync and async, it would be in a good
position also to judge whether it wants to use stack/vmalloc
addresses.

I'll have a stab at this.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing
  2024-06-06  8:33                             ` Ard Biesheuvel
@ 2024-06-06  9:15                               ` Herbert Xu
  2024-06-10 16:42                                 ` Eric Biggers
  0 siblings, 1 reply; 18+ messages in thread
From: Herbert Xu @ 2024-06-06  9:15 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Eric Biggers, Steffen Klassert, netdev, linux-crypto, fsverity,
	dm-devel, x86, linux-arm-kernel, Sami Tolvanen, Bart Van Assche,
	Tim Chen

On Thu, Jun 06, 2024 at 10:33:15AM +0200, Ard Biesheuvel wrote:
>
> Are you suggesting that, e.g., the arm64 sha2 shash implementation
> that is modified by this series should instead expose both an shash as
> before, and an ahash built around the same asm code that exposes the
> multibuffer capability?

Yes the multi-request handling should be implemented as ahash
only.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing
  2024-06-06  9:15                               ` Herbert Xu
@ 2024-06-10 16:42                                 ` Eric Biggers
  2024-06-11 15:21                                   ` Herbert Xu
  0 siblings, 1 reply; 18+ messages in thread
From: Eric Biggers @ 2024-06-10 16:42 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Ard Biesheuvel, Steffen Klassert, netdev, linux-crypto, fsverity,
	dm-devel, x86, linux-arm-kernel, Sami Tolvanen, Bart Van Assche,
	Tim Chen

Hi Herbert,

On Thu, Jun 06, 2024 at 05:15:16PM +0800, Herbert Xu wrote:
> On Thu, Jun 06, 2024 at 10:33:15AM +0200, Ard Biesheuvel wrote:
> >
> > Are you suggesting that, e.g., the arm64 sha2 shash implementation
> > that is modified by this series should instead expose both an shash as
> > before, and an ahash built around the same asm code that exposes the
> > multibuffer capability?
> 
> Yes the multi-request handling should be implemented as ahash
> only.
> 

That is not a reasonable way to do it.  It would be much more complex, more
error-prone, and slower than my proposal.  Also your proposal wouldn't actually
bring you much closer to being able to optimize your authenc use case.

First, each algorithm implementation that wants to support your scheme would
need to implement the ahash API alongside shash.  This would be a lot of
duplicated code, and it would need to support all quirks of the ahash API
including scatterlists.

Second, allowing an ahash_request to actually be a list of requests creates an
ambiguity where it will often be unclear when code is supposed to operate on an
ahash_request individually vs. on the whole list.  This is error-prone, as it
invites bugs where a crypto operation is done on only one request of the list.
This is a bad design, especially for a cryptographic API.  We would have
crypto_ahash_*() do some checks to prevent multi-requests from reaching
algorithms that don't support the particular kind of request being made.  But
there will be a lot of cases to consider.

Third, it would be hard to actually implement multibuffer hashing given an
arbitrary list of requests.  For multibuffer hashing to work, the lengths of all
the buffers must be synced up, including the internal buffers in the hash states
as well as every buffer that is produced while walking the scatterlists.  We can
place constraints on what is supported, but those constraints will need to be
clearly specified, and each algorithm will actually need to check for them and
do something sane (error or fallback) when they are not met.  Note that it would
be possible for the messages to get out of sync in the middle of walking the
scatterlists, which would be difficult to handle.  All this would add a lot of
complexity and overhead compared to my proposal, which naturally expresses the
same-length constraints in the API.

Fourth, having the API be ahash-only would also force fsverity to switch back to
the ahash API, which would add complexity and overhead.  The shash API is easier
to use even when working with pages, as the diffstat of commit 8fcd94add6c5
clearly shows.  The ahash API also has more overhead, including in doing the
needed memory allocation, setting up a scatterlist, initializing required but
irrelevant or redundant fields in the ahash_request, and shash_ahash_finup()
running the fully generalized scatterlist walker on the scatterlist just to
finally end up with a pointer which the caller could have provided directly.
All this overhead adds up, even when hashing 4K blocks.  Back when CPU-based
crypto was slow this didn't really matter, but now it's fast and these small
overheads are significant when trying to keep up with storage device speeds
(which are also fast now).  E.g., even disregarding the memory allocation,
hashing a 4K block is about 5% faster with shash than with ahash on x86_64.

Of course, I'd need to support ahash in fsverity anyway if someone were to
actually need support for non-inline hardware offload in fsverity (and if I
decide to accept that despite many hardware drivers being buggy).  But really
the best way to do this would be to support ahash alongside shash, like what I'm
proposing in dm-verity -- perhaps with ahash support limited to the data blocks
only, as that's the most performance critical part.  The ahash code path would
use the async callback to actually properly support offload, which would mean
the code would be substantially different anyway due to the fundamentally
different computational model, especially vs. multibuffer hashing.  A "single"
API, that works for both hardware offload and for the vast majority of users
that simply need very low overhead software crypto, would be nice in theory.
But I'm afraid it's been shown repeatedly that just doesn't work...  The
existence of lib/crypto/, shash, lskcipher, and scomp all reflect this.

I understand that you think the ahash based API would make it easier to add
multibuffer support to "authenc(hmac(sha256),cbc(aes))" for IPsec, which seems
to be a very important use case for you (though it isn't relevant to nearly as
many systems as dm-verity and fsverity are).  Regardless, the reality is that it
would be much more difficult to take advantage of multibuffer crypto in the
IPsec authenc use case than in dm-verity and fsverity.  authenc uses multiple
underlying algorithms, AES-CBC and HMAC-SHA256, that would both have to use
multibuffer crypto in order to see a significant benefit, seeing as even if the
SHA-256 support could be wired up through HMAC-SHA256, encryption would be
bottlenecked on AES-CBC, especially on Intel CPUs.  It also looks like the IPsec
code would need a lot of updates to support multibuffer crypto.

At the same time, an easier way to optimize "authenc(hmac(sha256),cbc(aes))"
would be to add an implementation of it to arch/x86/crypto/ that interleaves the
AES-CBC and SHA-256 and also avoids the overhead associated with the template
based approach (such as all the extra indirect calls).  Of course, that would
require writing assembly, but so would multibuffer crypto.  It seems unlikely
that someone would step in to do all the work to implement a multibuffer
optimization for this algorithm and its use in IPsec, when no one has ever
bothered to optimize the single-buffer case in the first place, which has been
possible all along and would require no API or IPsec changes...

In any case, any potential multi-request support in ahash, skcipher, or aead
should be separate considerations from the simple function in shash that I'm
proposing.  It makes sense to have the shash support regardless.

Ultimately, I need to have dm-verity and fsverity be properly optimized in the
downstreams that are most relevant to me.  If you're not going to allow the
upstream crypto API to provide the needed functionality in a reasonable way,
then I'll need to shift my focus to getting this patchset into downstream
kernels such as Android and Chrome OS instead.

- Eric

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing
  2024-06-10 16:42                                 ` Eric Biggers
@ 2024-06-11 15:21                                   ` Herbert Xu
  2024-06-11 15:39                                     ` Herbert Xu
                                                       ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Herbert Xu @ 2024-06-11 15:21 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Ard Biesheuvel, Steffen Klassert, netdev, linux-crypto, fsverity,
	dm-devel, x86, linux-arm-kernel, Sami Tolvanen, Bart Van Assche,
	Tim Chen

On Mon, Jun 10, 2024 at 09:42:58AM -0700, Eric Biggers wrote:
>
> I understand that you think the ahash based API would make it easier to add
> multibuffer support to "authenc(hmac(sha256),cbc(aes))" for IPsec, which seems
> to be a very important use case for you (though it isn't relevant to nearly as
> many systems as dm-verity and fsverity are).  Regardless, the reality is that it
> would be much more difficult to take advantage of multibuffer crypto in the
> IPsec authenc use case than in dm-verity and fsverity.  authenc uses multiple
> underlying algorithms, AES-CBC and HMAC-SHA256, that would both have to use
> multibuffer crypto in order to see a significant benefit, seeing as even if the
> SHA-256 support could be wired up through HMAC-SHA256, encryption would be
> bottlenecked on AES-CBC, especially on Intel CPUs.  It also looks like the IPsec
> code would need a lot of updates to support multibuffer crypto.

The linked-request thing feeds nicely into networking.  In fact
that's where I got the idea of linking them from.  In networking
a large GSO (currently limited to 64K but theoretically we could
make it unlimited) packet is automatically split up into a linked
list of MTU-sized skb's.

Therefore if we switched to a linked-list API networking could
give us the buffers with minimal changes.

BTW, I found an old Intel paper that claims through their multi-
buffer strategy they were able to make AES-CBC-XCBC beat AES-GCM.
I wonder if we could still replicate this today:

https://github.com/intel/intel-ipsec-mb/wiki/doc/fast-multi-buffer-ipsec-implementations-ia-processors-paper.pdf
 
> Ultimately, I need to have dm-verity and fsverity be properly optimized in the
> downstreams that are most relevant to me.  If you're not going to allow the
> upstream crypto API to provide the needed functionality in a reasonable way,
> then I'll need to shift my focus to getting this patchset into downstream
> kernels such as Android and Chrome OS instead.

I totally understand that this is your priority.  But please give
me some time to see if we can devise something that works for both
scenarios.

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing
  2024-06-11 15:21                                   ` Herbert Xu
@ 2024-06-11 15:39                                     ` Herbert Xu
  2024-06-11 20:32                                       ` Eric Biggers
  2024-06-11 15:46                                     ` Ard Biesheuvel
  2024-06-11 20:18                                     ` Eric Biggers
  2 siblings, 1 reply; 18+ messages in thread
From: Herbert Xu @ 2024-06-11 15:39 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Ard Biesheuvel, Steffen Klassert, netdev, linux-crypto, fsverity,
	dm-devel, x86, linux-arm-kernel, Sami Tolvanen, Bart Van Assche,
	Tim Chen

On Tue, Jun 11, 2024 at 11:21:43PM +0800, Herbert Xu wrote:
>
> Therefore if we switched to a linked-list API networking could
> give us the buffers with minimal changes.

BTW, this is not just about parallelising hashing.  Just as one of
the most significant benefits of GSO does not come from hardware
offload, but rather the amortisation of (network) stack overhead.
IOW you're traversing a very deep stack once instead of 40 times
(this is the factor for 64K vs MTU, if we extend beyond 64K (which
we absolute should do) the benefit would increase as well).

The same should apply to the Crypto API.  So even if this was a
purely software solution with no assembly code at all, it may well
improve GCM performance (at least for users able to feed us bulk
data, like networking).

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing
  2024-06-11 15:39                                     ` Herbert Xu
@ 2024-06-11 20:32                                       ` Eric Biggers
  0 siblings, 0 replies; 18+ messages in thread
From: Eric Biggers @ 2024-06-11 20:32 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Ard Biesheuvel, Steffen Klassert, netdev, linux-crypto, fsverity,
	dm-devel, x86, linux-arm-kernel, Sami Tolvanen, Bart Van Assche,
	Tim Chen

On Tue, Jun 11, 2024 at 11:39:08PM +0800, Herbert Xu wrote:
> On Tue, Jun 11, 2024 at 11:21:43PM +0800, Herbert Xu wrote:
> >
> > Therefore if we switched to a linked-list API networking could
> > give us the buffers with minimal changes.
> 
> BTW, this is not just about parallelising hashing.  Just as one of
> the most significant benefits of GSO does not come from hardware
> offload, but rather the amortisation of (network) stack overhead.
> IOW you're traversing a very deep stack once instead of 40 times
> (this is the factor for 64K vs MTU, if we extend beyond 64K (which
> we absolute should do) the benefit would increase as well).
> 
> The same should apply to the Crypto API.  So even if this was a
> purely software solution with no assembly code at all, it may well
> improve GCM performance (at least for users able to feed us bulk
> data, like networking).
> 

At best this would save an indirect call per message, if the underlying
algorithm explicitly added support for it and the user of the API migrated to
the multi-request model.  This alone doesn't seem worth the effort of migrating
to multi-request, especially considering the many other already-possible
optimizations that would not require API changes or migrating users to
multi-request.  The x86_64 AES-GCM is pretty well optimized now after my recent
patches, but there's still an indirect call associated with the use of the SIMD
helper which could be eliminated, saving one per message (already as much as we
could hope to get from multi-request).  authenc on the other hand is almost
totally unoptimized, as I mentioned before; it makes little sense to talk about
any sort of multi-request optimization for it at this point.

- Eric

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing
  2024-06-11 15:21                                   ` Herbert Xu
  2024-06-11 15:39                                     ` Herbert Xu
@ 2024-06-11 15:46                                     ` Ard Biesheuvel
  2024-06-11 15:51                                       ` Herbert Xu
  2024-06-11 20:18                                     ` Eric Biggers
  2 siblings, 1 reply; 18+ messages in thread
From: Ard Biesheuvel @ 2024-06-11 15:46 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Eric Biggers, Steffen Klassert, netdev, linux-crypto, fsverity,
	dm-devel, x86, linux-arm-kernel, Sami Tolvanen, Bart Van Assche,
	Tim Chen

On Tue, 11 Jun 2024 at 17:21, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> On Mon, Jun 10, 2024 at 09:42:58AM -0700, Eric Biggers wrote:
> >
> > I understand that you think the ahash based API would make it easier to add
> > multibuffer support to "authenc(hmac(sha256),cbc(aes))" for IPsec, which seems
> > to be a very important use case for you (though it isn't relevant to nearly as
> > many systems as dm-verity and fsverity are).  Regardless, the reality is that it
> > would be much more difficult to take advantage of multibuffer crypto in the
> > IPsec authenc use case than in dm-verity and fsverity.  authenc uses multiple
> > underlying algorithms, AES-CBC and HMAC-SHA256, that would both have to use
> > multibuffer crypto in order to see a significant benefit, seeing as even if the
> > SHA-256 support could be wired up through HMAC-SHA256, encryption would be
> > bottlenecked on AES-CBC, especially on Intel CPUs.  It also looks like the IPsec
> > code would need a lot of updates to support multibuffer crypto.
>
> The linked-request thing feeds nicely into networking.  In fact
> that's where I got the idea of linking them from.  In networking
> a large GSO (currently limited to 64K but theoretically we could
> make it unlimited) packet is automatically split up into a linked
> list of MTU-sized skb's.
>
> Therefore if we switched to a linked-list API networking could
> give us the buffers with minimal changes.
>
> BTW, I found an old Intel paper that claims through their multi-
> buffer strategy they were able to make AES-CBC-XCBC beat AES-GCM.
> I wonder if we could still replicate this today:
>
> https://github.com/intel/intel-ipsec-mb/wiki/doc/fast-multi-buffer-ipsec-implementations-ia-processors-paper.pdf
>

This looks like the whitepaper that describes the buggy multibuffer
code that we ripped out.

> > Ultimately, I need to have dm-verity and fsverity be properly optimized in the
> > downstreams that are most relevant to me.  If you're not going to allow the
> > upstream crypto API to provide the needed functionality in a reasonable way,
> > then I'll need to shift my focus to getting this patchset into downstream
> > kernels such as Android and Chrome OS instead.
>
> I totally understand that this is your priority.  But please give
> me some time to see if we can devise something that works for both
> scenarios.
>

The issue here is that the CPU based multibuffer approach has rather
tight constraints in terms of input length and the shared prefix, and
so designing a more generic API based on ahash doesn't help at all.
The intel multibuffer code went off into the weeds entirely attempting
to apply this parallel scheme to arbitrary combinations of inputs, so
this is something we know we should avoid.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing
  2024-06-11 15:46                                     ` Ard Biesheuvel
@ 2024-06-11 15:51                                       ` Herbert Xu
  0 siblings, 0 replies; 18+ messages in thread
From: Herbert Xu @ 2024-06-11 15:51 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Eric Biggers, Steffen Klassert, netdev, linux-crypto, fsverity,
	dm-devel, x86, linux-arm-kernel, Sami Tolvanen, Bart Van Assche,
	Tim Chen

On Tue, Jun 11, 2024 at 05:46:01PM +0200, Ard Biesheuvel wrote:
>
> The issue here is that the CPU based multibuffer approach has rather
> tight constraints in terms of input length and the shared prefix, and
> so designing a more generic API based on ahash doesn't help at all.
> The intel multibuffer code went off into the weeds entirely attempting
> to apply this parallel scheme to arbitrary combinations of inputs, so
> this is something we know we should avoid.

The sha-mb approach failed because it failed to aggregate the data
properly.  By driving this from the data sink, it was doomed to fail.

The correct way to aggregate data is to do it at the source.  The
user (of the Crypto API) knows exactlty how much data they want to
hash and how it's structured.  They should be supplying that info
to the API so it can use multi-buffer where applicable.  Even where
multi-buffer isn't available, they would at least benefit from making
a single indirect call into the Crypto stack instead of N calls.
When N is large (which is almost always the case for TCP) this
produces a non-trivial saving.

Sure I understand that you guys are more than happy with N=2 but
please let me at least try this out and see if we could make this
work for a large value of N.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing
  2024-06-11 15:21                                   ` Herbert Xu
  2024-06-11 15:39                                     ` Herbert Xu
  2024-06-11 15:46                                     ` Ard Biesheuvel
@ 2024-06-11 20:18                                     ` Eric Biggers
  2 siblings, 0 replies; 18+ messages in thread
From: Eric Biggers @ 2024-06-11 20:18 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Ard Biesheuvel, Steffen Klassert, netdev, linux-crypto, fsverity,
	dm-devel, x86, linux-arm-kernel, Sami Tolvanen, Bart Van Assche,
	Tim Chen

On Tue, Jun 11, 2024 at 11:21:43PM +0800, Herbert Xu wrote:
> 
> BTW, I found an old Intel paper that claims through their multi-
> buffer strategy they were able to make AES-CBC-XCBC beat AES-GCM.
> I wonder if we could still replicate this today:
> 
> https://github.com/intel/intel-ipsec-mb/wiki/doc/fast-multi-buffer-ipsec-implementations-ia-processors-paper.pdf

No, not even close.  Even assuming that the lack of parallelizability in AES-CBC
and AES-XCBC can be entirely compensated for via multibuffer crypto (which
really it can't -- consider single packets, for example), doing AES twice is
much more expensive than doing AES and GHASH.  GHASH is a universal hash
function, and computing a universal hash function is inherently cheaper than
computing a cryptographic hash function.  But also modern Intel CPUs have very
fast carryless multiplication, and it uses a different execution port from what
AES uses.  So the overhead of AES + GHASH over AES alone is very small.  By
doing AES twice, you'd be entirely bottlenecked by the ports that can execute
the AES instructions, while the other ports go nearly unused.  So it would
probably be approaching twice as slow as AES-GCM.

Westmere (2010) through Ivy Bridge (2012) are the only Intel CPUs where
multibuffer AES-CBC-XCBC could plausibly be faster than AES-GCM (given a
sufficiently large number of messages at once), due to the very slow pclmulqdq
instruction on those CPUs.  This is long since fixed, as pclmulqdq became much
faster in Haswell (2013), and faster still in Broadwell.  This is exactly what
that Intel paper shows; they show AES-GCM becoming fastest in "Gen 4", i.e.
Haswell.  The paper is from 2012, so of course they don't show anything after
that.  But AES-GCM has only pulled ahead even more since then.

In theory something like AES-CBC + SHA-256 could be slightly more competitive
than AES-CBC + AES-XCBC.  But it would still be worse than simply doing AES-GCM
-- which again, doesn't need multibuffer, and my recent patches have already
fully optimized for recent x86_64 CPUs.

- Eric

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2024-06-11 20:32 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20240603183731.108986-1-ebiggers@kernel.org>
     [not found] ` <20240603183731.108986-7-ebiggers@kernel.org>
     [not found]   ` <Zl7gYOMyscYDKZ8_@gondor.apana.org.au>
     [not found]     ` <20240604184220.GC1566@sol.localdomain>
     [not found]       ` <ZmAthcxC8V3V3sm3@gondor.apana.org.au>
     [not found]         ` <ZmAuTceqwZlRJqHx@gondor.apana.org.au>
2024-06-05  9:46           ` [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing Herbert Xu
2024-06-05 19:14             ` Eric Biggers
2024-06-06  2:00               ` Herbert Xu
2024-06-06  5:28                 ` Eric Biggers
2024-06-06  5:41                   ` Herbert Xu
2024-06-06  6:58                     ` Ard Biesheuvel
2024-06-06  7:34                       ` Herbert Xu
2024-06-06  7:55                         ` Ard Biesheuvel
2024-06-06  8:08                           ` Herbert Xu
2024-06-06  8:33                             ` Ard Biesheuvel
2024-06-06  9:15                               ` Herbert Xu
2024-06-10 16:42                                 ` Eric Biggers
2024-06-11 15:21                                   ` Herbert Xu
2024-06-11 15:39                                     ` Herbert Xu
2024-06-11 20:32                                       ` Eric Biggers
2024-06-11 15:46                                     ` Ard Biesheuvel
2024-06-11 15:51                                       ` Herbert Xu
2024-06-11 20:18                                     ` Eric Biggers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox