Re: [PATCH 0/6] Faster AES-XTS on modern x86_64 CPUs

public inbox for linux-crypto@vger.kernel.org
 help / color / mirror / Atom feed

From: Eric Biggers <ebiggers@kernel.org>
To: David Laight <David.Laight@aculab.com>
Cc: Ard Biesheuvel <ardb@kernel.org>,
	"linux-crypto@vger.kernel.org" <linux-crypto@vger.kernel.org>,
	"x86@kernel.org" <x86@kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Andy Lutomirski <luto@kernel.org>,
	"Chang S . Bae" <chang.seok.bae@intel.com>
Subject: Re: [PATCH 0/6] Faster AES-XTS on modern x86_64 CPUs
Date: Fri, 5 Apr 2024 15:19:04 -0400	[thread overview]
Message-ID: <20240405191904.GA1205@quark.localdomain> (raw)
In-Reply-To: <142077804bee45daac3b0fad8bc4c2fe@AcuMS.aculab.com>

On Thu, Apr 04, 2024 at 07:53:48AM +0000, David Laight wrote:
> > >
> > > How much does the kernel_fpu_begin() cost on real workloads?
> > > (ie when the registers are live and it forces an extra save/restore)
> > 
> > x86 Linux does lazy restore of the FPU state.  The first kernel_fpu_begin() can
> > have a significant cost, as it issues an XSAVE (or equivalent) instruction and
> > causes an XRSTOR (or equivalent) instruction to be issued when returning to
> > userspace when it otherwise might not be needed.  Additional kernel_fpu_begin()
> > / kernel_fpu_end() pairs without returning to userspace have only a small cost,
> > as they don't cause any more saves or restores of the FPU state to be done.
> > 
> > My new xts(aes) implementations have one kernel_fpu_begin() / kernel_fpu_end()
> > pair per message (if the message doesn't span any page boundaries, which is
> > almost always the case).  That's exactly the same as the current xts-aes-aesni.
> 
> I realised after sending it that the code almost certainly already did
> kernel_fpu_begin() - so there probably isn't a difference because all the
> fpu state is always saved.
> (I'm sure there should be a way of getting access to (say) 2 ymm registers
> by providing an on-stack save area to allow wide data copies or special
> instructions - but that is a different issue.)
> 
> > I think what you may really be asking is how much the overhead of the XSAVE /
> > XRSTOR pair associated with kernel-mode use of the FPU *increases* if the kernel
> > clobbers AVX or AVX512 state, instead of just SSE state as xts-aes-aesni does.
> > That's much more relevant to this patchset.
> 
> It depends on what has to be saved, not on what is used.
> Although, since all the x/y/zmm registers are caller-saved I think they could
> be 'zapped' on syscall entry (and restored as zero later).
> Trouble is I suspect there is a single piece of code somewhere that relies
> on them being preserved across an inlined system call.
> 
> > I think the answer is that there is no additional overhead.  This is because the
> > XSAVE / XRSTOR pair happens regardless of the type of state the kernel clobbers,
> > and it operates on the userspace state, not the kernel's.  Some of the newer
> > variants of XSAVE (XSAVEOPT and XSAVES) do have a "modified" optimization where
> > they don't save parts of the state that are unmodified since the last XRSTOR;
> > however, that is unimportant here because the kernel's FPU state is never saved.
> > 
> > (This would change if x86 Linux were to support preemption of kernel-mode FPU
> > code.  In that case, we may need to take more care to minimize use of AVX and
> > AVX512 state.  That being said, AES-XTS tends to be used for bulk data anyway.)
> > 
> > This is based on theory, though.  I'll do a test to confirm that there's indeed
> > no additional overhead.  And also, even if there's no additional overhead, what
> > the existing overhead actually is.
> 
> Yes, I was wondering how it is used for 'real applications'.
> If a system call that would normally return immediately (or at least without
> a full process switch) hits the aes code it gets the cost of the XSAVE added.
> Whereas the benchmark probably doesn't do anywhere near as many.
> 
> OTOH this is probably no different.

I did some tests on Sapphire Rapids using a system call that I customized to do
nothing except possibly a kernel_fpu_begin / kernel_fpu_end pair.

On average the bare syscall took 70 ns.  The syscall with the kernel_fpu_begin /
kernel_fpu_end pair took 160 ns if the userspace program used xmm only, 340 ns
if it used ymm, or 360 ns if it used zmm.  I also tried making the kernel
clobber different registers in the kernel_fpu_begin / kernel_fpu_end section,
and as I expected this did not make any difference.

Note that without the kernel_fpu_begin / kernel_fpu_end pair, AES-NI
instructions cannot be used and the alternative would be xts(ecb(aes-generic)).
On the same CPU, encrypting a single 512-byte sector with xts(ecb(aes-generic))
takes about 2235ns.  With xts-aes-vaes-avx10_512 it takes 75 ns.  (Not a typo --
it really is almost 30 times faster!)  So it seems clear the FPU state save and
restore is worth it even just for a single sector using the traditional 512-byte
sector size, let alone a 4096-byte sector size which is recommended these days.

- Eric

next prev parent reply	other threads:[~2024-04-05 19:19 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-26  8:02 [PATCH 0/6] Faster AES-XTS on modern x86_64 CPUs Eric Biggers
2024-03-26  8:02 ` [PATCH 1/6] x86: add kconfig symbols for assembler VAES and VPCLMULQDQ support Eric Biggers
2024-03-26  8:10   ` Ingo Molnar
2024-03-26  8:18     ` Eric Biggers
2024-03-26  8:28       ` Ingo Molnar
2024-03-26  8:03 ` [PATCH 2/6] crypto: x86/aes-xts - add AES-XTS assembly macro for modern CPUs Eric Biggers
2024-03-26  8:03 ` [PATCH 3/6] crypto: x86/aes-xts - wire up AESNI + AVX implementation Eric Biggers
2024-03-26  8:03 ` [PATCH 4/6] crypto: x86/aes-xts - wire up VAES + AVX2 implementation Eric Biggers
2024-03-26  8:03 ` [PATCH 5/6] crypto: x86/aes-xts - wire up VAES + AVX10/256 implementation Eric Biggers
2024-03-26  8:03 ` [PATCH 6/6] crypto: x86/aes-xts - wire up VAES + AVX10/512 implementation Eric Biggers
2024-03-26  8:51 ` [PATCH 0/6] Faster AES-XTS on modern x86_64 CPUs Ard Biesheuvel
2024-03-26 16:47   ` Eric Biggers
2024-04-03  8:12     ` David Laight
2024-04-04  1:35       ` Eric Biggers
2024-04-04  7:53         ` David Laight
2024-04-05 19:19           ` Eric Biggers [this message]
2024-04-08  7:41             ` David Laight
2024-04-08 12:31               ` Eric Biggers
2024-04-05  7:58 ` Herbert Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240405191904.GA1205@quark.localdomain \
    --to=ebiggers@kernel.org \
    --cc=David.Laight@aculab.com \
    --cc=ardb@kernel.org \
    --cc=chang.seok.bae@intel.com \
    --cc=linux-crypto@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@kernel.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox