From: Eric Biggers <ebiggers@kernel.org>
To: David Laight <David.Laight@aculab.com>
Cc: Ard Biesheuvel <ardb@kernel.org>,
"linux-crypto@vger.kernel.org" <linux-crypto@vger.kernel.org>,
"x86@kernel.org" <x86@kernel.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
Andy Lutomirski <luto@kernel.org>,
"Chang S . Bae" <chang.seok.bae@intel.com>
Subject: Re: [PATCH 0/6] Faster AES-XTS on modern x86_64 CPUs
Date: Fri, 5 Apr 2024 15:19:04 -0400 [thread overview]
Message-ID: <20240405191904.GA1205@quark.localdomain> (raw)
In-Reply-To: <142077804bee45daac3b0fad8bc4c2fe@AcuMS.aculab.com>
On Thu, Apr 04, 2024 at 07:53:48AM +0000, David Laight wrote:
> > >
> > > How much does the kernel_fpu_begin() cost on real workloads?
> > > (ie when the registers are live and it forces an extra save/restore)
> >
> > x86 Linux does lazy restore of the FPU state. The first kernel_fpu_begin() can
> > have a significant cost, as it issues an XSAVE (or equivalent) instruction and
> > causes an XRSTOR (or equivalent) instruction to be issued when returning to
> > userspace when it otherwise might not be needed. Additional kernel_fpu_begin()
> > / kernel_fpu_end() pairs without returning to userspace have only a small cost,
> > as they don't cause any more saves or restores of the FPU state to be done.
> >
> > My new xts(aes) implementations have one kernel_fpu_begin() / kernel_fpu_end()
> > pair per message (if the message doesn't span any page boundaries, which is
> > almost always the case). That's exactly the same as the current xts-aes-aesni.
>
> I realised after sending it that the code almost certainly already did
> kernel_fpu_begin() - so there probably isn't a difference because all the
> fpu state is always saved.
> (I'm sure there should be a way of getting access to (say) 2 ymm registers
> by providing an on-stack save area to allow wide data copies or special
> instructions - but that is a different issue.)
>
> > I think what you may really be asking is how much the overhead of the XSAVE /
> > XRSTOR pair associated with kernel-mode use of the FPU *increases* if the kernel
> > clobbers AVX or AVX512 state, instead of just SSE state as xts-aes-aesni does.
> > That's much more relevant to this patchset.
>
> It depends on what has to be saved, not on what is used.
> Although, since all the x/y/zmm registers are caller-saved I think they could
> be 'zapped' on syscall entry (and restored as zero later).
> Trouble is I suspect there is a single piece of code somewhere that relies
> on them being preserved across an inlined system call.
>
> > I think the answer is that there is no additional overhead. This is because the
> > XSAVE / XRSTOR pair happens regardless of the type of state the kernel clobbers,
> > and it operates on the userspace state, not the kernel's. Some of the newer
> > variants of XSAVE (XSAVEOPT and XSAVES) do have a "modified" optimization where
> > they don't save parts of the state that are unmodified since the last XRSTOR;
> > however, that is unimportant here because the kernel's FPU state is never saved.
> >
> > (This would change if x86 Linux were to support preemption of kernel-mode FPU
> > code. In that case, we may need to take more care to minimize use of AVX and
> > AVX512 state. That being said, AES-XTS tends to be used for bulk data anyway.)
> >
> > This is based on theory, though. I'll do a test to confirm that there's indeed
> > no additional overhead. And also, even if there's no additional overhead, what
> > the existing overhead actually is.
>
> Yes, I was wondering how it is used for 'real applications'.
> If a system call that would normally return immediately (or at least without
> a full process switch) hits the aes code it gets the cost of the XSAVE added.
> Whereas the benchmark probably doesn't do anywhere near as many.
>
> OTOH this is probably no different.
I did some tests on Sapphire Rapids using a system call that I customized to do
nothing except possibly a kernel_fpu_begin / kernel_fpu_end pair.
On average the bare syscall took 70 ns. The syscall with the kernel_fpu_begin /
kernel_fpu_end pair took 160 ns if the userspace program used xmm only, 340 ns
if it used ymm, or 360 ns if it used zmm. I also tried making the kernel
clobber different registers in the kernel_fpu_begin / kernel_fpu_end section,
and as I expected this did not make any difference.
Note that without the kernel_fpu_begin / kernel_fpu_end pair, AES-NI
instructions cannot be used and the alternative would be xts(ecb(aes-generic)).
On the same CPU, encrypting a single 512-byte sector with xts(ecb(aes-generic))
takes about 2235ns. With xts-aes-vaes-avx10_512 it takes 75 ns. (Not a typo --
it really is almost 30 times faster!) So it seems clear the FPU state save and
restore is worth it even just for a single sector using the traditional 512-byte
sector size, let alone a 4096-byte sector size which is recommended these days.
- Eric
next prev parent reply other threads:[~2024-04-05 19:19 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-03-26 8:02 [PATCH 0/6] Faster AES-XTS on modern x86_64 CPUs Eric Biggers
2024-03-26 8:02 ` [PATCH 1/6] x86: add kconfig symbols for assembler VAES and VPCLMULQDQ support Eric Biggers
2024-03-26 8:10 ` Ingo Molnar
2024-03-26 8:18 ` Eric Biggers
2024-03-26 8:28 ` Ingo Molnar
2024-03-26 8:03 ` [PATCH 2/6] crypto: x86/aes-xts - add AES-XTS assembly macro for modern CPUs Eric Biggers
2024-03-26 8:03 ` [PATCH 3/6] crypto: x86/aes-xts - wire up AESNI + AVX implementation Eric Biggers
2024-03-26 8:03 ` [PATCH 4/6] crypto: x86/aes-xts - wire up VAES + AVX2 implementation Eric Biggers
2024-03-26 8:03 ` [PATCH 5/6] crypto: x86/aes-xts - wire up VAES + AVX10/256 implementation Eric Biggers
2024-03-26 8:03 ` [PATCH 6/6] crypto: x86/aes-xts - wire up VAES + AVX10/512 implementation Eric Biggers
2024-03-26 8:51 ` [PATCH 0/6] Faster AES-XTS on modern x86_64 CPUs Ard Biesheuvel
2024-03-26 16:47 ` Eric Biggers
2024-04-03 8:12 ` David Laight
2024-04-04 1:35 ` Eric Biggers
2024-04-04 7:53 ` David Laight
2024-04-05 19:19 ` Eric Biggers [this message]
2024-04-08 7:41 ` David Laight
2024-04-08 12:31 ` Eric Biggers
2024-04-05 7:58 ` Herbert Xu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240405191904.GA1205@quark.localdomain \
--to=ebiggers@kernel.org \
--cc=David.Laight@aculab.com \
--cc=ardb@kernel.org \
--cc=chang.seok.bae@intel.com \
--cc=linux-crypto@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=luto@kernel.org \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox