From: Ingo Molnar <mingo@kernel.org>
To: Uros Bizjak <ubizjak@gmail.com>
Cc: x86@kernel.org, linux-kernel@vger.kernel.org,
Sean Christopherson <seanjc@google.com>,
Nadav Amit <namit@vmware.com>, Andy Lutomirski <luto@kernel.org>,
Brian Gerst <brgerst@gmail.com>,
Denys Vlasenko <dvlasenk@redhat.com>,
"H . Peter Anvin" <hpa@zytor.com>,
Linus Torvalds <torvalds@linux-foundation.org>,
Peter Zijlstra <peterz@infradead.org>,
Thomas Gleixner <tglx@linutronix.de>,
Josh Poimboeuf <jpoimboe@redhat.com>,
Borislav Petkov <bp@alien8.de>
Subject: Re: [PATCH -tip 3/3] x86/percpu: *NOT FOR MERGE* Implement arch_raw_cpu_ptr() with RDGSBASE
Date: Mon, 16 Oct 2023 13:14:54 +0200 [thread overview]
Message-ID: <ZS0bLvcC46tHjM/G@gmail.com> (raw)
In-Reply-To: <20231015202523.189168-3-ubizjak@gmail.com>
* Uros Bizjak <ubizjak@gmail.com> wrote:
> Sean says:
>
> "A significant percentage of data accesses in Intel's TDX-Module[*] use
> this pattern, e.g. even global data is relative to GS.base in the module
> due its rather odd and restricted environment. Back in the early days
> of TDX, the module used RD{FS,GS}BASE instead of prefixes to get
> pointers to per-CPU and global data structures in the TDX-Module. It's
> been a few years so I forget the exact numbers, but at the time a single
> transition between guest and host would have something like ~100 reads
> of FS.base or GS.base. Switching from RD{FS,GS}BASE to prefixed accesses
> reduced the latency for a guest<->host transition through the TDX-Module
> by several thousand cycles, as every RD{FS,GS}BASE had a latency of
> ~18 cycles (again, going off 3+ year old memories).
>
> The TDX-Module code is pretty much a pathological worth case scenario,
> but I suspect its usage is very similar to most usage of raw_cpu_ptr(),
> e.g. get a pointer to some data structure and then do multiple
> reads/writes from/to that data structure.
>
> The other wrinkle with RD{FS,FS}GSBASE is that they are trivially easy
[ Obsessive-compulsive nitpicking:
s/RD{FS,FS}GSBASE
/RD{FS,GS}BASE
]
> to emulate. If a hypervisor/VMM is advertising FSGSBASE even when it's
> not supported by hardware, e.g. to migrate VMs to older hardware, then
> every RDGSBASE will end up taking a few thousand cycles
> (#UD -> VM-Exit -> emulate). I would be surprised if any hypervisor
> actually does this as it would be easier/smarter to simply not advertise
> FSGSBASE if migrating to older hardware might be necessary, e.g. KVM
> doesn't support emulating RD{FS,GS}BASE. But at the same time, the whole
> reason I stumbled on the TDX-Module's sub-optimal RD{FS,GS}BASE usage was
> because I had hacked KVM to emulate RD{FS,GS}BASE so that I could do KVM
> TDX development on older hardware. I.e. it's not impossible that this
> code could run on hardware where RDGSBASE is emulated in software.
>
> {RD,WR}{FS,GS}BASE were added as faster alternatives to {RD,WR}MSR,
> not to accelerate actual accesses to per-CPU data, TLS, etc. E.g.
> loading a 64-bit base via a MOV to FS/GS is impossible. And presumably
> saving a userspace controlled by actually accessing FS/GS is dangerous
> for one reason or another.
>
> The instructions are guarded by a CR4 bit, the ucode cost just to check
> CR4.FSGSBASE is probably non-trivial."
BTW., a side note regarding the very last paragraph and the CR4 bit ucode
cost, given that SMAP is CR4 controlled too:
#define X86_CR4_FSGSBASE_BIT 16 /* enable RDWRFSGS support */
#define X86_CR4_FSGSBASE _BITUL(X86_CR4_FSGSBASE_BIT)
...
#define X86_CR4_SMAP_BIT 21 /* enable SMAP support */
#define X86_CR4_SMAP _BITUL(X86_CR4_SMAP_BIT)
And this modifies the behavior of STAC/CLAC, of which we have ~300
instances in a defconfig kernel image:
kepler:~/tip> objdump -wdr vmlinux | grep -w 'stac' x | wc -l
119
kepler:~/tip> objdump -wdr vmlinux | grep -w 'clac' x | wc -l
188
Are we certain that ucode on modern x86 CPUs check CR4 for every affected
instruction?
Could they perhaps use something faster, such as internal
microcode-patching (is that a thing?), to turn support for certain
instructions on/off when the relevant CR4 bit is modified, without
having to genuinely access CR4 for every instruction executed?
Thanks,
Ingo
next prev parent reply other threads:[~2023-10-16 11:15 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-10-15 20:24 [PATCH -tip 1/3] x86/percpu: Rewrite arch_raw_cpu_ptr() Uros Bizjak
2023-10-15 20:24 ` [PATCH -tip 2/3] x86/percpu: Use C for arch_raw_cpu_ptr() Uros Bizjak
2023-10-16 11:09 ` [tip: x86/percpu] x86/percpu: Use C for arch_raw_cpu_ptr(), to improve code generation tip-bot2 for Uros Bizjak
2023-10-15 20:24 ` [PATCH -tip 3/3] x86/percpu: *NOT FOR MERGE* Implement arch_raw_cpu_ptr() with RDGSBASE Uros Bizjak
2023-10-16 11:14 ` Ingo Molnar [this message]
2023-10-16 19:29 ` Sean Christopherson
2023-10-16 19:54 ` Linus Torvalds
2023-10-16 11:09 ` [tip: x86/percpu] x86/percpu: Rewrite arch_raw_cpu_ptr() to be easier for compilers to optimize tip-bot2 for Uros Bizjak
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZS0bLvcC46tHjM/G@gmail.com \
--to=mingo@kernel.org \
--cc=bp@alien8.de \
--cc=brgerst@gmail.com \
--cc=dvlasenk@redhat.com \
--cc=hpa@zytor.com \
--cc=jpoimboe@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=luto@kernel.org \
--cc=namit@vmware.com \
--cc=peterz@infradead.org \
--cc=seanjc@google.com \
--cc=tglx@linutronix.de \
--cc=torvalds@linux-foundation.org \
--cc=ubizjak@gmail.com \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.