From: Andy Lutomirski <luto@amacapital.net>
To: Andi Kleen <andi@firstfloor.org>, x86@kernel.org
Cc: linux-kernel@vger.kernel.org, Andi Kleen <ak@linux.intel.com>
Subject: Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base
Date: Tue, 29 Apr 2014 11:19:57 -0700 [thread overview]
Message-ID: <535FED4D.5000703@amacapital.net> (raw)
In-Reply-To: <1398723161-21968-5-git-send-email-andi@firstfloor.org>
On 04/28/2014 03:12 PM, Andi Kleen wrote:
> From: Andi Kleen <ak@linux.intel.com>
>
> IvyBridge added new instructions to directly write the fs and gs
> 64bit base registers. Previously this had to be done with a system
> call to write to MSRs. The main use case is fast user space threading
> and switching the fs/gs registers quickly there.
>
> The instructions are opt-in and have to be explicitely enabled
> by the OS.
>
> Previously Linux couldn't support this because the paranoid
> entry code relied on the gs base never being negative outside
> the kernel to decide when to use swaps. It would check the gs MSR
> value and assume it was already running in kernel if the value
> was already negative.
>
> This patch changes the paranoid entry code to use rdgsbase
> if available. Then we check the GS value against the expected GS value
> stored at the bottom of the IST stack. If the value is the expected
> value we skip swapgs.
>
> This is also significantly faster than a MSR read, so will speed
> NMis (critical for profiling)
>
> An alternative would have been to save/restore the GS value
> unconditionally, but this approach needs less changes.
>
> Then after these changes we need to also use the new instructions
> to save/restore fs and gs, so that the new values set by the
> users won't disappear. This is also significantly
> faster for the case when the 64bit base has to be switched
> (that is when GS is larger than 4GB), as we can replace
> the slow MSR write with a faster wr[fg]sbase execution.
>
> The instructions do not context switch
> the segment index, so the old invariant that fs or gs index
> have to be 0 for a different 64bit value to stick is still
> true. Previously it was enforced by arch_prctl, now the user
> program has to make sure it keeps the segment indexes zero.
> If it doesn't the changes may not stick.
>
> This is in term enables fast switching when there are
> enough threads that their TLS segment does not fit below 4GB,
> or alternatively programs that use fs as an additional base
> register will not get a sigificant context switch penalty.
>
> It is all done in a single patch to avoid bisect crash
> holes.
>
> +paranoid_save_gs:
> + .byte 0xf3,0x48,0x0f,0xae,0xc9 # rdgsbaseq %rcx
> + movq $-EXCEPTION_STKSZ,%rax # non debug stack size
> + cmpq $DEBUG_STACK,ORIG_RAX+8(%rsp)
> + movq $-1,ORIG_RAX+8(%rsp) # no syscall to restart
> + jne 1f
> + movq $-DEBUG_STKSZ,%rax # debug stack size
> +1:
> + andq %rsp,%rax # bottom of stack
> + movq (%rax),%rdi # get expected GS
> + cmpq %rdi,%rcx # is it the kernel gs?
I don't like this part. There are now three cases:
1. User gs, gsbase != kernel gs base. This works the same as before
2. Kernel gs. This also works the same as before.
3. User gs, but gsbase == kernel gs base. This will cause C code to
execute on the *user* gs base.
Case 3 is annoying. If nothing tries to change the user gs base, then
everything is okay because the user gs base and the kernel gs bases are
equal. But if something does try to change the user gs base, then it
will accidentally change the kernel gs base instead.
For the IST entries, this should be fine -- cpu migration, scheduling,
and such are impossible anyway. For the non-IST entries, I'm less
convinced. The entry_64.S code suggests that the problematic entries are:
double_fault
stack_segment
machine_check
Of course, all of those entries really do use IST, so I wonder why they
are paranoid*entry instead of paranoid*entry_ist. Is it because they're
supposedly non-recursive?
In any case, wouldn't this all be much simpler and less magical if the
paranoid entries just saved the old gsbase to the rbx and loaded the new
ones? The exits could do the inverse. This should be really fast:
rdgsbaseq %rbx
wrgsbaseq {the correct value}
...
wrgsbaseq %rbx
This still doesn't support changing the usergs value inside a paranoid
entry, but at least it will fail consistently instead of only failing if
the user gs has a particular special value.
I don't know the actual latencies, but I suspect that this would be
faster, too -- it removes some branches, and wrgsbase and rdgsbase
deserve to be faster than swapgs. It's probably no good for
non-rd/wrgsbase-capable cpus, though, since I suspect that three MSR
accesses are much worse than one MSR access and two swapgs calls.
--Andy
next prev parent reply other threads:[~2014-04-29 18:20 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-04-28 22:12 Add support for RD/WR FS/GSBASE Andi Kleen
2014-04-28 22:12 ` [PATCH 1/7] percpu: Add a DEFINE_PER_CPU_2PAGE_ALIGNED Andi Kleen
2014-05-02 15:18 ` Tejun Heo
2014-04-28 22:12 ` [PATCH 2/7] x86: Naturally align the debug IST stack Andi Kleen
2014-04-28 22:12 ` [PATCH 3/7] x86: Add C intrinsics for new rd/wr fs/gs base instructions Andi Kleen
2014-04-29 14:10 ` Konrad Rzeszutek Wilk
2014-04-28 22:12 ` [PATCH 4/7] x86: Add support for rd/wr fs/gs base Andi Kleen
2014-04-29 18:19 ` Andy Lutomirski [this message]
2014-04-29 23:39 ` Andi Kleen
2014-04-30 4:52 ` H. Peter Anvin
2014-04-30 4:57 ` H. Peter Anvin
2014-04-30 23:44 ` Andy Lutomirski
2014-04-30 23:47 ` Andy Lutomirski
2014-05-01 21:15 ` Andi Kleen
2014-05-01 21:39 ` Andy Lutomirski
2014-05-01 21:51 ` Andi Kleen
2014-05-01 21:53 ` Andy Lutomirski
2014-05-01 21:58 ` H. Peter Anvin
2014-05-01 22:06 ` Andy Lutomirski
2014-05-01 22:18 ` Andi Kleen
2014-05-01 22:45 ` H. Peter Anvin
2014-04-28 22:12 ` [PATCH 5/7] x86: Make old K8 swapgs workaround conditional Andi Kleen
2014-04-30 4:57 ` H. Peter Anvin
2014-04-28 22:12 ` [PATCH 6/7] x86: Enumerate kernel FSGS capability in AT_HWCAP2 Andi Kleen
2014-04-28 22:12 ` [PATCH 7/7] x86: Add documentation for rd/wr fs/gs base Andi Kleen
2014-04-29 2:23 ` Randy Dunlap
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=535FED4D.5000703@amacapital.net \
--to=luto@amacapital.net \
--cc=ak@linux.intel.com \
--cc=andi@firstfloor.org \
--cc=linux-kernel@vger.kernel.org \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox