* [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set
2018-01-09 12:56 [RFC PATCH v2 0/6] Per process PTI activation Willy Tarreau
@ 2018-01-09 12:56 ` Willy Tarreau
2018-01-10 7:15 ` Ingo Molnar
2018-01-10 8:22 ` Peter Zijlstra
0 siblings, 2 replies; 55+ messages in thread
From: Willy Tarreau @ 2018-01-09 12:56 UTC (permalink / raw)
To: linux-kernel, x86
Cc: Willy Tarreau, Andy Lutomirski, Borislav Petkov, Brian Gerst,
Dave Hansen, Ingo Molnar, Linus Torvalds, Peter Zijlstra,
Thomas Gleixner, Josh Poimboeuf, H. Peter Anvin,
Greg Kroah-Hartman, Kees Cook
When a syscall returns to userspace with pti_disable set, it means the
current mm is configured to disable page table isolation (PTI). In this
case, returns from kernel to user will not switch the CR3, leaving it
to the kernel one which already maps both user and kernel pages. This
avoids a TLB flush, and saves another one on next entry.
Thanks to these changes, haproxy running under KVM went back from
12700 conn/s (without PCID) or 19700 (with PCID) to 23100 once loaded
after calling prctl(), indicating that PTI has no measurable impact on
this workload.
Signed-off-by: Willy Tarreau <w@1wt.eu>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Kees Cook <keescook@chromium.org>
v2:
- use pti_disable instead of task flag
---
arch/x86/entry/calling.h | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 2c0d3b5..5361a10 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -229,6 +229,11 @@
.macro SWITCH_TO_USER_CR3_NOSTACK scratch_reg:req scratch_reg2:req
ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
+
+ /* The "pti_disable" mm attribute is mirrored into this per-cpu var */
+ cmpb $0, PER_CPU_VAR(pti_disable)
+ jne .Lend_\@
+
mov %cr3, \scratch_reg
ALTERNATIVE "jmp .Lwrcr3_\@", "", X86_FEATURE_PCID
--
1.7.12.1
^ permalink raw reply related [flat|nested] 55+ messages in thread
* Re: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set
2018-01-09 12:56 ` [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set Willy Tarreau
@ 2018-01-10 7:15 ` Ingo Molnar
2018-01-10 7:23 ` Willy Tarreau
2018-01-10 8:22 ` Peter Zijlstra
1 sibling, 1 reply; 55+ messages in thread
From: Ingo Molnar @ 2018-01-10 7:15 UTC (permalink / raw)
To: Willy Tarreau
Cc: linux-kernel, x86, Andy Lutomirski, Borislav Petkov, Brian Gerst,
Dave Hansen, Linus Torvalds, Peter Zijlstra, Thomas Gleixner,
Josh Poimboeuf, H. Peter Anvin, Greg Kroah-Hartman, Kees Cook
* Willy Tarreau <w@1wt.eu> wrote:
> When a syscall returns to userspace with pti_disable set, it means the
> current mm is configured to disable page table isolation (PTI). In this
> case, returns from kernel to user will not switch the CR3, leaving it
> to the kernel one which already maps both user and kernel pages. This
> avoids a TLB flush, and saves another one on next entry.
>
> Thanks to these changes, haproxy running under KVM went back from
> 12700 conn/s (without PCID) or 19700 (with PCID) to 23100 once loaded
> after calling prctl(), indicating that PTI has no measurable impact on
> this workload.
>
> Signed-off-by: Willy Tarreau <w@1wt.eu>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Brian Gerst <brgerst@gmail.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Josh Poimboeuf <jpoimboe@redhat.com>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: Kees Cook <keescook@chromium.org>
>
> v2:
> - use pti_disable instead of task flag
> ---
> arch/x86/entry/calling.h | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
> index 2c0d3b5..5361a10 100644
> --- a/arch/x86/entry/calling.h
> +++ b/arch/x86/entry/calling.h
> @@ -229,6 +229,11 @@
>
> .macro SWITCH_TO_USER_CR3_NOSTACK scratch_reg:req scratch_reg2:req
> ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
> +
> + /* The "pti_disable" mm attribute is mirrored into this per-cpu var */
> + cmpb $0, PER_CPU_VAR(pti_disable)
> + jne .Lend_\@
Could you please do this small change for future iterations:
s/per-cpu
/per-CPU
... to make the spelling more consistent with the rest of the code base?
Thanks,
Ingo
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set
2018-01-10 7:15 ` Ingo Molnar
@ 2018-01-10 7:23 ` Willy Tarreau
0 siblings, 0 replies; 55+ messages in thread
From: Willy Tarreau @ 2018-01-10 7:23 UTC (permalink / raw)
To: Ingo Molnar
Cc: linux-kernel, x86, Andy Lutomirski, Borislav Petkov, Brian Gerst,
Dave Hansen, Linus Torvalds, Peter Zijlstra, Thomas Gleixner,
Josh Poimboeuf, H. Peter Anvin, Greg Kroah-Hartman, Kees Cook
On Wed, Jan 10, 2018 at 08:15:10AM +0100, Ingo Molnar wrote:
> > + /* The "pti_disable" mm attribute is mirrored into this per-cpu var */
> > + cmpb $0, PER_CPU_VAR(pti_disable)
> > + jne .Lend_\@
>
> Could you please do this small change for future iterations:
>
> s/per-cpu
> /per-CPU
>
> ... to make the spelling more consistent with the rest of the code base?
Now done as well, thank you. Do not hesitate to send future cosmetic
changes directly to me so that we keep a max of participants mostly
focused on the design choices. (and don't worry, I'll pick them, I
really appreciate these as well).
Thanks,
Willy
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set
2018-01-09 12:56 ` [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set Willy Tarreau
2018-01-10 7:15 ` Ingo Molnar
@ 2018-01-10 8:22 ` Peter Zijlstra
2018-01-10 9:11 ` Willy Tarreau
1 sibling, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2018-01-10 8:22 UTC (permalink / raw)
To: Willy Tarreau
Cc: linux-kernel, x86, Andy Lutomirski, Borislav Petkov, Brian Gerst,
Dave Hansen, Ingo Molnar, Linus Torvalds, Thomas Gleixner,
Josh Poimboeuf, H. Peter Anvin, Greg Kroah-Hartman, Kees Cook
On Tue, Jan 09, 2018 at 01:56:20PM +0100, Willy Tarreau wrote:
> - use pti_disable instead of task flag
> ---
> arch/x86/entry/calling.h | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
> index 2c0d3b5..5361a10 100644
> --- a/arch/x86/entry/calling.h
> +++ b/arch/x86/entry/calling.h
> @@ -229,6 +229,11 @@
>
> .macro SWITCH_TO_USER_CR3_NOSTACK scratch_reg:req scratch_reg2:req
> ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
> +
> + /* The "pti_disable" mm attribute is mirrored into this per-cpu var */
> + cmpb $0, PER_CPU_VAR(pti_disable)
> + jne .Lend_\@
> +
> mov %cr3, \scratch_reg
So could you switch back to a task flag for this? That word is already
cache-hot on the exit path while your new variable is not.
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set
2018-01-10 8:22 ` Peter Zijlstra
@ 2018-01-10 9:11 ` Willy Tarreau
2018-01-10 19:21 ` Andy Lutomirski
0 siblings, 1 reply; 55+ messages in thread
From: Willy Tarreau @ 2018-01-10 9:11 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, x86, Andy Lutomirski, Borislav Petkov, Brian Gerst,
Dave Hansen, Ingo Molnar, Linus Torvalds, Thomas Gleixner,
Josh Poimboeuf, H. Peter Anvin, Greg Kroah-Hartman, Kees Cook
On Wed, Jan 10, 2018 at 09:22:07AM +0100, Peter Zijlstra wrote:
> On Tue, Jan 09, 2018 at 01:56:20PM +0100, Willy Tarreau wrote:
> > - use pti_disable instead of task flag
> > ---
> > arch/x86/entry/calling.h | 5 +++++
> > 1 file changed, 5 insertions(+)
> >
> > diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
> > index 2c0d3b5..5361a10 100644
> > --- a/arch/x86/entry/calling.h
> > +++ b/arch/x86/entry/calling.h
> > @@ -229,6 +229,11 @@
> >
> > .macro SWITCH_TO_USER_CR3_NOSTACK scratch_reg:req scratch_reg2:req
> > ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
> > +
> > + /* The "pti_disable" mm attribute is mirrored into this per-cpu var */
> > + cmpb $0, PER_CPU_VAR(pti_disable)
> > + jne .Lend_\@
> > +
> > mov %cr3, \scratch_reg
>
> So could you switch back to a task flag for this? That word is already
> cache-hot on the exit path while your new variable is not.
That's a good point. There's already been some demands for a per-thread
setting.
What I can propose then is to partially revert the changes to have this :
- arch_prctl() adjusts the task flag and not a per-mm variable anymore
(Linus, are you OK for this ?)
- arch_prctl() only accepts to perform the action if mm->mm_users == 1
so that we don't change the setting after having created threads ;
this way the task flag is replicated to all future threads ;
- later we may decide to permit re-enabling PTI per thread if it was
disabled.
If we agree on this, I'd like to propose to have two flags :
- TIF_DISABLE_PTI_NOW : disable PTI for the current task, reset by execve()
- TIF_DISABLE_PTI_NEXT : disable PTI after execve(), reset by execve()
execve() would then simply do :
TIF_DISABLE_PTI_NOW = TIF_DISABLE_PTI_NEXT;
TIF_DISABLE_PTI_NEXT = 0;
The former would be used by applications using their own configuration.
The latter would be used by wrappers. This way we seem to cover the various
use cases. And we make this depend on a sysctl that allows the admin to
globally and permanently disable the feature and which is disabled by
default.
Any objection ?
Regards,
Willy
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set
2018-01-10 9:11 ` Willy Tarreau
@ 2018-01-10 19:21 ` Andy Lutomirski
2018-01-10 19:39 ` Willy Tarreau
2018-01-10 19:50 ` Linus Torvalds
0 siblings, 2 replies; 55+ messages in thread
From: Andy Lutomirski @ 2018-01-10 19:21 UTC (permalink / raw)
To: Willy Tarreau
Cc: Peter Zijlstra, LKML, X86 ML, Andy Lutomirski, Borislav Petkov,
Brian Gerst, Dave Hansen, Ingo Molnar, Linus Torvalds,
Thomas Gleixner, Josh Poimboeuf, H. Peter Anvin,
Greg Kroah-Hartman, Kees Cook
On Wed, Jan 10, 2018 at 1:11 AM, Willy Tarreau <w@1wt.eu> wrote:
> On Wed, Jan 10, 2018 at 09:22:07AM +0100, Peter Zijlstra wrote:
>> On Tue, Jan 09, 2018 at 01:56:20PM +0100, Willy Tarreau wrote:
>> > - use pti_disable instead of task flag
>> > ---
>> > arch/x86/entry/calling.h | 5 +++++
>> > 1 file changed, 5 insertions(+)
>> >
>> > diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
>> > index 2c0d3b5..5361a10 100644
>> > --- a/arch/x86/entry/calling.h
>> > +++ b/arch/x86/entry/calling.h
>> > @@ -229,6 +229,11 @@
>> >
>> > .macro SWITCH_TO_USER_CR3_NOSTACK scratch_reg:req scratch_reg2:req
>> > ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
>> > +
>> > + /* The "pti_disable" mm attribute is mirrored into this per-cpu var */
>> > + cmpb $0, PER_CPU_VAR(pti_disable)
>> > + jne .Lend_\@
>> > +
>> > mov %cr3, \scratch_reg
>>
>> So could you switch back to a task flag for this? That word is already
>> cache-hot on the exit path while your new variable is not.
>
> That's a good point. There's already been some demands for a per-thread
> setting.
>
> What I can propose then is to partially revert the changes to have this :
>
> - arch_prctl() adjusts the task flag and not a per-mm variable anymore
> (Linus, are you OK for this ?)
>
> - arch_prctl() only accepts to perform the action if mm->mm_users == 1
> so that we don't change the setting after having created threads ;
> this way the task flag is replicated to all future threads ;
>
> - later we may decide to permit re-enabling PTI per thread if it was
> disabled.
>
> If we agree on this, I'd like to propose to have two flags :
>
> - TIF_DISABLE_PTI_NOW : disable PTI for the current task, reset by execve()
> - TIF_DISABLE_PTI_NEXT : disable PTI after execve(), reset by execve()
I really dislike state that isn't cleared on execve(). I'm assuming
that this is so you can run time pwn_me_without_pti whatever? Surely
LD_PRELOAD can do this, too?
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set
2018-01-10 19:21 ` Andy Lutomirski
@ 2018-01-10 19:39 ` Willy Tarreau
2018-01-10 19:44 ` Andy Lutomirski
2018-01-10 19:50 ` Linus Torvalds
1 sibling, 1 reply; 55+ messages in thread
From: Willy Tarreau @ 2018-01-10 19:39 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Peter Zijlstra, LKML, X86 ML, Borislav Petkov, Brian Gerst,
Dave Hansen, Ingo Molnar, Linus Torvalds, Thomas Gleixner,
Josh Poimboeuf, H. Peter Anvin, Greg Kroah-Hartman, Kees Cook
Hi Andy,
On Wed, Jan 10, 2018 at 11:21:15AM -0800, Andy Lutomirski wrote:
> > If we agree on this, I'd like to propose to have two flags :
> >
> > - TIF_DISABLE_PTI_NOW : disable PTI for the current task, reset by execve()
> > - TIF_DISABLE_PTI_NEXT : disable PTI after execve(), reset by execve()
>
> I really dislike state that isn't cleared on execve(). I'm assuming
> that this is so you can run time pwn_me_without_pti whatever?
Yes exactly. I've just sent a 3rd series with an example code for this.
In fact it's not that the state is not cleared by execve(), it's that
it's set for the next execve() which then resets it.
> Surely LD_PRELOAD can do this, too?
That was one of my other proposals. I really don't know if LD_PRELOAD
fits anyone's usage for such things (static/setuid binaries, complication
to pass variables maybe).
Please take a look and tell me if you still dislike it or not.
thanks!
Willy
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set
2018-01-10 19:39 ` Willy Tarreau
@ 2018-01-10 19:44 ` Andy Lutomirski
0 siblings, 0 replies; 55+ messages in thread
From: Andy Lutomirski @ 2018-01-10 19:44 UTC (permalink / raw)
To: Willy Tarreau
Cc: Andy Lutomirski, Peter Zijlstra, LKML, X86 ML, Borislav Petkov,
Brian Gerst, Dave Hansen, Ingo Molnar, Linus Torvalds,
Thomas Gleixner, Josh Poimboeuf, H. Peter Anvin,
Greg Kroah-Hartman, Kees Cook
On Wed, Jan 10, 2018 at 11:39 AM, Willy Tarreau <w@1wt.eu> wrote:
> Hi Andy,
>
> On Wed, Jan 10, 2018 at 11:21:15AM -0800, Andy Lutomirski wrote:
>> > If we agree on this, I'd like to propose to have two flags :
>> >
>> > - TIF_DISABLE_PTI_NOW : disable PTI for the current task, reset by execve()
>> > - TIF_DISABLE_PTI_NEXT : disable PTI after execve(), reset by execve()
>>
>> I really dislike state that isn't cleared on execve(). I'm assuming
>> that this is so you can run time pwn_me_without_pti whatever?
>
> Yes exactly. I've just sent a 3rd series with an example code for this.
> In fact it's not that the state is not cleared by execve(), it's that
> it's set for the next execve() which then resets it.
>
>> Surely LD_PRELOAD can do this, too?
>
> That was one of my other proposals. I really don't know if LD_PRELOAD
> fits anyone's usage for such things (static/setuid binaries, complication
> to pass variables maybe).
>
> Please take a look and tell me if you still dislike it or not.
>
Adding flags that persist across execve() of setuid stuff is IMO even
worse. It just makes reasoning about the system's security model
really nasty.
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set
2018-01-10 19:21 ` Andy Lutomirski
2018-01-10 19:39 ` Willy Tarreau
@ 2018-01-10 19:50 ` Linus Torvalds
2018-01-10 20:04 ` Andy Lutomirski
2018-01-11 6:42 ` Willy Tarreau
1 sibling, 2 replies; 55+ messages in thread
From: Linus Torvalds @ 2018-01-10 19:50 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Willy Tarreau, Peter Zijlstra, LKML, X86 ML, Borislav Petkov,
Brian Gerst, Dave Hansen, Ingo Molnar, Thomas Gleixner,
Josh Poimboeuf, H. Peter Anvin, Greg Kroah-Hartman, Kees Cook
On Wed, Jan 10, 2018 at 11:21 AM, Andy Lutomirski <luto@kernel.org> wrote:
>
> I really dislike state that isn't cleared on execve(). I'm assuming
> that this is so you can run time pwn_me_without_pti whatever? Surely
> LD_PRELOAD can do this, too?
Andy, what the hell is wrong with you?
You are actively trying to screw this whole interface up, aren't you?
LD_PRELOAD cannot work for a wrapper, for the simple reason that it
runs in the same context as the process. So if you want to say "I want
to run this process without PTI", but you don't want to run the
process with elevated privileges, LD_PRELOAD doesn't work.
It's like saying "why do we need 'sudo' - let's juat make a LD_PRELOAD
that sets uid to zero instead".
The "let's do it per thread" made no sense either, since that's
fundamentally not how page tables work, and it's complete broken shit.
And the whole "NOW" vs "NEXT" is complete garbage. The obvious sane
no-PTI interface is that it
(a) inherits on fork/exec, so that you don't have to worry about how
something is implemented (think "I want to run this kernel build
without the PTI overhead", but also "I want to run this system daemon
without PTI").
(b) actual domain changes clear it (ie suid, whatever).
that make it useful for random uses of "I trust service XYZ".
So I'm NAK'ing this whole series on the grounds that it has several
completely insane semantics and really need to be clarified, and where
actual usage needs to be thought about a lot more.
Linus
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set
2018-01-10 19:50 ` Linus Torvalds
@ 2018-01-10 20:04 ` Andy Lutomirski
2018-01-11 6:42 ` Willy Tarreau
1 sibling, 0 replies; 55+ messages in thread
From: Andy Lutomirski @ 2018-01-10 20:04 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andy Lutomirski, Willy Tarreau, Peter Zijlstra, LKML, X86 ML,
Borislav Petkov, Brian Gerst, Dave Hansen, Ingo Molnar,
Thomas Gleixner, Josh Poimboeuf, H. Peter Anvin,
Greg Kroah-Hartman, Kees Cook
On Wed, Jan 10, 2018 at 11:50 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wed, Jan 10, 2018 at 11:21 AM, Andy Lutomirski <luto@kernel.org> wrote:
>>
>> I really dislike state that isn't cleared on execve(). I'm assuming
>> that this is so you can run time pwn_me_without_pti whatever? Surely
>> LD_PRELOAD can do this, too?
>
> Andy, what the hell is wrong with you?
>
> You are actively trying to screw this whole interface up, aren't you?
>
> LD_PRELOAD cannot work for a wrapper, for the simple reason that it
> runs in the same context as the process. So if you want to say "I want
> to run this process without PTI", but you don't want to run the
> process with elevated privileges, LD_PRELOAD doesn't work.
Oh, right, duh. Brain was off.
> The "let's do it per thread" made no sense either, since that's
> fundamentally not how page tables work, and it's complete broken shit.
I still disagree with you here. The whole concept of per-thread or
per-mm or per-whatever PTI disablement is if the admin for some reason
trusts some piece of code not to try to exploit Meltdown. But just
imagine a program like a web browser. The browser will do some
performance critical stuff (networking) and some
absolutely-no-fucking-way-would-I-turn-off-PTI stuff (running
scripts). So per-thread seems totally sensible to me. No one sane
would ever do this for a web browser, but I can easily imagine it for
something like a web *server* or even a database server.
Just logically, too, per-thread is the obvious semantics. Whether we
rewrite CR3 when we go to usermode is a thing affecting that thread.
The only reason the mm has anything to do with it is the NX trick.
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set
2018-01-10 19:50 ` Linus Torvalds
2018-01-10 20:04 ` Andy Lutomirski
@ 2018-01-11 6:42 ` Willy Tarreau
2018-01-11 15:29 ` Dave Hansen
1 sibling, 1 reply; 55+ messages in thread
From: Willy Tarreau @ 2018-01-11 6:42 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andy Lutomirski, Peter Zijlstra, LKML, X86 ML, Borislav Petkov,
Brian Gerst, Dave Hansen, Ingo Molnar, Thomas Gleixner,
Josh Poimboeuf, H. Peter Anvin, Greg Kroah-Hartman, Kees Cook
On Wed, Jan 10, 2018 at 11:50:46AM -0800, Linus Torvalds wrote:
> And the whole "NOW" vs "NEXT" is complete garbage. The obvious sane
> no-PTI interface is that it
>
> (a) inherits on fork/exec, so that you don't have to worry about how
> something is implemented (think "I want to run this kernel build
> without the PTI overhead", but also "I want to run this system daemon
> without PTI").
>
> (b) actual domain changes clear it (ie suid, whatever).
>
> that make it useful for random uses of "I trust service XYZ".
OK. Do you want to see something *only* based on a wrapper (i.e. works
only after execve) or can we let the application apply the change to
itself ? I would also like to let applications re-enable the protection
for processes they're going to exec and not necessarily trust.
> So I'm NAK'ing this whole series on the grounds that it has several
> completely insane semantics and really need to be clarified, and where
> actual usage needs to be thought about a lot more.
In fact we were trying to limit the risk of propagating the protection
removal too far, and leave it only on the sensitive process which really
requires it. But your example of "running the kernel build without PTI"
makes sense from a user's perspective, and it completely contradicts
our initial assumptions.
After all I don't think the NOW vs NEXT is so fundamentally broken. We
could think about having one option for the current process only (which
is cleared by execve) so that applications can apply it to themselves
only without having to wonder about clearing it, and another one which
is only for wrappers and which passes execve(). For now I considered
that we could stop at the first execve, but if I just remove the
clearing of the NEXT flag, it matches your requirement for the kernel
build. After this it's just a matter of naming and placing them on the
mm rather than thread.
Willy
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set
2018-01-11 6:42 ` Willy Tarreau
@ 2018-01-11 15:29 ` Dave Hansen
2018-01-11 15:44 ` Willy Tarreau
0 siblings, 1 reply; 55+ messages in thread
From: Dave Hansen @ 2018-01-11 15:29 UTC (permalink / raw)
To: Willy Tarreau, Linus Torvalds
Cc: Andy Lutomirski, Peter Zijlstra, LKML, X86 ML, Borislav Petkov,
Brian Gerst, Ingo Molnar, Thomas Gleixner, Josh Poimboeuf,
H. Peter Anvin, Greg Kroah-Hartman, Kees Cook
On 01/10/2018 10:42 PM, Willy Tarreau wrote:
> On Wed, Jan 10, 2018 at 11:50:46AM -0800, Linus Torvalds wrote:
>> And the whole "NOW" vs "NEXT" is complete garbage. The obvious sane
>> no-PTI interface is that it
>>
>> (a) inherits on fork/exec, so that you don't have to worry about how
>> something is implemented (think "I want to run this kernel build
>> without the PTI overhead", but also "I want to run this system daemon
>> without PTI").
>>
>> (b) actual domain changes clear it (ie suid, whatever).
>>
>> that make it useful for random uses of "I trust service XYZ".
> OK. Do you want to see something *only* based on a wrapper (i.e. works
> only after execve) or can we let the application apply the change to
> itself ? I would also like to let applications re-enable the protection
> for processes they're going to exec and not necessarily trust.
I don't think we need a "NOW" and "NEXT" mode, at least initially. The
"NEXT" semantics are going to be tricky and I think "NOW" is good enough
Whatever we do, we'll need this PTI-disable flag to be able cross
exeve() so that a wrapper a la nice(1) work. Initially, I think the
default should be that it survives fork(). There are just too many
things out there that "start up" by doing a shell script that calls a
python script, that calls a...
Without the wrapper support, we're _basically_ stuck using this only in
newly-compiled binaries. That's going to make it much less likely to
get used.
The inheritance also gives an app a way to re-enable protections for
children, just from a _second_ wrapper. That's nice because it means we
don't initially need a "NEXT" ABI.
So, I'd do this:
1. Do the arch_prctl() (but ask the ARM guys what they want too)
2. Enabled for an entire process (not thread)
3. Inherited across fork/exec
4. Cleared on setuid() and friends
5. I'm sure the security folks have/want a way to force it on forever
Next, if we decide that we have things that both don't want PTI's
protections and are forking things not covered by #4, we can add some
"child opt out" in the prctl(), plus maybe marking binaries somehow.
Please don't forget to add ways to tell if this feature is on/off in
/proc or whatever. I think we also need to be able to dump the actual
CR3 value that we entered the kernel with before we start doing too much
other funky stuff with the entry code.
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set
2018-01-11 15:29 ` Dave Hansen
@ 2018-01-11 15:44 ` Willy Tarreau
2018-01-11 15:51 ` Dave Hansen
` (2 more replies)
0 siblings, 3 replies; 55+ messages in thread
From: Willy Tarreau @ 2018-01-11 15:44 UTC (permalink / raw)
To: Dave Hansen
Cc: Linus Torvalds, Andy Lutomirski, Peter Zijlstra, LKML, X86 ML,
Borislav Petkov, Brian Gerst, Ingo Molnar, Thomas Gleixner,
Josh Poimboeuf, H. Peter Anvin, Greg Kroah-Hartman, Kees Cook
Hi Dave,
On Thu, Jan 11, 2018 at 07:29:30AM -0800, Dave Hansen wrote:
> I don't think we need a "NOW" and "NEXT" mode, at least initially. The
> "NEXT" semantics are going to be tricky and I think "NOW" is good enough
In fact I thought the NEXT one would bring us a nice benefit which is that
we start the new process knowing the flag's value so we can decide whether
or not to apply _PAGE_NX on the pgd from the start, and never touch it
anymore.
> Whatever we do, we'll need this PTI-disable flag to be able cross
> exeve() so that a wrapper a la nice(1) work.
Absolutely!
> Initially, I think the
> default should be that it survives fork(). There are just too many
> things out there that "start up" by doing a shell script that calls a
> python script, that calls a...
Not only that, simply daemons, like most services are!
> Without the wrapper support, we're _basically_ stuck using this only in
> newly-compiled binaries. That's going to make it much less likely to
> get used.
I know, that's why I kept considering that option despite not really
needing it for my own use case.
> The inheritance also gives an app a way to re-enable protections for
> children, just from a _second_ wrapper. That's nice because it means we
> don't initially need a "NEXT" ABI.
>
> So, I'd do this:
> 1. Do the arch_prctl() (but ask the ARM guys what they want too)
> 2. Enabled for an entire process (not thread)
> 3. Inherited across fork/exec
> 4. Cleared on setuid() and friends
This one causes me a problem : some daemons already take care of dropping
privileges after the initial fork() for the sake of security. Haproxy
typically does this at boot :
- parse config
- chroot to /var/empty
- setuid(dedicated_uid)
- fork()
This ensures the process is properly isolated and hard enough to break out
of. So I'd really like this setuid() not to anihilate all we've done.
Probably that we want to drop it on suid binaries however, though I'm
having doubts about the benefits, because if the binary already allows
an intruder to inject its own meltdown code, you're quite screwed anyway.
> 5. I'm sure the security folks have/want a way to force it on forever
Sure! That's what I implemented using the sysctl.
> Next, if we decide that we have things that both don't want PTI's
> protections and are forking things not covered by #4, we can add some
> "child opt out" in the prctl(), plus maybe marking binaries somehow.
I was really thinking about using the "NOW" for this compard to the NEXT.
But I don't know what it could imply for the pgd not having the _PAGE_NX.
> Please don't forget to add ways to tell if this feature is on/off in
> /proc or whatever.
Very good idea, and it will be much more convenient than using the GET
prctl that I didn't like.
> I think we also need to be able to dump the actual
> CR3 value that we entered the kernel with before we start doing too much
> other funky stuff with the entry code.
When you say dump, you mean save it somewhere in a per_cpu variable ?
Willy
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set
2018-01-11 15:44 ` Willy Tarreau
@ 2018-01-11 15:51 ` Dave Hansen
2018-01-11 17:02 ` Andy Lutomirski
2018-01-11 17:09 ` Andy Lutomirski
2018-01-11 18:35 ` Dave Hansen
2 siblings, 1 reply; 55+ messages in thread
From: Dave Hansen @ 2018-01-11 15:51 UTC (permalink / raw)
To: Willy Tarreau
Cc: Linus Torvalds, Andy Lutomirski, Peter Zijlstra, LKML, X86 ML,
Borislav Petkov, Brian Gerst, Ingo Molnar, Thomas Gleixner,
Josh Poimboeuf, H. Peter Anvin, Greg Kroah-Hartman, Kees Cook
On 01/11/2018 07:44 AM, Willy Tarreau wrote:
>> I think we also need to be able to dump the actual
>> CR3 value that we entered the kernel with before we start doing too much
>> other funky stuff with the entry code.
> When you say dump, you mean save it somewhere in a per_cpu variable ?
Yeah, I think a per-cpu variable is fine for now. But, that only gives
you a dump from a single entry to the kernel. Ideally, it would be nice
to have a stack of them so you could do things like debug
syscall->fault->oops. Was it the syscall entry or the fault entry that
lead to the oops?
But, the stack gets really fun because of NMIs.
I'm sure Andy Lutomirski has some ideas too.
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set
2018-01-11 15:51 ` Dave Hansen
@ 2018-01-11 17:02 ` Andy Lutomirski
2018-01-11 18:21 ` Alexei Starovoitov
2018-01-11 20:00 ` Dave Hansen
0 siblings, 2 replies; 55+ messages in thread
From: Andy Lutomirski @ 2018-01-11 17:02 UTC (permalink / raw)
To: Dave Hansen
Cc: Willy Tarreau, Linus Torvalds, Andy Lutomirski, Peter Zijlstra,
LKML, X86 ML, Borislav Petkov, Brian Gerst, Ingo Molnar,
Thomas Gleixner, Josh Poimboeuf, H. Peter Anvin,
Greg Kroah-Hartman, Kees Cook
On Thu, Jan 11, 2018 at 7:51 AM, Dave Hansen
<dave.hansen@linux.intel.com> wrote:
> On 01/11/2018 07:44 AM, Willy Tarreau wrote:
>>> I think we also need to be able to dump the actual
>>> CR3 value that we entered the kernel with before we start doing too much
>>> other funky stuff with the entry code.
>> When you say dump, you mean save it somewhere in a per_cpu variable ?
>
> Yeah, I think a per-cpu variable is fine for now. But, that only gives
> you a dump from a single entry to the kernel. Ideally, it would be nice
> to have a stack of them so you could do things like debug
> syscall->fault->oops. Was it the syscall entry or the fault entry that
> lead to the oops?
>
> But, the stack gets really fun because of NMIs.
>
> I'm sure Andy Lutomirski has some ideas too.
I was thinking that maybe we should add a new field or two to pt_regs.
They could store CR2 and maybe CR3 as well. I'd also like to expose
the error code of exceptions in stack traces. We should get this
integrated right into the unwinder.
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set
2018-01-11 15:44 ` Willy Tarreau
2018-01-11 15:51 ` Dave Hansen
@ 2018-01-11 17:09 ` Andy Lutomirski
2018-01-11 17:40 ` Willy Tarreau
2018-01-11 18:35 ` Dave Hansen
2 siblings, 1 reply; 55+ messages in thread
From: Andy Lutomirski @ 2018-01-11 17:09 UTC (permalink / raw)
To: Willy Tarreau
Cc: Dave Hansen, Linus Torvalds, Andy Lutomirski, Peter Zijlstra,
LKML, X86 ML, Borislav Petkov, Brian Gerst, Ingo Molnar,
Thomas Gleixner, Josh Poimboeuf, H. Peter Anvin,
Greg Kroah-Hartman, Kees Cook
On Thu, Jan 11, 2018 at 7:44 AM, Willy Tarreau <w@1wt.eu> wrote:
> Hi Dave,
>
> On Thu, Jan 11, 2018 at 07:29:30AM -0800, Dave Hansen wrote:
>> I don't think we need a "NOW" and "NEXT" mode, at least initially. The
>> "NEXT" semantics are going to be tricky and I think "NOW" is good enough
>
> In fact I thought the NEXT one would bring us a nice benefit which is that
> we start the new process knowing the flag's value so we can decide whether
> or not to apply _PAGE_NX on the pgd from the start, and never touch it
> anymore.
>
>> Whatever we do, we'll need this PTI-disable flag to be able cross
>> exeve() so that a wrapper a la nice(1) work.
>
> Absolutely!
>
>> Initially, I think the
>> default should be that it survives fork(). There are just too many
>> things out there that "start up" by doing a shell script that calls a
>> python script, that calls a...
>
> Not only that, simply daemons, like most services are!
>
>> Without the wrapper support, we're _basically_ stuck using this only in
>> newly-compiled binaries. That's going to make it much less likely to
>> get used.
>
> I know, that's why I kept considering that option despite not really
> needing it for my own use case.
>
>> The inheritance also gives an app a way to re-enable protections for
>> children, just from a _second_ wrapper. That's nice because it means we
>> don't initially need a "NEXT" ABI.
>>
>> So, I'd do this:
>> 1. Do the arch_prctl() (but ask the ARM guys what they want too)
>> 2. Enabled for an entire process (not thread)
>> 3. Inherited across fork/exec
>> 4. Cleared on setuid() and friends
>
> This one causes me a problem : some daemons already take care of dropping
> privileges after the initial fork() for the sake of security. Haproxy
> typically does this at boot :
>
> - parse config
> - chroot to /var/empty
> - setuid(dedicated_uid)
> - fork()
>
> This ensures the process is properly isolated and hard enough to break out
> of. So I'd really like this setuid() not to anihilate all we've done.
> Probably that we want to drop it on suid binaries however, though I'm
> having doubts about the benefits, because if the binary already allows
> an intruder to inject its own meltdown code, you're quite screwed anyway.
>
>> 5. I'm sure the security folks have/want a way to force it on forever
>
> Sure! That's what I implemented using the sysctl.
>
All of these proposals have serious issues. For example, suppose I
have a setuid program called nopti that works like this:
$ nopti some_program
nopti verifies that some_program is trustworthy and runs it (as the
real uid of nopti's user) with PTI off. Now we have all the usual
problems: you can easily break out using ptrace(), for example. And
LD_PRELOAD gets this wrong. Et.
So I think that no-pti mode is a privilege as opposed to a mode per
se. If you can turn off PTI, then you have the ability to read all of
kernel memory So maybe we should treat it as such. Add a capability
CAP_DISABLE_PTI. If you have that capability (globally), then you can
use the arch_prctl() or regular prctl() or whatever to turn PTI on.
If you lose the cap, you lose no-pti mode as well. If an LSM wants to
block it, it can use existing mechanisms.
As for per-mm vs per-thread, let's make it only switchable in
single-threaded processes for now and inherited when threads are
created. We can change that if and when demand for the ability to
change it shows up.
(Another reason for per-thread instead of per-mm: as a per-mm thing,
you can't set it up for your descendents using vfork(); prctl();
exec(), and the latter is how your average language runtime that
spawns subprocesses would want to do it.
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set
2018-01-11 17:09 ` Andy Lutomirski
@ 2018-01-11 17:40 ` Willy Tarreau
2018-01-11 17:53 ` Andy Lutomirski
2018-01-11 18:25 ` Linus Torvalds
0 siblings, 2 replies; 55+ messages in thread
From: Willy Tarreau @ 2018-01-11 17:40 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Dave Hansen, Linus Torvalds, Peter Zijlstra, LKML, X86 ML,
Borislav Petkov, Brian Gerst, Ingo Molnar, Thomas Gleixner,
Josh Poimboeuf, H. Peter Anvin, Greg Kroah-Hartman, Kees Cook
Hi Andy!
On Thu, Jan 11, 2018 at 09:09:14AM -0800, Andy Lutomirski wrote:
> On Thu, Jan 11, 2018 at 7:44 AM, Willy Tarreau <w@1wt.eu> wrote:
> >> So, I'd do this:
> >> 1. Do the arch_prctl() (but ask the ARM guys what they want too)
> >> 2. Enabled for an entire process (not thread)
> >> 3. Inherited across fork/exec
> >> 4. Cleared on setuid() and friends
> >
> > This one causes me a problem : some daemons already take care of dropping
> > privileges after the initial fork() for the sake of security. Haproxy
> > typically does this at boot :
> >
> > - parse config
> > - chroot to /var/empty
> > - setuid(dedicated_uid)
> > - fork()
> >
> > This ensures the process is properly isolated and hard enough to break out
> > of. So I'd really like this setuid() not to anihilate all we've done.
> > Probably that we want to drop it on suid binaries however, though I'm
> > having doubts about the benefits, because if the binary already allows
> > an intruder to inject its own meltdown code, you're quite screwed anyway.
> >
> >> 5. I'm sure the security folks have/want a way to force it on forever
> >
> > Sure! That's what I implemented using the sysctl.
> >
>
> All of these proposals have serious issues. For example, suppose I
> have a setuid program called nopti that works like this:
>
> $ nopti some_program
>
> nopti verifies that some_program is trustworthy and runs it (as the
> real uid of nopti's user) with PTI off. Now we have all the usual
> problems: you can easily break out using ptrace(), for example. And
> LD_PRELOAD gets this wrong. Et.
Fine, we can drop it on suid binaries as I mentionned.
> So I think that no-pti mode is a privilege as opposed to a mode per
> se. If you can turn off PTI, then you have the ability to read all of
> kernel memory So maybe we should treat it as such. Add a capability
> CAP_DISABLE_PTI. If you have that capability (globally), then you can
> use the arch_prctl() or regular prctl() or whatever to turn PTI on.
> If you lose the cap, you lose no-pti mode as well.
I disagree on this, because the only alternative I have is to decide
to keep my process running as root, which is even worse, as root can
much more easily escape from a chroot jail for example, or access
/dev/mem and read all the memory as well. Also tell Linus that he'll
have to build his kernels as root ;-)
The arch_prctl() calls I proposed only allow to turn PTI off for
privileged users but any user can turn it back on. For me it's
important. A process might trust itself for its own use, but such
processes will rarely trust external processes in case they need to
perform an occasional fork+exec. Imagine for example a static web
server requiring to achieve the highest possible performance and
having to occasionally call logrotate to rotate+compress the logs.
It's better if the process knows how to turn PTI back on before
calling this.
> If an LSM wants to block it, it can use existing mechanisms.
>
> As for per-mm vs per-thread, let's make it only switchable in
> single-threaded processes for now and inherited when threads are
> created.
That's exactly what it does for now, but Linus doesn't like it at all.
So I'll switch it back to per-mm + per-CPU variable. Well he has a valid
point regarding the pgd and _PAGE_NX setting. point Now at least we know
the change is minimal if we have a good reason for doing differently
later.
> We can change that if and when demand for the ability to
> change it shows up.
>
> (Another reason for per-thread instead of per-mm: as a per-mm thing,
> you can't set it up for your descendents using vfork(); prctl();
> exec(), and the latter is how your average language runtime that
> spawns subprocesses would want to do it.
That's indeed the benefit it provides for now since I actually had
to *add* code to execve() to disable it then.
Cheers,
Willy
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set
2018-01-11 17:40 ` Willy Tarreau
@ 2018-01-11 17:53 ` Andy Lutomirski
2018-01-11 18:05 ` Willy Tarreau
2018-01-11 18:25 ` Linus Torvalds
1 sibling, 1 reply; 55+ messages in thread
From: Andy Lutomirski @ 2018-01-11 17:53 UTC (permalink / raw)
To: Willy Tarreau
Cc: Andy Lutomirski, Dave Hansen, Linus Torvalds, Peter Zijlstra,
LKML, X86 ML, Borislav Petkov, Brian Gerst, Ingo Molnar,
Thomas Gleixner, Josh Poimboeuf, H. Peter Anvin,
Greg Kroah-Hartman, Kees Cook
On Thu, Jan 11, 2018 at 9:40 AM, Willy Tarreau <w@1wt.eu> wrote:
> Hi Andy!
>
> On Thu, Jan 11, 2018 at 09:09:14AM -0800, Andy Lutomirski wrote:
>> On Thu, Jan 11, 2018 at 7:44 AM, Willy Tarreau <w@1wt.eu> wrote:
>> >> So, I'd do this:
>> >> 1. Do the arch_prctl() (but ask the ARM guys what they want too)
>> >> 2. Enabled for an entire process (not thread)
>> >> 3. Inherited across fork/exec
>> >> 4. Cleared on setuid() and friends
>> >
>> > This one causes me a problem : some daemons already take care of dropping
>> > privileges after the initial fork() for the sake of security. Haproxy
>> > typically does this at boot :
>> >
>> > - parse config
>> > - chroot to /var/empty
>> > - setuid(dedicated_uid)
>> > - fork()
>> >
>> > This ensures the process is properly isolated and hard enough to break out
>> > of. So I'd really like this setuid() not to anihilate all we've done.
>> > Probably that we want to drop it on suid binaries however, though I'm
>> > having doubts about the benefits, because if the binary already allows
>> > an intruder to inject its own meltdown code, you're quite screwed anyway.
>> >
>> >> 5. I'm sure the security folks have/want a way to force it on forever
>> >
>> > Sure! That's what I implemented using the sysctl.
>> >
>>
>> All of these proposals have serious issues. For example, suppose I
>> have a setuid program called nopti that works like this:
>>
>> $ nopti some_program
>>
>> nopti verifies that some_program is trustworthy and runs it (as the
>> real uid of nopti's user) with PTI off. Now we have all the usual
>> problems: you can easily break out using ptrace(), for example. And
>> LD_PRELOAD gets this wrong. Et.
>
> Fine, we can drop it on suid binaries as I mentionned.
>
>> So I think that no-pti mode is a privilege as opposed to a mode per
>> se. If you can turn off PTI, then you have the ability to read all of
>> kernel memory So maybe we should treat it as such. Add a capability
>> CAP_DISABLE_PTI. If you have that capability (globally), then you can
>> use the arch_prctl() or regular prctl() or whatever to turn PTI on.
>> If you lose the cap, you lose no-pti mode as well.
>
> I disagree on this, because the only alternative I have is to decide
> to keep my process running as root, which is even worse, as root can
> much more easily escape from a chroot jail for example, or access
> /dev/mem and read all the memory as well. Also tell Linus that he'll
> have to build his kernels as root ;-)
Not since Linux 4.3 :) You can set CAP_DISABLE_PTI as an "ambient"
capability and let it be inherited by all descendents even if
unprivileged. This was all very carefully designed so that a program
that inherited an ambient capability but tries to run a
less-privileged child using pre-4.3 techniques will correctly drop the
ambient capability, which is *exactly* what we want for PTI.
So I stand by my suggestion. Linus could still do:
$ nopti make -j512
and have it work, but trying to ptrace() the make process from outside
the nopti process tree (and without doing nopti ptrace) will fail, as
expected. (Unless root does the ptrace, again as expected.)
>
> The arch_prctl() calls I proposed only allow to turn PTI off for
> privileged users but any user can turn it back on. For me it's
> important. A process might trust itself for its own use, but such
> processes will rarely trust external processes in case they need to
> perform an occasional fork+exec. Imagine for example a static web
> server requiring to achieve the highest possible performance and
> having to occasionally call logrotate to rotate+compress the logs.
> It's better if the process knows how to turn PTI back on before
> calling this.
In my proposal, CAP_DISABLE_PTI doesn't turn off PTI -- it just grants
the ability to have PTI off. If you have PTI off, you can turn it
back in using prctl() or whatever. So you call prctl() (to turn PTI
back on) or capset() (to turn it on and drop the ability to turn it
off).
How exactly do you plan to efficiently call logrotate from your
webserver if the flag is per-mm, though? You don't want to fork() in
your fancy server because fork() write-protects the whole address
space. So you use clone() or vfork() (like posix_spawn() does
internally on a sane libc), but now you can't turn PTI back on if it's
per-mm because you haven't changed your mm.
I really really think it should be per thread.
>
>> If an LSM wants to block it, it can use existing mechanisms.
>>
>> As for per-mm vs per-thread, let's make it only switchable in
>> single-threaded processes for now and inherited when threads are
>> created.
>
> That's exactly what it does for now, but Linus doesn't like it at all.
> So I'll switch it back to per-mm + per-CPU variable. Well he has a valid
> point regarding the pgd and _PAGE_NX setting. point Now at least we know
> the change is minimal if we have a good reason for doing differently
> later.
Yuck, I hate this. Are you really going to wire it up complete with
all the IPIs needed to get the thing synced up right? it's also going
to run slower no matter what, I think, because you'll have to sync the
per-mm state back to the TI flags on context switches.
Linus, can you explain again why you think PTI should be a per-mm
thing? That is, why do you think it's useful and why do you think it
makes logical sense from a user's POV? I think the implementation is
easier and saner for per-thread. Also, if I we use a capability bit
for it, making it per-mm gets really weird.
--Andy
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set
2018-01-11 17:53 ` Andy Lutomirski
@ 2018-01-11 18:05 ` Willy Tarreau
2018-01-11 18:15 ` Dave Hansen
0 siblings, 1 reply; 55+ messages in thread
From: Willy Tarreau @ 2018-01-11 18:05 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Dave Hansen, Linus Torvalds, Peter Zijlstra, LKML, X86 ML,
Borislav Petkov, Brian Gerst, Ingo Molnar, Thomas Gleixner,
Josh Poimboeuf, H. Peter Anvin, Greg Kroah-Hartman, Kees Cook
On Thu, Jan 11, 2018 at 09:53:26AM -0800, Andy Lutomirski wrote:
> >> So I think that no-pti mode is a privilege as opposed to a mode per
> >> se. If you can turn off PTI, then you have the ability to read all of
> >> kernel memory So maybe we should treat it as such. Add a capability
> >> CAP_DISABLE_PTI. If you have that capability (globally), then you can
> >> use the arch_prctl() or regular prctl() or whatever to turn PTI on.
> >> If you lose the cap, you lose no-pti mode as well.
> >
> > I disagree on this, because the only alternative I have is to decide
> > to keep my process running as root, which is even worse, as root can
> > much more easily escape from a chroot jail for example, or access
> > /dev/mem and read all the memory as well. Also tell Linus that he'll
> > have to build his kernels as root ;-)
>
> Not since Linux 4.3 :) You can set CAP_DISABLE_PTI as an "ambient"
> capability and let it be inherited by all descendents even if
> unprivileged. This was all very carefully designed so that a program
> that inherited an ambient capability but tries to run a
> less-privileged child using pre-4.3 techniques will correctly drop the
> ambient capability, which is *exactly* what we want for PTI.
Ah thanks for explaining what these "ambient" capabilities are, I saw
the term a few times but never looked closer.
> So I stand by my suggestion. Linus could still do:
>
> $ nopti make -j512
>
> and have it work, but trying to ptrace() the make process from outside
> the nopti process tree (and without doing nopti ptrace) will fail, as
> expected. (Unless root does the ptrace, again as expected.)
This may be reasonable.
> > The arch_prctl() calls I proposed only allow to turn PTI off for
> > privileged users but any user can turn it back on. For me it's
> > important. A process might trust itself for its own use, but such
> > processes will rarely trust external processes in case they need to
> > perform an occasional fork+exec. Imagine for example a static web
> > server requiring to achieve the highest possible performance and
> > having to occasionally call logrotate to rotate+compress the logs.
> > It's better if the process knows how to turn PTI back on before
> > calling this.
>
> In my proposal, CAP_DISABLE_PTI doesn't turn off PTI -- it just grants
> the ability to have PTI off. If you have PTI off, you can turn it
> back in using prctl() or whatever. So you call prctl() (to turn PTI
> back on) or capset() (to turn it on and drop the ability to turn it
> off).
Hmmm OK. I still don't like much to conflate the "turn back on" between
the two distinct calls. If the capability grants you the right to act
on prctl(), it should not perform the action in your back when disabled.
It may even lead people to care less about it which is not a good practise.
> How exactly do you plan to efficiently call logrotate from your
> webserver if the flag is per-mm, though? You don't want to fork() in
> your fancy server because fork() write-protects the whole address
> space.
I'm not following you, what's the problem here ? I mean most programs
do fork() then close() a few FDs that were not marked CLOEXEC, then
execve(). The purpose is to avoid changing existing programs too much.
> So you use clone() or vfork() (like posix_spawn() does
> internally on a sane libc), but now you can't turn PTI back on if it's
> per-mm because you haven't changed your mm.
But as soon as you write anything it's cloned, right ? Ie you write in
the stack. I could say bullshit here, but surely we have a way to split
them.
> I really really think it should be per thread.
It could be a good argument here.
> >> As for per-mm vs per-thread, let's make it only switchable in
> >> single-threaded processes for now and inherited when threads are
> >> created.
> >
> > That's exactly what it does for now, but Linus doesn't like it at all.
> > So I'll switch it back to per-mm + per-CPU variable. Well he has a valid
> > point regarding the pgd and _PAGE_NX setting. point Now at least we know
> > the change is minimal if we have a good reason for doing differently
> > later.
>
> Yuck, I hate this. Are you really going to wire it up complete with
> all the IPIs needed to get the thing synced up right? it's also going
> to run slower no matter what, I think, because you'll have to sync the
> per-mm state back to the TI flags on context switches.
At this point I'm lost, all I can do is trust you guys once you agree
on a solution :-)
> Linus, can you explain again why you think PTI should be a per-mm
> thing? That is, why do you think it's useful and why do you think it
> makes logical sense from a user's POV? I think the implementation is
> easier and saner for per-thread. Also, if I we use a capability bit
> for it, making it per-mm gets really weird.
thanks,
Willy
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set
2018-01-11 18:05 ` Willy Tarreau
@ 2018-01-11 18:15 ` Dave Hansen
2018-01-11 18:31 ` Linus Torvalds
0 siblings, 1 reply; 55+ messages in thread
From: Dave Hansen @ 2018-01-11 18:15 UTC (permalink / raw)
To: Willy Tarreau, Andy Lutomirski
Cc: Linus Torvalds, Peter Zijlstra, LKML, X86 ML, Borislav Petkov,
Brian Gerst, Ingo Molnar, Thomas Gleixner, Josh Poimboeuf,
H. Peter Anvin, Greg Kroah-Hartman, Kees Cook
On 01/11/2018 10:05 AM, Willy Tarreau wrote:
>> That's exactly what it does for now, but Linus doesn't like it at all.
>> So I'll switch it back to per-mm + per-CPU variable. Well he has a valid
>> point regarding the pgd and _PAGE_NX setting. point Now at least we know
>> the change is minimal if we have a good reason for doing differently
>> later.
> Yuck, I hate this. Are you really going to wire it up complete with
> all the IPIs needed to get the thing synced up right?
Well, on the bright side, we don't need IPIs when _removing_ NX. We can
just handle those like a spurious fault.
But, when re-enabling it, we need all the TLB flushing for all the CPUs
that have run with the un-NX'd page tables. I guess that's the one
benefit if you come up with an API that only allows "disable PTI" while
a task is running but leaves execve()/fork() as places that implicitly
reenable PTI.
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set
2018-01-11 17:02 ` Andy Lutomirski
@ 2018-01-11 18:21 ` Alexei Starovoitov
2018-01-11 18:30 ` Dave Hansen
2018-01-11 18:32 ` Josh Poimboeuf
2018-01-11 20:00 ` Dave Hansen
1 sibling, 2 replies; 55+ messages in thread
From: Alexei Starovoitov @ 2018-01-11 18:21 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Dave Hansen, Willy Tarreau, Linus Torvalds, Peter Zijlstra, LKML,
X86 ML, Borislav Petkov, Brian Gerst, Ingo Molnar,
Thomas Gleixner, Josh Poimboeuf, H. Peter Anvin,
Greg Kroah-Hartman, Kees Cook
On Thu, Jan 11, 2018 at 09:02:55AM -0800, Andy Lutomirski wrote:
> On Thu, Jan 11, 2018 at 7:51 AM, Dave Hansen
> <dave.hansen@linux.intel.com> wrote:
> > On 01/11/2018 07:44 AM, Willy Tarreau wrote:
> >>> I think we also need to be able to dump the actual
> >>> CR3 value that we entered the kernel with before we start doing too much
> >>> other funky stuff with the entry code.
> >> When you say dump, you mean save it somewhere in a per_cpu variable ?
> >
> > Yeah, I think a per-cpu variable is fine for now. But, that only gives
> > you a dump from a single entry to the kernel. Ideally, it would be nice
> > to have a stack of them so you could do things like debug
> > syscall->fault->oops. Was it the syscall entry or the fault entry that
> > lead to the oops?
> >
> > But, the stack gets really fun because of NMIs.
> >
> > I'm sure Andy Lutomirski has some ideas too.
>
> I was thinking that maybe we should add a new field or two to pt_regs.
> They could store CR2 and maybe CR3 as well. I'd also like to expose
> the error code of exceptions in stack traces. We should get this
> integrated right into the unwinder.
hmm. Exposing cr3 to user space will make it trivial for user process
to know whether kpti is active. Not sure how exploitable such
information leak.
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set
2018-01-11 17:40 ` Willy Tarreau
2018-01-11 17:53 ` Andy Lutomirski
@ 2018-01-11 18:25 ` Linus Torvalds
2018-01-11 18:26 ` Linus Torvalds
` (2 more replies)
1 sibling, 3 replies; 55+ messages in thread
From: Linus Torvalds @ 2018-01-11 18:25 UTC (permalink / raw)
To: Willy Tarreau
Cc: Andy Lutomirski, Dave Hansen, Peter Zijlstra, LKML, X86 ML,
Borislav Petkov, Brian Gerst, Ingo Molnar, Thomas Gleixner,
Josh Poimboeuf, H. Peter Anvin, Greg Kroah-Hartman, Kees Cook
On Thu, Jan 11, 2018 at 9:40 AM, Willy Tarreau <w@1wt.eu> wrote:
>> As for per-mm vs per-thread, let's make it only switchable in
>> single-threaded processes for now and inherited when threads are
>> created.
>
> That's exactly what it does for now, but Linus doesn't like it at all.
Just to clarify: I definitely want the part where it is only
switchable in single-threaded mode, and I actually do want it
"inherited" by threads when they do get created.
It's just that my mental model for threads is not that they "inherit"
the PTI flag, it's that they simply share it. But that's more of a
difference in "views" than in actual behavior.
>> (Another reason for per-thread instead of per-mm: as a per-mm thing,
>> you can't set it up for your descendents using vfork(); prctl();
>> exec(), and the latter is how your average language runtime that
>> spawns subprocesses would want to do it.
>
> That's indeed the benefit it provides for now since I actually had
> to *add* code to execve() to disable it then.
So the "vfork()" case is indeed interesting, but I don't think it's
all that relevant.
Why?
If you do the PTI on/off operation *before* the vfork(), nothing is
different. The vfork() by definition ends up having the same PTI
state, since it has the same VM. But that's actually 100% expected,
and it matches the fork() behavior too: the PTI state should be just
copied by a fork(), since fork isn't any protection domain.
And *after* you've done a vfork(), you can't do a PTI on/off, because
now the VM has multiple users, which is 100% equivalent to the thread
case that we already all agreed should be disallowed. So no, you can't
do "vfork -> setnopti -> exec', but that is in no way different from
any of the *other* things you cannot do in between vfork and execve.
And in a wrapper that sets nopti, you wouldn't want to use vfork
anyway. You wouldn't even want to use *fork*. You'd just do "set
nopti" and then execve(). That's the whole point of the wrapper.
So vfork() is worth _mentioning_, but I don't think there is any
actual issue there. Quite the reverse - it acts exactly as expected.
The main thing that should be special for PTI on/off is "execve()".
That's the one that may force PTI on again, because of a security
boundary.
The other case may be the CLONE_NEW* operations. I *think* they are
noops as far as PTI settings would be, but I think people should think
about them.
Linus
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set
2018-01-11 18:25 ` Linus Torvalds
@ 2018-01-11 18:26 ` Linus Torvalds
2018-01-11 19:33 ` Andy Lutomirski
2018-01-11 21:59 ` Willy Tarreau
2018-01-12 16:27 ` David Laight
2 siblings, 1 reply; 55+ messages in thread
From: Linus Torvalds @ 2018-01-11 18:26 UTC (permalink / raw)
To: Willy Tarreau
Cc: Andy Lutomirski, Dave Hansen, Peter Zijlstra, LKML, X86 ML,
Borislav Petkov, Brian Gerst, Ingo Molnar, Thomas Gleixner,
Josh Poimboeuf, H. Peter Anvin, Greg Kroah-Hartman, Kees Cook
On Thu, Jan 11, 2018 at 10:25 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> The other case may be the CLONE_NEW* operations. I *think* they are
> noops as far as PTI settings would be, but I think people should think
> about them.
Oh, and yes, I think the npti flag should also break ptrace(). I do
agree with Andy that it's a "capability", although I do not think it
should actually be implemented as one.
Linus
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set
2018-01-11 18:21 ` Alexei Starovoitov
@ 2018-01-11 18:30 ` Dave Hansen
2018-01-11 18:32 ` Josh Poimboeuf
1 sibling, 0 replies; 55+ messages in thread
From: Dave Hansen @ 2018-01-11 18:30 UTC (permalink / raw)
To: Alexei Starovoitov, Andy Lutomirski
Cc: Willy Tarreau, Linus Torvalds, Peter Zijlstra, LKML, X86 ML,
Borislav Petkov, Brian Gerst, Ingo Molnar, Thomas Gleixner,
Josh Poimboeuf, H. Peter Anvin, Greg Kroah-Hartman, Kees Cook
On 01/11/2018 10:21 AM, Alexei Starovoitov wrote:
>> I was thinking that maybe we should add a new field or two to pt_regs.
>> They could store CR2 and maybe CR3 as well. I'd also like to expose
>> the error code of exceptions in stack traces. We should get this
>> integrated right into the unwinder.
> hmm. Exposing cr3 to user space will make it trivial for user process
> to know whether kpti is active. Not sure how exploitable such
> information leak.
It also gives userspace a pretty valuable physical address to go after.
That, plus a KASLR defeat gives you a known-valuable virtual address to
target. That's no good.
I think CR3 is a non-starter.
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set
2018-01-11 18:15 ` Dave Hansen
@ 2018-01-11 18:31 ` Linus Torvalds
0 siblings, 0 replies; 55+ messages in thread
From: Linus Torvalds @ 2018-01-11 18:31 UTC (permalink / raw)
To: Dave Hansen
Cc: Willy Tarreau, Andy Lutomirski, Peter Zijlstra, LKML, X86 ML,
Borislav Petkov, Brian Gerst, Ingo Molnar, Thomas Gleixner,
Josh Poimboeuf, H. Peter Anvin, Greg Kroah-Hartman, Kees Cook
On Thu, Jan 11, 2018 at 10:15 AM, Dave Hansen
<dave.hansen@linux.intel.com> wrote:
>
> Well, on the bright side, we don't need IPIs when _removing_ NX. We can
> just handle those like a spurious fault.
I think I agree.
> But, when re-enabling it, we need all the TLB flushing for all the CPUs
> that have run with the un-NX'd page tables.
Actually, I really don't think we should even allow "re-enable PTI".
The only thing that re-enables PTI is a completely new page table,
notably "execve()".
And I think that is when the "NOW" vs "NEXT" *may* make sense. Not for
enabling PTI, but if we want to have a "disable PTI", I think it
should act on the next execve().
And one reason I think we want that behavior is that once you've
disabled PTI, I don't think the double page tables would necessarily
even exist, and I don't think we should try to re-populate them. A
noPTI process might simply *have* just the single page table.
That wouldn't be the first implementation, but I think the interface
should be designed for that kind of thing in mind, where nopti really
means "stop doing two page tables for this process". And that may make
it *impossible* to re-enable PTI for this process, simply because we
don't have the required double-page PGD allocation at all.
Linus
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set
2018-01-11 18:21 ` Alexei Starovoitov
2018-01-11 18:30 ` Dave Hansen
@ 2018-01-11 18:32 ` Josh Poimboeuf
2018-01-11 18:36 ` Linus Torvalds
2018-01-11 18:38 ` Dave Hansen
1 sibling, 2 replies; 55+ messages in thread
From: Josh Poimboeuf @ 2018-01-11 18:32 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Andy Lutomirski, Dave Hansen, Willy Tarreau, Linus Torvalds,
Peter Zijlstra, LKML, X86 ML, Borislav Petkov, Brian Gerst,
Ingo Molnar, Thomas Gleixner, H. Peter Anvin, Greg Kroah-Hartman,
Kees Cook
On Thu, Jan 11, 2018 at 10:21:49AM -0800, Alexei Starovoitov wrote:
> On Thu, Jan 11, 2018 at 09:02:55AM -0800, Andy Lutomirski wrote:
> > On Thu, Jan 11, 2018 at 7:51 AM, Dave Hansen
> > <dave.hansen@linux.intel.com> wrote:
> > > On 01/11/2018 07:44 AM, Willy Tarreau wrote:
> > >>> I think we also need to be able to dump the actual
> > >>> CR3 value that we entered the kernel with before we start doing too much
> > >>> other funky stuff with the entry code.
> > >> When you say dump, you mean save it somewhere in a per_cpu variable ?
> > >
> > > Yeah, I think a per-cpu variable is fine for now. But, that only gives
> > > you a dump from a single entry to the kernel. Ideally, it would be nice
> > > to have a stack of them so you could do things like debug
> > > syscall->fault->oops. Was it the syscall entry or the fault entry that
> > > lead to the oops?
> > >
> > > But, the stack gets really fun because of NMIs.
> > >
> > > I'm sure Andy Lutomirski has some ideas too.
> >
> > I was thinking that maybe we should add a new field or two to pt_regs.
> > They could store CR2 and maybe CR3 as well. I'd also like to expose
> > the error code of exceptions in stack traces. We should get this
> > integrated right into the unwinder.
>
> hmm. Exposing cr3 to user space will make it trivial for user process
> to know whether kpti is active. Not sure how exploitable such
> information leak.
It's already trivial to detect PTI from user space.
--
Josh
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set
2018-01-11 15:44 ` Willy Tarreau
2018-01-11 15:51 ` Dave Hansen
2018-01-11 17:09 ` Andy Lutomirski
@ 2018-01-11 18:35 ` Dave Hansen
2018-01-11 21:49 ` Willy Tarreau
2 siblings, 1 reply; 55+ messages in thread
From: Dave Hansen @ 2018-01-11 18:35 UTC (permalink / raw)
To: Willy Tarreau
Cc: Linus Torvalds, Andy Lutomirski, Peter Zijlstra, LKML, X86 ML,
Borislav Petkov, Brian Gerst, Ingo Molnar, Thomas Gleixner,
Josh Poimboeuf, H. Peter Anvin, Greg Kroah-Hartman, Kees Cook
On 01/11/2018 07:44 AM, Willy Tarreau wrote:
>> 4. Cleared on setuid() and friends
> This one causes me a problem : some daemons already take care of dropping
> privileges after the initial fork() for the sake of security. Haproxy
> typically does this at boot :
>
> - parse config
> - chroot to /var/empty
> - setuid(dedicated_uid)
> - fork()
This makes me a _bit_ nervous. I think Andy touched on this, but I'll
say it another way: you want PTI turned off because you trust an app to
be good, but you also drop permissions because it is exposed to an
environment where you do *not* fully trust it.
I'm not sure how you reconcile that.
If your proxy gets compromised, and pti is turned off, you are entirely
exposed to meltdown from that process. I don't know exactly what you
are doing, but isn't this proxy sitting there shuffling untrusted user
data around all day?
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set
2018-01-11 18:32 ` Josh Poimboeuf
@ 2018-01-11 18:36 ` Linus Torvalds
2018-01-11 18:38 ` Dave Hansen
1 sibling, 0 replies; 55+ messages in thread
From: Linus Torvalds @ 2018-01-11 18:36 UTC (permalink / raw)
To: Josh Poimboeuf
Cc: Alexei Starovoitov, Andy Lutomirski, Dave Hansen, Willy Tarreau,
Peter Zijlstra, LKML, X86 ML, Borislav Petkov, Brian Gerst,
Ingo Molnar, Thomas Gleixner, H. Peter Anvin, Greg Kroah-Hartman,
Kees Cook
On Thu, Jan 11, 2018 at 10:32 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Thu, Jan 11, 2018 at 10:21:49AM -0800, Alexei Starovoitov wrote:
>>
>> hmm. Exposing cr3 to user space will make it trivial for user process
>> to know whether kpti is active. Not sure how exploitable such
>> information leak.
>
> It's already trivial to detect PTI from user space.
I agree. I don't think the whole "is PTI on" is all that much of a secret.
But exposing all of %cr3 to user space is completely unacceptable.
That's just about *the* best attack vector information you could give
to somebody, and would make KASLR almost totally uninteresting. If you
have access to the page directory pointer, and some other weakness
that allows you to access it, you're golden. You can do anything.
Linus
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set
2018-01-11 18:32 ` Josh Poimboeuf
2018-01-11 18:36 ` Linus Torvalds
@ 2018-01-11 18:38 ` Dave Hansen
2018-01-11 18:51 ` Linus Torvalds
2018-01-11 19:11 ` Willy Tarreau
1 sibling, 2 replies; 55+ messages in thread
From: Dave Hansen @ 2018-01-11 18:38 UTC (permalink / raw)
To: Josh Poimboeuf, Alexei Starovoitov
Cc: Andy Lutomirski, Willy Tarreau, Linus Torvalds, Peter Zijlstra,
LKML, X86 ML, Borislav Petkov, Brian Gerst, Ingo Molnar,
Thomas Gleixner, H. Peter Anvin, Greg Kroah-Hartman, Kees Cook
On 01/11/2018 10:32 AM, Josh Poimboeuf wrote:
>> hmm. Exposing cr3 to user space will make it trivial for user process
>> to know whether kpti is active. Not sure how exploitable such
>> information leak.
> It's already trivial to detect PTI from user space.
Do tell.
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set
2018-01-11 18:38 ` Dave Hansen
@ 2018-01-11 18:51 ` Linus Torvalds
2018-01-11 18:57 ` Dave Hansen
2018-01-11 19:11 ` Willy Tarreau
1 sibling, 1 reply; 55+ messages in thread
From: Linus Torvalds @ 2018-01-11 18:51 UTC (permalink / raw)
To: Dave Hansen
Cc: Josh Poimboeuf, Alexei Starovoitov, Andy Lutomirski,
Willy Tarreau, Peter Zijlstra, LKML, X86 ML, Borislav Petkov,
Brian Gerst, Ingo Molnar, Thomas Gleixner, H. Peter Anvin,
Greg Kroah-Hartman, Kees Cook
On Thu, Jan 11, 2018 at 10:38 AM, Dave Hansen
<dave.hansen@linux.intel.com> wrote:
> On 01/11/2018 10:32 AM, Josh Poimboeuf wrote:
>>> hmm. Exposing cr3 to user space will make it trivial for user process
>>> to know whether kpti is active. Not sure how exploitable such
>>> information leak.
>> It's already trivial to detect PTI from user space.
>
> Do tell.
One way to do it is to just run the attack, and see if you get something.
So it's not really "is PTI enabled", but a "is meltdown there". Then
you just use that together with cpuinfo to decide if PTI is enabled.
So I think Josh is 100% right. Detecting PTI on/off is not hard.
But that does *not* mean that %cr3 isn't secret. %cr3 should
definitely never *ever* be accessible to user space.
Linus
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set
2018-01-11 18:51 ` Linus Torvalds
@ 2018-01-11 18:57 ` Dave Hansen
2018-01-11 19:05 ` Josh Poimboeuf
` (3 more replies)
0 siblings, 4 replies; 55+ messages in thread
From: Dave Hansen @ 2018-01-11 18:57 UTC (permalink / raw)
To: Linus Torvalds
Cc: Josh Poimboeuf, Alexei Starovoitov, Andy Lutomirski,
Willy Tarreau, Peter Zijlstra, LKML, X86 ML, Borislav Petkov,
Brian Gerst, Ingo Molnar, Thomas Gleixner, H. Peter Anvin,
Greg Kroah-Hartman, Kees Cook
On 01/11/2018 10:51 AM, Linus Torvalds wrote:
> On Thu, Jan 11, 2018 at 10:38 AM, Dave Hansen
> <dave.hansen@linux.intel.com> wrote:
>> On 01/11/2018 10:32 AM, Josh Poimboeuf wrote:
>>>> hmm. Exposing cr3 to user space will make it trivial for user process
>>>> to know whether kpti is active. Not sure how exploitable such
>>>> information leak.
>>> It's already trivial to detect PTI from user space.
>> Do tell.
> One way to do it is to just run the attack, and see if you get something.
Not judging how trivial (or not) the attack is, I was hoping for
something that was *not* the attack itself. :)
I'd love to have a tool that tells you for sure "KPTI enabled or not",
but I'd also love to have it be something I can easily distribute
without it being handled like a WMD.
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set
2018-01-11 18:57 ` Dave Hansen
@ 2018-01-11 19:05 ` Josh Poimboeuf
2018-01-11 19:07 ` Borislav Petkov
` (2 subsequent siblings)
3 siblings, 0 replies; 55+ messages in thread
From: Josh Poimboeuf @ 2018-01-11 19:05 UTC (permalink / raw)
To: Dave Hansen
Cc: Linus Torvalds, Alexei Starovoitov, Andy Lutomirski,
Willy Tarreau, Peter Zijlstra, LKML, X86 ML, Borislav Petkov,
Brian Gerst, Ingo Molnar, Thomas Gleixner, H. Peter Anvin,
Greg Kroah-Hartman, Kees Cook, aarcange
On Thu, Jan 11, 2018 at 10:57:51AM -0800, Dave Hansen wrote:
> On 01/11/2018 10:51 AM, Linus Torvalds wrote:
> > On Thu, Jan 11, 2018 at 10:38 AM, Dave Hansen
> > <dave.hansen@linux.intel.com> wrote:
> >> On 01/11/2018 10:32 AM, Josh Poimboeuf wrote:
> >>>> hmm. Exposing cr3 to user space will make it trivial for user process
> >>>> to know whether kpti is active. Not sure how exploitable such
> >>>> information leak.
> >>> It's already trivial to detect PTI from user space.
> >> Do tell.
> > One way to do it is to just run the attack, and see if you get something.
>
> Not judging how trivial (or not) the attack is, I was hoping for
> something that was *not* the attack itself. :)
>
> I'd love to have a tool that tells you for sure "KPTI enabled or not",
> but I'd also love to have it be something I can easily distribute
> without it being handled like a WMD.
Instead of the meltdown attack, why not just run the KASLR attack, with
prefetches + cache timing?
Something like this (I haven't tested it though):
https://github.com/IAIK/prefetch/blob/master/addrspace/addrspace.c
Andrea also created such a tool, but IIRC, it requires knowing a kernel
address, so it wouldn't work with KASLR. It could probably be extended
to scan kernel space until it finds something.
We could maybe even add such a tool to the kernel tree.
--
Josh
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set
2018-01-11 18:57 ` Dave Hansen
2018-01-11 19:05 ` Josh Poimboeuf
@ 2018-01-11 19:07 ` Borislav Petkov
2018-01-11 19:17 ` Dave Hansen
2018-01-11 19:12 ` Linus Torvalds
2018-01-11 19:38 ` Alexei Starovoitov
3 siblings, 1 reply; 55+ messages in thread
From: Borislav Petkov @ 2018-01-11 19:07 UTC (permalink / raw)
To: Dave Hansen
Cc: Linus Torvalds, Josh Poimboeuf, Alexei Starovoitov,
Andy Lutomirski, Willy Tarreau, Peter Zijlstra, LKML, X86 ML,
Brian Gerst, Ingo Molnar, Thomas Gleixner, H. Peter Anvin,
Greg Kroah-Hartman, Kees Cook
On Thu, Jan 11, 2018 at 10:57:51AM -0800, Dave Hansen wrote:
> I'd love to have a tool that tells you for sure "KPTI enabled or not",
> but I'd also love to have it be something I can easily distribute
> without it being handled like a WMD.
You mean this:
https://git.kernel.org/tip/87590ce6e373d1a5401f6539f0c59ef92dd924a9
?
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set
2018-01-11 18:38 ` Dave Hansen
2018-01-11 18:51 ` Linus Torvalds
@ 2018-01-11 19:11 ` Willy Tarreau
1 sibling, 0 replies; 55+ messages in thread
From: Willy Tarreau @ 2018-01-11 19:11 UTC (permalink / raw)
To: Dave Hansen
Cc: Josh Poimboeuf, Alexei Starovoitov, Andy Lutomirski,
Linus Torvalds, Peter Zijlstra, LKML, X86 ML, Borislav Petkov,
Brian Gerst, Ingo Molnar, Thomas Gleixner, H. Peter Anvin,
Greg Kroah-Hartman, Kees Cook
On Thu, Jan 11, 2018 at 10:38:07AM -0800, Dave Hansen wrote:
> On 01/11/2018 10:32 AM, Josh Poimboeuf wrote:
> >> hmm. Exposing cr3 to user space will make it trivial for user process
> >> to know whether kpti is active. Not sure how exploitable such
> >> information leak.
> > It's already trivial to detect PTI from user space.
>
> Do tell.
Probably because meltdown doesn't work anymore ? :-)
Willy
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set
2018-01-11 18:57 ` Dave Hansen
2018-01-11 19:05 ` Josh Poimboeuf
2018-01-11 19:07 ` Borislav Petkov
@ 2018-01-11 19:12 ` Linus Torvalds
2018-01-11 19:38 ` Alexei Starovoitov
3 siblings, 0 replies; 55+ messages in thread
From: Linus Torvalds @ 2018-01-11 19:12 UTC (permalink / raw)
To: Dave Hansen
Cc: Josh Poimboeuf, Alexei Starovoitov, Andy Lutomirski,
Willy Tarreau, Peter Zijlstra, LKML, X86 ML, Borislav Petkov,
Brian Gerst, Ingo Molnar, Thomas Gleixner, H. Peter Anvin,
Greg Kroah-Hartman, Kees Cook
On Thu, Jan 11, 2018 at 10:57 AM, Dave Hansen
<dave.hansen@linux.intel.com> wrote:
>
> I'd love to have a tool that tells you for sure "KPTI enabled or not",
> but I'd also love to have it be something I can easily distribute
> without it being handled like a WMD.
As Josh points out, the whole meltdown excitement actually already
generated a number of projects that aren't the attack, but show some
of the effects.
But Thomas does have a whole sysfs series to show "mitigation status
for various hardware bugs".
Linus
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set
2018-01-11 19:07 ` Borislav Petkov
@ 2018-01-11 19:17 ` Dave Hansen
2018-01-11 19:19 ` Olivier Galibert
0 siblings, 1 reply; 55+ messages in thread
From: Dave Hansen @ 2018-01-11 19:17 UTC (permalink / raw)
To: Borislav Petkov
Cc: Linus Torvalds, Josh Poimboeuf, Alexei Starovoitov,
Andy Lutomirski, Willy Tarreau, Peter Zijlstra, LKML, X86 ML,
Brian Gerst, Ingo Molnar, Thomas Gleixner, H. Peter Anvin,
Greg Kroah-Hartman, Kees Cook
On 01/11/2018 11:07 AM, Borislav Petkov wrote:
> On Thu, Jan 11, 2018 at 10:57:51AM -0800, Dave Hansen wrote:
>> I'd love to have a tool that tells you for sure "KPTI enabled or not",
>> but I'd also love to have it be something I can easily distribute
>> without it being handled like a WMD.
> You mean this:
>
> https://git.kernel.org/tip/87590ce6e373d1a5401f6539f0c59ef92dd924a9
I meant that works across all the kpti implementations. Those with gunk
in /sys, /proc/cpuinfo, or with nothing at all like the original KAISER
patches I posted.
And, yes, I want a pony. I just though someone had such a pony handy
and it was trivial.
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set
2018-01-11 19:17 ` Dave Hansen
@ 2018-01-11 19:19 ` Olivier Galibert
2018-01-11 19:26 ` Josh Poimboeuf
0 siblings, 1 reply; 55+ messages in thread
From: Olivier Galibert @ 2018-01-11 19:19 UTC (permalink / raw)
To: Dave Hansen
Cc: Borislav Petkov, Linus Torvalds, Josh Poimboeuf,
Alexei Starovoitov, Andy Lutomirski, Willy Tarreau,
Peter Zijlstra, LKML, X86 ML, Brian Gerst, Ingo Molnar,
Thomas Gleixner, H. Peter Anvin, Greg Kroah-Hartman, Kees Cook
Wouldn't the time taken by an easy syscall like getuid be a clear indicator?
OG.
On Thu, Jan 11, 2018 at 8:17 PM, Dave Hansen
<dave.hansen@linux.intel.com> wrote:
> On 01/11/2018 11:07 AM, Borislav Petkov wrote:
>> On Thu, Jan 11, 2018 at 10:57:51AM -0800, Dave Hansen wrote:
>>> I'd love to have a tool that tells you for sure "KPTI enabled or not",
>>> but I'd also love to have it be something I can easily distribute
>>> without it being handled like a WMD.
>> You mean this:
>>
>> https://git.kernel.org/tip/87590ce6e373d1a5401f6539f0c59ef92dd924a9
>
> I meant that works across all the kpti implementations. Those with gunk
> in /sys, /proc/cpuinfo, or with nothing at all like the original KAISER
> patches I posted.
>
> And, yes, I want a pony. I just though someone had such a pony handy
> and it was trivial.
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set
2018-01-11 19:19 ` Olivier Galibert
@ 2018-01-11 19:26 ` Josh Poimboeuf
2018-01-11 19:34 ` Alan Cox
0 siblings, 1 reply; 55+ messages in thread
From: Josh Poimboeuf @ 2018-01-11 19:26 UTC (permalink / raw)
To: Olivier Galibert
Cc: Dave Hansen, Borislav Petkov, Linus Torvalds, Alexei Starovoitov,
Andy Lutomirski, Willy Tarreau, Peter Zijlstra, LKML, X86 ML,
Brian Gerst, Ingo Molnar, Thomas Gleixner, H. Peter Anvin,
Greg Kroah-Hartman, Kees Cook
On Thu, Jan 11, 2018 at 08:19:35PM +0100, Olivier Galibert wrote:
> Wouldn't the time taken by an easy syscall like getuid be a clear indicator?
I think only if you had a baseline measurement to compare against.
--
Josh
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set
2018-01-11 18:26 ` Linus Torvalds
@ 2018-01-11 19:33 ` Andy Lutomirski
2018-01-12 20:22 ` Ingo Molnar
0 siblings, 1 reply; 55+ messages in thread
From: Andy Lutomirski @ 2018-01-11 19:33 UTC (permalink / raw)
To: Linus Torvalds
Cc: Willy Tarreau, Andy Lutomirski, Dave Hansen, Peter Zijlstra, LKML,
X86 ML, Borislav Petkov, Brian Gerst, Ingo Molnar,
Thomas Gleixner, Josh Poimboeuf, H. Peter Anvin,
Greg Kroah-Hartman, Kees Cook
> On Jan 11, 2018, at 10:26 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> On Thu, Jan 11, 2018 at 10:25 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> The other case may be the CLONE_NEW* operations. I *think* they are
>> noops as far as PTI settings would be, but I think people should think
>> about them.
>
> Oh, and yes, I think the npti flag should also break ptrace(). I do
> agree with Andy that it's a "capability", although I do not think it
> should actually be implemented as one.
For all that Linux capabilities are crap, nopti walks like one and quacks like one. It needs to affect ptrace() permissions, it needs a way to disable it systemwide, it needs LSM integration, etc. Using CAP_DISABLE_PTI gives us all of this without tons of churn, auditing, and a whole new configuration thingy for each LSM. And I avoids permanently polluting ptrace checks, the LSM interface, etc for what is, essentially, a performance hack to work around a blatant error in the design of some CPUs.
Plus, with ambient caps, we already did the nasty part of the with and finished all the relevant bikeshedding.
So I'd rather just hold my nose and add the new capability bit.
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set
2018-01-11 19:26 ` Josh Poimboeuf
@ 2018-01-11 19:34 ` Alan Cox
2018-01-11 21:23 ` Willy Tarreau
0 siblings, 1 reply; 55+ messages in thread
From: Alan Cox @ 2018-01-11 19:34 UTC (permalink / raw)
To: Josh Poimboeuf
Cc: Olivier Galibert, Dave Hansen, Borislav Petkov, Linus Torvalds,
Alexei Starovoitov, Andy Lutomirski, Willy Tarreau,
Peter Zijlstra, LKML, X86 ML, Brian Gerst, Ingo Molnar,
Thomas Gleixner, H. Peter Anvin, Greg Kroah-Hartman, Kees Cook
On Thu, 11 Jan 2018 13:26:01 -0600
Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Thu, Jan 11, 2018 at 08:19:35PM +0100, Olivier Galibert wrote:
> > Wouldn't the time taken by an easy syscall like getuid be a clear indicator?
>
> I think only if you had a baseline measurement to compare against.
getuid can also be cached by a smart libc. getppid() is what micro
benchmarks seem to favour.
Alan
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set
2018-01-11 18:57 ` Dave Hansen
` (2 preceding siblings ...)
2018-01-11 19:12 ` Linus Torvalds
@ 2018-01-11 19:38 ` Alexei Starovoitov
3 siblings, 0 replies; 55+ messages in thread
From: Alexei Starovoitov @ 2018-01-11 19:38 UTC (permalink / raw)
To: Dave Hansen
Cc: Linus Torvalds, Josh Poimboeuf, Andy Lutomirski, Willy Tarreau,
Peter Zijlstra, LKML, X86 ML, Borislav Petkov, Brian Gerst,
Ingo Molnar, Thomas Gleixner, H. Peter Anvin, Greg Kroah-Hartman,
Kees Cook
On Thu, Jan 11, 2018 at 10:57:51AM -0800, Dave Hansen wrote:
> On 01/11/2018 10:51 AM, Linus Torvalds wrote:
> > On Thu, Jan 11, 2018 at 10:38 AM, Dave Hansen
> > <dave.hansen@linux.intel.com> wrote:
> >> On 01/11/2018 10:32 AM, Josh Poimboeuf wrote:
> >>>> hmm. Exposing cr3 to user space will make it trivial for user process
> >>>> to know whether kpti is active. Not sure how exploitable such
> >>>> information leak.
> >>> It's already trivial to detect PTI from user space.
> >> Do tell.
> > One way to do it is to just run the attack, and see if you get something.
>
> Not judging how trivial (or not) the attack is, I was hoping for
> something that was *not* the attack itself. :)
the attack can also be inconclusive.
I'm playing with an idea to conditionally switch into user cr3
after returning from syscall.
Like per task or per cpu counter or randomly tell syscall return to
keep kernel cr3 while next interrupt will bring task back to user cr3.
Public exploits use syscalls to keep kernel memory in L1 and
with such hack they see partial kernel reads and cannot really tell
which are the kernel bytes in the mix.
If there is a fast way for an attacker to know that after
syscall kpti is off for this task then such conditional kpti on/off
will be completely pointless.
It's not 100% secure obviously. Sort-of best effort to bring back
most of syscall performance without thinking which task should or
should not be allowed to toggle kpti flag.
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set
2018-01-11 17:02 ` Andy Lutomirski
2018-01-11 18:21 ` Alexei Starovoitov
@ 2018-01-11 20:00 ` Dave Hansen
1 sibling, 0 replies; 55+ messages in thread
From: Dave Hansen @ 2018-01-11 20:00 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Willy Tarreau, Linus Torvalds, Peter Zijlstra, LKML, X86 ML,
Borislav Petkov, Brian Gerst, Ingo Molnar, Thomas Gleixner,
Josh Poimboeuf, H. Peter Anvin, Greg Kroah-Hartman, Kees Cook
On 01/11/2018 09:02 AM, Andy Lutomirski wrote:
>> But, the stack gets really fun because of NMIs.
>>
>> I'm sure Andy Lutomirski has some ideas too.
> I was thinking that maybe we should add a new field or two to pt_regs.
> They could store CR2 and maybe CR3 as well. I'd also like to expose
> the error code of exceptions in stack traces. We should get this
> integrated right into the unwinder.
The trampoline and (normal) interrupt stacks should be pretty doable.
It's the NMI mess that I'm worried about. I tried to change the stack
layout in there once and ran away screaming.
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set
2018-01-11 19:34 ` Alan Cox
@ 2018-01-11 21:23 ` Willy Tarreau
2018-01-11 21:28 ` Linus Torvalds
0 siblings, 1 reply; 55+ messages in thread
From: Willy Tarreau @ 2018-01-11 21:23 UTC (permalink / raw)
To: Alan Cox
Cc: Josh Poimboeuf, Olivier Galibert, Dave Hansen, Borislav Petkov,
Linus Torvalds, Alexei Starovoitov, Andy Lutomirski,
Peter Zijlstra, LKML, X86 ML, Brian Gerst, Ingo Molnar,
Thomas Gleixner, H. Peter Anvin, Greg Kroah-Hartman, Kees Cook
On Thu, Jan 11, 2018 at 07:34:31PM +0000, Alan Cox wrote:
> On Thu, 11 Jan 2018 13:26:01 -0600
> Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>
> > On Thu, Jan 11, 2018 at 08:19:35PM +0100, Olivier Galibert wrote:
> > > Wouldn't the time taken by an easy syscall like getuid be a clear indicator?
> >
> > I think only if you had a baseline measurement to compare against.
>
> getuid can also be cached by a smart libc. getppid() is what micro
> benchmarks seem to favour.
I initially tried getpid() which I found to be cached by glibc, but I
switched to write(-1, "a", 1) in the example in the cover letter and
the test for 3 million runs roughly climbs from 200 ms to 900 ms under
kvm (with PCID this time, I didn't retest without).
Willy
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set
2018-01-11 21:23 ` Willy Tarreau
@ 2018-01-11 21:28 ` Linus Torvalds
2018-01-11 22:06 ` Willy Tarreau
0 siblings, 1 reply; 55+ messages in thread
From: Linus Torvalds @ 2018-01-11 21:28 UTC (permalink / raw)
To: Willy Tarreau
Cc: One Thousand Gnomes, Josh Poimboeuf, Olivier Galibert,
Dave Hansen, Borislav Petkov, Alexei Starovoitov, Andy Lutomirski,
Peter Zijlstra, LKML, X86 ML, Brian Gerst, Ingo Molnar,
Thomas Gleixner, H. Peter Anvin, Greg Kroah-Hartman, Kees Cook
[-- Attachment #1: Type: text/plain, Size: 943 bytes --]
The traditional fast system call to test is getppid().
write() goes through a lot more code.
Linus
On Jan 11, 2018 13:24, "Willy Tarreau" <w@1wt.eu> wrote:
> On Thu, Jan 11, 2018 at 07:34:31PM +0000, Alan Cox wrote:
> > On Thu, 11 Jan 2018 13:26:01 -0600
> > Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >
> > > On Thu, Jan 11, 2018 at 08:19:35PM +0100, Olivier Galibert wrote:
> > > > Wouldn't the time taken by an easy syscall like getuid be a clear
> indicator?
> > >
> > > I think only if you had a baseline measurement to compare against.
> >
> > getuid can also be cached by a smart libc. getppid() is what micro
> > benchmarks seem to favour.
>
> I initially tried getpid() which I found to be cached by glibc, but I
> switched to write(-1, "a", 1) in the example in the cover letter and
> the test for 3 million runs roughly climbs from 200 ms to 900 ms under
> kvm (with PCID this time, I didn't retest without).
>
> Willy
>
[-- Attachment #2: Type: text/html, Size: 1469 bytes --]
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set
2018-01-11 18:35 ` Dave Hansen
@ 2018-01-11 21:49 ` Willy Tarreau
0 siblings, 0 replies; 55+ messages in thread
From: Willy Tarreau @ 2018-01-11 21:49 UTC (permalink / raw)
To: Dave Hansen
Cc: Linus Torvalds, Andy Lutomirski, Peter Zijlstra, LKML, X86 ML,
Borislav Petkov, Brian Gerst, Ingo Molnar, Thomas Gleixner,
Josh Poimboeuf, H. Peter Anvin, Greg Kroah-Hartman, Kees Cook
On Thu, Jan 11, 2018 at 10:35:57AM -0800, Dave Hansen wrote:
> On 01/11/2018 07:44 AM, Willy Tarreau wrote:
> >> 4. Cleared on setuid() and friends
> > This one causes me a problem : some daemons already take care of dropping
> > privileges after the initial fork() for the sake of security. Haproxy
> > typically does this at boot :
> >
> > - parse config
> > - chroot to /var/empty
> > - setuid(dedicated_uid)
> > - fork()
>
> This makes me a _bit_ nervous. I think Andy touched on this, but I'll
> say it another way: you want PTI turned off because you trust an app to
> be good, but you also drop permissions because it is exposed to an
> environment where you do *not* fully trust it.
>
> I'm not sure how you reconcile that.
>
> If your proxy gets compromised, and pti is turned off, you are entirely
> exposed to meltdown from that process.
Yes but security is not either black or white, it's about adding protections
that make sense wherever possible. Meltdown would "only" allow you to dump
the system's memory. Running as root would additionally allow you to modify
this memory, ptrace processes outside of the chroot to inject backdoors etc.
For me, and for a lot of users I guess, the choice is easily made.
> I don't know exactly what you
> are doing, but isn't this proxy sitting there shuffling untrusted user
> data around all day?
Yes definitely for some users, while it can be very trusted data for others.
For example those dealing with wire transfers have much more to care about
than someone reading their kernel memory when they can also read and modify
the amount of money in the transfers if they want!
And it has to do all this *fast*, that's one of its most touted qualities.
And that's even why people who care the most about performance always
install it on Linux because today it's the best OS for this job.
Again Dave, it's a tradeoff. People using such components generally know
what they are doing and are the ones to decide. Some of them prefer to run
as root because it allows them to bind to foreign addresses, others cannot
start as root because the local policy is strict about this, some do a lot
of things some of us would consider ugly but which are critical to their
acitvity. It's not up to me as the software's author nor to us as system
developers to tell our users how they will be much slower to be more
secure if they decide that they don't care about this *specific* risk.
Our job is to properly protect those who don't know they're at risk and
to help the other ones reduce the risk to the minimum needed for a given
task.
In this case the minimum can be pretty well summarized :
- PTI on by default on the system
- no PTI for the performance-sensitive process if the admin decides it
- switch to a harmless uid/gid couple once not needed to be root anymore
- if the process may execute external processes, give it an option to
disable PTI before or during execve() to avoid exposing the system to
what they could be doing.
I think that's pretty reasonable for a normal production environment.
Willy
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set
2018-01-11 18:25 ` Linus Torvalds
2018-01-11 18:26 ` Linus Torvalds
@ 2018-01-11 21:59 ` Willy Tarreau
2018-01-12 16:27 ` David Laight
2 siblings, 0 replies; 55+ messages in thread
From: Willy Tarreau @ 2018-01-11 21:59 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andy Lutomirski, Dave Hansen, Peter Zijlstra, LKML, X86 ML,
Borislav Petkov, Brian Gerst, Ingo Molnar, Thomas Gleixner,
Josh Poimboeuf, H. Peter Anvin, Greg Kroah-Hartman, Kees Cook
On Thu, Jan 11, 2018 at 10:25:29AM -0800, Linus Torvalds wrote:
> Just to clarify: I definitely want the part where it is only
> switchable in single-threaded mode, and I actually do want it
> "inherited" by threads when they do get created.
OK this is what is currently done in series v3 because the TIF_* flags
are copied as-is to threads (unless I missed something). Even for
re-enabling it currently refuses it if mm_users > 1.
> It's just that my mental model for threads is not that they "inherit"
> the PTI flag, it's that they simply share it. But that's more of a
> difference in "views" than in actual behavior.
I see, thanks for explaining this point, I understand better your
concern now. Well, if we document that the current process' flag is
replicated as-is to any threads so that it is consistent across all
threads and that it may only be modified on all threads atomically,
which currently we can only achieve by doing it when there's a single
thread on an mm, I suspect it could match your mental model.
> If you do the PTI on/off operation *before* the vfork(), nothing is
> different. The vfork() by definition ends up having the same PTI
> state, since it has the same VM. But that's actually 100% expected,
> and it matches the fork() behavior too: the PTI state should be just
> copied by a fork(), since fork isn't any protection domain.
>
> And *after* you've done a vfork(), you can't do a PTI on/off, because
> now the VM has multiple users, which is 100% equivalent to the thread
> case that we already all agreed should be disallowed. So no, you can't
> do "vfork -> setnopti -> exec', but that is in no way different from
> any of the *other* things you cannot do in between vfork and execve.
That's where I like the principle of the NEXT ctl which can be per-
thread. The thread about to do an execve() cannot change its own flag
because it's entangled to the other ones sharing the same mm, but it
can change its own NEXT flag so that execve() starts with the specified
mode (typically PTI on in the example of log rotation for a server).
Quite honnestly for the NOW vs NEXT, I find the NOW convenient to avoid
a wrapper, but a program could also self-exec after setting the flag
(I've already done this to change thread stack sizes on certain
processes a long time ago and that's no real hassle). And given that
NOW cannot really re-adjust the PGD that was already assigned, maybe
in the end we should stick to this NEXT thing and wait for the next
execve() to apply the operation.
Willy
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set
2018-01-11 21:28 ` Linus Torvalds
@ 2018-01-11 22:06 ` Willy Tarreau
2018-01-12 16:37 ` David Laight
0 siblings, 1 reply; 55+ messages in thread
From: Willy Tarreau @ 2018-01-11 22:06 UTC (permalink / raw)
To: Linus Torvalds
Cc: One Thousand Gnomes, Josh Poimboeuf, Olivier Galibert,
Dave Hansen, Borislav Petkov, Alexei Starovoitov, Andy Lutomirski,
Peter Zijlstra, LKML, X86 ML, Brian Gerst, Ingo Molnar,
Thomas Gleixner, H. Peter Anvin, Greg Kroah-Hartman, Kees Cook
On Thu, Jan 11, 2018 at 01:28:18PM -0800, Linus Torvalds wrote:
> The traditional fast system call to test is getppid().
>
> write() goes through a lot more code.
Just tried getppid() now, it's relatively similar (slightly slower than
write(-1) though, maybe that one aborts very early) :
PTI=on : 920ms for 3 million calls
PTI=off (prctl) : 230ms for 3 million calls
PTI=off (boot) : 215ms for 3 million calls
The small difference between the last two very likely comes from the few
instructions avoided thanks to the alternatives when pti=off is used at
boot.
So yes here it's trivial to tell if it's on or off :-)
Willy
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set
@ 2018-01-11 23:11 Alexey Dobriyan
0 siblings, 0 replies; 55+ messages in thread
From: Alexey Dobriyan @ 2018-01-11 23:11 UTC (permalink / raw)
To: w; +Cc: linux-kernel
> I initially tried getpid() which I found to be cached by glibc, but I
> switched to write(-1, "a", 1) in the example in the cover letter and
> the test for 3 million runs roughly climbs from 200 ms to 900 ms under
> kvm (with PCID this time, I didn't retest without).
umask() is the fastest system call I think.
Also who cares about glibc caching?
static inline int sys_umask(int mask)
{
int rv;
asm volatile (
"syscall"
: "=a" (rv)
: "0" (95), "D" (mask)
: "rcx", "r11", "cc", "memory"
);
return rv;
}
int main(void)
{
unsigned int i;
for (i = 0; i < (1U << 24); i++) {
sys_umask(0);
}
return 0;
}
^ permalink raw reply [flat|nested] 55+ messages in thread
* RE: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set
2018-01-11 18:25 ` Linus Torvalds
2018-01-11 18:26 ` Linus Torvalds
2018-01-11 21:59 ` Willy Tarreau
@ 2018-01-12 16:27 ` David Laight
2018-01-12 17:55 ` Linus Torvalds
2 siblings, 1 reply; 55+ messages in thread
From: David Laight @ 2018-01-12 16:27 UTC (permalink / raw)
To: 'Linus Torvalds', Willy Tarreau
Cc: Andy Lutomirski, Dave Hansen, Peter Zijlstra, LKML, X86 ML,
Borislav Petkov, Brian Gerst, Ingo Molnar, Thomas Gleixner,
Josh Poimboeuf, H. Peter Anvin, Greg Kroah-Hartman, Kees Cook
From: Linus Torvalds
> Sent: 11 January 2018 18:25
> On Thu, Jan 11, 2018 at 9:40 AM, Willy Tarreau <w@1wt.eu> wrote:
> >> As for per-mm vs per-thread, let's make it only switchable in
> >> single-threaded processes for now and inherited when threads are
> >> created.
> >
> > That's exactly what it does for now, but Linus doesn't like it at all.
>
> Just to clarify: I definitely want the part where it is only
> switchable in single-threaded mode, and I actually do want it
> "inherited" by threads when they do get created.
...
You need to allow for libraries that create threads before main()
is called.
David
^ permalink raw reply [flat|nested] 55+ messages in thread
* RE: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set
2018-01-11 22:06 ` Willy Tarreau
@ 2018-01-12 16:37 ` David Laight
0 siblings, 0 replies; 55+ messages in thread
From: David Laight @ 2018-01-12 16:37 UTC (permalink / raw)
To: 'Willy Tarreau', Linus Torvalds
Cc: One Thousand Gnomes, Josh Poimboeuf, Olivier Galibert,
Dave Hansen, Borislav Petkov, Alexei Starovoitov, Andy Lutomirski,
Peter Zijlstra, LKML, X86 ML, Brian Gerst, Ingo Molnar,
Thomas Gleixner, H. Peter Anvin, Greg Kroah-Hartman, Kees Cook
From: Willy Tarreau
> Sent: 11 January 2018 22:07
> On Thu, Jan 11, 2018 at 01:28:18PM -0800, Linus Torvalds wrote:
> > The traditional fast system call to test is getppid().
> >
> > write() goes through a lot more code.
>
> Just tried getppid() now, it's relatively similar (slightly slower than
> write(-1) though, maybe that one aborts very early) :
>
> PTI=on : 920ms for 3 million calls
> PTI=off (prctl) : 230ms for 3 million calls
> PTI=off (boot) : 215ms for 3 million calls
>
> The small difference between the last two very likely comes from the few
> instructions avoided thanks to the alternatives when pti=off is used at
> boot.
>
> So yes here it's trivial to tell if it's on or off :-)
A system call with a larger kernel memory footprint, and user
code that touches more pages, might show an even bigger difference
between PTI=on and PTI=off.
David
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set
2018-01-12 16:27 ` David Laight
@ 2018-01-12 17:55 ` Linus Torvalds
2018-01-12 19:36 ` Willy Tarreau
0 siblings, 1 reply; 55+ messages in thread
From: Linus Torvalds @ 2018-01-12 17:55 UTC (permalink / raw)
To: David Laight
Cc: Willy Tarreau, Andy Lutomirski, Dave Hansen, Peter Zijlstra, LKML,
X86 ML, Borislav Petkov, Brian Gerst, Ingo Molnar,
Thomas Gleixner, Josh Poimboeuf, H. Peter Anvin,
Greg Kroah-Hartman, Kees Cook
On Fri, Jan 12, 2018 at 8:27 AM, David Laight <David.Laight@aculab.com> wrote:
>
> You need to allow for libraries that create threads before main()
> is called.
I really don't think we do. I think the normal case is the wrapper.
Processes should never say "I'm so important that I'm disabling PTI".
That's crazy talk, and wrong.
It's wrong for all the usual reasons - everybody always thinks that
_their_ own work is so important and bug-free, and that things like
PTI are about protecting all those other imcompetent people.
No. Bullshit. Nobody should ever disable PTI for themselves, because
nobody is inherently trustworthy.
Instead, we have the case of something _external_ saying "this
process is so important that it should be started without PTI".
Linus
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set
2018-01-12 17:55 ` Linus Torvalds
@ 2018-01-12 19:36 ` Willy Tarreau
0 siblings, 0 replies; 55+ messages in thread
From: Willy Tarreau @ 2018-01-12 19:36 UTC (permalink / raw)
To: Linus Torvalds
Cc: David Laight, Andy Lutomirski, Dave Hansen, Peter Zijlstra, LKML,
X86 ML, Borislav Petkov, Brian Gerst, Ingo Molnar,
Thomas Gleixner, Josh Poimboeuf, H. Peter Anvin,
Greg Kroah-Hartman, Kees Cook
On Fri, Jan 12, 2018 at 09:55:45AM -0800, Linus Torvalds wrote:
> On Fri, Jan 12, 2018 at 8:27 AM, David Laight <David.Laight@aculab.com> wrote:
> >
> > You need to allow for libraries that create threads before main()
> > is called.
>
> I really don't think we do. I think the normal case is the wrapper.
>
> Processes should never say "I'm so important that I'm disabling PTI".
> That's crazy talk, and wrong.
>
> It's wrong for all the usual reasons - everybody always thinks that
> _their_ own work is so important and bug-free, and that things like
> PTI are about protecting all those other imcompetent people.
>
> No. Bullshit. Nobody should ever disable PTI for themselves, because
> nobody is inherently trustworthy.
>
> Instead, we have the case of something _external_ saying "this
> process is so important that it should be started without PTI".
I totally agree, and what I initially envisionned (for my use case)
was a config option with a scary enough name if we couldn't have the
wrapper. But the wrapper brings the long term benefit of allowing us
to do what we want with the pgd, which is a nice add-on.
Willy
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set
2018-01-11 19:33 ` Andy Lutomirski
@ 2018-01-12 20:22 ` Ingo Molnar
2018-01-12 21:18 ` Andy Lutomirski
0 siblings, 1 reply; 55+ messages in thread
From: Ingo Molnar @ 2018-01-12 20:22 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Linus Torvalds, Willy Tarreau, Andy Lutomirski, Dave Hansen,
Peter Zijlstra, LKML, X86 ML, Borislav Petkov, Brian Gerst,
Thomas Gleixner, Josh Poimboeuf, H. Peter Anvin,
Greg Kroah-Hartman, Kees Cook
* Andy Lutomirski <luto@amacapital.net> wrote:
> > Oh, and yes, I think the npti flag should also break ptrace(). I do agree with
> > Andy that it's a "capability", although I do not think it should actually be
> > implemented as one.
>
> For all that Linux capabilities are crap, nopti walks like one and quacks like
> one. It needs to affect ptrace() permissions, it needs a way to disable it
> systemwide, it needs LSM integration, etc. Using CAP_DISABLE_PTI gives us all
> of this without tons of churn, auditing, and a whole new configuration thingy
> for each LSM. And I avoids permanently polluting ptrace checks, the LSM
> interface, etc for what is, essentially, a performance hack to work around a
> blatant error in the design of some CPUs.
>
> Plus, with ambient caps, we already did the nasty part of the with and finished
> all the relevant bikeshedding.
>
> So I'd rather just hold my nose and add the new capability bit.
Those all seem pretty valid arguments to me.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set
2018-01-12 20:22 ` Ingo Molnar
@ 2018-01-12 21:18 ` Andy Lutomirski
2018-01-12 21:54 ` Willy Tarreau
0 siblings, 1 reply; 55+ messages in thread
From: Andy Lutomirski @ 2018-01-12 21:18 UTC (permalink / raw)
To: Ingo Molnar
Cc: Linus Torvalds, Willy Tarreau, Andy Lutomirski, Dave Hansen,
Peter Zijlstra, LKML, X86 ML, Borislav Petkov, Brian Gerst,
Thomas Gleixner, Josh Poimboeuf, H. Peter Anvin,
Greg Kroah-Hartman, Kees Cook
> On Jan 12, 2018, at 12:22 PM, Ingo Molnar <mingo@kernel.org> wrote:
>
>
> * Andy Lutomirski <luto@amacapital.net> wrote:
>
>>> Oh, and yes, I think the npti flag should also break ptrace(). I do agree with
>>> Andy that it's a "capability", although I do not think it should actually be
>>> implemented as one.
>>
>> For all that Linux capabilities are crap, nopti walks like one and quacks like
>> one. It needs to affect ptrace() permissions, it needs a way to disable it
>> systemwide, it needs LSM integration, etc. Using CAP_DISABLE_PTI gives us all
>> of this without tons of churn, auditing, and a whole new configuration thingy
>> for each LSM. And I avoids permanently polluting ptrace checks, the LSM
>> interface, etc for what is, essentially, a performance hack to work around a
>> blatant error in the design of some CPUs.
>>
>> Plus, with ambient caps, we already did the nasty part of the with and finished
>> all the relevant bikeshedding.
>>
>> So I'd rather just hold my nose and add the new capability bit.
>
> Those all seem pretty valid arguments to me.
>
>
FWIW, if we take this approach, then either dropping the capability should turn PTI back on or we need to deal with the corner case of PTI off and capability not present. The latter is a bit awkward but not necessarily a show stopper. I think that all we need to do is to update the ptrace rules and maybe make PTI turn back on when we execve. At least there's no need to muck around with LSM hooks.
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set
2018-01-12 21:18 ` Andy Lutomirski
@ 2018-01-12 21:54 ` Willy Tarreau
0 siblings, 0 replies; 55+ messages in thread
From: Willy Tarreau @ 2018-01-12 21:54 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Ingo Molnar, Linus Torvalds, Andy Lutomirski, Dave Hansen,
Peter Zijlstra, LKML, X86 ML, Borislav Petkov, Brian Gerst,
Thomas Gleixner, Josh Poimboeuf, H. Peter Anvin,
Greg Kroah-Hartman, Kees Cook
On Fri, Jan 12, 2018 at 01:18:06PM -0800, Andy Lutomirski wrote:
> FWIW, if we take this approach, then either dropping the capability should
> turn PTI back on or we need to deal with the corner case of PTI off and
> capability not present. The latter is a bit awkward but not necessarily a
> show stopper. I think that all we need to do is to update the ptrace rules
> and maybe make PTI turn back on when we execve. At least there's no need to
> muck around with LSM hooks.
That's my point as well, just the same principle as the "NEXT" prctl : only
perform changes on execve(). At least we're sure to deal with something
consistent and it's the right moment for deciding on _PAGE_NX.
Willy
^ permalink raw reply [flat|nested] 55+ messages in thread
end of thread, other threads:[~2018-01-12 22:01 UTC | newest]
Thread overview: 55+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-01-11 23:11 [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set Alexey Dobriyan
-- strict thread matches above, loose matches on Subject: below --
2018-01-09 12:56 [RFC PATCH v2 0/6] Per process PTI activation Willy Tarreau
2018-01-09 12:56 ` [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set Willy Tarreau
2018-01-10 7:15 ` Ingo Molnar
2018-01-10 7:23 ` Willy Tarreau
2018-01-10 8:22 ` Peter Zijlstra
2018-01-10 9:11 ` Willy Tarreau
2018-01-10 19:21 ` Andy Lutomirski
2018-01-10 19:39 ` Willy Tarreau
2018-01-10 19:44 ` Andy Lutomirski
2018-01-10 19:50 ` Linus Torvalds
2018-01-10 20:04 ` Andy Lutomirski
2018-01-11 6:42 ` Willy Tarreau
2018-01-11 15:29 ` Dave Hansen
2018-01-11 15:44 ` Willy Tarreau
2018-01-11 15:51 ` Dave Hansen
2018-01-11 17:02 ` Andy Lutomirski
2018-01-11 18:21 ` Alexei Starovoitov
2018-01-11 18:30 ` Dave Hansen
2018-01-11 18:32 ` Josh Poimboeuf
2018-01-11 18:36 ` Linus Torvalds
2018-01-11 18:38 ` Dave Hansen
2018-01-11 18:51 ` Linus Torvalds
2018-01-11 18:57 ` Dave Hansen
2018-01-11 19:05 ` Josh Poimboeuf
2018-01-11 19:07 ` Borislav Petkov
2018-01-11 19:17 ` Dave Hansen
2018-01-11 19:19 ` Olivier Galibert
2018-01-11 19:26 ` Josh Poimboeuf
2018-01-11 19:34 ` Alan Cox
2018-01-11 21:23 ` Willy Tarreau
2018-01-11 21:28 ` Linus Torvalds
2018-01-11 22:06 ` Willy Tarreau
2018-01-12 16:37 ` David Laight
2018-01-11 19:12 ` Linus Torvalds
2018-01-11 19:38 ` Alexei Starovoitov
2018-01-11 19:11 ` Willy Tarreau
2018-01-11 20:00 ` Dave Hansen
2018-01-11 17:09 ` Andy Lutomirski
2018-01-11 17:40 ` Willy Tarreau
2018-01-11 17:53 ` Andy Lutomirski
2018-01-11 18:05 ` Willy Tarreau
2018-01-11 18:15 ` Dave Hansen
2018-01-11 18:31 ` Linus Torvalds
2018-01-11 18:25 ` Linus Torvalds
2018-01-11 18:26 ` Linus Torvalds
2018-01-11 19:33 ` Andy Lutomirski
2018-01-12 20:22 ` Ingo Molnar
2018-01-12 21:18 ` Andy Lutomirski
2018-01-12 21:54 ` Willy Tarreau
2018-01-11 21:59 ` Willy Tarreau
2018-01-12 16:27 ` David Laight
2018-01-12 17:55 ` Linus Torvalds
2018-01-12 19:36 ` Willy Tarreau
2018-01-11 18:35 ` Dave Hansen
2018-01-11 21:49 ` Willy Tarreau
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox