Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* [PATCH v8 06/14] x86/cet/ibt: Add arch_prctl functions for IBT
From: Yu-cheng Yu @ 2019-08-13 20:53 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit
  Cc: Yu-cheng Yu
In-Reply-To: <20190813205359.12196-1-yu-cheng.yu@intel.com>

From: "H.J. Lu" <hjl.tools@gmail.com>

Update ARCH_X86_CET_STATUS and ARCH_X86_CET_DISABLE to include
Indirect Branch Tracking features.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---
 arch/x86/include/uapi/asm/prctl.h | 2 ++
 arch/x86/kernel/cet_prctl.c       | 4 ++++
 2 files changed, 6 insertions(+)

diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index d962f0ec9ccf..02243127dcf6 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -18,5 +18,7 @@
 #define ARCH_X86_CET_DISABLE		0x3002
 #define ARCH_X86_CET_LOCK		0x3003
 #define ARCH_X86_CET_ALLOC_SHSTK	0x3004
+#define ARCH_X86_CET_GET_LEGACY_BITMAP	0x3005 /* deprecated */
+#define ARCH_X86_CET_SET_LEGACY_BITMAP	0x3006 /* deprecated */
 
 #endif /* _ASM_X86_PRCTL_H */
diff --git a/arch/x86/kernel/cet_prctl.c b/arch/x86/kernel/cet_prctl.c
index 9c9d4262b07e..09d8c4ea935c 100644
--- a/arch/x86/kernel/cet_prctl.c
+++ b/arch/x86/kernel/cet_prctl.c
@@ -20,6 +20,8 @@ static int handle_get_status(unsigned long arg2)
 
 	if (current->thread.cet.shstk_enabled)
 		features |= GNU_PROPERTY_X86_FEATURE_1_SHSTK;
+	if (current->thread.cet.ibt_enabled)
+		features |= GNU_PROPERTY_X86_FEATURE_1_IBT;
 
 	shstk_base = current->thread.cet.shstk_base;
 	shstk_size = current->thread.cet.shstk_size;
@@ -69,6 +71,8 @@ int prctl_cet(int option, unsigned long arg2)
 			return -EPERM;
 		if (arg2 & GNU_PROPERTY_X86_FEATURE_1_SHSTK)
 			cet_disable_free_shstk(current);
+		if (arg2 & GNU_PROPERTY_X86_FEATURE_1_IBT)
+			cet_disable_ibt();
 
 		return 0;
 
-- 
2.17.1

^ permalink raw reply related

* [PATCH v8 07/14] x86/cet/ibt: Add ENDBR to op-code-map
From: Yu-cheng Yu @ 2019-08-13 20:53 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit
  Cc: Yu-cheng Yu
In-Reply-To: <20190813205359.12196-1-yu-cheng.yu@intel.com>

Add control transfer terminating instructions:

ENDBR64/ENDBR32:
    Mark a valid 64/32-bit control transfer endpoint.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---
 arch/x86/lib/x86-opcode-map.txt               | 13 +++++++++++--
 tools/objtool/arch/x86/lib/x86-opcode-map.txt | 13 +++++++++++--
 2 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/arch/x86/lib/x86-opcode-map.txt b/arch/x86/lib/x86-opcode-map.txt
index c5e825d44766..fbc53481bc59 100644
--- a/arch/x86/lib/x86-opcode-map.txt
+++ b/arch/x86/lib/x86-opcode-map.txt
@@ -620,7 +620,16 @@ ea: SAVEPREVSSP (f3)
 # Skip 0xeb-0xff
 EndTable
 
-Table: 3-byte opcode 2 (0x0f 0x38)
+Table: 3-byte opcode 2 (0x0f 0x1e)
+Referrer:
+AVXcode:
+# Skip 0x00-0xf9
+fa: ENDBR64 (f3)
+fb: ENDBR32 (f3)
+#skip 0xfc-0xff
+EndTable
+
+Table: 3-byte opcode 3 (0x0f 0x38)
 Referrer: 3-byte escape 1
 AVXcode: 2
 # 0x0f 0x38 0x00-0x0f
@@ -804,7 +813,7 @@ f6: ADCX Gy,Ey (66) | ADOX Gy,Ey (F3) | MULX By,Gy,rDX,Ey (F2),(v) | WRSS Pq,Qq
 f7: BEXTR Gy,Ey,By (v) | SHLX Gy,Ey,By (66),(v) | SARX Gy,Ey,By (F3),(v) | SHRX Gy,Ey,By (F2),(v)
 EndTable
 
-Table: 3-byte opcode 3 (0x0f 0x3a)
+Table: 3-byte opcode 4 (0x0f 0x3a)
 Referrer: 3-byte escape 2
 AVXcode: 3
 # 0x0f 0x3a 0x00-0xff
diff --git a/tools/objtool/arch/x86/lib/x86-opcode-map.txt b/tools/objtool/arch/x86/lib/x86-opcode-map.txt
index c5e825d44766..fbc53481bc59 100644
--- a/tools/objtool/arch/x86/lib/x86-opcode-map.txt
+++ b/tools/objtool/arch/x86/lib/x86-opcode-map.txt
@@ -620,7 +620,16 @@ ea: SAVEPREVSSP (f3)
 # Skip 0xeb-0xff
 EndTable
 
-Table: 3-byte opcode 2 (0x0f 0x38)
+Table: 3-byte opcode 2 (0x0f 0x1e)
+Referrer:
+AVXcode:
+# Skip 0x00-0xf9
+fa: ENDBR64 (f3)
+fb: ENDBR32 (f3)
+#skip 0xfc-0xff
+EndTable
+
+Table: 3-byte opcode 3 (0x0f 0x38)
 Referrer: 3-byte escape 1
 AVXcode: 2
 # 0x0f 0x38 0x00-0x0f
@@ -804,7 +813,7 @@ f6: ADCX Gy,Ey (66) | ADOX Gy,Ey (F3) | MULX By,Gy,rDX,Ey (F2),(v) | WRSS Pq,Qq
 f7: BEXTR Gy,Ey,By (v) | SHLX Gy,Ey,By (66),(v) | SARX Gy,Ey,By (F3),(v) | SHRX Gy,Ey,By (F2),(v)
 EndTable
 
-Table: 3-byte opcode 3 (0x0f 0x3a)
+Table: 3-byte opcode 4 (0x0f 0x3a)
 Referrer: 3-byte escape 2
 AVXcode: 3
 # 0x0f 0x3a 0x00-0xff
-- 
2.17.1

^ permalink raw reply related

* [PATCH v8 08/14] x86/vdso: Insert endbr32/endbr64 to vDSO
From: Yu-cheng Yu @ 2019-08-13 20:53 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit
  Cc: Yu-cheng Yu
In-Reply-To: <20190813205359.12196-1-yu-cheng.yu@intel.com>

From: "H.J. Lu" <hjl.tools@gmail.com>

When Intel indirect branch tracking is enabled, functions in vDSO which
may be called indirectly must have endbr32 or endbr64 as the first
instruction.  Compiler must support -fcf-protection=branch so that it
can be used to compile vDSO.

Acked-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---
 arch/x86/entry/vdso/Makefile          | 12 +++++++++++-
 arch/x86/entry/vdso/vdso-layout.lds.S |  1 +
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile
index 8df549138193..1e6a95881e73 100644
--- a/arch/x86/entry/vdso/Makefile
+++ b/arch/x86/entry/vdso/Makefile
@@ -114,13 +114,17 @@ vobjx32s := $(foreach F,$(vobjx32s-y),$(obj)/$F)
 
 # Convert 64bit object file to x32 for x32 vDSO.
 quiet_cmd_x32 = X32     $@
-      cmd_x32 = $(OBJCOPY) -O elf32-x86-64 $< $@
+      cmd_x32 = $(OBJCOPY) -R .note.gnu.property -O elf32-x86-64 $< $@
 
 $(obj)/%-x32.o: $(obj)/%.o FORCE
 	$(call if_changed,x32)
 
 targets += vdsox32.lds $(vobjx32s-y)
 
+ifdef CONFIG_X86_INTEL_BRANCH_TRACKING_USER
+    $(obj)/vclock_gettime.o $(obj)/vgetcpu.o $(obj)/vdso32/vclock_gettime.o: KBUILD_CFLAGS += -fcf-protection=branch
+endif
+
 $(obj)/%.so: OBJCOPYFLAGS := -S
 $(obj)/%.so: $(obj)/%.so.dbg FORCE
 	$(call if_changed,objcopy)
@@ -178,6 +182,12 @@ quiet_cmd_vdso = VDSO    $@
 
 VDSO_LDFLAGS = -shared --hash-style=both --build-id \
 	$(call ld-option, --eh-frame-hdr) -Bsymbolic
+ifdef CONFIG_X86_INTEL_BRANCH_TRACKING_USER
+  VDSO_LDFLAGS += $(call ldoption, -z$(comma)ibt)
+endif
+ifdef CONFIG_X86_INTEL_SHADOW_STACK_USER
+  VDSO_LDFLAGS += $(call ldoption, -z$(comma)shstk)
+endif
 GCOV_PROFILE := n
 
 quiet_cmd_vdso_and_check = VDSO    $@
diff --git a/arch/x86/entry/vdso/vdso-layout.lds.S b/arch/x86/entry/vdso/vdso-layout.lds.S
index 93c6dc7812d0..3fea2ce318bc 100644
--- a/arch/x86/entry/vdso/vdso-layout.lds.S
+++ b/arch/x86/entry/vdso/vdso-layout.lds.S
@@ -52,6 +52,7 @@ SECTIONS
 		*(.gnu.linkonce.b.*)
 	}						:text
 
+	.note.gnu.property : { *(.note.gnu.property) }	:text	:note
 	.note		: { *(.note.*) }		:text	:note
 
 	.eh_frame_hdr	: { *(.eh_frame_hdr) }		:text	:eh_frame_hdr
-- 
2.17.1

^ permalink raw reply related

* [PATCH v8 09/14] x86/vdso/32: Add ENDBR32 to __kernel_vsyscall entry point
From: Yu-cheng Yu @ 2019-08-13 20:53 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit
  Cc: Yu-cheng Yu
In-Reply-To: <20190813205359.12196-1-yu-cheng.yu@intel.com>

From: "H.J. Lu" <hjl.tools@gmail.com>

Add ENDBR32 to __kernel_vsyscall entry point.

Acked-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---
 arch/x86/entry/vdso/vdso32/system_call.S | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/entry/vdso/vdso32/system_call.S b/arch/x86/entry/vdso/vdso32/system_call.S
index 263d7433dea8..2fc8141fff4e 100644
--- a/arch/x86/entry/vdso/vdso32/system_call.S
+++ b/arch/x86/entry/vdso/vdso32/system_call.S
@@ -14,6 +14,9 @@
 	ALIGN
 __kernel_vsyscall:
 	CFI_STARTPROC
+#ifdef CONFIG_X86_INTEL_BRANCH_TRACKING_USER
+	endbr32
+#endif
 	/*
 	 * Reshuffle regs so that all of any of the entry instructions
 	 * will preserve enough state.
-- 
2.17.1

^ permalink raw reply related

* [PATCH v8 10/14] x86/vsyscall/64: Add ENDBR64 to vsyscall entry points
From: Yu-cheng Yu @ 2019-08-13 20:53 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit
  Cc: Yu-cheng Yu
In-Reply-To: <20190813205359.12196-1-yu-cheng.yu@intel.com>

From: "H.J. Lu" <hjl.tools@gmail.com>

Add ENDBR64 to vsyscall entry points.

Acked-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---
 arch/x86/entry/vsyscall/vsyscall_emu_64.S | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/arch/x86/entry/vsyscall/vsyscall_emu_64.S b/arch/x86/entry/vsyscall/vsyscall_emu_64.S
index 2e203f3a25a7..040696333457 100644
--- a/arch/x86/entry/vsyscall/vsyscall_emu_64.S
+++ b/arch/x86/entry/vsyscall/vsyscall_emu_64.S
@@ -17,16 +17,25 @@ __PAGE_ALIGNED_DATA
 	.type __vsyscall_page, @object
 __vsyscall_page:
 
+#ifdef CONFIG_X86_INTEL_BRANCH_TRACKING_USER
+	endbr64
+#endif
 	mov $__NR_gettimeofday, %rax
 	syscall
 	ret
 
 	.balign 1024, 0xcc
+#ifdef CONFIG_X86_INTEL_BRANCH_TRACKING_USER
+	endbr64
+#endif
 	mov $__NR_time, %rax
 	syscall
 	ret
 
 	.balign 1024, 0xcc
+#ifdef CONFIG_X86_INTEL_BRANCH_TRACKING_USER
+	endbr64
+#endif
 	mov $__NR_getcpu, %rax
 	syscall
 	ret
-- 
2.17.1

^ permalink raw reply related

* [PATCH v8 11/14] x86/vsyscall/64: Fixup shadow stack and branch tracking for vsyscall
From: Yu-cheng Yu @ 2019-08-13 20:53 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit
  Cc: Yu-cheng Yu
In-Reply-To: <20190813205359.12196-1-yu-cheng.yu@intel.com>

When emulating a RET, also unwind the task's shadow stack and cancel
the current branch tracking status.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---
 arch/x86/entry/vsyscall/vsyscall_64.c    | 29 ++++++++++++++++++++++++
 arch/x86/entry/vsyscall/vsyscall_trace.h |  1 +
 2 files changed, 30 insertions(+)

diff --git a/arch/x86/entry/vsyscall/vsyscall_64.c b/arch/x86/entry/vsyscall/vsyscall_64.c
index e7c596dea947..27ff81f75c82 100644
--- a/arch/x86/entry/vsyscall/vsyscall_64.c
+++ b/arch/x86/entry/vsyscall/vsyscall_64.c
@@ -38,6 +38,9 @@
 #include <asm/fixmap.h>
 #include <asm/traps.h>
 #include <asm/paravirt.h>
+#include <asm/fpu/xstate.h>
+#include <asm/fpu/types.h>
+#include <asm/fpu/internal.h>
 
 #define CREATE_TRACE_POINTS
 #include "vsyscall_trace.h"
@@ -286,6 +289,32 @@ bool emulate_vsyscall(unsigned long error_code,
 	/* Emulate a ret instruction. */
 	regs->ip = caller;
 	regs->sp += 8;
+
+	/* Unwind shadow stack. */
+
+#ifdef CONFIG_X86_INTEL_SHADOW_STACK_USER
+	if (current->thread.cet.shstk_enabled) {
+		u64 r;
+
+		modify_fpu_regs_begin();
+		rdmsrl(MSR_IA32_PL3_SSP, r);
+		wrmsrl(MSR_IA32_PL3_SSP, r + 8);
+		modify_fpu_regs_end();
+	}
+#endif
+
+	/* Fixup branch tracking */
+#ifdef CONFIG_X86_INTEL_BRANCH_TRACKING_USER
+	if (current->thread.cet.ibt_enabled) {
+		u64 r;
+
+		modify_fpu_regs_begin();
+		rdmsrl(MSR_IA32_U_CET, r);
+		wrmsrl(MSR_IA32_U_CET, r & ~MSR_IA32_CET_WAIT_ENDBR);
+		modify_fpu_regs_end();
+	}
+#endif
+
 	return true;
 
 sigsegv:
diff --git a/arch/x86/entry/vsyscall/vsyscall_trace.h b/arch/x86/entry/vsyscall/vsyscall_trace.h
index 3c3f9765a85c..7aa2101ada44 100644
--- a/arch/x86/entry/vsyscall/vsyscall_trace.h
+++ b/arch/x86/entry/vsyscall/vsyscall_trace.h
@@ -25,6 +25,7 @@ TRACE_EVENT(emulate_vsyscall,
 #endif
 
 #undef TRACE_INCLUDE_PATH
+#undef TRACE_INCLUDE_FILE
 #define TRACE_INCLUDE_PATH ../../arch/x86/entry/vsyscall/
 #define TRACE_INCLUDE_FILE vsyscall_trace
 #include <trace/define_trace.h>
-- 
2.17.1

^ permalink raw reply related

* [PATCH v8 12/14] x86/cet: Add PTRACE interface for CET
From: Yu-cheng Yu @ 2019-08-13 20:53 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit
  Cc: Yu-cheng Yu
In-Reply-To: <20190813205359.12196-1-yu-cheng.yu@intel.com>

Add REGSET_CET64/REGSET_CET32 to get/set CET MSRs:

    IA32_U_CET (user-mode CET settings) and
    IA32_PL3_SSP (user-mode shadow stack)

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---
 arch/x86/include/asm/fpu/regset.h |  7 +++---
 arch/x86/kernel/fpu/regset.c      | 41 +++++++++++++++++++++++++++++++
 arch/x86/kernel/ptrace.c          | 16 ++++++++++++
 include/uapi/linux/elf.h          |  1 +
 4 files changed, 62 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/fpu/regset.h b/arch/x86/include/asm/fpu/regset.h
index d5bdffb9d27f..edad0d889084 100644
--- a/arch/x86/include/asm/fpu/regset.h
+++ b/arch/x86/include/asm/fpu/regset.h
@@ -7,11 +7,12 @@
 
 #include <linux/regset.h>
 
-extern user_regset_active_fn regset_fpregs_active, regset_xregset_fpregs_active;
+extern user_regset_active_fn regset_fpregs_active, regset_xregset_fpregs_active,
+				cetregs_active;
 extern user_regset_get_fn fpregs_get, xfpregs_get, fpregs_soft_get,
-				xstateregs_get;
+				xstateregs_get, cetregs_get;
 extern user_regset_set_fn fpregs_set, xfpregs_set, fpregs_soft_set,
-				 xstateregs_set;
+				 xstateregs_set, cetregs_set;
 
 /*
  * xstateregs_active == regset_fpregs_active. Please refer to the comment
diff --git a/arch/x86/kernel/fpu/regset.c b/arch/x86/kernel/fpu/regset.c
index d652b939ccfb..2937ec9d9215 100644
--- a/arch/x86/kernel/fpu/regset.c
+++ b/arch/x86/kernel/fpu/regset.c
@@ -156,6 +156,47 @@ int xstateregs_set(struct task_struct *target, const struct user_regset *regset,
 	return ret;
 }
 
+int cetregs_active(struct task_struct *target, const struct user_regset *regset)
+{
+#ifdef CONFIG_X86_INTEL_CET
+	if (target->thread.cet.shstk_enabled || target->thread.cet.ibt_enabled)
+		return regset->n;
+#endif
+	return 0;
+}
+
+int cetregs_get(struct task_struct *target, const struct user_regset *regset,
+		unsigned int pos, unsigned int count,
+		void *kbuf, void __user *ubuf)
+{
+	struct fpu *fpu = &target->thread.fpu;
+	struct cet_user_state *cetregs;
+
+	if (!boot_cpu_has(X86_FEATURE_SHSTK))
+		return -ENODEV;
+
+	cetregs = get_xsave_addr(&fpu->state.xsave, XFEATURE_CET_USER);
+
+	fpu__prepare_read(fpu);
+	return user_regset_copyout(&pos, &count, &kbuf, &ubuf, cetregs, 0, -1);
+}
+
+int cetregs_set(struct task_struct *target, const struct user_regset *regset,
+		  unsigned int pos, unsigned int count,
+		  const void *kbuf, const void __user *ubuf)
+{
+	struct fpu *fpu = &target->thread.fpu;
+	struct cet_user_state *cetregs;
+
+	if (!boot_cpu_has(X86_FEATURE_SHSTK))
+		return -ENODEV;
+
+	cetregs = get_xsave_addr(&fpu->state.xsave, XFEATURE_CET_USER);
+
+	fpu__prepare_write(fpu);
+	return user_regset_copyin(&pos, &count, &kbuf, &ubuf, cetregs, 0, -1);
+}
+
 #if defined CONFIG_X86_32 || defined CONFIG_IA32_EMULATION
 
 /*
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 3c5bbe8e4120..4bae0faa5331 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -52,7 +52,9 @@ enum x86_regset {
 	REGSET_IOPERM64 = REGSET_XFP,
 	REGSET_XSTATE,
 	REGSET_TLS,
+	REGSET_CET64 = REGSET_TLS,
 	REGSET_IOPERM32,
+	REGSET_CET32,
 };
 
 struct pt_regs_offset {
@@ -1239,6 +1241,13 @@ static struct user_regset x86_64_regsets[] __ro_after_init = {
 		.size = sizeof(long), .align = sizeof(long),
 		.active = ioperm_active, .get = ioperm_get
 	},
+	[REGSET_CET64] = {
+		.core_note_type = NT_X86_CET,
+		.n = sizeof(struct cet_user_state) / sizeof(u64),
+		.size = sizeof(u64), .align = sizeof(u64),
+		.active = cetregs_active, .get = cetregs_get,
+		.set = cetregs_set
+	},
 };
 
 static const struct user_regset_view user_x86_64_view = {
@@ -1294,6 +1303,13 @@ static struct user_regset x86_32_regsets[] __ro_after_init = {
 		.size = sizeof(u32), .align = sizeof(u32),
 		.active = ioperm_active, .get = ioperm_get
 	},
+	[REGSET_CET32] = {
+		.core_note_type = NT_X86_CET,
+		.n = sizeof(struct cet_user_state) / sizeof(u64),
+		.size = sizeof(u64), .align = sizeof(u64),
+		.active = cetregs_active, .get = cetregs_get,
+		.set = cetregs_set
+	},
 };
 
 static const struct user_regset_view user_x86_32_view = {
diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
index 530ce08467c2..349c435a2ce1 100644
--- a/include/uapi/linux/elf.h
+++ b/include/uapi/linux/elf.h
@@ -401,6 +401,7 @@ typedef struct elf64_shdr {
 #define NT_386_TLS	0x200		/* i386 TLS slots (struct user_desc) */
 #define NT_386_IOPERM	0x201		/* x86 io permission bitmap (1=deny) */
 #define NT_X86_XSTATE	0x202		/* x86 extended state using xsave */
+#define NT_X86_CET	0x203		/* x86 cet state */
 #define NT_S390_HIGH_GPRS	0x300	/* s390 upper register halves */
 #define NT_S390_TIMER	0x301		/* s390 timer register */
 #define NT_S390_TODCMP	0x302		/* s390 TOD clock comparator register */
-- 
2.17.1

^ permalink raw reply related

* [PATCH v8 13/14] x86: Discard .note.gnu.property sections
From: Yu-cheng Yu @ 2019-08-13 20:53 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit
  Cc: Yu-cheng Yu
In-Reply-To: <20190813205359.12196-1-yu-cheng.yu@intel.com>

From: "H.J. Lu" <hjl.tools@gmail.com>

With the command-line option, -mx86-used-note=yes, the x86 assembler
in binutils 2.32 and above generates a program property note in a note
section, .note.gnu.property, to encode used x86 ISAs and features.
To exclude .note.gnu.property sections from NOTE segment in x86 kernel
linker script:

PHDRS {
 text PT_LOAD FLAGS(5);
 data PT_LOAD FLAGS(6);
 percpu PT_LOAD FLAGS(6);
 init PT_LOAD FLAGS(7);
 note PT_NOTE FLAGS(0);
}
SECTIONS
{
...
 .notes : AT(ADDR(.notes) - 0xffffffff80000000) { __start_notes = .; KEEP(*(.not
e.*)) __stop_notes = .; } :text :note
...
}

this patch discards .note.gnu.property sections in kernel linker script
by adding

 /DISCARD/ : {
  *(.note.gnu.property)
 }

before .notes sections.  Since .exit.text and .exit.data sections are
discarded at runtime, it undefines EXIT_TEXT and EXIT_DATA to exclude
.exit.text and .exit.data sections from default discarded sections.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---
 arch/x86/kernel/vmlinux.lds.S | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index e2feacf921a0..5ef137493a85 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -146,6 +146,10 @@ SECTIONS
 		_etext = .;
 	} :text = 0x9090
 
+	/* .note.gnu.property sections should be discarded */
+	/DISCARD/ : {
+		*(.note.gnu.property)
+	}
 	NOTES :text :note
 
 	EXCEPTION_TABLE(16) :text = 0x9090
@@ -415,6 +419,12 @@ SECTIONS
 	STABS_DEBUG
 	DWARF_DEBUG
 
+	/* Sections to be discarded.  EXIT_TEXT and EXIT_DATA discard at
+	 * runtime, not link time. */
+#undef EXIT_TEXT
+#define EXIT_TEXT
+#undef EXIT_DATA
+#define EXIT_DATA
 	DISCARDS
 	/DISCARD/ : {
 		*(.eh_frame)
-- 
2.17.1

^ permalink raw reply related

* [PATCH v8 14/14] Introduce arch_prctl(ARCH_X86_CET_MARK_LEGACY_CODE)
From: Yu-cheng Yu @ 2019-08-13 20:53 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit
  Cc: Yu-cheng Yu
In-Reply-To: <20190813205359.12196-1-yu-cheng.yu@intel.com>

When CET Indirect Branch Tracking (IBT) is enabled, the processor expects
every branch target is an ENDBR instruction, or the target's address is
marked as legacy in the legacy code bitmap.  The bitmap covers the whole
user-mode address space (TASK_SIZE_MAX for 64-bit, TASK_SIZE for IA32),
and each bit represents one page of linear address range.  The bitmap is
located at the topmost address: (TASK_SIZE - IBT_BITMAP_SIZE).

It is allocated only when the first time ARCH_X86_MARK_LEGACY_CODE
is called from an application.

The IBT bitmap is visiable from user-mode, but not writable.

Introduce:

arch_prctl(ARCH_X86_CET_MARK_LEGACY_CODE, unsigned long *buf)
    Mark an address range as IBT legacy code.

    *buf: starting linear address
    *(buf + 1): size of the legacy code
    *(buf + 2): set (1); clear (0)

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---
 arch/x86/include/asm/cet.h        |   3 +
 arch/x86/include/asm/processor.h  |  13 +-
 arch/x86/include/uapi/asm/prctl.h |   1 +
 arch/x86/kernel/Makefile          |   2 +-
 arch/x86/kernel/cet_bitmap.c      | 210 ++++++++++++++++++++++++++++++
 arch/x86/kernel/cet_prctl.c       |  15 +++
 mm/memory.c                       |   8 ++
 7 files changed, 250 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/kernel/cet_bitmap.c

diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
index 2561efe081ad..d5f693d082b0 100644
--- a/arch/x86/include/asm/cet.h
+++ b/arch/x86/include/asm/cet.h
@@ -4,6 +4,7 @@
 
 #ifndef __ASSEMBLY__
 #include <linux/types.h>
+#include <asm/processor.h>
 
 struct task_struct;
 struct sc_ext;
@@ -30,6 +31,7 @@ void cet_disable_free_shstk(struct task_struct *p);
 int cet_restore_signal(bool ia32, struct sc_ext *sc);
 int cet_setup_signal(bool ia32, unsigned long rstor, struct sc_ext *sc);
 int cet_setup_ibt(void);
+int cet_mark_legacy_code(unsigned long addr, unsigned long size, unsigned long set);
 void cet_disable_ibt(void);
 #else
 static inline int prctl_cet(int option, unsigned long arg2) { return -EINVAL; }
@@ -42,6 +44,7 @@ static inline int cet_restore_signal(bool ia32, struct sc_ext *sc) { return -EIN
 static inline int cet_setup_signal(bool ia32, unsigned long rstor,
 				   struct sc_ext *sc) { return -EINVAL; }
 static inline int cet_setup_ibt(void) { return -EINVAL; }
+static inline int cet_mark_legacy_code(unsigned long addr, unsigned long size, unsigned long set) { return -EINVAL; }
 static inline void cet_disable_ibt(void) {}
 #endif
 
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 0f9bc7fd1351..af3bdd545a55 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -888,7 +888,18 @@ static inline void spin_lock_prefetch(const void *x)
 #define TASK_SIZE_OF(child)	((test_tsk_thread_flag(child, TIF_ADDR32)) ? \
 					IA32_PAGE_OFFSET : TASK_SIZE_MAX)
 
-#define STACK_TOP		TASK_SIZE_LOW
+#define MMAP_MAX		(unsigned long)(test_thread_flag(TIF_ADDR32) ? \
+					TASK_SIZE : TASK_SIZE_MAX)
+
+#define IBT_BITMAP_SIZE		(round_up(MMAP_MAX, PAGE_SIZE * BITS_PER_BYTE) / \
+					(PAGE_SIZE * BITS_PER_BYTE))
+
+#define IBT_BITMAP_ADDR		(TASK_SIZE - IBT_BITMAP_SIZE)
+
+#define STACK_TOP		(TASK_SIZE_LOW < IBT_BITMAP_ADDR - PAGE_SIZE ? \
+					TASK_SIZE_LOW : \
+					IBT_BITMAP_ADDR - PAGE_SIZE)
+
 #define STACK_TOP_MAX		TASK_SIZE_MAX
 
 #define INIT_THREAD  {						\
diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index 02243127dcf6..da39d4bde4e1 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -20,5 +20,6 @@
 #define ARCH_X86_CET_ALLOC_SHSTK	0x3004
 #define ARCH_X86_CET_GET_LEGACY_BITMAP	0x3005 /* deprecated */
 #define ARCH_X86_CET_SET_LEGACY_BITMAP	0x3006 /* deprecated */
+#define ARCH_X86_CET_MARK_LEGACY_CODE	0x3007
 
 #endif /* _ASM_X86_PRCTL_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 311829335521..228906364513 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -140,7 +140,7 @@ obj-$(CONFIG_UNWINDER_ORC)		+= unwind_orc.o
 obj-$(CONFIG_UNWINDER_FRAME_POINTER)	+= unwind_frame.o
 obj-$(CONFIG_UNWINDER_GUESS)		+= unwind_guess.o
 
-obj-$(CONFIG_X86_INTEL_CET)		+= cet.o cet_prctl.o
+obj-$(CONFIG_X86_INTEL_CET)		+= cet.o cet_prctl.o cet_bitmap.o
 
 ###
 # 64 bit specific files
diff --git a/arch/x86/kernel/cet_bitmap.c b/arch/x86/kernel/cet_bitmap.c
new file mode 100644
index 000000000000..25eb441eb094
--- /dev/null
+++ b/arch/x86/kernel/cet_bitmap.c
@@ -0,0 +1,210 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#include <linux/bits.h>
+#include <linux/err.h>
+#include <linux/memcontrol.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include <linux/oom.h>
+#include <linux/pagemap.h>
+#include <linux/rmap.h>
+#include <linux/swap.h>
+#include <asm/cet.h>
+#include <asm/fpu/internal.h>
+
+/*
+ * For read fault, provide the zero page.  For write fault coming from
+ * get_user_pages(), clear the page already allocated.
+ */
+static vm_fault_t bitmap_fault(const struct vm_special_mapping *sm,
+			       struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	if (!(vmf->flags & FAULT_FLAG_WRITE)) {
+		vmf->page = ZERO_PAGE(vmf->address);
+		return 0;
+	} else {
+		vm_fault_t r;
+
+		if (!vmf->cow_page)
+			return VM_FAULT_ERROR;
+
+		clear_user_highpage(vmf->cow_page, vmf->address);
+		__SetPageUptodate(vmf->cow_page);
+		r = finish_fault(vmf);
+		return r ? r : VM_FAULT_DONE_COW;
+	}
+}
+
+static int bitmap_mremap(const struct vm_special_mapping *sm,
+			 struct vm_area_struct *vma)
+{
+	return -EINVAL;
+}
+
+static const struct vm_special_mapping bitmap_mapping = {
+	.name	= "[ibt_bitmap]",
+	.fault	= bitmap_fault,
+	.mremap	= bitmap_mremap,
+};
+
+static int alloc_bitmap(void)
+{
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *vma;
+	u64 msr_ia32_u_cet;
+	int r = 0;
+
+	if (down_write_killable(&mm->mmap_sem))
+		return -EINTR;
+
+	vma = _install_special_mapping(mm, IBT_BITMAP_ADDR, IBT_BITMAP_SIZE,
+				       VM_READ | VM_MAYREAD | VM_MAYWRITE,
+				       &bitmap_mapping);
+
+	if (IS_ERR(vma))
+		r = PTR_ERR(vma);
+
+	up_write(&mm->mmap_sem);
+
+	if (r)
+		return r;
+
+	current->thread.cet.ibt_bitmap_used = 1;
+
+	modify_fpu_regs_begin();
+	rdmsrl(MSR_IA32_U_CET, msr_ia32_u_cet);
+	msr_ia32_u_cet |= (MSR_IA32_CET_LEG_IW_EN | IBT_BITMAP_ADDR);
+	wrmsrl(MSR_IA32_U_CET, msr_ia32_u_cet);
+	modify_fpu_regs_end();
+	return 0;
+}
+
+/*
+ * Set bits in the IBT legacy code bitmap, which is read-only user memory.
+ */
+static int set_bits(unsigned long start_bit, unsigned long end_bit,
+		    unsigned long set)
+{
+	unsigned long start_ul, end_ul, nr_ul;
+	unsigned long start_ul_addr, tmp_addr, len;
+	int i, j;
+
+	start_ul = start_bit / BITS_PER_LONG;
+	end_ul = end_bit / BITS_PER_LONG;
+	i = start_bit % BITS_PER_LONG;
+	j = end_bit % BITS_PER_LONG;
+
+	start_ul_addr = IBT_BITMAP_ADDR + start_ul * sizeof(0UL);
+	nr_ul = end_ul - start_ul + 1;
+
+	tmp_addr = start_ul_addr;
+	len = nr_ul * sizeof(0UL);
+
+	down_read(&current->mm->mmap_sem);
+	while (len) {
+		unsigned long *first, *last, mask, bytes;
+		int ret, offset;
+		void *kern_page_addr;
+		struct page *page = NULL;
+
+		ret = get_user_pages(tmp_addr, 1, FOLL_WRITE | FOLL_FORCE,
+				     &page, NULL);
+
+		if (ret <= 0) {
+			up_read(&current->mm->mmap_sem);
+			return ret;
+		}
+
+		kern_page_addr = kmap(page);
+
+		bytes = len;
+		offset = tmp_addr & (PAGE_SIZE - 1);
+
+		/* Is end_ul in this page? */
+		if (bytes > (PAGE_SIZE - offset)) {
+			bytes = PAGE_SIZE - offset;
+			last = NULL;
+		} else {
+			last = (unsigned long *)(kern_page_addr + offset + bytes) - 1;
+		}
+
+		/* Is start_ul in this page? */
+		if (tmp_addr == start_ul_addr)
+			first = (unsigned long *)(kern_page_addr + offset);
+		else
+			first = NULL;
+
+		if (nr_ul == 1) {
+			mask = GENMASK(j, i);
+
+			if (set)
+				*first |= mask;
+			else
+				*first &= ~mask;
+		} else {
+			if (first) {
+				mask = GENMASK(BITS_PER_LONG - 1, i);
+
+				if (set)
+					*first |= mask;
+				else
+					*first &= ~mask;
+			}
+
+			if (last) {
+				mask = GENMASK(j, 0);
+
+				if (set)
+					*last |= mask;
+				else
+					*last &= ~mask;
+			}
+
+			if (nr_ul > 2) {
+				void *p = kern_page_addr + offset;
+				int cnt = bytes;
+
+				if (first) {
+					p += sizeof(*first);
+					cnt -= sizeof(*first);
+				}
+
+				if (last)
+					cnt -= sizeof(*last);
+
+				if (set)
+					memset(p, 0xff, cnt);
+				else
+					memset(p, 0, cnt);
+			}
+		}
+
+		set_page_dirty_lock(page);
+		kunmap(page);
+		put_page(page);
+
+		len -= bytes;
+		tmp_addr += bytes;
+	}
+	up_read(&current->mm->mmap_sem);
+	return 0;
+}
+
+int cet_mark_legacy_code(unsigned long addr, unsigned long size, unsigned long set)
+{
+	int r;
+
+	if (!current->thread.cet.ibt_enabled)
+		return -EINVAL;
+
+	if ((addr >= IBT_BITMAP_ADDR) || (addr + size > IBT_BITMAP_ADDR))
+		return -EINVAL;
+
+	if (!current->thread.cet.ibt_bitmap_used) {
+		r = alloc_bitmap();
+		if (r)
+			return r;
+	}
+
+	return set_bits(addr / PAGE_SIZE, (addr + size - 1) / PAGE_SIZE, set);
+}
diff --git a/arch/x86/kernel/cet_prctl.c b/arch/x86/kernel/cet_prctl.c
index 09d8c4ea935c..eec5baf8b0da 100644
--- a/arch/x86/kernel/cet_prctl.c
+++ b/arch/x86/kernel/cet_prctl.c
@@ -57,6 +57,18 @@ static int handle_alloc_shstk(unsigned long arg2)
 	return 0;
 }
 
+static int handle_mark_legacy_code(unsigned long arg2)
+{
+	unsigned long addr, size, set;
+
+	if (get_user(addr, (unsigned long __user *)arg2) ||
+	    get_user(size, (unsigned long __user *)arg2 + 1) ||
+	    get_user(set, (unsigned long __user *)arg2 + 2))
+		return -EFAULT;
+
+	return cet_mark_legacy_code(addr, size, set);
+}
+
 int prctl_cet(int option, unsigned long arg2)
 {
 	if (!cpu_x86_cet_enabled())
@@ -83,6 +95,9 @@ int prctl_cet(int option, unsigned long arg2)
 	case ARCH_X86_CET_ALLOC_SHSTK:
 		return handle_alloc_shstk(arg2);
 
+	case ARCH_X86_CET_MARK_LEGACY_CODE:
+		return handle_mark_legacy_code(arg2);
+
 	default:
 		return -EINVAL;
 	}
diff --git a/mm/memory.c b/mm/memory.c
index be93a73b5152..75076f727be0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3290,6 +3290,12 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg,
 
 	flush_icache_page(vma, page);
 	entry = mk_pte(page, vma->vm_page_prot);
+
+	if (is_zero_pfn(pte_pfn(entry))) {
+		entry = pte_mkspecial(entry);
+		goto alloc_set_pte_out;
+	}
+
 	if (write)
 		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 	/* copy-on-write page */
@@ -3302,6 +3308,8 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg,
 		inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
 		page_add_file_rmap(page, false);
 	}
+
+alloc_set_pte_out:
 	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
 
 	/* no need to invalidate: a not-present page won't be cached */
-- 
2.17.1

^ permalink raw reply related

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Alexei Starovoitov @ 2019-08-13 21:58 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Song Liu, Kees Cook, Networking, bpf, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team, Lorenz Bauer, Jann Horn, Greg KH,
	Linux API, LSM List
In-Reply-To: <CALCETrXEHL3+NAY6P6vUj7Pvd9ZpZsYC6VCLXOaNxb90a_POGw@mail.gmail.com>

On Tue, Aug 06, 2019 at 10:24:25PM -0700, Andy Lutomirski wrote:
> >
> > Inside containers and inside nested containers we need to start processes
> > that will use bpf. All of the processes are trusted.
> 
> Trusted by whom?  In a non-nested container, the container manager
> *might* be trusted by the outside world.  In a *nested* container,
> unless the inner container management is controlled from outside the
> outer container, it's not trusted.  I don't know much about how
> Facebook's containers work, but the LXC/LXD/Podman world is moving
> very strongly toward user namespaces and maximally-untrusted
> containers, and I think bpf() should work in that context.

agree that containers (namespaces) reduce amount of trust necessary
for apps to run, but the end goal is not security though.
Linux has become a single user system.
If user can ssh into the host they can become root.
If arbitrary code can run on the host it will be break out of any sandbox.
Containers are not providing the level of security that is enough
to run arbitrary code. VMs can do it better, but cpu bugs don't make it easy.
Containers are used to make production systems safer.
Some people call it more 'secure', but it's clearly not secure for
arbitrary code and that is what kernel.unprivileged_bpf_disabled allows.
When we say 'unprivileged bpf' we really mean arbitrary malicious bpf program.
It's been a constant source of pain. The constant blinding, randomization,
verifier speculative analysis, all spectre v1, v2, v4 mitigations
are simply not worth it. It's a lot of complex kernel code without users.
There is not a single use case to allow arbitrary malicious bpf
program to be loaded and executed.
As soon as we have /dev/bpf to allow all of bpf to be used without root
we will set sysctl kernel.unprivileged_bpf_disabled=1
Hence I prefer this /dev/bpf mechanism to be as simple a possible.
The applications that will use it are going to be just as trusted as systemd.

> > To solve your concern of bypassing all capable checks...
> > How about we do /dev/bpf/full_verifier first?
> > It will replace capable() checks in the verifier only.
> 
> I'm not convinced that "in the verifier" is the right distinction.
> Telling administrators that some setting lets certain users bypass
> bpf() verifier checks doesn't have a clear enough meaning.  

linux is a single user system. there are no administrators any more.
No doubt, folks will disagree, but that game is over.
At least on bpf side it's done.

> I propose,
> instead, that the current capable() checks be divided into three
> categories:

I don't see a use case for these categories.
All bpf programs extend the kernel in some way.
The kernel vs user is one category.
Conceptually CAP_BPF is enough. It would be similar to CAP_NET_ADMIN.
When application has CAP_NET_ADMIN it covers all of networking knobs.
There is no use case that would warrant fine grain CAP_ROUTE_ADMIN,
CAP_ETHTOOL_ADMIN, CAP_ETH0_ADMIN, etc.
Similarly CAP_BPF as the only knob is enough.
The only disadvantage of CAP_BPF is that it's not possible to
pass it from one systemd-like daemon to another systemd-like daemon.
Hence /dev/bpf idea and passing file descriptor.

> This type of thing actually fits quite nicely into an idea I've been
> thinking about for a while called "implicit rights". In very brief
> summary, there would be objects called /dev/rights/xyz, where xyz is
> the same of a "right".  If there is a readable object of the right
> type at the literal path "/dev/rights/xyz", then you have right xyz.
> There's a bit more flexibility on top of this.  BPF could use
> /dev/rights/bpf/maptypes/lpm and
> /dev/rights/bpf/verifier/bounded_loops, for example.  Other non-BPF
> use cases include a biggie:
> /dev/rights/namespace/create_unprivileged_userns.
> /dev/rights/bind_port/80 would be nice, too.

The concept of "implicit rights" is very nice and I'm sure it will
be a good fit somewhere, but I don't see why use it in bpf space.
There is no use case for fine grain partition of bpf features.

^ permalink raw reply

* Re: [PATCH] syscalls: Update the syscall #defines to match uapi
From: Alistair Francis @ 2019-08-13 22:04 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Alistair Francis, Linux Kernel Mailing List, Linux API,
	Deepa Dinamani
In-Reply-To: <CAK8P3a0phB6hb=Da6V=CWWmi_oYwUcpSRi6CZTqAPz7QXKFJTA@mail.gmail.com>

On Tue, Aug 13, 2019 at 12:41 PM Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Tue, Aug 13, 2019 at 9:01 PM Alistair Francis <alistair23@gmail.com> wrote:
> > On Mon, Aug 12, 2019 at 2:49 AM Arnd Bergmann <arnd@arndb.de> wrote:
>
> > > > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> > > > index 2bcef4c70183..e4bf5e480d60 100644
> > > > --- a/include/linux/syscalls.h
> > > > +++ b/include/linux/syscalls.h
> > > > @@ -512,7 +512,7 @@ asmlinkage long sys_readlinkat(int dfd, const char __user *path, char __user *bu
> > > >  asmlinkage long sys_newfstatat(int dfd, const char __user *filename,
> > > >                                struct stat __user *statbuf, int flag);
> > > >  asmlinkage long sys_newfstat(unsigned int fd, struct stat __user *statbuf);
> > > > -#if defined(__ARCH_WANT_STAT64) || defined(__ARCH_WANT_COMPAT_STAT64)
> > > > +#if defined(__ARCH_WANT_NEW_STAT) || defined(__ARCH_WANT_STAT64)
> > > >  asmlinkage long sys_fstat64(unsigned long fd, struct stat64 __user *statbuf);
> > > >  asmlinkage long sys_fstatat64(int dfd, const char __user *filename,
> > > >                                struct stat64 __user *statbuf, int flag);
> > >
> > > I think this is wrong: when __ARCH_WANT_NEW_STAT is set, we are
> > > on a 64-bit architecture and only want the sys_newfstat{,at} system
> > > calls, not sys_fstat{,at}64 that gets used on 32-bit machines.
> >
> > Ah, that would make sense then. I don't think you will see the error then.
>
> So we don't need this patch to build riscv32 kernels, right? It's possible
> that it was the result of an incorrect forward port of some other patch,
> as older riscv32 kernels did provide stat64(), but newer ones only have
> statx().

The issue came up when I was just changing some things for testing and
I thought it was a bug that others might run into. It isn't directly
related to the riscv32 kernel.

Alistair

>
>        Arnd

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Daniel Colascione @ 2019-08-13 22:26 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andy Lutomirski, Song Liu, Kees Cook, Networking, bpf,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team, Lorenz Bauer,
	Jann Horn, Greg KH, Linux API, LSM List
In-Reply-To: <20190813215823.3sfbakzzjjykyng2@ast-mbp>

On Tue, Aug 13, 2019 at 2:58 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Aug 06, 2019 at 10:24:25PM -0700, Andy Lutomirski wrote:
> > >
> > > Inside containers and inside nested containers we need to start processes
> > > that will use bpf. All of the processes are trusted.
> >
> > Trusted by whom?  In a non-nested container, the container manager
> > *might* be trusted by the outside world.  In a *nested* container,
> > unless the inner container management is controlled from outside the
> > outer container, it's not trusted.  I don't know much about how
> > Facebook's containers work, but the LXC/LXD/Podman world is moving
> > very strongly toward user namespaces and maximally-untrusted
> > containers, and I think bpf() should work in that context.
>
> agree that containers (namespaces) reduce amount of trust necessary
> for apps to run, but the end goal is not security though.
> Linux has become a single user system.
> If user can ssh into the host they can become root.
> If arbitrary code can run on the host it will be break out of any sandbox.
> Containers are not providing the level of security that is enough
> to run arbitrary code. VMs can do it better, but cpu bugs don't make it easy.
> Containers are used to make production systems safer.
> Some people call it more 'secure', but it's clearly not secure for
> arbitrary code and that is what kernel.unprivileged_bpf_disabled allows.
> When we say 'unprivileged bpf' we really mean arbitrary malicious bpf program.
> It's been a constant source of pain. The constant blinding, randomization,
> verifier speculative analysis, all spectre v1, v2, v4 mitigations
> are simply not worth it. It's a lot of complex kernel code without users.
> There is not a single use case to allow arbitrary malicious bpf
> program to be loaded and executed.
> As soon as we have /dev/bpf to allow all of bpf to be used without root
> we will set sysctl kernel.unprivileged_bpf_disabled=1
> Hence I prefer this /dev/bpf mechanism to be as simple a possible.
> The applications that will use it are going to be just as trusted as systemd.
>
> > > To solve your concern of bypassing all capable checks...
> > > How about we do /dev/bpf/full_verifier first?
> > > It will replace capable() checks in the verifier only.
> >
> > I'm not convinced that "in the verifier" is the right distinction.
> > Telling administrators that some setting lets certain users bypass
> > bpf() verifier checks doesn't have a clear enough meaning.
>
> linux is a single user system. there are no administrators any more.
> No doubt, folks will disagree, but that game is over.
> At least on bpf side it's done.
>
> > I propose,
> > instead, that the current capable() checks be divided into three
> > categories:
>
> I don't see a use case for these categories.
> All bpf programs extend the kernel in some way.
> The kernel vs user is one category.
> Conceptually CAP_BPF is enough. It would be similar to CAP_NET_ADMIN.
> When application has CAP_NET_ADMIN it covers all of networking knobs.
> There is no use case that would warrant fine grain CAP_ROUTE_ADMIN,
> CAP_ETHTOOL_ADMIN, CAP_ETH0_ADMIN, etc.
> Similarly CAP_BPF as the only knob is enough.
> The only disadvantage of CAP_BPF is that it's not possible to
> pass it from one systemd-like daemon to another systemd-like daemon.
> Hence /dev/bpf idea and passing file descriptor.
>
> > This type of thing actually fits quite nicely into an idea I've been
> > thinking about for a while called "implicit rights". In very brief
> > summary, there would be objects called /dev/rights/xyz, where xyz is
> > the same of a "right".  If there is a readable object of the right
> > type at the literal path "/dev/rights/xyz", then you have right xyz.
> > There's a bit more flexibility on top of this.  BPF could use
> > /dev/rights/bpf/maptypes/lpm and
> > /dev/rights/bpf/verifier/bounded_loops, for example.  Other non-BPF
> > use cases include a biggie:
> > /dev/rights/namespace/create_unprivileged_userns.
> > /dev/rights/bind_port/80 would be nice, too.
>
> The concept of "implicit rights" is very nice and I'm sure it will
> be a good fit somewhere, but I don't see why use it in bpf space.
> There is no use case for fine grain partition of bpf features.

Isn't this "implicit rights" model just another kind of ambient
authority --- one that constrains the otherwise-free filesystem
namespace to boot? IMHO, the kernel should be moving toward explicit
authorization tokens modeled by file descriptors and away from
contextual authorization decisions.

^ permalink raw reply

* Re: [PATCH v8 09/27] mm/mmap: Prevent Shadow Stack VMA merges
From: Dave Hansen @ 2019-08-13 22:34 UTC (permalink / raw)
  To: Yu-cheng Yu, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz
In-Reply-To: <20190813205225.12032-10-yu-cheng.yu@intel.com>

On 8/13/19 1:52 PM, Yu-cheng Yu wrote:
> To prevent function call/return spills into the next shadow stack
> area, do not merge shadow stack areas.

How does this prevent call/return spills?

^ permalink raw reply

* Re: [PATCH v8 15/27] mm: Handle shadow stack page fault
From: Andy Lutomirski @ 2019-08-13 22:55 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: X86 ML, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, LKML,
	open list:DOCUMENTATION, Linux-MM, linux-arch, Linux API,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz
In-Reply-To: <20190813205225.12032-16-yu-cheng.yu@intel.com>

On Tue, Aug 13, 2019 at 2:02 PM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
>
> When a task does fork(), its shadow stack (SHSTK) must be duplicated
> for the child.  This patch implements a flow similar to copy-on-write
> of an anonymous page, but for SHSTK.
>
> A SHSTK PTE must be RO and dirty.  This dirty bit requirement is used
> to effect the copying.  In copy_one_pte(), clear the dirty bit from a
> SHSTK PTE to cause a page fault upon the next SHSTK access.  At that
> time, fix the PTE and copy/re-use the page.

Is using VM_SHSTK and special-casing all of this really better than
using a special mapping or other pseudo-file-backed VMA and putting
all the magic in the vm_operations?

--Andy

^ permalink raw reply

* Re: [PATCH v8 11/27] x86/mm: Introduce _PAGE_DIRTY_SW
From: Dave Hansen @ 2019-08-13 23:02 UTC (permalink / raw)
  To: Yu-cheng Yu, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz
In-Reply-To: <20190813205225.12032-12-yu-cheng.yu@intel.com>

> +#if defined(CONFIG_X86_INTEL_SHADOW_STACK_USER)
> +static inline pte_t pte_move_flags(pte_t pte, pteval_t from, pteval_t to)
> +{
> +	if (pte_flags(pte) & from)
> +		pte = pte_set_flags(pte_clear_flags(pte, from), to);
> +	return pte;

Why is this conditional on the compile option and not a runtime check?

> +}
> +#else
> +static inline pte_t pte_move_flags(pte_t pte, pteval_t from, pteval_t to)
> +{
> +	return pte;
> +}
> +#endif

Why do we need this function?  It's not mentioned in the changelog or
commented.

>  static inline pte_t pte_mkclean(pte_t pte)
>  {
> -	return pte_clear_flags(pte, _PAGE_DIRTY);
> +	return pte_clear_flags(pte, _PAGE_DIRTY_BITS);
>  }
>  
>  static inline pte_t pte_mkold(pte_t pte)
> @@ -322,6 +336,7 @@ static inline pte_t pte_mkold(pte_t pte)
>  
>  static inline pte_t pte_wrprotect(pte_t pte)
>  {
> +	pte = pte_move_flags(pte, _PAGE_DIRTY_HW, _PAGE_DIRTY_SW);
>  	return pte_clear_flags(pte, _PAGE_RW);
>  }

Please comment what this is doing and why.

> @@ -332,9 +347,24 @@ static inline pte_t pte_mkexec(pte_t pte)
>  
>  static inline pte_t pte_mkdirty(pte_t pte)
>  {
> +	pteval_t dirty = (!IS_ENABLED(CONFIG_X86_INTEL_SHADOW_STACK_USER) ||
> +			   pte_write(pte)) ? _PAGE_DIRTY_HW:_PAGE_DIRTY_SW;

This is *really* hard for me to read and parse.  How about:

	pte_t dirty = _PAGE_DIRTY_HW;

	/*
	 * When Shadow Stacks are enabled, read-only PTEs can
	 * not have the hardware dirty bit set and must use
	 * the software bit.
	 */
	if (IS_ENABLED(CONFIG_X86_INTEL_SHADOW_STACK_USER) &&
	    !pte_write(pte))
		dirty = _PAGE_DIRTY_SW;


> +	return pte_set_flags(pte, dirty | _PAGE_SOFT_DIRTY);
> +}
> +
> +#ifdef CONFIG_ARCH_HAS_SHSTK
> +static inline pte_t pte_mkdirty_shstk(pte_t pte)
> +{
> +	pte = pte_clear_flags(pte, _PAGE_DIRTY_SW);
>  	return pte_set_flags(pte, _PAGE_DIRTY_HW | _PAGE_SOFT_DIRTY);
>  }

Why does the _PAGE_DIRTY_SW *HAVE* to be cleared on shstk pages?

> +static inline bool pte_dirty_hw(pte_t pte)
> +{
> +	return pte_flags(pte) & _PAGE_DIRTY_HW;
> +}
> +#endif

Why are these #ifdef'd?

>  static inline pte_t pte_mkyoung(pte_t pte)
>  {
>  	return pte_set_flags(pte, _PAGE_ACCESSED);
> @@ -342,6 +372,7 @@ static inline pte_t pte_mkyoung(pte_t pte)
>  
>  static inline pte_t pte_mkwrite(pte_t pte)
>  {
> +	pte = pte_move_flags(pte, _PAGE_DIRTY_SW, _PAGE_DIRTY_HW);
>  	return pte_set_flags(pte, _PAGE_RW);
>  }

It also isn't clear to me why this *must* move bits here.  Its doubly
unclear why you would need to do this on systems when shadow stacks are
compiled in but disabled.

<snip>

Same comments for pmds and puds.

> -
>  /* mprotect needs to preserve PAT bits when updating vm_page_prot */
>  #define pgprot_modify pgprot_modify
>  static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
> @@ -1178,6 +1254,19 @@ static inline int pmd_write(pmd_t pmd)
>  	return pmd_flags(pmd) & _PAGE_RW;
>  }
>  
> +static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
> +{
> +	pmdval_t val = pmd_val(pmd), oldval = val;
> +
> +	val &= _HPAGE_CHG_MASK;
> +	val |= check_pgprot(newprot) & ~_HPAGE_CHG_MASK;
> +	val = flip_protnone_guard(oldval, val, PHYSICAL_PMD_PAGE_MASK);
> +	if ((pmd_write(pmd) && !(pgprot_val(newprot) & _PAGE_RW)))
> +		return pmd_move_flags(__pmd(val), _PAGE_DIRTY_HW,
> +				      _PAGE_DIRTY_SW);
> +	return __pmd(val);
> +}

Why was this function moved?  This makes it really hard to review what
you changed

I'm going to stop reading this code now.  It needs a lot more care and
feeding to make it reviewable.  Please go back, double-check your
changelogs and flesh them out, then please try to make the code more
readable and understandable by commenting it.

Please take all of the compile-time checks and ask yourself whether they
need to be or *can* be runtime checks.  Consider what the overhead is of
non-shadowstack systems running shadowstack-required code.

Please also reconcile the supervisor XSAVE portion of your patches with
the ones that Fenghua has been sending around.  I've given quite a bit
of feedback to improve those.  Please consolidate and agree on a common
set of patches with him.

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Andy Lutomirski @ 2019-08-13 23:06 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andy Lutomirski, Song Liu, Kees Cook, Networking, bpf,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team, Lorenz Bauer,
	Jann Horn, Greg KH, Linux API, LSM List
In-Reply-To: <20190813215823.3sfbakzzjjykyng2@ast-mbp>

On Tue, Aug 13, 2019 at 2:58 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Aug 06, 2019 at 10:24:25PM -0700, Andy Lutomirski wrote:
> > >
> > > Inside containers and inside nested containers we need to start processes
> > > that will use bpf. All of the processes are trusted.
> >
> > Trusted by whom?  In a non-nested container, the container manager
> > *might* be trusted by the outside world.  In a *nested* container,
> > unless the inner container management is controlled from outside the
> > outer container, it's not trusted.  I don't know much about how
> > Facebook's containers work, but the LXC/LXD/Podman world is moving
> > very strongly toward user namespaces and maximally-untrusted
> > containers, and I think bpf() should work in that context.
>
> agree that containers (namespaces) reduce amount of trust necessary
> for apps to run, but the end goal is not security though.
> Linux has become a single user system.
> If user can ssh into the host they can become root.
> If arbitrary code can run on the host it will be break out of any sandbox.

I would argue that this is a reasonable assumption to make if you're
designing a system using Linux, but it's not a valid assumption to
make as kernel developers.  Otherwise we should just give everyone
CAP_SYS_ADMIN and call it a day.  There really is a difference between
root and non-root.

> Containers are not providing the level of security that is enough
> to run arbitrary code. VMs can do it better, but cpu bugs don't make it easy.
> Containers are used to make production systems safer.
> Some people call it more 'secure', but it's clearly not secure for
> arbitrary code and that is what kernel.unprivileged_bpf_disabled allows.
> When we say 'unprivileged bpf' we really mean arbitrary malicious bpf program.
> It's been a constant source of pain. The constant blinding, randomization,
> verifier speculative analysis, all spectre v1, v2, v4 mitigations
> are simply not worth it. It's a lot of complex kernel code without users.

Seccomp really will want eBPF some day, and it should work without
privilege.  Maybe it should be a restricted subset of eBPF, and
Spectre will always be an issue until dramatically better hardware
shows up, but I think people will want the ability for regular
programs to load eBPF seccomp programs.

> Hence I prefer this /dev/bpf mechanism to be as simple a possible.
> The applications that will use it are going to be just as trusted as systemd.

I still don't understand your systemd example.  systemd --users is not
trusted systemwide in any respect.  The main PID 1 systemd is root.
No matter how you dice it, granting a user systemd instance extra bpf
access is tantamount to granting the user extra bpf access in general.

It sounds to me like you're thinking of eBPF as a feature a bit like
unprivileged user namespaces: *in principle*, it's supposed to be safe
to give any unprivileged process the ability to use it, and you
consider security flaws in it to be bugs worth fixing.  But you think
it's a large attack surface and that most unprivileged programs
shouldn't be allowed to use it.  Is that reasonable?


>
> > > To solve your concern of bypassing all capable checks...
> > > How about we do /dev/bpf/full_verifier first?
> > > It will replace capable() checks in the verifier only.
> >
> > I'm not convinced that "in the verifier" is the right distinction.
> > Telling administrators that some setting lets certain users bypass
> > bpf() verifier checks doesn't have a clear enough meaning.
>
> linux is a single user system. there are no administrators any more.
> No doubt, folks will disagree, but that game is over.
> At least on bpf side it's done.
>
> > I propose,
> > instead, that the current capable() checks be divided into three
> > categories:
>
> I don't see a use case for these categories.
> All bpf programs extend the kernel in some way.
> The kernel vs user is one category.
> Conceptually CAP_BPF is enough. It would be similar to CAP_NET_ADMIN.
> When application has CAP_NET_ADMIN it covers all of networking knobs.
> There is no use case that would warrant fine grain CAP_ROUTE_ADMIN,
> CAP_ETHTOOL_ADMIN, CAP_ETH0_ADMIN, etc.
> Similarly CAP_BPF as the only knob is enough.
> The only disadvantage of CAP_BPF is that it's not possible to
> pass it from one systemd-like daemon to another systemd-like daemon.
> Hence /dev/bpf idea and passing file descriptor.
>
> > This type of thing actually fits quite nicely into an idea I've been
> > thinking about for a while called "implicit rights". In very brief
> > summary, there would be objects called /dev/rights/xyz, where xyz is
> > the same of a "right".  If there is a readable object of the right
> > type at the literal path "/dev/rights/xyz", then you have right xyz.
> > There's a bit more flexibility on top of this.  BPF could use
> > /dev/rights/bpf/maptypes/lpm and
> > /dev/rights/bpf/verifier/bounded_loops, for example.  Other non-BPF
> > use cases include a biggie:
> > /dev/rights/namespace/create_unprivileged_userns.
> > /dev/rights/bind_port/80 would be nice, too.
>
> The concept of "implicit rights" is very nice and I'm sure it will
> be a good fit somewhere, but I don't see why use it in bpf space.
> There is no use case for fine grain partition of bpf features.
>

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Andy Lutomirski @ 2019-08-13 23:24 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Alexei Starovoitov, Andy Lutomirski, Song Liu, Kees Cook,
	Networking, bpf, Alexei Starovoitov, Daniel Borkmann, Kernel Team,
	Lorenz Bauer, Jann Horn, Greg KH, Linux API, LSM List
In-Reply-To: <CAKOZuev8XY5+shG8SiWcx4z12QnkgzhcUqCHs9t+eV2z-6nzPA@mail.gmail.com>

On Tue, Aug 13, 2019 at 3:27 PM Daniel Colascione <dancol@google.com> wrote:
>
> On Tue, Aug 13, 2019 at 2:58 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Tue, Aug 06, 2019 at 10:24:25PM -0700, Andy Lutomirski wrote:
> > > >
> > > > Inside containers and inside nested containers we need to start processes
> > > > that will use bpf. All of the processes are trusted.
> > >
> > > Trusted by whom?  In a non-nested container, the container manager
> > > *might* be trusted by the outside world.  In a *nested* container,
> > > unless the inner container management is controlled from outside the
> > > outer container, it's not trusted.  I don't know much about how
> > > Facebook's containers work, but the LXC/LXD/Podman world is moving
> > > very strongly toward user namespaces and maximally-untrusted
> > > containers, and I think bpf() should work in that context.
> >
> > agree that containers (namespaces) reduce amount of trust necessary
> > for apps to run, but the end goal is not security though.
> > Linux has become a single user system.
> > If user can ssh into the host they can become root.
> > If arbitrary code can run on the host it will be break out of any sandbox.
> > Containers are not providing the level of security that is enough
> > to run arbitrary code. VMs can do it better, but cpu bugs don't make it easy.
> > Containers are used to make production systems safer.
> > Some people call it more 'secure', but it's clearly not secure for
> > arbitrary code and that is what kernel.unprivileged_bpf_disabled allows.
> > When we say 'unprivileged bpf' we really mean arbitrary malicious bpf program.
> > It's been a constant source of pain. The constant blinding, randomization,
> > verifier speculative analysis, all spectre v1, v2, v4 mitigations
> > are simply not worth it. It's a lot of complex kernel code without users.
> > There is not a single use case to allow arbitrary malicious bpf
> > program to be loaded and executed.
> > As soon as we have /dev/bpf to allow all of bpf to be used without root
> > we will set sysctl kernel.unprivileged_bpf_disabled=1
> > Hence I prefer this /dev/bpf mechanism to be as simple a possible.
> > The applications that will use it are going to be just as trusted as systemd.
> >
> > > > To solve your concern of bypassing all capable checks...
> > > > How about we do /dev/bpf/full_verifier first?
> > > > It will replace capable() checks in the verifier only.
> > >
> > > I'm not convinced that "in the verifier" is the right distinction.
> > > Telling administrators that some setting lets certain users bypass
> > > bpf() verifier checks doesn't have a clear enough meaning.
> >
> > linux is a single user system. there are no administrators any more.
> > No doubt, folks will disagree, but that game is over.
> > At least on bpf side it's done.
> >
> > > I propose,
> > > instead, that the current capable() checks be divided into three
> > > categories:
> >
> > I don't see a use case for these categories.
> > All bpf programs extend the kernel in some way.
> > The kernel vs user is one category.
> > Conceptually CAP_BPF is enough. It would be similar to CAP_NET_ADMIN.
> > When application has CAP_NET_ADMIN it covers all of networking knobs.
> > There is no use case that would warrant fine grain CAP_ROUTE_ADMIN,
> > CAP_ETHTOOL_ADMIN, CAP_ETH0_ADMIN, etc.
> > Similarly CAP_BPF as the only knob is enough.
> > The only disadvantage of CAP_BPF is that it's not possible to
> > pass it from one systemd-like daemon to another systemd-like daemon.
> > Hence /dev/bpf idea and passing file descriptor.
> >
> > > This type of thing actually fits quite nicely into an idea I've been
> > > thinking about for a while called "implicit rights". In very brief
> > > summary, there would be objects called /dev/rights/xyz, where xyz is
> > > the same of a "right".  If there is a readable object of the right
> > > type at the literal path "/dev/rights/xyz", then you have right xyz.
> > > There's a bit more flexibility on top of this.  BPF could use
> > > /dev/rights/bpf/maptypes/lpm and
> > > /dev/rights/bpf/verifier/bounded_loops, for example.  Other non-BPF
> > > use cases include a biggie:
> > > /dev/rights/namespace/create_unprivileged_userns.
> > > /dev/rights/bind_port/80 would be nice, too.
> >
> > The concept of "implicit rights" is very nice and I'm sure it will
> > be a good fit somewhere, but I don't see why use it in bpf space.
> > There is no use case for fine grain partition of bpf features.
>
> Isn't this "implicit rights" model just another kind of ambient
> authority --- one that constrains the otherwise-free filesystem
> namespace to boot?

Yes.

> IMHO, the kernel should be moving toward explicit
> authorization tokens modeled by file descriptors and away from
> contextual authorization decisions.

And yes, I agree there too. Here's how I think about it: there are
really two layers here:

Rights: these are objects like /dev/rights/bpf/some_bpf_privilege or
/dev/rights/namespace/unpriv_userns, and you would, ideally, use them
like genuine capabilities.  You'd pass an fd with appropriate access
(FMODE_READ, presumably, since exec is awkward to work with for fds)
into bpf() or similar, and the kernel would say "yep, caller has the
capability" and do something.  There's nothing really restricting them
to /dev/rights, but they more or less have to live on a memory-backed
file system (a real backing store has all kinds of issues), and
putting them in /dev gets a lot of nifty properties for free.  For
example, existing container systems that don't know about them will
automatically deny them to containers, since nothing with an ounce of
sense passes all of /dev through to a container.  But container
systems that are aware of them can bind-mount them into the container.
And /dev is already known to be magical due to things like
/dev/urandom.

The implicit part on top is less than ideal, but it solves two problems:

1. It keeps compatibility with existing code.  There are programs that
expect unshare(CLONE_NEWUSER) to work -- with *implicit* rights, it
will work exactly when it's supposed to.  Also, for cases like
CLONE_NEWUSER, it does have more or less the right semantics -- if
they were explicit, most programs would just try to open
/dev/rights/namespace/unpriv_userns and pass the fd to unshare2, so
we're not losing much.

2. For things like eBPF where the set of rights could be a moving
target, implicit rights lets the model evolve without breaking
userspace.  So if LPM maps eventually become bulletproof and a right
is no longer needed, it still works.  Or if some feature in the
verifier that is currently unrestricted were subsequently deemed to
need restrictions, they could be added without retrofitting all the
users.

There are cases where implicit rights would be totally inappropriate.
For example, a CAP_DAC_READ_SEARCH right could not be safely made
implicit.  In general, I think the implicit model works for system
calls where it's unambiguous what the caller wants to have happen and
there, depending on privilege level, it either works or it doesn't.
So, for accessing a filesystem, it's not at all obvious whether a
program is accessing it on its own behalf or on a client's behalf, and
privilege usage should be explicit.  For something like "don't
Spectre-mitigate this eBPF program", the semantics change and the
request should IMO be explicit.  For for "create an LPM map", I don't
see how a confused deputy is likely, and an implicit right seems
reasonable.  Similarly, for creating a namespace or binding a network
port, confused deputies seem unlikely.  (For connecting to a network
address, if such a thing were ever restricted, confused deputies are
definitely possible and happen all the time, e.g. under a DNS
rebinding attack.)

^ permalink raw reply

* Re: [PATCH v8 11/27] x86/mm: Introduce _PAGE_DIRTY_SW
From: Andy Lutomirski @ 2019-08-13 23:49 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Yu-cheng Yu, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit
In-Reply-To: <dac2d62b-9045-4767-87dd-eac12e7abafd@intel.com>

On Aug 13, 2019, at 4:02 PM, Dave Hansen <dave.hansen@intel.com> wrote:

>> 
>> static inline pte_t pte_mkwrite(pte_t pte)
>> {
>> +    pte = pte_move_flags(pte, _PAGE_DIRTY_SW, _PAGE_DIRTY_HW);
>>    return pte_set_flags(pte, _PAGE_RW);
>> }
> 
> It also isn't clear to me why this *must* move bits here.  Its doubly
> unclear why you would need to do this on systems when shadow stacks are
> compiled in but disabled.

Why is it conditional at all?  ISTM, in x86, RO+dirty has been effectively repurposed. To avoid having extra things that can conditionally break, I think this code should be unconditional. 

That being said, I’m not at all sure that pte_mkwrite on a shadow stack page makes any sense.

> <snip>
> 
> Same comments for pmds and puds.

Wasn’t Kirill working on a rework if the whole page table system to just have integer page table levels?

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Alexei Starovoitov @ 2019-08-14  0:57 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Song Liu, Kees Cook, Networking, bpf, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team, Lorenz Bauer, Jann Horn, Greg KH,
	Linux API, LSM List
In-Reply-To: <CALCETrVT-dDXQGukGs5S1DkzvQv9_e=axzr_GyEd2c4T4z8Qng@mail.gmail.com>

On Tue, Aug 13, 2019 at 04:06:00PM -0700, Andy Lutomirski wrote:
> On Tue, Aug 13, 2019 at 2:58 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Tue, Aug 06, 2019 at 10:24:25PM -0700, Andy Lutomirski wrote:
> > > >
> > > > Inside containers and inside nested containers we need to start processes
> > > > that will use bpf. All of the processes are trusted.
> > >
> > > Trusted by whom?  In a non-nested container, the container manager
> > > *might* be trusted by the outside world.  In a *nested* container,
> > > unless the inner container management is controlled from outside the
> > > outer container, it's not trusted.  I don't know much about how
> > > Facebook's containers work, but the LXC/LXD/Podman world is moving
> > > very strongly toward user namespaces and maximally-untrusted
> > > containers, and I think bpf() should work in that context.
> >
> > agree that containers (namespaces) reduce amount of trust necessary
> > for apps to run, but the end goal is not security though.
> > Linux has become a single user system.
> > If user can ssh into the host they can become root.
> > If arbitrary code can run on the host it will be break out of any sandbox.
> 
> I would argue that this is a reasonable assumption to make if you're
> designing a system using Linux, but it's not a valid assumption to
> make as kernel developers.  Otherwise we should just give everyone
> CAP_SYS_ADMIN and call it a day.  There really is a difference between
> root and non-root.

hmm. No. Kernel developers should not make any assumptions.
They should guide their design by real use cases instead. That includes studing
what people do now and hacks they use to workaround lack of interfaces.
Effecitvely bpf is root only. There are no unpriv users.
This root applications go out of their way to reduce privileges
while they still want to use bpf. That is the need that /dev/bpf is solving.

> 
> > Containers are not providing the level of security that is enough
> > to run arbitrary code. VMs can do it better, but cpu bugs don't make it easy.
> > Containers are used to make production systems safer.
> > Some people call it more 'secure', but it's clearly not secure for
> > arbitrary code and that is what kernel.unprivileged_bpf_disabled allows.
> > When we say 'unprivileged bpf' we really mean arbitrary malicious bpf program.
> > It's been a constant source of pain. The constant blinding, randomization,
> > verifier speculative analysis, all spectre v1, v2, v4 mitigations
> > are simply not worth it. It's a lot of complex kernel code without users.
> 
> Seccomp really will want eBPF some day, and it should work without
> privilege.  Maybe it should be a restricted subset of eBPF, and
> Spectre will always be an issue until dramatically better hardware
> shows up, but I think people will want the ability for regular
> programs to load eBPF seccomp programs.

I'm absolutely against using eBPF in seccomp.
Precisely due to discussions like the current one.

> 
> > Hence I prefer this /dev/bpf mechanism to be as simple a possible.
> > The applications that will use it are going to be just as trusted as systemd.
> 
> I still don't understand your systemd example.  systemd --users is not
> trusted systemwide in any respect.  The main PID 1 systemd is root.
> No matter how you dice it, granting a user systemd instance extra bpf
> access is tantamount to granting the user extra bpf access in general.

People use systemd --user while their kernel have 'undef CONFIG_USER_NS'.

> It sounds to me like you're thinking of eBPF as a feature a bit like
> unprivileged user namespaces: *in principle*, it's supposed to be safe
> to give any unprivileged process the ability to use it, and you
> consider security flaws in it to be bugs worth fixing. But you think
> it's a large attack surface and that most unprivileged programs
> shouldn't be allowed to use it.  Is that reasonable?

I think there should be no unprivileged bpf at all,
because over all these years we've seen zero use cases.
Hence all new features are root only.
LPM map is a prime example. There was not a single security bug in there.
There were few functional bugs, but not security issues.
These bugs didn't crash the kernel and didn't expose any data.
Yet we still keep LPM as root only.
Can we flip the switch and make it non-root? It's trivial single line patch ?
and security risk is very low?
Nope, since it will not address the underlying issue.

^ permalink raw reply

* Re: [PATCH V38 15/29] acpi: Ignore acpi_rsdp kernel param when the kernel has been locked down
From: Dave Young @ 2019-08-14  2:51 UTC (permalink / raw)
  To: Matthew Garrett
  Cc: jmorris, linux-security-module, linux-kernel, linux-api,
	Josh Boyer, bhe, kasong, Borislav Petkov, David Howells,
	Matthew Garrett, Kees Cook, linux-acpi
In-Reply-To: <20190808000721.124691-16-matthewgarrett@google.com>

On 08/07/19 at 05:07pm, Matthew Garrett wrote:
> From: Josh Boyer <jwboyer@redhat.com>
> 
> This option allows userspace to pass the RSDP address to the kernel, which
> makes it possible for a user to modify the workings of hardware. Reject
> the option when the kernel is locked down. This requires some reworking
> of the existing RSDP command line logic, since the early boot code also
> makes use of a command-line passed RSDP when locating the SRAT table
> before the lockdown code has been initialised. This is achieved by
> separating the command line RSDP path in the early boot code from the
> generic RSDP path, and then copying the command line RSDP into boot
> params in the kernel proper if lockdown is not enabled. If lockdown is
> enabled and an RSDP is provided on the command line, this will only be
> used when parsing SRAT (which shouldn't permit kernel code execution)
> and will be ignored in the rest of the kernel.
> 
> (Modified by Matthew Garrett in order to handle the early boot RSDP
> environment)
> 
> Signed-off-by: Josh Boyer <jwboyer@redhat.com>
> Signed-off-by: David Howells <dhowells@redhat.com>
> Signed-off-by: Matthew Garrett <mjg59@google.com>
> Reviewed-by: Kees Cook <keescook@chromium.org>
> cc: Dave Young <dyoung@redhat.com>
> cc: linux-acpi@vger.kernel.org
> ---
>  arch/x86/boot/compressed/acpi.c | 19 +++++++++++++------
>  arch/x86/include/asm/acpi.h     |  9 +++++++++
>  arch/x86/include/asm/x86_init.h |  2 ++
>  arch/x86/kernel/acpi/boot.c     |  5 +++++
>  arch/x86/kernel/x86_init.c      |  1 +
>  drivers/acpi/osl.c              | 14 +++++++++++++-
>  include/linux/acpi.h            |  6 ++++++
>  7 files changed, 49 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/boot/compressed/acpi.c b/arch/x86/boot/compressed/acpi.c
> index 15255f388a85..149795c369f2 100644
> --- a/arch/x86/boot/compressed/acpi.c
> +++ b/arch/x86/boot/compressed/acpi.c
> @@ -26,7 +26,7 @@ struct mem_vector immovable_mem[MAX_NUMNODES*2];
>   */
>  #define MAX_ADDR_LEN 19
>  
> -static acpi_physical_address get_acpi_rsdp(void)
> +static acpi_physical_address get_cmdline_acpi_rsdp(void)
>  {
>  	acpi_physical_address addr = 0;
>  
> @@ -278,10 +278,7 @@ acpi_physical_address get_rsdp_addr(void)
>  {
>  	acpi_physical_address pa;
>  
> -	pa = get_acpi_rsdp();
> -
> -	if (!pa)
> -		pa = boot_params->acpi_rsdp_addr;
> +	pa = boot_params->acpi_rsdp_addr;
>  
>  	/*
>  	 * Try to get EFI data from setup_data. This can happen when we're a
> @@ -311,7 +308,17 @@ static unsigned long get_acpi_srat_table(void)
>  	char arg[10];
>  	u8 *entry;
>  
> -	rsdp = (struct acpi_table_rsdp *)(long)boot_params->acpi_rsdp_addr;
> +	/*
> +	 * Check whether we were given an RSDP on the command line. We don't
> +	 * stash this in boot params because the kernel itself may have
> +	 * different ideas about whether to trust a command-line parameter.
> +	 */
> +	rsdp = (struct acpi_table_rsdp *)get_cmdline_acpi_rsdp();
> +
> +	if (!rsdp)
> +		rsdp = (struct acpi_table_rsdp *)(long)
> +			boot_params->acpi_rsdp_addr;
> +
>  	if (!rsdp)
>  		return 0;
>  
> diff --git a/arch/x86/include/asm/acpi.h b/arch/x86/include/asm/acpi.h
> index aac686e1e005..bc9693c9107e 100644
> --- a/arch/x86/include/asm/acpi.h
> +++ b/arch/x86/include/asm/acpi.h
> @@ -117,6 +117,12 @@ static inline bool acpi_has_cpu_in_madt(void)
>  	return !!acpi_lapic;
>  }
>  
> +#define ACPI_HAVE_ARCH_SET_ROOT_POINTER
> +static inline void acpi_arch_set_root_pointer(u64 addr)
> +{
> +	x86_init.acpi.set_root_pointer(addr);
> +}
> +
>  #define ACPI_HAVE_ARCH_GET_ROOT_POINTER
>  static inline u64 acpi_arch_get_root_pointer(void)
>  {
> @@ -125,6 +131,7 @@ static inline u64 acpi_arch_get_root_pointer(void)
>  
>  void acpi_generic_reduced_hw_init(void);
>  
> +void x86_default_set_root_pointer(u64 addr);
>  u64 x86_default_get_root_pointer(void);
>  
>  #else /* !CONFIG_ACPI */
> @@ -138,6 +145,8 @@ static inline void disable_acpi(void) { }
>  
>  static inline void acpi_generic_reduced_hw_init(void) { }
>  
> +static inline void x86_default_set_root_pointer(u64 addr) { }
> +
>  static inline u64 x86_default_get_root_pointer(void)
>  {
>  	return 0;
> diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h
> index ac0934189017..19435858df5f 100644
> --- a/arch/x86/include/asm/x86_init.h
> +++ b/arch/x86/include/asm/x86_init.h
> @@ -134,10 +134,12 @@ struct x86_hyper_init {
>  
>  /**
>   * struct x86_init_acpi - x86 ACPI init functions
> + * @set_root_poitner:		set RSDP address
>   * @get_root_pointer:		get RSDP address
>   * @reduced_hw_early_init:	hardware reduced platform early init
>   */
>  struct x86_init_acpi {
> +	void (*set_root_pointer)(u64 addr);
>  	u64 (*get_root_pointer)(void);
>  	void (*reduced_hw_early_init)(void);
>  };
> diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
> index 17b33ef604f3..04205ce127a1 100644
> --- a/arch/x86/kernel/acpi/boot.c
> +++ b/arch/x86/kernel/acpi/boot.c
> @@ -1760,6 +1760,11 @@ void __init arch_reserve_mem_area(acpi_physical_address addr, size_t size)
>  	e820__update_table_print();
>  }
>  
> +void x86_default_set_root_pointer(u64 addr)
> +{
> +	boot_params.acpi_rsdp_addr = addr;
> +}
> +
>  u64 x86_default_get_root_pointer(void)
>  {
>  	return boot_params.acpi_rsdp_addr;
> diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c
> index 1bef687faf22..18a799c8fa28 100644
> --- a/arch/x86/kernel/x86_init.c
> +++ b/arch/x86/kernel/x86_init.c
> @@ -95,6 +95,7 @@ struct x86_init_ops x86_init __initdata = {
>  	},
>  
>  	.acpi = {
> +		.set_root_pointer	= x86_default_set_root_pointer,
>  		.get_root_pointer	= x86_default_get_root_pointer,
>  		.reduced_hw_early_init	= acpi_generic_reduced_hw_init,
>  	},
> diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
> index 9c0edf2fc0dd..d43df3a3fa8d 100644
> --- a/drivers/acpi/osl.c
> +++ b/drivers/acpi/osl.c
> @@ -26,6 +26,7 @@
>  #include <linux/list.h>
>  #include <linux/jiffies.h>
>  #include <linux/semaphore.h>
> +#include <linux/security.h>
>  
>  #include <asm/io.h>
>  #include <linux/uaccess.h>
> @@ -180,8 +181,19 @@ acpi_physical_address __init acpi_os_get_root_pointer(void)
>  	acpi_physical_address pa;
>  
>  #ifdef CONFIG_KEXEC
> -	if (acpi_rsdp)
> +	/*
> +	 * We may have been provided with an RSDP on the command line,
> +	 * but if a malicious user has done so they may be pointing us
> +	 * at modified ACPI tables that could alter kernel behaviour -
> +	 * so, we check the lockdown status before making use of
> +	 * it. If we trust it then also stash it in an architecture
> +	 * specific location (if appropriate) so it can be carried
> +	 * over further kexec()s.
> +	 */
> +	if (acpi_rsdp && !security_locked_down(LOCKDOWN_ACPI_TABLES)) {
> +		acpi_arch_set_root_pointer(acpi_rsdp);
>  		return acpi_rsdp;
> +	}
>  #endif
>  	pa = acpi_arch_get_root_pointer();
>  	if (pa)
> diff --git a/include/linux/acpi.h b/include/linux/acpi.h
> index e40e1e27ed8e..6b35f2f4cab3 100644
> --- a/include/linux/acpi.h
> +++ b/include/linux/acpi.h
> @@ -643,6 +643,12 @@ bool acpi_gtdt_c3stop(int type);
>  int acpi_arch_timer_mem_init(struct arch_timer_mem *timer_mem, int *timer_count);
>  #endif
>  
> +#ifndef ACPI_HAVE_ARCH_SET_ROOT_POINTER
> +static inline void acpi_arch_set_root_pointer(u64 addr)
> +{
> +}
> +#endif
> +
>  #ifndef ACPI_HAVE_ARCH_GET_ROOT_POINTER
>  static inline u64 acpi_arch_get_root_pointer(void)
>  {
> -- 
> 2.22.0.770.g0f2c4a37fd-goog
> 

I'm not sure why this series get posted quickly again.  As for this
patch it was restructured according the early rsdp parsing code, so
personally I think keep the original Reviewed-by is questionable and it
needs another review.  But I did not find time to read it carefully.

Add several cc to void it slipped.

Thanks
Dave

^ permalink raw reply

* [PATCH] tracefs: Fix NULL pointer dereference when no lockdown is used
From: Marek Szyprowski @ 2019-08-14  6:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Marek Szyprowski, Matthew Garrett, Steven Rostedt, James Morris,
	Krzysztof Kozlowski, Bartlomiej Zolnierkiewicz,
	linux-security-module, linux-api, Matthew Garrett
In-Reply-To: <3028ed35-3b6d-459f-f3c8-103c5636fe95@samsung.com>

Commit 757ff7244358 ("tracefs: Restrict tracefs when the kernel is locked
down") added infrastructure for restricting tracefs access when lockdown
is enabled. It however broke tracefs operation when no lockdown is used.
Fix this issue by adding missing check for a NULL ->open() callback.

Fixes: 757ff7244358 ("tracefs: Restrict tracefs when the kernel is locked down")
Reported-by: Krzysztof Kozlowski <krzk@kernel.org>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
---
 fs/tracefs/inode.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index 12a325fb4cbd..8efff7603032 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -43,7 +43,7 @@ static int default_open_file(struct inode *inode, struct file *filp)
 		return ret;
 
 	real_fops = dentry->d_fsdata;
-	return real_fops->open(inode, filp);
+	return real_fops->open ? real_fops->open(inode, filp) : 0;
 }
 
 static ssize_t default_read_file(struct file *file, char __user *buf,
-- 
2.17.1

^ permalink raw reply related

* Re: [PATCH V38 15/29] acpi: Ignore acpi_rsdp kernel param when the kernel has been locked down
From: Borislav Petkov @ 2019-08-14  7:26 UTC (permalink / raw)
  To: Matthew Garrett
  Cc: jmorris, linux-security-module, linux-kernel, linux-api,
	Josh Boyer, David Howells, Matthew Garrett, Kees Cook, Dave Young,
	linux-acpi
In-Reply-To: <20190808000721.124691-16-matthewgarrett@google.com>

On Wed, Aug 07, 2019 at 05:07:07PM -0700, Matthew Garrett wrote:
> From: Josh Boyer <jwboyer@redhat.com>
> 
> This option allows userspace to pass the RSDP address to the kernel, which
> makes it possible for a user to modify the workings of hardware. Reject
> the option when the kernel is locked down. This requires some reworking
> of the existing RSDP command line logic, since the early boot code also
> makes use of a command-line passed RSDP when locating the SRAT table
> before the lockdown code has been initialised. This is achieved by
> separating the command line RSDP path in the early boot code from the
> generic RSDP path, and then copying the command line RSDP into boot
> params in the kernel proper if lockdown is not enabled. If lockdown is
> enabled and an RSDP is provided on the command line, this will only be
> used when parsing SRAT (which shouldn't permit kernel code execution)
> and will be ignored in the rest of the kernel.
> 
> (Modified by Matthew Garrett in order to handle the early boot RSDP
> environment)
> 
> Signed-off-by: Josh Boyer <jwboyer@redhat.com>
> Signed-off-by: David Howells <dhowells@redhat.com>
> Signed-off-by: Matthew Garrett <mjg59@google.com>
> Reviewed-by: Kees Cook <keescook@chromium.org>
> cc: Dave Young <dyoung@redhat.com>
> cc: linux-acpi@vger.kernel.org
> ---
>  arch/x86/boot/compressed/acpi.c | 19 +++++++++++++------
>  arch/x86/include/asm/acpi.h     |  9 +++++++++
>  arch/x86/include/asm/x86_init.h |  2 ++
>  arch/x86/kernel/acpi/boot.c     |  5 +++++
>  arch/x86/kernel/x86_init.c      |  1 +
>  drivers/acpi/osl.c              | 14 +++++++++++++-
>  include/linux/acpi.h            |  6 ++++++
>  7 files changed, 49 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/boot/compressed/acpi.c b/arch/x86/boot/compressed/acpi.c
> index 15255f388a85..149795c369f2 100644
> --- a/arch/x86/boot/compressed/acpi.c
> +++ b/arch/x86/boot/compressed/acpi.c
> @@ -26,7 +26,7 @@ struct mem_vector immovable_mem[MAX_NUMNODES*2];
>   */
>  #define MAX_ADDR_LEN 19
>  
> -static acpi_physical_address get_acpi_rsdp(void)
> +static acpi_physical_address get_cmdline_acpi_rsdp(void)
>  {
>  	acpi_physical_address addr = 0;
>  
> @@ -278,10 +278,7 @@ acpi_physical_address get_rsdp_addr(void)
>  {
>  	acpi_physical_address pa;
>  
> -	pa = get_acpi_rsdp();
> -
> -	if (!pa)
> -		pa = boot_params->acpi_rsdp_addr;

AFAICT, this looks like it would break current usage: get_rsdp_addr()
needs to call get_acpi_rsdp() which you've now called
get_cmdline_acpi_rsdp() to parse "acpi_rsdp=" but it ain't happening
anymore.

Where the parsing is happening now is in get_acpi_srat_table() which is
not present in configs with

#if defined(CONFIG_RANDOMIZE_BASE) && defined(CONFIG_MEMORY_HOTREMOVE)

false and thus not available to early code anymore.

I think the cleaner/easier approach would be to clear
boot_params->acpi_rsdp_addr after SRAT has been parsed in lockdown
configurations so that nothing else uses the supplied RDSP address
anymore. But I could very well be missing something...

Thx.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply

* Re: [PATCH V38 15/29] acpi: Ignore acpi_rsdp kernel param when the kernel has been locked down
From: Borislav Petkov @ 2019-08-14  7:28 UTC (permalink / raw)
  To: Matthew Garrett
  Cc: jmorris, linux-security-module, linux-kernel, linux-api,
	Josh Boyer, David Howells, Matthew Garrett, Kees Cook, Dave Young,
	linux-acpi, x86-ml
In-Reply-To: <20190808000721.124691-16-matthewgarrett@google.com>

On Wed, Aug 07, 2019 at 05:07:07PM -0700, Matthew Garrett wrote:
> From: Josh Boyer <jwboyer@redhat.com>
> 
> This option allows userspace to pass the RSDP address to the kernel, which
> makes it possible for a user to modify the workings of hardware. Reject
> the option when the kernel is locked down. This requires some reworking
> of the existing RSDP command line logic, since the early boot code also
> makes use of a command-line passed RSDP when locating the SRAT table
> before the lockdown code has been initialised. This is achieved by
> separating the command line RSDP path in the early boot code from the
> generic RSDP path, and then copying the command line RSDP into boot
> params in the kernel proper if lockdown is not enabled. If lockdown is
> enabled and an RSDP is provided on the command line, this will only be
> used when parsing SRAT (which shouldn't permit kernel code execution)
> and will be ignored in the rest of the kernel.
> 
> (Modified by Matthew Garrett in order to handle the early boot RSDP
> environment)
> 
> Signed-off-by: Josh Boyer <jwboyer@redhat.com>
> Signed-off-by: David Howells <dhowells@redhat.com>
> Signed-off-by: Matthew Garrett <mjg59@google.com>
> Reviewed-by: Kees Cook <keescook@chromium.org>
> cc: Dave Young <dyoung@redhat.com>
> cc: linux-acpi@vger.kernel.org
> ---
>  arch/x86/boot/compressed/acpi.c | 19 +++++++++++++------
>  arch/x86/include/asm/acpi.h     |  9 +++++++++
>  arch/x86/include/asm/x86_init.h |  2 ++
>  arch/x86/kernel/acpi/boot.c     |  5 +++++
>  arch/x86/kernel/x86_init.c      |  1 +
>  drivers/acpi/osl.c              | 14 +++++++++++++-
>  include/linux/acpi.h            |  6 ++++++
>  7 files changed, 49 insertions(+), 7 deletions(-)

Also, why isn't an x86 person CCed here? This clearly needs an x86
maintainer's ACK before it goes in?

Thx.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply

* Re: [PATCH v5 1/6] mm/page_idle: Add per-pid idle page tracking using virtual index
From: Michal Hocko @ 2019-08-14  7:56 UTC (permalink / raw)
  To: Jann Horn
  Cc: Daniel Gruss, Joel Fernandes (Google), kernel list,
	Alexey Dobriyan, Andrew Morton, Borislav Petkov, Brendan Gregg,
	Catalin Marinas, Christian Hansen, Daniel Colascione, fmayer,
	H. Peter Anvin, Ingo Molnar, Joel Fernandes, Jonathan Corbet,
	Kees Cook, kernel-team, Linux API, linux-doc, linux-fsdevel
In-Reply-To: <CAG48ez2cuqe_VYhhaqw8Hcyswv47cmz2XmkqNdvkXEhokMVaXg@mail.gmail.com>

On Tue 13-08-19 17:29:09, Jann Horn wrote:
> On Tue, Aug 13, 2019 at 12:09 PM Michal Hocko <mhocko@kernel.org> wrote:
> > On Mon 12-08-19 20:14:38, Jann Horn wrote:
> > > On Wed, Aug 7, 2019 at 7:16 PM Joel Fernandes (Google)
> > > <joel@joelfernandes.org> wrote:
> > > > The page_idle tracking feature currently requires looking up the pagemap
> > > > for a process followed by interacting with /sys/kernel/mm/page_idle.
> > > > Looking up PFN from pagemap in Android devices is not supported by
> > > > unprivileged process and requires SYS_ADMIN and gives 0 for the PFN.
> > > >
> > > > This patch adds support to directly interact with page_idle tracking at
> > > > the PID level by introducing a /proc/<pid>/page_idle file.  It follows
> > > > the exact same semantics as the global /sys/kernel/mm/page_idle, but now
> > > > looking up PFN through pagemap is not needed since the interface uses
> > > > virtual frame numbers, and at the same time also does not require
> > > > SYS_ADMIN.
> > > >
> > > > In Android, we are using this for the heap profiler (heapprofd) which
> > > > profiles and pin points code paths which allocates and leaves memory
> > > > idle for long periods of time. This method solves the security issue
> > > > with userspace learning the PFN, and while at it is also shown to yield
> > > > better results than the pagemap lookup, the theory being that the window
> > > > where the address space can change is reduced by eliminating the
> > > > intermediate pagemap look up stage. In virtual address indexing, the
> > > > process's mmap_sem is held for the duration of the access.
> > >
> > > What happens when you use this interface on shared pages, like memory
> > > inherited from the zygote, library file mappings and so on? If two
> > > profilers ran concurrently for two different processes that both map
> > > the same libraries, would they end up messing up each other's data?
> >
> > Yup PageIdle state is shared. That is the page_idle semantic even now
> > IIRC.
> >
> > > Can this be used to observe which library pages other processes are
> > > accessing, even if you don't have access to those processes, as long
> > > as you can map the same libraries? I realize that there are already a
> > > bunch of ways to do that with side channels and such; but if you're
> > > adding an interface that allows this by design, it seems to me like
> > > something that should be gated behind some sort of privilege check.
> >
> > Hmm, you need to be priviledged to get the pfn now and without that you
> > cannot get to any page so the new interface is weakening the rules.
> > Maybe we should limit setting the idle state to processes with the write
> > status. Or do you think that even observing idle status is useful for
> > practical side channel attacks? If yes, is that a problem of the
> > profiler which does potentially dangerous things?
> 
> I suppose read-only access isn't a real problem as long as the
> profiler isn't writing the idle state in a very tight loop... but I
> don't see a usecase where you'd actually want that? As far as I can
> tell, if you can't write the idle state, being able to read it is
> pretty much useless.
> 
> If the profiler only wants to profile process-private memory, then
> that should be implementable in a safe way in principle, I think, but
> since Joel said that they want to profile CoW memory as well, I think
> that's inherently somewhat dangerous.

I cannot really say how useful that would be but I can see that
implementing ownership checks would be really non-trivial for
shared pages. Reducing the interface to exclusive pages would make it
easier as you noted but less helpful.

Besides that the attack vector shouldn't be really much different from
the page cache access, right? So essentially can_do_mincore model.

I guess we want to document that page idle tracking should be used with
care because it potentially opens a side channel opportunity if used
on sensitive data.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH v5 2/6] mm/page_idle: Add support for handling swapped PG_Idle pages
From: Michal Hocko @ 2019-08-14  8:05 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: khlebnikov, linux-kernel, Minchan Kim, Alexey Dobriyan,
	Andrew Morton, Borislav Petkov, Brendan Gregg, Catalin Marinas,
	Christian Hansen, dancol, fmayer, H. Peter Anvin, Ingo Molnar,
	Jonathan Corbet, Kees Cook, kernel-team, linux-api, linux-doc,
	linux-fsdevel, linux-mm, Mike Rapoport, namhyung, paulmck,
	Robin Murphy, Roman Gushchin
In-Reply-To: <20190813153659.GD14622@google.com>

On Tue 13-08-19 11:36:59, Joel Fernandes wrote:
> On Tue, Aug 13, 2019 at 05:04:50PM +0200, Michal Hocko wrote:
> > On Wed 07-08-19 13:15:55, Joel Fernandes (Google) wrote:
> > > Idle page tracking currently does not work well in the following
> > > scenario:
> > >  1. mark page-A idle which was present at that time.
> > >  2. run workload
> > >  3. page-A is not touched by workload
> > >  4. *sudden* memory pressure happen so finally page A is finally swapped out
> > >  5. now see the page A - it appears as if it was accessed (pte unmapped
> > >     so idle bit not set in output) - but it's incorrect.
> > > 
> > > To fix this, we store the idle information into a new idle bit of the
> > > swap PTE during swapping of anonymous pages.
> > >
> > > Also in the future, madvise extensions will allow a system process
> > > manager (like Android's ActivityManager) to swap pages out of a process
> > > that it knows will be cold. To an external process like a heap profiler
> > > that is doing idle tracking on another process, this procedure will
> > > interfere with the idle page tracking similar to the above steps.
> > 
> > This could be solved by checking the !present/swapped out pages
> > right? Whoever decided to put the page out to the swap just made it
> > idle effectively.  So the monitor can make some educated guess for
> > tracking. If that is fundamentally not possible then please describe
> > why.
> 
> But the monitoring process (profiler) does not have control over the 'whoever
> made it effectively idle' process.

Why does that matter? Whether it is a global/memcg reclaim or somebody
calling MADV_PAGEOUT or whatever it is a decision to make the page not
hot. Sure you could argue that a missing idle bit on swap entries might
mean that the swap out decision was pre-mature/sub-optimal/wrong but is
this the aim of the interface?

> As you said it will be a guess, it will not be accurate.

Yes and the point I am trying to make is that having some space and not
giving a guarantee sounds like a safer option for this interface because
...
> 
> I am curious what is your concern with using a bit in the swap PTE?

... It is a promiss of the semantic I find limiting for future. The bit
in the pte might turn out insufficient (e.g. pte reclaim) so teaching
the userspace to consider this a hard guarantee is a ticket to problems
later on. Maybe I am overly paranoid because I have seen so many "nice
to have" features turning into a maintenance burden in the past.

If this is really considered mostly debugging purpouse interface then a
certain level of imprecision should be tolerateable. If there is a
really strong real world usecase that simply has no other way to go
then this might be added later. Adding an information is always safer
than take it away.

That being said, if I am a minority voice here then I will not really
stand in the way and won't nack the patch. I will not ack it neither
though.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox