[PATCH RFC 0/2] Selective mitigation for trusted userspace

public inbox for cgroups@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH RFC 0/2] Selective mitigation for trusted userspace
@ 2024-09-19 21:52 Pawan Gupta
  2024-09-19 21:52 ` [PATCH RFC 1/2] x86/entry_64: Add a separate unmitigated entry/exit path Pawan Gupta
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Pawan Gupta @ 2024-09-19 21:52 UTC (permalink / raw)
  To: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, David Kaplan, Daniel Sneddon, x86, H. Peter Anvin,
	Peter Zijlstra, Josh Poimboeuf, Steven Rostedt
  Cc: linux-kernel, cgroups

Hi,

This is an experimental series exploring the feasibility of selectively
applying CPU vulnerability mitigations on a per-process basis. The
motivation behind this work is to address the performance degradation
experienced by trusted user-space applications due to system-wide CPU
mitigations.

Currently, the mitigations are applied universally across the system,
without discrimination between trusted and untrusted user-space processes.
This results in a performance penalty for all applications, regardless of
their trustworthiness. The proposed solution aims to provide a mechanism
for system administrators to explicitly mark certain applications as
trusted, allowing them to bypass these mitigations and regain lost
performance.

The series introduces a new cgroup attribute and a separate kernel
entry/exit path that can be used to selectively disable CPU mitigations for
processes that are deemed trustworthy by the system administrator. This
approach provides a tool to the administrator who understands the security
implications and is aware of trustworthiness of the applications that they
care.

The rationale for choosing the cgroup interface over other potential
interfaces, such as LSMs, is cgroup's inherent support for core scheduling.
Core scheduling allows the grouping of tasks such that they are scheduled
to run on the same cores. By leveraging core scheduling, we can minimize
the performance overhead caused by the MSR writes during context switching
between trusted and untrusted processes. With the end goal being trusted
and untrusted processes run on separate cores, enhancing the security.

Patch 1 adds the unmitigated entry/exit path.
Patch 2 provides a cgroup knob to bypass CPU mitigations.

This series is lightly tested. Feedback and discussion are welcome.

TODO:
- Add CONFIG_MITIGATION_PER_PROCESS
- Add support for skipping other mitigations like RSB filling.
- Update suspend/resume paths to handle the new entry/exit path.
- Should child processes inherit the parent's unmitigated status?
- Add documentation.

Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
---
Pawan Gupta (2):
      x86/entry_64: Add a separate unmitigated entry/exit path
      cpu/bugs: cgroup: Add a cgroup knob to bypass CPU mitigations

 arch/x86/entry/entry_64.S        | 66 +++++++++++++++++++++++++++++++++-------
 arch/x86/include/asm/proto.h     | 15 ++++++---
 arch/x86/include/asm/ptrace.h    | 15 ++++++---
 arch/x86/include/asm/switch_to.h | 10 ++++++
 arch/x86/kernel/cpu/bugs.c       | 21 +++++++++++++
 arch/x86/kernel/cpu/common.c     |  2 +-
 include/linux/cgroup-defs.h      |  3 ++
 kernel/cgroup/cgroup.c           | 42 +++++++++++++++++++++++++
 kernel/sched/core.c              |  2 +-
 9 files changed, 155 insertions(+), 21 deletions(-)
---
base-commit: 98f7e32f20d28ec452afb208f9cffc08448a2652
change-id: 20240919-selective-mitigation-6d02c4bbb72b

-- 
Thanks,
Pawan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH RFC 1/2] x86/entry_64: Add a separate unmitigated entry/exit path
  2024-09-19 21:52 [PATCH RFC 0/2] Selective mitigation for trusted userspace Pawan Gupta
@ 2024-09-19 21:52 ` Pawan Gupta
  2024-09-20  6:57   ` Waiman Long
  2024-09-19 21:52 ` [PATCH RFC 2/2] cpu/bugs: cgroup: Add a cgroup knob to bypass CPU mitigations Pawan Gupta
  2024-09-27 15:52 ` [PATCH RFC 0/2] Selective mitigation for trusted userspace Michal Koutný
  2 siblings, 1 reply; 10+ messages in thread
From: Pawan Gupta @ 2024-09-19 21:52 UTC (permalink / raw)
  To: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, David Kaplan, Daniel Sneddon, x86, H. Peter Anvin,
	Peter Zijlstra, Josh Poimboeuf, Steven Rostedt
  Cc: linux-kernel, cgroups

CPU mitigations are deployed system-wide, but usually not all of the
userspace is malicious. Yet, they suffer from the performance impact
of the mitigations. This all or nothing approach is due to lack of a
way for kernel to know which userspace can be trusted and which cannot.

For scenarios where an admin can decide which processes to trust, an
interface to tell the kernel to possibly skip the mitigation would be
useful.

In preparation for kernel to be able to selectively apply mitigation
per-process add a separate kernel entry/exit path that skips the
mitigations.

Originally-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
---
 arch/x86/entry/entry_64.S     | 66 +++++++++++++++++++++++++++++++++++--------
 arch/x86/include/asm/proto.h  | 15 +++++++---
 arch/x86/include/asm/ptrace.h | 15 +++++++---
 arch/x86/kernel/cpu/common.c  |  2 +-
 4 files changed, 78 insertions(+), 20 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 1b5be07f8669..eeaf4226d09c 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -84,7 +84,7 @@
  * with them due to bugs in both AMD and Intel CPUs.
  */
 
-SYM_CODE_START(entry_SYSCALL_64)
+.macro __entry_SYSCALL_64 mitigated=0
 	UNWIND_HINT_ENTRY
 	ENDBR
 
@@ -94,7 +94,12 @@ SYM_CODE_START(entry_SYSCALL_64)
 	SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
 	movq	PER_CPU_VAR(pcpu_hot + X86_top_of_stack), %rsp
 
-SYM_INNER_LABEL(entry_SYSCALL_64_safe_stack, SYM_L_GLOBAL)
+.if \mitigated
+SYM_INNER_LABEL(entry_SYSCALL_64_safe_stack_mitigated, SYM_L_GLOBAL)
+.else
+SYM_INNER_LABEL(entry_SYSCALL_64_safe_stack_unmitigated, SYM_L_GLOBAL)
+.endif
+
 	ANNOTATE_NOENDBR
 
 	/* Construct struct pt_regs on stack */
@@ -103,7 +108,11 @@ SYM_INNER_LABEL(entry_SYSCALL_64_safe_stack, SYM_L_GLOBAL)
 	pushq	%r11					/* pt_regs->flags */
 	pushq	$__USER_CS				/* pt_regs->cs */
 	pushq	%rcx					/* pt_regs->ip */
+
+.if \mitigated
 SYM_INNER_LABEL(entry_SYSCALL_64_after_hwframe, SYM_L_GLOBAL)
+.endif
+
 	pushq	%rax					/* pt_regs->orig_ax */
 
 	PUSH_AND_CLEAR_REGS rax=$-ENOSYS
@@ -113,10 +122,12 @@ SYM_INNER_LABEL(entry_SYSCALL_64_after_hwframe, SYM_L_GLOBAL)
 	/* Sign extend the lower 32bit as syscall numbers are treated as int */
 	movslq	%eax, %rsi
 
+.if \mitigated
 	/* clobbers %rax, make sure it is after saving the syscall nr */
 	IBRS_ENTER
 	UNTRAIN_RET
 	CLEAR_BRANCH_HISTORY
+.endif
 
 	call	do_syscall_64		/* returns with IRQs disabled */
 
@@ -127,15 +138,26 @@ SYM_INNER_LABEL(entry_SYSCALL_64_after_hwframe, SYM_L_GLOBAL)
 	 * In the Xen PV case we must use iret anyway.
 	 */
 
-	ALTERNATIVE "testb %al, %al; jz swapgs_restore_regs_and_return_to_usermode", \
-		"jmp swapgs_restore_regs_and_return_to_usermode", X86_FEATURE_XENPV
+.if \mitigated
+	push %rax
+	IBRS_EXIT
+	CLEAR_CPU_BUFFERS
+	pop %rax
+.endif
+
+	ALTERNATIVE "testb %al, %al; jz swapgs_restore_regs_and_return_to_usermode_from_syscall", \
+		"jmp swapgs_restore_regs_and_return_to_usermode_from_syscall", X86_FEATURE_XENPV
 
 	/*
 	 * We win! This label is here just for ease of understanding
 	 * perf profiles. Nothing jumps here.
 	 */
-syscall_return_via_sysret:
-	IBRS_EXIT
+.if \mitigated
+syscall_return_via_sysret_mitigated:
+.else
+syscall_return_via_sysret_unmitigated:
+.endif
+
 	POP_REGS pop_rdi=0
 
 	/*
@@ -159,15 +181,36 @@ syscall_return_via_sysret:
 
 	popq	%rdi
 	popq	%rsp
-SYM_INNER_LABEL(entry_SYSRETQ_unsafe_stack, SYM_L_GLOBAL)
+
+.if \mitigated
+SYM_INNER_LABEL(entry_SYSRETQ_unsafe_stack_mitigated, SYM_L_GLOBAL)
+.else
+SYM_INNER_LABEL(entry_SYSRETQ_unsafe_stack_unmitigated, SYM_L_GLOBAL)
+.endif
+
 	ANNOTATE_NOENDBR
 	swapgs
-	CLEAR_CPU_BUFFERS
+
+.if \mitigated
+SYM_INNER_LABEL(entry_SYSRETQ_end_mitigated, SYM_L_GLOBAL)
+.else
+SYM_INNER_LABEL(entry_SYSRETQ_end_unmitigated, SYM_L_GLOBAL)
+.endif
 	sysretq
-SYM_INNER_LABEL(entry_SYSRETQ_end, SYM_L_GLOBAL)
+
+.endm /* __entry_SYSCALL_64 */
+
+SYM_CODE_START(entry_SYSCALL_64_unmitigated)
+	__entry_SYSCALL_64 mitigated=0
 	ANNOTATE_NOENDBR
 	int3
-SYM_CODE_END(entry_SYSCALL_64)
+SYM_CODE_END(entry_SYSCALL_64_unmitigated)
+
+SYM_CODE_START(entry_SYSCALL_64_mitigated)
+	__entry_SYSCALL_64 mitigated=1
+	ANNOTATE_NOENDBR
+	int3
+SYM_CODE_END(entry_SYSCALL_64_mitigated)
 
 /*
  * %rdi: prev task
@@ -559,6 +602,8 @@ __irqentry_text_end:
 SYM_CODE_START_LOCAL(common_interrupt_return)
 SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_usermode, SYM_L_GLOBAL)
 	IBRS_EXIT
+	CLEAR_CPU_BUFFERS
+SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_usermode_from_syscall, SYM_L_GLOBAL)
 #ifdef CONFIG_XEN_PV
 	ALTERNATIVE "", "jmp xenpv_restore_regs_and_return_to_usermode", X86_FEATURE_XENPV
 #endif
@@ -573,7 +618,6 @@ SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_usermode, SYM_L_GLOBAL)
 
 .Lswapgs_and_iret:
 	swapgs
-	CLEAR_CPU_BUFFERS
 	/* Assert that the IRET frame indicates user mode. */
 	testb	$3, 8(%rsp)
 	jnz	.Lnative_iret
diff --git a/arch/x86/include/asm/proto.h b/arch/x86/include/asm/proto.h
index 484f4f0131a5..0936e0e70659 100644
--- a/arch/x86/include/asm/proto.h
+++ b/arch/x86/include/asm/proto.h
@@ -11,10 +11,17 @@ struct task_struct;
 void syscall_init(void);
 
 #ifdef CONFIG_X86_64
-void entry_SYSCALL_64(void);
-void entry_SYSCALL_64_safe_stack(void);
-void entry_SYSRETQ_unsafe_stack(void);
-void entry_SYSRETQ_end(void);
+
+void entry_SYSCALL_64_unmitigated(void);
+void entry_SYSCALL_64_safe_stack_unmitigated(void);
+void entry_SYSRETQ_unsafe_stack_unmitigated(void);
+void entry_SYSRETQ_end_unmitigated(void);
+
+void entry_SYSCALL_64_mitigated(void);
+void entry_SYSCALL_64_safe_stack_mitigated(void);
+void entry_SYSRETQ_unsafe_stack_mitigated(void);
+void entry_SYSRETQ_end_mitigated(void);
+
 long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2);
 #endif
 
diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index 5a83fbd9bc0b..74a13c76d241 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -261,11 +261,18 @@ static inline bool any_64bit_mode(struct pt_regs *regs)
 
 static __always_inline bool ip_within_syscall_gap(struct pt_regs *regs)
 {
-	bool ret = (regs->ip >= (unsigned long)entry_SYSCALL_64 &&
-		    regs->ip <  (unsigned long)entry_SYSCALL_64_safe_stack);
+	bool ret = (regs->ip >= (unsigned long)entry_SYSCALL_64_unmitigated &&
+		    regs->ip <  (unsigned long)entry_SYSCALL_64_safe_stack_unmitigated);
+
+	ret = ret || (regs->ip >= (unsigned long)entry_SYSRETQ_unsafe_stack_unmitigated &&
+		      regs->ip <  (unsigned long)entry_SYSRETQ_end_unmitigated);
+
+	ret = ret || (regs->ip >= (unsigned long)entry_SYSCALL_64_mitigated &&
+		      regs->ip <  (unsigned long)entry_SYSCALL_64_safe_stack_mitigated);
+
+	ret = ret || (regs->ip >= (unsigned long)entry_SYSRETQ_unsafe_stack_mitigated &&
+		      regs->ip <  (unsigned long)entry_SYSRETQ_end_mitigated);
 
-	ret = ret || (regs->ip >= (unsigned long)entry_SYSRETQ_unsafe_stack &&
-		      regs->ip <  (unsigned long)entry_SYSRETQ_end);
 #ifdef CONFIG_IA32_EMULATION
 	ret = ret || (regs->ip >= (unsigned long)entry_SYSCALL_compat &&
 		      regs->ip <  (unsigned long)entry_SYSCALL_compat_safe_stack);
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index d4e539d4e158..e72c37f3a437 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -2026,7 +2026,7 @@ static void wrmsrl_cstar(unsigned long val)
 
 static inline void idt_syscall_init(void)
 {
-	wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
+	wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64_unmitigated);
 
 	if (ia32_enabled()) {
 		wrmsrl_cstar((unsigned long)entry_SYSCALL_compat);

-- 
2.34.1



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH RFC 2/2] cpu/bugs: cgroup: Add a cgroup knob to bypass CPU mitigations
  2024-09-19 21:52 [PATCH RFC 0/2] Selective mitigation for trusted userspace Pawan Gupta
  2024-09-19 21:52 ` [PATCH RFC 1/2] x86/entry_64: Add a separate unmitigated entry/exit path Pawan Gupta
@ 2024-09-19 21:52 ` Pawan Gupta
  2024-09-20  7:05   ` Waiman Long
  2024-09-27 15:52 ` [PATCH RFC 0/2] Selective mitigation for trusted userspace Michal Koutný
  2 siblings, 1 reply; 10+ messages in thread
From: Pawan Gupta @ 2024-09-19 21:52 UTC (permalink / raw)
  To: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, David Kaplan, Daniel Sneddon, x86, H. Peter Anvin,
	Peter Zijlstra, Josh Poimboeuf, Steven Rostedt
  Cc: linux-kernel, cgroups

For cases where an admin wanting to bypass CPU mitigations for a specific
workload that they trust. Add a cgroup attribute "cpu.skip_mitigation" that
can only be set by a privileged user. When set, the CPU mitigations are
bypassed for all tasks in that cgroup.

Before setting this knob, the admin should be aware of possible security
risks like confused deputy attack on trusted interpreters, JIT engine,
etc.

Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
---
 arch/x86/include/asm/switch_to.h | 10 ++++++++++
 arch/x86/kernel/cpu/bugs.c       | 21 ++++++++++++++++++++
 include/linux/cgroup-defs.h      |  3 +++
 kernel/cgroup/cgroup.c           | 42 ++++++++++++++++++++++++++++++++++++++++
 kernel/sched/core.c              |  2 +-
 5 files changed, 77 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
index c3bd0c0758c9..7f32fd139644 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -46,6 +46,16 @@ struct fork_frame {
 	struct pt_regs regs;
 };
 
+extern inline void cpu_mitigation_skip(struct task_struct *prev, struct task_struct *next);
+
+#define prepare_arch_switch prepare_arch_switch
+
+static inline void prepare_arch_switch(struct task_struct *prev,
+				       struct task_struct *next)
+{
+	cpu_mitigation_skip(prev, next);
+}
+
 #define switch_to(prev, next, last)					\
 do {									\
 	((last) = __switch_to_asm((prev), (next)));			\
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 45675da354f3..77eb4f6dc5c9 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -128,6 +128,27 @@ DEFINE_STATIC_KEY_FALSE(switch_mm_cond_l1d_flush);
 DEFINE_STATIC_KEY_FALSE(mmio_stale_data_clear);
 EXPORT_SYMBOL_GPL(mmio_stale_data_clear);
 
+inline void cpu_mitigation_skip(struct task_struct *prev,
+				struct task_struct *next)
+{
+	bool prev_skip = false, next_skip = false;
+
+	if (prev->mm)
+		prev_skip = task_dfl_cgroup(prev)->cpu_skip_mitigation;
+	if (next->mm)
+		next_skip = task_dfl_cgroup(next)->cpu_skip_mitigation;
+
+	if (!prev_skip && !next_skip)
+		return;
+	if (prev_skip == next_skip)
+		return;
+
+	if (next_skip)
+		wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64_unmitigated);
+	else
+		wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64_mitigated);
+}
+
 void __init cpu_select_mitigations(void)
 {
 	/*
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index ae04035b6cbe..6a131a62f43c 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -546,6 +546,9 @@ struct cgroup {
 	struct bpf_local_storage __rcu  *bpf_cgrp_storage;
 #endif
 
+	/* Used to bypass the CPU mitigations for tasks in a cgroup */
+	bool cpu_skip_mitigation;
+
 	/* All ancestors including self */
 	struct cgroup *ancestors[];
 };
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index c8e4b62b436a..b745dbcb153e 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -2045,6 +2045,7 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp)
 	cgrp->dom_cgrp = cgrp;
 	cgrp->max_descendants = INT_MAX;
 	cgrp->max_depth = INT_MAX;
+	cgrp->cpu_skip_mitigation = 0;
 	INIT_LIST_HEAD(&cgrp->rstat_css_list);
 	prev_cputime_init(&cgrp->prev_cputime);
 
@@ -3751,6 +3752,41 @@ static int cpu_stat_show(struct seq_file *seq, void *v)
 	return ret;
 }
 
+static int cpu_skip_mitigation_show(struct seq_file *seq, void *v)
+{
+	struct cgroup *cgrp = seq_css(seq)->cgroup;
+	int ret = 0;
+
+	seq_printf(seq, "%d\n", cgrp->cpu_skip_mitigation);
+
+	return ret;
+}
+
+static ssize_t cgroup_skip_mitigation_write(struct kernfs_open_file *of,
+					    char *buf, size_t nbytes,
+					    loff_t off)
+{
+	struct cgroup *cgrp = of->kn->parent->priv;
+	struct cgroup_file_ctx *ctx = of->priv;
+	u64 skip_mitigation;
+	int ret;
+
+	/* Only privileged user in init namespace is allowed to set skip_mitigation */
+	if ((ctx->ns != &init_cgroup_ns) || !capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	ret = kstrtoull(buf, 0, &skip_mitigation);
+	if (ret)
+		return -EINVAL;
+
+	if (skip_mitigation > 1)
+		return -EINVAL;
+
+	cgrp->cpu_skip_mitigation = skip_mitigation;
+
+	return nbytes;
+}
+
 static int cpu_local_stat_show(struct seq_file *seq, void *v)
 {
 	struct cgroup __maybe_unused *cgrp = seq_css(seq)->cgroup;
@@ -5290,6 +5326,12 @@ static struct cftype cgroup_base_files[] = {
 		.name = "cpu.stat.local",
 		.seq_show = cpu_local_stat_show,
 	},
+	{
+		.name = "cpu.skip_mitigation",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = cpu_skip_mitigation_show,
+		.write = cgroup_skip_mitigation_write,
+	},
 	{ }	/* terminate */
 };
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f3951e4a55e5..4b4109afbf7c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4994,7 +4994,7 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
 	fire_sched_out_preempt_notifiers(prev, next);
 	kmap_local_sched_out();
 	prepare_task(next);
-	prepare_arch_switch(next);
+	prepare_arch_switch(prev, next);
 }
 
 /**

-- 
2.34.1



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC 1/2] x86/entry_64: Add a separate unmitigated entry/exit path
  2024-09-19 21:52 ` [PATCH RFC 1/2] x86/entry_64: Add a separate unmitigated entry/exit path Pawan Gupta
@ 2024-09-20  6:57   ` Waiman Long
  2024-09-20  7:24     ` Pawan Gupta
  0 siblings, 1 reply; 10+ messages in thread
From: Waiman Long @ 2024-09-20  6:57 UTC (permalink / raw)
  To: Pawan Gupta, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, David Kaplan, Daniel Sneddon, x86,
	H. Peter Anvin, Peter Zijlstra, Josh Poimboeuf, Steven Rostedt
  Cc: linux-kernel, cgroups


On 9/19/24 17:52, Pawan Gupta wrote:
> CPU mitigations are deployed system-wide, but usually not all of the
> userspace is malicious. Yet, they suffer from the performance impact
> of the mitigations. This all or nothing approach is due to lack of a
> way for kernel to know which userspace can be trusted and which cannot.
>
> For scenarios where an admin can decide which processes to trust, an
> interface to tell the kernel to possibly skip the mitigation would be
> useful.
>
> In preparation for kernel to be able to selectively apply mitigation
> per-process add a separate kernel entry/exit path that skips the
> mitigations.
>
> Originally-by: Josh Poimboeuf <jpoimboe@kernel.org>
> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>

For the current patch, not all x86 CPU vulnerability mitigations can be 
disabled. Maybe we should list the subset of mitigations that can be 
disabled.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC 2/2] cpu/bugs: cgroup: Add a cgroup knob to bypass CPU mitigations
  2024-09-19 21:52 ` [PATCH RFC 2/2] cpu/bugs: cgroup: Add a cgroup knob to bypass CPU mitigations Pawan Gupta
@ 2024-09-20  7:05   ` Waiman Long
  2024-09-20  7:54     ` Pawan Gupta
  0 siblings, 1 reply; 10+ messages in thread
From: Waiman Long @ 2024-09-20  7:05 UTC (permalink / raw)
  To: Pawan Gupta, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, David Kaplan, Daniel Sneddon, x86,
	H. Peter Anvin, Peter Zijlstra, Josh Poimboeuf, Steven Rostedt
  Cc: linux-kernel, cgroups


On 9/19/24 17:52, Pawan Gupta wrote:
> For cases where an admin wanting to bypass CPU mitigations for a specific
> workload that they trust. Add a cgroup attribute "cpu.skip_mitigation" that
> can only be set by a privileged user. When set, the CPU mitigations are
> bypassed for all tasks in that cgroup.
>
> Before setting this knob, the admin should be aware of possible security
> risks like confused deputy attack on trusted interpreters, JIT engine,
> etc.
>
> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
> ---
>   arch/x86/include/asm/switch_to.h | 10 ++++++++++
>   arch/x86/kernel/cpu/bugs.c       | 21 ++++++++++++++++++++
>   include/linux/cgroup-defs.h      |  3 +++
>   kernel/cgroup/cgroup.c           | 42 ++++++++++++++++++++++++++++++++++++++++
>   kernel/sched/core.c              |  2 +-
>   5 files changed, 77 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
> index c3bd0c0758c9..7f32fd139644 100644
> --- a/arch/x86/include/asm/switch_to.h
> +++ b/arch/x86/include/asm/switch_to.h
> @@ -46,6 +46,16 @@ struct fork_frame {
>   	struct pt_regs regs;
>   };
>   
> +extern inline void cpu_mitigation_skip(struct task_struct *prev, struct task_struct *next);
> +
> +#define prepare_arch_switch prepare_arch_switch
> +
> +static inline void prepare_arch_switch(struct task_struct *prev,
> +				       struct task_struct *next)
> +{
> +	cpu_mitigation_skip(prev, next);
> +}
> +
>   #define switch_to(prev, next, last)					\
>   do {									\
>   	((last) = __switch_to_asm((prev), (next)));			\
> diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
> index 45675da354f3..77eb4f6dc5c9 100644
> --- a/arch/x86/kernel/cpu/bugs.c
> +++ b/arch/x86/kernel/cpu/bugs.c
> @@ -128,6 +128,27 @@ DEFINE_STATIC_KEY_FALSE(switch_mm_cond_l1d_flush);
>   DEFINE_STATIC_KEY_FALSE(mmio_stale_data_clear);
>   EXPORT_SYMBOL_GPL(mmio_stale_data_clear);
>   
> +inline void cpu_mitigation_skip(struct task_struct *prev,
> +				struct task_struct *next)
> +{
> +	bool prev_skip = false, next_skip = false;
> +
> +	if (prev->mm)
> +		prev_skip = task_dfl_cgroup(prev)->cpu_skip_mitigation;
> +	if (next->mm)
> +		next_skip = task_dfl_cgroup(next)->cpu_skip_mitigation;
> +
> +	if (!prev_skip && !next_skip)
> +		return;
> +	if (prev_skip == next_skip)
> +		return;
I believe the first (!prev_skip && !next_skip) check is redundant.
> +
> +	if (next_skip)
> +		wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64_unmitigated);
> +	else
> +		wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64_mitigated);
> +}
> +
>   void __init cpu_select_mitigations(void)
>   {
>   	/*
> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index ae04035b6cbe..6a131a62f43c 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -546,6 +546,9 @@ struct cgroup {
>   	struct bpf_local_storage __rcu  *bpf_cgrp_storage;
>   #endif
>   
> +	/* Used to bypass the CPU mitigations for tasks in a cgroup */
> +	bool cpu_skip_mitigation;
> +
>   	/* All ancestors including self */
>   	struct cgroup *ancestors[];
>   };
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index c8e4b62b436a..b745dbcb153e 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -2045,6 +2045,7 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp)
>   	cgrp->dom_cgrp = cgrp;
>   	cgrp->max_descendants = INT_MAX;
>   	cgrp->max_depth = INT_MAX;
> +	cgrp->cpu_skip_mitigation = 0;
>   	INIT_LIST_HEAD(&cgrp->rstat_css_list);
>   	prev_cputime_init(&cgrp->prev_cputime);
>   
> @@ -3751,6 +3752,41 @@ static int cpu_stat_show(struct seq_file *seq, void *v)
>   	return ret;
>   }
>   
> +static int cpu_skip_mitigation_show(struct seq_file *seq, void *v)
> +{
> +	struct cgroup *cgrp = seq_css(seq)->cgroup;
> +	int ret = 0;
> +
> +	seq_printf(seq, "%d\n", cgrp->cpu_skip_mitigation);
> +
> +	return ret;
> +}
> +
> +static ssize_t cgroup_skip_mitigation_write(struct kernfs_open_file *of,
> +					    char *buf, size_t nbytes,
> +					    loff_t off)
> +{
> +	struct cgroup *cgrp = of->kn->parent->priv;
> +	struct cgroup_file_ctx *ctx = of->priv;
> +	u64 skip_mitigation;
> +	int ret;
> +
> +	/* Only privileged user in init namespace is allowed to set skip_mitigation */
> +	if ((ctx->ns != &init_cgroup_ns) || !capable(CAP_SYS_ADMIN))
> +		return -EPERM;
> +
> +	ret = kstrtoull(buf, 0, &skip_mitigation);
> +	if (ret)
> +		return -EINVAL;
> +
> +	if (skip_mitigation > 1)
> +		return -EINVAL;
> +
> +	cgrp->cpu_skip_mitigation = skip_mitigation;
> +
> +	return nbytes;
> +}
> +
>   static int cpu_local_stat_show(struct seq_file *seq, void *v)
>   {
>   	struct cgroup __maybe_unused *cgrp = seq_css(seq)->cgroup;
> @@ -5290,6 +5326,12 @@ static struct cftype cgroup_base_files[] = {
>   		.name = "cpu.stat.local",
>   		.seq_show = cpu_local_stat_show,
>   	},
> +	{
> +		.name = "cpu.skip_mitigation",
> +		.flags = CFTYPE_NOT_ON_ROOT,
> +		.seq_show = cpu_skip_mitigation_show,
> +		.write = cgroup_skip_mitigation_write,
> +	},
>   	{ }	/* terminate */
>   };
Since this control knob is effective only for x86_64, should we enable 
this only for this architecture? However, cgroup never has architecture 
specific control knob like that before, it will be the first if we 
decide to do that.
>   
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index f3951e4a55e5..4b4109afbf7c 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4994,7 +4994,7 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
>   	fire_sched_out_preempt_notifiers(prev, next);
>   	kmap_local_sched_out();
>   	prepare_task(next);
> -	prepare_arch_switch(next);
> +	prepare_arch_switch(prev, next);
>   }
>   
>   /**
>
The sparc and m68k arches have their own prepare_arch_switch() calls. So 
you will need to modify those places as well.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC 1/2] x86/entry_64: Add a separate unmitigated entry/exit path
  2024-09-20  6:57   ` Waiman Long
@ 2024-09-20  7:24     ` Pawan Gupta
  0 siblings, 0 replies; 10+ messages in thread
From: Pawan Gupta @ 2024-09-20  7:24 UTC (permalink / raw)
  To: Waiman Long
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, David Kaplan, Daniel Sneddon, x86, H. Peter Anvin,
	Peter Zijlstra, Josh Poimboeuf, Steven Rostedt, linux-kernel,
	cgroups

On Fri, Sep 20, 2024 at 02:57:34AM -0400, Waiman Long wrote:
> 
> On 9/19/24 17:52, Pawan Gupta wrote:
> > CPU mitigations are deployed system-wide, but usually not all of the
> > userspace is malicious. Yet, they suffer from the performance impact
> > of the mitigations. This all or nothing approach is due to lack of a
> > way for kernel to know which userspace can be trusted and which cannot.
> > 
> > For scenarios where an admin can decide which processes to trust, an
> > interface to tell the kernel to possibly skip the mitigation would be
> > useful.
> > 
> > In preparation for kernel to be able to selectively apply mitigation
> > per-process add a separate kernel entry/exit path that skips the
> > mitigations.
> > 
> > Originally-by: Josh Poimboeuf <jpoimboe@kernel.org>
> > Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
> 
> For the current patch, not all x86 CPU vulnerability mitigations can be
> disabled. Maybe we should list the subset of mitigations that can be
> disabled.

Yes, will update that mitigations that can be bypassed are BHI, VERW,
Retbleed-IBRS, Retbleed-unret and IBPB.

Meltdown, Spectre-v1, eIBRS, GDS, SRBDS, retpoline and rethunk stays
enabled.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC 2/2] cpu/bugs: cgroup: Add a cgroup knob to bypass CPU mitigations
  2024-09-20  7:05   ` Waiman Long
@ 2024-09-20  7:54     ` Pawan Gupta
  2024-09-21  7:30       ` Waiman Long
  0 siblings, 1 reply; 10+ messages in thread
From: Pawan Gupta @ 2024-09-20  7:54 UTC (permalink / raw)
  To: Waiman Long
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, David Kaplan, Daniel Sneddon, x86, H. Peter Anvin,
	Peter Zijlstra, Josh Poimboeuf, Steven Rostedt, linux-kernel,
	cgroups

On Fri, Sep 20, 2024 at 03:05:57AM -0400, Waiman Long wrote:
> 
> On 9/19/24 17:52, Pawan Gupta wrote:
> > For cases where an admin wanting to bypass CPU mitigations for a specific
> > workload that they trust. Add a cgroup attribute "cpu.skip_mitigation" that
> > can only be set by a privileged user. When set, the CPU mitigations are
> > bypassed for all tasks in that cgroup.
> > 
> > Before setting this knob, the admin should be aware of possible security
> > risks like confused deputy attack on trusted interpreters, JIT engine,
> > etc.
> > 
> > Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
> > ---
> >   arch/x86/include/asm/switch_to.h | 10 ++++++++++
> >   arch/x86/kernel/cpu/bugs.c       | 21 ++++++++++++++++++++
> >   include/linux/cgroup-defs.h      |  3 +++
> >   kernel/cgroup/cgroup.c           | 42 ++++++++++++++++++++++++++++++++++++++++
> >   kernel/sched/core.c              |  2 +-
> >   5 files changed, 77 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
> > index c3bd0c0758c9..7f32fd139644 100644
> > --- a/arch/x86/include/asm/switch_to.h
> > +++ b/arch/x86/include/asm/switch_to.h
> > @@ -46,6 +46,16 @@ struct fork_frame {
> >   	struct pt_regs regs;
> >   };
> > +extern inline void cpu_mitigation_skip(struct task_struct *prev, struct task_struct *next);
> > +
> > +#define prepare_arch_switch prepare_arch_switch
> > +
> > +static inline void prepare_arch_switch(struct task_struct *prev,
> > +				       struct task_struct *next)
> > +{
> > +	cpu_mitigation_skip(prev, next);
> > +}
> > +
> >   #define switch_to(prev, next, last)					\
> >   do {									\
> >   	((last) = __switch_to_asm((prev), (next)));			\
> > diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
> > index 45675da354f3..77eb4f6dc5c9 100644
> > --- a/arch/x86/kernel/cpu/bugs.c
> > +++ b/arch/x86/kernel/cpu/bugs.c
> > @@ -128,6 +128,27 @@ DEFINE_STATIC_KEY_FALSE(switch_mm_cond_l1d_flush);
> >   DEFINE_STATIC_KEY_FALSE(mmio_stale_data_clear);
> >   EXPORT_SYMBOL_GPL(mmio_stale_data_clear);
> > +inline void cpu_mitigation_skip(struct task_struct *prev,
> > +				struct task_struct *next)
> > +{
> > +	bool prev_skip = false, next_skip = false;
> > +
> > +	if (prev->mm)
> > +		prev_skip = task_dfl_cgroup(prev)->cpu_skip_mitigation;
> > +	if (next->mm)
> > +		next_skip = task_dfl_cgroup(next)->cpu_skip_mitigation;
> > +
> > +	if (!prev_skip && !next_skip)
> > +		return;
> > +	if (prev_skip == next_skip)
> > +		return;
> I believe the first (!prev_skip && !next_skip) check is redundant.

Agh, you are right. I will remove it.

> > +	if (next_skip)
> > +		wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64_unmitigated);
> > +	else
> > +		wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64_mitigated);
> > +}
> > +
> >   void __init cpu_select_mitigations(void)
> >   {
> >   	/*
> > diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> > index ae04035b6cbe..6a131a62f43c 100644
> > --- a/include/linux/cgroup-defs.h
> > +++ b/include/linux/cgroup-defs.h
> > @@ -546,6 +546,9 @@ struct cgroup {
> >   	struct bpf_local_storage __rcu  *bpf_cgrp_storage;
> >   #endif
> > +	/* Used to bypass the CPU mitigations for tasks in a cgroup */
> > +	bool cpu_skip_mitigation;
> > +
> >   	/* All ancestors including self */
> >   	struct cgroup *ancestors[];
> >   };
> > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> > index c8e4b62b436a..b745dbcb153e 100644
> > --- a/kernel/cgroup/cgroup.c
> > +++ b/kernel/cgroup/cgroup.c
> > @@ -2045,6 +2045,7 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp)
> >   	cgrp->dom_cgrp = cgrp;
> >   	cgrp->max_descendants = INT_MAX;
> >   	cgrp->max_depth = INT_MAX;
> > +	cgrp->cpu_skip_mitigation = 0;
> >   	INIT_LIST_HEAD(&cgrp->rstat_css_list);
> >   	prev_cputime_init(&cgrp->prev_cputime);
> > @@ -3751,6 +3752,41 @@ static int cpu_stat_show(struct seq_file *seq, void *v)
> >   	return ret;
> >   }
> > +static int cpu_skip_mitigation_show(struct seq_file *seq, void *v)
> > +{
> > +	struct cgroup *cgrp = seq_css(seq)->cgroup;
> > +	int ret = 0;
> > +
> > +	seq_printf(seq, "%d\n", cgrp->cpu_skip_mitigation);
> > +
> > +	return ret;
> > +}
> > +
> > +static ssize_t cgroup_skip_mitigation_write(struct kernfs_open_file *of,
> > +					    char *buf, size_t nbytes,
> > +					    loff_t off)
> > +{
> > +	struct cgroup *cgrp = of->kn->parent->priv;
> > +	struct cgroup_file_ctx *ctx = of->priv;
> > +	u64 skip_mitigation;
> > +	int ret;
> > +
> > +	/* Only privileged user in init namespace is allowed to set skip_mitigation */
> > +	if ((ctx->ns != &init_cgroup_ns) || !capable(CAP_SYS_ADMIN))
> > +		return -EPERM;
> > +
> > +	ret = kstrtoull(buf, 0, &skip_mitigation);
> > +	if (ret)
> > +		return -EINVAL;
> > +
> > +	if (skip_mitigation > 1)
> > +		return -EINVAL;
> > +
> > +	cgrp->cpu_skip_mitigation = skip_mitigation;
> > +
> > +	return nbytes;
> > +}
> > +
> >   static int cpu_local_stat_show(struct seq_file *seq, void *v)
> >   {
> >   	struct cgroup __maybe_unused *cgrp = seq_css(seq)->cgroup;
> > @@ -5290,6 +5326,12 @@ static struct cftype cgroup_base_files[] = {
> >   		.name = "cpu.stat.local",
> >   		.seq_show = cpu_local_stat_show,
> >   	},
> > +	{
> > +		.name = "cpu.skip_mitigation",
> > +		.flags = CFTYPE_NOT_ON_ROOT,
> > +		.seq_show = cpu_skip_mitigation_show,
> > +		.write = cgroup_skip_mitigation_write,
> > +	},
> >   	{ }	/* terminate */
> >   };
> Since this control knob is effective only for x86_64, should we enable this
> only for this architecture?

This should be under a CONFIG option that depends on the architecture
selected. I don't see a reason why it will not be useful for other archs.

> However, cgroup never has architecture specific control knob like that
> before, it will be the first if we decide to do that.

Hmm, maybe then it could be universally available, but only effective on
archs using it, and NOP on others? Other than performance, this knob should
not have any userspace visible side effects.

> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index f3951e4a55e5..4b4109afbf7c 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4994,7 +4994,7 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
> >   	fire_sched_out_preempt_notifiers(prev, next);
> >   	kmap_local_sched_out();
> >   	prepare_task(next);
> > -	prepare_arch_switch(next);
> > +	prepare_arch_switch(prev, next);
> >   }
> >   /**
> > 
> The sparc and m68k arches have their own prepare_arch_switch() calls. So you
> will need to modify those places as well.

Will do. Thanks for the review.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC 2/2] cpu/bugs: cgroup: Add a cgroup knob to bypass CPU mitigations
  2024-09-20  7:54     ` Pawan Gupta
@ 2024-09-21  7:30       ` Waiman Long
  0 siblings, 0 replies; 10+ messages in thread
From: Waiman Long @ 2024-09-21  7:30 UTC (permalink / raw)
  To: Pawan Gupta
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, David Kaplan, Daniel Sneddon, x86, H. Peter Anvin,
	Peter Zijlstra, Josh Poimboeuf, Steven Rostedt, linux-kernel,
	cgroups


On 9/20/24 03:54, Pawan Gupta wrote:
>>>    static int cpu_local_stat_show(struct seq_file *seq, void *v)
>>>    {
>>>    	struct cgroup __maybe_unused *cgrp = seq_css(seq)->cgroup;
>>> @@ -5290,6 +5326,12 @@ static struct cftype cgroup_base_files[] = {
>>>    		.name = "cpu.stat.local",
>>>    		.seq_show = cpu_local_stat_show,
>>>    	},
>>> +	{
>>> +		.name = "cpu.skip_mitigation",
>>> +		.flags = CFTYPE_NOT_ON_ROOT,
>>> +		.seq_show = cpu_skip_mitigation_show,
>>> +		.write = cgroup_skip_mitigation_write,
>>> +	},
>>>    	{ }	/* terminate */
>>>    };
>> Since this control knob is effective only for x86_64, should we enable this
>> only for this architecture?
> This should be under a CONFIG option that depends on the architecture
> selected. I don't see a reason why it will not be useful for other archs.

Using a CONFIG option looks fine to me. I just want to make sure that 
arches that don't support it won't have a useless control knob show up.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC 0/2] Selective mitigation for trusted userspace
  2024-09-19 21:52 [PATCH RFC 0/2] Selective mitigation for trusted userspace Pawan Gupta
  2024-09-19 21:52 ` [PATCH RFC 1/2] x86/entry_64: Add a separate unmitigated entry/exit path Pawan Gupta
  2024-09-19 21:52 ` [PATCH RFC 2/2] cpu/bugs: cgroup: Add a cgroup knob to bypass CPU mitigations Pawan Gupta
@ 2024-09-27 15:52 ` Michal Koutný
  2024-09-27 22:01   ` Pawan Gupta
  2 siblings, 1 reply; 10+ messages in thread
From: Michal Koutný @ 2024-09-27 15:52 UTC (permalink / raw)
  To: Pawan Gupta
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, David Kaplan, Daniel Sneddon, x86, H. Peter Anvin,
	Peter Zijlstra, Josh Poimboeuf, Steven Rostedt, linux-kernel,
	cgroups

[-- Attachment #1: Type: text/plain, Size: 1402 bytes --]

Hello.

On Thu, Sep 19, 2024 at 02:52:31PM GMT, Pawan Gupta <pawan.kumar.gupta@linux.intel.com> wrote:
> This is an experimental series exploring the feasibility of selectively
> applying CPU vulnerability mitigations on a per-process basis. The
> motivation behind this work is to address the performance degradation
> experienced by trusted user-space applications due to system-wide CPU
> mitigations.

This is an interesting idea (like an extension of core scheduling).

> The rationale for choosing the cgroup interface over other potential
> interfaces, such as LSMs, is cgroup's inherent support for core scheduling.

You don't list prctl (and process inheritance) interface here.

> Core scheduling allows the grouping of tasks such that they are scheduled
> to run on the same cores. 

And that is actually the way how core scheduling is implemented AFAICS
-- cookie creation and passing via prctls.
Thus I don't find the implementation via a cgroup attribute ideal.

(I'd also say that cgroups are more organization/resource domains but
not so much security domains.)


> - Should child processes inherit the parent's unmitigated status?

Assuming turning off mitigations is a a privileged operation, the
fork could preserve it. It would be upon parent to clear it up properly
before handing over execution to a child (cf e.g. dropping uid=0).

HTH,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC 0/2] Selective mitigation for trusted userspace
  2024-09-27 15:52 ` [PATCH RFC 0/2] Selective mitigation for trusted userspace Michal Koutný
@ 2024-09-27 22:01   ` Pawan Gupta
  0 siblings, 0 replies; 10+ messages in thread
From: Pawan Gupta @ 2024-09-27 22:01 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, David Kaplan, Daniel Sneddon, x86, H. Peter Anvin,
	Peter Zijlstra, Josh Poimboeuf, Steven Rostedt, linux-kernel,
	cgroups

[-- Attachment #1: Type: text/plain, Size: 2369 bytes --]

On Fri, Sep 27, 2024 at 05:52:35PM +0200, Michal Koutný wrote:
> Hello.
> 
> On Thu, Sep 19, 2024 at 02:52:31PM GMT, Pawan Gupta <pawan.kumar.gupta@linux.intel.com> wrote:
> > This is an experimental series exploring the feasibility of selectively
> > applying CPU vulnerability mitigations on a per-process basis. The
> > motivation behind this work is to address the performance degradation
> > experienced by trusted user-space applications due to system-wide CPU
> > mitigations.
> 
> This is an interesting idea (like an extension of core scheduling).
> 
> > The rationale for choosing the cgroup interface over other potential
> > interfaces, such as LSMs, is cgroup's inherent support for core scheduling.
> 
> You don't list prctl (and process inheritance) interface here.

Apologies for the oversight. prctl is indeed the interface that Core
scheduling uses to group tasks. Cgroup was the initially proposed interface
but that later got changed to prctl.

> > Core scheduling allows the grouping of tasks such that they are scheduled
> > to run on the same cores. 
> 
> And that is actually the way how core scheduling is implemented AFAICS
> -- cookie creation and passing via prctls.
> Thus I don't find the implementation via a cgroup attribute ideal.

Problem with prctl is that a process can't set its own skip-mitigation
status, unless it is privileged. So a privileged process has to spawn the
workload that needs to be unmitigated. This is not ideal for an application
that wants to selectively apply mitigations.

> (I'd also say that cgroups are more organization/resource domains but
> not so much security domains.)

Agree.

> > - Should child processes inherit the parent's unmitigated status?
> 
> Assuming turning off mitigations is a a privileged operation, the
> fork could preserve it. It would be upon parent to clear it up properly
> before handing over execution to a child (cf e.g. dropping uid=0).

IIUC, skip-mitigation should be a capability that can be inherited by child
processes, or can be set for an executable file. If we want to leverage
Core scheduling, it will be callenging to have all trusted processes in the
system have the same cookie. But, all child processes of a trusted process
would inherit the skip-mitigation status and same cookie, so they can be
co-sccheduled.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2024-09-27 22:01 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-19 21:52 [PATCH RFC 0/2] Selective mitigation for trusted userspace Pawan Gupta
2024-09-19 21:52 ` [PATCH RFC 1/2] x86/entry_64: Add a separate unmitigated entry/exit path Pawan Gupta
2024-09-20  6:57   ` Waiman Long
2024-09-20  7:24     ` Pawan Gupta
2024-09-19 21:52 ` [PATCH RFC 2/2] cpu/bugs: cgroup: Add a cgroup knob to bypass CPU mitigations Pawan Gupta
2024-09-20  7:05   ` Waiman Long
2024-09-20  7:54     ` Pawan Gupta
2024-09-21  7:30       ` Waiman Long
2024-09-27 15:52 ` [PATCH RFC 0/2] Selective mitigation for trusted userspace Michal Koutný
2024-09-27 22:01   ` Pawan Gupta

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox