* [PATCH v6 1/2] ftrace: disable preemption when recursion locked
From: 王贇 @ 2021-10-27 2:34 UTC (permalink / raw)
To: Guo Ren, Steven Rostedt, Ingo Molnar, James E.J. Bottomley,
Helge Deller, Michael Ellerman, Benjamin Herrenschmidt,
Paul Mackerras, Paul Walmsley, Palmer Dabbelt, Albert Ou,
Thomas Gleixner, Borislav Petkov, x86, H. Peter Anvin,
Josh Poimboeuf, Jiri Kosina, Miroslav Benes, Petr Mladek,
Joe Lawrence, Masami Hiramatsu, Peter Zijlstra (Intel),
Nicholas Piggin, Jisheng Zhang, linux-csky, linux-kernel,
linux-parisc, linuxppc-dev, linux-riscv, live-patching
In-Reply-To: <df8e9b3e-504c-635d-4e92-99cdf9f05479@linux.alibaba.com>
As the documentation explained, ftrace_test_recursion_trylock()
and ftrace_test_recursion_unlock() were supposed to disable and
enable preemption properly, however currently this work is done
outside of the function, which could be missing by mistake.
And since the internal using of trace_test_and_set_recursion()
and trace_clear_recursion() also require preemption disabled, we
can just merge the logical.
This patch will make sure the preemption has been disabled when
trace_test_and_set_recursion() return bit >= 0, and
trace_clear_recursion() will enable the preemption if previously
enabled.
CC: Petr Mladek <pmladek@suse.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Miroslav Benes <mbenes@suse.cz>
Reported-by: Abaci <abaci@linux.alibaba.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
---
arch/csky/kernel/probes/ftrace.c | 2 --
arch/parisc/kernel/ftrace.c | 2 --
arch/powerpc/kernel/kprobes-ftrace.c | 2 --
arch/riscv/kernel/probes/ftrace.c | 2 --
arch/x86/kernel/kprobes/ftrace.c | 2 --
include/linux/trace_recursion.h | 13 ++++++++++++-
kernel/livepatch/patch.c | 13 +++++++------
kernel/trace/ftrace.c | 15 +++++----------
kernel/trace/trace_functions.c | 5 -----
9 files changed, 24 insertions(+), 32 deletions(-)
diff --git a/arch/csky/kernel/probes/ftrace.c b/arch/csky/kernel/probes/ftrace.c
index b388228..834cffc 100644
--- a/arch/csky/kernel/probes/ftrace.c
+++ b/arch/csky/kernel/probes/ftrace.c
@@ -17,7 +17,6 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned long parent_ip,
return;
regs = ftrace_get_regs(fregs);
- preempt_disable_notrace();
p = get_kprobe((kprobe_opcode_t *)ip);
if (!p) {
p = get_kprobe((kprobe_opcode_t *)(ip - MCOUNT_INSN_SIZE));
@@ -57,7 +56,6 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned long parent_ip,
__this_cpu_write(current_kprobe, NULL);
}
out:
- preempt_enable_notrace();
ftrace_test_recursion_unlock(bit);
}
NOKPROBE_SYMBOL(kprobe_ftrace_handler);
diff --git a/arch/parisc/kernel/ftrace.c b/arch/parisc/kernel/ftrace.c
index 7d14242..90c4345 100644
--- a/arch/parisc/kernel/ftrace.c
+++ b/arch/parisc/kernel/ftrace.c
@@ -210,7 +210,6 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned long parent_ip,
return;
regs = ftrace_get_regs(fregs);
- preempt_disable_notrace();
p = get_kprobe((kprobe_opcode_t *)ip);
if (unlikely(!p) || kprobe_disabled(p))
goto out;
@@ -239,7 +238,6 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned long parent_ip,
}
__this_cpu_write(current_kprobe, NULL);
out:
- preempt_enable_notrace();
ftrace_test_recursion_unlock(bit);
}
NOKPROBE_SYMBOL(kprobe_ftrace_handler);
diff --git a/arch/powerpc/kernel/kprobes-ftrace.c b/arch/powerpc/kernel/kprobes-ftrace.c
index 7154d58..072ebe7 100644
--- a/arch/powerpc/kernel/kprobes-ftrace.c
+++ b/arch/powerpc/kernel/kprobes-ftrace.c
@@ -26,7 +26,6 @@ void kprobe_ftrace_handler(unsigned long nip, unsigned long parent_nip,
return;
regs = ftrace_get_regs(fregs);
- preempt_disable_notrace();
p = get_kprobe((kprobe_opcode_t *)nip);
if (unlikely(!p) || kprobe_disabled(p))
goto out;
@@ -61,7 +60,6 @@ void kprobe_ftrace_handler(unsigned long nip, unsigned long parent_nip,
__this_cpu_write(current_kprobe, NULL);
}
out:
- preempt_enable_notrace();
ftrace_test_recursion_unlock(bit);
}
NOKPROBE_SYMBOL(kprobe_ftrace_handler);
diff --git a/arch/riscv/kernel/probes/ftrace.c b/arch/riscv/kernel/probes/ftrace.c
index aab85a8..7142ec4 100644
--- a/arch/riscv/kernel/probes/ftrace.c
+++ b/arch/riscv/kernel/probes/ftrace.c
@@ -15,7 +15,6 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned long parent_ip,
if (bit < 0)
return;
- preempt_disable_notrace();
p = get_kprobe((kprobe_opcode_t *)ip);
if (unlikely(!p) || kprobe_disabled(p))
goto out;
@@ -52,7 +51,6 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned long parent_ip,
__this_cpu_write(current_kprobe, NULL);
}
out:
- preempt_enable_notrace();
ftrace_test_recursion_unlock(bit);
}
NOKPROBE_SYMBOL(kprobe_ftrace_handler);
diff --git a/arch/x86/kernel/kprobes/ftrace.c b/arch/x86/kernel/kprobes/ftrace.c
index 596de2f..dd2ec14 100644
--- a/arch/x86/kernel/kprobes/ftrace.c
+++ b/arch/x86/kernel/kprobes/ftrace.c
@@ -25,7 +25,6 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned long parent_ip,
if (bit < 0)
return;
- preempt_disable_notrace();
p = get_kprobe((kprobe_opcode_t *)ip);
if (unlikely(!p) || kprobe_disabled(p))
goto out;
@@ -59,7 +58,6 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned long parent_ip,
__this_cpu_write(current_kprobe, NULL);
}
out:
- preempt_enable_notrace();
ftrace_test_recursion_unlock(bit);
}
NOKPROBE_SYMBOL(kprobe_ftrace_handler);
diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h
index abe1a50..64c03ee 100644
--- a/include/linux/trace_recursion.h
+++ b/include/linux/trace_recursion.h
@@ -135,6 +135,9 @@ static __always_inline int trace_get_context_bit(void)
# define do_ftrace_record_recursion(ip, pip) do { } while (0)
#endif
+/*
+ * Preemption is promised to be disabled when return bit >= 0.
+ */
static __always_inline int trace_test_and_set_recursion(unsigned long ip, unsigned long pip,
int start)
{
@@ -162,11 +165,19 @@ static __always_inline int trace_test_and_set_recursion(unsigned long ip, unsign
current->trace_recursion = val;
barrier();
+ preempt_disable_notrace();
+
return bit;
}
+/*
+ * Preemption will be enabled (if it was previously enabled).
+ */
static __always_inline void trace_clear_recursion(int bit)
{
+ WARN_ON_ONCE(bit < 0);
+
+ preempt_enable_notrace();
barrier();
trace_recursion_clear(bit);
}
@@ -178,7 +189,7 @@ static __always_inline void trace_clear_recursion(int bit)
* tracing recursed in the same context (normal vs interrupt),
*
* Returns: -1 if a recursion happened.
- * >= 0 if no recursion
+ * >= 0 if no recursion.
*/
static __always_inline int ftrace_test_recursion_trylock(unsigned long ip,
unsigned long parent_ip)
diff --git a/kernel/livepatch/patch.c b/kernel/livepatch/patch.c
index e8029ae..b8d75fb 100644
--- a/kernel/livepatch/patch.c
+++ b/kernel/livepatch/patch.c
@@ -49,14 +49,16 @@ static void notrace klp_ftrace_handler(unsigned long ip,
ops = container_of(fops, struct klp_ops, fops);
+ /*
+ *
+ * The ftrace_test_recursion_trylock() will disable preemption,
+ * which is required for the variant of synchronize_rcu() that is
+ * used to allow patching functions where RCU is not watching.
+ * See klp_synchronize_transition() for more details.
+ */
bit = ftrace_test_recursion_trylock(ip, parent_ip);
if (WARN_ON_ONCE(bit < 0))
return;
- /*
- * A variant of synchronize_rcu() is used to allow patching functions
- * where RCU is not watching, see klp_synchronize_transition().
- */
- preempt_disable_notrace();
func = list_first_or_null_rcu(&ops->func_stack, struct klp_func,
stack_node);
@@ -120,7 +122,6 @@ static void notrace klp_ftrace_handler(unsigned long ip,
klp_arch_set_pc(fregs, (unsigned long)func->new_func);
unlock:
- preempt_enable_notrace();
ftrace_test_recursion_unlock(bit);
}
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index b7be1df..7392bc7 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -7198,16 +7198,15 @@ void ftrace_reset_array_ops(struct trace_array *tr)
struct ftrace_ops *op;
int bit;
+ /*
+ * The ftrace_test_and_set_recursion() will disable preemption,
+ * which is required since some of the ops may be dynamically
+ * allocated, they must be freed after a synchronize_rcu().
+ */
bit = trace_test_and_set_recursion(ip, parent_ip, TRACE_LIST_START);
if (bit < 0)
return;
- /*
- * Some of the ops may be dynamically allocated,
- * they must be freed after a synchronize_rcu().
- */
- preempt_disable_notrace();
-
do_for_each_ftrace_op(op, ftrace_ops_list) {
/* Stub functions don't need to be called nor tested */
if (op->flags & FTRACE_OPS_FL_STUB)
@@ -7231,7 +7230,6 @@ void ftrace_reset_array_ops(struct trace_array *tr)
}
} while_for_each_ftrace_op(op);
out:
- preempt_enable_notrace();
trace_clear_recursion(bit);
}
@@ -7279,12 +7277,9 @@ static void ftrace_ops_assist_func(unsigned long ip, unsigned long parent_ip,
if (bit < 0)
return;
- preempt_disable_notrace();
-
if (!(op->flags & FTRACE_OPS_FL_RCU) || rcu_is_watching())
op->func(ip, parent_ip, op, fregs);
- preempt_enable_notrace();
trace_clear_recursion(bit);
}
NOKPROBE_SYMBOL(ftrace_ops_assist_func);
diff --git a/kernel/trace/trace_functions.c b/kernel/trace/trace_functions.c
index 1f0e63f..9f1bfbe 100644
--- a/kernel/trace/trace_functions.c
+++ b/kernel/trace/trace_functions.c
@@ -186,7 +186,6 @@ static void function_trace_start(struct trace_array *tr)
return;
trace_ctx = tracing_gen_ctx();
- preempt_disable_notrace();
cpu = smp_processor_id();
data = per_cpu_ptr(tr->array_buffer.data, cpu);
@@ -194,7 +193,6 @@ static void function_trace_start(struct trace_array *tr)
trace_function(tr, ip, parent_ip, trace_ctx);
ftrace_test_recursion_unlock(bit);
- preempt_enable_notrace();
}
#ifdef CONFIG_UNWINDER_ORC
@@ -298,8 +296,6 @@ static inline void process_repeats(struct trace_array *tr,
if (bit < 0)
return;
- preempt_disable_notrace();
-
cpu = smp_processor_id();
data = per_cpu_ptr(tr->array_buffer.data, cpu);
if (atomic_read(&data->disabled))
@@ -324,7 +320,6 @@ static inline void process_repeats(struct trace_array *tr,
out:
ftrace_test_recursion_unlock(bit);
- preempt_enable_notrace();
}
static void
--
1.8.3.1
^ permalink raw reply related
* [PATCH v6 2/2] ftrace: do CPU checking after preemption disabled
From: 王贇 @ 2021-10-27 2:34 UTC (permalink / raw)
To: Guo Ren, Steven Rostedt, Ingo Molnar, James E.J. Bottomley,
Helge Deller, Michael Ellerman, Benjamin Herrenschmidt,
Paul Mackerras, Paul Walmsley, Palmer Dabbelt, Albert Ou,
Thomas Gleixner, Borislav Petkov, x86, H. Peter Anvin,
Josh Poimboeuf, Jiri Kosina, Miroslav Benes, Petr Mladek,
Joe Lawrence, Masami Hiramatsu, Peter Zijlstra (Intel),
Nicholas Piggin, Jisheng Zhang, linux-csky, linux-kernel,
linux-parisc, linuxppc-dev, linux-riscv, live-patching
In-Reply-To: <df8e9b3e-504c-635d-4e92-99cdf9f05479@linux.alibaba.com>
With CONFIG_DEBUG_PREEMPT we observed reports like:
BUG: using smp_processor_id() in preemptible
caller is perf_ftrace_function_call+0x6f/0x2e0
CPU: 1 PID: 680 Comm: a.out Not tainted
Call Trace:
<TASK>
dump_stack_lvl+0x8d/0xcf
check_preemption_disabled+0x104/0x110
? optimize_nops.isra.7+0x230/0x230
? text_poke_bp_batch+0x9f/0x310
perf_ftrace_function_call+0x6f/0x2e0
...
__text_poke+0x5/0x620
text_poke_bp_batch+0x9f/0x310
This telling us the CPU could be changed after task is preempted, and
the checking on CPU before preemption will be invalid.
Since now ftrace_test_recursion_trylock() will help to disable the
preemption, this patch just do the checking after trylock() to address
the issue.
CC: Steven Rostedt <rostedt@goodmis.org>
Reported-by: Abaci <abaci@linux.alibaba.com>
Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
---
kernel/trace/trace_event_perf.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
index 6aed10e..fba8cb7 100644
--- a/kernel/trace/trace_event_perf.c
+++ b/kernel/trace/trace_event_perf.c
@@ -441,13 +441,13 @@ void perf_trace_buf_update(void *record, u16 type)
if (!rcu_is_watching())
return;
- if ((unsigned long)ops->private != smp_processor_id())
- return;
-
bit = ftrace_test_recursion_trylock(ip, parent_ip);
if (bit < 0)
return;
+ if ((unsigned long)ops->private != smp_processor_id())
+ goto out;
+
event = container_of(ops, struct perf_event, ftrace_ops);
/*
--
1.8.3.1
^ permalink raw reply related
* Re: [PATCH v6 1/2] ftrace: disable preemption when recursion locked
From: Steven Rostedt @ 2021-10-27 2:55 UTC (permalink / raw)
To: 王贇
Cc: Peter Zijlstra (Intel), Paul Walmsley, James E.J. Bottomley,
Guo Ren, Jisheng Zhang, H. Peter Anvin, live-patching,
linux-riscv, Miroslav Benes, Joe Lawrence, Helge Deller, x86,
linux-csky, Ingo Molnar, Petr Mladek, Albert Ou, Jiri Kosina,
Nicholas Piggin, Borislav Petkov, Josh Poimboeuf, Thomas Gleixner,
linux-parisc, linux-kernel, Palmer Dabbelt, Masami Hiramatsu,
Paul Mackerras, linuxppc-dev
In-Reply-To: <78c95844-16b7-8904-b48d-3b2ccd76a352@linux.alibaba.com>
On Wed, 27 Oct 2021 10:34:13 +0800
王贇 <yun.wang@linux.alibaba.com> wrote:
> +/*
> + * Preemption will be enabled (if it was previously enabled).
> + */
> static __always_inline void trace_clear_recursion(int bit)
> {
> + WARN_ON_ONCE(bit < 0);
Can you send a v7 without the WARN_ON.
This is an extremely hot path, and this will cause noticeable overhead.
If something were to call this with bit < 0, then it would crash and
burn rather quickly.
-- Steve
> +
> + preempt_enable_notrace();
> barrier();
> trace_recursion_clear(bit);
> }
^ permalink raw reply
* Re: [PATCH v6 1/2] ftrace: disable preemption when recursion locked
From: 王贇 @ 2021-10-27 3:09 UTC (permalink / raw)
To: Steven Rostedt
Cc: Peter Zijlstra (Intel), Paul Walmsley, James E.J. Bottomley,
Guo Ren, Jisheng Zhang, H. Peter Anvin, live-patching,
linux-riscv, Miroslav Benes, Joe Lawrence, Helge Deller, x86,
linux-csky, Ingo Molnar, Petr Mladek, Albert Ou, Jiri Kosina,
Nicholas Piggin, Borislav Petkov, Josh Poimboeuf, Thomas Gleixner,
linux-parisc, linux-kernel, Palmer Dabbelt, Masami Hiramatsu,
Paul Mackerras, linuxppc-dev
In-Reply-To: <20211026225552.72a7ee79@rorschach.local.home>
On 2021/10/27 上午10:55, Steven Rostedt wrote:
> On Wed, 27 Oct 2021 10:34:13 +0800
> 王贇 <yun.wang@linux.alibaba.com> wrote:
>
>> +/*
>> + * Preemption will be enabled (if it was previously enabled).
>> + */
>> static __always_inline void trace_clear_recursion(int bit)
>> {
>> + WARN_ON_ONCE(bit < 0);
>
> Can you send a v7 without the WARN_ON.
>
> This is an extremely hot path, and this will cause noticeable overhead.
>
> If something were to call this with bit < 0, then it would crash and
> burn rather quickly.
I see, if the problem will be notified anyway then it's fine, v7 on the way.
Regards,
Michael Wang
>
> -- Steve
>
>
>> +
>> + preempt_enable_notrace();
>> barrier();
>> trace_recursion_clear(bit);
>> }
^ permalink raw reply
* [PATCH v7 0/2] fix & prevent the missing preemption disabling
From: 王贇 @ 2021-10-27 3:13 UTC (permalink / raw)
To: Guo Ren, Steven Rostedt, Ingo Molnar, James E.J. Bottomley,
Helge Deller, Michael Ellerman, Benjamin Herrenschmidt,
Paul Mackerras, Paul Walmsley, Palmer Dabbelt, Albert Ou,
Thomas Gleixner, Borislav Petkov, x86, H. Peter Anvin,
Josh Poimboeuf, Jiri Kosina, Miroslav Benes, Petr Mladek,
Joe Lawrence, Masami Hiramatsu, Peter Zijlstra (Intel),
Nicholas Piggin, Jisheng Zhang, linux-csky, linux-kernel,
linux-parisc, linuxppc-dev, linux-riscv, live-patching
The testing show that perf_ftrace_function_call() are using smp_processor_id()
with preemption enabled, all the checking on CPU could be wrong after preemption.
As Peter point out, the section between ftrace_test_recursion_trylock/unlock()
pair require the preemption to be disabled as 'Documentation/trace/ftrace-uses.rst'
explained, but currently the work is done outside of the helpers.
And since the internal using of trace_test_and_set_recursion()
and trace_clear_recursion() also require preemption to be disabled, we
can just merge the logical together.
Patch 1/2 will make sure preemption disabled when recursion lock succeed,
patch 2/2 will do smp_processor_id() checking after trylock() to address the
issue.
v1: https://lore.kernel.org/all/8c7de46d-9869-aa5e-2bb9-5dbc2eda395e@linux.alibaba.com/
v2: https://lore.kernel.org/all/b1d7fe43-ce84-0ed7-32f7-ea1d12d0b716@linux.alibaba.com/
v3: https://lore.kernel.org/all/609b565a-ed6e-a1da-f025-166691b5d994@linux.alibaba.com/
V4: https://lore.kernel.org/all/32a36348-69ee-6464-390c-3a8d6e9d2b53@linux.alibaba.com/
V5: https://lore.kernel.org/all/3ca92dc9-ea04-ddc2-71cd-524bfa5a5721@linux.alibaba.com/
V6: https://lore.kernel.org/all/df8e9b3e-504c-635d-4e92-99cdf9f05479@linux.alibaba.com/
Michael Wang (2):
ftrace: disable preemption when recursion locked
ftrace: do CPU checking after preemption disabled
arch/csky/kernel/probes/ftrace.c | 2 --
arch/parisc/kernel/ftrace.c | 2 --
arch/powerpc/kernel/kprobes-ftrace.c | 2 --
arch/riscv/kernel/probes/ftrace.c | 2 --
arch/x86/kernel/kprobes/ftrace.c | 2 --
include/linux/trace_recursion.h | 11 ++++++++++-
kernel/livepatch/patch.c | 13 +++++++------
kernel/trace/ftrace.c | 15 +++++----------
kernel/trace/trace_event_perf.c | 6 +++---
kernel/trace/trace_functions.c | 5 -----
10 files changed, 25 insertions(+), 35 deletions(-)
--
1.8.3.1
^ permalink raw reply
* [PATCH v7 1/2] ftrace: disable preemption when recursion locked
From: 王贇 @ 2021-10-27 3:14 UTC (permalink / raw)
To: Guo Ren, Steven Rostedt, Ingo Molnar, James E.J. Bottomley,
Helge Deller, Michael Ellerman, Benjamin Herrenschmidt,
Paul Mackerras, Paul Walmsley, Palmer Dabbelt, Albert Ou,
Thomas Gleixner, Borislav Petkov, x86, H. Peter Anvin,
Josh Poimboeuf, Jiri Kosina, Miroslav Benes, Petr Mladek,
Joe Lawrence, Masami Hiramatsu, Peter Zijlstra (Intel),
Nicholas Piggin, Jisheng Zhang, linux-csky, linux-kernel,
linux-parisc, linuxppc-dev, linux-riscv, live-patching
In-Reply-To: <ecd56129-faed-3b30-5552-7fa1e09bc408@linux.alibaba.com>
As the documentation explained, ftrace_test_recursion_trylock()
and ftrace_test_recursion_unlock() were supposed to disable and
enable preemption properly, however currently this work is done
outside of the function, which could be missing by mistake.
And since the internal using of trace_test_and_set_recursion()
and trace_clear_recursion() also require preemption disabled, we
can just merge the logical.
This patch will make sure the preemption has been disabled when
trace_test_and_set_recursion() return bit >= 0, and
trace_clear_recursion() will enable the preemption if previously
enabled.
CC: Petr Mladek <pmladek@suse.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Miroslav Benes <mbenes@suse.cz>
Reported-by: Abaci <abaci@linux.alibaba.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
---
arch/csky/kernel/probes/ftrace.c | 2 --
arch/parisc/kernel/ftrace.c | 2 --
arch/powerpc/kernel/kprobes-ftrace.c | 2 --
arch/riscv/kernel/probes/ftrace.c | 2 --
arch/x86/kernel/kprobes/ftrace.c | 2 --
include/linux/trace_recursion.h | 11 ++++++++++-
kernel/livepatch/patch.c | 13 +++++++------
kernel/trace/ftrace.c | 15 +++++----------
kernel/trace/trace_functions.c | 5 -----
9 files changed, 22 insertions(+), 32 deletions(-)
diff --git a/arch/csky/kernel/probes/ftrace.c b/arch/csky/kernel/probes/ftrace.c
index b388228..834cffc 100644
--- a/arch/csky/kernel/probes/ftrace.c
+++ b/arch/csky/kernel/probes/ftrace.c
@@ -17,7 +17,6 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned long parent_ip,
return;
regs = ftrace_get_regs(fregs);
- preempt_disable_notrace();
p = get_kprobe((kprobe_opcode_t *)ip);
if (!p) {
p = get_kprobe((kprobe_opcode_t *)(ip - MCOUNT_INSN_SIZE));
@@ -57,7 +56,6 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned long parent_ip,
__this_cpu_write(current_kprobe, NULL);
}
out:
- preempt_enable_notrace();
ftrace_test_recursion_unlock(bit);
}
NOKPROBE_SYMBOL(kprobe_ftrace_handler);
diff --git a/arch/parisc/kernel/ftrace.c b/arch/parisc/kernel/ftrace.c
index 7d14242..90c4345 100644
--- a/arch/parisc/kernel/ftrace.c
+++ b/arch/parisc/kernel/ftrace.c
@@ -210,7 +210,6 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned long parent_ip,
return;
regs = ftrace_get_regs(fregs);
- preempt_disable_notrace();
p = get_kprobe((kprobe_opcode_t *)ip);
if (unlikely(!p) || kprobe_disabled(p))
goto out;
@@ -239,7 +238,6 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned long parent_ip,
}
__this_cpu_write(current_kprobe, NULL);
out:
- preempt_enable_notrace();
ftrace_test_recursion_unlock(bit);
}
NOKPROBE_SYMBOL(kprobe_ftrace_handler);
diff --git a/arch/powerpc/kernel/kprobes-ftrace.c b/arch/powerpc/kernel/kprobes-ftrace.c
index 7154d58..072ebe7 100644
--- a/arch/powerpc/kernel/kprobes-ftrace.c
+++ b/arch/powerpc/kernel/kprobes-ftrace.c
@@ -26,7 +26,6 @@ void kprobe_ftrace_handler(unsigned long nip, unsigned long parent_nip,
return;
regs = ftrace_get_regs(fregs);
- preempt_disable_notrace();
p = get_kprobe((kprobe_opcode_t *)nip);
if (unlikely(!p) || kprobe_disabled(p))
goto out;
@@ -61,7 +60,6 @@ void kprobe_ftrace_handler(unsigned long nip, unsigned long parent_nip,
__this_cpu_write(current_kprobe, NULL);
}
out:
- preempt_enable_notrace();
ftrace_test_recursion_unlock(bit);
}
NOKPROBE_SYMBOL(kprobe_ftrace_handler);
diff --git a/arch/riscv/kernel/probes/ftrace.c b/arch/riscv/kernel/probes/ftrace.c
index aab85a8..7142ec4 100644
--- a/arch/riscv/kernel/probes/ftrace.c
+++ b/arch/riscv/kernel/probes/ftrace.c
@@ -15,7 +15,6 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned long parent_ip,
if (bit < 0)
return;
- preempt_disable_notrace();
p = get_kprobe((kprobe_opcode_t *)ip);
if (unlikely(!p) || kprobe_disabled(p))
goto out;
@@ -52,7 +51,6 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned long parent_ip,
__this_cpu_write(current_kprobe, NULL);
}
out:
- preempt_enable_notrace();
ftrace_test_recursion_unlock(bit);
}
NOKPROBE_SYMBOL(kprobe_ftrace_handler);
diff --git a/arch/x86/kernel/kprobes/ftrace.c b/arch/x86/kernel/kprobes/ftrace.c
index 596de2f..dd2ec14 100644
--- a/arch/x86/kernel/kprobes/ftrace.c
+++ b/arch/x86/kernel/kprobes/ftrace.c
@@ -25,7 +25,6 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned long parent_ip,
if (bit < 0)
return;
- preempt_disable_notrace();
p = get_kprobe((kprobe_opcode_t *)ip);
if (unlikely(!p) || kprobe_disabled(p))
goto out;
@@ -59,7 +58,6 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned long parent_ip,
__this_cpu_write(current_kprobe, NULL);
}
out:
- preempt_enable_notrace();
ftrace_test_recursion_unlock(bit);
}
NOKPROBE_SYMBOL(kprobe_ftrace_handler);
diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h
index abe1a50..c303f7a 100644
--- a/include/linux/trace_recursion.h
+++ b/include/linux/trace_recursion.h
@@ -135,6 +135,9 @@ static __always_inline int trace_get_context_bit(void)
# define do_ftrace_record_recursion(ip, pip) do { } while (0)
#endif
+/*
+ * Preemption is promised to be disabled when return bit >= 0.
+ */
static __always_inline int trace_test_and_set_recursion(unsigned long ip, unsigned long pip,
int start)
{
@@ -162,11 +165,17 @@ static __always_inline int trace_test_and_set_recursion(unsigned long ip, unsign
current->trace_recursion = val;
barrier();
+ preempt_disable_notrace();
+
return bit;
}
+/*
+ * Preemption will be enabled (if it was previously enabled).
+ */
static __always_inline void trace_clear_recursion(int bit)
{
+ preempt_enable_notrace();
barrier();
trace_recursion_clear(bit);
}
@@ -178,7 +187,7 @@ static __always_inline void trace_clear_recursion(int bit)
* tracing recursed in the same context (normal vs interrupt),
*
* Returns: -1 if a recursion happened.
- * >= 0 if no recursion
+ * >= 0 if no recursion.
*/
static __always_inline int ftrace_test_recursion_trylock(unsigned long ip,
unsigned long parent_ip)
diff --git a/kernel/livepatch/patch.c b/kernel/livepatch/patch.c
index e8029ae..b8d75fb 100644
--- a/kernel/livepatch/patch.c
+++ b/kernel/livepatch/patch.c
@@ -49,14 +49,16 @@ static void notrace klp_ftrace_handler(unsigned long ip,
ops = container_of(fops, struct klp_ops, fops);
+ /*
+ *
+ * The ftrace_test_recursion_trylock() will disable preemption,
+ * which is required for the variant of synchronize_rcu() that is
+ * used to allow patching functions where RCU is not watching.
+ * See klp_synchronize_transition() for more details.
+ */
bit = ftrace_test_recursion_trylock(ip, parent_ip);
if (WARN_ON_ONCE(bit < 0))
return;
- /*
- * A variant of synchronize_rcu() is used to allow patching functions
- * where RCU is not watching, see klp_synchronize_transition().
- */
- preempt_disable_notrace();
func = list_first_or_null_rcu(&ops->func_stack, struct klp_func,
stack_node);
@@ -120,7 +122,6 @@ static void notrace klp_ftrace_handler(unsigned long ip,
klp_arch_set_pc(fregs, (unsigned long)func->new_func);
unlock:
- preempt_enable_notrace();
ftrace_test_recursion_unlock(bit);
}
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index b7be1df..7392bc7 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -7198,16 +7198,15 @@ void ftrace_reset_array_ops(struct trace_array *tr)
struct ftrace_ops *op;
int bit;
+ /*
+ * The ftrace_test_and_set_recursion() will disable preemption,
+ * which is required since some of the ops may be dynamically
+ * allocated, they must be freed after a synchronize_rcu().
+ */
bit = trace_test_and_set_recursion(ip, parent_ip, TRACE_LIST_START);
if (bit < 0)
return;
- /*
- * Some of the ops may be dynamically allocated,
- * they must be freed after a synchronize_rcu().
- */
- preempt_disable_notrace();
-
do_for_each_ftrace_op(op, ftrace_ops_list) {
/* Stub functions don't need to be called nor tested */
if (op->flags & FTRACE_OPS_FL_STUB)
@@ -7231,7 +7230,6 @@ void ftrace_reset_array_ops(struct trace_array *tr)
}
} while_for_each_ftrace_op(op);
out:
- preempt_enable_notrace();
trace_clear_recursion(bit);
}
@@ -7279,12 +7277,9 @@ static void ftrace_ops_assist_func(unsigned long ip, unsigned long parent_ip,
if (bit < 0)
return;
- preempt_disable_notrace();
-
if (!(op->flags & FTRACE_OPS_FL_RCU) || rcu_is_watching())
op->func(ip, parent_ip, op, fregs);
- preempt_enable_notrace();
trace_clear_recursion(bit);
}
NOKPROBE_SYMBOL(ftrace_ops_assist_func);
diff --git a/kernel/trace/trace_functions.c b/kernel/trace/trace_functions.c
index 1f0e63f..9f1bfbe 100644
--- a/kernel/trace/trace_functions.c
+++ b/kernel/trace/trace_functions.c
@@ -186,7 +186,6 @@ static void function_trace_start(struct trace_array *tr)
return;
trace_ctx = tracing_gen_ctx();
- preempt_disable_notrace();
cpu = smp_processor_id();
data = per_cpu_ptr(tr->array_buffer.data, cpu);
@@ -194,7 +193,6 @@ static void function_trace_start(struct trace_array *tr)
trace_function(tr, ip, parent_ip, trace_ctx);
ftrace_test_recursion_unlock(bit);
- preempt_enable_notrace();
}
#ifdef CONFIG_UNWINDER_ORC
@@ -298,8 +296,6 @@ static inline void process_repeats(struct trace_array *tr,
if (bit < 0)
return;
- preempt_disable_notrace();
-
cpu = smp_processor_id();
data = per_cpu_ptr(tr->array_buffer.data, cpu);
if (atomic_read(&data->disabled))
@@ -324,7 +320,6 @@ static inline void process_repeats(struct trace_array *tr,
out:
ftrace_test_recursion_unlock(bit);
- preempt_enable_notrace();
}
static void
--
1.8.3.1
^ permalink raw reply related
* [PATCH v7 2/2] ftrace: do CPU checking after preemption disabled
From: 王贇 @ 2021-10-27 3:15 UTC (permalink / raw)
To: Guo Ren, Steven Rostedt, Ingo Molnar, James E.J. Bottomley,
Helge Deller, Michael Ellerman, Benjamin Herrenschmidt,
Paul Mackerras, Paul Walmsley, Palmer Dabbelt, Albert Ou,
Thomas Gleixner, Borislav Petkov, x86, H. Peter Anvin,
Josh Poimboeuf, Jiri Kosina, Miroslav Benes, Petr Mladek,
Joe Lawrence, Masami Hiramatsu, Peter Zijlstra (Intel),
Nicholas Piggin, Jisheng Zhang, linux-csky, linux-kernel,
linux-parisc, linuxppc-dev, linux-riscv, live-patching
In-Reply-To: <ecd56129-faed-3b30-5552-7fa1e09bc408@linux.alibaba.com>
With CONFIG_DEBUG_PREEMPT we observed reports like:
BUG: using smp_processor_id() in preemptible
caller is perf_ftrace_function_call+0x6f/0x2e0
CPU: 1 PID: 680 Comm: a.out Not tainted
Call Trace:
<TASK>
dump_stack_lvl+0x8d/0xcf
check_preemption_disabled+0x104/0x110
? optimize_nops.isra.7+0x230/0x230
? text_poke_bp_batch+0x9f/0x310
perf_ftrace_function_call+0x6f/0x2e0
...
__text_poke+0x5/0x620
text_poke_bp_batch+0x9f/0x310
This telling us the CPU could be changed after task is preempted, and
the checking on CPU before preemption will be invalid.
Since now ftrace_test_recursion_trylock() will help to disable the
preemption, this patch just do the checking after trylock() to address
the issue.
CC: Steven Rostedt <rostedt@goodmis.org>
Reported-by: Abaci <abaci@linux.alibaba.com>
Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
---
kernel/trace/trace_event_perf.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
index 6aed10e..fba8cb7 100644
--- a/kernel/trace/trace_event_perf.c
+++ b/kernel/trace/trace_event_perf.c
@@ -441,13 +441,13 @@ void perf_trace_buf_update(void *record, u16 type)
if (!rcu_is_watching())
return;
- if ((unsigned long)ops->private != smp_processor_id())
- return;
-
bit = ftrace_test_recursion_trylock(ip, parent_ip);
if (bit < 0)
return;
+ if ((unsigned long)ops->private != smp_processor_id())
+ goto out;
+
event = container_of(ops, struct perf_event, ftrace_ops);
/*
--
1.8.3.1
^ permalink raw reply related
* Re: [PATCH 1/2] powerpc/watchdog: prevent printk and send IPI while holding the wd lock
From: Nicholas Piggin @ 2021-10-27 3:29 UTC (permalink / raw)
To: benh, Laurent Dufour, mpe, paulus; +Cc: linuxppc-dev, linux-kernel
In-Reply-To: <20211026162740.16283-2-ldufour@linux.ibm.com>
Excerpts from Laurent Dufour's message of October 27, 2021 2:27 am:
> When handling the Watchdog interrupt, long processing should not be done
> while holding the __wd_smp_lock. This prevents the other CPUs to grab it
> and to process Watchdog timer interrupts. Furhtermore, this could lead to
> the following situation:
>
> CPU x detect lockup on CPU y and grab the __wd_smp_lock
> in watchdog_smp_panic()
> CPU y caught the watchdog interrupt and try to grab the __wd_smp_lock
> in soft_nmi_interrupt()
> CPU x wait for CPU y to catch the IPI for 1s in __smp_send_nmi_ipi()
CPU y should get the IPI here if it's a NMI IPI (which will be true for
>= POWER9 64s).
That said, not all platforms support it and the console lock problem
seems real, so okay.
> CPU x will timeout and so has spent 1s waiting while holding the
> __wd_smp_lock.
>
> A deadlock may also happen between the __wd_smp_lock and the console_owner
> 'lock' this way:
> CPU x grab the console_owner
> CPU y grab the __wd_smp_lock
> CPU x catch the watchdog timer interrupt and needs to grab __wd_smp_lock
> CPU y wants to print something and wait for console_owner
> -> deadlock
>
> Doing all the long processing without holding the _wd_smp_lock prevents
> these situations.
The intention was to avoid logs getting garbled e.g., if multiple
different CPUs fire at once.
I wonder if instead we could deal with that by protecting the IPI
sending and printing stuff with a trylock, and if you don't get the
trylock then just return, and you'll come back with the next timer
interrupt.
Thanks,
Nick
>
> Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
> ---
> arch/powerpc/kernel/watchdog.c | 31 +++++++++++++++++--------------
> 1 file changed, 17 insertions(+), 14 deletions(-)
>
> diff --git a/arch/powerpc/kernel/watchdog.c b/arch/powerpc/kernel/watchdog.c
> index f9ea0e5357f9..bc7411327066 100644
> --- a/arch/powerpc/kernel/watchdog.c
> +++ b/arch/powerpc/kernel/watchdog.c
> @@ -149,6 +149,8 @@ static void set_cpu_stuck(int cpu, u64 tb)
>
> static void watchdog_smp_panic(int cpu, u64 tb)
> {
> + cpumask_t cpus_pending_copy;
> + u64 last_reset_tb_copy;
> unsigned long flags;
> int c;
>
> @@ -161,29 +163,32 @@ static void watchdog_smp_panic(int cpu, u64 tb)
> if (cpumask_weight(&wd_smp_cpus_pending) == 0)
> goto out;
>
> + cpumask_copy(&cpus_pending_copy, &wd_smp_cpus_pending);
> + last_reset_tb_copy = wd_smp_last_reset_tb;
> +
> + /* Take the stuck CPUs out of the watch group */
> + set_cpumask_stuck(&wd_smp_cpus_pending, tb);
> +
> + wd_smp_unlock(&flags);
> +
> pr_emerg("CPU %d detected hard LOCKUP on other CPUs %*pbl\n",
> - cpu, cpumask_pr_args(&wd_smp_cpus_pending));
> + cpu, cpumask_pr_args(&cpus_pending_copy));
> pr_emerg("CPU %d TB:%lld, last SMP heartbeat TB:%lld (%lldms ago)\n",
> - cpu, tb, wd_smp_last_reset_tb,
> - tb_to_ns(tb - wd_smp_last_reset_tb) / 1000000);
> + cpu, tb, last_reset_tb_copy,
> + tb_to_ns(tb - last_reset_tb_copy) / 1000000);
>
> if (!sysctl_hardlockup_all_cpu_backtrace) {
> /*
> * Try to trigger the stuck CPUs, unless we are going to
> * get a backtrace on all of them anyway.
> */
> - for_each_cpu(c, &wd_smp_cpus_pending) {
> + for_each_cpu(c, &cpus_pending_copy) {
> if (c == cpu)
> continue;
> smp_send_nmi_ipi(c, wd_lockup_ipi, 1000000);
> }
> }
>
> - /* Take the stuck CPUs out of the watch group */
> - set_cpumask_stuck(&wd_smp_cpus_pending, tb);
> -
> - wd_smp_unlock(&flags);
> -
> if (sysctl_hardlockup_all_cpu_backtrace)
> trigger_allbutself_cpu_backtrace();
>
> @@ -204,6 +209,8 @@ static void wd_smp_clear_cpu_pending(int cpu, u64 tb)
> unsigned long flags;
>
> wd_smp_lock(&flags);
> + cpumask_clear_cpu(cpu, &wd_smp_cpus_stuck);
> + wd_smp_unlock(&flags);
>
> pr_emerg("CPU %d became unstuck TB:%lld\n",
> cpu, tb);
> @@ -212,9 +219,6 @@ static void wd_smp_clear_cpu_pending(int cpu, u64 tb)
> show_regs(regs);
> else
> dump_stack();
> -
> - cpumask_clear_cpu(cpu, &wd_smp_cpus_stuck);
> - wd_smp_unlock(&flags);
> }
> return;
> }
> @@ -267,6 +271,7 @@ DEFINE_INTERRUPT_HANDLER_NMI(soft_nmi_interrupt)
> return 0;
> }
> set_cpu_stuck(cpu, tb);
> + wd_smp_unlock(&flags);
>
> pr_emerg("CPU %d self-detected hard LOCKUP @ %pS\n",
> cpu, (void *)regs->nip);
> @@ -277,8 +282,6 @@ DEFINE_INTERRUPT_HANDLER_NMI(soft_nmi_interrupt)
> print_irqtrace_events(current);
> show_regs(regs);
>
> - wd_smp_unlock(&flags);
> -
> if (sysctl_hardlockup_all_cpu_backtrace)
> trigger_allbutself_cpu_backtrace();
>
> --
> 2.33.1
>
>
^ permalink raw reply
* Re: [PATCH 2/2] powerpc/watchdog: ensure watchdog data accesses are protected
From: Nicholas Piggin @ 2021-10-27 3:48 UTC (permalink / raw)
To: benh, Laurent Dufour, mpe, paulus; +Cc: linuxppc-dev, linux-kernel
In-Reply-To: <20211026162740.16283-3-ldufour@linux.ibm.com>
Excerpts from Laurent Dufour's message of October 27, 2021 2:27 am:
> The wd_smp_cpus_pending CPU mask should be accessed under the protection of
> the __wd_smp_lock.
>
> This prevents false alarm to be raised when the system is under an heavy
> stress. This has been seen while doing LPM on large system with a big
> workload.
>
> Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
> ---
> arch/powerpc/kernel/watchdog.c | 14 ++++++++------
> 1 file changed, 8 insertions(+), 6 deletions(-)
>
> diff --git a/arch/powerpc/kernel/watchdog.c b/arch/powerpc/kernel/watchdog.c
> index bc7411327066..8d7a1a86187e 100644
> --- a/arch/powerpc/kernel/watchdog.c
> +++ b/arch/powerpc/kernel/watchdog.c
> @@ -203,12 +203,13 @@ static void watchdog_smp_panic(int cpu, u64 tb)
>
> static void wd_smp_clear_cpu_pending(int cpu, u64 tb)
> {
> + unsigned long flags;
> +
> + wd_smp_lock(&flags);
> if (!cpumask_test_cpu(cpu, &wd_smp_cpus_pending)) {
> if (unlikely(cpumask_test_cpu(cpu, &wd_smp_cpus_stuck))) {
> struct pt_regs *regs = get_irq_regs();
> - unsigned long flags;
>
> - wd_smp_lock(&flags);
> cpumask_clear_cpu(cpu, &wd_smp_cpus_stuck);
> wd_smp_unlock(&flags);
>
> @@ -219,22 +220,23 @@ static void wd_smp_clear_cpu_pending(int cpu, u64 tb)
> show_regs(regs);
> else
> dump_stack();
> + return;
> }
> +
> + wd_smp_unlock(&flags);
> return;
> }
> +
> cpumask_clear_cpu(cpu, &wd_smp_cpus_pending);
> if (cpumask_empty(&wd_smp_cpus_pending)) {
> - unsigned long flags;
> -
> - wd_smp_lock(&flags);
> if (cpumask_empty(&wd_smp_cpus_pending)) {
> wd_smp_last_reset_tb = tb;
> cpumask_andnot(&wd_smp_cpus_pending,
> &wd_cpus_enabled,
> &wd_smp_cpus_stuck);
> }
> - wd_smp_unlock(&flags);
> }
> + wd_smp_unlock(&flags);
Hmm. I wanted to avoid the lock here because all CPUs will do it on
every heartbeat timer.
Although maybe when you look at the cacheline transfers required it
doesn't matter much (and the timer is only once every few seconds).
I guess it's always better to aovid lock free craziness unless it's
required... so where is the race coming from? I guess 2 CPUs can
clear wd_smp_cpus_pending but not see one another's update so they
both miss cpumask_empty == true! Good catch.
We shouldn't strictly need the wd_smp_lock for the first test, but
that should be an uncommon case, so okay.
Can we clear wd_smp_cpus_pending with a non-atomic operation now?
Thanks,
Nick
^ permalink raw reply
* Re: instruction storage exception handling
From: Nicholas Piggin @ 2021-10-27 4:10 UTC (permalink / raw)
To: Jacques de Laval, To: Scott Wood; +Cc: linuxppc-dev@lists.ozlabs.org
In-Reply-To: <uqZVxyE3l9oalZp_hyXFJvdH-ADNTvpOuQeoNGyqrUcoNgh9afea1-FzfZKMgiaF5WxY4kdMQlJYzmjvdQ2E2joF86-mEcaxdifht9z8NA0=@protonmail.com>
Excerpts from Jacques de Laval's message of October 26, 2021 6:07 am:
> Hi,
>
> We are trying to upgrade kernel from 5.10 to 5.14.11. We have a Freescale/NXP
> T1023 SOC with two e5500 cores, and are running in 32-bit mode:
>
> CONFIG_PPC32=y
> # CONFIG_PPC64 is not set
>
> #
> # Processor support
> #
> # CONFIG_PPC_BOOK3S_32 is not set
> CONFIG_PPC_85xx=y
> # CONFIG_PPC_8xx is not set
> # CONFIG_40x is not set
> # CONFIG_44x is not set
> CONFIG_GENERIC_CPU=y
> # CONFIG_E5500_CPU is not set
> # CONFIG_E6500_CPU is not set
> CONFIG_E500=y
> CONFIG_PPC_E500MC=y
> CONFIG_PPC_FPU_REGS=y
> CONFIG_PPC_FPU=y
> CONFIG_FSL_EMB_PERFMON=y
> CONFIG_FSL_EMB_PERF_EVENT=y
> CONFIG_FSL_EMB_PERF_EVENT_E500=y
> CONFIG_BOOKE=y
> CONFIG_FSL_BOOKE=y
> CONFIG_PPC_FSL_BOOK3E=y
> CONFIG_PTE_64BIT=y
> CONFIG_PHYS_64BIT=y
> CONFIG_PPC_MMU_NOHASH=y
> CONFIG_PPC_BOOK3E_MMU=y
> # CONFIG_PMU_SYSFS is not set
> CONFIG_SMP=y
> CONFIG_NR_CPUS=2
> CONFIG_PPC_DOORBELL=y
> # end of Processor support
>
> We compile using 32-bit Bootlin PPC toolchain:
>
> powerpc-e500mc glibc bleeding-edge 2020.08-1.
>
> When booting, and starting PID 1 we sometimes get a hang. Nothing but our init
> process is running, and for debugging purposes our init currently looks like
> this:
>
> int main(int argc, char *argv[]){
> for (int i = 0; ; i++) {
> FILE *fp = fopen("/dev/kmsg", "w");
> if (fp) {
> fprintf(fp, "%d\n", i);
> fclose(fp);
> }
> sleep(1);
> }
> return 0;
> }
>
> When the hangup occur we don't get any output at all from our PID 1.
> The last output is from the kernel:
>
> Run /sbin/init as init process
> with arguments:
> /sbin/init
> with environment:
> HOME=/
> TERM=linux
> kgdboc=ttyS0,115200
>
> When issuing a backtrace on all active cpus we can see that the kernel is
> handling an instruction storage exception:
>
> sysrq: Show backtrace of all active CPUs
> sysrq: CPU0:
> CPU: 0 PID: 1 Comm: init Not tainted 5.14.11 #1
> NIP: c02aac78 LR: c02aac2c CTR: 00000000
> REGS: c1907d40 TRAP: 0500 Not tainted (5.14.11)
> MSR: 00029002 <CE,EE,ME> CR: 82244284 XER: 20000000
> GPR00: 0000000f c1907e20 c1910000 0065a000 000001d0 01100cca c1907e84 0000000c
> GPR08: d39a8000 000001d3 0000000c c1907f10 42244284 00000000 00740514 bfb71670
> GPR16: 007040e6 00701418 b7c1a5f0 00702f18 00000000 bfb71690 0000fff1 b7c1c478
> GPR24: 00708558 00701698 d3994040 00029002 c1907f20 0065a238 00000355 d39a2790
> NIP [c02aac78] handle_mm_fault+0xf8/0x11f0
> LR [c02aac2c] handle_mm_fault+0xac/0x11f0
> Call Trace:
> [c1907e20] [c02aac10] handle_mm_fault+0x90/0x11f0 (unreliable)
> [c1907ec0] [c003078c] ___do_page_fault+0x26c/0x780
> [c1907ef0] [c0030cd4] do_page_fault+0x34/0x100
> [c1907f10] [c0000988] InstructionStorage+0x108/0x120
> --- interrupt: 400 at 0x65a238
> NIP: 0065a238 LR: 0052f26c CTR: 0052f260
> REGS: c1907f20 TRAP: 0400 Not tainted (5.14.11)
> MSR: 0002d002 <CE,EE,PR,ME> CR: 42242284 XER: 00000000
> GPR00: b7be9914 bfb71620 b7c203a0 8c008000 0070400d b7c182a0 000b8260 0052f260
> GPR08: 0047d448 0052f260 0000000a 00000003 42242284 00000000 00740514 bfb71670
> GPR16: 007040e6 00701418 b7c1a5f0 00702f18 00000000 bfb71690 0000fff1 b7c1c478
> GPR24: 00708558 00701698 00700000 00000015 b7c1c2b0 00707e20 b7c1b8a8 bfb71660
> NIP [0065a238] 0x65a238
> LR [0052f26c] 0x52f26c
> --- interrupt: 400
> Instruction dump:
> 60a500c0 811f0020 57aa6cfa 813f0000 57a30026 809f004c 81080024 7d29e850
> 90a1002c 5529a33e 93c10038 7d244a14 <90610034> 7d485215 91210030 41c203dc
>
> We have also observed that the CPU is continuously servicing the same interrupt
> (north of 140k times per sec), it is not deadlocked.
>
> We have not yet been able to reproduce this behavior under QEMU system
> emulation.
>
> When bisecting between 5.10 and 5.14.11 we can see that this behavior started
> with commit a01a3f2ddbcda83e8572787c0ec1dcbeba86915a:
>
> powerpc: remove arguments from fault handler functions
Thank you for the excellent work to investigate and report this.
>
> Our best guess that the instruction storage exception is not properly handled
> and the kernel is never able to recover from the page fault, but we don't
> really know how to proceed. Does anyone have any suggestions or insights?
Before my patch, zero was passed to the page fault error code, after
my patch it passes the contents of ESR SPR.
It looks like the BookE instruction access interrupt does not set ESR
(except for BO interrupts maybe?) so you're getting what was in the ESR
register from a previous interrupt, and maybe if that was a store then
access_error won't cause a segfault because is_exec is true so that
test bails out early, then it might just keep retrying the interrupt.
That could explain why you don't always see the same thing.
Now previous code still saved ESR in regs->esr/dsisr for some reason.
I can't quite see why that should have been necessary though. Does
this patch solve it for you?
Thanks,
Nick
--
diff --git a/arch/powerpc/kernel/head_booke.h b/arch/powerpc/kernel/head_booke.h
index e5503420b6c6..0e7cdc8716eb 100644
--- a/arch/powerpc/kernel/head_booke.h
+++ b/arch/powerpc/kernel/head_booke.h
@@ -467,10 +467,11 @@ ALT_FTR_SECTION_END_IFSET(CPU_FTR_EMB_HV)
#define INSTRUCTION_STORAGE_EXCEPTION \
START_EXCEPTION(InstructionStorage) \
- NORMAL_EXCEPTION_PROLOG(0x400, INST_STORAGE); \
- mfspr r5,SPRN_ESR; /* Grab the ESR and save it */ \
+ NORMAL_EXCEPTION_PROLOG(0x400, INST_STORAGE); \
+ li r5,0; \
+ mfspr r5,SPRN_ESR; /* Store 0 in regs->esr (dsisr) */ \
stw r5,_ESR(r11); \
- stw r12, _DEAR(r11); /* Pass SRR0 as arg2 */ \
+ stw r12, _DEAR(r11); /* Set regs->dear (dar) */ \
prepare_transfer_to_handler; \
bl do_page_fault; \
b interrupt_return
^ permalink raw reply related
* Re: [PATCH 1/3] powerpc/nohash: Fix __ptep_set_access_flags() and ptep_set_wrprotect()
From: Nicholas Piggin @ 2021-10-27 4:23 UTC (permalink / raw)
To: Benjamin Herrenschmidt, Christophe Leroy, Michael Ellerman,
Paul Mackerras
Cc: linuxppc-dev, linux-kernel
In-Reply-To: <922bdab3a220781bae2360ff3dd5adb7fe4d34f1.1635226743.git.christophe.leroy@csgroup.eu>
Excerpts from Christophe Leroy's message of October 26, 2021 3:39 pm:
> Commit 26973fa5ac0e ("powerpc/mm: use pte helpers in generic code")
> changed those two functions to use pte helpers to determine which
> bits to clear and which bits to set.
>
> This change was based on the assumption that bits to be set/cleared
> are always the same and can be determined by applying the pte
> manipulation helpers on __pte(0).
>
> But on platforms like book3e, the bits depend on whether the page
> is a user page or not.
>
> For the time being it more or less works because of _PAGE_EXEC being
> used for user pages only and exec right being set at all time on
> kernel page. But following patch will clean that and output of
> pte_mkexec() will depend on the page being a user or kernel page.
>
> Instead of trying to make an even more complicated helper where bits
> would become dependent on the final pte value, come back to a more
> static situation like before commit 26973fa5ac0e ("powerpc/mm: use
> pte helpers in generic code"), by introducing an 8xx specific
> version of __ptep_set_access_flags() and ptep_set_wrprotect().
What is this actually fixing? Does it change anything itself, or
just a preparation patch?
Thanks,
Nick
>
> Fixes: 26973fa5ac0e ("powerpc/mm: use pte helpers in generic code")
> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
> ---
> v3: No change
> v2: New
> ---
> arch/powerpc/include/asm/nohash/32/pgtable.h | 17 +++++++--------
> arch/powerpc/include/asm/nohash/32/pte-8xx.h | 22 ++++++++++++++++++++
> 2 files changed, 30 insertions(+), 9 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/nohash/32/pgtable.h b/arch/powerpc/include/asm/nohash/32/pgtable.h
> index 34ce50da1850..11c6849f7864 100644
> --- a/arch/powerpc/include/asm/nohash/32/pgtable.h
> +++ b/arch/powerpc/include/asm/nohash/32/pgtable.h
> @@ -306,30 +306,29 @@ static inline pte_t ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,
> }
>
> #define __HAVE_ARCH_PTEP_SET_WRPROTECT
> +#ifndef ptep_set_wrprotect
> static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addr,
> pte_t *ptep)
> {
> - unsigned long clr = ~pte_val(pte_wrprotect(__pte(~0)));
> - unsigned long set = pte_val(pte_wrprotect(__pte(0)));
> -
> - pte_update(mm, addr, ptep, clr, set, 0);
> + pte_update(mm, addr, ptep, _PAGE_RW, 0, 0);
> }
> +#endif
>
> +#ifndef __ptep_set_access_flags
> static inline void __ptep_set_access_flags(struct vm_area_struct *vma,
> pte_t *ptep, pte_t entry,
> unsigned long address,
> int psize)
> {
> - pte_t pte_set = pte_mkyoung(pte_mkdirty(pte_mkwrite(pte_mkexec(__pte(0)))));
> - pte_t pte_clr = pte_mkyoung(pte_mkdirty(pte_mkwrite(pte_mkexec(__pte(~0)))));
> - unsigned long set = pte_val(entry) & pte_val(pte_set);
> - unsigned long clr = ~pte_val(entry) & ~pte_val(pte_clr);
> + unsigned long set = pte_val(entry) &
> + (_PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_RW | _PAGE_EXEC);
> int huge = psize > mmu_virtual_psize ? 1 : 0;
>
> - pte_update(vma->vm_mm, address, ptep, clr, set, huge);
> + pte_update(vma->vm_mm, address, ptep, 0, set, huge);
>
> flush_tlb_page(vma, address);
> }
> +#endif
>
> static inline int pte_young(pte_t pte)
> {
> diff --git a/arch/powerpc/include/asm/nohash/32/pte-8xx.h b/arch/powerpc/include/asm/nohash/32/pte-8xx.h
> index fcc48d590d88..1a89ebdc3acc 100644
> --- a/arch/powerpc/include/asm/nohash/32/pte-8xx.h
> +++ b/arch/powerpc/include/asm/nohash/32/pte-8xx.h
> @@ -136,6 +136,28 @@ static inline pte_t pte_mkhuge(pte_t pte)
>
> #define pte_mkhuge pte_mkhuge
>
> +static inline pte_basic_t pte_update(struct mm_struct *mm, unsigned long addr, pte_t *p,
> + unsigned long clr, unsigned long set, int huge);
> +
> +static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
> +{
> + pte_update(mm, addr, ptep, 0, _PAGE_RO, 0);
> +}
> +#define ptep_set_wrprotect ptep_set_wrprotect
> +
> +static inline void __ptep_set_access_flags(struct vm_area_struct *vma, pte_t *ptep,
> + pte_t entry, unsigned long address, int psize)
> +{
> + unsigned long set = pte_val(entry) & (_PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_EXEC);
> + unsigned long clr = ~pte_val(entry) & _PAGE_RO;
> + int huge = psize > mmu_virtual_psize ? 1 : 0;
> +
> + pte_update(vma->vm_mm, address, ptep, clr, set, huge);
> +
> + flush_tlb_page(vma, address);
> +}
> +#define __ptep_set_access_flags __ptep_set_access_flags
> +
> static inline unsigned long pgd_leaf_size(pgd_t pgd)
> {
> if (pgd_val(pgd) & _PMD_PAGE_8M)
> --
> 2.31.1
>
>
^ permalink raw reply
* Re: [PATCH 1/3] powerpc/nohash: Fix __ptep_set_access_flags() and ptep_set_wrprotect()
From: Christophe Leroy @ 2021-10-27 4:39 UTC (permalink / raw)
To: Nicholas Piggin, Benjamin Herrenschmidt, Michael Ellerman,
Paul Mackerras
Cc: linuxppc-dev, linux-kernel
In-Reply-To: <1635308538.6vye6lbbh8.astroid@bobo.none>
Le 27/10/2021 à 06:23, Nicholas Piggin a écrit :
> Excerpts from Christophe Leroy's message of October 26, 2021 3:39 pm:
>> Commit 26973fa5ac0e ("powerpc/mm: use pte helpers in generic code")
>> changed those two functions to use pte helpers to determine which
>> bits to clear and which bits to set.
>>
>> This change was based on the assumption that bits to be set/cleared
>> are always the same and can be determined by applying the pte
>> manipulation helpers on __pte(0).
>>
>> But on platforms like book3e, the bits depend on whether the page
>> is a user page or not.
>>
>> For the time being it more or less works because of _PAGE_EXEC being
>> used for user pages only and exec right being set at all time on
>> kernel page. But following patch will clean that and output of
>> pte_mkexec() will depend on the page being a user or kernel page.
>>
>> Instead of trying to make an even more complicated helper where bits
>> would become dependent on the final pte value, come back to a more
>> static situation like before commit 26973fa5ac0e ("powerpc/mm: use
>> pte helpers in generic code"), by introducing an 8xx specific
>> version of __ptep_set_access_flags() and ptep_set_wrprotect().
>
> What is this actually fixing? Does it change anything itself, or
> just a preparation patch?
Just a preparation patch I think.
I didn't flag it for stable.
Once patch 2 is applied, __ptep_set_access_flags() doesn't work anymore
without this patch, because then pte_mkexec(__pte(0)) sets SX and clears
UX while pte_mkexec(__pte(~0)) sets UX and clears SX
Christophe
>
> Thanks,
> Nick
>
>>
>> Fixes: 26973fa5ac0e ("powerpc/mm: use pte helpers in generic code")
>> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
>> ---
>> v3: No change
>> v2: New
>> ---
>> arch/powerpc/include/asm/nohash/32/pgtable.h | 17 +++++++--------
>> arch/powerpc/include/asm/nohash/32/pte-8xx.h | 22 ++++++++++++++++++++
>> 2 files changed, 30 insertions(+), 9 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/nohash/32/pgtable.h b/arch/powerpc/include/asm/nohash/32/pgtable.h
>> index 34ce50da1850..11c6849f7864 100644
>> --- a/arch/powerpc/include/asm/nohash/32/pgtable.h
>> +++ b/arch/powerpc/include/asm/nohash/32/pgtable.h
>> @@ -306,30 +306,29 @@ static inline pte_t ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,
>> }
>>
>> #define __HAVE_ARCH_PTEP_SET_WRPROTECT
>> +#ifndef ptep_set_wrprotect
>> static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addr,
>> pte_t *ptep)
>> {
>> - unsigned long clr = ~pte_val(pte_wrprotect(__pte(~0)));
>> - unsigned long set = pte_val(pte_wrprotect(__pte(0)));
>> -
>> - pte_update(mm, addr, ptep, clr, set, 0);
>> + pte_update(mm, addr, ptep, _PAGE_RW, 0, 0);
>> }
>> +#endif
>>
>> +#ifndef __ptep_set_access_flags
>> static inline void __ptep_set_access_flags(struct vm_area_struct *vma,
>> pte_t *ptep, pte_t entry,
>> unsigned long address,
>> int psize)
>> {
>> - pte_t pte_set = pte_mkyoung(pte_mkdirty(pte_mkwrite(pte_mkexec(__pte(0)))));
>> - pte_t pte_clr = pte_mkyoung(pte_mkdirty(pte_mkwrite(pte_mkexec(__pte(~0)))));
>> - unsigned long set = pte_val(entry) & pte_val(pte_set);
>> - unsigned long clr = ~pte_val(entry) & ~pte_val(pte_clr);
>> + unsigned long set = pte_val(entry) &
>> + (_PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_RW | _PAGE_EXEC);
>> int huge = psize > mmu_virtual_psize ? 1 : 0;
>>
>> - pte_update(vma->vm_mm, address, ptep, clr, set, huge);
>> + pte_update(vma->vm_mm, address, ptep, 0, set, huge);
>>
>> flush_tlb_page(vma, address);
>> }
>> +#endif
>>
>> static inline int pte_young(pte_t pte)
>> {
>> diff --git a/arch/powerpc/include/asm/nohash/32/pte-8xx.h b/arch/powerpc/include/asm/nohash/32/pte-8xx.h
>> index fcc48d590d88..1a89ebdc3acc 100644
>> --- a/arch/powerpc/include/asm/nohash/32/pte-8xx.h
>> +++ b/arch/powerpc/include/asm/nohash/32/pte-8xx.h
>> @@ -136,6 +136,28 @@ static inline pte_t pte_mkhuge(pte_t pte)
>>
>> #define pte_mkhuge pte_mkhuge
>>
>> +static inline pte_basic_t pte_update(struct mm_struct *mm, unsigned long addr, pte_t *p,
>> + unsigned long clr, unsigned long set, int huge);
>> +
>> +static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
>> +{
>> + pte_update(mm, addr, ptep, 0, _PAGE_RO, 0);
>> +}
>> +#define ptep_set_wrprotect ptep_set_wrprotect
>> +
>> +static inline void __ptep_set_access_flags(struct vm_area_struct *vma, pte_t *ptep,
>> + pte_t entry, unsigned long address, int psize)
>> +{
>> + unsigned long set = pte_val(entry) & (_PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_EXEC);
>> + unsigned long clr = ~pte_val(entry) & _PAGE_RO;
>> + int huge = psize > mmu_virtual_psize ? 1 : 0;
>> +
>> + pte_update(vma->vm_mm, address, ptep, clr, set, huge);
>> +
>> + flush_tlb_page(vma, address);
>> +}
>> +#define __ptep_set_access_flags __ptep_set_access_flags
>> +
>> static inline unsigned long pgd_leaf_size(pgd_t pgd)
>> {
>> if (pgd_val(pgd) & _PMD_PAGE_8M)
>> --
>> 2.31.1
>>
>>
^ permalink raw reply
* Re: [PATCH 2/3] powerpc/book3e: Fix set_memory_x() and set_memory_nx()
From: Nicholas Piggin @ 2021-10-27 4:44 UTC (permalink / raw)
To: Benjamin Herrenschmidt, Christophe Leroy, Michael Ellerman,
Paul Mackerras
Cc: linuxppc-dev, linux-kernel
In-Reply-To: <c41100f9c144dc5b62e5a751b810190c6b5d42fd.1635226743.git.christophe.leroy@csgroup.eu>
Excerpts from Christophe Leroy's message of October 26, 2021 3:39 pm:
> set_memory_x() calls pte_mkexec() which sets _PAGE_EXEC.
> set_memory_nx() calls pte_exprotec() which clears _PAGE_EXEC.
>
> Book3e has 2 bits, UX and SX, which defines the exec rights
> resp. for user (PR=1) and for kernel (PR=0).
>
> _PAGE_EXEC is defined as UX only.
>
> An executable kernel page is set with either _PAGE_KERNEL_RWX
> or _PAGE_KERNEL_ROX, which both have SX set and UX cleared.
>
> So set_memory_nx() call for an executable kernel page does
> nothing because UX is already cleared.
>
> And set_memory_x() on a non-executable kernel page makes it
> executable for the user and keeps it non-executable for kernel.
>
> Also, pte_exec() always returns 'false' on kernel pages, because
> it checks _PAGE_EXEC which doesn't include SX, so for instance
> the W+X check doesn't work.
>
> To fix this:
> - change tlb_low_64e.S to use _PAGE_BAP_UX instead of _PAGE_USER
> - sets both UX and SX in _PAGE_EXEC so that pte_user() returns
> true whenever one of the two bits is set
I don't understand this change. Which pte_user() returns true after
this change? Or do you mean pte_exec()?
Does this filter through in some cases at least for kernel executable
PTEs will get both bits set? Seems cleaner to distinguish user and
kernel exec for that but maybe it's a lot of churn?
Thanks,
Nick
> and pte_exprotect()
> clears both bits.
> - Define a book3e specific version of pte_mkexec() which sets
> either SX or UX based on UR.
>
> Fixes: 1f9ad21c3b38 ("powerpc/mm: Implement set_memory() routines")
> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
> ---
> v3: Removed pte_mkexec() from nohash/64/pgtable.h
> v2: New
> ---
^ permalink raw reply
* [PATCH] livepatch: Fix build failure on 32 bits processors
From: Christophe Leroy @ 2021-10-27 4:42 UTC (permalink / raw)
To: Josh Poimboeuf, Jiri Kosina, Miroslav Benes, Petr Mladek,
Joe Lawrence
Cc: live-patching, linuxppc-dev, linux-kernel
Trying to build livepatch on powerpc/32 results in:
kernel/livepatch/core.c: In function 'klp_resolve_symbols':
kernel/livepatch/core.c:221:23: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
221 | sym = (Elf64_Sym *)sechdrs[symndx].sh_addr + ELF_R_SYM(relas[i].r_info);
| ^
kernel/livepatch/core.c:221:21: error: assignment to 'Elf32_Sym *' {aka 'struct elf32_sym *'} from incompatible pointer type 'Elf64_Sym *' {aka 'struct elf64_sym *'} [-Werror=incompatible-pointer-types]
221 | sym = (Elf64_Sym *)sechdrs[symndx].sh_addr + ELF_R_SYM(relas[i].r_info);
| ^
kernel/livepatch/core.c: In function 'klp_apply_section_relocs':
kernel/livepatch/core.c:312:35: error: passing argument 1 of 'klp_resolve_symbols' from incompatible pointer type [-Werror=incompatible-pointer-types]
312 | ret = klp_resolve_symbols(sechdrs, strtab, symndx, sec, sec_objname);
| ^~~~~~~
| |
| Elf32_Shdr * {aka struct elf32_shdr *}
kernel/livepatch/core.c:193:44: note: expected 'Elf64_Shdr *' {aka 'struct elf64_shdr *'} but argument is of type 'Elf32_Shdr *' {aka 'struct elf32_shdr *'}
193 | static int klp_resolve_symbols(Elf64_Shdr *sechdrs, const char *strtab,
| ~~~~~~~~~~~~^~~~~~~
Fix it by using the right types instead of forcing 64 bits types.
Fixes: 7c8e2bdd5f0d ("livepatch: Apply vmlinux-specific KLP relocations early")
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
kernel/livepatch/core.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
index 335d988bd811..c0789383807b 100644
--- a/kernel/livepatch/core.c
+++ b/kernel/livepatch/core.c
@@ -190,7 +190,7 @@ static int klp_find_object_symbol(const char *objname, const char *name,
return -EINVAL;
}
-static int klp_resolve_symbols(Elf64_Shdr *sechdrs, const char *strtab,
+static int klp_resolve_symbols(Elf_Shdr *sechdrs, const char *strtab,
unsigned int symndx, Elf_Shdr *relasec,
const char *sec_objname)
{
@@ -218,7 +218,7 @@ static int klp_resolve_symbols(Elf64_Shdr *sechdrs, const char *strtab,
relas = (Elf_Rela *) relasec->sh_addr;
/* For each rela in this klp relocation section */
for (i = 0; i < relasec->sh_size / sizeof(Elf_Rela); i++) {
- sym = (Elf64_Sym *)sechdrs[symndx].sh_addr + ELF_R_SYM(relas[i].r_info);
+ sym = (Elf_Sym *)sechdrs[symndx].sh_addr + ELF_R_SYM(relas[i].r_info);
if (sym->st_shndx != SHN_LIVEPATCH) {
pr_err("symbol %s is not marked as a livepatch symbol\n",
strtab + sym->st_name);
--
2.31.1
^ permalink raw reply related
* Re: [PATCH 2/3] powerpc/book3e: Fix set_memory_x() and set_memory_nx()
From: Christophe Leroy @ 2021-10-27 4:55 UTC (permalink / raw)
To: Nicholas Piggin, Benjamin Herrenschmidt, Michael Ellerman,
Paul Mackerras
Cc: linuxppc-dev, linux-kernel
In-Reply-To: <1635309296.3vv9pb80wz.astroid@bobo.none>
Le 27/10/2021 à 06:44, Nicholas Piggin a écrit :
> Excerpts from Christophe Leroy's message of October 26, 2021 3:39 pm:
>> set_memory_x() calls pte_mkexec() which sets _PAGE_EXEC.
>> set_memory_nx() calls pte_exprotec() which clears _PAGE_EXEC.
>>
>> Book3e has 2 bits, UX and SX, which defines the exec rights
>> resp. for user (PR=1) and for kernel (PR=0).
>>
>> _PAGE_EXEC is defined as UX only.
>>
>> An executable kernel page is set with either _PAGE_KERNEL_RWX
>> or _PAGE_KERNEL_ROX, which both have SX set and UX cleared.
>>
>> So set_memory_nx() call for an executable kernel page does
>> nothing because UX is already cleared.
>>
>> And set_memory_x() on a non-executable kernel page makes it
>> executable for the user and keeps it non-executable for kernel.
>>
>> Also, pte_exec() always returns 'false' on kernel pages, because
>> it checks _PAGE_EXEC which doesn't include SX, so for instance
>> the W+X check doesn't work.
>>
>> To fix this:
>> - change tlb_low_64e.S to use _PAGE_BAP_UX instead of _PAGE_USER
>> - sets both UX and SX in _PAGE_EXEC so that pte_user() returns
>> true whenever one of the two bits is set
>
> I don't understand this change. Which pte_user() returns true after
> this change? Or do you mean pte_exec()?
Oops, yes, I mean pte_exec()
Unless I have to re-spin, can Michael eventually fix that typo while
applying ?
>
> Does this filter through in some cases at least for kernel executable
> PTEs will get both bits set? Seems cleaner to distinguish user and
> kernel exec for that but maybe it's a lot of churn?
Didn't understand what you mean.
I did it like that to be able to continue using _PAGE_EXEC for checking
executability regardless of whether this is user or kernel, and then
continue using the generic nohash pte_exec() helper.
Other solution would be to get rid of _PAGE_EXEC completely for book3e
and implement both pte_exec() and pte_mkexec() with _PAGE_BAP_UX and
_PAGE_BAP_SX, but I'm not sure it is worth the churn as you say. It
would also mean different helpers for book3s/32 when it is using 32 bits
PTE (CONFIG_PTE_64BIT=n)
Christophe
>
> Thanks,
> Nick
>
>> and pte_exprotect()
>> clears both bits.
>> - Define a book3e specific version of pte_mkexec() which sets
>> either SX or UX based on UR.
>>
>> Fixes: 1f9ad21c3b38 ("powerpc/mm: Implement set_memory() routines")
>> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
>> ---
>> v3: Removed pte_mkexec() from nohash/64/pgtable.h
>> v2: New
>> ---
^ permalink raw reply
* Re: instruction storage exception handling
From: Christophe Leroy @ 2021-10-27 5:00 UTC (permalink / raw)
To: Nicholas Piggin, Jacques de Laval, Scott Wood
Cc: linuxppc-dev@lists.ozlabs.org
In-Reply-To: <1635306738.0z8wt7619v.astroid@bobo.none>
Le 27/10/2021 à 06:10, Nicholas Piggin a écrit :
> Excerpts from Jacques de Laval's message of October 26, 2021 6:07 am:
>> Hi,
>>
>> We are trying to upgrade kernel from 5.10 to 5.14.11. We have a Freescale/NXP
>> T1023 SOC with two e5500 cores, and are running in 32-bit mode:
>>
>> CONFIG_PPC32=y
>> # CONFIG_PPC64 is not set
>>
>> #
>> # Processor support
>> #
>> # CONFIG_PPC_BOOK3S_32 is not set
>> CONFIG_PPC_85xx=y
>> # CONFIG_PPC_8xx is not set
>> # CONFIG_40x is not set
>> # CONFIG_44x is not set
>> CONFIG_GENERIC_CPU=y
>> # CONFIG_E5500_CPU is not set
>> # CONFIG_E6500_CPU is not set
>> CONFIG_E500=y
>> CONFIG_PPC_E500MC=y
>> CONFIG_PPC_FPU_REGS=y
>> CONFIG_PPC_FPU=y
>> CONFIG_FSL_EMB_PERFMON=y
>> CONFIG_FSL_EMB_PERF_EVENT=y
>> CONFIG_FSL_EMB_PERF_EVENT_E500=y
>> CONFIG_BOOKE=y
>> CONFIG_FSL_BOOKE=y
>> CONFIG_PPC_FSL_BOOK3E=y
>> CONFIG_PTE_64BIT=y
>> CONFIG_PHYS_64BIT=y
>> CONFIG_PPC_MMU_NOHASH=y
>> CONFIG_PPC_BOOK3E_MMU=y
>> # CONFIG_PMU_SYSFS is not set
>> CONFIG_SMP=y
>> CONFIG_NR_CPUS=2
>> CONFIG_PPC_DOORBELL=y
>> # end of Processor support
>>
>> We compile using 32-bit Bootlin PPC toolchain:
>>
>> powerpc-e500mc glibc bleeding-edge 2020.08-1.
>>
>> When booting, and starting PID 1 we sometimes get a hang. Nothing but our init
>> process is running, and for debugging purposes our init currently looks like
>> this:
>>
>> int main(int argc, char *argv[]){
>> for (int i = 0; ; i++) {
>> FILE *fp = fopen("/dev/kmsg", "w");
>> if (fp) {
>> fprintf(fp, "%d\n", i);
>> fclose(fp);
>> }
>> sleep(1);
>> }
>> return 0;
>> }
>>
>> When the hangup occur we don't get any output at all from our PID 1.
>> The last output is from the kernel:
>>
>> Run /sbin/init as init process
>> with arguments:
>> /sbin/init
>> with environment:
>> HOME=/
>> TERM=linux
>> kgdboc=ttyS0,115200
>>
>> When issuing a backtrace on all active cpus we can see that the kernel is
>> handling an instruction storage exception:
>>
>> sysrq: Show backtrace of all active CPUs
>> sysrq: CPU0:
>> CPU: 0 PID: 1 Comm: init Not tainted 5.14.11 #1
>> NIP: c02aac78 LR: c02aac2c CTR: 00000000
>> REGS: c1907d40 TRAP: 0500 Not tainted (5.14.11)
>> MSR: 00029002 <CE,EE,ME> CR: 82244284 XER: 20000000
>> GPR00: 0000000f c1907e20 c1910000 0065a000 000001d0 01100cca c1907e84 0000000c
>> GPR08: d39a8000 000001d3 0000000c c1907f10 42244284 00000000 00740514 bfb71670
>> GPR16: 007040e6 00701418 b7c1a5f0 00702f18 00000000 bfb71690 0000fff1 b7c1c478
>> GPR24: 00708558 00701698 d3994040 00029002 c1907f20 0065a238 00000355 d39a2790
>> NIP [c02aac78] handle_mm_fault+0xf8/0x11f0
>> LR [c02aac2c] handle_mm_fault+0xac/0x11f0
>> Call Trace:
>> [c1907e20] [c02aac10] handle_mm_fault+0x90/0x11f0 (unreliable)
>> [c1907ec0] [c003078c] ___do_page_fault+0x26c/0x780
>> [c1907ef0] [c0030cd4] do_page_fault+0x34/0x100
>> [c1907f10] [c0000988] InstructionStorage+0x108/0x120
>> --- interrupt: 400 at 0x65a238
>> NIP: 0065a238 LR: 0052f26c CTR: 0052f260
>> REGS: c1907f20 TRAP: 0400 Not tainted (5.14.11)
>> MSR: 0002d002 <CE,EE,PR,ME> CR: 42242284 XER: 00000000
>> GPR00: b7be9914 bfb71620 b7c203a0 8c008000 0070400d b7c182a0 000b8260 0052f260
>> GPR08: 0047d448 0052f260 0000000a 00000003 42242284 00000000 00740514 bfb71670
>> GPR16: 007040e6 00701418 b7c1a5f0 00702f18 00000000 bfb71690 0000fff1 b7c1c478
>> GPR24: 00708558 00701698 00700000 00000015 b7c1c2b0 00707e20 b7c1b8a8 bfb71660
>> NIP [0065a238] 0x65a238
>> LR [0052f26c] 0x52f26c
>> --- interrupt: 400
>> Instruction dump:
>> 60a500c0 811f0020 57aa6cfa 813f0000 57a30026 809f004c 81080024 7d29e850
>> 90a1002c 5529a33e 93c10038 7d244a14 <90610034> 7d485215 91210030 41c203dc
>>
>> We have also observed that the CPU is continuously servicing the same interrupt
>> (north of 140k times per sec), it is not deadlocked.
>>
>> We have not yet been able to reproduce this behavior under QEMU system
>> emulation.
>>
>> When bisecting between 5.10 and 5.14.11 we can see that this behavior started
>> with commit a01a3f2ddbcda83e8572787c0ec1dcbeba86915a:
>>
>> powerpc: remove arguments from fault handler functions
>
> Thank you for the excellent work to investigate and report this.
>
>>
>> Our best guess that the instruction storage exception is not properly handled
>> and the kernel is never able to recover from the page fault, but we don't
>> really know how to proceed. Does anyone have any suggestions or insights?
>
> Before my patch, zero was passed to the page fault error code, after
> my patch it passes the contents of ESR SPR.
>
> It looks like the BookE instruction access interrupt does not set ESR
> (except for BO interrupts maybe?) so you're getting what was in the ESR
> register from a previous interrupt, and maybe if that was a store then
> access_error won't cause a segfault because is_exec is true so that
> test bails out early, then it might just keep retrying the interrupt.
>
> That could explain why you don't always see the same thing.
>
> Now previous code still saved ESR in regs->esr/dsisr for some reason.
> I can't quite see why that should have been necessary though. Does
> this patch solve it for you?
>
> Thanks,
> Nick
> --
>
>
> diff --git a/arch/powerpc/kernel/head_booke.h b/arch/powerpc/kernel/head_booke.h
> index e5503420b6c6..0e7cdc8716eb 100644
> --- a/arch/powerpc/kernel/head_booke.h
> +++ b/arch/powerpc/kernel/head_booke.h
> @@ -467,10 +467,11 @@ ALT_FTR_SECTION_END_IFSET(CPU_FTR_EMB_HV)
>
> #define INSTRUCTION_STORAGE_EXCEPTION \
> START_EXCEPTION(InstructionStorage) \
> - NORMAL_EXCEPTION_PROLOG(0x400, INST_STORAGE); \
> - mfspr r5,SPRN_ESR; /* Grab the ESR and save it */ \
> + NORMAL_EXCEPTION_PROLOG(0x400, INST_STORAGE); \
> + li r5,0; \
> + mfspr r5,SPRN_ESR; /* Store 0 in regs->esr (dsisr) */ \
I can't see how that can help, you set r5 to 0 and immediately after you
reload ESR into r5 so you are still saving garbage into _ESR(r11)
> stw r5,_ESR(r11); \
> - stw r12, _DEAR(r11); /* Pass SRR0 as arg2 */ \
> + stw r12, _DEAR(r11); /* Set regs->dear (dar) */ \
> prepare_transfer_to_handler; \
> bl do_page_fault; \
> b interrupt_return
>
^ permalink raw reply
* Re: [PATCH v7 1/3] riscv: Introduce CONFIG_RELOCATABLE
From: Alexandre ghiti @ 2021-10-27 5:04 UTC (permalink / raw)
To: Palmer Dabbelt
Cc: aou, linux-kernel, paulus, Paul Walmsley, linux-riscv,
alexandre.ghiti, linuxppc-dev
In-Reply-To: <mhng-4d503326-d18d-4155-a595-91dc15cfb4f1@palmerdabbelt-glaptop>
Hi Palmer,
On 10/26/21 11:29 PM, Palmer Dabbelt wrote:
> On Sat, 09 Oct 2021 10:20:20 PDT (-0700), alex@ghiti.fr wrote:
>> Arf, I have sent this patchset with the wrong email address. @Palmer
>> tell me if you want me to resend it correctly.
>
> Sorry for being kind of slow here. It's fine: there's a "From:" in
> the patch, and git picks those up so it'll match the signed-off-by
> line. I send pretty much all my patches that way, as I never managed
> to get my Google address working correctly.
>
>>
>> Thanks,
>>
>> Alex
>>
>> On 10/9/21 7:12 PM, Alexandre Ghiti wrote:
>>> From: Alexandre Ghiti <alex@ghiti.fr>
>>>
>>> This config allows to compile 64b kernel as PIE and to relocate it at
>>> any virtual address at runtime: this paves the way to KASLR.
>>> Runtime relocation is possible since relocation metadata are
>>> embedded into
>>> the kernel.
>
> IMO this should really be user selectable, at a bare minimum so it's
> testable.
> I just sent along a patch to do that (my power's off at home, so email
> is a bit
> wacky right now).
>
> I haven't put this on for-next yet as I'm not sure if you had a fix
> for the
> kasan issue (which IIUC would conflict with this).
The kasan issue only revealed that I need to move the kasan shadow
memory around with sv48 support, that's not related to the relocatable
kernel.
Thanks,
Alex
>
>>> Note that relocating at runtime introduces an overhead even if the
>>> kernel is loaded at the same address it was linked at and that the
>>> compiler
>>> options are those used in arm64 which uses the same RELA relocation
>>> format.
>>>
>>> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
>>> ---
>>> arch/riscv/Kconfig | 12 ++++++++
>>> arch/riscv/Makefile | 7 +++--
>>> arch/riscv/kernel/vmlinux.lds.S | 6 ++++
>>> arch/riscv/mm/Makefile | 4 +++
>>> arch/riscv/mm/init.c | 54 ++++++++++++++++++++++++++++++++-
>>> 5 files changed, 80 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
>>> index ea16fa2dd768..043ba92559fa 100644
>>> --- a/arch/riscv/Kconfig
>>> +++ b/arch/riscv/Kconfig
>>> @@ -213,6 +213,18 @@ config PGTABLE_LEVELS
>>> config LOCKDEP_SUPPORT
>>> def_bool y
>>>
>>> +config RELOCATABLE
>>> + bool
>>> + depends on MMU && 64BIT && !XIP_KERNEL
>>> + help
>>> + This builds a kernel as a Position Independent Executable
>>> (PIE),
>>> + which retains all relocation metadata required to
>>> relocate the
>>> + kernel binary at runtime to a different virtual address
>>> than the
>>> + address it was linked at.
>>> + Since RISCV uses the RELA relocation format, this requires a
>>> + relocation pass at runtime even if the kernel is loaded
>>> at the
>>> + same address it was linked at.
>>> +
>>> source "arch/riscv/Kconfig.socs"
>>> source "arch/riscv/Kconfig.erratas"
>>>
>>> diff --git a/arch/riscv/Makefile b/arch/riscv/Makefile
>>> index 0eb4568fbd29..2f509915f246 100644
>>> --- a/arch/riscv/Makefile
>>> +++ b/arch/riscv/Makefile
>>> @@ -9,9 +9,12 @@
>>> #
>>>
>>> OBJCOPYFLAGS := -O binary
>>> -LDFLAGS_vmlinux :=
>>> +ifeq ($(CONFIG_RELOCATABLE),y)
>>> + LDFLAGS_vmlinux += -shared -Bsymbolic -z notext -z norelro
>>> + KBUILD_CFLAGS += -fPIE
>>> +endif
>>> ifeq ($(CONFIG_DYNAMIC_FTRACE),y)
>>> - LDFLAGS_vmlinux := --no-relax
>>> + LDFLAGS_vmlinux += --no-relax
>>> KBUILD_CPPFLAGS += -DCC_USING_PATCHABLE_FUNCTION_ENTRY
>>> CC_FLAGS_FTRACE := -fpatchable-function-entry=8
>>> endif
>>> diff --git a/arch/riscv/kernel/vmlinux.lds.S
>>> b/arch/riscv/kernel/vmlinux.lds.S
>>> index 5104f3a871e3..862a8c09723c 100644
>>> --- a/arch/riscv/kernel/vmlinux.lds.S
>>> +++ b/arch/riscv/kernel/vmlinux.lds.S
>>> @@ -133,6 +133,12 @@ SECTIONS
>>>
>>> BSS_SECTION(PAGE_SIZE, PAGE_SIZE, 0)
>>>
>>> + .rela.dyn : ALIGN(8) {
>>> + __rela_dyn_start = .;
>>> + *(.rela .rela*)
>>> + __rela_dyn_end = .;
>>> + }
>>> +
>>> #ifdef CONFIG_EFI
>>> . = ALIGN(PECOFF_SECTION_ALIGNMENT);
>>> __pecoff_data_virt_size = ABSOLUTE(. - __pecoff_text_end);
>>> diff --git a/arch/riscv/mm/Makefile b/arch/riscv/mm/Makefile
>>> index 7ebaef10ea1b..2d33ec574bbb 100644
>>> --- a/arch/riscv/mm/Makefile
>>> +++ b/arch/riscv/mm/Makefile
>>> @@ -1,6 +1,10 @@
>>> # SPDX-License-Identifier: GPL-2.0-only
>>>
>>> CFLAGS_init.o := -mcmodel=medany
>>> +ifdef CONFIG_RELOCATABLE
>>> +CFLAGS_init.o += -fno-pie
>>> +endif
>>> +
>>> ifdef CONFIG_FTRACE
>>> CFLAGS_REMOVE_init.o = $(CC_FLAGS_FTRACE)
>>> CFLAGS_REMOVE_cacheflush.o = $(CC_FLAGS_FTRACE)
>>> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
>>> index c0cddf0fc22d..42041c12d496 100644
>>> --- a/arch/riscv/mm/init.c
>>> +++ b/arch/riscv/mm/init.c
>>> @@ -20,6 +20,9 @@
>>> #include <linux/dma-map-ops.h>
>>> #include <linux/crash_dump.h>
>>> #include <linux/hugetlb.h>
>>> +#ifdef CONFIG_RELOCATABLE
>>> +#include <linux/elf.h>
>>> +#endif
>>>
>>> #include <asm/fixmap.h>
>>> #include <asm/tlbflush.h>
>>> @@ -103,7 +106,7 @@ static void __init print_vm_layout(void)
>>> print_mlm("lowmem", (unsigned long)PAGE_OFFSET,
>>> (unsigned long)high_memory);
>>> #ifdef CONFIG_64BIT
>>> - print_mlm("kernel", (unsigned long)KERNEL_LINK_ADDR,
>>> + print_mlm("kernel", (unsigned long)kernel_map.virt_addr,
>>> (unsigned long)ADDRESS_SPACE_END);
>>> #endif
>>> }
>>> @@ -518,6 +521,44 @@ static __init pgprot_t pgprot_from_va(uintptr_t
>>> va)
>>> #error "setup_vm() is called from head.S before relocate so it
>>> should not use absolute addressing."
>>> #endif
>>>
>>> +#ifdef CONFIG_RELOCATABLE
>>> +extern unsigned long __rela_dyn_start, __rela_dyn_end;
>>> +
>>> +static void __init relocate_kernel(void)
>>> +{
>>> + Elf64_Rela *rela = (Elf64_Rela *)&__rela_dyn_start;
>>> + /*
>>> + * This holds the offset between the linked virtual address and
>>> the
>>> + * relocated virtual address.
>>> + */
>>> + uintptr_t reloc_offset = kernel_map.virt_addr - KERNEL_LINK_ADDR;
>>> + /*
>>> + * This holds the offset between kernel linked virtual address and
>>> + * physical address.
>>> + */
>>> + uintptr_t va_kernel_link_pa_offset = KERNEL_LINK_ADDR -
>>> kernel_map.phys_addr;
>>> +
>>> + for ( ; rela < (Elf64_Rela *)&__rela_dyn_end; rela++) {
>>> + Elf64_Addr addr = (rela->r_offset - va_kernel_link_pa_offset);
>>> + Elf64_Addr relocated_addr = rela->r_addend;
>>> +
>>> + if (rela->r_info != R_RISCV_RELATIVE)
>>> + continue;
>>> +
>>> + /*
>>> + * Make sure to not relocate vdso symbols like rt_sigreturn
>>> + * which are linked from the address 0 in vmlinux since
>>> + * vdso symbol addresses are actually used as an offset from
>>> + * mm->context.vdso in VDSO_OFFSET macro.
>>> + */
>>> + if (relocated_addr >= KERNEL_LINK_ADDR)
>>> + relocated_addr += reloc_offset;
>>> +
>>> + *(Elf64_Addr *)addr = relocated_addr;
>>> + }
>>> +}
>>> +#endif /* CONFIG_RELOCATABLE */
>>> +
>>> #ifdef CONFIG_XIP_KERNEL
>>> static void __init create_kernel_page_table(pgd_t *pgdir,
>>> __always_unused bool early)
>>> @@ -625,6 +666,17 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>>> BUG_ON((kernel_map.virt_addr + kernel_map.size) >
>>> ADDRESS_SPACE_END - SZ_4K);
>>> #endif
>>>
>>> +#ifdef CONFIG_RELOCATABLE
>>> + /*
>>> + * Early page table uses only one PGDIR, which makes it possible
>>> + * to map PGDIR_SIZE aligned on PGDIR_SIZE: if the relocation
>>> offset
>>> + * makes the kernel cross over a PGDIR_SIZE boundary, raise a bug
>>> + * since a part of the kernel would not get mapped.
>>> + */
>>> + BUG_ON(PGDIR_SIZE - (kernel_map.virt_addr & (PGDIR_SIZE - 1)) <
>>> kernel_map.size);
>>> + relocate_kernel();
>>> +#endif
>>> +
>>> pt_ops.alloc_pte = alloc_pte_early;
>>> pt_ops.get_pte_virt = get_pte_virt_early;
>>> #ifndef __PAGETABLE_PMD_FOLDED
^ permalink raw reply
* Re: instruction storage exception handling
From: Nicholas Piggin @ 2021-10-27 5:25 UTC (permalink / raw)
To: Christophe Leroy, Jacques de Laval, Scott Wood
Cc: linuxppc-dev@lists.ozlabs.org
In-Reply-To: <1f5c24de-6bba-d6c0-5b8e-3522f25158f6@csgroup.eu>
Excerpts from Christophe Leroy's message of October 27, 2021 3:00 pm:
>
>
> Le 27/10/2021 à 06:10, Nicholas Piggin a écrit :
>> Excerpts from Jacques de Laval's message of October 26, 2021 6:07 am:
>>> Hi,
>>>
>>> We are trying to upgrade kernel from 5.10 to 5.14.11. We have a Freescale/NXP
>>> T1023 SOC with two e5500 cores, and are running in 32-bit mode:
>>>
>>> CONFIG_PPC32=y
>>> # CONFIG_PPC64 is not set
>>>
>>> #
>>> # Processor support
>>> #
>>> # CONFIG_PPC_BOOK3S_32 is not set
>>> CONFIG_PPC_85xx=y
>>> # CONFIG_PPC_8xx is not set
>>> # CONFIG_40x is not set
>>> # CONFIG_44x is not set
>>> CONFIG_GENERIC_CPU=y
>>> # CONFIG_E5500_CPU is not set
>>> # CONFIG_E6500_CPU is not set
>>> CONFIG_E500=y
>>> CONFIG_PPC_E500MC=y
>>> CONFIG_PPC_FPU_REGS=y
>>> CONFIG_PPC_FPU=y
>>> CONFIG_FSL_EMB_PERFMON=y
>>> CONFIG_FSL_EMB_PERF_EVENT=y
>>> CONFIG_FSL_EMB_PERF_EVENT_E500=y
>>> CONFIG_BOOKE=y
>>> CONFIG_FSL_BOOKE=y
>>> CONFIG_PPC_FSL_BOOK3E=y
>>> CONFIG_PTE_64BIT=y
>>> CONFIG_PHYS_64BIT=y
>>> CONFIG_PPC_MMU_NOHASH=y
>>> CONFIG_PPC_BOOK3E_MMU=y
>>> # CONFIG_PMU_SYSFS is not set
>>> CONFIG_SMP=y
>>> CONFIG_NR_CPUS=2
>>> CONFIG_PPC_DOORBELL=y
>>> # end of Processor support
>>>
>>> We compile using 32-bit Bootlin PPC toolchain:
>>>
>>> powerpc-e500mc glibc bleeding-edge 2020.08-1.
>>>
>>> When booting, and starting PID 1 we sometimes get a hang. Nothing but our init
>>> process is running, and for debugging purposes our init currently looks like
>>> this:
>>>
>>> int main(int argc, char *argv[]){
>>> for (int i = 0; ; i++) {
>>> FILE *fp = fopen("/dev/kmsg", "w");
>>> if (fp) {
>>> fprintf(fp, "%d\n", i);
>>> fclose(fp);
>>> }
>>> sleep(1);
>>> }
>>> return 0;
>>> }
>>>
>>> When the hangup occur we don't get any output at all from our PID 1.
>>> The last output is from the kernel:
>>>
>>> Run /sbin/init as init process
>>> with arguments:
>>> /sbin/init
>>> with environment:
>>> HOME=/
>>> TERM=linux
>>> kgdboc=ttyS0,115200
>>>
>>> When issuing a backtrace on all active cpus we can see that the kernel is
>>> handling an instruction storage exception:
>>>
>>> sysrq: Show backtrace of all active CPUs
>>> sysrq: CPU0:
>>> CPU: 0 PID: 1 Comm: init Not tainted 5.14.11 #1
>>> NIP: c02aac78 LR: c02aac2c CTR: 00000000
>>> REGS: c1907d40 TRAP: 0500 Not tainted (5.14.11)
>>> MSR: 00029002 <CE,EE,ME> CR: 82244284 XER: 20000000
>>> GPR00: 0000000f c1907e20 c1910000 0065a000 000001d0 01100cca c1907e84 0000000c
>>> GPR08: d39a8000 000001d3 0000000c c1907f10 42244284 00000000 00740514 bfb71670
>>> GPR16: 007040e6 00701418 b7c1a5f0 00702f18 00000000 bfb71690 0000fff1 b7c1c478
>>> GPR24: 00708558 00701698 d3994040 00029002 c1907f20 0065a238 00000355 d39a2790
>>> NIP [c02aac78] handle_mm_fault+0xf8/0x11f0
>>> LR [c02aac2c] handle_mm_fault+0xac/0x11f0
>>> Call Trace:
>>> [c1907e20] [c02aac10] handle_mm_fault+0x90/0x11f0 (unreliable)
>>> [c1907ec0] [c003078c] ___do_page_fault+0x26c/0x780
>>> [c1907ef0] [c0030cd4] do_page_fault+0x34/0x100
>>> [c1907f10] [c0000988] InstructionStorage+0x108/0x120
>>> --- interrupt: 400 at 0x65a238
>>> NIP: 0065a238 LR: 0052f26c CTR: 0052f260
>>> REGS: c1907f20 TRAP: 0400 Not tainted (5.14.11)
>>> MSR: 0002d002 <CE,EE,PR,ME> CR: 42242284 XER: 00000000
>>> GPR00: b7be9914 bfb71620 b7c203a0 8c008000 0070400d b7c182a0 000b8260 0052f260
>>> GPR08: 0047d448 0052f260 0000000a 00000003 42242284 00000000 00740514 bfb71670
>>> GPR16: 007040e6 00701418 b7c1a5f0 00702f18 00000000 bfb71690 0000fff1 b7c1c478
>>> GPR24: 00708558 00701698 00700000 00000015 b7c1c2b0 00707e20 b7c1b8a8 bfb71660
>>> NIP [0065a238] 0x65a238
>>> LR [0052f26c] 0x52f26c
>>> --- interrupt: 400
>>> Instruction dump:
>>> 60a500c0 811f0020 57aa6cfa 813f0000 57a30026 809f004c 81080024 7d29e850
>>> 90a1002c 5529a33e 93c10038 7d244a14 <90610034> 7d485215 91210030 41c203dc
>>>
>>> We have also observed that the CPU is continuously servicing the same interrupt
>>> (north of 140k times per sec), it is not deadlocked.
>>>
>>> We have not yet been able to reproduce this behavior under QEMU system
>>> emulation.
>>>
>>> When bisecting between 5.10 and 5.14.11 we can see that this behavior started
>>> with commit a01a3f2ddbcda83e8572787c0ec1dcbeba86915a:
>>>
>>> powerpc: remove arguments from fault handler functions
>>
>> Thank you for the excellent work to investigate and report this.
>>
>>>
>>> Our best guess that the instruction storage exception is not properly handled
>>> and the kernel is never able to recover from the page fault, but we don't
>>> really know how to proceed. Does anyone have any suggestions or insights?
>>
>> Before my patch, zero was passed to the page fault error code, after
>> my patch it passes the contents of ESR SPR.
>>
>> It looks like the BookE instruction access interrupt does not set ESR
>> (except for BO interrupts maybe?) so you're getting what was in the ESR
>> register from a previous interrupt, and maybe if that was a store then
>> access_error won't cause a segfault because is_exec is true so that
>> test bails out early, then it might just keep retrying the interrupt.
>>
>> That could explain why you don't always see the same thing.
>>
>> Now previous code still saved ESR in regs->esr/dsisr for some reason.
>> I can't quite see why that should have been necessary though. Does
>> this patch solve it for you?
>>
>> Thanks,
>> Nick
>> --
>>
>>
>> diff --git a/arch/powerpc/kernel/head_booke.h b/arch/powerpc/kernel/head_booke.h
>> index e5503420b6c6..0e7cdc8716eb 100644
>> --- a/arch/powerpc/kernel/head_booke.h
>> +++ b/arch/powerpc/kernel/head_booke.h
>> @@ -467,10 +467,11 @@ ALT_FTR_SECTION_END_IFSET(CPU_FTR_EMB_HV)
>>
>> #define INSTRUCTION_STORAGE_EXCEPTION \
>> START_EXCEPTION(InstructionStorage) \
>> - NORMAL_EXCEPTION_PROLOG(0x400, INST_STORAGE); \
>> - mfspr r5,SPRN_ESR; /* Grab the ESR and save it */ \
>> + NORMAL_EXCEPTION_PROLOG(0x400, INST_STORAGE); \
>> + li r5,0; \
>> + mfspr r5,SPRN_ESR; /* Store 0 in regs->esr (dsisr) */ \
>
> I can't see how that can help, you set r5 to 0 and immediately after you
> reload ESR into r5 so you are still saving garbage into _ESR(r11)
>
Oops, stupid mistake.
Thanks,
Nick
diff --git a/arch/powerpc/kernel/head_booke.h b/arch/powerpc/kernel/head_booke.h
index e5503420b6c6..094c9790a490 100644
--- a/arch/powerpc/kernel/head_booke.h
+++ b/arch/powerpc/kernel/head_booke.h
@@ -467,10 +467,10 @@ ALT_FTR_SECTION_END_IFSET(CPU_FTR_EMB_HV)
#define INSTRUCTION_STORAGE_EXCEPTION \
START_EXCEPTION(InstructionStorage) \
- NORMAL_EXCEPTION_PROLOG(0x400, INST_STORAGE); \
- mfspr r5,SPRN_ESR; /* Grab the ESR and save it */ \
+ NORMAL_EXCEPTION_PROLOG(0x400, INST_STORAGE); \
+ li r5,0; /* Store 0 in regs->esr (dsisr) */ \
stw r5,_ESR(r11); \
- stw r12, _DEAR(r11); /* Pass SRR0 as arg2 */ \
+ stw r12, _DEAR(r11); /* Set regs->dear (dar) */ \
prepare_transfer_to_handler; \
bl do_page_fault; \
b interrupt_return
^ permalink raw reply related
* Re: [PATCH 2/3] powerpc/book3e: Fix set_memory_x() and set_memory_nx()
From: Nicholas Piggin @ 2021-10-27 5:27 UTC (permalink / raw)
To: Benjamin Herrenschmidt, Christophe Leroy, Michael Ellerman,
Paul Mackerras
Cc: linuxppc-dev, linux-kernel
In-Reply-To: <063e72e1-fc05-7783-9f42-f681dd08a4b2@csgroup.eu>
Excerpts from Christophe Leroy's message of October 27, 2021 2:55 pm:
>
>
> Le 27/10/2021 à 06:44, Nicholas Piggin a écrit :
>> Excerpts from Christophe Leroy's message of October 26, 2021 3:39 pm:
>>> set_memory_x() calls pte_mkexec() which sets _PAGE_EXEC.
>>> set_memory_nx() calls pte_exprotec() which clears _PAGE_EXEC.
>>>
>>> Book3e has 2 bits, UX and SX, which defines the exec rights
>>> resp. for user (PR=1) and for kernel (PR=0).
>>>
>>> _PAGE_EXEC is defined as UX only.
>>>
>>> An executable kernel page is set with either _PAGE_KERNEL_RWX
>>> or _PAGE_KERNEL_ROX, which both have SX set and UX cleared.
>>>
>>> So set_memory_nx() call for an executable kernel page does
>>> nothing because UX is already cleared.
>>>
>>> And set_memory_x() on a non-executable kernel page makes it
>>> executable for the user and keeps it non-executable for kernel.
>>>
>>> Also, pte_exec() always returns 'false' on kernel pages, because
>>> it checks _PAGE_EXEC which doesn't include SX, so for instance
>>> the W+X check doesn't work.
>>>
>>> To fix this:
>>> - change tlb_low_64e.S to use _PAGE_BAP_UX instead of _PAGE_USER
>>> - sets both UX and SX in _PAGE_EXEC so that pte_user() returns
>>> true whenever one of the two bits is set
>>
>> I don't understand this change. Which pte_user() returns true after
>> this change? Or do you mean pte_exec()?
>
> Oops, yes, I mean pte_exec()
>
> Unless I have to re-spin, can Michael eventually fix that typo while
> applying ?
>
>>
>> Does this filter through in some cases at least for kernel executable
>> PTEs will get both bits set? Seems cleaner to distinguish user and
>> kernel exec for that but maybe it's a lot of churn?
>
> Didn't understand what you mean.
>
> I did it like that to be able to continue using _PAGE_EXEC for checking
> executability regardless of whether this is user or kernel, and then
> continue using the generic nohash pte_exec() helper.
>
> Other solution would be to get rid of _PAGE_EXEC completely for book3e
> and implement both pte_exec() and pte_mkexec() with _PAGE_BAP_UX and
> _PAGE_BAP_SX, but I'm not sure it is worth the churn as you say. It
> would also mean different helpers for book3s/32 when it is using 32 bits
> PTE (CONFIG_PTE_64BIT=n)
That's basically what I mean. And _PAGE_KERNEL_ROX etc would then not
set the UX bit. But at least for now it seems to be an improvement.
Thanks,
Nick
^ permalink raw reply
* Re: Linux kernel: powerpc: KVM guest can trigger host crash on Power8
From: Nicholas Piggin @ 2021-10-27 5:29 UTC (permalink / raw)
To: John Paul Adrian Glaubitz, mpe
Cc: oss-security, debian-powerpc@lists.debian.org, linuxppc-dev
In-Reply-To: <05b88724-90b6-a38a-bb3b-7392f85c1934@physik.fu-berlin.de>
Excerpts from John Paul Adrian Glaubitz's message of October 26, 2021 6:48 pm:
> Hi Michael!
>
>> The Linux kernel for powerpc since v5.2 has a bug which allows a
>> malicious KVM guest to crash the host, when the host is running on
>> Power8.
>>
>> Only machines using Linux as the hypervisor, aka. KVM, powernv or bare
>> metal, are affected by the bug. Machines running PowerVM are not
>> affected.
>>
>> The bug was introduced in:
>>
>> 10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C")
>>
>> Which was first released in v5.2.
>>
>> The upstream fix is:
>>
>> cdeb5d7d890e ("KVM: PPC: Book3S HV: Make idle_kvm_start_guest() return 0 if it went to guest")
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=cdeb5d7d890e14f3b70e8087e745c4a6a7d9f337
>>
>> Which will be included in the v5.16 release.
>
> I have tested these patches against 5.14 but it seems the problem [1] still remains for me
> for big-endian guests. I built a patched kernel yesterday, rebooted the KVM server and let
> the build daemons do their work over night.
>
> When I got up this morning, I noticed the machine was down, so I checked the serial console
> via IPMI and saw the same messages again as reported in [1]:
>
> [41483.963562] watchdog: BUG: soft lockup - CPU#104 stuck for 25521s! [migration/104:175]
> [41507.963307] watchdog: BUG: soft lockup - CPU#104 stuck for 25544s! [migration/104:175]
> [41518.311200] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
> [41518.311216] rcu: 136-...0: (135 ticks this GP) idle=242/1/0x4000000000000000 softirq=32031/32033 fqs=2729959
> [41547.962882] watchdog: BUG: soft lockup - CPU#104 stuck for 25581s! [migration/104:175]
> [41571.962627] watchdog: BUG: soft lockup - CPU#104 stuck for 25603s! [migration/104:175]
> [41581.330530] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
> [41581.330546] rcu: 136-...0: (135 ticks this GP) idle=242/1/0x4000000000000000 softirq=32031/32033 fqs=2736378
> [41611.962202] watchdog: BUG: soft lockup - CPU#104 stuck for 25641s! [migration/104:175]
> [41635.961947] watchdog: BUG: soft lockup - CPU#104 stuck for 25663s! [migration/104:175]
> [41644.349859] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
> [41644.349876] rcu: 136-...0: (135 ticks this GP) idle=242/1/0x4000000000000000 softirq=32031/32033 fqs=2742753
> [41671.961564] watchdog: BUG: soft lockup - CPU#104 stuck for 25697s! [migration/104:175]
> [41695.961309] watchdog: BUG: soft lockup - CPU#104 stuck for 25719s! [migration/104:175]
> [41707.369190] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
> [41707.369206] rcu: 136-...0: (135 ticks this GP) idle=242/1/0x4000000000000000 softirq=32031/32033 fqs=2749151
> [41735.960884] watchdog: BUG: soft lockup - CPU#104 stuck for 25756s! [migration/104:175]
> [41759.960629] watchdog: BUG: soft lockup - CPU#104 stuck for 25778s! [migration/104:175]
> [41770.388520] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
> [41770.388548] rcu: 136-...0: (135 ticks this GP) idle=242/1/0x4000000000000000 softirq=32031/32033 fqs=2755540
> [41776.076307] rcu: rcu_sched kthread timer wakeup didn't happen for 1423 jiffies! g49897 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
> [41776.076327] rcu: Possible timer handling issue on cpu=32 timer-softirq=1056014
> [41776.076336] rcu: rcu_sched kthread starved for 1424 jiffies! g49897 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=32
> [41776.076350] rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
> [41776.076360] rcu: RCU grace-period kthread stack dump:
> [41776.076434] rcu: Stack dump where RCU GP kthread last ran:
> [41783.960374] watchdog: BUG: soft lockup - CPU#104 stuck for 25801s! [migration/104:175]
> [41807.960119] watchdog: BUG: soft lockup - CPU#104 stuck for 25823s! [migration/104:175]
> [41831.959864] watchdog: BUG: soft lockup - CPU#104 stuck for 25846s! [migration/104:175]
> [41833.407851] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
> [41833.407868] rcu: 136-...0: (135 ticks this GP) idle=242/1/0x4000000000000000 softirq=32031/32033 fqs=2760381
> [41863.959524] watchdog: BUG: soft lockup - CPU#104 stuck for 25875s! [migration/104:175]
I don't suppose you were able to get any more of the log saved? (The
first error messages that happened might be interesting)
Thanks,
Nick
^ permalink raw reply
* Re: Linux kernel: powerpc: KVM guest can trigger host crash on Power8
From: Michael Ellerman @ 2021-10-27 5:30 UTC (permalink / raw)
To: John Paul Adrian Glaubitz
Cc: oss-security, debian-powerpc@lists.debian.org, linuxppc-dev
In-Reply-To: <05b88724-90b6-a38a-bb3b-7392f85c1934@physik.fu-berlin.de>
John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> writes:
> Hi Michael!
Hi Adrian,
Thanks for testing ...
>> The Linux kernel for powerpc since v5.2 has a bug which allows a
>> malicious KVM guest to crash the host, when the host is running on
>> Power8.
>>
>> Only machines using Linux as the hypervisor, aka. KVM, powernv or bare
>> metal, are affected by the bug. Machines running PowerVM are not
>> affected.
>>
>> The bug was introduced in:
>>
>> 10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C")
>>
>> Which was first released in v5.2.
>>
>> The upstream fix is:
>>
>> cdeb5d7d890e ("KVM: PPC: Book3S HV: Make idle_kvm_start_guest() return 0 if it went to guest")
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=cdeb5d7d890e14f3b70e8087e745c4a6a7d9f337
>>
>> Which will be included in the v5.16 release.
>
> I have tested these patches against 5.14 but it seems the problem [1] still remains for me
> for big-endian guests. I built a patched kernel yesterday, rebooted the KVM server and let
> the build daemons do their work over night.
>
> When I got up this morning, I noticed the machine was down, so I checked the serial console
> via IPMI and saw the same messages again as reported in [1]:
>
> [41483.963562] watchdog: BUG: soft lockup - CPU#104 stuck for 25521s! [migration/104:175]
> [41507.963307] watchdog: BUG: soft lockup - CPU#104 stuck for 25544s! [migration/104:175]
> [41518.311200] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
> [41518.311216] rcu: 136-...0: (135 ticks this GP) idle=242/1/0x4000000000000000 softirq=32031/32033 fqs=2729959
> [41547.962882] watchdog: BUG: soft lockup - CPU#104 stuck for 25581s! [migration/104:175]
> [41571.962627] watchdog: BUG: soft lockup - CPU#104 stuck for 25603s! [migration/104:175]
> [41581.330530] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
> [41581.330546] rcu: 136-...0: (135 ticks this GP) idle=242/1/0x4000000000000000 softirq=32031/32033 fqs=2736378
> [41611.962202] watchdog: BUG: soft lockup - CPU#104 stuck for 25641s! [migration/104:175]
> [41635.961947] watchdog: BUG: soft lockup - CPU#104 stuck for 25663s! [migration/104:175]
> [41644.349859] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
> [41644.349876] rcu: 136-...0: (135 ticks this GP) idle=242/1/0x4000000000000000 softirq=32031/32033 fqs=2742753
> [41671.961564] watchdog: BUG: soft lockup - CPU#104 stuck for 25697s! [migration/104:175]
> [41695.961309] watchdog: BUG: soft lockup - CPU#104 stuck for 25719s! [migration/104:175]
> [41707.369190] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
> [41707.369206] rcu: 136-...0: (135 ticks this GP) idle=242/1/0x4000000000000000 softirq=32031/32033 fqs=2749151
> [41735.960884] watchdog: BUG: soft lockup - CPU#104 stuck for 25756s! [migration/104:175]
> [41759.960629] watchdog: BUG: soft lockup - CPU#104 stuck for 25778s! [migration/104:175]
> [41770.388520] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
> [41770.388548] rcu: 136-...0: (135 ticks this GP) idle=242/1/0x4000000000000000 softirq=32031/32033 fqs=2755540
> [41776.076307] rcu: rcu_sched kthread timer wakeup didn't happen for 1423 jiffies! g49897 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
> [41776.076327] rcu: Possible timer handling issue on cpu=32 timer-softirq=1056014
> [41776.076336] rcu: rcu_sched kthread starved for 1424 jiffies! g49897 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=32
> [41776.076350] rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
> [41776.076360] rcu: RCU grace-period kthread stack dump:
> [41776.076434] rcu: Stack dump where RCU GP kthread last ran:
> [41783.960374] watchdog: BUG: soft lockup - CPU#104 stuck for 25801s! [migration/104:175]
> [41807.960119] watchdog: BUG: soft lockup - CPU#104 stuck for 25823s! [migration/104:175]
> [41831.959864] watchdog: BUG: soft lockup - CPU#104 stuck for 25846s! [migration/104:175]
> [41833.407851] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
> [41833.407868] rcu: 136-...0: (135 ticks this GP) idle=242/1/0x4000000000000000 softirq=32031/32033 fqs=2760381
> [41863.959524] watchdog: BUG: soft lockup - CPU#104 stuck for 25875s! [migration/104:175]
>
> It seems that in this case, it was the testsuite of the git package [2] that triggered the bug. As you
> can see from the overview, the git package has been in the building state for 8 hours meaning the
> build server crashed and is no longer giving feedback to the database.
OK, that sucks.
I did test the repro case you gave me before (in the bugzilla), which
was building glibc, that passes for me with a patched host.
I guess we have yet another bug.
I tried the following in a debian BE VM and it completed fine:
$ dget -u http://ftp.debian.org/debian/pool/main/g/git/git_2.33.1-1.dsc
$ sbuild -d sid --arch=powerpc --no-arch-all git_2.33.1-1.dsc
Same for ppc64.
And I also tried both at once, repeatedly in a loop.
I guess it's something more complicated.
What exact host/guest kernel versions and configs are you running?
cheers
^ permalink raw reply
* Re: [PATCH 2/3] powerpc/book3e: Fix set_memory_x() and set_memory_nx()
From: Christophe Leroy @ 2021-10-27 5:50 UTC (permalink / raw)
To: Nicholas Piggin, Benjamin Herrenschmidt, Michael Ellerman,
Paul Mackerras
Cc: linuxppc-dev, linux-kernel
In-Reply-To: <1635312355.da7w1oggf1.astroid@bobo.none>
Le 27/10/2021 à 07:27, Nicholas Piggin a écrit :
> Excerpts from Christophe Leroy's message of October 27, 2021 2:55 pm:
>>
>>
>> Le 27/10/2021 à 06:44, Nicholas Piggin a écrit :
>>> Excerpts from Christophe Leroy's message of October 26, 2021 3:39 pm:
>>>> set_memory_x() calls pte_mkexec() which sets _PAGE_EXEC.
>>>> set_memory_nx() calls pte_exprotec() which clears _PAGE_EXEC.
>>>>
>>>> Book3e has 2 bits, UX and SX, which defines the exec rights
>>>> resp. for user (PR=1) and for kernel (PR=0).
>>>>
>>>> _PAGE_EXEC is defined as UX only.
>>>>
>>>> An executable kernel page is set with either _PAGE_KERNEL_RWX
>>>> or _PAGE_KERNEL_ROX, which both have SX set and UX cleared.
>>>>
>>>> So set_memory_nx() call for an executable kernel page does
>>>> nothing because UX is already cleared.
>>>>
>>>> And set_memory_x() on a non-executable kernel page makes it
>>>> executable for the user and keeps it non-executable for kernel.
>>>>
>>>> Also, pte_exec() always returns 'false' on kernel pages, because
>>>> it checks _PAGE_EXEC which doesn't include SX, so for instance
>>>> the W+X check doesn't work.
>>>>
>>>> To fix this:
>>>> - change tlb_low_64e.S to use _PAGE_BAP_UX instead of _PAGE_USER
>>>> - sets both UX and SX in _PAGE_EXEC so that pte_user() returns
>>>> true whenever one of the two bits is set
>>>
>>> I don't understand this change. Which pte_user() returns true after
>>> this change? Or do you mean pte_exec()?
>>
>> Oops, yes, I mean pte_exec()
>>
>> Unless I have to re-spin, can Michael eventually fix that typo while
>> applying ?
>>
>>>
>>> Does this filter through in some cases at least for kernel executable
>>> PTEs will get both bits set? Seems cleaner to distinguish user and
>>> kernel exec for that but maybe it's a lot of churn?
>>
>> Didn't understand what you mean.
>>
>> I did it like that to be able to continue using _PAGE_EXEC for checking
>> executability regardless of whether this is user or kernel, and then
>> continue using the generic nohash pte_exec() helper.
>>
>> Other solution would be to get rid of _PAGE_EXEC completely for book3e
>> and implement both pte_exec() and pte_mkexec() with _PAGE_BAP_UX and
>> _PAGE_BAP_SX, but I'm not sure it is worth the churn as you say. It
>> would also mean different helpers for book3s/32 when it is using 32 bits
>> PTE (CONFIG_PTE_64BIT=n)
>
> That's basically what I mean. And _PAGE_KERNEL_ROX etc would then not
> set the UX bit. But at least for now it seems to be an improvement.
>
That's already the case:
#define _PAGE_KERNEL_RWX (_PAGE_BAP_SW | _PAGE_BAP_SR | _PAGE_DIRTY |
_PAGE_BAP_SX)
#define _PAGE_KERNEL_ROX (_PAGE_BAP_SR | _PAGE_BAP_SX)
Christophe
^ permalink raw reply
* Re: instruction storage exception handling
From: Christophe Leroy @ 2021-10-27 5:51 UTC (permalink / raw)
To: Nicholas Piggin, Jacques de Laval, Scott Wood
Cc: linuxppc-dev@lists.ozlabs.org
In-Reply-To: <1635312278.p87nvl11rv.astroid@bobo.none>
Le 27/10/2021 à 07:25, Nicholas Piggin a écrit :
> Excerpts from Christophe Leroy's message of October 27, 2021 3:00 pm:
>>
>>
>> Le 27/10/2021 à 06:10, Nicholas Piggin a écrit :
>>> Excerpts from Jacques de Laval's message of October 26, 2021 6:07 am:
>>>> Hi,
>>>>
>>>> We are trying to upgrade kernel from 5.10 to 5.14.11. We have a Freescale/NXP
>>>> T1023 SOC with two e5500 cores, and are running in 32-bit mode:
>>>>
>>>> CONFIG_PPC32=y
>>>> # CONFIG_PPC64 is not set
>>>>
>>>> #
>>>> # Processor support
>>>> #
>>>> # CONFIG_PPC_BOOK3S_32 is not set
>>>> CONFIG_PPC_85xx=y
>>>> # CONFIG_PPC_8xx is not set
>>>> # CONFIG_40x is not set
>>>> # CONFIG_44x is not set
>>>> CONFIG_GENERIC_CPU=y
>>>> # CONFIG_E5500_CPU is not set
>>>> # CONFIG_E6500_CPU is not set
>>>> CONFIG_E500=y
>>>> CONFIG_PPC_E500MC=y
>>>> CONFIG_PPC_FPU_REGS=y
>>>> CONFIG_PPC_FPU=y
>>>> CONFIG_FSL_EMB_PERFMON=y
>>>> CONFIG_FSL_EMB_PERF_EVENT=y
>>>> CONFIG_FSL_EMB_PERF_EVENT_E500=y
>>>> CONFIG_BOOKE=y
>>>> CONFIG_FSL_BOOKE=y
>>>> CONFIG_PPC_FSL_BOOK3E=y
>>>> CONFIG_PTE_64BIT=y
>>>> CONFIG_PHYS_64BIT=y
>>>> CONFIG_PPC_MMU_NOHASH=y
>>>> CONFIG_PPC_BOOK3E_MMU=y
>>>> # CONFIG_PMU_SYSFS is not set
>>>> CONFIG_SMP=y
>>>> CONFIG_NR_CPUS=2
>>>> CONFIG_PPC_DOORBELL=y
>>>> # end of Processor support
>>>>
>>>> We compile using 32-bit Bootlin PPC toolchain:
>>>>
>>>> powerpc-e500mc glibc bleeding-edge 2020.08-1.
>>>>
>>>> When booting, and starting PID 1 we sometimes get a hang. Nothing but our init
>>>> process is running, and for debugging purposes our init currently looks like
>>>> this:
>>>>
>>>> int main(int argc, char *argv[]){
>>>> for (int i = 0; ; i++) {
>>>> FILE *fp = fopen("/dev/kmsg", "w");
>>>> if (fp) {
>>>> fprintf(fp, "%d\n", i);
>>>> fclose(fp);
>>>> }
>>>> sleep(1);
>>>> }
>>>> return 0;
>>>> }
>>>>
>>>> When the hangup occur we don't get any output at all from our PID 1.
>>>> The last output is from the kernel:
>>>>
>>>> Run /sbin/init as init process
>>>> with arguments:
>>>> /sbin/init
>>>> with environment:
>>>> HOME=/
>>>> TERM=linux
>>>> kgdboc=ttyS0,115200
>>>>
>>>> When issuing a backtrace on all active cpus we can see that the kernel is
>>>> handling an instruction storage exception:
>>>>
>>>> sysrq: Show backtrace of all active CPUs
>>>> sysrq: CPU0:
>>>> CPU: 0 PID: 1 Comm: init Not tainted 5.14.11 #1
>>>> NIP: c02aac78 LR: c02aac2c CTR: 00000000
>>>> REGS: c1907d40 TRAP: 0500 Not tainted (5.14.11)
>>>> MSR: 00029002 <CE,EE,ME> CR: 82244284 XER: 20000000
>>>> GPR00: 0000000f c1907e20 c1910000 0065a000 000001d0 01100cca c1907e84 0000000c
>>>> GPR08: d39a8000 000001d3 0000000c c1907f10 42244284 00000000 00740514 bfb71670
>>>> GPR16: 007040e6 00701418 b7c1a5f0 00702f18 00000000 bfb71690 0000fff1 b7c1c478
>>>> GPR24: 00708558 00701698 d3994040 00029002 c1907f20 0065a238 00000355 d39a2790
>>>> NIP [c02aac78] handle_mm_fault+0xf8/0x11f0
>>>> LR [c02aac2c] handle_mm_fault+0xac/0x11f0
>>>> Call Trace:
>>>> [c1907e20] [c02aac10] handle_mm_fault+0x90/0x11f0 (unreliable)
>>>> [c1907ec0] [c003078c] ___do_page_fault+0x26c/0x780
>>>> [c1907ef0] [c0030cd4] do_page_fault+0x34/0x100
>>>> [c1907f10] [c0000988] InstructionStorage+0x108/0x120
>>>> --- interrupt: 400 at 0x65a238
>>>> NIP: 0065a238 LR: 0052f26c CTR: 0052f260
>>>> REGS: c1907f20 TRAP: 0400 Not tainted (5.14.11)
>>>> MSR: 0002d002 <CE,EE,PR,ME> CR: 42242284 XER: 00000000
>>>> GPR00: b7be9914 bfb71620 b7c203a0 8c008000 0070400d b7c182a0 000b8260 0052f260
>>>> GPR08: 0047d448 0052f260 0000000a 00000003 42242284 00000000 00740514 bfb71670
>>>> GPR16: 007040e6 00701418 b7c1a5f0 00702f18 00000000 bfb71690 0000fff1 b7c1c478
>>>> GPR24: 00708558 00701698 00700000 00000015 b7c1c2b0 00707e20 b7c1b8a8 bfb71660
>>>> NIP [0065a238] 0x65a238
>>>> LR [0052f26c] 0x52f26c
>>>> --- interrupt: 400
>>>> Instruction dump:
>>>> 60a500c0 811f0020 57aa6cfa 813f0000 57a30026 809f004c 81080024 7d29e850
>>>> 90a1002c 5529a33e 93c10038 7d244a14 <90610034> 7d485215 91210030 41c203dc
>>>>
>>>> We have also observed that the CPU is continuously servicing the same interrupt
>>>> (north of 140k times per sec), it is not deadlocked.
>>>>
>>>> We have not yet been able to reproduce this behavior under QEMU system
>>>> emulation.
>>>>
>>>> When bisecting between 5.10 and 5.14.11 we can see that this behavior started
>>>> with commit a01a3f2ddbcda83e8572787c0ec1dcbeba86915a:
>>>>
>>>> powerpc: remove arguments from fault handler functions
>>>
>>> Thank you for the excellent work to investigate and report this.
>>>
>>>>
>>>> Our best guess that the instruction storage exception is not properly handled
>>>> and the kernel is never able to recover from the page fault, but we don't
>>>> really know how to proceed. Does anyone have any suggestions or insights?
>>>
>>> Before my patch, zero was passed to the page fault error code, after
>>> my patch it passes the contents of ESR SPR.
>>>
>>> It looks like the BookE instruction access interrupt does not set ESR
>>> (except for BO interrupts maybe?) so you're getting what was in the ESR
>>> register from a previous interrupt, and maybe if that was a store then
>>> access_error won't cause a segfault because is_exec is true so that
>>> test bails out early, then it might just keep retrying the interrupt.
>>>
>>> That could explain why you don't always see the same thing.
>>>
>>> Now previous code still saved ESR in regs->esr/dsisr for some reason.
>>> I can't quite see why that should have been necessary though. Does
According to the e500 Reference Manual, on ISI:
BO is set if the instruction fetch caused a byte-ordering exception;
otherwise cleared. *All* other defined ESR bits are *cleared*.
Christophe
>>> this patch solve it for you?
>>>
>>> Thanks,
>>> Nick
>>> --
>>>
>>>
>>> diff --git a/arch/powerpc/kernel/head_booke.h b/arch/powerpc/kernel/head_booke.h
>>> index e5503420b6c6..0e7cdc8716eb 100644
>>> --- a/arch/powerpc/kernel/head_booke.h
>>> +++ b/arch/powerpc/kernel/head_booke.h
>>> @@ -467,10 +467,11 @@ ALT_FTR_SECTION_END_IFSET(CPU_FTR_EMB_HV)
>>>
>>> #define INSTRUCTION_STORAGE_EXCEPTION \
>>> START_EXCEPTION(InstructionStorage) \
>>> - NORMAL_EXCEPTION_PROLOG(0x400, INST_STORAGE); \
>>> - mfspr r5,SPRN_ESR; /* Grab the ESR and save it */ \
>>> + NORMAL_EXCEPTION_PROLOG(0x400, INST_STORAGE); \
>>> + li r5,0; \
>>> + mfspr r5,SPRN_ESR; /* Store 0 in regs->esr (dsisr) */ \
>>
>> I can't see how that can help, you set r5 to 0 and immediately after you
>> reload ESR into r5 so you are still saving garbage into _ESR(r11)
>>
>
> Oops, stupid mistake.
>
> Thanks,
> Nick
>
> diff --git a/arch/powerpc/kernel/head_booke.h b/arch/powerpc/kernel/head_booke.h
> index e5503420b6c6..094c9790a490 100644
> --- a/arch/powerpc/kernel/head_booke.h
> +++ b/arch/powerpc/kernel/head_booke.h
> @@ -467,10 +467,10 @@ ALT_FTR_SECTION_END_IFSET(CPU_FTR_EMB_HV)
>
> #define INSTRUCTION_STORAGE_EXCEPTION \
> START_EXCEPTION(InstructionStorage) \
> - NORMAL_EXCEPTION_PROLOG(0x400, INST_STORAGE); \
> - mfspr r5,SPRN_ESR; /* Grab the ESR and save it */ \
> + NORMAL_EXCEPTION_PROLOG(0x400, INST_STORAGE); \
> + li r5,0; /* Store 0 in regs->esr (dsisr) */ \
> stw r5,_ESR(r11); \
> - stw r12, _DEAR(r11); /* Pass SRR0 as arg2 */ \
> + stw r12, _DEAR(r11); /* Set regs->dear (dar) */ \
> prepare_transfer_to_handler; \
> bl do_page_fault; \
> b interrupt_return
>
^ permalink raw reply
* [PATCH] MAINTAINERS: Update powerpc KVM entry
From: Michael Ellerman @ 2021-10-27 6:16 UTC (permalink / raw)
To: linuxppc-dev, paulus; +Cc: kvm, pbonzini, kvm-ppc, npiggin, linux-kernel
Paul is no longer handling patches for kvmppc.
Instead we'll treat them as regular powerpc patches, taking them via the
powerpc tree, using the topic/ppc-kvm branch when necessary.
Also drop the web reference, it doesn't have any information
specifically relevant to powerpc KVM.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
---
MAINTAINERS | 7 ++-----
1 file changed, 2 insertions(+), 5 deletions(-)
diff --git a/MAINTAINERS b/MAINTAINERS
index ca6d6fde85cf..fbfd3345c40d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -10260,11 +10260,8 @@ F: arch/mips/include/uapi/asm/kvm*
F: arch/mips/kvm/
KERNEL VIRTUAL MACHINE FOR POWERPC (KVM/powerpc)
-M: Paul Mackerras <paulus@ozlabs.org>
-L: kvm-ppc@vger.kernel.org
-S: Supported
-W: http://www.linux-kvm.org/
-T: git git://github.com/agraf/linux-2.6.git
+L: linuxppc-dev@lists.ozlabs.org
+T: git git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git topic/ppc-kvm
F: arch/powerpc/include/asm/kvm*
F: arch/powerpc/include/uapi/asm/kvm*
F: arch/powerpc/kernel/kvm*
--
2.31.1
^ permalink raw reply related
* Re: [PATCH 2/3] powerpc/book3e: Fix set_memory_x() and set_memory_nx()
From: Nicholas Piggin @ 2021-10-27 7:15 UTC (permalink / raw)
To: Benjamin Herrenschmidt, Christophe Leroy, Michael Ellerman,
Paul Mackerras
Cc: linuxppc-dev, linux-kernel
In-Reply-To: <8ccb9629-43fc-2f36-c9e4-61d6898fb80d@csgroup.eu>
Excerpts from Christophe Leroy's message of October 27, 2021 3:50 pm:
>
>
> Le 27/10/2021 à 07:27, Nicholas Piggin a écrit :
>> Excerpts from Christophe Leroy's message of October 27, 2021 2:55 pm:
>>>
>>>
>>> Le 27/10/2021 à 06:44, Nicholas Piggin a écrit :
>>>> Excerpts from Christophe Leroy's message of October 26, 2021 3:39 pm:
>>>>> set_memory_x() calls pte_mkexec() which sets _PAGE_EXEC.
>>>>> set_memory_nx() calls pte_exprotec() which clears _PAGE_EXEC.
>>>>>
>>>>> Book3e has 2 bits, UX and SX, which defines the exec rights
>>>>> resp. for user (PR=1) and for kernel (PR=0).
>>>>>
>>>>> _PAGE_EXEC is defined as UX only.
>>>>>
>>>>> An executable kernel page is set with either _PAGE_KERNEL_RWX
>>>>> or _PAGE_KERNEL_ROX, which both have SX set and UX cleared.
>>>>>
>>>>> So set_memory_nx() call for an executable kernel page does
>>>>> nothing because UX is already cleared.
>>>>>
>>>>> And set_memory_x() on a non-executable kernel page makes it
>>>>> executable for the user and keeps it non-executable for kernel.
>>>>>
>>>>> Also, pte_exec() always returns 'false' on kernel pages, because
>>>>> it checks _PAGE_EXEC which doesn't include SX, so for instance
>>>>> the W+X check doesn't work.
>>>>>
>>>>> To fix this:
>>>>> - change tlb_low_64e.S to use _PAGE_BAP_UX instead of _PAGE_USER
>>>>> - sets both UX and SX in _PAGE_EXEC so that pte_user() returns
>>>>> true whenever one of the two bits is set
>>>>
>>>> I don't understand this change. Which pte_user() returns true after
>>>> this change? Or do you mean pte_exec()?
>>>
>>> Oops, yes, I mean pte_exec()
>>>
>>> Unless I have to re-spin, can Michael eventually fix that typo while
>>> applying ?
>>>
>>>>
>>>> Does this filter through in some cases at least for kernel executable
>>>> PTEs will get both bits set? Seems cleaner to distinguish user and
>>>> kernel exec for that but maybe it's a lot of churn?
>>>
>>> Didn't understand what you mean.
>>>
>>> I did it like that to be able to continue using _PAGE_EXEC for checking
>>> executability regardless of whether this is user or kernel, and then
>>> continue using the generic nohash pte_exec() helper.
>>>
>>> Other solution would be to get rid of _PAGE_EXEC completely for book3e
>>> and implement both pte_exec() and pte_mkexec() with _PAGE_BAP_UX and
>>> _PAGE_BAP_SX, but I'm not sure it is worth the churn as you say. It
>>> would also mean different helpers for book3s/32 when it is using 32 bits
>>> PTE (CONFIG_PTE_64BIT=n)
>>
>> That's basically what I mean. And _PAGE_KERNEL_ROX etc would then not
>> set the UX bit. But at least for now it seems to be an improvement.
>>
>
> That's already the case:
>
> #define _PAGE_KERNEL_RWX (_PAGE_BAP_SW | _PAGE_BAP_SR | _PAGE_DIRTY |
> _PAGE_BAP_SX)
> #define _PAGE_KERNEL_ROX (_PAGE_BAP_SR | _PAGE_BAP_SX)
So it is, I was looking at the wrong header.
Looks okay to me then, for what it's worth.
Thanks,
Nick
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox