LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v5 0/2] fix & prevent the missing preemption disabling
From: 王贇 @ 2021-10-26  3:14 UTC (permalink / raw)
  To: Guo Ren, Steven Rostedt, Ingo Molnar, James E.J. Bottomley,
	Helge Deller, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Thomas Gleixner, Borislav Petkov, x86, H. Peter Anvin,
	Josh Poimboeuf, Jiri Kosina, Miroslav Benes, Petr Mladek,
	Joe Lawrence, Masami Hiramatsu, Peter Zijlstra (Intel),
	Michael Wang, Nicholas Piggin, Jisheng Zhang, linux-csky,
	linux-kernel, linux-parisc, linuxppc-dev, linux-riscv,
	live-patching

The testing show that perf_ftrace_function_call() are using smp_processor_id()
with preemption enabled, all the checking on CPU could be wrong after preemption.

As Peter point out, the section between ftrace_test_recursion_trylock/unlock()
pair require the preemption to be disabled as 'Documentation/trace/ftrace-uses.rst'
explained, but currently the work is done outside of the helpers.

And since the internal using of trace_test_and_set_recursion()
and trace_clear_recursion() also require preemption to be disabled, we
can just merge the logical together.

Patch 1/2 will make sure preemption disabled when recursion lock succeed,
patch 2/2 will do smp_processor_id() checking after trylock() to address the
issue.

v1: https://lore.kernel.org/all/8c7de46d-9869-aa5e-2bb9-5dbc2eda395e@linux.alibaba.com/
v2: https://lore.kernel.org/all/b1d7fe43-ce84-0ed7-32f7-ea1d12d0b716@linux.alibaba.com/
v3: https://lore.kernel.org/all/609b565a-ed6e-a1da-f025-166691b5d994@linux.alibaba.com/
V4: https://lore.kernel.org/all/32a36348-69ee-6464-390c-3a8d6e9d2b53@linux.alibaba.com/

Michael Wang (2):
  ftrace: disable preemption when recursion locked
  ftrace: do CPU checking after preemption disabled

 arch/csky/kernel/probes/ftrace.c     |  2 --
 arch/parisc/kernel/ftrace.c          |  2 --
 arch/powerpc/kernel/kprobes-ftrace.c |  2 --
 arch/riscv/kernel/probes/ftrace.c    |  2 --
 arch/x86/kernel/kprobes/ftrace.c     |  2 --
 include/linux/trace_recursion.h      | 11 ++++++++++-
 kernel/livepatch/patch.c             | 13 +++++++------
 kernel/trace/ftrace.c                | 15 +++++----------
 kernel/trace/trace_event_perf.c      |  6 +++---
 kernel/trace/trace_functions.c       |  5 -----
 10 files changed, 25 insertions(+), 35 deletions(-)

-- 
1.8.3.1


^ permalink raw reply

* [PATCH v5 1/2] ftrace: disable preemption when recursion locked
From: 王贇 @ 2021-10-26  3:15 UTC (permalink / raw)
  To: Guo Ren, Steven Rostedt, Ingo Molnar, James E.J. Bottomley,
	Helge Deller, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Thomas Gleixner, Borislav Petkov, x86, H. Peter Anvin,
	Josh Poimboeuf, Jiri Kosina, Miroslav Benes, Petr Mladek,
	Joe Lawrence, Masami Hiramatsu, Peter Zijlstra (Intel),
	Nicholas Piggin, Jisheng Zhang, linux-csky, linux-kernel,
	linux-parisc, linuxppc-dev, linux-riscv, live-patching
In-Reply-To: <3ca92dc9-ea04-ddc2-71cd-524bfa5a5721@linux.alibaba.com>

As the documentation explained, ftrace_test_recursion_trylock()
and ftrace_test_recursion_unlock() were supposed to disable and
enable preemption properly, however currently this work is done
outside of the function, which could be missing by mistake.

And since the internal using of trace_test_and_set_recursion()
and trace_clear_recursion() also require preemption disabled, we
can just merge the logical.

This patch will make sure the preemption has been disabled when
trace_test_and_set_recursion() return bit >= 0, and
trace_clear_recursion() will enable the preemption if previously
enabled.

CC: Petr Mladek <pmladek@suse.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Miroslav Benes <mbenes@suse.cz>
Reported-by: Abaci <abaci@linux.alibaba.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
---
 arch/csky/kernel/probes/ftrace.c     |  2 --
 arch/parisc/kernel/ftrace.c          |  2 --
 arch/powerpc/kernel/kprobes-ftrace.c |  2 --
 arch/riscv/kernel/probes/ftrace.c    |  2 --
 arch/x86/kernel/kprobes/ftrace.c     |  2 --
 include/linux/trace_recursion.h      | 11 ++++++++++-
 kernel/livepatch/patch.c             | 13 +++++++------
 kernel/trace/ftrace.c                | 15 +++++----------
 kernel/trace/trace_functions.c       |  5 -----
 9 files changed, 22 insertions(+), 32 deletions(-)

diff --git a/arch/csky/kernel/probes/ftrace.c b/arch/csky/kernel/probes/ftrace.c
index b388228..834cffc 100644
--- a/arch/csky/kernel/probes/ftrace.c
+++ b/arch/csky/kernel/probes/ftrace.c
@@ -17,7 +17,6 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned long parent_ip,
 		return;

 	regs = ftrace_get_regs(fregs);
-	preempt_disable_notrace();
 	p = get_kprobe((kprobe_opcode_t *)ip);
 	if (!p) {
 		p = get_kprobe((kprobe_opcode_t *)(ip - MCOUNT_INSN_SIZE));
@@ -57,7 +56,6 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned long parent_ip,
 		__this_cpu_write(current_kprobe, NULL);
 	}
 out:
-	preempt_enable_notrace();
 	ftrace_test_recursion_unlock(bit);
 }
 NOKPROBE_SYMBOL(kprobe_ftrace_handler);
diff --git a/arch/parisc/kernel/ftrace.c b/arch/parisc/kernel/ftrace.c
index 7d14242..90c4345 100644
--- a/arch/parisc/kernel/ftrace.c
+++ b/arch/parisc/kernel/ftrace.c
@@ -210,7 +210,6 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned long parent_ip,
 		return;

 	regs = ftrace_get_regs(fregs);
-	preempt_disable_notrace();
 	p = get_kprobe((kprobe_opcode_t *)ip);
 	if (unlikely(!p) || kprobe_disabled(p))
 		goto out;
@@ -239,7 +238,6 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned long parent_ip,
 	}
 	__this_cpu_write(current_kprobe, NULL);
 out:
-	preempt_enable_notrace();
 	ftrace_test_recursion_unlock(bit);
 }
 NOKPROBE_SYMBOL(kprobe_ftrace_handler);
diff --git a/arch/powerpc/kernel/kprobes-ftrace.c b/arch/powerpc/kernel/kprobes-ftrace.c
index 7154d58..072ebe7 100644
--- a/arch/powerpc/kernel/kprobes-ftrace.c
+++ b/arch/powerpc/kernel/kprobes-ftrace.c
@@ -26,7 +26,6 @@ void kprobe_ftrace_handler(unsigned long nip, unsigned long parent_nip,
 		return;

 	regs = ftrace_get_regs(fregs);
-	preempt_disable_notrace();
 	p = get_kprobe((kprobe_opcode_t *)nip);
 	if (unlikely(!p) || kprobe_disabled(p))
 		goto out;
@@ -61,7 +60,6 @@ void kprobe_ftrace_handler(unsigned long nip, unsigned long parent_nip,
 		__this_cpu_write(current_kprobe, NULL);
 	}
 out:
-	preempt_enable_notrace();
 	ftrace_test_recursion_unlock(bit);
 }
 NOKPROBE_SYMBOL(kprobe_ftrace_handler);
diff --git a/arch/riscv/kernel/probes/ftrace.c b/arch/riscv/kernel/probes/ftrace.c
index aab85a8..7142ec4 100644
--- a/arch/riscv/kernel/probes/ftrace.c
+++ b/arch/riscv/kernel/probes/ftrace.c
@@ -15,7 +15,6 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned long parent_ip,
 	if (bit < 0)
 		return;

-	preempt_disable_notrace();
 	p = get_kprobe((kprobe_opcode_t *)ip);
 	if (unlikely(!p) || kprobe_disabled(p))
 		goto out;
@@ -52,7 +51,6 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned long parent_ip,
 		__this_cpu_write(current_kprobe, NULL);
 	}
 out:
-	preempt_enable_notrace();
 	ftrace_test_recursion_unlock(bit);
 }
 NOKPROBE_SYMBOL(kprobe_ftrace_handler);
diff --git a/arch/x86/kernel/kprobes/ftrace.c b/arch/x86/kernel/kprobes/ftrace.c
index 596de2f..dd2ec14 100644
--- a/arch/x86/kernel/kprobes/ftrace.c
+++ b/arch/x86/kernel/kprobes/ftrace.c
@@ -25,7 +25,6 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned long parent_ip,
 	if (bit < 0)
 		return;

-	preempt_disable_notrace();
 	p = get_kprobe((kprobe_opcode_t *)ip);
 	if (unlikely(!p) || kprobe_disabled(p))
 		goto out;
@@ -59,7 +58,6 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned long parent_ip,
 		__this_cpu_write(current_kprobe, NULL);
 	}
 out:
-	preempt_enable_notrace();
 	ftrace_test_recursion_unlock(bit);
 }
 NOKPROBE_SYMBOL(kprobe_ftrace_handler);
diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h
index abe1a50..2bc1522 100644
--- a/include/linux/trace_recursion.h
+++ b/include/linux/trace_recursion.h
@@ -135,6 +135,9 @@ static __always_inline int trace_get_context_bit(void)
 # define do_ftrace_record_recursion(ip, pip)	do { } while (0)
 #endif

+/*
+ * Preemption is promised to be disabled when return bit > 0.
+ */
 static __always_inline int trace_test_and_set_recursion(unsigned long ip, unsigned long pip,
 							int start)
 {
@@ -162,11 +165,17 @@ static __always_inline int trace_test_and_set_recursion(unsigned long ip, unsign
 	current->trace_recursion = val;
 	barrier();

+	preempt_disable_notrace();
+
 	return bit;
 }

+/*
+ * Preemption will be enabled (if it was previously enabled).
+ */
 static __always_inline void trace_clear_recursion(int bit)
 {
+	preempt_enable_notrace();
 	barrier();
 	trace_recursion_clear(bit);
 }
@@ -178,7 +187,7 @@ static __always_inline void trace_clear_recursion(int bit)
  * tracing recursed in the same context (normal vs interrupt),
  *
  * Returns: -1 if a recursion happened.
- *           >= 0 if no recursion
+ *           > 0 if no recursion.
  */
 static __always_inline int ftrace_test_recursion_trylock(unsigned long ip,
 							 unsigned long parent_ip)
diff --git a/kernel/livepatch/patch.c b/kernel/livepatch/patch.c
index e8029ae..b8d75fb 100644
--- a/kernel/livepatch/patch.c
+++ b/kernel/livepatch/patch.c
@@ -49,14 +49,16 @@ static void notrace klp_ftrace_handler(unsigned long ip,

 	ops = container_of(fops, struct klp_ops, fops);

+	/*
+	 *
+	 * The ftrace_test_recursion_trylock() will disable preemption,
+	 * which is required for the variant of synchronize_rcu() that is
+	 * used to allow patching functions where RCU is not watching.
+	 * See klp_synchronize_transition() for more details.
+	 */
 	bit = ftrace_test_recursion_trylock(ip, parent_ip);
 	if (WARN_ON_ONCE(bit < 0))
 		return;
-	/*
-	 * A variant of synchronize_rcu() is used to allow patching functions
-	 * where RCU is not watching, see klp_synchronize_transition().
-	 */
-	preempt_disable_notrace();

 	func = list_first_or_null_rcu(&ops->func_stack, struct klp_func,
 				      stack_node);
@@ -120,7 +122,6 @@ static void notrace klp_ftrace_handler(unsigned long ip,
 	klp_arch_set_pc(fregs, (unsigned long)func->new_func);

 unlock:
-	preempt_enable_notrace();
 	ftrace_test_recursion_unlock(bit);
 }

diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index b7be1df..7392bc7 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -7198,16 +7198,15 @@ void ftrace_reset_array_ops(struct trace_array *tr)
 	struct ftrace_ops *op;
 	int bit;

+	/*
+	 * The ftrace_test_and_set_recursion() will disable preemption,
+	 * which is required since some of the ops may be dynamically
+	 * allocated, they must be freed after a synchronize_rcu().
+	 */
 	bit = trace_test_and_set_recursion(ip, parent_ip, TRACE_LIST_START);
 	if (bit < 0)
 		return;

-	/*
-	 * Some of the ops may be dynamically allocated,
-	 * they must be freed after a synchronize_rcu().
-	 */
-	preempt_disable_notrace();
-
 	do_for_each_ftrace_op(op, ftrace_ops_list) {
 		/* Stub functions don't need to be called nor tested */
 		if (op->flags & FTRACE_OPS_FL_STUB)
@@ -7231,7 +7230,6 @@ void ftrace_reset_array_ops(struct trace_array *tr)
 		}
 	} while_for_each_ftrace_op(op);
 out:
-	preempt_enable_notrace();
 	trace_clear_recursion(bit);
 }

@@ -7279,12 +7277,9 @@ static void ftrace_ops_assist_func(unsigned long ip, unsigned long parent_ip,
 	if (bit < 0)
 		return;

-	preempt_disable_notrace();
-
 	if (!(op->flags & FTRACE_OPS_FL_RCU) || rcu_is_watching())
 		op->func(ip, parent_ip, op, fregs);

-	preempt_enable_notrace();
 	trace_clear_recursion(bit);
 }
 NOKPROBE_SYMBOL(ftrace_ops_assist_func);
diff --git a/kernel/trace/trace_functions.c b/kernel/trace/trace_functions.c
index 1f0e63f..9f1bfbe 100644
--- a/kernel/trace/trace_functions.c
+++ b/kernel/trace/trace_functions.c
@@ -186,7 +186,6 @@ static void function_trace_start(struct trace_array *tr)
 		return;

 	trace_ctx = tracing_gen_ctx();
-	preempt_disable_notrace();

 	cpu = smp_processor_id();
 	data = per_cpu_ptr(tr->array_buffer.data, cpu);
@@ -194,7 +193,6 @@ static void function_trace_start(struct trace_array *tr)
 		trace_function(tr, ip, parent_ip, trace_ctx);

 	ftrace_test_recursion_unlock(bit);
-	preempt_enable_notrace();
 }

 #ifdef CONFIG_UNWINDER_ORC
@@ -298,8 +296,6 @@ static inline void process_repeats(struct trace_array *tr,
 	if (bit < 0)
 		return;

-	preempt_disable_notrace();
-
 	cpu = smp_processor_id();
 	data = per_cpu_ptr(tr->array_buffer.data, cpu);
 	if (atomic_read(&data->disabled))
@@ -324,7 +320,6 @@ static inline void process_repeats(struct trace_array *tr,

 out:
 	ftrace_test_recursion_unlock(bit);
-	preempt_enable_notrace();
 }

 static void
-- 
1.8.3.1


^ permalink raw reply related

* [PATCH v5 2/2] ftrace: do CPU checking after preemption disabled
From: 王贇 @ 2021-10-26  3:15 UTC (permalink / raw)
  To: Guo Ren, Steven Rostedt, Ingo Molnar, James E.J. Bottomley,
	Helge Deller, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Thomas Gleixner, Borislav Petkov, x86, H. Peter Anvin,
	Josh Poimboeuf, Jiri Kosina, Miroslav Benes, Petr Mladek,
	Joe Lawrence, Masami Hiramatsu, Peter Zijlstra (Intel),
	Nicholas Piggin, Jisheng Zhang, linux-csky, linux-kernel,
	linux-parisc, linuxppc-dev, linux-riscv, live-patching
In-Reply-To: <3ca92dc9-ea04-ddc2-71cd-524bfa5a5721@linux.alibaba.com>

With CONFIG_DEBUG_PREEMPT we observed reports like:

  BUG: using smp_processor_id() in preemptible
  caller is perf_ftrace_function_call+0x6f/0x2e0
  CPU: 1 PID: 680 Comm: a.out Not tainted
  Call Trace:
   <TASK>
   dump_stack_lvl+0x8d/0xcf
   check_preemption_disabled+0x104/0x110
   ? optimize_nops.isra.7+0x230/0x230
   ? text_poke_bp_batch+0x9f/0x310
   perf_ftrace_function_call+0x6f/0x2e0
   ...
   __text_poke+0x5/0x620
   text_poke_bp_batch+0x9f/0x310

This telling us the CPU could be changed after task is preempted, and
the checking on CPU before preemption will be invalid.

Since now ftrace_test_recursion_trylock() will help to disable the
preemption, this patch just do the checking after trylock() to address
the issue.

CC: Steven Rostedt <rostedt@goodmis.org>
Reported-by: Abaci <abaci@linux.alibaba.com>
Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
---
 kernel/trace/trace_event_perf.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
index 6aed10e..fba8cb7 100644
--- a/kernel/trace/trace_event_perf.c
+++ b/kernel/trace/trace_event_perf.c
@@ -441,13 +441,13 @@ void perf_trace_buf_update(void *record, u16 type)
 	if (!rcu_is_watching())
 		return;

-	if ((unsigned long)ops->private != smp_processor_id())
-		return;
-
 	bit = ftrace_test_recursion_trylock(ip, parent_ip);
 	if (bit < 0)
 		return;

+	if ((unsigned long)ops->private != smp_processor_id())
+		goto out;
+
 	event = container_of(ops, struct perf_event, ftrace_ops);

 	/*
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH v2] macintosh/via-pmu-led: make disk activity usage a parameter.
From: Hill Ma @ 2021-10-26  3:32 UTC (permalink / raw)
  To: benh, linuxppc-dev; +Cc: Hill Ma, linux-kernel, linux-doc

Whether to use the LED as a disk activity is a user preference.
Some like this usage while others find the LED too bright. So it
might be a good idea to make this choice a runtime parameter rather
than compile-time config.

The default is set to disabled as OS X does not use the LED as a
disk activity indicator.

Signed-off-by: Hill Ma <maahiuzeon@gmail.com>
---
 Documentation/admin-guide/kernel-parameters.txt |  6 ++++++
 drivers/macintosh/Kconfig                       | 10 ----------
 drivers/macintosh/via-pmu-led.c                 | 11 ++++++++---
 3 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 43dc35fe5bc0..a656a51ba0a8 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -250,6 +250,12 @@
 			Use timer override. For some broken Nvidia NF5 boards
 			that require a timer override, but don't have HPET
 
+	adb_pmu_led_disk [PPC]
+			Use front LED as disk LED by default. Only applies to
+			PowerBook, iBook, PowerMac 7,2/7,3.
+			Format: <bool>  (1/Y/y=enable, 0/N/n=disable)
+			Default: disabled
+
 	add_efi_memmap	[EFI; X86] Include EFI memory map in
 			kernel's map of available physical RAM.
 
diff --git a/drivers/macintosh/Kconfig b/drivers/macintosh/Kconfig
index 5cdc361da37c..243215de563c 100644
--- a/drivers/macintosh/Kconfig
+++ b/drivers/macintosh/Kconfig
@@ -78,16 +78,6 @@ config ADB_PMU_LED
 	  behaviour of the old CONFIG_BLK_DEV_IDE_PMAC_BLINK, select this
 	  and the disk LED trigger and configure appropriately through sysfs.
 
-config ADB_PMU_LED_DISK
-	bool "Use front LED as DISK LED by default"
-	depends on ADB_PMU_LED
-	depends on LEDS_CLASS
-	select LEDS_TRIGGERS
-	select LEDS_TRIGGER_DISK
-	help
-	  This option makes the front LED default to the disk trigger
-	  so that it blinks on disk activity.
-
 config PMAC_SMU
 	bool "Support for SMU  based PowerMacs"
 	depends on PPC_PMAC64
diff --git a/drivers/macintosh/via-pmu-led.c b/drivers/macintosh/via-pmu-led.c
index ae067ab2373d..faf39a5962aa 100644
--- a/drivers/macintosh/via-pmu-led.c
+++ b/drivers/macintosh/via-pmu-led.c
@@ -25,6 +25,7 @@
 #include <linux/leds.h>
 #include <linux/adb.h>
 #include <linux/pmu.h>
+#include <linux/moduleparam.h>
 #include <asm/prom.h>
 
 static spinlock_t pmu_blink_lock;
@@ -71,11 +72,10 @@ static void pmu_led_set(struct led_classdev *led_cdev,
  	spin_unlock_irqrestore(&pmu_blink_lock, flags);
 }
 
+bool adb_pmu_led_disk;
+
 static struct led_classdev pmu_led = {
 	.name = "pmu-led::front",
-#ifdef CONFIG_ADB_PMU_LED_DISK
-	.default_trigger = "disk-activity",
-#endif
 	.brightness_set = pmu_led_set,
 };
 
@@ -106,6 +106,9 @@ static int __init via_pmu_led_init(void)
 	}
 	of_node_put(dt);
 
+	if (adb_pmu_led_disk)
+		pmu_led.default_trigger = "disk-activity";
+
 	spin_lock_init(&pmu_blink_lock);
 	/* no outstanding req */
 	pmu_blink_req.complete = 1;
@@ -114,4 +117,6 @@ static int __init via_pmu_led_init(void)
 	return led_classdev_register(NULL, &pmu_led);
 }
 
+core_param(adb_pmu_led_disk, adb_pmu_led_disk, bool, 0644);
+
 late_initcall(via_pmu_led_init);
-- 
2.33.1


^ permalink raw reply related

* Re: [PATCH v11 2/3] tty: hvc: pass DMA capable memory to put_chars()
From: Jiri Slaby @ 2021-10-26  5:10 UTC (permalink / raw)
  To: Xianting Tian, gregkh, amit, arnd, osandov
  Cc: shile.zhang, linuxppc-dev, linux-kernel, virtualization
In-Reply-To: <20211015024658.1353987-3-xianting.tian@linux.alibaba.com>

On 15. 10. 21, 4:46, Xianting Tian wrote:
> @@ -151,9 +142,11 @@ static uint32_t vtermnos[MAX_NR_HVC_CONSOLES] =
>   static void hvc_console_print(struct console *co, const char *b,
>   			      unsigned count)
>   {
> -	char c[N_OUTBUF] __ALIGNED__;
> +	char *c;
>   	unsigned i = 0, n = 0;
>   	int r, donecr = 0, index = co->index;
> +	unsigned long flags;
> +	struct hvc_struct *hp;
>   
>   	/* Console access attempt outside of acceptable console range. */
>   	if (index >= MAX_NR_HVC_CONSOLES)
> @@ -163,6 +156,13 @@ static void hvc_console_print(struct console *co, const char *b,
>   	if (vtermnos[index] == -1)
>   		return;
>   
> +	hp = cons_hvcs[index];
> +	if (!hp)
> +		return;

You effectively make the console unusable until someone calls 
hvc_alloc() for this device, correct? This doesn't look right. Neither 
you describe this change of behaviour in the commit log.

regards,
-- 
js
suse labs

^ permalink raw reply

* [PATCH 1/3] powerpc/nohash: Fix __ptep_set_access_flags() and ptep_set_wrprotect()
From: Christophe Leroy @ 2021-10-26  5:39 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
  Cc: linuxppc-dev, linux-kernel

Commit 26973fa5ac0e ("powerpc/mm: use pte helpers in generic code")
changed those two functions to use pte helpers to determine which
bits to clear and which bits to set.

This change was based on the assumption that bits to be set/cleared
are always the same and can be determined by applying the pte
manipulation helpers on __pte(0).

But on platforms like book3e, the bits depend on whether the page
is a user page or not.

For the time being it more or less works because of _PAGE_EXEC being
used for user pages only and exec right being set at all time on
kernel page. But following patch will clean that and output of
pte_mkexec() will depend on the page being a user or kernel page.

Instead of trying to make an even more complicated helper where bits
would become dependent on the final pte value, come back to a more
static situation like before commit 26973fa5ac0e ("powerpc/mm: use
pte helpers in generic code"), by introducing an 8xx specific
version of __ptep_set_access_flags() and ptep_set_wrprotect().

Fixes: 26973fa5ac0e ("powerpc/mm: use pte helpers in generic code")
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
v3: No change
v2: New
---
 arch/powerpc/include/asm/nohash/32/pgtable.h | 17 +++++++--------
 arch/powerpc/include/asm/nohash/32/pte-8xx.h | 22 ++++++++++++++++++++
 2 files changed, 30 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/nohash/32/pgtable.h b/arch/powerpc/include/asm/nohash/32/pgtable.h
index 34ce50da1850..11c6849f7864 100644
--- a/arch/powerpc/include/asm/nohash/32/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/32/pgtable.h
@@ -306,30 +306,29 @@ static inline pte_t ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,
 }
 
 #define __HAVE_ARCH_PTEP_SET_WRPROTECT
+#ifndef ptep_set_wrprotect
 static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addr,
 				      pte_t *ptep)
 {
-	unsigned long clr = ~pte_val(pte_wrprotect(__pte(~0)));
-	unsigned long set = pte_val(pte_wrprotect(__pte(0)));
-
-	pte_update(mm, addr, ptep, clr, set, 0);
+	pte_update(mm, addr, ptep, _PAGE_RW, 0, 0);
 }
+#endif
 
+#ifndef __ptep_set_access_flags
 static inline void __ptep_set_access_flags(struct vm_area_struct *vma,
 					   pte_t *ptep, pte_t entry,
 					   unsigned long address,
 					   int psize)
 {
-	pte_t pte_set = pte_mkyoung(pte_mkdirty(pte_mkwrite(pte_mkexec(__pte(0)))));
-	pte_t pte_clr = pte_mkyoung(pte_mkdirty(pte_mkwrite(pte_mkexec(__pte(~0)))));
-	unsigned long set = pte_val(entry) & pte_val(pte_set);
-	unsigned long clr = ~pte_val(entry) & ~pte_val(pte_clr);
+	unsigned long set = pte_val(entry) &
+			    (_PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_RW | _PAGE_EXEC);
 	int huge = psize > mmu_virtual_psize ? 1 : 0;
 
-	pte_update(vma->vm_mm, address, ptep, clr, set, huge);
+	pte_update(vma->vm_mm, address, ptep, 0, set, huge);
 
 	flush_tlb_page(vma, address);
 }
+#endif
 
 static inline int pte_young(pte_t pte)
 {
diff --git a/arch/powerpc/include/asm/nohash/32/pte-8xx.h b/arch/powerpc/include/asm/nohash/32/pte-8xx.h
index fcc48d590d88..1a89ebdc3acc 100644
--- a/arch/powerpc/include/asm/nohash/32/pte-8xx.h
+++ b/arch/powerpc/include/asm/nohash/32/pte-8xx.h
@@ -136,6 +136,28 @@ static inline pte_t pte_mkhuge(pte_t pte)
 
 #define pte_mkhuge pte_mkhuge
 
+static inline pte_basic_t pte_update(struct mm_struct *mm, unsigned long addr, pte_t *p,
+				     unsigned long clr, unsigned long set, int huge);
+
+static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
+{
+	pte_update(mm, addr, ptep, 0, _PAGE_RO, 0);
+}
+#define ptep_set_wrprotect ptep_set_wrprotect
+
+static inline void __ptep_set_access_flags(struct vm_area_struct *vma, pte_t *ptep,
+					   pte_t entry, unsigned long address, int psize)
+{
+	unsigned long set = pte_val(entry) & (_PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_EXEC);
+	unsigned long clr = ~pte_val(entry) & _PAGE_RO;
+	int huge = psize > mmu_virtual_psize ? 1 : 0;
+
+	pte_update(vma->vm_mm, address, ptep, clr, set, huge);
+
+	flush_tlb_page(vma, address);
+}
+#define __ptep_set_access_flags __ptep_set_access_flags
+
 static inline unsigned long pgd_leaf_size(pgd_t pgd)
 {
 	if (pgd_val(pgd) & _PMD_PAGE_8M)
-- 
2.31.1


^ permalink raw reply related

* [PATCH 2/3] powerpc/book3e: Fix set_memory_x() and set_memory_nx()
From: Christophe Leroy @ 2021-10-26  5:39 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <922bdab3a220781bae2360ff3dd5adb7fe4d34f1.1635226743.git.christophe.leroy@csgroup.eu>

set_memory_x() calls pte_mkexec() which sets _PAGE_EXEC.
set_memory_nx() calls pte_exprotec() which clears _PAGE_EXEC.

Book3e has 2 bits, UX and SX, which defines the exec rights
resp. for user (PR=1) and for kernel (PR=0).

_PAGE_EXEC is defined as UX only.

An executable kernel page is set with either _PAGE_KERNEL_RWX
or _PAGE_KERNEL_ROX, which both have SX set and UX cleared.

So set_memory_nx() call for an executable kernel page does
nothing because UX is already cleared.

And set_memory_x() on a non-executable kernel page makes it
executable for the user and keeps it non-executable for kernel.

Also, pte_exec() always returns 'false' on kernel pages, because
it checks _PAGE_EXEC which doesn't include SX, so for instance
the W+X check doesn't work.

To fix this:
- change tlb_low_64e.S to use _PAGE_BAP_UX instead of _PAGE_USER
- sets both UX and SX in _PAGE_EXEC so that pte_user() returns
true whenever one of the two bits is set and pte_exprotect()
clears both bits.
- Define a book3e specific version of pte_mkexec() which sets
either SX or UX based on UR.

Fixes: 1f9ad21c3b38 ("powerpc/mm: Implement set_memory() routines")
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
v3: Removed pte_mkexec() from nohash/64/pgtable.h
v2: New
---
 arch/powerpc/include/asm/nohash/32/pgtable.h |  2 ++
 arch/powerpc/include/asm/nohash/64/pgtable.h |  5 -----
 arch/powerpc/include/asm/nohash/pte-book3e.h | 18 ++++++++++++++----
 arch/powerpc/mm/nohash/tlb_low_64e.S         |  8 ++++----
 4 files changed, 20 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/include/asm/nohash/32/pgtable.h b/arch/powerpc/include/asm/nohash/32/pgtable.h
index 11c6849f7864..b67742e2a9b2 100644
--- a/arch/powerpc/include/asm/nohash/32/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/32/pgtable.h
@@ -193,10 +193,12 @@ static inline pte_t pte_wrprotect(pte_t pte)
 }
 #endif
 
+#ifndef pte_mkexec
 static inline pte_t pte_mkexec(pte_t pte)
 {
 	return __pte(pte_val(pte) | _PAGE_EXEC);
 }
+#endif
 
 #define pmd_none(pmd)		(!pmd_val(pmd))
 #define	pmd_bad(pmd)		(pmd_val(pmd) & _PMD_BAD)
diff --git a/arch/powerpc/include/asm/nohash/64/pgtable.h b/arch/powerpc/include/asm/nohash/64/pgtable.h
index d081704b13fb..9d2905a47410 100644
--- a/arch/powerpc/include/asm/nohash/64/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/64/pgtable.h
@@ -118,11 +118,6 @@ static inline pte_t pte_wrprotect(pte_t pte)
 	return __pte(pte_val(pte) & ~_PAGE_RW);
 }
 
-static inline pte_t pte_mkexec(pte_t pte)
-{
-	return __pte(pte_val(pte) | _PAGE_EXEC);
-}
-
 #define PMD_BAD_BITS		(PTE_TABLE_SIZE-1)
 #define PUD_BAD_BITS		(PMD_TABLE_SIZE-1)
 
diff --git a/arch/powerpc/include/asm/nohash/pte-book3e.h b/arch/powerpc/include/asm/nohash/pte-book3e.h
index 813918f40765..f798640422c2 100644
--- a/arch/powerpc/include/asm/nohash/pte-book3e.h
+++ b/arch/powerpc/include/asm/nohash/pte-book3e.h
@@ -48,7 +48,7 @@
 #define _PAGE_WRITETHRU	0x800000 /* W: cache write-through */
 
 /* "Higher level" linux bit combinations */
-#define _PAGE_EXEC		_PAGE_BAP_UX /* .. and was cache cleaned */
+#define _PAGE_EXEC		(_PAGE_BAP_SX | _PAGE_BAP_UX) /* .. and was cache cleaned */
 #define _PAGE_RW		(_PAGE_BAP_SW | _PAGE_BAP_UW) /* User write permission */
 #define _PAGE_KERNEL_RW		(_PAGE_BAP_SW | _PAGE_BAP_SR | _PAGE_DIRTY)
 #define _PAGE_KERNEL_RO		(_PAGE_BAP_SR)
@@ -93,11 +93,11 @@
 /* Permission masks used to generate the __P and __S table */
 #define PAGE_NONE	__pgprot(_PAGE_BASE)
 #define PAGE_SHARED	__pgprot(_PAGE_BASE | _PAGE_USER | _PAGE_RW)
-#define PAGE_SHARED_X	__pgprot(_PAGE_BASE | _PAGE_USER | _PAGE_RW | _PAGE_EXEC)
+#define PAGE_SHARED_X	__pgprot(_PAGE_BASE | _PAGE_USER | _PAGE_RW | _PAGE_BAP_UX)
 #define PAGE_COPY	__pgprot(_PAGE_BASE | _PAGE_USER)
-#define PAGE_COPY_X	__pgprot(_PAGE_BASE | _PAGE_USER | _PAGE_EXEC)
+#define PAGE_COPY_X	__pgprot(_PAGE_BASE | _PAGE_USER | _PAGE_BAP_UX)
 #define PAGE_READONLY	__pgprot(_PAGE_BASE | _PAGE_USER)
-#define PAGE_READONLY_X	__pgprot(_PAGE_BASE | _PAGE_USER | _PAGE_EXEC)
+#define PAGE_READONLY_X	__pgprot(_PAGE_BASE | _PAGE_USER | _PAGE_BAP_UX)
 
 #ifndef __ASSEMBLY__
 static inline pte_t pte_mkprivileged(pte_t pte)
@@ -113,6 +113,16 @@ static inline pte_t pte_mkuser(pte_t pte)
 }
 
 #define pte_mkuser pte_mkuser
+
+static inline pte_t pte_mkexec(pte_t pte)
+{
+	if (pte_val(pte) & _PAGE_BAP_UR)
+		return __pte((pte_val(pte) & ~_PAGE_BAP_SX) | _PAGE_BAP_UX);
+	else
+		return __pte((pte_val(pte) & ~_PAGE_BAP_UX) | _PAGE_BAP_SX);
+}
+#define pte_mkexec pte_mkexec
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* __KERNEL__ */
diff --git a/arch/powerpc/mm/nohash/tlb_low_64e.S b/arch/powerpc/mm/nohash/tlb_low_64e.S
index bf24451f3e71..9235e720e357 100644
--- a/arch/powerpc/mm/nohash/tlb_low_64e.S
+++ b/arch/powerpc/mm/nohash/tlb_low_64e.S
@@ -222,7 +222,7 @@ tlb_miss_kernel_bolted:
 
 tlb_miss_fault_bolted:
 	/* We need to check if it was an instruction miss */
-	andi.	r10,r11,_PAGE_EXEC|_PAGE_BAP_SX
+	andi.	r10,r11,_PAGE_BAP_UX|_PAGE_BAP_SX
 	bne	itlb_miss_fault_bolted
 dtlb_miss_fault_bolted:
 	tlb_epilog_bolted
@@ -239,7 +239,7 @@ itlb_miss_fault_bolted:
 	srdi	r15,r16,60		/* get region */
 	bne-	itlb_miss_fault_bolted
 
-	li	r11,_PAGE_PRESENT|_PAGE_EXEC	/* Base perm */
+	li	r11,_PAGE_PRESENT|_PAGE_BAP_UX	/* Base perm */
 
 	/* We do the user/kernel test for the PID here along with the RW test
 	 */
@@ -614,7 +614,7 @@ itlb_miss_fault_e6500:
 
 	/* We do the user/kernel test for the PID here along with the RW test
 	 */
-	li	r11,_PAGE_PRESENT|_PAGE_EXEC	/* Base perm */
+	li	r11,_PAGE_PRESENT|_PAGE_BAP_UX	/* Base perm */
 	oris	r11,r11,_PAGE_ACCESSED@h
 
 	cmpldi	cr0,r15,0			/* Check for user region */
@@ -734,7 +734,7 @@ normal_tlb_miss_done:
 
 normal_tlb_miss_access_fault:
 	/* We need to check if it was an instruction miss */
-	andi.	r10,r11,_PAGE_EXEC
+	andi.	r10,r11,_PAGE_BAP_UX
 	bne	1f
 	ld	r14,EX_TLB_DEAR(r12)
 	ld	r15,EX_TLB_ESR(r12)
-- 
2.31.1


^ permalink raw reply related

* [PATCH 3/3] powerpc/fsl_booke: Fix setting of exec flag when setting TLBCAMs
From: Christophe Leroy @ 2021-10-26  5:39 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
  Cc: linuxppc-dev, linux-kernel, kernel test robot
In-Reply-To: <922bdab3a220781bae2360ff3dd5adb7fe4d34f1.1635226743.git.christophe.leroy@csgroup.eu>

Building tqm8541_defconfig results in:

	arch/powerpc/mm/nohash/fsl_book3e.c: In function 'settlbcam':
	arch/powerpc/mm/nohash/fsl_book3e.c:126:40: error: '_PAGE_BAP_SX' undeclared (first use in this function)
	  126 |         TLBCAM[index].MAS3 |= (flags & _PAGE_BAP_SX) ? MAS3_SX : 0;
	      |                                        ^~~~~~~~~~~~
	arch/powerpc/mm/nohash/fsl_book3e.c:126:40: note: each undeclared identifier is reported only once for each function it appears in
	make[3]: *** [scripts/Makefile.build:277: arch/powerpc/mm/nohash/fsl_book3e.o] Error 1
	make[2]: *** [scripts/Makefile.build:540: arch/powerpc/mm/nohash] Error 2
	make[1]: *** [scripts/Makefile.build:540: arch/powerpc/mm] Error 2
	make: *** [Makefile:1868: arch/powerpc] Error 2

This is because _PAGE_BAP_SX is not defined when using 32 bits PTE.

Now that _PAGE_EXEC contains both _PAGE_BAP_SX and _PAGE_BAP_UX, it can be used instead.

Reported-by: kernel test robot <lkp@intel.com>
Fixes: 01116e6e98b0 ("powerpc/fsl_booke: Take exec flag into account when setting TLBCAMs")
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/powerpc/mm/nohash/fsl_book3e.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/nohash/fsl_book3e.c b/arch/powerpc/mm/nohash/fsl_book3e.c
index 978e0bcdfa2c..b231a54f540c 100644
--- a/arch/powerpc/mm/nohash/fsl_book3e.c
+++ b/arch/powerpc/mm/nohash/fsl_book3e.c
@@ -123,7 +123,6 @@ static void settlbcam(int index, unsigned long virt, phys_addr_t phys,
 	TLBCAM[index].MAS2 |= (flags & _PAGE_ENDIAN) ? MAS2_E : 0;
 
 	TLBCAM[index].MAS3 = (phys & MAS3_RPN) | MAS3_SR;
-	TLBCAM[index].MAS3 |= (flags & _PAGE_BAP_SX) ? MAS3_SX : 0;
 	TLBCAM[index].MAS3 |= (flags & _PAGE_RW) ? MAS3_SW : 0;
 	if (mmu_has_feature(MMU_FTR_BIG_PHYS))
 		TLBCAM[index].MAS7 = (u64)phys >> 32;
@@ -133,6 +132,8 @@ static void settlbcam(int index, unsigned long virt, phys_addr_t phys,
 		TLBCAM[index].MAS3 |= MAS3_UR;
 		TLBCAM[index].MAS3 |= (flags & _PAGE_EXEC) ? MAS3_UX : 0;
 		TLBCAM[index].MAS3 |= (flags & _PAGE_RW) ? MAS3_UW : 0;
+	} else {
+		TLBCAM[index].MAS3 |= (flags & _PAGE_EXEC) ? MAS3_SX : 0;
 	}
 
 	tlbcam_addrs[index].start = virt;
-- 
2.31.1


^ permalink raw reply related

* Re: [PATCH] powerpc: Enhance pmem DMA bypass handling
From: Alexey Kardashevskiy @ 2021-10-26  5:39 UTC (permalink / raw)
  To: Brian King, linuxppc-dev
In-Reply-To: <f9af9834-797f-cd69-bbcf-3663ce375c72@linux.vnet.ibm.com>



On 10/26/21 01:40, Brian King wrote:
> On 10/23/21 7:18 AM, Alexey Kardashevskiy wrote:
>>
>>
>> On 23/10/2021 07:18, Brian King wrote:
>>> On 10/22/21 7:24 AM, Alexey Kardashevskiy wrote:
>>>>
>>>>
>>>> On 22/10/2021 04:44, Brian King wrote:
>>>>> If ibm,pmemory is installed in the system, it can appear anywhere
>>>>> in the address space. This patch enhances how we handle DMA for devices when
>>>>> ibm,pmemory is present. In the case where we have enough DMA space to
>>>>> direct map all of RAM, but not ibm,pmemory, we use direct DMA for
>>>>> I/O to RAM and use the default window to dynamically map ibm,pmemory.
>>>>> In the case where we only have a single DMA window, this won't work, > so if the window is not big enough to map the entire address range,
>>>>> we cannot direct map.
>>>>
>>>> but we want the pmem range to be mapped into the huge DMA window too if we can, why skip it?
>>>
>>> This patch should simply do what the comment in this commit mentioned below suggests, which says that
>>> ibm,pmemory can appear anywhere in the address space. If the DMA window is large enough
>>> to map all of MAX_PHYSMEM_BITS, we will indeed simply do direct DMA for everything,
>>> including the pmem. If we do not have a big enough window to do that, we will do
>>> direct DMA for DRAM and dynamic mapping for pmem.
>>
>>
>> Right, and this is what we do already, do not we? I missing something here.
> 
> The upstream code does not work correctly that I can see. If I boot an upstream kernel
> with an nvme device and vpmem assigned to the LPAR, and enable dev_dbg in arch/powerpc/platforms/pseries/iommu.c,
> I see the following in the logs:
> 
> [    2.157549] nvme 0121:50:00.0: ibm,query-pe-dma-windows(53) 500000 8000000 20000121 returned 0
> [    2.157561] nvme 0121:50:00.0: Skipping ibm,pmemory
> [    2.157567] nvme 0121:50:00.0: can't map partition max 0x8000000000000 with 16777216 65536-sized pages
> [    2.170150] nvme 0121:50:00.0: ibm,create-pe-dma-window(54) 500000 8000000 20000121 10 28 returned 0 (liobn = 0x70000121 starting addr = 8000000 0)
> [    2.170170] nvme 0121:50:00.0: created tce table LIOBN 0x70000121 for /pci@800000020000121/pci1014,683@0
> [    2.356260] nvme 0121:50:00.0: node is /pci@800000020000121/pci1014,683@0
> 
> This means we are heading down the leg in enable_ddw where we do not set direct_mapping to true. We use
> create the DDW window, but don't do any direct DMA. This is because the window is not large enough to
> map 2PB of memory, which is what ddw_memory_hotplug_max returns without my patch. 
> 
> With my patch applied, I get this in the logs:
> 
> [    2.204866] nvme 0121:50:00.0: ibm,query-pe-dma-windows(53) 500000 8000000 20000121 returned 0
> [    2.204875] nvme 0121:50:00.0: Skipping ibm,pmemory
> [    2.205058] nvme 0121:50:00.0: ibm,create-pe-dma-window(54) 500000 8000000 20000121 10 21 returned 0 (liobn = 0x70000121 starting addr = 8000000 0)
> [    2.205068] nvme 0121:50:00.0: created tce table LIOBN 0x70000121 for /pci@800000020000121/pci1014,683@0
> [    2.215898] nvme 0121:50:00.0: iommu: 64-bit OK but direct DMA is limited by 800000200000000
> 


ah I see. then...


> 
> Thanks,
> 
> Brian
> 
> 
>>
>>>
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/arch/powerpc/platforms/pseries/iommu.c?id=bf6e2d562bbc4d115cf322b0bca57fe5bbd26f48
>>>
>>>
>>> Thanks,
>>>
>>> Brian
>>>
>>>
>>>>
>>>>
>>>>>
>>>>> Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
>>>>> ---
>>>>>    arch/powerpc/platforms/pseries/iommu.c | 19 ++++++++++---------
>>>>>    1 file changed, 10 insertions(+), 9 deletions(-)
>>>>>
>>>>> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
>>>>> index 269f61d519c2..d9ae985d10a4 100644
>>>>> --- a/arch/powerpc/platforms/pseries/iommu.c
>>>>> +++ b/arch/powerpc/platforms/pseries/iommu.c
>>>>> @@ -1092,15 +1092,6 @@ static phys_addr_t ddw_memory_hotplug_max(void)
>>>>>        phys_addr_t max_addr = memory_hotplug_max();
>>>>>        struct device_node *memory;
>>>>>    -    /*
>>>>> -     * The "ibm,pmemory" can appear anywhere in the address space.
>>>>> -     * Assuming it is still backed by page structs, set the upper limit
>>>>> -     * for the huge DMA window as MAX_PHYSMEM_BITS.
>>>>> -     */
>>>>> -    if (of_find_node_by_type(NULL, "ibm,pmemory"))
>>>>> -        return (sizeof(phys_addr_t) * 8 <= MAX_PHYSMEM_BITS) ?
>>>>> -            (phys_addr_t) -1 : (1ULL << MAX_PHYSMEM_BITS);
>>>>> -
>>>>>        for_each_node_by_type(memory, "memory") {
>>>>>            unsigned long start, size;
>>>>>            int n_mem_addr_cells, n_mem_size_cells, len;
>>>>> @@ -1341,6 +1332,16 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
>>>>>         */
>>>>>        len = max_ram_len;
>>>>>        if (pmem_present) {
>>>>> +        if (default_win_removed) {
>>>>> +            /*
>>>>> +             * If we only have one DMA window and have pmem present,
>>>>> +             * then we need to be able to map the entire address
>>>>> +             * range in order to be able to do direct DMA to RAM.
>>>>> +             */
>>>>> +            len = order_base_2((sizeof(phys_addr_t) * 8 <= MAX_PHYSMEM_BITS) ?
>>>>> +                    (phys_addr_t) -1 : (1ULL << MAX_PHYSMEM_BITS));


... len = (sizeof(phys_addr_t) * 8 <= MAX_PHYSMEM_BITS) ? 31 :
MAX_PHYSMEM_BITS  ?

Or actually simply drop this hunk and only leave the first one and add
this instead:


diff --git a/arch/powerpc/platforms/pseries/iommu.c
b/arch/powerpc/platforms/pseries/iommu.c
index 591ec9e94edb..68bfcd2227d9 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -1518,7 +1518,7 @@ static bool enable_ddw(struct pci_dev *dev, struct
device_node *pdn)
         * as RAM, then we failed to create a window to cover persistent
         * memory and need to set the DMA limit.
         */
-       if (pmem_present && ddw_enabled && direct_mapping && len ==
max_ram_len)
+       if (pmem_present && ddw_enabled && direct_mapping)

?

Thanks,



>>>>> +        }
>>>>> +
>>>>>            if (query.largest_available_block >=
>>>>>                (1ULL << (MAX_PHYSMEM_BITS - page_shift)))
>>>>>                len = MAX_PHYSMEM_BITS;
>>>>>
>>>>
>>>
>>>
>>
> 
> 

-- 
Alexey

^ permalink raw reply related

* Re: [PATCH v11 2/3] tty: hvc: pass DMA capable memory to put_chars()
From: Xianting Tian @ 2021-10-26  6:02 UTC (permalink / raw)
  To: Jiri Slaby, gregkh, amit, arnd, osandov
  Cc: shile.zhang, linuxppc-dev, linux-kernel, virtualization
In-Reply-To: <208f7a41-a9fa-630c-cb44-c37c503f3a72@kernel.org>

在 2021/10/26 下午1:10, Jiri Slaby 写道:
> On 15. 10. 21, 4:46, Xianting Tian wrote:
>> @@ -151,9 +142,11 @@ static uint32_t vtermnos[MAX_NR_HVC_CONSOLES] =
>>   static void hvc_console_print(struct console *co, const char *b,
>>                     unsigned count)
>>   {
>> -    char c[N_OUTBUF] __ALIGNED__;
>> +    char *c;
>>       unsigned i = 0, n = 0;
>>       int r, donecr = 0, index = co->index;
>> +    unsigned long flags;
>> +    struct hvc_struct *hp;
>>         /* Console access attempt outside of acceptable console 
>> range. */
>>       if (index >= MAX_NR_HVC_CONSOLES)
>> @@ -163,6 +156,13 @@ static void hvc_console_print(struct console 
>> *co, const char *b,
>>       if (vtermnos[index] == -1)
>>           return;
>>   +    hp = cons_hvcs[index];
>> +    if (!hp)
>> +        return;
>
> You effectively make the console unusable until someone calls 
> hvc_alloc() for this device, correct? This doesn't look right. Neither 
> you describe this change of behaviour in the commit log.

I mentioned such info in the commit log:
'Introduce another array(cons_hvcs[]) for hvc pointers next to the
cons_ops[] and vtermnos[] arrays. With the array, we can easily find
hvc's cons_outbuf and its lock.'

After you pointed it out, I just found what you said make sense, I checked the code hvc_console_print() can support print before hvc_alloc() is called when someone use hvc_instantiate() for an early console discovery method.
I send a patch to fix the issue?  or these serial pathches reverted fisrtly then I resend new version patches? thanks


>
> regards,

^ permalink raw reply

* Re: [PATCH v11 2/3] tty: hvc: pass DMA capable memory to put_chars()
From: Greg KH @ 2021-10-26  6:10 UTC (permalink / raw)
  To: Xianting Tian
  Cc: arnd, amit, Jiri Slaby, shile.zhang, linux-kernel, virtualization,
	linuxppc-dev, osandov
In-Reply-To: <cd195483-93c7-23be-8f4c-9cf7f25a3065@linux.alibaba.com>

On Tue, Oct 26, 2021 at 02:02:21PM +0800, Xianting Tian wrote:
> 在 2021/10/26 下午1:10, Jiri Slaby 写道:
> > On 15. 10. 21, 4:46, Xianting Tian wrote:
> > > @@ -151,9 +142,11 @@ static uint32_t vtermnos[MAX_NR_HVC_CONSOLES] =
> > >   static void hvc_console_print(struct console *co, const char *b,
> > >                     unsigned count)
> > >   {
> > > -    char c[N_OUTBUF] __ALIGNED__;
> > > +    char *c;
> > >       unsigned i = 0, n = 0;
> > >       int r, donecr = 0, index = co->index;
> > > +    unsigned long flags;
> > > +    struct hvc_struct *hp;
> > >         /* Console access attempt outside of acceptable console
> > > range. */
> > >       if (index >= MAX_NR_HVC_CONSOLES)
> > > @@ -163,6 +156,13 @@ static void hvc_console_print(struct console
> > > *co, const char *b,
> > >       if (vtermnos[index] == -1)
> > >           return;
> > >   +    hp = cons_hvcs[index];
> > > +    if (!hp)
> > > +        return;
> > 
> > You effectively make the console unusable until someone calls
> > hvc_alloc() for this device, correct? This doesn't look right. Neither
> > you describe this change of behaviour in the commit log.
> 
> I mentioned such info in the commit log:
> 'Introduce another array(cons_hvcs[]) for hvc pointers next to the
> cons_ops[] and vtermnos[] arrays. With the array, we can easily find
> hvc's cons_outbuf and its lock.'
> 
> After you pointed it out, I just found what you said make sense, I checked the code hvc_console_print() can support print before hvc_alloc() is called when someone use hvc_instantiate() for an early console discovery method.
> I send a patch to fix the issue?  or these serial pathches reverted fisrtly then I resend new version patches? thanks

Let me revert these now and you can send an updated version.

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH v11 2/3] tty: hvc: pass DMA capable memory to put_chars()
From: Xianting Tian @ 2021-10-26  6:11 UTC (permalink / raw)
  To: Greg KH
  Cc: arnd, amit, Jiri Slaby, shile.zhang, linux-kernel, virtualization,
	linuxppc-dev, osandov
In-Reply-To: <YXebzdZz8oN6w+T0@kroah.com>


在 2021/10/26 下午2:10, Greg KH 写道:
> On Tue, Oct 26, 2021 at 02:02:21PM +0800, Xianting Tian wrote:
>> 在 2021/10/26 下午1:10, Jiri Slaby 写道:
>>> On 15. 10. 21, 4:46, Xianting Tian wrote:
>>>> @@ -151,9 +142,11 @@ static uint32_t vtermnos[MAX_NR_HVC_CONSOLES] =
>>>>    static void hvc_console_print(struct console *co, const char *b,
>>>>                      unsigned count)
>>>>    {
>>>> -    char c[N_OUTBUF] __ALIGNED__;
>>>> +    char *c;
>>>>        unsigned i = 0, n = 0;
>>>>        int r, donecr = 0, index = co->index;
>>>> +    unsigned long flags;
>>>> +    struct hvc_struct *hp;
>>>>          /* Console access attempt outside of acceptable console
>>>> range. */
>>>>        if (index >= MAX_NR_HVC_CONSOLES)
>>>> @@ -163,6 +156,13 @@ static void hvc_console_print(struct console
>>>> *co, const char *b,
>>>>        if (vtermnos[index] == -1)
>>>>            return;
>>>>    +    hp = cons_hvcs[index];
>>>> +    if (!hp)
>>>> +        return;
>>> You effectively make the console unusable until someone calls
>>> hvc_alloc() for this device, correct? This doesn't look right. Neither
>>> you describe this change of behaviour in the commit log.
>> I mentioned such info in the commit log:
>> 'Introduce another array(cons_hvcs[]) for hvc pointers next to the
>> cons_ops[] and vtermnos[] arrays. With the array, we can easily find
>> hvc's cons_outbuf and its lock.'
>>
>> After you pointed it out, I just found what you said make sense, I checked the code hvc_console_print() can support print before hvc_alloc() is called when someone use hvc_instantiate() for an early console discovery method.
>> I send a patch to fix the issue?  or these serial pathches reverted fisrtly then I resend new version patches? thanks
> Let me revert these now and you can send an updated version.
OK, thanks.
>
> thanks,
>
> greg k-h

^ permalink raw reply

* Re: [PATCH v11 2/3] tty: hvc: pass DMA capable memory to put_chars()
From: Greg KH @ 2021-10-26  6:49 UTC (permalink / raw)
  To: Xianting Tian
  Cc: arnd, amit, Jiri Slaby, shile.zhang, linux-kernel, virtualization,
	linuxppc-dev, osandov
In-Reply-To: <8f78c1b8-c736-748d-d08b-3d6121eb5af8@linux.alibaba.com>

On Tue, Oct 26, 2021 at 02:11:51PM +0800, Xianting Tian wrote:
> 
> 在 2021/10/26 下午2:10, Greg KH 写道:
> > On Tue, Oct 26, 2021 at 02:02:21PM +0800, Xianting Tian wrote:
> > > 在 2021/10/26 下午1:10, Jiri Slaby 写道:
> > > > On 15. 10. 21, 4:46, Xianting Tian wrote:
> > > > > @@ -151,9 +142,11 @@ static uint32_t vtermnos[MAX_NR_HVC_CONSOLES] =
> > > > >    static void hvc_console_print(struct console *co, const char *b,
> > > > >                      unsigned count)
> > > > >    {
> > > > > -    char c[N_OUTBUF] __ALIGNED__;
> > > > > +    char *c;
> > > > >        unsigned i = 0, n = 0;
> > > > >        int r, donecr = 0, index = co->index;
> > > > > +    unsigned long flags;
> > > > > +    struct hvc_struct *hp;
> > > > >          /* Console access attempt outside of acceptable console
> > > > > range. */
> > > > >        if (index >= MAX_NR_HVC_CONSOLES)
> > > > > @@ -163,6 +156,13 @@ static void hvc_console_print(struct console
> > > > > *co, const char *b,
> > > > >        if (vtermnos[index] == -1)
> > > > >            return;
> > > > >    +    hp = cons_hvcs[index];
> > > > > +    if (!hp)
> > > > > +        return;
> > > > You effectively make the console unusable until someone calls
> > > > hvc_alloc() for this device, correct? This doesn't look right. Neither
> > > > you describe this change of behaviour in the commit log.
> > > I mentioned such info in the commit log:
> > > 'Introduce another array(cons_hvcs[]) for hvc pointers next to the
> > > cons_ops[] and vtermnos[] arrays. With the array, we can easily find
> > > hvc's cons_outbuf and its lock.'
> > > 
> > > After you pointed it out, I just found what you said make sense, I checked the code hvc_console_print() can support print before hvc_alloc() is called when someone use hvc_instantiate() for an early console discovery method.
> > > I send a patch to fix the issue?  or these serial pathches reverted fisrtly then I resend new version patches? thanks
> > Let me revert these now and you can send an updated version.
> OK, thanks.

I have now reverted patches 2/3 and 3/3 in this series from my tree.
The first patch was just fine.

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH v2 02/10] powerpc/book3e: Fix set_memory_x() and set_memory_nx()
From: Christophe Leroy @ 2021-10-26  7:03 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <5794f254-0523-7f2f-f9e7-ff64a7fe400d@csgroup.eu>



Le 25/10/2021 à 23:53, Christophe Leroy a écrit :
> 
> 
> On 23/10/2021 13:47, Christophe Leroy wrote:
>> set_memory_x() calls pte_mkexec() which sets _PAGE_EXEC.
>> set_memory_nx() calls pte_exprotec() which clears _PAGE_EXEC.
>>
>> Book3e has 2 bits, UX and SX, which defines the exec rights
>> resp. for user (PR=1) and for kernel (PR=0).
>>
>> _PAGE_EXEC is defined as UX only.
>>
>> An executable kernel page is set with either _PAGE_KERNEL_RWX
>> or _PAGE_KERNEL_ROX, which both have SX set and UX cleared.
>>
>> So set_memory_nx() call for an executable kernel page does
>> nothing because UX is already cleared.
>>
>> And set_memory_x() on a non-executable kernel page makes it
>> executable for the user and keeps it non-executable for kernel.
>>
>> Also, pte_exec() always returns 'false' on kernel pages, because
>> it checks _PAGE_EXEC which doesn't include SX, so for instance
>> the W+X check doesn't work.
>>
>> To fix this:
>> - change tlb_low_64e.S to use _PAGE_BAP_UX instead of _PAGE_USER
>> - sets both UX and SX in _PAGE_EXEC so that pte_user() returns
>> true whenever one of the two bits is set and pte_exprotect()
>> clears both bits.
>> - Define a book3e specific version of pte_mkexec() which sets
>> either SX or UX based on UR.
>>
>> Fixes: 1f9ad21c3b38 ("powerpc/mm: Implement set_memory() routines")
>> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
>> ---
>> v2: New
> 
> pte_mkexec() in nohash/64/pgtable.h conflicts with the one in 
> nohash/pte_book3e.h
> 
> Should guard it with  #ifndef pte_mkexec(), but as pte_book3e is the 
> only user in 64 bits, then just remove it from there.
> 
> Send v3 with only that change compared to v2.

Series v1 was merged into next so I submitted followup series with the 
three fixes.

Christophe

> 
> Christophe
> 
>> ---
>>   arch/powerpc/include/asm/nohash/32/pgtable.h |  2 ++
>>   arch/powerpc/include/asm/nohash/pte-book3e.h | 18 ++++++++++++++----
>>   arch/powerpc/mm/nohash/tlb_low_64e.S         |  8 ++++----
>>   3 files changed, 20 insertions(+), 8 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/nohash/32/pgtable.h 
>> b/arch/powerpc/include/asm/nohash/32/pgtable.h
>> index ac0a5ff48c3a..d6ba821a56ce 100644
>> --- a/arch/powerpc/include/asm/nohash/32/pgtable.h
>> +++ b/arch/powerpc/include/asm/nohash/32/pgtable.h
>> @@ -193,10 +193,12 @@ static inline pte_t pte_wrprotect(pte_t pte)
>>   }
>>   #endif
>> +#ifndef pte_mkexec
>>   static inline pte_t pte_mkexec(pte_t pte)
>>   {
>>       return __pte(pte_val(pte) | _PAGE_EXEC);
>>   }
>> +#endif
>>   #define pmd_none(pmd)        (!pmd_val(pmd))
>>   #define    pmd_bad(pmd)        (pmd_val(pmd) & _PMD_BAD)
>> diff --git a/arch/powerpc/include/asm/nohash/pte-book3e.h 
>> b/arch/powerpc/include/asm/nohash/pte-book3e.h
>> index 813918f40765..f798640422c2 100644
>> --- a/arch/powerpc/include/asm/nohash/pte-book3e.h
>> +++ b/arch/powerpc/include/asm/nohash/pte-book3e.h
>> @@ -48,7 +48,7 @@
>>   #define _PAGE_WRITETHRU    0x800000 /* W: cache write-through */
>>   /* "Higher level" linux bit combinations */
>> -#define _PAGE_EXEC        _PAGE_BAP_UX /* .. and was cache cleaned */
>> +#define _PAGE_EXEC        (_PAGE_BAP_SX | _PAGE_BAP_UX) /* .. and was 
>> cache cleaned */
>>   #define _PAGE_RW        (_PAGE_BAP_SW | _PAGE_BAP_UW) /* User write 
>> permission */
>>   #define _PAGE_KERNEL_RW        (_PAGE_BAP_SW | _PAGE_BAP_SR | 
>> _PAGE_DIRTY)
>>   #define _PAGE_KERNEL_RO        (_PAGE_BAP_SR)
>> @@ -93,11 +93,11 @@
>>   /* Permission masks used to generate the __P and __S table */
>>   #define PAGE_NONE    __pgprot(_PAGE_BASE)
>>   #define PAGE_SHARED    __pgprot(_PAGE_BASE | _PAGE_USER | _PAGE_RW)
>> -#define PAGE_SHARED_X    __pgprot(_PAGE_BASE | _PAGE_USER | _PAGE_RW 
>> | _PAGE_EXEC)
>> +#define PAGE_SHARED_X    __pgprot(_PAGE_BASE | _PAGE_USER | _PAGE_RW 
>> | _PAGE_BAP_UX)
>>   #define PAGE_COPY    __pgprot(_PAGE_BASE | _PAGE_USER)
>> -#define PAGE_COPY_X    __pgprot(_PAGE_BASE | _PAGE_USER | _PAGE_EXEC)
>> +#define PAGE_COPY_X    __pgprot(_PAGE_BASE | _PAGE_USER | _PAGE_BAP_UX)
>>   #define PAGE_READONLY    __pgprot(_PAGE_BASE | _PAGE_USER)
>> -#define PAGE_READONLY_X    __pgprot(_PAGE_BASE | _PAGE_USER | 
>> _PAGE_EXEC)
>> +#define PAGE_READONLY_X    __pgprot(_PAGE_BASE | _PAGE_USER | 
>> _PAGE_BAP_UX)
>>   #ifndef __ASSEMBLY__
>>   static inline pte_t pte_mkprivileged(pte_t pte)
>> @@ -113,6 +113,16 @@ static inline pte_t pte_mkuser(pte_t pte)
>>   }
>>   #define pte_mkuser pte_mkuser
>> +
>> +static inline pte_t pte_mkexec(pte_t pte)
>> +{
>> +    if (pte_val(pte) & _PAGE_BAP_UR)
>> +        return __pte((pte_val(pte) & ~_PAGE_BAP_SX) | _PAGE_BAP_UX);
>> +    else
>> +        return __pte((pte_val(pte) & ~_PAGE_BAP_UX) | _PAGE_BAP_SX);
>> +}
>> +#define pte_mkexec pte_mkexec
>> +
>>   #endif /* __ASSEMBLY__ */
>>   #endif /* __KERNEL__ */
>> diff --git a/arch/powerpc/mm/nohash/tlb_low_64e.S 
>> b/arch/powerpc/mm/nohash/tlb_low_64e.S
>> index bf24451f3e71..9235e720e357 100644
>> --- a/arch/powerpc/mm/nohash/tlb_low_64e.S
>> +++ b/arch/powerpc/mm/nohash/tlb_low_64e.S
>> @@ -222,7 +222,7 @@ tlb_miss_kernel_bolted:
>>   tlb_miss_fault_bolted:
>>       /* We need to check if it was an instruction miss */
>> -    andi.    r10,r11,_PAGE_EXEC|_PAGE_BAP_SX
>> +    andi.    r10,r11,_PAGE_BAP_UX|_PAGE_BAP_SX
>>       bne    itlb_miss_fault_bolted
>>   dtlb_miss_fault_bolted:
>>       tlb_epilog_bolted
>> @@ -239,7 +239,7 @@ itlb_miss_fault_bolted:
>>       srdi    r15,r16,60        /* get region */
>>       bne-    itlb_miss_fault_bolted
>> -    li    r11,_PAGE_PRESENT|_PAGE_EXEC    /* Base perm */
>> +    li    r11,_PAGE_PRESENT|_PAGE_BAP_UX    /* Base perm */
>>       /* We do the user/kernel test for the PID here along with the RW 
>> test
>>        */
>> @@ -614,7 +614,7 @@ itlb_miss_fault_e6500:
>>       /* We do the user/kernel test for the PID here along with the RW 
>> test
>>        */
>> -    li    r11,_PAGE_PRESENT|_PAGE_EXEC    /* Base perm */
>> +    li    r11,_PAGE_PRESENT|_PAGE_BAP_UX    /* Base perm */
>>       oris    r11,r11,_PAGE_ACCESSED@h
>>       cmpldi    cr0,r15,0            /* Check for user region */
>> @@ -734,7 +734,7 @@ normal_tlb_miss_done:
>>   normal_tlb_miss_access_fault:
>>       /* We need to check if it was an instruction miss */
>> -    andi.    r10,r11,_PAGE_EXEC
>> +    andi.    r10,r11,_PAGE_BAP_UX
>>       bne    1f
>>       ld    r14,EX_TLB_DEAR(r12)
>>       ld    r15,EX_TLB_ESR(r12)
>>

^ permalink raw reply

* Re: Linux kernel: powerpc: KVM guest can trigger host crash on Power8
From: John Paul Adrian Glaubitz @ 2021-10-26  8:48 UTC (permalink / raw)
  To: mpe; +Cc: oss-security, debian-powerpc@lists.debian.org, linuxppc-dev
In-Reply-To: <87pmrtbbdt.fsf@mpe.ellerman.id.au>

Hi Michael!

> The Linux kernel for powerpc since v5.2 has a bug which allows a
> malicious KVM guest to crash the host, when the host is running on
> Power8.
> 
> Only machines using Linux as the hypervisor, aka. KVM, powernv or bare
> metal, are affected by the bug. Machines running PowerVM are not
> affected.
> 
> The bug was introduced in:
> 
>     10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C")
> 
> Which was first released in v5.2.
> 
> The upstream fix is:
> 
>   cdeb5d7d890e ("KVM: PPC: Book3S HV: Make idle_kvm_start_guest() return 0 if it went to guest")
>   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=cdeb5d7d890e14f3b70e8087e745c4a6a7d9f337
> 
> Which will be included in the v5.16 release.

I have tested these patches against 5.14 but it seems the problem [1] still remains for me
for big-endian guests. I built a patched kernel yesterday, rebooted the KVM server and let
the build daemons do their work over night.

When I got up this morning, I noticed the machine was down, so I checked the serial console
via IPMI and saw the same messages again as reported in [1]:

[41483.963562] watchdog: BUG: soft lockup - CPU#104 stuck for 25521s! [migration/104:175]
[41507.963307] watchdog: BUG: soft lockup - CPU#104 stuck for 25544s! [migration/104:175]
[41518.311200] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[41518.311216] rcu:     136-...0: (135 ticks this GP) idle=242/1/0x4000000000000000 softirq=32031/32033 fqs=2729959 
[41547.962882] watchdog: BUG: soft lockup - CPU#104 stuck for 25581s! [migration/104:175]
[41571.962627] watchdog: BUG: soft lockup - CPU#104 stuck for 25603s! [migration/104:175]
[41581.330530] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[41581.330546] rcu:     136-...0: (135 ticks this GP) idle=242/1/0x4000000000000000 softirq=32031/32033 fqs=2736378 
[41611.962202] watchdog: BUG: soft lockup - CPU#104 stuck for 25641s! [migration/104:175]
[41635.961947] watchdog: BUG: soft lockup - CPU#104 stuck for 25663s! [migration/104:175]
[41644.349859] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[41644.349876] rcu:     136-...0: (135 ticks this GP) idle=242/1/0x4000000000000000 softirq=32031/32033 fqs=2742753 
[41671.961564] watchdog: BUG: soft lockup - CPU#104 stuck for 25697s! [migration/104:175]
[41695.961309] watchdog: BUG: soft lockup - CPU#104 stuck for 25719s! [migration/104:175]
[41707.369190] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[41707.369206] rcu:     136-...0: (135 ticks this GP) idle=242/1/0x4000000000000000 softirq=32031/32033 fqs=2749151 
[41735.960884] watchdog: BUG: soft lockup - CPU#104 stuck for 25756s! [migration/104:175]
[41759.960629] watchdog: BUG: soft lockup - CPU#104 stuck for 25778s! [migration/104:175]
[41770.388520] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[41770.388548] rcu:     136-...0: (135 ticks this GP) idle=242/1/0x4000000000000000 softirq=32031/32033 fqs=2755540 
[41776.076307] rcu: rcu_sched kthread timer wakeup didn't happen for 1423 jiffies! g49897 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[41776.076327] rcu:     Possible timer handling issue on cpu=32 timer-softirq=1056014
[41776.076336] rcu: rcu_sched kthread starved for 1424 jiffies! g49897 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=32
[41776.076350] rcu:     Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[41776.076360] rcu: RCU grace-period kthread stack dump:
[41776.076434] rcu: Stack dump where RCU GP kthread last ran:
[41783.960374] watchdog: BUG: soft lockup - CPU#104 stuck for 25801s! [migration/104:175]
[41807.960119] watchdog: BUG: soft lockup - CPU#104 stuck for 25823s! [migration/104:175]
[41831.959864] watchdog: BUG: soft lockup - CPU#104 stuck for 25846s! [migration/104:175]
[41833.407851] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[41833.407868] rcu:     136-...0: (135 ticks this GP) idle=242/1/0x4000000000000000 softirq=32031/32033 fqs=2760381 
[41863.959524] watchdog: BUG: soft lockup - CPU#104 stuck for 25875s! [migration/104:175]

It seems that in this case, it was the testsuite of the git package [2] that triggered the bug. As you
can see from the overview, the git package has been in the building state for 8 hours meaning the
build server crashed and is no longer giving feedback to the database.

Adrian

> [1] https://bugzilla.kernel.org/show_bug.cgi?id=206669
> [2] https://buildd.debian.org/status/package.php?p=git&suite=experimental

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer - glaubitz@debian.org
`. `'   Freie Universitaet Berlin - glaubitz@physik.fu-berlin.de
  `-    GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913

^ permalink raw reply

* Re: [PATCH v5 1/2] ftrace: disable preemption when recursion locked
From: Miroslav Benes @ 2021-10-26  9:35 UTC (permalink / raw)
  To: 王贇
  Cc: Peter Zijlstra (Intel), Paul Walmsley, James E.J. Bottomley,
	Guo Ren, Jisheng Zhang, H. Peter Anvin, live-patching,
	linux-riscv, Joe Lawrence, Helge Deller, x86, linux-csky,
	Ingo Molnar, Petr Mladek, Albert Ou, Jiri Kosina, Nicholas Piggin,
	Borislav Petkov, Steven Rostedt, Josh Poimboeuf, Thomas Gleixner,
	linux-parisc, linux-kernel, Palmer Dabbelt, Masami Hiramatsu,
	Paul Mackerras, linuxppc-dev
In-Reply-To: <333cecfe-3045-8e0a-0c08-64ff590845ab@linux.alibaba.com>

Hi,

> diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h
> index abe1a50..2bc1522 100644
> --- a/include/linux/trace_recursion.h
> +++ b/include/linux/trace_recursion.h
> @@ -135,6 +135,9 @@ static __always_inline int trace_get_context_bit(void)
>  # define do_ftrace_record_recursion(ip, pip)	do { } while (0)
>  #endif
> 
> +/*
> + * Preemption is promised to be disabled when return bit > 0.
> + */
>  static __always_inline int trace_test_and_set_recursion(unsigned long ip, unsigned long pip,
>  							int start)
>  {
> @@ -162,11 +165,17 @@ static __always_inline int trace_test_and_set_recursion(unsigned long ip, unsign
>  	current->trace_recursion = val;
>  	barrier();
> 
> +	preempt_disable_notrace();
> +
>  	return bit;
>  }
> 
> +/*
> + * Preemption will be enabled (if it was previously enabled).
> + */
>  static __always_inline void trace_clear_recursion(int bit)
>  {
> +	preempt_enable_notrace();
>  	barrier();
>  	trace_recursion_clear(bit);
>  }

The two comments should be updated too since Steven removed the "bit == 0" 
trick.

> @@ -178,7 +187,7 @@ static __always_inline void trace_clear_recursion(int bit)
>   * tracing recursed in the same context (normal vs interrupt),
>   *
>   * Returns: -1 if a recursion happened.
> - *           >= 0 if no recursion
> + *           > 0 if no recursion.
>   */
>  static __always_inline int ftrace_test_recursion_trylock(unsigned long ip,
>  							 unsigned long parent_ip)

And this change would not be correct now.

Regards
Miroslav

^ permalink raw reply

* Re: [PATCH v5 1/2] ftrace: disable preemption when recursion locked
From: 王贇 @ 2021-10-26  9:48 UTC (permalink / raw)
  To: Miroslav Benes
  Cc: Peter Zijlstra (Intel), Paul Walmsley, James E.J. Bottomley,
	Guo Ren, Jisheng Zhang, H. Peter Anvin, live-patching,
	linux-riscv, Joe Lawrence, Helge Deller, x86, linux-csky,
	Ingo Molnar, Petr Mladek, Albert Ou, Jiri Kosina, Nicholas Piggin,
	Borislav Petkov, Steven Rostedt, Josh Poimboeuf, Thomas Gleixner,
	linux-parisc, linux-kernel, Palmer Dabbelt, Masami Hiramatsu,
	Paul Mackerras, linuxppc-dev
In-Reply-To: <alpine.LSU.2.21.2110261128120.28494@pobox.suse.cz>

Hi, Miroslav

On 2021/10/26 下午5:35, Miroslav Benes wrote:
> Hi,
> 
>> diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h
>> index abe1a50..2bc1522 100644
>> --- a/include/linux/trace_recursion.h
>> +++ b/include/linux/trace_recursion.h
>> @@ -135,6 +135,9 @@ static __always_inline int trace_get_context_bit(void)
>>  # define do_ftrace_record_recursion(ip, pip)	do { } while (0)
>>  #endif
>>
>> +/*
>> + * Preemption is promised to be disabled when return bit > 0.
>> + */
>>  static __always_inline int trace_test_and_set_recursion(unsigned long ip, unsigned long pip,
>>  							int start)
>>  {
>> @@ -162,11 +165,17 @@ static __always_inline int trace_test_and_set_recursion(unsigned long ip, unsign
>>  	current->trace_recursion = val;
>>  	barrier();
>>
>> +	preempt_disable_notrace();
>> +
>>  	return bit;
>>  }
>>
>> +/*
>> + * Preemption will be enabled (if it was previously enabled).
>> + */
>>  static __always_inline void trace_clear_recursion(int bit)
>>  {
>> +	preempt_enable_notrace();
>>  	barrier();
>>  	trace_recursion_clear(bit);
>>  }
> 
> The two comments should be updated too since Steven removed the "bit == 0" 
> trick.

Could you please give more hint on how will it be correct?

I get the point that bit will no longer be 0, there are only -1 or > 0 now
so trace_test_and_set_recursion() will disable preemption on bit > 0 and
trace_clear_recursion() will enabled it since it should only be called when
bit > 0 (I remember we could use a WARN_ON here now :-P).

> 
>> @@ -178,7 +187,7 @@ static __always_inline void trace_clear_recursion(int bit)
>>   * tracing recursed in the same context (normal vs interrupt),
>>   *
>>   * Returns: -1 if a recursion happened.
>> - *           >= 0 if no recursion
>> + *           > 0 if no recursion.
>>   */
>>  static __always_inline int ftrace_test_recursion_trylock(unsigned long ip,
>>  							 unsigned long parent_ip)
> 
> And this change would not be correct now.

I thought it will no longer return 0 so I change it to > 0, isn't that correct?

Regards,
Michael Wang

> 
> Regards
> Miroslav
> 

^ permalink raw reply

* Re: [PATCH v2] perf vendor events power10: Add metric events json file for power10 platform
From: kajoljain @ 2021-10-26  9:58 UTC (permalink / raw)
  To: Paul A. Clarke
  Cc: maddy, rnsastry, jolsa, linux-kernel, acme, linux-perf-users,
	atrajeev, linuxppc-dev
In-Reply-To: <20211022144910.GC104437@li-24c3614c-2adc-11b2-a85c-85f334518bdb.ibm.com>



On 10/22/21 8:19 PM, Paul A. Clarke wrote:
> Thanks for the changes!
> More nits below (many left over from prior review)...
> 
> On Fri, Oct 22, 2021 at 11:55:05AM +0530, Kajol Jain wrote:
>> Add pmu metric json file for power10 platform.
>>
>> Signed-off-by: Kajol Jain <kjain@linux.ibm.com>
>> ---
>> Changelog v1 -> v2:
>> - Did some nit changes in BriefDescription field
>>   as suggested by Paul A. Clarke
>>
>> - Link to the v1 patch: https://lkml.org/lkml/2021/10/6/131
>>
>>  .../arch/powerpc/power10/metrics.json         | 676 ++++++++++++++++++
>>  1 file changed, 676 insertions(+)
>>  create mode 100644 tools/perf/pmu-events/arch/powerpc/power10/metrics.json
>>
>> diff --git a/tools/perf/pmu-events/arch/powerpc/power10/metrics.json b/tools/perf/pmu-events/arch/powerpc/power10/metrics.json
>> new file mode 100644
>> index 000000000000..8adab5cd9934
>> --- /dev/null
>> +++ b/tools/perf/pmu-events/arch/powerpc/power10/metrics.json
>> @@ -0,0 +1,676 @@
>> +[
>> +    {
>> +        "BriefDescription": "Percentage of cycles that are run cycles",
>> +        "MetricExpr": "PM_RUN_CYC / PM_CYC * 100",
>> +        "MetricGroup": "General",
>> +        "MetricName": "RUN_CYCLES_RATE",
>> +        "ScaleUnit": "1%"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per completed instruction",
>> +        "MetricExpr": "PM_CYC / PM_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "CYCLES_PER_INSTRUCTION"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled for any reason",
>> +        "MetricExpr": "PM_DISP_STALL_CYC / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DISPATCHED_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled because there was a flush",
>> +        "MetricExpr": "PM_DISP_STALL_FLUSH / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DISPATCHED_FLUSH_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled because the MMU was handling a translation miss",
>> +        "MetricExpr": "PM_DISP_STALL_TRANSLATION / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DISPATCHED_TRANSLATION_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled waiting to resolve an instruction ERAT miss",
>> +        "MetricExpr": "PM_DISP_STALL_IERAT_ONLY_MISS / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DISPATCHED_IERAT_ONLY_MISS_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled waiting to resolve an instruction TLB miss",
>> +        "MetricExpr": "PM_DISP_STALL_ITLB_MISS / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DISPATCHED_ITLB_MISS_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled due to an icache miss",
>> +        "MetricExpr": "PM_DISP_STALL_IC_MISS / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DISPATCHED_IC_MISS_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled while the instruction was fetched from the local L2",
>> +        "MetricExpr": "PM_DISP_STALL_IC_L2 / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DISPATCHED_IC_L2_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled while the instruction was fetched from the local L3",
>> +        "MetricExpr": "PM_DISP_STALL_IC_L3 / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DISPATCHED_IC_L3_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled while the instruction was fetched from any source beyond the local L3",
>> +        "MetricExpr": "PM_DISP_STALL_IC_L3MISS / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DISPATCHED_IC_L3MISS_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled due to an icache miss after a branch mispredict",
>> +        "MetricExpr": "PM_DISP_STALL_BR_MPRED_ICMISS / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DISPATCHED_BR_MPRED_ICMISS_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled while instruction was fetched from the local L2 after suffering a branch mispredict",
>> +        "MetricExpr": "PM_DISP_STALL_BR_MPRED_IC_L2 / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DISPATCHED_BR_MPRED_IC_L2_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled while instruction was fetched from the local L3 after suffering a branch mispredict",
>> +        "MetricExpr": "PM_DISP_STALL_BR_MPRED_IC_L3 / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DISPATCHED_BR_MPRED_IC_L3_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled while instruction was fetched from any source beyond the local L3 after suffering a branch mispredict",
>> +        "MetricExpr": "PM_DISP_STALL_BR_MPRED_IC_L3MISS / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DISPATCHED_BR_MPRED_IC_L3MISS_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled due to a branch mispredict",
>> +        "MetricExpr": "PM_DISP_STALL_BR_MPRED / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DISPATCHED_BR_MPRED_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction was held at dispatch for any reason",
>> +        "MetricExpr": "PM_DISP_STALL_HELD_CYC / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DISPATCHED_HELD_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction was held at dispatch because of a synchronizing instruction that requires the ICT to be empty before dispatch",
>> +        "MetricExpr": "PM_DISP_STALL_HELD_SYNC_CYC / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DISP_HELD_STALL_SYNC_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction was held at dispatch while waiting on the scoreboard",
>> +        "MetricExpr": "PM_DISP_STALL_HELD_SCOREBOARD_CYC / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DISP_HELD_STALL_SCOREBOARD_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction was held at dispatch due to issue queue full",
>> +        "MetricExpr": "PM_DISP_STALL_HELD_ISSQ_FULL_CYC / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DISP_HELD_STALL_ISSQ_FULL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction was held at dispatch because the mapper/SRB was full",
>> +        "MetricExpr": "PM_DISP_STALL_HELD_RENAME_CYC / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DISPATCHED_HELD_RENAME_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction was held at dispatch because the STF mapper/SRB was full",
>> +        "MetricExpr": "PM_DISP_STALL_HELD_STF_MAPPER_CYC / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DISPATCHED_HELD_STF_MAPPER_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction was held at dispatch because the XVFC mapper/SRB was full",
>> +        "MetricExpr": "PM_DISP_STALL_HELD_XVFC_MAPPER_CYC / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DISPATCHED_HELD_XVFC_MAPPER_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction was held at dispatch for any other reason",
>> +        "MetricExpr": "PM_DISP_STALL_HELD_OTHER_CYC / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DISPATCHED_HELD_OTHER_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction has been dispatched but not issued for any reason",
>> +        "MetricExpr": "PM_ISSUE_STALL / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "ISSUE_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is waiting to be finished in one of the execution units",
>> +        "MetricExpr": "PM_EXEC_STALL / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "EXECUTION_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction spent executing an NTC instruction that gets flushed some time after dispatch",
>> +        "MetricExpr": "PM_EXEC_STALL_NTC_FLUSH / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "NTC_FLUSH_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTF instruction finishes at dispatch",
>> +        "MetricExpr": "PM_EXEC_STALL_FIN_AT_DISP / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "FIN_AT_DISP_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is executing in the branch unit",
>> +        "MetricExpr": "PM_EXEC_STALL_BRU / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "BRU_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is a simple fixed point instruction that is executing in the LSU",
>> +        "MetricExpr": "PM_EXEC_STALL_SIMPLE_FX / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "SIMPLE_FX_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is executing in the VSU",
>> +        "MetricExpr": "PM_EXEC_STALL_VSU / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "VSU_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is waiting to be finished in one of the execution units",
>> +        "MetricExpr": "PM_EXEC_STALL_TRANSLATION / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "TRANSLATION_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is a load or store that suffered a translation miss",
>> +        "MetricExpr": "PM_EXEC_STALL_DERAT_ONLY_MISS / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DERAT_ONLY_MISS_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is recovering from a TLB miss",
>> +        "MetricExpr": "PM_EXEC_STALL_DERAT_DTLB_MISS / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DERAT_DTLB_MISS_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is executing in the LSU",
>> +        "MetricExpr": "PM_EXEC_STALL_LSU / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "LSU_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is a load that is executing in the LSU",
>> +        "MetricExpr": "PM_EXEC_STALL_LOAD / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "LOAD_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is waiting for a load miss to resolve from either the local L2 or local L3",
>> +        "MetricExpr": "PM_EXEC_STALL_DMISS_L2L3 / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DMISS_L2L3_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is waiting for a load miss to resolve from either the local L2 or local L3, with an RC dispatch conflict",
>> +        "MetricExpr": "PM_EXEC_STALL_DMISS_L2L3_CONFLICT / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DMISS_L2L3_CONFLICT_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is waiting for a load miss to resolve from either the local L2 or local L3, without an RC dispatch conflict",
>> +        "MetricExpr": "PM_EXEC_STALL_DMISS_L2L3_NOCONFLICT / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DMISS_L2L3_NOCONFLICT_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is waiting for a load miss to resolve from a source beyond the local L2 and local L3",
>> +        "MetricExpr": "PM_EXEC_STALL_DMISS_L3MISS / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DMISS_L3MISS_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is waiting for a load miss to resolve from a neighbor chiplet's L2 or L3 in the same chip",
>> +        "MetricExpr": "PM_EXEC_STALL_DMISS_L21_L31 / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DMISS_L21_L31_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is waiting for a load miss to resolve from L4, local memory or OpenCapp chip",
> 
> What is "OpenCapp"?  Is is different from OpenCAPI?

Hi Paul,
    Yes, OpenCapp is same as OpenCAPI. But as these descriptions
are provided by hardware team and same is followed in the PMU workbook.
We need to use OpenCapp.

> 
>> +        "MetricExpr": "PM_EXEC_STALL_DMISS_LMEM / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DMISS_LMEM_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is waiting for a load miss to resolve from a remote chip (cache, L4, memory or OpenCapp) in the same group",
>> +        "MetricExpr": "PM_EXEC_STALL_DMISS_OFF_CHIP / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DMISS_OFF_CHIP_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is waiting for a load miss to resolve from a distant chip (cache, L4, memory or OpenCapp chip)",
>> +        "MetricExpr": "PM_EXEC_STALL_DMISS_OFF_NODE / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DMISS_OFF_NODE_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is executing a TLBIEL instruction",
>> +        "MetricExpr": "PM_EXEC_STALL_TLBIEL / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "TLBIEL_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is finishing a load after its data has been reloaded from a data source beyond the local L1, OR when the LSU is processing an L1-hit, OR when the NTF instruction merged with another load in the LMQ",
>> +        "MetricExpr": "PM_EXEC_STALL_LOAD_FINISH / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "LOAD_FINISH_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is a store that is executing in the LSU",
>> +        "MetricExpr": "PM_EXEC_STALL_STORE / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "STORE_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is in the store unit outside of handling store misses or other special store operations",
> 
> Is "store unit" not the same as "LSU" ?  Use "LSU" uniformly if appropriate:
> s/store unit/LSU/

Here using store unit is more appropriate as we are counting
instructions executed in the store unit of LSU.

> 
>> +        "MetricExpr": "PM_EXEC_STALL_STORE_PIPE / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "STORE_PIPE_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is a store whose cache line was not resident in the L1 and had to wait for allocation of the missing line into the L1",
>> +        "MetricExpr": "PM_EXEC_STALL_STORE_MISS / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "STORE_MISS_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is a TLBIE instruction waiting for a response from the L2",
>> +        "MetricExpr": "PM_EXEC_STALL_TLBIE / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "TLBIE_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is executing a PTESYNC instruction",
>> +        "MetricExpr": "PM_EXEC_STALL_PTESYNC / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "PTESYNC_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction cannot complete because the thread was blocked",
>> +        "MetricExpr": "PM_CMPL_STALL / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "COMPLETION_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction cannot complete because it was interrupted by ANY exception",
>> +        "MetricExpr": "PM_CMPL_STALL_EXCEPTION / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "EXCEPTION_COMPLETION_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is stuck at finish waiting for the non-speculative finish of either a STCX instruction waiting for its result or a load waiting for non-critical sectors of data and ECC",
>> +        "MetricExpr": "PM_CMPL_STALL_MEM_ECC / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "MEM_ECC_COMPLETION_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction cannot complete the instruction is a STCX instruction waiting for resolution from the nest",
> 
> Need to reword this, I think.  I propose "Average cycles per instruction
> when the NTC instruction is a STCX instruction waiting for resolution
> from the nest", which follows the form used by HWSYNC_COMPLETION_STALL_CPI,
> below.

Yes make sense. Will update this description.

> 
>> +        "MetricExpr": "PM_CMPL_STALL_STCX / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "STCX_COMPLETION_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is a LWSYNC instruction waiting to complete",
>> +        "MetricExpr": "PM_CMPL_STALL_LWSYNC / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "LWSYNC_COMPLETION_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is a HWSYNC instruction stuck at finish waiting for a response from the L2",
>> +        "MetricExpr": "PM_CMPL_STALL_HWSYNC / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "HWSYNC_COMPLETION_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction required special handling before completion",
>> +        "MetricExpr": "PM_CMPL_STALL_SPECIAL / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "SPECIAL_COMPLETION_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled because fetch was being held, so there was nothing in the pipeline for this thread",
>> +        "MetricExpr": "PM_DISP_STALL_FETCH / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DISPATCHED_FETCH_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction was held at dispatch because of power management",
>> +        "MetricExpr": "PM_DISP_STALL_HELD_HALT_CYC / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DISPATCHED_HELD_HALT_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Percentage of flushes per completed run instruction",
> 
> s/per completed run instruction/per instruction/

As discussed I will update it to completed insruction in all the below
descriptions.

> 
>> +        "MetricExpr": "PM_FLUSH / PM_RUN_INST_CMPL * 100",
>> +        "MetricGroup": "Others",
>> +        "MetricName": "FLUSH_RATE",
>> +        "ScaleUnit": "1%"
>> +    },
>> +    {
>> +        "BriefDescription": "Percentage of flushes due to a branch mispredict per instruction",
>> +        "MetricExpr": "PM_FLUSH_MPRED / PM_RUN_INST_CMPL * 100",
>> +        "MetricGroup": "Others",
>> +        "MetricName": "BR_MPRED_FLUSH_RATE",
>> +        "ScaleUnit": "1%"
>> +    },
>> +    {
>> +        "BriefDescription": "Percentage of branch mispredictions per completed run instruction",
> 
> s/per completed run instruction/per instruction/
> 
>> +        "MetricExpr": "PM_BR_MPRED_CMPL / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "Others",
>> +        "MetricName": "BRANCH_MISPREDICTION_RATE"
>> +    },
>> +    {
>> +        "BriefDescription": "Percentage of finished loads that missed in the L1",
>> +        "MetricExpr": "PM_LD_MISS_L1 / PM_LD_REF_L1 * 100",
>> +        "MetricGroup": "Others",
>> +        "MetricName": "L1_LD_MISS_RATIO",
>> +        "ScaleUnit": "1%"
>> +    },
>> +    {
>> +        "BriefDescription": "Percentage of completed run instructions that were loads that missed the L1",
> 
> s/completed run instructions/instructions/
> 
>> +        "MetricExpr": "PM_LD_MISS_L1 / PM_RUN_INST_CMPL * 100",
>> +        "MetricGroup": "Others",
>> +        "MetricName": "L1_LD_MISS_RATE",
>> +        "ScaleUnit": "1%"
>> +    },
>> +    {
>> +        "BriefDescription": "Percentage of instructions when the DPTEG required for the load/store instruction in execution was missing from the TLB",
>> +        "MetricExpr": "PM_DTLB_MISS / PM_RUN_INST_CMPL * 100",
>> +        "MetricGroup": "Others",
>> +        "MetricName": "DTLB_MISS_RATE",
>> +        "ScaleUnit": "1%"
>> +    },
>> +    {
>> +        "BriefDescription": "Average number of instructions dispatched per instruction completed",
>> +        "MetricExpr": "PM_INST_DISP / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "General",
>> +        "MetricName": "DISPATCH_PER_INST_CMPL"
>> +    },
>> +    {
>> +        "BriefDescription": "Percentage of completed run instructions that were a demand load that did not hit in the L1 or L2",
> 
> s/completed run instructions/instructions/
> 
>> +        "MetricExpr": "PM_DATA_FROM_L2MISS / PM_RUN_INST_CMPL * 100",
>> +        "MetricGroup": "General",
>> +        "MetricName": "L2_LD_MISS_RATE",
>> +        "ScaleUnit": "1%"
>> +    },
>> +    {
>> +        "BriefDescription": "Percentage of completed run instructions that were demand fetches that missed the L1 instruction cache",
> 
> s/completed run instructions/instructions/
> s/instruction cache/icache/ to be consistent with the rest of the file
> 
>> +        "MetricExpr": "PM_L1_ICACHE_MISS / PM_RUN_INST_CMPL * 100",
>> +        "MetricGroup": "Instruction_Misses",
>> +        "MetricName": "L1_INST_MISS_RATE",
>> +        "ScaleUnit": "1%"
>> +    },
>> +    {
>> +        "BriefDescription": "Percentage of completed run instructions that were demand fetches that reloaded from beyond the L3 instruction cache",
> 
> s/completed run instructions/instructions/
> s/instruction cache/icache/ to be consistent with the rest of the file

Sure, I will make it icache.

Thanks,
Kajol Jain

> 
>> +        "MetricExpr": "PM_INST_FROM_L3MISS / PM_RUN_INST_CMPL * 100",
>> +        "MetricGroup": "General",
>> +        "MetricName": "L3_INST_MISS_RATE",
>> +        "ScaleUnit": "1%"
>> +    },
>> +    {
>> +        "BriefDescription": "Average number of completed instructions per cycle",
>> +        "MetricExpr": "PM_INST_CMPL / PM_CYC",
>> +        "MetricGroup": "General",
>> +        "MetricName": "IPC"
>> +    },
>> +    {
>> +        "BriefDescription": "Average number of cycles per completed instruction group",
>> +        "MetricExpr": "PM_CYC / PM_1PLUS_PPC_CMPL",
>> +        "MetricGroup": "General",
>> +        "MetricName": "CYCLES_PER_COMPLETED_INSTRUCTIONS_SET"
>> +    },
>> +    {
>> +        "BriefDescription": "Percentage of cycles when at least 1 instruction dispatched",
>> +        "MetricExpr": "PM_1PLUS_PPC_DISP / PM_RUN_CYC * 100",
>> +        "MetricGroup": "General",
>> +        "MetricName": "CYCLES_ATLEAST_ONE_INST_DISPATCHED",
>> +        "ScaleUnit": "1%"
>> +    },
>> +    {
>> +        "BriefDescription": "Average number of finished loads per completed run instruction",
> 
> s/completed run instruction/instruction/
> 
>> +        "MetricExpr": "PM_LD_REF_L1 / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "General",
>> +        "MetricName": "LOADS_PER_INST"
>> +    },
>> +    {
>> +        "BriefDescription": "Average number of finished stores per completed run instruction",
> 
> s/completed run instruction/instruction/
> 
>> +        "MetricExpr": "PM_ST_FIN / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "General",
>> +        "MetricName": "STORES_PER_INST"
>> +    },
>> +    {
>> +        "BriefDescription": "Percentage of demand loads that reloaded from beyond the L2 per completed run instruction",
> 
> s/completed run instruction/instruction/
> 
>> +        "MetricExpr": "PM_DATA_FROM_L2MISS / PM_RUN_INST_CMPL * 100",
>> +        "MetricGroup": "dL1_Reloads",
>> +        "MetricName": "DL1_RELOAD_FROM_L2_MISS_RATE",
>> +        "ScaleUnit": "1%"
>> +    },
>> +    {
>> +        "BriefDescription": "Percentage of demand loads that reloaded from beyond the L3 per completed run instruction",
> 
> s/completed run instruction/instruction/
> 
>> +        "MetricExpr": "PM_DATA_FROM_L3MISS / PM_RUN_INST_CMPL * 100",
>> +        "MetricGroup": "dL1_Reloads",
>> +        "MetricName": "DL1_RELOAD_FROM_L3_MISS_RATE",
>> +        "ScaleUnit": "1%"
>> +    },
>> +    {
>> +        "BriefDescription": "Percentage of DERAT misses with 4k page size per completed run instruction",
> 
> s/completed run instruction/instruction/
> 
>> +        "MetricExpr": "PM_DERAT_MISS_4K / PM_RUN_INST_CMPL * 100",
>> +        "MetricGroup": "Translation",
>> +        "MetricName": "DERAT_4K_MISS_RATE",
>> +        "ScaleUnit": "1%"
>> +    },
>> +    {
>> +        "BriefDescription": "Percentage of DERAT misses with 64k page size per completed run instruction",
> 
> s/completed run instruction/instruction/
> 
>> +        "MetricExpr": "PM_DERAT_MISS_64K / PM_RUN_INST_CMPL * 100",
>> +        "MetricGroup": "Translation",
>> +        "MetricName": "DERAT_64K_MISS_RATE",
>> +        "ScaleUnit": "1%"
>> +    },
>> +    {
>> +        "BriefDescription": "Average number of run cycles per completed run instruction",
> 
> s/completed run instruction/instruction/
> 
>> +        "MetricExpr": "PM_RUN_CYC / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "General",
>> +        "MetricName": "RUN_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Percentage of DERAT misses per completed run instruction",
> 
> s/completed run instruction/instruction/
> 
>> +        "MetricExpr": "PM_DERAT_MISS / PM_RUN_INST_CMPL * 100",
>> +        "MetricGroup": "Translation",
>> +        "MetricName": "DERAT_MISS_RATE",
>> +        "ScaleUnit": "1%"
>> +    },
>> +    {
>> +        "BriefDescription": "Average number of completed run instructions per run cycle",
> 
> s/completed run instructions/instructions/
> 
>> +        "MetricExpr": "PM_RUN_INST_CMPL / PM_RUN_CYC",
>> +        "MetricGroup": "General",
>> +        "MetricName": "RUN_IPC"
>> +    },
>> +    {
>> +        "BriefDescription": "Average number of instructions completed per instruction group",
> 
> s/completed//
> 
>> +        "MetricExpr": "PM_RUN_INST_CMPL / PM_1PLUS_PPC_CMPL",
>> +        "MetricGroup": "General",
>> +        "MetricName": "AVERAGE_COMPLETED_INSTRUCTION_SET_SIZE"
>> +    },
>> +    {
>> +        "BriefDescription": "Average number of finished instructions per completed run instructions",
> 
> s/completed run instructions/instruction/
> 
>> +        "MetricExpr": "PM_INST_FIN / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "General",
>> +        "MetricName": "INST_FIN_PER_CMPL"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTF instruction is completing and the finish was overlooked",
>> +        "MetricExpr": "PM_EXEC_STALL_UNKNOWN / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "General",
>> +        "MetricName": "EXEC_STALL_UNKOWN_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Percentage of finished branches that were taken",
>> +        "MetricExpr": "PM_BR_TAKEN_CMPL / PM_BR_FIN * 100",
>> +        "MetricGroup": "General",
>> +        "MetricName": "TAKEN_BRANCHES",
>> +        "ScaleUnit": "1%"
>> +    },
>> +    {
>> +        "BriefDescription": "Percentage of completed run instructions that were a demand load that did not hit in the L1, L2, or the L3",
> 
> s/completed run instructions/instructions/
> 
>> +        "MetricExpr": "PM_DATA_FROM_L3MISS / PM_RUN_INST_CMPL * 100",
>> +        "MetricGroup": "General",
>> +        "MetricName": "L3_LD_MISS_RATE",
>> +        "ScaleUnit": "1%"
>> +    },
>> +    {
>> +        "BriefDescription": "Average number of finished branches per completed run instruction",
> 
> s/completed run instruction/instruction/
> 
>> +        "MetricExpr": "PM_BR_FIN / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "General",
>> +        "MetricName": "BRANCHES_PER_INST"
>> +    },
>> +    {
>> +        "BriefDescription": "Average number of instructions finished in the LSU per completed run instruction",
> 
> s/completed run instruction/instruction/
> 
>> +        "MetricExpr": "PM_LSU_FIN / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "General",
>> +        "MetricName": "LSU_PER_INST"
>> +    },
>> +    {
>> +        "BriefDescription": "Average number of instructions finished in the VSU per completed run instruction",
> 
> s/completed run instruction/instruction/
> 
>> +        "MetricExpr": "PM_VSU_FIN / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "General",
>> +        "MetricName": "VSU_PER_INST"
>> +    },
>> +    {
>> +        "BriefDescription": "Average number of TLBIE instructions finished in the LSU per completed run instruction",
> 
> s/completed run instruction/instruction/
> 
>> +        "MetricExpr": "PM_TLBIE_FIN / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "General",
>> +        "MetricName": "TLBIE_PER_INST"
>> +    },
>> +    {
>> +        "BriefDescription": "Average number of STCX instructions finshed per completed run instruction",
> 
> s/completed run instruction/instruction/
> 
>> +        "MetricExpr": "PM_STCX_FIN / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "General",
>> +        "MetricName": "STXC_PER_INST"
>> +    },
>> +    {
>> +        "BriefDescription": "Average number of LARX instructions finshed per completed run instruction",
> 
> s/completed run instruction/instruction/
> 
>> +        "MetricExpr": "PM_LARX_FIN / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "General",
>> +        "MetricName": "LARX_PER_INST"
>> +    },
>> +    {
>> +        "BriefDescription": "Average number of PTESYNC instructions finshed per completed run instruction",
> 
> s/completed run instruction/instruction/
> 
>> +        "MetricExpr": "PM_PTESYNC_FIN / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "General",
>> +        "MetricName": "PTESYNC_PER_INST"
>> +    },
>> +    {
>> +        "BriefDescription": "Average number of simple fixed-point instructions finshed in the store unit per completed run instruction",
> 
> s/completed run instruction/instruction/
> s/store unit/LSU/
> 
>> +        "MetricExpr": "PM_FX_LSU_FIN / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "General",
>> +        "MetricName": "FX_PER_INST"
>> +    },
>> +    {
>> +        "BriefDescription": "Percentage of demand load misses that reloaded the L1 cache",
>> +        "MetricExpr": "PM_LD_DEMAND_MISS_L1 / PM_LD_MISS_L1 * 100",
>> +        "MetricGroup": "General",
>> +        "MetricName": "DL1_MISS_RELOADS",
>> +        "ScaleUnit": "1%"
>> +    },
>> +    {
>> +        "BriefDescription": "Percentage of demand load misses that reloaded from beyond the local L2",
>> +        "MetricExpr": "PM_DATA_FROM_L2MISS / PM_LD_DEMAND_MISS_L1 * 100",
>> +        "MetricGroup": "dL1_Reloads",
>> +        "MetricName": "DL1_RELOAD_FROM_L2_MISS",
>> +        "ScaleUnit": "1%"
>> +    },
>> +    {
>> +        "BriefDescription": "Percentage of demand load misses that reloaded from beyond the local L3",
>> +        "MetricExpr": "PM_DATA_FROM_L3MISS / PM_LD_DEMAND_MISS_L1 * 100",
>> +        "MetricGroup": "dL1_Reloads",
>> +        "MetricName": "DL1_RELOAD_FROM_L3_MISS",
>> +        "ScaleUnit": "1%"
>> +    },
>> +    {
>> +        "BriefDescription": "Percentage of cycles stalled due to the NTC instruction waiting for a load miss to resolve from a source beyond the local L2 and local L3",
>> +        "MetricExpr": "DMISS_L3MISS_STALL_CPI / RUN_CPI * 100",
>> +        "MetricGroup": "General",
>> +        "MetricName": "DCACHE_MISS_CPI",
>> +        "ScaleUnit": "1%"
>> +    },
>> +    {
>> +        "BriefDescription": "Percentage of DERAT misses with 2M page size per completed run instruction",
> 
> s/completed run instruction/instruction/
> 
>> +        "MetricExpr": "PM_DERAT_MISS_2M / PM_RUN_INST_CMPL * 100",
>> +        "MetricGroup": "Translation",
>> +        "MetricName": "DERAT_2M_MISS_RATE",
>> +        "ScaleUnit": "1%"
>> +    },
>> +    {
>> +        "BriefDescription": "Percentage of DERAT misses with 16M page size per completed run instruction",
> 
> s/completed run instruction/instruction/
> 
>> +        "MetricExpr": "PM_DERAT_MISS_16M / PM_RUN_INST_CMPL * 100",
>> +        "MetricGroup": "Translation",
>> +        "MetricName": "DERAT_16M_MISS_RATE",
>> +        "ScaleUnit": "1%"
>> +    },
>> +    {
>> +        "BriefDescription": "DERAT miss ratio for 4K page size",
>> +        "MetricExpr": "PM_DERAT_MISS_4K / PM_DERAT_MISS",
>> +        "MetricGroup": "Translation",
>> +        "MetricName": "DERAT_4K_MISS_RATIO"
>> +    },
>> +    {
>> +        "BriefDescription": "DERAT miss ratio for 2M page size",
>> +        "MetricExpr": "PM_DERAT_MISS_2M / PM_DERAT_MISS",
>> +        "MetricGroup": "Translation",
>> +        "MetricName": "DERAT_2M_MISS_RATIO"
>> +    },
>> +    {
>> +        "BriefDescription": "DERAT miss ratio for 16M page size",
>> +        "MetricExpr": "PM_DERAT_MISS_16M / PM_DERAT_MISS",
>> +        "MetricGroup": "Translation",
>> +        "MetricName": "DERAT_16M_MISS_RATIO"
>> +    },
>> +    {
>> +        "BriefDescription": "DERAT miss ratio for 64K page size",
>> +        "MetricExpr": "PM_DERAT_MISS_64K / PM_DERAT_MISS",
>> +        "MetricGroup": "Translation",
>> +        "MetricName": "DERAT_64K_MISS_RATIO"
>> +    },
>> +    {
>> +        "BriefDescription": "Percentage of DERAT misses that resulted in TLB reloads",
>> +        "MetricExpr": "PM_DTLB_MISS / PM_DERAT_MISS * 100",
>> +        "MetricGroup": "Translation",
>> +        "MetricName": "DERAT_MISS_RELOAD",
>> +        "ScaleUnit": "1%"
>> +    },
>> +    {
>> +        "BriefDescription": "Percentage of icache misses that were reloaded from beyond the local L3",
>> +        "MetricExpr": "PM_INST_FROM_L3MISS / PM_L1_ICACHE_MISS * 100",
>> +        "MetricGroup": "Instruction_Misses",
>> +        "MetricName": "INST_FROM_L3_MISS",
>> +        "ScaleUnit": "1%"
>> +    },
>> +    {
>> +        "BriefDescription": "Percentage of icache reloads from the beyond the L3 per completed run instruction",
> 
> s/completed run instruction/instruction/
> 
>> +        "MetricExpr": "PM_INST_FROM_L3MISS / PM_RUN_INST_CMPL * 100",
>> +        "MetricGroup": "Instruction_Misses",
>> +        "MetricName": "INST_FROM_L3_MISS_RATE",
>> +        "ScaleUnit": "1%"
>> +    }
>> +]
>> -- 
> 
> PC
> 

^ permalink raw reply

* Re: [PATCH v2] perf vendor events power10: Add metric events json file for power10 platform
From: kajoljain @ 2021-10-26 10:01 UTC (permalink / raw)
  To: Paul A. Clarke, Michael Ellerman
  Cc: maddy, rnsastry, linuxppc-dev, linux-kernel, acme,
	linux-perf-users, atrajeev, jolsa
In-Reply-To: <20211025120658.GD104437@li-24c3614c-2adc-11b2-a85c-85f334518bdb.ibm.com>



On 10/25/21 5:36 PM, Paul A. Clarke wrote:
> On Mon, Oct 25, 2021 at 02:23:15PM +1100, Michael Ellerman wrote:
>> "Paul A. Clarke" <pc@us.ibm.com> writes:
>>> Thanks for the changes!
>>> More nits below (many left over from prior review)...
>>>
>>> On Fri, Oct 22, 2021 at 11:55:05AM +0530, Kajol Jain wrote:
>>>> Add pmu metric json file for power10 platform.
>>>>
>>>> Signed-off-by: Kajol Jain <kjain@linux.ibm.com>
>>>> ---
>>>> Changelog v1 -> v2:
>>>> - Did some nit changes in BriefDescription field
>>>>   as suggested by Paul A. Clarke
>>>>
>>>> - Link to the v1 patch: https://lkml.org/lkml/2021/10/6/131
>>>>
>>>>  .../arch/powerpc/power10/metrics.json         | 676 ++++++++++++++++++
>>>>  1 file changed, 676 insertions(+)
>>>>  create mode 100644 tools/perf/pmu-events/arch/powerpc/power10/metrics.json
>>>>
>>>> diff --git a/tools/perf/pmu-events/arch/powerpc/power10/metrics.json b/tools/perf/pmu-events/arch/powerpc/power10/metrics.json
>>>> new file mode 100644
>>>> index 000000000000..8adab5cd9934
>>>> --- /dev/null
>>>> +++ b/tools/perf/pmu-events/arch/powerpc/power10/metrics.json
>>>> @@ -0,0 +1,676 @@
>>>> +[
>>>> +    {
>>>> +        "BriefDescription": "Percentage of cycles that are run cycles",
>>>> +        "MetricExpr": "PM_RUN_CYC / PM_CYC * 100",
>>>> +        "MetricGroup": "General",
>>>> +        "MetricName": "RUN_CYCLES_RATE",
>>>> +        "ScaleUnit": "1%"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average cycles per completed instruction",
>>>> +        "MetricExpr": "PM_CYC / PM_INST_CMPL",
>>>> +        "MetricGroup": "CPI",
>>>> +        "MetricName": "CYCLES_PER_INSTRUCTION"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled for any reason",
>>>> +        "MetricExpr": "PM_DISP_STALL_CYC / PM_RUN_INST_CMPL",
>>>> +        "MetricGroup": "CPI",
>>>> +        "MetricName": "DISPATCHED_CPI"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled because there was a flush",
>>>> +        "MetricExpr": "PM_DISP_STALL_FLUSH / PM_RUN_INST_CMPL",
>>>> +        "MetricGroup": "CPI",
>>>> +        "MetricName": "DISPATCHED_FLUSH_CPI"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled because the MMU was handling a translation miss",
>>>> +        "MetricExpr": "PM_DISP_STALL_TRANSLATION / PM_RUN_INST_CMPL",
>>>> +        "MetricGroup": "CPI",
>>>> +        "MetricName": "DISPATCHED_TRANSLATION_CPI"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled waiting to resolve an instruction ERAT miss",
>>>> +        "MetricExpr": "PM_DISP_STALL_IERAT_ONLY_MISS / PM_RUN_INST_CMPL",
>>>> +        "MetricGroup": "CPI",
>>>> +        "MetricName": "DISPATCHED_IERAT_ONLY_MISS_CPI"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled waiting to resolve an instruction TLB miss",
>>>> +        "MetricExpr": "PM_DISP_STALL_ITLB_MISS / PM_RUN_INST_CMPL",
>>>> +        "MetricGroup": "CPI",
>>>> +        "MetricName": "DISPATCHED_ITLB_MISS_CPI"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled due to an icache miss",
>>>> +        "MetricExpr": "PM_DISP_STALL_IC_MISS / PM_RUN_INST_CMPL",
>>>> +        "MetricGroup": "CPI",
>>>> +        "MetricName": "DISPATCHED_IC_MISS_CPI"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled while the instruction was fetched from the local L2",
>>>> +        "MetricExpr": "PM_DISP_STALL_IC_L2 / PM_RUN_INST_CMPL",
>>>> +        "MetricGroup": "CPI",
>>>> +        "MetricName": "DISPATCHED_IC_L2_CPI"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled while the instruction was fetched from the local L3",
>>>> +        "MetricExpr": "PM_DISP_STALL_IC_L3 / PM_RUN_INST_CMPL",
>>>> +        "MetricGroup": "CPI",
>>>> +        "MetricName": "DISPATCHED_IC_L3_CPI"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled while the instruction was fetched from any source beyond the local L3",
>>>> +        "MetricExpr": "PM_DISP_STALL_IC_L3MISS / PM_RUN_INST_CMPL",
>>>> +        "MetricGroup": "CPI",
>>>> +        "MetricName": "DISPATCHED_IC_L3MISS_CPI"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled due to an icache miss after a branch mispredict",
>>>> +        "MetricExpr": "PM_DISP_STALL_BR_MPRED_ICMISS / PM_RUN_INST_CMPL",
>>>> +        "MetricGroup": "CPI",
>>>> +        "MetricName": "DISPATCHED_BR_MPRED_ICMISS_CPI"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled while instruction was fetched from the local L2 after suffering a branch mispredict",
>>>> +        "MetricExpr": "PM_DISP_STALL_BR_MPRED_IC_L2 / PM_RUN_INST_CMPL",
>>>> +        "MetricGroup": "CPI",
>>>> +        "MetricName": "DISPATCHED_BR_MPRED_IC_L2_CPI"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled while instruction was fetched from the local L3 after suffering a branch mispredict",
>>>> +        "MetricExpr": "PM_DISP_STALL_BR_MPRED_IC_L3 / PM_RUN_INST_CMPL",
>>>> +        "MetricGroup": "CPI",
>>>> +        "MetricName": "DISPATCHED_BR_MPRED_IC_L3_CPI"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled while instruction was fetched from any source beyond the local L3 after suffering a branch mispredict",
>>>> +        "MetricExpr": "PM_DISP_STALL_BR_MPRED_IC_L3MISS / PM_RUN_INST_CMPL",
>>>> +        "MetricGroup": "CPI",
>>>> +        "MetricName": "DISPATCHED_BR_MPRED_IC_L3MISS_CPI"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled due to a branch mispredict",
>>>> +        "MetricExpr": "PM_DISP_STALL_BR_MPRED / PM_RUN_INST_CMPL",
>>>> +        "MetricGroup": "CPI",
>>>> +        "MetricName": "DISPATCHED_BR_MPRED_CPI"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction was held at dispatch for any reason",
>>>> +        "MetricExpr": "PM_DISP_STALL_HELD_CYC / PM_RUN_INST_CMPL",
>>>> +        "MetricGroup": "CPI",
>>>> +        "MetricName": "DISPATCHED_HELD_CPI"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction was held at dispatch because of a synchronizing instruction that requires the ICT to be empty before dispatch",
>>>> +        "MetricExpr": "PM_DISP_STALL_HELD_SYNC_CYC / PM_RUN_INST_CMPL",
>>>> +        "MetricGroup": "CPI",
>>>> +        "MetricName": "DISP_HELD_STALL_SYNC_CPI"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction was held at dispatch while waiting on the scoreboard",
>>>> +        "MetricExpr": "PM_DISP_STALL_HELD_SCOREBOARD_CYC / PM_RUN_INST_CMPL",
>>>> +        "MetricGroup": "CPI",
>>>> +        "MetricName": "DISP_HELD_STALL_SCOREBOARD_CPI"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction was held at dispatch due to issue queue full",
>>>> +        "MetricExpr": "PM_DISP_STALL_HELD_ISSQ_FULL_CYC / PM_RUN_INST_CMPL",
>>>> +        "MetricGroup": "CPI",
>>>> +        "MetricName": "DISP_HELD_STALL_ISSQ_FULL_CPI"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction was held at dispatch because the mapper/SRB was full",
>>>> +        "MetricExpr": "PM_DISP_STALL_HELD_RENAME_CYC / PM_RUN_INST_CMPL",
>>>> +        "MetricGroup": "CPI",
>>>> +        "MetricName": "DISPATCHED_HELD_RENAME_CPI"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction was held at dispatch because the STF mapper/SRB was full",
>>>> +        "MetricExpr": "PM_DISP_STALL_HELD_STF_MAPPER_CYC / PM_RUN_INST_CMPL",
>>>> +        "MetricGroup": "CPI",
>>>> +        "MetricName": "DISPATCHED_HELD_STF_MAPPER_CPI"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction was held at dispatch because the XVFC mapper/SRB was full",
>>>> +        "MetricExpr": "PM_DISP_STALL_HELD_XVFC_MAPPER_CYC / PM_RUN_INST_CMPL",
>>>> +        "MetricGroup": "CPI",
>>>> +        "MetricName": "DISPATCHED_HELD_XVFC_MAPPER_CPI"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction was held at dispatch for any other reason",
>>>> +        "MetricExpr": "PM_DISP_STALL_HELD_OTHER_CYC / PM_RUN_INST_CMPL",
>>>> +        "MetricGroup": "CPI",
>>>> +        "MetricName": "DISPATCHED_HELD_OTHER_CPI"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction has been dispatched but not issued for any reason",
>>>> +        "MetricExpr": "PM_ISSUE_STALL / PM_RUN_INST_CMPL",
>>>> +        "MetricGroup": "CPI",
>>>> +        "MetricName": "ISSUE_STALL_CPI"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is waiting to be finished in one of the execution units",
>>>> +        "MetricExpr": "PM_EXEC_STALL / PM_RUN_INST_CMPL",
>>>> +        "MetricGroup": "CPI",
>>>> +        "MetricName": "EXECUTION_STALL_CPI"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average cycles per instruction spent executing an NTC instruction that gets flushed some time after dispatch",
>>>> +        "MetricExpr": "PM_EXEC_STALL_NTC_FLUSH / PM_RUN_INST_CMPL",
>>>> +        "MetricGroup": "CPI",
>>>> +        "MetricName": "NTC_FLUSH_STALL_CPI"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average cycles per instruction when the NTF instruction finishes at dispatch",
>>>> +        "MetricExpr": "PM_EXEC_STALL_FIN_AT_DISP / PM_RUN_INST_CMPL",
>>>> +        "MetricGroup": "CPI",
>>>> +        "MetricName": "FIN_AT_DISP_STALL_CPI"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is executing in the branch unit",
>>>> +        "MetricExpr": "PM_EXEC_STALL_BRU / PM_RUN_INST_CMPL",
>>>> +        "MetricGroup": "CPI",
>>>> +        "MetricName": "BRU_STALL_CPI"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is a simple fixed point instruction that is executing in the LSU",
>>>> +        "MetricExpr": "PM_EXEC_STALL_SIMPLE_FX / PM_RUN_INST_CMPL",
>>>> +        "MetricGroup": "CPI",
>>>> +        "MetricName": "SIMPLE_FX_STALL_CPI"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is executing in the VSU",
>>>> +        "MetricExpr": "PM_EXEC_STALL_VSU / PM_RUN_INST_CMPL",
>>>> +        "MetricGroup": "CPI",
>>>> +        "MetricName": "VSU_STALL_CPI"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is waiting to be finished in one of the execution units",
>>>> +        "MetricExpr": "PM_EXEC_STALL_TRANSLATION / PM_RUN_INST_CMPL",
>>>> +        "MetricGroup": "CPI",
>>>> +        "MetricName": "TRANSLATION_STALL_CPI"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is a load or store that suffered a translation miss",
>>>> +        "MetricExpr": "PM_EXEC_STALL_DERAT_ONLY_MISS / PM_RUN_INST_CMPL",
>>>> +        "MetricGroup": "CPI",
>>>> +        "MetricName": "DERAT_ONLY_MISS_STALL_CPI"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is recovering from a TLB miss",
>>>> +        "MetricExpr": "PM_EXEC_STALL_DERAT_DTLB_MISS / PM_RUN_INST_CMPL",
>>>> +        "MetricGroup": "CPI",
>>>> +        "MetricName": "DERAT_DTLB_MISS_STALL_CPI"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is executing in the LSU",
>>>> +        "MetricExpr": "PM_EXEC_STALL_LSU / PM_RUN_INST_CMPL",
>>>> +        "MetricGroup": "CPI",
>>>> +        "MetricName": "LSU_STALL_CPI"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is a load that is executing in the LSU",
>>>> +        "MetricExpr": "PM_EXEC_STALL_LOAD / PM_RUN_INST_CMPL",
>>>> +        "MetricGroup": "CPI",
>>>> +        "MetricName": "LOAD_STALL_CPI"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is waiting for a load miss to resolve from either the local L2 or local L3",
>>>> +        "MetricExpr": "PM_EXEC_STALL_DMISS_L2L3 / PM_RUN_INST_CMPL",
>>>> +        "MetricGroup": "CPI",
>>>> +        "MetricName": "DMISS_L2L3_STALL_CPI"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is waiting for a load miss to resolve from either the local L2 or local L3, with an RC dispatch conflict",
>>>> +        "MetricExpr": "PM_EXEC_STALL_DMISS_L2L3_CONFLICT / PM_RUN_INST_CMPL",
>>>> +        "MetricGroup": "CPI",
>>>> +        "MetricName": "DMISS_L2L3_CONFLICT_STALL_CPI"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is waiting for a load miss to resolve from either the local L2 or local L3, without an RC dispatch conflict",
>>>> +        "MetricExpr": "PM_EXEC_STALL_DMISS_L2L3_NOCONFLICT / PM_RUN_INST_CMPL",
>>>> +        "MetricGroup": "CPI",
>>>> +        "MetricName": "DMISS_L2L3_NOCONFLICT_STALL_CPI"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is waiting for a load miss to resolve from a source beyond the local L2 and local L3",
>>>> +        "MetricExpr": "PM_EXEC_STALL_DMISS_L3MISS / PM_RUN_INST_CMPL",
>>>> +        "MetricGroup": "CPI",
>>>> +        "MetricName": "DMISS_L3MISS_STALL_CPI"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is waiting for a load miss to resolve from a neighbor chiplet's L2 or L3 in the same chip",
>>>> +        "MetricExpr": "PM_EXEC_STALL_DMISS_L21_L31 / PM_RUN_INST_CMPL",
>>>> +        "MetricGroup": "CPI",
>>>> +        "MetricName": "DMISS_L21_L31_STALL_CPI"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is waiting for a load miss to resolve from L4, local memory or OpenCapp chip",
>>>
>>> What is "OpenCapp"?  Is is different from OpenCAPI?
>>>
>>>> +        "MetricExpr": "PM_EXEC_STALL_DMISS_LMEM / PM_RUN_INST_CMPL",
>>>> +        "MetricGroup": "CPI",
>>>> +        "MetricName": "DMISS_LMEM_STALL_CPI"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is waiting for a load miss to resolve from a remote chip (cache, L4, memory or OpenCapp) in the same group",
>>>> +        "MetricExpr": "PM_EXEC_STALL_DMISS_OFF_CHIP / PM_RUN_INST_CMPL",
>>>> +        "MetricGroup": "CPI",
>>>> +        "MetricName": "DMISS_OFF_CHIP_STALL_CPI"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is waiting for a load miss to resolve from a distant chip (cache, L4, memory or OpenCapp chip)",
>>>> +        "MetricExpr": "PM_EXEC_STALL_DMISS_OFF_NODE / PM_RUN_INST_CMPL",
>>>> +        "MetricGroup": "CPI",
>>>> +        "MetricName": "DMISS_OFF_NODE_STALL_CPI"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is executing a TLBIEL instruction",
>>>> +        "MetricExpr": "PM_EXEC_STALL_TLBIEL / PM_RUN_INST_CMPL",
>>>> +        "MetricGroup": "CPI",
>>>> +        "MetricName": "TLBIEL_STALL_CPI"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is finishing a load after its data has been reloaded from a data source beyond the local L1, OR when the LSU is processing an L1-hit, OR when the NTF instruction merged with another load in the LMQ",
>>>> +        "MetricExpr": "PM_EXEC_STALL_LOAD_FINISH / PM_RUN_INST_CMPL",
>>>> +        "MetricGroup": "CPI",
>>>> +        "MetricName": "LOAD_FINISH_STALL_CPI"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is a store that is executing in the LSU",
>>>> +        "MetricExpr": "PM_EXEC_STALL_STORE / PM_RUN_INST_CMPL",
>>>> +        "MetricGroup": "CPI",
>>>> +        "MetricName": "STORE_STALL_CPI"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is in the store unit outside of handling store misses or other special store operations",
>>>
>>> Is "store unit" not the same as "LSU" ?  Use "LSU" uniformly if appropriate:
>>> s/store unit/LSU/
>>>
>>>> +        "MetricExpr": "PM_EXEC_STALL_STORE_PIPE / PM_RUN_INST_CMPL",
>>>> +        "MetricGroup": "CPI",
>>>> +        "MetricName": "STORE_PIPE_STALL_CPI"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is a store whose cache line was not resident in the L1 and had to wait for allocation of the missing line into the L1",
>>>> +        "MetricExpr": "PM_EXEC_STALL_STORE_MISS / PM_RUN_INST_CMPL",
>>>> +        "MetricGroup": "CPI",
>>>> +        "MetricName": "STORE_MISS_STALL_CPI"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is a TLBIE instruction waiting for a response from the L2",
>>>> +        "MetricExpr": "PM_EXEC_STALL_TLBIE / PM_RUN_INST_CMPL",
>>>> +        "MetricGroup": "CPI",
>>>> +        "MetricName": "TLBIE_STALL_CPI"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is executing a PTESYNC instruction",
>>>> +        "MetricExpr": "PM_EXEC_STALL_PTESYNC / PM_RUN_INST_CMPL",
>>>> +        "MetricGroup": "CPI",
>>>> +        "MetricName": "PTESYNC_STALL_CPI"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction cannot complete because the thread was blocked",
>>>> +        "MetricExpr": "PM_CMPL_STALL / PM_RUN_INST_CMPL",
>>>> +        "MetricGroup": "CPI",
>>>> +        "MetricName": "COMPLETION_STALL_CPI"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction cannot complete because it was interrupted by ANY exception",
>>>> +        "MetricExpr": "PM_CMPL_STALL_EXCEPTION / PM_RUN_INST_CMPL",
>>>> +        "MetricGroup": "CPI",
>>>> +        "MetricName": "EXCEPTION_COMPLETION_STALL_CPI"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is stuck at finish waiting for the non-speculative finish of either a STCX instruction waiting for its result or a load waiting for non-critical sectors of data and ECC",
>>>> +        "MetricExpr": "PM_CMPL_STALL_MEM_ECC / PM_RUN_INST_CMPL",
>>>> +        "MetricGroup": "CPI",
>>>> +        "MetricName": "MEM_ECC_COMPLETION_STALL_CPI"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction cannot complete the instruction is a STCX instruction waiting for resolution from the nest",
>>>
>>> Need to reword this, I think.  I propose "Average cycles per instruction
>>> when the NTC instruction is a STCX instruction waiting for resolution
>>> from the nest", which follows the form used by HWSYNC_COMPLETION_STALL_CPI,
>>> below.
>>>
>>>> +        "MetricExpr": "PM_CMPL_STALL_STCX / PM_RUN_INST_CMPL",
>>>> +        "MetricGroup": "CPI",
>>>> +        "MetricName": "STCX_COMPLETION_STALL_CPI"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is a LWSYNC instruction waiting to complete",
>>>> +        "MetricExpr": "PM_CMPL_STALL_LWSYNC / PM_RUN_INST_CMPL",
>>>> +        "MetricGroup": "CPI",
>>>> +        "MetricName": "LWSYNC_COMPLETION_STALL_CPI"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is a HWSYNC instruction stuck at finish waiting for a response from the L2",
>>>> +        "MetricExpr": "PM_CMPL_STALL_HWSYNC / PM_RUN_INST_CMPL",
>>>> +        "MetricGroup": "CPI",
>>>> +        "MetricName": "HWSYNC_COMPLETION_STALL_CPI"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction required special handling before completion",
>>>> +        "MetricExpr": "PM_CMPL_STALL_SPECIAL / PM_RUN_INST_CMPL",
>>>> +        "MetricGroup": "CPI",
>>>> +        "MetricName": "SPECIAL_COMPLETION_STALL_CPI"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled because fetch was being held, so there was nothing in the pipeline for this thread",
>>>> +        "MetricExpr": "PM_DISP_STALL_FETCH / PM_RUN_INST_CMPL",
>>>> +        "MetricGroup": "CPI",
>>>> +        "MetricName": "DISPATCHED_FETCH_CPI"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction was held at dispatch because of power management",
>>>> +        "MetricExpr": "PM_DISP_STALL_HELD_HALT_CYC / PM_RUN_INST_CMPL",
>>>> +        "MetricGroup": "CPI",
>>>> +        "MetricName": "DISPATCHED_HELD_HALT_CPI"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Percentage of flushes per completed run instruction",
>>>
>>> s/per completed run instruction/per instruction/
>>
>> I'm not sure we want to drop "completed" from this and all the following
>> descriptions.
>>
>> There is a meaningful distinction between completed and dispatched
>> instructions, I think it's useful to be explicit about which the event
>> is counting.
>>
>> I agree dropping "run" is a good idea, most people won't understand that
>> "run" means "non-idle", and I think don't expect idle instructions to be
>> counted anyway.
>>
>> ...
>>>
>>>> +        "MetricExpr": "PM_RUN_INST_CMPL / PM_RUN_CYC",
>>>> +        "MetricGroup": "General",
>>>> +        "MetricName": "RUN_IPC"
>>>> +    },
>>>> +    {
>>>> +        "BriefDescription": "Average number of instructions completed per instruction group",
>>>
>>> s/completed//
>>
>> And here the meaning is different if you drop "completed".
> 
> All fair comments.  I am looking for consistency, but correctness trumps.
> 
> Regarding consistency, though, there are lots of occurences like:
> |        "BriefDescription": "Average cycles per instruction when dispatch was stalled for any reason",
> |        "MetricExpr": "PM_DISP_STALL_CYC / PM_RUN_INST_CMPL",
> 
> Can we pick one phrase for all metrics where PM_RUN_INST_CMPL is used,
> perhaps?  "completed instructions" ?

Hi Paul/Michael,
      Sure I will update description part to use "completed
instructions", whereever PM_RUN_INST_CMPL event is used.

Thanks,
Kajol Jain

> 
> PC
> 

^ permalink raw reply

* Re: linux-next: manual merge of the audit tree with the powerpc tree
From: Michael Ellerman @ 2021-10-26 10:55 UTC (permalink / raw)
  To: Stephen Rothwell, Paul Moore, PowerPC
  Cc: Richard Guy Briggs, Linux Next Mailing List,
	Linux Kernel Mailing List
In-Reply-To: <20211026133147.35d19e00@canb.auug.org.au>

Stephen Rothwell <sfr@canb.auug.org.au> writes:
> Hi all,
>
> Today's linux-next merge of the audit tree got conflicts in:
>
>   arch/powerpc/kernel/audit.c
>   arch/powerpc/kernel/compat_audit.c
>
> between commit:
>
>   566af8cda399 ("powerpc/audit: Convert powerpc to AUDIT_ARCH_COMPAT_GENERIC")
>
> from the powerpc tree and commits:
>
>   42f355ef59a2 ("audit: replace magic audit syscall class numbers with macros")
>   1c30e3af8a79 ("audit: add support for the openat2 syscall")
>
> from the audit tree.

Thanks.

I guess this is OK, unless the audit folks disagree. I could revert the
powerpc commit and try it again later.

If I don't hear anything I'll leave it as-is.

cheers

^ permalink raw reply

* Re: [PATCH v5 1/2] ftrace: disable preemption when recursion locked
From: Steven Rostedt @ 2021-10-26 12:01 UTC (permalink / raw)
  To: 王贇
  Cc: Peter Zijlstra (Intel), Paul Walmsley, James E.J. Bottomley,
	Guo Ren, Jisheng Zhang, H. Peter Anvin, live-patching,
	linux-riscv, Miroslav Benes, Joe Lawrence, Helge Deller, x86,
	linux-csky, Ingo Molnar, Petr Mladek, Albert Ou, Jiri Kosina,
	Nicholas Piggin, Borislav Petkov, Josh Poimboeuf, Thomas Gleixner,
	linux-parisc, linux-kernel, Palmer Dabbelt, Masami Hiramatsu,
	Paul Mackerras, linuxppc-dev
In-Reply-To: <18ba2a71-e12d-33f7-63fe-2857b2db022c@linux.alibaba.com>

On Tue, 26 Oct 2021 17:48:10 +0800
王贇 <yun.wang@linux.alibaba.com> wrote:

> > The two comments should be updated too since Steven removed the "bit == 0" 
> > trick.  
> 
> Could you please give more hint on how will it be correct?
> 
> I get the point that bit will no longer be 0, there are only -1 or > 0 now
> so trace_test_and_set_recursion() will disable preemption on bit > 0 and
> trace_clear_recursion() will enabled it since it should only be called when
> bit > 0 (I remember we could use a WARN_ON here now :-P).
> 
> >   
> >> @@ -178,7 +187,7 @@ static __always_inline void trace_clear_recursion(int bit)
> >>   * tracing recursed in the same context (normal vs interrupt),
> >>   *
> >>   * Returns: -1 if a recursion happened.
> >> - *           >= 0 if no recursion
> >> + *           > 0 if no recursion.
> >>   */
> >>  static __always_inline int ftrace_test_recursion_trylock(unsigned long ip,
> >>  							 unsigned long parent_ip)  
> > 
> > And this change would not be correct now.  
> 
> I thought it will no longer return 0 so I change it to > 0, isn't that correct?

No it is not. I removed the bit + 1 return value, which means it returns the
actual bit now. Which is 0 or more.

-- Steve

^ permalink raw reply

* [PATCH v1] powerpc/64s/interrupt: Fix check_return_regs_valid false positive
From: Nicholas Piggin @ 2021-10-26 12:25 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

The check_return_regs_valid can cause a false positive if the return
regs are marked as norestart and they are an HSRR type interrupt,
because the low bit in the bottom of regs->trap causes interrupt
type matching to fail.

This can occcur for example on bare metal with a HV privileged doorbell
interrupt that causes a signal, but do_signal returns early because
get_signal() fails, and takes the "No signal to deliver" path. In this
case no signal was delivered so the return location is not changed so
return SRRs are not invalidated, yet set_trap_norestart is called, which
messes up the match. Building go-1.16.6 is known to reproduce this.

Fix it by using the TRAP() accessor which masks out the low bit.

Fixes: 6eaaf9de3599 ("powerpc/64s/interrupt: Check and fix srr_valid without crashing")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/kernel/interrupt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/interrupt.c b/arch/powerpc/kernel/interrupt.c
index de10a2697258..835b626cd476 100644
--- a/arch/powerpc/kernel/interrupt.c
+++ b/arch/powerpc/kernel/interrupt.c
@@ -266,7 +266,7 @@ static void check_return_regs_valid(struct pt_regs *regs)
 	if (trap_is_scv(regs))
 		return;

-	trap = regs->trap;
+	trap = TRAP(regs);
 	// EE in HV mode sets HSRRs like 0xea0
 	if (cpu_has_feature(CPU_FTR_HVMODE) && trap == INTERRUPT_EXTERNAL)
 		trap = 0xea0;
-- 
2.23.0

^ permalink raw reply related

* Re: [PATCH v2] macintosh/via-pmu-led: make disk activity usage a parameter.
From: Nathan Lynch @ 2021-10-26 13:08 UTC (permalink / raw)
  To: Hill Ma; +Cc: linuxppc-dev, linux-kernel, linux-doc
In-Reply-To: <20211026033254.1052-1-maahiuzeon@gmail.com>

Hello,

Hill Ma <maahiuzeon@gmail.com> writes:
> Whether to use the LED as a disk activity is a user preference.
> Some like this usage while others find the LED too bright. So it
> might be a good idea to make this choice a runtime parameter rather
> than compile-time config.

Users already have the ability to change the LED behavior at runtime
already, correct? I.e. they can do:

  echo none > /sys/class/leds/pmu-led::front/trigger

in their boot scripts. Granted, a kernel built with ADB_PMU_LED_DISK=y
will blink the LED on disk activity until user space is running. Is this
unsatisfactory?

> The default is set to disabled as OS X does not use the LED as a
> disk activity indicator.

This is long-standing behavior in Linux and OS X has been EOL on this
architecture for a decade, so this isn't much of a consideration at this
point. Seems more important to avoid surprising existing users and
distributions with a behavior change that makes additional work for
them. See below.

> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 43dc35fe5bc0..a656a51ba0a8 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -250,6 +250,12 @@
>  			Use timer override. For some broken Nvidia NF5 boards
>  			that require a timer override, but don't have HPET
>  
> +	adb_pmu_led_disk [PPC]
> +			Use front LED as disk LED by default. Only applies to
> +			PowerBook, iBook, PowerMac 7,2/7,3.
> +			Format: <bool>  (1/Y/y=enable, 0/N/n=disable)
> +			Default: disabled
> +
>  	add_efi_memmap	[EFI; X86] Include EFI memory map in
>  			kernel's map of available physical RAM.
>  
> diff --git a/drivers/macintosh/Kconfig b/drivers/macintosh/Kconfig
> index 5cdc361da37c..243215de563c 100644
> --- a/drivers/macintosh/Kconfig
> +++ b/drivers/macintosh/Kconfig
> @@ -78,16 +78,6 @@ config ADB_PMU_LED
>  	  behaviour of the old CONFIG_BLK_DEV_IDE_PMAC_BLINK, select this
>  	  and the disk LED trigger and configure appropriately through sysfs.
>  
> -config ADB_PMU_LED_DISK
> -	bool "Use front LED as DISK LED by default"
> -	depends on ADB_PMU_LED
> -	depends on LEDS_CLASS
> -	select LEDS_TRIGGERS
> -	select LEDS_TRIGGER_DISK
> -	help
> -	  This option makes the front LED default to the disk trigger
> -	  so that it blinks on disk activity.
> -

So, if I've been relying on CONFIG_ADB_PMU_LED_DISK=y and I upgrade to a
newer kernel with the proposed change, from my point of view the disk
activity LED has stopped working and I need to alter the bootloader
config or init scripts to restore the expected behavior. That seems
undesirable to me.

I don't think we rigidly enforce Kconfig backward compatibility, but
when it comes to a user-visible function on a legacy platform where
users and distros likely have their configurations figured out already,
it's probably best to avoid such changes.

^ permalink raw reply

* [PATCH] powerpc/xmon: fix task state output
From: Denis Kirjanov @ 2021-10-26 13:31 UTC (permalink / raw)
  To: linuxppc-dev

p_state is unsigned since the commit 2f064a59a11f

The patch also uses TASK_RUNNING instead of null.

Fixes: 2f064a59a11f ("sched: Change task_struct::state")
Signed-off-by: Denis Kirjanov <kda@linux-powerpc.org>
---
 arch/powerpc/xmon/xmon.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index dd8241c009e5..8b28ff9d98d1 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -3264,8 +3264,7 @@ static void show_task(struct task_struct *volatile tsk)
 	 * appropriate for calling from xmon. This could be moved
 	 * to a common, generic, routine used by both.
 	 */
-	state = (p_state == 0) ? 'R' :
-		(p_state < 0) ? 'U' :
+	state = (p_state == TASK_RUNNING) ? 'R' :
 		(p_state & TASK_UNINTERRUPTIBLE) ? 'D' :
 		(p_state & TASK_STOPPED) ? 'T' :
 		(p_state & TASK_TRACED) ? 'C' :
-- 
2.16.4


^ permalink raw reply related

* [PATCH v2] powerpc/boot: Set LANG=C in wrapper script
From: Christophe Leroy @ 2021-10-26 13:45 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
  Cc: linuxppc-dev, linux-kernel

While trying to build a simple Image for ACADIA platform, I got the
following error:

	  WRAP    arch/powerpc/boot/simpleImage.acadia
	INFO: Uncompressed kernel (size 0x6ae7d0) overlaps the address of the wrapper(0x400000)
	INFO: Fixing the link_address of wrapper to (0x700000)
	powerpc64-linux-gnu-ld : mode d'émulation non reconnu : -T
	Émulations prises en charge : elf64ppc elf32ppc elf32ppclinux elf32ppcsim elf64lppc elf32lppc elf32lppclinux elf32lppcsim
	make[1]: *** [arch/powerpc/boot/Makefile:424 : arch/powerpc/boot/simpleImage.acadia] Erreur 1
	make: *** [arch/powerpc/Makefile:285 : simpleImage.acadia] Erreur 2

Trying again with V=1 shows the following command

	powerpc64-linux-gnu-ld -m -T arch/powerpc/boot/zImage.lds -Ttext 0x700000 --no-dynamic-linker -o arch/powerpc/boot/simpleImage.acadia -Map wrapper.map arch/powerpc/boot/fixed-head.o arch/powerpc/boot/simpleboot.o ./zImage.3278022.o arch/powerpc/boot/wrapper.a

The argument of '-m' is missing.

This is due to the wrapper script calling 'objdump -p vmlinux' and
looking for 'file format', whereas the output of objdump is:

	vmlinux:     format de fichier elf32-powerpc

	En-tête de programme:
	    LOAD off    0x00010000 vaddr 0xc0000000 paddr 0x00000000 align 2**16
	         filesz 0x0069e1d4 memsz 0x006c128c flags rwx
	    NOTE off    0x0064591c vaddr 0xc063591c paddr 0x0063591c align 2**2
	         filesz 0x00000054 memsz 0x00000054 flags ---

Add LC_ALL=C at the beginning of the wrapper script in order to get the
output expected by the script:

	vmlinux:     file format elf32-powerpc

	Program Header:
	    LOAD off    0x00010000 vaddr 0xc0000000 paddr 0x00000000 align 2**16
	         filesz 0x0069e1d4 memsz 0x006c128c flags rwx
	    NOTE off    0x0064591c vaddr 0xc063591c paddr 0x0063591c align 2**2
	         filesz 0x00000054 memsz 0x00000054 flags ---

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
v2: Use LC_ALL=C per Segher
---
 arch/powerpc/boot/wrapper | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/boot/wrapper b/arch/powerpc/boot/wrapper
index 1cd82564c996..9184eda780fd 100755
--- a/arch/powerpc/boot/wrapper
+++ b/arch/powerpc/boot/wrapper
@@ -26,6 +26,8 @@
 # Stop execution if any command fails
 set -e
 
+export LC_ALL=C
+
 # Allow for verbose output
 if [ "$V" = 1 ]; then
     set -x
-- 
2.31.1


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox