* Re: [PATCH v2 2/3] powerpc/powernv/idle: save-restore DAWR0,DAWRX0 for P10
From: Ravi Bangoria @ 2020-07-20 4:24 UTC (permalink / raw)
To: Pratik Rajesh Sampat
Cc: ego, mikey, pratik.r.sampat, linux-kernel, paulus, linuxppc-dev,
Ravi Bangoria
In-Reply-To: <20200710052207.12003-3-psampat@linux.ibm.com>
Hi Pratik,
On 7/10/20 10:52 AM, Pratik Rajesh Sampat wrote:
> Additional registers DAWR0, DAWRX0 may be lost on Power 10 for
> stop levels < 4.
p10 has one more pair DAWR1/DAWRX1. Please include that as well.
Ravi
^ permalink raw reply
* Re: [PATCH v2 2/3] powerpc/powernv/idle: save-restore DAWR0, DAWRX0 for P10
From: Ravi Bangoria @ 2020-07-20 4:30 UTC (permalink / raw)
To: Nicholas Piggin, Pratik Rajesh Sampat
Cc: ego, mikey, pratik.r.sampat, linux-kernel, paulus, linuxppc-dev
In-Reply-To: <1594619458.45vrahx59w.astroid@bobo.none>
Hi Nick,
On 7/13/20 11:22 AM, Nicholas Piggin wrote:
> Excerpts from Pratik Rajesh Sampat's message of July 10, 2020 3:22 pm:
>> Additional registers DAWR0, DAWRX0 may be lost on Power 10 for
>> stop levels < 4.
>> Therefore save the values of these SPRs before entering a "stop"
>> state and restore their values on wakeup.
>
> Hmm, where do you get this from? Documentation I see says DAWR is lost
> on POWER9 but not P10.
>
> Does idle thread even need to save DAWR, or does it get switched when
> going to a thread that has a watchpoint set?
I don't know how idle states works internally but IIUC, we need to save/restore
DAWRs. This is needed when user creates per-cpu watchpoint event.
Ravi
^ permalink raw reply
* [FIX PATCH] powerpc/prom: Enable Radix GTSE in cpu pa-features
From: Bharata B Rao @ 2020-07-20 4:42 UTC (permalink / raw)
To: linuxppc-dev; +Cc: aneesh.kumar, npiggin, Bharata B Rao, Qian Cai
From: Nicholas Piggin <npiggin@gmail.com>
When '029ab30b4c0a ("powerpc/mm: Enable radix GTSE only if supported.")'
made GTSE an MMU feature, it was enabled by default in
powerpc-cpu-features but was missed in pa-features. This causes
random memory corruption during boot of PowerNV kernels where
CONFIG_PPC_DT_CPU_FTRS isn't enabled.
Fixes: 029ab30b4c0a ("powerpc/mm: Enable radix GTSE only if supported.")
Reported-by: Qian Cai <cai@lca.pw>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
---
arch/powerpc/kernel/prom.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index 9cc49f265c86..a9594bad572a 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -163,7 +163,8 @@ static struct ibm_pa_feature {
{ .pabyte = 0, .pabit = 6, .cpu_features = CPU_FTR_NOEXECUTE },
{ .pabyte = 1, .pabit = 2, .mmu_features = MMU_FTR_CI_LARGE_PAGE },
#ifdef CONFIG_PPC_RADIX_MMU
- { .pabyte = 40, .pabit = 0, .mmu_features = MMU_FTR_TYPE_RADIX },
+ { .pabyte = 40, .pabit = 0,
+ .mmu_features = (MMU_FTR_TYPE_RADIX | MMU_FTR_GTSE) },
#endif
{ .pabyte = 1, .pabit = 1, .invert = 1, .cpu_features = CPU_FTR_NODSISRALIGN },
{ .pabyte = 5, .pabit = 0, .cpu_features = CPU_FTR_REAL_LE,
--
2.26.2
^ permalink raw reply related
* Re: [PATCH v2 1/2] cpuidle: Trace IPI based and timer based wakeup latency from idle states
From: Gautham R Shenoy @ 2020-07-20 5:00 UTC (permalink / raw)
To: Pratik Rajesh Sampat
Cc: ego, linux-pm, daniel.lezcano, rjw, linuxppc-dev, npiggin, paulus,
linux-kselftest, shuah, srivatsa, linux-kernel
In-Reply-To: <20200717091801.29289-2-psampat@linux.ibm.com>
On Fri, Jul 17, 2020 at 02:48:00PM +0530, Pratik Rajesh Sampat wrote:
> Fire directed smp_call_function_single IPIs from a specified source
> CPU to the specified target CPU to reduce the noise we have to wade
> through in the trace log.
> The module is based on the idea written by Srivatsa Bhat and maintained
> by Vaidyanathan Srinivasan internally.
>
> Queue HR timer and measure jitter. Wakeup latency measurement for idle
> states using hrtimer. Echo a value in ns to timer_test_function and
> watch trace. A HRtimer will be queued and when it fires the expected
> wakeup vs actual wakeup is computes and delay printed in ns.
>
> Implemented as a module which utilizes debugfs so that it can be
> integrated with selftests.
>
> To include the module, check option and include as module
> kernel hacking -> Cpuidle latency selftests
>
> [srivatsa.bhat@linux.vnet.ibm.com: Initial implementation in
> cpidle/sysfs]
>
> [svaidy@linux.vnet.ibm.com: wakeup latency measurements using hrtimer
> and fix some of the time calculation]
>
> [ego@linux.vnet.ibm.com: Fix some whitespace and tab errors and
> increase the resolution of IPI wakeup]
>
> Signed-off-by: Pratik Rajesh Sampat <psampat@linux.ibm.com>
The debugfs module looks good to me.
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
> ---
> drivers/cpuidle/Makefile | 1 +
> drivers/cpuidle/test-cpuidle_latency.c | 150 +++++++++++++++++++++++++
> lib/Kconfig.debug | 10 ++
> 3 files changed, 161 insertions(+)
> create mode 100644 drivers/cpuidle/test-cpuidle_latency.c
>
> diff --git a/drivers/cpuidle/Makefile b/drivers/cpuidle/Makefile
> index f07800cbb43f..2ae05968078c 100644
> --- a/drivers/cpuidle/Makefile
> +++ b/drivers/cpuidle/Makefile
> @@ -8,6 +8,7 @@ obj-$(CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED) += coupled.o
> obj-$(CONFIG_DT_IDLE_STATES) += dt_idle_states.o
> obj-$(CONFIG_ARCH_HAS_CPU_RELAX) += poll_state.o
> obj-$(CONFIG_HALTPOLL_CPUIDLE) += cpuidle-haltpoll.o
> +obj-$(CONFIG_IDLE_LATENCY_SELFTEST) += test-cpuidle_latency.o
>
> ##################################################################################
> # ARM SoC drivers
> diff --git a/drivers/cpuidle/test-cpuidle_latency.c b/drivers/cpuidle/test-cpuidle_latency.c
> new file mode 100644
> index 000000000000..61574665e972
> --- /dev/null
> +++ b/drivers/cpuidle/test-cpuidle_latency.c
> @@ -0,0 +1,150 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * Module-based API test facility for cpuidle latency using IPIs and timers
> + */
> +
> +#include <linux/debugfs.h>
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +
> +/* IPI based wakeup latencies */
> +struct latency {
> + unsigned int src_cpu;
> + unsigned int dest_cpu;
> + ktime_t time_start;
> + ktime_t time_end;
> + u64 latency_ns;
> +} ipi_wakeup;
> +
> +static void measure_latency(void *info)
> +{
> + struct latency *v;
> + ktime_t time_diff;
> +
> + v = (struct latency *)info;
> + v->time_end = ktime_get();
> + time_diff = ktime_sub(v->time_end, v->time_start);
> + v->latency_ns = ktime_to_ns(time_diff);
> +}
> +
> +void run_smp_call_function_test(unsigned int cpu)
> +{
> + ipi_wakeup.src_cpu = smp_processor_id();
> + ipi_wakeup.dest_cpu = cpu;
> + ipi_wakeup.time_start = ktime_get();
> + smp_call_function_single(cpu, measure_latency, &ipi_wakeup, 1);
> +}
> +
> +/* Timer based wakeup latencies */
> +struct timer_data {
> + unsigned int src_cpu;
> + u64 timeout;
> + ktime_t time_start;
> + ktime_t time_end;
> + struct hrtimer timer;
> + u64 timeout_diff_ns;
> +} timer_wakeup;
> +
> +static enum hrtimer_restart timer_called(struct hrtimer *hrtimer)
> +{
> + struct timer_data *w;
> + ktime_t time_diff;
> +
> + w = container_of(hrtimer, struct timer_data, timer);
> + w->time_end = ktime_get();
> +
> + time_diff = ktime_sub(w->time_end, w->time_start);
> + time_diff = ktime_sub(time_diff, ns_to_ktime(w->timeout));
> + w->timeout_diff_ns = ktime_to_ns(time_diff);
> + return HRTIMER_NORESTART;
> +}
> +
> +static void run_timer_test(unsigned int ns)
> +{
> + hrtimer_init(&timer_wakeup.timer, CLOCK_MONOTONIC,
> + HRTIMER_MODE_REL);
> + timer_wakeup.timer.function = timer_called;
> + timer_wakeup.time_start = ktime_get();
> + timer_wakeup.src_cpu = smp_processor_id();
> + timer_wakeup.timeout = ns;
> +
> + hrtimer_start(&timer_wakeup.timer, ns_to_ktime(ns),
> + HRTIMER_MODE_REL_PINNED);
> +}
> +
> +static struct dentry *dir;
> +
> +static int cpu_read_op(void *data, u64 *value)
> +{
> + *value = ipi_wakeup.dest_cpu;
> + return 0;
> +}
> +
> +static int cpu_write_op(void *data, u64 value)
> +{
> + run_smp_call_function_test(value);
> + return 0;
> +}
> +DEFINE_SIMPLE_ATTRIBUTE(ipi_ops, cpu_read_op, cpu_write_op, "%llu\n");
> +
> +static int timeout_read_op(void *data, u64 *value)
> +{
> + *value = timer_wakeup.timeout;
> + return 0;
> +}
> +
> +static int timeout_write_op(void *data, u64 value)
> +{
> + run_timer_test(value);
> + return 0;
> +}
> +DEFINE_SIMPLE_ATTRIBUTE(timeout_ops, timeout_read_op, timeout_write_op, "%llu\n");
> +
> +static int __init latency_init(void)
> +{
> + struct dentry *temp;
> +
> + dir = debugfs_create_dir("latency_test", 0);
> + if (!dir) {
> + pr_alert("latency_test: failed to create /sys/kernel/debug/latency_test\n");
> + return -1;
> + }
> + temp = debugfs_create_file("ipi_cpu_dest",
> + 0666,
> + dir,
> + NULL,
> + &ipi_ops);
> + if (!temp) {
> + pr_alert("latency_test: failed to create /sys/kernel/debug/ipi_cpu_dest\n");
> + return -1;
> + }
> + debugfs_create_u64("ipi_latency_ns", 0444, dir, &ipi_wakeup.latency_ns);
> + debugfs_create_u32("ipi_cpu_src", 0444, dir, &ipi_wakeup.src_cpu);
> +
> + temp = debugfs_create_file("timeout_expected_ns",
> + 0666,
> + dir,
> + NULL,
> + &timeout_ops);
> + if (!temp) {
> + pr_alert("latency_test: failed to create /sys/kernel/debug/timeout_expected_ns\n");
> + return -1;
> + }
> + debugfs_create_u64("timeout_diff_ns", 0444, dir, &timer_wakeup.timeout_diff_ns);
> + debugfs_create_u32("timeout_cpu_src", 0444, dir, &timer_wakeup.src_cpu);
> + pr_info("Latency Test module loaded\n");
> + return 0;
> +}
> +
> +static void __exit latency_cleanup(void)
> +{
> + pr_info("Cleaning up Latency Test module.\n");
> + debugfs_remove_recursive(dir);
> +}
> +
> +module_init(latency_init);
> +module_exit(latency_cleanup);
> +
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("IBM Corporation");
> +MODULE_DESCRIPTION("Measuring idle latency for IPIs and Timers");
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index d74ac0fd6b2d..e2283790245a 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -1375,6 +1375,16 @@ config DEBUG_KOBJECT
> If you say Y here, some extra kobject debugging messages will be sent
> to the syslog.
>
> +config IDLE_LATENCY_SELFTEST
> + tristate "Cpuidle latency selftests"
> + depends on CPU_IDLE
> + help
> + This option provides a kernel module that runs tests using the IPI and
> + timers to measure latency.
> +
> + Say M if you want these self tests to build as a module.
> + Say N if you are unsure.
> +
> config DEBUG_KOBJECT_RELEASE
> bool "kobject release debugging"
> depends on DEBUG_OBJECTS_TIMERS
> --
> 2.25.4
>
^ permalink raw reply
* Re: [PATCH v2 13/16] scripts/kallsyms: move ignored symbol types to is_ignored_symbol()
From: Masahiro Yamada @ 2020-07-20 5:34 UTC (permalink / raw)
To: Finn Thain
Cc: linuxppc-dev, Linux Kernel Mailing List,
Linux Kbuild mailing list
In-Reply-To: <alpine.LNX.2.23.453.2007201132240.8@nippy.intranet>
On Mon, Jul 20, 2020 at 10:46 AM Finn Thain <fthain@telegraphics.com.au> wrote:
>
> On Sun, 24 Nov 2019, Masahiro Yamada wrote:
>
> > Collect the ignored patterns to is_ignored_symbol().
> >
> > Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
>
> This commit (887df76de67f5) caused a regression in my powerpc builds as it
> causes symbol names to disappear from backtraces:
>
> ------------[ cut here ]------------
> WARNING: CPU: 0 PID: 0 at kernel/smp.c:433 _einittext+0x3f9e5120/0x3feb71b8
> Modules linked in:
> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.4.0-rc7-pmac-00055-g887df76de67f5 #18
> NIP: c00aef68 LR: c00af114 CTR: c001272c
> REGS: c0705c40 TRAP: 0700 Not tainted (5.4.0-rc7-pmac-00055-g887df76de67f5)
> MSR: 00029032 <EE,ME,IR,DR,RI> CR: 42000044 XER: 00000000
>
> GPR00: 001f0100 c0705cf8 c06dc300 c070af1c c001258c 00000000 00000000 ef7fb5bc
> GPR08: 08800000 00000100 00000001 00000100 42000044 00000000 c0709040 00000004
> GPR16: 00000001 c06022b4 c058297c 00200002 ffff8cb9 00000000 c06d84a0 c0710000
> GPR24: c0710000 00000000 00000000 c070af1c c070af1c 00000000 c001258c 00000000
> NIP [c00aef68] _einittext+0x3f9e5120/0x3feb71b8
> LR [c00af114] _einittext+0x3f9e52cc/0x3feb71b8
> Call Trace:
> [c0705cf8] [ef006320] 0xef006320 (unreliable)
> [c0705d38] [c00af114] _einittext+0x3f9e52cc/0x3feb71b8
> [c0705d48] [c00af158] _einittext+0x3f9e5310/0x3feb71b8
> [c0705d68] [c0012768] _einittext+0x3f948920/0x3feb71b8
> [c0705d78] [c0092c04] _einittext+0x3f9c8dbc/0x3feb71b8
> [c0705d88] [c0092d18] _einittext+0x3f9c8ed0/0x3feb71b8
> [c0705da8] [c0093a2c] _einittext+0x3f9c9be4/0x3feb71b8
> [c0705de8] [c0580224] _einittext+0x3feb63dc/0x3feb71b8
> [c0705e48] [c00382ec] _einittext+0x3f96e4a4/0x3feb71b8
> [c0705e58] [c000d2a0] _einittext+0x3f943458/0x3feb71b8
> [c0705e88] [c001353c] _einittext+0x3f9496f4/0x3feb71b8
> --- interrupt: 901 at _einittext+0x3f941058/0x3feb71b8
> LR = _einittext+0x3f941058/0x3feb71b8
> [c0705f50] [c06cc214] 0xc06cc214 (unreliable)
> [c0705f60] [c057fa20] _einittext+0x3feb5bd8/0x3feb71b8
> [c0705f70] [c005de48] _einittext+0x3f994000/0x3feb71b8
> [c0705f90] [c005e050] _einittext+0x3f994208/0x3feb71b8
> [c0705fa0] [c0004cc8] _einittext+0x3f93ae80/0x3feb71b8
> [c0705fb0] [c069a36c] _einittext+0x3ffd0524/0x40000000
> [c0705ff0] [00003500] 0x3500
> Instruction dump:
> 7c0803a6 7fa5eb78 7d808120 7ea6ab78 baa10014 38210040 4bfffbb0 7f64db78
> 7f85e378 484b31b1 7c601b78 4bfffdf4 <0fe00000> 4bfffd60 9421ffe0 7c0802a6
> ---[ end trace a06fef4788747c72 ]---
>
>
> Prior to that (e.g. 97261e1e2240f), I get backtraces like this:
>
> ------------[ cut here ]------------
> WARNING: CPU: 0 PID: 0 at kernel/smp.c:433 smp_call_function_many+0x318/0x320
> Modules linked in:
> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.4.0-rc7-pmac-00054-g97261e1e2240f #20
> NIP: c00aef68 LR: c00af114 CTR: c001272c
> REGS: c075dc40 TRAP: 0700 Not tainted (5.4.0-rc7-pmac-00054-g97261e1e2240f)
> MSR: 00029032 <EE,ME,IR,DR,RI> CR: 42000044 XER: 00000000
>
> GPR00: 001f0100 c075dcf8 c0733300 c0762f1c c001258c 00000000 00000000 ef7fb5bc
> GPR08: 04800000 00000100 00000001 00000100 42000044 00000000 c0761040 00000004
> GPR16: 00000001 c0658e58 c058297c 00200002 ffff8cb9 00000000 c072f4a0 c0760000
> GPR24: c0760000 00000000 00000000 c0762f1c c0762f1c 00000000 c001258c 00000000
> NIP [c00aef68] smp_call_function_many+0x318/0x320
> LR [c00af114] smp_call_function+0x34/0x44
> Call Trace:
> [c075dcf8] [ef006320] 0xef006320 (unreliable)
> [c075dd38] [c00af114] smp_call_function+0x34/0x44
> [c075dd48] [c00af158] on_each_cpu+0x1c/0x4c
> [c075dd68] [c0012768] tau_timeout_smp+0x3c/0x4c
> [c075dd78] [c0092c04] call_timer_fn.isra.26+0x20/0x84
> [c075dd88] [c0092d18] expire_timers+0xb0/0xc0
> [c075dda8] [c0093a2c] run_timer_softirq+0xa4/0x1a4
> [c075dde8] [c0580224] __do_softirq+0x11c/0x280
> [c075de48] [c00382ec] irq_exit+0xc0/0xd4
> [c075de58] [c000d2a0] timer_interrupt+0x154/0x260
> [c075de88] [c001353c] ret_from_except+0x0/0x14
> --- interrupt: 901 at arch_cpu_idle+0x24/0x78
> LR = arch_cpu_idle+0x24/0x78
> [c075df50] [c0723214] 0xc0723214 (unreliable)
> [c075df60] [c057fa20] default_idle_call+0x38/0x58
> [c075df70] [c005de48] do_idle+0xd4/0x17c
> [c075df90] [c005e054] cpu_startup_entry+0x24/0x28
> [c075dfa0] [c0004cc8] rest_init+0xa8/0xbc
> [c075dfb0] [c06f136c] start_kernel+0x40c/0x420
> [c075dff0] [00003500] 0x3500
> Instruction dump:
> 7c0803a6 7fa5eb78 7d808120 7ea6ab78 baa10014 38210040 4bfffbb0 7f64db78
> 7f85e378 484b31b1 7c601b78 4bfffdf4 <0fe00000> 4bfffd60 9421ffe0 7c0802a6
> ---[ end trace 784c7f15ecd23941 ]---
>
> Has anyone else observed these problems (either the WARNING from
> smp_call_function_many() or the missing symbol names)?
>
> What is the best way to fix this? Should I upgrade binutils?
I got a similar report before.
I'd like to know whether or not
this is the same issue as fixed by
7883a14339299773b2ce08dcfd97c63c199a9289
Does your problem happen on the latest kernel?
Which version of binutils are you using?
--
Best Regards
Masahiro Yamada
^ permalink raw reply
* Re: [FIX PATCH] powerpc/prom: Enable Radix GTSE in cpu pa-features
From: Nicholas Piggin @ 2020-07-20 5:38 UTC (permalink / raw)
To: Bharata B Rao, linuxppc-dev; +Cc: aneesh.kumar, Qian Cai
In-Reply-To: <20200720044258.863574-1-bharata@linux.ibm.com>
Excerpts from Bharata B Rao's message of July 20, 2020 2:42 pm:
> From: Nicholas Piggin <npiggin@gmail.com>
>
> When '029ab30b4c0a ("powerpc/mm: Enable radix GTSE only if supported.")'
> made GTSE an MMU feature, it was enabled by default in
> powerpc-cpu-features but was missed in pa-features. This causes
> random memory corruption during boot of PowerNV kernels where
> CONFIG_PPC_DT_CPU_FTRS isn't enabled.
Thanks for writing this up, I got a bit bogged down with other things.
> Fixes: 029ab30b4c0a ("powerpc/mm: Enable radix GTSE only if supported.")
> Reported-by: Qian Cai <cai@lca.pw>
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
> ---
> arch/powerpc/kernel/prom.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
> index 9cc49f265c86..a9594bad572a 100644
> --- a/arch/powerpc/kernel/prom.c
> +++ b/arch/powerpc/kernel/prom.c
> @@ -163,7 +163,8 @@ static struct ibm_pa_feature {
> { .pabyte = 0, .pabit = 6, .cpu_features = CPU_FTR_NOEXECUTE },
> { .pabyte = 1, .pabit = 2, .mmu_features = MMU_FTR_CI_LARGE_PAGE },
> #ifdef CONFIG_PPC_RADIX_MMU
> - { .pabyte = 40, .pabit = 0, .mmu_features = MMU_FTR_TYPE_RADIX },
> + { .pabyte = 40, .pabit = 0,
> + .mmu_features = (MMU_FTR_TYPE_RADIX | MMU_FTR_GTSE) },
It might look better like this:
{ .pabyte = 1, .pabit = 2, .mmu_features = MMU_FTR_CI_LARGE_PAGE },
#ifdef CONFIG_PPC_RADIX_MMU
{ .pabyte = 40, .pabit = 0, .mmu_features = MMU_FTR_TYPE_RADIX },
{ .pabyte = 40, .pabit = 0, .mmu_features = MMU_FTR_TYPE_RADIX |
MMU_FTR_GTSE },
#endif
{ .pabyte = 1, .pabit = 1, .invert = 1, .cpu_features = CPU_FTR_NODSISRALIGN },
But that's bikeshedding a bit and the optional bits already put it out
of alignment.
Thanks,
Nick
^ permalink raw reply
* Re: [PATCH 10/11] powerpc/smp: Implement cpu_to_coregroup_id
From: Srikar Dronamraju @ 2020-07-20 5:48 UTC (permalink / raw)
To: Gautham R Shenoy
Cc: Nathan Lynch, Oliver OHalloran, Michael Neuling, Michael Ellerman,
Anton Blanchard, linuxppc-dev, Nick Piggin
In-Reply-To: <20200717082652.GF32531@in.ibm.com>
* Gautham R Shenoy <ego@linux.vnet.ibm.com> [2020-07-17 13:56:53]:
> On Tue, Jul 14, 2020 at 10:06:23AM +0530, Srikar Dronamraju wrote:
> > Lookup the coregroup id from the associativity array.
> >
> > If unable to detect the coregroup id, fallback on the core id.
> > This way, ensure sched_domain degenerates and an extra sched domain is
> > not created.
> >
> > Ideally this function should have been implemented in
> > arch/powerpc/kernel/smp.c. However if its implemented in mm/numa.c, we
> > don't need to find the primary domain again.
> >
> > If the device-tree mentions more than one coregroup, then kernel
> > implements only the last or the smallest coregroup, which currently
> > corresponds to the penultimate domain in the device-tree.
> >
> > Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
> > Cc: Michael Ellerman <michaele@au1.ibm.com>
> > Cc: Nick Piggin <npiggin@au1.ibm.com>
> > Cc: Oliver OHalloran <oliveroh@au1.ibm.com>
> > Cc: Nathan Lynch <nathanl@linux.ibm.com>
> > Cc: Michael Neuling <mikey@linux.ibm.com>
> > Cc: Anton Blanchard <anton@au1.ibm.com>
> > Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> > Cc: Vaidyanathan Srinivasan <svaidy@linux.ibm.com>
> > Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> > ---
> > arch/powerpc/mm/numa.c | 17 +++++++++++++++++
> > 1 file changed, 17 insertions(+)
> >
> > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> > index d9ab9da85eab..4e85564ef62a 100644
> > --- a/arch/powerpc/mm/numa.c
> > +++ b/arch/powerpc/mm/numa.c
> > @@ -1697,6 +1697,23 @@ static const struct proc_ops topology_proc_ops = {
> >
> > int cpu_to_coregroup_id(int cpu)
> > {
> > + __be32 associativity[VPHN_ASSOC_BUFSIZE] = {0};
> > + int index;
> > +
>
> It would be good to have an assert here to ensure that we are calling
> this function only when coregroups are enabled.
>
> Else, we may end up returning the penultimate index which maps to the
> chip-id.
>
We have a check below exactly for the same reason. Please look below.
>
>
> > + if (cpu < 0 || cpu > nr_cpu_ids)
> > + return -1;
> > +
> > + if (!firmware_has_feature(FW_FEATURE_VPHN))
> > + goto out;
> > +
> > + if (vphn_get_associativity(cpu, associativity))
> > + goto out;
> > +
> > + index = of_read_number(associativity, 1);
> > + if ((index > min_common_depth + 1) && coregroup_enabled)
> > + return of_read_number(&associativity[index - 1], 1);
See ^above.
index would be the all the domains in the associativity array,
min_common_depth would be where the primary domain or the chip-id is
defined. So we are reading the penultimate domain if and only if the
min_common_depth isn't the primary domain aka chip-id.
What other check /assertions can we add?
> > +
> > +out:
> > return cpu_to_core_id(cpu);
> > }
> >
> > --
> > 2.17.1
> >
--
Thanks and Regards
Srikar Dronamraju
^ permalink raw reply
* Re: [PATCH 0/5] cpuidle-pseries: Parse extended CEDE information for idle.
From: Vaidyanathan Srinivasan @ 2020-07-20 5:48 UTC (permalink / raw)
To: Gautham R. Shenoy
Cc: Nathan Lynch, Michael Neuling, linux-pm, linux-kernel,
Nicholas Piggin, linuxppc-dev
In-Reply-To: <1594120299-31389-1-git-send-email-ego@linux.vnet.ibm.com>
* Gautham R Shenoy <ego@linux.vnet.ibm.com> [2020-07-07 16:41:34]:
> From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
>
> Hi,
>
> On pseries Dedicated Linux LPARs, apart from the polling snooze idle
> state, we currently have the CEDE idle state which cedes the CPU to
> the hypervisor with latency-hint = 0.
>
> However, the PowerVM hypervisor supports additional extended CEDE
> states, which can be queried through the "ibm,get-systems-parameter"
> rtas-call with the CEDE_LATENCY_TOKEN. The hypervisor maps these
> extended CEDE states to appropriate platform idle-states in order to
> provide energy-savings as well as shifting power to the active
> units. On existing pseries LPARs today we have extended CEDE with
> latency-hints {1,2} supported.
>
> In Patches 1-3 of this patchset, we add the code to parse the CEDE
> latency records provided by the hypervisor. We use this information to
> determine the wakeup latency of the regular CEDE (which we have been
> so far hardcoding to 10us while experimentally it is much lesser ~
> 1us), by looking at the wakeup latency provided by the hypervisor for
> Extended CEDE states. Since the platform currently advertises Extended
> CEDE 1 to have wakeup latency of 2us, we can be sure that the wakeup
> latency of the regular CEDE is no more than this.
>
> Patch 4 (currently marked as RFC), expose the extended CEDE states
> parsed above to the cpuidle framework, provided that they can wakeup
> on an interrupt. On current platforms only Extended CEDE 1 fits the
> bill, but this is going to change in future platforms where even
> Extended CEDE 2 may be responsive to external interrupts.
>
> Patch 5 (currently marked as RFC), filters out Extended CEDE 1 since
> it offers no added advantage over the normal CEDE.
>
> With Patches 1-3, we see an improvement in the single-threaded
> performance on ebizzy.
>
> 2 ebizzy threads bound to the same big-core. 25% improvement in the
> avg records/s (higher the better) with patches 1-3.
> x without_patches
> * with_patches
> N Min Max Median Avg Stddev
> x 10 2491089 5834307 5398375 4244335 1596244.9
> * 10 2893813 5834474 5832448 5327281.3 1055941.4
>
> We do not observe any major regression in either the context_switch2
> benchmark or the schbench benchmark
>
> context_switch2 across CPU0 CPU1 (Both belong to same big-core, but different
> small cores). We observe a minor 0.14% regression in the number of
> context-switches (higher is better).
> x without_patch
> * with_patch
> N Min Max Median Avg Stddev
> x 500 348872 362236 354712 354745.69 2711.827
> * 500 349422 361452 353942 354215.4 2576.9258
>
> context_switch2 across CPU0 CPU8 (Different big-cores). We observe a 0.37%
> improvement in the number of context-switches (higher is better).
> x without_patch
> * with_patch
> N Min Max Median Avg Stddev
> x 500 287956 294940 288896 288977.23 646.59295
> * 500 288300 294646 289582 290064.76 1161.9992
>
> schbench:
> No major difference could be seen until the 99.9th percentile.
>
> Without-patch
> Latency percentiles (usec)
> 50.0th: 29
> 75.0th: 39
> 90.0th: 49
> 95.0th: 59
> *99.0th: 13104
> 99.5th: 14672
> 99.9th: 15824
> min=0, max=17993
>
> With-patch:
> Latency percentiles (usec)
> 50.0th: 29
> 75.0th: 40
> 90.0th: 50
> 95.0th: 61
> *99.0th: 13648
> 99.5th: 14768
> 99.9th: 15664
> min=0, max=29812
This patch series mainly cleans up the CEDE latency discovery and
prepares to add different cpuidle states in virtualised environment.
This helps in improving SMT folding speeds and also power savings and
power shifting with newer platform firmware.
The current benefit is primarily from faster SMT folding and resulting
single performance achieved by updating the platform firmware provided
heuristics in the cpuidle states.
--Vaidy
^ permalink raw reply
* Re: [PATCH v2 2/2] selftest/cpuidle: Add support for cpuidle latency measurement
From: Gautham R Shenoy @ 2020-07-20 5:52 UTC (permalink / raw)
To: Pratik Rajesh Sampat
Cc: ego, linux-pm, daniel.lezcano, rjw, linuxppc-dev, npiggin, paulus,
linux-kselftest, shuah, srivatsa, linux-kernel
In-Reply-To: <20200717091801.29289-3-psampat@linux.ibm.com>
Hi Pratik,
On Fri, Jul 17, 2020 at 02:48:01PM +0530, Pratik Rajesh Sampat wrote:
> This patch adds support to trace IPI based and timer based wakeup
> latency from idle states
>
> Latches onto the test-cpuidle_latency kernel module using the debugfs
> interface to send IPIs or schedule a timer based event, which in-turn
> populates the debugfs with the latency measurements.
>
> Currently for the IPI and timer tests; first disable all idle states
> and then test for latency measurements incrementally enabling each state
>
> Signed-off-by: Pratik Rajesh Sampat <psampat@linux.ibm.com>
A few comments below.
> ---
> tools/testing/selftests/Makefile | 1 +
> tools/testing/selftests/cpuidle/Makefile | 6 +
> tools/testing/selftests/cpuidle/cpuidle.sh | 257 +++++++++++++++++++++
> tools/testing/selftests/cpuidle/settings | 1 +
> 4 files changed, 265 insertions(+)
> create mode 100644 tools/testing/selftests/cpuidle/Makefile
> create mode 100755 tools/testing/selftests/cpuidle/cpuidle.sh
> create mode 100644 tools/testing/selftests/cpuidle/settings
>
[..skip..]
> +
> +ins_mod()
> +{
> + if [ ! -f "$MODULE" ]; then
> + printf "$MODULE module does not exist. Exitting\n"
If the module has been compiled into the kernel (due to a
localyesconfig, for instance), then it is unlikely that we will find
it in /lib/modules. Perhaps you want to check if the debugfs
directories created by the module exist, and if so, print a message
saying that the modules is already loaded or some such?
> + exit $ksft_skip
> + fi
> + printf "Inserting $MODULE module\n\n"
> + insmod $MODULE
> + if [ $? != 0 ]; then
> + printf "Insmod $MODULE failed\n"
> + exit $ksft_skip
> + fi
> +}
> +
> +compute_average()
> +{
> + arr=("$@")
> + sum=0
> + size=${#arr[@]}
> + for i in "${arr[@]}"
> + do
> + sum=$((sum + i))
> + done
> + avg=$((sum/size))
It would be good to assert that "size" isn't 0 here.
> +}
> +
> +# Disable all stop states
> +disable_idle()
> +{
> + for ((cpu=0; cpu<NUM_CPUS; cpu++))
> + do
> + for ((state=0; state<NUM_STATES; state++))
> + do
> + echo 1 > /sys/devices/system/cpu/cpu$cpu/cpuidle/state$state/disable
So, on offlined CPUs, we won't see
/sys/devices/system/cpu/cpu$cpu/cpuidle/state$state directory. You
should probably perform this operation only on online CPUs.
> + done
> + done
> +}
> +
> +# Perform operation on each CPU for the given state
> +# $1 - Operation: enable (0) / disable (1)
> +# $2 - State to enable
> +op_state()
> +{
> + for ((cpu=0; cpu<NUM_CPUS; cpu++))
> + do
> + echo $1 > /sys/devices/system/cpu/cpu$cpu/cpuidle/state$2/disable
Ditto
> + done
> +}
This is a helper function. For better readability of the main code you
can define the following wrappers and use them.
cpuidle_enable_state()
{
state=$1
op_state 1 $state
}
cpuidle_disable_state()
{
state=$1
op_state 0 $state
}
> +
[..snip..]
> +run_ipi_tests()
> +{
> + extract_latency
> + disable_idle
> + declare -a avg_arr
> + echo -e "--IPI Latency Test---" >> $LOG
> +
> + echo -e "--Baseline IPI Latency measurement: CPU Busy--" >> $LOG
> + printf "%s %10s %12s\n" "SRC_CPU" "DEST_CPU" "IPI_Latency(ns)" >> $LOG
> + for ((cpu=0; cpu<NUM_CPUS; cpu++))
> + do
> + ipi_test_once "baseline" $cpu
> + printf "%-3s %10s %12s\n" $src_cpu $cpu $ipi_latency >> $LOG
> + avg_arr+=($ipi_latency)
> + done
> + compute_average "${avg_arr[@]}"
> + echo -e "Baseline Average IPI latency(ns): $avg" >> $LOG
> +
> + for ((state=0; state<NUM_STATES; state++))
> + do
> + unset avg_arr
> + echo -e "---Enabling state: $state---" >> $LOG
> + op_state 0 $state
> + printf "%s %10s %12s\n" "SRC_CPU" "DEST_CPU" "IPI_Latency(ns)" >> $LOG
> + for ((cpu=0; cpu<NUM_CPUS; cpu++))
> + do
If a CPU is offline, then we should skip it here.
> + # Running IPI test and logging results
> + sleep 1
> + ipi_test_once "test" $cpu
> + printf "%-3s %10s %12s\n" $src_cpu $cpu $ipi_latency >> $LOG
> + avg_arr+=($ipi_latency)
> + done
> + compute_average "${avg_arr[@]}"
> + echo -e "Expected IPI latency(ns): ${latency_arr[$state]}" >> $LOG
> + echo -e "Observed Average IPI latency(ns): $avg" >> $LOG
> + op_state 1 $state
> + done
> +}
> +
> +# Extract the residency in microseconds and convert to nanoseconds.
> +# Add 100 ns so that the timer stays for a little longer than the residency
> +extract_residency()
> +{
> + for ((state=0; state<NUM_STATES; state++))
> + do
> + residency=$(($(cat /sys/devices/system/cpu/cpu0/cpuidle/state$state/residency) * 1000 + 200))
> + residency_arr+=($residency)
> + done
> +}
> +
> +# Run the Timeout test
> +# $1 run for baseline - busy cpu or regular environment
> +# $2 destination cpu
> +# $3 timeout
> +timeout_test_once()
> +{
> + dest_cpu=$2
> + if [ "$1" = "baseline" ]; then
> + # Keep the CPU busy
> + taskset -c $dest_cpu cat /dev/random > /dev/null &
> + task_pid=$!
> + # Wait for the workload to achieve 100% CPU usage
> + sleep 1
> + fi
> + taskset -c $dest_cpu echo $3 > /sys/kernel/debug/latency_test/timeout_expected_ns
> + # Wait for the result to populate
> + sleep 0.1
> + timeout_diff=$(cat /sys/kernel/debug/latency_test/timeout_diff_ns)
> + src_cpu=$(cat /sys/kernel/debug/latency_test/timeout_cpu_src)
> + if [ "$1" = "baseline" ]; then
> + kill $task_pid
> + wait $task_pid 2>/dev/null
> + fi
> +}
> +
> +run_timeout_tests()
> +{
> + extract_residency
> + disable_idle
> + declare -a avg_arr
> + echo -e "\n--Timeout Latency Test--" >> $LOG
> +
> + echo -e "--Baseline Timeout Latency measurement: CPU Busy--" >> $LOG
> + printf "%s %10s %10s\n" "Wakeup_src" "Baseline_delay(ns)">> $LOG
> + for ((cpu=0; cpu<NUM_CPUS; cpu++))
> + do
Again only perform this on online CPUs.
> + timeout_test_once "baseline" $cpu ${residency_arr[0]}
> + printf "%-3s %13s\n" $src_cpu $timeout_diff >> $LOG
> + avg_arr+=($timeout_diff)
> + done
> + compute_average "${avg_arr[@]}"
> + echo -e "Baseline Average timeout diff(ns): $avg" >> $LOG
> +
> + for ((state=0; state<NUM_STATES; state++))
> + do
> + echo -e "---Enabling state: $state---" >> $LOG
> + op_state 0 $state
> + printf "%s %10s %10s\n" "Wakeup_src" "Baseline_delay(ns)" "Delay(ns)" >> $LOG
> + unset avg_arr
> + for ((cpu=0; cpu<NUM_CPUS; cpu++))
> + do
Ditto.
> + timeout_test_once "test" $cpu ${residency_arr[$state]}
> + printf "%-3s %13s %18s\n" $src_cpu $baseline_timeout_diff $timeout_diff >> $LOG
> + avg_arr+=($timeout_diff)
> + done
> + compute_average "${avg_arr[@]}"
> + echo -e "Expected timeout(ns): ${residency_arr[$state]}" >> $LOG
> + echo -e "Observed Average timeout diff(ns): $avg" >> $LOG
> + op_state 1 $state
> + done
> +}
> +
--
Thanks and Regards
gautham.
^ permalink raw reply
* Re: [PATCH 1/5] cpuidle-pseries: Set the latency-hint before entering CEDE
From: Vaidyanathan Srinivasan @ 2020-07-20 5:55 UTC (permalink / raw)
To: Gautham R. Shenoy
Cc: Nathan Lynch, Michael Neuling, linux-pm, linux-kernel,
Nicholas Piggin, linuxppc-dev
In-Reply-To: <1594120299-31389-2-git-send-email-ego@linux.vnet.ibm.com>
* Gautham R Shenoy <ego@linux.vnet.ibm.com> [2020-07-07 16:41:35]:
> From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
>
> As per the PAPR, each H_CEDE call is associated with a latency-hint to
> be passed in the VPA field "cede_latency_hint". The CEDE states that
> we were implicitly entering so far is CEDE with latency-hint = 0.
>
> This patch explicitly sets the latency hint corresponding to the CEDE
> state that we are currently entering. While at it, we save the
> previous hint, to be restored once we wakeup from CEDE. This will be
> required in the future when we expose extended-cede states through the
> cpuidle framework, where each of them will have a different
> cede-latency hint.
>
> Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Reviewed-by: Vaidyanathan Srinivasan <svaidy@linux.ibm.com>
> ---
> drivers/cpuidle/cpuidle-pseries.c | 10 +++++++++-
> 1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/cpuidle/cpuidle-pseries.c b/drivers/cpuidle/cpuidle-pseries.c
> index 4a37252..39d4bb6 100644
> --- a/drivers/cpuidle/cpuidle-pseries.c
> +++ b/drivers/cpuidle/cpuidle-pseries.c
> @@ -105,19 +105,27 @@ static void check_and_cede_processor(void)
> }
> }
>
> +#define NR_CEDE_STATES 1 /* CEDE with latency-hint 0 */
> +#define NR_DEDICATED_STATES (NR_CEDE_STATES + 1) /* Includes snooze */
> +
> +u8 cede_latency_hint[NR_DEDICATED_STATES];
> static int dedicated_cede_loop(struct cpuidle_device *dev,
> struct cpuidle_driver *drv,
> int index)
> {
> + u8 old_latency_hint;
>
> pseries_idle_prolog();
> get_lppaca()->donate_dedicated_cpu = 1;
> + old_latency_hint = get_lppaca()->cede_latency_hint;
> + get_lppaca()->cede_latency_hint = cede_latency_hint[index];
>
> HMT_medium();
> check_and_cede_processor();
>
> local_irq_disable();
> get_lppaca()->donate_dedicated_cpu = 0;
> + get_lppaca()->cede_latency_hint = old_latency_hint;
>
> pseries_idle_epilog();
>
> @@ -149,7 +157,7 @@ static int shared_cede_loop(struct cpuidle_device *dev,
> /*
> * States for dedicated partition case.
> */
> -static struct cpuidle_state dedicated_states[] = {
> +static struct cpuidle_state dedicated_states[NR_DEDICATED_STATES] = {
> { /* Snooze */
> .name = "snooze",
> .desc = "snooze",
Saving and restoring the current cede hint value helps in maintaining
compatibility with other parts of the kernel. Over long term we can
make cpuidle driver deterministically set the CEDE hint at each
invocation of H_CEDE call so that we do not have to do multiple
redundant save-restore.
This is a reasonable start to cleanup this cupidle subsystem on PAPR
guests.
--Vaidy
^ permalink raw reply
* Re: [PATCH 09/11] Powerpc/smp: Create coregroup domain
From: Srikar Dronamraju @ 2020-07-20 6:02 UTC (permalink / raw)
To: Gautham R Shenoy
Cc: Nathan Lynch, Oliver OHalloran, Michael Neuling, Michael Ellerman,
Anton Blanchard, linuxppc-dev, Nick Piggin
In-Reply-To: <20200717081926.GD32531@in.ibm.com>
* Gautham R Shenoy <ego@linux.vnet.ibm.com> [2020-07-17 13:49:26]:
> On Tue, Jul 14, 2020 at 10:06:22AM +0530, Srikar Dronamraju wrote:
> > Add percpu coregroup maps and masks to create coregroup domain.
> > If a coregroup doesn't exist, the coregroup domain will be degenerated
> > in favour of SMT/CACHE domain.
> >
> > Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
> > Cc: Michael Ellerman <michaele@au1.ibm.com>
> > Cc: Nick Piggin <npiggin@au1.ibm.com>
> > Cc: Oliver OHalloran <oliveroh@au1.ibm.com>
> > Cc: Nathan Lynch <nathanl@linux.ibm.com>
> > Cc: Michael Neuling <mikey@linux.ibm.com>
> > Cc: Anton Blanchard <anton@au1.ibm.com>
> > Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> > Cc: Vaidyanathan Srinivasan <svaidy@linux.ibm.com>
> > Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> > ---
> > arch/powerpc/include/asm/topology.h | 10 ++++++++
> > arch/powerpc/kernel/smp.c | 37 +++++++++++++++++++++++++++++
> > arch/powerpc/mm/numa.c | 5 ++++
> > 3 files changed, 52 insertions(+)
> >
> > @@ -950,6 +972,11 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
> > cpumask_set_cpu(boot_cpuid, cpu_l2_cache_mask(boot_cpuid));
> > cpumask_set_cpu(boot_cpuid, cpu_core_mask(boot_cpuid));
> >
> > + if (has_coregroup_support())
> > + cpumask_set_cpu(boot_cpuid, cpu_coregroup_mask(boot_cpuid));
> > + else
> > + powerpc_topology[mc_idx].mask = cpu_bigcore_mask;
> > +
>
> The else part could be moved to the common function where we are
> modifying the other attributes of the topology array.
>
My intent is to make all the changes to the topology attributes in
smp_prepare_cpus. It makes more sense to change the attributes immediately
after we define / detect a particular topology change.
The only thing that is left out currently is shared_cache,
We should be able to detect shared_cache also around this time. Just that it
needs more cleanups. Once we do update the topology attributes even for
shared_cache here itself.
> > init_big_cores();
> > if (has_big_cores) {
> > cpumask_set_cpu(boot_cpuid,
> > @@ -1241,6 +1268,8 @@ static void remove_cpu_from_masks(int cpu)
> > set_cpus_unrelated(cpu, i, cpu_sibling_mask);
> > if (has_big_cores)
> > set_cpus_unrelated(cpu, i, cpu_smallcore_mask);
> > + if (has_coregroup_support())
> > + set_cpus_unrelated(cpu, i, cpu_coregroup_mask);
> > }
> > }
> > #endif
> > @@ -1301,6 +1330,14 @@ static void add_cpu_to_masks(int cpu)
> > add_cpu_to_smallcore_masks(cpu);
> > update_mask_by_l2(cpu, cpu_l2_cache_mask);
> >
> > + if (has_coregroup_support()) {
> > + cpumask_set_cpu(cpu, cpu_coregroup_mask(cpu));
> > + for_each_cpu(i, cpu_online_mask) {
> > + if (cpu_to_coregroup_id(cpu) == cpu_to_coregroup_id(i))
> > + set_cpus_related(cpu, i, cpu_coregroup_mask);
> > + }
> > + }
> > +
> > if (pkg_id == -1) {
> > struct cpumask *(*mask)(int) = cpu_sibling_mask;
> >
> > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> > index a43eab455be4..d9ab9da85eab 100644
> > --- a/arch/powerpc/mm/numa.c
> > +++ b/arch/powerpc/mm/numa.c
> > @@ -1695,6 +1695,11 @@ static const struct proc_ops topology_proc_ops = {
> > .proc_release = single_release,
> > };
> >
> > +int cpu_to_coregroup_id(int cpu)
> > +{
> > + return cpu_to_core_id(cpu);
> > +}
>
>
> So, if has_coregroup_support() returns true, then since the core_group
> identification is currently done through the core-id, the
> coregroup_mask is going to be the same as the
> cpu_core_mask/cpu_cpu_mask. Thus, we will be degenerating the DIE
> domain. Right ? Instead we could have kept the core-group to be a
> single bigcore by default, so that those domains can get degenerated
> preserving the legacy SMT, DIE, NUMA hierarchy.
>
>
I think you have confused between cpu_core_mask and cpu_to_core_id.
cpu_to_core_id() returns a unique id for every SMT8 thread group.
If coregroup_support is true and the system doesn't support coregroup, then
We would not be degenerating the DIE but degenerating new MC domain only.
This is because the MC domain and the previous domain (SHARED_CACHE/BIGCORE/
SMT) would match the MC domain.
--
Thanks and Regards
Srikar Dronamraju
^ permalink raw reply
* Re: [PATCH v6 03/23] powerpc/book3s64/pkeys: pkeys are supported only on hash on book3s.
From: Michael Ellerman @ 2020-07-20 6:05 UTC (permalink / raw)
To: Aneesh Kumar K.V, linuxppc-dev; +Cc: Aneesh Kumar K.V, linuxram, bauerman
In-Reply-To: <20200709032946.881753-4-aneesh.kumar@linux.ibm.com>
"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
> Move them to hash specific file and add BUG() for radix path.
> ---
> .../powerpc/include/asm/book3s/64/hash-pkey.h | 32 ++++++++++++++++
> arch/powerpc/include/asm/book3s/64/pkeys.h | 25 +++++++++++++
> arch/powerpc/include/asm/pkeys.h | 37 ++++---------------
> 3 files changed, 64 insertions(+), 30 deletions(-)
> create mode 100644 arch/powerpc/include/asm/book3s/64/hash-pkey.h
> create mode 100644 arch/powerpc/include/asm/book3s/64/pkeys.h
This isn't signed-off.
I assume you meant to, so I added it.
cheers
> diff --git a/arch/powerpc/include/asm/book3s/64/hash-pkey.h b/arch/powerpc/include/asm/book3s/64/hash-pkey.h
> new file mode 100644
> index 000000000000..795010897e5d
^ permalink raw reply
* Re: [PATCH 2/5] cpuidle-pseries: Add function to parse extended CEDE records
From: Vaidyanathan Srinivasan @ 2020-07-20 6:09 UTC (permalink / raw)
To: Gautham R. Shenoy
Cc: Nathan Lynch, Michael Neuling, linux-pm, linux-kernel,
Nicholas Piggin, linuxppc-dev
In-Reply-To: <1594120299-31389-3-git-send-email-ego@linux.vnet.ibm.com>
* Gautham R Shenoy <ego@linux.vnet.ibm.com> [2020-07-07 16:41:36]:
> From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
>
> Currently we use CEDE with latency-hint 0 as the only other idle state
> on a dedicated LPAR apart from the polling "snooze" state.
>
> The platform might support additional extended CEDE idle states, which
> can be discovered through the "ibm,get-system-parameter" rtas-call
> made with CEDE_LATENCY_TOKEN.
>
> This patch adds a function to obtain information about the extended
> CEDE idle states from the platform and parse the contents to populate
> an array of extended CEDE states. These idle states thus discovered
> will be added to the cpuidle framework in the next patch.
>
> dmesg on a POWER9 LPAR, demonstrating the output of parsing the
> extended CEDE latency parameters.
>
> [ 5.913180] xcede : xcede_record_size = 10
> [ 5.913183] xcede : Record 0 : hint = 1, latency =0x400 tb-ticks, Wake-on-irq = 1
> [ 5.913188] xcede : Record 1 : hint = 2, latency =0x3e8000 tb-ticks, Wake-on-irq = 0
> [ 5.913193] cpuidle : Skipping the 2 Extended CEDE idle states
>
> Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Reviewed-by: Vaidyanathan Srinivasan <svaidy@linux.ibm.com>
>
> ---
> drivers/cpuidle/cpuidle-pseries.c | 129 +++++++++++++++++++++++++++++++++++++-
> 1 file changed, 127 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/cpuidle/cpuidle-pseries.c b/drivers/cpuidle/cpuidle-pseries.c
> index 39d4bb6..c13549b 100644
> --- a/drivers/cpuidle/cpuidle-pseries.c
> +++ b/drivers/cpuidle/cpuidle-pseries.c
> @@ -21,6 +21,7 @@
> #include <asm/runlatch.h>
> #include <asm/idle.h>
> #include <asm/plpar_wrappers.h>
> +#include <asm/rtas.h>
>
> struct cpuidle_driver pseries_idle_driver = {
> .name = "pseries_idle",
> @@ -105,9 +106,120 @@ static void check_and_cede_processor(void)
> }
> }
>
> -#define NR_CEDE_STATES 1 /* CEDE with latency-hint 0 */
> +struct xcede_latency_records {
> + u8 latency_hint;
> + u64 wakeup_latency_tb_ticks;
> + u8 responsive_to_irqs;
> +};
> +
> +/*
> + * XCEDE : Extended CEDE states discovered through the
> + * "ibm,get-systems-parameter" rtas-call with the token
> + * CEDE_LATENCY_TOKEN
> + */
> +#define MAX_XCEDE_STATES 4
> +#define XCEDE_LATENCY_RECORD_SIZE 10
> +#define XCEDE_LATENCY_PARAM_MAX_LENGTH (2 + 2 + \
> + (MAX_XCEDE_STATES * XCEDE_LATENCY_RECORD_SIZE))
> +
> +#define CEDE_LATENCY_TOKEN 45
> +
> +#define NR_CEDE_STATES (MAX_XCEDE_STATES + 1) /* CEDE with latency-hint 0 */
> #define NR_DEDICATED_STATES (NR_CEDE_STATES + 1) /* Includes snooze */
>
> +struct xcede_latency_records xcede_records[MAX_XCEDE_STATES];
> +unsigned int nr_xcede_records;
> +char xcede_parameters[XCEDE_LATENCY_PARAM_MAX_LENGTH];
> +
> +static int parse_cede_parameters(void)
> +{
> + int ret = -1, i;
> + u16 payload_length;
> + u8 xcede_record_size;
> + u32 total_xcede_records_size;
> + char *payload;
> +
> + memset(xcede_parameters, 0, XCEDE_LATENCY_PARAM_MAX_LENGTH);
> +
> + ret = rtas_call(rtas_token("ibm,get-system-parameter"), 3, 1,
> + NULL, CEDE_LATENCY_TOKEN, __pa(xcede_parameters),
> + XCEDE_LATENCY_PARAM_MAX_LENGTH);
> +
> + if (ret) {
> + pr_err("xcede: Error parsing CEDE_LATENCY_TOKEN\n");
> + return ret;
> + }
> +
> + payload_length = be16_to_cpu(*(__be16 *)(&xcede_parameters[0]));
> + payload = &xcede_parameters[2];
> +
> + /*
> + * If the platform supports the cede latency settings
> + * information system parameter it must provide the following
> + * information in the NULL terminated parameter string:
> + *
> + * a. The first byte is the length ???N??? of each cede
> + * latency setting record minus one (zero indicates a length
> + * of 1 byte).
> + *
> + * b. For each supported cede latency setting a cede latency
> + * setting record consisting of the first ???N??? bytes as per
> + * the following table.
> + *
> + * -----------------------------
> + * | Field | Field |
> + * | Name | Length |
> + * -----------------------------
> + * | Cede Latency | 1 Byte |
> + * | Specifier Value | |
> + * -----------------------------
> + * | Maximum wakeup | |
> + * | latency in | 8 Bytes|
> + * | tb-ticks | |
> + * -----------------------------
> + * | Responsive to | |
> + * | external | 1 Byte |
> + * | interrupts | |
> + * -----------------------------
> + *
> + * This version has cede latency record size = 10.
> + */
> + xcede_record_size = (u8)payload[0] + 1;
This is standard PAPR interface that has been defined long time ago.
However, new H_CEDE hints that map to new platform features will
appear in the same interface and Linux needs to prepare and be ready
to check and exploit the new hints if they are useful for the given
setup.
> +
> + if (xcede_record_size != XCEDE_LATENCY_RECORD_SIZE) {
> + pr_err("xcede : Expected record-size %d. Observed size %d.\n",
> + XCEDE_LATENCY_RECORD_SIZE, xcede_record_size);
> + return -EINVAL;
> + }
> +
> + pr_info("xcede : xcede_record_size = %d\n", xcede_record_size);
> +
> + /*
> + * Since the payload_length includes the last NULL byte and
> + * the xcede_record_size, the remaining bytes correspond to
> + * array of all cede_latency settings.
> + */
> + total_xcede_records_size = payload_length - 2;
> + nr_xcede_records = total_xcede_records_size / xcede_record_size;
> +
> + payload++;
> + for (i = 0; i < nr_xcede_records; i++) {
> + struct xcede_latency_records *record = &xcede_records[i];
> +
> + record->latency_hint = (u8)payload[0];
> + record->wakeup_latency_tb_ticks =
> + be64_to_cpu(*(__be64 *)(&payload[1]));
> + record->responsive_to_irqs = (u8)payload[9];
> + payload += xcede_record_size;
> + pr_info("xcede : Record %d : hint = %u, latency =0x%llx tb-ticks, Wake-on-irq = %u\n",
> + i, record->latency_hint,
> + record->wakeup_latency_tb_ticks,
> + record->responsive_to_irqs);
> + }
> +
> + return 0;
> +}
> +
> u8 cede_latency_hint[NR_DEDICATED_STATES];
> static int dedicated_cede_loop(struct cpuidle_device *dev,
> struct cpuidle_driver *drv,
> @@ -238,6 +350,19 @@ static int pseries_cpuidle_driver_init(void)
> return 0;
> }
>
> +static int add_pseries_idle_states(void)
> +{
> + int nr_states = 2; /* By default we have snooze, CEDE */
> +
> + if (parse_cede_parameters())
> + return nr_states;
> +
> + pr_info("cpuidle : Skipping the %d Extended CEDE idle states\n",
> + nr_xcede_records);
> +
> + return nr_states;
More logic will be added to this function in the subsequent patches to
actually make use of the information that is obtained from the platform
firmware.
--Vaidy
^ permalink raw reply
* Re: [PATCH v6 03/23] powerpc/book3s64/pkeys: pkeys are supported only on hash on book3s.
From: Aneesh Kumar K.V @ 2020-07-20 6:15 UTC (permalink / raw)
To: Michael Ellerman, linuxppc-dev; +Cc: linuxram, bauerman
In-Reply-To: <87h7u2u3ep.fsf@mpe.ellerman.id.au>
On 7/20/20 11:35 AM, Michael Ellerman wrote:
> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>> Move them to hash specific file and add BUG() for radix path.
>> ---
>> .../powerpc/include/asm/book3s/64/hash-pkey.h | 32 ++++++++++++++++
>> arch/powerpc/include/asm/book3s/64/pkeys.h | 25 +++++++++++++
>> arch/powerpc/include/asm/pkeys.h | 37 ++++---------------
>> 3 files changed, 64 insertions(+), 30 deletions(-)
>> create mode 100644 arch/powerpc/include/asm/book3s/64/hash-pkey.h
>> create mode 100644 arch/powerpc/include/asm/book3s/64/pkeys.h
>
> This isn't signed-off.
>
Not sure how i missed that.
> I assume you meant to, so I added it.
>
yes. Thanks for taking care of that.
-aneesh
^ permalink raw reply
* Re: [FIX PATCH] powerpc/prom: Enable Radix GTSE in cpu pa-features
From: Bharata B Rao @ 2020-07-20 6:17 UTC (permalink / raw)
To: Nicholas Piggin; +Cc: Qian Cai, aneesh.kumar, linuxppc-dev
In-Reply-To: <1595223290.jz1cmk38dz.astroid@bobo.none>
On Mon, Jul 20, 2020 at 03:38:29PM +1000, Nicholas Piggin wrote:
> Excerpts from Bharata B Rao's message of July 20, 2020 2:42 pm:
> > diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
> > index 9cc49f265c86..a9594bad572a 100644
> > --- a/arch/powerpc/kernel/prom.c
> > +++ b/arch/powerpc/kernel/prom.c
> > @@ -163,7 +163,8 @@ static struct ibm_pa_feature {
> > { .pabyte = 0, .pabit = 6, .cpu_features = CPU_FTR_NOEXECUTE },
> > { .pabyte = 1, .pabit = 2, .mmu_features = MMU_FTR_CI_LARGE_PAGE },
> > #ifdef CONFIG_PPC_RADIX_MMU
> > - { .pabyte = 40, .pabit = 0, .mmu_features = MMU_FTR_TYPE_RADIX },
> > + { .pabyte = 40, .pabit = 0,
> > + .mmu_features = (MMU_FTR_TYPE_RADIX | MMU_FTR_GTSE) },
>
> It might look better like this:
>
> { .pabyte = 1, .pabit = 2, .mmu_features = MMU_FTR_CI_LARGE_PAGE },
> #ifdef CONFIG_PPC_RADIX_MMU
> { .pabyte = 40, .pabit = 0, .mmu_features = MMU_FTR_TYPE_RADIX },
> { .pabyte = 40, .pabit = 0, .mmu_features = MMU_FTR_TYPE_RADIX |
> MMU_FTR_GTSE },
> #endif
> { .pabyte = 1, .pabit = 1, .invert = 1, .cpu_features = CPU_FTR_NODSISRALIGN },
>
> But that's bikeshedding a bit and the optional bits already put it out
> of alignment.
Here it is...
From 1be7f3f8b43503740431b7bdf585e488ecdeb48f Mon Sep 17 00:00:00 2001
From: Nicholas Piggin <npiggin@gmail.com>
Date: Mon, 20 Jul 2020 09:05:05 +0530
Subject: [FIX PATCH] powerpc/prom: Enable Radix GTSE in cpu pa-features
When '029ab30b4c0a ("powerpc/mm: Enable radix GTSE only if supported.")'
made GTSE an MMU feature, it was enabled by default in
powerpc-cpu-features but was missed in pa-features. This causes
random memory corruption during boot of PowerNV kernels if
CONFIG_PPC_DT_CPU_FTRS isn't enabled.
Fixes: 029ab30b4c0a ("powerpc/mm: Enable radix GTSE only if supported.")
Reported-by: Qian Cai <cai@lca.pw>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
---
arch/powerpc/kernel/prom.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index 9cc49f265c86..dae30e805e42 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -163,7 +163,8 @@ static struct ibm_pa_feature {
{ .pabyte = 0, .pabit = 6, .cpu_features = CPU_FTR_NOEXECUTE },
{ .pabyte = 1, .pabit = 2, .mmu_features = MMU_FTR_CI_LARGE_PAGE },
#ifdef CONFIG_PPC_RADIX_MMU
- { .pabyte = 40, .pabit = 0, .mmu_features = MMU_FTR_TYPE_RADIX },
+ { .pabyte = 40, .pabit = 0, .mmu_features = MMU_FTR_TYPE_RADIX |
+ MMU_FTR_GTSE },
#endif
{ .pabyte = 1, .pabit = 1, .invert = 1, .cpu_features = CPU_FTR_NODSISRALIGN },
{ .pabyte = 5, .pabit = 0, .cpu_features = CPU_FTR_REAL_LE,
--
2.26.2
^ permalink raw reply related
* Re: [PATCH 06/11] powerpc/smp: Generalize 2nd sched domain
From: Srikar Dronamraju @ 2020-07-20 6:19 UTC (permalink / raw)
To: Gautham R Shenoy
Cc: Nathan Lynch, Oliver OHalloran, Michael Neuling, Michael Ellerman,
Anton Blanchard, linuxppc-dev, Nick Piggin
In-Reply-To: <20200717063755.GA32531@in.ibm.com>
* Gautham R Shenoy <ego@linux.vnet.ibm.com> [2020-07-17 12:07:55]:
> On Tue, Jul 14, 2020 at 10:06:19AM +0530, Srikar Dronamraju wrote:
> > Currently "CACHE" domain happens to be the 2nd sched domain as per
> > powerpc_topology. This domain will collapse if cpumask of l2-cache is
> > same as SMT domain. However we could generalize this domain such that it
> > could mean either be a "CACHE" domain or a "BIGCORE" domain.
> >
> > While setting up the "CACHE" domain, check if shared_cache is already
> > set.
> >
> > Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
> > Cc: Michael Ellerman <michaele@au1.ibm.com>
> > Cc: Nick Piggin <npiggin@au1.ibm.com>
> > Cc: Oliver OHalloran <oliveroh@au1.ibm.com>
> > Cc: Nathan Lynch <nathanl@linux.ibm.com>
> > Cc: Michael Neuling <mikey@linux.ibm.com>
> > Cc: Anton Blanchard <anton@au1.ibm.com>
> > Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> > Cc: Vaidyanathan Srinivasan <svaidy@linux.ibm.com>
> > Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> > ---
> > @@ -867,11 +869,16 @@ static const struct cpumask *smallcore_smt_mask(int cpu)
> > }
> > #endif
> >
> > +static const struct cpumask *cpu_bigcore_mask(int cpu)
> > +{
> > + return cpu_core_mask(cpu);
>
> It should be cpu_smt_mask() if we want the redundant big-core to be
> degenerated in favour of the SMT level on P8, no? Because
> cpu_core_mask refers to all the CPUs that are in the same chip.
>
Right, but it cant be cpu_smt_mask since cpu_smt_mask is only enabled in
CONFIG_SCHED_SMT. I was looking at using sibling_map, but we have to careful
for power9 / PowerNV mode. Guess that should be fine.
> > +}
> > +
> > static struct sched_domain_topology_level powerpc_topology[] = {
> > #ifdef CONFIG_SCHED_SMT
> > { cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) },
> > #endif
> > - { shared_cache_mask, powerpc_shared_cache_flags, SD_INIT_NAME(CACHE) },
> > + { cpu_bigcore_mask, SD_INIT_NAME(BIGCORE) },
> > { cpu_cpu_mask, SD_INIT_NAME(DIE) },
> > { NULL, },
> > };
> > @@ -1319,7 +1326,6 @@ static void add_cpu_to_masks(int cpu)
> > void start_secondary(void *unused)
> > {
> > unsigned int cpu = smp_processor_id();
> > - struct cpumask *(*sibling_mask)(int) = cpu_sibling_mask;
> >
> > mmgrab(&init_mm);
> > current->active_mm = &init_mm;
> > @@ -1345,14 +1351,20 @@ void start_secondary(void *unused)
> > /* Update topology CPU masks */
> > add_cpu_to_masks(cpu);
> >
> > - if (has_big_cores)
> > - sibling_mask = cpu_smallcore_mask;
> > /*
> > * Check for any shared caches. Note that this must be done on a
> > * per-core basis because one core in the pair might be disabled.
> > */
> > - if (!cpumask_equal(cpu_l2_cache_mask(cpu), sibling_mask(cpu)))
> > - shared_caches = true;
> > + if (!shared_caches) {
> > + struct cpumask *(*sibling_mask)(int) = cpu_sibling_mask;
> > + struct cpumask *mask = cpu_l2_cache_mask(cpu);
> > +
> > + if (has_big_cores)
> > + sibling_mask = cpu_smallcore_mask;
> > +
> > + if (cpumask_weight(mask) > cpumask_weight(sibling_mask(cpu)))
> > + shared_caches = true;
>
> Shouldn't we use cpumask_subset() here ?
Wouldn't cpumask_subset should return 1 if both are same?
We dont want to have shared_caches set if both the masks are equal.
>
> > + }
> >
> > set_numa_node(numa_cpu_lookup_table[cpu]);
> > set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu]));
> > @@ -1390,6 +1402,14 @@ void __init smp_cpus_done(unsigned int max_cpus)
> > smp_ops->bringup_done();
> >
> > dump_numa_cpu_topology();
> > + if (shared_caches) {
> > + pr_info("Using shared cache scheduler topology\n");
> > + powerpc_topology[bigcore_idx].mask = shared_cache_mask;
> > +#ifdef CONFIG_SCHED_DEBUG
> > + powerpc_topology[bigcore_idx].name = "CACHE";
> > +#endif
> > + powerpc_topology[bigcore_idx].sd_flags = powerpc_shared_cache_flags;
> > + }
>
>
> I would much rather that we have all the topology-fixups done in one
> function.
>
> fixup_topology(void) {
> if (has_big_core)
> powerpc_topology[smt_idx].mask = smallcore_smt_mask;
>
> if (shared_caches) {
> const char *name = "CACHE";
> powerpc_topology[bigcore_idx].mask = shared_cache_mask;
> strlcpy(powerpc_topology[bigcore_idx].name, name,
> strlen(name));
> powerpc_topology[bigcore_idx].sd_flags = powerpc_shared_cache_flags;
> }
>
> /* Any other changes to the topology structure here */
We could do this.
>
> And also as an optimization, get rid of degenerate structures here
> itself so that we don't pay additional penalty while building the
> sched-domains each time.
>
Yes this is definitely in plan, but slightly later in time.
Thanks for the review and comments.
--
Thanks and Regards
Srikar Dronamraju
^ permalink raw reply
* Re: [PATCH 3/5] dma-mapping: make support for dma ops optional
From: Christoph Hellwig @ 2020-07-20 6:20 UTC (permalink / raw)
To: Guenter Roeck
Cc: Daniel Borkmann, Alexey Kardashevskiy, Greg Kroah-Hartman,
linuxppc-dev, linux-kernel, iommu, Jesper Dangaard Brouer,
Robin Murphy, Christoph Hellwig
In-Reply-To: <20200718171714.GA237687@roeck-us.net>
On Sat, Jul 18, 2020 at 10:17:14AM -0700, Guenter Roeck wrote:
> On Wed, Jul 08, 2020 at 05:24:47PM +0200, Christoph Hellwig wrote:
> > Avoid the overhead of the dma ops support for tiny builds that only
> > use the direct mapping.
> >
> > Signed-off-by: Christoph Hellwig <hch@lst.de>
>
> For ppc:pmac32_defconfig and other configurations, this patch results in:
Fixed and force pushed.
^ permalink raw reply
* Re: [PATCH 3/5] cpuidle-pseries : Fixup exit latency for CEDE(0)
From: Vaidyanathan Srinivasan @ 2020-07-20 6:24 UTC (permalink / raw)
To: Gautham R. Shenoy
Cc: Nathan Lynch, Michael Neuling, linux-pm, linux-kernel,
Nicholas Piggin, linuxppc-dev
In-Reply-To: <1594120299-31389-4-git-send-email-ego@linux.vnet.ibm.com>
* Gautham R Shenoy <ego@linux.vnet.ibm.com> [2020-07-07 16:41:37]:
> From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
>
> We are currently assuming that CEDE(0) has exit latency 10us, since
> there is no way for us to query from the platform. However, if the
> wakeup latency of an Extended CEDE state is smaller than 10us, then we
> can be sure that the exit latency of CEDE(0) cannot be more than that.
> that.
>
> In this patch, we fix the exit latency of CEDE(0) if we discover an
> Extended CEDE state with wakeup latency smaller than 10us. The new
> value is 1us lesser than the smallest wakeup latency among the
> Extended CEDE states.
>
> Benchmark results:
>
> ebizzy:
> 2 ebizzy threads bound to the same big-core. 25% improvement in the
> avg records/s with patch.
> x without_patch
> * with_patch
> N Min Max Median Avg Stddev
> x 10 2491089 5834307 5398375 4244335 1596244.9
> * 10 2893813 5834474 5832448 5327281.3 1055941.4
>
> context_switch2 :
> There is no major regression observed with this patch as seen from the
> context_switch2 benchmark.
>
> context_switch2 across CPU0 CPU1 (Both belong to same big-core, but different
> small cores). We observe a minor 0.14% regression in the number of
> context-switches (higher is better).
> x without_patch
> * with_patch
> N Min Max Median Avg Stddev
> x 500 348872 362236 354712 354745.69 2711.827
> * 500 349422 361452 353942 354215.4 2576.9258
>
> context_switch2 across CPU0 CPU8 (Different big-cores). We observe a 0.37%
> improvement in the number of context-switches (higher is better).
> x without_patch
> * with_patch
> N Min Max Median Avg Stddev
> x 500 287956 294940 288896 288977.23 646.59295
> * 500 288300 294646 289582 290064.76 1161.9992
>
> schbench:
> No major difference could be seen until the 99.9th percentile.
>
> Without-patch
> Latency percentiles (usec)
> 50.0th: 29
> 75.0th: 39
> 90.0th: 49
> 95.0th: 59
> *99.0th: 13104
> 99.5th: 14672
> 99.9th: 15824
> min=0, max=17993
>
> With-patch:
> Latency percentiles (usec)
> 50.0th: 29
> 75.0th: 40
> 90.0th: 50
> 95.0th: 61
> *99.0th: 13648
> 99.5th: 14768
> 99.9th: 15664
> min=0, max=29812
>
> Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Reviewed-by: Vaidyanathan Srinivasan <svaidy@linux.ibm.com>
> ---
> drivers/cpuidle/cpuidle-pseries.c | 34 ++++++++++++++++++++++++++++++++--
> 1 file changed, 32 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/cpuidle/cpuidle-pseries.c b/drivers/cpuidle/cpuidle-pseries.c
> index c13549b..502f906 100644
> --- a/drivers/cpuidle/cpuidle-pseries.c
> +++ b/drivers/cpuidle/cpuidle-pseries.c
> @@ -353,12 +353,42 @@ static int pseries_cpuidle_driver_init(void)
> static int add_pseries_idle_states(void)
> {
> int nr_states = 2; /* By default we have snooze, CEDE */
> + int i;
> + u64 min_latency_us = dedicated_states[1].exit_latency; /* CEDE latency */
>
> if (parse_cede_parameters())
> return nr_states;
>
> - pr_info("cpuidle : Skipping the %d Extended CEDE idle states\n",
> - nr_xcede_records);
> + for (i = 0; i < nr_xcede_records; i++) {
> + u64 latency_tb = xcede_records[i].wakeup_latency_tb_ticks;
> + u64 latency_us = tb_to_ns(latency_tb) / NSEC_PER_USEC;
> +
> + if (latency_us < min_latency_us)
> + min_latency_us = latency_us;
> + }
> +
> + /*
> + * We are currently assuming that CEDE(0) has exit latency
> + * 10us, since there is no way for us to query from the
> + * platform.
> + *
> + * However, if the wakeup latency of an Extended CEDE state is
> + * smaller than 10us, then we can be sure that CEDE(0)
> + * requires no more than that.
> + *
> + * Perform the fix-up.
> + */
> + if (min_latency_us < dedicated_states[1].exit_latency) {
> + u64 cede0_latency = min_latency_us - 1;
> +
> + if (cede0_latency <= 0)
> + cede0_latency = min_latency_us;
> +
> + dedicated_states[1].exit_latency = cede0_latency;
> + dedicated_states[1].target_residency = 10 * (cede0_latency);
> + pr_info("cpuidle : Fixed up CEDE exit latency to %llu us\n",
> + cede0_latency);
> + }
As per PAPR spec the CEDE hints are in increasing order of exit
latency. Hence a given state's exit latency cannot exceed the one
following it. The quirk is such that the first one (hint 0) is
implicit and hence we have to use the above logic to extract its
characteristics.
--Vaidy
^ permalink raw reply
* Re: [PATCH 4/5] cpuidle-pseries : Include extended CEDE states in cpuidle framework
From: Vaidyanathan Srinivasan @ 2020-07-20 6:31 UTC (permalink / raw)
To: Gautham R. Shenoy
Cc: Nathan Lynch, Michael Neuling, linux-pm, linux-kernel,
Nicholas Piggin, linuxppc-dev
In-Reply-To: <1594120299-31389-5-git-send-email-ego@linux.vnet.ibm.com>
* Gautham R Shenoy <ego@linux.vnet.ibm.com> [2020-07-07 16:41:38]:
> From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
>
> This patch exposes those extended CEDE states to the cpuidle framework
> which are responsive to external interrupts and do not need an H_PROD.
>
> Since as per the PAPR, all the extended CEDE states are non-responsive
> to timers, we indicate this to the cpuidle subsystem via the
> CPUIDLE_FLAG_TIMER_STOP flag for all those extende CEDE states which
> can wake up on external interrupts.
>
> With the patch, we are able to see the extended CEDE state with
> latency hint = 1 exposed via the cpuidle framework.
>
> $ cpupower idle-info
> CPUidle driver: pseries_idle
> CPUidle governor: menu
> analyzing CPU 0:
>
> Number of idle states: 3
> Available idle states: snooze CEDE XCEDE1
> snooze:
> Flags/Description: snooze
> Latency: 0
> Usage: 33429446
> Duration: 27006062
> CEDE:
> Flags/Description: CEDE
> Latency: 1
> Usage: 10272
> Duration: 110786770
> XCEDE1:
> Flags/Description: XCEDE1
> Latency: 12
> Usage: 26445
> Duration: 1436433815
>
> Benchmark results:
> TLDR: Over all we do not see any additional benefit from having XCEDE1 over
> CEDE.
>
> ebizzy :
> 2 threads bound to a big-core. With this patch, we see a 3.39%
> regression compared to with only CEDE0 latency fixup.
> x With only CEDE0 latency fixup
> * With CEDE0 latency fixup + CEDE1
> N Min Max Median Avg Stddev
> x 10 2893813 5834474 5832448 5327281.3 1055941.4
> * 10 2907329 5834923 5831398 5146614.6 1193874.8
>
> context_switch2:
> With the context_switch2 there are no observable regressions in the
> results.
>
> context_switch2 CPU0 CPU1 (Same Big-core, different small-cores).
> No difference with and without patch.
> x without_patch
> * with_patch
> N Min Max Median Avg Stddev
> x 500 343644 348778 345444 345584.02 1035.1658
> * 500 344310 347646 345776 345877.22 802.19501
>
> context_switch2 CPU0 CPU8 (different big-cores). Minor 0.05% improvement
> with patch
> x without_patch
> * with_patch
> N Min Max Median Avg Stddev
> x 500 287562 288756 288162 288134.76 262.24328
> * 500 287874 288960 288306 288274.66 187.57034
>
> schbench:
> No regressions observed with schbench
>
> Without Patch:
> Latency percentiles (usec)
> 50.0th: 29
> 75.0th: 40
> 90.0th: 50
> 95.0th: 61
> *99.0th: 13648
> 99.5th: 14768
> 99.9th: 15664
> min=0, max=29812
>
> With Patch:
> Latency percentiles (usec)
> 50.0th: 30
> 75.0th: 40
> 90.0th: 51
> 95.0th: 59
> *99.0th: 13616
> 99.5th: 14512
> 99.9th: 15696
> min=0, max=15996
>
> Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Reviewed-by: Vaidyanathan Srinivasan <svaidy@linux.ibm.com>
> ---
> drivers/cpuidle/cpuidle-pseries.c | 50 +++++++++++++++++++++++++++++++++++++++
> 1 file changed, 50 insertions(+)
>
> diff --git a/drivers/cpuidle/cpuidle-pseries.c b/drivers/cpuidle/cpuidle-pseries.c
> index 502f906..6f893cd 100644
> --- a/drivers/cpuidle/cpuidle-pseries.c
> +++ b/drivers/cpuidle/cpuidle-pseries.c
> @@ -362,9 +362,59 @@ static int add_pseries_idle_states(void)
> for (i = 0; i < nr_xcede_records; i++) {
> u64 latency_tb = xcede_records[i].wakeup_latency_tb_ticks;
> u64 latency_us = tb_to_ns(latency_tb) / NSEC_PER_USEC;
> + char name[CPUIDLE_NAME_LEN];
> + unsigned int latency_hint = xcede_records[i].latency_hint;
> + u64 residency_us;
> +
> + if (!xcede_records[i].responsive_to_irqs) {
> + pr_info("cpuidle : Skipping XCEDE%d. Not responsive to IRQs\n",
> + latency_hint);
> + continue;
> + }
>
> if (latency_us < min_latency_us)
> min_latency_us = latency_us;
> + snprintf(name, CPUIDLE_NAME_LEN, "XCEDE%d", latency_hint);
> +
> + /*
> + * As per the section 14.14.1 of PAPR version 2.8.1
> + * says that alling H_CEDE with the value of the cede
> + * latency specifier set greater than zero allows the
> + * processor timer facility to be disabled (so as not
> + * to cause gratuitous wake-ups - the use of H_PROD,
> + * or other external interrupt is required to wake the
> + * processor in this case).
> + *
> + * So, inform the cpuidle-subsystem that the timer
> + * will be stopped for these states.
> + *
> + * Also, bump up the latency by 10us, since cpuidle
> + * would use timer-offload framework which will need
> + * to send an IPI to wakeup a CPU whose timer has
> + * expired.
> + */
> + if (latency_hint > 0) {
> + dedicated_states[nr_states].flags = CPUIDLE_FLAG_TIMER_STOP;
> + latency_us += 10;
> + }
> +
> + /*
> + * Thumb rule : Reside in the XCEDE state for at least
> + * 10x the time required to enter and exit that state.
> + */
> + residency_us = latency_us * 10;
> +
> + strlcpy(dedicated_states[nr_states].name, (const char *)name,
> + CPUIDLE_NAME_LEN);
> + strlcpy(dedicated_states[nr_states].desc, (const char *)name,
> + CPUIDLE_NAME_LEN);
> + dedicated_states[nr_states].exit_latency = latency_us;
> + dedicated_states[nr_states].target_residency = residency_us;
> + dedicated_states[nr_states].enter = &dedicated_cede_loop;
> + cede_latency_hint[nr_states] = latency_hint;
> + pr_info("cpuidle : Added %s. latency-hint = %d\n",
> + name, latency_hint);
> + nr_states++;
This patch demonstrates the various use cases of the previous patches
in the series that helps interface with the platform firmware better.
On current platforms these benefits are very limited, but the
framework built by the previous patches helps Linux exploit new and
enhanced idle states that will be available on newer platform and
firmware.
--Vaidy
^ permalink raw reply
* Re: [PATCH 5/5] cpuidle-pseries: Block Extended CEDE(1) which adds no additional value.
From: Vaidyanathan Srinivasan @ 2020-07-20 6:34 UTC (permalink / raw)
To: Gautham R. Shenoy
Cc: Nathan Lynch, Michael Neuling, linux-pm, linux-kernel,
Nicholas Piggin, linuxppc-dev
In-Reply-To: <1594120299-31389-6-git-send-email-ego@linux.vnet.ibm.com>
* Gautham R Shenoy <ego@linux.vnet.ibm.com> [2020-07-07 16:41:39]:
> From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
>
> The Extended CEDE state with latency-hint = 1 is only different from
> normal CEDE (with latency-hint = 0) in that a CPU in Extended CEDE(1)
> does not wakeup on timer events. Both CEDE and Extended CEDE(1) map to
> the same hardware idle state. Since we already get SMT folding from
> the normal CEDE, the Extended CEDE(1) doesn't provide any additional
> value. This patch blocks Extended CEDE(1).
>
> Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Reviewed-by: Vaidyanathan Srinivasan <svaidy@linux.ibm.com>
> ---
> drivers/cpuidle/cpuidle-pseries.c | 57 ++++++++++++++++++++++++++++++++++++---
> 1 file changed, 54 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/cpuidle/cpuidle-pseries.c b/drivers/cpuidle/cpuidle-pseries.c
> index 6f893cd..be0b8b2 100644
> --- a/drivers/cpuidle/cpuidle-pseries.c
> +++ b/drivers/cpuidle/cpuidle-pseries.c
> @@ -350,6 +350,43 @@ static int pseries_cpuidle_driver_init(void)
> return 0;
> }
>
> +#define XCEDE1_HINT 1
> +#define ERR_NO_VALUE_ADD (-1)
> +#define ERR_NO_EE_WAKEUP (-2)
> +
> +/*
> + * Returns 0 if the Extende CEDE state with @hint is not blocked in
> + * cpuidle framework.
> + *
> + * Returns ERR_NO_EE_WAKEUP if the Extended CEDE state is blocked due
> + * to not being responsive to external interrupts.
> + *
> + * Returns ERR_NO_VALUE_ADD if the Extended CEDE state does not provide
> + * added value addition over the normal CEDE.
> + */
> +static int cpuidle_xcede_blocked(u8 hint, u64 latency_us, u8 responsive_to_irqs)
> +{
> +
> + /*
> + * We will only allow extended CEDE states that are responsive
> + * to irqs do not require an H_PROD to be woken up.
> + */
> + if (!responsive_to_irqs)
> + return ERR_NO_EE_WAKEUP;
> +
> + /*
> + * We already obtain SMT folding benefits from CEDE (which is
> + * CEDE with hint 0). Furthermore, CEDE is also responsive to
> + * timer-events, while XCEDE1 requires an external
> + * interrupt/H_PROD to be woken up. Hence, block XCEDE1 since
> + * it adds no further value.
> + */
> + if (hint == XCEDE1_HINT)
> + return ERR_NO_VALUE_ADD;
> +
> + return 0;
> +}
> +
> static int add_pseries_idle_states(void)
> {
> int nr_states = 2; /* By default we have snooze, CEDE */
> @@ -365,15 +402,29 @@ static int add_pseries_idle_states(void)
> char name[CPUIDLE_NAME_LEN];
> unsigned int latency_hint = xcede_records[i].latency_hint;
> u64 residency_us;
> + int rc;
> +
> + if (latency_us < min_latency_us)
> + min_latency_us = latency_us;
> +
> + rc = cpuidle_xcede_blocked(latency_hint, latency_us,
> + xcede_records[i].responsive_to_irqs);
>
> - if (!xcede_records[i].responsive_to_irqs) {
> + if (rc) {
> + switch (rc) {
> + case ERR_NO_VALUE_ADD:
> + pr_info("cpuidle : Skipping XCEDE%d. No additional value-add\n",
> + latency_hint);
> + break;
> + case ERR_NO_EE_WAKEUP:
> pr_info("cpuidle : Skipping XCEDE%d. Not responsive to IRQs\n",
> latency_hint);
> + break;
> + }
> +
> continue;
> }
>
> - if (latency_us < min_latency_us)
> - min_latency_us = latency_us;
> snprintf(name, CPUIDLE_NAME_LEN, "XCEDE%d", latency_hint);
>
> /*
We need these heuristics to select/reject idle states exposed by
platform firmware to Linux primarily because not all states are really
useful to Linux on a given setup.
--Vaidy
^ permalink raw reply
* Re: [PATCH 05/11] powerpc/smp: Dont assume l2-cache to be superset of sibling
From: Srikar Dronamraju @ 2020-07-20 6:45 UTC (permalink / raw)
To: Gautham R Shenoy
Cc: Nathan Lynch, Oliver OHalloran, Michael Neuling, Michael Ellerman,
Anton Blanchard, linuxppc-dev, Nick Piggin
In-Reply-To: <20200717060011.GE25851@in.ibm.com>
* Gautham R Shenoy <ego@linux.vnet.ibm.com> [2020-07-17 11:30:11]:
> Hi Srikar,
>
> On Tue, Jul 14, 2020 at 10:06:18AM +0530, Srikar Dronamraju wrote:
> > Current code assumes that cpumask of cpus sharing a l2-cache mask will
> > always be a superset of cpu_sibling_mask.
> >
> > Lets stop that assumption.
> >
> > Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
> > Cc: Michael Ellerman <michaele@au1.ibm.com>
> > Cc: Nick Piggin <npiggin@au1.ibm.com>
> > Cc: Oliver OHalloran <oliveroh@au1.ibm.com>
> > Cc: Nathan Lynch <nathanl@linux.ibm.com>
> > Cc: Michael Neuling <mikey@linux.ibm.com>
> > Cc: Anton Blanchard <anton@au1.ibm.com>
> > Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> > Cc: Vaidyanathan Srinivasan <svaidy@linux.ibm.com>
> > Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> > ---
> > arch/powerpc/kernel/smp.c | 28 +++++++++++++++-------------
> > 1 file changed, 15 insertions(+), 13 deletions(-)
> >
> > diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> > index 7d430fc536cc..875f57e41355 100644
> > --- a/arch/powerpc/kernel/smp.c
> > +++ b/arch/powerpc/kernel/smp.c
> > @@ -1198,6 +1198,7 @@ static bool update_mask_by_l2(int cpu, struct cpumask *(*mask_fn)(int))
> > struct device_node *l2_cache, *np;
> > int i;
> >
> > + cpumask_set_cpu(cpu, mask_fn(cpu));
>
> It would be good to comment why do we need to do set the CPU in the
> l2-mask if we don't have a l2cache domain.
>
Good Catch,
We should move this after the cpu_to_l2cache.
> > l2_cache = cpu_to_l2cache(cpu);
> > if (!l2_cache)
> > return false;
> > @@ -1284,29 +1285,30 @@ static void add_cpu_to_masks(int cpu)
> > * add it to it's own thread sibling mask.
> > */
> > cpumask_set_cpu(cpu, cpu_sibling_mask(cpu));
> > + cpumask_set_cpu(cpu, cpu_core_mask(cpu));
> >
> > for (i = first_thread; i < first_thread + threads_per_core; i++)
> > if (cpu_online(i))
> > set_cpus_related(i, cpu, cpu_sibling_mask);
> >
> > add_cpu_to_smallcore_masks(cpu);
> > - /*
> > - * Copy the thread sibling mask into the cache sibling mask
> > - * and mark any CPUs that share an L2 with this CPU.
> > - */
> > - for_each_cpu(i, cpu_sibling_mask(cpu))
> > - set_cpus_related(cpu, i, cpu_l2_cache_mask);
> > update_mask_by_l2(cpu, cpu_l2_cache_mask);
> >
> > - /*
> > - * Copy the cache sibling mask into core sibling mask and mark
> > - * any CPUs on the same chip as this CPU.
> > - */
> > - for_each_cpu(i, cpu_l2_cache_mask(cpu))
> > - set_cpus_related(cpu, i, cpu_core_mask);
> > + if (pkg_id == -1) {
> > + struct cpumask *(*mask)(int) = cpu_sibling_mask;
> > +
> > + /*
> > + * Copy the sibling mask into core sibling mask and
> > + * mark any CPUs on the same chip as this CPU.
> > + */
> > + if (shared_caches)
> > + mask = cpu_l2_cache_mask;
> > +
>
>
> Now that we decoupling the containment relationship between
> sibling_mask and l2-cache mask, should we set all the CPUs that are
> both in cpu_sibling_mask(cpu) as well as cpu_l2_mask(cpu) in
> cpu_core_mask ?
>
Are you saying instead of setting this cpu in this cpu_core_mask, can we set
all the cpus in the mask in cpu_core_mask?
Currently we dont know if any of the cpus of the mask were already set or
not. Plus we need to anyway update cpumask of all other cpus to says they
are related. So setting a mask instead of cpu at a time will not change
anything for our side.
> > + for_each_cpu(i, mask(cpu))
> > + set_cpus_related(cpu, i, cpu_core_mask);
> >
> > - if (pkg_id == -1)
> > return;
> > + }
> >
> > for_each_cpu(i, cpu_online_mask)
> > if (get_physical_package_id(i) == pkg_id)
> > --
> > 2.17.1
> >
> --
> Thanks and Regards
> gautham.
--
Thanks and Regards
Srikar Dronamraju
^ permalink raw reply
* Re: [PATCH v4 10/10] powerpc/watchpoint: Remove 512 byte boundary
From: Jordan Niethe @ 2020-07-20 6:54 UTC (permalink / raw)
To: Ravi Bangoria
Cc: Christophe Leroy, apopple, mikey, miltonm, peterz, oleg,
Nicholas Piggin, linux-kernel, Paul Mackerras, jolsa, fweisbec,
pedromfc, naveen.n.rao, linuxppc-dev, mingo
In-Reply-To: <20200717040958.70561-11-ravi.bangoria@linux.ibm.com>
On Fri, Jul 17, 2020 at 2:11 PM Ravi Bangoria
<ravi.bangoria@linux.ibm.com> wrote:
>
> Power10 has removed 512 bytes boundary from match criteria. i.e. The watch
> range can cross 512 bytes boundary.
It looks like this change is not mentioned in ISA v3.1 Book III 9.4
Data Address Watchpoint. It could be useful to mention that in the
commit message.
Also I wonder if could add a test for this to the ptrace-hwbreak selftest?
>
> Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
> ---
> arch/powerpc/kernel/hw_breakpoint.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/kernel/hw_breakpoint.c b/arch/powerpc/kernel/hw_breakpoint.c
> index c55e67bab271..1f4a1efa0074 100644
> --- a/arch/powerpc/kernel/hw_breakpoint.c
> +++ b/arch/powerpc/kernel/hw_breakpoint.c
> @@ -418,8 +418,9 @@ static int hw_breakpoint_validate_len(struct arch_hw_breakpoint *hw)
>
> if (dawr_enabled()) {
> max_len = DAWR_MAX_LEN;
> - /* DAWR region can't cross 512 bytes boundary */
> - if (ALIGN_DOWN(start_addr, SZ_512) != ALIGN_DOWN(end_addr - 1, SZ_512))
> + /* DAWR region can't cross 512 bytes boundary on p10 predecessors */
> + if (!cpu_has_feature(CPU_FTR_ARCH_31) &&
> + (ALIGN_DOWN(start_addr, SZ_512) != ALIGN_DOWN(end_addr - 1, SZ_512)))
> return -EINVAL;
> } else if (IS_ENABLED(CONFIG_PPC_8xx)) {
> /* 8xx can setup a range without limitation */
> --
> 2.26.2
>
^ permalink raw reply
* Re: [PATCH 04/11] powerpc/smp: Enable small core scheduling sooner
From: Srikar Dronamraju @ 2020-07-20 7:20 UTC (permalink / raw)
To: Gautham R Shenoy
Cc: Nathan Lynch, Oliver OHalloran, Michael Neuling, Michael Ellerman,
Anton Blanchard, linuxppc-dev, Nick Piggin
In-Reply-To: <20200717054821.GD25851@in.ibm.com>
* Gautham R Shenoy <ego@linux.vnet.ibm.com> [2020-07-17 11:18:21]:
> On Tue, Jul 14, 2020 at 10:06:17AM +0530, Srikar Dronamraju wrote:
> > Enable small core scheduling as soon as we detect that we are in a
> > system that supports thread group. Doing so would avoid a redundant
> > check.
> >
> > Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
> > Cc: Michael Ellerman <michaele@au1.ibm.com>
> > Cc: Nick Piggin <npiggin@au1.ibm.com>
> > Cc: Oliver OHalloran <oliveroh@au1.ibm.com>
> > Cc: Nathan Lynch <nathanl@linux.ibm.com>
> > Cc: Michael Neuling <mikey@linux.ibm.com>
> > Cc: Anton Blanchard <anton@au1.ibm.com>
> > Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> > Cc: Vaidyanathan Srinivasan <svaidy@linux.ibm.com>
> > Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
>
> I don't see a problem with this.
>
> However, since we are now going to be maintaining a single topology
> structure, wouldn't it be better to collate all the changes being made
> to the mask_functions/flags/names of this structure within a single
> function so that it becomes easier to keep track of what all changes
> are going into the topology and why are we doing it?
>
My intent was to move the topology updates early as soon as they are
detected. Currently the shared_cache want cannot be detected early.
But I think its possible to detect shared_cache early with some cleanups.
And if we do, then we should be able to call this up pretty early.
>
> > ---
> > arch/powerpc/kernel/smp.c | 12 ++++++------
> > 1 file changed, 6 insertions(+), 6 deletions(-)
> >
> > diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> > index 24529f6134aa..7d430fc536cc 100644
> > --- a/arch/powerpc/kernel/smp.c
> > +++ b/arch/powerpc/kernel/smp.c
> > @@ -892,6 +892,12 @@ static int init_big_cores(void)
> > }
> >
> > has_big_cores = true;
> > +
> > +#ifdef CONFIG_SCHED_SMT
> > + pr_info("Big cores detected. Using small core scheduling\n");
> > + powerpc_topology[0].mask = smallcore_smt_mask;
> > +#endif
> > +
> > return 0;
> > }
> >
> > @@ -1383,12 +1389,6 @@ void __init smp_cpus_done(unsigned int max_cpus)
> >
> > dump_numa_cpu_topology();
> >
> > -#ifdef CONFIG_SCHED_SMT
> > - if (has_big_cores) {
> > - pr_info("Big cores detected but using small core scheduling\n");
> > - powerpc_topology[0].mask = smallcore_smt_mask;
> > - }
> > -#endif
> > set_sched_topology(powerpc_topology);
> > }
> >
> > --
> > 2.17.1
> >
--
Thanks and Regards
Srikar Dronamraju
^ permalink raw reply
* Re: [PATCH 04/11] powerpc/smp: Enable small core scheduling sooner
From: Jordan Niethe @ 2020-07-20 7:47 UTC (permalink / raw)
To: Srikar Dronamraju
Cc: Nathan Lynch, Gautham R Shenoy, Oliver OHalloran, Michael Neuling,
Michael Ellerman, Anton Blanchard, linuxppc-dev, Nick Piggin
In-Reply-To: <20200714043624.5648-5-srikar@linux.vnet.ibm.com>
On Tue, Jul 14, 2020 at 2:44 PM Srikar Dronamraju
<srikar@linux.vnet.ibm.com> wrote:
>
> Enable small core scheduling as soon as we detect that we are in a
> system that supports thread group. Doing so would avoid a redundant
> check.
>
> Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
> Cc: Michael Ellerman <michaele@au1.ibm.com>
> Cc: Nick Piggin <npiggin@au1.ibm.com>
> Cc: Oliver OHalloran <oliveroh@au1.ibm.com>
> Cc: Nathan Lynch <nathanl@linux.ibm.com>
> Cc: Michael Neuling <mikey@linux.ibm.com>
> Cc: Anton Blanchard <anton@au1.ibm.com>
> Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> Cc: Vaidyanathan Srinivasan <svaidy@linux.ibm.com>
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> ---
> arch/powerpc/kernel/smp.c | 12 ++++++------
> 1 file changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index 24529f6134aa..7d430fc536cc 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -892,6 +892,12 @@ static int init_big_cores(void)
> }
>
> has_big_cores = true;
> +
> +#ifdef CONFIG_SCHED_SMT
> + pr_info("Big cores detected. Using small core scheduling\n");
Why change the wording from "Big cores detected but using small core
scheduling\n"?
> + powerpc_topology[0].mask = smallcore_smt_mask;
> +#endif
> +
> return 0;
> }
>
> @@ -1383,12 +1389,6 @@ void __init smp_cpus_done(unsigned int max_cpus)
>
> dump_numa_cpu_topology();
>
> -#ifdef CONFIG_SCHED_SMT
> - if (has_big_cores) {
> - pr_info("Big cores detected but using small core scheduling\n");
> - powerpc_topology[0].mask = smallcore_smt_mask;
> - }
> -#endif
> set_sched_topology(powerpc_topology);
> }
>
> --
> 2.17.1
>
^ permalink raw reply
* [PATCH v3] powerpc/pseries: Avoid using addr_to_pfn in realmode
From: Ganesh Goudar @ 2020-07-20 8:03 UTC (permalink / raw)
To: mpe, linuxppc-dev; +Cc: mahesh, Ganesh Goudar, npiggin, aneesh.kumar
When an UE or memory error exception is encountered the MCE handler
tries to find the pfn using addr_to_pfn() which takes effective
address as an argument, later pfn is used to poison the page where
memory error occurred, recent rework in this area made addr_to_pfn
to run in realmode, which can be fatal as it may try to access
memory outside RMO region.
To fix this have separate functions for realmode and virtual mode
handling and let addr_to_pfn to run in virtual mode.
Without this fix following kernel crash is seen on hitting UE.
[ 485.128036] Oops: Kernel access of bad area, sig: 11 [#1]
[ 485.128040] LE SMP NR_CPUS=2048 NUMA pSeries
[ 485.128047] Modules linked in:
[ 485.128067] CPU: 15 PID: 6536 Comm: insmod Kdump: loaded Tainted: G OE 5.7.0 #22
[ 485.128074] NIP: c00000000009b24c LR: c0000000000398d8 CTR: c000000000cd57c0
[ 485.128078] REGS: c000000003f1f970 TRAP: 0300 Tainted: G OE (5.7.0)
[ 485.128082] MSR: 8000000000001003 <SF,ME,RI,LE> CR: 28008284 XER: 00000001
[ 485.128088] CFAR: c00000000009b190 DAR: c0000001fab00000 DSISR: 40000000 IRQMASK: 1
[ 485.128088] GPR00: 0000000000000001 c000000003f1fbf0 c000000001634300 0000b0fa01000000
[ 485.128088] GPR04: d000000002220000 0000000000000000 00000000fab00000 0000000000000022
[ 485.128088] GPR08: c0000001fab00000 0000000000000000 c0000001fab00000 c000000003f1fc14
[ 485.128088] GPR12: 0000000000000008 c000000003ff5880 d000000002100008 0000000000000000
[ 485.128088] GPR16: 000000000000ff20 000000000000fff1 000000000000fff2 d0000000021a1100
[ 485.128088] GPR20: d000000002200000 c00000015c893c50 c000000000d49b28 c00000015c893c50
[ 485.128088] GPR24: d0000000021a0d08 c0000000014e5da8 d0000000021a0818 000000000000000a
[ 485.128088] GPR28: 0000000000000008 000000000000000a c0000000017e2970 000000000000000a
[ 485.128125] NIP [c00000000009b24c] __find_linux_pte+0x11c/0x310
[ 485.128130] LR [c0000000000398d8] addr_to_pfn+0x138/0x170
[ 485.128133] Call Trace:
[ 485.128135] Instruction dump:
[ 485.128138] 3929ffff 7d4a3378 7c883c36 7d2907b4 794a1564 7d294038 794af082 3900ffff
[ 485.128144] 79291f24 790af00e 78e70020 7d095214 <7c69502a> 2fa30000 419e011c 70690040
[ 485.128152] ---[ end trace d34b27e29ae0e340 ]---
Signed-off-by: Ganesh Goudar <ganeshgr@linux.ibm.com>
---
V2: Leave bare metal code and save_mce_event as is.
V3: Have separate functions for realmode and virtual mode handling.
---
arch/powerpc/platforms/pseries/ras.c | 119 ++++++++++++++++-----------
1 file changed, 70 insertions(+), 49 deletions(-)
diff --git a/arch/powerpc/platforms/pseries/ras.c b/arch/powerpc/platforms/pseries/ras.c
index f3736fcd98fc..32fe3fad86b8 100644
--- a/arch/powerpc/platforms/pseries/ras.c
+++ b/arch/powerpc/platforms/pseries/ras.c
@@ -522,18 +522,55 @@ int pSeries_system_reset_exception(struct pt_regs *regs)
return 0; /* need to perform reset */
}
+static int mce_handle_err_realmode(int disposition, u8 error_type)
+{
+#ifdef CONFIG_PPC_BOOK3S_64
+ if (disposition == RTAS_DISP_NOT_RECOVERED) {
+ switch (error_type) {
+ case MC_ERROR_TYPE_SLB:
+ case MC_ERROR_TYPE_ERAT:
+ /*
+ * Store the old slb content in paca before flushing.
+ * Print this when we go to virtual mode.
+ * There are chances that we may hit MCE again if there
+ * is a parity error on the SLB entry we trying to read
+ * for saving. Hence limit the slb saving to single
+ * level of recursion.
+ */
+ if (local_paca->in_mce == 1)
+ slb_save_contents(local_paca->mce_faulty_slbs);
+ flush_and_reload_slb();
+ disposition = RTAS_DISP_FULLY_RECOVERED;
+ break;
+ default:
+ break;
+ }
+ } else if (disposition == RTAS_DISP_LIMITED_RECOVERY) {
+ /* Platform corrected itself but could be degraded */
+ pr_err("MCE: limited recovery, system may be degraded\n");
+ disposition = RTAS_DISP_FULLY_RECOVERED;
+ }
+#endif
+ return disposition;
+}
-static int mce_handle_error(struct pt_regs *regs, struct rtas_error_log *errp)
+static int mce_handle_err_virtmode(struct pt_regs *regs,
+ struct rtas_error_log *errp,
+ struct pseries_mc_errorlog *mce_log,
+ int disposition)
{
struct mce_error_info mce_err = { 0 };
- unsigned long eaddr = 0, paddr = 0;
- struct pseries_errorlog *pseries_log;
- struct pseries_mc_errorlog *mce_log;
- int disposition = rtas_error_disposition(errp);
int initiator = rtas_error_initiator(errp);
int severity = rtas_error_severity(errp);
+ unsigned long eaddr = 0, paddr = 0;
u8 error_type, err_sub_type;
+ if (!mce_log)
+ goto out;
+
+ error_type = mce_log->error_type;
+ err_sub_type = rtas_mc_error_sub_type(mce_log);
+
if (initiator == RTAS_INITIATOR_UNKNOWN)
mce_err.initiator = MCE_INITIATOR_UNKNOWN;
else if (initiator == RTAS_INITIATOR_CPU)
@@ -572,18 +609,7 @@ static int mce_handle_error(struct pt_regs *regs, struct rtas_error_log *errp)
mce_err.error_type = MCE_ERROR_TYPE_UNKNOWN;
mce_err.error_class = MCE_ECLASS_UNKNOWN;
- if (!rtas_error_extended(errp))
- goto out;
-
- pseries_log = get_pseries_errorlog(errp, PSERIES_ELOG_SECT_ID_MCE);
- if (pseries_log == NULL)
- goto out;
-
- mce_log = (struct pseries_mc_errorlog *)pseries_log->data;
- error_type = mce_log->error_type;
- err_sub_type = rtas_mc_error_sub_type(mce_log);
-
- switch (mce_log->error_type) {
+ switch (error_type) {
case MC_ERROR_TYPE_UE:
mce_err.error_type = MCE_ERROR_TYPE_UE;
mce_common_process_ue(regs, &mce_err);
@@ -683,37 +709,32 @@ static int mce_handle_error(struct pt_regs *regs, struct rtas_error_log *errp)
mce_err.error_type = MCE_ERROR_TYPE_UNKNOWN;
break;
}
+out:
+ save_mce_event(regs, disposition == RTAS_DISP_FULLY_RECOVERED,
+ &mce_err, regs->nip, eaddr, paddr);
+ return disposition;
+}
-#ifdef CONFIG_PPC_BOOK3S_64
- if (disposition == RTAS_DISP_NOT_RECOVERED) {
- switch (error_type) {
- case MC_ERROR_TYPE_SLB:
- case MC_ERROR_TYPE_ERAT:
- /*
- * Store the old slb content in paca before flushing.
- * Print this when we go to virtual mode.
- * There are chances that we may hit MCE again if there
- * is a parity error on the SLB entry we trying to read
- * for saving. Hence limit the slb saving to single
- * level of recursion.
- */
- if (local_paca->in_mce == 1)
- slb_save_contents(local_paca->mce_faulty_slbs);
- flush_and_reload_slb();
- disposition = RTAS_DISP_FULLY_RECOVERED;
- break;
- default:
- break;
- }
- } else if (disposition == RTAS_DISP_LIMITED_RECOVERY) {
- /* Platform corrected itself but could be degraded */
- printk(KERN_ERR "MCE: limited recovery, system may "
- "be degraded\n");
- disposition = RTAS_DISP_FULLY_RECOVERED;
- }
-#endif
+static int mce_handle_error(struct pt_regs *regs, struct rtas_error_log *errp)
+{
+ struct pseries_errorlog *pseries_log;
+ struct pseries_mc_errorlog *mce_log = NULL;
+ int disposition = rtas_error_disposition(errp);
+ u8 error_type, err_sub_type;
+
+ if (!rtas_error_extended(errp))
+ goto out;
+
+ pseries_log = get_pseries_errorlog(errp, PSERIES_ELOG_SECT_ID_MCE);
+ if (!pseries_log)
+ goto out;
+
+ mce_log = (struct pseries_mc_errorlog *)pseries_log->data;
+ error_type = mce_log->error_type;
+ err_sub_type = rtas_mc_error_sub_type(mce_log);
+
+ disposition = mce_handle_err_realmode(disposition, error_type);
-out:
/*
* Enable translation as we will be accessing per-cpu variables
* in save_mce_event() which may fall outside RMO region, also
@@ -724,10 +745,10 @@ static int mce_handle_error(struct pt_regs *regs, struct rtas_error_log *errp)
* Note: All the realmode handling like flushing SLB entries for
* SLB multihit is done by now.
*/
+out:
mtmsr(mfmsr() | MSR_IR | MSR_DR);
- save_mce_event(regs, disposition == RTAS_DISP_FULLY_RECOVERED,
- &mce_err, regs->nip, eaddr, paddr);
-
+ disposition = mce_handle_err_virtmode(regs, errp, mce_log,
+ disposition);
return disposition;
}
--
2.17.2
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox