LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH v3 1/3] powerpc/powernv/idle: Replace CPU features checks with PVR checks
From: Nicholas Piggin @ 2020-07-20  0:00 UTC (permalink / raw)
  To: benh, ego, linux-kernel, linuxppc-dev, mikey, mpe, paulus,
	pratik.r.sampat, Pratik Rajesh Sampat, svaidy
In-Reply-To: <20200717185306.60607-2-psampat@linux.ibm.com>

Excerpts from Pratik Rajesh Sampat's message of July 18, 2020 4:53 am:
> As the idle framework's architecture is incomplete, hence instead of
> checking for just the processor type advertised in the device tree CPU
> features; check for the Processor Version Register (PVR) so that finer
> granularity can be leveraged while making processor checks.
> 
> Signed-off-by: Pratik Rajesh Sampat <psampat@linux.ibm.com>
> ---
>  arch/powerpc/platforms/powernv/idle.c | 14 +++++++-------
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/idle.c b/arch/powerpc/platforms/powernv/idle.c
> index 2dd467383a88..f62904f70fc6 100644
> --- a/arch/powerpc/platforms/powernv/idle.c
> +++ b/arch/powerpc/platforms/powernv/idle.c
> @@ -92,7 +92,7 @@ static int pnv_save_sprs_for_deep_states(void)
>  		if (rc != 0)
>  			return rc;
>  
> -		if (cpu_has_feature(CPU_FTR_ARCH_300)) {
> +		if (pvr_version_is(PVR_POWER9)) {
>  			rc = opal_slw_set_reg(pir, P9_STOP_SPR_MSR, msr_val);
>  			if (rc)
>  				return rc;
> @@ -116,7 +116,7 @@ static int pnv_save_sprs_for_deep_states(void)
>  				return rc;
>  
>  			/* Only p8 needs to set extra HID regiters */
> -			if (!cpu_has_feature(CPU_FTR_ARCH_300)) {
> +			if (!pvr_version_is(PVR_POWER9)) {
>  
>  				rc = opal_slw_set_reg(pir, SPRN_HID1, hid1_val);
>  				if (rc != 0)

What I think you should do is keep using CPU_FTR_ARCH_300 for this stuff 
which is written for power9 and we know is running on power9, because 
that's a faster test (static branch and does not have to read PVR. And
then...

> @@ -1205,7 +1205,7 @@ static void __init pnv_probe_idle_states(void)
>  		return;
>  	}
>  
> -	if (cpu_has_feature(CPU_FTR_ARCH_300))
> +	if (pvr_version_is(PVR_POWER9))
>  		pnv_power9_idle_init();
>  
>  	for (i = 0; i < nr_pnv_idle_states; i++)

Here is where you would put the version check. Once we have code that 
can also handle P10 (either by testing CPU_FTR_ARCH_31, or by adding
an entirely new power10 idle function), then you can add the P10 version
check here.

Thanks,
Nick


^ permalink raw reply

* Re: [PATCH v3 3/3] powerpc/powernv/idle: Exclude mfspr on HID1, 4, 5 on P9 and above
From: Nicholas Piggin @ 2020-07-20  0:01 UTC (permalink / raw)
  To: benh, ego, linux-kernel, linuxppc-dev, mikey, mpe, paulus,
	pratik.r.sampat, Pratik Rajesh Sampat, svaidy
In-Reply-To: <20200717185306.60607-4-psampat@linux.ibm.com>

Excerpts from Pratik Rajesh Sampat's message of July 18, 2020 4:53 am:
> POWER9 onwards the support for the registers HID1, HID4, HID5 has been
> receded.
> Although mfspr on the above registers worked in Power9, In Power10
> simulator is unrecognized. Moving their assignment under the
> check for machines lower than Power9
> 
> Signed-off-by: Pratik Rajesh Sampat <psampat@linux.ibm.com>
> Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>

Reviewed-by: Nicholas Piggin <npiggin@gmail.com>

> ---
>  arch/powerpc/platforms/powernv/idle.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/idle.c b/arch/powerpc/platforms/powernv/idle.c
> index d439e11af101..d24d6671f3e8 100644
> --- a/arch/powerpc/platforms/powernv/idle.c
> +++ b/arch/powerpc/platforms/powernv/idle.c
> @@ -73,9 +73,6 @@ static int pnv_save_sprs_for_deep_states(void)
>  	 */
>  	uint64_t lpcr_val	= mfspr(SPRN_LPCR);
>  	uint64_t hid0_val	= mfspr(SPRN_HID0);
> -	uint64_t hid1_val	= mfspr(SPRN_HID1);
> -	uint64_t hid4_val	= mfspr(SPRN_HID4);
> -	uint64_t hid5_val	= mfspr(SPRN_HID5);
>  	uint64_t hmeer_val	= mfspr(SPRN_HMEER);
>  	uint64_t msr_val = MSR_IDLE;
>  	uint64_t psscr_val = pnv_deepest_stop_psscr_val;
> @@ -117,6 +114,9 @@ static int pnv_save_sprs_for_deep_states(void)
>  
>  			/* Only p8 needs to set extra HID regiters */
>  			if (!pvr_version_is(PVR_POWER9)) {
> +				uint64_t hid1_val = mfspr(SPRN_HID1);
> +				uint64_t hid4_val = mfspr(SPRN_HID4);
> +				uint64_t hid5_val = mfspr(SPRN_HID5);
>  
>  				rc = opal_slw_set_reg(pir, SPRN_HID1, hid1_val);
>  				if (rc != 0)
> -- 
> 2.25.4
> 
> 

^ permalink raw reply

* io_uring kthread_use_mm / mmget_not_zero possible abuse
From: Nicholas Piggin @ 2020-07-20  0:38 UTC (permalink / raw)
  To: Jens Axboe, David S. Miller
  Cc: linux-arch, linux-mm, sparclinux, linuxppc-dev, linux-kernel

When I last looked at this (predating io_uring), as far as I remember it was 
not permitted to actually switch to (use_mm) an mm user context that was 
pinned with mmget_not_zero. Those pins were only allowed to look at page 
tables, vmas, etc., but not actually run the CPU in that mm context.

sparc/kernel/smp_64.c depends heavily on this, e.g.,

void smp_flush_tlb_mm(struct mm_struct *mm)
{
        u32 ctx = CTX_HWBITS(mm->context);
        int cpu = get_cpu();

        if (atomic_read(&mm->mm_users) == 1) {
                cpumask_copy(mm_cpumask(mm), cpumask_of(cpu));
                goto local_flush_and_out;
        }

        smp_cross_call_masked(&xcall_flush_tlb_mm,
                              ctx, 0, 0,
                              mm_cpumask(mm));

local_flush_and_out:
        __flush_tlb_mm(ctx, SECONDARY_CONTEXT);

        put_cpu();
}

If a kthread comes in concurrently between the mm_users test and the 
mm_cpumask reset, and does mmget_not_zero(); kthread_use_mm() then we have 
another CPU switched to mm context but not in the mm_cpumask. It's then 
possible for our thread to schedule on that CPU and not go through a 
switch_mm (because kthread_unuse_mm will make it lazy, then we can switch 
back to our user thread and un-lazy it).

powerpc has something similar.

I don't think this is documented anywhere and certainly isn't checked for 
unfortunately, so I don't really blame io_uring.

The simplest fix is for io_uring to carry mm_users references. If that can't 
be done or we decide to lift the limitation on mmget_not_zero references, we 
can come up with a way to synchronize things.

On powerpc for example, we IPI all targets in mm_cpumask before clearing 
them, so we could disable interrupts while kthread_use_mm does the mm switch 
sequence, and have the IPI handler check that current->mm hasn't been set to 
mm, for example.

sparc is a bit harder because it doesn't IPI targets if it thinks it can 
avoid it. But powerpc found that just doing one IPI isn't a big burden here 
so maybe we change sparc to do that too. I would be inclined to fix this 
mmget_not_zero quirk if we can, unless someone has a very good way to test 
and enforce it, it'll just happen again.

Comments?

Thanks,
Nick

^ permalink raw reply

* Re: [PATCH v2] powerpc/powernv/pci: use ifdef to avoid dead code
From: Oliver O'Halloran @ 2020-07-20  1:15 UTC (permalink / raw)
  To: Greg Thelen; +Cc: Linux Kernel Mailing List, Paul Mackerras, linuxppc-dev
In-Reply-To: <xr93sgdo7i2t.fsf@gthelen.svl.corp.google.com>

On Sun, Jul 19, 2020 at 5:13 AM Greg Thelen <gthelen@google.com> wrote:
>
> Oliver O'Halloran <oohall@gmail.com> wrote:
>
> > On Mon, Jun 15, 2020 at 9:33 AM Greg Thelen <gthelen@google.com> wrote:
> >>
> >> Commit dc3d8f85bb57 ("powerpc/powernv/pci: Re-work bus PE
> >> configuration") removed a couple pnv_ioda_setup_bus_dma() calls.  The
> >> only remaining calls are behind CONFIG_IOMMU_API.  Thus builds without
> >> CONFIG_IOMMU_API see:
> >>   arch/powerpc/platforms/powernv/pci-ioda.c:1888:13: error: 'pnv_ioda_setup_bus_dma' defined but not used
> >>
> >> Move pnv_ioda_setup_bus_dma() under CONFIG_IOMMU_API to avoid dead code.
> >
> > Doh! Thanks for the fix.
> >
> > Reviewed-by: Oliver O'Halloran <oohall@gmail.com>
>
> Is there anything else needed from me on this patch?
> Given that it fixes a 5.8 commit I figured it'd be 5.8 material.

Oh sorry, I completely forgot about this patch. I sent another series
that included a more-or-less identical fix after the kbuild robot sent
a reminder:

http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=187630&state=*

That's current in powerpc/next, but if it's causing a build break then
I agree it should probably go into 5.8 too.

Oliver

^ permalink raw reply

* Re: [PATCH v4 06/10] powerpc/watchpoint: Set CPU_FTR_DAWR1 based on pa-features bit
From: Jordan Niethe @ 2020-07-20  1:39 UTC (permalink / raw)
  To: Ravi Bangoria
  Cc: Christophe Leroy, apopple, mikey, miltonm, peterz, oleg,
	Nicholas Piggin, linux-kernel, Paul Mackerras, jolsa, fweisbec,
	pedromfc, naveen.n.rao, linuxppc-dev, mingo
In-Reply-To: <20200717040958.70561-7-ravi.bangoria@linux.ibm.com>

On Fri, Jul 17, 2020 at 2:10 PM Ravi Bangoria
<ravi.bangoria@linux.ibm.com> wrote:
>
> As per the PAPR, bit 0 of byte 64 in pa-features property indicates
> availability of 2nd DAWR registers. i.e. If this bit is set, 2nd
> DAWR is present, otherwise not. Host generally uses "cpu-features",
> which masks "pa-features". But "cpu-features" are still not used for
> guests and thus this change is mostly applicable for guests only.
>
> Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
I checked those PAPR values are correct and checked running a powernv
kernel in p10 mambo with dt_cpu_ftrs=off and it does set the
CPU_FTR_DAWR1 bit.
(using p10 skiboot).
Tested-by: Jordan Niethe <jniethe5@gmail.com>
> ---
>  arch/powerpc/kernel/prom.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
> index 9cc49f265c86..c76c09b97bc8 100644
> --- a/arch/powerpc/kernel/prom.c
> +++ b/arch/powerpc/kernel/prom.c
> @@ -175,6 +175,8 @@ static struct ibm_pa_feature {
>          */
>         { .pabyte = 22, .pabit = 0, .cpu_features = CPU_FTR_TM_COMP,
>           .cpu_user_ftrs2 = PPC_FEATURE2_HTM_COMP | PPC_FEATURE2_HTM_NOSC_COMP },
> +
> +       { .pabyte = 64, .pabit = 0, .cpu_features = CPU_FTR_DAWR1 },
>  };
>
>  static void __init scan_features(unsigned long node, const unsigned char *ftrs,
> --
> 2.26.2
>

^ permalink raw reply

* Re: [PATCH v2 13/16] scripts/kallsyms: move ignored symbol types to is_ignored_symbol()
From: Finn Thain @ 2020-07-20  1:46 UTC (permalink / raw)
  To: Masahiro Yamada; +Cc: linuxppc-dev, linux-kernel, linux-kbuild
In-Reply-To: <20191123160444.11251-14-yamada.masahiro@socionext.com>

On Sun, 24 Nov 2019, Masahiro Yamada wrote:

> Collect the ignored patterns to is_ignored_symbol().
> 
> Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>

This commit (887df76de67f5) caused a regression in my powerpc builds as it 
causes symbol names to disappear from backtraces:

------------[ cut here ]------------
WARNING: CPU: 0 PID: 0 at kernel/smp.c:433 _einittext+0x3f9e5120/0x3feb71b8
Modules linked in:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.4.0-rc7-pmac-00055-g887df76de67f5 #18
NIP:  c00aef68 LR: c00af114 CTR: c001272c
REGS: c0705c40 TRAP: 0700   Not tainted  (5.4.0-rc7-pmac-00055-g887df76de67f5)
MSR:  00029032 <EE,ME,IR,DR,RI>  CR: 42000044  XER: 00000000

GPR00: 001f0100 c0705cf8 c06dc300 c070af1c c001258c 00000000 00000000 ef7fb5bc 
GPR08: 08800000 00000100 00000001 00000100 42000044 00000000 c0709040 00000004 
GPR16: 00000001 c06022b4 c058297c 00200002 ffff8cb9 00000000 c06d84a0 c0710000 
GPR24: c0710000 00000000 00000000 c070af1c c070af1c 00000000 c001258c 00000000 
NIP [c00aef68] _einittext+0x3f9e5120/0x3feb71b8
LR [c00af114] _einittext+0x3f9e52cc/0x3feb71b8
Call Trace:
[c0705cf8] [ef006320] 0xef006320 (unreliable)
[c0705d38] [c00af114] _einittext+0x3f9e52cc/0x3feb71b8
[c0705d48] [c00af158] _einittext+0x3f9e5310/0x3feb71b8
[c0705d68] [c0012768] _einittext+0x3f948920/0x3feb71b8
[c0705d78] [c0092c04] _einittext+0x3f9c8dbc/0x3feb71b8
[c0705d88] [c0092d18] _einittext+0x3f9c8ed0/0x3feb71b8
[c0705da8] [c0093a2c] _einittext+0x3f9c9be4/0x3feb71b8
[c0705de8] [c0580224] _einittext+0x3feb63dc/0x3feb71b8
[c0705e48] [c00382ec] _einittext+0x3f96e4a4/0x3feb71b8
[c0705e58] [c000d2a0] _einittext+0x3f943458/0x3feb71b8
[c0705e88] [c001353c] _einittext+0x3f9496f4/0x3feb71b8
--- interrupt: 901 at _einittext+0x3f941058/0x3feb71b8
    LR = _einittext+0x3f941058/0x3feb71b8
[c0705f50] [c06cc214] 0xc06cc214 (unreliable)
[c0705f60] [c057fa20] _einittext+0x3feb5bd8/0x3feb71b8
[c0705f70] [c005de48] _einittext+0x3f994000/0x3feb71b8
[c0705f90] [c005e050] _einittext+0x3f994208/0x3feb71b8
[c0705fa0] [c0004cc8] _einittext+0x3f93ae80/0x3feb71b8
[c0705fb0] [c069a36c] _einittext+0x3ffd0524/0x40000000
[c0705ff0] [00003500] 0x3500
Instruction dump:
7c0803a6 7fa5eb78 7d808120 7ea6ab78 baa10014 38210040 4bfffbb0 7f64db78 
7f85e378 484b31b1 7c601b78 4bfffdf4 <0fe00000> 4bfffd60 9421ffe0 7c0802a6 
---[ end trace a06fef4788747c72 ]---


Prior to that (e.g. 97261e1e2240f), I get backtraces like this:

------------[ cut here ]------------
WARNING: CPU: 0 PID: 0 at kernel/smp.c:433 smp_call_function_many+0x318/0x320
Modules linked in:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.4.0-rc7-pmac-00054-g97261e1e2240f #20
NIP:  c00aef68 LR: c00af114 CTR: c001272c
REGS: c075dc40 TRAP: 0700   Not tainted  (5.4.0-rc7-pmac-00054-g97261e1e2240f)
MSR:  00029032 <EE,ME,IR,DR,RI>  CR: 42000044  XER: 00000000

GPR00: 001f0100 c075dcf8 c0733300 c0762f1c c001258c 00000000 00000000 ef7fb5bc 
GPR08: 04800000 00000100 00000001 00000100 42000044 00000000 c0761040 00000004 
GPR16: 00000001 c0658e58 c058297c 00200002 ffff8cb9 00000000 c072f4a0 c0760000 
GPR24: c0760000 00000000 00000000 c0762f1c c0762f1c 00000000 c001258c 00000000 
NIP [c00aef68] smp_call_function_many+0x318/0x320
LR [c00af114] smp_call_function+0x34/0x44
Call Trace:
[c075dcf8] [ef006320] 0xef006320 (unreliable)
[c075dd38] [c00af114] smp_call_function+0x34/0x44
[c075dd48] [c00af158] on_each_cpu+0x1c/0x4c
[c075dd68] [c0012768] tau_timeout_smp+0x3c/0x4c
[c075dd78] [c0092c04] call_timer_fn.isra.26+0x20/0x84
[c075dd88] [c0092d18] expire_timers+0xb0/0xc0
[c075dda8] [c0093a2c] run_timer_softirq+0xa4/0x1a4
[c075dde8] [c0580224] __do_softirq+0x11c/0x280
[c075de48] [c00382ec] irq_exit+0xc0/0xd4
[c075de58] [c000d2a0] timer_interrupt+0x154/0x260
[c075de88] [c001353c] ret_from_except+0x0/0x14
--- interrupt: 901 at arch_cpu_idle+0x24/0x78
    LR = arch_cpu_idle+0x24/0x78
[c075df50] [c0723214] 0xc0723214 (unreliable)
[c075df60] [c057fa20] default_idle_call+0x38/0x58
[c075df70] [c005de48] do_idle+0xd4/0x17c
[c075df90] [c005e054] cpu_startup_entry+0x24/0x28
[c075dfa0] [c0004cc8] rest_init+0xa8/0xbc
[c075dfb0] [c06f136c] start_kernel+0x40c/0x420
[c075dff0] [00003500] 0x3500
Instruction dump:
7c0803a6 7fa5eb78 7d808120 7ea6ab78 baa10014 38210040 4bfffbb0 7f64db78 
7f85e378 484b31b1 7c601b78 4bfffdf4 <0fe00000> 4bfffd60 9421ffe0 7c0802a6 
---[ end trace 784c7f15ecd23941 ]---

Has anyone else observed these problems (either the WARNING from 
smp_call_function_many() or the missing symbol names)?

What is the best way to fix this? Should I upgrade binutils?

^ permalink raw reply

* Re: [PATCH v4 07/10] powerpc/watchpoint: Rename current H_SET_MODE DAWR macro
From: Jordan Niethe @ 2020-07-20  1:50 UTC (permalink / raw)
  To: Ravi Bangoria
  Cc: Christophe Leroy, apopple, mikey, miltonm, peterz, oleg,
	Nicholas Piggin, linux-kernel, Paul Mackerras, jolsa, fweisbec,
	pedromfc, naveen.n.rao, linuxppc-dev, mingo
In-Reply-To: <20200717040958.70561-8-ravi.bangoria@linux.ibm.com>

On Fri, Jul 17, 2020 at 2:11 PM Ravi Bangoria
<ravi.bangoria@linux.ibm.com> wrote:
>
> Current H_SET_MODE hcall macro name for setting/resetting DAWR0 is
> H_SET_MODE_RESOURCE_SET_DAWR. Add suffix 0 to macro name as well.
>
> Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Reviewed-by: Jordan Niethe <jniethe5@gmail.com>
> ---
>  arch/powerpc/include/asm/hvcall.h         | 2 +-
>  arch/powerpc/include/asm/plpar_wrappers.h | 2 +-
>  arch/powerpc/kvm/book3s_hv.c              | 2 +-
>  3 files changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/hvcall.h b/arch/powerpc/include/asm/hvcall.h
> index 43486e773bd6..b785e9f0071c 100644
> --- a/arch/powerpc/include/asm/hvcall.h
> +++ b/arch/powerpc/include/asm/hvcall.h
> @@ -355,7 +355,7 @@
>
>  /* Values for 2nd argument to H_SET_MODE */
>  #define H_SET_MODE_RESOURCE_SET_CIABR          1
> -#define H_SET_MODE_RESOURCE_SET_DAWR           2
> +#define H_SET_MODE_RESOURCE_SET_DAWR0          2
>  #define H_SET_MODE_RESOURCE_ADDR_TRANS_MODE    3
>  #define H_SET_MODE_RESOURCE_LE                 4
>
> diff --git a/arch/powerpc/include/asm/plpar_wrappers.h b/arch/powerpc/include/asm/plpar_wrappers.h
> index 4293c5d2ddf4..d12c3680d946 100644
> --- a/arch/powerpc/include/asm/plpar_wrappers.h
> +++ b/arch/powerpc/include/asm/plpar_wrappers.h
> @@ -312,7 +312,7 @@ static inline long plpar_set_ciabr(unsigned long ciabr)
>
>  static inline long plpar_set_watchpoint0(unsigned long dawr0, unsigned long dawrx0)
>  {
> -       return plpar_set_mode(0, H_SET_MODE_RESOURCE_SET_DAWR, dawr0, dawrx0);
> +       return plpar_set_mode(0, H_SET_MODE_RESOURCE_SET_DAWR0, dawr0, dawrx0);
>  }
>
>  static inline long plpar_signal_sys_reset(long cpu)
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 6bf66649ab92..7ad692c2d7c7 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -764,7 +764,7 @@ static int kvmppc_h_set_mode(struct kvm_vcpu *vcpu, unsigned long mflags,
>                         return H_P3;
>                 vcpu->arch.ciabr  = value1;
>                 return H_SUCCESS;
> -       case H_SET_MODE_RESOURCE_SET_DAWR:
> +       case H_SET_MODE_RESOURCE_SET_DAWR0:
>                 if (!kvmppc_power8_compatible(vcpu))
>                         return H_P2;
>                 if (!ppc_breakpoint_available())
> --
> 2.26.2
>

^ permalink raw reply

* Re: [PATCH v2 4/4] mm/vmalloc: Hugepage vmalloc mappings
From: Zefan Li @ 2020-07-20  2:02 UTC (permalink / raw)
  To: Nicholas Piggin, linux-mm
  Cc: linux-arch, Will Deacon, Catalin Marinas, x86, linux-kernel,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, Thomas Gleixner,
	linuxppc-dev, linux-arm-kernel
In-Reply-To: <20200413125303.423864-5-npiggin@gmail.com>

> +static int vmap_pages_range_noflush(unsigned long start, unsigned long end,
> +				    pgprot_t prot, struct page **pages,
> +				    unsigned int page_shift)
> +{
> +	if (page_shift == PAGE_SIZE) {

Is this a typo of PAGE_SHIFT?

> +		return vmap_small_pages_range_noflush(start, end, prot, pages);
> +	} else {
> +		unsigned long addr = start;
> +		unsigned int i, nr = (end - start) >> page_shift;
> +
> +		for (i = 0; i < nr; i++) {
> +			int err;
> +
> +			err = vmap_range_noflush(addr,
> +					addr + (1UL << page_shift),
> +					__pa(page_address(pages[i])), prot,
> +					page_shift);
> +			if (err)
> +				return err;
> +
> +			addr += 1UL << page_shift;
> +		}
> +
> +		return 0;
> +	}
> +}
> +

^ permalink raw reply

* Re: [PATCH v2 4/4] mm/vmalloc: Hugepage vmalloc mappings
From: Nicholas Piggin @ 2020-07-20  2:49 UTC (permalink / raw)
  To: linux-mm, Zefan Li
  Cc: linux-arch, Will Deacon, Catalin Marinas, x86, linux-kernel,
	Ingo Molnar, Borislav,  Petkov, H. Peter Anvin, Thomas,  Gleixner,
	linuxppc-dev, linux-arm-kernel
In-Reply-To: <0e43e743-7c78-fb86-6c36-f42e6184d32c@huawei.com>

Excerpts from Zefan Li's message of July 20, 2020 12:02 pm:
>> +static int vmap_pages_range_noflush(unsigned long start, unsigned long end,
>> +				    pgprot_t prot, struct page **pages,
>> +				    unsigned int page_shift)
>> +{
>> +	if (page_shift == PAGE_SIZE) {
> 
> Is this a typo of PAGE_SHIFT?

Oh good catch, yeah that'll always be going via the one-at-a-time route 
and slow down the small page vmaps. Will fix.

Thanks,
Nick

> 
>> +		return vmap_small_pages_range_noflush(start, end, prot, pages);
>> +	} else {
>> +		unsigned long addr = start;
>> +		unsigned int i, nr = (end - start) >> page_shift;
>> +
>> +		for (i = 0; i < nr; i++) {
>> +			int err;
>> +
>> +			err = vmap_range_noflush(addr,
>> +					addr + (1UL << page_shift),
>> +					__pa(page_address(pages[i])), prot,
>> +					page_shift);
>> +			if (err)
>> +				return err;
>> +
>> +			addr += 1UL << page_shift;
>> +		}
>> +
>> +		return 0;
>> +	}
>> +}
>> +
> 

^ permalink raw reply

* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode
From: Nicholas Piggin @ 2020-07-20  3:03 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-arch, Jens Axboe, Arnd Bergmann, Peter Zijlstra, x86,
	linux-kernel, Andy Lutomirski, linux-mm, Andy Lutomirski,
	linuxppc-dev
In-Reply-To: <1314561373.18530.1594993363050.JavaMail.zimbra@efficios.com>

Excerpts from Mathieu Desnoyers's message of July 17, 2020 11:42 pm:
> ----- On Jul 16, 2020, at 7:26 PM, Nicholas Piggin npiggin@gmail.com wrote:
> [...]
>> 
>> membarrier does replace barrier instructions on remote CPUs, which do
>> order accesses performed by the kernel on the user address space. So
>> membarrier should too I guess.
>> 
>> Normal process context accesses like read(2) will do so because they
>> don't get filtered out from IPIs, but kernel threads using the mm may
>> not.
> 
> But it should not be an issue, because membarrier's ordering is only with respect
> to submit and completion of io_uring requests, which are performed through
> system calls from the context of user-space threads, which are called from the
> right mm.

Is that true? Can io completions be written into an address space via a
kernel thread? I don't know the io_uring code well but it looks like 
that's asynchonously using the user mm context.

How about other memory accesses via kthread_use_mm? Presumably there is 
still ordering requirement there for membarrier, so I really think
it's a fragile interface with no real way for the user to know how 
kernel threads may use its mm for any particular reason, so membarrier
should synchronize all possible kernel users as well.

Thanks,
Nick

^ permalink raw reply

* Re: [PATCH v4 09/10] powerpc/watchpoint: Return available watchpoints dynamically
From: Jordan Niethe @ 2020-07-20  3:42 UTC (permalink / raw)
  To: Ravi Bangoria
  Cc: Christophe Leroy, apopple, mikey, miltonm, peterz, oleg,
	Nicholas Piggin, linux-kernel, Paul Mackerras, jolsa, fweisbec,
	pedromfc, naveen.n.rao, linuxppc-dev, mingo
In-Reply-To: <20200717040958.70561-10-ravi.bangoria@linux.ibm.com>

On Fri, Jul 17, 2020 at 2:11 PM Ravi Bangoria
<ravi.bangoria@linux.ibm.com> wrote:
>
> So far Book3S Powerpc supported only one watchpoint. Power10 is
> introducing 2nd DAWR. Enable 2nd DAWR support for Power10.
> Availability of 2nd DAWR will depend on CPU_FTR_DAWR1.
>
> Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
> ---
>  arch/powerpc/include/asm/cputable.h      | 4 +++-
>  arch/powerpc/include/asm/hw_breakpoint.h | 5 +++--
>  2 files changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/cputable.h b/arch/powerpc/include/asm/cputable.h
> index 3445c86e1f6f..36a0851a7a9b 100644
> --- a/arch/powerpc/include/asm/cputable.h
> +++ b/arch/powerpc/include/asm/cputable.h
> @@ -633,7 +633,9 @@ enum {
>   * Maximum number of hw breakpoint supported on powerpc. Number of
>   * breakpoints supported by actual hw might be less than this.
>   */
> -#define HBP_NUM_MAX    1
> +#define HBP_NUM_MAX    2
> +#define HBP_NUM_ONE    1
> +#define HBP_NUM_TWO    2
I wonder if these defines are necessary - has it any advantage over
just using the literal?
>
>  #endif /* !__ASSEMBLY__ */
>
> diff --git a/arch/powerpc/include/asm/hw_breakpoint.h b/arch/powerpc/include/asm/hw_breakpoint.h
> index cb424799da0d..d4eab1694bcd 100644
> --- a/arch/powerpc/include/asm/hw_breakpoint.h
> +++ b/arch/powerpc/include/asm/hw_breakpoint.h
> @@ -5,10 +5,11 @@
>   * Copyright 2010, IBM Corporation.
>   * Author: K.Prasad <prasad@linux.vnet.ibm.com>
>   */
> -
Was removing this line deliberate?
>  #ifndef _PPC_BOOK3S_64_HW_BREAKPOINT_H
>  #define _PPC_BOOK3S_64_HW_BREAKPOINT_H
>
> +#include <asm/cpu_has_feature.h>
> +
>  #ifdef __KERNEL__
>  struct arch_hw_breakpoint {
>         unsigned long   address;
> @@ -46,7 +47,7 @@ struct arch_hw_breakpoint {
>
>  static inline int nr_wp_slots(void)
>  {
> -       return HBP_NUM_MAX;
> +       return cpu_has_feature(CPU_FTR_DAWR1) ? HBP_NUM_TWO : HBP_NUM_ONE;
So it'd be something like:
+       return cpu_has_feature(CPU_FTR_DAWR1) ? HBP_NUM_MAX : 1;
But thinking that there might be more slots added in the future, it
may be better to make the number of slots a variable that is set
during the init and then have this function return that.
>  }
>
>  #ifdef CONFIG_HAVE_HW_BREAKPOINT
> --
> 2.26.2
>

^ permalink raw reply

* Re: [PATCH v2 2/3] powerpc/powernv/idle: save-restore DAWR0,DAWRX0 for P10
From: Ravi Bangoria @ 2020-07-20  4:24 UTC (permalink / raw)
  To: Pratik Rajesh Sampat
  Cc: ego, mikey, pratik.r.sampat, linux-kernel, paulus, linuxppc-dev,
	Ravi Bangoria
In-Reply-To: <20200710052207.12003-3-psampat@linux.ibm.com>

Hi Pratik,

On 7/10/20 10:52 AM, Pratik Rajesh Sampat wrote:
> Additional registers DAWR0, DAWRX0 may be lost on Power 10 for
> stop levels < 4.

p10 has one more pair DAWR1/DAWRX1. Please include that as well.

Ravi

^ permalink raw reply

* Re: [PATCH v2 2/3] powerpc/powernv/idle: save-restore DAWR0, DAWRX0 for P10
From: Ravi Bangoria @ 2020-07-20  4:30 UTC (permalink / raw)
  To: Nicholas Piggin, Pratik Rajesh Sampat
  Cc: ego, mikey, pratik.r.sampat, linux-kernel, paulus, linuxppc-dev
In-Reply-To: <1594619458.45vrahx59w.astroid@bobo.none>

Hi Nick,

On 7/13/20 11:22 AM, Nicholas Piggin wrote:
> Excerpts from Pratik Rajesh Sampat's message of July 10, 2020 3:22 pm:
>> Additional registers DAWR0, DAWRX0 may be lost on Power 10 for
>> stop levels < 4.
>> Therefore save the values of these SPRs before entering a  "stop"
>> state and restore their values on wakeup.
> 
> Hmm, where do you get this from? Documentation I see says DAWR is lost
> on POWER9 but not P10.
> 
> Does idle thread even need to save DAWR, or does it get switched when
> going to a thread that has a watchpoint set?

I don't know how idle states works internally but IIUC, we need to save/restore
DAWRs. This is needed when user creates per-cpu watchpoint event.

Ravi

^ permalink raw reply

* [FIX PATCH] powerpc/prom: Enable Radix GTSE in cpu pa-features
From: Bharata B Rao @ 2020-07-20  4:42 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: aneesh.kumar, npiggin, Bharata B Rao, Qian Cai

From: Nicholas Piggin <npiggin@gmail.com>

When '029ab30b4c0a ("powerpc/mm: Enable radix GTSE only if supported.")'
made GTSE an MMU feature, it was enabled by default in
powerpc-cpu-features but was missed in pa-features. This causes
random memory corruption during boot of PowerNV kernels where
CONFIG_PPC_DT_CPU_FTRS isn't enabled.

Fixes: 029ab30b4c0a ("powerpc/mm: Enable radix GTSE only if supported.")
Reported-by: Qian Cai <cai@lca.pw>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
---
 arch/powerpc/kernel/prom.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index 9cc49f265c86..a9594bad572a 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -163,7 +163,8 @@ static struct ibm_pa_feature {
 	{ .pabyte = 0,  .pabit = 6, .cpu_features  = CPU_FTR_NOEXECUTE },
 	{ .pabyte = 1,  .pabit = 2, .mmu_features  = MMU_FTR_CI_LARGE_PAGE },
 #ifdef CONFIG_PPC_RADIX_MMU
-	{ .pabyte = 40, .pabit = 0, .mmu_features  = MMU_FTR_TYPE_RADIX },
+	{ .pabyte = 40, .pabit = 0,
+	  .mmu_features  = (MMU_FTR_TYPE_RADIX | MMU_FTR_GTSE) },
 #endif
 	{ .pabyte = 1,  .pabit = 1, .invert = 1, .cpu_features = CPU_FTR_NODSISRALIGN },
 	{ .pabyte = 5,  .pabit = 0, .cpu_features  = CPU_FTR_REAL_LE,
-- 
2.26.2


^ permalink raw reply related

* Re: [PATCH v2 1/2] cpuidle: Trace IPI based and timer based wakeup latency from idle states
From: Gautham R Shenoy @ 2020-07-20  5:00 UTC (permalink / raw)
  To: Pratik Rajesh Sampat
  Cc: ego, linux-pm, daniel.lezcano, rjw, linuxppc-dev, npiggin, paulus,
	linux-kselftest, shuah, srivatsa, linux-kernel
In-Reply-To: <20200717091801.29289-2-psampat@linux.ibm.com>

On Fri, Jul 17, 2020 at 02:48:00PM +0530, Pratik Rajesh Sampat wrote:
> Fire directed smp_call_function_single IPIs from a specified source
> CPU to the specified target CPU to reduce the noise we have to wade
> through in the trace log.
> The module is based on the idea written by Srivatsa Bhat and maintained
> by Vaidyanathan Srinivasan internally.
> 
> Queue HR timer and measure jitter. Wakeup latency measurement for idle
> states using hrtimer.  Echo a value in ns to timer_test_function and
> watch trace. A HRtimer will be queued and when it fires the expected
> wakeup vs actual wakeup is computes and delay printed in ns.
> 
> Implemented as a module which utilizes debugfs so that it can be
> integrated with selftests.
> 
> To include the module, check option and include as module
> kernel hacking -> Cpuidle latency selftests
> 
> [srivatsa.bhat@linux.vnet.ibm.com: Initial implementation in
>  cpidle/sysfs]
> 
> [svaidy@linux.vnet.ibm.com: wakeup latency measurements using hrtimer
>  and fix some of the time calculation]
> 
> [ego@linux.vnet.ibm.com: Fix some whitespace and tab errors and
>  increase the resolution of IPI wakeup]
> 
> Signed-off-by: Pratik Rajesh Sampat <psampat@linux.ibm.com>


The debugfs module looks good to me.

Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>


> ---
>  drivers/cpuidle/Makefile               |   1 +
>  drivers/cpuidle/test-cpuidle_latency.c | 150 +++++++++++++++++++++++++
>  lib/Kconfig.debug                      |  10 ++
>  3 files changed, 161 insertions(+)
>  create mode 100644 drivers/cpuidle/test-cpuidle_latency.c
> 
> diff --git a/drivers/cpuidle/Makefile b/drivers/cpuidle/Makefile
> index f07800cbb43f..2ae05968078c 100644
> --- a/drivers/cpuidle/Makefile
> +++ b/drivers/cpuidle/Makefile
> @@ -8,6 +8,7 @@ obj-$(CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED) += coupled.o
>  obj-$(CONFIG_DT_IDLE_STATES)		  += dt_idle_states.o
>  obj-$(CONFIG_ARCH_HAS_CPU_RELAX)	  += poll_state.o
>  obj-$(CONFIG_HALTPOLL_CPUIDLE)		  += cpuidle-haltpoll.o
> +obj-$(CONFIG_IDLE_LATENCY_SELFTEST)	  += test-cpuidle_latency.o
> 
>  ##################################################################################
>  # ARM SoC drivers
> diff --git a/drivers/cpuidle/test-cpuidle_latency.c b/drivers/cpuidle/test-cpuidle_latency.c
> new file mode 100644
> index 000000000000..61574665e972
> --- /dev/null
> +++ b/drivers/cpuidle/test-cpuidle_latency.c
> @@ -0,0 +1,150 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * Module-based API test facility for cpuidle latency using IPIs and timers
> + */
> +
> +#include <linux/debugfs.h>
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +
> +/* IPI based wakeup latencies */
> +struct latency {
> +	unsigned int src_cpu;
> +	unsigned int dest_cpu;
> +	ktime_t time_start;
> +	ktime_t time_end;
> +	u64 latency_ns;
> +} ipi_wakeup;
> +
> +static void measure_latency(void *info)
> +{
> +	struct latency *v;
> +	ktime_t time_diff;
> +
> +	v = (struct latency *)info;
> +	v->time_end = ktime_get();
> +	time_diff = ktime_sub(v->time_end, v->time_start);
> +	v->latency_ns = ktime_to_ns(time_diff);
> +}
> +
> +void run_smp_call_function_test(unsigned int cpu)
> +{
> +	ipi_wakeup.src_cpu = smp_processor_id();
> +	ipi_wakeup.dest_cpu = cpu;
> +	ipi_wakeup.time_start = ktime_get();
> +	smp_call_function_single(cpu, measure_latency, &ipi_wakeup, 1);
> +}
> +
> +/* Timer based wakeup latencies */
> +struct timer_data {
> +	unsigned int src_cpu;
> +	u64 timeout;
> +	ktime_t time_start;
> +	ktime_t time_end;
> +	struct hrtimer timer;
> +	u64 timeout_diff_ns;
> +} timer_wakeup;
> +
> +static enum hrtimer_restart timer_called(struct hrtimer *hrtimer)
> +{
> +	struct timer_data *w;
> +	ktime_t time_diff;
> +
> +	w = container_of(hrtimer, struct timer_data, timer);
> +	w->time_end = ktime_get();
> +
> +	time_diff = ktime_sub(w->time_end, w->time_start);
> +	time_diff = ktime_sub(time_diff, ns_to_ktime(w->timeout));
> +	w->timeout_diff_ns = ktime_to_ns(time_diff);
> +	return HRTIMER_NORESTART;
> +}
> +
> +static void run_timer_test(unsigned int ns)
> +{
> +	hrtimer_init(&timer_wakeup.timer, CLOCK_MONOTONIC,
> +		     HRTIMER_MODE_REL);
> +	timer_wakeup.timer.function = timer_called;
> +	timer_wakeup.time_start = ktime_get();
> +	timer_wakeup.src_cpu = smp_processor_id();
> +	timer_wakeup.timeout = ns;
> +
> +	hrtimer_start(&timer_wakeup.timer, ns_to_ktime(ns),
> +		      HRTIMER_MODE_REL_PINNED);
> +}
> +
> +static struct dentry *dir;
> +
> +static int cpu_read_op(void *data, u64 *value)
> +{
> +	*value = ipi_wakeup.dest_cpu;
> +	return 0;
> +}
> +
> +static int cpu_write_op(void *data, u64 value)
> +{
> +	run_smp_call_function_test(value);
> +	return 0;
> +}
> +DEFINE_SIMPLE_ATTRIBUTE(ipi_ops, cpu_read_op, cpu_write_op, "%llu\n");
> +
> +static int timeout_read_op(void *data, u64 *value)
> +{
> +	*value = timer_wakeup.timeout;
> +	return 0;
> +}
> +
> +static int timeout_write_op(void *data, u64 value)
> +{
> +	run_timer_test(value);
> +	return 0;
> +}
> +DEFINE_SIMPLE_ATTRIBUTE(timeout_ops, timeout_read_op, timeout_write_op, "%llu\n");
> +
> +static int __init latency_init(void)
> +{
> +	struct dentry *temp;
> +
> +	dir = debugfs_create_dir("latency_test", 0);
> +	if (!dir) {
> +		pr_alert("latency_test: failed to create /sys/kernel/debug/latency_test\n");
> +		return -1;
> +	}
> +	temp = debugfs_create_file("ipi_cpu_dest",
> +				   0666,
> +				   dir,
> +				   NULL,
> +				   &ipi_ops);
> +	if (!temp) {
> +		pr_alert("latency_test: failed to create /sys/kernel/debug/ipi_cpu_dest\n");
> +		return -1;
> +	}
> +	debugfs_create_u64("ipi_latency_ns", 0444, dir, &ipi_wakeup.latency_ns);
> +	debugfs_create_u32("ipi_cpu_src", 0444, dir, &ipi_wakeup.src_cpu);
> +
> +	temp = debugfs_create_file("timeout_expected_ns",
> +				   0666,
> +				   dir,
> +				   NULL,
> +				   &timeout_ops);
> +	if (!temp) {
> +		pr_alert("latency_test: failed to create /sys/kernel/debug/timeout_expected_ns\n");
> +		return -1;
> +	}
> +	debugfs_create_u64("timeout_diff_ns", 0444, dir, &timer_wakeup.timeout_diff_ns);
> +	debugfs_create_u32("timeout_cpu_src", 0444, dir, &timer_wakeup.src_cpu);
> +	pr_info("Latency Test module loaded\n");
> +	return 0;
> +}
> +
> +static void __exit latency_cleanup(void)
> +{
> +	pr_info("Cleaning up Latency Test module.\n");
> +	debugfs_remove_recursive(dir);
> +}
> +
> +module_init(latency_init);
> +module_exit(latency_cleanup);
> +
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("IBM Corporation");
> +MODULE_DESCRIPTION("Measuring idle latency for IPIs and Timers");
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index d74ac0fd6b2d..e2283790245a 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -1375,6 +1375,16 @@ config DEBUG_KOBJECT
>  	  If you say Y here, some extra kobject debugging messages will be sent
>  	  to the syslog.
> 
> +config IDLE_LATENCY_SELFTEST
> +	tristate "Cpuidle latency selftests"
> +	depends on CPU_IDLE
> +	help
> +	  This option provides a kernel module that runs tests using the IPI and
> +	  timers to measure latency.
> +
> +	  Say M if you want these self tests to build as a module.
> +	  Say N if you are unsure.
> +
>  config DEBUG_KOBJECT_RELEASE
>  	bool "kobject release debugging"
>  	depends on DEBUG_OBJECTS_TIMERS
> -- 
> 2.25.4
> 

^ permalink raw reply

* Re: [PATCH v2 13/16] scripts/kallsyms: move ignored symbol types to is_ignored_symbol()
From: Masahiro Yamada @ 2020-07-20  5:34 UTC (permalink / raw)
  To: Finn Thain
  Cc: linuxppc-dev, Linux Kernel Mailing List,
	Linux Kbuild mailing list
In-Reply-To: <alpine.LNX.2.23.453.2007201132240.8@nippy.intranet>

On Mon, Jul 20, 2020 at 10:46 AM Finn Thain <fthain@telegraphics.com.au> wrote:
>
> On Sun, 24 Nov 2019, Masahiro Yamada wrote:
>
> > Collect the ignored patterns to is_ignored_symbol().
> >
> > Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
>
> This commit (887df76de67f5) caused a regression in my powerpc builds as it
> causes symbol names to disappear from backtraces:
>
> ------------[ cut here ]------------
> WARNING: CPU: 0 PID: 0 at kernel/smp.c:433 _einittext+0x3f9e5120/0x3feb71b8
> Modules linked in:
> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.4.0-rc7-pmac-00055-g887df76de67f5 #18
> NIP:  c00aef68 LR: c00af114 CTR: c001272c
> REGS: c0705c40 TRAP: 0700   Not tainted  (5.4.0-rc7-pmac-00055-g887df76de67f5)
> MSR:  00029032 <EE,ME,IR,DR,RI>  CR: 42000044  XER: 00000000
>
> GPR00: 001f0100 c0705cf8 c06dc300 c070af1c c001258c 00000000 00000000 ef7fb5bc
> GPR08: 08800000 00000100 00000001 00000100 42000044 00000000 c0709040 00000004
> GPR16: 00000001 c06022b4 c058297c 00200002 ffff8cb9 00000000 c06d84a0 c0710000
> GPR24: c0710000 00000000 00000000 c070af1c c070af1c 00000000 c001258c 00000000
> NIP [c00aef68] _einittext+0x3f9e5120/0x3feb71b8
> LR [c00af114] _einittext+0x3f9e52cc/0x3feb71b8
> Call Trace:
> [c0705cf8] [ef006320] 0xef006320 (unreliable)
> [c0705d38] [c00af114] _einittext+0x3f9e52cc/0x3feb71b8
> [c0705d48] [c00af158] _einittext+0x3f9e5310/0x3feb71b8
> [c0705d68] [c0012768] _einittext+0x3f948920/0x3feb71b8
> [c0705d78] [c0092c04] _einittext+0x3f9c8dbc/0x3feb71b8
> [c0705d88] [c0092d18] _einittext+0x3f9c8ed0/0x3feb71b8
> [c0705da8] [c0093a2c] _einittext+0x3f9c9be4/0x3feb71b8
> [c0705de8] [c0580224] _einittext+0x3feb63dc/0x3feb71b8
> [c0705e48] [c00382ec] _einittext+0x3f96e4a4/0x3feb71b8
> [c0705e58] [c000d2a0] _einittext+0x3f943458/0x3feb71b8
> [c0705e88] [c001353c] _einittext+0x3f9496f4/0x3feb71b8
> --- interrupt: 901 at _einittext+0x3f941058/0x3feb71b8
>     LR = _einittext+0x3f941058/0x3feb71b8
> [c0705f50] [c06cc214] 0xc06cc214 (unreliable)
> [c0705f60] [c057fa20] _einittext+0x3feb5bd8/0x3feb71b8
> [c0705f70] [c005de48] _einittext+0x3f994000/0x3feb71b8
> [c0705f90] [c005e050] _einittext+0x3f994208/0x3feb71b8
> [c0705fa0] [c0004cc8] _einittext+0x3f93ae80/0x3feb71b8
> [c0705fb0] [c069a36c] _einittext+0x3ffd0524/0x40000000
> [c0705ff0] [00003500] 0x3500
> Instruction dump:
> 7c0803a6 7fa5eb78 7d808120 7ea6ab78 baa10014 38210040 4bfffbb0 7f64db78
> 7f85e378 484b31b1 7c601b78 4bfffdf4 <0fe00000> 4bfffd60 9421ffe0 7c0802a6
> ---[ end trace a06fef4788747c72 ]---
>
>
> Prior to that (e.g. 97261e1e2240f), I get backtraces like this:
>
> ------------[ cut here ]------------
> WARNING: CPU: 0 PID: 0 at kernel/smp.c:433 smp_call_function_many+0x318/0x320
> Modules linked in:
> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.4.0-rc7-pmac-00054-g97261e1e2240f #20
> NIP:  c00aef68 LR: c00af114 CTR: c001272c
> REGS: c075dc40 TRAP: 0700   Not tainted  (5.4.0-rc7-pmac-00054-g97261e1e2240f)
> MSR:  00029032 <EE,ME,IR,DR,RI>  CR: 42000044  XER: 00000000
>
> GPR00: 001f0100 c075dcf8 c0733300 c0762f1c c001258c 00000000 00000000 ef7fb5bc
> GPR08: 04800000 00000100 00000001 00000100 42000044 00000000 c0761040 00000004
> GPR16: 00000001 c0658e58 c058297c 00200002 ffff8cb9 00000000 c072f4a0 c0760000
> GPR24: c0760000 00000000 00000000 c0762f1c c0762f1c 00000000 c001258c 00000000
> NIP [c00aef68] smp_call_function_many+0x318/0x320
> LR [c00af114] smp_call_function+0x34/0x44
> Call Trace:
> [c075dcf8] [ef006320] 0xef006320 (unreliable)
> [c075dd38] [c00af114] smp_call_function+0x34/0x44
> [c075dd48] [c00af158] on_each_cpu+0x1c/0x4c
> [c075dd68] [c0012768] tau_timeout_smp+0x3c/0x4c
> [c075dd78] [c0092c04] call_timer_fn.isra.26+0x20/0x84
> [c075dd88] [c0092d18] expire_timers+0xb0/0xc0
> [c075dda8] [c0093a2c] run_timer_softirq+0xa4/0x1a4
> [c075dde8] [c0580224] __do_softirq+0x11c/0x280
> [c075de48] [c00382ec] irq_exit+0xc0/0xd4
> [c075de58] [c000d2a0] timer_interrupt+0x154/0x260
> [c075de88] [c001353c] ret_from_except+0x0/0x14
> --- interrupt: 901 at arch_cpu_idle+0x24/0x78
>     LR = arch_cpu_idle+0x24/0x78
> [c075df50] [c0723214] 0xc0723214 (unreliable)
> [c075df60] [c057fa20] default_idle_call+0x38/0x58
> [c075df70] [c005de48] do_idle+0xd4/0x17c
> [c075df90] [c005e054] cpu_startup_entry+0x24/0x28
> [c075dfa0] [c0004cc8] rest_init+0xa8/0xbc
> [c075dfb0] [c06f136c] start_kernel+0x40c/0x420
> [c075dff0] [00003500] 0x3500
> Instruction dump:
> 7c0803a6 7fa5eb78 7d808120 7ea6ab78 baa10014 38210040 4bfffbb0 7f64db78
> 7f85e378 484b31b1 7c601b78 4bfffdf4 <0fe00000> 4bfffd60 9421ffe0 7c0802a6
> ---[ end trace 784c7f15ecd23941 ]---
>
> Has anyone else observed these problems (either the WARNING from
> smp_call_function_many() or the missing symbol names)?
>
> What is the best way to fix this? Should I upgrade binutils?



I got a similar report before.

I'd like to know whether or not
this is the same issue as fixed by
7883a14339299773b2ce08dcfd97c63c199a9289


Does your problem happen on the latest kernel?
Which version of binutils are you using?


-- 
Best Regards
Masahiro Yamada

^ permalink raw reply

* Re: [FIX PATCH] powerpc/prom: Enable Radix GTSE in cpu pa-features
From: Nicholas Piggin @ 2020-07-20  5:38 UTC (permalink / raw)
  To: Bharata B Rao, linuxppc-dev; +Cc: aneesh.kumar, Qian Cai
In-Reply-To: <20200720044258.863574-1-bharata@linux.ibm.com>

Excerpts from Bharata B Rao's message of July 20, 2020 2:42 pm:
> From: Nicholas Piggin <npiggin@gmail.com>
> 
> When '029ab30b4c0a ("powerpc/mm: Enable radix GTSE only if supported.")'
> made GTSE an MMU feature, it was enabled by default in
> powerpc-cpu-features but was missed in pa-features. This causes
> random memory corruption during boot of PowerNV kernels where
> CONFIG_PPC_DT_CPU_FTRS isn't enabled.

Thanks for writing this up, I got a bit bogged down with other things.

> Fixes: 029ab30b4c0a ("powerpc/mm: Enable radix GTSE only if supported.")
> Reported-by: Qian Cai <cai@lca.pw>
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
> ---
>  arch/powerpc/kernel/prom.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
> index 9cc49f265c86..a9594bad572a 100644
> --- a/arch/powerpc/kernel/prom.c
> +++ b/arch/powerpc/kernel/prom.c
> @@ -163,7 +163,8 @@ static struct ibm_pa_feature {
>  	{ .pabyte = 0,  .pabit = 6, .cpu_features  = CPU_FTR_NOEXECUTE },
>  	{ .pabyte = 1,  .pabit = 2, .mmu_features  = MMU_FTR_CI_LARGE_PAGE },
>  #ifdef CONFIG_PPC_RADIX_MMU
> -	{ .pabyte = 40, .pabit = 0, .mmu_features  = MMU_FTR_TYPE_RADIX },
> +	{ .pabyte = 40, .pabit = 0,
> +	  .mmu_features  = (MMU_FTR_TYPE_RADIX | MMU_FTR_GTSE) },

It might look better like this:

        { .pabyte = 1,  .pabit = 2, .mmu_features  = MMU_FTR_CI_LARGE_PAGE },
#ifdef CONFIG_PPC_RADIX_MMU
        { .pabyte = 40, .pabit = 0, .mmu_features  = MMU_FTR_TYPE_RADIX },
        { .pabyte = 40, .pabit = 0, .mmu_features  = MMU_FTR_TYPE_RADIX |
                                                     MMU_FTR_GTSE },
#endif
  	{ .pabyte = 1,  .pabit = 1, .invert = 1, .cpu_features = CPU_FTR_NODSISRALIGN },

But that's bikeshedding a bit and the optional bits already put it out 
of alignment.

Thanks,
Nick


^ permalink raw reply

* Re: [PATCH 10/11] powerpc/smp: Implement cpu_to_coregroup_id
From: Srikar Dronamraju @ 2020-07-20  5:48 UTC (permalink / raw)
  To: Gautham R Shenoy
  Cc: Nathan Lynch, Oliver OHalloran, Michael Neuling, Michael Ellerman,
	Anton Blanchard, linuxppc-dev, Nick Piggin
In-Reply-To: <20200717082652.GF32531@in.ibm.com>

* Gautham R Shenoy <ego@linux.vnet.ibm.com> [2020-07-17 13:56:53]:

> On Tue, Jul 14, 2020 at 10:06:23AM +0530, Srikar Dronamraju wrote:
> > Lookup the coregroup id from the associativity array.
> > 
> > If unable to detect the coregroup id, fallback on the core id.
> > This way, ensure sched_domain degenerates and an extra sched domain is
> > not created.
> > 
> > Ideally this function should have been implemented in
> > arch/powerpc/kernel/smp.c. However if its implemented in mm/numa.c, we
> > don't need to find the primary domain again.
> > 
> > If the device-tree mentions more than one coregroup, then kernel
> > implements only the last or the smallest coregroup, which currently
> > corresponds to the penultimate domain in the device-tree.
> > 
> > Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
> > Cc: Michael Ellerman <michaele@au1.ibm.com>
> > Cc: Nick Piggin <npiggin@au1.ibm.com>
> > Cc: Oliver OHalloran <oliveroh@au1.ibm.com>
> > Cc: Nathan Lynch <nathanl@linux.ibm.com>
> > Cc: Michael Neuling <mikey@linux.ibm.com>
> > Cc: Anton Blanchard <anton@au1.ibm.com>
> > Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> > Cc: Vaidyanathan Srinivasan <svaidy@linux.ibm.com>
> > Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> > ---
> >  arch/powerpc/mm/numa.c | 17 +++++++++++++++++
> >  1 file changed, 17 insertions(+)
> > 
> > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> > index d9ab9da85eab..4e85564ef62a 100644
> > --- a/arch/powerpc/mm/numa.c
> > +++ b/arch/powerpc/mm/numa.c
> > @@ -1697,6 +1697,23 @@ static const struct proc_ops topology_proc_ops = {
> > 
> >  int cpu_to_coregroup_id(int cpu)
> >  {
> > +	__be32 associativity[VPHN_ASSOC_BUFSIZE] = {0};
> > +	int index;
> > +
> 
> It would be good to have an assert here to ensure that we are calling
> this function only when coregroups are enabled.
> 
> Else, we may end up returning the penultimate index which maps to the
> chip-id.
> 

We have a check below exactly for the same reason. Please look below.

> 
> 
> > +	if (cpu < 0 || cpu > nr_cpu_ids)
> > +		return -1;
> > +
> > +	if (!firmware_has_feature(FW_FEATURE_VPHN))
> > +		goto out;
> > +
> > +	if (vphn_get_associativity(cpu, associativity))
> > +		goto out;
> > +
> > +	index = of_read_number(associativity, 1);
> > +	if ((index > min_common_depth + 1) && coregroup_enabled)
> > +		return of_read_number(&associativity[index - 1], 1);

See ^above.

index would be the all the domains in the associativity array, 
min_common_depth would be where the primary domain or the chip-id is
defined. So we are reading the penultimate domain if and only if the
min_common_depth isn't the primary domain aka chip-id. 

What other check /assertions can we add?


> > +
> > +out:
> >  	return cpu_to_core_id(cpu);
> >  }
> > 
> > -- 
> > 2.17.1
> > 

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply

* Re: [PATCH 0/5] cpuidle-pseries: Parse extended CEDE information for idle.
From: Vaidyanathan Srinivasan @ 2020-07-20  5:48 UTC (permalink / raw)
  To: Gautham R. Shenoy
  Cc: Nathan Lynch, Michael Neuling, linux-pm, linux-kernel,
	Nicholas Piggin, linuxppc-dev
In-Reply-To: <1594120299-31389-1-git-send-email-ego@linux.vnet.ibm.com>

* Gautham R Shenoy <ego@linux.vnet.ibm.com> [2020-07-07 16:41:34]:

> From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
> 
> Hi,
> 
> On pseries Dedicated Linux LPARs, apart from the polling snooze idle
> state, we currently have the CEDE idle state which cedes the CPU to
> the hypervisor with latency-hint = 0.
> 
> However, the PowerVM hypervisor supports additional extended CEDE
> states, which can be queried through the "ibm,get-systems-parameter"
> rtas-call with the CEDE_LATENCY_TOKEN. The hypervisor maps these
> extended CEDE states to appropriate platform idle-states in order to
> provide energy-savings as well as shifting power to the active
> units. On existing pseries LPARs today we have extended CEDE with
> latency-hints {1,2} supported.
> 
> In Patches 1-3 of this patchset, we add the code to parse the CEDE
> latency records provided by the hypervisor. We use this information to
> determine the wakeup latency of the regular CEDE (which we have been
> so far hardcoding to 10us while experimentally it is much lesser ~
> 1us), by looking at the wakeup latency provided by the hypervisor for
> Extended CEDE states. Since the platform currently advertises Extended
> CEDE 1 to have wakeup latency of 2us, we can be sure that the wakeup
> latency of the regular CEDE is no more than this.
> 
> Patch 4 (currently marked as RFC), expose the extended CEDE states
> parsed above to the cpuidle framework, provided that they can wakeup
> on an interrupt. On current platforms only Extended CEDE 1 fits the
> bill, but this is going to change in future platforms where even
> Extended CEDE 2 may be responsive to external interrupts.
> 
> Patch 5 (currently marked as RFC), filters out Extended CEDE 1 since
> it offers no added advantage over the normal CEDE.
> 
> With Patches 1-3, we see an improvement in the single-threaded
> performance on ebizzy.
> 
> 2 ebizzy threads bound to the same big-core. 25% improvement in the
> avg records/s (higher the better) with patches 1-3.
> x without_patches
> * with_patches
>     N           Min           Max        Median           Avg        Stddev
> x  10       2491089       5834307       5398375       4244335     1596244.9
> *  10       2893813       5834474       5832448     5327281.3     1055941.4
> 
> We do not observe any major regression in either the context_switch2
> benchmark or the schbench benchmark
> 
> context_switch2 across CPU0 CPU1 (Both belong to same big-core, but different
> small cores). We observe a minor 0.14% regression in the number of
> context-switches (higher is better).
> x without_patch
> * with_patch
>     N           Min           Max        Median           Avg        Stddev
> x 500        348872        362236        354712     354745.69      2711.827
> * 500        349422        361452        353942      354215.4     2576.9258
> 
> context_switch2 across CPU0 CPU8 (Different big-cores). We observe a 0.37%
> improvement in the number of context-switches (higher is better).
> x without_patch
> * with_patch
>     N           Min           Max        Median           Avg        Stddev
> x 500        287956        294940        288896     288977.23     646.59295
> * 500        288300        294646        289582     290064.76     1161.9992
> 
> schbench:
> No major difference could be seen until the 99.9th percentile.
> 
> Without-patch
> Latency percentiles (usec)
> 	50.0th: 29
> 	75.0th: 39
> 	90.0th: 49
> 	95.0th: 59
> 	*99.0th: 13104
> 	99.5th: 14672
> 	99.9th: 15824
> 	min=0, max=17993
> 
> With-patch:
> Latency percentiles (usec)
> 	50.0th: 29
> 	75.0th: 40
> 	90.0th: 50
> 	95.0th: 61
> 	*99.0th: 13648
> 	99.5th: 14768
> 	99.9th: 15664
> 	min=0, max=29812

This patch series mainly cleans up the CEDE latency discovery and
prepares to add different cpuidle states in virtualised environment.
This helps in improving SMT folding speeds and also power savings and
power shifting with newer platform firmware.

The current benefit is primarily from faster SMT folding and resulting
single performance achieved by updating the platform firmware provided
heuristics in the cpuidle states.

--Vaidy




^ permalink raw reply

* Re: [PATCH v2 2/2] selftest/cpuidle: Add support for cpuidle latency measurement
From: Gautham R Shenoy @ 2020-07-20  5:52 UTC (permalink / raw)
  To: Pratik Rajesh Sampat
  Cc: ego, linux-pm, daniel.lezcano, rjw, linuxppc-dev, npiggin, paulus,
	linux-kselftest, shuah, srivatsa, linux-kernel
In-Reply-To: <20200717091801.29289-3-psampat@linux.ibm.com>

Hi Pratik,


On Fri, Jul 17, 2020 at 02:48:01PM +0530, Pratik Rajesh Sampat wrote:
> This patch adds support to trace IPI based and timer based wakeup
> latency from idle states
> 
> Latches onto the test-cpuidle_latency kernel module using the debugfs
> interface to send IPIs or schedule a timer based event, which in-turn
> populates the debugfs with the latency measurements.
> 
> Currently for the IPI and timer tests; first disable all idle states
> and then test for latency measurements incrementally enabling each state
> 
> Signed-off-by: Pratik Rajesh Sampat <psampat@linux.ibm.com>

A few comments below.

> ---
>  tools/testing/selftests/Makefile           |   1 +
>  tools/testing/selftests/cpuidle/Makefile   |   6 +
>  tools/testing/selftests/cpuidle/cpuidle.sh | 257 +++++++++++++++++++++
>  tools/testing/selftests/cpuidle/settings   |   1 +
>  4 files changed, 265 insertions(+)
>  create mode 100644 tools/testing/selftests/cpuidle/Makefile
>  create mode 100755 tools/testing/selftests/cpuidle/cpuidle.sh
>  create mode 100644 tools/testing/selftests/cpuidle/settings
> 

[..skip..]

> +
> +ins_mod()
> +{
> +	if [ ! -f "$MODULE" ]; then
> +		printf "$MODULE module does not exist. Exitting\n"

If the module has been compiled into the kernel (due to a
localyesconfig, for instance), then it is unlikely that we will find
it in /lib/modules. Perhaps you want to check if the debugfs
directories created by the module exist, and if so, print a message
saying that the modules is already loaded or some such?

> +		exit $ksft_skip
> +	fi
> +	printf "Inserting $MODULE module\n\n"
> +	insmod $MODULE
> +	if [ $? != 0 ]; then
> +		printf "Insmod $MODULE failed\n"
> +		exit $ksft_skip
> +	fi
> +}
> +
> +compute_average()
> +{
> +	arr=("$@")
> +	sum=0
> +	size=${#arr[@]}
> +	for i in "${arr[@]}"
> +	do
> +		sum=$((sum + i))
> +	done
> +	avg=$((sum/size))

It would be good to assert that "size" isn't 0 here.

> +}
> +
> +# Disable all stop states
> +disable_idle()
> +{
> +	for ((cpu=0; cpu<NUM_CPUS; cpu++))
> +	do
> +		for ((state=0; state<NUM_STATES; state++))
> +		do
> +			echo 1 > /sys/devices/system/cpu/cpu$cpu/cpuidle/state$state/disable

So, on offlined CPUs, we won't see
/sys/devices/system/cpu/cpu$cpu/cpuidle/state$state directory. You
should probably perform this operation only on online CPUs.


> +		done
> +	done
> +}
> +
> +# Perform operation on each CPU for the given state
> +# $1 - Operation: enable (0) / disable (1)
> +# $2 - State to enable
> +op_state()
> +{
> +	for ((cpu=0; cpu<NUM_CPUS; cpu++))
> +	do
> +		echo $1 > /sys/devices/system/cpu/cpu$cpu/cpuidle/state$2/disable


Ditto

> +	done
> +}

This is a helper function. For better readability of the main code you
can define the following wrappers and use them.


cpuidle_enable_state()
{
	state=$1
	op_state 1 $state
}

cpuidle_disable_state()
{
	state=$1
	op_state 0 $state

}


> +

[..snip..]

> +run_ipi_tests()
> +{
> +        extract_latency
> +        disable_idle
> +        declare -a avg_arr
> +        echo -e "--IPI Latency Test---" >> $LOG
> +
> +		echo -e "--Baseline IPI Latency measurement: CPU Busy--" >> $LOG
> +		printf "%s %10s %12s\n" "SRC_CPU" "DEST_CPU" "IPI_Latency(ns)" >> $LOG
> +		for ((cpu=0; cpu<NUM_CPUS; cpu++))
> +		do
> +			ipi_test_once "baseline" $cpu
> +			printf "%-3s %10s %12s\n" $src_cpu $cpu $ipi_latency >> $LOG
> +			avg_arr+=($ipi_latency)
> +		done
> +		compute_average "${avg_arr[@]}"
> +		echo -e "Baseline Average IPI latency(ns): $avg" >> $LOG
> +
> +        for ((state=0; state<NUM_STATES; state++))
> +        do
> +			unset avg_arr
> +			echo -e "---Enabling state: $state---" >> $LOG
> +			op_state 0 $state
> +			printf "%s %10s %12s\n" "SRC_CPU" "DEST_CPU" "IPI_Latency(ns)" >> $LOG
> +			for ((cpu=0; cpu<NUM_CPUS; cpu++))
> +			do

If a CPU is offline, then we should skip it here.

> +				# Running IPI test and logging results
> +				sleep 1
> +				ipi_test_once "test" $cpu
> +				printf "%-3s %10s %12s\n" $src_cpu $cpu $ipi_latency >> $LOG
> +				avg_arr+=($ipi_latency)
> +			done
> +			compute_average "${avg_arr[@]}"
> +			echo -e "Expected IPI latency(ns): ${latency_arr[$state]}" >> $LOG
> +			echo -e "Observed Average IPI latency(ns): $avg" >> $LOG
> +			op_state 1 $state
> +        done
> +}
> +
> +# Extract the residency in microseconds and convert to nanoseconds.
> +# Add 100 ns so that the timer stays for a little longer than the residency
> +extract_residency()
> +{
> +	for ((state=0; state<NUM_STATES; state++))
> +	do
> +		residency=$(($(cat /sys/devices/system/cpu/cpu0/cpuidle/state$state/residency) * 1000 + 200))
> +		residency_arr+=($residency)
> +	done
> +}
> +
> +# Run the Timeout test
> +# $1 run for baseline - busy cpu or regular environment
> +# $2 destination cpu
> +# $3 timeout
> +timeout_test_once()
> +{
> +	dest_cpu=$2
> +	if [ "$1" = "baseline" ]; then
> +		# Keep the CPU busy
> +		taskset -c $dest_cpu cat /dev/random > /dev/null &
> +		task_pid=$!
> +		# Wait for the workload to achieve 100% CPU usage
> +		sleep 1
> +	fi
> +	taskset -c $dest_cpu echo $3 > /sys/kernel/debug/latency_test/timeout_expected_ns
> +	# Wait for the result to populate
> +	sleep 0.1
> +	timeout_diff=$(cat /sys/kernel/debug/latency_test/timeout_diff_ns)
> +	src_cpu=$(cat /sys/kernel/debug/latency_test/timeout_cpu_src)
> +	if [ "$1" = "baseline" ]; then
> +		kill $task_pid
> +		wait $task_pid 2>/dev/null
> +	fi
> +}
> +
> +run_timeout_tests()
> +{
> +	extract_residency
> +	disable_idle
> +	declare -a avg_arr
> +	echo -e "\n--Timeout Latency Test--" >> $LOG
> +
> +	echo -e "--Baseline Timeout Latency measurement: CPU Busy--" >> $LOG
> +	printf "%s %10s %10s\n" "Wakeup_src" "Baseline_delay(ns)">> $LOG
> +	for ((cpu=0; cpu<NUM_CPUS; cpu++))
> +	do

Again only perform this on online CPUs.

> +		timeout_test_once "baseline" $cpu ${residency_arr[0]}
> +		printf "%-3s %13s\n" $src_cpu $timeout_diff >> $LOG
> +		avg_arr+=($timeout_diff)
> +	done
> +	compute_average "${avg_arr[@]}"
> +	echo -e "Baseline Average timeout diff(ns): $avg" >> $LOG
> +
> +	for ((state=0; state<NUM_STATES; state++))
> +	do

> +		echo -e "---Enabling state: $state---" >> $LOG
> +		op_state 0 $state
> +		printf "%s %10s %10s\n" "Wakeup_src" "Baseline_delay(ns)" "Delay(ns)" >> $LOG
> +		unset avg_arr
> +		for ((cpu=0; cpu<NUM_CPUS; cpu++))
> +		do

Ditto.

> +			timeout_test_once "test" $cpu ${residency_arr[$state]}
> +			printf "%-3s %13s %18s\n" $src_cpu $baseline_timeout_diff $timeout_diff >> $LOG
> +			avg_arr+=($timeout_diff)
> +		done
> +		compute_average "${avg_arr[@]}"
> +		echo -e "Expected timeout(ns): ${residency_arr[$state]}" >> $LOG
> +		echo -e "Observed Average timeout diff(ns): $avg" >> $LOG
> +		op_state 1 $state
> +	done
> +}
> +

--
Thanks and Regards
gautham.

^ permalink raw reply

* Re: [PATCH 1/5] cpuidle-pseries: Set the latency-hint before entering CEDE
From: Vaidyanathan Srinivasan @ 2020-07-20  5:55 UTC (permalink / raw)
  To: Gautham R. Shenoy
  Cc: Nathan Lynch, Michael Neuling, linux-pm, linux-kernel,
	Nicholas Piggin, linuxppc-dev
In-Reply-To: <1594120299-31389-2-git-send-email-ego@linux.vnet.ibm.com>

* Gautham R Shenoy <ego@linux.vnet.ibm.com> [2020-07-07 16:41:35]:

> From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
> 
> As per the PAPR, each H_CEDE call is associated with a latency-hint to
> be passed in the VPA field "cede_latency_hint". The CEDE states that
> we were implicitly entering so far is CEDE with latency-hint = 0.
> 
> This patch explicitly sets the latency hint corresponding to the CEDE
> state that we are currently entering. While at it, we save the
> previous hint, to be restored once we wakeup from CEDE. This will be
> required in the future when we expose extended-cede states through the
> cpuidle framework, where each of them will have a different
> cede-latency hint.
> 
> Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>

Reviewed-by: Vaidyanathan Srinivasan <svaidy@linux.ibm.com>


> ---
>  drivers/cpuidle/cpuidle-pseries.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/cpuidle/cpuidle-pseries.c b/drivers/cpuidle/cpuidle-pseries.c
> index 4a37252..39d4bb6 100644
> --- a/drivers/cpuidle/cpuidle-pseries.c
> +++ b/drivers/cpuidle/cpuidle-pseries.c
> @@ -105,19 +105,27 @@ static void check_and_cede_processor(void)
>  	}
>  }
> 
> +#define NR_CEDE_STATES		1  /* CEDE with latency-hint 0 */
> +#define NR_DEDICATED_STATES	(NR_CEDE_STATES + 1) /* Includes snooze */
> +
> +u8 cede_latency_hint[NR_DEDICATED_STATES];
>  static int dedicated_cede_loop(struct cpuidle_device *dev,
>  				struct cpuidle_driver *drv,
>  				int index)
>  {
> +	u8 old_latency_hint;
> 
>  	pseries_idle_prolog();
>  	get_lppaca()->donate_dedicated_cpu = 1;
> +	old_latency_hint = get_lppaca()->cede_latency_hint;
> +	get_lppaca()->cede_latency_hint = cede_latency_hint[index];
> 
>  	HMT_medium();
>  	check_and_cede_processor();
> 
>  	local_irq_disable();
>  	get_lppaca()->donate_dedicated_cpu = 0;
> +	get_lppaca()->cede_latency_hint = old_latency_hint;
> 
>  	pseries_idle_epilog();
> 
> @@ -149,7 +157,7 @@ static int shared_cede_loop(struct cpuidle_device *dev,
>  /*
>   * States for dedicated partition case.
>   */
> -static struct cpuidle_state dedicated_states[] = {
> +static struct cpuidle_state dedicated_states[NR_DEDICATED_STATES] = {
>  	{ /* Snooze */
>  		.name = "snooze",
>  		.desc = "snooze",


Saving and restoring the current cede hint value helps in maintaining
compatibility with other parts of the kernel.  Over long term we can
make cpuidle driver deterministically set the CEDE hint at each
invocation of H_CEDE call so that we do not have to do multiple
redundant save-restore.

This is a reasonable start to cleanup this cupidle subsystem on PAPR
guests. 

--Vaidy


^ permalink raw reply

* Re: [PATCH 09/11] Powerpc/smp: Create coregroup domain
From: Srikar Dronamraju @ 2020-07-20  6:02 UTC (permalink / raw)
  To: Gautham R Shenoy
  Cc: Nathan Lynch, Oliver OHalloran, Michael Neuling, Michael Ellerman,
	Anton Blanchard, linuxppc-dev, Nick Piggin
In-Reply-To: <20200717081926.GD32531@in.ibm.com>

* Gautham R Shenoy <ego@linux.vnet.ibm.com> [2020-07-17 13:49:26]:

> On Tue, Jul 14, 2020 at 10:06:22AM +0530, Srikar Dronamraju wrote:
> > Add percpu coregroup maps and masks to create coregroup domain.
> > If a coregroup doesn't exist, the coregroup domain will be degenerated
> > in favour of SMT/CACHE domain.
> > 
> > Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
> > Cc: Michael Ellerman <michaele@au1.ibm.com>
> > Cc: Nick Piggin <npiggin@au1.ibm.com>
> > Cc: Oliver OHalloran <oliveroh@au1.ibm.com>
> > Cc: Nathan Lynch <nathanl@linux.ibm.com>
> > Cc: Michael Neuling <mikey@linux.ibm.com>
> > Cc: Anton Blanchard <anton@au1.ibm.com>
> > Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> > Cc: Vaidyanathan Srinivasan <svaidy@linux.ibm.com>
> > Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> > ---
> >  arch/powerpc/include/asm/topology.h | 10 ++++++++
> >  arch/powerpc/kernel/smp.c           | 37 +++++++++++++++++++++++++++++
> >  arch/powerpc/mm/numa.c              |  5 ++++
> >  3 files changed, 52 insertions(+)
> > 
> > @@ -950,6 +972,11 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
> >  	cpumask_set_cpu(boot_cpuid, cpu_l2_cache_mask(boot_cpuid));
> >  	cpumask_set_cpu(boot_cpuid, cpu_core_mask(boot_cpuid));
> > 
> > +	if (has_coregroup_support())
> > +		cpumask_set_cpu(boot_cpuid, cpu_coregroup_mask(boot_cpuid));
> > +	else
> > +		powerpc_topology[mc_idx].mask = cpu_bigcore_mask;
> > +
> 
> The else part could be moved to the common function where we are
> modifying the other attributes of the topology array.
> 

My intent is to make all the changes to the topology attributes in
smp_prepare_cpus. It makes more sense to change the attributes immediately
after we define / detect a particular topology change.

The only thing that is left out currently is shared_cache,
We should be able to detect shared_cache also around this time. Just that it
needs more cleanups. Once we do update the topology attributes even for
shared_cache here itself. 

> >  	init_big_cores();
> >  	if (has_big_cores) {
> >  		cpumask_set_cpu(boot_cpuid,
> > @@ -1241,6 +1268,8 @@ static void remove_cpu_from_masks(int cpu)
> >  		set_cpus_unrelated(cpu, i, cpu_sibling_mask);
> >  		if (has_big_cores)
> >  			set_cpus_unrelated(cpu, i, cpu_smallcore_mask);
> > +		if (has_coregroup_support())
> > +			set_cpus_unrelated(cpu, i, cpu_coregroup_mask);
> >  	}
> >  }
> >  #endif
> > @@ -1301,6 +1330,14 @@ static void add_cpu_to_masks(int cpu)
> >  	add_cpu_to_smallcore_masks(cpu);
> >  	update_mask_by_l2(cpu, cpu_l2_cache_mask);
> > 
> > +	if (has_coregroup_support()) {
> > +		cpumask_set_cpu(cpu, cpu_coregroup_mask(cpu));
> > +		for_each_cpu(i, cpu_online_mask) {
> > +			if (cpu_to_coregroup_id(cpu) == cpu_to_coregroup_id(i))
> > +				set_cpus_related(cpu, i, cpu_coregroup_mask);
> > +		}
> > +	}
> > +
> >  	if (pkg_id == -1) {
> >  		struct cpumask *(*mask)(int) = cpu_sibling_mask;
> > 
> > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> > index a43eab455be4..d9ab9da85eab 100644
> > --- a/arch/powerpc/mm/numa.c
> > +++ b/arch/powerpc/mm/numa.c
> > @@ -1695,6 +1695,11 @@ static const struct proc_ops topology_proc_ops = {
> >  	.proc_release	= single_release,
> >  };
> > 
> > +int cpu_to_coregroup_id(int cpu)
> > +{
> > +	return cpu_to_core_id(cpu);
> > +}
> 
> 
> So, if has_coregroup_support() returns true, then since the core_group
> identification is currently done through the core-id, the
> coregroup_mask is going to be the same as the
> cpu_core_mask/cpu_cpu_mask. Thus, we will be degenerating the DIE
> domain. Right ? Instead we could have kept the core-group to be a
> single bigcore by default, so that those domains can get degenerated
> preserving the legacy SMT, DIE, NUMA hierarchy.
> 
> 

I think you have confused between cpu_core_mask and cpu_to_core_id.
cpu_to_core_id() returns a unique id for every SMT8 thread group.
If coregroup_support is true and the system doesn't support coregroup, then
We would not be degenerating the DIE but degenerating new MC domain only.
This is because the MC domain and the previous domain (SHARED_CACHE/BIGCORE/ 
SMT) would match the MC domain.

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply

* Re: [PATCH v6 03/23] powerpc/book3s64/pkeys: pkeys are supported only on hash on book3s.
From: Michael Ellerman @ 2020-07-20  6:05 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linuxppc-dev; +Cc: Aneesh Kumar K.V, linuxram, bauerman
In-Reply-To: <20200709032946.881753-4-aneesh.kumar@linux.ibm.com>

"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
> Move them to hash specific file and add BUG() for radix path.
> ---
>  .../powerpc/include/asm/book3s/64/hash-pkey.h | 32 ++++++++++++++++
>  arch/powerpc/include/asm/book3s/64/pkeys.h    | 25 +++++++++++++
>  arch/powerpc/include/asm/pkeys.h              | 37 ++++---------------
>  3 files changed, 64 insertions(+), 30 deletions(-)
>  create mode 100644 arch/powerpc/include/asm/book3s/64/hash-pkey.h
>  create mode 100644 arch/powerpc/include/asm/book3s/64/pkeys.h

This isn't signed-off.

I assume you meant to, so I added it.

cheers

> diff --git a/arch/powerpc/include/asm/book3s/64/hash-pkey.h b/arch/powerpc/include/asm/book3s/64/hash-pkey.h
> new file mode 100644
> index 000000000000..795010897e5d

^ permalink raw reply

* Re: [PATCH 2/5] cpuidle-pseries: Add function to parse extended CEDE records
From: Vaidyanathan Srinivasan @ 2020-07-20  6:09 UTC (permalink / raw)
  To: Gautham R. Shenoy
  Cc: Nathan Lynch, Michael Neuling, linux-pm, linux-kernel,
	Nicholas Piggin, linuxppc-dev
In-Reply-To: <1594120299-31389-3-git-send-email-ego@linux.vnet.ibm.com>

* Gautham R Shenoy <ego@linux.vnet.ibm.com> [2020-07-07 16:41:36]:

> From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
> 
> Currently we use CEDE with latency-hint 0 as the only other idle state
> on a dedicated LPAR apart from the polling "snooze" state.
> 
> The platform might support additional extended CEDE idle states, which
> can be discovered through the "ibm,get-system-parameter" rtas-call
> made with CEDE_LATENCY_TOKEN.
> 
> This patch adds a function to obtain information about the extended
> CEDE idle states from the platform and parse the contents to populate
> an array of extended CEDE states. These idle states thus discovered
> will be added to the cpuidle framework in the next patch.
> 
> dmesg on a POWER9 LPAR, demonstrating the output of parsing the
> extended CEDE latency parameters.
> 
> [    5.913180] xcede : xcede_record_size = 10
> [    5.913183] xcede : Record 0 : hint = 1, latency =0x400 tb-ticks, Wake-on-irq = 1
> [    5.913188] xcede : Record 1 : hint = 2, latency =0x3e8000 tb-ticks, Wake-on-irq = 0
> [    5.913193] cpuidle : Skipping the 2 Extended CEDE idle states
> 
> Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>

Reviewed-by: Vaidyanathan Srinivasan <svaidy@linux.ibm.com>

>
> ---
>  drivers/cpuidle/cpuidle-pseries.c | 129 +++++++++++++++++++++++++++++++++++++-
>  1 file changed, 127 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/cpuidle/cpuidle-pseries.c b/drivers/cpuidle/cpuidle-pseries.c
> index 39d4bb6..c13549b 100644
> --- a/drivers/cpuidle/cpuidle-pseries.c
> +++ b/drivers/cpuidle/cpuidle-pseries.c
> @@ -21,6 +21,7 @@
>  #include <asm/runlatch.h>
>  #include <asm/idle.h>
>  #include <asm/plpar_wrappers.h>
> +#include <asm/rtas.h>
> 
>  struct cpuidle_driver pseries_idle_driver = {
>  	.name             = "pseries_idle",
> @@ -105,9 +106,120 @@ static void check_and_cede_processor(void)
>  	}
>  }
> 
> -#define NR_CEDE_STATES		1  /* CEDE with latency-hint 0 */
> +struct xcede_latency_records {
> +	u8  latency_hint;
> +	u64 wakeup_latency_tb_ticks;
> +	u8  responsive_to_irqs;
> +};
> +
> +/*
> + * XCEDE : Extended CEDE states discovered through the
> + *         "ibm,get-systems-parameter" rtas-call with the token
> + *         CEDE_LATENCY_TOKEN
> + */
> +#define MAX_XCEDE_STATES		4
> +#define	XCEDE_LATENCY_RECORD_SIZE	10
> +#define XCEDE_LATENCY_PARAM_MAX_LENGTH	(2 + 2 + \
> +					(MAX_XCEDE_STATES * XCEDE_LATENCY_RECORD_SIZE))
> +
> +#define CEDE_LATENCY_TOKEN		45
> +
> +#define NR_CEDE_STATES		(MAX_XCEDE_STATES + 1) /* CEDE with latency-hint 0 */
>  #define NR_DEDICATED_STATES	(NR_CEDE_STATES + 1) /* Includes snooze */
> 
> +struct xcede_latency_records xcede_records[MAX_XCEDE_STATES];
> +unsigned int nr_xcede_records;
> +char xcede_parameters[XCEDE_LATENCY_PARAM_MAX_LENGTH];
> +
> +static int parse_cede_parameters(void)
> +{
> +	int ret = -1, i;
> +	u16 payload_length;
> +	u8 xcede_record_size;
> +	u32 total_xcede_records_size;
> +	char *payload;
> +
> +	memset(xcede_parameters, 0, XCEDE_LATENCY_PARAM_MAX_LENGTH);
> +
> +	ret = rtas_call(rtas_token("ibm,get-system-parameter"), 3, 1,
> +			NULL, CEDE_LATENCY_TOKEN, __pa(xcede_parameters),
> +			XCEDE_LATENCY_PARAM_MAX_LENGTH);
> +
> +	if (ret) {
> +		pr_err("xcede: Error parsing CEDE_LATENCY_TOKEN\n");
> +		return ret;
> +	}
> +
> +	payload_length = be16_to_cpu(*(__be16 *)(&xcede_parameters[0]));
> +	payload = &xcede_parameters[2];
> +
> +	/*
> +	 * If the platform supports the cede latency settings
> +	 * information system parameter it must provide the following
> +	 * information in the NULL terminated parameter string:
> +	 *
> +	 * a. The first byte is the length ???N??? of each cede
> +	 *    latency setting record minus one (zero indicates a length
> +	 *    of 1 byte).
> +	 *
> +	 * b. For each supported cede latency setting a cede latency
> +	 *    setting record consisting of the first ???N??? bytes as per
> +	 *    the following table.
> +	 *
> +	 *	-----------------------------
> +	 *	| Field           | Field  |
> +	 *	| Name            | Length |
> +	 *	-----------------------------
> +	 *	| Cede Latency    | 1 Byte |
> +	 *	| Specifier Value |        |
> +	 *	-----------------------------
> +	 *	| Maximum wakeup  |        |
> +	 *	| latency in      | 8 Bytes|
> +	 *	| tb-ticks        |        |
> +	 *	-----------------------------
> +	 *	| Responsive to   |        |
> +	 *	| external        | 1 Byte |
> +	 *	| interrupts      |        |
> +	 *	-----------------------------
> +	 *
> +	 * This version has cede latency record size = 10.
> +	 */
> +	xcede_record_size = (u8)payload[0] + 1;

This is standard PAPR interface that has been defined long time ago.
However, new H_CEDE hints that map to new platform features will
appear in the same interface and Linux needs to prepare and be ready
to check and exploit the new hints if they are useful for the given
setup.


> +
> +	if (xcede_record_size != XCEDE_LATENCY_RECORD_SIZE) {
> +		pr_err("xcede : Expected record-size %d. Observed size %d.\n",
> +		       XCEDE_LATENCY_RECORD_SIZE, xcede_record_size);
> +		return -EINVAL;
> +	}
> +
> +	pr_info("xcede : xcede_record_size = %d\n", xcede_record_size);
> +
> +	/*
> +	 * Since the payload_length includes the last NULL byte and
> +	 * the xcede_record_size, the remaining bytes correspond to
> +	 * array of all cede_latency settings.
> +	 */
> +	total_xcede_records_size = payload_length - 2;
> +	nr_xcede_records = total_xcede_records_size / xcede_record_size;
> +
> +	payload++;
> +	for (i = 0; i < nr_xcede_records; i++) {
> +		struct xcede_latency_records *record = &xcede_records[i];
> +
> +		record->latency_hint = (u8)payload[0];
> +		record->wakeup_latency_tb_ticks  =
> +			be64_to_cpu(*(__be64 *)(&payload[1]));
> +		record->responsive_to_irqs = (u8)payload[9];
> +		payload += xcede_record_size;
> +		pr_info("xcede : Record %d : hint = %u, latency =0x%llx tb-ticks, Wake-on-irq = %u\n",
> +			i, record->latency_hint,
> +			record->wakeup_latency_tb_ticks,
> +			record->responsive_to_irqs);
> +	}
> +
> +	return 0;
> +}
> +
>  u8 cede_latency_hint[NR_DEDICATED_STATES];
>  static int dedicated_cede_loop(struct cpuidle_device *dev,
>  				struct cpuidle_driver *drv,
> @@ -238,6 +350,19 @@ static int pseries_cpuidle_driver_init(void)
>  	return 0;
>  }
> 
> +static int add_pseries_idle_states(void)
> +{
> +	int nr_states = 2; /* By default we have snooze, CEDE */
> +
> +	if (parse_cede_parameters())
> +		return nr_states;
> +
> +	pr_info("cpuidle : Skipping the %d Extended CEDE idle states\n",
> +		nr_xcede_records);
> +
> +	return nr_states;

More logic will be added to this function in the subsequent patches to
actually make use of the information that is obtained from the platform
firmware.

--Vaidy


^ permalink raw reply

* Re: [PATCH v6 03/23] powerpc/book3s64/pkeys: pkeys are supported only on hash on book3s.
From: Aneesh Kumar K.V @ 2020-07-20  6:15 UTC (permalink / raw)
  To: Michael Ellerman, linuxppc-dev; +Cc: linuxram, bauerman
In-Reply-To: <87h7u2u3ep.fsf@mpe.ellerman.id.au>

On 7/20/20 11:35 AM, Michael Ellerman wrote:
> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>> Move them to hash specific file and add BUG() for radix path.
>> ---
>>   .../powerpc/include/asm/book3s/64/hash-pkey.h | 32 ++++++++++++++++
>>   arch/powerpc/include/asm/book3s/64/pkeys.h    | 25 +++++++++++++
>>   arch/powerpc/include/asm/pkeys.h              | 37 ++++---------------
>>   3 files changed, 64 insertions(+), 30 deletions(-)
>>   create mode 100644 arch/powerpc/include/asm/book3s/64/hash-pkey.h
>>   create mode 100644 arch/powerpc/include/asm/book3s/64/pkeys.h
> 
> This isn't signed-off.
> 

Not sure how i missed that.

> I assume you meant to, so I added it.
> 


yes. Thanks for taking care of that.

-aneesh


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox