LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH v2 1/3] powerpc/numa: Print debug statements only when required
From: Srikar Dronamraju @ 2021-08-26  4:47 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Vincent Guittot, Peter Zijlstra,
	Geetika Moolchandani, Valentin Schneider, Laurent Dufour,
	linuxppc-dev, Ingo Molnar
In-Reply-To: <87r1ehbrux.fsf@mpe.ellerman.id.au>

* Michael Ellerman <mpe@ellerman.id.au> [2021-08-25 23:01:42]:

> Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> > * Laurent Dufour <ldufour@linux.ibm.com> [2021-08-23 11:21:33]:
> >> Le 21/08/2021 à 12:25, Srikar Dronamraju a écrit :
> >> > Currently, a debug message gets printed every time an attempt to
> >> > add(remove) a CPU. However this is redundant if the CPU is already added
> >> > (removed) from the node.
> >> > 
> >
> > Its a fair point.
> >
> > Michael,
> >
> > Do you want me to resend this patch with s/pr_err/pr_warn for the above
> > line?
> 
> I think what I'd prefer is if we stopped using this custom dbg() stuff
> in numa.c, and cleaned up all the messages to use pr_xxx().
> 
> Those debug statements only appear if you boot with numa=debug, which is
> not documented anywhere and I had completely forgotten existed TBH.
> 
> These days there's CONFIG_DYNAMIC_DEBUG for turning on/off messages,
> which is much more flexible.
> 
> So can we drop the numa=debug bits, and convert all the dbg()s to
> pr_debug().
> 
> And then do a pass converting all the printk("NUMA: ") to pr_xxx() which
> will get "numa:" from pr_fmt().
> 
> cheers

Okay, will do the needful.

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply

* Re: [PATCH] powerpc: Make set_endian() return EINVAL when not supporting little endian
From: Christophe Leroy @ 2021-08-26  4:55 UTC (permalink / raw)
  To: Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <87bl5kc1os.fsf@mpe.ellerman.id.au>

Le 26/08/2021 à 05:41, Michael Ellerman a écrit :
> Christophe Leroy <christophe.leroy@csgroup.eu> writes:
>> There is no point in modifying MSR_LE bit on CPUs not supporting
>> little endian.
> 
> Isn't that an ABI break?

Or an ABI fix ? I don't know.

My first thought was that all other 32 bits architectures were returning -EINVAL, but looking at the 
man page of prctl, it is explicit that this is powerpc only.

> 
> set_endian(PR_ENDIAN_BIG) should work on a big endian CPU, even if it
> does nothing useful.

Fair enough. But shouldn't in that case get_endian() return PR_ENDIAN_BIG instead of returning EINVAL ?

We can do one or the other, but I think it should at least be consistant between them, shouldn't it ?

Christophe

^ permalink raw reply

* Re: [PATCH] powerpc: Redefine HMT_xxx macros as empty on PPC32
From: Michael Ellerman @ 2021-08-26  4:57 UTC (permalink / raw)
  To: Christophe Leroy, Benjamin Herrenschmidt, Paul Mackerras
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <c5a07fadea33d640ad10cecf0ac8faaec1c524e0.1629898474.git.christophe.leroy@csgroup.eu>

Christophe Leroy <christophe.leroy@csgroup.eu> writes:
> HMT_xxx macros are macros for adjusting thread priority
> (hardware multi-threading) are macros inherited from PPC64
> via commit 5f7c690728ac ("[PATCH] powerpc: Merged ppc_asm.h")
>
> Those instructions are pointless on PPC32, but some common
> fonctions like arch_cpu_idle() use them.
>
> So make them empty on PPC32 to avoid those instructions.

I guess we're pretty sure no 32-bit CPUs do anything with those.

e6500 can run in 32-bit mode, and is 2-way threaded AIUI, so it's
*possible* it could use them.

But I can't find any mention of those special nops in the e6500 core
manual. And actually it does have documentation about thread priority
registers, PPR32, TPRI0/1, but it says they're not used in e6500.

So I guess this seems safe, I'll pick it up.

cheers

> diff --git a/arch/powerpc/include/asm/vdso/processor.h b/arch/powerpc/include/asm/vdso/processor.h
> index e072577bc7c0..8d79f994b4aa 100644
> --- a/arch/powerpc/include/asm/vdso/processor.h
> +++ b/arch/powerpc/include/asm/vdso/processor.h
> @@ -5,12 +5,21 @@
>  #ifndef __ASSEMBLY__
>  
>  /* Macros for adjusting thread priority (hardware multi-threading) */
> +#ifdef CONFIG_PPC64
>  #define HMT_very_low()		asm volatile("or 31, 31, 31	# very low priority")
>  #define HMT_low()		asm volatile("or 1, 1, 1	# low priority")
>  #define HMT_medium_low()	asm volatile("or 6, 6, 6	# medium low priority")
>  #define HMT_medium()		asm volatile("or 2, 2, 2	# medium priority")
>  #define HMT_medium_high()	asm volatile("or 5, 5, 5	# medium high priority")
>  #define HMT_high()		asm volatile("or 3, 3, 3	# high priority")
> +#else
> +#define HMT_very_low()
> +#define HMT_low()
> +#define HMT_medium_low()
> +#define HMT_medium()
> +#define HMT_medium_high()
> +#define HMT_high()
> +#endif
>  
>  #ifdef CONFIG_PPC64
>  #define cpu_relax()	do { HMT_low(); HMT_medium(); barrier(); } while (0)
> -- 
> 2.25.0

^ permalink raw reply

* Re: [PATCH v2 2/2] powerpc/bug: Provide better flexibility to WARN_ON/__WARN_FLAGS() with asm goto
From: Christophe Leroy @ 2021-08-26  6:37 UTC (permalink / raw)
  To: Michael Ellerman, Nathan Chancellor
  Cc: llvm, linux-kernel, clang-built-linux, Paul Mackerras,
	linuxppc-dev
In-Reply-To: <87h7fcc2m4.fsf@mpe.ellerman.id.au>



Le 26/08/2021 à 05:21, Michael Ellerman a écrit :
> Nathan Chancellor <nathan@kernel.org> writes:
>> On Tue, Apr 13, 2021 at 04:38:10PM +0000, Christophe Leroy wrote:
>>> Using asm goto in __WARN_FLAGS() and WARN_ON() allows more
>>> flexibility to GCC.
> ...
>>
>> This patch as commit 1e688dd2a3d6 ("powerpc/bug: Provide better
>> flexibility to WARN_ON/__WARN_FLAGS() with asm goto") cause a WARN_ON in
>> klist_add_tail to trigger over and over on boot when compiling with
>> clang:

...

> 
> This patch seems to fix it. Not sure if that's just papering over it though.
> 
> diff --git a/arch/powerpc/include/asm/bug.h b/arch/powerpc/include/asm/bug.h
> index 1ee0f22313ee..75fcb4370d96 100644
> --- a/arch/powerpc/include/asm/bug.h
> +++ b/arch/powerpc/include/asm/bug.h
> @@ -119,7 +119,7 @@ __label_warn_on:						\
>   								\
>   			WARN_ENTRY(PPC_TLNEI " %4, 0",		\
>   				   BUGFLAG_WARNING | BUGFLAG_TAINT(TAINT_WARN),	\
> -				   __label_warn_on, "r" (x));	\
> +				   __label_warn_on, "r" (!!(x))); \
>   			break;					\
>   __label_warn_on:						\
>   			__ret_warn_on = true;			\
> 
> 

But for a simple WARN_ON() call:

void test(unsigned long b)
{
	WARN_ON(b);
}

Without your change with GCC you get:

00000000000012d0 <.test>:
     12d0:	0b 03 00 00 	tdnei   r3,0
     12d4:	4e 80 00 20 	blr


With the !! change you get:

00000000000012d0 <.test>:
     12d0:	31 23 ff ff 	addic   r9,r3,-1
     12d4:	7d 29 19 10 	subfe   r9,r9,r3
     12d8:	0b 09 00 00 	tdnei   r9,0
     12dc:	4e 80 00 20 	blr

Christophe

^ permalink raw reply

* [PATCH v2 0/3] powerpc/smp: Misc fixes
From: Srikar Dronamraju @ 2021-08-26 10:03 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Vincent Guittot,
	Srikar Dronamraju, Peter Zijlstra, Ingo Molnar, linuxppc-dev,
	Valentin Schneider

Changelog : v1 -> v2:
v1: https://lore.kernel.org/linuxppc-dev/20210821092419.167454-1-srikar@linux.vnet.ibm.com/t/#u``
[ patch 1: Updated to use DIV_ROUND_UP instead of max to handle more situations ]
[ patch 2: updated changelog to make it more generic powerpc issue ]

The 1st patch fixes a regression which causes a crash when booted with
nr_cpus=2.

The 2nd patch fixes a regression where lscpu on PowerVM reports more
number of sockets that are available.

The 3rd patch updates the fallback when L2-cache properties are not
explicitly exposed to be the cpu_sibling_mask.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>

Srikar Dronamraju (3):
  powerpc/smp: Fix a crash while booting kvm guest with nr_cpus=2
  powerpc/smp: Update cpu_core_map on all PowerPc systems
  powerpc/smp: Enable CACHE domain for shared processor

 arch/powerpc/kernel/smp.c | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

-- 
2.18.2


^ permalink raw reply

* [PATCH v2 1/3] powerpc/smp: Fix a crash while booting kvm guest with nr_cpus=2
From: Srikar Dronamraju @ 2021-08-26 10:03 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Vincent Guittot,
	Srikar Dronamraju, Peter Zijlstra, Aneesh Kumar K . V,
	Valentin Schneider, linuxppc-dev, Ingo Molnar
In-Reply-To: <20210826100401.412519-1-srikar@linux.vnet.ibm.com>

Aneesh reported a crash with a fairly recent upstream kernel when
booting kernel whose commandline was appended with nr_cpus=2

1:mon> e
cpu 0x1: Vector: 300 (Data Access) at [c000000008a67bd0]
    pc: c00000000002557c: cpu_to_chip_id+0x3c/0x100
    lr: c000000000058380: start_secondary+0x460/0xb00
    sp: c000000008a67e70
   msr: 8000000000001033
   dar: 10
 dsisr: 80000
  current = 0xc00000000891bb00
  paca    = 0xc0000018ff981f80   irqmask: 0x03   irq_happened: 0x01
    pid   = 0, comm = swapper/1
Linux version 5.13.0-rc3-15704-ga050a6d2b7e8 (kvaneesh@ltc-boston8) (gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #433 SMP Tue May 25 02:38:49 CDT 2021
1:mon> t
[link register   ] c000000000058380 start_secondary+0x460/0xb00
[c000000008a67e70] c000000008a67eb0 (unreliable)
[c000000008a67eb0] c0000000000589d4 start_secondary+0xab4/0xb00
[c000000008a67f90] c00000000000c654 start_secondary_prolog+0x10/0x14

Current code assumes that num_possible_cpus() is always greater than
threads_per_core. However this may not be true when using nr_cpus=2 or
similar options. Handle the case where num_possible_cpus() is not an
exact multiple of  threads_per_core.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Fixes: c1e53367dab1 ("powerpc/smp: Cache CPU to chip lookup")
Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Debugged-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog: v1 -> v2:
v1: - https://lore.kernel.org/linuxppc-dev/20210821092419.167454-2-srikar@linux.vnet.ibm.com/t/#u
Handled comment from Gautham Shenoy
[ Updated to use DIV_ROUND_UP instead of max to handle more situations ]

 arch/powerpc/kernel/smp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 6c6e4d934d86..bf11b3c4eb28 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1074,7 +1074,7 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
 	}
 
 	if (cpu_to_chip_id(boot_cpuid) != -1) {
-		int idx = num_possible_cpus() / threads_per_core;
+		int idx = DIV_ROUND_UP(num_possible_cpus(), threads_per_core);
 
 		/*
 		 * All threads of a core will all belong to the same core,
-- 
2.18.2


^ permalink raw reply related

* [PATCH v2 2/3] powerpc/smp: Update cpu_core_map on all PowerPc systems
From: Srikar Dronamraju @ 2021-08-26 10:04 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Vincent Guittot,
	Srikar Dronamraju, Peter Zijlstra, Aneesh Kumar K . V,
	Valentin Schneider, linuxppc-dev, Ingo Molnar
In-Reply-To: <20210826100401.412519-1-srikar@linux.vnet.ibm.com>

lscpu() uses core_siblings to list the number of sockets in the
system. core_siblings is set using topology_core_cpumask.

While optimizing the powerpc bootup path, Commit 4ca234a9cbd7
("powerpc/smp: Stop updating cpu_core_mask").  it was found that
updating cpu_core_mask() ended up taking a lot of time. It was thought
that on Powerpc, cpu_core_mask() would always be same as
cpu_cpu_mask() i.e number of sockets will always be equal to number of
nodes. As an optimization, cpu_core_mask() was made a snapshot of
cpu_cpu_mask().

However that was found to be false with PowerPc KVM guests, where each
node could have more than one socket. So with Commit c47f892d7aa6
("powerpc/smp: Reintroduce cpu_core_mask"), cpu_core_mask was updated
based on chip_id but in an optimized way using some mask manipulations
and chip_id caching.

However on non-PowerNV and non-pseries KVM guests (i.e not
implementing cpu_to_chip_id(), continued to use a copy of
cpu_cpu_mask().

There are two issues that were noticed on such systems
1. lscpu would report one extra socket.
On a IBM,9009-42A (aka zz system) which has only 2 chips/ sockets/
nodes, lscpu would report
Architecture:        ppc64le
Byte Order:          Little Endian
CPU(s):              160
On-line CPU(s) list: 0-159
Thread(s) per core:  8
Core(s) per socket:  6
Socket(s):           3                <--------------
NUMA node(s):        2
Model:               2.2 (pvr 004e 0202)
Model name:          POWER9 (architected), altivec supported
Hypervisor vendor:   pHyp
Virtualization type: para
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            10240K
NUMA node0 CPU(s):   0-79
NUMA node1 CPU(s):   80-159

2. Currently cpu_cpu_mask is updated when a core is
added/removed. However its not updated when smt mode switching or on
CPUs are explicitly offlined. However all other percpu masks are
updated to ensure only active/online CPUs are in the masks.
This results in build_sched_domain traces since there will be CPUs in
cpu_cpu_mask() but those CPUs are not present in SMT / CACHE / MC /
NUMA domains. A loop of threads running smt mode switching and core
add/remove will soon show this trace.
Hence cpu_cpu_mask has to be update at smt mode switch.

This will have impact on cpu_core_mask(). cpu_core_mask() is a
snapshot of cpu_cpu_mask. Different CPUs within the same socket will
end up having different cpu_core_masks since they are snapshots at
different points of time. This means when lscpu will start reporting
many more sockets than the actual number of sockets/ nodes / chips.

Different ways to handle this problem:
A. Update the snapshot aka cpu_core_mask for all CPUs whenever
   cpu_cpu_mask is updated. This would a non-optimal solution.
B. Instead of a cpumask_var_t, make cpu_core_map a cpumask pointer
   pointing to cpu_cpu_mask. However percpu cpumask pointer is frowned
   upon and we need a clean way to handle PowerPc KVM guest which is
   not a snapshot.
C. Update cpu_core_masks all PowerPc systems like in PowerPc KVM
guests using mask manipulations. This approach is relatively simple
and unifies with the existing code.
D. On top of 3, we could also resurrect get_physical_package_id which
   could return a nid for the said CPU. However this is not needed at this
   time.

Option C is the preferred approach for now.

While this is somewhat a revert of Commit 4ca234a9cbd7 ("powerpc/smp:
Stop updating cpu_core_mask").

1. Plain revert has some conflicts
2. For chip_id == -1, the cpu_core_mask is made identical to
cpu_cpu_mask, unlike previously where cpu_core_mask was set to a core
if chip_id doesn't exist.

This goes by the principle that if chip_id is not exposed, then
sockets / chip / node share the same set of CPUs.

With the fix, lscpu o/p would be
Architecture:        ppc64le
Byte Order:          Little Endian
CPU(s):              160
On-line CPU(s) list: 0-159
Thread(s) per core:  8
Core(s) per socket:  6
Socket(s):           2                     <--------------
NUMA node(s):        2
Model:               2.2 (pvr 004e 0202)
Model name:          POWER9 (architected), altivec supported
Hypervisor vendor:   pHyp
Virtualization type: para
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            10240K
NUMA node0 CPU(s):   0-79
NUMA node1 CPU(s):   80-159

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Fixes: 4ca234a9cbd7 ("powerpc/smp: Stop updating cpu_core_mask")
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog : v1 -> v2:
v1:https://lore.kernel.org/linuxppc-dev/20210821092419.167454-3-srikar@linux.vnet.ibm.com/t/#u
Handled comments from Michael Ellerman
[ updated changelog to make it more generic powerpc issue ]

 arch/powerpc/kernel/smp.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index bf11b3c4eb28..b5bd5a4708d0 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1492,6 +1492,7 @@ static void add_cpu_to_masks(int cpu)
 	 * add it to it's own thread sibling mask.
 	 */
 	cpumask_set_cpu(cpu, cpu_sibling_mask(cpu));
+	cpumask_set_cpu(cpu, cpu_core_mask(cpu));

 	for (i = first_thread; i < first_thread + threads_per_core; i++)
 		if (cpu_online(i))
@@ -1509,11 +1510,6 @@ static void add_cpu_to_masks(int cpu)
 	if (chip_id_lookup_table && ret)
 		chip_id = cpu_to_chip_id(cpu);

-	if (chip_id == -1) {
-		cpumask_copy(per_cpu(cpu_core_map, cpu), cpu_cpu_mask(cpu));
-		goto out;
-	}
-
 	if (shared_caches)
 		submask_fn = cpu_l2_cache_mask;

@@ -1523,6 +1519,10 @@ static void add_cpu_to_masks(int cpu)
 	/* Skip all CPUs already part of current CPU core mask */
 	cpumask_andnot(mask, cpu_online_mask, cpu_core_mask(cpu));

+	/* If chip_id is -1; limit the cpu_core_mask to within DIE*/
+	if (chip_id == -1)
+		cpumask_and(mask, mask, cpu_cpu_mask(cpu));
+
 	for_each_cpu(i, mask) {
 		if (chip_id == cpu_to_chip_id(i)) {
 			or_cpumasks_related(cpu, i, submask_fn, cpu_core_mask);
@@ -1532,7 +1532,6 @@ static void add_cpu_to_masks(int cpu)
 		}
 	}

-out:
 	free_cpumask_var(mask);
 }

-- 
2.18.2

^ permalink raw reply related

* [PATCH v2 3/3] powerpc/smp: Enable CACHE domain for shared processor
From: Srikar Dronamraju @ 2021-08-26 10:04 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Vincent Guittot,
	Srikar Dronamraju, Peter Zijlstra, Ingo Molnar, linuxppc-dev,
	Valentin Schneider
In-Reply-To: <20210826100401.412519-1-srikar@linux.vnet.ibm.com>

Currently CACHE domain is not enabled on shared processor mode PowerVM
LPARS. On PowerVM systems, 'ibm,thread-group' device-tree property 2
under cpu-device-node indicates which all CPUs share L2-cache. However
'ibm,thread-group' device-tree property 2 is a relatively new property.
In absence of 'ibm,thread-group' property 2, 'l2-cache' device property
under cpu-device-node could help system to identify CPUs sharing L2-cache.
However this property is not exposed by PhyP in shared processor mode
configurations.

In absence of properties that inform OS about which CPUs share L2-cache,
fallback on core boundary.

Here are some stats from Power9 shared LPAR with the changes.

$ lscpu
Architecture:        ppc64le
Byte Order:          Little Endian
CPU(s):              32
On-line CPU(s) list: 0-31
Thread(s) per core:  8
Core(s) per socket:  1
Socket(s):           3
NUMA node(s):        2
Model:               2.2 (pvr 004e 0202)
Model name:          POWER9 (architected), altivec supported
Hypervisor vendor:   pHyp
Virtualization type: para
L1d cache:           32K
L1i cache:           32K
NUMA node0 CPU(s):   16-23
NUMA node1 CPU(s):   0-15,24-31
Physical sockets:    2
Physical chips:      1
Physical cores/chip: 10

Before patch
$ grep -r . /sys/kernel/debug/sched/domains/cpu0/domain*/name
Before
/sys/kernel/debug/sched/domains/cpu0/domain0/name:SMT
/sys/kernel/debug/sched/domains/cpu0/domain1/name:DIE
/sys/kernel/debug/sched/domains/cpu0/domain2/name:NUMA

After
/sys/kernel/debug/sched/domains/cpu0/domain0/name:SMT
/sys/kernel/debug/sched/domains/cpu0/domain1/name:CACHE
/sys/kernel/debug/sched/domains/cpu0/domain2/name:DIE
/sys/kernel/debug/sched/domains/cpu0/domain3/name:NUMA

$  awk '/domain/{print $1, $2}' /proc/schedstat | sort -u | sed -e 's/00000000,//g'
Before
domain0 00000055
domain0 000000aa
domain0 00005500
domain0 0000aa00
domain0 00550000
domain0 00aa0000
domain0 55000000
domain0 aa000000
domain1 00ff0000
domain1 ff00ffff
domain2 ffffffff

After
domain0 00000055
domain0 000000aa
domain0 00005500
domain0 0000aa00
domain0 00550000
domain0 00aa0000
domain0 55000000
domain0 aa000000
domain1 000000ff
domain1 0000ff00
domain1 00ff0000
domain1 ff000000
domain2 ff00ffff
domain2 ffffffff
domain3 ffffffff

(Lower is better)
perf stat -a -r 5 -n perf bench sched pipe  | tail -n 2
Before
           153.798 +- 0.142 seconds time elapsed  ( +-  0.09% )

After
           111.545 +- 0.652 seconds time elapsed  ( +-  0.58% )

which is an improvement of 27.47%

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/smp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index b5bd5a4708d0..b31b8ca3ae2e 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1365,7 +1365,7 @@ static bool update_mask_by_l2(int cpu, cpumask_var_t *mask)
 	l2_cache = cpu_to_l2cache(cpu);
 	if (!l2_cache || !*mask) {
 		/* Assume only core siblings share cache with this CPU */
-		for_each_cpu(i, submask_fn(cpu))
+		for_each_cpu(i, cpu_sibling_mask(cpu))
 			set_cpus_related(cpu, i, cpu_l2_cache_mask);
 
 		return false;
-- 
2.18.2


^ permalink raw reply related

* [PATCH v3 0/5] Updates to powerpc for robust CPU online/offline
From: Srikar Dronamraju @ 2021-08-26 10:05 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Vincent Guittot,
	Srikar Dronamraju, Peter Zijlstra, Geetika Moolchandani,
	Ingo Molnar, Laurent Dufour, linuxppc-dev, Valentin Schneider

Changelog v2 -> v3:
v2: https://lore.kernel.org/linuxppc-dev/20210821102535.169643-1-srikar@linux.vnet.ibm.com/t/#u
Add patch 1: to drop dbg and numa=debug (Suggested by Michael Ellerman)
Add patch 2: to convert printk to pr_xxx (Suggested by Michael Ellerman)
	Use pr_warn instead of pr_debug(WARNING) (Suggested by Laurent)

Changelog v1 -> v2:
Moved patch to this series: powerpc/numa: Fill distance_lookup_table for offline nodes
fixed a missing prototype warning

Scheduler expects unique number of node distances to be available
at boot. It uses node distance to calculate this unique node
distances. On Power Servers, node distances for offline nodes is not
available. However, Power Servers already knows unique possible node
distances. Fake the offline node's distance_lookup_table entries so
that all possible node distances are updated.

For example distance info from numactl from a fully populated 8 node
system at boot may look like this.

node distances:
node   0   1   2   3   4   5   6   7
  0:  10  20  40  40  40  40  40  40
  1:  20  10  40  40  40  40  40  40
  2:  40  40  10  20  40  40  40  40
  3:  40  40  20  10  40  40  40  40
  4:  40  40  40  40  10  20  40  40
  5:  40  40  40  40  20  10  40  40
  6:  40  40  40  40  40  40  10  20
  7:  40  40  40  40  40  40  20  10

However the same system when only two nodes are online at boot, then
distance info from numactl will look like
node distances:
node   0   1
  0:  10  20
  1:  20  10

With the faked numa distance at boot, the node distance table will look
like
node   0   1   2
  0:  10  20  40
  1:  20  10  40
  2:  40  40  10

The actual distance will be populated once the nodes are onlined.

Also when simultaneously running CPU online/offline with CPU
add/remove in a loop, we see a WARNING messages.

WARNING: CPU: 13 PID: 1142 at kernel/sched/topology.c:898 build_sched_domains+0xd48/0x1720
Modules linked in: rpadlpar_io rpaphp mptcp_diag xsk_diag tcp_diag udp_diag
raw_diag inet_diag unix_diag af_packet_diag netlink_diag bonding tls
nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet
nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set rfkill nf_tables nfnetlink
pseries_rng xts vmx_crypto uio_pdrv_genirq uio binfmt_misc ip_tables xfs
libcrc32c dm_service_time sd_mod t10_pi sg ibmvfc scsi_transport_fc ibmveth
dm_multipath dm_mirror dm_region_hash dm_log dm_mod fuse
CPU: 13 PID: 1142 Comm: kworker/13:2 Not tainted 5.13.0-rc6+ #28
Workqueue: events cpuset_hotplug_workfn
NIP:  c0000000001caac8 LR: c0000000001caac4 CTR: 00000000007088ec
REGS: c00000005596f220 TRAP: 0700   Not tainted  (5.13.0-rc6+)
MSR:  8000000000029033 <SF,EE,ME,IR,DR,RI,LE>  CR: 48828222  XER: 00000009
CFAR: c0000000001ea698 IRQMASK: 0
GPR00: c0000000001caac4 c00000005596f4c0 c000000001c4a400 0000000000000036
GPR04: 00000000fffdffff c00000005596f1d0 0000000000000027 c0000018cfd07f90
GPR08: 0000000000000023 0000000000000001 0000000000000027 c0000018fe68ffe8
GPR12: 0000000000008000 c00000001e9d1880 c00000013a047200 0000000000000800
GPR16: c000000001d3c7d0 0000000000000240 0000000000000048 c000000010aacd18
GPR20: 0000000000000001 c000000010aacc18 c00000013a047c00 c000000139ec2400
GPR24: 0000000000000280 c000000139ec2520 c000000136c1b400 c000000001c93060
GPR28: c00000013a047c20 c000000001d3c6c0 c000000001c978a0 000000000000000d
NIP [c0000000001caac8] build_sched_domains+0xd48/0x1720
LR [c0000000001caac4] build_sched_domains+0xd44/0x1720
Call Trace:
[c00000005596f4c0] [c0000000001caac4] build_sched_domains+0xd44/0x1720 (unreliable)
[c00000005596f670] [c0000000001cc5ec] partition_sched_domains_locked+0x3ac/0x4b0
[c00000005596f710] [c0000000002804e4] rebuild_sched_domains_locked+0x404/0x9e0
[c00000005596f810] [c000000000283e60] rebuild_sched_domains+0x40/0x70
[c00000005596f840] [c000000000284124] cpuset_hotplug_workfn+0x294/0xf10
[c00000005596fc60] [c000000000175040] process_one_work+0x290/0x590
[c00000005596fd00] [c0000000001753c8] worker_thread+0x88/0x620
[c00000005596fda0] [c000000000181704] kthread+0x194/0x1a0
[c00000005596fe10] [c00000000000ccec] ret_from_kernel_thread+0x5c/0x70
Instruction dump:
485af049 60000000 2fa30800 409e0028 80fe0000 e89a00f8 e86100e8 38da0120
7f88e378 7ce53b78 4801fb91 60000000 <0fe00000> 39000000 38e00000 38c00000

This was because cpu_cpu_mask() was not getting updated on CPU
online/offline but would be only updated when add/remove of CPUs.
Other cpumasks get updated both on CPU online/offline and add/remove
Update cpu_cpu_mask() on CPU online/offline too.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>

Srikar Dronamraju (5):
  powerpc/numa: Drop dbg in favour of pr_debug
  powerpc/numa: convert printk to pr_xxx
  powerpc/numa: Print debug statements only when required
  powerpc/numa: Update cpu_cpu_map on CPU online/offline
  powerpc/numa: Fill distance_lookup_table for offline nodes

 arch/powerpc/include/asm/topology.h |  12 +++
 arch/powerpc/kernel/smp.c           |   3 +
 arch/powerpc/mm/numa.c              | 120 ++++++++++++++++++++--------
 3 files changed, 103 insertions(+), 32 deletions(-)

-- 
2.18.2


^ permalink raw reply

* [PATCH v5 2/5] powerpc/numa: convert printk to pr_xxx
From: Srikar Dronamraju @ 2021-08-26 10:05 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Vincent Guittot,
	Srikar Dronamraju, Peter Zijlstra, Geetika Moolchandani,
	Ingo Molnar, Laurent Dufour, linuxppc-dev, Valentin Schneider
In-Reply-To: <20210826100521.412639-1-srikar@linux.vnet.ibm.com>

Convert the remaining printk to pr_xxx
One advantage would be all prints will now have prefix "numa:" from
pr_fmt().

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
[ convert printk(KERN_ERR) to pr_warn : Suggested by Laurent Dufour ]
Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 arch/powerpc/mm/numa.c | 18 +++++++-----------
 1 file changed, 7 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 5e9b777a1151..9af38b1c618b 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -154,8 +154,7 @@ static void unmap_cpu_from_node(unsigned long cpu)
 	if (cpumask_test_cpu(cpu, node_to_cpumask_map[node])) {
 		cpumask_clear_cpu(cpu, node_to_cpumask_map[node]);
 	} else {
-		printk(KERN_ERR "WARNING: cpu %lu not found in node %d\n",
-		       cpu, node);
+		pr_warn("WARNING: cpu %lu not found in node %d\n", cpu, node);
 	}
 }
 #endif /* CONFIG_HOTPLUG_CPU || CONFIG_PPC_SPLPAR */
@@ -326,8 +325,7 @@ static int __init find_min_common_depth(void)
 		depth = of_read_number(distance_ref_points, 1);
 	} else {
 		if (distance_ref_points_depth < 2) {
-			printk(KERN_WARNING "NUMA: "
-				"short ibm,associativity-reference-points\n");
+			pr_warn("short ibm,associativity-reference-points\n");
 			goto err;
 		}
 
@@ -339,7 +337,7 @@ static int __init find_min_common_depth(void)
 	 * MAX_DISTANCE_REF_POINTS domains.
 	 */
 	if (distance_ref_points_depth > MAX_DISTANCE_REF_POINTS) {
-		printk(KERN_WARNING "NUMA: distance array capped at "
+		pr_warn("distance array capped at "
 			"%d entries\n", MAX_DISTANCE_REF_POINTS);
 		distance_ref_points_depth = MAX_DISTANCE_REF_POINTS;
 	}
@@ -701,7 +699,7 @@ static int __init parse_numa_properties(void)
 	unsigned long i;
 
 	if (numa_enabled == 0) {
-		printk(KERN_WARNING "NUMA disabled by user\n");
+		pr_warn("disabled by user\n");
 		return -1;
 	}
 
@@ -716,7 +714,7 @@ static int __init parse_numa_properties(void)
 		return min_common_depth;
 	}
 
-	pr_debug("NUMA associativity depth for CPU/Memory: %d\n", min_common_depth);
+	pr_debug("associativity depth for CPU/Memory: %d\n", min_common_depth);
 
 	/*
 	 * Even though we connect cpus to numa domains later in SMP
@@ -808,10 +806,8 @@ static void __init setup_nonnuma(void)
 	unsigned int nid = 0;
 	int i;
 
-	printk(KERN_DEBUG "Top of RAM: 0x%lx, Total RAM: 0x%lx\n",
-	       top_of_ram, total_ram);
-	printk(KERN_DEBUG "Memory hole size: %ldMB\n",
-	       (top_of_ram - total_ram) >> 20);
+	pr_debug("Top of RAM: 0x%lx, Total RAM: 0x%lx\n", top_of_ram, total_ram);
+	pr_debug("Memory hole size: %ldMB\n", (top_of_ram - total_ram) >> 20);
 
 	for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) {
 		fake_numa_create_new_node(end_pfn, &nid);
-- 
2.18.2


^ permalink raw reply related

* [PATCH v3 3/5] powerpc/numa: Print debug statements only when required
From: Srikar Dronamraju @ 2021-08-26 10:05 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Vincent Guittot,
	Srikar Dronamraju, Peter Zijlstra, Geetika Moolchandani,
	Ingo Molnar, Laurent Dufour, linuxppc-dev, Valentin Schneider
In-Reply-To: <20210826100521.412639-1-srikar@linux.vnet.ibm.com>

Currently, a debug message gets printed every time an attempt to
add(remove) a CPU. However this is redundant if the CPU is already added
(removed) from the node.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 arch/powerpc/mm/numa.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 9af38b1c618b..6655ecdeddef 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -138,10 +138,10 @@ static void map_cpu_to_node(int cpu, int node)
 {
 	update_numa_cpu_lookup_table(cpu, node);
 
-	pr_debug("adding cpu %d to node %d\n", cpu, node);
-
-	if (!(cpumask_test_cpu(cpu, node_to_cpumask_map[node])))
+	if (!(cpumask_test_cpu(cpu, node_to_cpumask_map[node]))) {
+		pr_debug("adding cpu %d to node %d\n", cpu, node);
 		cpumask_set_cpu(cpu, node_to_cpumask_map[node]);
+	}
 }
 
 #if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_PPC_SPLPAR)
@@ -149,10 +149,9 @@ static void unmap_cpu_from_node(unsigned long cpu)
 {
 	int node = numa_cpu_lookup_table[cpu];
 
-	pr_debug("removing cpu %lu from node %d\n", cpu, node);
-
 	if (cpumask_test_cpu(cpu, node_to_cpumask_map[node])) {
 		cpumask_clear_cpu(cpu, node_to_cpumask_map[node]);
+		pr_debug("removing cpu %lu from node %d\n", cpu, node);
 	} else {
 		pr_warn("WARNING: cpu %lu not found in node %d\n", cpu, node);
 	}
-- 
2.18.2


^ permalink raw reply related

* [PATCH v3 4/5] powerpc/numa: Update cpu_cpu_map on CPU online/offline
From: Srikar Dronamraju @ 2021-08-26 10:05 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Vincent Guittot,
	Srikar Dronamraju, Peter Zijlstra, Geetika Moolchandani,
	Ingo Molnar, Laurent Dufour, linuxppc-dev, Valentin Schneider
In-Reply-To: <20210826100521.412639-1-srikar@linux.vnet.ibm.com>

cpu_cpu_map holds all the CPUs in the DIE. However in PowerPC, when
onlining/offlining of CPUs, this mask doesn't get updated.  This mask
is however updated when CPUs are added/removed. So when both
operations like online/offline of CPUs and adding/removing of CPUs are
done simultaneously, then cpumaps end up broken.

WARNING: CPU: 13 PID: 1142 at kernel/sched/topology.c:898
build_sched_domains+0xd48/0x1720
Modules linked in: rpadlpar_io rpaphp mptcp_diag xsk_diag tcp_diag
udp_diag raw_diag inet_diag unix_diag af_packet_diag netlink_diag
bonding tls nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib
nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct
nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set
rfkill nf_tables nfnetlink pseries_rng xts vmx_crypto uio_pdrv_genirq
uio binfmt_misc ip_tables xfs libcrc32c dm_service_time sd_mod t10_pi sg
ibmvfc scsi_transport_fc ibmveth dm_multipath dm_mirror dm_region_hash
dm_log dm_mod fuse
CPU: 13 PID: 1142 Comm: kworker/13:2 Not tainted 5.13.0-rc6+ #28
Workqueue: events cpuset_hotplug_workfn
NIP:  c0000000001caac8 LR: c0000000001caac4 CTR: 00000000007088ec
REGS: c00000005596f220 TRAP: 0700   Not tainted  (5.13.0-rc6+)
MSR:  8000000000029033 <SF,EE,ME,IR,DR,RI,LE>  CR: 48828222  XER:
00000009
CFAR: c0000000001ea698 IRQMASK: 0
GPR00: c0000000001caac4 c00000005596f4c0 c000000001c4a400 0000000000000036
GPR04: 00000000fffdffff c00000005596f1d0 0000000000000027 c0000018cfd07f90
GPR08: 0000000000000023 0000000000000001 0000000000000027 c0000018fe68ffe8
GPR12: 0000000000008000 c00000001e9d1880 c00000013a047200 0000000000000800
GPR16: c000000001d3c7d0 0000000000000240 0000000000000048 c000000010aacd18
GPR20: 0000000000000001 c000000010aacc18 c00000013a047c00 c000000139ec2400
GPR24: 0000000000000280 c000000139ec2520 c000000136c1b400 c000000001c93060
GPR28: c00000013a047c20 c000000001d3c6c0 c000000001c978a0 000000000000000d
NIP [c0000000001caac8] build_sched_domains+0xd48/0x1720
LR [c0000000001caac4] build_sched_domains+0xd44/0x1720
Call Trace:
[c00000005596f4c0] [c0000000001caac4] build_sched_domains+0xd44/0x1720 (unreliable)
[c00000005596f670] [c0000000001cc5ec] partition_sched_domains_locked+0x3ac/0x4b0
[c00000005596f710] [c0000000002804e4] rebuild_sched_domains_locked+0x404/0x9e0
[c00000005596f810] [c000000000283e60] rebuild_sched_domains+0x40/0x70
[c00000005596f840] [c000000000284124] cpuset_hotplug_workfn+0x294/0xf10
[c00000005596fc60] [c000000000175040] process_one_work+0x290/0x590
[c00000005596fd00] [c0000000001753c8] worker_thread+0x88/0x620
[c00000005596fda0] [c000000000181704] kthread+0x194/0x1a0
[c00000005596fe10] [c00000000000ccec] ret_from_kernel_thread+0x5c/0x70
Instruction dump:
485af049 60000000 2fa30800 409e0028 80fe0000 e89a00f8 e86100e8 38da0120
7f88e378 7ce53b78 4801fb91 60000000 <0fe00000> 39000000 38e00000 38c00000

Fix this by updating cpu_cpu_map aka cpumask_of_node() on every CPU
online/offline.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/topology.h | 12 ++++++++++++
 arch/powerpc/kernel/smp.c           |  3 +++
 arch/powerpc/mm/numa.c              |  7 ++-----
 3 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
index e4db64c0e184..2f0a4d7b95f6 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -65,6 +65,11 @@ static inline int early_cpu_to_node(int cpu)
 
 int of_drconf_to_nid_single(struct drmem_lmb *lmb);
 
+extern void map_cpu_to_node(int cpu, int node);
+#ifdef CONFIG_HOTPLUG_CPU
+extern void unmap_cpu_from_node(unsigned long cpu);
+#endif /* CONFIG_HOTPLUG_CPU */
+
 #else
 
 static inline int early_cpu_to_node(int cpu) { return 0; }
@@ -93,6 +98,13 @@ static inline int of_drconf_to_nid_single(struct drmem_lmb *lmb)
 	return first_online_node;
 }
 
+#ifdef CONFIG_SMP
+static inline void map_cpu_to_node(int cpu, int node) {}
+#ifdef CONFIG_HOTPLUG_CPU
+static inline void unmap_cpu_from_node(unsigned long cpu) {}
+#endif /* CONFIG_HOTPLUG_CPU */
+#endif /* CONFIG_SMP */
+
 #endif /* CONFIG_NUMA */
 
 #if defined(CONFIG_NUMA) && defined(CONFIG_PPC_SPLPAR)
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index b31b8ca3ae2e..d947a4fd753c 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1407,6 +1407,8 @@ static void remove_cpu_from_masks(int cpu)
 	struct cpumask *(*mask_fn)(int) = cpu_sibling_mask;
 	int i;
 
+	unmap_cpu_from_node(cpu);
+
 	if (shared_caches)
 		mask_fn = cpu_l2_cache_mask;
 
@@ -1491,6 +1493,7 @@ static void add_cpu_to_masks(int cpu)
 	 * This CPU will not be in the online mask yet so we need to manually
 	 * add it to it's own thread sibling mask.
 	 */
+	map_cpu_to_node(cpu, cpu_to_node(cpu));
 	cpumask_set_cpu(cpu, cpu_sibling_mask(cpu));
 	cpumask_set_cpu(cpu, cpu_core_mask(cpu));
 
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 6655ecdeddef..87ade2f56f45 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -134,7 +134,7 @@ static void reset_numa_cpu_lookup_table(void)
 		numa_cpu_lookup_table[cpu] = -1;
 }
 
-static void map_cpu_to_node(int cpu, int node)
+void map_cpu_to_node(int cpu, int node)
 {
 	update_numa_cpu_lookup_table(cpu, node);
 
@@ -145,7 +145,7 @@ static void map_cpu_to_node(int cpu, int node)
 }
 
 #if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_PPC_SPLPAR)
-static void unmap_cpu_from_node(unsigned long cpu)
+void unmap_cpu_from_node(unsigned long cpu)
 {
 	int node = numa_cpu_lookup_table[cpu];
 
@@ -592,9 +592,6 @@ static int ppc_numa_cpu_prepare(unsigned int cpu)
 
 static int ppc_numa_cpu_dead(unsigned int cpu)
 {
-#ifdef CONFIG_HOTPLUG_CPU
-	unmap_cpu_from_node(cpu);
-#endif
 	return 0;
 }
 
-- 
2.18.2


^ permalink raw reply related

* [PATCH v3 5/5] powerpc/numa: Fill distance_lookup_table for offline nodes
From: Srikar Dronamraju @ 2021-08-26 10:05 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Vincent Guittot,
	Srikar Dronamraju, Peter Zijlstra, Geetika Moolchandani,
	Ingo Molnar, Laurent Dufour, linuxppc-dev, Valentin Schneider,
	kernel test robot
In-Reply-To: <20210826100521.412639-1-srikar@linux.vnet.ibm.com>

Scheduler expects unique number of node distances to be available at
boot. It uses node distance to calculate this unique node distances.
On POWER, node distances for offline nodes is not available. However,
POWER already knows unique possible node distances. Fake the offline
node's distance_lookup_table entries so that all possible node
distances are updated.

However this only needs to be done if the number of unique node
distances that can be computed for online nodes is less than the
number of possible unique node distances as represented by
distance_ref_points_depth. When the node is actually onlined,
distance_lookup_table will be updated with actual entries.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: kernel test robot <lkp@intel.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 arch/powerpc/mm/numa.c | 70 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 70 insertions(+)

Changelog:
v1: https://lore.kernel.org/linuxppc-dev/20210701041552.112072-3-srikar@linux.vnet.ibm.com/t/#u
[ Fixed a missing prototype warning Reported-by: kernel test robot <lkp@intel.com>]

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 87ade2f56f45..afa2ede4ac53 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -849,6 +849,75 @@ void __init dump_numa_cpu_topology(void)
 	}
 }
 
+/*
+ * Scheduler expects unique number of node distances to be available at
+ * boot. It uses node distance to calculate this unique node distances. On
+ * POWER, node distances for offline nodes is not available. However, POWER
+ * already knows unique possible node distances. Fake the offline node's
+ * distance_lookup_table entries so that all possible node distances are
+ * updated.
+ */
+static void __init fake_update_distance_lookup_table(void)
+{
+	unsigned long distance_map;
+	int i, nr_levels, nr_depth, node;
+
+	if (!numa_enabled)
+		return;
+
+	if (!form1_affinity)
+		return;
+
+	/*
+	 * distance_ref_points_depth lists the unique numa domains
+	 * available. However it ignore LOCAL_DISTANCE. So add +1
+	 * to get the actual number of unique distances.
+	 */
+	nr_depth = distance_ref_points_depth + 1;
+
+	WARN_ON(nr_depth > sizeof(distance_map));
+
+	bitmap_zero(&distance_map, nr_depth);
+	bitmap_set(&distance_map, 0, 1);
+
+	for_each_online_node(node) {
+		int nd, distance = LOCAL_DISTANCE;
+
+		if (node == first_online_node)
+			continue;
+
+		nd = __node_distance(node, first_online_node);
+		for (i = 0; i < nr_depth; i++, distance *= 2) {
+			if (distance == nd) {
+				bitmap_set(&distance_map, i, 1);
+				break;
+			}
+		}
+		nr_levels = bitmap_weight(&distance_map, nr_depth);
+		if (nr_levels == nr_depth)
+			return;
+	}
+
+	for_each_node(node) {
+		if (node_online(node))
+			continue;
+
+		i = find_first_zero_bit(&distance_map, nr_depth);
+		if (i >= nr_depth || i == 0) {
+			pr_warn("Levels(%d) not matching levels(%d)", nr_levels, nr_depth);
+			return;
+		}
+
+		bitmap_set(&distance_map, i, 1);
+		while (i--)
+			distance_lookup_table[node][i] = node;
+
+		nr_levels = bitmap_weight(&distance_map, nr_depth);
+		if (nr_levels == nr_depth)
+			return;
+	}
+}
+
 /* Initialize NODE_DATA for a node on the local memory */
 static void __init setup_node_data(int nid, u64 start_pfn, u64 end_pfn)
 {
@@ -964,6 +1033,7 @@ void __init mem_topology_setup(void)
 		 */
 		numa_setup_cpu(cpu);
 	}
+	fake_update_distance_lookup_table();
 }
 
 void __init initmem_init(void)
-- 
2.18.2


^ permalink raw reply related

* [PATCH v5 1/5] powerpc/numa: Drop dbg in favour of pr_debug
From: Srikar Dronamraju @ 2021-08-26 10:05 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Vincent Guittot,
	Srikar Dronamraju, Peter Zijlstra, Geetika Moolchandani,
	Ingo Molnar, Laurent Dufour, linuxppc-dev, Valentin Schneider
In-Reply-To: <20210826100521.412639-1-srikar@linux.vnet.ibm.com>

PowerPc supported numa=debug which is not documented. This option was
used to print early debug output. However something more flexible can be
achieved by using CONFIG_DYNAMIC_DEBUG.

Hence drop dbg (and numa=debug) in favour of pr_debug

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 arch/powerpc/mm/numa.c | 22 ++++++++--------------
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index f2bf98bdcea2..5e9b777a1151 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -40,9 +40,6 @@ static int numa_enabled = 1;
 
 static char *cmdline __initdata;
 
-static int numa_debug;
-#define dbg(args...) if (numa_debug) { printk(KERN_INFO args); }
-
 int numa_cpu_lookup_table[NR_CPUS];
 cpumask_var_t node_to_cpumask_map[MAX_NUMNODES];
 struct pglist_data *node_data[MAX_NUMNODES];
@@ -79,7 +76,7 @@ static void __init setup_node_to_cpumask_map(void)
 		alloc_bootmem_cpumask_var(&node_to_cpumask_map[node]);
 
 	/* cpumask_of_node() will now work */
-	dbg("Node to cpumask map for %u nodes\n", nr_node_ids);
+	pr_debug("Node to cpumask map for %u nodes\n", nr_node_ids);
 }
 
 static int __init fake_numa_create_new_node(unsigned long end_pfn,
@@ -123,7 +120,7 @@ static int __init fake_numa_create_new_node(unsigned long end_pfn,
 		cmdline = p;
 		fake_nid++;
 		*nid = fake_nid;
-		dbg("created new fake_node with id %d\n", fake_nid);
+		pr_debug("created new fake_node with id %d\n", fake_nid);
 		return 1;
 	}
 	return 0;
@@ -141,7 +138,7 @@ static void map_cpu_to_node(int cpu, int node)
 {
 	update_numa_cpu_lookup_table(cpu, node);
 
-	dbg("adding cpu %d to node %d\n", cpu, node);
+	pr_debug("adding cpu %d to node %d\n", cpu, node);
 
 	if (!(cpumask_test_cpu(cpu, node_to_cpumask_map[node])))
 		cpumask_set_cpu(cpu, node_to_cpumask_map[node]);
@@ -152,7 +149,7 @@ static void unmap_cpu_from_node(unsigned long cpu)
 {
 	int node = numa_cpu_lookup_table[cpu];
 
-	dbg("removing cpu %lu from node %d\n", cpu, node);
+	pr_debug("removing cpu %lu from node %d\n", cpu, node);
 
 	if (cpumask_test_cpu(cpu, node_to_cpumask_map[node])) {
 		cpumask_clear_cpu(cpu, node_to_cpumask_map[node]);
@@ -313,7 +310,7 @@ static int __init find_min_common_depth(void)
 					&distance_ref_points_depth);
 
 	if (!distance_ref_points) {
-		dbg("NUMA: ibm,associativity-reference-points not found.\n");
+		pr_debug("NUMA: ibm,associativity-reference-points not found.\n");
 		goto err;
 	}
 
@@ -321,7 +318,7 @@ static int __init find_min_common_depth(void)
 
 	if (firmware_has_feature(FW_FEATURE_OPAL) ||
 	    firmware_has_feature(FW_FEATURE_TYPE1_AFFINITY)) {
-		dbg("Using form 1 affinity\n");
+		pr_debug("Using form 1 affinity\n");
 		form1_affinity = 1;
 	}
 
@@ -719,7 +716,7 @@ static int __init parse_numa_properties(void)
 		return min_common_depth;
 	}
 
-	dbg("NUMA associativity depth for CPU/Memory: %d\n", min_common_depth);
+	pr_debug("NUMA associativity depth for CPU/Memory: %d\n", min_common_depth);
 
 	/*
 	 * Even though we connect cpus to numa domains later in SMP
@@ -1014,9 +1011,6 @@ static int __init early_numa(char *p)
 	if (strstr(p, "off"))
 		numa_enabled = 0;
 
-	if (strstr(p, "debug"))
-		numa_debug = 1;
-
 	p = strstr(p, "fake=");
 	if (p)
 		cmdline = p + strlen("fake=");
@@ -1179,7 +1173,7 @@ static long vphn_get_associativity(unsigned long cpu,
 
 	switch (rc) {
 	case H_SUCCESS:
-		dbg("VPHN hcall succeeded. Reset polling...\n");
+		pr_debug("VPHN hcall succeeded. Reset polling...\n");
 		goto out;
 
 	case H_FUNCTION:
-- 
2.18.2


^ permalink raw reply related

* [PATCH 1/2] ASoC: fsl_rpmsg: add soc specific data structure
From: Shengjiu Wang @ 2021-08-26 10:57 UTC (permalink / raw)
  To: timur, nicoleotsuka, Xiubo.Lee, festevam, broonie, perex, tiwai,
	alsa-devel
  Cc: linuxppc-dev, linux-kernel

Each platform has different supported rates and
formats, so add soc specific data for each platform.
This soc specific data is attached with compatible string.

Signed-off-by: Shengjiu Wang <shengjiu.wang@nxp.com>
---
 sound/soc/fsl/fsl_rpmsg.c | 47 +++++++++++++++++++++++++++++++++++----
 sound/soc/fsl/fsl_rpmsg.h | 12 ++++++++++
 2 files changed, 55 insertions(+), 4 deletions(-)

diff --git a/sound/soc/fsl/fsl_rpmsg.c b/sound/soc/fsl/fsl_rpmsg.c
index d60f4dac6c1b..bda6cc96071d 100644
--- a/sound/soc/fsl/fsl_rpmsg.c
+++ b/sound/soc/fsl/fsl_rpmsg.c
@@ -138,11 +138,42 @@ static const struct snd_soc_component_driver fsl_component = {
 	.name           = "fsl-rpmsg",
 };
 
+static const struct fsl_rpmsg_soc_data imx7ulp_data = {
+	.rates = SNDRV_PCM_RATE_8000 | SNDRV_PCM_RATE_16000 |
+		 SNDRV_PCM_RATE_48000,
+	.formats = SNDRV_PCM_FMTBIT_S16_LE,
+};
+
+static const struct fsl_rpmsg_soc_data imx8mm_data = {
+	.rates = SNDRV_PCM_RATE_KNOT,
+	.formats = SNDRV_PCM_FMTBIT_S16_LE | SNDRV_PCM_FMTBIT_S24_LE |
+		   SNDRV_PCM_FMTBIT_S32_LE | SNDRV_PCM_FMTBIT_DSD_U8 |
+		   SNDRV_PCM_FMTBIT_DSD_U16_LE | SNDRV_PCM_FMTBIT_DSD_U32_LE,
+};
+
+static const struct fsl_rpmsg_soc_data imx8mn_data = {
+	.rates = SNDRV_PCM_RATE_32000 | SNDRV_PCM_RATE_44100 |
+		 SNDRV_PCM_RATE_48000 | SNDRV_PCM_RATE_88200 |
+		 SNDRV_PCM_RATE_96000 | SNDRV_PCM_RATE_176400 |
+		 SNDRV_PCM_RATE_192000,
+	.formats = SNDRV_PCM_FMTBIT_S16_LE | SNDRV_PCM_FMTBIT_S24_LE |
+		   SNDRV_PCM_FMTBIT_S32_LE,
+};
+
+static const struct fsl_rpmsg_soc_data imx8mp_data = {
+	.rates = SNDRV_PCM_RATE_32000 | SNDRV_PCM_RATE_44100 |
+		 SNDRV_PCM_RATE_48000 | SNDRV_PCM_RATE_88200 |
+		 SNDRV_PCM_RATE_96000 | SNDRV_PCM_RATE_176400 |
+		 SNDRV_PCM_RATE_192000,
+	.formats = SNDRV_PCM_FMTBIT_S16_LE | SNDRV_PCM_FMTBIT_S24_LE |
+		   SNDRV_PCM_FMTBIT_S32_LE,
+};
+
 static const struct of_device_id fsl_rpmsg_ids[] = {
-	{ .compatible = "fsl,imx7ulp-rpmsg-audio"},
-	{ .compatible = "fsl,imx8mm-rpmsg-audio"},
-	{ .compatible = "fsl,imx8mn-rpmsg-audio"},
-	{ .compatible = "fsl,imx8mp-rpmsg-audio"},
+	{ .compatible = "fsl,imx7ulp-rpmsg-audio", .data = &imx7ulp_data},
+	{ .compatible = "fsl,imx8mm-rpmsg-audio", .data = &imx8mm_data},
+	{ .compatible = "fsl,imx8mn-rpmsg-audio", .data = &imx8mn_data},
+	{ .compatible = "fsl,imx8mp-rpmsg-audio", .data = &imx8mp_data},
 	{ /* sentinel */ }
 };
 MODULE_DEVICE_TABLE(of, fsl_rpmsg_ids);
@@ -157,6 +188,14 @@ static int fsl_rpmsg_probe(struct platform_device *pdev)
 	if (!rpmsg)
 		return -ENOMEM;
 
+	rpmsg->soc_data = of_device_get_match_data(&pdev->dev);
+	if (rpmsg->soc_data) {
+		fsl_rpmsg_dai.playback.rates = rpmsg->soc_data->rates;
+		fsl_rpmsg_dai.capture.rates = rpmsg->soc_data->rates;
+		fsl_rpmsg_dai.playback.formats = rpmsg->soc_data->formats;
+		fsl_rpmsg_dai.capture.formats = rpmsg->soc_data->formats;
+	}
+
 	if (of_property_read_bool(np, "fsl,enable-lpa")) {
 		rpmsg->enable_lpa = 1;
 		rpmsg->buffer_size = LPA_LARGE_BUFFER_SIZE;
diff --git a/sound/soc/fsl/fsl_rpmsg.h b/sound/soc/fsl/fsl_rpmsg.h
index 4f5b49eb18d8..b04086fbf828 100644
--- a/sound/soc/fsl/fsl_rpmsg.h
+++ b/sound/soc/fsl/fsl_rpmsg.h
@@ -6,6 +6,16 @@
 #ifndef __FSL_RPMSG_H
 #define __FSL_RPMSG_H
 
+/*
+ * struct fsl_rpmsg_soc_data
+ * @rates: supported rates
+ * @formats: supported formats
+ */
+struct fsl_rpmsg_soc_data {
+	int rates;
+	u64 formats;
+};
+
 /*
  * struct fsl_rpmsg - rpmsg private data
  *
@@ -15,6 +25,7 @@
  * @pll8k: parent clock for multiple of 8kHz frequency
  * @pll11k: parent clock for multiple of 11kHz frequency
  * @card_pdev: Platform_device pointer to register a sound card
+ * @soc_data: soc specific data
  * @mclk_streams: Active streams that are using baudclk
  * @force_lpa: force enable low power audio routine if condition satisfy
  * @enable_lpa: enable low power audio routine according to dts setting
@@ -27,6 +38,7 @@ struct fsl_rpmsg {
 	struct clk *pll8k;
 	struct clk *pll11k;
 	struct platform_device *card_pdev;
+	const struct fsl_rpmsg_soc_data *soc_data;
 	unsigned int mclk_streams;
 	int force_lpa;
 	int enable_lpa;
-- 
2.17.1


^ permalink raw reply related

* Re: [PATCH 1/2] ASoC: fsl_rpmsg: add soc specific data structure
From: Fabio Estevam @ 2021-08-26 11:52 UTC (permalink / raw)
  To: Shengjiu Wang
  Cc: Linux-ALSA, Timur Tabi, Xiubo Li, linux-kernel, Takashi Iwai,
	Jaroslav Kysela, Nicolin Chen, Mark Brown, linuxppc-dev
In-Reply-To: <1629975460-17990-1-git-send-email-shengjiu.wang@nxp.com>

Hi Shengjiu,

On Thu, Aug 26, 2021 at 8:19 AM Shengjiu Wang <shengjiu.wang@nxp.com> wrote:

> +       rpmsg->soc_data = of_device_get_match_data(&pdev->dev);
> +       if (rpmsg->soc_data) {

This check is not necessary, because rpmsg->soc_data is always non-NULL.

Other than that:

Reviewed-by: Fabio Estevam <festevam@gmail.com>

^ permalink raw reply

* [PATCH 0/3] powerpc/microwatt: Device tree and config updates
From: Joel Stanley @ 2021-08-26 12:26 UTC (permalink / raw)
  To: Michael Ellerman, Paul Mackerras, Anton Blanchard,
	Michael Neuling
  Cc: linuxppc-dev

This enables the liteeth network device for microwatt which will be
merged in v5.15.

It also turns on some options so the microwatt defconfig can be used to
boot a userspace with systemd.

Joel Stanley (3):
  powerpc/microwatt: Add Ethernet to device tree
  powerpc/configs/microwattt: Enable Liteeth
  powerpc/configs/microwatt: Enable options for systemd

 arch/powerpc/boot/dts/microwatt.dts      | 12 ++++++++++++
 arch/powerpc/configs/microwatt_defconfig |  6 +++++-
 2 files changed, 17 insertions(+), 1 deletion(-)

-- 
2.33.0


^ permalink raw reply

* [PATCH 1/3] powerpc/microwatt: Add Ethernet to device tree
From: Joel Stanley @ 2021-08-26 12:26 UTC (permalink / raw)
  To: Michael Ellerman, Paul Mackerras, Anton Blanchard,
	Michael Neuling
  Cc: linuxppc-dev
In-Reply-To: <20210826122653.3236867-1-joel@jms.id.au>

The liteeth network device is used in the Microwatt soc.

Signed-off-by: Joel Stanley <joel@jms.id.au>
---
 arch/powerpc/boot/dts/microwatt.dts | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/arch/powerpc/boot/dts/microwatt.dts b/arch/powerpc/boot/dts/microwatt.dts
index 974abbdda249..65b270a90f94 100644
--- a/arch/powerpc/boot/dts/microwatt.dts
+++ b/arch/powerpc/boot/dts/microwatt.dts
@@ -127,6 +127,18 @@ UART0: serial@2000 {
 			fifo-size = <16>;
 			interrupts = <0x10 0x1>;
 		};
+
+		ethernet@8020000 {
+			compatible = "litex,liteeth";
+			reg = <0x8021000 0x100
+				0x8020800 0x100
+				0x8030000 0x2000>;
+			reg-names = "mac", "mido", "buffer";
+			litex,rx-slots = <2>;
+			litex,tx-slots = <2>;
+			litex,slot-size = <0x800>;
+			interrupts = <0x11 0x1>;
+		};
 	};
 
 	chosen {
-- 
2.33.0


^ permalink raw reply related

* [PATCH 2/3] powerpc/configs/microwattt: Enable Liteeth
From: Joel Stanley @ 2021-08-26 12:26 UTC (permalink / raw)
  To: Michael Ellerman, Paul Mackerras, Anton Blanchard,
	Michael Neuling
  Cc: linuxppc-dev
In-Reply-To: <20210826122653.3236867-1-joel@jms.id.au>

Liteeth is the network device used by Microwatt.

Signed-off-by: Joel Stanley <joel@jms.id.au>
---
 arch/powerpc/configs/microwatt_defconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/configs/microwatt_defconfig b/arch/powerpc/configs/microwatt_defconfig
index a08b739123da..2f8b1fec9a16 100644
--- a/arch/powerpc/configs/microwatt_defconfig
+++ b/arch/powerpc/configs/microwatt_defconfig
@@ -53,6 +53,7 @@ CONFIG_MTD_SPI_NOR=y
 CONFIG_BLK_DEV_LOOP=y
 CONFIG_BLK_DEV_RAM=y
 CONFIG_NETDEVICES=y
+CONFIG_LITEX_LITEETH=y
 # CONFIG_WLAN is not set
 # CONFIG_INPUT is not set
 # CONFIG_SERIO is not set
-- 
2.33.0


^ permalink raw reply related

* [PATCH 3/3] powerpc/configs/microwatt: Enable options for systemd
From: Joel Stanley @ 2021-08-26 12:26 UTC (permalink / raw)
  To: Michael Ellerman, Paul Mackerras, Anton Blanchard,
	Michael Neuling
  Cc: linuxppc-dev
In-Reply-To: <20210826122653.3236867-1-joel@jms.id.au>

When booting with systemd these options are required.

This increases the image by about 50KB, or 2%.

Signed-off-by: Joel Stanley <joel@jms.id.au>
---
 arch/powerpc/configs/microwatt_defconfig | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/configs/microwatt_defconfig b/arch/powerpc/configs/microwatt_defconfig
index 2f8b1fec9a16..4a4924cd056e 100644
--- a/arch/powerpc/configs/microwatt_defconfig
+++ b/arch/powerpc/configs/microwatt_defconfig
@@ -5,6 +5,7 @@ CONFIG_PREEMPT_VOLUNTARY=y
 CONFIG_TICK_CPU_ACCOUNTING=y
 CONFIG_LOG_BUF_SHIFT=16
 CONFIG_PRINTK_SAFE_LOG_BUF_SHIFT=12
+CONFIG_CGROUPS=y
 CONFIG_BLK_DEV_INITRD=y
 CONFIG_CC_OPTIMIZE_FOR_SIZE=y
 CONFIG_KALLSYMS_ALL=y
@@ -77,8 +78,10 @@ CONFIG_SPI_SPIDEV=y
 CONFIG_EXT4_FS=y
 # CONFIG_FILE_LOCKING is not set
 # CONFIG_DNOTIFY is not set
-# CONFIG_INOTIFY_USER is not set
+CONFIG_AUTOFS_FS=y
+CONFIG_TMPFS=y
 # CONFIG_MISC_FILESYSTEMS is not set
+CONFIG_CRYPTO_SHA256=y
 # CONFIG_CRYPTO_HW is not set
 # CONFIG_XZ_DEC_X86 is not set
 # CONFIG_XZ_DEC_IA64 is not set
-- 
2.33.0


^ permalink raw reply related

* Re: [PATCH v3] PCI: Move pci_dev_is/assign_added() to pci.h
From: Niklas Schnelle @ 2021-08-26 12:36 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-arch, linux-s390, linux-kernel, Paul Mackerras, linux-pci,
	Bjorn Helgaas, linuxppc-dev
In-Reply-To: <20210825190444.GA3593752@bjorn-Precision-5520>

On Wed, 2021-08-25 at 14:04 -0500, Bjorn Helgaas wrote:
> On Mon, Aug 23, 2021 at 12:53:39PM +0200, Niklas Schnelle wrote:
> > On Fri, 2021-08-20 at 17:37 -0500, Bjorn Helgaas wrote:
> > > On Tue, Jul 20, 2021 at 05:01:45PM +0200, Niklas Schnelle wrote:
> > > > The helper function pci_dev_is_added() from drivers/pci/pci.h is used in
> > > > PCI arch code of both s390 and powerpc leading to awkward relative
> > > > includes. Move it to the global include/linux/pci.h and get rid of these
> > > > includes just for that one function.
> > > 
> > > I agree the includes are awkward.
> > > 
> > > But the arch code *using* pci_dev_is_added() seems awkward, too.
> > 
> > See below for my interpretation why s390 has some driver like
> > functionality in its arch code which isn't necessarily awkward.
> > 
> > Independent from that I have found pci_dev_is_added() as the only way
> > deal with the case that one might be looking at a struct pci_dev
> > reference that has been removed via pci_stop_and_remove_bus_device() or
> > has never been fully scanned. This is quite useful when handling error
> > events which on s390 are part of the adapter event mechanism shared
> > with channel I/O devices.
> > 
> > > AFAICS, in powerpc, pci_dev_is_added() is only used by
> > > pnv_pci_ioda_fixup_iov() and pseries_pci_fixup_iov_resources().  Those
> > > are only called from pcibios_add_device(), which is only called from
> > > pci_device_add().
> > > 
> > > Is it even possible for pci_dev_is_added() to be true in that path?
> 
> If the pci_dev_is_added() in powerpc is unreachable, we can remove it
> and at least reduce this to an s390-only problem.

Ok. I might be missing something but I agree it does look these are
called from within pcibios_add_device() only so pci_dev_is_added() can
never be true. This looks pretty clear as pci_dev_assign_added() is
called after pcibios_add_device() in pci_bus_add_device().

> 
> > > s390 uses pci_dev_is_added() in recover_store()
> > 
> > I'm actually looking into this as I'm working on an s390 implementation
> > of the PCI recovery flow described in Documentation/PCI/pci-error-
> > recovery.rst that would also call pci_dev_is_added() because when we
> > get a platform notification of a PCI reset done by firmware it may be
> > that the struct pci_dev is going away i.e. we still have a ref count
> > but it is not added to the PCI bus anymore. And pci_dev_is_added() is
> > the only way I've found to check for this state.
> > 
> > > , but I don't know what
> > > that is (looks like a sysfs file, but it's not documented) or why s390
> > > is the only arch that does this.
> > 
> > Good point about this not being documented, I'll look into adding docs.
> > 
> > This is a sysfs attribute that basically removes the pci_dev and re-
> > adds it. This has the complication that since the attribute sits at
> > /sys/bus/pci/devices/<dev>/recover it deletes its own parent directory
> > which requires extra caution and means concurrent accesses block on
> > pci_lock_rescan_remove() instead of a kernfs lock.
> > Long story short when concurrently triggering the attribute one thread
> > proceeds into the pci_lock_rescan_remove() section and does the
> > removal, while others would block on pci_lock_rescan_remove(). Now when
> > the threads unblock the removal is done. In this case there is a new
> > struct pci_dev found in the rescan but the previously blocked threads
> > still have references to the old struct pci_dev which was removed and
> > as far as I could tell can only be distinguished by checking
> > pci_dev_is_added().
> 
> Is this locking issue different from concurrently writing to
> /sys/.../remove on other architectures?

In principle it is very similar except that we re-scan and thus the
removed pdev may co-exist with a new pdev for the same actual device if
there are other references to the pdev.

There is however also a significant difference in locking that fixes a
possible deadlock that I just confirmed also affects /sys/../remove
where it is hidden by a lockdep ignore, see lockdep splash below. 

For /sys/../recover this was fixed by my commit dd712f0a53ca
("s390/pci: Fix possible deadlock in  recover_store()") which also
introduced the need for pci_dev_is_added().

The change in the above commit is very similar to what is done in
drivers/scsi/scsi_sysfs.c:sdev_store_delete() which also has a self
deletion and in commit 0ee223b2e1f6 ("scsi: core: Avoid that SCSI
device removal through sysfs triggers a deadlock") fixed a very very
similar lock order problem.

Like the SCSI code I use sysfs_break_active_protection() which allows a
concurrent access to /sys/../recover to advance into recover_store().
It may then block on pci_lock_rescan_remove() instead of the kernfs
lock it would block on with just delete_remove_file_self(). Thus we
have to handle unblocking from pci_lock_rescan_remove() while holding
the stale struct pci_dev of the removed device. Together with the re-
scan this means I have to be able to determine after unblocking if I'm
looking at the old pdev.

I just tested that when removing the lockdep ignore from remove_store()
and do /sys/../power and /sys/../remove I do hit the same lock order
inversion issue there for remove. Note that unlike in the commit
introducing the ignore this is a completely flat hiearchy:

[  234.196509] ======================================================
[  234.196511] WARNING: possible circular locking dependency detected
[  234.196512] 5.14.0-rc7-08025-gf6d7568b37df-dirty #18 Not tainted
[  234.196514] ------------------------------------------------------
[  234.196516] zsh/1214 is trying to acquire lock:
[  234.196518] 000000013f296e30 (pci_rescan_remove_lock){+.+.}-{3:3}, at: pci_stop_and_remove_bus_device_locked+0x26/0x48
[  234.196529]
               but task is already holding lock:
[  234.196530] 000000009b6c3a10 (kn->active#363){++++}-{0:0}, at: sysfs_remove_file_self+0x42/0x78
[  234.196538]
               which lock already depends on the new lock.

[  234.196539]
               the existing dependency chain (in reverse order) is:
[  234.196541]
               -> #1 (kn->active#363){++++}-{0:0}:
[  234.196545]        validate_chain+0x9ca/0xde8
[  234.196548]        __lock_acquire+0x64c/0xc40
[  234.196550]        lock_acquire.part.0+0xec/0x258
[  234.196552]        lock_acquire+0xb0/0x200
[  234.196554]        kernfs_drain+0x17a/0x1c8
[  234.196556]        __kernfs_remove+0x1f4/0x230
[  234.196558]        kernfs_remove_by_name_ns+0x5c/0xa8
[  234.196560]        remove_files+0x46/0x90
[  234.196563]        sysfs_remove_group+0x5a/0xb8
[  234.196565]        sysfs_remove_groups+0x46/0x68
[  234.196567]        device_remove_attrs+0x90/0xb8
[  234.196598]        device_del+0x192/0x3f8
[  234.196600]        pci_remove_bus_device+0x8a/0x128
[  234.196602]        pci_stop_and_remove_bus_device_locked+0x3a/0x48
[  234.196604]        zpci_bus_remove_device+0x68/0xa8
[  234.196607]        zpci_deconfigure_device+0x3a/0xe0
[  234.196610]        power_write_file+0x7c/0x130
[  234.196615]        kernfs_fop_write_iter+0x13e/0x1e0
[  234.196617]        new_sync_write+0x100/0x190
[  234.196620]        vfs_write+0x21e/0x2d0
[  234.196622]        ksys_write+0x6c/0xf8
[  234.196624]        __do_syscall+0x1c2/0x1f0
[  234.196628]        system_call+0x78/0xa0
[  234.196632]
               -> #0 (pci_rescan_remove_lock){+.+.}-{3:3}:
[  234.196636]        check_noncircular+0x168/0x188
[  234.196637]        check_prev_add+0xe0/0xed8
[  234.196639]        validate_chain+0x9ca/0xde8
[  234.196641]        __lock_acquire+0x64c/0xc40
[  234.196642]        lock_acquire.part.0+0xec/0x258
[  234.196644]        lock_acquire+0xb0/0x200
[  234.196646]        __mutex_lock+0xa2/0x8d8
[  234.196648]        mutex_lock_nested+0x32/0x40
[  234.196649]        pci_stop_and_remove_bus_device_locked+0x26/0x48
[  234.196652]        remove_store+0x7a/0x88
[  234.196654]        kernfs_fop_write_iter+0x13e/0x1e0
[  234.196656]        new_sync_write+0x100/0x190
[  234.196657]        vfs_write+0x21e/0x2d0
[  234.196659]        ksys_write+0x6c/0xf8
[  234.196661]        __do_syscall+0x1c2/0x1f0
[  234.196663]        system_call+0x78/0xa0
[  234.196665]
               other info that might help us debug this:

[  234.196666]  Possible unsafe locking scenario:

[  234.196668]        CPU0                    CPU1
[  234.196669]        ----                    ----
[  234.196670]   lock(kn->active#363);
[  234.196672]                                lock(pci_rescan_remove_lock);
[  234.196674]                                lock(kn->active#363);
[  234.196677]   lock(pci_rescan_remove_lock);
[  234.196679]
                *** DEADLOCK ***

[  234.196680] 3 locks held by zsh/1214:
[  234.196682]  #0: 0000000099d44498 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0x6c/0xf8
[  234.196688]  #1: 00000000b214ee90 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x102/0x1e0
[  234.196693]  #2: 000000009b6c3a10 (kn->active#363){++++}-{0:0}, at: sysfs_remove_file_self+0x42/0x78
[  234.196699]
               stack backtrace:
[  234.196701] CPU: 6 PID: 1214 Comm: zsh Not tainted 5.14.0-rc7-08025-gf6d7568b37df-dirty #18
[  234.196703] Hardware name: IBM 8561 T01 703 (LPAR)
[  234.196705] Call Trace:
[  234.196706]  [<000000013ebba5d0>] show_stack+0x90/0xf8
[  234.196711]  [<000000013ebcc28e>] dump_stack_lvl+0x8e/0xc8
[  234.196714]  [<000000013dfbe908>] check_noncircular+0x168/0x188
[  234.196716]  [<000000013dfbf9c8>] check_prev_add+0xe0/0xed8
[  234.196718]  [<000000013dfc118a>] validate_chain+0x9ca/0xde8
[  234.196720]  [<000000013dfc4184>] __lock_acquire+0x64c/0xc40
[  234.196721]  [<000000013dfc2d8c>] lock_acquire.part.0+0xec/0x258
[  234.196724]  [<000000013dfc2fa8>] lock_acquire+0xb0/0x200
[  234.196725]  [<000000013ebdac7a>] __mutex_lock+0xa2/0x8d8
[  234.196727]  [<000000013ebdb4e2>] mutex_lock_nested+0x32/0x40
[  234.196729]  [<000000013e7daaf6>] pci_stop_and_remove_bus_device_locked+0x26/0x48
[  234.196732]  [<000000013e7e753a>] remove_store+0x7a/0x88
[  234.196734]  [<000000013e35b4be>] kernfs_fop_write_iter+0x13e/0x1e0
[  234.196736]  [<000000013e264098>] new_sync_write+0x100/0x190
[  234.196738]  [<000000013e266f06>] vfs_write+0x21e/0x2d0
[  234.196740]  [<000000013e267224>] ksys_write+0x6c/0xf8
[  234.196742]  [<000000013ebcf9da>] __do_syscall+0x1c2/0x1f0
[  234.196744]  [<000000013ebe2238>] system_call+0x78/0xa0


> 
> > > Maybe we should make powerpc and s390 less special?
> > 
> > On s390, as I see it, the reason for this is that all of the PCI
> > functionality is directly defined in the Architecture as special CPU
> > instructions which are kind of hypercalls but also an ISA extension.
> > 
> > These instructions range from the basic PCI memory accesses (no real
> > MMIO) to enumeration of the devices and on to reporting of hot-plug and
> > and resets/recovery events. Importantly we do not have any kind of
> > direct access to a real or virtual PCI controller and the architecture
> > has no concept of a comparable entity.
> > 
> > So in my opinion while there is some of the functionality of a PCI
> > controller in arch/s390/pci the cut off between controller
> > functionality and arch support isn't clear at all and exposing PCI
> > support as CPU instructions doesn't map well to the controller concept.
> > 
> > That said, in principle I'm open to moving some of that into
> > drivers/pci/controller/ if you think that would improve things and we
> > can find a good argument what should go where. One possible cut off
> > would be to have arch/s390/pci/ provide wrappers to the PCI
> > instructions but move all their uses to  e.g.
> > drivers/pci/controller/s390/. This would of course be a major
> > refactoring and none of that code would be useful on any other
> > architecture but it would move a lot the accesses to PCI common code
> > functionality out of the arch code.
> 
> Looks like hotplug is already in drivers/pci/hotplug/s390_pci_hpc.c.
> 
> Might be worth considering putting the other PCI core-ish code in
> drivers/pci as well, though it doesn't feel urgent to me.  Maybe a
> good internship or mentoring project.

I agree

> 
> I'm not sure this juggling around is worth it basically to just clean
> up the include path.  The downside to me is exposing
> pci_dev_is_added() to outside the PCI core, because I don't want
> to encourage any other users.

Hmm, for me the big question how to then handle the need for
pci_dev_is_added() in my upcoming recovery code, that would be a new
user outside the PCI core unless I move arch/s390/pci/pci_event.c to
drivers/pci.

That said I'm not sure I understand yet what makes pci_dev_is_added()
unsuitable for outside the PCI core. I do understand that struct
pci_dev::priv_flags is supposed to be private to PCI drivers but on the
other hand the functionality of checking whether a PCI device has been
added to a PCI bus seems rather basic and useful whenever working with
a PCI device.

> 
> > > > Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
> > > > ---
> > > > Since v1 (and bad v2):
> > > > - Fixed accidental removal of PCI_DPC_RECOVERED, PCI_DPC_RECOVERING
> > > >   defines and also move these to include/linux/pci.h
> > > > 
> > > >  arch/powerpc/platforms/powernv/pci-sriov.c |  3 ---
> > > >  arch/powerpc/platforms/pseries/setup.c     |  1 -
> > > >  arch/s390/pci/pci_sysfs.c                  |  2 --
> > > >  drivers/pci/hotplug/acpiphp_glue.c         |  1 -
> > > >  drivers/pci/pci.h                          | 15 ---------------
> > > >  include/linux/pci.h                        | 15 +++++++++++++++
> > > >  6 files changed, 15 insertions(+), 22 deletions(-)
> > > > 
> > > > diff --git a/arch/powerpc/platforms/powernv/pci-sriov.c b/arch/powerpc/platforms/powernv/pci-sriov.c
> > > > index 28aac933a439..2e0ca5451e85 100644
> > > > --- a/arch/powerpc/platforms/powernv/pci-sriov.c
> > > > +++ b/arch/powerpc/platforms/powernv/pci-sriov.c
> > > > @@ -9,9 +9,6 @@
> > > >  
> > > >  #include "pci.h"
> > > >  
> > > > -/* for pci_dev_is_added() */
> > > > -#include "../../../../drivers/pci/pci.h"
> > > > 
> > .. snip ..
> > 


^ permalink raw reply

* Re: [PATCH v2 1/2] powerpc/bug: Remove specific powerpc BUG_ON() and WARN_ON() on PPC32
From: Segher Boessenkool @ 2021-08-26 12:49 UTC (permalink / raw)
  To: Nicholas Piggin; +Cc: linux-kernel, Paul Mackerras, linuxppc-dev
In-Reply-To: <1629946707.f6ptz0tgle.astroid@bobo.none>

Hi!

On Thu, Aug 26, 2021 at 01:26:14PM +1000, Nicholas Piggin wrote:
> Excerpts from Segher Boessenkool's message of August 19, 2021 1:06 am:
> > On Fri, Aug 13, 2021 at 04:08:13PM +1000, Nicholas Piggin wrote:
> >> This one possibly the branches end up in predictors, whereas conditional 
> >> trap is always just speculated not to hit. Branches may also have a
> >> throughput limit on execution whereas trap could be more (1 per cycle
> >> vs 4 per cycle on POWER9).
> > 
> > I thought only *taken* branches are just one per cycle?
> 
> Taken branches are fetched by the front end at one per cycle (assuming 
> they hit the BTAC), but all branches have to be executed by BR at one 
> per cycle

This is not true.  (Simple) predicted not-taken conditional branches are
just folded out, never hit the issue queues.  And they are fetched as
many together as fit in a fetch group, can complete without limits as
well.

The BTAC is a frontend thing, used for target address prediction.  It
does not limit execution.

Correctly predicted simple conditional branches just get their prediction
validated (and that is not done in the execution units).  Incorrectly
predicted branches the same, but those cause a redirect and refetch.

> > Internally *all* traps are conditional, in GCC.  It also can optimise
> > them quite well.  There must be something in the kernel macros that
> > prevents good optimisation.
> 
> I did take a look at it at one point.
> 
> One problem is that the kernel needs the address of the trap instruction 
> to create the entry for it. The other problem is that __builtin_trap 
> does not return so it can't be used for WARN. LLVM at least seems to 
> have a __builtin_debugtrap which does return.

This is <https://gcc.gnu.org/PR99299>.

> The first problem seems like the show stopper though. AFAIKS it would 
> need a special builtin support that does something to create the table
> entry, or a guarantee that we could put an inline asm right after the
> builtin as a recognized pattern and that would give us the instruction
> following the trap.

I'm not quite sure what this means.  Can't you always just put a

bla:	asm("");

in there, and use the address of "bla"?  If not, you need to say a lot
more about what you actually want to do :-/


Segher

^ permalink raw reply

* Re: [PATCH v2 3/3] powerpc/numa: Fill distance_lookup_table for offline nodes
From: Michael Ellerman @ 2021-08-26 13:36 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Nathan Lynch, Gautham R Shenoy, Vincent Guittot,
	Srikar Dronamraju, Peter Zijlstra, Geetika Moolchandani,
	Ingo Molnar, Laurent Dufour, linuxppc-dev, Valentin Schneider,
	kernel test robot
In-Reply-To: <20210821102535.169643-4-srikar@linux.vnet.ibm.com>

Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> Scheduler expects unique number of node distances to be available at
> boot.

I think it needs "the number of unique node distances" ?

> It uses node distance to calculate this unique node distances.

It iterates over all pairs of nodes and records node_distance() for that
pair, and then calculates the set of unique distances.

> On POWER, node distances for offline nodes is not available. However,
> POWER already knows unique possible node distances.

I think it would be more accurate to say PAPR rather than POWER there.
It's PAPR that defines the way we determine distances and imposes that
limitation.

> Fake the offline node's distance_lookup_table entries so that all
> possible node distances are updated.

Does this work if we have a single node offline at boot?

Say we start with:

node distances:
node   0   1
  0:  10  20
  1:  20  10

And node 2 is offline at boot. We can only initialise that nodes entries
in the distance_lookup_table:

		while (i--)
			distance_lookup_table[node][i] = node;

By filling them all with 2 that causes node_distance(2, X) to return the
maximum distance for all other nodes X, because we won't break out of
the loop in __node_distance():

	for (i = 0; i < distance_ref_points_depth; i++) {
		if (distance_lookup_table[a][i] == distance_lookup_table[b][i])
			break;

		/* Double the distance for each NUMA level */
		distance *= 2;
	}

If distance_ref_points_depth was 4 we'd return 160.

That'd leave us with 3 unique distances at boot, 10, 20, 160.

But when node 2 comes online it might introduce more than 1 new distance
value, eg. it could be that the actual distances are:

node distances:
node   0   1   2
  0:  10  20  40
  1:  20  10  80
  2:  40  80  10

ie. we now have 4 distances, 10, 20, 40, 80.

What am I missing?

> However this only needs to be done if the number of unique node
> distances that can be computed for online nodes is less than the
> number of possible unique node distances as represented by
> distance_ref_points_depth.

Looking at a few machines they all have distance_ref_points_depth = 2.

So maybe that explains it, in practice we only see 10, 20, 40.


> When the node is actually onlined, distance_lookup_table will be
> updated with actual entries.

> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: Nathan Lynch <nathanl@linux.ibm.com>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Valentin Schneider <valentin.schneider@arm.com>
> Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>
> Cc: Laurent Dufour <ldufour@linux.ibm.com>
> Cc: kernel test robot <lkp@intel.com>
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> ---
>  arch/powerpc/mm/numa.c | 70 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 70 insertions(+)
>
> Changelog:
> v1: https://lore.kernel.org/linuxppc-dev/20210701041552.112072-3-srikar@linux.vnet.ibm.com/t/#u
> [ Fixed a missing prototype warning Reported-by: kernel test robot <lkp@intel.com>]
>
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 3c124928a16d..0ee79a08c9e1 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -856,6 +856,75 @@ void __init dump_numa_cpu_topology(void)
>  	}
>  }
>  
> +/*
> + * Scheduler expects unique number of node distances to be available at
> + * boot. It uses node distance to calculate this unique node distances. On
> + * POWER, node distances for offline nodes is not available. However, POWER
> + * already knows unique possible node distances. Fake the offline node's
> + * distance_lookup_table entries so that all possible node distances are
> + * updated.
> + */

> +static void __init fake_update_distance_lookup_table(void)
> +{
> +	unsigned long distance_map;
> +	int i, nr_levels, nr_depth, node;

Are they distances, depths, or levels? :)

Bit more consistency in the variable names might make the code easier to
follow.

> +
> +	if (!numa_enabled)
> +		return;
> +
> +	if (!form1_affinity)
> +		return;

That doesn't exist since Aneesh's FORM2 series, so that will need a
rebase, and possibly some more rework to interact with that series.

> +	/*
> +	 * distance_ref_points_depth lists the unique numa domains
> +	 * available. However it ignore LOCAL_DISTANCE. So add +1
> +	 * to get the actual number of unique distances.
> +	 */
> +	nr_depth = distance_ref_points_depth + 1;

num_depths would be a better name IMHO.

> +
> +	WARN_ON(nr_depth > sizeof(distance_map));

Warn but then continue, and corrupt something on the stack? Seems like a
bad idea :)

I guess it's too early to use bitmap_alloc(). But can we at least return
if nr_depth is too big.

> +
> +	bitmap_zero(&distance_map, nr_depth);
> +	bitmap_set(&distance_map, 0, 1);
> +
> +	for_each_online_node(node) {
> +		int nd, distance = LOCAL_DISTANCE;
> +
> +		if (node == first_online_node)
> +			continue;
> +
> +		nd = __node_distance(node, first_online_node);
> +		for (i = 0; i < nr_depth; i++, distance *= 2) {

		for (i = 0, distance = LOCAL_DISTANCE; i < nr_depth; i++, distance *= 2) {

Could make it clearer what the for loop is doing I think.

> +			if (distance == nd) {
> +				bitmap_set(&distance_map, i, 1);
> +				break;
> +			}
> +		}
> +		nr_levels = bitmap_weight(&distance_map, nr_depth);
> +		if (nr_levels == nr_depth)
> +			return;
> +	}
> +
> +	for_each_node(node) {
> +		if (node_online(node))
> +			continue;
> +
> +		i = find_first_zero_bit(&distance_map, nr_depth);
> +		if (i >= nr_depth || i == 0) {

Neither of those can happen can they?

We checked the bitmap weight in the previous for loop, or at the bottom
of this one, and returned if we'd filled the map already.

And we set bit zero explicitly with bitmap_set().

> +			pr_warn("Levels(%d) not matching levels(%d)", nr_levels, nr_depth);
> +			return;
> +		}
> +
> +		bitmap_set(&distance_map, i, 1);
> +		while (i--)
> +			distance_lookup_table[node][i] = node;

That leaves distance_lookup_table[node][i+1] and so on uninitialised, or
initialised to zero because it's static, is that OK?

> +		nr_levels = bitmap_weight(&distance_map, nr_depth);
> +		if (nr_levels == nr_depth)
> +			return;
> +	}
> +}
> +
>  /* Initialize NODE_DATA for a node on the local memory */
>  static void __init setup_node_data(int nid, u64 start_pfn, u64 end_pfn)
>  {
> @@ -971,6 +1040,7 @@ void __init mem_topology_setup(void)
>  		 */
>  		numa_setup_cpu(cpu);
>  	}
> +	fake_update_distance_lookup_table();
>  }
>  
>  void __init initmem_init(void)


cheers

^ permalink raw reply

* Re: [PATCH v2 2/2] powerpc/bug: Provide better flexibility to WARN_ON/__WARN_FLAGS() with asm goto
From: Segher Boessenkool @ 2021-08-26 13:47 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: llvm, linux-kernel, Nathan Chancellor, clang-built-linux,
	Paul Mackerras, linuxppc-dev
In-Reply-To: <3fad8702-278a-d9f9-1882-6958ce570bcc@csgroup.eu>

On Thu, Aug 26, 2021 at 08:37:09AM +0200, Christophe Leroy wrote:
> Le 26/08/2021 à 05:21, Michael Ellerman a écrit :
> >This patch seems to fix it. Not sure if that's just papering over it 
> >though.
> >
> >diff --git a/arch/powerpc/include/asm/bug.h 
> >b/arch/powerpc/include/asm/bug.h
> >index 1ee0f22313ee..75fcb4370d96 100644
> >--- a/arch/powerpc/include/asm/bug.h
> >+++ b/arch/powerpc/include/asm/bug.h
> >@@ -119,7 +119,7 @@ __label_warn_on:					 \
> >  								\
> >  			WARN_ENTRY(PPC_TLNEI " %4, 0",		\
> >  				   BUGFLAG_WARNING | 
> >  				   BUGFLAG_TAINT(TAINT_WARN),	\
> >-				   __label_warn_on, "r" (x));	\
> >+				   __label_warn_on, "r" (!!(x))); \
> >  			break;					\
> >  __label_warn_on:						\
> >  			__ret_warn_on = true;			\
> 
> But for a simple WARN_ON() call:
> 
> void test(unsigned long b)
> {
> 	WARN_ON(b);
> }
> 
> Without your change with GCC you get:
> 
> 00000000000012d0 <.test>:
>     12d0:	0b 03 00 00 	tdnei   r3,0
>     12d4:	4e 80 00 20 	blr
> 
> 
> With the !! change you get:
> 
> 00000000000012d0 <.test>:
>     12d0:	31 23 ff ff 	addic   r9,r3,-1
>     12d4:	7d 29 19 10 	subfe   r9,r9,r3
>     12d8:	0b 09 00 00 	tdnei   r9,0
>     12dc:	4e 80 00 20 	blr

That is because the asm (unlike the builtin) cannot be optimised by the
compiler.


Segher

^ permalink raw reply

* Re: [PATCH v2 1/2] powerpc/bug: Remove specific powerpc BUG_ON() and WARN_ON() on PPC32
From: Nicholas Piggin @ 2021-08-26 13:57 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: linux-kernel, Paul Mackerras, linuxppc-dev
In-Reply-To: <20210826124901.GY1583@gate.crashing.org>

Excerpts from Segher Boessenkool's message of August 26, 2021 10:49 pm:
> Hi!
> 
> On Thu, Aug 26, 2021 at 01:26:14PM +1000, Nicholas Piggin wrote:
>> Excerpts from Segher Boessenkool's message of August 19, 2021 1:06 am:
>> > On Fri, Aug 13, 2021 at 04:08:13PM +1000, Nicholas Piggin wrote:
>> >> This one possibly the branches end up in predictors, whereas conditional 
>> >> trap is always just speculated not to hit. Branches may also have a
>> >> throughput limit on execution whereas trap could be more (1 per cycle
>> >> vs 4 per cycle on POWER9).
>> > 
>> > I thought only *taken* branches are just one per cycle?
>> 
>> Taken branches are fetched by the front end at one per cycle (assuming 
>> they hit the BTAC), but all branches have to be executed by BR at one 
>> per cycle
> 
> This is not true.  (Simple) predicted not-taken conditional branches are
> just folded out, never hit the issue queues.  And they are fetched as
> many together as fit in a fetch group, can complete without limits as
> well.

No, they are all dispatched and issue to the BRU for execution. It's 
trivial to construct a test of a lot of not taken branches in a row
and time a loop of it to see it executes at 1 cycle per branch.

> The BTAC is a frontend thing, used for target address prediction.  It
> does not limit execution.

I didn't say it did.

> Correctly predicted simple conditional branches just get their prediction
> validated (and that is not done in the execution units).  Incorrectly
> predicted branches the same, but those cause a redirect and refetch.

How could it validate prediction without issuing? It wouldn't know when
sources are ready.

> 
>> > Internally *all* traps are conditional, in GCC.  It also can optimise
>> > them quite well.  There must be something in the kernel macros that
>> > prevents good optimisation.
>> 
>> I did take a look at it at one point.
>> 
>> One problem is that the kernel needs the address of the trap instruction 
>> to create the entry for it. The other problem is that __builtin_trap 
>> does not return so it can't be used for WARN. LLVM at least seems to 
>> have a __builtin_debugtrap which does return.
> 
> This is <https://gcc.gnu.org/PR99299>.

Aha.

>> The first problem seems like the show stopper though. AFAIKS it would 
>> need a special builtin support that does something to create the table
>> entry, or a guarantee that we could put an inline asm right after the
>> builtin as a recognized pattern and that would give us the instruction
>> following the trap.
> 
> I'm not quite sure what this means.  Can't you always just put a
> 
> bla:	asm("");
> 
> in there, and use the address of "bla"?

Not AFAIKS. Put it where?

> If not, you need to say a lot
> more about what you actually want to do :-/

We need to put the address (or relative offset) of the trap instruction
into an entry in a different section. Basically this:

   __builtin_trap();
   asm ("1:                               \n\t"
        "    .section __bug_table,\"aw\"  \n\t"
        "2:  .4byte 1b - 2b - 4           \n\t"
        "    .previous");

Where the 1: label must follow immediately after the trap instruction, 
and where the asm must be emitted even for the case of no-return traps 
(but anything following the asm could be omitted).

Thanks,
Nick

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox