LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH v2 3/3] powerpc/numa: Fill distance_lookup_table for offline nodes
From: Michael Ellerman @ 2021-09-23 11:17 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Nathan Lynch, Gautham R Shenoy, Vincent Guittot,
	kernel test robot, Peter Zijlstra, Geetika Moolchandani,
	Valentin Schneider, Laurent Dufour, linuxppc-dev, Ingo Molnar
In-Reply-To: <20210901102206.GO21942@linux.vnet.ibm.com>

Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> * Michael Ellerman <mpe@ellerman.id.au> [2021-08-26 23:36:53]:
>
>> Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
>> > Scheduler expects unique number of node distances to be available at
>> > boot.
...
>
>> > Fake the offline node's distance_lookup_table entries so that all
>> > possible node distances are updated.
>> 
>> Does this work if we have a single node offline at boot?
>> 
>
> It should.
>
>> Say we start with:
>> 
>> node distances:
>> node   0   1
>>   0:  10  20
>>   1:  20  10
>> 
>> And node 2 is offline at boot. We can only initialise that nodes entries
>> in the distance_lookup_table:
>> 
>> 		while (i--)
>> 			distance_lookup_table[node][i] = node;
>> 
>> By filling them all with 2 that causes node_distance(2, X) to return the
>> maximum distance for all other nodes X, because we won't break out of
>> the loop in __node_distance():
>> 
>> 	for (i = 0; i < distance_ref_points_depth; i++) {
>> 		if (distance_lookup_table[a][i] == distance_lookup_table[b][i])
>> 			break;
>> 
>> 		/* Double the distance for each NUMA level */
>> 		distance *= 2;
>> 	}
>> 
>> If distance_ref_points_depth was 4 we'd return 160.
>
> As you already know, distance 10, 20, .. are defined by Powerpc, form1
> affinity. PAPR doesn't define actual distances, it only provides us the
> associativity. If there are distance_ref_points_depth is 4,
> (distance_ref_points_depth doesn't take local distance into consideration)
> 10, 20, 40, 80, 160.
>
>> 
>> That'd leave us with 3 unique distances at boot, 10, 20, 160.
>> 
>
> So if there are unique distances, then the distances as per the current
> code has to be 10, 20, 40, 80.. I dont see a way in which we have a break in
> the series. like having 160 without 80.

I'm confused what you mean there.

If we have a node that's offline at boot then we get 160 for that node,
that's just the result of having no info for it, so we never break out
of the for loop.

So if we have two nodes, one hop apart, and then an offline node we get
10, 20, 160.

Or if you're using depth = 3 then it's 10, 20, 80.

>> But when node 2 comes online it might introduce more than 1 new distance
>> value, eg. it could be that the actual distances are:
>> 
>> node distances:
>> node   0   1   2
>>   0:  10  20  40
>>   1:  20  10  80
>>   2:  40  80  10
>> 
>> ie. we now have 4 distances, 10, 20, 40, 80.
>> 
>> What am I missing?
>
> As I said above, I am not sure how we can have a break in the series.
> If distance_ref_points_depth is 3, the distances has to be 10,20,40,80 as
> atleast for form1 affinity.

I agree for depth 3 we have to see 10, 20, 40, 80. But nothing
guarantees we see each value (other than 10).

We can have two nodes one hop apart, so we have 10 and 20, then a third
node is added 3 hops away, so we get 10, 20, 80.

The real problem is that the third node could be 3 hops from node 0
and 2 hops from node 1, and so the addition of the third node causes
two new distance values (40 & 80) to be required.

I think maybe what you're saying is that in practice we don't see setups
like that. But I don't know if I'm happy with a solution that doesn't
work in the general case, and relies on the particular properties of our
current set of systems.

Possibly we just need to detect that case and WARN about it. The only
problem is we won't know until the system is already up and running, ie.
we can't know at boot that the onlining of the third node will cause 2
new distance values to be added.

cheers

^ permalink raw reply

* Re: [PATCH 3/3] memblock: cleanup memblock_free interface
From: Mike Rapoport @ 2021-09-23 12:01 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: linux-efi, kvm, linux-sh, linux-mips, linux-mm, sparclinux,
	linux-riscv, linux-s390, kasan-dev, xen-devel, linux-snps-arc,
	devicetree, linux-um, linux-arm-kernel, Linus Torvalds, linux-usb,
	linux-kernel, iommu, linux-alpha, Andrew Morton, linuxppc-dev,
	Mike Rapoport
In-Reply-To: <1101e3c7-fcb7-a632-8e22-47f4a01ea02e@csgroup.eu>

On Thu, Sep 23, 2021 at 11:47:48AM +0200, Christophe Leroy wrote:
> 
> 
> Le 23/09/2021 à 09:43, Mike Rapoport a écrit :
> > From: Mike Rapoport <rppt@linux.ibm.com>
> > 
> > For ages memblock_free() interface dealt with physical addresses even
> > despite the existence of memblock_alloc_xx() functions that return a
> > virtual pointer.
> > 
> > Introduce memblock_phys_free() for freeing physical ranges and repurpose
> > memblock_free() to free virtual pointers to make the following pairing
> > abundantly clear:
> > 
> > 	int memblock_phys_free(phys_addr_t base, phys_addr_t size);
> > 	phys_addr_t memblock_phys_alloc(phys_addr_t base, phys_addr_t size);
> > 
> > 	void *memblock_alloc(phys_addr_t size, phys_addr_t align);
> > 	void memblock_free(void *ptr, size_t size);
> > 
> > Replace intermediate memblock_free_ptr() with memblock_free() and drop
> > unnecessary aliases memblock_free_early() and memblock_free_early_nid().
> > 
> > Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> > Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> > ---
> 
> > diff --git a/arch/s390/kernel/smp.c b/arch/s390/kernel/smp.c
> > index 1a04e5bdf655..37826d8c4f74 100644
> > --- a/arch/s390/kernel/smp.c
> > +++ b/arch/s390/kernel/smp.c
> > @@ -723,7 +723,7 @@ void __init smp_save_dump_cpus(void)
> >   			/* Get the CPU registers */
> >   			smp_save_cpu_regs(sa, addr, is_boot_cpu, page);
> >   	}
> > -	memblock_free(page, PAGE_SIZE);
> > +	memblock_phys_free(page, PAGE_SIZE);
> >   	diag_amode31_ops.diag308_reset();
> >   	pcpu_set_smt(0);
> >   }
> > @@ -880,7 +880,7 @@ void __init smp_detect_cpus(void)
> >   	/* Add CPUs present at boot */
> >   	__smp_rescan_cpus(info, true);
> > -	memblock_free_early((unsigned long)info, sizeof(*info));
> > +	memblock_free(info, sizeof(*info));
> >   }
> >   /*
> 
> I'm a bit lost. IIUC memblock_free_early() and memblock_free() where
> identical.

Yes, they were, but all calls to memblock_free_early() were using
__pa(vaddr) because they had a virtual address at hand.

> In the first hunk memblock_free() gets replaced by memblock_phys_free()
> In the second hunk memblock_free_early() gets replaced by memblock_free()

In the first hunk the memory is allocated with memblock_phys_alloc() and we
have a physical range to free. In the second hunk the memory is allocated
with memblock_alloc() and we are freeing a virtual pointer.
 
> I think it would be easier to follow if you could split it in several
> patches:

It was an explicit request from Linus to make it a single commit:

  but the actual commit can and should be just a single commit that just
  fixes 'memblock_free()' to have sane interfaces.

I don't feel strongly about splitting it (except my laziness really
objects), but I don't think doing the conversion in several steps worth the
churn.

> - First patch: Create memblock_phys_free() and change all relevant
> memblock_free() to memblock_phys_free() - Or change memblock_free() to
> memblock_phys_free() and make memblock_free() an alias of it.
> - Second patch: Make memblock_free_ptr() become memblock_free() and change
> all remaining callers to the new semantics (IIUC memblock_free(__pa(ptr))
> becomes memblock_free(ptr) and make memblock_free_ptr() an alias of
> memblock_free()
> - Fourth patch: Replace and drop memblock_free_ptr()
> - Fifth patch: Drop memblock_free_early() and memblock_free_early_nid() (All
> users should have been upgraded to memblock_free_phys() in patch 1 or
> memblock_free() in patch 2)
> 
> Christophe

-- 
Sincerely yours,
Mike.

^ permalink raw reply

* Re: [PATCH v2 1/2] powerpc/powermac: add missing g5_phy_disable_cpu1() declaration
From: Michael Ellerman @ 2021-09-23 13:21 UTC (permalink / raw)
  To: Krzysztof Kozlowski, Benjamin Herrenschmidt, Paul Mackerras,
	Rob Herring, Frank Rowand, Krzysztof Kozlowski, Christophe Leroy,
	Aneesh Kumar K.V, linuxppc-dev, linux-kernel, devicetree
In-Reply-To: <20210922160436.130931-1-krzysztof.kozlowski@canonical.com>

Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com> writes:
> g5_phy_disable_cpu1() is used outside of platforms/powermac/feature.c,
> so it should have a declaration to fix W=1 warning:
>
>   arch/powerpc/platforms/powermac/feature.c:1533:6:
>     error: no previous prototype for ‘g5_phy_disable_cpu1’ [-Werror=missing-prototypes]
>
> Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
>
> ---
>
> Changes since v1:
> 1. Drop declaration in powermac/smp.c

Sorry I missed v1, so I'm going to ask for more changes :}

>  arch/powerpc/include/asm/pmac_feature.h | 4 ++++

Putting it here exposes it to the whole kernel, but it's only needed
inside arch/powerpc/platforms/powermac.

The right place for the prototype is arch/powerpc/platforms/powermac/pmac.h,
which is for platform internal prototypes.

>  arch/powerpc/platforms/powermac/smp.c   | 2 --
>  2 files changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/pmac_feature.h b/arch/powerpc/include/asm/pmac_feature.h
> index e08e829261b6..7703e5bf1203 100644
> --- a/arch/powerpc/include/asm/pmac_feature.h
> +++ b/arch/powerpc/include/asm/pmac_feature.h
> @@ -143,6 +143,10 @@
>   */
>  struct device_node;
>  
> +#ifdef CONFIG_PPC64
> +void g5_phy_disable_cpu1(void);
> +#endif /* CONFIG_PPC64 */

The ifdef around the prototype doesn't gain much, and is extra visual
noise, so I'd rather without it.

cheers

^ permalink raw reply

* Re: [PATCH v2 1/2] powerpc/powermac: add missing g5_phy_disable_cpu1() declaration
From: Krzysztof Kozlowski @ 2021-09-23 13:32 UTC (permalink / raw)
  To: Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Rob Herring, Frank Rowand, Christophe Leroy, Aneesh Kumar K.V,
	linuxppc-dev, linux-kernel, devicetree
In-Reply-To: <87v92rl95e.fsf@mpe.ellerman.id.au>

On 23/09/2021 15:21, Michael Ellerman wrote:
> Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com> writes:
>> g5_phy_disable_cpu1() is used outside of platforms/powermac/feature.c,
>> so it should have a declaration to fix W=1 warning:
>>
>>   arch/powerpc/platforms/powermac/feature.c:1533:6:
>>     error: no previous prototype for ‘g5_phy_disable_cpu1’ [-Werror=missing-prototypes]
>>
>> Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
>>
>> ---
>>
>> Changes since v1:
>> 1. Drop declaration in powermac/smp.c
> 
> Sorry I missed v1, so I'm going to ask for more changes :}
> 
>>  arch/powerpc/include/asm/pmac_feature.h | 4 ++++
> 
> Putting it here exposes it to the whole kernel, but it's only needed
> inside arch/powerpc/platforms/powermac.
> 
> The right place for the prototype is arch/powerpc/platforms/powermac/pmac.h,
> which is for platform internal prototypes.

I'll fix it up.

> 
>>  arch/powerpc/platforms/powermac/smp.c   | 2 --
>>  2 files changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/pmac_feature.h b/arch/powerpc/include/asm/pmac_feature.h
>> index e08e829261b6..7703e5bf1203 100644
>> --- a/arch/powerpc/include/asm/pmac_feature.h
>> +++ b/arch/powerpc/include/asm/pmac_feature.h
>> @@ -143,6 +143,10 @@
>>   */
>>  struct device_node;
>>  
>> +#ifdef CONFIG_PPC64
>> +void g5_phy_disable_cpu1(void);
>> +#endif /* CONFIG_PPC64 */
> 
> The ifdef around the prototype doesn't gain much, and is extra visual
> noise, so I'd rather without it.


Sure.


Best regards,
Krzysztof

^ permalink raw reply

* Re: [PATCH 3/3] memblock: cleanup memblock_free interface
From: Christophe Leroy @ 2021-09-23 13:54 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-efi, kvm, linux-sh, linux-mips, linux-mm, sparclinux,
	linux-riscv, linux-s390, kasan-dev, xen-devel, linux-snps-arc,
	devicetree, linux-um, linux-arm-kernel, Linus Torvalds, linux-usb,
	linux-kernel, iommu, linux-alpha, Andrew Morton, linuxppc-dev,
	Mike Rapoport
In-Reply-To: <YUxsgN/uolhn1Ok+@linux.ibm.com>



Le 23/09/2021 à 14:01, Mike Rapoport a écrit :
> On Thu, Sep 23, 2021 at 11:47:48AM +0200, Christophe Leroy wrote:
>>
>>
>> Le 23/09/2021 à 09:43, Mike Rapoport a écrit :
>>> From: Mike Rapoport <rppt@linux.ibm.com>
>>>
>>> For ages memblock_free() interface dealt with physical addresses even
>>> despite the existence of memblock_alloc_xx() functions that return a
>>> virtual pointer.
>>>
>>> Introduce memblock_phys_free() for freeing physical ranges and repurpose
>>> memblock_free() to free virtual pointers to make the following pairing
>>> abundantly clear:
>>>
>>> 	int memblock_phys_free(phys_addr_t base, phys_addr_t size);
>>> 	phys_addr_t memblock_phys_alloc(phys_addr_t base, phys_addr_t size);
>>>
>>> 	void *memblock_alloc(phys_addr_t size, phys_addr_t align);
>>> 	void memblock_free(void *ptr, size_t size);
>>>
>>> Replace intermediate memblock_free_ptr() with memblock_free() and drop
>>> unnecessary aliases memblock_free_early() and memblock_free_early_nid().
>>>
>>> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
>>> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
>>> ---
>>
>>> diff --git a/arch/s390/kernel/smp.c b/arch/s390/kernel/smp.c
>>> index 1a04e5bdf655..37826d8c4f74 100644
>>> --- a/arch/s390/kernel/smp.c
>>> +++ b/arch/s390/kernel/smp.c
>>> @@ -723,7 +723,7 @@ void __init smp_save_dump_cpus(void)
>>>    			/* Get the CPU registers */
>>>    			smp_save_cpu_regs(sa, addr, is_boot_cpu, page);
>>>    	}
>>> -	memblock_free(page, PAGE_SIZE);
>>> +	memblock_phys_free(page, PAGE_SIZE);
>>>    	diag_amode31_ops.diag308_reset();
>>>    	pcpu_set_smt(0);
>>>    }
>>> @@ -880,7 +880,7 @@ void __init smp_detect_cpus(void)
>>>    	/* Add CPUs present at boot */
>>>    	__smp_rescan_cpus(info, true);
>>> -	memblock_free_early((unsigned long)info, sizeof(*info));
>>> +	memblock_free(info, sizeof(*info));
>>>    }
>>>    /*
>>
>> I'm a bit lost. IIUC memblock_free_early() and memblock_free() where
>> identical.
> 
> Yes, they were, but all calls to memblock_free_early() were using
> __pa(vaddr) because they had a virtual address at hand.

I'm still not following. In the above memblock_free_early() was taking 
(unsigned long)info . Was it a bug ? It looks odd to hide bug fixes in 
such a big patch, should that bug fix go in patch 2 ?

> 
>> In the first hunk memblock_free() gets replaced by memblock_phys_free()
>> In the second hunk memblock_free_early() gets replaced by memblock_free()
> 
> In the first hunk the memory is allocated with memblock_phys_alloc() and we
> have a physical range to free. In the second hunk the memory is allocated
> with memblock_alloc() and we are freeing a virtual pointer.
>   
>> I think it would be easier to follow if you could split it in several
>> patches:
> 
> It was an explicit request from Linus to make it a single commit:
> 
>    but the actual commit can and should be just a single commit that just
>    fixes 'memblock_free()' to have sane interfaces.
> 
> I don't feel strongly about splitting it (except my laziness really
> objects), but I don't think doing the conversion in several steps worth the
> churn.

The commit is quite big (55 files changed, approx 100 lines modified).

If done in the right order the change should be minimal.

It is rather not-easy to follow and review when a function that was 
existing (namely memblock_free() ) disappears and re-appears in the same 
commit but to do something different.

You do:
- memblock_free() ==> memblock_phys_free()
- memblock_free_ptr() ==> memblock_free()

At least you could split in two patches, the advantage would be that 
between first and second patch memblock() doesn't exist anymore so you 
can check you really don't have anymore user.

> 
>> - First patch: Create memblock_phys_free() and change all relevant
>> memblock_free() to memblock_phys_free() - Or change memblock_free() to
>> memblock_phys_free() and make memblock_free() an alias of it.
>> - Second patch: Make memblock_free_ptr() become memblock_free() and change
>> all remaining callers to the new semantics (IIUC memblock_free(__pa(ptr))
>> becomes memblock_free(ptr) and make memblock_free_ptr() an alias of
>> memblock_free()
>> - Fourth patch: Replace and drop memblock_free_ptr()
>> - Fifth patch: Drop memblock_free_early() and memblock_free_early_nid() (All
>> users should have been upgraded to memblock_free_phys() in patch 1 or
>> memblock_free() in patch 2)
>>
>> Christophe
> 

^ permalink raw reply

* Re: [Bug 213837] "Kernel panic - not syncing: corrupted stack end detected inside scheduler" at building via distcc on a G5
From: Michael Ellerman @ 2021-09-23 14:05 UTC (permalink / raw)
  To: bugzilla-daemon, linuxppc-dev
In-Reply-To: <bug-213837-206035-ilyQ6qrWP6@https.bugzilla.kernel.org/>

bugzilla-daemon@bugzilla.kernel.org writes:
> https://bugzilla.kernel.org/show_bug.cgi?id=213837
>
> --- Comment #7 from Erhard F. (erhard_f@mailbox.org) ---
> Created attachment 298919
>   --> https://bugzilla.kernel.org/attachment.cgi?id=298919&action=edit
> dmesg (5.15-rc2 + patch, PowerMac G5 11,2)
>
> (In reply to mpe from comment #6)
>> Can you try this patch, it might help us work out what is corrupting the
>> stack.
> With your patch applied to recent v5.15-rc2 the output looks like this:
>
> [...]
> stack corrupted? stack end = 0xc000000029fdc000
> stack: c000000029fdbc00: 5a5a5a5a 5a5a5a5a cccccccc cccccccc  ZZZZZZZZ........
...

> Can't make much sense out of it but hopefully you can. ;)

Thanks. Obvious isn't it? ;)

  stack corrupted? stack end = 0xc000000029fdc000
  stack: c000000029fdbc00: 5a5a5a5a 5a5a5a5a cccccccc cccccccc  ZZZZZZZZ........
  stack: c000000029fdbc10: 00000ddc 7c000010 cccccccc cccccccc  ....|...........
  stack: c000000029fdbc20: 29fc4e41 673d4bb3 5a5a5a5a 5a5a5a5a  ).NAg=K.ZZZZZZZZ
  stack: c000000029fdbc30: cccccccc cccccccc 00000ddc 8e000010  ................
  stack: c000000029fdbc40: cccccccc cccccccc 41fc4e41 673d41a3  ........A.NAg=A.
  stack: c000000029fdbc50: 5a5a5a5a 5a5a5a5a cccccccc cccccccc  ZZZZZZZZ........
  stack: c000000029fdbc60: 00000ddc 8e00000c cccccccc cccccccc  ................
  stack: c000000029fdbc70: 79fc4e41 673d4dab 5a5a5a5a 5a5a5a5a  y.NAg=M.ZZZZZZZZ
  stack: c000000029fdbc80: cccccccc cccccccc 00000ddc 90000008  ................
  stack: c000000029fdbc90: cccccccc cccccccc 91fc4e41 673d4573  ..........NAg=Es
  stack: c000000029fdbca0: 5a5a5a5a 5a5a5a5a cccccccc cccccccc  ZZZZZZZZ........
  stack: c000000029fdbcb0: 00000dd7 ac000016 cccccccc cccccccc  ................
  stack: c000000029fdbcc0: c9fc4e41 673d4203 5a5a5a5a 5a5a5a5a  ..NAg=B.ZZZZZZZZ
  stack: c000000029fdbcd0: cccccccc cccccccc 00000ddc 6c000004  ............l...
  stack: c000000029fdbce0: cccccccc cccccccc e1fc4e41 673d474b  ..........NAg=GK
  stack: c000000029fdbcf0: 5a5a5a5a 5a5a5a5a cccccccc cccccccc  ZZZZZZZZ........
  stack: c000000029fdbd00: 00000ddc 88000000 cccccccc cccccccc  ................
  stack: c000000029fdbd10: 19fd4e41 673d4143 5a5a5a5a 5a5a5a5a  ..NAg=ACZZZZZZZZ
  stack: c000000029fdbd20: cccccccc cccccccc 00000ddb 6c00000e  ............l...
  stack: c000000029fdbd30: cccccccc cccccccc 31fd4e41 673d4f43  ........1.NAg=OC
  stack: c000000029fdbd40: 5a5a5a5a 5a5a5a5a cccccccc cccccccc  ZZZZZZZZ........
  stack: c000000029fdbd50: 00000ddc 8e000008 cccccccc cccccccc  ................
  stack: c000000029fdbd60: 69fd4e41 673d407b 5a5a5a5a 5a5a5a5a  i.NAg=@{ZZZZZZZZ
  stack: c000000029fdbd70: cccccccc cccccccc 00000ddc 92000008  ................
  stack: c000000029fdbd80: cccccccc cccccccc 81fd4e41 673d4633  ..........NAg=F3
  stack: c000000029fdbd90: 5a5a5a5a 5a5a5a5a cccccccc cccccccc  ZZZZZZZZ........
  stack: c000000029fdbda0: 00000ddb 42000018 cccccccc cccccccc  ....B...........
  stack: c000000029fdbdb0: b9fd4e41 673d42fb 5a5a5a5a 5a5a5a5a  ..NAg=B.ZZZZZZZZ
  stack: c000000029fdbdc0: cccccccc cccccccc 00000ddc 7e000018  ............~...
  stack: c000000029fdbdd0: cccccccc cccccccc d1fd4e41 673d4a1b  ..........NAg=J.
  stack: c000000029fdbde0: 5a5a5a5a 5a5a5a5a cccccccc cccccccc  ZZZZZZZZ........
  stack: c000000029fdbdf0: 00000ddc 8e000004 cccccccc cccccccc  ................
  stack: c000000029fdbe00: 09fe4e41 673d4ee3 5a5a5a5a 5a5a5a5a  ..NAg=N.ZZZZZZZZ
  stack: c000000029fdbe10: cccccccc cccccccc 00000dd9 7200001c  ............r...
  stack: c000000029fdbe20: cccccccc cccccccc 21fe4e41 673d4fa3  ........!.NAg=O.

That's slab data.

It's not clear what the actual data is, but because you booted with
slub_debug=FZP we can see the red zones and poison.

The cccccccc is SLUB_RED_ACTIVE, and 5a5a5a5a is POISON_INUSE (see poison.h)


  stack: c000000029fdbe30: c0000000 29fdbeb0 cccccccc cccccccc  ....)...........

But then here we have an obvious pointer (big endian FTW).

And it points nearby, just slightly higher in memory, so that looks
suspiciously like a stack back chain pointer. There's more similar
values if you look further.

But we shouldn't be seeing the stack yet, it's meant to start (end) at
c000000029fdc000 ...

  stack: c000000029fdbe40: 00000ddc 94000000 cccccccc cccccccc  ................
  stack: c000000029fdbe50: 59fe4e41 673d4933 5a5a5a5a 5a5a5a5a  Y.NAg=I3ZZZZZZZZ
  stack: c000000029fdbe60: cccccccc cccccccc 00000dd9 60000024  ............`..$
  stack: c000000029fdbe70: cccccccc cccccccc 71fe4e41 673d416b  ........q.NAg=Ak
  stack: c000000029fdbe80: 5a5a5a5a 5a5a5a5a cccccccc cccccccc  ZZZZZZZZ........
  stack: c000000029fdbe90: 00000ddc 6000000c cccccccc cccccccc  ....`...........
  stack: c000000029fdbea0: c0000000 29fdbf20 00000000 00000002  ....).. ........
  stack: c000000029fdbeb0: c0000000 29fdbf30 00000ddc 7e00001c ....)..0....~...     <---
  stack: c000000029fdbec0: c0000000 29fdbf40 c1fe4e41 673d4723  ....)..@..NAg=G#
  stack: c000000029fdbed0: 5a5a5a5a 5a5a5a5a cccccccc cccccccc  ZZZZZZZZ........
  stack: c000000029fdbee0: c0000000 29fdbf60 cccccccc cccccccc  ....)..`........
  stack: c000000029fdbef0: c0000000 29fdbf70 5a5a5a5a 5a5a5a5a  ....)..pZZZZZZZZ
  stack: c000000029fdbf00: cccccccc cccccccc 00000ddc 60000010  ............`...
  stack: c000000029fdbf10: c0000000 29fdbf90 00000000 00000002  ....)...........
  stack: c000000029fdbf20: c0000000 29fdbf01 001d3029 96167689  ....).....0)..v.
  stack: c000000029fdbf30: c0000000 29fdbfc0 c0000004 7f6f1800 ....)........o..     <---
  stack: c000000029fdbf40: c0000000 29fdbfc0 5a5a5a5a 5a5a5a5a  ....)...ZZZZZZZZ
  stack: c000000029fdbf50: c0000000 000ea33c 00000000 00000000  .......<........
  stack: c000000029fdbf60: c0000000 29fdbfe0 c0000000 05cdb700  ....)...........
  stack: c000000029fdbf70: c0000000 29fdbff0 cccccccc cccccccc  ....)...........
  stack: c000000029fdbf80: c0000000 000ea33c 00000000 00328780  .......<.....2..
  stack: c000000029fdbf90: c0000000 29fdc010 001d3029 96167689  ....).....0)..v.
  stack: c000000029fdbfa0: c0000000 29fdc020 00000000 000008e4  ....).. ........
  stack: c000000029fdbfb0: 00000000 00000201 001d3029 96167689  ..........0)..v.
  stack: c000000029fdbfc0: c0000000 29fdc040 cccccccc cccccccc ....)..@........     <---
  stack: c000000029fdbfd0: c0000000 000c2344 001d3029 96167689  ......#D..0)..v.
  stack: c000000029fdbfe0: c0000000 29fdc001 001d3029 96167689  ....).....0)..v.
  stack: c000000029fdbff0: c0000000 29fdc080 00000088 554c539a  ....).......ULS.

... which is here:

  stack: c000000029fdc000: c0000000 000c1d9c 001d3029 96167689  ..........0)..v.
  stack: c000000029fdc010: c0000000 29fdc0d0 c0000004 7f6f1700  ....)........o..
  stack: c000000029fdc020: c0000000 29fdc0a0 c0000000 05cdb580  ....)...........
  stack: c000000029fdc030: c0000000 29fdc0b0 c0000004 7f6f1700  ....)........o..
  stack: c000000029fdc040: c0000000 29fdc0c0 00000000 00000001  ....)...........


So it looks like you have actually overran your stack, rather than
something else clobbering your stack.

Can you attach your System.map for that exact kernel? We might be able
to work out what functions we were in when we overran.

You could also try changing CONFIG_THREAD_SHIFT to 15, that might keep
the system running a bit longer and give us some other clues.

cheers

^ permalink raw reply

* [Bug 213837] "Kernel panic - not syncing: corrupted stack end detected inside scheduler" at building via distcc on a G5
From: bugzilla-daemon @ 2021-09-23 14:05 UTC (permalink / raw)
  To: linuxppc-dev
In-Reply-To: <bug-213837-206035@https.bugzilla.kernel.org/>

https://bugzilla.kernel.org/show_bug.cgi?id=213837

--- Comment #8 from mpe@ellerman.id.au ---
bugzilla-daemon@bugzilla.kernel.org writes:
> https://bugzilla.kernel.org/show_bug.cgi?id=213837
>
> --- Comment #7 from Erhard F. (erhard_f@mailbox.org) ---
> Created attachment 298919
>   --> https://bugzilla.kernel.org/attachment.cgi?id=298919&action=edit
> dmesg (5.15-rc2 + patch, PowerMac G5 11,2)
>
> (In reply to mpe from comment #6)
>> Can you try this patch, it might help us work out what is corrupting the
>> stack.
> With your patch applied to recent v5.15-rc2 the output looks like this:
>
> [...]
> stack corrupted? stack end = 0xc000000029fdc000
> stack: c000000029fdbc00: 5a5a5a5a 5a5a5a5a cccccccc cccccccc 
> ZZZZZZZZ........
...

> Can't make much sense out of it but hopefully you can. ;)

Thanks. Obvious isn't it? ;)

  stack corrupted? stack end = 0xc000000029fdc000
  stack: c000000029fdbc00: 5a5a5a5a 5a5a5a5a cccccccc cccccccc 
ZZZZZZZZ........
  stack: c000000029fdbc10: 00000ddc 7c000010 cccccccc cccccccc 
....|...........
  stack: c000000029fdbc20: 29fc4e41 673d4bb3 5a5a5a5a 5a5a5a5a 
).NAg=K.ZZZZZZZZ
  stack: c000000029fdbc30: cccccccc cccccccc 00000ddc 8e000010 
................
  stack: c000000029fdbc40: cccccccc cccccccc 41fc4e41 673d41a3 
........A.NAg=A.
  stack: c000000029fdbc50: 5a5a5a5a 5a5a5a5a cccccccc cccccccc 
ZZZZZZZZ........
  stack: c000000029fdbc60: 00000ddc 8e00000c cccccccc cccccccc 
................
  stack: c000000029fdbc70: 79fc4e41 673d4dab 5a5a5a5a 5a5a5a5a 
y.NAg=M.ZZZZZZZZ
  stack: c000000029fdbc80: cccccccc cccccccc 00000ddc 90000008 
................
  stack: c000000029fdbc90: cccccccc cccccccc 91fc4e41 673d4573 
..........NAg=Es
  stack: c000000029fdbca0: 5a5a5a5a 5a5a5a5a cccccccc cccccccc 
ZZZZZZZZ........
  stack: c000000029fdbcb0: 00000dd7 ac000016 cccccccc cccccccc 
................
  stack: c000000029fdbcc0: c9fc4e41 673d4203 5a5a5a5a 5a5a5a5a 
..NAg=B.ZZZZZZZZ
  stack: c000000029fdbcd0: cccccccc cccccccc 00000ddc 6c000004 
............l...
  stack: c000000029fdbce0: cccccccc cccccccc e1fc4e41 673d474b 
..........NAg=GK
  stack: c000000029fdbcf0: 5a5a5a5a 5a5a5a5a cccccccc cccccccc 
ZZZZZZZZ........
  stack: c000000029fdbd00: 00000ddc 88000000 cccccccc cccccccc 
................
  stack: c000000029fdbd10: 19fd4e41 673d4143 5a5a5a5a 5a5a5a5a 
..NAg=ACZZZZZZZZ
  stack: c000000029fdbd20: cccccccc cccccccc 00000ddb 6c00000e 
............l...
  stack: c000000029fdbd30: cccccccc cccccccc 31fd4e41 673d4f43 
........1.NAg=OC
  stack: c000000029fdbd40: 5a5a5a5a 5a5a5a5a cccccccc cccccccc 
ZZZZZZZZ........
  stack: c000000029fdbd50: 00000ddc 8e000008 cccccccc cccccccc 
................
  stack: c000000029fdbd60: 69fd4e41 673d407b 5a5a5a5a 5a5a5a5a 
i.NAg=@{ZZZZZZZZ
  stack: c000000029fdbd70: cccccccc cccccccc 00000ddc 92000008 
................
  stack: c000000029fdbd80: cccccccc cccccccc 81fd4e41 673d4633 
..........NAg=F3
  stack: c000000029fdbd90: 5a5a5a5a 5a5a5a5a cccccccc cccccccc 
ZZZZZZZZ........
  stack: c000000029fdbda0: 00000ddb 42000018 cccccccc cccccccc 
....B...........
  stack: c000000029fdbdb0: b9fd4e41 673d42fb 5a5a5a5a 5a5a5a5a 
..NAg=B.ZZZZZZZZ
  stack: c000000029fdbdc0: cccccccc cccccccc 00000ddc 7e000018 
............~...
  stack: c000000029fdbdd0: cccccccc cccccccc d1fd4e41 673d4a1b 
..........NAg=J.
  stack: c000000029fdbde0: 5a5a5a5a 5a5a5a5a cccccccc cccccccc 
ZZZZZZZZ........
  stack: c000000029fdbdf0: 00000ddc 8e000004 cccccccc cccccccc 
................
  stack: c000000029fdbe00: 09fe4e41 673d4ee3 5a5a5a5a 5a5a5a5a 
..NAg=N.ZZZZZZZZ
  stack: c000000029fdbe10: cccccccc cccccccc 00000dd9 7200001c 
............r...
  stack: c000000029fdbe20: cccccccc cccccccc 21fe4e41 673d4fa3 
........!.NAg=O.

That's slab data.

It's not clear what the actual data is, but because you booted with
slub_debug=FZP we can see the red zones and poison.

The cccccccc is SLUB_RED_ACTIVE, and 5a5a5a5a is POISON_INUSE (see poison.h)


  stack: c000000029fdbe30: c0000000 29fdbeb0 cccccccc cccccccc 
....)...........

But then here we have an obvious pointer (big endian FTW).

And it points nearby, just slightly higher in memory, so that looks
suspiciously like a stack back chain pointer. There's more similar
values if you look further.

But we shouldn't be seeing the stack yet, it's meant to start (end) at
c000000029fdc000 ...

  stack: c000000029fdbe40: 00000ddc 94000000 cccccccc cccccccc 
................
  stack: c000000029fdbe50: 59fe4e41 673d4933 5a5a5a5a 5a5a5a5a 
Y.NAg=I3ZZZZZZZZ
  stack: c000000029fdbe60: cccccccc cccccccc 00000dd9 60000024 
............`..$
  stack: c000000029fdbe70: cccccccc cccccccc 71fe4e41 673d416b 
........q.NAg=Ak
  stack: c000000029fdbe80: 5a5a5a5a 5a5a5a5a cccccccc cccccccc 
ZZZZZZZZ........
  stack: c000000029fdbe90: 00000ddc 6000000c cccccccc cccccccc 
....`...........
  stack: c000000029fdbea0: c0000000 29fdbf20 00000000 00000002  ....)..
........
  stack: c000000029fdbeb0: c0000000 29fdbf30 00000ddc 7e00001c ....)..0....~...
    <---
  stack: c000000029fdbec0: c0000000 29fdbf40 c1fe4e41 673d4723 
....)..@..NAg=G#
  stack: c000000029fdbed0: 5a5a5a5a 5a5a5a5a cccccccc cccccccc 
ZZZZZZZZ........
  stack: c000000029fdbee0: c0000000 29fdbf60 cccccccc cccccccc 
....)..`........
  stack: c000000029fdbef0: c0000000 29fdbf70 5a5a5a5a 5a5a5a5a 
....)..pZZZZZZZZ
  stack: c000000029fdbf00: cccccccc cccccccc 00000ddc 60000010 
............`...
  stack: c000000029fdbf10: c0000000 29fdbf90 00000000 00000002 
....)...........
  stack: c000000029fdbf20: c0000000 29fdbf01 001d3029 96167689 
....).....0)..v.
  stack: c000000029fdbf30: c0000000 29fdbfc0 c0000004 7f6f1800 ....)........o..
    <---
  stack: c000000029fdbf40: c0000000 29fdbfc0 5a5a5a5a 5a5a5a5a 
....)...ZZZZZZZZ
  stack: c000000029fdbf50: c0000000 000ea33c 00000000 00000000 
.......<........
  stack: c000000029fdbf60: c0000000 29fdbfe0 c0000000 05cdb700 
....)...........
  stack: c000000029fdbf70: c0000000 29fdbff0 cccccccc cccccccc 
....)...........
  stack: c000000029fdbf80: c0000000 000ea33c 00000000 00328780 
.......<.....2..
  stack: c000000029fdbf90: c0000000 29fdc010 001d3029 96167689 
....).....0)..v.
  stack: c000000029fdbfa0: c0000000 29fdc020 00000000 000008e4  ....)..
........
  stack: c000000029fdbfb0: 00000000 00000201 001d3029 96167689 
..........0)..v.
  stack: c000000029fdbfc0: c0000000 29fdc040 cccccccc cccccccc ....)..@........
    <---
  stack: c000000029fdbfd0: c0000000 000c2344 001d3029 96167689 
......#D..0)..v.
  stack: c000000029fdbfe0: c0000000 29fdc001 001d3029 96167689 
....).....0)..v.
  stack: c000000029fdbff0: c0000000 29fdc080 00000088 554c539a 
....).......ULS.

... which is here:

  stack: c000000029fdc000: c0000000 000c1d9c 001d3029 96167689 
..........0)..v.
  stack: c000000029fdc010: c0000000 29fdc0d0 c0000004 7f6f1700 
....)........o..
  stack: c000000029fdc020: c0000000 29fdc0a0 c0000000 05cdb580 
....)...........
  stack: c000000029fdc030: c0000000 29fdc0b0 c0000004 7f6f1700 
....)........o..
  stack: c000000029fdc040: c0000000 29fdc0c0 00000000 00000001 
....)...........


So it looks like you have actually overran your stack, rather than
something else clobbering your stack.

Can you attach your System.map for that exact kernel? We might be able
to work out what functions we were in when we overran.

You could also try changing CONFIG_THREAD_SHIFT to 15, that might keep
the system running a bit longer and give us some other clues.

cheers

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.
You are watching someone on the CC list of the bug.

^ permalink raw reply

* Re: [PATCH 3/9] xen/x86: make "earlyprintk=xen" work better for PVH Dom0
From: Juergen Gross @ 2021-09-23 14:05 UTC (permalink / raw)
  To: Jan Beulich, Boris Ostrovsky
  Cc: xen-devel@lists.xenproject.org, Stefano Stabellini,
	linuxppc-dev@lists.ozlabs.org, lkml
In-Reply-To: <ecf17c7b-09a4-29a7-6951-1e0b0dda3c67@suse.com>


[-- Attachment #1.1.1: Type: text/plain, Size: 529 bytes --]

On 07.09.21 12:09, Jan Beulich wrote:
> The xen_hvm_early_write() path better wouldn't be taken in this case;
> while port 0xE9 can be used, the hypercall path is quite a bit more
> efficient. Put that first, as it may also work for DomU-s (see also
> xen_raw_console_write()).
> 
> While there also bail from the function when the first
> domU_write_console() failed - later ones aren't going to succeed.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Juergen Gross <jgross@suse.com>


Juergen

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3135 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply

* Re: [PATCH 5/9] xen/x86: make "earlyprintk=xen" work for HVM/PVH DomU
From: Juergen Gross @ 2021-09-23 14:08 UTC (permalink / raw)
  To: Jan Beulich, Boris Ostrovsky
  Cc: xen-devel@lists.xenproject.org, Stefano Stabellini,
	linuxppc-dev@lists.ozlabs.org, lkml
In-Reply-To: <5e9ff16e-85f0-50ea-f053-37e17351a0cc@suse.com>


[-- Attachment #1.1.1: Type: text/plain, Size: 336 bytes --]

On 07.09.21 12:10, Jan Beulich wrote:
> xenboot_write_console() is dealing with these quite fine so I don't see
> why xenboot_console_setup() would return -ENOENT in this case.
> 
> Adjust documentation accordingly.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Juergen Gross <jgross@suse.com>


Juergen

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3135 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply

* [PATCH] KVM: PPC: Book3S HV: Use GLOBAL_TOC for kvmppc_h_set_dabr/xdabr()
From: Michael Ellerman @ 2021-09-23 15:10 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: mikey, npiggin

kvmppc_h_set_dabr(), and kvmppc_h_set_xdabr() which jumps into
it, need to use _GLOBAL_TOC to setup the kernel TOC pointer, because
kvmppc_h_set_dabr() uses LOAD_REG_ADDR() to load dawr_force_enable.

When called from hcall_try_real_mode() we have the kernel TOC in r2,
established near the start of kvmppc_interrupt_hv(), so there is no
issue.

But they can also be called from kvmppc_pseries_do_hcall() which is
module code, so the access ends up happening with the kvm-hv module's
r2, which will not point at dawr_force_enable and could even cause a
fault.

With the current code layout and compilers we haven't observed a fault
in practice, the load hits somewhere in kvm-hv.ko and silently returns
some bogus value.

Note that we we expect p8/p9 guests to use the DAWR, but SLOF uses
h_set_dabr() to test if sc1 works correctly, see SLOF's
lib/libhvcall/brokensc1.c.

Fixes: c1fe190c0672 ("powerpc: Add force enable of DAWR on P9 option")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
---
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 90484425a1e6..30a8a07cff18 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -1999,7 +1999,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
 	.globl	hcall_real_table_end
 hcall_real_table_end:
 
-_GLOBAL(kvmppc_h_set_xdabr)
+_GLOBAL_TOC(kvmppc_h_set_xdabr)
 EXPORT_SYMBOL_GPL(kvmppc_h_set_xdabr)
 	andi.	r0, r5, DABRX_USER | DABRX_KERNEL
 	beq	6f
@@ -2009,7 +2009,7 @@ EXPORT_SYMBOL_GPL(kvmppc_h_set_xdabr)
 6:	li	r3, H_PARAMETER
 	blr
 
-_GLOBAL(kvmppc_h_set_dabr)
+_GLOBAL_TOC(kvmppc_h_set_dabr)
 EXPORT_SYMBOL_GPL(kvmppc_h_set_dabr)
 	li	r5, DABRX_USER | DABRX_KERNEL
 3:
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCH 0/3] memblock: cleanup memblock_free interface
From: Linus Torvalds @ 2021-09-23 16:01 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: devicetree, linux-efi, Mike Rapoport, KVM list, linux-s390,
	Linux-sh list, linux-um, Linux Kernel Mailing List, kasan-dev,
	open list:BROADCOM NVRAM DRIVER, Linux-MM, iommu, linux-usb,
	alpha, linux-sparc, xen-devel, Andrew Morton,
	open list:SYNOPSYS ARC ARCHITECTURE, linuxppc-dev, linux-riscv,
	Linux ARM
In-Reply-To: <20210923074335.12583-1-rppt@kernel.org>

On Thu, Sep 23, 2021 at 12:43 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> The core change is in the third patch that makes memblock_free() a
> counterpart of memblock_alloc() and adds memblock_phys_alloc() to be a

^^^^^^^^^^^^^^^^^^^
> counterpart of memblock_phys_alloc().

That should be 'memblock_phys_free()'

HOWEVER.

The real reason I'm replying is that this patch is horribly buggy, and
will cause subtle problems that are nasty to debug.

You need to be a LOT more careful.

From a trivial check - exactly because I looked at doing it with a
script, and decided it's not so easy - I found cases like this:

-               memblock_free(__pa(paca_ptrs) + new_ptrs_size,
+               memblock_free(paca_ptrs + new_ptrs_size,

which is COMPLETELY wrong.

Why? Because now that addition is done as _pointer_ addition, not as
an integer addition, and the end result is something completely
different.

pcac_ptrs is of type 'struct paca_struct **', so when you add
new_ptrs_size to it, it will add it in terms of that many pointers,
not that many bytes.

You need to use some smarter scripting, or some way to validate it.

And no, making the scripting just replace '__pa(x)' with '(void *)(x)'
- which _would_ be mindless and get the same result - is not
acceptable either, because it avoids one of the big improvements from
using the right interface, namely having compiler type checking (and
saner code that people understand).

So NAK. No broken automated scripting patches.

               Linus

^ permalink raw reply

* Re: [PATCH 0/2] kvm: fix KVM_MAX_VCPU_ID handling
From: Paolo Bonzini @ 2021-09-23 16:21 UTC (permalink / raw)
  To: Juergen Gross, kvm, x86, linux-kernel, linux-doc, linux-mips,
	kvm-ppc, linuxppc-dev, linux-kselftest
  Cc: Shuah Khan, Thomas Bogendoerfer, Wanpeng Li, Jonathan Corbet,
	Sean Christopherson, Joerg Roedel, Huacai Chen,
	Aleksandar Markovic, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Paul Mackerras, Vitaly Kuznetsov, Shuah Khan, Thomas Gleixner,
	Jim Mattson
In-Reply-To: <20210913135745.13944-1-jgross@suse.com>

On 13/09/21 15:57, Juergen Gross wrote:
> Revert commit 76b4f357d0e7d8f6f00 which was based on wrong reasoning
> and rename KVM_MAX_VCPU_ID to KVM_MAX_VCPU_IDS in order to avoid the
> same issue in future.
> 
> Juergen Gross (2):
>    x86/kvm: revert commit 76b4f357d0e7d8f6f00
>    kvm: rename KVM_MAX_VCPU_ID to KVM_MAX_VCPU_IDS
> 
>   Documentation/virt/kvm/devices/xics.rst            | 2 +-
>   Documentation/virt/kvm/devices/xive.rst            | 2 +-
>   arch/mips/kvm/mips.c                               | 2 +-
>   arch/powerpc/include/asm/kvm_book3s.h              | 2 +-
>   arch/powerpc/include/asm/kvm_host.h                | 4 ++--
>   arch/powerpc/kvm/book3s_xive.c                     | 2 +-
>   arch/powerpc/kvm/powerpc.c                         | 2 +-
>   arch/x86/include/asm/kvm_host.h                    | 2 +-
>   arch/x86/kvm/ioapic.c                              | 2 +-
>   arch/x86/kvm/ioapic.h                              | 4 ++--
>   arch/x86/kvm/x86.c                                 | 2 +-
>   include/linux/kvm_host.h                           | 4 ++--
>   tools/testing/selftests/kvm/kvm_create_max_vcpus.c | 2 +-
>   virt/kvm/kvm_main.c                                | 2 +-
>   14 files changed, 17 insertions(+), 17 deletions(-)
> 

Queued, thanks.

Paolo


^ permalink raw reply

* [Bug 213837] "Kernel panic - not syncing: corrupted stack end detected inside scheduler" at building via distcc on a G5
From: bugzilla-daemon @ 2021-09-23 16:29 UTC (permalink / raw)
  To: linuxppc-dev
In-Reply-To: <bug-213837-206035@https.bugzilla.kernel.org/>

https://bugzilla.kernel.org/show_bug.cgi?id=213837

--- Comment #9 from Erhard F. (erhard_f@mailbox.org) ---
Created attachment 298933
  --> https://bugzilla.kernel.org/attachment.cgi?id=298933&action=edit
System.map (5.15-rc2 + patch, PowerMac G5 11,2)

(In reply to mpe from comment #8)
> So it looks like you have actually overran your stack, rather than
> something else clobbering your stack.
> 
> Can you attach your System.map for that exact kernel? We might be able
> to work out what functions we were in when we overran.
> 
> You could also try changing CONFIG_THREAD_SHIFT to 15, that might keep
> the system running a bit longer and give us some other clues.
> 
> cheers
Hm, interesting...

What I do to trigger this bug is building llvm-12 on the G5 via distcc (on the
other side is a 16-core Opteron) and MAKEOPTS="-j10 -l3". As the G5 got 16 GiB
RAM building runs in a zstd-compressed ext2 filesystem (/sbin/zram-init -d1 -s2
-azstd -text2 -orelatime -m1777 -Lvar_tmp_dir 49152 /var/tmp). Most of the time
the bug is triggered very shortly after the actual building starts via meson.
At this time the build directory /var/tmp/portage occupies about 800 MiB.

Also sometimes I don't get a proper stack trace via netconsole but this:
BUG: unable to handle kernel data access on write at 0xc000000037c82040
BUG: unable to handle kernel data access on write at 0xc000000037c80000

Please find the relevant System.map attached. I'll do another kernel build with
CONFIG_THREAD_SHIFT=15 and see if anything changes.

Thanks for investigating this!

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.
You are watching someone on the CC list of the bug.

^ permalink raw reply

* [PATCH v2 0/9] cxl_pci refactor for reusability
From: Ben Widawsky @ 2021-09-23 17:26 UTC (permalink / raw)
  To: linux-cxl
  Cc: Ben Widawsky, Andrew Donnellan, linux-pci, linuxppc-dev, iommu,
	Bjorn Helgaas, David E. Box, Frederic Barrat, Lu Baolu,
	David Woodhouse, Kan Liang

Changes since v1 [3]:
- get_max_afu_index() updated to new interface (Dan)
- siov_find_pci_dvsec updated to new interface (Dan)
- remove unnecessary pci request/release regions with refactor (Dan)
- move @base into cxl_register_map (Dan)
- Expanded the Cc list to include other users of PCIe DVSEC. (Dan)

Three spots are not converted to use the new PCI core DVSEC helper as the
changes are non-trivial.
- find_dvsec_afu_ctrl (ocxl)
- pmt_pci_probe (David)
- intel_uncore_has_discovery_tables (Kan)

The interdiff is below. A range-diff can be generated against v1[3] if desired.
Discussion occurred partially offlist between Dan and myself. To summarize, Dan
contends that patch 4, "cxl/pci: Refactor cxl_pci_setup_regs" should either be
rewrittn, or have a precursor patch that cxl_pci will still scan all register
blocks, but only claim the specified types. While the current diff is somewhat
complex, I contend that this is unneeded churn because we can easily test this
change in our existing regression testing. It also ultimately results in a
simpler function in the cxl_port patch series when cxl_pci stops trying to map
the component register block (loop is removed entirely). A tiebreak here would
be great :-)

---

Original commit message:

Provide the ability to obtain CXL register blocks as discrete functionality.
This functionality will become useful for other CXL drivers that need access to
CXL register blocks. It is also in line with other additions to core which moves
register mapping functionality.

At the introduction of the CXL driver the only user of CXL MMIO was cxl_pci
(then known as cxl_mem). As the driver has evolved it is clear that cxl_pci will
not be the only entity that needs access to CXL MMIO. This series stops short of
moving the generalized functionality into cxl_core for the sake of getting eyes
on the important foundational bits sooner rather than later. The ultimate plan
is to move much of the code into cxl_core.

Via review of two previous patches [1] & [2] it has been suggested that the bits
which are being used for DVSEC enumeration move into PCI core. As CXL core is
soon going to require these, let's try to get the ball rolling now on making
that happen.

---

[1]: https://lore.kernel.org/linux-pci/20210913190131.xiiszmno46qie7v5@intel.com/
[2]: https://lore.kernel.org/linux-cxl/20210920225638.1729482-1-ben.widawsky@intel.com/
[3]: https://lore.kernel.org/linux-cxl/20210921220459.2437386-1-ben.widawsky@intel.com/


Ben Widawsky (9):
  cxl: Convert "RBI" to enum
  cxl/pci: Remove dev_dbg for unknown register blocks
  cxl/pci: Remove pci request/release regions
  cxl/pci: Refactor cxl_pci_setup_regs
  cxl/pci: Make more use of cxl_register_map
  PCI: Add pci_find_dvsec_capability to find designated VSEC
  cxl/pci: Use pci core's DVSEC functionality
  ocxl: Use pci core's DVSEC functionality
  iommu/vt-d: Use pci core's DVSEC functionality

 arch/powerpc/platforms/powernv/ocxl.c |   3 +-
 drivers/cxl/cxl.h                     |   1 +
 drivers/cxl/pci.c                     | 176 ++++++++++++--------------
 drivers/cxl/pci.h                     |  14 +-
 drivers/iommu/intel/iommu.c           |  15 +--
 drivers/misc/ocxl/config.c            |  13 +-
 drivers/pci/pci.c                     |  32 +++++
 include/linux/pci.h                   |   1 +
 8 files changed, 127 insertions(+), 128 deletions(-)

Interdiff against v1:
diff --git a/arch/powerpc/platforms/powernv/ocxl.c b/arch/powerpc/platforms/powernv/ocxl.c
index 9105efcf242a..28b009b46464 100644
--- a/arch/powerpc/platforms/powernv/ocxl.c
+++ b/arch/powerpc/platforms/powernv/ocxl.c
@@ -107,7 +107,8 @@ static int get_max_afu_index(struct pci_dev *dev, int *afu_idx)
 	int pos;
 	u32 val;
 
-	pos = find_dvsec_from_pos(dev, OCXL_DVSEC_FUNC_ID, 0);
+	pos = pci_find_dvsec_capability(dev, PCI_VENDOR_ID_IBM,
+					OCXL_DVSEC_FUNC_ID);
 	if (!pos)
 		return -ESRCH;
 
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 7d6b011dd963..3b128ce71564 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -140,6 +140,7 @@ struct cxl_device_reg_map {
 };
 
 struct cxl_register_map {
+	void __iomem *base;
 	u64 block_offset;
 	u8 reg_type;
 	u8 barno;
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 040379f727ad..79d4d9b16d83 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -306,9 +306,8 @@ static int cxl_pci_setup_mailbox(struct cxl_mem *cxlm)
 	return 0;
 }
 
-static void __iomem *cxl_pci_map_regblock(struct cxl_mem *cxlm, struct cxl_register_map *map)
+static int cxl_pci_map_regblock(struct cxl_mem *cxlm, struct cxl_register_map *map)
 {
-	void __iomem *addr;
 	int bar = map->barno;
 	struct device *dev = cxlm->dev;
 	struct pci_dev *pdev = to_pci_dev(dev);
@@ -318,24 +317,25 @@ static void __iomem *cxl_pci_map_regblock(struct cxl_mem *cxlm, struct cxl_regis
 	if (pci_resource_len(pdev, bar) < offset) {
 		dev_err(dev, "BAR%d: %pr: too small (offset: %#llx)\n", bar,
 			&pdev->resource[bar], (unsigned long long)offset);
-		return IOMEM_ERR_PTR(-ENXIO);
+		return -ENXIO;
 	}
 
-	addr = pci_iomap(pdev, bar, 0);
-	if (!addr) {
+	map->base = pci_iomap(pdev, bar, 0);
+	if (!map->base) {
 		dev_err(dev, "failed to map registers\n");
-		return addr;
+		return PTR_ERR(map->base);
 	}
 
 	dev_dbg(dev, "Mapped CXL Memory Device resource bar %u @ %#llx\n",
 		bar, offset);
 
-	return addr;
+	return 0;
 }
 
-static void cxl_pci_unmap_regblock(struct cxl_mem *cxlm, void __iomem *base)
+static void cxl_pci_unmap_regblock(struct cxl_mem *cxlm, struct cxl_register_map *map)
 {
-	pci_iounmap(to_pci_dev(cxlm->dev), base);
+	pci_iounmap(to_pci_dev(cxlm->dev), map->base);
+	map->base = 0;
 }
 
 static int cxl_pci_dvsec(struct pci_dev *pdev, int dvsec)
@@ -343,10 +343,9 @@ static int cxl_pci_dvsec(struct pci_dev *pdev, int dvsec)
 	return pci_find_dvsec_capability(pdev, PCI_DVSEC_VENDOR_ID_CXL, dvsec);
 }
 
-static int cxl_probe_regs(struct cxl_mem *cxlm, void __iomem *base,
-			  struct cxl_register_map *map)
+static int cxl_probe_regs(struct cxl_mem *cxlm, struct cxl_register_map *map)
 {
-	void __iomem *offset = base + map->block_offset;
+	void __iomem *base = map->base + map->block_offset;
 	struct cxl_component_reg_map *comp_map;
 	struct cxl_device_reg_map *dev_map;
 	struct device *dev = cxlm->dev;
@@ -354,7 +353,7 @@ static int cxl_probe_regs(struct cxl_mem *cxlm, void __iomem *base,
 	switch (map->reg_type) {
 	case CXL_REGLOC_RBI_COMPONENT:
 		comp_map = &map->component_map;
-		cxl_probe_component_regs(dev, offset, comp_map);
+		cxl_probe_component_regs(dev, base, comp_map);
 		if (!comp_map->hdm_decoder.valid) {
 			dev_err(dev, "HDM decoder registers not found\n");
 			return -ENXIO;
@@ -364,7 +363,7 @@ static int cxl_probe_regs(struct cxl_mem *cxlm, void __iomem *base,
 		break;
 	case CXL_REGLOC_RBI_MEMDEV:
 		dev_map = &map->device_map;
-		cxl_probe_device_regs(dev, offset, dev_map);
+		cxl_probe_device_regs(dev, base, dev_map);
 		if (!dev_map->status.valid || !dev_map->mbox.valid ||
 		    !dev_map->memdev.valid) {
 			dev_err(dev, "registers not found: %s%s%s\n",
@@ -418,8 +417,6 @@ static int find_register_block(struct pci_dev *pdev, enum cxl_regloc_type type,
 	int regloc, i, rc = -ENODEV;
 	u32 regloc_size, regblocks;
 
-	memset(map, 0, sizeof(*map));
-
 	regloc = cxl_pci_dvsec(pdev, PCI_DVSEC_ID_CXL_REGLOC_DVSEC_ID);
 	if (!regloc)
 		return -ENXIO;
@@ -450,8 +447,6 @@ static int find_register_block(struct pci_dev *pdev, enum cxl_regloc_type type,
 		}
 	}
 
-	pci_release_mem_regions(pdev);
-
 	return rc;
 }
 
@@ -473,12 +468,8 @@ static int cxl_pci_setup_regs(struct cxl_mem *cxlm)
 	const enum cxl_regloc_type types[] = { CXL_REGLOC_RBI_MEMDEV,
 					       CXL_REGLOC_RBI_COMPONENT };
 
-	if (pci_request_mem_regions(pdev, pci_name(pdev)))
-		return -ENODEV;
-
 	for (i = 0; i < ARRAY_SIZE(types); i++) {
 		struct cxl_register_map map;
-		void __iomem *base;
 
 		rc = find_register_block(pdev, types[i], &map);
 		if (rc) {
@@ -489,14 +480,12 @@ static int cxl_pci_setup_regs(struct cxl_mem *cxlm)
 			break;
 		}
 
-		base = cxl_pci_map_regblock(cxlm, &map);
-		if (!base) {
-			rc = -ENOMEM;
+		rc = cxl_pci_map_regblock(cxlm, &map);
+		if (rc)
 			break;
-		}
 
-		rc = cxl_probe_regs(cxlm, base, &map);
-		cxl_pci_unmap_regblock(cxlm, base);
+		rc = cxl_probe_regs(cxlm, &map);
+		cxl_pci_unmap_regblock(cxlm, &map);
 		if (rc)
 			break;
 
@@ -507,7 +496,6 @@ static int cxl_pci_setup_regs(struct cxl_mem *cxlm)
 		}
 	}
 
-	pci_release_mem_regions(pdev);
 	return rc;
 }
 
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index d75f59ae28e6..30c97181f0ae 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -5398,20 +5398,7 @@ static int intel_iommu_disable_sva(struct device *dev)
  */
 static int siov_find_pci_dvsec(struct pci_dev *pdev)
 {
-	int pos;
-	u16 vendor, id;
-
-	pos = pci_find_next_ext_capability(pdev, 0, 0x23);
-	while (pos) {
-		pci_read_config_word(pdev, pos + 4, &vendor);
-		pci_read_config_word(pdev, pos + 8, &id);
-		if (vendor == PCI_VENDOR_ID_INTEL && id == 5)
-			return pos;
-
-		pos = pci_find_next_ext_capability(pdev, pos, 0x23);
-	}
-
-	return 0;
+	return pci_find_dvsec_capability(pdev, PCI_VENDOR_ID_INTEL, 5);
 }
 
 static bool

base-commit: 18f2a85f5e7db7468530be5048fee5b9cc7a8013
-- 
2.33.0


^ permalink raw reply related

* [PATCH v2 1/9] cxl: Convert "RBI" to enum
From: Ben Widawsky @ 2021-09-23 17:26 UTC (permalink / raw)
  To: linux-cxl
  Cc: Ben Widawsky, Andrew Donnellan, linux-pci, linuxppc-dev, iommu,
	Bjorn Helgaas, David E. Box, Frederic Barrat, Lu Baolu,
	David Woodhouse, Kan Liang
In-Reply-To: <20210923172647.72738-1-ben.widawsky@intel.com>

In preparation for passing around the Register Block Indicator (RBI) as
a parameter, it is desirable to convert the type to an enum so that the
interface can use a well defined type checked parameter.

As a result of this change, should future versions of the spec add
sparsely defined identifiers, it could become a problem if checking for
invalid identifiers since the code currently checks for the max
identifier. This is not an issue with current spec, and the algorithm to
obtain the register blocks will change before those possible additions
are made.

Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
---
 drivers/cxl/pci.h | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/cxl/pci.h b/drivers/cxl/pci.h
index 8c1a58813816..7d3e4bf06b45 100644
--- a/drivers/cxl/pci.h
+++ b/drivers/cxl/pci.h
@@ -20,13 +20,15 @@
 #define CXL_REGLOC_BIR_MASK GENMASK(2, 0)
 
 /* Register Block Identifier (RBI) */
-#define CXL_REGLOC_RBI_MASK GENMASK(15, 8)
-#define CXL_REGLOC_RBI_EMPTY 0
-#define CXL_REGLOC_RBI_COMPONENT 1
-#define CXL_REGLOC_RBI_VIRT 2
-#define CXL_REGLOC_RBI_MEMDEV 3
-#define CXL_REGLOC_RBI_TYPES CXL_REGLOC_RBI_MEMDEV + 1
+enum cxl_regloc_type {
+	CXL_REGLOC_RBI_EMPTY = 0,
+	CXL_REGLOC_RBI_COMPONENT,
+	CXL_REGLOC_RBI_VIRT,
+	CXL_REGLOC_RBI_MEMDEV,
+	CXL_REGLOC_RBI_TYPES
+};
 
+#define CXL_REGLOC_RBI_MASK GENMASK(15, 8)
 #define CXL_REGLOC_ADDR_MASK GENMASK(31, 16)
 
 #endif /* __CXL_PCI_H__ */
-- 
2.33.0


^ permalink raw reply related

* [PATCH v2 2/9] cxl/pci: Remove dev_dbg for unknown register blocks
From: Ben Widawsky @ 2021-09-23 17:26 UTC (permalink / raw)
  To: linux-cxl
  Cc: Ben Widawsky, Andrew Donnellan, linux-pci, linuxppc-dev, iommu,
	Bjorn Helgaas, David E. Box, Frederic Barrat, Lu Baolu,
	David Woodhouse, Kan Liang
In-Reply-To: <20210923172647.72738-1-ben.widawsky@intel.com>

While interesting to driver developers, the dev_dbg message doesn't do
much except clutter up logs. This information should be attainable
through sysfs, and someday lspci like utilities. This change
additionally helps reduce the LOC in a subsequent patch to refactor some
of cxl_pci register mapping.

Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
---
 drivers/cxl/pci.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 64180f46c895..ccc7c2573ddc 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -475,9 +475,6 @@ static int cxl_pci_setup_regs(struct cxl_mem *cxlm)
 		cxl_decode_register_block(reg_lo, reg_hi, &bar, &offset,
 					  &reg_type);
 
-		dev_dbg(dev, "Found register block in bar %u @ 0x%llx of type %u\n",
-			bar, offset, reg_type);
-
 		/* Ignore unknown register block types */
 		if (reg_type > CXL_REGLOC_RBI_MEMDEV)
 			continue;
-- 
2.33.0


^ permalink raw reply related

* [PATCH v2 3/9] cxl/pci: Remove pci request/release regions
From: Ben Widawsky @ 2021-09-23 17:26 UTC (permalink / raw)
  To: linux-cxl
  Cc: Ben Widawsky, Andrew Donnellan, linux-pci, linuxppc-dev, iommu,
	Bjorn Helgaas, David E. Box, Frederic Barrat, Dan Williams,
	Lu Baolu, David Woodhouse, Kan Liang
In-Reply-To: <20210923172647.72738-1-ben.widawsky@intel.com>

Quoting Dan, "... the request + release regions should probably just be
dropped. It's not like any of the register enumeration would collide
with someone else who already has the registers mapped. The collision
only comes when the registers are mapped for their final usage, and that
will have more precision in the request."

Recommended-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
---
 drivers/cxl/pci.c | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index ccc7c2573ddc..7256c236fdb3 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -453,9 +453,6 @@ static int cxl_pci_setup_regs(struct cxl_mem *cxlm)
 		return -ENXIO;
 	}
 
-	if (pci_request_mem_regions(pdev, pci_name(pdev)))
-		return -ENODEV;
-
 	/* Get the size of the Register Locator DVSEC */
 	pci_read_config_dword(pdev, regloc + PCI_DVSEC_HEADER1, &regloc_size);
 	regloc_size = FIELD_GET(PCI_DVSEC_HEADER1_LENGTH_MASK, regloc_size);
@@ -499,8 +496,6 @@ static int cxl_pci_setup_regs(struct cxl_mem *cxlm)
 		n_maps++;
 	}
 
-	pci_release_mem_regions(pdev);
-
 	for (i = 0; i < n_maps; i++) {
 		ret = cxl_map_regs(cxlm, &maps[i]);
 		if (ret)
-- 
2.33.0


^ permalink raw reply related

* [PATCH v2 4/9] cxl/pci: Refactor cxl_pci_setup_regs
From: Ben Widawsky @ 2021-09-23 17:26 UTC (permalink / raw)
  To: linux-cxl
  Cc: Ben Widawsky, Andrew Donnellan, linux-pci, linuxppc-dev, iommu,
	Bjorn Helgaas, David E. Box, Frederic Barrat, Lu Baolu,
	David Woodhouse, Kan Liang
In-Reply-To: <20210923172647.72738-1-ben.widawsky@intel.com>

In preparation for moving parts of register mapping to cxl_core, the
cxl_pci driver is refactored to utilize a new helper to find register
blocks by type.

cxl_pci scanned through all register blocks and mapping the ones that
the driver will use. This logic is inverted so that the driver
specifically requests the register blocks from a new helper. Under the
hood, the same implementation of scanning through all register locator
DVSEC entries exists.

There are 2 behavioral changes (#2 is arguable):
1. A dev_err is introduced if cxl_map_regs fails.
2. The previous logic would try to map component registers and device
   registers multiple times if there were present and keep the mapping
   of the last one found (furthest offset in the register locator).
   While this is disallowed in the spec, CXL 2.0 8.1.9: "Each register
   block identifier shall only occur once in the Register Locator DVSEC
   structure" it was how the driver would respond to the spec violation.
   The new logic will take the first found register block by type and
   move on.

Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>

---
Changes since v1:
---
 drivers/cxl/pci.c | 126 +++++++++++++++++++++++++---------------------
 1 file changed, 70 insertions(+), 56 deletions(-)

diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 7256c236fdb3..bbbacbc94fbf 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -428,6 +428,45 @@ static void cxl_decode_register_block(u32 reg_lo, u32 reg_hi,
 	*reg_type = FIELD_GET(CXL_REGLOC_RBI_MASK, reg_lo);
 }
 
+static int find_register_block(struct pci_dev *pdev, enum cxl_regloc_type type,
+			       struct cxl_register_map *map)
+{
+	int regloc, i, rc = -ENODEV;
+	u32 regloc_size, regblocks;
+
+	regloc = cxl_pci_dvsec(pdev, PCI_DVSEC_ID_CXL_REGLOC_DVSEC_ID);
+	if (!regloc)
+		return -ENXIO;
+
+	pci_read_config_dword(pdev, regloc + PCI_DVSEC_HEADER1, &regloc_size);
+	regloc_size = FIELD_GET(PCI_DVSEC_HEADER1_LENGTH_MASK, regloc_size);
+
+	regloc += PCI_DVSEC_ID_CXL_REGLOC_BLOCK1_OFFSET;
+	regblocks = (regloc_size - PCI_DVSEC_ID_CXL_REGLOC_BLOCK1_OFFSET) / 8;
+
+	for (i = 0; i < regblocks; i++, regloc += 8) {
+		u32 reg_lo, reg_hi;
+		u8 reg_type, bar;
+		u64 offset;
+
+		pci_read_config_dword(pdev, regloc, &reg_lo);
+		pci_read_config_dword(pdev, regloc + 4, &reg_hi);
+
+		cxl_decode_register_block(reg_lo, reg_hi, &bar, &offset,
+					  &reg_type);
+
+		if (reg_type == type) {
+			map->barno = bar;
+			map->block_offset = offset;
+			map->reg_type = reg_type;
+			rc = 0;
+			break;
+		}
+	}
+
+	return rc;
+}
+
 /**
  * cxl_pci_setup_regs() - Setup necessary MMIO.
  * @cxlm: The CXL memory device to communicate with.
@@ -440,69 +479,44 @@ static void cxl_decode_register_block(u32 reg_lo, u32 reg_hi,
  */
 static int cxl_pci_setup_regs(struct cxl_mem *cxlm)
 {
-	void __iomem *base;
-	u32 regloc_size, regblocks;
-	int regloc, i, n_maps, ret = 0;
+	int rc, i;
 	struct device *dev = cxlm->dev;
 	struct pci_dev *pdev = to_pci_dev(dev);
-	struct cxl_register_map *map, maps[CXL_REGLOC_RBI_TYPES];
+	const enum cxl_regloc_type types[] = { CXL_REGLOC_RBI_MEMDEV,
+					       CXL_REGLOC_RBI_COMPONENT };
 
-	regloc = cxl_pci_dvsec(pdev, PCI_DVSEC_ID_CXL_REGLOC_DVSEC_ID);
-	if (!regloc) {
-		dev_err(dev, "register location dvsec not found\n");
-		return -ENXIO;
-	}
+	for (i = 0; i < ARRAY_SIZE(types); i++) {
+		struct cxl_register_map map;
+		void __iomem *base;
 
-	/* Get the size of the Register Locator DVSEC */
-	pci_read_config_dword(pdev, regloc + PCI_DVSEC_HEADER1, &regloc_size);
-	regloc_size = FIELD_GET(PCI_DVSEC_HEADER1_LENGTH_MASK, regloc_size);
-
-	regloc += PCI_DVSEC_ID_CXL_REGLOC_BLOCK1_OFFSET;
-	regblocks = (regloc_size - PCI_DVSEC_ID_CXL_REGLOC_BLOCK1_OFFSET) / 8;
-
-	for (i = 0, n_maps = 0; i < regblocks; i++, regloc += 8) {
-		u32 reg_lo, reg_hi;
-		u8 reg_type;
-		u64 offset;
-		u8 bar;
-
-		pci_read_config_dword(pdev, regloc, &reg_lo);
-		pci_read_config_dword(pdev, regloc + 4, &reg_hi);
-
-		cxl_decode_register_block(reg_lo, reg_hi, &bar, &offset,
-					  &reg_type);
-
-		/* Ignore unknown register block types */
-		if (reg_type > CXL_REGLOC_RBI_MEMDEV)
-			continue;
-
-		base = cxl_pci_map_regblock(cxlm, bar, offset);
-		if (!base)
-			return -ENOMEM;
-
-		map = &maps[n_maps];
-		map->barno = bar;
-		map->block_offset = offset;
-		map->reg_type = reg_type;
-
-		ret = cxl_probe_regs(cxlm, base + offset, map);
-
-		/* Always unmap the regblock regardless of probe success */
-		cxl_pci_unmap_regblock(cxlm, base);
-
-		if (ret)
-			return ret;
-
-		n_maps++;
-	}
-
-	for (i = 0; i < n_maps; i++) {
-		ret = cxl_map_regs(cxlm, &maps[i]);
-		if (ret)
+		rc = find_register_block(pdev, types[i], &map);
+		if (rc) {
+			dev_err(dev, "Couldn't find %s register block\n",
+				types[i] == CXL_REGLOC_RBI_MEMDEV ?
+					      "device" :
+					      "component");
 			break;
+		}
+
+		base = cxl_pci_map_regblock(cxlm, map.barno, map.block_offset);
+		if (!base) {
+			rc = -ENOMEM;
+			break;
+		}
+
+		rc = cxl_probe_regs(cxlm, base + map.block_offset, &map);
+		cxl_pci_unmap_regblock(cxlm, base);
+		if (rc)
+			break;
+
+		rc = cxl_map_regs(cxlm, &map);
+		if (rc) {
+			dev_err(dev, "Failed to map CXL registers\n");
+			break;
+		}
 	}
 
-	return ret;
+	return rc;
 }
 
 static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
-- 
2.33.0


^ permalink raw reply related

* [PATCH v2 5/9] cxl/pci: Make more use of cxl_register_map
From: Ben Widawsky @ 2021-09-23 17:26 UTC (permalink / raw)
  To: linux-cxl
  Cc: Ben Widawsky, Andrew Donnellan, linux-pci, linuxppc-dev, iommu,
	Bjorn Helgaas, David E. Box, Frederic Barrat, Lu Baolu,
	David Woodhouse, Kan Liang
In-Reply-To: <20210923172647.72738-1-ben.widawsky@intel.com>

The structure exists to pass around information about register mapping.
Using it more extensively cleans up many existing functions.

Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
---
 drivers/cxl/cxl.h |  1 +
 drivers/cxl/pci.c | 36 +++++++++++++++++-------------------
 2 files changed, 18 insertions(+), 19 deletions(-)

diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 7d6b011dd963..3b128ce71564 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -140,6 +140,7 @@ struct cxl_device_reg_map {
 };
 
 struct cxl_register_map {
+	void __iomem *base;
 	u64 block_offset;
 	u8 reg_type;
 	u8 barno;
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index bbbacbc94fbf..5eaf2736f779 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -306,35 +306,36 @@ static int cxl_pci_setup_mailbox(struct cxl_mem *cxlm)
 	return 0;
 }
 
-static void __iomem *cxl_pci_map_regblock(struct cxl_mem *cxlm,
-					  u8 bar, u64 offset)
+static int cxl_pci_map_regblock(struct cxl_mem *cxlm, struct cxl_register_map *map)
 {
-	void __iomem *addr;
+	int bar = map->barno;
 	struct device *dev = cxlm->dev;
 	struct pci_dev *pdev = to_pci_dev(dev);
+	resource_size_t offset = map->block_offset;
 
 	/* Basic sanity check that BAR is big enough */
 	if (pci_resource_len(pdev, bar) < offset) {
 		dev_err(dev, "BAR%d: %pr: too small (offset: %#llx)\n", bar,
 			&pdev->resource[bar], (unsigned long long)offset);
-		return IOMEM_ERR_PTR(-ENXIO);
+		return -ENXIO;
 	}
 
-	addr = pci_iomap(pdev, bar, 0);
-	if (!addr) {
+	map->base = pci_iomap(pdev, bar, 0);
+	if (!map->base) {
 		dev_err(dev, "failed to map registers\n");
-		return addr;
+		return PTR_ERR(map->base);
 	}
 
 	dev_dbg(dev, "Mapped CXL Memory Device resource bar %u @ %#llx\n",
 		bar, offset);
 
-	return addr;
+	return 0;
 }
 
-static void cxl_pci_unmap_regblock(struct cxl_mem *cxlm, void __iomem *base)
+static void cxl_pci_unmap_regblock(struct cxl_mem *cxlm, struct cxl_register_map *map)
 {
-	pci_iounmap(to_pci_dev(cxlm->dev), base);
+	pci_iounmap(to_pci_dev(cxlm->dev), map->base);
+	map->base = 0;
 }
 
 static int cxl_pci_dvsec(struct pci_dev *pdev, int dvsec)
@@ -360,9 +361,9 @@ static int cxl_pci_dvsec(struct pci_dev *pdev, int dvsec)
 	return 0;
 }
 
-static int cxl_probe_regs(struct cxl_mem *cxlm, void __iomem *base,
-			  struct cxl_register_map *map)
+static int cxl_probe_regs(struct cxl_mem *cxlm, struct cxl_register_map *map)
 {
+	void __iomem *base = map->base + map->block_offset;
 	struct cxl_component_reg_map *comp_map;
 	struct cxl_device_reg_map *dev_map;
 	struct device *dev = cxlm->dev;
@@ -487,7 +488,6 @@ static int cxl_pci_setup_regs(struct cxl_mem *cxlm)
 
 	for (i = 0; i < ARRAY_SIZE(types); i++) {
 		struct cxl_register_map map;
-		void __iomem *base;
 
 		rc = find_register_block(pdev, types[i], &map);
 		if (rc) {
@@ -498,14 +498,12 @@ static int cxl_pci_setup_regs(struct cxl_mem *cxlm)
 			break;
 		}
 
-		base = cxl_pci_map_regblock(cxlm, map.barno, map.block_offset);
-		if (!base) {
-			rc = -ENOMEM;
+		rc = cxl_pci_map_regblock(cxlm, &map);
+		if (rc)
 			break;
-		}
 
-		rc = cxl_probe_regs(cxlm, base + map.block_offset, &map);
-		cxl_pci_unmap_regblock(cxlm, base);
+		rc = cxl_probe_regs(cxlm, &map);
+		cxl_pci_unmap_regblock(cxlm, &map);
 		if (rc)
 			break;
 
-- 
2.33.0


^ permalink raw reply related

* [PATCH v2 6/9] PCI: Add pci_find_dvsec_capability to find designated VSEC
From: Ben Widawsky @ 2021-09-23 17:26 UTC (permalink / raw)
  To: linux-cxl
  Cc: Ben Widawsky, Andrew Donnellan, linux-pci, Frederic Barrat, iommu,
	Bjorn Helgaas, David E . Box, Jonathan Cameron, Bjorn Helgaas,
	Dan Williams, Kan Liang, linuxppc-dev, David Woodhouse, Lu Baolu
In-Reply-To: <20210923172647.72738-1-ben.widawsky@intel.com>

Add pci_find_dvsec_capability to locate a Designated Vendor-Specific
Extended Capability with the specified DVSEC ID.

The Designated Vendor-Specific Extended Capability (DVSEC) allows one or
more vendor specific capabilities that aren't tied to the vendor ID of
the PCI component.

DVSEC is critical for both the Compute Express Link (CXL) driver as well
as the driver for OpenCAPI coherent accelerator (OCXL).

Cc: David E. Box <david.e.box@linux.intel.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: linux-pci@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Andrew Donnellan <ajd@linux.ibm.com>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Frederic Barrat <fbarrat@linux.ibm.com>
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
---
 drivers/pci/pci.c   | 32 ++++++++++++++++++++++++++++++++
 include/linux/pci.h |  1 +
 2 files changed, 33 insertions(+)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index ce2ab62b64cf..94ac86ff28b0 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -732,6 +732,38 @@ u16 pci_find_vsec_capability(struct pci_dev *dev, u16 vendor, int cap)
 }
 EXPORT_SYMBOL_GPL(pci_find_vsec_capability);
 
+/**
+ * pci_find_dvsec_capability - Find DVSEC for vendor
+ * @dev: PCI device to query
+ * @vendor: Vendor ID to match for the DVSEC
+ * @dvsec: Designated Vendor-specific capability ID
+ *
+ * If DVSEC has Vendor ID @vendor and DVSEC ID @dvsec return the capability
+ * offset in config space; otherwise return 0.
+ */
+u16 pci_find_dvsec_capability(struct pci_dev *dev, u16 vendor, u16 dvsec)
+{
+	int pos;
+
+	pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_DVSEC);
+	if (!pos)
+		return 0;
+
+	while (pos) {
+		u16 v, id;
+
+		pci_read_config_word(dev, pos + PCI_DVSEC_HEADER1, &v);
+		pci_read_config_word(dev, pos + PCI_DVSEC_HEADER2, &id);
+		if (vendor == v && dvsec == id)
+			return pos;
+
+		pos = pci_find_next_ext_capability(dev, pos, PCI_EXT_CAP_ID_DVSEC);
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(pci_find_dvsec_capability);
+
 /**
  * pci_find_parent_resource - return resource region of parent bus of given
  *			      region
diff --git a/include/linux/pci.h b/include/linux/pci.h
index cd8aa6fce204..c93ccfa4571b 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1130,6 +1130,7 @@ u16 pci_find_ext_capability(struct pci_dev *dev, int cap);
 u16 pci_find_next_ext_capability(struct pci_dev *dev, u16 pos, int cap);
 struct pci_bus *pci_find_next_bus(const struct pci_bus *from);
 u16 pci_find_vsec_capability(struct pci_dev *dev, u16 vendor, int cap);
+u16 pci_find_dvsec_capability(struct pci_dev *dev, u16 vendor, u16 dvsec);
 
 u64 pci_get_dsn(struct pci_dev *dev);
 
-- 
2.33.0


^ permalink raw reply related

* [PATCH v2 7/9] cxl/pci: Use pci core's DVSEC functionality
From: Ben Widawsky @ 2021-09-23 17:26 UTC (permalink / raw)
  To: linux-cxl
  Cc: Ben Widawsky, Andrew Donnellan, linux-pci, linuxppc-dev, iommu,
	Bjorn Helgaas, David E. Box, Frederic Barrat, Lu Baolu,
	David Woodhouse, Kan Liang
In-Reply-To: <20210923172647.72738-1-ben.widawsky@intel.com>

Reduce maintenance burden of DVSEC query implementation by using the
centralized PCI core implementation.

Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
---
 drivers/cxl/pci.c | 20 +-------------------
 1 file changed, 1 insertion(+), 19 deletions(-)

diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 5eaf2736f779..79d4d9b16d83 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -340,25 +340,7 @@ static void cxl_pci_unmap_regblock(struct cxl_mem *cxlm, struct cxl_register_map
 
 static int cxl_pci_dvsec(struct pci_dev *pdev, int dvsec)
 {
-	int pos;
-
-	pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_DVSEC);
-	if (!pos)
-		return 0;
-
-	while (pos) {
-		u16 vendor, id;
-
-		pci_read_config_word(pdev, pos + PCI_DVSEC_HEADER1, &vendor);
-		pci_read_config_word(pdev, pos + PCI_DVSEC_HEADER2, &id);
-		if (vendor == PCI_DVSEC_VENDOR_ID_CXL && dvsec == id)
-			return pos;
-
-		pos = pci_find_next_ext_capability(pdev, pos,
-						   PCI_EXT_CAP_ID_DVSEC);
-	}
-
-	return 0;
+	return pci_find_dvsec_capability(pdev, PCI_DVSEC_VENDOR_ID_CXL, dvsec);
 }
 
 static int cxl_probe_regs(struct cxl_mem *cxlm, struct cxl_register_map *map)
-- 
2.33.0


^ permalink raw reply related

* [PATCH v2 8/9] ocxl: Use pci core's DVSEC functionality
From: Ben Widawsky @ 2021-09-23 17:26 UTC (permalink / raw)
  To: linux-cxl
  Cc: Ben Widawsky, Andrew Donnellan, David Woodhouse, iommu,
	Bjorn Helgaas, David E. Box, linux-pci, Frederic Barrat, Lu Baolu,
	linuxppc-dev, Kan Liang
In-Reply-To: <20210923172647.72738-1-ben.widawsky@intel.com>

Reduce maintenance burden of DVSEC query implementation by using the
centralized PCI core implementation.

There are two obvious places to simply drop in the new core
implementation. There remains find_dvsec_from_pos() which would benefit
from using a core implementation. As that change is less trivial it is
reserved for later.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Andrew Donnellan <ajd@linux.ibm.com>
Acked-by: Frederic Barrat <fbarrat@linux.ibm.com> (v1)
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
---
 arch/powerpc/platforms/powernv/ocxl.c |  3 ++-
 drivers/misc/ocxl/config.c            | 13 +------------
 2 files changed, 3 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/ocxl.c b/arch/powerpc/platforms/powernv/ocxl.c
index 9105efcf242a..28b009b46464 100644
--- a/arch/powerpc/platforms/powernv/ocxl.c
+++ b/arch/powerpc/platforms/powernv/ocxl.c
@@ -107,7 +107,8 @@ static int get_max_afu_index(struct pci_dev *dev, int *afu_idx)
 	int pos;
 	u32 val;
 
-	pos = find_dvsec_from_pos(dev, OCXL_DVSEC_FUNC_ID, 0);
+	pos = pci_find_dvsec_capability(dev, PCI_VENDOR_ID_IBM,
+					OCXL_DVSEC_FUNC_ID);
 	if (!pos)
 		return -ESRCH;
 
diff --git a/drivers/misc/ocxl/config.c b/drivers/misc/ocxl/config.c
index a68738f38252..e401a51596b9 100644
--- a/drivers/misc/ocxl/config.c
+++ b/drivers/misc/ocxl/config.c
@@ -33,18 +33,7 @@
 
 static int find_dvsec(struct pci_dev *dev, int dvsec_id)
 {
-	int vsec = 0;
-	u16 vendor, id;
-
-	while ((vsec = pci_find_next_ext_capability(dev, vsec,
-						    OCXL_EXT_CAP_ID_DVSEC))) {
-		pci_read_config_word(dev, vsec + OCXL_DVSEC_VENDOR_OFFSET,
-				&vendor);
-		pci_read_config_word(dev, vsec + OCXL_DVSEC_ID_OFFSET, &id);
-		if (vendor == PCI_VENDOR_ID_IBM && id == dvsec_id)
-			return vsec;
-	}
-	return 0;
+	return pci_find_dvsec_capability(dev, PCI_VENDOR_ID_IBM, dvsec_id);
 }
 
 static int find_dvsec_afu_ctrl(struct pci_dev *dev, u8 afu_idx)
-- 
2.33.0


^ permalink raw reply related

* [PATCH v2 9/9] iommu/vt-d: Use pci core's DVSEC functionality
From: Ben Widawsky @ 2021-09-23 17:26 UTC (permalink / raw)
  To: linux-cxl
  Cc: Ben Widawsky, Andrew Donnellan, linux-pci, linuxppc-dev, iommu,
	Bjorn Helgaas, David E. Box, Frederic Barrat, Kan Liang,
	David Woodhouse, Lu Baolu
In-Reply-To: <20210923172647.72738-1-ben.widawsky@intel.com>

Reduce maintenance burden of DVSEC query implementation by using the
centralized PCI core implementation.

Cc: iommu@lists.linux-foundation.org
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
---
 drivers/iommu/intel/iommu.c | 15 +--------------
 1 file changed, 1 insertion(+), 14 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index d75f59ae28e6..30c97181f0ae 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -5398,20 +5398,7 @@ static int intel_iommu_disable_sva(struct device *dev)
  */
 static int siov_find_pci_dvsec(struct pci_dev *pdev)
 {
-	int pos;
-	u16 vendor, id;
-
-	pos = pci_find_next_ext_capability(pdev, 0, 0x23);
-	while (pos) {
-		pci_read_config_word(pdev, pos + 4, &vendor);
-		pci_read_config_word(pdev, pos + 8, &id);
-		if (vendor == PCI_VENDOR_ID_INTEL && id == 5)
-			return pos;
-
-		pos = pci_find_next_ext_capability(pdev, pos, 0x23);
-	}
-
-	return 0;
+	return pci_find_dvsec_capability(pdev, PCI_VENDOR_ID_INTEL, 5);
 }
 
 static bool
-- 
2.33.0


^ permalink raw reply related

* Re: [PATCH v2 3/3] powerpc/numa: Fill distance_lookup_table for offline nodes
From: Srikar Dronamraju @ 2021-09-23 17:57 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Vincent Guittot,
	kernel test robot, Peter Zijlstra, Geetika Moolchandani,
	Valentin Schneider, Laurent Dufour, linuxppc-dev, Ingo Molnar
In-Reply-To: <871r5fmth6.fsf@mpe.ellerman.id.au>

* Michael Ellerman <mpe@ellerman.id.au> [2021-09-23 21:17:25]:

> Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> > * Michael Ellerman <mpe@ellerman.id.au> [2021-08-26 23:36:53]:
> >
> >> Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> >> > Scheduler expects unique number of node distances to be available at
> >> > boot.
> ...
> >
> >> > Fake the offline node's distance_lookup_table entries so that all
> >> > possible node distances are updated.
> >>
> >> Does this work if we have a single node offline at boot?
> >>
> >
> > It should.
> >
> >> Say we start with:
> >>
> >> node distances:
> >> node   0   1
> >>   0:  10  20
> >>   1:  20  10
> >>
> >> And node 2 is offline at boot. We can only initialise that nodes entries
> >> in the distance_lookup_table:
> >>
> >> 		while (i--)
> >> 			distance_lookup_table[node][i] = node;
> >>
> >> By filling them all with 2 that causes node_distance(2, X) to return the
> >> maximum distance for all other nodes X, because we won't break out of
> >> the loop in __node_distance():
> >>
> >> 	for (i = 0; i < distance_ref_points_depth; i++) {
> >> 		if (distance_lookup_table[a][i] == distance_lookup_table[b][i])
> >> 			break;
> >>
> >> 		/* Double the distance for each NUMA level */
> >> 		distance *= 2;
> >> 	}
> >>
> >> If distance_ref_points_depth was 4 we'd return 160.
> >
> > As you already know, distance 10, 20, .. are defined by Powerpc, form1
> > affinity. PAPR doesn't define actual distances, it only provides us the
> > associativity. If there are distance_ref_points_depth is 4,
> > (distance_ref_points_depth doesn't take local distance into consideration)
> > 10, 20, 40, 80, 160.
> >
> >>
> >> That'd leave us with 3 unique distances at boot, 10, 20, 160.
> >>
> >
> > So if there are unique distances, then the distances as per the current
> > code has to be 10, 20, 40, 80.. I dont see a way in which we have a break in
> > the series. like having 160 without 80.
>
> I'm confused what you mean there.
>

At the outset, if we have a better probable solution, do let me know, I am
willing to try that too.

> If we have a node that's offline at boot then we get 160 for that node,
> that's just the result of having no info for it, so we never break out
> of the for loop.
>
> So if we have two nodes, one hop apart, and then an offline node we get
> 10, 20, 160.
>
> Or if you're using depth = 3 then it's 10, 20, 80.
>

My understanding is as below:

device-tree provides the max hops by way of
ibm,associativity-reference-points. This is mapped to
distance_ref_points_depth in Linux-powerpc.

Now Linux-powerpc encodes hops as (dis-regarding local distance) 20, 40, 80,
160, 320 ...
So if the distance_ref_points_depth is 3, then the hops are 20, 40, 80.

Do you disagree?


> >> But when node 2 comes online it might introduce more than 1 new distance
> >> value, eg. it could be that the actual distances are:
> >>
> >> node distances:
> >> node   0   1   2
> >>   0:  10  20  40
> >>   1:  20  10  80
> >>   2:  40  80  10
> >>
> >> ie. we now have 4 distances, 10, 20, 40, 80.
> >>
> >> What am I missing?
> >
> > As I said above, I am not sure how we can have a break in the series.
> > If distance_ref_points_depth is 3, the distances has to be 10,20,40,80 as
> > atleast for form1 affinity.
>
> I agree for depth 3 we have to see 10, 20, 40, 80. But nothing
> guarantees we see each value (other than 10).

The hop distances are not from the device-tree, the device-tree only gives
us the max hops possible. Linux-powerpc is actually hard-coding the
distances which each hop distance being 2x the previous.
So we may not see any nodes at a particular hop, but we know maximum hops.
And if distance_ref_points_depth is 3, then hops are 20, 40, 80 only.

>
> We can have two nodes one hop apart, so we have 10 and 20, then a third
> node is added 3 hops away, so we get 10, 20, 80.
>

> The real problem is that the third node could be 3 hops from node 0
> and 2 hops from node 1, and so the addition of the third node causes
> two new distance values (40 & 80) to be required.

So here the max hops as given by device-tree is 3. So we know that we are
looking for max-distance of 80 by way of distance_ref_points_depth.

Even if the 3rd node was at 4 hops, we would already know the max distance
of 160, by way of distance_ref_points_depth. However in the most unlikely
scenarios where the number of possible nodes are less than the
distance_ref_points_depth(aka max hops) + there are CPUless/memoryless nodes
we may not have initialized to the right distances.

>
> I think maybe what you're saying is that in practice we don't see setups
> like that. But I don't know if I'm happy with a solution that doesn't
> work in the general case, and relies on the particular properties of our
> current set of systems.
>

But our current set of systems are having a problem (Systems can likely
crash on adding a CPU to a node.)  The only other way I can think of is the
previous approach were we ask scheduler hook which tells how many unique
node distances are possible. But then it was stuck down because, we didnt
want to add a hook just for one arch.

However isn't this is much much better than the current situation we are in?
i.e This is not going to cause any regression for the other setups.

> Possibly we just need to detect that case and WARN about it. The only
> problem is we won't know until the system is already up and running, ie.
> we can't know at boot that the onlining of the third node will cause 2
> new distance values to be added.
>

Yes, We should be able to detect this very easily.
At the end of the function (v2 or v3) if cur_depth != max_depth then we
havent initialized all possible node distances.

> cheers

--
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply

* Re: [PATCH] powerpc/paravirt: correct preempt debug splat in vcpu_is_preempted()
From: Srikar Dronamraju @ 2021-09-23 18:02 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: Nathan Lynch, linuxppc-dev, npiggin
In-Reply-To: <874kabn40z.fsf@mpe.ellerman.id.au>

* Michael Ellerman <mpe@ellerman.id.au> [2021-09-23 17:29:32]:

> Nathan Lynch <nathanl@linux.ibm.com> writes:
> > Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> >
> >> * Nathan Lynch <nathanl@linux.ibm.com> [2021-09-22 11:01:12]:
> >>
> >>> Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> >>> > * Nathan Lynch <nathanl@linux.ibm.com> [2021-09-20 22:12:13]:
> >>> >
> >>> >> vcpu_is_preempted() can be used outside of preempt-disabled critical
> >>> >> sections, yielding warnings such as:
> >>> >> 
> >>> >> BUG: using smp_processor_id() in preemptible [00000000] code: systemd-udevd/185
> >>> >> caller is rwsem_spin_on_owner+0x1cc/0x2d0
> >>> >> CPU: 1 PID: 185 Comm: systemd-udevd Not tainted 5.15.0-rc2+ #33
> >>> >> Call Trace:
> >>> >> [c000000012907ac0] [c000000000aa30a8] dump_stack_lvl+0xac/0x108 (unreliable)
> >>> >> [c000000012907b00] [c000000001371f70] check_preemption_disabled+0x150/0x160
> >>> >> [c000000012907b90] [c0000000001e0e8c] rwsem_spin_on_owner+0x1cc/0x2d0
> >>> >> [c000000012907be0] [c0000000001e1408] rwsem_down_write_slowpath+0x478/0x9a0
> >>> >> [c000000012907ca0] [c000000000576cf4] filename_create+0x94/0x1e0
> >>> >> [c000000012907d10] [c00000000057ac08] do_symlinkat+0x68/0x1a0
> >>> >> [c000000012907d70] [c00000000057ae18] sys_symlink+0x58/0x70
> >>> >> [c000000012907da0] [c00000000002e448] system_call_exception+0x198/0x3c0
> >>> >> [c000000012907e10] [c00000000000c54c] system_call_common+0xec/0x250
> >>> >> 
> >>> >> The result of vcpu_is_preempted() is always subject to invalidation by
> >>> >> events inside and outside of Linux; it's just a best guess at a point in
> >>> >> time. Use raw_smp_processor_id() to avoid such warnings.
> >>> >
> >>> > Typically smp_processor_id() and raw_smp_processor_id() except for the
> >>> > CONFIG_DEBUG_PREEMPT.
> >>> 
> >>> Sorry, I don't follow...
> >>
> >> I meant, Unless CONFIG_DEBUG_PREEMPT, smp_processor_id() is defined as
> >> raw_processor_id().
> >>
> >>> 
> >>> > In the CONFIG_DEBUG_PREEMPT case, smp_processor_id()
> >>> > is actually debug_smp_processor_id(), which does all the checks.
> >>> 
> >>> Yes, OK.
> >>> 
> >>> > I believe these checks in debug_smp_processor_id() are only valid for x86
> >>> > case (aka cases were they have __smp_processor_id() defined.)
> >>> 
> >>> Hmm, I am under the impression that the checks in
> >>> debug_smp_processor_id() are valid regardless of whether the arch
> >>> overrides __smp_processor_id().
> >>
> >> From include/linux/smp.h
> >>
> >> /*
> >>  * Allow the architecture to differentiate between a stable and unstable read.
> >>  * For example, x86 uses an IRQ-safe asm-volatile read for the unstable but a
> >>  * regular asm read for the stable.
> >>  */
> >> #ifndef __smp_processor_id
> >> #define __smp_processor_id(x) raw_smp_processor_id(x)
> >> #endif
> >>
> >> As far as I see, only x86 has a definition of __smp_processor_id.
> >> So for archs like Powerpc, __smp_processor_id(), is always
> >> defined as raw_smp_processor_id(). Right?
> >
> > Sure, yes.
> >
> >> I would think debug_smp_processor_id() would be useful if __smp_processor_id()
> >> is different from raw_smp_processor_id(). Do note debug_smp_processor_id() 
> >> calls raw_smp_processor_id().
> 
> Agree.
> 
> > I do not think the utility of debug_smp_processor_id() is related to
> > whether the arch defines __smp_processor_id().
> >
> >> Or can I understand how debug_smp_processor_id() is useful if
> >> __smp_processor_id() is defined as raw_smp_processor_id()?
> 
> debug_smp_processor_id() is useful on powerpc, as well as other arches,
> because it checks that we're in a context where the processor id won't
> change out from under us.
> 
> eg. something like this is unsafe:
> 
>   int counts[NR_CPUS];
>   int tmp, cpu;
>   
>   cpu = smp_processor_id();
>   tmp = counts[cpu];
>   				<- preempted here and migrated to another CPU
>   counts[cpu] = tmp + 1;
> 

If lets say the above call was replaced by raw_smp_processor_id(), how would
it avoid the preemption / migration to another CPU?

Replacing it with raw_smp_processor_id() may avoid, the debug splat but the
underlying issue would still remain as is. No?

> 
> > So, for powerpc with DEBUG_PREEMPT unset, a call to smp_procesor_id()
> > expands to __smp_processor_id() which expands to raw_smp_processor_id(),
> > avoiding the preempt safety checks. This is working as intended.
> >
> > For powerpc with DEBUG_PREEMPT=y, a call to smp_processor_id() expands
> > to the out of line call to debug_smp_processor_id(), which calls
> > raw_smp_processor_id() and performs the checks, warning if called in an
> > inappropriate context, as seen here. Also working as intended.
> >
> > AFAICT __smp_processor_id() is a performance/codegen-oriented hook, and
> > not really related to the debug facility. Please see 9ed7d75b2f09d
> > ("x86/percpu: Relax smp_processor_id()").
> 
> Yeah good find.
> 
> cheers

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox