LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH v2 1/4] powerpc: Remove flush_instruction_cache for book3s/32
From: Michael Ellerman @ 2020-09-09 13:27 UTC (permalink / raw)
  To: hch, Michael Ellerman, Christophe Leroy, Paul Mackerras,
	Benjamin Herrenschmidt
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <50098f49877cea0f46730a9df82dcabf84160e4b.1597384512.git.christophe.leroy@csgroup.eu>

On Fri, 14 Aug 2020 05:56:24 +0000 (UTC), Christophe Leroy wrote:
> The only callers of flush_instruction_cache() are:
> 
> arch/powerpc/kernel/swsusp_booke.S:	bl flush_instruction_cache
> arch/powerpc/mm/nohash/40x.c:	flush_instruction_cache();
> arch/powerpc/mm/nohash/44x.c:	flush_instruction_cache();
> arch/powerpc/mm/nohash/fsl_booke.c:	flush_instruction_cache();
> arch/powerpc/platforms/44x/machine_check.c:			flush_instruction_cache();
> arch/powerpc/platforms/44x/machine_check.c:		flush_instruction_cache();
> 
> [...]

Applied to powerpc/next.

[1/4] powerpc: Remove flush_instruction_cache for book3s/32
      https://git.kernel.org/powerpc/c/e426ab39f41045a4c163031272b2f48d944b69c0
[2/4] powerpc: Move flush_instruction_cache() prototype in asm/cacheflush.h
      https://git.kernel.org/powerpc/c/f663f3312051402d32952c44d156a20c0b854753
[3/4] powerpc: Rewrite 4xx flush_cache_instruction() in C
      https://git.kernel.org/powerpc/c/de39b19452e784de5f90ae899851ab29a29bb42c
[4/4] powerpc: Rewrite FSL_BOOKE flush_cache_instruction() in C
      https://git.kernel.org/powerpc/c/704dfe931df951895dea98bd1d9cacbb601b6451

cheers

^ permalink raw reply

* Re: [PATCH v1] powerpc/process: Remove unnecessary #ifdef CONFIG_FUNCTION_GRAPH_TRACER
From: Michael Ellerman @ 2020-09-09 13:27 UTC (permalink / raw)
  To: Christophe Leroy, Paul Mackerras, Michael Ellerman,
	Benjamin Herrenschmidt
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <9d11143d4e27ba8274369a926968756917584868.1597643153.git.christophe.leroy@csgroup.eu>

On Mon, 17 Aug 2020 05:46:39 +0000 (UTC), Christophe Leroy wrote:
> ftrace_graph_ret_addr() is always defined and returns 'ip' when
> CONFIG_FUNCTION GRAPH_TRACER is not set.
> 
> So the #ifdef is not needed, remove it.

Applied to powerpc/next.

[1/1] powerpc/process: Remove unnecessary #ifdef CONFIG_FUNCTION_GRAPH_TRACER
      https://git.kernel.org/powerpc/c/353bce211e00d183344f464ba1ee0e1ffb0e2a6c

cheers

^ permalink raw reply

* Re: [PATCH] powerpc/irq: Drop forward declaration of struct irqaction
From: Michael Ellerman @ 2020-09-09 13:27 UTC (permalink / raw)
  To: Christophe Leroy, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <e0bcdabac45fcd26c02d7df273bd4a5827c6033d.1596716375.git.christophe.leroy@csgroup.eu>

On Thu, 6 Aug 2020 12:19:46 +0000 (UTC), Christophe Leroy wrote:
> Since the commit identified below, the forward declaration of
> struct irqaction is useless. Drop it.

Applied to powerpc/next.

[1/1] powerpc/irq: Drop forward declaration of struct irqaction
      https://git.kernel.org/powerpc/c/b134cfc3e3276ccd5d29e39de5c848a45b08e410

cheers

^ permalink raw reply

* Re: [PATCH] powerpc/32s: Fix assembler warning about r0
From: Michael Ellerman @ 2020-09-09 13:27 UTC (permalink / raw)
  To: Christophe Leroy, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <2b69ac8e1cddff6f808fc7415907179eab4aae9e.1596693679.git.christophe.leroy@csgroup.eu>

On Thu, 6 Aug 2020 06:01:42 +0000 (UTC), Christophe Leroy wrote:
> The assembler says:
>   arch/powerpc/kernel/head_32.S:1095: Warning: invalid register expression
> 
> It's objecting to the use of r0 as the RA argument. That's because
> when RA = 0 the literal value 0 is used, rather than the content of
> r0, making the use of r0 in the source potentially confusing.
> 
> [...]

Applied to powerpc/next.

[1/1] powerpc/32s: Fix assembler warning about r0
      https://git.kernel.org/powerpc/c/b51ba4fe2e134b631f9c8f45423707aab71449b5

cheers

^ permalink raw reply

* Re: [PATCH] powerpc/hwirq: Remove stale forward irq_chip declaration
From: Michael Ellerman @ 2020-09-09 13:27 UTC (permalink / raw)
  To: Christophe Leroy, Paul Mackerras, Michael Ellerman,
	Benjamin Herrenschmidt
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <fbe58d27cf128d5fe581e4510ded8701858f268e.1596716328.git.christophe.leroy@csgroup.eu>

On Thu, 6 Aug 2020 12:19:06 +0000 (UTC), Christophe Leroy wrote:
> Since commit identified below, the forward declaration of
> struct irq_chip is useless (was struct hw_interrupt_type at that time)
> 
> Remove it, together with the associated comment.

Applied to powerpc/next.

[1/1] powerpc/hwirq: Remove stale forward irq_chip declaration
      https://git.kernel.org/powerpc/c/169b9afee572853522901b7cbf34842c0494a887

cheers

^ permalink raw reply

* Re: [PATCH 1/2] powerpc/fpu: Drop cvt_fd() and cvt_df()
From: Michael Ellerman @ 2020-09-09 13:27 UTC (permalink / raw)
  To: Christophe Leroy, Paul Mackerras, Michael Ellerman,
	Benjamin Herrenschmidt
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <d5641ada199b8dd2af16ad00a66084cf974f2704.1596716418.git.christophe.leroy@csgroup.eu>

On Thu, 6 Aug 2020 12:20:34 +0000 (UTC), Christophe Leroy wrote:
> Those two functions have been unused since commit identified below.
> Drop them.

Applied to powerpc/next.

[1/2] powerpc/fpu: Drop cvt_fd() and cvt_df()
      https://git.kernel.org/powerpc/c/63442de4301188129e1fcff144fbfb966ad5eb19
[2/2] powerpc: drop hard_reset_now() and poweroff_now() declaration
      https://git.kernel.org/powerpc/c/82eb1792426f8a171cdaa6cfccb63c39f55bc9bd

cheers

^ permalink raw reply

* Re: [PATCH v2 1/2] powerpc/perf: consolidate GPCI hcall structs into asm/hvcall.h
From: Michael Ellerman @ 2020-09-09 13:27 UTC (permalink / raw)
  To: Scott Cheloha, linuxppc-dev; +Cc: Nathan Lynch, Tyrel Datwylder
In-Reply-To: <20200727184605.2945095-1-cheloha@linux.ibm.com>

On Mon, 27 Jul 2020 13:46:04 -0500, Scott Cheloha wrote:
> The H_GetPerformanceCounterInfo (GPCI) hypercall input/output structs are
> useful to modules outside of perf/, so move them into asm/hvcall.h to live
> alongside the other powerpc hypercall structs.
> 
> Leave the perf-specific GPCI stuff in perf/hv-gpci.h.

Applied to powerpc/next.

[1/2] powerpc/perf: consolidate GPCI hcall structs into asm/hvcall.h
      https://git.kernel.org/powerpc/c/59562b5c33d6ff3685509ed58b2ed3c5b5712704
[2/2] powerpc/pseries: new lparcfg key/value pair: partition_affinity_score
      https://git.kernel.org/powerpc/c/5d1bc776428f34941a6237afb9454061b5b5e1e1

cheers

^ permalink raw reply

* Re: [PATCH v3] pseries/drmem: don't cache node id in drmem_lmb struct
From: Michael Ellerman @ 2020-09-09 13:27 UTC (permalink / raw)
  To: Scott Cheloha, linuxppc-dev
  Cc: Nathan Lynch, Laurent Dufour, Michal Suchanek, Rick Lindsley,
	David Hildenbrand
In-Reply-To: <20200811015115.63677-1-cheloha@linux.ibm.com>

On Mon, 10 Aug 2020 20:51:15 -0500, Scott Cheloha wrote:
> At memory hot-remove time we can retrieve an LMB's nid from its
> corresponding memory_block.  There is no need to store the nid
> in multiple locations.
> 
> Note that lmb_to_memblock() uses find_memory_block() to get the
> corresponding memory_block.  As find_memory_block() runs in sub-linear
> time this approach is negligibly slower than what we do at present.
> 
> [...]

Applied to powerpc/next.

[1/1] pseries/drmem: don't cache node id in drmem_lmb struct
      https://git.kernel.org/powerpc/c/e5e179aa3a39c818db8fbc2dce8d2cd24adaf657

cheers

^ permalink raw reply

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
From: Gerald Schaefer @ 2020-09-09 12:29 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Catalin Marinas, Andrey Ryabinin, Heiko Carstens,
	Arnd Bergmann, John Hubbard, Jeff Dike, linux-um, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, linux-arm, linux-power, LKML,
	Andrew Morton, Linus Torvalds, Mike Rapoport
In-Reply-To: <0dbc6ec8-45ea-0853-4856-2bc1e661a5a5@intel.com>

On Tue, 8 Sep 2020 07:30:50 -0700
Dave Hansen <dave.hansen@intel.com> wrote:

> On 9/7/20 11:00 AM, Gerald Schaefer wrote:
> > Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
> > code") introduced a subtle but severe bug on s390 with gup_fast, due to
> > dynamic page table folding.
> 
> Would it be fair to say that the "fake" page table entries s390
> allocates on the stack are what's causing the trouble here?  That might
> be a nice thing to open up with here.  "Dynamic page table folding"
> really means nothing to me.

Sorry, I guess my previous reply does not really explain "what the heck
is dynamic page table folding?".

On s390, we can have different number of page table levels for different
processes / mms. We always start with 3 levels, and update dynamically
on process demand to 4 or 5 levels, hence the dynamic folding. Still,
the PxD_SIZE/SHIFT is defined statically, so that e.g. pXd_addr_end() will
not reflect this dynamic behavior.

For the various pagetable walkers using pXd_addr_end() (w/o READ_ONCE
logic) this is no problem. With static folding, iteration over the folded
levels will always happen at pgd level (top-level folding). For s390,
we stay at the respective level and iterate there (dynamic middle-level
folding), only return to pgd level if there really were 5 levels.

This only works well as long there are real pagetable pointers involved,
that can also be used for iteration. For gup_fast, or any other future
pagetable walkers using the READ_ONCE logic w/o lock, that is not true.
There are pointers involved to local pXd values on the stack, because of
the READ_ONCE logic, and our middle-level iteration will suddenly iterate
over such stack pointers instead of pagetable pointers.

This will be addressed by making the pXd_addr_end() dynamic, for which
we need to see the pXd value in order to determine its level / type.

^ permalink raw reply

* Re: [PATCH v2 3/7] mm/memory_hotplug: prepare passing flags to add_memory() and friends
From: David Hildenbrand @ 2020-09-09 11:51 UTC (permalink / raw)
  To: Michael Ellerman, Greg Kroah-Hartman
  Cc: linux-hyperv, Michal Hocko, Michael S. Tsirkin, Jason Wang,
	Pingfan Liu, virtualization, linux-mm, Paul Mackerras,
	K. Y. Srinivasan, Boris Ostrovsky, linux-s390, Wei Liu,
	Stefano Stabellini, Dave Jiang, Baoquan He, Jason Gunthorpe,
	Vishal Verma, linux-acpi, xen-devel, Heiko Carstens, Len Brown,
	Nathan Lynch, Vasily Gorbik, Leonardo Bras, Haiyang Zhang,
	Stephen Hemminger, Dan Williams, Christian Borntraeger,
	Juergen Gross, Pankaj Gupta, Libor Pechacek, linux-nvdimm,
	Rafael J. Wysocki, linux-kernel, Wei Yang, Oliver O'Halloran,
	Andrew Morton, linuxppc-dev
In-Reply-To: <5145c5c4-d9c0-85a8-7e0b-ccfa03eb0427@redhat.com>

On 09.09.20 13:37, David Hildenbrand wrote:
> On 09.09.20 13:24, Michael Ellerman wrote:
>> David Hildenbrand <david@redhat.com> writes:
>>> On 09.09.20 09:17, Greg Kroah-Hartman wrote:
>>>> On Tue, Sep 08, 2020 at 10:10:08PM +0200, David Hildenbrand wrote:
>>>>> We soon want to pass flags, e.g., to mark added System RAM resources.
>>>>> mergeable. Prepare for that.
>>>>
>>>> What are these random "flags", and how do we know what should be passed
>>>> to them?
>>>>
>>>> Why not make this an enumerated type so that we know it all works
>>>> properly, like the GPF_* flags are?  Passing around a random unsigned
>>>> long feels very odd/broken...
>>>
>>> Agreed, an enum (mhp_flags) seems to give a better hint what can
>>> actually be passed. Thanks!
>>
>> You probably know this but ...
>>
>> Just using a C enum doesn't get you any type safety.
>>
>> You can get some checking via sparse by using __bitwise, which is what
>> gfp_t does. You don't actually have to use an enum for that, it works
>> with #defines also.
> 
> Yeah, we seem to be using different approaches. And there is always a
> way to mess things up :)
> 
> gfp_t is one (extreme) example, enum memblock_flags is another example.
> I tend to prefer an enum in this particular case, because it's simple
> and at least tells the user which values are expected.
> 

Gave it another try, looks like mhp_t (like gfp_t) is actually nicer.

-- 
Thanks,

David / dhildenb


^ permalink raw reply

* Re: [PATCH v2 3/7] mm/memory_hotplug: prepare passing flags to add_memory() and friends
From: David Hildenbrand @ 2020-09-09 11:37 UTC (permalink / raw)
  To: Michael Ellerman, Greg Kroah-Hartman
  Cc: linux-hyperv, Michal Hocko, Michael S. Tsirkin, Jason Wang,
	Pingfan Liu, virtualization, linux-mm, Paul Mackerras,
	K. Y. Srinivasan, Boris Ostrovsky, linux-s390, Wei Liu,
	Stefano Stabellini, Dave Jiang, Baoquan He, Jason Gunthorpe,
	Vishal Verma, linux-acpi, xen-devel, Heiko Carstens, Len Brown,
	Nathan Lynch, Vasily Gorbik, Leonardo Bras, Haiyang Zhang,
	Stephen Hemminger, Dan Williams, Christian Borntraeger,
	Juergen Gross, Pankaj Gupta, Libor Pechacek, linux-nvdimm,
	Rafael J. Wysocki, linux-kernel, Wei Yang, Oliver O'Halloran,
	Andrew Morton, linuxppc-dev
In-Reply-To: <87eenbry5p.fsf@mpe.ellerman.id.au>

On 09.09.20 13:24, Michael Ellerman wrote:
> David Hildenbrand <david@redhat.com> writes:
>> On 09.09.20 09:17, Greg Kroah-Hartman wrote:
>>> On Tue, Sep 08, 2020 at 10:10:08PM +0200, David Hildenbrand wrote:
>>>> We soon want to pass flags, e.g., to mark added System RAM resources.
>>>> mergeable. Prepare for that.
>>>
>>> What are these random "flags", and how do we know what should be passed
>>> to them?
>>>
>>> Why not make this an enumerated type so that we know it all works
>>> properly, like the GPF_* flags are?  Passing around a random unsigned
>>> long feels very odd/broken...
>>
>> Agreed, an enum (mhp_flags) seems to give a better hint what can
>> actually be passed. Thanks!
> 
> You probably know this but ...
> 
> Just using a C enum doesn't get you any type safety.
> 
> You can get some checking via sparse by using __bitwise, which is what
> gfp_t does. You don't actually have to use an enum for that, it works
> with #defines also.

Yeah, we seem to be using different approaches. And there is always a
way to mess things up :)

gfp_t is one (extreme) example, enum memblock_flags is another example.
I tend to prefer an enum in this particular case, because it's simple
and at least tells the user which values are expected.

Thoughts?

> 
> Or you can wrap the flag in a struct, the way atomic_t does, and then
> the compiler will prevent passing plain integers in place of your custom
> type.



-- 
Thanks,

David / dhildenb


^ permalink raw reply

* Re: [PATCH v4 00/13] mm/debug_vm_pgtable fixes
From: Gerald Schaefer @ 2020-09-09 11:36 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-s390@vger.kernel.org, Aneesh Kumar K.V, linux-mm,
	Vineet Gupta, akpm, linux-snps-arc@lists.infradead.org,
	linuxppc-dev, linux-riscv, Gerald Schaefer
In-Reply-To: <d4199cd4-e042-7a05-8a78-703eae958589@arm.com>

On Wed, 9 Sep 2020 13:38:25 +0530
Anshuman Khandual <anshuman.khandual@arm.com> wrote:

> 
> 
> On 09/04/2020 08:56 PM, Gerald Schaefer wrote:
> > On Fri, 4 Sep 2020 12:18:05 +0530
> > Anshuman Khandual <anshuman.khandual@arm.com> wrote:
> > 
> >>
> >>
> >> On 09/02/2020 05:12 PM, Aneesh Kumar K.V wrote:
> >>> This patch series includes fixes for debug_vm_pgtable test code so that
> >>> they follow page table updates rules correctly. The first two patches introduce
> >>> changes w.r.t ppc64. The patches are included in this series for completeness. We can
> >>> merge them via ppc64 tree if required.
> >>>
> >>> Hugetlb test is disabled on ppc64 because that needs larger change to satisfy
> >>> page table update rules.
> >>>
> >>> These tests are broken w.r.t page table update rules and results in kernel
> >>> crash as below. 
> >>>
> >>> [   21.083519] kernel BUG at arch/powerpc/mm/pgtable.c:304!
> >>> cpu 0x0: Vector: 700 (Program Check) at [c000000c6d1e76c0]
> >>>     pc: c00000000009a5ec: assert_pte_locked+0x14c/0x380
> >>>     lr: c0000000005eeeec: pte_update+0x11c/0x190
> >>>     sp: c000000c6d1e7950
> >>>    msr: 8000000002029033
> >>>   current = 0xc000000c6d172c80
> >>>   paca    = 0xc000000003ba0000   irqmask: 0x03   irq_happened: 0x01
> >>>     pid   = 1, comm = swapper/0
> >>> kernel BUG at arch/powerpc/mm/pgtable.c:304!
> >>> [link register   ] c0000000005eeeec pte_update+0x11c/0x190
> >>> [c000000c6d1e7950] 0000000000000001 (unreliable)
> >>> [c000000c6d1e79b0] c0000000005eee14 pte_update+0x44/0x190
> >>> [c000000c6d1e7a10] c000000001a2ca9c pte_advanced_tests+0x160/0x3d8
> >>> [c000000c6d1e7ab0] c000000001a2d4fc debug_vm_pgtable+0x7e8/0x1338
> >>> [c000000c6d1e7ba0] c0000000000116ec do_one_initcall+0xac/0x5f0
> >>> [c000000c6d1e7c80] c0000000019e4fac kernel_init_freeable+0x4dc/0x5a4
> >>> [c000000c6d1e7db0] c000000000012474 kernel_init+0x24/0x160
> >>> [c000000c6d1e7e20] c00000000000cbd0 ret_from_kernel_thread+0x5c/0x6c
> >>>
> >>> With DEBUG_VM disabled
> >>>
> >>> [   20.530152] BUG: Kernel NULL pointer dereference on read at 0x00000000
> >>> [   20.530183] Faulting instruction address: 0xc0000000000df330
> >>> cpu 0x33: Vector: 380 (Data SLB Access) at [c000000c6d19f700]
> >>>     pc: c0000000000df330: memset+0x68/0x104
> >>>     lr: c00000000009f6d8: hash__pmdp_huge_get_and_clear+0xe8/0x1b0
> >>>     sp: c000000c6d19f990
> >>>    msr: 8000000002009033
> >>>    dar: 0
> >>>   current = 0xc000000c6d177480
> >>>   paca    = 0xc00000001ec4f400   irqmask: 0x03   irq_happened: 0x01
> >>>     pid   = 1, comm = swapper/0
> >>> [link register   ] c00000000009f6d8 hash__pmdp_huge_get_and_clear+0xe8/0x1b0
> >>> [c000000c6d19f990] c00000000009f748 hash__pmdp_huge_get_and_clear+0x158/0x1b0 (unreliable)
> >>> [c000000c6d19fa10] c0000000019ebf30 pmd_advanced_tests+0x1f0/0x378
> >>> [c000000c6d19fab0] c0000000019ed088 debug_vm_pgtable+0x79c/0x1244
> >>> [c000000c6d19fba0] c0000000000116ec do_one_initcall+0xac/0x5f0
> >>> [c000000c6d19fc80] c0000000019a4fac kernel_init_freeable+0x4dc/0x5a4
> >>> [c000000c6d19fdb0] c000000000012474 kernel_init+0x24/0x160
> >>> [c000000c6d19fe20] c00000000000cbd0 ret_from_kernel_thread+0x5c/0x6c
> >>>
> >>> Changes from v3:
> >>> * Address review feedback
> >>> * Move page table depost and withdraw patch after adding pmdlock to avoid bisect failure.
> >>
> >> This version
> >>
> >> - Builds on x86, arm64, s390, arc, powerpc and riscv (defconfig with DEBUG_VM_PGTABLE)
> >> - Runs on arm64 and x86 without any regression, atleast nothing that I have noticed
> >> - Will be great if this could get tested on s390, arc, riscv, ppc32 platforms as well
> > 
> > When I quickly tested v3, it worked fine, but now it turned out to
> > only work fine "sometimes", both v3 and v4. I need to look into it
> > further, but so far it seems related to the hugetlb_advanced_tests().
> > 
> > I guess there was already some discussion on this test, but we did
> > not receive all of the thread(s). Please always add at least
> > linux-s390@vger.kernel.org and maybe myself and Vasily Gorbik <gor@linux.ibm.com>
> > for further discussions.
> 
> IIRC, the V3 series previously had all these addresses copied properly
> but this version once again missed copying all required addresses.

I also had issues with the de.ibm.com address, which might also have
made some mails disappear, and others might simply have been overlooked
be me. Don't bother, my bad.

> 
> > 
> > That being said, sorry for duplications, this might already have been
> > discussed. Preliminary analysis showed that it only seems to go wrong
> > for certain random vaddr values. I cannot make any sense of that yet,
> > but what seems strange to me is that the hugetlb_advanced_tests()
> > take a (real) pte_t pointer as input, and also use that for all
> > kinds of operations (set_huge_pte_at, huge_ptep_get_and_clear, etc.).
> > 
> > Although all the hugetlb code in the kernel is (mis)using pte_t
> > pointers instead of the correct pmd/pud_t pointers like THP, that
> > is just for historic reasons. The pointers will actually never point
> > to a real pte_t (i.e. page table entry), but of course to a pmd
> > or pud entry, depending on hugepage size.
> 
> HugeTLB logically operates on a PTE entry irrespective of it's real
> page table level position. Nonetheless, IIUC, vaddr here should have
> been aligned to real page table level in which the entry is being
> mapped currently.

That goes back to the time where only x86 had hugepages, and they
have the same layout for pte/pmd/etc entries, so it simply didn't
matter that the code (mis)used pte pointers / entries. But even for
x86, the hugetlb pte pointers would never have pointed to real ptes,
but pmds instead. That's why I call it misuse.

s390 is very sensitive to page table level, and we can also determine
the level from the entry value, which is used for some primitives.
Others have implicit assumptions and calculations, which go wrong
if a wrong level is passed in, like in this case for
huge_ptep_get_and_clear(). Simply aligning vaddr / pfn will not
be enough to fix this for s390, it has to be a pmd/pud pointer.
Or, as you already mentioned, the result of huge_pte_alloc().

Furthermore, the pmd and pte layout are different, so we simply cannot
use any pte_xxx primitives for hugepages. That was the reason for
introducing huge_ptep_get(), which will do an implicit conversion
from the real pmd/pud entry to a "fake" pte entry, which can then
be used with such pte_xxx primitives. Before writing it back in
set_huge_pte_at() we then do the reverse conversion to a proper
pmd/pud again.

^ permalink raw reply

* Re: [PATCH v2 3/7] mm/memory_hotplug: prepare passing flags to add_memory() and friends
From: Michael Ellerman @ 2020-09-09 11:24 UTC (permalink / raw)
  To: David Hildenbrand, Greg Kroah-Hartman
  Cc: linux-hyperv, Michal Hocko, Michael S. Tsirkin, Jason Wang,
	Pingfan Liu, virtualization, linux-mm, Paul Mackerras,
	K. Y. Srinivasan, Boris Ostrovsky, linux-s390, Wei Liu,
	Stefano Stabellini, Dave Jiang, Baoquan He, Jason Gunthorpe,
	Vishal Verma, linux-acpi, xen-devel, Heiko Carstens, Len Brown,
	Nathan Lynch, Vasily Gorbik, Leonardo Bras, Haiyang Zhang,
	Stephen Hemminger, Dan Williams, Christian Borntraeger,
	Juergen Gross, Pankaj Gupta, Libor Pechacek, linux-nvdimm,
	Rafael J. Wysocki, linux-kernel, Wei Yang, Oliver O'Halloran,
	Andrew Morton, linuxppc-dev
In-Reply-To: <3bc5b464-3229-d442-714a-ec33b5728ac6@redhat.com>

David Hildenbrand <david@redhat.com> writes:
> On 09.09.20 09:17, Greg Kroah-Hartman wrote:
>> On Tue, Sep 08, 2020 at 10:10:08PM +0200, David Hildenbrand wrote:
>>> We soon want to pass flags, e.g., to mark added System RAM resources.
>>> mergeable. Prepare for that.
>> 
>> What are these random "flags", and how do we know what should be passed
>> to them?
>> 
>> Why not make this an enumerated type so that we know it all works
>> properly, like the GPF_* flags are?  Passing around a random unsigned
>> long feels very odd/broken...
>
> Agreed, an enum (mhp_flags) seems to give a better hint what can
> actually be passed. Thanks!

You probably know this but ...

Just using a C enum doesn't get you any type safety.

You can get some checking via sparse by using __bitwise, which is what
gfp_t does. You don't actually have to use an enum for that, it works
with #defines also.

Or you can wrap the flag in a struct, the way atomic_t does, and then
the compiler will prevent passing plain integers in place of your custom
type.

cheers

^ permalink raw reply

* Re: [PATCH v4 00/13] mm/debug_vm_pgtable fixes
From: Gerald Schaefer @ 2020-09-09 11:16 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-s390@vger.kernel.org, Anshuman Khandual, linux-mm,
	Vineet Gupta, akpm, linux-snps-arc@lists.infradead.org,
	linuxppc-dev, linux-riscv, Gerald Schaefer
In-Reply-To: <87wo134h3s.fsf@linux.ibm.com>

On Wed, 09 Sep 2020 11:38:39 +0530
"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> wrote:

> Gerald Schaefer <gerald.schaefer@linux.ibm.com> writes:
> 
> > On Fri, 4 Sep 2020 18:01:15 +0200
> > Gerald Schaefer <gerald.schaefer@linux.ibm.com> wrote:
> >
> > [...]
> >> 
> >> BTW2, a quick test with this change (so far) made the issues on s390
> >> go away:
> >> 
> >> @@ -1069,7 +1074,7 @@ static int __init debug_vm_pgtable(void)
> >>         spin_unlock(ptl);
> >> 
> >>  #ifndef CONFIG_PPC_BOOK3S_64
> >> -       hugetlb_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot);
> >> +       hugetlb_advanced_tests(mm, vma, (pte_t *) pmdp, pmd_aligned, vaddr, prot);
> >>  #endif
> >> 
> >>         spin_lock(&mm->page_table_lock);
> >> 
> >> That would more match the "pte_t pointer" usage for hugetlb code,
> >> i.e. just cast a pmd_t pointer to it. Also changed to pmd_aligned,
> >> but I think the root cause is the pte_t pointer.
> >> 
> >> Not entirely sure though if that would really be the correct fix.
> >> I somehow lost whatever little track I had about what these tests
> >> really want to check, and if that would still be valid with that
> >> change.
> >
> > Uh oh, wasn't aware that this (or some predecessor) already went
> > upstream, and broke our debug kernel today.
> 
> Not sure i followed the above. Are you finding that s390 kernel crash
> after this patch series or the original patchset? As noted in my patch
> the hugetlb test is broken and we should fix that. A quick fix is to
> comment out that test for s390 too as i have done for PPC64.

We see it with both, it basically is broken since there is a hugetlb
test using real pte pointers. It doesn't always show, depending on
random vaddr, so it slipped through earlier testing.

I guess we also would have had one or the other chance to notice
that earlier, through better review, or better reading of previous
mails. I must admit that I neglected this a bit.

^ permalink raw reply

* Re: [PATCH v4 00/13] mm/debug_vm_pgtable fixes
From: Gerald Schaefer @ 2020-09-09 11:10 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-s390@vger.kernel.org, Aneesh Kumar K.V, linux-mm,
	Vineet Gupta, akpm, linux-snps-arc@lists.infradead.org,
	linuxppc-dev, linux-riscv, Gerald Schaefer
In-Reply-To: <ac85fd77-bca1-3318-de0e-eea40558cbbd@arm.com>

On Wed, 9 Sep 2020 13:45:48 +0530
Anshuman Khandual <anshuman.khandual@arm.com> wrote:

[...]
> > 
> > That would more match the "pte_t pointer" usage for hugetlb code,
> > i.e. just cast a pmd_t pointer to it. Also changed to pmd_aligned,
> > but I think the root cause is the pte_t pointer.
> 
> Ideally, the pte_t pointer used here should be from huge_pte_alloc()
> not from pte_alloc_map_lock() as the case currently.

Ah, good point. I assumed that this would also always return casted
pmd etc. pointers, and never pte pointers. Unfortunately, that doesn't
seem to be true for all architectures, e.g. ia64, parisc, (some) powerpc,
where they really do a pte_alloc_map() for some reason.

I guess that means you cannot simply cast the pmd pointer, as suggested,
although I really do not understand how any architecture can work with
real ptes for hugepages. But that's fair, s390 also does some things that
nobody would expect or understand for other architectures...

So, for using huge_pte_alloc() you'd also need some size, maybe
iterating over hstates with for_each_hstate() could be an option,
if they are already initialized at that point. Then you have the
size(s) with huge_page_size(hstate) and can actually call the
hugetlb tests for all supported sizes, and with proper pointer
from huge_pte_alloc().

^ permalink raw reply

* Re: Flushing transparent hugepages
From: Aneesh Kumar K.V @ 2020-09-09 10:26 UTC (permalink / raw)
  To: Matthew Wilcox, linux-arch
  Cc: Thomas Bogendoerfer, Will Deacon, Vineet Gupta, Russell King,
	linux-mips, linux-mm, Paul Mackerras, Catalin Marinas, sparclinux,
	linux-snps-arc, linuxppc-dev, David S. Miller, linux-arm-kernel
In-Reply-To: <20200818150736.GQ17456@casper.infradead.org>

Matthew Wilcox <willy@infradead.org> writes:

> PowerPC has special handling of hugetlbfs pages.  Well, that's what
> the config option says, but actually it handles THP as well.  If
> the config option is enabled.
>
> #ifdef CONFIG_HUGETLB_PAGE
>         if (PageCompound(page)) {
>                 flush_dcache_icache_hugepage(page);
>                 return;
>         }
> #endif

I do have a change posted sometime back to avoid that confusion.
http://patchwork.ozlabs.org/project/linuxppc-dev/patch/20200320103256.229365-1-aneesh.kumar@linux.ibm.com/

But IIUC we use the head page flags (PG_arch_1) to track whether we need
the flush or not.

>
> By the way, THPs can be mapped askew -- that is, at an offset which
> means you can't use a PMD to map a PMD sized page.
>
> Anyway, we don't really have consensus between the various architectures
> on how to handle either THPs or hugetlb pages.  It's not contemplated
> in Documentation/core-api/cachetlb.rst so there's no real surprise
> we've diverged.
>
> What would you _like_ to see?  Would you rather flush_dcache_page()
> were called once for each subpage, or would you rather maintain
> the page-needs-flushing state once per compound page?  We could also
> introduce flush_dcache_thp() if some architectures would prefer it one
> way and one the other, although that brings into question what to do
> for hugetlbfs pages.
>
> It might not be a bad idea to centralise the handling of all this stuff
> somewhere.  Sounds like the kind of thing Arnd would like to do ;-) I'll
> settle for getting enough clear feedback about what the various arch
> maintainers want that I can write a documentation update for cachetlb.rst.

^ permalink raw reply

* Re: [PATCH kernel] powerpc/dma: Fix dma_map_ops::get_required_mask
From: Alexey Kardashevskiy @ 2020-09-09  9:36 UTC (permalink / raw)
  To: Christoph Hellwig, Cédric Le Goater, Michael Ellerman,
	Oliver OHalloran, linuxppc-dev@lists.ozlabs.org
In-Reply-To: <20200909075849.GA12282@lst.de>



On 09/09/2020 17:58, Christoph Hellwig wrote:
> On Tue, Sep 08, 2020 at 11:10:03PM +1000, Alexey Kardashevskiy wrote:
>>>> a-ha, this makes more sense, thanks. Then I guess we need to revert that
>>>> one bit from yours f1565c24b596, do not we?
>>>
>>> Why?  The was the original intent of the API, but now we also use
>>> internally to check the addressing capabilities.
>>
>> The bigger mask the better, no? As it is now, it's limited by the window 
>> size which happens to be bigger than 4GB but smaller then full 64bit (48bit 
>> on my system)
> 
> Yes, the bigger mask is better.  But I don't see why you'd want to
> revert the dma bypass code for that entirely.
> 

I want dma_get_required_mask() to return the bigger mask always.

Now it depends on (in dma_alloc_direct()):
1. dev->dma_ops_bypass: set via pci_set_(coherent_)dma_mask();
2. dev->coherent_dma_mask - the same;
3. dev->bus_dma_limit - usually not set at all.

So until we set the mask, dma_get_required_mask() returns smaller mask.
So aacraid and likes (which calls dma_get_required_mask() before setting
it) will remain prone for breaks.


[forgot to cc: other folks last time, fixed now]

-- 
Alexey

^ permalink raw reply

* Re: [PATCH v4 00/13] mm/debug_vm_pgtable fixes
From: Anshuman Khandual @ 2020-09-09  8:38 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: linux-s390@vger.kernel.org, Aneesh Kumar K.V, linux-mm,
	Vineet Gupta, akpm, linux-snps-arc@lists.infradead.org,
	linuxppc-dev, linux-riscv, Gerald Schaefer
In-Reply-To: <20200904195346.6d57ff9f@thinkpad>

On 09/04/2020 11:23 PM, Gerald Schaefer wrote:
> On Fri, 4 Sep 2020 18:01:15 +0200
> Gerald Schaefer <gerald.schaefer@linux.ibm.com> wrote:
> 
>> On Fri, 4 Sep 2020 17:26:47 +0200
>> Gerald Schaefer <gerald.schaefer@linux.ibm.com> wrote:
>>
>>> On Fri, 4 Sep 2020 12:18:05 +0530
>>> Anshuman Khandual <anshuman.khandual@arm.com> wrote:
>>>
>>>>
>>>>
>>>> On 09/02/2020 05:12 PM, Aneesh Kumar K.V wrote:
>>>>> This patch series includes fixes for debug_vm_pgtable test code so that
>>>>> they follow page table updates rules correctly. The first two patches introduce
>>>>> changes w.r.t ppc64. The patches are included in this series for completeness. We can
>>>>> merge them via ppc64 tree if required.
>>>>>
>>>>> Hugetlb test is disabled on ppc64 because that needs larger change to satisfy
>>>>> page table update rules.
>>>>>
>>>>> These tests are broken w.r.t page table update rules and results in kernel
>>>>> crash as below. 
>>>>>
>>>>> [   21.083519] kernel BUG at arch/powerpc/mm/pgtable.c:304!
>>>>> cpu 0x0: Vector: 700 (Program Check) at [c000000c6d1e76c0]
>>>>>     pc: c00000000009a5ec: assert_pte_locked+0x14c/0x380
>>>>>     lr: c0000000005eeeec: pte_update+0x11c/0x190
>>>>>     sp: c000000c6d1e7950
>>>>>    msr: 8000000002029033
>>>>>   current = 0xc000000c6d172c80
>>>>>   paca    = 0xc000000003ba0000   irqmask: 0x03   irq_happened: 0x01
>>>>>     pid   = 1, comm = swapper/0
>>>>> kernel BUG at arch/powerpc/mm/pgtable.c:304!
>>>>> [link register   ] c0000000005eeeec pte_update+0x11c/0x190
>>>>> [c000000c6d1e7950] 0000000000000001 (unreliable)
>>>>> [c000000c6d1e79b0] c0000000005eee14 pte_update+0x44/0x190
>>>>> [c000000c6d1e7a10] c000000001a2ca9c pte_advanced_tests+0x160/0x3d8
>>>>> [c000000c6d1e7ab0] c000000001a2d4fc debug_vm_pgtable+0x7e8/0x1338
>>>>> [c000000c6d1e7ba0] c0000000000116ec do_one_initcall+0xac/0x5f0
>>>>> [c000000c6d1e7c80] c0000000019e4fac kernel_init_freeable+0x4dc/0x5a4
>>>>> [c000000c6d1e7db0] c000000000012474 kernel_init+0x24/0x160
>>>>> [c000000c6d1e7e20] c00000000000cbd0 ret_from_kernel_thread+0x5c/0x6c
>>>>>
>>>>> With DEBUG_VM disabled
>>>>>
>>>>> [   20.530152] BUG: Kernel NULL pointer dereference on read at 0x00000000
>>>>> [   20.530183] Faulting instruction address: 0xc0000000000df330
>>>>> cpu 0x33: Vector: 380 (Data SLB Access) at [c000000c6d19f700]
>>>>>     pc: c0000000000df330: memset+0x68/0x104
>>>>>     lr: c00000000009f6d8: hash__pmdp_huge_get_and_clear+0xe8/0x1b0
>>>>>     sp: c000000c6d19f990
>>>>>    msr: 8000000002009033
>>>>>    dar: 0
>>>>>   current = 0xc000000c6d177480
>>>>>   paca    = 0xc00000001ec4f400   irqmask: 0x03   irq_happened: 0x01
>>>>>     pid   = 1, comm = swapper/0
>>>>> [link register   ] c00000000009f6d8 hash__pmdp_huge_get_and_clear+0xe8/0x1b0
>>>>> [c000000c6d19f990] c00000000009f748 hash__pmdp_huge_get_and_clear+0x158/0x1b0 (unreliable)
>>>>> [c000000c6d19fa10] c0000000019ebf30 pmd_advanced_tests+0x1f0/0x378
>>>>> [c000000c6d19fab0] c0000000019ed088 debug_vm_pgtable+0x79c/0x1244
>>>>> [c000000c6d19fba0] c0000000000116ec do_one_initcall+0xac/0x5f0
>>>>> [c000000c6d19fc80] c0000000019a4fac kernel_init_freeable+0x4dc/0x5a4
>>>>> [c000000c6d19fdb0] c000000000012474 kernel_init+0x24/0x160
>>>>> [c000000c6d19fe20] c00000000000cbd0 ret_from_kernel_thread+0x5c/0x6c
>>>>>
>>>>> Changes from v3:
>>>>> * Address review feedback
>>>>> * Move page table depost and withdraw patch after adding pmdlock to avoid bisect failure.
>>>>
>>>> This version
>>>>
>>>> - Builds on x86, arm64, s390, arc, powerpc and riscv (defconfig with DEBUG_VM_PGTABLE)
>>>> - Runs on arm64 and x86 without any regression, atleast nothing that I have noticed
>>>> - Will be great if this could get tested on s390, arc, riscv, ppc32 platforms as well
>>>
>>> When I quickly tested v3, it worked fine, but now it turned out to
>>> only work fine "sometimes", both v3 and v4. I need to look into it
>>> further, but so far it seems related to the hugetlb_advanced_tests().
>>>
>>> I guess there was already some discussion on this test, but we did
>>> not receive all of the thread(s). Please always add at least
>>> linux-s390@vger.kernel.org and maybe myself and Vasily Gorbik <gor@linux.ibm.com>
>>> for further discussions.
>>
>> BTW, with myself I mean the new address gerald.schaefer@linux.ibm.com.
>> The old gerald.schaefer@de.ibm.com seems to work (again), but is not
>> very reliable.
>>
>> BTW2, a quick test with this change (so far) made the issues on s390
>> go away:
>>
>> @@ -1069,7 +1074,7 @@ static int __init debug_vm_pgtable(void)
>>         spin_unlock(ptl);
>>
>>  #ifndef CONFIG_PPC_BOOK3S_64
>> -       hugetlb_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot);
>> +       hugetlb_advanced_tests(mm, vma, (pte_t *) pmdp, pmd_aligned, vaddr, prot);
>>  #endif
>>
>>         spin_lock(&mm->page_table_lock);
>>
>> That would more match the "pte_t pointer" usage for hugetlb code,
>> i.e. just cast a pmd_t pointer to it. Also changed to pmd_aligned,
>> but I think the root cause is the pte_t pointer.
>>
>> Not entirely sure though if that would really be the correct fix.
>> I somehow lost whatever little track I had about what these tests
>> really want to check, and if that would still be valid with that
>> change.
> 
> Another potential issue, apparently not for s390, but maybe for
> others, is that the vaddr passed to hugetlb_advanced_tests() is
> also not pmd/pud size aligned, like you did in pmd/pud_advanced_tests().
> 
> I guess for the hugetlb_advanced_tests() you need to choose if
> you want to test pmd or pud hugepages, and accordingly prepare
> the *ptep, pfn and vaddr input. If you only check for CONFIG_HUGETLB_PAGE,
> then probably only pmd hugepages would be safe, there might be
> architectures only supporting one hugepage size.

I guess preparing for PMD based HugeTLB tests should be sufficient
for now, which can be improved later on to cover other levels.

> 
> So, for s390, at least the ptep input value is a problem. Still
> need to better understand how it goes wrong, but it seems to be
> fixed when using proper pmdp, and also works with pudp.
> 
> For others, especially the apparent issues on ppc64, the other
> non-hugepage aligned input pfn and vaddr might also be an issue,
> e.g. power at least seems to use the vaddr in its set_huge_pte_at()
> implementation for some pmd_off(mm, addr) calculation.
> 
> Again, sorry if this was already discussed, I missed most of it
> and honestly didn't properly look at the scarce mails that we did
> receive...

Sure, will consider these points and try improve tests afterwards.

^ permalink raw reply

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
From: Christophe Leroy @ 2020-09-09  8:38 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Arnd Bergmann, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Andrey Ryabinin, Gerald Schaefer, Jeff Dike,
	Vasily Gorbik, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Andrew Morton, linux-power, Mike Rapoport
In-Reply-To: <20200908141554.GA20558@oc3871087118.ibm.com>

On Tue, 2020-09-08 at 16:15 +0200, Alexander Gordeev wrote:
> On Tue, Sep 08, 2020 at 10:16:49AM +0200, Christophe Leroy wrote:
> > >Yes, and also two more sources :/
> > >	arch/powerpc/mm/kasan/8xx.c
> > >	arch/powerpc/mm/kasan/kasan_init_32.c
> > >
> > >But these two are not quite obvious wrt pgd_addr_end() used
> > >while traversing pmds. Could you please clarify a bit?
> > >
> > >
> > >diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
> > >index 2784224..89c5053 100644
> > >--- a/arch/powerpc/mm/kasan/8xx.c
> > >+++ b/arch/powerpc/mm/kasan/8xx.c
> > >@@ -15,8 +15,8 @@
> > >  	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) {
> > >  		pte_basic_t *new;
> > >-		k_next = pgd_addr_end(k_cur, k_end);
> > >-		k_next = pgd_addr_end(k_next, k_end);
> > >+		k_next = pmd_addr_end(k_cur, k_end);
> > >+		k_next = pmd_addr_end(k_next, k_end);
> > 
> > No, I don't think so.
> > On powerpc32 we have only two levels, so pgd and pmd are more or
> > less the same.
> > But pmd_addr_end() as defined in include/asm-generic/pgtable-nopmd.h
> > is a no-op, so I don't think it will work.
> > 
> > It is likely that this function should iterate on pgd, then you get
> > pmd = pmd_offset(pud_offset(p4d_offset(pgd)));
> 
> It looks like the code iterates over single pmd table while using
> pgd_addr_end() only to skip all the middle levels and bail out
> from the loop.
> 
> I would be wary for switching from pmds to pgds, since we are
> trying to minimize impact (especially functional) and the
> rework does not seem that obvious.
> 

I've just tested the following change, it works and should fix the
oddity:

diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
index 2784224054f8..8e53ddf57b84 100644
--- a/arch/powerpc/mm/kasan/8xx.c
+++ b/arch/powerpc/mm/kasan/8xx.c
@@ -9,11 +9,12 @@
 static int __init
 kasan_init_shadow_8M(unsigned long k_start, unsigned long k_end, void
*block)
 {
-	pmd_t *pmd = pmd_off_k(k_start);
+	pgd_t *pgd = pgd_offset_k(k_start);
 	unsigned long k_cur, k_next;
 
-	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block
+= SZ_8M) {
+	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pgd += 2, block
+= SZ_8M) {
 		pte_basic_t *new;
+		pmd_t *pmd = pmd_offset(pud_offset(p4d_offset(pgd, k_cur), k_cur),
k_cur);
 
 		k_next = pgd_addr_end(k_cur, k_end);
 		k_next = pgd_addr_end(k_next, k_end);
diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c
b/arch/powerpc/mm/kasan/kasan_init_32.c
index fb294046e00e..e5f524fa71a7 100644
--- a/arch/powerpc/mm/kasan/kasan_init_32.c
+++ b/arch/powerpc/mm/kasan/kasan_init_32.c
@@ -30,13 +30,12 @@ static void __init kasan_populate_pte(pte_t *ptep,
pgprot_t prot)
 
 int __init kasan_init_shadow_page_tables(unsigned long k_start,
unsigned long k_end)
 {
-	pmd_t *pmd;
+	pgd_t *pgd = pgd_offset_k(k_start);
 	unsigned long k_cur, k_next;
 
-	pmd = pmd_off_k(k_start);
-
-	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd++) {
+	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pgd++) {
 		pte_t *new;
+		pmd_t *pmd = pmd_offset(pud_offset(p4d_offset(pgd, k_cur), k_cur),
k_cur);
 
 		k_next = pgd_addr_end(k_cur, k_end);
 		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
@@ -189,16 +188,18 @@ void __init kasan_early_init(void)
 	unsigned long addr = KASAN_SHADOW_START;
 	unsigned long end = KASAN_SHADOW_END;
 	unsigned long next;
-	pmd_t *pmd = pmd_off_k(addr);
+	pgd_t *pgd = pgd_offset_k(addr);
 
 	BUILD_BUG_ON(KASAN_SHADOW_START & ~PGDIR_MASK);
 
 	kasan_populate_pte(kasan_early_shadow_pte, PAGE_KERNEL);
 
 	do {
+		pmd_t *pmd = pmd_offset(pud_offset(p4d_offset(pgd, addr), addr),
addr);
+
 		next = pgd_addr_end(addr, end);
 		pmd_populate_kernel(&init_mm, pmd, kasan_early_shadow_pte);
-	} while (pmd++, addr = next, addr != end);
+	} while (pgd++, addr = next, addr != end);
 
 	if (early_mmu_has_feature(MMU_FTR_HPTE_TABLE))
 		kasan_early_hash_table();
---
Christophe


^ permalink raw reply related

* Re: [PATCH v4 00/13] mm/debug_vm_pgtable fixes
From: Anshuman Khandual @ 2020-09-09  8:15 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: linux-s390@vger.kernel.org, Aneesh Kumar K.V, linux-mm,
	Vineet Gupta, akpm, linux-snps-arc@lists.infradead.org,
	linuxppc-dev, linux-riscv, Gerald Schaefer
In-Reply-To: <20200904180115.07ee5f00@thinkpad>



On 09/04/2020 09:31 PM, Gerald Schaefer wrote:
> On Fri, 4 Sep 2020 17:26:47 +0200
> Gerald Schaefer <gerald.schaefer@linux.ibm.com> wrote:
> 
>> On Fri, 4 Sep 2020 12:18:05 +0530
>> Anshuman Khandual <anshuman.khandual@arm.com> wrote:
>>
>>>
>>>
>>> On 09/02/2020 05:12 PM, Aneesh Kumar K.V wrote:
>>>> This patch series includes fixes for debug_vm_pgtable test code so that
>>>> they follow page table updates rules correctly. The first two patches introduce
>>>> changes w.r.t ppc64. The patches are included in this series for completeness. We can
>>>> merge them via ppc64 tree if required.
>>>>
>>>> Hugetlb test is disabled on ppc64 because that needs larger change to satisfy
>>>> page table update rules.
>>>>
>>>> These tests are broken w.r.t page table update rules and results in kernel
>>>> crash as below. 
>>>>
>>>> [   21.083519] kernel BUG at arch/powerpc/mm/pgtable.c:304!
>>>> cpu 0x0: Vector: 700 (Program Check) at [c000000c6d1e76c0]
>>>>     pc: c00000000009a5ec: assert_pte_locked+0x14c/0x380
>>>>     lr: c0000000005eeeec: pte_update+0x11c/0x190
>>>>     sp: c000000c6d1e7950
>>>>    msr: 8000000002029033
>>>>   current = 0xc000000c6d172c80
>>>>   paca    = 0xc000000003ba0000   irqmask: 0x03   irq_happened: 0x01
>>>>     pid   = 1, comm = swapper/0
>>>> kernel BUG at arch/powerpc/mm/pgtable.c:304!
>>>> [link register   ] c0000000005eeeec pte_update+0x11c/0x190
>>>> [c000000c6d1e7950] 0000000000000001 (unreliable)
>>>> [c000000c6d1e79b0] c0000000005eee14 pte_update+0x44/0x190
>>>> [c000000c6d1e7a10] c000000001a2ca9c pte_advanced_tests+0x160/0x3d8
>>>> [c000000c6d1e7ab0] c000000001a2d4fc debug_vm_pgtable+0x7e8/0x1338
>>>> [c000000c6d1e7ba0] c0000000000116ec do_one_initcall+0xac/0x5f0
>>>> [c000000c6d1e7c80] c0000000019e4fac kernel_init_freeable+0x4dc/0x5a4
>>>> [c000000c6d1e7db0] c000000000012474 kernel_init+0x24/0x160
>>>> [c000000c6d1e7e20] c00000000000cbd0 ret_from_kernel_thread+0x5c/0x6c
>>>>
>>>> With DEBUG_VM disabled
>>>>
>>>> [   20.530152] BUG: Kernel NULL pointer dereference on read at 0x00000000
>>>> [   20.530183] Faulting instruction address: 0xc0000000000df330
>>>> cpu 0x33: Vector: 380 (Data SLB Access) at [c000000c6d19f700]
>>>>     pc: c0000000000df330: memset+0x68/0x104
>>>>     lr: c00000000009f6d8: hash__pmdp_huge_get_and_clear+0xe8/0x1b0
>>>>     sp: c000000c6d19f990
>>>>    msr: 8000000002009033
>>>>    dar: 0
>>>>   current = 0xc000000c6d177480
>>>>   paca    = 0xc00000001ec4f400   irqmask: 0x03   irq_happened: 0x01
>>>>     pid   = 1, comm = swapper/0
>>>> [link register   ] c00000000009f6d8 hash__pmdp_huge_get_and_clear+0xe8/0x1b0
>>>> [c000000c6d19f990] c00000000009f748 hash__pmdp_huge_get_and_clear+0x158/0x1b0 (unreliable)
>>>> [c000000c6d19fa10] c0000000019ebf30 pmd_advanced_tests+0x1f0/0x378
>>>> [c000000c6d19fab0] c0000000019ed088 debug_vm_pgtable+0x79c/0x1244
>>>> [c000000c6d19fba0] c0000000000116ec do_one_initcall+0xac/0x5f0
>>>> [c000000c6d19fc80] c0000000019a4fac kernel_init_freeable+0x4dc/0x5a4
>>>> [c000000c6d19fdb0] c000000000012474 kernel_init+0x24/0x160
>>>> [c000000c6d19fe20] c00000000000cbd0 ret_from_kernel_thread+0x5c/0x6c
>>>>
>>>> Changes from v3:
>>>> * Address review feedback
>>>> * Move page table depost and withdraw patch after adding pmdlock to avoid bisect failure.
>>>
>>> This version
>>>
>>> - Builds on x86, arm64, s390, arc, powerpc and riscv (defconfig with DEBUG_VM_PGTABLE)
>>> - Runs on arm64 and x86 without any regression, atleast nothing that I have noticed
>>> - Will be great if this could get tested on s390, arc, riscv, ppc32 platforms as well
>>
>> When I quickly tested v3, it worked fine, but now it turned out to
>> only work fine "sometimes", both v3 and v4. I need to look into it
>> further, but so far it seems related to the hugetlb_advanced_tests().
>>
>> I guess there was already some discussion on this test, but we did
>> not receive all of the thread(s). Please always add at least
>> linux-s390@vger.kernel.org and maybe myself and Vasily Gorbik <gor@linux.ibm.com>
>> for further discussions.
> 
> BTW, with myself I mean the new address gerald.schaefer@linux.ibm.com.
> The old gerald.schaefer@de.ibm.com seems to work (again), but is not
> very reliable.

Sure, noted.

> 
> BTW2, a quick test with this change (so far) made the issues on s390
> go away:
> 
> @@ -1069,7 +1074,7 @@ static int __init debug_vm_pgtable(void)
>         spin_unlock(ptl);
>  
>  #ifndef CONFIG_PPC_BOOK3S_64
> -       hugetlb_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot);
> +       hugetlb_advanced_tests(mm, vma, (pte_t *) pmdp, pmd_aligned, vaddr, prot);
>  #endif
>  
>         spin_lock(&mm->page_table_lock);
> 
> That would more match the "pte_t pointer" usage for hugetlb code,
> i.e. just cast a pmd_t pointer to it. Also changed to pmd_aligned,
> but I think the root cause is the pte_t pointer.

Ideally, the pte_t pointer used here should be from huge_pte_alloc()
not from pte_alloc_map_lock() as the case currently.

> 
> Not entirely sure though if that would really be the correct fix.
> I somehow lost whatever little track I had about what these tests
> really want to check, and if that would still be valid with that
> change.
> 

^ permalink raw reply

* Re: [PATCH v4 00/13] mm/debug_vm_pgtable fixes
From: Anshuman Khandual @ 2020-09-09  8:08 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: linux-s390@vger.kernel.org, Aneesh Kumar K.V, linux-mm,
	Vineet Gupta, akpm, linux-snps-arc@lists.infradead.org,
	linuxppc-dev, linux-riscv, Gerald Schaefer
In-Reply-To: <20200904172647.002113d3@thinkpad>



On 09/04/2020 08:56 PM, Gerald Schaefer wrote:
> On Fri, 4 Sep 2020 12:18:05 +0530
> Anshuman Khandual <anshuman.khandual@arm.com> wrote:
> 
>>
>>
>> On 09/02/2020 05:12 PM, Aneesh Kumar K.V wrote:
>>> This patch series includes fixes for debug_vm_pgtable test code so that
>>> they follow page table updates rules correctly. The first two patches introduce
>>> changes w.r.t ppc64. The patches are included in this series for completeness. We can
>>> merge them via ppc64 tree if required.
>>>
>>> Hugetlb test is disabled on ppc64 because that needs larger change to satisfy
>>> page table update rules.
>>>
>>> These tests are broken w.r.t page table update rules and results in kernel
>>> crash as below. 
>>>
>>> [   21.083519] kernel BUG at arch/powerpc/mm/pgtable.c:304!
>>> cpu 0x0: Vector: 700 (Program Check) at [c000000c6d1e76c0]
>>>     pc: c00000000009a5ec: assert_pte_locked+0x14c/0x380
>>>     lr: c0000000005eeeec: pte_update+0x11c/0x190
>>>     sp: c000000c6d1e7950
>>>    msr: 8000000002029033
>>>   current = 0xc000000c6d172c80
>>>   paca    = 0xc000000003ba0000   irqmask: 0x03   irq_happened: 0x01
>>>     pid   = 1, comm = swapper/0
>>> kernel BUG at arch/powerpc/mm/pgtable.c:304!
>>> [link register   ] c0000000005eeeec pte_update+0x11c/0x190
>>> [c000000c6d1e7950] 0000000000000001 (unreliable)
>>> [c000000c6d1e79b0] c0000000005eee14 pte_update+0x44/0x190
>>> [c000000c6d1e7a10] c000000001a2ca9c pte_advanced_tests+0x160/0x3d8
>>> [c000000c6d1e7ab0] c000000001a2d4fc debug_vm_pgtable+0x7e8/0x1338
>>> [c000000c6d1e7ba0] c0000000000116ec do_one_initcall+0xac/0x5f0
>>> [c000000c6d1e7c80] c0000000019e4fac kernel_init_freeable+0x4dc/0x5a4
>>> [c000000c6d1e7db0] c000000000012474 kernel_init+0x24/0x160
>>> [c000000c6d1e7e20] c00000000000cbd0 ret_from_kernel_thread+0x5c/0x6c
>>>
>>> With DEBUG_VM disabled
>>>
>>> [   20.530152] BUG: Kernel NULL pointer dereference on read at 0x00000000
>>> [   20.530183] Faulting instruction address: 0xc0000000000df330
>>> cpu 0x33: Vector: 380 (Data SLB Access) at [c000000c6d19f700]
>>>     pc: c0000000000df330: memset+0x68/0x104
>>>     lr: c00000000009f6d8: hash__pmdp_huge_get_and_clear+0xe8/0x1b0
>>>     sp: c000000c6d19f990
>>>    msr: 8000000002009033
>>>    dar: 0
>>>   current = 0xc000000c6d177480
>>>   paca    = 0xc00000001ec4f400   irqmask: 0x03   irq_happened: 0x01
>>>     pid   = 1, comm = swapper/0
>>> [link register   ] c00000000009f6d8 hash__pmdp_huge_get_and_clear+0xe8/0x1b0
>>> [c000000c6d19f990] c00000000009f748 hash__pmdp_huge_get_and_clear+0x158/0x1b0 (unreliable)
>>> [c000000c6d19fa10] c0000000019ebf30 pmd_advanced_tests+0x1f0/0x378
>>> [c000000c6d19fab0] c0000000019ed088 debug_vm_pgtable+0x79c/0x1244
>>> [c000000c6d19fba0] c0000000000116ec do_one_initcall+0xac/0x5f0
>>> [c000000c6d19fc80] c0000000019a4fac kernel_init_freeable+0x4dc/0x5a4
>>> [c000000c6d19fdb0] c000000000012474 kernel_init+0x24/0x160
>>> [c000000c6d19fe20] c00000000000cbd0 ret_from_kernel_thread+0x5c/0x6c
>>>
>>> Changes from v3:
>>> * Address review feedback
>>> * Move page table depost and withdraw patch after adding pmdlock to avoid bisect failure.
>>
>> This version
>>
>> - Builds on x86, arm64, s390, arc, powerpc and riscv (defconfig with DEBUG_VM_PGTABLE)
>> - Runs on arm64 and x86 without any regression, atleast nothing that I have noticed
>> - Will be great if this could get tested on s390, arc, riscv, ppc32 platforms as well
> 
> When I quickly tested v3, it worked fine, but now it turned out to
> only work fine "sometimes", both v3 and v4. I need to look into it
> further, but so far it seems related to the hugetlb_advanced_tests().
> 
> I guess there was already some discussion on this test, but we did
> not receive all of the thread(s). Please always add at least
> linux-s390@vger.kernel.org and maybe myself and Vasily Gorbik <gor@linux.ibm.com>
> for further discussions.

IIRC, the V3 series previously had all these addresses copied properly
but this version once again missed copying all required addresses.

> 
> That being said, sorry for duplications, this might already have been
> discussed. Preliminary analysis showed that it only seems to go wrong
> for certain random vaddr values. I cannot make any sense of that yet,
> but what seems strange to me is that the hugetlb_advanced_tests()
> take a (real) pte_t pointer as input, and also use that for all
> kinds of operations (set_huge_pte_at, huge_ptep_get_and_clear, etc.).
> 
> Although all the hugetlb code in the kernel is (mis)using pte_t
> pointers instead of the correct pmd/pud_t pointers like THP, that
> is just for historic reasons. The pointers will actually never point
> to a real pte_t (i.e. page table entry), but of course to a pmd
> or pud entry, depending on hugepage size.

HugeTLB logically operates on a PTE entry irrespective of it's real
page table level position. Nonetheless, IIUC, vaddr here should have
been aligned to real page table level in which the entry is being
mapped currently.

> 
> What is passed in as ptep to hugetlb_advanced_tests() seems to be
> the result from the previous ptep = pte_alloc_map(mm, pmdp, vaddr),
> so I would expect that it points to a real page table entry. Need
> to investigate further, but IIUC, using such a pointer for adding
> large pte entries (i.e. pmd/pud entries) at least feels very wrong
> to me, and I assume it is related to the issues we see on s390.
Will look into this further.

> 
> We actually see different issues, e.g. once a panic directly in
> hugetlb_advanced_tests() -> huge_ptep_get_and_clear(), but also
> indirect symptoms after debug_vm_pgtable() completes, like this:
> 
> [   10.533901] BUG task_struct (Not tainted): Padding overwritten. 0x0000000019f798c7-0x0000000019f798c7 @offset=30087
> 
> Last but not least, what I said about the pte vs. pmd/pud of
> course also should apply to the hugetlb_basic_tests(), although
> they are not directly using a pte_t pointer, and especially
> also not writing to it. Still, the pte_aligned pfn parameter
> is not guaranteed to also be pmd/pud_aligned, which doesn't
> feel right.
hugetlb_basic_tests() does not directly operate on real page table
entries. But I do see the point wrt using pmd_aligned pfn instead.
I will look into this in detail and send out something after this
series settles down.

> 
> So, for now, until this is sorted out, I guess we also need
> to exclude s390 at least from the hugetlb_advanced_tests().
> The hugetlb_basic_tests() seem to work fine so far (probably
> by chance :-))

^ permalink raw reply

* Re: [PATCH v2 3/7] mm/memory_hotplug: prepare passing flags to add_memory() and friends
From: David Hildenbrand @ 2020-09-09  7:28 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: linux-hyperv, Michal Hocko, Michael S. Tsirkin, Jason Wang,
	Pingfan Liu, virtualization, linux-mm, Paul Mackerras,
	K. Y. Srinivasan, Boris Ostrovsky, linux-s390, Wei Liu,
	Stefano Stabellini, Dave Jiang, Baoquan He, Jason Gunthorpe,
	linux-acpi, xen-devel, Heiko Carstens, Len Brown, Nathan Lynch,
	Vasily Gorbik, Leonardo Bras, Haiyang Zhang, Stephen Hemminger,
	Dan Williams, Christian Borntraeger, Juergen Gross, Pankaj Gupta,
	Libor Pechacek, linux-nvdimm, Rafael J. Wysocki, linux-kernel,
	Wei Yang, Vishal Verma, Oliver O'Halloran, Andrew Morton,
	linuxppc-dev
In-Reply-To: <20200909071759.GD435421@kroah.com>

On 09.09.20 09:17, Greg Kroah-Hartman wrote:
> On Tue, Sep 08, 2020 at 10:10:08PM +0200, David Hildenbrand wrote:
>> We soon want to pass flags, e.g., to mark added System RAM resources.
>> mergeable. Prepare for that.
> 
> What are these random "flags", and how do we know what should be passed
> to them?
> 
> Why not make this an enumerated type so that we know it all works
> properly, like the GPF_* flags are?  Passing around a random unsigned
> long feels very odd/broken...

Agreed, an enum (mhp_flags) seems to give a better hint what can
actually be passed. Thanks!

-- 
Thanks,

David / dhildenb


^ permalink raw reply

* Re: [PATCH v2 3/7] mm/memory_hotplug: prepare passing flags to add_memory() and friends
From: Greg Kroah-Hartman @ 2020-09-09  7:17 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-hyperv, Michal Hocko, Michael S. Tsirkin, Jason Wang,
	Pingfan Liu, virtualization, linux-mm, Paul Mackerras,
	K. Y. Srinivasan, Boris Ostrovsky, linux-s390, Wei Liu,
	Stefano Stabellini, Dave Jiang, Baoquan He, Jason Gunthorpe,
	linux-acpi, xen-devel, Heiko Carstens, Len Brown, Nathan Lynch,
	Vasily Gorbik, Leonardo Bras, Haiyang Zhang, Stephen Hemminger,
	Dan Williams, Christian Borntraeger, Juergen Gross, Pankaj Gupta,
	Libor Pechacek, linux-nvdimm, Rafael J. Wysocki, linux-kernel,
	Wei Yang, Vishal Verma, Oliver O'Halloran, Andrew Morton,
	linuxppc-dev
In-Reply-To: <20200908201012.44168-4-david@redhat.com>

On Tue, Sep 08, 2020 at 10:10:08PM +0200, David Hildenbrand wrote:
> We soon want to pass flags, e.g., to mark added System RAM resources.
> mergeable. Prepare for that.

What are these random "flags", and how do we know what should be passed
to them?

Why not make this an enumerated type so that we know it all works
properly, like the GPF_* flags are?  Passing around a random unsigned
long feels very odd/broken...

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH v1 4/5] powerpc/fault: Avoid heavy search_exception_tables() verification
From: Christophe Leroy @ 2020-09-09  6:20 UTC (permalink / raw)
  To: Aneesh Kumar K.V, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, npiggin
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <871rjb5vv4.fsf@linux.ibm.com>



Le 09/09/2020 à 08:04, Aneesh Kumar K.V a écrit :
> Christophe Leroy <christophe.leroy@csgroup.eu> writes:
> 
>> search_exception_tables() is an heavy operation, we have to avoid it.
>> When KUAP is selected, we'll know the fault has been blocked by KUAP.
>> Otherwise, it behaves just as if the address was already in the TLBs
>> and no fault was generated.
>>
>> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
>> ---
>>   arch/powerpc/mm/fault.c | 20 +++++---------------
>>   1 file changed, 5 insertions(+), 15 deletions(-)
>>
>> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
>> index 525e0c2b5406..edde169ba3a6 100644
>> --- a/arch/powerpc/mm/fault.c
>> +++ b/arch/powerpc/mm/fault.c
>> @@ -214,24 +214,14 @@ static bool bad_kernel_fault(struct pt_regs *regs, unsigned long error_code,
>>   	if (address >= TASK_SIZE)
>>   		return true;
>>   
>> -	if (!is_exec && (error_code & DSISR_PROTFAULT) &&
>> -	    !search_exception_tables(regs->nip)) {
>> +	// Read/write fault blocked by KUAP is bad, it can never succeed.
>> +	if (bad_kuap_fault(regs, address, is_write)) {
>>   		pr_crit_ratelimited("Kernel attempted to access user page (%lx) - exploit attempt? (uid: %d)\n",
>> -				    address,
>> -				    from_kuid(&init_user_ns, current_uid()));
>> -	}
>> -
>> -	// Fault on user outside of certain regions (eg. copy_tofrom_user()) is bad
>> -	if (!search_exception_tables(regs->nip))
>> -		return true;
> 
> We still need to keep this ? Without that we detect the lack of
> exception tables pretty late.

Is that a problem at all to detect the lack of exception tables late ?
That case is very unlikely and will lead to failure anyway. So, is it 
worth impacting performance of the likely case which will always have an 
exception table and where we expect the exception to run as fast as 
possible ?

The other architectures I have looked at (arm64 and x86) only have the 
exception table search together with the down_read_trylock(&mm->mmap_sem).

Christophe

^ permalink raw reply

* Re: [PATCH v2] powerpc/pci: unmap legacy INTx interrupts when a PHB is removed
From: Alexey Kardashevskiy @ 2020-09-09  6:16 UTC (permalink / raw)
  To: Cédric Le Goater, Michael Ellerman
  Cc: Oliver O'Halloran, linuxppc-dev
In-Reply-To: <20200807101854.844619-1-clg@kaod.org>



On 07/08/2020 20:18, Cédric Le Goater wrote:
> When a passthrough IO adapter is removed from a pseries machine using
> hash MMU and the XIVE interrupt mode, the POWER hypervisor expects the
> guest OS to clear all page table entries related to the adapter. If
> some are still present, the RTAS call which isolates the PCI slot
> returns error 9001 "valid outstanding translations" and the removal of
> the IO adapter fails. This is because when the PHBs are scanned, Linux
> maps automatically the INTx interrupts in the Linux interrupt number
> space but these are never removed.
> 
> To solve this problem, we introduce a PPC platform specific
> pcibios_remove_bus() routine which clears all interrupt mappings when
> the bus is removed. This also clears the associated page table entries
> of the ESB pages when using XIVE.
> 
> For this purpose, we record the logical interrupt numbers of the
> mapped interrupt under the PHB structure and let pcibios_remove_bus()
> do the clean up.
> 
> Since some PCI adapters, like GPUs, use the "interrupt-map" property
> to describe interrupt mappings other than the legacy INTx interrupts,
> we can not restrict the size of the mapping array to PCI_NUM_INTX. The
> number of interrupt mappings is computed from the "interrupt-map"
> property and the mapping array is allocated accordingly.
> 
> Cc: "Oliver O'Halloran" <oohall@gmail.com>
> Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
> Signed-off-by: Cédric Le Goater <clg@kaod.org>

I thought we could reuse some of the common OF code for the DT parsing
but we cannot (easily) so it is good as it is:

Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>


> ---
> 
>  Changes since v2:
> 
>  - merged 2 patches.
>  
>  arch/powerpc/include/asm/pci-bridge.h |   6 ++
>  arch/powerpc/kernel/pci-common.c      | 114 ++++++++++++++++++++++++++
>  2 files changed, 120 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
> index b92e81b256e5..ca75cf264ddf 100644
> --- a/arch/powerpc/include/asm/pci-bridge.h
> +++ b/arch/powerpc/include/asm/pci-bridge.h
> @@ -48,6 +48,9 @@ struct pci_controller_ops {
>  
>  /*
>   * Structure of a PCI controller (host bridge)
> + *
> + * @irq_count: number of interrupt mappings
> + * @irq_map: interrupt mappings
>   */
>  struct pci_controller {
>  	struct pci_bus *bus;
> @@ -127,6 +130,9 @@ struct pci_controller {
>  
>  	void *private_data;
>  	struct npu *npu;
> +
> +	unsigned int irq_count;
> +	unsigned int *irq_map;
>  };
>  
>  /* These are used for config access before all the PCI probing
> diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
> index be108616a721..deb831f0ae13 100644
> --- a/arch/powerpc/kernel/pci-common.c
> +++ b/arch/powerpc/kernel/pci-common.c
> @@ -353,6 +353,115 @@ struct pci_controller *pci_find_controller_for_domain(int domain_nr)
>  	return NULL;
>  }
>  
> +/*
> + * Assumption is made on the interrupt parent. All interrupt-map
> + * entries are considered to have the same parent.
> + */
> +static int pcibios_irq_map_count(struct pci_controller *phb)
> +{
> +	const __be32 *imap;
> +	int imaplen;
> +	struct device_node *parent;
> +	u32 intsize, addrsize, parintsize, paraddrsize;
> +
> +	if (of_property_read_u32(phb->dn, "#interrupt-cells", &intsize))
> +		return 0;
> +	if (of_property_read_u32(phb->dn, "#address-cells", &addrsize))
> +		return 0;
> +
> +	imap = of_get_property(phb->dn, "interrupt-map", &imaplen);
> +	if (!imap) {
> +		pr_debug("%pOF : no interrupt-map\n", phb->dn);
> +		return 0;
> +	}
> +	imaplen /= sizeof(u32);
> +	pr_debug("%pOF : imaplen=%d\n", phb->dn, imaplen);
> +
> +	if (imaplen < (addrsize + intsize + 1))
> +		return 0;
> +
> +	imap += intsize + addrsize;
> +	parent = of_find_node_by_phandle(be32_to_cpup(imap));
> +	if (!parent) {
> +		pr_debug("%pOF : no imap parent found !\n", phb->dn);
> +		return 0;
> +	}
> +
> +	if (of_property_read_u32(parent, "#interrupt-cells", &parintsize)) {
> +		pr_debug("%pOF : parent lacks #interrupt-cells!\n", phb->dn);
> +		return 0;
> +	}
> +
> +	if (of_property_read_u32(parent, "#address-cells", &paraddrsize))
> +		paraddrsize = 0;
> +
> +	return imaplen / (addrsize + intsize + 1 + paraddrsize + parintsize);
> +}
> +
> +static void pcibios_irq_map_init(struct pci_controller *phb)
> +{
> +	phb->irq_count = pcibios_irq_map_count(phb);
> +	if (phb->irq_count < PCI_NUM_INTX)
> +		phb->irq_count = PCI_NUM_INTX;
> +
> +	pr_debug("%pOF : interrupt map #%d\n", phb->dn, phb->irq_count);
> +
> +	phb->irq_map = kcalloc(phb->irq_count, sizeof(unsigned int),
> +			       GFP_KERNEL);
> +}
> +
> +static void pci_irq_map_register(struct pci_dev *pdev, unsigned int virq)
> +{
> +	struct pci_controller *phb = pci_bus_to_host(pdev->bus);
> +	int i;
> +
> +	if (!phb->irq_map)
> +		return;
> +
> +	for (i = 0; i < phb->irq_count; i++) {
> +		/*
> +		 * Look for an empty or an equivalent slot, as INTx
> +		 * interrupts can be shared between adapters.
> +		 */
> +		if (phb->irq_map[i] == virq || !phb->irq_map[i]) {
> +			phb->irq_map[i] = virq;
> +			break;
> +		}
> +	}
> +
> +	if (i == phb->irq_count)
> +		pr_err("PCI:%s all platform interrupts mapped\n",
> +		       pci_name(pdev));
> +}
> +
> +/*
> + * Clearing the mapped interrupts will also clear the underlying
> + * mappings of the ESB pages of the interrupts when under XIVE. It is
> + * a requirement of PowerVM to clear all memory mappings before
> + * removing a PHB.
> + */
> +static void pci_irq_map_dispose(struct pci_bus *bus)
> +{
> +	struct pci_controller *phb = pci_bus_to_host(bus);
> +	int i;
> +
> +	if (!phb->irq_map)
> +		return;
> +
> +	pr_debug("PCI: Clearing interrupt mappings for PHB %04x:%02x...\n",
> +		 pci_domain_nr(bus), bus->number);
> +	for (i = 0; i < phb->irq_count; i++)
> +		irq_dispose_mapping(phb->irq_map[i]);
> +
> +	kfree(phb->irq_map);
> +}
> +
> +void pcibios_remove_bus(struct pci_bus *bus)
> +{
> +	pci_irq_map_dispose(bus);
> +}
> +EXPORT_SYMBOL_GPL(pcibios_remove_bus);
> +
>  /*
>   * Reads the interrupt pin to determine if interrupt is use by card.
>   * If the interrupt is used, then gets the interrupt line from the
> @@ -401,6 +510,8 @@ static int pci_read_irq_line(struct pci_dev *pci_dev)
>  
>  	pci_dev->irq = virq;
>  
> +	/* Record all interrut mappings for later removal of a PHB */
> +	pci_irq_map_register(pci_dev, virq);
>  	return 0;
>  }
>  
> @@ -1554,6 +1665,9 @@ void pcibios_scan_phb(struct pci_controller *hose)
>  
>  	pr_debug("PCI: Scanning PHB %pOF\n", node);
>  
> +	/* Allocate interrupt mappings array */
> +	pcibios_irq_map_init(hose);
> +
>  	/* Get some IO space for the new PHB */
>  	pcibios_setup_phb_io_space(hose);
>  
> 

-- 
Alexey

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox