LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH v9 0/6] add support for relative references in special sections
From: Ard Biesheuvel @ 2018-07-01 17:39 UTC (permalink / raw)
  To: Thomas Gleixner, the arch/x86 maintainers, Ingo Molnar
  Cc: Linux Kernel Mailing List, Arnd Bergmann, Kees Cook,
	Michael Ellerman, Thomas Garnier, Serge E. Hallyn, Bjorn Helgaas,
	Benjamin Herrenschmidt, Russell King, Paul Mackerras,
	Catalin Marinas, Petr Mladek, James Morris, Andrew Morton,
	Nicolas Pitre, Josh Poimboeuf, Steven Rostedt, Sergey Senozhatsky,
	Linus Torvalds, Jessica Yu, linux-arm-kernel, linuxppc-dev,
	Will Deacon
In-Reply-To: <20180627151510.GE30631@arm.com>

On 27 June 2018 at 17:15, Will Deacon <will.deacon@arm.com> wrote:
> Hi Ard,
>
> On Tue, Jun 26, 2018 at 08:27:55PM +0200, Ard Biesheuvel wrote:
>> This adds support for emitting special sections such as initcall arrays,
>> PCI fixups and tracepoints as relative references rather than absolute
>> references. This reduces the size by 50% on 64-bit architectures, but
>> more importantly, it removes the need for carrying relocation metadata
>> for these sections in relocatable kernels (e.g., for KASLR) that needs
>> to be fixed up at boot time. On arm64, this reduces the vmlinux footprint
>> of such a reference by 8x (8 byte absolute reference + 24 byte RELA entry
>> vs 4 byte relative reference)
>>
>> Patch #3 was sent out before as a single patch. This series supersedes
>> the previous submission. This version makes relative ksymtab entries
>> dependent on the new Kconfig symbol HAVE_ARCH_PREL32_RELOCATIONS rather
>> than trying to infer from kbuild test robot replies for which architectures
>> it should be blacklisted.
>>
>> Patch #1 introduces the new Kconfig symbol HAVE_ARCH_PREL32_RELOCATIONS,
>> and sets it for the main architectures that are expected to benefit the
>> most from this feature, i.e., 64-bit architectures or ones that use
>> runtime relocations.
>>
>> Patch #2 add support for #define'ing __DISABLE_EXPORTS to get rid of
>> ksymtab/kcrctab sections in decompressor and EFI stub objects when
>> rebuilding existing C files to run in a different context.
>
> I had a small question on patch 3, but it's really for my understanding.
> So, for patches 1-3:
>
> Reviewed-by: Will Deacon <will.deacon@arm.com>
>

Thanks all.

Thomas, Ingo,

Except for the below tweak against patch #3 for powerpc, which may
apparently get confused by an input section called .discard without
any suffixes, this series is good to go, but requires your ack to
proceed, so I would like to ask you to share your comments and/or
objections. Also, any suggestions or recommendations regarding the
route these patches should take are highly appreciated.

Ard.


diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index 2d9c63f41031..61c844d4ab48 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -287,7 +287,7 @@ unsigned long read_word_at_a_time(const void *addr)
  * visible to the compiler.
  */
 #define __ADDRESSABLE(sym) \
-       static void * __attribute__((section(".discard"), used))        \
+       static void * __attribute__((section(".discard.addressable"), used)) \
                __PASTE(__addressable_##sym, __LINE__) = (void *)&sym;

 /**

^ permalink raw reply related

* Re: [PATCH v2 1/1] powerpc/pseries: fix EEH recovery of some IOV devices
From: Michael Ellerman @ 2018-07-02  0:59 UTC (permalink / raw)
  To: Sam Bobroff, linuxppc-dev, linux-pci; +Cc: bhelgaas, bryantly
In-Reply-To: <7598ffeb48c16c88a34937ad93b18f806222b8df.1527208281.git.sbobroff@linux.ibm.com>

Sam Bobroff <sbobroff@linux.ibm.com> writes:

> EEH recovery currently fails on pSeries for some IOV capable PCI
> devices, if CONFIG_PCI_IOV is on and the hypervisor doesn't provide
> certain device tree properties for the device. (Found on an IOV
> capable device using the ipr driver.)
>
> Recovery fails in pci_enable_resources() at the check on r->parent,
> because r->flags is set and r->parent is not.  This state is due to
> sriov_init() setting the start, end and flags members of the IOV BARs
> but the parent not being set later in
> pseries_pci_fixup_iov_resources(), because the
> "ibm,open-sriov-vf-bar-info" property is missing.
>
> Correct this by zeroing the resource flags for IOV BARs when they
> can't be configured.
>
> Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
> ---
> Hi,
>
> This is a fix to allow EEH recovery to succeed in a specific situation,
> which I've tried to explain in the commit message.
>
> As with the RFC version, the IOV BARs are disabled by setting the resource
> flags to 0 but the other fields are now left as-is because that is what is done
> elsewhere (see sriov_init() and __pci_read_base()).
>
> I've also examined the concern raised by Bjorn Helgaas, that VFs could be
> enabled later after the BARs are disabled, and it already seems safe: enabling
> VFs (on pseries) depends on another device tree property,
> "ibm,number-of-configurable-vfs" as well as support for the RTAS function
> "ibm_map_pes". Since these are all part of the hypervisor's support for IOV it
> seems unlikely that we would ever see some of them but not all. (None are
> currently provided by QEMU/KVM.) (Additionally, the ipr driver on which the EEH
> recovery failure was discovered doesn't even seem to have SR-IOV support so it
> certainly can't enable VFs.)

Can you fold/reword the above into the change log, it seems like useful
detail.

cheers

^ permalink raw reply

* Re: [PATCH] powerpc: mpc5200: Remove VLA usage
From: Michael Ellerman @ 2018-07-02  1:33 UTC (permalink / raw)
  To: Kees Cook, Arnd Bergmann
  Cc: linuxppc-dev, Anatolij Gustschin, Paul Mackerras,
	Linux Kernel Mailing List
In-Reply-To: <CAGXu5jJz0zP_DvGqZsBjKZ_vY0=xKoUxCLVxXgUPAbcXS6aMNg@mail.gmail.com>

Kees Cook <keescook@chromium.org> writes:

> On Fri, Jun 29, 2018 at 2:02 PM, Arnd Bergmann <arnd@arndb.de> wrote:
>> On Fri, Jun 29, 2018 at 8:53 PM, Kees Cook <keescook@chromium.org> wrote:
>>> In the quest to remove all stack VLA usage from the kernel[1], this
>>> switches to using a stack size large enough for the saved routine and
>>> adds a sanity check.
>>>
>>> [1] https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com
>>>
>>> Signed-off-by: Kees Cook <keescook@chromium.org>
>>
>> This seems particularly nice, not only avoids it the dynamic stack
>> allocation, it
>> also makes sure the new 0x500 handler doesn't overflow into the 0x600
>> exception handler.
>>
>> It would help to explain how you arrived at that '256 byte' number in
>> the changelog though.
>
> Honestly, I just counted instructions, multiplied by 8 and rounded up
> to the next nearest power of 2, and the result felt right for giving
> some level of flexibility for code growth before tripping the WARN. :P
>
> I'm happy to adjust, of course. :)

What if we write it:

       char saved_0x500[0x600 - 0x500];

Hopefully the compiler is smart enough not to generate a VLA for that :)

cheers

^ permalink raw reply

* Re: [PATCH kernel v2 2/2] KVM: PPC: Check if IOMMU page is contained in the pinned physical page
From: David Gibson @ 2018-07-02  4:08 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, kvm-ppc, Alex Williamson, Paul Mackerras
In-Reply-To: <20180629170747.471bea35@aik.ozlabs.ibm.com>

[-- Attachment #1: Type: text/plain, Size: 5797 bytes --]

On Fri, Jun 29, 2018 at 05:07:47PM +1000, Alexey Kardashevskiy wrote:
> On Fri, 29 Jun 2018 15:18:20 +1000
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
> > On Fri, 29 Jun 2018 14:57:02 +1000
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> > 
> > > On Fri, Jun 29, 2018 at 02:51:21PM +1000, Alexey Kardashevskiy wrote:  
> > > > On Fri, 29 Jun 2018 14:12:41 +1000
> > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > >     
> > > > > On Tue, Jun 26, 2018 at 03:59:26PM +1000, Alexey Kardashevskiy wrote:    
> > > > > > We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that
> > > > > > an IOMMU page is contained in the physical page so the PCI hardware won't
> > > > > > get access to unassigned host memory.
> > > > > > 
> > > > > > However we do not have this check in KVM fastpath (H_PUT_TCE accelerated
> > > > > > code) so the user space can pin memory backed with 64k pages and create
> > > > > > a hardware TCE table with a bigger page size. We were lucky so far and
> > > > > > did not hit this yet as the very first time the mapping happens
> > > > > > we do not have tbl::it_userspace allocated yet and fall back to
> > > > > > the userspace which in turn calls VFIO IOMMU driver and that fails
> > > > > > because of the check in vfio_iommu_spapr_tce.c which is really
> > > > > > sustainable solution.
> > > > > > 
> > > > > > This stores the smallest preregistered page size in the preregistered
> > > > > > region descriptor and changes the mm_iommu_xxx API to check this against
> > > > > > the IOMMU page size.
> > > > > > 
> > > > > > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > > > > > ---
> > > > > > Changes:
> > > > > > v2:
> > > > > > * explicitly check for compound pages before calling compound_order()
> > > > > > 
> > > > > > ---
> > > > > > The bug is: run QEMU _without_ hugepages (no -mempath) and tell it to
> > > > > > advertise 16MB pages to the guest; a typical pseries guest will use 16MB
> > > > > > for IOMMU pages without checking the mmu pagesize and this will fail
> > > > > > at https://git.qemu.org/?p=qemu.git;a=blob;f=hw/vfio/common.c;h=fb396cf00ac40eb35967a04c9cc798ca896eed57;hb=refs/heads/master#l256
> > > > > > 
> > > > > > With the change, mapping will fail in KVM and the guest will print:
> > > > > > 
> > > > > > mlx5_core 0000:00:00.0: ibm,create-pe-dma-window(2027) 0 8000000 20000000 18 1f returned 0 (liobn = 0x80000001 starting addr = 8000000 0)
> > > > > > mlx5_core 0000:00:00.0: created tce table LIOBN 0x80000001 for /pci@800000020000000/ethernet@0
> > > > > > mlx5_core 0000:00:00.0: failed to map direct window for
> > > > > > /pci@800000020000000/ethernet@0: -1      
> > > > > 
> > > > > [snip]    
> > > > > > @@ -124,7 +125,7 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
> > > > > >  		struct mm_iommu_table_group_mem_t **pmem)
> > > > > >  {
> > > > > >  	struct mm_iommu_table_group_mem_t *mem;
> > > > > > -	long i, j, ret = 0, locked_entries = 0;
> > > > > > +	long i, j, ret = 0, locked_entries = 0, pageshift;
> > > > > >  	struct page *page = NULL;
> > > > > >  
> > > > > >  	mutex_lock(&mem_list_mutex);
> > > > > > @@ -166,6 +167,8 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
> > > > > >  		goto unlock_exit;
> > > > > >  	}
> > > > > >  
> > > >  > > +	mem->pageshift = 30; /* start from 1G pages - the biggest we have */      
> > > > > 
> > > > > What about 16G pages on an HPT system?    
> > > > 
> > > > 
> > > > Below in the loop mem->pageshift will reduce to the biggest actual size
> > > > which will be 16mb/64k/4k. Or remain 1GB if no memory is actually
> > > > pinned, no loss there.    
> > > 
> > > Are you saying that 16G IOMMU pages aren't supported?  Or that there's
> > > some reason a guest can never use them?  
> > 
> > 
> > ah, 16_G_, not _M_. My bad. I just never tried such huge pages, I will
> > lift the limit up to 64 then, easier this way.
> 
> 
> Ah, no, rather this as the upper limit:
> 
> mem->pageshift = ilog2(entries) + PAGE_SHIFT;

I can't make sense of this comment in context.  I see how you're
computing the minimum page size in the reserved region.

My question is about what the "maximum minimum" is - the starting
value from which you calculate.  Currently it's 1G, but I can't
immediately see a reason that 16G is impossible here.

> @entries here is a number of system pages being pinned in that
> function.
> 
> 
> 
> > 
> > >   
> > > > > >  	for (i = 0; i < entries; ++i) {
> > > > > >  		if (1 != get_user_pages_fast(ua + (i << PAGE_SHIFT),
> > > > > >  					1/* pages */, 1/* iswrite */, &page)) {
> > > > > > @@ -199,6 +202,11 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
> > > > > >  			}
> > > > > >  		}
> > > > > >  populate:
> > > > > > +		pageshift = PAGE_SHIFT;
> > > > > > +		if (PageCompound(page))
> > > > > > +			pageshift += compound_order(compound_head(page));
> > > > > > +		mem->pageshift = min_t(unsigned int, mem->pageshift, pageshift);      
> > > > > 
> > > > > Why not make mem->pageshift and pageshift local the same type to avoid
> > > > > the min_t() ?    
> > > > 
> > > > I was under impression min() is deprecated (misinterpret checkpatch.pl
> > > > may be) and therefore did not pay attention to it. I can fix this and
> > > > repost if there is no other question.    
> > > 
> > > Hm, it's possible.  
> > 
> > Nah, tried min(), compiles fine.
> 
> 
> 



-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [Update] Regression in 4.18 - 32-bit PowerPC crashes on boot - bisected to commit 1d40a5ea01d5
From: Michael Ellerman @ 2018-07-02  4:16 UTC (permalink / raw)
  To: Linus Torvalds, Larry Finger
  Cc: Matthew Wilcox, Kirill A. Shutemov, Vlastimil Babka,
	Christoph Lameter, Dave Hansen, Jerome Glisse, Lai Jiangshan,
	Martin Schwidefsky, Pekka Enberg, Randy Dunlap, Andrey Ryabinin,
	Andrew Morton, Benjamin Herrenschmidt, Paul Mackerras, ppc-dev,
	Linux Kernel Mailing List
In-Reply-To: <CA+55aFzZ7PND2Xvz9wB1jaCmp0rBMTSmJtKiFwSeOWy9iLSd8Q@mail.gmail.com>

Linus Torvalds <torvalds@linux-foundation.org> writes:
> On Fri, Jun 29, 2018 at 1:42 PM Larry Finger <Larry.Finger@lwfinger.net> wrote:
>>
>> I have more information regarding this BUG. Line 700 of page-flags.h is the
>> macro PAGE_TYPE_OPS(Table, table). For further debugging, I manually expanded
>> the macro, and found that the bug line is VM_BUG_ON_PAGE(!PageTable(page), page)
>> in routine __ClearPageTable(), which is called from pgtable_page_dtor() in
>> include/linux/mm.h. I also added a printk call to PageTable() that logs
>> page->page_type. The routine was called twice. The first had page_type of
>> 0xfffffbff, which would have been expected for a . The second call had
>> 0xffffffff, which led to the BUG.
>
> So it looks to me like the tear-down of the page tables first found a
> page that is indeed a page table, and cleared the page table bit
> (well, it set it - the bits are reversed).
...
>
> That said, can some ppc person who knows the 32-bit ppc code and maybe
> knows what that "interrupt: 700" means talk about that oddity in the
> trace, please?

I think everyone else answered your questions here, and it should be
fixed now in your tree.

Larry let me know if you're still seeing a crash with 4.18-rc3.

cheers

^ permalink raw reply

* Re: [PATCH kernel v2 2/2] KVM: PPC: Check if IOMMU page is contained in the pinned physical page
From: Alexey Kardashevskiy @ 2018-07-02  4:33 UTC (permalink / raw)
  To: David Gibson; +Cc: linuxppc-dev, kvm-ppc, Alex Williamson, Paul Mackerras
In-Reply-To: <20180702040852.GW3422@umbus.fritz.box>

On Mon, 2 Jul 2018 14:08:52 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Fri, Jun 29, 2018 at 05:07:47PM +1000, Alexey Kardashevskiy wrote:
> > On Fri, 29 Jun 2018 15:18:20 +1000
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >   
> > > On Fri, 29 Jun 2018 14:57:02 +1000
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >   
> > > > On Fri, Jun 29, 2018 at 02:51:21PM +1000, Alexey Kardashevskiy wrote:    
> > > > > On Fri, 29 Jun 2018 14:12:41 +1000
> > > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > >       
> > > > > > On Tue, Jun 26, 2018 at 03:59:26PM +1000, Alexey Kardashevskiy wrote:      
> > > > > > > We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that
> > > > > > > an IOMMU page is contained in the physical page so the PCI hardware won't
> > > > > > > get access to unassigned host memory.
> > > > > > > 
> > > > > > > However we do not have this check in KVM fastpath (H_PUT_TCE accelerated
> > > > > > > code) so the user space can pin memory backed with 64k pages and create
> > > > > > > a hardware TCE table with a bigger page size. We were lucky so far and
> > > > > > > did not hit this yet as the very first time the mapping happens
> > > > > > > we do not have tbl::it_userspace allocated yet and fall back to
> > > > > > > the userspace which in turn calls VFIO IOMMU driver and that fails
> > > > > > > because of the check in vfio_iommu_spapr_tce.c which is really
> > > > > > > sustainable solution.
> > > > > > > 
> > > > > > > This stores the smallest preregistered page size in the preregistered
> > > > > > > region descriptor and changes the mm_iommu_xxx API to check this against
> > > > > > > the IOMMU page size.
> > > > > > > 
> > > > > > > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > > > > > > ---
> > > > > > > Changes:
> > > > > > > v2:
> > > > > > > * explicitly check for compound pages before calling compound_order()
> > > > > > > 
> > > > > > > ---
> > > > > > > The bug is: run QEMU _without_ hugepages (no -mempath) and tell it to
> > > > > > > advertise 16MB pages to the guest; a typical pseries guest will use 16MB
> > > > > > > for IOMMU pages without checking the mmu pagesize and this will fail
> > > > > > > at https://git.qemu.org/?p=qemu.git;a=blob;f=hw/vfio/common.c;h=fb396cf00ac40eb35967a04c9cc798ca896eed57;hb=refs/heads/master#l256
> > > > > > > 
> > > > > > > With the change, mapping will fail in KVM and the guest will print:
> > > > > > > 
> > > > > > > mlx5_core 0000:00:00.0: ibm,create-pe-dma-window(2027) 0 8000000 20000000 18 1f returned 0 (liobn = 0x80000001 starting addr = 8000000 0)
> > > > > > > mlx5_core 0000:00:00.0: created tce table LIOBN 0x80000001 for /pci@800000020000000/ethernet@0
> > > > > > > mlx5_core 0000:00:00.0: failed to map direct window for
> > > > > > > /pci@800000020000000/ethernet@0: -1        
> > > > > > 
> > > > > > [snip]      
> > > > > > > @@ -124,7 +125,7 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
> > > > > > >  		struct mm_iommu_table_group_mem_t **pmem)
> > > > > > >  {
> > > > > > >  	struct mm_iommu_table_group_mem_t *mem;
> > > > > > > -	long i, j, ret = 0, locked_entries = 0;
> > > > > > > +	long i, j, ret = 0, locked_entries = 0, pageshift;
> > > > > > >  	struct page *page = NULL;
> > > > > > >  
> > > > > > >  	mutex_lock(&mem_list_mutex);
> > > > > > > @@ -166,6 +167,8 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
> > > > > > >  		goto unlock_exit;
> > > > > > >  	}
> > > > > > >  
> > > > >  > > +	mem->pageshift = 30; /* start from 1G pages - the biggest we have */        
> > > > > > 
> > > > > > What about 16G pages on an HPT system?      
> > > > > 
> > > > > 
> > > > > Below in the loop mem->pageshift will reduce to the biggest actual size
> > > > > which will be 16mb/64k/4k. Or remain 1GB if no memory is actually
> > > > > pinned, no loss there.      
> > > > 
> > > > Are you saying that 16G IOMMU pages aren't supported?  Or that there's
> > > > some reason a guest can never use them?    
> > > 
> > > 
> > > ah, 16_G_, not _M_. My bad. I just never tried such huge pages, I will
> > > lift the limit up to 64 then, easier this way.  
> > 
> > 
> > Ah, no, rather this as the upper limit:
> > 
> > mem->pageshift = ilog2(entries) + PAGE_SHIFT;  
> 
> I can't make sense of this comment in context.  I see how you're
> computing the minimum page size in the reserved region.
> 
> My question is about what the "maximum minimum" is - the starting
> value from which you calculate.  Currently it's 1G, but I can't
> immediately see a reason that 16G is impossible here.


16GB is impossible if the chunk we are preregistering here is smaller
than that, for example, the entire guest ram is 4GB. If that is the
case and we try mapping a 16GB IOMMU page, this should fail as I do not
really know what happens to the memory between 4GB..16GB.

imho if not that, than 1<<64 would make a good upper limit.



> 
> > @entries here is a number of system pages being pinned in that
> > function.
> > 
> > 
> >   
> > >   
> > > >     
> > > > > > >  	for (i = 0; i < entries; ++i) {
> > > > > > >  		if (1 != get_user_pages_fast(ua + (i << PAGE_SHIFT),
> > > > > > >  					1/* pages */, 1/* iswrite */, &page)) {
> > > > > > > @@ -199,6 +202,11 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
> > > > > > >  			}
> > > > > > >  		}
> > > > > > >  populate:
> > > > > > > +		pageshift = PAGE_SHIFT;
> > > > > > > +		if (PageCompound(page))
> > > > > > > +			pageshift += compound_order(compound_head(page));
> > > > > > > +		mem->pageshift = min_t(unsigned int, mem->pageshift, pageshift);        
> > > > > > 
> > > > > > Why not make mem->pageshift and pageshift local the same type to avoid
> > > > > > the min_t() ?      
> > > > > 
> > > > > I was under impression min() is deprecated (misinterpret checkpatch.pl
> > > > > may be) and therefore did not pay attention to it. I can fix this and
> > > > > repost if there is no other question.      
> > > > 
> > > > Hm, it's possible.    
> > > 
> > > Nah, tried min(), compiles fine.  
> > 
> > 
> >   
> 
> 
> 
> -- 
> David Gibson			| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> 				| _way_ _around_!
> http://www.ozlabs.org/~dgibson



--
Alexey

^ permalink raw reply

* Re: [PATCH kernel v2 2/2] KVM: PPC: Check if IOMMU page is contained in the pinned physical page
From: David Gibson @ 2018-07-02  4:52 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, kvm-ppc, Alex Williamson, Paul Mackerras
In-Reply-To: <20180702143244.3a08ebe2@aik.ozlabs.ibm.com>

[-- Attachment #1: Type: text/plain, Size: 7092 bytes --]

On Mon, Jul 02, 2018 at 02:33:30PM +1000, Alexey Kardashevskiy wrote:
> On Mon, 2 Jul 2018 14:08:52 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Fri, Jun 29, 2018 at 05:07:47PM +1000, Alexey Kardashevskiy wrote:
> > > On Fri, 29 Jun 2018 15:18:20 +1000
> > > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> > >   
> > > > On Fri, 29 Jun 2018 14:57:02 +1000
> > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > >   
> > > > > On Fri, Jun 29, 2018 at 02:51:21PM +1000, Alexey Kardashevskiy wrote:    
> > > > > > On Fri, 29 Jun 2018 14:12:41 +1000
> > > > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > > >       
> > > > > > > On Tue, Jun 26, 2018 at 03:59:26PM +1000, Alexey Kardashevskiy wrote:      
> > > > > > > > We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that
> > > > > > > > an IOMMU page is contained in the physical page so the PCI hardware won't
> > > > > > > > get access to unassigned host memory.
> > > > > > > > 
> > > > > > > > However we do not have this check in KVM fastpath (H_PUT_TCE accelerated
> > > > > > > > code) so the user space can pin memory backed with 64k pages and create
> > > > > > > > a hardware TCE table with a bigger page size. We were lucky so far and
> > > > > > > > did not hit this yet as the very first time the mapping happens
> > > > > > > > we do not have tbl::it_userspace allocated yet and fall back to
> > > > > > > > the userspace which in turn calls VFIO IOMMU driver and that fails
> > > > > > > > because of the check in vfio_iommu_spapr_tce.c which is really
> > > > > > > > sustainable solution.
> > > > > > > > 
> > > > > > > > This stores the smallest preregistered page size in the preregistered
> > > > > > > > region descriptor and changes the mm_iommu_xxx API to check this against
> > > > > > > > the IOMMU page size.
> > > > > > > > 
> > > > > > > > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > > > > > > > ---
> > > > > > > > Changes:
> > > > > > > > v2:
> > > > > > > > * explicitly check for compound pages before calling compound_order()
> > > > > > > > 
> > > > > > > > ---
> > > > > > > > The bug is: run QEMU _without_ hugepages (no -mempath) and tell it to
> > > > > > > > advertise 16MB pages to the guest; a typical pseries guest will use 16MB
> > > > > > > > for IOMMU pages without checking the mmu pagesize and this will fail
> > > > > > > > at https://git.qemu.org/?p=qemu.git;a=blob;f=hw/vfio/common.c;h=fb396cf00ac40eb35967a04c9cc798ca896eed57;hb=refs/heads/master#l256
> > > > > > > > 
> > > > > > > > With the change, mapping will fail in KVM and the guest will print:
> > > > > > > > 
> > > > > > > > mlx5_core 0000:00:00.0: ibm,create-pe-dma-window(2027) 0 8000000 20000000 18 1f returned 0 (liobn = 0x80000001 starting addr = 8000000 0)
> > > > > > > > mlx5_core 0000:00:00.0: created tce table LIOBN 0x80000001 for /pci@800000020000000/ethernet@0
> > > > > > > > mlx5_core 0000:00:00.0: failed to map direct window for
> > > > > > > > /pci@800000020000000/ethernet@0: -1        
> > > > > > > 
> > > > > > > [snip]      
> > > > > > > > @@ -124,7 +125,7 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
> > > > > > > >  		struct mm_iommu_table_group_mem_t **pmem)
> > > > > > > >  {
> > > > > > > >  	struct mm_iommu_table_group_mem_t *mem;
> > > > > > > > -	long i, j, ret = 0, locked_entries = 0;
> > > > > > > > +	long i, j, ret = 0, locked_entries = 0, pageshift;
> > > > > > > >  	struct page *page = NULL;
> > > > > > > >  
> > > > > > > >  	mutex_lock(&mem_list_mutex);
> > > > > > > > @@ -166,6 +167,8 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
> > > > > > > >  		goto unlock_exit;
> > > > > > > >  	}
> > > > > > > >  
> > > > > >  > > +	mem->pageshift = 30; /* start from 1G pages - the biggest we have */        
> > > > > > > 
> > > > > > > What about 16G pages on an HPT system?      
> > > > > > 
> > > > > > 
> > > > > > Below in the loop mem->pageshift will reduce to the biggest actual size
> > > > > > which will be 16mb/64k/4k. Or remain 1GB if no memory is actually
> > > > > > pinned, no loss there.      
> > > > > 
> > > > > Are you saying that 16G IOMMU pages aren't supported?  Or that there's
> > > > > some reason a guest can never use them?    
> > > > 
> > > > 
> > > > ah, 16_G_, not _M_. My bad. I just never tried such huge pages, I will
> > > > lift the limit up to 64 then, easier this way.  
> > > 
> > > 
> > > Ah, no, rather this as the upper limit:
> > > 
> > > mem->pageshift = ilog2(entries) + PAGE_SHIFT;  
> > 
> > I can't make sense of this comment in context.  I see how you're
> > computing the minimum page size in the reserved region.
> > 
> > My question is about what the "maximum minimum" is - the starting
> > value from which you calculate.  Currently it's 1G, but I can't
> > immediately see a reason that 16G is impossible here.
> 
> 
> 16GB is impossible if the chunk we are preregistering here is smaller
> than that, for example, the entire guest ram is 4GB.

Of course.  Just like it was for 1GiB if you had a 512MiB guest, for
example.  I'm talking about a case where you have a guest that's
>=16GiB and you *have* allocated 16GiB hugepages to back it.

> If that is the
> case and we try mapping a 16GB IOMMU page, this should fail as I do not
> really know what happens to the memory between 4GB..16GB.
> 
> imho if not that, than 1<<64 would make a good upper limit.
> 
> 
> 
> > 
> > > @entries here is a number of system pages being pinned in that
> > > function.
> > > 
> > > 
> > >   
> > > >   
> > > > >     
> > > > > > > >  	for (i = 0; i < entries; ++i) {
> > > > > > > >  		if (1 != get_user_pages_fast(ua + (i << PAGE_SHIFT),
> > > > > > > >  					1/* pages */, 1/* iswrite */, &page)) {
> > > > > > > > @@ -199,6 +202,11 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
> > > > > > > >  			}
> > > > > > > >  		}
> > > > > > > >  populate:
> > > > > > > > +		pageshift = PAGE_SHIFT;
> > > > > > > > +		if (PageCompound(page))
> > > > > > > > +			pageshift += compound_order(compound_head(page));
> > > > > > > > +		mem->pageshift = min_t(unsigned int, mem->pageshift, pageshift);        
> > > > > > > 
> > > > > > > Why not make mem->pageshift and pageshift local the same type to avoid
> > > > > > > the min_t() ?      
> > > > > > 
> > > > > > I was under impression min() is deprecated (misinterpret checkpatch.pl
> > > > > > may be) and therefore did not pay attention to it. I can fix this and
> > > > > > repost if there is no other question.      
> > > > > 
> > > > > Hm, it's possible.    
> > > > 
> > > > Nah, tried min(), compiles fine.  
> > > 
> > > 
> > >   
> > 
> > 
> > 
> 
> 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* [PATCH v5 0/7] powerpc/pseries: Machien check handler improvements.
From: Mahesh J Salgaonkar @ 2018-07-02  5:45 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Nicholas Piggin, Michal Suchanek, Michael Ellerman, stable,
	Aneesh Kumar K.V, Nicholas Piggin, Aneesh Kumar K.V,
	Laurent Dufour, Michal Suchanek

This patch series includes some improvement to Machine check handler
for pseries. Patch 1 fixes a buffer overrun issue if rtas extended error
log size is greater than RTAS_ERROR_LOG_MAX.
Patch 2 fixes an issue where machine check handler crashes
kernel while accessing vmalloc-ed buffer while in nmi context.
Patch 3 fixes endain bug while restoring of r3 in MCE handler.
Patch 5 implements a real mode mce handler and flushes the SLBs on SLB error.
Patch 6 display's the MCE error details on console.
Patch 7 saves and dumps the SLB contents on SLB MCE errors to improve the
debugability.

Change in V5:
- Use min_t instead of max_t.
- Fix an issue reported by kbuild test robot and addressed review comments.

Change in V4:
- Flush the SLBs in real mode mce handler to handle SLB errors for entry 0.
- Allocate buffers per cpu to hold rtas error log and old slb contents.
- Defer the logging of rtas error log to irq work queue.

Change in V3:
- Moved patch 5 to patch 2

Change in V2:
- patch 3: Display additional info (NIP and task info) in MCE error details.
- patch 5: Fix endain bug while restoring of r3 in MCE handler.

---

Mahesh Salgaonkar (7):
      powerpc/pseries: Avoid using the size greater than
      powerpc/pseries: Defer the logging of rtas error to irq work queue.
      powerpc/pseries: Fix endainness while restoring of r3 in MCE handler.
      powerpc/pseries: Define MCE error event section.
      powerpc/pseries: flush SLB contents on SLB MCE errors.
      powerpc/pseries: Display machine check error details.
      powerpc/pseries: Dump the SLB contents on SLB MCE errors.


 arch/powerpc/include/asm/book3s/64/mmu-hash.h |    8 +
 arch/powerpc/include/asm/machdep.h            |    1 
 arch/powerpc/include/asm/paca.h               |    4 
 arch/powerpc/include/asm/rtas.h               |  116 ++++++++++++
 arch/powerpc/kernel/exceptions-64s.S          |   42 ++++
 arch/powerpc/kernel/mce.c                     |   16 +-
 arch/powerpc/mm/slb.c                         |   63 +++++++
 arch/powerpc/platforms/powernv/opal.c         |    1 
 arch/powerpc/platforms/pseries/pseries.h      |    1 
 arch/powerpc/platforms/pseries/ras.c          |  241 +++++++++++++++++++++++--
 arch/powerpc/platforms/pseries/setup.c        |   27 +++
 11 files changed, 499 insertions(+), 21 deletions(-)

--
Signature

^ permalink raw reply

* [PATCH v5 1/7] powerpc/pseries: Avoid using the size greater than
From: Mahesh J Salgaonkar @ 2018-07-02  5:46 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Michal Suchanek, Nicholas Piggin, Aneesh Kumar K.V,
	Laurent Dufour, Michal Suchanek
In-Reply-To: <153051022088.30541.5610525713141009848.stgit@jupiter.in.ibm.com>

From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>

The global mce data buffer that used to copy rtas error log is of 2048
(RTAS_ERROR_LOG_MAX) bytes in size. Before the copy we read
extended_log_length from rtas error log header, then use max of
extended_log_length and RTAS_ERROR_LOG_MAX as a size of data to be copied.
Ideally the platform (phyp) will never send extended error log with
size > 2048. But if that happens, then we have a risk of buffer overrun
and corruption. Fix this by using min_t instead.

Fixes: d368514c3097 ("powerpc: Fix corruption when grabbing FWNMI data")
Reported-by: Michal Suchanek <msuchanek@suse.com>
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/pseries/ras.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/ras.c b/arch/powerpc/platforms/pseries/ras.c
index 5e1ef9150182..ef104144d4bc 100644
--- a/arch/powerpc/platforms/pseries/ras.c
+++ b/arch/powerpc/platforms/pseries/ras.c
@@ -371,7 +371,7 @@ static struct rtas_error_log *fwnmi_get_errinfo(struct pt_regs *regs)
 		int len, error_log_length;
 
 		error_log_length = 8 + rtas_error_extended_log_length(h);
-		len = max_t(int, error_log_length, RTAS_ERROR_LOG_MAX);
+		len = min_t(int, error_log_length, RTAS_ERROR_LOG_MAX);
 		memset(global_mce_data_buf, 0, RTAS_ERROR_LOG_MAX);
 		memcpy(global_mce_data_buf, h, len);
 		errhdr = (struct rtas_error_log *)global_mce_data_buf;

^ permalink raw reply related

* [PATCH v5 2/7] powerpc/pseries: Defer the logging of rtas error to irq work queue.
From: Mahesh J Salgaonkar @ 2018-07-02  5:46 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: stable, Nicholas Piggin, Aneesh Kumar K.V, Laurent Dufour,
	Michal Suchanek
In-Reply-To: <153051022088.30541.5610525713141009848.stgit@jupiter.in.ibm.com>

From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>

rtas_log_buf is a buffer to hold RTAS event data that are communicated
to kernel by hypervisor. This buffer is then used to pass RTAS event
data to user through proc fs. This buffer is allocated from vmalloc
(non-linear mapping) area.

On Machine check interrupt, register r3 points to RTAS extended event
log passed by hypervisor that contains the MCE event. The pseries
machine check handler then logs this error into rtas_log_buf. The
rtas_log_buf is a vmalloc-ed (non-linear) buffer we end up taking up a
page fault (vector 0x300) while accessing it. Since machine check
interrupt handler runs in NMI context we can not afford to take any
page fault. Page faults are not honored in NMI context and causes
kernel panic. Apart from that, as Nick pointed out, pSeries_log_error()
also takes a spin_lock while logging error which is not safe in NMI
context. It may endup in deadlock if we get another MCE before releasing
the lock. Fix this by deferring the logging of rtas error to irq work queue.

Current implementation uses two different buffers to hold rtas error log
depending on whether extended log is provided or not. This makes bit
difficult to identify which buffer has valid data that needs to logged
later in irq work. Simplify this using single buffer, one per paca, and
copy rtas log to it irrespective of whether extended log is provided or
not. Allocate this buffer below RMA region so that it can be accessed
in real mode mce handler.

Fixes: b96672dd840f ("powerpc: Machine check interrupt is a non-maskable interrupt")
Cc: stable@vger.kernel.org
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/paca.h        |    3 ++
 arch/powerpc/platforms/pseries/ras.c   |   47 ++++++++++++++++++++++----------
 arch/powerpc/platforms/pseries/setup.c |   16 +++++++++++
 3 files changed, 51 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index 3f109a3e3edb..b441fef53077 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -251,6 +251,9 @@ struct paca_struct {
 	void *rfi_flush_fallback_area;
 	u64 l1d_flush_size;
 #endif
+#ifdef CONFIG_PPC_PSERIES
+	u8 *mce_data_buf;		/* buffer to hold per cpu rtas errlog */
+#endif /* CONFIG_PPC_PSERIES */
 } ____cacheline_aligned;
 
 extern void copy_mm_to_paca(struct mm_struct *mm);
diff --git a/arch/powerpc/platforms/pseries/ras.c b/arch/powerpc/platforms/pseries/ras.c
index ef104144d4bc..14a46b07ab2f 100644
--- a/arch/powerpc/platforms/pseries/ras.c
+++ b/arch/powerpc/platforms/pseries/ras.c
@@ -22,6 +22,7 @@
 #include <linux/of.h>
 #include <linux/fs.h>
 #include <linux/reboot.h>
+#include <linux/irq_work.h>
 
 #include <asm/machdep.h>
 #include <asm/rtas.h>
@@ -32,11 +33,13 @@
 static unsigned char ras_log_buf[RTAS_ERROR_LOG_MAX];
 static DEFINE_SPINLOCK(ras_log_buf_lock);
 
-static char global_mce_data_buf[RTAS_ERROR_LOG_MAX];
-static DEFINE_PER_CPU(__u64, mce_data_buf);
-
 static int ras_check_exception_token;
 
+static void mce_process_errlog_event(struct irq_work *work);
+static struct irq_work mce_errlog_process_work = {
+	.func = mce_process_errlog_event,
+};
+
 #define EPOW_SENSOR_TOKEN	9
 #define EPOW_SENSOR_INDEX	0
 
@@ -330,16 +333,20 @@ static irqreturn_t ras_error_interrupt(int irq, void *dev_id)
 	((((A) >= 0x7000) && ((A) < 0x7ff0)) || \
 	(((A) >= rtas.base) && ((A) < (rtas.base + rtas.size - 16))))
 
+static inline struct rtas_error_log *fwnmi_get_errlog(void)
+{
+	return (struct rtas_error_log *)local_paca->mce_data_buf;
+}
+
 /*
  * Get the error information for errors coming through the
  * FWNMI vectors.  The pt_regs' r3 will be updated to reflect
  * the actual r3 if possible, and a ptr to the error log entry
  * will be returned if found.
  *
- * If the RTAS error is not of the extended type, then we put it in a per
- * cpu 64bit buffer. If it is the extended type we use global_mce_data_buf.
+ * Use one buffer mce_data_buf per cpu to store RTAS error.
  *
- * The global_mce_data_buf does not have any locks or protection around it,
+ * The mce_data_buf does not have any locks or protection around it,
  * if a second machine check comes in, or a system reset is done
  * before we have logged the error, then we will get corruption in the
  * error log.  This is preferable over holding off on calling
@@ -349,7 +356,7 @@ static irqreturn_t ras_error_interrupt(int irq, void *dev_id)
 static struct rtas_error_log *fwnmi_get_errinfo(struct pt_regs *regs)
 {
 	unsigned long *savep;
-	struct rtas_error_log *h, *errhdr = NULL;
+	struct rtas_error_log *h;
 
 	/* Mask top two bits */
 	regs->gpr[3] &= ~(0x3UL << 62);
@@ -362,22 +369,20 @@ static struct rtas_error_log *fwnmi_get_errinfo(struct pt_regs *regs)
 	savep = __va(regs->gpr[3]);
 	regs->gpr[3] = savep[0];	/* restore original r3 */
 
-	/* If it isn't an extended log we can use the per cpu 64bit buffer */
 	h = (struct rtas_error_log *)&savep[1];
+	/* Use the per cpu buffer from paca to store rtas error log */
+	memset(local_paca->mce_data_buf, 0, RTAS_ERROR_LOG_MAX);
 	if (!rtas_error_extended(h)) {
-		memcpy(this_cpu_ptr(&mce_data_buf), h, sizeof(__u64));
-		errhdr = (struct rtas_error_log *)this_cpu_ptr(&mce_data_buf);
+		memcpy(local_paca->mce_data_buf, h, sizeof(__u64));
 	} else {
 		int len, error_log_length;
 
 		error_log_length = 8 + rtas_error_extended_log_length(h);
 		len = min_t(int, error_log_length, RTAS_ERROR_LOG_MAX);
-		memset(global_mce_data_buf, 0, RTAS_ERROR_LOG_MAX);
-		memcpy(global_mce_data_buf, h, len);
-		errhdr = (struct rtas_error_log *)global_mce_data_buf;
+		memcpy(local_paca->mce_data_buf, h, len);
 	}
 
-	return errhdr;
+	return (struct rtas_error_log *)local_paca->mce_data_buf;
 }
 
 /* Call this when done with the data returned by FWNMI_get_errinfo.
@@ -422,6 +427,17 @@ int pSeries_system_reset_exception(struct pt_regs *regs)
 	return 0; /* need to perform reset */
 }
 
+/*
+ * Process MCE rtas errlog event.
+ */
+static void mce_process_errlog_event(struct irq_work *work)
+{
+	struct rtas_error_log *err;
+
+	err = fwnmi_get_errlog();
+	log_error((char *)err, ERR_TYPE_RTAS_LOG, 0);
+}
+
 /*
  * See if we can recover from a machine check exception.
  * This is only called on power4 (or above) and only via
@@ -466,7 +482,8 @@ static int recover_mce(struct pt_regs *regs, struct rtas_error_log *err)
 		recovered = 1;
 	}
 
-	log_error((char *)err, ERR_TYPE_RTAS_LOG, 0);
+	/* Queue irq work to log this rtas event later. */
+	irq_work_queue(&mce_errlog_process_work);
 
 	return recovered;
 }
diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
index fdb32e056ef4..60a067a6e743 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -41,6 +41,7 @@
 #include <linux/root_dev.h>
 #include <linux/of.h>
 #include <linux/of_pci.h>
+#include <linux/memblock.h>
 
 #include <asm/mmu.h>
 #include <asm/processor.h>
@@ -101,6 +102,9 @@ static void pSeries_show_cpuinfo(struct seq_file *m)
 static void __init fwnmi_init(void)
 {
 	unsigned long system_reset_addr, machine_check_addr;
+	u8 *mce_data_buf;
+	unsigned int i;
+	int nr_cpus = num_possible_cpus();
 
 	int ibm_nmi_register = rtas_token("ibm,nmi-register");
 	if (ibm_nmi_register == RTAS_UNKNOWN_SERVICE)
@@ -114,6 +118,18 @@ static void __init fwnmi_init(void)
 	if (0 == rtas_call(ibm_nmi_register, 2, 1, NULL, system_reset_addr,
 				machine_check_addr))
 		fwnmi_active = 1;
+
+	/*
+	 * Allocate a chunk for per cpu buffer to hold rtas errorlog.
+	 * It will be used in real mode mce handler, hence it needs to be
+	 * below RMA.
+	 */
+	mce_data_buf = __va(memblock_alloc_base(RTAS_ERROR_LOG_MAX * nr_cpus,
+					RTAS_ERROR_LOG_MAX, ppc64_rma_size));
+	for_each_possible_cpu(i) {
+		paca_ptrs[i]->mce_data_buf = mce_data_buf +
+						(RTAS_ERROR_LOG_MAX * i);
+	}
 }
 
 static void pseries_8259_cascade(struct irq_desc *desc)

^ permalink raw reply related

* [PATCH v5 3/7] powerpc/pseries: Fix endainness while restoring of r3 in MCE handler.
From: Mahesh J Salgaonkar @ 2018-07-02  5:46 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: stable, Nicholas Piggin, Nicholas Piggin, Aneesh Kumar K.V,
	Laurent Dufour, Michal Suchanek
In-Reply-To: <153051022088.30541.5610525713141009848.stgit@jupiter.in.ibm.com>

From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>

During Machine Check interrupt on pseries platform, register r3 points
RTAS extended event log passed by hypervisor. Since hypervisor uses r3
to pass pointer to rtas log, it stores the original r3 value at the
start of the memory (first 8 bytes) pointed by r3. Since hypervisor
stores this info and rtas log is in BE format, linux should make
sure to restore r3 value in correct endian format.

Without this patch when MCE handler, after recovery, returns to code that
that caused the MCE may end up with Data SLB access interrupt for invalid
address followed by kernel panic or hang.

[   62.878965] Severe Machine check interrupt [Recovered]
[   62.878968]   NIP [d00000000ca301b8]: init_module+0x1b8/0x338 [bork_kernel]
[   62.878969]   Initiator: CPU
[   62.878970]   Error type: SLB [Multihit]
[   62.878971]     Effective address: d00000000ca70000
cpu 0xa: Vector: 380 (Data SLB Access) at [c0000000fc7775b0]
    pc: c0000000009694c0: vsnprintf+0x80/0x480
    lr: c0000000009698e0: vscnprintf+0x20/0x60
    sp: c0000000fc777830
   msr: 8000000002009033
   dar: a803a30c000000d0
  current = 0xc00000000bc9ef00
  paca    = 0xc00000001eca5c00	 softe: 3	 irq_happened: 0x01
    pid   = 8860, comm = insmod
[c0000000fc7778b0] c0000000009698e0 vscnprintf+0x20/0x60
[c0000000fc7778e0] c00000000016b6c4 vprintk_emit+0xb4/0x4b0
[c0000000fc777960] c00000000016d40c vprintk_func+0x5c/0xd0
[c0000000fc777980] c00000000016cbb4 printk+0x38/0x4c
[c0000000fc7779a0] d00000000ca301c0 init_module+0x1c0/0x338 [bork_kernel]
[c0000000fc777a40] c00000000000d9c4 do_one_initcall+0x54/0x230
[c0000000fc777b00] c0000000001b3b74 do_init_module+0x8c/0x248
[c0000000fc777b90] c0000000001b2478 load_module+0x12b8/0x15b0
[c0000000fc777d30] c0000000001b29e8 sys_finit_module+0xa8/0x110
[c0000000fc777e30] c00000000000b204 system_call+0x58/0x6c
--- Exception: c00 (System Call) at 00007fff8bda0644
SP (7fffdfbfe980) is in userspace

This patch fixes this issue.

Fixes: a08a53ea4c97 ("powerpc/le: Enable RTAS events support")
Cc: stable@vger.kernel.org
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/pseries/ras.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/ras.c b/arch/powerpc/platforms/pseries/ras.c
index 14a46b07ab2f..851ce326874a 100644
--- a/arch/powerpc/platforms/pseries/ras.c
+++ b/arch/powerpc/platforms/pseries/ras.c
@@ -367,7 +367,7 @@ static struct rtas_error_log *fwnmi_get_errinfo(struct pt_regs *regs)
 	}
 
 	savep = __va(regs->gpr[3]);
-	regs->gpr[3] = savep[0];	/* restore original r3 */
+	regs->gpr[3] = be64_to_cpu(savep[0]);	/* restore original r3 */
 
 	h = (struct rtas_error_log *)&savep[1];
 	/* Use the per cpu buffer from paca to store rtas error log */

^ permalink raw reply related

* [PATCH v5 4/7] powerpc/pseries: Define MCE error event section.
From: Mahesh J Salgaonkar @ 2018-07-02  5:46 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Nicholas Piggin, Aneesh Kumar K.V, Laurent Dufour,
	Michal Suchanek
In-Reply-To: <153051022088.30541.5610525713141009848.stgit@jupiter.in.ibm.com>

From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>

On pseries, the machine check error details are part of RTAS extended
event log passed under Machine check exception section. This patch adds
the definition of rtas MCE event section and related helper
functions.

Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/rtas.h |  111 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 111 insertions(+)

diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
index ec9dd79398ee..ceeed2dd489b 100644
--- a/arch/powerpc/include/asm/rtas.h
+++ b/arch/powerpc/include/asm/rtas.h
@@ -185,6 +185,13 @@ static inline uint8_t rtas_error_disposition(const struct rtas_error_log *elog)
 	return (elog->byte1 & 0x18) >> 3;
 }
 
+static inline
+void rtas_set_disposition_recovered(struct rtas_error_log *elog)
+{
+	elog->byte1 &= ~0x18;
+	elog->byte1 |= (RTAS_DISP_FULLY_RECOVERED << 3);
+}
+
 static inline uint8_t rtas_error_extended(const struct rtas_error_log *elog)
 {
 	return (elog->byte1 & 0x04) >> 2;
@@ -275,6 +282,7 @@ inline uint32_t rtas_ext_event_company_id(struct rtas_ext_event_log_v6 *ext_log)
 #define PSERIES_ELOG_SECT_ID_CALL_HOME		(('C' << 8) | 'H')
 #define PSERIES_ELOG_SECT_ID_USER_DEF		(('U' << 8) | 'D')
 #define PSERIES_ELOG_SECT_ID_HOTPLUG		(('H' << 8) | 'P')
+#define PSERIES_ELOG_SECT_ID_MCE		(('M' << 8) | 'C')
 
 /* Vendor specific Platform Event Log Format, Version 6, section header */
 struct pseries_errorlog {
@@ -326,6 +334,109 @@ struct pseries_hp_errorlog {
 #define PSERIES_HP_ELOG_ID_DRC_COUNT	3
 #define PSERIES_HP_ELOG_ID_DRC_IC	4
 
+/* RTAS pseries MCE errorlog section */
+#pragma pack(push, 1)
+struct pseries_mc_errorlog {
+	__be32	fru_id;
+	__be32	proc_id;
+	uint8_t	error_type;
+	union {
+		struct {
+			uint8_t	ue_err_type;
+			/* XXXXXXXX
+			 * X		1: Permanent or Transient UE.
+			 *  X		1: Effective address provided.
+			 *   X		1: Logical address provided.
+			 *    XX	2: Reserved.
+			 *      XXX	3: Type of UE error.
+			 */
+			uint8_t	reserved_1[6];
+			__be64	effective_address;
+			__be64	logical_address;
+		} ue_error;
+		struct {
+			uint8_t	soft_err_type;
+			/* XXXXXXXX
+			 * X		1: Effective address provided.
+			 *  XXXXX	5: Reserved.
+			 *       XX	2: Type of SLB/ERAT/TLB error.
+			 */
+			uint8_t	reserved_1[6];
+			__be64	effective_address;
+			uint8_t	reserved_2[8];
+		} soft_error;
+	} u;
+};
+#pragma pack(pop)
+
+/* RTAS pseries MCE error types */
+#define PSERIES_MC_ERROR_TYPE_UE		0x00
+#define PSERIES_MC_ERROR_TYPE_SLB		0x01
+#define PSERIES_MC_ERROR_TYPE_ERAT		0x02
+#define PSERIES_MC_ERROR_TYPE_TLB		0x04
+#define PSERIES_MC_ERROR_TYPE_D_CACHE		0x05
+#define PSERIES_MC_ERROR_TYPE_I_CACHE		0x07
+
+/* RTAS pseries MCE error sub types */
+#define PSERIES_MC_ERROR_UE_INDETERMINATE		0
+#define PSERIES_MC_ERROR_UE_IFETCH			1
+#define PSERIES_MC_ERROR_UE_PAGE_TABLE_WALK_IFETCH	2
+#define PSERIES_MC_ERROR_UE_LOAD_STORE			3
+#define PSERIES_MC_ERROR_UE_PAGE_TABLE_WALK_LOAD_STORE	4
+
+#define PSERIES_MC_ERROR_SLB_PARITY		0
+#define PSERIES_MC_ERROR_SLB_MULTIHIT		1
+#define PSERIES_MC_ERROR_SLB_INDETERMINATE	2
+
+#define PSERIES_MC_ERROR_ERAT_PARITY		1
+#define PSERIES_MC_ERROR_ERAT_MULTIHIT		2
+#define PSERIES_MC_ERROR_ERAT_INDETERMINATE	3
+
+#define PSERIES_MC_ERROR_TLB_PARITY		1
+#define PSERIES_MC_ERROR_TLB_MULTIHIT		2
+#define PSERIES_MC_ERROR_TLB_INDETERMINATE	3
+
+static inline uint8_t rtas_mc_error_type(const struct pseries_mc_errorlog *mlog)
+{
+	return mlog->error_type;
+}
+
+static inline uint8_t rtas_mc_error_sub_type(
+					const struct pseries_mc_errorlog *mlog)
+{
+	switch (mlog->error_type) {
+	case	PSERIES_MC_ERROR_TYPE_UE:
+		return (mlog->u.ue_error.ue_err_type & 0x07);
+	case	PSERIES_MC_ERROR_TYPE_SLB:
+	case	PSERIES_MC_ERROR_TYPE_ERAT:
+	case	PSERIES_MC_ERROR_TYPE_TLB:
+		return (mlog->u.soft_error.soft_err_type & 0x03);
+	default:
+		return 0;
+	}
+}
+
+static inline uint64_t rtas_mc_get_effective_addr(
+					const struct pseries_mc_errorlog *mlog)
+{
+	uint64_t addr = 0;
+
+	switch (mlog->error_type) {
+	case	PSERIES_MC_ERROR_TYPE_UE:
+		if (mlog->u.ue_error.ue_err_type & 0x40)
+			addr = mlog->u.ue_error.effective_address;
+		break;
+	case	PSERIES_MC_ERROR_TYPE_SLB:
+	case	PSERIES_MC_ERROR_TYPE_ERAT:
+	case	PSERIES_MC_ERROR_TYPE_TLB:
+		if (mlog->u.soft_error.soft_err_type & 0x80)
+			addr = mlog->u.soft_error.effective_address;
+	default:
+		break;
+	}
+	return be64_to_cpu(addr);
+}
+
 struct pseries_errorlog *get_pseries_errorlog(struct rtas_error_log *log,
 					      uint16_t section_id);
 

^ permalink raw reply related

* [PATCH v5 5/7] powerpc/pseries: flush SLB contents on SLB MCE errors.
From: Mahesh J Salgaonkar @ 2018-07-02  5:47 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Nicholas Piggin, Aneesh Kumar K.V, Laurent Dufour,
	Michal Suchanek
In-Reply-To: <153051022088.30541.5610525713141009848.stgit@jupiter.in.ibm.com>

From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>

On pseries, as of today system crashes if we get a machine check
exceptions due to SLB errors. These are soft errors and can be fixed by
flushing the SLBs so the kernel can continue to function instead of
system crash. We do this in real mode before turning on MMU. Otherwise
we would run into nested machine checks. This patch now fetches the
rtas error log in real mode and flushes the SLBs on SLB errors.

Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/book3s/64/mmu-hash.h |    1 
 arch/powerpc/include/asm/machdep.h            |    1 
 arch/powerpc/kernel/exceptions-64s.S          |   42 +++++++++++++++++++++
 arch/powerpc/kernel/mce.c                     |   16 +++++++-
 arch/powerpc/mm/slb.c                         |    6 +++
 arch/powerpc/platforms/powernv/opal.c         |    1 
 arch/powerpc/platforms/pseries/pseries.h      |    1 
 arch/powerpc/platforms/pseries/ras.c          |   51 +++++++++++++++++++++++++
 arch/powerpc/platforms/pseries/setup.c        |    1 
 9 files changed, 116 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index 50ed64fba4ae..cc00a7088cf3 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -487,6 +487,7 @@ extern void hpte_init_native(void);
 
 extern void slb_initialize(void);
 extern void slb_flush_and_rebolt(void);
+extern void slb_flush_and_rebolt_realmode(void);
 
 extern void slb_vmalloc_update(void);
 extern void slb_set_size(u16 size);
diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h
index ffe7c71e1132..fe447e0d4140 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -108,6 +108,7 @@ struct machdep_calls {
 
 	/* Early exception handlers called in realmode */
 	int		(*hmi_exception_early)(struct pt_regs *regs);
+	int		(*machine_check_early)(struct pt_regs *regs);
 
 	/* Called during machine check exception to retrive fixup address. */
 	bool		(*mce_check_early_recovery)(struct pt_regs *regs);
diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
index f283958129f2..0038596b7906 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -332,6 +332,9 @@ TRAMP_REAL_BEGIN(machine_check_pSeries)
 machine_check_fwnmi:
 	SET_SCRATCH0(r13)		/* save r13 */
 	EXCEPTION_PROLOG_0(PACA_EXMC)
+BEGIN_FTR_SECTION
+	b	machine_check_pSeries_early
+END_FTR_SECTION_IFCLR(CPU_FTR_HVMODE)
 machine_check_pSeries_0:
 	EXCEPTION_PROLOG_1(PACA_EXMC, KVMTEST_PR, 0x200)
 	/*
@@ -343,6 +346,45 @@ machine_check_pSeries_0:
 
 TRAMP_KVM_SKIP(PACA_EXMC, 0x200)
 
+TRAMP_REAL_BEGIN(machine_check_pSeries_early)
+BEGIN_FTR_SECTION
+	EXCEPTION_PROLOG_1(PACA_EXMC, NOTEST, 0x200)
+	mr	r10,r1			/* Save r1 */
+	ld	r1,PACAMCEMERGSP(r13)	/* Use MC emergency stack */
+	subi	r1,r1,INT_FRAME_SIZE	/* alloc stack frame		*/
+	mfspr	r11,SPRN_SRR0		/* Save SRR0 */
+	mfspr	r12,SPRN_SRR1		/* Save SRR1 */
+	EXCEPTION_PROLOG_COMMON_1()
+	EXCEPTION_PROLOG_COMMON_2(PACA_EXMC)
+	EXCEPTION_PROLOG_COMMON_3(0x200)
+	addi	r3,r1,STACK_FRAME_OVERHEAD
+	BRANCH_LINK_TO_FAR(machine_check_early) /* Function call ABI */
+
+	/* Move original SRR0 and SRR1 into the respective regs */
+	ld	r9,_MSR(r1)
+	mtspr	SPRN_SRR1,r9
+	ld	r3,_NIP(r1)
+	mtspr	SPRN_SRR0,r3
+	ld	r9,_CTR(r1)
+	mtctr	r9
+	ld	r9,_XER(r1)
+	mtxer	r9
+	ld	r9,_LINK(r1)
+	mtlr	r9
+	REST_GPR(0, r1)
+	REST_8GPRS(2, r1)
+	REST_GPR(10, r1)
+	ld	r11,_CCR(r1)
+	mtcr	r11
+	REST_GPR(11, r1)
+	REST_2GPRS(12, r1)
+	/* restore original r1. */
+	ld	r1,GPR1(r1)
+	SET_SCRATCH0(r13)		/* save r13 */
+	EXCEPTION_PROLOG_0(PACA_EXMC)
+	b	machine_check_pSeries_0
+END_FTR_SECTION_IFCLR(CPU_FTR_HVMODE)
+
 EXC_COMMON_BEGIN(machine_check_common)
 	/*
 	 * Machine check is different because we use a different
diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index efdd16a79075..221271c96a57 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -488,9 +488,21 @@ long machine_check_early(struct pt_regs *regs)
 {
 	long handled = 0;
 
-	__this_cpu_inc(irq_stat.mce_exceptions);
+	/*
+	 * For pSeries we count mce when we go into virtual mode machine
+	 * check handler. Hence skip it. Also, We can't access per cpu
+	 * variables in real mode for LPAR.
+	 */
+	if (early_cpu_has_feature(CPU_FTR_HVMODE))
+		__this_cpu_inc(irq_stat.mce_exceptions);
 
-	if (cur_cpu_spec && cur_cpu_spec->machine_check_early)
+	/*
+	 * See if platform is capable of handling machine check.
+	 * Otherwise fallthrough and allow CPU to handle this machine check.
+	 */
+	if (ppc_md.machine_check_early)
+		handled = ppc_md.machine_check_early(regs);
+	else if (cur_cpu_spec && cur_cpu_spec->machine_check_early)
 		handled = cur_cpu_spec->machine_check_early(regs);
 	return handled;
 }
diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c
index 66577cc66dc9..5b1813b98358 100644
--- a/arch/powerpc/mm/slb.c
+++ b/arch/powerpc/mm/slb.c
@@ -145,6 +145,12 @@ void slb_flush_and_rebolt(void)
 	get_paca()->slb_cache_ptr = 0;
 }
 
+void slb_flush_and_rebolt_realmode(void)
+{
+	__slb_flush_and_rebolt();
+	get_paca()->slb_cache_ptr = 0;
+}
+
 void slb_vmalloc_update(void)
 {
 	unsigned long vflags;
diff --git a/arch/powerpc/platforms/powernv/opal.c b/arch/powerpc/platforms/powernv/opal.c
index 48fbb41af5d1..ed548d40a9e1 100644
--- a/arch/powerpc/platforms/powernv/opal.c
+++ b/arch/powerpc/platforms/powernv/opal.c
@@ -417,7 +417,6 @@ static int opal_recover_mce(struct pt_regs *regs,
 
 	if (!(regs->msr & MSR_RI)) {
 		/* If MSR_RI isn't set, we cannot recover */
-		pr_err("Machine check interrupt unrecoverable: MSR(RI=0)\n");
 		recovered = 0;
 	} else if (evt->disposition == MCE_DISPOSITION_RECOVERED) {
 		/* Platform corrected itself */
diff --git a/arch/powerpc/platforms/pseries/pseries.h b/arch/powerpc/platforms/pseries/pseries.h
index 60db2ee511fb..3611db5dd583 100644
--- a/arch/powerpc/platforms/pseries/pseries.h
+++ b/arch/powerpc/platforms/pseries/pseries.h
@@ -24,6 +24,7 @@ struct pt_regs;
 
 extern int pSeries_system_reset_exception(struct pt_regs *regs);
 extern int pSeries_machine_check_exception(struct pt_regs *regs);
+extern int pSeries_machine_check_realmode(struct pt_regs *regs);
 
 #ifdef CONFIG_SMP
 extern void smp_init_pseries(void);
diff --git a/arch/powerpc/platforms/pseries/ras.c b/arch/powerpc/platforms/pseries/ras.c
index 851ce326874a..9aa7885e0148 100644
--- a/arch/powerpc/platforms/pseries/ras.c
+++ b/arch/powerpc/platforms/pseries/ras.c
@@ -427,6 +427,35 @@ int pSeries_system_reset_exception(struct pt_regs *regs)
 	return 0; /* need to perform reset */
 }
 
+static int mce_handle_error(struct rtas_error_log *errp)
+{
+	struct pseries_errorlog *pseries_log;
+	struct pseries_mc_errorlog *mce_log;
+	int disposition = rtas_error_disposition(errp);
+	uint8_t error_type;
+
+	if (!rtas_error_extended(errp))
+		goto out;
+
+	pseries_log = get_pseries_errorlog(errp, PSERIES_ELOG_SECT_ID_MCE);
+	if (pseries_log == NULL)
+		goto out;
+
+	mce_log = (struct pseries_mc_errorlog *)pseries_log->data;
+	error_type = rtas_mc_error_type(mce_log);
+
+	if ((disposition == RTAS_DISP_NOT_RECOVERED) &&
+			(error_type == PSERIES_MC_ERROR_TYPE_SLB)) {
+		/* Store the old slb content someplace. */
+		slb_flush_and_rebolt_realmode();
+		disposition = RTAS_DISP_FULLY_RECOVERED;
+		rtas_set_disposition_recovered(errp);
+	}
+
+out:
+	return disposition;
+}
+
 /*
  * Process MCE rtas errlog event.
  */
@@ -503,11 +532,31 @@ int pSeries_machine_check_exception(struct pt_regs *regs)
 	struct rtas_error_log *errp;
 
 	if (fwnmi_active) {
-		errp = fwnmi_get_errinfo(regs);
 		fwnmi_release_errinfo();
+		errp = fwnmi_get_errlog();
 		if (errp && recover_mce(regs, errp))
 			return 1;
 	}
 
 	return 0;
 }
+
+int pSeries_machine_check_realmode(struct pt_regs *regs)
+{
+	struct rtas_error_log *errp;
+	int disposition;
+
+	if (fwnmi_active) {
+		errp = fwnmi_get_errinfo(regs);
+		/*
+		 * Call to fwnmi_release_errinfo() in real mode causes kernel
+		 * to panic. Hence we will call it as soon as we go into
+		 * virtual mode.
+		 */
+		disposition = mce_handle_error(errp);
+		if (disposition == RTAS_DISP_FULLY_RECOVERED)
+			return 1;
+	}
+
+	return 0;
+}
diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
index 60a067a6e743..249b02bc5c41 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -999,6 +999,7 @@ define_machine(pseries) {
 	.calibrate_decr		= generic_calibrate_decr,
 	.progress		= rtas_progress,
 	.system_reset_exception = pSeries_system_reset_exception,
+	.machine_check_early	= pSeries_machine_check_realmode,
 	.machine_check_exception = pSeries_machine_check_exception,
 #ifdef CONFIG_KEXEC_CORE
 	.machine_kexec          = pSeries_machine_kexec,

^ permalink raw reply related

* [PATCH v5 6/7] powerpc/pseries: Display machine check error details.
From: Mahesh J Salgaonkar @ 2018-07-02  5:47 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Nicholas Piggin, Aneesh Kumar K.V, Laurent Dufour,
	Michal Suchanek
In-Reply-To: <153051022088.30541.5610525713141009848.stgit@jupiter.in.ibm.com>

From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>

Extract the MCE error details from RTAS extended log and display it to
console.

With this patch you should now see mce logs like below:

[  142.371818] Severe Machine check interrupt [Recovered]
[  142.371822]   NIP [d00000000ca301b8]: init_module+0x1b8/0x338 [bork_kernel]
[  142.371822]   Initiator: CPU
[  142.371823]   Error type: SLB [Multihit]
[  142.371824]     Effective address: d00000000ca70000

Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/rtas.h      |    5 +
 arch/powerpc/platforms/pseries/ras.c |  131 ++++++++++++++++++++++++++++++++++
 2 files changed, 136 insertions(+)

diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
index ceeed2dd489b..26bc3d5c4992 100644
--- a/arch/powerpc/include/asm/rtas.h
+++ b/arch/powerpc/include/asm/rtas.h
@@ -197,6 +197,11 @@ static inline uint8_t rtas_error_extended(const struct rtas_error_log *elog)
 	return (elog->byte1 & 0x04) >> 2;
 }
 
+static inline uint8_t rtas_error_initiator(const struct rtas_error_log *elog)
+{
+	return (elog->byte2 & 0xf0) >> 4;
+}
+
 #define rtas_error_type(x)	((x)->byte3)
 
 static inline
diff --git a/arch/powerpc/platforms/pseries/ras.c b/arch/powerpc/platforms/pseries/ras.c
index 9aa7885e0148..7d4d2b8bc019 100644
--- a/arch/powerpc/platforms/pseries/ras.c
+++ b/arch/powerpc/platforms/pseries/ras.c
@@ -427,6 +427,135 @@ int pSeries_system_reset_exception(struct pt_regs *regs)
 	return 0; /* need to perform reset */
 }
 
+#define VAL_TO_STRING(ar, val)	((val < ARRAY_SIZE(ar)) ? ar[val] : "Unknown")
+
+static void pseries_print_mce_info(struct pt_regs *regs,
+						struct rtas_error_log *errp)
+{
+	const char *level, *sevstr;
+	struct pseries_errorlog *pseries_log;
+	struct pseries_mc_errorlog *mce_log;
+	uint8_t error_type, err_sub_type;
+	uint64_t addr;
+	uint8_t initiator = rtas_error_initiator(errp);
+	int disposition = rtas_error_disposition(errp);
+
+	static const char * const initiators[] = {
+		"Unknown",
+		"CPU",
+		"PCI",
+		"ISA",
+		"Memory",
+		"Power Mgmt",
+	};
+	static const char * const mc_err_types[] = {
+		"UE",
+		"SLB",
+		"ERAT",
+		"TLB",
+		"D-Cache",
+		"Unknown",
+		"I-Cache",
+	};
+	static const char * const mc_ue_types[] = {
+		"Indeterminate",
+		"Instruction fetch",
+		"Page table walk ifetch",
+		"Load/Store",
+		"Page table walk Load/Store",
+	};
+
+	/* SLB sub errors valid values are 0x0, 0x1, 0x2 */
+	static const char * const mc_slb_types[] = {
+		"Parity",
+		"Multihit",
+		"Indeterminate",
+	};
+
+	/* TLB and ERAT sub errors valid values are 0x1, 0x2, 0x3 */
+	static const char * const mc_soft_types[] = {
+		"Unknown",
+		"Parity",
+		"Multihit",
+		"Indeterminate",
+	};
+
+	if (!rtas_error_extended(errp)) {
+		pr_err("Machine check interrupt: Missing extended error log\n");
+		return;
+	}
+
+	pseries_log = get_pseries_errorlog(errp, PSERIES_ELOG_SECT_ID_MCE);
+	if (pseries_log == NULL)
+		return;
+
+	mce_log = (struct pseries_mc_errorlog *)pseries_log->data;
+
+	error_type = rtas_mc_error_type(mce_log);
+	err_sub_type = rtas_mc_error_sub_type(mce_log);
+
+	switch (rtas_error_severity(errp)) {
+	case RTAS_SEVERITY_NO_ERROR:
+		level = KERN_INFO;
+		sevstr = "Harmless";
+		break;
+	case RTAS_SEVERITY_WARNING:
+		level = KERN_WARNING;
+		sevstr = "";
+		break;
+	case RTAS_SEVERITY_ERROR:
+	case RTAS_SEVERITY_ERROR_SYNC:
+		level = KERN_ERR;
+		sevstr = "Severe";
+		break;
+	case RTAS_SEVERITY_FATAL:
+	default:
+		level = KERN_ERR;
+		sevstr = "Fatal";
+		break;
+	}
+
+	printk("%s%s Machine check interrupt [%s]\n", level, sevstr,
+		disposition == RTAS_DISP_FULLY_RECOVERED ?
+		"Recovered" : "Not recovered");
+	if (user_mode(regs)) {
+		printk("%s  NIP: [%016lx] PID: %d Comm: %s\n", level,
+			regs->nip, current->pid, current->comm);
+	} else {
+		printk("%s  NIP [%016lx]: %pS\n", level, regs->nip,
+			(void *)regs->nip);
+	}
+	printk("%s  Initiator: %s\n", level,
+				VAL_TO_STRING(initiators, initiator));
+
+	switch (error_type) {
+	case PSERIES_MC_ERROR_TYPE_UE:
+		printk("%s  Error type: %s [%s]\n", level,
+			VAL_TO_STRING(mc_err_types, error_type),
+			VAL_TO_STRING(mc_ue_types, err_sub_type));
+		break;
+	case PSERIES_MC_ERROR_TYPE_SLB:
+		printk("%s  Error type: %s [%s]\n", level,
+			VAL_TO_STRING(mc_err_types, error_type),
+			VAL_TO_STRING(mc_slb_types, err_sub_type));
+		break;
+	case PSERIES_MC_ERROR_TYPE_ERAT:
+	case PSERIES_MC_ERROR_TYPE_TLB:
+		printk("%s  Error type: %s [%s]\n", level,
+			VAL_TO_STRING(mc_err_types, error_type),
+			VAL_TO_STRING(mc_soft_types, err_sub_type));
+		break;
+	default:
+		printk("%s  Error type: %s\n", level,
+			VAL_TO_STRING(mc_err_types, error_type));
+		break;
+	}
+
+	addr = rtas_mc_get_effective_addr(mce_log);
+	if (addr)
+		printk("%s    Effective address: %016llx\n", level, addr);
+}
+
 static int mce_handle_error(struct rtas_error_log *errp)
 {
 	struct pseries_errorlog *pseries_log;
@@ -481,6 +610,8 @@ static int recover_mce(struct pt_regs *regs, struct rtas_error_log *err)
 	int recovered = 0;
 	int disposition = rtas_error_disposition(err);
 
+	pseries_print_mce_info(regs, err);
+
 	if (!(regs->msr & MSR_RI)) {
 		/* If MSR_RI isn't set, we cannot recover */
 		recovered = 0;

^ permalink raw reply related

* [PATCH v5 7/7] powerpc/pseries: Dump the SLB contents on SLB MCE errors.
From: Mahesh J Salgaonkar @ 2018-07-02  5:47 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Aneesh Kumar K.V, Michael Ellerman, Nicholas Piggin,
	Aneesh Kumar K.V, Laurent Dufour, Michal Suchanek
In-Reply-To: <153051022088.30541.5610525713141009848.stgit@jupiter.in.ibm.com>

From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>

If we get a machine check exceptions due to SLB errors then dump the
current SLB contents which will be very much helpful in debugging the
root cause of SLB errors. Introduce an exclusive buffer per cpu to hold
faulty SLB entries. In real mode mce handler saves the old SLB contents
into this buffer accessible through paca and print it out later in virtual
mode.

With this patch the console will log SLB contents like below on SLB MCE
errors:

[ 3022.938065] SLB contents of cpu 0x3
[ 3022.938066] 00 c000000008000000 400ea1b217000500
[ 3022.938067]   1T  ESID=   c00000  VSID=      ea1b217 LLP:100
[ 3022.938068] 01 d000000008000000 400d43642f000510
[ 3022.938069]   1T  ESID=   d00000  VSID=      d43642f LLP:110
[ 3022.938070] 05 f000000008000000 400a86c85f000500
[ 3022.938071]   1T  ESID=   f00000  VSID=      a86c85f LLP:100
[ 3022.938072] 06 00007f0008000000 400a628b13000d90
[ 3022.938073]   1T  ESID=       7f  VSID=      a628b13 LLP:110
[ 3022.938074] 07 0000000018000000 000b7979f523fd90
[ 3022.938075]  256M ESID=        1  VSID=   b7979f523f LLP:110
[ 3022.938076] 08 c000000008000000 400ea1b217000510
[ 3022.938076]   1T  ESID=   c00000  VSID=      ea1b217 LLP:110
[ 3022.938077] 09 c000000008000000 400ea1b217000510
[ 3022.938078]   1T  ESID=   c00000  VSID=      ea1b217 LLP:110

Suggested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/book3s/64/mmu-hash.h |    7 +++
 arch/powerpc/include/asm/paca.h               |    1 
 arch/powerpc/mm/slb.c                         |   57 +++++++++++++++++++++++++
 arch/powerpc/platforms/pseries/ras.c          |   10 ++++
 arch/powerpc/platforms/pseries/setup.c        |   10 ++++
 5 files changed, 84 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index cc00a7088cf3..5a3fe282076d 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -485,9 +485,16 @@ static inline void hpte_init_pseries(void) { }
 
 extern void hpte_init_native(void);
 
+struct slb_entry {
+	u64	esid;
+	u64	vsid;
+};
+
 extern void slb_initialize(void);
 extern void slb_flush_and_rebolt(void);
 extern void slb_flush_and_rebolt_realmode(void);
+extern void slb_save_contents(struct slb_entry *slb_ptr);
+extern void slb_dump_contents(struct slb_entry *slb_ptr);
 
 extern void slb_vmalloc_update(void);
 extern void slb_set_size(u16 size);
diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index b441fef53077..653f87c69423 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -253,6 +253,7 @@ struct paca_struct {
 #endif
 #ifdef CONFIG_PPC_PSERIES
 	u8 *mce_data_buf;		/* buffer to hold per cpu rtas errlog */
+	struct slb_entry *mce_faulty_slbs;
 #endif /* CONFIG_PPC_PSERIES */
 } ____cacheline_aligned;
 
diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c
index 5b1813b98358..476ab0b1d4e8 100644
--- a/arch/powerpc/mm/slb.c
+++ b/arch/powerpc/mm/slb.c
@@ -151,6 +151,63 @@ void slb_flush_and_rebolt_realmode(void)
 	get_paca()->slb_cache_ptr = 0;
 }
 
+void slb_save_contents(struct slb_entry *slb_ptr)
+{
+	int i;
+	unsigned long e, v;
+
+	if (!slb_ptr)
+		return;
+
+	for (i = 0; i < mmu_slb_size; i++) {
+		asm volatile("slbmfee  %0,%1" : "=r" (e) : "r" (i));
+		asm volatile("slbmfev  %0,%1" : "=r" (v) : "r" (i));
+		slb_ptr->esid = e;
+		slb_ptr->vsid = v;
+		slb_ptr++;
+	}
+}
+
+void slb_dump_contents(struct slb_entry *slb_ptr)
+{
+	int i;
+	unsigned long e, v;
+	unsigned long llp;
+
+	if (!slb_ptr)
+		return;
+
+	pr_err("SLB contents of cpu 0x%x\n", smp_processor_id());
+
+	for (i = 0; i < mmu_slb_size; i++) {
+		e = slb_ptr->esid;
+		v = slb_ptr->vsid;
+		slb_ptr++;
+
+		if (!e && !v)
+			continue;
+
+		pr_err("%02d %016lx %016lx\n", i, e, v);
+
+		if (!(e & SLB_ESID_V)) {
+			pr_err("\n");
+			continue;
+		}
+		llp = v & SLB_VSID_LLP;
+		if (v & SLB_VSID_B_1T) {
+			pr_err("  1T  ESID=%9lx  VSID=%13lx LLP:%3lx\n",
+				GET_ESID_1T(e),
+				(v & ~SLB_VSID_B) >> SLB_VSID_SHIFT_1T,
+				llp);
+		} else {
+			pr_err(" 256M ESID=%9lx  VSID=%13lx LLP:%3lx\n",
+				GET_ESID(e),
+				(v & ~SLB_VSID_B) >> SLB_VSID_SHIFT,
+				llp);
+		}
+	}
+}
+
 void slb_vmalloc_update(void)
 {
 	unsigned long vflags;
diff --git a/arch/powerpc/platforms/pseries/ras.c b/arch/powerpc/platforms/pseries/ras.c
index 7d4d2b8bc019..d33c88e65fa1 100644
--- a/arch/powerpc/platforms/pseries/ras.c
+++ b/arch/powerpc/platforms/pseries/ras.c
@@ -515,6 +515,10 @@ static void pseries_print_mce_info(struct pt_regs *regs,
 		break;
 	}
 
+	/* Display faulty slb contents for SLB errors. */
+	if (error_type == PSERIES_MC_ERROR_TYPE_SLB)
+		slb_dump_contents(local_paca->mce_faulty_slbs);
+
 	printk("%s%s Machine check interrupt [%s]\n", level, sevstr,
 		disposition == RTAS_DISP_FULLY_RECOVERED ?
 		"Recovered" : "Not recovered");
@@ -575,7 +579,11 @@ static int mce_handle_error(struct rtas_error_log *errp)
 
 	if ((disposition == RTAS_DISP_NOT_RECOVERED) &&
 			(error_type == PSERIES_MC_ERROR_TYPE_SLB)) {
-		/* Store the old slb content someplace. */
+		/*
+		 * Store the old slb content in paca before flushing. Print
+		 * this when we go to virtual mode.
+		 */
+		slb_save_contents(local_paca->mce_faulty_slbs);
 		slb_flush_and_rebolt_realmode();
 		disposition = RTAS_DISP_FULLY_RECOVERED;
 		rtas_set_disposition_recovered(errp);
diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
index 249b02bc5c41..76d15e46a152 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -105,6 +105,9 @@ static void __init fwnmi_init(void)
 	u8 *mce_data_buf;
 	unsigned int i;
 	int nr_cpus = num_possible_cpus();
+	struct slb_entry *slb_ptr;
+	size_t size;
+
 
 	int ibm_nmi_register = rtas_token("ibm,nmi-register");
 	if (ibm_nmi_register == RTAS_UNKNOWN_SERVICE)
@@ -130,6 +133,13 @@ static void __init fwnmi_init(void)
 		paca_ptrs[i]->mce_data_buf = mce_data_buf +
 						(RTAS_ERROR_LOG_MAX * i);
 	}
+
+	/* Allocate per cpu slb area to save old slb contents during MCE */
+	size = sizeof(struct slb_entry) * mmu_slb_size * nr_cpus;
+	slb_ptr = __va(memblock_alloc_base(size, sizeof(struct slb_entry),
+							ppc64_rma_size));
+	for_each_possible_cpu(i)
+		paca_ptrs[i]->mce_faulty_slbs = slb_ptr + (mmu_slb_size * i);
 }
 
 static void pseries_8259_cascade(struct irq_desc *desc)

^ permalink raw reply related

* Re: [PATCH kernel v2 2/2] KVM: PPC: Check if IOMMU page is contained in the pinned physical page
From: Alexey Kardashevskiy @ 2018-07-02  6:32 UTC (permalink / raw)
  To: David Gibson; +Cc: linuxppc-dev, kvm-ppc, Alex Williamson, Paul Mackerras
In-Reply-To: <20180702045243.GX3422@umbus.fritz.box>

[-- Attachment #1: Type: text/plain, Size: 5996 bytes --]

On Mon, 2 Jul 2018 14:52:43 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Mon, Jul 02, 2018 at 02:33:30PM +1000, Alexey Kardashevskiy wrote:
> > On Mon, 2 Jul 2018 14:08:52 +1000
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> > > On Fri, Jun 29, 2018 at 05:07:47PM +1000, Alexey Kardashevskiy wrote:  
> > > > On Fri, 29 Jun 2018 15:18:20 +1000
> > > > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> > > >     
> > > > > On Fri, 29 Jun 2018 14:57:02 +1000
> > > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > >     
> > > > > > On Fri, Jun 29, 2018 at 02:51:21PM +1000, Alexey Kardashevskiy wrote:      
> > > > > > > On Fri, 29 Jun 2018 14:12:41 +1000
> > > > > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > > > >         
> > > > > > > > On Tue, Jun 26, 2018 at 03:59:26PM +1000, Alexey Kardashevskiy wrote:        
> > > > > > > > > We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that
> > > > > > > > > an IOMMU page is contained in the physical page so the PCI hardware won't
> > > > > > > > > get access to unassigned host memory.
> > > > > > > > > 
> > > > > > > > > However we do not have this check in KVM fastpath (H_PUT_TCE accelerated
> > > > > > > > > code) so the user space can pin memory backed with 64k pages and create
> > > > > > > > > a hardware TCE table with a bigger page size. We were lucky so far and
> > > > > > > > > did not hit this yet as the very first time the mapping happens
> > > > > > > > > we do not have tbl::it_userspace allocated yet and fall back to
> > > > > > > > > the userspace which in turn calls VFIO IOMMU driver and that fails
> > > > > > > > > because of the check in vfio_iommu_spapr_tce.c which is really
> > > > > > > > > sustainable solution.
> > > > > > > > > 
> > > > > > > > > This stores the smallest preregistered page size in the preregistered
> > > > > > > > > region descriptor and changes the mm_iommu_xxx API to check this against
> > > > > > > > > the IOMMU page size.
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > > > > > > > > ---
> > > > > > > > > Changes:
> > > > > > > > > v2:
> > > > > > > > > * explicitly check for compound pages before calling compound_order()
> > > > > > > > > 
> > > > > > > > > ---
> > > > > > > > > The bug is: run QEMU _without_ hugepages (no -mempath) and tell it to
> > > > > > > > > advertise 16MB pages to the guest; a typical pseries guest will use 16MB
> > > > > > > > > for IOMMU pages without checking the mmu pagesize and this will fail
> > > > > > > > > at https://git.qemu.org/?p=qemu.git;a=blob;f=hw/vfio/common.c;h=fb396cf00ac40eb35967a04c9cc798ca896eed57;hb=refs/heads/master#l256
> > > > > > > > > 
> > > > > > > > > With the change, mapping will fail in KVM and the guest will print:
> > > > > > > > > 
> > > > > > > > > mlx5_core 0000:00:00.0: ibm,create-pe-dma-window(2027) 0 8000000 20000000 18 1f returned 0 (liobn = 0x80000001 starting addr = 8000000 0)
> > > > > > > > > mlx5_core 0000:00:00.0: created tce table LIOBN 0x80000001 for /pci@800000020000000/ethernet@0
> > > > > > > > > mlx5_core 0000:00:00.0: failed to map direct window for
> > > > > > > > > /pci@800000020000000/ethernet@0: -1          
> > > > > > > > 
> > > > > > > > [snip]        
> > > > > > > > > @@ -124,7 +125,7 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
> > > > > > > > >  		struct mm_iommu_table_group_mem_t **pmem)
> > > > > > > > >  {
> > > > > > > > >  	struct mm_iommu_table_group_mem_t *mem;
> > > > > > > > > -	long i, j, ret = 0, locked_entries = 0;
> > > > > > > > > +	long i, j, ret = 0, locked_entries = 0, pageshift;
> > > > > > > > >  	struct page *page = NULL;
> > > > > > > > >  
> > > > > > > > >  	mutex_lock(&mem_list_mutex);
> > > > > > > > > @@ -166,6 +167,8 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
> > > > > > > > >  		goto unlock_exit;
> > > > > > > > >  	}
> > > > > > > > >  
> > > > > > >  > > +	mem->pageshift = 30; /* start from 1G pages - the biggest we have */          
> > > > > > > > 
> > > > > > > > What about 16G pages on an HPT system?        
> > > > > > > 
> > > > > > > 
> > > > > > > Below in the loop mem->pageshift will reduce to the biggest actual size
> > > > > > > which will be 16mb/64k/4k. Or remain 1GB if no memory is actually
> > > > > > > pinned, no loss there.        
> > > > > > 
> > > > > > Are you saying that 16G IOMMU pages aren't supported?  Or that there's
> > > > > > some reason a guest can never use them?      
> > > > > 
> > > > > 
> > > > > ah, 16_G_, not _M_. My bad. I just never tried such huge pages, I will
> > > > > lift the limit up to 64 then, easier this way.    
> > > > 
> > > > 
> > > > Ah, no, rather this as the upper limit:
> > > > 
> > > > mem->pageshift = ilog2(entries) + PAGE_SHIFT;    
> > > 
> > > I can't make sense of this comment in context.  I see how you're
> > > computing the minimum page size in the reserved region.
> > > 
> > > My question is about what the  is - the starting
> > > value from which you calculate.  Currently it's 1G, but I can't
> > > immediately see a reason that 16G is impossible here.  
> > 
> > 
> > 16GB is impossible if the chunk we are preregistering here is smaller
> > than that, for example, the entire guest ram is 4GB.  
> 
> Of course.  Just like it was for 1GiB if you had a 512MiB guest, for
> example.  I'm talking about a case where you have a guest that's
> >=16GiB and you *have* allocated 16GiB hugepages to back it.  


Then, assuming we are preregistering entire RAM as a single chunk, the
"maximum minimum" will be initialized as ">=16GiB" (but floor-aligned
to power of two) before the pinning loop and then reduce to the actual
page size, inside the loop. I feel like I am missing something in the
question, what is that?




--
Alexey

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH 1/3] powerpc/powernv/pci: Track largest available TCE order per PHB
From: Alexey Kardashevskiy @ 2018-07-02  7:32 UTC (permalink / raw)
  To: Russell Currey; +Cc: linuxppc-dev, benh, alistair, tpearson
In-Reply-To: <20180629073437.4060-2-ruscur@russell.cc>

On Fri, 29 Jun 2018 17:34:35 +1000
Russell Currey <ruscur@russell.cc> wrote:

> Knowing the largest possible TCE size of a PHB is useful, so get it out
> of the device tree.  This relies on the property being added in OPAL.
> 
> It is assumed that any PHB4 or later machine would be running firmware
> that implemented this property, and otherwise assumed to be PHB3, which
> has a maximum TCE order of 28 bits or 256MB TCEs.
> 
> This is used later in the series.
> 
> Signed-off-by: Russell Currey <ruscur@russell.cc>
> ---
>  arch/powerpc/platforms/powernv/pci-ioda.c | 16 ++++++++++++++++
>  arch/powerpc/platforms/powernv/pci.h      |  3 +++
>  2 files changed, 19 insertions(+)
> 
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 5bd0eb6681bc..17c590087279 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -3873,11 +3873,13 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
>  	struct resource r;
>  	const __be64 *prop64;
>  	const __be32 *prop32;
> +	struct property *prop;
>  	int len;
>  	unsigned int segno;
>  	u64 phb_id;
>  	void *aux;
>  	long rc;
> +	u32 val;
>  
>  	if (!of_device_is_available(np))
>  		return;
> @@ -4016,6 +4018,20 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
>  	}
>  	phb->ioda.pe_array = aux + pemap_off;
>  
> +	phb->ioda.max_tce_order = 0;
> +	/* Get TCE order from the DT.  If it's not present, assume P8 */
> +	if (!of_get_property(np, "ibm,supported-tce-sizes", NULL)) {
> +		phb->ioda.max_tce_order = 28; /* assume P8 256mb TCEs */
> +	} else {
> +		of_property_for_each_u32(np, "ibm,supported-tce-sizes", prop,
> +					 prop32, val) {
> +			if (val > phb->ioda.max_tce_order)
> +				phb->ioda.max_tce_order = val;
> +		}
> +		pr_debug("PHB%llx Found max TCE order of %d bits\n",
> +			 phb->opal_id, phb->ioda.max_tce_order);
> +	}


pnv_ioda_parse_tce_sizes() does this, use it. It even reports 256MB pages for P8 as in v4.18-rc3. And since this is going to be used once per device driver bind operation, there is no need at all to cache it, just call ilog2(pnv_ioda_parse_tce_sizes()) whenever you want to know the maximum page size.


> +
>  	/*
>  	 * Choose PE number for root bus, which shouldn't have
>  	 * M64 resources consumed by its child devices. To pick
> diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
> index eada4b6068cb..c9952def5e93 100644
> --- a/arch/powerpc/platforms/powernv/pci.h
> +++ b/arch/powerpc/platforms/powernv/pci.h
> @@ -173,6 +173,9 @@ struct pnv_phb {
>  		struct list_head	pe_list;
>  		struct mutex            pe_list_mutex;
>  
> +		/* Largest supported TCE order bits */
> +		uint8_t			max_tce_order;
> +
>  		/* Reverse map of PEs, indexed by {bus, devfn} */
>  		unsigned int		pe_rmap[0x10000];
>  	} ioda;
> -- 
> 2.17.1
> 



--
Alexey

^ permalink raw reply

* Re: [PATCH 1/3] powerpc/powernv/pci: Track largest available TCE order per PHB
From: Alexey Kardashevskiy @ 2018-07-02  7:34 UTC (permalink / raw)
  To: Russell Currey; +Cc: linuxppc-dev, benh, alistair, tpearson
In-Reply-To: <20180702173256.67254e00@aik.ozlabs.ibm.com>

On Mon, 2 Jul 2018 17:32:56 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On Fri, 29 Jun 2018 17:34:35 +1000
> Russell Currey <ruscur@russell.cc> wrote:
> 
> > Knowing the largest possible TCE size of a PHB is useful, so get it
> > out of the device tree.  This relies on the property being added in
> > OPAL.
> > 
> > It is assumed that any PHB4 or later machine would be running
> > firmware that implemented this property, and otherwise assumed to
> > be PHB3, which has a maximum TCE order of 28 bits or 256MB TCEs.
> > 
> > This is used later in the series.
> > 
> > Signed-off-by: Russell Currey <ruscur@russell.cc>
> > ---
> >  arch/powerpc/platforms/powernv/pci-ioda.c | 16 ++++++++++++++++
> >  arch/powerpc/platforms/powernv/pci.h      |  3 +++
> >  2 files changed, 19 insertions(+)
> > 
> > diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c
> > b/arch/powerpc/platforms/powernv/pci-ioda.c index
> > 5bd0eb6681bc..17c590087279 100644 ---
> > a/arch/powerpc/platforms/powernv/pci-ioda.c +++
> > b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -3873,11 +3873,13 @@
> > static void __init pnv_pci_init_ioda_phb(struct device_node *np,
> > struct resource r; const __be64 *prop64;
> >  	const __be32 *prop32;
> > +	struct property *prop;
> >  	int len;
> >  	unsigned int segno;
> >  	u64 phb_id;
> >  	void *aux;
> >  	long rc;
> > +	u32 val;
> >  
> >  	if (!of_device_is_available(np))
> >  		return;
> > @@ -4016,6 +4018,20 @@ static void __init
> > pnv_pci_init_ioda_phb(struct device_node *np, }
> >  	phb->ioda.pe_array = aux + pemap_off;
> >  
> > +	phb->ioda.max_tce_order = 0;
> > +	/* Get TCE order from the DT.  If it's not present, assume
> > P8 */
> > +	if (!of_get_property(np, "ibm,supported-tce-sizes", NULL))
> > {
> > +		phb->ioda.max_tce_order = 28; /* assume P8 256mb
> > TCEs */
> > +	} else {
> > +		of_property_for_each_u32(np,
> > "ibm,supported-tce-sizes", prop,
> > +					 prop32, val) {
> > +			if (val > phb->ioda.max_tce_order)
> > +				phb->ioda.max_tce_order = val;
> > +		}
> > +		pr_debug("PHB%llx Found max TCE order of %d
> > bits\n",
> > +			 phb->opal_id, phb->ioda.max_tce_order);
> > +	}  
> 
> 
> pnv_ioda_parse_tce_sizes() does this, use it. It even reports 256MB
> pages for P8 as in v4.18-rc3.


ah, not, not in rc3, my bad. I'll post it soon.


--
Alexey

^ permalink raw reply

* [PATCH kernel] powerpc/powernv/ioda2: Add 256M IOMMU page size to the default POWER8 case
From: Alexey Kardashevskiy @ 2018-07-02  7:42 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Alexey Kardashevskiy, Russell Currey, linux-kernel

The sketchy bypass uses 256M pages so add this page size as well.

This should cause no behavioral change but will be used later.

Fixes: 477afd6ea6 "powerpc/ioda: Use ibm,supported-tce-sizes for IOMMU page size mask"
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 5bd0eb6..557c11d 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2925,7 +2925,7 @@ static unsigned long pnv_ioda_parse_tce_sizes(struct pnv_phb *phb)
 		/* Add 16M for POWER8 by default */
 		if (cpu_has_feature(CPU_FTR_ARCH_207S) &&
 				!cpu_has_feature(CPU_FTR_ARCH_300))
-			mask |= SZ_16M;
+			mask |= SZ_16M | SZ_256M;
 		return mask;
 	}
 
-- 
2.11.0

^ permalink raw reply related

* [PATCH v4 01/11] macintosh/via-pmu: Fix section mismatch warning
From: Finn Thain @ 2018-07-02  8:21 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Michael Schmitz, linuxppc-dev, linux-m68k, linux-kernel
In-Reply-To: <cover.1530519301.git.fthain@telegraphics.com.au>

The pmu_init() function has the __init qualifier, but the ops struct
that holds a pointer to it does not. This causes a build warning.
The driver works fine because the pointer is only dereferenced early.

The function is so small that there's negligible benefit from using
the __init qualifier. Remove it to fix the warning, consistent with
the other ADB drivers.

Tested-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
---
 drivers/macintosh/via-pmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/macintosh/via-pmu.c b/drivers/macintosh/via-pmu.c
index 25c1ce811053..f8a2c917201f 100644
--- a/drivers/macintosh/via-pmu.c
+++ b/drivers/macintosh/via-pmu.c
@@ -378,7 +378,7 @@ static int pmu_probe(void)
 	return vias == NULL? -ENODEV: 0;
 }
 
-static int __init pmu_init(void)
+static int pmu_init(void)
 {
 	if (vias == NULL)
 		return -ENODEV;
-- 
2.16.4

^ permalink raw reply related

* [PATCH v4 02/11] macintosh/via-pmu: Add missing mmio accessors
From: Finn Thain @ 2018-07-02  8:21 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Michael Schmitz, linuxppc-dev, linux-m68k, linux-kernel
In-Reply-To: <cover.1530519301.git.fthain@telegraphics.com.au>

Add missing in_8() accessors to init_pmu() and pmu_sr_intr().

This fixes several sparse warnings:
drivers/macintosh/via-pmu.c:536:29: warning: dereference of noderef expression
drivers/macintosh/via-pmu.c:537:33: warning: dereference of noderef expression
drivers/macintosh/via-pmu.c:1455:17: warning: dereference of noderef expression
drivers/macintosh/via-pmu.c:1456:69: warning: dereference of noderef expression

Tested-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
---
 drivers/macintosh/via-pmu.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/macintosh/via-pmu.c b/drivers/macintosh/via-pmu.c
index f8a2c917201f..ba41220f618e 100644
--- a/drivers/macintosh/via-pmu.c
+++ b/drivers/macintosh/via-pmu.c
@@ -534,8 +534,9 @@ init_pmu(void)
 	int timeout;
 	struct adb_request req;
 
-	out_8(&via[B], via[B] | TREQ);			/* negate TREQ */
-	out_8(&via[DIRB], (via[DIRB] | TREQ) & ~TACK);	/* TACK in, TREQ out */
+	/* Negate TREQ. Set TACK to input and TREQ to output. */
+	out_8(&via[B], in_8(&via[B]) | TREQ);
+	out_8(&via[DIRB], (in_8(&via[DIRB]) | TREQ) & ~TACK);
 
 	pmu_request(&req, NULL, 2, PMU_SET_INTR_MASK, pmu_intr_mask);
 	timeout =  100000;
@@ -1418,8 +1419,8 @@ pmu_sr_intr(void)
 	struct adb_request *req;
 	int bite = 0;
 
-	if (via[B] & TREQ) {
-		printk(KERN_ERR "PMU: spurious SR intr (%x)\n", via[B]);
+	if (in_8(&via[B]) & TREQ) {
+		printk(KERN_ERR "PMU: spurious SR intr (%x)\n", in_8(&via[B]));
 		out_8(&via[IFR], SR_INT);
 		return NULL;
 	}
-- 
2.16.4

^ permalink raw reply related

* [PATCH v4 00/11] macintosh: Resolve various PMU driver problems
From: Finn Thain @ 2018-07-02  8:21 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Michael Schmitz, linuxppc-dev, linux-m68k, linux-kernel

This series of patches has the following aims.

1) Eliminate duplicated code. Linux presently has two drivers for
   the 68HC05-based PMU devices found in Macs: via-pmu and via-pmu68k.
   There's no value in having separate PMU drivers for each architecture.

2) Avoid further work on via-pmu68k that's not needed for via-pmu.

3) Fix some bugs in the via-pmu driver.

4) Enable the /dev/pmu and /proc/pmu/* userspace APIs on m68k Macs
   by adopting via-pmu.

5) Improve stability on early 100-series PowerBooks by loading no PMU
   driver at all. Neither via-pmu nor via-pmu68k supports the early
   M50753-based PMU device found in these models.

6) Assist the out-of-tree NuBus PowerMac port to support PMU designs
   shared with the m68k Mac port (e.g. PowerBooks 190 and 5300).

This patch series has been regression tested on various PowerBooks
(190, 520, 3400, Pismo G3) and PowerMacs (Beige G3, G5). These patches
did not affect userland utilities. (Note that there is a userland-
visible change to the contents of /proc/pmu/interrupts.)

Changed since v1:
1) Added blank lines after 'break' statements in patch 10.
2) Improved patch description for patch 3.
3) Added reviewed-by tags.
4) Split patch 8 to make code review easier.

Changed since v2:
1) Added reviewed-by tag.
2) Retained PMU_68K_V1 and PMU_68K_V2 symbols.

Changed since v3:
1) Rebased on v4.18-rc2.
2) Omitted patch 10/12, since these RTC changes now conflict with mainline.
   It will be reworked once the mainline m68k/powerpc RTC code stabilizes.

Finn Thain (11):
  macintosh/via-pmu: Fix section mismatch warning
  macintosh/via-pmu: Add missing mmio accessors
  macintosh/via-pmu: Don't clear shift register interrupt flag twice
  macintosh/via-pmu: Enhance state machine with new 'uninitialized'
    state
  macintosh/via-pmu: Replace via pointer with via1 and via2 pointers
  macintosh/via-pmu: Add support for m68k PowerBooks
  macintosh/via-pmu: Explicitly specify CONFIG_PPC_PMAC dependencies
  macintosh/via-pmu68k: Don't load driver on unsupported hardware
  macintosh/via-pmu: Replace via-pmu68k driver with via-pmu driver
  macintosh/via-pmu: Clean up interrupt statistics
  macintosh/via-pmu: Disambiguate interrupt statistics

 arch/m68k/configs/mac_defconfig   |   2 +-
 arch/m68k/configs/multi_defconfig |   2 +-
 arch/m68k/mac/config.c            |   2 +-
 arch/m68k/mac/misc.c              |  54 +--
 drivers/macintosh/Kconfig         |  19 +-
 drivers/macintosh/Makefile        |   1 -
 drivers/macintosh/adb.c           |   2 +-
 drivers/macintosh/via-pmu.c       | 346 ++++++++++------
 drivers/macintosh/via-pmu68k.c    | 850 --------------------------------------
 include/uapi/linux/pmu.h          |   4 +-
 10 files changed, 235 insertions(+), 1047 deletions(-)
 delete mode 100644 drivers/macintosh/via-pmu68k.c

-- 
2.16.4

^ permalink raw reply

* [PATCH v4 03/11] macintosh/via-pmu: Don't clear shift register interrupt flag twice
From: Finn Thain @ 2018-07-02  8:21 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Michael Schmitz, linuxppc-dev, linux-m68k, linux-kernel
In-Reply-To: <cover.1530519301.git.fthain@telegraphics.com.au>

The shift register interrupt flag gets cleared in via_pmu_interrupt()
and once again in pmu_sr_intr(). Fix this theoretical race condition.

Tested-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
---
 drivers/macintosh/via-pmu.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/macintosh/via-pmu.c b/drivers/macintosh/via-pmu.c
index ba41220f618e..c313ddfdb17a 100644
--- a/drivers/macintosh/via-pmu.c
+++ b/drivers/macintosh/via-pmu.c
@@ -1421,7 +1421,6 @@ pmu_sr_intr(void)
 
 	if (in_8(&via[B]) & TREQ) {
 		printk(KERN_ERR "PMU: spurious SR intr (%x)\n", in_8(&via[B]));
-		out_8(&via[IFR], SR_INT);
 		return NULL;
 	}
 	/* The ack may not yet be low when we get the interrupt */
-- 
2.16.4

^ permalink raw reply related

* [PATCH v4 04/11] macintosh/via-pmu: Enhance state machine with new 'uninitialized' state
From: Finn Thain @ 2018-07-02  8:21 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Michael Schmitz, linuxppc-dev, linux-m68k, linux-kernel
In-Reply-To: <cover.1530519301.git.fthain@telegraphics.com.au>

On 68k Macs, the via/vias pointer can't be used to determine whether
the PMU driver has been initialized. For portability, add a new state
to indicate that via_find_pmu() succeeded.

After via_find_pmu() executes, testing vias == NULL is equivalent to
testing via == NULL. Replace these tests with pmu_state == uninitialized
which is simpler and more consistent. No functional change.

Tested-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
---
 drivers/macintosh/via-pmu.c | 44 ++++++++++++++++++++++----------------------
 1 file changed, 22 insertions(+), 22 deletions(-)

diff --git a/drivers/macintosh/via-pmu.c b/drivers/macintosh/via-pmu.c
index c313ddfdb17a..6a6f1666712e 100644
--- a/drivers/macintosh/via-pmu.c
+++ b/drivers/macintosh/via-pmu.c
@@ -114,6 +114,7 @@ static volatile unsigned char __iomem *via;
 #define CB1_INT		0x10		/* transition on CB1 input */
 
 static volatile enum pmu_state {
+	uninitialized = 0,
 	idle,
 	sending,
 	intack,
@@ -274,7 +275,7 @@ int __init find_via_pmu(void)
 	u64 taddr;
 	const u32 *reg;
 
-	if (via)
+	if (pmu_state != uninitialized)
 		return 1;
 	vias = of_find_node_by_name(NULL, "via-pmu");
 	if (vias == NULL)
@@ -369,20 +370,19 @@ int __init find_via_pmu(void)
  fail:
 	of_node_put(vias);
 	vias = NULL;
+	pmu_state = uninitialized;
 	return 0;
 }
 
 #ifdef CONFIG_ADB
 static int pmu_probe(void)
 {
-	return vias == NULL? -ENODEV: 0;
+	return pmu_state == uninitialized ? -ENODEV : 0;
 }
 
 static int pmu_init(void)
 {
-	if (vias == NULL)
-		return -ENODEV;
-	return 0;
+	return pmu_state == uninitialized ? -ENODEV : 0;
 }
 #endif /* CONFIG_ADB */
 
@@ -397,7 +397,7 @@ static int __init via_pmu_start(void)
 {
 	unsigned int irq;
 
-	if (vias == NULL)
+	if (pmu_state == uninitialized)
 		return -ENODEV;
 
 	batt_req.complete = 1;
@@ -463,7 +463,7 @@ arch_initcall(via_pmu_start);
  */
 static int __init via_pmu_dev_init(void)
 {
-	if (vias == NULL)
+	if (pmu_state == uninitialized)
 		return -ENODEV;
 
 #ifdef CONFIG_PMAC_BACKLIGHT
@@ -929,7 +929,7 @@ static int pmu_send_request(struct adb_request *req, int sync)
 {
 	int i, ret;
 
-	if ((vias == NULL) || (!pmu_fully_inited)) {
+	if (pmu_state == uninitialized || !pmu_fully_inited) {
 		req->complete = 1;
 		return -ENXIO;
 	}
@@ -1023,7 +1023,7 @@ static int __pmu_adb_autopoll(int devs)
 
 static int pmu_adb_autopoll(int devs)
 {
-	if ((vias == NULL) || (!pmu_fully_inited) || !pmu_has_adb)
+	if (pmu_state == uninitialized || !pmu_fully_inited || !pmu_has_adb)
 		return -ENXIO;
 
 	adb_dev_map = devs;
@@ -1036,7 +1036,7 @@ static int pmu_adb_reset_bus(void)
 	struct adb_request req;
 	int save_autopoll = adb_dev_map;
 
-	if ((vias == NULL) || (!pmu_fully_inited) || !pmu_has_adb)
+	if (pmu_state == uninitialized || !pmu_fully_inited || !pmu_has_adb)
 		return -ENXIO;
 
 	/* anyone got a better idea?? */
@@ -1072,7 +1072,7 @@ pmu_request(struct adb_request *req, void (*done)(struct adb_request *),
 	va_list list;
 	int i;
 
-	if (vias == NULL)
+	if (pmu_state == uninitialized)
 		return -ENXIO;
 
 	if (nbytes < 0 || nbytes > 32) {
@@ -1097,7 +1097,7 @@ pmu_queue_request(struct adb_request *req)
 	unsigned long flags;
 	int nsend;
 
-	if (via == NULL) {
+	if (pmu_state == uninitialized) {
 		req->complete = 1;
 		return -ENXIO;
 	}
@@ -1210,7 +1210,7 @@ pmu_start(void)
 void
 pmu_poll(void)
 {
-	if (!via)
+	if (pmu_state == uninitialized)
 		return;
 	if (disable_poll)
 		return;
@@ -1220,7 +1220,7 @@ pmu_poll(void)
 void
 pmu_poll_adb(void)
 {
-	if (!via)
+	if (pmu_state == uninitialized)
 		return;
 	if (disable_poll)
 		return;
@@ -1235,7 +1235,7 @@ pmu_poll_adb(void)
 void
 pmu_wait_complete(struct adb_request *req)
 {
-	if (!via)
+	if (pmu_state == uninitialized)
 		return;
 	while((pmu_state != idle && pmu_state != locked) || !req->complete)
 		via_pmu_interrupt(0, NULL);
@@ -1251,7 +1251,7 @@ pmu_suspend(void)
 {
 	unsigned long flags;
 
-	if (!via)
+	if (pmu_state == uninitialized)
 		return;
 	
 	spin_lock_irqsave(&pmu_lock, flags);
@@ -1282,7 +1282,7 @@ pmu_resume(void)
 {
 	unsigned long flags;
 
-	if (!via || (pmu_suspended < 1))
+	if (pmu_state == uninitialized || pmu_suspended < 1)
 		return;
 
 	spin_lock_irqsave(&pmu_lock, flags);
@@ -1644,7 +1644,7 @@ pmu_enable_irled(int on)
 {
 	struct adb_request req;
 
-	if (vias == NULL)
+	if (pmu_state == uninitialized)
 		return ;
 	if (pmu_kind == PMU_KEYLARGO_BASED)
 		return ;
@@ -1659,7 +1659,7 @@ pmu_restart(void)
 {
 	struct adb_request req;
 
-	if (via == NULL)
+	if (pmu_state == uninitialized)
 		return;
 
 	local_irq_disable();
@@ -1684,7 +1684,7 @@ pmu_shutdown(void)
 {
 	struct adb_request req;
 
-	if (via == NULL)
+	if (pmu_state == uninitialized)
 		return;
 
 	local_irq_disable();
@@ -1712,7 +1712,7 @@ pmu_shutdown(void)
 int
 pmu_present(void)
 {
-	return via != NULL;
+	return pmu_state != uninitialized;
 }
 
 #if defined(CONFIG_SUSPEND) && defined(CONFIG_PPC32)
@@ -2378,7 +2378,7 @@ static struct miscdevice pmu_device = {
 
 static int pmu_device_init(void)
 {
-	if (!via)
+	if (pmu_state == uninitialized)
 		return 0;
 	if (misc_register(&pmu_device) < 0)
 		printk(KERN_ERR "via-pmu: cannot register misc device.\n");
-- 
2.16.4

^ permalink raw reply related

* [PATCH v4 05/11] macintosh/via-pmu: Replace via pointer with via1 and via2 pointers
From: Finn Thain @ 2018-07-02  8:21 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Michael Schmitz, linuxppc-dev, linux-m68k, linux-kernel
In-Reply-To: <cover.1530519301.git.fthain@telegraphics.com.au>

On most PowerPC Macs, the PMU driver uses the shift register and
IO port B from a single VIA chip.

On 68k and early PowerPC PowerBooks, the driver uses the shift register
from one VIA chip together with IO port B from another.

Replace via with via1 and via2 to accommodate this. For the
CONFIG_PPC_PMAC case, set via1 = via2 so there is no change.

Tested-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
---
 drivers/macintosh/via-pmu.c | 142 +++++++++++++++++++++-----------------------
 1 file changed, 69 insertions(+), 73 deletions(-)

diff --git a/drivers/macintosh/via-pmu.c b/drivers/macintosh/via-pmu.c
index 6a6f1666712e..2557f3e49f18 100644
--- a/drivers/macintosh/via-pmu.c
+++ b/drivers/macintosh/via-pmu.c
@@ -76,7 +76,6 @@
 #define BATTERY_POLLING_COUNT	2
 
 static DEFINE_MUTEX(pmu_info_proc_mutex);
-static volatile unsigned char __iomem *via;
 
 /* VIA registers - spaced 0x200 bytes apart */
 #define RS		0x200		/* skip between registers */
@@ -145,6 +144,8 @@ static struct device_node *vias;
 static int pmu_kind = PMU_UNKNOWN;
 static int pmu_fully_inited;
 static int pmu_has_adb;
+static volatile unsigned char __iomem *via1;
+static volatile unsigned char __iomem *via2;
 static struct device_node *gpio_node;
 static unsigned char __iomem *gpio_reg;
 static int gpio_irq = 0;
@@ -340,14 +341,14 @@ int __init find_via_pmu(void)
 	} else
 		pmu_kind = PMU_UNKNOWN;
 
-	via = ioremap(taddr, 0x2000);
-	if (via == NULL) {
+	via1 = via2 = ioremap(taddr, 0x2000);
+	if (via1 == NULL) {
 		printk(KERN_ERR "via-pmu: Can't map address !\n");
 		goto fail_via_remap;
 	}
 	
-	out_8(&via[IER], IER_CLR | 0x7f);	/* disable all intrs */
-	out_8(&via[IFR], 0x7f);			/* clear IFR */
+	out_8(&via1[IER], IER_CLR | 0x7f);	/* disable all intrs */
+	out_8(&via1[IFR], 0x7f);			/* clear IFR */
 
 	pmu_state = idle;
 
@@ -362,8 +363,8 @@ int __init find_via_pmu(void)
 	return 1;
 
  fail_init:
-	iounmap(via);
-	via = NULL;
+	iounmap(via1);
+	via1 = via2 = NULL;
  fail_via_remap:
 	iounmap(gpio_reg);
 	gpio_reg = NULL;
@@ -437,7 +438,7 @@ static int __init via_pmu_start(void)
 	}
 
 	/* Enable interrupts */
-	out_8(&via[IER], IER_SET | SR_INT | CB1_INT);
+	out_8(&via1[IER], IER_SET | SR_INT | CB1_INT);
 
 	pmu_fully_inited = 1;
 
@@ -535,8 +536,8 @@ init_pmu(void)
 	struct adb_request req;
 
 	/* Negate TREQ. Set TACK to input and TREQ to output. */
-	out_8(&via[B], in_8(&via[B]) | TREQ);
-	out_8(&via[DIRB], (in_8(&via[DIRB]) | TREQ) & ~TACK);
+	out_8(&via2[B], in_8(&via2[B]) | TREQ);
+	out_8(&via2[DIRB], (in_8(&via2[DIRB]) | TREQ) & ~TACK);
 
 	pmu_request(&req, NULL, 2, PMU_SET_INTR_MASK, pmu_intr_mask);
 	timeout =  100000;
@@ -1137,7 +1138,7 @@ wait_for_ack(void)
 	 * reported
 	 */
 	int timeout = 4000;
-	while ((in_8(&via[B]) & TACK) == 0) {
+	while ((in_8(&via2[B]) & TACK) == 0) {
 		if (--timeout < 0) {
 			printk(KERN_ERR "PMU not responding (!ack)\n");
 			return;
@@ -1151,23 +1152,19 @@ wait_for_ack(void)
 static inline void
 send_byte(int x)
 {
-	volatile unsigned char __iomem *v = via;
-
-	out_8(&v[ACR], in_8(&v[ACR]) | SR_OUT | SR_EXT);
-	out_8(&v[SR], x);
-	out_8(&v[B], in_8(&v[B]) & ~TREQ);		/* assert TREQ */
-	(void)in_8(&v[B]);
+	out_8(&via1[ACR], in_8(&via1[ACR]) | SR_OUT | SR_EXT);
+	out_8(&via1[SR], x);
+	out_8(&via2[B], in_8(&via2[B]) & ~TREQ);	/* assert TREQ */
+	(void)in_8(&via2[B]);
 }
 
 static inline void
 recv_byte(void)
 {
-	volatile unsigned char __iomem *v = via;
-
-	out_8(&v[ACR], (in_8(&v[ACR]) & ~SR_OUT) | SR_EXT);
-	in_8(&v[SR]);		/* resets SR */
-	out_8(&v[B], in_8(&v[B]) & ~TREQ);
-	(void)in_8(&v[B]);
+	out_8(&via1[ACR], (in_8(&via1[ACR]) & ~SR_OUT) | SR_EXT);
+	in_8(&via1[SR]);		/* resets SR */
+	out_8(&via2[B], in_8(&via2[B]) & ~TREQ);
+	(void)in_8(&via2[B]);
 }
 
 static inline void
@@ -1270,7 +1267,7 @@ pmu_suspend(void)
 		if (!adb_int_pending && pmu_state == idle && !req_awaiting_reply) {
 			if (gpio_irq >= 0)
 				disable_irq_nosync(gpio_irq);
-			out_8(&via[IER], CB1_INT | IER_CLR);
+			out_8(&via1[IER], CB1_INT | IER_CLR);
 			spin_unlock_irqrestore(&pmu_lock, flags);
 			break;
 		}
@@ -1294,7 +1291,7 @@ pmu_resume(void)
 	adb_int_pending = 1;
 	if (gpio_irq >= 0)
 		enable_irq(gpio_irq);
-	out_8(&via[IER], CB1_INT | IER_SET);
+	out_8(&via1[IER], CB1_INT | IER_SET);
 	spin_unlock_irqrestore(&pmu_lock, flags);
 	pmu_poll();
 }
@@ -1419,20 +1416,20 @@ pmu_sr_intr(void)
 	struct adb_request *req;
 	int bite = 0;
 
-	if (in_8(&via[B]) & TREQ) {
-		printk(KERN_ERR "PMU: spurious SR intr (%x)\n", in_8(&via[B]));
+	if (in_8(&via2[B]) & TREQ) {
+		printk(KERN_ERR "PMU: spurious SR intr (%x)\n", in_8(&via2[B]));
 		return NULL;
 	}
 	/* The ack may not yet be low when we get the interrupt */
-	while ((in_8(&via[B]) & TACK) != 0)
+	while ((in_8(&via2[B]) & TACK) != 0)
 			;
 
 	/* if reading grab the byte, and reset the interrupt */
 	if (pmu_state == reading || pmu_state == reading_intr)
-		bite = in_8(&via[SR]);
+		bite = in_8(&via1[SR]);
 
 	/* reset TREQ and wait for TACK to go high */
-	out_8(&via[B], in_8(&via[B]) | TREQ);
+	out_8(&via2[B], in_8(&via2[B]) | TREQ);
 	wait_for_ack();
 
 	switch (pmu_state) {
@@ -1533,17 +1530,17 @@ via_pmu_interrupt(int irq, void *arg)
 	++disable_poll;
 	
 	for (;;) {
-		intr = in_8(&via[IFR]) & (SR_INT | CB1_INT);
+		intr = in_8(&via1[IFR]) & (SR_INT | CB1_INT);
 		if (intr == 0)
 			break;
 		handled = 1;
 		if (++nloop > 1000) {
 			printk(KERN_DEBUG "PMU: stuck in intr loop, "
 			       "intr=%x, ier=%x pmu_state=%d\n",
-			       intr, in_8(&via[IER]), pmu_state);
+			       intr, in_8(&via1[IER]), pmu_state);
 			break;
 		}
-		out_8(&via[IFR], intr);
+		out_8(&via1[IFR], intr);
 		if (intr & CB1_INT) {
 			adb_int_pending = 1;
 			pmu_irq_stats[0]++;
@@ -1725,29 +1722,29 @@ static u32 save_via[8];
 static void
 save_via_state(void)
 {
-	save_via[0] = in_8(&via[ANH]);
-	save_via[1] = in_8(&via[DIRA]);
-	save_via[2] = in_8(&via[B]);
-	save_via[3] = in_8(&via[DIRB]);
-	save_via[4] = in_8(&via[PCR]);
-	save_via[5] = in_8(&via[ACR]);
-	save_via[6] = in_8(&via[T1CL]);
-	save_via[7] = in_8(&via[T1CH]);
+	save_via[0] = in_8(&via1[ANH]);
+	save_via[1] = in_8(&via1[DIRA]);
+	save_via[2] = in_8(&via1[B]);
+	save_via[3] = in_8(&via1[DIRB]);
+	save_via[4] = in_8(&via1[PCR]);
+	save_via[5] = in_8(&via1[ACR]);
+	save_via[6] = in_8(&via1[T1CL]);
+	save_via[7] = in_8(&via1[T1CH]);
 }
 static void
 restore_via_state(void)
 {
-	out_8(&via[ANH], save_via[0]);
-	out_8(&via[DIRA], save_via[1]);
-	out_8(&via[B], save_via[2]);
-	out_8(&via[DIRB], save_via[3]);
-	out_8(&via[PCR], save_via[4]);
-	out_8(&via[ACR], save_via[5]);
-	out_8(&via[T1CL], save_via[6]);
-	out_8(&via[T1CH], save_via[7]);
-	out_8(&via[IER], IER_CLR | 0x7f);	/* disable all intrs */
-	out_8(&via[IFR], 0x7f);				/* clear IFR */
-	out_8(&via[IER], IER_SET | SR_INT | CB1_INT);
+	out_8(&via1[ANH],  save_via[0]);
+	out_8(&via1[DIRA], save_via[1]);
+	out_8(&via1[B],    save_via[2]);
+	out_8(&via1[DIRB], save_via[3]);
+	out_8(&via1[PCR],  save_via[4]);
+	out_8(&via1[ACR],  save_via[5]);
+	out_8(&via1[T1CL], save_via[6]);
+	out_8(&via1[T1CH], save_via[7]);
+	out_8(&via1[IER], IER_CLR | 0x7f);	/* disable all intrs */
+	out_8(&via1[IFR], 0x7f);			/* clear IFR */
+	out_8(&via1[IER], IER_SET | SR_INT | CB1_INT);
 }
 
 #define	GRACKLE_PM	(1<<7)
@@ -2389,33 +2386,33 @@ device_initcall(pmu_device_init);
 
 #ifdef DEBUG_SLEEP
 static inline void 
-polled_handshake(volatile unsigned char __iomem *via)
+polled_handshake(void)
 {
-	via[B] &= ~TREQ; eieio();
-	while ((via[B] & TACK) != 0)
+	via2[B] &= ~TREQ; eieio();
+	while ((via2[B] & TACK) != 0)
 		;
-	via[B] |= TREQ; eieio();
-	while ((via[B] & TACK) == 0)
+	via2[B] |= TREQ; eieio();
+	while ((via2[B] & TACK) == 0)
 		;
 }
 
 static inline void 
-polled_send_byte(volatile unsigned char __iomem *via, int x)
+polled_send_byte(int x)
 {
-	via[ACR] |= SR_OUT | SR_EXT; eieio();
-	via[SR] = x; eieio();
-	polled_handshake(via);
+	via1[ACR] |= SR_OUT | SR_EXT; eieio();
+	via1[SR] = x; eieio();
+	polled_handshake();
 }
 
 static inline int
-polled_recv_byte(volatile unsigned char __iomem *via)
+polled_recv_byte(void)
 {
 	int x;
 
-	via[ACR] = (via[ACR] & ~SR_OUT) | SR_EXT; eieio();
-	x = via[SR]; eieio();
-	polled_handshake(via);
-	x = via[SR]; eieio();
+	via1[ACR] = (via1[ACR] & ~SR_OUT) | SR_EXT; eieio();
+	x = via1[SR]; eieio();
+	polled_handshake();
+	x = via1[SR]; eieio();
 	return x;
 }
 
@@ -2424,7 +2421,6 @@ pmu_polled_request(struct adb_request *req)
 {
 	unsigned long flags;
 	int i, l, c;
-	volatile unsigned char __iomem *v = via;
 
 	req->complete = 1;
 	c = req->data[0];
@@ -2436,21 +2432,21 @@ pmu_polled_request(struct adb_request *req)
 	while (pmu_state != idle)
 		pmu_poll();
 
-	while ((via[B] & TACK) == 0)
+	while ((via2[B] & TACK) == 0)
 		;
-	polled_send_byte(v, c);
+	polled_send_byte(c);
 	if (l < 0) {
 		l = req->nbytes - 1;
-		polled_send_byte(v, l);
+		polled_send_byte(l);
 	}
 	for (i = 1; i <= l; ++i)
-		polled_send_byte(v, req->data[i]);
+		polled_send_byte(req->data[i]);
 
 	l = pmu_data_len[c][1];
 	if (l < 0)
-		l = polled_recv_byte(v);
+		l = polled_recv_byte();
 	for (i = 0; i < l; ++i)
-		req->reply[i + req->reply_len] = polled_recv_byte(v);
+		req->reply[i + req->reply_len] = polled_recv_byte();
 
 	if (req->done)
 		(*req->done)(req);
-- 
2.16.4

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox