LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Christoph Hellwig @ 2018-08-07  6:18 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Benjamin Herrenschmidt, Christoph Hellwig, Will Deacon,
	Anshuman Khandual, virtualization, linux-kernel, linuxppc-dev,
	aik, robh, joe, elfring, david, jasowang, mpe, linuxram, haren,
	paulus, srikar, robin.murphy, jean-philippe.brucker, marc.zyngier
In-Reply-To: <20180807002857-mutt-send-email-mst@kernel.org>

On Tue, Aug 07, 2018 at 12:46:34AM +0300, Michael S. Tsirkin wrote:
> Well we have the RFC for that - the switch to using DMA ops unconditionally isn't
> problematic itself IMHO, for now that RFC is blocked
> by its perfromance overhead for now but Christoph says
> he's trying to remove that for direct mappings,
> so we should hopefully be able to get there in X weeks.

The direct calls to dma_direct_ops aren't going to help you with legacy
virtio, given that virtio is specified to deal with physical addresses,
while dma-direct is not in many cases.

It would however help with the case where qemu always sets the platform
dma flag, as we'd avoid the indirect calls for that.

^ permalink raw reply

* Re: [PATCH] powerpc/mm: Don't report PUDs as memory leaks when using kmemleak
From: Christophe Leroy @ 2018-08-06 17:06 UTC (permalink / raw)
  To: Michael Ellerman, pmenzel; +Cc: linuxppc-dev, aneesh.kumar
In-Reply-To: <20180719143316.23486-1-mpe@ellerman.id.au>

Hello,

I've got the same issue on PPC32 with hugepages:

unreferenced object 0xc30f8200 (size 512):
   comm "mmap", pid 374, jiffies 4872494 (age 627.630s)
   hex dump (first 32 bytes):
     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
   backtrace:
     [<e32b68da>] huge_pte_alloc+0xdc/0x1f8
     [<9e0df1e1>] hugetlb_fault+0x560/0x8f8
     [<7938ec6c>] follow_hugetlb_page+0x14c/0x44c
     [<afbdb405>] __get_user_pages+0x1c4/0x3dc
     [<b8fd7cd9>] __mm_populate+0xac/0x140
     [<3215421e>] vm_mmap_pgoff+0xb4/0xb8
     [<c148db69>] ksys_mmap_pgoff+0xcc/0x1fc
     [<4fcd760f>] ret_from_syscall+0x0/0x38

Used the following app to trigger it:

#include <sys/mman.h>
#include <unistd.h>
#include <stdlib.h>

int main()
{
	long *p;

	p = mmap(NULL, 1024 * 1024, PROT_READ | PROT_WRITE, MAP_PRIVATE | 
MAP_ANONYMOUS | MAP_HUGETLB | MAP_POPULATE, -1, 0);
	*p = 0xdeadbeef;

	for (;;)
		pause();

	exit(0);
}

Christophe

On 07/19/2018 02:33 PM, Michael Ellerman wrote:
> Paul Menzel reported that kmemleak was producing reports such as:
> 
>    unreferenced object 0xc0000000f8b80000 (size 16384):
>      comm "init", pid 1, jiffies 4294937416 (age 312.240s)
>      hex dump (first 32 bytes):
>        00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>        00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>      backtrace:
>        [<00000000d997deb7>] __pud_alloc+0x80/0x190
>        [<0000000087f2e8a3>] move_page_tables+0xbac/0xdc0
>        [<00000000091e51c2>] shift_arg_pages+0xc0/0x210
>        [<00000000ab88670c>] setup_arg_pages+0x22c/0x2a0
>        [<0000000060871529>] load_elf_binary+0x41c/0x1648
>        [<00000000ecd9d2d4>] search_binary_handler.part.11+0xbc/0x280
>        [<0000000034e0cdd7>] __do_execve_file.isra.13+0x73c/0x940
>        [<000000005f953a6e>] sys_execve+0x58/0x70
>        [<000000009700a858>] system_call+0x5c/0x70
> 
> Indicating that a PUD was being leaked.
> 
> However what's really happening is that kmemleak is not able to
> recognise the references from the PGD to the PUD, because they are not
> fully qualified pointers.
> 
> We can confirm that in xmon, eg:
> 
> Find the task struct for pid 1 "init":
>    0:mon> P
>         task_struct     ->thread.ksp    PID   PPID S  P CMD
>    c0000001fe7c0000 c0000001fe803960      1      0 S 13 systemd
> 
> Dump virtual address 0 to find the PGD:
>    0:mon> dv 0 c0000001fe7c0000
>    pgd  @ 0xc0000000f8b01000
> 
> Dump the memory of the PGD:
>    0:mon> d c0000000f8b01000
>    c0000000f8b01000 00000000f8b90000 0000000000000000  |................|
>    c0000000f8b01010 0000000000000000 0000000000000000  |................|
>    c0000000f8b01020 0000000000000000 0000000000000000  |................|
>    c0000000f8b01030 0000000000000000 00000000f8b80000  |................|
>                                      ^^^^^^^^^^^^^^^^
> 
> There we can see the reference to our supposedly leaked PUD. But
> because it's missing the leading 0xc, kmemleak won't recognise it.
> 
> We can confirm it's still in use by translating an address that is
> mapped via it:
>    0:mon> dv 7fff94000000 c0000001fe7c0000
>    pgd  @ 0xc0000000f8b01000
>    pgdp @ 0xc0000000f8b01038 = 0x00000000f8b80000 <--
>    pudp @ 0xc0000000f8b81ff8 = 0x00000000037c4000
>    pmdp @ 0xc0000000037c5ca0 = 0x00000000fbd89000
>    ptep @ 0xc0000000fbd89000 = 0xc0800001d5ce0386
>    Maps physical address = 0x00000001d5ce0000
>    Flags = Accessed Dirty Read Write
> 
> The fix is fairly simple. We need to tell kmemleak to ignore PUD
> allocations and never report them as leaks. We can also tell it not to
> scan the PGD, because it will never find pointers in there. However it
> will still notice if we allocate a PGD and then leak it.
> 
> Reported-by: Paul Menzel <pmenzel@molgen.mpg.de>
> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
> ---
>   arch/powerpc/include/asm/book3s/64/pgalloc.h | 23 +++++++++++++++++++++--
>   1 file changed, 21 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/pgalloc.h b/arch/powerpc/include/asm/book3s/64/pgalloc.h
> index 01ee40f11f3a..76234a14b97d 100644
> --- a/arch/powerpc/include/asm/book3s/64/pgalloc.h
> +++ b/arch/powerpc/include/asm/book3s/64/pgalloc.h
> @@ -9,6 +9,7 @@
>   
>   #include <linux/slab.h>
>   #include <linux/cpumask.h>
> +#include <linux/kmemleak.h>
>   #include <linux/percpu.h>
>   
>   struct vmemmap_backing {
> @@ -82,6 +83,13 @@ static inline pgd_t *pgd_alloc(struct mm_struct *mm)
>   
>   	pgd = kmem_cache_alloc(PGT_CACHE(PGD_INDEX_SIZE),
>   			       pgtable_gfp_flags(mm, GFP_KERNEL));
> +	/*
> +	 * Don't scan the PGD for pointers, it contains references to PUDs but
> +	 * those references are not full pointers and so can't be recognised by
> +	 * kmemleak.
> +	 */
> +	kmemleak_no_scan(pgd);
> +
>   	/*
>   	 * With hugetlb, we don't clear the second half of the page table.
>   	 * If we share the same slab cache with the pmd or pud level table,
> @@ -110,8 +118,19 @@ static inline void pgd_populate(struct mm_struct *mm, pgd_t *pgd, pud_t *pud)
>   
>   static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
>   {
> -	return kmem_cache_alloc(PGT_CACHE(PUD_CACHE_INDEX),
> -		pgtable_gfp_flags(mm, GFP_KERNEL));
> +	pud_t *pud;
> +
> +	pud = kmem_cache_alloc(PGT_CACHE(PUD_CACHE_INDEX),
> +			       pgtable_gfp_flags(mm, GFP_KERNEL));
> +	/*
> +	 * Tell kmemleak to ignore the PUD, that means don't scan it for
> +	 * pointers and don't consider it a leak. PUDs are typically only
> +	 * referred to by their PGD, but kmemleak is not able to recognise those
> +	 * as pointers, leading to false leak reports.
> +	 */
> +	kmemleak_ignore(pud);
> +
> +	return pud;
>   }
>   
>   static inline void pud_free(struct mm_struct *mm, pud_t *pud)
> 

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Christoph Hellwig @ 2018-08-07  6:21 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Christoph Hellwig, Michael S. Tsirkin, Will Deacon,
	Anshuman Khandual, virtualization, linux-kernel, linuxppc-dev,
	aik, robh, joe, elfring, david, jasowang, mpe, linuxram, haren,
	paulus, srikar, robin.murphy, jean-philippe.brucker, marc.zyngier
In-Reply-To: <6c707d6d33ac25a42265c2e9b521c2416d72c739.camel@kernel.crashing.org>

On Tue, Aug 07, 2018 at 05:52:12AM +1000, Benjamin Herrenschmidt wrote:
> > It is your job to write a coherent interface specification that does
> > not depend on the used components.  The hypervisor might be PAPR,
> > Linux + qemu, VMware, Hyperv or something so secret that you'd have
> > to shoot me if you had to tell me.  The guest might be Linux, FreeBSD,
> > AIX, OS400 or a Hipster project of the day in Rust.  As long as we
> > properly specify the interface it simplify does not matter.
> 
> That's the point Christoph. The interface is today's interface. It does
> NOT change. That information is not part of the interface.
> 
> It's the VM itself that is stashing away its memory in a secret place,
> and thus needs to do bounce buffering. There is no change to the virtio
> interface per-se.

Any guest that doesn't know about your magic limited adressing is simply
not going to work, so we need to communicate that fact.

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Christoph Hellwig @ 2018-08-07  6:27 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Michael S. Tsirkin, Christoph Hellwig, Will Deacon,
	Anshuman Khandual, virtualization, linux-kernel, linuxppc-dev,
	aik, robh, joe, elfring, david, jasowang, mpe, linuxram, haren,
	paulus, srikar, robin.murphy, jean-philippe.brucker, marc.zyngier
In-Reply-To: <93518075238a07e9f011774d89bdc652c083f1ba.camel@kernel.crashing.org>

On Tue, Aug 07, 2018 at 08:13:56AM +1000, Benjamin Herrenschmidt wrote:
> It would be indeed ideal if all we had to do was setup some kind of
> bus_dma_mask on all PCI devices and have virtio automagically insert
> swiotlb when necessary.

For 4.20 I plan to remove the swiotlb ops and instead do the bounce
buffering in the common code, including a direct call to the direct
ops to avoid retpoline overhead.  For that you still need a flag
in virtio that instead of blindly working physical addresses it needs
to be treated like a real device in terms of DMA.

And for powerpc to make use of that I need to get the dma series I
posted last week reviewed and included, otherwise powerpc will have
to be excepted (like arm, where rmk didn't like the way the code
was factored, everything else has already been taken care of).

https://lists.linuxfoundation.org/pipermail/iommu/2018-July/028989.html

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Christoph Hellwig @ 2018-08-07  6:32 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Benjamin Herrenschmidt, Christoph Hellwig, Will Deacon,
	Anshuman Khandual, virtualization, linux-kernel, linuxppc-dev,
	aik, robh, joe, elfring, david, jasowang, mpe, linuxram, haren,
	paulus, srikar, robin.murphy, jean-philippe.brucker, marc.zyngier
In-Reply-To: <20180807024503-mutt-send-email-mst@kernel.org>

On Tue, Aug 07, 2018 at 02:45:25AM +0300, Michael S. Tsirkin wrote:
> > > I think that's where Christoph might have specific ideas about it.
> > 
> > OK well, assuming Christoph can solve the direct case in a way that
> > also work for the virtio !iommu case, we still want some bit of logic
> > somewhere that will "switch" to swiotlb based ops if the DMA mask is
> > limited.
> > 
> > You mentioned an RFC for that ? Do you happen to have a link ?
> 
> No but Christoph did I think.

Do you mean the direct map retpoline mitigation?  It is here:

https://www.spinics.net/lists/netdev/msg495413.html

https://www.spinics.net/lists/netdev/msg495785.html

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Benjamin Herrenschmidt @ 2018-08-07  6:42 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Michael S. Tsirkin, Will Deacon, Anshuman Khandual,
	virtualization, linux-kernel, linuxppc-dev, aik, robh, joe,
	elfring, david, jasowang, mpe, linuxram, haren, paulus, srikar,
	robin.murphy, jean-philippe.brucker, marc.zyngier
In-Reply-To: <20180807062117.GD32709@infradead.org>

On Mon, 2018-08-06 at 23:21 -0700, Christoph Hellwig wrote:
> On Tue, Aug 07, 2018 at 05:52:12AM +1000, Benjamin Herrenschmidt wrote:
> > > It is your job to write a coherent interface specification that does
> > > not depend on the used components.  The hypervisor might be PAPR,
> > > Linux + qemu, VMware, Hyperv or something so secret that you'd have
> > > to shoot me if you had to tell me.  The guest might be Linux, FreeBSD,
> > > AIX, OS400 or a Hipster project of the day in Rust.  As long as we
> > > properly specify the interface it simplify does not matter.
> > 
> > That's the point Christoph. The interface is today's interface. It does
> > NOT change. That information is not part of the interface.
> > 
> > It's the VM itself that is stashing away its memory in a secret place,
> > and thus needs to do bounce buffering. There is no change to the virtio
> > interface per-se.
> 
> Any guest that doesn't know about your magic limited adressing is simply
> not going to work, so we need to communicate that fact.

The guest does. It's the guest itself that initiates it. That's my
point, it's not a factor of the hypervisor, which is unchanged in that
area.

It's the guest itself, that makes the decision early on, to stash it's
memory away in a secure place, and thus needs to establish some kind of
bouce buffering via a few left over "insecure" pages.

It's all done by the guest: initiated by the guest and controlled by
the guest.

That's why I don't see why this specifically needs to involve the
hypervisor side, and thus a VIRTIO feature bit.

Note that I can make it so that the same DMA ops (basically standard
swiotlb ops without arch hacks) work for both "direct virtio" and
"normal PCI" devices.

The trick is simply in the arch to setup the iommu to map the swiotlb
bounce buffer pool 1:1 in the iommu, so the iommu essentially can be
ignored without affecting the physical addresses.

If I do that, *all* I need is a way, from the guest itself (again, the
other side dosn't know anything about it), to force virtio to use the
DMA ops as if there was an iommu, that is, use whatever dma ops were
setup by the platform for the pci device.

Cheers,
Ben.

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Christoph Hellwig @ 2018-08-07  6:12 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Benjamin Herrenschmidt, Christoph Hellwig, Will Deacon,
	Anshuman Khandual, virtualization, linux-kernel, linuxppc-dev,
	aik, robh, joe, elfring, david, jasowang, mpe, linuxram, haren,
	paulus, srikar, robin.murphy, jean-philippe.brucker, marc.zyngier
In-Reply-To: <20180806233024-mutt-send-email-mst@kernel.org>

On Mon, Aug 06, 2018 at 11:35:39PM +0300, Michael S. Tsirkin wrote:
> > As I said replying to Christoph, we are "leaking" into the interface
> > something here that is really what's the VM is doing to itself, which
> > is to stash its memory away in an inaccessible place.
> > 
> > Cheers,
> > Ben.
> 
> I think Christoph merely objects to the specific implementation.  If
> instead you do something like tweak dev->bus_dma_mask for the virtio
> device I think he won't object.

As long as we also document how dev->bus_dma_mask is tweaked for this
particular virtual bus, yes.

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Benjamin Herrenschmidt @ 2018-08-07  6:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Michael S. Tsirkin, Will Deacon, Anshuman Khandual,
	virtualization, linux-kernel, linuxppc-dev, aik, robh, joe,
	elfring, david, jasowang, mpe, linuxram, haren, paulus, srikar,
	robin.murphy, jean-philippe.brucker, marc.zyngier
In-Reply-To: <20180807062731.GA23159@infradead.org>

On Mon, 2018-08-06 at 23:27 -0700, Christoph Hellwig wrote:
> On Tue, Aug 07, 2018 at 08:13:56AM +1000, Benjamin Herrenschmidt wrote:
> > It would be indeed ideal if all we had to do was setup some kind of
> > bus_dma_mask on all PCI devices and have virtio automagically insert
> > swiotlb when necessary.
> 
> For 4.20 I plan to remove the swiotlb ops and instead do the bounce
> buffering in the common code, including a direct call to the direct
> ops to avoid retpoline overhead.  For that you still need a flag
> in virtio that instead of blindly working physical addresses it needs
> to be treated like a real device in terms of DMA.

But you will still call the swiotlb infrastructure, right ? IE, I sitll
need to control where/how the swiotlb "pool" is allocated.
> 
> And for powerpc to make use of that I need to get the dma series I
> posted last week reviewed and included, otherwise powerpc will have
> to be excepted (like arm, where rmk didn't like the way the code
> was factored, everything else has already been taken care of).
> 
> https://lists.linuxfoundation.org/pipermail/iommu/2018-July/028989.html

Yes, I saw your series. I'm just back from a week of travel, I plan to
start reviewing it this week if Michael doesn't beat me to it.

Cheers,
Ben.

^ permalink raw reply

* Re: [PATCH 1/2] powerpc: Allow memory that has been hot-removed to be hot-added
From: Michael Neuling @ 2018-08-07  6:52 UTC (permalink / raw)
  To: Rashmica Gupta, linuxppc-dev, mpe, benh, paulus, bsingharora
In-Reply-To: <20180803060601.724-1-rashmica.g@gmail.com>

On Fri, 2018-08-03 at 16:06 +1000, Rashmica Gupta wrote:
> This patch allows the memory removed by memtrace to be readded to the
> kernel. So now you don't have to reboot your system to add the memory
> back to the kernel or to have a different amount of memory removed.
>=20
> Signed-off-by: Rashmica Gupta <rashmica.g@gmail.com>

Tested-by: Michael Neuling <mikey@neuling.org>

> ---
> To remove 1GB from each node:
> echo 1073741824  >  /sys/kernel/debug/powerpc/memtrace/enable
>=20
> To add this memory back and remove 2GB:
> echo 2147483648 >  /sys/kernel/debug/powerpc/memtrace/enable=20
>=20
> To just re-add memory:
> echo 0  >  /sys/kernel/debug/powerpc/memtrace/enable=20

Nice... thanks!

Mikey

>=20
>=20
>  arch/powerpc/platforms/powernv/memtrace.c | 93 +++++++++++++++++++++++++=
+++
> ---
>  1 file changed, 86 insertions(+), 7 deletions(-)
>=20
> diff --git a/arch/powerpc/platforms/powernv/memtrace.c
> b/arch/powerpc/platforms/powernv/memtrace.c
> index b99283df8584..51fe0862dcab 100644
> --- a/arch/powerpc/platforms/powernv/memtrace.c
> +++ b/arch/powerpc/platforms/powernv/memtrace.c
> @@ -206,8 +206,11 @@ static int memtrace_init_debugfs(void)
> =20
>  		snprintf(ent->name, 16, "%08x", ent->nid);
>  		dir =3D debugfs_create_dir(ent->name, memtrace_debugfs_dir);
> -		if (!dir)
> +		if (!dir) {
> +			pr_err("Failed to create debugfs directory for node
> %d\n",
> +				ent->nid);
>  			return -1;
> +		}
> =20
>  		ent->dir =3D dir;
>  		debugfs_create_file("trace", 0400, dir, ent, &memtrace_fops);
> @@ -218,18 +221,94 @@ static int memtrace_init_debugfs(void)
>  	return ret;
>  }
> =20
> +static int online_mem_block(struct memory_block *mem, void *arg)
> +{
> +	return device_online(&mem->dev);
> +}
> +
> +/*
> + * Iterate through the chunks of memory we have removed from the kernel
> + * and attempt to add them back to the kernel.
> + */
> +static int memtrace_online(void)
> +{
> +	int i, ret =3D 0;
> +	struct memtrace_entry *ent;
> +
> +	for (i =3D memtrace_array_nr - 1; i >=3D 0; i--) {
> +		ent =3D &memtrace_array[i];
> +
> +		/* We have onlined this chunk previously */
> +		if (ent->nid =3D=3D -1)
> +			continue;
> +
> +		/* Remove from io mappings */
> +		if (ent->mem) {
> +			iounmap(ent->mem);
> +			ent->mem =3D 0;
> +		}
> +
> +		if (add_memory(ent->nid, ent->start, ent->size)) {
> +			pr_err("Failed to add trace memory to node %d\n",
> +					 ent->nid);
> +			ret +=3D 1;
> +			continue;
> +		}
> +
> +		/*
> +		 * If kernel isn't compiled with the auto online option
> +		 * we need to online the memory ourselves.
> +		 */
> +		if (!memhp_auto_online) {
> +			walk_memory_range(PFN_DOWN(ent->start),
> +				PFN_UP(ent->start + ent->size - 1),
> +				NULL, online_mem_block);
> +		}
> +
> +		/*
> +		 * Memory was added successfully so clean up references to it
> +		 * so on reentry we can tell that this chunk was added.
> +		 */
> +		debugfs_remove_recursive(ent->dir);
> +		pr_info("Added trace memory back to node %d\n", ent->nid);
> +		ent->size =3D ent->start =3D ent->nid =3D -1;
> +	}
> +	if (ret)
> +		return ret;
> +
> +	/* If all chunks of memory were added successfully, reset globals */
> +	kfree(memtrace_array);
> +	memtrace_array =3D NULL;
> +	memtrace_size =3D 0;
> +	memtrace_array_nr =3D 0;
> +	return 0;
> +
> +}
> +
>  static int memtrace_enable_set(void *data, u64 val)
>  {
> -	if (memtrace_size)
> +	uint64_t bytes;
> +
> +	/*
> +	 * Don't attempt to do anything if size isn't aligned to a memory
> +	 * block or equal to zero.
> +	 */
> +	bytes =3D memory_block_size_bytes();
> +	if (val & (bytes - 1)) {
> +		pr_err("Value must be aligned with 0x%llx\n", bytes);
>  		return -EINVAL;
> +	}
> =20
> -	if (!val)
> -		return -EINVAL;
> +	/* Re-add/online previously removed/offlined memory */
> +	if (memtrace_size) {
> +		if (memtrace_online())
> +			return -EAGAIN;
> +	}
> =20
> -	/* Make sure size is aligned to a memory block */
> -	if (val & (memory_block_size_bytes() - 1))
> -		return -EINVAL;
> +	if (!val)
> +		return 0;
> =20
> +	/* Offline and remove memory */
>  	if (memtrace_init_regions_runtime(val))
>  		return -EINVAL;
> =20

^ permalink raw reply

* Re: [PATCH 1/3] powerpc: make CPU selection logic generic in Makefile
From: Michael Ellerman @ 2018-08-07  7:18 UTC (permalink / raw)
  To: Christophe Leroy, Benjamin Herrenschmidt, Paul Mackerras, npiggin
  Cc: linux-kernel, linuxppc-dev
In-Reply-To: <273f8ed3e980b9385c6e1b31e17f890ea08ce33c.1528365638.git.christophe.leroy@c-s.fr>

Christophe Leroy <christophe.leroy@c-s.fr> writes:

> At the time being, when adding a new CPU for selection, both
> Kconfig.cputype and Makefile have to be modified.
>
> This patch moves into Kconfig.cputype the name of the CPU to me
> passed to the -mcpu= argument.
>
> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
> ---
>  arch/powerpc/Makefile                  |  8 +-------
>  arch/powerpc/platforms/Kconfig.cputype | 15 +++++++++++++++
>  2 files changed, 16 insertions(+), 7 deletions(-)
>
> diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
> index 9704ab360d39..9a5642552abc 100644
> --- a/arch/powerpc/Makefile
> +++ b/arch/powerpc/Makefile
> @@ -175,13 +175,7 @@ ifdef CONFIG_MPROFILE_KERNEL
>      endif
>  endif
>  
> -CFLAGS-$(CONFIG_CELL_CPU) += $(call cc-option,-mcpu=cell)
> -CFLAGS-$(CONFIG_POWER5_CPU) += $(call cc-option,-mcpu=power5)
> -CFLAGS-$(CONFIG_POWER6_CPU) += $(call cc-option,-mcpu=power6)
> -CFLAGS-$(CONFIG_POWER7_CPU) += $(call cc-option,-mcpu=power7)
> -CFLAGS-$(CONFIG_POWER8_CPU) += $(call cc-option,-mcpu=power8)
> -CFLAGS-$(CONFIG_POWER9_CPU) += $(call cc-option,-mcpu=power9)
> -CFLAGS-$(CONFIG_PPC_8xx) += $(call cc-option,-mcpu=860)
> +CFLAGS-$(CONFIG_SPECIAL_CPU_BOOL) += $(call cc-option,-mcpu=$(CONFIG_SPECIAL_CPU))

This looks good.

I'll rename it from "SPECIAL_CPU" to "TARGET_CPU" because that's the
terminology used in the GCC docs, eg:

    -mcpu=name
           Specify the name of the target processor, optionally suffixed by one or more feature modifiers.


cheers

^ permalink raw reply

* Re: [PATCH v3 00/14] Implement use of HW assistance on TLB table walk on 8xx
From: Christophe LEROY @ 2018-08-07  7:57 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	aneesh.kumar
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <cover.1527608397.git.christophe.leroy@c-s.fr>

Hi

Please note that I'm currently deeply reworking this serie in order to 
make it more clear hence more simple to review. So don't consider taking 
it yet (except maybe the first patch of the serie which is a bugfix).

Christophe

Le 29/05/2018 à 17:50, Christophe Leroy a écrit :
> The purpose of this serie is to implement hardware assistance for TLB table walk
> on the 8xx.
> 
> First part is to make L1 entries and L2 entries independant.
> For that, we need to alter ioremap functions in order to handle GUARD attribute
> at the PGD/PMD level.
> 
> Last part is to reuse PTE fragment implemented on PPC64 in order to
> not waste 16k Pages for page tables as only 4k are used.
> 
> Tested successfully on 8xx.
> 
> Successfull compilation on following defconfigs (v3):
> ppc64_defconfig
> ppc64e_defconfig
> 
> Successfull compilation on following defconfigs (v2):
> ppc64_defconfig
> ppc64e_defconfig
> pseries_defconfig
> pmac32_defconfig
> linkstation_defconfig
> corenet32_smp_defconfig
> ppc40x_defconfig
> storcenter_defconfig
> ppc44x_defconfig
> 
> Changes in v3:
>   - Fixed an issue in the 09/14 when CONFIG_PIN_TLB_TEXT was not enabled
>   - Added performance measurement in the 09/14 commit log
>   - Rebased on latest 'powerpc/merge' tree, which conflicted with 13/14
> 
> Changes in v2:
>   - Removed the 3 first patchs which have been applied already
>   - Fixed compilation errors reported by Michael
>   - Squashed the commonalisation of ioremap functions into a single patch
>   - Fixed the use of pte_fragment
>   - Added a patch optimising perf counting of TLB misses and instructions
> 
> Christophe Leroy (14):
>    Revert "powerpc/8xx: Use L1 entry APG to handle _PAGE_ACCESSED for
>      CONFIG_SWAP"
>    powerpc: move io mapping functions into ioremap.c
>    powerpc: make ioremap_bot common to PPC32 and PPC64
>    powerpc: common ioremap functions.
>    powerpc: use _ALIGN_DOWN macro for VMALLOC_BASE
>    powerpc/nohash32: allow setting GUARDED attribute in the PMD directly
>    powerpc/8xx: set GUARDED attribute in the PMD directly
>    powerpc/8xx: Remove PTE_ATOMIC_UPDATES
>    powerpc/mm: Use hardware assistance in TLB handlers on the 8xx
>    powerpc/8xx: reunify TLB handler routines
>    powerpc/8xx: Free up SPRN_SPRG_SCRATCH2
>    powerpc/mm: Make pte_fragment_alloc() common to PPC32 and PPC64
>    powerpc/mm: Use pte_fragment_alloc() on 8xx
>    powerpc/8xx: Move SW perf counters in first 32kb of memory
> 
>   arch/powerpc/include/asm/book3s/32/pgtable.h |  29 +-
>   arch/powerpc/include/asm/book3s/64/pgtable.h |   2 +
>   arch/powerpc/include/asm/highmem.h           |  11 -
>   arch/powerpc/include/asm/hugetlb.h           |   4 +-
>   arch/powerpc/include/asm/machdep.h           |   2 +-
>   arch/powerpc/include/asm/mmu-8xx.h           |  38 +--
>   arch/powerpc/include/asm/mmu_context.h       |  53 ++++
>   arch/powerpc/include/asm/nohash/32/pgalloc.h |  56 +++-
>   arch/powerpc/include/asm/nohash/32/pgtable.h |  77 +++--
>   arch/powerpc/include/asm/nohash/32/pte-8xx.h |   6 +-
>   arch/powerpc/include/asm/nohash/pgtable.h    |   4 +
>   arch/powerpc/include/asm/page.h              |   2 +-
>   arch/powerpc/include/asm/pgtable-types.h     |   4 +
>   arch/powerpc/kernel/head_8xx.S               | 409 +++++++++++----------------
>   arch/powerpc/mm/8xx_mmu.c                    |  12 +-
>   arch/powerpc/mm/Makefile                     |   2 +-
>   arch/powerpc/mm/dma-noncoherent.c            |   2 +-
>   arch/powerpc/mm/dump_linuxpagetables.c       |  32 ++-
>   arch/powerpc/mm/hugetlbpage.c                |  12 +
>   arch/powerpc/mm/init_32.c                    |   6 +-
>   arch/powerpc/mm/{pgtable_64.c => ioremap.c}  | 239 ++++++----------
>   arch/powerpc/mm/mem.c                        |  16 +-
>   arch/powerpc/mm/mmu_context_book3s64.c       |  44 ---
>   arch/powerpc/mm/mmu_context_nohash.c         |   4 +
>   arch/powerpc/mm/pgtable-book3s64.c           |  72 -----
>   arch/powerpc/mm/pgtable.c                    |  82 ++++++
>   arch/powerpc/mm/pgtable_32.c                 | 167 +++--------
>   arch/powerpc/mm/pgtable_64.c                 | 177 ------------
>   arch/powerpc/platforms/Kconfig.cputype       |  19 ++
>   29 files changed, 657 insertions(+), 926 deletions(-)
>   copy arch/powerpc/mm/{pgtable_64.c => ioremap.c} (53%)
> 

^ permalink raw reply

* Re: [PATCH v6 00/11] hugetlb: Factorize hugetlb architecture primitives
From: Ingo Molnar @ 2018-08-07  9:54 UTC (permalink / raw)
  To: Alexandre Ghiti
  Cc: linux-mm, mike.kravetz, linux, catalin.marinas, will.deacon,
	tony.luck, fenghua.yu, ralf, paul.burton, jhogan, jejb, deller,
	benh, paulus, mpe, ysato, dalias, davem, tglx, mingo, hpa, x86,
	arnd, linux-arm-kernel, linux-kernel, linux-ia64, linux-mips,
	linux-parisc, linuxppc-dev, linux-sh, sparclinux, linux-arch
In-Reply-To: <20180806175711.24438-1-alex@ghiti.fr>


* Alexandre Ghiti <alex@ghiti.fr> wrote:

> [CC linux-mm for inclusion in -mm tree]                                          
>                                                                                  
> In order to reduce copy/paste of functions across architectures and then         
> make riscv hugetlb port (and future ports) simpler and smaller, this             
> patchset intends to factorize the numerous hugetlb primitives that are           
> defined across all the architectures.                                            
>                                                                                  
> Except for prepare_hugepage_range, this patchset moves the versions that         
> are just pass-through to standard pte primitives into                            
> asm-generic/hugetlb.h by using the same #ifdef semantic that can be              
> found in asm-generic/pgtable.h, i.e. __HAVE_ARCH_***.                            
>                                                                                  
> s390 architecture has not been tackled in this serie since it does not           
> use asm-generic/hugetlb.h at all.                                                
>                                                                                  
> This patchset has been compiled on all addressed architectures with              
> success (except for parisc, but the problem does not come from this              
> series).                                                                         
>                                                                                  
> v6:                                                                              
>   - Remove nohash/32 and book3s/32 powerpc specific implementations in
>     order to use the generic ones.                                                        
>   - Add all the Reviewed-by, Acked-by and Tested-by in the commits,              
>     thanks to everyone.                                                          
>                                                                                  
> v5:                                                                              
>   As suggested by Mike Kravetz, no need to move the #include                     
>   <asm-generic/hugetlb.h> for arm and x86 architectures, let it live at          
>   the top of the file.                                                           
>                                                                                  
> v4:                                                                              
>   Fix powerpc build error due to misplacing of #include                          
>   <asm-generic/hugetlb.h> outside of #ifdef CONFIG_HUGETLB_PAGE, as              
>   pointed by Christophe Leroy.                                                   
>                                                                                  
> v1, v2, v3:                                                                      
>   Same version, just problems with email provider and misuse of                  
>   --batch-size option of git send-email
> 
> Alexandre Ghiti (11):
>   hugetlb: Harmonize hugetlb.h arch specific defines with pgtable.h
>   hugetlb: Introduce generic version of hugetlb_free_pgd_range
>   hugetlb: Introduce generic version of set_huge_pte_at
>   hugetlb: Introduce generic version of huge_ptep_get_and_clear
>   hugetlb: Introduce generic version of huge_ptep_clear_flush
>   hugetlb: Introduce generic version of huge_pte_none
>   hugetlb: Introduce generic version of huge_pte_wrprotect
>   hugetlb: Introduce generic version of prepare_hugepage_range
>   hugetlb: Introduce generic version of huge_ptep_set_wrprotect
>   hugetlb: Introduce generic version of huge_ptep_set_access_flags
>   hugetlb: Introduce generic version of huge_ptep_get
> 
>  arch/arm/include/asm/hugetlb-3level.h        | 32 +---------
>  arch/arm/include/asm/hugetlb.h               | 30 ----------
>  arch/arm64/include/asm/hugetlb.h             | 39 +++---------
>  arch/ia64/include/asm/hugetlb.h              | 47 ++-------------
>  arch/mips/include/asm/hugetlb.h              | 40 +++----------
>  arch/parisc/include/asm/hugetlb.h            | 33 +++--------
>  arch/powerpc/include/asm/book3s/32/pgtable.h |  6 --
>  arch/powerpc/include/asm/book3s/64/pgtable.h |  1 +
>  arch/powerpc/include/asm/hugetlb.h           | 43 ++------------
>  arch/powerpc/include/asm/nohash/32/pgtable.h |  6 --
>  arch/powerpc/include/asm/nohash/64/pgtable.h |  1 +
>  arch/sh/include/asm/hugetlb.h                | 54 ++---------------
>  arch/sparc/include/asm/hugetlb.h             | 40 +++----------
>  arch/x86/include/asm/hugetlb.h               | 69 ----------------------
>  include/asm-generic/hugetlb.h                | 88 +++++++++++++++++++++++++++-
>  15 files changed, 135 insertions(+), 394 deletions(-)

The x86 bits look good to me (assuming it's all tested on all relevant architectures, etc.)

Acked-by: Ingo Molnar <mingo@kernel.org>

Thanks,

	Ingo

^ permalink raw reply

* Re: Several suspected memory leaks
From: Catalin Marinas @ 2018-08-07 11:03 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: pmenzel, Paul Mackerras, mpe, linuxppc-dev,
	Linux Kernel Mailing List
In-Reply-To: <69a6086d25c17b03ecf3ab7ed34fe941fd993410.camel@kernel.crashing.org>

(catching up with emails)

On Wed, 11 Jul 2018 at 00:40, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Tue, 2018-07-10 at 17:17 +0200, Paul Menzel wrote:
> > On a the IBM S822LC (8335-GTA) with Ubuntu 18.04 I built Linux master
> > =E2=80=93 4.18-rc4+, commit 092150a2 (Merge branch 'for-linus'
> > of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid) =E2=80=93 w=
ith
> > kmemleak. Several issues are found.
>
> Some of these are completely uninteresting though and look like
> kmemleak bugs to me :-)
>
> >     [<00000000bc285bbf>] __pud_alloc+0x80/0x270
> >     [<0000000007135d64>] hash__map_kernel_page+0x30c/0x4d0
> >     [<0000000071677858>] __ioremap_at+0x108/0x140
> >     [<000000000023e921>] __ioremap_caller+0x130/0x180
> >     [<000000009dbc3923>] icp_native_init_one_node+0x5cc/0x760
> >     [<0000000015f3168a>] icp_native_init+0x70/0x13c
> >     [<00000000655550ed>] xics_init+0x38/0x1ac
> >     [<0000000088dbf9d1>] pnv_init_IRQ+0x30/0x5c
>
> This is the interrupt controller mapping its registers, why on earth
> would that be considered a leak ? kmemleak needs to learn to ignore
> kernel page tables allocations.

Indeed, that's just a false positive for powerpc. Kmemleak ignores
page allocations and most architectures use __get_free_pages() for the
page table. In this particular case, the powerpc code uses
kmem_cache_alloc() and that's tracked by kmemleak. Since the pgd
stores the __pa(pud), kmemleak doesn't detect this pointer and reports
it as a leak. To work around this, you can pass SLAB_NOLEAKTRACE to
kmem_cache_create() in pgtable_cache_add()
(arch/powerpc/mm/init-common.c).

--=20
Catalin

^ permalink raw reply

* [PATCH] powerpc/64: Disable irq restore warning for now
From: Michael Ellerman @ 2018-08-07 11:48 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: npiggin

We recently added a warning in arch_local_irq_restore() to check that
the soft masking state matches reality.

Unfortunately it trips in a few places, which are not entirely trivial
to fix. The key problem is if we're doing function_graph tracing of
restore_math(), the warning pops and then seems to recurse. It's not
entirely clear because the system continuously oopses on all CPUs,
with the output interleaved and unreadable.

It's also been observed on a G5 coming out of idle.

Until we can fix those cases disable the warning for now.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
---
 arch/powerpc/kernel/irq.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index ca941f1e83a9..916ddc4aac44 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -261,9 +261,16 @@ notrace void arch_local_irq_restore(unsigned long mask)
 	 */
 	irq_happened = get_irq_happened();
 	if (!irq_happened) {
-#ifdef CONFIG_PPC_IRQ_SOFT_MASK_DEBUG
-		WARN_ON(!(mfmsr() & MSR_EE));
-#endif
+		/*
+		 * FIXME. Here we'd like to be able to do:
+		 *
+		 * #ifdef CONFIG_PPC_IRQ_SOFT_MASK_DEBUG
+		 *   WARN_ON(!(mfmsr() & MSR_EE));
+		 * #endif
+		 *
+		 * But currently it hits in a few paths, we should fix those and
+		 * enable the warning.
+		 */
 		return;
 	}

-- 
2.14.1

^ permalink raw reply related

* Re: [RFC PATCH 0/3] New device-tree format and Opal based idle save-restore
From: Michael Ellerman @ 2018-08-07 12:15 UTC (permalink / raw)
  To: Akshay Adiga, linux-kernel, linuxppc-dev
  Cc: npiggin, benh, ego, huntbag, Akshay Adiga
In-Reply-To: <20180802045132.12432-1-akshay.adiga@linux.vnet.ibm.com>

Akshay Adiga <akshay.adiga@linux.vnet.ibm.com> writes:

> Previously if a older kernel runs on a newer firmware, it may enable
> all available states irrespective of its capability of handling it.
> New device tree format adds a compatible flag, so that only kernel
> which has the capability to handle the version of stop state will enable
> it.
>
> Older kernel will still see stop0 and stop0_lite in older format and we
> will depricate it after some time.
>
> 1) Idea is to bump up the version string in firmware if we find a bug or
> regression in stop states. A fix will be provided in linux which would
> now know about the bumped up version of stop states, where as kernel
> without fixes would ignore the states.
>
> 2) Slowly deprecate cpuidle/cpuhotplug threshold which is hard-coded
> into cpuidle-powernv driver. Instead use compatible strings to indicate
> if idle state is suitable for cpuidle and hotplug.
>
> New idle state device tree format :
>        power-mgt {
>             ...
>          ibm,enabled-stop-levels = <0xec000000>;
>          ibm,cpu-idle-state-psscr-mask = <0x0 0x3003ff 0x0 0x3003ff>;
>          ibm,cpu-idle-state-latencies-ns = <0x3e8 0x7d0>;
>          ibm,cpu-idle-state-psscr = <0x0 0x330 0x0 0x300330>;
>          ibm,cpu-idle-state-flags = <0x100000 0x101000>;
>          ibm,cpu-idle-state-residency-ns = <0x2710 0x4e20>;
>          ibm,idle-states {
>                      stop4 {
>                          flags = <0x207000>;
>                          compatible = "ibm,state-v1",
> 				      "cpuidle",
> 				      "opal-supported";
>                          psscr-mask = <0x0 0x3003ff>;
>                          handle = <0x102>;
>                          latency-ns = <0x186a0>;
>                          residency-ns = <0x989680>;
>                          psscr = <0x0 0x300374>;
>                   };
>                     ...
>                     stop11 {
>                      ...
>                          compatible = "ibm,state-v1",
> 				      "cpuoffline",
> 				      "opal-supported";
>                          ...
>                   };
>              };
>
> Skiboot patch-set for device-tree is posted here :
> https://patchwork.ozlabs.org/project/skiboot/list/?series=58934

I don't see a device tree binding documented anywhere?

There is an existing binding defined for ARM chips, presumably it
doesn't do everything we need. But are there good reasons why we are not
using it as a base?

See: Documentation/devicetree/bindings/arm/idle-states.txt


The way you're using compatible is not really consistent with its
traditional meaning.

eg, you have multiple states with:

                compatible = "ibm,state-v1",
                            "cpuoffline",
                            "opal-supported";


This would typically mean that all those state are all "compatible" with
some semantics defined by the name "ibm,state-v1". What you're trying to
say (I think) is that each state is "version 1" of *that state*. And
only kernels that understand version 1 should use the state.

And "cpuoffline" and "opal-supported" definitely don't belong in
compatible AFAICS, they should simply be boolean properties of the node.

cheers

^ permalink raw reply

* Re: [PATCH] misc: ibmvsm: Fix wrong assignment of return code
From: Michael Ellerman @ 2018-08-07 12:28 UTC (permalink / raw)
  To: Bryant G. Ly, gregkh; +Cc: linuxppc-dev, linux-kernel, Bryant G. Ly
In-Reply-To: <1533562260-11456-1-git-send-email-bryantly@linux.vnet.ibm.com>

"Bryant G. Ly" <bryantly@linux.vnet.ibm.com> writes:

> From: "Bryant G. Ly" <bryantly@linux.ibm.com>
>
> Currently the assignment is flipped and rc is always 0.

If you'd left rc uninitialised at the start of the function the compiler
would have caught it for you.

And what is the consequence of the bug? Nothing, complete system crash,
subtle data corruption?

Also this should be tagged:

Fixes: 0eca353e7ae7 ("misc: IBM Virtual Management Channel Driver (VMC)")

cheers

> diff --git a/drivers/misc/ibmvmc.c b/drivers/misc/ibmvmc.c
> index 8f82bb9..b8aaa68 100644
> --- a/drivers/misc/ibmvmc.c
> +++ b/drivers/misc/ibmvmc.c
> @@ -2131,7 +2131,7 @@ static int ibmvmc_init_crq_queue(struct crq_server_adapter *adapter)
>  	retrc = plpar_hcall_norets(H_REG_CRQ,
>  				   vdev->unit_address,
>  				   queue->msg_token, PAGE_SIZE);
> -	retrc = rc;
> +	rc = retrc;
>  
>  	if (rc == H_RESOURCE)
>  		rc = ibmvmc_reset_crq_queue(adapter);

^ permalink raw reply

* Re: [PATCH] of/fdt: Remove PPC32 longtrail hack in memory scan
From: Michael Ellerman @ 2018-08-07 12:34 UTC (permalink / raw)
  To: Rob Herring; +Cc: devicetree, Frank Rowand, Paul Mackerras, linuxppc-dev
In-Reply-To: <CAL_JsqJQKb-0V0rHnonE5QfCrJwf8APiLNoRtEYei+H_YiC4qA@mail.gmail.com>

Rob Herring <robh+dt@kernel.org> writes:
> On Mon, Jul 30, 2018 at 4:47 AM Michael Ellerman <mpe@ellerman.id.au> wrote:
>> Rob Herring <robh+dt@kernel.org> writes:
>> > On Thu, Jul 26, 2018 at 11:36 PM Michael Ellerman <mpe@ellerman.id.au> wrote:
>> >> When the OF code was originally made common by Grant in commit
>> >> 51975db0b733 ("of/flattree: merge early_init_dt_scan_memory() common
>> >> code") (Feb 2010), the common code inherited a hack to handle
>> >> PPC "longtrail" machines, which had a "memory@0" node with no
>> >> device_type.
>> >>
>> >> That check was then made to only apply to PPC32 in b44aa25d20e2 ("of:
>> >> Handle memory@0 node on PPC32 only") (May 2014).
>> >>
>> >> But according to Paul Mackerras the "longtrail" machines are long
>> >> dead, if they were ever seen in the wild at all. If someone does still
>> >> have one, we can handle this firmware wart in powerpc platform code.
>> >>
>> >> So remove the hack once and for all.
>> >
>> > Yay. I guess Power Macs and other quirks will never die...
>>
>> Not soon.
>>
>> In base.c I see:
>>  - the hack in arch_find_n_match_cpu_physical_id()
>>    - we should just move that into arch code, it's a __weak arch hook
>>      after all.
>
> Except then we'd have to export __of_find_n_match_cpu_property. I
> somewhat prefer it like it is because arch specific functions tend to
> encourage duplication. But if the implementation is completely
> different like sparc, then yes, a separate implementation makes sense.

OK I'll leave it as-is then.

>>  - a PPC hack in of_alias_scan(), I guess we need to retain that
>>    behaviour, but it's pretty minor anyway.
>
> It would be nice to know what platform(s) needs this as I don't have a
> clue.

Yeah.

> It would also be nice if I had some dumps of DTs for some of
> these OpenFirmware systems.

Yeah it would. What if I threw some in a git tree somewhere?

I guess fully unpacked is the most useful form, ie. just a direct copy
from /proc/device-tree.

cheers

^ permalink raw reply

* [PATCH] powerpc/64s: idle_power4 fix PACA_IRQ_HARD_DIS accounting
From: Nicholas Piggin @ 2018-08-07 13:21 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

When idle_power4 hard disables interrupts then finds a soft pending
interrupt, it returns with interrupts hard disabled but without
PACA_IRQ_HARD_DIS set. Commit 9b81c0211c ("powerpc/64s: make
PACA_IRQ_HARD_DIS track MSR[EE] closely") added a warning for that
condition.

Fix this by adding the PACA_IRQ_HARD_DIS for that case.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>

---
This was half tested by modifying power4_idle to work on later CPUs
because I have no G5.

 arch/powerpc/kernel/idle_power4.S | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/idle_power4.S b/arch/powerpc/kernel/idle_power4.S
index dd7471fe20bd..a09b3c7ca176 100644
--- a/arch/powerpc/kernel/idle_power4.S
+++ b/arch/powerpc/kernel/idle_power4.S
@@ -32,6 +32,8 @@ END_FTR_SECTION_IFCLR(CPU_FTR_CAN_NAP)
 	cmpwi	0,r4,0
 	beqlr
 
+	/* This sequence is similar to prep_irq_for_idle() */
+
 	/* Hard disable interrupts */
 	mfmsr	r7
 	rldicl	r0,r7,48,1
@@ -41,10 +43,15 @@ END_FTR_SECTION_IFCLR(CPU_FTR_CAN_NAP)
 	/* Check if something happened while soft-disabled */
 	lbz	r0,PACAIRQHAPPENED(r13)
 	cmpwi	cr0,r0,0
-	bnelr
+	bne-	2f
 
-	/* Soft-enable interrupts */
+	/*
+	 * Soft-enable interrupts. This will make power4_fixup_nap return
+	 * to our caller with interrupts enabled (soft and hard). The caller
+	 * can cope with either interrupts disabled or enabled upon return.
+	 */
 #ifdef CONFIG_TRACE_IRQFLAGS
+	/* Tell the tracer interrupts are on, because idle responds to them. */
 	mflr	r0
 	std	r0,16(r1)
 	stdu    r1,-128(r1)
@@ -73,3 +80,8 @@ END_FTR_SECTION_IFSET(CPU_FTR_ALTIVEC)
 	isync
 	b	1b
 
+2:	/* Return if an interrupt had happened while soft disabled */
+	/* Set the HARD_DIS flag because interrupts are now hard disabled */
+	ori	r0,r0,PACA_IRQ_HARD_DIS
+	stb	r0,PACAIRQHAPPENED(r13)
+	blr
-- 
2.17.0

^ permalink raw reply related

* [PATCH v2] powerpc/tm: Print 64-bits MSR
From: Breno Leitao @ 2018-08-07 13:35 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: mikey, Breno Leitao
In-Reply-To: <5b85006b6695bd94f4818357be607036e9df8696.camel@neuling.org>

On a kernel TM Bad thing program exception, the Machine State Register
(MSR) is not being properly displayed. The exception code dumps a 32-bits
value but MSR is a 64 bits register for all platforms that have HTM
enabled.

This patch dumps the MSR value as a 64-bits value instead of 32 bits. In
order to do so, the 'reason' variable could not be used, since it trimmed
MSR to 32-bits (int).

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 arch/powerpc/kernel/traps.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index 0e17dcb48720..cd561fd89532 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -1402,7 +1402,7 @@ void program_check_exception(struct pt_regs *regs)
 			goto bail;
 		} else {
 			printk(KERN_EMERG "Unexpected TM Bad Thing exception "
-			       "at %lx (msr 0x%x)\n", regs->nip, reason);
+			       "at %lx (msr 0x%lx)\n", regs->nip, regs->msr);
 			die("Unrecoverable exception", regs, SIGABRT);
 		}
 	}
-- 
2.16.3

^ permalink raw reply related

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Christoph Hellwig @ 2018-08-07 13:55 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Christoph Hellwig, Michael S. Tsirkin, Will Deacon,
	Anshuman Khandual, virtualization, linux-kernel, linuxppc-dev,
	aik, robh, joe, elfring, david, jasowang, mpe, linuxram, haren,
	paulus, srikar, robin.murphy, jean-philippe.brucker, marc.zyngier
In-Reply-To: <aa59c7f8556bd4b332394a1dcf2d4a8faf3dc4a2.camel@kernel.crashing.org>

On Tue, Aug 07, 2018 at 04:42:44PM +1000, Benjamin Herrenschmidt wrote:
> Note that I can make it so that the same DMA ops (basically standard
> swiotlb ops without arch hacks) work for both "direct virtio" and
> "normal PCI" devices.
> 
> The trick is simply in the arch to setup the iommu to map the swiotlb
> bounce buffer pool 1:1 in the iommu, so the iommu essentially can be
> ignored without affecting the physical addresses.
> 
> If I do that, *all* I need is a way, from the guest itself (again, the
> other side dosn't know anything about it), to force virtio to use the
> DMA ops as if there was an iommu, that is, use whatever dma ops were
> setup by the platform for the pci device.

In that case just setting VIRTIO_F_IOMMU_PLATFORM in the flags should
do the work (even if that isn't strictly what the current definition
of the flag actually means).  On the qemu side you'll need to make
sure you have a way to set VIRTIO_F_IOMMU_PLATFORM without emulating
an iommu, but with code to take dma offsets into account if your
plaform has any (various power plaforms seem to have them, not sure
if it affects your config).

^ permalink raw reply

* [PATCH v3] selftests/powerpc: Kill child processes on SIGINT
From: Breno Leitao @ 2018-08-07 14:15 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: mpe, Breno Leitao, Gustavo Romero
In-Reply-To: <a4f9f03b-51dd-55d4-cfec-9ae48fbbd9d6@debian.org>

There are some powerpc selftests, as tm/tm-unavailable, that run for a long
period (>120 seconds), and if it is interrupted, as pressing CRTL-C
(SIGINT), the foreground process (harness) dies but the child process and
threads continue to execute (with PPID = 1 now) in background.

In this case, you'd think the whole test exited, but there are remaining
threads and processes being executed in background. Sometimes these
zombies processes are doing annoying things, as consuming the whole CPU or
dumping things to STDOUT.

This patch fixes this problem by attaching an empty signal handler to
SIGINT in the harness process. This handler will interrupt (EINTR) the
parent process waitpid() call, letting the code to follow through the
normal flow, which will kill all the processes in the child process group.

This patch also fixes a typo.

Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Gustavo Romero <gromero@linux.vnet.ibm.com>
---
 tools/testing/selftests/powerpc/harness.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/tools/testing/selftests/powerpc/harness.c b/tools/testing/selftests/powerpc/harness.c
index 66d31de60b9a..9d7166dfad1e 100644
--- a/tools/testing/selftests/powerpc/harness.c
+++ b/tools/testing/selftests/powerpc/harness.c
@@ -85,13 +85,13 @@ int run_test(int (test_function)(void), char *name)
 	return status;
 }

-static void alarm_handler(int signum)
+static void sig_handler(int signum)
 {
-	/* Jut wake us up from waitpid */
+	/* Just wake us up from waitpid */
 }

-static struct sigaction alarm_action = {
-	.sa_handler = alarm_handler,
+static struct sigaction sig_action = {
+	.sa_handler = sig_handler,
 };

 void test_harness_set_timeout(uint64_t time)
@@ -106,8 +106,14 @@ int test_harness(int (test_function)(void), char *name)
 	test_start(name);
 	test_set_git_version(GIT_VERSION);

-	if (sigaction(SIGALRM, &alarm_action, NULL)) {
-		perror("sigaction");
+	if (sigaction(SIGINT, &sig_action, NULL)) {
+		perror("sigaction (sigint)");
+		test_error(name);
+		return 1;
+	}
+
+	if (sigaction(SIGALRM, &sig_action, NULL)) {
+		perror("sigaction (sigalrm)");
 		test_error(name);
 		return 1;
 	}
-- 
2.16.3

^ permalink raw reply related

* [PATCH v7 0/9] powerpc/pseries: Machine check handler improvements.
From: Mahesh J Salgaonkar @ 2018-08-07 14:15 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Nicholas Piggin, Michal Suchanek, Michael Ellerman, stable,
	Aneesh Kumar K.V, Aneesh Kumar K.V, Michal Suchanek,
	Ananth Narayan, Nicholas Piggin, Laurent Dufour, Michael Ellerman

This patch series includes some improvement to Machine check handler
for pseries. Patch 1 fixes a buffer overrun issue if rtas extended error
log size is greater than RTAS_ERROR_LOG_MAX.
Patch 2 fixes an issue where machine check handler crashes
kernel while accessing vmalloc-ed buffer while in nmi context.
Patch 3 fixes endain bug while restoring of r3 in MCE handler.
Patch 5 implements a real mode mce handler and flushes the SLBs on SLB error.
Patch 6 display's the MCE error details on console.
Patch 7 saves and dumps the SLB contents on SLB MCE errors to improve the
debugability.
Patch 8 adds sysctl knob for recovery action on recovered MCEs.
Patch 9 consolidates mce early real mode handling code.

Change in V7:
- Fold Michal's patch into patch 5
- Handle MSR_RI=0 and evil context case in MC handler in patch 5.
- Patch 7: Print slb cache ptr value and slb cache data.
- Move patch 8 to patch 9.
- Introduce patch 8 add sysctl knob for recovery action on recovered MCEs.

Change in V6:
- Introduce patch 8 to consolidate early real mode handling code.
- Address Nick's comment on erroneous hunk.

Change in V5:
- Use min_t instead of max_t.
- Fix an issue reported by kbuild test robot and address review comments.

Change in V4:
- Flush the SLBs in real mode mce handler to handle SLB errors for entry 0.
- Allocate buffers per cpu to hold rtas error log and old slb contents.
- Defer the logging of rtas error log to irq work queue.

Change in V3:
- Moved patch 5 to patch 2

Change in V2:
- patch 3: Display additional info (NIP and task info) in MCE error details.
- patch 5: Fix endain bug while restoring of r3 in MCE handler.
---

Mahesh Salgaonkar (9):
      powerpc/pseries: Avoid using the size greater than RTAS_ERROR_LOG_MAX.
      powerpc/pseries: Defer the logging of rtas error to irq work queue.
      powerpc/pseries: Fix endainness while restoring of r3 in MCE handler.
      powerpc/pseries: Define MCE error event section.
      powerpc/pseries: flush SLB contents on SLB MCE errors.
      powerpc/pseries: Display machine check error details.
      powerpc/pseries: Dump the SLB contents on SLB MCE errors.
      powerpc/mce: Add sysctl control for recovery action on MCE.
      powernv/pseries: consolidate code for mce early handling.


 arch/powerpc/include/asm/book3s/64/mmu-hash.h |    8 +
 arch/powerpc/include/asm/machdep.h            |    1 
 arch/powerpc/include/asm/mce.h                |    2 
 arch/powerpc/include/asm/paca.h               |    7 +
 arch/powerpc/include/asm/rtas.h               |  116 ++++++++++++
 arch/powerpc/kernel/exceptions-64s.S          |   42 ++++
 arch/powerpc/kernel/mce.c                     |   73 +++++++-
 arch/powerpc/kernel/traps.c                   |    3 
 arch/powerpc/mm/slb.c                         |   79 ++++++++
 arch/powerpc/platforms/powernv/setup.c        |   15 ++
 arch/powerpc/platforms/pseries/pseries.h      |    1 
 arch/powerpc/platforms/pseries/ras.c          |  242 +++++++++++++++++++++++--
 arch/powerpc/platforms/pseries/setup.c        |   27 +++
 13 files changed, 588 insertions(+), 28 deletions(-)

--
Signature

^ permalink raw reply

* [PATCH v7 1/9] powerpc/pseries: Avoid using the size greater than RTAS_ERROR_LOG_MAX.
From: Mahesh J Salgaonkar @ 2018-08-07 14:16 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Michal Suchanek, Aneesh Kumar K.V, Michal Suchanek,
	Ananth Narayan, Nicholas Piggin, Laurent Dufour, Michael Ellerman
In-Reply-To: <153365127532.14256.1965469477086140841.stgit@jupiter.in.ibm.com>

From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>

The global mce data buffer that used to copy rtas error log is of 2048
(RTAS_ERROR_LOG_MAX) bytes in size. Before the copy we read
extended_log_length from rtas error log header, then use max of
extended_log_length and RTAS_ERROR_LOG_MAX as a size of data to be copied.
Ideally the platform (phyp) will never send extended error log with
size > 2048. But if that happens, then we have a risk of buffer overrun
and corruption. Fix this by using min_t instead.

Fixes: d368514c3097 ("powerpc: Fix corruption when grabbing FWNMI data")
Reported-by: Michal Suchanek <msuchanek@suse.com>
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/pseries/ras.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/ras.c b/arch/powerpc/platforms/pseries/ras.c
index 5e1ef9150182..ef104144d4bc 100644
--- a/arch/powerpc/platforms/pseries/ras.c
+++ b/arch/powerpc/platforms/pseries/ras.c
@@ -371,7 +371,7 @@ static struct rtas_error_log *fwnmi_get_errinfo(struct pt_regs *regs)
 		int len, error_log_length;
 
 		error_log_length = 8 + rtas_error_extended_log_length(h);
-		len = max_t(int, error_log_length, RTAS_ERROR_LOG_MAX);
+		len = min_t(int, error_log_length, RTAS_ERROR_LOG_MAX);
 		memset(global_mce_data_buf, 0, RTAS_ERROR_LOG_MAX);
 		memcpy(global_mce_data_buf, h, len);
 		errhdr = (struct rtas_error_log *)global_mce_data_buf;

^ permalink raw reply related

* [PATCH v7 2/9] powerpc/pseries: Defer the logging of rtas error to irq work queue.
From: Mahesh J Salgaonkar @ 2018-08-07 14:16 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: stable, Nicholas Piggin, Aneesh Kumar K.V, Michal Suchanek,
	Ananth Narayan, Nicholas Piggin, Laurent Dufour, Michael Ellerman
In-Reply-To: <153365127532.14256.1965469477086140841.stgit@jupiter.in.ibm.com>

From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>

rtas_log_buf is a buffer to hold RTAS event data that are communicated
to kernel by hypervisor. This buffer is then used to pass RTAS event
data to user through proc fs. This buffer is allocated from vmalloc
(non-linear mapping) area.

On Machine check interrupt, register r3 points to RTAS extended event
log passed by hypervisor that contains the MCE event. The pseries
machine check handler then logs this error into rtas_log_buf. The
rtas_log_buf is a vmalloc-ed (non-linear) buffer we end up taking up a
page fault (vector 0x300) while accessing it. Since machine check
interrupt handler runs in NMI context we can not afford to take any
page fault. Page faults are not honored in NMI context and causes
kernel panic. Apart from that, as Nick pointed out, pSeries_log_error()
also takes a spin_lock while logging error which is not safe in NMI
context. It may endup in deadlock if we get another MCE before releasing
the lock. Fix this by deferring the logging of rtas error to irq work queue.

Current implementation uses two different buffers to hold rtas error log
depending on whether extended log is provided or not. This makes bit
difficult to identify which buffer has valid data that needs to logged
later in irq work. Simplify this using single buffer, one per paca, and
copy rtas log to it irrespective of whether extended log is provided or
not. Allocate this buffer below RMA region so that it can be accessed
in real mode mce handler.

Fixes: b96672dd840f ("powerpc: Machine check interrupt is a non-maskable interrupt")
Cc: stable@vger.kernel.org
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/paca.h        |    3 ++
 arch/powerpc/platforms/pseries/ras.c   |   47 ++++++++++++++++++++++----------
 arch/powerpc/platforms/pseries/setup.c |   16 +++++++++++
 3 files changed, 51 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index 6d34bd71139d..7f22929ce915 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -252,6 +252,9 @@ struct paca_struct {
 	void *rfi_flush_fallback_area;
 	u64 l1d_flush_size;
 #endif
+#ifdef CONFIG_PPC_PSERIES
+	u8 *mce_data_buf;		/* buffer to hold per cpu rtas errlog */
+#endif /* CONFIG_PPC_PSERIES */
 } ____cacheline_aligned;
 
 extern void copy_mm_to_paca(struct mm_struct *mm);
diff --git a/arch/powerpc/platforms/pseries/ras.c b/arch/powerpc/platforms/pseries/ras.c
index ef104144d4bc..14a46b07ab2f 100644
--- a/arch/powerpc/platforms/pseries/ras.c
+++ b/arch/powerpc/platforms/pseries/ras.c
@@ -22,6 +22,7 @@
 #include <linux/of.h>
 #include <linux/fs.h>
 #include <linux/reboot.h>
+#include <linux/irq_work.h>
 
 #include <asm/machdep.h>
 #include <asm/rtas.h>
@@ -32,11 +33,13 @@
 static unsigned char ras_log_buf[RTAS_ERROR_LOG_MAX];
 static DEFINE_SPINLOCK(ras_log_buf_lock);
 
-static char global_mce_data_buf[RTAS_ERROR_LOG_MAX];
-static DEFINE_PER_CPU(__u64, mce_data_buf);
-
 static int ras_check_exception_token;
 
+static void mce_process_errlog_event(struct irq_work *work);
+static struct irq_work mce_errlog_process_work = {
+	.func = mce_process_errlog_event,
+};
+
 #define EPOW_SENSOR_TOKEN	9
 #define EPOW_SENSOR_INDEX	0
 
@@ -330,16 +333,20 @@ static irqreturn_t ras_error_interrupt(int irq, void *dev_id)
 	((((A) >= 0x7000) && ((A) < 0x7ff0)) || \
 	(((A) >= rtas.base) && ((A) < (rtas.base + rtas.size - 16))))
 
+static inline struct rtas_error_log *fwnmi_get_errlog(void)
+{
+	return (struct rtas_error_log *)local_paca->mce_data_buf;
+}
+
 /*
  * Get the error information for errors coming through the
  * FWNMI vectors.  The pt_regs' r3 will be updated to reflect
  * the actual r3 if possible, and a ptr to the error log entry
  * will be returned if found.
  *
- * If the RTAS error is not of the extended type, then we put it in a per
- * cpu 64bit buffer. If it is the extended type we use global_mce_data_buf.
+ * Use one buffer mce_data_buf per cpu to store RTAS error.
  *
- * The global_mce_data_buf does not have any locks or protection around it,
+ * The mce_data_buf does not have any locks or protection around it,
  * if a second machine check comes in, or a system reset is done
  * before we have logged the error, then we will get corruption in the
  * error log.  This is preferable over holding off on calling
@@ -349,7 +356,7 @@ static irqreturn_t ras_error_interrupt(int irq, void *dev_id)
 static struct rtas_error_log *fwnmi_get_errinfo(struct pt_regs *regs)
 {
 	unsigned long *savep;
-	struct rtas_error_log *h, *errhdr = NULL;
+	struct rtas_error_log *h;
 
 	/* Mask top two bits */
 	regs->gpr[3] &= ~(0x3UL << 62);
@@ -362,22 +369,20 @@ static struct rtas_error_log *fwnmi_get_errinfo(struct pt_regs *regs)
 	savep = __va(regs->gpr[3]);
 	regs->gpr[3] = savep[0];	/* restore original r3 */
 
-	/* If it isn't an extended log we can use the per cpu 64bit buffer */
 	h = (struct rtas_error_log *)&savep[1];
+	/* Use the per cpu buffer from paca to store rtas error log */
+	memset(local_paca->mce_data_buf, 0, RTAS_ERROR_LOG_MAX);
 	if (!rtas_error_extended(h)) {
-		memcpy(this_cpu_ptr(&mce_data_buf), h, sizeof(__u64));
-		errhdr = (struct rtas_error_log *)this_cpu_ptr(&mce_data_buf);
+		memcpy(local_paca->mce_data_buf, h, sizeof(__u64));
 	} else {
 		int len, error_log_length;
 
 		error_log_length = 8 + rtas_error_extended_log_length(h);
 		len = min_t(int, error_log_length, RTAS_ERROR_LOG_MAX);
-		memset(global_mce_data_buf, 0, RTAS_ERROR_LOG_MAX);
-		memcpy(global_mce_data_buf, h, len);
-		errhdr = (struct rtas_error_log *)global_mce_data_buf;
+		memcpy(local_paca->mce_data_buf, h, len);
 	}
 
-	return errhdr;
+	return (struct rtas_error_log *)local_paca->mce_data_buf;
 }
 
 /* Call this when done with the data returned by FWNMI_get_errinfo.
@@ -422,6 +427,17 @@ int pSeries_system_reset_exception(struct pt_regs *regs)
 	return 0; /* need to perform reset */
 }
 
+/*
+ * Process MCE rtas errlog event.
+ */
+static void mce_process_errlog_event(struct irq_work *work)
+{
+	struct rtas_error_log *err;
+
+	err = fwnmi_get_errlog();
+	log_error((char *)err, ERR_TYPE_RTAS_LOG, 0);
+}
+
 /*
  * See if we can recover from a machine check exception.
  * This is only called on power4 (or above) and only via
@@ -466,7 +482,8 @@ static int recover_mce(struct pt_regs *regs, struct rtas_error_log *err)
 		recovered = 1;
 	}
 
-	log_error((char *)err, ERR_TYPE_RTAS_LOG, 0);
+	/* Queue irq work to log this rtas event later. */
+	irq_work_queue(&mce_errlog_process_work);
 
 	return recovered;
 }
diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
index 139f0af6c3d9..b42087cd8c6b 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -41,6 +41,7 @@
 #include <linux/root_dev.h>
 #include <linux/of.h>
 #include <linux/of_pci.h>
+#include <linux/memblock.h>
 
 #include <asm/mmu.h>
 #include <asm/processor.h>
@@ -101,6 +102,9 @@ static void pSeries_show_cpuinfo(struct seq_file *m)
 static void __init fwnmi_init(void)
 {
 	unsigned long system_reset_addr, machine_check_addr;
+	u8 *mce_data_buf;
+	unsigned int i;
+	int nr_cpus = num_possible_cpus();
 
 	int ibm_nmi_register = rtas_token("ibm,nmi-register");
 	if (ibm_nmi_register == RTAS_UNKNOWN_SERVICE)
@@ -114,6 +118,18 @@ static void __init fwnmi_init(void)
 	if (0 == rtas_call(ibm_nmi_register, 2, 1, NULL, system_reset_addr,
 				machine_check_addr))
 		fwnmi_active = 1;
+
+	/*
+	 * Allocate a chunk for per cpu buffer to hold rtas errorlog.
+	 * It will be used in real mode mce handler, hence it needs to be
+	 * below RMA.
+	 */
+	mce_data_buf = __va(memblock_alloc_base(RTAS_ERROR_LOG_MAX * nr_cpus,
+					RTAS_ERROR_LOG_MAX, ppc64_rma_size));
+	for_each_possible_cpu(i) {
+		paca_ptrs[i]->mce_data_buf = mce_data_buf +
+						(RTAS_ERROR_LOG_MAX * i);
+	}
 }
 
 static void pseries_8259_cascade(struct irq_desc *desc)

^ permalink raw reply related

* [PATCH v7 3/9] powerpc/pseries: Fix endainness while restoring of r3 in MCE handler.
From: Mahesh J Salgaonkar @ 2018-08-07 14:16 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: stable, Nicholas Piggin, Aneesh Kumar K.V, Michal Suchanek,
	Ananth Narayan, Nicholas Piggin, Laurent Dufour, Michael Ellerman
In-Reply-To: <153365127532.14256.1965469477086140841.stgit@jupiter.in.ibm.com>

From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>

During Machine Check interrupt on pseries platform, register r3 points
RTAS extended event log passed by hypervisor. Since hypervisor uses r3
to pass pointer to rtas log, it stores the original r3 value at the
start of the memory (first 8 bytes) pointed by r3. Since hypervisor
stores this info and rtas log is in BE format, linux should make
sure to restore r3 value in correct endian format.

Without this patch when MCE handler, after recovery, returns to code that
that caused the MCE may end up with Data SLB access interrupt for invalid
address followed by kernel panic or hang.

[   62.878965] Severe Machine check interrupt [Recovered]
[   62.878968]   NIP [d00000000ca301b8]: init_module+0x1b8/0x338 [bork_kernel]
[   62.878969]   Initiator: CPU
[   62.878970]   Error type: SLB [Multihit]
[   62.878971]     Effective address: d00000000ca70000
cpu 0xa: Vector: 380 (Data SLB Access) at [c0000000fc7775b0]
    pc: c0000000009694c0: vsnprintf+0x80/0x480
    lr: c0000000009698e0: vscnprintf+0x20/0x60
    sp: c0000000fc777830
   msr: 8000000002009033
   dar: a803a30c000000d0
  current = 0xc00000000bc9ef00
  paca    = 0xc00000001eca5c00	 softe: 3	 irq_happened: 0x01
    pid   = 8860, comm = insmod
[c0000000fc7778b0] c0000000009698e0 vscnprintf+0x20/0x60
[c0000000fc7778e0] c00000000016b6c4 vprintk_emit+0xb4/0x4b0
[c0000000fc777960] c00000000016d40c vprintk_func+0x5c/0xd0
[c0000000fc777980] c00000000016cbb4 printk+0x38/0x4c
[c0000000fc7779a0] d00000000ca301c0 init_module+0x1c0/0x338 [bork_kernel]
[c0000000fc777a40] c00000000000d9c4 do_one_initcall+0x54/0x230
[c0000000fc777b00] c0000000001b3b74 do_init_module+0x8c/0x248
[c0000000fc777b90] c0000000001b2478 load_module+0x12b8/0x15b0
[c0000000fc777d30] c0000000001b29e8 sys_finit_module+0xa8/0x110
[c0000000fc777e30] c00000000000b204 system_call+0x58/0x6c
--- Exception: c00 (System Call) at 00007fff8bda0644
SP (7fffdfbfe980) is in userspace

This patch fixes this issue.

Fixes: a08a53ea4c97 ("powerpc/le: Enable RTAS events support")
Cc: stable@vger.kernel.org
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/pseries/ras.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/ras.c b/arch/powerpc/platforms/pseries/ras.c
index 14a46b07ab2f..851ce326874a 100644
--- a/arch/powerpc/platforms/pseries/ras.c
+++ b/arch/powerpc/platforms/pseries/ras.c
@@ -367,7 +367,7 @@ static struct rtas_error_log *fwnmi_get_errinfo(struct pt_regs *regs)
 	}
 
 	savep = __va(regs->gpr[3]);
-	regs->gpr[3] = savep[0];	/* restore original r3 */
+	regs->gpr[3] = be64_to_cpu(savep[0]);	/* restore original r3 */
 
 	h = (struct rtas_error_log *)&savep[1];
 	/* Use the per cpu buffer from paca to store rtas error log */

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox