LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: Fwd: [powerpc/Baremetal]Kernel OOPS while executing memory hotplug on Power8 baremetal
From: Jens Axboe @ 2018-06-07 15:40 UTC (permalink / raw)
  To: vrbagal1, Bart Van Assche, kent.overstreet, snitzer, linux-block
  Cc: linux-scsi, linuxppc-dev, sachinp, Linuxppc-dev
In-Reply-To: <b1b67ad6-b300-7def-5851-b11baa6edf97@kernel.dk>

On 6/7/18 8:45 AM, Jens Axboe wrote:
> On 6/7/18 4:37 AM, vrbagal1 wrote:
>> On 2018-06-07 13:12, Bart Van Assche wrote:
>>> On Thu, 2018-06-07 at 12:56 +0530, Venkat Rao B wrote:
>>>> On Thursday 07 June 2018 12:46 PM, Bart Van Assche wrote:
>>>>> On Thu, 2018-06-07 at 12:38 +0530, vrbagal1 wrote:
>>>>>> Observing Kernel oops and machine reboots while executing memory hotplug
>>>>>> test case, on Power8 Baremetal machine.
>>>>>>
>>>>>> I see this is introduced some where between rc6 and 4.17.
>>>>>
>>>>> Please provide the exact versions (git commit IDs) of the kernel versions
>>>>> you have tested.
>>>>
>>>> Commit Id ---> 5037be168f
>>>
>>> The reason I was asking for the commit ID is because I saw that 
>>> clone_endio()
>>> occurs in the oops which means that the dm driver is involved. An 
>>> important fix
>>> for the dm driver went upstream recently, namely d37753540568 ("dm: Use 
>>> kzalloc
>>> for all structs with embedded biosets/mempools"). Can you double check 
>>> whether
>>> that commit it present in your tree? If it is not present, please 
>>> update to the
>>> latest master and retest. If it is present, please report how to 
>>> reproduce
>>> this oops to Kent Overstreet, Jens Axboe, linux-block and Mike Snitzer.
>>>
>>> Thanks,
>>>
>>> Bart.
>>
>>
>> Yes, the fix is present in the tree, which I have tested.
>>
>> Steps to reproduce:
>>
>> Step1: Clone and Install avocado git clone 
>> https://github.com/avocado-framework/avocado.git
>> Step2: Clone 
>> https://github.com/avocado-framework-tests/avocado-misc-tests.git
>>         Test case is 
>> https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/memory/memhotplug.py
>> Step3: Command to run the test is avocado run 
>> avocado-misc-tests/memory/memhotplug.py
> 
> Can you try with the below? Not a fully formed fix since I'd prefer
> if the dm bioset copy stuff was changed instead, but worth a shot.

This is closer to an actual fix, please try that instead.


diff --git a/block/bio.c b/block/bio.c
index 595663e0281a..0616d86b15c6 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1967,6 +1967,21 @@ int bioset_init(struct bio_set *bs,
 }
 EXPORT_SYMBOL(bioset_init);
 
+int bioset_init_from_src(struct bio_set *new, struct bio_set *src)
+{
+	unsigned int pool_size = src->bio_pool.min_nr;
+	int flags;
+
+	flags = 0;
+	if (src->bvec_pool.min_nr)
+		flags |= BIOSET_NEED_BVECS;
+	if (src->rescue_workqueue)
+		flags |= BIOSET_NEED_RESCUER;
+
+	return bioset_init(new, pool_size, src->front_pad, flags);
+}
+EXPORT_SYMBOL(bioset_init_from_src);
+
 #ifdef CONFIG_BLK_CGROUP
 
 /**
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 98dff36b89a3..20a8d63754bf 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1953,9 +1953,10 @@ static void free_dev(struct mapped_device *md)
 	kvfree(md);
 }
 
-static void __bind_mempools(struct mapped_device *md, struct dm_table *t)
+static int __bind_mempools(struct mapped_device *md, struct dm_table *t)
 {
 	struct dm_md_mempools *p = dm_table_get_md_mempools(t);
+	int ret = 0;
 
 	if (dm_table_bio_based(t)) {
 		/*
@@ -1982,13 +1983,16 @@ static void __bind_mempools(struct mapped_device *md, struct dm_table *t)
 	       bioset_initialized(&md->bs) ||
 	       bioset_initialized(&md->io_bs));
 
-	md->bs = p->bs;
-	memset(&p->bs, 0, sizeof(p->bs));
-	md->io_bs = p->io_bs;
-	memset(&p->io_bs, 0, sizeof(p->io_bs));
+	ret = bioset_init_from_src(&md->bs, &p->bs);
+	if (ret)
+		goto out;
+	ret = bioset_init_from_src(&md->io_bs, &p->io_bs);
+	if (ret)
+		bioset_exit(&md->bs);
 out:
 	/* mempool bind completed, no longer need any mempools in the table */
 	dm_table_free_md_mempools(t);
+	return ret;
 }
 
 /*
@@ -2033,6 +2037,7 @@ static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t,
 	struct request_queue *q = md->queue;
 	bool request_based = dm_table_request_based(t);
 	sector_t size;
+	int ret;
 
 	lockdep_assert_held(&md->suspend_lock);
 
@@ -2068,7 +2073,11 @@ static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t,
 		md->immutable_target = dm_table_get_immutable_target(t);
 	}
 
-	__bind_mempools(md, t);
+	ret = __bind_mempools(md, t);
+	if (ret) {
+		old_map = ERR_PTR(ret);
+		goto out;
+	}
 
 	old_map = rcu_dereference_protected(md->map, lockdep_is_held(&md->suspend_lock));
 	rcu_assign_pointer(md->map, (void *)t);
@@ -2078,6 +2087,7 @@ static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t,
 	if (old_map)
 		dm_sync_table(md);
 
+out:
 	return old_map;
 }
 
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 810a8bee8f85..307682ac2f31 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -417,6 +417,7 @@ enum {
 extern int bioset_init(struct bio_set *, unsigned int, unsigned int, int flags);
 extern void bioset_exit(struct bio_set *);
 extern int biovec_init_pool(mempool_t *pool, int pool_entries);
+extern int bioset_init_from_src(struct bio_set *new, struct bio_set *src);
 
 extern struct bio *bio_alloc_bioset(gfp_t, unsigned int, struct bio_set *);
 extern void bio_put(struct bio *);

-- 
Jens Axboe

^ permalink raw reply related

* Re: [RFC V2] virtio: Add platform specific DMA API translation for virito devices
From: Michael S. Tsirkin @ 2018-06-07 16:28 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Anshuman Khandual, Ram Pai, robh, aik, jasowang, linux-kernel,
	virtualization, joe, linuxppc-dev, elfring, david, cohuck,
	pawel.moll, Tom Lendacky, Rustad, Mark D
In-Reply-To: <20180607052306.GA1532@infradead.org>

On Wed, Jun 06, 2018 at 10:23:06PM -0700, Christoph Hellwig wrote:
> On Thu, May 31, 2018 at 08:43:58PM +0300, Michael S. Tsirkin wrote:
> > Pls work on a long term solution. Short term needs can be served by
> > enabling the iommu platform in qemu.
> 
> So, I spent some time looking at converting virtio to dma ops overrides,
> and the current virtio spec, and the sad through I have to tell is that
> both the spec and the Linux implementation are complete and utterly fucked
> up.

Let me restate it: DMA API has support for a wide range of hardware, and
hardware based virtio implementations likely won't benefit from all of
it.

And given virtio right now is optimized for specific workloads, improving
portability without regressing performance isn't easy.

I think it's unsurprising since it started a strictly a guest/host
mechanism.  People did implement offloads on specific platforms though,
and they are known to work. To improve portability even further,
we might need to make spec and code changes.

I'm not really sympathetic to people complaining that they can't even
set a flag in qemu though. If that's the case the stack in question is
way too inflexible.



> Both in the flag naming and the implementation there is an implication
> of DMA API == IOMMU, which is fundamentally wrong.

Maybe we need to extend the meaning of PLATFORM_IOMMU or rename it.

It's possible that some setups will benefit from a more
fine-grained approach where some aspects of the DMA
API are bypassed, others aren't.

This seems to be what was being asked for in this thread,
with comments claiming IOMMU flag adds too much overhead.


> The DMA API does a few different things:
> 
>  a) address translation
> 
> 	This does include IOMMUs.  But it also includes random offsets
> 	between PCI bars and system memory that we see on various
> 	platforms.

I don't think you mean bars. That's unrelated to DMA.

>  Worse so some of these offsets might be based on
> 	banks, e.g. on the broadcom bmips platform.  It also deals
> 	with bitmask in physical addresses related to memory encryption
> 	like AMD SEV.  I'd be really curious how for example the
> 	Intel virtio based NIC is going to work on any of those
> 	plaforms.

SEV guys report that they just set the iommu flag and then it all works.
I guess if there's translation we can think of this as a kind of iommu.
Maybe we should rename PLATFORM_IOMMU to PLARTFORM_TRANSLATION?

And apparently some people complain that just setting that flag makes
qemu check translation on each access with an unacceptable performance
overhead.  Forcing same behaviour for everyone on general principles
even without the flag is unlikely to make them happy.

>   b) coherency
> 
> 	On many architectures DMA is not cache coherent, and we need
> 	to invalidate and/or write back cache lines before doing
> 	DMA.  Again, I wonder how this is every going to work with
> 	hardware based virtio implementations.


You mean dma_Xmb and friends?
There's a new feature VIRTIO_F_IO_BARRIER that's being proposed
for that.


>  Even worse I think this
> 	is actually broken at least for VIVT event for virtualized
> 	implementations.  E.g. a KVM guest is going to access memory
> 	using different virtual addresses than qemu, vhost might throw
> 	in another different address space.

I don't really know what VIVT is. Could you help me please?

>   c) bounce buffering
> 
> 	Many DMA implementations can not address all physical memory
> 	due to addressing limitations.  In such cases we copy the
> 	DMA memory into a known addressable bounc buffer and DMA
> 	from there.

Don't do it then?


>   d) flushing write combining buffers or similar
> 
> 	On some hardware platforms we need workarounds to e.g. read
> 	from a certain mmio address to make sure DMA can actually
> 	see memory written by the host.

I guess it isn't an issue as long as WC isn't actually used.
It will become an issue when virtio spec adds some WC capability -
I suspect we can ignore this for now.

> 
> All of this is bypassed by virtio by default despite generally being
> platform issues, not particular to a given device.

It's both a device and a platform issue. A PV device is often more like
another CPU than like a PCI device.



-- 
MST

^ permalink raw reply

* Re: [v2 PATCH 0/5] powerpc/pseries: Machien check handler improvements.
From: Mahesh Jagannath Salgaonkar @ 2018-06-07 16:32 UTC (permalink / raw)
  To: Nicholas Piggin; +Cc: linuxppc-dev, Laurent Dufour, Aneesh Kumar K.V
In-Reply-To: <20180607204554.05565a56@roar.ozlabs.ibm.com>

On 06/07/2018 04:15 PM, Nicholas Piggin wrote:
> On Thu, 07 Jun 2018 15:36:25 +0530
> Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com> wrote:
> 
>> This patch series includes some improvement to Machine check handler
>> for pseries. Patch 1 fixes an issue where machine check handler crashes
>> kernel while accessing vmalloc-ed buffer while in nmi context.
>> Patch 3 dumps the SLB contents on SLB MCE errors to improve the debugability.
>> Patch 4 display's the MCE error details on console.
>>
>> Change in V2:
>> - patch 4: Display additional info (NIP and task info) in MCE error details.
>> - patch 5: Fix endain bug while restoring of r3 in MCE handler.
>>
>> ---
>>
>> Mahesh Salgaonkar (5):
>>       powerpc/pseries: convert rtas_log_buf to linear allocation.
>>       powerpc/pseries: Define MCE error event section.
>>       powerpc/pseries: Dump and flush SLB contents on SLB MCE errors.
>>       powerpc/pseries: Display machine check error details.
>>       powerpc/pseries: Fix endainness while restoring of r3 in MCE handler.
> 
> These look good, should patch 5 be moved to patch 2 and the first 2
> patches marked for stable?

Yup. Will move patch 5 to 2nd position.

> 
> Do you also plan to dump SLB contents for bare metal MCEs?

Yes. That's the plan. Will do that separately.

Thanks,
-Mahesh.

^ permalink raw reply

* Re: [RFC PATCH -tip v5 18/27] powerpc/kprobes: Don't call the ->break_handler() in arm kprobes code
From: Naveen N. Rao @ 2018-06-07 16:37 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Andrew Morton, H . Peter Anvin, linux-arch, linux-kernel,
	linuxppc-dev, Ingo Molnar, Ingo Molnar, Paul Mackerras,
	Steven Rostedt, Thomas Gleixner
In-Reply-To: <20180607232802.c5fcea960e94ef2f3cd4cde8@kernel.org>

Masami Hiramatsu wrote:
> On Thu, 07 Jun 2018 17:07:00 +0530
> "Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com> wrote:
>=20
>> Masami Hiramatsu wrote:
>> > Don't call the ->break_handler() from the arm kprobes code,
>> 					    ^^^ powerpc
>>=20
>> > because it was only used by jprobes which got removed.
>> >=20
>> > This also makes skip_singlestep() a static function since
>> > only ftrace-kprobe.c is using this function.
>> >=20
>> > Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
>> > Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
>> > Cc: Paul Mackerras <paulus@samba.org>
>> > Cc: Michael Ellerman <mpe@ellerman.id.au>
>> > Cc: "Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com>
>> > Cc: linuxppc-dev@lists.ozlabs.org
>> > ---
>> >  arch/powerpc/include/asm/kprobes.h   |   10 ----------
>> >  arch/powerpc/kernel/kprobes-ftrace.c |   16 +++-------------
>> >  arch/powerpc/kernel/kprobes.c        |   31 +++++++++++--------------=
------
>> >  3 files changed, 14 insertions(+), 43 deletions(-)
>>=20
>> With 2 small comments...
>=20
> 2 ? or 1 ?

Two, with one in the commit log above :)

- Naveen

=

^ permalink raw reply

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
From: Alex Williamson @ 2018-06-07 17:04 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, David Gibson, kvm-ppc, Benjamin Herrenschmidt,
	Ram Pai, kvm, Alistair Popple
In-Reply-To: <20180607084420.29513-1-aik@ozlabs.ru>

On Thu,  7 Jun 2018 18:44:15 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> Here is an rfc of some patches adding psaa-through support
> for NVIDIA V100 GPU found in some POWER9 boxes.
> 
> The example P9 system has 6 GPUs, each accompanied with 2 bridges
> representing the hardware links (aka NVLink2):
> 
>  4  0004:04:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
>  5  0004:05:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
>  6  0004:06:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
>  4  0006:00:00.0 Bridge: IBM Device 04ea (rev 01)
>  4  0006:00:00.1 Bridge: IBM Device 04ea (rev 01)
>  5  0006:00:01.0 Bridge: IBM Device 04ea (rev 01)
>  5  0006:00:01.1 Bridge: IBM Device 04ea (rev 01)
>  6  0006:00:02.0 Bridge: IBM Device 04ea (rev 01)
>  6  0006:00:02.1 Bridge: IBM Device 04ea (rev 01)
> 10  0007:00:00.0 Bridge: IBM Device 04ea (rev 01)
> 10  0007:00:00.1 Bridge: IBM Device 04ea (rev 01)
> 11  0007:00:01.0 Bridge: IBM Device 04ea (rev 01)
> 11  0007:00:01.1 Bridge: IBM Device 04ea (rev 01)
> 12  0007:00:02.0 Bridge: IBM Device 04ea (rev 01)
> 12  0007:00:02.1 Bridge: IBM Device 04ea (rev 01)
> 10  0035:03:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
> 11  0035:04:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
> 12  0035:05:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
> 
> ^^ the number is an IOMMU group ID.

Can we back up and discuss whether the IOMMU grouping of NVLink
connected devices makes sense?  AIUI we have a PCI view of these
devices and from that perspective they're isolated.  That's the view of
the device used to generate the grouping.  However, not visible to us,
these devices are interconnected via NVLink.  What isolation properties
does NVLink provide given that its entire purpose for existing seems to
be to provide a high performance link for p2p between devices?
 
> Each bridge represents an additional hardware interface called "NVLink2",
> it is not a PCI link but separate but. The design inherits from original
> NVLink from POWER8.
> 
> The new feature of V100 is 16GB of cache coherent memory on GPU board.
> This memory is presented to the host via the device tree and remains offline
> until the NVIDIA driver loads, trains NVLink2 (via the config space of these
> bridges above) and the nvidia-persistenced daemon then onlines it.
> The memory remains online as long as nvidia-persistenced is running, when
> it stops, it offlines the memory.
> 
> The amount of GPUs suggest passing them through to a guest. However,
> in order to do so we cannot use the NVIDIA driver so we have a host with
> a 128GB window (bigger or equal to actual GPU RAM size) in a system memory
> with no page structs backing this window and we cannot touch this memory
> before the NVIDIA driver configures it in a host or a guest as
> HMI (hardware management interrupt?) occurs.

Having a lot of GPUs only suggests assignment to a guest if there's
actually isolation provided between those GPUs.  Otherwise we'd need to
assign them as one big group, which gets a lot less useful.  Thanks,

Alex

> On the example system the GPU RAM windows are located at:
> 0x0400 0000 0000
> 0x0420 0000 0000
> 0x0440 0000 0000
> 0x2400 0000 0000
> 0x2420 0000 0000
> 0x2440 0000 0000
> 
> So the complications are:
> 
> 1. cannot touch the GPU memory till it is trained, i.e. cannot add ptes
> to VFIO-to-userspace or guest-to-host-physical translations till
> the driver trains it (i.e. nvidia-persistenced has started), otherwise
> prefetching happens and HMI occurs; I am trying to get this changed
> somehow;
> 
> 2. since it appears as normal cache coherent memory, it will be used
> for DMA which means it has to be pinned and mapped in the host. Having
> no page structs makes it different from the usual case - we only need
> translate user addresses to host physical and map GPU RAM memory but
> pinning is not required.
> 
> This series maps GPU RAM via the GPU vfio-pci device so QEMU can then
> register this memory as a KVM memory slot and present memory nodes to
> the guest. Unless NVIDIA provides an userspace driver, this is no use
> for things like DPDK.
> 
> 
> There is another problem which the series does not address but worth
> mentioning - it is not strictly necessary to map GPU RAM to the guest
> exactly where it is in the host (I tested this to some extent), we still
> might want to represent the memory at the same offset as on the host
> which increases the size of a TCE table needed to cover such a huge
> window: (((0x244000000000 + 0x2000000000) >> 16)*8)>>20 = 4556MB
> I am addressing this in a separate patchset by allocating indirect TCE
> levels on demand and using 16MB IOMMU pages in the guest as we can now
> back emulated pages with the smaller hardware ones.
> 
> 
> This is an RFC. Please comment. Thanks.
> 
> 
> 
> Alexey Kardashevskiy (5):
>   vfio/spapr_tce: Simplify page contained test
>   powerpc/iommu_context: Change referencing in API
>   powerpc/iommu: Do not pin memory of a memory device
>   vfio_pci: Allow mapping extra regions
>   vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver
> 
>  drivers/vfio/pci/Makefile              |   1 +
>  arch/powerpc/include/asm/mmu_context.h |   5 +-
>  drivers/vfio/pci/vfio_pci_private.h    |  11 ++
>  include/uapi/linux/vfio.h              |   3 +
>  arch/powerpc/kernel/iommu.c            |   8 +-
>  arch/powerpc/mm/mmu_context_iommu.c    |  70 +++++++++---
>  drivers/vfio/pci/vfio_pci.c            |  19 +++-
>  drivers/vfio/pci/vfio_pci_nvlink2.c    | 190 +++++++++++++++++++++++++++++++++
>  drivers/vfio/vfio_iommu_spapr_tce.c    |  42 +++++---
>  drivers/vfio/pci/Kconfig               |   4 +
>  10 files changed, 319 insertions(+), 34 deletions(-)
>  create mode 100644 drivers/vfio/pci/vfio_pci_nvlink2.c
> 

^ permalink raw reply

* Re: [RFC PATCH kernel 5/5] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver
From: Alex Williamson @ 2018-06-07 17:04 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, David Gibson, kvm-ppc, Benjamin Herrenschmidt,
	Ram Pai, kvm, Alistair Popple
In-Reply-To: <20180607084420.29513-6-aik@ozlabs.ru>

On Thu,  7 Jun 2018 18:44:20 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> Some POWER9 chips come with special NVLink2 links which provide
> cacheable memory access to the RAM physically located on NVIDIA GPU.
> This memory is presented to a host via the device tree but remains
> offline until the NVIDIA driver onlines it.
> 
> This exports this RAM to the userspace as a new region so
> the NVIDIA driver in the guest can train these links and online GPU RAM.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  drivers/vfio/pci/Makefile           |   1 +
>  drivers/vfio/pci/vfio_pci_private.h |   8 ++
>  include/uapi/linux/vfio.h           |   3 +
>  drivers/vfio/pci/vfio_pci.c         |   9 ++
>  drivers/vfio/pci/vfio_pci_nvlink2.c | 190 ++++++++++++++++++++++++++++++++++++
>  drivers/vfio/pci/Kconfig            |   4 +
>  6 files changed, 215 insertions(+)
>  create mode 100644 drivers/vfio/pci/vfio_pci_nvlink2.c
> 
> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
> index 76d8ec0..9662c06 100644
> --- a/drivers/vfio/pci/Makefile
> +++ b/drivers/vfio/pci/Makefile
> @@ -1,5 +1,6 @@
>  
>  vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
>  vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
> +vfio-pci-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
>  
>  obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
> diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
> index 86aab05..7115b9b 100644
> --- a/drivers/vfio/pci/vfio_pci_private.h
> +++ b/drivers/vfio/pci/vfio_pci_private.h
> @@ -160,4 +160,12 @@ static inline int vfio_pci_igd_init(struct vfio_pci_device *vdev)
>  	return -ENODEV;
>  }
>  #endif
> +#ifdef CONFIG_VFIO_PCI_NVLINK2
> +extern int vfio_pci_nvlink2_init(struct vfio_pci_device *vdev);
> +#else
> +static inline int vfio_pci_nvlink2_init(struct vfio_pci_device *vdev)
> +{
> +	return -ENODEV;
> +}
> +#endif
>  #endif /* VFIO_PCI_PRIVATE_H */
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 1aa7b82..2fe8227 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -301,6 +301,9 @@ struct vfio_region_info_cap_type {
>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG	(2)
>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG	(3)
>  
> +/* NVIDIA GPU NV2 */
> +#define VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2	(4)

You're continuing the Intel vendor ID sub-types for an NVIDIA vendor ID
subtype.  Each vendor has their own address space of sub-types.

> +
>  /*
>   * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
>   * which allows direct access to non-MSIX registers which happened to be within
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 7bddf1e..38c9475 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -306,6 +306,15 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
>  		}
>  	}
>  
> +	if (pdev->vendor == PCI_VENDOR_ID_NVIDIA &&
> +	    pdev->device == 0x1db1 &&
> +	    IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {

Can't we do better than check this based on device ID?  Perhaps PCIe
capability hints at this?

Is it worthwhile to continue with assigning the device in the !ENABLED
case?  For instance, maybe it would be better to provide a weak
definition of vfio_pci_nvlink2_init() that would cause us to fail here
if we don't have this device specific support enabled.  I realize
you're following the example set forth for IGD, but those regions are
optional, for better or worse.

> +		ret = vfio_pci_nvlink2_init(vdev);
> +		if (ret)
> +			dev_warn(&vdev->pdev->dev,
> +				 "Failed to setup NVIDIA NV2 RAM region\n");
> +	}
> +
>  	vfio_pci_probe_mmaps(vdev);
>  
>  	return 0;
> diff --git a/drivers/vfio/pci/vfio_pci_nvlink2.c b/drivers/vfio/pci/vfio_pci_nvlink2.c
> new file mode 100644
> index 0000000..451c5cb
> --- /dev/null
> +++ b/drivers/vfio/pci/vfio_pci_nvlink2.c
> @@ -0,0 +1,190 @@
> +// SPDX-License-Identifier: GPL-2.0+
> +/*
> + * VFIO PCI NVIDIA Whitherspoon GPU support a.k.a. NVLink2.
> + *
> + * Copyright (C) 2018 IBM Corp.  All rights reserved.
> + *     Author: Alexey Kardashevskiy <aik@ozlabs.ru>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * Register an on-GPU RAM region for cacheable access.
> + *
> + * Derived from original vfio_pci_igd.c:
> + * Copyright (C) 2016 Red Hat, Inc.  All rights reserved.
> + *	Author: Alex Williamson <alex.williamson@redhat.com>
> + */
> +
> +#include <linux/io.h>
> +#include <linux/pci.h>
> +#include <linux/uaccess.h>
> +#include <linux/vfio.h>
> +#include <linux/sched/mm.h>
> +#include <linux/mmu_context.h>
> +
> +#include "vfio_pci_private.h"
> +
> +struct vfio_pci_nvlink2_data {
> +	unsigned long gpu_hpa;
> +	unsigned long useraddr;
> +	unsigned long size;
> +	struct mm_struct *mm;
> +	struct mm_iommu_table_group_mem_t *mem;
> +};
> +
> +static size_t vfio_pci_nvlink2_rw(struct vfio_pci_device *vdev,
> +		char __user *buf, size_t count, loff_t *ppos, bool iswrite)
> +{
> +	unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS;
> +	void *base = vdev->region[i].data;
> +	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
> +
> +	if (pos >= vdev->region[i].size)
> +		return -EINVAL;
> +
> +	count = min(count, (size_t)(vdev->region[i].size - pos));
> +
> +	if (iswrite) {
> +		if (copy_from_user(base + pos, buf, count))
> +			return -EFAULT;
> +	} else {
> +		if (copy_to_user(buf, base + pos, count))
> +			return -EFAULT;
> +	}
> +	*ppos += count;
> +
> +	return count;
> +}
> +
> +static void vfio_pci_nvlink2_release(struct vfio_pci_device *vdev,
> +		struct vfio_pci_region *region)
> +{
> +	struct vfio_pci_nvlink2_data *data = region->data;
> +	long ret;
> +
> +	ret = mm_iommu_put(data->mm, data->mem);
> +	WARN_ON(ret);
> +
> +	mmdrop(data->mm);
> +	kfree(data);
> +}
> +
> +static int vfio_pci_nvlink2_mmap_fault(struct vm_fault *vmf)
> +{
> +	struct vm_area_struct *vma = vmf->vma;
> +	struct vfio_pci_region *region = vma->vm_private_data;
> +	struct vfio_pci_nvlink2_data *data = region->data;
> +	int ret;
> +	unsigned long vmf_off = (vmf->address - vma->vm_start) >> PAGE_SHIFT;
> +	unsigned long nv2pg = data->gpu_hpa >> PAGE_SHIFT;
> +	unsigned long vm_pgoff = vma->vm_pgoff &
> +		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
> +	unsigned long pfn = nv2pg + vm_pgoff + vmf_off;
> +
> +	ret = vm_insert_pfn(vma, vmf->address, pfn);
> +	/* TODO: make it a tracepoint */
> +	pr_debug("NVLink2: vmf=%lx hpa=%lx ret=%d\n",
> +		 vmf->address, pfn << PAGE_SHIFT, ret);
> +	if (ret)
> +		return VM_FAULT_SIGSEGV;
> +
> +	return VM_FAULT_NOPAGE;
> +}
> +
> +static const struct vm_operations_struct vfio_pci_nvlink2_mmap_vmops = {
> +	.fault = vfio_pci_nvlink2_mmap_fault,
> +};
> +
> +static int vfio_pci_nvlink2_mmap(struct vfio_pci_device *vdev,
> +		struct vfio_pci_region *region, struct vm_area_struct *vma)
> +{
> +	long ret;
> +	struct vfio_pci_nvlink2_data *data = region->data;
> +
> +	if (data->useraddr)
> +		return -EPERM;
> +
> +	if (vma->vm_end - vma->vm_start > data->size)
> +		return -EINVAL;
> +
> +	vma->vm_private_data = region;
> +	vma->vm_flags |= VM_PFNMAP;
> +	vma->vm_ops = &vfio_pci_nvlink2_mmap_vmops;
> +
> +	/*
> +	 * Calling mm_iommu_newdev() here once as the region is not
> +	 * registered yet and therefore right initialization will happen now.
> +	 * Other places will use mm_iommu_find() which returns
> +	 * registered @mem and does not go gup().
> +	 */
> +	data->useraddr = vma->vm_start;
> +	data->mm = current->mm;
> +	atomic_inc(&data->mm->mm_count);
> +	ret = mm_iommu_newdev(data->mm, data->useraddr,
> +			(vma->vm_end - vma->vm_start) >> PAGE_SHIFT,
> +			data->gpu_hpa, &data->mem);
> +
> +	pr_debug("VFIO NVLINK2 mmap: useraddr=%lx hpa=%lx size=%lx ret=%ld\n",
> +			data->useraddr, data->gpu_hpa,
> +			vma->vm_end - vma->vm_start, ret);
> +
> +	return ret;
> +}
> +
> +static const struct vfio_pci_regops vfio_pci_nvlink2_regops = {
> +	.rw = vfio_pci_nvlink2_rw,
> +	.release = vfio_pci_nvlink2_release,
> +	.mmap = vfio_pci_nvlink2_mmap,
> +};
> +
> +int vfio_pci_nvlink2_init(struct vfio_pci_device *vdev)
> +{
> +	int len = 0, ret;
> +	struct device_node *npu_node, *mem_node;
> +	struct pci_dev *npu_dev;
> +	uint32_t *mem_phandle, *val;
> +	struct vfio_pci_nvlink2_data *data;
> +
> +	npu_dev = pnv_pci_get_npu_dev(vdev->pdev, 0);
> +	if (!npu_dev)
> +		return -EINVAL;
> +
> +	npu_node = pci_device_to_OF_node(npu_dev);
> +	if (!npu_node)
> +		return -EINVAL;
> +
> +	mem_phandle = (void *) of_get_property(npu_node, "memory-region", NULL);
> +	if (!mem_phandle)
> +		return -EINVAL;
> +
> +	mem_node = of_find_node_by_phandle(be32_to_cpu(*mem_phandle));
> +	if (!mem_node)
> +		return -EINVAL;
> +
> +	val = (uint32_t *) of_get_property(mem_node, "reg", &len);
> +	if (!val || len != 2 * sizeof(uint64_t))
> +		return -EINVAL;
> +
> +	data = kzalloc(sizeof(*data), GFP_KERNEL);
> +	if (!data)
> +		return -ENOMEM;
> +
> +	data->gpu_hpa = ((uint64_t)be32_to_cpu(val[0]) << 32) |
> +			be32_to_cpu(val[1]);
> +	data->size = ((uint64_t)be32_to_cpu(val[2]) << 32) |
> +			be32_to_cpu(val[3]);
> +
> +	dev_dbg(&vdev->pdev->dev, "%lx..%lx\n", data->gpu_hpa,
> +			data->gpu_hpa + data->size - 1);
> +
> +	ret = vfio_pci_register_dev_region(vdev,
> +			PCI_VENDOR_ID_NVIDIA | VFIO_REGION_TYPE_PCI_VENDOR_TYPE,
> +			VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2,
> +			&vfio_pci_nvlink2_regops, data->size,
> +			VFIO_REGION_INFO_FLAG_READ, data);
> +	if (ret)
> +		kfree(data);
> +
> +	return ret;
> +}
> diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
> index 24ee260..2725bc8 100644
> --- a/drivers/vfio/pci/Kconfig
> +++ b/drivers/vfio/pci/Kconfig
> @@ -30,3 +30,7 @@ config VFIO_PCI_INTX
>  config VFIO_PCI_IGD
>  	depends on VFIO_PCI
>  	def_bool y if X86
> +
> +config VFIO_PCI_NVLINK2
> +	depends on VFIO_PCI
> +	def_bool y if PPC_POWERNV

As written, this also depends on PPC_POWERNV (or at least TCE), it's not
a portable implementation that we could re-use on X86 or ARM or any
other platform if hardware appeared for it.  Can we improve that as
well to make this less POWER specific?  Thanks,

Alex

^ permalink raw reply

* Re: [RFC PATCH kernel 4/5] vfio_pci: Allow mapping extra regions
From: Alex Williamson @ 2018-06-07 17:04 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, David Gibson, kvm-ppc, Benjamin Herrenschmidt,
	Ram Pai, kvm, Alistair Popple
In-Reply-To: <20180607084420.29513-5-aik@ozlabs.ru>

On Thu,  7 Jun 2018 18:44:19 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

What's an "extra region", -ENOCOMMITLOG

> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  drivers/vfio/pci/vfio_pci_private.h |  3 +++
>  drivers/vfio/pci/vfio_pci.c         | 10 ++++++++--
>  2 files changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
> index cde3b5d..86aab05 100644
> --- a/drivers/vfio/pci/vfio_pci_private.h
> +++ b/drivers/vfio/pci/vfio_pci_private.h
> @@ -59,6 +59,9 @@ struct vfio_pci_regops {
>  		      size_t count, loff_t *ppos, bool iswrite);
>  	void	(*release)(struct vfio_pci_device *vdev,
>  			   struct vfio_pci_region *region);
> +	int	(*mmap)(struct vfio_pci_device *vdev,
> +			struct vfio_pci_region *region,
> +			struct vm_area_struct *vma);
>  };
>  
>  struct vfio_pci_region {
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 3729937..7bddf1e 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -1123,10 +1123,16 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
>  		return -EINVAL;
>  	if ((vma->vm_flags & VM_SHARED) == 0)
>  		return -EINVAL;
> +	if (index >= VFIO_PCI_NUM_REGIONS) {
> +		int regnum = index - VFIO_PCI_NUM_REGIONS;
> +		struct vfio_pci_region *region = vdev->region + regnum;
> +
> +		if (region && region->ops && region->ops->mmap)
> +			return region->ops->mmap(vdev, region, vma);
> +		return -EINVAL;
> +	}
>  	if (index >= VFIO_PCI_ROM_REGION_INDEX)
>  		return -EINVAL;
> -	if (!vdev->bar_mmap_supported[index])
> -		return -EINVAL;

This seems unrelated.  Thanks,

Alex
  
>  	phys_len = PAGE_ALIGN(pci_resource_len(pdev, index));
>  	req_len = vma->vm_end - vma->vm_start;

^ permalink raw reply

* [v3 PATCH 0/5] powerpc/pseries: Machien check handler improvements.
From: Mahesh J Salgaonkar @ 2018-06-07 17:27 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Michael Ellerman, stable, Aneesh Kumar K.V, Aneesh Kumar K.V,
	Michael Ellerman, Laurent Dufour, Nicholas Piggin

This patch series includes some improvement to Machine check handler
for pseries. Patch 1 fixes an issue where machine check handler crashes
kernel while accessing vmalloc-ed buffer while in nmi context.
Patch 2 fixes endain bug while restoring of r3 in MCE handler.
Patch 4 dumps the SLB contents on SLB MCE errors to improve the debugability.
Patch 5 display's the MCE error details on console.

CHange in V3:
- Moved patch 5 to patch 2

Change in V2:
- patch 3: Display additional info (NIP and task info) in MCE error details.
- patch 5: Fix endain bug while restoring of r3 in MCE handler.

---

Mahesh Salgaonkar (5):
      powerpc/pseries: convert rtas_log_buf to linear allocation.
      powerpc/pseries: Fix endainness while restoring of r3 in MCE handler.
      powerpc/pseries: Define MCE error event section.
      powerpc/pseries: Dump and flush SLB contents on SLB MCE errors.
      powerpc/pseries: Display machine check error details.


 arch/powerpc/include/asm/book3s/64/mmu-hash.h |    1 
 arch/powerpc/include/asm/rtas.h               |  109 ++++++++++++++++++
 arch/powerpc/kernel/rtasd.c                   |    2 
 arch/powerpc/mm/slb.c                         |   35 ++++++
 arch/powerpc/platforms/pseries/ras.c          |  155 +++++++++++++++++++++++++
 5 files changed, 299 insertions(+), 3 deletions(-)

--
Signature

^ permalink raw reply

* [v3 PATCH 1/5] powerpc/pseries: convert rtas_log_buf to linear allocation.
From: Mahesh J Salgaonkar @ 2018-06-07 17:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: stable, Aneesh Kumar K.V, Aneesh Kumar K.V, Michael Ellerman,
	Laurent Dufour, Nicholas Piggin
In-Reply-To: <152839244928.25118.15100234720683911223.stgit@jupiter.in.ibm.com>

From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>

rtas_log_buf is a buffer to hold RTAS event data that are communicated
to kernel by hypervisor. This buffer is then used to pass RTAS event
data to user through proc fs. This buffer is allocated from vmalloc
(non-linear mapping) area.

On Machine check interrupt, register r3 points to RTAS extended event
log passed by hypervisor that contains the MCE event. The pseries
machine check handler then logs this error into rtas_log_buf. The
rtas_log_buf is a vmalloc-ed (non-linear) buffer we end up taking up a
page fault (vector 0x300) while accessing it. Since machine check
interrupt handler runs in NMI context we can not afford to take any
page fault. Page faults are not honored in NMI context and causes
kernel panic. This patch fixes this issue by allocating rtas_log_buf
using kmalloc.

Fixes: b96672dd840f ("powerpc: Machine check interrupt is a non-maskable interrupt")
Cc: stable@vger.kernel.org
Suggested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/rtasd.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/rtasd.c b/arch/powerpc/kernel/rtasd.c
index f915db93cd42..3957d4ae2ba2 100644
--- a/arch/powerpc/kernel/rtasd.c
+++ b/arch/powerpc/kernel/rtasd.c
@@ -559,7 +559,7 @@ static int __init rtas_event_scan_init(void)
 	rtas_error_log_max = rtas_get_error_log_max();
 	rtas_error_log_buffer_max = rtas_error_log_max + sizeof(int);
 
-	rtas_log_buf = vmalloc(rtas_error_log_buffer_max*LOG_NUMBER);
+	rtas_log_buf = kmalloc(rtas_error_log_buffer_max*LOG_NUMBER, GFP_KERNEL);
 	if (!rtas_log_buf) {
 		printk(KERN_ERR "rtasd: no memory\n");
 		return -ENOMEM;

^ permalink raw reply related

* [v3 PATCH 2/5] powerpc/pseries: Fix endainness while restoring of r3 in MCE handler.
From: Mahesh J Salgaonkar @ 2018-06-07 17:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: stable, Aneesh Kumar K.V, Michael Ellerman, Laurent Dufour,
	Nicholas Piggin
In-Reply-To: <152839244928.25118.15100234720683911223.stgit@jupiter.in.ibm.com>

From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>

During Machine Check interrupt on pseries platform, register r3 points
RTAS extended event log passed by hypervisor. Since hypervisor uses r3
to pass pointer to rtas log, it stores the original r3 value at the
start of the memory (first 8 bytes) pointed by r3. Since hypervisor
stores this info and rtas log is in BE format, linux should make
sure to restore r3 value in correct endian format.

Without this patch when MCE handler, after recovery, returns to code that
that caused the MCE may end up with Data SLB access interrupt for invalid
address followed by kernel panic or hang.

[   62.878965] Severe Machine check interrupt [Recovered]
[   62.878968]   NIP [d00000000ca301b8]: init_module+0x1b8/0x338 [bork_kernel]
[   62.878969]   Initiator: CPU
[   62.878970]   Error type: SLB [Multihit]
[   62.878971]     Effective address: d00000000ca70000
cpu 0xa: Vector: 380 (Data SLB Access) at [c0000000fc7775b0]
    pc: c0000000009694c0: vsnprintf+0x80/0x480
    lr: c0000000009698e0: vscnprintf+0x20/0x60
    sp: c0000000fc777830
   msr: 8000000002009033
   dar: a803a30c000000d0
  current = 0xc00000000bc9ef00
  paca    = 0xc00000001eca5c00	 softe: 3	 irq_happened: 0x01
    pid   = 8860, comm = insmod
[c0000000fc7778b0] c0000000009698e0 vscnprintf+0x20/0x60
[c0000000fc7778e0] c00000000016b6c4 vprintk_emit+0xb4/0x4b0
[c0000000fc777960] c00000000016d40c vprintk_func+0x5c/0xd0
[c0000000fc777980] c00000000016cbb4 printk+0x38/0x4c
[c0000000fc7779a0] d00000000ca301c0 init_module+0x1c0/0x338 [bork_kernel]
[c0000000fc777a40] c00000000000d9c4 do_one_initcall+0x54/0x230
[c0000000fc777b00] c0000000001b3b74 do_init_module+0x8c/0x248
[c0000000fc777b90] c0000000001b2478 load_module+0x12b8/0x15b0
[c0000000fc777d30] c0000000001b29e8 sys_finit_module+0xa8/0x110
[c0000000fc777e30] c00000000000b204 system_call+0x58/0x6c
--- Exception: c00 (System Call) at 00007fff8bda0644
SP (7fffdfbfe980) is in userspace

This patch fixes this issue.

Fixes: a08a53ea4c97 ("powerpc/le: Enable RTAS events support")
Cc: stable@vger.kernel.org
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/pseries/ras.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/ras.c b/arch/powerpc/platforms/pseries/ras.c
index 5e1ef9150182..2edc673be137 100644
--- a/arch/powerpc/platforms/pseries/ras.c
+++ b/arch/powerpc/platforms/pseries/ras.c
@@ -360,7 +360,7 @@ static struct rtas_error_log *fwnmi_get_errinfo(struct pt_regs *regs)
 	}
 
 	savep = __va(regs->gpr[3]);
-	regs->gpr[3] = savep[0];	/* restore original r3 */
+	regs->gpr[3] = be64_to_cpu(savep[0]);	/* restore original r3 */
 
 	/* If it isn't an extended log we can use the per cpu 64bit buffer */
 	h = (struct rtas_error_log *)&savep[1];

^ permalink raw reply related

* [v3 PATCH 3/5] powerpc/pseries: Define MCE error event section.
From: Mahesh J Salgaonkar @ 2018-06-07 17:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Aneesh Kumar K.V, Michael Ellerman, Laurent Dufour,
	Nicholas Piggin
In-Reply-To: <152839244928.25118.15100234720683911223.stgit@jupiter.in.ibm.com>

From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>

On pseries, the machine check error details are part of RTAS extended
event log passed under Machine check exception section. This patch adds
the definition of rtas MCE event section and related helper
functions.

Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/rtas.h |  104 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 104 insertions(+)

diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
index ec9dd79398ee..3f2fba7ef23b 100644
--- a/arch/powerpc/include/asm/rtas.h
+++ b/arch/powerpc/include/asm/rtas.h
@@ -275,6 +275,7 @@ inline uint32_t rtas_ext_event_company_id(struct rtas_ext_event_log_v6 *ext_log)
 #define PSERIES_ELOG_SECT_ID_CALL_HOME		(('C' << 8) | 'H')
 #define PSERIES_ELOG_SECT_ID_USER_DEF		(('U' << 8) | 'D')
 #define PSERIES_ELOG_SECT_ID_HOTPLUG		(('H' << 8) | 'P')
+#define PSERIES_ELOG_SECT_ID_MCE		(('M' << 8) | 'C')
 
 /* Vendor specific Platform Event Log Format, Version 6, section header */
 struct pseries_errorlog {
@@ -326,6 +327,109 @@ struct pseries_hp_errorlog {
 #define PSERIES_HP_ELOG_ID_DRC_COUNT	3
 #define PSERIES_HP_ELOG_ID_DRC_IC	4
 
+/* RTAS pseries MCE errorlog section */
+#pragma pack(push, 1)
+struct pseries_mc_errorlog {
+	__be32	fru_id;
+	__be32	proc_id;
+	uint8_t	error_type;
+	union {
+		struct {
+			uint8_t	ue_err_type;
+			/* XXXXXXXX
+			 * X		1: Permanent or Transient UE.
+			 *  X		1: Effective address provided.
+			 *   X		1: Logical address provided.
+			 *    XX	2: Reserved.
+			 *      XXX	3: Type of UE error.
+			 */
+			uint8_t	reserved_1[6];
+			__be64	effective_address;
+			__be64	logical_address;
+		} ue_error;
+		struct {
+			uint8_t	soft_err_type;
+			/* XXXXXXXX
+			 * X		1: Effective address provided.
+			 *  XXXXX	5: Reserved.
+			 *       XX	2: Type of SLB/ERAT/TLB error.
+			 */
+			uint8_t	reserved_1[6];
+			__be64	effective_address;
+			uint8_t	reserved_2[8];
+		} soft_error;
+	} u;
+};
+#pragma pack(pop)
+
+/* RTAS pseries MCE error types */
+#define PSERIES_MC_ERROR_TYPE_UE		0x00
+#define PSERIES_MC_ERROR_TYPE_SLB		0x01
+#define PSERIES_MC_ERROR_TYPE_ERAT		0x02
+#define PSERIES_MC_ERROR_TYPE_TLB		0x04
+#define PSERIES_MC_ERROR_TYPE_D_CACHE		0x05
+#define PSERIES_MC_ERROR_TYPE_I_CACHE		0x07
+
+/* RTAS pseries MCE error sub types */
+#define PSERIES_MC_ERROR_UE_INDETERMINATE		0
+#define PSERIES_MC_ERROR_UE_IFETCH			1
+#define PSERIES_MC_ERROR_UE_PAGE_TABLE_WALK_IFETCH	2
+#define PSERIES_MC_ERROR_UE_LOAD_STORE			3
+#define PSERIES_MC_ERROR_UE_PAGE_TABLE_WALK_LOAD_STORE	4
+
+#define PSERIES_MC_ERROR_SLB_PARITY		0
+#define PSERIES_MC_ERROR_SLB_MULTIHIT		1
+#define PSERIES_MC_ERROR_SLB_INDETERMINATE	2
+
+#define PSERIES_MC_ERROR_ERAT_PARITY		1
+#define PSERIES_MC_ERROR_ERAT_MULTIHIT		2
+#define PSERIES_MC_ERROR_ERAT_INDETERMINATE	3
+
+#define PSERIES_MC_ERROR_TLB_PARITY		1
+#define PSERIES_MC_ERROR_TLB_MULTIHIT		2
+#define PSERIES_MC_ERROR_TLB_INDETERMINATE	3
+
+static inline uint8_t rtas_mc_error_type(const struct pseries_mc_errorlog *mlog)
+{
+	return mlog->error_type;
+}
+
+static inline uint8_t rtas_mc_error_sub_type(
+					const struct pseries_mc_errorlog *mlog)
+{
+	switch (mlog->error_type) {
+	case	PSERIES_MC_ERROR_TYPE_UE:
+		return (mlog->u.ue_error.ue_err_type & 0x07);
+	case	PSERIES_MC_ERROR_TYPE_SLB:
+	case	PSERIES_MC_ERROR_TYPE_ERAT:
+	case	PSERIES_MC_ERROR_TYPE_TLB:
+		return (mlog->u.soft_error.soft_err_type & 0x03);
+	default:
+		return 0;
+	}
+}
+
+static inline uint64_t rtas_mc_get_effective_addr(
+					const struct pseries_mc_errorlog *mlog)
+{
+	uint64_t addr = 0;
+
+	switch (mlog->error_type) {
+	case	PSERIES_MC_ERROR_TYPE_UE:
+		if (mlog->u.ue_error.ue_err_type & 0x40)
+			addr = mlog->u.ue_error.effective_address;
+		break;
+	case	PSERIES_MC_ERROR_TYPE_SLB:
+	case	PSERIES_MC_ERROR_TYPE_ERAT:
+	case	PSERIES_MC_ERROR_TYPE_TLB:
+		if (mlog->u.soft_error.soft_err_type & 0x80)
+			addr = mlog->u.soft_error.effective_address;
+	default:
+		break;
+	}
+	return be64_to_cpu(addr);
+}
+
 struct pseries_errorlog *get_pseries_errorlog(struct rtas_error_log *log,
 					      uint16_t section_id);
 

^ permalink raw reply related

* [v3 PATCH 4/5] powerpc/pseries: Dump and flush SLB contents on SLB MCE errors.
From: Mahesh J Salgaonkar @ 2018-06-07 17:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Aneesh Kumar K.V, Michael Ellerman, Aneesh Kumar K.V,
	Michael Ellerman, Laurent Dufour, Nicholas Piggin
In-Reply-To: <152839244928.25118.15100234720683911223.stgit@jupiter.in.ibm.com>

From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>

If we get a machine check exceptions due to SLB errors then dump the
current SLB contents which will be very much helpful in debugging the
root cause of SLB errors. On pseries, as of today system crashes on SLB
errors. These are soft errors and can be fixed by flushing the SLBs so
the kernel can continue to function instead of system crash. This patch
fixes that also.

With this patch the console will log SLB contents like below on SLB MCE
errors:

[  822.711728] slb contents:
[  822.711730] 00 c000000008000000 400ea1b217000500
[  822.711731]   1T  ESID=   c00000  VSID=      ea1b217 LLP:100
[  822.711732] 01 d000000008000000 400d43642f000510
[  822.711733]   1T  ESID=   d00000  VSID=      d43642f LLP:110
[  822.711734] 09 f000000008000000 400a86c85f000500
[  822.711736]   1T  ESID=   f00000  VSID=      a86c85f LLP:100
[  822.711737] 10 00007f0008000000 400d1f26e3000d90
[  822.711738]   1T  ESID=       7f  VSID=      d1f26e3 LLP:110
[  822.711739] 11 0000000018000000 000e3615f520fd90
[  822.711740]  256M ESID=        1  VSID=   e3615f520f LLP:110
[  822.711740] 12 d000000008000000 400d43642f000510
[  822.711741]   1T  ESID=   d00000  VSID=      d43642f LLP:110
[  822.711742] 13 d000000008000000 400d43642f000510
[  822.711743]   1T  ESID=   d00000  VSID=      d43642f LLP:110


Suggested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/book3s/64/mmu-hash.h |    1 +
 arch/powerpc/mm/slb.c                         |   35 +++++++++++++++++++++++++
 arch/powerpc/platforms/pseries/ras.c          |   29 ++++++++++++++++++++-
 3 files changed, 64 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index 50ed64fba4ae..c0da68927235 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -487,6 +487,7 @@ extern void hpte_init_native(void);
 
 extern void slb_initialize(void);
 extern void slb_flush_and_rebolt(void);
+extern void slb_dump_contents(void);
 
 extern void slb_vmalloc_update(void);
 extern void slb_set_size(u16 size);
diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c
index 66577cc66dc9..799aa117cec3 100644
--- a/arch/powerpc/mm/slb.c
+++ b/arch/powerpc/mm/slb.c
@@ -145,6 +145,41 @@ void slb_flush_and_rebolt(void)
 	get_paca()->slb_cache_ptr = 0;
 }
 
+void slb_dump_contents(void)
+{
+	int i;
+	unsigned long e, v;
+	unsigned long llp;
+
+	pr_err("slb contents:\n");
+	for (i = 0; i < mmu_slb_size; i++) {
+		asm volatile("slbmfee  %0,%1" : "=r" (e) : "r" (i));
+		asm volatile("slbmfev  %0,%1" : "=r" (v) : "r" (i));
+
+		if (!e && !v)
+			continue;
+
+		pr_err("%02d %016lx %016lx", i, e, v);
+
+		if (!(e & SLB_ESID_V)) {
+			pr_err("\n");
+			continue;
+		}
+		llp = v & SLB_VSID_LLP;
+		if (v & SLB_VSID_B_1T) {
+			pr_err("  1T  ESID=%9lx  VSID=%13lx LLP:%3lx\n",
+				GET_ESID_1T(e),
+				(v & ~SLB_VSID_B) >> SLB_VSID_SHIFT_1T,
+				llp);
+		} else {
+			pr_err(" 256M ESID=%9lx  VSID=%13lx LLP:%3lx\n",
+				GET_ESID(e),
+				(v & ~SLB_VSID_B) >> SLB_VSID_SHIFT,
+				llp);
+		}
+	}
+}
+
 void slb_vmalloc_update(void)
 {
 	unsigned long vflags;
diff --git a/arch/powerpc/platforms/pseries/ras.c b/arch/powerpc/platforms/pseries/ras.c
index 2edc673be137..e56759d92356 100644
--- a/arch/powerpc/platforms/pseries/ras.c
+++ b/arch/powerpc/platforms/pseries/ras.c
@@ -422,6 +422,31 @@ int pSeries_system_reset_exception(struct pt_regs *regs)
 	return 0; /* need to perform reset */
 }
 
+static int mce_handle_error(struct rtas_error_log *errp)
+{
+	struct pseries_errorlog *pseries_log;
+	struct pseries_mc_errorlog *mce_log;
+	int disposition = rtas_error_disposition(errp);
+	uint8_t error_type;
+
+	pseries_log = get_pseries_errorlog(errp, PSERIES_ELOG_SECT_ID_MCE);
+	if (pseries_log == NULL)
+		goto out;
+
+	mce_log = (struct pseries_mc_errorlog *)pseries_log->data;
+	error_type = rtas_mc_error_type(mce_log);
+
+	if ((disposition == RTAS_DISP_NOT_RECOVERED) &&
+			(error_type == PSERIES_MC_ERROR_TYPE_SLB)) {
+		slb_dump_contents();
+		slb_flush_and_rebolt();
+		disposition = RTAS_DISP_FULLY_RECOVERED;
+	}
+
+out:
+	return disposition;
+}
+
 /*
  * See if we can recover from a machine check exception.
  * This is only called on power4 (or above) and only via
@@ -434,7 +459,9 @@ int pSeries_system_reset_exception(struct pt_regs *regs)
 static int recover_mce(struct pt_regs *regs, struct rtas_error_log *err)
 {
 	int recovered = 0;
-	int disposition = rtas_error_disposition(err);
+	int disposition;
+
+	disposition = mce_handle_error(err);
 
 	if (!(regs->msr & MSR_RI)) {
 		/* If MSR_RI isn't set, we cannot recover */

^ permalink raw reply related

* [v3 PATCH 5/5] powerpc/pseries: Display machine check error details.
From: Mahesh J Salgaonkar @ 2018-06-07 17:29 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Aneesh Kumar K.V, Michael Ellerman, Laurent Dufour,
	Nicholas Piggin
In-Reply-To: <152839244928.25118.15100234720683911223.stgit@jupiter.in.ibm.com>

From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>

Extract the MCE error details from RTAS extended log and display it to
console.

With this patch you should now see mce logs like below:

[  142.371818] Severe Machine check interrupt [Recovered]
[  142.371822]   NIP [d00000000ca301b8]: init_module+0x1b8/0x338 [bork_kernel]
[  142.371822]   Initiator: CPU
[  142.371823]   Error type: SLB [Multihit]
[  142.371824]     Effective address: d00000000ca70000

Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/rtas.h      |    5 +
 arch/powerpc/platforms/pseries/ras.c |  128 +++++++++++++++++++++++++++++++++-
 2 files changed, 131 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
index 3f2fba7ef23b..8100a95c133a 100644
--- a/arch/powerpc/include/asm/rtas.h
+++ b/arch/powerpc/include/asm/rtas.h
@@ -190,6 +190,11 @@ static inline uint8_t rtas_error_extended(const struct rtas_error_log *elog)
 	return (elog->byte1 & 0x04) >> 2;
 }
 
+static inline uint8_t rtas_error_initiator(const struct rtas_error_log *elog)
+{
+	return (elog->byte2 & 0xf0) >> 4;
+}
+
 #define rtas_error_type(x)	((x)->byte3)
 
 static inline
diff --git a/arch/powerpc/platforms/pseries/ras.c b/arch/powerpc/platforms/pseries/ras.c
index e56759d92356..cd9446980092 100644
--- a/arch/powerpc/platforms/pseries/ras.c
+++ b/arch/powerpc/platforms/pseries/ras.c
@@ -422,7 +422,130 @@ int pSeries_system_reset_exception(struct pt_regs *regs)
 	return 0; /* need to perform reset */
 }
 
-static int mce_handle_error(struct rtas_error_log *errp)
+#define VAL_TO_STRING(ar, val)	((val < ARRAY_SIZE(ar)) ? ar[val] : "Unknown")
+
+static void pseries_print_mce_info(struct pt_regs *regs,
+				struct rtas_error_log *errp, int disposition)
+{
+	const char *level, *sevstr;
+	struct pseries_errorlog *pseries_log;
+	struct pseries_mc_errorlog *mce_log;
+	uint8_t error_type, err_sub_type;
+	uint8_t initiator = rtas_error_initiator(errp);
+	uint64_t addr;
+
+	static const char * const initiators[] = {
+		"Unknown",
+		"CPU",
+		"PCI",
+		"ISA",
+		"Memory",
+		"Power Mgmt",
+	};
+	static const char * const mc_err_types[] = {
+		"UE",
+		"SLB",
+		"ERAT",
+		"TLB",
+		"D-Cache",
+		"Unknown",
+		"I-Cache",
+	};
+	static const char * const mc_ue_types[] = {
+		"Indeterminate",
+		"Instruction fetch",
+		"Page table walk ifetch",
+		"Load/Store",
+		"Page table walk Load/Store",
+	};
+
+	/* SLB sub errors valid values are 0x0, 0x1, 0x2 */
+	static const char * const mc_slb_types[] = {
+		"Parity",
+		"Multihit",
+		"Indeterminate",
+	};
+
+	/* TLB and ERAT sub errors valid values are 0x1, 0x2, 0x3 */
+	static const char * const mc_soft_types[] = {
+		"Unknown",
+		"Parity",
+		"Multihit",
+		"Indeterminate",
+	};
+
+	pseries_log = get_pseries_errorlog(errp, PSERIES_ELOG_SECT_ID_MCE);
+	if (pseries_log == NULL)
+		return;
+
+	mce_log = (struct pseries_mc_errorlog *)pseries_log->data;
+
+	error_type = rtas_mc_error_type(mce_log);
+	err_sub_type = rtas_mc_error_sub_type(mce_log);
+
+	switch (rtas_error_severity(errp)) {
+	case RTAS_SEVERITY_NO_ERROR:
+		level = KERN_INFO;
+		sevstr = "Harmless";
+		break;
+	case RTAS_SEVERITY_WARNING:
+		level = KERN_WARNING;
+		sevstr = "";
+		break;
+	case RTAS_SEVERITY_ERROR:
+	case RTAS_SEVERITY_ERROR_SYNC:
+		level = KERN_ERR;
+		sevstr = "Severe";
+		break;
+	case RTAS_SEVERITY_FATAL:
+	default:
+		level = KERN_ERR;
+		sevstr = "Fatal";
+		break;
+	}
+
+	printk("%s%s Machine check interrupt [%s]\n", level, sevstr,
+		disposition == RTAS_DISP_FULLY_RECOVERED ?
+		"Recovered" : "Not recovered");
+	if (user_mode(regs)) {
+		printk("%s  NIP: [%016lx] PID: %d Comm: %s\n", level,
+			regs->nip, current->pid, current->comm);
+	} else {
+		printk("%s  NIP [%016lx]: %pS\n", level, regs->nip,
+			(void *)regs->nip);
+	}
+	printk("%s  Initiator: %s\n", level,
+				VAL_TO_STRING(initiators, initiator));
+
+	switch (error_type) {
+	case PSERIES_MC_ERROR_TYPE_UE:
+		printk("%s  Error type: %s [%s]\n", level,
+			VAL_TO_STRING(mc_err_types, error_type),
+			VAL_TO_STRING(mc_ue_types, err_sub_type));
+		break;
+	case PSERIES_MC_ERROR_TYPE_SLB:
+		printk("%s  Error type: %s [%s]\n", level,
+			VAL_TO_STRING(mc_err_types, error_type),
+			VAL_TO_STRING(mc_slb_types, err_sub_type));
+		break;
+	case PSERIES_MC_ERROR_TYPE_ERAT:
+	case PSERIES_MC_ERROR_TYPE_TLB:
+		printk("%s  Error type: %s [%s]\n", level,
+			VAL_TO_STRING(mc_err_types, error_type),
+			VAL_TO_STRING(mc_soft_types, err_sub_type));
+		break;
+	default:
+		printk("%s  Error type: %s\n", level,
+			VAL_TO_STRING(mc_err_types, error_type));
+		break;
+	}
+
+	addr = rtas_mc_get_effective_addr(mce_log);
+	if (addr)
+		printk("%s    Effective address: %016llx\n", level, addr);
+}
+
+static int mce_handle_error(struct pt_regs *regs, struct rtas_error_log *errp)
 {
 	struct pseries_errorlog *pseries_log;
 	struct pseries_mc_errorlog *mce_log;
@@ -442,6 +565,7 @@ static int mce_handle_error(struct rtas_error_log *errp)
 		slb_flush_and_rebolt();
 		disposition = RTAS_DISP_FULLY_RECOVERED;
 	}
+	pseries_print_mce_info(regs, errp, disposition);
 
 out:
 	return disposition;
@@ -461,7 +585,7 @@ static int recover_mce(struct pt_regs *regs, struct rtas_error_log *err)
 	int recovered = 0;
 	int disposition;
 
-	disposition = mce_handle_error(err);
+	disposition = mce_handle_error(regs, err);
 
 	if (!(regs->msr & MSR_RI)) {
 		/* If MSR_RI isn't set, we cannot recover */

^ permalink raw reply related

* UBSAN: Undefined behaviour in ../include/linux/percpu_counter.h:137:13
From: Mathieu Malaterre @ 2018-06-07 18:26 UTC (permalink / raw)
  To: linuxppc-dev

Hi there,

I have a reproducible UBSAN appearing in dmesg after a while on my G4
(*). Could anyone suggest a way to diagnose the actual root issue here
(or is it just a false positive) ?

Thanks,

(*)
[41877.514338] ================================================================================
[41877.514364] UBSAN: Undefined behaviour in
../include/linux/percpu_counter.h:137:13
[41877.514373] signed integer overflow:
[41877.514378] 9223352809007201260 + 41997676517838 cannot be
represented in type 'long long int'
[41877.514389] CPU: 0 PID: 0 Comm: swapper Not tainted 4.17.0+ #54
[41877.514394] Call Trace:
[41877.514411] [dffedd30] [c047a5f8] ubsan_epilogue+0x18/0x4c (unreliable)
[41877.514422] [dffedd40] [c047af98] handle_overflow+0xbc/0xdc
[41877.514437] [dffeddc0] [c043aaa8] cfq_completed_request+0x560/0x1234
[41877.514446] [dffede40] [c03f595c] __blk_put_request+0xb0/0x2dc
[41877.514460] [dffede80] [c05aa41c] scsi_end_request+0x19c/0x344
[41877.514469] [dffedeb0] [c05abba0] scsi_io_completion+0x4b4/0x854
[41877.514482] [dffedf10] [c040604c] blk_done_softirq+0xe4/0x1e0
[41877.514496] [dffedf60] [c07eef84] __do_softirq+0x16c/0x5f0
[41877.514508] [dffedfd0] [c0065160] irq_exit+0x110/0x1a8
[41877.514520] [dffedff0] [c001646c] call_do_irq+0x24/0x3c
[41877.514533] [c0ce5e80] [c0009a2c] do_IRQ+0x98/0x1a0
[41877.514541] [c0ce5eb0] [c001b93c] ret_from_except+0x0/0x14
[41877.514549] --- interrupt: 501 at arch_cpu_idle+0x30/0x78
                   LR = arch_cpu_idle+0x30/0x78
[41877.514558] [c0ce5f70] [c0ce4000] 0xc0ce4000 (unreliable)
[41877.514570] [c0ce5f80] [c00a3928] do_idle+0xc4/0x158
[41877.514577] [c0ce5fb0] [c00a3b74] cpu_startup_entry+0x24/0x28
[41877.514585] [c0ce5fc0] [c0988820] start_kernel+0x47c/0x490
[41877.514592] [c0ce5ff0] [00003444] 0x3444
[41877.514597] ================================================================================
[41886.390210] ================================================================================
[41886.390236] UBSAN: Undefined behaviour in
../include/linux/percpu_counter.h:137:13
[41886.390245] signed integer overflow:
[41886.390250] 9223366156262940402 + 42006563339289 cannot be
represented in type 'long long int'
[41886.390260] CPU: 0 PID: 0 Comm: swapper Not tainted 4.17.0+ #54
[41886.390265] Call Trace:
[41886.390282] [dffedd30] [c047a5f8] ubsan_epilogue+0x18/0x4c (unreliable)
[41886.390293] [dffedd40] [c047af98] handle_overflow+0xbc/0xdc
[41886.390309] [dffeddc0] [c043a8c4] cfq_completed_request+0x37c/0x1234
[41886.390317] [dffede40] [c03f595c] __blk_put_request+0xb0/0x2dc
[41886.390331] [dffede80] [c05aa41c] scsi_end_request+0x19c/0x344
[41886.390340] [dffedeb0] [c05abba0] scsi_io_completion+0x4b4/0x854
[41886.390353] [dffedf10] [c040604c] blk_done_softirq+0xe4/0x1e0
[41886.390367] [dffedf60] [c07eef84] __do_softirq+0x16c/0x5f0
[41886.390379] [dffedfd0] [c0065160] irq_exit+0x110/0x1a8
[41886.390391] [dffedff0] [c001646c] call_do_irq+0x24/0x3c
[41886.390404] [c0ce5e80] [c0009a2c] do_IRQ+0x98/0x1a0
[41886.390411] [c0ce5eb0] [c001b93c] ret_from_except+0x0/0x14
[41886.390420] --- interrupt: 501 at arch_cpu_idle+0x30/0x78
                   LR = arch_cpu_idle+0x30/0x78
[41886.390429] [c0ce5f70] [c0ce4000] 0xc0ce4000 (unreliable)
[41886.390441] [c0ce5f80] [c00a3928] do_idle+0xc4/0x158
[41886.390449] [c0ce5fb0] [c00a3b74] cpu_startup_entry+0x24/0x28
[41886.390457] [c0ce5fc0] [c0988820] start_kernel+0x47c/0x490
[41886.390463] [c0ce5ff0] [00003444] 0x3444
[41886.390468] ================================================================================

^ permalink raw reply

* Re: [RFC PATCH 0/4] powerpc/pseries: Machien check handler improvements.
From: Michal Suchánek @ 2018-06-07 19:43 UTC (permalink / raw)
  To: Mahesh J Salgaonkar; +Cc: linuxppc-dev, Laurent Dufour, Aneesh Kumar K.V
In-Reply-To: <152825972491.24560.4626614298142965406.stgit@jupiter.in.ibm.com>

On Wed, 06 Jun 2018 10:06:23 +0530
Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com> wrote:

> This patch series includes some improvement to Machine check handler
> for pseries. Patch 1 fixes an issue where machine check handler
> crashes kernel while accessing vmalloc-ed buffer while in nmi context.
> Patch 3 dumps the SLB contents on SLB MCE errors to improve the
> debugability. Patch 4 display's the MCE error details on console.
>=20
> ---
>=20
> Mahesh Salgaonkar (4):
>       powerpc/pseries: convert rtas_log_buf to linear allocation.
>       powerpc/pseries: Define MCE error event section.
>       powerpc/pseries: Dump and flush SLB contents on SLB MCE errors.
>       powerpc/pseries: Display machine check error details.

Tested-by: Michal Such=C3=A1nek <msuchanek@suse.de>

Thanks

Michal

^ permalink raw reply

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
From: Benjamin Herrenschmidt @ 2018-06-07 21:54 UTC (permalink / raw)
  To: Alex Williamson, Alexey Kardashevskiy
  Cc: linuxppc-dev, David Gibson, kvm-ppc, Ram Pai, kvm,
	Alistair Popple
In-Reply-To: <20180607110409.5057ebac@w520.home>

On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:
> 
> Can we back up and discuss whether the IOMMU grouping of NVLink
> connected devices makes sense?  AIUI we have a PCI view of these
> devices and from that perspective they're isolated.  That's the view of
> the device used to generate the grouping.  However, not visible to us,
> these devices are interconnected via NVLink.  What isolation properties
> does NVLink provide given that its entire purpose for existing seems to
> be to provide a high performance link for p2p between devices?

Not entire. On POWER chips, we also have an nvlink between the device
and the CPU which is running significantly faster than PCIe.

But yes, there are cross-links and those should probably be accounted
for in the grouping.

> > Each bridge represents an additional hardware interface called "NVLink2",
> > it is not a PCI link but separate but. The design inherits from original
> > NVLink from POWER8.
> > 
> > The new feature of V100 is 16GB of cache coherent memory on GPU board.
> > This memory is presented to the host via the device tree and remains offline
> > until the NVIDIA driver loads, trains NVLink2 (via the config space of these
> > bridges above) and the nvidia-persistenced daemon then onlines it.
> > The memory remains online as long as nvidia-persistenced is running, when
> > it stops, it offlines the memory.
> > 
> > The amount of GPUs suggest passing them through to a guest. However,
> > in order to do so we cannot use the NVIDIA driver so we have a host with
> > a 128GB window (bigger or equal to actual GPU RAM size) in a system memory
> > with no page structs backing this window and we cannot touch this memory
> > before the NVIDIA driver configures it in a host or a guest as
> > HMI (hardware management interrupt?) occurs.
> 
> Having a lot of GPUs only suggests assignment to a guest if there's
> actually isolation provided between those GPUs.  Otherwise we'd need to
> assign them as one big group, which gets a lot less useful.  Thanks,
> 
> Alex
> 
> > On the example system the GPU RAM windows are located at:
> > 0x0400 0000 0000
> > 0x0420 0000 0000
> > 0x0440 0000 0000
> > 0x2400 0000 0000
> > 0x2420 0000 0000
> > 0x2440 0000 0000
> > 
> > So the complications are:
> > 
> > 1. cannot touch the GPU memory till it is trained, i.e. cannot add ptes
> > to VFIO-to-userspace or guest-to-host-physical translations till
> > the driver trains it (i.e. nvidia-persistenced has started), otherwise
> > prefetching happens and HMI occurs; I am trying to get this changed
> > somehow;
> > 
> > 2. since it appears as normal cache coherent memory, it will be used
> > for DMA which means it has to be pinned and mapped in the host. Having
> > no page structs makes it different from the usual case - we only need
> > translate user addresses to host physical and map GPU RAM memory but
> > pinning is not required.
> > 
> > This series maps GPU RAM via the GPU vfio-pci device so QEMU can then
> > register this memory as a KVM memory slot and present memory nodes to
> > the guest. Unless NVIDIA provides an userspace driver, this is no use
> > for things like DPDK.
> > 
> > 
> > There is another problem which the series does not address but worth
> > mentioning - it is not strictly necessary to map GPU RAM to the guest
> > exactly where it is in the host (I tested this to some extent), we still
> > might want to represent the memory at the same offset as on the host
> > which increases the size of a TCE table needed to cover such a huge
> > window: (((0x244000000000 + 0x2000000000) >> 16)*8)>>20 = 4556MB
> > I am addressing this in a separate patchset by allocating indirect TCE
> > levels on demand and using 16MB IOMMU pages in the guest as we can now
> > back emulated pages with the smaller hardware ones.
> > 
> > 
> > This is an RFC. Please comment. Thanks.
> > 
> > 
> > 
> > Alexey Kardashevskiy (5):
> >   vfio/spapr_tce: Simplify page contained test
> >   powerpc/iommu_context: Change referencing in API
> >   powerpc/iommu: Do not pin memory of a memory device
> >   vfio_pci: Allow mapping extra regions
> >   vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver
> > 
> >  drivers/vfio/pci/Makefile              |   1 +
> >  arch/powerpc/include/asm/mmu_context.h |   5 +-
> >  drivers/vfio/pci/vfio_pci_private.h    |  11 ++
> >  include/uapi/linux/vfio.h              |   3 +
> >  arch/powerpc/kernel/iommu.c            |   8 +-
> >  arch/powerpc/mm/mmu_context_iommu.c    |  70 +++++++++---
> >  drivers/vfio/pci/vfio_pci.c            |  19 +++-
> >  drivers/vfio/pci/vfio_pci_nvlink2.c    | 190 +++++++++++++++++++++++++++++++++
> >  drivers/vfio/vfio_iommu_spapr_tce.c    |  42 +++++---
> >  drivers/vfio/pci/Kconfig               |   4 +
> >  10 files changed, 319 insertions(+), 34 deletions(-)
> >  create mode 100644 drivers/vfio/pci/vfio_pci_nvlink2.c
> > 

^ permalink raw reply

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
From: Alex Williamson @ 2018-06-07 22:15 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, kvm-ppc,
	Ram Pai, kvm, Alistair Popple
In-Reply-To: <e35a7bbea8b82c17f93eb6eb438df38a94097f2d.camel@kernel.crashing.org>

On Fri, 08 Jun 2018 07:54:02 +1000
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:

> On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:
> > 
> > Can we back up and discuss whether the IOMMU grouping of NVLink
> > connected devices makes sense?  AIUI we have a PCI view of these
> > devices and from that perspective they're isolated.  That's the view of
> > the device used to generate the grouping.  However, not visible to us,
> > these devices are interconnected via NVLink.  What isolation properties
> > does NVLink provide given that its entire purpose for existing seems to
> > be to provide a high performance link for p2p between devices?  
> 
> Not entire. On POWER chips, we also have an nvlink between the device
> and the CPU which is running significantly faster than PCIe.
> 
> But yes, there are cross-links and those should probably be accounted
> for in the grouping.

Then after we fix the grouping, can we just let the host driver manage
this coherent memory range and expose vGPUs to guests?  The use case of
assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
convince NVIDIA to support more than a single vGPU per VM though)
Thanks,

Alex

^ permalink raw reply

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
From: Benjamin Herrenschmidt @ 2018-06-07 23:20 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, kvm-ppc,
	Ram Pai, kvm, Alistair Popple
In-Reply-To: <20180607161541.21df6434@w520.home>

On Thu, 2018-06-07 at 16:15 -0600, Alex Williamson wrote:
> On Fri, 08 Jun 2018 07:54:02 +1000
> Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> 
> > On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:
> > > 
> > > Can we back up and discuss whether the IOMMU grouping of NVLink
> > > connected devices makes sense?  AIUI we have a PCI view of these
> > > devices and from that perspective they're isolated.  That's the view of
> > > the device used to generate the grouping.  However, not visible to us,
> > > these devices are interconnected via NVLink.  What isolation properties
> > > does NVLink provide given that its entire purpose for existing seems to
> > > be to provide a high performance link for p2p between devices?  
> > 
> > Not entire. On POWER chips, we also have an nvlink between the device
> > and the CPU which is running significantly faster than PCIe.
> > 
> > But yes, there are cross-links and those should probably be accounted
> > for in the grouping.
> 
> Then after we fix the grouping, can we just let the host driver manage
> this coherent memory range and expose vGPUs to guests?  The use case of
> assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
> convince NVIDIA to support more than a single vGPU per VM though)
> Thanks,

I don't know about "vGPUs" and what nVidia may be cooking in that area.

The patched from Alexey allow for passing through the full thing, but
they aren't trivial (there are additional issues, I'm not sure how
covered they are, as we need to pay with the mapping attributes of
portions of the GPU memory on the host side...).

Note: The cross-links are only per-socket so that would be 2 groups of
3.

We *can* allow individual GPUs to be passed through, either if somebody
designs a system without cross links, or if the user is ok with the
security risk as the guest driver will not enable them if it doesn't
"find" both sides of them.

Cheers,
Ben.

^ permalink raw reply

* Re: linux-next: manual merge of the powerpc tree with the kbuild tree
From: Stephen Rothwell @ 2018-06-07 23:52 UTC (permalink / raw)
  To: Masahiro Yamada
  Cc: Michael Ellerman, Benjamin Herrenschmidt, PowerPC,
	Linux-Next Mailing List, Linux Kernel Mailing List,
	Nicholas Piggin, Naveen N. Rao
In-Reply-To: <20180531093216.5c91aa2b@canb.auug.org.au>

[-- Attachment #1: Type: text/plain, Size: 1690 bytes --]

Hi all,

On Thu, 31 May 2018 09:32:16 +1000 Stephen Rothwell <sfr@canb.auug.org.au> wrote:
>
> Today's linux-next merge of the powerpc tree got a conflict in:
> 
>   arch/powerpc/kernel/module_64.c
> 
> between commit:
> 
>   06aeb9e3f2bc ("powerpc/kbuild: move -mprofile-kernel check to Kconfig")
> 
> from the kbuild tree and commit:
> 
>   250122baed29 ("powerpc64/module: Tighten detection of mcount call sites with -mprofile-kernel")
> 
> from the powerpc tree.
> 
> I fixed it up (see below) and can carry the fix as necessary. This
> is now fixed as far as linux-next is concerned, but any non trivial
> conflicts should be mentioned to your upstream maintainer when your tree
> is submitted for merging.  You may also want to consider cooperating
> with the maintainer of the conflicting tree to minimise any particularly
> complex conflicts.
> 
> -- 
> Cheers,
> Stephen Rothwell
> 
> diff --cc arch/powerpc/kernel/module_64.c
> index 55bccc315e1a,f7667e2ebfcb..000000000000
> --- a/arch/powerpc/kernel/module_64.c
> +++ b/arch/powerpc/kernel/module_64.c
> @@@ -462,9 -466,12 +466,12 @@@ static unsigned long stub_for_addr(cons
>   	return (unsigned long)&stubs[i];
>   }
>   
>  -#ifdef CC_USING_MPROFILE_KERNEL
>  +#ifdef CONFIG_MPROFILE_KERNEL
> - static bool is_early_mcount_callsite(u32 *instruction)
> + static bool is_mprofile_mcount_callsite(const char *name, u32 *instruction)
>   {
> + 	if (strcmp("_mcount", name))
> + 		return false;
> + 
>   	/*
>   	 * Check if this is one of the -mprofile-kernel sequences.
>   	 */

This is now a conflict between the kbuild tree and Linus' tree.

-- 
Cheers,
Stephen Rothwell

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
From: Alex Williamson @ 2018-06-08  0:34 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, kvm-ppc,
	Ram Pai, kvm, Alistair Popple
In-Reply-To: <33590885d138195c8ede78b588ddb03b132267fd.camel@kernel.crashing.org>

On Fri, 08 Jun 2018 09:20:30 +1000
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:

> On Thu, 2018-06-07 at 16:15 -0600, Alex Williamson wrote:
> > On Fri, 08 Jun 2018 07:54:02 +1000
> > Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> >   
> > > On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:  
> > > > 
> > > > Can we back up and discuss whether the IOMMU grouping of NVLink
> > > > connected devices makes sense?  AIUI we have a PCI view of these
> > > > devices and from that perspective they're isolated.  That's the view of
> > > > the device used to generate the grouping.  However, not visible to us,
> > > > these devices are interconnected via NVLink.  What isolation properties
> > > > does NVLink provide given that its entire purpose for existing seems to
> > > > be to provide a high performance link for p2p between devices?    
> > > 
> > > Not entire. On POWER chips, we also have an nvlink between the device
> > > and the CPU which is running significantly faster than PCIe.
> > > 
> > > But yes, there are cross-links and those should probably be accounted
> > > for in the grouping.  
> > 
> > Then after we fix the grouping, can we just let the host driver manage
> > this coherent memory range and expose vGPUs to guests?  The use case of
> > assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
> > convince NVIDIA to support more than a single vGPU per VM though)
> > Thanks,  
> 
> I don't know about "vGPUs" and what nVidia may be cooking in that area.
> 
> The patched from Alexey allow for passing through the full thing, but
> they aren't trivial (there are additional issues, I'm not sure how
> covered they are, as we need to pay with the mapping attributes of
> portions of the GPU memory on the host side...).
> 
> Note: The cross-links are only per-socket so that would be 2 groups of
> 3.
> 
> We *can* allow individual GPUs to be passed through, either if somebody
> designs a system without cross links, or if the user is ok with the
> security risk as the guest driver will not enable them if it doesn't
> "find" both sides of them.

If GPUs are not isolated and we cannot prevent them from probing each
other via these links, then I think we have an obligation to configure
grouping in a way that doesn't rely on a benevolent userspace.  Thanks,

Alex

^ permalink raw reply

* Re: [RFC PATCH -tip v5 18/27] powerpc/kprobes: Don't call the ->break_handler() in arm kprobes code
From: Masami Hiramatsu @ 2018-06-08  0:42 UTC (permalink / raw)
  To: Naveen N. Rao
  Cc: Andrew Morton, H . Peter Anvin, linux-arch, linux-kernel,
	linuxppc-dev, Ingo Molnar, Ingo Molnar, Paul Mackerras,
	Steven Rostedt, Thomas Gleixner
In-Reply-To: <1528389409.62iyaqw9yn.naveen@linux.ibm.com>

On Thu, 07 Jun 2018 22:07:26 +0530
"Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com> wrote:

> Masami Hiramatsu wrote:
> > On Thu, 07 Jun 2018 17:07:00 +0530
> > "Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com> wrote:
> > 
> >> Masami Hiramatsu wrote:
> >> > Don't call the ->break_handler() from the arm kprobes code,
> >> 					    ^^^ powerpc
> >> 
> >> > because it was only used by jprobes which got removed.
> >> > 
> >> > This also makes skip_singlestep() a static function since
> >> > only ftrace-kprobe.c is using this function.
> >> > 
> >> > Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
> >> > Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> >> > Cc: Paul Mackerras <paulus@samba.org>
> >> > Cc: Michael Ellerman <mpe@ellerman.id.au>
> >> > Cc: "Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com>
> >> > Cc: linuxppc-dev@lists.ozlabs.org
> >> > ---
> >> >  arch/powerpc/include/asm/kprobes.h   |   10 ----------
> >> >  arch/powerpc/kernel/kprobes-ftrace.c |   16 +++-------------
> >> >  arch/powerpc/kernel/kprobes.c        |   31 +++++++++++--------------------
> >> >  3 files changed, 14 insertions(+), 43 deletions(-)
> >> 
> >> With 2 small comments...
> > 
> > 2 ? or 1 ?
> 
> Two, with one in the commit log above :)

Oops, sorry I missed it. yeah, the comment above is my mistake. I'll fix it.

Thanks!


-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
From: Benjamin Herrenschmidt @ 2018-06-08  0:58 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, kvm-ppc,
	Ram Pai, kvm, Alistair Popple
In-Reply-To: <20180607183417.3ff2acf1@w520.home>

On Thu, 2018-06-07 at 18:34 -0600, Alex Williamson wrote:
> > We *can* allow individual GPUs to be passed through, either if somebody
> > designs a system without cross links, or if the user is ok with the
> > security risk as the guest driver will not enable them if it doesn't
> > "find" both sides of them.
> 
> If GPUs are not isolated and we cannot prevent them from probing each
> other via these links, then I think we have an obligation to configure
> grouping in a way that doesn't rely on a benevolent userspace.  Thanks,

Well, it's a user decision, no ? Like how we used to let the user
decide whether to pass-through things that have LSIs shared out of
their domain.

Cheers,
Ben.

^ permalink raw reply

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
From: Alex Williamson @ 2018-06-08  1:18 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, kvm-ppc,
	Ram Pai, kvm, Alistair Popple
In-Reply-To: <f6fbd7996a2edfc0bdb8363cb6384caeae7d47b9.camel@kernel.crashing.org>

On Fri, 08 Jun 2018 10:58:54 +1000
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:

> On Thu, 2018-06-07 at 18:34 -0600, Alex Williamson wrote:
> > > We *can* allow individual GPUs to be passed through, either if somebody
> > > designs a system without cross links, or if the user is ok with the
> > > security risk as the guest driver will not enable them if it doesn't
> > > "find" both sides of them.  
> > 
> > If GPUs are not isolated and we cannot prevent them from probing each
> > other via these links, then I think we have an obligation to configure
> > grouping in a way that doesn't rely on a benevolent userspace.  Thanks,  
> 
> Well, it's a user decision, no ? Like how we used to let the user
> decide whether to pass-through things that have LSIs shared out of
> their domain.

No, users don't get to pinky swear they'll be good.  The kernel creates
IOMMU groups assuming the worst case isolation and malicious users.
Its the kernel's job to protect itself from users and to protect users
from each other.  Anything else is unsupportable.  The only way to
bypass the default grouping is to modify the kernel.  Thanks,

Alex

^ permalink raw reply

* Re: [v3 PATCH 1/5] powerpc/pseries: convert rtas_log_buf to linear allocation.
From: Nicholas Piggin @ 2018-06-08  1:31 UTC (permalink / raw)
  To: Mahesh J Salgaonkar
  Cc: linuxppc-dev, stable, Aneesh Kumar K.V, Michael Ellerman,
	Laurent Dufour
In-Reply-To: <152839248387.25118.579137029955034015.stgit@jupiter.in.ibm.com>

On Thu, 07 Jun 2018 22:58:11 +0530
Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com> wrote:

> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
> 
> rtas_log_buf is a buffer to hold RTAS event data that are communicated
> to kernel by hypervisor. This buffer is then used to pass RTAS event
> data to user through proc fs. This buffer is allocated from vmalloc
> (non-linear mapping) area.
> 
> On Machine check interrupt, register r3 points to RTAS extended event
> log passed by hypervisor that contains the MCE event. The pseries
> machine check handler then logs this error into rtas_log_buf. The
> rtas_log_buf is a vmalloc-ed (non-linear) buffer we end up taking up a
> page fault (vector 0x300) while accessing it. Since machine check
> interrupt handler runs in NMI context we can not afford to take any
> page fault. Page faults are not honored in NMI context and causes
> kernel panic. This patch fixes this issue by allocating rtas_log_buf
> using kmalloc.
> 
> Fixes: b96672dd840f ("powerpc: Machine check interrupt is a non-maskable interrupt")
> Cc: stable@vger.kernel.org
> Suggested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
> ---
>  arch/powerpc/kernel/rtasd.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kernel/rtasd.c b/arch/powerpc/kernel/rtasd.c
> index f915db93cd42..3957d4ae2ba2 100644
> --- a/arch/powerpc/kernel/rtasd.c
> +++ b/arch/powerpc/kernel/rtasd.c
> @@ -559,7 +559,7 @@ static int __init rtas_event_scan_init(void)
>  	rtas_error_log_max = rtas_get_error_log_max();
>  	rtas_error_log_buffer_max = rtas_error_log_max + sizeof(int);
>  
> -	rtas_log_buf = vmalloc(rtas_error_log_buffer_max*LOG_NUMBER);
> +	rtas_log_buf = kmalloc(rtas_error_log_buffer_max*LOG_NUMBER, GFP_KERNEL);

Does this have to be in the RMA region if it's to be accessed with
relocation off in the guest?

A comment about it being accessed with relocation off might be helpful
too.

Thanks,
Nick

^ permalink raw reply

* Re: [v3 PATCH 2/5] powerpc/pseries: Fix endainness while restoring of r3 in MCE handler.
From: Nicholas Piggin @ 2018-06-08  1:33 UTC (permalink / raw)
  To: Mahesh J Salgaonkar
  Cc: linuxppc-dev, stable, Aneesh Kumar K.V, Michael Ellerman,
	Laurent Dufour
In-Reply-To: <152839249913.25118.1191250274945665204.stgit@jupiter.in.ibm.com>

On Thu, 07 Jun 2018 22:58:33 +0530
Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com> wrote:

> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
> 
> During Machine Check interrupt on pseries platform, register r3 points
> RTAS extended event log passed by hypervisor. Since hypervisor uses r3
> to pass pointer to rtas log, it stores the original r3 value at the
> start of the memory (first 8 bytes) pointed by r3. Since hypervisor
> stores this info and rtas log is in BE format, linux should make
> sure to restore r3 value in correct endian format.
> 
> Without this patch when MCE handler, after recovery, returns to code that
> that caused the MCE may end up with Data SLB access interrupt for invalid
> address followed by kernel panic or hang.
> 
> [   62.878965] Severe Machine check interrupt [Recovered]
> [   62.878968]   NIP [d00000000ca301b8]: init_module+0x1b8/0x338 [bork_kernel]
> [   62.878969]   Initiator: CPU
> [   62.878970]   Error type: SLB [Multihit]
> [   62.878971]     Effective address: d00000000ca70000
> cpu 0xa: Vector: 380 (Data SLB Access) at [c0000000fc7775b0]
>     pc: c0000000009694c0: vsnprintf+0x80/0x480
>     lr: c0000000009698e0: vscnprintf+0x20/0x60
>     sp: c0000000fc777830
>    msr: 8000000002009033
>    dar: a803a30c000000d0
>   current = 0xc00000000bc9ef00
>   paca    = 0xc00000001eca5c00	 softe: 3	 irq_happened: 0x01
>     pid   = 8860, comm = insmod
> [c0000000fc7778b0] c0000000009698e0 vscnprintf+0x20/0x60
> [c0000000fc7778e0] c00000000016b6c4 vprintk_emit+0xb4/0x4b0
> [c0000000fc777960] c00000000016d40c vprintk_func+0x5c/0xd0
> [c0000000fc777980] c00000000016cbb4 printk+0x38/0x4c
> [c0000000fc7779a0] d00000000ca301c0 init_module+0x1c0/0x338 [bork_kernel]
> [c0000000fc777a40] c00000000000d9c4 do_one_initcall+0x54/0x230
> [c0000000fc777b00] c0000000001b3b74 do_init_module+0x8c/0x248
> [c0000000fc777b90] c0000000001b2478 load_module+0x12b8/0x15b0
> [c0000000fc777d30] c0000000001b29e8 sys_finit_module+0xa8/0x110
> [c0000000fc777e30] c00000000000b204 system_call+0x58/0x6c
> --- Exception: c00 (System Call) at 00007fff8bda0644
> SP (7fffdfbfe980) is in userspace
> 
> This patch fixes this issue.

LGTM

Reviewed-by: Nicholas Piggin <npiggin@gmail.com>

> 
> Fixes: a08a53ea4c97 ("powerpc/le: Enable RTAS events support")
> Cc: stable@vger.kernel.org
> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
> ---
>  arch/powerpc/platforms/pseries/ras.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/platforms/pseries/ras.c b/arch/powerpc/platforms/pseries/ras.c
> index 5e1ef9150182..2edc673be137 100644
> --- a/arch/powerpc/platforms/pseries/ras.c
> +++ b/arch/powerpc/platforms/pseries/ras.c
> @@ -360,7 +360,7 @@ static struct rtas_error_log *fwnmi_get_errinfo(struct pt_regs *regs)
>  	}
>  
>  	savep = __va(regs->gpr[3]);
> -	regs->gpr[3] = savep[0];	/* restore original r3 */
> +	regs->gpr[3] = be64_to_cpu(savep[0]);	/* restore original r3 */
>  
>  	/* If it isn't an extended log we can use the per cpu 64bit buffer */
>  	h = (struct rtas_error_log *)&savep[1];
> 

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox