LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 6/6] powerpc/traps: Show instructions on exceptions
From: Murilo Opsfelder Araujo @ 2018-08-01 21:33 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alastair D'Silva, Andrew Donnellan, Balbir Singh,
	Benjamin Herrenschmidt, Christophe Leroy, Cyril Bur,
	Eric W . Biederman, Joe Perches, Michael Ellerman,
	Michael Neuling, Murilo Opsfelder Araujo, Nicholas Piggin,
	Paul Mackerras, Simon Guo, Sukadev Bhattiprolu, Tobin C . Harding,
	linuxppc-dev, Segher Boessenkool
In-Reply-To: <20180801213320.11203-1-muriloo@linux.ibm.com>

Call show_user_instructions() in arch/powerpc/kernel/traps.c to dump
instructions at faulty location, useful to debugging.

Before this patch, an unhandled signal message looked like:

  pandafault[10524]: segfault (11) at 100007d0 nip 1000061c lr 7fffbd295100 code 2 in pandafault[10000000+10000]

After this patch, it looks like:

  pandafault[10524]: segfault (11) at 100007d0 nip 1000061c lr 7fffbd295100 code 2 in pandafault[10000000+10000]
  pandafault[10524]: code: 4bfffeec 4bfffee8 3c401002 38427f00 fbe1fff8 f821ffc1 7c3f0b78 3d22fffe
  pandafault[10524]: code: 392988d0 f93f0020 e93f0020 39400048 <99490000> 39200000 7d234b78 383f0040

Signed-off-by: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
---
 arch/powerpc/kernel/traps.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index bcefbb1ee771..8494b0ff4904 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -70,6 +70,7 @@
 #include <asm/hmi.h>
 #include <sysdev/fsl_pci.h>
 #include <asm/kprobes.h>
+#include <asm/stacktrace.h>
 
 #if defined(CONFIG_DEBUGGER) || defined(CONFIG_KEXEC_CORE)
 int (*__debugger)(struct pt_regs *regs) __read_mostly;
@@ -337,6 +338,8 @@ static void show_signal_msg(int signr, struct pt_regs *regs, int code,
 	print_vma_addr(KERN_CONT " in ", regs->nip);
 
 	pr_cont("\n");
+
+	show_user_instructions(regs);
 }
 
 void _exception_pkey(int signr, struct pt_regs *regs, int code,
-- 
2.17.1

^ permalink raw reply related

* [PATCH v4 3/6] powerpc/traps: Use %lx format in show_signal_msg()
From: Murilo Opsfelder Araujo @ 2018-08-01 21:33 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alastair D'Silva, Andrew Donnellan, Balbir Singh,
	Benjamin Herrenschmidt, Christophe Leroy, Cyril Bur,
	Eric W . Biederman, Joe Perches, Michael Ellerman,
	Michael Neuling, Murilo Opsfelder Araujo, Nicholas Piggin,
	Paul Mackerras, Simon Guo, Sukadev Bhattiprolu, Tobin C . Harding,
	linuxppc-dev, Segher Boessenkool
In-Reply-To: <20180801213320.11203-1-muriloo@linux.ibm.com>

Use %lx format to print registers.  This avoids having two different
formats and avoids checking for MSR_64BIT, improving readability of the
function.

Even though we could have used %px, which is functionally equivalent to %lx
as per Documentation/core-api/printk-formats.rst, it is not semantically
correct because the data printed are not pointers.  And using %px requires
casting data to (void *).

Besides that, %lx matches the format used in show_regs().

Before this patch:

  pandafault[4808]: unhandled signal 11 at 0000000010000718 nip 0000000010000574 lr 00007fff935e7a6c code 2

After this patch:

  pandafault[4732]: unhandled signal 11 at 10000718 nip 10000574 lr 7fff86697a6c code 2

Signed-off-by: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
---
 arch/powerpc/kernel/traps.c | 11 +++--------
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index 367534b41617..4503e22f6ba5 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -311,20 +311,15 @@ static bool show_unhandled_signals_ratelimited(void)
 static void show_signal_msg(int signr, struct pt_regs *regs, int code,
 			    unsigned long addr)
 {
-	const char fmt32[] = KERN_INFO "%s[%d]: unhandled signal %d " \
-		"at %08lx nip %08lx lr %08lx code %x\n";
-	const char fmt64[] = KERN_INFO "%s[%d]: unhandled signal %d " \
-		"at %016lx nip %016lx lr %016lx code %x\n";
-
 	if (!unhandled_signal(current, signr))
 		return;
 
 	if (!show_unhandled_signals_ratelimited())
 		return;
 
-	printk(regs->msr & MSR_64BIT ? fmt64 : fmt32,
-	       current->comm, current->pid, signr,
-	       addr, regs->nip, regs->link, code);
+	pr_info("%s[%d]: unhandled signal %d at %lx nip %lx lr %lx code %x\n",
+		current->comm, current->pid, signr,
+		addr, regs->nip, regs->link, code);
 }
 
 void _exception_pkey(int signr, struct pt_regs *regs, int code,
-- 
2.17.1

^ permalink raw reply related

* [PATCH v4 2/6] powerpc/traps: Use an explicit ratelimit state for show_signal_msg()
From: Murilo Opsfelder Araujo @ 2018-08-01 21:33 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alastair D'Silva, Andrew Donnellan, Balbir Singh,
	Benjamin Herrenschmidt, Christophe Leroy, Cyril Bur,
	Eric W . Biederman, Joe Perches, Michael Ellerman,
	Michael Neuling, Murilo Opsfelder Araujo, Nicholas Piggin,
	Paul Mackerras, Simon Guo, Sukadev Bhattiprolu, Tobin C . Harding,
	linuxppc-dev, Segher Boessenkool
In-Reply-To: <20180801213320.11203-1-muriloo@linux.ibm.com>

Replace printk_ratelimited() by printk() and a default rate limit
burst to limit displaying unhandled signals messages.

This will allow us to call print_vma_addr() in a future patch, which
does not work with printk_ratelimited().

Signed-off-by: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
---
 arch/powerpc/kernel/traps.c | 21 ++++++++++++++++-----
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index cbd3dc365193..367534b41617 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -301,6 +301,13 @@ void user_single_step_siginfo(struct task_struct *tsk,
 	info->si_addr = (void __user *)regs->nip;
 }
 
+static bool show_unhandled_signals_ratelimited(void)
+{
+	static DEFINE_RATELIMIT_STATE(rs, DEFAULT_RATELIMIT_INTERVAL,
+				      DEFAULT_RATELIMIT_BURST);
+	return show_unhandled_signals && __ratelimit(&rs);
+}
+
 static void show_signal_msg(int signr, struct pt_regs *regs, int code,
 			    unsigned long addr)
 {
@@ -309,11 +316,15 @@ static void show_signal_msg(int signr, struct pt_regs *regs, int code,
 	const char fmt64[] = KERN_INFO "%s[%d]: unhandled signal %d " \
 		"at %016lx nip %016lx lr %016lx code %x\n";
 
-	if (show_unhandled_signals && unhandled_signal(current, signr)) {
-		printk_ratelimited(regs->msr & MSR_64BIT ? fmt64 : fmt32,
-				   current->comm, current->pid, signr,
-				   addr, regs->nip, regs->link, code);
-	}
+	if (!unhandled_signal(current, signr))
+		return;
+
+	if (!show_unhandled_signals_ratelimited())
+		return;
+
+	printk(regs->msr & MSR_64BIT ? fmt64 : fmt32,
+	       current->comm, current->pid, signr,
+	       addr, regs->nip, regs->link, code);
 }
 
 void _exception_pkey(int signr, struct pt_regs *regs, int code,
-- 
2.17.1

^ permalink raw reply related

* [PATCH v4 1/6] powerpc/traps: Print unhandled signals in a separate function
From: Murilo Opsfelder Araujo @ 2018-08-01 21:33 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alastair D'Silva, Andrew Donnellan, Balbir Singh,
	Benjamin Herrenschmidt, Christophe Leroy, Cyril Bur,
	Eric W . Biederman, Joe Perches, Michael Ellerman,
	Michael Neuling, Murilo Opsfelder Araujo, Nicholas Piggin,
	Paul Mackerras, Simon Guo, Sukadev Bhattiprolu, Tobin C . Harding,
	linuxppc-dev, Segher Boessenkool
In-Reply-To: <20180801213320.11203-1-muriloo@linux.ibm.com>

Isolate the logic of printing unhandled signals out of _exception_pkey().
No functional change, only code rearrangement.

Signed-off-by: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
---
 arch/powerpc/kernel/traps.c | 26 ++++++++++++++++----------
 1 file changed, 16 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index 0e17dcb48720..cbd3dc365193 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -301,26 +301,32 @@ void user_single_step_siginfo(struct task_struct *tsk,
 	info->si_addr = (void __user *)regs->nip;
 }
 
+static void show_signal_msg(int signr, struct pt_regs *regs, int code,
+			    unsigned long addr)
+{
+	const char fmt32[] = KERN_INFO "%s[%d]: unhandled signal %d " \
+		"at %08lx nip %08lx lr %08lx code %x\n";
+	const char fmt64[] = KERN_INFO "%s[%d]: unhandled signal %d " \
+		"at %016lx nip %016lx lr %016lx code %x\n";
+
+	if (show_unhandled_signals && unhandled_signal(current, signr)) {
+		printk_ratelimited(regs->msr & MSR_64BIT ? fmt64 : fmt32,
+				   current->comm, current->pid, signr,
+				   addr, regs->nip, regs->link, code);
+	}
+}
 
 void _exception_pkey(int signr, struct pt_regs *regs, int code,
-		unsigned long addr, int key)
+		     unsigned long addr, int key)
 {
 	siginfo_t info;
-	const char fmt32[] = KERN_INFO "%s[%d]: unhandled signal %d " \
-			"at %08lx nip %08lx lr %08lx code %x\n";
-	const char fmt64[] = KERN_INFO "%s[%d]: unhandled signal %d " \
-			"at %016lx nip %016lx lr %016lx code %x\n";
 
 	if (!user_mode(regs)) {
 		die("Exception in kernel mode", regs, signr);
 		return;
 	}
 
-	if (show_unhandled_signals && unhandled_signal(current, signr)) {
-		printk_ratelimited(regs->msr & MSR_64BIT ? fmt64 : fmt32,
-				   current->comm, current->pid, signr,
-				   addr, regs->nip, regs->link, code);
-	}
+	show_signal_msg(signr, regs, code, addr);
 
 	if (arch_irqs_disabled() && !arch_irq_disabled_regs(regs))
 		local_irq_enable();
-- 
2.17.1

^ permalink raw reply related

* [PATCH v4 0/6] powerpc: Modernize unhandled signals message
From: Murilo Opsfelder Araujo @ 2018-08-01 21:33 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alastair D'Silva, Andrew Donnellan, Balbir Singh,
	Benjamin Herrenschmidt, Christophe Leroy, Cyril Bur,
	Eric W . Biederman, Joe Perches, Michael Ellerman,
	Michael Neuling, Murilo Opsfelder Araujo, Nicholas Piggin,
	Paul Mackerras, Simon Guo, Sukadev Bhattiprolu, Tobin C . Harding,
	linuxppc-dev, Segher Boessenkool

Hi, everyone.

This series was inspired by the need to modernize and display more
informative messages about unhandled signals.

The "unhandled signal NN" is not very informative.  We thought it would
be helpful adding a human-readable message describing what the signal
number means, printing the VMA address, and dumping the instructions.

Before this series:

  pandafault32[4724]: unhandled signal 11 at 100005e4 nip 10000444 lr 0fe31ef4 code 2

  pandafault64[4725]: unhandled signal 11 at 0000000010000718 nip 0000000010000574 lr 00007fff7faa7a6c code 2

After this series:

  pandafault32[4753]: segfault (11) at 100005e4 nip 10000444 lr fe31ef4 code 2 in pandafault32[10000000+10000]
  pandafault32[4753]: code: 4bffff3c 60000000 60420000 4bffff30 9421ffd0 93e1002c 7c3f0b78 3d201000
  pandafault32[4753]: code: 392905e4 913f0008 813f0008 39400048 <99490000> 39200000 7d234b78 397f0030

  pandafault64[4754]: segfault (11) at 10000718 nip 10000574 lr 7fffb0007a6c code 2 in pandafault64[10000000+10000]
  pandafault64[4754]: code: e8010010 7c0803a6 4bfffef4 4bfffef0 fbe1fff8 f821ffb1 7c3f0b78 3d22fffe
  pandafault64[4754]: code: 39298818 f93f0030 e93f0030 39400048 <99490000> 39200000 7d234b78 383f0050

Link to v3:

  https://lore.kernel.org/lkml/20180731145020.14009-1-muriloo@linux.ibm.com/

v3..v4:

  - Added new show_user_instructions() based on the existing
    show_instructions()
  - Updated commit messages
  - Replaced signals names table with a tiny function that returns a
    literal string for each signal number

Cheers!

Murilo Opsfelder Araujo (6):
  powerpc/traps: Print unhandled signals in a separate function
  powerpc/traps: Use an explicit ratelimit state for show_signal_msg()
  powerpc/traps: Use %lx format in show_signal_msg()
  powerpc/traps: Print VMA for unhandled signals
  powerpc: Add show_user_instructions()
  powerpc/traps: Show instructions on exceptions

 arch/powerpc/include/asm/stacktrace.h | 13 +++++++
 arch/powerpc/kernel/process.c         | 40 +++++++++++++++++++++
 arch/powerpc/kernel/traps.c           | 52 +++++++++++++++++++++------
 3 files changed, 95 insertions(+), 10 deletions(-)
 create mode 100644 arch/powerpc/include/asm/stacktrace.h

-- 
2.17.1

^ permalink raw reply

* Re: phandle_cache vs of_detach_node (was Re: [PATCH] powerpc/mobility: Fix node detach/rename problem)
From: Michael Bringmann @ 2018-08-01 17:26 UTC (permalink / raw)
  To: linuxppc-dev
In-Reply-To: <87d0v2f9hp.fsf@concordia.ellerman.id.au>

On 08/01/2018 09:26 AM, Michael Ellerman wrote:
> Frank Rowand <frowand.list@gmail.com> writes:
>> On 07/31/18 07:17, Rob Herring wrote:
>>> On Tue, Jul 31, 2018 at 12:34 AM Michael Ellerman <mpe@ellerman.id.au> wrote:
>>>>
>>>> Hi Rob/Frank,
>>>>
>>>> I think we might have a problem with the phandle_cache not interacting
>>>> well with of_detach_node():
>>>
>>> Probably needs a similar fix as this commit did for overlays:
>>>
>>> commit b9952b5218added5577e4a3443969bc20884cea9
>>> Author: Frank Rowand <frank.rowand@sony.com>
>>> Date:   Thu Jul 12 14:00:07 2018 -0700
>>>
>>>     of: overlay: update phandle cache on overlay apply and remove
>>>
>>>     A comment in the review of the patch adding the phandle cache said that
>>>     the cache would have to be updated when modules are applied and removed.
>>>     This patch implements the cache updates.
>>>
>>>     Fixes: 0b3ce78e90fc ("of: cache phandle nodes to reduce cost of
>>> of_find_node_by_phandle()")
>>>     Reported-by: Alan Tull <atull@kernel.org>
>>>     Suggested-by: Alan Tull <atull@kernel.org>
>>>     Signed-off-by: Frank Rowand <frank.rowand@sony.com>
>>>     Signed-off-by: Rob Herring <robh@kernel.org>
>>
>> Agreed.  Sorry about missing the of_detach_node() case.
>>
>>
>>> Really what we need here is an "invalidate phandle" function rather
>>> than free and re-allocate the whole damn cache.
>>
>> The big hammer approach was chosen to avoid the race conditions that
>> would otherwise occur.  OF does not have a locking strategy that
>> would be able to protect against the races.
>>
>> We could maybe implement a slightly smaller hammer by (1) disabling
>> the cache, (2) invalidate a phandle entry in the cache, (3) re-enable
>> the cache.  That is an off the cuff thought - I would have to look
>> a little bit more carefully to make sure it would work.
>>
>> But I don't see a need to add the complexity of the smaller hammer
>> or the bigger hammer of proper locking _unless_ we start seeing that
>> the cache is being freed and re-allocated frequently.  For overlays
>> I don't expect the high frequency because it happens on a per overlay
>> removal basis (not per node removal basis).  For of_detach_node() the
>> event _is_ on a per node removal basis.  Michael, do you expect node
>> removals to be a frequent event with low latency being important?  If
>> so, a rough guess on what the frequency would be?
> 
> No in practice we expect them to be fairly rare and in the thousands at
> most I think.
> 
> TBH you could just disable the phandle_cache on the first detach and
> that would be fine by me. I don't think we've ever noticed phandle
> lookups being much of a problem for us on our hardware.

I ran a couple of migrations with the calls to

of_free_phandle_cache() ... of_populate_phandle_cache()

bracketing the body of

arch/powerpc/platforms/pseries/mobility.c:post_mobility_fixup()

with a patch like,

diff --git a/arch/powerpc/platforms/pseries/mobility.c b/arch/powerpc/platforms/pseries/mobility.c
index e245a88..efc9442 100644
--- a/arch/powerpc/platforms/pseries/mobility.c
+++ b/arch/powerpc/platforms/pseries/mobility.c
@@ -22,6 +22,9 @@
 #include <asm/rtas.h>
 #include "pseries.h"

+extern int of_free_phandle_cache(void);
+extern void of_populate_phandle_cache(void);
+
 static struct kobject *mobility_kobj;

 struct update_props_workarea {
@@ -343,6 +346,8 @@ void post_mobility_fixup(void)
                rc = rtas_call(activate_fw_token, 0, 1, NULL);
        } while (rtas_busy_delay(rc));

+       of_free_phandle_cache();
+
        if (rc)
                printk(KERN_ERR "Post-mobility activate-fw failed: %d\n", rc);

@@ -354,6 +359,8 @@ void post_mobility_fixup(void)
        /* Possibly switch to a new RFI flush type */
        pseries_setup_rfi_flush();

+       of_populate_phandle_cache();
+
        return;
 }

and did not observe the warnings/crashes that have been
plaguing me.  We will need to move their prototypes out
of drivers/of/of_private.h to include/linux/of.h, though.

> 
> cheers
> 

Michael
 

-- 
Michael W. Bringmann
Linux Technology Center
IBM Corporation
Tie-Line  363-5196
External: (512) 286-5196
Cell:       (512) 466-0650
mwb@linux.vnet.ibm.com

^ permalink raw reply related

* Re: [PATCH] powerpc/pasemi: Seach for PCI root bus by compatible property
From: Olof Johansson @ 2018-08-01 16:17 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: Christian Zigotzky, Darren Stevens, linuxppc-dev
In-Reply-To: <877elaf92a.fsf@concordia.ellerman.id.au>

On Wed, Aug 1, 2018 at 7:36 AM, Michael Ellerman <mpe@ellerman.id.au> wrote:
> Christian Zigotzky <chzigotzky@xenosoft.de> writes:
>
>> Just for info: I tested it on my Nemo board today and it works.
>
> Awesome thanks.
>
> That's probably sufficient to merge it, and if it breaks anything we can
> always revert it.

Should be fine, all known boards have the properties in question, and
I doubt anyone has anything but Nemo and Electra/Chitra boards out
there.

It's a virtual root bus, so all boards have it irrespective of what
PCIe is brought out.

(I should hook up my test system and get it into the CI cycle again,
maybe this fall).


Acked-by: Olof Johansson <olof@lixom.net>


-Olof

^ permalink raw reply

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
From: Alex Williamson @ 2018-08-01 16:16 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Benjamin Herrenschmidt, linuxppc-dev, David Gibson, kvm-ppc,
	Ram Pai, kvm, Alistair Popple
In-Reply-To: <38e4455c-e3b8-b867-1771-0f4ccf9d31d8@ozlabs.ru>

On Wed, 1 Aug 2018 18:37:35 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 01/08/2018 00:29, Alex Williamson wrote:
> > On Tue, 31 Jul 2018 14:03:35 +1000
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >   
> >> On 31/07/2018 02:29, Alex Williamson wrote:  
> >>> On Mon, 30 Jul 2018 18:58:49 +1000
> >>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:  
> >>>> After some local discussions, it was pointed out that force disabling
> >>>> nvlinks won't bring us much as for an nvlink to work, both sides need to
> >>>> enable it so malicious guests cannot penetrate good ones (or a host)
> >>>> unless a good guest enabled the link but won't happen with a well
> >>>> behaving guest. And if two guests became malicious, then can still only
> >>>> harm each other, and so can they via other ways such network. This is
> >>>> different from PCIe as once PCIe link is unavoidably enabled, a well
> >>>> behaving device cannot firewall itself from peers as it is up to the
> >>>> upstream bridge(s) now to decide the routing; with nvlink2, a GPU still
> >>>> has means to protect itself, just like a guest can run "firewalld" for
> >>>> network.
> >>>>
> >>>> Although it would be a nice feature to have an extra barrier between
> >>>> GPUs, is inability to block the links in hypervisor still a blocker for
> >>>> V100 pass through?    
> >>>
> >>> How is the NVLink configured by the guest, is it 'on'/'off' or are
> >>> specific routes configured?     
> >>
> >> The GPU-GPU links need not to be blocked and need to be enabled
> >> (==trained) by a driver in the guest. There are no routes between GPUs
> >> in NVLink fabric, these are direct links, it is just a switch on each
> >> side, both switches need to be on for a link to work.  
> > 
> > Ok, but there is at least the possibility of multiple direct links per
> > GPU, the very first diagram I find of NVlink shows 8 interconnected
> > GPUs:
> > 
> > https://www.nvidia.com/en-us/data-center/nvlink/  
> 
> Out design is like the left part of the picture but it is just a detail.

Unless we can specifically identify a direct link vs a mesh link, we
shouldn't be making assumptions about the degree of interconnect.
 
> > So if each switch enables one direct, point to point link, how does the
> > guest know which links to open for which peer device?  
> 
> It uses PCI config space on GPUs to discover the topology.

So do we need to virtualize this config space if we're going to
virtualize the topology?

> > And of course
> > since we can't see the spec, a security audit is at best hearsay :-\  
> 
> Yup, the exact discovery protocol is hidden.

It could be reverse engineered...

> >> The GPU-CPU links - the GPU bit is the same switch, the CPU NVlink state
> >> is controlled via the emulated PCI bridges which I pass through together
> >> with the GPU.  
> > 
> > So there's a special emulated switch, is that how the guest knows which
> > GPUs it can enable NVLinks to?  
> 
> Since it only has PCI config space (there is nothing relevant in the
> device tree at all), I assume (double checking with the NVIDIA folks
> now) the guest driver enables them all, tests which pair works and
> disables the ones which do not. This gives a malicious guest a tiny
> window of opportunity to break into a good guest. Hm :-/

Let's not minimize that window, that seems like a prime candidate for
an exploit.

> >>> If the former, then isn't a non-malicious
> >>> guest still susceptible to a malicious guest?    
> >>
> >> A non-malicious guest needs to turn its switch on for a link to a GPU
> >> which belongs to a malicious guest.  
> > 
> > Actual security, or obfuscation, will we ever know...  
> >>>> If the latter, how is  
> >>> routing configured by the guest given that the guest view of the
> >>> topology doesn't match physical hardware?  Are these routes
> >>> deconfigured by device reset?  Are they part of the save/restore
> >>> state?  Thanks,    
> > 
> > Still curious what happens to these routes on reset.  Can a later user
> > of a GPU inherit a device where the links are already enabled?  Thanks,  
> 
> I am told that the GPU reset disables links. As a side effect, we get an
> HMI (a hardware fault which reset the host machine) when trying
> accessing the GPU RAM which indicates that the link is down as the
> memory is only accessible via the nvlink. We have special fencing code
> in our host firmware (skiboot) to fence this memory on PCI reset so
> reading from it returns zeroes instead of HMIs.

What sort of reset is required for this?  Typically we rely on
secondary bus reset for GPUs, but it would be a problem if GPUs were to
start implementing FLR and nobody had a spec to learn that FLR maybe
didn't disable the link.  The better approach to me still seems to be
virtualizing these NVLink config registers to an extent that the user
can only enabling links where they have ownership of both ends of the
connection.  Thanks,

Alex

^ permalink raw reply

* Re: [PATCH v3 9/9] powerpc/traps: Add line prefix in show_instructions()
From: Murilo Opsfelder Araujo @ 2018-08-01 15:03 UTC (permalink / raw)
  To: Christophe LEROY
  Cc: linux-kernel, Alastair D'Silva, Andrew Donnellan,
	Balbir Singh, Benjamin Herrenschmidt, Cyril Bur,
	Eric W . Biederman, Joe Perches, Michael Ellerman,
	Michael Neuling, Nicholas Piggin, Paul Mackerras, Simon Guo,
	Sukadev Bhattiprolu, Tobin C . Harding, linuxppc-dev
In-Reply-To: <c7e0fe17-8c4e-4ccc-dadf-9b320ca5a1d2@c-s.fr>

Hi, Christophe.

On Wed, Aug 01, 2018 at 08:41:15AM +0200, Christophe LEROY wrote:
>
>
> Le 31/07/2018 à 16:50, Murilo Opsfelder Araujo a écrit :
> > Remove "Instruction dump:" line by adding a prefix to display current->comm
> > and current->pid, along with the instructions dump.
> >
> > The prefix can serve as a glue that links the instructions dump to its
> > originator, allowing messages to be interleaved in the logs.
> >
> > Before this patch, a page fault looked like:
> >
> >    pandafault[10524]: segfault (11) at 100007d0 nip 1000061c lr 7fffbd295100 code 2 in pandafault[10000000+10000]
> >    Instruction dump:
> >    4bfffeec 4bfffee8 3c401002 38427f00 fbe1fff8 f821ffc1 7c3f0b78 3d22fffe
> >    392988d0 f93f0020 e93f0020 39400048 <99490000> 39200000 7d234b78 383f0040
> >
> > After this patch, it looks like:
> >
> >    pandafault[10850]: segfault (11) at 100007d0 nip 1000061c lr 7fff9f3e5100 code 2 in pandafault[10000000+10000]
> >    pandafault[10850]: code: 4bfffeec 4bfffee8 3c401002 38427f00 fbe1fff8 f821ffc1 7c3f0b78 3d22fffe
> >    pandafault[10850]: code: 392988d0 f93f0020 e93f0020 39400048 <99490000> 39200000 7d234b78 383f0040
> >
> > Signed-off-by: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
>
> Does the script scripts/decode_stacktrace.sh also works with this format
> change ?

I've got more feedback from Michael about this.  A better approach would be
having a new show_user_instructions(), a slightly modified version of
show_instructions(), that can be called from within show_signal_msg().

Since _exception_pkey() dies if the exception is in kernel mode, we'll be
safe to call the new show_user_instructions(), without interfering in
scripts/decode_stacktrace.sh.

Cheers
Murilo

^ permalink raw reply

* Re: [PATCH] powerpc/4xx: Fix error return path in ppc4xx_msi_probe()
From: Christoph Hellwig @ 2018-08-01 14:55 UTC (permalink / raw)
  To: Guenter Roeck
  Cc: Michael Ellerman, linuxppc-dev, linux-kernel, Robin Murphy,
	Christoph Hellwig
In-Reply-To: <1533001454-8751-1-git-send-email-linux@roeck-us.net>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply

* Re: [PATCH v3 5/9] powerpc/traps: Print signal name for unhandled signals
From: Murilo Opsfelder Araujo @ 2018-08-01 14:44 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Joe Perches, Christophe LEROY, linux-kernel, Michael Neuling,
	Simon Guo, Nicholas Piggin, Paul Mackerras, Eric W . Biederman,
	Andrew Donnellan, Alastair D'Silva, Sukadev Bhattiprolu,
	linuxppc-dev, Cyril Bur, Tobin C . Harding
In-Reply-To: <20180801074903.GG16221@gate.crashing.org>

Hi, Segher.

On Wed, Aug 01, 2018 at 02:49:03AM -0500, Segher Boessenkool wrote:
> On Wed, Aug 01, 2018 at 12:03:50AM -0700, Joe Perches wrote:
> > On Wed, 2018-08-01 at 08:37 +0200, Christophe LEROY wrote:
> > > Le 31/07/2018 à 16:50, Murilo Opsfelder Araujo a écrit :
> > > I would suggest to instead use a function like this:
> > >
> > > static const char *signame(int signr)
> > > {
> > > 	if (signr == SIGBUS)
> > > 		return "bus error";
> > > 	if (signr == SIGFPE)
> > > 		return "floating point exception";
> > > 	if (signr == SIGILL)
> > > 		return "illegal instruction";
> > > 	if (signr == SIGILL)
> > > 		return "segfault";
> > > 	if (signr == SIGTRAP)
> > > 		return "unhandled trap";
> > > 	return "unknown signal";
> > > }
> >
> > trivia:
> >
> > Unless the if tests are ordered most to least likely,
> > perhaps it would be better to use a switch/case and
> > let the compiler decide.
>
> That would also show there are two entries for SIGILL (here and in the
> original patch), one of them very wrong.

Good catch.  I'll take care of that in my next respin.

> Check the table with psignal or something?

Nice suggestion.  Thanks!

Cheers
Murilo

^ permalink raw reply

* Re: [PATCH v3 5/9] powerpc/traps: Print signal name for unhandled signals
From: Murilo Opsfelder Araujo @ 2018-08-01 14:42 UTC (permalink / raw)
  To: Joe Perches
  Cc: Christophe LEROY, linux-kernel, Alastair D'Silva,
	Andrew Donnellan, Balbir Singh, Benjamin Herrenschmidt, Cyril Bur,
	Eric W . Biederman, Michael Ellerman, Michael Neuling,
	Nicholas Piggin, Paul Mackerras, Simon Guo, Sukadev Bhattiprolu,
	Tobin C . Harding, linuxppc-dev
In-Reply-To: <afa25468e62ae3259878d8a95e9ac96f6be9f88a.camel@perches.com>

On Wed, Aug 01, 2018 at 12:03:50AM -0700, Joe Perches wrote:
> On Wed, 2018-08-01 at 08:37 +0200, Christophe LEROY wrote:
> > Le 31/07/2018 à 16:50, Murilo Opsfelder Araujo a écrit :
> > > This adds a human-readable name in the unhandled signal message.
> > > Before this patch, a page fault looked like:
> > >    pandafault[6303]: unhandled signal 11 at 100007d0 nip 1000061c lr 7fff93c55100 code 2 in pandafault[10000000+10000]
> > > After this patch, a page fault looks like:
> > >    pandafault[6352]: segfault (11) at 13a2a09f8 nip 13a2a086c lr 7fffb63e5100 code 2 in pandafault[13a2a0000+10000]
> ]]
> > > diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
> []
> > > @@ -96,6 +96,41 @@ EXPORT_SYMBOL(__debugger_fault_handler);
> > >   #define TM_DEBUG(x...) do { } while(0)
> > >   #endif
> > >
> > > +static const char *signames[SIGRTMIN + 1] = {
> > > +	"UNKNOWN",
> > > +	"SIGHUP",			// 1
> > > +	"SIGINT",			// 2
> []
> > I don't think is is worth having that full table when we only use a few
> > of them. (As discussed in v1 https://patchwork.ozlabs.org/patch/948802/)
> >
> > I would suggest to instead use a function like this:
> >
> > static const char *signame(int signr)
> > {
> > 	if (signr == SIGBUS)
> > 		return "bus error";
> > 	if (signr == SIGFPE)
> > 		return "floating point exception";
> > 	if (signr == SIGILL)
> > 		return "illegal instruction";
> > 	if (signr == SIGILL)
> > 		return "segfault";
> > 	if (signr == SIGTRAP)
> > 		return "unhandled trap";
> > 	return "unknown signal";
> > }
>
> trivia:
>
> Unless the if tests are ordered most to least likely,
> perhaps it would be better to use a switch/case and
> let the compiler decide.
>
> 	switch (signr) {
> 	case SIGBUS:	return "bus error";
> 	case SIGFPE:	return "floating point exception";
> 	case SIGILL:	return "illegal instruction";
> 	case SIGSEGV:	return "segfault";
> 	case SIGTRAP:	return "unhandled trap";
> 	}
> 	return "unknown signal";
> }
>

Hi, Joe, Christophe.

That's a nice enhancement.  I'll do that in my next respin.

Cheers
Murilo

^ permalink raw reply

* Re: [PATCH] perf tools: allow overriding MAX_NR_CPUS at compile time
From: Arnaldo Carvalho de Melo @ 2018-08-01 14:40 UTC (permalink / raw)
  To: Christophe LEROY
  Cc: Peter Zijlstra, Ingo Molnar, Alexander Shishkin, linux-kernel,
	linuxppc-dev
In-Reply-To: <dc8f1911-fa9a-c1c8-0dd5-fd33236da69a@c-s.fr>

Em Wed, Aug 01, 2018 at 11:37:30AM +0200, Christophe LEROY escreveu:
> 
> 
> Le 03/05/2018 à 15:40, Arnaldo Carvalho de Melo a écrit :
> > Em Fri, Sep 22, 2017 at 01:20:43PM +0200, Christophe Leroy escreveu:
> > > After update of kernel, perf tool doesn't run anymore on my
> > > 32MB RAM powerpc board, but still runs on a 128MB RAM board:
> > 
> > Cleaning up my inbox, found this one, simple enough, still applies,
> > applied.
> 
> Did you finally apply it ? I can't see it in linux-next. Will it be merged
> into 4.19 ?

Sure, applied it finally, 

- Arnaldo

^ permalink raw reply

* Re: [PATCH] powerpc/pasemi: Seach for PCI root bus by compatible property
From: Michael Ellerman @ 2018-08-01 14:36 UTC (permalink / raw)
  To: Christian Zigotzky, Darren Stevens, linuxppc-dev; +Cc: Olof Johansson
In-Reply-To: <fde71219-5e0e-cc12-a31f-258ce547f471@xenosoft.de>

Christian Zigotzky <chzigotzky@xenosoft.de> writes:

> Just for info: I tested it on my Nemo board today and it works.

Awesome thanks.

That's probably sufficient to merge it, and if it breaks anything we can
always revert it.

cheers

> On 31 July 2018 at 2:04PM, Michael Ellerman wrote:
>> Michael Ellerman <mpe@ellerman.id.au> writes:
>>> Darren Stevens <darren@stevens-zone.net> writes:
>>>
>>>> Pasemi arch code finds the root of the PCI-e bus by searching the
>>>> device-tree for a node called 'pxp'. But the root bus has a
>>>> compatible property of 'pasemi,rootbus' so search for that instead.
>>>>
>>>> Signed-off-by: Darren Stevens <darren@stevens-zone.net>
>>>> ---
>>>>
>>>> This works on the Amigaone X1000, I don't know if this method of
>>>> finding the pci bus was there bcause of earlier firmwares.
>>> Does anyone have another pasemi board they can test this on?
>>>
>>> The last time I plugged mine in it popped the power supply and took out
>>> power to half the office :) - I haven't had a chance to try it since.
>> I actually I remembered I have a device tree lying around from an electra.
>>
>> It has:
>>
>>    [I] home:pxp@0,80000000(7)(I)> lsprop name compatible
>>    name             "pxp"
>>    compatible       "pasemi,rootbus"
>>                     "pa-pxp"
>>
>>
>> So it looks like the patch would work fine on it at least.
>>
>> cheers
>>
>>>> diff --git a/arch/powerpc/platforms/pasemi/pci.c b/arch/powerpc/platforms/pasemi/pci.c
>>>> index c7c8607..be62380 100644
>>>> --- a/arch/powerpc/platforms/pasemi/pci.c
>>>> +++ b/arch/powerpc/platforms/pasemi/pci.c
>>>> @@ -216,6 +216,7 @@ static int __init pas_add_bridge(struct device_node *dev)
>>>>   void __init pas_pci_init(void)
>>>>   {
>>>>      struct device_node *np, *root;
>>>> +   int res;
>>>>   
>>>>      root = of_find_node_by_path("/");
>>>>      if (!root) {
>>>> @@ -226,11 +227,11 @@ void __init pas_pci_init(void)
>>>>   
>>>>      pci_set_flags(PCI_SCAN_ALL_PCIE_DEVS);
>>>>   
>>>> -   for (np = NULL; (np = of_get_next_child(root, np)) != NULL;)
>>>> -       if (np->name && !strcmp(np->name, "pxp") && !pas_add_bridge(np))
>>>> -           of_node_get(np);
>>>> -
>>>> -   of_node_put(root);
>>>> +   np = of_find_compatible_node(root, NULL, "pasemi,rootbus");
>>>> +   if (np) {
>>>> +       res = pas_add_bridge(np);
>>>> +       of_node_put(np);
>>>> +   }
>>>>   }
>>>>   
>>>>   void __iomem *pasemi_pci_getcfgaddr(struct pci_dev *dev, int offset)

^ permalink raw reply

* Re: [PATCH] powerpc/mobility: Fix node detach/rename problem
From: Michael Ellerman @ 2018-08-01 14:35 UTC (permalink / raw)
  To: Tyrel Datwyler, Michael Bringmann, linuxppc-dev
  Cc: Nathan Fontenot, Thomas Falcon, John Allen
In-Reply-To: <8bf95369-2625-146a-34d4-d051322e06ff@linux.vnet.ibm.com>

Tyrel Datwyler <tyreld@linux.vnet.ibm.com> writes:
> On 07/30/2018 11:42 PM, Michael Ellerman wrote:
>> Tyrel Datwyler <tyreld@linux.vnet.ibm.com> writes:
>>> On 07/29/2018 06:11 AM, Michael Bringmann wrote:
>>>> During LPAR migration, the content of the device tree/sysfs may
>>>> be updated including deletion and replacement of nodes in the
>>>> tree.  When nodes are added to the internal node structures, they
>>>> are appended in FIFO order to a list of nodes maintained by the
>>>> OF code APIs.  When nodes are removed from the device tree, they
>>>> are marked OF_DETACHED, but not actually deleted from the system
>>>> to allow for pointers cached elsewhere in the kernel.  The order
>>>> and content of the entries in the list of nodes is not altered,
>>>> though.
>>>>
>>>> During LPAR migration some common nodes are deleted and re-added
>>>> e.g. "ibm,platform-facilities".  If a node is re-added to the OF
>>>> node lists, the of_attach_node function checks to make sure that
>>>> the name + ibm,phandle of the to-be-added data is unique.  As the
>>>> previous copy of a re-added node is not modified beyond the addition
>>>> of a bit flag, the code (1) finds the old copy, (2) prints a WARNING
>>>> notice to the console, (3) renames the to-be-added node to avoid
>>>> filename collisions within a directory, and (3) adds entries to
>>>> the sysfs/kernfs.
>>>
>>> So, this patch actually just band aids over the real problem. This is
>>> a long standing problem with several PFO drivers leaking references.
>>> The issue here is that, during the device tree update that follows a
>>> migration. the update of the ibm,platform-facilities node and friends
>>> below are always deleted and re-added on the destination lpar and
>>> subsequently the leaked references prevent the devices nodes from
>>> every actually being properly cleaned up after detach. Thus, leading
>>> to the issue you are observing.
>
> So, it was the end of the day, and I kind of glossed over the issue
> Michael was trying to address with an issue that I remembered that had
> been around for awhile.

Sure, I know that feeling :)

>> Leaking references shouldn't affect the node being detached from the
>> tree though.
>
> No, but it does prevent it from being freed from sysfs which leads to
> the sysfs entry renaming that happens when another node with the same
> name is attached.

OK.

>> See of_detach_node() calling __of_detach_node(), none of that depends on
>> the refcount.
>> 
>> It's only the actual freeing of the node, in of_node_release() that is
>> prevented by leaked reference counts.
>
> Right, but if we did refcount correctly there is the new problem that
> the node is freed and now the phandle cache points at freed memory in
> the case were no invalidation is done.

Yeah that's bad.

I guess no one is supposed to lookup that phandle anymore, but it's
super fragile.

>> So I agree we need to do a better job with the reference counting, but I
>> don't see how it is causing the problem here
>
> Now in regards to the phandle caching somehow I missed that code going
> into OF earlier this year. That should have had at least some
> discussion from our side based on the fact that it is built on dtc
> compiler assumption that there are a set number of phandles that are
> allocated starting at 1..n such that they are monotonically
> increasing. That has a nice fixed size with O(1) lookup time. Phyp
> doesn't guarantee any sort of niceness with nicely ordered phandles.
> From what I've seen there are a subset of phandles that decrease from
> (-1) monotonically, and then there are a bunch of random values for
> cpus and IOAs. Thinking on it might not be that big of a deal as we
> just end up in the case where we have a phandle collision one makes it
> into the cache and is optimized while the other doesn't. On another
> note, they clearly hit a similar issue during overlay attach and
> remove, and as Rob pointed out their solution to handle it is a full
> cache invalidation followed by rescanning the whole tree to rebuild
> it. Seeing as our dynamic lifecycle is node by node this definitely
> adds some overhead.

Yeah I also should have noticed it going in, but despite being
subscribed to the devicetree list I didn't spot it in the flood.

AIUI the 1..n assumption is not about correctness but if our phandles
violate that we may not be getting any benefit.

Anyway yeah lots of things to look at tomorrow :)

cheers

^ permalink raw reply

* Re: phandle_cache vs of_detach_node (was Re: [PATCH] powerpc/mobility: Fix node detach/rename problem)
From: Michael Ellerman @ 2018-08-01 14:26 UTC (permalink / raw)
  To: Frank Rowand, Rob Herring; +Cc: mwb, linuxppc-dev, devicetree
In-Reply-To: <85014a16-cd2c-0137-72e5-54d4bcdf788a@gmail.com>

Frank Rowand <frowand.list@gmail.com> writes:
> On 07/31/18 07:17, Rob Herring wrote:
>> On Tue, Jul 31, 2018 at 12:34 AM Michael Ellerman <mpe@ellerman.id.au> wrote:
>>>
>>> Hi Rob/Frank,
>>>
>>> I think we might have a problem with the phandle_cache not interacting
>>> well with of_detach_node():
>> 
>> Probably needs a similar fix as this commit did for overlays:
>> 
>> commit b9952b5218added5577e4a3443969bc20884cea9
>> Author: Frank Rowand <frank.rowand@sony.com>
>> Date:   Thu Jul 12 14:00:07 2018 -0700
>> 
>>     of: overlay: update phandle cache on overlay apply and remove
>> 
>>     A comment in the review of the patch adding the phandle cache said that
>>     the cache would have to be updated when modules are applied and removed.
>>     This patch implements the cache updates.
>> 
>>     Fixes: 0b3ce78e90fc ("of: cache phandle nodes to reduce cost of
>> of_find_node_by_phandle()")
>>     Reported-by: Alan Tull <atull@kernel.org>
>>     Suggested-by: Alan Tull <atull@kernel.org>
>>     Signed-off-by: Frank Rowand <frank.rowand@sony.com>
>>     Signed-off-by: Rob Herring <robh@kernel.org>
>
> Agreed.  Sorry about missing the of_detach_node() case.
>
>
>> Really what we need here is an "invalidate phandle" function rather
>> than free and re-allocate the whole damn cache.
>
> The big hammer approach was chosen to avoid the race conditions that
> would otherwise occur.  OF does not have a locking strategy that
> would be able to protect against the races.
>
> We could maybe implement a slightly smaller hammer by (1) disabling
> the cache, (2) invalidate a phandle entry in the cache, (3) re-enable
> the cache.  That is an off the cuff thought - I would have to look
> a little bit more carefully to make sure it would work.
>
> But I don't see a need to add the complexity of the smaller hammer
> or the bigger hammer of proper locking _unless_ we start seeing that
> the cache is being freed and re-allocated frequently.  For overlays
> I don't expect the high frequency because it happens on a per overlay
> removal basis (not per node removal basis).  For of_detach_node() the
> event _is_ on a per node removal basis.  Michael, do you expect node
> removals to be a frequent event with low latency being important?  If
> so, a rough guess on what the frequency would be?

No in practice we expect them to be fairly rare and in the thousands at
most I think.

TBH you could just disable the phandle_cache on the first detach and
that would be fine by me. I don't think we've ever noticed phandle
lookups being much of a problem for us on our hardware.

cheers

^ permalink raw reply

* Re: [PATCH v3 9/9] powerpc/traps: Add line prefix in show_instructions()
From: Michael Ellerman @ 2018-08-01 14:14 UTC (permalink / raw)
  To: Christophe LEROY, Murilo Opsfelder Araujo, linux-kernel
  Cc: Alastair D'Silva, Andrew Donnellan, Balbir Singh,
	Benjamin Herrenschmidt, Cyril Bur, Eric W . Biederman,
	Joe Perches, Michael Neuling, Nicholas Piggin, Paul Mackerras,
	Simon Guo, Sukadev Bhattiprolu, Tobin C . Harding, linuxppc-dev
In-Reply-To: <c7e0fe17-8c4e-4ccc-dadf-9b320ca5a1d2@c-s.fr>

Christophe LEROY <christophe.leroy@c-s.fr> writes:

> Le 31/07/2018 =C3=A0 16:50, Murilo Opsfelder Araujo a =C3=A9crit=C2=A0:
>> Remove "Instruction dump:" line by adding a prefix to display current->c=
omm
>> and current->pid, along with the instructions dump.
>>=20
>> The prefix can serve as a glue that links the instructions dump to its
>> originator, allowing messages to be interleaved in the logs.
>>=20
>> Before this patch, a page fault looked like:
>>=20
>>    pandafault[10524]: segfault (11) at 100007d0 nip 1000061c lr 7fffbd29=
5100 code 2 in pandafault[10000000+10000]
>>    Instruction dump:
>>    4bfffeec 4bfffee8 3c401002 38427f00 fbe1fff8 f821ffc1 7c3f0b78 3d22ff=
fe
>>    392988d0 f93f0020 e93f0020 39400048 <99490000> 39200000 7d234b78 383f=
0040
>>=20
>> After this patch, it looks like:
>>=20
>>    pandafault[10850]: segfault (11) at 100007d0 nip 1000061c lr 7fff9f3e=
5100 code 2 in pandafault[10000000+10000]
>>    pandafault[10850]: code: 4bfffeec 4bfffee8 3c401002 38427f00 fbe1fff8=
 f821ffc1 7c3f0b78 3d22fffe
>>    pandafault[10850]: code: 392988d0 f93f0020 e93f0020 39400048 <9949000=
0> 39200000 7d234b78 383f0040
>>=20
>> Signed-off-by: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
>
> Does the script scripts/decode_stacktrace.sh also works with this format=
=20
> change ?

Did it work before? I've never used it.

Would be great if it did work though.

cheers

^ permalink raw reply

* Re: [PATCH v2 2/2] powerpc/pseries: Wait for completion of hotplug events during PRRN handling
From: Michael Ellerman @ 2018-08-01 13:16 UTC (permalink / raw)
  To: John Allen; +Cc: linuxppc-dev, nfont
In-Reply-To: <20180723152223.mydr5dovcv26domv@p50>

John Allen <jallen@linux.ibm.com> writes:

> On Mon, Jul 23, 2018 at 11:41:24PM +1000, Michael Ellerman wrote:
>>John Allen <jallen@linux.ibm.com> writes:
>>
>>> While handling PRRN events, the time to handle the actual hotplug events
>>> dwarfs the time it takes to perform the device tree updates and queue the
>>> hotplug events. In the case that PRRN events are being queued continuously,
>>> hotplug events have been observed to be queued faster than the kernel can
>>> actually handle them. This patch avoids the problem by waiting for a
>>> hotplug request to complete before queueing more hotplug events.

Have you tested this patch in isolation, ie. not with patch 1?

>>So do we need the hotplug work queue at all? Can we just call
>>handle_dlpar_errorlog() directly?
>>
>>Or are we using the work queue to serialise things? And if so would a
>>mutex be better?
>
> Right, the workqueue is meant to serialize all hotplug events and it 
> gets used for more than just PRRN events. I believe the motivation for 
> using the workqueue over a mutex is that KVM guests initiate hotplug 
> events through the hotplug interrupt and can queue fairly large requests 
> meaning that in this scenario, waiting for a lock would block interrupts
> for a while.

OK, but that just means that path needs to schedule work to run later.

> Using the workqueue allows us to serialize hotplug events 
> from different sources in the same way without worrying about the 
> context in which the event is generated.

A lock would be so much simpler.

It looks like we have three callers of queue_hotplug_event(), the dlpar
code, the mobility code and the ras interrupt.

The dlpar code already waits synchronously:

  init_completion(&hotplug_done);
  queue_hotplug_event(hp_elog, &hotplug_done, &rc);
  wait_for_completion(&hotplug_done);

You're changing mobility to do the same (this patch), leaving only the
ras interrupt that actually queues work and returns.


So it really seems like a mutex would do the trick, and the ras
interrupt would be the only case that needs to schedule work for later.

cheers

^ permalink raw reply

* Re: [RESEND PATCH 6/6] arm64: enable RapidIO menu in Kconfig
From: Alex Bounine @ 2018-08-01 13:15 UTC (permalink / raw)
  To: Christoph Hellwig, Alexei Colin
  Cc: Catalin Marinas, Will Deacon, Andrew Morton, Russell King,
	John Paul Walters, x86, linuxppc-dev, linux-mips,
	linux-arm-kernel, linux-kernel
In-Reply-To: <20180801095404.GA17585@infradead.org>

On 2018-08-01 05:54 AM, Christoph Hellwig wrote:
> On Tue, Jul 31, 2018 at 10:29:54AM -0400, Alexei Colin wrote:
>> Platforms with a PCI bus will be offered the RapidIO menu since they may
>> be want support for a RapidIO PCI device. Platforms without a PCI bus
>> that might include a RapidIO IP block will need to "select HAS_RAPIDIO"
>> in the platform-/machine-specific "config ARCH_*" Kconfig entry.
>>
>> Tested that kernel builds for arm64 with RapidIO subsystem and
>> switch drivers enabled, also that the modules load successfully
>> on a custom Aarch64 Qemu model.
> 
> As said before, please include it from drivers/Kconfig so that _all_
> architectures supporting PCI (or other Rapidio attachements) get it
> and not some arbitrary selection of architectures.
> 
As it was replied earlier this is not a random selection of 
architectures but only ones that implement support for RapidIO as system 
bus. If other architectures choose to adopt RapidIO we will include them 
as well.

On some platforms RapidIO can be the only system bus available replacing 
PCI/PCIe or RapidIO can coexist with PCIe.

As it is done now, RapidIO is configured in "Bus Options" (x86/PPC) or 
"Bus Support" (ARMs) sub-menu and from system configuration option it 
should be kept this way.

Current location of RAPIDIO configuration option is familiar to users of 
PowerPC and x86 platforms, and is similarly available in some ARM 
manufacturers kernel code trees.

drivers/Kconfig will be used for configuring drivers for peripheral 
RapidIO devices if/when such device drivers will be published.

^ permalink raw reply

* Re: [PATCH v2 1/2] powerpc/pseries: Avoid blocking rtas polling handling multiple PRRN events
From: Michael Ellerman @ 2018-08-01 13:02 UTC (permalink / raw)
  To: John Allen; +Cc: linuxppc-dev, nfont
In-Reply-To: <20180723150548.6syfciqtkohec66r@p50>

Hi John,

I'm still not sure about this one.

John Allen <jallen@linux.ibm.com> writes:
> On Mon, Jul 23, 2018 at 11:27:56PM +1000, Michael Ellerman wrote:
>>Hi John,
>>
>>I'm a bit puzzled by this one.
>>
>>John Allen <jallen@linux.ibm.com> writes:
>>> When a PRRN event is being handled and another PRRN event comes in, the
>>> second event will block rtas polling waiting on the first to complete,
>>> preventing any further rtas events from being handled. This can be
>>> especially problematic in case that PRRN events are continuously being
>>> queued in which case rtas polling gets indefinitely blocked completely.
>>>
>>> This patch introduces a mutex that prevents any subsequent PRRN events from
>>> running while there is a prrn event being handled, allowing rtas polling to
>>> continue normally.
>>>
>>> Signed-off-by: John Allen <jallen@linux.ibm.com>
>>> ---
>>> v2:
>>>   -Unlock prrn_lock when PRRN operations are complete, not after handler is
>>>    scheduled.
>>>   -Remove call to flush_work, the previous broken method of serializing
>>>    PRRN events.
>>> ---
>>>  arch/powerpc/kernel/rtasd.c | 10 +++++++---
>>>  1 file changed, 7 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/arch/powerpc/kernel/rtasd.c b/arch/powerpc/kernel/rtasd.c
>>> index 44d66c33d59d..845fc5aec178 100644
>>> --- a/arch/powerpc/kernel/rtasd.c
>>> +++ b/arch/powerpc/kernel/rtasd.c
>>> @@ -284,15 +286,17 @@ static void prrn_work_fn(struct work_struct *work)
>>>  	 */
>>>  	pseries_devicetree_update(-prrn_update_scope);
>>>  	numa_update_cpu_topology(false);
>>> +	mutex_unlock(&prrn_lock);
>>>  }
>>>
>>>  static DECLARE_WORK(prrn_work, prrn_work_fn);
>>>
>>>  static void prrn_schedule_update(u32 scope)
>>>  {
>>> -	flush_work(&prrn_work);
>>
>>This seems like it's actually the core of the change. Previously we were
>>basically blocking on the flush before continuing.
>
> The idea here is to replace the blocking flush_work with a non-blocking 
> mutex. So rather than waiting on the running PRRN event to complete, we 
> bail out since a PRRN event is already running.

OK, but why is it OK to bail out?

The firmware sent you an error log asking you to do something, with a
scope value that has some meaning, and now you're just going to drop
that on the floor?

Maybe it is OK to just drop these events? Or maybe you're saying that
because the system is crashing under the load of too many events it's OK
to drop the events in this case.

> The situation this is 
> meant to address is flooding the workqueue with PRRN events, which like 
> the situation in patch 2/2, these can be queued up faster than they can 
> actually be handled.

I'm not really sure why this is a problem though.

The current code synchronously processes the events, so there should
only ever be one in flight.

I guess the issue is that each one can queue multiple events on the
hotplug work queue?

But still, we have terabytes of RAM, we should be able to queue a lot
of events before it becomes a problem.

So what exactly is getting flooded, what's the symptom?

If the queuing of the hotplug events is the problem, then why don't we
stop doing that? We could just process them synchronously from the PRRN
update, that would naturally throttle them.

cheers

^ permalink raw reply

* Re: [PATCH v5 00/11] hugetlb: Factorize hugetlb architecture primitives
From: Alexandre Ghiti @ 2018-08-01 11:50 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: linux-mm, mike.kravetz, linux, catalin.marinas, will.deacon,
	tony.luck, fenghua.yu, ralf, paul.burton, jhogan, jejb, deller,
	benh, paulus, mpe, ysato, dalias, davem, tglx, mingo, hpa, x86,
	arnd, linux-arm-kernel, linux-kernel, linux-ia64, linux-mips,
	linux-parisc, linuxppc-dev, linux-sh, sparclinux, linux-arch
In-Reply-To: <20180731160640.11306628@doriath>

On 07/31/2018 10:06 PM, Luiz Capitulino wrote:
> On Tue, 31 Jul 2018 06:01:44 +0000
> Alexandre Ghiti <alex@ghiti.fr> wrote:
>
>> [CC linux-mm for inclusion in -mm tree]
>>
>> In order to reduce copy/paste of functions across architectures and then
>> make riscv hugetlb port (and future ports) simpler and smaller, this
>> patchset intends to factorize the numerous hugetlb primitives that are
>> defined across all the architectures.
> [...]
>
>>   15 files changed, 139 insertions(+), 382 deletions(-)
> I imagine you're mostly interested in non-x86 review at this point, but
> as this series is looking amazing:
>
> Reviewed-by: Luiz Capitulino <lcapitulino@redhat.com>

It's always good to have another feedback :)
Thanks for your review Luiz,

Alex

^ permalink raw reply

* Re: [PATCH] powerpc/powernv/opal: Use standard interrupts property when available
From: Michael Ellerman @ 2018-08-01 11:37 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, linuxppc-dev; +Cc: Stewart Smith
In-Reply-To: <1523344570.11062.65.camel@kernel.crashing.org>

Benjamin Herrenschmidt <benh@kernel.crashing.org> writes:

> diff --git a/arch/powerpc/platforms/powernv/opal-irqchip.c b/arch/powerpc/platforms/powernv/opal-irqchip.c
> index 9d1b8c0aaf93..46785eaf625d 100644
> --- a/arch/powerpc/platforms/powernv/opal-irqchip.c
> +++ b/arch/powerpc/platforms/powernv/opal-irqchip.c
> @@ -216,67 +214,91 @@ int __init opal_event_init(void)
...
>  
>  	/* Install interrupt handlers */
>  	for (i = 0; i < opal_irq_count; i++) {
> -		unsigned int virq;
> -		char *name;
> +		struct resource *r = &opal_irqs[i];
> +		const char *name;
>  
> -		/* Get hardware and virtual IRQ */
> -		virq = irq_create_mapping(NULL, irqs[i]);
> -		if (!virq) {
> -			pr_warn("Failed to map irq 0x%x\n", irqs[i]);
> -			continue;
> -		}
> -
> -		if (names[i] && strlen(names[i]))
> -			name = kasprintf(GFP_KERNEL, "opal-%s", names[i]);
> +		/* Prefix name */
> +		if (r->name)
> +			name = kasprintf(GFP_KERNEL, "opal-%s", r->name);
>  		else
>  			name = kasprintf(GFP_KERNEL, "opal");

I'm getting:

root@p85:/proc/irq# find . -name '*opal*'
...
./63/opal-ipmi
./227/opal-
./228/opal-
./231/opal-
./232/opal-
./247/opal-
./248/opal-
./483/opal-
./484/opal-
./487/opal-
./488/opal-
./500/opal-
./510/opal-
./511/opal-
./512/opal-


I fixed it with:

diff --git a/arch/powerpc/platforms/powernv/opal-irqchip.c b/arch/powerpc/platforms/powernv/opal-irqchip.c
index 41c3303157f7..a2d067925140 100644
--- a/arch/powerpc/platforms/powernv/opal-irqchip.c
+++ b/arch/powerpc/platforms/powernv/opal-irqchip.c
@@ -275,7 +275,7 @@ int __init opal_event_init(void)
 		const char *name;
 
 		/* Prefix name */
-		if (r->name)
+		if (r->name && strlen(r->name))
 			name = kasprintf(GFP_KERNEL, "opal-%s", r->name);
 		else
 			name = kasprintf(GFP_KERNEL, "opal");


cheers

^ permalink raw reply related

* Re: Infinite looping observed in __offline_pages
From: Michal Hocko @ 2018-08-01 11:20 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: John Allen, n-horiguchi, linuxppc-dev, linux-kernel,
	kamezawa.hiroyu, mgorman
In-Reply-To: <87r2jifimk.fsf@concordia.ellerman.id.au>

On Wed 01-08-18 21:09:39, Michael Ellerman wrote:
> Michal Hocko <mhocko@kernel.org> writes:
> > On Wed 25-07-18 13:11:15, John Allen wrote:
> > [...]
> >> Does a failure in do_migrate_range indicate that the range is unmigratable
> >> and the loop in __offline_pages should terminate and goto failed_removal? Or
> >> should we allow a certain number of retrys before we
> >> give up on migrating the range?
> >
> > Unfortunatelly not. Migration code doesn't tell a difference between
> > ephemeral and permanent failures.
> 
> What's to stop an ephemeral failure happening repeatedly?

If there is a short term pin on the page that prevents the migration
then the holder of the pin should realease it and the next retry will
succeed the migration. If the page gets freed on the way then it will
not be reallocated because they are isolated already. I can only see
complete OOM to be the reason to fail allocation of the target place
as the migration failure and that is highly unlikely and sooner or later
trigger the oom killer and release some memory.

The biggest problem here is that we cannot tell ephemeral and long term
pins...
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: Infinite looping observed in __offline_pages
From: Michael Ellerman @ 2018-08-01 11:09 UTC (permalink / raw)
  To: Michal Hocko, John Allen
  Cc: n-horiguchi, linuxppc-dev, linux-kernel, kamezawa.hiroyu, mgorman
In-Reply-To: <20180725200336.GP28386@dhcp22.suse.cz>

Michal Hocko <mhocko@kernel.org> writes:
> On Wed 25-07-18 13:11:15, John Allen wrote:
> [...]
>> Does a failure in do_migrate_range indicate that the range is unmigratable
>> and the loop in __offline_pages should terminate and goto failed_removal? Or
>> should we allow a certain number of retrys before we
>> give up on migrating the range?
>
> Unfortunatelly not. Migration code doesn't tell a difference between
> ephemeral and permanent failures.

What's to stop an ephemeral failure happening repeatedly?

cheers

^ permalink raw reply

* RE: [PATCH 3/3] ptp_qoriq: convert to use module parameters for initialization
From: Y.b. Lu @ 2018-08-01 10:10 UTC (permalink / raw)
  To: Richard Cochran
  Cc: netdev@vger.kernel.org, Madalin-cristian Bucur, Rob Herring,
	Shawn Guo, David S . Miller, devicetree@vger.kernel.org,
	linuxppc-dev@lists.ozlabs.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <20180801061515.krn52iaq2kqdi3wo@localhost>

Hi Richard,

> -----Original Message-----
> From: Richard Cochran [mailto:richardcochran@gmail.com]
> Sent: Wednesday, August 1, 2018 2:15 PM
> To: Y.b. Lu <yangbo.lu@nxp.com>
> Cc: netdev@vger.kernel.org; Madalin-cristian Bucur
> <madalin.bucur@nxp.com>; Rob Herring <robh+dt@kernel.org>; Shawn Guo
> <shawnguo@kernel.org>; David S . Miller <davem@davemloft.net>;
> devicetree@vger.kernel.org; linuxppc-dev@lists.ozlabs.org;
> linux-arm-kernel@lists.infradead.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH 3/3] ptp_qoriq: convert to use module parameters for
> initialization
>=20
> On Wed, Aug 01, 2018 at 04:36:40AM +0000, Y.b. Lu wrote:
>=20
> > Could I add a function to calculate a set of default register values
> > to initialize ptp timer when dts method failed to get required
> > properties in driver?
>=20
> Yes, it would be ideal if the driver can pick correct values automaticall=
y.
>=20
> However, the frequency on the FIPER outputs can't be configured
> automatically, and we don't have an API for the user to choose this.

[Y.b. Lu] Thanks a lot. Just let me send out the v2 patch for your reviewin=
g.

>=20
> > I think this will be useful. The ptp timer on new platforms (you may
> > see two dts patches in this patchset. Many platforms will be
> > affected.) will work without these dts properties. If user want
> > specific setting, they can set dts properties.
>=20
> Sure.
>=20
> Thanks,
> Richard

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox