LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH v3 3/3] powerpc/64s: feature: Work around inline asm issues
From: Segher Boessenkool @ 2020-11-23 20:08 UTC (permalink / raw)
  To: Bill Wendling; +Cc: Nick Desaulniers, linuxppc-dev
In-Reply-To: <CAGG=3QXR=Yfh8PNa4m-kQLTBP4YKD8OGm_6fSUgeasQ1ar9b2g@mail.gmail.com>

On Mon, Nov 23, 2020 at 12:01:01PM -0800, Bill Wendling wrote:
> On Mon, Nov 23, 2020 at 11:58 AM Segher Boessenkool
> <segher@kernel.crashing.org> wrote:
> > > On Sun, Nov 22, 2020 at 10:36 PM Segher Boessenkool
> > > <segher@kernel.crashing.org> wrote:
> > > > "true" (as a result of a comparison) in as is -1, not 1.
> >
> > On Mon, Nov 23, 2020 at 11:43:11AM -0800, Bill Wendling wrote:
> > > What Segher said. :-) Also, if you reverse the comparison, you'll get
> > > a build error.
> >
> > But that means your patch is the wrong way around?
> >
> > -       .ifgt (label##4b- label##3b)-(label##2b- label##1b);    \
> > -       .error "Feature section else case larger than body";    \
> > -       .endif;                                                 \
> > +       .org . - ((label##4b-label##3b) > (label##2b-label##1b)); \
> >
> > It should be a + in that last line, not a -.
> 
> I said so in a follow up email.

Yeah, and that arrived a second after I pressed "send" :-)

> > Was this tested?
> >
> Please don't be insulting. Anyone can make an error.

Absolutely, but it is just a question.  It seems you could improve that
testing!  It helps you yourself most of all ;-)


Segher

^ permalink raw reply

* Re: [PATCH v3 3/3] powerpc/64s: feature: Work around inline asm issues
From: Bill Wendling @ 2020-11-23 20:01 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: Nick Desaulniers, linuxppc-dev
In-Reply-To: <20201123195622.GI2672@gate.crashing.org>

On Mon, Nov 23, 2020 at 11:58 AM Segher Boessenkool
<segher@kernel.crashing.org> wrote:
>
> > On Sun, Nov 22, 2020 at 10:36 PM Segher Boessenkool
> > <segher@kernel.crashing.org> wrote:
> > > "true" (as a result of a comparison) in as is -1, not 1.
>
> On Mon, Nov 23, 2020 at 11:43:11AM -0800, Bill Wendling wrote:
> > What Segher said. :-) Also, if you reverse the comparison, you'll get
> > a build error.
>
> But that means your patch is the wrong way around?
>
> -       .ifgt (label##4b- label##3b)-(label##2b- label##1b);    \
> -       .error "Feature section else case larger than body";    \
> -       .endif;                                                 \
> +       .org . - ((label##4b-label##3b) > (label##2b-label##1b)); \
>
> It should be a + in that last line, not a -.

I said so in a follow up email.

> Was this tested?
>
Please don't be insulting. Anyone can make an error.

-bw

^ permalink raw reply

* Re: [PATCH v3 3/3] powerpc/64s: feature: Work around inline asm issues
From: Segher Boessenkool @ 2020-11-23 19:56 UTC (permalink / raw)
  To: Bill Wendling; +Cc: Nick Desaulniers, linuxppc-dev
In-Reply-To: <CAGG=3QVjSAwU+ebvH=Lk5YVMxW7=ThvkJXGPw+95nYxxuurMig@mail.gmail.com>

(Please don't top-post.)

> On Sun, Nov 22, 2020 at 10:36 PM Segher Boessenkool
> <segher@kernel.crashing.org> wrote:
> > "true" (as a result of a comparison) in as is -1, not 1.

On Mon, Nov 23, 2020 at 11:43:11AM -0800, Bill Wendling wrote:
> What Segher said. :-) Also, if you reverse the comparison, you'll get
> a build error.

But that means your patch is the wrong way around?

-	.ifgt (label##4b- label##3b)-(label##2b- label##1b);	\
-	.error "Feature section else case larger than body";	\
-	.endif;							\
+	.org . - ((label##4b-label##3b) > (label##2b-label##1b)); \

It should be a + in that last line, not a -.  Was this tested?


Segher

^ permalink raw reply

* Re: [PATCH v3 3/3] powerpc/64s: feature: Work around inline asm issues
From: Bill Wendling @ 2020-11-23 19:53 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: Nick Desaulniers, linuxppc-dev
In-Reply-To: <CAGG=3QVjSAwU+ebvH=Lk5YVMxW7=ThvkJXGPw+95nYxxuurMig@mail.gmail.com>

After looking at this, I suspect that the correct change should be:

  .org . + ((label##4b-label##3b) > (label##2b-label##1b));

I'm sorry about that. I can submit another version of the patch.

-bw

On Mon, Nov 23, 2020 at 11:43 AM Bill Wendling <morbo@google.com> wrote:
>
> What Segher said. :-) Also, if you reverse the comparison, you'll get
> a build error.
>
> On Sun, Nov 22, 2020 at 10:36 PM Segher Boessenkool
> <segher@kernel.crashing.org> wrote:
> >
> > On Mon, Nov 23, 2020 at 04:44:56PM +1100, Michael Ellerman wrote:
> > > If I hard code:
> > >
> > >       .org . - (1);
> > >
> > > It fails as expected.
> > >
> > > But if I hard code:
> > >
> > >       .org . - (1 > 0);
> > >
> > > It builds?
> >
> > "true" (as a result of a comparison) in as is -1, not 1.
> >
> >
> > Segher

^ permalink raw reply

* Re: [PATCH v3 3/3] powerpc/64s: feature: Work around inline asm issues
From: Bill Wendling @ 2020-11-23 19:43 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: Nick Desaulniers, linuxppc-dev
In-Reply-To: <20201123063432.GG2672@gate.crashing.org>

What Segher said. :-) Also, if you reverse the comparison, you'll get
a build error.

On Sun, Nov 22, 2020 at 10:36 PM Segher Boessenkool
<segher@kernel.crashing.org> wrote:
>
> On Mon, Nov 23, 2020 at 04:44:56PM +1100, Michael Ellerman wrote:
> > If I hard code:
> >
> >       .org . - (1);
> >
> > It fails as expected.
> >
> > But if I hard code:
> >
> >       .org . - (1 > 0);
> >
> > It builds?
>
> "true" (as a result of a comparison) in as is -1, not 1.
>
>
> Segher

^ permalink raw reply

* Re: [PATCH v3 2/2] powerpc/ptrace: Hard wire PT_SOFTE value to 1 in gpr_get() too
From: Oleg Nesterov @ 2020-11-23 18:01 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Christophe Leroy, Madhavan Srinivasan, linuxppc-dev,
	Nicholas Piggin, linux-kernel, Paul Mackerras, Al Viro,
	Aneesh Kumar K.V, Jan Kratochvil
In-Reply-To: <20201119224347.GC5138@redhat.com>

Christophe, et al,

So what?

Are you going to push your change or should I re-send 1-2 without
whitespace cleanups?

On 11/19, Oleg Nesterov wrote:
>
> On 11/19, Christophe Leroy wrote:
> >
> > I think the following should work, and not require the first patch (compile
> > tested only).
> >
> > --- a/arch/powerpc/kernel/ptrace/ptrace-view.c
> > +++ b/arch/powerpc/kernel/ptrace/ptrace-view.c
> > @@ -234,9 +234,21 @@ static int gpr_get(struct task_struct *target, const
> > struct user_regset *regset,
> >  	BUILD_BUG_ON(offsetof(struct pt_regs, orig_gpr3) !=
> >  		     offsetof(struct pt_regs, msr) + sizeof(long));
> > 
> > +#ifdef CONFIG_PPC64
> > +	membuf_write(&to, &target->thread.regs->orig_gpr3,
> > +		     offsetof(struct pt_regs, softe) - offsetof(struct pt_regs,
> > orig_gpr3));
> > +	membuf_store(&to, 1UL);
> > +
> > +	BUILD_BUG_ON(offsetof(struct pt_regs, trap) !=
> > +		     offsetof(struct pt_regs, softe) + sizeof(long));
> > +
> > +	membuf_write(&to, &target->thread.regs->trap,
> > +		     sizeof(struct user_pt_regs) - offsetof(struct pt_regs, trap));
> > +#else
> >  	membuf_write(&to, &target->thread.regs->orig_gpr3,
> >  			sizeof(struct user_pt_regs) -
> >  			offsetof(struct pt_regs, orig_gpr3));
> > +#endif
> >  	return membuf_zero(&to, ELF_NGREG * sizeof(unsigned long) -
> >  				 sizeof(struct user_pt_regs));
> >  }
> 
> Probably yes.
> 
> This mirrors the previous patch I sent (https://lore.kernel.org/lkml/20190917143753.GA12300@redhat.com/)
> and this is exactly what I tried to avoid, we can make a simpler fix now.
> 
> But let me repeat, I agree with any fix even if imp my version simplifies the code, just
> commit this change and lets forget this problem.
> 
> Oleg.


^ permalink raw reply

* Re: [PATCH v2 04/19] powerpc/perf: move perf irq/nmi handling details into traps.c
From: Athira Rajeev @ 2020-11-23 17:54 UTC (permalink / raw)
  To: Nicholas Piggin; +Cc: linuxppc-dev
In-Reply-To: <20201111094410.3038123-5-npiggin@gmail.com>



> On 11-Nov-2020, at 3:13 PM, Nicholas Piggin <npiggin@gmail.com> wrote:
> 
> This is required in order to allow more significant differences between
> NMI type interrupt handlers and regular asynchronous handlers.
> 
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---
> arch/powerpc/kernel/traps.c      | 31 +++++++++++++++++++++++++++-
> arch/powerpc/perf/core-book3s.c  | 35 ++------------------------------
> arch/powerpc/perf/core-fsl-emb.c | 25 -----------------------
> 3 files changed, 32 insertions(+), 59 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
> index 902fcbd1a778..7dda72eb97cc 100644
> --- a/arch/powerpc/kernel/traps.c
> +++ b/arch/powerpc/kernel/traps.c
> @@ -1919,11 +1919,40 @@ void vsx_unavailable_tm(struct pt_regs *regs)
> }
> #endif /* CONFIG_PPC_TRANSACTIONAL_MEM */
> 
> -void performance_monitor_exception(struct pt_regs *regs)
> +static void performance_monitor_exception_nmi(struct pt_regs *regs)
> +{
> +	nmi_enter();
> +
> +	__this_cpu_inc(irq_stat.pmu_irqs);
> +
> +	perf_irq(regs);
> +
> +	nmi_exit();
> +}
> +
> +static void performance_monitor_exception_async(struct pt_regs *regs)
> {
> +	irq_enter();
> +
> 	__this_cpu_inc(irq_stat.pmu_irqs);
> 
> 	perf_irq(regs);
> +
> +	irq_exit();
> +}
> +
> +void performance_monitor_exception(struct pt_regs *regs)
> +{
> +	/*
> +	 * On 64-bit, if perf interrupts hit in a local_irq_disable
> +	 * (soft-masked) region, we consider them as NMIs. This is required to
> +	 * prevent hash faults on user addresses when reading callchains (and
> +	 * looks better from an irq tracing perspective).
> +	 */
> +	if (IS_ENABLED(CONFIG_PPC64) && unlikely(arch_irq_disabled_regs(regs)))
> +		performance_monitor_exception_nmi(regs);
> +	else
> +		performance_monitor_exception_async(regs);
> }
> 
> #ifdef CONFIG_PPC_ADV_DEBUG_REGS
> diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
> index 08643cba1494..9fd8cae09218 100644
> --- a/arch/powerpc/perf/core-book3s.c
> +++ b/arch/powerpc/perf/core-book3s.c
> @@ -109,10 +109,6 @@ static inline void perf_read_regs(struct pt_regs *regs)
> {
> 	regs->result = 0;
> }
> -static inline int perf_intr_is_nmi(struct pt_regs *regs)
> -{
> -	return 0;
> -}
> 
> static inline int siar_valid(struct pt_regs *regs)
> {
> @@ -328,15 +324,6 @@ static inline void perf_read_regs(struct pt_regs *regs)
> 	regs->result = use_siar;
> }
> 
> -/*
> - * If interrupts were soft-disabled when a PMU interrupt occurs, treat
> - * it as an NMI.
> - */
> -static inline int perf_intr_is_nmi(struct pt_regs *regs)
> -{
> -	return (regs->softe & IRQS_DISABLED);
> -}
> -

Hi Nick,

arch_irq_disabled_regs checks the regs->softe value, if it has IRQS_DISABLED set.
Core-book3s is also using same logic in perf_intr_is_nmi to check if it is an NMI. With the
changes in this patch, if I understood correctly, we will do the irq/nmi handling in traps.c 
rather than doing it in the PMI interrupt handler.  But can you please help to understand
better on what is the perf weirdness (sometimes NMI, sometimes not) mentioned in the cover
letter that we are fixing with this change ?

Thanks
Athira

> /*
>  * On processors like P7+ that have the SIAR-Valid bit, marked instructions
>  * must be sampled only if the SIAR-valid bit is set.
> @@ -2224,7 +2211,6 @@ static void __perf_event_interrupt(struct pt_regs *regs)
> 	struct perf_event *event;
> 	unsigned long val[8];
> 	int found, active;
> -	int nmi;
> 
> 	if (cpuhw->n_limited)
> 		freeze_limited_counters(cpuhw, mfspr(SPRN_PMC5),
> @@ -2232,18 +2218,6 @@ static void __perf_event_interrupt(struct pt_regs *regs)
> 
> 	perf_read_regs(regs);
> 
> -	/*
> -	 * If perf interrupts hit in a local_irq_disable (soft-masked) region,
> -	 * we consider them as NMIs. This is required to prevent hash faults on
> -	 * user addresses when reading callchains. See the NMI test in
> -	 * do_hash_page.
> -	 */
> -	nmi = perf_intr_is_nmi(regs);
> -	if (nmi)
> -		nmi_enter();
> -	else
> -		irq_enter();
> -
> 	/* Read all the PMCs since we'll need them a bunch of times */
> 	for (i = 0; i < ppmu->n_counter; ++i)
> 		val[i] = read_pmc(i + 1);
> @@ -2289,8 +2263,8 @@ static void __perf_event_interrupt(struct pt_regs *regs)
> 			}
> 		}
> 	}
> -	if (!found && !nmi && printk_ratelimit())
> -		printk(KERN_WARNING "Can't find PMC that caused IRQ\n");
> +	if (unlikely(!found) && !arch_irq_disabled_regs(regs))
> +		printk_ratelimited(KERN_WARNING "Can't find PMC that caused IRQ\n");
> 
> 	/*
> 	 * Reset MMCR0 to its normal value.  This will set PMXE and
> @@ -2300,11 +2274,6 @@ static void __perf_event_interrupt(struct pt_regs *regs)
> 	 * we get back out of this interrupt.
> 	 */
> 	write_mmcr0(cpuhw, cpuhw->mmcr.mmcr0);
> -
> -	if (nmi)
> -		nmi_exit();
> -	else
> -		irq_exit();
> }
> 
> static void perf_event_interrupt(struct pt_regs *regs)
> diff --git a/arch/powerpc/perf/core-fsl-emb.c b/arch/powerpc/perf/core-fsl-emb.c
> index e0e7e276bfd2..ee721f420a7b 100644
> --- a/arch/powerpc/perf/core-fsl-emb.c
> +++ b/arch/powerpc/perf/core-fsl-emb.c
> @@ -31,19 +31,6 @@ static atomic_t num_events;
> /* Used to avoid races in calling reserve/release_pmc_hardware */
> static DEFINE_MUTEX(pmc_reserve_mutex);
> 
> -/*
> - * If interrupts were soft-disabled when a PMU interrupt occurs, treat
> - * it as an NMI.
> - */
> -static inline int perf_intr_is_nmi(struct pt_regs *regs)
> -{
> -#ifdef __powerpc64__
> -	return (regs->softe & IRQS_DISABLED);
> -#else
> -	return 0;
> -#endif
> -}
> -
> static void perf_event_interrupt(struct pt_regs *regs);
> 
> /*
> @@ -659,13 +646,6 @@ static void perf_event_interrupt(struct pt_regs *regs)
> 	struct perf_event *event;
> 	unsigned long val;
> 	int found = 0;
> -	int nmi;
> -
> -	nmi = perf_intr_is_nmi(regs);
> -	if (nmi)
> -		nmi_enter();
> -	else
> -		irq_enter();
> 
> 	for (i = 0; i < ppmu->n_counter; ++i) {
> 		event = cpuhw->event[i];
> @@ -690,11 +670,6 @@ static void perf_event_interrupt(struct pt_regs *regs)
> 	mtmsr(mfmsr() | MSR_PMM);
> 	mtpmr(PMRN_PMGC0, PMGC0_PMIE | PMGC0_FCECE);
> 	isync();
> -
> -	if (nmi)
> -		nmi_exit();
> -	else
> -		irq_exit();
> }
> 
> void hw_perf_event_setup(int cpu)
> -- 
> 2.23.0
> 


^ permalink raw reply

* Re: Linux kernel: powerpc: RTAS calls can be used to compromise kernel integrity
From: Andrew Donnellan @ 2020-11-23 14:41 UTC (permalink / raw)
  To: oss-security, linuxppc-dev
In-Reply-To: <09cb1e1e-c71b-83a3-4c04-4e47e7c85342@linux.ibm.com>

On 9/10/20 12:20 pm, Andrew Donnellan wrote:
> The Linux kernel for powerpc has an issue with the Run-Time Abstraction 
> Services (RTAS) interface, allowing root (or CAP_SYS_ADMIN users) in a 
> VM to overwrite some parts of memory, including kernel memory.
> 
> This issue impacts guests running on top of PowerVM or KVM hypervisors 
> (pseries platform), and does *not* impact bare-metal machines (powernv 
> platform).
CVE-2020-27777 has been assigned.

-- 
Andrew Donnellan              OzLabs, ADL Canberra
ajd@linux.ibm.com             IBM Australia Limited

^ permalink raw reply

* Re: [PATCH] powerpc/perf: Fix crash with 'is_sier_available' when pmu is not set
From: Athira Rajeev @ 2020-11-23 13:32 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: sachinp, Madhavan Srinivasan, linuxppc-dev
In-Reply-To: <877dqc1ftj.fsf@mpe.ellerman.id.au>



> On 23-Nov-2020, at 4:49 PM, Michael Ellerman <mpe@ellerman.id.au> wrote:
> 
> Hi Athira,
> 
> Athira Rajeev <atrajeev@linux.vnet.ibm.com> writes:
>> On systems without any platform specific PMU driver support registered or
>> Generic Compat PMU support registered,
> 
> The compat PMU is registered just like other PMUs, so I don't see how we
> can crash like this if the compat PMU is active?
> 
> ie. if we're using the compat PMU then ppmu will be non-NULL and point
> to generic_compat_pmu.

Hi Michael,

Thanks for checking the patch.

Crash happens on systems which neither has compat PMU support registered nor 
has Platform specific PMU. This happens when the distro do not have either the PMU 
driver support for that platform or the generic "compat-mode" performance monitoring
driver support. 

So in such cases since compat PMU is in-active, ppmu is not set and
results in crash. Sorry for the confusion with my first line. I will correct it.

> 
>> running 'perf record' with
>> —intr-regs  will crash ( perf record -I <workload> ).
>> 
>> The relevant portion from crash logs and Call Trace:
>> 
>> Unable to handle kernel paging request for data at address 0x00000068
>> Faulting instruction address: 0xc00000000013eb18
>> Oops: Kernel access of bad area, sig: 11 [#1]
>> CPU: 2 PID: 13435 Comm: kill Kdump: loaded Not tainted 4.18.0-193.el8.ppc64le #1
>> NIP:  c00000000013eb18 LR: c000000000139f2c CTR: c000000000393d80
>> REGS: c0000004a07ab4f0 TRAP: 0300   Not tainted  (4.18.0-193.el8.ppc64le)
>> NIP [c00000000013eb18] is_sier_available+0x18/0x30
>> LR [c000000000139f2c] perf_reg_value+0x6c/0xb0
>> Call Trace:
>> [c0000004a07ab770] [c0000004a07ab7c8] 0xc0000004a07ab7c8 (unreliable)
>> [c0000004a07ab7a0] [c0000000003aa77c] perf_output_sample+0x60c/0xac0
>> [c0000004a07ab840] [c0000000003ab3f0] perf_event_output_forward+0x70/0xb0
>> [c0000004a07ab8c0] [c00000000039e208] __perf_event_overflow+0x88/0x1a0
>> [c0000004a07ab910] [c00000000039e42c] perf_swevent_hrtimer+0x10c/0x1d0
>> [c0000004a07abc50] [c000000000228b9c] __hrtimer_run_queues+0x17c/0x480
>> [c0000004a07abcf0] [c00000000022aaf4] hrtimer_interrupt+0x144/0x520
>> [c0000004a07abdd0] [c00000000002a864] timer_interrupt+0x104/0x2f0
>> [c0000004a07abe30] [c0000000000091c4] decrementer_common+0x114/0x120
>> 
>> When perf record session started with "-I" option, capture registers
>                          ^
>                          is
> 
>> via intr-regs,
> 
> "intr-regs" is just the full name for the -I option, so that kind of
> repeats itself.
> 
>> on each sample ‘is_sier_available()'i is called to check
>                                      ^
>                                      extra i
> 
> The single quotes around is_sier_available() aren't necessary IMO.
> 
>> for the SIER ( Sample Instruction Event Register) availability in the
>                ^
>                stray space
>> platform. This function in core-book3s access 'ppmu->flags'. If platform
>                                               ^                 ^
>                                               es                a
>> specific pmu driver is not registered, ppmu is set to null and accessing
>           ^                                            ^
>           PMU                                          NULL
>> its members results in crash. Patch fixes this by returning false in
>                        ^
>                        a
>> 'is_sier_available()' if 'ppmu' is not set.
> 
> Use the imperative mood for the last sentence which says what the patch
> does:
> 
>  Fix the crash by returning false in is_sier_available() if ppmu is not set.

Sure,  I will make all these changes as suggested.

Thanks
Athira
> 
> 
>> Fixes: 333804dc3b7a ("powerpc/perf: Update perf_regs structure to include SIER")
>> Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
>> Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
>> ---
>> arch/powerpc/perf/core-book3s.c | 3 +++
>> 1 file changed, 3 insertions(+)
>> 
>> diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
>> index 08643cb..1de4770 100644
>> --- a/arch/powerpc/perf/core-book3s.c
>> +++ b/arch/powerpc/perf/core-book3s.c
>> @@ -137,6 +137,9 @@ static void pmao_restore_workaround(bool ebb) { }
>> 
>> bool is_sier_available(void)
>> {
>> +	if (!ppmu)
>> +		return false;
>> +
>> 	if (ppmu->flags & PPMU_HAS_SIER)
>> 		return true;
>> 
>> -- 
>> 1.8.3.1
> 
> 
> cheers


^ permalink raw reply

* Re: [PATCH] powerpc/perf: Fix crash with 'is_sier_available' when pmu is not set
From: Michael Ellerman @ 2020-11-23 11:19 UTC (permalink / raw)
  To: Athira Rajeev; +Cc: sachinp, maddy, linuxppc-dev
In-Reply-To: <1606124997-3358-1-git-send-email-atrajeev@linux.vnet.ibm.com>

Hi Athira,

Athira Rajeev <atrajeev@linux.vnet.ibm.com> writes:
> On systems without any platform specific PMU driver support registered or
> Generic Compat PMU support registered,

The compat PMU is registered just like other PMUs, so I don't see how we
can crash like this if the compat PMU is active?

ie. if we're using the compat PMU then ppmu will be non-NULL and point
to generic_compat_pmu.

> running 'perf record' with
> —intr-regs  will crash ( perf record -I <workload> ).
>
> The relevant portion from crash logs and Call Trace:
>
> Unable to handle kernel paging request for data at address 0x00000068
> Faulting instruction address: 0xc00000000013eb18
> Oops: Kernel access of bad area, sig: 11 [#1]
> CPU: 2 PID: 13435 Comm: kill Kdump: loaded Not tainted 4.18.0-193.el8.ppc64le #1
> NIP:  c00000000013eb18 LR: c000000000139f2c CTR: c000000000393d80
> REGS: c0000004a07ab4f0 TRAP: 0300   Not tainted  (4.18.0-193.el8.ppc64le)
> NIP [c00000000013eb18] is_sier_available+0x18/0x30
> LR [c000000000139f2c] perf_reg_value+0x6c/0xb0
> Call Trace:
> [c0000004a07ab770] [c0000004a07ab7c8] 0xc0000004a07ab7c8 (unreliable)
> [c0000004a07ab7a0] [c0000000003aa77c] perf_output_sample+0x60c/0xac0
> [c0000004a07ab840] [c0000000003ab3f0] perf_event_output_forward+0x70/0xb0
> [c0000004a07ab8c0] [c00000000039e208] __perf_event_overflow+0x88/0x1a0
> [c0000004a07ab910] [c00000000039e42c] perf_swevent_hrtimer+0x10c/0x1d0
> [c0000004a07abc50] [c000000000228b9c] __hrtimer_run_queues+0x17c/0x480
> [c0000004a07abcf0] [c00000000022aaf4] hrtimer_interrupt+0x144/0x520
> [c0000004a07abdd0] [c00000000002a864] timer_interrupt+0x104/0x2f0
> [c0000004a07abe30] [c0000000000091c4] decrementer_common+0x114/0x120
>
> When perf record session started with "-I" option, capture registers
                          ^
                          is

> via intr-regs,

"intr-regs" is just the full name for the -I option, so that kind of
repeats itself.

> on each sample ‘is_sier_available()'i is called to check
                                      ^
                                      extra i

The single quotes around is_sier_available() aren't necessary IMO.

> for the SIER ( Sample Instruction Event Register) availability in the
                ^
                stray space
> platform. This function in core-book3s access 'ppmu->flags'. If platform
                                               ^                 ^
                                               es                a
> specific pmu driver is not registered, ppmu is set to null and accessing
           ^                                            ^
           PMU                                          NULL
> its members results in crash. Patch fixes this by returning false in
                        ^
                        a
> 'is_sier_available()' if 'ppmu' is not set.

Use the imperative mood for the last sentence which says what the patch
does:

  Fix the crash by returning false in is_sier_available() if ppmu is not set.


> Fixes: 333804dc3b7a ("powerpc/perf: Update perf_regs structure to include SIER")
> Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
> Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
> ---
>  arch/powerpc/perf/core-book3s.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
> index 08643cb..1de4770 100644
> --- a/arch/powerpc/perf/core-book3s.c
> +++ b/arch/powerpc/perf/core-book3s.c
> @@ -137,6 +137,9 @@ static void pmao_restore_workaround(bool ebb) { }
>  
>  bool is_sier_available(void)
>  {
> +	if (!ppmu)
> +		return false;
> +
>  	if (ppmu->flags & PPMU_HAS_SIER)
>  		return true;
>  
> -- 
> 1.8.3.1


cheers

^ permalink raw reply

* Re: [PATCH 1/3] perf/core: Flush PMU internal buffers for per-CPU events
From: Michael Ellerman @ 2020-11-23 11:00 UTC (permalink / raw)
  To: Namhyung Kim, Liang, Kan
  Cc: Ian Rogers, Andi Kleen, Peter Zijlstra, linuxppc-dev,
	linux-kernel, Stephane Eranian, Paul Mackerras,
	Arnaldo Carvalho de Melo, Jiri Olsa, Ingo Molnar, Gabriel Marin
In-Reply-To: <CAM9d7chbQE=zkqYsNFMv+uWEYWdXcGD=fNYT_R2ondwR5zVvaQ@mail.gmail.com>

Namhyung Kim <namhyung@kernel.org> writes:
> Hi Peter and Kan,
>
> (Adding PPC folks)
>
> On Tue, Nov 17, 2020 at 2:01 PM Namhyung Kim <namhyung@kernel.org> wrote:
>>
>> Hello,
>>
>> On Thu, Nov 12, 2020 at 4:54 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
>> >
>> >
>> >
>> > On 11/11/2020 11:25 AM, Peter Zijlstra wrote:
>> > > On Mon, Nov 09, 2020 at 09:49:31AM -0500, Liang, Kan wrote:
>> > >
>> > >> - When the large PEBS was introduced (9c964efa4330), the sched_task() should
>> > >> be invoked to flush the PEBS buffer in each context switch. However, The
>> > >> perf_sched_events in account_event() is not updated accordingly. The
>> > >> perf_event_task_sched_* never be invoked for a pure per-CPU context. Only
>> > >> per-task event works.
>> > >>     At that time, the perf_pmu_sched_task() is outside of
>> > >> perf_event_context_sched_in/out. It means that perf has to double
>> > >> perf_pmu_disable() for per-task event.
>> > >
>> > >> - The patch 1 tries to fix broken per-CPU events. The CPU context cannot be
>> > >> retrieved from the task->perf_event_ctxp. So it has to be tracked in the
>> > >> sched_cb_list. Yes, the code is very similar to the original codes, but it
>> > >> is actually the new code for per-CPU events. The optimization for per-task
>> > >> events is still kept.
>> > >>    For the case, which has both a CPU context and a task context, yes, the
>> > >> __perf_pmu_sched_task() in this patch is not invoked. Because the
>> > >> sched_task() only need to be invoked once in a context switch. The
>> > >> sched_task() will be eventually invoked in the task context.
>> > >
>> > > The thing is; your first two patches rely on PERF_ATTACH_SCHED_CB and
>> > > only set that for large pebs. Are you sure the other users (Intel LBR
>> > > and PowerPC BHRB) don't need it?
>> >
>> > I didn't set it for LBR, because the perf_sched_events is always enabled
>> > for LBR. But, yes, we should explicitly set the PERF_ATTACH_SCHED_CB
>> > for LBR.
>> >
>> >         if (has_branch_stack(event))
>> >                 inc = true;
>> >
>> > >
>> > > If they indeed do not require the pmu::sched_task() callback for CPU
>> > > events, then I still think the whole perf_sched_cb_{inc,dec}() interface
>> >
>> > No, LBR requires the pmu::sched_task() callback for CPU events.
>> >
>> > Now, The LBR registers have to be reset in sched in even for CPU events.
>> >
>> > To fix the shorter LBR callstack issue for CPU events, we also need to
>> > save/restore LBRs in pmu::sched_task().
>> > https://lore.kernel.org/lkml/1578495789-95006-4-git-send-email-kan.liang@linux.intel.com/
>> >
>> > > is confusing at best.
>> > >
>> > > Can't we do something like this instead?
>> > >
>> > I think the below patch may have two issues.
>> > - PERF_ATTACH_SCHED_CB is required for LBR (maybe PowerPC BHRB as well) now.
>> > - We may disable the large PEBS later if not all PEBS events support
>> > large PEBS. The PMU need a way to notify the generic code to decrease
>> > the nr_sched_task.
>>
>> Any updates on this?  I've reviewed and tested Kan's patches
>> and they all look good.
>>
>> Maybe we can talk to PPC folks to confirm the BHRB case?
>
> Can we move this forward?  I saw patch 3/3 also adds PERF_ATTACH_SCHED_CB
> for PowerPC too.  But it'd be nice if ppc folks can confirm the change.

Sorry I've read the whole thread, but I'm still not entirely sure I
understand the question.

cheers

^ permalink raw reply

* Re: [PATCH] powerpc/perf: Fix crash with 'is_sier_available' when pmu is not set
From: Sachin Sant @ 2020-11-23 10:55 UTC (permalink / raw)
  To: Athira Rajeev; +Cc: maddy, linuxppc-dev
In-Reply-To: <1606124997-3358-1-git-send-email-atrajeev@linux.vnet.ibm.com>

> When perf record session started with "-I" option, capture registers
> via intr-regs, on each sample ‘is_sier_available()'i is called to check
> for the SIER ( Sample Instruction Event Register) availability in the
> platform. This function in core-book3s access 'ppmu->flags'. If platform
> specific pmu driver is not registered, ppmu is set to null and accessing
> its members results in crash. Patch fixes this by returning false in
> 'is_sier_available()' if 'ppmu' is not set.
> 
> Fixes: 333804dc3b7a ("powerpc/perf: Update perf_regs structure to include SIER")
> Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
> Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>

Tested-by: Sachin Sant <sachinp@linux.vnet.ibm.com>

Thanks
-Sachin

^ permalink raw reply

* Re: [PATCH V2 4/5] ocxl: Add mmu notifier
From: Frederic Barrat @ 2020-11-23 10:40 UTC (permalink / raw)
  To: Christophe Lombard, linuxppc-dev, fbarrat, ajd
In-Reply-To: <20201120173241.59229-5-clombard@linux.vnet.ibm.com>



On 20/11/2020 18:32, Christophe Lombard wrote:
> Add invalidate_range mmu notifier, when required (ATSD access of MMIO
> registers is available), to initiate TLB invalidation commands.
> For the time being, the ATSD0 set of registers is used by default.
> 
> The pasid and bdf values have to be configured in the Process Element
> Entry.
> The PEE must be set up to match the BDF/PASID of the AFU.
> 
> Signed-off-by: Christophe Lombard <clombard@linux.vnet.ibm.com>
> ---
>   drivers/misc/ocxl/link.c | 58 +++++++++++++++++++++++++++++++++++++++-
>   1 file changed, 57 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/misc/ocxl/link.c b/drivers/misc/ocxl/link.c
> index 20444db8a2bb..100bdfe9ec37 100644
> --- a/drivers/misc/ocxl/link.c
> +++ b/drivers/misc/ocxl/link.c
> @@ -2,8 +2,10 @@
>   // Copyright 2017 IBM Corp.
>   #include <linux/sched/mm.h>
>   #include <linux/mutex.h>
> +#include <linux/mm.h>
>   #include <linux/mm_types.h>
>   #include <linux/mmu_context.h>
> +#include <linux/mmu_notifier.h>
>   #include <asm/copro.h>
>   #include <asm/pnv-ocxl.h>
>   #include <asm/xive.h>
> @@ -33,6 +35,7 @@
> 
>   #define SPA_PE_VALID		0x80000000
> 
> +struct ocxl_link;
> 
>   struct pe_data {
>   	struct mm_struct *mm;
> @@ -41,6 +44,8 @@ struct pe_data {
>   	/* opaque pointer to be passed to the above callback */
>   	void *xsl_err_data;
>   	struct rcu_head rcu;
> +	struct ocxl_link *link;
> +	struct mmu_notifier mmu_notifier;
>   };
> 
>   struct spa {
> @@ -83,6 +88,8 @@ struct ocxl_link {
>   	int domain;
>   	int bus;
>   	int dev;
> +	void __iomem *arva;     /* ATSD register virtual address */
> +	spinlock_t atsd_lock;   /* to serialize shootdowns */
>   	atomic_t irq_available;
>   	struct spa *spa;
>   	void *platform_data;
> @@ -403,6 +410,11 @@ static int alloc_link(struct pci_dev *dev, int PE_mask, struct ocxl_link **out_l
>   	if (rc)
>   		goto err_xsl_irq;
> 
> +	rc = pnv_ocxl_map_lpar(dev, mfspr(SPRN_LPID), 0,
> +					  &link->arva);
> +	if (!rc)
> +		spin_lock_init(&link->atsd_lock);
> +


We could use a comment to say that if arva = 0, then we don't need mmio 
shootdowns and we rely on hardware snooping.

Also, we could always initialize the spin lock, it doesn't hurt and make 
the code more readable.

   Fred


>   	*out_link = link;
>   	return 0;
> 
> @@ -454,6 +466,11 @@ static void release_xsl(struct kref *ref)
>   {
>   	struct ocxl_link *link = container_of(ref, struct ocxl_link, ref);
> 
> +	if (link->arva) {
> +		pnv_ocxl_unmap_lpar(&link->arva);
> +		link->arva = NULL;
> +	}
> +
>   	list_del(&link->list);
>   	/* call platform code before releasing data */
>   	pnv_ocxl_spa_release(link->platform_data);
> @@ -470,6 +487,26 @@ void ocxl_link_release(struct pci_dev *dev, void *link_handle)
>   }
>   EXPORT_SYMBOL_GPL(ocxl_link_release);
> 
> +static void invalidate_range(struct mmu_notifier *mn,
> +			     struct mm_struct *mm,
> +			     unsigned long start, unsigned long end)
> +{
> +	struct pe_data *pe_data = container_of(mn, struct pe_data, mmu_notifier);
> +	struct ocxl_link *link = pe_data->link;
> +	unsigned long addr, pid, page_size = PAGE_SIZE;
> +
> +	pid = mm->context.id;
> +
> +	spin_lock(&link->atsd_lock);
> +	for (addr = start; addr < end; addr += page_size)
> +		pnv_ocxl_tlb_invalidate(&link->arva, pid, addr);
> +	spin_unlock(&link->atsd_lock);
> +}
> +
> +static const struct mmu_notifier_ops ocxl_mmu_notifier_ops = {
> +	.invalidate_range = invalidate_range,
> +};
> +
>   static u64 calculate_cfg_state(bool kernel)
>   {
>   	u64 state;
> @@ -526,6 +563,8 @@ int ocxl_link_add_pe(void *link_handle, int pasid, u32 pidr, u32 tidr,
>   	pe_data->mm = mm;
>   	pe_data->xsl_err_cb = xsl_err_cb;
>   	pe_data->xsl_err_data = xsl_err_data;
> +	pe_data->link = link;
> +	pe_data->mmu_notifier.ops = &ocxl_mmu_notifier_ops;
> 
>   	memset(pe, 0, sizeof(struct ocxl_process_element));
>   	pe->config_state = cpu_to_be64(calculate_cfg_state(pidr == 0));
> @@ -542,8 +581,16 @@ int ocxl_link_add_pe(void *link_handle, int pasid, u32 pidr, u32 tidr,
>   	 * by the nest MMU. If we have a kernel context, TLBIs are
>   	 * already global.
>   	 */
> -	if (mm)
> +	if (mm) {
>   		mm_context_add_copro(mm);
> +		if (link->arva) {
> +			/* Use MMIO registers for the TLB Invalidate
> +			 * operations.
> +			 */
> +			mmu_notifier_register(&pe_data->mmu_notifier, mm);
> +		}
> +	}
> +
>   	/*
>   	 * Barrier is to make sure PE is visible in the SPA before it
>   	 * is used by the device. It also helps with the global TLBI
> @@ -674,6 +721,15 @@ int ocxl_link_remove_pe(void *link_handle, int pasid)
>   		WARN(1, "Couldn't find pe data when removing PE\n");
>   	} else {
>   		if (pe_data->mm) {
> +			if (link->arva) {
> +				mmu_notifier_unregister(&pe_data->mmu_notifier,
> +							pe_data->mm);
> +				spin_lock(&link->atsd_lock);
> +				pnv_ocxl_tlb_invalidate(&link->arva,
> +							pe_data->mm->context.id,
> +							0ull);
> +				spin_unlock(&link->atsd_lock);
> +			}
>   			mm_context_remove_copro(pe_data->mm);
>   			mmdrop(pe_data->mm);
>   		}
> 

^ permalink raw reply

* Re: [PATCH V2 3/5] ocxl: Update the Process Element Entry
From: Frederic Barrat @ 2020-11-23 10:38 UTC (permalink / raw)
  To: Christophe Lombard, linuxppc-dev, fbarrat, ajd
In-Reply-To: <20201120173241.59229-4-clombard@linux.vnet.ibm.com>



On 20/11/2020 18:32, Christophe Lombard wrote:
> To complete the MMIO based mechanism, the fields: PASID, bus, device and
> function of the Process Element Entry have to be filled. (See
> OpenCAPI Power Platform Architecture document)
> 
>                     Hypervisor Process Element Entry
> Word
>      0 1 .... 7  8  ...... 12  13 ..15  16.... 19  20 ........... 31
> 0                  OSL Configuration State (0:31)
> 1                  OSL Configuration State (32:63)
> 2               PASID                      |    Reserved
> 3       Bus   |   Device    |Function |        Reserved
> 4                             Reserved
> 5                             Reserved
> 6                               ....
> 
> Signed-off-by: Christophe Lombard <clombard@linux.vnet.ibm.com>
> ---
>   drivers/misc/ocxl/context.c       | 4 +++-
>   drivers/misc/ocxl/link.c          | 4 +++-
>   drivers/misc/ocxl/ocxl_internal.h | 4 +++-
>   drivers/scsi/cxlflash/ocxl_hw.c   | 6 ++++--
>   include/misc/ocxl.h               | 2 +-
>   5 files changed, 14 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/misc/ocxl/context.c b/drivers/misc/ocxl/context.c
> index c21f65a5c762..9eb0d93b01c6 100644
> --- a/drivers/misc/ocxl/context.c
> +++ b/drivers/misc/ocxl/context.c
> @@ -70,6 +70,7 @@ int ocxl_context_attach(struct ocxl_context *ctx, u64 amr, struct mm_struct *mm)
>   {
>   	int rc;
>   	unsigned long pidr = 0;
> +	struct pci_dev *dev;
> 
>   	// Locks both status & tidr
>   	mutex_lock(&ctx->status_mutex);
> @@ -81,8 +82,9 @@ int ocxl_context_attach(struct ocxl_context *ctx, u64 amr, struct mm_struct *mm)
>   	if (mm)
>   		pidr = mm->context.id;
> 
> +	dev = to_pci_dev(ctx->afu->fn->dev.parent);
>   	rc = ocxl_link_add_pe(ctx->afu->fn->link, ctx->pasid, pidr, ctx->tidr,
> -			      amr, mm, xsl_fault_error, ctx);
> +			      amr, pci_dev_id(dev), mm, xsl_fault_error, ctx);
>   	if (rc)
>   		goto out;
> 
> diff --git a/drivers/misc/ocxl/link.c b/drivers/misc/ocxl/link.c
> index fd73d3bc0eb6..20444db8a2bb 100644
> --- a/drivers/misc/ocxl/link.c
> +++ b/drivers/misc/ocxl/link.c
> @@ -494,7 +494,7 @@ static u64 calculate_cfg_state(bool kernel)
>   }
> 
>   int ocxl_link_add_pe(void *link_handle, int pasid, u32 pidr, u32 tidr,
> -		u64 amr, struct mm_struct *mm,
> +		u64 amr, u64 bdf, struct mm_struct *mm,


bdf could/should be a u16, since that's per the PCI spec.

   Fred


>   		void (*xsl_err_cb)(void *data, u64 addr, u64 dsisr),
>   		void *xsl_err_data)
>   {
> @@ -529,6 +529,8 @@ int ocxl_link_add_pe(void *link_handle, int pasid, u32 pidr, u32 tidr,
> 
>   	memset(pe, 0, sizeof(struct ocxl_process_element));
>   	pe->config_state = cpu_to_be64(calculate_cfg_state(pidr == 0));
> +	pe->pasid = cpu_to_be32(pasid << (31 - 19));
> +	pe->bdf = cpu_to_be32(bdf << (31 - 15));
>   	pe->lpid = cpu_to_be32(mfspr(SPRN_LPID));
>   	pe->pid = cpu_to_be32(pidr);
>   	pe->tid = cpu_to_be32(tidr);
> diff --git a/drivers/misc/ocxl/ocxl_internal.h b/drivers/misc/ocxl/ocxl_internal.h
> index 0bad0a123af6..c9ce2af21d6f 100644
> --- a/drivers/misc/ocxl/ocxl_internal.h
> +++ b/drivers/misc/ocxl/ocxl_internal.h
> @@ -84,7 +84,9 @@ struct ocxl_context {
> 
>   struct ocxl_process_element {
>   	__be64 config_state;
> -	__be32 reserved1[11];
> +	__be32 pasid;
> +	__be32 bdf;
> +	__be32 reserved1[9];
>   	__be32 lpid;
>   	__be32 tid;
>   	__be32 pid;
> diff --git a/drivers/scsi/cxlflash/ocxl_hw.c b/drivers/scsi/cxlflash/ocxl_hw.c
> index e4e0d767b98e..244fc27215dc 100644
> --- a/drivers/scsi/cxlflash/ocxl_hw.c
> +++ b/drivers/scsi/cxlflash/ocxl_hw.c
> @@ -329,6 +329,7 @@ static int start_context(struct ocxlflash_context *ctx)
>   	struct ocxl_hw_afu *afu = ctx->hw_afu;
>   	struct ocxl_afu_config *acfg = &afu->acfg;
>   	void *link_token = afu->link_token;
> +	struct pci_dev *pdev = afu->pdev;
>   	struct device *dev = afu->dev;
>   	bool master = ctx->master;
>   	struct mm_struct *mm;
> @@ -360,8 +361,9 @@ static int start_context(struct ocxlflash_context *ctx)
>   		mm = current->mm;
>   	}
> 
> -	rc = ocxl_link_add_pe(link_token, ctx->pe, pid, 0, 0, mm,
> -			      ocxlflash_xsl_fault, ctx);
> +	rc = ocxl_link_add_pe(link_token, ctx->pe, pid, 0, 0,
> +			      pci_dev_id(pdev), mm, ocxlflash_xsl_fault,
> +			      ctx);
>   	if (unlikely(rc)) {
>   		dev_err(dev, "%s: ocxl_link_add_pe failed rc=%d\n",
>   			__func__, rc);
> diff --git a/include/misc/ocxl.h b/include/misc/ocxl.h
> index e013736e275d..d0f101f428dd 100644
> --- a/include/misc/ocxl.h
> +++ b/include/misc/ocxl.h
> @@ -447,7 +447,7 @@ void ocxl_link_release(struct pci_dev *dev, void *link_handle);
>    * defined
>    */
>   int ocxl_link_add_pe(void *link_handle, int pasid, u32 pidr, u32 tidr,
> -		u64 amr, struct mm_struct *mm,
> +		u64 amr, u64 bdf, struct mm_struct *mm,
>   		void (*xsl_err_cb)(void *data, u64 addr, u64 dsisr),
>   		void *xsl_err_data);
> 

^ permalink raw reply

* Re: [PATCH V2 2/5] ocxl: Initiate a TLB invalidate command
From: Frederic Barrat @ 2020-11-23 10:37 UTC (permalink / raw)
  To: Christophe Lombard, linuxppc-dev, fbarrat, ajd
In-Reply-To: <20201120173241.59229-3-clombard@linux.vnet.ibm.com>



On 20/11/2020 18:32, Christophe Lombard wrote:
> When a TLB Invalidate is required for the Logical Partition, the following
> sequence has to be performed:
> 
> 1. Load MMIO ATSD AVA register with the necessary value, if required.
> 2. Write the MMIO ATSD launch register to initiate the TLB Invalidate
>     command.
> 3. Poll the MMIO ATSD status register to determine when the TLB Invalidate
>     has been completed.
> 
> Signed-off-by: Christophe Lombard <clombard@linux.vnet.ibm.com>
> ---
>   arch/powerpc/include/asm/pnv-ocxl.h   | 50 ++++++++++++++++++++++++
>   arch/powerpc/platforms/powernv/ocxl.c | 55 +++++++++++++++++++++++++++
>   2 files changed, 105 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/pnv-ocxl.h b/arch/powerpc/include/asm/pnv-ocxl.h
> index 3f38aed7100c..9c90e87e7659 100644
> --- a/arch/powerpc/include/asm/pnv-ocxl.h
> +++ b/arch/powerpc/include/asm/pnv-ocxl.h
> @@ -3,12 +3,59 @@
>   #ifndef _ASM_PNV_OCXL_H
>   #define _ASM_PNV_OCXL_H
> 
> +#include <linux/bitfield.h>
>   #include <linux/pci.h>
> 
>   #define PNV_OCXL_TL_MAX_TEMPLATE        63
>   #define PNV_OCXL_TL_BITS_PER_RATE       4
>   #define PNV_OCXL_TL_RATE_BUF_SIZE       ((PNV_OCXL_TL_MAX_TEMPLATE+1) * PNV_OCXL_TL_BITS_PER_RATE / 8)
> 
> +#define PNV_OCXL_ATSD_TIMEOUT		1
> +
> +/* TLB Management Instructions */
> +#define PNV_OCXL_ATSD_LNCH		0x00
> +/* Radix Invalidate */
> +#define   PNV_OCXL_ATSD_LNCH_R		PPC_BIT(0)
> +/* Radix Invalidation Control
> + * 0b00 Just invalidate TLB.
> + * 0b01 Invalidate just Page Walk Cache.
> + * 0b10 Invalidate TLB, Page Walk Cache, and any
> + * caching of Partition and Process Table Entries.
> + */
> +#define   PNV_OCXL_ATSD_LNCH_RIC	PPC_BITMASK(1, 2)
> +/* Number and Page Size of translations to be invalidated */
> +#define   PNV_OCXL_ATSD_LNCH_LP		PPC_BITMASK(3, 10)
> +/* Invalidation Criteria
> + * 0b00 Invalidate just the target VA.
> + * 0b01 Invalidate matching PID.
> + */
> +#define   PNV_OCXL_ATSD_LNCH_IS		PPC_BITMASK(11, 12)
> +/* 0b1: Process Scope, 0b0: Partition Scope */
> +#define   PNV_OCXL_ATSD_LNCH_PRS	PPC_BIT(13)
> +/* Invalidation Flag */
> +#define   PNV_OCXL_ATSD_LNCH_B		PPC_BIT(14)
> +/* Actual Page Size to be invalidated
> + * 000 4KB
> + * 101 64KB
> + * 001 2MB
> + * 010 1GB
> + */
> +#define   PNV_OCXL_ATSD_LNCH_AP		PPC_BITMASK(15, 17)
> +/* Defines the large page select
> + * L=0b0 for 4KB pages
> + * L=0b1 for large pages)
> + */
> +#define   PNV_OCXL_ATSD_LNCH_L		PPC_BIT(18)
> +/* Process ID */
> +#define   PNV_OCXL_ATSD_LNCH_PID	PPC_BITMASK(19, 38)
> +/* NoFlush – Assumed to be 0b0 */
> +#define   PNV_OCXL_ATSD_LNCH_F		PPC_BIT(39)
> +#define   PNV_OCXL_ATSD_LNCH_OCAPI_SLBI	PPC_BIT(40)
> +#define   PNV_OCXL_ATSD_LNCH_OCAPI_SINGLETON	PPC_BIT(41)
> +#define PNV_OCXL_ATSD_AVA		0x08
> +#define   PNV_OCXL_ATSD_AVA_AVA		PPC_BITMASK(0, 51)
> +#define PNV_OCXL_ATSD_STAT		0x10
> +
>   int pnv_ocxl_get_actag(struct pci_dev *dev, u16 *base, u16 *enabled, u16 *supported);
>   int pnv_ocxl_get_pasid_count(struct pci_dev *dev, int *count);
> 
> @@ -31,4 +78,7 @@ int pnv_ocxl_spa_remove_pe_from_cache(void *platform_data, int pe_handle);
>   int pnv_ocxl_map_lpar(struct pci_dev *dev, uint64_t lparid,
>   		      uint64_t lpcr, void __iomem **arva);
>   void pnv_ocxl_unmap_lpar(void __iomem **arva);
> +void pnv_ocxl_tlb_invalidate(void __iomem **arva,
> +			     unsigned long pid,
> +			     unsigned long addr);
>   #endif /* _ASM_PNV_OCXL_H */
> diff --git a/arch/powerpc/platforms/powernv/ocxl.c b/arch/powerpc/platforms/powernv/ocxl.c
> index bc20cf867900..07878496954b 100644
> --- a/arch/powerpc/platforms/powernv/ocxl.c
> +++ b/arch/powerpc/platforms/powernv/ocxl.c
> @@ -531,3 +531,58 @@ void pnv_ocxl_unmap_lpar(void __iomem **arva)
>   	iounmap(*arva);
>   }
>   EXPORT_SYMBOL_GPL(pnv_ocxl_unmap_lpar);
> +
> +void pnv_ocxl_tlb_invalidate(void __iomem **arva,


Similarly to the previous patch, there's no reason why arva should be a 
double-pointer.


> +			     unsigned long pid,
> +			     unsigned long addr)
> +{
> +	unsigned long timeout = jiffies + (HZ * PNV_OCXL_ATSD_TIMEOUT);
> +	uint64_t val = 0ull;
> +	int pend;
> +
> +	if (!(*arva))
> +		return;
> +
> +	if (addr) {
> +		/* load Abbreviated Virtual Address register with
> +		 * the necessary value
> +		 */
> +		val |= FIELD_PREP(PNV_OCXL_ATSD_AVA_AVA, addr >> (63-51));
> +		out_be64(*arva + PNV_OCXL_ATSD_AVA, val);
> +	}
> +
> +	/* Write access initiates a shoot down to initiate the
> +	 * TLB Invalidate command
> +	 */
> +	val = PNV_OCXL_ATSD_LNCH_R;
> +	if (addr) {
> +		val |= FIELD_PREP(PNV_OCXL_ATSD_LNCH_RIC, 0b00);
> +		val |= FIELD_PREP(PNV_OCXL_ATSD_LNCH_IS, 0b00);
> +	} else {
> +		val |= FIELD_PREP(PNV_OCXL_ATSD_LNCH_RIC, 0b10);
> +		val |= FIELD_PREP(PNV_OCXL_ATSD_LNCH_IS, 0b01);
> +		val |= PNV_OCXL_ATSD_LNCH_OCAPI_SINGLETON;
> +	}
> +	val |= PNV_OCXL_ATSD_LNCH_PRS;
> +	val |= FIELD_PREP(PNV_OCXL_ATSD_LNCH_AP, 0b101);



So we hard code a page size of 64k. The mmu notifier loops over 
PAGE_SIZE. It would be cleaner to pass the page size as an argument and 
code AP based on it.

   Fred


> +	val |= FIELD_PREP(PNV_OCXL_ATSD_LNCH_PID, pid);
> +	out_be64(*arva + PNV_OCXL_ATSD_LNCH, val);
> +
> +	/* Poll the ATSD status register to determine when the
> +	 * TLB Invalidate has been completed.
> +	 */
> +	val = in_be64(*arva + PNV_OCXL_ATSD_STAT);
> +	pend = val >> 63;
> +
> +	while (pend) {
> +		if (time_after_eq(jiffies, timeout)) {
> +			pr_err("%s - Timeout while reading XTS MMIO ATSD status register (val=%#llx, pidr=0x%lx)\n",
> +			       __func__, val, pid);
> +			return;
> +		}
> +		cpu_relax();
> +		val = in_be64(*arva + PNV_OCXL_ATSD_STAT);
> +		pend = val >> 63;
> +	}
> +}
> +EXPORT_SYMBOL_GPL(pnv_ocxl_tlb_invalidate);
> 

^ permalink raw reply

* Re: [PATCH V2 1/5] ocxl: Assign a register set to a Logical Partition
From: Frederic Barrat @ 2020-11-23 10:35 UTC (permalink / raw)
  To: Christophe Lombard, linuxppc-dev, fbarrat, ajd
In-Reply-To: <20201120173241.59229-2-clombard@linux.vnet.ibm.com>



On 20/11/2020 18:32, Christophe Lombard wrote:
> Platform specific function to assign a register set to a Logical Partition.
> The "ibm,mmio-atsd" property, provided by the firmware, contains the 16
> base ATSD physical addresses (ATSD0 through ATSD15) of the set of MMIO
> registers (XTS MMIO ATSDx LPARID/AVA/launch/status register).
> 
> For the time being, the ATSD0 set of registers is used by default.
> 
> Signed-off-by: Christophe Lombard <clombard@linux.vnet.ibm.com>
> ---
>   arch/powerpc/include/asm/pnv-ocxl.h   |  3 ++
>   arch/powerpc/platforms/powernv/ocxl.c | 48 +++++++++++++++++++++++++++
>   2 files changed, 51 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/pnv-ocxl.h b/arch/powerpc/include/asm/pnv-ocxl.h
> index d37ededca3ee..3f38aed7100c 100644
> --- a/arch/powerpc/include/asm/pnv-ocxl.h
> +++ b/arch/powerpc/include/asm/pnv-ocxl.h
> @@ -28,4 +28,7 @@ int pnv_ocxl_spa_setup(struct pci_dev *dev, void *spa_mem, int PE_mask, void **p
>   void pnv_ocxl_spa_release(void *platform_data);
>   int pnv_ocxl_spa_remove_pe_from_cache(void *platform_data, int pe_handle);
> 
> +int pnv_ocxl_map_lpar(struct pci_dev *dev, uint64_t lparid,
> +		      uint64_t lpcr, void __iomem **arva);
> +void pnv_ocxl_unmap_lpar(void __iomem **arva);
>   #endif /* _ASM_PNV_OCXL_H */
> diff --git a/arch/powerpc/platforms/powernv/ocxl.c b/arch/powerpc/platforms/powernv/ocxl.c
> index ecdad219d704..bc20cf867900 100644
> --- a/arch/powerpc/platforms/powernv/ocxl.c
> +++ b/arch/powerpc/platforms/powernv/ocxl.c
> @@ -483,3 +483,51 @@ int pnv_ocxl_spa_remove_pe_from_cache(void *platform_data, int pe_handle)
>   	return rc;
>   }
>   EXPORT_SYMBOL_GPL(pnv_ocxl_spa_remove_pe_from_cache);
> +
> +int pnv_ocxl_map_lpar(struct pci_dev *dev, uint64_t lparid,
> +		      uint64_t lpcr, void __iomem **arva)
> +{
> +	struct pci_controller *hose = pci_bus_to_host(dev->bus);
> +	struct pnv_phb *phb = hose->private_data;
> +	u64 mmio_atsd;
> +	int rc;
> +
> +	/* ATSD physical address.
> +	 * ATSD LAUNCH register: write access initiates a shoot down to
> +	 * initiate the TLB Invalidate command.
> +	 */
> +	rc = of_property_read_u64_index(hose->dn, "ibm,mmio-atsd",
> +					0, &mmio_atsd);
> +	if (rc) {
> +		dev_info(&dev->dev, "No available ATSD found\n");
> +		return rc;
> +	}
> +
> +	/* Assign a register set to a Logical Partition and MMIO ATSD
> +	 * LPARID register to the required value.
> +	 */
> +	if (mmio_atsd)


If we don't have the "ibm,mmio-atsd", i.e on P9, then we've already 
exited above. So why not consider mmio_atsd as an error?


> +		rc = opal_npu_map_lpar(phb->opal_id, pci_dev_id(dev),
> +				       lparid, lpcr);
> +	if (rc) {
> +		dev_err(&dev->dev, "Error mapping device to LPAR: %d\n", rc);
> +		return rc;
> +	}
> +
> +	if (mmio_atsd) {


Same here


> +		*arva = ioremap(mmio_atsd, 24);
> +		if (!(*arva)) {
> +			dev_warn(&dev->dev, "ioremap failed - mmio_atsd: %#llx\n", mmio_atsd);
> +			rc = -ENOMEM;
> +		}
> +	}
> +
> +	return rc;
> +}
> +EXPORT_SYMBOL_GPL(pnv_ocxl_map_lpar);
> +
> +void pnv_ocxl_unmap_lpar(void __iomem **arva)


The arva argument doesn't need to be a double pointer. Void * is enough.

   Fred


> +{
> +	iounmap(*arva);
> +}
> +EXPORT_SYMBOL_GPL(pnv_ocxl_unmap_lpar);
> 

^ permalink raw reply

* [PATCH] powerpc/perf: Fix crash with 'is_sier_available' when pmu is not set
From: Athira Rajeev @ 2020-11-23  9:49 UTC (permalink / raw)
  To: mpe; +Cc: sachinp, maddy, linuxppc-dev

On systems without any platform specific PMU driver support registered or
Generic Compat PMU support registered, running 'perf record' with
—intr-regs  will crash ( perf record -I <workload> ).

The relevant portion from crash logs and Call Trace:

Unable to handle kernel paging request for data at address 0x00000068
Faulting instruction address: 0xc00000000013eb18
Oops: Kernel access of bad area, sig: 11 [#1]
CPU: 2 PID: 13435 Comm: kill Kdump: loaded Not tainted 4.18.0-193.el8.ppc64le #1
NIP:  c00000000013eb18 LR: c000000000139f2c CTR: c000000000393d80
REGS: c0000004a07ab4f0 TRAP: 0300   Not tainted  (4.18.0-193.el8.ppc64le)
NIP [c00000000013eb18] is_sier_available+0x18/0x30
LR [c000000000139f2c] perf_reg_value+0x6c/0xb0
Call Trace:
[c0000004a07ab770] [c0000004a07ab7c8] 0xc0000004a07ab7c8 (unreliable)
[c0000004a07ab7a0] [c0000000003aa77c] perf_output_sample+0x60c/0xac0
[c0000004a07ab840] [c0000000003ab3f0] perf_event_output_forward+0x70/0xb0
[c0000004a07ab8c0] [c00000000039e208] __perf_event_overflow+0x88/0x1a0
[c0000004a07ab910] [c00000000039e42c] perf_swevent_hrtimer+0x10c/0x1d0
[c0000004a07abc50] [c000000000228b9c] __hrtimer_run_queues+0x17c/0x480
[c0000004a07abcf0] [c00000000022aaf4] hrtimer_interrupt+0x144/0x520
[c0000004a07abdd0] [c00000000002a864] timer_interrupt+0x104/0x2f0
[c0000004a07abe30] [c0000000000091c4] decrementer_common+0x114/0x120

When perf record session started with "-I" option, capture registers
via intr-regs, on each sample ‘is_sier_available()'i is called to check
for the SIER ( Sample Instruction Event Register) availability in the
platform. This function in core-book3s access 'ppmu->flags'. If platform
specific pmu driver is not registered, ppmu is set to null and accessing
its members results in crash. Patch fixes this by returning false in
'is_sier_available()' if 'ppmu' is not set.

Fixes: 333804dc3b7a ("powerpc/perf: Update perf_regs structure to include SIER")
Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
---
 arch/powerpc/perf/core-book3s.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index 08643cb..1de4770 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -137,6 +137,9 @@ static void pmao_restore_workaround(bool ebb) { }
 
 bool is_sier_available(void)
 {
+	if (!ppmu)
+		return false;
+
 	if (ppmu->flags & PPMU_HAS_SIER)
 		return true;
 
-- 
1.8.3.1


^ permalink raw reply related

* [PATCH v4] dt-bindings: misc: convert fsl,qoriq-mc from txt to YAML
From: Laurentiu Tudor @ 2020-11-23  9:00 UTC (permalink / raw)
  To: robh+dt, leoyang.li, corbet, linux-arm-kernel, devicetree,
	linux-kernel, netdev, linux-doc
  Cc: ioana.ciornei, Ionut-robert Aron, kuba, linuxppc-dev, davem,
	Laurentiu Tudor

From: Ionut-robert Aron <ionut-robert.aron@nxp.com>

Convert fsl,qoriq-mc to YAML in order to automate the verification
process of dts files. In addition, update MAINTAINERS accordingly
and, while at it, add some missing files.

Signed-off-by: Ionut-robert Aron <ionut-robert.aron@nxp.com>
[laurentiu.tudor@nxp.com: update MINTAINERS, updates & fixes in schema]
Signed-off-by: Laurentiu Tudor <laurentiu.tudor@nxp.com>
---
Changes in v4:
 - use $ref to point to fsl,qoriq-mc-dpmac binding

Changes in v3:
 - dropped duplicated "fsl,qoriq-mc-dpmac" schema and replaced with
   reference to it
 - fixed a dt_binding_check warning

Changes in v2:
 - fixed errors reported by yamllint
 - dropped multiple unnecessary quotes
 - used schema instead of text in description
 - added constraints on dpmac reg property

 .../devicetree/bindings/misc/fsl,qoriq-mc.txt | 196 ------------------
 .../bindings/misc/fsl,qoriq-mc.yaml           | 186 +++++++++++++++++
 .../ethernet/freescale/dpaa2/overview.rst     |   5 +-
 MAINTAINERS                                   |   4 +-
 4 files changed, 193 insertions(+), 198 deletions(-)
 delete mode 100644 Documentation/devicetree/bindings/misc/fsl,qoriq-mc.txt
 create mode 100644 Documentation/devicetree/bindings/misc/fsl,qoriq-mc.yaml

diff --git a/Documentation/devicetree/bindings/misc/fsl,qoriq-mc.txt b/Documentation/devicetree/bindings/misc/fsl,qoriq-mc.txt
deleted file mode 100644
index 7b486d4985dc..000000000000
--- a/Documentation/devicetree/bindings/misc/fsl,qoriq-mc.txt
+++ /dev/null
@@ -1,196 +0,0 @@
-* Freescale Management Complex
-
-The Freescale Management Complex (fsl-mc) is a hardware resource
-manager that manages specialized hardware objects used in
-network-oriented packet processing applications. After the fsl-mc
-block is enabled, pools of hardware resources are available, such as
-queues, buffer pools, I/O interfaces. These resources are building
-blocks that can be used to create functional hardware objects/devices
-such as network interfaces, crypto accelerator instances, L2 switches,
-etc.
-
-For an overview of the DPAA2 architecture and fsl-mc bus see:
-Documentation/networking/device_drivers/ethernet/freescale/dpaa2/overview.rst
-
-As described in the above overview, all DPAA2 objects in a DPRC share the
-same hardware "isolation context" and a 10-bit value called an ICID
-(isolation context id) is expressed by the hardware to identify
-the requester.
-
-The generic 'iommus' property is insufficient to describe the relationship
-between ICIDs and IOMMUs, so an iommu-map property is used to define
-the set of possible ICIDs under a root DPRC and how they map to
-an IOMMU.
-
-For generic IOMMU bindings, see
-Documentation/devicetree/bindings/iommu/iommu.txt.
-
-For arm-smmu binding, see:
-Documentation/devicetree/bindings/iommu/arm,smmu.yaml.
-
-The MSI writes are accompanied by sideband data which is derived from the ICID.
-The msi-map property is used to associate the devices with both the ITS
-controller and the sideband data which accompanies the writes.
-
-For generic MSI bindings, see
-Documentation/devicetree/bindings/interrupt-controller/msi.txt.
-
-For GICv3 and GIC ITS bindings, see:
-Documentation/devicetree/bindings/interrupt-controller/arm,gic-v3.yaml.
-
-Required properties:
-
-    - compatible
-        Value type: <string>
-        Definition: Must be "fsl,qoriq-mc".  A Freescale Management Complex
-                    compatible with this binding must have Block Revision
-                    Registers BRR1 and BRR2 at offset 0x0BF8 and 0x0BFC in
-                    the MC control register region.
-
-    - reg
-        Value type: <prop-encoded-array>
-        Definition: A standard property.  Specifies one or two regions
-                    defining the MC's registers:
-
-                       -the first region is the command portal for the
-                        this machine and must always be present
-
-                       -the second region is the MC control registers. This
-                        region may not be present in some scenarios, such
-                        as in the device tree presented to a virtual machine.
-
-    - ranges
-        Value type: <prop-encoded-array>
-        Definition: A standard property.  Defines the mapping between the child
-                    MC address space and the parent system address space.
-
-                    The MC address space is defined by 3 components:
-                       <region type> <offset hi> <offset lo>
-
-                    Valid values for region type are
-                       0x0 - MC portals
-                       0x1 - QBMAN portals
-
-    - #address-cells
-        Value type: <u32>
-        Definition: Must be 3.  (see definition in 'ranges' property)
-
-    - #size-cells
-        Value type: <u32>
-        Definition: Must be 1.
-
-Sub-nodes:
-
-        The fsl-mc node may optionally have dpmac sub-nodes that describe
-        the relationship between the Ethernet MACs which belong to the MC
-        and the Ethernet PHYs on the system board.
-
-        The dpmac nodes must be under a node named "dpmacs" which contains
-        the following properties:
-
-            - #address-cells
-              Value type: <u32>
-              Definition: Must be present if dpmac sub-nodes are defined and must
-                          have a value of 1.
-
-            - #size-cells
-              Value type: <u32>
-              Definition: Must be present if dpmac sub-nodes are defined and must
-                          have a value of 0.
-
-        These nodes must have the following properties:
-
-            - compatible
-              Value type: <string>
-              Definition: Must be "fsl,qoriq-mc-dpmac".
-
-            - reg
-              Value type: <prop-encoded-array>
-              Definition: Specifies the id of the dpmac.
-
-            - phy-handle
-              Value type: <phandle>
-              Definition: Specifies the phandle to the PHY device node associated
-                          with the this dpmac.
-Optional properties:
-
-- iommu-map: Maps an ICID to an IOMMU and associated iommu-specifier
-  data.
-
-  The property is an arbitrary number of tuples of
-  (icid-base,iommu,iommu-base,length).
-
-  Any ICID i in the interval [icid-base, icid-base + length) is
-  associated with the listed IOMMU, with the iommu-specifier
-  (i - icid-base + iommu-base).
-
-- msi-map: Maps an ICID to a GIC ITS and associated msi-specifier
-  data.
-
-  The property is an arbitrary number of tuples of
-  (icid-base,gic-its,msi-base,length).
-
-  Any ICID in the interval [icid-base, icid-base + length) is
-  associated with the listed GIC ITS, with the msi-specifier
-  (i - icid-base + msi-base).
-
-Deprecated properties:
-
-    - msi-parent
-        Value type: <phandle>
-        Definition: Describes the MSI controller node handling message
-                    interrupts for the MC. When there is no translation
-                    between the ICID and deviceID this property can be used
-                    to describe the MSI controller used by the devices on the
-                    mc-bus.
-                    The use of this property for mc-bus is deprecated. Please
-                    use msi-map.
-
-Example:
-
-        smmu: iommu@5000000 {
-               compatible = "arm,mmu-500";
-               #iommu-cells = <1>;
-               stream-match-mask = <0x7C00>;
-               ...
-        };
-
-        gic: interrupt-controller@6000000 {
-               compatible = "arm,gic-v3";
-               ...
-        }
-        its: gic-its@6020000 {
-               compatible = "arm,gic-v3-its";
-               msi-controller;
-               ...
-        };
-
-        fsl_mc: fsl-mc@80c000000 {
-                compatible = "fsl,qoriq-mc";
-                reg = <0x00000008 0x0c000000 0 0x40>,    /* MC portal base */
-                      <0x00000000 0x08340000 0 0x40000>; /* MC control reg */
-                /* define map for ICIDs 23-64 */
-                iommu-map = <23 &smmu 23 41>;
-                /* define msi map for ICIDs 23-64 */
-                msi-map = <23 &its 23 41>;
-                #address-cells = <3>;
-                #size-cells = <1>;
-
-                /*
-                 * Region type 0x0 - MC portals
-                 * Region type 0x1 - QBMAN portals
-                 */
-                ranges = <0x0 0x0 0x0 0x8 0x0c000000 0x4000000
-                          0x1 0x0 0x0 0x8 0x18000000 0x8000000>;
-
-                dpmacs {
-                    #address-cells = <1>;
-                    #size-cells = <0>;
-
-                    dpmac@1 {
-                        compatible = "fsl,qoriq-mc-dpmac";
-                        reg = <1>;
-                        phy-handle = <&mdio0_phy0>;
-                    }
-                }
-        };
diff --git a/Documentation/devicetree/bindings/misc/fsl,qoriq-mc.yaml b/Documentation/devicetree/bindings/misc/fsl,qoriq-mc.yaml
new file mode 100644
index 000000000000..f45e21872e4f
--- /dev/null
+++ b/Documentation/devicetree/bindings/misc/fsl,qoriq-mc.yaml
@@ -0,0 +1,186 @@
+# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause)
+# Copyright 2020 NXP
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/misc/fsl,qoriq-mc.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+maintainers:
+  - Laurentiu Tudor <laurentiu.tudor@nxp.com>
+
+title: Freescale Management Complex
+
+description: |
+  The Freescale Management Complex (fsl-mc) is a hardware resource
+  manager that manages specialized hardware objects used in
+  network-oriented packet processing applications. After the fsl-mc
+  block is enabled, pools of hardware resources are available, such as
+  queues, buffer pools, I/O interfaces. These resources are building
+  blocks that can be used to create functional hardware objects/devices
+  such as network interfaces, crypto accelerator instances, L2 switches,
+  etc.
+
+  For an overview of the DPAA2 architecture and fsl-mc bus see:
+  Documentation/networking/device_drivers/freescale/dpaa2/overview.rst
+
+  As described in the above overview, all DPAA2 objects in a DPRC share the
+  same hardware "isolation context" and a 10-bit value called an ICID
+  (isolation context id) is expressed by the hardware to identify
+  the requester.
+
+  The generic 'iommus' property is insufficient to describe the relationship
+  between ICIDs and IOMMUs, so an iommu-map property is used to define
+  the set of possible ICIDs under a root DPRC and how they map to
+  an IOMMU.
+
+  For generic IOMMU bindings, see:
+  Documentation/devicetree/bindings/iommu/iommu.txt.
+
+  For arm-smmu binding, see:
+  Documentation/devicetree/bindings/iommu/arm,smmu.yaml.
+
+  MC firmware binary images can be found here:
+  https://github.com/NXP/qoriq-mc-binary
+
+properties:
+  compatible:
+    const: fsl,qoriq-mc
+    description:
+      A Freescale Management Complex compatible with this binding must have
+      Block Revision Registers BRR1 and BRR2 at offset 0x0BF8 and 0x0BFC in
+      the MC control register region.
+
+  reg:
+    minItems: 1
+    items:
+      - description: the command portal for this machine
+      - description:
+          MC control registers. This region may not be present in some
+          scenarios, such as in the device tree presented to a virtual
+          machine.
+
+  ranges:
+    description: |
+      A standard property. Defines the mapping between the child MC address
+      space and the parent system address space.
+
+      The MC address space is defined by 3 components:
+                <region type> <offset hi> <offset lo>
+
+      Valid values for region type are:
+                  0x0 - MC portals
+                  0x1 - QBMAN portals
+
+  '#address-cells':
+    const: 3
+
+  '#size-cells':
+    const: 1
+
+  dpmacs:
+    type: object
+    description:
+      The fsl-mc node may optionally have dpmac sub-nodes that describe the
+      relationship between the Ethernet MACs which belong to the MC and the
+      Ethernet PHYs on the system board.
+
+    properties:
+      '#address-cells':
+        const: 1
+
+      '#size-cells':
+        const: 0
+
+    patternProperties:
+      "^(dpmac@[0-9a-f]+)|(ethernet@[0-9a-f]+)$":
+        type: object
+
+        $ref: /schemas/net/fsl,qoriq-mc-dpmac.yaml#
+
+  iommu-map:
+    description: |
+      Maps an ICID to an IOMMU and associated iommu-specifier data.
+
+      The property is an arbitrary number of tuples of
+      (icid-base, iommu, iommu-base, length).
+
+      Any ICID i in the interval [icid-base, icid-base + length) is
+      associated with the listed IOMMU, with the iommu-specifier
+      (i - icid-base + iommu-base).
+
+  msi-map:
+    description: |
+      Maps an ICID to a GIC ITS and associated msi-specifier data.
+
+      The property is an arbitrary number of tuples of
+      (icid-base, gic-its, msi-base, length).
+
+      Any ICID in the interval [icid-base, icid-base + length) is
+      associated with the listed GIC ITS, with the msi-specifier
+      (i - icid-base + msi-base).
+
+  msi-parent:
+    deprecated: true
+    description:
+      Points to the MSI controller node handling message interrupts for the MC.
+
+required:
+  - compatible
+  - reg
+  - iommu-map
+  - msi-map
+  - ranges
+  - '#address-cells'
+  - '#size-cells'
+
+additionalProperties: false
+
+examples:
+  - |
+    soc {
+      #address-cells = <2>;
+      #size-cells = <2>;
+
+      smmu: iommu@5000000 {
+        compatible = "arm,mmu-500";
+        #global-interrupts = <1>;
+        #iommu-cells = <1>;
+        reg = <0 0x5000000 0 0x800000>;
+        stream-match-mask = <0x7c00>;
+        interrupts = <0 13 4>,
+                     <0 146 4>, <0 147 4>,
+                     <0 148 4>, <0 149 4>,
+                     <0 150 4>, <0 151 4>,
+                     <0 152 4>, <0 153 4>;
+      };
+
+      fsl_mc: fsl-mc@80c000000 {
+        compatible = "fsl,qoriq-mc";
+        reg = <0x00000008 0x0c000000 0 0x40>,    /* MC portal base */
+        <0x00000000 0x08340000 0 0x40000>; /* MC control reg */
+        /* define map for ICIDs 23-64 */
+        iommu-map = <23 &smmu 23 41>;
+        /* define msi map for ICIDs 23-64 */
+        msi-map = <23 &its 23 41>;
+        #address-cells = <3>;
+        #size-cells = <1>;
+
+        /*
+        * Region type 0x0 - MC portals
+        * Region type 0x1 - QBMAN portals
+        */
+        ranges = <0x0 0x0 0x0 0x8 0x0c000000 0x4000000
+                  0x1 0x0 0x0 0x8 0x18000000 0x8000000>;
+
+        dpmacs {
+          #address-cells = <1>;
+          #size-cells = <0>;
+
+          ethernet@1 {
+            compatible = "fsl,qoriq-mc-dpmac";
+            reg = <1>;
+            phy-handle = <&mdio0_phy0>;
+          };
+        };
+      };
+    };
diff --git a/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/overview.rst b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/overview.rst
index d638b5a8aadd..b3261c5871cc 100644
--- a/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/overview.rst
+++ b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/overview.rst
@@ -28,6 +28,9 @@ interfaces, an L2 switch, or accelerator instances.
 The MC provides memory-mapped I/O command interfaces (MC portals)
 which DPAA2 software drivers use to operate on DPAA2 objects.
 
+MC firmware binary images can be found here:
+https://github.com/NXP/qoriq-mc-binary
+
 The diagram below shows an overview of the DPAA2 resource management
 architecture::
 
@@ -338,7 +341,7 @@ Key functions include:
   a bind of the root DPRC to the DPRC driver
 
 The binding for the MC-bus device-tree node can be consulted at
-*Documentation/devicetree/bindings/misc/fsl,qoriq-mc.txt*.
+*Documentation/devicetree/bindings/misc/fsl,qoriq-mc.yaml*.
 The sysfs bind/unbind interfaces for the MC-bus can be consulted at
 *Documentation/ABI/testing/sysfs-bus-fsl-mc*.
 
diff --git a/MAINTAINERS b/MAINTAINERS
index b516bb34a8d5..e0ce6e2b663c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -14409,9 +14409,11 @@ M:	Stuart Yoder <stuyoder@gmail.com>
 M:	Laurentiu Tudor <laurentiu.tudor@nxp.com>
 L:	linux-kernel@vger.kernel.org
 S:	Maintained
-F:	Documentation/devicetree/bindings/misc/fsl,qoriq-mc.txt
+F:	Documentation/devicetree/bindings/misc/fsl,dpaa2-console.yaml
+F:	Documentation/devicetree/bindings/misc/fsl,qoriq-mc.yaml
 F:	Documentation/networking/device_drivers/ethernet/freescale/dpaa2/overview.rst
 F:	drivers/bus/fsl-mc/
+F:	include/linux/fsl/mc.h
 
 QT1010 MEDIA DRIVER
 M:	Antti Palosaari <crope@iki.fi>
-- 
2.17.1


^ permalink raw reply related

* Re: [PATCH v2] m68k: Fix WARNING splat in pmac_zilog driver
From: Geert Uytterhoeven @ 2020-11-23  7:41 UTC (permalink / raw)
  To: Finn Thain
  Cc: linuxppc-dev, Linux Kernel Mailing List, stable, linux-m68k,
	Paul Mackerras, open list:SERIAL DRIVERS, Greg Kroah-Hartman,
	Jiri Slaby, Joshua Thompson
In-Reply-To: <0c0fe1e4f11ccec202d4df09ea7d9d98155d101a.1606001297.git.fthain@telegraphics.com.au>

On Sun, Nov 22, 2020 at 12:40 AM Finn Thain <fthain@telegraphics.com.au> wrote:
> Don't add platform resources that won't be used. This avoids a
> recently-added warning from the driver core, that can show up on a
> multi-platform kernel when !MACH_IS_MAC.
>
> ------------[ cut here ]------------
> WARNING: CPU: 0 PID: 0 at drivers/base/platform.c:224 platform_get_irq_optional+0x8e/0xce
> 0 is an invalid IRQ number
> Modules linked in:
> CPU: 0 PID: 0 Comm: swapper Not tainted 5.9.0-multi #1
> Stack from 004b3f04:
>         004b3f04 00462c2f 00462c2f 004b3f20 0002e128 004754db 004b6ad4 004b3f4c
>         0002e19c 004754f7 000000e0 00285ba0 00000009 00000000 004b3f44 ffffffff
>         004754db 004b3f64 004b3f74 00285ba0 004754f7 000000e0 00000009 004754db
>         004fdf0c 005269e2 004fdf0c 00000000 004b3f88 00285cae 004b6964 00000000
>         004fdf0c 004b3fac 0051cc68 004b6964 00000000 004b6964 00000200 00000000
>         0051cc3e 0023c18a 004b3fc0 0051cd8a 004fdf0c 00000002 0052b43c 004b3fc8
> Call Trace: [<0002e128>] __warn+0xa6/0xd6
>  [<0002e19c>] warn_slowpath_fmt+0x44/0x76
>  [<00285ba0>] platform_get_irq_optional+0x8e/0xce
>  [<00285ba0>] platform_get_irq_optional+0x8e/0xce
>  [<00285cae>] platform_get_irq+0x12/0x4c
>  [<0051cc68>] pmz_init_port+0x2a/0xa6
>  [<0051cc3e>] pmz_init_port+0x0/0xa6
>  [<0023c18a>] strlen+0x0/0x22
>  [<0051cd8a>] pmz_probe+0x34/0x88
>  [<0051cde6>] pmz_console_init+0x8/0x28
>  [<00511776>] console_init+0x1e/0x28
>  [<0005a3bc>] printk+0x0/0x16
>  [<0050a8a6>] start_kernel+0x368/0x4ce
>  [<005094f8>] _sinittext+0x4f8/0xc48
> random: get_random_bytes called from print_oops_end_marker+0x56/0x80 with crng_init=0
> ---[ end trace 392d8e82eed68d6c ]---
>
> Commit a85a6c86c25b ("driver core: platform: Clarify that IRQ 0 is invalid"),
> which introduced the WARNING, suggests that testing for irq == 0 is
> undesirable. Instead of that comparison, just test for resource existence.
>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Paul Mackerras <paulus@samba.org>
> Cc: Joshua Thompson <funaho@jurai.org>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: Jiri Slaby <jirislaby@kernel.org>
> Cc: stable@vger.kernel.org # v5.8+
> References: commit a85a6c86c25b ("driver core: platform: Clarify that IRQ 0 is invalid")
> Reported-by: Laurent Vivier <laurent@vivier.eu>
> Signed-off-by: Finn Thain <fthain@telegraphics.com.au>

Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
i.e. will queue in the m68k for-v5.11 branch.

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply

* linux-next: build failure in Linus' tree
From: Stephen Rothwell @ 2020-11-23  7:40 UTC (permalink / raw)
  To: Michael Ellerman, PowerPC
  Cc: Linux Next Mailing List, Linux Kernel Mailing List,
	Nicholas Piggin, Daniel Axtens

[-- Attachment #1: Type: text/plain, Size: 3523 bytes --]

Hi all,

After merging most of the trees, today's linux-next build (powerpc64
allnoconfig) failed like this:

In file included from arch/powerpc/include/asm/kup.h:18,
                 from arch/powerpc/include/asm/uaccess.h:9,
                 from include/linux/uaccess.h:11,
                 from include/linux/sched/task.h:11,
                 from include/linux/sched/signal.h:9,
                 from include/linux/rcuwait.h:6,
                 from include/linux/percpu-rwsem.h:7,
                 from include/linux/fs.h:33,
                 from include/linux/compat.h:17,
                 from arch/powerpc/kernel/asm-offsets.c:14:
arch/powerpc/include/asm/book3s/64/kup-radix.h:66:1: warning: data definition has no type or storage class
   66 | DECLARE_STATIC_KEY_FALSE(uaccess_flush_key);
      | ^~~~~~~~~~~~~~~~~~~~~~~~
arch/powerpc/include/asm/book3s/64/kup-radix.h:66:1: error: type defaults to 'int' in declaration of 'DECLARE_STATIC_KEY_FALSE' [-Werror=implicit-int]
arch/powerpc/include/asm/book3s/64/kup-radix.h:66:1: warning: parameter names (without types) in function declaration
arch/powerpc/include/asm/book3s/64/kup-radix.h: In function 'prevent_user_access':
arch/powerpc/include/asm/book3s/64/kup-radix.h:180:6: error: implicit declaration of function 'static_branch_unlikely' [-Werror=implicit-function-declaration]
  180 |  if (static_branch_unlikely(&uaccess_flush_key))
      |      ^~~~~~~~~~~~~~~~~~~~~~
arch/powerpc/include/asm/book3s/64/kup-radix.h:180:30: error: 'uaccess_flush_key' undeclared (first use in this function)
  180 |  if (static_branch_unlikely(&uaccess_flush_key))
      |                              ^~~~~~~~~~~~~~~~~
arch/powerpc/include/asm/book3s/64/kup-radix.h:180:30: note: each undeclared identifier is reported only once for each function it appears in
arch/powerpc/include/asm/book3s/64/kup-radix.h: In function 'prevent_user_access_return':
arch/powerpc/include/asm/book3s/64/kup-radix.h:189:30: error: 'uaccess_flush_key' undeclared (first use in this function)
  189 |  if (static_branch_unlikely(&uaccess_flush_key))
      |                              ^~~~~~~~~~~~~~~~~
arch/powerpc/include/asm/book3s/64/kup-radix.h: In function 'restore_user_access':
arch/powerpc/include/asm/book3s/64/kup-radix.h:198:30: error: 'uaccess_flush_key' undeclared (first use in this function)
  198 |  if (static_branch_unlikely(&uaccess_flush_key) && flags == AMR_KUAP_BLOCKED)
      |                              ^~~~~~~~~~~~~~~~~

Caused by commit

  9a32a7e78bd0 ("powerpc/64s: flush L1D after user accesses")

I have applied the following patch for today:

From: Stephen Rothwell <sfr@canb.auug.org.au>
Date: Mon, 23 Nov 2020 18:35:02 +1100
Subject: [PATCH] powerpc/64s: using DECLARE_STATIC_KEY_FALSE needs
 linux/jump_table.h

Fixes: 9a32a7e78bd0 ("powerpc/64s: flush L1D after user accesses")
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
---
 arch/powerpc/include/asm/book3s/64/kup-radix.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/include/asm/book3s/64/kup-radix.h b/arch/powerpc/include/asm/book3s/64/kup-radix.h
index 28716e2f13e3..a39e2d193fdc 100644
--- a/arch/powerpc/include/asm/book3s/64/kup-radix.h
+++ b/arch/powerpc/include/asm/book3s/64/kup-radix.h
@@ -63,6 +63,8 @@
 
 #else /* !__ASSEMBLY__ */
 
+#include <linux/jump_label.h>
+
 DECLARE_STATIC_KEY_FALSE(uaccess_flush_key);
 
 #ifdef CONFIG_PPC_KUAP
-- 
2.29.2

-- 
Cheers,
Stephen Rothwell

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply related

* Re: [PATCH v3 3/3] powerpc/64s: feature: Work around inline asm issues
From: Segher Boessenkool @ 2020-11-23  6:34 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: Nick Desaulniers, linuxppc-dev, Bill Wendling
In-Reply-To: <87d0041vaf.fsf@mpe.ellerman.id.au>

On Mon, Nov 23, 2020 at 04:44:56PM +1100, Michael Ellerman wrote:
> If I hard code:
> 
> 	.org . - (1);
> 
> It fails as expected.
> 
> But if I hard code:
> 
> 	.org . - (1 > 0);
> 
> It builds?

"true" (as a result of a comparison) in as is -1, not 1.


Segher

^ permalink raw reply

* Re: [PATCH v3 3/3] powerpc/64s: feature: Work around inline asm issues
From: Michael Ellerman @ 2020-11-23  5:44 UTC (permalink / raw)
  To: Bill Wendling, linuxppc-dev; +Cc: Nick Desaulniers, Bill Wendling
In-Reply-To: <20201120224034.191382-4-morbo@google.com>

Hi Bill,

Bill Wendling <morbo@google.com> writes:
> The clang toolchain treats inline assembly a bit differently than
> straight assembly code. In particular, inline assembly doesn't have the
> complete context available to resolve expressions. This is intentional
> to avoid divergence in the resulting assembly code.
>
> We can work around this issue by borrowing a workaround done for ARM,
> i.e. not directly testing the labels themselves, but by moving the
> current output pointer by a value that should always be zero. If this
> value is not null, then we will trigger a backward move, which is
> explicitly forbidden.
>
> Signed-off-by: Bill Wendling <morbo@google.com>
> ---
>  arch/powerpc/include/asm/feature-fixups.h | 17 +++++++++++++----
>  1 file changed, 13 insertions(+), 4 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/feature-fixups.h b/arch/powerpc/include/asm/feature-fixups.h
> index b0af97add751..f81036518edb 100644
> --- a/arch/powerpc/include/asm/feature-fixups.h
> +++ b/arch/powerpc/include/asm/feature-fixups.h
> @@ -36,6 +36,18 @@ label##2:						\
>  	.align 2;					\
>  label##3:
>  
> +/*
> + * If the .org directive fails, it means that the feature instructions
> + * are smaller than the alternate instructions. This used to be written
> + * as
> + *
> + * .ifgt (label##4b-label##3b) - (label##2b-label##1b)
> + *      .error "Feature section else case larger than body"
> + * .endif
> + *
> + * but clang's assembler complains about the expression being non-absolute
> + * when the code appears in an inline assembly statement.
> + */
>  #define MAKE_FTR_SECTION_ENTRY(msk, val, label, sect)		\
>  label##4:							\
>  	.popsection;						\
> @@ -48,12 +60,9 @@ label##5:							\
>  	FTR_ENTRY_OFFSET label##2b-label##5b;			\
>  	FTR_ENTRY_OFFSET label##3b-label##5b;			\
>  	FTR_ENTRY_OFFSET label##4b-label##5b;			\
> -	.ifgt (label##4b- label##3b)-(label##2b- label##1b);	\
> -	.error "Feature section else case larger than body";	\
> -	.endif;							\
> +	.org . - ((label##4b-label##3b) > (label##2b-label##1b)); \
>  	.popsection;

When I have an oversize alt section this doesn't seem to give me any
error using binutils?

If I hard code:

	.org . - (1);

It fails as expected.

But if I hard code:

	.org . - (1 > 0);

It builds?

cheers

^ permalink raw reply

* Re: Fwd: Petitboot for PS3
From: Geoff Levand @ 2020-11-22 18:13 UTC (permalink / raw)
  To: Carlos Eduardo de Paula; +Cc: linuxppc-dev@lists.ozlabs.org, petitboot
In-Reply-To: <CADnnUqcAGigWgKQtu6=tud=V0-7f2aYqNP3MjP=-bRonD1R7_w@mail.gmail.com>

Hi Carlos,

On 11/19/20 1:07 PM, Carlos Eduardo de Paula wrote:
> I was able in the petitboot shell to set the timeout for booting an image by using ps3-bl-options (that uses ps3-flash-util itself) but if I use these utilities in my booted linux, I get "magic_num failed" error and can't do any flash operations. I tried loading the ps3flash module and /dev/ps3flash device appears but still can't set it. Also in linux, the devices ps3vflash and etc doesn't show up. Any tips on accessing the flash from booted linux?

Your kernel needs to be built with CONFIG_PS3_FLASH set, as it is
with ps3_defconfig.  I guess this is not your problem though since
it seems you can open and read the ps3flash device, but get an
error in the data returned.

You could add the line '#define DEBUG' to the very top of
'drivers/char/ps3flash.c' to print some driver debug output to
the console.

That "magic_num failed" message is coming from the routine
os_area_header_verify() here:
 
 https://git.kernel.org/pub/scm/linux/kernel/git/geoff/ps3-utils.git/tree/lib/flash.c#n321

Which I guess in your case is called by os_area_header_read().

My recommendation is to compare the os_area header from petitboot
and from the booted kernel to see how it compares.  You could use
the os_area_db_dump_header() routine from ps3-utils (show-settings).
Another thing is to add some code to the built kernel to dump the
os_area header for comparison to the petitboot data.

> Another question, I generated a kernel patch from your tree diff'ing stock 5.8.0 from your 5.8.0, then I fetched mainline 5.9.8 and applied this patch, built it and added to yaboot.
> 
> The kernel boots fine but I don't get network although the interface is there and after some minutes, I get a kernel oops in the gelic driver. Here's a print from the error.
> 20201119_151317.jpg

Not sure about that error, but if I were to guess, you are running out of memory...

-Geoff

^ permalink raw reply

* [PATCH kernel v2] vfio/pci/nvlink2: Do not attempt NPU2 setup on POWER8NVL NPU
From: Alexey Kardashevskiy @ 2020-11-22  7:39 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: kvm, Alexey Kardashevskiy, kvm-ppc, Alex Williamson, stable,
	Leonardo Augusto Guimarães Garcia, David Gibson

We execute certain NPU2 setup code (such as mapping an LPID to a device
in NPU2) unconditionally if an Nvlink bridge is detected. However this
cannot succeed on POWER8NVL machines as the init helpers return an error
other than ENODEV which means the device is there is and setup failed so
vfio_pci_enable() fails and pass through is not possible.

This changes the two NPU2 related init helpers to return -ENODEV if
there is no "memory-region" device tree property as this is
the distinction between NPU and NPU2.

Tested on
- POWER9 pvr=004e1201, Ubuntu 19.04 host, Ubuntu 18.04 vm,
  NVIDIA GV100 10de:1db1 driver 418.39
- POWER8 pvr=004c0100, RHEL 7.6 host, Ubuntu 16.10 vm,
  NVIDIA P100 10de:15f9 driver 396.47

Fixes: 7f92891778df ("vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] subdriver")
Cc: stable@vger.kernel.org # 5.0
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v2:
* updated commit log with tested configs and replaced P8+ with POWER8NVL for clarity
---
 drivers/vfio/pci/vfio_pci_nvlink2.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_nvlink2.c b/drivers/vfio/pci/vfio_pci_nvlink2.c
index 65c61710c0e9..9adcf6a8f888 100644
--- a/drivers/vfio/pci/vfio_pci_nvlink2.c
+++ b/drivers/vfio/pci/vfio_pci_nvlink2.c
@@ -231,7 +231,7 @@ int vfio_pci_nvdia_v100_nvlink2_init(struct vfio_pci_device *vdev)
 		return -EINVAL;
 
 	if (of_property_read_u32(npu_node, "memory-region", &mem_phandle))
-		return -EINVAL;
+		return -ENODEV;
 
 	mem_node = of_find_node_by_phandle(mem_phandle);
 	if (!mem_node)
@@ -393,7 +393,7 @@ int vfio_pci_ibm_npu2_init(struct vfio_pci_device *vdev)
 	int ret;
 	struct vfio_pci_npu2_data *data;
 	struct device_node *nvlink_dn;
-	u32 nvlink_index = 0;
+	u32 nvlink_index = 0, mem_phandle = 0;
 	struct pci_dev *npdev = vdev->pdev;
 	struct device_node *npu_node = pci_device_to_OF_node(npdev);
 	struct pci_controller *hose = pci_bus_to_host(npdev->bus);
@@ -408,6 +408,9 @@ int vfio_pci_ibm_npu2_init(struct vfio_pci_device *vdev)
 	if (!pnv_pci_get_gpu_dev(vdev->pdev))
 		return -ENODEV;
 
+	if (of_property_read_u32(npu_node, "memory-region", &mem_phandle))
+		return -ENODEV;
+
 	/*
 	 * NPU2 normally has 8 ATSD registers (for concurrency) and 6 links
 	 * so we can allocate one register per link, using nvlink index as
-- 
2.17.1


^ permalink raw reply related

* [PATCH kernel v2] powerpc/powernv/npu: Do not attempt NPU2 setup on POWER8NVL NPU
From: Alexey Kardashevskiy @ 2020-11-22  7:38 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, stable, kvm-ppc,
	Leonardo Augusto Guimarães Garcia, David Gibson

We execute certain NPU2 setup code (such as mapping an LPID to a device
in NPU2) unconditionally if an Nvlink bridge is detected. However this
cannot succeed on POWER8NVL machines and errors appear in dmesg. This is
harmless as skiboot returns an error and the only place we check it is
vfio-pci but that code does not get called on P8+ either.

This adds a check if pnv_npu2_xxx helpers are called on a machine with
NPU2 which initializes pnv_phb::npu in pnv_npu2_init();
pnv_phb::npu==NULL on POWER8/NVL (Naples).

While at this, fix NULL derefencing in pnv_npu_peers_take_ownership/
pnv_npu_peers_release_ownership which occurs when GPUs on mentioned P8s
cause EEH which happens if "vfio-pci" disables devices using
the D3 power state; the vfio-pci's disable_idle_d3 module parameter
controls this and must be set on Naples. The EEH handling clears
the entire pnv_ioda_pe struct in pnv_ioda_free_pe() hence
the NULL derefencing. We cannot recover from that but at least we stop
crashing.

Tested on
- POWER9 pvr=004e1201, Ubuntu 19.04 host, Ubuntu 18.04 vm,
  NVIDIA GV100 10de:1db1 driver 418.39
- POWER8 pvr=004c0100, RHEL 7.6 host, Ubuntu 16.10 vm,
  NVIDIA P100 10de:15f9 driver 396.47

Fixes: 1b785611e119 ("powerpc/powernv/npu: Add release_ownership hook")
Cc: stable@vger.kernel.org # 5.0
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v2:
* added checks for !pe->table_group.ops and updated commit log
* added tested configurations
---
 arch/powerpc/platforms/powernv/npu-dma.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c
index abeaa533b976..b711dc3262a3 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -385,7 +385,8 @@ static void pnv_npu_peers_take_ownership(struct iommu_table_group *table_group)
 	for (i = 0; i < npucomp->pe_num; ++i) {
 		struct pnv_ioda_pe *pe = npucomp->pe[i];
 
-		if (!pe->table_group.ops->take_ownership)
+		if (!pe->table_group.ops ||
+		    !pe->table_group.ops->take_ownership)
 			continue;
 		pe->table_group.ops->take_ownership(&pe->table_group);
 	}
@@ -401,7 +402,8 @@ static void pnv_npu_peers_release_ownership(
 	for (i = 0; i < npucomp->pe_num; ++i) {
 		struct pnv_ioda_pe *pe = npucomp->pe[i];
 
-		if (!pe->table_group.ops->release_ownership)
+		if (!pe->table_group.ops ||
+		    !pe->table_group.ops->release_ownership)
 			continue;
 		pe->table_group.ops->release_ownership(&pe->table_group);
 	}
@@ -623,6 +625,11 @@ int pnv_npu2_map_lpar_dev(struct pci_dev *gpdev, unsigned int lparid,
 		return -ENODEV;
 
 	hose = pci_bus_to_host(npdev->bus);
+	if (hose->npu == NULL) {
+		dev_info_once(&npdev->dev, "Nvlink1 does not support contexts");
+		return 0;
+	}
+
 	nphb = hose->private_data;
 
 	dev_dbg(&gpdev->dev, "Map LPAR opalid=%llu lparid=%u\n",
@@ -670,6 +677,11 @@ int pnv_npu2_unmap_lpar_dev(struct pci_dev *gpdev)
 		return -ENODEV;
 
 	hose = pci_bus_to_host(npdev->bus);
+	if (hose->npu == NULL) {
+		dev_info_once(&npdev->dev, "Nvlink1 does not support contexts");
+		return 0;
+	}
+
 	nphb = hose->private_data;
 
 	dev_dbg(&gpdev->dev, "destroy context opalid=%llu\n",
-- 
2.17.1


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox