LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: Endless soft-lockups for compiling workload since next-20200519
From: Peter Zijlstra @ 2020-05-21  9:39 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Paul E. McKenney, Linux Kernel Mailing List, Qian Cai,
	Borislav Petkov, Thomas Gleixner, linuxppc-dev
In-Reply-To: <20200521004035.GA15455@lenoir>

On Thu, May 21, 2020 at 02:40:36AM +0200, Frederic Weisbecker wrote:
> On Wed, May 20, 2020 at 02:50:56PM +0200, Peter Zijlstra wrote:
> > On Tue, May 19, 2020 at 11:58:17PM -0400, Qian Cai wrote:
> > > Just a head up. Repeatedly compiling kernels for a while would trigger
> > > endless soft-lockups since next-20200519 on both x86_64 and powerpc.
> > > .config are in,
> > 
> > Could be 90b5363acd47 ("sched: Clean up scheduler_ipi()"), although I've
> > not seen anything like that myself. Let me go have a look.
> > 
> > 
> > In as far as the logs are readable (they're a wrapped mess, please don't
> > do that!), they contain very little useful, as is typical with IPIs :/
> > 
> > > [ 1167.993773][    C1] WARNING: CPU: 1 PID: 0 at kernel/smp.c:127
> > > flush_smp_call_function_queue+0x1fa/0x2e0
> 
> So I've tried to think of a race that could produce that and here is
> the only thing I could come up with. It's a bit complicated unfortunately:

This:

>         smp_call_function_single_async() {             smp_call_function_single_async() {
>             // verified csd->flags != CSD_LOCK             // verified csd->flags != CSD_LOCK
>             csd->flags = CSD_LOCK                          csd->flags = CSD_LOCK

concurrent smp_call_function_single_async() using the same csd is what
I'm looking at as well. Now in the ILB case there is an easy cure:

(because there is only a single ilb target)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 01f94cf52783..b6d8a7b991f0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10033,7 +10033,7 @@ static void kick_ilb(unsigned int flags)
 	 * is idle. And the softirq performing nohz idle load balance
 	 * will be run before returning from the IPI.
 	 */
-	smp_call_function_single_async(ilb_cpu, &cpu_rq(ilb_cpu)->nohz_csd);
+	smp_call_function_single_async(ilb_cpu, &this_rq()->nohz_csd);
 }
 
 /*

Qian, can you give that a spin?

But I'm still not convinced of your scenario:

>                                                    kick_ilb() {
>                                                        atomic_fetch_or(...., nohz_flags(0))

>         atomic_fetch_or(...., nohz_flags(0))           #VMENTER
>         smp_call_function_single_async() {             smp_call_function_single_async() {
>             // verified csd->flags != CSD_LOCK             // verified csd->flags != CSD_LOCK
>             csd->flags = CSD_LOCK                          csd->flags = CSD_LOCK

Note that we check the return value of atomic_fetch_or() and bail if
someone else set a flag in KICK_MASK before us.

Aah, I suppose you're saying this can happen when:

  !(flags & NOHZ_KICK_MASK)

? That's not supposed to happen though.


Anyway, let me go stare at the remove wake-up case, because i'm afraid
that might have the same problem too...

^ permalink raw reply related

* Re: [PATCH] input: i8042: Remove special PowerPC handling
From: Michael Ellerman @ 2020-05-21  7:39 UTC (permalink / raw)
  To: Dmitry Torokhov
  Cc: kbuild test robot, linux-kernel, clang-built-linux,
	Paul Mackerras, linux-input, Nathan Chancellor, linuxppc-dev
In-Reply-To: <20200520171618.GT89269@dtor-ws>

Dmitry Torokhov <dmitry.torokhov@gmail.com> writes:
> Hi Michael,
>
> On Wed, May 20, 2020 at 04:07:00PM +1000, Michael Ellerman wrote:
>> [ + Dmitry & linux-input ]
>> 
>> Nathan Chancellor <natechancellor@gmail.com> writes:
>> > This causes a build error with CONFIG_WALNUT because kb_cs and kb_data
>> > were removed in commit 917f0af9e5a9 ("powerpc: Remove arch/ppc and
>> > include/asm-ppc").
>> >
>> > ld.lld: error: undefined symbol: kb_cs
>> >> referenced by i8042-ppcio.h:28 (drivers/input/serio/i8042-ppcio.h:28)
>> >> input/serio/i8042.o:(__i8042_command) in archive drivers/built-in.a
>> >> referenced by i8042-ppcio.h:28 (drivers/input/serio/i8042-ppcio.h:28)
>> >> input/serio/i8042.o:(__i8042_command) in archive drivers/built-in.a
>> >> referenced by i8042-ppcio.h:28 (drivers/input/serio/i8042-ppcio.h:28)
>> >> input/serio/i8042.o:(__i8042_command) in archive drivers/built-in.a
>> >
>> > ld.lld: error: undefined symbol: kb_data
>> >> referenced by i8042.c:309 (drivers/input/serio/i8042.c:309)
>> >> input/serio/i8042.o:(__i8042_command) in archive drivers/built-in.a
>> >> referenced by i8042-ppcio.h:33 (drivers/input/serio/i8042-ppcio.h:33)
>> >> input/serio/i8042.o:(__i8042_command) in archive drivers/built-in.a
>> >> referenced by i8042.c:319 (drivers/input/serio/i8042.c:319)
>> >> input/serio/i8042.o:(__i8042_command) in archive drivers/built-in.a
>> >> referenced 15 more times
>> >
>> > Presumably since nobody has noticed this for the last 12 years, there is
>> > not anyone actually trying to use this driver so we can just remove this
>> > special walnut code and use the generic header so it builds for all
>> > configurations.
>> >
>> > Fixes: 917f0af9e5a9 ("powerpc: Remove arch/ppc and include/asm-ppc")
>> > Reported-by: kbuild test robot <lkp@intel.com>
>> > Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
>> > ---
>> >  drivers/input/serio/i8042-ppcio.h | 57 -------------------------------
>> >  drivers/input/serio/i8042.h       |  2 --
>> >  2 files changed, 59 deletions(-)
>> >  delete mode 100644 drivers/input/serio/i8042-ppcio.h
>> 
>> This LGTM.
>> 
>> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
>> 
>> I assumed drivers/input/serio would be pretty quiet, but there's
>> actually some commits to it in linux-next. So perhaps this should go via
>> the input tree.
>> 
>> Dmitry do you want to take this, or should I take it via powerpc?
>> 
>> Original patch is here:
>>   https://lore.kernel.org/lkml/20200518181043.3363953-1-natechancellor@gmail.com
>
> I'm fine with you taking it through powerpc.
>
> Acked-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>

Thanks.

> Also, while I have your attention ;), could you please ack or take
> https://lore.kernel.org/lkml/20191002214854.GA114387@dtor-ws/ as I
> believe this is the last user or input_polled_dev API and I would like
> to drop it from the tree.

Ooof. Sorry, you are very patient :)

I have put it in my next-test.

In future feel free to send me mail off-list, or ping me on irc, if I
haven't picked something up after several months!

cheers

^ permalink raw reply

* [PATCH] powerpc/4xx: Don't unmap NULL mbase
From: Michael Ellerman @ 2020-05-21  7:26 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: huhai

From: huhai <huhai@tj.kylinos.cn>

Signed-off-by: huhai <huhai@tj.kylinos.cn>
---
 arch/powerpc/platforms/4xx/pci.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/4xx/pci.c b/arch/powerpc/platforms/4xx/pci.c
index e6e2adcc7b64..c13d64c3b019 100644
--- a/arch/powerpc/platforms/4xx/pci.c
+++ b/arch/powerpc/platforms/4xx/pci.c
@@ -1242,7 +1242,7 @@ static void __init ppc460sx_pciex_check_link(struct ppc4xx_pciex_port *port)
 	if (mbase == NULL) {
 		printk(KERN_ERR "%pOF: Can't map internal config space !",
 			port->node);
-		goto done;
+		return;
 	}
 
 	while (attempt && (0 == (in_le32(mbase + PECFG_460SX_DLLSTA)
@@ -1252,9 +1252,7 @@ static void __init ppc460sx_pciex_check_link(struct ppc4xx_pciex_port *port)
 	}
 	if (attempt)
 		port->link = 1;
-done:
 	iounmap(mbase);
-
 }
 
 static struct ppc4xx_pciex_hwops ppc460sx_pcie_hwops __initdata = {
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCH 2/3] powerpc/pci: unmap legacy INTx interrupts of passthrough IO adapters
From: Cédric Le Goater @ 2020-05-21  7:13 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Carol L Soto, Oliver O'Halloran, linuxppc-dev, Greg Kurz,
	Alexey Kardashevskiy
In-Reply-To: <20200429075122.1216388-3-clg@kaod.org>

On 4/29/20 9:51 AM, Cédric Le Goater wrote:
> When a passthrough IO adapter is removed from a pseries machine using
> hash MMU and the XIVE interrupt mode, the POWER hypervisor, pHyp,
> expects the guest OS to have cleared all page table entries related to
> the adapter. If some are still present, the RTAS call which isolates
> the PCI slot returns error 9001 "valid outstanding translations" and
> the removal of the IO adapter fails.
> 
> INTx interrupt numbers need special care because Linux maps the
> interrupts automatically in the Linux interrupt number space if they
> are presented in the device tree node describing the IO adapter. These
> interrupts are not un-mapped automatically and in case of an hot-plug
> adapter, the PCI hot-plug layer needs to handle the cleanup to make
> sure that all the page table entries of the XIVE ESB pages are
> cleared.
> 
> Cc: "Oliver O'Halloran" <oohall@gmail.com>
> Signed-off-by: Cédric Le Goater <clg@kaod.org>


We did some tests with differnt passthrough adapters under LPAPRs (PowerVM) 
and KVM guests P8 (HPT) and P9 (HPT+Radix). Baremetal behaves correctly. 
But I would feel more comfortable if someone could give a try (PATCH1-2) 
on a different system.


Thanks,

C. 


> ---
>  arch/powerpc/kernel/pci-hotplug.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/arch/powerpc/kernel/pci-hotplug.c b/arch/powerpc/kernel/pci-hotplug.c
> index bf83f76563a3..9e9c6befd7ea 100644
> --- a/arch/powerpc/kernel/pci-hotplug.c
> +++ b/arch/powerpc/kernel/pci-hotplug.c
> @@ -57,6 +57,8 @@ void pcibios_release_device(struct pci_dev *dev)
>  	struct pci_controller *phb = pci_bus_to_host(dev->bus);
>  	struct pci_dn *pdn = pci_get_pdn(dev);
>  
> +	irq_dispose_mapping(dev->irq);
> +
>  	eeh_remove_device(dev);
>  
>  	if (phb->controller_ops.release_device)
> 


^ permalink raw reply

* Re: [PATCH 0/2] powerpc: Remove support for ppc405/440 Xilinx platforms
From: Michael Ellerman @ 2020-05-21  7:02 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Kate Stewart, Mark Rutland, Desnes A. Nunes do Rosario,
	Geert Uytterhoeven, open list:DOCUMENTATION,
	ALSA Development Mailing List, dri-devel, Jaroslav Kysela,
	Richard Fontana, Paul Mackerras, Miquel Raynal,
	Mauro Carvalho Chehab, Fabio Estevam, Sasha Levin,
	Stephen Rothwell, Jonathan Corbet, Masahiro Yamada, Takashi Iwai,
	YueHaibing, Michal Simek, Krzysztof Kozlowski, Linux ARM,
	Leonardo Bras, DTML, Andrew Donnellan, Bartlomiej Zolnierkiewicz,
	Marc Zyngier, Alistair Popple, linuxppc-dev, Nicholas Piggin,
	Alexios Zavras, Mark Brown, git, Linux Fbdev development list,
	Jonathan Cameron, Thomas Gleixner, Andy Shevchenko,
	Allison Randal, Christophe Leroy, Michal Simek, Wei Hu,
	Christian Lamparter, Greg Kroah-Hartman, Nick Desaulniers,
	linux-kernel@vger.kernel.org, Armijn Hemel, Rob Herring,
	Enrico Weigelt, David S. Miller, Thiago Jung Bauermann
In-Reply-To: <CAK8P3a1XmeeP7FKfNwXZO8cXyJ_U_Jr0kjOaGZ6F=7OcoZ+0nw@mail.gmail.com>

Arnd Bergmann <arnd@arndb.de> writes:
> +On Wed, Apr 8, 2020 at 2:04 PM Michael Ellerman <mpe@ellerman.id.au> wrote:
>> Benjamin Herrenschmidt <benh@kernel.crashing.org> writes:
>> > On Fri, 2020-04-03 at 15:59 +1100, Michael Ellerman wrote:
>> >> Benjamin Herrenschmidt <benh@kernel.crashing.org> writes:
>> > IBM still put 40x cores inside POWER chips no ?
>>
>> Oh yeah that's true. I guess most folks don't know that, or that they
>> run RHEL on them.
>
> Is there a reason for not having those dts files in mainline then?
> If nothing else, it would document what machines are still being
> used with future kernels.

Sorry that part was a joke :D  Those chips don't run Linux.

cheers

^ permalink raw reply

* Re: [RESEND PATCH v7 4/5] ndctl/papr_scm, uapi: Add support for PAPR nvdimm specific methods
From: Michael Ellerman @ 2020-05-21  6:48 UTC (permalink / raw)
  To: Ira Weiny, Vaibhav Jain
  Cc: Aneesh Kumar K . V, linuxppc-dev, linux-kernel, Steven Rostedt,
	linux-nvdimm
In-Reply-To: <20200520153209.GC3660833@iweiny-DESK2.sc.intel.com>

Ira Weiny <ira.weiny@intel.com> writes:
> On Wed, May 20, 2020 at 12:30:57AM +0530, Vaibhav Jain wrote:
>> Introduce support for Papr nvDimm Specific Methods (PDSM) in papr_scm
>> modules and add the command family to the white list of NVDIMM command
>> sets. Also advertise support for ND_CMD_CALL for the dimm
>> command mask and implement necessary scaffolding in the module to
>> handle ND_CMD_CALL ioctl and PDSM requests that we receive.
...
>> + *
>> + * Payload Version:
>> + *
>> + * A 'payload_version' field is present in PDSM header that indicates a specific
>> + * version of the structure present in PDSM Payload for a given PDSM command.
>> + * This provides backward compatibility in case the PDSM Payload structure
>> + * evolves and different structures are supported by 'papr_scm' and 'libndctl'.
>> + *
>> + * When sending a PDSM Payload to 'papr_scm', 'libndctl' should send the version
>> + * of the payload struct it supports via 'payload_version' field. The 'papr_scm'
>> + * module when servicing the PDSM envelope checks the 'payload_version' and then
>> + * uses 'payload struct version' == MIN('payload_version field',
>> + * 'max payload-struct-version supported by papr_scm') to service the PDSM.
>> + * After servicing the PDSM, 'papr_scm' put the negotiated version of payload
>> + * struct in returned 'payload_version' field.
>
> FWIW many people believe using a size rather than version is more sustainable.
> It is expected that new payload structures are larger (more features) than the
> previous payload structure.
>
> I can't find references at the moment through.

I think clone_args is a good modern example:

  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/uapi/linux/sched.h#n88

cheers

^ permalink raw reply

* Re: [PATCH v2] KVM: PPC: Book3S HV: relax check on H_SVM_INIT_ABORT
From: Ram Pai @ 2020-05-21  6:08 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: linux-kernel, kvm-ppc, groug, paulus, sukadev, linuxppc-dev
In-Reply-To: <20200520174308.77820-1-ldufour@linux.ibm.com>

On Wed, May 20, 2020 at 07:43:08PM +0200, Laurent Dufour wrote:
> The commit 8c47b6ff29e3 ("KVM: PPC: Book3S HV: Check caller of H_SVM_*
> Hcalls") added checks of secure bit of SRR1 to filter out the Hcall
> reserved to the Ultravisor.
> 
> However, the Hcall H_SVM_INIT_ABORT is made by the Ultravisor passing the
> context of the VM calling UV_ESM. This allows the Hypervisor to return to
> the guest without going through the Ultravisor. Thus the Secure bit of SRR1
> is not set in that particular case.
> 
> In the case a regular VM is calling H_SVM_INIT_ABORT, this hcall will be
> filtered out in kvmppc_h_svm_init_abort() because kvm->arch.secure_guest is
> not set in that case.
> 
> Fixes: 8c47b6ff29e3 ("KVM: PPC: Book3S HV: Check caller of H_SVM_* Hcalls")
> Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>


Reviewed-by: Ram Pai <linuxram@us.ibm.com>

> ---
>  arch/powerpc/kvm/book3s_hv.c | 9 ++++++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 93493f0cbfe8..6ad1a3b14300 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -1099,9 +1099,12 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
>  			ret = kvmppc_h_svm_init_done(vcpu->kvm);
>  		break;
>  	case H_SVM_INIT_ABORT:
> -		ret = H_UNSUPPORTED;
> -		if (kvmppc_get_srr1(vcpu) & MSR_S)
> -			ret = kvmppc_h_svm_init_abort(vcpu->kvm);
> +		/*
> +		 * Even if that call is made by the Ultravisor, the SSR1 value
> +		 * is the guest context one, with the secure bit clear as it has
> +		 * not yet been secured. So we can't check it here.
> +		 */

Frankly speaking, the comment above when read in isolation; i.e without
the delete code above, feels out of place.  The reasoning for change is
anyway captured in the changelog.  So, I think, we should delete this
comment.

Also the comment above assumes the Ultravisor will call H_SVM_INIT_ABORT
with SRR1(S) bit not set; which may or may not be true.  Regardless of
who and how H_SVM_INIT_ABORT is called, we should just call
kvmppc_h_svm_init_abort() and let it deal with the complexities.


RP

^ permalink raw reply

* Re: linux-next: manual merge of the rcu tree with the powerpc tree
From: Stephen Rothwell @ 2020-05-21  4:51 UTC (permalink / raw)
  To: Michael Ellerman, PowerPC, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin
  Cc: Linux Next Mailing List, Nicholas Piggin,
	Linux Kernel Mailing List, Paul E. McKenney
In-Reply-To: <20200519172316.3b37cbae@canb.auug.org.au>

[-- Attachment #1: Type: text/plain, Size: 1122 bytes --]

Hi all,

On Tue, 19 May 2020 17:23:16 +1000 Stephen Rothwell <sfr@canb.auug.org.au> wrote:
>
> Today's linux-next merge of the rcu tree got a conflict in:
> 
>   arch/powerpc/kernel/traps.c
> 
> between commit:
> 
>   116ac378bb3f ("powerpc/64s: machine check interrupt update NMI accounting")
> 
> from the powerpc tree and commit:
> 
>   187416eeb388 ("hardirq/nmi: Allow nested nmi_enter()")
> 
> from the rcu tree.
> 
> I fixed it up (I used the powerpc tree version for now) and can carry the
> fix as necessary. This is now fixed as far as linux-next is concerned,
> but any non trivial conflicts should be mentioned to your upstream
> maintainer when your tree is submitted for merging.  You may also want
> to consider cooperating with the maintainer of the conflicting tree to
> minimise any particularly complex conflicts.

This is now a conflict between the powerpc commit and commit

  69ea03b56ed2 ("hardirq/nmi: Allow nested nmi_enter()")

from the tip tree.  I assume that the rcu and tip trees are sharing
some patches (but not commits) :-(

-- 
Cheers,
Stephen Rothwell

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH] soc: fsl: qe: Replace one-element array and use struct_size() helper
From: Kees Cook @ 2020-05-21  3:24 UTC (permalink / raw)
  To: Li Yang
  Cc: Gustavo A. R. Silva, lkml, Gustavo A. R. Silva, linuxppc-dev,
	moderated list:ARM/FREESCALE IMX / MXC ARM ARCHITECTURE,
	Qiang Zhao
In-Reply-To: <CADRPPNR-Croux9FgnrQJJmdF2jNnuAmC+2xMJSgSbkbRv9u8Mw@mail.gmail.com>

On Wed, May 20, 2020 at 06:52:21PM -0500, Li Yang wrote:
> On Mon, May 18, 2020 at 5:57 PM Kees Cook <keescook@chromium.org> wrote:
> > Hm, looking at this code, I see a few other things that need to be
> > fixed:
> >
> > 1) drivers/tty/serial/ucc_uart.c does not do a be32_to_cpu() conversion
> >    on the length test (understandably, a little-endian system has never run
> >    this code since it's ppc specific), but it's still wrong:
> >
> >         if (firmware->header.length != fw->size) {
> >
> >    compare to the firmware loader:
> >
> >         length = be32_to_cpu(hdr->length);
> >
> > 2) drivers/soc/fsl/qe/qe.c does not perform bounds checking on the
> >    per-microcode offsets, so the uploader might send data outside the
> >    firmware buffer. Perhaps:
> 
> We do validate the CRC for each microcode, it is unlikely the CRC
> check can pass if the offset or length is not correct.  But you are
> probably right that it will be safer to check the boundary and fail

Right, but a malicious firmware file could still match CRC but trick the
kernel code.

> quicker before we actually start the CRC check.  Will you come up with
> a formal patch or you want us to deal with it?

It sounds like Gustavo will be sending one, though I don't think either
of us have the hardware to test it with, so if you could do that part,
that would be great! :)

-- 
Kees Cook

^ permalink raw reply

* Re: [PATCH v3] powerpc/64: Option to use ELF V2 ABI for big-endian kernels
From: Nicholas Piggin @ 2020-05-21  2:23 UTC (permalink / raw)
  To: Michael Ellerman, Segher Boessenkool; +Cc: linuxppc-dev
In-Reply-To: <20200518121917.GJ31009@gate.crashing.org>

Excerpts from Segher Boessenkool's message of May 18, 2020 10:19 pm:
> Hi!
> 
> On Mon, May 18, 2020 at 04:35:22PM +1000, Michael Ellerman wrote:
>> Nicholas Piggin <npiggin@gmail.com> writes:
>> > Provide an option to build big-endian kernels using the ELF V2 ABI. This works
>> > on GCC and clang (since about 2014). it is is not officially supported by the
>> > GNU toolchain, but it can give big-endian kernels  some useful advantages of
>> > the V2 ABI (e.g., less stack usage).
> 
>> This doesn't build with clang:
>> 
>>   /tmp/aesp8-ppc-dad624.s: Assembler messages:
>>   /tmp/aesp8-ppc-dad624.s: Error: .size expression for aes_p8_set_encrypt_key does not evaluate to a constant
> 
> What does this assembler code that clang doesn't swallow look like?  Is
> that valid code?  Etc.

The .size directive calculation is .text - .opd because the preprocessor 
isn't passing -mabi=elfv2 which makes our _GLOBAL function entry macro 
use the v1 convention. I guess I got the 64-bit vdso wrong as well, it 
should remain in ELFv1.

Thanks,
Nick

^ permalink raw reply

* [PATCH v3 7/7] powerpc: Add POWER10 architected mode
From: Alistair Popple @ 2020-05-21  1:43 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: mikey, npiggin, aneesh.kumar, Alistair Popple,
	Cédric Le Goater
In-Reply-To: <20200521014341.29095-1-alistair@popple.id.au>

PVR value of 0x0F000006 means we are arch v3.1 compliant (i.e. POWER10).
This is used by phyp and kvm when booting as a pseries guest to detect
the presence of new P10 features and to enable the appropriate hwcap and
facility bits.

Signed-off-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 arch/powerpc/include/asm/cputable.h   | 15 ++++++++++++--
 arch/powerpc/include/asm/mmu.h        |  1 +
 arch/powerpc/include/asm/prom.h       |  1 +
 arch/powerpc/kernel/cpu_setup_power.S | 20 ++++++++++++++++--
 arch/powerpc/kernel/cputable.c        | 30 +++++++++++++++++++++++++++
 arch/powerpc/kernel/prom_init.c       | 12 +++++++++--
 6 files changed, 73 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/cputable.h b/arch/powerpc/include/asm/cputable.h
index 1559dbf72842..bac2252c839e 100644
--- a/arch/powerpc/include/asm/cputable.h
+++ b/arch/powerpc/include/asm/cputable.h
@@ -468,6 +468,17 @@ static inline void cpu_feature_keys_init(void) { }
 #define CPU_FTRS_POWER9_DD2_2 (CPU_FTRS_POWER9 | CPU_FTR_POWER9_DD2_1 | \
 			       CPU_FTR_P9_TM_HV_ASSIST | \
 			       CPU_FTR_P9_TM_XER_SO_BUG)
+#define CPU_FTRS_POWER10 (CPU_FTR_LWSYNC | \
+	    CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | CPU_FTR_ARCH_206 |\
+	    CPU_FTR_MMCRA | CPU_FTR_SMT | \
+	    CPU_FTR_COHERENT_ICACHE | \
+	    CPU_FTR_PURR | CPU_FTR_SPURR | CPU_FTR_REAL_LE | \
+	    CPU_FTR_DSCR | CPU_FTR_SAO  | \
+	    CPU_FTR_STCX_CHECKS_ADDRESS | CPU_FTR_POPCNTB | CPU_FTR_POPCNTD | \
+	    CPU_FTR_CFAR | CPU_FTR_HVMODE | CPU_FTR_VMX_COPY | \
+	    CPU_FTR_DBELL | CPU_FTR_HAS_PPR | CPU_FTR_ARCH_207S | \
+	    CPU_FTR_TM_COMP | CPU_FTR_ARCH_300 | CPU_FTR_PKEY | \
+	    CPU_FTR_ARCH_31)
 #define CPU_FTRS_CELL	(CPU_FTR_LWSYNC | \
 	    CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | \
 	    CPU_FTR_ALTIVEC_COMP | CPU_FTR_MMCRA | CPU_FTR_SMT | \
@@ -486,14 +497,14 @@ static inline void cpu_feature_keys_init(void) { }
 #define CPU_FTRS_POSSIBLE	\
 	    (CPU_FTRS_POWER7 | CPU_FTRS_POWER8E | CPU_FTRS_POWER8 | \
 	     CPU_FTR_ALTIVEC_COMP | CPU_FTR_VSX_COMP | CPU_FTRS_POWER9 | \
-	     CPU_FTRS_POWER9_DD2_1 | CPU_FTRS_POWER9_DD2_2)
+	     CPU_FTRS_POWER9_DD2_1 | CPU_FTRS_POWER9_DD2_2 | CPU_FTRS_POWER10)
 #else
 #define CPU_FTRS_POSSIBLE	\
 	    (CPU_FTRS_PPC970 | CPU_FTRS_POWER5 | \
 	     CPU_FTRS_POWER6 | CPU_FTRS_POWER7 | CPU_FTRS_POWER8E | \
 	     CPU_FTRS_POWER8 | CPU_FTRS_CELL | CPU_FTRS_PA6T | \
 	     CPU_FTR_VSX_COMP | CPU_FTR_ALTIVEC_COMP | CPU_FTRS_POWER9 | \
-	     CPU_FTRS_POWER9_DD2_1 | CPU_FTRS_POWER9_DD2_2)
+	     CPU_FTRS_POWER9_DD2_1 | CPU_FTRS_POWER9_DD2_2 | CPU_FTRS_POWER10)
 #endif /* CONFIG_CPU_LITTLE_ENDIAN */
 #endif
 #else
diff --git a/arch/powerpc/include/asm/mmu.h b/arch/powerpc/include/asm/mmu.h
index cf2a08bfd5cd..f4ac25d4df05 100644
--- a/arch/powerpc/include/asm/mmu.h
+++ b/arch/powerpc/include/asm/mmu.h
@@ -122,6 +122,7 @@
 #define MMU_FTRS_POWER7		MMU_FTRS_POWER6
 #define MMU_FTRS_POWER8		MMU_FTRS_POWER6
 #define MMU_FTRS_POWER9		MMU_FTRS_POWER6
+#define MMU_FTRS_POWER10	MMU_FTRS_POWER6
 #define MMU_FTRS_CELL		MMU_FTRS_DEFAULT_HPTE_ARCH_V2 | \
 				MMU_FTR_CI_LARGE_PAGE
 #define MMU_FTRS_PA6T		MMU_FTRS_DEFAULT_HPTE_ARCH_V2 | \
diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h
index 94e3fd54f2c8..324a13351749 100644
--- a/arch/powerpc/include/asm/prom.h
+++ b/arch/powerpc/include/asm/prom.h
@@ -117,6 +117,7 @@ extern int of_read_drc_info_cell(struct property **prop,
 #define OV1_PPC_2_07		0x01	/* set if we support PowerPC 2.07 */
 
 #define OV1_PPC_3_00		0x80	/* set if we support PowerPC 3.00 */
+#define OV1_PPC_3_1			0x40	/* set if we support PowerPC 3.1 */
 
 /* Option vector 2: Open Firmware options supported */
 #define OV2_REAL_MODE		0x20	/* set if we want OF in real mode */
diff --git a/arch/powerpc/kernel/cpu_setup_power.S b/arch/powerpc/kernel/cpu_setup_power.S
index a460298c7ddb..f3730cf904fa 100644
--- a/arch/powerpc/kernel/cpu_setup_power.S
+++ b/arch/powerpc/kernel/cpu_setup_power.S
@@ -91,10 +91,15 @@ _GLOBAL(__restore_cpu_power8)
 	mtlr	r11
 	blr
 
+_GLOBAL(__setup_cpu_power10)
+	mflr	r11
+	bl	__init_FSCR_P10
+	b	1f
+
 _GLOBAL(__setup_cpu_power9)
 	mflr	r11
 	bl	__init_FSCR
-	bl	__init_PMU
+1:	bl	__init_PMU
 	bl	__init_hvmode_206
 	mtlr	r11
 	beqlr
@@ -116,10 +121,15 @@ _GLOBAL(__setup_cpu_power9)
 	mtlr	r11
 	blr
 
+_GLOBAL(__restore_cpu_power10)
+	mflr	r11
+	bl	__init_FSCR_P10
+	b	1f
+
 _GLOBAL(__restore_cpu_power9)
 	mflr	r11
 	bl	__init_FSCR
-	bl	__init_PMU
+1:	bl	__init_PMU
 	mfmsr	r3
 	rldicl.	r0,r3,4,63
 	mtlr	r11
@@ -182,6 +192,12 @@ __init_LPCR_ISA300:
 	isync
 	blr
 
+__init_FSCR_P10:
+	mfspr	r3,SPRN_FSCR
+	ori	r3,r3,FSCR_TAR|FSCR_DSCR|FSCR_EBB|FSCR_PREFIX
+	mtspr	SPRN_FSCR,r3
+	blr
+
 __init_FSCR:
 	mfspr	r3,SPRN_FSCR
 	ori	r3,r3,FSCR_TAR|FSCR_DSCR|FSCR_EBB
diff --git a/arch/powerpc/kernel/cputable.c b/arch/powerpc/kernel/cputable.c
index 13eba2eb46fe..a17eeb311cdb 100644
--- a/arch/powerpc/kernel/cputable.c
+++ b/arch/powerpc/kernel/cputable.c
@@ -70,6 +70,8 @@ extern void __setup_cpu_power8(unsigned long offset, struct cpu_spec* spec);
 extern void __restore_cpu_power8(void);
 extern void __setup_cpu_power9(unsigned long offset, struct cpu_spec* spec);
 extern void __restore_cpu_power9(void);
+extern void __setup_cpu_power10(unsigned long offset, struct cpu_spec* spec);
+extern void __restore_cpu_power10(void);
 extern long __machine_check_early_realmode_p7(struct pt_regs *regs);
 extern long __machine_check_early_realmode_p8(struct pt_regs *regs);
 extern long __machine_check_early_realmode_p9(struct pt_regs *regs);
@@ -119,6 +121,10 @@ extern void __restore_cpu_e6500(void);
 				 PPC_FEATURE2_ARCH_3_00 | \
 				 PPC_FEATURE2_HAS_IEEE128 | \
 				 PPC_FEATURE2_DARN )
+#define COMMON_USER_POWER10	COMMON_USER_POWER9
+#define COMMON_USER2_POWER10	(COMMON_USER2_POWER9 | \
+				 PPC_FEATURE2_ARCH_3_1 | \
+				 PPC_FEATURE2_MMA)
 
 #ifdef CONFIG_PPC_BOOK3E_64
 #define COMMON_USER_BOOKE	(COMMON_USER_PPC64 | PPC_FEATURE_BOOKE)
@@ -127,6 +133,14 @@ extern void __restore_cpu_e6500(void);
 				 PPC_FEATURE_BOOKE)
 #endif
 
+#ifdef CONFIG_PPC64
+static void setup_cpu_power10(unsigned long offset, struct cpu_spec* spec)
+{
+	__setup_cpu_power10(offset, spec);
+	current->thread.fscr |= FSCR_PREFIX;
+}
+#endif
+
 static struct cpu_spec __initdata cpu_specs[] = {
 #ifdef CONFIG_PPC_BOOK3S_64
 	{	/* PPC970 */
@@ -367,6 +381,22 @@ static struct cpu_spec __initdata cpu_specs[] = {
 		.cpu_restore		= __restore_cpu_power9,
 		.platform		= "power9",
 	},
+	{	/* 3.1-compliant processor, i.e. Power10 "architected" mode */
+		.pvr_mask		= 0xffffffff,
+		.pvr_value		= 0x0f000006,
+		.cpu_name		= "POWER10 (architected)",
+		.cpu_features		= CPU_FTRS_POWER10,
+		.cpu_user_features	= COMMON_USER_POWER10,
+		.cpu_user_features2	= COMMON_USER2_POWER10,
+		.mmu_features		= MMU_FTRS_POWER10,
+		.icache_bsize		= 128,
+		.dcache_bsize		= 128,
+		.oprofile_type		= PPC_OPROFILE_INVALID,
+		.oprofile_cpu_type	= "ppc64/ibm-compat-v1",
+		.cpu_setup		= setup_cpu_power10,
+		.cpu_restore		= __restore_cpu_power10,
+		.platform		= "power10",
+	},
 	{	/* Power7 */
 		.pvr_mask		= 0xffff0000,
 		.pvr_value		= 0x003f0000,
diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index e3a9fde51c4f..5f15b10eb007 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -920,7 +920,7 @@ struct option_vector6 {
 } __packed;
 
 struct ibm_arch_vec {
-	struct { u32 mask, val; } pvrs[12];
+	struct { u32 mask, val; } pvrs[14];
 
 	u8 num_vectors;
 
@@ -973,6 +973,14 @@ static const struct ibm_arch_vec ibm_architecture_vec_template __initconst = {
 			.mask = cpu_to_be32(0xffff0000), /* POWER9 */
 			.val  = cpu_to_be32(0x004e0000),
 		},
+		{
+			.mask = cpu_to_be32(0xffff0000), /* POWER10 */
+			.val  = cpu_to_be32(0x00800000),
+		},
+		{
+			.mask = cpu_to_be32(0xffffffff), /* all 3.1-compliant */
+			.val  = cpu_to_be32(0x0f000006),
+		},
 		{
 			.mask = cpu_to_be32(0xffffffff), /* all 3.00-compliant */
 			.val  = cpu_to_be32(0x0f000005),
@@ -1002,7 +1010,7 @@ static const struct ibm_arch_vec ibm_architecture_vec_template __initconst = {
 		.byte1 = 0,
 		.arch_versions = OV1_PPC_2_00 | OV1_PPC_2_01 | OV1_PPC_2_02 | OV1_PPC_2_03 |
 				 OV1_PPC_2_04 | OV1_PPC_2_05 | OV1_PPC_2_06 | OV1_PPC_2_07,
-		.arch_versions3 = OV1_PPC_3_00,
+		.arch_versions3 = OV1_PPC_3_00 | OV1_PPC_3_1,
 	},
 
 	.vec2_len = VECTOR_LENGTH(sizeof(struct option_vector2)),
-- 
2.20.1


^ permalink raw reply related

* [PATCH v3 6/7] powerpc/dt_cpu_ftrs: Add MMA feature
From: Alistair Popple @ 2020-05-21  1:43 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: mikey, npiggin, aneesh.kumar, Alistair Popple
In-Reply-To: <20200521014341.29095-1-alistair@popple.id.au>

Matrix multiple assist (MMA) is a new feature added to ISAv3.1 and
POWER10. Support on powernv can be selected via a firmware CPU device
tree feature which enables it via a PCR bit.

Signed-off-by: Alistair Popple <alistair@popple.id.au>
---
 arch/powerpc/include/asm/reg.h    |  3 ++-
 arch/powerpc/kernel/dt_cpu_ftrs.c | 17 ++++++++++++++++-
 2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index dd20af367b57..88e6c78100d9 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -481,7 +481,8 @@
 #define   PCR_VEC_DIS	(__MASK(63-0))	/* Vec. disable (bit NA since POWER8) */
 #define   PCR_VSX_DIS	(__MASK(63-1))	/* VSX disable (bit NA since POWER8) */
 #define   PCR_TM_DIS	(__MASK(63-2))	/* Trans. memory disable (POWER8) */
-#define   PCR_HIGH_BITS	(PCR_VEC_DIS | PCR_VSX_DIS | PCR_TM_DIS)
+#define   PCR_MMA_DIS	(__MASK(63-3)) /* Matrix-Multiply Accelerator */
+#define   PCR_HIGH_BITS	(PCR_MMA_DIS | PCR_VEC_DIS | PCR_VSX_DIS | PCR_TM_DIS)
 /*
  * These bits are used in the function kvmppc_set_arch_compat() to specify and
  * determine both the compatibility level which we want to emulate and the
diff --git a/arch/powerpc/kernel/dt_cpu_ftrs.c b/arch/powerpc/kernel/dt_cpu_ftrs.c
index 93c340906aad..0a41fce34165 100644
--- a/arch/powerpc/kernel/dt_cpu_ftrs.c
+++ b/arch/powerpc/kernel/dt_cpu_ftrs.c
@@ -75,6 +75,7 @@ static struct {
 	u64	lpcr_clear;
 	u64	hfscr;
 	u64	fscr;
+	u64	pcr;
 } system_registers;
 
 static void (*init_pmu_registers)(void);
@@ -102,7 +103,7 @@ static void __restore_cpu_cpufeatures(void)
 	if (hv_mode) {
 		mtspr(SPRN_LPID, 0);
 		mtspr(SPRN_HFSCR, system_registers.hfscr);
-		mtspr(SPRN_PCR, PCR_MASK);
+		mtspr(SPRN_PCR, system_registers.pcr);
 	}
 	mtspr(SPRN_FSCR, system_registers.fscr);
 
@@ -555,6 +556,18 @@ static int __init feat_enable_large_ci(struct dt_cpu_feature *f)
 	return 1;
 }
 
+static int __init feat_enable_mma(struct dt_cpu_feature *f)
+{
+	u64 pcr;
+
+	feat_enable(f);
+	pcr = mfspr(SPRN_PCR);
+	pcr &= ~PCR_MMA_DIS;
+	mtspr(SPRN_PCR, pcr);
+
+	return 1;
+}
+
 struct dt_cpu_feature_match {
 	const char *name;
 	int (*enable)(struct dt_cpu_feature *f);
@@ -629,6 +642,7 @@ static struct dt_cpu_feature_match __initdata
 	{"vector-binary16", feat_enable, 0},
 	{"wait-v3", feat_enable, 0},
 	{"prefix-instructions", feat_enable, 0},
+	{"matrix-multiply-assist", feat_enable_mma, 0},
 };
 
 static bool __initdata using_dt_cpu_ftrs;
@@ -779,6 +793,7 @@ static void __init cpufeatures_setup_finished(void)
 	system_registers.lpcr = mfspr(SPRN_LPCR);
 	system_registers.hfscr = mfspr(SPRN_HFSCR);
 	system_registers.fscr = mfspr(SPRN_FSCR);
+	system_registers.pcr = mfspr(SPRN_PCR);
 
 	pr_info("final cpu/mmu features = 0x%016lx 0x%08x\n",
 		cur_cpu_spec->cpu_features, cur_cpu_spec->mmu_features);
-- 
2.20.1


^ permalink raw reply related

* [PATCH v3 5/7] powerpc/dt_cpu_ftrs: Enable Prefixed Instructions
From: Alistair Popple @ 2020-05-21  1:43 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: mikey, npiggin, aneesh.kumar, Alistair Popple
In-Reply-To: <20200521014341.29095-1-alistair@popple.id.au>

Prefix instructions have their own FSCR bit which needs to be enabled
via a CPU feature. The kernel will save the FSCR for problem state but
it needs to be enabled initially.

Signed-off-by: Alistair Popple <alistair@popple.id.au>
---
 arch/powerpc/kernel/dt_cpu_ftrs.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kernel/dt_cpu_ftrs.c b/arch/powerpc/kernel/dt_cpu_ftrs.c
index 677190f70cac..93c340906aad 100644
--- a/arch/powerpc/kernel/dt_cpu_ftrs.c
+++ b/arch/powerpc/kernel/dt_cpu_ftrs.c
@@ -628,6 +628,7 @@ static struct dt_cpu_feature_match __initdata
 	{"vector-binary128", feat_enable, 0},
 	{"vector-binary16", feat_enable, 0},
 	{"wait-v3", feat_enable, 0},
+	{"prefix-instructions", feat_enable, 0},
 };
 
 static bool __initdata using_dt_cpu_ftrs;
-- 
2.20.1


^ permalink raw reply related

* [PATCH v3 4/7] powerpc/dt_cpu_ftrs: Set current thread fscr bits
From: Alistair Popple @ 2020-05-21  1:43 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: mikey, npiggin, aneesh.kumar, Alistair Popple
In-Reply-To: <20200521014341.29095-1-alistair@popple.id.au>

Setting the FSCR bit directly in the SPR only sets it during the initial
boot and early init of the kernel but not for the init process or any
subsequent kthreads. This is because the thread_struct for those is
copied from the current thread_struct setup at boot which doesn't
reflect any changes made to the FSCR during cpu feature detection. This
patch ensures the current thread state is updated to match once feature
detection is complete.

Signed-off-by: Alistair Popple <alistair@popple.id.au>
---
 arch/powerpc/kernel/dt_cpu_ftrs.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/kernel/dt_cpu_ftrs.c b/arch/powerpc/kernel/dt_cpu_ftrs.c
index b5e21264d168..677190f70cac 100644
--- a/arch/powerpc/kernel/dt_cpu_ftrs.c
+++ b/arch/powerpc/kernel/dt_cpu_ftrs.c
@@ -170,6 +170,7 @@ static int __init feat_try_enable_unknown(struct dt_cpu_feature *f)
 		u64 fscr = mfspr(SPRN_FSCR);
 		fscr |= 1UL << f->fscr_bit_nr;
 		mtspr(SPRN_FSCR, fscr);
+		current->thread.fscr |= 1UL << f->fscr_bit_nr;
 	} else {
 		/* Does not have a known recipe */
 		return 0;
@@ -205,6 +206,7 @@ static int __init feat_enable(struct dt_cpu_feature *f)
 			u64 fscr = mfspr(SPRN_FSCR);
 			fscr |= 1UL << f->fscr_bit_nr;
 			mtspr(SPRN_FSCR, fscr);
+			current->thread.fscr |= 1UL << f->fscr_bit_nr;
 		}
 	}
 
-- 
2.20.1


^ permalink raw reply related

* [PATCH v3 3/7] powerpc/dt_cpu_ftrs: Advertise support for ISA v3.1 if selected
From: Alistair Popple @ 2020-05-21  1:43 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: mikey, npiggin, aneesh.kumar, Alistair Popple
In-Reply-To: <20200521014341.29095-1-alistair@popple.id.au>

On powernv hardware support for ISAv3.1 is advertised via a cpu feature
bit in the device tree. This patch enables the associated HWCAP bit if
the device tree indicates ISAv3.1 is available.

Signed-off-by: Alistair Popple <alistair@popple.id.au>
---
 arch/powerpc/kernel/dt_cpu_ftrs.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/arch/powerpc/kernel/dt_cpu_ftrs.c b/arch/powerpc/kernel/dt_cpu_ftrs.c
index 36bc0d5c4f3a..b5e21264d168 100644
--- a/arch/powerpc/kernel/dt_cpu_ftrs.c
+++ b/arch/powerpc/kernel/dt_cpu_ftrs.c
@@ -26,6 +26,7 @@
 /* Device-tree visible constants follow */
 #define ISA_V2_07B      2070
 #define ISA_V3_0B       3000
+#define ISA_V3_1        3100
 
 #define USABLE_PR               (1U << 0)
 #define USABLE_OS               (1U << 1)
@@ -654,6 +655,11 @@ static void __init cpufeatures_setup_start(u32 isa)
 		cur_cpu_spec->cpu_features |= CPU_FTR_ARCH_300;
 		cur_cpu_spec->cpu_user_features2 |= PPC_FEATURE2_ARCH_3_00;
 	}
+
+	if (isa >= 3100) {
+		cur_cpu_spec->cpu_features |= CPU_FTR_ARCH_31;
+		cur_cpu_spec->cpu_user_features2 |= PPC_FEATURE2_ARCH_3_1;
+	}
 }
 
 static bool __init cpufeatures_process_feature(struct dt_cpu_feature *f)
-- 
2.20.1


^ permalink raw reply related

* [PATCH v3 2/7] powerpc: Add support for ISA v3.1
From: Alistair Popple @ 2020-05-21  1:43 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: mikey, npiggin, aneesh.kumar, Alistair Popple
In-Reply-To: <20200521014341.29095-1-alistair@popple.id.au>

Newer ISA versions are enabled by clearing all bits in the PCR
associated with previous versions of the ISA. Enable ISA v3.1 support
by updating the PCR mask to include ISA v3.0. This ensures all PCR
bits corresponding to earlier architecture versions get cleared
thereby enabling ISA v3.1 if supported by the hardware.

Signed-off-by: Alistair Popple <alistair@popple.id.au>
---
 arch/powerpc/include/asm/cputable.h | 1 +
 arch/powerpc/include/asm/reg.h      | 3 ++-
 arch/powerpc/kvm/book3s_hv.c        | 3 ---
 3 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/cputable.h b/arch/powerpc/include/asm/cputable.h
index c67b94f3334c..1559dbf72842 100644
--- a/arch/powerpc/include/asm/cputable.h
+++ b/arch/powerpc/include/asm/cputable.h
@@ -213,6 +213,7 @@ static inline void cpu_feature_keys_init(void) { }
 #define CPU_FTR_P9_TIDR			LONG_ASM_CONST(0x0000800000000000)
 #define CPU_FTR_P9_TLBIE_ERAT_BUG	LONG_ASM_CONST(0x0001000000000000)
 #define CPU_FTR_P9_RADIX_PREFETCH_BUG	LONG_ASM_CONST(0x0002000000000000)
+#define CPU_FTR_ARCH_31			LONG_ASM_CONST(0x0004000000000000)
 
 #ifndef __ASSEMBLY__
 
diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 054f8a71d686..dd20af367b57 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -487,10 +487,11 @@
  * determine both the compatibility level which we want to emulate and the
  * compatibility level which the host is capable of emulating.
  */
+#define   PCR_ARCH_300	0x10		/* Architecture 3.00 */
 #define   PCR_ARCH_207	0x8		/* Architecture 2.07 */
 #define   PCR_ARCH_206	0x4		/* Architecture 2.06 */
 #define   PCR_ARCH_205	0x2		/* Architecture 2.05 */
-#define   PCR_LOW_BITS	(PCR_ARCH_207 | PCR_ARCH_206 | PCR_ARCH_205)
+#define   PCR_LOW_BITS	(PCR_ARCH_207 | PCR_ARCH_206 | PCR_ARCH_205 | PCR_ARCH_300)
 #define   PCR_MASK	~(PCR_HIGH_BITS | PCR_LOW_BITS)	/* PCR Reserved Bits */
 #define	SPRN_HEIR	0x153	/* Hypervisor Emulated Instruction Register */
 #define SPRN_TLBINDEXR	0x154	/* P7 TLB control register */
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index db07199f0977..a0cf17597838 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -344,9 +344,6 @@ static void kvmppc_set_pvr_hv(struct kvm_vcpu *vcpu, u32 pvr)
 	vcpu->arch.pvr = pvr;
 }
 
-/* Dummy value used in computing PCR value below */
-#define PCR_ARCH_300	(PCR_ARCH_207 << 1)
-
 static int kvmppc_set_arch_compat(struct kvm_vcpu *vcpu, u32 arch_compat)
 {
 	unsigned long host_pcr_bit = 0, guest_pcr_bit = 0;
-- 
2.20.1


^ permalink raw reply related

* [PATCH v3 1/7] powerpc: Add new HWCAP bits
From: Alistair Popple @ 2020-05-21  1:43 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: mikey, npiggin, aneesh.kumar, Alistair Popple
In-Reply-To: <20200521014341.29095-1-alistair@popple.id.au>

POWER10 introduces two new architectural features - ISAv3.1 and matrix
multiply assist (MMA) instructions. Userspace detects the presence
of these features via two HWCAP bits introduced in this patch. These
bits have been agreed to by the compiler and binutils team.

According to ISAv3.1 MMA is an optional feature and software that makes
use of it should first check for availability via this HWCAP bit and use
alternate code paths if unavailable.

Signed-off-by: Alistair Popple <alistair@popple.id.au>
Tested-by: Michael Neuling <mikey@neuling.org>
---
 arch/powerpc/include/uapi/asm/cputable.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/include/uapi/asm/cputable.h b/arch/powerpc/include/uapi/asm/cputable.h
index 540592034740..731b97dc2d15 100644
--- a/arch/powerpc/include/uapi/asm/cputable.h
+++ b/arch/powerpc/include/uapi/asm/cputable.h
@@ -50,6 +50,8 @@
 #define PPC_FEATURE2_DARN		0x00200000 /* darn random number insn */
 #define PPC_FEATURE2_SCV		0x00100000 /* scv syscall */
 #define PPC_FEATURE2_HTM_NO_SUSPEND	0x00080000 /* TM w/out suspended state */
+#define PPC_FEATURE2_ARCH_3_1		0x00040000 /* ISA 3.1 */
+#define PPC_FEATURE2_MMA		0x00020000 /* Matrix Multiply Assist */
 
 /*
  * IMPORTANT!
-- 
2.20.1


^ permalink raw reply related

* [PATCH v3 0/7] Base support for POWER10
From: Alistair Popple @ 2020-05-21  1:43 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: mikey, npiggin, aneesh.kumar, Alistair Popple

This series brings together several previously posted patches required for
POWER10 support and introduces a new patch enabling POWER10 architected
mode to enable booting as a POWER10 pseries guest.

It includes support for enabling facilities related to MMA and prefix
instructions.

Changes from v2:
  - s/accumulate/assist/
  - Updated commit messages

Changes from v1:
  - Two bug-fixes to enable prefix and MMA on pseries
  - Minor updates to commit message wording
  - Fixes a build error when CONFIG_KVM_BOOK3S_64_HV is enabled

Alistair Popple (7):
  powerpc: Add new HWCAP bits
  powerpc: Add support for ISA v3.1
  powerpc/dt_cpu_ftrs: Advertise support for ISA v3.1 if selected
  powerpc/dt_cpu_ftrs: Set current thread fscr bits
  powerpc/dt_cpu_ftrs: Enable Prefixed Instructions
  powerpc/dt_cpu_ftrs: Add MMA feature
  powerpc: Add POWER10 architected mode

 arch/powerpc/include/asm/cputable.h      | 16 +++++++++++--
 arch/powerpc/include/asm/mmu.h           |  1 +
 arch/powerpc/include/asm/prom.h          |  1 +
 arch/powerpc/include/asm/reg.h           |  6 +++--
 arch/powerpc/include/uapi/asm/cputable.h |  2 ++
 arch/powerpc/kernel/cpu_setup_power.S    | 20 ++++++++++++++--
 arch/powerpc/kernel/cputable.c           | 30 ++++++++++++++++++++++++
 arch/powerpc/kernel/dt_cpu_ftrs.c        | 26 +++++++++++++++++++-
 arch/powerpc/kernel/prom_init.c          | 12 ++++++++--
 arch/powerpc/kvm/book3s_hv.c             |  3 ---
 10 files changed, 105 insertions(+), 12 deletions(-)

-- 
2.20.1


^ permalink raw reply

* Re: [PATCH v2 5/7] mm: parallelize deferred_init_memmap()
From: Alexander Duyck @ 2020-05-21  1:29 UTC (permalink / raw)
  To: Daniel Jordan
  Cc: David Hildenbrand, Peter Zijlstra, Dave Hansen, Michal Hocko,
	linux-mm, Steven Sistare, Pavel Machek, Alexander Duyck,
	Steffen Klassert, linux-s390, Herbert Xu, Jonathan Corbet,
	Jason Gunthorpe, Zi Yan, Robert Elliott, Pavel Tatashin,
	Shile Zhang, Josh Triplett, Alex Williamson, Kirill Tkhai,
	Dan Williams, Randy Dunlap, LKML, linux-crypto, Tejun Heo,
	Andrew Morton, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)
In-Reply-To: <20200520182645.1658949-6-daniel.m.jordan@oracle.com>

On Wed, May 20, 2020 at 11:27 AM Daniel Jordan
<daniel.m.jordan@oracle.com> wrote:
>
> Deferred struct page init is a significant bottleneck in kernel boot.
> Optimizing it maximizes availability for large-memory systems and allows
> spinning up short-lived VMs as needed without having to leave them
> running.  It also benefits bare metal machines hosting VMs that are
> sensitive to downtime.  In projects such as VMM Fast Restart[1], where
> guest state is preserved across kexec reboot, it helps prevent
> application and network timeouts in the guests.
>
> Multithread to take full advantage of system memory bandwidth.
>
> The maximum number of threads is capped at the number of CPUs on the
> node because speedups always improve with additional threads on every
> system tested, and at this phase of boot, the system is otherwise idle
> and waiting on page init to finish.
>
> Helper threads operate on section-aligned ranges to both avoid false
> sharing when setting the pageblock's migrate type and to avoid accessing
> uninitialized buddy pages, though max order alignment is enough for the
> latter.
>
> The minimum chunk size is also a section.  There was benefit to using
> multiple threads even on relatively small memory (1G) systems, and this
> is the smallest size that the alignment allows.
>
> The time (milliseconds) is the slowest node to initialize since boot
> blocks until all nodes finish.  intel_pstate is loaded in active mode
> without hwp and with turbo enabled, and intel_idle is active as well.
>
>     Intel(R) Xeon(R) Platinum 8167M CPU @ 2.00GHz (Skylake, bare metal)
>       2 nodes * 26 cores * 2 threads = 104 CPUs
>       384G/node = 768G memory
>
>                    kernel boot                 deferred init
>                    ------------------------    ------------------------
>     node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
>           (  0)         --   4078.0 (  9.0)         --   1779.0 (  8.7)
>        2% (  1)       1.4%   4021.3 (  2.9)       3.4%   1717.7 (  7.8)
>       12% (  6)      35.1%   2644.7 ( 35.3)      80.8%    341.0 ( 35.5)
>       25% ( 13)      38.7%   2498.0 ( 34.2)      89.1%    193.3 ( 32.3)
>       37% ( 19)      39.1%   2482.0 ( 25.2)      90.1%    175.3 ( 31.7)
>       50% ( 26)      38.8%   2495.0 (  8.7)      89.1%    193.7 (  3.5)
>       75% ( 39)      39.2%   2478.0 ( 21.0)      90.3%    172.7 ( 26.7)
>      100% ( 52)      40.0%   2448.0 (  2.0)      91.9%    143.3 (  1.5)
>
>     Intel(R) Xeon(R) CPU E5-2699C v4 @ 2.20GHz (Broadwell, bare metal)
>       1 node * 16 cores * 2 threads = 32 CPUs
>       192G/node = 192G memory
>
>                    kernel boot                 deferred init
>                    ------------------------    ------------------------
>     node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
>           (  0)         --   1996.0 ( 18.0)         --   1104.3 (  6.7)
>        3% (  1)       1.4%   1968.0 (  3.0)       2.7%   1074.7 (  9.0)
>       12% (  4)      40.1%   1196.0 ( 22.7)      72.4%    305.3 ( 16.8)
>       25% (  8)      47.4%   1049.3 ( 17.2)      84.2%    174.0 ( 10.6)
>       37% ( 12)      48.3%   1032.0 ( 14.9)      86.8%    145.3 (  2.5)
>       50% ( 16)      48.9%   1020.3 (  2.5)      88.0%    133.0 (  1.7)
>       75% ( 24)      49.1%   1016.3 (  8.1)      88.4%    128.0 (  1.7)
>      100% ( 32)      49.4%   1009.0 (  8.5)      88.6%    126.3 (  0.6)
>
>     Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (Haswell, bare metal)
>       2 nodes * 18 cores * 2 threads = 72 CPUs
>       128G/node = 256G memory
>
>                    kernel boot                 deferred init
>                    ------------------------    ------------------------
>     node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
>           (  0)         --   1682.7 (  6.7)         --    630.0 (  4.6)
>        3% (  1)       0.4%   1676.0 (  2.0)       0.7%    625.3 (  3.2)
>       12% (  4)      25.8%   1249.0 (  1.0)      68.2%    200.3 (  1.2)
>       25% (  9)      30.0%   1178.0 (  5.2)      79.7%    128.0 (  3.5)
>       37% ( 13)      30.6%   1167.7 (  3.1)      81.3%    117.7 (  1.2)
>       50% ( 18)      30.6%   1167.3 (  2.3)      81.4%    117.0 (  1.0)
>       75% ( 27)      31.0%   1161.3 (  4.6)      82.5%    110.0 (  6.9)
>      100% ( 36)      32.1%   1142.0 (  3.6)      85.7%     90.0 (  1.0)
>
>     AMD EPYC 7551 32-Core Processor (Zen, kvm guest)
>       1 node * 8 cores * 2 threads = 16 CPUs
>       64G/node = 64G memory
>
>                    kernel boot                 deferred init
>                    ------------------------    ------------------------
>     node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
>           (  0)         --   1003.7 ( 16.6)         --    243.3 (  8.1)
>        6% (  1)       1.4%    990.0 (  4.6)       1.2%    240.3 (  1.5)
>       12% (  2)      11.4%    889.3 ( 16.7)      44.5%    135.0 (  3.0)
>       25% (  4)      16.8%    835.3 (  9.0)      65.8%     83.3 (  2.5)
>       37% (  6)      18.6%    816.7 ( 17.6)      70.4%     72.0 (  1.0)
>       50% (  8)      18.2%    821.0 (  5.0)      70.7%     71.3 (  1.2)
>       75% ( 12)      19.0%    813.3 (  5.0)      71.8%     68.7 (  2.1)
>      100% ( 16)      19.8%    805.3 ( 10.8)      76.4%     57.3 ( 15.9)
>
> Server-oriented distros that enable deferred page init sometimes run in
> small VMs, and they still benefit even though the fraction of boot time
> saved is smaller:
>
>     AMD EPYC 7551 32-Core Processor (Zen, kvm guest)
>       1 node * 2 cores * 2 threads = 4 CPUs
>       16G/node = 16G memory
>
>                    kernel boot                 deferred init
>                    ------------------------    ------------------------
>     node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
>           (  0)         --    722.3 (  9.5)         --     50.7 (  0.6)
>       25% (  1)      -3.3%    746.3 (  4.7)      -2.0%     51.7 (  1.2)
>       50% (  2)       0.2%    721.0 ( 11.3)      29.6%     35.7 (  4.9)
>       75% (  3)      -0.3%    724.3 ( 11.2)      48.7%     26.0 (  0.0)
>      100% (  4)       3.0%    700.3 ( 13.6)      55.9%     22.3 (  0.6)
>
>     Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (Haswell, kvm guest)
>       1 node * 2 cores * 2 threads = 4 CPUs
>       14G/node = 14G memory
>
>                    kernel boot                 deferred init
>                    ------------------------    ------------------------
>     node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
>           (  0)         --    673.0 (  6.9)         --     57.0 (  1.0)
>       25% (  1)      -0.6%    677.3 ( 19.8)       1.8%     56.0 (  1.0)
>       50% (  2)       3.4%    650.0 (  3.6)      36.8%     36.0 (  5.2)
>       75% (  3)       4.2%    644.7 (  7.6)      56.1%     25.0 (  1.0)
>      100% (  4)       5.3%    637.0 (  5.6)      63.2%     21.0 (  0.0)
>
> On Josh's 96-CPU and 192G memory system:
>
>     Without this patch series:
>     [    0.487132] node 0 initialised, 23398907 pages in 292ms
>     [    0.499132] node 1 initialised, 24189223 pages in 304ms
>     ...
>     [    0.629376] Run /sbin/init as init process
>
>     With this patch series:
>     [    0.227868] node 0 initialised, 23398907 pages in 28ms
>     [    0.230019] node 1 initialised, 24189223 pages in 28ms
>     ...
>     [    0.361069] Run /sbin/init as init process
>
> [1] https://static.sched.com/hosted_files/kvmforum2019/66/VMM-fast-restart_kvmforum2019.pdf
>
> Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> ---
>  mm/Kconfig      |  6 ++---
>  mm/page_alloc.c | 60 ++++++++++++++++++++++++++++++++++++++++++++-----
>  2 files changed, 58 insertions(+), 8 deletions(-)
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index c1acc34c1c358..04c1da3f9f44c 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -750,13 +750,13 @@ config DEFERRED_STRUCT_PAGE_INIT
>         depends on SPARSEMEM
>         depends on !NEED_PER_CPU_KM
>         depends on 64BIT
> +       select PADATA
>         help
>           Ordinarily all struct pages are initialised during early boot in a
>           single thread. On very large machines this can take a considerable
>           amount of time. If this option is set, large machines will bring up
> -         a subset of memmap at boot and then initialise the rest in parallel
> -         by starting one-off "pgdatinitX" kernel thread for each node X. This
> -         has a potential performance impact on processes running early in the
> +         a subset of memmap at boot and then initialise the rest in parallel.
> +         This has a potential performance impact on tasks running early in the
>           lifetime of the system until these kthreads finish the
>           initialisation.
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d0c0d9364aa6d..9cb780e8dec78 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -68,6 +68,7 @@
>  #include <linux/lockdep.h>
>  #include <linux/nmi.h>
>  #include <linux/psi.h>
> +#include <linux/padata.h>
>
>  #include <asm/sections.h>
>  #include <asm/tlbflush.h>
> @@ -1814,16 +1815,44 @@ deferred_init_maxorder(u64 *i, struct zone *zone, unsigned long *start_pfn,
>         return nr_pages;
>  }
>
> +struct definit_args {
> +       struct zone *zone;
> +       atomic_long_t nr_pages;
> +};
> +
> +static void __init
> +deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn,
> +                          void *arg)
> +{
> +       unsigned long spfn, epfn, nr_pages = 0;
> +       struct definit_args *args = arg;
> +       struct zone *zone = args->zone;
> +       u64 i;
> +
> +       deferred_init_mem_pfn_range_in_zone(&i, zone, &spfn, &epfn, start_pfn);
> +
> +       /*
> +        * Initialize and free pages in MAX_ORDER sized increments so that we
> +        * can avoid introducing any issues with the buddy allocator.
> +        */
> +       while (spfn < end_pfn) {
> +               nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> +               cond_resched();
> +       }
> +
> +       atomic_long_add(nr_pages, &args->nr_pages);
> +}
> +

Personally I would get rid of nr_pages entirely. It isn't worth the
cache thrash to have this atomic variable bouncing around. You could
probably just have this function return void since all nr_pages is
used for is a pr_info  statement at the end of the initialization
which will be completely useless now anyway since we really have the
threads running in parallel anyway.

We only really need the nr_pages logic in deferred_grow_zone in order
to track if we have freed enough pages to allow us to go back to what
we were doing.

>  /* Initialise remaining memory on a node */
>  static int __init deferred_init_memmap(void *data)
>  {
>         pg_data_t *pgdat = data;
>         const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
>         unsigned long spfn = 0, epfn = 0, nr_pages = 0;
> -       unsigned long first_init_pfn, flags;
> +       unsigned long first_init_pfn, flags, epfn_align;
>         unsigned long start = jiffies;
>         struct zone *zone;
> -       int zid;
> +       int zid, max_threads;
>         u64 i;
>
>         /* Bind memory initialisation thread to a local node if possible */
> @@ -1863,11 +1892,32 @@ static int __init deferred_init_memmap(void *data)
>                 goto zone_empty;
>
>         /*
> -        * Initialize and free pages in MAX_ORDER sized increments so
> -        * that we can avoid introducing any issues with the buddy
> -        * allocator.
> +        * More CPUs always led to greater speedups on tested systems, up to
> +        * all the nodes' CPUs.  Use all since the system is otherwise idle now.
>          */
> +       max_threads = max(cpumask_weight(cpumask), 1u);
> +
>         while (spfn < epfn) {
> +               epfn_align = ALIGN_DOWN(epfn, PAGES_PER_SECTION);
> +
> +               if (IS_ALIGNED(spfn, PAGES_PER_SECTION) &&
> +                   epfn_align - spfn >= PAGES_PER_SECTION) {
> +                       struct definit_args arg = { zone, ATOMIC_LONG_INIT(0) };
> +                       struct padata_mt_job job = {
> +                               .thread_fn   = deferred_init_memmap_chunk,
> +                               .fn_arg      = &arg,
> +                               .start       = spfn,
> +                               .size        = epfn_align - spfn,
> +                               .align       = PAGES_PER_SECTION,
> +                               .min_chunk   = PAGES_PER_SECTION,
> +                               .max_threads = max_threads,
> +                       };
> +
> +                       padata_do_multithreaded(&job);
> +                       nr_pages += atomic_long_read(&arg.nr_pages);
> +                       spfn = epfn_align;
> +               }
> +
>                 nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
>                 cond_resched();
>         }

This doesn't look right. You are basically adding threads in addition
to calls to deferred_init_maxorder. In addition you are spawning one
job per section instead of per range. Really you should be going for
something more along the lines of:

        while (spfn < epfn) {
                unsigned long epfn_align = ALIGN(epfn,
PAGE_PER_SECTION);
                struct definit_args arg = { zone, ATOMIC_LONG_INIT(0)
};
                struct padata_mt_job job = {
                        .thread_fn   = deferred_init_memmap_chunk,
                        .fn_arg      = &arg,
                        .start       = spfn,
                        .size        = epfn_align - spfn,
                        .align       = PAGES_PER_SECTION,
                        .min_chunk   = PAGES_PER_SECTION,
                        .max_threads = max_threads,
                };

                padata_do_multithreaded(&job);

                for_each_free_mem_pfn_range_in_zone_from(i, zone,
spfn, epfn) {
                        if (epfn_align <= spfn)
                                break;
                }
        }

This should accomplish the same thing, but much more efficiently. The
only thing you really lose is the tracking of nr_pages which really
doesn't add anything anyway since the value could shift around
depending on how many times deferred_grow_zone got called anyway.

Also the spfn should already be sectioned aligned, or at least be in a
new section unrelated to the one we just scheduled, so there is no
need for the extra checks you had.

^ permalink raw reply

* Re: Endless soft-lockups for compiling workload since next-20200519
From: Frederic Weisbecker @ 2020-05-21  0:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Linux Kernel Mailing List, Qian Cai,
	Borislav Petkov, Thomas Gleixner, linuxppc-dev
In-Reply-To: <20200520125056.GC325280@hirez.programming.kicks-ass.net>

On Wed, May 20, 2020 at 02:50:56PM +0200, Peter Zijlstra wrote:
> On Tue, May 19, 2020 at 11:58:17PM -0400, Qian Cai wrote:
> > Just a head up. Repeatedly compiling kernels for a while would trigger
> > endless soft-lockups since next-20200519 on both x86_64 and powerpc.
> > .config are in,
> 
> Could be 90b5363acd47 ("sched: Clean up scheduler_ipi()"), although I've
> not seen anything like that myself. Let me go have a look.
> 
> 
> In as far as the logs are readable (they're a wrapped mess, please don't
> do that!), they contain very little useful, as is typical with IPIs :/
> 
> > [ 1167.993773][    C1] WARNING: CPU: 1 PID: 0 at kernel/smp.c:127
> > flush_smp_call_function_queue+0x1fa/0x2e0

So I've tried to think of a race that could produce that and here is
the only thing I could come up with. It's a bit complicated unfortunately:

CPU 0                                              CPU 1
-----                                              -----

tick {
    trigger_load_balance() {
        raise_softirq(SCHED_SOFTIRQ);
        //but nohz_flags(0) = 0
    }
                                                   kick_ilb() {
                                                       atomic_fetch_or(...., nohz_flags(0))
    softirq() {                                        #VMEXIT or anything that could stop a CPU for a while
        run_rebalance_domain() {
            nohz_idle_balance() {
                atomic_andnot(NOHZ_KICK_MASK, nohz_flag(0))
            }
         }
     }
}

// schedule
nohz_newidle_balance() {
    kick_ilb() { // pick current CPU
        atomic_fetch_or(...., nohz_flags(0))           #VMENTER
        smp_call_function_single_async() {             smp_call_function_single_async() {
            // verified csd->flags != CSD_LOCK             // verified csd->flags != CSD_LOCK
            csd->flags = CSD_LOCK                          csd->flags = CSD_LOCK
            //execute in place                             //queue and send IPI
            csd->flags = 0
            nohz_csd_func()
	}
    }
}


IPI�{
    flush_smp_call_function_queue() {
        csd_unlock() {
            WARN_ON(csd->flags != CSD_LOCK) <---------!!!!!



The root cause here would be that trigger_load_balance() unconditionally raise
the softirq. And I have to confess I'm not clear why since the softirq is
essentially a no-op when nohz_flags() is 0.

Thanks.

^ permalink raw reply

* Re: [PATCH] soc: fsl: qe: Replace one-element array and use struct_size() helper
From: Gustavo A. R. Silva @ 2020-05-21  0:01 UTC (permalink / raw)
  To: Li Yang
  Cc: Kees Cook, Gustavo A. R. Silva, lkml, linuxppc-dev,
	moderated list:ARM/FREESCALE IMX / MXC ARM ARCHITECTURE,
	Qiang Zhao
In-Reply-To: <CADRPPNR-Croux9FgnrQJJmdF2jNnuAmC+2xMJSgSbkbRv9u8Mw@mail.gmail.com>

On Wed, May 20, 2020 at 06:52:21PM -0500, Li Yang wrote:
> On Mon, May 18, 2020 at 5:57 PM Kees Cook <keescook@chromium.org> wrote:
> >
> > On Mon, May 18, 2020 at 05:19:04PM -0500, Gustavo A. R. Silva wrote:
> > > The current codebase makes use of one-element arrays in the following
> > > form:
> > >
> > > struct something {
> > >     int length;
> > >     u8 data[1];
> > > };
> > >
> > > struct something *instance;
> > >
> > > instance = kmalloc(sizeof(*instance) + size, GFP_KERNEL);
> > > instance->length = size;
> > > memcpy(instance->data, source, size);
> > >
> > > but the preferred mechanism to declare variable-length types such as
> > > these ones is a flexible array member[1][2], introduced in C99:
> > >
> > > struct foo {
> > >         int stuff;
> > >         struct boo array[];
> > > };
> > >
> > > By making use of the mechanism above, we will get a compiler warning
> > > in case the flexible array does not occur last in the structure, which
> > > will help us prevent some kind of undefined behavior bugs from being
> > > inadvertently introduced[3] to the codebase from now on. So, replace
> > > the one-element array with a flexible-array member.
> > >
> > > Also, make use of the new struct_size() helper to properly calculate the
> > > size of struct qe_firmware.
> > >
> > > This issue was found with the help of Coccinelle and, audited and fixed
> > > _manually_.
> > >
> > > [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
> > > [2] https://github.com/KSPP/linux/issues/21
> > > [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour")
> > >
> > > Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
> > > ---
> > >  drivers/soc/fsl/qe/qe.c | 4 ++--
> > >  include/soc/fsl/qe/qe.h | 2 +-
> > >  2 files changed, 3 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/drivers/soc/fsl/qe/qe.c b/drivers/soc/fsl/qe/qe.c
> > > index 447146861c2c1..2df20d6f85fa4 100644
> > > --- a/drivers/soc/fsl/qe/qe.c
> > > +++ b/drivers/soc/fsl/qe/qe.c
> > > @@ -448,7 +448,7 @@ int qe_upload_firmware(const struct qe_firmware *firmware)
> > >       unsigned int i;
> > >       unsigned int j;
> > >       u32 crc;
> > > -     size_t calc_size = sizeof(struct qe_firmware);
> > > +     size_t calc_size;
> > >       size_t length;
> > >       const struct qe_header *hdr;
> > >
> > > @@ -480,7 +480,7 @@ int qe_upload_firmware(const struct qe_firmware *firmware)
> > >       }
> > >
> > >       /* Validate the length and check if there's a CRC */
> > > -     calc_size += (firmware->count - 1) * sizeof(struct qe_microcode);
> > > +     calc_size = struct_size(firmware, microcode, firmware->count);
> > >
> > >       for (i = 0; i < firmware->count; i++)
> > >               /*
> > > diff --git a/include/soc/fsl/qe/qe.h b/include/soc/fsl/qe/qe.h
> > > index e282ac01ec081..3feddfec9f87d 100644
> > > --- a/include/soc/fsl/qe/qe.h
> > > +++ b/include/soc/fsl/qe/qe.h
> > > @@ -307,7 +307,7 @@ struct qe_firmware {
> > >               u8 revision;            /* The microcode version revision */
> > >               u8 padding;             /* Reserved, for alignment */
> > >               u8 reserved[4];         /* Reserved, for future expansion */
> > > -     } __attribute__ ((packed)) microcode[1];
> > > +     } __packed microcode[];
> > >       /* All microcode binaries should be located here */
> > >       /* CRC32 should be located here, after the microcode binaries */
> > >  } __attribute__ ((packed));
> > > --
> > > 2.26.2
> > >
> >
> > Hm, looking at this code, I see a few other things that need to be
> > fixed:
> >
> > 1) drivers/tty/serial/ucc_uart.c does not do a be32_to_cpu() conversion
> >    on the length test (understandably, a little-endian system has never run
> >    this code since it's ppc specific), but it's still wrong:
> >
> >         if (firmware->header.length != fw->size) {
> >
> >    compare to the firmware loader:
> >
> >         length = be32_to_cpu(hdr->length);
> >
> > 2) drivers/soc/fsl/qe/qe.c does not perform bounds checking on the
> >    per-microcode offsets, so the uploader might send data outside the
> >    firmware buffer. Perhaps:
> 
> We do validate the CRC for each microcode, it is unlikely the CRC
> check can pass if the offset or length is not correct.  But you are
> probably right that it will be safer to check the boundary and fail
> quicker before we actually start the CRC check.  Will you come up with
> a formal patch or you want us to deal with it?
> 

Li,

I will send a proper patch for this.

Thanks
--
Gustavo

> >
> >
> > diff --git a/drivers/soc/fsl/qe/qe.c b/drivers/soc/fsl/qe/qe.c
> > index 447146861c2c..c4e0bc452f03 100644
> > --- a/drivers/soc/fsl/qe/qe.c
> > +++ b/drivers/soc/fsl/qe/qe.c
> > @@ -451,6 +451,7 @@ int qe_upload_firmware(const struct qe_firmware *firmware)
> >         size_t calc_size = sizeof(struct qe_firmware);
> >         size_t length;
> >         const struct qe_header *hdr;
> > +       void *firmware_end;
> >
> >         if (!firmware) {
> >                 printk(KERN_ERR "qe-firmware: invalid pointer\n");
> > @@ -491,19 +492,39 @@ int qe_upload_firmware(const struct qe_firmware *firmware)
> >                 calc_size += sizeof(__be32) *
> >                         be32_to_cpu(firmware->microcode[i].count);
> >
> > -       /* Validate the length */
> > +       /* Validate total length */
> >         if (length != calc_size + sizeof(__be32)) {
> >                 printk(KERN_ERR "qe-firmware: invalid length\n");
> >                 return -EPERM;
> >         }
> >
> >         /* Validate the CRC */
> > -       crc = be32_to_cpu(*(__be32 *)((void *)firmware + calc_size));
> > +       firmware_end = (void *)firmware + calc_size;
> > +       crc = be32_to_cpu(*(__be32 *)firmware_end);
> >         if (crc != crc32(0, firmware, calc_size)) {
> >                 printk(KERN_ERR "qe-firmware: firmware CRC is invalid\n");
> >                 return -EIO;
> >         }
> >
> > +       /* Validate ucode lengths and offsets */
> > +       for (i = 0; i < firmware->count; i++) {
> > +               const struct qe_microcode *ucode = &firmware->microcode[i];
> > +               __be32 *code;
> > +               size_t count;
> > +
> > +               if (!ucode->code_offset)
> > +                       continue;
> > +
> > +               code = (void *)firmware + be32_to_cpu(ucode->code_offset);
> > +               count = be32_to_cpu(ucode->count) * sizeof(*code);
> > +
> > +               if (code < firmware || code >= firmware_end ||
> > +                   code + count < firmware || code + count >= firmware_end) {
> > +                       printk(KERN_ERR "qe-firmware: invalid ucode offset\n");
> > +                       return -EIO;
> > +               }
> > +       }
> > +
> >         /*
> >          * If the microcode calls for it, split the I-RAM.
> >          */
> >
> >
> > I haven't tested this.
> >
> >
> > --
> > Kees Cook

^ permalink raw reply

* Re: [PATCH] soc: fsl: qe: Replace one-element array and use struct_size() helper
From: Li Yang @ 2020-05-20 23:52 UTC (permalink / raw)
  To: Kees Cook
  Cc: Gustavo A. R. Silva, lkml, Gustavo A. R. Silva, linuxppc-dev,
	moderated list:ARM/FREESCALE IMX / MXC ARM ARCHITECTURE,
	Qiang Zhao
In-Reply-To: <202005181529.C0CB448FBB@keescook>

On Mon, May 18, 2020 at 5:57 PM Kees Cook <keescook@chromium.org> wrote:
>
> On Mon, May 18, 2020 at 05:19:04PM -0500, Gustavo A. R. Silva wrote:
> > The current codebase makes use of one-element arrays in the following
> > form:
> >
> > struct something {
> >     int length;
> >     u8 data[1];
> > };
> >
> > struct something *instance;
> >
> > instance = kmalloc(sizeof(*instance) + size, GFP_KERNEL);
> > instance->length = size;
> > memcpy(instance->data, source, size);
> >
> > but the preferred mechanism to declare variable-length types such as
> > these ones is a flexible array member[1][2], introduced in C99:
> >
> > struct foo {
> >         int stuff;
> >         struct boo array[];
> > };
> >
> > By making use of the mechanism above, we will get a compiler warning
> > in case the flexible array does not occur last in the structure, which
> > will help us prevent some kind of undefined behavior bugs from being
> > inadvertently introduced[3] to the codebase from now on. So, replace
> > the one-element array with a flexible-array member.
> >
> > Also, make use of the new struct_size() helper to properly calculate the
> > size of struct qe_firmware.
> >
> > This issue was found with the help of Coccinelle and, audited and fixed
> > _manually_.
> >
> > [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
> > [2] https://github.com/KSPP/linux/issues/21
> > [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour")
> >
> > Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
> > ---
> >  drivers/soc/fsl/qe/qe.c | 4 ++--
> >  include/soc/fsl/qe/qe.h | 2 +-
> >  2 files changed, 3 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/soc/fsl/qe/qe.c b/drivers/soc/fsl/qe/qe.c
> > index 447146861c2c1..2df20d6f85fa4 100644
> > --- a/drivers/soc/fsl/qe/qe.c
> > +++ b/drivers/soc/fsl/qe/qe.c
> > @@ -448,7 +448,7 @@ int qe_upload_firmware(const struct qe_firmware *firmware)
> >       unsigned int i;
> >       unsigned int j;
> >       u32 crc;
> > -     size_t calc_size = sizeof(struct qe_firmware);
> > +     size_t calc_size;
> >       size_t length;
> >       const struct qe_header *hdr;
> >
> > @@ -480,7 +480,7 @@ int qe_upload_firmware(const struct qe_firmware *firmware)
> >       }
> >
> >       /* Validate the length and check if there's a CRC */
> > -     calc_size += (firmware->count - 1) * sizeof(struct qe_microcode);
> > +     calc_size = struct_size(firmware, microcode, firmware->count);
> >
> >       for (i = 0; i < firmware->count; i++)
> >               /*
> > diff --git a/include/soc/fsl/qe/qe.h b/include/soc/fsl/qe/qe.h
> > index e282ac01ec081..3feddfec9f87d 100644
> > --- a/include/soc/fsl/qe/qe.h
> > +++ b/include/soc/fsl/qe/qe.h
> > @@ -307,7 +307,7 @@ struct qe_firmware {
> >               u8 revision;            /* The microcode version revision */
> >               u8 padding;             /* Reserved, for alignment */
> >               u8 reserved[4];         /* Reserved, for future expansion */
> > -     } __attribute__ ((packed)) microcode[1];
> > +     } __packed microcode[];
> >       /* All microcode binaries should be located here */
> >       /* CRC32 should be located here, after the microcode binaries */
> >  } __attribute__ ((packed));
> > --
> > 2.26.2
> >
>
> Hm, looking at this code, I see a few other things that need to be
> fixed:
>
> 1) drivers/tty/serial/ucc_uart.c does not do a be32_to_cpu() conversion
>    on the length test (understandably, a little-endian system has never run
>    this code since it's ppc specific), but it's still wrong:
>
>         if (firmware->header.length != fw->size) {
>
>    compare to the firmware loader:
>
>         length = be32_to_cpu(hdr->length);
>
> 2) drivers/soc/fsl/qe/qe.c does not perform bounds checking on the
>    per-microcode offsets, so the uploader might send data outside the
>    firmware buffer. Perhaps:

We do validate the CRC for each microcode, it is unlikely the CRC
check can pass if the offset or length is not correct.  But you are
probably right that it will be safer to check the boundary and fail
quicker before we actually start the CRC check.  Will you come up with
a formal patch or you want us to deal with it?

>
>
> diff --git a/drivers/soc/fsl/qe/qe.c b/drivers/soc/fsl/qe/qe.c
> index 447146861c2c..c4e0bc452f03 100644
> --- a/drivers/soc/fsl/qe/qe.c
> +++ b/drivers/soc/fsl/qe/qe.c
> @@ -451,6 +451,7 @@ int qe_upload_firmware(const struct qe_firmware *firmware)
>         size_t calc_size = sizeof(struct qe_firmware);
>         size_t length;
>         const struct qe_header *hdr;
> +       void *firmware_end;
>
>         if (!firmware) {
>                 printk(KERN_ERR "qe-firmware: invalid pointer\n");
> @@ -491,19 +492,39 @@ int qe_upload_firmware(const struct qe_firmware *firmware)
>                 calc_size += sizeof(__be32) *
>                         be32_to_cpu(firmware->microcode[i].count);
>
> -       /* Validate the length */
> +       /* Validate total length */
>         if (length != calc_size + sizeof(__be32)) {
>                 printk(KERN_ERR "qe-firmware: invalid length\n");
>                 return -EPERM;
>         }
>
>         /* Validate the CRC */
> -       crc = be32_to_cpu(*(__be32 *)((void *)firmware + calc_size));
> +       firmware_end = (void *)firmware + calc_size;
> +       crc = be32_to_cpu(*(__be32 *)firmware_end);
>         if (crc != crc32(0, firmware, calc_size)) {
>                 printk(KERN_ERR "qe-firmware: firmware CRC is invalid\n");
>                 return -EIO;
>         }
>
> +       /* Validate ucode lengths and offsets */
> +       for (i = 0; i < firmware->count; i++) {
> +               const struct qe_microcode *ucode = &firmware->microcode[i];
> +               __be32 *code;
> +               size_t count;
> +
> +               if (!ucode->code_offset)
> +                       continue;
> +
> +               code = (void *)firmware + be32_to_cpu(ucode->code_offset);
> +               count = be32_to_cpu(ucode->count) * sizeof(*code);
> +
> +               if (code < firmware || code >= firmware_end ||
> +                   code + count < firmware || code + count >= firmware_end) {
> +                       printk(KERN_ERR "qe-firmware: invalid ucode offset\n");
> +                       return -EIO;
> +               }
> +       }
> +
>         /*
>          * If the microcode calls for it, split the I-RAM.
>          */
>
>
> I haven't tested this.
>
>
> --
> Kees Cook

^ permalink raw reply

* Re: [PATCH v6 09/11] PCI: layerscape: Add EP mode support for ls1088a and ls2088a
From: Rob Herring @ 2020-05-20 20:50 UTC (permalink / raw)
  To: Xiaowei Bao
  Cc: devicetree, andrew.murray, lorenzo.pieralisi, jingoohan1,
	linux-pci, Zhiqiang.Hou, linux-kernel, kishon, Minghuan.Lian,
	robh+dt, mingkai.hu, gustavo.pimentel, bhelgaas, shawnguo,
	roy.zang, linuxppc-dev, leoyang.li, linux-arm-kernel, amurray
In-Reply-To: <20200314033038.24844-10-xiaowei.bao@nxp.com>

On Sat, 14 Mar 2020 11:30:36 +0800, Xiaowei Bao wrote:
> Add PCIe EP mode support for ls1088a and ls2088a, there are some
> difference between LS1 and LS2 platform, so refactor the code of
> the EP driver.
> 
> Signed-off-by: Xiaowei Bao <xiaowei.bao@nxp.com>
> ---
> v2:
>  - This is a new patch for supporting the ls1088a and ls2088a platform.
> v3:
>  - Adjust the some struct assignment order in probe function.
> v4:
>  - No change.
> v5:
>  - No change.
> v6:
>  - No change.
> 
>  drivers/pci/controller/dwc/pci-layerscape-ep.c | 72 +++++++++++++++++++-------
>  1 file changed, 53 insertions(+), 19 deletions(-)
> 

Reviewed-by: Rob Herring <robh@kernel.org>

^ permalink raw reply

* Re: [PATCH v6 07/11] PCI: layerscape: Modify the way of getting capability with different PEX
From: Rob Herring @ 2020-05-20 20:45 UTC (permalink / raw)
  To: Xiaowei Bao
  Cc: devicetree, andrew.murray, lorenzo.pieralisi, roy.zang,
	gustavo.pimentel, Zhiqiang.Hou, linux-kernel, kishon,
	Minghuan.Lian, jingoohan1, robh+dt, mingkai.hu, linux-pci,
	bhelgaas, shawnguo, leoyang.li, linuxppc-dev, linux-arm-kernel,
	amurray
In-Reply-To: <20200314033038.24844-8-xiaowei.bao@nxp.com>

On Sat, 14 Mar 2020 11:30:34 +0800, Xiaowei Bao wrote:
> The different PCIe controller in one board may be have different
> capability of MSI or MSIX, so change the way of getting the MSI
> capability, make it more flexible.
> 
> Signed-off-by: Xiaowei Bao <xiaowei.bao@nxp.com>
> ---
> v2:
>  - Remove the repeated assignment code.
> v3:
>  - Use ep_func msi_cap and msix_cap to decide the msi_capable and
>    msix_capable of pci_epc_features struct.
> v4:
>  - No change.
> v5:
>  - No change.
> v6:
>  - No change.
> 
>  drivers/pci/controller/dwc/pci-layerscape-ep.c | 31 +++++++++++++++++++-------
>  1 file changed, 23 insertions(+), 8 deletions(-)
> 

Reviewed-by: Rob Herring <robh@kernel.org>

^ permalink raw reply

* Re: [PATCH v6 04/11] PCI: designware-ep: Modify MSI and MSIX CAP way of finding
From: Rob Herring @ 2020-05-20 20:45 UTC (permalink / raw)
  To: Xiaowei Bao
  Cc: roy.zang, lorenzo.pieralisi, devicetree, jingoohan1, Zhiqiang.Hou,
	linuxppc-dev, linux-pci, linux-kernel, leoyang.li, Minghuan.Lian,
	linux-arm-kernel, gustavo.pimentel, bhelgaas, andrew.murray,
	kishon, shawnguo, mingkai.hu, amurray
In-Reply-To: <20200314033038.24844-5-xiaowei.bao@nxp.com>

On Sat, Mar 14, 2020 at 11:30:31AM +0800, Xiaowei Bao wrote:
> Each PF of EP device should have it's own MSI or MSIX capabitily

s/it's/its/

> struct, so create a dw_pcie_ep_func struct and remove the msi_cap
> and msix_cap to this struct from dw_pcie_ep, and manage the PFs
> with a list.
> 
> Signed-off-by: Xiaowei Bao <xiaowei.bao@nxp.com>
> ---
> v3:
>  - This is a new patch, to fix the issue of MSI and MSIX CAP way of
>    finding.
> v4:
>  - Correct some word of commit message.
> v5:
>  - No change.
> v6:
>  - Fix up the compile error.
> 
>  drivers/pci/controller/dwc/pcie-designware-ep.c | 135 +++++++++++++++++++++---
>  drivers/pci/controller/dwc/pcie-designware.h    |  18 +++-
>  2 files changed, 134 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/pci/controller/dwc/pcie-designware-ep.c b/drivers/pci/controller/dwc/pcie-designware-ep.c
> index 933bb89..fb915f2 100644
> --- a/drivers/pci/controller/dwc/pcie-designware-ep.c
> +++ b/drivers/pci/controller/dwc/pcie-designware-ep.c
> @@ -19,6 +19,19 @@ void dw_pcie_ep_linkup(struct dw_pcie_ep *ep)
>  	pci_epc_linkup(epc);
>  }
>  
> +struct dw_pcie_ep_func *
> +dw_pcie_ep_get_func_from_ep(struct dw_pcie_ep *ep, u8 func_no)
> +{
> +	struct dw_pcie_ep_func *ep_func;
> +
> +	list_for_each_entry(ep_func, &ep->func_list, list) {
> +		if (ep_func->func_no == func_no)
> +			return ep_func;
> +	}
> +
> +	return NULL;
> +}
> +
>  static unsigned int dw_pcie_ep_func_select(struct dw_pcie_ep *ep, u8 func_no)
>  {
>  	unsigned int func_offset = 0;
> @@ -59,6 +72,47 @@ void dw_pcie_ep_reset_bar(struct dw_pcie *pci, enum pci_barno bar)
>  		__dw_pcie_ep_reset_bar(pci, func_no, bar, 0);
>  }
>  
> +static u8 __dw_pcie_ep_find_next_cap(struct dw_pcie_ep *ep, u8 func_no,
> +		u8 cap_ptr, u8 cap)
> +{
> +	struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
> +	unsigned int func_offset = 0;
> +	u8 cap_id, next_cap_ptr;
> +	u16 reg;
> +
> +	if (!cap_ptr)
> +		return 0;
> +
> +	func_offset = dw_pcie_ep_func_select(ep, func_no);
> +
> +	reg = dw_pcie_readw_dbi(pci, func_offset + cap_ptr);
> +	cap_id = (reg & 0x00ff);
> +
> +	if (cap_id > PCI_CAP_ID_MAX)
> +		return 0;
> +
> +	if (cap_id == cap)
> +		return cap_ptr;
> +
> +	next_cap_ptr = (reg & 0xff00) >> 8;
> +	return __dw_pcie_ep_find_next_cap(ep, func_no, next_cap_ptr, cap);
> +}
> +
> +static u8 dw_pcie_ep_find_capability(struct dw_pcie_ep *ep, u8 func_no, u8 cap)
> +{
> +	struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
> +	unsigned int func_offset = 0;
> +	u8 next_cap_ptr;
> +	u16 reg;
> +
> +	func_offset = dw_pcie_ep_func_select(ep, func_no);
> +
> +	reg = dw_pcie_readw_dbi(pci, func_offset + PCI_CAPABILITY_LIST);
> +	next_cap_ptr = (reg & 0x00ff);
> +
> +	return __dw_pcie_ep_find_next_cap(ep, func_no, next_cap_ptr, cap);
> +}
> +
>  static int dw_pcie_ep_write_header(struct pci_epc *epc, u8 func_no,
>  				   struct pci_epf_header *hdr)
>  {
> @@ -246,13 +300,18 @@ static int dw_pcie_ep_get_msi(struct pci_epc *epc, u8 func_no)
>  	struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
>  	u32 val, reg;
>  	unsigned int func_offset = 0;
> +	struct dw_pcie_ep_func *ep_func;
>  
> -	if (!ep->msi_cap)
> +	ep_func = dw_pcie_ep_get_func_from_ep(ep, func_no);
> +	if (!ep_func)
> +		return -EINVAL;
> +
> +	if (!ep_func->msi_cap)
>  		return -EINVAL;

if (!ep_func || !ep_func->msi_cap)
  return -EINVAL;

>  
>  	func_offset = dw_pcie_ep_func_select(ep, func_no);
>  
> -	reg = ep->msi_cap + func_offset + PCI_MSI_FLAGS;
> +	reg = ep_func->msi_cap + func_offset + PCI_MSI_FLAGS;
>  	val = dw_pcie_readw_dbi(pci, reg);
>  	if (!(val & PCI_MSI_FLAGS_ENABLE))
>  		return -EINVAL;
> @@ -268,13 +327,18 @@ static int dw_pcie_ep_set_msi(struct pci_epc *epc, u8 func_no, u8 interrupts)
>  	struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
>  	u32 val, reg;
>  	unsigned int func_offset = 0;
> +	struct dw_pcie_ep_func *ep_func;
> +
> +	ep_func = dw_pcie_ep_get_func_from_ep(ep, func_no);
> +	if (!ep_func)
> +		return -EINVAL;
>  
> -	if (!ep->msi_cap)
> +	if (!ep_func->msi_cap)
>  		return -EINVAL;

Same here.

>  
>  	func_offset = dw_pcie_ep_func_select(ep, func_no);
>  
> -	reg = ep->msi_cap + func_offset + PCI_MSI_FLAGS;
> +	reg = ep_func->msi_cap + func_offset + PCI_MSI_FLAGS;
>  	val = dw_pcie_readw_dbi(pci, reg);
>  	val &= ~PCI_MSI_FLAGS_QMASK;
>  	val |= (interrupts << 1) & PCI_MSI_FLAGS_QMASK;
> @@ -291,13 +355,18 @@ static int dw_pcie_ep_get_msix(struct pci_epc *epc, u8 func_no)
>  	struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
>  	u32 val, reg;
>  	unsigned int func_offset = 0;
> +	struct dw_pcie_ep_func *ep_func;
> +
> +	ep_func = dw_pcie_ep_get_func_from_ep(ep, func_no);
> +	if (!ep_func)
> +		return -EINVAL;
>  
> -	if (!ep->msix_cap)
> +	if (!ep_func->msix_cap)
>  		return -EINVAL;

Same here for msix.

>  
>  	func_offset = dw_pcie_ep_func_select(ep, func_no);
>  
> -	reg = ep->msix_cap + func_offset + PCI_MSIX_FLAGS;
> +	reg = ep_func->msix_cap + func_offset + PCI_MSIX_FLAGS;
>  	val = dw_pcie_readw_dbi(pci, reg);
>  	if (!(val & PCI_MSIX_FLAGS_ENABLE))
>  		return -EINVAL;
> @@ -313,13 +382,18 @@ static int dw_pcie_ep_set_msix(struct pci_epc *epc, u8 func_no, u16 interrupts)
>  	struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
>  	u32 val, reg;
>  	unsigned int func_offset = 0;
> +	struct dw_pcie_ep_func *ep_func;
>  
> -	if (!ep->msix_cap)
> +	ep_func = dw_pcie_ep_get_func_from_ep(ep, func_no);
> +	if (!ep_func)
> +		return -EINVAL;
> +
> +	if (!ep_func->msix_cap)
>  		return -EINVAL;

And here.

>  
>  	func_offset = dw_pcie_ep_func_select(ep, func_no);
>  
> -	reg = ep->msix_cap + func_offset + PCI_MSIX_FLAGS;
> +	reg = ep_func->msix_cap + func_offset + PCI_MSIX_FLAGS;
>  	val = dw_pcie_readw_dbi(pci, reg);
>  	val &= ~PCI_MSIX_FLAGS_QSIZE;
>  	val |= interrupts;
> @@ -404,6 +478,7 @@ int dw_pcie_ep_raise_msi_irq(struct dw_pcie_ep *ep, u8 func_no,
>  			     u8 interrupt_num)
>  {
>  	struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
> +	struct dw_pcie_ep_func *ep_func;
>  	struct pci_epc *epc = ep->epc;
>  	unsigned int aligned_offset;
>  	unsigned int func_offset = 0;
> @@ -413,25 +488,29 @@ int dw_pcie_ep_raise_msi_irq(struct dw_pcie_ep *ep, u8 func_no,
>  	bool has_upper;
>  	int ret;
>  
> -	if (!ep->msi_cap)
> +	ep_func = dw_pcie_ep_get_func_from_ep(ep, func_no);
> +	if (!ep_func)
> +		return -EINVAL;
> +
> +	if (!ep_func->msi_cap)
>  		return -EINVAL;

And here.

>  
>  	func_offset = dw_pcie_ep_func_select(ep, func_no);
>  
>  	/* Raise MSI per the PCI Local Bus Specification Revision 3.0, 6.8.1. */
> -	reg = ep->msi_cap + func_offset + PCI_MSI_FLAGS;
> +	reg = ep_func->msi_cap + func_offset + PCI_MSI_FLAGS;
>  	msg_ctrl = dw_pcie_readw_dbi(pci, reg);
>  	has_upper = !!(msg_ctrl & PCI_MSI_FLAGS_64BIT);
> -	reg = ep->msi_cap + func_offset + PCI_MSI_ADDRESS_LO;
> +	reg = ep_func->msi_cap + func_offset + PCI_MSI_ADDRESS_LO;
>  	msg_addr_lower = dw_pcie_readl_dbi(pci, reg);
>  	if (has_upper) {
> -		reg = ep->msi_cap + func_offset + PCI_MSI_ADDRESS_HI;
> +		reg = ep_func->msi_cap + func_offset + PCI_MSI_ADDRESS_HI;
>  		msg_addr_upper = dw_pcie_readl_dbi(pci, reg);
> -		reg = ep->msi_cap + func_offset + PCI_MSI_DATA_64;
> +		reg = ep_func->msi_cap + func_offset + PCI_MSI_DATA_64;
>  		msg_data = dw_pcie_readw_dbi(pci, reg);
>  	} else {
>  		msg_addr_upper = 0;
> -		reg = ep->msi_cap + func_offset + PCI_MSI_DATA_32;
> +		reg = ep_func->msi_cap + func_offset + PCI_MSI_DATA_32;
>  		msg_data = dw_pcie_readw_dbi(pci, reg);
>  	}
>  	aligned_offset = msg_addr_lower & (epc->mem->page_size - 1);
> @@ -467,6 +546,7 @@ int dw_pcie_ep_raise_msix_irq(struct dw_pcie_ep *ep, u8 func_no,
>  			      u16 interrupt_num)
>  {
>  	struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
> +	struct dw_pcie_ep_func *ep_func;
>  	struct pci_epc *epc = ep->epc;
>  	u16 tbl_offset, bir;
>  	unsigned int func_offset = 0;
> @@ -477,9 +557,16 @@ int dw_pcie_ep_raise_msix_irq(struct dw_pcie_ep *ep, u8 func_no,
>  	void __iomem *msix_tbl;
>  	int ret;
>  
> +	ep_func = dw_pcie_ep_get_func_from_ep(ep, func_no);
> +	if (!ep_func)
> +		return -EINVAL;
> +
> +	if (!ep_func->msix_cap)
> +		return -EINVAL;

And here.

> +
>  	func_offset = dw_pcie_ep_func_select(ep, func_no);
>  
> -	reg = ep->msix_cap + func_offset + PCI_MSIX_TABLE;
> +	reg = ep_func->msix_cap + func_offset + PCI_MSIX_TABLE;
>  	tbl_offset = dw_pcie_readl_dbi(pci, reg);
>  	bir = (tbl_offset & PCI_MSIX_TABLE_BIR);
>  	tbl_offset &= PCI_MSIX_TABLE_OFFSET;
> @@ -558,6 +645,7 @@ int dw_pcie_ep_init(struct dw_pcie_ep *ep)
>  	int i;
>  	int ret;
>  	u32 reg;
> +	u8 func_no;
>  	void *addr;
>  	u8 hdr_type;
>  	unsigned int nbars;
> @@ -566,6 +654,9 @@ int dw_pcie_ep_init(struct dw_pcie_ep *ep)
>  	struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
>  	struct device *dev = pci->dev;
>  	struct device_node *np = dev->of_node;
> +	struct dw_pcie_ep_func *ep_func;
> +
> +	INIT_LIST_HEAD(&ep->func_list);
>  
>  	if (!pci->dbi_base || !pci->dbi_base2) {
>  		dev_err(dev, "dbi_base/dbi_base2 is not populated\n");
> @@ -632,9 +723,19 @@ int dw_pcie_ep_init(struct dw_pcie_ep *ep)
>  	if (ret < 0)
>  		epc->max_functions = 1;
>  
> -	ep->msi_cap = dw_pcie_find_capability(pci, PCI_CAP_ID_MSI);
> +	for (func_no = 0; func_no < epc->max_functions; func_no++) {
> +		ep_func = devm_kzalloc(dev, sizeof(*ep_func), GFP_KERNEL);

Why do you need a list if you allocate all the functions at once? You 
could just do an array. Or do the allocations as needed and keep the 
list. I assume all functions aren't always used.

> +		if (!ep_func)
> +			return -ENOMEM;
>  
> -	ep->msix_cap = dw_pcie_find_capability(pci, PCI_CAP_ID_MSIX);
> +		ep_func->func_no = func_no;
> +		ep_func->msi_cap = dw_pcie_ep_find_capability(ep, func_no,
> +							      PCI_CAP_ID_MSI);
> +		ep_func->msix_cap = dw_pcie_ep_find_capability(ep, func_no,
> +							       PCI_CAP_ID_MSIX);
> +
> +		list_add_tail(&ep_func->list, &ep->func_list);
> +	}
>  
>  	if (ep->ops->ep_init)
>  		ep->ops->ep_init(ep);
> diff --git a/drivers/pci/controller/dwc/pcie-designware.h b/drivers/pci/controller/dwc/pcie-designware.h
> index cb32afa..dd9b7b4 100644
> --- a/drivers/pci/controller/dwc/pcie-designware.h
> +++ b/drivers/pci/controller/dwc/pcie-designware.h
> @@ -230,8 +230,16 @@ struct dw_pcie_ep_ops {
>  	unsigned int (*func_conf_select)(struct dw_pcie_ep *ep, u8 func_no);
>  };
>  
> +struct dw_pcie_ep_func {
> +	struct list_head	list;
> +	u8			func_no;
> +	u8			msi_cap;	/* MSI capability offset */
> +	u8			msix_cap;	/* MSI-X capability offset */
> +};
> +
>  struct dw_pcie_ep {
>  	struct pci_epc		*epc;
> +	struct list_head	func_list;
>  	const struct dw_pcie_ep_ops *ops;
>  	phys_addr_t		phys_base;
>  	size_t			addr_size;
> @@ -244,8 +252,6 @@ struct dw_pcie_ep {
>  	u32			num_ob_windows;
>  	void __iomem		*msi_mem;
>  	phys_addr_t		msi_mem_phys;
> -	u8			msi_cap;	/* MSI capability offset */
> -	u8			msix_cap;	/* MSI-X capability offset */
>  };
>  
>  struct dw_pcie_ops {
> @@ -437,6 +443,8 @@ int dw_pcie_ep_raise_msix_irq(struct dw_pcie_ep *ep, u8 func_no,
>  int dw_pcie_ep_raise_msix_irq_doorbell(struct dw_pcie_ep *ep, u8 func_no,
>  				       u16 interrupt_num);
>  void dw_pcie_ep_reset_bar(struct dw_pcie *pci, enum pci_barno bar);
> +struct dw_pcie_ep_func *
> +dw_pcie_ep_get_func_from_ep(struct dw_pcie_ep *ep, u8 func_no);
>  #else
>  static inline void dw_pcie_ep_linkup(struct dw_pcie_ep *ep)
>  {
> @@ -478,5 +486,11 @@ static inline int dw_pcie_ep_raise_msix_irq_doorbell(struct dw_pcie_ep *ep,
>  static inline void dw_pcie_ep_reset_bar(struct dw_pcie *pci, enum pci_barno bar)
>  {
>  }
> +
> +static inline struct dw_pcie_ep_func *
> +dw_pcie_ep_get_func_from_ep(struct dw_pcie_ep *ep, u8 func_no)
> +{
> +	return NULL;
> +}
>  #endif
>  #endif /* _PCIE_DESIGNWARE_H */
> -- 
> 2.9.5
> 

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox