LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH] powerpc/bug: Cast to unsigned long before passing to inline asm
From: Michael Ellerman @ 2021-09-01  7:17 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: nathan, linuxppc-dev
In-Reply-To: <20210831213432.GF1583@gate.crashing.org>

Segher Boessenkool <segher@kernel.crashing.org> writes:
> On Tue, Aug 31, 2021 at 11:27:20PM +1000, Michael Ellerman wrote:
>> Nathan filed an LLVM bug [2], in which Eli Friedman explained that "if
>> you pass a value of a type that's narrower than a register to an inline
>> asm, the high bits are undefined". In this case we are passing a bool
>> to the inline asm, which is a single bit value, and so the compiler
>> thinks it can leave the high bits of r30 unmasked, because only the
>> value of bit 0 matters.
>> 
>> Because the inline asm compares the full width of the register (32 or
>> 64-bit) we need to ensure the value passed to the inline asm has all
>> bits defined. So fix it by casting to long.
>> 
>> We also already cast to long for the similar case in BUG_ENTRY(), which
>> it turns out was added to fix a similar bug in 2005 in commit
>> 32818c2eb6b8 ("[PATCH] ppc64: Fix issue with gcc 4.0 compiled kernels").
>
> That points to <https://gcc.gnu.org/PR23422>, which shows the correct
> explanation.

That's a pity because I don't understand that explanation ^_^

Why does sign extension matter when we're comparing against zero?

> The code as it was did **not** pass a bool.  It perhaps passed an int
> (so many macros in play, it is hard to tell for sure, but it is int or
> long int, perhaps unsigned (which does not change anything here).

I don't understand that. It definitely is passing a bool at the source
level. Are you saying it's getting promoted somehow?

It expands to:

      asm goto(
              "1:   "
              "tdnei"
              "
              " " % 4,
              0 " "\n " ".section __ex_table,\"a\";"
                                              " "
                                              ".balign 4;"
                                              " "
                                              ".long (1b) - . ;"
                                              " "
                                              ".long (%l[__label_warn_on]) - . ;"
                                              " "
                                              ".previous"
                                              " "
                                              ".section __bug_table,\"aw\"\n"
                                              "2:\t.4byte 1b - 2b, %0 - 2b\n"
                                              "\t.short %1, %2\n"
                                              ".org 2b+%3\n"
                                              ".previous\n"
              :
              : "i"("lib/klist.c"), "i"(62),
                "i"((1 << 0) | ((9) << 8)),
                "i"(sizeof(struct bug_entry)),
                "r"(knode_dead(knode))
                ^^^^^^^^^^^^^^^^^^^^^
              :
              : __label_warn_on);

And knode_dead() returns bool:

  static bool knode_dead(struct klist_node *knode)
  {
  	return (unsigned long)knode->n_klist & KNODE_DEAD;
  }

So in my book that means the type there is bool. But I'm not a compiler
guy so I guessing I'm missing something.

> But td wants a 64-bit register, not a 32-bit one (which are the only two
> possibilities for the "r" constraint on PowerPC).  The cast to "long" is
> fine for powerpc64, but it has nothing to do with "narrower than a
> register".

If it's 32-bit vs 64-bit, and the clang explanation is correct, then
we'd expect the low 32-bits of the value passed to the asm to have the
correct value, ie. have been masked with KNODE_DEAD.

> If this is not the correct explanation for LLVM, it needs to solve some
> other bug.

OK. I just need to get this fixed in the kernel, so I'll do a new
version with a changelog that is ~= "shrug not sure what's going on" and
merge that. Then we can argue about what is really going on later :)

cheeers

^ permalink raw reply

* [PATCH v3] powerpc/32: Add support for out-of-line static calls
From: Christophe Leroy @ 2021-09-01  8:30 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
  Cc: Peter Zijlstra, linux-kernel, Steven Rostedt, Jason Baron,
	Josh Poimboeuf, linuxppc-dev, Ard Biesheuvel

Add support for out-of-line static calls on PPC32. This change
improve performance of calls to global function pointers by
using direct calls instead of indirect calls.

The trampoline is initialy populated with a 'blr' or branch to target,
followed by an unreachable long jump sequence.

In order to cater with parallele execution, the trampoline needs to
be updated in a way that ensures it remains consistent at all time.
This means we can't use the traditional lis/addi to load r12 with
the target address, otherwise there would be a window during which
the first instruction contains the upper part of the new target
address while the second instruction still contains the lower part of
the old target address. To avoid that the target address is stored
just after the 'bctr' and loaded from there with a single instruction.

Then, depending on the target distance, arch_static_call_transform()
will either replace the first instruction by a direct 'bl <target>' or
'nop' in order to have the trampoline fall through the long jump
sequence.

For the special case of __static_call_return0(), to avoid the risk of
a far branch, a version of it is inlined at the end of the trampoline.

Performancewise the long jump sequence is probably not better than
the indirect calls set by GCC when we don't use static calls, but
such calls are unlikely to be required on powerpc32: With most
configurations the kernel size is far below 32 Mbytes so only
modules may happen to be too far. And even modules are likely to
be close enough as they are allocated below the kernel core and
as close as possible of the kernel text.

static_call selftest is running successfully with this change.

With this patch, __do_irq() has the following sequence to trace
irq entries:

	c0004a00 <__SCT__tp_func_irq_entry>:
	c0004a00:	48 00 00 e0 	b       c0004ae0 <__traceiter_irq_entry>
	c0004a04:	3d 80 c0 00 	lis     r12,-16384
	c0004a08:	81 8c 4a 1c 	lwz     r12,18972(r12)
	c0004a0c:	7d 89 03 a6 	mtctr   r12
	c0004a10:	4e 80 04 20 	bctr
	c0004a14:	38 60 00 00 	li      r3,0
	c0004a18:	4e 80 00 20 	blr
	c0004a1c:	00 00 00 00 	.long 0x0
...
	c0005654 <__do_irq>:
...
	c0005664:	7c 7f 1b 78 	mr      r31,r3
...
	c00056a0:	81 22 00 00 	lwz     r9,0(r2)
	c00056a4:	39 29 00 01 	addi    r9,r9,1
	c00056a8:	91 22 00 00 	stw     r9,0(r2)
	c00056ac:	3d 20 c0 af 	lis     r9,-16209
	c00056b0:	81 29 74 cc 	lwz     r9,29900(r9)
	c00056b4:	2c 09 00 00 	cmpwi   r9,0
	c00056b8:	41 82 00 10 	beq     c00056c8 <__do_irq+0x74>
	c00056bc:	80 69 00 04 	lwz     r3,4(r9)
	c00056c0:	7f e4 fb 78 	mr      r4,r31
	c00056c4:	4b ff f3 3d 	bl      c0004a00 <__SCT__tp_func_irq_entry>

Before this patch, __do_irq() was doing the following to trace irq
entries:

	c0005700 <__do_irq>:
...
	c0005710:	7c 7e 1b 78 	mr      r30,r3
...
	c000574c:	93 e1 00 0c 	stw     r31,12(r1)
	c0005750:	81 22 00 00 	lwz     r9,0(r2)
	c0005754:	39 29 00 01 	addi    r9,r9,1
	c0005758:	91 22 00 00 	stw     r9,0(r2)
	c000575c:	3d 20 c0 af 	lis     r9,-16209
	c0005760:	83 e9 f4 cc 	lwz     r31,-2868(r9)
	c0005764:	2c 1f 00 00 	cmpwi   r31,0
	c0005768:	41 82 00 24 	beq     c000578c <__do_irq+0x8c>
	c000576c:	81 3f 00 00 	lwz     r9,0(r31)
	c0005770:	80 7f 00 04 	lwz     r3,4(r31)
	c0005774:	7d 29 03 a6 	mtctr   r9
	c0005778:	7f c4 f3 78 	mr      r4,r30
	c000577c:	4e 80 04 21 	bctrl
	c0005780:	85 3f 00 0c 	lwzu    r9,12(r31)
	c0005784:	2c 09 00 00 	cmpwi   r9,0
	c0005788:	40 82 ff e4 	bne     c000576c <__do_irq+0x6c>

Behind the fact of now using a direct 'bl' instead of a
'load/mtctr/bctr' sequence, we can also see that we get one less
register on the stack.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
v3: Adding the special case of __static_call_return0()

v2: Use indirect load in long jump sequence to cater with parallele execution and preemption.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/powerpc/Kconfig                   |  1 +
 arch/powerpc/include/asm/static_call.h | 28 +++++++++++++++++++
 arch/powerpc/kernel/Makefile           |  2 +-
 arch/powerpc/kernel/static_call.c      | 37 ++++++++++++++++++++++++++
 4 files changed, 67 insertions(+), 1 deletion(-)
 create mode 100644 arch/powerpc/include/asm/static_call.h
 create mode 100644 arch/powerpc/kernel/static_call.c

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 36b72d972568..a0fe69d8ec83 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -247,6 +247,7 @@ config PPC
 	select HAVE_SOFTIRQ_ON_OWN_STACK
 	select HAVE_STACKPROTECTOR		if PPC32 && $(cc-option,-mstack-protector-guard=tls -mstack-protector-guard-reg=r2)
 	select HAVE_STACKPROTECTOR		if PPC64 && $(cc-option,-mstack-protector-guard=tls -mstack-protector-guard-reg=r13)
+	select HAVE_STATIC_CALL			if PPC32
 	select HAVE_SYSCALL_TRACEPOINTS
 	select HAVE_VIRT_CPU_ACCOUNTING
 	select HUGETLB_PAGE_SIZE_VARIABLE	if PPC_BOOK3S_64 && HUGETLB_PAGE
diff --git a/arch/powerpc/include/asm/static_call.h b/arch/powerpc/include/asm/static_call.h
new file mode 100644
index 000000000000..0a0bc79bd1fa
--- /dev/null
+++ b/arch/powerpc/include/asm/static_call.h
@@ -0,0 +1,28 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_POWERPC_STATIC_CALL_H
+#define _ASM_POWERPC_STATIC_CALL_H
+
+#define __PPC_SCT(name, inst)					\
+	asm(".pushsection .text, \"ax\"				\n"	\
+	    ".align 5						\n"	\
+	    ".globl " STATIC_CALL_TRAMP_STR(name) "		\n"	\
+	    STATIC_CALL_TRAMP_STR(name) ":			\n"	\
+	    inst "						\n"	\
+	    "	lis	12,2f@ha				\n"	\
+	    "	lwz	12,2f@l(12)				\n"	\
+	    "	mtctr	12					\n"	\
+	    "	bctr						\n"	\
+	    "1:	li	3, 0					\n"	\
+	    "	blr						\n"	\
+	    "2:	.long 0						\n"	\
+	    ".type " STATIC_CALL_TRAMP_STR(name) ", @function	\n"	\
+	    ".size " STATIC_CALL_TRAMP_STR(name) ", . - " STATIC_CALL_TRAMP_STR(name) " \n" \
+	    ".popsection					\n")
+
+#define PPC_SCT_RET0		20		/* Offset of label 1 */
+#define PPC_SCT_DATA		28		/* Offset of label 2 */
+
+#define ARCH_DEFINE_STATIC_CALL_TRAMP(name, func)	__PPC_SCT(name, "b " #func)
+#define ARCH_DEFINE_STATIC_CALL_NULL_TRAMP(name)	__PPC_SCT(name, "blr")
+
+#endif /* _ASM_POWERPC_STATIC_CALL_H */
diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index 7be36c1e1db6..0e3640e14eb1 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -106,7 +106,7 @@ extra-y				+= vmlinux.lds
 
 obj-$(CONFIG_RELOCATABLE)	+= reloc_$(BITS).o
 
-obj-$(CONFIG_PPC32)		+= entry_32.o setup_32.o early_32.o
+obj-$(CONFIG_PPC32)		+= entry_32.o setup_32.o early_32.o static_call.o
 obj-$(CONFIG_PPC64)		+= dma-iommu.o iommu.o
 obj-$(CONFIG_KGDB)		+= kgdb.o
 obj-$(CONFIG_BOOTX_TEXT)	+= btext.o
diff --git a/arch/powerpc/kernel/static_call.c b/arch/powerpc/kernel/static_call.c
new file mode 100644
index 000000000000..863a7aa24650
--- /dev/null
+++ b/arch/powerpc/kernel/static_call.c
@@ -0,0 +1,37 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/memory.h>
+#include <linux/static_call.h>
+
+#include <asm/code-patching.h>
+
+void arch_static_call_transform(void *site, void *tramp, void *func, bool tail)
+{
+	int err;
+	bool is_ret0 = (func == __static_call_return0);
+	unsigned long target = (unsigned long)(is_ret0 ? tramp + PPC_SCT_RET0 : func);
+	bool is_short = is_offset_in_branch_range((long)target - (long)tramp);
+
+	if (!tramp)
+		return;
+
+	mutex_lock(&text_mutex);
+
+	if (func && !is_short) {
+		err = patch_instruction(tramp + PPC_SCT_DATA, ppc_inst(target));
+		if (err)
+			goto out;
+	}
+
+	if (!func)
+		err = patch_instruction(tramp, ppc_inst(PPC_RAW_BLR()));
+	else if (is_short)
+		err = patch_branch(tramp, target, 0);
+	else
+		err = patch_instruction(tramp, ppc_inst(PPC_RAW_NOP()));
+out:
+	mutex_unlock(&text_mutex);
+
+	if (err)
+		panic("%s: patching failed %pS at %pS\n", __func__, func, tramp);
+}
+EXPORT_SYMBOL_GPL(arch_static_call_transform);
-- 
2.25.0


^ permalink raw reply related

* Re: [PATCH kernel] KVM: PPC: Book3S HV: Make unique debugfs nodename
From: Alexey Kardashevskiy @ 2021-09-01  8:43 UTC (permalink / raw)
  To: Fabiano Rosas, Michael Ellerman; +Cc: linuxppc-dev, linux-kernel, kvm-ppc
In-Reply-To: <a1be1913-f564-924b-1750-03efa859a0b1@ozlabs.ru>



On 24/08/2021 18:37, Alexey Kardashevskiy wrote:
> 
> 
> On 18/08/2021 08:20, Fabiano Rosas wrote:
>> Alexey Kardashevskiy <aik@ozlabs.ru> writes:
>>
>>> On 07/07/2021 14:13, Alexey Kardashevskiy wrote:
>>
>>> alternatively move this debugfs stuff under the platform-independent
>>> directory, how about that?
>>
>> That's a good idea. I only now realized we have two separate directories
>> for the same guest:
>>
>> $ ls /sys/kernel/debug/kvm/ | grep $pid
>> 19062-11
>> vm19062
>>
>> Looks like we would have to implement kvm_arch_create_vcpu_debugfs for
>> the vcpu information and add a similar hook for the vm.
> 
> Something like that. From the git history, it looks like the ppc folder 
> was added first and then the generic kvm folder was added but apparently 
> they did not notice the ppc one due to natural reasons :)
> 
> If you are not too busy, can you please merge the ppc one into the 
> generic one and post the patch, so we won't need to fix these 
> duplication warnings again? Thanks,



Turns out it is not that straight forward as I thought as the common KVM 
debugfs entry is created after PPC HV KVM created its own and there is 
no obvious way to change the order (no "post init" hook in kvmppc_ops).

Also, unlike the common KVM debugfs setup, we do not allocate structures 
to support debugfs nodes so we do not leak anything to bother with a 
mutex like 85cd39af14f4 did.

So I'd stick to the original patch to reduce the noise in the dmesg, and 
it also exposes lpid which I find rather useful for finding the right 
partition scope tree in partition_tb.

Michael?


> 
> 
> 
>>>> ---
>>>>    arch/powerpc/kvm/book3s_hv.c | 2 +-
>>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/arch/powerpc/kvm/book3s_hv.c 
>>>> b/arch/powerpc/kvm/book3s_hv.c
>>>> index 1d1fcc290fca..0223ddc0eed0 100644
>>>> --- a/arch/powerpc/kvm/book3s_hv.c
>>>> +++ b/arch/powerpc/kvm/book3s_hv.c
>>>> @@ -5227,7 +5227,7 @@ static int kvmppc_core_init_vm_hv(struct kvm 
>>>> *kvm)
>>>>        /*
>>>>         * Create a debugfs directory for the VM
>>>>         */
>>>> -    snprintf(buf, sizeof(buf), "vm%d", current->pid);
>>>> +    snprintf(buf, sizeof(buf), "vm%d-lp%ld", current->pid, lpid);
>>>>        kvm->arch.debugfs_dir = debugfs_create_dir(buf, 
>>>> kvm_debugfs_dir);
>>>>        kvmppc_mmu_debugfs_init(kvm);
>>>>        if (radix_enabled())
>>>>
> 

-- 
Alexey

^ permalink raw reply

* [PATCH kernel] KVM: PPC: Book3S: Suppress warnings when allocating too big memory slots
From: Alexey Kardashevskiy @ 2021-09-01  8:45 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Alexey Kardashevskiy, kvm-ppc

The userspace can trigger "vmalloc size %lu allocation failure: exceeds
total pages" via the KVM_SET_USER_MEMORY_REGION ioctl.

This silences the warning by checking the limit before calling vzalloc()
and returns ENOMEM if failed.

This does not call underlying valloc helpers as __vmalloc_node() is only
exported when CONFIG_TEST_VMALLOC_MODULE and __vmalloc_node_range() is not
exported at all.

Spotted by syzkaller.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/kvm/book3s_hv.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 474c0cfde384..a59f1cccbcf9 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -4830,8 +4830,12 @@ static int kvmppc_core_prepare_memory_region_hv(struct kvm *kvm,
 	unsigned long npages = mem->memory_size >> PAGE_SHIFT;
 
 	if (change == KVM_MR_CREATE) {
-		slot->arch.rmap = vzalloc(array_size(npages,
-					  sizeof(*slot->arch.rmap)));
+		unsigned long cb = array_size(npages, sizeof(*slot->arch.rmap));
+
+		if ((cb >> PAGE_SHIFT) > totalram_pages())
+			return -ENOMEM;
+
+		slot->arch.rmap = vzalloc(cb);
 		if (!slot->arch.rmap)
 			return -ENOMEM;
 	}
-- 
2.30.2


^ permalink raw reply related

* [PATCH kernel] KVM: PPC: Book3S: Suppress failed alloc warning in H_COPY_TOFROM_GUEST
From: Alexey Kardashevskiy @ 2021-09-01  8:45 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Alexey Kardashevskiy, kvm-ppc

H_COPY_TOFROM_GUEST is an hcall for an upper level VM to access its nested
VMs memory. The userspace can trigger WARN_ON_ONCE(!(gfp & __GFP_NOWARN))
in __alloc_pages() by constructing a tiny VM which only does
H_COPY_TOFROM_GUEST with a too big GPR9 (number of bytes to copy).

This silences the warning by adding __GFP_NOWARN.

Spotted by syzkaller.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/kvm/book3s_hv_nested.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_hv_nested.c b/arch/powerpc/kvm/book3s_hv_nested.c
index e57c08b968c0..a2e34efb8d31 100644
--- a/arch/powerpc/kvm/book3s_hv_nested.c
+++ b/arch/powerpc/kvm/book3s_hv_nested.c
@@ -580,7 +580,7 @@ long kvmhv_copy_tofrom_guest_nested(struct kvm_vcpu *vcpu)
 	if (eaddr & (0xFFFUL << 52))
 		return H_PARAMETER;
 
-	buf = kzalloc(n, GFP_KERNEL);
+	buf = kzalloc(n, GFP_KERNEL | __GFP_NOWARN);
 	if (!buf)
 		return H_NO_MEM;
 
-- 
2.30.2


^ permalink raw reply related

* Re: [PATCH v3] powerpc/32: Add support for out-of-line static calls
From: Peter Zijlstra @ 2021-09-01  8:52 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: linux-kernel, Steven Rostedt, Jason Baron, Paul Mackerras,
	Josh Poimboeuf, linuxppc-dev, Ard Biesheuvel
In-Reply-To: <6ec2a7865ed6a5ec54ab46d026785bafe1d837ea.1630484892.git.christophe.leroy@csgroup.eu>

On Wed, Sep 01, 2021 at 08:30:21AM +0000, Christophe Leroy wrote:
> Add support for out-of-line static calls on PPC32. This change
> improve performance of calls to global function pointers by
> using direct calls instead of indirect calls.
> 
> The trampoline is initialy populated with a 'blr' or branch to target,
> followed by an unreachable long jump sequence.
> 
> In order to cater with parallele execution, the trampoline needs to
> be updated in a way that ensures it remains consistent at all time.
> This means we can't use the traditional lis/addi to load r12 with
> the target address, otherwise there would be a window during which
> the first instruction contains the upper part of the new target
> address while the second instruction still contains the lower part of
> the old target address. To avoid that the target address is stored
> just after the 'bctr' and loaded from there with a single instruction.
> 
> Then, depending on the target distance, arch_static_call_transform()
> will either replace the first instruction by a direct 'bl <target>' or
> 'nop' in order to have the trampoline fall through the long jump
> sequence.
> 
> For the special case of __static_call_return0(), to avoid the risk of
> a far branch, a version of it is inlined at the end of the trampoline.

(also, it's in the same line, so it avoids another cachemiss and it
nicely fills the hole you had in your 32byte chunk)

> Performancewise the long jump sequence is probably not better than
> the indirect calls set by GCC when we don't use static calls, but
> such calls are unlikely to be required on powerpc32: With most
> configurations the kernel size is far below 32 Mbytes so only
> modules may happen to be too far. And even modules are likely to
> be close enough as they are allocated below the kernel core and
> as close as possible of the kernel text.
> 
> static_call selftest is running successfully with this change.

Nice!, I'd ask if you'd tried PREEMPT_DYNAMIC, since that should really
stress the thing, but I see that also requires GENERIC_ENTRY and you
don't have that. Alas.

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

^ permalink raw reply

* Re: [PATCH v2 3/3] powerpc/numa: Fill distance_lookup_table for offline nodes
From: Srikar Dronamraju @ 2021-09-01 10:22 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Vincent Guittot,
	kernel test robot, Peter Zijlstra, Geetika Moolchandani,
	Valentin Schneider, Laurent Dufour, linuxppc-dev, Ingo Molnar
In-Reply-To: <875yvsba4q.fsf@mpe.ellerman.id.au>

* Michael Ellerman <mpe@ellerman.id.au> [2021-08-26 23:36:53]:

> Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> > Scheduler expects unique number of node distances to be available at
> > boot.
> 
> I think it needs "the number of unique node distances" ?
> 
> > It uses node distance to calculate this unique node distances.
> 
> It iterates over all pairs of nodes and records node_distance() for that
> pair, and then calculates the set of unique distances.
> 
> > On POWER, node distances for offline nodes is not available. However,
> > POWER already knows unique possible node distances.
> 
> I think it would be more accurate to say PAPR rather than POWER there.
> It's PAPR that defines the way we determine distances and imposes that
> limitation.
> 

Okay, will do all the necessary modifications as suggested above.

> > Fake the offline node's distance_lookup_table entries so that all
> > possible node distances are updated.
> 
> Does this work if we have a single node offline at boot?
> 

It should.

> Say we start with:
> 
> node distances:
> node   0   1
>   0:  10  20
>   1:  20  10
> 
> And node 2 is offline at boot. We can only initialise that nodes entries
> in the distance_lookup_table:
> 
> 		while (i--)
> 			distance_lookup_table[node][i] = node;
> 
> By filling them all with 2 that causes node_distance(2, X) to return the
> maximum distance for all other nodes X, because we won't break out of
> the loop in __node_distance():
> 
> 	for (i = 0; i < distance_ref_points_depth; i++) {
> 		if (distance_lookup_table[a][i] == distance_lookup_table[b][i])
> 			break;
> 
> 		/* Double the distance for each NUMA level */
> 		distance *= 2;
> 	}
> 
> If distance_ref_points_depth was 4 we'd return 160.

As you already know, distance 10, 20, .. are defined by Powerpc, form1
affinity. PAPR doesn't define actual distances, it only provides us the
associativity. If there are distance_ref_points_depth is 4,
(distance_ref_points_depth doesn't take local distance into consideration)
10, 20, 40, 80, 160.

> 
> That'd leave us with 3 unique distances at boot, 10, 20, 160.
> 

So if there are unique distances, then the distances as per the current
code has to be 10, 20, 40, 80.. I dont see a way in which we have a break in
the series. like having 160 without 80.

> But when node 2 comes online it might introduce more than 1 new distance
> value, eg. it could be that the actual distances are:
> 
> node distances:
> node   0   1   2
>   0:  10  20  40
>   1:  20  10  80
>   2:  40  80  10
> 
> ie. we now have 4 distances, 10, 20, 40, 80.
> 
> What am I missing?
> 

As I said above, I am not sure how we can have a break in the series.
If distance_ref_points_depth is 3, the distances has to be 10,20,40,80 as
atleast for form1 affinity.

> > However this only needs to be done if the number of unique node
> > distances that can be computed for online nodes is less than the
> > number of possible unique node distances as represented by
> > distance_ref_points_depth.
> 
> Looking at a few machines they all have distance_ref_points_depth = 2.
> 
> So maybe that explains it, in practice we only see 10, 20, 40.
> 
> > When the node is actually onlined, distance_lookup_table will be
> > updated with actual entries.
> 
> > Cc: linuxppc-dev@lists.ozlabs.org
> > Cc: Nathan Lynch <nathanl@linux.ibm.com>
> > Cc: Michael Ellerman <mpe@ellerman.id.au>
> > Cc: Ingo Molnar <mingo@kernel.org>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Valentin Schneider <valentin.schneider@arm.com>
> > Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > Cc: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>
> > Cc: Laurent Dufour <ldufour@linux.ibm.com>
> > Cc: kernel test robot <lkp@intel.com>
> > Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> > ---
> >  arch/powerpc/mm/numa.c | 70 ++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 70 insertions(+)
> >
> > Changelog:
> > v1: https://lore.kernel.org/linuxppc-dev/20210701041552.112072-3-srikar@linux.vnet.ibm.com/t/#u
> > [ Fixed a missing prototype warning Reported-by: kernel test robot <lkp@intel.com>]
> >
> > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> > index 3c124928a16d..0ee79a08c9e1 100644
> > --- a/arch/powerpc/mm/numa.c
> > +++ b/arch/powerpc/mm/numa.c
> > @@ -856,6 +856,75 @@ void __init dump_numa_cpu_topology(void)
> >  	}
> >  }
> >  
> > +/*
> > + * Scheduler expects unique number of node distances to be available at
> > + * boot. It uses node distance to calculate this unique node distances. On
> > + * POWER, node distances for offline nodes is not available. However, POWER
> > + * already knows unique possible node distances. Fake the offline node's
> > + * distance_lookup_table entries so that all possible node distances are
> > + * updated.
> > + */
> 
> > +static void __init fake_update_distance_lookup_table(void)
> > +{
> > +	unsigned long distance_map;
> > +	int i, nr_levels, nr_depth, node;
> 
> Are they distances, depths, or levels? :)
> 
> Bit more consistency in the variable names might make the code easier to
> follow.
> 
> > +
> > +	if (!numa_enabled)
> > +		return;
> > +
> > +	if (!form1_affinity)
> > +		return;
> 
> That doesn't exist since Aneesh's FORM2 series, so that will need a
> rebase, and possibly some more rework to interact with that series.
> 

We only have to handle for form1, so it should be easier to handle.

> > +	/*
> > +	 * distance_ref_points_depth lists the unique numa domains
> > +	 * available. However it ignore LOCAL_DISTANCE. So add +1
> > +	 * to get the actual number of unique distances.
> > +	 */
> > +	nr_depth = distance_ref_points_depth + 1;
> 
> num_depths would be a better name IMHO.

Okay,
s/nr_depth/num_depths/g
s/nr_level/depth/g

> 
> > +
> > +	WARN_ON(nr_depth > sizeof(distance_map));
> 
> Warn but then continue, and corrupt something on the stack? Seems like a
> bad idea :)
> 
> I guess it's too early to use bitmap_alloc(). But can we at least return
> if nr_depth is too big.

Yes, we can't use bitmap_alloc here.
Now should we continue if nr_depth is greater than sizeof(distance_map)

If we don't and return immediately, then we can end up not creating enough
scheduler domains and may later on lead to build_sched_domain OOPs, when we
online nodes.

However if don't return, chance of surviving when the domains are actually
onlined is more.
We could probably reset nr_depth to be same as sizeof(distance_map).

That said, I think we are too far away from nr_depths being anywhere closer
to sizeof(long). So I am okay either way.

> 
> > +
> > +	bitmap_zero(&distance_map, nr_depth);
> > +	bitmap_set(&distance_map, 0, 1);
> > +
> > +	for_each_online_node(node) {
> > +		int nd, distance = LOCAL_DISTANCE;
> > +
> > +		if (node == first_online_node)
> > +			continue;
> > +
> > +		nd = __node_distance(node, first_online_node);
> > +		for (i = 0; i < nr_depth; i++, distance *= 2) {
> 
> 		for (i = 0, distance = LOCAL_DISTANCE; i < nr_depth; i++, distance *= 2) {
> 
> Could make it clearer what the for loop is doing I think.
> 
> > +			if (distance == nd) {
> > +				bitmap_set(&distance_map, i, 1);
> > +				break;
> > +			}
> > +		}
> > +		nr_levels = bitmap_weight(&distance_map, nr_depth);
> > +		if (nr_levels == nr_depth)
> > +			return;
> > +	}
> > +
> > +	for_each_node(node) {
> > +		if (node_online(node))
> > +			continue;
> > +
> > +		i = find_first_zero_bit(&distance_map, nr_depth);
> > +		if (i >= nr_depth || i == 0) {
> 
> Neither of those can happen can they?
> 
> We checked the bitmap weight in the previous for loop, or at the bottom
> of this one, and returned if we'd filled the map already.
> 
> And we set bit zero explicitly with bitmap_set().
> 

Agree, I can drop the hunk.

> > +			pr_warn("Levels(%d) not matching levels(%d)", nr_levels, nr_depth);
> > +			return;
> > +		}


> > +
> > +		bitmap_set(&distance_map, i, 1);
> > +		while (i--)
> > +			distance_lookup_table[node][i] = node;
> 
> That leaves distance_lookup_table[node][i+1] and so on uninitialised, or
> initialised to zero because it's static, is that OK?

Yes, this should be fine, because we are only interested in finding number
of unique numa distances, By the time actual distances come and overwrite,
we would no more use these fake distances.

But if you are comfortable with updating for all the depths, I can update
too.

> 
> > +		nr_levels = bitmap_weight(&distance_map, nr_depth);
> > +		if (nr_levels == nr_depth)
> > +			return;
> > +	}
> > +}
> > +
> >  /* Initialize NODE_DATA for a node on the local memory */
> >  static void __init setup_node_data(int nid, u64 start_pfn, u64 end_pfn)
> >  {
> > @@ -971,6 +1040,7 @@ void __init mem_topology_setup(void)
> >  		 */
> >  		numa_setup_cpu(cpu);
> >  	}
> > +	fake_update_distance_lookup_table();
> >  }
> >  
> >  void __init initmem_init(void)
> 
> 
> cheers

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply

* Re: [next-20210820][ppc][nvme/raid] pci unbind WARNS at fs/kernfs/dir.c:1524 kernfs_remove_by_name_ns+
From: Abdul Haleem @ 2021-09-01 10:30 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Brian King, linux-next, linuxppc-dev, linux-kernel, linux-scsi
In-Reply-To: <YSfBwj1zo/w2GCH4@infradead.org>

Thanks Christoph,

The problem is fixed with below commit and tested on next-20210823

Tested-by: Abdul Haleem <abdhalee@linux.vnet.ibm.com>

On 8/26/21 10:00 PM, Christoph Hellwig wrote:

> Are you sure this is the very latest linux-next?  This should hav been
> fixed by:
>
>      "block: add back the bd_holder_dir reference in bd_link_disk_holder"

-- 
Regard's

Abdul Haleem
IBM Linux Technology Center


^ permalink raw reply

* [next-20210827][ppc][multipathd] INFO: task hung in dm_table_add_target
From: Abdul Haleem @ 2021-09-01 11:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: axboe, sachinp, jack, linux-scsi, dm-devel, linux-next, dougmill,
	Brian King, linuxppc-dev, hch

Greeting's

multiple task hung while adding the vfc disk back to the multipath on my 
powerpc box running linux-next kernel

Test:
$ multipathd -k"remove path sde"
$ multipathd -k"add path sde"

After above command multipathd task hung with continuous call traces on 
console, it requires reboot to stop call traces

systemd-udevd[180359]: Process '/sbin/mdadm -I /dev/dm-8' failed with 
exit code 1.
multipathd[180272]: mpathf: load table [0 62914560 multipath 1 
queue_if_no_path 1 alua 2 1 round-robin 0 2 1 8:64 1 65:0 1 round-robin 
0 2 1 8:144 1 65:80 1]
systemd[1]: Started Device-Mapper Multipath Device Controller.
multipathd[180272]: mpatha: sdl - tur checker reports path is up
multipathd[180272]: 8:176: reinstated
multipathd[180272]: mpatha: remaining active paths: 2
multipathd[180272]: sde: remove path (operator)
multipathd[180272]: mpathf: load table [0 62914560 multipath 1 
queue_if_no_path 1 alua 2 1 round-robin 0 1 1 65:0 1 round-robin 0 2 1 
8:144 1 65:80 1]
multipathd[180272]: sde [8:64]: path removed from map mpathf
multipathd[180272]: sdq: remove path (operator)
multipathd[180272]: mpathf: load table [0 62914560 multipath 1 
queue_if_no_path 1 alua 1 1 round-robin 0 2 1 8:144 1 65:80 1]
multipathd[180272]: sdq [65:0]: path removed from map mpathf
multipathd[180272]: sdj: remove path (operator)
multipathd[180272]: mpathf: load table [0 62914560 multipath 1 
queue_if_no_path 1 alua 1 1 round-robin 0 1 1 65:80 1]
multipathd[180272]: sdj [8:144]: path removed from map mpathf
multipathd[180272]: sdv: remove path (operator)
multipathd[180272]: mpathf: map in use
multipathd[180272]: mpathf: can't flush
multipathd[180272]: mpathf: load table [0 62914560 multipath 1 
queue_if_no_path 0 0 0]
multipathd[180272]: sdv [65:80]: path removed from map mpathf
systemd[1]: Starting system activity accounting tool...
systemd[1]: sysstat-collect.service: Succeeded.
systemd[1]: Started system activity accounting tool.
systemd-udevd[1156]: seq 5678 '/devices/virtual/block/dm-10' is taking a 
long time
systemd-udevd[1156]: seq 5678 '/devices/virtual/block/dm-10' killed
multipathd[180272]: sde: add path (operator)
systemd[1]: Stopping Device-Mapper Multipath Device Controller...
systemd[1]: multipathd.service: State 'stop-sigterm' timed out. Killing.
systemd[1]: multipathd.service: Killing process 180272 (multipathd) with 
signal SIGKILL.
systemd[1]: multipathd.service: Processes still around after SIGKILL. 
Ignoring.
dbus-daemon[1920]: [system] Activating via systemd: service 
name='net.reactivated.Fprint' unit='fprintd.service' requested by 
':1.255' (uid=0 pid=95173 comm="/bin/login -p --      ")
systemd[1]: Starting Fingerprint Authentication Daemon...
dbus-daemon[1920]: [system] Successfully activated service 
'net.reactivated.Fprint'
systemd[1]: Started Fingerprint Authentication Daemon.
systemd-logind[2032]: New session 6 of user root.
systemd[1]: Started Session 6 of user root.
kernel: INFO: task multipathd:180274 blocked for more than 122 seconds.
kernel:      Not tainted 5.14.0-rc7-next-20210827-autotest #1
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
kernel: task:multipathd      state:D stack:    0 pid:180274 ppid:     1 
flags:0x00040082
kernel: Call Trace:
kernel: [c00000002aaa72c0] [c00000002aaa73a0] 0xc00000002aaa73a0 
(unreliable)
kernel: [c00000002aaa74b0] [c00000000001e638] __switch_to+0x278/0x490
kernel: [c00000002aaa7510] [c000000000c89cfc] __schedule+0x31c/0xa10
kernel: [c00000002aaa75d0] [c000000000c8a468] schedule+0x78/0x130
kernel: [c00000002aaa7600] [c000000000c8aa08] 
schedule_preempt_disabled+0x18/0x30
kernel: [c00000002aaa7620] [c000000000c8cf0c] 
__mutex_lock.isra.11+0x36c/0x700
kernel: [c00000002aaa76b0] [c00000000067efe4] bd_link_disk_holder+0x34/0x270
kernel: [c00000002aaa7700] [c0080000003e3b78] 
dm_get_table_device+0x1e0/0x2c0 [dm_mod]
kernel: [c00000002aaa77a0] [c0080000003e7e48] dm_get_device+0x130/0x2f0 
[dm_mod]
kernel: [c00000002aaa7850] [c008000000745234] multipath_ctr+0x9bc/0xff0 
[dm_multipath]
kernel: [c00000002aaa79d0] [c0080000003e883c] 
dm_table_add_target+0x1a4/0x420 [dm_mod]
kernel: [c00000002aaa7a90] [c0080000003ee874] table_load+0x15c/0x4a0 
[dm_mod]
kernel: [c00000002aaa7b40] [c0080000003f1454] ctl_ioctl+0x27c/0x770 [dm_mod]
kernel: [c00000002aaa7d40] [c0080000003f1960] dm_ctl_ioctl+0x18/0x30 
[dm_mod]
kernel: [c00000002aaa7d60] [c000000000481198] sys_ioctl+0xf8/0x150
kernel: [c00000002aaa7db0] [c00000000002ff28] 
system_call_exception+0x158/0x2c0
kernel: [c00000002aaa7e10] [c00000000000c764] system_call_common+0xf4/0x258
kernel: --- interrupt: c00 at 0x7fffa06f4480
kernel: NIP:  00007fffa06f4480 LR: 00007fffa0ad6714 CTR: 0000000000000000
kernel: REGS: c00000002aaa7e80 TRAP: 0c00   Not tainted 
(5.14.0-rc7-next-20210827-autotest)
kernel: MSR:  800000000000d033 <SF,EE,PR,ME,IR,DR,RI,LE>  CR: 24042204  
XER: 00000000
kernel: IRQMASK: 0 #012GPR00: 0000000000000036 00007fff9fd4c3a0 
00007fffa07e7100 0000000000000005 #012GPR04: 00000000c138fd09 
00007fff9806a0c0 00007fffa0ad9f18 00007fff9fd4a298 #012GPR08: 
0000000000000005 0000000000000000 0000000000000000 0000000000000000 
#012GPR12: 0000000000000000 00007fff9fd56300 00007fff9806a0c0 
00007fffa0ad9c80 #012GPR16: 00007fffa0ad9c80 00007fffa0ad9c80 
00007fffa0b13670 0000000000000000 #012GPR20: 00007fffa0ae3260 
00007fffa0b12040 00007fff9806a0f0 00007fff9800eca0 #012GPR24: 
00007fffa0ad9c80 00007fffa0ad9c80 00007fffa0ad9c80 0000000000000000 
#012GPR28: 00007fffa0ad9c80 00007fffa0ad9c80 0000000000000000 
00007fffa0ad9c80
kernel: NIP [00007fffa06f4480] 0x7fffa06f4480
kernel: LR [00007fffa0ad6714] 0x7fffa0ad6714
kernel: --- interrupt: c00
kernel: INFO: task systemd-udevd:180404 blocked for more than 122 seconds.
kernel:      Not tainted 5.14.0-rc7-next-20210827-autotest #1
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
kernel: task:systemd-udevd   state:D stack:    0 pid:180404 ppid:  1156 
flags:0x00042482
kernel: Call Trace:

Before test
------------
$multipath -ll
mpathf (360050768108001b3a8000000000000e8) dm-10 IBM,2145
size=30G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='round-robin 0' prio=50 status=active
| |- 1:0:0:4 sde 8:64  active ready running
| `- 2:0:0:4 sdq 65:0  active ready running
`-+- policy='round-robin 0' prio=10 status=enabled
   |- 1:0:1:4 sdj 8:144 active ready running
   `- 2:0:1:4 sdv 65:80 active ready running
$

After test fail
------------
$ multipath -ll
mpathf (360050768108001b3a8000000000000e8) dm-10 ##,##
size=30G features='1 queue_if_no_path' hwhandler='0' wp=rw
$

-- 
Regard's

Abdul Haleem
IBM Linux Technology Center


^ permalink raw reply

* [PATCH v2] powerpc/bug: Cast to unsigned long before passing to inline asm
From: Michael Ellerman @ 2021-09-01 11:25 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: nathan

In commit 1e688dd2a3d6 ("powerpc/bug: Provide better flexibility to
WARN_ON/__WARN_FLAGS() with asm goto") we changed WARN_ON(). Previously
it would take the warning condition, x, and double negate it before
converting the result to int, and passing that int to the underlying
inline asm. ie:

  #define WARN_ON(x) ({
  	int __ret_warn_on = !!(x);
  	if (__builtin_constant_p(__ret_warn_on)) {
  	...
  	} else {
  		BUG_ENTRY(PPC_TLNEI " %4, 0",
  			  BUGFLAG_WARNING | BUGFLAG_TAINT(TAINT_WARN),
  			  "r" (__ret_warn_on));

The asm then does a full register width comparison with zero and traps
if it is non-zero (PPC_TLNEI).

The new code instead passes the full expression, x, with some arbitrary
type, to the inline asm:

  #define WARN_ON(x) ({
	...
	do {
		if (__builtin_constant_p((x))) {
		...
		} else {
			...
			WARN_ENTRY(PPC_TLNEI " %4, 0",
				   BUGFLAG_WARNING | BUGFLAG_TAINT(TAINT_WARN),
				   __label_warn_on, "r" (x));

As reported[1] by Nathan, when building with clang this can cause
spurious warnings to fire repeatedly at boot:

  WARNING: CPU: 0 PID: 1 at lib/klist.c:62 .klist_add_tail+0x3c/0x110
  Modules linked in:
  CPU: 0 PID: 1 Comm: swapper/0 Tainted: G        W         5.14.0-rc7-next-20210825 #1
  NIP:  c0000000007ff81c LR: c00000000090a038 CTR: 0000000000000000
  REGS: c0000000073c32a0 TRAP: 0700   Tainted: G        W          (5.14.0-rc7-next-20210825)
  MSR:  8000000002029032 <SF,VEC,EE,ME,IR,DR,RI>  CR: 22000a40  XER: 00000000
  CFAR: c00000000090a034 IRQMASK: 0
  GPR00: c00000000090a038 c0000000073c3540 c000000001be3200 0000000000000001
  GPR04: c0000000072d65c0 0000000000000000 c0000000091ba798 c0000000091bb0a0
  GPR08: 0000000000000001 0000000000000000 c000000008581918 fffffffffffffc00
  GPR12: 0000000044000240 c000000001dd0000 c000000000012300 0000000000000000
  GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
  GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
  GPR24: 0000000000000000 c0000000017e3200 0000000000000000 c000000001a0e778
  GPR28: c0000000072d65b0 c0000000072d65a8 c000000007de72c8 c0000000073c35d0
  NIP .klist_add_tail+0x3c/0x110
  LR  .bus_add_driver+0x148/0x290
  Call Trace:
    0xc0000000073c35d0 (unreliable)
    .bus_add_driver+0x148/0x290
    .driver_register+0xb8/0x190
    .__hid_register_driver+0x70/0xd0
    .redragon_driver_init+0x34/0x58
    .do_one_initcall+0x130/0x3b0
    .do_initcall_level+0xd8/0x188
    .do_initcalls+0x7c/0xdc
    .kernel_init_freeable+0x178/0x21c
    .kernel_init+0x34/0x220
    .ret_from_kernel_thread+0x58/0x60
  Instruction dump:
  fba10078 7c7d1b78 38600001 fb810070 3b9d0008 fbc10080 7c9e2378 389d0018
  fb9d0008 fb9d0010 90640000 fbdd0000 <0b1e0000> e87e0018 28230000 41820024

The instruction dump shows that we are trapping because r30 is not zero:
  tdnei   r30,0

Where r30 = c000000007de72c8

The WARN_ON() comes from:

  static void knode_set_klist(struct klist_node *knode, struct klist *klist)
  {
  	knode->n_klist = klist;
  	/* no knode deserves to start its life dead */
  	WARN_ON(knode_dead(knode));
  		      ^^^^^^^^^^^^^^^^^

Where:

  #define KNODE_DEAD		1LU

  static bool knode_dead(struct klist_node *knode)
  {
  	return (unsigned long)knode->n_klist & KNODE_DEAD;
  }

The full disassembly shows that clang has not generated any code to
apply the "& KNODE_DEAD" to the n_klist pointer, which is surprising.

Nathan filed an LLVM bug [2], in which Eli Friedman explained that clang
believes it is only passing a single bit to the asm (ie. a bool) and so
the mask of bit 0 with 1 can be omitted, and suggested that if we want
the full 64-bit value passed to the inline asm we should cast to a
64-bit type (or 32-bit on 32-bits).

In fact we already do that for BUG_ENTRY(), which was added to fix a
possibly similar bug in 2005 in commit 32818c2eb6b8 ("[PATCH] ppc64: Fix
issue with gcc 4.0 compiled kernels").

So cast the value we pass to the inline asm to long.

For GCC this appears to have no effect on code generation, other than
causing sign extension in some cases.

[1]: http://lore.kernel.org/r/YSa1O4fcX1nNKqN/@Ryzen-9-3900X.localdomain
[2]: https://bugs.llvm.org/show_bug.cgi?id=51634

Fixes: 1e688dd2a3d6 ("powerpc/bug: Provide better flexibility to WARN_ON/__WARN_FLAGS() with asm goto")
Reported-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
---
 arch/powerpc/include/asm/bug.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

v2: Reword the change log a bit to hopefully make it clearer that it's clang that believes
it only needs to pass a single bit for bool, whether that's correct behaviour can be
discussed on the list at a later date :)

diff --git a/arch/powerpc/include/asm/bug.h b/arch/powerpc/include/asm/bug.h
index 1ee0f22313ee..02c08d1492f8 100644
--- a/arch/powerpc/include/asm/bug.h
+++ b/arch/powerpc/include/asm/bug.h
@@ -119,7 +119,8 @@ __label_warn_on:						\
 								\
 			WARN_ENTRY(PPC_TLNEI " %4, 0",		\
 				   BUGFLAG_WARNING | BUGFLAG_TAINT(TAINT_WARN),	\
-				   __label_warn_on, "r" (x));	\
+				   __label_warn_on,		\
+				   "r" ((__force long)(x)));	\
 			break;					\
 __label_warn_on:						\
 			__ret_warn_on = true;			\
-- 
2.25.1


^ permalink raw reply related

* Re: [next-20210827][ppc][multipathd] INFO: task hung in dm_table_add_target
From: Christoph Hellwig @ 2021-09-01 13:36 UTC (permalink / raw)
  To: Abdul Haleem
  Cc: axboe, sachinp, jack, linux-scsi, linux-kernel, dm-devel,
	linux-next, dougmill, Brian King, linuxppc-dev, hch
In-Reply-To: <68dde454-965a-0c44-374a-a0ca277150ee@linux.vnet.ibm.com>

On Wed, Sep 01, 2021 at 04:47:26PM +0530, Abdul Haleem wrote:
> Greeting's
>
> multiple task hung while adding the vfc disk back to the multipath on my 
> powerpc box running linux-next kernel

Can you retry to reproduce this with lockdep enabled to see if there
is anything interesting holding this lock?

^ permalink raw reply

* Re: [PATCH kernel] KVM: PPC: Book3S HV: Make unique debugfs nodename
From: Fabiano Rosas @ 2021-09-01 14:23 UTC (permalink / raw)
  To: Alexey Kardashevskiy, Michael Ellerman
  Cc: linuxppc-dev, linux-kernel, kvm-ppc
In-Reply-To: <2fe01488-5a9b-785e-7c05-1d527dead18d@ozlabs.ru>

Alexey Kardashevskiy <aik@ozlabs.ru> writes:

> On 24/08/2021 18:37, Alexey Kardashevskiy wrote:
>> 
>> 
>> On 18/08/2021 08:20, Fabiano Rosas wrote:
>>> Alexey Kardashevskiy <aik@ozlabs.ru> writes:
>>>
>>>> On 07/07/2021 14:13, Alexey Kardashevskiy wrote:
>>>
>>>> alternatively move this debugfs stuff under the platform-independent
>>>> directory, how about that?
>>>
>>> That's a good idea. I only now realized we have two separate directories
>>> for the same guest:
>>>
>>> $ ls /sys/kernel/debug/kvm/ | grep $pid
>>> 19062-11
>>> vm19062
>>>
>>> Looks like we would have to implement kvm_arch_create_vcpu_debugfs for
>>> the vcpu information and add a similar hook for the vm.
>> 
>> Something like that. From the git history, it looks like the ppc folder 
>> was added first and then the generic kvm folder was added but apparently 
>> they did not notice the ppc one due to natural reasons :)
>> 
>> If you are not too busy, can you please merge the ppc one into the 
>> generic one and post the patch, so we won't need to fix these 
>> duplication warnings again? Thanks,
>
>
>
> Turns out it is not that straight forward as I thought as the common KVM 
> debugfs entry is created after PPC HV KVM created its own and there is 
> no obvious way to change the order (no "post init" hook in
> kvmppc_ops).

That is why I mentioned creating a hook similar to
kvm_create_vcpu_debugfs in the common KVM code. kvm_create_vm_debugfs or
something.

Alternatively, maybe kvm_create_vm_debugfs could be moved earlier into
kvm_create_vm, before kvm_arch_post_init_vm and we could move our code
into kvm_arch_post_init_vm.

>
> Also, unlike the common KVM debugfs setup, we do not allocate structures 
> to support debugfs nodes so we do not leak anything to bother with a 
> mutex like 85cd39af14f4 did.
>
> So I'd stick to the original patch to reduce the noise in the dmesg, and 
> it also exposes lpid which I find rather useful for finding the right 
> partition scope tree in partition_tb.
>
> Michael?
>
>
>> 
>> 
>> 
>>>>> ---
>>>>>    arch/powerpc/kvm/book3s_hv.c | 2 +-
>>>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/arch/powerpc/kvm/book3s_hv.c 
>>>>> b/arch/powerpc/kvm/book3s_hv.c
>>>>> index 1d1fcc290fca..0223ddc0eed0 100644
>>>>> --- a/arch/powerpc/kvm/book3s_hv.c
>>>>> +++ b/arch/powerpc/kvm/book3s_hv.c
>>>>> @@ -5227,7 +5227,7 @@ static int kvmppc_core_init_vm_hv(struct kvm 
>>>>> *kvm)
>>>>>        /*
>>>>>         * Create a debugfs directory for the VM
>>>>>         */
>>>>> -    snprintf(buf, sizeof(buf), "vm%d", current->pid);
>>>>> +    snprintf(buf, sizeof(buf), "vm%d-lp%ld", current->pid, lpid);
>>>>>        kvm->arch.debugfs_dir = debugfs_create_dir(buf, 
>>>>> kvm_debugfs_dir);
>>>>>        kvmppc_mmu_debugfs_init(kvm);
>>>>>        if (radix_enabled())
>>>>>
>> 

^ permalink raw reply

* Re: [PATCH kernel] KVM: PPC: Book3S: Suppress failed alloc warning in H_COPY_TOFROM_GUEST
From: Fabiano Rosas @ 2021-09-01 14:45 UTC (permalink / raw)
  To: Alexey Kardashevskiy, linuxppc-dev; +Cc: Alexey Kardashevskiy, kvm-ppc
In-Reply-To: <20210901084550.1658699-1-aik@ozlabs.ru>

Alexey Kardashevskiy <aik@ozlabs.ru> writes:

> H_COPY_TOFROM_GUEST is an hcall for an upper level VM to access its nested
> VMs memory. The userspace can trigger WARN_ON_ONCE(!(gfp & __GFP_NOWARN))
> in __alloc_pages() by constructing a tiny VM which only does
> H_COPY_TOFROM_GUEST with a too big GPR9 (number of bytes to copy).
>
> This silences the warning by adding __GFP_NOWARN.
>
> Spotted by syzkaller.
>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---

Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com>

>  arch/powerpc/kvm/book3s_hv_nested.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/kvm/book3s_hv_nested.c b/arch/powerpc/kvm/book3s_hv_nested.c
> index e57c08b968c0..a2e34efb8d31 100644
> --- a/arch/powerpc/kvm/book3s_hv_nested.c
> +++ b/arch/powerpc/kvm/book3s_hv_nested.c
> @@ -580,7 +580,7 @@ long kvmhv_copy_tofrom_guest_nested(struct kvm_vcpu *vcpu)
>  	if (eaddr & (0xFFFUL << 52))
>  		return H_PARAMETER;
>
> -	buf = kzalloc(n, GFP_KERNEL);
> +	buf = kzalloc(n, GFP_KERNEL | __GFP_NOWARN);
>  	if (!buf)
>  		return H_NO_MEM;

^ permalink raw reply

* Re: [PATCH] powerpc/bug: Cast to unsigned long before passing to inline asm
From: Segher Boessenkool @ 2021-09-01 14:44 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: nathan, linuxppc-dev
In-Reply-To: <87tuj43gu1.fsf@mpe.ellerman.id.au>

On Wed, Sep 01, 2021 at 05:17:26PM +1000, Michael Ellerman wrote:
> Segher Boessenkool <segher@kernel.crashing.org> writes:
> > On Tue, Aug 31, 2021 at 11:27:20PM +1000, Michael Ellerman wrote:
> >> Nathan filed an LLVM bug [2], in which Eli Friedman explained that "if
> >> you pass a value of a type that's narrower than a register to an inline
> >> asm, the high bits are undefined". In this case we are passing a bool
> >> to the inline asm, which is a single bit value, and so the compiler
> >> thinks it can leave the high bits of r30 unmasked, because only the
> >> value of bit 0 matters.
> >> 
> >> Because the inline asm compares the full width of the register (32 or
> >> 64-bit) we need to ensure the value passed to the inline asm has all
> >> bits defined. So fix it by casting to long.
> >> 
> >> We also already cast to long for the similar case in BUG_ENTRY(), which
> >> it turns out was added to fix a similar bug in 2005 in commit
> >> 32818c2eb6b8 ("[PATCH] ppc64: Fix issue with gcc 4.0 compiled kernels").
> >
> > That points to <https://gcc.gnu.org/PR23422>, which shows the correct
> > explanation.
> 
> That's a pity because I don't understand that explanation ^_^
> 
> Why does sign extension matter when we're comparing against zero?

The "td" insn wants a 64-bit quantity.  You have to provide one, the
compiler will not do an extend itself, it does not try to understand the
asm template in any way.

> > The code as it was did **not** pass a bool.  It perhaps passed an int
> > (so many macros in play, it is hard to tell for sure, but it is int or
> > long int, perhaps unsigned (which does not change anything here).
> 
> I don't understand that. It definitely is passing a bool at the source
> level. Are you saying it's getting promoted somehow?
> 
> It expands to:

<snip>

> And knode_dead() returns bool:
> 
>   static bool knode_dead(struct klist_node *knode)
>   {
>   	return (unsigned long)knode->n_klist & KNODE_DEAD;
>   }
> 
> So in my book that means the type there is bool. But I'm not a compiler
> guy so I guessing I'm missing something.

I was confused by all the macros and stuff.  And "bool" in the kernel
means "_Bool" now (so it is a character type, with GCC).

> > If this is not the correct explanation for LLVM, it needs to solve some
> > other bug.
> 
> OK. I just need to get this fixed in the kernel, so I'll do a new
> version with a changelog that is ~= "shrug not sure what's going on" and
> merge that. Then we can argue about what is really going on later :)

One thing you should probably do is never pass expressions as asm
operands that are "r".  Instead, make a temporary var and assign to that,
so it will have the type you want, without being able to forget to add
a cast :-)


Segher

^ permalink raw reply

* Re: [PATCH kernel] KVM: PPC: Book3S: Suppress warnings when allocating too big memory slots
From: Fabiano Rosas @ 2021-09-01 14:59 UTC (permalink / raw)
  To: Alexey Kardashevskiy, linuxppc-dev; +Cc: Alexey Kardashevskiy, kvm-ppc
In-Reply-To: <20210901084512.1658628-1-aik@ozlabs.ru>

Alexey Kardashevskiy <aik@ozlabs.ru> writes:

> The userspace can trigger "vmalloc size %lu allocation failure: exceeds
> total pages" via the KVM_SET_USER_MEMORY_REGION ioctl.
>
> This silences the warning by checking the limit before calling vzalloc()
> and returns ENOMEM if failed.
>
> This does not call underlying valloc helpers as __vmalloc_node() is only
> exported when CONFIG_TEST_VMALLOC_MODULE and __vmalloc_node_range() is not
> exported at all.
>
> Spotted by syzkaller.
>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  arch/powerpc/kvm/book3s_hv.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 474c0cfde384..a59f1cccbcf9 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -4830,8 +4830,12 @@ static int kvmppc_core_prepare_memory_region_hv(struct kvm *kvm,
>  	unsigned long npages = mem->memory_size >> PAGE_SHIFT;
>
>  	if (change == KVM_MR_CREATE) {
> -		slot->arch.rmap = vzalloc(array_size(npages,
> -					  sizeof(*slot->arch.rmap)));
> +		unsigned long cb = array_size(npages, sizeof(*slot->arch.rmap));

What does cb mean?

> +
> +		if ((cb >> PAGE_SHIFT) > totalram_pages())
> +			return -ENOMEM;
> +
> +		slot->arch.rmap = vzalloc(cb);
>  		if (!slot->arch.rmap)
>  			return -ENOMEM;
>  	}

^ permalink raw reply

* Re: [PATCH kernel] KVM: PPC: Book3S HV: Make unique debugfs nodename
From: Fabiano Rosas @ 2021-09-01 15:26 UTC (permalink / raw)
  To: Alexey Kardashevskiy, Michael Ellerman
  Cc: linuxppc-dev, linux-kernel, kvm-ppc
In-Reply-To: <87lf4gv0hf.fsf@linux.ibm.com>

Fabiano Rosas <farosas@linux.ibm.com> writes:

> That is why I mentioned creating a hook similar to
> kvm_create_vcpu_debugfs in the common KVM code. kvm_create_vm_debugfs or
> something.

s/kvm/kvm_arch/


^ permalink raw reply

* [PATCH v1 1/2] powerpc/64s: system call scv tabort fix for corrupt irq soft-mask state
From: Nicholas Piggin @ 2021-09-01 16:18 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Eirik Fuller, Nicholas Piggin

If a system call is made with a transaction active, the kernel
immediately aborts it and returns. scv system calls disable irqs even
earlier in their interrupt handler, and tabort_syscall does not fix this
up.

This can result in irq soft-mask state being messed up on the next
kernel entry, and crashing at BUG_ON(arch_irq_disabled_regs(regs)) in
the kernel exit handlers, or possibly worse.

Fix this by having tabort_syscall setting irq soft-mask back to enabled.

Reported-by: Eirik Fuller <efuller@redhat.com>
Fixes: 7fa95f9adaee7 ("powerpc/64s: system call support for scv/rfscv instructions")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/kernel/interrupt_64.S | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/powerpc/kernel/interrupt_64.S b/arch/powerpc/kernel/interrupt_64.S
index d4212d2ff0b5..44f99df36fb2 100644
--- a/arch/powerpc/kernel/interrupt_64.S
+++ b/arch/powerpc/kernel/interrupt_64.S
@@ -438,6 +438,10 @@ _ASM_NOKPROBE_SYMBOL(tabort_syscall)
 	li	r9, (TM_CAUSE_SYSCALL|TM_CAUSE_PERSISTENT)
 	TABORT(R9)
 
+	/* scv has disabled irqs so must re-enable. sc just remains enabled */
+	li	r9,IRQS_ENABLED
+	stb	r9,PACAIRQSOFTMASK(r13)
+
 	/*
 	 * Return directly to userspace. We have corrupted user register state,
 	 * but userspace will never see that register state. Execution will
-- 
2.23.0


^ permalink raw reply related

* [PATCH v1 2/2] selftests/powerpc: Add scv versions of the basic TM syscall tests
From: Nicholas Piggin @ 2021-09-01 16:18 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Eirik Fuller, Nicholas Piggin
In-Reply-To: <20210901161810.1411015-1-npiggin@gmail.com>

The basic TM vs syscall test code hard codes an sc instruction for the
system call, which fails to cover scv even when the userspace libc has
support for it.

Duplicate the tests with hard coded scv variants so both are tested
when possible.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 .../selftests/powerpc/tm/tm-syscall-asm.S     | 46 +++++++++++++++++++
 .../testing/selftests/powerpc/tm/tm-syscall.c | 36 ++++++++++++---
 2 files changed, 75 insertions(+), 7 deletions(-)

diff --git a/tools/testing/selftests/powerpc/tm/tm-syscall-asm.S b/tools/testing/selftests/powerpc/tm/tm-syscall-asm.S
index bd1ca25febe4..849316831e6a 100644
--- a/tools/testing/selftests/powerpc/tm/tm-syscall-asm.S
+++ b/tools/testing/selftests/powerpc/tm/tm-syscall-asm.S
@@ -2,6 +2,10 @@
 #include <ppc-asm.h>
 #include <asm/unistd.h>
 
+/* ppc-asm.h does not define r0 or r1 */
+#define r0 0
+#define r1 1
+
 	.text
 FUNC_START(getppid_tm_active)
 	tbegin.
@@ -26,3 +30,45 @@ FUNC_START(getppid_tm_suspended)
 1:
 	li	r3, -1
 	blr
+
+FUNC_START(getppid_scv_tm_active)
+	mflr	r0
+	std	r0,16(r1)
+	stdu	r1,-32(r1)
+	tbegin.
+	beq 1f
+	li	r0, __NR_getppid
+	scv	0
+	tend.
+	addi	r1,r1,32
+	ld	r0,16(r1)
+	mtlr	r0
+	blr
+1:
+	li	r3, -1
+	addi	r1,r1,32
+	ld	r0,16(r1)
+	mtlr	r0
+	blr
+
+FUNC_START(getppid_scv_tm_suspended)
+	mflr	r0
+	std	r0,16(r1)
+	stdu	r1,-32(r1)
+	tbegin.
+	beq 1f
+	li	r0, __NR_getppid
+	tsuspend.
+	scv	0
+	tresume.
+	tend.
+	addi	r1,r1,32
+	ld	r0,16(r1)
+	mtlr	r0
+	blr
+1:
+	li	r3, -1
+	addi	r1,r1,32
+	ld	r0,16(r1)
+	mtlr	r0
+	blr
diff --git a/tools/testing/selftests/powerpc/tm/tm-syscall.c b/tools/testing/selftests/powerpc/tm/tm-syscall.c
index becb8207b432..9a822208680e 100644
--- a/tools/testing/selftests/powerpc/tm/tm-syscall.c
+++ b/tools/testing/selftests/powerpc/tm/tm-syscall.c
@@ -19,24 +19,37 @@
 #include "utils.h"
 #include "tm.h"
 
+#ifndef PPC_FEATURE2_SCV
+#define PPC_FEATURE2_SCV               0x00100000 /* scv syscall */
+#endif
+
 extern int getppid_tm_active(void);
 extern int getppid_tm_suspended(void);
+extern int getppid_scv_tm_active(void);
+extern int getppid_scv_tm_suspended(void);
 
 unsigned retries = 0;
 
 #define TEST_DURATION 10 /* seconds */
 #define TM_RETRIES 100
 
-pid_t getppid_tm(bool suspend)
+pid_t getppid_tm(bool scv, bool suspend)
 {
 	int i;
 	pid_t pid;
 
 	for (i = 0; i < TM_RETRIES; i++) {
-		if (suspend)
-			pid = getppid_tm_suspended();
-		else
-			pid = getppid_tm_active();
+		if (suspend) {
+			if (scv)
+				pid = getppid_scv_tm_suspended();
+			else
+				pid = getppid_tm_suspended();
+		} else {
+			if (scv)
+				pid = getppid_scv_tm_active();
+			else
+				pid = getppid_tm_active();
+		}
 
 		if (pid >= 0)
 			return pid;
@@ -82,15 +95,24 @@ int tm_syscall(void)
 		 * Test a syscall within a suspended transaction and verify
 		 * that it succeeds.
 		 */
-		FAIL_IF(getppid_tm(true) == -1); /* Should succeed. */
+		FAIL_IF(getppid_tm(false, true) == -1); /* Should succeed. */
 
 		/*
 		 * Test a syscall within an active transaction and verify that
 		 * it fails with the correct failure code.
 		 */
-		FAIL_IF(getppid_tm(false) != -1);  /* Should fail... */
+		FAIL_IF(getppid_tm(false, false) != -1);  /* Should fail... */
 		FAIL_IF(!failure_is_persistent()); /* ...persistently... */
 		FAIL_IF(!failure_is_syscall());    /* ...with code syscall. */
+
+		/* Now do it all again with scv if it is available. */
+		if (have_hwcap2(PPC_FEATURE2_SCV)) {
+			FAIL_IF(getppid_tm(true, true) == -1); /* Should succeed. */
+			FAIL_IF(getppid_tm(true, false) != -1);  /* Should fail... */
+			FAIL_IF(!failure_is_persistent()); /* ...persistently... */
+			FAIL_IF(!failure_is_syscall());    /* ...with code syscall. */
+		}
+
 		gettimeofday(&now, 0);
 	}
 
-- 
2.23.0


^ permalink raw reply related

* [PATCH v2 1/2] powerpc/64s: system call scv tabort fix for corrupt irq soft-mask state
From: Nicholas Piggin @ 2021-09-01 16:54 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Eirik Fuller, Nicholas Piggin

If a system call is made with a transaction active, the kernel
immediately aborts it and returns. scv system calls disable irqs even
earlier in their interrupt handler, and tabort_syscall does not fix this
up.

This can result in irq soft-mask state being messed up on the next
kernel entry, and crashing at BUG_ON(arch_irq_disabled_regs(regs)) in
the kernel exit handlers, or possibly worse.

Fix this by having tabort_syscall setting irq soft-mask back to enabled
(which requires MSR[EE] be disabled first).

Reported-by: Eirik Fuller <efuller@redhat.com>
Fixes: 7fa95f9adaee7 ("powerpc/64s: system call support for scv/rfscv instructions")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---

Tested the wrong kernel before sending v1 and missed a bug, sorry.

 arch/powerpc/kernel/interrupt_64.S | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/interrupt_64.S b/arch/powerpc/kernel/interrupt_64.S
index d4212d2ff0b5..9c31d65b4851 100644
--- a/arch/powerpc/kernel/interrupt_64.S
+++ b/arch/powerpc/kernel/interrupt_64.S
@@ -428,16 +428,22 @@ RESTART_TABLE(.Lsyscall_rst_start, .Lsyscall_rst_end, syscall_restart)
 #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
 tabort_syscall:
 _ASM_NOKPROBE_SYMBOL(tabort_syscall)
-	/* Firstly we need to enable TM in the kernel */
+	/* We need to enable TM in the kernel, and disable EE (for scv) */
 	mfmsr	r10
 	li	r9, 1
 	rldimi	r10, r9, MSR_TM_LG, 63-MSR_TM_LG
+	LOAD_REG_IMMEDIATE(r9, MSR_EE)
+	andc	r10, r10, r9
 	mtmsrd	r10, 0
 
 	/* tabort, this dooms the transaction, nothing else */
 	li	r9, (TM_CAUSE_SYSCALL|TM_CAUSE_PERSISTENT)
 	TABORT(R9)
 
+	/* scv has disabled irqs so must re-enable. sc just remains enabled */
+	li	r9,IRQS_ENABLED
+	stb	r9,PACAIRQSOFTMASK(r13)
+
 	/*
 	 * Return directly to userspace. We have corrupted user register state,
 	 * but userspace will never see that register state. Execution will
-- 
2.23.0


^ permalink raw reply related

* [PATCH v2 2/2] selftests/powerpc: Add scv versions of the basic TM syscall tests
From: Nicholas Piggin @ 2021-09-01 16:54 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Eirik Fuller, Nicholas Piggin
In-Reply-To: <20210901165418.1412891-1-npiggin@gmail.com>

The basic TM vs syscall test code hard codes an sc instruction for the
system call, which fails to cover scv even when the userspace libc has
support for it.

Duplicate the tests with hard coded scv variants so both are tested
when possible.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 .../selftests/powerpc/tm/tm-syscall-asm.S     | 46 +++++++++++++++++++
 .../testing/selftests/powerpc/tm/tm-syscall.c | 36 ++++++++++++---
 2 files changed, 75 insertions(+), 7 deletions(-)

diff --git a/tools/testing/selftests/powerpc/tm/tm-syscall-asm.S b/tools/testing/selftests/powerpc/tm/tm-syscall-asm.S
index bd1ca25febe4..849316831e6a 100644
--- a/tools/testing/selftests/powerpc/tm/tm-syscall-asm.S
+++ b/tools/testing/selftests/powerpc/tm/tm-syscall-asm.S
@@ -2,6 +2,10 @@
 #include <ppc-asm.h>
 #include <asm/unistd.h>
 
+/* ppc-asm.h does not define r0 or r1 */
+#define r0 0
+#define r1 1
+
 	.text
 FUNC_START(getppid_tm_active)
 	tbegin.
@@ -26,3 +30,45 @@ FUNC_START(getppid_tm_suspended)
 1:
 	li	r3, -1
 	blr
+
+FUNC_START(getppid_scv_tm_active)
+	mflr	r0
+	std	r0,16(r1)
+	stdu	r1,-32(r1)
+	tbegin.
+	beq 1f
+	li	r0, __NR_getppid
+	scv	0
+	tend.
+	addi	r1,r1,32
+	ld	r0,16(r1)
+	mtlr	r0
+	blr
+1:
+	li	r3, -1
+	addi	r1,r1,32
+	ld	r0,16(r1)
+	mtlr	r0
+	blr
+
+FUNC_START(getppid_scv_tm_suspended)
+	mflr	r0
+	std	r0,16(r1)
+	stdu	r1,-32(r1)
+	tbegin.
+	beq 1f
+	li	r0, __NR_getppid
+	tsuspend.
+	scv	0
+	tresume.
+	tend.
+	addi	r1,r1,32
+	ld	r0,16(r1)
+	mtlr	r0
+	blr
+1:
+	li	r3, -1
+	addi	r1,r1,32
+	ld	r0,16(r1)
+	mtlr	r0
+	blr
diff --git a/tools/testing/selftests/powerpc/tm/tm-syscall.c b/tools/testing/selftests/powerpc/tm/tm-syscall.c
index becb8207b432..9a822208680e 100644
--- a/tools/testing/selftests/powerpc/tm/tm-syscall.c
+++ b/tools/testing/selftests/powerpc/tm/tm-syscall.c
@@ -19,24 +19,37 @@
 #include "utils.h"
 #include "tm.h"
 
+#ifndef PPC_FEATURE2_SCV
+#define PPC_FEATURE2_SCV               0x00100000 /* scv syscall */
+#endif
+
 extern int getppid_tm_active(void);
 extern int getppid_tm_suspended(void);
+extern int getppid_scv_tm_active(void);
+extern int getppid_scv_tm_suspended(void);
 
 unsigned retries = 0;
 
 #define TEST_DURATION 10 /* seconds */
 #define TM_RETRIES 100
 
-pid_t getppid_tm(bool suspend)
+pid_t getppid_tm(bool scv, bool suspend)
 {
 	int i;
 	pid_t pid;
 
 	for (i = 0; i < TM_RETRIES; i++) {
-		if (suspend)
-			pid = getppid_tm_suspended();
-		else
-			pid = getppid_tm_active();
+		if (suspend) {
+			if (scv)
+				pid = getppid_scv_tm_suspended();
+			else
+				pid = getppid_tm_suspended();
+		} else {
+			if (scv)
+				pid = getppid_scv_tm_active();
+			else
+				pid = getppid_tm_active();
+		}
 
 		if (pid >= 0)
 			return pid;
@@ -82,15 +95,24 @@ int tm_syscall(void)
 		 * Test a syscall within a suspended transaction and verify
 		 * that it succeeds.
 		 */
-		FAIL_IF(getppid_tm(true) == -1); /* Should succeed. */
+		FAIL_IF(getppid_tm(false, true) == -1); /* Should succeed. */
 
 		/*
 		 * Test a syscall within an active transaction and verify that
 		 * it fails with the correct failure code.
 		 */
-		FAIL_IF(getppid_tm(false) != -1);  /* Should fail... */
+		FAIL_IF(getppid_tm(false, false) != -1);  /* Should fail... */
 		FAIL_IF(!failure_is_persistent()); /* ...persistently... */
 		FAIL_IF(!failure_is_syscall());    /* ...with code syscall. */
+
+		/* Now do it all again with scv if it is available. */
+		if (have_hwcap2(PPC_FEATURE2_SCV)) {
+			FAIL_IF(getppid_tm(true, true) == -1); /* Should succeed. */
+			FAIL_IF(getppid_tm(true, false) != -1);  /* Should fail... */
+			FAIL_IF(!failure_is_persistent()); /* ...persistently... */
+			FAIL_IF(!failure_is_syscall());    /* ...with code syscall. */
+		}
+
 		gettimeofday(&now, 0);
 	}
 
-- 
2.23.0


^ permalink raw reply related

* Re: [PATCH v2 2/2] selftests/powerpc: Add scv versions of the basic TM syscall tests
From: Christophe Leroy @ 2021-09-01 17:15 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev; +Cc: Eirik Fuller
In-Reply-To: <20210901165418.1412891-2-npiggin@gmail.com>



Le 01/09/2021 à 18:54, Nicholas Piggin a écrit :
> The basic TM vs syscall test code hard codes an sc instruction for the
> system call, which fails to cover scv even when the userspace libc has
> support for it.
> 
> Duplicate the tests with hard coded scv variants so both are tested
> when possible.
> 
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---
>   .../selftests/powerpc/tm/tm-syscall-asm.S     | 46 +++++++++++++++++++
>   .../testing/selftests/powerpc/tm/tm-syscall.c | 36 ++++++++++++---
>   2 files changed, 75 insertions(+), 7 deletions(-)
> 
> diff --git a/tools/testing/selftests/powerpc/tm/tm-syscall-asm.S b/tools/testing/selftests/powerpc/tm/tm-syscall-asm.S
> index bd1ca25febe4..849316831e6a 100644
> --- a/tools/testing/selftests/powerpc/tm/tm-syscall-asm.S
> +++ b/tools/testing/selftests/powerpc/tm/tm-syscall-asm.S
> @@ -2,6 +2,10 @@
>   #include <ppc-asm.h>
>   #include <asm/unistd.h>
>   
> +/* ppc-asm.h does not define r0 or r1 */
> +#define r0 0
> +#define r1 1
> +

See https://github.com/gcc-mirror/gcc/blob/master/gcc/config/rs6000/ppc-asm.h

It doesn't not define r1 but it defines r0.

And it defines 'sp' as register 1.

>   	.text
>   FUNC_START(getppid_tm_active)
>   	tbegin.
> @@ -26,3 +30,45 @@ FUNC_START(getppid_tm_suspended)
>   1:
>   	li	r3, -1
>   	blr
> +
> +FUNC_START(getppid_scv_tm_active)
> +	mflr	r0
> +	std	r0,16(r1)
> +	stdu	r1,-32(r1)
> +	tbegin.
> +	beq 1f
> +	li	r0, __NR_getppid
> +	scv	0
> +	tend.
> +	addi	r1,r1,32
> +	ld	r0,16(r1)
> +	mtlr	r0
> +	blr
> +1:
> +	li	r3, -1
> +	addi	r1,r1,32
> +	ld	r0,16(r1)
> +	mtlr	r0
> +	blr
> +
> +FUNC_START(getppid_scv_tm_suspended)
> +	mflr	r0
> +	std	r0,16(r1)
> +	stdu	r1,-32(r1)
> +	tbegin.
> +	beq 1f
> +	li	r0, __NR_getppid
> +	tsuspend.
> +	scv	0
> +	tresume.
> +	tend.
> +	addi	r1,r1,32
> +	ld	r0,16(r1)
> +	mtlr	r0
> +	blr
> +1:
> +	li	r3, -1
> +	addi	r1,r1,32
> +	ld	r0,16(r1)
> +	mtlr	r0
> +	blr
> diff --git a/tools/testing/selftests/powerpc/tm/tm-syscall.c b/tools/testing/selftests/powerpc/tm/tm-syscall.c
> index becb8207b432..9a822208680e 100644
> --- a/tools/testing/selftests/powerpc/tm/tm-syscall.c
> +++ b/tools/testing/selftests/powerpc/tm/tm-syscall.c
> @@ -19,24 +19,37 @@
>   #include "utils.h"
>   #include "tm.h"
>   
> +#ifndef PPC_FEATURE2_SCV
> +#define PPC_FEATURE2_SCV               0x00100000 /* scv syscall */
> +#endif
> +
>   extern int getppid_tm_active(void);
>   extern int getppid_tm_suspended(void);
> +extern int getppid_scv_tm_active(void);
> +extern int getppid_scv_tm_suspended(void);
>   
>   unsigned retries = 0;
>   
>   #define TEST_DURATION 10 /* seconds */
>   #define TM_RETRIES 100
>   
> -pid_t getppid_tm(bool suspend)
> +pid_t getppid_tm(bool scv, bool suspend)
>   {
>   	int i;
>   	pid_t pid;
>   
>   	for (i = 0; i < TM_RETRIES; i++) {
> -		if (suspend)
> -			pid = getppid_tm_suspended();
> -		else
> -			pid = getppid_tm_active();
> +		if (suspend) {
> +			if (scv)
> +				pid = getppid_scv_tm_suspended();
> +			else
> +				pid = getppid_tm_suspended();
> +		} else {
> +			if (scv)
> +				pid = getppid_scv_tm_active();
> +			else
> +				pid = getppid_tm_active();
> +		}
>   
>   		if (pid >= 0)
>   			return pid;
> @@ -82,15 +95,24 @@ int tm_syscall(void)
>   		 * Test a syscall within a suspended transaction and verify
>   		 * that it succeeds.
>   		 */
> -		FAIL_IF(getppid_tm(true) == -1); /* Should succeed. */
> +		FAIL_IF(getppid_tm(false, true) == -1); /* Should succeed. */
>   
>   		/*
>   		 * Test a syscall within an active transaction and verify that
>   		 * it fails with the correct failure code.
>   		 */
> -		FAIL_IF(getppid_tm(false) != -1);  /* Should fail... */
> +		FAIL_IF(getppid_tm(false, false) != -1);  /* Should fail... */
>   		FAIL_IF(!failure_is_persistent()); /* ...persistently... */
>   		FAIL_IF(!failure_is_syscall());    /* ...with code syscall. */
> +
> +		/* Now do it all again with scv if it is available. */
> +		if (have_hwcap2(PPC_FEATURE2_SCV)) {
> +			FAIL_IF(getppid_tm(true, true) == -1); /* Should succeed. */
> +			FAIL_IF(getppid_tm(true, false) != -1);  /* Should fail... */
> +			FAIL_IF(!failure_is_persistent()); /* ...persistently... */
> +			FAIL_IF(!failure_is_syscall());    /* ...with code syscall. */
> +		}
> +
>   		gettimeofday(&now, 0);
>   	}
>   
> 

^ permalink raw reply

* Re: [PATCH v2 1/2] powerpc/64s: system call scv tabort fix for corrupt irq soft-mask state
From: Christophe Leroy @ 2021-09-01 17:21 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev; +Cc: Eirik Fuller
In-Reply-To: <20210901165418.1412891-1-npiggin@gmail.com>



Le 01/09/2021 à 18:54, Nicholas Piggin a écrit :
> If a system call is made with a transaction active, the kernel
> immediately aborts it and returns. scv system calls disable irqs even
> earlier in their interrupt handler, and tabort_syscall does not fix this
> up.
> 
> This can result in irq soft-mask state being messed up on the next
> kernel entry, and crashing at BUG_ON(arch_irq_disabled_regs(regs)) in
> the kernel exit handlers, or possibly worse.
> 
> Fix this by having tabort_syscall setting irq soft-mask back to enabled
> (which requires MSR[EE] be disabled first).
> 
> Reported-by: Eirik Fuller <efuller@redhat.com>
> Fixes: 7fa95f9adaee7 ("powerpc/64s: system call support for scv/rfscv instructions")
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---
> 
> Tested the wrong kernel before sending v1 and missed a bug, sorry.
> 
>   arch/powerpc/kernel/interrupt_64.S | 8 +++++++-
>   1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kernel/interrupt_64.S b/arch/powerpc/kernel/interrupt_64.S
> index d4212d2ff0b5..9c31d65b4851 100644
> --- a/arch/powerpc/kernel/interrupt_64.S
> +++ b/arch/powerpc/kernel/interrupt_64.S
> @@ -428,16 +428,22 @@ RESTART_TABLE(.Lsyscall_rst_start, .Lsyscall_rst_end, syscall_restart)
>   #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
>   tabort_syscall:
>   _ASM_NOKPROBE_SYMBOL(tabort_syscall)
> -	/* Firstly we need to enable TM in the kernel */
> +	/* We need to enable TM in the kernel, and disable EE (for scv) */
>   	mfmsr	r10
>   	li	r9, 1
>   	rldimi	r10, r9, MSR_TM_LG, 63-MSR_TM_LG
> +	LOAD_REG_IMMEDIATE(r9, MSR_EE)
> +	andc	r10, r10, r9

Why not use 'rlwinm' to mask out MSR_EE ?

Something like

	rlwinm	r10, r10, 0, ~MSR_EE

>   	mtmsrd	r10, 0
>   
>   	/* tabort, this dooms the transaction, nothing else */
>   	li	r9, (TM_CAUSE_SYSCALL|TM_CAUSE_PERSISTENT)
>   	TABORT(R9)
>   
> +	/* scv has disabled irqs so must re-enable. sc just remains enabled */
> +	li	r9,IRQS_ENABLED
> +	stb	r9,PACAIRQSOFTMASK(r13)
> +
>   	/*
>   	 * Return directly to userspace. We have corrupted user register state,
>   	 * but userspace will never see that register state. Execution will
> 

^ permalink raw reply

* [PATCH 0/5] KVM: PPC: Book3S: Modules cleanup and unification
From: Fabiano Rosas @ 2021-09-01 17:33 UTC (permalink / raw)
  To: kvm-ppc; +Cc: linuxppc-dev, npiggin, david

This series merges our three kvm modules kvm.ko, kvm-hv.ko and
kvm-pr.ko into one kvm.ko module.

The main reason for this is to deal with the issue that kvm.ko can be
loaded on its own without any of the other modules present. This can
happen if one or both of the modules fail to init or if the user loads
kvm.ko only.

With only kvm.ko loaded, the userspace can call any of the KVM ioctls
which will fail more or less gracefully depending on what kind of
verification we do in powerpc.c.

Instead of adding a check to every entry point or finding a hack to
link the modules so that when one fails (hv/pr), the other (kvm)
exits, I think it is cleaner to just make them all a single module.

The two KVM implementations are already selected by Kconfig options,
so the only thing that changes is that they are now in the same
module. I also kept kvm-hv and kvm-pr as aliases to kvm, so that
people don't get too surprised with the change.

There is a possible issue with the larger module size for kernel
builds that should support both HV-only and PR-only environments, but
PR is usually not used in production so I'm not sure if that is a real
issue.

Patches 1,2,3 are standalone cleanups.
Patches 4,5 are the unification work.

Fabiano Rosas (5):
  KVM: PPC: Book3S HV: Check return value of kvmppc_radix_init
  KVM: PPC: Book3S HV: Delay setting of kvm ops
  KVM: PPC: Book3S HV: Free allocated memory if module init fails
  KVM: PPC: Book3S: Unify kvm-hv and kvm-pr modules
  KVM: PPC: Book3S: Stop exporting non-builtin symbols

 arch/powerpc/configs/powernv_defconfig |  2 +-
 arch/powerpc/configs/ppc64_defconfig   |  2 +-
 arch/powerpc/configs/pseries_defconfig |  2 +-
 arch/powerpc/kvm/Kconfig               | 72 ++++++++++++--------------
 arch/powerpc/kvm/Makefile              | 11 ++--
 arch/powerpc/kvm/book3s.c              | 61 ++++++++++++++--------
 arch/powerpc/kvm/book3s.h              | 19 +++++++
 arch/powerpc/kvm/book3s_64_mmu_radix.c |  3 --
 arch/powerpc/kvm/book3s_64_vio.c       |  3 --
 arch/powerpc/kvm/book3s_hv.c           | 38 ++++++++------
 arch/powerpc/kvm/book3s_pr.c           | 13 -----
 arch/powerpc/kvm/book3s_rtas.c         |  1 -
 arch/powerpc/kvm/book3s_xics.c         |  4 --
 arch/powerpc/kvm/book3s_xive.c         |  6 ---
 arch/powerpc/kvm/emulate.c             |  1 -
 arch/powerpc/kvm/powerpc.c             | 14 -----
 kernel/irq/irqdesc.c                   |  2 +-
 17 files changed, 125 insertions(+), 129 deletions(-)

-- 
2.29.2


^ permalink raw reply

* [PATCH 3/5] KVM: PPC: Book3S HV: Free allocated memory if module init fails
From: Fabiano Rosas @ 2021-09-01 17:33 UTC (permalink / raw)
  To: kvm-ppc; +Cc: linuxppc-dev, npiggin, david
In-Reply-To: <20210901173357.3183658-1-farosas@linux.ibm.com>

The module's exit function is not called when the init fails, we need
to do cleanup before returning.

Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com>
---
 arch/powerpc/kvm/book3s_hv.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index fe20074faa61..a7b82eb438f5 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -5963,7 +5963,7 @@ static int kvmppc_book3s_init_hv(void)
 
 	r = kvm_init_subcore_bitmap();
 	if (r)
-		return r;
+		goto err;
 
 	/*
 	 * We need a way of accessing the XICS interrupt controller,
@@ -5978,7 +5978,8 @@ static int kvmppc_book3s_init_hv(void)
 		np = of_find_compatible_node(NULL, NULL, "ibm,opal-intc");
 		if (!np) {
 			pr_err("KVM-HV: Cannot determine method for accessing XICS\n");
-			return -ENODEV;
+			r = -ENODEV;
+			goto err;
 		}
 		/* presence of intc confirmed - node can be dropped again */
 		of_node_put(np);
@@ -5991,12 +5992,12 @@ static int kvmppc_book3s_init_hv(void)
 
 	r = kvmppc_mmu_hv_init();
 	if (r)
-		return r;
+		goto err;
 
 	if (kvmppc_radix_possible()) {
 		r = kvmppc_radix_init();
 		if (r)
-			return r;
+			goto err;
 	}
 
 	r = kvmppc_uvmem_init();
@@ -6009,6 +6010,12 @@ static int kvmppc_book3s_init_hv(void)
 	kvmppc_hv_ops = &kvm_ops_hv;
 
 	return 0;
+
+err:
+	kvmhv_nested_exit();
+	kvmppc_radix_exit();
+
+	return r;
 }
 
 static void kvmppc_book3s_exit_hv(void)
-- 
2.29.2


^ permalink raw reply related

* [PATCH 4/5] KVM: PPC: Book3S: Unify kvm-hv and kvm-pr modules
From: Fabiano Rosas @ 2021-09-01 17:33 UTC (permalink / raw)
  To: kvm-ppc; +Cc: linuxppc-dev, npiggin, david
In-Reply-To: <20210901173357.3183658-1-farosas@linux.ibm.com>

Our three virtualization modules (kvm, kvm-hv, kvm-pr) can be
loaded/unloaded in such a way that could leave kvm.ko loaded without
kvm-hv.ko or kvm-pr.ko. That means the userspace could continue to
issue ioctls to KVM while there is no code on our side to service
them.

We currently select at config time which module will be built, either
kvm-hv, kvm-pr or both, with the kvm module being always present. For
32 bits the kvm module is the only one present and contains the code
from kvm-pr in it. We can do the same for 64 bit and keep hv and pr
inside kvm.ko.

With this, we do not lose the ability of running two guests at the
same time, each using a different implementation (although only POWER
bare-metal with Hash MMU supports HV and PR at the same time).

We lose the ability of loading and unloading the code, which means any
Linux installation that wants to support *both* KVM-HV guests on POWER
bare-metal and KVM-PR nested guests on top of PowerVM would have a
binary with a larger memory footprint (assuming PR is only used when
HV is not present).

This patch alters the build to output only one module at all times and
relies on the Kconfig to select which implementation will be present.

The module init code was rearranged to initialize and cleanup both
implementations or only one of them.

The Kconfig was altered to provide one boolean for each implementation
(KVM_BOOK3S_PR_POSSIBLE, KVM_BOOK3S_HV_POSSIBLE), while keeping the
old tristate for the kvm.ko module (KVM_BOOK3S_64). The old
KVM_BOOK3S_32 already selects KVM_BOOK3S_PR_POSSIBLE, so nothing
changes there.

Two module aliases were created to facilitate the introduction of the
new scheme so `modprobe kvm-hv`and `modprobe kvm-pr` are the same as
`modprobe kvm`.

The license macro was removed because it is already included by
virt/kvm/kvm_main.c.

Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com>

---
Here's a summary of the environments that can run KVM:

* 64 bit powernv w/Radix              KVM-HV
* 64 bit powernv w/Hash               KVM-HV        KVM-PR
  32 bit powernv w/Hash                             KVM-PR
  32 bit powernv w/Radix              N/A
* 64 bit pseries w/Radix on powernv   KVM-HV nested
* 64 bit pseries w/Hash  on powernv                 KVM-PR nested
  32 bit pseries w/Hash  on powernv                 KVM-PR nested
* 64 bit pseries w/Radix on PowerVM   N/A
* 64 bit pseries w/Hash  on PowerVM                 KVM-PR nested

Lines with an asterisk were tested with all combinations for the
module and the HV/PR configs.
---
 arch/powerpc/configs/powernv_defconfig |  2 +-
 arch/powerpc/configs/ppc64_defconfig   |  2 +-
 arch/powerpc/configs/pseries_defconfig |  2 +-
 arch/powerpc/kvm/Kconfig               | 72 ++++++++++++--------------
 arch/powerpc/kvm/Makefile              | 11 ++--
 arch/powerpc/kvm/book3s.c              | 46 +++++++++++++---
 arch/powerpc/kvm/book3s.h              | 19 +++++++
 arch/powerpc/kvm/book3s_hv.c           | 10 +---
 arch/powerpc/kvm/book3s_pr.c           | 13 -----
 kernel/irq/irqdesc.c                   |  2 +-
 10 files changed, 105 insertions(+), 74 deletions(-)

diff --git a/arch/powerpc/configs/powernv_defconfig b/arch/powerpc/configs/powernv_defconfig
index 8bfeea6c7de7..853b95685d9f 100644
--- a/arch/powerpc/configs/powernv_defconfig
+++ b/arch/powerpc/configs/powernv_defconfig
@@ -341,7 +341,7 @@ CONFIG_CRYPTO_DEV_NX=y
 CONFIG_CRYPTO_DEV_VMX=y
 CONFIG_VIRTUALIZATION=y
 CONFIG_KVM_BOOK3S_64=m
-CONFIG_KVM_BOOK3S_64_HV=m
+CONFIG_KVM_BOOK3S_HV_POSSIBLE=y
 CONFIG_VHOST_NET=m
 CONFIG_PRINTK_TIME=y
 CONFIG_PRINTK_CALLER=y
diff --git a/arch/powerpc/configs/ppc64_defconfig b/arch/powerpc/configs/ppc64_defconfig
index 0ad2291337a7..46ace457def6 100644
--- a/arch/powerpc/configs/ppc64_defconfig
+++ b/arch/powerpc/configs/ppc64_defconfig
@@ -61,7 +61,7 @@ CONFIG_PCCARD=y
 CONFIG_ELECTRA_CF=y
 CONFIG_VIRTUALIZATION=y
 CONFIG_KVM_BOOK3S_64=m
-CONFIG_KVM_BOOK3S_64_HV=m
+CONFIG_KVM_BOOK3S_HV_POSSIBLE=y
 CONFIG_VHOST_NET=m
 CONFIG_KPROBES=y
 CONFIG_JUMP_LABEL=y
diff --git a/arch/powerpc/configs/pseries_defconfig b/arch/powerpc/configs/pseries_defconfig
index b183629f1bcf..6804b1a6bef1 100644
--- a/arch/powerpc/configs/pseries_defconfig
+++ b/arch/powerpc/configs/pseries_defconfig
@@ -321,7 +321,7 @@ CONFIG_CRYPTO_DEV_NX_ENCRYPT=m
 CONFIG_CRYPTO_DEV_VMX=y
 CONFIG_VIRTUALIZATION=y
 CONFIG_KVM_BOOK3S_64=m
-CONFIG_KVM_BOOK3S_64_HV=m
+CONFIG_KVM_BOOK3S_HV_POSSIBLE=y
 CONFIG_VHOST_NET=m
 CONFIG_PRINTK_TIME=y
 CONFIG_PRINTK_CALLER=y
diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index e45644657d49..2d080f57ce3b 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -41,12 +41,43 @@ config KVM_BOOK3S_64_HANDLER
 	select PPC_DAWR_FORCE_ENABLE
 
 config KVM_BOOK3S_PR_POSSIBLE
-	bool
+	bool "KVM support without using hypervisor mode in host"
+	depends on KVM_BOOK3S_64 || KVM_BOOK3S_32
 	select KVM_MMIO
 	select MMU_NOTIFIER
+	help
+	  Support running guest kernels in virtual machines on processors
+	  without using hypervisor mode in the host, by running the
+	  guest in user mode (problem state) and emulating all
+	  privileged instructions and registers.
+
+	  This is not as fast as using hypervisor mode, but works on
+	  where hypervisor mode is not available or not usable,
+	  and can emulate processors that are different from the host
+	  processor, including emulating 32-bit processors on a 64-bit
+	  host.
+
 
 config KVM_BOOK3S_HV_POSSIBLE
-	bool
+	bool "KVM for POWER7 and later using hypervisor mode in host"
+	depends on KVM_BOOK3S_64
+	select MMU_NOTIFIER
+	select CMA
+	help
+	  Support running unmodified book3s_64 guest kernels in
+	  virtual machines on POWER7 and newer processors that have
+	  hypervisor mode available to the host.
+
+	  If you say Y here, KVM will use the hardware virtualization
+	  facilities of POWER7 (and later) processors, meaning that
+	  guest operating systems will run at full hardware speed
+	  using supervisor and user modes.  However, this also means
+	  that KVM is not usable under PowerVM (pHyp), is only usable
+	  on POWER7 or later processors, and cannot emulate a
+	  different processor from the host processor.
+
+	  If unsure, say N.
+
 
 config KVM_BOOK3S_32
 	tristate "KVM support for PowerPC book3s_32 processors"
@@ -80,43 +111,6 @@ config KVM_BOOK3S_64
 
 	  If unsure, say N.
 
-config KVM_BOOK3S_64_HV
-	tristate "KVM for POWER7 and later using hypervisor mode in host"
-	depends on KVM_BOOK3S_64 && PPC_POWERNV
-	select KVM_BOOK3S_HV_POSSIBLE
-	select MMU_NOTIFIER
-	select CMA
-	help
-	  Support running unmodified book3s_64 guest kernels in
-	  virtual machines on POWER7 and newer processors that have
-	  hypervisor mode available to the host.
-
-	  If you say Y here, KVM will use the hardware virtualization
-	  facilities of POWER7 (and later) processors, meaning that
-	  guest operating systems will run at full hardware speed
-	  using supervisor and user modes.  However, this also means
-	  that KVM is not usable under PowerVM (pHyp), is only usable
-	  on POWER7 or later processors, and cannot emulate a
-	  different processor from the host processor.
-
-	  If unsure, say N.
-
-config KVM_BOOK3S_64_PR
-	tristate "KVM support without using hypervisor mode in host"
-	depends on KVM_BOOK3S_64
-	select KVM_BOOK3S_PR_POSSIBLE
-	help
-	  Support running guest kernels in virtual machines on processors
-	  without using hypervisor mode in the host, by running the
-	  guest in user mode (problem state) and emulating all
-	  privileged instructions and registers.
-
-	  This is not as fast as using hypervisor mode, but works on
-	  machines where hypervisor mode is not available or not usable,
-	  and can emulate processors that are different from the host
-	  processor, including emulating 32-bit processors on a 64-bit
-	  host.
-
 config KVM_BOOK3S_HV_EXIT_TIMING
 	bool "Detailed timing for hypervisor real-mode code"
 	depends on KVM_BOOK3S_HV_POSSIBLE && DEBUG_FS
diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
index 583c14ef596e..704380065df3 100644
--- a/arch/powerpc/kvm/Makefile
+++ b/arch/powerpc/kvm/Makefile
@@ -108,6 +108,14 @@ kvm-book3s_64-module-objs := \
 	book3s_rtas.o \
 	$(kvm-book3s_64-objs-y)
 
+ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+kvm-book3s_64-module-objs += $(kvm-hv-y)
+endif
+
+ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE
+kvm-book3s_64-module-objs += $(kvm-pr-y)
+endif
+
 kvm-objs-$(CONFIG_KVM_BOOK3S_64) := $(kvm-book3s_64-module-objs)
 
 kvm-book3s_32-objs := \
@@ -134,7 +142,4 @@ obj-$(CONFIG_KVM_E500MC) += kvm.o
 obj-$(CONFIG_KVM_BOOK3S_64) += kvm.o
 obj-$(CONFIG_KVM_BOOK3S_32) += kvm.o
 
-obj-$(CONFIG_KVM_BOOK3S_64_PR) += kvm-pr.o
-obj-$(CONFIG_KVM_BOOK3S_64_HV) += kvm-hv.o
-
 obj-y += $(kvm-book3s_64-builtin-objs-y)
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index 79833f78d1da..c381bb17b0f9 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -1065,6 +1065,24 @@ int kvm_irq_map_chip_pin(struct kvm *kvm, unsigned irqchip, unsigned pin)
 
 #endif /* CONFIG_KVM_XICS */
 
+static int kvmppc_init(void)
+{
+	int r;
+
+	r = kvmppc_book3s_init_hv();
+	if (r)
+		pr_err("KVM-HV: could not load (%d)\n", r);
+
+	r = kvmppc_book3s_init_pr();
+	if (r)
+		pr_err("KVM-PR: could not load (%d)\n", r);
+
+	if (!kvmppc_hv_ops && !kvmppc_pr_ops)
+		return -EINVAL;
+
+	return 0;
+}
+
 static int kvmppc_book3s_init(void)
 {
 	int r;
@@ -1072,9 +1090,10 @@ static int kvmppc_book3s_init(void)
 	r = kvm_init(NULL, sizeof(struct kvm_vcpu), 0, THIS_MODULE);
 	if (r)
 		return r;
-#ifdef CONFIG_KVM_BOOK3S_32_HANDLER
-	r = kvmppc_book3s_init_pr();
-#endif
+
+	r = kvmppc_init();
+	if (r)
+		goto exit;
 
 #ifdef CONFIG_KVM_XICS
 #ifdef CONFIG_KVM_XIVE
@@ -1087,22 +1106,35 @@ static int kvmppc_book3s_init(void)
 #endif
 		kvm_register_device_ops(&kvm_xics_ops, KVM_DEV_TYPE_XICS);
 #endif
+	return 0;
+
+exit:
+	kvm_exit();
 	return r;
 }
 
 static void kvmppc_book3s_exit(void)
 {
-#ifdef CONFIG_KVM_BOOK3S_32_HANDLER
 	kvmppc_book3s_exit_pr();
-#endif
+	kvmppc_book3s_exit_hv();
 	kvm_exit();
 }
 
 module_init(kvmppc_book3s_init);
 module_exit(kvmppc_book3s_exit);
 
-/* On 32bit this is our one and only kernel module */
-#ifdef CONFIG_KVM_BOOK3S_32_HANDLER
 MODULE_ALIAS_MISCDEV(KVM_MINOR);
 MODULE_ALIAS("devname:kvm");
+
+/*
+ * Whether we use KVM-HV or KVM-PR is dependent exclusively on the
+ * config options. These aliases are here only to ease the transition
+ * to the one module model we have now.
+ */
+#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+MODULE_ALIAS("kvm-hv");
+#endif
+
+#ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE
+MODULE_ALIAS("kvm-pr");
 #endif
diff --git a/arch/powerpc/kvm/book3s.h b/arch/powerpc/kvm/book3s.h
index 740e51def5a5..d21772904971 100644
--- a/arch/powerpc/kvm/book3s.h
+++ b/arch/powerpc/kvm/book3s.h
@@ -22,8 +22,27 @@ extern int kvmppc_core_emulate_mtspr_pr(struct kvm_vcpu *vcpu,
 					int sprn, ulong spr_val);
 extern int kvmppc_core_emulate_mfspr_pr(struct kvm_vcpu *vcpu,
 					int sprn, ulong *spr_val);
+#ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE
 extern int kvmppc_book3s_init_pr(void);
 extern void kvmppc_book3s_exit_pr(void);
+#else
+static inline int kvmppc_book3s_init_pr(void)
+{
+	return 0;
+}
+static inline void kvmppc_book3s_exit_pr(void) {}
+#endif
+
+#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+extern int kvmppc_book3s_init_hv(void);
+extern void kvmppc_book3s_exit_hv(void);
+#else
+static inline int kvmppc_book3s_init_hv(void)
+{
+	return 0;
+}
+static inline void kvmppc_book3s_exit_hv(void) {}
+#endif
 
 #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
 extern void kvmppc_emulate_tabort(struct kvm_vcpu *vcpu, int ra_val);
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index a7b82eb438f5..6ba9545ef58e 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -5941,7 +5941,7 @@ static int kvmppc_radix_possible(void)
 	return cpu_has_feature(CPU_FTR_ARCH_300) && radix_enabled();
 }
 
-static int kvmppc_book3s_init_hv(void)
+int kvmppc_book3s_init_hv(void)
 {
 	int r;
 
@@ -6018,7 +6018,7 @@ static int kvmppc_book3s_init_hv(void)
 	return r;
 }
 
-static void kvmppc_book3s_exit_hv(void)
+void kvmppc_book3s_exit_hv(void)
 {
 	kvmppc_uvmem_free();
 	kvmppc_free_host_rm_ops();
@@ -6027,9 +6027,3 @@ static void kvmppc_book3s_exit_hv(void)
 	kvmppc_hv_ops = NULL;
 	kvmhv_nested_exit();
 }
-
-module_init(kvmppc_book3s_init_hv);
-module_exit(kvmppc_book3s_exit_hv);
-MODULE_LICENSE("GPL");
-MODULE_ALIAS_MISCDEV(KVM_MINOR);
-MODULE_ALIAS("devname:kvm");
diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
index 6bc9425acb32..1d017c4b3eb4 100644
--- a/arch/powerpc/kvm/book3s_pr.c
+++ b/arch/powerpc/kvm/book3s_pr.c
@@ -2100,16 +2100,3 @@ void kvmppc_book3s_exit_pr(void)
 	kvmppc_pr_ops = NULL;
 	kvmppc_mmu_hpte_sysexit();
 }
-
-/*
- * We only support separate modules for book3s 64
- */
-#ifdef CONFIG_PPC_BOOK3S_64
-
-module_init(kvmppc_book3s_init_pr);
-module_exit(kvmppc_book3s_exit_pr);
-
-MODULE_LICENSE("GPL");
-MODULE_ALIAS_MISCDEV(KVM_MINOR);
-MODULE_ALIAS("devname:kvm");
-#endif
diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index 4e3c29bb603c..7c9c9e794c01 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -352,7 +352,7 @@ struct irq_desc *irq_to_desc(unsigned int irq)
 {
 	return radix_tree_lookup(&irq_desc_tree, irq);
 }
-#ifdef CONFIG_KVM_BOOK3S_64_HV_MODULE
+#ifdef CONFIG_KVM_BOOK3S_64_MODULE
 EXPORT_SYMBOL_GPL(irq_to_desc);
 #endif
 
-- 
2.29.2


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox