LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH v5 1/4] riscv: Move kernel mapping to vmalloc zone
From: Arnd Bergmann @ 2020-07-24  8:14 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Albert Ou, Alex Ghiti, Atish Patra, Anup Patel,
	linux-kernel@vger.kernel.org, Linux-MM, Palmer Dabbelt, Zong Li,
	Paul Walmsley, Paul Mackerras, linux-riscv, linuxppc-dev
In-Reply-To: <418d5f3d3f42bbc79c5cf30e18ec89edfe2dbd26.camel@kernel.crashing.org>

On Fri, Jul 24, 2020 at 12:34 AM Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Thu, 2020-07-23 at 01:21 -0400, Alex Ghiti wrote:
> > > works fine with huge pages, what is your problem there ? You rely on
> > > punching small-page size holes in there ?
> > >
> >
> > ARCH_HAS_STRICT_KERNEL_RWX prevents the use of a hugepage for the kernel
> > mapping in the direct mapping as it sets different permissions to
> > different part of the kernel (data, text..etc).
>
> Ah ok, that can be solved in a couple of ways...
>
> One is to use the linker script to ensure those sections are linked
> HUGE_PAGE_SIZE appart and moved appropriately by early boot code. One
> is to selectively degrade just those huge pages.
>
> I'm not familiar with the RiscV MMU (I should probably go have a look)
> but if it's a classic radix tree with huge pages at PUD/PMD level, then
> you could just degrade the one(s) that cross those boundaries.

That would work, but if the system can otherwise use 1GB-sized pages,
that might mean degrading the first gigabyte into a mix of 2MB pages
and 4KB pages.

If the kernel is in vmalloc space and vmap is able to use 2MB pages
for contiguous chunks of the mapping, you get a somewhat better
TLB usage.

However, this also means that a writable mapping exists in the
linear mapping for any executable part of the kernel (.text in
both vmlinux and modules). Do we have that on other architectures
as well, or is this something that ought to be prevented with
STRICT_KERNEL_RWX/STRICT_MODULE_RWX?

     Arnd

^ permalink raw reply

* Re: [PATCH v3 5/6] powerpc/pseries: implement paravirt qspinlocks for SPLPAR
From: Will Deacon @ 2020-07-24  8:16 UTC (permalink / raw)
  To: peterz
  Cc: linux-arch, Boqun Feng, linux-kernel, Nicholas Piggin,
	virtualization, Ingo Molnar, kvm-ppc, Waiman Long, linuxppc-dev
In-Reply-To: <20200723184759.GS119549@hirez.programming.kicks-ass.net>

On Thu, Jul 23, 2020 at 08:47:59PM +0200, peterz@infradead.org wrote:
> On Thu, Jul 23, 2020 at 02:32:36PM -0400, Waiman Long wrote:
> > BTW, do you have any comment on my v2 lock holder cpu info qspinlock patch?
> > I will have to update the patch to fix the reported 0-day test problem, but
> > I want to collect other feedback before sending out v3.
> 
> I want to say I hate it all, it adds instructions to a path we spend an
> aweful lot of time optimizing without really getting anything back for
> it.
> 
> Will, how do you feel about it?

I can see it potentially being useful for debugging, but I hate the
limitation to 256 CPUs. Even arm64 is hitting that now.

Also, you're talking ~1% gains here. I think our collective time would
be better spent off reviewing the CNA series and trying to make it more
deterministic.

Will

^ permalink raw reply

* Re: [v3 12/15] powerpc/perf: Add support for outputting extended regs in perf intr_regs
From: Athira Rajeev @ 2020-07-24  8:25 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Gautham R Shenoy, Michael Neuling, maddy, kvm, kajoljain, kvm-ppc,
	svaidyan, Jiri Olsa, linuxppc-dev
In-Reply-To: <20200723145609.GA1943963@kernel.org>



> On 23-Jul-2020, at 8:26 PM, Arnaldo Carvalho de Melo <acme@kernel.org> wrote:
> 
> Em Thu, Jul 23, 2020 at 11:14:16AM +0530, kajoljain escreveu:
>> 
>> 
>> On 7/21/20 11:32 AM, kajoljain wrote:
>>> 
>>> 
>>> On 7/17/20 8:08 PM, Athira Rajeev wrote:
>>>> From: Anju T Sudhakar <anju@linux.vnet.ibm.com>
>>>> 
>>>> Add support for perf extended register capability in powerpc.
>>>> The capability flag PERF_PMU_CAP_EXTENDED_REGS, is used to indicate the
>>>> PMU which support extended registers. The generic code define the mask
>>>> of extended registers as 0 for non supported architectures.
>>>> 
>>>> Patch adds extended regs support for power9 platform by
>>>> exposing MMCR0, MMCR1 and MMCR2 registers.
>>>> 
>>>> REG_RESERVED mask needs update to include extended regs.
>>>> `PERF_REG_EXTENDED_MASK`, contains mask value of the supported registers,
>>>> is defined at runtime in the kernel based on platform since the supported
>>>> registers may differ from one processor version to another and hence the
>>>> MASK value.
>>>> 
>>>> with patch
>>>> ----------
>>>> 
>>>> available registers: r0 r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 r11
>>>> r12 r13 r14 r15 r16 r17 r18 r19 r20 r21 r22 r23 r24 r25 r26
>>>> r27 r28 r29 r30 r31 nip msr orig_r3 ctr link xer ccr softe
>>>> trap dar dsisr sier mmcra mmcr0 mmcr1 mmcr2
>>>> 
>>>> PERF_RECORD_SAMPLE(IP, 0x1): 4784/4784: 0 period: 1 addr: 0
>>>> ... intr regs: mask 0xffffffffffff ABI 64-bit
>>>> .... r0    0xc00000000012b77c
>>>> .... r1    0xc000003fe5e03930
>>>> .... r2    0xc000000001b0e000
>>>> .... r3    0xc000003fdcddf800
>>>> .... r4    0xc000003fc7880000
>>>> .... r5    0x9c422724be
>>>> .... r6    0xc000003fe5e03908
>>>> .... r7    0xffffff63bddc8706
>>>> .... r8    0x9e4
>>>> .... r9    0x0
>>>> .... r10   0x1
>>>> .... r11   0x0
>>>> .... r12   0xc0000000001299c0
>>>> .... r13   0xc000003ffffc4800
>>>> .... r14   0x0
>>>> .... r15   0x7fffdd8b8b00
>>>> .... r16   0x0
>>>> .... r17   0x7fffdd8be6b8
>>>> .... r18   0x7e7076607730
>>>> .... r19   0x2f
>>>> .... r20   0xc00000001fc26c68
>>>> .... r21   0xc0002041e4227e00
>>>> .... r22   0xc00000002018fb60
>>>> .... r23   0x1
>>>> .... r24   0xc000003ffec4d900
>>>> .... r25   0x80000000
>>>> .... r26   0x0
>>>> .... r27   0x1
>>>> .... r28   0x1
>>>> .... r29   0xc000000001be1260
>>>> .... r30   0x6008010
>>>> .... r31   0xc000003ffebb7218
>>>> .... nip   0xc00000000012b910
>>>> .... msr   0x9000000000009033
>>>> .... orig_r3 0xc00000000012b86c
>>>> .... ctr   0xc0000000001299c0
>>>> .... link  0xc00000000012b77c
>>>> .... xer   0x0
>>>> .... ccr   0x28002222
>>>> .... softe 0x1
>>>> .... trap  0xf00
>>>> .... dar   0x0
>>>> .... dsisr 0x80000000000
>>>> .... sier  0x0
>>>> .... mmcra 0x80000000000
>>>> .... mmcr0 0x82008090
>>>> .... mmcr1 0x1e000000
>>>> .... mmcr2 0x0
>>>> ... thread: perf:4784
>>>> 
>>>> Signed-off-by: Anju T Sudhakar <anju@linux.vnet.ibm.com>
>>>> [Defined PERF_REG_EXTENDED_MASK at run time to add support for different platforms ]
>>>> Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
>>>> Reviewed-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
>>>> ---
>>> 
>>> Patch looks good to me.
>>> 
>>> Reviewed-by: Kajol Jain <kjain@linux.ibm.com>
>> 
>> Hi Arnaldo and Jiri,
>> 	 Please let me know if you have any comments on these patches. Can you pull/ack these
>> patches if they seems fine to you.
> 
> Can you please clarify something here, I think I saw a kernel build bot
> complaint followed by a fix, in these cases I think, for reviewer's
> sake, that this would entail a v4 patchkit? One that has no such build
> issues?
> 
> Or have I got something wrong?

Hi Arnaldo,

yes you are right, I will send version 4 as a new series with changes to add support for extended regs and including fix for the build issue.
Thanks for your response.

Athira 

> 
> - Arnaldo


^ permalink raw reply

* Re: [PATCH v2 1/3] module: Rename module_alloc() to text_alloc() and move to kernel proper
From: Jarkko Sakkinen @ 2020-07-24  7:36 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Catalin Marinas, Kefeng Wang, Paul Mackerras, Zong Li, Andi Kleen,
	Paul Burton, Vincent Whitchurch, Petr Mladek, Brian Gerst,
	Andy Lutomirski, Thomas Gleixner, Jiri Kosina, Anup Patel,
	Linux Kernel Mailing List, Philipp Rudo, Torsten Duwe,
	Masami Hiramatsu, Andrew Morton, Mark Rutland,
	James E.J. Bottomley, Vincent Chen, Omar Sandoval, open list:S390,
	Joe Lawrence, Helge Deller, John Fastabend, Anil S Keshavamurthy,
	Yonghong Song, Iurii Zaikin, Andrii Nakryiko, Thomas Huth,
	Vasily Gorbik, moderated list:ARM PORT, Daniel Axtens,
	Damien Le Moal, Martin KaFai Lau, Song Liu, Josh Poimboeuf,
	Heiko Carstens, Alexei Starovoitov, Atish Patra, Will Deacon,
	Daniel Borkmann, Masahiro Yamada, Nayna Jain, Ley Foon Tan,
	Christian Borntraeger, Sami Tolvanen, Naveen N. Rao, Mao Han,
	Marco Elver, Steven Rostedt, Babu Moger, Borislav Petkov,
	Greentime Hu, Ben Dooks, Guan Xuetao, Thomas Bogendoerfer,
	open list:PARISC ARCHITECTURE, Jessica Yu,
	open list:BPF JIT for MIPS 32-BIT AND 64-BIT, David S. Miller,
	Thiago Jung Bauermann, Peter Zijlstra, David Howells,
	Amit Daniel Kachhap, Sandipan Das, H. Peter Anvin,
	open list:SPARC + UltraSPARC sparc/sparc64,
	open list:RISC-V ARCHITECTURE, Miroslav Benes, Jiri Olsa,
	Tiezhu Yang, Vincenzo Frascino, Anders Roxell, Sven Schnelle,
	maintainer:X86 ARCHITECTURE 32-BIT AND 64-BIT, Russell King,
	Mike Rapoport, Ingo Molnar, Albert Ou, Paul E. McKenney,
	Paul Walmsley, KP Singh, Dmitry Vyukov, Nick Hu,
	open list:BPF JIT for MIPS 32-BIT AND 64-BIT, open list:MIPS,
	Palmer Dabbelt, open list:LINUX FOR POWERPC 32-BIT AND 64-BIT
In-Reply-To: <CAMj1kXGJhqC+asc6JUNeEkRsHYTzNQVe4-65vKqigbW03gO9Jg@mail.gmail.com>

On Thu, Jul 23, 2020 at 03:42:09PM +0300, Ard Biesheuvel wrote:
> On Thu, 23 Jul 2020 at 04:52, Jarkko Sakkinen
> <jarkko.sakkinen@linux.intel.com> wrote:
> >
> > On Thu, Jul 16, 2020 at 06:49:09PM +0200, Christophe Leroy wrote:
> > > Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> a écrit :
> > >
> > > > Rename module_alloc() to text_alloc() and module_memfree() to
> > > > text_memfree(), and move them to kernel/text.c, which is unconditionally
> > > > compiled to the kernel proper. This allows kprobes, ftrace and bpf to
> > > > allocate space for executable code without requiring to compile the modules
> > > > support (CONFIG_MODULES=y) in.
> > >
> > > You are not changing enough in powerpc to have this work.
> > > On powerpc 32 bits (6xx), when STRICT_KERNEL_RWX is selected, the vmalloc
> > > space is set to NX (no exec) at segment level (ie by 256Mbytes zone) unless
> > > CONFIG_MODULES is selected.
> > >
> > > Christophe
> >
> > This has been deduced down to:
> >
> > https://lore.kernel.org/lkml/20200717030422.679972-1-jarkko.sakkinen@linux.intel.com/
> >
> > I.e. not intruding PPC anymore :-)
> >
> 
> Ok, so after the elaborate discussion we had between Jessica, Russell,
> Peter, Will, Mark, you and myself, where we pointed out that
> a) a single text_alloc() abstraction for bpf, kprobes and ftrace does
> not fit other architectures very well, and
> b) that module_alloc() is not suitable as a default to base text_alloc() on,

In the latest iteration (v5) it is conditionally available only if arch
defines and fallback has been removed.

> you went ahead and implemented that anyway, but only cc'ing Peter,
> akpm, Masami and the mm list this time?

No problems with that. Actually each patch gets everything that
get_maintainer.pl gives with a cc cmd script, not just the ones
explicitly listed in the patch.

Should I explicitly CC you to the next version? I'm happy to grow
the list when requested.

> Sorry, but that is not how it works. Once people get pulled into a
> discussion, you cannot dismiss them or their feedback like that and go
> off and do your own thing anyway. Generic features like this are
> tricky to get right, and it will likely take many iterations and input
> from many different people.

Sure. I'm not expecting this move quickly.

I don't think I've at least purposely done that. As you said it's tricky
to get this right.

/Jarkko

^ permalink raw reply

* Re: [PATCH v2 1/3] module: Rename module_alloc() to text_alloc() and move to kernel proper
From: Jessica Yu @ 2020-07-24  8:05 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: Catalin Marinas, Kefeng Wang, Paul Mackerras, Zong Li, Andi Kleen,
	Paul Burton, Vincent Whitchurch, Petr Mladek, Brian Gerst,
	Andy Lutomirski, Thomas Gleixner, Jiri Kosina, Anup Patel,
	Linux Kernel Mailing List, Philipp Rudo, Torsten Duwe,
	Masami Hiramatsu, Andrew Morton, Mark Rutland,
	James E.J. Bottomley, Vincent Chen, Omar Sandoval, open list:S390,
	Joe Lawrence, Helge Deller, John Fastabend, Anil S Keshavamurthy,
	Yonghong Song, Iurii Zaikin, Andrii Nakryiko, Thomas Huth,
	Vasily Gorbik, moderated list:ARM PORT, Daniel Axtens,
	Damien Le Moal, Martin KaFai Lau, Song Liu, Josh Poimboeuf,
	Heiko Carstens, Alexei Starovoitov, Atish Patra, Will Deacon,
	Daniel Borkmann, Masahiro Yamada, Nayna Jain, Ley Foon Tan,
	Christian Borntraeger, Sami Tolvanen, Naveen N. Rao, Mao Han,
	Marco Elver, Steven Rostedt, Babu Moger, Borislav Petkov,
	Greentime Hu, Ben Dooks, Guan Xuetao, Tiezhu Yang,
	Thomas Bogendoerfer, open list:PARISC ARCHITECTURE,
	open list:BPF JIT for MIPS 32-BIT AND 64-BIT, David S. Miller,
	Thiago Jung Bauermann, Peter Zijlstra, David Howells,
	Amit Daniel Kachhap, Sandipan Das, H. Peter Anvin,
	open list:SPARC + UltraSPARC sparc/sparc64,
	open list:RISC-V ARCHITECTURE, Miroslav Benes, Jiri Olsa,
	Ard Biesheuvel, Vincenzo Frascino, Anders Roxell, Sven Schnelle,
	maintainer:X86 ARCHITECTURE 32-BIT AND 64-BIT, Russell King,
	Mike Rapoport, Ingo Molnar, Albert Ou, Paul E. McKenney,
	Paul Walmsley, KP Singh, Dmitry Vyukov, Nick Hu,
	open list:BPF JIT for MIPS 32-BIT AND 64-BIT, open list:MIPS,
	Palmer Dabbelt, open list:LINUX FOR POWERPC 32-BIT AND 64-BIT
In-Reply-To: <20200724073621.GC1872662@linux.intel.com>

+++ Jarkko Sakkinen [24/07/20 10:36 +0300]:
>On Thu, Jul 23, 2020 at 03:42:09PM +0300, Ard Biesheuvel wrote:
>> On Thu, 23 Jul 2020 at 04:52, Jarkko Sakkinen
>> <jarkko.sakkinen@linux.intel.com> wrote:
>> >
>> > On Thu, Jul 16, 2020 at 06:49:09PM +0200, Christophe Leroy wrote:
>> > > Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> a écrit :
>> > >
>> > > > Rename module_alloc() to text_alloc() and module_memfree() to
>> > > > text_memfree(), and move them to kernel/text.c, which is unconditionally
>> > > > compiled to the kernel proper. This allows kprobes, ftrace and bpf to
>> > > > allocate space for executable code without requiring to compile the modules
>> > > > support (CONFIG_MODULES=y) in.
>> > >
>> > > You are not changing enough in powerpc to have this work.
>> > > On powerpc 32 bits (6xx), when STRICT_KERNEL_RWX is selected, the vmalloc
>> > > space is set to NX (no exec) at segment level (ie by 256Mbytes zone) unless
>> > > CONFIG_MODULES is selected.
>> > >
>> > > Christophe
>> >
>> > This has been deduced down to:
>> >
>> > https://lore.kernel.org/lkml/20200717030422.679972-1-jarkko.sakkinen@linux.intel.com/
>> >
>> > I.e. not intruding PPC anymore :-)
>> >
>>
>> Ok, so after the elaborate discussion we had between Jessica, Russell,
>> Peter, Will, Mark, you and myself, where we pointed out that
>> a) a single text_alloc() abstraction for bpf, kprobes and ftrace does
>> not fit other architectures very well, and
>> b) that module_alloc() is not suitable as a default to base text_alloc() on,
>
>In the latest iteration (v5) it is conditionally available only if arch
>defines and fallback has been removed.
>
>> you went ahead and implemented that anyway, but only cc'ing Peter,
>> akpm, Masami and the mm list this time?
>
>No problems with that. Actually each patch gets everything that
>get_maintainer.pl gives with a cc cmd script, not just the ones
>explicitly listed in the patch.
>
>Should I explicitly CC you to the next version? I'm happy to grow
>the list when requested.

Yes, please CC everybody that was part of the discussion last time
especially during v2, and please use a consistent CC list for the
whole patchset. It is difficult to review when you only receive patch
1 out of 6 with no mention of text_alloc() anywhere and without being
CC'd on the cover letter.

Jessica

^ permalink raw reply

* [PATCH] KVM: PPC: Book3S HV: rework secure mem slot dropping
From: Laurent Dufour @ 2020-07-24  8:35 UTC (permalink / raw)
  To: bharata, linuxram
  Cc: cclaudio, kvm-ppc, sathnaga, aneesh.kumar, sukadev, linuxppc-dev,
	bauerman, david
In-Reply-To: <20200724030337.GC1082478@in.ibm.com>

When a secure memslot is dropped, all the pages backed in the secure
device (aka really backed by secure memory by the Ultravisor)
should be paged out to a normal page. Previously, this was
achieved by triggering the page fault mechanism which is calling
kvmppc_svm_page_out() on each pages.

This can't work when hot unplugging a memory slot because the memory
slot is flagged as invalid and gfn_to_pfn() is then not trying to access
the page, so the page fault mechanism is not triggered.

Since the final goal is to make a call to kvmppc_svm_page_out() it seems
simpler to call directly instead of triggering such a mechanism. This
way kvmppc_uvmem_drop_pages() can be called even when hot unplugging a
memslot.

Since kvmppc_uvmem_drop_pages() is already holding kvm->arch.uvmem_lock,
the call to __kvmppc_svm_page_out() is made.  As
__kvmppc_svm_page_out needs the vma pointer to migrate the pages,
the VMA is fetched in a lazy way, to not trigger find_vma() all
the time. In addition, the mmap_sem is held in read mode during
that time, not in write mode since the virual memory layout is not
impacted, and kvm->arch.uvmem_lock prevents concurrent operation
on the secure device.

Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Bharata B Rao <bharata@linux.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
	[modified the changelog description]
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
        [modified check on the VMA in kvmppc_uvmem_drop_pages]
---
 arch/powerpc/kvm/book3s_hv_uvmem.c | 53 ++++++++++++++++++++----------
 1 file changed, 36 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
index c772e921f769..5dd3e9acdcab 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -632,35 +632,54 @@ static inline int kvmppc_svm_page_out(struct vm_area_struct *vma,
  * fault on them, do fault time migration to replace the device PTEs in
  * QEMU page table with normal PTEs from newly allocated pages.
  */
-void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *free,
+void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *slot,
 			     struct kvm *kvm, bool skip_page_out)
 {
 	int i;
 	struct kvmppc_uvmem_page_pvt *pvt;
-	unsigned long pfn, uvmem_pfn;
-	unsigned long gfn = free->base_gfn;
+	struct page *uvmem_page;
+	struct vm_area_struct *vma = NULL;
+	unsigned long uvmem_pfn, gfn;
+	unsigned long addr, end;
+
+	mmap_read_lock(kvm->mm);
+
+	addr = slot->userspace_addr;
+	end = addr + (slot->npages * PAGE_SIZE);
 
-	for (i = free->npages; i; --i, ++gfn) {
-		struct page *uvmem_page;
+	gfn = slot->base_gfn;
+	for (i = slot->npages; i; --i, ++gfn, addr += PAGE_SIZE) {
+
+		/* Fetch the VMA if addr is not in the latest fetched one */
+		if (!vma || addr >= vma->vm_end) {
+			vma = find_vma_intersection(kvm->mm, addr, addr+1);
+			if (!vma) {
+				pr_err("Can't find VMA for gfn:0x%lx\n", gfn);
+				break;
+			}
+		}
 
 		mutex_lock(&kvm->arch.uvmem_lock);
-		if (!kvmppc_gfn_is_uvmem_pfn(gfn, kvm, &uvmem_pfn)) {
+
+		if (kvmppc_gfn_is_uvmem_pfn(gfn, kvm, &uvmem_pfn)) {
+			uvmem_page = pfn_to_page(uvmem_pfn);
+			pvt = uvmem_page->zone_device_data;
+			pvt->skip_page_out = skip_page_out;
+			pvt->remove_gfn = true;
+
+			if (__kvmppc_svm_page_out(vma, addr, addr + PAGE_SIZE,
+						  PAGE_SHIFT, kvm, pvt->gpa))
+				pr_err("Can't page out gpa:0x%lx addr:0x%lx\n",
+				       pvt->gpa, addr);
+		} else {
+			/* Remove the shared flag if any */
 			kvmppc_gfn_remove(gfn, kvm);
-			mutex_unlock(&kvm->arch.uvmem_lock);
-			continue;
 		}
 
-		uvmem_page = pfn_to_page(uvmem_pfn);
-		pvt = uvmem_page->zone_device_data;
-		pvt->skip_page_out = skip_page_out;
-		pvt->remove_gfn = true;
 		mutex_unlock(&kvm->arch.uvmem_lock);
-
-		pfn = gfn_to_pfn(kvm, gfn);
-		if (is_error_noslot_pfn(pfn))
-			continue;
-		kvm_release_pfn_clean(pfn);
 	}
+
+	mmap_read_unlock(kvm->mm);
 }
 
 unsigned long kvmppc_h_svm_init_abort(struct kvm *kvm)
-- 
2.27.0


^ permalink raw reply related

* Re: [PATCH 2/5] powerpc: Allow 4096 bytes of stack expansion for the signal frame
From: Michael Ellerman @ 2020-07-24  9:20 UTC (permalink / raw)
  To: Daniel Axtens, linuxppc-dev; +Cc: linux-arch, hughd, linux-kernel
In-Reply-To: <87blk6tkuv.fsf@dja-thinkpad.axtens.net>

Daniel Axtens <dja@axtens.net> writes:
> Hi Michael,
>
> Unfortunately, this patch doesn't completely solve the problem.
>
> Trying the original reproducer, I'm still able to trigger the crash even
> with this patch, although not 100% of the time. (If I turn ASLR off
> outside of tmux it reliably crashes, if I turn ASLR off _inside_ of tmux
> it reliably succeeds; all of this is on a serial console.)
>
> ./foo 1241000 & sleep 1; killall -USR1 foo; echo ok
>
> If I add some debugging information, I see that I'm getting
> address + 4096 = 7fffffed0fa0
> gpr1 =           7fffffed1020
>
> So address + 4096 is 0x80 bytes below the 4k window. I haven't been able
> to figure out why, gdb gives me a NIP in __kernel_sigtramp_rt64 but I
> don't know what to make of that.

Thanks for testing.

I looked at it again this morning and it's fairly obvious when it's not
11pm :)

We need space for struct rt_sigframe as well as another 128 bytes,
which is __SIGNAL_FRAMESIZE. It's actually mentioned in the comment
above struct rt_sigframe.

I'll send a v2.

> P.S. I don't know what your policy on linking to kernel bugzilla is, but
> if you want:
>
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=205183

In general I prefer to keep things clean with just a single Link: tag
pointing to the archive of the patch submission.

That can then contain further links and other info, and has the
advantage that people can reply to the patch submission in the future to
add information to the thread that wasn't known at the time of the
commit.

cheers

^ permalink raw reply

* [PATCH v2 1/5] selftests/powerpc: Add test of stack expansion logic
From: Michael Ellerman @ 2020-07-24  9:25 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: linux-kernel, dja

We have custom stack expansion checks that it turns out are extremely
badly tested and contain bugs, surprise. So add some tests that
exercise the code and capture the current boundary conditions.

The signal test currently fails on 64-bit kernels because the 2048
byte allowance for the signal frame is too small, we will fix that in
a subsequent patch.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
---

v2:
 - Concentrate on used stack around the 1MB size, as that's where our
   custom logic kicks in.
 - Increment the used stack size by 64 so we can exercise the case
   where we overflow the page by less than 128 (__SIGNAL_FRAMESIZE).

---
 tools/testing/selftests/powerpc/mm/.gitignore |   2 +
 tools/testing/selftests/powerpc/mm/Makefile   |   9 +-
 .../powerpc/mm/stack_expansion_ldst.c         | 233 ++++++++++++++++++
 .../powerpc/mm/stack_expansion_signal.c       | 118 +++++++++
 tools/testing/selftests/powerpc/pmu/lib.h     |   1 +
 5 files changed, 362 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/powerpc/mm/stack_expansion_ldst.c
 create mode 100644 tools/testing/selftests/powerpc/mm/stack_expansion_signal.c

diff --git a/tools/testing/selftests/powerpc/mm/.gitignore b/tools/testing/selftests/powerpc/mm/.gitignore
index 8d041f508a51..52308f42b7de 100644
--- a/tools/testing/selftests/powerpc/mm/.gitignore
+++ b/tools/testing/selftests/powerpc/mm/.gitignore
@@ -8,3 +8,5 @@ large_vm_fork_separation
 bad_accesses
 tlbie_test
 pkey_exec_prot
+stack_expansion_ldst
+stack_expansion_signal
diff --git a/tools/testing/selftests/powerpc/mm/Makefile b/tools/testing/selftests/powerpc/mm/Makefile
index 5a86d59441dc..6cd772e0e374 100644
--- a/tools/testing/selftests/powerpc/mm/Makefile
+++ b/tools/testing/selftests/powerpc/mm/Makefile
@@ -3,7 +3,9 @@
 	$(MAKE) -C ../
 
 TEST_GEN_PROGS := hugetlb_vs_thp_test subpage_prot segv_errors wild_bctr \
-		  large_vm_fork_separation bad_accesses pkey_exec_prot
+		  large_vm_fork_separation bad_accesses pkey_exec_prot stack_expansion_signal \
+		  stack_expansion_ldst
+
 TEST_GEN_PROGS_EXTENDED := tlbie_test
 TEST_GEN_FILES := tempfile
 
@@ -17,6 +19,11 @@ $(OUTPUT)/large_vm_fork_separation: CFLAGS += -m64
 $(OUTPUT)/bad_accesses: CFLAGS += -m64
 $(OUTPUT)/pkey_exec_prot: CFLAGS += -m64
 
+$(OUTPUT)/stack_expansion_signal: ../utils.c ../pmu/lib.c
+
+$(OUTPUT)/stack_expansion_ldst: CFLAGS += -fno-stack-protector
+$(OUTPUT)/stack_expansion_ldst: ../utils.c
+
 $(OUTPUT)/tempfile:
 	dd if=/dev/zero of=$@ bs=64k count=1
 
diff --git a/tools/testing/selftests/powerpc/mm/stack_expansion_ldst.c b/tools/testing/selftests/powerpc/mm/stack_expansion_ldst.c
new file mode 100644
index 000000000000..0587e11437f5
--- /dev/null
+++ b/tools/testing/selftests/powerpc/mm/stack_expansion_ldst.c
@@ -0,0 +1,233 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Test that loads/stores expand the stack segment, or trigger a SEGV, in
+ * various conditions.
+ *
+ * Based on test code by Tom Lane.
+ */
+
+#undef NDEBUG
+#include <assert.h>
+
+#include <err.h>
+#include <errno.h>
+#include <stdio.h>
+#include <signal.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/resource.h>
+#include <sys/time.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <unistd.h>
+
+#define _KB (1024)
+#define _MB (1024 * 1024)
+
+volatile char *stack_top_ptr;
+volatile unsigned long stack_top_sp;
+volatile char c;
+
+enum access_type {
+	LOAD,
+	STORE,
+};
+
+/*
+ * Consume stack until the stack pointer is below @target_sp, then do an access
+ * (load or store) at offset @delta from either the base of the stack or the
+ * current stack pointer.
+ */
+__attribute__ ((noinline))
+int consume_stack(unsigned long target_sp, unsigned long stack_high, int delta, enum access_type type)
+{
+	unsigned long target;
+	char stack_cur;
+
+	if ((unsigned long)&stack_cur > target_sp)
+		return consume_stack(target_sp, stack_high, delta, type);
+	else {
+		// We don't really need this, but without it GCC might not
+		// generate a recursive call above.
+		stack_top_ptr = &stack_cur;
+
+#ifdef __powerpc__
+		asm volatile ("mr %[sp], %%r1" : [sp] "=r" (stack_top_sp));
+#else
+		asm volatile ("mov %%rsp, %[sp]" : [sp] "=r" (stack_top_sp));
+#endif
+
+		// Kludge, delta < 0 indicates relative to SP
+		if (delta < 0)
+			target = stack_top_sp + delta;
+		else
+			target = stack_high - delta + 1;
+
+		volatile char *p = (char *)target;
+
+		if (type == STORE)
+			*p = c;
+		else
+			c = *p;
+
+		// Do something to prevent the stack frame being popped prior to
+		// our access above.
+		getpid();
+	}
+
+	return 0;
+}
+
+static int search_proc_maps(char *needle, unsigned long *low, unsigned long *high)
+{
+	unsigned long start, end;
+	static char buf[4096];
+	char name[128];
+	FILE *f;
+	int rc;
+
+	f = fopen("/proc/self/maps", "r");
+	if (!f) {
+		perror("fopen");
+		return -1;
+	}
+
+	while (fgets(buf, sizeof(buf), f)) {
+		rc = sscanf(buf, "%lx-%lx %*c%*c%*c%*c %*x %*d:%*d %*d %127s\n",
+			    &start, &end, name);
+		if (rc == 2)
+			continue;
+
+		if (rc != 3) {
+			printf("sscanf errored\n");
+			rc = -1;
+			break;
+		}
+
+		if (strstr(name, needle)) {
+			*low = start;
+			*high = end - 1;
+			rc = 0;
+			break;
+		}
+	}
+
+	fclose(f);
+
+	return rc;
+}
+
+int child(unsigned int stack_used, int delta, enum access_type type)
+{
+	unsigned long low, stack_high;
+
+	assert(search_proc_maps("[stack]", &low, &stack_high) == 0);
+
+	assert(consume_stack(stack_high - stack_used, stack_high, delta, type) == 0);
+
+	printf("Access OK: %s delta %-7d used size 0x%06x stack high 0x%lx top_ptr %p top sp 0x%lx actual used 0x%lx\n",
+	       type == LOAD ? "load" : "store", delta, stack_used, stack_high,
+	       stack_top_ptr, stack_top_sp, stack_high - stack_top_sp + 1);
+
+	return 0;
+}
+
+static int test_one(unsigned int stack_used, int delta, enum access_type type)
+{
+	pid_t pid;
+	int rc;
+
+	pid = fork();
+	if (pid == 0)
+		exit(child(stack_used, delta, type));
+
+	assert(waitpid(pid, &rc, 0) != -1);
+
+	if (WIFEXITED(rc) && WEXITSTATUS(rc) == 0)
+		return 0;
+
+	// We don't expect a non-zero exit that's not a signal
+	assert(!WIFEXITED(rc));
+
+	printf("Faulted:   %s delta %-7d used size 0x%06x signal %d\n",
+	       type == LOAD ? "load" : "store", delta, stack_used,
+	       WTERMSIG(rc));
+
+	return 1;
+}
+
+// This is fairly arbitrary but is well below any of the targets below,
+// so that the delta between the stack pointer and the target is large.
+#define DEFAULT_SIZE	(32 * _KB)
+
+static void test_one_type(enum access_type type, unsigned long page_size, unsigned long rlim_cur)
+{
+	assert(test_one(DEFAULT_SIZE, 512 * _KB, type) == 0);
+
+	// powerpc has a special case to allow up to 1MB
+	assert(test_one(DEFAULT_SIZE, 1 * _MB, type) == 0);
+
+#ifdef __powerpc__
+	// This fails on powerpc because it's > 1MB and is not a stdu &
+	// not close to r1
+	assert(test_one(DEFAULT_SIZE, 1 * _MB + 8, type) != 0);
+#else
+	assert(test_one(DEFAULT_SIZE, 1 * _MB + 8, type) == 0);
+#endif
+
+#ifdef __powerpc__
+	// Accessing way past the stack pointer is not allowed on powerpc
+	assert(test_one(DEFAULT_SIZE, rlim_cur, type) != 0);
+#else
+	// We should be able to access anywhere within the rlimit
+	assert(test_one(DEFAULT_SIZE, rlim_cur, type) == 0);
+#endif
+
+	// But if we go past the rlimit it should fail
+	assert(test_one(DEFAULT_SIZE, rlim_cur + 1, type) != 0);
+
+	// Above 1MB powerpc only allows accesses within 2048 bytes of
+	// r1 for accesses that aren't stdu
+	assert(test_one(1 * _MB + page_size - 128, -2048, type) == 0);
+#ifdef __powerpc__
+	assert(test_one(1 * _MB + page_size - 128, -2049, type) != 0);
+#else
+	assert(test_one(1 * _MB + page_size - 128, -2049, type) == 0);
+#endif
+
+	// By consuming 2MB of stack we test the stdu case
+	assert(test_one(2 * _MB + page_size - 128, -2048, type) == 0);
+}
+
+static int test(void)
+{
+	unsigned long page_size;
+	struct rlimit rlimit;
+
+	page_size = getpagesize();
+	getrlimit(RLIMIT_STACK, &rlimit);
+	printf("Stack rlimit is 0x%lx\n", rlimit.rlim_cur);
+
+	printf("Testing loads ...\n");
+	test_one_type(LOAD, page_size, rlimit.rlim_cur);
+	printf("Testing stores ...\n");
+	test_one_type(STORE, page_size, rlimit.rlim_cur);
+
+	printf("All OK\n");
+
+	return 0;
+}
+
+#ifdef __powerpc__
+#include "utils.h"
+
+int main(void)
+{
+	return test_harness(test, "stack_expansion_ldst");
+}
+#else
+int main(void)
+{
+	return test();
+}
+#endif
diff --git a/tools/testing/selftests/powerpc/mm/stack_expansion_signal.c b/tools/testing/selftests/powerpc/mm/stack_expansion_signal.c
new file mode 100644
index 000000000000..c8b32a29e274
--- /dev/null
+++ b/tools/testing/selftests/powerpc/mm/stack_expansion_signal.c
@@ -0,0 +1,118 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Test that signal delivery is able to expand the stack segment without
+ * triggering a SEGV.
+ *
+ * Based on test code by Tom Lane.
+ */
+
+#include <err.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <signal.h>
+#include <sys/types.h>
+#include <unistd.h>
+
+#include "../pmu/lib.h"
+#include "utils.h"
+
+#define _KB (1024)
+#define _MB (1024 * 1024)
+
+static char *stack_base_ptr;
+static char *stack_top_ptr;
+
+static volatile sig_atomic_t sig_occurred = 0;
+
+static void sigusr1_handler(int signal_arg)
+{
+	sig_occurred = 1;
+}
+
+static int consume_stack(unsigned int stack_size, union pipe write_pipe)
+{
+	char stack_cur;
+
+	if ((stack_base_ptr - &stack_cur) < stack_size)
+		return consume_stack(stack_size, write_pipe);
+	else {
+		stack_top_ptr = &stack_cur;
+
+		FAIL_IF(notify_parent(write_pipe));
+
+		while (!sig_occurred)
+			barrier();
+	}
+
+	return 0;
+}
+
+static int child(unsigned int stack_size, union pipe write_pipe)
+{
+	struct sigaction act;
+	char stack_base;
+
+	act.sa_handler = sigusr1_handler;
+	sigemptyset(&act.sa_mask);
+	act.sa_flags = 0;
+	if (sigaction(SIGUSR1, &act, NULL) < 0)
+		err(1, "sigaction");
+
+	stack_base_ptr = (char *) (((size_t) &stack_base + 65535) & ~65535UL);
+
+	FAIL_IF(consume_stack(stack_size, write_pipe));
+
+	printf("size 0x%06x: OK, stack base %p top %p (%zx used)\n",
+		stack_size, stack_base_ptr, stack_top_ptr,
+		stack_base_ptr - stack_top_ptr);
+
+	return 0;
+}
+
+static int test_one_size(unsigned int stack_size)
+{
+	union pipe read_pipe, write_pipe;
+	pid_t pid;
+
+	FAIL_IF(pipe(read_pipe.fds) == -1);
+	FAIL_IF(pipe(write_pipe.fds) == -1);
+
+	pid = fork();
+	if (pid == 0) {
+		close(read_pipe.read_fd);
+		close(write_pipe.write_fd);
+		exit(child(stack_size, read_pipe));
+	}
+
+	close(read_pipe.write_fd);
+	close(write_pipe.read_fd);
+	FAIL_IF(sync_with_child(read_pipe, write_pipe));
+
+	kill(pid, SIGUSR1);
+
+	FAIL_IF(wait_for_child(pid));
+
+	close(read_pipe.read_fd);
+	close(write_pipe.write_fd);
+
+	return 0;
+}
+
+int test(void)
+{
+	unsigned int i, size;
+
+	// Test with used stack from 1MB - 64K to 1MB + 64K
+	// Increment by 64 to get more coverage of odd sizes
+	for (i = 0; i < (128 * _KB); i += 64) {
+		size = i + (1 * _MB) - (64 * _KB);
+		FAIL_IF(test_one_size(size));
+	}
+
+	return 0;
+}
+
+int main(void)
+{
+	return test_harness(test, "stack_expansion_signal");
+}
diff --git a/tools/testing/selftests/powerpc/pmu/lib.h b/tools/testing/selftests/powerpc/pmu/lib.h
index fa12e7d0b4d3..bf1bec013bbb 100644
--- a/tools/testing/selftests/powerpc/pmu/lib.h
+++ b/tools/testing/selftests/powerpc/pmu/lib.h
@@ -6,6 +6,7 @@
 #ifndef __SELFTESTS_POWERPC_PMU_LIB_H
 #define __SELFTESTS_POWERPC_PMU_LIB_H
 
+#include <stdbool.h>
 #include <stdio.h>
 #include <stdint.h>
 #include <string.h>
-- 
2.25.1


^ permalink raw reply related

* [PATCH v2 2/5] powerpc: Allow 4224 bytes of stack expansion for the signal frame
From: Michael Ellerman @ 2020-07-24  9:25 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: linux-kernel, dja
In-Reply-To: <20200724092528.1578671-1-mpe@ellerman.id.au>

We have powerpc specific logic in our page fault handling to decide if
an access to an unmapped address below the stack pointer should expand
the stack VMA.

The code was originally added in 2004 "ported from 2.4". The rough
logic is that the stack is allowed to grow to 1MB with no extra
checking. Over 1MB the access must be within 2048 bytes of the stack
pointer, or be from a user instruction that updates the stack pointer.

The 2048 byte allowance below the stack pointer is there to cover the
288 byte "red zone" as well as the "about 1.5kB" needed by the signal
delivery code.

Unfortunately since then the signal frame has expanded, and is now
4224 bytes on 64-bit kernels with transactional memory enabled. This
means if a process has consumed more than 1MB of stack, and its stack
pointer lies less than 4224 bytes from the next page boundary, signal
delivery will fault when trying to expand the stack and the process
will see a SEGV.

The total size of the signal frame is the size of struct rt_sigframe
(which includes the red zone) plus __SIGNAL_FRAMESIZE (128 bytes on
64-bit).

The 2048 byte allowance was correct until 2008 as the signal frame
was:

struct rt_sigframe {
        struct ucontext    uc;                           /*     0  1440 */
        /* --- cacheline 11 boundary (1408 bytes) was 32 bytes ago --- */
        long unsigned int          _unused[2];           /*  1440    16 */
        unsigned int               tramp[6];             /*  1456    24 */
        struct siginfo *           pinfo;                /*  1480     8 */
        void *                     puc;                  /*  1488     8 */
        struct siginfo     info;                         /*  1496   128 */
        /* --- cacheline 12 boundary (1536 bytes) was 88 bytes ago --- */
        char                       abigap[288];          /*  1624   288 */

        /* size: 1920, cachelines: 15, members: 7 */
        /* padding: 8 */
};

1920 + 128 = 2048

Then in commit ce48b2100785 ("powerpc: Add VSX context save/restore,
ptrace and signal support") (Jul 2008) the signal frame expanded to
2304 bytes:

struct rt_sigframe {
        struct ucontext    uc;                           /*     0  1696 */	<--
        /* --- cacheline 13 boundary (1664 bytes) was 32 bytes ago --- */
        long unsigned int          _unused[2];           /*  1696    16 */
        unsigned int               tramp[6];             /*  1712    24 */
        struct siginfo *           pinfo;                /*  1736     8 */
        void *                     puc;                  /*  1744     8 */
        struct siginfo     info;                         /*  1752   128 */
        /* --- cacheline 14 boundary (1792 bytes) was 88 bytes ago --- */
        char                       abigap[288];          /*  1880   288 */

        /* size: 2176, cachelines: 17, members: 7 */
        /* padding: 8 */
};

2176 + 128 = 2304

At this point we should have been exposed to the bug, though as far as
I know it was never reported. I no longer have a system old enough to
easily test on.

Then in 2010 commit 320b2b8de126 ("mm: keep a guard page below a
grow-down stack segment") caused our stack expansion code to never
trigger, as there was always a VMA found for a write up to PAGE_SIZE
below r1.

That meant the bug was hidden as we continued to expand the signal
frame in commit 2b0a576d15e0 ("powerpc: Add new transactional memory
state to the signal context") (Feb 2013):

struct rt_sigframe {
        struct ucontext    uc;                           /*     0  1696 */
        /* --- cacheline 13 boundary (1664 bytes) was 32 bytes ago --- */
        struct ucontext    uc_transact;                  /*  1696  1696 */	<--
        /* --- cacheline 26 boundary (3328 bytes) was 64 bytes ago --- */
        long unsigned int          _unused[2];           /*  3392    16 */
        unsigned int               tramp[6];             /*  3408    24 */
        struct siginfo *           pinfo;                /*  3432     8 */
        void *                     puc;                  /*  3440     8 */
        struct siginfo     info;                         /*  3448   128 */
        /* --- cacheline 27 boundary (3456 bytes) was 120 bytes ago --- */
        char                       abigap[288];          /*  3576   288 */

        /* size: 3872, cachelines: 31, members: 8 */
        /* padding: 8 */
        /* last cacheline: 32 bytes */
};

3872 + 128 = 4000

And commit 573ebfa6601f ("powerpc: Increase stack redzone for 64-bit
userspace to 512 bytes") (Feb 2014):

struct rt_sigframe {
        struct ucontext    uc;                           /*     0  1696 */
        /* --- cacheline 13 boundary (1664 bytes) was 32 bytes ago --- */
        struct ucontext    uc_transact;                  /*  1696  1696 */
        /* --- cacheline 26 boundary (3328 bytes) was 64 bytes ago --- */
        long unsigned int          _unused[2];           /*  3392    16 */
        unsigned int               tramp[6];             /*  3408    24 */
        struct siginfo *           pinfo;                /*  3432     8 */
        void *                     puc;                  /*  3440     8 */
        struct siginfo     info;                         /*  3448   128 */
        /* --- cacheline 27 boundary (3456 bytes) was 120 bytes ago --- */
        char                       abigap[512];          /*  3576   512 */	<--

        /* size: 4096, cachelines: 32, members: 8 */
        /* padding: 8 */
};

4096 + 128 = 4224

Then finally in 2017, commit 1be7107fbe18 ("mm: larger stack guard
gap, between vmas") exposed us to the existing bug, because it changed
the stack VMA to be the correct/real size, meaning our stack expansion
code is now triggered.

Fix it by increasing the allowance to 4224 bytes.

Hard-coding 4224 is obviously unsafe against future expansions of the
signal frame in the same way as the existing code. We can't easily use
sizeof() because the signal frame structure is not in a header. We
will either fix that, or rip out all the custom stack expansion
checking logic entirely.

Fixes: ce48b2100785 ("powerpc: Add VSX context save/restore, ptrace and signal support")
Cc: stable@vger.kernel.org # v2.6.27+
Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
---

v2: Account for the extra 128 bytes of __SIGNAL_FRAMESIZE, making the
    total size 4224, as noticed by dja.

See also https://bugzilla.kernel.org/show_bug.cgi?id=205183
---
 arch/powerpc/mm/fault.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 641fc5f3d7dd..3ebb1792e636 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -267,6 +267,9 @@ static bool bad_kernel_fault(struct pt_regs *regs, unsigned long error_code,
 	return false;
 }
 
+// This comes from 64-bit struct rt_sigframe + __SIGNAL_FRAMESIZE
+#define SIGFRAME_MAX_SIZE	(4096 + 128)
+
 static bool bad_stack_expansion(struct pt_regs *regs, unsigned long address,
 				struct vm_area_struct *vma, unsigned int flags,
 				bool *must_retry)
@@ -274,7 +277,7 @@ static bool bad_stack_expansion(struct pt_regs *regs, unsigned long address,
 	/*
 	 * N.B. The POWER/Open ABI allows programs to access up to
 	 * 288 bytes below the stack pointer.
-	 * The kernel signal delivery code writes up to about 1.5kB
+	 * The kernel signal delivery code writes a bit over 4KB
 	 * below the stack pointer (r1) before decrementing it.
 	 * The exec code can write slightly over 640kB to the stack
 	 * before setting the user r1.  Thus we allow the stack to
@@ -299,7 +302,7 @@ static bool bad_stack_expansion(struct pt_regs *regs, unsigned long address,
 		 * between the last mapped region and the stack will
 		 * expand the stack rather than segfaulting.
 		 */
-		if (address + 2048 >= uregs->gpr[1])
+		if (address + SIGFRAME_MAX_SIZE >= uregs->gpr[1])
 			return false;
 
 		if ((flags & FAULT_FLAG_WRITE) && (flags & FAULT_FLAG_USER) &&
-- 
2.25.1


^ permalink raw reply related

* [PATCH v2 3/5] selftests/powerpc: Update the stack expansion test
From: Michael Ellerman @ 2020-07-24  9:25 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: linux-kernel, dja
In-Reply-To: <20200724092528.1578671-1-mpe@ellerman.id.au>

Update the stack expansion load/store test to take into account the
new allowance of 4224 bytes below the stack pointer.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
---
 .../selftests/powerpc/mm/stack_expansion_ldst.c        | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

v2: Update for change of size to 4224.

diff --git a/tools/testing/selftests/powerpc/mm/stack_expansion_ldst.c b/tools/testing/selftests/powerpc/mm/stack_expansion_ldst.c
index 0587e11437f5..8dbfb51acf0f 100644
--- a/tools/testing/selftests/powerpc/mm/stack_expansion_ldst.c
+++ b/tools/testing/selftests/powerpc/mm/stack_expansion_ldst.c
@@ -186,17 +186,17 @@ static void test_one_type(enum access_type type, unsigned long page_size, unsign
 	// But if we go past the rlimit it should fail
 	assert(test_one(DEFAULT_SIZE, rlim_cur + 1, type) != 0);
 
-	// Above 1MB powerpc only allows accesses within 2048 bytes of
+	// Above 1MB powerpc only allows accesses within 4224 bytes of
 	// r1 for accesses that aren't stdu
-	assert(test_one(1 * _MB + page_size - 128, -2048, type) == 0);
+	assert(test_one(1 * _MB + page_size - 128, -4224, type) == 0);
 #ifdef __powerpc__
-	assert(test_one(1 * _MB + page_size - 128, -2049, type) != 0);
+	assert(test_one(1 * _MB + page_size - 128, -4225, type) != 0);
 #else
-	assert(test_one(1 * _MB + page_size - 128, -2049, type) == 0);
+	assert(test_one(1 * _MB + page_size - 128, -4225, type) == 0);
 #endif
 
 	// By consuming 2MB of stack we test the stdu case
-	assert(test_one(2 * _MB + page_size - 128, -2048, type) == 0);
+	assert(test_one(2 * _MB + page_size - 128, -4224, type) == 0);
 }
 
 static int test(void)
-- 
2.25.1


^ permalink raw reply related

* [PATCH v2 4/5] powerpc/mm: Remove custom stack expansion checking
From: Michael Ellerman @ 2020-07-24  9:25 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: linux-kernel, dja
In-Reply-To: <20200724092528.1578671-1-mpe@ellerman.id.au>

We have powerpc specific logic in our page fault handling to decide if
an access to an unmapped address below the stack pointer should expand
the stack VMA.

The logic aims to prevent userspace from doing bad accesses below the
stack pointer. However as long as the stack is < 1MB in size, we allow
all accesses without further checks. Adding some debug I see that I
can do a full kernel build and LTP run, and not a single process has
used more than 1MB of stack. So for the majority of processes the
logic never even fires.

We also recently found a nasty bug in this code which could cause
userspace programs to be killed during signal delivery. It went
unnoticed presumably because most processes use < 1MB of stack.

The generic mm code has also grown support for stack guard pages since
this code was originally written, so the most heinous case of the
stack expanding into other mappings is now handled for us.

Finally although some other arches have special logic in this path,
from what I can tell none of x86, arm64, arm and s390 impose any extra
checks other than those in expand_stack().

So drop our complicated logic and like other architectures just let
the stack expand as long as its within the rlimit.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
---
 arch/powerpc/mm/fault.c | 109 ++--------------------------------------
 1 file changed, 5 insertions(+), 104 deletions(-)

v2: no change just rebased.

diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 3ebb1792e636..925a7231abb3 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -42,39 +42,7 @@
 #include <asm/kup.h>
 #include <asm/inst.h>
 
-/*
- * Check whether the instruction inst is a store using
- * an update addressing form which will update r1.
- */
-static bool store_updates_sp(struct ppc_inst inst)
-{
-	/* check for 1 in the rA field */
-	if (((ppc_inst_val(inst) >> 16) & 0x1f) != 1)
-		return false;
-	/* check major opcode */
-	switch (ppc_inst_primary_opcode(inst)) {
-	case OP_STWU:
-	case OP_STBU:
-	case OP_STHU:
-	case OP_STFSU:
-	case OP_STFDU:
-		return true;
-	case OP_STD:	/* std or stdu */
-		return (ppc_inst_val(inst) & 3) == 1;
-	case OP_31:
-		/* check minor opcode */
-		switch ((ppc_inst_val(inst) >> 1) & 0x3ff) {
-		case OP_31_XOP_STDUX:
-		case OP_31_XOP_STWUX:
-		case OP_31_XOP_STBUX:
-		case OP_31_XOP_STHUX:
-		case OP_31_XOP_STFSUX:
-		case OP_31_XOP_STFDUX:
-			return true;
-		}
-	}
-	return false;
-}
+
 /*
  * do_page_fault error handling helpers
  */
@@ -267,57 +235,6 @@ static bool bad_kernel_fault(struct pt_regs *regs, unsigned long error_code,
 	return false;
 }
 
-// This comes from 64-bit struct rt_sigframe + __SIGNAL_FRAMESIZE
-#define SIGFRAME_MAX_SIZE	(4096 + 128)
-
-static bool bad_stack_expansion(struct pt_regs *regs, unsigned long address,
-				struct vm_area_struct *vma, unsigned int flags,
-				bool *must_retry)
-{
-	/*
-	 * N.B. The POWER/Open ABI allows programs to access up to
-	 * 288 bytes below the stack pointer.
-	 * The kernel signal delivery code writes a bit over 4KB
-	 * below the stack pointer (r1) before decrementing it.
-	 * The exec code can write slightly over 640kB to the stack
-	 * before setting the user r1.  Thus we allow the stack to
-	 * expand to 1MB without further checks.
-	 */
-	if (address + 0x100000 < vma->vm_end) {
-		struct ppc_inst __user *nip = (struct ppc_inst __user *)regs->nip;
-		/* get user regs even if this fault is in kernel mode */
-		struct pt_regs *uregs = current->thread.regs;
-		if (uregs == NULL)
-			return true;
-
-		/*
-		 * A user-mode access to an address a long way below
-		 * the stack pointer is only valid if the instruction
-		 * is one which would update the stack pointer to the
-		 * address accessed if the instruction completed,
-		 * i.e. either stwu rs,n(r1) or stwux rs,r1,rb
-		 * (or the byte, halfword, float or double forms).
-		 *
-		 * If we don't check this then any write to the area
-		 * between the last mapped region and the stack will
-		 * expand the stack rather than segfaulting.
-		 */
-		if (address + SIGFRAME_MAX_SIZE >= uregs->gpr[1])
-			return false;
-
-		if ((flags & FAULT_FLAG_WRITE) && (flags & FAULT_FLAG_USER) &&
-		    access_ok(nip, sizeof(*nip))) {
-			struct ppc_inst inst;
-
-			if (!probe_user_read_inst(&inst, nip))
-				return !store_updates_sp(inst);
-			*must_retry = true;
-		}
-		return true;
-	}
-	return false;
-}
-
 #ifdef CONFIG_PPC_MEM_KEYS
 static bool access_pkey_error(bool is_write, bool is_exec, bool is_pkey,
 			      struct vm_area_struct *vma)
@@ -483,7 +400,6 @@ static int __do_page_fault(struct pt_regs *regs, unsigned long address,
 	int is_user = user_mode(regs);
 	int is_write = page_fault_is_write(error_code);
 	vm_fault_t fault, major = 0;
-	bool must_retry = false;
 	bool kprobe_fault = kprobe_page_fault(regs, 11);
 
 	if (unlikely(debugger_fault_handler(regs) || kprobe_fault))
@@ -572,30 +488,15 @@ static int __do_page_fault(struct pt_regs *regs, unsigned long address,
 	vma = find_vma(mm, address);
 	if (unlikely(!vma))
 		return bad_area(regs, address);
-	if (likely(vma->vm_start <= address))
-		goto good_area;
-	if (unlikely(!(vma->vm_flags & VM_GROWSDOWN)))
-		return bad_area(regs, address);
 
-	/* The stack is being expanded, check if it's valid */
-	if (unlikely(bad_stack_expansion(regs, address, vma, flags,
-					 &must_retry))) {
-		if (!must_retry)
+	if (unlikely(vma->vm_start > address)) {
+		if (unlikely(!(vma->vm_flags & VM_GROWSDOWN)))
 			return bad_area(regs, address);
 
-		mmap_read_unlock(mm);
-		if (fault_in_pages_readable((const char __user *)regs->nip,
-					    sizeof(unsigned int)))
-			return bad_area_nosemaphore(regs, address);
-		goto retry;
+		if (unlikely(expand_stack(vma, address)))
+			return bad_area(regs, address);
 	}
 
-	/* Try to expand it */
-	if (unlikely(expand_stack(vma, address)))
-		return bad_area(regs, address);
-
-good_area:
-
 #ifdef CONFIG_PPC_MEM_KEYS
 	if (unlikely(access_pkey_error(is_write, is_exec,
 				       (error_code & DSISR_KEYFAULT), vma)))
-- 
2.25.1


^ permalink raw reply related

* [PATCH v2 5/5] selftests/powerpc: Remove powerpc special cases from stack expansion test
From: Michael Ellerman @ 2020-07-24  9:25 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: linux-kernel, dja
In-Reply-To: <20200724092528.1578671-1-mpe@ellerman.id.au>

Now that the powerpc code behaves the same as other architectures we
can drop the special cases we had.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
---
 .../powerpc/mm/stack_expansion_ldst.c         | 41 +++----------------
 1 file changed, 5 insertions(+), 36 deletions(-)

v2: no change just rebased.

diff --git a/tools/testing/selftests/powerpc/mm/stack_expansion_ldst.c b/tools/testing/selftests/powerpc/mm/stack_expansion_ldst.c
index 8dbfb51acf0f..ed9143990888 100644
--- a/tools/testing/selftests/powerpc/mm/stack_expansion_ldst.c
+++ b/tools/testing/selftests/powerpc/mm/stack_expansion_ldst.c
@@ -56,13 +56,7 @@ int consume_stack(unsigned long target_sp, unsigned long stack_high, int delta,
 #else
 		asm volatile ("mov %%rsp, %[sp]" : [sp] "=r" (stack_top_sp));
 #endif
-
-		// Kludge, delta < 0 indicates relative to SP
-		if (delta < 0)
-			target = stack_top_sp + delta;
-		else
-			target = stack_high - delta + 1;
-
+		target = stack_high - delta + 1;
 		volatile char *p = (char *)target;
 
 		if (type == STORE)
@@ -162,41 +156,16 @@ static int test_one(unsigned int stack_used, int delta, enum access_type type)
 
 static void test_one_type(enum access_type type, unsigned long page_size, unsigned long rlim_cur)
 {
-	assert(test_one(DEFAULT_SIZE, 512 * _KB, type) == 0);
+	unsigned long delta;
 
-	// powerpc has a special case to allow up to 1MB
-	assert(test_one(DEFAULT_SIZE, 1 * _MB, type) == 0);
-
-#ifdef __powerpc__
-	// This fails on powerpc because it's > 1MB and is not a stdu &
-	// not close to r1
-	assert(test_one(DEFAULT_SIZE, 1 * _MB + 8, type) != 0);
-#else
-	assert(test_one(DEFAULT_SIZE, 1 * _MB + 8, type) == 0);
-#endif
-
-#ifdef __powerpc__
-	// Accessing way past the stack pointer is not allowed on powerpc
-	assert(test_one(DEFAULT_SIZE, rlim_cur, type) != 0);
-#else
 	// We should be able to access anywhere within the rlimit
+	for (delta = page_size; delta <= rlim_cur; delta += page_size)
+		assert(test_one(DEFAULT_SIZE, delta, type) == 0);
+
 	assert(test_one(DEFAULT_SIZE, rlim_cur, type) == 0);
-#endif
 
 	// But if we go past the rlimit it should fail
 	assert(test_one(DEFAULT_SIZE, rlim_cur + 1, type) != 0);
-
-	// Above 1MB powerpc only allows accesses within 4224 bytes of
-	// r1 for accesses that aren't stdu
-	assert(test_one(1 * _MB + page_size - 128, -4224, type) == 0);
-#ifdef __powerpc__
-	assert(test_one(1 * _MB + page_size - 128, -4225, type) != 0);
-#else
-	assert(test_one(1 * _MB + page_size - 128, -4225, type) == 0);
-#endif
-
-	// By consuming 2MB of stack we test the stdu case
-	assert(test_one(2 * _MB + page_size - 128, -4224, type) == 0);
 }
 
 static int test(void)
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCH 2/2] powerpc/64s: system call support for scv/rfscv instructions
From: Michael Ellerman @ 2020-07-24 10:45 UTC (permalink / raw)
  To: Christophe Leroy; +Cc: linux-api, musl, linuxppc-dev, Nicholas Piggin, libc-dev
In-Reply-To: <20200723184814.Horde.pk5BO9iFqyGX5D4TW5wqmg1@messagerie.si.c-s.fr>

Christophe Leroy <christophe.leroy@csgroup.eu> writes:
> Michael Ellerman <mpe@ellerman.id.au> a écrit :
>
>> Nicholas Piggin <npiggin@gmail.com> writes:
>>> diff --git a/arch/powerpc/include/asm/ppc-opcode.h  
>>> b/arch/powerpc/include/asm/ppc-opcode.h
>>> index 2a39c716c343..b2bdc4de1292 100644
>>> --- a/arch/powerpc/include/asm/ppc-opcode.h
>>> +++ b/arch/powerpc/include/asm/ppc-opcode.h
>>> @@ -257,6 +257,7 @@
>>>  #define PPC_INST_MFVSRD			0x7c000066
>>>  #define PPC_INST_MTVSRD			0x7c000166
>>>  #define PPC_INST_SC			0x44000002
>>> +#define PPC_INST_SCV			0x44000001
>> ...
>>> @@ -411,6 +412,7 @@
>> ...
>>> +#define __PPC_LEV(l)	(((l) & 0x7f) << 5)
>>
>> These conflicted and didn't seem to be used so I dropped them.
>>
>>> diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c
>>> index 5abe98216dc2..161bfccbc309 100644
>>> --- a/arch/powerpc/lib/sstep.c
>>> +++ b/arch/powerpc/lib/sstep.c
>>> @@ -3378,6 +3382,16 @@ int emulate_step(struct pt_regs *regs,  
>>> struct ppc_inst instr)
>>>  		regs->msr = MSR_KERNEL;
>>>  		return 1;
>>>
>>> +	case SYSCALL_VECTORED_0:	/* scv 0 */
>>> +		regs->gpr[9] = regs->gpr[13];
>>> +		regs->gpr[10] = MSR_KERNEL;
>>> +		regs->gpr[11] = regs->nip + 4;
>>> +		regs->gpr[12] = regs->msr & MSR_MASK;
>>> +		regs->gpr[13] = (unsigned long) get_paca();
>>> +		regs->nip = (unsigned long) &system_call_vectored_emulate;
>>> +		regs->msr = MSR_KERNEL;
>>> +		return 1;
>>> +
>>
>> This broke the ppc64e build:
>>
>>   ld: arch/powerpc/lib/sstep.o:(.toc+0x0): undefined reference to  
>> `system_call_vectored_emulate'
>>   make[1]: *** [/home/michael/linux/Makefile:1139: vmlinux] Error 1
>>
>> I wrapped it in #ifdef CONFIG_PPC64_BOOK3S.
>
> You mean CONFIG_PPC_BOOK3S_64 ?

I hope so ...

#### ## ####.

Will send a fixup. Thanks for noticing.

cheers

^ permalink raw reply

* Re: [PATCH v 1/1] powerpc/64s: allow for clang's objdump differences
From: Michael Ellerman @ 2020-07-24 10:48 UTC (permalink / raw)
  To: Bill Wendling, Benjamin Herrenschmidt, Paul Mackerras
  Cc: linuxppc-dev, Bill Wendling
In-Reply-To: <20200724001605.3718561-1-morbo@google.com>

Hi Bill,

Bill Wendling <morbo@google.com> writes:
> Clang's objdump emits slightly different output from GNU's objdump,
> causing a list of warnings to be emitted during relocatable builds.
> E.g., clang's objdump emits this:
>
>    c000000000000004: 2c 00 00 48  b  0xc000000000000030
>    ...
>    c000000000005c6c: 10 00 82 40  bf 2, 0xc000000000005c7c
>
> while GNU objdump emits:
>
>    c000000000000004: 2c 00 00 48  b    c000000000000030 <__start+0x30>
>    ...
>    c000000000005c6c: 10 00 82 40  bne  c000000000005c7c <masked_interrupt+0x3c>
>
> Adjust llvm-objdump's output to remove the extraneous '0x' and convert
> 'bf' and 'bt' to 'bne' and 'beq' resp. to more closely match GNU
> objdump's output.
>
> Note that clang's objdump doesn't yet output the relocation symbols on
> PPC.
>
> Signed-off-by: Bill Wendling <morbo@google.com>
> ---
>  arch/powerpc/tools/unrel_branch_check.sh | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/arch/powerpc/tools/unrel_branch_check.sh b/arch/powerpc/tools/unrel_branch_check.sh
> index 77114755dc6f..71ce86b68d18 100755
> --- a/arch/powerpc/tools/unrel_branch_check.sh
> +++ b/arch/powerpc/tools/unrel_branch_check.sh
> @@ -31,6 +31,9 @@ grep -e "^c[0-9a-f]*:[[:space:]]*\([0-9a-f][0-9a-f][[:space:]]\)\{4\}[[:space:]]
>  grep -v '\<__start_initialization_multiplatform>' |
>  grep -v -e 'b.\?.\?ctr' |
>  grep -v -e 'b.\?.\?lr' |
> +sed 's/\bbt.\?[[:space:]]*[[:digit:]][[:digit:]]*,/beq/' |
> +sed 's/\bbf.\?[[:space:]]*[[:digit:]][[:digit:]]*,/bne/' |
> +sed 's/[[:space:]]0x/ /' |
>  sed 's/://' |

I know you followed the example in the script of just doing everything
as a separate entry in the pipeline, but I think we could consolidate
all the seds into one?

eg:

sed -e 's/\bbt.\?[[:space:]]*[[:digit:]][[:digit:]]*,/beq/' \
    -e 's/\bbf.\?[[:space:]]*[[:digit:]][[:digit:]]*,/bne/' \
    -e 's/[[:space:]]0x/ /' \
    -e 's/://' |

Does that work?

cheers

^ permalink raw reply

* [PATCH v2] powerpc/numa: Limit possible nodes to within num_possible_nodes
From: Srikar Dronamraju @ 2020-07-24 10:58 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Tyrel Datwyler, Srikar Dronamraju, Nicholas Piggin,
	linuxppc-dev

MAX_NUMNODES is a theoretical maximum number of nodes thats is supported
by the kernel. Device tree properties exposes the number of possible
nodes on the current platform. The kernel would detected this and would
use it for most of its resource allocations.  If the platform now
increases the nodes to over what was already exposed, then it may lead
to inconsistencies. Hence limit it to the already exposed nodes.

Suggested-by: Nathan Lynch <nathanl@linux.ibm.com>
Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Anton Blanchard <anton@ozlabs.org>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Tyrel Datwyler <tyreld@linux.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>

Changelog v1 -> v2:
v1: https://lore.kernel.org/linuxppc-dev/20200715120534.3673-1-srikar@linux.vnet.ibm.com/t/#u
	Use nr_node_ids instead of num_possible_nodes() When nodes are
	sparse like in PowerNV, nr_node_ids gets the right value unlike
	num_possible_nodes()

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 arch/powerpc/mm/numa.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index e437a9ac4956..383359272270 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -221,7 +221,7 @@ static void initialize_distance_lookup_table(int nid,
 	}
 }
 
-/* Returns nid in the range [0..MAX_NUMNODES-1], or -1 if no useful numa
+/* Returns nid in the range [0..nr_node_ids], or -1 if no useful numa
  * info is found.
  */
 static int associativity_to_nid(const __be32 *associativity)
@@ -235,7 +235,7 @@ static int associativity_to_nid(const __be32 *associativity)
 		nid = of_read_number(&associativity[min_common_depth], 1);
 
 	/* POWER4 LPAR uses 0xffff as invalid node */
-	if (nid == 0xffff || nid >= MAX_NUMNODES)
+	if (nid == 0xffff || nid >= nr_node_ids)
 		nid = NUMA_NO_NODE;
 
 	if (nid > 0 &&
@@ -448,7 +448,7 @@ static int of_drconf_to_nid_single(struct drmem_lmb *lmb)
 		index = lmb->aa_index * aa.array_sz + min_common_depth - 1;
 		nid = of_read_number(&aa.arrays[index], 1);
 
-		if (nid == 0xffff || nid >= MAX_NUMNODES)
+		if (nid == 0xffff || nid >= nr_node_ids)
 			nid = default_nid;
 
 		if (nid > 0) {
-- 
2.17.1


^ permalink raw reply related

* Re: [v3 13/15] tools/perf: Add perf tools support for extended register capability in powerpc
From: Ravi Bangoria @ 2020-07-24 11:02 UTC (permalink / raw)
  To: Athira Rajeev
  Cc: ego, mikey, maddy, kvm, Ravi Bangoria, kvm-ppc, svaidyan, acme,
	jolsa, linuxppc-dev
In-Reply-To: <1594996707-3727-14-git-send-email-atrajeev@linux.vnet.ibm.com>

Hi Athira,

On 7/17/20 8:08 PM, Athira Rajeev wrote:
> From: Anju T Sudhakar <anju@linux.vnet.ibm.com>
> 
> Add extended regs to sample_reg_mask in the tool side to use
> with `-I?` option. Perf tools side uses extended mask to display
> the platform supported register names (with -I? option) to the user
> and also send this mask to the kernel to capture the extended registers
> in each sample. Hence decide the mask value based on the processor
> version.
> 
> Currently definitions for `mfspr`, `SPRN_PVR` are part of
> `arch/powerpc/util/header.c`. Move this to a header file so that
> these definitions can be re-used in other source files as well.

It seems this patch has a regression.

Without this patch:

   $ sudo ./perf record -I
   ^C[ perf record: Woken up 1 times to write data ]
   [ perf record: Captured and wrote 0.458 MB perf.data (318 samples) ]

With this patch:

   $ sudo ./perf record -I
   Error:
   dummy:HG: PMU Hardware doesn't support sampling/overflow-interrupts. Try 'perf stat'

Ravi

^ permalink raw reply

* RE: [RFC PATCH] powerpc/pseries/svm: capture instruction faulting on MMIO access, in sprg0 register
From: Michael Ellerman @ 2020-07-24 11:49 UTC (permalink / raw)
  To: Ram Pai
  Cc: ldufour, aik, kvm-ppc, bharata, sathnaga, sukadev, linuxppc-dev,
	bauerman, david
In-Reply-To: <20200722074929.GI7339@oc0525413822.ibm.com>

Ram Pai <linuxram@us.ibm.com> writes:
> On Wed, Jul 22, 2020 at 12:06:06PM +1000, Michael Ellerman wrote:
>> Ram Pai <linuxram@us.ibm.com> writes:
>> > An instruction accessing a mmio address, generates a HDSI fault.  This fault is
>> > appropriately handled by the Hypervisor.  However in the case of secureVMs, the
>> > fault is delivered to the ultravisor.
>> >
>> > Unfortunately the Ultravisor has no correct-way to fetch the faulting
>> > instruction. The PEF architecture does not allow Ultravisor to enable MMU
>> > translation. Walking the two level page table to read the instruction can race
>> > with other vcpus modifying the SVM's process scoped page table.
>> 
>> You're trying to read the guest's kernel text IIUC, that mapping should
>> be stable. Possibly permissions on it could change over time, but the
>> virtual -> real mapping should not.
>
> Actually the code does not capture the address of the instruction in the
> sprg0 register. It captures the instruction itself. So should the mapping
> matter?
>> 
>> > This problem can be correctly solved with some help from the kernel.
>> >
>> > Capture the faulting instruction in SPRG0 register, before executing the
>> > faulting instruction. This enables the ultravisor to easily procure the
>> > faulting instruction and emulate it.
>> 
>> This is not something I'm going to merge. Sorry.
>
> Ok. Will consider other approaches.

To elaborate ...

You've basically invented a custom ucall ABI. But a really strange one
which takes an instruction as its first parameter in SPRG0, and then
subsequent parameters in any GPR depending on what the instruction was.

The UV should either emulate the instruction, which means the guest
should not be expected to do anything other than execute the
instruction. Or it should be done with a proper ucall that the guest
explicitly makes with a well defined ABI.

cheers

^ permalink raw reply

* Re: [PATCH v3 0/4] powerpc/mm/radix: Memory unplug fixes
From: Michael Ellerman @ 2020-07-24 11:52 UTC (permalink / raw)
  To: bharata; +Cc: Nathan Lynch, Aneesh Kumar K.V, linuxppc-dev, david
In-Reply-To: <20200722060506.GO7902@in.ibm.com>

Bharata B Rao <bharata@linux.ibm.com> writes:
> On Tue, Jul 21, 2020 at 10:25:58PM +1000, Michael Ellerman wrote:
>> Bharata B Rao <bharata@linux.ibm.com> writes:
>> > On Tue, Jul 21, 2020 at 11:45:20AM +1000, Michael Ellerman wrote:
>> >> Nathan Lynch <nathanl@linux.ibm.com> writes:
>> >> > "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>> >> >> This is the next version of the fixes for memory unplug on radix.
>> >> >> The issues and the fix are described in the actual patches.
>> >> >
>> >> > I guess this isn't actually causing problems at runtime right now, but I
>> >> > notice calls to resize_hpt_for_hotplug() from arch_add_memory() and
>> >> > arch_remove_memory(), which ought to be mmu-agnostic:
>> >> >
>> >> > int __ref arch_add_memory(int nid, u64 start, u64 size,
>> >> > 			  struct mhp_params *params)
>> >> > {
>> >> > 	unsigned long start_pfn = start >> PAGE_SHIFT;
>> >> > 	unsigned long nr_pages = size >> PAGE_SHIFT;
>> >> > 	int rc;
>> >> >
>> >> > 	resize_hpt_for_hotplug(memblock_phys_mem_size());
>> >> >
>> >> > 	start = (unsigned long)__va(start);
>> >> > 	rc = create_section_mapping(start, start + size, nid,
>> >> > 				    params->pgprot);
>> >> > ...
>> >> 
>> >> Hmm well spotted.
>> >> 
>> >> That does return early if the ops are not setup:
>> >> 
>> >> int resize_hpt_for_hotplug(unsigned long new_mem_size)
>> >> {
>> >> 	unsigned target_hpt_shift;
>> >> 
>> >> 	if (!mmu_hash_ops.resize_hpt)
>> >> 		return 0;
>> >> 
>> >> 
>> >> And:
>> >> 
>> >> void __init hpte_init_pseries(void)
>> >> {
>> >> 	...
>> >> 	if (firmware_has_feature(FW_FEATURE_HPT_RESIZE))
>> >> 		mmu_hash_ops.resize_hpt = pseries_lpar_resize_hpt;
>> >> 
>> >> And that comes in via ibm,hypertas-functions:
>> >> 
>> >> 	{FW_FEATURE_HPT_RESIZE,		"hcall-hpt-resize"},
>> >> 
>> >> 
>> >> But firmware is not necessarily going to add/remove that call based on
>> >> whether we're using hash/radix.
>> >
>> > Correct but hpte_init_pseries() will not be called for radix guests.
>> 
>> Yeah, duh. You'd think the function name would have been a sufficient
>> clue for me :)
>> 
>> >> So I think a follow-up patch is needed to make this more robust.
>> >> 
>> >> Aneesh/Bharata what platform did you test this series on? I'm curious
>> >> how this didn't break.
>> >
>> > I have tested memory hotplug/unplug for radix guest on zz platform and
>> > sanity-tested this for hash guest on P8.
>> >
>> > As noted above, mmu_hash_ops.resize_hpt will not be set for radix
>> > guest and hence we won't see any breakage.
>> 
>> OK.
>> 
>> That's probably fine as it is then. Or maybe just a comment in
>> resize_hpt_for_hotplug() pointing out that resize_hpt will be NULL if
>> we're using radix.
>
> Or we could move these calls to hpt-only routines like below?

That looks like it would be equivalent, and would nicely isolate those
calls in hash specific code. So yeah I think that's worth sending as a
proper patch, even better if you can test it.

> David - Do you remember if there was any particular reason to have
> these two hpt-resize calls within powerpc-generic memory hotplug code?

I think the HPT resizing was developed before or concurrently with the
radix support, so I would guess it was just not something we thought
about at the time.

cheers

> diff --git a/arch/powerpc/include/asm/sparsemem.h b/arch/powerpc/include/asm/sparsemem.h
> index c89b32443cff..1e6fa371cc38 100644
> --- a/arch/powerpc/include/asm/sparsemem.h
> +++ b/arch/powerpc/include/asm/sparsemem.h
> @@ -17,12 +17,6 @@ extern int create_section_mapping(unsigned long start, unsigned long end,
>  				  int nid, pgprot_t prot);
>  extern int remove_section_mapping(unsigned long start, unsigned long end);
>  
> -#ifdef CONFIG_PPC_BOOK3S_64
> -extern int resize_hpt_for_hotplug(unsigned long new_mem_size);
> -#else
> -static inline int resize_hpt_for_hotplug(unsigned long new_mem_size) { return 0; }
> -#endif
> -
>  #ifdef CONFIG_NUMA
>  extern int hot_add_scn_to_nid(unsigned long scn_addr);
>  #else
> diff --git a/arch/powerpc/mm/book3s64/hash_utils.c b/arch/powerpc/mm/book3s64/hash_utils.c
> index eec6f4e5e481..5daf53ec7600 100644
> --- a/arch/powerpc/mm/book3s64/hash_utils.c
> +++ b/arch/powerpc/mm/book3s64/hash_utils.c
> @@ -787,7 +787,7 @@ static unsigned long __init htab_get_table_size(void)
>  }
>  
>  #ifdef CONFIG_MEMORY_HOTPLUG
> -int resize_hpt_for_hotplug(unsigned long new_mem_size)
> +static int resize_hpt_for_hotplug(unsigned long new_mem_size)
>  {
>  	unsigned target_hpt_shift;
>  
> @@ -821,6 +821,8 @@ int hash__create_section_mapping(unsigned long start, unsigned long end,
>  		return -1;
>  	}
>  
> +	resize_hpt_for_hotplug(memblock_phys_mem_size());
> +
>  	rc = htab_bolt_mapping(start, end, __pa(start),
>  			       pgprot_val(prot), mmu_linear_psize,
>  			       mmu_kernel_ssize);
> @@ -838,6 +840,10 @@ int hash__remove_section_mapping(unsigned long start, unsigned long end)
>  	int rc = htab_remove_mapping(start, end, mmu_linear_psize,
>  				     mmu_kernel_ssize);
>  	WARN_ON(rc < 0);
> +
> +	if (resize_hpt_for_hotplug(memblock_phys_mem_size()) == -ENOSPC)
> +		pr_warn("Hash collision while resizing HPT\n");
> +
>  	return rc;
>  }
>  #endif /* CONFIG_MEMORY_HOTPLUG */
> diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
> index c2c11eb8dcfc..9dafc636588f 100644
> --- a/arch/powerpc/mm/mem.c
> +++ b/arch/powerpc/mm/mem.c
> @@ -127,8 +127,6 @@ int __ref arch_add_memory(int nid, u64 start, u64 size,
>  	unsigned long nr_pages = size >> PAGE_SHIFT;
>  	int rc;
>  
> -	resize_hpt_for_hotplug(memblock_phys_mem_size());
> -
>  	start = (unsigned long)__va(start);
>  	rc = create_section_mapping(start, start + size, nid,
>  				    params->pgprot);
> @@ -161,9 +159,6 @@ void __ref arch_remove_memory(int nid, u64 start, u64 size,
>  	 * hit that section of memory
>  	 */
>  	vm_unmap_aliases();
> -
> -	if (resize_hpt_for_hotplug(memblock_phys_mem_size()) == -ENOSPC)
> -		pr_warn("Hash collision while resizing HPT\n");
>  }
>  #endif
>  
> -- 
> 2.26.2

^ permalink raw reply

* Re: [PATCH v3 0/4] powerpc/mm/radix: Memory unplug fixes
From: Bharata B Rao @ 2020-07-24 12:17 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: Nathan Lynch, Aneesh Kumar K.V, linuxppc-dev, david
In-Reply-To: <87mu3pp1u9.fsf@mpe.ellerman.id.au>

On Fri, Jul 24, 2020 at 09:52:14PM +1000, Michael Ellerman wrote:
> Bharata B Rao <bharata@linux.ibm.com> writes:
> > On Tue, Jul 21, 2020 at 10:25:58PM +1000, Michael Ellerman wrote:
> >> Bharata B Rao <bharata@linux.ibm.com> writes:
> >> > On Tue, Jul 21, 2020 at 11:45:20AM +1000, Michael Ellerman wrote:
> >> >> Nathan Lynch <nathanl@linux.ibm.com> writes:
> >> >> > "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
> >> >> >> This is the next version of the fixes for memory unplug on radix.
> >> >> >> The issues and the fix are described in the actual patches.
> >> >> >
> >> >> > I guess this isn't actually causing problems at runtime right now, but I
> >> >> > notice calls to resize_hpt_for_hotplug() from arch_add_memory() and
> >> >> > arch_remove_memory(), which ought to be mmu-agnostic:
> >> >> >
> >> >> > int __ref arch_add_memory(int nid, u64 start, u64 size,
> >> >> > 			  struct mhp_params *params)
> >> >> > {
> >> >> > 	unsigned long start_pfn = start >> PAGE_SHIFT;
> >> >> > 	unsigned long nr_pages = size >> PAGE_SHIFT;
> >> >> > 	int rc;
> >> >> >
> >> >> > 	resize_hpt_for_hotplug(memblock_phys_mem_size());
> >> >> >
> >> >> > 	start = (unsigned long)__va(start);
> >> >> > 	rc = create_section_mapping(start, start + size, nid,
> >> >> > 				    params->pgprot);
> >> >> > ...
> >> >> 
> >> >> Hmm well spotted.
> >> >> 
> >> >> That does return early if the ops are not setup:
> >> >> 
> >> >> int resize_hpt_for_hotplug(unsigned long new_mem_size)
> >> >> {
> >> >> 	unsigned target_hpt_shift;
> >> >> 
> >> >> 	if (!mmu_hash_ops.resize_hpt)
> >> >> 		return 0;
> >> >> 
> >> >> 
> >> >> And:
> >> >> 
> >> >> void __init hpte_init_pseries(void)
> >> >> {
> >> >> 	...
> >> >> 	if (firmware_has_feature(FW_FEATURE_HPT_RESIZE))
> >> >> 		mmu_hash_ops.resize_hpt = pseries_lpar_resize_hpt;
> >> >> 
> >> >> And that comes in via ibm,hypertas-functions:
> >> >> 
> >> >> 	{FW_FEATURE_HPT_RESIZE,		"hcall-hpt-resize"},
> >> >> 
> >> >> 
> >> >> But firmware is not necessarily going to add/remove that call based on
> >> >> whether we're using hash/radix.
> >> >
> >> > Correct but hpte_init_pseries() will not be called for radix guests.
> >> 
> >> Yeah, duh. You'd think the function name would have been a sufficient
> >> clue for me :)
> >> 
> >> >> So I think a follow-up patch is needed to make this more robust.
> >> >> 
> >> >> Aneesh/Bharata what platform did you test this series on? I'm curious
> >> >> how this didn't break.
> >> >
> >> > I have tested memory hotplug/unplug for radix guest on zz platform and
> >> > sanity-tested this for hash guest on P8.
> >> >
> >> > As noted above, mmu_hash_ops.resize_hpt will not be set for radix
> >> > guest and hence we won't see any breakage.
> >> 
> >> OK.
> >> 
> >> That's probably fine as it is then. Or maybe just a comment in
> >> resize_hpt_for_hotplug() pointing out that resize_hpt will be NULL if
> >> we're using radix.
> >
> > Or we could move these calls to hpt-only routines like below?
> 
> That looks like it would be equivalent, and would nicely isolate those
> calls in hash specific code. So yeah I think that's worth sending as a
> proper patch, even better if you can test it.

Sure I will send it as a proper patch. I did test minimal hotplug/unplug
for hash guest with that patch, will do more extensive test and resend.

> 
> > David - Do you remember if there was any particular reason to have
> > these two hpt-resize calls within powerpc-generic memory hotplug code?
> 
> I think the HPT resizing was developed before or concurrently with the
> radix support, so I would guess it was just not something we thought
> about at the time.

Right.

Regards,
Bharata.

^ permalink raw reply

* Re: [PATCHv3 2/2] powerpc/pseries: update device tree before ejecting hotplug uevents
From: Pingfan Liu @ 2020-07-24 12:24 UTC (permalink / raw)
  To: Nathan Lynch
  Cc: cheloha, Kexec Mailing List, ldufour, linuxppc-dev, Hari Bathini
In-Reply-To: <87imee1hvt.fsf@linux.ibm.com>

On Thu, Jul 23, 2020 at 9:27 PM Nathan Lynch <nathanl@linux.ibm.com> wrote:
>
> Pingfan Liu <kernelfans@gmail.com> writes:
> > A bug is observed on pseries by taking the following steps on rhel:
> > -1. drmgr -c mem -r -q 5
> > -2. echo c > /proc/sysrq-trigger
> >
> > And then, the failure looks like:
> > kdump: saving to /sysroot//var/crash/127.0.0.1-2020-01-16-02:06:14/
> > kdump: saving vmcore-dmesg.txt
> > kdump: saving vmcore-dmesg.txt complete
> > kdump: saving vmcore
> >  Checking for memory holes                         : [  0.0 %] /                   Checking for memory holes                         : [100.0 %] |                   Excluding unnecessary pages                       : [100.0 %] \                   Copying data                                      : [  0.3 %] -          eta: 38s[   44.337636] hash-mmu: mm: Hashing failure ! EA=0x7fffba400000 access=0x8000000000000004 current=makedumpfile
> > [   44.337663] hash-mmu:     trap=0x300 vsid=0x13a109c ssize=1 base psize=2 psize 2 pte=0xc000000050000504
> > [   44.337677] hash-mmu: mm: Hashing failure ! EA=0x7fffba400000 access=0x8000000000000004 current=makedumpfile
> > [   44.337692] hash-mmu:     trap=0x300 vsid=0x13a109c ssize=1 base psize=2 psize 2 pte=0xc000000050000504
> > [   44.337708] makedumpfile[469]: unhandled signal 7 at 00007fffba400000 nip 00007fffbbc4d7fc lr 000000011356ca3c code 2
> > [   44.338548] Core dump to |/bin/false pipe failed
> > /lib/kdump-lib-initramfs.sh: line 98:   469 Bus error               $CORE_COLLECTOR /proc/vmcore $_mp/$KDUMP_PATH/$HOST_IP-$DATEDIR/vmcore-incomplete
> > kdump: saving vmcore failed
> >
> > * Root cause *
> >   After analyzing, it turns out that in the current implementation,
> > when hot-removing lmb, the KOBJ_REMOVE event ejects before the dt updating as
> > the code __remove_memory() comes before drmem_update_dt().
> > So in kdump kernel, when read_from_oldmem() resorts to
> > pSeries_lpar_hpte_insert() to install hpte, but fails with -2 due to
> > non-exist pfn. And finally, low_hash_fault() raise SIGBUS to process, as it
> > can be observed "Bus error"
> >
> > From a viewpoint of listener and publisher, the publisher notifies the
> > listener before data is ready.  This introduces a problem where udev
> > launches kexec-tools (due to KOBJ_REMOVE) and loads a stale dt before
> > updating. And in capture kernel, makedumpfile will access the memory based
> > on the stale dt info, and hit a SIGBUS error due to an un-existed lmb.
> >
> > * Fix *
> >   In order to fix this issue, update dt before __remove_memory(), and
> > accordingly the same rule in hot-add path.
> >
> > This will introduce extra dt updating payload for each involved lmb when hotplug.
> > But it should be fine since drmem_update_dt() is memory based operation and
> > hotplug is not a hot path.
>
> This is great analysis but the performance implications of the change
> are grave. The add/remove paths here are already O(n) where n is the
> quantity of memory assigned to the LP, this change would make it O(n^2):
>
> dlpar_memory_add_by_count
>   for_each_drmem_lmb             <--
>     dlpar_add_lmb
>       drmem_update_dt(_v1|_v2)
>         for_each_drmem_lmb       <--
>
> Memory add/remove isn't a hot path but quadratic runtime complexity
> isn't acceptable. Its current performance is bad enough that I have
Yes, the quadratic runtime complexity sounds terrible.
And I am curious about the bug. Does the system have thousands of lmb?

> internal bugs open on it.
>
> Not to mention we leak memory every time drmem_update_dt is called
> because we can't safely free device tree properties :-(
Do you know what block us to free it?
>
> Also note that this sort of reverts (fixes?) 063b8b1251fd
> ("powerpc/pseries/memory-hotplug: Only update DT once per memory DLPAR
> request").
Yes. And now, I think I need to bring up another method to fix it.

Thanks,
Pingfan

^ permalink raw reply

* Re: [v3 12/15] powerpc/perf: Add support for outputting extended regs in perf intr_regs
From: Ravi Bangoria @ 2020-07-24 12:26 UTC (permalink / raw)
  To: Athira Rajeev
  Cc: ego, mikey, maddy, kvm, Ravi Bangoria, kvm-ppc, svaidyan, acme,
	jolsa, linuxppc-dev
In-Reply-To: <1594996707-3727-13-git-send-email-atrajeev@linux.vnet.ibm.com>

Hi Athira,

> +/* Function to return the extended register values */
> +static u64 get_ext_regs_value(int idx)
> +{
> +	switch (idx) {
> +	case PERF_REG_POWERPC_MMCR0:
> +		return mfspr(SPRN_MMCR0);
> +	case PERF_REG_POWERPC_MMCR1:
> +		return mfspr(SPRN_MMCR1);
> +	case PERF_REG_POWERPC_MMCR2:
> +		return mfspr(SPRN_MMCR2);
> +	default: return 0;
> +	}
> +}
> +
>   u64 perf_reg_value(struct pt_regs *regs, int idx)
>   {
> -	if (WARN_ON_ONCE(idx >= PERF_REG_POWERPC_MAX))
> -		return 0;
> +	u64 PERF_REG_EXTENDED_MAX;

PERF_REG_EXTENDED_MAX should be initialized. otherwise ...

> +
> +	if (cpu_has_feature(CPU_FTR_ARCH_300))
> +		PERF_REG_EXTENDED_MAX = PERF_REG_MAX_ISA_300;
>   
>   	if (idx == PERF_REG_POWERPC_SIER &&
>   	   (IS_ENABLED(CONFIG_FSL_EMB_PERF_EVENT) ||
> @@ -85,6 +103,16 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
>   	    IS_ENABLED(CONFIG_PPC32)))
>   		return 0;
>   
> +	if (idx >= PERF_REG_POWERPC_MAX && idx < PERF_REG_EXTENDED_MAX)
> +		return get_ext_regs_value(idx);

On non p9/p10 machine, PERF_REG_EXTENDED_MAX may contain random value which will
allow user to pass this if condition unintentionally.

Neat: PERF_REG_EXTENDED_MAX is a local variable so it should be in lowercase.
Any specific reason to define it in capital?

Ravi

^ permalink raw reply

* [PATCH v4 0/6] powerpc: queued spinlocks and rwlocks
From: Nicholas Piggin @ 2020-07-24 13:14 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: linux-arch, Michal Suchánek, Peter Zijlstra, Boqun Feng,
	linux-kernel, Nicholas Piggin, virtualization, Ingo Molnar,
	kvm-ppc, Waiman Long, Will Deacon

Updated with everybody's feedback (thanks all), and more performance
results.

What I've found is I might have been measuring the worst load point for
the paravirt case, and by looking at a range of loads it's clear that
queued spinlocks are overall better even on PV, doubly so when you look
at the generally much improved worst case latencies.

I have defaulted it to N even though I'm less concerned about the PV
numbers now, just because I think it needs more stress testing. But
it's very nicely selectable so should be low risk to include.

All in all this is a very cool technology and great results especially
on the big systems but even on smaller ones there are nice gains. Thanks
Waiman and everyone who developed it.

Thanks,
Nick

Nicholas Piggin (6):
  powerpc/pseries: move some PAPR paravirt functions to their own file
  powerpc: move spinlock implementation to simple_spinlock
  powerpc/64s: implement queued spinlocks and rwlocks
  powerpc/pseries: implement paravirt qspinlocks for SPLPAR
  powerpc/qspinlock: optimised atomic_try_cmpxchg_lock that adds the
    lock hint
  powerpc: implement smp_cond_load_relaxed

 arch/powerpc/Kconfig                          |  15 +
 arch/powerpc/include/asm/Kbuild               |   1 +
 arch/powerpc/include/asm/atomic.h             |  28 ++
 arch/powerpc/include/asm/barrier.h            |  14 +
 arch/powerpc/include/asm/paravirt.h           |  87 +++++
 arch/powerpc/include/asm/qspinlock.h          |  91 ++++++
 arch/powerpc/include/asm/qspinlock_paravirt.h |   7 +
 arch/powerpc/include/asm/simple_spinlock.h    | 288 ++++++++++++++++
 .../include/asm/simple_spinlock_types.h       |  21 ++
 arch/powerpc/include/asm/spinlock.h           | 308 +-----------------
 arch/powerpc/include/asm/spinlock_types.h     |  17 +-
 arch/powerpc/lib/Makefile                     |   3 +
 arch/powerpc/lib/locks.c                      |  12 +-
 arch/powerpc/platforms/pseries/Kconfig        |   9 +-
 arch/powerpc/platforms/pseries/setup.c        |   4 +-
 include/asm-generic/qspinlock.h               |   4 +
 16 files changed, 588 insertions(+), 321 deletions(-)
 create mode 100644 arch/powerpc/include/asm/paravirt.h
 create mode 100644 arch/powerpc/include/asm/qspinlock.h
 create mode 100644 arch/powerpc/include/asm/qspinlock_paravirt.h
 create mode 100644 arch/powerpc/include/asm/simple_spinlock.h
 create mode 100644 arch/powerpc/include/asm/simple_spinlock_types.h

-- 
2.23.0


^ permalink raw reply

* [PATCH v4 1/6] powerpc/pseries: move some PAPR paravirt functions to their own file
From: Nicholas Piggin @ 2020-07-24 13:14 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: linux-arch, Michal Suchánek, Peter Zijlstra, Boqun Feng,
	linux-kernel, Nicholas Piggin, virtualization, Ingo Molnar,
	kvm-ppc, Waiman Long, Will Deacon
In-Reply-To: <20200724131423.1362108-1-npiggin@gmail.com>

These functions will be used by queued spinlock implementation,
and may be useful elsewhere too, so move them out of spinlock.h.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/include/asm/paravirt.h | 59 +++++++++++++++++++++++++++++
 arch/powerpc/include/asm/spinlock.h | 24 +-----------
 arch/powerpc/lib/locks.c            | 12 +++---
 3 files changed, 66 insertions(+), 29 deletions(-)
 create mode 100644 arch/powerpc/include/asm/paravirt.h

diff --git a/arch/powerpc/include/asm/paravirt.h b/arch/powerpc/include/asm/paravirt.h
new file mode 100644
index 000000000000..339e8533464b
--- /dev/null
+++ b/arch/powerpc/include/asm/paravirt.h
@@ -0,0 +1,59 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+#ifndef _ASM_POWERPC_PARAVIRT_H
+#define _ASM_POWERPC_PARAVIRT_H
+
+#include <linux/jump_label.h>
+#include <asm/smp.h>
+#ifdef CONFIG_PPC64
+#include <asm/paca.h>
+#include <asm/hvcall.h>
+#endif
+
+#ifdef CONFIG_PPC_SPLPAR
+DECLARE_STATIC_KEY_FALSE(shared_processor);
+
+static inline bool is_shared_processor(void)
+{
+	return static_branch_unlikely(&shared_processor);
+}
+
+/* If bit 0 is set, the cpu has been preempted */
+static inline u32 yield_count_of(int cpu)
+{
+	__be32 yield_count = READ_ONCE(lppaca_of(cpu).yield_count);
+	return be32_to_cpu(yield_count);
+}
+
+static inline void yield_to_preempted(int cpu, u32 yield_count)
+{
+	plpar_hcall_norets(H_CONFER, get_hard_smp_processor_id(cpu), yield_count);
+}
+#else
+static inline bool is_shared_processor(void)
+{
+	return false;
+}
+
+static inline u32 yield_count_of(int cpu)
+{
+	return 0;
+}
+
+extern void ___bad_yield_to_preempted(void);
+static inline void yield_to_preempted(int cpu, u32 yield_count)
+{
+	___bad_yield_to_preempted(); /* This would be a bug */
+}
+#endif
+
+#define vcpu_is_preempted vcpu_is_preempted
+static inline bool vcpu_is_preempted(int cpu)
+{
+	if (!is_shared_processor())
+		return false;
+	if (yield_count_of(cpu) & 1)
+		return true;
+	return false;
+}
+
+#endif /* _ASM_POWERPC_PARAVIRT_H */
diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h
index 2d620896cdae..79be9bb10bbb 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -15,11 +15,10 @@
  *
  * (the type definitions are in asm/spinlock_types.h)
  */
-#include <linux/jump_label.h>
 #include <linux/irqflags.h>
+#include <asm/paravirt.h>
 #ifdef CONFIG_PPC64
 #include <asm/paca.h>
-#include <asm/hvcall.h>
 #endif
 #include <asm/synch.h>
 #include <asm/ppc-opcode.h>
@@ -35,18 +34,6 @@
 #define LOCK_TOKEN	1
 #endif
 
-#ifdef CONFIG_PPC_PSERIES
-DECLARE_STATIC_KEY_FALSE(shared_processor);
-
-#define vcpu_is_preempted vcpu_is_preempted
-static inline bool vcpu_is_preempted(int cpu)
-{
-	if (!static_branch_unlikely(&shared_processor))
-		return false;
-	return !!(be32_to_cpu(lppaca_of(cpu).yield_count) & 1);
-}
-#endif
-
 static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
 {
 	return lock.slock == 0;
@@ -110,15 +97,6 @@ static inline void splpar_spin_yield(arch_spinlock_t *lock) {};
 static inline void splpar_rw_yield(arch_rwlock_t *lock) {};
 #endif
 
-static inline bool is_shared_processor(void)
-{
-#ifdef CONFIG_PPC_SPLPAR
-	return static_branch_unlikely(&shared_processor);
-#else
-	return false;
-#endif
-}
-
 static inline void spin_yield(arch_spinlock_t *lock)
 {
 	if (is_shared_processor())
diff --git a/arch/powerpc/lib/locks.c b/arch/powerpc/lib/locks.c
index 6440d5943c00..04165b7a163f 100644
--- a/arch/powerpc/lib/locks.c
+++ b/arch/powerpc/lib/locks.c
@@ -27,14 +27,14 @@ void splpar_spin_yield(arch_spinlock_t *lock)
 		return;
 	holder_cpu = lock_value & 0xffff;
 	BUG_ON(holder_cpu >= NR_CPUS);
-	yield_count = be32_to_cpu(lppaca_of(holder_cpu).yield_count);
+
+	yield_count = yield_count_of(holder_cpu);
 	if ((yield_count & 1) == 0)
 		return;		/* virtual cpu is currently running */
 	rmb();
 	if (lock->slock != lock_value)
 		return;		/* something has changed */
-	plpar_hcall_norets(H_CONFER,
-		get_hard_smp_processor_id(holder_cpu), yield_count);
+	yield_to_preempted(holder_cpu, yield_count);
 }
 EXPORT_SYMBOL_GPL(splpar_spin_yield);
 
@@ -53,13 +53,13 @@ void splpar_rw_yield(arch_rwlock_t *rw)
 		return;		/* no write lock at present */
 	holder_cpu = lock_value & 0xffff;
 	BUG_ON(holder_cpu >= NR_CPUS);
-	yield_count = be32_to_cpu(lppaca_of(holder_cpu).yield_count);
+
+	yield_count = yield_count_of(holder_cpu);
 	if ((yield_count & 1) == 0)
 		return;		/* virtual cpu is currently running */
 	rmb();
 	if (rw->lock != lock_value)
 		return;		/* something has changed */
-	plpar_hcall_norets(H_CONFER,
-		get_hard_smp_processor_id(holder_cpu), yield_count);
+	yield_to_preempted(holder_cpu, yield_count);
 }
 #endif
-- 
2.23.0


^ permalink raw reply related

* [PATCH v4 2/6] powerpc: move spinlock implementation to simple_spinlock
From: Nicholas Piggin @ 2020-07-24 13:14 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: linux-arch, Michal Suchánek, Peter Zijlstra, Boqun Feng,
	linux-kernel, Nicholas Piggin, virtualization, Ingo Molnar,
	kvm-ppc, Waiman Long, Will Deacon
In-Reply-To: <20200724131423.1362108-1-npiggin@gmail.com>

To prepare for queued spinlocks. This is a simple rename except to update
preprocessor guard name and a file reference.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/include/asm/simple_spinlock.h    | 288 ++++++++++++++++++
 .../include/asm/simple_spinlock_types.h       |  21 ++
 arch/powerpc/include/asm/spinlock.h           | 285 +----------------
 arch/powerpc/include/asm/spinlock_types.h     |  12 +-
 4 files changed, 311 insertions(+), 295 deletions(-)
 create mode 100644 arch/powerpc/include/asm/simple_spinlock.h
 create mode 100644 arch/powerpc/include/asm/simple_spinlock_types.h

diff --git a/arch/powerpc/include/asm/simple_spinlock.h b/arch/powerpc/include/asm/simple_spinlock.h
new file mode 100644
index 000000000000..fe6cff7df48e
--- /dev/null
+++ b/arch/powerpc/include/asm/simple_spinlock.h
@@ -0,0 +1,288 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+#ifndef _ASM_POWERPC_SIMPLE_SPINLOCK_H
+#define _ASM_POWERPC_SIMPLE_SPINLOCK_H
+
+/*
+ * Simple spin lock operations.  
+ *
+ * Copyright (C) 2001-2004 Paul Mackerras <paulus@au.ibm.com>, IBM
+ * Copyright (C) 2001 Anton Blanchard <anton@au.ibm.com>, IBM
+ * Copyright (C) 2002 Dave Engebretsen <engebret@us.ibm.com>, IBM
+ *	Rework to support virtual processors
+ *
+ * Type of int is used as a full 64b word is not necessary.
+ *
+ * (the type definitions are in asm/simple_spinlock_types.h)
+ */
+#include <linux/irqflags.h>
+#include <asm/paravirt.h>
+#include <asm/paca.h>
+#include <asm/synch.h>
+#include <asm/ppc-opcode.h>
+
+#ifdef CONFIG_PPC64
+/* use 0x800000yy when locked, where yy == CPU number */
+#ifdef __BIG_ENDIAN__
+#define LOCK_TOKEN	(*(u32 *)(&get_paca()->lock_token))
+#else
+#define LOCK_TOKEN	(*(u32 *)(&get_paca()->paca_index))
+#endif
+#else
+#define LOCK_TOKEN	1
+#endif
+
+static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
+{
+	return lock.slock == 0;
+}
+
+static inline int arch_spin_is_locked(arch_spinlock_t *lock)
+{
+	smp_mb();
+	return !arch_spin_value_unlocked(*lock);
+}
+
+/*
+ * This returns the old value in the lock, so we succeeded
+ * in getting the lock if the return value is 0.
+ */
+static inline unsigned long __arch_spin_trylock(arch_spinlock_t *lock)
+{
+	unsigned long tmp, token;
+
+	token = LOCK_TOKEN;
+	__asm__ __volatile__(
+"1:	" PPC_LWARX(%0,0,%2,1) "\n\
+	cmpwi		0,%0,0\n\
+	bne-		2f\n\
+	stwcx.		%1,0,%2\n\
+	bne-		1b\n"
+	PPC_ACQUIRE_BARRIER
+"2:"
+	: "=&r" (tmp)
+	: "r" (token), "r" (&lock->slock)
+	: "cr0", "memory");
+
+	return tmp;
+}
+
+static inline int arch_spin_trylock(arch_spinlock_t *lock)
+{
+	return __arch_spin_trylock(lock) == 0;
+}
+
+/*
+ * On a system with shared processors (that is, where a physical
+ * processor is multiplexed between several virtual processors),
+ * there is no point spinning on a lock if the holder of the lock
+ * isn't currently scheduled on a physical processor.  Instead
+ * we detect this situation and ask the hypervisor to give the
+ * rest of our timeslice to the lock holder.
+ *
+ * So that we can tell which virtual processor is holding a lock,
+ * we put 0x80000000 | smp_processor_id() in the lock when it is
+ * held.  Conveniently, we have a word in the paca that holds this
+ * value.
+ */
+
+#if defined(CONFIG_PPC_SPLPAR)
+/* We only yield to the hypervisor if we are in shared processor mode */
+void splpar_spin_yield(arch_spinlock_t *lock);
+void splpar_rw_yield(arch_rwlock_t *lock);
+#else /* SPLPAR */
+static inline void splpar_spin_yield(arch_spinlock_t *lock) {};
+static inline void splpar_rw_yield(arch_rwlock_t *lock) {};
+#endif
+
+static inline void spin_yield(arch_spinlock_t *lock)
+{
+	if (is_shared_processor())
+		splpar_spin_yield(lock);
+	else
+		barrier();
+}
+
+static inline void rw_yield(arch_rwlock_t *lock)
+{
+	if (is_shared_processor())
+		splpar_rw_yield(lock);
+	else
+		barrier();
+}
+
+static inline void arch_spin_lock(arch_spinlock_t *lock)
+{
+	while (1) {
+		if (likely(__arch_spin_trylock(lock) == 0))
+			break;
+		do {
+			HMT_low();
+			if (is_shared_processor())
+				splpar_spin_yield(lock);
+		} while (unlikely(lock->slock != 0));
+		HMT_medium();
+	}
+}
+
+static inline
+void arch_spin_lock_flags(arch_spinlock_t *lock, unsigned long flags)
+{
+	unsigned long flags_dis;
+
+	while (1) {
+		if (likely(__arch_spin_trylock(lock) == 0))
+			break;
+		local_save_flags(flags_dis);
+		local_irq_restore(flags);
+		do {
+			HMT_low();
+			if (is_shared_processor())
+				splpar_spin_yield(lock);
+		} while (unlikely(lock->slock != 0));
+		HMT_medium();
+		local_irq_restore(flags_dis);
+	}
+}
+#define arch_spin_lock_flags arch_spin_lock_flags
+
+static inline void arch_spin_unlock(arch_spinlock_t *lock)
+{
+	__asm__ __volatile__("# arch_spin_unlock\n\t"
+				PPC_RELEASE_BARRIER: : :"memory");
+	lock->slock = 0;
+}
+
+/*
+ * Read-write spinlocks, allowing multiple readers
+ * but only one writer.
+ *
+ * NOTE! it is quite common to have readers in interrupts
+ * but no interrupt writers. For those circumstances we
+ * can "mix" irq-safe locks - any writer needs to get a
+ * irq-safe write-lock, but readers can get non-irqsafe
+ * read-locks.
+ */
+
+#ifdef CONFIG_PPC64
+#define __DO_SIGN_EXTEND	"extsw	%0,%0\n"
+#define WRLOCK_TOKEN		LOCK_TOKEN	/* it's negative */
+#else
+#define __DO_SIGN_EXTEND
+#define WRLOCK_TOKEN		(-1)
+#endif
+
+/*
+ * This returns the old value in the lock + 1,
+ * so we got a read lock if the return value is > 0.
+ */
+static inline long __arch_read_trylock(arch_rwlock_t *rw)
+{
+	long tmp;
+
+	__asm__ __volatile__(
+"1:	" PPC_LWARX(%0,0,%1,1) "\n"
+	__DO_SIGN_EXTEND
+"	addic.		%0,%0,1\n\
+	ble-		2f\n"
+"	stwcx.		%0,0,%1\n\
+	bne-		1b\n"
+	PPC_ACQUIRE_BARRIER
+"2:"	: "=&r" (tmp)
+	: "r" (&rw->lock)
+	: "cr0", "xer", "memory");
+
+	return tmp;
+}
+
+/*
+ * This returns the old value in the lock,
+ * so we got the write lock if the return value is 0.
+ */
+static inline long __arch_write_trylock(arch_rwlock_t *rw)
+{
+	long tmp, token;
+
+	token = WRLOCK_TOKEN;
+	__asm__ __volatile__(
+"1:	" PPC_LWARX(%0,0,%2,1) "\n\
+	cmpwi		0,%0,0\n\
+	bne-		2f\n"
+"	stwcx.		%1,0,%2\n\
+	bne-		1b\n"
+	PPC_ACQUIRE_BARRIER
+"2:"	: "=&r" (tmp)
+	: "r" (token), "r" (&rw->lock)
+	: "cr0", "memory");
+
+	return tmp;
+}
+
+static inline void arch_read_lock(arch_rwlock_t *rw)
+{
+	while (1) {
+		if (likely(__arch_read_trylock(rw) > 0))
+			break;
+		do {
+			HMT_low();
+			if (is_shared_processor())
+				splpar_rw_yield(rw);
+		} while (unlikely(rw->lock < 0));
+		HMT_medium();
+	}
+}
+
+static inline void arch_write_lock(arch_rwlock_t *rw)
+{
+	while (1) {
+		if (likely(__arch_write_trylock(rw) == 0))
+			break;
+		do {
+			HMT_low();
+			if (is_shared_processor())
+				splpar_rw_yield(rw);
+		} while (unlikely(rw->lock != 0));
+		HMT_medium();
+	}
+}
+
+static inline int arch_read_trylock(arch_rwlock_t *rw)
+{
+	return __arch_read_trylock(rw) > 0;
+}
+
+static inline int arch_write_trylock(arch_rwlock_t *rw)
+{
+	return __arch_write_trylock(rw) == 0;
+}
+
+static inline void arch_read_unlock(arch_rwlock_t *rw)
+{
+	long tmp;
+
+	__asm__ __volatile__(
+	"# read_unlock\n\t"
+	PPC_RELEASE_BARRIER
+"1:	lwarx		%0,0,%1\n\
+	addic		%0,%0,-1\n"
+"	stwcx.		%0,0,%1\n\
+	bne-		1b"
+	: "=&r"(tmp)
+	: "r"(&rw->lock)
+	: "cr0", "xer", "memory");
+}
+
+static inline void arch_write_unlock(arch_rwlock_t *rw)
+{
+	__asm__ __volatile__("# write_unlock\n\t"
+				PPC_RELEASE_BARRIER: : :"memory");
+	rw->lock = 0;
+}
+
+#define arch_spin_relax(lock)	spin_yield(lock)
+#define arch_read_relax(lock)	rw_yield(lock)
+#define arch_write_relax(lock)	rw_yield(lock)
+
+/* See include/linux/spinlock.h */
+#define smp_mb__after_spinlock()   smp_mb()
+
+#endif /* _ASM_POWERPC_SIMPLE_SPINLOCK_H */
diff --git a/arch/powerpc/include/asm/simple_spinlock_types.h b/arch/powerpc/include/asm/simple_spinlock_types.h
new file mode 100644
index 000000000000..0f3cdd8faa95
--- /dev/null
+++ b/arch/powerpc/include/asm/simple_spinlock_types.h
@@ -0,0 +1,21 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_POWERPC_SIMPLE_SPINLOCK_TYPES_H
+#define _ASM_POWERPC_SIMPLE_SPINLOCK_TYPES_H
+
+#ifndef __LINUX_SPINLOCK_TYPES_H
+# error "please don't include this file directly"
+#endif
+
+typedef struct {
+	volatile unsigned int slock;
+} arch_spinlock_t;
+
+#define __ARCH_SPIN_LOCK_UNLOCKED	{ 0 }
+
+typedef struct {
+	volatile signed int lock;
+} arch_rwlock_t;
+
+#define __ARCH_RW_LOCK_UNLOCKED		{ 0 }
+
+#endif /* _ASM_POWERPC_SIMPLE_SPINLOCK_TYPES_H */
diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h
index 79be9bb10bbb..21357fe05fe0 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -3,290 +3,7 @@
 #define __ASM_SPINLOCK_H
 #ifdef __KERNEL__
 
-/*
- * Simple spin lock operations.  
- *
- * Copyright (C) 2001-2004 Paul Mackerras <paulus@au.ibm.com>, IBM
- * Copyright (C) 2001 Anton Blanchard <anton@au.ibm.com>, IBM
- * Copyright (C) 2002 Dave Engebretsen <engebret@us.ibm.com>, IBM
- *	Rework to support virtual processors
- *
- * Type of int is used as a full 64b word is not necessary.
- *
- * (the type definitions are in asm/spinlock_types.h)
- */
-#include <linux/irqflags.h>
-#include <asm/paravirt.h>
-#ifdef CONFIG_PPC64
-#include <asm/paca.h>
-#endif
-#include <asm/synch.h>
-#include <asm/ppc-opcode.h>
-
-#ifdef CONFIG_PPC64
-/* use 0x800000yy when locked, where yy == CPU number */
-#ifdef __BIG_ENDIAN__
-#define LOCK_TOKEN	(*(u32 *)(&get_paca()->lock_token))
-#else
-#define LOCK_TOKEN	(*(u32 *)(&get_paca()->paca_index))
-#endif
-#else
-#define LOCK_TOKEN	1
-#endif
-
-static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
-{
-	return lock.slock == 0;
-}
-
-static inline int arch_spin_is_locked(arch_spinlock_t *lock)
-{
-	smp_mb();
-	return !arch_spin_value_unlocked(*lock);
-}
-
-/*
- * This returns the old value in the lock, so we succeeded
- * in getting the lock if the return value is 0.
- */
-static inline unsigned long __arch_spin_trylock(arch_spinlock_t *lock)
-{
-	unsigned long tmp, token;
-
-	token = LOCK_TOKEN;
-	__asm__ __volatile__(
-"1:	" PPC_LWARX(%0,0,%2,1) "\n\
-	cmpwi		0,%0,0\n\
-	bne-		2f\n\
-	stwcx.		%1,0,%2\n\
-	bne-		1b\n"
-	PPC_ACQUIRE_BARRIER
-"2:"
-	: "=&r" (tmp)
-	: "r" (token), "r" (&lock->slock)
-	: "cr0", "memory");
-
-	return tmp;
-}
-
-static inline int arch_spin_trylock(arch_spinlock_t *lock)
-{
-	return __arch_spin_trylock(lock) == 0;
-}
-
-/*
- * On a system with shared processors (that is, where a physical
- * processor is multiplexed between several virtual processors),
- * there is no point spinning on a lock if the holder of the lock
- * isn't currently scheduled on a physical processor.  Instead
- * we detect this situation and ask the hypervisor to give the
- * rest of our timeslice to the lock holder.
- *
- * So that we can tell which virtual processor is holding a lock,
- * we put 0x80000000 | smp_processor_id() in the lock when it is
- * held.  Conveniently, we have a word in the paca that holds this
- * value.
- */
-
-#if defined(CONFIG_PPC_SPLPAR)
-/* We only yield to the hypervisor if we are in shared processor mode */
-void splpar_spin_yield(arch_spinlock_t *lock);
-void splpar_rw_yield(arch_rwlock_t *lock);
-#else /* SPLPAR */
-static inline void splpar_spin_yield(arch_spinlock_t *lock) {};
-static inline void splpar_rw_yield(arch_rwlock_t *lock) {};
-#endif
-
-static inline void spin_yield(arch_spinlock_t *lock)
-{
-	if (is_shared_processor())
-		splpar_spin_yield(lock);
-	else
-		barrier();
-}
-
-static inline void rw_yield(arch_rwlock_t *lock)
-{
-	if (is_shared_processor())
-		splpar_rw_yield(lock);
-	else
-		barrier();
-}
-
-static inline void arch_spin_lock(arch_spinlock_t *lock)
-{
-	while (1) {
-		if (likely(__arch_spin_trylock(lock) == 0))
-			break;
-		do {
-			HMT_low();
-			if (is_shared_processor())
-				splpar_spin_yield(lock);
-		} while (unlikely(lock->slock != 0));
-		HMT_medium();
-	}
-}
-
-static inline
-void arch_spin_lock_flags(arch_spinlock_t *lock, unsigned long flags)
-{
-	unsigned long flags_dis;
-
-	while (1) {
-		if (likely(__arch_spin_trylock(lock) == 0))
-			break;
-		local_save_flags(flags_dis);
-		local_irq_restore(flags);
-		do {
-			HMT_low();
-			if (is_shared_processor())
-				splpar_spin_yield(lock);
-		} while (unlikely(lock->slock != 0));
-		HMT_medium();
-		local_irq_restore(flags_dis);
-	}
-}
-#define arch_spin_lock_flags arch_spin_lock_flags
-
-static inline void arch_spin_unlock(arch_spinlock_t *lock)
-{
-	__asm__ __volatile__("# arch_spin_unlock\n\t"
-				PPC_RELEASE_BARRIER: : :"memory");
-	lock->slock = 0;
-}
-
-/*
- * Read-write spinlocks, allowing multiple readers
- * but only one writer.
- *
- * NOTE! it is quite common to have readers in interrupts
- * but no interrupt writers. For those circumstances we
- * can "mix" irq-safe locks - any writer needs to get a
- * irq-safe write-lock, but readers can get non-irqsafe
- * read-locks.
- */
-
-#ifdef CONFIG_PPC64
-#define __DO_SIGN_EXTEND	"extsw	%0,%0\n"
-#define WRLOCK_TOKEN		LOCK_TOKEN	/* it's negative */
-#else
-#define __DO_SIGN_EXTEND
-#define WRLOCK_TOKEN		(-1)
-#endif
-
-/*
- * This returns the old value in the lock + 1,
- * so we got a read lock if the return value is > 0.
- */
-static inline long __arch_read_trylock(arch_rwlock_t *rw)
-{
-	long tmp;
-
-	__asm__ __volatile__(
-"1:	" PPC_LWARX(%0,0,%1,1) "\n"
-	__DO_SIGN_EXTEND
-"	addic.		%0,%0,1\n\
-	ble-		2f\n"
-"	stwcx.		%0,0,%1\n\
-	bne-		1b\n"
-	PPC_ACQUIRE_BARRIER
-"2:"	: "=&r" (tmp)
-	: "r" (&rw->lock)
-	: "cr0", "xer", "memory");
-
-	return tmp;
-}
-
-/*
- * This returns the old value in the lock,
- * so we got the write lock if the return value is 0.
- */
-static inline long __arch_write_trylock(arch_rwlock_t *rw)
-{
-	long tmp, token;
-
-	token = WRLOCK_TOKEN;
-	__asm__ __volatile__(
-"1:	" PPC_LWARX(%0,0,%2,1) "\n\
-	cmpwi		0,%0,0\n\
-	bne-		2f\n"
-"	stwcx.		%1,0,%2\n\
-	bne-		1b\n"
-	PPC_ACQUIRE_BARRIER
-"2:"	: "=&r" (tmp)
-	: "r" (token), "r" (&rw->lock)
-	: "cr0", "memory");
-
-	return tmp;
-}
-
-static inline void arch_read_lock(arch_rwlock_t *rw)
-{
-	while (1) {
-		if (likely(__arch_read_trylock(rw) > 0))
-			break;
-		do {
-			HMT_low();
-			if (is_shared_processor())
-				splpar_rw_yield(rw);
-		} while (unlikely(rw->lock < 0));
-		HMT_medium();
-	}
-}
-
-static inline void arch_write_lock(arch_rwlock_t *rw)
-{
-	while (1) {
-		if (likely(__arch_write_trylock(rw) == 0))
-			break;
-		do {
-			HMT_low();
-			if (is_shared_processor())
-				splpar_rw_yield(rw);
-		} while (unlikely(rw->lock != 0));
-		HMT_medium();
-	}
-}
-
-static inline int arch_read_trylock(arch_rwlock_t *rw)
-{
-	return __arch_read_trylock(rw) > 0;
-}
-
-static inline int arch_write_trylock(arch_rwlock_t *rw)
-{
-	return __arch_write_trylock(rw) == 0;
-}
-
-static inline void arch_read_unlock(arch_rwlock_t *rw)
-{
-	long tmp;
-
-	__asm__ __volatile__(
-	"# read_unlock\n\t"
-	PPC_RELEASE_BARRIER
-"1:	lwarx		%0,0,%1\n\
-	addic		%0,%0,-1\n"
-"	stwcx.		%0,0,%1\n\
-	bne-		1b"
-	: "=&r"(tmp)
-	: "r"(&rw->lock)
-	: "cr0", "xer", "memory");
-}
-
-static inline void arch_write_unlock(arch_rwlock_t *rw)
-{
-	__asm__ __volatile__("# write_unlock\n\t"
-				PPC_RELEASE_BARRIER: : :"memory");
-	rw->lock = 0;
-}
-
-#define arch_spin_relax(lock)	spin_yield(lock)
-#define arch_read_relax(lock)	rw_yield(lock)
-#define arch_write_relax(lock)	rw_yield(lock)
-
-/* See include/linux/spinlock.h */
-#define smp_mb__after_spinlock()   smp_mb()
+#include <asm/simple_spinlock.h>
 
 #endif /* __KERNEL__ */
 #endif /* __ASM_SPINLOCK_H */
diff --git a/arch/powerpc/include/asm/spinlock_types.h b/arch/powerpc/include/asm/spinlock_types.h
index 87adaf13b7e8..3906f52dae65 100644
--- a/arch/powerpc/include/asm/spinlock_types.h
+++ b/arch/powerpc/include/asm/spinlock_types.h
@@ -6,16 +6,6 @@
 # error "please don't include this file directly"
 #endif
 
-typedef struct {
-	volatile unsigned int slock;
-} arch_spinlock_t;
-
-#define __ARCH_SPIN_LOCK_UNLOCKED	{ 0 }
-
-typedef struct {
-	volatile signed int lock;
-} arch_rwlock_t;
-
-#define __ARCH_RW_LOCK_UNLOCKED		{ 0 }
+#include <asm/simple_spinlock_types.h>
 
 #endif
-- 
2.23.0


^ permalink raw reply related

* [PATCH v4 3/6] powerpc/64s: implement queued spinlocks and rwlocks
From: Nicholas Piggin @ 2020-07-24 13:14 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: linux-arch, Michal Suchánek, Peter Zijlstra, Boqun Feng,
	linux-kernel, Nicholas Piggin, virtualization, Ingo Molnar,
	kvm-ppc, Waiman Long, Will Deacon
In-Reply-To: <20200724131423.1362108-1-npiggin@gmail.com>

These have shown significantly improved performance and fairness when
spinlock contention is moderate to high on very large systems.

With this series including subsequent patches, on a 16 socket 1536 thread
POWER9, a stress test such as same-file open/close from all CPUs gets big
speedups, 11620op/s aggregate with simple spinlocks vs 384158op/s (33x
faster), where the difference in throughput between the fastest and slowest
thread goes from 7x to 1.4x.

Thanks to the fast path being identical in terms of atomics and barriers
(after a subsequent optimisation patch), single threaded performance is not
changed (no measurable difference).

On smaller systems, performance and fairness seems to be generally improved.
Using dbench on tmpfs as a test (that starts to run into kernel spinlock
contention), a 2-socket OpenPOWER POWER9 system was tested with bare metal
and KVM guest configurations. Results can be found here:

https://github.com/linuxppc/issues/issues/305#issuecomment-663487453

Observations are:

- Queued spinlocks are equal when contention is insignificant, as expected
  and as measured with microbenchmarks.

- When there is contention, on bare metal queued spinlocks have better
  throughput and max latency at all points.

- When virtualised, queued spinlocks are slightly worse approaching peak
  throughput, but significantly better throughput and max latency at all
  points beyond peak, until queued spinlock maximum latency rises when
  clients are 2x vCPUs.

The regressions haven't been analysed very well yet, there are a lot of
things that can be tuned, particularly the paravirtualised locking, but the
numbers already look like a good net win even on relatively small systems.

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/Kconfig                      | 15 ++++++++++++++
 arch/powerpc/include/asm/Kbuild           |  1 +
 arch/powerpc/include/asm/qspinlock.h      | 25 +++++++++++++++++++++++
 arch/powerpc/include/asm/spinlock.h       |  5 +++++
 arch/powerpc/include/asm/spinlock_types.h |  5 +++++
 arch/powerpc/lib/Makefile                 |  3 +++
 include/asm-generic/qspinlock.h           |  2 ++
 7 files changed, 56 insertions(+)
 create mode 100644 arch/powerpc/include/asm/qspinlock.h

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 9fa23eb320ff..641946052d67 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -145,6 +145,8 @@ config PPC
 	select ARCH_SUPPORTS_ATOMIC_RMW
 	select ARCH_USE_BUILTIN_BSWAP
 	select ARCH_USE_CMPXCHG_LOCKREF		if PPC64
+	select ARCH_USE_QUEUED_RWLOCKS		if PPC_QUEUED_SPINLOCKS
+	select ARCH_USE_QUEUED_SPINLOCKS	if PPC_QUEUED_SPINLOCKS
 	select ARCH_WANT_IPC_PARSE_VERSION
 	select ARCH_WEAK_RELEASE_ACQUIRE
 	select BINFMT_ELF
@@ -490,6 +492,19 @@ config HOTPLUG_CPU
 
 	  Say N if you are unsure.
 
+config PPC_QUEUED_SPINLOCKS
+	bool "Queued spinlocks"
+	depends on SMP
+	help
+	  Say Y here to use to use queued spinlocks which give better
+	  scalability and fairness on large SMP and NUMA systems without
+	  harming single threaded performance.
+
+	  This option is currently experimental, the code is more complex
+	  and less tested so it defaults to "N" for the moment.
+
+	  If unsure, say "N".
+
 config ARCH_CPU_PROBE_RELEASE
 	def_bool y
 	depends on HOTPLUG_CPU
diff --git a/arch/powerpc/include/asm/Kbuild b/arch/powerpc/include/asm/Kbuild
index dadbcf3a0b1e..27c2268dfd6c 100644
--- a/arch/powerpc/include/asm/Kbuild
+++ b/arch/powerpc/include/asm/Kbuild
@@ -6,5 +6,6 @@ generated-y += syscall_table_spu.h
 generic-y += export.h
 generic-y += local64.h
 generic-y += mcs_spinlock.h
+generic-y += qrwlock.h
 generic-y += vtime.h
 generic-y += early_ioremap.h
diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h
new file mode 100644
index 000000000000..c49e33e24edd
--- /dev/null
+++ b/arch/powerpc/include/asm/qspinlock.h
@@ -0,0 +1,25 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_POWERPC_QSPINLOCK_H
+#define _ASM_POWERPC_QSPINLOCK_H
+
+#include <asm-generic/qspinlock_types.h>
+
+#define _Q_PENDING_LOOPS	(1 << 9) /* not tuned */
+
+#define smp_mb__after_spinlock()   smp_mb()
+
+static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
+{
+	/*
+	 * This barrier was added to simple spinlocks by commit 51d7d5205d338,
+	 * but it should now be possible to remove it, asm arm64 has done with
+	 * commit c6f5d02b6a0f.
+	 */
+	smp_mb();
+	return atomic_read(&lock->val);
+}
+#define queued_spin_is_locked queued_spin_is_locked
+
+#include <asm-generic/qspinlock.h>
+
+#endif /* _ASM_POWERPC_QSPINLOCK_H */
diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h
index 21357fe05fe0..434615f1d761 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -3,7 +3,12 @@
 #define __ASM_SPINLOCK_H
 #ifdef __KERNEL__
 
+#ifdef CONFIG_PPC_QUEUED_SPINLOCKS
+#include <asm/qspinlock.h>
+#include <asm/qrwlock.h>
+#else
 #include <asm/simple_spinlock.h>
+#endif
 
 #endif /* __KERNEL__ */
 #endif /* __ASM_SPINLOCK_H */
diff --git a/arch/powerpc/include/asm/spinlock_types.h b/arch/powerpc/include/asm/spinlock_types.h
index 3906f52dae65..c5d742f18021 100644
--- a/arch/powerpc/include/asm/spinlock_types.h
+++ b/arch/powerpc/include/asm/spinlock_types.h
@@ -6,6 +6,11 @@
 # error "please don't include this file directly"
 #endif
 
+#ifdef CONFIG_PPC_QUEUED_SPINLOCKS
+#include <asm-generic/qspinlock_types.h>
+#include <asm-generic/qrwlock_types.h>
+#else
 #include <asm/simple_spinlock_types.h>
+#endif
 
 #endif
diff --git a/arch/powerpc/lib/Makefile b/arch/powerpc/lib/Makefile
index 5e994cda8e40..d66a645503eb 100644
--- a/arch/powerpc/lib/Makefile
+++ b/arch/powerpc/lib/Makefile
@@ -41,7 +41,10 @@ obj-$(CONFIG_PPC_BOOK3S_64) += copyuser_power7.o copypage_power7.o \
 obj64-y	+= copypage_64.o copyuser_64.o mem_64.o hweight_64.o \
 	   memcpy_64.o memcpy_mcsafe_64.o
 
+ifndef CONFIG_PPC_QUEUED_SPINLOCKS
 obj64-$(CONFIG_SMP)	+= locks.o
+endif
+
 obj64-$(CONFIG_ALTIVEC)	+= vmx-helper.o
 obj64-$(CONFIG_KPROBES_SANITY_TEST)	+= test_emulate_step.o \
 					   test_emulate_step_exec_instr.o
diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
index fde943d180e0..fb0a814d4395 100644
--- a/include/asm-generic/qspinlock.h
+++ b/include/asm-generic/qspinlock.h
@@ -12,6 +12,7 @@
 
 #include <asm-generic/qspinlock_types.h>
 
+#ifndef queued_spin_is_locked
 /**
  * queued_spin_is_locked - is the spinlock locked?
  * @lock: Pointer to queued spinlock structure
@@ -25,6 +26,7 @@ static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
 	 */
 	return atomic_read(&lock->val);
 }
+#endif
 
 /**
  * queued_spin_value_unlocked - is the spinlock structure unlocked?
-- 
2.23.0


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox