LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH v11 01/26] mm: introduce CONFIG_SPECULATIVE_PAGE_FAULT
From: Randy Dunlap @ 2018-05-17 16:36 UTC (permalink / raw)
  To: Laurent Dufour, akpm, mhocko, peterz, kirill, ak, dave, jack,
	Matthew Wilcox, khandual, aneesh.kumar, benh, mpe, paulus,
	Thomas Gleixner, Ingo Molnar, hpa, Will Deacon,
	Sergey Senozhatsky, sergey.senozhatsky.work, Andrea Arcangeli,
	Alexei Starovoitov, kemi.wang, Daniel Jordan, David Rientjes,
	Jerome Glisse, Ganesh Mahendran, Minchan Kim, Punit Agrawal,
	vinayak menon, Yang Shi
  Cc: linux-kernel, linux-mm, haren, npiggin, bsingharora, paulmck,
	Tim Chen, linuxppc-dev, x86
In-Reply-To: <1526555193-7242-2-git-send-email-ldufour@linux.vnet.ibm.com>

Hi,

On 05/17/2018 04:06 AM, Laurent Dufour wrote:
> This configuration variable will be used to build the code needed to
> handle speculative page fault.
> 
> By default it is turned off, and activated depending on architecture
> support, ARCH_HAS_PTE_SPECIAL, SMP and MMU.
> 
> The architecture support is needed since the speculative page fault handler
> is called from the architecture's page faulting code, and some code has to
> be added there to handle the speculative handler.
> 
> The dependency on ARCH_HAS_PTE_SPECIAL is required because vm_normal_page()
> does processing that is not compatible with the speculative handling in the
> case ARCH_HAS_PTE_SPECIAL is not set.
> 
> Suggested-by: Thomas Gleixner <tglx@linutronix.de>
> Suggested-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
> ---
>  mm/Kconfig | 22 ++++++++++++++++++++++
>  1 file changed, 22 insertions(+)
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 1d0888c5b97a..a38796276113 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -761,3 +761,25 @@ config GUP_BENCHMARK
>  
>  config ARCH_HAS_PTE_SPECIAL
>  	bool
> +
> +config ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
> +       def_bool n
> +
> +config SPECULATIVE_PAGE_FAULT
> +       bool "Speculative page faults"
> +       default y
> +       depends on ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
> +       depends on ARCH_HAS_PTE_SPECIAL && MMU && SMP
> +       help
> +         Try to handle user space page faults without holding the mmap_sem.
> +
> +	 This should allow better concurrency for massively threaded process

	                                                             processes

> +	 since the page fault handler will not wait for other threads memory

	                                                      thread's

> +	 layout change to be done, assuming that this change is done in another
> +	 part of the process's memory space. This type of page fault is named
> +	 speculative page fault.
> +
> +	 If the speculative page fault fails because of a concurrency is

	                                     because a concurrency is

> +	 detected or because underlying PMD or PTE tables are not yet
> +	 allocating, it is failing its processing and a classic page fault

	 allocated, the speculative page fault fails and a classic page fault

> +	 is then tried.


Also, all of the help text (below the "help" line) should be indented by
1 tab + 2 spaces (in coding-style.rst).


-- 
~Randy

^ permalink raw reply

* Re: [PATCH 07/14] powerpc: Add support for restartable sequences
From: Mathieu Desnoyers @ 2018-05-17 15:28 UTC (permalink / raw)
  To: Boqun Feng, Will Deacon
  Cc: Peter Zijlstra, Paul E. McKenney, Andy Lutomirski, Dave Watson,
	linux-kernel, linux-api, Paul Turner, Andrew Morton, Russell King,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andrew Hunter,
	Andi Kleen, Chris Lameter, Ben Maurer, rostedt, Josh Triplett,
	Linus Torvalds, Catalin Marinas, Michael Kerrisk, Joel Fernandes,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	linuxppc-dev
In-Reply-To: <20180517011949.GA1121@tardis>

----- On May 16, 2018, at 9:19 PM, Boqun Feng boqun.feng@gmail.com wrote:

> On Wed, May 16, 2018 at 04:13:16PM -0400, Mathieu Desnoyers wrote:
>> ----- On May 16, 2018, at 12:18 PM, Peter Zijlstra peterz@infradead.org wrote:
>> 
>> > On Mon, Apr 30, 2018 at 06:44:26PM -0400, Mathieu Desnoyers wrote:
>> >> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
>> >> index c32a181a7cbb..ed21a777e8c6 100644
>> >> --- a/arch/powerpc/Kconfig
>> >> +++ b/arch/powerpc/Kconfig
>> >> @@ -223,6 +223,7 @@ config PPC
>> >>  	select HAVE_SYSCALL_TRACEPOINTS
>> >>  	select HAVE_VIRT_CPU_ACCOUNTING
>> >>  	select HAVE_IRQ_TIME_ACCOUNTING
>> >> +	select HAVE_RSEQ
>> >>  	select IRQ_DOMAIN
>> >>  	select IRQ_FORCED_THREADING
>> >>  	select MODULES_USE_ELF_RELA
>> >> diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c
>> >> index 61db86ecd318..d3bb3aaaf5ac 100644
>> >> --- a/arch/powerpc/kernel/signal.c
>> >> +++ b/arch/powerpc/kernel/signal.c
>> >> @@ -133,6 +133,8 @@ static void do_signal(struct task_struct *tsk)
>> >>  	/* Re-enable the breakpoints for the signal stack */
>> >>  	thread_change_pc(tsk, tsk->thread.regs);
>> >>  
>> >> +	rseq_signal_deliver(tsk->thread.regs);
>> >> +
>> >>  	if (is32) {
>> >>          	if (ksig.ka.sa.sa_flags & SA_SIGINFO)
>> >>  			ret = handle_rt_signal32(&ksig, oldset, tsk);
>> >> @@ -164,6 +166,7 @@ void do_notify_resume(struct pt_regs *regs, unsigned long
>> >> thread_info_flags)
>> >>  	if (thread_info_flags & _TIF_NOTIFY_RESUME) {
>> >>  		clear_thread_flag(TIF_NOTIFY_RESUME);
>> >>  		tracehook_notify_resume(regs);
>> >> +		rseq_handle_notify_resume(regs);
>> >>  	}
>> >>  
>> >>  	user_enter();
>> > 
>> > Again no rseq_syscall().
>> 
>> Same question for PowerPC as for ARM:
>> 
>> Considering that rseq_syscall is implemented as follows:
>> 
>> +void rseq_syscall(struct pt_regs *regs)
>> +{
>> +       unsigned long ip = instruction_pointer(regs);
>> +       struct task_struct *t = current;
>> +       struct rseq_cs rseq_cs;
>> +
>> +       if (!t->rseq)
>> +               return;
>> +       if (!access_ok(VERIFY_READ, t->rseq, sizeof(*t->rseq)) ||
>> +           rseq_get_rseq_cs(t, &rseq_cs) || in_rseq_cs(ip, &rseq_cs))
>> +               force_sig(SIGSEGV, t);
>> +}
>> 
>> and that x86 calls it from syscall_return_slowpath() (which AFAIU is
>> now used in the fast-path since KPTI), I wonder where we should call
> 
> So we actually detect this after the syscall takes effect, right? I
> wonder whether this could be problematic, because "disallowing syscall"
> in rseq areas may means the syscall won't take effect to some people, I
> guess?
> 
>> this on PowerPC ?  I was under the impression that PowerPC return to
>> userspace fast-path was not calling C code unless work flags were set,
>> but I might be wrong.
>> 
> 
> I think you're right. So we have to introduce callsite to rseq_syscall()
> in syscall path, something like:
> 
> diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
> index 51695608c68b..a25734a96640 100644
> --- a/arch/powerpc/kernel/entry_64.S
> +++ b/arch/powerpc/kernel/entry_64.S
> @@ -222,6 +222,9 @@ system_call_exit:
> 	mtmsrd	r11,1
> #endif /* CONFIG_PPC_BOOK3E */
> 
> +	addi    r3,r1,STACK_FRAME_OVERHEAD
> +	bl	rseq_syscall
> +
> 	ld	r9,TI_FLAGS(r12)
> 	li	r11,-MAX_ERRNO
> 	andi.
> 		r0,r9,(_TIF_SYSCALL_DOTRACE|_TIF_SINGLESTEP|_TIF_USER_WORK_MASK|_TIF_PERSYSCALL_MASK)
> 
> But I think it's important for us to first decide where (before or after
> the syscall) we do the detection.

As Peter said, we don't really care whether it's on syscall entry or exit, as
long as the process gets killed when the erroneous use is detected. I think doing
it on syscall exit is a bit easier because we can clearly access the userspace
TLS, which AFAIU may be less straightforward on syscall entry.

We may want to add #ifdef CONFIG_DEBUG_RSEQ / #endif around the code you
proposed above, so it's only compiled in if CONFIG_DEBUG_RSEQ=y.

On the ARM leg of the email thread, Will Deacon suggests to test whether current->rseq
is non-NULL before calling rseq_syscall(). I wonder if this added check is justified
as the assembly level, considering that this is just a debugging option. We already do
that check at the very beginning of rseq_syscall().

Thoughts ?

Thanks,

Mathieu

> 
> Regards,
> Boqun
> 
>> Thoughts ?
>> 
>> Thanks!
>> 
>> Mathieu
>> 
>> --
>> Mathieu Desnoyers
>> EfficiOS Inc.
> > http://www.efficios.com

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply

* Failure to allocate HTAB for guest - CMA allocation failures?
From: Daniel Axtens @ 2018-05-17 15:13 UTC (permalink / raw)
  To: linuxppc-dev

Hi,

I have reports from a user who is experiencing intermittent issues
with qemu being unable to allocate memory for the guest HPT. We see:

libvirtError: internal error: process exited while connecting to monitor: Unexpected error in spapr_alloc_htab() at /build/qemu-UwnbKa/qemu-2.5+dfsg/hw/ppc/spapr.c:1030:
qemu-system-ppc64le: Failed to allocate HTAB of requested size, try with smaller maxmem

and in the kernel logs:

[10103945.040498] alloc_contig_range: 19127 callbacks suppressed
[10103945.040502] alloc_contig_range: [7a5d00, 7a6500) PFNs busy
[10103945.040526] alloc_contig_range: [7a5d00, 7a6504) PFNs busy
[10103945.040548] alloc_contig_range: [7a5d00, 7a6508) PFNs busy
[10103945.040569] alloc_contig_range: [7a5d00, 7a650c) PFNs busy
[10103945.040591] alloc_contig_range: [7a5d00, 7a6510) PFNs busy
[10103945.040612] alloc_contig_range: [7a5d00, 7a6514) PFNs busy
[10103945.040634] alloc_contig_range: [7a5d00, 7a6518) PFNs busy
[10103945.040655] alloc_contig_range: [7a5d00, 7a651c) PFNs busy
[10103945.040676] alloc_contig_range: [7a5d00, 7a6520) PFNs busy
[10103945.040698] alloc_contig_range: [7a5d00, 7a6524) PFNs busy

I understand that this is caused when the request for an appropriately
sized and aligned piece of contiguous host memory for the guest hash
page table cannot be satisfied from the CMA. The user was attempting
to start a 16GB guest, so if I can read qemu code correctly, it would
be asking for 128MB of contiguous memory.

The CMA is pretty large - this is taken from /proc/meminfo some time
after the allocation failure:

CmaTotal: 26853376 kB
CmaFree: 4024448 kB

(The CMA is ~25GB, the host has 512GB of RAM.)

My guess is that the CMA has become fragmented (the machine had 112
days of uptime) and that was interfering with the ability of the
kernel to service the request?

Some googling suggests that these sorts of failures have been seen
before:

 * [1] is a Launchpad bug mirrored from the IBM Bugzilla that talks
   about this issue especially in the context of PCI passthrough
   leading to more memory being pinned. No PCI passthrough is
   occurring in this case.

 * [2] is from Red Hat - it seems to be especially focussed on
   particularly huge guests and memory hotplug. I don't think either
   of those apply here either.

I noticed from [1] that there is a patch from Balbir that apparently
helps when VFIO is used - 2e5bbb5461f1 ("KVM: PPC: Book3S HV: Migrate
pinned pages out of CMA"). The user is running a 4.4 kernel with this
backported. There's also reference to some work Alexey was doing to
unpin pages in a more timely fashion. It looks like that stalled, and
I can't see anything else particularly relevant in the kernel tree
between then and now - although I may well be missing stuff.

So:

 - have I missed anything obvious here/have I gone completely wrong in
   my analysis somewhere?

 - have I missed any great changes since 4.4 that would fix this?

 - is there any ongoing work in increasing CMA availability?

 - I noticed in arch/powerpc/kvm/book3s_hv_builtin.c,
   kvm_cma_resv_ratio is defined as a boot parameter. By default 5% of
   host memory is reserved for CMA. Presumably increasing this will
   increase the likelihood that the kernel can service a request for
   contiguous memory. Are there any recommended tunings here?

 - is there anything else the user could try?

Thanks!

Regards,
Daniel

[1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1632045
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1304300

^ permalink raw reply

* Re: [2/2] powerpc/powernv: Fix NVRAM sleep in invalid context when crashing
From: Michael Ellerman @ 2018-05-17 14:55 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev; +Cc: stable, Nicholas Piggin
In-Reply-To: <20180514155947.22753-3-npiggin@gmail.com>

On Mon, 2018-05-14 at 15:59:47 UTC, Nicholas Piggin wrote:
> Similarly to opal_event_shutdown, opal_nvram_write can be called in
> the crash path with irqs disabled. Special case the delay to avoid
> sleeping in invalid context.
> 
> Cc: stable@vger.kernel.org # v3.2
> Fixes: 3b8070335f ("powerpc/powernv: Fix OPAL NVRAM driver OPAL_BUSY loops")
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/c1d2a31397ec51f0370f6bd17b19b3

cheers

^ permalink raw reply

* Re: [v4, 1/2] cxl: Set the PBCQ Tunnel BAR register when enabling capi mode
From: Michael Ellerman @ 2018-05-17 14:54 UTC (permalink / raw)
  To: Philippe Bergheaud, linuxppc-dev
  Cc: fbarrat, clombard, Philippe Bergheaud, benh
In-Reply-To: <20180514082736.3699-1-felix@linux.ibm.com>

On Mon, 2018-05-14 at 08:27:35 UTC, Philippe Bergheaud wrote:
> Skiboot used to set the default Tunnel BAR register value when capi mode
> was enabled. This approach was ok for the cxl driver, but prevented other
> drivers from choosing different values.
> 
> Skiboot versions > 5.11 will not set the default value any longer. This
> patch modifies the cxl driver to set/reset the Tunnel BAR register when
> entering/exiting the cxl mode, with pnv_pci_set_tunnel_bar().
> 
> That should work with old skiboot (since we are re-writing the value
> already set) and new skiboot.
> 
> Signed-off-by: Philippe Bergheaud <felix@linux.ibm.com>
> Reviewed-by: Christophe Lombard <clombard@linux.vnet.ibm.com>
> Acked-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com>

Series applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/401dca8cbd14fc4b32d93499dcd12a

cheers

^ permalink raw reply

* Re: [PATCH] Revert "powerpc/64: Fix checksum folding in csum_add()"
From: Segher Boessenkool @ 2018-05-17 14:38 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Scott Wood, Shile Zhang, linuxppc-dev, linux-kernel
In-Reply-To: <20180410063437.217D2653BC@po15720vm.idsi0.si.c-s.fr>

On Tue, Apr 10, 2018 at 08:34:37AM +0200, Christophe Leroy wrote:
> This reverts commit 6ad966d7303b70165228dba1ee8da1a05c10eefe.
> 
> That commit was pointless, because csum_add() sums two 32 bits
> values, so the sum is 0x1fffffffe at the maximum.
> And then when adding upper part (1) and lower part (0xfffffffe),
> the result is 0xffffffff which doesn't carry.
> Any lower value will not carry either.
> 
> And behind the fact that this commit is useless, it also kills the
> whole purpose of having an arch specific inline csum_add()
> because the resulting code gets even worse than what is obtained
> with the generic implementation of csum_add()

:-)

> And the reverted implementation for PPC64 gives:
> 
> 0000000000000240 <.csum_add>:
>  240:	7c 84 1a 14 	add     r4,r4,r3
>  244:	78 80 00 22 	rldicl  r0,r4,32,32
>  248:	7c 80 22 14 	add     r4,r0,r4
>  24c:	78 83 00 20 	clrldi  r3,r4,32
>  250:	4e 80 00 20 	blr

If you really, really, *really* want to optimise this you could
make it:

	rldimi r3,r3,0,32
	rldimi r4,r4,0,32
	add r3,r3,r4
	srdi r3,r3,32
	blr

which is the same size, but has a shorter critical path length.  Very
analogous to how you fold 64->32.


Segher

^ permalink raw reply

* Re: [PATCH v2 2/2] powerpc/32be: use stmw/lmw for registers save/restore in asm
From: Segher Boessenkool @ 2018-05-17 14:27 UTC (permalink / raw)
  To: Christophe LEROY
  Cc: Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	linuxppc-dev, linux-kernel
In-Reply-To: <74ce3f30-6c06-e884-f1ea-1539edbf1a74@c-s.fr>

On Thu, May 17, 2018 at 03:27:37PM +0200, Christophe LEROY wrote:
> Le 17/05/2018 à 15:15, Segher Boessenkool a écrit :
> >>I guess we've been enabling this for all 32-bit targets for ever so it
> >>must be a reasonable option.
> >
> >On 603, load multiple (and string) are one cycle slower than doing all the
> >loads separately, and store is essentially the same as separate stores.
> >On 7xx and 7xxx both loads and stores are one cycle slower as multiple
> >than as separate insns.
> 
> That's in theory when the instructions are already in the cache.
> 
> But loading several instructions into the cache takes time.

Yes, of course, that's why I wrote:

> >load/store multiple are nice for saving/storing registers.

:-)


Segher

^ permalink raw reply

* Re: [PATCH] powerpc/lib: Remove .balign inside string functions for PPC32
From: Christophe LEROY @ 2018-05-17 14:21 UTC (permalink / raw)
  To: Michael Ellerman, Nicholas Piggin
  Cc: Benjamin Herrenschmidt, Paul Mackerras, linuxppc-dev,
	linux-kernel
In-Reply-To: <87wow2mm2b.fsf@concordia.ellerman.id.au>



Le 17/05/2018 à 15:46, Michael Ellerman a écrit :
> Nicholas Piggin <npiggin@gmail.com> writes:
> 
>> On Thu, 17 May 2018 12:04:13 +0200 (CEST)
>> Christophe Leroy <christophe.leroy@c-s.fr> wrote:
>>
>>> commit 87a156fb18fe1 ("Align hot loops of some string functions")
>>> degraded the performance of string functions by adding useless
>>> nops
>>>
>>> A simple benchmark on an 8xx calling 100000x a memchr() that
>>> matches the first byte runs in 41668 TB ticks before this patch
>>> and in 35986 TB ticks after this patch. So this gives an
>>> improvement of approx 10%
>>>
>>> Another benchmark doing the same with a memchr() matching the 128th
>>> byte runs in 1011365 TB ticks before this patch and 1005682 TB ticks
>>> after this patch, so regardless on the number of loops, removing
>>> those useless nops improves the test by 5683 TB ticks.
>>>
>>> Fixes: 87a156fb18fe1 ("Align hot loops of some string functions")
>>> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
>>> ---
>>>   Was sent already as part of a serie optimising string functions.
>>>   Resending on itself as it is independent of the other changes in the
>>> serie
>>>
>>>   arch/powerpc/lib/string.S | 6 ++++++
>>>   1 file changed, 6 insertions(+)
>>>
>>> diff --git a/arch/powerpc/lib/string.S b/arch/powerpc/lib/string.S
>>> index a787776822d8..a026d8fa8a99 100644
>>> --- a/arch/powerpc/lib/string.S
>>> +++ b/arch/powerpc/lib/string.S
>>> @@ -23,7 +23,9 @@ _GLOBAL(strncpy)
>>>   	mtctr	r5
>>>   	addi	r6,r3,-1
>>>   	addi	r4,r4,-1
>>> +#ifdef CONFIG_PPC64
>>>   	.balign 16
>>> +#endif
>>>   1:	lbzu	r0,1(r4)
>>>   	cmpwi	0,r0,0
>>>   	stbu	r0,1(r6)
>>
>> The ifdefs are a bit ugly, but you can't argue with the numbers. These
>> alignments should be IFETCH_ALIGN_BYTES, which is intended to optimise
>> the ifetch performance when you have such a loop (although there is
>> always a tradeoff for a single iteration).
>>
>> Would it make sense to define that for 32-bit as well, and you could use
>> it here instead of the ifdefs? Small CPUs could just use 0.
> 
> Can we do it with a macro in the header, eg. like:
> 
> #ifdef CONFIG_PPC64
> #define IFETCH_BALIGN	.balign IFETCH_ALIGN_BYTES
> #endif
> 
> ...
> 
>    	addi	r4,r4,-1
>    	IFETCH_BALIGN
>    1:	lbzu	r0,1(r4)
> 
> 

Why not just define IFETCH_ALIGN_SHIFT for PPC32 as well in asm/cache.h 
?, then replace the .balign 16 by .balign IFETCH_ALIGN_BYTES (or .align 
IFETCH_ALIGN_SHIFT) ?

Christophe

> cheers
> 

^ permalink raw reply

* Re: [PATCH v4 3/4] powerpc/64: add 32 bytes prechecking before using VMX optimization on memcmp()
From: Michael Ellerman @ 2018-05-17 14:13 UTC (permalink / raw)
  To: wei.guo.simon, linuxppc-dev
  Cc: Paul Mackerras, Naveen N.  Rao, Cyril Bur, Simon Guo
In-Reply-To: <1526459661-17323-4-git-send-email-wei.guo.simon@gmail.com>

wei.guo.simon@gmail.com writes:
> From: Simon Guo <wei.guo.simon@gmail.com>
>
> This patch is based on the previous VMX patch on memcmp().
>
> To optimize ppc64 memcmp() with VMX instruction, we need to think about
> the VMX penalty brought with: If kernel uses VMX instruction, it needs
> to save/restore current thread's VMX registers. There are 32 x 128 bits
> VMX registers in PPC, which means 32 x 16 = 512 bytes for load and store.
>
> The major concern regarding the memcmp() performance in kernel is KSM,
> who will use memcmp() frequently to merge identical pages. So it will
> make sense to take some measures/enhancement on KSM to see whether any
> improvement can be done here.  Cyril Bur indicates that the memcmp() for
> KSM has a higher possibility to fail (unmatch) early in previous bytes
> in following mail.
> 	https://patchwork.ozlabs.org/patch/817322/#1773629
> And I am taking a follow-up on this with this patch.
>
> Per some testing, it shows KSM memcmp() will fail early at previous 32
> bytes.  More specifically:
>     - 76% cases will fail/unmatch before 16 bytes;
>     - 83% cases will fail/unmatch before 32 bytes;
>     - 84% cases will fail/unmatch before 64 bytes;
> So 32 bytes looks a better choice than other bytes for pre-checking.
>
> This patch adds a 32 bytes pre-checking firstly before jumping into VMX
> operations, to avoid the unnecessary VMX penalty. And the testing shows
> ~20% improvement on memcmp() average execution time with this patch.
>
> The detail data and analysis is at:
> https://github.com/justdoitqd/publicFiles/blob/master/memcmp/README.md
>
> Any suggestion is welcome.

Thanks for digging into that, really great work.

I'm inclined to make this not depend on KSM though. It seems like a good
optimisation to do in general.

So can we just call it the 'pre-check' or something, and always do it?

cheers

> diff --git a/arch/powerpc/lib/memcmp_64.S b/arch/powerpc/lib/memcmp_64.S
> index 6303bbf..df2eec0 100644
> --- a/arch/powerpc/lib/memcmp_64.S
> +++ b/arch/powerpc/lib/memcmp_64.S
> @@ -405,6 +405,35 @@ _GLOBAL(memcmp)
>  	/* Enter with src/dst addrs has the same offset with 8 bytes
>  	 * align boundary
>  	 */
> +
> +#ifdef CONFIG_KSM
> +	/* KSM will always compare at page boundary so it falls into
> +	 * .Lsameoffset_vmx_cmp.
> +	 *
> +	 * There is an optimization for KSM based on following fact:
> +	 * KSM pages memcmp() prones to fail early at the first bytes. In
> +	 * a statisis data, it shows 76% KSM memcmp() fails at the first
> +	 * 16 bytes, and 83% KSM memcmp() fails at the first 32 bytes, 84%
> +	 * KSM memcmp() fails at the first 64 bytes.
> +	 *
> +	 * Before applying VMX instructions which will lead to 32x128bits VMX
> +	 * regs load/restore penalty, let us compares the first 32 bytes
> +	 * so that we can catch the ~80% fail cases.
> +	 */
> +
> +	li	r0,4
> +	mtctr	r0
> +.Lksm_32B_loop:
> +	LD	rA,0,r3
> +	LD	rB,0,r4
> +	cmpld	cr0,rA,rB
> +	addi	r3,r3,8
> +	addi	r4,r4,8
> +	bne     cr0,.LcmpAB_lightweight
> +	addi	r5,r5,-8
> +	bdnz	.Lksm_32B_loop
> +#endif
> +
>  	ENTER_VMX_OPS
>  	beq     cr1,.Llong_novmx_cmp
>  
> -- 
> 1.8.3.1

^ permalink raw reply

* Re: [PATCH 2/2] selftests/powerpc: Add core file test for Protection Key registers
From: Michael Ellerman @ 2018-05-17 14:03 UTC (permalink / raw)
  To: Thiago Jung Bauermann, linuxppc-dev
  Cc: linux-kselftest, linux-kernel, Ram Pai, Thiago Jung Bauermann
In-Reply-To: <20180223183344.21038-2-bauerman@linux.vnet.ibm.com>

Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com> writes:

> This test verifies that the AMR, IAMR and UAMOR are being written to a
> process' core file.
>
> Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
> ---
>  tools/testing/selftests/powerpc/ptrace/Makefile    |   5 +-
>  tools/testing/selftests/powerpc/ptrace/core-pkey.c | 460 +++++++++++++++++++++

Also failing w/out pkeys:

  test: core_pkey
  tags: git_version:52e7d87
  [FAIL] Test FAILED on line 117
  [FAIL] Test FAILED on line 265
  failure: core_pkey


cheers

^ permalink raw reply

* Re: [PATCH 1/2] selftests/powerpc: Add ptrace tests for Protection Key registers
From: Michael Ellerman @ 2018-05-17 14:03 UTC (permalink / raw)
  To: Thiago Jung Bauermann, linuxppc-dev
  Cc: linux-kselftest, linux-kernel, Ram Pai, Thiago Jung Bauermann
In-Reply-To: <20180223183344.21038-1-bauerman@linux.vnet.ibm.com>

Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com> writes:

> This test exercises read and write access to the AMR, IAMR and UAMOR.
>
> Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
> ---
>  tools/testing/selftests/powerpc/include/reg.h      |   1 +
>  tools/testing/selftests/powerpc/ptrace/Makefile    |   5 +-
>  tools/testing/selftests/powerpc/ptrace/child.h     | 130 ++++++++
>  .../testing/selftests/powerpc/ptrace/ptrace-pkey.c | 326 +++++++++++++++++++++

This is failing on machines without pkeys:

  test: ptrace_pkey
  tags: git_version:52e7d87
  [FAIL] Test FAILED on line 117
  [FAIL] Test FAILED on line 191
  failure: ptrace_pkey


I think the first fail is in the child here:

int ptrace_read_regs(pid_t child, unsigned long type, unsigned long regs[],
		     int n)
{
	struct iovec iov;
	long ret;

	FAIL_IF(start_trace(child));

	iov.iov_base = regs;
	iov.iov_len = n * sizeof(unsigned long);

	ret = ptrace(PTRACE_GETREGSET, child, type, &iov);
	FAIL_IF(ret != 0);


Which makes sense.

The test needs to skip if pkeys are not available/enabled. Using the
availability of the REGSET might actually be a nice way to detect that,
because it's read-only.

cheers

^ permalink raw reply

* Re: [PATCH v2 5/5] powerpc/lib: inline memcmp() for small constant sizes
From: Segher Boessenkool @ 2018-05-17 13:55 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	linuxppc-dev, linux-kernel
In-Reply-To: <8a6f90d882c8b60e5fa0826cd23dd70a92075659.1526553552.git.christophe.leroy@c-s.fr>

On Thu, May 17, 2018 at 12:49:58PM +0200, Christophe Leroy wrote:
> In my 8xx configuration, I get 208 calls to memcmp()
> Within those 208 calls, about half of them have constant sizes,
> 46 have a size of 8, 17 have a size of 16, only a few have a
> size over 16. Other fixed sizes are mostly 4, 6 and 10.
> 
> This patch inlines calls to memcmp() when size
> is constant and lower than or equal to 16
> 
> In my 8xx configuration, this reduces the number of calls
> to memcmp() from 208 to 123
> 
> The following table shows the number of TB timeticks to perform
> a constant size memcmp() before and after the patch depending on
> the size
> 
> 	Before	After	Improvement
> 01:	 7577	 5682	25%
> 02:	41668	 5682	86%
> 03:	51137	13258	74%
> 04:	45455	 5682	87%
> 05:	58713	13258	77%
> 06:	58712	13258	77%
> 07:	68183	20834	70%
> 08:	56819	15153	73%
> 09:	70077	28411	60%
> 10:	70077	28411	60%
> 11:	79546	35986	55%
> 12:	68182	28411	58%
> 13:	81440	35986	55%
> 14:	81440	39774	51%
> 15:	94697	43562	54%
> 16:	79546	37881	52%

Could you show results with a more recent GCC?  What version was this?

What is this really measuring?  I doubt it takes 7577 (or 5682) timebase
ticks to do a 1-byte memcmp, which is just 3 instructions after all.


Segher

^ permalink raw reply

* Re: [PATCH] powerpc/lib: Remove .balign inside string functions for PPC32
From: Michael Ellerman @ 2018-05-17 13:46 UTC (permalink / raw)
  To: Nicholas Piggin, Christophe Leroy
  Cc: Benjamin Herrenschmidt, Paul Mackerras, linuxppc-dev,
	linux-kernel
In-Reply-To: <20180517231938.4b1b8172@roar.ozlabs.ibm.com>

Nicholas Piggin <npiggin@gmail.com> writes:

> On Thu, 17 May 2018 12:04:13 +0200 (CEST)
> Christophe Leroy <christophe.leroy@c-s.fr> wrote:
>
>> commit 87a156fb18fe1 ("Align hot loops of some string functions")
>> degraded the performance of string functions by adding useless
>> nops
>> 
>> A simple benchmark on an 8xx calling 100000x a memchr() that
>> matches the first byte runs in 41668 TB ticks before this patch
>> and in 35986 TB ticks after this patch. So this gives an
>> improvement of approx 10%
>> 
>> Another benchmark doing the same with a memchr() matching the 128th
>> byte runs in 1011365 TB ticks before this patch and 1005682 TB ticks
>> after this patch, so regardless on the number of loops, removing
>> those useless nops improves the test by 5683 TB ticks.
>> 
>> Fixes: 87a156fb18fe1 ("Align hot loops of some string functions")
>> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
>> ---
>>  Was sent already as part of a serie optimising string functions.
>>  Resending on itself as it is independent of the other changes in the
>> serie
>> 
>>  arch/powerpc/lib/string.S | 6 ++++++
>>  1 file changed, 6 insertions(+)
>> 
>> diff --git a/arch/powerpc/lib/string.S b/arch/powerpc/lib/string.S
>> index a787776822d8..a026d8fa8a99 100644
>> --- a/arch/powerpc/lib/string.S
>> +++ b/arch/powerpc/lib/string.S
>> @@ -23,7 +23,9 @@ _GLOBAL(strncpy)
>>  	mtctr	r5
>>  	addi	r6,r3,-1
>>  	addi	r4,r4,-1
>> +#ifdef CONFIG_PPC64
>>  	.balign 16
>> +#endif
>>  1:	lbzu	r0,1(r4)
>>  	cmpwi	0,r0,0
>>  	stbu	r0,1(r6)
>
> The ifdefs are a bit ugly, but you can't argue with the numbers. These
> alignments should be IFETCH_ALIGN_BYTES, which is intended to optimise
> the ifetch performance when you have such a loop (although there is
> always a tradeoff for a single iteration).
>
> Would it make sense to define that for 32-bit as well, and you could use
> it here instead of the ifdefs? Small CPUs could just use 0.

Can we do it with a macro in the header, eg. like:

#ifdef CONFIG_PPC64
#define IFETCH_BALIGN	.balign IFETCH_ALIGN_BYTES
#endif

...

  	addi	r4,r4,-1
  	IFETCH_BALIGN
  1:	lbzu	r0,1(r4)


cheers

^ permalink raw reply

* Re: [PATCH v2 5/5] powerpc/lib: inline memcmp() for small constant sizes
From: Benjamin Herrenschmidt @ 2018-05-17 13:44 UTC (permalink / raw)
  To: Christophe LEROY, Mathieu Malaterre
  Cc: Paul Mackerras, Michael Ellerman, linuxppc-dev, LKML
In-Reply-To: <c184c807-dac0-0617-f449-492f5b6eeec1@c-s.fr>

On Thu, 2018-05-17 at 15:21 +0200, Christophe LEROY wrote:
> > > +static inline int __memcmp8(const void *p, const void *q, int off)
> > > +{
> > > +       s64 tmp = be64_to_cpu(*(u64*)(p + off)) - be64_to_cpu(*(u64*)(q + off));
> > 
> > I always assumed 64bits unaligned access would trigger an exception.
> > Is this correct ?
> 
> As far as I know, an unaligned access will only occur when the operand 
> of lmw, stmw, lwarx, or stwcx. is not aligned.
> 
> Maybe that's different for PPC64 ?

It's very implementation specific.

Recent ppc64 chips generally don't trap (unless it's cache inhibited
space). Earlier variants might trap on page boundaries or segment
boundaries. Some embedded parts are less forgiving... some earlier
POWER chips will trap on unaligned in LE mode...

I wouldn't worry too much about it though. I think if 8xx shows an
improvement then it's probably fine everywhere else :-)

Cheers,
Ben.

^ permalink raw reply

* Re: [PATCH v2 2/2] powerpc/32be: use stmw/lmw for registers save/restore in asm
From: Benjamin Herrenschmidt @ 2018-05-17 13:39 UTC (permalink / raw)
  To: Michael Ellerman, Christophe Leroy, Paul Mackerras
  Cc: linux-kernel, linuxppc-dev
In-Reply-To: <87zi0ymqj6.fsf@concordia.ellerman.id.au>

On Thu, 2018-05-17 at 22:10 +1000, Michael Ellerman wrote:
> Christophe Leroy <christophe.leroy@c-s.fr> writes:
> > arch/powerpc/Makefile activates -mmultiple on BE PPC32 configs
> > in order to use multiple word instructions in functions entry/exit
> 
> True, though that could be a lot simpler because the MULTIPLEWORD value
> is only used for PPC32, which is always big endian. I'll send a patch
> for that.

There have been known cases of 4xx LE ports though none ever made it
upstream ...

> > The patch does the same for the asm parts, for consistency
> > 
> > On processors like the 8xx on which insn fetching is pretty slow,
> > this speeds up registers save/restore
> 
> OK. I've always heard that they should be avoided, but that's coming
> from 64-bit land.
> 
> I guess we've been enabling this for all 32-bit targets for ever so it
> must be a reasonable option.
> 
> > Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
> > ---
> >  v2: Swapped both patches in the serie to reduce number of impacted
> >      lines and added the same modification in ppc_save_regs()
> > 
> >  arch/powerpc/include/asm/ppc_asm.h  |  5 +++++
> >  arch/powerpc/kernel/misc.S          | 10 ++++++++++
> >  arch/powerpc/kernel/ppc_save_regs.S |  4 ++++
> >  3 files changed, 19 insertions(+)
> > 
> > diff --git a/arch/powerpc/include/asm/ppc_asm.h b/arch/powerpc/include/asm/ppc_asm.h
> > index 13f7f4c0e1ea..4bb765d0b758 100644
> > --- a/arch/powerpc/include/asm/ppc_asm.h
> > +++ b/arch/powerpc/include/asm/ppc_asm.h
> > @@ -80,11 +80,16 @@ END_FW_FTR_SECTION_IFSET(FW_FEATURE_SPLPAR)
> >  #else
> >  #define SAVE_GPR(n, base)	stw	n,GPR0+4*(n)(base)
> >  #define REST_GPR(n, base)	lwz	n,GPR0+4*(n)(base)
> > +#ifdef CONFIG_CPU_BIG_ENDIAN
> > +#define SAVE_NVGPRS(base)	stmw	13, GPR0+4*13(base)
> > +#define REST_NVGPRS(base)	lmw	13, GPR0+4*13(base)
> > +#else
> >  #define SAVE_NVGPRS(base)	SAVE_GPR(13, base); SAVE_8GPRS(14, base); \
> >  				SAVE_10GPRS(22, base)
> >  #define REST_NVGPRS(base)	REST_GPR(13, base); REST_8GPRS(14, base); \
> >  				REST_10GPRS(22, base)
> 
> There is no 32-bit little endian, so this is basically dead code now.
> 
> Maybe there'll be a 32-bit LE port one day, but if so we can put the
> code back then.
> 
> So I'll just drop the else case.
> 
> >  #endif
> > +#endif
> >  
> >  #define SAVE_2GPRS(n, base)	SAVE_GPR(n, base); SAVE_GPR(n+1, base)
> >  #define SAVE_4GPRS(n, base)	SAVE_2GPRS(n, base); SAVE_2GPRS(n+2, base)
> > diff --git a/arch/powerpc/kernel/misc.S b/arch/powerpc/kernel/misc.S
> > index 746ee0320ad4..a316d90a5c26 100644
> > --- a/arch/powerpc/kernel/misc.S
> > +++ b/arch/powerpc/kernel/misc.S
> > @@ -49,6 +49,10 @@ _GLOBAL(setjmp)
> >  	PPC_STL	r0,0(r3)
> >  	PPC_STL	r1,SZL(r3)
> >  	PPC_STL	r2,2*SZL(r3)
> > +#if defined(CONFIG_PPC32) && defined(CONFIG_CPU_BIG_ENDIAN)
> 
> And this could just be:
> 
> #ifdef CONFIG_PPC32
> 
> > +	mfcr	r12
> > +	stmw	r12, 3*SZL(r3)
> > +#else
> >  	mfcr	r0
> >  	PPC_STL	r0,3*SZL(r3)
> >  	PPC_STL	r13,4*SZL(r3)
> > @@ -70,10 +74,15 @@ _GLOBAL(setjmp)
> >  	PPC_STL	r29,20*SZL(r3)
> >  	PPC_STL	r30,21*SZL(r3)
> >  	PPC_STL	r31,22*SZL(r3)
> > +#endif
> 
> It's a pity to end up with this basically split in half by ifdefs for
> 32/64-bit, but maybe we can clean that up later.
> 
> cheers

^ permalink raw reply

* Re: [PATCH v6] powerpc/mm: Only read faulting instruction when necessary in do_page_fault()
From: Nicholas Piggin @ 2018-05-17 13:36 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	linuxppc-dev, linux-kernel
In-Reply-To: <20180517105930.71B1D6F937@po14934vm.idsi0.si.c-s.fr>

On Thu, 17 May 2018 12:59:29 +0200 (CEST)
Christophe Leroy <christophe.leroy@c-s.fr> wrote:

> Commit a7a9dcd882a67 ("powerpc: Avoid taking a data miss on every
> userspace instruction miss") has shown that limiting the read of
> faulting instruction to likely cases improves performance.
> 
> This patch goes further into this direction by limiting the read
> of the faulting instruction to the only cases where it is definitly
> needed.
> 
> On an MPC885, with the same benchmark app as in the commit referred
> above, we see a reduction of 4000 dTLB misses (approx 3%):
> 
> Before the patch:
>  Performance counter stats for './fault 500' (10 runs):
> 
>          720495838      cpu-cycles		( +-  0.04% )
>             141769      dTLB-load-misses	( +-  0.02% )
>              52722      iTLB-load-misses	( +-  0.01% )
>              19611      faults			( +-  0.02% )
> 
>        5.750535176 seconds time elapsed		( +-  0.16% )
> 
> With the patch:
>  Performance counter stats for './fault 500' (10 runs):
> 
>          717669123      cpu-cycles		( +-  0.02% )
>             137344      dTLB-load-misses	( +-  0.03% )
>              52731      iTLB-load-misses	( +-  0.01% )
>              19614      faults			( +-  0.03% )
> 
>        5.728423115 seconds time elapsed		( +-  0.14% )
> 
> The proper work of the huge stack expansion was tested with the
> following app:
> 
> int main(int argc, char **argv)
> {
> 	char buf[1024 * 1025];
> 
> 	sprintf(buf, "Hello world !\n");
> 	printf(buf);
> 
> 	exit(0);
> }
> 
> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
> ---
>  v6: Rebased on latest powerpc/merge branch ; Using __get_user_inatomic() instead of get_user() in order
>      to move it inside the semaphored area. That removes all the complexity of the patch.
> 
>  v5: Reworked to fit after Benh do_fault improvement and rebased on top of powerpc/merge (65152902e43fef)
> 
>  v4: Rebased on top of powerpc/next (f718d426d7e42e) and doing access_ok() verification before __get_user_xxx()
> 
>  v3: Do a first try with pagefault disabled before releasing the semaphore
> 
>  v2: Changes 'if (cond1) if (cond2)' by 'if (cond1 && cond2)'
> 
>  arch/powerpc/mm/fault.c | 28 ++++++++++++++--------------
>  1 file changed, 14 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> index c01d627e687a..a7d5cc76a8ce 100644
> --- a/arch/powerpc/mm/fault.c
> +++ b/arch/powerpc/mm/fault.c
> @@ -72,8 +72,18 @@ static inline bool notify_page_fault(struct pt_regs *regs)
>  static bool store_updates_sp(struct pt_regs *regs)
>  {
>  	unsigned int inst;
> +	int ret;
>  
> -	if (get_user(inst, (unsigned int __user *)regs->nip))
> +	/*
> +	 * Using get_user_in_atomic() as reading code around nip can result in
> +	 * fault, which may cause a deadlock when called with mmap_sem held,
> +	 * however since we are reading the instruction that generated the DSI
> +	 * we are handling, the page is necessarily already present.
> +	 */
> +	pagefault_disable();
> +	ret = __get_user_inatomic(inst, (unsigned int __user *)regs->nip);
> +	pagefault_enable();
> +	if (ret)
>  		return false;

Problem is that the page can be removed from page tables between
taking the fault and reading the address here.

That case would be so rare that it should be fine to do a big hammer
fix like drop the mmap_sem, do a fault_in_pages_readable, and then
restart from taking the mmap_sem again.

Thanks,
Nick

^ permalink raw reply

* Re: [PATCH v2 1/5] powerpc/lib: move PPC32 specific functions out of string.S
From: Segher Boessenkool @ 2018-05-17 13:33 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	linuxppc-dev, linux-kernel
In-Reply-To: <68d0167111d06a8b21d7e52140bf399d79979e3b.1526553552.git.christophe.leroy@c-s.fr>

On Thu, May 17, 2018 at 12:49:50PM +0200, Christophe Leroy wrote:
> In preparation of optimisation patches, move PPC32 specific
> memcmp() and __clear_user() into string_32.S

> --- /dev/null
> +++ b/arch/powerpc/lib/string_32.S
> @@ -0,0 +1,74 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +/*
> + * String handling functions for PowerPC32
> + *
> + * Copyright (C) 2018 CS Systemes d'Information
> + *
> + * Author: Christophe Leroy <christophe.leroy@c-s.fr>
> + *
> + */

That is wrong; that code is a plain copy, from 1996 already.


Segher

^ permalink raw reply

* Re: [PATCH v2 2/2] powerpc/32be: use stmw/lmw for registers save/restore in asm
From: Christophe LEROY @ 2018-05-17 13:27 UTC (permalink / raw)
  To: Segher Boessenkool, Michael Ellerman
  Cc: Benjamin Herrenschmidt, Paul Mackerras, linuxppc-dev,
	linux-kernel
In-Reply-To: <20180517131550.GR17342@gate.crashing.org>



Le 17/05/2018 à 15:15, Segher Boessenkool a écrit :
> On Thu, May 17, 2018 at 10:10:21PM +1000, Michael Ellerman wrote:
>> Christophe Leroy <christophe.leroy@c-s.fr> writes:
>>> arch/powerpc/Makefile activates -mmultiple on BE PPC32 configs
>>> in order to use multiple word instructions in functions entry/exit
>>
>> True, though that could be a lot simpler because the MULTIPLEWORD value
>> is only used for PPC32, which is always big endian. I'll send a patch
>> for that.
> 
> Do you mean in the kernel?  Many 32-bit processors can do LE, and many
> do not implement multiple or string insns in LE mode.
> 
>>> The patch does the same for the asm parts, for consistency
>>>
>>> On processors like the 8xx on which insn fetching is pretty slow,
>>> this speeds up registers save/restore
>>
>> OK. I've always heard that they should be avoided, but that's coming
>> from 64-bit land.
>>
>> I guess we've been enabling this for all 32-bit targets for ever so it
>> must be a reasonable option.
> 
> On 603, load multiple (and string) are one cycle slower than doing all the
> loads separately, and store is essentially the same as separate stores.
> On 7xx and 7xxx both loads and stores are one cycle slower as multiple
> than as separate insns.

That's in theory when the instructions are already in the cache.

But loading several instructions into the cache takes time.

Christophe

> 
> load/store multiple are nice for saving/storing registers.
> 
> 
> Segher
> 

^ permalink raw reply

* Re: [PATCH] powerpc/lib: Remove .balign inside string functions for PPC32
From: Nicholas Piggin @ 2018-05-17 13:26 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	linuxppc-dev, linux-kernel
In-Reply-To: <20180517100413.856096F938@po14934vm.idsi0.si.c-s.fr>

On Thu, 17 May 2018 12:04:13 +0200 (CEST)
Christophe Leroy <christophe.leroy@c-s.fr> wrote:

> commit 87a156fb18fe1 ("Align hot loops of some string functions")
> degraded the performance of string functions by adding useless
> nops
> 
> A simple benchmark on an 8xx calling 100000x a memchr() that
> matches the first byte runs in 41668 TB ticks before this patch
> and in 35986 TB ticks after this patch. So this gives an
> improvement of approx 10%
> 
> Another benchmark doing the same with a memchr() matching the 128th
> byte runs in 1011365 TB ticks before this patch and 1005682 TB ticks
> after this patch, so regardless on the number of loops, removing
> those useless nops improves the test by 5683 TB ticks.
> 
> Fixes: 87a156fb18fe1 ("Align hot loops of some string functions")
> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
> ---
>  Was sent already as part of a serie optimising string functions.
>  Resending on itself as it is independent of the other changes in the
> serie
> 
>  arch/powerpc/lib/string.S | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/arch/powerpc/lib/string.S b/arch/powerpc/lib/string.S
> index a787776822d8..a026d8fa8a99 100644
> --- a/arch/powerpc/lib/string.S
> +++ b/arch/powerpc/lib/string.S
> @@ -23,7 +23,9 @@ _GLOBAL(strncpy)
>  	mtctr	r5
>  	addi	r6,r3,-1
>  	addi	r4,r4,-1
> +#ifdef CONFIG_PPC64
>  	.balign 16
> +#endif
>  1:	lbzu	r0,1(r4)
>  	cmpwi	0,r0,0
>  	stbu	r0,1(r6)

The ifdefs are a bit ugly, but you can't argue with the numbers. These
alignments should be IFETCH_ALIGN_BYTES, which is intended to optimise
the ifetch performance when you have such a loop (although there is
always a tradeoff for a single iteration).

Would it make sense to define that for 32-bit as well, and you could use
it here instead of the ifdefs? Small CPUs could just use 0.

Thanks,
Nick

^ permalink raw reply

* Re: [PATCH v2 5/5] powerpc/lib: inline memcmp() for small constant sizes
From: Christophe LEROY @ 2018-05-17 13:21 UTC (permalink / raw)
  To: Mathieu Malaterre
  Cc: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	linuxppc-dev, LKML
In-Reply-To: <CA+7wUsyEBW-sKiwP_KJpwuWC0JGi_Cg-xnxcnw0d_Wc7-GWeeA@mail.gmail.com>



Le 17/05/2018 à 15:03, Mathieu Malaterre a écrit :
> On Thu, May 17, 2018 at 12:49 PM, Christophe Leroy
> <christophe.leroy@c-s.fr> wrote:
>> In my 8xx configuration, I get 208 calls to memcmp()
>> Within those 208 calls, about half of them have constant sizes,
>> 46 have a size of 8, 17 have a size of 16, only a few have a
>> size over 16. Other fixed sizes are mostly 4, 6 and 10.
>>
>> This patch inlines calls to memcmp() when size
>> is constant and lower than or equal to 16
>>
>> In my 8xx configuration, this reduces the number of calls
>> to memcmp() from 208 to 123
>>
>> The following table shows the number of TB timeticks to perform
>> a constant size memcmp() before and after the patch depending on
>> the size
>>
>>          Before  After   Improvement
>> 01:      7577    5682   25%
>> 02:     41668    5682   86%
>> 03:     51137   13258   74%
>> 04:     45455    5682   87%
>> 05:     58713   13258   77%
>> 06:     58712   13258   77%
>> 07:     68183   20834   70%
>> 08:     56819   15153   73%
>> 09:     70077   28411   60%
>> 10:     70077   28411   60%
>> 11:     79546   35986   55%
>> 12:     68182   28411   58%
>> 13:     81440   35986   55%
>> 14:     81440   39774   51%
>> 15:     94697   43562   54%
>> 16:     79546   37881   52%
>>
>> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
>> ---
>>   arch/powerpc/include/asm/string.h | 46 +++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 46 insertions(+)
>>
>> diff --git a/arch/powerpc/include/asm/string.h b/arch/powerpc/include/asm/string.h
>> index 35f1aaad9b50..80cf0f9605dd 100644
>> --- a/arch/powerpc/include/asm/string.h
>> +++ b/arch/powerpc/include/asm/string.h
>> @@ -4,6 +4,8 @@
>>
>>   #ifdef __KERNEL__
>>
>> +#include <linux/kernel.h>
>> +
>>   #define __HAVE_ARCH_STRNCPY
>>   #define __HAVE_ARCH_STRNCMP
>>   #define __HAVE_ARCH_MEMSET
>> @@ -51,10 +53,54 @@ static inline int strncmp(const char *p, const char *q, __kernel_size_t size)
>>          return __strncmp(p, q, size);
>>   }
>>
>> +static inline int __memcmp1(const void *p, const void *q, int off)
> 
> Does that change anything if you change void* to char* pointer ? I
> find void* arithmetic hard to read.

Yes that's not the same

void* means you can use any pointer, for instance pointers to two 
structs you want to compare.

char* will force users to cast their pointers to char*

> 
>> +{
>> +       return *(u8*)(p + off) - *(u8*)(q + off);
>> +}
>> +
>> +static inline int __memcmp2(const void *p, const void *q, int off)
>> +{
>> +       return be16_to_cpu(*(u16*)(p + off)) - be16_to_cpu(*(u16*)(q + off));
>> +}
>> +
>> +static inline int __memcmp4(const void *p, const void *q, int off)
>> +{
>> +       return be32_to_cpu(*(u32*)(p + off)) - be32_to_cpu(*(u32*)(q + off));
>> +}
>> +
>> +static inline int __memcmp8(const void *p, const void *q, int off)
>> +{
>> +       s64 tmp = be64_to_cpu(*(u64*)(p + off)) - be64_to_cpu(*(u64*)(q + off));
> 
> I always assumed 64bits unaligned access would trigger an exception.
> Is this correct ?

As far as I know, an unaligned access will only occur when the operand 
of lmw, stmw, lwarx, or stwcx. is not aligned.

Maybe that's different for PPC64 ?

Christophe

> 
>> +       return tmp >> 32 ? : (int)tmp;
>> +}
>> +
>> +static inline int __memcmp_cst(const void *p,const void *q,__kernel_size_t size)
>> +{
>> +       if (size == 1)
>> +               return __memcmp1(p, q, 0);
>> +       if (size == 2)
>> +               return __memcmp2(p, q, 0);
>> +       if (size == 3)
>> +               return __memcmp2(p, q, 0) ? : __memcmp1(p, q, 2);
>> +       if (size == 4)
>> +               return __memcmp4(p, q, 0);
>> +       if (size == 5)
>> +               return __memcmp4(p, q, 0) ? : __memcmp1(p, q, 4);
>> +       if (size == 6)
>> +               return __memcmp4(p, q, 0) ? : __memcmp2(p, q, 4);
>> +       if (size == 7)
>> +               return __memcmp4(p, q, 0) ? : __memcmp2(p, q, 4) ? : __memcmp1(p, q, 6);
>> +       return __memcmp8(p, q, 0);
>> +}
>> +
>>   static inline int memcmp(const void *p,const void *q,__kernel_size_t size)
>>   {
>>          if (unlikely(!size))
>>                  return 0;
>> +       if (__builtin_constant_p(size) && size <= 8)
>> +               return __memcmp_cst(p, q, size);
>> +       if (__builtin_constant_p(size) && size <= 16)
>> +               return __memcmp8(p, q, 0) ? : __memcmp_cst(p + 8, q + 8, size - 8);
>>          return __memcmp(p, q, size);
>>   }
>>
>> --
>> 2.13.3
>>

^ permalink raw reply

* Re: [PATCH v2 2/2] powerpc/32be: use stmw/lmw for registers save/restore in asm
From: Segher Boessenkool @ 2018-05-17 13:15 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Christophe Leroy, Benjamin Herrenschmidt, Paul Mackerras,
	linuxppc-dev, linux-kernel
In-Reply-To: <87zi0ymqj6.fsf@concordia.ellerman.id.au>

On Thu, May 17, 2018 at 10:10:21PM +1000, Michael Ellerman wrote:
> Christophe Leroy <christophe.leroy@c-s.fr> writes:
> > arch/powerpc/Makefile activates -mmultiple on BE PPC32 configs
> > in order to use multiple word instructions in functions entry/exit
> 
> True, though that could be a lot simpler because the MULTIPLEWORD value
> is only used for PPC32, which is always big endian. I'll send a patch
> for that.

Do you mean in the kernel?  Many 32-bit processors can do LE, and many
do not implement multiple or string insns in LE mode.

> > The patch does the same for the asm parts, for consistency
> >
> > On processors like the 8xx on which insn fetching is pretty slow,
> > this speeds up registers save/restore
> 
> OK. I've always heard that they should be avoided, but that's coming
> from 64-bit land.
> 
> I guess we've been enabling this for all 32-bit targets for ever so it
> must be a reasonable option.

On 603, load multiple (and string) are one cycle slower than doing all the
loads separately, and store is essentially the same as separate stores.
On 7xx and 7xxx both loads and stores are one cycle slower as multiple
than as separate insns.

load/store multiple are nice for saving/storing registers.


Segher

^ permalink raw reply

* Re: [PATCH v2 5/5] powerpc/lib: inline memcmp() for small constant sizes
From: Mathieu Malaterre @ 2018-05-17 13:03 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	linuxppc-dev, LKML
In-Reply-To: <8a6f90d882c8b60e5fa0826cd23dd70a92075659.1526553552.git.christophe.leroy@c-s.fr>

On Thu, May 17, 2018 at 12:49 PM, Christophe Leroy
<christophe.leroy@c-s.fr> wrote:
> In my 8xx configuration, I get 208 calls to memcmp()
> Within those 208 calls, about half of them have constant sizes,
> 46 have a size of 8, 17 have a size of 16, only a few have a
> size over 16. Other fixed sizes are mostly 4, 6 and 10.
>
> This patch inlines calls to memcmp() when size
> is constant and lower than or equal to 16
>
> In my 8xx configuration, this reduces the number of calls
> to memcmp() from 208 to 123
>
> The following table shows the number of TB timeticks to perform
> a constant size memcmp() before and after the patch depending on
> the size
>
>         Before  After   Improvement
> 01:      7577    5682   25%
> 02:     41668    5682   86%
> 03:     51137   13258   74%
> 04:     45455    5682   87%
> 05:     58713   13258   77%
> 06:     58712   13258   77%
> 07:     68183   20834   70%
> 08:     56819   15153   73%
> 09:     70077   28411   60%
> 10:     70077   28411   60%
> 11:     79546   35986   55%
> 12:     68182   28411   58%
> 13:     81440   35986   55%
> 14:     81440   39774   51%
> 15:     94697   43562   54%
> 16:     79546   37881   52%
>
> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
> ---
>  arch/powerpc/include/asm/string.h | 46 +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 46 insertions(+)
>
> diff --git a/arch/powerpc/include/asm/string.h b/arch/powerpc/include/asm/string.h
> index 35f1aaad9b50..80cf0f9605dd 100644
> --- a/arch/powerpc/include/asm/string.h
> +++ b/arch/powerpc/include/asm/string.h
> @@ -4,6 +4,8 @@
>
>  #ifdef __KERNEL__
>
> +#include <linux/kernel.h>
> +
>  #define __HAVE_ARCH_STRNCPY
>  #define __HAVE_ARCH_STRNCMP
>  #define __HAVE_ARCH_MEMSET
> @@ -51,10 +53,54 @@ static inline int strncmp(const char *p, const char *q, __kernel_size_t size)
>         return __strncmp(p, q, size);
>  }
>
> +static inline int __memcmp1(const void *p, const void *q, int off)

Does that change anything if you change void* to char* pointer ? I
find void* arithmetic hard to read.

> +{
> +       return *(u8*)(p + off) - *(u8*)(q + off);
> +}
> +
> +static inline int __memcmp2(const void *p, const void *q, int off)
> +{
> +       return be16_to_cpu(*(u16*)(p + off)) - be16_to_cpu(*(u16*)(q + off));
> +}
> +
> +static inline int __memcmp4(const void *p, const void *q, int off)
> +{
> +       return be32_to_cpu(*(u32*)(p + off)) - be32_to_cpu(*(u32*)(q + off));
> +}
> +
> +static inline int __memcmp8(const void *p, const void *q, int off)
> +{
> +       s64 tmp = be64_to_cpu(*(u64*)(p + off)) - be64_to_cpu(*(u64*)(q + off));

I always assumed 64bits unaligned access would trigger an exception.
Is this correct ?

> +       return tmp >> 32 ? : (int)tmp;
> +}
> +
> +static inline int __memcmp_cst(const void *p,const void *q,__kernel_size_t size)
> +{
> +       if (size == 1)
> +               return __memcmp1(p, q, 0);
> +       if (size == 2)
> +               return __memcmp2(p, q, 0);
> +       if (size == 3)
> +               return __memcmp2(p, q, 0) ? : __memcmp1(p, q, 2);
> +       if (size == 4)
> +               return __memcmp4(p, q, 0);
> +       if (size == 5)
> +               return __memcmp4(p, q, 0) ? : __memcmp1(p, q, 4);
> +       if (size == 6)
> +               return __memcmp4(p, q, 0) ? : __memcmp2(p, q, 4);
> +       if (size == 7)
> +               return __memcmp4(p, q, 0) ? : __memcmp2(p, q, 4) ? : __memcmp1(p, q, 6);
> +       return __memcmp8(p, q, 0);
> +}
> +
>  static inline int memcmp(const void *p,const void *q,__kernel_size_t size)
>  {
>         if (unlikely(!size))
>                 return 0;
> +       if (__builtin_constant_p(size) && size <= 8)
> +               return __memcmp_cst(p, q, size);
> +       if (__builtin_constant_p(size) && size <= 16)
> +               return __memcmp8(p, q, 0) ? : __memcmp_cst(p + 8, q + 8, size - 8);
>         return __memcmp(p, q, size);
>  }
>
> --
> 2.13.3
>

^ permalink raw reply

* Re: [PATCH 0/3] Add support to disable sensor groups in P9
From: Guenter Roeck @ 2018-05-17 12:38 UTC (permalink / raw)
  To: Shilpasri G Bhat; +Cc: mpe, linuxppc-dev, linux-hwmon, linux-kernel, stewart
In-Reply-To: <2ceb6b30-60b7-d985-ee0d-6c5784eda273@linux.vnet.ibm.com>

On 05/16/2018 11:10 PM, Shilpasri G Bhat wrote:
> 
> 
> On 05/15/2018 08:32 PM, Guenter Roeck wrote:
>> On Thu, Mar 22, 2018 at 04:24:32PM +0530, Shilpasri G Bhat wrote:
>>> This patch series adds support to enable/disable OCC based
>>> inband-sensor groups at runtime. The environmental sensor groups are
>>> managed in HWMON and the remaining platform specific sensor groups are
>>> managed in /sys/firmware/opal.
>>>
>>> The firmware changes required for this patch is posted below:
>>> https://lists.ozlabs.org/pipermail/skiboot/2018-March/010812.html
>>>
>>
>> Sorry for not getting back earlier. This is a tough one.
>>
> 
> Thanks for the reply. I have tried to answer your questions according to my
> understanding below:
> 
>> Key problem is that you are changing the ABI with those new attributes.
>> On top of that, the attributes _do_ make some sense (many chips support
>> enabling/disabling of individual sensors), suggesting that those or
>> similar attributes may or even should at some point be added to the ABI.
>>
>> At the same time, returning "0" as measurement values when sensors are
>> disabled does not seem like a good idea, since "0" is a perfectly valid
>> measurement, at least for most sensors.
> 
> I agree.
> 
>>
>> Given that, we need to have a discussion about adding _enable attributes to
>> the ABI
> 
>> what is the scope,
> IIUC the scope should be RW and the attribute is defined for each supported
> sensor group
> 

That is _your_ need. I am not aware of any other chip where a per-sensor group
attribute would make sense. The discussion we need has to extend beyond the need
of a single chip.

Guenter

>> when should the attributes exist and when not,
> We control this currently via device-tree
> 
>> do we want/need power_enable or powerX_enable or both, and so on), and
> We need power_enable right now
> 
>> what to return if a sensor is disabled (such as -ENODATA).
> -ENODATA sounds good.
> 
> Thanks and Regards,
> Shilpa
> 
>   Once we have an
>> agreement, we can continue with an implementation.
>>
>> Guenter
>>
>>> Shilpasri G Bhat (3):
>>>    powernv:opal-sensor-groups: Add support to enable sensor groups
>>>    hwmon: ibmpowernv: Add attributes to enable/disable sensor groups
>>>    powernv: opal-sensor-groups: Add attributes to disable/enable sensors
>>>
>>>   .../ABI/testing/sysfs-firmware-opal-sensor-groups  |  34 ++++++
>>>   Documentation/hwmon/ibmpowernv                     |  31 ++++-
>>>   arch/powerpc/include/asm/opal-api.h                |   4 +-
>>>   arch/powerpc/include/asm/opal.h                    |   2 +
>>>   .../powerpc/platforms/powernv/opal-sensor-groups.c | 104 ++++++++++++-----
>>>   arch/powerpc/platforms/powernv/opal-wrappers.S     |   1 +
>>>   drivers/hwmon/ibmpowernv.c                         | 127 +++++++++++++++++++--
>>>   7 files changed, 265 insertions(+), 38 deletions(-)
>>>   create mode 100644 Documentation/ABI/testing/sysfs-firmware-opal-sensor-groups
>>>
>>> -- 
>>> 1.8.3.1
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-hwmon" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> 
> 

^ permalink raw reply

* Re: [PATCH v2 2/2] powerpc/32be: use stmw/lmw for registers save/restore in asm
From: Michael Ellerman @ 2018-05-17 12:10 UTC (permalink / raw)
  To: Christophe Leroy, Benjamin Herrenschmidt, Paul Mackerras
  Cc: linux-kernel, linuxppc-dev
In-Reply-To: <a94e38e1dbe0c38ce5b363fac29ab6e65fd9985a.1523984745.git.christophe.leroy@c-s.fr>

Christophe Leroy <christophe.leroy@c-s.fr> writes:
> arch/powerpc/Makefile activates -mmultiple on BE PPC32 configs
> in order to use multiple word instructions in functions entry/exit

True, though that could be a lot simpler because the MULTIPLEWORD value
is only used for PPC32, which is always big endian. I'll send a patch
for that.

> The patch does the same for the asm parts, for consistency
>
> On processors like the 8xx on which insn fetching is pretty slow,
> this speeds up registers save/restore

OK. I've always heard that they should be avoided, but that's coming
from 64-bit land.

I guess we've been enabling this for all 32-bit targets for ever so it
must be a reasonable option.

> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
> ---
>  v2: Swapped both patches in the serie to reduce number of impacted
>      lines and added the same modification in ppc_save_regs()
>
>  arch/powerpc/include/asm/ppc_asm.h  |  5 +++++
>  arch/powerpc/kernel/misc.S          | 10 ++++++++++
>  arch/powerpc/kernel/ppc_save_regs.S |  4 ++++
>  3 files changed, 19 insertions(+)
>
> diff --git a/arch/powerpc/include/asm/ppc_asm.h b/arch/powerpc/include/asm/ppc_asm.h
> index 13f7f4c0e1ea..4bb765d0b758 100644
> --- a/arch/powerpc/include/asm/ppc_asm.h
> +++ b/arch/powerpc/include/asm/ppc_asm.h
> @@ -80,11 +80,16 @@ END_FW_FTR_SECTION_IFSET(FW_FEATURE_SPLPAR)
>  #else
>  #define SAVE_GPR(n, base)	stw	n,GPR0+4*(n)(base)
>  #define REST_GPR(n, base)	lwz	n,GPR0+4*(n)(base)
> +#ifdef CONFIG_CPU_BIG_ENDIAN
> +#define SAVE_NVGPRS(base)	stmw	13, GPR0+4*13(base)
> +#define REST_NVGPRS(base)	lmw	13, GPR0+4*13(base)
> +#else
>  #define SAVE_NVGPRS(base)	SAVE_GPR(13, base); SAVE_8GPRS(14, base); \
>  				SAVE_10GPRS(22, base)
>  #define REST_NVGPRS(base)	REST_GPR(13, base); REST_8GPRS(14, base); \
>  				REST_10GPRS(22, base)

There is no 32-bit little endian, so this is basically dead code now.

Maybe there'll be a 32-bit LE port one day, but if so we can put the
code back then.

So I'll just drop the else case.

>  #endif
> +#endif
>  
>  #define SAVE_2GPRS(n, base)	SAVE_GPR(n, base); SAVE_GPR(n+1, base)
>  #define SAVE_4GPRS(n, base)	SAVE_2GPRS(n, base); SAVE_2GPRS(n+2, base)
> diff --git a/arch/powerpc/kernel/misc.S b/arch/powerpc/kernel/misc.S
> index 746ee0320ad4..a316d90a5c26 100644
> --- a/arch/powerpc/kernel/misc.S
> +++ b/arch/powerpc/kernel/misc.S
> @@ -49,6 +49,10 @@ _GLOBAL(setjmp)
>  	PPC_STL	r0,0(r3)
>  	PPC_STL	r1,SZL(r3)
>  	PPC_STL	r2,2*SZL(r3)
> +#if defined(CONFIG_PPC32) && defined(CONFIG_CPU_BIG_ENDIAN)

And this could just be:

#ifdef CONFIG_PPC32

> +	mfcr	r12
> +	stmw	r12, 3*SZL(r3)
> +#else
>  	mfcr	r0
>  	PPC_STL	r0,3*SZL(r3)
>  	PPC_STL	r13,4*SZL(r3)
> @@ -70,10 +74,15 @@ _GLOBAL(setjmp)
>  	PPC_STL	r29,20*SZL(r3)
>  	PPC_STL	r30,21*SZL(r3)
>  	PPC_STL	r31,22*SZL(r3)
> +#endif

It's a pity to end up with this basically split in half by ifdefs for
32/64-bit, but maybe we can clean that up later.

cheers

^ permalink raw reply

* [PATCH v11 24/26] x86/mm: add speculative pagefault handling
From: Laurent Dufour @ 2018-05-17 11:06 UTC (permalink / raw)
  To: akpm, mhocko, peterz, kirill, ak, dave, jack, Matthew Wilcox,
	khandual, aneesh.kumar, benh, mpe, paulus, Thomas Gleixner,
	Ingo Molnar, hpa, Will Deacon, Sergey Senozhatsky,
	sergey.senozhatsky.work, Andrea Arcangeli, Alexei Starovoitov,
	kemi.wang, Daniel Jordan, David Rientjes, Jerome Glisse,
	Ganesh Mahendran, Minchan Kim, Punit Agrawal, vinayak menon,
	Yang Shi
  Cc: linux-kernel, linux-mm, haren, npiggin, bsingharora, paulmck,
	Tim Chen, linuxppc-dev, x86
In-Reply-To: <1526555193-7242-1-git-send-email-ldufour@linux.vnet.ibm.com>

From: Peter Zijlstra <peterz@infradead.org>

Try a speculative fault before acquiring mmap_sem, if it returns with
VM_FAULT_RETRY continue with the mmap_sem acquisition and do the
traditional fault.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

[Clearing of FAULT_FLAG_ALLOW_RETRY is now done in
 handle_speculative_fault()]
[Retry with usual fault path in the case VM_ERROR is returned by
 handle_speculative_fault(). This allows signal to be delivered]
[Don't build SPF call if !CONFIG_SPECULATIVE_PAGE_FAULT]
[Handle memory protection key fault]
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 arch/x86/mm/fault.c | 27 +++++++++++++++++++++++++--
 1 file changed, 25 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index fd84edf82252..11944bfc805a 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1224,7 +1224,7 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 	struct mm_struct *mm;
 	int fault, major = 0;
 	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
-	u32 pkey;
+	u32 pkey, *pt_pkey = &pkey;
 
 	tsk = current;
 	mm = tsk->mm;
@@ -1314,6 +1314,27 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 		flags |= FAULT_FLAG_INSTRUCTION;
 
 	/*
+	 * Do not try to do a speculative page fault if the fault was due to
+	 * protection keys since it can't be resolved.
+	 */
+	if (!(error_code & X86_PF_PK)) {
+		fault = handle_speculative_fault(mm, address, flags);
+		if (fault != VM_FAULT_RETRY) {
+			perf_sw_event(PERF_COUNT_SW_SPF, 1, regs, address);
+			/*
+			 * Do not advertise for the pkey value since we don't
+			 * know it.
+			 * This is not a matter as we checked for X86_PF_PK
+			 * earlier, so we should not handle pkey fault here,
+			 * but to be sure that mm_fault_error() callees will
+			 * not try to use it, we invalidate the pointer.
+			 */
+			pt_pkey = NULL;
+			goto done;
+		}
+	}
+
+	/*
 	 * When running in the kernel we expect faults to occur only to
 	 * addresses in user space.  All other faults represent errors in
 	 * the kernel and should generate an OOPS.  Unfortunately, in the
@@ -1427,8 +1448,10 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 	}
 
 	up_read(&mm->mmap_sem);
+
+done:
 	if (unlikely(fault & VM_FAULT_ERROR)) {
-		mm_fault_error(regs, error_code, address, &pkey, fault);
+		mm_fault_error(regs, error_code, address, pt_pkey, fault);
 		return;
 	}
 
-- 
2.7.4

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox