LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Failure to allocate HTAB for guest - CMA allocation failures?
From: Daniel Axtens @ 2018-05-17 15:13 UTC (permalink / raw)
  To: linuxppc-dev

Hi,

I have reports from a user who is experiencing intermittent issues
with qemu being unable to allocate memory for the guest HPT. We see:

libvirtError: internal error: process exited while connecting to monitor: Unexpected error in spapr_alloc_htab() at /build/qemu-UwnbKa/qemu-2.5+dfsg/hw/ppc/spapr.c:1030:
qemu-system-ppc64le: Failed to allocate HTAB of requested size, try with smaller maxmem

and in the kernel logs:

[10103945.040498] alloc_contig_range: 19127 callbacks suppressed
[10103945.040502] alloc_contig_range: [7a5d00, 7a6500) PFNs busy
[10103945.040526] alloc_contig_range: [7a5d00, 7a6504) PFNs busy
[10103945.040548] alloc_contig_range: [7a5d00, 7a6508) PFNs busy
[10103945.040569] alloc_contig_range: [7a5d00, 7a650c) PFNs busy
[10103945.040591] alloc_contig_range: [7a5d00, 7a6510) PFNs busy
[10103945.040612] alloc_contig_range: [7a5d00, 7a6514) PFNs busy
[10103945.040634] alloc_contig_range: [7a5d00, 7a6518) PFNs busy
[10103945.040655] alloc_contig_range: [7a5d00, 7a651c) PFNs busy
[10103945.040676] alloc_contig_range: [7a5d00, 7a6520) PFNs busy
[10103945.040698] alloc_contig_range: [7a5d00, 7a6524) PFNs busy

I understand that this is caused when the request for an appropriately
sized and aligned piece of contiguous host memory for the guest hash
page table cannot be satisfied from the CMA. The user was attempting
to start a 16GB guest, so if I can read qemu code correctly, it would
be asking for 128MB of contiguous memory.

The CMA is pretty large - this is taken from /proc/meminfo some time
after the allocation failure:

CmaTotal: 26853376 kB
CmaFree: 4024448 kB

(The CMA is ~25GB, the host has 512GB of RAM.)

My guess is that the CMA has become fragmented (the machine had 112
days of uptime) and that was interfering with the ability of the
kernel to service the request?

Some googling suggests that these sorts of failures have been seen
before:

 * [1] is a Launchpad bug mirrored from the IBM Bugzilla that talks
   about this issue especially in the context of PCI passthrough
   leading to more memory being pinned. No PCI passthrough is
   occurring in this case.

 * [2] is from Red Hat - it seems to be especially focussed on
   particularly huge guests and memory hotplug. I don't think either
   of those apply here either.

I noticed from [1] that there is a patch from Balbir that apparently
helps when VFIO is used - 2e5bbb5461f1 ("KVM: PPC: Book3S HV: Migrate
pinned pages out of CMA"). The user is running a 4.4 kernel with this
backported. There's also reference to some work Alexey was doing to
unpin pages in a more timely fashion. It looks like that stalled, and
I can't see anything else particularly relevant in the kernel tree
between then and now - although I may well be missing stuff.

So:

 - have I missed anything obvious here/have I gone completely wrong in
   my analysis somewhere?

 - have I missed any great changes since 4.4 that would fix this?

 - is there any ongoing work in increasing CMA availability?

 - I noticed in arch/powerpc/kvm/book3s_hv_builtin.c,
   kvm_cma_resv_ratio is defined as a boot parameter. By default 5% of
   host memory is reserved for CMA. Presumably increasing this will
   increase the likelihood that the kernel can service a request for
   contiguous memory. Are there any recommended tunings here?

 - is there anything else the user could try?

Thanks!

Regards,
Daniel

[1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1632045
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1304300

^ permalink raw reply

* Re: [2/2] powerpc/powernv: Fix NVRAM sleep in invalid context when crashing
From: Michael Ellerman @ 2018-05-17 14:55 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev; +Cc: stable, Nicholas Piggin
In-Reply-To: <20180514155947.22753-3-npiggin@gmail.com>

On Mon, 2018-05-14 at 15:59:47 UTC, Nicholas Piggin wrote:
> Similarly to opal_event_shutdown, opal_nvram_write can be called in
> the crash path with irqs disabled. Special case the delay to avoid
> sleeping in invalid context.
> 
> Cc: stable@vger.kernel.org # v3.2
> Fixes: 3b8070335f ("powerpc/powernv: Fix OPAL NVRAM driver OPAL_BUSY loops")
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/c1d2a31397ec51f0370f6bd17b19b3

cheers

^ permalink raw reply

* Re: [v4, 1/2] cxl: Set the PBCQ Tunnel BAR register when enabling capi mode
From: Michael Ellerman @ 2018-05-17 14:54 UTC (permalink / raw)
  To: Philippe Bergheaud, linuxppc-dev
  Cc: fbarrat, clombard, Philippe Bergheaud, benh
In-Reply-To: <20180514082736.3699-1-felix@linux.ibm.com>

On Mon, 2018-05-14 at 08:27:35 UTC, Philippe Bergheaud wrote:
> Skiboot used to set the default Tunnel BAR register value when capi mode
> was enabled. This approach was ok for the cxl driver, but prevented other
> drivers from choosing different values.
> 
> Skiboot versions > 5.11 will not set the default value any longer. This
> patch modifies the cxl driver to set/reset the Tunnel BAR register when
> entering/exiting the cxl mode, with pnv_pci_set_tunnel_bar().
> 
> That should work with old skiboot (since we are re-writing the value
> already set) and new skiboot.
> 
> Signed-off-by: Philippe Bergheaud <felix@linux.ibm.com>
> Reviewed-by: Christophe Lombard <clombard@linux.vnet.ibm.com>
> Acked-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com>

Series applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/401dca8cbd14fc4b32d93499dcd12a

cheers

^ permalink raw reply

* Re: [PATCH] Revert "powerpc/64: Fix checksum folding in csum_add()"
From: Segher Boessenkool @ 2018-05-17 14:38 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Scott Wood, Shile Zhang, linuxppc-dev, linux-kernel
In-Reply-To: <20180410063437.217D2653BC@po15720vm.idsi0.si.c-s.fr>

On Tue, Apr 10, 2018 at 08:34:37AM +0200, Christophe Leroy wrote:
> This reverts commit 6ad966d7303b70165228dba1ee8da1a05c10eefe.
> 
> That commit was pointless, because csum_add() sums two 32 bits
> values, so the sum is 0x1fffffffe at the maximum.
> And then when adding upper part (1) and lower part (0xfffffffe),
> the result is 0xffffffff which doesn't carry.
> Any lower value will not carry either.
> 
> And behind the fact that this commit is useless, it also kills the
> whole purpose of having an arch specific inline csum_add()
> because the resulting code gets even worse than what is obtained
> with the generic implementation of csum_add()

:-)

> And the reverted implementation for PPC64 gives:
> 
> 0000000000000240 <.csum_add>:
>  240:	7c 84 1a 14 	add     r4,r4,r3
>  244:	78 80 00 22 	rldicl  r0,r4,32,32
>  248:	7c 80 22 14 	add     r4,r0,r4
>  24c:	78 83 00 20 	clrldi  r3,r4,32
>  250:	4e 80 00 20 	blr

If you really, really, *really* want to optimise this you could
make it:

	rldimi r3,r3,0,32
	rldimi r4,r4,0,32
	add r3,r3,r4
	srdi r3,r3,32
	blr

which is the same size, but has a shorter critical path length.  Very
analogous to how you fold 64->32.


Segher

^ permalink raw reply

* Re: [PATCH v2 2/2] powerpc/32be: use stmw/lmw for registers save/restore in asm
From: Segher Boessenkool @ 2018-05-17 14:27 UTC (permalink / raw)
  To: Christophe LEROY
  Cc: Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	linuxppc-dev, linux-kernel
In-Reply-To: <74ce3f30-6c06-e884-f1ea-1539edbf1a74@c-s.fr>

On Thu, May 17, 2018 at 03:27:37PM +0200, Christophe LEROY wrote:
> Le 17/05/2018 à 15:15, Segher Boessenkool a écrit :
> >>I guess we've been enabling this for all 32-bit targets for ever so it
> >>must be a reasonable option.
> >
> >On 603, load multiple (and string) are one cycle slower than doing all the
> >loads separately, and store is essentially the same as separate stores.
> >On 7xx and 7xxx both loads and stores are one cycle slower as multiple
> >than as separate insns.
> 
> That's in theory when the instructions are already in the cache.
> 
> But loading several instructions into the cache takes time.

Yes, of course, that's why I wrote:

> >load/store multiple are nice for saving/storing registers.

:-)


Segher

^ permalink raw reply

* Re: [PATCH] powerpc/lib: Remove .balign inside string functions for PPC32
From: Christophe LEROY @ 2018-05-17 14:21 UTC (permalink / raw)
  To: Michael Ellerman, Nicholas Piggin
  Cc: Benjamin Herrenschmidt, Paul Mackerras, linuxppc-dev,
	linux-kernel
In-Reply-To: <87wow2mm2b.fsf@concordia.ellerman.id.au>



Le 17/05/2018 à 15:46, Michael Ellerman a écrit :
> Nicholas Piggin <npiggin@gmail.com> writes:
> 
>> On Thu, 17 May 2018 12:04:13 +0200 (CEST)
>> Christophe Leroy <christophe.leroy@c-s.fr> wrote:
>>
>>> commit 87a156fb18fe1 ("Align hot loops of some string functions")
>>> degraded the performance of string functions by adding useless
>>> nops
>>>
>>> A simple benchmark on an 8xx calling 100000x a memchr() that
>>> matches the first byte runs in 41668 TB ticks before this patch
>>> and in 35986 TB ticks after this patch. So this gives an
>>> improvement of approx 10%
>>>
>>> Another benchmark doing the same with a memchr() matching the 128th
>>> byte runs in 1011365 TB ticks before this patch and 1005682 TB ticks
>>> after this patch, so regardless on the number of loops, removing
>>> those useless nops improves the test by 5683 TB ticks.
>>>
>>> Fixes: 87a156fb18fe1 ("Align hot loops of some string functions")
>>> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
>>> ---
>>>   Was sent already as part of a serie optimising string functions.
>>>   Resending on itself as it is independent of the other changes in the
>>> serie
>>>
>>>   arch/powerpc/lib/string.S | 6 ++++++
>>>   1 file changed, 6 insertions(+)
>>>
>>> diff --git a/arch/powerpc/lib/string.S b/arch/powerpc/lib/string.S
>>> index a787776822d8..a026d8fa8a99 100644
>>> --- a/arch/powerpc/lib/string.S
>>> +++ b/arch/powerpc/lib/string.S
>>> @@ -23,7 +23,9 @@ _GLOBAL(strncpy)
>>>   	mtctr	r5
>>>   	addi	r6,r3,-1
>>>   	addi	r4,r4,-1
>>> +#ifdef CONFIG_PPC64
>>>   	.balign 16
>>> +#endif
>>>   1:	lbzu	r0,1(r4)
>>>   	cmpwi	0,r0,0
>>>   	stbu	r0,1(r6)
>>
>> The ifdefs are a bit ugly, but you can't argue with the numbers. These
>> alignments should be IFETCH_ALIGN_BYTES, which is intended to optimise
>> the ifetch performance when you have such a loop (although there is
>> always a tradeoff for a single iteration).
>>
>> Would it make sense to define that for 32-bit as well, and you could use
>> it here instead of the ifdefs? Small CPUs could just use 0.
> 
> Can we do it with a macro in the header, eg. like:
> 
> #ifdef CONFIG_PPC64
> #define IFETCH_BALIGN	.balign IFETCH_ALIGN_BYTES
> #endif
> 
> ...
> 
>    	addi	r4,r4,-1
>    	IFETCH_BALIGN
>    1:	lbzu	r0,1(r4)
> 
> 

Why not just define IFETCH_ALIGN_SHIFT for PPC32 as well in asm/cache.h 
?, then replace the .balign 16 by .balign IFETCH_ALIGN_BYTES (or .align 
IFETCH_ALIGN_SHIFT) ?

Christophe

> cheers
> 

^ permalink raw reply

* Re: [PATCH v4 3/4] powerpc/64: add 32 bytes prechecking before using VMX optimization on memcmp()
From: Michael Ellerman @ 2018-05-17 14:13 UTC (permalink / raw)
  To: wei.guo.simon, linuxppc-dev
  Cc: Paul Mackerras, Naveen N.  Rao, Cyril Bur, Simon Guo
In-Reply-To: <1526459661-17323-4-git-send-email-wei.guo.simon@gmail.com>

wei.guo.simon@gmail.com writes:
> From: Simon Guo <wei.guo.simon@gmail.com>
>
> This patch is based on the previous VMX patch on memcmp().
>
> To optimize ppc64 memcmp() with VMX instruction, we need to think about
> the VMX penalty brought with: If kernel uses VMX instruction, it needs
> to save/restore current thread's VMX registers. There are 32 x 128 bits
> VMX registers in PPC, which means 32 x 16 = 512 bytes for load and store.
>
> The major concern regarding the memcmp() performance in kernel is KSM,
> who will use memcmp() frequently to merge identical pages. So it will
> make sense to take some measures/enhancement on KSM to see whether any
> improvement can be done here.  Cyril Bur indicates that the memcmp() for
> KSM has a higher possibility to fail (unmatch) early in previous bytes
> in following mail.
> 	https://patchwork.ozlabs.org/patch/817322/#1773629
> And I am taking a follow-up on this with this patch.
>
> Per some testing, it shows KSM memcmp() will fail early at previous 32
> bytes.  More specifically:
>     - 76% cases will fail/unmatch before 16 bytes;
>     - 83% cases will fail/unmatch before 32 bytes;
>     - 84% cases will fail/unmatch before 64 bytes;
> So 32 bytes looks a better choice than other bytes for pre-checking.
>
> This patch adds a 32 bytes pre-checking firstly before jumping into VMX
> operations, to avoid the unnecessary VMX penalty. And the testing shows
> ~20% improvement on memcmp() average execution time with this patch.
>
> The detail data and analysis is at:
> https://github.com/justdoitqd/publicFiles/blob/master/memcmp/README.md
>
> Any suggestion is welcome.

Thanks for digging into that, really great work.

I'm inclined to make this not depend on KSM though. It seems like a good
optimisation to do in general.

So can we just call it the 'pre-check' or something, and always do it?

cheers

> diff --git a/arch/powerpc/lib/memcmp_64.S b/arch/powerpc/lib/memcmp_64.S
> index 6303bbf..df2eec0 100644
> --- a/arch/powerpc/lib/memcmp_64.S
> +++ b/arch/powerpc/lib/memcmp_64.S
> @@ -405,6 +405,35 @@ _GLOBAL(memcmp)
>  	/* Enter with src/dst addrs has the same offset with 8 bytes
>  	 * align boundary
>  	 */
> +
> +#ifdef CONFIG_KSM
> +	/* KSM will always compare at page boundary so it falls into
> +	 * .Lsameoffset_vmx_cmp.
> +	 *
> +	 * There is an optimization for KSM based on following fact:
> +	 * KSM pages memcmp() prones to fail early at the first bytes. In
> +	 * a statisis data, it shows 76% KSM memcmp() fails at the first
> +	 * 16 bytes, and 83% KSM memcmp() fails at the first 32 bytes, 84%
> +	 * KSM memcmp() fails at the first 64 bytes.
> +	 *
> +	 * Before applying VMX instructions which will lead to 32x128bits VMX
> +	 * regs load/restore penalty, let us compares the first 32 bytes
> +	 * so that we can catch the ~80% fail cases.
> +	 */
> +
> +	li	r0,4
> +	mtctr	r0
> +.Lksm_32B_loop:
> +	LD	rA,0,r3
> +	LD	rB,0,r4
> +	cmpld	cr0,rA,rB
> +	addi	r3,r3,8
> +	addi	r4,r4,8
> +	bne     cr0,.LcmpAB_lightweight
> +	addi	r5,r5,-8
> +	bdnz	.Lksm_32B_loop
> +#endif
> +
>  	ENTER_VMX_OPS
>  	beq     cr1,.Llong_novmx_cmp
>  
> -- 
> 1.8.3.1

^ permalink raw reply

* Re: [PATCH 2/2] selftests/powerpc: Add core file test for Protection Key registers
From: Michael Ellerman @ 2018-05-17 14:03 UTC (permalink / raw)
  To: Thiago Jung Bauermann, linuxppc-dev
  Cc: linux-kselftest, linux-kernel, Ram Pai, Thiago Jung Bauermann
In-Reply-To: <20180223183344.21038-2-bauerman@linux.vnet.ibm.com>

Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com> writes:

> This test verifies that the AMR, IAMR and UAMOR are being written to a
> process' core file.
>
> Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
> ---
>  tools/testing/selftests/powerpc/ptrace/Makefile    |   5 +-
>  tools/testing/selftests/powerpc/ptrace/core-pkey.c | 460 +++++++++++++++++++++

Also failing w/out pkeys:

  test: core_pkey
  tags: git_version:52e7d87
  [FAIL] Test FAILED on line 117
  [FAIL] Test FAILED on line 265
  failure: core_pkey


cheers

^ permalink raw reply

* Re: [PATCH 1/2] selftests/powerpc: Add ptrace tests for Protection Key registers
From: Michael Ellerman @ 2018-05-17 14:03 UTC (permalink / raw)
  To: Thiago Jung Bauermann, linuxppc-dev
  Cc: linux-kselftest, linux-kernel, Ram Pai, Thiago Jung Bauermann
In-Reply-To: <20180223183344.21038-1-bauerman@linux.vnet.ibm.com>

Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com> writes:

> This test exercises read and write access to the AMR, IAMR and UAMOR.
>
> Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
> ---
>  tools/testing/selftests/powerpc/include/reg.h      |   1 +
>  tools/testing/selftests/powerpc/ptrace/Makefile    |   5 +-
>  tools/testing/selftests/powerpc/ptrace/child.h     | 130 ++++++++
>  .../testing/selftests/powerpc/ptrace/ptrace-pkey.c | 326 +++++++++++++++++++++

This is failing on machines without pkeys:

  test: ptrace_pkey
  tags: git_version:52e7d87
  [FAIL] Test FAILED on line 117
  [FAIL] Test FAILED on line 191
  failure: ptrace_pkey


I think the first fail is in the child here:

int ptrace_read_regs(pid_t child, unsigned long type, unsigned long regs[],
		     int n)
{
	struct iovec iov;
	long ret;

	FAIL_IF(start_trace(child));

	iov.iov_base = regs;
	iov.iov_len = n * sizeof(unsigned long);

	ret = ptrace(PTRACE_GETREGSET, child, type, &iov);
	FAIL_IF(ret != 0);


Which makes sense.

The test needs to skip if pkeys are not available/enabled. Using the
availability of the REGSET might actually be a nice way to detect that,
because it's read-only.

cheers

^ permalink raw reply

* Re: [PATCH v2 5/5] powerpc/lib: inline memcmp() for small constant sizes
From: Segher Boessenkool @ 2018-05-17 13:55 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	linuxppc-dev, linux-kernel
In-Reply-To: <8a6f90d882c8b60e5fa0826cd23dd70a92075659.1526553552.git.christophe.leroy@c-s.fr>

On Thu, May 17, 2018 at 12:49:58PM +0200, Christophe Leroy wrote:
> In my 8xx configuration, I get 208 calls to memcmp()
> Within those 208 calls, about half of them have constant sizes,
> 46 have a size of 8, 17 have a size of 16, only a few have a
> size over 16. Other fixed sizes are mostly 4, 6 and 10.
> 
> This patch inlines calls to memcmp() when size
> is constant and lower than or equal to 16
> 
> In my 8xx configuration, this reduces the number of calls
> to memcmp() from 208 to 123
> 
> The following table shows the number of TB timeticks to perform
> a constant size memcmp() before and after the patch depending on
> the size
> 
> 	Before	After	Improvement
> 01:	 7577	 5682	25%
> 02:	41668	 5682	86%
> 03:	51137	13258	74%
> 04:	45455	 5682	87%
> 05:	58713	13258	77%
> 06:	58712	13258	77%
> 07:	68183	20834	70%
> 08:	56819	15153	73%
> 09:	70077	28411	60%
> 10:	70077	28411	60%
> 11:	79546	35986	55%
> 12:	68182	28411	58%
> 13:	81440	35986	55%
> 14:	81440	39774	51%
> 15:	94697	43562	54%
> 16:	79546	37881	52%

Could you show results with a more recent GCC?  What version was this?

What is this really measuring?  I doubt it takes 7577 (or 5682) timebase
ticks to do a 1-byte memcmp, which is just 3 instructions after all.


Segher

^ permalink raw reply

* Re: [PATCH] powerpc/lib: Remove .balign inside string functions for PPC32
From: Michael Ellerman @ 2018-05-17 13:46 UTC (permalink / raw)
  To: Nicholas Piggin, Christophe Leroy
  Cc: Benjamin Herrenschmidt, Paul Mackerras, linuxppc-dev,
	linux-kernel
In-Reply-To: <20180517231938.4b1b8172@roar.ozlabs.ibm.com>

Nicholas Piggin <npiggin@gmail.com> writes:

> On Thu, 17 May 2018 12:04:13 +0200 (CEST)
> Christophe Leroy <christophe.leroy@c-s.fr> wrote:
>
>> commit 87a156fb18fe1 ("Align hot loops of some string functions")
>> degraded the performance of string functions by adding useless
>> nops
>> 
>> A simple benchmark on an 8xx calling 100000x a memchr() that
>> matches the first byte runs in 41668 TB ticks before this patch
>> and in 35986 TB ticks after this patch. So this gives an
>> improvement of approx 10%
>> 
>> Another benchmark doing the same with a memchr() matching the 128th
>> byte runs in 1011365 TB ticks before this patch and 1005682 TB ticks
>> after this patch, so regardless on the number of loops, removing
>> those useless nops improves the test by 5683 TB ticks.
>> 
>> Fixes: 87a156fb18fe1 ("Align hot loops of some string functions")
>> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
>> ---
>>  Was sent already as part of a serie optimising string functions.
>>  Resending on itself as it is independent of the other changes in the
>> serie
>> 
>>  arch/powerpc/lib/string.S | 6 ++++++
>>  1 file changed, 6 insertions(+)
>> 
>> diff --git a/arch/powerpc/lib/string.S b/arch/powerpc/lib/string.S
>> index a787776822d8..a026d8fa8a99 100644
>> --- a/arch/powerpc/lib/string.S
>> +++ b/arch/powerpc/lib/string.S
>> @@ -23,7 +23,9 @@ _GLOBAL(strncpy)
>>  	mtctr	r5
>>  	addi	r6,r3,-1
>>  	addi	r4,r4,-1
>> +#ifdef CONFIG_PPC64
>>  	.balign 16
>> +#endif
>>  1:	lbzu	r0,1(r4)
>>  	cmpwi	0,r0,0
>>  	stbu	r0,1(r6)
>
> The ifdefs are a bit ugly, but you can't argue with the numbers. These
> alignments should be IFETCH_ALIGN_BYTES, which is intended to optimise
> the ifetch performance when you have such a loop (although there is
> always a tradeoff for a single iteration).
>
> Would it make sense to define that for 32-bit as well, and you could use
> it here instead of the ifdefs? Small CPUs could just use 0.

Can we do it with a macro in the header, eg. like:

#ifdef CONFIG_PPC64
#define IFETCH_BALIGN	.balign IFETCH_ALIGN_BYTES
#endif

...

  	addi	r4,r4,-1
  	IFETCH_BALIGN
  1:	lbzu	r0,1(r4)


cheers

^ permalink raw reply

* Re: [PATCH v2 5/5] powerpc/lib: inline memcmp() for small constant sizes
From: Benjamin Herrenschmidt @ 2018-05-17 13:44 UTC (permalink / raw)
  To: Christophe LEROY, Mathieu Malaterre
  Cc: Paul Mackerras, Michael Ellerman, linuxppc-dev, LKML
In-Reply-To: <c184c807-dac0-0617-f449-492f5b6eeec1@c-s.fr>

On Thu, 2018-05-17 at 15:21 +0200, Christophe LEROY wrote:
> > > +static inline int __memcmp8(const void *p, const void *q, int off)
> > > +{
> > > +       s64 tmp = be64_to_cpu(*(u64*)(p + off)) - be64_to_cpu(*(u64*)(q + off));
> > 
> > I always assumed 64bits unaligned access would trigger an exception.
> > Is this correct ?
> 
> As far as I know, an unaligned access will only occur when the operand 
> of lmw, stmw, lwarx, or stwcx. is not aligned.
> 
> Maybe that's different for PPC64 ?

It's very implementation specific.

Recent ppc64 chips generally don't trap (unless it's cache inhibited
space). Earlier variants might trap on page boundaries or segment
boundaries. Some embedded parts are less forgiving... some earlier
POWER chips will trap on unaligned in LE mode...

I wouldn't worry too much about it though. I think if 8xx shows an
improvement then it's probably fine everywhere else :-)

Cheers,
Ben.

^ permalink raw reply

* Re: [PATCH v2 2/2] powerpc/32be: use stmw/lmw for registers save/restore in asm
From: Benjamin Herrenschmidt @ 2018-05-17 13:39 UTC (permalink / raw)
  To: Michael Ellerman, Christophe Leroy, Paul Mackerras
  Cc: linux-kernel, linuxppc-dev
In-Reply-To: <87zi0ymqj6.fsf@concordia.ellerman.id.au>

On Thu, 2018-05-17 at 22:10 +1000, Michael Ellerman wrote:
> Christophe Leroy <christophe.leroy@c-s.fr> writes:
> > arch/powerpc/Makefile activates -mmultiple on BE PPC32 configs
> > in order to use multiple word instructions in functions entry/exit
> 
> True, though that could be a lot simpler because the MULTIPLEWORD value
> is only used for PPC32, which is always big endian. I'll send a patch
> for that.

There have been known cases of 4xx LE ports though none ever made it
upstream ...

> > The patch does the same for the asm parts, for consistency
> > 
> > On processors like the 8xx on which insn fetching is pretty slow,
> > this speeds up registers save/restore
> 
> OK. I've always heard that they should be avoided, but that's coming
> from 64-bit land.
> 
> I guess we've been enabling this for all 32-bit targets for ever so it
> must be a reasonable option.
> 
> > Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
> > ---
> >  v2: Swapped both patches in the serie to reduce number of impacted
> >      lines and added the same modification in ppc_save_regs()
> > 
> >  arch/powerpc/include/asm/ppc_asm.h  |  5 +++++
> >  arch/powerpc/kernel/misc.S          | 10 ++++++++++
> >  arch/powerpc/kernel/ppc_save_regs.S |  4 ++++
> >  3 files changed, 19 insertions(+)
> > 
> > diff --git a/arch/powerpc/include/asm/ppc_asm.h b/arch/powerpc/include/asm/ppc_asm.h
> > index 13f7f4c0e1ea..4bb765d0b758 100644
> > --- a/arch/powerpc/include/asm/ppc_asm.h
> > +++ b/arch/powerpc/include/asm/ppc_asm.h
> > @@ -80,11 +80,16 @@ END_FW_FTR_SECTION_IFSET(FW_FEATURE_SPLPAR)
> >  #else
> >  #define SAVE_GPR(n, base)	stw	n,GPR0+4*(n)(base)
> >  #define REST_GPR(n, base)	lwz	n,GPR0+4*(n)(base)
> > +#ifdef CONFIG_CPU_BIG_ENDIAN
> > +#define SAVE_NVGPRS(base)	stmw	13, GPR0+4*13(base)
> > +#define REST_NVGPRS(base)	lmw	13, GPR0+4*13(base)
> > +#else
> >  #define SAVE_NVGPRS(base)	SAVE_GPR(13, base); SAVE_8GPRS(14, base); \
> >  				SAVE_10GPRS(22, base)
> >  #define REST_NVGPRS(base)	REST_GPR(13, base); REST_8GPRS(14, base); \
> >  				REST_10GPRS(22, base)
> 
> There is no 32-bit little endian, so this is basically dead code now.
> 
> Maybe there'll be a 32-bit LE port one day, but if so we can put the
> code back then.
> 
> So I'll just drop the else case.
> 
> >  #endif
> > +#endif
> >  
> >  #define SAVE_2GPRS(n, base)	SAVE_GPR(n, base); SAVE_GPR(n+1, base)
> >  #define SAVE_4GPRS(n, base)	SAVE_2GPRS(n, base); SAVE_2GPRS(n+2, base)
> > diff --git a/arch/powerpc/kernel/misc.S b/arch/powerpc/kernel/misc.S
> > index 746ee0320ad4..a316d90a5c26 100644
> > --- a/arch/powerpc/kernel/misc.S
> > +++ b/arch/powerpc/kernel/misc.S
> > @@ -49,6 +49,10 @@ _GLOBAL(setjmp)
> >  	PPC_STL	r0,0(r3)
> >  	PPC_STL	r1,SZL(r3)
> >  	PPC_STL	r2,2*SZL(r3)
> > +#if defined(CONFIG_PPC32) && defined(CONFIG_CPU_BIG_ENDIAN)
> 
> And this could just be:
> 
> #ifdef CONFIG_PPC32
> 
> > +	mfcr	r12
> > +	stmw	r12, 3*SZL(r3)
> > +#else
> >  	mfcr	r0
> >  	PPC_STL	r0,3*SZL(r3)
> >  	PPC_STL	r13,4*SZL(r3)
> > @@ -70,10 +74,15 @@ _GLOBAL(setjmp)
> >  	PPC_STL	r29,20*SZL(r3)
> >  	PPC_STL	r30,21*SZL(r3)
> >  	PPC_STL	r31,22*SZL(r3)
> > +#endif
> 
> It's a pity to end up with this basically split in half by ifdefs for
> 32/64-bit, but maybe we can clean that up later.
> 
> cheers

^ permalink raw reply

* Re: [PATCH v6] powerpc/mm: Only read faulting instruction when necessary in do_page_fault()
From: Nicholas Piggin @ 2018-05-17 13:36 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	linuxppc-dev, linux-kernel
In-Reply-To: <20180517105930.71B1D6F937@po14934vm.idsi0.si.c-s.fr>

On Thu, 17 May 2018 12:59:29 +0200 (CEST)
Christophe Leroy <christophe.leroy@c-s.fr> wrote:

> Commit a7a9dcd882a67 ("powerpc: Avoid taking a data miss on every
> userspace instruction miss") has shown that limiting the read of
> faulting instruction to likely cases improves performance.
> 
> This patch goes further into this direction by limiting the read
> of the faulting instruction to the only cases where it is definitly
> needed.
> 
> On an MPC885, with the same benchmark app as in the commit referred
> above, we see a reduction of 4000 dTLB misses (approx 3%):
> 
> Before the patch:
>  Performance counter stats for './fault 500' (10 runs):
> 
>          720495838      cpu-cycles		( +-  0.04% )
>             141769      dTLB-load-misses	( +-  0.02% )
>              52722      iTLB-load-misses	( +-  0.01% )
>              19611      faults			( +-  0.02% )
> 
>        5.750535176 seconds time elapsed		( +-  0.16% )
> 
> With the patch:
>  Performance counter stats for './fault 500' (10 runs):
> 
>          717669123      cpu-cycles		( +-  0.02% )
>             137344      dTLB-load-misses	( +-  0.03% )
>              52731      iTLB-load-misses	( +-  0.01% )
>              19614      faults			( +-  0.03% )
> 
>        5.728423115 seconds time elapsed		( +-  0.14% )
> 
> The proper work of the huge stack expansion was tested with the
> following app:
> 
> int main(int argc, char **argv)
> {
> 	char buf[1024 * 1025];
> 
> 	sprintf(buf, "Hello world !\n");
> 	printf(buf);
> 
> 	exit(0);
> }
> 
> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
> ---
>  v6: Rebased on latest powerpc/merge branch ; Using __get_user_inatomic() instead of get_user() in order
>      to move it inside the semaphored area. That removes all the complexity of the patch.
> 
>  v5: Reworked to fit after Benh do_fault improvement and rebased on top of powerpc/merge (65152902e43fef)
> 
>  v4: Rebased on top of powerpc/next (f718d426d7e42e) and doing access_ok() verification before __get_user_xxx()
> 
>  v3: Do a first try with pagefault disabled before releasing the semaphore
> 
>  v2: Changes 'if (cond1) if (cond2)' by 'if (cond1 && cond2)'
> 
>  arch/powerpc/mm/fault.c | 28 ++++++++++++++--------------
>  1 file changed, 14 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> index c01d627e687a..a7d5cc76a8ce 100644
> --- a/arch/powerpc/mm/fault.c
> +++ b/arch/powerpc/mm/fault.c
> @@ -72,8 +72,18 @@ static inline bool notify_page_fault(struct pt_regs *regs)
>  static bool store_updates_sp(struct pt_regs *regs)
>  {
>  	unsigned int inst;
> +	int ret;
>  
> -	if (get_user(inst, (unsigned int __user *)regs->nip))
> +	/*
> +	 * Using get_user_in_atomic() as reading code around nip can result in
> +	 * fault, which may cause a deadlock when called with mmap_sem held,
> +	 * however since we are reading the instruction that generated the DSI
> +	 * we are handling, the page is necessarily already present.
> +	 */
> +	pagefault_disable();
> +	ret = __get_user_inatomic(inst, (unsigned int __user *)regs->nip);
> +	pagefault_enable();
> +	if (ret)
>  		return false;

Problem is that the page can be removed from page tables between
taking the fault and reading the address here.

That case would be so rare that it should be fine to do a big hammer
fix like drop the mmap_sem, do a fault_in_pages_readable, and then
restart from taking the mmap_sem again.

Thanks,
Nick

^ permalink raw reply

* Re: [PATCH v2 1/5] powerpc/lib: move PPC32 specific functions out of string.S
From: Segher Boessenkool @ 2018-05-17 13:33 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	linuxppc-dev, linux-kernel
In-Reply-To: <68d0167111d06a8b21d7e52140bf399d79979e3b.1526553552.git.christophe.leroy@c-s.fr>

On Thu, May 17, 2018 at 12:49:50PM +0200, Christophe Leroy wrote:
> In preparation of optimisation patches, move PPC32 specific
> memcmp() and __clear_user() into string_32.S

> --- /dev/null
> +++ b/arch/powerpc/lib/string_32.S
> @@ -0,0 +1,74 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +/*
> + * String handling functions for PowerPC32
> + *
> + * Copyright (C) 2018 CS Systemes d'Information
> + *
> + * Author: Christophe Leroy <christophe.leroy@c-s.fr>
> + *
> + */

That is wrong; that code is a plain copy, from 1996 already.


Segher

^ permalink raw reply

* Re: [PATCH v2 2/2] powerpc/32be: use stmw/lmw for registers save/restore in asm
From: Christophe LEROY @ 2018-05-17 13:27 UTC (permalink / raw)
  To: Segher Boessenkool, Michael Ellerman
  Cc: Benjamin Herrenschmidt, Paul Mackerras, linuxppc-dev,
	linux-kernel
In-Reply-To: <20180517131550.GR17342@gate.crashing.org>



Le 17/05/2018 à 15:15, Segher Boessenkool a écrit :
> On Thu, May 17, 2018 at 10:10:21PM +1000, Michael Ellerman wrote:
>> Christophe Leroy <christophe.leroy@c-s.fr> writes:
>>> arch/powerpc/Makefile activates -mmultiple on BE PPC32 configs
>>> in order to use multiple word instructions in functions entry/exit
>>
>> True, though that could be a lot simpler because the MULTIPLEWORD value
>> is only used for PPC32, which is always big endian. I'll send a patch
>> for that.
> 
> Do you mean in the kernel?  Many 32-bit processors can do LE, and many
> do not implement multiple or string insns in LE mode.
> 
>>> The patch does the same for the asm parts, for consistency
>>>
>>> On processors like the 8xx on which insn fetching is pretty slow,
>>> this speeds up registers save/restore
>>
>> OK. I've always heard that they should be avoided, but that's coming
>> from 64-bit land.
>>
>> I guess we've been enabling this for all 32-bit targets for ever so it
>> must be a reasonable option.
> 
> On 603, load multiple (and string) are one cycle slower than doing all the
> loads separately, and store is essentially the same as separate stores.
> On 7xx and 7xxx both loads and stores are one cycle slower as multiple
> than as separate insns.

That's in theory when the instructions are already in the cache.

But loading several instructions into the cache takes time.

Christophe

> 
> load/store multiple are nice for saving/storing registers.
> 
> 
> Segher
> 

^ permalink raw reply

* Re: [PATCH] powerpc/lib: Remove .balign inside string functions for PPC32
From: Nicholas Piggin @ 2018-05-17 13:26 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	linuxppc-dev, linux-kernel
In-Reply-To: <20180517100413.856096F938@po14934vm.idsi0.si.c-s.fr>

On Thu, 17 May 2018 12:04:13 +0200 (CEST)
Christophe Leroy <christophe.leroy@c-s.fr> wrote:

> commit 87a156fb18fe1 ("Align hot loops of some string functions")
> degraded the performance of string functions by adding useless
> nops
> 
> A simple benchmark on an 8xx calling 100000x a memchr() that
> matches the first byte runs in 41668 TB ticks before this patch
> and in 35986 TB ticks after this patch. So this gives an
> improvement of approx 10%
> 
> Another benchmark doing the same with a memchr() matching the 128th
> byte runs in 1011365 TB ticks before this patch and 1005682 TB ticks
> after this patch, so regardless on the number of loops, removing
> those useless nops improves the test by 5683 TB ticks.
> 
> Fixes: 87a156fb18fe1 ("Align hot loops of some string functions")
> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
> ---
>  Was sent already as part of a serie optimising string functions.
>  Resending on itself as it is independent of the other changes in the
> serie
> 
>  arch/powerpc/lib/string.S | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/arch/powerpc/lib/string.S b/arch/powerpc/lib/string.S
> index a787776822d8..a026d8fa8a99 100644
> --- a/arch/powerpc/lib/string.S
> +++ b/arch/powerpc/lib/string.S
> @@ -23,7 +23,9 @@ _GLOBAL(strncpy)
>  	mtctr	r5
>  	addi	r6,r3,-1
>  	addi	r4,r4,-1
> +#ifdef CONFIG_PPC64
>  	.balign 16
> +#endif
>  1:	lbzu	r0,1(r4)
>  	cmpwi	0,r0,0
>  	stbu	r0,1(r6)

The ifdefs are a bit ugly, but you can't argue with the numbers. These
alignments should be IFETCH_ALIGN_BYTES, which is intended to optimise
the ifetch performance when you have such a loop (although there is
always a tradeoff for a single iteration).

Would it make sense to define that for 32-bit as well, and you could use
it here instead of the ifdefs? Small CPUs could just use 0.

Thanks,
Nick

^ permalink raw reply

* Re: [PATCH v2 5/5] powerpc/lib: inline memcmp() for small constant sizes
From: Christophe LEROY @ 2018-05-17 13:21 UTC (permalink / raw)
  To: Mathieu Malaterre
  Cc: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	linuxppc-dev, LKML
In-Reply-To: <CA+7wUsyEBW-sKiwP_KJpwuWC0JGi_Cg-xnxcnw0d_Wc7-GWeeA@mail.gmail.com>



Le 17/05/2018 à 15:03, Mathieu Malaterre a écrit :
> On Thu, May 17, 2018 at 12:49 PM, Christophe Leroy
> <christophe.leroy@c-s.fr> wrote:
>> In my 8xx configuration, I get 208 calls to memcmp()
>> Within those 208 calls, about half of them have constant sizes,
>> 46 have a size of 8, 17 have a size of 16, only a few have a
>> size over 16. Other fixed sizes are mostly 4, 6 and 10.
>>
>> This patch inlines calls to memcmp() when size
>> is constant and lower than or equal to 16
>>
>> In my 8xx configuration, this reduces the number of calls
>> to memcmp() from 208 to 123
>>
>> The following table shows the number of TB timeticks to perform
>> a constant size memcmp() before and after the patch depending on
>> the size
>>
>>          Before  After   Improvement
>> 01:      7577    5682   25%
>> 02:     41668    5682   86%
>> 03:     51137   13258   74%
>> 04:     45455    5682   87%
>> 05:     58713   13258   77%
>> 06:     58712   13258   77%
>> 07:     68183   20834   70%
>> 08:     56819   15153   73%
>> 09:     70077   28411   60%
>> 10:     70077   28411   60%
>> 11:     79546   35986   55%
>> 12:     68182   28411   58%
>> 13:     81440   35986   55%
>> 14:     81440   39774   51%
>> 15:     94697   43562   54%
>> 16:     79546   37881   52%
>>
>> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
>> ---
>>   arch/powerpc/include/asm/string.h | 46 +++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 46 insertions(+)
>>
>> diff --git a/arch/powerpc/include/asm/string.h b/arch/powerpc/include/asm/string.h
>> index 35f1aaad9b50..80cf0f9605dd 100644
>> --- a/arch/powerpc/include/asm/string.h
>> +++ b/arch/powerpc/include/asm/string.h
>> @@ -4,6 +4,8 @@
>>
>>   #ifdef __KERNEL__
>>
>> +#include <linux/kernel.h>
>> +
>>   #define __HAVE_ARCH_STRNCPY
>>   #define __HAVE_ARCH_STRNCMP
>>   #define __HAVE_ARCH_MEMSET
>> @@ -51,10 +53,54 @@ static inline int strncmp(const char *p, const char *q, __kernel_size_t size)
>>          return __strncmp(p, q, size);
>>   }
>>
>> +static inline int __memcmp1(const void *p, const void *q, int off)
> 
> Does that change anything if you change void* to char* pointer ? I
> find void* arithmetic hard to read.

Yes that's not the same

void* means you can use any pointer, for instance pointers to two 
structs you want to compare.

char* will force users to cast their pointers to char*

> 
>> +{
>> +       return *(u8*)(p + off) - *(u8*)(q + off);
>> +}
>> +
>> +static inline int __memcmp2(const void *p, const void *q, int off)
>> +{
>> +       return be16_to_cpu(*(u16*)(p + off)) - be16_to_cpu(*(u16*)(q + off));
>> +}
>> +
>> +static inline int __memcmp4(const void *p, const void *q, int off)
>> +{
>> +       return be32_to_cpu(*(u32*)(p + off)) - be32_to_cpu(*(u32*)(q + off));
>> +}
>> +
>> +static inline int __memcmp8(const void *p, const void *q, int off)
>> +{
>> +       s64 tmp = be64_to_cpu(*(u64*)(p + off)) - be64_to_cpu(*(u64*)(q + off));
> 
> I always assumed 64bits unaligned access would trigger an exception.
> Is this correct ?

As far as I know, an unaligned access will only occur when the operand 
of lmw, stmw, lwarx, or stwcx. is not aligned.

Maybe that's different for PPC64 ?

Christophe

> 
>> +       return tmp >> 32 ? : (int)tmp;
>> +}
>> +
>> +static inline int __memcmp_cst(const void *p,const void *q,__kernel_size_t size)
>> +{
>> +       if (size == 1)
>> +               return __memcmp1(p, q, 0);
>> +       if (size == 2)
>> +               return __memcmp2(p, q, 0);
>> +       if (size == 3)
>> +               return __memcmp2(p, q, 0) ? : __memcmp1(p, q, 2);
>> +       if (size == 4)
>> +               return __memcmp4(p, q, 0);
>> +       if (size == 5)
>> +               return __memcmp4(p, q, 0) ? : __memcmp1(p, q, 4);
>> +       if (size == 6)
>> +               return __memcmp4(p, q, 0) ? : __memcmp2(p, q, 4);
>> +       if (size == 7)
>> +               return __memcmp4(p, q, 0) ? : __memcmp2(p, q, 4) ? : __memcmp1(p, q, 6);
>> +       return __memcmp8(p, q, 0);
>> +}
>> +
>>   static inline int memcmp(const void *p,const void *q,__kernel_size_t size)
>>   {
>>          if (unlikely(!size))
>>                  return 0;
>> +       if (__builtin_constant_p(size) && size <= 8)
>> +               return __memcmp_cst(p, q, size);
>> +       if (__builtin_constant_p(size) && size <= 16)
>> +               return __memcmp8(p, q, 0) ? : __memcmp_cst(p + 8, q + 8, size - 8);
>>          return __memcmp(p, q, size);
>>   }
>>
>> --
>> 2.13.3
>>

^ permalink raw reply

* Re: [PATCH v2 2/2] powerpc/32be: use stmw/lmw for registers save/restore in asm
From: Segher Boessenkool @ 2018-05-17 13:15 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Christophe Leroy, Benjamin Herrenschmidt, Paul Mackerras,
	linuxppc-dev, linux-kernel
In-Reply-To: <87zi0ymqj6.fsf@concordia.ellerman.id.au>

On Thu, May 17, 2018 at 10:10:21PM +1000, Michael Ellerman wrote:
> Christophe Leroy <christophe.leroy@c-s.fr> writes:
> > arch/powerpc/Makefile activates -mmultiple on BE PPC32 configs
> > in order to use multiple word instructions in functions entry/exit
> 
> True, though that could be a lot simpler because the MULTIPLEWORD value
> is only used for PPC32, which is always big endian. I'll send a patch
> for that.

Do you mean in the kernel?  Many 32-bit processors can do LE, and many
do not implement multiple or string insns in LE mode.

> > The patch does the same for the asm parts, for consistency
> >
> > On processors like the 8xx on which insn fetching is pretty slow,
> > this speeds up registers save/restore
> 
> OK. I've always heard that they should be avoided, but that's coming
> from 64-bit land.
> 
> I guess we've been enabling this for all 32-bit targets for ever so it
> must be a reasonable option.

On 603, load multiple (and string) are one cycle slower than doing all the
loads separately, and store is essentially the same as separate stores.
On 7xx and 7xxx both loads and stores are one cycle slower as multiple
than as separate insns.

load/store multiple are nice for saving/storing registers.


Segher

^ permalink raw reply

* Re: [PATCH v2 5/5] powerpc/lib: inline memcmp() for small constant sizes
From: Mathieu Malaterre @ 2018-05-17 13:03 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	linuxppc-dev, LKML
In-Reply-To: <8a6f90d882c8b60e5fa0826cd23dd70a92075659.1526553552.git.christophe.leroy@c-s.fr>

On Thu, May 17, 2018 at 12:49 PM, Christophe Leroy
<christophe.leroy@c-s.fr> wrote:
> In my 8xx configuration, I get 208 calls to memcmp()
> Within those 208 calls, about half of them have constant sizes,
> 46 have a size of 8, 17 have a size of 16, only a few have a
> size over 16. Other fixed sizes are mostly 4, 6 and 10.
>
> This patch inlines calls to memcmp() when size
> is constant and lower than or equal to 16
>
> In my 8xx configuration, this reduces the number of calls
> to memcmp() from 208 to 123
>
> The following table shows the number of TB timeticks to perform
> a constant size memcmp() before and after the patch depending on
> the size
>
>         Before  After   Improvement
> 01:      7577    5682   25%
> 02:     41668    5682   86%
> 03:     51137   13258   74%
> 04:     45455    5682   87%
> 05:     58713   13258   77%
> 06:     58712   13258   77%
> 07:     68183   20834   70%
> 08:     56819   15153   73%
> 09:     70077   28411   60%
> 10:     70077   28411   60%
> 11:     79546   35986   55%
> 12:     68182   28411   58%
> 13:     81440   35986   55%
> 14:     81440   39774   51%
> 15:     94697   43562   54%
> 16:     79546   37881   52%
>
> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
> ---
>  arch/powerpc/include/asm/string.h | 46 +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 46 insertions(+)
>
> diff --git a/arch/powerpc/include/asm/string.h b/arch/powerpc/include/asm/string.h
> index 35f1aaad9b50..80cf0f9605dd 100644
> --- a/arch/powerpc/include/asm/string.h
> +++ b/arch/powerpc/include/asm/string.h
> @@ -4,6 +4,8 @@
>
>  #ifdef __KERNEL__
>
> +#include <linux/kernel.h>
> +
>  #define __HAVE_ARCH_STRNCPY
>  #define __HAVE_ARCH_STRNCMP
>  #define __HAVE_ARCH_MEMSET
> @@ -51,10 +53,54 @@ static inline int strncmp(const char *p, const char *q, __kernel_size_t size)
>         return __strncmp(p, q, size);
>  }
>
> +static inline int __memcmp1(const void *p, const void *q, int off)

Does that change anything if you change void* to char* pointer ? I
find void* arithmetic hard to read.

> +{
> +       return *(u8*)(p + off) - *(u8*)(q + off);
> +}
> +
> +static inline int __memcmp2(const void *p, const void *q, int off)
> +{
> +       return be16_to_cpu(*(u16*)(p + off)) - be16_to_cpu(*(u16*)(q + off));
> +}
> +
> +static inline int __memcmp4(const void *p, const void *q, int off)
> +{
> +       return be32_to_cpu(*(u32*)(p + off)) - be32_to_cpu(*(u32*)(q + off));
> +}
> +
> +static inline int __memcmp8(const void *p, const void *q, int off)
> +{
> +       s64 tmp = be64_to_cpu(*(u64*)(p + off)) - be64_to_cpu(*(u64*)(q + off));

I always assumed 64bits unaligned access would trigger an exception.
Is this correct ?

> +       return tmp >> 32 ? : (int)tmp;
> +}
> +
> +static inline int __memcmp_cst(const void *p,const void *q,__kernel_size_t size)
> +{
> +       if (size == 1)
> +               return __memcmp1(p, q, 0);
> +       if (size == 2)
> +               return __memcmp2(p, q, 0);
> +       if (size == 3)
> +               return __memcmp2(p, q, 0) ? : __memcmp1(p, q, 2);
> +       if (size == 4)
> +               return __memcmp4(p, q, 0);
> +       if (size == 5)
> +               return __memcmp4(p, q, 0) ? : __memcmp1(p, q, 4);
> +       if (size == 6)
> +               return __memcmp4(p, q, 0) ? : __memcmp2(p, q, 4);
> +       if (size == 7)
> +               return __memcmp4(p, q, 0) ? : __memcmp2(p, q, 4) ? : __memcmp1(p, q, 6);
> +       return __memcmp8(p, q, 0);
> +}
> +
>  static inline int memcmp(const void *p,const void *q,__kernel_size_t size)
>  {
>         if (unlikely(!size))
>                 return 0;
> +       if (__builtin_constant_p(size) && size <= 8)
> +               return __memcmp_cst(p, q, size);
> +       if (__builtin_constant_p(size) && size <= 16)
> +               return __memcmp8(p, q, 0) ? : __memcmp_cst(p + 8, q + 8, size - 8);
>         return __memcmp(p, q, size);
>  }
>
> --
> 2.13.3
>

^ permalink raw reply

* Re: [PATCH 0/3] Add support to disable sensor groups in P9
From: Guenter Roeck @ 2018-05-17 12:38 UTC (permalink / raw)
  To: Shilpasri G Bhat; +Cc: mpe, linuxppc-dev, linux-hwmon, linux-kernel, stewart
In-Reply-To: <2ceb6b30-60b7-d985-ee0d-6c5784eda273@linux.vnet.ibm.com>

On 05/16/2018 11:10 PM, Shilpasri G Bhat wrote:
> 
> 
> On 05/15/2018 08:32 PM, Guenter Roeck wrote:
>> On Thu, Mar 22, 2018 at 04:24:32PM +0530, Shilpasri G Bhat wrote:
>>> This patch series adds support to enable/disable OCC based
>>> inband-sensor groups at runtime. The environmental sensor groups are
>>> managed in HWMON and the remaining platform specific sensor groups are
>>> managed in /sys/firmware/opal.
>>>
>>> The firmware changes required for this patch is posted below:
>>> https://lists.ozlabs.org/pipermail/skiboot/2018-March/010812.html
>>>
>>
>> Sorry for not getting back earlier. This is a tough one.
>>
> 
> Thanks for the reply. I have tried to answer your questions according to my
> understanding below:
> 
>> Key problem is that you are changing the ABI with those new attributes.
>> On top of that, the attributes _do_ make some sense (many chips support
>> enabling/disabling of individual sensors), suggesting that those or
>> similar attributes may or even should at some point be added to the ABI.
>>
>> At the same time, returning "0" as measurement values when sensors are
>> disabled does not seem like a good idea, since "0" is a perfectly valid
>> measurement, at least for most sensors.
> 
> I agree.
> 
>>
>> Given that, we need to have a discussion about adding _enable attributes to
>> the ABI
> 
>> what is the scope,
> IIUC the scope should be RW and the attribute is defined for each supported
> sensor group
> 

That is _your_ need. I am not aware of any other chip where a per-sensor group
attribute would make sense. The discussion we need has to extend beyond the need
of a single chip.

Guenter

>> when should the attributes exist and when not,
> We control this currently via device-tree
> 
>> do we want/need power_enable or powerX_enable or both, and so on), and
> We need power_enable right now
> 
>> what to return if a sensor is disabled (such as -ENODATA).
> -ENODATA sounds good.
> 
> Thanks and Regards,
> Shilpa
> 
>   Once we have an
>> agreement, we can continue with an implementation.
>>
>> Guenter
>>
>>> Shilpasri G Bhat (3):
>>>    powernv:opal-sensor-groups: Add support to enable sensor groups
>>>    hwmon: ibmpowernv: Add attributes to enable/disable sensor groups
>>>    powernv: opal-sensor-groups: Add attributes to disable/enable sensors
>>>
>>>   .../ABI/testing/sysfs-firmware-opal-sensor-groups  |  34 ++++++
>>>   Documentation/hwmon/ibmpowernv                     |  31 ++++-
>>>   arch/powerpc/include/asm/opal-api.h                |   4 +-
>>>   arch/powerpc/include/asm/opal.h                    |   2 +
>>>   .../powerpc/platforms/powernv/opal-sensor-groups.c | 104 ++++++++++++-----
>>>   arch/powerpc/platforms/powernv/opal-wrappers.S     |   1 +
>>>   drivers/hwmon/ibmpowernv.c                         | 127 +++++++++++++++++++--
>>>   7 files changed, 265 insertions(+), 38 deletions(-)
>>>   create mode 100644 Documentation/ABI/testing/sysfs-firmware-opal-sensor-groups
>>>
>>> -- 
>>> 1.8.3.1
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-hwmon" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> 
> 

^ permalink raw reply

* Re: [PATCH v2 2/2] powerpc/32be: use stmw/lmw for registers save/restore in asm
From: Michael Ellerman @ 2018-05-17 12:10 UTC (permalink / raw)
  To: Christophe Leroy, Benjamin Herrenschmidt, Paul Mackerras
  Cc: linux-kernel, linuxppc-dev
In-Reply-To: <a94e38e1dbe0c38ce5b363fac29ab6e65fd9985a.1523984745.git.christophe.leroy@c-s.fr>

Christophe Leroy <christophe.leroy@c-s.fr> writes:
> arch/powerpc/Makefile activates -mmultiple on BE PPC32 configs
> in order to use multiple word instructions in functions entry/exit

True, though that could be a lot simpler because the MULTIPLEWORD value
is only used for PPC32, which is always big endian. I'll send a patch
for that.

> The patch does the same for the asm parts, for consistency
>
> On processors like the 8xx on which insn fetching is pretty slow,
> this speeds up registers save/restore

OK. I've always heard that they should be avoided, but that's coming
from 64-bit land.

I guess we've been enabling this for all 32-bit targets for ever so it
must be a reasonable option.

> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
> ---
>  v2: Swapped both patches in the serie to reduce number of impacted
>      lines and added the same modification in ppc_save_regs()
>
>  arch/powerpc/include/asm/ppc_asm.h  |  5 +++++
>  arch/powerpc/kernel/misc.S          | 10 ++++++++++
>  arch/powerpc/kernel/ppc_save_regs.S |  4 ++++
>  3 files changed, 19 insertions(+)
>
> diff --git a/arch/powerpc/include/asm/ppc_asm.h b/arch/powerpc/include/asm/ppc_asm.h
> index 13f7f4c0e1ea..4bb765d0b758 100644
> --- a/arch/powerpc/include/asm/ppc_asm.h
> +++ b/arch/powerpc/include/asm/ppc_asm.h
> @@ -80,11 +80,16 @@ END_FW_FTR_SECTION_IFSET(FW_FEATURE_SPLPAR)
>  #else
>  #define SAVE_GPR(n, base)	stw	n,GPR0+4*(n)(base)
>  #define REST_GPR(n, base)	lwz	n,GPR0+4*(n)(base)
> +#ifdef CONFIG_CPU_BIG_ENDIAN
> +#define SAVE_NVGPRS(base)	stmw	13, GPR0+4*13(base)
> +#define REST_NVGPRS(base)	lmw	13, GPR0+4*13(base)
> +#else
>  #define SAVE_NVGPRS(base)	SAVE_GPR(13, base); SAVE_8GPRS(14, base); \
>  				SAVE_10GPRS(22, base)
>  #define REST_NVGPRS(base)	REST_GPR(13, base); REST_8GPRS(14, base); \
>  				REST_10GPRS(22, base)

There is no 32-bit little endian, so this is basically dead code now.

Maybe there'll be a 32-bit LE port one day, but if so we can put the
code back then.

So I'll just drop the else case.

>  #endif
> +#endif
>  
>  #define SAVE_2GPRS(n, base)	SAVE_GPR(n, base); SAVE_GPR(n+1, base)
>  #define SAVE_4GPRS(n, base)	SAVE_2GPRS(n, base); SAVE_2GPRS(n+2, base)
> diff --git a/arch/powerpc/kernel/misc.S b/arch/powerpc/kernel/misc.S
> index 746ee0320ad4..a316d90a5c26 100644
> --- a/arch/powerpc/kernel/misc.S
> +++ b/arch/powerpc/kernel/misc.S
> @@ -49,6 +49,10 @@ _GLOBAL(setjmp)
>  	PPC_STL	r0,0(r3)
>  	PPC_STL	r1,SZL(r3)
>  	PPC_STL	r2,2*SZL(r3)
> +#if defined(CONFIG_PPC32) && defined(CONFIG_CPU_BIG_ENDIAN)

And this could just be:

#ifdef CONFIG_PPC32

> +	mfcr	r12
> +	stmw	r12, 3*SZL(r3)
> +#else
>  	mfcr	r0
>  	PPC_STL	r0,3*SZL(r3)
>  	PPC_STL	r13,4*SZL(r3)
> @@ -70,10 +74,15 @@ _GLOBAL(setjmp)
>  	PPC_STL	r29,20*SZL(r3)
>  	PPC_STL	r30,21*SZL(r3)
>  	PPC_STL	r31,22*SZL(r3)
> +#endif

It's a pity to end up with this basically split in half by ifdefs for
32/64-bit, but maybe we can clean that up later.

cheers

^ permalink raw reply

* [PATCH v11 24/26] x86/mm: add speculative pagefault handling
From: Laurent Dufour @ 2018-05-17 11:06 UTC (permalink / raw)
  To: akpm, mhocko, peterz, kirill, ak, dave, jack, Matthew Wilcox,
	khandual, aneesh.kumar, benh, mpe, paulus, Thomas Gleixner,
	Ingo Molnar, hpa, Will Deacon, Sergey Senozhatsky,
	sergey.senozhatsky.work, Andrea Arcangeli, Alexei Starovoitov,
	kemi.wang, Daniel Jordan, David Rientjes, Jerome Glisse,
	Ganesh Mahendran, Minchan Kim, Punit Agrawal, vinayak menon,
	Yang Shi
  Cc: linux-kernel, linux-mm, haren, npiggin, bsingharora, paulmck,
	Tim Chen, linuxppc-dev, x86
In-Reply-To: <1526555193-7242-1-git-send-email-ldufour@linux.vnet.ibm.com>

From: Peter Zijlstra <peterz@infradead.org>

Try a speculative fault before acquiring mmap_sem, if it returns with
VM_FAULT_RETRY continue with the mmap_sem acquisition and do the
traditional fault.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

[Clearing of FAULT_FLAG_ALLOW_RETRY is now done in
 handle_speculative_fault()]
[Retry with usual fault path in the case VM_ERROR is returned by
 handle_speculative_fault(). This allows signal to be delivered]
[Don't build SPF call if !CONFIG_SPECULATIVE_PAGE_FAULT]
[Handle memory protection key fault]
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 arch/x86/mm/fault.c | 27 +++++++++++++++++++++++++--
 1 file changed, 25 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index fd84edf82252..11944bfc805a 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1224,7 +1224,7 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 	struct mm_struct *mm;
 	int fault, major = 0;
 	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
-	u32 pkey;
+	u32 pkey, *pt_pkey = &pkey;
 
 	tsk = current;
 	mm = tsk->mm;
@@ -1314,6 +1314,27 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 		flags |= FAULT_FLAG_INSTRUCTION;
 
 	/*
+	 * Do not try to do a speculative page fault if the fault was due to
+	 * protection keys since it can't be resolved.
+	 */
+	if (!(error_code & X86_PF_PK)) {
+		fault = handle_speculative_fault(mm, address, flags);
+		if (fault != VM_FAULT_RETRY) {
+			perf_sw_event(PERF_COUNT_SW_SPF, 1, regs, address);
+			/*
+			 * Do not advertise for the pkey value since we don't
+			 * know it.
+			 * This is not a matter as we checked for X86_PF_PK
+			 * earlier, so we should not handle pkey fault here,
+			 * but to be sure that mm_fault_error() callees will
+			 * not try to use it, we invalidate the pointer.
+			 */
+			pt_pkey = NULL;
+			goto done;
+		}
+	}
+
+	/*
 	 * When running in the kernel we expect faults to occur only to
 	 * addresses in user space.  All other faults represent errors in
 	 * the kernel and should generate an OOPS.  Unfortunately, in the
@@ -1427,8 +1448,10 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 	}
 
 	up_read(&mm->mmap_sem);
+
+done:
 	if (unlikely(fault & VM_FAULT_ERROR)) {
-		mm_fault_error(regs, error_code, address, &pkey, fault);
+		mm_fault_error(regs, error_code, address, pt_pkey, fault);
 		return;
 	}
 
-- 
2.7.4

^ permalink raw reply related

* [PATCH v11 26/26] arm64/mm: add speculative page fault
From: Laurent Dufour @ 2018-05-17 11:06 UTC (permalink / raw)
  To: akpm, mhocko, peterz, kirill, ak, dave, jack, Matthew Wilcox,
	khandual, aneesh.kumar, benh, mpe, paulus, Thomas Gleixner,
	Ingo Molnar, hpa, Will Deacon, Sergey Senozhatsky,
	sergey.senozhatsky.work, Andrea Arcangeli, Alexei Starovoitov,
	kemi.wang, Daniel Jordan, David Rientjes, Jerome Glisse,
	Ganesh Mahendran, Minchan Kim, Punit Agrawal, vinayak menon,
	Yang Shi
  Cc: linux-kernel, linux-mm, haren, npiggin, bsingharora, paulmck,
	Tim Chen, linuxppc-dev, x86
In-Reply-To: <1526555193-7242-1-git-send-email-ldufour@linux.vnet.ibm.com>

From: Mahendran Ganesh <opensource.ganesh@gmail.com>

This patch enables the speculative page fault on the arm64
architecture.

I completed spf porting in 4.9. From the test result,
we can see app launching time improved by about 10% in average.
For the apps which have more than 50 threads, 15% or even more
improvement can be got.

Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com>

[handle_speculative_fault() is no more returning the vma pointer]
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 arch/arm64/mm/fault.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 91c53a7d2575..fb9f840367f9 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -411,6 +411,16 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr,
 	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);
 
 	/*
+	 * let's try a speculative page fault without grabbing the
+	 * mmap_sem.
+	 */
+	fault = handle_speculative_fault(mm, addr, mm_flags);
+	if (fault != VM_FAULT_RETRY) {
+		perf_sw_event(PERF_COUNT_SW_SPF, 1, regs, addr);
+		goto done;
+	}
+
+	/*
 	 * As per x86, we may deadlock here. However, since the kernel only
 	 * validly references user space from well defined areas of the code,
 	 * we can bug out early if this is from code which shouldn't.
@@ -460,6 +470,8 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr,
 	}
 	up_read(&mm->mmap_sem);
 
+done:
+
 	/*
 	 * Handle the "normal" (no error) case first.
 	 */
-- 
2.7.4

^ permalink raw reply related

* [PATCH v11 25/26] powerpc/mm: add speculative page fault
From: Laurent Dufour @ 2018-05-17 11:06 UTC (permalink / raw)
  To: akpm, mhocko, peterz, kirill, ak, dave, jack, Matthew Wilcox,
	khandual, aneesh.kumar, benh, mpe, paulus, Thomas Gleixner,
	Ingo Molnar, hpa, Will Deacon, Sergey Senozhatsky,
	sergey.senozhatsky.work, Andrea Arcangeli, Alexei Starovoitov,
	kemi.wang, Daniel Jordan, David Rientjes, Jerome Glisse,
	Ganesh Mahendran, Minchan Kim, Punit Agrawal, vinayak menon,
	Yang Shi
  Cc: linux-kernel, linux-mm, haren, npiggin, bsingharora, paulmck,
	Tim Chen, linuxppc-dev, x86
In-Reply-To: <1526555193-7242-1-git-send-email-ldufour@linux.vnet.ibm.com>

This patch enable the speculative page fault on the PowerPC
architecture.

This will try a speculative page fault without holding the mmap_sem,
if it returns with VM_FAULT_RETRY, the mmap_sem is acquired and the
traditional page fault processing is done.

The speculative path is only tried for multithreaded process as there is no
risk of contention on the mmap_sem otherwise.

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 arch/powerpc/mm/fault.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index ef268d5d9db7..d7b5742ffb2b 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -465,6 +465,21 @@ static int __do_page_fault(struct pt_regs *regs, unsigned long address,
 	if (is_exec)
 		flags |= FAULT_FLAG_INSTRUCTION;
 
+	/*
+	 * Try speculative page fault before grabbing the mmap_sem.
+	 * The Page fault is done if VM_FAULT_RETRY is not returned.
+	 * But if the memory protection keys are active, we don't know if the
+	 * fault is due to key mistmatch or due to a classic protection check.
+	 * To differentiate that, we will need the VMA we no more have, so
+	 * let's retry with the mmap_sem held.
+	 */
+	fault = handle_speculative_fault(mm, address, flags);
+	if (fault != VM_FAULT_RETRY && (IS_ENABLED(CONFIG_PPC_MEM_KEYS) &&
+					fault != VM_FAULT_SIGSEGV)) {
+		perf_sw_event(PERF_COUNT_SW_SPF, 1, regs, address);
+		goto done;
+	}
+
 	/* When running in the kernel we expect faults to occur only to
 	 * addresses in user space.  All other faults represent errors in the
 	 * kernel and should generate an OOPS.  Unfortunately, in the case of an
@@ -565,6 +580,7 @@ static int __do_page_fault(struct pt_regs *regs, unsigned long address,
 
 	up_read(&current->mm->mmap_sem);
 
+done:
 	if (unlikely(fault & VM_FAULT_ERROR))
 		return mm_fault_error(regs, address, fault);
 
-- 
2.7.4

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox