linux-arch.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: David Laight <David.Laight@ACULAB.COM>
To: 'Robin Murphy' <robin.murphy@arm.com>,
	Will Deacon <will@kernel.org>,
	Mark Rutland <mark.rutland@arm.com>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org>,
	"kernel-team@android.com" <kernel-team@android.com>,
	Michael Ellerman <mpe@ellerman.id.au>,
	Peter Zijlstra <peterz@infradead.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Segher Boessenkool <segher@kernel.crashing.org>,
	Christian Borntraeger <borntraeger@de.ibm.com>,
	Luc Van Oostenryck <luc.vanoostenryck@gmail.com>,
	Arnd Bergmann <arnd@arndb.de>,
	Peter Oberparleiter <oberpar@linux.ibm.com>,
	Masahiro Yamada <masahiroy@kernel.org>,
	Nick Desaulniers <ndesaulniers@google.com>
Subject: RE: [PATCH v4 05/11] arm64: csum: Disable KASAN for do_csum()
Date: Fri, 24 Apr 2020 13:04:04 +0000	[thread overview]
Message-ID: <cdfcda98bce54632953cae5e05305dc7@AcuMS.aculab.com> (raw)
In-Reply-To: <a4ddb547-ea46-d79d-3088-a97b9a033997@arm.com>

From: Robin Murphy
> Sent: 24 April 2020 12:01
> On 2020-04-24 10:41 am, David Laight wrote:
> > From: Robin Murphy
> >> Sent: 22 April 2020 12:02
> > ..
> >> Sure - I have a nagging feeling that it could still do better WRT
> >> pipelining the loads anyway, so I'm happy to come back and reconsider
> >> the local codegen later. It certainly doesn't deserve to stand in the
> >> way of cross-arch rework.
> >
> > How fast does that loop actually run?
> 
> I've not characterised it in detail, but faster than any of the other
> attempts so far ;)
...
> The aim here is to minimise load bandwidth - most Arm cores can slurp 16
> bytes from L1 in a single load as quickly as any smaller amount, so
> nibbling away in little 32-bit chunks would result in up to 4x more load
> cycles.

The x86 'problem' is that 'adc' takes two clocks and the carry
flag 'register chain' means you can only sum 4 bytes/clock regardless
of the memory accesses.

> Yes, the C code looks ridiculous, but the other trick is that
> most of those operations don't actually exist. Since a __uint128_t is
> really backed by any two 64-bit GPRs - or if you're careful, one 64-bit
> GPR and the carry flag - all those shifts and rotations are in fact
> resolved by register allocation, so what we end up with is a very neat
> loop of essentially just loads and 64-bit accumulation:
> 
> ...
>   138:   a94030c3        ldp     x3, x12, [x6]
>   13c:   a9412cc8        ldp     x8, x11, [x6, #16]
>   140:   a94228c4        ldp     x4, x10, [x6, #32]
>   144:   a94324c7        ldp     x7, x9, [x6, #48]
>   148:   ab03018d        adds    x13, x12, x3
>   14c:   510100a5        sub     w5, w5, #0x40
>   150:   9a0c0063        adc     x3, x3, x12
>   154:   ab08016c        adds    x12, x11, x8
>   158:   9a0b0108        adc     x8, x8, x11
>   15c:   ab04014b        adds    x11, x10, x4
>   160:   9a0a0084        adc     x4, x4, x10
>   164:   ab07012a        adds    x10, x9, x7
>   168:   9a0900e7        adc     x7, x7, x9
>   16c:   ab080069        adds    x9, x3, x8
>   170:   9a080063        adc     x3, x3, x8
>   174:   ab070088        adds    x8, x4, x7
>   178:   9a070084        adc     x4, x4, x7
>   17c:   910100c6        add     x6, x6, #0x40
>   180:   ab040067        adds    x7, x3, x4
>   184:   9a040063        adc     x3, x3, x4
>   188:   ab010064        adds    x4, x3, x1
>   18c:   9a030023        adc     x3, x1, x3
>   190:   710100bf        cmp     w5, #0x40
>   194:   aa0303e1        mov     x1, x3
>   198:   54fffd0c        b.gt    138 <do_csum+0xd8>
> ...
> 
> Instruction-wise, that's about as good as it can get short of
> maintaining multiple accumulators and moving the pairwise folding out of
> the loop. The main thing that I think is still left on the table is that
> the load-to-use distances are pretty short and there's clearly scope to
> spread out and amortise the load cycles better, which stands to benefit
> both big and little cores.

I realised most of the C would disappear - just hard to see what
the result would be.
Looking at the above there are 8 (64bit) loads and 16 adds.
(Plus 2 adds for the loop control, should only need one!
and a spare register move.)
Without multiple carry flags the best you are going to get
is one add instruction and one 'save the carry flag' instruction
for each 'word'.
The thing then is to arrange the code to avoid register dependency
chains so that the instructions can run in parallel.
I think you lose at the bottom of the above when you add to
the global sum - might be faster with 2 sums.
Actually trying to do one load and 4 adds every clock might
be possible - if the cpu can execute them.
But that would require a horrid interleaved loop.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

  reply	other threads:[~2020-04-24 13:04 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-21 15:15 [PATCH v4 00/11] Rework READ_ONCE() to improve codegen Will Deacon
2020-04-21 15:15 ` [PATCH v4 01/11] compiler/gcc: Raise minimum GCC version for kernel builds to 4.8 Will Deacon
2020-04-21 17:15   ` Masahiro Yamada
2020-04-21 15:15 ` [PATCH v4 02/11] netfilter: Avoid assigning 'const' pointer to non-const pointer Will Deacon
2020-04-21 15:15 ` [PATCH v4 03/11] net: tls: " Will Deacon
2020-04-21 15:15 ` [PATCH v4 04/11] fault_inject: Don't rely on "return value" from WRITE_ONCE() Will Deacon
2020-04-21 15:15 ` [PATCH v4 05/11] arm64: csum: Disable KASAN for do_csum() Will Deacon
2020-04-21 15:15   ` Will Deacon
2020-04-22  9:49   ` Mark Rutland
2020-04-22  9:49     ` Mark Rutland
2020-04-22 10:41     ` Will Deacon
2020-04-22 11:01       ` Robin Murphy
2020-04-24  9:41         ` David Laight
2020-04-24 11:00           ` Robin Murphy
2020-04-24 13:04             ` David Laight [this message]
2020-04-24 13:04               ` David Laight
2020-04-21 15:15 ` [PATCH v4 06/11] READ_ONCE: Simplify implementations of {READ,WRITE}_ONCE() Will Deacon
2020-04-21 15:15   ` Will Deacon
2020-04-22  9:51   ` Mark Rutland
2020-04-21 15:15 ` [PATCH v4 07/11] READ_ONCE: Enforce atomicity for {READ,WRITE}_ONCE() memory accesses Will Deacon
2020-04-21 15:15   ` Will Deacon
2020-04-24 16:31   ` Jann Horn
2020-04-24 17:11     ` Will Deacon
2020-04-24 17:43       ` Peter Zijlstra
2020-04-21 15:15 ` [PATCH v4 08/11] READ_ONCE: Drop pointer qualifiers when reading from scalar types Will Deacon
2020-04-21 15:15   ` Will Deacon
2020-04-22 10:25   ` Rasmus Villemoes
2020-04-22 11:48     ` Segher Boessenkool
2020-04-22 11:48       ` Segher Boessenkool
2020-04-22 13:11       ` Will Deacon
2020-04-22 13:11         ` Will Deacon
2020-04-22 14:54   ` Will Deacon
2020-04-21 15:15 ` [PATCH v4 09/11] locking/barriers: Use '__unqual_scalar_typeof' for load-acquire macros Will Deacon
2020-04-21 15:15   ` Will Deacon
2020-04-21 15:15 ` [PATCH v4 10/11] arm64: barrier: Use '__unqual_scalar_typeof' for acquire/release macros Will Deacon
2020-04-21 15:15 ` [PATCH v4 11/11] gcov: Remove old GCC 3.4 support Will Deacon
2020-04-21 15:15   ` Will Deacon
2020-04-21 17:19   ` Masahiro Yamada
2020-04-21 18:42 ` [PATCH v4 00/11] Rework READ_ONCE() to improve codegen Linus Torvalds
2020-04-21 18:42   ` Linus Torvalds
2020-04-22  8:18   ` Will Deacon
2020-04-22 11:37     ` Peter Zijlstra
2020-04-22 12:26       ` Will Deacon
2020-04-24 13:42         ` Will Deacon
2020-04-24 15:54           ` Marco Elver
2020-04-24 16:52             ` Will Deacon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cdfcda98bce54632953cae5e05305dc7@AcuMS.aculab.com \
    --to=david.laight@aculab.com \
    --cc=arnd@arndb.de \
    --cc=borntraeger@de.ibm.com \
    --cc=kernel-team@android.com \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luc.vanoostenryck@gmail.com \
    --cc=mark.rutland@arm.com \
    --cc=masahiroy@kernel.org \
    --cc=mpe@ellerman.id.au \
    --cc=ndesaulniers@google.com \
    --cc=oberpar@linux.ibm.com \
    --cc=peterz@infradead.org \
    --cc=robin.murphy@arm.com \
    --cc=segher@kernel.crashing.org \
    --cc=torvalds@linux-foundation.org \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).