RE: [PATCH v2] x86: bring back rep movsq for user access on CPUs without ERMS

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: David Laight <David.Laight@ACULAB.COM>
To: 'Mateusz Guzik' <mjguzik@gmail.com>,
	"torvalds@linux-foundation.org" <torvalds@linux-foundation.org>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org>,
	"bp@alien8.de" <bp@alien8.de>
Subject: RE: [PATCH v2] x86: bring back rep movsq for user access on CPUs without ERMS
Date: Fri, 1 Sep 2023 13:33:03 +0000	[thread overview]
Message-ID: <27ba3536633c4e43b65f1dcd0a82c0de@AcuMS.aculab.com> (raw)
In-Reply-To: <20230830140315.2666490-1-mjguzik@gmail.com>

From: Mateusz Guzik
> Sent: 30 August 2023 15:03
...
> Hand-rolled mov loops executing in this case are quite pessimal compared
> to rep movsq for bigger sizes. While the upper limit depends on uarch,
> everyone is well south of 1KB AFAICS and sizes bigger than that are
> common.

That unrolled loop is pretty pessimal and very much 1980s.

It should be pretty easy to write a code loop that runs
at one copy (8 bytes) per clock on modern desktop x86.
I think that matches 'rep movsq'.
(It will run slower on Atom based cpu.)

A very simple copy loop needs (using negative offsets
from the end of the buffer):
	A memory read
	A memory write
	An increment
	A jnz
Doing all of those every clock is well with the cpu's capabilities.
However I've never managed a 1 clock loop.
So you need to unroll once (and only once) to copy 8 bytes/clock.

So for copies that are multiples of 16 bytes something like:
	# dst in %rdi, src in %rsi, len in %rdx
	add	%rdi, %rdx
	add	%rsi, %rdx
	neg	%rdx
1:
	mov	%rcx,0(%rsi, %rdx)
	mov	0(%rdi, %rdx), %rcx
	add	#16, %rdx
	mov	%rcx, -8(%rsi, %rdx)
	mov	-8(%rdi, %rdx), %rcx
	jnz	1b

Is likely to execute an iteration every two clocks.
The memory read/write all get queued up and will happen at
some point - so memory latency doesn't matter at all.

For copies (over 16 bytes) that aren't multiples of
16 it is probably just worth copying the first 16 bytes
and then doing 16 bytes copies that align with the end
of the buffer - copying some bytes twice.
(Or even copy the last 16 bytes first and copy aligned
with the start.)

I'm also not at all sure how much it is worth optimising
mis-aligned transfers.
An IP-Checksum function (which only does reads) is only
just measurable slower for mis-aligned buffers.
Less than 1 clock per cache line.

I think you can get an idea of what happens from looking
at the PCIe TLP generated for misaligned transfers and
assuming that memory requests get much the same treatment.

Last time I checked (on an i7-7700) misaligned transfers
were done in 8-byte chunks (SSE/AVX) and accesses that
crossed cache-line boundaries split into two.
Since the cpu can issue two reads/clock not all of the
split reads (to cache) will take an extra clock.
(which sort of matches what we see.)
OTOH misaligned writes that cross a cache-line boundary
probably always take a 1 clock penalty.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

next prev parent reply	other threads:[~2023-09-01 13:33 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-30 14:03 [PATCH v2] x86: bring back rep movsq for user access on CPUs without ERMS Mateusz Guzik
2023-08-30 16:50 ` Linus Torvalds
2023-08-30 20:00 ` Linus Torvalds
2023-09-01 15:20   ` Mateusz Guzik
2023-09-01 15:29     ` Linus Torvalds
2023-09-03 18:49     ` Linus Torvalds
2023-09-03 19:14       ` Linus Torvalds
2023-09-03 20:08       ` Linus Torvalds
2023-09-03 20:48         ` Mateusz Guzik
2023-09-03 20:57           ` Linus Torvalds
2023-09-03 21:06             ` Mateusz Guzik
2023-09-03 21:08               ` Linus Torvalds
2023-09-03 21:18                 ` Mateusz Guzik
2023-09-03 23:28                   ` Al Viro
2023-09-03 20:58           ` Mateusz Guzik
2023-09-03 21:05           ` Linus Torvalds
2023-09-03 21:48             ` Ingo Molnar
2023-09-03 22:34               ` Linus Torvalds
2023-09-03 23:15                 ` Mateusz Guzik
2023-09-04  3:07                   ` Linus Torvalds
2023-09-04  3:17                     ` Linus Torvalds
2023-09-04  6:03                       ` Mateusz Guzik
2023-09-04 17:28                         ` Linus Torvalds
2023-09-05 20:41                           ` Mateusz Guzik
2023-09-06  0:16                             ` Linus Torvalds
2023-09-06  4:11                               ` Mateusz Guzik
2023-09-01 13:33 ` David Laight [this message]
2023-09-01 15:28   ` Mateusz Guzik
2023-09-03 20:42     ` David Laight
2023-09-10 10:53       ` Mateusz Guzik
2023-09-11 10:37         ` David Laight
2023-09-12 18:48           ` Linus Torvalds
2023-09-12 19:41             ` David Laight
2023-09-12 20:48               ` Linus Torvalds
2023-09-13  8:25                 ` David Laight

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=27ba3536633c4e43b65f1dcd0a82c0de@AcuMS.aculab.com \
    --to=david.laight@aculab.com \
    --cc=bp@alien8.de \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mjguzik@gmail.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox