RE: [PATCH v2] x86: bring back rep movsq for user access on CPUs without ERMS

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: David Laight <David.Laight@ACULAB.COM>
To: 'Mateusz Guzik' <mjguzik@gmail.com>
Cc: "torvalds@linux-foundation.org" <torvalds@linux-foundation.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org>,
	"bp@alien8.de" <bp@alien8.de>
Subject: RE: [PATCH v2] x86: bring back rep movsq for user access on CPUs without ERMS
Date: Sun, 3 Sep 2023 20:42:55 +0000	[thread overview]
Message-ID: <9a5dd401bf154a0aace0e5f781a3580c@AcuMS.aculab.com> (raw)
In-Reply-To: <CAGudoHHUWZNz0OU5yCqOBkeifSYKhm4y6WO1x+q5pDPt1j3+GA@mail.gmail.com>

...
> When I was playing with this stuff about 5 years ago I found 32-byte
> loops to be optimal for uarchs of the priod (Skylake, Broadwell,
> Haswell and so on), but only up to a point where rep wins.

Does the 'rep movsq' ever actually win?
(Unless you find one of the EMRS (or similar) versions.)
IIRC it only ever does one iteration per clock - and you
should be able to match that with a carefully constructed loop.

Many years ago I got my Athlon-700 to execute a copy loop
as fast as 'rep movs' - but the setup times were longer.

The killer for 'rep movs' setup was always P4-netburst - over 40 clocks.
But I think some of the more recent cpu are still in double figures
(apart from some optimised copies).
So I'm not actually sure you should ever need to switch
to a 'rep movsq' loop - but I've not tried to write it.

I did have to unroll the ip-cksum loop 4 times (as):
+       asm(    "       bt    $4, %[len]\n"
+               "       jnc   10f\n"
+               "       add   (%[buff], %[len]), %[sum_0]\n"
+               "       adc   8(%[buff], %[len]), %[sum_1]\n"
+               "       lea   16(%[len]), %[len]\n"
+               "10:    jecxz 20f\n"   // %[len] is %rcx
+               "       adc   (%[buff], %[len]), %[sum_0]\n"
+               "       adc   8(%[buff], %[len]), %[sum_1]\n"
+               "       lea   32(%[len]), %[len_tmp]\n"
+               "       adc   16(%[buff], %[len]), %[sum_0]\n"
+               "       adc   24(%[buff], %[len]), %[sum_1]\n"
+               "       mov   %[len_tmp], %[len]\n"
+               "       jmp   10b\n"
+               "20:    adc   %[sum_0], %[sum]\n"
+               "       adc   %[sum_1], %[sum]\n"
+               "       adc   $0, %[sum]\n"
In order to get one adc every clock.
But only because of the strange loop required to 'loop carry' the
carry flag (the 'loop' instruction is OK on AMD cpu, but not on Intel.)
A similar loop using adox and adcx will beat one read/clock
provided it is unrolled again.
(IIRC I got to about 12 bytes/clock.)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

next prev parent reply	other threads:[~2023-09-03 20:43 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-30 14:03 [PATCH v2] x86: bring back rep movsq for user access on CPUs without ERMS Mateusz Guzik
2023-08-30 16:50 ` Linus Torvalds
2023-08-30 20:00 ` Linus Torvalds
2023-09-01 15:20   ` Mateusz Guzik
2023-09-01 15:29     ` Linus Torvalds
2023-09-03 18:49     ` Linus Torvalds
2023-09-03 19:14       ` Linus Torvalds
2023-09-03 20:08       ` Linus Torvalds
2023-09-03 20:48         ` Mateusz Guzik
2023-09-03 20:57           ` Linus Torvalds
2023-09-03 21:06             ` Mateusz Guzik
2023-09-03 21:08               ` Linus Torvalds
2023-09-03 21:18                 ` Mateusz Guzik
2023-09-03 23:28                   ` Al Viro
2023-09-03 20:58           ` Mateusz Guzik
2023-09-03 21:05           ` Linus Torvalds
2023-09-03 21:48             ` Ingo Molnar
2023-09-03 22:34               ` Linus Torvalds
2023-09-03 23:15                 ` Mateusz Guzik
2023-09-04  3:07                   ` Linus Torvalds
2023-09-04  3:17                     ` Linus Torvalds
2023-09-04  6:03                       ` Mateusz Guzik
2023-09-04 17:28                         ` Linus Torvalds
2023-09-05 20:41                           ` Mateusz Guzik
2023-09-06  0:16                             ` Linus Torvalds
2023-09-06  4:11                               ` Mateusz Guzik
2023-09-01 13:33 ` David Laight
2023-09-01 15:28   ` Mateusz Guzik
2023-09-03 20:42     ` David Laight [this message]
2023-09-10 10:53       ` Mateusz Guzik
2023-09-11 10:37         ` David Laight
2023-09-12 18:48           ` Linus Torvalds
2023-09-12 19:41             ` David Laight
2023-09-12 20:48               ` Linus Torvalds
2023-09-13  8:25                 ` David Laight

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9a5dd401bf154a0aace0e5f781a3580c@AcuMS.aculab.com \
    --to=david.laight@aculab.com \
    --cc=bp@alien8.de \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mjguzik@gmail.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox