From: David Laight <david.laight.linux@gmail.com>
To: Mateusz Guzik <mjguzik@gmail.com>
Cc: Herton Krzesinski <hkrzesin@redhat.com>,
x86@kernel.org, tglx@linutronix.de, mingo@redhat.com,
bp@alien8.de, dave.hansen@linux.intel.com, hpa@zytor.com,
linux-kernel@vger.kernel.org, torvalds@linux-foundation.org,
olichtne@redhat.com, atomasov@redhat.com, aokuliar@redhat.com
Subject: Re: [PATCH] x86: write aligned to 8 bytes in copy_user_generic (when without FSRM/ERMS)
Date: Thu, 20 Mar 2025 21:31:45 +0000 [thread overview]
Message-ID: <20250320213145.6d016e21@pumpkin> (raw)
In-Reply-To: <CAGudoHEDAzOrndyJeb7L95cMPmHHk0O5wk=tRCe35FE6GVYs1w@mail.gmail.com>
On Thu, 20 Mar 2025 19:02:21 +0100
Mateusz Guzik <mjguzik@gmail.com> wrote:
> On Thu, Mar 20, 2025 at 6:51 PM Herton Krzesinski <hkrzesin@redhat.com> wrote:
> >
> > On Thu, Mar 20, 2025 at 11:36 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
...
> > > That said, have you experimented with aligning the target to 16 bytes
> > > or more bytes?
> >
> > Yes I tried to do 32-byte write aligned on an old Xeon (Sandy Bridge based)
> > and got no improvement at least in the specific benchmark I'm doing here.
> > Also after your question here I tried 16-byte/32-byte on the AMD cpu as
> > well and got no difference from the 8-byte alignment, same bench as well.
> > I tried to do 8-byte alignment for the ERMS case on Intel and got no
> > difference on the systems I tested. I'm not saying it may not improve in
> > some other case, just that in my specific testing I couldn't tell/measure
> > any improvement.
> >
>
> oof, I would not got as far back as Sandy Bridge. ;)
It is a boundary point.
Agner's tables (fairly reliable have):
Sandy Bridge
Page 222
MOVS 5 4
REP MOVS 2n 1.5 n worst case
REP MOVS 3/16B 1/16B best case
which is the same as Ivy bridge - which you'd sort of expect since
Ivy bridge is a minor update, Agner's tables have the same values for it.
Haswell jumps to 1/32B.
I didn't test Sandy bridge (I've got one, powered off), but did test Ivy Bridge.
Neither the source nor destination alignment made any difference at all.
As I said earlier the only alignment that made any difference was 32byte
aligning the destination on Haswell (and later).
That is needed to get 32 bytes/clock rather than 16 bytes/clock.
>
> I think Skylake is the oldest yeller to worry about, if one insists on it.
>
> That said, if memory serves right these bufs like to be misaligned to
> weird extents, it very well may be in your tests aligning to 8 had a
> side effect of aligning it to 16 even.
>
> > >
> > > Moreover, I have some recollection that there were uarchs with ERMS
> > > which also liked the target to be aligned -- as in perhaps this should
> > > be done regardless of FSRM?
Dunno, the only report is some AMD cpu being slow with misaligned writes.
But that is the copy loop, not 'rep movsq'.
I don't have one to test.
> >
> > Where I tested I didn't see improvements but may be there is some case,
> > but I didn't have any.
> >
> > >
> > > And most importantly memset, memcpy and clear_user would all use a
> > > revamp and they are missing rep handling for bigger sizes (I verified
> > > they *do* show up). Not only that, but memcpy uses overlapping stores
> > > while memset just loops over stuff.
> > >
> > > I intended to sort it out long time ago and maybe will find some time
> > > now that I got reminded of it, but I would be deligthed if it got
> > > picked up.
> > >
> > > Hacking this up is just some screwing around, the real time consuming
> > > part is the benchmarking so I completely understand if you are not
> > > interested.
> >
> > Yes, the most time you spend is on benchmarking. May be later I could
> > try to take a look but will not put any promises on it.
I found I needed to use the performance counter to get a proper cycle count.
But then directly read the register to avoid all the 'library' overhead.
Then add lfence/mfence both sides of the cycle count read.
After subtracting the overhead of a 'null function' I could measure the
number of clocks each operation took.
So could tell when I was actually getting 32 bytes copied per clock.
(Or testing the ip checksum code the number of bytes/clock - can get to 12).
David
> >
>
> Now I'm curious enough what's up here. If I don't run out of steam,
> I'm gonna cover memset and memcpy myself.
>
next prev parent reply other threads:[~2025-03-20 21:31 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-03-20 14:22 [PATCH] x86: write aligned to 8 bytes in copy_user_generic (when without FSRM/ERMS) Herton R. Krzesinski
2025-03-20 14:35 ` Mateusz Guzik
2025-03-20 17:51 ` Herton Krzesinski
2025-03-20 18:02 ` Mateusz Guzik
2025-03-20 18:37 ` Herton Krzesinski
2025-03-20 21:31 ` David Laight [this message]
2025-03-28 21:17 ` [tip: x86/mm] x86/uaccess: Improve performance by aligning writes to 8 bytes in copy_user_generic(), on non-FSRM/ERMS CPUs tip-bot2 for Herton R. Krzesinski
2025-03-28 21:47 ` Ingo Molnar
2025-03-28 21:53 ` [tip: x86/urgent] " tip-bot2 for Herton R. Krzesinski
2025-03-28 22:04 ` tip-bot2 for Herton R. Krzesinski
2025-03-28 22:11 ` Ingo Molnar
2025-05-02 6:19 ` [PATCH] x86: write aligned to 8 bytes in copy_user_generic (when without FSRM/ERMS) Nicholas Sielicki
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250320213145.6d016e21@pumpkin \
--to=david.laight.linux@gmail.com \
--cc=aokuliar@redhat.com \
--cc=atomasov@redhat.com \
--cc=bp@alien8.de \
--cc=dave.hansen@linux.intel.com \
--cc=hkrzesin@redhat.com \
--cc=hpa@zytor.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@redhat.com \
--cc=mjguzik@gmail.com \
--cc=olichtne@redhat.com \
--cc=tglx@linutronix.de \
--cc=torvalds@linux-foundation.org \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.