Re: [PATCH v2 04/17] x86/cpu/intel: Fix the movsl alignment preference for extended Families

public inbox for linux-acpi@vger.kernel.org
 help / color / mirror / Atom feed

From: David Laight <david.laight.linux@gmail.com>
To: Dave Hansen <dave.hansen@intel.com>
Cc: Sohil Mehta <sohil.mehta@intel.com>,
	x86@kernel.org, Dave Hansen <dave.hansen@linux.intel.com>,
	Tony Luck <tony.luck@intel.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	Arnaldo Carvalho de Melo <acme@kernel.org>,
	Namhyung Kim <namhyung@kernel.org>,
	Mark Rutland <mark.rutland@arm.com>,
	Alexander Shishkin <alexander.shishkin@linux.intel.com>,
	Jiri Olsa <jolsa@kernel.org>, Ian Rogers <irogers@google.com>,
	Adrian Hunter <adrian.hunter@intel.com>,
	Kan Liang <kan.liang@linux.intel.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Borislav Petkov <bp@alien8.de>, "H . Peter Anvin" <hpa@zytor.com>,
	"Rafael J . Wysocki" <rafael@kernel.org>,
	Len Brown <lenb@kernel.org>, Andy Lutomirski <luto@kernel.org>,
	Viresh Kumar <viresh.kumar@linaro.org>,
	Fenghua Yu <fenghua.yu@intel.com>,
	Jean Delvare <jdelvare@suse.com>,
	Guenter Roeck <linux@roeck-us.net>,
	Zhang Rui <rui.zhang@intel.com>,
	Andrew Cooper <andrew.cooper3@citrix.com>,
	linux-perf-users@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-pm@vger.kernel.org,
	linux-hwmon@vger.kernel.org
Subject: Re: [PATCH v2 04/17] x86/cpu/intel: Fix the movsl alignment preference for extended Families
Date: Tue, 11 Feb 2025 21:45:09 +0000	[thread overview]
Message-ID: <20250211214509.281dc3be@pumpkin> (raw)
In-Reply-To: <5b954a96-1034-467d-a5dc-3d3f7bc112a1@intel.com>

On Tue, 11 Feb 2025 12:26:48 -0800
Dave Hansen <dave.hansen@intel.com> wrote:

> We should really rename intel_workarounds() to make it more clear that
> it's 32-bit only. But I digress...
> 
> On 2/11/25 11:43, Sohil Mehta wrote:
> > The alignment preference for 32-bit movsl based bulk memory move has
> > been 8-byte for a long time. However this preference is only set for
> > Family 6 and 15 processors.
> > 
> > Extend the preference to upcoming Family numbers 18 and 19 to maintain
> > legacy behavior. Also, use a VFM based check instead of switching based
> > on Family numbers. Refresh the comment to reflect the new check.  
> "Legacy behavior" is not important here. If anyone is running 32-bit
> kernel binaries on their brand new CPUs they (as far as I know) have a
> few screws loose. They don't care about performance or security and we
> shouldn't care _for_ them.
> 
> If the code yielded the "wrong" movsl_mask.mask for 18/19, it wouldn't
> matter one bit.
> 
> The thing that _does_ matter is someone auditing to figure out whether
> the code comprehends families>15 or whether it would break in horrible
> ways. The new check is shorter and it's more obvious that it will work
> forever.

For any Intel non-atom processors since the Ivy bridge the only alignment
that makes real difference is aligning the destination to a 32 byte boundary.
That does make it twice as fast (32 bytes/clock rather than 16).
The source alignment never matters.
(I've got access to one of the later 64-bit 8 core atoms - but can't
remember how it behaves.)

For short (IRC 1..32) byte transfers the cost is constant.
The cost depends on the cpu, Ivy bridge is something like 40 clocks.
Lower for later cpu.
(Unlike the P4 where the overhead is some 163 clocks.)
It also makes no difference whether you do 'rep movsb' or 'rep movsq'.

For any of those cpu I'm not sure it is ever worth using anything
other than 'rep movsb' unless the length is known to be very short,
likely a multiply of 4/8 and preferably constant.
Doing a function call and a one or two mispredictable branches will
soon eat into the overhead. Not to mention displacing code from the I-cache.
Unless you are micro-optimising a very hot path it really isn't worth
doing anything else.

OTOH even some recent AMD cpu are reported not to have FRSM and will
execute 'rep movsb' slowly.

I did 'discover' that code at the weekend, just the memory load to
get the mask is going to slow things down.
Running a benchmark test it'll be in cache and the branch predictor
will remember what you are doing.
Come in 'cold cache' and (IIRC) Intel cpu have a 50% chance of predicting
a branch taken (no static predict - eg backward taken).

Even for normal memory accesses I've not seen any significant slowdown
for misaligned memory accesses.
Ones that cross a cache line might end up being 2 uops, but the cpu
can do two reads/clock (with a following wind) and it is hard to write
a code loop that gets close to sustaining that.

I'll have tested the IP checksum (adc loop) code with misaligned buffers.
I don't even remember a significant slowdown for the version that does
three memory reads every two clocks (which seems to be the limit).

I actually suspect that any copies that matter are aligned so the cost
of the check far outways the benefit across all the calls.

One optimisation that seems to be absent is that if you are doing a
register copy loop, then any trailing bytes can be copied by doing
a misaligned copy of the last word (and I mean word, not 16 bits)
of the buffer - copying a few bytes twice.

	David

next prev parent reply	other threads:[~2025-02-11 21:45 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-02-11 19:43 [PATCH v2 00/17] Prepare for new Intel Family numbers Sohil Mehta
2025-02-11 19:43 ` [PATCH v2 01/17] x86/smpboot: Remove confusing quirk usage in INIT delay Sohil Mehta
2025-02-11 19:43 ` [PATCH v2 02/17] x86/smpboot: Fix INIT delay optimization for extended Intel Families Sohil Mehta
2025-02-11 20:10   ` Dave Hansen
2025-02-11 20:20     ` Sohil Mehta
2025-02-11 19:43 ` [PATCH v2 03/17] x86/apic: Fix 32-bit APIC initialization " Sohil Mehta
2025-02-11 19:43 ` [PATCH v2 04/17] x86/cpu/intel: Fix the movsl alignment preference for extended Families Sohil Mehta
2025-02-11 20:26   ` Dave Hansen
2025-02-11 21:45     ` David Laight [this message]
2025-02-11 19:43 ` [PATCH v2 05/17] x86/cpu/intel: Fix page copy performance " Sohil Mehta
2025-02-11 20:53   ` Dave Hansen
2025-02-12  0:54     ` Andrew Cooper
2025-02-12 21:19       ` Sohil Mehta
2025-02-13 23:02         ` Andrew Cooper
2025-02-14  0:29           ` Sohil Mehta
2025-02-11 19:43 ` [PATCH v2 06/17] cpufreq: Fix the efficient idle check for Intel " Sohil Mehta
2025-02-12  5:35   ` Zhang, Rui
2025-02-13 18:49     ` Sohil Mehta
2025-02-14  2:03       ` Zhang, Rui
2025-02-11 19:43 ` [PATCH v2 07/17] hwmon: Fix Intel Family-model checks to include " Sohil Mehta
2025-02-11 20:58   ` Dave Hansen
2025-02-11 21:38     ` Sohil Mehta
2025-02-12 13:43       ` Zhang, Rui
2025-02-12 16:57         ` Dave Hansen
2025-02-14  2:23           ` Zhang, Rui
2025-02-12 13:10     ` Zhang, Rui
2025-02-11 19:43 ` [PATCH v2 08/17] x86/microcode: Update the Intel processor flag scan check Sohil Mehta
2025-02-11 21:00   ` Dave Hansen
2025-02-11 19:43 ` [PATCH v2 09/17] x86/mtrr: Modify a x86_model check to an Intel VFM check Sohil Mehta
2025-02-11 21:00   ` Dave Hansen
2025-02-11 19:44 ` [PATCH v2 10/17] x86/cpu/intel: Replace early Family 6 checks with VFM ones Sohil Mehta
2025-02-11 21:03   ` Dave Hansen
2025-02-11 19:44 ` [PATCH v2 11/17] x86/cpu/intel: Replace Family 15 " Sohil Mehta
2025-02-11 21:03   ` Dave Hansen
2025-02-11 19:44 ` [PATCH v2 12/17] x86/cpu/intel: Replace Family 5 model " Sohil Mehta
2025-02-11 21:06   ` Dave Hansen
2025-02-11 19:44 ` [PATCH v2 13/17] x86/pat: Replace Intel x86_model " Sohil Mehta
2025-02-11 21:09   ` Dave Hansen
2025-02-11 21:42     ` Sohil Mehta
2025-02-11 19:44 ` [PATCH v2 14/17] x86/acpi/cstate: Improve Intel Family model checks Sohil Mehta
2025-02-11 21:20   ` Dave Hansen
2025-02-11 19:44 ` [PATCH v2 15/17] x86/cpu/intel: Bound the non-architectural constant_tsc " Sohil Mehta
2025-02-11 21:41   ` Dave Hansen
2025-02-12  0:45     ` Sohil Mehta
2025-02-11 19:44 ` [PATCH v2 16/17] perf/x86: Simplify P6 PMU initialization Sohil Mehta
2025-02-11 19:44 ` [PATCH v2 17/17] perf/x86/p4: Replace Pentium 4 model checks with VFM ones Sohil Mehta

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250211214509.281dc3be@pumpkin \
    --to=david.laight.linux@gmail.com \
    --cc=acme@kernel.org \
    --cc=adrian.hunter@intel.com \
    --cc=alexander.shishkin@linux.intel.com \
    --cc=andrew.cooper3@citrix.com \
    --cc=bp@alien8.de \
    --cc=dave.hansen@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=fenghua.yu@intel.com \
    --cc=hpa@zytor.com \
    --cc=irogers@google.com \
    --cc=jdelvare@suse.com \
    --cc=jolsa@kernel.org \
    --cc=kan.liang@linux.intel.com \
    --cc=lenb@kernel.org \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-hwmon@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-perf-users@vger.kernel.org \
    --cc=linux-pm@vger.kernel.org \
    --cc=linux@roeck-us.net \
    --cc=luto@kernel.org \
    --cc=mark.rutland@arm.com \
    --cc=mingo@redhat.com \
    --cc=namhyung@kernel.org \
    --cc=peterz@infradead.org \
    --cc=rafael@kernel.org \
    --cc=rui.zhang@intel.com \
    --cc=sohil.mehta@intel.com \
    --cc=tglx@linutronix.de \
    --cc=tony.luck@intel.com \
    --cc=viresh.kumar@linaro.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox