* [PATCH 0/2] x86: ERMS follow-on
@ 2025-09-25 10:45 Jan Beulich
2025-09-25 10:46 ` [PATCH for-4.21 1/2] x86/AMD: avoid REP MOVSB for Zen3/4 Jan Beulich
` (3 more replies)
0 siblings, 4 replies; 26+ messages in thread
From: Jan Beulich @ 2025-09-25 10:45 UTC (permalink / raw)
To: xen-devel@lists.xenproject.org; +Cc: Andrew Cooper, Roger Pau Monné
Patch 1 is definitely intended for 4.21, to limit possible performance fallout
from the ERMS series. Patch 2 may be too intrusive to consider taking at this
point.
1: x86/AMD: avoid REP MOVSB for Zen3/4
2: x86: guard synthetic feature and bug enumerators
Jan
^ permalink raw reply [flat|nested] 26+ messages in thread* [PATCH for-4.21 1/2] x86/AMD: avoid REP MOVSB for Zen3/4 2025-09-25 10:45 [PATCH 0/2] x86: ERMS follow-on Jan Beulich @ 2025-09-25 10:46 ` Jan Beulich 2025-09-25 12:18 ` Teddy Astie 2025-10-08 16:33 ` Andrew Cooper 2025-09-25 10:48 ` [PATCH 2/2] x86: guard synthetic feature and bug enumerators Jan Beulich ` (2 subsequent siblings) 3 siblings, 2 replies; 26+ messages in thread From: Jan Beulich @ 2025-09-25 10:46 UTC (permalink / raw) To: xen-devel@lists.xenproject.org Cc: Andrew Cooper, Roger Pau Monné, Oleksii Kurochko Along with Zen2 (which doesn't expose ERMS), both families reportedly suffer from sub-optimal aliasing detection when deciding whether REP MOVSB can actually be carried out the accelerated way. Therefore we want to avoid its use in the common case (memset(), copy_page_hot()). Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> --- Question is whether merely avoiding REP MOVSB (but not REP MOVSQ) is going to be good enough. --- a/xen/arch/x86/copy_page.S +++ b/xen/arch/x86/copy_page.S @@ -57,6 +57,6 @@ END(copy_page_cold) .endm FUNC(copy_page_hot) - ALTERNATIVE copy_page_movsq, copy_page_movsb, X86_FEATURE_ERMS + ALTERNATIVE copy_page_movsq, copy_page_movsb, X86_FEATURE_XEN_REP_MOVSB RET END(copy_page_hot) --- a/xen/arch/x86/cpu/amd.c +++ b/xen/arch/x86/cpu/amd.c @@ -1386,6 +1386,10 @@ static void cf_check init_amd(struct cpu check_syscfg_dram_mod_en(); + if (c == &boot_cpu_data && cpu_has(c, X86_FEATURE_ERMS) + && c->family != 0x19 /* Zen3/4 */) + setup_force_cpu_cap(X86_FEATURE_XEN_REP_MOVSB); + amd_log_freq(c); } --- a/xen/arch/x86/cpu/intel.c +++ b/xen/arch/x86/cpu/intel.c @@ -684,6 +684,9 @@ static void cf_check init_intel(struct c */ if (c == &boot_cpu_data && c->vfm == INTEL_SKYLAKE_X) setup_clear_cpu_cap(X86_FEATURE_CLWB); + + if (c == &boot_cpu_data && cpu_has(c, X86_FEATURE_ERMS)) + setup_force_cpu_cap(X86_FEATURE_XEN_REP_MOVSB); } const struct cpu_dev __initconst_cf_clobber intel_cpu_dev = { --- a/xen/arch/x86/include/asm/cpufeatures.h +++ b/xen/arch/x86/include/asm/cpufeatures.h @@ -7,7 +7,7 @@ #define FSCAPINTS FEATURESET_NR_ENTRIES /* Synthetic words follow the featureset words. */ -#define X86_NR_SYNTH 1 +#define X86_NR_SYNTH 2 #define X86_SYNTH(x) (FSCAPINTS * 32 + (x)) /* Synthetic features */ @@ -43,6 +43,7 @@ XEN_CPUFEATURE(IBPB_ENTRY_PV, X86_SY XEN_CPUFEATURE(IBPB_ENTRY_HVM, X86_SYNTH(29)) /* MSR_PRED_CMD used by Xen for HVM */ XEN_CPUFEATURE(USE_VMCALL, X86_SYNTH(30)) /* Use VMCALL instead of VMMCALL */ XEN_CPUFEATURE(PDX_COMPRESSION, X86_SYNTH(31)) /* PDX compression */ +XEN_CPUFEATURE(XEN_REP_MOVSB, X86_SYNTH(32)) /* REP MOVSB used for memcpy() and alike. */ /* Bug words follow the synthetic words. */ #define X86_NR_BUG 1 --- a/xen/arch/x86/memcpy.S +++ b/xen/arch/x86/memcpy.S @@ -10,7 +10,7 @@ FUNC(memcpy) * precautions were taken). */ ALTERNATIVE "and $7, %edx; shr $3, %rcx", \ - STR(rep movsb; RET), X86_FEATURE_ERMS + STR(rep movsb; RET), X86_FEATURE_XEN_REP_MOVSB rep movsq or %edx, %ecx jz 1f ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH for-4.21 1/2] x86/AMD: avoid REP MOVSB for Zen3/4 2025-09-25 10:46 ` [PATCH for-4.21 1/2] x86/AMD: avoid REP MOVSB for Zen3/4 Jan Beulich @ 2025-09-25 12:18 ` Teddy Astie 2025-09-25 12:58 ` Jan Beulich 2025-09-29 23:35 ` Jason Andryuk 2025-10-08 16:33 ` Andrew Cooper 1 sibling, 2 replies; 26+ messages in thread From: Teddy Astie @ 2025-09-25 12:18 UTC (permalink / raw) To: Jan Beulich, xen-devel Cc: Andrew Cooper, Roger Pau Monné, Oleksii Kurochko Le 25/09/2025 à 12:48, Jan Beulich a écrit : > Along with Zen2 (which doesn't expose ERMS), both families reportedly > suffer from sub-optimal aliasing detection when deciding whether REP MOVSB > can actually be carried out the accelerated way. Therefore we want to > avoid its use in the common case (memset(), copy_page_hot()). s/memset/memcpy (memset probably uses rep stosb which is not affected IIUC) > > Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> > Signed-off-by: Jan Beulich <jbeulich@suse.com> > --- > Question is whether merely avoiding REP MOVSB (but not REP MOVSQ) is going > to be good enough. > This probably wants to be checked with benchmarks of rep movsb vs rep movsq+b (current non-ERMS algorithm). If the issue also occurs with rep movsq, it may be preferable to keep rep movsb even considering this issue. > --- a/xen/arch/x86/copy_page.S > +++ b/xen/arch/x86/copy_page.S > @@ -57,6 +57,6 @@ END(copy_page_cold) > .endm > > FUNC(copy_page_hot) > - ALTERNATIVE copy_page_movsq, copy_page_movsb, X86_FEATURE_ERMS > + ALTERNATIVE copy_page_movsq, copy_page_movsb, X86_FEATURE_XEN_REP_MOVSB > RET > END(copy_page_hot) > --- a/xen/arch/x86/cpu/amd.c > +++ b/xen/arch/x86/cpu/amd.c > @@ -1386,6 +1386,10 @@ static void cf_check init_amd(struct cpu > > check_syscfg_dram_mod_en(); > > + if (c == &boot_cpu_data && cpu_has(c, X86_FEATURE_ERMS) > + && c->family != 0x19 /* Zen3/4 */) > + setup_force_cpu_cap(X86_FEATURE_XEN_REP_MOVSB); > + May it be fixed through a (future ?) microcode update, especially since rep movs is microcoded on these archs ? > amd_log_freq(c); > } > > --- a/xen/arch/x86/cpu/intel.c > +++ b/xen/arch/x86/cpu/intel.c > @@ -684,6 +684,9 @@ static void cf_check init_intel(struct c > */ > if (c == &boot_cpu_data && c->vfm == INTEL_SKYLAKE_X) > setup_clear_cpu_cap(X86_FEATURE_CLWB); > + > + if (c == &boot_cpu_data && cpu_has(c, X86_FEATURE_ERMS)) > + setup_force_cpu_cap(X86_FEATURE_XEN_REP_MOVSB); > } > > const struct cpu_dev __initconst_cf_clobber intel_cpu_dev = { > --- a/xen/arch/x86/include/asm/cpufeatures.h > +++ b/xen/arch/x86/include/asm/cpufeatures.h > @@ -7,7 +7,7 @@ > #define FSCAPINTS FEATURESET_NR_ENTRIES > > /* Synthetic words follow the featureset words. */ > -#define X86_NR_SYNTH 1 > +#define X86_NR_SYNTH 2 > #define X86_SYNTH(x) (FSCAPINTS * 32 + (x)) > > /* Synthetic features */ > @@ -43,6 +43,7 @@ XEN_CPUFEATURE(IBPB_ENTRY_PV, X86_SY > XEN_CPUFEATURE(IBPB_ENTRY_HVM, X86_SYNTH(29)) /* MSR_PRED_CMD used by Xen for HVM */ > XEN_CPUFEATURE(USE_VMCALL, X86_SYNTH(30)) /* Use VMCALL instead of VMMCALL */ > XEN_CPUFEATURE(PDX_COMPRESSION, X86_SYNTH(31)) /* PDX compression */ > +XEN_CPUFEATURE(XEN_REP_MOVSB, X86_SYNTH(32)) /* REP MOVSB used for memcpy() and alike. */ > > /* Bug words follow the synthetic words. */ > #define X86_NR_BUG 1 > --- a/xen/arch/x86/memcpy.S > +++ b/xen/arch/x86/memcpy.S > @@ -10,7 +10,7 @@ FUNC(memcpy) > * precautions were taken). > */ > ALTERNATIVE "and $7, %edx; shr $3, %rcx", \ > - STR(rep movsb; RET), X86_FEATURE_ERMS > + STR(rep movsb; RET), X86_FEATURE_XEN_REP_MOVSB > rep movsq > or %edx, %ecx > jz 1f > > Teddy -- Teddy Astie | Vates XCP-ng Developer XCP-ng & Xen Orchestra - Vates solutions web: https://vates.tech ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH for-4.21 1/2] x86/AMD: avoid REP MOVSB for Zen3/4 2025-09-25 12:18 ` Teddy Astie @ 2025-09-25 12:58 ` Jan Beulich 2025-09-30 13:03 ` Teddy Astie 2025-09-29 23:35 ` Jason Andryuk 1 sibling, 1 reply; 26+ messages in thread From: Jan Beulich @ 2025-09-25 12:58 UTC (permalink / raw) To: Teddy Astie Cc: Andrew Cooper, Roger Pau Monné, Oleksii Kurochko, xen-devel On 25.09.2025 14:18, Teddy Astie wrote: > Le 25/09/2025 à 12:48, Jan Beulich a écrit : >> Along with Zen2 (which doesn't expose ERMS), both families reportedly >> suffer from sub-optimal aliasing detection when deciding whether REP MOVSB >> can actually be carried out the accelerated way. Therefore we want to >> avoid its use in the common case (memset(), copy_page_hot()). > > s/memset/memcpy (memset probably uses rep stosb which is not affected IIUC) Oops, yes. >> Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> >> Signed-off-by: Jan Beulich <jbeulich@suse.com> >> --- >> Question is whether merely avoiding REP MOVSB (but not REP MOVSQ) is going >> to be good enough. > > This probably wants to be checked with benchmarks of rep movsb vs rep > movsq+b (current non-ERMS algorithm). If the issue also occurs with rep > movsq, it may be preferable to keep rep movsb even considering this issue. Why? Then REP MOVSB is 8 times slower than REP MOVSQ. >> --- a/xen/arch/x86/cpu/amd.c >> +++ b/xen/arch/x86/cpu/amd.c >> @@ -1386,6 +1386,10 @@ static void cf_check init_amd(struct cpu >> >> check_syscfg_dram_mod_en(); >> >> + if (c == &boot_cpu_data && cpu_has(c, X86_FEATURE_ERMS) >> + && c->family != 0x19 /* Zen3/4 */) >> + setup_force_cpu_cap(X86_FEATURE_XEN_REP_MOVSB); >> + > > May it be fixed through a (future ?) microcode update, especially since > rep movs is microcoded on these archs ? I don't know, but I also don't expect that to happen. Jan ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH for-4.21 1/2] x86/AMD: avoid REP MOVSB for Zen3/4 2025-09-25 12:58 ` Jan Beulich @ 2025-09-30 13:03 ` Teddy Astie 2025-10-07 12:19 ` Jan Beulich 0 siblings, 1 reply; 26+ messages in thread From: Teddy Astie @ 2025-09-30 13:03 UTC (permalink / raw) To: Jan Beulich Cc: Andrew Cooper, Roger Pau Monné, Oleksii Kurochko, xen-devel Le 25/09/2025 à 15:02, Jan Beulich a écrit : > On 25.09.2025 14:18, Teddy Astie wrote: >> Le 25/09/2025 à 12:48, Jan Beulich a écrit : >>> Along with Zen2 (which doesn't expose ERMS), both families reportedly >>> suffer from sub-optimal aliasing detection when deciding whether REP MOVSB >>> can actually be carried out the accelerated way. Therefore we want to >>> avoid its use in the common case (memset(), copy_page_hot()). >> >> s/memset/memcpy (memset probably uses rep stosb which is not affected IIUC) > > Oops, yes. > >>> Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> >>> Signed-off-by: Jan Beulich <jbeulich@suse.com> >>> --- >>> Question is whether merely avoiding REP MOVSB (but not REP MOVSQ) is going >>> to be good enough. >> >> This probably wants to be checked with benchmarks of rep movsb vs rep >> movsq+b (current non-ERMS algorithm). If the issue also occurs with rep >> movsq, it may be preferable to keep rep movsb even considering this issue. > > Why? Then REP MOVSB is 8 times slower than REP MOVSQ. > It doesn't match my observations while quickly benching rep movsb vs rep movsq+b (fallback) with varying alignments/sizes on Zen3/4 (Ryzen and EPYC). It's very sensitive to size and aligment, but in many (but not all) cases, rep movsb is significantly faster than rep movsq+b. The worst cases (mentioned bug) are much slower in both cases, though rep movsq+b tend to perform better in these cases. So unfortunately it's not as simple as rep movsb being (almost) always slower, especially with the varied copy sizes and aligments that does grant_copy. That's what I would prefer having more data to have a better picture. >>> --- a/xen/arch/x86/cpu/amd.c >>> +++ b/xen/arch/x86/cpu/amd.c >>> @@ -1386,6 +1386,10 @@ static void cf_check init_amd(struct cpu >>> >>> check_syscfg_dram_mod_en(); >>> >>> + if (c == &boot_cpu_data && cpu_has(c, X86_FEATURE_ERMS) >>> + && c->family != 0x19 /* Zen3/4 */) >>> + setup_force_cpu_cap(X86_FEATURE_XEN_REP_MOVSB); >>> + >> >> May it be fixed through a (future ?) microcode update, especially since >> rep movs is microcoded on these archs ? > > I don't know, but I also don't expect that to happen. > > Jan > Teddy -- Teddy Astie | Vates XCP-ng Developer XCP-ng & Xen Orchestra - Vates solutions web: https://vates.tech ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH for-4.21 1/2] x86/AMD: avoid REP MOVSB for Zen3/4 2025-09-30 13:03 ` Teddy Astie @ 2025-10-07 12:19 ` Jan Beulich 0 siblings, 0 replies; 26+ messages in thread From: Jan Beulich @ 2025-10-07 12:19 UTC (permalink / raw) To: Teddy Astie Cc: Andrew Cooper, Roger Pau Monné, Oleksii Kurochko, xen-devel On 30.09.2025 15:03, Teddy Astie wrote: > Le 25/09/2025 à 15:02, Jan Beulich a écrit : >> On 25.09.2025 14:18, Teddy Astie wrote: >>> Le 25/09/2025 à 12:48, Jan Beulich a écrit : >>>> Along with Zen2 (which doesn't expose ERMS), both families reportedly >>>> suffer from sub-optimal aliasing detection when deciding whether REP MOVSB >>>> can actually be carried out the accelerated way. Therefore we want to >>>> avoid its use in the common case (memset(), copy_page_hot()). >>> >>> s/memset/memcpy (memset probably uses rep stosb which is not affected IIUC) >> >> Oops, yes. >> >>>> Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> >>>> Signed-off-by: Jan Beulich <jbeulich@suse.com> >>>> --- >>>> Question is whether merely avoiding REP MOVSB (but not REP MOVSQ) is going >>>> to be good enough. >>> >>> This probably wants to be checked with benchmarks of rep movsb vs rep >>> movsq+b (current non-ERMS algorithm). If the issue also occurs with rep >>> movsq, it may be preferable to keep rep movsb even considering this issue. >> >> Why? Then REP MOVSB is 8 times slower than REP MOVSQ. >> > > It doesn't match my observations while quickly benching rep movsb vs rep > movsq+b (fallback) with varying alignments/sizes on Zen3/4 (Ryzen and EPYC). > > It's very sensitive to size and aligment, but in many (but not all) > cases, rep movsb is significantly faster than rep movsq+b. The worst > cases (mentioned bug) are much slower in both cases, though rep movsq+b > tend to perform better in these cases. Which is what the patch here is trying to address. > So unfortunately it's not as simple as rep movsb being (almost) always > slower, especially with the varied copy sizes and aligments that does > grant_copy. That's what I would prefer having more data to have a better > picture. Well, what I would have preferred is some actual written down description of the aliasing issue. I'm unaware of such; the patch is solely based on what Andrew has been telling me verbally (piecemeal). I've tried to reflect this in how the description is written. What you suggest would, aiui, entail more complicated decision logic in the memcpy() implementation, which (at least for now) we'd like to avoid. Jan ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH for-4.21 1/2] x86/AMD: avoid REP MOVSB for Zen3/4 2025-09-25 12:18 ` Teddy Astie 2025-09-25 12:58 ` Jan Beulich @ 2025-09-29 23:35 ` Jason Andryuk 2025-10-08 11:20 ` Roger Pau Monné 1 sibling, 1 reply; 26+ messages in thread From: Jason Andryuk @ 2025-09-29 23:35 UTC (permalink / raw) To: Teddy Astie, Jan Beulich, xen-devel Cc: Andrew Cooper, Roger Pau Monné, Oleksii Kurochko On 2025-09-25 08:18, Teddy Astie wrote: > Le 25/09/2025 à 12:48, Jan Beulich a écrit : >> Along with Zen2 (which doesn't expose ERMS), both families reportedly >> suffer from sub-optimal aliasing detection when deciding whether REP MOVSB >> can actually be carried out the accelerated way. Therefore we want to >> avoid its use in the common case (memset(), copy_page_hot()). > > s/memset/memcpy (memset probably uses rep stosb which is not affected IIUC) > >> >> Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> >> Signed-off-by: Jan Beulich <jbeulich@suse.com> With Teddy's suggested change: Reviewed-by: Jason Andryuk <jason.andryuk@amd.com> Thanks, Jason ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH for-4.21 1/2] x86/AMD: avoid REP MOVSB for Zen3/4 2025-09-29 23:35 ` Jason Andryuk @ 2025-10-08 11:20 ` Roger Pau Monné 2025-10-08 16:06 ` Jan Beulich 0 siblings, 1 reply; 26+ messages in thread From: Roger Pau Monné @ 2025-10-08 11:20 UTC (permalink / raw) To: Jason Andryuk, Teddy Astie Cc: Jan Beulich, xen-devel, Andrew Cooper, Oleksii Kurochko On Mon, Sep 29, 2025 at 07:35:53PM -0400, Jason Andryuk wrote: > On 2025-09-25 08:18, Teddy Astie wrote: > > Le 25/09/2025 à 12:48, Jan Beulich a écrit : > > > Along with Zen2 (which doesn't expose ERMS), both families reportedly > > > suffer from sub-optimal aliasing detection when deciding whether REP MOVSB > > > can actually be carried out the accelerated way. Therefore we want to > > > avoid its use in the common case (memset(), copy_page_hot()). > > > > s/memset/memcpy (memset probably uses rep stosb which is not affected IIUC) > > > > > > > > Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> > > > Signed-off-by: Jan Beulich <jbeulich@suse.com> > > With Teddy's suggested change: > > Reviewed-by: Jason Andryuk <jason.andryuk@amd.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> It would be nice to have some actual figures whether this makes any difference though. Teddy, I think Vates had been doing some testing in this regard, do you think you could measure whether the patch makes any noticeable difference in PV network traffic for example? (as that's a heavy user of grant copy). Thanks, Roger. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH for-4.21 1/2] x86/AMD: avoid REP MOVSB for Zen3/4 2025-10-08 11:20 ` Roger Pau Monné @ 2025-10-08 16:06 ` Jan Beulich 2025-10-09 7:20 ` Oleksii Kurochko 0 siblings, 1 reply; 26+ messages in thread From: Jan Beulich @ 2025-10-08 16:06 UTC (permalink / raw) To: Oleksii Kurochko Cc: xen-devel, Andrew Cooper, Roger Pau Monné, Jason Andryuk, Teddy Astie On 08.10.2025 13:20, Roger Pau Monné wrote: > On Mon, Sep 29, 2025 at 07:35:53PM -0400, Jason Andryuk wrote: >> On 2025-09-25 08:18, Teddy Astie wrote: >>> Le 25/09/2025 à 12:48, Jan Beulich a écrit : >>>> Along with Zen2 (which doesn't expose ERMS), both families reportedly >>>> suffer from sub-optimal aliasing detection when deciding whether REP MOVSB >>>> can actually be carried out the accelerated way. Therefore we want to >>>> avoid its use in the common case (memset(), copy_page_hot()). >>> >>> s/memset/memcpy (memset probably uses rep stosb which is not affected IIUC) >>> >>>> >>>> Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> >>>> Signed-off-by: Jan Beulich <jbeulich@suse.com> >> >> With Teddy's suggested change: >> >> Reviewed-by: Jason Andryuk <jason.andryuk@amd.com> > > Acked-by: Roger Pau Monné <roger.pau@citrix.com> May I ask for a release-ack here, seeing that it alters behavior that went in close before the freeze? Jan ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH for-4.21 1/2] x86/AMD: avoid REP MOVSB for Zen3/4 2025-10-08 16:06 ` Jan Beulich @ 2025-10-09 7:20 ` Oleksii Kurochko 0 siblings, 0 replies; 26+ messages in thread From: Oleksii Kurochko @ 2025-10-09 7:20 UTC (permalink / raw) To: Jan Beulich Cc: xen-devel, Andrew Cooper, Roger Pau Monné, Jason Andryuk, Teddy Astie [-- Attachment #1: Type: text/plain, Size: 1076 bytes --] On 10/8/25 6:06 PM, Jan Beulich wrote: > On 08.10.2025 13:20, Roger Pau Monné wrote: >> On Mon, Sep 29, 2025 at 07:35:53PM -0400, Jason Andryuk wrote: >>> On 2025-09-25 08:18, Teddy Astie wrote: >>>> Le 25/09/2025 à 12:48, Jan Beulich a écrit : >>>>> Along with Zen2 (which doesn't expose ERMS), both families reportedly >>>>> suffer from sub-optimal aliasing detection when deciding whether REP MOVSB >>>>> can actually be carried out the accelerated way. Therefore we want to >>>>> avoid its use in the common case (memset(), copy_page_hot()). >>>> s/memset/memcpy (memset probably uses rep stosb which is not affected IIUC) >>>> >>>>> Reported-by: Andrew Cooper<andrew.cooper3@citrix.com> >>>>> Signed-off-by: Jan Beulich<jbeulich@suse.com> >>> With Teddy's suggested change: >>> >>> Reviewed-by: Jason Andryuk<jason.andryuk@amd.com> >> Acked-by: Roger Pau Monné<roger.pau@citrix.com> > May I ask for a release-ack here, seeing that it alters behavior that went in > close before the freeze? Release-Acked-by: Oleksii Kurochko<oleksii.kurochko@gmail.com> ~ Oleksii [-- Attachment #2: Type: text/html, Size: 2551 bytes --] ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH for-4.21 1/2] x86/AMD: avoid REP MOVSB for Zen3/4 2025-09-25 10:46 ` [PATCH for-4.21 1/2] x86/AMD: avoid REP MOVSB for Zen3/4 Jan Beulich 2025-09-25 12:18 ` Teddy Astie @ 2025-10-08 16:33 ` Andrew Cooper 2025-10-09 7:27 ` Jan Beulich 1 sibling, 1 reply; 26+ messages in thread From: Andrew Cooper @ 2025-10-08 16:33 UTC (permalink / raw) To: Jan Beulich, xen-devel@lists.xenproject.org Cc: Andrew Cooper, Roger Pau Monné, Oleksii Kurochko On 25/09/2025 11:46 am, Jan Beulich wrote: > Along with Zen2 (which doesn't expose ERMS), both families reportedly > suffer from sub-optimal aliasing detection when deciding whether REP MOVSB > can actually be carried out the accelerated way. Therefore we want to > avoid its use in the common case (memset(), copy_page_hot()). > > Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> > Signed-off-by: Jan Beulich <jbeulich@suse.com> > --- > Question is whether merely avoiding REP MOVSB (but not REP MOVSQ) is going > to be good enough. In the problem case, MOVSQ is 8 times less bad than MOVSB, but they're both slower than alternative algorithms. > > --- a/xen/arch/x86/copy_page.S > +++ b/xen/arch/x86/copy_page.S > @@ -57,6 +57,6 @@ END(copy_page_cold) > .endm > > FUNC(copy_page_hot) > - ALTERNATIVE copy_page_movsq, copy_page_movsb, X86_FEATURE_ERMS > + ALTERNATIVE copy_page_movsq, copy_page_movsb, X86_FEATURE_XEN_REP_MOVSB > RET > END(copy_page_hot) Hmm. Overall I think this patch is an improvement. But, for any copy_page variants, we know both pointers are 4k aligned, so will not tickle the problem case. This does mess with the naming of the synthetic feature. > --- a/xen/arch/x86/cpu/amd.c > +++ b/xen/arch/x86/cpu/amd.c > @@ -1386,6 +1386,10 @@ static void cf_check init_amd(struct cpu > > check_syscfg_dram_mod_en(); > > + if (c == &boot_cpu_data && cpu_has(c, X86_FEATURE_ERMS) > + && c->family != 0x19 /* Zen3/4 */) Even if this is Linux style, && on the previous line please. ~Andrew ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH for-4.21 1/2] x86/AMD: avoid REP MOVSB for Zen3/4 2025-10-08 16:33 ` Andrew Cooper @ 2025-10-09 7:27 ` Jan Beulich 0 siblings, 0 replies; 26+ messages in thread From: Jan Beulich @ 2025-10-09 7:27 UTC (permalink / raw) To: Andrew Cooper Cc: Roger Pau Monné, Oleksii Kurochko, xen-devel@lists.xenproject.org On 08.10.2025 18:33, Andrew Cooper wrote: > On 25/09/2025 11:46 am, Jan Beulich wrote: >> --- a/xen/arch/x86/copy_page.S >> +++ b/xen/arch/x86/copy_page.S >> @@ -57,6 +57,6 @@ END(copy_page_cold) >> .endm >> >> FUNC(copy_page_hot) >> - ALTERNATIVE copy_page_movsq, copy_page_movsb, X86_FEATURE_ERMS >> + ALTERNATIVE copy_page_movsq, copy_page_movsb, X86_FEATURE_XEN_REP_MOVSB >> RET >> END(copy_page_hot) > > Hmm. > > Overall I think this patch is an improvement. > > But, for any copy_page variants, we know both pointers are 4k aligned, > so will not tickle the problem case. Then I fear I still haven't understood the bad "may overlap" condition. I thought that with the low 12 bits all identical in a page-copy, this case would _specifically_ trigger the bad behavior. > This does mess with the naming of the synthetic feature. Short of better naming suggestions, I would keep it as is. >> --- a/xen/arch/x86/cpu/amd.c >> +++ b/xen/arch/x86/cpu/amd.c >> @@ -1386,6 +1386,10 @@ static void cf_check init_amd(struct cpu >> >> check_syscfg_dram_mod_en(); >> >> + if (c == &boot_cpu_data && cpu_has(c, X86_FEATURE_ERMS) >> + && c->family != 0x19 /* Zen3/4 */) > > Even if this is Linux style, && on the previous line please. No, precisely because it is Linux style. If and when we change the file to Xen style (which probably we should), such operators would move. Jan ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 2/2] x86: guard synthetic feature and bug enumerators 2025-09-25 10:45 [PATCH 0/2] x86: ERMS follow-on Jan Beulich 2025-09-25 10:46 ` [PATCH for-4.21 1/2] x86/AMD: avoid REP MOVSB for Zen3/4 Jan Beulich @ 2025-09-25 10:48 ` Jan Beulich 2025-09-29 23:36 ` Jason Andryuk 2025-10-08 17:19 ` Andrew Cooper 2025-10-13 13:06 ` [PATCH for-4.21 v2] x86/AMD: avoid REP MOVSB for Zen3/4 Jan Beulich [not found] ` <693449f18cc4480ea2cb2161a9361354@DM4PR03MB7015.namprd03.prod.outlook.com> 3 siblings, 2 replies; 26+ messages in thread From: Jan Beulich @ 2025-09-25 10:48 UTC (permalink / raw) To: xen-devel@lists.xenproject.org; +Cc: Andrew Cooper, Roger Pau Monné While adding new enumerators one may overlook the (rare) need to bump X86_NR_{SYNTH,BUG}. Guard against that happening by adding respective checking. The use of BUILD_BUG_ON_ZERO(), however, entails a number of other changes, as the expansion may not appear in the assembly produced. Furthermore inputs to file-scope asm() are only supported in gcc15 (or newer). No difference in generated code (debug info, however, grows quite a bit). An implication from the changes is that users of the alternatives patching macros may not use unnamed asm() input operands anymore, as the "injected" new operands would break numbering expectations. Signed-off-by: Jan Beulich <jbeulich@suse.com> --- a/xen/arch/x86/include/asm/alternative.h +++ b/xen/arch/x86/include/asm/alternative.h @@ -70,12 +70,12 @@ extern void alternative_instructions(voi alt_repl_len(n2)) "-" alt_orig_len) #define ALTINSTR_ENTRY(feature, num) \ - " .if (" STR(feature & ~ALT_FLAG_NOT) ") >= " STR(NCAPINTS * 32) "\n" \ + " .if (%c" #feature " & " STR(~ALT_FLAG_NOT) ") >= " STR(NCAPINTS * 32) "\n" \ " .error \"alternative feature outside of featureset range\"\n" \ " .endif\n" \ " .long .LXEN%=_orig_s - .\n" /* label */ \ " .long " alt_repl_s(num)" - .\n" /* new instruction */ \ - " .word " STR(feature) "\n" /* feature bit */ \ + " .word %c" #feature "\n" /* feature bit */ \ " .byte " alt_orig_len "\n" /* source len */ \ " .byte " alt_repl_len(num) "\n" /* replacement len */ \ " .byte " alt_pad_len "\n" /* padding len */ \ @@ -127,14 +127,14 @@ extern void alternative_instructions(voi */ #define alternative(oldinstr, newinstr, feature) \ asm_inline volatile ( \ - ALTERNATIVE(oldinstr, newinstr, feature) \ - ::: "memory" ) + ALTERNATIVE(oldinstr, newinstr, [feat]) \ + :: [feat] "i" (feature) : "memory" ) #define alternative_2(oldinstr, newinstr1, feature1, newinstr2, feature2) \ asm_inline volatile ( \ - ALTERNATIVE_2(oldinstr, newinstr1, feature1, \ - newinstr2, feature2) \ - ::: "memory" ) + ALTERNATIVE_2(oldinstr, newinstr1, [feat1], \ + newinstr2, [feat2]) \ + :: [feat1] "i" (feature1), [feat2] "i" (feature2) : "memory" ) /* * Alternative inline assembly with input. @@ -148,14 +148,14 @@ extern void alternative_instructions(voi */ #define alternative_input(oldinstr, newinstr, feature, input...) \ asm_inline volatile ( \ - ALTERNATIVE(oldinstr, newinstr, feature) \ - :: input ) + ALTERNATIVE(oldinstr, newinstr, [feat]) \ + :: [feat] "i" (feature), ## input ) /* Like alternative_input, but with a single output argument */ #define alternative_io(oldinstr, newinstr, feature, output, input...) \ asm_inline volatile ( \ - ALTERNATIVE(oldinstr, newinstr, feature) \ - : output : input ) + ALTERNATIVE(oldinstr, newinstr, [feat]) \ + : output : [feat] "i" (feature), ## input ) /* * This is similar to alternative_io. But it has two features and @@ -168,9 +168,9 @@ extern void alternative_instructions(voi #define alternative_io_2(oldinstr, newinstr1, feature1, newinstr2, \ feature2, output, input...) \ asm_inline volatile ( \ - ALTERNATIVE_2(oldinstr, newinstr1, feature1, \ - newinstr2, feature2) \ - : output : input ) + ALTERNATIVE_2(oldinstr, newinstr1, [feat1], \ + newinstr2, [feat2]) \ + : output : [feat1] "i" (feature1), [feat2] "i" (feature2), ## input ) /* Use this macro(s) if you need more than one output parameter. */ #define ASM_OUTPUT2(a...) a --- a/xen/arch/x86/include/asm/cpufeatures.h +++ b/xen/arch/x86/include/asm/cpufeatures.h @@ -6,9 +6,16 @@ /* Number of capability words covered by the featureset words. */ #define FSCAPINTS FEATURESET_NR_ENTRIES +#if !defined(__ASSEMBLY__) && __GNUC__ >= 15 +#include <xen/macros.h> +#define X86_CHECK_FEAT_NR(x, n) BUILD_BUG_ON_ZERO((x) / 32 >= X86_NR_ ## n) +#else +#define X86_CHECK_FEAT_NR(x, n) 0 +#endif + /* Synthetic words follow the featureset words. */ #define X86_NR_SYNTH 2 -#define X86_SYNTH(x) (FSCAPINTS * 32 + (x)) +#define X86_SYNTH(x) (FSCAPINTS * 32 + (x) + X86_CHECK_FEAT_NR(x, SYNTH)) /* Synthetic features */ XEN_CPUFEATURE(CONSTANT_TSC, X86_SYNTH( 0)) /* TSC ticks at a constant rate */ @@ -47,7 +54,8 @@ XEN_CPUFEATURE(XEN_REP_MOVSB, X86_SY /* Bug words follow the synthetic words. */ #define X86_NR_BUG 1 -#define X86_BUG(x) ((FSCAPINTS + X86_NR_SYNTH) * 32 + (x)) +#define X86_BUG(x) ((FSCAPINTS + X86_NR_SYNTH) * 32 + (x) + \ + X86_CHECK_FEAT_NR(x, BUG)) #define X86_BUG_FPU_PTRS X86_BUG( 0) /* (F)X{SAVE,RSTOR} doesn't save/restore FOP/FIP/FDP. */ #define X86_BUG_NULL_SEG X86_BUG( 1) /* NULL-ing a selector preserves the base and limit. */ --- a/xen/arch/x86/include/asm/cpufeatureset.h +++ b/xen/arch/x86/include/asm/cpufeatureset.h @@ -12,8 +12,13 @@ enum { }; #undef XEN_CPUFEATURE +#if __GNUC__ >= 15 +#define XEN_CPUFEATURE(name, value) asm (".equ X86_FEATURE_" #name ", %c0" \ + :: "i" (X86_FEATURE_##name)); +#else #define XEN_CPUFEATURE(name, value) asm (".equ X86_FEATURE_" #name ", " \ __stringify(value)); +#endif #include <public/arch-x86/cpufeatureset.h> #include <asm/cpufeatures.h> --- a/xen/arch/x86/include/asm/pdx.h +++ b/xen/arch/x86/include/asm/pdx.h @@ -13,9 +13,9 @@ asm_inline goto ( \ ALTERNATIVE( \ "", \ - "jmp %l0", \ - ALT_NOT(X86_FEATURE_PDX_COMPRESSION)) \ - : : : : label ) + "jmp %l1", \ + [feat]) \ + : : [feat] "i" (ALT_NOT(X86_FEATURE_PDX_COMPRESSION)) : : label ) static inline unsigned long pfn_to_pdx(unsigned long pfn) { --- a/xen/arch/x86/include/asm/spec_ctrl.h +++ b/xen/arch/x86/include/asm/spec_ctrl.h @@ -73,7 +73,7 @@ static always_inline void spec_ctrl_new_ /* (ab)use alternative_input() to specify clobbers. */ alternative_input("", "DO_OVERWRITE_RSB xu=%=", X86_BUG_IBPB_NO_RET, - : "rax", "rcx"); + "i" (0) : "rax", "rcx"); } extern int8_t opt_ibpb_ctxt_switch; @@ -163,7 +163,7 @@ static always_inline void __spec_ctrl_en * (ab)use alternative_input() to specify clobbers. */ alternative_input("", "DO_OVERWRITE_RSB xu=%=", X86_FEATURE_SC_RSB_IDLE, - : "rax", "rcx"); + "i" (0) : "rax", "rcx"); } /* WARNING! `ret`, `call *`, `jmp *` not safe after this call. */ --- a/xen/arch/x86/include/asm/tsc.h +++ b/xen/arch/x86/include/asm/tsc.h @@ -29,8 +29,7 @@ static inline uint64_t rdtsc_ordered(voi alternative_io_2("lfence; rdtsc", "mfence; rdtsc", X86_FEATURE_MFENCE_RDTSC, "rdtscp", X86_FEATURE_RDTSCP, - ASM_OUTPUT2("=a" (low), "=d" (high), "=c" (aux)), - /* no inputs */); + ASM_OUTPUT2("=a" (low), "=d" (high), "=c" (aux))); return (high << 32) | low; } ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 2/2] x86: guard synthetic feature and bug enumerators 2025-09-25 10:48 ` [PATCH 2/2] x86: guard synthetic feature and bug enumerators Jan Beulich @ 2025-09-29 23:36 ` Jason Andryuk 2025-10-07 12:22 ` Jan Beulich 2025-10-08 17:19 ` Andrew Cooper 1 sibling, 1 reply; 26+ messages in thread From: Jason Andryuk @ 2025-09-29 23:36 UTC (permalink / raw) To: Jan Beulich, xen-devel@lists.xenproject.org Cc: Andrew Cooper, Roger Pau Monné On 2025-09-25 06:48, Jan Beulich wrote: > While adding new enumerators one may overlook the (rare) need to bump > X86_NR_{SYNTH,BUG}. Guard against that happening by adding respective > checking. The use of BUILD_BUG_ON_ZERO(), however, entails a number of > other changes, as the expansion may not appear in the assembly produced. > Furthermore inputs to file-scope asm() are only supported in gcc15 (or > newer). > > No difference in generated code (debug info, however, grows quite a bit). > > An implication from the changes is that users of the alternatives patching > macros may not use unnamed asm() input operands anymore, as the "injected" > new operands would break numbering expectations. > > Signed-off-by: Jan Beulich <jbeulich@suse.com> > > --- a/xen/arch/x86/include/asm/alternative.h > +++ b/xen/arch/x86/include/asm/alternative.h > @@ -70,12 +70,12 @@ extern void alternative_instructions(voi > alt_repl_len(n2)) "-" alt_orig_len) > > #define ALTINSTR_ENTRY(feature, num) \ > - " .if (" STR(feature & ~ALT_FLAG_NOT) ") >= " STR(NCAPINTS * 32) "\n" \ > + " .if (%c" #feature " & " STR(~ALT_FLAG_NOT) ") >= " STR(NCAPINTS * 32) "\n" \ > " .error \"alternative feature outside of featureset range\"\n" \ > " .endif\n" \ > " .long .LXEN%=_orig_s - .\n" /* label */ \ > " .long " alt_repl_s(num)" - .\n" /* new instruction */ \ > - " .word " STR(feature) "\n" /* feature bit */ \ > + " .word %c" #feature "\n" /* feature bit */ \ > " .byte " alt_orig_len "\n" /* source len */ \ > " .byte " alt_repl_len(num) "\n" /* replacement len */ \ > " .byte " alt_pad_len "\n" /* padding len */ \ > @@ -127,14 +127,14 @@ extern void alternative_instructions(voi > */ > #define alternative(oldinstr, newinstr, feature) \ > asm_inline volatile ( \ > - ALTERNATIVE(oldinstr, newinstr, feature) \ > - ::: "memory" ) > + ALTERNATIVE(oldinstr, newinstr, [feat]) \ > + :: [feat] "i" (feature) : "memory" ) > > #define alternative_2(oldinstr, newinstr1, feature1, newinstr2, feature2) \ > asm_inline volatile ( \ > - ALTERNATIVE_2(oldinstr, newinstr1, feature1, \ > - newinstr2, feature2) \ > - ::: "memory" ) > + ALTERNATIVE_2(oldinstr, newinstr1, [feat1], \ > + newinstr2, [feat2]) \ > + :: [feat1] "i" (feature1), [feat2] "i" (feature2) : "memory" ) > > /* > * Alternative inline assembly with input. > @@ -148,14 +148,14 @@ extern void alternative_instructions(voi > */ > #define alternative_input(oldinstr, newinstr, feature, input...) \ > asm_inline volatile ( \ > - ALTERNATIVE(oldinstr, newinstr, feature) \ > - :: input ) > + ALTERNATIVE(oldinstr, newinstr, [feat]) \ > + :: [feat] "i" (feature), ## input ) > > /* Like alternative_input, but with a single output argument */ > #define alternative_io(oldinstr, newinstr, feature, output, input...) \ > asm_inline volatile ( \ > - ALTERNATIVE(oldinstr, newinstr, feature) \ > - : output : input ) > + ALTERNATIVE(oldinstr, newinstr, [feat]) \ > + : output : [feat] "i" (feature), ## input ) > > /* > * This is similar to alternative_io. But it has two features and > @@ -168,9 +168,9 @@ extern void alternative_instructions(voi > #define alternative_io_2(oldinstr, newinstr1, feature1, newinstr2, \ > feature2, output, input...) \ > asm_inline volatile ( \ > - ALTERNATIVE_2(oldinstr, newinstr1, feature1, \ > - newinstr2, feature2) \ > - : output : input ) > + ALTERNATIVE_2(oldinstr, newinstr1, [feat1], \ > + newinstr2, [feat2]) \ > + : output : [feat1] "i" (feature1), [feat2] "i" (feature2), ## input ) > > /* Use this macro(s) if you need more than one output parameter. */ > #define ASM_OUTPUT2(a...) a > --- a/xen/arch/x86/include/asm/cpufeatures.h > +++ b/xen/arch/x86/include/asm/cpufeatures.h > @@ -6,9 +6,16 @@ > /* Number of capability words covered by the featureset words. */ > #define FSCAPINTS FEATURESET_NR_ENTRIES > > +#if !defined(__ASSEMBLY__) && __GNUC__ >= 15 > +#include <xen/macros.h> > +#define X86_CHECK_FEAT_NR(x, n) BUILD_BUG_ON_ZERO((x) / 32 >= X86_NR_ ## n) > +#else > +#define X86_CHECK_FEAT_NR(x, n) 0 > +#endif > + > /* Synthetic words follow the featureset words. */ > #define X86_NR_SYNTH 2 > -#define X86_SYNTH(x) (FSCAPINTS * 32 + (x)) > +#define X86_SYNTH(x) (FSCAPINTS * 32 + (x) + X86_CHECK_FEAT_NR(x, SYNTH)) > > /* Synthetic features */ > XEN_CPUFEATURE(CONSTANT_TSC, X86_SYNTH( 0)) /* TSC ticks at a constant rate */ > @@ -47,7 +54,8 @@ XEN_CPUFEATURE(XEN_REP_MOVSB, X86_SY > > /* Bug words follow the synthetic words. */ > #define X86_NR_BUG 1 > -#define X86_BUG(x) ((FSCAPINTS + X86_NR_SYNTH) * 32 + (x)) > +#define X86_BUG(x) ((FSCAPINTS + X86_NR_SYNTH) * 32 + (x) + \ > + X86_CHECK_FEAT_NR(x, BUG)) > > #define X86_BUG_FPU_PTRS X86_BUG( 0) /* (F)X{SAVE,RSTOR} doesn't save/restore FOP/FIP/FDP. */ > #define X86_BUG_NULL_SEG X86_BUG( 1) /* NULL-ing a selector preserves the base and limit. */ > --- a/xen/arch/x86/include/asm/cpufeatureset.h > +++ b/xen/arch/x86/include/asm/cpufeatureset.h > @@ -12,8 +12,13 @@ enum { > }; > #undef XEN_CPUFEATURE > > +#if __GNUC__ >= 15 > +#define XEN_CPUFEATURE(name, value) asm (".equ X86_FEATURE_" #name ", %c0" \ > + :: "i" (X86_FEATURE_##name)); > +#else > #define XEN_CPUFEATURE(name, value) asm (".equ X86_FEATURE_" #name ", " \ > __stringify(value)); > +#endif > #include <public/arch-x86/cpufeatureset.h> > #include <asm/cpufeatures.h> > > --- a/xen/arch/x86/include/asm/pdx.h > +++ b/xen/arch/x86/include/asm/pdx.h > @@ -13,9 +13,9 @@ > asm_inline goto ( \ > ALTERNATIVE( \ > "", \ > - "jmp %l0", \ > - ALT_NOT(X86_FEATURE_PDX_COMPRESSION)) \ > - : : : : label ) > + "jmp %l1", \ > + [feat]) \ > + : : [feat] "i" (ALT_NOT(X86_FEATURE_PDX_COMPRESSION)) : : label ) > > static inline unsigned long pfn_to_pdx(unsigned long pfn) > { > --- a/xen/arch/x86/include/asm/spec_ctrl.h > +++ b/xen/arch/x86/include/asm/spec_ctrl.h > @@ -73,7 +73,7 @@ static always_inline void spec_ctrl_new_ > > /* (ab)use alternative_input() to specify clobbers. */ > alternative_input("", "DO_OVERWRITE_RSB xu=%=", X86_BUG_IBPB_NO_RET, > - : "rax", "rcx"); > + "i" (0) : "rax", "rcx"); "i" (0) is to work around the trailing comma in alternative_input() and does nothing? Thanks, Jason > } > > extern int8_t opt_ibpb_ctxt_switch; > @@ -163,7 +163,7 @@ static always_inline void __spec_ctrl_en > * (ab)use alternative_input() to specify clobbers. > */ > alternative_input("", "DO_OVERWRITE_RSB xu=%=", X86_FEATURE_SC_RSB_IDLE, > - : "rax", "rcx"); > + "i" (0) : "rax", "rcx"); > } > > /* WARNING! `ret`, `call *`, `jmp *` not safe after this call. */ > --- a/xen/arch/x86/include/asm/tsc.h > +++ b/xen/arch/x86/include/asm/tsc.h > @@ -29,8 +29,7 @@ static inline uint64_t rdtsc_ordered(voi > alternative_io_2("lfence; rdtsc", > "mfence; rdtsc", X86_FEATURE_MFENCE_RDTSC, > "rdtscp", X86_FEATURE_RDTSCP, > - ASM_OUTPUT2("=a" (low), "=d" (high), "=c" (aux)), > - /* no inputs */); > + ASM_OUTPUT2("=a" (low), "=d" (high), "=c" (aux))); > > return (high << 32) | low; > } > > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 2/2] x86: guard synthetic feature and bug enumerators 2025-09-29 23:36 ` Jason Andryuk @ 2025-10-07 12:22 ` Jan Beulich 2025-10-07 19:38 ` Jason Andryuk 0 siblings, 1 reply; 26+ messages in thread From: Jan Beulich @ 2025-10-07 12:22 UTC (permalink / raw) To: Jason Andryuk Cc: Andrew Cooper, Roger Pau Monné, xen-devel@lists.xenproject.org On 30.09.2025 01:36, Jason Andryuk wrote: > On 2025-09-25 06:48, Jan Beulich wrote: >> --- a/xen/arch/x86/include/asm/spec_ctrl.h >> +++ b/xen/arch/x86/include/asm/spec_ctrl.h >> @@ -73,7 +73,7 @@ static always_inline void spec_ctrl_new_ >> >> /* (ab)use alternative_input() to specify clobbers. */ >> alternative_input("", "DO_OVERWRITE_RSB xu=%=", X86_BUG_IBPB_NO_RET, >> - : "rax", "rcx"); >> + "i" (0) : "rax", "rcx"); > > "i" (0) is to work around the trailing comma in alternative_input() and > does nothing? Yes. If more such "uses" appeared, we may want to introduce some kind of abstraction. Jan ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 2/2] x86: guard synthetic feature and bug enumerators 2025-10-07 12:22 ` Jan Beulich @ 2025-10-07 19:38 ` Jason Andryuk 2025-10-08 5:56 ` Jan Beulich 0 siblings, 1 reply; 26+ messages in thread From: Jason Andryuk @ 2025-10-07 19:38 UTC (permalink / raw) To: Jan Beulich Cc: Andrew Cooper, Roger Pau Monné, xen-devel@lists.xenproject.org On 2025-10-07 08:22, Jan Beulich wrote: > On 30.09.2025 01:36, Jason Andryuk wrote: >> On 2025-09-25 06:48, Jan Beulich wrote: >>> --- a/xen/arch/x86/include/asm/spec_ctrl.h >>> +++ b/xen/arch/x86/include/asm/spec_ctrl.h >>> @@ -73,7 +73,7 @@ static always_inline void spec_ctrl_new_ >>> >>> /* (ab)use alternative_input() to specify clobbers. */ >>> alternative_input("", "DO_OVERWRITE_RSB xu=%=", X86_BUG_IBPB_NO_RET, >>> - : "rax", "rcx"); >>> + "i" (0) : "rax", "rcx"); >> >> "i" (0) is to work around the trailing comma in alternative_input() and >> does nothing? > > Yes. If more such "uses" appeared, we may want to introduce some kind of > abstraction. Thanks for confirming. Reviewed-by: Jason Andryuk <jason.andryuk@amd.com> Though I also wondered if just #define X86_BUG_MAX/X86_SYNTH_MAX combined with a BUILD_BUG_ON might be good enough. Your approach avoids the extra define but is more complicated. Anyway, just a thought. Regards, Jason ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 2/2] x86: guard synthetic feature and bug enumerators 2025-10-07 19:38 ` Jason Andryuk @ 2025-10-08 5:56 ` Jan Beulich 2025-10-11 0:30 ` Jason Andryuk 0 siblings, 1 reply; 26+ messages in thread From: Jan Beulich @ 2025-10-08 5:56 UTC (permalink / raw) To: Jason Andryuk Cc: Andrew Cooper, Roger Pau Monné, xen-devel@lists.xenproject.org On 07.10.2025 21:38, Jason Andryuk wrote: > On 2025-10-07 08:22, Jan Beulich wrote: >> On 30.09.2025 01:36, Jason Andryuk wrote: >>> On 2025-09-25 06:48, Jan Beulich wrote: >>>> --- a/xen/arch/x86/include/asm/spec_ctrl.h >>>> +++ b/xen/arch/x86/include/asm/spec_ctrl.h >>>> @@ -73,7 +73,7 @@ static always_inline void spec_ctrl_new_ >>>> >>>> /* (ab)use alternative_input() to specify clobbers. */ >>>> alternative_input("", "DO_OVERWRITE_RSB xu=%=", X86_BUG_IBPB_NO_RET, >>>> - : "rax", "rcx"); >>>> + "i" (0) : "rax", "rcx"); >>> >>> "i" (0) is to work around the trailing comma in alternative_input() and >>> does nothing? >> >> Yes. If more such "uses" appeared, we may want to introduce some kind of >> abstraction. > > Thanks for confirming. > > Reviewed-by: Jason Andryuk <jason.andryuk@amd.com> Thanks. > Though I also wondered if just #define X86_BUG_MAX/X86_SYNTH_MAX > combined with a BUILD_BUG_ON might be good enough. Your approach avoids > the extra define but is more complicated. Anyway, just a thought. How would that end up simplifying things? IOW what would the BUILD_BUG_ON() look like that you're thinking about? After all X86_{SYNTH,BUG}_MAX aren't meaningfully different from X86_NR_{SYNTH,BUG}. Jan ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 2/2] x86: guard synthetic feature and bug enumerators 2025-10-08 5:56 ` Jan Beulich @ 2025-10-11 0:30 ` Jason Andryuk 2025-10-13 6:39 ` Jan Beulich 0 siblings, 1 reply; 26+ messages in thread From: Jason Andryuk @ 2025-10-11 0:30 UTC (permalink / raw) To: Jan Beulich Cc: Andrew Cooper, Roger Pau Monné, xen-devel@lists.xenproject.org On 2025-10-08 01:56, Jan Beulich wrote: > On 07.10.2025 21:38, Jason Andryuk wrote: >> On 2025-10-07 08:22, Jan Beulich wrote: >>> On 30.09.2025 01:36, Jason Andryuk wrote: >>>> On 2025-09-25 06:48, Jan Beulich wrote: >>>>> --- a/xen/arch/x86/include/asm/spec_ctrl.h >>>>> +++ b/xen/arch/x86/include/asm/spec_ctrl.h >>>>> @@ -73,7 +73,7 @@ static always_inline void spec_ctrl_new_ >>>>> >>>>> /* (ab)use alternative_input() to specify clobbers. */ >>>>> alternative_input("", "DO_OVERWRITE_RSB xu=%=", X86_BUG_IBPB_NO_RET, >>>>> - : "rax", "rcx"); >>>>> + "i" (0) : "rax", "rcx"); >>>> >>>> "i" (0) is to work around the trailing comma in alternative_input() and >>>> does nothing? >>> >>> Yes. If more such "uses" appeared, we may want to introduce some kind of >>> abstraction. >> >> Thanks for confirming. >> >> Reviewed-by: Jason Andryuk <jason.andryuk@amd.com> > > Thanks. > >> Though I also wondered if just #define X86_BUG_MAX/X86_SYNTH_MAX >> combined with a BUILD_BUG_ON might be good enough. Your approach avoids >> the extra define but is more complicated. Anyway, just a thought. > > How would that end up simplifying things? IOW what would the BUILD_BUG_ON() > look like that you're thinking about? After all X86_{SYNTH,BUG}_MAX aren't > meaningfully different from X86_NR_{SYNTH,BUG}. Originally, I was thinking something like XEN_CPUFEATURE(PDX_COMPRESSION, X86_SYNTH(31)) /* PDX compression */ +#define X86_SYNTH_MAX 31 /* Bump when adding flags */ and: BUILD_BUG_ON( ((X86_SYNTH_MAX / 32) + 1) > X86_NR_SYNTH ) Not automated, but adding a new flag should make it obvious X86_SYNTH_MAX should increase. But as you point out the redundancy of X86_{SYNTH,BUG}_MAX and X86_NR_{SYNTH,BUG}. But we could re-arrange to make X86_NR_{SYNTH,BUG} calculated from X86_{SYNTH,BUG}_MAX like below. Again, it's not automated, but it should make it harder to miss increasing the value. Regards, Jason diff --git i/xen/arch/x86/include/asm/cpufeatures.h w/xen/arch/x86/include/asm/cpufeatures.h index 0a98676c16..724eb1599f 100644 --- i/xen/arch/x86/include/asm/cpufeatures.h +++ w/xen/arch/x86/include/asm/cpufeatures.h @@ -7,7 +7,6 @@ #define FSCAPINTS FEATURESET_NR_ENTRIES /* Synthetic words follow the featureset words. */ -#define X86_NR_SYNTH 1 #define X86_SYNTH(x) (FSCAPINTS * 32 + (x)) /* Synthetic features */ @@ -43,9 +42,10 @@ XEN_CPUFEATURE(IBPB_ENTRY_PV, X86_SYNTH(28)) /* MSR_PRED_CMD used by Xen for XEN_CPUFEATURE(IBPB_ENTRY_HVM, X86_SYNTH(29)) /* MSR_PRED_CMD used by Xen for HVM */ XEN_CPUFEATURE(USE_VMCALL, X86_SYNTH(30)) /* Use VMCALL instead of VMMCALL */ XEN_CPUFEATURE(PDX_COMPRESSION, X86_SYNTH(31)) /* PDX compression */ +#define X86_SYNTH_MAX 31 /* Bump when adding new flags. */ +#define X86_NR_SYNTH ((X86_SYNTH_MAX / 32) + 1) /* Bug words follow the synthetic words. */ -#define X86_NR_BUG 1 #define X86_BUG(x) ((FSCAPINTS + X86_NR_SYNTH) * 32 + (x)) #define X86_BUG_FPU_PTRS X86_BUG( 0) /* (F)X{SAVE,RSTOR} doesn't save/restore FOP/FIP/FDP. */ @@ -62,6 +62,8 @@ XEN_CPUFEATURE(PDX_COMPRESSION, X86_SYNTH(31)) /* PDX compression */ #define X86_SPEC_BHB_TSX X86_BUG(19) /* Use clear_bhb_tsx for BHI mitigation. */ #define X86_SPEC_BHB_LOOPS X86_BUG(20) /* Use clear_bhb_loops for BHI mitigation.*/ #define X86_SPEC_BHB_LOOPS_LONG X86_BUG(21) /* Upgrade clear_bhb_loops to the "long" sequence. */ +#define X86_BUX_MAX 21 /* Bump when adding new flags. */ +#define X86_NR_BUG ((X86_BUG_MAX / 32) + 1) /* Total number of capability words, inc synth and bug words. */ #define NCAPINTS (FSCAPINTS + X86_NR_SYNTH + X86_NR_BUG) /* N 32-bit words worth of info */ ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH 2/2] x86: guard synthetic feature and bug enumerators 2025-10-11 0:30 ` Jason Andryuk @ 2025-10-13 6:39 ` Jan Beulich 0 siblings, 0 replies; 26+ messages in thread From: Jan Beulich @ 2025-10-13 6:39 UTC (permalink / raw) To: Jason Andryuk Cc: Andrew Cooper, Roger Pau Monné, xen-devel@lists.xenproject.org On 11.10.2025 02:30, Jason Andryuk wrote: > On 2025-10-08 01:56, Jan Beulich wrote: >> On 07.10.2025 21:38, Jason Andryuk wrote: >>> On 2025-10-07 08:22, Jan Beulich wrote: >>>> On 30.09.2025 01:36, Jason Andryuk wrote: >>>>> On 2025-09-25 06:48, Jan Beulich wrote: >>>>>> --- a/xen/arch/x86/include/asm/spec_ctrl.h >>>>>> +++ b/xen/arch/x86/include/asm/spec_ctrl.h >>>>>> @@ -73,7 +73,7 @@ static always_inline void spec_ctrl_new_ >>>>>> >>>>>> /* (ab)use alternative_input() to specify clobbers. */ >>>>>> alternative_input("", "DO_OVERWRITE_RSB xu=%=", X86_BUG_IBPB_NO_RET, >>>>>> - : "rax", "rcx"); >>>>>> + "i" (0) : "rax", "rcx"); >>>>> >>>>> "i" (0) is to work around the trailing comma in alternative_input() and >>>>> does nothing? >>>> >>>> Yes. If more such "uses" appeared, we may want to introduce some kind of >>>> abstraction. >>> >>> Thanks for confirming. >>> >>> Reviewed-by: Jason Andryuk <jason.andryuk@amd.com> >> >> Thanks. >> >>> Though I also wondered if just #define X86_BUG_MAX/X86_SYNTH_MAX >>> combined with a BUILD_BUG_ON might be good enough. Your approach avoids >>> the extra define but is more complicated. Anyway, just a thought. >> >> How would that end up simplifying things? IOW what would the BUILD_BUG_ON() >> look like that you're thinking about? After all X86_{SYNTH,BUG}_MAX aren't >> meaningfully different from X86_NR_{SYNTH,BUG}. > > Originally, I was thinking something like > XEN_CPUFEATURE(PDX_COMPRESSION, X86_SYNTH(31)) /* PDX compression */ > +#define X86_SYNTH_MAX 31 /* Bump when adding flags */ I don't really like this. Especially as presented is creates an ambiguity: Would one need to increase the value upon any flag addition, or only upon adding a new flag with a value divisible by 32. (From the patch fragment you appended it's clear the latter is meant, but that's not clear here, and when one learns to routinely ignore the comment, there's a risk of also ignoring it when it shouldn't be ignored. Whereas if the bump was required every time a new flag was added, I would dislike the unnecessary churn. In the end - yes, there's a reason why I did things the way done, even if there are other aspects to it which are (for the patch itself, but imo not once it would have gone in) not overly nice. Jan ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 2/2] x86: guard synthetic feature and bug enumerators 2025-09-25 10:48 ` [PATCH 2/2] x86: guard synthetic feature and bug enumerators Jan Beulich 2025-09-29 23:36 ` Jason Andryuk @ 2025-10-08 17:19 ` Andrew Cooper 2025-10-09 7:45 ` Jan Beulich 1 sibling, 1 reply; 26+ messages in thread From: Andrew Cooper @ 2025-10-08 17:19 UTC (permalink / raw) To: Jan Beulich, xen-devel@lists.xenproject.org Cc: Andrew Cooper, Roger Pau Monné On 25/09/2025 11:48 am, Jan Beulich wrote: > While adding new enumerators one may overlook the (rare) need to bump > X86_NR_{SYNTH,BUG}. Guard against that happening by adding respective > checking. The use of BUILD_BUG_ON_ZERO(), however, entails a number of > other changes, as the expansion may not appear in the assembly produced. > Furthermore inputs to file-scope asm() are only supported in gcc15 (or > newer). > > No difference in generated code (debug info, however, grows quite a bit). > > An implication from the changes is that users of the alternatives patching > macros may not use unnamed asm() input operands anymore, as the "injected" > new operands would break numbering expectations. > > Signed-off-by: Jan Beulich <jbeulich@suse.com> > > --- a/xen/arch/x86/include/asm/alternative.h > +++ b/xen/arch/x86/include/asm/alternative.h > @@ -70,12 +70,12 @@ extern void alternative_instructions(voi > alt_repl_len(n2)) "-" alt_orig_len) > > #define ALTINSTR_ENTRY(feature, num) \ > - " .if (" STR(feature & ~ALT_FLAG_NOT) ") >= " STR(NCAPINTS * 32) "\n" \ > + " .if (%c" #feature " & " STR(~ALT_FLAG_NOT) ") >= " STR(NCAPINTS * 32) "\n" \ > " .error \"alternative feature outside of featureset range\"\n" \ > " .endif\n" \ > " .long .LXEN%=_orig_s - .\n" /* label */ \ > " .long " alt_repl_s(num)" - .\n" /* new instruction */ \ > - " .word " STR(feature) "\n" /* feature bit */ \ > + " .word %c" #feature "\n" /* feature bit */ \ > " .byte " alt_orig_len "\n" /* source len */ \ > " .byte " alt_repl_len(num) "\n" /* replacement len */ \ > " .byte " alt_pad_len "\n" /* padding len */ \ > @@ -127,14 +127,14 @@ extern void alternative_instructions(voi > */ > #define alternative(oldinstr, newinstr, feature) \ > asm_inline volatile ( \ > - ALTERNATIVE(oldinstr, newinstr, feature) \ > - ::: "memory" ) > + ALTERNATIVE(oldinstr, newinstr, [feat]) \ > + :: [feat] "i" (feature) : "memory" ) I don't understand. How is this related to putting the guard in place? > --- a/xen/arch/x86/include/asm/cpufeatureset.h > +++ b/xen/arch/x86/include/asm/cpufeatureset.h > @@ -12,8 +12,13 @@ enum { > }; > #undef XEN_CPUFEATURE > > +#if __GNUC__ >= 15 > +#define XEN_CPUFEATURE(name, value) asm (".equ X86_FEATURE_" #name ", %c0" \ > + :: "i" (X86_FEATURE_##name)); > +#else > #define XEN_CPUFEATURE(name, value) asm (".equ X86_FEATURE_" #name ", " \ > __stringify(value)); > +#endif Again - why are we making a no-op change for the sake of it? > #include <public/arch-x86/cpufeatureset.h> > #include <asm/cpufeatures.h> > > --- a/xen/arch/x86/include/asm/pdx.h > +++ b/xen/arch/x86/include/asm/pdx.h > @@ -13,9 +13,9 @@ > asm_inline goto ( \ > ALTERNATIVE( \ > "", \ > - "jmp %l0", \ > - ALT_NOT(X86_FEATURE_PDX_COMPRESSION)) \ > - : : : : label ) > + "jmp %l1", \ > + [feat]) \ > + : : [feat] "i" (ALT_NOT(X86_FEATURE_PDX_COMPRESSION)) : : label ) Not a bug in this change, but the pre-existing use of positional labels is something I was expecting not to introduce at all seeing as we started cleanly with named labels. The jmp wants to be: "jmp %l" #label to cope with the fact it's a macro parameter too. > > static inline unsigned long pfn_to_pdx(unsigned long pfn) > { > --- a/xen/arch/x86/include/asm/spec_ctrl.h > +++ b/xen/arch/x86/include/asm/spec_ctrl.h > @@ -73,7 +73,7 @@ static always_inline void spec_ctrl_new_ > > /* (ab)use alternative_input() to specify clobbers. */ > alternative_input("", "DO_OVERWRITE_RSB xu=%=", X86_BUG_IBPB_NO_RET, > - : "rax", "rcx"); > + "i" (0) : "rax", "rcx"); > } As the comment says, this is already an abuse of the macro for a purpose it wasn't intended for. Now requiring an extra "nop" parameter to get the abuse to compile is too far. It can turn into a plain ALTERNATIVE(), and then both disappear. ~Andrew ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 2/2] x86: guard synthetic feature and bug enumerators 2025-10-08 17:19 ` Andrew Cooper @ 2025-10-09 7:45 ` Jan Beulich 2025-10-09 7:51 ` Jan Beulich 0 siblings, 1 reply; 26+ messages in thread From: Jan Beulich @ 2025-10-09 7:45 UTC (permalink / raw) To: Andrew Cooper; +Cc: Roger Pau Monné, xen-devel@lists.xenproject.org On 08.10.2025 19:19, Andrew Cooper wrote: > On 25/09/2025 11:48 am, Jan Beulich wrote: >> --- a/xen/arch/x86/include/asm/alternative.h >> +++ b/xen/arch/x86/include/asm/alternative.h >> @@ -70,12 +70,12 @@ extern void alternative_instructions(voi >> alt_repl_len(n2)) "-" alt_orig_len) >> >> #define ALTINSTR_ENTRY(feature, num) \ >> - " .if (" STR(feature & ~ALT_FLAG_NOT) ") >= " STR(NCAPINTS * 32) "\n" \ >> + " .if (%c" #feature " & " STR(~ALT_FLAG_NOT) ") >= " STR(NCAPINTS * 32) "\n" \ >> " .error \"alternative feature outside of featureset range\"\n" \ >> " .endif\n" \ >> " .long .LXEN%=_orig_s - .\n" /* label */ \ >> " .long " alt_repl_s(num)" - .\n" /* new instruction */ \ >> - " .word " STR(feature) "\n" /* feature bit */ \ >> + " .word %c" #feature "\n" /* feature bit */ \ >> " .byte " alt_orig_len "\n" /* source len */ \ >> " .byte " alt_repl_len(num) "\n" /* replacement len */ \ >> " .byte " alt_pad_len "\n" /* padding len */ \ >> @@ -127,14 +127,14 @@ extern void alternative_instructions(voi >> */ >> #define alternative(oldinstr, newinstr, feature) \ >> asm_inline volatile ( \ >> - ALTERNATIVE(oldinstr, newinstr, feature) \ >> - ::: "memory" ) >> + ALTERNATIVE(oldinstr, newinstr, [feat]) \ >> + :: [feat] "i" (feature) : "memory" ) > > I don't understand. How is this related to putting the guard in place? The change here is needed to fit the change to ALTINSTR_ENTRY() above. That change in turn is needed because #define X86_SYNTH(x) (FSCAPINTS * 32 + (x) + X86_CHECK_FEAT_NR(x, SYNTH)) with #define X86_CHECK_FEAT_NR(x, n) BUILD_BUG_ON_ZERO((x) / 32 >= X86_NR_ ## n) now needs to be evaluated by the compiler. If we used stringification as before, the assembler would get to see BUILD_BUG_ON_ZERO(). >> --- a/xen/arch/x86/include/asm/cpufeatureset.h >> +++ b/xen/arch/x86/include/asm/cpufeatureset.h >> @@ -12,8 +12,13 @@ enum { >> }; >> #undef XEN_CPUFEATURE >> >> +#if __GNUC__ >= 15 >> +#define XEN_CPUFEATURE(name, value) asm (".equ X86_FEATURE_" #name ", %c0" \ >> + :: "i" (X86_FEATURE_##name)); >> +#else >> #define XEN_CPUFEATURE(name, value) asm (".equ X86_FEATURE_" #name ", " \ >> __stringify(value)); >> +#endif > > Again - why are we making a no-op change for the sake of it? See above. >> --- a/xen/arch/x86/include/asm/pdx.h >> +++ b/xen/arch/x86/include/asm/pdx.h >> @@ -13,9 +13,9 @@ >> asm_inline goto ( \ >> ALTERNATIVE( \ >> "", \ >> - "jmp %l0", \ >> - ALT_NOT(X86_FEATURE_PDX_COMPRESSION)) \ >> - : : : : label ) >> + "jmp %l1", \ >> + [feat]) \ >> + : : [feat] "i" (ALT_NOT(X86_FEATURE_PDX_COMPRESSION)) : : label ) > > Not a bug in this change, but the pre-existing use of positional labels > is something I was expecting not to introduce at all seeing as we > started cleanly with named labels. > > The jmp wants to be: > > "jmp %l" #label > > to cope with the fact it's a macro parameter too. Unrelated change? I can of course do the adjustment in a separate prereq patch, but then it would have been nice if you had commented along these lines before that code actually had gone in. That said, isn't it at least bad practice to not expose the label use to the compiler? To avoid using positional operands, shouldn't we rather name the operand, and then use "jmp %l[whatever_the_name]"? That's a change I could see as being justified to do right here, rather than in a separate patch. >> --- a/xen/arch/x86/include/asm/spec_ctrl.h >> +++ b/xen/arch/x86/include/asm/spec_ctrl.h >> @@ -73,7 +73,7 @@ static always_inline void spec_ctrl_new_ >> >> /* (ab)use alternative_input() to specify clobbers. */ >> alternative_input("", "DO_OVERWRITE_RSB xu=%=", X86_BUG_IBPB_NO_RET, >> - : "rax", "rcx"); >> + "i" (0) : "rax", "rcx"); >> } > > As the comment says, this is already an abuse of the macro for a purpose > it wasn't intended for. > > Now requiring an extra "nop" parameter to get the abuse to compile is > too far. It can turn into a plain ALTERNATIVE(), and then both disappear. Except that, for the reasons explained further up, I'd rather not see new explicit uses of ALTERNATIVE() appear. In the end it's not clear which one is the lesser evil. Maybe we should gain alternative_clobber()? Jan ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 2/2] x86: guard synthetic feature and bug enumerators 2025-10-09 7:45 ` Jan Beulich @ 2025-10-09 7:51 ` Jan Beulich 0 siblings, 0 replies; 26+ messages in thread From: Jan Beulich @ 2025-10-09 7:51 UTC (permalink / raw) To: Andrew Cooper; +Cc: Roger Pau Monné, xen-devel@lists.xenproject.org On 09.10.2025 09:45, Jan Beulich wrote: > On 08.10.2025 19:19, Andrew Cooper wrote: >> On 25/09/2025 11:48 am, Jan Beulich wrote: >>> --- a/xen/arch/x86/include/asm/pdx.h >>> +++ b/xen/arch/x86/include/asm/pdx.h >>> @@ -13,9 +13,9 @@ >>> asm_inline goto ( \ >>> ALTERNATIVE( \ >>> "", \ >>> - "jmp %l0", \ >>> - ALT_NOT(X86_FEATURE_PDX_COMPRESSION)) \ >>> - : : : : label ) >>> + "jmp %l1", \ >>> + [feat]) \ >>> + : : [feat] "i" (ALT_NOT(X86_FEATURE_PDX_COMPRESSION)) : : label ) >> >> Not a bug in this change, but the pre-existing use of positional labels >> is something I was expecting not to introduce at all seeing as we >> started cleanly with named labels. >> >> The jmp wants to be: >> >> "jmp %l" #label >> >> to cope with the fact it's a macro parameter too. > > Unrelated change? I can of course do the adjustment in a separate prereq > patch, but then it would have been nice if you had commented along these > lines before that code actually had gone in. > > That said, isn't it at least bad practice to not expose the label use to > the compiler? To avoid using positional operands, shouldn't we rather > name the operand, and then use "jmp %l[whatever_the_name]"? That's a > change I could see as being justified to do right here, rather than in a > separate patch. Hmm, no, labels can't be named; they are their own names. I.e. what I think we want here is "jmp %l[" #label "]". Jan ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH for-4.21 v2] x86/AMD: avoid REP MOVSB for Zen3/4 2025-09-25 10:45 [PATCH 0/2] x86: ERMS follow-on Jan Beulich 2025-09-25 10:46 ` [PATCH for-4.21 1/2] x86/AMD: avoid REP MOVSB for Zen3/4 Jan Beulich 2025-09-25 10:48 ` [PATCH 2/2] x86: guard synthetic feature and bug enumerators Jan Beulich @ 2025-10-13 13:06 ` Jan Beulich [not found] ` <693449f18cc4480ea2cb2161a9361354@DM4PR03MB7015.namprd03.prod.outlook.com> 3 siblings, 0 replies; 26+ messages in thread From: Jan Beulich @ 2025-10-13 13:06 UTC (permalink / raw) To: xen-devel@lists.xenproject.org Cc: Andrew Cooper, Roger Pau Monné, Oleksii Kurochko Along with Zen2 (which doesn't expose ERMS), both families reportedly suffer from sub-optimal aliasing detection when deciding whether REP MOVSB can actually be carried out the accelerated way. Therefore we want to avoid its use in the common case of memcpy(); copy_page_hot() is fine, as its two pointers are always going to be having the same low 5 bits. Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Jason Andryuk <jason.andryuk@amd.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Release-Acked-by: Oleksii Kurochko<oleksii.kurochko@gmail.com> --- Trying to amend the comment in cpufeatures.h (e.g. "..., i.e. ERMS minus the Zen3/4 pointer aliasing issue") makes it get longish, so I kept it at the shortened form. --- v2: Leave page copying alone. --- a/xen/arch/x86/cpu/amd.c +++ b/xen/arch/x86/cpu/amd.c @@ -1386,6 +1386,10 @@ static void cf_check init_amd(struct cpu check_syscfg_dram_mod_en(); + if (c == &boot_cpu_data && cpu_has(c, X86_FEATURE_ERMS) + && c->family != 0x19 /* Zen3/4 */) + setup_force_cpu_cap(X86_FEATURE_XEN_REP_MOVSB); + amd_log_freq(c); } --- a/xen/arch/x86/cpu/intel.c +++ b/xen/arch/x86/cpu/intel.c @@ -684,6 +684,9 @@ static void cf_check init_intel(struct c */ if (c == &boot_cpu_data && c->vfm == INTEL_SKYLAKE_X) setup_clear_cpu_cap(X86_FEATURE_CLWB); + + if (c == &boot_cpu_data && cpu_has(c, X86_FEATURE_ERMS)) + setup_force_cpu_cap(X86_FEATURE_XEN_REP_MOVSB); } const struct cpu_dev __initconst_cf_clobber intel_cpu_dev = { --- a/xen/arch/x86/include/asm/cpufeatures.h +++ b/xen/arch/x86/include/asm/cpufeatures.h @@ -7,7 +7,7 @@ #define FSCAPINTS FEATURESET_NR_ENTRIES /* Synthetic words follow the featureset words. */ -#define X86_NR_SYNTH 1 +#define X86_NR_SYNTH 2 #define X86_SYNTH(x) (FSCAPINTS * 32 + (x)) /* Synthetic features */ @@ -43,6 +43,7 @@ XEN_CPUFEATURE(IBPB_ENTRY_PV, X86_SY XEN_CPUFEATURE(IBPB_ENTRY_HVM, X86_SYNTH(29)) /* MSR_PRED_CMD used by Xen for HVM */ XEN_CPUFEATURE(USE_VMCALL, X86_SYNTH(30)) /* Use VMCALL instead of VMMCALL */ XEN_CPUFEATURE(PDX_COMPRESSION, X86_SYNTH(31)) /* PDX compression */ +XEN_CPUFEATURE(XEN_REP_MOVSB, X86_SYNTH(32)) /* REP MOVSB used for memcpy() */ /* Bug words follow the synthetic words. */ #define X86_NR_BUG 1 --- a/xen/arch/x86/memcpy.S +++ b/xen/arch/x86/memcpy.S @@ -10,7 +10,7 @@ FUNC(memcpy) * precautions were taken). */ ALTERNATIVE "and $7, %edx; shr $3, %rcx", \ - STR(rep movsb; RET), X86_FEATURE_ERMS + STR(rep movsb; RET), X86_FEATURE_XEN_REP_MOVSB rep movsq or %edx, %ecx jz 1f ^ permalink raw reply [flat|nested] 26+ messages in thread
[parent not found: <693449f18cc4480ea2cb2161a9361354@DM4PR03MB7015.namprd03.prod.outlook.com>]
* Re: [PATCH for-4.21 v2] x86/AMD: avoid REP MOVSB for Zen3/4 [not found] ` <693449f18cc4480ea2cb2161a9361354@DM4PR03MB7015.namprd03.prod.outlook.com> @ 2026-01-06 21:07 ` Andrew Cooper 2026-01-07 7:22 ` Jan Beulich 0 siblings, 1 reply; 26+ messages in thread From: Andrew Cooper @ 2026-01-06 21:07 UTC (permalink / raw) To: Jan Beulich, xen-devel@lists.xenproject.org Cc: Andrew Cooper, Andrew Cooper, Roger Pau Monne, Oleksii Kurochko On 13/10/2025 2:06 pm, Jan Beulich wrote: > Along with Zen2 (which doesn't expose ERMS), both families reportedly > suffer from sub-optimal aliasing detection when deciding whether REP MOVSB > can actually be carried out the accelerated way. Therefore we want to > avoid its use in the common case of memcpy(); copy_page_hot() is fine, as > its two pointers are always going to be having the same low 5 bits. I think this could be a bit clearer. How about this: ---8<--- Zen2 (which doesn't expose ERMS) through Zen4 have sub-optimal aliasing detection for REP MOVS, and fall back to a unit-at-a-time loop when the two pointers have differing bottom 5 bits. While both forms are affected, this makes REP MOVSB 8 times slower than REP MOVSQ. memcpy() has a high likelihood of encountering this slowpath, so avoid using REP MOVSB. This undoes the ERMS optimisation added in commit d6397bd0e11c which turns out to be an anti-optimisation on these microarchitectures. However, retain the use of ERMS-based REP MOVSB in other cases such as copy_page_hot() where there parameter alignment is known to avoid the slowpath. ---8<--- ? This at least gets us back to the 4.20 behaviour. ~Andrew ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH for-4.21 v2] x86/AMD: avoid REP MOVSB for Zen3/4 2026-01-06 21:07 ` Andrew Cooper @ 2026-01-07 7:22 ` Jan Beulich 2026-01-07 10:42 ` Andrew Cooper 0 siblings, 1 reply; 26+ messages in thread From: Jan Beulich @ 2026-01-07 7:22 UTC (permalink / raw) To: Andrew Cooper Cc: Andrew Cooper, Roger Pau Monne, Oleksii Kurochko, xen-devel@lists.xenproject.org On 06.01.2026 22:07, Andrew Cooper wrote: > On 13/10/2025 2:06 pm, Jan Beulich wrote: >> Along with Zen2 (which doesn't expose ERMS), both families reportedly >> suffer from sub-optimal aliasing detection when deciding whether REP MOVSB >> can actually be carried out the accelerated way. Therefore we want to >> avoid its use in the common case of memcpy(); copy_page_hot() is fine, as >> its two pointers are always going to be having the same low 5 bits. > > I think this could be a bit clearer. How about this: > > ---8<--- > Zen2 (which doesn't expose ERMS) through Zen4 have sub-optimal aliasing > detection for REP MOVS, and fall back to a unit-at-a-time loop when the > two pointers have differing bottom 5 bits. While both forms are > affected, this makes REP MOVSB 8 times slower than REP MOVSQ. > > memcpy() has a high likelihood of encountering this slowpath, so avoid > using REP MOVSB. This undoes the ERMS optimisation added in commit > d6397bd0e11c which turns out to be an anti-optimisation on these > microarchitectures. > > However, retain the use of ERMS-based REP MOVSB in other cases such as > copy_page_hot() where there parameter alignment is known to avoid the > slowpath. > ---8<--- > > ? Fine with me; changed. Do I take this as an okay-to-commit? Jan ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH for-4.21 v2] x86/AMD: avoid REP MOVSB for Zen3/4 2026-01-07 7:22 ` Jan Beulich @ 2026-01-07 10:42 ` Andrew Cooper 0 siblings, 0 replies; 26+ messages in thread From: Andrew Cooper @ 2026-01-07 10:42 UTC (permalink / raw) To: Jan Beulich Cc: Andrew Cooper, Andrew Cooper, Roger Pau Monne, Oleksii Kurochko, xen-devel@lists.xenproject.org On 07/01/2026 7:22 am, Jan Beulich wrote: > On 06.01.2026 22:07, Andrew Cooper wrote: >> On 13/10/2025 2:06 pm, Jan Beulich wrote: >>> Along with Zen2 (which doesn't expose ERMS), both families reportedly >>> suffer from sub-optimal aliasing detection when deciding whether REP MOVSB >>> can actually be carried out the accelerated way. Therefore we want to >>> avoid its use in the common case of memcpy(); copy_page_hot() is fine, as >>> its two pointers are always going to be having the same low 5 bits. >> I think this could be a bit clearer. How about this: >> >> ---8<--- >> Zen2 (which doesn't expose ERMS) through Zen4 have sub-optimal aliasing >> detection for REP MOVS, and fall back to a unit-at-a-time loop when the >> two pointers have differing bottom 5 bits. While both forms are >> affected, this makes REP MOVSB 8 times slower than REP MOVSQ. >> >> memcpy() has a high likelihood of encountering this slowpath, so avoid >> using REP MOVSB. This undoes the ERMS optimisation added in commit >> d6397bd0e11c which turns out to be an anti-optimisation on these >> microarchitectures. >> >> However, retain the use of ERMS-based REP MOVSB in other cases such as >> copy_page_hot() where there parameter alignment is known to avoid the >> slowpath. >> ---8<--- >> >> ? > Fine with me; changed. Do I take this as an okay-to-commit? Yeah - with something to this effect, Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Sorry it took so long. ~Andrew ^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2026-01-07 10:42 UTC | newest]
Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-25 10:45 [PATCH 0/2] x86: ERMS follow-on Jan Beulich
2025-09-25 10:46 ` [PATCH for-4.21 1/2] x86/AMD: avoid REP MOVSB for Zen3/4 Jan Beulich
2025-09-25 12:18 ` Teddy Astie
2025-09-25 12:58 ` Jan Beulich
2025-09-30 13:03 ` Teddy Astie
2025-10-07 12:19 ` Jan Beulich
2025-09-29 23:35 ` Jason Andryuk
2025-10-08 11:20 ` Roger Pau Monné
2025-10-08 16:06 ` Jan Beulich
2025-10-09 7:20 ` Oleksii Kurochko
2025-10-08 16:33 ` Andrew Cooper
2025-10-09 7:27 ` Jan Beulich
2025-09-25 10:48 ` [PATCH 2/2] x86: guard synthetic feature and bug enumerators Jan Beulich
2025-09-29 23:36 ` Jason Andryuk
2025-10-07 12:22 ` Jan Beulich
2025-10-07 19:38 ` Jason Andryuk
2025-10-08 5:56 ` Jan Beulich
2025-10-11 0:30 ` Jason Andryuk
2025-10-13 6:39 ` Jan Beulich
2025-10-08 17:19 ` Andrew Cooper
2025-10-09 7:45 ` Jan Beulich
2025-10-09 7:51 ` Jan Beulich
2025-10-13 13:06 ` [PATCH for-4.21 v2] x86/AMD: avoid REP MOVSB for Zen3/4 Jan Beulich
[not found] ` <693449f18cc4480ea2cb2161a9361354@DM4PR03MB7015.namprd03.prod.outlook.com>
2026-01-06 21:07 ` Andrew Cooper
2026-01-07 7:22 ` Jan Beulich
2026-01-07 10:42 ` Andrew Cooper
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.