[PATCH 0/2] x86: ERMS follow-on

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] x86: ERMS follow-on
@ 2025-09-25 10:45 Jan Beulich
  2025-09-25 10:46 ` [PATCH for-4.21 1/2] x86/AMD: avoid REP MOVSB for Zen3/4 Jan Beulich
                   ` (3 more replies)
  0 siblings, 4 replies; 26+ messages in thread
From: Jan Beulich @ 2025-09-25 10:45 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org; +Cc: Andrew Cooper, Roger Pau Monné

Patch 1 is definitely intended for 4.21, to limit possible performance fallout
from the ERMS series. Patch 2 may be too intrusive to consider taking at this
point.

1: x86/AMD: avoid REP MOVSB for Zen3/4
2: x86: guard synthetic feature and bug enumerators

Jan


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH for-4.21 1/2] x86/AMD: avoid REP MOVSB for Zen3/4
  2025-09-25 10:45 [PATCH 0/2] x86: ERMS follow-on Jan Beulich
@ 2025-09-25 10:46 ` Jan Beulich
  2025-09-25 12:18   ` Teddy Astie
  2025-10-08 16:33   ` Andrew Cooper
  2025-09-25 10:48 ` [PATCH 2/2] x86: guard synthetic feature and bug enumerators Jan Beulich
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 26+ messages in thread
From: Jan Beulich @ 2025-09-25 10:46 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org
  Cc: Andrew Cooper, Roger Pau Monné, Oleksii Kurochko

Along with Zen2 (which doesn't expose ERMS), both families reportedly
suffer from sub-optimal aliasing detection when deciding whether REP MOVSB
can actually be carried out the accelerated way. Therefore we want to
avoid its use in the common case (memset(), copy_page_hot()).

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
Question is whether merely avoiding REP MOVSB (but not REP MOVSQ) is going
to be good enough.

--- a/xen/arch/x86/copy_page.S
+++ b/xen/arch/x86/copy_page.S
@@ -57,6 +57,6 @@ END(copy_page_cold)
         .endm
 
 FUNC(copy_page_hot)
-        ALTERNATIVE copy_page_movsq, copy_page_movsb, X86_FEATURE_ERMS
+        ALTERNATIVE copy_page_movsq, copy_page_movsb, X86_FEATURE_XEN_REP_MOVSB
         RET
 END(copy_page_hot)
--- a/xen/arch/x86/cpu/amd.c
+++ b/xen/arch/x86/cpu/amd.c
@@ -1386,6 +1386,10 @@ static void cf_check init_amd(struct cpu
 
 	check_syscfg_dram_mod_en();
 
+	if (c == &boot_cpu_data && cpu_has(c, X86_FEATURE_ERMS)
+	    && c->family != 0x19 /* Zen3/4 */)
+		setup_force_cpu_cap(X86_FEATURE_XEN_REP_MOVSB);
+
 	amd_log_freq(c);
 }
 
--- a/xen/arch/x86/cpu/intel.c
+++ b/xen/arch/x86/cpu/intel.c
@@ -684,6 +684,9 @@ static void cf_check init_intel(struct c
 	 */
 	if (c == &boot_cpu_data && c->vfm == INTEL_SKYLAKE_X)
 		setup_clear_cpu_cap(X86_FEATURE_CLWB);
+
+	if (c == &boot_cpu_data && cpu_has(c, X86_FEATURE_ERMS))
+		setup_force_cpu_cap(X86_FEATURE_XEN_REP_MOVSB);
 }
 
 const struct cpu_dev __initconst_cf_clobber intel_cpu_dev = {
--- a/xen/arch/x86/include/asm/cpufeatures.h
+++ b/xen/arch/x86/include/asm/cpufeatures.h
@@ -7,7 +7,7 @@
 #define FSCAPINTS FEATURESET_NR_ENTRIES
 
 /* Synthetic words follow the featureset words. */
-#define X86_NR_SYNTH 1
+#define X86_NR_SYNTH 2
 #define X86_SYNTH(x) (FSCAPINTS * 32 + (x))
 
 /* Synthetic features */
@@ -43,6 +43,7 @@ XEN_CPUFEATURE(IBPB_ENTRY_PV,     X86_SY
 XEN_CPUFEATURE(IBPB_ENTRY_HVM,    X86_SYNTH(29)) /* MSR_PRED_CMD used by Xen for HVM */
 XEN_CPUFEATURE(USE_VMCALL,        X86_SYNTH(30)) /* Use VMCALL instead of VMMCALL */
 XEN_CPUFEATURE(PDX_COMPRESSION,   X86_SYNTH(31)) /* PDX compression */
+XEN_CPUFEATURE(XEN_REP_MOVSB,     X86_SYNTH(32)) /* REP MOVSB used for memcpy() and alike. */
 
 /* Bug words follow the synthetic words. */
 #define X86_NR_BUG 1
--- a/xen/arch/x86/memcpy.S
+++ b/xen/arch/x86/memcpy.S
@@ -10,7 +10,7 @@ FUNC(memcpy)
          * precautions were taken).
          */
         ALTERNATIVE "and $7, %edx; shr $3, %rcx", \
-                    STR(rep movsb; RET), X86_FEATURE_ERMS
+                    STR(rep movsb; RET), X86_FEATURE_XEN_REP_MOVSB
         rep movsq
         or      %edx, %ecx
         jz      1f



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH for-4.21 1/2] x86/AMD: avoid REP MOVSB for Zen3/4
  2025-09-25 10:46 ` [PATCH for-4.21 1/2] x86/AMD: avoid REP MOVSB for Zen3/4 Jan Beulich
@ 2025-09-25 12:18   ` Teddy Astie
  2025-09-25 12:58     ` Jan Beulich
  2025-09-29 23:35     ` Jason Andryuk
  2025-10-08 16:33   ` Andrew Cooper
  1 sibling, 2 replies; 26+ messages in thread
From: Teddy Astie @ 2025-09-25 12:18 UTC (permalink / raw)
  To: Jan Beulich, xen-devel
  Cc: Andrew Cooper, Roger Pau Monné, Oleksii Kurochko

Le 25/09/2025 à 12:48, Jan Beulich a écrit :
> Along with Zen2 (which doesn't expose ERMS), both families reportedly
> suffer from sub-optimal aliasing detection when deciding whether REP MOVSB
> can actually be carried out the accelerated way. Therefore we want to
> avoid its use in the common case (memset(), copy_page_hot()).

s/memset/memcpy (memset probably uses rep stosb which is not affected IIUC)

>
> Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> Question is whether merely avoiding REP MOVSB (but not REP MOVSQ) is going
> to be good enough.
>

This probably wants to be checked with benchmarks of rep movsb vs rep
movsq+b (current non-ERMS algorithm). If the issue also occurs with rep
movsq, it may be preferable to keep rep movsb even considering this issue.

> --- a/xen/arch/x86/copy_page.S
> +++ b/xen/arch/x86/copy_page.S
> @@ -57,6 +57,6 @@ END(copy_page_cold)
>           .endm
>
>   FUNC(copy_page_hot)
> -        ALTERNATIVE copy_page_movsq, copy_page_movsb, X86_FEATURE_ERMS
> +        ALTERNATIVE copy_page_movsq, copy_page_movsb, X86_FEATURE_XEN_REP_MOVSB
>           RET
>   END(copy_page_hot)
> --- a/xen/arch/x86/cpu/amd.c
> +++ b/xen/arch/x86/cpu/amd.c
> @@ -1386,6 +1386,10 @@ static void cf_check init_amd(struct cpu
>
>   	check_syscfg_dram_mod_en();
>
> +	if (c == &boot_cpu_data && cpu_has(c, X86_FEATURE_ERMS)
> +	    && c->family != 0x19 /* Zen3/4 */)
> +		setup_force_cpu_cap(X86_FEATURE_XEN_REP_MOVSB);
> +

May it be fixed through a (future ?) microcode update, especially since
rep movs is microcoded on these archs ?

>   	amd_log_freq(c);
>   }
>
> --- a/xen/arch/x86/cpu/intel.c
> +++ b/xen/arch/x86/cpu/intel.c
> @@ -684,6 +684,9 @@ static void cf_check init_intel(struct c
>   	 */
>   	if (c == &boot_cpu_data && c->vfm == INTEL_SKYLAKE_X)
>   		setup_clear_cpu_cap(X86_FEATURE_CLWB);
> +
> +	if (c == &boot_cpu_data && cpu_has(c, X86_FEATURE_ERMS))
> +		setup_force_cpu_cap(X86_FEATURE_XEN_REP_MOVSB);
>   }
>
>   const struct cpu_dev __initconst_cf_clobber intel_cpu_dev = {
> --- a/xen/arch/x86/include/asm/cpufeatures.h
> +++ b/xen/arch/x86/include/asm/cpufeatures.h
> @@ -7,7 +7,7 @@
>   #define FSCAPINTS FEATURESET_NR_ENTRIES
>
>   /* Synthetic words follow the featureset words. */
> -#define X86_NR_SYNTH 1
> +#define X86_NR_SYNTH 2
>   #define X86_SYNTH(x) (FSCAPINTS * 32 + (x))
>
>   /* Synthetic features */
> @@ -43,6 +43,7 @@ XEN_CPUFEATURE(IBPB_ENTRY_PV,     X86_SY
>   XEN_CPUFEATURE(IBPB_ENTRY_HVM,    X86_SYNTH(29)) /* MSR_PRED_CMD used by Xen for HVM */
>   XEN_CPUFEATURE(USE_VMCALL,        X86_SYNTH(30)) /* Use VMCALL instead of VMMCALL */
>   XEN_CPUFEATURE(PDX_COMPRESSION,   X86_SYNTH(31)) /* PDX compression */
> +XEN_CPUFEATURE(XEN_REP_MOVSB,     X86_SYNTH(32)) /* REP MOVSB used for memcpy() and alike. */
>
>   /* Bug words follow the synthetic words. */
>   #define X86_NR_BUG 1
> --- a/xen/arch/x86/memcpy.S
> +++ b/xen/arch/x86/memcpy.S
> @@ -10,7 +10,7 @@ FUNC(memcpy)
>            * precautions were taken).
>            */
>           ALTERNATIVE "and $7, %edx; shr $3, %rcx", \
> -                    STR(rep movsb; RET), X86_FEATURE_ERMS
> +                    STR(rep movsb; RET), X86_FEATURE_XEN_REP_MOVSB
>           rep movsq
>           or      %edx, %ecx
>           jz      1f
>
>

Teddy


--
Teddy Astie | Vates XCP-ng Developer

XCP-ng & Xen Orchestra - Vates solutions

web: https://vates.tech




^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH for-4.21 1/2] x86/AMD: avoid REP MOVSB for Zen3/4
  2025-09-25 12:18   ` Teddy Astie
@ 2025-09-25 12:58     ` Jan Beulich
  2025-09-30 13:03       ` Teddy Astie
  2025-09-29 23:35     ` Jason Andryuk
  1 sibling, 1 reply; 26+ messages in thread
From: Jan Beulich @ 2025-09-25 12:58 UTC (permalink / raw)
  To: Teddy Astie
  Cc: Andrew Cooper, Roger Pau Monné, Oleksii Kurochko, xen-devel

On 25.09.2025 14:18, Teddy Astie wrote:
> Le 25/09/2025 à 12:48, Jan Beulich a écrit :
>> Along with Zen2 (which doesn't expose ERMS), both families reportedly
>> suffer from sub-optimal aliasing detection when deciding whether REP MOVSB
>> can actually be carried out the accelerated way. Therefore we want to
>> avoid its use in the common case (memset(), copy_page_hot()).
> 
> s/memset/memcpy (memset probably uses rep stosb which is not affected IIUC)

Oops, yes.

>> Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> Question is whether merely avoiding REP MOVSB (but not REP MOVSQ) is going
>> to be good enough.
> 
> This probably wants to be checked with benchmarks of rep movsb vs rep 
> movsq+b (current non-ERMS algorithm). If the issue also occurs with rep 
> movsq, it may be preferable to keep rep movsb even considering this issue.

Why? Then REP MOVSB is 8 times slower than REP MOVSQ.

>> --- a/xen/arch/x86/cpu/amd.c
>> +++ b/xen/arch/x86/cpu/amd.c
>> @@ -1386,6 +1386,10 @@ static void cf_check init_amd(struct cpu
>>   
>>   	check_syscfg_dram_mod_en();
>>   
>> +	if (c == &boot_cpu_data && cpu_has(c, X86_FEATURE_ERMS)
>> +	    && c->family != 0x19 /* Zen3/4 */)
>> +		setup_force_cpu_cap(X86_FEATURE_XEN_REP_MOVSB);
>> +
> 
> May it be fixed through a (future ?) microcode update, especially since 
> rep movs is microcoded on these archs ?

I don't know, but I also don't expect that to happen.

Jan


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH for-4.21 1/2] x86/AMD: avoid REP MOVSB for Zen3/4
  2025-09-25 12:58     ` Jan Beulich
@ 2025-09-30 13:03       ` Teddy Astie
  2025-10-07 12:19         ` Jan Beulich
  0 siblings, 1 reply; 26+ messages in thread
From: Teddy Astie @ 2025-09-30 13:03 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Roger Pau Monné, Oleksii Kurochko, xen-devel

Le 25/09/2025 à 15:02, Jan Beulich a écrit :
> On 25.09.2025 14:18, Teddy Astie wrote:
>> Le 25/09/2025 à 12:48, Jan Beulich a écrit :
>>> Along with Zen2 (which doesn't expose ERMS), both families reportedly
>>> suffer from sub-optimal aliasing detection when deciding whether REP MOVSB
>>> can actually be carried out the accelerated way. Therefore we want to
>>> avoid its use in the common case (memset(), copy_page_hot()).
>>
>> s/memset/memcpy (memset probably uses rep stosb which is not affected IIUC)
>
> Oops, yes.
>
>>> Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>> ---
>>> Question is whether merely avoiding REP MOVSB (but not REP MOVSQ) is going
>>> to be good enough.
>>
>> This probably wants to be checked with benchmarks of rep movsb vs rep
>> movsq+b (current non-ERMS algorithm). If the issue also occurs with rep
>> movsq, it may be preferable to keep rep movsb even considering this issue.
>
> Why? Then REP MOVSB is 8 times slower than REP MOVSQ.
>

It doesn't match my observations while quickly benching rep movsb vs rep
movsq+b (fallback) with varying alignments/sizes on Zen3/4 (Ryzen and EPYC).

It's very sensitive to size and aligment, but in many (but not all)
cases, rep movsb is significantly faster than rep movsq+b. The worst
cases (mentioned bug) are much slower in both cases, though rep movsq+b
tend to perform better in these cases.

So unfortunately it's not as simple as rep movsb being (almost) always
slower, especially with the varied copy sizes and aligments that does
grant_copy. That's what I would prefer having more data to have a better
picture.

>>> --- a/xen/arch/x86/cpu/amd.c
>>> +++ b/xen/arch/x86/cpu/amd.c
>>> @@ -1386,6 +1386,10 @@ static void cf_check init_amd(struct cpu
>>>
>>>    	check_syscfg_dram_mod_en();
>>>
>>> +	if (c == &boot_cpu_data && cpu_has(c, X86_FEATURE_ERMS)
>>> +	    && c->family != 0x19 /* Zen3/4 */)
>>> +		setup_force_cpu_cap(X86_FEATURE_XEN_REP_MOVSB);
>>> +
>>
>> May it be fixed through a (future ?) microcode update, especially since
>> rep movs is microcoded on these archs ?
>
> I don't know, but I also don't expect that to happen.
>
> Jan
>

Teddy


--
Teddy Astie | Vates XCP-ng Developer

XCP-ng & Xen Orchestra - Vates solutions

web: https://vates.tech




^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH for-4.21 1/2] x86/AMD: avoid REP MOVSB for Zen3/4
  2025-09-30 13:03       ` Teddy Astie
@ 2025-10-07 12:19         ` Jan Beulich
  0 siblings, 0 replies; 26+ messages in thread
From: Jan Beulich @ 2025-10-07 12:19 UTC (permalink / raw)
  To: Teddy Astie
  Cc: Andrew Cooper, Roger Pau Monné, Oleksii Kurochko, xen-devel

On 30.09.2025 15:03, Teddy Astie wrote:
> Le 25/09/2025 à 15:02, Jan Beulich a écrit :
>> On 25.09.2025 14:18, Teddy Astie wrote:
>>> Le 25/09/2025 à 12:48, Jan Beulich a écrit :
>>>> Along with Zen2 (which doesn't expose ERMS), both families reportedly
>>>> suffer from sub-optimal aliasing detection when deciding whether REP MOVSB
>>>> can actually be carried out the accelerated way. Therefore we want to
>>>> avoid its use in the common case (memset(), copy_page_hot()).
>>>
>>> s/memset/memcpy (memset probably uses rep stosb which is not affected IIUC)
>>
>> Oops, yes.
>>
>>>> Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>>> ---
>>>> Question is whether merely avoiding REP MOVSB (but not REP MOVSQ) is going
>>>> to be good enough.
>>>
>>> This probably wants to be checked with benchmarks of rep movsb vs rep
>>> movsq+b (current non-ERMS algorithm). If the issue also occurs with rep
>>> movsq, it may be preferable to keep rep movsb even considering this issue.
>>
>> Why? Then REP MOVSB is 8 times slower than REP MOVSQ.
>>
> 
> It doesn't match my observations while quickly benching rep movsb vs rep 
> movsq+b (fallback) with varying alignments/sizes on Zen3/4 (Ryzen and EPYC).
> 
> It's very sensitive to size and aligment, but in many (but not all) 
> cases, rep movsb is significantly faster than rep movsq+b. The worst 
> cases (mentioned bug) are much slower in both cases, though rep movsq+b 
> tend to perform better in these cases.

Which is what the patch here is trying to address.

> So unfortunately it's not as simple as rep movsb being (almost) always 
> slower, especially with the varied copy sizes and aligments that does 
> grant_copy. That's what I would prefer having more data to have a better 
> picture.

Well, what I would have preferred is some actual written down description
of the aliasing issue. I'm unaware of such; the patch is solely based on
what Andrew has been telling me verbally (piecemeal). I've tried to reflect
this in how the description is written.

What you suggest would, aiui, entail more complicated decision logic in
the memcpy() implementation, which (at least for now) we'd like to avoid.

Jan


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH for-4.21 1/2] x86/AMD: avoid REP MOVSB for Zen3/4
  2025-09-25 12:18   ` Teddy Astie
  2025-09-25 12:58     ` Jan Beulich
@ 2025-09-29 23:35     ` Jason Andryuk
  2025-10-08 11:20       ` Roger Pau Monné
  1 sibling, 1 reply; 26+ messages in thread
From: Jason Andryuk @ 2025-09-29 23:35 UTC (permalink / raw)
  To: Teddy Astie, Jan Beulich, xen-devel
  Cc: Andrew Cooper, Roger Pau Monné, Oleksii Kurochko

On 2025-09-25 08:18, Teddy Astie wrote:
> Le 25/09/2025 à 12:48, Jan Beulich a écrit :
>> Along with Zen2 (which doesn't expose ERMS), both families reportedly
>> suffer from sub-optimal aliasing detection when deciding whether REP MOVSB
>> can actually be carried out the accelerated way. Therefore we want to
>> avoid its use in the common case (memset(), copy_page_hot()).
> 
> s/memset/memcpy (memset probably uses rep stosb which is not affected IIUC)
> 
>>
>> Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>

With Teddy's suggested change:

Reviewed-by: Jason Andryuk <jason.andryuk@amd.com>

Thanks,
Jason


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH for-4.21 1/2] x86/AMD: avoid REP MOVSB for Zen3/4
  2025-09-29 23:35     ` Jason Andryuk
@ 2025-10-08 11:20       ` Roger Pau Monné
  2025-10-08 16:06         ` Jan Beulich
  0 siblings, 1 reply; 26+ messages in thread
From: Roger Pau Monné @ 2025-10-08 11:20 UTC (permalink / raw)
  To: Jason Andryuk, Teddy Astie
  Cc: Jan Beulich, xen-devel, Andrew Cooper, Oleksii Kurochko

On Mon, Sep 29, 2025 at 07:35:53PM -0400, Jason Andryuk wrote:
> On 2025-09-25 08:18, Teddy Astie wrote:
> > Le 25/09/2025 à 12:48, Jan Beulich a écrit :
> > > Along with Zen2 (which doesn't expose ERMS), both families reportedly
> > > suffer from sub-optimal aliasing detection when deciding whether REP MOVSB
> > > can actually be carried out the accelerated way. Therefore we want to
> > > avoid its use in the common case (memset(), copy_page_hot()).
> > 
> > s/memset/memcpy (memset probably uses rep stosb which is not affected IIUC)
> > 
> > > 
> > > Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
> > > Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> With Teddy's suggested change:
> 
> Reviewed-by: Jason Andryuk <jason.andryuk@amd.com>

Acked-by: Roger Pau Monné <roger.pau@citrix.com>

It would be nice to have some actual figures whether this makes any
difference though.  Teddy, I think Vates had been doing some testing
in this regard, do you think you could measure whether the patch makes
any noticeable difference in PV network traffic for example?  (as
that's a heavy user of grant copy).

Thanks, Roger.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH for-4.21 1/2] x86/AMD: avoid REP MOVSB for Zen3/4
  2025-10-08 11:20       ` Roger Pau Monné
@ 2025-10-08 16:06         ` Jan Beulich
  2025-10-09  7:20           ` Oleksii Kurochko
  0 siblings, 1 reply; 26+ messages in thread
From: Jan Beulich @ 2025-10-08 16:06 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: xen-devel, Andrew Cooper, Roger Pau Monné, Jason Andryuk,
	Teddy Astie

On 08.10.2025 13:20, Roger Pau Monné wrote:
> On Mon, Sep 29, 2025 at 07:35:53PM -0400, Jason Andryuk wrote:
>> On 2025-09-25 08:18, Teddy Astie wrote:
>>> Le 25/09/2025 à 12:48, Jan Beulich a écrit :
>>>> Along with Zen2 (which doesn't expose ERMS), both families reportedly
>>>> suffer from sub-optimal aliasing detection when deciding whether REP MOVSB
>>>> can actually be carried out the accelerated way. Therefore we want to
>>>> avoid its use in the common case (memset(), copy_page_hot()).
>>>
>>> s/memset/memcpy (memset probably uses rep stosb which is not affected IIUC)
>>>
>>>>
>>>> Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>
>> With Teddy's suggested change:
>>
>> Reviewed-by: Jason Andryuk <jason.andryuk@amd.com>
> 
> Acked-by: Roger Pau Monné <roger.pau@citrix.com>

May I ask for a release-ack here, seeing that it alters behavior that went in
close before the freeze?

Jan


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH for-4.21 1/2] x86/AMD: avoid REP MOVSB for Zen3/4
  2025-10-08 16:06         ` Jan Beulich
@ 2025-10-09  7:20           ` Oleksii Kurochko
  0 siblings, 0 replies; 26+ messages in thread
From: Oleksii Kurochko @ 2025-10-09  7:20 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, Andrew Cooper, Roger Pau Monné, Jason Andryuk,
	Teddy Astie

[-- Attachment #1: Type: text/plain, Size: 1076 bytes --]


On 10/8/25 6:06 PM, Jan Beulich wrote:
> On 08.10.2025 13:20, Roger Pau Monné wrote:
>> On Mon, Sep 29, 2025 at 07:35:53PM -0400, Jason Andryuk wrote:
>>> On 2025-09-25 08:18, Teddy Astie wrote:
>>>> Le 25/09/2025 à 12:48, Jan Beulich a écrit :
>>>>> Along with Zen2 (which doesn't expose ERMS), both families reportedly
>>>>> suffer from sub-optimal aliasing detection when deciding whether REP MOVSB
>>>>> can actually be carried out the accelerated way. Therefore we want to
>>>>> avoid its use in the common case (memset(), copy_page_hot()).
>>>> s/memset/memcpy (memset probably uses rep stosb which is not affected IIUC)
>>>>
>>>>> Reported-by: Andrew Cooper<andrew.cooper3@citrix.com>
>>>>> Signed-off-by: Jan Beulich<jbeulich@suse.com>
>>> With Teddy's suggested change:
>>>
>>> Reviewed-by: Jason Andryuk<jason.andryuk@amd.com>
>> Acked-by: Roger Pau Monné<roger.pau@citrix.com>
> May I ask for a release-ack here, seeing that it alters behavior that went in
> close before the freeze?

Release-Acked-by: Oleksii Kurochko<oleksii.kurochko@gmail.com>

~ Oleksii

[-- Attachment #2: Type: text/html, Size: 2551 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH for-4.21 1/2] x86/AMD: avoid REP MOVSB for Zen3/4
  2025-09-25 10:46 ` [PATCH for-4.21 1/2] x86/AMD: avoid REP MOVSB for Zen3/4 Jan Beulich
  2025-09-25 12:18   ` Teddy Astie
@ 2025-10-08 16:33   ` Andrew Cooper
  2025-10-09  7:27     ` Jan Beulich
  1 sibling, 1 reply; 26+ messages in thread
From: Andrew Cooper @ 2025-10-08 16:33 UTC (permalink / raw)
  To: Jan Beulich, xen-devel@lists.xenproject.org
  Cc: Andrew Cooper, Roger Pau Monné, Oleksii Kurochko

On 25/09/2025 11:46 am, Jan Beulich wrote:
> Along with Zen2 (which doesn't expose ERMS), both families reportedly
> suffer from sub-optimal aliasing detection when deciding whether REP MOVSB
> can actually be carried out the accelerated way. Therefore we want to
> avoid its use in the common case (memset(), copy_page_hot()).
>
> Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> Question is whether merely avoiding REP MOVSB (but not REP MOVSQ) is going
> to be good enough.

In the problem case, MOVSQ is 8 times less bad than MOVSB, but they're
both slower than alternative algorithms.

>
> --- a/xen/arch/x86/copy_page.S
> +++ b/xen/arch/x86/copy_page.S
> @@ -57,6 +57,6 @@ END(copy_page_cold)
>          .endm
>  
>  FUNC(copy_page_hot)
> -        ALTERNATIVE copy_page_movsq, copy_page_movsb, X86_FEATURE_ERMS
> +        ALTERNATIVE copy_page_movsq, copy_page_movsb, X86_FEATURE_XEN_REP_MOVSB
>          RET
>  END(copy_page_hot)

Hmm.

Overall I think this patch is an improvement.

But, for any copy_page variants, we know both pointers are 4k aligned,
so will not tickle the problem case.

This does mess with the naming of the synthetic feature.

> --- a/xen/arch/x86/cpu/amd.c
> +++ b/xen/arch/x86/cpu/amd.c
> @@ -1386,6 +1386,10 @@ static void cf_check init_amd(struct cpu
>  
>  	check_syscfg_dram_mod_en();
>  
> +	if (c == &boot_cpu_data && cpu_has(c, X86_FEATURE_ERMS)
> +	    && c->family != 0x19 /* Zen3/4 */)

Even if this is Linux style, && on the previous line please.

~Andrew


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH for-4.21 1/2] x86/AMD: avoid REP MOVSB for Zen3/4
  2025-10-08 16:33   ` Andrew Cooper
@ 2025-10-09  7:27     ` Jan Beulich
  0 siblings, 0 replies; 26+ messages in thread
From: Jan Beulich @ 2025-10-09  7:27 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Roger Pau Monné, Oleksii Kurochko,
	xen-devel@lists.xenproject.org

On 08.10.2025 18:33, Andrew Cooper wrote:
> On 25/09/2025 11:46 am, Jan Beulich wrote:
>> --- a/xen/arch/x86/copy_page.S
>> +++ b/xen/arch/x86/copy_page.S
>> @@ -57,6 +57,6 @@ END(copy_page_cold)
>>          .endm
>>  
>>  FUNC(copy_page_hot)
>> -        ALTERNATIVE copy_page_movsq, copy_page_movsb, X86_FEATURE_ERMS
>> +        ALTERNATIVE copy_page_movsq, copy_page_movsb, X86_FEATURE_XEN_REP_MOVSB
>>          RET
>>  END(copy_page_hot)
> 
> Hmm.
> 
> Overall I think this patch is an improvement.
> 
> But, for any copy_page variants, we know both pointers are 4k aligned,
> so will not tickle the problem case.

Then I fear I still haven't understood the bad "may overlap" condition.
I thought that with the low 12 bits all identical in a page-copy, this
case would _specifically_ trigger the bad behavior.

> This does mess with the naming of the synthetic feature.

Short of better naming suggestions, I would keep it as is.

>> --- a/xen/arch/x86/cpu/amd.c
>> +++ b/xen/arch/x86/cpu/amd.c
>> @@ -1386,6 +1386,10 @@ static void cf_check init_amd(struct cpu
>>  
>>  	check_syscfg_dram_mod_en();
>>  
>> +	if (c == &boot_cpu_data && cpu_has(c, X86_FEATURE_ERMS)
>> +	    && c->family != 0x19 /* Zen3/4 */)
> 
> Even if this is Linux style, && on the previous line please.

No, precisely because it is Linux style. If and when we change the file
to Xen style (which probably we should), such operators would move.

Jan


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 2/2] x86: guard synthetic feature and bug enumerators
  2025-09-25 10:45 [PATCH 0/2] x86: ERMS follow-on Jan Beulich
  2025-09-25 10:46 ` [PATCH for-4.21 1/2] x86/AMD: avoid REP MOVSB for Zen3/4 Jan Beulich
@ 2025-09-25 10:48 ` Jan Beulich
  2025-09-29 23:36   ` Jason Andryuk
  2025-10-08 17:19   ` Andrew Cooper
  2025-10-13 13:06 ` [PATCH for-4.21 v2] x86/AMD: avoid REP MOVSB for Zen3/4 Jan Beulich
       [not found] ` <693449f18cc4480ea2cb2161a9361354@DM4PR03MB7015.namprd03.prod.outlook.com>
  3 siblings, 2 replies; 26+ messages in thread
From: Jan Beulich @ 2025-09-25 10:48 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org; +Cc: Andrew Cooper, Roger Pau Monné

While adding new enumerators one may overlook the (rare) need to bump
X86_NR_{SYNTH,BUG}. Guard against that happening by adding respective
checking. The use of BUILD_BUG_ON_ZERO(), however, entails a number of
other changes, as the expansion may not appear in the assembly produced.
Furthermore inputs to file-scope asm() are only supported in gcc15 (or
newer).

No difference in generated code (debug info, however, grows quite a bit).

An implication from the changes is that users of the alternatives patching
macros may not use unnamed asm() input operands anymore, as the "injected"
new operands would break numbering expectations.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/include/asm/alternative.h
+++ b/xen/arch/x86/include/asm/alternative.h
@@ -70,12 +70,12 @@ extern void alternative_instructions(voi
                     alt_repl_len(n2)) "-" alt_orig_len)
 
 #define ALTINSTR_ENTRY(feature, num)                                    \
-        " .if (" STR(feature & ~ALT_FLAG_NOT) ") >= " STR(NCAPINTS * 32) "\n" \
+        " .if (%c" #feature " & " STR(~ALT_FLAG_NOT) ") >= " STR(NCAPINTS * 32) "\n" \
         " .error \"alternative feature outside of featureset range\"\n" \
         " .endif\n"                                                     \
         " .long .LXEN%=_orig_s - .\n"             /* label           */ \
         " .long " alt_repl_s(num)" - .\n"         /* new instruction */ \
-        " .word " STR(feature) "\n"               /* feature bit     */ \
+        " .word %c" #feature "\n"                 /* feature bit     */ \
         " .byte " alt_orig_len "\n"               /* source len      */ \
         " .byte " alt_repl_len(num) "\n"          /* replacement len */ \
         " .byte " alt_pad_len "\n"                /* padding len     */ \
@@ -127,14 +127,14 @@ extern void alternative_instructions(voi
  */
 #define alternative(oldinstr, newinstr, feature)                        \
     asm_inline volatile (                                               \
-        ALTERNATIVE(oldinstr, newinstr, feature)                        \
-        ::: "memory" )
+        ALTERNATIVE(oldinstr, newinstr, [feat])                         \
+        :: [feat] "i" (feature) : "memory" )
 
 #define alternative_2(oldinstr, newinstr1, feature1, newinstr2, feature2) \
     asm_inline volatile (                                               \
-        ALTERNATIVE_2(oldinstr, newinstr1, feature1,                    \
-                      newinstr2, feature2)                              \
-        ::: "memory" )
+        ALTERNATIVE_2(oldinstr, newinstr1, [feat1],                     \
+                      newinstr2, [feat2])                               \
+        ::  [feat1] "i" (feature1), [feat2] "i" (feature2) : "memory" )
 
 /*
  * Alternative inline assembly with input.
@@ -148,14 +148,14 @@ extern void alternative_instructions(voi
  */
 #define alternative_input(oldinstr, newinstr, feature, input...)        \
     asm_inline volatile (                                               \
-        ALTERNATIVE(oldinstr, newinstr, feature)                        \
-        :: input )
+        ALTERNATIVE(oldinstr, newinstr, [feat])                         \
+        :: [feat] "i" (feature), ## input  )
 
 /* Like alternative_input, but with a single output argument */
 #define alternative_io(oldinstr, newinstr, feature, output, input...)   \
     asm_inline volatile (                                               \
-        ALTERNATIVE(oldinstr, newinstr, feature)                        \
-        : output : input )
+        ALTERNATIVE(oldinstr, newinstr, [feat])                         \
+        : output : [feat] "i" (feature), ## input )
 
 /*
  * This is similar to alternative_io. But it has two features and
@@ -168,9 +168,9 @@ extern void alternative_instructions(voi
 #define alternative_io_2(oldinstr, newinstr1, feature1, newinstr2,      \
                          feature2, output, input...)                    \
     asm_inline volatile (                                               \
-        ALTERNATIVE_2(oldinstr, newinstr1, feature1,                    \
-                      newinstr2, feature2)                              \
-        : output : input )
+        ALTERNATIVE_2(oldinstr, newinstr1, [feat1],                     \
+                      newinstr2, [feat2])                               \
+        : output : [feat1] "i" (feature1), [feat2] "i" (feature2), ## input )
 
 /* Use this macro(s) if you need more than one output parameter. */
 #define ASM_OUTPUT2(a...) a
--- a/xen/arch/x86/include/asm/cpufeatures.h
+++ b/xen/arch/x86/include/asm/cpufeatures.h
@@ -6,9 +6,16 @@
 /* Number of capability words covered by the featureset words. */
 #define FSCAPINTS FEATURESET_NR_ENTRIES
 
+#if !defined(__ASSEMBLY__) && __GNUC__ >= 15
+#include <xen/macros.h>
+#define X86_CHECK_FEAT_NR(x, n) BUILD_BUG_ON_ZERO((x) / 32 >= X86_NR_ ## n)
+#else
+#define X86_CHECK_FEAT_NR(x, n) 0
+#endif
+
 /* Synthetic words follow the featureset words. */
 #define X86_NR_SYNTH 2
-#define X86_SYNTH(x) (FSCAPINTS * 32 + (x))
+#define X86_SYNTH(x) (FSCAPINTS * 32 + (x) + X86_CHECK_FEAT_NR(x, SYNTH))
 
 /* Synthetic features */
 XEN_CPUFEATURE(CONSTANT_TSC,      X86_SYNTH( 0)) /* TSC ticks at a constant rate */
@@ -47,7 +54,8 @@ XEN_CPUFEATURE(XEN_REP_MOVSB,     X86_SY
 
 /* Bug words follow the synthetic words. */
 #define X86_NR_BUG 1
-#define X86_BUG(x) ((FSCAPINTS + X86_NR_SYNTH) * 32 + (x))
+#define X86_BUG(x) ((FSCAPINTS + X86_NR_SYNTH) * 32 + (x) + \
+                    X86_CHECK_FEAT_NR(x, BUG))
 
 #define X86_BUG_FPU_PTRS          X86_BUG( 0) /* (F)X{SAVE,RSTOR} doesn't save/restore FOP/FIP/FDP. */
 #define X86_BUG_NULL_SEG          X86_BUG( 1) /* NULL-ing a selector preserves the base and limit. */
--- a/xen/arch/x86/include/asm/cpufeatureset.h
+++ b/xen/arch/x86/include/asm/cpufeatureset.h
@@ -12,8 +12,13 @@ enum {
 };
 #undef XEN_CPUFEATURE
 
+#if __GNUC__ >= 15
+#define XEN_CPUFEATURE(name, value) asm (".equ X86_FEATURE_" #name ", %c0" \
+                                         :: "i" (X86_FEATURE_##name));
+#else
 #define XEN_CPUFEATURE(name, value) asm (".equ X86_FEATURE_" #name ", " \
                                          __stringify(value));
+#endif
 #include <public/arch-x86/cpufeatureset.h>
 #include <asm/cpufeatures.h>
 
--- a/xen/arch/x86/include/asm/pdx.h
+++ b/xen/arch/x86/include/asm/pdx.h
@@ -13,9 +13,9 @@
     asm_inline goto (                               \
         ALTERNATIVE(                                \
             "",                                     \
-            "jmp %l0",                              \
-            ALT_NOT(X86_FEATURE_PDX_COMPRESSION))   \
-        : : : : label )
+            "jmp %l1",                              \
+            [feat])                                 \
+        : : [feat] "i" (ALT_NOT(X86_FEATURE_PDX_COMPRESSION)) : : label )
 
 static inline unsigned long pfn_to_pdx(unsigned long pfn)
 {
--- a/xen/arch/x86/include/asm/spec_ctrl.h
+++ b/xen/arch/x86/include/asm/spec_ctrl.h
@@ -73,7 +73,7 @@ static always_inline void spec_ctrl_new_
 
     /* (ab)use alternative_input() to specify clobbers. */
     alternative_input("", "DO_OVERWRITE_RSB xu=%=", X86_BUG_IBPB_NO_RET,
-                      : "rax", "rcx");
+                      "i" (0) : "rax", "rcx");
 }
 
 extern int8_t opt_ibpb_ctxt_switch;
@@ -163,7 +163,7 @@ static always_inline void __spec_ctrl_en
      * (ab)use alternative_input() to specify clobbers.
      */
     alternative_input("", "DO_OVERWRITE_RSB xu=%=", X86_FEATURE_SC_RSB_IDLE,
-                      : "rax", "rcx");
+                      "i" (0) : "rax", "rcx");
 }
 
 /* WARNING! `ret`, `call *`, `jmp *` not safe after this call. */
--- a/xen/arch/x86/include/asm/tsc.h
+++ b/xen/arch/x86/include/asm/tsc.h
@@ -29,8 +29,7 @@ static inline uint64_t rdtsc_ordered(voi
     alternative_io_2("lfence; rdtsc",
                      "mfence; rdtsc", X86_FEATURE_MFENCE_RDTSC,
                      "rdtscp",        X86_FEATURE_RDTSCP,
-                     ASM_OUTPUT2("=a" (low), "=d" (high), "=c" (aux)),
-                     /* no inputs */);
+                     ASM_OUTPUT2("=a" (low), "=d" (high), "=c" (aux)));
 
     return (high << 32) | low;
 }



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/2] x86: guard synthetic feature and bug enumerators
  2025-09-25 10:48 ` [PATCH 2/2] x86: guard synthetic feature and bug enumerators Jan Beulich
@ 2025-09-29 23:36   ` Jason Andryuk
  2025-10-07 12:22     ` Jan Beulich
  2025-10-08 17:19   ` Andrew Cooper
  1 sibling, 1 reply; 26+ messages in thread
From: Jason Andryuk @ 2025-09-29 23:36 UTC (permalink / raw)
  To: Jan Beulich, xen-devel@lists.xenproject.org
  Cc: Andrew Cooper, Roger Pau Monné

On 2025-09-25 06:48, Jan Beulich wrote:
> While adding new enumerators one may overlook the (rare) need to bump
> X86_NR_{SYNTH,BUG}. Guard against that happening by adding respective
> checking. The use of BUILD_BUG_ON_ZERO(), however, entails a number of
> other changes, as the expansion may not appear in the assembly produced.
> Furthermore inputs to file-scope asm() are only supported in gcc15 (or
> newer).
> 
> No difference in generated code (debug info, however, grows quite a bit).
> 
> An implication from the changes is that users of the alternatives patching
> macros may not use unnamed asm() input operands anymore, as the "injected"
> new operands would break numbering expectations.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> --- a/xen/arch/x86/include/asm/alternative.h
> +++ b/xen/arch/x86/include/asm/alternative.h
> @@ -70,12 +70,12 @@ extern void alternative_instructions(voi
>                       alt_repl_len(n2)) "-" alt_orig_len)
>   
>   #define ALTINSTR_ENTRY(feature, num)                                    \
> -        " .if (" STR(feature & ~ALT_FLAG_NOT) ") >= " STR(NCAPINTS * 32) "\n" \
> +        " .if (%c" #feature " & " STR(~ALT_FLAG_NOT) ") >= " STR(NCAPINTS * 32) "\n" \
>           " .error \"alternative feature outside of featureset range\"\n" \
>           " .endif\n"                                                     \
>           " .long .LXEN%=_orig_s - .\n"             /* label           */ \
>           " .long " alt_repl_s(num)" - .\n"         /* new instruction */ \
> -        " .word " STR(feature) "\n"               /* feature bit     */ \
> +        " .word %c" #feature "\n"                 /* feature bit     */ \
>           " .byte " alt_orig_len "\n"               /* source len      */ \
>           " .byte " alt_repl_len(num) "\n"          /* replacement len */ \
>           " .byte " alt_pad_len "\n"                /* padding len     */ \
> @@ -127,14 +127,14 @@ extern void alternative_instructions(voi
>    */
>   #define alternative(oldinstr, newinstr, feature)                        \
>       asm_inline volatile (                                               \
> -        ALTERNATIVE(oldinstr, newinstr, feature)                        \
> -        ::: "memory" )
> +        ALTERNATIVE(oldinstr, newinstr, [feat])                         \
> +        :: [feat] "i" (feature) : "memory" )
>   
>   #define alternative_2(oldinstr, newinstr1, feature1, newinstr2, feature2) \
>       asm_inline volatile (                                               \
> -        ALTERNATIVE_2(oldinstr, newinstr1, feature1,                    \
> -                      newinstr2, feature2)                              \
> -        ::: "memory" )
> +        ALTERNATIVE_2(oldinstr, newinstr1, [feat1],                     \
> +                      newinstr2, [feat2])                               \
> +        ::  [feat1] "i" (feature1), [feat2] "i" (feature2) : "memory" )
>   
>   /*
>    * Alternative inline assembly with input.
> @@ -148,14 +148,14 @@ extern void alternative_instructions(voi
>    */
>   #define alternative_input(oldinstr, newinstr, feature, input...)        \
>       asm_inline volatile (                                               \
> -        ALTERNATIVE(oldinstr, newinstr, feature)                        \
> -        :: input )
> +        ALTERNATIVE(oldinstr, newinstr, [feat])                         \
> +        :: [feat] "i" (feature), ## input  )
>   
>   /* Like alternative_input, but with a single output argument */
>   #define alternative_io(oldinstr, newinstr, feature, output, input...)   \
>       asm_inline volatile (                                               \
> -        ALTERNATIVE(oldinstr, newinstr, feature)                        \
> -        : output : input )
> +        ALTERNATIVE(oldinstr, newinstr, [feat])                         \
> +        : output : [feat] "i" (feature), ## input )
>   
>   /*
>    * This is similar to alternative_io. But it has two features and
> @@ -168,9 +168,9 @@ extern void alternative_instructions(voi
>   #define alternative_io_2(oldinstr, newinstr1, feature1, newinstr2,      \
>                            feature2, output, input...)                    \
>       asm_inline volatile (                                               \
> -        ALTERNATIVE_2(oldinstr, newinstr1, feature1,                    \
> -                      newinstr2, feature2)                              \
> -        : output : input )
> +        ALTERNATIVE_2(oldinstr, newinstr1, [feat1],                     \
> +                      newinstr2, [feat2])                               \
> +        : output : [feat1] "i" (feature1), [feat2] "i" (feature2), ## input )
>   
>   /* Use this macro(s) if you need more than one output parameter. */
>   #define ASM_OUTPUT2(a...) a
> --- a/xen/arch/x86/include/asm/cpufeatures.h
> +++ b/xen/arch/x86/include/asm/cpufeatures.h
> @@ -6,9 +6,16 @@
>   /* Number of capability words covered by the featureset words. */
>   #define FSCAPINTS FEATURESET_NR_ENTRIES
>   
> +#if !defined(__ASSEMBLY__) && __GNUC__ >= 15
> +#include <xen/macros.h>
> +#define X86_CHECK_FEAT_NR(x, n) BUILD_BUG_ON_ZERO((x) / 32 >= X86_NR_ ## n)
> +#else
> +#define X86_CHECK_FEAT_NR(x, n) 0
> +#endif
> +
>   /* Synthetic words follow the featureset words. */
>   #define X86_NR_SYNTH 2
> -#define X86_SYNTH(x) (FSCAPINTS * 32 + (x))
> +#define X86_SYNTH(x) (FSCAPINTS * 32 + (x) + X86_CHECK_FEAT_NR(x, SYNTH))
>   
>   /* Synthetic features */
>   XEN_CPUFEATURE(CONSTANT_TSC,      X86_SYNTH( 0)) /* TSC ticks at a constant rate */
> @@ -47,7 +54,8 @@ XEN_CPUFEATURE(XEN_REP_MOVSB,     X86_SY
>   
>   /* Bug words follow the synthetic words. */
>   #define X86_NR_BUG 1
> -#define X86_BUG(x) ((FSCAPINTS + X86_NR_SYNTH) * 32 + (x))
> +#define X86_BUG(x) ((FSCAPINTS + X86_NR_SYNTH) * 32 + (x) + \
> +                    X86_CHECK_FEAT_NR(x, BUG))
>   
>   #define X86_BUG_FPU_PTRS          X86_BUG( 0) /* (F)X{SAVE,RSTOR} doesn't save/restore FOP/FIP/FDP. */
>   #define X86_BUG_NULL_SEG          X86_BUG( 1) /* NULL-ing a selector preserves the base and limit. */
> --- a/xen/arch/x86/include/asm/cpufeatureset.h
> +++ b/xen/arch/x86/include/asm/cpufeatureset.h
> @@ -12,8 +12,13 @@ enum {
>   };
>   #undef XEN_CPUFEATURE
>   
> +#if __GNUC__ >= 15
> +#define XEN_CPUFEATURE(name, value) asm (".equ X86_FEATURE_" #name ", %c0" \
> +                                         :: "i" (X86_FEATURE_##name));
> +#else
>   #define XEN_CPUFEATURE(name, value) asm (".equ X86_FEATURE_" #name ", " \
>                                            __stringify(value));
> +#endif
>   #include <public/arch-x86/cpufeatureset.h>
>   #include <asm/cpufeatures.h>
>   
> --- a/xen/arch/x86/include/asm/pdx.h
> +++ b/xen/arch/x86/include/asm/pdx.h
> @@ -13,9 +13,9 @@
>       asm_inline goto (                               \
>           ALTERNATIVE(                                \
>               "",                                     \
> -            "jmp %l0",                              \
> -            ALT_NOT(X86_FEATURE_PDX_COMPRESSION))   \
> -        : : : : label )
> +            "jmp %l1",                              \
> +            [feat])                                 \
> +        : : [feat] "i" (ALT_NOT(X86_FEATURE_PDX_COMPRESSION)) : : label )
>   
>   static inline unsigned long pfn_to_pdx(unsigned long pfn)
>   {
> --- a/xen/arch/x86/include/asm/spec_ctrl.h
> +++ b/xen/arch/x86/include/asm/spec_ctrl.h
> @@ -73,7 +73,7 @@ static always_inline void spec_ctrl_new_
>   
>       /* (ab)use alternative_input() to specify clobbers. */
>       alternative_input("", "DO_OVERWRITE_RSB xu=%=", X86_BUG_IBPB_NO_RET,
> -                      : "rax", "rcx");
> +                      "i" (0) : "rax", "rcx");

"i" (0) is to work around the trailing comma in alternative_input() and 
does nothing?

Thanks,
Jason

>   }
>   
>   extern int8_t opt_ibpb_ctxt_switch;
> @@ -163,7 +163,7 @@ static always_inline void __spec_ctrl_en
>        * (ab)use alternative_input() to specify clobbers.
>        */
>       alternative_input("", "DO_OVERWRITE_RSB xu=%=", X86_FEATURE_SC_RSB_IDLE,
> -                      : "rax", "rcx");
> +                      "i" (0) : "rax", "rcx");
>   }
>   
>   /* WARNING! `ret`, `call *`, `jmp *` not safe after this call. */
> --- a/xen/arch/x86/include/asm/tsc.h
> +++ b/xen/arch/x86/include/asm/tsc.h
> @@ -29,8 +29,7 @@ static inline uint64_t rdtsc_ordered(voi
>       alternative_io_2("lfence; rdtsc",
>                        "mfence; rdtsc", X86_FEATURE_MFENCE_RDTSC,
>                        "rdtscp",        X86_FEATURE_RDTSCP,
> -                     ASM_OUTPUT2("=a" (low), "=d" (high), "=c" (aux)),
> -                     /* no inputs */);
> +                     ASM_OUTPUT2("=a" (low), "=d" (high), "=c" (aux)));
>   
>       return (high << 32) | low;
>   }
> 
> 



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/2] x86: guard synthetic feature and bug enumerators
  2025-09-29 23:36   ` Jason Andryuk
@ 2025-10-07 12:22     ` Jan Beulich
  2025-10-07 19:38       ` Jason Andryuk
  0 siblings, 1 reply; 26+ messages in thread
From: Jan Beulich @ 2025-10-07 12:22 UTC (permalink / raw)
  To: Jason Andryuk
  Cc: Andrew Cooper, Roger Pau Monné,
	xen-devel@lists.xenproject.org

On 30.09.2025 01:36, Jason Andryuk wrote:
> On 2025-09-25 06:48, Jan Beulich wrote:
>> --- a/xen/arch/x86/include/asm/spec_ctrl.h
>> +++ b/xen/arch/x86/include/asm/spec_ctrl.h
>> @@ -73,7 +73,7 @@ static always_inline void spec_ctrl_new_
>>   
>>       /* (ab)use alternative_input() to specify clobbers. */
>>       alternative_input("", "DO_OVERWRITE_RSB xu=%=", X86_BUG_IBPB_NO_RET,
>> -                      : "rax", "rcx");
>> +                      "i" (0) : "rax", "rcx");
> 
> "i" (0) is to work around the trailing comma in alternative_input() and 
> does nothing?

Yes. If more such "uses" appeared, we may want to introduce some kind of
abstraction.

Jan


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/2] x86: guard synthetic feature and bug enumerators
  2025-10-07 12:22     ` Jan Beulich
@ 2025-10-07 19:38       ` Jason Andryuk
  2025-10-08  5:56         ` Jan Beulich
  0 siblings, 1 reply; 26+ messages in thread
From: Jason Andryuk @ 2025-10-07 19:38 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Roger Pau Monné,
	xen-devel@lists.xenproject.org

On 2025-10-07 08:22, Jan Beulich wrote:
> On 30.09.2025 01:36, Jason Andryuk wrote:
>> On 2025-09-25 06:48, Jan Beulich wrote:
>>> --- a/xen/arch/x86/include/asm/spec_ctrl.h
>>> +++ b/xen/arch/x86/include/asm/spec_ctrl.h
>>> @@ -73,7 +73,7 @@ static always_inline void spec_ctrl_new_
>>>    
>>>        /* (ab)use alternative_input() to specify clobbers. */
>>>        alternative_input("", "DO_OVERWRITE_RSB xu=%=", X86_BUG_IBPB_NO_RET,
>>> -                      : "rax", "rcx");
>>> +                      "i" (0) : "rax", "rcx");
>>
>> "i" (0) is to work around the trailing comma in alternative_input() and
>> does nothing?
> 
> Yes. If more such "uses" appeared, we may want to introduce some kind of
> abstraction.

Thanks for confirming.

Reviewed-by: Jason Andryuk <jason.andryuk@amd.com>

Though I also wondered if just #define X86_BUG_MAX/X86_SYNTH_MAX 
combined with a BUILD_BUG_ON might be good enough.  Your approach avoids 
the extra define but is more complicated.  Anyway, just a thought.

Regards,
Jason


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/2] x86: guard synthetic feature and bug enumerators
  2025-10-07 19:38       ` Jason Andryuk
@ 2025-10-08  5:56         ` Jan Beulich
  2025-10-11  0:30           ` Jason Andryuk
  0 siblings, 1 reply; 26+ messages in thread
From: Jan Beulich @ 2025-10-08  5:56 UTC (permalink / raw)
  To: Jason Andryuk
  Cc: Andrew Cooper, Roger Pau Monné,
	xen-devel@lists.xenproject.org

On 07.10.2025 21:38, Jason Andryuk wrote:
> On 2025-10-07 08:22, Jan Beulich wrote:
>> On 30.09.2025 01:36, Jason Andryuk wrote:
>>> On 2025-09-25 06:48, Jan Beulich wrote:
>>>> --- a/xen/arch/x86/include/asm/spec_ctrl.h
>>>> +++ b/xen/arch/x86/include/asm/spec_ctrl.h
>>>> @@ -73,7 +73,7 @@ static always_inline void spec_ctrl_new_
>>>>    
>>>>        /* (ab)use alternative_input() to specify clobbers. */
>>>>        alternative_input("", "DO_OVERWRITE_RSB xu=%=", X86_BUG_IBPB_NO_RET,
>>>> -                      : "rax", "rcx");
>>>> +                      "i" (0) : "rax", "rcx");
>>>
>>> "i" (0) is to work around the trailing comma in alternative_input() and
>>> does nothing?
>>
>> Yes. If more such "uses" appeared, we may want to introduce some kind of
>> abstraction.
> 
> Thanks for confirming.
> 
> Reviewed-by: Jason Andryuk <jason.andryuk@amd.com>

Thanks.

> Though I also wondered if just #define X86_BUG_MAX/X86_SYNTH_MAX 
> combined with a BUILD_BUG_ON might be good enough.  Your approach avoids 
> the extra define but is more complicated.  Anyway, just a thought.

How would that end up simplifying things? IOW what would the BUILD_BUG_ON()
look like that you're thinking about? After all X86_{SYNTH,BUG}_MAX aren't
meaningfully different from X86_NR_{SYNTH,BUG}.

Jan


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/2] x86: guard synthetic feature and bug enumerators
  2025-10-08  5:56         ` Jan Beulich
@ 2025-10-11  0:30           ` Jason Andryuk
  2025-10-13  6:39             ` Jan Beulich
  0 siblings, 1 reply; 26+ messages in thread
From: Jason Andryuk @ 2025-10-11  0:30 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Roger Pau Monné,
	xen-devel@lists.xenproject.org

On 2025-10-08 01:56, Jan Beulich wrote:
> On 07.10.2025 21:38, Jason Andryuk wrote:
>> On 2025-10-07 08:22, Jan Beulich wrote:
>>> On 30.09.2025 01:36, Jason Andryuk wrote:
>>>> On 2025-09-25 06:48, Jan Beulich wrote:
>>>>> --- a/xen/arch/x86/include/asm/spec_ctrl.h
>>>>> +++ b/xen/arch/x86/include/asm/spec_ctrl.h
>>>>> @@ -73,7 +73,7 @@ static always_inline void spec_ctrl_new_
>>>>>     
>>>>>         /* (ab)use alternative_input() to specify clobbers. */
>>>>>         alternative_input("", "DO_OVERWRITE_RSB xu=%=", X86_BUG_IBPB_NO_RET,
>>>>> -                      : "rax", "rcx");
>>>>> +                      "i" (0) : "rax", "rcx");
>>>>
>>>> "i" (0) is to work around the trailing comma in alternative_input() and
>>>> does nothing?
>>>
>>> Yes. If more such "uses" appeared, we may want to introduce some kind of
>>> abstraction.
>>
>> Thanks for confirming.
>>
>> Reviewed-by: Jason Andryuk <jason.andryuk@amd.com>
> 
> Thanks.
> 
>> Though I also wondered if just #define X86_BUG_MAX/X86_SYNTH_MAX
>> combined with a BUILD_BUG_ON might be good enough.  Your approach avoids
>> the extra define but is more complicated.  Anyway, just a thought.
> 
> How would that end up simplifying things? IOW what would the BUILD_BUG_ON()
> look like that you're thinking about? After all X86_{SYNTH,BUG}_MAX aren't
> meaningfully different from X86_NR_{SYNTH,BUG}.

Originally, I was thinking something like
  XEN_CPUFEATURE(PDX_COMPRESSION,   X86_SYNTH(31)) /* PDX compression */
+#define X86_SYNTH_MAX 31 /* Bump when adding flags */

and:
BUILD_BUG_ON( ((X86_SYNTH_MAX / 32) + 1) > X86_NR_SYNTH )

Not automated, but adding a new flag should make it obvious 
X86_SYNTH_MAX should increase.

But as you point out the redundancy of X86_{SYNTH,BUG}_MAX and 
X86_NR_{SYNTH,BUG}.  But we could re-arrange to make X86_NR_{SYNTH,BUG} 
calculated from X86_{SYNTH,BUG}_MAX like below.

Again, it's not automated, but it should make it harder to miss 
increasing the value.

Regards,
Jason

diff --git i/xen/arch/x86/include/asm/cpufeatures.h 
w/xen/arch/x86/include/asm/cpufeatures.h
index 0a98676c16..724eb1599f 100644
--- i/xen/arch/x86/include/asm/cpufeatures.h
+++ w/xen/arch/x86/include/asm/cpufeatures.h
@@ -7,7 +7,6 @@
  #define FSCAPINTS FEATURESET_NR_ENTRIES

  /* Synthetic words follow the featureset words. */
-#define X86_NR_SYNTH 1
  #define X86_SYNTH(x) (FSCAPINTS * 32 + (x))

  /* Synthetic features */
@@ -43,9 +42,10 @@ XEN_CPUFEATURE(IBPB_ENTRY_PV,     X86_SYNTH(28)) /* 
MSR_PRED_CMD used by Xen for
  XEN_CPUFEATURE(IBPB_ENTRY_HVM,    X86_SYNTH(29)) /* MSR_PRED_CMD used 
by Xen for HVM */
  XEN_CPUFEATURE(USE_VMCALL,        X86_SYNTH(30)) /* Use VMCALL instead 
of VMMCALL */
  XEN_CPUFEATURE(PDX_COMPRESSION,   X86_SYNTH(31)) /* PDX compression */
+#define X86_SYNTH_MAX             31 /* Bump when adding new flags. */
+#define X86_NR_SYNTH              ((X86_SYNTH_MAX / 32) + 1)

  /* Bug words follow the synthetic words. */
-#define X86_NR_BUG 1
  #define X86_BUG(x) ((FSCAPINTS + X86_NR_SYNTH) * 32 + (x))

  #define X86_BUG_FPU_PTRS          X86_BUG( 0) /* (F)X{SAVE,RSTOR} 
doesn't save/restore FOP/FIP/FDP. */
@@ -62,6 +62,8 @@ XEN_CPUFEATURE(PDX_COMPRESSION,   X86_SYNTH(31)) /* 
PDX compression */
  #define X86_SPEC_BHB_TSX          X86_BUG(19) /* Use clear_bhb_tsx for 
BHI mitigation. */
  #define X86_SPEC_BHB_LOOPS        X86_BUG(20) /* Use clear_bhb_loops 
for BHI mitigation.*/
  #define X86_SPEC_BHB_LOOPS_LONG   X86_BUG(21) /* Upgrade 
clear_bhb_loops to the "long" sequence. */
+#define X86_BUX_MAX               21 /* Bump when adding new flags. */
+#define X86_NR_BUG                ((X86_BUG_MAX / 32) + 1)

  /* Total number of capability words, inc synth and bug words. */
  #define NCAPINTS (FSCAPINTS + X86_NR_SYNTH + X86_NR_BUG) /* N 32-bit 
words worth of info */



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/2] x86: guard synthetic feature and bug enumerators
  2025-10-11  0:30           ` Jason Andryuk
@ 2025-10-13  6:39             ` Jan Beulich
  0 siblings, 0 replies; 26+ messages in thread
From: Jan Beulich @ 2025-10-13  6:39 UTC (permalink / raw)
  To: Jason Andryuk
  Cc: Andrew Cooper, Roger Pau Monné,
	xen-devel@lists.xenproject.org

On 11.10.2025 02:30, Jason Andryuk wrote:
> On 2025-10-08 01:56, Jan Beulich wrote:
>> On 07.10.2025 21:38, Jason Andryuk wrote:
>>> On 2025-10-07 08:22, Jan Beulich wrote:
>>>> On 30.09.2025 01:36, Jason Andryuk wrote:
>>>>> On 2025-09-25 06:48, Jan Beulich wrote:
>>>>>> --- a/xen/arch/x86/include/asm/spec_ctrl.h
>>>>>> +++ b/xen/arch/x86/include/asm/spec_ctrl.h
>>>>>> @@ -73,7 +73,7 @@ static always_inline void spec_ctrl_new_
>>>>>>     
>>>>>>         /* (ab)use alternative_input() to specify clobbers. */
>>>>>>         alternative_input("", "DO_OVERWRITE_RSB xu=%=", X86_BUG_IBPB_NO_RET,
>>>>>> -                      : "rax", "rcx");
>>>>>> +                      "i" (0) : "rax", "rcx");
>>>>>
>>>>> "i" (0) is to work around the trailing comma in alternative_input() and
>>>>> does nothing?
>>>>
>>>> Yes. If more such "uses" appeared, we may want to introduce some kind of
>>>> abstraction.
>>>
>>> Thanks for confirming.
>>>
>>> Reviewed-by: Jason Andryuk <jason.andryuk@amd.com>
>>
>> Thanks.
>>
>>> Though I also wondered if just #define X86_BUG_MAX/X86_SYNTH_MAX
>>> combined with a BUILD_BUG_ON might be good enough.  Your approach avoids
>>> the extra define but is more complicated.  Anyway, just a thought.
>>
>> How would that end up simplifying things? IOW what would the BUILD_BUG_ON()
>> look like that you're thinking about? After all X86_{SYNTH,BUG}_MAX aren't
>> meaningfully different from X86_NR_{SYNTH,BUG}.
> 
> Originally, I was thinking something like
>   XEN_CPUFEATURE(PDX_COMPRESSION,   X86_SYNTH(31)) /* PDX compression */
> +#define X86_SYNTH_MAX 31 /* Bump when adding flags */

I don't really like this. Especially as presented is creates an ambiguity:
Would one need to increase the value upon any flag addition, or only upon
adding a new flag with a value divisible by 32. (From the patch fragment
you appended it's clear the latter is meant, but that's not clear here,
and when one learns to routinely ignore the comment, there's a risk of also
ignoring it when it shouldn't be ignored. Whereas if the bump was required
every time a new flag was added, I would dislike the unnecessary churn. In
the end - yes, there's a reason why I did things the way done, even if
there are other aspects to it which are (for the patch itself, but imo not
once it would have gone in) not overly nice.

Jan


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/2] x86: guard synthetic feature and bug enumerators
  2025-09-25 10:48 ` [PATCH 2/2] x86: guard synthetic feature and bug enumerators Jan Beulich
  2025-09-29 23:36   ` Jason Andryuk
@ 2025-10-08 17:19   ` Andrew Cooper
  2025-10-09  7:45     ` Jan Beulich
  1 sibling, 1 reply; 26+ messages in thread
From: Andrew Cooper @ 2025-10-08 17:19 UTC (permalink / raw)
  To: Jan Beulich, xen-devel@lists.xenproject.org
  Cc: Andrew Cooper, Roger Pau Monné

On 25/09/2025 11:48 am, Jan Beulich wrote:
> While adding new enumerators one may overlook the (rare) need to bump
> X86_NR_{SYNTH,BUG}. Guard against that happening by adding respective
> checking. The use of BUILD_BUG_ON_ZERO(), however, entails a number of
> other changes, as the expansion may not appear in the assembly produced.
> Furthermore inputs to file-scope asm() are only supported in gcc15 (or
> newer).
>
> No difference in generated code (debug info, however, grows quite a bit).
>
> An implication from the changes is that users of the alternatives patching
> macros may not use unnamed asm() input operands anymore, as the "injected"
> new operands would break numbering expectations.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>
> --- a/xen/arch/x86/include/asm/alternative.h
> +++ b/xen/arch/x86/include/asm/alternative.h
> @@ -70,12 +70,12 @@ extern void alternative_instructions(voi
>                      alt_repl_len(n2)) "-" alt_orig_len)
>  
>  #define ALTINSTR_ENTRY(feature, num)                                    \
> -        " .if (" STR(feature & ~ALT_FLAG_NOT) ") >= " STR(NCAPINTS * 32) "\n" \
> +        " .if (%c" #feature " & " STR(~ALT_FLAG_NOT) ") >= " STR(NCAPINTS * 32) "\n" \
>          " .error \"alternative feature outside of featureset range\"\n" \
>          " .endif\n"                                                     \
>          " .long .LXEN%=_orig_s - .\n"             /* label           */ \
>          " .long " alt_repl_s(num)" - .\n"         /* new instruction */ \
> -        " .word " STR(feature) "\n"               /* feature bit     */ \
> +        " .word %c" #feature "\n"                 /* feature bit     */ \
>          " .byte " alt_orig_len "\n"               /* source len      */ \
>          " .byte " alt_repl_len(num) "\n"          /* replacement len */ \
>          " .byte " alt_pad_len "\n"                /* padding len     */ \
> @@ -127,14 +127,14 @@ extern void alternative_instructions(voi
>   */
>  #define alternative(oldinstr, newinstr, feature)                        \
>      asm_inline volatile (                                               \
> -        ALTERNATIVE(oldinstr, newinstr, feature)                        \
> -        ::: "memory" )
> +        ALTERNATIVE(oldinstr, newinstr, [feat])                         \
> +        :: [feat] "i" (feature) : "memory" )

I don't understand.  How is this related to putting the guard in place?


> --- a/xen/arch/x86/include/asm/cpufeatureset.h
> +++ b/xen/arch/x86/include/asm/cpufeatureset.h
> @@ -12,8 +12,13 @@ enum {
>  };
>  #undef XEN_CPUFEATURE
>  
> +#if __GNUC__ >= 15
> +#define XEN_CPUFEATURE(name, value) asm (".equ X86_FEATURE_" #name ", %c0" \
> +                                         :: "i" (X86_FEATURE_##name));
> +#else
>  #define XEN_CPUFEATURE(name, value) asm (".equ X86_FEATURE_" #name ", " \
>                                           __stringify(value));
> +#endif

Again - why are we making a no-op change for the sake of it?

>  #include <public/arch-x86/cpufeatureset.h>
>  #include <asm/cpufeatures.h>
>  
> --- a/xen/arch/x86/include/asm/pdx.h
> +++ b/xen/arch/x86/include/asm/pdx.h
> @@ -13,9 +13,9 @@
>      asm_inline goto (                               \
>          ALTERNATIVE(                                \
>              "",                                     \
> -            "jmp %l0",                              \
> -            ALT_NOT(X86_FEATURE_PDX_COMPRESSION))   \
> -        : : : : label )
> +            "jmp %l1",                              \
> +            [feat])                                 \
> +        : : [feat] "i" (ALT_NOT(X86_FEATURE_PDX_COMPRESSION)) : : label )

Not a bug in this change, but the pre-existing use of positional labels
is something I was expecting not to introduce at all seeing as we
started cleanly with named labels.

The jmp wants to be:

  "jmp %l" #label

to cope with the fact it's a macro parameter too.

>  
>  static inline unsigned long pfn_to_pdx(unsigned long pfn)
>  {
> --- a/xen/arch/x86/include/asm/spec_ctrl.h
> +++ b/xen/arch/x86/include/asm/spec_ctrl.h
> @@ -73,7 +73,7 @@ static always_inline void spec_ctrl_new_
>  
>      /* (ab)use alternative_input() to specify clobbers. */
>      alternative_input("", "DO_OVERWRITE_RSB xu=%=", X86_BUG_IBPB_NO_RET,
> -                      : "rax", "rcx");
> +                      "i" (0) : "rax", "rcx");
>  }

As the comment says, this is already an abuse of the macro for a purpose
it wasn't intended for.

Now requiring an extra "nop" parameter to get the abuse to compile is
too far.  It can turn into a plain ALTERNATIVE(), and then both disappear.

~Andrew


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/2] x86: guard synthetic feature and bug enumerators
  2025-10-08 17:19   ` Andrew Cooper
@ 2025-10-09  7:45     ` Jan Beulich
  2025-10-09  7:51       ` Jan Beulich
  0 siblings, 1 reply; 26+ messages in thread
From: Jan Beulich @ 2025-10-09  7:45 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, xen-devel@lists.xenproject.org

On 08.10.2025 19:19, Andrew Cooper wrote:
> On 25/09/2025 11:48 am, Jan Beulich wrote:
>> --- a/xen/arch/x86/include/asm/alternative.h
>> +++ b/xen/arch/x86/include/asm/alternative.h
>> @@ -70,12 +70,12 @@ extern void alternative_instructions(voi
>>                      alt_repl_len(n2)) "-" alt_orig_len)
>>  
>>  #define ALTINSTR_ENTRY(feature, num)                                    \
>> -        " .if (" STR(feature & ~ALT_FLAG_NOT) ") >= " STR(NCAPINTS * 32) "\n" \
>> +        " .if (%c" #feature " & " STR(~ALT_FLAG_NOT) ") >= " STR(NCAPINTS * 32) "\n" \
>>          " .error \"alternative feature outside of featureset range\"\n" \
>>          " .endif\n"                                                     \
>>          " .long .LXEN%=_orig_s - .\n"             /* label           */ \
>>          " .long " alt_repl_s(num)" - .\n"         /* new instruction */ \
>> -        " .word " STR(feature) "\n"               /* feature bit     */ \
>> +        " .word %c" #feature "\n"                 /* feature bit     */ \
>>          " .byte " alt_orig_len "\n"               /* source len      */ \
>>          " .byte " alt_repl_len(num) "\n"          /* replacement len */ \
>>          " .byte " alt_pad_len "\n"                /* padding len     */ \
>> @@ -127,14 +127,14 @@ extern void alternative_instructions(voi
>>   */
>>  #define alternative(oldinstr, newinstr, feature)                        \
>>      asm_inline volatile (                                               \
>> -        ALTERNATIVE(oldinstr, newinstr, feature)                        \
>> -        ::: "memory" )
>> +        ALTERNATIVE(oldinstr, newinstr, [feat])                         \
>> +        :: [feat] "i" (feature) : "memory" )
> 
> I don't understand.  How is this related to putting the guard in place?

The change here is needed to fit the change to ALTINSTR_ENTRY() above. That
change in turn is needed because

#define X86_SYNTH(x) (FSCAPINTS * 32 + (x) + X86_CHECK_FEAT_NR(x, SYNTH))

with

#define X86_CHECK_FEAT_NR(x, n) BUILD_BUG_ON_ZERO((x) / 32 >= X86_NR_ ## n)

now needs to be evaluated by the compiler. If we used stringification as
before, the assembler would get to see BUILD_BUG_ON_ZERO().

>> --- a/xen/arch/x86/include/asm/cpufeatureset.h
>> +++ b/xen/arch/x86/include/asm/cpufeatureset.h
>> @@ -12,8 +12,13 @@ enum {
>>  };
>>  #undef XEN_CPUFEATURE
>>  
>> +#if __GNUC__ >= 15
>> +#define XEN_CPUFEATURE(name, value) asm (".equ X86_FEATURE_" #name ", %c0" \
>> +                                         :: "i" (X86_FEATURE_##name));
>> +#else
>>  #define XEN_CPUFEATURE(name, value) asm (".equ X86_FEATURE_" #name ", " \
>>                                           __stringify(value));
>> +#endif
> 
> Again - why are we making a no-op change for the sake of it?

See above.

>> --- a/xen/arch/x86/include/asm/pdx.h
>> +++ b/xen/arch/x86/include/asm/pdx.h
>> @@ -13,9 +13,9 @@
>>      asm_inline goto (                               \
>>          ALTERNATIVE(                                \
>>              "",                                     \
>> -            "jmp %l0",                              \
>> -            ALT_NOT(X86_FEATURE_PDX_COMPRESSION))   \
>> -        : : : : label )
>> +            "jmp %l1",                              \
>> +            [feat])                                 \
>> +        : : [feat] "i" (ALT_NOT(X86_FEATURE_PDX_COMPRESSION)) : : label )
> 
> Not a bug in this change, but the pre-existing use of positional labels
> is something I was expecting not to introduce at all seeing as we
> started cleanly with named labels.
> 
> The jmp wants to be:
> 
>   "jmp %l" #label
> 
> to cope with the fact it's a macro parameter too.

Unrelated change? I can of course do the adjustment in a separate prereq
patch, but then it would have been nice if you had commented along these
lines before that code actually had gone in.

That said, isn't it at least bad practice to not expose the label use to
the compiler? To avoid using positional operands, shouldn't we rather
name the operand, and then use "jmp %l[whatever_the_name]"? That's a
change I could see as being justified to do right here, rather than in a
separate patch.

>> --- a/xen/arch/x86/include/asm/spec_ctrl.h
>> +++ b/xen/arch/x86/include/asm/spec_ctrl.h
>> @@ -73,7 +73,7 @@ static always_inline void spec_ctrl_new_
>>  
>>      /* (ab)use alternative_input() to specify clobbers. */
>>      alternative_input("", "DO_OVERWRITE_RSB xu=%=", X86_BUG_IBPB_NO_RET,
>> -                      : "rax", "rcx");
>> +                      "i" (0) : "rax", "rcx");
>>  }
> 
> As the comment says, this is already an abuse of the macro for a purpose
> it wasn't intended for.
> 
> Now requiring an extra "nop" parameter to get the abuse to compile is
> too far.  It can turn into a plain ALTERNATIVE(), and then both disappear.

Except that, for the reasons explained further up, I'd rather not see new
explicit uses of ALTERNATIVE() appear. In the end it's not clear which
one is the lesser evil. Maybe we should gain alternative_clobber()?

Jan


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/2] x86: guard synthetic feature and bug enumerators
  2025-10-09  7:45     ` Jan Beulich
@ 2025-10-09  7:51       ` Jan Beulich
  0 siblings, 0 replies; 26+ messages in thread
From: Jan Beulich @ 2025-10-09  7:51 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, xen-devel@lists.xenproject.org

On 09.10.2025 09:45, Jan Beulich wrote:
> On 08.10.2025 19:19, Andrew Cooper wrote:
>> On 25/09/2025 11:48 am, Jan Beulich wrote:
>>> --- a/xen/arch/x86/include/asm/pdx.h
>>> +++ b/xen/arch/x86/include/asm/pdx.h
>>> @@ -13,9 +13,9 @@
>>>      asm_inline goto (                               \
>>>          ALTERNATIVE(                                \
>>>              "",                                     \
>>> -            "jmp %l0",                              \
>>> -            ALT_NOT(X86_FEATURE_PDX_COMPRESSION))   \
>>> -        : : : : label )
>>> +            "jmp %l1",                              \
>>> +            [feat])                                 \
>>> +        : : [feat] "i" (ALT_NOT(X86_FEATURE_PDX_COMPRESSION)) : : label )
>>
>> Not a bug in this change, but the pre-existing use of positional labels
>> is something I was expecting not to introduce at all seeing as we
>> started cleanly with named labels.
>>
>> The jmp wants to be:
>>
>>   "jmp %l" #label
>>
>> to cope with the fact it's a macro parameter too.
> 
> Unrelated change? I can of course do the adjustment in a separate prereq
> patch, but then it would have been nice if you had commented along these
> lines before that code actually had gone in.
> 
> That said, isn't it at least bad practice to not expose the label use to
> the compiler? To avoid using positional operands, shouldn't we rather
> name the operand, and then use "jmp %l[whatever_the_name]"? That's a
> change I could see as being justified to do right here, rather than in a
> separate patch.

Hmm, no, labels can't be named; they are their own names. I.e. what I think
we want here is "jmp %l[" #label "]".

Jan


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH for-4.21 v2] x86/AMD: avoid REP MOVSB for Zen3/4
  2025-09-25 10:45 [PATCH 0/2] x86: ERMS follow-on Jan Beulich
  2025-09-25 10:46 ` [PATCH for-4.21 1/2] x86/AMD: avoid REP MOVSB for Zen3/4 Jan Beulich
  2025-09-25 10:48 ` [PATCH 2/2] x86: guard synthetic feature and bug enumerators Jan Beulich
@ 2025-10-13 13:06 ` Jan Beulich
       [not found] ` <693449f18cc4480ea2cb2161a9361354@DM4PR03MB7015.namprd03.prod.outlook.com>
  3 siblings, 0 replies; 26+ messages in thread
From: Jan Beulich @ 2025-10-13 13:06 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org
  Cc: Andrew Cooper, Roger Pau Monné, Oleksii Kurochko

Along with Zen2 (which doesn't expose ERMS), both families reportedly
suffer from sub-optimal aliasing detection when deciding whether REP MOVSB
can actually be carried out the accelerated way. Therefore we want to
avoid its use in the common case of memcpy(); copy_page_hot() is fine, as
its two pointers are always going to be having the same low 5 bits.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Jason Andryuk <jason.andryuk@amd.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Release-Acked-by: Oleksii Kurochko<oleksii.kurochko@gmail.com>
---
Trying to amend the comment in cpufeatures.h (e.g. "..., i.e. ERMS minus
the Zen3/4 pointer aliasing issue") makes it get longish, so I kept it at
the shortened form.
---
v2: Leave page copying alone.

--- a/xen/arch/x86/cpu/amd.c
+++ b/xen/arch/x86/cpu/amd.c
@@ -1386,6 +1386,10 @@ static void cf_check init_amd(struct cpu
 
 	check_syscfg_dram_mod_en();
 
+	if (c == &boot_cpu_data && cpu_has(c, X86_FEATURE_ERMS)
+	    && c->family != 0x19 /* Zen3/4 */)
+		setup_force_cpu_cap(X86_FEATURE_XEN_REP_MOVSB);
+
 	amd_log_freq(c);
 }
 
--- a/xen/arch/x86/cpu/intel.c
+++ b/xen/arch/x86/cpu/intel.c
@@ -684,6 +684,9 @@ static void cf_check init_intel(struct c
 	 */
 	if (c == &boot_cpu_data && c->vfm == INTEL_SKYLAKE_X)
 		setup_clear_cpu_cap(X86_FEATURE_CLWB);
+
+	if (c == &boot_cpu_data && cpu_has(c, X86_FEATURE_ERMS))
+		setup_force_cpu_cap(X86_FEATURE_XEN_REP_MOVSB);
 }
 
 const struct cpu_dev __initconst_cf_clobber intel_cpu_dev = {
--- a/xen/arch/x86/include/asm/cpufeatures.h
+++ b/xen/arch/x86/include/asm/cpufeatures.h
@@ -7,7 +7,7 @@
 #define FSCAPINTS FEATURESET_NR_ENTRIES
 
 /* Synthetic words follow the featureset words. */
-#define X86_NR_SYNTH 1
+#define X86_NR_SYNTH 2
 #define X86_SYNTH(x) (FSCAPINTS * 32 + (x))
 
 /* Synthetic features */
@@ -43,6 +43,7 @@ XEN_CPUFEATURE(IBPB_ENTRY_PV,     X86_SY
 XEN_CPUFEATURE(IBPB_ENTRY_HVM,    X86_SYNTH(29)) /* MSR_PRED_CMD used by Xen for HVM */
 XEN_CPUFEATURE(USE_VMCALL,        X86_SYNTH(30)) /* Use VMCALL instead of VMMCALL */
 XEN_CPUFEATURE(PDX_COMPRESSION,   X86_SYNTH(31)) /* PDX compression */
+XEN_CPUFEATURE(XEN_REP_MOVSB,     X86_SYNTH(32)) /* REP MOVSB used for memcpy() */
 
 /* Bug words follow the synthetic words. */
 #define X86_NR_BUG 1
--- a/xen/arch/x86/memcpy.S
+++ b/xen/arch/x86/memcpy.S
@@ -10,7 +10,7 @@ FUNC(memcpy)
          * precautions were taken).
          */
         ALTERNATIVE "and $7, %edx; shr $3, %rcx", \
-                    STR(rep movsb; RET), X86_FEATURE_ERMS
+                    STR(rep movsb; RET), X86_FEATURE_XEN_REP_MOVSB
         rep movsq
         or      %edx, %ecx
         jz      1f


^ permalink raw reply	[flat|nested] 26+ messages in thread

[parent not found: <693449f18cc4480ea2cb2161a9361354@DM4PR03MB7015.namprd03.prod.outlook.com>]

* Re: [PATCH for-4.21 v2] x86/AMD: avoid REP MOVSB for Zen3/4
       [not found] ` <693449f18cc4480ea2cb2161a9361354@DM4PR03MB7015.namprd03.prod.outlook.com>
@ 2026-01-06 21:07   ` Andrew Cooper
  2026-01-07  7:22     ` Jan Beulich
  0 siblings, 1 reply; 26+ messages in thread
From: Andrew Cooper @ 2026-01-06 21:07 UTC (permalink / raw)
  To: Jan Beulich, xen-devel@lists.xenproject.org
  Cc: Andrew Cooper, Andrew Cooper, Roger Pau Monne, Oleksii Kurochko

On 13/10/2025 2:06 pm, Jan Beulich wrote:
> Along with Zen2 (which doesn't expose ERMS), both families reportedly
> suffer from sub-optimal aliasing detection when deciding whether REP MOVSB
> can actually be carried out the accelerated way. Therefore we want to
> avoid its use in the common case of memcpy(); copy_page_hot() is fine, as
> its two pointers are always going to be having the same low 5 bits.

I think this could be a bit clearer.  How about this:

---8<---
Zen2 (which doesn't expose ERMS) through Zen4 have sub-optimal aliasing
detection for REP MOVS, and fall back to a unit-at-a-time loop when the
two pointers have differing bottom 5 bits.  While both forms are
affected, this makes REP MOVSB 8 times slower than REP MOVSQ.

memcpy() has a high likelihood of encountering this slowpath, so avoid
using REP MOVSB.  This undoes the ERMS optimisation added in commit
d6397bd0e11c which turns out to be an anti-optimisation on these
microarchitectures.

However, retain the use of ERMS-based REP MOVSB in other cases such as
copy_page_hot() where there parameter alignment is known to avoid the
slowpath.
---8<---

?

This at least gets us back to the 4.20 behaviour.

~Andrew


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH for-4.21 v2] x86/AMD: avoid REP MOVSB for Zen3/4
  2026-01-06 21:07   ` Andrew Cooper
@ 2026-01-07  7:22     ` Jan Beulich
  2026-01-07 10:42       ` Andrew Cooper
  0 siblings, 1 reply; 26+ messages in thread
From: Jan Beulich @ 2026-01-07  7:22 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Andrew Cooper, Roger Pau Monne, Oleksii Kurochko,
	xen-devel@lists.xenproject.org

On 06.01.2026 22:07, Andrew Cooper wrote:
> On 13/10/2025 2:06 pm, Jan Beulich wrote:
>> Along with Zen2 (which doesn't expose ERMS), both families reportedly
>> suffer from sub-optimal aliasing detection when deciding whether REP MOVSB
>> can actually be carried out the accelerated way. Therefore we want to
>> avoid its use in the common case of memcpy(); copy_page_hot() is fine, as
>> its two pointers are always going to be having the same low 5 bits.
> 
> I think this could be a bit clearer.  How about this:
> 
> ---8<---
> Zen2 (which doesn't expose ERMS) through Zen4 have sub-optimal aliasing
> detection for REP MOVS, and fall back to a unit-at-a-time loop when the
> two pointers have differing bottom 5 bits.  While both forms are
> affected, this makes REP MOVSB 8 times slower than REP MOVSQ.
> 
> memcpy() has a high likelihood of encountering this slowpath, so avoid
> using REP MOVSB.  This undoes the ERMS optimisation added in commit
> d6397bd0e11c which turns out to be an anti-optimisation on these
> microarchitectures.
> 
> However, retain the use of ERMS-based REP MOVSB in other cases such as
> copy_page_hot() where there parameter alignment is known to avoid the
> slowpath.
> ---8<---
> 
> ?

Fine with me; changed. Do I take this as an okay-to-commit?

Jan


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH for-4.21 v2] x86/AMD: avoid REP MOVSB for Zen3/4
  2026-01-07  7:22     ` Jan Beulich
@ 2026-01-07 10:42       ` Andrew Cooper
  0 siblings, 0 replies; 26+ messages in thread
From: Andrew Cooper @ 2026-01-07 10:42 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Andrew Cooper, Roger Pau Monne, Oleksii Kurochko,
	xen-devel@lists.xenproject.org

On 07/01/2026 7:22 am, Jan Beulich wrote:
> On 06.01.2026 22:07, Andrew Cooper wrote:
>> On 13/10/2025 2:06 pm, Jan Beulich wrote:
>>> Along with Zen2 (which doesn't expose ERMS), both families reportedly
>>> suffer from sub-optimal aliasing detection when deciding whether REP MOVSB
>>> can actually be carried out the accelerated way. Therefore we want to
>>> avoid its use in the common case of memcpy(); copy_page_hot() is fine, as
>>> its two pointers are always going to be having the same low 5 bits.
>> I think this could be a bit clearer.  How about this:
>>
>> ---8<---
>> Zen2 (which doesn't expose ERMS) through Zen4 have sub-optimal aliasing
>> detection for REP MOVS, and fall back to a unit-at-a-time loop when the
>> two pointers have differing bottom 5 bits.  While both forms are
>> affected, this makes REP MOVSB 8 times slower than REP MOVSQ.
>>
>> memcpy() has a high likelihood of encountering this slowpath, so avoid
>> using REP MOVSB.  This undoes the ERMS optimisation added in commit
>> d6397bd0e11c which turns out to be an anti-optimisation on these
>> microarchitectures.
>>
>> However, retain the use of ERMS-based REP MOVSB in other cases such as
>> copy_page_hot() where there parameter alignment is known to avoid the
>> slowpath.
>> ---8<---
>>
>> ?
> Fine with me; changed. Do I take this as an okay-to-commit?

Yeah - with something to this effect, Reviewed-by: Andrew Cooper
<andrew.cooper3@citrix.com>

Sorry it took so long.

~Andrew


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2026-01-07 10:42 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-25 10:45 [PATCH 0/2] x86: ERMS follow-on Jan Beulich
2025-09-25 10:46 ` [PATCH for-4.21 1/2] x86/AMD: avoid REP MOVSB for Zen3/4 Jan Beulich
2025-09-25 12:18   ` Teddy Astie
2025-09-25 12:58     ` Jan Beulich
2025-09-30 13:03       ` Teddy Astie
2025-10-07 12:19         ` Jan Beulich
2025-09-29 23:35     ` Jason Andryuk
2025-10-08 11:20       ` Roger Pau Monné
2025-10-08 16:06         ` Jan Beulich
2025-10-09  7:20           ` Oleksii Kurochko
2025-10-08 16:33   ` Andrew Cooper
2025-10-09  7:27     ` Jan Beulich
2025-09-25 10:48 ` [PATCH 2/2] x86: guard synthetic feature and bug enumerators Jan Beulich
2025-09-29 23:36   ` Jason Andryuk
2025-10-07 12:22     ` Jan Beulich
2025-10-07 19:38       ` Jason Andryuk
2025-10-08  5:56         ` Jan Beulich
2025-10-11  0:30           ` Jason Andryuk
2025-10-13  6:39             ` Jan Beulich
2025-10-08 17:19   ` Andrew Cooper
2025-10-09  7:45     ` Jan Beulich
2025-10-09  7:51       ` Jan Beulich
2025-10-13 13:06 ` [PATCH for-4.21 v2] x86/AMD: avoid REP MOVSB for Zen3/4 Jan Beulich
     [not found] ` <693449f18cc4480ea2cb2161a9361354@DM4PR03MB7015.namprd03.prod.outlook.com>
2026-01-06 21:07   ` Andrew Cooper
2026-01-07  7:22     ` Jan Beulich
2026-01-07 10:42       ` Andrew Cooper

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.