x86/fpu: Inaccurate AVX-512 Usage Tracking via arch

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* x86/fpu: Inaccurate AVX-512 Usage Tracking via arch_status
@ 2025-10-27  7:50 chuang
  2025-10-27 14:26 ` Dave Hansen
  0 siblings, 1 reply; 5+ messages in thread
From: chuang @ 2025-10-27  7:50 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, open list

Dear FPU/x86 Maintainers,

I am writing to report an issue concerning the accuracy of AVX-512
usage tracking, specifically when querying the information via
'/proc/<pid>/arch_status' on systems supporting the instruction set.

This report references the mechanism introduced by the following
patch: https://lore.kernel.org/all/20190117183822.31333-1-aubrey.li@intel.com/T/#u

I have validated the patch's effect in modern environments supporting
AVX-512 (e.g., Intel Xeon Gold, AMD Zen4) and found that the tracking
mechanism does not accurately reflect the actual AVX-512 instruction
usage by the process.

Test Environment:
- CPU: Intel Xeon Gold (AVX-512 supported)
- Test Program: periodic_wake.c (Verified via objdump to not contain
any AVX-512 instructions.)
- Test Goal: To compare AVX-512 execution status as reported by perf
PMU versus procfs arch_status.

perf PMU:

$ perf stat -e instructions,cycles,fp_arith_inst_retired.512b_packed_double,fp_arith_inst_retired.512b_packed_single,fp_arith_inst_retired.8_flops,fp_arith_inst_retired2.128bit_packed_bf16,fp_arith_inst_retired2.256bit_packed_bf16,fp_arith_inst_retired2.512bit_packed_bf16
./periodic_wake > /dev/null
^C./periodic_wake: Interrupt

 Performance counter stats for './periodic_wake':

         2,329,116      instructions                     #    2.86
insn per cycle              (33.57%)
           814,040      cycles
                         (56.61%)
                 0      fp_arith_inst_retired.512b_packed_double
                                 (9.82%)
     <not counted>      fp_arith_inst_retired.512b_packed_single
                                 (0.00%)
     <not counted>      fp_arith_inst_retired.8_flops
                         (0.00%)
     <not counted>      fp_arith_inst_retired2.128bit_packed_bf16
                                  (0.00%)
     <not counted>      fp_arith_inst_retired2.256bit_packed_bf16
                                  (0.00%)
     <not counted>      fp_arith_inst_retired2.512bit_packed_bf16
                                  (0.00%)

       1.366220977 seconds time elapsed

       0.000000000 seconds user
       0.002253000 seconds sys


procfs arch_status:

$ cat /proc/$(pgrep -f "^./periodic_wake")/arch_status
AVX512_elapsed_ms:      44
$ cat /proc/$(pgrep -f "^./periodic_wake")/arch_status
AVX512_elapsed_ms:      64
$ cat /proc/$(pgrep -f "^./periodic_wake")/arch_status
AVX512_elapsed_ms:      91
$ cat /proc/$(pgrep -f "^./periodic_wake")/arch_status
AVX512_elapsed_ms:      50

Based on the observed behavior and a review of the referenced patch,
my hypothesis is:

On AVX-512 capable systems, the implementation appears to record the
current timestamp into 'task->thread.fpu.avx512_timestamp' upon any
task switch, irrespective of whether the task has actually executed an
AVX-512 instruction.

This continuous updating of the timestamp, even for non-AVX-512 tasks,
results in misleading non-zero values for AVX512_elapsed_ms, rendering
the mechanism ineffective for accurately determining if a task is
actively utilizing AVX-512.

Could you please confirm if this analysis is correct and advise on the
appropriate next steps to resolve this discrepancy?

'periodic_wake.c':

#include <stdio.h>
#include <time.h>
#include <unistd.h>
#include <errno.h>

// Define wakeup interval as 100 milliseconds
#define INTERVAL_MS 100

int main() {
    // Convert milliseconds to nanoseconds
    long interval_ns = (long)INTERVAL_MS * 1000000L;

    // timespec struct used for nanosleep
    struct timespec requested;
    struct timespec remaining;

    // Initialize the requested time structure
    requested.tv_sec = 0;
    requested.tv_nsec = interval_ns;

    printf("C Periodic Wakeup Program started (Interval: %dms,
%.9ldns). Press Ctrl+C to stop.\n",
           INTERVAL_MS, interval_ns);

    long long counter = 0;

    while (1) {
        counter++;

        // Print current wakeup information
        printf("Wakeup #%lld: Continuing execution.\n", counter);

        // Use nanosleep for high-precision sleep.
        // If nanosleep is interrupted by a signal (e.g., Ctrl+C), it
returns -1 and stores the remaining time in 'remaining'.
        // To maintain accurate periodicity, we re-sleep for the
remaining time if an interruption occurs.

        remaining.tv_sec = requested.tv_sec;
        remaining.tv_nsec = requested.tv_nsec;

        int result;

        do {
            // Sleep
            result = nanosleep(&remaining, &remaining);

            // Check return value
            if (result == -1) {
                if (errno == EINTR) {
                    // Interrupted by a signal (e.g., debugger or
Ctrl+C), continue sleeping for remaining time
                    printf("[Interrupted] nanosleep was interrupted by
a signal, sleeping for remaining %.3fms\n",
                           (double)remaining.tv_nsec / 1000000.0);
                    // Loop continues, using the remaining time stored
in 'remaining'
                } else {
                    // Other error, print error and exit
                    perror("nanosleep error");
                    return 1;
                }
            }
        } while (result == -1 && errno == EINTR);

        // If nanosleep returns 0 successfully, continue to the next
loop iteration
    }

    return 0; // Theoretically unreachable
}


Thank you for your time and assistance.

Best regards,

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: x86/fpu: Inaccurate AVX-512 Usage Tracking via arch_status
  2025-10-27  7:50 x86/fpu: Inaccurate AVX-512 Usage Tracking via arch_status chuang
@ 2025-10-27 14:26 ` Dave Hansen
  2025-10-30  6:56   ` chuang
  0 siblings, 1 reply; 5+ messages in thread
From: Dave Hansen @ 2025-10-27 14:26 UTC (permalink / raw)
  To: chuang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, open list

[-- Attachment #1: Type: text/plain, Size: 813 bytes --]

On 10/27/25 00:50, chuang wrote:
> On AVX-512 capable systems, the implementation appears to record the
> current timestamp into 'task->thread.fpu.avx512_timestamp' upon any
> task switch, irrespective of whether the task has actually executed an
> AVX-512 instruction.

The timestamp update ultimately has _zero_ to do with executing
AVX-512 instructions. It's all about the state in the ZMM registers, not
AVX-512 instructions.

Those registers are inherited at fork and I don't see avx512_timestamp
being zeroed anywhere. So I suspect what you are seeing is that some
_parent_ used AVX512, and its children are getting stuck with
avx512_timestamp.

You could probably confirm this by dumping ->avx512_timestamp in
fpu_clone().

Or, try the attached patch and see if it makes things work more like
you'd expect.

[-- Attachment #2: avx512-ts-reset.patch --]
[-- Type: text/x-patch, Size: 467 bytes --]

diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 1f71cc135e9a..2a8c159de5e2 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -648,6 +648,8 @@ int fpu_clone(struct task_struct *dst, u64 clone_flags, bool minimal,
 	/* The new task's FPU state cannot be valid in the hardware. */
 	dst_fpu->last_cpu = -1;
 
+	dst_fpu->avx512_timestamp = 0;
+
 	fpstate_reset(dst_fpu);
 
 	if (!cpu_feature_enabled(X86_FEATURE_FPU))

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: x86/fpu: Inaccurate AVX-512 Usage Tracking via arch_status
  2025-10-27 14:26 ` Dave Hansen
@ 2025-10-30  6:56   ` chuang
  2025-10-30 15:00     ` Dave Hansen
  0 siblings, 1 reply; 5+ messages in thread
From: chuang @ 2025-10-30  6:56 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, open list

On Mon, Oct 27, 2025 at 10:26 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 10/27/25 00:50, chuang wrote:
> > On AVX-512 capable systems, the implementation appears to record the
> > current timestamp into 'task->thread.fpu.avx512_timestamp' upon any
> > task switch, irrespective of whether the task has actually executed an
> > AVX-512 instruction.
>
> The timestamp update ultimately has _zero_ to do with executing
> AVX-512 instructions. It's all about the state in the ZMM registers, not
> AVX-512 instructions.

Got it, thanks.

> Those registers are inherited at fork and I don't see avx512_timestamp
> being zeroed anywhere. So I suspect what you are seeing is that some
> _parent_ used AVX512, and its children are getting stuck with
> avx512_timestamp.

I have tested with the attached patch. The behavior remains the same
as previously reported: after fpu_clone(), the new process still has a
non-zero avx512_timestamp, and it continues to be updated in
subsequent task switches, irrespective of AVX-512 instruction
execution.

I traced the code path within fpu_clone(): In fpu_clone() ->
save_fpregs_to_fpstate(), since my current Intel CPU supports XSAVE,
the call to os_xsave() results in the XFEATURE_Hi16_ZMM bit being
set/enabled in xsave.header.xfeatures. This then causes
update_avx_timestamp() to update fpu->avx512_timestamp. The same flow
occurs in __switch_to() -> switch_fpu_prepare().

Given this, is the issue related to my specific Intel Xeon Gold? Is
the CPU continuously indicating that the AVX-512 state is in use?

> You could probably confirm this by dumping ->avx512_timestamp in
> fpu_clone().
>
> Or, try the attached patch and see if it makes things work more like
> you'd expect.

Best regards,

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: x86/fpu: Inaccurate AVX-512 Usage Tracking via arch_status
  2025-10-30  6:56   ` chuang
@ 2025-10-30 15:00     ` Dave Hansen
  2025-11-09  3:19       ` chuang
  0 siblings, 1 reply; 5+ messages in thread
From: Dave Hansen @ 2025-10-30 15:00 UTC (permalink / raw)
  To: chuang
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, open list

On 10/29/25 23:56, chuang wrote:
...
> I traced the code path within fpu_clone(): In fpu_clone() ->
> save_fpregs_to_fpstate(), since my current Intel CPU supports XSAVE,
> the call to os_xsave() results in the XFEATURE_Hi16_ZMM bit being
> set/enabled in xsave.header.xfeatures. This then causes
> update_avx_timestamp() to update fpu->avx512_timestamp. The same flow
> occurs in __switch_to() -> switch_fpu_prepare().

So that points more in the direction of the AVX-512 not getting
initialized. fpu_flush_thread() either isn't getting called or isn't
doing its job at execve(). *Or*, there's something subtle in your test
case that's causing AVX-512 to get tracked as non-init after execve().

> Given this, is the issue related to my specific Intel Xeon Gold? Is
> the CPU continuously indicating that the AVX-512 state is in use?
As much as I love to blame the hardware, I don't think we're quite there
yet. We've literally had software bugs in the past that had this exact
same behavior: AVX-512 state was tracked as non-init when it was never used.

Any chance you could figure out where you first see XFEATURE_Hi16_ZMM in
xfeatures? The tracepoints in here might help:

	/sys/kernel/debug/tracing/events/x86_fpu

Is there any rhyme or reason for which tasks see avx512_timestamp
getting set? Is it just your test program? Or other random tasks on the
system?

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: x86/fpu: Inaccurate AVX-512 Usage Tracking via arch_status
  2025-10-30 15:00     ` Dave Hansen
@ 2025-11-09  3:19       ` chuang
  0 siblings, 0 replies; 5+ messages in thread
From: chuang @ 2025-11-09  3:19 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, open list

Thank you for the ongoing discussion.

On Thu, Oct 30, 2025 at 11:00 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 10/29/25 23:56, chuang wrote:
> ...
> > I traced the code path within fpu_clone(): In fpu_clone() ->
> > save_fpregs_to_fpstate(), since my current Intel CPU supports XSAVE,
> > the call to os_xsave() results in the XFEATURE_Hi16_ZMM bit being
> > set/enabled in xsave.header.xfeatures. This then causes
> > update_avx_timestamp() to update fpu->avx512_timestamp. The same flow
> > occurs in __switch_to() -> switch_fpu_prepare().
>
> So that points more in the direction of the AVX-512 not getting
> initialized. fpu_flush_thread() either isn't getting called or isn't
> doing its job at execve(). *Or*, there's something subtle in your test
> case that's causing AVX-512 to get tracked as non-init after execve().

My analysis of the issue suggests a dependency on the glibc
implementation. Since glibc version 2.24[1] onwards, AVX-512
instructions have been utilized to optimize memory functions such as
memcpy/memmove[2], and memset[3]. In contrast, previous versions
allowed the /proc/<pid>/arch_status mechanism to reflect genuine
application-level AVX-512 usage more accurately.

The disassembly of my test binary (while_sleep_static[4]) confirms the
activation of these glibc optimizations:

00000000004109d0 <__memcpy_avx512_no_vzeroupper>:
  4109d0:       f3 0f 1e fa             endbr64
  4109d4:       48 89 f8                mov    %rdi,%rax
  4109d7:       48 8d 0c 16             lea    (%rsi,%rdx,1),%rcx
  4109db:       4c 8d 0c 17             lea    (%rdi,%rdx,1),%r9
  4109df:       48 81 fa 00 02 00 00    cmp    $0x200,%rdx
  4109e6:       0f 87 5d 01 00 00       ja     410b49
<__memcpy_avx512_no_vzeroupper+0x179>
  4109ec:       48 83 fa 10             cmp    $0x10,%rdx
  4109f0:       0f 86 0f 01 00 00       jbe    410b05
<__memcpy_avx512_no_vzeroupper+0x135>
  4109f6:       48 81 fa 00 01 00 00    cmp    $0x100,%rdx
  4109fd:       72 6f                   jb     410a6e
<__memcpy_avx512_no_vzeroupper+0x9e>
  4109ff:       62 f1 7c 48 10 06       vmovups (%rsi),%zmm0
  410a05:       62 f1 7c 48 10 4e 01    vmovups 0x40(%rsi),%zmm1
  410a0c:       62 f1 7c 48 10 56 02    vmovups 0x80(%rsi),%zmm2
  410a13:       62 f1 7c 48 10 5e 03    vmovups 0xc0(%rsi),%zmm3
  410a1a:       62 f1 7c 48 10 61 fc    vmovups -0x100(%rcx),%zmm4
  410a21:       62 f1 7c 48 10 69 fd    vmovups -0xc0(%rcx),%zmm5
  410a28:       62 f1 7c 48 10 71 fe    vmovups -0x80(%rcx),%zmm6
  410a2f:       62 f1 7c 48 10 79 ff    vmovups -0x40(%rcx),%zmm7
  410a36:       62 f1 7c 48 11 07       vmovups %zmm0,(%rdi)
  410a3c:       62 f1 7c 48 11 4f 01    vmovups %zmm1,0x40(%rdi)
  410a43:       62 f1 7c 48 11 57 02    vmovups %zmm2,0x80(%rdi)
  410a4a:       62 f1 7c 48 11 5f 03    vmovups %zmm3,0xc0(%rdi)
  410a51:       62 d1 7c 48 11 61 fc    vmovups %zmm4,-0x100(%r9)

Specifically, I performed testing on Intel(R) Xeon(R) Gold 6271C and
an AMD EPYC 9W24 96-Core. The Intel PMU failed to accurately capture
AVX-512 usage, while the AMD CPU showed partial usage in some
scenarios. This potentially relates to the PMU event definitions,
which appear not to include vmovups instructions.
The descriptions for the relevant PMU events (as exemplified by Intel
CPU) are as follows:

  fp_arith_inst_retired.512b_packed_double
       [Number of SSE/AVX computational 512-bit packed double precision
        floating-point instructions retired; some instructions will count
        twice as noted below. Each count represents 8 computation operations,
        one for each element. Applies to SSE* and AVX* packed double precision
        floating-point instructions: ADD SUB MUL DIV MIN MAX RCP14 RSQRT14
        SQRT DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count twice
        as they perform 2 calculations per element]
  fp_arith_inst_retired.512b_packed_single
       [Number of SSE/AVX computational 512-bit packed single precision
        floating-point instructions retired; some instructions will count
        twice as noted below. Each count represents 16 computation operations,
        one for each element. Applies to SSE* and AVX* packed single precision
        floating-point instructions: ADD SUB MUL DIV MIN MAX RCP14 RSQRT14
        SQRT DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count twice
        as they perform 2 calculations per element]
  fp_arith_inst_retired.8_flops
       [Number of SSE/AVX computational 256-bit packed single precision and
        512-bit packed double precision FP instructions retired; some
        instructions will count twice as noted below. Each count represents 8
        computation operations, 1 for each element. Applies to SSE* and AVX*
        packed single precision and double precision FP instructions: ADD SUB
        HADD HSUB SUBADD MUL DIV MIN MAX SQRT RSQRT RSQRT14 RCP RCP14 DPP
        FM(N)ADD/SUB. DPP and FM(N)ADD/SUB count twice as they perform 2
        calculations per element]

Overall, the /proc/arch_status mechanism reliably reflects the AVX-512
register usage status.

We are utilizing AVX-512 enabled infrastructure within a Kubernetes
(k8s) environment and require a mechanism for monitoring the
utilization of this instruction set.
The current /proc/<pid>/arch_status file reliably indicates AVX-512
usage for a single process. However, in containerized environments
(like Kubernetes Pods), this forces us to monitor every single process
within the cgroup, which is highly inefficient and creates significant
performance overhead for monitoring.
To solve this scaling issue, we are exploring the possibility of
aggregating this usage data. Would it be feasible to extend the
AVX-512 activation status tracking to the cgroup level?

[1]: https://sourceware.org/git/?p=glibc.git;a=log;h=refs/tags/glibc-2.24;pg=1
[2]: https://sourceware.org/git/?p=glibc.git;a=commit;h=c867597bff2562180a18da4b8dba89d24e8b65c4
[3]: https://sourceware.org/git/?p=glibc.git;a=commit;h=5e8c5bb1ac83aa2577d64d82467a653fa413f7ce
[4]: while_sleep_static
// gcc -O3 -static while_sleep.c -o while_sleep_static
// glibc > 2.24
#include <unistd.h>

int main()
{
    while(1) {
            sleep(1);
    }
}


>
> > Given this, is the issue related to my specific Intel Xeon Gold? Is
> > the CPU continuously indicating that the AVX-512 state is in use?
> As much as I love to blame the hardware, I don't think we're quite there
> yet. We've literally had software bugs in the past that had this exact
> same behavior: AVX-512 state was tracked as non-init when it was never used.
>
> Any chance you could figure out where you first see XFEATURE_Hi16_ZMM in
> xfeatures? The tracepoints in here might help:
>
>         /sys/kernel/debug/tracing/events/x86_fpu
>
> Is there any rhyme or reason for which tasks see avx512_timestamp
> getting set? Is it just your test program? Or other random tasks on the
> system?

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-11-09  3:19 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-27  7:50 x86/fpu: Inaccurate AVX-512 Usage Tracking via arch_status chuang
2025-10-27 14:26 ` Dave Hansen
2025-10-30  6:56   ` chuang
2025-10-30 15:00     ` Dave Hansen
2025-11-09  3:19       ` chuang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox