Large(ish) variance induced by SCHED

public inbox for linux-rt-users@vger.kernel.org
 help / color / mirror / Atom feed

* Large(ish) variance induced by SCHED_FIFO
@ 2025-09-04 23:45 Marc Gonzalez
  2025-09-05  8:25 ` Daniel Wagner
       [not found] ` <CAMjhiJiVDmnx+pgpvtAy=KarHexjsYs+T9tSkpGvppjqFdtmiw@mail.gmail.com>
  0 siblings, 2 replies; 30+ messages in thread
From: Marc Gonzalez @ 2025-09-04 23:45 UTC (permalink / raw)
  To: linux-rt-users
  Cc: Steven Rostedt, Thomas Gleixner, Sebastian Andrzej Siewior,
	Daniel Wagner, Clark Williams, Pavel Machek, Luis Goncalves,
	John McCalpin

Hello everyone,

I am currently *NOT* running a PREEMPT_RT kernel.

I am trying to get reproducible / deterministic / low variance results
using a vendor kernel (6.8.0-79-generic #79-Ubuntu SMP PREEMPT_DYNAMIC).

I have removed all sources of noise / variance I could think of:

- SMT is disabled in BIOS
lscpu reports "Thread(s) per core: 1"
cat /sys/devices/system/cpu/smt/active returns 0

- Kernel boots in single user mode
No GUI, just a text terminal
Nothing running other than system daemons & bash

- Kernel runs with threadirqs command-line parameter
ISRs run as SCHED_FIFO processes with prio 50
$ ps -e -o pid,cls,pri,args --sort pri
    PID CLS PRI COMMAND
     18  FF 139 [migration/0]
     23  FF 139 [migration/1]
     29  FF 139 [migration/2]
     35  FF 139 [migration/3]
     19  FF  90 [idle_inject/0]
     22  FF  90 [idle_inject/1]
     28  FF  90 [idle_inject/2]
     34  FF  90 [idle_inject/3]
     54  FF  90 [irq/9-acpi]
     62  FF  90 [watchdogd]
     68  FF  90 [irq/24-PCIe PME]
     69  FF  90 [irq/24-pciehp]
     70  FF  90 [irq/24-s-pciehp]
     71  FF  90 [irq/25-PCIe PME]
     74  FF  90 [irq/8-rtc0]
     75  FF  90 [irq/16-ehci_hcd:usb1]
     82  FF  90 [irq/23-ehci_hcd:usb2]
    162  FF  90 [irq/26-xhci_hcd]
    164  FF  90 [irq/28-ahci[0000:00:1f.2]]
    377  FF  90 [irq/18-i801_smbus]
    384  FF  90 [irq/29-mei_me]
    411  FF  90 [card1-crtc0]
    412  FF  90 [card1-crtc1]
    413  FF  90 [irq/30-i915]
    463  FF  90 [irq/31-snd_hda_intel:card0]
    466  FF  90 [irq/32-snd_hda_intel:card1]
    335  FF  41 [psimon]
   1076  FF  41 [psimon]

- Frequency of CPUs is "pinned" at 3 GHz
No "turbo boost" (frequency varies with temperature).
Slightly below nominal freq to avoid temp throttling.
(Intel Core i5-4590 CPU @ 3.30GHz)
for I in 0 1 2 3; do
  echo userspace > /sys/devices/system/cpu/cpu$I/cpufreq/scaling_governor
  echo   3000000 > /sys/devices/system/cpu/cpu$I/cpufreq/scaling_setspeed
done

- Program is pinned to CPU #1
Avoid any cache ping pong.
Avoid CPU #0 in case it handles more interrupts.
$ taskset -c 1 ./a.out

And the last thing I wanted to do is:

- Run benchmark as SCHED_FIFO 99 to make sure NOTHING could interrupt it
bench is strictly computational:
no system calls, no I/O, no sleeping, just number crunching.

However, I noticed that the results have MORE variance when
the program runs as SCHED_FIFO than when it runs as SCHED_OTHER.

Here is my testcase:

Toy benchmark:
# define N (1 << 30)
int main(void)
{
	for (volatile int i = 0; i < N; ++i);
	return 0;
}

Compiled with gcc -O3 toy.c

Then I run this script as root:
#!/bin/bash

CMD="taskset -c 1 ./a.out"
if [ "$1" = "fifo" ]; then CMD="chrt -f 99 $CMD"; fi

for I in $(seq 1 30); do
  T0=$(date "+%s.%N")
  $CMD
  T1=$(date "+%s.%N")
  echo "$T1-$T0" | bc -l
done

Once without fifo (SCHED_OTHER), once with fifo (SCHED_FIFO).

(Full results at the end of the message)

For SCHED_OTHER:
MIN_RUNTIME=2.158045784 seconds
MAX_RUNTIME=2.158515466 seconds
i.e. all runs are within +/- 109 ppm (MAX-MIN)/(MAX+MIN)

For SCHED_FIFO:
MIN_RUNTIME=2.247394425 seconds
MAX_RUNTIME=2.305229091 seconds
i.e. all runs are within +/- 12700 ppm
i.e. 100 times WORSE than SCHED_OTHER... :(

HOWEVER, if we sort SCHED_FIFO results, we clearly see 3 "islands":
first run-time is the minimum outlier = 2.247394425
21 results in [2.253636818, 2.254797674] => 257 ppm
 8 results in [2.304188238, 2.305229091] => 226 ppm
Still not as good as SCHED_FIFO but nowhere near as bad.


Questions I have no answers for at this time:

What could explain this behavior?
Why several "islands" of results when running as SCHED_FIFO 99?
Why is the variance still greater within an island?

Thanks for reading me this far!! :)

Regards


Raw results for reference:

SCHED_OTHER
2.158171245
2.158127633
2.158129497
2.158109551
2.158089224
2.158515466
2.158083084
2.158127232
2.158282178
2.158135800
2.158123891
2.158045784
2.158174865
2.158141835
2.158184503
2.158118620
2.158096101
2.158150349
2.158150868
2.158076309
2.158179210
2.158134907
2.158202927
2.158108137
2.158182418
2.158088121
2.158095718
2.158165650
2.158134169
2.158119635

SCHED_FIFO
2.247394425
2.254726778
2.305229091
2.253719228
2.254658938
2.254744883
2.304188238
2.254670366
2.254683699
2.254683326
2.304284105
2.253755218
2.254756793
2.304288106
2.254756981
2.253636818
2.254797674
2.304240533
2.254630310
2.254659655
2.254696007
2.304234345
2.254739714
2.254766850
2.254620088
2.304224451
2.254608929
2.254732058
2.304227047
2.254665694



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Large(ish) variance induced by SCHED_FIFO
  2025-09-04 23:45 Large(ish) variance induced by SCHED_FIFO Marc Gonzalez
@ 2025-09-05  8:25 ` Daniel Wagner
  2025-09-05 13:23   ` Marc Gonzalez
       [not found] ` <CAMjhiJiVDmnx+pgpvtAy=KarHexjsYs+T9tSkpGvppjqFdtmiw@mail.gmail.com>
  1 sibling, 1 reply; 30+ messages in thread
From: Daniel Wagner @ 2025-09-05  8:25 UTC (permalink / raw)
  To: Marc Gonzalez
  Cc: linux-rt-users, Steven Rostedt, Thomas Gleixner,
	Sebastian Andrzej Siewior, Daniel Wagner, Clark Williams,
	Pavel Machek, Luis Goncalves, John McCalpin

On Fri, Sep 05, 2025 at 01:45:31AM +0200, Marc Gonzalez wrote:
> Then I run this script as root:
> #!/bin/bash
> 
> CMD="taskset -c 1 ./a.out"
> if [ "$1" = "fifo" ]; then CMD="chrt -f 99 $CMD"; fi
> 
> for I in $(seq 1 30); do
>   T0=$(date "+%s.%N")
>   $CMD
>   T1=$(date "+%s.%N")
>   echo "$T1-$T0" | bc -l
> done

This set setup is forking processes in order to get timestamps. This
gives you uncontrollable variance in the measurement. Use clock_gettime
inside your test program.

Also you might want to reduce the prio to 98.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Large(ish) variance induced by SCHED_FIFO
  2025-09-05  8:25 ` Daniel Wagner
@ 2025-09-05 13:23   ` Marc Gonzalez
  2025-09-05 14:03     ` John Ogness
  0 siblings, 1 reply; 30+ messages in thread
From: Marc Gonzalez @ 2025-09-05 13:23 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: linux-rt-users, Steven Rostedt, Thomas Gleixner,
	Sebastian Andrzej Siewior, Daniel Wagner, Clark Williams,
	Pavel Machek, Luis Goncalves, John McCalpin

Hello David :)

On 05/09/2025 10:25, Daniel Wagner wrote:

> On Fri, Sep 05, 2025 at 01:45:31AM +0200, Marc Gonzalez wrote:
>
>> Then I run this script as root:
>> #!/bin/bash
>>
>> CMD="taskset -c 1 ./a.out"
>> if [ "$1" = "fifo" ]; then CMD="chrt -f 99 $CMD"; fi
>>
>> for I in $(seq 1 30); do
>>   T0=$(date "+%s.%N")
>>   $CMD
>>   T1=$(date "+%s.%N")
>>   echo "$T1-$T0" | bc -l
>> done
> 
> This setup is forking processes in order to get timestamps. This
> gives you uncontrollable variance in the measurement. Use clock_gettime
> inside your test program.

I had already tested your suggestion, and observed it did not
make any difference. I will test again nonetheless.

(I used CLOCK_MONOTONIC_RAW.)

NB: if the large variance were induced by getting timestamps outside
the program, then SCHED_OTHER runs should be similarly impacted
(however, they are not).

> Also you might want to reduce the prio to 98.

May I ask why?

I'm guessing it's bad(TM) to potentially block the migration task?

I will lower the priority to 90.

Here is my current test code:

#include <stdio.h>
#include <time.h>
#define N (1 << 30)
#define G (1000*1000*1000L)
int main(void)
{
	struct timespec t0, t1;
	clock_gettime(CLOCK_MONOTONIC_RAW, &t0);
	for (volatile int i = 0; i < N; ++i);
	clock_gettime(CLOCK_MONOTONIC_RAW, &t1);
	long diff = (t1.tv_sec*G + t1.tv_nsec) - (t0.tv_sec*G + t0.tv_nsec);
	printf("%ld\n", diff);
	return 0;
}

associated script:

#!/bin/bash
CMD="taskset -c 1 ./a.out"
if [ "$1" = "fifo" ]; then CMD="chrt -f 90 $CMD"; fi
for I in $(seq 1 30); do $CMD; done


The results are the same as before:

SCHED_OTHER
MIN=2156085411 MAX=2156133420
All within 11 ppm of the average (EXCELLENT RESULTS)

SCHED_FIFO
MIN=2205859200 (first run again, how strange)
MAX=2304868902
All within 21950 ppm of the average (TERRIBLE RESULTS)
21950 ppm = 2.2%

Looking at the distribution of SCHED_FIFO again, I notice a pattern:
first run-time is the minimum outlier = 2.20 seconds
21 (again!!) results around 2.25 seconds
8 (again!!) results around 2.30 seconds

BRAIN DUMP:
"First" run is meaningless.
First since what? Previous batch of runs?
In that case, sleeping between runs should work around the issue?

It looks like there is some kind of 50 millisecond quantum
involved somewhere somehow.

$ grep HZ /boot/config-6.8.0-79-generic 
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
# CONFIG_NO_HZ_IDLE is not set
CONFIG_NO_HZ_FULL=y
CONFIG_NO_HZ=y
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000

(Not sure what it means to have CONFIG_NO_HZ_FULL && CONFIG_HZ_1000)

This is driving me crazy.
I'm all ears if anyone has other suggestions.

Regards


Raw results for reference:

SCHED_OTHER
2156120840
2156110092
2156111812
2156117629
2156111235
2156104159
2156109850
2156120644
2156100701
2156085456
2156133420
2156130993
2156087623
2156120236
2156114137
2156099823
2156122243
2156095888
2156114046
2156102910
2156120532
2156102464
2156100703
2156125228
2156119672
2156110619
2156131331
2156085411
2156098780
2156121458

SCHED_FIFO
2205859200
2254828687
2254866765
2304868902
2254849969
2253908545
2254811821
2304845457
2253814019
2254834900
2304858026
2254845905
2253835350
2254815775
2304860984
2253824572
2254816525
2254800880
2303875283
2254826189
2254813961
2304850479
2254832102
2253816865
2254820633
2304847552
2253841768
2254807812
2254810662
2303848123

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Large(ish) variance induced by SCHED_FIFO
  2025-09-05 13:23   ` Marc Gonzalez
@ 2025-09-05 14:03     ` John Ogness
  2025-09-05 15:34       ` Marc Gonzalez
  0 siblings, 1 reply; 30+ messages in thread
From: John Ogness @ 2025-09-05 14:03 UTC (permalink / raw)
  To: Marc Gonzalez, Daniel Wagner
  Cc: linux-rt-users, Steven Rostedt, Thomas Gleixner,
	Sebastian Andrzej Siewior, Daniel Wagner, Clark Williams,
	Pavel Machek, Luis Goncalves, John McCalpin

On 2025-09-05, Marc Gonzalez <marc.w.gonzalez@free.fr> wrote:
> (I used CLOCK_MONOTONIC_RAW.)

Not really relevant to your issue, but you should always use
CLOCK_MONOTONIC.

> It looks like there is some kind of 50 millisecond quantum
> involved somewhere somehow.

Did you forget to disable RT throttling?

$ sudo sysctl -w kernel.sched_rt_runtime_us=-1

If you check the kernel logs, you would see a message about throttling:

"sched: RT throttling activated"

John Ogness

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Large(ish) variance induced by SCHED_FIFO
  2025-09-05 14:03     ` John Ogness
@ 2025-09-05 15:34       ` Marc Gonzalez
  2025-09-05 16:07         ` John Ogness
  0 siblings, 1 reply; 30+ messages in thread
From: Marc Gonzalez @ 2025-09-05 15:34 UTC (permalink / raw)
  To: John Ogness, Daniel Wagner
  Cc: linux-rt-users, Steven Rostedt, Thomas Gleixner,
	Sebastian Andrzej Siewior, Daniel Wagner, Clark Williams,
	Pavel Machek, Luis Goncalves, John McCalpin

On 05/09/2025 16:03, John Ogness wrote:

> On 2025-09-05, Marc Gonzalez wrote:
>
>> (I used CLOCK_MONOTONIC_RAW.)
> 
> Not really relevant to your issue, but you should always use CLOCK_MONOTONIC.

Could you provide some insight as to why CLOCK_MONOTONIC is preferable?

https://www.man7.org/linux/man-pages/man3/clock_gettime.3.html

CLOCK_MONOTONIC

A non-settable system-wide clock that represents monotonic time since
(as described by POSIX) "some unspecified point in the past". On Linux,
that point corresponds to the number of seconds that the system has been
running since it was booted.

The CLOCK_MONOTONIC clock is not affected by discontinuous jumps in the
system time (e.g., if the system administrator manually changes the clock),
but is affected by the incremental adjustments performed by adjtime(3) and NTP.
This clock does not count time that the system is suspended.
All CLOCK_MONOTONIC variants guarantee that the time returned by
consecutive calls will not go backwards, but successive calls may
(depending on the architecture) return identical (not-increased)
time values.


CLOCK_MONOTONIC_RAW (since Linux 2.6.28; Linux-specific)

Similar to CLOCK_MONOTONIC, but provides access to a raw hardware-based time
that is not subject to NTP adjustments or the incremental adjustments performed
by adjtime(3). This clock does not count time that the system is suspended.


CLOCK_MONOTONIC_RAW seemed less susceptible to "interference" :)

>> It looks like there is some kind of 50 millisecond quantum
>> involved somewhere somehow.
> 
> Did you forget to disable RT throttling?
> 
> $ sudo sysctl -w kernel.sched_rt_runtime_us=-1
> 
> If you check the kernel logs, you would see a message about throttling:
> 
> "sched: RT throttling activated"

YES!!!

I didn't think to look at dmesg :(

And I had looked at sched-related proc nodes...

$ grep "" /proc/sys/kernel/sched_*
/proc/sys/kernel/sched_autogroup_enabled:1
/proc/sys/kernel/sched_cfs_bandwidth_slice_us:5000
/proc/sys/kernel/sched_deadline_period_max_us:4194304
/proc/sys/kernel/sched_deadline_period_min_us:100
/proc/sys/kernel/sched_rr_timeslice_ms:100
/proc/sys/kernel/sched_rt_period_us:1000000
/proc/sys/kernel/sched_rt_runtime_us:950000
/proc/sys/kernel/sched_schedstats:0
/proc/sys/kernel/sched_util_clamp_max:1024
/proc/sys/kernel/sched_util_clamp_min:1024
/proc/sys/kernel/sched_util_clamp_min_rt_default:1024

... but my brain didn't register the obvious clue,
although it was right there: 1000000 - 950000 = 50 ms !!

Now everything works as expected, even in multi-user mode:

$ sudo ./run_benchmark.sh fifo
2155899243
2155930350
2155925344
2155545153
2155520262
2155520647
2155531815
2155586328
2155501088
2155887042
2155937889
2155990740
2155939605
2155509468
2155898441
2155615316
2155498933
2155968150
2155493295
2155496287
2155949449
2155488499
2155932900
2155498135
2155531294
2156047123
2156035893
2155926392
2155917489
2155482792

EXCEPT... that my mouse is sluggish as molasses when the script
runs on CPU1 (but not when it runs on other cores).

I'm probably starving an important ISR :)

[migration/0] pid 18's current affinity list: 0
[migration/1] pid 23's current affinity list: 1
[migration/2] pid 29's current affinity list: 2
[migration/3] pid 35's current affinity list: 3
[idle_inject/0] pid 19's current affinity list: 0
[idle_inject/1] pid 22's current affinity list: 1
[idle_inject/2] pid 28's current affinity list: 2
[idle_inject/3] pid 34's current affinity list: 3
[irq/9-acpi] pid 54's current affinity list: 1
[watchdogd] pid 62's current affinity list: 0-3
[irq/24-PCIe PME] pid 68's current affinity list: 2
[irq/24-pciehp] pid 69's current affinity list: 2
[irq/24-s-pciehp] pid 70's current affinity list: 2
[irq/25-PCIe PME] pid 71's current affinity list: 3
[irq/8-rtc0] pid 74's current affinity list: 0
[irq/23-ehci_hcd:usb1] pid 75's current affinity list: 1
[irq/16-ehci_hcd:usb2] pid 81's current affinity list: 2
[irq/26-xhci_hcd] pid 159's current affinity list: 1
[irq/27-ahci[0000:00:1f.2]] pid 162's current affinity list: 1
[irq/18-i801_smbus] pid 388's current affinity list: 1
[irq/29-mei_me] pid 411's current affinity list: 2
[card1-crtc0] pid 461's current affinity list: 0-3
[card1-crtc1] pid 462's current affinity list: 0-3
[irq/30-i915] pid 463's current affinity list: 3
[irq/31-snd_hda_intel:card0] pid 514's current affinity list: 0
[irq/32-snd_hda_intel:card1] pid 518's current affinity list: 1
[psimon] pid 334's current affinity list: 0-3
[psimon] pid 1075's current affinity list: 0-3

Most of the USB ISRs run on core 1, no wonder my USB mouse feels limp :)

Thanks David & John for pointing me in the right direction!

Regards



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Large(ish) variance induced by SCHED_FIFO
  2025-09-05 15:34       ` Marc Gonzalez
@ 2025-09-05 16:07         ` John Ogness
  2025-09-05 17:40           ` Marc Gonzalez
  0 siblings, 1 reply; 30+ messages in thread
From: John Ogness @ 2025-09-05 16:07 UTC (permalink / raw)
  To: Marc Gonzalez, Daniel Wagner
  Cc: linux-rt-users, Steven Rostedt, Thomas Gleixner,
	Sebastian Andrzej Siewior, Daniel Wagner, Clark Williams,
	Pavel Machek, Luis Goncalves, John McCalpin

On 2025-09-05, Marc Gonzalez <marc.w.gonzalez@free.fr> wrote:
>>> (I used CLOCK_MONOTONIC_RAW.)
>> 
>> Not really relevant to your issue, but you should always use CLOCK_MONOTONIC.
>
> Could you provide some insight as to why CLOCK_MONOTONIC is
> preferable?

CLOCK_MONOTONIC is tracking in time units as defined by
humans. CLOCK_MONOTONIC_RAW is tracking in time units defined by some
hardware.

John

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Large(ish) variance induced by SCHED_FIFO
  2025-09-05 16:07         ` John Ogness
@ 2025-09-05 17:40           ` Marc Gonzalez
  2025-09-08  9:36             ` John Ogness
  0 siblings, 1 reply; 30+ messages in thread
From: Marc Gonzalez @ 2025-09-05 17:40 UTC (permalink / raw)
  To: John Ogness, Daniel Wagner
  Cc: linux-rt-users, Steven Rostedt, Thomas Gleixner,
	Sebastian Andrzej Siewior, Daniel Wagner, Clark Williams,
	Pavel Machek, Luis Goncalves, John McCalpin

On 05/09/2025 18:07, John Ogness wrote:

> On 2025-09-05, Marc Gonzalez wrote:
>
>> Could you provide some insight as to why CLOCK_MONOTONIC is preferable?
> 
> CLOCK_MONOTONIC is tracking in time units as defined by humans.
> CLOCK_MONOTONIC_RAW is tracking in time units defined by some hardware.

I'm not sure I understand what that implies.

Eventually, clock_gettime stores whatever it measured in struct timespec,
with tv_sec & tv_nsec, right?

Are you saying that if I have 2 identical timespecs,
one from CLOCK_MONOTONIC, one from CLOCK_MONOTONIC_RAW,
they might not measure the same time interval?

Are you perhaps saying that NTP can (ever so slightly) tweak
a clock to bring its frequency closer to "reality", but if
CLOCK_MONOTONIC_RAW ignores such tweaks, it might count time
slightly too fast or too slow?

Regards

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Large(ish) variance induced by SCHED_FIFO
       [not found] ` <CAMjhiJiVDmnx+pgpvtAy=KarHexjsYs+T9tSkpGvppjqFdtmiw@mail.gmail.com>
@ 2025-09-05 21:05   ` Marc Gonzalez
  0 siblings, 0 replies; 30+ messages in thread
From: Marc Gonzalez @ 2025-09-05 21:05 UTC (permalink / raw)
  To: Leon Woestenberg
  Cc: linux-rt-users, Steven Rostedt, Thomas Gleixner,
	Sebastian Andrzej Siewior, Daniel Wagner, Clark Williams,
	Pavel Machek, Luis Goncalves, John McCalpin

Hello Leon :)

On 05/09/2025 18:18, Leon Woestenberg wrote:

> On Fri, 5 Sep 2025 at 01:45, Marc Gonzalez wrote:
> 
>> I am trying to get reproducible / deterministic / low variance results
>> using a vendor kernel (6.8.0-79-generic #79-Ubuntu SMP PREEMPT_DYNAMIC).
> 
> Is that your end goal or is your question maybe an XY question?
> If so please describe X.

I need a rudimentary (but accurate) benchmark framework,
to measure the impact of micro-optimizations on codes.
These optimizations are expected to make programs 0-3% faster,
so the framework should have the lowest variance possible,
otherwise I can't discern the result from the noise.

> Or just include the measurement inside the program using CLOCK_MONOTONIC.

David suggested that as well, but it was not the source of the problem.

John pointed out the culprit:
/proc/sys/kernel/sched_rt_runtime_us

https://www.kernel.org/doc/html/latest/scheduler/sched-rt-group.html

Regards



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Large(ish) variance induced by SCHED_FIFO
  2025-09-05 17:40           ` Marc Gonzalez
@ 2025-09-08  9:36             ` John Ogness
  2025-09-08 15:42               ` Marc Gonzalez
  0 siblings, 1 reply; 30+ messages in thread
From: John Ogness @ 2025-09-08  9:36 UTC (permalink / raw)
  To: Marc Gonzalez, Daniel Wagner
  Cc: linux-rt-users, Steven Rostedt, Thomas Gleixner,
	Sebastian Andrzej Siewior, Daniel Wagner, Clark Williams,
	Pavel Machek, Luis Goncalves, John McCalpin

On 2025-09-05, Marc Gonzalez <marc.w.gonzalez@free.fr> wrote:
>>> Could you provide some insight as to why CLOCK_MONOTONIC is preferable?
>> 
>> CLOCK_MONOTONIC is tracking in time units as defined by humans.
>> CLOCK_MONOTONIC_RAW is tracking in time units defined by some hardware.
>
> I'm not sure I understand what that implies.
>
> Eventually, clock_gettime stores whatever it measured in struct timespec,
> with tv_sec & tv_nsec, right?

I would phrase it differently, but generally speaking, correct.

> Are you saying that if I have 2 identical timespecs,
> one from CLOCK_MONOTONIC, one from CLOCK_MONOTONIC_RAW,
> they might not measure the same time interval?

I assume you mean 2 _sets_ of timespecs (so that you have 2 intervals to
compare). Correct.

> Are you perhaps saying that NTP can (ever so slightly) tweak
> a clock to bring its frequency closer to "reality", but if
> CLOCK_MONOTONIC_RAW ignores such tweaks, it might count time
> slightly too fast or too slow?

Correct.

This is also valid for other clocks. For example, the kernel logs
(printk) and ftrace default to the CPU local clock. These timestamps
cannot be compared to CLOCK_MONOTONIC timestamps... just as
CLOCK_MONOTONIC_RAW timestamps cannot be compared to CLOCK_MONOTONIC
timestamps.

There are still reasons why CLOCK_MONOTONIC_RAW might be
interesting. For example, if you want a very stable timesource to
compare intervals, but do not care so much about the real world time
lengths of those intervals (i.e. where is the greatest latency vs. what
is the value of the greatest latency). Although even here, I doubt
CLOCK_MONOTONIC_RAW has a practical advantage over CLOCK_MONOTONIC.

John

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Large(ish) variance induced by SCHED_FIFO
  2025-09-08  9:36             ` John Ogness
@ 2025-09-08 15:42               ` Marc Gonzalez
  2025-09-08 16:02                 ` Daniel Wagner
  2025-09-08 16:52                 ` [EXT] Re: Large(ish) variance induced by SCHED_FIFO Rui Sousa
  0 siblings, 2 replies; 30+ messages in thread
From: Marc Gonzalez @ 2025-09-08 15:42 UTC (permalink / raw)
  To: John Ogness, Leon Woestenberg, Daniel Wagner
  Cc: linux-rt-users, Steven Rostedt, Thomas Gleixner,
	Sebastian Andrzej Siewior, Daniel Wagner, Clark Williams,
	Pavel Machek, Luis Goncalves, John McCalpin

On 08/09/2025 11:36, John Ogness wrote:

> There are still reasons why CLOCK_MONOTONIC_RAW might be
> interesting. For example, if you want a very stable time source to
> compare intervals, but do not care so much about the real world time
> lengths of those intervals (i.e. where is the greatest latency vs. what
> is the value of the greatest latency). Although even here, I doubt
> CLOCK_MONOTONIC_RAW has a practical advantage over CLOCK_MONOTONIC.

In fact, I'm just trying to compare the run-time of 2 minor
variations of the same program (testing micro-optimizations).

Absolute run-time is not really important, what I really want
to know is: does v2 run faster or slower than v1?

This is the framework I'm using at this point:

#include <stdio.h>
#include <time.h>
extern void my_code(void);

static long loop(int log2)
{
	int n = 1 << log2;
	struct timespec t0, t1;
	clock_gettime(CLOCK_MONOTONIC, &t0);
	for (int i = 0; i < n; ++i) my_code();
	clock_gettime(CLOCK_MONOTONIC, &t1);
	long d = (t1.tv_sec - t0.tv_sec)*1000000000L + (t1.tv_nsec - t0.tv_nsec);
	long t = d >> log2;
	return t;
}

int main(void)
{
	long t, min = loop(4);
	for (int i = 0; i < 20; ++i)
		if ((t = loop(8)) < min) min = t;
	printf("MIN=%ld\n", min);
	return 0;
}

Basically:
- warm up the caches
- run my_code() 256 times && compute average run-time
- repeat 20 times to find MINIMUM average run-time

When my_code() is a trivial computational loop such as:

	mov $(1<<12), %eax
1:	dec %ecx
	dec %ecx
	dec %eax
	jnz 1b
	ret

Then running the benchmark 1000 times returns the same value 1000 times:
MIN=2737


Obviously, the program I'm working on is a bit more complex, but barely:
- no system calls, no library calls
- just simple bit twiddling
- tiny code, tiny data structures, everything fits in L1
$ size a.out
   text	   data	    bss	    dec	    hex	filename
   8549	    632	   1072	  10253	   280d	a.out

When I run the benchmark 1000 times, there are some large outliers:
MIN_MIN=2502
MAX_MIN=2774

NOTE: 95% of the results are below 2536.
But the top 1% (worst 10) are really bad (2646-2774)

How to get repeatable results?

Random 10% outliers break the ability to measure the impact
of micro-optimizations expected to provide 0-3% improvements :(

For reference, the script launching the benchmark does:

echo     -1 > /proc/sys/kernel/sched_rt_runtime_us
for I in 0 1 2 3; do echo userspace > /sys/devices/system/cpu/cpu$I/cpufreq/scaling_governor; done
sleep 0.25
for I in 0 1 2 3; do echo   3000000 > /sys/devices/system/cpu/cpu$I/cpufreq/scaling_setspeed; done
sleep 0.25

for I in $(seq 1 1000); do
chrt -f 99 taskset -c 2 ./a.out < $1
done

for I in 0 1 2 3; do
echo schedutil > /sys/devices/system/cpu/cpu$I/cpufreq/scaling_governor
done
echo 950000 > /proc/sys/kernel/sched_rt_runtime_us


I've run out of ideas to identify other sources of variance.
(I ran everything in single user mode with nothing else running.)
Perhaps with perf I could identify the source of stalls or bubbles?

Hoping someone can point me in the right direction.

Regards



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Large(ish) variance induced by SCHED_FIFO
  2025-09-08 15:42               ` Marc Gonzalez
@ 2025-09-08 16:02                 ` Daniel Wagner
  2025-09-08 18:40                   ` Marc Gonzalez
  2025-09-08 16:52                 ` [EXT] Re: Large(ish) variance induced by SCHED_FIFO Rui Sousa
  1 sibling, 1 reply; 30+ messages in thread
From: Daniel Wagner @ 2025-09-08 16:02 UTC (permalink / raw)
  To: Marc Gonzalez
  Cc: John Ogness, Leon Woestenberg, linux-rt-users, Steven Rostedt,
	Thomas Gleixner, Sebastian Andrzej Siewior, Daniel Wagner,
	Clark Williams, Pavel Machek, Luis Goncalves, John McCalpin

On Mon, Sep 08, 2025 at 05:42:58PM +0200, Marc Gonzalez wrote:
> I've run out of ideas to identify other sources of variance.
> (I ran everything in single user mode with nothing else running.)
> Perhaps with perf I could identify the source of stalls or bubbles?
> 
> Hoping someone can point me in the right direction.

There are few guides out there, explaining the common pitfalls. I
suppose you have already found them. Just in case you missed Frederic's
excellent blog on CPU isolation:

https://www.suse.com/c/cpu-isolation-introduction-part-1/

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [EXT] Re: Large(ish) variance induced by SCHED_FIFO
  2025-09-08 15:42               ` Marc Gonzalez
  2025-09-08 16:02                 ` Daniel Wagner
@ 2025-09-08 16:52                 ` Rui Sousa
  2025-09-08 17:03                   ` Marc Gonzalez
  1 sibling, 1 reply; 30+ messages in thread
From: Rui Sousa @ 2025-09-08 16:52 UTC (permalink / raw)
  To: Marc Gonzalez, John Ogness, Leon Woestenberg, Daniel Wagner
  Cc: linux-rt-users@vger.kernel.org, Steven Rostedt, Thomas Gleixner,
	Sebastian Andrzej Siewior, Daniel Wagner, Clark Williams,
	Pavel Machek, Luis Goncalves, John McCalpin

Hi Marc,

Just replace your last setup script line with:
echo -1 > /proc/sys/kernel/sched_rt_runtime_us

?
You are running a realtime task for a long while, but you have rt throttling enabled. At some point it will trigger, and you have no control of when. I believe you are seeing regular tasks
Being scheduled in the middle of the benchmark measurement.

Disabling throttling may block your entire system while the benchmark is running, so lower the priority a bit and use affinity to land the benchmark process in a core that is not usually used?

Br
Rui

> -----Original Message-----
> From: Marc Gonzalez <marc.w.gonzalez@free.fr>
> Sent: Monday, September 8, 2025 17:43
> To: John Ogness <john.ogness@linutronix.de>; Leon Woestenberg
> <leon@sidebranch.com>; Daniel Wagner <dwagner@suse.de>
> Cc: linux-rt-users@vger.kernel.org; Steven Rostedt <rostedt@goodmis.org>;
> Thomas Gleixner <tglx@linutronix.de>; Sebastian Andrzej Siewior
> <bigeasy@linutronix.de>; Daniel Wagner <daniel.wagner@suse.com>; Clark
> Williams <williams@redhat.com>; Pavel Machek <pavel@denx.de>; Luis
> Goncalves <lgoncalv@redhat.com>; John McCalpin <mccalpin@tacc.utexas.edu>
> Subject: [EXT] Re: Large(ish) variance induced by SCHED_FIFO
> 
> Caution: This is an external email. Please take care when clicking links or opening
> attachments. When in doubt, report the message using the 'Report this email'
> button
> 
> 
> On 08/09/2025 11:36, John Ogness wrote:
> 
> > There are still reasons why CLOCK_MONOTONIC_RAW might be
> > interesting. For example, if you want a very stable time source to
> > compare intervals, but do not care so much about the real world time
> > lengths of those intervals (i.e. where is the greatest latency vs. what
> > is the value of the greatest latency). Although even here, I doubt
> > CLOCK_MONOTONIC_RAW has a practical advantage over
> CLOCK_MONOTONIC.
> 
> In fact, I'm just trying to compare the run-time of 2 minor
> variations of the same program (testing micro-optimizations).
> 
> Absolute run-time is not really important, what I really want
> to know is: does v2 run faster or slower than v1?
> 
> This is the framework I'm using at this point:
> 
> #include <stdio.h>
> #include <time.h>
> extern void my_code(void);
> 
> static long loop(int log2)
> {
>         int n = 1 << log2;
>         struct timespec t0, t1;
>         clock_gettime(CLOCK_MONOTONIC, &t0);
>         for (int i = 0; i < n; ++i) my_code();
>         clock_gettime(CLOCK_MONOTONIC, &t1);
>         long d = (t1.tv_sec - t0.tv_sec)*1000000000L + (t1.tv_nsec - t0.tv_nsec);
>         long t = d >> log2;
>         return t;
> }
> 
> int main(void)
> {
>         long t, min = loop(4);
>         for (int i = 0; i < 20; ++i)
>                 if ((t = loop(8)) < min) min = t;
>         printf("MIN=%ld\n", min);
>         return 0;
> }
> 
> Basically:
> - warm up the caches
> - run my_code() 256 times && compute average run-time
> - repeat 20 times to find MINIMUM average run-time
> 
> When my_code() is a trivial computational loop such as:
> 
>         mov $(1<<12), %eax
> 1:      dec %ecx
>         dec %ecx
>         dec %eax
>         jnz 1b
>         ret
> 
> Then running the benchmark 1000 times returns the same value 1000 times:
> MIN=2737
> 
> 
> Obviously, the program I'm working on is a bit more complex, but barely:
> - no system calls, no library calls
> - just simple bit twiddling
> - tiny code, tiny data structures, everything fits in L1
> $ size a.out
>    text    data     bss     dec     hex filename
>    8549     632    1072   10253    280d a.out
> 
> When I run the benchmark 1000 times, there are some large outliers:
> MIN_MIN=2502
> MAX_MIN=2774
> 
> NOTE: 95% of the results are below 2536.
> But the top 1% (worst 10) are really bad (2646-2774)
> 
> How to get repeatable results?
> 
> Random 10% outliers break the ability to measure the impact
> of micro-optimizations expected to provide 0-3% improvements :(
> 
> For reference, the script launching the benchmark does:
> 
> echo     -1 > /proc/sys/kernel/sched_rt_runtime_us
> for I in 0 1 2 3; do echo userspace >
> /sys/devices/system/cpu/cpu$I/cpufreq/scaling_governor; done
> sleep 0.25
> for I in 0 1 2 3; do echo   3000000 >
> /sys/devices/system/cpu/cpu$I/cpufreq/scaling_setspeed; done
> sleep 0.25
> 
> for I in $(seq 1 1000); do
> chrt -f 99 taskset -c 2 ./a.out < $1
> done
> 
> for I in 0 1 2 3; do
> echo schedutil > /sys/devices/system/cpu/cpu$I/cpufreq/scaling_governor
> done
> echo 950000 > /proc/sys/kernel/sched_rt_runtime_us
> 
> 
> I've run out of ideas to identify other sources of variance.
> (I ran everything in single user mode with nothing else running.)
> Perhaps with perf I could identify the source of stalls or bubbles?
> 
> Hoping someone can point me in the right direction.
> 
> Regards
> 
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [EXT] Re: Large(ish) variance induced by SCHED_FIFO
  2025-09-08 16:52                 ` [EXT] Re: Large(ish) variance induced by SCHED_FIFO Rui Sousa
@ 2025-09-08 17:03                   ` Marc Gonzalez
  0 siblings, 0 replies; 30+ messages in thread
From: Marc Gonzalez @ 2025-09-08 17:03 UTC (permalink / raw)
  To: Rui Sousa, John Ogness, Leon Woestenberg, Daniel Wagner
  Cc: linux-rt-users@vger.kernel.org, Steven Rostedt, Thomas Gleixner,
	Sebastian Andrzej Siewior, Daniel Wagner, Clark Williams,
	Pavel Machek, Luis Goncalves, John McCalpin

On 08/09/2025 18:52, Rui Sousa wrote:

> Just replace your last setup script line with:
> echo -1 > /proc/sys/kernel/sched_rt_runtime_us

Launch script consists of:

echo     -1 > /proc/sys/kernel/sched_rt_runtime_us
# RUN RT BENCHMARK
echo 950000 > /proc/sys/kernel/sched_rt_runtime_us


https://www.kernel.org/doc/html/latest/scheduler/sched-rt-group.html

"A run time of -1 specifies runtime == period, ie. no limit."

1) Disable RT throttling
2) Run RT benchmark
3) Enable RT throttling

> I believe you are seeing regular tasks being scheduled
> in the middle of the benchmark measurement.

I don't think so.

RT throttling is disabled.
The benchmark runs as SCHED_FIFO 99.
Nothing can interrupt it, except interrupts :)

Am checking Daniel's suggestion.

Regards


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Large(ish) variance induced by SCHED_FIFO
  2025-09-08 16:02                 ` Daniel Wagner
@ 2025-09-08 18:40                   ` Marc Gonzalez
  2025-09-09 11:08                     ` Unexplained variance in run-time of trivial program Marc Gonzalez
  0 siblings, 1 reply; 30+ messages in thread
From: Marc Gonzalez @ 2025-09-08 18:40 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: John Ogness, Leon Woestenberg, linux-rt-users, Steven Rostedt,
	Thomas Gleixner, Sebastian Andrzej Siewior, Clark Williams,
	Pavel Machek, Luis Goncalves, John McCalpin

On 08/09/2025 18:02, Daniel Wagner wrote:

> On Mon, Sep 08, 2025 at 05:42:58PM +0200, Marc Gonzalez wrote:
>
>> I've run out of ideas to identify other sources of variance.
>> (I ran everything in single user mode with nothing else running.)
>> Perhaps with perf I could identify the source of stalls or bubbles?
>>
>> Hoping someone can point me in the right direction.
> 
> There are few guides out there, explaining the common pitfalls. I
> suppose you have already found them. Just in case you missed Frederic's
> excellent blog on CPU isolation:
> 
> https://www.suse.com/c/cpu-isolation-introduction-part-1/

Thanks for this great link!

Is nohz_full=X a superset of isolcpus=X or are they orthogonal?

(I see that /usr/sbin/irqbalance is running on my system.)

Regards


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Unexplained variance in run-time of trivial program
  2025-09-08 18:40                   ` Marc Gonzalez
@ 2025-09-09 11:08                     ` Marc Gonzalez
  2025-09-09 11:21                       ` Daniel Wagner
  2025-09-09 12:34                       ` John Ogness
  0 siblings, 2 replies; 30+ messages in thread
From: Marc Gonzalez @ 2025-09-09 11:08 UTC (permalink / raw)
  To: linux-rt-users
  Cc: Daniel Wagner, John Ogness, Leon Woestenberg, Steven Rostedt,
	Thomas Gleixner, Sebastian Andrzej Siewior, Clark Williams,
	Pavel Machek, Luis Goncalves, John McCalpin, Frederic Weisbecker,
	Ingo Molnar, Masami Hiramatsu

On 08/09/2025 20:40, Marc Gonzalez wrote:

> On 08/09/2025 18:02, Daniel Wagner wrote:
> 
>> On Mon, Sep 08, 2025 at 05:42:58PM +0200, Marc Gonzalez wrote:
>>
>>> I've run out of ideas to identify other sources of variance.
>>> (I ran everything in single user mode with nothing else running.)
>>> Perhaps with perf I could identify the source of stalls or bubbles?
>>>
>>> Hoping someone can point me in the right direction.
>>
>> There are few guides out there, explaining the common pitfalls. I
>> suppose you have already found them. Just in case you missed Frederic's
>> excellent blog on CPU isolation:
>>
>> https://www.suse.com/c/cpu-isolation-introduction-part-1/
> 
> Thanks for this great link!
> 
> Is nohz_full=X a superset of isolcpus=X or are they orthogonal?

I am now more confused than ever.

Following Frederic's guide:
https://www.suse.com/c/cpu-isolation-practical-example-part-5/

If I didn't mess up, CPU3 is now supposed to be fully isolated
on my 4-core system:

$ sudo dmesg | grep "Kernel command line"
[    0.038445] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-6.8.0-79-generic root=UUID=2aff6d63-a8c6-4ebb-af69-9d2df2852cb0 ro log_buf_len=16M nohz_full=3 isolcpus=3 quiet splash vt.handoff=7

I ran echo "0-2" > $IRQ/smp_affinity_list for every IRQ.

IRQ=0   AFTER SMP_AFF=0-3 EFF_AFF=0
IRQ=1   AFTER SMP_AFF=0-2 EFF_AFF=0-2
IRQ=2   AFTER SMP_AFF=0-3 EFF_AFF=
IRQ=3   AFTER SMP_AFF=0-2 EFF_AFF=0-2
IRQ=4   AFTER SMP_AFF=0-2 EFF_AFF=0-2
IRQ=5   AFTER SMP_AFF=0-2 EFF_AFF=0-2
IRQ=6   AFTER SMP_AFF=0-2 EFF_AFF=0-2
IRQ=7   AFTER SMP_AFF=0-2 EFF_AFF=0-2
IRQ=8   AFTER SMP_AFF=0-2 EFF_AFF=0
IRQ=9   AFTER SMP_AFF=0-2 EFF_AFF=1
IRQ=10  AFTER SMP_AFF=0-2 EFF_AFF=0-2
IRQ=11  AFTER SMP_AFF=0-2 EFF_AFF=0-2
IRQ=12  AFTER SMP_AFF=0-2 EFF_AFF=0-2
IRQ=13  AFTER SMP_AFF=0-2 EFF_AFF=0-2
IRQ=14  AFTER SMP_AFF=0-2 EFF_AFF=0-2
IRQ=15  AFTER SMP_AFF=0-2 EFF_AFF=0-2
IRQ=16  AFTER SMP_AFF=0-2 EFF_AFF=1
IRQ=18  AFTER SMP_AFF=0-2 EFF_AFF=1
IRQ=23  AFTER SMP_AFF=0-2 EFF_AFF=2
IRQ=24  AFTER SMP_AFF=0-2 EFF_AFF=2
IRQ=25  AFTER SMP_AFF=0-2 EFF_AFF=3
IRQ=26  AFTER SMP_AFF=0-2 EFF_AFF=0
IRQ=27  AFTER SMP_AFF=0-2 EFF_AFF=0
IRQ=28  AFTER SMP_AFF=0-2 EFF_AFF=2
IRQ=29  AFTER SMP_AFF=0-2 EFF_AFF=2
IRQ=30  AFTER SMP_AFF=0-2 EFF_AFF=3
IRQ=31  AFTER SMP_AFF=0-2 EFF_AFF=0
IRQ=32  AFTER SMP_AFF=0-2 EFF_AFF=1

I note that IRQs 25 & 30 are still effectively pinned to CPU3,
despite the smp_aff setting.

$ cat /proc/interrupts 
           CPU0       CPU1       CPU2       CPU3       
  0:         27          0          0          0   IO-APIC   2-edge      timer
  8:          0          0          0          0   IO-APIC   8-edge      rtc0
  9:          0          4          0          0   IO-APIC   9-fasteoi   acpi
 16:          0         25          0          0   IO-APIC  16-fasteoi   ehci_hcd:usb1
 18:          0          5          0          0   IO-APIC  18-fasteoi   i801_smbus
 23:          0          0         29          0   IO-APIC  23-fasteoi   ehci_hcd:usb2
 24:          0          0          0          0  PCI-MSI-0000:00:1c.0   0-edge      PCIe PME, pciehp
 25:          0          0          0          0  PCI-MSI-0000:00:1c.3   0-edge      PCIe PME
 26:      23328          0          0      37250  PCI-MSI-0000:00:1f.2   0-edge      ahci[0000:00:1f.2]
 27:      86091          0          0          0  PCI-MSI-0000:00:14.0   0-edge      xhci_hcd
 28:          0          0      51308          0  PCI-MSIX-0000:02:00.0  0-edge      enp2s0
 29:          0          0         22          0  PCI-MSI-0000:00:16.0   0-edge      mei_me
 30:          0          0          0        604  PCI-MSI-0000:00:1b.0   0-edge      snd_hda_intel:card0
 31:     198664          0          0          0  PCI-MSI-0000:00:02.0   0-edge      i915
 32:          0        631          0          0  PCI-MSI-0000:00:03.0   0-edge      snd_hda_intel:card1


My benchmark is a trivial piece of code:
- no system calls, no library calls
- just simple bit twiddling
- tiny code, tiny data structures, everything fits in L1
$ size a.out
   text	   data	    bss	    dec	    hex	filename
   8549	    632	   1072	  10253	   280d	a.out

Undisturbed, it runs in 2500 nanoseconds
(with all cores set to tick at 3 GHz).

I run it 2^16 times, so the expected run-time is
2500*2^16 = 163840000 ns = 163.84 ms

Yet, I've observed AVERAGE run-times of
2600, 2700, and even 2800 (!!) nanoseconds.

I captured a full trace of this occurrence with
echo *:* > $TRACING/set_event
cat $TRACING/per_cpu/cpu$CPU/trace > trace.$CPU

# tracer: nop
#
# entries-in-buffer/entries-written: 9743/9743   #P:4
#
#                                _-----=> irqs-off/BH-disabled
#                               / _----=> need-resched
#                              | / _---=> hardirq/softirq
#                              || / _--=> preempt-depth
#                              ||| / _-=> migrate-disable
#                              |||| /     delay
#           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
#              | |         |   |||||     |         |
          <idle>-0       [003] dN.1.  4606.627286: cpu_idle: state=4294967295 cpu_id=3
          <idle>-0       [003] dN.1.  4606.627288: rcu_utilization: Start context switch
          <idle>-0       [003] dN.1.  4606.627288: rcu_utilization: End context switch
          <idle>-0       [003] d..2.  4606.627290: sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=taskset next_pid=8129 next_prio=0
          <idle>-0       [003] d..2.  4606.627291: tlb_flush: pages:-1 reason:flush on task switch (0)
           a.out-8129    [003] ...1.  4606.627304: mm_page_free: page=000000001fef04a8 pfn=0x16a586 order=1
           ...
           a.out-8129    [003] d....  4606.628019: user_enter: 
           a.out-8129    [003] d....  4606.811832: user_exit: 


Program starts at 4606.627304
Program setup lasts until 4606.628019 (~1500 recorded events)
=> roughly 700 microseconds spent before main()
which I'm not taking into account in my measurements.

Then program ends at 4606.811832
=> 183.813 ms instead of the expected 163.84 ms

Almost exactly 20 ms in excess.
Could this be a hint?
But there is absolutely NOTHING traced between 4606.628019 & 4606.811832.

I guess either my time source is incorrect.
(Next slide in Frederic's guide)
OR there is something wonky going on inside CPU3.

Looks like IRQs are not the issue here?
No need to investigate why I can't pin IRQs 25 & 30
elsewhere than CPU3?

Regards


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Unexplained variance in run-time of trivial program
  2025-09-09 11:08                     ` Unexplained variance in run-time of trivial program Marc Gonzalez
@ 2025-09-09 11:21                       ` Daniel Wagner
  2025-09-09 12:42                         ` Marc Gonzalez
  2025-09-09 12:34                       ` John Ogness
  1 sibling, 1 reply; 30+ messages in thread
From: Daniel Wagner @ 2025-09-09 11:21 UTC (permalink / raw)
  To: Marc Gonzalez
  Cc: linux-rt-users, John Ogness, Leon Woestenberg, Steven Rostedt,
	Thomas Gleixner, Sebastian Andrzej Siewior, Clark Williams,
	Pavel Machek, Luis Goncalves, John McCalpin, Frederic Weisbecker,
	Ingo Molnar, Masami Hiramatsu

> I note that IRQs 25 & 30 are still effectively pinned to CPU3,
> despite the smp_aff setting.
> 
> $ cat /proc/interrupts 
>            CPU0       CPU1       CPU2       CPU3       
>   0:         27          0          0          0   IO-APIC   2-edge      timer
>   8:          0          0          0          0   IO-APIC   8-edge      rtc0
>   9:          0          4          0          0   IO-APIC   9-fasteoi   acpi
>  16:          0         25          0          0   IO-APIC  16-fasteoi   ehci_hcd:usb1
>  18:          0          5          0          0   IO-APIC  18-fasteoi   i801_smbus
>  23:          0          0         29          0   IO-APIC  23-fasteoi   ehci_hcd:usb2
>  24:          0          0          0          0  PCI-MSI-0000:00:1c.0   0-edge      PCIe PME, pciehp
>  25:          0          0          0          0  PCI-MSI-0000:00:1c.3   0-edge      PCIe PME
>  26:      23328          0          0      37250  PCI-MSI-0000:00:1f.2   0-edge      ahci[0000:00:1f.2]
>  27:      86091          0          0          0  PCI-MSI-0000:00:14.0   0-edge      xhci_hcd
>  28:          0          0      51308          0  PCI-MSIX-0000:02:00.0  0-edge      enp2s0
>  29:          0          0         22          0  PCI-MSI-0000:00:16.0   0-edge      mei_me
>  30:          0          0          0        604  PCI-MSI-0000:00:1b.0   0-edge      snd_hda_intel:card0
>  31:     198664          0          0          0  PCI-MSI-0000:00:02.0   0-edge      i915
>  32:          0        631          0          0  PCI-MSI-0000:00:03.0   0-edge      snd_hda_intel:card1

Many drivers are not isolcpus aware. At least the sound driver could be
unloaded for you test I suppose.

> Almost exactly 20 ms in excess.
> Could this be a hint?
> But there is absolutely NOTHING traced between 4606.628019 & 4606.811832.

Ensure your clock source is working correctly and...

> I guess either my time source is incorrect.
> (Next slide in Frederic's guide)
> OR there is something wonky going on inside CPU3.

...there is no SMI running on this CPU, and no power management running
(also in the BIOS settings)

> Looks like IRQs are not the issue here?

IRQ handling should be visible in the trace.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Unexplained variance in run-time of trivial program
  2025-09-09 11:08                     ` Unexplained variance in run-time of trivial program Marc Gonzalez
  2025-09-09 11:21                       ` Daniel Wagner
@ 2025-09-09 12:34                       ` John Ogness
  2025-09-09 14:08                         ` Marc Gonzalez
  1 sibling, 1 reply; 30+ messages in thread
From: John Ogness @ 2025-09-09 12:34 UTC (permalink / raw)
  To: Marc Gonzalez, linux-rt-users
  Cc: Daniel Wagner, Leon Woestenberg, Steven Rostedt, Thomas Gleixner,
	Sebastian Andrzej Siewior, Clark Williams, Pavel Machek,
	Luis Goncalves, John McCalpin, Frederic Weisbecker, Ingo Molnar,
	Masami Hiramatsu

On 2025-09-09, Marc Gonzalez <marc.w.gonzalez@free.fr> wrote:
> I captured a full trace of this occurrence with
> echo *:* > $TRACING/set_event
> cat $TRACING/per_cpu/cpu$CPU/trace > trace.$CPU

...

> But there is absolutely NOTHING traced between 4606.628019 & 4606.811832.

I recommend using the function tracer:

# echo function > $TRACING/current_tracer

It should not affect your timings for pure-userspace code. This should
catch any unintentional kernel activity.

John

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Unexplained variance in run-time of trivial program
  2025-09-09 11:21                       ` Daniel Wagner
@ 2025-09-09 12:42                         ` Marc Gonzalez
  2025-09-09 14:23                           ` Steven Rostedt
  0 siblings, 1 reply; 30+ messages in thread
From: Marc Gonzalez @ 2025-09-09 12:42 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: linux-rt-users, John Ogness, Leon Woestenberg, Steven Rostedt,
	Thomas Gleixner, Sebastian Andrzej Siewior, Clark Williams,
	Pavel Machek, Luis Goncalves, John McCalpin, Frederic Weisbecker,
	Ingo Molnar, Masami Hiramatsu

On 09/09/2025 13:21, Daniel Wagner wrote:

>> I note that IRQs 25 & 30 are still effectively pinned to CPU3,
>> despite the smp_aff setting.
>>
>> $ cat /proc/interrupts 
>>            CPU0       CPU1       CPU2       CPU3       
>>   0:         27          0          0          0   IO-APIC   2-edge      timer
>>   8:          0          0          0          0   IO-APIC   8-edge      rtc0
>>   9:          0          4          0          0   IO-APIC   9-fasteoi   acpi
>>  16:          0         25          0          0   IO-APIC  16-fasteoi   ehci_hcd:usb1
>>  18:          0          5          0          0   IO-APIC  18-fasteoi   i801_smbus
>>  23:          0          0         29          0   IO-APIC  23-fasteoi   ehci_hcd:usb2
>>  24:          0          0          0          0  PCI-MSI-0000:00:1c.0   0-edge      PCIe PME, pciehp
>>  25:          0          0          0          0  PCI-MSI-0000:00:1c.3   0-edge      PCIe PME
>>  26:      23328          0          0      37250  PCI-MSI-0000:00:1f.2   0-edge      ahci[0000:00:1f.2]
>>  27:      86091          0          0          0  PCI-MSI-0000:00:14.0   0-edge      xhci_hcd
>>  28:          0          0      51308          0  PCI-MSIX-0000:02:00.0  0-edge      enp2s0
>>  29:          0          0         22          0  PCI-MSI-0000:00:16.0   0-edge      mei_me
>>  30:          0          0          0        604  PCI-MSI-0000:00:1b.0   0-edge      snd_hda_intel:card0
>>  31:     198664          0          0          0  PCI-MSI-0000:00:02.0   0-edge      i915
>>  32:          0        631          0          0  PCI-MSI-0000:00:03.0   0-edge      snd_hda_intel:card1
> 
> Many drivers are not isolcpus aware. At least the sound driver could be
> unloaded for your test I suppose.

I don't quite understand.
Do these drivers explicitly request that their ISR run on CPU3?
Why doesn't the kernel just run these ISRs on a non-isolated core?

>> Almost exactly 20 ms in excess.
>> Could this be a hint?
>> But there is absolutely NOTHING traced between 4606.628019 & 4606.811832.
> 
> Ensure your clock source is working correctly and...
> 
>> I guess either my time source is incorrect.
>> (Next slide in Frederic's guide)
>> OR there is something wonky going on inside CPU3.
> 
> ...there is no SMI running on this CPU, and no power management running
> (also in the BIOS settings)

Actually, I'm 99.9% sure that clock source accuracy & SMM are red herrings.

If I replace my code with this one:

	mov $(1<<12), %eax
1:	dec %ecx
	dec %ecx
	dec %eax
	jnz 1b

which runs in ~2735 nanoseconds at 3 Ghz
(2 cycles per iteration * 4096 iterations = 8192 cycles
8192 cycles / 3 = 2730 nanoseconds)

Running this trivial baseline benchmark 2^16 times should take
2735 ns * 2^16 = 179.241 ms

And if I run the loop 1000 times, and sort by run-time, I observe
MIN=179.374211 ms
MAX=179.406745 ms
So worst-case is only 166 microseconds worse than expected
(contrast this to 20 MILLIseconds for my code, 120 times worse)

It seems something is randomly stalling the pipeline on CPU3.
This feels like thermal throttling maybe?
Is that supposed to be logged somewhere?
But why would my program throttle & not the trivial baseline benchmark?

Regards


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Unexplained variance in run-time of trivial program
  2025-09-09 12:34                       ` John Ogness
@ 2025-09-09 14:08                         ` Marc Gonzalez
  2025-09-09 18:33                           ` Marc Gonzalez
  0 siblings, 1 reply; 30+ messages in thread
From: Marc Gonzalez @ 2025-09-09 14:08 UTC (permalink / raw)
  To: John Ogness, linux-rt-users
  Cc: Daniel Wagner, Leon Woestenberg, Steven Rostedt, Thomas Gleixner,
	Sebastian Andrzej Siewior, Clark Williams, Pavel Machek,
	Luis Goncalves, John McCalpin, Frederic Weisbecker, Ingo Molnar,
	Masami Hiramatsu

On 09/09/2025 14:34, John Ogness wrote:

> On 2025-09-09, Marc Gonzalez wrote:
> 
>> I captured a full trace of this occurrence with
>> echo *:* > $TRACING/set_event
>> cat $TRACING/per_cpu/cpu$CPU/trace > trace.$CPU
> 
> ...
> 
>> But there is absolutely NOTHING traced between 4606.628019 & 4606.811832.
> 
> I recommend using the function tracer:
> 
> # echo function > $TRACING/current_tracer
> 
> It should not affect your timings for pure-userspace code. This should
> catch any unintentional kernel activity.

OK. This time, I used:
echo function > $TRACING/current_tracer
cat $TRACING/trace > trace.all
(which produced a 12.65 MB file)

I dumped a problematic run where T=2722 instead of T=2500
i.e. total run-time = 178.389 ms instead of 163.84 ms

Looking for a.out-23683 until the log goes quiet...

  a.out-23683   [003] d.... 14939.982368: user_enter: 
  a.out-23683   [003] d.... 14939.982368: vtime_user_enter <-__ct_user_enter
  a.out-23683   [003] d.... 14939.982368: get_vtime_delta <-vtime_user_enter
  NO OCCURRENCE of a.out or [003] between these 2 points
  a.out-23683   [003] d.... 14940.160764: vtime_user_exit <-__ct_user_exit
  a.out-23683   [003] d.... 14940.160765: get_vtime_delta <-vtime_user_exit
  a.out-23683   [003] d.... 14940.160765: account_user_time <-vtime_user_exit

14940.160764 - 14939.982368 = 178.396 ms
(which corresponds to the total run-time)
Are the vtime_user_enter/exit tied to clock_gettime?
(Probably not, since they're not balanced)

It's starting to look like the problem is on the CPU itself,
not the kernel interrupting the CPU.

Might be time to break out perf?

My hunch is frequency dropped to ~2.7 GHz for the duration of the benchmark.

Maybe I should lower the frequency to 2 GHz.
But again, why would my code force throttling and not the toy code?
One possible reason is my code reaches 3.5 IPC, while the toy code
remains at 1.5 IPC (with micro-op fusion).

It's a stretch, but easy to test.

I will measure CPU cycles, to see if the increased run-time
corresponds with a change in CPU cycles.

Regards

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Unexplained variance in run-time of trivial program
  2025-09-09 12:42                         ` Marc Gonzalez
@ 2025-09-09 14:23                           ` Steven Rostedt
  0 siblings, 0 replies; 30+ messages in thread
From: Steven Rostedt @ 2025-09-09 14:23 UTC (permalink / raw)
  To: Marc Gonzalez
  Cc: Daniel Wagner, linux-rt-users, John Ogness, Leon Woestenberg,
	Thomas Gleixner, Sebastian Andrzej Siewior, Clark Williams,
	Pavel Machek, Luis Goncalves, John McCalpin, Frederic Weisbecker,
	Ingo Molnar, Masami Hiramatsu

On Tue, 9 Sep 2025 14:42:54 +0200
Marc Gonzalez <marc.w.gonzalez@free.fr> wrote:

> > ...there is no SMI running on this CPU, and no power management running
> > (also in the BIOS settings)  
> 
> Actually, I'm 99.9% sure that clock source accuracy & SMM are red herrings.

You may want to run the osnoise or hwlat tracer to see if SMM or anything
else the kernel does could cause issues with your tests.

-- Steve

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Unexplained variance in run-time of trivial program
  2025-09-09 14:08                         ` Marc Gonzalez
@ 2025-09-09 18:33                           ` Marc Gonzalez
       [not found]                             ` <CAMjhiJjMOQN-nWd+KP4JBBNHf20M+J2fXAuTTvowXctJgvGOcQ@mail.gmail.com>
  0 siblings, 1 reply; 30+ messages in thread
From: Marc Gonzalez @ 2025-09-09 18:33 UTC (permalink / raw)
  To: John Ogness, linux-rt-users
  Cc: Daniel Wagner, Leon Woestenberg, Steven Rostedt, Thomas Gleixner,
	Sebastian Andrzej Siewior, Clark Williams, Pavel Machek,
	Luis Goncalves, John McCalpin, Frederic Weisbecker, Ingo Molnar,
	Masami Hiramatsu

On 9/9/25 16:08, Marc Gonzalez wrote:

> My hunch is frequency dropped to ~2.7 GHz for the duration of the benchmark.
> 
> Maybe I should lower the frequency to 2 GHz.
> But again, why would my code force throttling and not the toy code?
> One possible reason is my code reaches 3.5 IPC, while the toy code
> remains at 1.5 IPC (with micro-op fusion).
> 
> It's a stretch, but easy to test.
> 
> I will measure CPU cycles, to see if the increased run-time
> corresponds with a change in CPU cycles.

# dmesg | grep -i tsc
[    0.000000] tsc: Fast TSC calibration using PIT
[    0.000000] tsc: Detected 3292.563 MHz processor
[    0.038014] TSC deadline timer available
[    0.093396] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x2f75db292c8, max_idle_ns: 440795321216 ns
[    0.178415] clocksource: Switched to clocksource tsc-early
[    1.272272] tsc: Refined TSC clocksource calibration: 3292.377 MHz
[    1.272280] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x2f752b31fa5, max_idle_ns: 440795265170 ns
[    1.272305] clocksource: Switched to clocksource tsc

$ lscpu | grep -o '[a-z_]*tsc[a-z_]*'
tsc
rdtscp
constant_tsc
nonstop_tsc
tsc_deadline_timer
tsc_adjust

static void loop(int log2)
{
	int n = 1 << log2;
	struct timespec t0, t1;
	clock_gettime(CLK, &t0);
	long c0 = __rdtsc();
	for (int i = 0; i < n; ++i) my_code();
	long c1 = __rdtsc();
	clock_gettime(CLK, &t1);
	long d = (t1.tv_sec - t0.tv_sec)*1000000000L + (t1.tv_nsec - t0.tv_nsec);
	long t = d >> log2;
	long c = c1-c0;
	long f = c*1000000/d;
	printf("D=%ld C=%ld F=%ld T=%ld N=%d\n", d, c, f, t, n);
}

(Looks like RDTSCP might have been preferable over RDTSC)

I lowered the frequency of all cores to 1.5 GHz

MIN RUNTIME: D=327944847 C=1079700683 F=3292323 T=5004 N=65536
MAX RUNTIME: D=369923901 C=1217909284 F=3292323 T=5644 N=65536

Looks like the TSC ticks at 3.3 GHz (core nominal freq) even
if I manually lower the core frequency. I need to figure out
how 'perf stat' does it. Probably reads a different MSR.

Changing /sys/devices/system/clocksource/clocksource0/current_clocksource
to hpet doesn't really change anything.

TODO: find cycle count MSR.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Unexplained variance in run-time of trivial program
       [not found]                             ` <CAMjhiJjMOQN-nWd+KP4JBBNHf20M+J2fXAuTTvowXctJgvGOcQ@mail.gmail.com>
@ 2025-09-09 20:30                               ` Marc Gonzalez
  2025-09-10  7:59                                 ` Daniel Wagner
  0 siblings, 1 reply; 30+ messages in thread
From: Marc Gonzalez @ 2025-09-09 20:30 UTC (permalink / raw)
  To: Leon Woestenberg
  Cc: John Ogness, linux-rt-users, Daniel Wagner, Steven Rostedt,
	Thomas Gleixner, Sebastian Andrzej Siewior, Clark Williams,
	Pavel Machek, Luis Goncalves, John McCalpin, Frederic Weisbecker,
	Ingo Molnar, Masami Hiramatsu

On 9/9/25 21:34, Leon Woestenberg wrote:

> Basically, you are reverse-tuning a multicore CPU system,
> that is heavily optimized for time-sharing and overall
> system throughput, back to act like bare-metal single-application.
> You can get close, but you will never reach the determinism
> of say, an Cortex-R(ealtime) Core.

This ignores the fact that I've been testing 2 programs:

- P1, toy code that just calls dec 3 times in a tight loop
- P2, a few memory accesses, everything fits in L1

I then loop P1 & P2 2^16 times to compute their average run-time.

Over thousands of runs, RUNTIME_P1 is ALWAYS the same (within 30 ppm)

Over thousands of runs, RUNTIME_P2 varies by up to 12 PERCENT (!!) = 122k ppm

Since P1 runs fine without ANY jitter, the problem cannot be
something inherent to the system itself, such as
- SMM
- IRQs firing randomly
- drifting, jittery, jumping time source

The problem appears to be something inherent to P2.
(Either the whole core is slowed down perhaps by thermal throttling,
or something is causing huge pipeline stalls.)

Regards


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Unexplained variance in run-time of trivial program
  2025-09-09 20:30                               ` Marc Gonzalez
@ 2025-09-10  7:59                                 ` Daniel Wagner
  2025-09-11 21:29                                   ` Marc Gonzalez
  0 siblings, 1 reply; 30+ messages in thread
From: Daniel Wagner @ 2025-09-10  7:59 UTC (permalink / raw)
  To: Marc Gonzalez
  Cc: Leon Woestenberg, John Ogness, linux-rt-users, Steven Rostedt,
	Thomas Gleixner, Sebastian Andrzej Siewior, Clark Williams,
	Pavel Machek, Luis Goncalves, John McCalpin, Frederic Weisbecker,
	Ingo Molnar, Masami Hiramatsu

On Tue, Sep 09, 2025 at 10:30:49PM +0200, Marc Gonzalez wrote:
> On 9/9/25 21:34, Leon Woestenberg wrote:
> 
> > Basically, you are reverse-tuning a multicore CPU system,
> > that is heavily optimized for time-sharing and overall
> > system throughput, back to act like bare-metal single-application.
> > You can get close, but you will never reach the determinism
> > of say, an Cortex-R(ealtime) Core.
> 
> This ignores the fact that I've been testing 2 programs:
>
> - P1, toy code that just calls dec 3 times in a tight loop
> - P2, a few memory accesses, everything fits in L1
> 
> I then loop P1 & P2 2^16 times to compute their average run-time.
> 
> Over thousands of runs, RUNTIME_P1 is ALWAYS the same (within 30 ppm)
> 
> Over thousands of runs, RUNTIME_P2 varies by up to 12 PERCENT (!!) =
> 122k ppm

Modern CPUs are highly complex systems with their micro architecture.
The result is non-deterministic runtime behavior. I can't tell what you
are measuring is the ballpark of non-deterministic runtime variance. If
you don't want to read through the CPU manufactures documentation, you
could check if the Spectre and related side channels attack papers give
some clues on timings behavior?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Unexplained variance in run-time of trivial program
  2025-09-10  7:59                                 ` Daniel Wagner
@ 2025-09-11 21:29                                   ` Marc Gonzalez
  2025-09-11 22:15                                     ` Marc Gonzalez
  0 siblings, 1 reply; 30+ messages in thread
From: Marc Gonzalez @ 2025-09-11 21:29 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Leon Woestenberg, John Ogness, linux-rt-users, Steven Rostedt,
	Thomas Gleixner, Sebastian Andrzej Siewior, Clark Williams,
	Pavel Machek, Luis Goncalves, John McCalpin, Frederic Weisbecker,
	Ingo Molnar, Masami Hiramatsu, agner

On 9/10/25 09:59, Daniel Wagner wrote:

> On Tue, Sep 09, 2025 at 10:30:49PM +0200, Marc Gonzalez wrote:
>
>> On 9/9/25 21:34, Leon Woestenberg wrote:
>>
>>> Basically, you are reverse-tuning a multicore CPU system,
>>> that is heavily optimized for time-sharing and overall
>>> system throughput, back to act like bare-metal single-application.
>>> You can get close, but you will never reach the determinism
>>> of say, an Cortex-R(ealtime) Core.
>>
>> This ignores the fact that I've been testing 2 programs:
>>
>> - P1, toy code that just calls dec 3 times in a tight loop
>> - P2, a few memory accesses, everything fits in L1
>>
>> I then loop P1 & P2 2^16 times to compute their average run-time.
>>
>> Over thousands of runs, RUNTIME_P1 is ALWAYS the same (within 30 ppm)
>>
>> Over thousands of runs, RUNTIME_P2 varies by up to 12 PERCENT (!!) = 122k ppm
> 
> Modern CPUs are highly complex systems with their micro architecture.
> The result is non-deterministic runtime behavior. I can't tell what you
> are measuring is the ballpark of non-deterministic runtime variance. If
> you don't want to read through the CPU manufactures documentation, you
> could check if the Spectre and related side channels attack papers give
> some clues on timings behavior?

I am well aware of the complexity of modern micro-architectures ;)

I actually reported a bug in the original K7 micro-architecture
25 years ago.

And Agner Fog's microarchitecture.pdf is always on my bedside table.
https://www.agner.org/optimize/microarchitecture.pdf

In an attempt to figure out the origin of the chaos, I broke out perf.
I would say I found many more questions than answers...

Running 3.3 GHz Haswell at 1.5 GHz
https://perfmon-events.intel.com/index.html?pltfrm=haswell_server.html
https://oprofile.sourceforge.io/docs/intel-haswell-events.php


GOOD RUN

D,C,Cb,F,T,N
1311453543 4317733068 4317732972 3292326 5002 262144

        7512849605      uops_executed.core                                                    
          31445738      uops_executed.stall_cycles                                            
         164966498      uops_retired.stall_cycles                                             
         574110144      topdown-fetch-bubbles                                                 
          25679080      cycle_activity.cycles_no_execute                                      
        6808318537      inst_retired.any                                                      
           6268865      int_misc.recovery_cycles                                              
           6268818      int_misc.recovery_cycles_any                                          

       1,313036051 seconds time elapsed


BAD RUN

D,C,Cb,F,T,N
1436503721 4729437143 4729436743 3292325 5479 262144

        7647384031      uops_executed.core                                                    
         113200734      uops_executed.stall_cycles                                            
         343460280      uops_retired.stall_cycles                                             
         787367726      topdown-fetch-bubbles                                                 
         106926806      cycle_activity.cycles_no_execute                                      
        6808330091      inst_retired.any                                                      
          17369470      int_misc.recovery_cycles                                              
          17369430      int_misc.recovery_cycles_any                                          

       1,438057158 seconds time elapsed


Remarks (in no particular order)

- Run-time increased by 9.5% (125 ms / 1.311 s)

- inst_retired.any remains (nearly) constant
BUT uops_executed.core increased by 1.8%
=> Where are those extra uops coming from???
Speculatively executed uops discarded on some flush event?

- 343460280-164966498 = 178.5M more stall cycles at retire stage
at 1.5 GHz = 119 ms
This alone could explain the degraded performance

- 113200734-31445738 = 81.7M more stall cycles at execute stage
some are probably dupes with retire stalls

- Not sure what the int_misc.recovery_* events measure???
They're documented as:
"This event counts the number of cycles spent waiting for a
recovery after an event such as a processor nuke, JEClear, assist,
hle/rtm abort etc."
and
"Core cycles the allocator was stalled due to recovery from earlier
clear event for any thread running on the physical core (e.g.
misprediction or memory nuke)."
=> In my case, they're probably measuring the same thing.
Weird that the description sounds a bit different.
Need to read up on processor nuke, memory nuke, machine clear event,
JEClear, assist, HLE/RTM...
I'll throw in MACHINE_CLEARS.CYCLES in the next run.
"Cycles there was a Nuke. Account for both thread-specific and All Thread Nukes."


I would love for someone to chime in & cast some light
on my utter confusion :)

Regards

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Unexplained variance in run-time of trivial program
  2025-09-11 21:29                                   ` Marc Gonzalez
@ 2025-09-11 22:15                                     ` Marc Gonzalez
  2025-09-12  7:44                                       ` Daniel Wagner
  0 siblings, 1 reply; 30+ messages in thread
From: Marc Gonzalez @ 2025-09-11 22:15 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Leon Woestenberg, John Ogness, linux-rt-users, Steven Rostedt,
	Thomas Gleixner, Sebastian Andrzej Siewior, Clark Williams,
	Pavel Machek, Luis Goncalves, John McCalpin, Frederic Weisbecker,
	Ingo Molnar, Masami Hiramatsu, agner

On 9/11/25 23:29, Marc Gonzalez wrote:

> - Not sure what the int_misc.recovery_* events measure???
> They're documented as:
> "This event counts the number of cycles spent waiting for a
> recovery after an event such as a processor nuke, JEClear, assist,
> hle/rtm abort etc."
> and
> "Core cycles the allocator was stalled due to recovery from earlier
> clear event for any thread running on the physical core (e.g.
> misprediction or memory nuke)."
> => In my case, they're probably measuring the same thing.
> Weird that the description sounds a bit different.
> Need to read up on processor nuke, memory nuke, machine clear event,
> JEClear, assist, HLE/RTM...

Daniel, seems you were spot on when mentioning side channel attacks.

https://www.usenix.org/system/files/sec21-ragab.pdf

I need to truly read & understand this paper.

GOOD
D,C,Cb,F,T,N
1311384211 4317505417 4317505021 3292326 5002 262144
        6808311015      inst_retired.any                                                      
        7515242772      uops_executed.core                                                    
          31190466      uops_executed.stall_cycles                                            
          25467763      cycle_activity.cycles_no_execute                                      
         165247175      uops_retired.stall_cycles                                             
           6335672      int_misc.recovery_cycles                                              
            296086      MACHINE_CLEARS.CYCLES                                                 

BAD
D,C,Cb,F,T,N
1460302963 4807793027 4807792639 3292325 5570 262144
        6808333218      inst_retired.any                                                      
        7656792683      uops_executed.core                                                    
         107194058      uops_executed.stall_cycles                                            
         102852971      cycle_activity.cycles_no_execute                                      
         341106967      uops_retired.stall_cycles                                             
          15062277      int_misc.recovery_cycles                                              
           2124362      MACHINE_CLEARS.CYCLES                                                 


Random reference
https://community.intel.com/t5/Analyzers/Regarding-machine-clears/m-p/953984



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Unexplained variance in run-time of trivial program
  2025-09-11 22:15                                     ` Marc Gonzalez
@ 2025-09-12  7:44                                       ` Daniel Wagner
  2025-09-13 10:09                                         ` Marc Gonzalez
  0 siblings, 1 reply; 30+ messages in thread
From: Daniel Wagner @ 2025-09-12  7:44 UTC (permalink / raw)
  To: Marc Gonzalez
  Cc: Leon Woestenberg, John Ogness, linux-rt-users, Steven Rostedt,
	Thomas Gleixner, Sebastian Andrzej Siewior, Clark Williams,
	Pavel Machek, Luis Goncalves, John McCalpin, Frederic Weisbecker,
	Ingo Molnar, Masami Hiramatsu, agner

On Fri, Sep 12, 2025 at 12:15:12AM +0200, Marc Gonzalez wrote:
> On 9/11/25 23:29, Marc Gonzalez wrote:
> 
> > - Not sure what the int_misc.recovery_* events measure???
> > They're documented as:
> > "This event counts the number of cycles spent waiting for a
> > recovery after an event such as a processor nuke, JEClear, assist,
> > hle/rtm abort etc."
> > and
> > "Core cycles the allocator was stalled due to recovery from earlier
> > clear event for any thread running on the physical core (e.g.
> > misprediction or memory nuke)."
> > => In my case, they're probably measuring the same thing.
> > Weird that the description sounds a bit different.
> > Need to read up on processor nuke, memory nuke, machine clear event,
> > JEClear, assist, HLE/RTM...
> 
> Daniel, seems you were spot on when mentioning side channel attacks.

I'm sorry if I sounded patronizing, that was not my intent. After we
ruled out the OS noise the only thing left was that the runtime variance
is from the CPU itself.

> https://www.usenix.org/system/files/sec21-ragab.pdf
> 
> I need to truly read & understand this paper.

Skimmed over it, it could explain what you are observing. The question
obviously is which part of the 'bad' code is confusing the CPU :)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Unexplained variance in run-time of trivial program
  2025-09-12  7:44                                       ` Daniel Wagner
@ 2025-09-13 10:09                                         ` Marc Gonzalez
  2025-09-15 16:30                                           ` Marc Gonzalez
  0 siblings, 1 reply; 30+ messages in thread
From: Marc Gonzalez @ 2025-09-13 10:09 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Leon Woestenberg, John Ogness, linux-rt-users, Steven Rostedt,
	Thomas Gleixner, Sebastian Andrzej Siewior, Clark Williams,
	Pavel Machek, Luis Goncalves, John McCalpin, Frederic Weisbecker,
	Ingo Molnar, Masami Hiramatsu, agner

On 9/12/25 09:44, Daniel Wagner wrote:

> On Fri, Sep 12, 2025 at 12:15:12AM +0200, Marc Gonzalez wrote:
>
>> - Not sure what the int_misc.recovery_* events measure???
>> They're documented as:
>> "This event counts the number of cycles spent waiting for a
>> recovery after an event such as a processor nuke, JEClear, assist,
>> hle/rtm abort etc."
>> and
>> "Core cycles the allocator was stalled due to recovery from earlier
>> clear event for any thread running on the physical core (e.g.
>> misprediction or memory nuke)."
>> => In my case, they're probably measuring the same thing.
>> Weird that the description sounds a bit different.
>> Need to read up on processor nuke, memory nuke, machine clear event,
>> JEClear, assist, HLE/RTM...
>>
>> Daniel, seems you were spot on when mentioning side channel attacks.
> 
> I'm sorry if I sounded patronizing, that was not my intent. After we
> ruled out the OS noise the only thing left was that the runtime variance
> is from the CPU itself.

I appreciate all the insights you've shared this far! :)
(As well as that from other participants!)

>> https://www.usenix.org/system/files/sec21-ragab.pdf
>>
>> I need to truly read & understand this paper.
> 
> Skimmed over it, it could explain what you are observing. The question
> obviously is which part of the 'bad' code is confusing the CPU :)

GOOD RUN
D,C,Cb,F,T,N
1310881180 4315846773 4315846665 3292324 5000 262144
        6808319919      inst_retired.any                                                      
        7513464787      uops_executed.core                                                    
          30496248      uops_executed.stall_cycles                                            
         164691876      uops_retired.stall_cycles                                             
           6292081      int_misc.recovery_cycles                                              
            292758      machine_clears.count                                                  
            292049      machine_clears.memory_ordering                                        

BAD RUN
D,C,Cb,F,T,N
1417213267 4665926496 4665926108 3292324 5406 262144
        6808316250      inst_retired.any                                                      
        7614270474      uops_executed.core                                                    
          85148157      uops_executed.stall_cycles                                            
         285284384      uops_retired.stall_cycles                                             
          13520285      int_misc.recovery_cycles                                              
           1536963      machine_clears.count                                                  
           1536308      machine_clears.memory_ordering                                        

Notes:

- A "good" run gets 300k machine clears.
- A  "bad" run gets 1.5M machine clears.
- All machine clears are memory ordering machine clears.
No self-modifying code, no floating point code.
Probably a few MCs from page faults.
(TODO: measure only MCs from my code, not setup code)

If I understand correctly, a memory ordering machine clear
only happens when different cores access the same memory?
I'm running single-threaded code on an otherwise idle system.
I'm thinking that Haswell might lump Memory Ordering AND
Memory Disambiguation MCs?

Is that a reasonable assumption?

Not sure what's non-deterministic about the loads in my code.
Might be some kind of "catastrophic" chain of events where
one mispredict generates a cascade of subsequent mispredicts.

I will keep digging :)

Regards


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Unexplained variance in run-time of trivial program
  2025-09-13 10:09                                         ` Marc Gonzalez
@ 2025-09-15 16:30                                           ` Marc Gonzalez
  2025-09-16  7:50                                             ` Ahmed S. Darwish
  0 siblings, 1 reply; 30+ messages in thread
From: Marc Gonzalez @ 2025-09-15 16:30 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Leon Woestenberg, John Ogness, linux-rt-users, Steven Rostedt,
	Thomas Gleixner, Sebastian Andrzej Siewior, Clark Williams,
	Pavel Machek, Luis Goncalves, John McCalpin, Frederic Weisbecker,
	Ingo Molnar, Masami Hiramatsu, agner

On 9/13/25 12:09, Marc Gonzalez wrote:

> GOOD RUN
> D,C,Cb,F,T,N
> 1310881180 4315846773 4315846665 3292324 5000 262144
>         6808319919      inst_retired.any                                                      
>         7513464787      uops_executed.core                                                    
>           30496248      uops_executed.stall_cycles                                            
>          164691876      uops_retired.stall_cycles                                             
>            6292081      int_misc.recovery_cycles                                              
>             292758      machine_clears.count                                                  
>             292049      machine_clears.memory_ordering                                        
> 
> BAD RUN
> D,C,Cb,F,T,N
> 1417213267 4665926496 4665926108 3292324 5406 262144
>         6808316250      inst_retired.any                                                      
>         7614270474      uops_executed.core                                                    
>           85148157      uops_executed.stall_cycles                                            
>          285284384      uops_retired.stall_cycles                                             
>           13520285      int_misc.recovery_cycles                                              
>            1536963      machine_clears.count                                                  
>            1536308      machine_clears.memory_ordering                                        
> 
> Notes:
> 
> - A "good" run gets 300k machine clears.
> - A  "bad" run gets 1.5M machine clears.
> - All machine clears are memory ordering machine clears.
> No self-modifying code, no floating point code.
> Probably a few MCs from page faults.
> (TODO: measure only MCs from my code, not setup code)
> 
> If I understand correctly, a memory ordering machine clear
> only happens when different cores access the same memory?
> I'm running single-threaded code on an otherwise idle system.
> I'm thinking that Haswell might lump Memory Ordering AND
> Memory Disambiguation MCs?
> 
> Is that a reasonable assumption?
> 
> Not sure what's non-deterministic about the loads in my code.
> Might be some kind of "catastrophic" chain of events where
> one mispredict generates a cascade of subsequent mispredicts.
> 
> I will keep digging :)

I now have a better understanding of the issue :)

To recap, I'm benchmarking a fairly trivial piece of code.
But I observe a large variance in the run-time of said code
(from N to 1.15N nanoseconds).

What's more, the variance REMAINS when I run the code
far longer (by looping the same execution).

Therefore
- Either the perturbation is periodic, and adds noise
whatever the duration of the benchmark,
- OR the perturbation only manifests in SOME RUNS.

However, some runs complete without any perturbation,
even very long ones.

Runs where perturbations are observed have a much higher
incidence of MEMORY ORDERING MEMORY NUKES. These types
of events happen when different cores share memory
(true sharing or false sharing). Since my code is purely
single-threaded, it must be false sharing. At this point,
I'm guessing that the address randomization creates
ONLY FOR SOME RUNS a condition where my program lives
in similar virtual addresses than another program active
on a different core, which causes MEMORY NUKES.

I can work around the issue by:
INSTEAD OF running 1M instances of my code in 1 program,
I RUN 1000 programs running 1000 instances of my code.
And taking the minimum run-time of the 1000.
(There is a bit more overhead in forking so many programs,
but I really need reproducible results.)

Now I observe less than 0.2% variance.

I will also try disabling address randomization,
to test my hypothesis.

Has anyone ever dealt with this issue?

Regards

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Unexplained variance in run-time of trivial program
  2025-09-15 16:30                                           ` Marc Gonzalez
@ 2025-09-16  7:50                                             ` Ahmed S. Darwish
  2025-09-25 11:38                                               ` Marc Gonzalez
  0 siblings, 1 reply; 30+ messages in thread
From: Ahmed S. Darwish @ 2025-09-16  7:50 UTC (permalink / raw)
  To: Marc Gonzalez
  Cc: Daniel Wagner, Leon Woestenberg, John Ogness, linux-rt-users,
	Steven Rostedt, Thomas Gleixner, Sebastian Andrzej Siewior,
	Clark Williams, Pavel Machek, Luis Goncalves, John McCalpin,
	Frederic Weisbecker, Ingo Molnar, Masami Hiramatsu, agner

On Mon, 15 Sep 2025, Marc Gonzalez wrote:
>
> These types of events happen when different cores share memory (true
> sharing or false sharing). Since my code is purely single-threaded, it
> must be false sharing.
>

I don't know if you're just having "fun" with the CPU or optimizing a
real-world use-case...

With that in mind, maybe the following helps: understand the topology of
your x86 CPU and turn off the cores that might be causing a "noisy
neighbor" problem.

If you were able to root it down, and your CPU supports class of service
resource monitoring and allocation, then check
Documentation/filesystems/resctrl.rst.

Good luck,

--
Ahmed S. Darwish
Linutronix GmbH

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Unexplained variance in run-time of trivial program
  2025-09-16  7:50                                             ` Ahmed S. Darwish
@ 2025-09-25 11:38                                               ` Marc Gonzalez
  0 siblings, 0 replies; 30+ messages in thread
From: Marc Gonzalez @ 2025-09-25 11:38 UTC (permalink / raw)
  To: Ahmed S. Darwish
  Cc: Daniel Wagner, Leon Woestenberg, John Ogness, linux-rt-users,
	Steven Rostedt, Thomas Gleixner, Sebastian Andrzej Siewior,
	Clark Williams, Pavel Machek, Luis Goncalves, John McCalpin,
	Frederic Weisbecker, Ingo Molnar, Masami Hiramatsu, agner

On 16/09/2025 09:50, Ahmed S. Darwish wrote:

> On Mon, 15 Sep 2025, Marc Gonzalez wrote:
> 
>> These types of events happen when different cores share memory (true
>> sharing or false sharing). Since my code is purely single-threaded, it
>> must be false sharing.
> 
> I don't know if you're just having "fun" with the CPU or optimizing a
> real-world use-case...

I'm writing a sudoku solver in my free time.
So, having fun on a real-world use-case :)

> With that in mind, maybe the following helps: understand the topology of
> your x86 CPU and turn off the cores that might be causing a "noisy
> neighbor" problem.

Hmmm... I would assume a fairly simple topology.
https://www.techpowerup.com/cpu-specs/core-i5-4590.c1750
https://en.wikipedia.org/wiki/Haswell_(microarchitecture)#Desktop_processors

This CPU does not support SMT, so L1 & L2 are exclusive to each core.
L3 is shared between all cores IIUC.

> If you were able to root it down, and your CPU supports class of service
> resource monitoring and allocation, then check
> Documentation/filesystems/resctrl.rst.

I have not yet found a good way to minimize the noise from
memory ordering memory nukes. (For the time being, I'm running
the benchmark hundreds of times, and take the minimum run time.)

https://docs.kernel.org/filesystems/resctrl.html

$ grep 'l2\|l3\|llc\|mb' /proc/cpuinfo
/* returns nothing */

Looks like Haswell didn't support those nice features :(

Thanks for the suggestion.

I might update to Zen 5 or Zen 6 at some point.

Regards


^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2025-09-25 11:38 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-04 23:45 Large(ish) variance induced by SCHED_FIFO Marc Gonzalez
2025-09-05  8:25 ` Daniel Wagner
2025-09-05 13:23   ` Marc Gonzalez
2025-09-05 14:03     ` John Ogness
2025-09-05 15:34       ` Marc Gonzalez
2025-09-05 16:07         ` John Ogness
2025-09-05 17:40           ` Marc Gonzalez
2025-09-08  9:36             ` John Ogness
2025-09-08 15:42               ` Marc Gonzalez
2025-09-08 16:02                 ` Daniel Wagner
2025-09-08 18:40                   ` Marc Gonzalez
2025-09-09 11:08                     ` Unexplained variance in run-time of trivial program Marc Gonzalez
2025-09-09 11:21                       ` Daniel Wagner
2025-09-09 12:42                         ` Marc Gonzalez
2025-09-09 14:23                           ` Steven Rostedt
2025-09-09 12:34                       ` John Ogness
2025-09-09 14:08                         ` Marc Gonzalez
2025-09-09 18:33                           ` Marc Gonzalez
     [not found]                             ` <CAMjhiJjMOQN-nWd+KP4JBBNHf20M+J2fXAuTTvowXctJgvGOcQ@mail.gmail.com>
2025-09-09 20:30                               ` Marc Gonzalez
2025-09-10  7:59                                 ` Daniel Wagner
2025-09-11 21:29                                   ` Marc Gonzalez
2025-09-11 22:15                                     ` Marc Gonzalez
2025-09-12  7:44                                       ` Daniel Wagner
2025-09-13 10:09                                         ` Marc Gonzalez
2025-09-15 16:30                                           ` Marc Gonzalez
2025-09-16  7:50                                             ` Ahmed S. Darwish
2025-09-25 11:38                                               ` Marc Gonzalez
2025-09-08 16:52                 ` [EXT] Re: Large(ish) variance induced by SCHED_FIFO Rui Sousa
2025-09-08 17:03                   ` Marc Gonzalez
     [not found] ` <CAMjhiJiVDmnx+pgpvtAy=KarHexjsYs+T9tSkpGvppjqFdtmiw@mail.gmail.com>
2025-09-05 21:05   ` Marc Gonzalez

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox