* Large(ish) variance induced by SCHED_FIFO
@ 2025-09-04 23:45 Marc Gonzalez
2025-09-05 8:25 ` Daniel Wagner
[not found] ` <CAMjhiJiVDmnx+pgpvtAy=KarHexjsYs+T9tSkpGvppjqFdtmiw@mail.gmail.com>
0 siblings, 2 replies; 30+ messages in thread
From: Marc Gonzalez @ 2025-09-04 23:45 UTC (permalink / raw)
To: linux-rt-users
Cc: Steven Rostedt, Thomas Gleixner, Sebastian Andrzej Siewior,
Daniel Wagner, Clark Williams, Pavel Machek, Luis Goncalves,
John McCalpin
Hello everyone,
I am currently *NOT* running a PREEMPT_RT kernel.
I am trying to get reproducible / deterministic / low variance results
using a vendor kernel (6.8.0-79-generic #79-Ubuntu SMP PREEMPT_DYNAMIC).
I have removed all sources of noise / variance I could think of:
- SMT is disabled in BIOS
lscpu reports "Thread(s) per core: 1"
cat /sys/devices/system/cpu/smt/active returns 0
- Kernel boots in single user mode
No GUI, just a text terminal
Nothing running other than system daemons & bash
- Kernel runs with threadirqs command-line parameter
ISRs run as SCHED_FIFO processes with prio 50
$ ps -e -o pid,cls,pri,args --sort pri
PID CLS PRI COMMAND
18 FF 139 [migration/0]
23 FF 139 [migration/1]
29 FF 139 [migration/2]
35 FF 139 [migration/3]
19 FF 90 [idle_inject/0]
22 FF 90 [idle_inject/1]
28 FF 90 [idle_inject/2]
34 FF 90 [idle_inject/3]
54 FF 90 [irq/9-acpi]
62 FF 90 [watchdogd]
68 FF 90 [irq/24-PCIe PME]
69 FF 90 [irq/24-pciehp]
70 FF 90 [irq/24-s-pciehp]
71 FF 90 [irq/25-PCIe PME]
74 FF 90 [irq/8-rtc0]
75 FF 90 [irq/16-ehci_hcd:usb1]
82 FF 90 [irq/23-ehci_hcd:usb2]
162 FF 90 [irq/26-xhci_hcd]
164 FF 90 [irq/28-ahci[0000:00:1f.2]]
377 FF 90 [irq/18-i801_smbus]
384 FF 90 [irq/29-mei_me]
411 FF 90 [card1-crtc0]
412 FF 90 [card1-crtc1]
413 FF 90 [irq/30-i915]
463 FF 90 [irq/31-snd_hda_intel:card0]
466 FF 90 [irq/32-snd_hda_intel:card1]
335 FF 41 [psimon]
1076 FF 41 [psimon]
- Frequency of CPUs is "pinned" at 3 GHz
No "turbo boost" (frequency varies with temperature).
Slightly below nominal freq to avoid temp throttling.
(Intel Core i5-4590 CPU @ 3.30GHz)
for I in 0 1 2 3; do
echo userspace > /sys/devices/system/cpu/cpu$I/cpufreq/scaling_governor
echo 3000000 > /sys/devices/system/cpu/cpu$I/cpufreq/scaling_setspeed
done
- Program is pinned to CPU #1
Avoid any cache ping pong.
Avoid CPU #0 in case it handles more interrupts.
$ taskset -c 1 ./a.out
And the last thing I wanted to do is:
- Run benchmark as SCHED_FIFO 99 to make sure NOTHING could interrupt it
bench is strictly computational:
no system calls, no I/O, no sleeping, just number crunching.
However, I noticed that the results have MORE variance when
the program runs as SCHED_FIFO than when it runs as SCHED_OTHER.
Here is my testcase:
Toy benchmark:
# define N (1 << 30)
int main(void)
{
for (volatile int i = 0; i < N; ++i);
return 0;
}
Compiled with gcc -O3 toy.c
Then I run this script as root:
#!/bin/bash
CMD="taskset -c 1 ./a.out"
if [ "$1" = "fifo" ]; then CMD="chrt -f 99 $CMD"; fi
for I in $(seq 1 30); do
T0=$(date "+%s.%N")
$CMD
T1=$(date "+%s.%N")
echo "$T1-$T0" | bc -l
done
Once without fifo (SCHED_OTHER), once with fifo (SCHED_FIFO).
(Full results at the end of the message)
For SCHED_OTHER:
MIN_RUNTIME=2.158045784 seconds
MAX_RUNTIME=2.158515466 seconds
i.e. all runs are within +/- 109 ppm (MAX-MIN)/(MAX+MIN)
For SCHED_FIFO:
MIN_RUNTIME=2.247394425 seconds
MAX_RUNTIME=2.305229091 seconds
i.e. all runs are within +/- 12700 ppm
i.e. 100 times WORSE than SCHED_OTHER... :(
HOWEVER, if we sort SCHED_FIFO results, we clearly see 3 "islands":
first run-time is the minimum outlier = 2.247394425
21 results in [2.253636818, 2.254797674] => 257 ppm
8 results in [2.304188238, 2.305229091] => 226 ppm
Still not as good as SCHED_FIFO but nowhere near as bad.
Questions I have no answers for at this time:
What could explain this behavior?
Why several "islands" of results when running as SCHED_FIFO 99?
Why is the variance still greater within an island?
Thanks for reading me this far!! :)
Regards
Raw results for reference:
SCHED_OTHER
2.158171245
2.158127633
2.158129497
2.158109551
2.158089224
2.158515466
2.158083084
2.158127232
2.158282178
2.158135800
2.158123891
2.158045784
2.158174865
2.158141835
2.158184503
2.158118620
2.158096101
2.158150349
2.158150868
2.158076309
2.158179210
2.158134907
2.158202927
2.158108137
2.158182418
2.158088121
2.158095718
2.158165650
2.158134169
2.158119635
SCHED_FIFO
2.247394425
2.254726778
2.305229091
2.253719228
2.254658938
2.254744883
2.304188238
2.254670366
2.254683699
2.254683326
2.304284105
2.253755218
2.254756793
2.304288106
2.254756981
2.253636818
2.254797674
2.304240533
2.254630310
2.254659655
2.254696007
2.304234345
2.254739714
2.254766850
2.254620088
2.304224451
2.254608929
2.254732058
2.304227047
2.254665694
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Large(ish) variance induced by SCHED_FIFO
2025-09-04 23:45 Large(ish) variance induced by SCHED_FIFO Marc Gonzalez
@ 2025-09-05 8:25 ` Daniel Wagner
2025-09-05 13:23 ` Marc Gonzalez
[not found] ` <CAMjhiJiVDmnx+pgpvtAy=KarHexjsYs+T9tSkpGvppjqFdtmiw@mail.gmail.com>
1 sibling, 1 reply; 30+ messages in thread
From: Daniel Wagner @ 2025-09-05 8:25 UTC (permalink / raw)
To: Marc Gonzalez
Cc: linux-rt-users, Steven Rostedt, Thomas Gleixner,
Sebastian Andrzej Siewior, Daniel Wagner, Clark Williams,
Pavel Machek, Luis Goncalves, John McCalpin
On Fri, Sep 05, 2025 at 01:45:31AM +0200, Marc Gonzalez wrote:
> Then I run this script as root:
> #!/bin/bash
>
> CMD="taskset -c 1 ./a.out"
> if [ "$1" = "fifo" ]; then CMD="chrt -f 99 $CMD"; fi
>
> for I in $(seq 1 30); do
> T0=$(date "+%s.%N")
> $CMD
> T1=$(date "+%s.%N")
> echo "$T1-$T0" | bc -l
> done
This set setup is forking processes in order to get timestamps. This
gives you uncontrollable variance in the measurement. Use clock_gettime
inside your test program.
Also you might want to reduce the prio to 98.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Large(ish) variance induced by SCHED_FIFO
2025-09-05 8:25 ` Daniel Wagner
@ 2025-09-05 13:23 ` Marc Gonzalez
2025-09-05 14:03 ` John Ogness
0 siblings, 1 reply; 30+ messages in thread
From: Marc Gonzalez @ 2025-09-05 13:23 UTC (permalink / raw)
To: Daniel Wagner
Cc: linux-rt-users, Steven Rostedt, Thomas Gleixner,
Sebastian Andrzej Siewior, Daniel Wagner, Clark Williams,
Pavel Machek, Luis Goncalves, John McCalpin
Hello David :)
On 05/09/2025 10:25, Daniel Wagner wrote:
> On Fri, Sep 05, 2025 at 01:45:31AM +0200, Marc Gonzalez wrote:
>
>> Then I run this script as root:
>> #!/bin/bash
>>
>> CMD="taskset -c 1 ./a.out"
>> if [ "$1" = "fifo" ]; then CMD="chrt -f 99 $CMD"; fi
>>
>> for I in $(seq 1 30); do
>> T0=$(date "+%s.%N")
>> $CMD
>> T1=$(date "+%s.%N")
>> echo "$T1-$T0" | bc -l
>> done
>
> This setup is forking processes in order to get timestamps. This
> gives you uncontrollable variance in the measurement. Use clock_gettime
> inside your test program.
I had already tested your suggestion, and observed it did not
make any difference. I will test again nonetheless.
(I used CLOCK_MONOTONIC_RAW.)
NB: if the large variance were induced by getting timestamps outside
the program, then SCHED_OTHER runs should be similarly impacted
(however, they are not).
> Also you might want to reduce the prio to 98.
May I ask why?
I'm guessing it's bad(TM) to potentially block the migration task?
I will lower the priority to 90.
Here is my current test code:
#include <stdio.h>
#include <time.h>
#define N (1 << 30)
#define G (1000*1000*1000L)
int main(void)
{
struct timespec t0, t1;
clock_gettime(CLOCK_MONOTONIC_RAW, &t0);
for (volatile int i = 0; i < N; ++i);
clock_gettime(CLOCK_MONOTONIC_RAW, &t1);
long diff = (t1.tv_sec*G + t1.tv_nsec) - (t0.tv_sec*G + t0.tv_nsec);
printf("%ld\n", diff);
return 0;
}
associated script:
#!/bin/bash
CMD="taskset -c 1 ./a.out"
if [ "$1" = "fifo" ]; then CMD="chrt -f 90 $CMD"; fi
for I in $(seq 1 30); do $CMD; done
The results are the same as before:
SCHED_OTHER
MIN=2156085411 MAX=2156133420
All within 11 ppm of the average (EXCELLENT RESULTS)
SCHED_FIFO
MIN=2205859200 (first run again, how strange)
MAX=2304868902
All within 21950 ppm of the average (TERRIBLE RESULTS)
21950 ppm = 2.2%
Looking at the distribution of SCHED_FIFO again, I notice a pattern:
first run-time is the minimum outlier = 2.20 seconds
21 (again!!) results around 2.25 seconds
8 (again!!) results around 2.30 seconds
BRAIN DUMP:
"First" run is meaningless.
First since what? Previous batch of runs?
In that case, sleeping between runs should work around the issue?
It looks like there is some kind of 50 millisecond quantum
involved somewhere somehow.
$ grep HZ /boot/config-6.8.0-79-generic
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
# CONFIG_NO_HZ_IDLE is not set
CONFIG_NO_HZ_FULL=y
CONFIG_NO_HZ=y
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000
(Not sure what it means to have CONFIG_NO_HZ_FULL && CONFIG_HZ_1000)
This is driving me crazy.
I'm all ears if anyone has other suggestions.
Regards
Raw results for reference:
SCHED_OTHER
2156120840
2156110092
2156111812
2156117629
2156111235
2156104159
2156109850
2156120644
2156100701
2156085456
2156133420
2156130993
2156087623
2156120236
2156114137
2156099823
2156122243
2156095888
2156114046
2156102910
2156120532
2156102464
2156100703
2156125228
2156119672
2156110619
2156131331
2156085411
2156098780
2156121458
SCHED_FIFO
2205859200
2254828687
2254866765
2304868902
2254849969
2253908545
2254811821
2304845457
2253814019
2254834900
2304858026
2254845905
2253835350
2254815775
2304860984
2253824572
2254816525
2254800880
2303875283
2254826189
2254813961
2304850479
2254832102
2253816865
2254820633
2304847552
2253841768
2254807812
2254810662
2303848123
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Large(ish) variance induced by SCHED_FIFO
2025-09-05 13:23 ` Marc Gonzalez
@ 2025-09-05 14:03 ` John Ogness
2025-09-05 15:34 ` Marc Gonzalez
0 siblings, 1 reply; 30+ messages in thread
From: John Ogness @ 2025-09-05 14:03 UTC (permalink / raw)
To: Marc Gonzalez, Daniel Wagner
Cc: linux-rt-users, Steven Rostedt, Thomas Gleixner,
Sebastian Andrzej Siewior, Daniel Wagner, Clark Williams,
Pavel Machek, Luis Goncalves, John McCalpin
On 2025-09-05, Marc Gonzalez <marc.w.gonzalez@free.fr> wrote:
> (I used CLOCK_MONOTONIC_RAW.)
Not really relevant to your issue, but you should always use
CLOCK_MONOTONIC.
> It looks like there is some kind of 50 millisecond quantum
> involved somewhere somehow.
Did you forget to disable RT throttling?
$ sudo sysctl -w kernel.sched_rt_runtime_us=-1
If you check the kernel logs, you would see a message about throttling:
"sched: RT throttling activated"
John Ogness
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Large(ish) variance induced by SCHED_FIFO
2025-09-05 14:03 ` John Ogness
@ 2025-09-05 15:34 ` Marc Gonzalez
2025-09-05 16:07 ` John Ogness
0 siblings, 1 reply; 30+ messages in thread
From: Marc Gonzalez @ 2025-09-05 15:34 UTC (permalink / raw)
To: John Ogness, Daniel Wagner
Cc: linux-rt-users, Steven Rostedt, Thomas Gleixner,
Sebastian Andrzej Siewior, Daniel Wagner, Clark Williams,
Pavel Machek, Luis Goncalves, John McCalpin
On 05/09/2025 16:03, John Ogness wrote:
> On 2025-09-05, Marc Gonzalez wrote:
>
>> (I used CLOCK_MONOTONIC_RAW.)
>
> Not really relevant to your issue, but you should always use CLOCK_MONOTONIC.
Could you provide some insight as to why CLOCK_MONOTONIC is preferable?
https://www.man7.org/linux/man-pages/man3/clock_gettime.3.html
CLOCK_MONOTONIC
A non-settable system-wide clock that represents monotonic time since
(as described by POSIX) "some unspecified point in the past". On Linux,
that point corresponds to the number of seconds that the system has been
running since it was booted.
The CLOCK_MONOTONIC clock is not affected by discontinuous jumps in the
system time (e.g., if the system administrator manually changes the clock),
but is affected by the incremental adjustments performed by adjtime(3) and NTP.
This clock does not count time that the system is suspended.
All CLOCK_MONOTONIC variants guarantee that the time returned by
consecutive calls will not go backwards, but successive calls may
(depending on the architecture) return identical (not-increased)
time values.
CLOCK_MONOTONIC_RAW (since Linux 2.6.28; Linux-specific)
Similar to CLOCK_MONOTONIC, but provides access to a raw hardware-based time
that is not subject to NTP adjustments or the incremental adjustments performed
by adjtime(3). This clock does not count time that the system is suspended.
CLOCK_MONOTONIC_RAW seemed less susceptible to "interference" :)
>> It looks like there is some kind of 50 millisecond quantum
>> involved somewhere somehow.
>
> Did you forget to disable RT throttling?
>
> $ sudo sysctl -w kernel.sched_rt_runtime_us=-1
>
> If you check the kernel logs, you would see a message about throttling:
>
> "sched: RT throttling activated"
YES!!!
I didn't think to look at dmesg :(
And I had looked at sched-related proc nodes...
$ grep "" /proc/sys/kernel/sched_*
/proc/sys/kernel/sched_autogroup_enabled:1
/proc/sys/kernel/sched_cfs_bandwidth_slice_us:5000
/proc/sys/kernel/sched_deadline_period_max_us:4194304
/proc/sys/kernel/sched_deadline_period_min_us:100
/proc/sys/kernel/sched_rr_timeslice_ms:100
/proc/sys/kernel/sched_rt_period_us:1000000
/proc/sys/kernel/sched_rt_runtime_us:950000
/proc/sys/kernel/sched_schedstats:0
/proc/sys/kernel/sched_util_clamp_max:1024
/proc/sys/kernel/sched_util_clamp_min:1024
/proc/sys/kernel/sched_util_clamp_min_rt_default:1024
... but my brain didn't register the obvious clue,
although it was right there: 1000000 - 950000 = 50 ms !!
Now everything works as expected, even in multi-user mode:
$ sudo ./run_benchmark.sh fifo
2155899243
2155930350
2155925344
2155545153
2155520262
2155520647
2155531815
2155586328
2155501088
2155887042
2155937889
2155990740
2155939605
2155509468
2155898441
2155615316
2155498933
2155968150
2155493295
2155496287
2155949449
2155488499
2155932900
2155498135
2155531294
2156047123
2156035893
2155926392
2155917489
2155482792
EXCEPT... that my mouse is sluggish as molasses when the script
runs on CPU1 (but not when it runs on other cores).
I'm probably starving an important ISR :)
[migration/0] pid 18's current affinity list: 0
[migration/1] pid 23's current affinity list: 1
[migration/2] pid 29's current affinity list: 2
[migration/3] pid 35's current affinity list: 3
[idle_inject/0] pid 19's current affinity list: 0
[idle_inject/1] pid 22's current affinity list: 1
[idle_inject/2] pid 28's current affinity list: 2
[idle_inject/3] pid 34's current affinity list: 3
[irq/9-acpi] pid 54's current affinity list: 1
[watchdogd] pid 62's current affinity list: 0-3
[irq/24-PCIe PME] pid 68's current affinity list: 2
[irq/24-pciehp] pid 69's current affinity list: 2
[irq/24-s-pciehp] pid 70's current affinity list: 2
[irq/25-PCIe PME] pid 71's current affinity list: 3
[irq/8-rtc0] pid 74's current affinity list: 0
[irq/23-ehci_hcd:usb1] pid 75's current affinity list: 1
[irq/16-ehci_hcd:usb2] pid 81's current affinity list: 2
[irq/26-xhci_hcd] pid 159's current affinity list: 1
[irq/27-ahci[0000:00:1f.2]] pid 162's current affinity list: 1
[irq/18-i801_smbus] pid 388's current affinity list: 1
[irq/29-mei_me] pid 411's current affinity list: 2
[card1-crtc0] pid 461's current affinity list: 0-3
[card1-crtc1] pid 462's current affinity list: 0-3
[irq/30-i915] pid 463's current affinity list: 3
[irq/31-snd_hda_intel:card0] pid 514's current affinity list: 0
[irq/32-snd_hda_intel:card1] pid 518's current affinity list: 1
[psimon] pid 334's current affinity list: 0-3
[psimon] pid 1075's current affinity list: 0-3
Most of the USB ISRs run on core 1, no wonder my USB mouse feels limp :)
Thanks David & John for pointing me in the right direction!
Regards
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Large(ish) variance induced by SCHED_FIFO
2025-09-05 15:34 ` Marc Gonzalez
@ 2025-09-05 16:07 ` John Ogness
2025-09-05 17:40 ` Marc Gonzalez
0 siblings, 1 reply; 30+ messages in thread
From: John Ogness @ 2025-09-05 16:07 UTC (permalink / raw)
To: Marc Gonzalez, Daniel Wagner
Cc: linux-rt-users, Steven Rostedt, Thomas Gleixner,
Sebastian Andrzej Siewior, Daniel Wagner, Clark Williams,
Pavel Machek, Luis Goncalves, John McCalpin
On 2025-09-05, Marc Gonzalez <marc.w.gonzalez@free.fr> wrote:
>>> (I used CLOCK_MONOTONIC_RAW.)
>>
>> Not really relevant to your issue, but you should always use CLOCK_MONOTONIC.
>
> Could you provide some insight as to why CLOCK_MONOTONIC is
> preferable?
CLOCK_MONOTONIC is tracking in time units as defined by
humans. CLOCK_MONOTONIC_RAW is tracking in time units defined by some
hardware.
John
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Large(ish) variance induced by SCHED_FIFO
2025-09-05 16:07 ` John Ogness
@ 2025-09-05 17:40 ` Marc Gonzalez
2025-09-08 9:36 ` John Ogness
0 siblings, 1 reply; 30+ messages in thread
From: Marc Gonzalez @ 2025-09-05 17:40 UTC (permalink / raw)
To: John Ogness, Daniel Wagner
Cc: linux-rt-users, Steven Rostedt, Thomas Gleixner,
Sebastian Andrzej Siewior, Daniel Wagner, Clark Williams,
Pavel Machek, Luis Goncalves, John McCalpin
On 05/09/2025 18:07, John Ogness wrote:
> On 2025-09-05, Marc Gonzalez wrote:
>
>> Could you provide some insight as to why CLOCK_MONOTONIC is preferable?
>
> CLOCK_MONOTONIC is tracking in time units as defined by humans.
> CLOCK_MONOTONIC_RAW is tracking in time units defined by some hardware.
I'm not sure I understand what that implies.
Eventually, clock_gettime stores whatever it measured in struct timespec,
with tv_sec & tv_nsec, right?
Are you saying that if I have 2 identical timespecs,
one from CLOCK_MONOTONIC, one from CLOCK_MONOTONIC_RAW,
they might not measure the same time interval?
Are you perhaps saying that NTP can (ever so slightly) tweak
a clock to bring its frequency closer to "reality", but if
CLOCK_MONOTONIC_RAW ignores such tweaks, it might count time
slightly too fast or too slow?
Regards
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Large(ish) variance induced by SCHED_FIFO
[not found] ` <CAMjhiJiVDmnx+pgpvtAy=KarHexjsYs+T9tSkpGvppjqFdtmiw@mail.gmail.com>
@ 2025-09-05 21:05 ` Marc Gonzalez
0 siblings, 0 replies; 30+ messages in thread
From: Marc Gonzalez @ 2025-09-05 21:05 UTC (permalink / raw)
To: Leon Woestenberg
Cc: linux-rt-users, Steven Rostedt, Thomas Gleixner,
Sebastian Andrzej Siewior, Daniel Wagner, Clark Williams,
Pavel Machek, Luis Goncalves, John McCalpin
Hello Leon :)
On 05/09/2025 18:18, Leon Woestenberg wrote:
> On Fri, 5 Sep 2025 at 01:45, Marc Gonzalez wrote:
>
>> I am trying to get reproducible / deterministic / low variance results
>> using a vendor kernel (6.8.0-79-generic #79-Ubuntu SMP PREEMPT_DYNAMIC).
>
> Is that your end goal or is your question maybe an XY question?
> If so please describe X.
I need a rudimentary (but accurate) benchmark framework,
to measure the impact of micro-optimizations on codes.
These optimizations are expected to make programs 0-3% faster,
so the framework should have the lowest variance possible,
otherwise I can't discern the result from the noise.
> Or just include the measurement inside the program using CLOCK_MONOTONIC.
David suggested that as well, but it was not the source of the problem.
John pointed out the culprit:
/proc/sys/kernel/sched_rt_runtime_us
https://www.kernel.org/doc/html/latest/scheduler/sched-rt-group.html
Regards
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Large(ish) variance induced by SCHED_FIFO
2025-09-05 17:40 ` Marc Gonzalez
@ 2025-09-08 9:36 ` John Ogness
2025-09-08 15:42 ` Marc Gonzalez
0 siblings, 1 reply; 30+ messages in thread
From: John Ogness @ 2025-09-08 9:36 UTC (permalink / raw)
To: Marc Gonzalez, Daniel Wagner
Cc: linux-rt-users, Steven Rostedt, Thomas Gleixner,
Sebastian Andrzej Siewior, Daniel Wagner, Clark Williams,
Pavel Machek, Luis Goncalves, John McCalpin
On 2025-09-05, Marc Gonzalez <marc.w.gonzalez@free.fr> wrote:
>>> Could you provide some insight as to why CLOCK_MONOTONIC is preferable?
>>
>> CLOCK_MONOTONIC is tracking in time units as defined by humans.
>> CLOCK_MONOTONIC_RAW is tracking in time units defined by some hardware.
>
> I'm not sure I understand what that implies.
>
> Eventually, clock_gettime stores whatever it measured in struct timespec,
> with tv_sec & tv_nsec, right?
I would phrase it differently, but generally speaking, correct.
> Are you saying that if I have 2 identical timespecs,
> one from CLOCK_MONOTONIC, one from CLOCK_MONOTONIC_RAW,
> they might not measure the same time interval?
I assume you mean 2 _sets_ of timespecs (so that you have 2 intervals to
compare). Correct.
> Are you perhaps saying that NTP can (ever so slightly) tweak
> a clock to bring its frequency closer to "reality", but if
> CLOCK_MONOTONIC_RAW ignores such tweaks, it might count time
> slightly too fast or too slow?
Correct.
This is also valid for other clocks. For example, the kernel logs
(printk) and ftrace default to the CPU local clock. These timestamps
cannot be compared to CLOCK_MONOTONIC timestamps... just as
CLOCK_MONOTONIC_RAW timestamps cannot be compared to CLOCK_MONOTONIC
timestamps.
There are still reasons why CLOCK_MONOTONIC_RAW might be
interesting. For example, if you want a very stable timesource to
compare intervals, but do not care so much about the real world time
lengths of those intervals (i.e. where is the greatest latency vs. what
is the value of the greatest latency). Although even here, I doubt
CLOCK_MONOTONIC_RAW has a practical advantage over CLOCK_MONOTONIC.
John
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Large(ish) variance induced by SCHED_FIFO
2025-09-08 9:36 ` John Ogness
@ 2025-09-08 15:42 ` Marc Gonzalez
2025-09-08 16:02 ` Daniel Wagner
2025-09-08 16:52 ` [EXT] Re: Large(ish) variance induced by SCHED_FIFO Rui Sousa
0 siblings, 2 replies; 30+ messages in thread
From: Marc Gonzalez @ 2025-09-08 15:42 UTC (permalink / raw)
To: John Ogness, Leon Woestenberg, Daniel Wagner
Cc: linux-rt-users, Steven Rostedt, Thomas Gleixner,
Sebastian Andrzej Siewior, Daniel Wagner, Clark Williams,
Pavel Machek, Luis Goncalves, John McCalpin
On 08/09/2025 11:36, John Ogness wrote:
> There are still reasons why CLOCK_MONOTONIC_RAW might be
> interesting. For example, if you want a very stable time source to
> compare intervals, but do not care so much about the real world time
> lengths of those intervals (i.e. where is the greatest latency vs. what
> is the value of the greatest latency). Although even here, I doubt
> CLOCK_MONOTONIC_RAW has a practical advantage over CLOCK_MONOTONIC.
In fact, I'm just trying to compare the run-time of 2 minor
variations of the same program (testing micro-optimizations).
Absolute run-time is not really important, what I really want
to know is: does v2 run faster or slower than v1?
This is the framework I'm using at this point:
#include <stdio.h>
#include <time.h>
extern void my_code(void);
static long loop(int log2)
{
int n = 1 << log2;
struct timespec t0, t1;
clock_gettime(CLOCK_MONOTONIC, &t0);
for (int i = 0; i < n; ++i) my_code();
clock_gettime(CLOCK_MONOTONIC, &t1);
long d = (t1.tv_sec - t0.tv_sec)*1000000000L + (t1.tv_nsec - t0.tv_nsec);
long t = d >> log2;
return t;
}
int main(void)
{
long t, min = loop(4);
for (int i = 0; i < 20; ++i)
if ((t = loop(8)) < min) min = t;
printf("MIN=%ld\n", min);
return 0;
}
Basically:
- warm up the caches
- run my_code() 256 times && compute average run-time
- repeat 20 times to find MINIMUM average run-time
When my_code() is a trivial computational loop such as:
mov $(1<<12), %eax
1: dec %ecx
dec %ecx
dec %eax
jnz 1b
ret
Then running the benchmark 1000 times returns the same value 1000 times:
MIN=2737
Obviously, the program I'm working on is a bit more complex, but barely:
- no system calls, no library calls
- just simple bit twiddling
- tiny code, tiny data structures, everything fits in L1
$ size a.out
text data bss dec hex filename
8549 632 1072 10253 280d a.out
When I run the benchmark 1000 times, there are some large outliers:
MIN_MIN=2502
MAX_MIN=2774
NOTE: 95% of the results are below 2536.
But the top 1% (worst 10) are really bad (2646-2774)
How to get repeatable results?
Random 10% outliers break the ability to measure the impact
of micro-optimizations expected to provide 0-3% improvements :(
For reference, the script launching the benchmark does:
echo -1 > /proc/sys/kernel/sched_rt_runtime_us
for I in 0 1 2 3; do echo userspace > /sys/devices/system/cpu/cpu$I/cpufreq/scaling_governor; done
sleep 0.25
for I in 0 1 2 3; do echo 3000000 > /sys/devices/system/cpu/cpu$I/cpufreq/scaling_setspeed; done
sleep 0.25
for I in $(seq 1 1000); do
chrt -f 99 taskset -c 2 ./a.out < $1
done
for I in 0 1 2 3; do
echo schedutil > /sys/devices/system/cpu/cpu$I/cpufreq/scaling_governor
done
echo 950000 > /proc/sys/kernel/sched_rt_runtime_us
I've run out of ideas to identify other sources of variance.
(I ran everything in single user mode with nothing else running.)
Perhaps with perf I could identify the source of stalls or bubbles?
Hoping someone can point me in the right direction.
Regards
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Large(ish) variance induced by SCHED_FIFO
2025-09-08 15:42 ` Marc Gonzalez
@ 2025-09-08 16:02 ` Daniel Wagner
2025-09-08 18:40 ` Marc Gonzalez
2025-09-08 16:52 ` [EXT] Re: Large(ish) variance induced by SCHED_FIFO Rui Sousa
1 sibling, 1 reply; 30+ messages in thread
From: Daniel Wagner @ 2025-09-08 16:02 UTC (permalink / raw)
To: Marc Gonzalez
Cc: John Ogness, Leon Woestenberg, linux-rt-users, Steven Rostedt,
Thomas Gleixner, Sebastian Andrzej Siewior, Daniel Wagner,
Clark Williams, Pavel Machek, Luis Goncalves, John McCalpin
On Mon, Sep 08, 2025 at 05:42:58PM +0200, Marc Gonzalez wrote:
> I've run out of ideas to identify other sources of variance.
> (I ran everything in single user mode with nothing else running.)
> Perhaps with perf I could identify the source of stalls or bubbles?
>
> Hoping someone can point me in the right direction.
There are few guides out there, explaining the common pitfalls. I
suppose you have already found them. Just in case you missed Frederic's
excellent blog on CPU isolation:
https://www.suse.com/c/cpu-isolation-introduction-part-1/
^ permalink raw reply [flat|nested] 30+ messages in thread
* RE: [EXT] Re: Large(ish) variance induced by SCHED_FIFO
2025-09-08 15:42 ` Marc Gonzalez
2025-09-08 16:02 ` Daniel Wagner
@ 2025-09-08 16:52 ` Rui Sousa
2025-09-08 17:03 ` Marc Gonzalez
1 sibling, 1 reply; 30+ messages in thread
From: Rui Sousa @ 2025-09-08 16:52 UTC (permalink / raw)
To: Marc Gonzalez, John Ogness, Leon Woestenberg, Daniel Wagner
Cc: linux-rt-users@vger.kernel.org, Steven Rostedt, Thomas Gleixner,
Sebastian Andrzej Siewior, Daniel Wagner, Clark Williams,
Pavel Machek, Luis Goncalves, John McCalpin
Hi Marc,
Just replace your last setup script line with:
echo -1 > /proc/sys/kernel/sched_rt_runtime_us
?
You are running a realtime task for a long while, but you have rt throttling enabled. At some point it will trigger, and you have no control of when. I believe you are seeing regular tasks
Being scheduled in the middle of the benchmark measurement.
Disabling throttling may block your entire system while the benchmark is running, so lower the priority a bit and use affinity to land the benchmark process in a core that is not usually used?
Br
Rui
> -----Original Message-----
> From: Marc Gonzalez <marc.w.gonzalez@free.fr>
> Sent: Monday, September 8, 2025 17:43
> To: John Ogness <john.ogness@linutronix.de>; Leon Woestenberg
> <leon@sidebranch.com>; Daniel Wagner <dwagner@suse.de>
> Cc: linux-rt-users@vger.kernel.org; Steven Rostedt <rostedt@goodmis.org>;
> Thomas Gleixner <tglx@linutronix.de>; Sebastian Andrzej Siewior
> <bigeasy@linutronix.de>; Daniel Wagner <daniel.wagner@suse.com>; Clark
> Williams <williams@redhat.com>; Pavel Machek <pavel@denx.de>; Luis
> Goncalves <lgoncalv@redhat.com>; John McCalpin <mccalpin@tacc.utexas.edu>
> Subject: [EXT] Re: Large(ish) variance induced by SCHED_FIFO
>
> Caution: This is an external email. Please take care when clicking links or opening
> attachments. When in doubt, report the message using the 'Report this email'
> button
>
>
> On 08/09/2025 11:36, John Ogness wrote:
>
> > There are still reasons why CLOCK_MONOTONIC_RAW might be
> > interesting. For example, if you want a very stable time source to
> > compare intervals, but do not care so much about the real world time
> > lengths of those intervals (i.e. where is the greatest latency vs. what
> > is the value of the greatest latency). Although even here, I doubt
> > CLOCK_MONOTONIC_RAW has a practical advantage over
> CLOCK_MONOTONIC.
>
> In fact, I'm just trying to compare the run-time of 2 minor
> variations of the same program (testing micro-optimizations).
>
> Absolute run-time is not really important, what I really want
> to know is: does v2 run faster or slower than v1?
>
> This is the framework I'm using at this point:
>
> #include <stdio.h>
> #include <time.h>
> extern void my_code(void);
>
> static long loop(int log2)
> {
> int n = 1 << log2;
> struct timespec t0, t1;
> clock_gettime(CLOCK_MONOTONIC, &t0);
> for (int i = 0; i < n; ++i) my_code();
> clock_gettime(CLOCK_MONOTONIC, &t1);
> long d = (t1.tv_sec - t0.tv_sec)*1000000000L + (t1.tv_nsec - t0.tv_nsec);
> long t = d >> log2;
> return t;
> }
>
> int main(void)
> {
> long t, min = loop(4);
> for (int i = 0; i < 20; ++i)
> if ((t = loop(8)) < min) min = t;
> printf("MIN=%ld\n", min);
> return 0;
> }
>
> Basically:
> - warm up the caches
> - run my_code() 256 times && compute average run-time
> - repeat 20 times to find MINIMUM average run-time
>
> When my_code() is a trivial computational loop such as:
>
> mov $(1<<12), %eax
> 1: dec %ecx
> dec %ecx
> dec %eax
> jnz 1b
> ret
>
> Then running the benchmark 1000 times returns the same value 1000 times:
> MIN=2737
>
>
> Obviously, the program I'm working on is a bit more complex, but barely:
> - no system calls, no library calls
> - just simple bit twiddling
> - tiny code, tiny data structures, everything fits in L1
> $ size a.out
> text data bss dec hex filename
> 8549 632 1072 10253 280d a.out
>
> When I run the benchmark 1000 times, there are some large outliers:
> MIN_MIN=2502
> MAX_MIN=2774
>
> NOTE: 95% of the results are below 2536.
> But the top 1% (worst 10) are really bad (2646-2774)
>
> How to get repeatable results?
>
> Random 10% outliers break the ability to measure the impact
> of micro-optimizations expected to provide 0-3% improvements :(
>
> For reference, the script launching the benchmark does:
>
> echo -1 > /proc/sys/kernel/sched_rt_runtime_us
> for I in 0 1 2 3; do echo userspace >
> /sys/devices/system/cpu/cpu$I/cpufreq/scaling_governor; done
> sleep 0.25
> for I in 0 1 2 3; do echo 3000000 >
> /sys/devices/system/cpu/cpu$I/cpufreq/scaling_setspeed; done
> sleep 0.25
>
> for I in $(seq 1 1000); do
> chrt -f 99 taskset -c 2 ./a.out < $1
> done
>
> for I in 0 1 2 3; do
> echo schedutil > /sys/devices/system/cpu/cpu$I/cpufreq/scaling_governor
> done
> echo 950000 > /proc/sys/kernel/sched_rt_runtime_us
>
>
> I've run out of ideas to identify other sources of variance.
> (I ran everything in single user mode with nothing else running.)
> Perhaps with perf I could identify the source of stalls or bubbles?
>
> Hoping someone can point me in the right direction.
>
> Regards
>
>
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [EXT] Re: Large(ish) variance induced by SCHED_FIFO
2025-09-08 16:52 ` [EXT] Re: Large(ish) variance induced by SCHED_FIFO Rui Sousa
@ 2025-09-08 17:03 ` Marc Gonzalez
0 siblings, 0 replies; 30+ messages in thread
From: Marc Gonzalez @ 2025-09-08 17:03 UTC (permalink / raw)
To: Rui Sousa, John Ogness, Leon Woestenberg, Daniel Wagner
Cc: linux-rt-users@vger.kernel.org, Steven Rostedt, Thomas Gleixner,
Sebastian Andrzej Siewior, Daniel Wagner, Clark Williams,
Pavel Machek, Luis Goncalves, John McCalpin
On 08/09/2025 18:52, Rui Sousa wrote:
> Just replace your last setup script line with:
> echo -1 > /proc/sys/kernel/sched_rt_runtime_us
Launch script consists of:
echo -1 > /proc/sys/kernel/sched_rt_runtime_us
# RUN RT BENCHMARK
echo 950000 > /proc/sys/kernel/sched_rt_runtime_us
https://www.kernel.org/doc/html/latest/scheduler/sched-rt-group.html
"A run time of -1 specifies runtime == period, ie. no limit."
1) Disable RT throttling
2) Run RT benchmark
3) Enable RT throttling
> I believe you are seeing regular tasks being scheduled
> in the middle of the benchmark measurement.
I don't think so.
RT throttling is disabled.
The benchmark runs as SCHED_FIFO 99.
Nothing can interrupt it, except interrupts :)
Am checking Daniel's suggestion.
Regards
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Large(ish) variance induced by SCHED_FIFO
2025-09-08 16:02 ` Daniel Wagner
@ 2025-09-08 18:40 ` Marc Gonzalez
2025-09-09 11:08 ` Unexplained variance in run-time of trivial program Marc Gonzalez
0 siblings, 1 reply; 30+ messages in thread
From: Marc Gonzalez @ 2025-09-08 18:40 UTC (permalink / raw)
To: Daniel Wagner
Cc: John Ogness, Leon Woestenberg, linux-rt-users, Steven Rostedt,
Thomas Gleixner, Sebastian Andrzej Siewior, Clark Williams,
Pavel Machek, Luis Goncalves, John McCalpin
On 08/09/2025 18:02, Daniel Wagner wrote:
> On Mon, Sep 08, 2025 at 05:42:58PM +0200, Marc Gonzalez wrote:
>
>> I've run out of ideas to identify other sources of variance.
>> (I ran everything in single user mode with nothing else running.)
>> Perhaps with perf I could identify the source of stalls or bubbles?
>>
>> Hoping someone can point me in the right direction.
>
> There are few guides out there, explaining the common pitfalls. I
> suppose you have already found them. Just in case you missed Frederic's
> excellent blog on CPU isolation:
>
> https://www.suse.com/c/cpu-isolation-introduction-part-1/
Thanks for this great link!
Is nohz_full=X a superset of isolcpus=X or are they orthogonal?
(I see that /usr/sbin/irqbalance is running on my system.)
Regards
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Unexplained variance in run-time of trivial program
2025-09-08 18:40 ` Marc Gonzalez
@ 2025-09-09 11:08 ` Marc Gonzalez
2025-09-09 11:21 ` Daniel Wagner
2025-09-09 12:34 ` John Ogness
0 siblings, 2 replies; 30+ messages in thread
From: Marc Gonzalez @ 2025-09-09 11:08 UTC (permalink / raw)
To: linux-rt-users
Cc: Daniel Wagner, John Ogness, Leon Woestenberg, Steven Rostedt,
Thomas Gleixner, Sebastian Andrzej Siewior, Clark Williams,
Pavel Machek, Luis Goncalves, John McCalpin, Frederic Weisbecker,
Ingo Molnar, Masami Hiramatsu
On 08/09/2025 20:40, Marc Gonzalez wrote:
> On 08/09/2025 18:02, Daniel Wagner wrote:
>
>> On Mon, Sep 08, 2025 at 05:42:58PM +0200, Marc Gonzalez wrote:
>>
>>> I've run out of ideas to identify other sources of variance.
>>> (I ran everything in single user mode with nothing else running.)
>>> Perhaps with perf I could identify the source of stalls or bubbles?
>>>
>>> Hoping someone can point me in the right direction.
>>
>> There are few guides out there, explaining the common pitfalls. I
>> suppose you have already found them. Just in case you missed Frederic's
>> excellent blog on CPU isolation:
>>
>> https://www.suse.com/c/cpu-isolation-introduction-part-1/
>
> Thanks for this great link!
>
> Is nohz_full=X a superset of isolcpus=X or are they orthogonal?
I am now more confused than ever.
Following Frederic's guide:
https://www.suse.com/c/cpu-isolation-practical-example-part-5/
If I didn't mess up, CPU3 is now supposed to be fully isolated
on my 4-core system:
$ sudo dmesg | grep "Kernel command line"
[ 0.038445] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-6.8.0-79-generic root=UUID=2aff6d63-a8c6-4ebb-af69-9d2df2852cb0 ro log_buf_len=16M nohz_full=3 isolcpus=3 quiet splash vt.handoff=7
I ran echo "0-2" > $IRQ/smp_affinity_list for every IRQ.
IRQ=0 AFTER SMP_AFF=0-3 EFF_AFF=0
IRQ=1 AFTER SMP_AFF=0-2 EFF_AFF=0-2
IRQ=2 AFTER SMP_AFF=0-3 EFF_AFF=
IRQ=3 AFTER SMP_AFF=0-2 EFF_AFF=0-2
IRQ=4 AFTER SMP_AFF=0-2 EFF_AFF=0-2
IRQ=5 AFTER SMP_AFF=0-2 EFF_AFF=0-2
IRQ=6 AFTER SMP_AFF=0-2 EFF_AFF=0-2
IRQ=7 AFTER SMP_AFF=0-2 EFF_AFF=0-2
IRQ=8 AFTER SMP_AFF=0-2 EFF_AFF=0
IRQ=9 AFTER SMP_AFF=0-2 EFF_AFF=1
IRQ=10 AFTER SMP_AFF=0-2 EFF_AFF=0-2
IRQ=11 AFTER SMP_AFF=0-2 EFF_AFF=0-2
IRQ=12 AFTER SMP_AFF=0-2 EFF_AFF=0-2
IRQ=13 AFTER SMP_AFF=0-2 EFF_AFF=0-2
IRQ=14 AFTER SMP_AFF=0-2 EFF_AFF=0-2
IRQ=15 AFTER SMP_AFF=0-2 EFF_AFF=0-2
IRQ=16 AFTER SMP_AFF=0-2 EFF_AFF=1
IRQ=18 AFTER SMP_AFF=0-2 EFF_AFF=1
IRQ=23 AFTER SMP_AFF=0-2 EFF_AFF=2
IRQ=24 AFTER SMP_AFF=0-2 EFF_AFF=2
IRQ=25 AFTER SMP_AFF=0-2 EFF_AFF=3
IRQ=26 AFTER SMP_AFF=0-2 EFF_AFF=0
IRQ=27 AFTER SMP_AFF=0-2 EFF_AFF=0
IRQ=28 AFTER SMP_AFF=0-2 EFF_AFF=2
IRQ=29 AFTER SMP_AFF=0-2 EFF_AFF=2
IRQ=30 AFTER SMP_AFF=0-2 EFF_AFF=3
IRQ=31 AFTER SMP_AFF=0-2 EFF_AFF=0
IRQ=32 AFTER SMP_AFF=0-2 EFF_AFF=1
I note that IRQs 25 & 30 are still effectively pinned to CPU3,
despite the smp_aff setting.
$ cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3
0: 27 0 0 0 IO-APIC 2-edge timer
8: 0 0 0 0 IO-APIC 8-edge rtc0
9: 0 4 0 0 IO-APIC 9-fasteoi acpi
16: 0 25 0 0 IO-APIC 16-fasteoi ehci_hcd:usb1
18: 0 5 0 0 IO-APIC 18-fasteoi i801_smbus
23: 0 0 29 0 IO-APIC 23-fasteoi ehci_hcd:usb2
24: 0 0 0 0 PCI-MSI-0000:00:1c.0 0-edge PCIe PME, pciehp
25: 0 0 0 0 PCI-MSI-0000:00:1c.3 0-edge PCIe PME
26: 23328 0 0 37250 PCI-MSI-0000:00:1f.2 0-edge ahci[0000:00:1f.2]
27: 86091 0 0 0 PCI-MSI-0000:00:14.0 0-edge xhci_hcd
28: 0 0 51308 0 PCI-MSIX-0000:02:00.0 0-edge enp2s0
29: 0 0 22 0 PCI-MSI-0000:00:16.0 0-edge mei_me
30: 0 0 0 604 PCI-MSI-0000:00:1b.0 0-edge snd_hda_intel:card0
31: 198664 0 0 0 PCI-MSI-0000:00:02.0 0-edge i915
32: 0 631 0 0 PCI-MSI-0000:00:03.0 0-edge snd_hda_intel:card1
My benchmark is a trivial piece of code:
- no system calls, no library calls
- just simple bit twiddling
- tiny code, tiny data structures, everything fits in L1
$ size a.out
text data bss dec hex filename
8549 632 1072 10253 280d a.out
Undisturbed, it runs in 2500 nanoseconds
(with all cores set to tick at 3 GHz).
I run it 2^16 times, so the expected run-time is
2500*2^16 = 163840000 ns = 163.84 ms
Yet, I've observed AVERAGE run-times of
2600, 2700, and even 2800 (!!) nanoseconds.
I captured a full trace of this occurrence with
echo *:* > $TRACING/set_event
cat $TRACING/per_cpu/cpu$CPU/trace > trace.$CPU
# tracer: nop
#
# entries-in-buffer/entries-written: 9743/9743 #P:4
#
# _-----=> irqs-off/BH-disabled
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / _-=> migrate-disable
# |||| / delay
# TASK-PID CPU# ||||| TIMESTAMP FUNCTION
# | | | ||||| | |
<idle>-0 [003] dN.1. 4606.627286: cpu_idle: state=4294967295 cpu_id=3
<idle>-0 [003] dN.1. 4606.627288: rcu_utilization: Start context switch
<idle>-0 [003] dN.1. 4606.627288: rcu_utilization: End context switch
<idle>-0 [003] d..2. 4606.627290: sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=taskset next_pid=8129 next_prio=0
<idle>-0 [003] d..2. 4606.627291: tlb_flush: pages:-1 reason:flush on task switch (0)
a.out-8129 [003] ...1. 4606.627304: mm_page_free: page=000000001fef04a8 pfn=0x16a586 order=1
...
a.out-8129 [003] d.... 4606.628019: user_enter:
a.out-8129 [003] d.... 4606.811832: user_exit:
Program starts at 4606.627304
Program setup lasts until 4606.628019 (~1500 recorded events)
=> roughly 700 microseconds spent before main()
which I'm not taking into account in my measurements.
Then program ends at 4606.811832
=> 183.813 ms instead of the expected 163.84 ms
Almost exactly 20 ms in excess.
Could this be a hint?
But there is absolutely NOTHING traced between 4606.628019 & 4606.811832.
I guess either my time source is incorrect.
(Next slide in Frederic's guide)
OR there is something wonky going on inside CPU3.
Looks like IRQs are not the issue here?
No need to investigate why I can't pin IRQs 25 & 30
elsewhere than CPU3?
Regards
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Unexplained variance in run-time of trivial program
2025-09-09 11:08 ` Unexplained variance in run-time of trivial program Marc Gonzalez
@ 2025-09-09 11:21 ` Daniel Wagner
2025-09-09 12:42 ` Marc Gonzalez
2025-09-09 12:34 ` John Ogness
1 sibling, 1 reply; 30+ messages in thread
From: Daniel Wagner @ 2025-09-09 11:21 UTC (permalink / raw)
To: Marc Gonzalez
Cc: linux-rt-users, John Ogness, Leon Woestenberg, Steven Rostedt,
Thomas Gleixner, Sebastian Andrzej Siewior, Clark Williams,
Pavel Machek, Luis Goncalves, John McCalpin, Frederic Weisbecker,
Ingo Molnar, Masami Hiramatsu
> I note that IRQs 25 & 30 are still effectively pinned to CPU3,
> despite the smp_aff setting.
>
> $ cat /proc/interrupts
> CPU0 CPU1 CPU2 CPU3
> 0: 27 0 0 0 IO-APIC 2-edge timer
> 8: 0 0 0 0 IO-APIC 8-edge rtc0
> 9: 0 4 0 0 IO-APIC 9-fasteoi acpi
> 16: 0 25 0 0 IO-APIC 16-fasteoi ehci_hcd:usb1
> 18: 0 5 0 0 IO-APIC 18-fasteoi i801_smbus
> 23: 0 0 29 0 IO-APIC 23-fasteoi ehci_hcd:usb2
> 24: 0 0 0 0 PCI-MSI-0000:00:1c.0 0-edge PCIe PME, pciehp
> 25: 0 0 0 0 PCI-MSI-0000:00:1c.3 0-edge PCIe PME
> 26: 23328 0 0 37250 PCI-MSI-0000:00:1f.2 0-edge ahci[0000:00:1f.2]
> 27: 86091 0 0 0 PCI-MSI-0000:00:14.0 0-edge xhci_hcd
> 28: 0 0 51308 0 PCI-MSIX-0000:02:00.0 0-edge enp2s0
> 29: 0 0 22 0 PCI-MSI-0000:00:16.0 0-edge mei_me
> 30: 0 0 0 604 PCI-MSI-0000:00:1b.0 0-edge snd_hda_intel:card0
> 31: 198664 0 0 0 PCI-MSI-0000:00:02.0 0-edge i915
> 32: 0 631 0 0 PCI-MSI-0000:00:03.0 0-edge snd_hda_intel:card1
Many drivers are not isolcpus aware. At least the sound driver could be
unloaded for you test I suppose.
> Almost exactly 20 ms in excess.
> Could this be a hint?
> But there is absolutely NOTHING traced between 4606.628019 & 4606.811832.
Ensure your clock source is working correctly and...
> I guess either my time source is incorrect.
> (Next slide in Frederic's guide)
> OR there is something wonky going on inside CPU3.
...there is no SMI running on this CPU, and no power management running
(also in the BIOS settings)
> Looks like IRQs are not the issue here?
IRQ handling should be visible in the trace.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Unexplained variance in run-time of trivial program
2025-09-09 11:08 ` Unexplained variance in run-time of trivial program Marc Gonzalez
2025-09-09 11:21 ` Daniel Wagner
@ 2025-09-09 12:34 ` John Ogness
2025-09-09 14:08 ` Marc Gonzalez
1 sibling, 1 reply; 30+ messages in thread
From: John Ogness @ 2025-09-09 12:34 UTC (permalink / raw)
To: Marc Gonzalez, linux-rt-users
Cc: Daniel Wagner, Leon Woestenberg, Steven Rostedt, Thomas Gleixner,
Sebastian Andrzej Siewior, Clark Williams, Pavel Machek,
Luis Goncalves, John McCalpin, Frederic Weisbecker, Ingo Molnar,
Masami Hiramatsu
On 2025-09-09, Marc Gonzalez <marc.w.gonzalez@free.fr> wrote:
> I captured a full trace of this occurrence with
> echo *:* > $TRACING/set_event
> cat $TRACING/per_cpu/cpu$CPU/trace > trace.$CPU
...
> But there is absolutely NOTHING traced between 4606.628019 & 4606.811832.
I recommend using the function tracer:
# echo function > $TRACING/current_tracer
It should not affect your timings for pure-userspace code. This should
catch any unintentional kernel activity.
John
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Unexplained variance in run-time of trivial program
2025-09-09 11:21 ` Daniel Wagner
@ 2025-09-09 12:42 ` Marc Gonzalez
2025-09-09 14:23 ` Steven Rostedt
0 siblings, 1 reply; 30+ messages in thread
From: Marc Gonzalez @ 2025-09-09 12:42 UTC (permalink / raw)
To: Daniel Wagner
Cc: linux-rt-users, John Ogness, Leon Woestenberg, Steven Rostedt,
Thomas Gleixner, Sebastian Andrzej Siewior, Clark Williams,
Pavel Machek, Luis Goncalves, John McCalpin, Frederic Weisbecker,
Ingo Molnar, Masami Hiramatsu
On 09/09/2025 13:21, Daniel Wagner wrote:
>> I note that IRQs 25 & 30 are still effectively pinned to CPU3,
>> despite the smp_aff setting.
>>
>> $ cat /proc/interrupts
>> CPU0 CPU1 CPU2 CPU3
>> 0: 27 0 0 0 IO-APIC 2-edge timer
>> 8: 0 0 0 0 IO-APIC 8-edge rtc0
>> 9: 0 4 0 0 IO-APIC 9-fasteoi acpi
>> 16: 0 25 0 0 IO-APIC 16-fasteoi ehci_hcd:usb1
>> 18: 0 5 0 0 IO-APIC 18-fasteoi i801_smbus
>> 23: 0 0 29 0 IO-APIC 23-fasteoi ehci_hcd:usb2
>> 24: 0 0 0 0 PCI-MSI-0000:00:1c.0 0-edge PCIe PME, pciehp
>> 25: 0 0 0 0 PCI-MSI-0000:00:1c.3 0-edge PCIe PME
>> 26: 23328 0 0 37250 PCI-MSI-0000:00:1f.2 0-edge ahci[0000:00:1f.2]
>> 27: 86091 0 0 0 PCI-MSI-0000:00:14.0 0-edge xhci_hcd
>> 28: 0 0 51308 0 PCI-MSIX-0000:02:00.0 0-edge enp2s0
>> 29: 0 0 22 0 PCI-MSI-0000:00:16.0 0-edge mei_me
>> 30: 0 0 0 604 PCI-MSI-0000:00:1b.0 0-edge snd_hda_intel:card0
>> 31: 198664 0 0 0 PCI-MSI-0000:00:02.0 0-edge i915
>> 32: 0 631 0 0 PCI-MSI-0000:00:03.0 0-edge snd_hda_intel:card1
>
> Many drivers are not isolcpus aware. At least the sound driver could be
> unloaded for your test I suppose.
I don't quite understand.
Do these drivers explicitly request that their ISR run on CPU3?
Why doesn't the kernel just run these ISRs on a non-isolated core?
>> Almost exactly 20 ms in excess.
>> Could this be a hint?
>> But there is absolutely NOTHING traced between 4606.628019 & 4606.811832.
>
> Ensure your clock source is working correctly and...
>
>> I guess either my time source is incorrect.
>> (Next slide in Frederic's guide)
>> OR there is something wonky going on inside CPU3.
>
> ...there is no SMI running on this CPU, and no power management running
> (also in the BIOS settings)
Actually, I'm 99.9% sure that clock source accuracy & SMM are red herrings.
If I replace my code with this one:
mov $(1<<12), %eax
1: dec %ecx
dec %ecx
dec %eax
jnz 1b
which runs in ~2735 nanoseconds at 3 Ghz
(2 cycles per iteration * 4096 iterations = 8192 cycles
8192 cycles / 3 = 2730 nanoseconds)
Running this trivial baseline benchmark 2^16 times should take
2735 ns * 2^16 = 179.241 ms
And if I run the loop 1000 times, and sort by run-time, I observe
MIN=179.374211 ms
MAX=179.406745 ms
So worst-case is only 166 microseconds worse than expected
(contrast this to 20 MILLIseconds for my code, 120 times worse)
It seems something is randomly stalling the pipeline on CPU3.
This feels like thermal throttling maybe?
Is that supposed to be logged somewhere?
But why would my program throttle & not the trivial baseline benchmark?
Regards
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Unexplained variance in run-time of trivial program
2025-09-09 12:34 ` John Ogness
@ 2025-09-09 14:08 ` Marc Gonzalez
2025-09-09 18:33 ` Marc Gonzalez
0 siblings, 1 reply; 30+ messages in thread
From: Marc Gonzalez @ 2025-09-09 14:08 UTC (permalink / raw)
To: John Ogness, linux-rt-users
Cc: Daniel Wagner, Leon Woestenberg, Steven Rostedt, Thomas Gleixner,
Sebastian Andrzej Siewior, Clark Williams, Pavel Machek,
Luis Goncalves, John McCalpin, Frederic Weisbecker, Ingo Molnar,
Masami Hiramatsu
On 09/09/2025 14:34, John Ogness wrote:
> On 2025-09-09, Marc Gonzalez wrote:
>
>> I captured a full trace of this occurrence with
>> echo *:* > $TRACING/set_event
>> cat $TRACING/per_cpu/cpu$CPU/trace > trace.$CPU
>
> ...
>
>> But there is absolutely NOTHING traced between 4606.628019 & 4606.811832.
>
> I recommend using the function tracer:
>
> # echo function > $TRACING/current_tracer
>
> It should not affect your timings for pure-userspace code. This should
> catch any unintentional kernel activity.
OK. This time, I used:
echo function > $TRACING/current_tracer
cat $TRACING/trace > trace.all
(which produced a 12.65 MB file)
I dumped a problematic run where T=2722 instead of T=2500
i.e. total run-time = 178.389 ms instead of 163.84 ms
Looking for a.out-23683 until the log goes quiet...
a.out-23683 [003] d.... 14939.982368: user_enter:
a.out-23683 [003] d.... 14939.982368: vtime_user_enter <-__ct_user_enter
a.out-23683 [003] d.... 14939.982368: get_vtime_delta <-vtime_user_enter
NO OCCURRENCE of a.out or [003] between these 2 points
a.out-23683 [003] d.... 14940.160764: vtime_user_exit <-__ct_user_exit
a.out-23683 [003] d.... 14940.160765: get_vtime_delta <-vtime_user_exit
a.out-23683 [003] d.... 14940.160765: account_user_time <-vtime_user_exit
14940.160764 - 14939.982368 = 178.396 ms
(which corresponds to the total run-time)
Are the vtime_user_enter/exit tied to clock_gettime?
(Probably not, since they're not balanced)
It's starting to look like the problem is on the CPU itself,
not the kernel interrupting the CPU.
Might be time to break out perf?
My hunch is frequency dropped to ~2.7 GHz for the duration of the benchmark.
Maybe I should lower the frequency to 2 GHz.
But again, why would my code force throttling and not the toy code?
One possible reason is my code reaches 3.5 IPC, while the toy code
remains at 1.5 IPC (with micro-op fusion).
It's a stretch, but easy to test.
I will measure CPU cycles, to see if the increased run-time
corresponds with a change in CPU cycles.
Regards
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Unexplained variance in run-time of trivial program
2025-09-09 12:42 ` Marc Gonzalez
@ 2025-09-09 14:23 ` Steven Rostedt
0 siblings, 0 replies; 30+ messages in thread
From: Steven Rostedt @ 2025-09-09 14:23 UTC (permalink / raw)
To: Marc Gonzalez
Cc: Daniel Wagner, linux-rt-users, John Ogness, Leon Woestenberg,
Thomas Gleixner, Sebastian Andrzej Siewior, Clark Williams,
Pavel Machek, Luis Goncalves, John McCalpin, Frederic Weisbecker,
Ingo Molnar, Masami Hiramatsu
On Tue, 9 Sep 2025 14:42:54 +0200
Marc Gonzalez <marc.w.gonzalez@free.fr> wrote:
> > ...there is no SMI running on this CPU, and no power management running
> > (also in the BIOS settings)
>
> Actually, I'm 99.9% sure that clock source accuracy & SMM are red herrings.
You may want to run the osnoise or hwlat tracer to see if SMM or anything
else the kernel does could cause issues with your tests.
-- Steve
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Unexplained variance in run-time of trivial program
2025-09-09 14:08 ` Marc Gonzalez
@ 2025-09-09 18:33 ` Marc Gonzalez
[not found] ` <CAMjhiJjMOQN-nWd+KP4JBBNHf20M+J2fXAuTTvowXctJgvGOcQ@mail.gmail.com>
0 siblings, 1 reply; 30+ messages in thread
From: Marc Gonzalez @ 2025-09-09 18:33 UTC (permalink / raw)
To: John Ogness, linux-rt-users
Cc: Daniel Wagner, Leon Woestenberg, Steven Rostedt, Thomas Gleixner,
Sebastian Andrzej Siewior, Clark Williams, Pavel Machek,
Luis Goncalves, John McCalpin, Frederic Weisbecker, Ingo Molnar,
Masami Hiramatsu
On 9/9/25 16:08, Marc Gonzalez wrote:
> My hunch is frequency dropped to ~2.7 GHz for the duration of the benchmark.
>
> Maybe I should lower the frequency to 2 GHz.
> But again, why would my code force throttling and not the toy code?
> One possible reason is my code reaches 3.5 IPC, while the toy code
> remains at 1.5 IPC (with micro-op fusion).
>
> It's a stretch, but easy to test.
>
> I will measure CPU cycles, to see if the increased run-time
> corresponds with a change in CPU cycles.
# dmesg | grep -i tsc
[ 0.000000] tsc: Fast TSC calibration using PIT
[ 0.000000] tsc: Detected 3292.563 MHz processor
[ 0.038014] TSC deadline timer available
[ 0.093396] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x2f75db292c8, max_idle_ns: 440795321216 ns
[ 0.178415] clocksource: Switched to clocksource tsc-early
[ 1.272272] tsc: Refined TSC clocksource calibration: 3292.377 MHz
[ 1.272280] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x2f752b31fa5, max_idle_ns: 440795265170 ns
[ 1.272305] clocksource: Switched to clocksource tsc
$ lscpu | grep -o '[a-z_]*tsc[a-z_]*'
tsc
rdtscp
constant_tsc
nonstop_tsc
tsc_deadline_timer
tsc_adjust
static void loop(int log2)
{
int n = 1 << log2;
struct timespec t0, t1;
clock_gettime(CLK, &t0);
long c0 = __rdtsc();
for (int i = 0; i < n; ++i) my_code();
long c1 = __rdtsc();
clock_gettime(CLK, &t1);
long d = (t1.tv_sec - t0.tv_sec)*1000000000L + (t1.tv_nsec - t0.tv_nsec);
long t = d >> log2;
long c = c1-c0;
long f = c*1000000/d;
printf("D=%ld C=%ld F=%ld T=%ld N=%d\n", d, c, f, t, n);
}
(Looks like RDTSCP might have been preferable over RDTSC)
I lowered the frequency of all cores to 1.5 GHz
MIN RUNTIME: D=327944847 C=1079700683 F=3292323 T=5004 N=65536
MAX RUNTIME: D=369923901 C=1217909284 F=3292323 T=5644 N=65536
Looks like the TSC ticks at 3.3 GHz (core nominal freq) even
if I manually lower the core frequency. I need to figure out
how 'perf stat' does it. Probably reads a different MSR.
Changing /sys/devices/system/clocksource/clocksource0/current_clocksource
to hpet doesn't really change anything.
TODO: find cycle count MSR.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Unexplained variance in run-time of trivial program
[not found] ` <CAMjhiJjMOQN-nWd+KP4JBBNHf20M+J2fXAuTTvowXctJgvGOcQ@mail.gmail.com>
@ 2025-09-09 20:30 ` Marc Gonzalez
2025-09-10 7:59 ` Daniel Wagner
0 siblings, 1 reply; 30+ messages in thread
From: Marc Gonzalez @ 2025-09-09 20:30 UTC (permalink / raw)
To: Leon Woestenberg
Cc: John Ogness, linux-rt-users, Daniel Wagner, Steven Rostedt,
Thomas Gleixner, Sebastian Andrzej Siewior, Clark Williams,
Pavel Machek, Luis Goncalves, John McCalpin, Frederic Weisbecker,
Ingo Molnar, Masami Hiramatsu
On 9/9/25 21:34, Leon Woestenberg wrote:
> Basically, you are reverse-tuning a multicore CPU system,
> that is heavily optimized for time-sharing and overall
> system throughput, back to act like bare-metal single-application.
> You can get close, but you will never reach the determinism
> of say, an Cortex-R(ealtime) Core.
This ignores the fact that I've been testing 2 programs:
- P1, toy code that just calls dec 3 times in a tight loop
- P2, a few memory accesses, everything fits in L1
I then loop P1 & P2 2^16 times to compute their average run-time.
Over thousands of runs, RUNTIME_P1 is ALWAYS the same (within 30 ppm)
Over thousands of runs, RUNTIME_P2 varies by up to 12 PERCENT (!!) = 122k ppm
Since P1 runs fine without ANY jitter, the problem cannot be
something inherent to the system itself, such as
- SMM
- IRQs firing randomly
- drifting, jittery, jumping time source
The problem appears to be something inherent to P2.
(Either the whole core is slowed down perhaps by thermal throttling,
or something is causing huge pipeline stalls.)
Regards
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Unexplained variance in run-time of trivial program
2025-09-09 20:30 ` Marc Gonzalez
@ 2025-09-10 7:59 ` Daniel Wagner
2025-09-11 21:29 ` Marc Gonzalez
0 siblings, 1 reply; 30+ messages in thread
From: Daniel Wagner @ 2025-09-10 7:59 UTC (permalink / raw)
To: Marc Gonzalez
Cc: Leon Woestenberg, John Ogness, linux-rt-users, Steven Rostedt,
Thomas Gleixner, Sebastian Andrzej Siewior, Clark Williams,
Pavel Machek, Luis Goncalves, John McCalpin, Frederic Weisbecker,
Ingo Molnar, Masami Hiramatsu
On Tue, Sep 09, 2025 at 10:30:49PM +0200, Marc Gonzalez wrote:
> On 9/9/25 21:34, Leon Woestenberg wrote:
>
> > Basically, you are reverse-tuning a multicore CPU system,
> > that is heavily optimized for time-sharing and overall
> > system throughput, back to act like bare-metal single-application.
> > You can get close, but you will never reach the determinism
> > of say, an Cortex-R(ealtime) Core.
>
> This ignores the fact that I've been testing 2 programs:
>
> - P1, toy code that just calls dec 3 times in a tight loop
> - P2, a few memory accesses, everything fits in L1
>
> I then loop P1 & P2 2^16 times to compute their average run-time.
>
> Over thousands of runs, RUNTIME_P1 is ALWAYS the same (within 30 ppm)
>
> Over thousands of runs, RUNTIME_P2 varies by up to 12 PERCENT (!!) =
> 122k ppm
Modern CPUs are highly complex systems with their micro architecture.
The result is non-deterministic runtime behavior. I can't tell what you
are measuring is the ballpark of non-deterministic runtime variance. If
you don't want to read through the CPU manufactures documentation, you
could check if the Spectre and related side channels attack papers give
some clues on timings behavior?
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Unexplained variance in run-time of trivial program
2025-09-10 7:59 ` Daniel Wagner
@ 2025-09-11 21:29 ` Marc Gonzalez
2025-09-11 22:15 ` Marc Gonzalez
0 siblings, 1 reply; 30+ messages in thread
From: Marc Gonzalez @ 2025-09-11 21:29 UTC (permalink / raw)
To: Daniel Wagner
Cc: Leon Woestenberg, John Ogness, linux-rt-users, Steven Rostedt,
Thomas Gleixner, Sebastian Andrzej Siewior, Clark Williams,
Pavel Machek, Luis Goncalves, John McCalpin, Frederic Weisbecker,
Ingo Molnar, Masami Hiramatsu, agner
On 9/10/25 09:59, Daniel Wagner wrote:
> On Tue, Sep 09, 2025 at 10:30:49PM +0200, Marc Gonzalez wrote:
>
>> On 9/9/25 21:34, Leon Woestenberg wrote:
>>
>>> Basically, you are reverse-tuning a multicore CPU system,
>>> that is heavily optimized for time-sharing and overall
>>> system throughput, back to act like bare-metal single-application.
>>> You can get close, but you will never reach the determinism
>>> of say, an Cortex-R(ealtime) Core.
>>
>> This ignores the fact that I've been testing 2 programs:
>>
>> - P1, toy code that just calls dec 3 times in a tight loop
>> - P2, a few memory accesses, everything fits in L1
>>
>> I then loop P1 & P2 2^16 times to compute their average run-time.
>>
>> Over thousands of runs, RUNTIME_P1 is ALWAYS the same (within 30 ppm)
>>
>> Over thousands of runs, RUNTIME_P2 varies by up to 12 PERCENT (!!) = 122k ppm
>
> Modern CPUs are highly complex systems with their micro architecture.
> The result is non-deterministic runtime behavior. I can't tell what you
> are measuring is the ballpark of non-deterministic runtime variance. If
> you don't want to read through the CPU manufactures documentation, you
> could check if the Spectre and related side channels attack papers give
> some clues on timings behavior?
I am well aware of the complexity of modern micro-architectures ;)
I actually reported a bug in the original K7 micro-architecture
25 years ago.
And Agner Fog's microarchitecture.pdf is always on my bedside table.
https://www.agner.org/optimize/microarchitecture.pdf
In an attempt to figure out the origin of the chaos, I broke out perf.
I would say I found many more questions than answers...
Running 3.3 GHz Haswell at 1.5 GHz
https://perfmon-events.intel.com/index.html?pltfrm=haswell_server.html
https://oprofile.sourceforge.io/docs/intel-haswell-events.php
GOOD RUN
D,C,Cb,F,T,N
1311453543 4317733068 4317732972 3292326 5002 262144
7512849605 uops_executed.core
31445738 uops_executed.stall_cycles
164966498 uops_retired.stall_cycles
574110144 topdown-fetch-bubbles
25679080 cycle_activity.cycles_no_execute
6808318537 inst_retired.any
6268865 int_misc.recovery_cycles
6268818 int_misc.recovery_cycles_any
1,313036051 seconds time elapsed
BAD RUN
D,C,Cb,F,T,N
1436503721 4729437143 4729436743 3292325 5479 262144
7647384031 uops_executed.core
113200734 uops_executed.stall_cycles
343460280 uops_retired.stall_cycles
787367726 topdown-fetch-bubbles
106926806 cycle_activity.cycles_no_execute
6808330091 inst_retired.any
17369470 int_misc.recovery_cycles
17369430 int_misc.recovery_cycles_any
1,438057158 seconds time elapsed
Remarks (in no particular order)
- Run-time increased by 9.5% (125 ms / 1.311 s)
- inst_retired.any remains (nearly) constant
BUT uops_executed.core increased by 1.8%
=> Where are those extra uops coming from???
Speculatively executed uops discarded on some flush event?
- 343460280-164966498 = 178.5M more stall cycles at retire stage
at 1.5 GHz = 119 ms
This alone could explain the degraded performance
- 113200734-31445738 = 81.7M more stall cycles at execute stage
some are probably dupes with retire stalls
- Not sure what the int_misc.recovery_* events measure???
They're documented as:
"This event counts the number of cycles spent waiting for a
recovery after an event such as a processor nuke, JEClear, assist,
hle/rtm abort etc."
and
"Core cycles the allocator was stalled due to recovery from earlier
clear event for any thread running on the physical core (e.g.
misprediction or memory nuke)."
=> In my case, they're probably measuring the same thing.
Weird that the description sounds a bit different.
Need to read up on processor nuke, memory nuke, machine clear event,
JEClear, assist, HLE/RTM...
I'll throw in MACHINE_CLEARS.CYCLES in the next run.
"Cycles there was a Nuke. Account for both thread-specific and All Thread Nukes."
I would love for someone to chime in & cast some light
on my utter confusion :)
Regards
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Unexplained variance in run-time of trivial program
2025-09-11 21:29 ` Marc Gonzalez
@ 2025-09-11 22:15 ` Marc Gonzalez
2025-09-12 7:44 ` Daniel Wagner
0 siblings, 1 reply; 30+ messages in thread
From: Marc Gonzalez @ 2025-09-11 22:15 UTC (permalink / raw)
To: Daniel Wagner
Cc: Leon Woestenberg, John Ogness, linux-rt-users, Steven Rostedt,
Thomas Gleixner, Sebastian Andrzej Siewior, Clark Williams,
Pavel Machek, Luis Goncalves, John McCalpin, Frederic Weisbecker,
Ingo Molnar, Masami Hiramatsu, agner
On 9/11/25 23:29, Marc Gonzalez wrote:
> - Not sure what the int_misc.recovery_* events measure???
> They're documented as:
> "This event counts the number of cycles spent waiting for a
> recovery after an event such as a processor nuke, JEClear, assist,
> hle/rtm abort etc."
> and
> "Core cycles the allocator was stalled due to recovery from earlier
> clear event for any thread running on the physical core (e.g.
> misprediction or memory nuke)."
> => In my case, they're probably measuring the same thing.
> Weird that the description sounds a bit different.
> Need to read up on processor nuke, memory nuke, machine clear event,
> JEClear, assist, HLE/RTM...
Daniel, seems you were spot on when mentioning side channel attacks.
https://www.usenix.org/system/files/sec21-ragab.pdf
I need to truly read & understand this paper.
GOOD
D,C,Cb,F,T,N
1311384211 4317505417 4317505021 3292326 5002 262144
6808311015 inst_retired.any
7515242772 uops_executed.core
31190466 uops_executed.stall_cycles
25467763 cycle_activity.cycles_no_execute
165247175 uops_retired.stall_cycles
6335672 int_misc.recovery_cycles
296086 MACHINE_CLEARS.CYCLES
BAD
D,C,Cb,F,T,N
1460302963 4807793027 4807792639 3292325 5570 262144
6808333218 inst_retired.any
7656792683 uops_executed.core
107194058 uops_executed.stall_cycles
102852971 cycle_activity.cycles_no_execute
341106967 uops_retired.stall_cycles
15062277 int_misc.recovery_cycles
2124362 MACHINE_CLEARS.CYCLES
Random reference
https://community.intel.com/t5/Analyzers/Regarding-machine-clears/m-p/953984
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Unexplained variance in run-time of trivial program
2025-09-11 22:15 ` Marc Gonzalez
@ 2025-09-12 7:44 ` Daniel Wagner
2025-09-13 10:09 ` Marc Gonzalez
0 siblings, 1 reply; 30+ messages in thread
From: Daniel Wagner @ 2025-09-12 7:44 UTC (permalink / raw)
To: Marc Gonzalez
Cc: Leon Woestenberg, John Ogness, linux-rt-users, Steven Rostedt,
Thomas Gleixner, Sebastian Andrzej Siewior, Clark Williams,
Pavel Machek, Luis Goncalves, John McCalpin, Frederic Weisbecker,
Ingo Molnar, Masami Hiramatsu, agner
On Fri, Sep 12, 2025 at 12:15:12AM +0200, Marc Gonzalez wrote:
> On 9/11/25 23:29, Marc Gonzalez wrote:
>
> > - Not sure what the int_misc.recovery_* events measure???
> > They're documented as:
> > "This event counts the number of cycles spent waiting for a
> > recovery after an event such as a processor nuke, JEClear, assist,
> > hle/rtm abort etc."
> > and
> > "Core cycles the allocator was stalled due to recovery from earlier
> > clear event for any thread running on the physical core (e.g.
> > misprediction or memory nuke)."
> > => In my case, they're probably measuring the same thing.
> > Weird that the description sounds a bit different.
> > Need to read up on processor nuke, memory nuke, machine clear event,
> > JEClear, assist, HLE/RTM...
>
> Daniel, seems you were spot on when mentioning side channel attacks.
I'm sorry if I sounded patronizing, that was not my intent. After we
ruled out the OS noise the only thing left was that the runtime variance
is from the CPU itself.
> https://www.usenix.org/system/files/sec21-ragab.pdf
>
> I need to truly read & understand this paper.
Skimmed over it, it could explain what you are observing. The question
obviously is which part of the 'bad' code is confusing the CPU :)
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Unexplained variance in run-time of trivial program
2025-09-12 7:44 ` Daniel Wagner
@ 2025-09-13 10:09 ` Marc Gonzalez
2025-09-15 16:30 ` Marc Gonzalez
0 siblings, 1 reply; 30+ messages in thread
From: Marc Gonzalez @ 2025-09-13 10:09 UTC (permalink / raw)
To: Daniel Wagner
Cc: Leon Woestenberg, John Ogness, linux-rt-users, Steven Rostedt,
Thomas Gleixner, Sebastian Andrzej Siewior, Clark Williams,
Pavel Machek, Luis Goncalves, John McCalpin, Frederic Weisbecker,
Ingo Molnar, Masami Hiramatsu, agner
On 9/12/25 09:44, Daniel Wagner wrote:
> On Fri, Sep 12, 2025 at 12:15:12AM +0200, Marc Gonzalez wrote:
>
>> - Not sure what the int_misc.recovery_* events measure???
>> They're documented as:
>> "This event counts the number of cycles spent waiting for a
>> recovery after an event such as a processor nuke, JEClear, assist,
>> hle/rtm abort etc."
>> and
>> "Core cycles the allocator was stalled due to recovery from earlier
>> clear event for any thread running on the physical core (e.g.
>> misprediction or memory nuke)."
>> => In my case, they're probably measuring the same thing.
>> Weird that the description sounds a bit different.
>> Need to read up on processor nuke, memory nuke, machine clear event,
>> JEClear, assist, HLE/RTM...
>>
>> Daniel, seems you were spot on when mentioning side channel attacks.
>
> I'm sorry if I sounded patronizing, that was not my intent. After we
> ruled out the OS noise the only thing left was that the runtime variance
> is from the CPU itself.
I appreciate all the insights you've shared this far! :)
(As well as that from other participants!)
>> https://www.usenix.org/system/files/sec21-ragab.pdf
>>
>> I need to truly read & understand this paper.
>
> Skimmed over it, it could explain what you are observing. The question
> obviously is which part of the 'bad' code is confusing the CPU :)
GOOD RUN
D,C,Cb,F,T,N
1310881180 4315846773 4315846665 3292324 5000 262144
6808319919 inst_retired.any
7513464787 uops_executed.core
30496248 uops_executed.stall_cycles
164691876 uops_retired.stall_cycles
6292081 int_misc.recovery_cycles
292758 machine_clears.count
292049 machine_clears.memory_ordering
BAD RUN
D,C,Cb,F,T,N
1417213267 4665926496 4665926108 3292324 5406 262144
6808316250 inst_retired.any
7614270474 uops_executed.core
85148157 uops_executed.stall_cycles
285284384 uops_retired.stall_cycles
13520285 int_misc.recovery_cycles
1536963 machine_clears.count
1536308 machine_clears.memory_ordering
Notes:
- A "good" run gets 300k machine clears.
- A "bad" run gets 1.5M machine clears.
- All machine clears are memory ordering machine clears.
No self-modifying code, no floating point code.
Probably a few MCs from page faults.
(TODO: measure only MCs from my code, not setup code)
If I understand correctly, a memory ordering machine clear
only happens when different cores access the same memory?
I'm running single-threaded code on an otherwise idle system.
I'm thinking that Haswell might lump Memory Ordering AND
Memory Disambiguation MCs?
Is that a reasonable assumption?
Not sure what's non-deterministic about the loads in my code.
Might be some kind of "catastrophic" chain of events where
one mispredict generates a cascade of subsequent mispredicts.
I will keep digging :)
Regards
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Unexplained variance in run-time of trivial program
2025-09-13 10:09 ` Marc Gonzalez
@ 2025-09-15 16:30 ` Marc Gonzalez
2025-09-16 7:50 ` Ahmed S. Darwish
0 siblings, 1 reply; 30+ messages in thread
From: Marc Gonzalez @ 2025-09-15 16:30 UTC (permalink / raw)
To: Daniel Wagner
Cc: Leon Woestenberg, John Ogness, linux-rt-users, Steven Rostedt,
Thomas Gleixner, Sebastian Andrzej Siewior, Clark Williams,
Pavel Machek, Luis Goncalves, John McCalpin, Frederic Weisbecker,
Ingo Molnar, Masami Hiramatsu, agner
On 9/13/25 12:09, Marc Gonzalez wrote:
> GOOD RUN
> D,C,Cb,F,T,N
> 1310881180 4315846773 4315846665 3292324 5000 262144
> 6808319919 inst_retired.any
> 7513464787 uops_executed.core
> 30496248 uops_executed.stall_cycles
> 164691876 uops_retired.stall_cycles
> 6292081 int_misc.recovery_cycles
> 292758 machine_clears.count
> 292049 machine_clears.memory_ordering
>
> BAD RUN
> D,C,Cb,F,T,N
> 1417213267 4665926496 4665926108 3292324 5406 262144
> 6808316250 inst_retired.any
> 7614270474 uops_executed.core
> 85148157 uops_executed.stall_cycles
> 285284384 uops_retired.stall_cycles
> 13520285 int_misc.recovery_cycles
> 1536963 machine_clears.count
> 1536308 machine_clears.memory_ordering
>
> Notes:
>
> - A "good" run gets 300k machine clears.
> - A "bad" run gets 1.5M machine clears.
> - All machine clears are memory ordering machine clears.
> No self-modifying code, no floating point code.
> Probably a few MCs from page faults.
> (TODO: measure only MCs from my code, not setup code)
>
> If I understand correctly, a memory ordering machine clear
> only happens when different cores access the same memory?
> I'm running single-threaded code on an otherwise idle system.
> I'm thinking that Haswell might lump Memory Ordering AND
> Memory Disambiguation MCs?
>
> Is that a reasonable assumption?
>
> Not sure what's non-deterministic about the loads in my code.
> Might be some kind of "catastrophic" chain of events where
> one mispredict generates a cascade of subsequent mispredicts.
>
> I will keep digging :)
I now have a better understanding of the issue :)
To recap, I'm benchmarking a fairly trivial piece of code.
But I observe a large variance in the run-time of said code
(from N to 1.15N nanoseconds).
What's more, the variance REMAINS when I run the code
far longer (by looping the same execution).
Therefore
- Either the perturbation is periodic, and adds noise
whatever the duration of the benchmark,
- OR the perturbation only manifests in SOME RUNS.
However, some runs complete without any perturbation,
even very long ones.
Runs where perturbations are observed have a much higher
incidence of MEMORY ORDERING MEMORY NUKES. These types
of events happen when different cores share memory
(true sharing or false sharing). Since my code is purely
single-threaded, it must be false sharing. At this point,
I'm guessing that the address randomization creates
ONLY FOR SOME RUNS a condition where my program lives
in similar virtual addresses than another program active
on a different core, which causes MEMORY NUKES.
I can work around the issue by:
INSTEAD OF running 1M instances of my code in 1 program,
I RUN 1000 programs running 1000 instances of my code.
And taking the minimum run-time of the 1000.
(There is a bit more overhead in forking so many programs,
but I really need reproducible results.)
Now I observe less than 0.2% variance.
I will also try disabling address randomization,
to test my hypothesis.
Has anyone ever dealt with this issue?
Regards
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Unexplained variance in run-time of trivial program
2025-09-15 16:30 ` Marc Gonzalez
@ 2025-09-16 7:50 ` Ahmed S. Darwish
2025-09-25 11:38 ` Marc Gonzalez
0 siblings, 1 reply; 30+ messages in thread
From: Ahmed S. Darwish @ 2025-09-16 7:50 UTC (permalink / raw)
To: Marc Gonzalez
Cc: Daniel Wagner, Leon Woestenberg, John Ogness, linux-rt-users,
Steven Rostedt, Thomas Gleixner, Sebastian Andrzej Siewior,
Clark Williams, Pavel Machek, Luis Goncalves, John McCalpin,
Frederic Weisbecker, Ingo Molnar, Masami Hiramatsu, agner
On Mon, 15 Sep 2025, Marc Gonzalez wrote:
>
> These types of events happen when different cores share memory (true
> sharing or false sharing). Since my code is purely single-threaded, it
> must be false sharing.
>
I don't know if you're just having "fun" with the CPU or optimizing a
real-world use-case...
With that in mind, maybe the following helps: understand the topology of
your x86 CPU and turn off the cores that might be causing a "noisy
neighbor" problem.
If you were able to root it down, and your CPU supports class of service
resource monitoring and allocation, then check
Documentation/filesystems/resctrl.rst.
Good luck,
--
Ahmed S. Darwish
Linutronix GmbH
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Unexplained variance in run-time of trivial program
2025-09-16 7:50 ` Ahmed S. Darwish
@ 2025-09-25 11:38 ` Marc Gonzalez
0 siblings, 0 replies; 30+ messages in thread
From: Marc Gonzalez @ 2025-09-25 11:38 UTC (permalink / raw)
To: Ahmed S. Darwish
Cc: Daniel Wagner, Leon Woestenberg, John Ogness, linux-rt-users,
Steven Rostedt, Thomas Gleixner, Sebastian Andrzej Siewior,
Clark Williams, Pavel Machek, Luis Goncalves, John McCalpin,
Frederic Weisbecker, Ingo Molnar, Masami Hiramatsu, agner
On 16/09/2025 09:50, Ahmed S. Darwish wrote:
> On Mon, 15 Sep 2025, Marc Gonzalez wrote:
>
>> These types of events happen when different cores share memory (true
>> sharing or false sharing). Since my code is purely single-threaded, it
>> must be false sharing.
>
> I don't know if you're just having "fun" with the CPU or optimizing a
> real-world use-case...
I'm writing a sudoku solver in my free time.
So, having fun on a real-world use-case :)
> With that in mind, maybe the following helps: understand the topology of
> your x86 CPU and turn off the cores that might be causing a "noisy
> neighbor" problem.
Hmmm... I would assume a fairly simple topology.
https://www.techpowerup.com/cpu-specs/core-i5-4590.c1750
https://en.wikipedia.org/wiki/Haswell_(microarchitecture)#Desktop_processors
This CPU does not support SMT, so L1 & L2 are exclusive to each core.
L3 is shared between all cores IIUC.
> If you were able to root it down, and your CPU supports class of service
> resource monitoring and allocation, then check
> Documentation/filesystems/resctrl.rst.
I have not yet found a good way to minimize the noise from
memory ordering memory nukes. (For the time being, I'm running
the benchmark hundreds of times, and take the minimum run time.)
https://docs.kernel.org/filesystems/resctrl.html
$ grep 'l2\|l3\|llc\|mb' /proc/cpuinfo
/* returns nothing */
Looks like Haswell didn't support those nice features :(
Thanks for the suggestion.
I might update to Zen 5 or Zen 6 at some point.
Regards
^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2025-09-25 11:38 UTC | newest]
Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-04 23:45 Large(ish) variance induced by SCHED_FIFO Marc Gonzalez
2025-09-05 8:25 ` Daniel Wagner
2025-09-05 13:23 ` Marc Gonzalez
2025-09-05 14:03 ` John Ogness
2025-09-05 15:34 ` Marc Gonzalez
2025-09-05 16:07 ` John Ogness
2025-09-05 17:40 ` Marc Gonzalez
2025-09-08 9:36 ` John Ogness
2025-09-08 15:42 ` Marc Gonzalez
2025-09-08 16:02 ` Daniel Wagner
2025-09-08 18:40 ` Marc Gonzalez
2025-09-09 11:08 ` Unexplained variance in run-time of trivial program Marc Gonzalez
2025-09-09 11:21 ` Daniel Wagner
2025-09-09 12:42 ` Marc Gonzalez
2025-09-09 14:23 ` Steven Rostedt
2025-09-09 12:34 ` John Ogness
2025-09-09 14:08 ` Marc Gonzalez
2025-09-09 18:33 ` Marc Gonzalez
[not found] ` <CAMjhiJjMOQN-nWd+KP4JBBNHf20M+J2fXAuTTvowXctJgvGOcQ@mail.gmail.com>
2025-09-09 20:30 ` Marc Gonzalez
2025-09-10 7:59 ` Daniel Wagner
2025-09-11 21:29 ` Marc Gonzalez
2025-09-11 22:15 ` Marc Gonzalez
2025-09-12 7:44 ` Daniel Wagner
2025-09-13 10:09 ` Marc Gonzalez
2025-09-15 16:30 ` Marc Gonzalez
2025-09-16 7:50 ` Ahmed S. Darwish
2025-09-25 11:38 ` Marc Gonzalez
2025-09-08 16:52 ` [EXT] Re: Large(ish) variance induced by SCHED_FIFO Rui Sousa
2025-09-08 17:03 ` Marc Gonzalez
[not found] ` <CAMjhiJiVDmnx+pgpvtAy=KarHexjsYs+T9tSkpGvppjqFdtmiw@mail.gmail.com>
2025-09-05 21:05 ` Marc Gonzalez
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox