* Large(ish) variance induced by SCHED_FIFO
@ 2025-09-04 23:45 Marc Gonzalez
2025-09-05 8:25 ` Daniel Wagner
[not found] ` <CAMjhiJiVDmnx+pgpvtAy=KarHexjsYs+T9tSkpGvppjqFdtmiw@mail.gmail.com>
0 siblings, 2 replies; 30+ messages in thread
From: Marc Gonzalez @ 2025-09-04 23:45 UTC (permalink / raw)
To: linux-rt-users
Cc: Steven Rostedt, Thomas Gleixner, Sebastian Andrzej Siewior,
Daniel Wagner, Clark Williams, Pavel Machek, Luis Goncalves,
John McCalpin
Hello everyone,
I am currently *NOT* running a PREEMPT_RT kernel.
I am trying to get reproducible / deterministic / low variance results
using a vendor kernel (6.8.0-79-generic #79-Ubuntu SMP PREEMPT_DYNAMIC).
I have removed all sources of noise / variance I could think of:
- SMT is disabled in BIOS
lscpu reports "Thread(s) per core: 1"
cat /sys/devices/system/cpu/smt/active returns 0
- Kernel boots in single user mode
No GUI, just a text terminal
Nothing running other than system daemons & bash
- Kernel runs with threadirqs command-line parameter
ISRs run as SCHED_FIFO processes with prio 50
$ ps -e -o pid,cls,pri,args --sort pri
PID CLS PRI COMMAND
18 FF 139 [migration/0]
23 FF 139 [migration/1]
29 FF 139 [migration/2]
35 FF 139 [migration/3]
19 FF 90 [idle_inject/0]
22 FF 90 [idle_inject/1]
28 FF 90 [idle_inject/2]
34 FF 90 [idle_inject/3]
54 FF 90 [irq/9-acpi]
62 FF 90 [watchdogd]
68 FF 90 [irq/24-PCIe PME]
69 FF 90 [irq/24-pciehp]
70 FF 90 [irq/24-s-pciehp]
71 FF 90 [irq/25-PCIe PME]
74 FF 90 [irq/8-rtc0]
75 FF 90 [irq/16-ehci_hcd:usb1]
82 FF 90 [irq/23-ehci_hcd:usb2]
162 FF 90 [irq/26-xhci_hcd]
164 FF 90 [irq/28-ahci[0000:00:1f.2]]
377 FF 90 [irq/18-i801_smbus]
384 FF 90 [irq/29-mei_me]
411 FF 90 [card1-crtc0]
412 FF 90 [card1-crtc1]
413 FF 90 [irq/30-i915]
463 FF 90 [irq/31-snd_hda_intel:card0]
466 FF 90 [irq/32-snd_hda_intel:card1]
335 FF 41 [psimon]
1076 FF 41 [psimon]
- Frequency of CPUs is "pinned" at 3 GHz
No "turbo boost" (frequency varies with temperature).
Slightly below nominal freq to avoid temp throttling.
(Intel Core i5-4590 CPU @ 3.30GHz)
for I in 0 1 2 3; do
echo userspace > /sys/devices/system/cpu/cpu$I/cpufreq/scaling_governor
echo 3000000 > /sys/devices/system/cpu/cpu$I/cpufreq/scaling_setspeed
done
- Program is pinned to CPU #1
Avoid any cache ping pong.
Avoid CPU #0 in case it handles more interrupts.
$ taskset -c 1 ./a.out
And the last thing I wanted to do is:
- Run benchmark as SCHED_FIFO 99 to make sure NOTHING could interrupt it
bench is strictly computational:
no system calls, no I/O, no sleeping, just number crunching.
However, I noticed that the results have MORE variance when
the program runs as SCHED_FIFO than when it runs as SCHED_OTHER.
Here is my testcase:
Toy benchmark:
# define N (1 << 30)
int main(void)
{
for (volatile int i = 0; i < N; ++i);
return 0;
}
Compiled with gcc -O3 toy.c
Then I run this script as root:
#!/bin/bash
CMD="taskset -c 1 ./a.out"
if [ "$1" = "fifo" ]; then CMD="chrt -f 99 $CMD"; fi
for I in $(seq 1 30); do
T0=$(date "+%s.%N")
$CMD
T1=$(date "+%s.%N")
echo "$T1-$T0" | bc -l
done
Once without fifo (SCHED_OTHER), once with fifo (SCHED_FIFO).
(Full results at the end of the message)
For SCHED_OTHER:
MIN_RUNTIME=2.158045784 seconds
MAX_RUNTIME=2.158515466 seconds
i.e. all runs are within +/- 109 ppm (MAX-MIN)/(MAX+MIN)
For SCHED_FIFO:
MIN_RUNTIME=2.247394425 seconds
MAX_RUNTIME=2.305229091 seconds
i.e. all runs are within +/- 12700 ppm
i.e. 100 times WORSE than SCHED_OTHER... :(
HOWEVER, if we sort SCHED_FIFO results, we clearly see 3 "islands":
first run-time is the minimum outlier = 2.247394425
21 results in [2.253636818, 2.254797674] => 257 ppm
8 results in [2.304188238, 2.305229091] => 226 ppm
Still not as good as SCHED_FIFO but nowhere near as bad.
Questions I have no answers for at this time:
What could explain this behavior?
Why several "islands" of results when running as SCHED_FIFO 99?
Why is the variance still greater within an island?
Thanks for reading me this far!! :)
Regards
Raw results for reference:
SCHED_OTHER
2.158171245
2.158127633
2.158129497
2.158109551
2.158089224
2.158515466
2.158083084
2.158127232
2.158282178
2.158135800
2.158123891
2.158045784
2.158174865
2.158141835
2.158184503
2.158118620
2.158096101
2.158150349
2.158150868
2.158076309
2.158179210
2.158134907
2.158202927
2.158108137
2.158182418
2.158088121
2.158095718
2.158165650
2.158134169
2.158119635
SCHED_FIFO
2.247394425
2.254726778
2.305229091
2.253719228
2.254658938
2.254744883
2.304188238
2.254670366
2.254683699
2.254683326
2.304284105
2.253755218
2.254756793
2.304288106
2.254756981
2.253636818
2.254797674
2.304240533
2.254630310
2.254659655
2.254696007
2.304234345
2.254739714
2.254766850
2.254620088
2.304224451
2.254608929
2.254732058
2.304227047
2.254665694
^ permalink raw reply [flat|nested] 30+ messages in thread* Re: Large(ish) variance induced by SCHED_FIFO 2025-09-04 23:45 Large(ish) variance induced by SCHED_FIFO Marc Gonzalez @ 2025-09-05 8:25 ` Daniel Wagner 2025-09-05 13:23 ` Marc Gonzalez [not found] ` <CAMjhiJiVDmnx+pgpvtAy=KarHexjsYs+T9tSkpGvppjqFdtmiw@mail.gmail.com> 1 sibling, 1 reply; 30+ messages in thread From: Daniel Wagner @ 2025-09-05 8:25 UTC (permalink / raw) To: Marc Gonzalez Cc: linux-rt-users, Steven Rostedt, Thomas Gleixner, Sebastian Andrzej Siewior, Daniel Wagner, Clark Williams, Pavel Machek, Luis Goncalves, John McCalpin On Fri, Sep 05, 2025 at 01:45:31AM +0200, Marc Gonzalez wrote: > Then I run this script as root: > #!/bin/bash > > CMD="taskset -c 1 ./a.out" > if [ "$1" = "fifo" ]; then CMD="chrt -f 99 $CMD"; fi > > for I in $(seq 1 30); do > T0=$(date "+%s.%N") > $CMD > T1=$(date "+%s.%N") > echo "$T1-$T0" | bc -l > done This set setup is forking processes in order to get timestamps. This gives you uncontrollable variance in the measurement. Use clock_gettime inside your test program. Also you might want to reduce the prio to 98. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Large(ish) variance induced by SCHED_FIFO 2025-09-05 8:25 ` Daniel Wagner @ 2025-09-05 13:23 ` Marc Gonzalez 2025-09-05 14:03 ` John Ogness 0 siblings, 1 reply; 30+ messages in thread From: Marc Gonzalez @ 2025-09-05 13:23 UTC (permalink / raw) To: Daniel Wagner Cc: linux-rt-users, Steven Rostedt, Thomas Gleixner, Sebastian Andrzej Siewior, Daniel Wagner, Clark Williams, Pavel Machek, Luis Goncalves, John McCalpin Hello David :) On 05/09/2025 10:25, Daniel Wagner wrote: > On Fri, Sep 05, 2025 at 01:45:31AM +0200, Marc Gonzalez wrote: > >> Then I run this script as root: >> #!/bin/bash >> >> CMD="taskset -c 1 ./a.out" >> if [ "$1" = "fifo" ]; then CMD="chrt -f 99 $CMD"; fi >> >> for I in $(seq 1 30); do >> T0=$(date "+%s.%N") >> $CMD >> T1=$(date "+%s.%N") >> echo "$T1-$T0" | bc -l >> done > > This setup is forking processes in order to get timestamps. This > gives you uncontrollable variance in the measurement. Use clock_gettime > inside your test program. I had already tested your suggestion, and observed it did not make any difference. I will test again nonetheless. (I used CLOCK_MONOTONIC_RAW.) NB: if the large variance were induced by getting timestamps outside the program, then SCHED_OTHER runs should be similarly impacted (however, they are not). > Also you might want to reduce the prio to 98. May I ask why? I'm guessing it's bad(TM) to potentially block the migration task? I will lower the priority to 90. Here is my current test code: #include <stdio.h> #include <time.h> #define N (1 << 30) #define G (1000*1000*1000L) int main(void) { struct timespec t0, t1; clock_gettime(CLOCK_MONOTONIC_RAW, &t0); for (volatile int i = 0; i < N; ++i); clock_gettime(CLOCK_MONOTONIC_RAW, &t1); long diff = (t1.tv_sec*G + t1.tv_nsec) - (t0.tv_sec*G + t0.tv_nsec); printf("%ld\n", diff); return 0; } associated script: #!/bin/bash CMD="taskset -c 1 ./a.out" if [ "$1" = "fifo" ]; then CMD="chrt -f 90 $CMD"; fi for I in $(seq 1 30); do $CMD; done The results are the same as before: SCHED_OTHER MIN=2156085411 MAX=2156133420 All within 11 ppm of the average (EXCELLENT RESULTS) SCHED_FIFO MIN=2205859200 (first run again, how strange) MAX=2304868902 All within 21950 ppm of the average (TERRIBLE RESULTS) 21950 ppm = 2.2% Looking at the distribution of SCHED_FIFO again, I notice a pattern: first run-time is the minimum outlier = 2.20 seconds 21 (again!!) results around 2.25 seconds 8 (again!!) results around 2.30 seconds BRAIN DUMP: "First" run is meaningless. First since what? Previous batch of runs? In that case, sleeping between runs should work around the issue? It looks like there is some kind of 50 millisecond quantum involved somewhere somehow. $ grep HZ /boot/config-6.8.0-79-generic CONFIG_NO_HZ_COMMON=y # CONFIG_HZ_PERIODIC is not set # CONFIG_NO_HZ_IDLE is not set CONFIG_NO_HZ_FULL=y CONFIG_NO_HZ=y # CONFIG_HZ_100 is not set # CONFIG_HZ_250 is not set # CONFIG_HZ_300 is not set CONFIG_HZ_1000=y CONFIG_HZ=1000 (Not sure what it means to have CONFIG_NO_HZ_FULL && CONFIG_HZ_1000) This is driving me crazy. I'm all ears if anyone has other suggestions. Regards Raw results for reference: SCHED_OTHER 2156120840 2156110092 2156111812 2156117629 2156111235 2156104159 2156109850 2156120644 2156100701 2156085456 2156133420 2156130993 2156087623 2156120236 2156114137 2156099823 2156122243 2156095888 2156114046 2156102910 2156120532 2156102464 2156100703 2156125228 2156119672 2156110619 2156131331 2156085411 2156098780 2156121458 SCHED_FIFO 2205859200 2254828687 2254866765 2304868902 2254849969 2253908545 2254811821 2304845457 2253814019 2254834900 2304858026 2254845905 2253835350 2254815775 2304860984 2253824572 2254816525 2254800880 2303875283 2254826189 2254813961 2304850479 2254832102 2253816865 2254820633 2304847552 2253841768 2254807812 2254810662 2303848123 ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Large(ish) variance induced by SCHED_FIFO 2025-09-05 13:23 ` Marc Gonzalez @ 2025-09-05 14:03 ` John Ogness 2025-09-05 15:34 ` Marc Gonzalez 0 siblings, 1 reply; 30+ messages in thread From: John Ogness @ 2025-09-05 14:03 UTC (permalink / raw) To: Marc Gonzalez, Daniel Wagner Cc: linux-rt-users, Steven Rostedt, Thomas Gleixner, Sebastian Andrzej Siewior, Daniel Wagner, Clark Williams, Pavel Machek, Luis Goncalves, John McCalpin On 2025-09-05, Marc Gonzalez <marc.w.gonzalez@free.fr> wrote: > (I used CLOCK_MONOTONIC_RAW.) Not really relevant to your issue, but you should always use CLOCK_MONOTONIC. > It looks like there is some kind of 50 millisecond quantum > involved somewhere somehow. Did you forget to disable RT throttling? $ sudo sysctl -w kernel.sched_rt_runtime_us=-1 If you check the kernel logs, you would see a message about throttling: "sched: RT throttling activated" John Ogness ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Large(ish) variance induced by SCHED_FIFO 2025-09-05 14:03 ` John Ogness @ 2025-09-05 15:34 ` Marc Gonzalez 2025-09-05 16:07 ` John Ogness 0 siblings, 1 reply; 30+ messages in thread From: Marc Gonzalez @ 2025-09-05 15:34 UTC (permalink / raw) To: John Ogness, Daniel Wagner Cc: linux-rt-users, Steven Rostedt, Thomas Gleixner, Sebastian Andrzej Siewior, Daniel Wagner, Clark Williams, Pavel Machek, Luis Goncalves, John McCalpin On 05/09/2025 16:03, John Ogness wrote: > On 2025-09-05, Marc Gonzalez wrote: > >> (I used CLOCK_MONOTONIC_RAW.) > > Not really relevant to your issue, but you should always use CLOCK_MONOTONIC. Could you provide some insight as to why CLOCK_MONOTONIC is preferable? https://www.man7.org/linux/man-pages/man3/clock_gettime.3.html CLOCK_MONOTONIC A non-settable system-wide clock that represents monotonic time since (as described by POSIX) "some unspecified point in the past". On Linux, that point corresponds to the number of seconds that the system has been running since it was booted. The CLOCK_MONOTONIC clock is not affected by discontinuous jumps in the system time (e.g., if the system administrator manually changes the clock), but is affected by the incremental adjustments performed by adjtime(3) and NTP. This clock does not count time that the system is suspended. All CLOCK_MONOTONIC variants guarantee that the time returned by consecutive calls will not go backwards, but successive calls may (depending on the architecture) return identical (not-increased) time values. CLOCK_MONOTONIC_RAW (since Linux 2.6.28; Linux-specific) Similar to CLOCK_MONOTONIC, but provides access to a raw hardware-based time that is not subject to NTP adjustments or the incremental adjustments performed by adjtime(3). This clock does not count time that the system is suspended. CLOCK_MONOTONIC_RAW seemed less susceptible to "interference" :) >> It looks like there is some kind of 50 millisecond quantum >> involved somewhere somehow. > > Did you forget to disable RT throttling? > > $ sudo sysctl -w kernel.sched_rt_runtime_us=-1 > > If you check the kernel logs, you would see a message about throttling: > > "sched: RT throttling activated" YES!!! I didn't think to look at dmesg :( And I had looked at sched-related proc nodes... $ grep "" /proc/sys/kernel/sched_* /proc/sys/kernel/sched_autogroup_enabled:1 /proc/sys/kernel/sched_cfs_bandwidth_slice_us:5000 /proc/sys/kernel/sched_deadline_period_max_us:4194304 /proc/sys/kernel/sched_deadline_period_min_us:100 /proc/sys/kernel/sched_rr_timeslice_ms:100 /proc/sys/kernel/sched_rt_period_us:1000000 /proc/sys/kernel/sched_rt_runtime_us:950000 /proc/sys/kernel/sched_schedstats:0 /proc/sys/kernel/sched_util_clamp_max:1024 /proc/sys/kernel/sched_util_clamp_min:1024 /proc/sys/kernel/sched_util_clamp_min_rt_default:1024 ... but my brain didn't register the obvious clue, although it was right there: 1000000 - 950000 = 50 ms !! Now everything works as expected, even in multi-user mode: $ sudo ./run_benchmark.sh fifo 2155899243 2155930350 2155925344 2155545153 2155520262 2155520647 2155531815 2155586328 2155501088 2155887042 2155937889 2155990740 2155939605 2155509468 2155898441 2155615316 2155498933 2155968150 2155493295 2155496287 2155949449 2155488499 2155932900 2155498135 2155531294 2156047123 2156035893 2155926392 2155917489 2155482792 EXCEPT... that my mouse is sluggish as molasses when the script runs on CPU1 (but not when it runs on other cores). I'm probably starving an important ISR :) [migration/0] pid 18's current affinity list: 0 [migration/1] pid 23's current affinity list: 1 [migration/2] pid 29's current affinity list: 2 [migration/3] pid 35's current affinity list: 3 [idle_inject/0] pid 19's current affinity list: 0 [idle_inject/1] pid 22's current affinity list: 1 [idle_inject/2] pid 28's current affinity list: 2 [idle_inject/3] pid 34's current affinity list: 3 [irq/9-acpi] pid 54's current affinity list: 1 [watchdogd] pid 62's current affinity list: 0-3 [irq/24-PCIe PME] pid 68's current affinity list: 2 [irq/24-pciehp] pid 69's current affinity list: 2 [irq/24-s-pciehp] pid 70's current affinity list: 2 [irq/25-PCIe PME] pid 71's current affinity list: 3 [irq/8-rtc0] pid 74's current affinity list: 0 [irq/23-ehci_hcd:usb1] pid 75's current affinity list: 1 [irq/16-ehci_hcd:usb2] pid 81's current affinity list: 2 [irq/26-xhci_hcd] pid 159's current affinity list: 1 [irq/27-ahci[0000:00:1f.2]] pid 162's current affinity list: 1 [irq/18-i801_smbus] pid 388's current affinity list: 1 [irq/29-mei_me] pid 411's current affinity list: 2 [card1-crtc0] pid 461's current affinity list: 0-3 [card1-crtc1] pid 462's current affinity list: 0-3 [irq/30-i915] pid 463's current affinity list: 3 [irq/31-snd_hda_intel:card0] pid 514's current affinity list: 0 [irq/32-snd_hda_intel:card1] pid 518's current affinity list: 1 [psimon] pid 334's current affinity list: 0-3 [psimon] pid 1075's current affinity list: 0-3 Most of the USB ISRs run on core 1, no wonder my USB mouse feels limp :) Thanks David & John for pointing me in the right direction! Regards ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Large(ish) variance induced by SCHED_FIFO 2025-09-05 15:34 ` Marc Gonzalez @ 2025-09-05 16:07 ` John Ogness 2025-09-05 17:40 ` Marc Gonzalez 0 siblings, 1 reply; 30+ messages in thread From: John Ogness @ 2025-09-05 16:07 UTC (permalink / raw) To: Marc Gonzalez, Daniel Wagner Cc: linux-rt-users, Steven Rostedt, Thomas Gleixner, Sebastian Andrzej Siewior, Daniel Wagner, Clark Williams, Pavel Machek, Luis Goncalves, John McCalpin On 2025-09-05, Marc Gonzalez <marc.w.gonzalez@free.fr> wrote: >>> (I used CLOCK_MONOTONIC_RAW.) >> >> Not really relevant to your issue, but you should always use CLOCK_MONOTONIC. > > Could you provide some insight as to why CLOCK_MONOTONIC is > preferable? CLOCK_MONOTONIC is tracking in time units as defined by humans. CLOCK_MONOTONIC_RAW is tracking in time units defined by some hardware. John ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Large(ish) variance induced by SCHED_FIFO 2025-09-05 16:07 ` John Ogness @ 2025-09-05 17:40 ` Marc Gonzalez 2025-09-08 9:36 ` John Ogness 0 siblings, 1 reply; 30+ messages in thread From: Marc Gonzalez @ 2025-09-05 17:40 UTC (permalink / raw) To: John Ogness, Daniel Wagner Cc: linux-rt-users, Steven Rostedt, Thomas Gleixner, Sebastian Andrzej Siewior, Daniel Wagner, Clark Williams, Pavel Machek, Luis Goncalves, John McCalpin On 05/09/2025 18:07, John Ogness wrote: > On 2025-09-05, Marc Gonzalez wrote: > >> Could you provide some insight as to why CLOCK_MONOTONIC is preferable? > > CLOCK_MONOTONIC is tracking in time units as defined by humans. > CLOCK_MONOTONIC_RAW is tracking in time units defined by some hardware. I'm not sure I understand what that implies. Eventually, clock_gettime stores whatever it measured in struct timespec, with tv_sec & tv_nsec, right? Are you saying that if I have 2 identical timespecs, one from CLOCK_MONOTONIC, one from CLOCK_MONOTONIC_RAW, they might not measure the same time interval? Are you perhaps saying that NTP can (ever so slightly) tweak a clock to bring its frequency closer to "reality", but if CLOCK_MONOTONIC_RAW ignores such tweaks, it might count time slightly too fast or too slow? Regards ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Large(ish) variance induced by SCHED_FIFO 2025-09-05 17:40 ` Marc Gonzalez @ 2025-09-08 9:36 ` John Ogness 2025-09-08 15:42 ` Marc Gonzalez 0 siblings, 1 reply; 30+ messages in thread From: John Ogness @ 2025-09-08 9:36 UTC (permalink / raw) To: Marc Gonzalez, Daniel Wagner Cc: linux-rt-users, Steven Rostedt, Thomas Gleixner, Sebastian Andrzej Siewior, Daniel Wagner, Clark Williams, Pavel Machek, Luis Goncalves, John McCalpin On 2025-09-05, Marc Gonzalez <marc.w.gonzalez@free.fr> wrote: >>> Could you provide some insight as to why CLOCK_MONOTONIC is preferable? >> >> CLOCK_MONOTONIC is tracking in time units as defined by humans. >> CLOCK_MONOTONIC_RAW is tracking in time units defined by some hardware. > > I'm not sure I understand what that implies. > > Eventually, clock_gettime stores whatever it measured in struct timespec, > with tv_sec & tv_nsec, right? I would phrase it differently, but generally speaking, correct. > Are you saying that if I have 2 identical timespecs, > one from CLOCK_MONOTONIC, one from CLOCK_MONOTONIC_RAW, > they might not measure the same time interval? I assume you mean 2 _sets_ of timespecs (so that you have 2 intervals to compare). Correct. > Are you perhaps saying that NTP can (ever so slightly) tweak > a clock to bring its frequency closer to "reality", but if > CLOCK_MONOTONIC_RAW ignores such tweaks, it might count time > slightly too fast or too slow? Correct. This is also valid for other clocks. For example, the kernel logs (printk) and ftrace default to the CPU local clock. These timestamps cannot be compared to CLOCK_MONOTONIC timestamps... just as CLOCK_MONOTONIC_RAW timestamps cannot be compared to CLOCK_MONOTONIC timestamps. There are still reasons why CLOCK_MONOTONIC_RAW might be interesting. For example, if you want a very stable timesource to compare intervals, but do not care so much about the real world time lengths of those intervals (i.e. where is the greatest latency vs. what is the value of the greatest latency). Although even here, I doubt CLOCK_MONOTONIC_RAW has a practical advantage over CLOCK_MONOTONIC. John ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Large(ish) variance induced by SCHED_FIFO 2025-09-08 9:36 ` John Ogness @ 2025-09-08 15:42 ` Marc Gonzalez 2025-09-08 16:02 ` Daniel Wagner 2025-09-08 16:52 ` [EXT] Re: Large(ish) variance induced by SCHED_FIFO Rui Sousa 0 siblings, 2 replies; 30+ messages in thread From: Marc Gonzalez @ 2025-09-08 15:42 UTC (permalink / raw) To: John Ogness, Leon Woestenberg, Daniel Wagner Cc: linux-rt-users, Steven Rostedt, Thomas Gleixner, Sebastian Andrzej Siewior, Daniel Wagner, Clark Williams, Pavel Machek, Luis Goncalves, John McCalpin On 08/09/2025 11:36, John Ogness wrote: > There are still reasons why CLOCK_MONOTONIC_RAW might be > interesting. For example, if you want a very stable time source to > compare intervals, but do not care so much about the real world time > lengths of those intervals (i.e. where is the greatest latency vs. what > is the value of the greatest latency). Although even here, I doubt > CLOCK_MONOTONIC_RAW has a practical advantage over CLOCK_MONOTONIC. In fact, I'm just trying to compare the run-time of 2 minor variations of the same program (testing micro-optimizations). Absolute run-time is not really important, what I really want to know is: does v2 run faster or slower than v1? This is the framework I'm using at this point: #include <stdio.h> #include <time.h> extern void my_code(void); static long loop(int log2) { int n = 1 << log2; struct timespec t0, t1; clock_gettime(CLOCK_MONOTONIC, &t0); for (int i = 0; i < n; ++i) my_code(); clock_gettime(CLOCK_MONOTONIC, &t1); long d = (t1.tv_sec - t0.tv_sec)*1000000000L + (t1.tv_nsec - t0.tv_nsec); long t = d >> log2; return t; } int main(void) { long t, min = loop(4); for (int i = 0; i < 20; ++i) if ((t = loop(8)) < min) min = t; printf("MIN=%ld\n", min); return 0; } Basically: - warm up the caches - run my_code() 256 times && compute average run-time - repeat 20 times to find MINIMUM average run-time When my_code() is a trivial computational loop such as: mov $(1<<12), %eax 1: dec %ecx dec %ecx dec %eax jnz 1b ret Then running the benchmark 1000 times returns the same value 1000 times: MIN=2737 Obviously, the program I'm working on is a bit more complex, but barely: - no system calls, no library calls - just simple bit twiddling - tiny code, tiny data structures, everything fits in L1 $ size a.out text data bss dec hex filename 8549 632 1072 10253 280d a.out When I run the benchmark 1000 times, there are some large outliers: MIN_MIN=2502 MAX_MIN=2774 NOTE: 95% of the results are below 2536. But the top 1% (worst 10) are really bad (2646-2774) How to get repeatable results? Random 10% outliers break the ability to measure the impact of micro-optimizations expected to provide 0-3% improvements :( For reference, the script launching the benchmark does: echo -1 > /proc/sys/kernel/sched_rt_runtime_us for I in 0 1 2 3; do echo userspace > /sys/devices/system/cpu/cpu$I/cpufreq/scaling_governor; done sleep 0.25 for I in 0 1 2 3; do echo 3000000 > /sys/devices/system/cpu/cpu$I/cpufreq/scaling_setspeed; done sleep 0.25 for I in $(seq 1 1000); do chrt -f 99 taskset -c 2 ./a.out < $1 done for I in 0 1 2 3; do echo schedutil > /sys/devices/system/cpu/cpu$I/cpufreq/scaling_governor done echo 950000 > /proc/sys/kernel/sched_rt_runtime_us I've run out of ideas to identify other sources of variance. (I ran everything in single user mode with nothing else running.) Perhaps with perf I could identify the source of stalls or bubbles? Hoping someone can point me in the right direction. Regards ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Large(ish) variance induced by SCHED_FIFO 2025-09-08 15:42 ` Marc Gonzalez @ 2025-09-08 16:02 ` Daniel Wagner 2025-09-08 18:40 ` Marc Gonzalez 2025-09-08 16:52 ` [EXT] Re: Large(ish) variance induced by SCHED_FIFO Rui Sousa 1 sibling, 1 reply; 30+ messages in thread From: Daniel Wagner @ 2025-09-08 16:02 UTC (permalink / raw) To: Marc Gonzalez Cc: John Ogness, Leon Woestenberg, linux-rt-users, Steven Rostedt, Thomas Gleixner, Sebastian Andrzej Siewior, Daniel Wagner, Clark Williams, Pavel Machek, Luis Goncalves, John McCalpin On Mon, Sep 08, 2025 at 05:42:58PM +0200, Marc Gonzalez wrote: > I've run out of ideas to identify other sources of variance. > (I ran everything in single user mode with nothing else running.) > Perhaps with perf I could identify the source of stalls or bubbles? > > Hoping someone can point me in the right direction. There are few guides out there, explaining the common pitfalls. I suppose you have already found them. Just in case you missed Frederic's excellent blog on CPU isolation: https://www.suse.com/c/cpu-isolation-introduction-part-1/ ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Large(ish) variance induced by SCHED_FIFO 2025-09-08 16:02 ` Daniel Wagner @ 2025-09-08 18:40 ` Marc Gonzalez 2025-09-09 11:08 ` Unexplained variance in run-time of trivial program Marc Gonzalez 0 siblings, 1 reply; 30+ messages in thread From: Marc Gonzalez @ 2025-09-08 18:40 UTC (permalink / raw) To: Daniel Wagner Cc: John Ogness, Leon Woestenberg, linux-rt-users, Steven Rostedt, Thomas Gleixner, Sebastian Andrzej Siewior, Clark Williams, Pavel Machek, Luis Goncalves, John McCalpin On 08/09/2025 18:02, Daniel Wagner wrote: > On Mon, Sep 08, 2025 at 05:42:58PM +0200, Marc Gonzalez wrote: > >> I've run out of ideas to identify other sources of variance. >> (I ran everything in single user mode with nothing else running.) >> Perhaps with perf I could identify the source of stalls or bubbles? >> >> Hoping someone can point me in the right direction. > > There are few guides out there, explaining the common pitfalls. I > suppose you have already found them. Just in case you missed Frederic's > excellent blog on CPU isolation: > > https://www.suse.com/c/cpu-isolation-introduction-part-1/ Thanks for this great link! Is nohz_full=X a superset of isolcpus=X or are they orthogonal? (I see that /usr/sbin/irqbalance is running on my system.) Regards ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Unexplained variance in run-time of trivial program 2025-09-08 18:40 ` Marc Gonzalez @ 2025-09-09 11:08 ` Marc Gonzalez 2025-09-09 11:21 ` Daniel Wagner 2025-09-09 12:34 ` John Ogness 0 siblings, 2 replies; 30+ messages in thread From: Marc Gonzalez @ 2025-09-09 11:08 UTC (permalink / raw) To: linux-rt-users Cc: Daniel Wagner, John Ogness, Leon Woestenberg, Steven Rostedt, Thomas Gleixner, Sebastian Andrzej Siewior, Clark Williams, Pavel Machek, Luis Goncalves, John McCalpin, Frederic Weisbecker, Ingo Molnar, Masami Hiramatsu On 08/09/2025 20:40, Marc Gonzalez wrote: > On 08/09/2025 18:02, Daniel Wagner wrote: > >> On Mon, Sep 08, 2025 at 05:42:58PM +0200, Marc Gonzalez wrote: >> >>> I've run out of ideas to identify other sources of variance. >>> (I ran everything in single user mode with nothing else running.) >>> Perhaps with perf I could identify the source of stalls or bubbles? >>> >>> Hoping someone can point me in the right direction. >> >> There are few guides out there, explaining the common pitfalls. I >> suppose you have already found them. Just in case you missed Frederic's >> excellent blog on CPU isolation: >> >> https://www.suse.com/c/cpu-isolation-introduction-part-1/ > > Thanks for this great link! > > Is nohz_full=X a superset of isolcpus=X or are they orthogonal? I am now more confused than ever. Following Frederic's guide: https://www.suse.com/c/cpu-isolation-practical-example-part-5/ If I didn't mess up, CPU3 is now supposed to be fully isolated on my 4-core system: $ sudo dmesg | grep "Kernel command line" [ 0.038445] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-6.8.0-79-generic root=UUID=2aff6d63-a8c6-4ebb-af69-9d2df2852cb0 ro log_buf_len=16M nohz_full=3 isolcpus=3 quiet splash vt.handoff=7 I ran echo "0-2" > $IRQ/smp_affinity_list for every IRQ. IRQ=0 AFTER SMP_AFF=0-3 EFF_AFF=0 IRQ=1 AFTER SMP_AFF=0-2 EFF_AFF=0-2 IRQ=2 AFTER SMP_AFF=0-3 EFF_AFF= IRQ=3 AFTER SMP_AFF=0-2 EFF_AFF=0-2 IRQ=4 AFTER SMP_AFF=0-2 EFF_AFF=0-2 IRQ=5 AFTER SMP_AFF=0-2 EFF_AFF=0-2 IRQ=6 AFTER SMP_AFF=0-2 EFF_AFF=0-2 IRQ=7 AFTER SMP_AFF=0-2 EFF_AFF=0-2 IRQ=8 AFTER SMP_AFF=0-2 EFF_AFF=0 IRQ=9 AFTER SMP_AFF=0-2 EFF_AFF=1 IRQ=10 AFTER SMP_AFF=0-2 EFF_AFF=0-2 IRQ=11 AFTER SMP_AFF=0-2 EFF_AFF=0-2 IRQ=12 AFTER SMP_AFF=0-2 EFF_AFF=0-2 IRQ=13 AFTER SMP_AFF=0-2 EFF_AFF=0-2 IRQ=14 AFTER SMP_AFF=0-2 EFF_AFF=0-2 IRQ=15 AFTER SMP_AFF=0-2 EFF_AFF=0-2 IRQ=16 AFTER SMP_AFF=0-2 EFF_AFF=1 IRQ=18 AFTER SMP_AFF=0-2 EFF_AFF=1 IRQ=23 AFTER SMP_AFF=0-2 EFF_AFF=2 IRQ=24 AFTER SMP_AFF=0-2 EFF_AFF=2 IRQ=25 AFTER SMP_AFF=0-2 EFF_AFF=3 IRQ=26 AFTER SMP_AFF=0-2 EFF_AFF=0 IRQ=27 AFTER SMP_AFF=0-2 EFF_AFF=0 IRQ=28 AFTER SMP_AFF=0-2 EFF_AFF=2 IRQ=29 AFTER SMP_AFF=0-2 EFF_AFF=2 IRQ=30 AFTER SMP_AFF=0-2 EFF_AFF=3 IRQ=31 AFTER SMP_AFF=0-2 EFF_AFF=0 IRQ=32 AFTER SMP_AFF=0-2 EFF_AFF=1 I note that IRQs 25 & 30 are still effectively pinned to CPU3, despite the smp_aff setting. $ cat /proc/interrupts CPU0 CPU1 CPU2 CPU3 0: 27 0 0 0 IO-APIC 2-edge timer 8: 0 0 0 0 IO-APIC 8-edge rtc0 9: 0 4 0 0 IO-APIC 9-fasteoi acpi 16: 0 25 0 0 IO-APIC 16-fasteoi ehci_hcd:usb1 18: 0 5 0 0 IO-APIC 18-fasteoi i801_smbus 23: 0 0 29 0 IO-APIC 23-fasteoi ehci_hcd:usb2 24: 0 0 0 0 PCI-MSI-0000:00:1c.0 0-edge PCIe PME, pciehp 25: 0 0 0 0 PCI-MSI-0000:00:1c.3 0-edge PCIe PME 26: 23328 0 0 37250 PCI-MSI-0000:00:1f.2 0-edge ahci[0000:00:1f.2] 27: 86091 0 0 0 PCI-MSI-0000:00:14.0 0-edge xhci_hcd 28: 0 0 51308 0 PCI-MSIX-0000:02:00.0 0-edge enp2s0 29: 0 0 22 0 PCI-MSI-0000:00:16.0 0-edge mei_me 30: 0 0 0 604 PCI-MSI-0000:00:1b.0 0-edge snd_hda_intel:card0 31: 198664 0 0 0 PCI-MSI-0000:00:02.0 0-edge i915 32: 0 631 0 0 PCI-MSI-0000:00:03.0 0-edge snd_hda_intel:card1 My benchmark is a trivial piece of code: - no system calls, no library calls - just simple bit twiddling - tiny code, tiny data structures, everything fits in L1 $ size a.out text data bss dec hex filename 8549 632 1072 10253 280d a.out Undisturbed, it runs in 2500 nanoseconds (with all cores set to tick at 3 GHz). I run it 2^16 times, so the expected run-time is 2500*2^16 = 163840000 ns = 163.84 ms Yet, I've observed AVERAGE run-times of 2600, 2700, and even 2800 (!!) nanoseconds. I captured a full trace of this occurrence with echo *:* > $TRACING/set_event cat $TRACING/per_cpu/cpu$CPU/trace > trace.$CPU # tracer: nop # # entries-in-buffer/entries-written: 9743/9743 #P:4 # # _-----=> irqs-off/BH-disabled # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / _-=> migrate-disable # |||| / delay # TASK-PID CPU# ||||| TIMESTAMP FUNCTION # | | | ||||| | | <idle>-0 [003] dN.1. 4606.627286: cpu_idle: state=4294967295 cpu_id=3 <idle>-0 [003] dN.1. 4606.627288: rcu_utilization: Start context switch <idle>-0 [003] dN.1. 4606.627288: rcu_utilization: End context switch <idle>-0 [003] d..2. 4606.627290: sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=taskset next_pid=8129 next_prio=0 <idle>-0 [003] d..2. 4606.627291: tlb_flush: pages:-1 reason:flush on task switch (0) a.out-8129 [003] ...1. 4606.627304: mm_page_free: page=000000001fef04a8 pfn=0x16a586 order=1 ... a.out-8129 [003] d.... 4606.628019: user_enter: a.out-8129 [003] d.... 4606.811832: user_exit: Program starts at 4606.627304 Program setup lasts until 4606.628019 (~1500 recorded events) => roughly 700 microseconds spent before main() which I'm not taking into account in my measurements. Then program ends at 4606.811832 => 183.813 ms instead of the expected 163.84 ms Almost exactly 20 ms in excess. Could this be a hint? But there is absolutely NOTHING traced between 4606.628019 & 4606.811832. I guess either my time source is incorrect. (Next slide in Frederic's guide) OR there is something wonky going on inside CPU3. Looks like IRQs are not the issue here? No need to investigate why I can't pin IRQs 25 & 30 elsewhere than CPU3? Regards ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Unexplained variance in run-time of trivial program 2025-09-09 11:08 ` Unexplained variance in run-time of trivial program Marc Gonzalez @ 2025-09-09 11:21 ` Daniel Wagner 2025-09-09 12:42 ` Marc Gonzalez 2025-09-09 12:34 ` John Ogness 1 sibling, 1 reply; 30+ messages in thread From: Daniel Wagner @ 2025-09-09 11:21 UTC (permalink / raw) To: Marc Gonzalez Cc: linux-rt-users, John Ogness, Leon Woestenberg, Steven Rostedt, Thomas Gleixner, Sebastian Andrzej Siewior, Clark Williams, Pavel Machek, Luis Goncalves, John McCalpin, Frederic Weisbecker, Ingo Molnar, Masami Hiramatsu > I note that IRQs 25 & 30 are still effectively pinned to CPU3, > despite the smp_aff setting. > > $ cat /proc/interrupts > CPU0 CPU1 CPU2 CPU3 > 0: 27 0 0 0 IO-APIC 2-edge timer > 8: 0 0 0 0 IO-APIC 8-edge rtc0 > 9: 0 4 0 0 IO-APIC 9-fasteoi acpi > 16: 0 25 0 0 IO-APIC 16-fasteoi ehci_hcd:usb1 > 18: 0 5 0 0 IO-APIC 18-fasteoi i801_smbus > 23: 0 0 29 0 IO-APIC 23-fasteoi ehci_hcd:usb2 > 24: 0 0 0 0 PCI-MSI-0000:00:1c.0 0-edge PCIe PME, pciehp > 25: 0 0 0 0 PCI-MSI-0000:00:1c.3 0-edge PCIe PME > 26: 23328 0 0 37250 PCI-MSI-0000:00:1f.2 0-edge ahci[0000:00:1f.2] > 27: 86091 0 0 0 PCI-MSI-0000:00:14.0 0-edge xhci_hcd > 28: 0 0 51308 0 PCI-MSIX-0000:02:00.0 0-edge enp2s0 > 29: 0 0 22 0 PCI-MSI-0000:00:16.0 0-edge mei_me > 30: 0 0 0 604 PCI-MSI-0000:00:1b.0 0-edge snd_hda_intel:card0 > 31: 198664 0 0 0 PCI-MSI-0000:00:02.0 0-edge i915 > 32: 0 631 0 0 PCI-MSI-0000:00:03.0 0-edge snd_hda_intel:card1 Many drivers are not isolcpus aware. At least the sound driver could be unloaded for you test I suppose. > Almost exactly 20 ms in excess. > Could this be a hint? > But there is absolutely NOTHING traced between 4606.628019 & 4606.811832. Ensure your clock source is working correctly and... > I guess either my time source is incorrect. > (Next slide in Frederic's guide) > OR there is something wonky going on inside CPU3. ...there is no SMI running on this CPU, and no power management running (also in the BIOS settings) > Looks like IRQs are not the issue here? IRQ handling should be visible in the trace. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Unexplained variance in run-time of trivial program 2025-09-09 11:21 ` Daniel Wagner @ 2025-09-09 12:42 ` Marc Gonzalez 2025-09-09 14:23 ` Steven Rostedt 0 siblings, 1 reply; 30+ messages in thread From: Marc Gonzalez @ 2025-09-09 12:42 UTC (permalink / raw) To: Daniel Wagner Cc: linux-rt-users, John Ogness, Leon Woestenberg, Steven Rostedt, Thomas Gleixner, Sebastian Andrzej Siewior, Clark Williams, Pavel Machek, Luis Goncalves, John McCalpin, Frederic Weisbecker, Ingo Molnar, Masami Hiramatsu On 09/09/2025 13:21, Daniel Wagner wrote: >> I note that IRQs 25 & 30 are still effectively pinned to CPU3, >> despite the smp_aff setting. >> >> $ cat /proc/interrupts >> CPU0 CPU1 CPU2 CPU3 >> 0: 27 0 0 0 IO-APIC 2-edge timer >> 8: 0 0 0 0 IO-APIC 8-edge rtc0 >> 9: 0 4 0 0 IO-APIC 9-fasteoi acpi >> 16: 0 25 0 0 IO-APIC 16-fasteoi ehci_hcd:usb1 >> 18: 0 5 0 0 IO-APIC 18-fasteoi i801_smbus >> 23: 0 0 29 0 IO-APIC 23-fasteoi ehci_hcd:usb2 >> 24: 0 0 0 0 PCI-MSI-0000:00:1c.0 0-edge PCIe PME, pciehp >> 25: 0 0 0 0 PCI-MSI-0000:00:1c.3 0-edge PCIe PME >> 26: 23328 0 0 37250 PCI-MSI-0000:00:1f.2 0-edge ahci[0000:00:1f.2] >> 27: 86091 0 0 0 PCI-MSI-0000:00:14.0 0-edge xhci_hcd >> 28: 0 0 51308 0 PCI-MSIX-0000:02:00.0 0-edge enp2s0 >> 29: 0 0 22 0 PCI-MSI-0000:00:16.0 0-edge mei_me >> 30: 0 0 0 604 PCI-MSI-0000:00:1b.0 0-edge snd_hda_intel:card0 >> 31: 198664 0 0 0 PCI-MSI-0000:00:02.0 0-edge i915 >> 32: 0 631 0 0 PCI-MSI-0000:00:03.0 0-edge snd_hda_intel:card1 > > Many drivers are not isolcpus aware. At least the sound driver could be > unloaded for your test I suppose. I don't quite understand. Do these drivers explicitly request that their ISR run on CPU3? Why doesn't the kernel just run these ISRs on a non-isolated core? >> Almost exactly 20 ms in excess. >> Could this be a hint? >> But there is absolutely NOTHING traced between 4606.628019 & 4606.811832. > > Ensure your clock source is working correctly and... > >> I guess either my time source is incorrect. >> (Next slide in Frederic's guide) >> OR there is something wonky going on inside CPU3. > > ...there is no SMI running on this CPU, and no power management running > (also in the BIOS settings) Actually, I'm 99.9% sure that clock source accuracy & SMM are red herrings. If I replace my code with this one: mov $(1<<12), %eax 1: dec %ecx dec %ecx dec %eax jnz 1b which runs in ~2735 nanoseconds at 3 Ghz (2 cycles per iteration * 4096 iterations = 8192 cycles 8192 cycles / 3 = 2730 nanoseconds) Running this trivial baseline benchmark 2^16 times should take 2735 ns * 2^16 = 179.241 ms And if I run the loop 1000 times, and sort by run-time, I observe MIN=179.374211 ms MAX=179.406745 ms So worst-case is only 166 microseconds worse than expected (contrast this to 20 MILLIseconds for my code, 120 times worse) It seems something is randomly stalling the pipeline on CPU3. This feels like thermal throttling maybe? Is that supposed to be logged somewhere? But why would my program throttle & not the trivial baseline benchmark? Regards ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Unexplained variance in run-time of trivial program 2025-09-09 12:42 ` Marc Gonzalez @ 2025-09-09 14:23 ` Steven Rostedt 0 siblings, 0 replies; 30+ messages in thread From: Steven Rostedt @ 2025-09-09 14:23 UTC (permalink / raw) To: Marc Gonzalez Cc: Daniel Wagner, linux-rt-users, John Ogness, Leon Woestenberg, Thomas Gleixner, Sebastian Andrzej Siewior, Clark Williams, Pavel Machek, Luis Goncalves, John McCalpin, Frederic Weisbecker, Ingo Molnar, Masami Hiramatsu On Tue, 9 Sep 2025 14:42:54 +0200 Marc Gonzalez <marc.w.gonzalez@free.fr> wrote: > > ...there is no SMI running on this CPU, and no power management running > > (also in the BIOS settings) > > Actually, I'm 99.9% sure that clock source accuracy & SMM are red herrings. You may want to run the osnoise or hwlat tracer to see if SMM or anything else the kernel does could cause issues with your tests. -- Steve ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Unexplained variance in run-time of trivial program 2025-09-09 11:08 ` Unexplained variance in run-time of trivial program Marc Gonzalez 2025-09-09 11:21 ` Daniel Wagner @ 2025-09-09 12:34 ` John Ogness 2025-09-09 14:08 ` Marc Gonzalez 1 sibling, 1 reply; 30+ messages in thread From: John Ogness @ 2025-09-09 12:34 UTC (permalink / raw) To: Marc Gonzalez, linux-rt-users Cc: Daniel Wagner, Leon Woestenberg, Steven Rostedt, Thomas Gleixner, Sebastian Andrzej Siewior, Clark Williams, Pavel Machek, Luis Goncalves, John McCalpin, Frederic Weisbecker, Ingo Molnar, Masami Hiramatsu On 2025-09-09, Marc Gonzalez <marc.w.gonzalez@free.fr> wrote: > I captured a full trace of this occurrence with > echo *:* > $TRACING/set_event > cat $TRACING/per_cpu/cpu$CPU/trace > trace.$CPU ... > But there is absolutely NOTHING traced between 4606.628019 & 4606.811832. I recommend using the function tracer: # echo function > $TRACING/current_tracer It should not affect your timings for pure-userspace code. This should catch any unintentional kernel activity. John ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Unexplained variance in run-time of trivial program 2025-09-09 12:34 ` John Ogness @ 2025-09-09 14:08 ` Marc Gonzalez 2025-09-09 18:33 ` Marc Gonzalez 0 siblings, 1 reply; 30+ messages in thread From: Marc Gonzalez @ 2025-09-09 14:08 UTC (permalink / raw) To: John Ogness, linux-rt-users Cc: Daniel Wagner, Leon Woestenberg, Steven Rostedt, Thomas Gleixner, Sebastian Andrzej Siewior, Clark Williams, Pavel Machek, Luis Goncalves, John McCalpin, Frederic Weisbecker, Ingo Molnar, Masami Hiramatsu On 09/09/2025 14:34, John Ogness wrote: > On 2025-09-09, Marc Gonzalez wrote: > >> I captured a full trace of this occurrence with >> echo *:* > $TRACING/set_event >> cat $TRACING/per_cpu/cpu$CPU/trace > trace.$CPU > > ... > >> But there is absolutely NOTHING traced between 4606.628019 & 4606.811832. > > I recommend using the function tracer: > > # echo function > $TRACING/current_tracer > > It should not affect your timings for pure-userspace code. This should > catch any unintentional kernel activity. OK. This time, I used: echo function > $TRACING/current_tracer cat $TRACING/trace > trace.all (which produced a 12.65 MB file) I dumped a problematic run where T=2722 instead of T=2500 i.e. total run-time = 178.389 ms instead of 163.84 ms Looking for a.out-23683 until the log goes quiet... a.out-23683 [003] d.... 14939.982368: user_enter: a.out-23683 [003] d.... 14939.982368: vtime_user_enter <-__ct_user_enter a.out-23683 [003] d.... 14939.982368: get_vtime_delta <-vtime_user_enter NO OCCURRENCE of a.out or [003] between these 2 points a.out-23683 [003] d.... 14940.160764: vtime_user_exit <-__ct_user_exit a.out-23683 [003] d.... 14940.160765: get_vtime_delta <-vtime_user_exit a.out-23683 [003] d.... 14940.160765: account_user_time <-vtime_user_exit 14940.160764 - 14939.982368 = 178.396 ms (which corresponds to the total run-time) Are the vtime_user_enter/exit tied to clock_gettime? (Probably not, since they're not balanced) It's starting to look like the problem is on the CPU itself, not the kernel interrupting the CPU. Might be time to break out perf? My hunch is frequency dropped to ~2.7 GHz for the duration of the benchmark. Maybe I should lower the frequency to 2 GHz. But again, why would my code force throttling and not the toy code? One possible reason is my code reaches 3.5 IPC, while the toy code remains at 1.5 IPC (with micro-op fusion). It's a stretch, but easy to test. I will measure CPU cycles, to see if the increased run-time corresponds with a change in CPU cycles. Regards ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Unexplained variance in run-time of trivial program 2025-09-09 14:08 ` Marc Gonzalez @ 2025-09-09 18:33 ` Marc Gonzalez [not found] ` <CAMjhiJjMOQN-nWd+KP4JBBNHf20M+J2fXAuTTvowXctJgvGOcQ@mail.gmail.com> 0 siblings, 1 reply; 30+ messages in thread From: Marc Gonzalez @ 2025-09-09 18:33 UTC (permalink / raw) To: John Ogness, linux-rt-users Cc: Daniel Wagner, Leon Woestenberg, Steven Rostedt, Thomas Gleixner, Sebastian Andrzej Siewior, Clark Williams, Pavel Machek, Luis Goncalves, John McCalpin, Frederic Weisbecker, Ingo Molnar, Masami Hiramatsu On 9/9/25 16:08, Marc Gonzalez wrote: > My hunch is frequency dropped to ~2.7 GHz for the duration of the benchmark. > > Maybe I should lower the frequency to 2 GHz. > But again, why would my code force throttling and not the toy code? > One possible reason is my code reaches 3.5 IPC, while the toy code > remains at 1.5 IPC (with micro-op fusion). > > It's a stretch, but easy to test. > > I will measure CPU cycles, to see if the increased run-time > corresponds with a change in CPU cycles. # dmesg | grep -i tsc [ 0.000000] tsc: Fast TSC calibration using PIT [ 0.000000] tsc: Detected 3292.563 MHz processor [ 0.038014] TSC deadline timer available [ 0.093396] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x2f75db292c8, max_idle_ns: 440795321216 ns [ 0.178415] clocksource: Switched to clocksource tsc-early [ 1.272272] tsc: Refined TSC clocksource calibration: 3292.377 MHz [ 1.272280] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x2f752b31fa5, max_idle_ns: 440795265170 ns [ 1.272305] clocksource: Switched to clocksource tsc $ lscpu | grep -o '[a-z_]*tsc[a-z_]*' tsc rdtscp constant_tsc nonstop_tsc tsc_deadline_timer tsc_adjust static void loop(int log2) { int n = 1 << log2; struct timespec t0, t1; clock_gettime(CLK, &t0); long c0 = __rdtsc(); for (int i = 0; i < n; ++i) my_code(); long c1 = __rdtsc(); clock_gettime(CLK, &t1); long d = (t1.tv_sec - t0.tv_sec)*1000000000L + (t1.tv_nsec - t0.tv_nsec); long t = d >> log2; long c = c1-c0; long f = c*1000000/d; printf("D=%ld C=%ld F=%ld T=%ld N=%d\n", d, c, f, t, n); } (Looks like RDTSCP might have been preferable over RDTSC) I lowered the frequency of all cores to 1.5 GHz MIN RUNTIME: D=327944847 C=1079700683 F=3292323 T=5004 N=65536 MAX RUNTIME: D=369923901 C=1217909284 F=3292323 T=5644 N=65536 Looks like the TSC ticks at 3.3 GHz (core nominal freq) even if I manually lower the core frequency. I need to figure out how 'perf stat' does it. Probably reads a different MSR. Changing /sys/devices/system/clocksource/clocksource0/current_clocksource to hpet doesn't really change anything. TODO: find cycle count MSR. ^ permalink raw reply [flat|nested] 30+ messages in thread
[parent not found: <CAMjhiJjMOQN-nWd+KP4JBBNHf20M+J2fXAuTTvowXctJgvGOcQ@mail.gmail.com>]
* Re: Unexplained variance in run-time of trivial program [not found] ` <CAMjhiJjMOQN-nWd+KP4JBBNHf20M+J2fXAuTTvowXctJgvGOcQ@mail.gmail.com> @ 2025-09-09 20:30 ` Marc Gonzalez 2025-09-10 7:59 ` Daniel Wagner 0 siblings, 1 reply; 30+ messages in thread From: Marc Gonzalez @ 2025-09-09 20:30 UTC (permalink / raw) To: Leon Woestenberg Cc: John Ogness, linux-rt-users, Daniel Wagner, Steven Rostedt, Thomas Gleixner, Sebastian Andrzej Siewior, Clark Williams, Pavel Machek, Luis Goncalves, John McCalpin, Frederic Weisbecker, Ingo Molnar, Masami Hiramatsu On 9/9/25 21:34, Leon Woestenberg wrote: > Basically, you are reverse-tuning a multicore CPU system, > that is heavily optimized for time-sharing and overall > system throughput, back to act like bare-metal single-application. > You can get close, but you will never reach the determinism > of say, an Cortex-R(ealtime) Core. This ignores the fact that I've been testing 2 programs: - P1, toy code that just calls dec 3 times in a tight loop - P2, a few memory accesses, everything fits in L1 I then loop P1 & P2 2^16 times to compute their average run-time. Over thousands of runs, RUNTIME_P1 is ALWAYS the same (within 30 ppm) Over thousands of runs, RUNTIME_P2 varies by up to 12 PERCENT (!!) = 122k ppm Since P1 runs fine without ANY jitter, the problem cannot be something inherent to the system itself, such as - SMM - IRQs firing randomly - drifting, jittery, jumping time source The problem appears to be something inherent to P2. (Either the whole core is slowed down perhaps by thermal throttling, or something is causing huge pipeline stalls.) Regards ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Unexplained variance in run-time of trivial program 2025-09-09 20:30 ` Marc Gonzalez @ 2025-09-10 7:59 ` Daniel Wagner 2025-09-11 21:29 ` Marc Gonzalez 0 siblings, 1 reply; 30+ messages in thread From: Daniel Wagner @ 2025-09-10 7:59 UTC (permalink / raw) To: Marc Gonzalez Cc: Leon Woestenberg, John Ogness, linux-rt-users, Steven Rostedt, Thomas Gleixner, Sebastian Andrzej Siewior, Clark Williams, Pavel Machek, Luis Goncalves, John McCalpin, Frederic Weisbecker, Ingo Molnar, Masami Hiramatsu On Tue, Sep 09, 2025 at 10:30:49PM +0200, Marc Gonzalez wrote: > On 9/9/25 21:34, Leon Woestenberg wrote: > > > Basically, you are reverse-tuning a multicore CPU system, > > that is heavily optimized for time-sharing and overall > > system throughput, back to act like bare-metal single-application. > > You can get close, but you will never reach the determinism > > of say, an Cortex-R(ealtime) Core. > > This ignores the fact that I've been testing 2 programs: > > - P1, toy code that just calls dec 3 times in a tight loop > - P2, a few memory accesses, everything fits in L1 > > I then loop P1 & P2 2^16 times to compute their average run-time. > > Over thousands of runs, RUNTIME_P1 is ALWAYS the same (within 30 ppm) > > Over thousands of runs, RUNTIME_P2 varies by up to 12 PERCENT (!!) = > 122k ppm Modern CPUs are highly complex systems with their micro architecture. The result is non-deterministic runtime behavior. I can't tell what you are measuring is the ballpark of non-deterministic runtime variance. If you don't want to read through the CPU manufactures documentation, you could check if the Spectre and related side channels attack papers give some clues on timings behavior? ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Unexplained variance in run-time of trivial program 2025-09-10 7:59 ` Daniel Wagner @ 2025-09-11 21:29 ` Marc Gonzalez 2025-09-11 22:15 ` Marc Gonzalez 0 siblings, 1 reply; 30+ messages in thread From: Marc Gonzalez @ 2025-09-11 21:29 UTC (permalink / raw) To: Daniel Wagner Cc: Leon Woestenberg, John Ogness, linux-rt-users, Steven Rostedt, Thomas Gleixner, Sebastian Andrzej Siewior, Clark Williams, Pavel Machek, Luis Goncalves, John McCalpin, Frederic Weisbecker, Ingo Molnar, Masami Hiramatsu, agner On 9/10/25 09:59, Daniel Wagner wrote: > On Tue, Sep 09, 2025 at 10:30:49PM +0200, Marc Gonzalez wrote: > >> On 9/9/25 21:34, Leon Woestenberg wrote: >> >>> Basically, you are reverse-tuning a multicore CPU system, >>> that is heavily optimized for time-sharing and overall >>> system throughput, back to act like bare-metal single-application. >>> You can get close, but you will never reach the determinism >>> of say, an Cortex-R(ealtime) Core. >> >> This ignores the fact that I've been testing 2 programs: >> >> - P1, toy code that just calls dec 3 times in a tight loop >> - P2, a few memory accesses, everything fits in L1 >> >> I then loop P1 & P2 2^16 times to compute their average run-time. >> >> Over thousands of runs, RUNTIME_P1 is ALWAYS the same (within 30 ppm) >> >> Over thousands of runs, RUNTIME_P2 varies by up to 12 PERCENT (!!) = 122k ppm > > Modern CPUs are highly complex systems with their micro architecture. > The result is non-deterministic runtime behavior. I can't tell what you > are measuring is the ballpark of non-deterministic runtime variance. If > you don't want to read through the CPU manufactures documentation, you > could check if the Spectre and related side channels attack papers give > some clues on timings behavior? I am well aware of the complexity of modern micro-architectures ;) I actually reported a bug in the original K7 micro-architecture 25 years ago. And Agner Fog's microarchitecture.pdf is always on my bedside table. https://www.agner.org/optimize/microarchitecture.pdf In an attempt to figure out the origin of the chaos, I broke out perf. I would say I found many more questions than answers... Running 3.3 GHz Haswell at 1.5 GHz https://perfmon-events.intel.com/index.html?pltfrm=haswell_server.html https://oprofile.sourceforge.io/docs/intel-haswell-events.php GOOD RUN D,C,Cb,F,T,N 1311453543 4317733068 4317732972 3292326 5002 262144 7512849605 uops_executed.core 31445738 uops_executed.stall_cycles 164966498 uops_retired.stall_cycles 574110144 topdown-fetch-bubbles 25679080 cycle_activity.cycles_no_execute 6808318537 inst_retired.any 6268865 int_misc.recovery_cycles 6268818 int_misc.recovery_cycles_any 1,313036051 seconds time elapsed BAD RUN D,C,Cb,F,T,N 1436503721 4729437143 4729436743 3292325 5479 262144 7647384031 uops_executed.core 113200734 uops_executed.stall_cycles 343460280 uops_retired.stall_cycles 787367726 topdown-fetch-bubbles 106926806 cycle_activity.cycles_no_execute 6808330091 inst_retired.any 17369470 int_misc.recovery_cycles 17369430 int_misc.recovery_cycles_any 1,438057158 seconds time elapsed Remarks (in no particular order) - Run-time increased by 9.5% (125 ms / 1.311 s) - inst_retired.any remains (nearly) constant BUT uops_executed.core increased by 1.8% => Where are those extra uops coming from??? Speculatively executed uops discarded on some flush event? - 343460280-164966498 = 178.5M more stall cycles at retire stage at 1.5 GHz = 119 ms This alone could explain the degraded performance - 113200734-31445738 = 81.7M more stall cycles at execute stage some are probably dupes with retire stalls - Not sure what the int_misc.recovery_* events measure??? They're documented as: "This event counts the number of cycles spent waiting for a recovery after an event such as a processor nuke, JEClear, assist, hle/rtm abort etc." and "Core cycles the allocator was stalled due to recovery from earlier clear event for any thread running on the physical core (e.g. misprediction or memory nuke)." => In my case, they're probably measuring the same thing. Weird that the description sounds a bit different. Need to read up on processor nuke, memory nuke, machine clear event, JEClear, assist, HLE/RTM... I'll throw in MACHINE_CLEARS.CYCLES in the next run. "Cycles there was a Nuke. Account for both thread-specific and All Thread Nukes." I would love for someone to chime in & cast some light on my utter confusion :) Regards ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Unexplained variance in run-time of trivial program 2025-09-11 21:29 ` Marc Gonzalez @ 2025-09-11 22:15 ` Marc Gonzalez 2025-09-12 7:44 ` Daniel Wagner 0 siblings, 1 reply; 30+ messages in thread From: Marc Gonzalez @ 2025-09-11 22:15 UTC (permalink / raw) To: Daniel Wagner Cc: Leon Woestenberg, John Ogness, linux-rt-users, Steven Rostedt, Thomas Gleixner, Sebastian Andrzej Siewior, Clark Williams, Pavel Machek, Luis Goncalves, John McCalpin, Frederic Weisbecker, Ingo Molnar, Masami Hiramatsu, agner On 9/11/25 23:29, Marc Gonzalez wrote: > - Not sure what the int_misc.recovery_* events measure??? > They're documented as: > "This event counts the number of cycles spent waiting for a > recovery after an event such as a processor nuke, JEClear, assist, > hle/rtm abort etc." > and > "Core cycles the allocator was stalled due to recovery from earlier > clear event for any thread running on the physical core (e.g. > misprediction or memory nuke)." > => In my case, they're probably measuring the same thing. > Weird that the description sounds a bit different. > Need to read up on processor nuke, memory nuke, machine clear event, > JEClear, assist, HLE/RTM... Daniel, seems you were spot on when mentioning side channel attacks. https://www.usenix.org/system/files/sec21-ragab.pdf I need to truly read & understand this paper. GOOD D,C,Cb,F,T,N 1311384211 4317505417 4317505021 3292326 5002 262144 6808311015 inst_retired.any 7515242772 uops_executed.core 31190466 uops_executed.stall_cycles 25467763 cycle_activity.cycles_no_execute 165247175 uops_retired.stall_cycles 6335672 int_misc.recovery_cycles 296086 MACHINE_CLEARS.CYCLES BAD D,C,Cb,F,T,N 1460302963 4807793027 4807792639 3292325 5570 262144 6808333218 inst_retired.any 7656792683 uops_executed.core 107194058 uops_executed.stall_cycles 102852971 cycle_activity.cycles_no_execute 341106967 uops_retired.stall_cycles 15062277 int_misc.recovery_cycles 2124362 MACHINE_CLEARS.CYCLES Random reference https://community.intel.com/t5/Analyzers/Regarding-machine-clears/m-p/953984 ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Unexplained variance in run-time of trivial program 2025-09-11 22:15 ` Marc Gonzalez @ 2025-09-12 7:44 ` Daniel Wagner 2025-09-13 10:09 ` Marc Gonzalez 0 siblings, 1 reply; 30+ messages in thread From: Daniel Wagner @ 2025-09-12 7:44 UTC (permalink / raw) To: Marc Gonzalez Cc: Leon Woestenberg, John Ogness, linux-rt-users, Steven Rostedt, Thomas Gleixner, Sebastian Andrzej Siewior, Clark Williams, Pavel Machek, Luis Goncalves, John McCalpin, Frederic Weisbecker, Ingo Molnar, Masami Hiramatsu, agner On Fri, Sep 12, 2025 at 12:15:12AM +0200, Marc Gonzalez wrote: > On 9/11/25 23:29, Marc Gonzalez wrote: > > > - Not sure what the int_misc.recovery_* events measure??? > > They're documented as: > > "This event counts the number of cycles spent waiting for a > > recovery after an event such as a processor nuke, JEClear, assist, > > hle/rtm abort etc." > > and > > "Core cycles the allocator was stalled due to recovery from earlier > > clear event for any thread running on the physical core (e.g. > > misprediction or memory nuke)." > > => In my case, they're probably measuring the same thing. > > Weird that the description sounds a bit different. > > Need to read up on processor nuke, memory nuke, machine clear event, > > JEClear, assist, HLE/RTM... > > Daniel, seems you were spot on when mentioning side channel attacks. I'm sorry if I sounded patronizing, that was not my intent. After we ruled out the OS noise the only thing left was that the runtime variance is from the CPU itself. > https://www.usenix.org/system/files/sec21-ragab.pdf > > I need to truly read & understand this paper. Skimmed over it, it could explain what you are observing. The question obviously is which part of the 'bad' code is confusing the CPU :) ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Unexplained variance in run-time of trivial program 2025-09-12 7:44 ` Daniel Wagner @ 2025-09-13 10:09 ` Marc Gonzalez 2025-09-15 16:30 ` Marc Gonzalez 0 siblings, 1 reply; 30+ messages in thread From: Marc Gonzalez @ 2025-09-13 10:09 UTC (permalink / raw) To: Daniel Wagner Cc: Leon Woestenberg, John Ogness, linux-rt-users, Steven Rostedt, Thomas Gleixner, Sebastian Andrzej Siewior, Clark Williams, Pavel Machek, Luis Goncalves, John McCalpin, Frederic Weisbecker, Ingo Molnar, Masami Hiramatsu, agner On 9/12/25 09:44, Daniel Wagner wrote: > On Fri, Sep 12, 2025 at 12:15:12AM +0200, Marc Gonzalez wrote: > >> - Not sure what the int_misc.recovery_* events measure??? >> They're documented as: >> "This event counts the number of cycles spent waiting for a >> recovery after an event such as a processor nuke, JEClear, assist, >> hle/rtm abort etc." >> and >> "Core cycles the allocator was stalled due to recovery from earlier >> clear event for any thread running on the physical core (e.g. >> misprediction or memory nuke)." >> => In my case, they're probably measuring the same thing. >> Weird that the description sounds a bit different. >> Need to read up on processor nuke, memory nuke, machine clear event, >> JEClear, assist, HLE/RTM... >> >> Daniel, seems you were spot on when mentioning side channel attacks. > > I'm sorry if I sounded patronizing, that was not my intent. After we > ruled out the OS noise the only thing left was that the runtime variance > is from the CPU itself. I appreciate all the insights you've shared this far! :) (As well as that from other participants!) >> https://www.usenix.org/system/files/sec21-ragab.pdf >> >> I need to truly read & understand this paper. > > Skimmed over it, it could explain what you are observing. The question > obviously is which part of the 'bad' code is confusing the CPU :) GOOD RUN D,C,Cb,F,T,N 1310881180 4315846773 4315846665 3292324 5000 262144 6808319919 inst_retired.any 7513464787 uops_executed.core 30496248 uops_executed.stall_cycles 164691876 uops_retired.stall_cycles 6292081 int_misc.recovery_cycles 292758 machine_clears.count 292049 machine_clears.memory_ordering BAD RUN D,C,Cb,F,T,N 1417213267 4665926496 4665926108 3292324 5406 262144 6808316250 inst_retired.any 7614270474 uops_executed.core 85148157 uops_executed.stall_cycles 285284384 uops_retired.stall_cycles 13520285 int_misc.recovery_cycles 1536963 machine_clears.count 1536308 machine_clears.memory_ordering Notes: - A "good" run gets 300k machine clears. - A "bad" run gets 1.5M machine clears. - All machine clears are memory ordering machine clears. No self-modifying code, no floating point code. Probably a few MCs from page faults. (TODO: measure only MCs from my code, not setup code) If I understand correctly, a memory ordering machine clear only happens when different cores access the same memory? I'm running single-threaded code on an otherwise idle system. I'm thinking that Haswell might lump Memory Ordering AND Memory Disambiguation MCs? Is that a reasonable assumption? Not sure what's non-deterministic about the loads in my code. Might be some kind of "catastrophic" chain of events where one mispredict generates a cascade of subsequent mispredicts. I will keep digging :) Regards ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Unexplained variance in run-time of trivial program 2025-09-13 10:09 ` Marc Gonzalez @ 2025-09-15 16:30 ` Marc Gonzalez 2025-09-16 7:50 ` Ahmed S. Darwish 0 siblings, 1 reply; 30+ messages in thread From: Marc Gonzalez @ 2025-09-15 16:30 UTC (permalink / raw) To: Daniel Wagner Cc: Leon Woestenberg, John Ogness, linux-rt-users, Steven Rostedt, Thomas Gleixner, Sebastian Andrzej Siewior, Clark Williams, Pavel Machek, Luis Goncalves, John McCalpin, Frederic Weisbecker, Ingo Molnar, Masami Hiramatsu, agner On 9/13/25 12:09, Marc Gonzalez wrote: > GOOD RUN > D,C,Cb,F,T,N > 1310881180 4315846773 4315846665 3292324 5000 262144 > 6808319919 inst_retired.any > 7513464787 uops_executed.core > 30496248 uops_executed.stall_cycles > 164691876 uops_retired.stall_cycles > 6292081 int_misc.recovery_cycles > 292758 machine_clears.count > 292049 machine_clears.memory_ordering > > BAD RUN > D,C,Cb,F,T,N > 1417213267 4665926496 4665926108 3292324 5406 262144 > 6808316250 inst_retired.any > 7614270474 uops_executed.core > 85148157 uops_executed.stall_cycles > 285284384 uops_retired.stall_cycles > 13520285 int_misc.recovery_cycles > 1536963 machine_clears.count > 1536308 machine_clears.memory_ordering > > Notes: > > - A "good" run gets 300k machine clears. > - A "bad" run gets 1.5M machine clears. > - All machine clears are memory ordering machine clears. > No self-modifying code, no floating point code. > Probably a few MCs from page faults. > (TODO: measure only MCs from my code, not setup code) > > If I understand correctly, a memory ordering machine clear > only happens when different cores access the same memory? > I'm running single-threaded code on an otherwise idle system. > I'm thinking that Haswell might lump Memory Ordering AND > Memory Disambiguation MCs? > > Is that a reasonable assumption? > > Not sure what's non-deterministic about the loads in my code. > Might be some kind of "catastrophic" chain of events where > one mispredict generates a cascade of subsequent mispredicts. > > I will keep digging :) I now have a better understanding of the issue :) To recap, I'm benchmarking a fairly trivial piece of code. But I observe a large variance in the run-time of said code (from N to 1.15N nanoseconds). What's more, the variance REMAINS when I run the code far longer (by looping the same execution). Therefore - Either the perturbation is periodic, and adds noise whatever the duration of the benchmark, - OR the perturbation only manifests in SOME RUNS. However, some runs complete without any perturbation, even very long ones. Runs where perturbations are observed have a much higher incidence of MEMORY ORDERING MEMORY NUKES. These types of events happen when different cores share memory (true sharing or false sharing). Since my code is purely single-threaded, it must be false sharing. At this point, I'm guessing that the address randomization creates ONLY FOR SOME RUNS a condition where my program lives in similar virtual addresses than another program active on a different core, which causes MEMORY NUKES. I can work around the issue by: INSTEAD OF running 1M instances of my code in 1 program, I RUN 1000 programs running 1000 instances of my code. And taking the minimum run-time of the 1000. (There is a bit more overhead in forking so many programs, but I really need reproducible results.) Now I observe less than 0.2% variance. I will also try disabling address randomization, to test my hypothesis. Has anyone ever dealt with this issue? Regards ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Unexplained variance in run-time of trivial program 2025-09-15 16:30 ` Marc Gonzalez @ 2025-09-16 7:50 ` Ahmed S. Darwish 2025-09-25 11:38 ` Marc Gonzalez 0 siblings, 1 reply; 30+ messages in thread From: Ahmed S. Darwish @ 2025-09-16 7:50 UTC (permalink / raw) To: Marc Gonzalez Cc: Daniel Wagner, Leon Woestenberg, John Ogness, linux-rt-users, Steven Rostedt, Thomas Gleixner, Sebastian Andrzej Siewior, Clark Williams, Pavel Machek, Luis Goncalves, John McCalpin, Frederic Weisbecker, Ingo Molnar, Masami Hiramatsu, agner On Mon, 15 Sep 2025, Marc Gonzalez wrote: > > These types of events happen when different cores share memory (true > sharing or false sharing). Since my code is purely single-threaded, it > must be false sharing. > I don't know if you're just having "fun" with the CPU or optimizing a real-world use-case... With that in mind, maybe the following helps: understand the topology of your x86 CPU and turn off the cores that might be causing a "noisy neighbor" problem. If you were able to root it down, and your CPU supports class of service resource monitoring and allocation, then check Documentation/filesystems/resctrl.rst. Good luck, -- Ahmed S. Darwish Linutronix GmbH ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Unexplained variance in run-time of trivial program 2025-09-16 7:50 ` Ahmed S. Darwish @ 2025-09-25 11:38 ` Marc Gonzalez 0 siblings, 0 replies; 30+ messages in thread From: Marc Gonzalez @ 2025-09-25 11:38 UTC (permalink / raw) To: Ahmed S. Darwish Cc: Daniel Wagner, Leon Woestenberg, John Ogness, linux-rt-users, Steven Rostedt, Thomas Gleixner, Sebastian Andrzej Siewior, Clark Williams, Pavel Machek, Luis Goncalves, John McCalpin, Frederic Weisbecker, Ingo Molnar, Masami Hiramatsu, agner On 16/09/2025 09:50, Ahmed S. Darwish wrote: > On Mon, 15 Sep 2025, Marc Gonzalez wrote: > >> These types of events happen when different cores share memory (true >> sharing or false sharing). Since my code is purely single-threaded, it >> must be false sharing. > > I don't know if you're just having "fun" with the CPU or optimizing a > real-world use-case... I'm writing a sudoku solver in my free time. So, having fun on a real-world use-case :) > With that in mind, maybe the following helps: understand the topology of > your x86 CPU and turn off the cores that might be causing a "noisy > neighbor" problem. Hmmm... I would assume a fairly simple topology. https://www.techpowerup.com/cpu-specs/core-i5-4590.c1750 https://en.wikipedia.org/wiki/Haswell_(microarchitecture)#Desktop_processors This CPU does not support SMT, so L1 & L2 are exclusive to each core. L3 is shared between all cores IIUC. > If you were able to root it down, and your CPU supports class of service > resource monitoring and allocation, then check > Documentation/filesystems/resctrl.rst. I have not yet found a good way to minimize the noise from memory ordering memory nukes. (For the time being, I'm running the benchmark hundreds of times, and take the minimum run time.) https://docs.kernel.org/filesystems/resctrl.html $ grep 'l2\|l3\|llc\|mb' /proc/cpuinfo /* returns nothing */ Looks like Haswell didn't support those nice features :( Thanks for the suggestion. I might update to Zen 5 or Zen 6 at some point. Regards ^ permalink raw reply [flat|nested] 30+ messages in thread
* RE: [EXT] Re: Large(ish) variance induced by SCHED_FIFO 2025-09-08 15:42 ` Marc Gonzalez 2025-09-08 16:02 ` Daniel Wagner @ 2025-09-08 16:52 ` Rui Sousa 2025-09-08 17:03 ` Marc Gonzalez 1 sibling, 1 reply; 30+ messages in thread From: Rui Sousa @ 2025-09-08 16:52 UTC (permalink / raw) To: Marc Gonzalez, John Ogness, Leon Woestenberg, Daniel Wagner Cc: linux-rt-users@vger.kernel.org, Steven Rostedt, Thomas Gleixner, Sebastian Andrzej Siewior, Daniel Wagner, Clark Williams, Pavel Machek, Luis Goncalves, John McCalpin Hi Marc, Just replace your last setup script line with: echo -1 > /proc/sys/kernel/sched_rt_runtime_us ? You are running a realtime task for a long while, but you have rt throttling enabled. At some point it will trigger, and you have no control of when. I believe you are seeing regular tasks Being scheduled in the middle of the benchmark measurement. Disabling throttling may block your entire system while the benchmark is running, so lower the priority a bit and use affinity to land the benchmark process in a core that is not usually used? Br Rui > -----Original Message----- > From: Marc Gonzalez <marc.w.gonzalez@free.fr> > Sent: Monday, September 8, 2025 17:43 > To: John Ogness <john.ogness@linutronix.de>; Leon Woestenberg > <leon@sidebranch.com>; Daniel Wagner <dwagner@suse.de> > Cc: linux-rt-users@vger.kernel.org; Steven Rostedt <rostedt@goodmis.org>; > Thomas Gleixner <tglx@linutronix.de>; Sebastian Andrzej Siewior > <bigeasy@linutronix.de>; Daniel Wagner <daniel.wagner@suse.com>; Clark > Williams <williams@redhat.com>; Pavel Machek <pavel@denx.de>; Luis > Goncalves <lgoncalv@redhat.com>; John McCalpin <mccalpin@tacc.utexas.edu> > Subject: [EXT] Re: Large(ish) variance induced by SCHED_FIFO > > Caution: This is an external email. Please take care when clicking links or opening > attachments. When in doubt, report the message using the 'Report this email' > button > > > On 08/09/2025 11:36, John Ogness wrote: > > > There are still reasons why CLOCK_MONOTONIC_RAW might be > > interesting. For example, if you want a very stable time source to > > compare intervals, but do not care so much about the real world time > > lengths of those intervals (i.e. where is the greatest latency vs. what > > is the value of the greatest latency). Although even here, I doubt > > CLOCK_MONOTONIC_RAW has a practical advantage over > CLOCK_MONOTONIC. > > In fact, I'm just trying to compare the run-time of 2 minor > variations of the same program (testing micro-optimizations). > > Absolute run-time is not really important, what I really want > to know is: does v2 run faster or slower than v1? > > This is the framework I'm using at this point: > > #include <stdio.h> > #include <time.h> > extern void my_code(void); > > static long loop(int log2) > { > int n = 1 << log2; > struct timespec t0, t1; > clock_gettime(CLOCK_MONOTONIC, &t0); > for (int i = 0; i < n; ++i) my_code(); > clock_gettime(CLOCK_MONOTONIC, &t1); > long d = (t1.tv_sec - t0.tv_sec)*1000000000L + (t1.tv_nsec - t0.tv_nsec); > long t = d >> log2; > return t; > } > > int main(void) > { > long t, min = loop(4); > for (int i = 0; i < 20; ++i) > if ((t = loop(8)) < min) min = t; > printf("MIN=%ld\n", min); > return 0; > } > > Basically: > - warm up the caches > - run my_code() 256 times && compute average run-time > - repeat 20 times to find MINIMUM average run-time > > When my_code() is a trivial computational loop such as: > > mov $(1<<12), %eax > 1: dec %ecx > dec %ecx > dec %eax > jnz 1b > ret > > Then running the benchmark 1000 times returns the same value 1000 times: > MIN=2737 > > > Obviously, the program I'm working on is a bit more complex, but barely: > - no system calls, no library calls > - just simple bit twiddling > - tiny code, tiny data structures, everything fits in L1 > $ size a.out > text data bss dec hex filename > 8549 632 1072 10253 280d a.out > > When I run the benchmark 1000 times, there are some large outliers: > MIN_MIN=2502 > MAX_MIN=2774 > > NOTE: 95% of the results are below 2536. > But the top 1% (worst 10) are really bad (2646-2774) > > How to get repeatable results? > > Random 10% outliers break the ability to measure the impact > of micro-optimizations expected to provide 0-3% improvements :( > > For reference, the script launching the benchmark does: > > echo -1 > /proc/sys/kernel/sched_rt_runtime_us > for I in 0 1 2 3; do echo userspace > > /sys/devices/system/cpu/cpu$I/cpufreq/scaling_governor; done > sleep 0.25 > for I in 0 1 2 3; do echo 3000000 > > /sys/devices/system/cpu/cpu$I/cpufreq/scaling_setspeed; done > sleep 0.25 > > for I in $(seq 1 1000); do > chrt -f 99 taskset -c 2 ./a.out < $1 > done > > for I in 0 1 2 3; do > echo schedutil > /sys/devices/system/cpu/cpu$I/cpufreq/scaling_governor > done > echo 950000 > /proc/sys/kernel/sched_rt_runtime_us > > > I've run out of ideas to identify other sources of variance. > (I ran everything in single user mode with nothing else running.) > Perhaps with perf I could identify the source of stalls or bubbles? > > Hoping someone can point me in the right direction. > > Regards > > ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [EXT] Re: Large(ish) variance induced by SCHED_FIFO 2025-09-08 16:52 ` [EXT] Re: Large(ish) variance induced by SCHED_FIFO Rui Sousa @ 2025-09-08 17:03 ` Marc Gonzalez 0 siblings, 0 replies; 30+ messages in thread From: Marc Gonzalez @ 2025-09-08 17:03 UTC (permalink / raw) To: Rui Sousa, John Ogness, Leon Woestenberg, Daniel Wagner Cc: linux-rt-users@vger.kernel.org, Steven Rostedt, Thomas Gleixner, Sebastian Andrzej Siewior, Daniel Wagner, Clark Williams, Pavel Machek, Luis Goncalves, John McCalpin On 08/09/2025 18:52, Rui Sousa wrote: > Just replace your last setup script line with: > echo -1 > /proc/sys/kernel/sched_rt_runtime_us Launch script consists of: echo -1 > /proc/sys/kernel/sched_rt_runtime_us # RUN RT BENCHMARK echo 950000 > /proc/sys/kernel/sched_rt_runtime_us https://www.kernel.org/doc/html/latest/scheduler/sched-rt-group.html "A run time of -1 specifies runtime == period, ie. no limit." 1) Disable RT throttling 2) Run RT benchmark 3) Enable RT throttling > I believe you are seeing regular tasks being scheduled > in the middle of the benchmark measurement. I don't think so. RT throttling is disabled. The benchmark runs as SCHED_FIFO 99. Nothing can interrupt it, except interrupts :) Am checking Daniel's suggestion. Regards ^ permalink raw reply [flat|nested] 30+ messages in thread
[parent not found: <CAMjhiJiVDmnx+pgpvtAy=KarHexjsYs+T9tSkpGvppjqFdtmiw@mail.gmail.com>]
* Re: Large(ish) variance induced by SCHED_FIFO [not found] ` <CAMjhiJiVDmnx+pgpvtAy=KarHexjsYs+T9tSkpGvppjqFdtmiw@mail.gmail.com> @ 2025-09-05 21:05 ` Marc Gonzalez 0 siblings, 0 replies; 30+ messages in thread From: Marc Gonzalez @ 2025-09-05 21:05 UTC (permalink / raw) To: Leon Woestenberg Cc: linux-rt-users, Steven Rostedt, Thomas Gleixner, Sebastian Andrzej Siewior, Daniel Wagner, Clark Williams, Pavel Machek, Luis Goncalves, John McCalpin Hello Leon :) On 05/09/2025 18:18, Leon Woestenberg wrote: > On Fri, 5 Sep 2025 at 01:45, Marc Gonzalez wrote: > >> I am trying to get reproducible / deterministic / low variance results >> using a vendor kernel (6.8.0-79-generic #79-Ubuntu SMP PREEMPT_DYNAMIC). > > Is that your end goal or is your question maybe an XY question? > If so please describe X. I need a rudimentary (but accurate) benchmark framework, to measure the impact of micro-optimizations on codes. These optimizations are expected to make programs 0-3% faster, so the framework should have the lowest variance possible, otherwise I can't discern the result from the noise. > Or just include the measurement inside the program using CLOCK_MONOTONIC. David suggested that as well, but it was not the source of the problem. John pointed out the culprit: /proc/sys/kernel/sched_rt_runtime_us https://www.kernel.org/doc/html/latest/scheduler/sched-rt-group.html Regards ^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2025-09-25 11:38 UTC | newest]
Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-04 23:45 Large(ish) variance induced by SCHED_FIFO Marc Gonzalez
2025-09-05 8:25 ` Daniel Wagner
2025-09-05 13:23 ` Marc Gonzalez
2025-09-05 14:03 ` John Ogness
2025-09-05 15:34 ` Marc Gonzalez
2025-09-05 16:07 ` John Ogness
2025-09-05 17:40 ` Marc Gonzalez
2025-09-08 9:36 ` John Ogness
2025-09-08 15:42 ` Marc Gonzalez
2025-09-08 16:02 ` Daniel Wagner
2025-09-08 18:40 ` Marc Gonzalez
2025-09-09 11:08 ` Unexplained variance in run-time of trivial program Marc Gonzalez
2025-09-09 11:21 ` Daniel Wagner
2025-09-09 12:42 ` Marc Gonzalez
2025-09-09 14:23 ` Steven Rostedt
2025-09-09 12:34 ` John Ogness
2025-09-09 14:08 ` Marc Gonzalez
2025-09-09 18:33 ` Marc Gonzalez
[not found] ` <CAMjhiJjMOQN-nWd+KP4JBBNHf20M+J2fXAuTTvowXctJgvGOcQ@mail.gmail.com>
2025-09-09 20:30 ` Marc Gonzalez
2025-09-10 7:59 ` Daniel Wagner
2025-09-11 21:29 ` Marc Gonzalez
2025-09-11 22:15 ` Marc Gonzalez
2025-09-12 7:44 ` Daniel Wagner
2025-09-13 10:09 ` Marc Gonzalez
2025-09-15 16:30 ` Marc Gonzalez
2025-09-16 7:50 ` Ahmed S. Darwish
2025-09-25 11:38 ` Marc Gonzalez
2025-09-08 16:52 ` [EXT] Re: Large(ish) variance induced by SCHED_FIFO Rui Sousa
2025-09-08 17:03 ` Marc Gonzalez
[not found] ` <CAMjhiJiVDmnx+pgpvtAy=KarHexjsYs+T9tSkpGvppjqFdtmiw@mail.gmail.com>
2025-09-05 21:05 ` Marc Gonzalez
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox