public inbox for linux-rt-users@vger.kernel.org
 help / color / mirror / Atom feed
* Unexplained variance in run-time of simple program (part 2)
@ 2026-03-26 15:24 Marc Gonzalez
  2026-03-26 19:09 ` Marc Gonzalez
  0 siblings, 1 reply; 6+ messages in thread
From: Marc Gonzalez @ 2026-03-26 15:24 UTC (permalink / raw)
  To: linux-rt-users
  Cc: Leon Woestenberg, John Ogness, Steven Rostedt, Thomas Gleixner,
	Sebastian Andrzej Siewior, Clark Williams, Pavel Machek,
	Luis Goncalves, John McCalpin, Frederic Weisbecker, Ingo Molnar,
	Masami Hiramatsu, Ahmed S. Darwish, agner, Dirk Beyer,
	Philipp Wendler

Hello (again) everyone,

Past discussion:
Large(ish) variance induced by SCHED_FIFO / Unexplained variance in run-time of trivial program
https://lore.kernel.org/linux-rt-users/0d87e3c3-8de1-4d98-802e-a292f63f1bf1@free.fr/

SYNOPSIS:
I have a simple(*) program.
I want to know how long the program runs.

(*) By simple, I mean:
- no system calls, no library calls, just simple bit twiddling
- tiny code, small(ish) dataset
(the main function uses ~900 bytes of stack & recurses 40-60 times)

GOAL: Run the program 25,000 times. Get the SAME(ish) cycle count 25,000 times.

Running kernel v6.8 on Haswell i5-4590 3.3 GHz

I have removed "all" sources of noise / jitter / variance in the system:

A) kernel boots with:
threadirqs irqaffinity=0-2 nohz=on nohz_full=3 isolcpus=3 rcu_nocbs=3 nosmt mitigations=off single
i.e.
- Expose ISRs as regular processes
- No ISRs on CPU3
- No timer interrupt on CPU3
- No RCU callbacks on CPU3
- 1 thread per core
- No side-channel mitigations
- Single user mode, no GUI, only 1 terminal

B) before program runs:
echo -1 > /proc/sys/kernel/sched_rt_runtime_us
for I in 0 1 2 3; do echo userspace > /sys/devices/system/cpu/cpu$I/cpufreq/scaling_governor; done
for I in 0 1 2 3; do echo   2000000 > /sys/devices/system/cpu/cpu$I/cpufreq/scaling_setspeed; done
sleep 0.5
i.e
- Let SCHED_FIFO program monopolize a CPU
- Pin CPU frequency to 2 GHz to avoid thermal throttling & disable turbo-boost
- Give these settings time to settle

C) start the benchmark:
for I in $(seq 1 25000); do chrt -f 99 taskset -c 3 ./bench; done
i.e.
- Run as SCHED_FIFO 99 = nothing can interrupt the benchmark
- Run the program on isolated CPU 3 where nothing else is running
$ ps -eo psr,cls,pri,cmd --sort psr,pri
  3  FF 139 [migration/3]
  3  FF  90 [idle_inject/3]
  3  TS  19 [cpuhp/3]
  3  TS  19 [ksoftirqd/3]
  3  TS  19 [kworker/3:0-events]
  3  TS  19 [kworker/3:1]

D) prepare to run the timed code:
	u64 v[1+4];
	int main_fd = open_event(PERF_TYPE_HARDWARE, PERF_COUNT_HW_CPU_CYCLES, -1);
	open_event(PERF_TYPE_HARDWARE, PERF_COUNT_HW_INSTRUCTIONS, main_fd);
	open_event(PERF_TYPE_RAW, UOPS_EXECUTED, main_fd);
	open_event(PERF_TYPE_RAW, EXEC_STALLS, main_fd);

	void *ctx = init_ctx();
	solve_grid(ctx); // warm up all types of caches

	ioctl(main_fd, PERF_EVENT_IOC_RESET, PERF_IOC_FLAG_GROUP);
	solve_grid(ctx);
	if (read(main_fd, v, sizeof v) < sizeof v) return 2;

	printf("%lu %lu %lu %lu\n", v[1], v[2], v[3], v[4]);

- PERF_EVENT_IOC_RESET resets all counters to 0, so we're only measuring the actual program, not any setup/teardown system code.

The results are unexpected, disappointing, frustrating...


  AA     BB     CC    DD
$ head -5 sorted.RES.5
108018 186124 256147 23195
108412 186124 257228 23275
108637 186124 258963 23245
109103 186124 258598 23507
109167 186124 259715 23425

$ tail -5 sorted.RES.5
123824 186124 266546 30949
124755 186122 266494 31749
124773 186124 264435 30966
126273 186122 267967 32376
130967 186124 284301 33597

AA = PERF_COUNT_HW_CPU_CYCLES
BB = PERF_COUNT_HW_INSTRUCTIONS
CC = UOPS_EXECUTED
DD = EXEC_STALLS

It seems the program runs in ~108k cycles, but unexplained perturbations can delay
the program by up to 23k cycles = 21% (108k + 23k = 131k in the worst observed case)

BEST CASE vs WORST CASE
108018 186124 256147 23195
130967 186124 284301 33597

Run-time: +21%
I_count: identical
uop_count: +11%
exec_stalls: +45%

I don't see these wild deviations when I test toy programs that don't touch memory
or only touch 1 word on the stack. So this seems to be memory-related?
But everything fits in L1...
Could there be some activity on other CPUs that force cache-coherence shenanigans?
I'm stumped :(

Would appreciate any insight.
Will re-read the previous thread for anything I might have missed.

Regards

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-04-10 17:16 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-26 15:24 Unexplained variance in run-time of simple program (part 2) Marc Gonzalez
2026-03-26 19:09 ` Marc Gonzalez
2026-04-07  0:38   ` Marc Gonzalez
     [not found]     ` <17537284-FA52-40E5-A70F-1120FCEB8BC6@mccalpin.com>
2026-04-07 13:52       ` Marc Gonzalez
2026-04-08  9:29         ` John D. McCalpin
2026-04-10 17:16           ` Marc Gonzalez

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox