Unexplained variance in run-time of simple program (part 2)

public inbox for linux-rt-users@vger.kernel.org
 help / color / mirror / Atom feed

* Unexplained variance in run-time of simple program (part 2)
@ 2026-03-26 15:24 Marc Gonzalez
  2026-03-26 19:09 ` Marc Gonzalez
  0 siblings, 1 reply; 6+ messages in thread
From: Marc Gonzalez @ 2026-03-26 15:24 UTC (permalink / raw)
  To: linux-rt-users
  Cc: Leon Woestenberg, John Ogness, Steven Rostedt, Thomas Gleixner,
	Sebastian Andrzej Siewior, Clark Williams, Pavel Machek,
	Luis Goncalves, John McCalpin, Frederic Weisbecker, Ingo Molnar,
	Masami Hiramatsu, Ahmed S. Darwish, agner, Dirk Beyer,
	Philipp Wendler

Hello (again) everyone,

Past discussion:
Large(ish) variance induced by SCHED_FIFO / Unexplained variance in run-time of trivial program
https://lore.kernel.org/linux-rt-users/0d87e3c3-8de1-4d98-802e-a292f63f1bf1@free.fr/

SYNOPSIS:
I have a simple(*) program.
I want to know how long the program runs.

(*) By simple, I mean:
- no system calls, no library calls, just simple bit twiddling
- tiny code, small(ish) dataset
(the main function uses ~900 bytes of stack & recurses 40-60 times)

GOAL: Run the program 25,000 times. Get the SAME(ish) cycle count 25,000 times.

Running kernel v6.8 on Haswell i5-4590 3.3 GHz

I have removed "all" sources of noise / jitter / variance in the system:

A) kernel boots with:
threadirqs irqaffinity=0-2 nohz=on nohz_full=3 isolcpus=3 rcu_nocbs=3 nosmt mitigations=off single
i.e.
- Expose ISRs as regular processes
- No ISRs on CPU3
- No timer interrupt on CPU3
- No RCU callbacks on CPU3
- 1 thread per core
- No side-channel mitigations
- Single user mode, no GUI, only 1 terminal

B) before program runs:
echo -1 > /proc/sys/kernel/sched_rt_runtime_us
for I in 0 1 2 3; do echo userspace > /sys/devices/system/cpu/cpu$I/cpufreq/scaling_governor; done
for I in 0 1 2 3; do echo   2000000 > /sys/devices/system/cpu/cpu$I/cpufreq/scaling_setspeed; done
sleep 0.5
i.e
- Let SCHED_FIFO program monopolize a CPU
- Pin CPU frequency to 2 GHz to avoid thermal throttling & disable turbo-boost
- Give these settings time to settle

C) start the benchmark:
for I in $(seq 1 25000); do chrt -f 99 taskset -c 3 ./bench; done
i.e.
- Run as SCHED_FIFO 99 = nothing can interrupt the benchmark
- Run the program on isolated CPU 3 where nothing else is running
$ ps -eo psr,cls,pri,cmd --sort psr,pri
  3  FF 139 [migration/3]
  3  FF  90 [idle_inject/3]
  3  TS  19 [cpuhp/3]
  3  TS  19 [ksoftirqd/3]
  3  TS  19 [kworker/3:0-events]
  3  TS  19 [kworker/3:1]

D) prepare to run the timed code:
	u64 v[1+4];
	int main_fd = open_event(PERF_TYPE_HARDWARE, PERF_COUNT_HW_CPU_CYCLES, -1);
	open_event(PERF_TYPE_HARDWARE, PERF_COUNT_HW_INSTRUCTIONS, main_fd);
	open_event(PERF_TYPE_RAW, UOPS_EXECUTED, main_fd);
	open_event(PERF_TYPE_RAW, EXEC_STALLS, main_fd);

	void *ctx = init_ctx();
	solve_grid(ctx); // warm up all types of caches

	ioctl(main_fd, PERF_EVENT_IOC_RESET, PERF_IOC_FLAG_GROUP);
	solve_grid(ctx);
	if (read(main_fd, v, sizeof v) < sizeof v) return 2;

	printf("%lu %lu %lu %lu\n", v[1], v[2], v[3], v[4]);

- PERF_EVENT_IOC_RESET resets all counters to 0, so we're only measuring the actual program, not any setup/teardown system code.

The results are unexpected, disappointing, frustrating...


  AA     BB     CC    DD
$ head -5 sorted.RES.5
108018 186124 256147 23195
108412 186124 257228 23275
108637 186124 258963 23245
109103 186124 258598 23507
109167 186124 259715 23425

$ tail -5 sorted.RES.5
123824 186124 266546 30949
124755 186122 266494 31749
124773 186124 264435 30966
126273 186122 267967 32376
130967 186124 284301 33597

AA = PERF_COUNT_HW_CPU_CYCLES
BB = PERF_COUNT_HW_INSTRUCTIONS
CC = UOPS_EXECUTED
DD = EXEC_STALLS

It seems the program runs in ~108k cycles, but unexplained perturbations can delay
the program by up to 23k cycles = 21% (108k + 23k = 131k in the worst observed case)

BEST CASE vs WORST CASE
108018 186124 256147 23195
130967 186124 284301 33597

Run-time: +21%
I_count: identical
uop_count: +11%
exec_stalls: +45%

I don't see these wild deviations when I test toy programs that don't touch memory
or only touch 1 word on the stack. So this seems to be memory-related?
But everything fits in L1...
Could there be some activity on other CPUs that force cache-coherence shenanigans?
I'm stumped :(

Would appreciate any insight.
Will re-read the previous thread for anything I might have missed.

Regards

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Unexplained variance in run-time of simple program (part 2)
  2026-03-26 15:24 Unexplained variance in run-time of simple program (part 2) Marc Gonzalez
@ 2026-03-26 19:09 ` Marc Gonzalez
  2026-04-07  0:38   ` Marc Gonzalez
  0 siblings, 1 reply; 6+ messages in thread
From: Marc Gonzalez @ 2026-03-26 19:09 UTC (permalink / raw)
  To: linux-rt-users
  Cc: Daniel Wagner, Leon Woestenberg, John Ogness, Steven Rostedt,
	Thomas Gleixner, Sebastian Andrzej Siewior, Clark Williams,
	Pavel Machek, Luis Goncalves, John McCalpin, Frederic Weisbecker,
	Ingo Molnar, Masami Hiramatsu, Ahmed S. Darwish, agner,
	Dirk Beyer, Philipp Wendler

[ Add Daniel Wagner + use different address for John McCalpin ]

On 26/03/2026 16:24, Marc Gonzalez wrote:

> Hello (again) everyone,
> 
> Past discussion:
> Large(ish) variance induced by SCHED_FIFO / Unexplained variance in run-time of trivial program
> https://lore.kernel.org/linux-rt-users/0d87e3c3-8de1-4d98-802e-a292f63f1bf1@free.fr/
> 
> SYNOPSIS:
> I have a simple(*) program.
> I want to know how long the program runs.
> 
> (*) By simple, I mean:
> - no system calls, no library calls, just simple bit twiddling
> - tiny code, small(ish) dataset
> (the main function uses ~900 bytes of stack & recurses 40-60 times)
> 
> GOAL: Run the program 25,000 times. Get the SAME(ish) cycle count 25,000 times.
> 
> Running kernel v6.8 on Haswell i5-4590 3.3 GHz
> 
> I have removed "all" sources of noise / jitter / variance in the system:
> 
> A) kernel boots with:
> threadirqs irqaffinity=0-2 nohz=on nohz_full=3 isolcpus=3 rcu_nocbs=3 nosmt mitigations=off single
> i.e.
> - Expose ISRs as regular processes
> - No ISRs on CPU3
> - No timer interrupt on CPU3
> - No RCU callbacks on CPU3
> - 1 thread per core
> - No side-channel mitigations
> - Single user mode, no GUI, only 1 terminal
> 
> B) before program runs:
> echo -1 > /proc/sys/kernel/sched_rt_runtime_us
> for I in 0 1 2 3; do echo userspace > /sys/devices/system/cpu/cpu$I/cpufreq/scaling_governor; done
> for I in 0 1 2 3; do echo   2000000 > /sys/devices/system/cpu/cpu$I/cpufreq/scaling_setspeed; done
> sleep 0.5
> i.e
> - Let SCHED_FIFO program monopolize a CPU
> - Pin CPU frequency to 2 GHz to avoid thermal throttling & disable turbo-boost
> - Give these settings time to settle
> 
> C) start the benchmark:
> for I in $(seq 1 25000); do chrt -f 99 taskset -c 3 ./bench; done
> i.e.
> - Run as SCHED_FIFO 99 = nothing can interrupt the benchmark
> - Run the program on isolated CPU 3 where nothing else is running
> $ ps -eo psr,cls,pri,cmd --sort psr,pri
>   3  FF 139 [migration/3]
>   3  FF  90 [idle_inject/3]
>   3  TS  19 [cpuhp/3]
>   3  TS  19 [ksoftirqd/3]
>   3  TS  19 [kworker/3:0-events]
>   3  TS  19 [kworker/3:1]
> 
> D) prepare to run the timed code:
> 	u64 v[1+4];
> 	int main_fd = open_event(PERF_TYPE_HARDWARE, PERF_COUNT_HW_CPU_CYCLES, -1);
> 	open_event(PERF_TYPE_HARDWARE, PERF_COUNT_HW_INSTRUCTIONS, main_fd);
> 	open_event(PERF_TYPE_RAW, UOPS_EXECUTED, main_fd);
> 	open_event(PERF_TYPE_RAW, EXEC_STALLS, main_fd);
> 
> 	void *ctx = init_ctx();
> 	solve_grid(ctx); // warm up all types of caches
> 
> 	ioctl(main_fd, PERF_EVENT_IOC_RESET, PERF_IOC_FLAG_GROUP);
> 	solve_grid(ctx);
> 	if (read(main_fd, v, sizeof v) < sizeof v) return 2;
> 
> 	printf("%lu %lu %lu %lu\n", v[1], v[2], v[3], v[4]);
> 
> - PERF_EVENT_IOC_RESET resets all counters to 0, so we're only measuring the actual program, not any setup/teardown system code.
> 
> The results are unexpected, disappointing, frustrating...
> 
> 
>   AA     BB     CC    DD
> $ head -5 sorted.RES.5
> 108018 186124 256147 23195
> 108412 186124 257228 23275
> 108637 186124 258963 23245
> 109103 186124 258598 23507
> 109167 186124 259715 23425
> 
> $ tail -5 sorted.RES.5
> 123824 186124 266546 30949
> 124755 186122 266494 31749
> 124773 186124 264435 30966
> 126273 186122 267967 32376
> 130967 186124 284301 33597
> 
> AA = PERF_COUNT_HW_CPU_CYCLES
> BB = PERF_COUNT_HW_INSTRUCTIONS
> CC = UOPS_EXECUTED
> DD = EXEC_STALLS
> 
> It seems the program runs in ~108k cycles, but unexplained perturbations can delay
> the program by up to 23k cycles = 21% (108k + 23k = 131k in the worst observed case)
> 
> BEST CASE vs WORST CASE
> 108018 186124 256147 23195
> 130967 186124 284301 33597
> 
> Run-time: +21%
> I_count: identical
> uop_count: +11%
> exec_stalls: +45%
> 
> I don't see these wild deviations when I test toy programs that don't touch memory
> or only touch 1 word on the stack. So this seems to be memory-related?
> But everything fits in L1...
> Could there be some activity on other CPUs that force cache-coherence shenanigans?
> I'm stumped :(
> 
> Would appreciate any insight.
> Will re-read the previous thread for anything I might have missed.
> 
> Regards


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Unexplained variance in run-time of simple program (part 2)
  2026-03-26 19:09 ` Marc Gonzalez
@ 2026-04-07  0:38   ` Marc Gonzalez
       [not found]     ` <17537284-FA52-40E5-A70F-1120FCEB8BC6@mccalpin.com>
  0 siblings, 1 reply; 6+ messages in thread
From: Marc Gonzalez @ 2026-04-07  0:38 UTC (permalink / raw)
  To: linux-rt-users
  Cc: Daniel Wagner, Leon Woestenberg, John Ogness, Steven Rostedt,
	Thomas Gleixner, Sebastian Andrzej Siewior, Clark Williams,
	Pavel Machek, Luis Goncalves, John McCalpin, Frederic Weisbecker,
	Ingo Molnar, Masami Hiramatsu, Ahmed S. Darwish, Agner Fog,
	Dirk Beyer, Philipp Wendler, Matt Godbolt

On 26/03/2026 16:24, Marc Gonzalez wrote:

> Past discussion:
> Large(ish) variance induced by SCHED_FIFO / Unexplained variance in run-time of trivial program
> https://lore.kernel.org/linux-rt-users/0d87e3c3-8de1-4d98-802e-a292f63f1bf1@free.fr/
>
> SYNOPSIS:
> I have a simple(*) program.
> I just want to know how long the program takes to run.

I probably need to start from the absolutely MOST simple program possible, then work from there.

spin:
	mov ecx, 1
	shl ecx, 10
	xor eax, eax
loop:
	times 60 inc eax
	dec ecx
	jnz loop
	ret

I.e. just a long dependency chain of 61440 increment instructions.

Experimental setup:

- Boot kernel 6.8 with nohz_full=3 rcu_nocbs=3 isolcpus=nohz,domain,managed_irq,3 irqaffinity=0-2 nosmt mitigations=off nosoftlockup tsc=reliable log_buf_len=16M single
- Prepare system with:
echo -1 > /proc/sys/kernel/sched_rt_runtime_us
for I in 0 1 2 3; do echo userspace > /sys/devices/system/cpu/cpu$I/cpufreq/scaling_governor; done
for I in 0 1 2 3; do echo   2000000 > /sys/devices/system/cpu/cpu$I/cpufreq/scaling_setspeed; done
- Call spin() 10M times, recording the following events every time:
HW_CPU_CYCLES, HW_INSTRUCTIONS, UOPS_EXECUTED, EXEC_STALLS

Cycle-count distribution:

63500: 470202
63625: 8747244
63750: 770281
63875: 12143
64000: 105
64125: 16
64250: 3
64375: 2
64500: 1
64875: 1
68375: 1
91250: 1

This looks good, as far as I can tell.

 4.70% within [63500, 63625[
87.47% within [63625, 63750[
 7.70% within [63750, 63875[
 0.12% within [63875, 64000[

Covers 99.99% of samples.

Therefore, I think I would get very stable results by simply:
- running 100 iterations of the code
- discarding the worst 10 (20? 50?) outliers (what about the best outliers?)
- taking the arithmetic mean (or the median?)

I note that the benchmark overhead itself seems to be ~3000 cycles
(ioctl to reset the event counters + read to copy the event counters to user space)
3000 cycles is a whopping 5% of what I'm trying to measure.
It might make sense to call spin() a few times (2? 4? 10?) to lower the overhead's impact...

As always, happy to read anyone's input / insight into the process :)

Regards


^ permalink raw reply	[flat|nested] 6+ messages in thread

[parent not found: <17537284-FA52-40E5-A70F-1120FCEB8BC6@mccalpin.com>]

* Re: Unexplained variance in run-time of simple program (part 2)
       [not found]     ` <17537284-FA52-40E5-A70F-1120FCEB8BC6@mccalpin.com>
@ 2026-04-07 13:52       ` Marc Gonzalez
  2026-04-08  9:29         ` John D. McCalpin
  0 siblings, 1 reply; 6+ messages in thread
From: Marc Gonzalez @ 2026-04-07 13:52 UTC (permalink / raw)
  To: John D. McCalpin
  Cc: linux-rt-users, Daniel Wagner, Leon Woestenberg, John Ogness,
	Steven Rostedt, Thomas Gleixner, Sebastian Andrzej Siewior,
	Clark Williams, Pavel Machek, Luis Goncalves, Frederic Weisbecker,
	Ingo Molnar, Masami Hiramatsu, Ahmed S. Darwish, Agner Fog,
	Dirk Beyer, Philipp Wendler, Matt Godbolt

Hello Doctor Bandwidth,

I was secretly hoping you would chime in! :)

You've moved to Spain if I understand correctly?

On 07/04/2026 10:37, John D. McCalpin wrote:

> Going back to the example discussed at [in 2025-09], it looks like
> the benchmark takes ~2500 ns and that you are executing the
> benchmark 2^16 times.   Does that mean you are launching a new
> process 2^16 times?  If so, exactly what mechanism are you using?
> The variation you are reporting of 0.1 to 0.3 microseconds per
> execution seems very small to me.  Are you sure that the variation
> is in the "execution time", or could it be associated with the
> launching of the execution?   If the benchmark code is being
> executed 2^16 times in a loop in a single execution, then the
> situation quite different and performance counters should be very
> effective at determining the cause of the variability.


I understand that process creation is a costly operation.
I do run the benchmark several times from the same process.
(I plan on revisiting your above remarks after I digest them.)


> For the performance counter side of things, I would recommend
> programming the performance counters externally to the benchmark
> process and using inline RDPMC instructions to get the counter
> values (inside the benchmark executable).


Why do you suggest programming the PMCs from outside the benchmark process?
(Is it perhaps because it's simpler than using the kernel API?)

It never occurred to me that I could read the PMCs from user-space!

This is what I've been doing until now:

int open_event(u64 type, u64 config, int fd)
{
	struct perf_event_attr attr = {
		.type = type,
		.size = sizeof(attr),
		.config = config,
		.read_format = PERF_FORMAT_GROUP,
	};

	return syscall(SYS_perf_event_open, &attr, 0, 3, fd, 0);
}

	int main_fd = open_event(PERF_TYPE_HARDWARE, PERF_COUNT_HW_CPU_CYCLES, -1);
	open_event(PERF_TYPE_HARDWARE, PERF_COUNT_HW_INSTRUCTIONS, main_fd);
	open_event(PERF_TYPE_RAW, UOPS_EXECUTED, main_fd);
	open_event(PERF_TYPE_RAW, EXEC_STALLS, main_fd);

	for (int i = 0; i < 1000000; ++i)
	{
		ioctl(main_fd, PERF_EVENT_IOC_RESET, PERF_IOC_FLAG_GROUP);
		for (int i = 0; i < N; ++i) spin(ctx);
		if (read(main_fd, v, sizeof v) < sizeof v) return 2;
		printf("%lu %lu %lu %lu\n", v[1]/N, v[2]/N, v[3]/N, v[4]/N);
	}

The problem with that technique is that the PMCs are "polluted" by
the exit from ioctl & the entry into read.

Reading the PMCs from user-space with RDPMC solves this issue!

Found this interesting suggestion from 10 years ago:
https://community.intel.com/t5/Software-Tuning-Performance/How-to-read-performance-counters-by-rdpmc-instruction/m-p/1009043
I think you might be familiar with the responder ;)


> I am not certain what the lowest overhead mechanism might be for
> getting those performance counter values out of the program — I
> would probably try attaching to a persistent System V shared memory
> segment and simply storing the values in memory for later post-processing.


Are you implying that simply using printf might disturb the caches
from the write calls? (I redirect the output to a tmpfs.)


> On Intel SKX processors I measured the overhead of an RDPMC
> instruction at as low as ~25 cycles, with RDTSCP instructions taking
> a little bit longer (~40 cycles), but most importantly inline RDPMC
> does not involve any uncontrolled code paths or any hardware
> accesses outside the core.  RDTSCP is similar, but might be
> accessing off-core resources — the timing depends on both the core
> and uncore clock frequencies.

Thanks for the great suggestion. I'll be reading up on RDPMC.

Why do you think I would need RDTSCP?
At the moment, I just pin all cores at 2 GHz & count cycles using PMCs.

Reference for myself:
https://www.codestudy.net/blog/difference-between-rdtscp-rdtsc-memory-and-cpuid-rdtsc/

Regards


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Unexplained variance in run-time of simple program (part 2)
  2026-04-07 13:52       ` Marc Gonzalez
@ 2026-04-08  9:29         ` John D. McCalpin
  2026-04-10 17:16           ` Marc Gonzalez
  0 siblings, 1 reply; 6+ messages in thread
From: John D. McCalpin @ 2026-04-08  9:29 UTC (permalink / raw)
  To: Marc Gonzalez
  Cc: linux-rt-users, Daniel Wagner, Leon Woestenberg, John Ogness,
	Steven Rostedt, Thomas Gleixner, Sebastian Andrzej Siewior,
	Clark Williams, Pavel Machek, Luis Goncalves, Frederic Weisbecker,
	Ingo Molnar, Masami Hiramatsu, Ahmed S. Darwish, Agner Fog,
	Dirk Beyer, Philipp Wendler, Matt Godbolt

Responses inline.

> On Apr 7, 2026, at 3:52 PM, Marc Gonzalez <marc.w.gonzalez@free.fr> wrote:
> 
> Hello Doctor Bandwidth,
> 
> I was secretly hoping you would chime in! :)
> 
> You've moved to Spain if I understand correctly?

I moved to Spain last year and have been working at the Barcelona Supercomputing Center for about 9 months.

> 
> On 07/04/2026 10:37, John D. McCalpin wrote:
> 
>> Going back to the example discussed at [in 2025-09], it looks like
>> the benchmark takes ~2500 ns and that you are executing the
>> benchmark 2^16 times.   Does that mean you are launching a new
>> process 2^16 times?  If so, exactly what mechanism are you using?
>> The variation you are reporting of 0.1 to 0.3 microseconds per
>> execution seems very small to me.  Are you sure that the variation
>> is in the "execution time", or could it be associated with the
>> launching of the execution?   If the benchmark code is being
>> executed 2^16 times in a loop in a single execution, then the
>> situation quite different and performance counters should be very
>> effective at determining the cause of the variability.
> 
> 
> I understand that process creation is a costly operation.
> I do run the benchmark several times from the same process.
> (I plan on revisiting your above remarks after I digest them.)

I am still not clear on exactly what you mean by “executing the benchmark 2^16 times”.
Are you forking a new process 2^16 times, or is a single process executing a block of code 2^16 times.
The difference is critical for determining where to look for sources of performance variability.

>> For the performance counter side of things, I would recommend
>> programming the performance counters externally to the benchmark
>> process and using inline RDPMC instructions to get the counter
>> values (inside the benchmark executable).
> 
> 
> Why do you suggest programming the PMCs from outside the benchmark process?
> (Is it perhaps because it's simpler than using the kernel API?)
> 
> It never occurred to me that I could read the PMCs from user-space!
> 
> This is what I've been doing until now:
> 
> int open_event(u64 type, u64 config, int fd)
> {
> struct perf_event_attr attr = {
> .type = type,
> .size = sizeof(attr),
> .config = config,
> .read_format = PERF_FORMAT_GROUP,
> };
> 
> return syscall(SYS_perf_event_open, &attr, 0, 3, fd, 0);
> }
> 
> int main_fd = open_event(PERF_TYPE_HARDWARE, PERF_COUNT_HW_CPU_CYCLES, -1);
> open_event(PERF_TYPE_HARDWARE, PERF_COUNT_HW_INSTRUCTIONS, main_fd);
> open_event(PERF_TYPE_RAW, UOPS_EXECUTED, main_fd);
> open_event(PERF_TYPE_RAW, EXEC_STALLS, main_fd);
> 
> for (int i = 0; i < 1000000; ++i)
> {
> ioctl(main_fd, PERF_EVENT_IOC_RESET, PERF_IOC_FLAG_GROUP);
> for (int i = 0; i < N; ++i) spin(ctx);
> if (read(main_fd, v, sizeof v) < sizeof v) return 2;
> printf("%lu %lu %lu %lu\n", v[1]/N, v[2]/N, v[3]/N, v[4]/N);
> }
> 
> The problem with that technique is that the PMCs are "polluted" by
> the exit from ioctl & the entry into read.
> 
> Reading the PMCs from user-space with RDPMC solves this issue!

The perf_event subsystem is able to read the performance counters without requiring a kernel transition at every read, but internally the process is rather complex.  For fine-grain measurements on code that (1) can be bound to a single logical processor and (2) be read frequently enough that overflows can be unambiguously detected, there is no reason not to use the RDPMC instruction directly.  (Recent versions of the Linux kernel have started disabling the RDPMC instruction for user mode by clearing CR4.PCE by default, and only enabling it when the perf_event library wants to use it.  This irritates me more than I can easily express, but it can be overridden (in the kernels I have been using) by echoing a value of “2” to /sys/devices/cpu/rdpmc.)

> Found this interesting suggestion from 10 years ago:
> https://community.intel.com/t5/Software-Tuning-Performance/How-to-read-performance-counters-by-rdpmc-instruction/m-p/1009043
> I think you might be familiar with the responder ;)
> 
> 
>> I am not certain what the lowest overhead mechanism might be for
>> getting those performance counter values out of the program — I
>> would probably try attaching to a persistent System V shared memory
>> segment and simply storing the values in memory for later post-processing.
> 
> 
> Are you implying that simply using printf might disturb the caches
> from the write calls? (I redirect the output to a tmpfs.)

The degree of disturbance depends on the size of the “benchmark” being measured.  If the code executes millions of instructions between output points, then it is probably OK to just use “printf()”, but if you are reading performance counters around loops of a thousand cycles then you need to be more careful about how the data is output (especially if the code under test is measuring access to the caches).

A technique I commonly use is to allocate a small output buffer (much smaller than the L1 DCache) and zero it immediately before the first performance counter reads so that it will be resident in the L1 cache in a writable state.  I then run a block of iterations and save the performance counter values in the buffer.  Once the buffer is filled I copy the contents to a larger buffer in memory, re-zero the small output buffer, and start the next block of iterations.  If I want to be careful not to cause extra cache pollution, I copy the small output buffer to the large output buffer using streaming stores. 
This approach is useful for getting ensembles of performance counter results for repetitions of an identical code loop, or for collecting performance counter results for the same loop with different loop bounds, strides, or offsets.  

Changing the performance counter programming is always expensive (since the WRMSR instruction can only be executed in kernel mode), so I either change those externally to my program or internally but only around very large (millions of cycles) blocks of tests.   If the counters need to be changed at finer granularity the benchmark can be embedded in a loadable kernel module, but the execution time of the required process binding and WRMSR instructions is still substantial. 

> 
>> On Intel SKX processors I measured the overhead of an RDPMC
>> instruction at as low as ~25 cycles, with RDTSCP instructions taking
>> a little bit longer (~40 cycles), but most importantly inline RDPMC
>> does not involve any uncontrolled code paths or any hardware
>> accesses outside the core.  RDTSCP is similar, but might be
>> accessing off-core resources — the timing depends on both the core
>> and uncore clock frequencies.
> 
> Thanks for the great suggestion. I'll be reading up on RDPMC.
> 
> Why do you think I would need RDTSCP?
> At the moment, I just pin all cores at 2 GHz & count cycles using PMCs.

RDTSCP is useful to have in the toolbox because (1) it is almost never disabled (via CR4.TSD), (2) it has lower overhead than other timers, (3) it has some ordering restrictions on execution, (4) the instruction also returns the contents of MSR IA32_TSC_AUX in the %ecx register.  The interfaces that I use are available at https://github.com/jdmccalpin/low-overhead-timers.

There is one exception to (2) and that is using one of the CYCLES events in the programmable core performance counters.  On SKX processors, for example, the programmable event CPU_CLK_UNHALTED can be read in about 24 cycles, while RDSTCP takes about 40 cycles and RDTSC is in the middle somewhere.  CPU_CLK_UNHALTED is quite useful for interval measurements inside a benchmark because the processor is almost always going to be active all the time during the benchmark execution, and “core cycles” is often the most convenient unit of measurement.  (I often set the core frequency to match the TSC frequency to minimize the probability of unwanted frequency transitions and to make the arithmetic easier.  You still need to watch out for unexpected stalls and frequency transitions if you are using 256-bit or 512-bit SIMD instructions.)

RDTSCP is “half-ordered” with respect to other instructions.  It will not execute until all prior instructions have executed, but it does not prevent future instructions from being started earlier than the RDTSCP executes.  The CPUID instruction in the reference below is *very expensive* and not recommended for fine-grain timing.  If you want to ensure that no instructions have started execution before you read the timer, then use the sequence:
	mfence
	lfence
	rdtscp (or rdtsc in this particular case)
before the code being monitored, and the sequence:
	rdtscp
	lfence
after the code being monitored.   
I very rarely find a need for the lfence instructions.  It prevents later instructions from starting early (which could add cycles before the RDTSCP instruction grabs the TSC), but I have never found a case where this biased the results by an amount that was visible above the background noise.  (RDTSC and RDTSCP are microcoded instructions that have variable execution times because they need to communicate with off-core resources to obtain a TSC value that is independent of the CPU core frequency and the history of CPU active and halted cycles.)

On Linux systems, IA32_TSC_AUX is programmed to contain the socket number and logical processor number in concatenated 12-bit fields, and the RDTSCP instruction guarantees that the TSC value (returned as two 32-bit values in registers %eax and %ebx) and the IA32_TSC_AUX value were collected atomically on the same logical processor.  So RDTSCP provides an extremely low-overhead mechanism to determine the logical processor number where the process is currently executing.  As a simple hardware instruction, it won’t do anything that might encourage the OS to consider re-scheduling the process (while calling a system library routine like sched_get_affinity() might).  This can be very helpful when trying to debug process affinity problems, e.g., in software environments with multiple agents able to make affinity requests, e.g., SLURM + OpenMP runtime + MPI runtime.

> Reference for myself:
> https://www.codestudy.net/blog/difference-between-rdtscp-rdtsc-memory-and-cpuid-rdtsc/
> 
> Regards
> 
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Unexplained variance in run-time of simple program (part 2)
  2026-04-08  9:29         ` John D. McCalpin
@ 2026-04-10 17:16           ` Marc Gonzalez
  0 siblings, 0 replies; 6+ messages in thread
From: Marc Gonzalez @ 2026-04-10 17:16 UTC (permalink / raw)
  To: John D. McCalpin
  Cc: linux-rt-users, Daniel Wagner, Leon Woestenberg, John Ogness,
	Steven Rostedt, Thomas Gleixner, Sebastian Andrzej Siewior,
	Clark Williams, Pavel Machek, Luis Goncalves, Frederic Weisbecker,
	Ingo Molnar, Masami Hiramatsu, Ahmed S. Darwish, Agner Fog,
	Dirk Beyer, Philipp Wendler, Matt Godbolt

On 08/04/2026 11:29, John D. McCalpin wrote:

> I am still not clear on exactly what you mean by “executing the benchmark 2^16 times”.
> Are you forking a new process 2^16 times, or is a single process executing a block of code 2^16 times.
> The difference is critical for determining where to look for sources of performance variability.
Based on your very insightful feedback (thanks again!) I currently use this wrapper:

#include <stdlib.h>
#include <stdio.h>

typedef unsigned int u32;

// Ignore id: process will be pinned to core 3
// Ignore hi: t1-t0 < 2^32 cycles (1.3 seconds)
static inline u32 rdtscp(void)
{
	u32 hi, lo, id;
	asm volatile("rdtscp" : "=d"(hi), "=a"(lo), "=c"(id));
	return lo;
}

extern void init(void);
extern void spin(void); // THE CODE UNDER BENCHMARK

static u32 v[4];

int main(int argc, char **argv)
{
	if (argc != 3) return 1;

	const u32 N = strtoul(argv[1], NULL, 10);
	const u32 S = strtoul(argv[2], NULL, 10);

	void *buf = aligned_alloc(4096, N * sizeof v);

	for (u32 n = 0; n < N; ++n)
	{
		for (u32 i = 0; i < 4; ++i)
		{
			u32 t0 = rdtscp();
			for (u32 s = 0; s < S; ++s) spin();
			u32 t1 = rdtscp();
			v[i] = t1 - t0;
		}
		__builtin_ia32_movntdq(buf + n * sizeof v, __builtin_ia32_movntdqa((void *)v));
	}

	for (int i = 0; i < N * 4; ++i) printf("%u\n", ((u32 *)buf)[i]);

	return 0;
}

N times: time the code 4 times, and transfer the 4 results to memory using a non-temporal store.
Once measuring is done, print all the results.
Would it be more efficient to write a full cache line with movntdq? (64 bytes vs 16 bytes)


The baseline code (to check for correct results) is:

spin:
	xor eax, eax
	mov ecx, 1000
loop:
	times 60 inc eax ; 120 bytes
	dec ecx
	jnz loop ; 8-bit offset
	ret


I ran the following experiments:
./run.sh ./baseline 200000 1
./run.sh ./baseline 200000 2
./run.sh ./baseline 200000 4


The results in histogram (in 50-cycle steps) :

histo.1
99000: 199999
99100: 1

histo.2
197950: 85290
198000: 114706
198050: 1
198200: 2
198250: 1

$ cat histo.4
395900: 110104
395950: 89891
396000: 1
396150: 2
396200: 1
396450: 1

1st col: cycles @ 3.3 GHz
2nd col: number of times result fell in that interval

I think this is looking pretty good!
I just need to test on my actual code that uses L1$.

Regards


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-04-10 17:16 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-26 15:24 Unexplained variance in run-time of simple program (part 2) Marc Gonzalez
2026-03-26 19:09 ` Marc Gonzalez
2026-04-07  0:38   ` Marc Gonzalez
     [not found]     ` <17537284-FA52-40E5-A70F-1120FCEB8BC6@mccalpin.com>
2026-04-07 13:52       ` Marc Gonzalez
2026-04-08  9:29         ` John D. McCalpin
2026-04-10 17:16           ` Marc Gonzalez

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox