linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Thomas Richter <tmricht@linux.ibm.com>
To: James Clark <james.clark@linaro.org>,
	Anubhav Shelat <ashelat@redhat.com>
Cc: mpetlan@redhat.com, acme@kernel.org, namhyung@kernel.org,
	irogers@google.com, linux-perf-users@vger.kernel.org,
	peterz@infradead.org, mingo@redhat.com, mark.rutland@arm.com,
	alexander.shishkin@linux.intel.com, jolsa@kernel.org,
	adrian.hunter@intel.com, kan.liang@linux.intel.com,
	dapeng1.mi@linux.intel.com
Subject: Re: [PATCH] Revert "perf test: Allow tolerance for leader sampling test"
Date: Wed, 12 Nov 2025 11:47:39 +0100	[thread overview]
Message-ID: <0862c069-8c12-42f0-9a9b-2580daad56d1@linux.ibm.com> (raw)
In-Reply-To: <42a066c9-8be3-432c-8cd2-5311c26e7b36@linaro.org>

On 11/11/25 15:03, James Clark wrote:

.....

>>
>> James,
>> the counter above being always 0 come from the fact that the group uses modifier :u
>> for user space, the cycles event are invoked without :u modifier.
>> On s390 you need to use -e "{cycles,cycles}:S"
> 
> I'm not following why :u should make a difference. I see from your explanation below that you have two different types of counter and the first event runs on the buffered one and the other on a simple PMU counter, but how does that translate to userspace filtering making a difference?
> 

Well, s390 counters are 
 # ls -l /sys/devices/cpum_cf/events/*CPU_CYCLES
-r--r--r-- 1 root root 4096 Nov 12 11:10 /sys/devices/cpum_cf/events/CPU_CYCLES
-r--r--r-- 1 root root 4096 Nov 12 11:10 /sys/devices/cpum_cf/events/PROBLEM_STATE_CPU_CYCLES
 # cat /sys/devices/cpum_cf/events/*CPU_CYCLES
 event=0x0000
 event=0x0020
 # 

The generic event cycles is translated to CPU_CYCLES and user space cycles are counted with
a different counter named PROBLEM_STATE_CPU_CYCLES

So 2 different counters, in 2 different counter sets, one started one not started.

>> I disagree with your assessment completely. I have spent quite some time
>> debugging this and come to a different finding. I executed this command
>> on the latest kernel:
>>
>>   # uname -a
>>   Linux b83lp65.lnxne.boe 6.18.0-rc4d-perf+ #84 SMP Fri Nov  7 09:24:44 CET 2025 s390x GNU/Linux
>>   # /root/linux/tools/perf/perf record -e "{cycles,cycles}:S" \
>>          -- taskset -c 0 /root/linux/tools/perf/perf test -w brstack 5000000 2>/dev/null
>>
>> I also had your suggestion patched into the perf script code:
>>   # git diff
>>   diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
>> index 09af486c83e4..212bca671a49 100644
>> --- a/tools/perf/util/session.c
>> +++ b/tools/perf/util/session.c
>> @@ -1212,9 +1212,9 @@ static int deliver_sample_value(struct evlist *evlist,
>>          /*
>>           * There's no reason to deliver sample
>>           * for zero period, bail out.
>> -        */
>>          if (!sample->period)
>>                  return 0;
>> +        */
>>
>>          evsel = container_of(sid->evsel, struct evsel, core);
>>          return tool->sample(tool, event, sample, evsel, machine);
>>   #
>>
>> The output of perf script is now (always in pairs of 2 lines):
>> > taskset    7005   332.357459:    1377000 cycles:       3ff9db4611a find_module_idx+0x8a (/usr/lib64/libc.so.6)
>> taskset    7005   332.357459:    1948445 cycles:       3ff9db4611a find_module_idx+0x8a (/usr/lib64/libc.so.6)
>>     perf    7005   332.368619:    1377000 cycles:       3ffe03a4fa6 xas_reload+0x36 ([kernel.kallsyms])
>>     perf    7005   332.368619:   61344023 cycles:       3ffe03a4fa6 xas_reload+0x36 ([kernel.kallsyms])
>>     perf    7005   332.368624:    1377000 cycles:       3ffe0b53624 mas_empty_area_rev+0x3c4 ([kernel.kallsyms])
>>     perf    7005   332.368624:      25448 cycles:       3ffe0b53624 mas_empty_area_rev+0x3c4 ([kernel.kallsyms])
>>     perf    7005   332.368626:    1377000 cycles:       3ffadf81606 _dl_catch_exception+0xb6 (/usr/lib/ld64.so.1)
>>     perf    7005   332.368626:      12182 cycles:       3ffadf81606 _dl_catch_exception+0xb6 (/usr/lib/ld64.so.1)
>>     perf    7005   332.368628:    1377000 cycles:       3ffadf8c9b2 _dl_add_to_namespace_list+0x42 (/usr/lib/ld64.so.1)
>>     perf    7005   332.368628:      11392 cycles:       3ffadf8c9b2 _dl_add_to_namespace_list+0x42 (/usr/lib/ld64.so.1)
>>     perf    7005   332.368630:    1377000 cycles:       3ffe0b4f800 mas_prev+0x0 ([kernel.kallsyms])
>>     perf    7005   332.368630:      11476 cycles:       3ffe0b4f800 mas_prev+0x0 ([kernel.kallsyms])
>>
> 
> The above patch won't make a difference to this output because none of them are 0. You get exactly the same output with or without the patch.
> 
> I only showed that this part of the code was hiding the original issue where most of the second counter values were 0. That seems to not be the case if you remove :u as you found, but isn't related to this part.
> 

Agreed

> I kind of see why :u results in 0 counts, because you're only counting the time between when the buffered counter fills up and interrupted and when you generated the other samples. But I don't get why that has to be the case. Couldn't you leave the other counters running and then they would count everything between samples?
> 
>>
>> Now to the debugging and investigation:
>> 1. With command
>>         perf record -e '{cycles,cycles}:S' -- ....
>>     the first cycles event start sampling.
>>     On s390 this sets up sampling with a frequency of 4000 Hz.
>>     This translates to hardware sample rate of 1377000 instructions per
>>     micro-second to meet a frequency of 4000 HZ.
> 
> 1377000 instructions per sample? I'm assuming per micro-second is a mistake.

Here is the maths:
# cat /proc/service_levels 
CPU-MF: Counter facility: version=3.8 authorization=002f
CPU-MF: Sampling facility: min_rate=22032 max_rate=180464112 cpu_speed=5508
CPU-MF: Sampling facility: mode=basic sample_size=32
CPU-MF: Sampling facility: mode=diagnostic sample_size=171
...

So the cpu_speed of 5508 (machine type z17) means 5508 CPU cycles per micro second. Document SA23-2260-08 page 2-35
5508 * 1000000 = 5508000000b / 4000 --> 1377000 CPU cycles between hardware write a sample.

> 
> As in, it sets the counter to roll over and take a sample every 1377000 cycles (instructions?), which ends up being 4000 samples per second.
> 
>>
>> 2. With first event cycles now sampling into a hardware buffer, an interrupt
>>     is triggered each time a sampling buffer gets full. The interrupt handler
>>     is then invoked and debug output shows the processing of samples.
>>     The size of one hardware sample is 32 bytes. With an interrupt triggered
>>     when the hardware buffer page of 4KB gets full, the interrupt handler
>>     processes 128 samples.
>>     (This is taken from s390 specific fast debug data gathering)
>>     2025-11-07 14:35:51.977248  000003ffe013cbfa  perf_event_count_update event->count 0x0 count 0x1502e8
>>     2025-11-07 14:35:51.977248  000003ffe013cbfa  perf_event_count_update event->count 0x1502e8 count 0x1502e8
>>     2025-11-07 14:35:51.977248  000003ffe013cbfa  perf_event_count_update event->count 0x2a05d0 count 0x1502e8
>>     2025-11-07 14:35:51.977252  000003ffe013cbfa  perf_event_count_update event->count 0x3f08b8 count 0x1502e8
>>     2025-11-07 14:35:51.977252  000003ffe013cbfa  perf_event_count_update event->count 0x540ba0 count 0x1502e8
>>     2025-11-07 14:35:51.977253  000003ffe013cbfa  perf_event_count_update event->count 0x690e88 count 0x1502e8
>>     2025-11-07 14:35:51.977254  000003ffe013cbfa  perf_event_count_update event->count 0x7e1170 count 0x1502e8
>>     2025-11-07 14:35:51.977254  000003ffe013cbfa  perf_event_count_update event->count 0x931458 count 0x1502e8
>>     2025-11-07 14:35:51.977254  000003ffe013cbfa  perf_event_count_update event->count 0xa81740 count 0x1502e8
>>
>> 3. The value is constantly increasing by the number of instructions executed
>>     to generate a sample entry.  This is the first line of the pairs of lines.
>>     count 0x1502e8 --> 1377000
>>
>>     # perf script | grep 1377000 | wc -l
>>     214
>>     # perf script | wc -l
>>     428
>>     #
>>     That is 428 lines in total, and half of the lines contain value 1377000.
>>
>> 4. The second event cycles is opened against the counting PMU, which is an
>>     independent PMU and is not interrupt driven.  Once enabled it runs in the
>>     backgroud and keeps running, incrementing silently about 400+
>>     counters. The counter values are read via assembly instructions.
>>
>>     This second counter PMU's read call back function is called when the
>>     interrupt handler of the sampling facility processes each sample. The
>>     function call sequence is:
>>
>>     perf_event_overflow()
>>     +--> __perf_event_overflow()
>>          +--> __perf_event_output()
>>                 +--> perf_output_sample()
>>                      +--> perf_output_read()
>>                           +--> perf_output_read_group()
>>                                    for_each_sibling_event(sub, leader) {
>>                  values[n++] = perf_event_count(sub, self);
>>                  printk("%s sub %p values %#lx\n", __func__, sub, values[n-1]);
>>                                    }
>>
>>     The last function perf_event_count() is invoked on the second event cylces
>>     *on* the counting PMU. An added printk statement shows the following lines
>>     in the dmesg output:
>>
>>     # dmesg|grep perf_output_read_group |head -10
>>     [  332.368620] perf_output_read_group sub 00000000d80b7c1f values 0x3a80917 (1)
>>     [  332.368624] perf_output_read_group sub 00000000d80b7c1f values 0x3a86c7f (2)
>>     [  332.368627] perf_output_read_group sub 00000000d80b7c1f values 0x3a89c15 (3)
>>     [  332.368629] perf_output_read_group sub 00000000d80b7c1f values 0x3a8c895 (4)
>>     [  332.368631] perf_output_read_group sub 00000000d80b7c1f values 0x3a8f569 (5)
>>     [  332.368633] perf_output_read_group sub 00000000d80b7c1f values 0x3a9204b
>>     [  332.368635] perf_output_read_group sub 00000000d80b7c1f values 0x3a94790
>>     [  332.368637] perf_output_read_group sub 00000000d80b7c1f values 0x3a9704b
>>     [  332.368638] perf_output_read_group sub 00000000d80b7c1f values 0x3a99888
>>     #
>>
>>    This correlates with the output of
>>     # perf report -D | grep 'id 00000000000000'|head -10
>>     ..... id 0000000000000006, value 00000000001502e8, lost 0
>>     ..... id 000000000000000e, value 0000000003a80917, lost 0 --> line (1) above
>>     ..... id 0000000000000006, value 00000000002a05d0, lost 0
>>     ..... id 000000000000000e, value 0000000003a86c7f, lost 0 --> line (2) above
>>     ..... id 0000000000000006, value 00000000003f08b8, lost 0
>>     ..... id 000000000000000e, value 0000000003a89c15, lost 0 --> line (3) above
>>     ..... id 0000000000000006, value 0000000000540ba0, lost 0
>>     ..... id 000000000000000e, value 0000000003a8c895, lost 0 --> line (4) above
>>     ..... id 0000000000000006, value 0000000000690e88, lost 0
>>     ..... id 000000000000000e, value 0000000003a8f569, lost 0 --> line (5) above
>>
>> Summary:
>> - Above command starts the CPU sampling facility, with runs interrupt
>>    driven when a 4KB page is full. An interrupt processes the 128 samples
>>    and calls eventually perf_output_read_group() for each sample to save it
>>    in the event's ring buffer.
>>
>> - At that time the CPU counting facility is invoked to read the value of
>>    the event cycles. This value is saved as the second value in the
>>    sample_read structure.
>>
>> - The first and odd lines in the perf script output displays the period
>>    value between 2 samples being created by hardware. It is the number
>>    of instructions executes before the hardware writes a sample.
>>
>> - The second and even lines in the perf script output displays the number
>>    of CPU cycles needed to process each sample and save it in the event's ring
>>    buffer.

....

>> Thanks a lot.
> 
> I agree, it looks like the test in its current form isn't valid and shouldn't be run on s390.
> 
> Does this buffered hardware counter only support the cycles event? Or does it support any event but there is only one instance of it?
> 
> Because if it only supports the cycles event then we can just change the test to count branches which would force it to run on a simple counter instead. If it supports any event then I think the test (or sample read as a whole) can never work.
> 
> You'd still probably want to update the PMU to reject sample read events that run on the buffered counter, whether it's only for cycles or not. That would prevent people from being confused outside of the test as well.
> 
We can disable this leader sampling test for s390, because we only have one event.
Or I change the command line invocation to this:

# git diff tests/shell/record.sh
diff --git a/tools/perf/tests/shell/record.sh b/tools/perf/tests/shell/record.sh
index 0f5841c479e7..4d2557b85606 100755
--- a/tools/perf/tests/shell/record.sh
+++ b/tools/perf/tests/shell/record.sh
@@ -260,7 +260,7 @@ test_uid() {
 
 test_leader_sampling() {
   echo "Basic leader sampling test"
-  if ! perf record -o "${perfdata}" -e "{cycles,cycles}:Su" -- \
+  if ! perf record -o "${perfdata}" -e "{cpum_sf/SF_CYCLES_BASIC/,cycles}:S" -- \
     perf test -w brstack 2> /dev/null
   then
     echo "Leader sampling [Failed record]"
@@ -432,18 +432,18 @@ if [[ $default_fd_limit -lt $min_fd_limit ]]; then
        ulimit -Sn $min_fd_limit
 fi

It installs the sampling using a different perf_event::attr::type and
perf_event::attr::config combination for sampling. It now operates on
two sampling event which run in sync.

Leads to this result:
# for i in $(seq 10); do ./perf test 124; done
124: perf record tests                                               : Ok
124: perf record tests                                               : Ok
124: perf record tests                                               : Ok
124: perf record tests                                               : Ok
124: perf record tests                                               : Ok
124: perf record tests                                               : Ok
124: perf record tests                                               : Ok
124: perf record tests                                               : Ok
124: perf record tests                                               : Ok
124: perf record tests                                               : Ok
# 

Should I submit a patch for that?

Thanks
-- 
Thomas Richter, Dept 3303, IBM s390 Linux Development, Boeblingen, Germany
--
IBM Deutschland Research & Development GmbH

Vorsitzender des Aufsichtsrats: Wolfgang Wendt

Geschäftsführung: David Faller

Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart, HRB 243294

  reply	other threads:[~2025-11-12 10:50 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-23 13:24 [PATCH] Revert "perf test: Allow tolerance for leader sampling test" Anubhav Shelat
2025-10-24  8:40 ` Thomas Richter
2025-10-24 17:21   ` Anubhav Shelat
2025-10-27 10:27     ` Thomas Richter
2025-10-28 11:30     ` James Clark
2025-10-28 12:55       ` Thomas Richter
2025-10-28 15:23         ` James Clark
2025-10-29  7:37           ` Thomas Richter
2025-10-29  9:25             ` James Clark
2025-11-11 11:22               ` Thomas Richter
2025-11-11 14:03                 ` James Clark
2025-11-12 10:47                   ` Thomas Richter [this message]
2025-11-13 11:58                     ` James Clark
2025-10-30 13:52         ` Anubhav Shelat
2025-10-30 14:19           ` James Clark
2025-10-30 15:22             ` Anubhav Shelat

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0862c069-8c12-42f0-9a9b-2580daad56d1@linux.ibm.com \
    --to=tmricht@linux.ibm.com \
    --cc=acme@kernel.org \
    --cc=adrian.hunter@intel.com \
    --cc=alexander.shishkin@linux.intel.com \
    --cc=ashelat@redhat.com \
    --cc=dapeng1.mi@linux.intel.com \
    --cc=irogers@google.com \
    --cc=james.clark@linaro.org \
    --cc=jolsa@kernel.org \
    --cc=kan.liang@linux.intel.com \
    --cc=linux-perf-users@vger.kernel.org \
    --cc=mark.rutland@arm.com \
    --cc=mingo@redhat.com \
    --cc=mpetlan@redhat.com \
    --cc=namhyung@kernel.org \
    --cc=peterz@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).