From: Namhyung Kim <namhyung@kernel.org>
To: Howard Chu <howardchu95@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>,
Ian Rogers <irogers@google.com>,
Kan Liang <kan.liang@linux.intel.com>,
Jiri Olsa <jolsa@kernel.org>,
Adrian Hunter <adrian.hunter@intel.com>,
Peter Zijlstra <peterz@infradead.org>,
Ingo Molnar <mingo@kernel.org>,
LKML <linux-kernel@vger.kernel.org>,
linux-perf-users@vger.kernel.org, bpf@vger.kernel.org,
Song Liu <song@kernel.org>
Subject: Re: [PATCH v2] perf trace: Implement syscall summary in BPF
Date: Thu, 20 Mar 2025 22:54:05 -0700 [thread overview]
Message-ID: <Z9z-_ftY23KcZ6c1@google.com> (raw)
In-Reply-To: <CAH0uvojj=-BE93VeuxK1LWEEBkYXT_BsRAf17gb-34jFRwnDww@mail.gmail.com>
On Thu, Mar 20, 2025 at 07:35:01PM -0700, Howard Chu wrote:
> Hello again Namhyung,
>
> As funny as it sounds, I have too much homework this week. I had to
> break the review into two parts. Sorry.
Thanks for taking your time!
>
> 1) Maybe just '--bpf-summary' instead?
>
> First of all, is '-s --bpf-summary' is it ergonomic? Why not drop the
> -s and just --bpf-summary since the option has 'summary' in its name.
> Another reason being,
> sudo ./perf trace -S --bpf-summary --summary-mode=total -- sleep 1
> sudo ./perf trace -s --bpf-summary --summary-mode=total -- sleep 1
> are the same (-S will emit no output to stdout).
Hmm.. it looks like a bug, will take a look. Maybe --bpf-summary is
redundant, but I think we can make it default later so that -s/-S can
use BPF without the option excplicitly. Then this option can be used
to disable it (--no-bpf-summary) for whatever reasons.
>
> 2) Anomaly observed when playing around
>
> sudo ./perf trace -s --bpf-summary --summary-mode=total -- sleep 1
> this gave me 10000 events
>
> sudo ./perf trace -as --bpf-summary --summary-mode=total -- sleep 1
> while this gave me 1000 events
>
> I guess it's something to do with the lost events?
No, as you said in the previous message it didn't support process
targets yet. I plan to disable it without -a for now.
>
> 3) Wrong stddev values
> Please compare these two outputs
>
> perf $ sudo ./perf trace -as --summary-mode=total -- sleep 1
>
> Summary of events:
>
> total, 11290 events
>
> syscall calls errors total min avg
> max stddev
> (msec) (msec) (msec)
> (msec) (%)
> --------------- -------- ------ -------- --------- ---------
> --------- ------
> mq_open 214 71 16073.976 0.000 75.112
> 250.120 9.91%
> futex 1296 195 11592.060 0.000 8.944
> 907.590 13.59%
> epoll_wait 479 0 4262.456 0.000 8.899
> 496.568 20.34%
> poll 241 0 2545.090 0.000 10.561
> 607.894 33.33%
> ppoll 330 0 1713.676 0.000 5.193
> 410.143 26.45%
> migrate_pages 45 0 1031.915 0.000 22.931
> 147.830 20.70%
> clock_nanosleep 2 0 1000.106 0.000 500.053
> 1000.106 100.00%
> swapoff 340 0 909.827 0.000 2.676
> 50.117 22.76%
> pselect6 5 0 604.816 0.000 120.963
> 604.808 100.00%
> readlinkat 26 3 501.205 0.000 19.277
> 499.998 99.75%
>
> perf $ sudo ./perf trace -as --bpf-summary --summary-mode=total -- sleep 1
>
> Summary of events:
>
> total, 880 events
>
> syscall calls errors total min avg
> max stddev
> (msec) (msec) (msec)
> (msec) (%)
> --------------- -------- ------ -------- --------- ---------
> --------- ------
> futex 219 46 2326.400 0.001 10.623
> 243.028 337.77%
> mq_open 19 8 2001.347 0.003 105.334
> 250.356 117.26%
> poll 6 1 1002.512 0.002 167.085
> 1002.496 223.60%
> clock_nanosleep 1 0 1000.147 1000.147 1000.147
> 1000.147 0.00%
> swapoff 43 0 953.251 0.001 22.169
> 50.390 112.37%
> migrate_pages 43 0 933.727 0.004 21.715
> 49.149 106.68%
> ppoll 32 0 838.035 0.002 26.189
> 331.222 252.10%
> epoll_pwait 5 0 499.578 0.001 99.916
> 499.565 199.99%
> nanosleep 1 0 10.149 10.149 10.149
> 10.149 0.00%
> epoll_wait 10 0 3.449 0.003 0.345
> 0.815 88.02%
> readlinkat 25 3 1.424 0.006 0.057
> 0.080 41.76%
> recvmsg 61 0 1.326 0.016 0.022
> 0.052 21.71%
> execve 6 5 1.100 0.002 0.183
> 1.078 218.21%
>
> I would say stddev here is a little off. The reason is:
>
> On Mon, Mar 17, 2025 at 11:08 AM Namhyung Kim <namhyung@kernel.org> wrote:
> >
> > When -s/--summary option is used, it doesn't need (augmented) arguments
> > of syscalls. Let's skip the augmentation and load another small BPF
> > program to collect the statistics in the kernel instead of copying the
> > data to the ring-buffer to calculate the stats in userspace. This will
> > be much more light-weight than the existing approach and remove any lost
> > events.
> >
> > Let's add a new option --bpf-summary to control this behavior. I cannot
> > make it default because there's no way to get e_machine in the BPF which
> > is needed for detecting different ABIs like 32-bit compat mode.
> >
> > No functional changes intended except for no more LOST events. :)
> >
> > $ sudo perf trace -as --bpf-summary --summary-mode=total -- sleep 1
> >
> > Summary of events:
> >
> > total, 2824 events
> >
> > syscall calls errors total min avg max stddev
> > (msec) (msec) (msec) (msec) (%)
> > --------------- -------- ------ -------- --------- --------- --------- ------
> > futex 372 18 4373.773 0.000 11.757 997.715 660.42%
> > poll 241 0 2757.963 0.000 11.444 997.758 580.34%
> > epoll_wait 161 0 2460.854 0.000 15.285 325.189 260.73%
> > ppoll 19 0 1298.652 0.000 68.350 667.172 281.46%
> > clock_nanosleep 1 0 1000.093 0.000 1000.093 1000.093 0.00%
> > epoll_pwait 16 0 192.787 0.000 12.049 173.994 348.73%
> > nanosleep 6 0 50.926 0.000 8.488 10.210 43.96%
> > ...
> >
> > Cc: Howard Chu <howardchu95@gmail.com>
> > Signed-off-by: Namhyung Kim <namhyung@kernel.org>
> > ---
> > v2)
> > * rebased on top of Ian's e_machine changes
> > * add --bpf-summary option
> > * support per-thread summary
> > * add stddev calculation (Howard)
> <SNIP>
> > +static double rel_stddev(struct syscall_stats *stat)
> > +{
> > + double variance, average;
> > +
> > + if (stat->count < 2)
> > + return 0;
> > +
> > + average = (double)stat->total_time / stat->count;
> > +
> > + variance = stat->squared_sum;
> > + variance -= (stat->total_time * stat->total_time) / stat->count;
> > + variance /= stat->count;
>
> isn't it 'variance /= stat->count - 1' because we used Bessel's
> correction? (Link:
> https://en.wikipedia.org/wiki/Bessel%27s_correction), that is to use n
> - 1 instead of n, this is what's done in stat.c.
>
> * (\Sum n_i^2) - ((\Sum n_i)^2)/n
> * s^2 = -------------------------------
> * n - 1
>
> and the lines down here are unfortunately incorrect
> + variance = stat->squared_sum;
> + variance -= (stat->total_time * stat->total_time) / stat->count;
> + variance /= stat->count;
> +
> + return 100 * sqrt(variance) / average;
>
> variance /= stat->count - 1; will get you variance, but I think we
> need variance mean.
> Link: https://en.wikipedia.org/wiki/Standard_deviation#Relationship_between_standard_deviation_and_mean
>
> it holds that:
> variance(mean) = variance / N
>
> so you are losing a '/ stat->count'
You're right, thanks for pointing that out.
>
> And with all due respect, although it makes total sense in
> engineering, mathematically, I find variance = stat->squared_sum,
> variance -= ... these accumulated calculations on variable 'variance'
> a little weird... because readers may find difficult to determine at
> which point it becomes the actual 'variance'
>
> with clarity in mind:
>
> diff --git a/tools/perf/util/bpf-trace-summary.c
> b/tools/perf/util/bpf-trace-summary.c
> index 5ae9feca244d..a435b4037082 100644
> --- a/tools/perf/util/bpf-trace-summary.c
> +++ b/tools/perf/util/bpf-trace-summary.c
> @@ -62,18 +62,18 @@ struct syscall_node {
>
> static double rel_stddev(struct syscall_stats *stat)
> {
> - double variance, average;
> + double variance, average, squared_total;
>
> if (stat->count < 2)
> return 0;
>
> average = (double)stat->total_time / stat->count;
>
> - variance = stat->squared_sum;
> - variance -= (stat->total_time * stat->total_time) / stat->count;
> - variance /= stat->count;
> + squared_total = stat->total_time * stat->total_time;
> + variance = (stat->squared_sum - squared_total / stat->count) /
> (stat->count - 1);
> + stddev_mean = sqrt(variance / stat->count);
>
> - return 100 * sqrt(variance) / average;
> + return 100 * stddev_mean / average;
> }
Can it be like this?
diff --git a/tools/perf/util/bpf-trace-summary.c b/tools/perf/util/bpf-trace-summary.c
index a91d42447e850a59..c897fb017914960c 100644
--- a/tools/perf/util/bpf-trace-summary.c
+++ b/tools/perf/util/bpf-trace-summary.c
@@ -71,9 +71,9 @@ static double rel_stddev(struct syscall_stats *stat)
variance = stat->squared_sum;
variance -= (stat->total_time * stat->total_time) / stat->count;
- variance /= stat->count;
+ variance /= stat->count - 1;
- return 100 * sqrt(variance) / average;
+ return 100 * sqrt(variance / stat->count) / average;
}
struct syscall_data {
>
> btw I haven't checked the legal range for stddev_mean, so I can be wrong.
> <SNIP>
> > +static int update_total_stats(struct hashmap *hash, struct syscall_key *map_key,
> > + struct syscall_stats *map_data)
> > +{
> > + struct syscall_data *data;
> > + struct syscall_stats *stat;
> > +
> > + if (!hashmap__find(hash, map_key, &data)) {
> > + data = zalloc(sizeof(*data));
> > + if (data == NULL)
> > + return -ENOMEM;
> > +
> > + data->nodes = zalloc(sizeof(*data->nodes));
> > + if (data->nodes == NULL) {
> > + free(data);
> > + return -ENOMEM;
> > + }
> > +
> > + data->nr_nodes = 1;
> > + data->key = map_key->nr;
> > + data->nodes->syscall_nr = data->key;
> Wow, aggressive. I guess you want it to behave like a single value
> when it is SYSCALL_AGGR_CPU, and an array when it is
> SYSCALL_AGGR_THREAD. Do you mind adding a comment about it?
>
> so it's
>
> (cpu, syscall_nr) -> data -> {node}
> (tid, syscall_nr) -> data -> [node1, node2, node3]
Right, will add comments.
>
>
> > +
> > + if (hashmap__add(hash, data->key, data) < 0) {
> > + free(data->nodes);
> > + free(data);
> > + return -ENOMEM;
> > + }
> > + }
> > +
> > + /* update total stats for this syscall */
> > + data->nr_events += map_data->count;
> > + data->total_time += map_data->total_time;
> > +
> > + /* This is sum of the same syscall from different CPUs */
> > + stat = &data->nodes->stats;
> > +
> > + stat->total_time += map_data->total_time;
> > + stat->squared_sum += map_data->squared_sum;
> > + stat->count += map_data->count;
> > + stat->error += map_data->error;
> > +
> > + if (stat->max_time < map_data->max_time)
> > + stat->max_time = map_data->max_time;
> > + if (stat->min_time > map_data->min_time)
> > + stat->min_time = map_data->min_time;
> > +
> > + return 0;
> > +}
> > +
> > +static int print_total_stats(struct syscall_data **data, int nr_data, FILE *fp)
> > +{
> > + int printed = 0;
> > + int nr_events = 0;
> > +
> > + for (int i = 0; i < nr_data; i++)
> > + nr_events += data[i]->nr_events;
> > +
> > + printed += fprintf(fp, " total, %d events\n\n", nr_events);
> > +
> > + printed += fprintf(fp, " syscall calls errors total min avg max stddev\n");
> > + printed += fprintf(fp, " (msec) (msec) (msec) (msec) (%%)\n");
> > + printed += fprintf(fp, " --------------- -------- ------ -------- --------- --------- --------- ------\n");
> > +
> > + for (int i = 0; i < nr_data; i++)
> > + printed += print_common_stats(data[i], fp);
> > +
> > + printed += fprintf(fp, "\n\n");
> > + return printed;
> > +}
> > +
> > +int trace_print_bpf_summary(FILE *fp)
> > +{
> > + struct bpf_map *map = skel->maps.syscall_stats_map;
> > + struct syscall_key *prev_key, key;
> > + struct syscall_data **data = NULL;
> > + struct hashmap schash;
> > + struct hashmap_entry *entry;
> > + int nr_data = 0;
> > + int printed = 0;
> > + int i;
> > + size_t bkt;
> > +
> > + hashmap__init(&schash, sc_node_hash, sc_node_equal, /*ctx=*/NULL);
> > +
> > + printed = fprintf(fp, "\n Summary of events:\n\n");
> > +
> > + /* get stats from the bpf map */
> > + prev_key = NULL;
> > + while (!bpf_map__get_next_key(map, prev_key, &key, sizeof(key))) {
> > + struct syscall_stats stat;
> > +
> > + if (!bpf_map__lookup_elem(map, &key, sizeof(key), &stat, sizeof(stat), 0)) {
> > + if (skel->rodata->aggr_mode == SYSCALL_AGGR_THREAD)
> > + update_thread_stats(&schash, &key, &stat);
> > + else
> > + update_total_stats(&schash, &key, &stat);
> > + }
> > +
> > + prev_key = &key;
> > + }
> > +
> > + nr_data = hashmap__size(&schash);
> > + data = calloc(nr_data, sizeof(*data));
> > + if (data == NULL)
> > + goto out;
> > +
> > + i = 0;
> > + hashmap__for_each_entry(&schash, entry, bkt)
> > + data[i++] = entry->pvalue;
> > +
> > + qsort(data, nr_data, sizeof(*data), datacmp);
>
> Here syscall_data is sorted for AGGR_THREAD and AGGR_CPU, meaning the
> thread who has the higher total syscall period will be printed first.
> This is an awesome side effect but it is not the behavior of 'sudo
> ./perf trace -as -- sleep 1' without the --bpf-summary option. If it
> is not too trivial, maybe consider documenting this behavior? But it
> may be too verbose so Idk.
The original behavior is random ordering and now it's ordered by total
time. But still we can think it's randomly ordered. I believe we need
to maintain strict ordering by total time and then fixed the existing
code. Once that happen I can add the documentation.
Thanks for your careful review!
Namhyung
>
> sudo ./perf trace -as -- sleep 1
>
> ClientModuleMan (4956), 16 events, 0.1%
>
> syscall calls errors total min avg
> max stddev
> (msec) (msec) (msec)
> (msec) (%)
> --------------- -------- ------ -------- --------- ---------
> --------- ------
> futex 8 4 750.234 0.000 93.779
> 250.105 48.79%
>
>
> CHTTPClientThre (15720), 16 events, 0.1%
>
> syscall calls errors total min avg
> max stddev
> (msec) (msec) (msec)
> (msec) (%)
> --------------- -------- ------ -------- --------- ---------
> --------- ------
> futex 8 4 1000.425 0.000 125.053
> 750.317 75.59%
>
> The order is random for the command above.
>
> Thanks,
> Howard
next prev parent reply other threads:[~2025-03-21 5:54 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-03-17 18:08 [PATCH v2] perf trace: Implement syscall summary in BPF Namhyung Kim
2025-03-17 18:37 ` Ian Rogers
2025-03-20 6:07 ` Namhyung Kim
2025-03-19 19:00 ` Howard Chu
2025-03-19 22:07 ` Namhyung Kim
2025-03-19 22:19 ` Howard Chu
2025-03-19 23:39 ` Howard Chu
2025-03-20 0:17 ` Namhyung Kim
2025-03-21 2:35 ` Howard Chu
2025-03-21 5:54 ` Namhyung Kim [this message]
2025-03-21 17:19 ` Howard Chu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Z9z-_ftY23KcZ6c1@google.com \
--to=namhyung@kernel.org \
--cc=acme@kernel.org \
--cc=adrian.hunter@intel.com \
--cc=bpf@vger.kernel.org \
--cc=howardchu95@gmail.com \
--cc=irogers@google.com \
--cc=jolsa@kernel.org \
--cc=kan.liang@linux.intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-perf-users@vger.kernel.org \
--cc=mingo@kernel.org \
--cc=peterz@infradead.org \
--cc=song@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.