All of lore.kernel.org
 help / color / mirror / Atom feed
From: Namhyung Kim <namhyung@kernel.org>
To: Howard Chu <howardchu95@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>,
	Ian Rogers <irogers@google.com>,
	Kan Liang <kan.liang@linux.intel.com>,
	Jiri Olsa <jolsa@kernel.org>,
	Adrian Hunter <adrian.hunter@intel.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@kernel.org>,
	LKML <linux-kernel@vger.kernel.org>,
	linux-perf-users@vger.kernel.org, bpf@vger.kernel.org,
	Song Liu <song@kernel.org>
Subject: Re: [PATCH v2] perf trace: Implement syscall summary in BPF
Date: Thu, 20 Mar 2025 22:54:05 -0700	[thread overview]
Message-ID: <Z9z-_ftY23KcZ6c1@google.com> (raw)
In-Reply-To: <CAH0uvojj=-BE93VeuxK1LWEEBkYXT_BsRAf17gb-34jFRwnDww@mail.gmail.com>

On Thu, Mar 20, 2025 at 07:35:01PM -0700, Howard Chu wrote:
> Hello again Namhyung,
> 
> As funny as it sounds, I have too much homework this week. I had to
> break the review into two parts. Sorry.

Thanks for taking your time!

> 
> 1) Maybe just '--bpf-summary' instead?
> 
> First of all, is '-s --bpf-summary' is it ergonomic? Why not drop the
> -s and just --bpf-summary since the option has 'summary' in its name.
> Another reason being,
> sudo ./perf trace -S --bpf-summary --summary-mode=total -- sleep 1
> sudo ./perf trace -s --bpf-summary --summary-mode=total -- sleep 1
> are the same (-S will emit no output to stdout).

Hmm.. it looks like a bug, will take a look.  Maybe --bpf-summary is
redundant, but I think we can make it default later so that -s/-S can
use BPF without the option excplicitly.  Then this option can be used
to disable it (--no-bpf-summary) for whatever reasons.

> 
> 2) Anomaly observed when playing around
> 
> sudo ./perf trace -s --bpf-summary --summary-mode=total -- sleep 1
> this gave me 10000 events
> 
> sudo ./perf trace -as --bpf-summary --summary-mode=total -- sleep 1
> while this gave me 1000 events
> 
> I guess it's something to do with the lost events?

No, as you said in the previous message it didn't support process
targets yet.  I plan to disable it without -a for now.

> 
> 3) Wrong stddev values
> Please compare these two outputs
> 
> perf $ sudo ./perf trace -as --summary-mode=total -- sleep 1
> 
>  Summary of events:
> 
>  total, 11290 events
> 
>    syscall            calls  errors  total       min       avg
> max       stddev
>                                      (msec)    (msec)    (msec)
> (msec)        (%)
>    --------------- --------  ------ -------- --------- ---------
> ---------     ------
>    mq_open              214     71 16073.976     0.000    75.112
> 250.120      9.91%
>    futex               1296    195 11592.060     0.000     8.944
> 907.590     13.59%
>    epoll_wait           479      0  4262.456     0.000     8.899
> 496.568     20.34%
>    poll                 241      0  2545.090     0.000    10.561
> 607.894     33.33%
>    ppoll                330      0  1713.676     0.000     5.193
> 410.143     26.45%
>    migrate_pages         45      0  1031.915     0.000    22.931
> 147.830     20.70%
>    clock_nanosleep        2      0  1000.106     0.000   500.053
> 1000.106    100.00%
>    swapoff              340      0   909.827     0.000     2.676
> 50.117     22.76%
>    pselect6               5      0   604.816     0.000   120.963
> 604.808    100.00%
>    readlinkat            26      3   501.205     0.000    19.277
> 499.998     99.75%
> 
> perf $ sudo ./perf trace -as --bpf-summary --summary-mode=total -- sleep 1
> 
>  Summary of events:
> 
>  total, 880 events
> 
>    syscall            calls  errors  total       min       avg
> max       stddev
>                                      (msec)    (msec)    (msec)
> (msec)        (%)
>    --------------- --------  ------ -------- --------- ---------
> ---------     ------
>    futex                219     46  2326.400     0.001    10.623
> 243.028    337.77%
>    mq_open               19      8  2001.347     0.003   105.334
> 250.356    117.26%
>    poll                   6      1  1002.512     0.002   167.085
> 1002.496    223.60%
>    clock_nanosleep        1      0  1000.147  1000.147  1000.147
> 1000.147      0.00%
>    swapoff               43      0   953.251     0.001    22.169
> 50.390    112.37%
>    migrate_pages         43      0   933.727     0.004    21.715
> 49.149    106.68%
>    ppoll                 32      0   838.035     0.002    26.189
> 331.222    252.10%
>    epoll_pwait            5      0   499.578     0.001    99.916
> 499.565    199.99%
>    nanosleep              1      0    10.149    10.149    10.149
> 10.149      0.00%
>    epoll_wait            10      0     3.449     0.003     0.345
> 0.815     88.02%
>    readlinkat            25      3     1.424     0.006     0.057
> 0.080     41.76%
>    recvmsg               61      0     1.326     0.016     0.022
> 0.052     21.71%
>    execve                 6      5     1.100     0.002     0.183
> 1.078    218.21%
> 
> I would say stddev here is a little off. The reason is:
> 
> On Mon, Mar 17, 2025 at 11:08 AM Namhyung Kim <namhyung@kernel.org> wrote:
> >
> > When -s/--summary option is used, it doesn't need (augmented) arguments
> > of syscalls.  Let's skip the augmentation and load another small BPF
> > program to collect the statistics in the kernel instead of copying the
> > data to the ring-buffer to calculate the stats in userspace.  This will
> > be much more light-weight than the existing approach and remove any lost
> > events.
> >
> > Let's add a new option --bpf-summary to control this behavior.  I cannot
> > make it default because there's no way to get e_machine in the BPF which
> > is needed for detecting different ABIs like 32-bit compat mode.
> >
> > No functional changes intended except for no more LOST events. :)
> >
> >   $ sudo perf trace -as --bpf-summary --summary-mode=total -- sleep 1
> >
> >    Summary of events:
> >
> >    total, 2824 events
> >
> >      syscall            calls  errors  total       min       avg       max       stddev
> >                                        (msec)    (msec)    (msec)    (msec)        (%)
> >      --------------- --------  ------ -------- --------- --------- ---------     ------
> >      futex                372     18  4373.773     0.000    11.757   997.715    660.42%
> >      poll                 241      0  2757.963     0.000    11.444   997.758    580.34%
> >      epoll_wait           161      0  2460.854     0.000    15.285   325.189    260.73%
> >      ppoll                 19      0  1298.652     0.000    68.350   667.172    281.46%
> >      clock_nanosleep        1      0  1000.093     0.000  1000.093  1000.093      0.00%
> >      epoll_pwait           16      0   192.787     0.000    12.049   173.994    348.73%
> >      nanosleep              6      0    50.926     0.000     8.488    10.210     43.96%
> >      ...
> >
> > Cc: Howard Chu <howardchu95@gmail.com>
> > Signed-off-by: Namhyung Kim <namhyung@kernel.org>
> > ---
> > v2)
> >  * rebased on top of Ian's e_machine changes
> >  * add --bpf-summary option
> >  * support per-thread summary
> >  * add stddev calculation  (Howard)
> <SNIP>
> > +static double rel_stddev(struct syscall_stats *stat)
> > +{
> > +       double variance, average;
> > +
> > +       if (stat->count < 2)
> > +               return 0;
> > +
> > +       average = (double)stat->total_time / stat->count;
> > +
> > +       variance = stat->squared_sum;
> > +       variance -= (stat->total_time * stat->total_time) / stat->count;
> > +       variance /= stat->count;
> 
> isn't it 'variance /= stat->count - 1' because we used Bessel's
> correction? (Link:
> https://en.wikipedia.org/wiki/Bessel%27s_correction), that is to use n
> - 1 instead of n, this is what's done in stat.c.
> 
>  *       (\Sum n_i^2) - ((\Sum n_i)^2)/n
>  * s^2 = -------------------------------
>  *                  n - 1
> 
> and the lines down here are unfortunately incorrect
> + variance = stat->squared_sum;
> + variance -= (stat->total_time * stat->total_time) / stat->count;
> + variance /= stat->count;
> +
> + return 100 * sqrt(variance) / average;
> 
> variance /= stat->count - 1; will get you variance, but I think we
> need variance mean.
> Link: https://en.wikipedia.org/wiki/Standard_deviation#Relationship_between_standard_deviation_and_mean
> 
> it holds that:
> variance(mean) = variance / N
> 
> so you are losing a '/ stat->count'

You're right, thanks for pointing that out.

> 
> And with all due respect, although it makes total sense in
> engineering, mathematically, I find variance = stat->squared_sum,
> variance -= ... these accumulated calculations on variable 'variance'
> a little weird... because readers may find difficult to determine at
> which point it becomes the actual 'variance'
> 
> with clarity in mind:
> 
> diff --git a/tools/perf/util/bpf-trace-summary.c
> b/tools/perf/util/bpf-trace-summary.c
> index 5ae9feca244d..a435b4037082 100644
> --- a/tools/perf/util/bpf-trace-summary.c
> +++ b/tools/perf/util/bpf-trace-summary.c
> @@ -62,18 +62,18 @@ struct syscall_node {
> 
>  static double rel_stddev(struct syscall_stats *stat)
>  {
> -       double variance, average;
> +       double variance, average, squared_total;
> 
>         if (stat->count < 2)
>                 return 0;
> 
>         average = (double)stat->total_time / stat->count;
> 
> -       variance = stat->squared_sum;
> -       variance -= (stat->total_time * stat->total_time) / stat->count;
> -       variance /= stat->count;
> +       squared_total = stat->total_time * stat->total_time;
> +       variance = (stat->squared_sum - squared_total / stat->count) /
> (stat->count - 1);
> +       stddev_mean = sqrt(variance / stat->count);
> 
> -       return 100 * sqrt(variance) / average;
> +       return 100 * stddev_mean / average;
>  }

Can it be like this?

diff --git a/tools/perf/util/bpf-trace-summary.c b/tools/perf/util/bpf-trace-summary.c
index a91d42447e850a59..c897fb017914960c 100644
--- a/tools/perf/util/bpf-trace-summary.c
+++ b/tools/perf/util/bpf-trace-summary.c
@@ -71,9 +71,9 @@ static double rel_stddev(struct syscall_stats *stat)
 
        variance = stat->squared_sum;
        variance -= (stat->total_time * stat->total_time) / stat->count;
-       variance /= stat->count;
+       variance /= stat->count - 1;
 
-       return 100 * sqrt(variance) / average;
+       return 100 * sqrt(variance / stat->count) / average;
 }
 
 struct syscall_data {

> 
> btw I haven't checked the legal range for stddev_mean, so I can be wrong.
> <SNIP>
> > +static int update_total_stats(struct hashmap *hash, struct syscall_key *map_key,
> > +                             struct syscall_stats *map_data)
> > +{
> > +       struct syscall_data *data;
> > +       struct syscall_stats *stat;
> > +
> > +       if (!hashmap__find(hash, map_key, &data)) {
> > +               data = zalloc(sizeof(*data));
> > +               if (data == NULL)
> > +                       return -ENOMEM;
> > +
> > +               data->nodes = zalloc(sizeof(*data->nodes));
> > +               if (data->nodes == NULL) {
> > +                       free(data);
> > +                       return -ENOMEM;
> > +               }
> > +
> > +               data->nr_nodes = 1;
> > +               data->key = map_key->nr;
> > +               data->nodes->syscall_nr = data->key;
> Wow, aggressive. I guess you want it to behave like a single value
> when it is SYSCALL_AGGR_CPU, and an array when it is
> SYSCALL_AGGR_THREAD. Do you mind adding a comment about it?
> 
> so it's
> 
> (cpu, syscall_nr) -> data -> {node}
> (tid, syscall_nr) -> data -> [node1, node2, node3]

Right, will add comments.

> 
> 
> > +
> > +               if (hashmap__add(hash, data->key, data) < 0) {
> > +                       free(data->nodes);
> > +                       free(data);
> > +                       return -ENOMEM;
> > +               }
> > +       }
> > +
> > +       /* update total stats for this syscall */
> > +       data->nr_events += map_data->count;
> > +       data->total_time += map_data->total_time;
> > +
> > +       /* This is sum of the same syscall from different CPUs */
> > +       stat = &data->nodes->stats;
> > +
> > +       stat->total_time += map_data->total_time;
> > +       stat->squared_sum += map_data->squared_sum;
> > +       stat->count += map_data->count;
> > +       stat->error += map_data->error;
> > +
> > +       if (stat->max_time < map_data->max_time)
> > +               stat->max_time = map_data->max_time;
> > +       if (stat->min_time > map_data->min_time)
> > +               stat->min_time = map_data->min_time;
> > +
> > +       return 0;
> > +}
> > +
> > +static int print_total_stats(struct syscall_data **data, int nr_data, FILE *fp)
> > +{
> > +       int printed = 0;
> > +       int nr_events = 0;
> > +
> > +       for (int i = 0; i < nr_data; i++)
> > +               nr_events += data[i]->nr_events;
> > +
> > +       printed += fprintf(fp, " total, %d events\n\n", nr_events);
> > +
> > +       printed += fprintf(fp, "   syscall            calls  errors  total       min       avg       max       stddev\n");
> > +       printed += fprintf(fp, "                                     (msec)    (msec)    (msec)    (msec)        (%%)\n");
> > +       printed += fprintf(fp, "   --------------- --------  ------ -------- --------- --------- ---------     ------\n");
> > +
> > +       for (int i = 0; i < nr_data; i++)
> > +               printed += print_common_stats(data[i], fp);
> > +
> > +       printed += fprintf(fp, "\n\n");
> > +       return printed;
> > +}
> > +
> > +int trace_print_bpf_summary(FILE *fp)
> > +{
> > +       struct bpf_map *map = skel->maps.syscall_stats_map;
> > +       struct syscall_key *prev_key, key;
> > +       struct syscall_data **data = NULL;
> > +       struct hashmap schash;
> > +       struct hashmap_entry *entry;
> > +       int nr_data = 0;
> > +       int printed = 0;
> > +       int i;
> > +       size_t bkt;
> > +
> > +       hashmap__init(&schash, sc_node_hash, sc_node_equal, /*ctx=*/NULL);
> > +
> > +       printed = fprintf(fp, "\n Summary of events:\n\n");
> > +
> > +       /* get stats from the bpf map */
> > +       prev_key = NULL;
> > +       while (!bpf_map__get_next_key(map, prev_key, &key, sizeof(key))) {
> > +               struct syscall_stats stat;
> > +
> > +               if (!bpf_map__lookup_elem(map, &key, sizeof(key), &stat, sizeof(stat), 0)) {
> > +                       if (skel->rodata->aggr_mode == SYSCALL_AGGR_THREAD)
> > +                               update_thread_stats(&schash, &key, &stat);
> > +                       else
> > +                               update_total_stats(&schash, &key, &stat);
> > +               }
> > +
> > +               prev_key = &key;
> > +       }
> > +
> > +       nr_data = hashmap__size(&schash);
> > +       data = calloc(nr_data, sizeof(*data));
> > +       if (data == NULL)
> > +               goto out;
> > +
> > +       i = 0;
> > +       hashmap__for_each_entry(&schash, entry, bkt)
> > +               data[i++] = entry->pvalue;
> > +
> > +       qsort(data, nr_data, sizeof(*data), datacmp);
> 
> Here syscall_data is sorted for AGGR_THREAD and AGGR_CPU, meaning the
> thread who has the higher total syscall period will be printed first.
> This is an awesome side effect but it is not the behavior of 'sudo
> ./perf trace -as -- sleep 1' without the --bpf-summary option. If it
> is not too trivial, maybe consider documenting this behavior? But it
> may be too verbose so Idk.

The original behavior is random ordering and now it's ordered by total
time.  But still we can think it's randomly ordered.  I believe we need
to maintain strict ordering by total time and then fixed the existing
code.  Once that happen I can add the documentation.

Thanks for your careful review!
Namhyung

> 
> sudo ./perf trace -as -- sleep 1
> 
>  ClientModuleMan (4956), 16 events, 0.1%
> 
>    syscall            calls  errors  total       min       avg
> max       stddev
>                                      (msec)    (msec)    (msec)
> (msec)        (%)
>    --------------- --------  ------ -------- --------- ---------
> ---------     ------
>    futex                  8      4   750.234     0.000    93.779
> 250.105     48.79%
> 
> 
>  CHTTPClientThre (15720), 16 events, 0.1%
> 
>    syscall            calls  errors  total       min       avg
> max       stddev
>                                      (msec)    (msec)    (msec)
> (msec)        (%)
>    --------------- --------  ------ -------- --------- ---------
> ---------     ------
>    futex                  8      4  1000.425     0.000   125.053
> 750.317     75.59%
> 
> The order is random for the command above.
> 
> Thanks,
> Howard

  reply	other threads:[~2025-03-21  5:54 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-17 18:08 [PATCH v2] perf trace: Implement syscall summary in BPF Namhyung Kim
2025-03-17 18:37 ` Ian Rogers
2025-03-20  6:07   ` Namhyung Kim
2025-03-19 19:00 ` Howard Chu
2025-03-19 22:07   ` Namhyung Kim
2025-03-19 22:19     ` Howard Chu
2025-03-19 23:39       ` Howard Chu
2025-03-20  0:17         ` Namhyung Kim
2025-03-21  2:35 ` Howard Chu
2025-03-21  5:54   ` Namhyung Kim [this message]
2025-03-21 17:19     ` Howard Chu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z9z-_ftY23KcZ6c1@google.com \
    --to=namhyung@kernel.org \
    --cc=acme@kernel.org \
    --cc=adrian.hunter@intel.com \
    --cc=bpf@vger.kernel.org \
    --cc=howardchu95@gmail.com \
    --cc=irogers@google.com \
    --cc=jolsa@kernel.org \
    --cc=kan.liang@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-perf-users@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=song@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.