From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pl1-f181.google.com (mail-pl1-f181.google.com [209.85.214.181]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AD3FB1362; Mon, 24 Mar 2025 01:49:01 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.181 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742780943; cv=none; b=oMoFnmkSpJFi6EB66o7lRTygQLEL4L8AdX7AFTBtUdrEVsH74NtXQxM6TXveYbNR82FlSOCc8P5N1xKiytA7kY4oPXTSTbKETVqOcK1A0wVlD0Iw/l6pqZj3LsOmUVvMgQohncjMGwSfjt8/oCRBHrZsPQKEJmKTlReKYR3B6/M= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742780943; c=relaxed/simple; bh=YV3k6Xdqyx4/ZDEUvtVUbTKz4NlDiLBTYIM4gctsH9w=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=uExk9bmB0TosCdDHDZUwOdcluX1SnAejzR4D+mSJTIlTyR7qDKrQmNFRxli5yXv6NqxYCe+oolqlQ7bTlEty0ORt+RvEbUsT93GGr4PbnQPA447QR3ChIVaUJ502CcEIyliII7oF3e/EC/+ZD8K+hdTVpYtiCJGeOkN711+w8Sk= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=ETEkej1t; arc=none smtp.client-ip=209.85.214.181 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="ETEkej1t" Received: by mail-pl1-f181.google.com with SMTP id d9443c01a7336-227a8cdd241so17192215ad.3; Sun, 23 Mar 2025 18:49:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1742780941; x=1743385741; darn=vger.kernel.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=FXOQskcOJsN7Rmd723/wQRi5Q5/ziF69kqzxyWTLnlg=; b=ETEkej1tUPTduJH+M+hbWu5z+/hcGF9hBrP44SV9VpeLJjKmJ9DLQo7PuLJALcg+uG uEpp2k5NvF3aJsmFlK589DBYYxkXns6aw+bro/NZ0XBppdMCshKktABmscxSxmVP7SB3 rrNIHHuU23lA8yK+nQnjhQdSEzx/nvFbAcv4pd3EbYwASYvowhtgGsxLbIVqYcJzcyIq 74+wz/civqMU/gNjQcBgXNfTWRHbjNjOst2GWsimI5fQ+qhd2yX8ZKIvYdicQZDwZWW7 zIWA0ke29bsWcNBC8nYXCzCOREhPkyXiIwLL15gdg4I/arGsuwJaXdJO7YsWITky2fRx Gtjw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1742780941; x=1743385741; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=FXOQskcOJsN7Rmd723/wQRi5Q5/ziF69kqzxyWTLnlg=; b=DV9/nkz3+9MhO5QkEbCT7OtCW9SK8+Qj/gT67RtKa0XIPAo9svjFIrpvVHhsrcnYWI 6gr0D4zm6Ylpks7zvWOm0LGkq6eLwM1xNRZiHkuoQenI+Azjyk7WqJseARoFM+dg/fcY +JlauWo3jQj87vj0V4/m+AnPGOjZV4K58xuOdKoQQmmWBo+8RVynNnhoj96kgs4nUUpn 8lo7VFKufWucKwNnDtIVsc5oPLxkHNX3qQ5ZjGn//KQxKKQAmVayXJbVgpmCbAvJte7W AmZ3Ydjf5d6maOdnaJmPCLZbw5bRvRUIny8DB5UzNhKNarf8fCdaPjqp3MvrdlyEXnlN wFAQ== X-Forwarded-Encrypted: i=1; AJvYcCVNNBmsSBkDxoL1HIiEUEWbB7At88VV/CI/60weaVEZjJPk3UGWnkLBV07H+oSc6jEY5R0=@vger.kernel.org, AJvYcCVpwPu0nhBrW8KE+GqcJhqudkS4m6Ew6D8mw/Tpl53LnrDH1No9TEcnbJ9r1MVgJp5+U3eZPdxblpJTxg26@vger.kernel.org, AJvYcCWGMsXhsfuvqqG8iKtaa/fjdBTErVwvWzH8SjbORhDbNskec9vtl8LWdaj/A3mYRYCQ3rgRopfdNAgGk3Z1JLFcZA==@vger.kernel.org X-Gm-Message-State: AOJu0YzX9IRm227AHIIXEqe2eCNxa9WoHOV+pyV3uVQ434QVoTHew567 U5uWW47ZYJoT/6nH5XZvTcyqlUiNMbmQ2eBVoze+L2jb81+QdSRk X-Gm-Gg: ASbGncv4Dg1uJ0lIfQMLzCWPkGQOkcAfLGChwr1eql86vPEBzLfRcqCxMBjPLnvazaI PLyLt3E2kMqJJa4d881zpuDRBYkn6Us/IU620ixM9/nBYvkw1K1I4cPMkEikMgB+Hi08L8YJQLZ qIWfzJg20wpNI/LR8EYp6G48QrRHplM+E8/QHLkZF/lWCiB9d0QvzR4JtQyZ9ngBxKU7Zcz1hnv ZKyQwTFHSIpLJtPAOnLaGG5ZAPl8IrjNTr3Pfjo94X/xG1MJSBFhUwcNFA2upaotsjdACMg/hZv io92FMi1oy6wqYovFyB4ZevizuQhFUnu2Gr2H0MEgZ+wK63zNcP58ubD4dG8PCFYJxT3ywkViUz RtPLsNh5wCsCNPA== X-Google-Smtp-Source: AGHT+IFFx/16qR05KKlDPtLD2SFdvl1sDcaa4WdsceekmmAJR82UCRJCOT1JxXLSuuP22jYSfdE6OA== X-Received: by 2002:a05:6a20:3942:b0:1f5:7007:9ea5 with SMTP id adf61e73a8af0-1fe42f2a027mr18737452637.2.1742780940623; Sun, 23 Mar 2025 18:49:00 -0700 (PDT) Received: from gmail.com (c-73-202-46-50.hsd1.ca.comcast.net. [73.202.46.50]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-af8a280e572sm5891967a12.32.2025.03.23.18.48.59 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 23 Mar 2025 18:49:00 -0700 (PDT) Date: Sun, 23 Mar 2025 18:48:57 -0700 From: Howard Chu To: Namhyung Kim , Arnaldo Carvalho de Melo , Ian Rogers , Kan Liang Cc: Jiri Olsa , Adrian Hunter , Peter Zijlstra , Ingo Molnar , LKML , linux-perf-users@vger.kernel.org, Song Liu , bpf@vger.kernel.org Subject: Re: [PATCH v3] perf trace: Implement syscall summary in BPF Message-ID: References: <20250321184255.2809370-1-namhyung@kernel.org> Precedence: bulk X-Mailing-List: bpf@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20250321184255.2809370-1-namhyung@kernel.org> Hello Namhyung, On Fri, Mar 21, 2025 at 11:42:55AM -0700, Namhyung Kim wrote: > When -s/--summary option is used, it doesn't need (augmented) arguments > of syscalls. Let's skip the augmentation and load another small BPF > program to collect the statistics in the kernel instead of copying the > data to the ring-buffer to calculate the stats in userspace. This will > be much more light-weight than the existing approach and remove any lost > events. > > Let's add a new option --bpf-summary to control this behavior. I cannot > make it default because there's no way to get e_machine in the BPF which > is needed for detecting different ABIs like 32-bit compat mode. > > No functional changes intended except for no more LOST events. :) But > it only works with -a/--all-cpus for now. > > $ sudo ./perf trace -as --summary-mode=total --bpf-summary sleep 1 > > Summary of events: > > total, 6194 events > > syscall calls errors total min avg max stddev > (msec) (msec) (msec) (msec) (%) > --------------- -------- ------ -------- --------- --------- --------- ------ > epoll_wait 561 0 4530.843 0.000 8.076 520.941 18.75% > futex 693 45 4317.231 0.000 6.230 500.077 21.98% > poll 300 0 1040.109 0.000 3.467 120.928 17.02% > clock_nanosleep 1 0 1000.172 1000.172 1000.172 1000.172 0.00% > ppoll 360 0 872.386 0.001 2.423 253.275 41.91% > epoll_pwait 14 0 384.349 0.001 27.453 380.002 98.79% > pselect6 14 0 108.130 7.198 7.724 8.206 0.85% > nanosleep 39 0 43.378 0.069 1.112 10.084 44.23% > ... > > Cc: Howard Chu > Signed-off-by: Namhyung Kim > --- > v3) > * support -S/--with-summary option too (Howard) It gave me segfault somehow. (gdb) bt #0 sighandler_dump_stack (sig=32767) at util/debug.c:322 #1 #2 0x00005555556d2383 in hashmap_find () #3 0x000055555567474a in thread__update_stats (thread=0x5555564acc60, ttrace=0x5555564ad7a0, id=0, sample=0x7fffffff8f10, err=1, trace=0x7fffffffb1a0) at builtin-trace.c:2616 #4 0x00005555556757cb in trace__sys_exit (trace=0x7fffffffb1a0, evsel=0x5555561879d0, event=0x7fffed980000, sample=0x7fffffff8f10) at builtin-trace.c:2924 #5 0x0000555555677e1b in trace__handle_event (trace=0x7fffffffb1a0, event=0x7fffed980000, sample=0x7fffffff8f10) at builtin-trace.c:3619 #6 0x00005555556796fb in __trace__deliver_event (trace=0x7fffffffb1a0, event=0x7fffed980000) at builtin-trace.c:4173 #7 0x0000555555679859 in trace__deliver_event (trace=0x7fffffffb1a0, event=0x7fffed980000) at builtin-trace.c:4201 #8 0x000055555567abed in trace__run (trace=0x7fffffffb1a0, argc=2, argv=0x7fffffffeb30) at builtin-trace.c:4590 #9 0x000055555567f102 in cmd_trace (argc=2, argv=0x7fffffffeb30) at builtin-trace.c:5803 #10 0x0000555555685252 in run_builtin (p=0x5555560eaf28 , argc=7, argv=0x7fffffffeb30) at perf.c:351 #11 0x00005555556854fd in handle_internal_command (argc=7, argv=0x7fffffffeb30) at perf.c:404 #12 0x000055555568565e in run_argv (argcp=0x7fffffffe91c, argv=0x7fffffffe910) at perf.c:448 #13 0x00005555556859af in main (argc=7, argv=0x7fffffffeb30) at perf.c:556 > * make it work only with -a/--all-cpus (Howard) > * fix stddev calculation (Howard) > * add some comments about syscall_data (Howard) > > v2) > * Rebased on top of Ian's e_machine changes > * add --bpf-summary option > * support per-thread summary > * add stddev calculation (Howard) > > tools/perf/Documentation/perf-trace.txt | 6 + > tools/perf/Makefile.perf | 2 +- > tools/perf/builtin-trace.c | 51 ++- > tools/perf/util/Build | 1 + > tools/perf/util/bpf-trace-summary.c | 347 ++++++++++++++++++ > .../perf/util/bpf_skel/syscall_summary.bpf.c | 118 ++++++ > tools/perf/util/bpf_skel/syscall_summary.h | 25 ++ > tools/perf/util/trace.h | 37 ++ > 8 files changed, 574 insertions(+), 13 deletions(-) > create mode 100644 tools/perf/util/bpf-trace-summary.c > create mode 100644 tools/perf/util/bpf_skel/syscall_summary.bpf.c > create mode 100644 tools/perf/util/bpf_skel/syscall_summary.h > create mode 100644 tools/perf/util/trace.h > > diff --git a/tools/perf/Documentation/perf-trace.txt b/tools/perf/Documentation/perf-trace.txt > index 887dc37773d0f4d6..a8a0d8c33438fef7 100644 > --- a/tools/perf/Documentation/perf-trace.txt > +++ b/tools/perf/Documentation/perf-trace.txt > @@ -251,6 +251,12 @@ the thread executes on the designated CPUs. Default is to monitor all CPUs. > pretty-printing serves as a fallback to hand-crafted pretty printers, as the latter can > better pretty-print integer flags and struct pointers. > > +--bpf-summary:: > + Collect system call statistics in BPF. This is only for live mode and > + works well with -s/--summary option where no argument information is > + required. > + > + > PAGEFAULTS > ---------- > > diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf > index f3cd8de15d1a2681..d7a7e0c68fc10b8b 100644 > --- a/tools/perf/Makefile.perf > +++ b/tools/perf/Makefile.perf > @@ -1206,7 +1206,7 @@ SKELETONS += $(SKEL_OUT)/bperf_leader.skel.h $(SKEL_OUT)/bperf_follower.skel.h > SKELETONS += $(SKEL_OUT)/bperf_cgroup.skel.h $(SKEL_OUT)/func_latency.skel.h > SKELETONS += $(SKEL_OUT)/off_cpu.skel.h $(SKEL_OUT)/lock_contention.skel.h > SKELETONS += $(SKEL_OUT)/kwork_trace.skel.h $(SKEL_OUT)/sample_filter.skel.h > -SKELETONS += $(SKEL_OUT)/kwork_top.skel.h > +SKELETONS += $(SKEL_OUT)/kwork_top.skel.h $(SKEL_OUT)/syscall_summary.skel.h > SKELETONS += $(SKEL_OUT)/bench_uprobe.skel.h > SKELETONS += $(SKEL_OUT)/augmented_raw_syscalls.skel.h > > diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c > index 439e152186daf38b..71822161956827a7 100644 > --- a/tools/perf/builtin-trace.c > +++ b/tools/perf/builtin-trace.c > @@ -55,6 +55,7 @@ > #include "util/thread_map.h" > #include "util/stat.h" > #include "util/tool.h" > +#include "util/trace.h" > #include "util/util.h" > #include "trace/beauty/beauty.h" > #include "trace-event.h" > @@ -141,12 +142,6 @@ struct syscall_fmt { > bool hexret; > }; > > -enum summary_mode { > - SUMMARY__NONE = 0, > - SUMMARY__BY_TOTAL, > - SUMMARY__BY_THREAD, > -}; > - > struct trace { > struct perf_tool tool; > struct { > @@ -205,7 +200,7 @@ struct trace { > } stats; > unsigned int max_stack; > unsigned int min_stack; > - enum summary_mode summary_mode; > + enum trace_summary_mode summary_mode; > int raw_augmented_syscalls_args_size; > bool raw_augmented_syscalls; > bool fd_path_disabled; > @@ -234,6 +229,7 @@ struct trace { > bool force; > bool vfs_getname; > bool force_btf; > + bool summary_bpf; > int trace_pgfaults; > char *perfconfig_events; > struct { > @@ -4371,6 +4367,14 @@ static int trace__run(struct trace *trace, int argc, const char **argv) > > trace->live = true; > > + if (trace->summary_bpf) { > + if (trace_prepare_bpf_summary(trace->summary_mode) < 0) > + goto out_delete_evlist; > + > + if (trace->summary_only) > + goto create_maps; > + } > + > if (!trace->raw_augmented_syscalls) { > if (trace->trace_syscalls && trace__add_syscall_newtp(trace)) > goto out_error_raw_syscalls; > @@ -4429,6 +4433,7 @@ static int trace__run(struct trace *trace, int argc, const char **argv) > if (trace->cgroup) > evlist__set_default_cgroup(trace->evlist, trace->cgroup); > > +create_maps: > err = evlist__create_maps(evlist, &trace->opts.target); > if (err < 0) { > fprintf(trace->output, "Problems parsing the target to trace, check your options!\n"); > @@ -4441,7 +4446,7 @@ static int trace__run(struct trace *trace, int argc, const char **argv) > goto out_delete_evlist; > } > > - if (trace->summary_mode == SUMMARY__BY_TOTAL) { > + if (trace->summary_mode == SUMMARY__BY_TOTAL && !trace->summary_bpf) { > trace->syscall_stats = alloc_syscall_stats(); > if (trace->syscall_stats == NULL) > goto out_delete_evlist; > @@ -4527,9 +4532,11 @@ static int trace__run(struct trace *trace, int argc, const char **argv) > if (err < 0) > goto out_error_apply_filters; > > - err = evlist__mmap(evlist, trace->opts.mmap_pages); > - if (err < 0) > - goto out_error_mmap; > + if (!trace->summary_only || !trace->summary_bpf) { > + err = evlist__mmap(evlist, trace->opts.mmap_pages); > + if (err < 0) > + goto out_error_mmap; > + } > > if (!target__none(&trace->opts.target) && !trace->opts.target.initial_delay) > evlist__enable(evlist); > @@ -4542,6 +4549,9 @@ static int trace__run(struct trace *trace, int argc, const char **argv) > evlist__enable(evlist); > } > > + if (trace->summary_bpf) > + trace_start_bpf_summary(); > + > trace->multiple_threads = perf_thread_map__pid(evlist->core.threads, 0) == -1 || > perf_thread_map__nr(evlist->core.threads) > 1 || > evlist__first(evlist)->core.attr.inherit; > @@ -4609,12 +4619,17 @@ static int trace__run(struct trace *trace, int argc, const char **argv) > > evlist__disable(evlist); > > + if (trace->summary_bpf) > + trace_end_bpf_summary(); > + > if (trace->sort_events) > ordered_events__flush(&trace->oe.data, OE_FLUSH__FINAL); > > if (!err) { > if (trace->summary) { > - if (trace->summary_mode == SUMMARY__BY_TOTAL) > + if (trace->summary_bpf) > + trace_print_bpf_summary(trace->output); > + else if (trace->summary_mode == SUMMARY__BY_TOTAL) > trace__fprintf_total_summary(trace, trace->output); > else > trace__fprintf_thread_summary(trace, trace->output); > @@ -4630,6 +4645,7 @@ static int trace__run(struct trace *trace, int argc, const char **argv) > } > > out_delete_evlist: > + trace_cleanup_bpf_summary(); > delete_syscall_stats(trace->syscall_stats); > trace__symbols__exit(trace); > evlist__free_syscall_tp_fields(evlist); > @@ -5465,6 +5481,7 @@ int cmd_trace(int argc, const char **argv) > "start"), > OPT_BOOLEAN(0, "force-btf", &trace.force_btf, "Prefer btf_dump general pretty printer" > "to customized ones"), > + OPT_BOOLEAN(0, "bpf-summary", &trace.summary_bpf, "Summary syscall stats in BPF"), > OPTS_EVSWITCH(&trace.evswitch), > OPT_END() > }; > @@ -5556,6 +5573,16 @@ int cmd_trace(int argc, const char **argv) > goto skip_augmentation; > } > > + if (trace.summary_bpf) { > + if (!trace.opts.target.system_wide) { > + /* TODO: Add filters in the BPF to support other targets. */ > + pr_err("Error: --bpf-summary only works for system-wide mode.\n"); > + goto out; > + } > + if (trace.summary_only) > + goto skip_augmentation; > + } > + > trace.skel = augmented_raw_syscalls_bpf__open(); > if (!trace.skel) { > pr_debug("Failed to open augmented syscalls BPF skeleton"); > diff --git a/tools/perf/util/Build b/tools/perf/util/Build > index 034a6603d5a8e8b0..ba4201a6f3c69753 100644 > --- a/tools/perf/util/Build > +++ b/tools/perf/util/Build > @@ -171,6 +171,7 @@ perf-util-$(CONFIG_PERF_BPF_SKEL) += bpf_off_cpu.o > perf-util-$(CONFIG_PERF_BPF_SKEL) += bpf-filter.o > perf-util-$(CONFIG_PERF_BPF_SKEL) += bpf-filter-flex.o > perf-util-$(CONFIG_PERF_BPF_SKEL) += bpf-filter-bison.o > +perf-util-$(CONFIG_PERF_BPF_SKEL) += bpf-trace-summary.o > perf-util-$(CONFIG_PERF_BPF_SKEL) += btf.o > > ifeq ($(CONFIG_LIBTRACEEVENT),y) > diff --git a/tools/perf/util/bpf-trace-summary.c b/tools/perf/util/bpf-trace-summary.c > new file mode 100644 > index 0000000000000000..1eadc00c46056747 > --- /dev/null > +++ b/tools/perf/util/bpf-trace-summary.c > @@ -0,0 +1,347 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +#include > +#include > +#include > +#include > + > +#include "dwarf-regs.h" /* for EM_HOST */ > +#include "syscalltbl.h" > +#include "util/hashmap.h" > +#include "util/trace.h" > +#include "util/util.h" > +#include > +#include > +#include /* reallocarray */ > + > +#include "bpf_skel/syscall_summary.h" > +#include "bpf_skel/syscall_summary.skel.h" > + > + > +static struct syscall_summary_bpf *skel; > + > +int trace_prepare_bpf_summary(enum trace_summary_mode mode) > +{ > + skel = syscall_summary_bpf__open(); > + if (skel == NULL) { > + fprintf(stderr, "failed to open syscall summary bpf skeleton\n"); > + return -1; > + } > + > + if (mode == SUMMARY__BY_THREAD) > + skel->rodata->aggr_mode = SYSCALL_AGGR_THREAD; > + else > + skel->rodata->aggr_mode = SYSCALL_AGGR_CPU; > + > + if (syscall_summary_bpf__load(skel) < 0) { > + fprintf(stderr, "failed to load syscall summary bpf skeleton\n"); > + return -1; > + } > + > + if (syscall_summary_bpf__attach(skel) < 0) { > + fprintf(stderr, "failed to attach syscall summary bpf skeleton\n"); > + return -1; > + } > + > + return 0; > +} > + > +void trace_start_bpf_summary(void) > +{ > + skel->bss->enabled = 1; > +} > + > +void trace_end_bpf_summary(void) > +{ > + skel->bss->enabled = 0; > +} > + > +struct syscall_node { > + int syscall_nr; > + struct syscall_stats stats; > +}; > + > +static double rel_stddev(struct syscall_stats *stat) > +{ > + double variance, average; > + > + if (stat->count < 2) > + return 0; > + > + average = (double)stat->total_time / stat->count; > + > + variance = stat->squared_sum; > + variance -= (stat->total_time * stat->total_time) / stat->count; > + variance /= stat->count - 1; > + > + return 100 * sqrt(variance / stat->count) / average; > +} > + > +/* > + * The syscall_data is to maintain syscall stats ordered by total time. > + * It support different summary mode like per-thread or global. nit: It supports different summary modes like per-thread or global. > + * > + * For per-thread stats, it uses two-level data strurcture - > + * syscall_data is keyed by TID and has an array of nodes which > + * represents each syscall for the thread. > + * > + * For global stats, it's still two-level technically but we don't need > + * per-cpu analysis so it's keyed by the syscall number to combine stats > + * from different CPUs. And syscall_data always has a syscall_node so > + * it can effectively work as flat hierarchy. > + */ > +struct syscall_data { > + int key; /* tid if AGGR_THREAD, syscall-nr if AGGR_CPU */ > + int nr_events; > + int nr_nodes; > + u64 total_time; > + struct syscall_node *nodes; > +}; > + > +static int datacmp(const void *a, const void *b) > +{ > + const struct syscall_data * const *sa = a; > + const struct syscall_data * const *sb = b; > + > + return (*sa)->total_time > (*sb)->total_time ? -1 : 1; > +} > + > +static int nodecmp(const void *a, const void *b) > +{ > + const struct syscall_node *na = a; > + const struct syscall_node *nb = b; > + > + return na->stats.total_time > nb->stats.total_time ? -1 : 1; > +} > + > +static size_t sc_node_hash(long key, void *ctx __maybe_unused) > +{ > + return key; > +} > + > +static bool sc_node_equal(long key1, long key2, void *ctx __maybe_unused) > +{ > + return key1 == key2; > +} > + > +static int print_common_stats(struct syscall_data *data, FILE *fp) > +{ > + int printed = 0; > + > + for (int i = 0; i < data->nr_nodes; i++) { > + struct syscall_node *node = &data->nodes[i]; > + struct syscall_stats *stat = &node->stats; > + double total = (double)(stat->total_time) / NSEC_PER_MSEC; > + double min = (double)(stat->min_time) / NSEC_PER_MSEC; > + double max = (double)(stat->max_time) / NSEC_PER_MSEC; > + double avg = total / stat->count; > + const char *name; > + > + /* TODO: support other ABIs */ > + name = syscalltbl__name(EM_HOST, node->syscall_nr); > + if (name) > + printed += fprintf(fp, " %-15s", name); > + else > + printed += fprintf(fp, " syscall:%-7d", node->syscall_nr); > + > + printed += fprintf(fp, " %8u %6u %9.3f %9.3f %9.3f %9.3f %9.2f%%\n", > + stat->count, stat->error, total, min, avg, max, > + rel_stddev(stat)); > + } > + return printed; > +} > + > +static int update_thread_stats(struct hashmap *hash, struct syscall_key *map_key, > + struct syscall_stats *map_data) > +{ > + struct syscall_data *data; > + struct syscall_node *nodes; > + > + if (!hashmap__find(hash, map_key->cpu_or_tid, &data)) { > + data = zalloc(sizeof(*data)); > + if (data == NULL) > + return -ENOMEM; > + > + data->key = map_key->cpu_or_tid; > + if (hashmap__add(hash, data->key, data) < 0) { > + free(data); > + return -ENOMEM; > + } > + } > + > + /* update thread total stats */ > + data->nr_events += map_data->count; > + data->total_time += map_data->total_time; > + > + nodes = reallocarray(data->nodes, data->nr_nodes + 1, sizeof(*nodes)); > + if (nodes == NULL) > + return -ENOMEM; > + > + data->nodes = nodes; > + nodes = &data->nodes[data->nr_nodes++]; > + nodes->syscall_nr = map_key->nr; > + > + /* each thread has an entry for each syscall, just use the stat */ > + memcpy(&nodes->stats, map_data, sizeof(*map_data)); > + return 0; > +} > + > +static int print_thread_stat(struct syscall_data *data, FILE *fp) > +{ > + int printed = 0; > + > + qsort(data->nodes, data->nr_nodes, sizeof(*data->nodes), nodecmp); > + > + printed += fprintf(fp, " thread (%d), ", data->key); > + printed += fprintf(fp, "%d events\n\n", data->nr_events); > + > + printed += fprintf(fp, " syscall calls errors total min avg max stddev\n"); > + printed += fprintf(fp, " (msec) (msec) (msec) (msec) (%%)\n"); > + printed += fprintf(fp, " --------------- -------- ------ -------- --------- --------- --------- ------\n"); > + > + printed += print_common_stats(data, fp); > + printed += fprintf(fp, "\n\n"); > + > + return printed; > +} > + > +static int print_thread_stats(struct syscall_data **data, int nr_data, FILE *fp) > +{ > + int printed = 0; > + > + for (int i = 0; i < nr_data; i++) > + printed += print_thread_stat(data[i], fp); > + > + return printed; > +} > + > +static int update_total_stats(struct hashmap *hash, struct syscall_key *map_key, > + struct syscall_stats *map_data) > +{ > + struct syscall_data *data; > + struct syscall_stats *stat; > + > + if (!hashmap__find(hash, map_key, &data)) { There is a bug that I missed in my previous review, this should be map_key->nr. As I mentioned in v2's review, I discovered an anomaly, it reads: "there is a difference between sudo ./perf trace -as --summary-mode=total -- sleep 1 sudo ./perf trace -as --bpf-summary --summary-mode=total -- sleep 1 perf $ sudo ./perf trace -as --summary-mode=total -- sleep 1 Summary of events: total, 15354 events perf $ sudo ./perf trace -as --bpf-summary --summary-mode=total -- sleep 1 Summary of events: total, 1319 events without the --bpf-summary perf trace gave more events, and it ran slower" Turns out we are losing events because of finding with the wrong key. The value of key 'syscall_nr' is actually present in the hashmap so if hashmap__add() with the duplicated key, the function failed. It won't cause memory leak because it returns right away. This bug can be displayed using this diff: diff --git a/tools/perf/util/bpf-trace-summary.c b/tools/perf/util/bpf-trace-summary.c index 1eadc00c4605..5fab33a2de3a 100644 --- a/tools/perf/util/bpf-trace-summary.c +++ b/tools/perf/util/bpf-trace-summary.c @@ -214,13 +214,21 @@ static int print_thread_stats(struct syscall_data **data, int nr_data, FILE *fp) return printed; } +int cnt; + static int update_total_stats(struct hashmap *hash, struct syscall_key *map_key, struct syscall_stats *map_data) { struct syscall_data *data; struct syscall_stats *stat; + cnt += map_data->count; if (!hashmap__find(hash, map_key, &data)) { + if (hashmap__find(hash, map_key->nr, &data)) { + printf("duplicated key\n"); + printf("map_key->nr %d\n", map_key->nr); + printf("key %d nr_events %d nr_nodes %d total_time %lu nodes %p\n", data->key, data->nr_events, data->nr_nodes, data->total_time, data->nodes); + } data = zalloc(sizeof(*data)); if (data == NULL) return -ENOMEM; @@ -238,6 +246,7 @@ static int update_total_stats(struct hashmap *hash, struct syscall_key *map_key, if (hashmap__add(hash, data->key, data) < 0) { free(data->nodes); free(data); + printf("failed\n"); return -ENOMEM; } } @@ -270,6 +279,7 @@ static int print_total_stats(struct syscall_data **data, int nr_data, FILE *fp) for (int i = 0; i < nr_data; i++) nr_events += data[i]->nr_events; + printf("actual number of events: %d\n", cnt); printed += fprintf(fp, " total, %d events\n\n", nr_events); printed += fprintf(fp, " syscall calls errors total min avg max stddev\n"); > + data = zalloc(sizeof(*data)); > + if (data == NULL) > + return -ENOMEM; > + > + data->nodes = zalloc(sizeof(*data->nodes)); > + if (data->nodes == NULL) { > + free(data); > + return -ENOMEM; > + } > + > + data->nr_nodes = 1; > + data->key = map_key->nr; > + data->nodes->syscall_nr = data->key; > + > + if (hashmap__add(hash, data->key, data) < 0) { > + free(data->nodes); > + free(data); > + return -ENOMEM; > + } > + } Thanks for recommending me mutt, this is the first review using mutt, hope it reaches places. Thanks, Howard