From: Namhyung Kim <namhyung@kernel.org>
To: Howard Chu <howardchu95@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>,
Ian Rogers <irogers@google.com>,
Kan Liang <kan.liang@linux.intel.com>,
Jiri Olsa <jolsa@kernel.org>,
Adrian Hunter <adrian.hunter@intel.com>,
Peter Zijlstra <peterz@infradead.org>,
Ingo Molnar <mingo@kernel.org>,
LKML <linux-kernel@vger.kernel.org>,
linux-perf-users@vger.kernel.org
Subject: Re: [RFC/PATCH] perf trace: Add --syscall-period option
Date: Wed, 18 Dec 2024 21:37:36 -0800 [thread overview]
Message-ID: <Z2OxIBW0t4kLBAFW@google.com> (raw)
In-Reply-To: <CAH0uvog3XkndPWvifVjhrV1GoM4O462e_3Xxr_=r9yCWhnKGmg@mail.gmail.com>
On Tue, Dec 17, 2024 at 09:40:21PM -0800, Howard Chu wrote:
> Hello Namhyung,
>
> Thanks for the patch.
>
> On Wed, Dec 11, 2024 at 2:21 PM Namhyung Kim <namhyung@kernel.org> wrote:
> >
> > This option is to implement the event sampling for system calls.
> > The period is given in msec as it shows the time in msec.
> >
> > # perf trace -C 0 --syscall-period 100 sleep 1
> > ? ( ): fleetspeakd/1828559 ... [continued]: futex()) = -1 ETIMEDOUT (Connection timed out)
> > 0.050 (100.247 ms): gnome-shell/572531 recvmsg(fd: 10<socket:[3355761]>, msg: 0x7ffef8b39d20) = 40
> > 100.357 (100.149 ms): pipewire-pulse/572245 read(fd: 5<anon_inode:[eventfd]>, buf: 0x7ffc0b9dc8f0, count: 8) = 8
> > 200.553 (100.268 ms): NetworkManager/3424 epoll_wait(epfd: 19<anon_inode:[eventpoll]>, events: 0x5607b85bb880, maxevents: 6) = 0
> > 300.876 ( ): mon/4932 poll(ufds: 0x7fa392784df0, nfds: 1, timeout_msecs: 100) ...
> > 400.901 ( 0.025 ms): TaskCon~ller #/620145 futex(uaddr: 0x7f3fc596fa00, op: WAKE|PRIVATE_FLAG, val: 1) = 0
> > 300.876 (100.123 ms): mon/4932 ... [continued]: poll()) = 0 (Timeout)
> > 500.901 ( 0.012 ms): evdefer/2/2335122 futex(uaddr: 0x5640baac5198, op: WAKE|PRIVATE_FLAG, val: 1) = 0
> > 602.701 ( 0.017 ms): Compositor/1992200 futex(uaddr: 0x7f1a51dfdd40, op: WAKE|PRIVATE_FLAG, val: 1) = 0
> > 705.589 ( 0.017 ms): JS Watchdog/947933 futex(uaddr: 0x7f4cac1d4240, op: WAKE|PRIVATE_FLAG, val: 1) = 0
> > 812.667 ( 0.027 ms): fix/1985151 futex(uaddr: 0xc0008f7148, op: WAKE|PRIVATE_FLAG, val: 1) = 1
> > 912.807 ( 0.017 ms): Xorg/572315 setitimer(value: 0x7ffc375d6ba0) = 0
> >
> > The timestamp is kept in a per-cpu array and the allowed task is saved
> > in a hash map.
> >
> > Signed-off-by: Namhyung Kim <namhyung@kernel.org>
> > ---
> > tools/perf/Documentation/perf-trace.txt | 6 ++
> > tools/perf/builtin-trace.c | 5 ++
> > .../bpf_skel/augmented_raw_syscalls.bpf.c | 67 ++++++++++++++++++-
> > 3 files changed, 76 insertions(+), 2 deletions(-)
> >
> > diff --git a/tools/perf/Documentation/perf-trace.txt b/tools/perf/Documentation/perf-trace.txt
> > index 6e0cc50bbc13fc7f..9f338a8c5357a67e 100644
> > --- a/tools/perf/Documentation/perf-trace.txt
> > +++ b/tools/perf/Documentation/perf-trace.txt
> > @@ -241,6 +241,12 @@ the thread executes on the designated CPUs. Default is to monitor all CPUs.
> > printing using the existing 'perf trace' syscall arg beautifiers to map integer
> > arguments to strings (pid to comm, syscall id to syscall name, etc).
> >
> > +--syscall-period::
> > + Trace a system call in the given period (in msec). This implements
> > + sampling for syscalls in order to reduce the monitoring overhead.
> > + For example, setting the sysall period to 100 (msec) means it will
> > + sample a syscall and next one after 100 msec.
> > +
> >
> > PAGEFAULTS
> > ----------
> > diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c
> > index 3c46de1a8d79bfe6..789eb0ffd5f90b61 100644
> > --- a/tools/perf/builtin-trace.c
> > +++ b/tools/perf/builtin-trace.c
> > @@ -185,6 +185,7 @@ struct trace {
> > } stats;
> > unsigned int max_stack;
> > unsigned int min_stack;
> > + unsigned long sample_period_ms;
> > int raw_augmented_syscalls_args_size;
> > bool raw_augmented_syscalls;
> > bool fd_path_disabled;
> > @@ -5219,6 +5220,7 @@ int cmd_trace(int argc, const char **argv)
> > "start"),
> > OPT_BOOLEAN(0, "force-btf", &trace.force_btf, "Prefer btf_dump general pretty printer"
> > "to customized ones"),
> > + OPT_ULONG(0, "syscall-period", &trace.sample_period_ms, "syscall sampling period in ms"),
> > OPTS_EVSWITCH(&trace.evswitch),
> > OPT_END()
> > };
> > @@ -5326,6 +5328,9 @@ int cmd_trace(int argc, const char **argv)
> > bpf_program__set_autoattach(prog, /*autoattach=*/false);
> > }
> >
> > + if (trace.sample_period_ms)
> > + trace.skel->rodata->sample_period = trace.sample_period_ms * NSEC_PER_MSEC;
> > +
> > err = augmented_raw_syscalls_bpf__load(trace.skel);
> >
> > if (err < 0) {
> > diff --git a/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c b/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c
> > index 4a62ed593e84edf8..12272620dcd73700 100644
> > --- a/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c
> > +++ b/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c
> > @@ -113,6 +113,22 @@ struct pids_filtered {
> > __uint(max_entries, 64);
> > } pids_filtered SEC(".maps");
> >
> > +struct sample_timestamp {
> > + __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
> > + __type(key, int);
> > + __type(value, __u64);
> > + __uint(max_entries, 1);
> > +} sample_timestamp SEC(".maps");
> > +
> > +struct sample_filtered {
> > + __uint(type, BPF_MAP_TYPE_HASH);
> > + __type(key, pid_t);
> > + __type(value, bool);
> > + __uint(max_entries, MAX_CPUS);
> > +} sample_filtered SEC(".maps");
>
> Do you think MAX_CPUS provides enough entries? If there are MAX_CPUS
> syscalls that have entered but haven't exited, new syscalls cannot
> write to the map, which results in the new syscall's sys_exit not
> being collected. This can lead to trace logs like the following:
>
> 210.947 ( ): tmux: server/4399 poll(ufds: 0x579c819eac40,
> nfds: 51, timeout_msecs: 9)
>
> which doesn't have a duration, because perf trace doesn't know when it
> exits. For example, a poll system call can take a long time to exit,
> if there are MAX_CPUS number of polls, they will occupy all the space
> in the sample_filtered map.
>
> > + if (prev_ts) {
> > + if ((*prev_ts + sample_period) > timestamp)
> > + return false;
> > + *prev_ts = timestamp;
> > + } else {
> > + bpf_map_update_elem(&sample_timestamp, &idx, ×tamp, BPF_ANY);
> > + }
> > +
> > + bpf_map_update_elem(&sample_filtered, &pid, &ok, BPF_ANY);
>
> and during polling, when a new syscall enters, this
> bpf_map_update_elem() will return an error and fail to insert, and in
> sys_exit(), this:
>
> > + if (!bpf_map_lookup_elem(&sample_filtered, &pid))
> > + return false;
>
> returns false, and creates a sample with no duration.
>
> In order to eliminate samples with no duration, there should be no
> more than MAX_CPUS simultaneous syscalls, but I haven't yet created a
> scenario where there are MAX_CPUS polls at the same time. (such
> creation needs some tricks, because --syscall-period=1 allows maximum
> <number-of-cpus> number of syscalls every millisecond)
>
> That said, I agree that MAX_CPUS (4096) is generally sufficient for
> the sample_filtered map in most situations.
Right, usually the number of CPUs is far less than that, but I remember
there's a system with more than 2K CPUs. I can add a command line
option to control the number of entries or change the default using the
actual number of CPUs in the system.
>
> Aside from this edge case, this patch looks good to me, I tested it with:
>
> perf trace --syscall-period=500
>
> 544.863 ( 0.008 ms): sshd/869528 rt_sigprocmask(how: UNBLOCK, nset:
> 0x7fffef5f71f0, oset: 0x7fffef5f7270, sigsetsize: 8) = 0
> 545.034 ( 0.007 ms): tmux: server/4399 ioctl(fd: 54</dev/ptmx>,
> cmd: FIONREAD, arg: 0x7ffe020ec758) = 0
> 593.054 (100.214 ms): gnome-shell/2419 poll(ufds: 0x6236d31eb850,
> nfds: 15, timeout_msecs: 100) = 0 (Timeout)
> 1073.251 ( 0.059 ms): gmain/38955 inotify_add_watch(fd:
> 4<anon_inode:inotify>, pathname:
> "/var/lib/fwupd/metadata/lvfs-testing", mask: 16789454) = -1 ENOENT
> (No such file or directory)
> 1073.641 ( 0.014 ms): tmux: server/4399 ioctl(fd: 54</dev/ptmx>,
> cmd: FIONREAD, arg: 0x7ffe020ec758) = 0
> 1074.588 ( 0.013 ms): sshd/869528 rt_sigprocmask(how: UNBLOCK, nset:
> 0x7fffef5f71f0, oset: 0x7fffef5f7270, sigsetsize: 8) = 0
> 1094.651 (100.370 ms): gnome-shell/2419 poll(ufds: 0x6236d31eb850,
> nfds: 15, timeout_msecs: 100) = 0 (Timeout)
> 1596.817 (100.363 ms): gnome-shell/2419 poll(ufds: 0x6236d31eb850,
> nfds: 15, timeout_msecs: 100)
>
> The timestamp makes sense for the 500 ms period.
Thanks for the review and test!
Namhyung
prev parent reply other threads:[~2024-12-19 5:37 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-12-11 22:21 [RFC/PATCH] perf trace: Add --syscall-period option Namhyung Kim
2024-12-11 22:52 ` Arnaldo Carvalho de Melo
2024-12-12 1:34 ` Ian Rogers
2024-12-12 1:47 ` Howard Chu
2024-12-12 4:28 ` Ian Rogers
2024-12-13 2:08 ` Namhyung Kim
2024-12-18 5:40 ` Howard Chu
2024-12-19 5:37 ` Namhyung Kim [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Z2OxIBW0t4kLBAFW@google.com \
--to=namhyung@kernel.org \
--cc=acme@kernel.org \
--cc=adrian.hunter@intel.com \
--cc=howardchu95@gmail.com \
--cc=irogers@google.com \
--cc=jolsa@kernel.org \
--cc=kan.liang@linux.intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-perf-users@vger.kernel.org \
--cc=mingo@kernel.org \
--cc=peterz@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.