* [PATCH v2] net: tracepoint: fix print wrong sysctl_mem value
@ 2020-09-08 2:09 Dust Li
2020-09-09 2:49 ` David Miller
0 siblings, 1 reply; 3+ messages in thread
From: Dust Li @ 2020-09-08 2:09 UTC (permalink / raw)
To: Jakub Kicinski, David S . Miller, Eric Dumazet, Satoru Moriya; +Cc: netdev
sysctl_mem is an point, and tracepoint entry do not support
been visited like an array. Use an __array(3) to store sysctl_mem
instead.
tracpoint output with and without this fix:
- without fix:
28821.074 sock:sock_exceed_buf_limit:proto:UDP
sysctl_mem=-1741233440,19,322156906942464 allocated=19 sysctl_rmem=4096
rmem_alloc=75008 sysctl_wmem=4096 wmem_alloc=1 wmem_queued=0
kind=SK_MEM_RECV
- with fix:
2126.136 sock:sock_exceed_buf_limit:proto:UDP
sysctl_mem=18,122845,184266 allocated=19 sysctl_rmem=4096
rmem_alloc=73728 sysctl_wmem=4096 wmem_alloc=1 wmem_queued=0
kind=SK_MEM_RECV
v2: use __array(3) instead of 3 long type to store sysctl_mem
Fixes: 3847ce32aea9fdf ("core: add tracepoints for queueing skb to rcvbuf")
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
---
include/trace/events/sock.h | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/include/trace/events/sock.h b/include/trace/events/sock.h
index a966d4b5ab37..914e58938480 100644
--- a/include/trace/events/sock.h
+++ b/include/trace/events/sock.h
@@ -98,7 +98,7 @@ TRACE_EVENT(sock_exceed_buf_limit,
TP_STRUCT__entry(
__array(char, name, 32)
- __field(long *, sysctl_mem)
+ __array(long, sysctl_mem, 3)
__field(long, allocated)
__field(int, sysctl_rmem)
__field(int, rmem_alloc)
@@ -110,7 +110,9 @@ TRACE_EVENT(sock_exceed_buf_limit,
TP_fast_assign(
strncpy(__entry->name, prot->name, 32);
- __entry->sysctl_mem = prot->sysctl_mem;
+ __entry->sysctl_mem[0] = prot->sysctl_mem[0];
+ __entry->sysctl_mem[1] = prot->sysctl_mem[1];
+ __entry->sysctl_mem[2] = prot->sysctl_mem[2];
__entry->allocated = allocated;
__entry->sysctl_rmem = sk_get_rmem0(sk, prot);
__entry->rmem_alloc = atomic_read(&sk->sk_rmem_alloc);
--
2.19.1.3.ge56e4f7
^ permalink raw reply related [flat|nested] 3+ messages in thread* Re: [PATCH v2] net: tracepoint: fix print wrong sysctl_mem value 2020-09-08 2:09 [PATCH v2] net: tracepoint: fix print wrong sysctl_mem value Dust Li @ 2020-09-09 2:49 ` David Miller 2020-09-10 14:34 ` dust.li 0 siblings, 1 reply; 3+ messages in thread From: David Miller @ 2020-09-09 2:49 UTC (permalink / raw) To: dust.li; +Cc: kuba, edumazet, satoru.moriya, netdev From: Dust Li <dust.li@linux.alibaba.com> Date: Tue, 8 Sep 2020 10:09:39 +0800 > @@ -98,7 +98,7 @@ TRACE_EVENT(sock_exceed_buf_limit, > > TP_STRUCT__entry( > __array(char, name, 32) > - __field(long *, sysctl_mem) > + __array(long, sysctl_mem, 3) > __field(long, allocated) > __field(int, sysctl_rmem) > __field(int, rmem_alloc) > @@ -110,7 +110,9 @@ TRACE_EVENT(sock_exceed_buf_limit, > > TP_fast_assign( > strncpy(__entry->name, prot->name, 32); > - __entry->sysctl_mem = prot->sysctl_mem; > + __entry->sysctl_mem[0] = prot->sysctl_mem[0]; > + __entry->sysctl_mem[1] = prot->sysctl_mem[1]; > + __entry->sysctl_mem[2] = prot->sysctl_mem[2]; I can't understand at all why the current code doesn't work. We assign a pointer to entry->sysctl_mem and then print out the three words pointed to by that. It's so wasteful to copy this over every tracepoint entry so the pointer approach is very desirable. ^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH v2] net: tracepoint: fix print wrong sysctl_mem value 2020-09-09 2:49 ` David Miller @ 2020-09-10 14:34 ` dust.li 0 siblings, 0 replies; 3+ messages in thread From: dust.li @ 2020-09-10 14:34 UTC (permalink / raw) To: David Miller; +Cc: kuba, edumazet, satoru.moriya, netdev On Tue, Sep 08, 2020 at 07:49:52PM -0700, David Miller wrote: >From: Dust Li <dust.li@linux.alibaba.com> >Date: Tue, 8 Sep 2020 10:09:39 +0800 > >> @@ -98,7 +98,7 @@ TRACE_EVENT(sock_exceed_buf_limit, >> >> TP_STRUCT__entry( >> __array(char, name, 32) >> - __field(long *, sysctl_mem) >> + __array(long, sysctl_mem, 3) >> __field(long, allocated) >> __field(int, sysctl_rmem) >> __field(int, rmem_alloc) >> @@ -110,7 +110,9 @@ TRACE_EVENT(sock_exceed_buf_limit, >> >> TP_fast_assign( >> strncpy(__entry->name, prot->name, 32); >> - __entry->sysctl_mem = prot->sysctl_mem; >> + __entry->sysctl_mem[0] = prot->sysctl_mem[0]; >> + __entry->sysctl_mem[1] = prot->sysctl_mem[1]; >> + __entry->sysctl_mem[2] = prot->sysctl_mem[2]; > >I can't understand at all why the current code doesn't work. > >We assign a pointer to entry->sysctl_mem and then print out the >three words pointed to by that. > >It's so wasteful to copy this over every tracepoint entry so >the pointer approach is very desirable. Thanks for your reply! I took a close look at the code generated by tracepoint and found the problem is not the tracepoint itself, but `perf trace`. My previous output was got by running: `perf trace -e sock:sock_exceed_buf_limit` This time, I tried directly read from /sys/kernel/debug/tracing/trace, and everything is right :) So I checked the code of perf tool, and found the foundamatal difference is `perf trace` did the string formatting in the userspace, but raw ftrace did it in the kernel. When using `perf trace`, the kernel passes the string format and the data to perf using the perf ringbuffer, and no one in the kernel will try to visit the pointer sysctl_mem is pointed to, so the the userspace perf got the original pointer of sysctl_mem and tries to do the formating, which result in the wrong data in the commit log. The key call trace when using `perf trace` is this: trace_sock_exceed_buf_limit() --> perf_trace_sock_exceed_buf_limit() { ... perf_fetch_caller_regs(__regs); { strncpy(entry->name, prot->name, 32); entry->sysctl_mem = prot->sysctl_mem; entry->allocated = allocated; entry->sysctl_rmem = sk_get_rmem0(sk, prot); entry->rmem_alloc = atomic_read(&sk->sk_backlog.rmem_alloc); entry->sysctl_wmem = sk_get_wmem0(sk, prot); entry->wmem_alloc = refcount_read(&sk->sk_wmem_alloc); entry->wmem_queued = sk->sk_wmem_queued; entry->kind = kind; } perf_trace_run_bpf_submit(entry, __entry_size, rctx, event_call, \ __count, __regs, head, __task); } Here *entry* is directly passed in to perf_trace_run_bpf_submit() as raw data, and perf_trace_run_bpf_submit() won't do string formatting but just pass them to the userspace perf, which will finnally did the formatting, but it's already too late to get sysctl_mem[x]. So, any pointer dereference in tracepoint entry should failed in `perf trace`. Thanks. Dust.Li ^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2020-09-10 21:03 UTC | newest] Thread overview: 3+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2020-09-08 2:09 [PATCH v2] net: tracepoint: fix print wrong sysctl_mem value Dust Li 2020-09-09 2:49 ` David Miller 2020-09-10 14:34 ` dust.li
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).