From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.7 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, UNPARSEABLE_RELAY,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 228A0C43461 for ; Thu, 10 Sep 2020 21:03:39 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id CFB2221D81 for ; Thu, 10 Sep 2020 21:03:38 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731191AbgIJOiJ (ORCPT ); Thu, 10 Sep 2020 10:38:09 -0400 Received: from out30-42.freemail.mail.aliyun.com ([115.124.30.42]:52939 "EHLO out30-42.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730729AbgIJOfk (ORCPT ); Thu, 10 Sep 2020 10:35:40 -0400 X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R941e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04423;MF=dust.li@linux.alibaba.com;NM=1;PH=DS;RN=5;SR=0;TI=SMTPD_---0U8W8kz0_1599748489; Received: from localhost(mailfrom:dust.li@linux.alibaba.com fp:SMTPD_---0U8W8kz0_1599748489) by smtp.aliyun-inc.com(127.0.0.1); Thu, 10 Sep 2020 22:34:49 +0800 Date: Thu, 10 Sep 2020 22:34:49 +0800 From: "dust.li" To: David Miller Cc: kuba@kernel.org, edumazet@google.com, satoru.moriya@hds.com, netdev@vger.kernel.org Subject: Re: [PATCH v2] net: tracepoint: fix print wrong sysctl_mem value Message-ID: <20200910143449.GA26740@linux.alibaba.com> Reply-To: dust.li@linux.alibaba.com References: <20200908020939.7653-1-dust.li@linux.alibaba.com> <20200908.194952.784011770473577866.davem@davemloft.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20200908.194952.784011770473577866.davem@davemloft.net> Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org On Tue, Sep 08, 2020 at 07:49:52PM -0700, David Miller wrote: >From: Dust Li >Date: Tue, 8 Sep 2020 10:09:39 +0800 > >> @@ -98,7 +98,7 @@ TRACE_EVENT(sock_exceed_buf_limit, >> >> TP_STRUCT__entry( >> __array(char, name, 32) >> - __field(long *, sysctl_mem) >> + __array(long, sysctl_mem, 3) >> __field(long, allocated) >> __field(int, sysctl_rmem) >> __field(int, rmem_alloc) >> @@ -110,7 +110,9 @@ TRACE_EVENT(sock_exceed_buf_limit, >> >> TP_fast_assign( >> strncpy(__entry->name, prot->name, 32); >> - __entry->sysctl_mem = prot->sysctl_mem; >> + __entry->sysctl_mem[0] = prot->sysctl_mem[0]; >> + __entry->sysctl_mem[1] = prot->sysctl_mem[1]; >> + __entry->sysctl_mem[2] = prot->sysctl_mem[2]; > >I can't understand at all why the current code doesn't work. > >We assign a pointer to entry->sysctl_mem and then print out the >three words pointed to by that. > >It's so wasteful to copy this over every tracepoint entry so >the pointer approach is very desirable. Thanks for your reply! I took a close look at the code generated by tracepoint and found the problem is not the tracepoint itself, but `perf trace`. My previous output was got by running: `perf trace -e sock:sock_exceed_buf_limit` This time, I tried directly read from /sys/kernel/debug/tracing/trace, and everything is right :) So I checked the code of perf tool, and found the foundamatal difference is `perf trace` did the string formatting in the userspace, but raw ftrace did it in the kernel. When using `perf trace`, the kernel passes the string format and the data to perf using the perf ringbuffer, and no one in the kernel will try to visit the pointer sysctl_mem is pointed to, so the the userspace perf got the original pointer of sysctl_mem and tries to do the formating, which result in the wrong data in the commit log. The key call trace when using `perf trace` is this: trace_sock_exceed_buf_limit() --> perf_trace_sock_exceed_buf_limit() { ... perf_fetch_caller_regs(__regs); { strncpy(entry->name, prot->name, 32); entry->sysctl_mem = prot->sysctl_mem; entry->allocated = allocated; entry->sysctl_rmem = sk_get_rmem0(sk, prot); entry->rmem_alloc = atomic_read(&sk->sk_backlog.rmem_alloc); entry->sysctl_wmem = sk_get_wmem0(sk, prot); entry->wmem_alloc = refcount_read(&sk->sk_wmem_alloc); entry->wmem_queued = sk->sk_wmem_queued; entry->kind = kind; } perf_trace_run_bpf_submit(entry, __entry_size, rctx, event_call, \ __count, __regs, head, __task); } Here *entry* is directly passed in to perf_trace_run_bpf_submit() as raw data, and perf_trace_run_bpf_submit() won't do string formatting but just pass them to the userspace perf, which will finnally did the formatting, but it's already too late to get sysctl_mem[x]. So, any pointer dereference in tracepoint entry should failed in `perf trace`. Thanks. Dust.Li