From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.2 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, ONE_TIME,SPF_HELO_NONE,SPF_PASS,UNPARSEABLE_RELAY,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8AE48C38A2B for ; Fri, 17 Apr 2020 15:02:50 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 6B40E20857 for ; Fri, 17 Apr 2020 15:02:50 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b="vHBDn9EP" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728552AbgDQPCt (ORCPT ); Fri, 17 Apr 2020 11:02:49 -0400 Received: from userp2120.oracle.com ([156.151.31.85]:49088 "EHLO userp2120.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728287AbgDQPCt (ORCPT ); Fri, 17 Apr 2020 11:02:49 -0400 Received: from pps.filterd (userp2120.oracle.com [127.0.0.1]) by userp2120.oracle.com (8.16.0.42/8.16.0.42) with SMTP id 03HEw4NI104956; Fri, 17 Apr 2020 15:02:24 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=date : from : to : cc : subject : in-reply-to : message-id : references : mime-version : content-type; s=corp-2020-01-29; bh=IM4u/O9m1Kbxkhtwt0QhoyIa9kJ6IGeuPbcANxRaXVs=; b=vHBDn9EP7d1RNVTKmc8H0S8z2Vm7/Jy7AbyGQ4lCRr8e71d+xbQDS10Rk1EdGiyQ/jyf gfvk8bwQmhTAlFpD7cdFNIVzOQL0gs/JhM9M5C/1AqJ0pkMluNfSZNr7SAE78nGVbt/w KfiAFylOV+8LhSzQbNMmcAupQ/CvPx6dqt1asA1zD1qGzIyCz/1IRT7DkmSMMFT1saxV nk6CI9RDGTTR/+7EVWAAmzcLQcyOPY96gUEJRooFxlkMJvgvPNdS81j4+uSaUndhnqNF A/7iCRFcpa6NGf6R+4y41BXdH9s8VhOl0n+c5IbWHwdtFRK7cz14aD8AfgDAflk4f2jf bg== Received: from aserp3030.oracle.com (aserp3030.oracle.com [141.146.126.71]) by userp2120.oracle.com with ESMTP id 30emejqbkn-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 17 Apr 2020 15:02:22 +0000 Received: from pps.filterd (aserp3030.oracle.com [127.0.0.1]) by aserp3030.oracle.com (8.16.0.42/8.16.0.42) with SMTP id 03HEvK5f084764; Fri, 17 Apr 2020 15:02:21 GMT Received: from userv0122.oracle.com (userv0122.oracle.com [156.151.31.75]) by aserp3030.oracle.com with ESMTP id 30dn91pb2p-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 17 Apr 2020 15:02:21 +0000 Received: from abhmp0002.oracle.com (abhmp0002.oracle.com [141.146.116.8]) by userv0122.oracle.com (8.14.4/8.14.4) with ESMTP id 03HF2IcY007269; Fri, 17 Apr 2020 15:02:18 GMT Received: from dhcp-10-175-205-33.vpn.oracle.com (/10.175.205.33) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Fri, 17 Apr 2020 08:02:17 -0700 Date: Fri, 17 Apr 2020 16:02:09 +0100 (BST) From: Alan Maguire X-X-Sender: alan@localhost To: Yonghong Song cc: David Ahern , Andrii Nakryiko , bpf@vger.kernel.org, Martin KaFai Lau , netdev@vger.kernel.org, Alexei Starovoitov , Daniel Borkmann , kernel-team@fb.com Subject: Re: [RFC PATCH bpf-next v2 00/17] bpf: implement bpf based dumping of kernel data structures In-Reply-To: Message-ID: References: <20200415192740.4082659-1-yhs@fb.com> <40e427e2-5b15-e9aa-e2cb-42dc1b53d047@gmail.com> User-Agent: Alpine 2.21 (LRH 202 2017-01-01) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-Proofpoint-Virus-Version: vendor=nai engine=6000 definitions=9593 signatures=668686 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 mlxscore=0 mlxlogscore=999 suspectscore=3 malwarescore=0 spamscore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2003020000 definitions=main-2004170122 X-Proofpoint-Virus-Version: vendor=nai engine=6000 definitions=9593 signatures=668686 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 malwarescore=0 impostorscore=0 mlxscore=0 suspectscore=3 lowpriorityscore=0 spamscore=0 mlxlogscore=999 bulkscore=0 adultscore=0 phishscore=0 clxscore=1015 priorityscore=1501 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2003020000 definitions=main-2004170122 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org On Wed, 15 Apr 2020, Yonghong Song wrote: > > > On 4/15/20 7:23 PM, David Ahern wrote: > > On 4/15/20 1:27 PM, Yonghong Song wrote: > >> > >> As there are some discussions regarding to the kernel interface/steps to > >> create file/anonymous dumpers, I think it will be beneficial for > >> discussion with this work in progress. > >> > >> Motivation: > >> The current way to dump kernel data structures mostly: > >> 1. /proc system > >> 2. various specific tools like "ss" which requires kernel support. > >> 3. drgn > >> The dropback for the first two is that whenever you want to dump more, > >> you > >> need change the kernel. For example, Martin wants to dump socket local > > > > If kernel support is needed for bpfdump of kernel data structures, you > > are not really solving the kernel support problem. i.e., to dump > > ipv4_route's you need to modify the relevant proc show function. > > Yes, as mentioned two paragraphs below. kernel change is required. > The tradeoff is that this is a one-time investment. Once kernel change > is in place, printing new fields (in most cases except new fields > which need additional locks etc.) no need for kernel change any more. > One thing I struggled with initially when reading the cover letter was understanding how BPF dumper programs get run. Patch 7 deals with that I think and the answer seems to be to create additional seq file infrastructure to the exisiting one which executes the BPF dumper programs where appropriate. Have I got this right? I guess more lightweight methods such as instrumenting functions associated with an existing /proc dumper are a bit too messy? Thanks! Alan > > > > > >> storage with "ss". Kernel change is needed for it to work ([1]). > >> This is also the direct motivation for this work. > >> > >> drgn ([2]) solves this proble nicely and no kernel change is not needed. > >> But since drgn is not able to verify the validity of a particular > >> pointer value, > >> it might present the wrong results in rare cases. > >> > >> In this patch set, we introduce bpf based dumping. Initial kernel > >> changes are > >> still needed, but a data structure change will not require kernel > >> changes > >> any more. bpf program itself is used to adapt to new data structure > >> changes. This will give certain flexibility with guaranteed correctness. > >> > >> Here, kernel seq_ops is used to facilitate dumping, similar to current > >> /proc and many other lossless kernel dumping facilities. > >> > >> User Interfaces: > >> 1. A new mount file system, bpfdump at /sys/kernel/bpfdump is > >> introduced. > >> Different from /sys/fs/bpf, this is a single user mount. Mount > >> command > >> can be: > >> mount -t bpfdump bpfdump /sys/kernel/bpfdump > >> 2. Kernel bpf dumpable data structures are represented as directories > >> under /sys/kernel/bpfdump, e.g., > >> /sys/kernel/bpfdump/ipv6_route/ > >> /sys/kernel/bpfdump/netlink/ > > > > The names of bpfdump fs entries do not match actual data structure names > > - e.g., there is no ipv6_route struct. On the one hand that is a good > > thing since structure names can change, but that also means a mapping is > > needed between the dumper filesystem entries and what you get for context. > > Yes, the later bpftool patch implements a new command to dump such > information. > > $ bpftool dumper show target > target prog_ctx_type > task bpfdump__task > task/file bpfdump__task_file > bpf_map bpfdump__bpf_map > ipv6_route bpfdump__ipv6_route > netlink bpfdump__netlink > > in vmlinux.h generated by vmlinux BTF, we have > > struct bpf_dump_meta { > struct seq_file *seq; > u64 session_id; > u64 seq_num; > }; > > struct bpfdump__ipv6_route { > struct bpf_dump_meta *meta; > struct fib6_info *rt; > }; > > Here, bpfdump__ipv6_route is the bpf program context type. > User can based on this to write the bpf program. > > > > > Further, what is the expectation in terms of stable API for these fs > > entries? Entries in the context can change. Data structure names can > > change. Entries in the structs can change. All of that breaks the idea > > of stable programs that are compiled once and run for all future > > releases. When structs change, those programs will break - and > > structures will change. > > Yes, the API (ctx) we presented to bpf program is indeed unstable. > CO-RE should help to certain extend but if some fields are gone, e.g., > bpf program will need to be rewritten for that particular kernel version, or > kernel bpfdump infrastructure can be enhanced to > change its ctx structure to have more information to the program > for that kernel version. In summary, I agree with you that this is > an unstable API similar to other tracing program > since it accesses kernel internal data structures. > > > > > What does bpfdumper provide that you can not do with a tracepoint on a > > relevant function and then putting a program on the tracepoint? ie., why > > not just put a tracepoint in the relevant dump functions. > > In my very beginning to explore bpfdump, kprobe to "show" function is > one of options. But quickly we realized that we actually do not want > to just piggyback on "show" function, but want to replace it with > bpf. This will be useful in following different use cases: > 1. first catable dumper file, similar to /proc/net/ipv6_route, > we want /sys/kernel/bpfdump/ipv6_route/my_dumper and you can cat > to get it. > > Using kprobe when you are doing `cat /proc/net/ipv6_route` > is complicated. You probably need an application which > runs through `cat /proc/net/ipv6_route` and discard its output, > and at the same time gets the result from bpf program > (filtered by pid since somebody may run > `cat /proc/net/ipv6_route` at the same time. You may use > perf ring_buffer to send the result back to the application. > > note that perf ring buffer may lose records for whatever > reason and seq_ops are implemented not to lose records > by built-in retries. > > Using kprobe approach above is complicated and for each dumper > you need an application. We would like it to be just catable > with minimum user overhead to create such a dumper. > > 2. second, anonymous dumper, kprobe/tracepoint will incur > original overhead of seq_printf per object. but user may > be only interested in a very small portion of information. > In such cases, bpf program directly doing filtering in > the kernel can potentially speed up a lot if there are a lot of > records to traverse. > > 3. for data structures which do not have catable dumpers > for example task, hopefully, as demonstrated in this patch set, > kernel implementation and writing a bpf program are not > too hard. This especially enables people to do in-kernel > filtering which is the strength of the bpf. > > >