BPF List
 help / color / mirror / Atom feed
From: Yonghong Song <yhs@fb.com>
To: Martin KaFai Lau <kafai@fb.com>, Daniel Borkmann <daniel@iogearbox.net>
Cc: <bpf@vger.kernel.org>, Alexei Starovoitov <ast@kernel.org>,
	David Miller <davem@davemloft.net>, <kernel-team@fb.com>,
	<netdev@vger.kernel.org>, <eric.dumazet@gmail.com>
Subject: Re: [PATCH bpf-next v2 4/4] bpf: inet_diag: Dump bpf_sk_storages in inet_diag_dump()
Date: Wed, 26 Feb 2020 22:16:16 -0800	[thread overview]
Message-ID: <b3c06021-a5a4-191c-e301-a30ad6954191@fb.com> (raw)
In-Reply-To: <20200227013448.srxy5kkpve7yheln@kafai-mbp>



On 2/26/20 5:34 PM, Martin KaFai Lau wrote:
> On Wed, Feb 26, 2020 at 06:21:33PM +0100, Daniel Borkmann wrote:
>> On 2/26/20 12:04 AM, Martin KaFai Lau wrote:
>>> This patch will dump out the bpf_sk_storages of a sk
>>> if the request has the INET_DIAG_REQ_SK_BPF_STORAGES nlattr.
>>>
>>> An array of SK_DIAG_BPF_STORAGE_REQ_MAP_FD can be specified in
>>> INET_DIAG_REQ_SK_BPF_STORAGES to select which bpf_sk_storage to dump.
>>> If no map_fd is specified, all bpf_sk_storages of a sk will be dumped.
>>>
>>> bpf_sk_storages can be added to the system at runtime.  It is difficult
>>> to find a proper static value for cb->min_dump_alloc.
>>>
>>> This patch learns the nlattr size required to dump the bpf_sk_storages
>>> of a sk.  If it happens to be the very first nlmsg of a dump and it
>>> cannot fit the needed bpf_sk_storages,  it will try to expand the
>>> skb by "pskb_expand_head()".
>>>
>>> Instead of expanding it in inet_sk_diag_fill(), it is expanded at a
>>> sleepable context in __inet_diag_dump() so __GFP_DIRECT_RECLAIM can
>>> be used.  In __inet_diag_dump(), it will retry as long as the
>>> skb is empty and the cb->min_dump_alloc becomes larger than before.
>>> cb->min_dump_alloc is bounded by KMALLOC_MAX_SIZE.  The min_dump_alloc
>>> is also changed from 'u16' to 'u32' to accommodate a sk that may have
>>> a few large bpf_sk_storages.
>>>
>>> The updated cb->min_dump_alloc will also be used to allocate the skb in
>>> the next dump.  This logic already exists in netlink_dump().
>>>
>>> Here is the sample output of a locally modified 'ss' and it could be made
>>> more readable by using BTF later:
>>> [root@arch-fb-vm1 ~]# ss --bpf-map-id 14 --bpf-map-id 13 -t6an 'dst [::1]:8989'
>>> State Recv-Q Send-Q Local Address:Port  Peer Address:PortProcess
>>> ESTAB 0      0              [::1]:51072        [::1]:8989
>>> 	 bpf_map_id:14 value:[ 3feb ]
>>> 	 bpf_map_id:13 value:[ 3f ]
>>> ESTAB 0      0              [::1]:51070        [::1]:8989
>>> 	 bpf_map_id:14 value:[ 3feb ]
>>> 	 bpf_map_id:13 value:[ 3f ]
>>>
>>> [root@arch-fb-vm1 ~]# ~/devshare/github/iproute2/misc/ss --bpf-maps -t6an 'dst [::1]:8989'
>>> State         Recv-Q         Send-Q                   Local Address:Port                    Peer Address:Port         Process
>>> ESTAB         0              0                                [::1]:51072                          [::1]:8989
>>> 	 bpf_map_id:14 value:[ 3feb ]
>>> 	 bpf_map_id:13 value:[ 3f ]
>>> 	 bpf_map_id:12 value:[ 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000... total:65407 ]
>>> ESTAB         0              0                                [::1]:51070                          [::1]:8989
>>> 	 bpf_map_id:14 value:[ 3feb ]
>>> 	 bpf_map_id:13 value:[ 3f ]
>>> 	 bpf_map_id:12 value:[ 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000... total:65407 ]
>>>
>>> Acked-by: Song Liu <songliubraving@fb.com>
>>> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
>>
>> Hmm, the whole approach is not too pleasant to be honest. I can see why you need
>> it since the regular sk_storage lookup only takes sock fd as a key and you don't
>> have it otherwise available from outside, but then dumping up to KMALLOC_MAX_SIZE
>> via netlink skb is not a great experience either. :( Also, are we planning to add
>> the BTF dump there in addition to bpftool? Thus resulting in two different lookup
>> APIs and several tools needed for introspection instead of one? :/ Also, how do we
>> dump task local storage maps in future? Does it need a third lookup interface?
>>
>> In your commit logs I haven't read on other approaches and why they won't work;
>> I was wondering, given sockets are backed by inodes, couldn't we have a variant
>> of iget_locked() (minus the alloc_inode() part from there) where you pass in ino
>> number to eventually get to the socket and then dump the map value associated with
>> it the regular way from bpf() syscall?
> Thanks for the feedback!
> 
> I think (1) dumping all sk(s) in a system is different from
> (2) dumping all sk of a bpf_sk_storage_map or lookup a particular
> sk from a bpf_sk_storage_map.
> 
> This patch is doing (1).  I believe it is useful to make the commonly used
> tools like "ss" (which already shows many useful information of a sk)
> to be able to introspect a kernel struct extended by bpf instead of
> limiting to only the bpftool can show the bpf extended data.
> The plan is to move the bpftool/btf_dumper.c to libbpf.  The libbpf's
> btf_dumper print out format is still TBD and the current target is the drgn
> like format instead of the current semi-json like plain-txt printout.  As
> more kernel struct may be extensible by bpf, having it in libbpf will be
> useful going forward.
> 
> Re: future kernel struct extended by bpf
> For doing (1), I think we can ride on the existing API to iterate them also.
> bpf can extend a kernel struct but should not stop the current
> iteration API from working or seeing them.  That includes the current
> seq_file API.  The mid-term goal is to extend the seq_file by
> attaching a bpf_prog to (e.g. /proc/net/tcp) to do filtering
> and printing.  Yonghong is looking into that.

Right. I am working on a design and prototype to use bpf programs
to dump certain kernel internal data structures. The high level
idea is to create some /sys path to encode "what" to dump, e.g.,
/sys/kernel/bpfdump/tcp6_sockets/ for tcp6_sockets and "add" bpf
program(s) to that path to specify "how" to dump.

I hope to share the design and prototype soon.

  reply	other threads:[~2020-02-27  6:16 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-25 23:04 [PATCH bpf-next v2 0/4] Provide bpf_sk_storage data in INET_DIAG Martin KaFai Lau
2020-02-25 23:04 ` [PATCH bpf-next v2 1/4] inet_diag: Refactor inet_sk_diag_fill(), dump(), and dump_one() Martin KaFai Lau
2020-02-25 23:04 ` [PATCH bpf-next v2 2/4] inet_diag: Move the INET_DIAG_REQ_BYTECODE nlattr to cb->data Martin KaFai Lau
2020-02-25 23:04 ` [PATCH bpf-next v2 3/4] bpf: INET_DIAG support in bpf_sk_storage Martin KaFai Lau
2020-02-25 23:04 ` [PATCH bpf-next v2 4/4] bpf: inet_diag: Dump bpf_sk_storages in inet_diag_dump() Martin KaFai Lau
2020-02-26 17:21   ` Daniel Borkmann
2020-02-27  1:34     ` Martin KaFai Lau
2020-02-27  6:16       ` Yonghong Song [this message]
2020-02-27 23:45       ` Daniel Borkmann
2020-02-28  2:57         ` Alexei Starovoitov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b3c06021-a5a4-191c-e301-a30ad6954191@fb.com \
    --to=yhs@fb.com \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=eric.dumazet@gmail.com \
    --cc=kafai@fb.com \
    --cc=kernel-team@fb.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox