Re: [Patch bpf-next v6 4/8] skmsg: move sk_redir from TCP_SKB_CB to skb

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jakub Sitnicki <jakub@cloudflare.com>
To: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Linux Kernel Network Developers <netdev@vger.kernel.org>,
	bpf <bpf@vger.kernel.org>,
	duanxiongchun@bytedance.com,
	Dongdong Wang <wangdongdong.6@bytedance.com>,
	Jiang Wang <jiang.wang@bytedance.com>,
	Cong Wang <cong.wang@bytedance.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Lorenz Bauer <lmb@cloudflare.com>,
	John Fastabend <john.fastabend@gmail.com>
Subject: Re: [Patch bpf-next v6 4/8] skmsg: move sk_redir from TCP_SKB_CB to skb
Date: Tue, 23 Feb 2021 18:52:58 +0100	[thread overview]
Message-ID: <875z2i4qo5.fsf@cloudflare.com> (raw)
In-Reply-To: <CAM_iQpVS_sJy=sM31pHZVi6njZEAa7Hv_Bkt2sB7JcAjFw3guw@mail.gmail.com>

On Mon, Feb 22, 2021 at 08:27 PM CET, Cong Wang wrote:
> On Mon, Feb 22, 2021 at 4:20 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>>
>> On Sat, Feb 20, 2021 at 06:29 AM CET, Cong Wang wrote:
>> > From: Cong Wang <cong.wang@bytedance.com>
>> >
>> > Currently TCP_SKB_CB() is hard-coded in skmsg code, it certainly
>> > does not work for any other non-TCP protocols. We can move them to
>> > skb ext, but it introduces a memory allocation on fast path.
>> >
>> > Fortunately, we only need to a word-size to store all the information,
>> > because the flags actually only contains 1 bit so can be just packed
>> > into the lowest bit of the "pointer", which is stored as unsigned
>> > long.
>> >
>> > Inside struct sk_buff, '_skb_refdst' can be reused because skb dst is
>> > no longer needed after ->sk_data_ready() so we can just drop it.
>> >
>> > Cc: Daniel Borkmann <daniel@iogearbox.net>
>> > Cc: Jakub Sitnicki <jakub@cloudflare.com>
>> > Cc: Lorenz Bauer <lmb@cloudflare.com>
>> > Acked-by: John Fastabend <john.fastabend@gmail.com>
>> > Signed-off-by: Cong Wang <cong.wang@bytedance.com>
>> > ---
>>
>> LGTM. I have some questions (below) that would help me confirm if I
>> understand the changes, and what could be improved, if anything.
>>
>> Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
>>
>> >  include/linux/skbuff.h |  3 +++
>> >  include/linux/skmsg.h  | 35 +++++++++++++++++++++++++++++++++++
>> >  include/net/tcp.h      | 19 -------------------
>> >  net/core/skmsg.c       | 32 ++++++++++++++++++++------------
>> >  net/core/sock_map.c    |  8 ++------
>> >  5 files changed, 60 insertions(+), 37 deletions(-)
>> >
>> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>> > index 6d0a33d1c0db..bd84f799c952 100644
>> > --- a/include/linux/skbuff.h
>> > +++ b/include/linux/skbuff.h
>> > @@ -755,6 +755,9 @@ struct sk_buff {
>> >                       void            (*destructor)(struct sk_buff *skb);
>> >               };
>> >               struct list_head        tcp_tsorted_anchor;
>> > +#ifdef CONFIG_NET_SOCK_MSG
>> > +             unsigned long           _sk_redir;
>> > +#endif
>> >       };
>> >
>> >  #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
>> > diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
>> > index e3bb712af257..fc234d507fd7 100644
>> > --- a/include/linux/skmsg.h
>> > +++ b/include/linux/skmsg.h
>> > @@ -459,4 +459,39 @@ static inline bool sk_psock_strp_enabled(struct sk_psock *psock)
>> >               return false;
>> >       return !!psock->saved_data_ready;
>> >  }
>> > +
>> > +#if IS_ENABLED(CONFIG_NET_SOCK_MSG)
>> > +static inline bool skb_bpf_ingress(const struct sk_buff *skb)
>> > +{
>> > +     unsigned long sk_redir = skb->_sk_redir;
>> > +
>> > +     return sk_redir & BPF_F_INGRESS;
>> > +}
>> > +
>> > +static inline void skb_bpf_set_ingress(struct sk_buff *skb)
>> > +{
>> > +     skb->_sk_redir |= BPF_F_INGRESS;
>> > +}
>> > +
>> > +static inline void skb_bpf_set_redir(struct sk_buff *skb, struct sock *sk_redir,
>> > +                                  bool ingress)
>> > +{
>> > +     skb->_sk_redir = (unsigned long)sk_redir;
>> > +     if (ingress)
>> > +             skb->_sk_redir |= BPF_F_INGRESS;
>> > +}
>> > +
>> > +static inline struct sock *skb_bpf_redirect_fetch(const struct sk_buff *skb)
>> > +{
>> > +     unsigned long sk_redir = skb->_sk_redir;
>> > +
>> > +     sk_redir &= ~0x1UL;
>>
>> We're using the enum when setting the bit flag, but a hardcoded constant
>> when masking it. ~BPF_F_INGRESS would be more consistent here.
>
> Well, here we need a mask, not a bit, but we don't have a mask yet,
> hence I just use hard-coded 0x1. Does #define BPF_F_MASK 0x1UL
> look any better?

Based on what I've seen around, mask for sanitizing tagged pointers is
usually derived from the flag(s). For instance:

#define SKB_DST_NOREF	1UL
#define SKB_DST_PTRMASK	~(SKB_DST_NOREF)

#define SK_USER_DATA_NOCOPY	1UL
#define SK_USER_DATA_BPF	2UL	/* Managed by BPF */
#define SK_USER_DATA_PTRMASK	~(SK_USER_DATA_NOCOPY | SK_USER_DATA_BPF)

Using ~(BPF_F_INGRESS) expression would be like substituting mask
definition.

[..]

>> > diff --git a/include/net/tcp.h b/include/net/tcp.h
>> > index 947ef5da6867..075de26f449d 100644
>> > --- a/include/net/tcp.h
>> > +++ b/include/net/tcp.h
>> > @@ -883,30 +883,11 @@ struct tcp_skb_cb {
>> >                       struct inet6_skb_parm   h6;
>> >  #endif
>> >               } header;       /* For incoming skbs */
>> > -             struct {
>> > -                     __u32 flags;
>> > -                     struct sock *sk_redir;
>> > -             } bpf;
>> >       };
>> >  };
>> >
>> >  #define TCP_SKB_CB(__skb)    ((struct tcp_skb_cb *)&((__skb)->cb[0]))
>> >
>> > -static inline bool tcp_skb_bpf_ingress(const struct sk_buff *skb)
>> > -{
>> > -     return TCP_SKB_CB(skb)->bpf.flags & BPF_F_INGRESS;
>> > -}
>> > -
>> > -static inline struct sock *tcp_skb_bpf_redirect_fetch(struct sk_buff *skb)
>> > -{
>> > -     return TCP_SKB_CB(skb)->bpf.sk_redir;
>> > -}
>> > -
>> > -static inline void tcp_skb_bpf_redirect_clear(struct sk_buff *skb)
>> > -{
>> > -     TCP_SKB_CB(skb)->bpf.sk_redir = NULL;
>> > -}
>> > -
>> >  extern const struct inet_connection_sock_af_ops ipv4_specific;
>> >
>> >  #if IS_ENABLED(CONFIG_IPV6)
>> > diff --git a/net/core/skmsg.c b/net/core/skmsg.c
>> > index 2d8bbb3fd87c..05b5af09ff42 100644
>> > --- a/net/core/skmsg.c
>> > +++ b/net/core/skmsg.c
>> > @@ -494,6 +494,8 @@ static int sk_psock_skb_ingress_self(struct sk_psock *psock, struct sk_buff *skb
>> >  static int sk_psock_handle_skb(struct sk_psock *psock, struct sk_buff *skb,
>> >                              u32 off, u32 len, bool ingress)
>> >  {
>> > +     skb_bpf_redirect_clear(skb);
>>
>> This is called to avoid leaking state in skb->_skb_refdst. Correct?
>
> This is to teach kfree_skb() not to consider it as a valid _skb_refdst.

OK

>
>>
>> I'm wondering why we're doing it every time sk_psock_handle_skb() gets
>> invoked from the do/while loop in sk_psock_backlog(), instead of doing
>> it once after reading ingress flag with skb_bpf_ingress()?
>
> It should also work, I don't see much difference here, as we almost
> always process a full skb, that is, ret == skb->len.

OK

>
>
>>
>> > +
>> >       if (!ingress) {
>> >               if (!sock_writeable(psock->sk))
>> >                       return -EAGAIN;
>> > @@ -525,7 +527,7 @@ static void sk_psock_backlog(struct work_struct *work)
>> >               len = skb->len;
>> >               off = 0;
>> >  start:
>> > -             ingress = tcp_skb_bpf_ingress(skb);
>> > +             ingress = skb_bpf_ingress(skb);
>> >               do {
>> >                       ret = -EIO;
>> >                       if (likely(psock->sk->sk_socket))
>> > @@ -631,7 +633,12 @@ void __sk_psock_purge_ingress_msg(struct sk_psock *psock)
>> >
>> >  static void sk_psock_zap_ingress(struct sk_psock *psock)
>> >  {
>> > -     __skb_queue_purge(&psock->ingress_skb);
>> > +     struct sk_buff *skb;
>> > +
>> > +     while ((skb = __skb_dequeue(&psock->ingress_skb)) != NULL) {
>> > +             skb_bpf_redirect_clear(skb);
>>
>> I believe we clone the skb before enqueuing it psock->ingress_skb.
>> Clone happens either in sk_psock_verdict_recv() or in __strp_recv().
>> There are not other users holding a ref, so clearing the redirect seems
>> unneeded. Unless I'm missing something?
>
> Yes, skb dst is also cloned:
>
>  980 static void __copy_skb_header(struct sk_buff *new, const struct
> sk_buff *old)
>  981 {
>  982         new->tstamp             = old->tstamp;
>  983         /* We do not copy old->sk */
>  984         new->dev                = old->dev;
>  985         memcpy(new->cb, old->cb, sizeof(old->cb));
>  986         skb_dst_copy(new, old);
>
> Also, if without this, dst_release() would complain again. I was not smart
> enough to add it in the beginning, dst_release() taught me this lesson. ;)

OK, I think I follow you now.

Alternatively we could clear _skb_refdest after clone, but before
enqueuing the skb in ingress_skb. And only for when we're redirecting.

I believe that would be in sk_psock_skb_redirect, right before skb_queue_tail.

>
>>
>> > +             kfree_skb(skb);
>> > +     }
>> >       __sk_psock_purge_ingress_msg(psock);
>> >  }
>> >
>> > @@ -752,7 +759,7 @@ static void sk_psock_skb_redirect(struct sk_buff *skb)
>> >       struct sk_psock *psock_other;
>> >       struct sock *sk_other;
>> >
>> > -     sk_other = tcp_skb_bpf_redirect_fetch(skb);
>> > +     sk_other = skb_bpf_redirect_fetch(skb);
>> >       /* This error is a buggy BPF program, it returned a redirect
>> >        * return code, but then didn't set a redirect interface.
>> >        */
>> > @@ -802,9 +809,10 @@ int sk_psock_tls_strp_read(struct sk_psock *psock, struct sk_buff *skb)
>> >                * TLS context.
>> >                */
>> >               skb->sk = psock->sk;
>> > -             tcp_skb_bpf_redirect_clear(skb);
>> > +             skb_dst_drop(skb);
>> > +             skb_bpf_redirect_clear(skb);
>>
>> After skb_dst_drop(), skb->_skb_refdst is clear. So it seems the
>> redirect_clear() is not needed. But I'm guessing it is being invoked
>> to communicate the intention?
>
> Technically true, but I prefer to call them explicitly, not to rely on the
> fact skb->_skb_refdst shares the same storage with skb->_sk_redir,
> which would also require some comments to explain.
>

OK

next prev parent reply	other threads:[~2021-02-23 17:54 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-02-20  5:29 [Patch bpf-next v6 0/8] sock_map: clean up and refactor code for BPF_SK_SKB_VERDICT Cong Wang
2021-02-20  5:29 ` [Patch bpf-next v6 1/8] bpf: clean up sockmap related Kconfigs Cong Wang
2021-02-22  8:51   ` Jakub Sitnicki
2021-02-22 23:23     ` Cong Wang
2021-02-20  5:29 ` [Patch bpf-next v6 2/8] skmsg: get rid of struct sk_psock_parser Cong Wang
2021-02-20  5:29 ` [Patch bpf-next v6 3/8] bpf: compute data_end dynamically with JIT code Cong Wang
2021-02-20  5:29 ` [Patch bpf-next v6 4/8] skmsg: move sk_redir from TCP_SKB_CB to skb Cong Wang
2021-02-22 12:20   ` Jakub Sitnicki
2021-02-22 19:27     ` Cong Wang
2021-02-23 17:52       ` Jakub Sitnicki [this message]
2021-02-23 18:04         ` Cong Wang
2021-02-23 18:36           ` Jakub Sitnicki
2021-02-20  5:29 ` [Patch bpf-next v6 5/8] sock_map: rename skb_parser and skb_verdict Cong Wang
2021-02-22 12:28   ` Jakub Sitnicki
2021-02-22 19:32     ` Cong Wang
2021-02-20  5:29 ` [Patch bpf-next v6 6/8] sock_map: make sock_map_prog_update() static Cong Wang
2021-02-22 12:29   ` Jakub Sitnicki
2021-02-20  5:29 ` [Patch bpf-next v6 7/8] skmsg: make __sk_psock_purge_ingress_msg() static Cong Wang
2021-02-22 12:30   ` Jakub Sitnicki
2021-02-20  5:29 ` [Patch bpf-next v6 8/8] skmsg: get rid of sk_psock_bpf_run() Cong Wang
2021-02-22 12:31   ` Jakub Sitnicki
2021-02-22 12:32 ` [Patch bpf-next v6 0/8] sock_map: clean up and refactor code for BPF_SK_SKB_VERDICT Jakub Sitnicki

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=875z2i4qo5.fsf@cloudflare.com \
    --to=jakub@cloudflare.com \
    --cc=bpf@vger.kernel.org \
    --cc=cong.wang@bytedance.com \
    --cc=daniel@iogearbox.net \
    --cc=duanxiongchun@bytedance.com \
    --cc=jiang.wang@bytedance.com \
    --cc=john.fastabend@gmail.com \
    --cc=lmb@cloudflare.com \
    --cc=netdev@vger.kernel.org \
    --cc=wangdongdong.6@bytedance.com \
    --cc=xiyou.wangcong@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.