All of lore.kernel.org
 help / color / mirror / Atom feed
From: Martin KaFai Lau <martin.lau@linux.dev>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: bpf <bpf@vger.kernel.org>, Alexei Starovoitov <ast@kernel.org>,
	Andrii Nakryiko <andrii@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Kernel Team <kernel-team@fb.com>,
	Network Development <netdev@vger.kernel.org>
Subject: Re: [PATCH bpf-next 4/5] bpf: Stop bpf_setsockopt(TCP_CONGESTION) in init ops to recur itself
Date: Fri, 23 Sep 2022 10:46:43 -0700	[thread overview]
Message-ID: <27c7725a-738a-2227-5e47-ab2afab29348@linux.dev> (raw)
In-Reply-To: <CAADnVQKFdpiQFxgF253V5XmtjnrVXcZ14sxT_Q3vOQ97WxScMQ@mail.gmail.com>

On 9/23/22 8:26 AM, Alexei Starovoitov wrote:
> On Thu, Sep 22, 2022 at 6:11 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>
>> On 9/22/22 5:12 PM, Alexei Starovoitov wrote:
>>> On Thu, Sep 22, 2022 at 3:56 PM Martin KaFai Lau <kafai@fb.com> wrote:
>>>>
>>>> From: Martin KaFai Lau <martin.lau@kernel.org>
>>>>
>>>> When a bad bpf prog '.init' calls
>>>> bpf_setsockopt(TCP_CONGESTION, "itself"), it will trigger this loop:
>>>>
>>>> .init => bpf_setsockopt(tcp_cc) => .init => bpf_setsockopt(tcp_cc) ...
>>>> ... => .init => bpf_setsockopt(tcp_cc).
>>>>
>>>> It was prevented by the prog->active counter before but the prog->active
>>>> detection cannot be used in struct_ops as explained in the earlier
>>>> patch of the set.
>>>>
>>>> In this patch, the second bpf_setsockopt(tcp_cc) is not allowed
>>>> in order to break the loop.  This is done by checking the
>>>> previous bpf_run_ctx has saved the same sk pointer in the
>>>> bpf_cookie.
>>>>
>>>> Note that this essentially limits only the first '.init' can
>>>> call bpf_setsockopt(TCP_CONGESTION) to pick a fallback cc (eg. peer
>>>> does not support ECN) and the second '.init' cannot fallback to
>>>> another cc.  This applies even the second
>>>> bpf_setsockopt(TCP_CONGESTION) will not cause a loop.
>>>>
>>>> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
>>>> ---
>>>>    include/linux/filter.h |  3 +++
>>>>    net/core/filter.c      |  4 ++--
>>>>    net/ipv4/bpf_tcp_ca.c  | 54 ++++++++++++++++++++++++++++++++++++++++++
>>>>    3 files changed, 59 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/include/linux/filter.h b/include/linux/filter.h
>>>> index 98e28126c24b..9942ecc68a45 100644
>>>> --- a/include/linux/filter.h
>>>> +++ b/include/linux/filter.h
>>>> @@ -911,6 +911,9 @@ int sk_get_filter(struct sock *sk, sockptr_t optval, unsigned int len);
>>>>    bool sk_filter_charge(struct sock *sk, struct sk_filter *fp);
>>>>    void sk_filter_uncharge(struct sock *sk, struct sk_filter *fp);
>>>>
>>>> +int _bpf_setsockopt(struct sock *sk, int level, int optname,
>>>> +                   char *optval, int optlen);
>>>> +
>>>>    u64 __bpf_call_base(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
>>>>    #define __bpf_call_base_args \
>>>>           ((u64 (*)(u64, u64, u64, u64, u64, const struct bpf_insn *)) \
>>>> diff --git a/net/core/filter.c b/net/core/filter.c
>>>> index f4cea3ff994a..e56a1ebcf1bc 100644
>>>> --- a/net/core/filter.c
>>>> +++ b/net/core/filter.c
>>>> @@ -5244,8 +5244,8 @@ static int __bpf_setsockopt(struct sock *sk, int level, int optname,
>>>>           return -EINVAL;
>>>>    }
>>>>
>>>> -static int _bpf_setsockopt(struct sock *sk, int level, int optname,
>>>> -                          char *optval, int optlen)
>>>> +int _bpf_setsockopt(struct sock *sk, int level, int optname,
>>>> +                   char *optval, int optlen)
>>>>    {
>>>>           if (sk_fullsock(sk))
>>>>                   sock_owned_by_me(sk);
>>>> diff --git a/net/ipv4/bpf_tcp_ca.c b/net/ipv4/bpf_tcp_ca.c
>>>> index 6da16ae6a962..a9f2cab5ffbc 100644
>>>> --- a/net/ipv4/bpf_tcp_ca.c
>>>> +++ b/net/ipv4/bpf_tcp_ca.c
>>>> @@ -144,6 +144,57 @@ static const struct bpf_func_proto bpf_tcp_send_ack_proto = {
>>>>           .arg2_type      = ARG_ANYTHING,
>>>>    };
>>>>
>>>> +BPF_CALL_5(bpf_init_ops_setsockopt, struct sock *, sk, int, level,
>>>> +          int, optname, char *, optval, int, optlen)
>>>> +{
>>>> +       struct bpf_tramp_run_ctx *run_ctx, *saved_run_ctx;
>>>> +       int ret;
>>>> +
>>>> +       if (optname != TCP_CONGESTION)
>>>> +               return _bpf_setsockopt(sk, level, optname, optval, optlen);
>>>> +
>>>> +       run_ctx = (struct bpf_tramp_run_ctx *)current->bpf_ctx;
>>>> +       if (unlikely(run_ctx->saved_run_ctx &&
>>>> +                    run_ctx->saved_run_ctx->type == BPF_RUN_CTX_TYPE_STRUCT_OPS)) {
>>>> +               saved_run_ctx = (struct bpf_tramp_run_ctx *)run_ctx->saved_run_ctx;
>>>> +               /* It stops this looping
>>>> +                *
>>>> +                * .init => bpf_setsockopt(tcp_cc) => .init =>
>>>> +                * bpf_setsockopt(tcp_cc)" => .init => ....
>>>> +                *
>>>> +                * The second bpf_setsockopt(tcp_cc) is not allowed
>>>> +                * in order to break the loop when both .init
>>>> +                * are the same bpf prog.
>>>> +                *
>>>> +                * This applies even the second bpf_setsockopt(tcp_cc)
>>>> +                * does not cause a loop.  This limits only the first
>>>> +                * '.init' can call bpf_setsockopt(TCP_CONGESTION) to
>>>> +                * pick a fallback cc (eg. peer does not support ECN)
>>>> +                * and the second '.init' cannot fallback to
>>>> +                * another cc.
>>>> +                */
>>>> +               if (saved_run_ctx->bpf_cookie == (uintptr_t)sk)
>>>> +                       return -EBUSY;
>>>> +       }
>>>> +
>>>> +       run_ctx->bpf_cookie = (uintptr_t)sk;
>>>> +       ret = _bpf_setsockopt(sk, level, optname, optval, optlen);
>>>> +       run_ctx->bpf_cookie = 0;
>>>
>>> Instead of adding 4 bytes for enum in patch 3
>>> (which will be 8 bytes due to alignment)
>>> and abusing bpf_cookie here
>>> (which struct_ops bpf prog might eventually read and be surprised
>>> to find sk pointer in there)
>>> how about adding 'struct task_struct *saved_current' as another arg
>>> to bpf_tramp_run_ctx ?
>>> Always store the current task in there in prog_entry_struct_ops
>>> and then compare it here in this specialized bpf_init_ops_setsockopt?
>>>
>>> Or maybe always check in enter_prog_struct_ops:
>>> if (container_of(current->bpf_ctx, struct bpf_tramp_run_ctx,
>>> run_ctx)->saved_current == current) // goto out since recursion?
>>> it will prevent issues in case we don't know about and will
>>> address the good recursion case as explained in patch 1?
>>> I'm assuming 2nd ssthresh runs in a different task..
>>> Or is it actually the same task?
>>
>> The 2nd ssthresh() should run in the same task but different sk.  The
>> first ssthresh(sk[1]) was run in_task() context and then got
>> interrupted.  The softirq then handles the rcv path which just happens
>> to also call ssthresh(sk[2]) in the unlikely pkt-loss case. It is like
>> ssthresh(sk[1]) => softirq => ssthresh(sk[2]).
>>
>> The tcp-cc ops can recur but cannot recur on the same sk because it
>> requires to hold the sk lock, so the patch remembers what was the
>> previous sk to ensure it does not recur on the same sk.  Then it needs
>> to peek into the previous run ctx which may not always be
>> bpf_trump_run_ctx.  eg. a cg bpf prog (with bpf_cg_run_ctx) can call
>> bpf_setsockopt(TCP_CONGESTION, "a_bpf_tcp_cc") which then will call the
>> a_bpf_tcp_cc->init().  It needs a bpf_run_ctx_type so it can safely peek
>> the previous bpf_run_ctx.
> 
> got it.
> 
>>
>> Since struct_ops is the only one that needs to peek into the previous
>> run_ctx (through tramp_run_ctx->saved_run_ctx),  instead of adding 4
>> bytes to the bpf_run_ctx, one idea just came to my mind is to use one
>> bit in the tramp_run_ctx->saved_run_ctx pointer itsef.  Something like
>> this if it reuses the bpf_cookie (probably missed some int/ptr type
>> casting):
>>
>> #define BPF_RUN_CTX_STRUCT_OPS_BIT 1UL
>>
>> u64 notrace __bpf_prog_enter_struct_ops(struct bpf_prog *prog,
>>                                       struct bpf_tramp_run_ctx *run_ctx)
>>           __acquires(RCU)
>> {
>>          rcu_read_lock();
>>          migrate_disable();
>>
>>          run_ctx->saved_run_ctx = bpf_set_run_ctx((&run_ctx->run_ctx) |
>>                                          BPF_RUN_CTX_STRUCT_OPS_BIT);
>>
>>           return bpf_prog_start_time();
>> }
>>
>> BPF_CALL_5(bpf_init_ops_setsockopt, struct sock *, sk, int, level,
>>              int, optname, char *, optval, int, optlen)
>> {
>>          /* ... */
>>          if (unlikely((run_ctx->saved_run_ctx &
>>                          BPF_RUN_CTX_STRUCT_OPS_BIT) && ...) {
>>                  /* ... */
>>                  if (bpf_cookie == (uintptr_t)sk)
>>                          return -EBUSY;
>>          }
>>
>> }
> 
> that should work, but don't you need to loop through all previous
> run_ctx and check all with BPF_RUN_CTX_STRUCT_OPS_BIT type ?
> Since run_ctx is saved in the task and we have preemptible
> rpgos there could be tracing prog in the chain:
> struct_ops_run_ctx->tracing_run_ctx->struct_ops_run_ctx
> where 1st and last have the same 'sk'.


This interleave of different run_ctx could happen.  My understanding is 
the 'struct_ops_run_ctx' can only be created when the tcp stack is 
calling the 'bpf_tcp_cc->init()' (or other cc ops).  In the above case, 
the first and second struct_ops_run_ctx are interleaved with a 
tracing_run_ctx.  Each of these two struct_ops_run_ctx was created from 
a different 'bpf_tcp_cc->init()' call by the kernel tcp stack.  They 
cannot be called with the same sk and changing that sk at the same time 
like this.  Otherwise, the kernel stack has a bug.

  reply	other threads:[~2022-09-23 17:46 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-22 22:56 [PATCH bpf-next 0/5] bpf: Remove recursion check for struct_ops prog Martin KaFai Lau
2022-09-22 22:56 ` [PATCH bpf-next 1/5] bpf: Add __bpf_prog_{enter,exit}_struct_ops for struct_ops trampoline Martin KaFai Lau
2022-09-22 22:56 ` [PATCH bpf-next 2/5] bpf: Move the "cdg" tcp-cc check to the common sol_tcp_sockopt() Martin KaFai Lau
2022-09-22 22:56 ` [PATCH bpf-next 3/5] bpf: Add bpf_run_ctx_type Martin KaFai Lau
2022-09-22 22:56 ` [PATCH bpf-next 4/5] bpf: Stop bpf_setsockopt(TCP_CONGESTION) in init ops to recur itself Martin KaFai Lau
2022-09-23  0:12   ` Alexei Starovoitov
2022-09-23  1:11     ` Martin KaFai Lau
2022-09-23  1:59       ` Hao Luo
2022-09-23 15:26       ` Alexei Starovoitov
2022-09-23 17:46         ` Martin KaFai Lau [this message]
2022-09-23 18:27           ` Andrii Nakryiko
2022-09-23 18:30             ` Andrii Nakryiko
2022-09-23 18:48               ` Martin KaFai Lau
2022-09-22 22:56 ` [PATCH bpf-next 5/5] selftests/bpf: Check -EBUSY for the recurred bpf_setsockopt(TCP_CONGESTION) Martin KaFai Lau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=27c7725a-738a-2227-5e47-ab2afab29348@linux.dev \
    --to=martin.lau@linux.dev \
    --cc=alexei.starovoitov@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=kernel-team@fb.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.