From: Martin KaFai Lau <martin.lau@linux.dev>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: bpf <bpf@vger.kernel.org>, Alexei Starovoitov <ast@kernel.org>,
Andrii Nakryiko <andrii@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
Kernel Team <kernel-team@fb.com>,
Network Development <netdev@vger.kernel.org>
Subject: Re: [PATCH bpf-next 4/5] bpf: Stop bpf_setsockopt(TCP_CONGESTION) in init ops to recur itself
Date: Fri, 23 Sep 2022 10:46:43 -0700 [thread overview]
Message-ID: <27c7725a-738a-2227-5e47-ab2afab29348@linux.dev> (raw)
In-Reply-To: <CAADnVQKFdpiQFxgF253V5XmtjnrVXcZ14sxT_Q3vOQ97WxScMQ@mail.gmail.com>
On 9/23/22 8:26 AM, Alexei Starovoitov wrote:
> On Thu, Sep 22, 2022 at 6:11 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>
>> On 9/22/22 5:12 PM, Alexei Starovoitov wrote:
>>> On Thu, Sep 22, 2022 at 3:56 PM Martin KaFai Lau <kafai@fb.com> wrote:
>>>>
>>>> From: Martin KaFai Lau <martin.lau@kernel.org>
>>>>
>>>> When a bad bpf prog '.init' calls
>>>> bpf_setsockopt(TCP_CONGESTION, "itself"), it will trigger this loop:
>>>>
>>>> .init => bpf_setsockopt(tcp_cc) => .init => bpf_setsockopt(tcp_cc) ...
>>>> ... => .init => bpf_setsockopt(tcp_cc).
>>>>
>>>> It was prevented by the prog->active counter before but the prog->active
>>>> detection cannot be used in struct_ops as explained in the earlier
>>>> patch of the set.
>>>>
>>>> In this patch, the second bpf_setsockopt(tcp_cc) is not allowed
>>>> in order to break the loop. This is done by checking the
>>>> previous bpf_run_ctx has saved the same sk pointer in the
>>>> bpf_cookie.
>>>>
>>>> Note that this essentially limits only the first '.init' can
>>>> call bpf_setsockopt(TCP_CONGESTION) to pick a fallback cc (eg. peer
>>>> does not support ECN) and the second '.init' cannot fallback to
>>>> another cc. This applies even the second
>>>> bpf_setsockopt(TCP_CONGESTION) will not cause a loop.
>>>>
>>>> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
>>>> ---
>>>> include/linux/filter.h | 3 +++
>>>> net/core/filter.c | 4 ++--
>>>> net/ipv4/bpf_tcp_ca.c | 54 ++++++++++++++++++++++++++++++++++++++++++
>>>> 3 files changed, 59 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/include/linux/filter.h b/include/linux/filter.h
>>>> index 98e28126c24b..9942ecc68a45 100644
>>>> --- a/include/linux/filter.h
>>>> +++ b/include/linux/filter.h
>>>> @@ -911,6 +911,9 @@ int sk_get_filter(struct sock *sk, sockptr_t optval, unsigned int len);
>>>> bool sk_filter_charge(struct sock *sk, struct sk_filter *fp);
>>>> void sk_filter_uncharge(struct sock *sk, struct sk_filter *fp);
>>>>
>>>> +int _bpf_setsockopt(struct sock *sk, int level, int optname,
>>>> + char *optval, int optlen);
>>>> +
>>>> u64 __bpf_call_base(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
>>>> #define __bpf_call_base_args \
>>>> ((u64 (*)(u64, u64, u64, u64, u64, const struct bpf_insn *)) \
>>>> diff --git a/net/core/filter.c b/net/core/filter.c
>>>> index f4cea3ff994a..e56a1ebcf1bc 100644
>>>> --- a/net/core/filter.c
>>>> +++ b/net/core/filter.c
>>>> @@ -5244,8 +5244,8 @@ static int __bpf_setsockopt(struct sock *sk, int level, int optname,
>>>> return -EINVAL;
>>>> }
>>>>
>>>> -static int _bpf_setsockopt(struct sock *sk, int level, int optname,
>>>> - char *optval, int optlen)
>>>> +int _bpf_setsockopt(struct sock *sk, int level, int optname,
>>>> + char *optval, int optlen)
>>>> {
>>>> if (sk_fullsock(sk))
>>>> sock_owned_by_me(sk);
>>>> diff --git a/net/ipv4/bpf_tcp_ca.c b/net/ipv4/bpf_tcp_ca.c
>>>> index 6da16ae6a962..a9f2cab5ffbc 100644
>>>> --- a/net/ipv4/bpf_tcp_ca.c
>>>> +++ b/net/ipv4/bpf_tcp_ca.c
>>>> @@ -144,6 +144,57 @@ static const struct bpf_func_proto bpf_tcp_send_ack_proto = {
>>>> .arg2_type = ARG_ANYTHING,
>>>> };
>>>>
>>>> +BPF_CALL_5(bpf_init_ops_setsockopt, struct sock *, sk, int, level,
>>>> + int, optname, char *, optval, int, optlen)
>>>> +{
>>>> + struct bpf_tramp_run_ctx *run_ctx, *saved_run_ctx;
>>>> + int ret;
>>>> +
>>>> + if (optname != TCP_CONGESTION)
>>>> + return _bpf_setsockopt(sk, level, optname, optval, optlen);
>>>> +
>>>> + run_ctx = (struct bpf_tramp_run_ctx *)current->bpf_ctx;
>>>> + if (unlikely(run_ctx->saved_run_ctx &&
>>>> + run_ctx->saved_run_ctx->type == BPF_RUN_CTX_TYPE_STRUCT_OPS)) {
>>>> + saved_run_ctx = (struct bpf_tramp_run_ctx *)run_ctx->saved_run_ctx;
>>>> + /* It stops this looping
>>>> + *
>>>> + * .init => bpf_setsockopt(tcp_cc) => .init =>
>>>> + * bpf_setsockopt(tcp_cc)" => .init => ....
>>>> + *
>>>> + * The second bpf_setsockopt(tcp_cc) is not allowed
>>>> + * in order to break the loop when both .init
>>>> + * are the same bpf prog.
>>>> + *
>>>> + * This applies even the second bpf_setsockopt(tcp_cc)
>>>> + * does not cause a loop. This limits only the first
>>>> + * '.init' can call bpf_setsockopt(TCP_CONGESTION) to
>>>> + * pick a fallback cc (eg. peer does not support ECN)
>>>> + * and the second '.init' cannot fallback to
>>>> + * another cc.
>>>> + */
>>>> + if (saved_run_ctx->bpf_cookie == (uintptr_t)sk)
>>>> + return -EBUSY;
>>>> + }
>>>> +
>>>> + run_ctx->bpf_cookie = (uintptr_t)sk;
>>>> + ret = _bpf_setsockopt(sk, level, optname, optval, optlen);
>>>> + run_ctx->bpf_cookie = 0;
>>>
>>> Instead of adding 4 bytes for enum in patch 3
>>> (which will be 8 bytes due to alignment)
>>> and abusing bpf_cookie here
>>> (which struct_ops bpf prog might eventually read and be surprised
>>> to find sk pointer in there)
>>> how about adding 'struct task_struct *saved_current' as another arg
>>> to bpf_tramp_run_ctx ?
>>> Always store the current task in there in prog_entry_struct_ops
>>> and then compare it here in this specialized bpf_init_ops_setsockopt?
>>>
>>> Or maybe always check in enter_prog_struct_ops:
>>> if (container_of(current->bpf_ctx, struct bpf_tramp_run_ctx,
>>> run_ctx)->saved_current == current) // goto out since recursion?
>>> it will prevent issues in case we don't know about and will
>>> address the good recursion case as explained in patch 1?
>>> I'm assuming 2nd ssthresh runs in a different task..
>>> Or is it actually the same task?
>>
>> The 2nd ssthresh() should run in the same task but different sk. The
>> first ssthresh(sk[1]) was run in_task() context and then got
>> interrupted. The softirq then handles the rcv path which just happens
>> to also call ssthresh(sk[2]) in the unlikely pkt-loss case. It is like
>> ssthresh(sk[1]) => softirq => ssthresh(sk[2]).
>>
>> The tcp-cc ops can recur but cannot recur on the same sk because it
>> requires to hold the sk lock, so the patch remembers what was the
>> previous sk to ensure it does not recur on the same sk. Then it needs
>> to peek into the previous run ctx which may not always be
>> bpf_trump_run_ctx. eg. a cg bpf prog (with bpf_cg_run_ctx) can call
>> bpf_setsockopt(TCP_CONGESTION, "a_bpf_tcp_cc") which then will call the
>> a_bpf_tcp_cc->init(). It needs a bpf_run_ctx_type so it can safely peek
>> the previous bpf_run_ctx.
>
> got it.
>
>>
>> Since struct_ops is the only one that needs to peek into the previous
>> run_ctx (through tramp_run_ctx->saved_run_ctx), instead of adding 4
>> bytes to the bpf_run_ctx, one idea just came to my mind is to use one
>> bit in the tramp_run_ctx->saved_run_ctx pointer itsef. Something like
>> this if it reuses the bpf_cookie (probably missed some int/ptr type
>> casting):
>>
>> #define BPF_RUN_CTX_STRUCT_OPS_BIT 1UL
>>
>> u64 notrace __bpf_prog_enter_struct_ops(struct bpf_prog *prog,
>> struct bpf_tramp_run_ctx *run_ctx)
>> __acquires(RCU)
>> {
>> rcu_read_lock();
>> migrate_disable();
>>
>> run_ctx->saved_run_ctx = bpf_set_run_ctx((&run_ctx->run_ctx) |
>> BPF_RUN_CTX_STRUCT_OPS_BIT);
>>
>> return bpf_prog_start_time();
>> }
>>
>> BPF_CALL_5(bpf_init_ops_setsockopt, struct sock *, sk, int, level,
>> int, optname, char *, optval, int, optlen)
>> {
>> /* ... */
>> if (unlikely((run_ctx->saved_run_ctx &
>> BPF_RUN_CTX_STRUCT_OPS_BIT) && ...) {
>> /* ... */
>> if (bpf_cookie == (uintptr_t)sk)
>> return -EBUSY;
>> }
>>
>> }
>
> that should work, but don't you need to loop through all previous
> run_ctx and check all with BPF_RUN_CTX_STRUCT_OPS_BIT type ?
> Since run_ctx is saved in the task and we have preemptible
> rpgos there could be tracing prog in the chain:
> struct_ops_run_ctx->tracing_run_ctx->struct_ops_run_ctx
> where 1st and last have the same 'sk'.
This interleave of different run_ctx could happen. My understanding is
the 'struct_ops_run_ctx' can only be created when the tcp stack is
calling the 'bpf_tcp_cc->init()' (or other cc ops). In the above case,
the first and second struct_ops_run_ctx are interleaved with a
tracing_run_ctx. Each of these two struct_ops_run_ctx was created from
a different 'bpf_tcp_cc->init()' call by the kernel tcp stack. They
cannot be called with the same sk and changing that sk at the same time
like this. Otherwise, the kernel stack has a bug.
next prev parent reply other threads:[~2022-09-23 17:46 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-09-22 22:56 [PATCH bpf-next 0/5] bpf: Remove recursion check for struct_ops prog Martin KaFai Lau
2022-09-22 22:56 ` [PATCH bpf-next 1/5] bpf: Add __bpf_prog_{enter,exit}_struct_ops for struct_ops trampoline Martin KaFai Lau
2022-09-22 22:56 ` [PATCH bpf-next 2/5] bpf: Move the "cdg" tcp-cc check to the common sol_tcp_sockopt() Martin KaFai Lau
2022-09-22 22:56 ` [PATCH bpf-next 3/5] bpf: Add bpf_run_ctx_type Martin KaFai Lau
2022-09-22 22:56 ` [PATCH bpf-next 4/5] bpf: Stop bpf_setsockopt(TCP_CONGESTION) in init ops to recur itself Martin KaFai Lau
2022-09-23 0:12 ` Alexei Starovoitov
2022-09-23 1:11 ` Martin KaFai Lau
2022-09-23 1:59 ` Hao Luo
2022-09-23 15:26 ` Alexei Starovoitov
2022-09-23 17:46 ` Martin KaFai Lau [this message]
2022-09-23 18:27 ` Andrii Nakryiko
2022-09-23 18:30 ` Andrii Nakryiko
2022-09-23 18:48 ` Martin KaFai Lau
2022-09-22 22:56 ` [PATCH bpf-next 5/5] selftests/bpf: Check -EBUSY for the recurred bpf_setsockopt(TCP_CONGESTION) Martin KaFai Lau
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=27c7725a-738a-2227-5e47-ab2afab29348@linux.dev \
--to=martin.lau@linux.dev \
--cc=alexei.starovoitov@gmail.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=kernel-team@fb.com \
--cc=netdev@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).