Re: [PATCH bpf-next v2 1/2] bpf: tcp: prevent bpf_reserve_hdr_opt() from growing skb larger than MTU

BPF List
 help / color / mirror / Atom feed

From: Martin KaFai Lau <martin.lau@linux.dev>
To: Zijian Zhang <zijianzhang@bytedance.com>,
	Amery Hung <amery.hung@bytedance.com>
Cc: bpf@vger.kernel.org, edumazet@google.com, davem@davemloft.net,
	kuba@kernel.org, pabeni@redhat.com, dsahern@kernel.org,
	ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org,
	eddyz87@gmail.com, yonghong.song@linux.dev,
	john.fastabend@gmail.com, kpsingh@kernel.org, sdf@fomichev.me,
	haoluo@google.com, jolsa@kernel.org, mykolal@fb.com,
	shuah@kernel.org, xiyou.wangcong@gmail.com,
	wangdongdong.6@bytedance.com, zhoufeng.zf@bytedance.com
Subject: Re: [PATCH bpf-next v2 1/2] bpf: tcp: prevent bpf_reserve_hdr_opt() from growing skb larger than MTU
Date: Tue, 3 Sep 2024 15:38:35 -0700	[thread overview]
Message-ID: <d89a6a41-c109-4033-8eba-1e11c3c6d1f6@linux.dev> (raw)
In-Reply-To: <3e387788-1f5a-4159-9ff5-e53e897ae41c@bytedance.com>

On 8/30/24 2:02 PM, Zijian Zhang wrote:
> On 8/28/24 6:00 PM, Martin KaFai Lau wrote:
>> On 8/28/24 4:01 PM, Zijian Zhang wrote:
>>> On 8/28/24 2:29 PM, Martin KaFai Lau wrote:
>>>> On 8/26/24 6:37 PM, zijianzhang@bytedance.com wrote:
>>>>> From: Amery Hung <amery.hung@bytedance.com>
>>>>>
>>>>> This series prevents sockops users from accidentally causing packet
>>>>> drops. This can happen when a BPF_SOCK_OPS_HDR_OPT_LEN_CB program
>>>>> reserves different option lengths in tcp_sendmsg().
>>>>>
>>>>> Initially, sockops BPF_SOCK_OPS_HDR_OPT_LEN_CB program will be called to
>>>>> reserve a space in tcp_send_mss(), which will return the MSS for TSO.
>>>>> Then, BPF_SOCK_OPS_HDR_OPT_LEN_CB will be called in __tcp_transmit_skb()
>>>>> again to calculate the actual tcp_option_size and skb_push() the total
>>>>> header size.
>>>>>
>>>>> skb->gso_size is restored from TCP_SKB_CB(skb)->tcp_gso_size, which is
>>>>> derived from tcp_send_mss() where we first call HDR_OPT_LEN. If the
>>>>> reserved opt size is smaller than the actual header size, the len of the
>>>>> skb can exceed the MTU. As a result, ip(6)_fragment will drop the
>>>>> packet if skb->ignore_df is not set.
>>>>>
>>>>> To prevent this accidental packet drop, we need to make sure the
>>>>> second call to the BPF_SOCK_OPS_HDR_OPT_LEN_CB program reserves space
>>>>> not more than the first time. 
>>>>
>>>> iiuc, it is a bug in the bpf prog itself that did not reserve the same 
>>>> header length and caused a drop. It is not the only drop case though for an 
>>>> incorrect bpf prog. There are other cases where a bpf prog can accidentally 
>>>> drop a packet.
>>>>
>>>> Do you have an actual use case that the bpf prog cannot reserve the correct 
>>>> header length for the same sk ?
>>>
>>> That's right, it's the bug of the bpf prog itself. We are trying to have
>>> the error reported earlier in eBPF program, instead of successfully
>>> returning from bpf_sock_ops_reserve_hdr_opt but leading to packet drop
>>> at the end because of it.
>>>
>>> By adding this patch, the `remaining` variable passed to the
>>> bpf_skops_hdr_opt_len will be more precise, it takes the previously
>>> reserved size into account. As a result, if users accidentally set an
>>> option size larger than the reserved size, bpf_sock_ops_reserve_hdr_opt
>>> will return -ENOSPC instead of 0.
>>
>> Putting aside it adds more checks, this adds something pretty unique to the 
>> bpf header option comparing with other dynamic options like sack. e.g. For tp- 
>> >mss_cache, I assume it won't change since the earlier tcp_current_mss() was 
>> called?
>>
> 
> Agreed, this check is very unique comparing with other dynamic options.
> How about moving the check into function bpf_skops_hdr_opt_len? It
> already has some logic to check if the context is in tcp_current_mss.
> Does it look more reasonable?

Yes, it probably may be better to put that into the bpf_skops_hdr_opt_len(). 
This is implementation details which is something for later after figuring out 
if the reserved_opt_spc change is correct and needed.

> 
> `reserved_opt_spc = tp->mss_cache - tcp_skb_seglen(skb)`
> For tp->mss_cache here, yes, I also think it won't change,

This needs more details and clear explanation on why it is the case and why the 
existing regular bpf prog will continue to work.

afaik, mss_cache and tcp_skb_seglen() must be in-sync and the same between calls 
for this to work . From thinking about it, it should be but I haven't looked at 
all cases. Missing "- size" does not help the confidence here also.

Also, when "skb != NULL", I don't understand why the "MAX_TCP_OPTION_SPACE - 
size" is still being considered. I am likely missing something here. If the 
above formula is correct, why reserved_opt_spc is not always used for the "skb 
!= NULL" case and still need to be compared with the "MAX_TCP_OPTION_SPACE - size"?

> 
> I am trying to get the reserved bpf hdr size. Considering other dynamic
> options, this equation is not precise, and may change it to
> `reserved_opt_spc = tp->mss_cache - tcp_skb_seglen(skb) - size`?

"- size" is needed. The test did not catch it?

> 
> Or, not sure if adding a field in tcp_sock to precisely cache the
> reserved bpf hdr size is a good idea?

imo, adding one field in tcp_sock for this is overkill.

It seems your bpf prog is randomly reserving space. If this is the case, giving 
a fail signal in bpf_reserve_hdr_opt() does not improve the success rate for 
your random bpf prog to reserve header. I don't think adding a field or the 
change in this patch really solves anything in your randomly reserving space use 
case ?

> 
>>>
>>> We have a use case where we add options to some packets kind of randomly
>>> for the purpose of sampling, and accidentally set a larger option size
>>> than the reserved size. It is the problem of ourselves and takes us
>>> some effort to troubleshoot the root cause.
>>>
>>> If bpf_sock_ops_reserve_hdr_opt can return an error in this case, it
>>> could be helpful for users to avoid this mistake.
>>
>> The bpf_sk_storage can be used to decide if a sk has been sampled.
>>
>> Also, with bpf_cast_to_kern_ctx and bpf_rdonly_cast, all the checks in this 
>> patch can be done in the bpf prog itself.
>>
> 
> Thanks for pointing this out! Agreed, the check can be implemented in
> eBPF progs.

next prev parent reply	other threads:[~2024-09-03 22:38 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-08-27  1:37 [PATCH bpf-next v2 0/2] prevent bpf_reserve_hdr_opt() from growing skb larger than MTU zijianzhang
2024-08-27  1:37 ` [PATCH bpf-next v2 1/2] bpf: tcp: " zijianzhang
2024-08-28 21:29   ` Martin KaFai Lau
2024-08-28 23:01     ` Zijian Zhang
2024-08-29  1:00       ` Martin KaFai Lau
2024-08-30 21:02         ` Zijian Zhang
2024-09-03 22:38           ` Martin KaFai Lau [this message]
2024-09-05 18:19             ` Zijian Zhang
2024-09-05 19:38               ` Martin KaFai Lau
2024-09-05 20:20                 ` Zijian Zhang
2024-09-05 21:07                   ` Martin KaFai Lau
2024-08-29 16:46     ` Cong Wang
2024-08-30  0:20       ` Martin KaFai Lau
2024-08-27  1:37 ` [PATCH bpf-next v2 2/2] bpf: selftests: reserve smaller tcp header options than the actual size zijianzhang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d89a6a41-c109-4033-8eba-1e11c3c6d1f6@linux.dev \
    --to=martin.lau@linux.dev \
    --cc=amery.hung@bytedance.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=dsahern@kernel.org \
    --cc=eddyz87@gmail.com \
    --cc=edumazet@google.com \
    --cc=haoluo@google.com \
    --cc=john.fastabend@gmail.com \
    --cc=jolsa@kernel.org \
    --cc=kpsingh@kernel.org \
    --cc=kuba@kernel.org \
    --cc=mykolal@fb.com \
    --cc=pabeni@redhat.com \
    --cc=sdf@fomichev.me \
    --cc=shuah@kernel.org \
    --cc=wangdongdong.6@bytedance.com \
    --cc=xiyou.wangcong@gmail.com \
    --cc=yonghong.song@linux.dev \
    --cc=zhoufeng.zf@bytedance.com \
    --cc=zijianzhang@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox