public inbox for bpf@vger.kernel.org
 help / color / mirror / Atom feed
From: Martin KaFai Lau <martin.lau@linux.dev>
To: yonghong.song@linux.dev
Cc: bpf@vger.kernel.org, ast@kernel.org, song@kernel.org,
	kernel-team@meta.com, andrii@kernel.org, sdf@google.com,
	sinquersw@gmail.com, kuifeng@meta.com, thinker.li@gmail.com,
	Alexei Starovoitov <alexei.starovoitov@gmail.com>
Subject: Re: [RFC bpf-next v3 4/5] bpf: Add a new dynptr type for CGRUP_SOCKOPT.
Date: Thu, 17 Aug 2023 15:56:40 -0700	[thread overview]
Message-ID: <fc66a75f-312e-8efd-2da9-d55e2391ea90@linux.dev> (raw)
In-Reply-To: <8e0fd701-0128-06a5-004e-c82a562691ee@linux.dev>

On 8/17/23 2:37 PM, Yonghong Song wrote:
> 
> 
> On 8/17/23 1:41 PM, Martin KaFai Lau wrote:
>> On 8/16/23 6:25 PM, Alexei Starovoitov wrote:
>>> But I think we have to step back. Why do we need this whole thing in the 
>>> first place?
>>> _why_  sockopt bpf progs needs to read and write user memory?
>>>
>>> Yes there is one page limit, but what is the use case to actually read and write
>>> beyond that? iptables sockopt was mentioned, but I don't think bpf prog can do
>>> anything useful with iptables binary blobs. They are hard enough for kernel 
>>> to parse.
>>
>> Usually the bpf prog is only interested in a very small number of optnames and 
>> no need to read the optval at all for most cases. The max size for our use 
>> cases is 16 bytes. The kernel currently is kind of doing it the opposite and 
>> always assumes the bpf prog needing to use the optval, allocate kernel memory 
>> and copy_from_user such that the non-sleepable bpf program can read/write it.
>>
>> The bpf prog usually checks the optname and then just returns for most cases:
>>
>> SEC("cgroup/getsockopt")
>> int get_internal_sockopt(struct bpf_sockopt *ctx)
>> {
>>      if (ctx->optname != MY_INTERNAL_SOCKOPT)
>>          return 1;
>>
>>      /* change the ctx->optval and return to user space ... */
>> }
>>
>> When the optlen is > PAGE_SIZE, the kernel only allocates PAGE_SIZE memory and 
>> copy the first PAGE_SIZE data from the user optval. We used to ask the bpf 
>> prog to explicitly set the optlen to 0 for > PAGE_SIZE case even it has not 
>> looked at the optval. Otherwise, the kernel used to conclude that the bpf prog 
>> had set an invalid optlen because optlen is larger than the optval_end - 
>> optval and returned -EFAULT incorrectly to the end-user.
>>
>> The bpf prog started doing this > PAGE_SIZE check and set optlen = 0 due to an 
>> internal kernel PAGE_SIZE limitation:
>>
>> SEC("cgroup/getsockopt")
>> int get_internal_sockopt(struct bpf_sockopt *ctx)
>> {
>>      if (ctx->optname != MY_INTERNAL_SOCKOPT) {
>>          /* only do that for ctx->optlen > PAGE_SIZE.
>>           * otherwise, the following cgroup bpf prog will
>>           * not be able to use the optval that it may
>>           * be interested.
>>           */
>>          if (ctx->optlen > PAGE_SIZE)
>>              ctx->optlen = 0;
>>          return 1;
>>      }
>>
>>      /* change the ctx->optval and return to user space ... */
>> }
>>
>> The above has been worked around in commit 29ebbba7d461 ("bpf: Don't EFAULT 
>> for {g,s}setsockopt with wrong optlen").
>>
>> Later, it was reported that an optname (NETLINK_LIST_MEMBERSHIPS) that the 
>> kernel allows a user passing NULL optval and using the optlen returned by 
>> getsockopt to learn the buffer space required. The bpf prog then needs to 
>> remove the optlen > PAGE_SIZE check and set optlen to 0 for _all_ optnames 
>> that it is not interested while risking the following cgroup prog may not be 
>> able to use some of the optval:
>>
>> SEC("cgroup/getsockopt")
>> int get_internal_sockopt(struct bpf_sockopt *ctx)
>> {
>>      if (ctx->optname != MY_INTERNAL_SOCKOPT) {
>>
>>          /* Do that for all optname that you are not interested.
>>           * The latter cgroup bpf will not be able to use the optval.
>>           */
>>           ctx->optlen = 0;
>>          return 1;
>>      }
>>
>>      /* chage the ctx->optval and return to user space ... */
>> }
>>
>> The above case has been addressed in commit 00e74ae08638 ("bpf: Don't EFAULT 
>> for getsockopt with optval=NULL").
>>
>> To avoid other potential optname cases that may run into similar situation and 
>> requires the bpf prog work around it again like the above, it needs a way to 
>> track whether a bpf prog has changed the optval in runtime. If it is not 
>> changed, use the result from the kernel 
> 
> Can we add a field in bpf_sockopt uapi struct so bpf_prog can set it
> if optval is changed?

This new interface should work. If there is an old-existing prog staying with 
the old interface (didn't set this bool but changed the optval) in the cgroup 
prog array, it probably end up not improving anything also?

or the verifier can enforce setting this bool in runtime when writing to optval? 
do not know how demanding the verifier change is. I am not sure if this would be 
an overkill for the verifier.

> 
> struct bpf_sockopt {
>          __bpf_md_ptr(struct bpf_sock *, sk);
>          __bpf_md_ptr(void *, optval);
>          __bpf_md_ptr(void *, optval_end);
> 
>          __s32   level;
>          __s32   optname;
>          __s32   optlen;
>          __s32   retval;
> };
> 
>> set/getsockopt. If reading/writing to optval is done through a kfunc, this can 
>> be tracked. The kfunc can also directly read/write the user memory in optval, 
>> avoid the pre-alloc kernel memory and the PAGE_SIZE limit but this is a minor 
>> point.


  reply	other threads:[~2023-08-17 22:56 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-15 17:47 [RFC bpf-next v3 0/5] Sleepable BPF programs on cgroup {get,set}sockopt thinker.li
2023-08-15 17:47 ` [RFC bpf-next v3 1/5] bpf: enable sleepable BPF programs attached to cgroup/{get,set}sockopt thinker.li
2023-08-15 20:58   ` Stanislav Fomichev
2023-08-15 21:04     ` Kui-Feng Lee
2023-08-15 17:47 ` [RFC bpf-next v3 2/5] libbpf: add sleepable sections for {get,set}sockopt() thinker.li
2023-08-15 17:47 ` [RFC bpf-next v3 3/5] bpf: Prevent BPF programs from access the buffer pointed by user_optval thinker.li
2023-08-17  0:55   ` Martin KaFai Lau
2023-08-17 18:10     ` Kui-Feng Lee
2023-08-17  1:17   ` Alexei Starovoitov
2023-08-17 18:12     ` Kui-Feng Lee
2023-08-15 17:47 ` [RFC bpf-next v3 4/5] bpf: Add a new dynptr type for CGRUP_SOCKOPT thinker.li
2023-08-17  1:25   ` Alexei Starovoitov
2023-08-17 19:00     ` Kui-Feng Lee
2023-08-17 19:43       ` Alexei Starovoitov
2023-08-18  0:14         ` Kui-Feng Lee
2023-08-17 20:41     ` Martin KaFai Lau
2023-08-17 21:37       ` Yonghong Song
2023-08-17 22:56         ` Martin KaFai Lau [this message]
2023-08-17 21:46       ` Alexei Starovoitov
2023-08-17 22:45         ` Martin KaFai Lau
2023-08-15 17:47 ` [RFC bpf-next v3 5/5] selftests/bpf: Add test cases for sleepable BPF programs of the CGROUP_SOCKOPT type thinker.li
2023-08-15 20:57   ` Stanislav Fomichev
2023-08-15 23:37     ` Kui-Feng Lee
2023-08-16  0:03       ` Kui-Feng Lee
2023-08-17  1:13         ` Martin KaFai Lau
2023-08-17 18:36           ` Kui-Feng Lee

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fc66a75f-312e-8efd-2da9-d55e2391ea90@linux.dev \
    --to=martin.lau@linux.dev \
    --cc=alexei.starovoitov@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=kernel-team@meta.com \
    --cc=kuifeng@meta.com \
    --cc=sdf@google.com \
    --cc=sinquersw@gmail.com \
    --cc=song@kernel.org \
    --cc=thinker.li@gmail.com \
    --cc=yonghong.song@linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox