public inbox for bpf@vger.kernel.org
 help / color / mirror / Atom feed
From: Yonghong Song <yonghong.song@linux.dev>
To: Martin KaFai Lau <martin.lau@linux.dev>,
	Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: bpf@vger.kernel.org, ast@kernel.org, song@kernel.org,
	kernel-team@meta.com, andrii@kernel.org, sdf@google.com,
	sinquersw@gmail.com, kuifeng@meta.com, thinker.li@gmail.com
Subject: Re: [RFC bpf-next v3 4/5] bpf: Add a new dynptr type for CGRUP_SOCKOPT.
Date: Thu, 17 Aug 2023 14:37:54 -0700	[thread overview]
Message-ID: <8e0fd701-0128-06a5-004e-c82a562691ee@linux.dev> (raw)
In-Reply-To: <f903808f-13c3-c440-c720-2051fe6ec4fe@linux.dev>



On 8/17/23 1:41 PM, Martin KaFai Lau wrote:
> On 8/16/23 6:25 PM, Alexei Starovoitov wrote:
>> But I think we have to step back. Why do we need this whole thing in 
>> the first place?
>> _why_  sockopt bpf progs needs to read and write user memory?
>>
>> Yes there is one page limit, but what is the use case to actually read 
>> and write
>> beyond that? iptables sockopt was mentioned, but I don't think bpf 
>> prog can do
>> anything useful with iptables binary blobs. They are hard enough for 
>> kernel to parse.
> 
> Usually the bpf prog is only interested in a very small number of 
> optnames and no need to read the optval at all for most cases. The max 
> size for our use cases is 16 bytes. The kernel currently is kind of 
> doing it the opposite and always assumes the bpf prog needing to use the 
> optval, allocate kernel memory and copy_from_user such that the 
> non-sleepable bpf program can read/write it.
> 
> The bpf prog usually checks the optname and then just returns for most 
> cases:
> 
> SEC("cgroup/getsockopt")
> int get_internal_sockopt(struct bpf_sockopt *ctx)
> {
>      if (ctx->optname != MY_INTERNAL_SOCKOPT)
>          return 1;
> 
>      /* change the ctx->optval and return to user space ... */
> }
> 
> When the optlen is > PAGE_SIZE, the kernel only allocates PAGE_SIZE 
> memory and copy the first PAGE_SIZE data from the user optval. We used 
> to ask the bpf prog to explicitly set the optlen to 0 for > PAGE_SIZE 
> case even it has not looked at the optval. Otherwise, the kernel used to 
> conclude that the bpf prog had set an invalid optlen because optlen is 
> larger than the optval_end - optval and returned -EFAULT incorrectly to 
> the end-user.
> 
> The bpf prog started doing this > PAGE_SIZE check and set optlen = 0 due 
> to an internal kernel PAGE_SIZE limitation:
> 
> SEC("cgroup/getsockopt")
> int get_internal_sockopt(struct bpf_sockopt *ctx)
> {
>      if (ctx->optname != MY_INTERNAL_SOCKOPT) {
>          /* only do that for ctx->optlen > PAGE_SIZE.
>           * otherwise, the following cgroup bpf prog will
>           * not be able to use the optval that it may
>           * be interested.
>           */
>          if (ctx->optlen > PAGE_SIZE)
>              ctx->optlen = 0;
>          return 1;
>      }
> 
>      /* change the ctx->optval and return to user space ... */
> }
> 
> The above has been worked around in commit 29ebbba7d461 ("bpf: Don't 
> EFAULT for {g,s}setsockopt with wrong optlen").
> 
> Later, it was reported that an optname (NETLINK_LIST_MEMBERSHIPS) that 
> the kernel allows a user passing NULL optval and using the optlen 
> returned by getsockopt to learn the buffer space required. The bpf prog 
> then needs to remove the optlen > PAGE_SIZE check and set optlen to 0 
> for _all_ optnames that it is not interested while risking the following 
> cgroup prog may not be able to use some of the optval:
> 
> SEC("cgroup/getsockopt")
> int get_internal_sockopt(struct bpf_sockopt *ctx)
> {
>      if (ctx->optname != MY_INTERNAL_SOCKOPT) {
> 
>          /* Do that for all optname that you are not interested.
>           * The latter cgroup bpf will not be able to use the optval.
>           */
>           ctx->optlen = 0;
>          return 1;
>      }
> 
>      /* chage the ctx->optval and return to user space ... */
> }
> 
> The above case has been addressed in commit 00e74ae08638 ("bpf: Don't 
> EFAULT for getsockopt with optval=NULL").
> 
> To avoid other potential optname cases that may run into similar 
> situation and requires the bpf prog work around it again like the above, 
> it needs a way to track whether a bpf prog has changed the optval in 
> runtime. If it is not changed, use the result from the kernel 

Can we add a field in bpf_sockopt uapi struct so bpf_prog can set it
if optval is changed?

struct bpf_sockopt {
         __bpf_md_ptr(struct bpf_sock *, sk);
         __bpf_md_ptr(void *, optval);
         __bpf_md_ptr(void *, optval_end);

         __s32   level;
         __s32   optname;
         __s32   optlen;
         __s32   retval;
};

> set/getsockopt. If reading/writing to optval is done through a kfunc, 
> this can be tracked. The kfunc can also directly read/write the user 
> memory in optval, avoid the pre-alloc kernel memory and the PAGE_SIZE 
> limit but this is a minor point.

  reply	other threads:[~2023-08-17 21:38 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-15 17:47 [RFC bpf-next v3 0/5] Sleepable BPF programs on cgroup {get,set}sockopt thinker.li
2023-08-15 17:47 ` [RFC bpf-next v3 1/5] bpf: enable sleepable BPF programs attached to cgroup/{get,set}sockopt thinker.li
2023-08-15 20:58   ` Stanislav Fomichev
2023-08-15 21:04     ` Kui-Feng Lee
2023-08-15 17:47 ` [RFC bpf-next v3 2/5] libbpf: add sleepable sections for {get,set}sockopt() thinker.li
2023-08-15 17:47 ` [RFC bpf-next v3 3/5] bpf: Prevent BPF programs from access the buffer pointed by user_optval thinker.li
2023-08-17  0:55   ` Martin KaFai Lau
2023-08-17 18:10     ` Kui-Feng Lee
2023-08-17  1:17   ` Alexei Starovoitov
2023-08-17 18:12     ` Kui-Feng Lee
2023-08-15 17:47 ` [RFC bpf-next v3 4/5] bpf: Add a new dynptr type for CGRUP_SOCKOPT thinker.li
2023-08-17  1:25   ` Alexei Starovoitov
2023-08-17 19:00     ` Kui-Feng Lee
2023-08-17 19:43       ` Alexei Starovoitov
2023-08-18  0:14         ` Kui-Feng Lee
2023-08-17 20:41     ` Martin KaFai Lau
2023-08-17 21:37       ` Yonghong Song [this message]
2023-08-17 22:56         ` Martin KaFai Lau
2023-08-17 21:46       ` Alexei Starovoitov
2023-08-17 22:45         ` Martin KaFai Lau
2023-08-15 17:47 ` [RFC bpf-next v3 5/5] selftests/bpf: Add test cases for sleepable BPF programs of the CGROUP_SOCKOPT type thinker.li
2023-08-15 20:57   ` Stanislav Fomichev
2023-08-15 23:37     ` Kui-Feng Lee
2023-08-16  0:03       ` Kui-Feng Lee
2023-08-17  1:13         ` Martin KaFai Lau
2023-08-17 18:36           ` Kui-Feng Lee

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8e0fd701-0128-06a5-004e-c82a562691ee@linux.dev \
    --to=yonghong.song@linux.dev \
    --cc=alexei.starovoitov@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=kernel-team@meta.com \
    --cc=kuifeng@meta.com \
    --cc=martin.lau@linux.dev \
    --cc=sdf@google.com \
    --cc=sinquersw@gmail.com \
    --cc=song@kernel.org \
    --cc=thinker.li@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox