From: Yonghong Song <yonghong.song@linux.dev>
To: Martin KaFai Lau <martin.lau@linux.dev>,
Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: bpf@vger.kernel.org, ast@kernel.org, song@kernel.org,
kernel-team@meta.com, andrii@kernel.org, sdf@google.com,
sinquersw@gmail.com, kuifeng@meta.com, thinker.li@gmail.com
Subject: Re: [RFC bpf-next v3 4/5] bpf: Add a new dynptr type for CGRUP_SOCKOPT.
Date: Thu, 17 Aug 2023 14:37:54 -0700 [thread overview]
Message-ID: <8e0fd701-0128-06a5-004e-c82a562691ee@linux.dev> (raw)
In-Reply-To: <f903808f-13c3-c440-c720-2051fe6ec4fe@linux.dev>
On 8/17/23 1:41 PM, Martin KaFai Lau wrote:
> On 8/16/23 6:25 PM, Alexei Starovoitov wrote:
>> But I think we have to step back. Why do we need this whole thing in
>> the first place?
>> _why_ sockopt bpf progs needs to read and write user memory?
>>
>> Yes there is one page limit, but what is the use case to actually read
>> and write
>> beyond that? iptables sockopt was mentioned, but I don't think bpf
>> prog can do
>> anything useful with iptables binary blobs. They are hard enough for
>> kernel to parse.
>
> Usually the bpf prog is only interested in a very small number of
> optnames and no need to read the optval at all for most cases. The max
> size for our use cases is 16 bytes. The kernel currently is kind of
> doing it the opposite and always assumes the bpf prog needing to use the
> optval, allocate kernel memory and copy_from_user such that the
> non-sleepable bpf program can read/write it.
>
> The bpf prog usually checks the optname and then just returns for most
> cases:
>
> SEC("cgroup/getsockopt")
> int get_internal_sockopt(struct bpf_sockopt *ctx)
> {
> if (ctx->optname != MY_INTERNAL_SOCKOPT)
> return 1;
>
> /* change the ctx->optval and return to user space ... */
> }
>
> When the optlen is > PAGE_SIZE, the kernel only allocates PAGE_SIZE
> memory and copy the first PAGE_SIZE data from the user optval. We used
> to ask the bpf prog to explicitly set the optlen to 0 for > PAGE_SIZE
> case even it has not looked at the optval. Otherwise, the kernel used to
> conclude that the bpf prog had set an invalid optlen because optlen is
> larger than the optval_end - optval and returned -EFAULT incorrectly to
> the end-user.
>
> The bpf prog started doing this > PAGE_SIZE check and set optlen = 0 due
> to an internal kernel PAGE_SIZE limitation:
>
> SEC("cgroup/getsockopt")
> int get_internal_sockopt(struct bpf_sockopt *ctx)
> {
> if (ctx->optname != MY_INTERNAL_SOCKOPT) {
> /* only do that for ctx->optlen > PAGE_SIZE.
> * otherwise, the following cgroup bpf prog will
> * not be able to use the optval that it may
> * be interested.
> */
> if (ctx->optlen > PAGE_SIZE)
> ctx->optlen = 0;
> return 1;
> }
>
> /* change the ctx->optval and return to user space ... */
> }
>
> The above has been worked around in commit 29ebbba7d461 ("bpf: Don't
> EFAULT for {g,s}setsockopt with wrong optlen").
>
> Later, it was reported that an optname (NETLINK_LIST_MEMBERSHIPS) that
> the kernel allows a user passing NULL optval and using the optlen
> returned by getsockopt to learn the buffer space required. The bpf prog
> then needs to remove the optlen > PAGE_SIZE check and set optlen to 0
> for _all_ optnames that it is not interested while risking the following
> cgroup prog may not be able to use some of the optval:
>
> SEC("cgroup/getsockopt")
> int get_internal_sockopt(struct bpf_sockopt *ctx)
> {
> if (ctx->optname != MY_INTERNAL_SOCKOPT) {
>
> /* Do that for all optname that you are not interested.
> * The latter cgroup bpf will not be able to use the optval.
> */
> ctx->optlen = 0;
> return 1;
> }
>
> /* chage the ctx->optval and return to user space ... */
> }
>
> The above case has been addressed in commit 00e74ae08638 ("bpf: Don't
> EFAULT for getsockopt with optval=NULL").
>
> To avoid other potential optname cases that may run into similar
> situation and requires the bpf prog work around it again like the above,
> it needs a way to track whether a bpf prog has changed the optval in
> runtime. If it is not changed, use the result from the kernel
Can we add a field in bpf_sockopt uapi struct so bpf_prog can set it
if optval is changed?
struct bpf_sockopt {
__bpf_md_ptr(struct bpf_sock *, sk);
__bpf_md_ptr(void *, optval);
__bpf_md_ptr(void *, optval_end);
__s32 level;
__s32 optname;
__s32 optlen;
__s32 retval;
};
> set/getsockopt. If reading/writing to optval is done through a kfunc,
> this can be tracked. The kfunc can also directly read/write the user
> memory in optval, avoid the pre-alloc kernel memory and the PAGE_SIZE
> limit but this is a minor point.
next prev parent reply other threads:[~2023-08-17 21:38 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-08-15 17:47 [RFC bpf-next v3 0/5] Sleepable BPF programs on cgroup {get,set}sockopt thinker.li
2023-08-15 17:47 ` [RFC bpf-next v3 1/5] bpf: enable sleepable BPF programs attached to cgroup/{get,set}sockopt thinker.li
2023-08-15 20:58 ` Stanislav Fomichev
2023-08-15 21:04 ` Kui-Feng Lee
2023-08-15 17:47 ` [RFC bpf-next v3 2/5] libbpf: add sleepable sections for {get,set}sockopt() thinker.li
2023-08-15 17:47 ` [RFC bpf-next v3 3/5] bpf: Prevent BPF programs from access the buffer pointed by user_optval thinker.li
2023-08-17 0:55 ` Martin KaFai Lau
2023-08-17 18:10 ` Kui-Feng Lee
2023-08-17 1:17 ` Alexei Starovoitov
2023-08-17 18:12 ` Kui-Feng Lee
2023-08-15 17:47 ` [RFC bpf-next v3 4/5] bpf: Add a new dynptr type for CGRUP_SOCKOPT thinker.li
2023-08-17 1:25 ` Alexei Starovoitov
2023-08-17 19:00 ` Kui-Feng Lee
2023-08-17 19:43 ` Alexei Starovoitov
2023-08-18 0:14 ` Kui-Feng Lee
2023-08-17 20:41 ` Martin KaFai Lau
2023-08-17 21:37 ` Yonghong Song [this message]
2023-08-17 22:56 ` Martin KaFai Lau
2023-08-17 21:46 ` Alexei Starovoitov
2023-08-17 22:45 ` Martin KaFai Lau
2023-08-15 17:47 ` [RFC bpf-next v3 5/5] selftests/bpf: Add test cases for sleepable BPF programs of the CGROUP_SOCKOPT type thinker.li
2023-08-15 20:57 ` Stanislav Fomichev
2023-08-15 23:37 ` Kui-Feng Lee
2023-08-16 0:03 ` Kui-Feng Lee
2023-08-17 1:13 ` Martin KaFai Lau
2023-08-17 18:36 ` Kui-Feng Lee
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=8e0fd701-0128-06a5-004e-c82a562691ee@linux.dev \
--to=yonghong.song@linux.dev \
--cc=alexei.starovoitov@gmail.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=kernel-team@meta.com \
--cc=kuifeng@meta.com \
--cc=martin.lau@linux.dev \
--cc=sdf@google.com \
--cc=sinquersw@gmail.com \
--cc=song@kernel.org \
--cc=thinker.li@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox