public inbox for bpf@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC bpf-next v3 0/5] Sleepable BPF programs on cgroup {get,set}sockopt
@ 2023-08-15 17:47 thinker.li
  2023-08-15 17:47 ` [RFC bpf-next v3 1/5] bpf: enable sleepable BPF programs attached to cgroup/{get,set}sockopt thinker.li
                   ` (4 more replies)
  0 siblings, 5 replies; 26+ messages in thread
From: thinker.li @ 2023-08-15 17:47 UTC (permalink / raw)
  To: bpf, ast, martin.lau, song, kernel-team, andrii, sdf,
	yonghong.song
  Cc: sinquersw, kuifeng, Kui-Feng Lee

From: Kui-Feng Lee <thinker.li@gmail.com>

Major Changes from v2:

 - Add test cases mixing sleepable and non-sleepable BPF programs.

 - Don't expose bpf_sockopt_kern.flags to BPF programs.

 - Rename kfuncs to *_sockopt_dynptr_*()

Major Changes from v1:

 - Add bpf_so_optval_copy_to() and bpf_so_optval_copy_to_r() to copy
   data from a dynptr or raw buffer to the optval of a context, either
   in the kernel or user space, to simplify BPF programs.

 - Restrict to having atmost one instance of dynptr initialized by
   bpf_so_optval_from() at any moment.  It simplifies the memory
   management of the optval buffer in kernel.

 - Fix the issue of bpf_prog_array_free() by replacing it with
   bpf_prog_array_free_sleepable().


Make BPF programs attached on cgroup/{get,set}sockopt hooks sleepable
and able to call bpf_copy_from_user() and bpf_copy_to_user(), a new
kfunc.

The Issue with CGroup {get,set}sockopt Hooks
============================================

Calling {get,set}sockopt from user space, optval is a pointer to a
buffer. The format of the buffer depends on the level and optname, and
its size is specified by optlen. The buffer is used by user space
programs to pass values to setsockopt and retrieve values from
getsockopt.

The problem is that BPF programs protected by RCU read lock cannot
access the buffers located in user space. This is because these
programs are non-sleepable and using copy_from_user() or
copy_to_user() to access user space memory can result in paging.

The kernel makes a copy of the buffer specified by optval and optlen
in kernel space before passing it to the cgroup {get,set}sockopt
hooks. After the hooks are executed, the content of the buffer in
kernel space is copied to user space if necessary.

Programs may send a significant amount of data, stored in buffer
indicated by optval, to the kernel. One example is iptables, which can
send several megabytes to the kernel. However, BPF programs on the
hooks can only see up to the first PAGE_SIZE bytes of the buffer. The
optlen value that BPF programs observe may appear to be PAGE_SIZE, but
in reality, it is larger than that. On the other hand, the value of
optlen represents the amount of data retrieved by
getsockopt(). Additionally, both the buffer content and optlen can be
modified by BPF programs.

Kernel may wrongly modify the value of optlen returned to user space
to PAGE_SIZE. This can happen because the kernel cannot distinguish if
the value was set by BPF programs or by the kernel itself.

To fix it, we perform various hacks; for example, the commit d8fe449a9c51
("bpf: Don't return EINVAL from {get,set}sockopt when optlen > PAGE_SIZE")
and the commit 29ebbba7d461 ("bpf: Don't EFAULT for {g,s}setsockopt with
 wrong optlen").

Make CGroup {get,set}sockopt Hooks Sleepable
============================================

The long term solution is to make these hooks sleepable to enable BPF
programs call copy_from_user() and copy_to_user(),
a.k.a. bpf_copy_from_user() and bpf_copy_to_user(). It prevents
manipulation of optval and optlen values, and allows BPF programs to
access the complete contents of the buffer referenced by optval.

Mix Sleepable and Non-Sleepable Programs
========================================

Installing both sleepable and non-sleepable programs simultaneously on
the same hook leads to the mixing of sleepable and non-sleepable
programs. For programs that cannot sleep, the kernel first copies data
from the user buffer to a kernel buffer before invoking BPF
programs. This process introduces intricate interactions between
sleepable and non-sleepable programs.

For instance, due to kernel copies for non-sleepable programs, a
sleepable program may receive optval in either the user space or the
kernel space. These two scenarios require different handling
approaches to update the buffer pointed to by optval. Consequently,
sleepable programs can become significantly complex.

To simplify the programs, we introduce a set of kfuncs that enable
data copying to optval without requiring knowledge of the underlying
details.  (bpf_so_optval_copy_to() and bpf_so_optval_copy_to_r())

---
v1: https://lore.kernel.org/bpf/20230722052248.1062582-1-kuifeng@meta.com/
v2: https://lore.kernel.org/all/20230811043127.1318152-1-thinker.li@gmail.com/

Kui-Feng Lee (5):
  bpf: enable sleepable BPF programs attached to
    cgroup/{get,set}sockopt.
  libbpf: add sleepable sections for {get,set}sockopt()
  bpf: Prevent BPF programs from access the buffer pointed by
    user_optval.
  bpf: Add a new dynptr type for CGRUP_SOCKOPT.
  selftests/bpf: Add test cases for sleepable BPF programs of the
    CGROUP_SOCKOPT type

 include/linux/bpf.h                           |  13 +-
 include/linux/filter.h                        |  10 +
 kernel/bpf/btf.c                              |   3 +
 kernel/bpf/cgroup.c                           | 226 +++++++++++----
 kernel/bpf/helpers.c                          | 197 ++++++++++++++
 kernel/bpf/verifier.c                         | 118 +++++---
 tools/lib/bpf/libbpf.c                        |   2 +
 .../testing/selftests/bpf/bpf_experimental.h  |  36 +++
 tools/testing/selftests/bpf/bpf_kfuncs.h      |  41 +++
 .../selftests/bpf/prog_tests/sockopt_sk.c     | 112 +++++++-
 .../testing/selftests/bpf/progs/sockopt_sk.c  | 257 ++++++++++++++++++
 .../selftests/bpf/verifier/sleepable.c        |   2 +-
 12 files changed, 929 insertions(+), 88 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2023-08-18  0:15 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-15 17:47 [RFC bpf-next v3 0/5] Sleepable BPF programs on cgroup {get,set}sockopt thinker.li
2023-08-15 17:47 ` [RFC bpf-next v3 1/5] bpf: enable sleepable BPF programs attached to cgroup/{get,set}sockopt thinker.li
2023-08-15 20:58   ` Stanislav Fomichev
2023-08-15 21:04     ` Kui-Feng Lee
2023-08-15 17:47 ` [RFC bpf-next v3 2/5] libbpf: add sleepable sections for {get,set}sockopt() thinker.li
2023-08-15 17:47 ` [RFC bpf-next v3 3/5] bpf: Prevent BPF programs from access the buffer pointed by user_optval thinker.li
2023-08-17  0:55   ` Martin KaFai Lau
2023-08-17 18:10     ` Kui-Feng Lee
2023-08-17  1:17   ` Alexei Starovoitov
2023-08-17 18:12     ` Kui-Feng Lee
2023-08-15 17:47 ` [RFC bpf-next v3 4/5] bpf: Add a new dynptr type for CGRUP_SOCKOPT thinker.li
2023-08-17  1:25   ` Alexei Starovoitov
2023-08-17 19:00     ` Kui-Feng Lee
2023-08-17 19:43       ` Alexei Starovoitov
2023-08-18  0:14         ` Kui-Feng Lee
2023-08-17 20:41     ` Martin KaFai Lau
2023-08-17 21:37       ` Yonghong Song
2023-08-17 22:56         ` Martin KaFai Lau
2023-08-17 21:46       ` Alexei Starovoitov
2023-08-17 22:45         ` Martin KaFai Lau
2023-08-15 17:47 ` [RFC bpf-next v3 5/5] selftests/bpf: Add test cases for sleepable BPF programs of the CGROUP_SOCKOPT type thinker.li
2023-08-15 20:57   ` Stanislav Fomichev
2023-08-15 23:37     ` Kui-Feng Lee
2023-08-16  0:03       ` Kui-Feng Lee
2023-08-17  1:13         ` Martin KaFai Lau
2023-08-17 18:36           ` Kui-Feng Lee

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox