bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1 bpf-next/net 0/8] bpf: Allow decoupling memcg from sk->sk_prot->memory_allocated.
@ 2025-08-22 22:17 Kuniyuki Iwashima
  2025-08-22 22:17 ` [PATCH v1 bpf-next/net 1/8] tcp: Save lock_sock() for memcg in inet_csk_accept() Kuniyuki Iwashima
                   ` (7 more replies)
  0 siblings, 8 replies; 20+ messages in thread
From: Kuniyuki Iwashima @ 2025-08-22 22:17 UTC (permalink / raw)
  To: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau
  Cc: John Fastabend, Stanislav Fomichev, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Neal Cardwell, Willem de Bruijn,
	Mina Almasry, Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev

Some protocols (e.g., TCP, UDP) have their own memory accounting for
socket buffers and charge memory to global per-protocol counters such
as /proc/net/ipv4/tcp_mem.

When running under a non-root cgroup, this memory is also charged to
the memcg as sock in memory.stat.

Sockets of such protocols are still subject to the global limits,
thus affected by a noisy neighbour outside cgroup.

This makes it difficult to accurately estimate and configure appropriate
global limits.

This series allows decoupling memcg from the global memory accounting
if socket is configured as such by BPF prog.

This simplifies the memcg configuration while keeping the global limits
within a reasonable range, which is only 10% of the physical memory by
default.

Overview of the series:

  patch 1 is a prep
  patch 2 ~ 4 add a new bpf hook for accept()
  patch 5 & 6 intorduce SK_BPF_MEMCG_SOCK_ISOLATED for bpf_setsockopt()
  patch 7 decouples memcg from sk_prot->memory_allocated based on the flag
  patch 8 is selftest


Kuniyuki Iwashima (8):
  tcp: Save lock_sock() for memcg in inet_csk_accept().
  bpf: Add a bpf hook in __inet_accept().
  libbpf: Support BPF_CGROUP_INET_SOCK_ACCEPT.
  bpftool: Support BPF_CGROUP_INET_SOCK_ACCEPT.
  bpf: Support bpf_setsockopt() for
    BPF_CGROUP_INET_SOCK_(CREATE|ACCEPT).
  bpf: Introduce SK_BPF_MEMCG_FLAGS and SK_BPF_MEMCG_SOCK_ISOLATED.
  net-memcg: Allow decoupling memcg from global protocol memory
    accounting.
  selftest: bpf: Add test for SK_BPF_MEMCG_SOCK_ISOLATED.

 include/linux/bpf-cgroup-defs.h               |   1 +
 include/linux/bpf-cgroup.h                    |   4 +
 include/net/proto_memory.h                    |  15 +-
 include/net/sock.h                            |  48 ++++
 include/net/tcp.h                             |  10 +-
 include/uapi/linux/bpf.h                      |   7 +
 kernel/bpf/cgroup.c                           |   2 +
 kernel/bpf/syscall.c                          |   3 +
 net/core/filter.c                             |  75 +++++-
 net/core/sock.c                               |  64 ++++--
 net/ipv4/af_inet.c                            |  34 +++
 net/ipv4/inet_connection_sock.c               |  26 +--
 net/ipv4/tcp.c                                |   3 +-
 net/ipv4/tcp_output.c                         |  10 +-
 net/mptcp/protocol.c                          |   3 +-
 net/tls/tls_device.c                          |   4 +-
 tools/bpf/bpftool/cgroup.c                    |   6 +-
 tools/include/uapi/linux/bpf.h                |   7 +
 tools/lib/bpf/libbpf.c                        |   2 +
 .../selftests/bpf/prog_tests/sk_memcg.c       | 214 ++++++++++++++++++
 tools/testing/selftests/bpf/progs/sk_memcg.c  |  29 +++
 21 files changed, 508 insertions(+), 59 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/sk_memcg.c
 create mode 100644 tools/testing/selftests/bpf/progs/sk_memcg.c

-- 
2.51.0.rc2.233.g662b1ed5c5-goog


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v1 bpf-next/net 1/8] tcp: Save lock_sock() for memcg in inet_csk_accept().
  2025-08-22 22:17 [PATCH v1 bpf-next/net 0/8] bpf: Allow decoupling memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
@ 2025-08-22 22:17 ` Kuniyuki Iwashima
  2025-08-22 22:17 ` [PATCH v1 bpf-next/net 2/8] bpf: Add a bpf hook in __inet_accept() Kuniyuki Iwashima
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 20+ messages in thread
From: Kuniyuki Iwashima @ 2025-08-22 22:17 UTC (permalink / raw)
  To: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau
  Cc: John Fastabend, Stanislav Fomichev, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Neal Cardwell, Willem de Bruijn,
	Mina Almasry, Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev

If memcg is enabled, accept() acquires lock_sock() twice for each new
TCP/MPTCP socket in inet_csk_accept() and __inet_accept().

Let's move memcg operations from inet_csk_accept() to __inet_accept().

This makes easier to add a BPF hook that covers sk_prot.memory_allocated
users (TCP, MPTCP, SCTP) in a single place.

Two notes:

1)
SCTP somehow allocates a new socket by sk_alloc() in sk->sk_prot->accept()
and clones fields manually, instead of using sk_clone_lock().

For SCTP, mem_cgroup_sk_alloc() has been called before __inet_accept(),
so I added the protocol tests in __inet_accept(), but this can be removed
once SCTP uses sk_clone_lock().

2)
The single if block is separated into two because we will add a new bpf
hook between the blocks, where a bpf prog can add a flag in sk->sk_memcg.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
 net/ipv4/af_inet.c              | 22 ++++++++++++++++++++++
 net/ipv4/inet_connection_sock.c | 25 -------------------------
 2 files changed, 22 insertions(+), 25 deletions(-)

diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 76e38092cd8a..ae83ecda3983 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -753,6 +753,28 @@ EXPORT_SYMBOL(inet_stream_connect);
 
 void __inet_accept(struct socket *sock, struct socket *newsock, struct sock *newsk)
 {
+	gfp_t gfp = GFP_KERNEL | __GFP_NOFAIL;
+
+	/* TODO: use sk_clone_lock() in SCTP and remove protocol checks */
+	if (mem_cgroup_sockets_enabled &&
+	    (!IS_ENABLED(CONFIG_IP_SCTP) ||
+	     sk_is_tcp(newsk) || sk_is_mptcp(newsk))) {
+		mem_cgroup_sk_alloc(newsk);
+		kmem_cache_charge(newsk, gfp);
+	}
+
+	if (mem_cgroup_sk_enabled(newsk)) {
+		int amt;
+
+		/* The socket has not been accepted yet, no need
+		 * to look at newsk->sk_wmem_queued.
+		 */
+		amt = sk_mem_pages(newsk->sk_forward_alloc +
+				   atomic_read(&newsk->sk_rmem_alloc));
+		if (amt)
+			mem_cgroup_sk_charge(newsk, amt, gfp);
+	}
+
 	sock_rps_record_flow(newsk);
 	WARN_ON(!((1 << newsk->sk_state) &
 		  (TCPF_ESTABLISHED | TCPF_SYN_RECV |
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 0ef1eacd539d..ed10b959a906 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -708,31 +708,6 @@ struct sock *inet_csk_accept(struct sock *sk, struct proto_accept_arg *arg)
 
 	release_sock(sk);
 
-	if (mem_cgroup_sockets_enabled) {
-		gfp_t gfp = GFP_KERNEL | __GFP_NOFAIL;
-		int amt = 0;
-
-		/* atomically get the memory usage, set and charge the
-		 * newsk->sk_memcg.
-		 */
-		lock_sock(newsk);
-
-		mem_cgroup_sk_alloc(newsk);
-		if (mem_cgroup_from_sk(newsk)) {
-			/* The socket has not been accepted yet, no need
-			 * to look at newsk->sk_wmem_queued.
-			 */
-			amt = sk_mem_pages(newsk->sk_forward_alloc +
-					   atomic_read(&newsk->sk_rmem_alloc));
-		}
-
-		if (amt)
-			mem_cgroup_sk_charge(newsk, amt, gfp);
-		kmem_cache_charge(newsk, gfp);
-
-		release_sock(newsk);
-	}
-
 	if (req)
 		reqsk_put(req);
 
-- 
2.51.0.rc2.233.g662b1ed5c5-goog


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v1 bpf-next/net 2/8] bpf: Add a bpf hook in __inet_accept().
  2025-08-22 22:17 [PATCH v1 bpf-next/net 0/8] bpf: Allow decoupling memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
  2025-08-22 22:17 ` [PATCH v1 bpf-next/net 1/8] tcp: Save lock_sock() for memcg in inet_csk_accept() Kuniyuki Iwashima
@ 2025-08-22 22:17 ` Kuniyuki Iwashima
  2025-08-23 11:02   ` kernel test robot
  2025-08-25 17:57   ` Martin KaFai Lau
  2025-08-22 22:17 ` [PATCH v1 bpf-next/net 3/8] libbpf: Support BPF_CGROUP_INET_SOCK_ACCEPT Kuniyuki Iwashima
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 20+ messages in thread
From: Kuniyuki Iwashima @ 2025-08-22 22:17 UTC (permalink / raw)
  To: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau
  Cc: John Fastabend, Stanislav Fomichev, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Neal Cardwell, Willem de Bruijn,
	Mina Almasry, Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev

We will store a flag in sk->sk_memcg by bpf_setsockopt().

For a new child socket, memcg is not allocated until accept().

Let's add a new hook for BPF_PROG_TYPE_CGROUP_SOCK in
__inet_accept().

This hook does not fail by not supporting bpf_set_retval().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
 include/linux/bpf-cgroup-defs.h | 1 +
 include/linux/bpf-cgroup.h      | 4 ++++
 include/uapi/linux/bpf.h        | 1 +
 kernel/bpf/cgroup.c             | 2 ++
 kernel/bpf/syscall.c            | 3 +++
 net/ipv4/af_inet.c              | 2 ++
 tools/include/uapi/linux/bpf.h  | 1 +
 7 files changed, 14 insertions(+)

diff --git a/include/linux/bpf-cgroup-defs.h b/include/linux/bpf-cgroup-defs.h
index c9e6b26abab6..c9053fdbda5e 100644
--- a/include/linux/bpf-cgroup-defs.h
+++ b/include/linux/bpf-cgroup-defs.h
@@ -47,6 +47,7 @@ enum cgroup_bpf_attach_type {
 	CGROUP_INET6_GETSOCKNAME,
 	CGROUP_UNIX_GETSOCKNAME,
 	CGROUP_INET_SOCK_RELEASE,
+	CGROUP_INET_SOCK_ACCEPT,
 	CGROUP_LSM_START,
 	CGROUP_LSM_END = CGROUP_LSM_START + CGROUP_LSM_NUM - 1,
 	MAX_CGROUP_BPF_ATTACH_TYPE
diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index aedf573bdb42..4b0e835bbab7 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -67,6 +67,7 @@ to_cgroup_bpf_attach_type(enum bpf_attach_type attach_type)
 	CGROUP_ATYPE(CGROUP_INET6_GETSOCKNAME);
 	CGROUP_ATYPE(CGROUP_UNIX_GETSOCKNAME);
 	CGROUP_ATYPE(CGROUP_INET_SOCK_RELEASE);
+	CGROUP_ATYPE(CGROUP_INET_SOCK_ACCEPT);
 	default:
 		return CGROUP_BPF_ATTACH_TYPE_INVALID;
 	}
@@ -225,6 +226,9 @@ static inline bool cgroup_bpf_sock_enabled(struct sock *sk,
 #define BPF_CGROUP_RUN_PROG_INET_SOCK(sk)				       \
 	BPF_CGROUP_RUN_SK_PROG(sk, CGROUP_INET_SOCK_CREATE)
 
+#define BPF_CGROUP_RUN_PROG_INET_SOCK_ACCEPT(sk)			       \
+	BPF_CGROUP_RUN_SK_PROG(sk, CGROUP_INET_SOCK_ACCEPT)
+
 #define BPF_CGROUP_RUN_PROG_INET_SOCK_RELEASE(sk)			       \
 	BPF_CGROUP_RUN_SK_PROG(sk, CGROUP_INET_SOCK_RELEASE)
 
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 233de8677382..80df246d4741 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1133,6 +1133,7 @@ enum bpf_attach_type {
 	BPF_NETKIT_PEER,
 	BPF_TRACE_KPROBE_SESSION,
 	BPF_TRACE_UPROBE_SESSION,
+	BPF_CGROUP_INET_SOCK_ACCEPT,
 	__MAX_BPF_ATTACH_TYPE
 };
 
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index 180b630279b9..dee9ae0c2a9a 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -2724,6 +2724,7 @@ cgroup_common_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		switch (prog->expected_attach_type) {
 		case BPF_CGROUP_INET_INGRESS:
 		case BPF_CGROUP_INET_EGRESS:
+		case BPF_CGROUP_INET_SOCK_ACCEPT:
 		case BPF_CGROUP_SOCK_OPS:
 		case BPF_CGROUP_UDP4_RECVMSG:
 		case BPF_CGROUP_UDP6_RECVMSG:
@@ -2742,6 +2743,7 @@ cgroup_common_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		switch (prog->expected_attach_type) {
 		case BPF_CGROUP_INET_INGRESS:
 		case BPF_CGROUP_INET_EGRESS:
+		case BPF_CGROUP_INET_SOCK_ACCEPT:
 		case BPF_CGROUP_SOCK_OPS:
 		case BPF_CGROUP_UDP4_RECVMSG:
 		case BPF_CGROUP_UDP6_RECVMSG:
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 0fbfa8532c39..23a801da230c 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2640,6 +2640,7 @@ bpf_prog_load_check_attach(enum bpf_prog_type prog_type,
 	case BPF_PROG_TYPE_CGROUP_SOCK:
 		switch (expected_attach_type) {
 		case BPF_CGROUP_INET_SOCK_CREATE:
+		case BPF_CGROUP_INET_SOCK_ACCEPT:
 		case BPF_CGROUP_INET_SOCK_RELEASE:
 		case BPF_CGROUP_INET4_POST_BIND:
 		case BPF_CGROUP_INET6_POST_BIND:
@@ -4194,6 +4195,7 @@ attach_type_to_prog_type(enum bpf_attach_type attach_type)
 	case BPF_CGROUP_INET_EGRESS:
 		return BPF_PROG_TYPE_CGROUP_SKB;
 	case BPF_CGROUP_INET_SOCK_CREATE:
+	case BPF_CGROUP_INET_SOCK_ACCEPT:
 	case BPF_CGROUP_INET_SOCK_RELEASE:
 	case BPF_CGROUP_INET4_POST_BIND:
 	case BPF_CGROUP_INET6_POST_BIND:
@@ -4515,6 +4517,7 @@ static int bpf_prog_query(const union bpf_attr *attr,
 	case BPF_CGROUP_INET_INGRESS:
 	case BPF_CGROUP_INET_EGRESS:
 	case BPF_CGROUP_INET_SOCK_CREATE:
+	case BPF_CGROUP_INET_SOCK_ACCEPT:
 	case BPF_CGROUP_INET_SOCK_RELEASE:
 	case BPF_CGROUP_INET4_BIND:
 	case BPF_CGROUP_INET6_BIND:
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index ae83ecda3983..ab613abdfaa4 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -763,6 +763,8 @@ void __inet_accept(struct socket *sock, struct socket *newsock, struct sock *new
 		kmem_cache_charge(newsk, gfp);
 	}
 
+	BPF_CGROUP_RUN_PROG_INET_SOCK_ACCEPT(newsk);
+
 	if (mem_cgroup_sk_enabled(newsk)) {
 		int amt;
 
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 233de8677382..80df246d4741 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1133,6 +1133,7 @@ enum bpf_attach_type {
 	BPF_NETKIT_PEER,
 	BPF_TRACE_KPROBE_SESSION,
 	BPF_TRACE_UPROBE_SESSION,
+	BPF_CGROUP_INET_SOCK_ACCEPT,
 	__MAX_BPF_ATTACH_TYPE
 };
 
-- 
2.51.0.rc2.233.g662b1ed5c5-goog


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v1 bpf-next/net 3/8] libbpf: Support BPF_CGROUP_INET_SOCK_ACCEPT.
  2025-08-22 22:17 [PATCH v1 bpf-next/net 0/8] bpf: Allow decoupling memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
  2025-08-22 22:17 ` [PATCH v1 bpf-next/net 1/8] tcp: Save lock_sock() for memcg in inet_csk_accept() Kuniyuki Iwashima
  2025-08-22 22:17 ` [PATCH v1 bpf-next/net 2/8] bpf: Add a bpf hook in __inet_accept() Kuniyuki Iwashima
@ 2025-08-22 22:17 ` Kuniyuki Iwashima
  2025-08-22 22:17 ` [PATCH v1 bpf-next/net 4/8] bpftool: " Kuniyuki Iwashima
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 20+ messages in thread
From: Kuniyuki Iwashima @ 2025-08-22 22:17 UTC (permalink / raw)
  To: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau
  Cc: John Fastabend, Stanislav Fomichev, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Neal Cardwell, Willem de Bruijn,
	Mina Almasry, Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev

Let's support the new attach_type for cgroup prog to
hook in __inet_accept().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
 tools/lib/bpf/libbpf.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 8f5a81b672e1..c1b28a3e6d6f 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -81,6 +81,7 @@ static const char * const attach_type_name[] = {
 	[BPF_CGROUP_INET_INGRESS]	= "cgroup_inet_ingress",
 	[BPF_CGROUP_INET_EGRESS]	= "cgroup_inet_egress",
 	[BPF_CGROUP_INET_SOCK_CREATE]	= "cgroup_inet_sock_create",
+	[BPF_CGROUP_INET_SOCK_ACCEPT]	= "cgroup_inet_sock_accept",
 	[BPF_CGROUP_INET_SOCK_RELEASE]	= "cgroup_inet_sock_release",
 	[BPF_CGROUP_SOCK_OPS]		= "cgroup_sock_ops",
 	[BPF_CGROUP_DEVICE]		= "cgroup_device",
@@ -9584,6 +9585,7 @@ static const struct bpf_sec_def section_defs[] = {
 	SEC_DEF("cgroup_skb/egress",	CGROUP_SKB, BPF_CGROUP_INET_EGRESS, SEC_ATTACHABLE_OPT),
 	SEC_DEF("cgroup/skb",		CGROUP_SKB, 0, SEC_NONE),
 	SEC_DEF("cgroup/sock_create",	CGROUP_SOCK, BPF_CGROUP_INET_SOCK_CREATE, SEC_ATTACHABLE),
+	SEC_DEF("cgroup/sock_accept",	CGROUP_SOCK, BPF_CGROUP_INET_SOCK_ACCEPT, SEC_ATTACHABLE),
 	SEC_DEF("cgroup/sock_release",	CGROUP_SOCK, BPF_CGROUP_INET_SOCK_RELEASE, SEC_ATTACHABLE),
 	SEC_DEF("cgroup/sock",		CGROUP_SOCK, BPF_CGROUP_INET_SOCK_CREATE, SEC_ATTACHABLE_OPT),
 	SEC_DEF("cgroup/post_bind4",	CGROUP_SOCK, BPF_CGROUP_INET4_POST_BIND, SEC_ATTACHABLE),
-- 
2.51.0.rc2.233.g662b1ed5c5-goog


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v1 bpf-next/net 4/8] bpftool: Support BPF_CGROUP_INET_SOCK_ACCEPT.
  2025-08-22 22:17 [PATCH v1 bpf-next/net 0/8] bpf: Allow decoupling memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
                   ` (2 preceding siblings ...)
  2025-08-22 22:17 ` [PATCH v1 bpf-next/net 3/8] libbpf: Support BPF_CGROUP_INET_SOCK_ACCEPT Kuniyuki Iwashima
@ 2025-08-22 22:17 ` Kuniyuki Iwashima
  2025-08-22 22:18 ` [PATCH v1 bpf-next/net 5/8] bpf: Support bpf_setsockopt() for BPF_CGROUP_INET_SOCK_(CREATE|ACCEPT) Kuniyuki Iwashima
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 20+ messages in thread
From: Kuniyuki Iwashima @ 2025-08-22 22:17 UTC (permalink / raw)
  To: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau
  Cc: John Fastabend, Stanislav Fomichev, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Neal Cardwell, Willem de Bruijn,
	Mina Almasry, Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev

Let's support the new attach_type for cgroup prog to
hook in __inet_accept().

Now we can specify BPF_CGROUP_INET_SOCK_ACCEPT as
cgroup_inet_sock_accept:

  # bpftool cgroup attach /sys/fs/cgroup/test \
      cgroup_inet_sock_accept pinned /sys/fs/bpf/sk_memcg_accept

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
 tools/bpf/bpftool/cgroup.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/tools/bpf/bpftool/cgroup.c b/tools/bpf/bpftool/cgroup.c
index 944ebe21a216..593dabcf1578 100644
--- a/tools/bpf/bpftool/cgroup.c
+++ b/tools/bpf/bpftool/cgroup.c
@@ -48,7 +48,8 @@ static const int cgroup_attach_types[] = {
 	BPF_CGROUP_SYSCTL,
 	BPF_CGROUP_GETSOCKOPT,
 	BPF_CGROUP_SETSOCKOPT,
-	BPF_LSM_CGROUP
+	BPF_LSM_CGROUP,
+	BPF_CGROUP_INET_SOCK_ACCEPT,
 };
 
 #define HELP_SPEC_ATTACH_FLAGS						\
@@ -68,7 +69,8 @@ static const int cgroup_attach_types[] = {
 	"                        cgroup_unix_sendmsg | cgroup_udp4_recvmsg |\n" \
 	"                        cgroup_udp6_recvmsg | cgroup_unix_recvmsg |\n" \
 	"                        cgroup_sysctl | cgroup_getsockopt |\n" \
-	"                        cgroup_setsockopt | cgroup_inet_sock_release }"
+	"                        cgroup_setsockopt | cgroup_inet_sock_release |\n" \
+	"                        cgroup_inet_sock_accept }"
 
 static unsigned int query_flags;
 static struct btf *btf_vmlinux;
-- 
2.51.0.rc2.233.g662b1ed5c5-goog


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v1 bpf-next/net 5/8] bpf: Support bpf_setsockopt() for BPF_CGROUP_INET_SOCK_(CREATE|ACCEPT).
  2025-08-22 22:17 [PATCH v1 bpf-next/net 0/8] bpf: Allow decoupling memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
                   ` (3 preceding siblings ...)
  2025-08-22 22:17 ` [PATCH v1 bpf-next/net 4/8] bpftool: " Kuniyuki Iwashima
@ 2025-08-22 22:18 ` Kuniyuki Iwashima
  2025-08-23 23:58   ` kernel test robot
  2025-08-22 22:18 ` [PATCH v1 bpf-next/net 6/8] bpf: Introduce SK_BPF_MEMCG_FLAGS and SK_BPF_MEMCG_SOCK_ISOLATED Kuniyuki Iwashima
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 20+ messages in thread
From: Kuniyuki Iwashima @ 2025-08-22 22:18 UTC (permalink / raw)
  To: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau
  Cc: John Fastabend, Stanislav Fomichev, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Neal Cardwell, Willem de Bruijn,
	Mina Almasry, Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev

We will store a flag in sk->sk_memcg by bpf_setsockopt() during
socket() and accept().

BPF_CGROUP_INET_SOCK_CREATE and BPF_CGROUP_INET_SOCK_ACCEPT are
invoked by __cgroup_bpf_run_filter_sk() that passes a pointer to
struct sock to the bpf prog as void *ctx.

But there are no bpf_func_proto for bpf_setsockopt() that receives
the ctx as a pointer to struct sock.

Let's add new bpf_setsockopt() variants and support them in two
attach types.

Note that __inet_accept() is under lock_sock() but inet_create()
is not.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
 net/core/filter.c | 43 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 43 insertions(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index 63f3baee2daf..aa17c7ed5aed 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5743,6 +5743,40 @@ static const struct bpf_func_proto bpf_sock_ops_setsockopt_proto = {
 	.arg5_type	= ARG_CONST_SIZE,
 };
 
+BPF_CALL_5(bpf_sock_setsockopt, struct sock *, sk, int, level,
+	   int, optname, char *, optval, int, optlen)
+{
+	return __bpf_setsockopt(sk, level, optname, optval, optlen);
+}
+
+const struct bpf_func_proto bpf_sock_setsockopt_proto = {
+	.func		= bpf_sock_setsockopt,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_ANYTHING,
+	.arg3_type	= ARG_ANYTHING,
+	.arg4_type	= ARG_PTR_TO_MEM | MEM_RDONLY,
+	.arg5_type	= ARG_CONST_SIZE,
+};
+
+BPF_CALL_5(bpf_unlocked_sock_setsockopt, struct sock *, sk, int, level,
+	   int, optname, char *, optval, int, optlen)
+{
+	return _bpf_setsockopt(sk, level, optname, optval, optlen);
+}
+
+const struct bpf_func_proto bpf_unlocked_sock_setsockopt_proto = {
+	.func		= bpf_unlocked_sock_setsockopt,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_ANYTHING,
+	.arg3_type	= ARG_ANYTHING,
+	.arg4_type	= ARG_PTR_TO_MEM | MEM_RDONLY,
+	.arg5_type	= ARG_CONST_SIZE,
+};
+
 static int bpf_sock_ops_get_syn(struct bpf_sock_ops_kern *bpf_sock,
 				int optname, const u8 **start)
 {
@@ -8051,6 +8085,15 @@ sock_filter_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_sk_storage_get_cg_sock_proto;
 	case BPF_FUNC_ktime_get_coarse_ns:
 		return &bpf_ktime_get_coarse_ns_proto;
+	case BPF_FUNC_setsockopt:
+		switch (prog->expected_attach_type) {
+		case BPF_CGROUP_INET_SOCK_CREATE:
+			return &bpf_unlocked_sock_setsockopt_proto;
+		case BPF_CGROUP_INET_SOCK_ACCEPT:
+			return &bpf_sock_setsockopt_proto;
+		default:
+			return NULL;
+		}
 	default:
 		return bpf_base_func_proto(func_id, prog);
 	}
-- 
2.51.0.rc2.233.g662b1ed5c5-goog


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v1 bpf-next/net 6/8] bpf: Introduce SK_BPF_MEMCG_FLAGS and SK_BPF_MEMCG_SOCK_ISOLATED.
  2025-08-22 22:17 [PATCH v1 bpf-next/net 0/8] bpf: Allow decoupling memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
                   ` (4 preceding siblings ...)
  2025-08-22 22:18 ` [PATCH v1 bpf-next/net 5/8] bpf: Support bpf_setsockopt() for BPF_CGROUP_INET_SOCK_(CREATE|ACCEPT) Kuniyuki Iwashima
@ 2025-08-22 22:18 ` Kuniyuki Iwashima
  2025-08-23 15:38   ` kernel test robot
  2025-08-22 22:18 ` [PATCH v1 bpf-next/net 7/8] net-memcg: Allow decoupling memcg from global protocol memory accounting Kuniyuki Iwashima
  2025-08-22 22:18 ` [PATCH v1 bpf-next/net 8/8] selftest: bpf: Add test for SK_BPF_MEMCG_SOCK_ISOLATED Kuniyuki Iwashima
  7 siblings, 1 reply; 20+ messages in thread
From: Kuniyuki Iwashima @ 2025-08-22 22:18 UTC (permalink / raw)
  To: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau
  Cc: John Fastabend, Stanislav Fomichev, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Neal Cardwell, Willem de Bruijn,
	Mina Almasry, Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev

We will decouple sockets from the global protocol memory accounting
if sockets have SK_BPF_MEMCG_SOCK_ISOLATED.

This can be flagged via bpf_setsockopt() during socket() or accept():

  flags = SK_BPF_MEMCG_SOCK_ISOLATED;
  bpf_setsockopt(ctx, SOL_SOCKET, SK_BPF_MEMCG_FLAGS,
                 &flags, sizeof(flags));

Given sk->sk_memcg can be accessed in the fast path, it would
be preferable to place the flag field in the same cache line as
sk->sk_memcg.

However, struct sock does not have such a 1-byte hole.

Let's store the flag in the lowest bit of sk->sk_memcg and add
a helper to check the bit.

In the next patch, if mem_cgroup_sk_isolated() returns true,
the socket will not be charged to sk->sk_prot->memory_allocated.

Note that we do not support other hooks because UDP charges memory
under sk->sk_receive_queue.lock instead of lock_sock().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
 include/net/sock.h             | 48 ++++++++++++++++++++++++++++++++++
 include/uapi/linux/bpf.h       |  6 +++++
 net/core/filter.c              | 32 ++++++++++++++++++++++-
 tools/include/uapi/linux/bpf.h |  6 +++++
 4 files changed, 91 insertions(+), 1 deletion(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 63a6a48afb48..fb33a7af7c9a 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2596,10 +2596,39 @@ static inline gfp_t gfp_memcg_charge(void)
 	return in_softirq() ? GFP_ATOMIC : GFP_KERNEL;
 }
 
+#define SK_BPF_MEMCG_FLAG_MASK	(SK_BPF_MEMCG_FLAG_MAX - 1)
+#define SK_BPF_MEMCG_PTR_MASK	~SK_BPF_MEMCG_FLAG_MASK
+
 #ifdef CONFIG_MEMCG
+static inline void mem_cgroup_sk_set_flags(struct sock *sk, unsigned short flags)
+{
+	unsigned long val = (unsigned long)sk->sk_memcg;
+
+	val |= flags;
+	sk->sk_memcg = (struct mem_cgroup *)val;
+}
+
+static inline unsigned short mem_cgroup_sk_get_flags(const struct sock *sk)
+{
+#ifdef CONFIG_BPF_SYSCALL
+	unsigned long val = (unsigned long)sk->sk_memcg;
+
+	return val & SK_BPF_MEMCG_FLAG_MASK;
+#else
+	return 0;
+#endif
+}
+
 static inline struct mem_cgroup *mem_cgroup_from_sk(const struct sock *sk)
 {
+#ifdef CONFIG_BPF_SYSCALL
+	unsigned long val = (unsigned long)sk->sk_memcg;
+
+	val &= SK_BPF_MEMCG_PTR_MASK;
+	return (struct mem_cgroup *)val;
+#else
 	return sk->sk_memcg;
+#endif
 }
 
 static inline bool mem_cgroup_sk_enabled(const struct sock *sk)
@@ -2607,6 +2636,11 @@ static inline bool mem_cgroup_sk_enabled(const struct sock *sk)
 	return mem_cgroup_sockets_enabled && mem_cgroup_from_sk(sk);
 }
 
+static inline bool mem_cgroup_sk_isolated(const struct sock *sk)
+{
+	return mem_cgroup_sk_get_flags(sk) & SK_BPF_MEMCG_SOCK_ISOLATED;
+}
+
 static inline bool mem_cgroup_sk_under_memory_pressure(const struct sock *sk)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_sk(sk);
@@ -2624,6 +2658,15 @@ static inline bool mem_cgroup_sk_under_memory_pressure(const struct sock *sk)
 	return false;
 }
 #else
+static inline void mem_cgroup_sk_set_flag(struct sock *sk, unsigned short flags)
+{
+}
+
+static inline unsigned short mem_cgroup_sk_get_flags(const struct sock *sk)
+{
+	return 0;
+}
+
 static inline struct mem_cgroup *mem_cgroup_from_sk(const struct sock *sk)
 {
 	return NULL;
@@ -2634,6 +2677,11 @@ static inline bool mem_cgroup_sk_enabled(const struct sock *sk)
 	return false;
 }
 
+static inline bool mem_cgroup_sk_isolated(const struct sock *sk)
+{
+	return false;
+}
+
 static inline bool mem_cgroup_sk_under_memory_pressure(const struct sock *sk)
 {
 	return false;
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 80df246d4741..9657496e0f3c 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -7183,6 +7183,7 @@ enum {
 	TCP_BPF_SYN_MAC         = 1007, /* Copy the MAC, IP[46], and TCP header */
 	TCP_BPF_SOCK_OPS_CB_FLAGS = 1008, /* Get or Set TCP sock ops flags */
 	SK_BPF_CB_FLAGS		= 1009, /* Get or set sock ops flags in socket */
+	SK_BPF_MEMCG_FLAGS	= 1010, /* Get or Set flags saved in sk->sk_memcg */
 };
 
 enum {
@@ -7205,6 +7206,11 @@ enum {
 						 */
 };
 
+enum {
+	SK_BPF_MEMCG_SOCK_ISOLATED	= (1UL << 0),
+	SK_BPF_MEMCG_FLAG_MAX		= (1UL << 1),
+};
+
 struct bpf_perf_event_value {
 	__u64 counter;
 	__u64 enabled;
diff --git a/net/core/filter.c b/net/core/filter.c
index aa17c7ed5aed..d8a9f73095fb 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5267,6 +5267,31 @@ static int sk_bpf_set_get_cb_flags(struct sock *sk, char *optval, bool getopt)
 	return 0;
 }
 
+static int sk_bpf_set_get_memcg_flags(struct sock *sk, int *optval, bool getopt)
+{
+	if (!mem_cgroup_sk_enabled(sk))
+		return -EOPNOTSUPP;
+
+	if (getopt) {
+		*optval = mem_cgroup_sk_get_flags(sk);
+		return 0;
+	}
+
+	/* Don't allow once sk has been published to userspace.
+	 * INET_CREATE is called without lock_sock() but with sk_socket
+	 * INET_ACCEPT is called with lock_sock() but without sk_socket
+	 */
+	if (sock_owned_by_user_nocheck(sk) && sk->sk_socket)
+		return -EBUSY;
+
+	if (*optval <= 0 || *optval >= SK_BPF_MEMCG_FLAG_MAX)
+		return -EINVAL;
+
+	mem_cgroup_sk_set_flags(sk, *optval);
+
+	return 0;
+}
+
 static int sol_socket_sockopt(struct sock *sk, int optname,
 			      char *optval, int *optlen,
 			      bool getopt)
@@ -5284,6 +5309,7 @@ static int sol_socket_sockopt(struct sock *sk, int optname,
 	case SO_BINDTOIFINDEX:
 	case SO_TXREHASH:
 	case SK_BPF_CB_FLAGS:
+	case SK_BPF_MEMCG_FLAGS:
 		if (*optlen != sizeof(int))
 			return -EINVAL;
 		break;
@@ -5293,8 +5319,12 @@ static int sol_socket_sockopt(struct sock *sk, int optname,
 		return -EINVAL;
 	}
 
-	if (optname == SK_BPF_CB_FLAGS)
+	switch (optname) {
+	case SK_BPF_CB_FLAGS:
 		return sk_bpf_set_get_cb_flags(sk, optval, getopt);
+	case SK_BPF_MEMCG_FLAGS:
+		return sk_bpf_set_get_memcg_flags(sk, (int *)optval, getopt);
+	}
 
 	if (getopt) {
 		if (optname == SO_BINDTODEVICE)
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 80df246d4741..9657496e0f3c 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -7183,6 +7183,7 @@ enum {
 	TCP_BPF_SYN_MAC         = 1007, /* Copy the MAC, IP[46], and TCP header */
 	TCP_BPF_SOCK_OPS_CB_FLAGS = 1008, /* Get or Set TCP sock ops flags */
 	SK_BPF_CB_FLAGS		= 1009, /* Get or set sock ops flags in socket */
+	SK_BPF_MEMCG_FLAGS	= 1010, /* Get or Set flags saved in sk->sk_memcg */
 };
 
 enum {
@@ -7205,6 +7206,11 @@ enum {
 						 */
 };
 
+enum {
+	SK_BPF_MEMCG_SOCK_ISOLATED	= (1UL << 0),
+	SK_BPF_MEMCG_FLAG_MAX		= (1UL << 1),
+};
+
 struct bpf_perf_event_value {
 	__u64 counter;
 	__u64 enabled;
-- 
2.51.0.rc2.233.g662b1ed5c5-goog


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v1 bpf-next/net 7/8] net-memcg: Allow decoupling memcg from global protocol memory accounting.
  2025-08-22 22:17 [PATCH v1 bpf-next/net 0/8] bpf: Allow decoupling memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
                   ` (5 preceding siblings ...)
  2025-08-22 22:18 ` [PATCH v1 bpf-next/net 6/8] bpf: Introduce SK_BPF_MEMCG_FLAGS and SK_BPF_MEMCG_SOCK_ISOLATED Kuniyuki Iwashima
@ 2025-08-22 22:18 ` Kuniyuki Iwashima
  2025-08-22 22:18 ` [PATCH v1 bpf-next/net 8/8] selftest: bpf: Add test for SK_BPF_MEMCG_SOCK_ISOLATED Kuniyuki Iwashima
  7 siblings, 0 replies; 20+ messages in thread
From: Kuniyuki Iwashima @ 2025-08-22 22:18 UTC (permalink / raw)
  To: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau
  Cc: John Fastabend, Stanislav Fomichev, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Neal Cardwell, Willem de Bruijn,
	Mina Almasry, Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev

Some protocols (e.g., TCP, UDP) implement memory accounting for socket
buffers and charge memory to per-protocol global counters pointed to by
sk->sk_proto->memory_allocated.

When running under a non-root cgroup, this memory is also charged to the
memcg as "sock" in memory.stat.

Even when a memcg controls memory usage, sockets of such protocols are
still subject to global limits (e.g., /proc/sys/net/ipv4/tcp_mem).

This makes it difficult to accurately estimate and configure appropriate
global limits, especially in multi-tenant environments.

If all workloads were guaranteed to be controlled under memcg, the issue
could be worked around by setting tcp_mem[0~2] to UINT_MAX.

In reality, this assumption does not always hold, and processes not
controlled by memcg lose the seatbelt and can consume memory up to
the global limit, becoming noisy neighbour.

Let's decouple sockets in memcg from the global per-protocol memory
accounting if sockets have SK_BPF_MEMCG_SOCK_ISOLATED in sk->sk_memcg.

This simplifies memcg configuration while keeping the global limits
within a reasonable range.

If mem_cgroup_sk_isolated(sk) returns true, the per-protocol memory
accounting is skipped.

In __inet_accept(), we need to reclaim counts that are already charged
for child sockets because we do not allocate sk->sk_memcg until accept().

Note that trace_sock_exceed_buf_limit() will always show 0 as accounted
for the isolated sockets, but this can be obtained via memory.stat.

Tested with a script that creates local socket pairs and send()s a
bunch of data without recv()ing.

Setup:

  # mkdir /sys/fs/cgroup/test
  # echo $$ >> /sys/fs/cgroup/test/cgroup.procs
  # sysctl -q net.ipv4.tcp_mem="1000 1000 1000"

Without bpf prog:

  # prlimit -n=524288:524288 bash -c "python3 pressure.py" &
  # cat /sys/fs/cgroup/test/memory.stat | grep sock
  sock 22642688
  # cat /proc/net/sockstat| grep TCP
  TCP: inuse 2006 orphan 0 tw 0 alloc 2008 mem 5376
  # ss -tn | head -n 5
  State Recv-Q Send-Q Local Address:Port  Peer Address:Port
  ESTAB 2000   0          127.0.0.1:34479    127.0.0.1:53188
  ESTAB 2000   0          127.0.0.1:34479    127.0.0.1:49972
  ESTAB 2000   0          127.0.0.1:34479    127.0.0.1:53868
  ESTAB 2000   0          127.0.0.1:34479    127.0.0.1:53554
  # nstat | grep Pressure || echo no pressure
  TcpExtTCPMemoryPressures        1                  0.0

With bpf prog in the next patch:

  # bpftool prog load sk_memcg.bpf.o /sys/fs/bpf/sk_memcg_create type cgroup/sock_create
  # bpftool prog load sk_memcg.bpf.o /sys/fs/bpf/sk_memcg_accept type cgroup/sock_accept
  # bpftool cgroup attach /sys/fs/cgroup/test cgroup_inet_sock_create pinned /sys/fs/bpf/sk_memcg_create
  # bpftool cgroup attach /sys/fs/cgroup/test cgroup_inet_sock_accept pinned /sys/fs/bpf/sk_memcg_accept
  # prlimit -n=524288:524288 bash -c "python3 pressure.py" &
  # cat /sys/fs/cgroup/test/memory.stat | grep sock
  sock 2757468160
  # cat /proc/net/sockstat | grep TCP
  TCP: inuse 2006 orphan 0 tw 0 alloc 2008 mem 0
  # ss -tn | head -n 5
  State Recv-Q Send-Q  Local Address:Port  Peer Address:Port
  ESTAB 111000 0           127.0.0.1:36019    127.0.0.1:49026
  ESTAB 110000 0           127.0.0.1:36019    127.0.0.1:45630
  ESTAB 110000 0           127.0.0.1:36019    127.0.0.1:44870
  ESTAB 111000 0           127.0.0.1:36019    127.0.0.1:45274
  # nstat | grep Pressure || echo no pressure
  no pressure

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
 include/net/proto_memory.h      | 15 ++++++--
 include/net/tcp.h               | 10 ++++--
 net/core/sock.c                 | 64 ++++++++++++++++++++++-----------
 net/ipv4/af_inet.c              | 12 ++++++-
 net/ipv4/inet_connection_sock.c |  1 +
 net/ipv4/tcp.c                  |  3 +-
 net/ipv4/tcp_output.c           | 10 ++++--
 net/mptcp/protocol.c            |  3 +-
 net/tls/tls_device.c            |  4 ++-
 9 files changed, 90 insertions(+), 32 deletions(-)

diff --git a/include/net/proto_memory.h b/include/net/proto_memory.h
index 8e91a8fa31b5..8e8432b13515 100644
--- a/include/net/proto_memory.h
+++ b/include/net/proto_memory.h
@@ -31,13 +31,22 @@ static inline bool sk_under_memory_pressure(const struct sock *sk)
 	if (!sk->sk_prot->memory_pressure)
 		return false;
 
-	if (mem_cgroup_sk_enabled(sk) &&
-	    mem_cgroup_sk_under_memory_pressure(sk))
-		return true;
+	if (mem_cgroup_sk_enabled(sk)) {
+		if (mem_cgroup_sk_under_memory_pressure(sk))
+			return true;
+
+		if (mem_cgroup_sk_isolated(sk))
+			return false;
+	}
 
 	return !!READ_ONCE(*sk->sk_prot->memory_pressure);
 }
 
+static inline bool sk_should_enter_memory_pressure(struct sock *sk)
+{
+	return !mem_cgroup_sk_enabled(sk) || !mem_cgroup_sk_isolated(sk);
+}
+
 static inline long
 proto_memory_allocated(const struct proto *prot)
 {
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 2936b8175950..0191a4585bba 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -275,9 +275,13 @@ extern unsigned long tcp_memory_pressure;
 /* optimized version of sk_under_memory_pressure() for TCP sockets */
 static inline bool tcp_under_memory_pressure(const struct sock *sk)
 {
-	if (mem_cgroup_sk_enabled(sk) &&
-	    mem_cgroup_sk_under_memory_pressure(sk))
-		return true;
+	if (mem_cgroup_sk_enabled(sk)) {
+		if (mem_cgroup_sk_under_memory_pressure(sk))
+			return true;
+
+		if (mem_cgroup_sk_isolated(sk))
+			return false;
+	}
 
 	return READ_ONCE(tcp_memory_pressure);
 }
diff --git a/net/core/sock.c b/net/core/sock.c
index 8002ac6293dc..be5574f9a025 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1046,17 +1046,21 @@ static int sock_reserve_memory(struct sock *sk, int bytes)
 	if (!charged)
 		return -ENOMEM;
 
-	/* pre-charge to forward_alloc */
-	sk_memory_allocated_add(sk, pages);
-	allocated = sk_memory_allocated(sk);
-	/* If the system goes into memory pressure with this
-	 * precharge, give up and return error.
-	 */
-	if (allocated > sk_prot_mem_limits(sk, 1)) {
-		sk_memory_allocated_sub(sk, pages);
-		mem_cgroup_sk_uncharge(sk, pages);
-		return -ENOMEM;
+	if (!mem_cgroup_sk_isolated(sk)) {
+		/* pre-charge to forward_alloc */
+		sk_memory_allocated_add(sk, pages);
+		allocated = sk_memory_allocated(sk);
+
+		/* If the system goes into memory pressure with this
+		 * precharge, give up and return error.
+		 */
+		if (allocated > sk_prot_mem_limits(sk, 1)) {
+			sk_memory_allocated_sub(sk, pages);
+			mem_cgroup_sk_uncharge(sk, pages);
+			return -ENOMEM;
+		}
 	}
+
 	sk_forward_alloc_add(sk, pages << PAGE_SHIFT);
 
 	WRITE_ONCE(sk->sk_reserved_mem,
@@ -3153,8 +3157,11 @@ bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag)
 	if (likely(skb_page_frag_refill(32U, pfrag, sk->sk_allocation)))
 		return true;
 
-	sk_enter_memory_pressure(sk);
+	if (sk_should_enter_memory_pressure(sk))
+		sk_enter_memory_pressure(sk);
+
 	sk_stream_moderate_sndbuf(sk);
+
 	return false;
 }
 EXPORT_SYMBOL(sk_page_frag_refill);
@@ -3267,18 +3274,30 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
 {
 	bool memcg_enabled = false, charged = false;
 	struct proto *prot = sk->sk_prot;
-	long allocated;
-
-	sk_memory_allocated_add(sk, amt);
-	allocated = sk_memory_allocated(sk);
+	long allocated = 0;
 
 	if (mem_cgroup_sk_enabled(sk)) {
+		bool isolated = mem_cgroup_sk_isolated(sk);
+
 		memcg_enabled = true;
 		charged = mem_cgroup_sk_charge(sk, amt, gfp_memcg_charge());
-		if (!charged)
+
+		if (isolated && charged)
+			return 1;
+
+		if (!charged) {
+			if (!isolated) {
+				sk_memory_allocated_add(sk, amt);
+				allocated = sk_memory_allocated(sk);
+			}
+
 			goto suppress_allocation;
+		}
 	}
 
+	sk_memory_allocated_add(sk, amt);
+	allocated = sk_memory_allocated(sk);
+
 	/* Under limit. */
 	if (allocated <= sk_prot_mem_limits(sk, 0)) {
 		sk_leave_memory_pressure(sk);
@@ -3357,7 +3376,8 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
 
 	trace_sock_exceed_buf_limit(sk, prot, allocated, kind);
 
-	sk_memory_allocated_sub(sk, amt);
+	if (allocated)
+		sk_memory_allocated_sub(sk, amt);
 
 	if (charged)
 		mem_cgroup_sk_uncharge(sk, amt);
@@ -3396,11 +3416,15 @@ EXPORT_SYMBOL(__sk_mem_schedule);
  */
 void __sk_mem_reduce_allocated(struct sock *sk, int amount)
 {
-	sk_memory_allocated_sub(sk, amount);
-
-	if (mem_cgroup_sk_enabled(sk))
+	if (mem_cgroup_sk_enabled(sk)) {
 		mem_cgroup_sk_uncharge(sk, amount);
 
+		if (mem_cgroup_sk_isolated(sk))
+			return;
+	}
+
+	sk_memory_allocated_sub(sk, amount);
+
 	if (sk_under_global_memory_pressure(sk) &&
 	    (sk_memory_allocated(sk) < sk_prot_mem_limits(sk, 0)))
 		sk_leave_memory_pressure(sk);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index ab613abdfaa4..e92dfca0a0ff 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -95,6 +95,7 @@
 #include <net/checksum.h>
 #include <net/ip.h>
 #include <net/protocol.h>
+#include <net/proto_memory.h>
 #include <net/arp.h>
 #include <net/route.h>
 #include <net/ip_fib.h>
@@ -773,8 +774,17 @@ void __inet_accept(struct socket *sock, struct socket *newsock, struct sock *new
 		 */
 		amt = sk_mem_pages(newsk->sk_forward_alloc +
 				   atomic_read(&newsk->sk_rmem_alloc));
-		if (amt)
+		if (amt) {
+			/* This amt is already charged globally to
+			 * sk_prot->memory_allocated due to lack of
+			 * sk_memcg until accept(), thus we need to
+			 * reclaim it here if newsk is isolated.
+			 */
+			if (mem_cgroup_sk_isolated(newsk))
+				sk_memory_allocated_sub(newsk, amt);
+
 			mem_cgroup_sk_charge(newsk, amt, gfp);
+		}
 	}
 
 	sock_rps_record_flow(newsk);
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index ed10b959a906..f8dd53d40dcf 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -22,6 +22,7 @@
 #include <net/tcp.h>
 #include <net/sock_reuseport.h>
 #include <net/addrconf.h>
+#include <net/proto_memory.h>
 
 #if IS_ENABLED(CONFIG_IPV6)
 /* match_sk*_wildcard == true:  IPV6_ADDR_ANY equals to any IPv6 addresses
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 71a956fbfc55..dcbd49e2f8af 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -908,7 +908,8 @@ struct sk_buff *tcp_stream_alloc_skb(struct sock *sk, gfp_t gfp,
 		}
 		__kfree_skb(skb);
 	} else {
-		sk->sk_prot->enter_memory_pressure(sk);
+		if (sk_should_enter_memory_pressure(sk))
+			tcp_enter_memory_pressure(sk);
 		sk_stream_moderate_sndbuf(sk);
 	}
 	return NULL;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index dfbac0876d96..f7aa86661219 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3574,12 +3574,18 @@ void sk_forced_mem_schedule(struct sock *sk, int size)
 	delta = size - sk->sk_forward_alloc;
 	if (delta <= 0)
 		return;
+
 	amt = sk_mem_pages(delta);
 	sk_forward_alloc_add(sk, amt << PAGE_SHIFT);
-	sk_memory_allocated_add(sk, amt);
 
-	if (mem_cgroup_sk_enabled(sk))
+	if (mem_cgroup_sk_enabled(sk)) {
 		mem_cgroup_sk_charge(sk, amt, gfp_memcg_charge() | __GFP_NOFAIL);
+
+		if (mem_cgroup_sk_isolated(sk))
+			return;
+	}
+
+	sk_memory_allocated_add(sk, amt);
 }
 
 /* Send a FIN. The caller locks the socket for us.
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 9a287b75c1b3..f7487e22a3f8 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -16,6 +16,7 @@
 #include <net/inet_common.h>
 #include <net/inet_hashtables.h>
 #include <net/protocol.h>
+#include <net/proto_memory.h>
 #include <net/tcp_states.h>
 #if IS_ENABLED(CONFIG_MPTCP_IPV6)
 #include <net/transp_v6.h>
@@ -1016,7 +1017,7 @@ static void mptcp_enter_memory_pressure(struct sock *sk)
 	mptcp_for_each_subflow(msk, subflow) {
 		struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
 
-		if (first)
+		if (first && sk_should_enter_memory_pressure(ssk))
 			tcp_enter_memory_pressure(ssk);
 		sk_stream_moderate_sndbuf(ssk);
 
diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
index f672a62a9a52..6696ef837116 100644
--- a/net/tls/tls_device.c
+++ b/net/tls/tls_device.c
@@ -35,6 +35,7 @@
 #include <linux/netdevice.h>
 #include <net/dst.h>
 #include <net/inet_connection_sock.h>
+#include <net/proto_memory.h>
 #include <net/tcp.h>
 #include <net/tls.h>
 #include <linux/skbuff_ref.h>
@@ -371,7 +372,8 @@ static int tls_do_allocation(struct sock *sk,
 	if (!offload_ctx->open_record) {
 		if (unlikely(!skb_page_frag_refill(prepend_size, pfrag,
 						   sk->sk_allocation))) {
-			READ_ONCE(sk->sk_prot)->enter_memory_pressure(sk);
+			if (sk_should_enter_memory_pressure(sk))
+				READ_ONCE(sk->sk_prot)->enter_memory_pressure(sk);
 			sk_stream_moderate_sndbuf(sk);
 			return -ENOMEM;
 		}
-- 
2.51.0.rc2.233.g662b1ed5c5-goog


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v1 bpf-next/net 8/8] selftest: bpf: Add test for SK_BPF_MEMCG_SOCK_ISOLATED.
  2025-08-22 22:17 [PATCH v1 bpf-next/net 0/8] bpf: Allow decoupling memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
                   ` (6 preceding siblings ...)
  2025-08-22 22:18 ` [PATCH v1 bpf-next/net 7/8] net-memcg: Allow decoupling memcg from global protocol memory accounting Kuniyuki Iwashima
@ 2025-08-22 22:18 ` Kuniyuki Iwashima
  7 siblings, 0 replies; 20+ messages in thread
From: Kuniyuki Iwashima @ 2025-08-22 22:18 UTC (permalink / raw)
  To: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau
  Cc: John Fastabend, Stanislav Fomichev, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Neal Cardwell, Willem de Bruijn,
	Mina Almasry, Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev

The test does the following for IPv4/IPv6 x TCP/UDP sockets
with/without BPF prog.

  1. Create socket pairs
  2. Send a bunch of data that require more than 1000 pages
  3. Read memory_allocated from the 3rd column in /proc/net/protocols
  4. Check if unread data is charged to memory_allocated

If BPF prog is attached, memory_allocated should not be changed,
but we allow a small error (up to 10 pages) in case the test is ran
concurrently with other tests using TCP/UDP sockets.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
 .../selftests/bpf/prog_tests/sk_memcg.c       | 214 ++++++++++++++++++
 tools/testing/selftests/bpf/progs/sk_memcg.c  |  29 +++
 2 files changed, 243 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/sk_memcg.c
 create mode 100644 tools/testing/selftests/bpf/progs/sk_memcg.c

diff --git a/tools/testing/selftests/bpf/prog_tests/sk_memcg.c b/tools/testing/selftests/bpf/prog_tests/sk_memcg.c
new file mode 100644
index 000000000000..486e58277eec
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/sk_memcg.c
@@ -0,0 +1,214 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright 2025 Google LLC */
+
+#include <test_progs.h>
+#include "sk_memcg.skel.h"
+#include "network_helpers.h"
+
+#define NR_SOCKETS	128
+#define NR_SEND		128
+
+struct test_case {
+	char name[10]; /* protocols (%-9s) in /proc/net/protocols, see proto_seq_printf(). */
+	int family;
+	int type;
+	int (*create_sockets)(struct test_case *test_case, int sk[], int len);
+};
+
+static int tcp_create_sockets(struct test_case *test_case, int sk[], int len)
+{
+	int server, i;
+
+	server = start_server(test_case->family, test_case->type, NULL, 0, 0);
+	ASSERT_GE(server, 0, "start_server_str");
+
+	for (i = 0; i < len / 2; i++) {
+		sk[i * 2] = connect_to_fd(server, 0);
+		if (!ASSERT_GE(sk[i * 2], 0, "connect_to_fd"))
+			return sk[i * 2];
+
+		sk[i * 2 + 1] = accept(server, NULL, NULL);
+		if (!ASSERT_GE(sk[i * 2 + 1], 0, "accept"))
+			return sk[i * 2 + 1];
+	}
+
+	close(server);
+
+	return 0;
+}
+
+static int udp_create_sockets(struct test_case *test_case, int sk[], int len)
+{
+	int i, err, rcvbuf = 1024 * NR_SEND;
+
+	for (i = 0; i < len / 2; i++) {
+		sk[i * 2] = start_server(test_case->family, test_case->type, NULL, 0, 0);
+		if (!ASSERT_GE(sk[i * 2], 0, "start_server"))
+			return sk[i * 2];
+
+		sk[i * 2 + 1] = connect_to_fd(sk[i * 2], 0);
+		if (!ASSERT_GE(sk[i * 2 + 1], 0, "connect_to_fd"))
+			return sk[i * 2 + 1];
+
+		err = connect_fd_to_fd(sk[i * 2], sk[i * 2 + 1], 0);
+		if (!ASSERT_EQ(err, 0, "connect_fd_to_fd"))
+			return err;
+
+		err = setsockopt(sk[i * 2], SOL_SOCKET, SO_RCVBUF, &rcvbuf, sizeof(int));
+		if (!ASSERT_EQ(err, 0, "setsockopt(SO_RCVBUF)"))
+			return err;
+
+		err = setsockopt(sk[i * 2 + 1], SOL_SOCKET, SO_RCVBUF, &rcvbuf, sizeof(int));
+		if (!ASSERT_EQ(err, 0, "setsockopt(SO_RCVBUF)"))
+			return err;
+	}
+
+	return 0;
+}
+
+static int get_memory_allocated(struct test_case *test_case)
+{
+	long memory_allocated = -1;
+	char *line = NULL;
+	size_t unused;
+	FILE *f;
+
+	f = fopen("/proc/net/protocols", "r");
+	if (!ASSERT_OK_PTR(f, "fopen"))
+		goto out;
+
+	while (getline(&line, &unused, f) != -1) {
+		unsigned int unused_0;
+		int unused_1;
+		int ret;
+
+		if (strncmp(line, test_case->name, sizeof(test_case->name)))
+			continue;
+
+		ret = sscanf(line + sizeof(test_case->name), "%4u %6d  %6ld",
+			     &unused_0, &unused_1, &memory_allocated);
+		ASSERT_EQ(ret, 3, "sscanf");
+		break;
+	}
+
+	ASSERT_NEQ(memory_allocated, -1, "get_memory_allocated");
+
+	free(line);
+	fclose(f);
+out:
+	return memory_allocated;
+}
+
+static int check_isolated(struct test_case *test_case, bool isolated)
+{
+	long memory_allocated[2];
+	int sk[NR_SOCKETS] = {};
+	char buf[1024] = {};
+	int err = -1, i, j;
+
+	memory_allocated[0] = get_memory_allocated(test_case);
+	if (!ASSERT_GE(memory_allocated[0], 0, "memory_allocated[0]"))
+		goto out;
+
+	err = test_case->create_sockets(test_case, sk, ARRAY_SIZE(sk));
+	if (err)
+		goto close;
+
+	/* Must allocate pages >= net.core.mem_pcpu_rsv */
+	for (i = 0; i < ARRAY_SIZE(sk); i++) {
+		for (j = 0; j < NR_SEND; j++) {
+			int bytes = send(sk[i], buf, sizeof(buf), 0);
+
+			ASSERT_EQ(bytes, sizeof(buf), "send");
+		}
+	}
+
+	memory_allocated[1] = get_memory_allocated(test_case);
+	if (!ASSERT_GE(memory_allocated[1], 0, "memory_allocated[1]"))
+		goto close;
+
+	if (isolated)
+		ASSERT_LE(memory_allocated[1], memory_allocated[0] + 10, "isolated");
+	else
+		ASSERT_GT(memory_allocated[1], memory_allocated[0] + 1000, "not isolated");
+
+close:
+	for (i = 0; i < ARRAY_SIZE(sk); i++)
+		close(sk[i]);
+
+	/* Let RCU destruct sockets */
+	sleep(1);
+out:
+	return err;
+}
+
+void run_test(struct test_case *test_case)
+{
+	struct sk_memcg *skel;
+	int cgroup, err;
+
+	skel = sk_memcg__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "open_and_load"))
+		return;
+
+	cgroup = test__join_cgroup("/sk_memcg");
+	if (!ASSERT_GE(cgroup, 0, "join_cgroup"))
+		goto destroy_skel;
+
+	err = check_isolated(test_case, false);
+	if (!ASSERT_EQ(err, 0, "test_isolated(false)"))
+		goto close_cgroup;
+
+	skel->links.sock_create = bpf_program__attach_cgroup(skel->progs.sock_create, cgroup);
+	if (!ASSERT_OK_PTR(skel->links.sock_create, "attach_cgroup(sock_create)"))
+		goto close_cgroup;
+
+	skel->links.sock_accept = bpf_program__attach_cgroup(skel->progs.sock_accept, cgroup);
+	if (!ASSERT_OK_PTR(skel->links.sock_accept, "attach_cgroup(sock_accept)"))
+		goto close_cgroup;
+
+	err = check_isolated(test_case, true);
+	ASSERT_EQ(err, 0, "test_isolated(false)");
+
+close_cgroup:
+	close(cgroup);
+destroy_skel:
+	sk_memcg__destroy(skel);
+}
+
+struct test_case test_cases[] = {
+	{
+		.name = "TCP       ",
+		.family = AF_INET,
+		.type = SOCK_STREAM,
+		.create_sockets = tcp_create_sockets,
+	},
+	{
+		.name = "UDP       ",
+		.family = AF_INET,
+		.type = SOCK_DGRAM,
+		.create_sockets = udp_create_sockets,
+	},
+	{
+		.name = "TCPv6     ",
+		.family = AF_INET6,
+		.type = SOCK_STREAM,
+		.create_sockets = tcp_create_sockets,
+	},
+	{
+		.name = "UDPv6     ",
+		.family = AF_INET6,
+		.type = SOCK_DGRAM,
+		.create_sockets = udp_create_sockets,
+	},
+};
+
+void test_sk_memcg(void)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(test_cases); i++) {
+		test__start_subtest(test_cases[i].name);
+		run_test(&test_cases[i]);
+	}
+}
diff --git a/tools/testing/selftests/bpf/progs/sk_memcg.c b/tools/testing/selftests/bpf/progs/sk_memcg.c
new file mode 100644
index 000000000000..8a43e05be14b
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/sk_memcg.c
@@ -0,0 +1,29 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright 2025 Google LLC */
+
+#include "bpf_tracing_net.h"
+#include <bpf/bpf_helpers.h>
+
+void isolate_memcg(struct bpf_sock *ctx)
+{
+	int flags = SK_BPF_MEMCG_SOCK_ISOLATED;
+
+	bpf_setsockopt(ctx, SOL_SOCKET, SK_BPF_MEMCG_FLAGS,
+		       &flags, sizeof(flags));
+}
+
+SEC("cgroup/sock_create")
+int sock_create(struct bpf_sock *ctx)
+{
+	isolate_memcg(ctx);
+	return 1;
+}
+
+SEC("cgroup/sock_accept")
+int sock_accept(struct bpf_sock *ctx)
+{
+	isolate_memcg(ctx);
+	return 1;
+}
+
+char LICENSE[] SEC("license") = "GPL";
-- 
2.51.0.rc2.233.g662b1ed5c5-goog


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v1 bpf-next/net 2/8] bpf: Add a bpf hook in __inet_accept().
  2025-08-22 22:17 ` [PATCH v1 bpf-next/net 2/8] bpf: Add a bpf hook in __inet_accept() Kuniyuki Iwashima
@ 2025-08-23 11:02   ` kernel test robot
  2025-08-25 17:57   ` Martin KaFai Lau
  1 sibling, 0 replies; 20+ messages in thread
From: kernel test robot @ 2025-08-23 11:02 UTC (permalink / raw)
  To: Kuniyuki Iwashima, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Martin KaFai Lau
  Cc: oe-kbuild-all, John Fastabend, Stanislav Fomichev,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Neal Cardwell,
	Willem de Bruijn, Mina Almasry, Kuniyuki Iwashima, bpf, netdev

Hi Kuniyuki,

kernel test robot noticed the following build errors:

[auto build test ERROR on bpf-next/net]

url:    https://github.com/intel-lab-lkp/linux/commits/Kuniyuki-Iwashima/tcp-Save-lock_sock-for-memcg-in-inet_csk_accept/20250823-062322
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git net
patch link:    https://lore.kernel.org/r/20250822221846.744252-3-kuniyu%40google.com
patch subject: [PATCH v1 bpf-next/net 2/8] bpf: Add a bpf hook in __inet_accept().
config: x86_64-buildonly-randconfig-001-20250823 (https://download.01.org/0day-ci/archive/20250823/202508231842.8DrS8EwE-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14+deb12u1) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250823/202508231842.8DrS8EwE-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202508231842.8DrS8EwE-lkp@intel.com/

All errors (new ones prefixed by >>):

   net/ipv4/af_inet.c: In function '__inet_accept':
>> net/ipv4/af_inet.c:766:9: error: implicit declaration of function 'BPF_CGROUP_RUN_PROG_INET_SOCK_ACCEPT'; did you mean 'BPF_CGROUP_RUN_PROG_INET_SOCK_RELEASE'? [-Werror=implicit-function-declaration]
     766 |         BPF_CGROUP_RUN_PROG_INET_SOCK_ACCEPT(newsk);
         |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
         |         BPF_CGROUP_RUN_PROG_INET_SOCK_RELEASE
   cc1: some warnings being treated as errors


vim +766 net/ipv4/af_inet.c

   753	
   754	void __inet_accept(struct socket *sock, struct socket *newsock, struct sock *newsk)
   755	{
   756		gfp_t gfp = GFP_KERNEL | __GFP_NOFAIL;
   757	
   758		/* TODO: use sk_clone_lock() in SCTP and remove protocol checks */
   759		if (mem_cgroup_sockets_enabled &&
   760		    (!IS_ENABLED(CONFIG_IP_SCTP) ||
   761		     sk_is_tcp(newsk) || sk_is_mptcp(newsk))) {
   762			mem_cgroup_sk_alloc(newsk);
   763			kmem_cache_charge(newsk, gfp);
   764		}
   765	
 > 766		BPF_CGROUP_RUN_PROG_INET_SOCK_ACCEPT(newsk);
   767	
   768		if (mem_cgroup_sk_enabled(newsk)) {
   769			int amt;
   770	
   771			/* The socket has not been accepted yet, no need
   772			 * to look at newsk->sk_wmem_queued.
   773			 */
   774			amt = sk_mem_pages(newsk->sk_forward_alloc +
   775					   atomic_read(&newsk->sk_rmem_alloc));
   776			if (amt)
   777				mem_cgroup_sk_charge(newsk, amt, gfp);
   778		}
   779	
   780		sock_rps_record_flow(newsk);
   781		WARN_ON(!((1 << newsk->sk_state) &
   782			  (TCPF_ESTABLISHED | TCPF_SYN_RECV |
   783			   TCPF_FIN_WAIT1 | TCPF_FIN_WAIT2 |
   784			   TCPF_CLOSING | TCPF_CLOSE_WAIT |
   785			   TCPF_CLOSE)));
   786	
   787		if (test_bit(SOCK_SUPPORT_ZC, &sock->flags))
   788			set_bit(SOCK_SUPPORT_ZC, &newsock->flags);
   789		sock_graft(newsk, newsock);
   790	
   791		newsock->state = SS_CONNECTED;
   792	}
   793	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v1 bpf-next/net 6/8] bpf: Introduce SK_BPF_MEMCG_FLAGS and SK_BPF_MEMCG_SOCK_ISOLATED.
  2025-08-22 22:18 ` [PATCH v1 bpf-next/net 6/8] bpf: Introduce SK_BPF_MEMCG_FLAGS and SK_BPF_MEMCG_SOCK_ISOLATED Kuniyuki Iwashima
@ 2025-08-23 15:38   ` kernel test robot
  0 siblings, 0 replies; 20+ messages in thread
From: kernel test robot @ 2025-08-23 15:38 UTC (permalink / raw)
  To: Kuniyuki Iwashima, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Martin KaFai Lau
  Cc: oe-kbuild-all, John Fastabend, Stanislav Fomichev,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Neal Cardwell,
	Willem de Bruijn, Mina Almasry, Kuniyuki Iwashima, bpf, netdev

Hi Kuniyuki,

kernel test robot noticed the following build errors:

[auto build test ERROR on bpf-next/net]

url:    https://github.com/intel-lab-lkp/linux/commits/Kuniyuki-Iwashima/tcp-Save-lock_sock-for-memcg-in-inet_csk_accept/20250823-062322
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git net
patch link:    https://lore.kernel.org/r/20250822221846.744252-7-kuniyu%40google.com
patch subject: [PATCH v1 bpf-next/net 6/8] bpf: Introduce SK_BPF_MEMCG_FLAGS and SK_BPF_MEMCG_SOCK_ISOLATED.
config: arc-randconfig-002-20250823 (https://download.01.org/0day-ci/archive/20250823/202508232331.rxOqu50j-lkp@intel.com/config)
compiler: arc-linux-gcc (GCC) 12.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250823/202508232331.rxOqu50j-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202508232331.rxOqu50j-lkp@intel.com/

All errors (new ones prefixed by >>):

   net/core/filter.c: In function 'sk_bpf_set_get_memcg_flags':
>> net/core/filter.c:5290:9: error: implicit declaration of function 'mem_cgroup_sk_set_flags'; did you mean 'mem_cgroup_sk_get_flags'? [-Werror=implicit-function-declaration]
    5290 |         mem_cgroup_sk_set_flags(sk, *optval);
         |         ^~~~~~~~~~~~~~~~~~~~~~~
         |         mem_cgroup_sk_get_flags
   cc1: some warnings being treated as errors


vim +5290 net/core/filter.c

  5269	
  5270	static int sk_bpf_set_get_memcg_flags(struct sock *sk, int *optval, bool getopt)
  5271	{
  5272		if (!mem_cgroup_sk_enabled(sk))
  5273			return -EOPNOTSUPP;
  5274	
  5275		if (getopt) {
  5276			*optval = mem_cgroup_sk_get_flags(sk);
  5277			return 0;
  5278		}
  5279	
  5280		/* Don't allow once sk has been published to userspace.
  5281		 * INET_CREATE is called without lock_sock() but with sk_socket
  5282		 * INET_ACCEPT is called with lock_sock() but without sk_socket
  5283		 */
  5284		if (sock_owned_by_user_nocheck(sk) && sk->sk_socket)
  5285			return -EBUSY;
  5286	
  5287		if (*optval <= 0 || *optval >= SK_BPF_MEMCG_FLAG_MAX)
  5288			return -EINVAL;
  5289	
> 5290		mem_cgroup_sk_set_flags(sk, *optval);
  5291	
  5292		return 0;
  5293	}
  5294	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v1 bpf-next/net 5/8] bpf: Support bpf_setsockopt() for BPF_CGROUP_INET_SOCK_(CREATE|ACCEPT).
  2025-08-22 22:18 ` [PATCH v1 bpf-next/net 5/8] bpf: Support bpf_setsockopt() for BPF_CGROUP_INET_SOCK_(CREATE|ACCEPT) Kuniyuki Iwashima
@ 2025-08-23 23:58   ` kernel test robot
  0 siblings, 0 replies; 20+ messages in thread
From: kernel test robot @ 2025-08-23 23:58 UTC (permalink / raw)
  To: Kuniyuki Iwashima, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Martin KaFai Lau
  Cc: oe-kbuild-all, John Fastabend, Stanislav Fomichev,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Neal Cardwell,
	Willem de Bruijn, Mina Almasry, Kuniyuki Iwashima, bpf, netdev

Hi Kuniyuki,

kernel test robot noticed the following build warnings:

[auto build test WARNING on bpf-next/net]

url:    https://github.com/intel-lab-lkp/linux/commits/Kuniyuki-Iwashima/tcp-Save-lock_sock-for-memcg-in-inet_csk_accept/20250823-062322
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git net
patch link:    https://lore.kernel.org/r/20250822221846.744252-6-kuniyu%40google.com
patch subject: [PATCH v1 bpf-next/net 5/8] bpf: Support bpf_setsockopt() for BPF_CGROUP_INET_SOCK_(CREATE|ACCEPT).
config: arm-randconfig-r131-20250824 (https://download.01.org/0day-ci/archive/20250824/202508240731.UPB4k4Uo-lkp@intel.com/config)
compiler: arm-linux-gnueabi-gcc (GCC) 8.5.0
reproduce: (https://download.01.org/0day-ci/archive/20250824/202508240731.UPB4k4Uo-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202508240731.UPB4k4Uo-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
   net/core/filter.c:6322:9: sparse: sparse: switch with no cases
   net/core/filter.c:6363:9: sparse: sparse: switch with no cases
   net/core/filter.c:1440:39: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct sock_filter const *filter @@     got struct sock_filter [noderef] __user *filter @@
   net/core/filter.c:1440:39: sparse:     expected struct sock_filter const *filter
   net/core/filter.c:1440:39: sparse:     got struct sock_filter [noderef] __user *filter
   net/core/filter.c:1518:39: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct sock_filter const *filter @@     got struct sock_filter [noderef] __user *filter @@
   net/core/filter.c:1518:39: sparse:     expected struct sock_filter const *filter
   net/core/filter.c:1518:39: sparse:     got struct sock_filter [noderef] __user *filter
>> net/core/filter.c:5752:29: sparse: sparse: symbol 'bpf_sock_setsockopt_proto' was not declared. Should it be static?
>> net/core/filter.c:5769:29: sparse: sparse: symbol 'bpf_unlocked_sock_setsockopt_proto' was not declared. Should it be static?
   net/core/filter.c:11166:31: sparse: sparse: symbol 'cg_skb_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:11172:27: sparse: sparse: symbol 'cg_skb_prog_ops' was not declared. Should it be static?
   net/core/filter.c:11216:31: sparse: sparse: symbol 'cg_sock_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:11222:27: sparse: sparse: symbol 'cg_sock_prog_ops' was not declared. Should it be static?
   net/core/filter.c:11225:31: sparse: sparse: symbol 'cg_sock_addr_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:11231:27: sparse: sparse: symbol 'cg_sock_addr_prog_ops' was not declared. Should it be static?
   net/core/filter.c:1948:43: sparse: sparse: incorrect type in argument 2 (different base types) @@     expected restricted __wsum [usertype] diff @@     got unsigned long long [usertype] to @@
   net/core/filter.c:1948:43: sparse:     expected restricted __wsum [usertype] diff
   net/core/filter.c:1948:43: sparse:     got unsigned long long [usertype] to
   net/core/filter.c:1951:36: sparse: sparse: incorrect type in argument 2 (different base types) @@     expected restricted __be16 [usertype] old @@     got unsigned long long [usertype] from @@
   net/core/filter.c:1951:36: sparse:     expected restricted __be16 [usertype] old
   net/core/filter.c:1951:36: sparse:     got unsigned long long [usertype] from
   net/core/filter.c:1951:42: sparse: sparse: incorrect type in argument 3 (different base types) @@     expected restricted __be16 [usertype] new @@     got unsigned long long [usertype] to @@
   net/core/filter.c:1951:42: sparse:     expected restricted __be16 [usertype] new
   net/core/filter.c:1951:42: sparse:     got unsigned long long [usertype] to
   net/core/filter.c:1954:36: sparse: sparse: incorrect type in argument 2 (different base types) @@     expected restricted __be32 [usertype] from @@     got unsigned long long [usertype] from @@
   net/core/filter.c:1954:36: sparse:     expected restricted __be32 [usertype] from
   net/core/filter.c:1954:36: sparse:     got unsigned long long [usertype] from
   net/core/filter.c:1954:42: sparse: sparse: incorrect type in argument 3 (different base types) @@     expected restricted __be32 [usertype] to @@     got unsigned long long [usertype] to @@
   net/core/filter.c:1954:42: sparse:     expected restricted __be32 [usertype] to
   net/core/filter.c:1954:42: sparse:     got unsigned long long [usertype] to
   net/core/filter.c:2000:59: sparse: sparse: incorrect type in argument 3 (different base types) @@     expected restricted __wsum [usertype] diff @@     got unsigned long long [usertype] to @@
   net/core/filter.c:2000:59: sparse:     expected restricted __wsum [usertype] diff
   net/core/filter.c:2000:59: sparse:     got unsigned long long [usertype] to
   net/core/filter.c:2003:52: sparse: sparse: incorrect type in argument 3 (different base types) @@     expected restricted __be16 [usertype] from @@     got unsigned long long [usertype] from @@
   net/core/filter.c:2003:52: sparse:     expected restricted __be16 [usertype] from
   net/core/filter.c:2003:52: sparse:     got unsigned long long [usertype] from
   net/core/filter.c:2003:58: sparse: sparse: incorrect type in argument 4 (different base types) @@     expected restricted __be16 [usertype] to @@     got unsigned long long [usertype] to @@
   net/core/filter.c:2003:58: sparse:     expected restricted __be16 [usertype] to
   net/core/filter.c:2003:58: sparse:     got unsigned long long [usertype] to
   net/core/filter.c:2006:52: sparse: sparse: incorrect type in argument 3 (different base types) @@     expected restricted __be32 [usertype] from @@     got unsigned long long [usertype] from @@
   net/core/filter.c:2006:52: sparse:     expected restricted __be32 [usertype] from
   net/core/filter.c:2006:52: sparse:     got unsigned long long [usertype] from
   net/core/filter.c:2006:58: sparse: sparse: incorrect type in argument 4 (different base types) @@     expected restricted __be32 [usertype] to @@     got unsigned long long [usertype] to @@
   net/core/filter.c:2006:58: sparse:     expected restricted __be32 [usertype] to
   net/core/filter.c:2006:58: sparse:     got unsigned long long [usertype] to
   net/core/filter.c:2073:35: sparse: sparse: incorrect type in return expression (different base types) @@     expected unsigned long long @@     got restricted __wsum [usertype] csum @@
   net/core/filter.c:2073:35: sparse:     expected unsigned long long
   net/core/filter.c:2073:35: sparse:     got restricted __wsum [usertype] csum

vim +/bpf_sock_setsockopt_proto +5752 net/core/filter.c

  5751	
> 5752	const struct bpf_func_proto bpf_sock_setsockopt_proto = {
  5753		.func		= bpf_sock_setsockopt,
  5754		.gpl_only	= false,
  5755		.ret_type	= RET_INTEGER,
  5756		.arg1_type	= ARG_PTR_TO_CTX,
  5757		.arg2_type	= ARG_ANYTHING,
  5758		.arg3_type	= ARG_ANYTHING,
  5759		.arg4_type	= ARG_PTR_TO_MEM | MEM_RDONLY,
  5760		.arg5_type	= ARG_CONST_SIZE,
  5761	};
  5762	
  5763	BPF_CALL_5(bpf_unlocked_sock_setsockopt, struct sock *, sk, int, level,
  5764		   int, optname, char *, optval, int, optlen)
  5765	{
  5766		return _bpf_setsockopt(sk, level, optname, optval, optlen);
  5767	}
  5768	
> 5769	const struct bpf_func_proto bpf_unlocked_sock_setsockopt_proto = {
  5770		.func		= bpf_unlocked_sock_setsockopt,
  5771		.gpl_only	= false,
  5772		.ret_type	= RET_INTEGER,
  5773		.arg1_type	= ARG_PTR_TO_CTX,
  5774		.arg2_type	= ARG_ANYTHING,
  5775		.arg3_type	= ARG_ANYTHING,
  5776		.arg4_type	= ARG_PTR_TO_MEM | MEM_RDONLY,
  5777		.arg5_type	= ARG_CONST_SIZE,
  5778	};
  5779	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v1 bpf-next/net 2/8] bpf: Add a bpf hook in __inet_accept().
  2025-08-22 22:17 ` [PATCH v1 bpf-next/net 2/8] bpf: Add a bpf hook in __inet_accept() Kuniyuki Iwashima
  2025-08-23 11:02   ` kernel test robot
@ 2025-08-25 17:57   ` Martin KaFai Lau
  2025-08-25 18:14     ` Kuniyuki Iwashima
  1 sibling, 1 reply; 20+ messages in thread
From: Martin KaFai Lau @ 2025-08-25 17:57 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	John Fastabend, Stanislav Fomichev, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Neal Cardwell, Willem de Bruijn,
	Mina Almasry, Kuniyuki Iwashima, bpf, netdev

On 8/22/25 3:17 PM, Kuniyuki Iwashima wrote:
> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> index ae83ecda3983..ab613abdfaa4 100644
> --- a/net/ipv4/af_inet.c
> +++ b/net/ipv4/af_inet.c
> @@ -763,6 +763,8 @@ void __inet_accept(struct socket *sock, struct socket *newsock, struct sock *new
>   		kmem_cache_charge(newsk, gfp);
>   	}
>   
> +	BPF_CGROUP_RUN_PROG_INET_SOCK_ACCEPT(newsk);
> +
>   	if (mem_cgroup_sk_enabled(newsk)) {
>   		int amt;
>   
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 233de8677382..80df246d4741 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -1133,6 +1133,7 @@ enum bpf_attach_type {
>   	BPF_NETKIT_PEER,
>   	BPF_TRACE_KPROBE_SESSION,
>   	BPF_TRACE_UPROBE_SESSION,
> +	BPF_CGROUP_INET_SOCK_ACCEPT,

Instead of adding another hook, can the SK_BPF_MEMCG_SOCK_ISOLATED bit be 
inherited from the listener?


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v1 bpf-next/net 2/8] bpf: Add a bpf hook in __inet_accept().
  2025-08-25 17:57   ` Martin KaFai Lau
@ 2025-08-25 18:14     ` Kuniyuki Iwashima
  2025-08-25 23:14       ` Martin KaFai Lau
  0 siblings, 1 reply; 20+ messages in thread
From: Kuniyuki Iwashima @ 2025-08-25 18:14 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	John Fastabend, Stanislav Fomichev, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Neal Cardwell, Willem de Bruijn,
	Mina Almasry, Kuniyuki Iwashima, bpf, netdev

On Mon, Aug 25, 2025 at 10:57 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 8/22/25 3:17 PM, Kuniyuki Iwashima wrote:
> > diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> > index ae83ecda3983..ab613abdfaa4 100644
> > --- a/net/ipv4/af_inet.c
> > +++ b/net/ipv4/af_inet.c
> > @@ -763,6 +763,8 @@ void __inet_accept(struct socket *sock, struct socket *newsock, struct sock *new
> >               kmem_cache_charge(newsk, gfp);
> >       }
> >
> > +     BPF_CGROUP_RUN_PROG_INET_SOCK_ACCEPT(newsk);
> > +
> >       if (mem_cgroup_sk_enabled(newsk)) {
> >               int amt;
> >
> > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> > index 233de8677382..80df246d4741 100644
> > --- a/tools/include/uapi/linux/bpf.h
> > +++ b/tools/include/uapi/linux/bpf.h
> > @@ -1133,6 +1133,7 @@ enum bpf_attach_type {
> >       BPF_NETKIT_PEER,
> >       BPF_TRACE_KPROBE_SESSION,
> >       BPF_TRACE_UPROBE_SESSION,
> > +     BPF_CGROUP_INET_SOCK_ACCEPT,
>
> Instead of adding another hook, can the SK_BPF_MEMCG_SOCK_ISOLATED bit be
> inherited from the listener?

Since e876ecc67db80 and d752a4986532c , we defer memcg allocation to
accept() because the child socket could be created during irq context with
unrelated cgroup.  This had another reason; if the listener was created in the
root cgroup and passed to a process under cgroup, child sockets would never
have sk_memcg if sk_memcg was inherited.

So, the child's memcg is not always the same one with the listener's, and
we cannot rely on the listener's sk_memcg.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v1 bpf-next/net 2/8] bpf: Add a bpf hook in __inet_accept().
  2025-08-25 18:14     ` Kuniyuki Iwashima
@ 2025-08-25 23:14       ` Martin KaFai Lau
  2025-08-26  0:23         ` Kuniyuki Iwashima
  0 siblings, 1 reply; 20+ messages in thread
From: Martin KaFai Lau @ 2025-08-25 23:14 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	John Fastabend, Stanislav Fomichev, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Neal Cardwell, Willem de Bruijn,
	Mina Almasry, Kuniyuki Iwashima, bpf, netdev

On 8/25/25 11:14 AM, Kuniyuki Iwashima wrote:
> On Mon, Aug 25, 2025 at 10:57 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>
>> On 8/22/25 3:17 PM, Kuniyuki Iwashima wrote:
>>> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
>>> index ae83ecda3983..ab613abdfaa4 100644
>>> --- a/net/ipv4/af_inet.c
>>> +++ b/net/ipv4/af_inet.c
>>> @@ -763,6 +763,8 @@ void __inet_accept(struct socket *sock, struct socket *newsock, struct sock *new
>>>                kmem_cache_charge(newsk, gfp);
>>>        }
>>>
>>> +     BPF_CGROUP_RUN_PROG_INET_SOCK_ACCEPT(newsk);
>>> +
>>>        if (mem_cgroup_sk_enabled(newsk)) {
>>>                int amt;
>>>
>>> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
>>> index 233de8677382..80df246d4741 100644
>>> --- a/tools/include/uapi/linux/bpf.h
>>> +++ b/tools/include/uapi/linux/bpf.h
>>> @@ -1133,6 +1133,7 @@ enum bpf_attach_type {
>>>        BPF_NETKIT_PEER,
>>>        BPF_TRACE_KPROBE_SESSION,
>>>        BPF_TRACE_UPROBE_SESSION,
>>> +     BPF_CGROUP_INET_SOCK_ACCEPT,
>>
>> Instead of adding another hook, can the SK_BPF_MEMCG_SOCK_ISOLATED bit be
>> inherited from the listener?
> 
> Since e876ecc67db80 and d752a4986532c , we defer memcg allocation to
> accept() because the child socket could be created during irq context with
> unrelated cgroup.  This had another reason; if the listener was created in the
> root cgroup and passed to a process under cgroup, child sockets would never
> have sk_memcg if sk_memcg was inherited.
> 
> So, the child's memcg is not always the same one with the listener's, and
> we cannot rely on the listener's sk_memcg.

I didn't mean to inherit the entire sk_memcg pointer. I meant to only inherit 
the SK_BPF_MEMCG_SOCK_ISOLATED bit.

If it can only be done at accept, there is already an existing 
SEC("lsm_cgroup/socket_accept") hook. Take a look at 
tools/testing/selftests/bpf/progs/lsm_cgroup.c. The lsm socket_accept doesn't 
have access to the "newsock->sk" but it should have access to the "sock->sk", do 
bpf_setsockopt and then inherit by the newsock->sk (?)

There are already quite enough cgroup-sk style hooks. I would prefer not to add 
another cgroup attach_type and instead see if some of the existing ones can be 
reused. There is also SEC("lsm/sock_graft").

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v1 bpf-next/net 2/8] bpf: Add a bpf hook in __inet_accept().
  2025-08-25 23:14       ` Martin KaFai Lau
@ 2025-08-26  0:23         ` Kuniyuki Iwashima
  2025-08-26 20:06           ` Martin KaFai Lau
  0 siblings, 1 reply; 20+ messages in thread
From: Kuniyuki Iwashima @ 2025-08-26  0:23 UTC (permalink / raw)
  To: martin.lau
  Cc: almasrymina, andrii, ast, bpf, daniel, davem, edumazet, hannes,
	john.fastabend, kuba, kuni1840, kuniyu, mhocko, ncardwell, netdev,
	pabeni, roman.gushchin, sdf, shakeel.butt, willemb

From: Martin KaFai Lau <martin.lau@linux.dev>
Date: Mon, 25 Aug 2025 16:14:35 -0700
> On 8/25/25 11:14 AM, Kuniyuki Iwashima wrote:
> > On Mon, Aug 25, 2025 at 10:57 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >>
> >> On 8/22/25 3:17 PM, Kuniyuki Iwashima wrote:
> >>> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> >>> index ae83ecda3983..ab613abdfaa4 100644
> >>> --- a/net/ipv4/af_inet.c
> >>> +++ b/net/ipv4/af_inet.c
> >>> @@ -763,6 +763,8 @@ void __inet_accept(struct socket *sock, struct socket *newsock, struct sock *new
> >>>                kmem_cache_charge(newsk, gfp);
> >>>        }
> >>>
> >>> +     BPF_CGROUP_RUN_PROG_INET_SOCK_ACCEPT(newsk);
> >>> +
> >>>        if (mem_cgroup_sk_enabled(newsk)) {
> >>>                int amt;
> >>>
> >>> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> >>> index 233de8677382..80df246d4741 100644
> >>> --- a/tools/include/uapi/linux/bpf.h
> >>> +++ b/tools/include/uapi/linux/bpf.h
> >>> @@ -1133,6 +1133,7 @@ enum bpf_attach_type {
> >>>        BPF_NETKIT_PEER,
> >>>        BPF_TRACE_KPROBE_SESSION,
> >>>        BPF_TRACE_UPROBE_SESSION,
> >>> +     BPF_CGROUP_INET_SOCK_ACCEPT,
> >>
> >> Instead of adding another hook, can the SK_BPF_MEMCG_SOCK_ISOLATED bit be
> >> inherited from the listener?
> > 
> > Since e876ecc67db80 and d752a4986532c , we defer memcg allocation to
> > accept() because the child socket could be created during irq context with
> > unrelated cgroup.  This had another reason; if the listener was created in the
> > root cgroup and passed to a process under cgroup, child sockets would never
> > have sk_memcg if sk_memcg was inherited.
> > 
> > So, the child's memcg is not always the same one with the listener's, and
> > we cannot rely on the listener's sk_memcg.
> 
> I didn't mean to inherit the entire sk_memcg pointer. I meant to only inherit 
> the SK_BPF_MEMCG_SOCK_ISOLATED bit.

I didn't want to let the flag remain alone without accept() (error-prone
but works because we always check mem_cgroup_from_sk() before the bit)
and wanted to check mem_cgroup_sk_enabled() in setsockopt(), but if we
don't care, it will be doable with other hooks, PASSIVE_ESTABLISHED_CB
or bpf_iter etc.

---8<---
diff --git a/net/core/filter.c b/net/core/filter.c
index a78356682442..9ef458fe706e 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5269,7 +5269,7 @@ static int sk_bpf_set_get_cb_flags(struct sock *sk, char *optval, bool getopt)
 
 static int sk_bpf_set_get_memcg_flags(struct sock *sk, int *optval, bool getopt)
 {
-	if (!mem_cgroup_sk_enabled(sk))
+	if (!sk_has_account(sk))
 		return -EOPNOTSUPP;
 
 	if (getopt) {
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index e92dfca0a0ff..efae15d04306 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -760,7 +760,10 @@ void __inet_accept(struct socket *sock, struct socket *newsock, struct sock *new
 	if (mem_cgroup_sockets_enabled &&
 	    (!IS_ENABLED(CONFIG_IP_SCTP) ||
 	     sk_is_tcp(newsk) || sk_is_mptcp(newsk))) {
+		unsigned short flags = mem_cgroup_sk_get_flags(newsk);
+
 		mem_cgroup_sk_alloc(newsk);
+		mem_cgroup_sk_set_flags(newsk, flags);
 		kmem_cache_charge(newsk, gfp);
 	}
 
---8<---


> 
> If it can only be done at accept, there is already an existing 
> SEC("lsm_cgroup/socket_accept") hook. Take a look at 
> tools/testing/selftests/bpf/progs/lsm_cgroup.c. The lsm socket_accept doesn't 
> have access to the "newsock->sk" but it should have access to the "sock->sk", do 
> bpf_setsockopt and then inherit by the newsock->sk (?)
> 
> There are already quite enough cgroup-sk style hooks. I would prefer not to add 
> another cgroup attach_type and instead see if some of the existing ones can be 
> reused. There is also SEC("lsm/sock_graft").

We need to do fixup below, so lsm_cgroup/socket_accept won't work
if we keep the code in __inet_accept().  We can move this after
lsm/sock_graft within __inet_accept().

if (mem_cgroup_sk_isolated(newsk))
	sk_memory_allocated_sub(newsk, amt);

But then, we cannot distinguish it with other hooks (lock_sock() &&
sk->sk_socket != NULL), and finally the fixup must be done dynamically
in setsockopt().

But I didn't want to place that in setsockopt() to avoid caring about
weird scenario that introduces unnecessary complexity.

e.g. bpf_setsockopt() fixes up once and a new data comes in before
accept() and we need to handle another fixup in accept() or close(),
in the latter case, we need to check the bit only even if sk_memcg
itself is NULL.

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v1 bpf-next/net 2/8] bpf: Add a bpf hook in __inet_accept().
  2025-08-26  0:23         ` Kuniyuki Iwashima
@ 2025-08-26 20:06           ` Martin KaFai Lau
  2025-08-26 21:08             ` Kuniyuki Iwashima
  0 siblings, 1 reply; 20+ messages in thread
From: Martin KaFai Lau @ 2025-08-26 20:06 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: almasrymina, andrii, ast, bpf, daniel, davem, edumazet, hannes,
	john.fastabend, kuba, kuni1840, mhocko, ncardwell, netdev, pabeni,
	roman.gushchin, sdf, shakeel.butt, willemb

On 8/25/25 5:23 PM, Kuniyuki Iwashima wrote:
> From: Martin KaFai Lau <martin.lau@linux.dev>
> Date: Mon, 25 Aug 2025 16:14:35 -0700
>> On 8/25/25 11:14 AM, Kuniyuki Iwashima wrote:
>>> On Mon, Aug 25, 2025 at 10:57 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>>>
>>>> On 8/22/25 3:17 PM, Kuniyuki Iwashima wrote:
>>>>> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
>>>>> index ae83ecda3983..ab613abdfaa4 100644
>>>>> --- a/net/ipv4/af_inet.c
>>>>> +++ b/net/ipv4/af_inet.c
>>>>> @@ -763,6 +763,8 @@ void __inet_accept(struct socket *sock, struct socket *newsock, struct sock *new
>>>>>                 kmem_cache_charge(newsk, gfp);
>>>>>         }
>>>>>
>>>>> +     BPF_CGROUP_RUN_PROG_INET_SOCK_ACCEPT(newsk);
>>>>> +
>>>>>         if (mem_cgroup_sk_enabled(newsk)) {
>>>>>                 int amt;
>>>>>
>>>>> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
>>>>> index 233de8677382..80df246d4741 100644
>>>>> --- a/tools/include/uapi/linux/bpf.h
>>>>> +++ b/tools/include/uapi/linux/bpf.h
>>>>> @@ -1133,6 +1133,7 @@ enum bpf_attach_type {
>>>>>         BPF_NETKIT_PEER,
>>>>>         BPF_TRACE_KPROBE_SESSION,
>>>>>         BPF_TRACE_UPROBE_SESSION,
>>>>> +     BPF_CGROUP_INET_SOCK_ACCEPT,
>>>>
>>>> Instead of adding another hook, can the SK_BPF_MEMCG_SOCK_ISOLATED bit be
>>>> inherited from the listener?
>>>
>>> Since e876ecc67db80 and d752a4986532c , we defer memcg allocation to
>>> accept() because the child socket could be created during irq context with
>>> unrelated cgroup.  This had another reason; if the listener was created in the
>>> root cgroup and passed to a process under cgroup, child sockets would never
>>> have sk_memcg if sk_memcg was inherited.
>>>
>>> So, the child's memcg is not always the same one with the listener's, and
>>> we cannot rely on the listener's sk_memcg.
>>
>> I didn't mean to inherit the entire sk_memcg pointer. I meant to only inherit
>> the SK_BPF_MEMCG_SOCK_ISOLATED bit.
> 
> I didn't want to let the flag remain alone without accept() (error-prone
> but works because we always check mem_cgroup_from_sk() before the bit)
> and wanted to check mem_cgroup_sk_enabled() in setsockopt(), but if we
> don't care, it will be doable with other hooks, PASSIVE_ESTABLISHED_CB
> or bpf_iter etc.

I think this could be a surprise to the user. imo, this is the implementation 
details that a bit of a pointer is used for the setsockopt purpose and a right 
one for perf reason. It does not necessary need to affect what the user can 
expect from setsockopt in listener. From the user pov, what the user can usually 
expect from setsockopt in the listener and gets copied to child? Beside, the 
listener and the accept-or on different processes is one of the use case but not 
the only use case.

> 
> ---8<---
> diff --git a/net/core/filter.c b/net/core/filter.c
> index a78356682442..9ef458fe706e 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -5269,7 +5269,7 @@ static int sk_bpf_set_get_cb_flags(struct sock *sk, char *optval, bool getopt)
>   
>   static int sk_bpf_set_get_memcg_flags(struct sock *sk, int *optval, bool getopt)
>   {
> -	if (!mem_cgroup_sk_enabled(sk))
> +	if (!sk_has_account(sk))
>   		return -EOPNOTSUPP;
>   
>   	if (getopt) {
> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> index e92dfca0a0ff..efae15d04306 100644
> --- a/net/ipv4/af_inet.c
> +++ b/net/ipv4/af_inet.c
> @@ -760,7 +760,10 @@ void __inet_accept(struct socket *sock, struct socket *newsock, struct sock *new
>   	if (mem_cgroup_sockets_enabled &&
>   	    (!IS_ENABLED(CONFIG_IP_SCTP) ||
>   	     sk_is_tcp(newsk) || sk_is_mptcp(newsk))) {
> +		unsigned short flags = mem_cgroup_sk_get_flags(newsk);
> +
>   		mem_cgroup_sk_alloc(newsk);
> +		mem_cgroup_sk_set_flags(newsk, flags);
>   		kmem_cache_charge(newsk, gfp);
>   	}
>   
> ---8<---
> 
> 
>>
>> If it can only be done at accept, there is already an existing
>> SEC("lsm_cgroup/socket_accept") hook. Take a look at
>> tools/testing/selftests/bpf/progs/lsm_cgroup.c. The lsm socket_accept doesn't
>> have access to the "newsock->sk" but it should have access to the "sock->sk", do
>> bpf_setsockopt and then inherit by the newsock->sk (?)
>>
>> There are already quite enough cgroup-sk style hooks. I would prefer not to add
>> another cgroup attach_type and instead see if some of the existing ones can be
>> reused. There is also SEC("lsm/sock_graft").
> 
> We need to do fixup below, so lsm_cgroup/socket_accept won't work
> if we keep the code in __inet_accept().  We can move this after
> lsm/sock_graft within __inet_accept().
> 
> if (mem_cgroup_sk_isolated(newsk))
> 	sk_memory_allocated_sub(newsk, amt);

If I read it correctly, lsm_cgroup/sock_graft should work but need to do the 
above sk_memory_allocated_sub() after the sock_graft and ...
  >
> But then, we cannot distinguish it with other hooks (lock_sock() &&
> sk->sk_socket != NULL), and finally the fixup must be done dynamically
> in setsockopt().

... need a way to disallow this SK_BPF_MEMCG_SOCK_ISOLATED bit being changed 
once the socket fd is visible to the user. The current approach is to use the 
observation in the owned_by_user and sk->sk_socket in the create and accept 
hook. [ unrelated, I am not sure about the owned_by_user check considering 
sol_socket_sockopt can be called from bh ].

If it is needed, there are other ways to stop the SK_BPF_MEMCG_SOCK_ISOLATED 
being changed again once the fd is visible to user. e.g. there are still bits 
left in the sk_memcg pointer to freeze it at runtime. If doing it statically 
(i.e. at prog load time), it can probably return a different setsockopt_proto 
that can understand the SK_BPF_MEMCG_FLAGS.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v1 bpf-next/net 2/8] bpf: Add a bpf hook in __inet_accept().
  2025-08-26 20:06           ` Martin KaFai Lau
@ 2025-08-26 21:08             ` Kuniyuki Iwashima
  2025-08-26 22:02               ` Martin KaFai Lau
  0 siblings, 1 reply; 20+ messages in thread
From: Kuniyuki Iwashima @ 2025-08-26 21:08 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: almasrymina, andrii, ast, bpf, daniel, davem, edumazet, hannes,
	john.fastabend, kuba, kuni1840, mhocko, ncardwell, netdev, pabeni,
	roman.gushchin, sdf, shakeel.butt, willemb

On Tue, Aug 26, 2025 at 1:07 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 8/25/25 5:23 PM, Kuniyuki Iwashima wrote:
> > From: Martin KaFai Lau <martin.lau@linux.dev>
> > Date: Mon, 25 Aug 2025 16:14:35 -0700
> >> On 8/25/25 11:14 AM, Kuniyuki Iwashima wrote:
> >>> On Mon, Aug 25, 2025 at 10:57 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >>>>
> >>>> On 8/22/25 3:17 PM, Kuniyuki Iwashima wrote:
> >>>>> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> >>>>> index ae83ecda3983..ab613abdfaa4 100644
> >>>>> --- a/net/ipv4/af_inet.c
> >>>>> +++ b/net/ipv4/af_inet.c
> >>>>> @@ -763,6 +763,8 @@ void __inet_accept(struct socket *sock, struct socket *newsock, struct sock *new
> >>>>>                 kmem_cache_charge(newsk, gfp);
> >>>>>         }
> >>>>>
> >>>>> +     BPF_CGROUP_RUN_PROG_INET_SOCK_ACCEPT(newsk);
> >>>>> +
> >>>>>         if (mem_cgroup_sk_enabled(newsk)) {
> >>>>>                 int amt;
> >>>>>
> >>>>> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> >>>>> index 233de8677382..80df246d4741 100644
> >>>>> --- a/tools/include/uapi/linux/bpf.h
> >>>>> +++ b/tools/include/uapi/linux/bpf.h
> >>>>> @@ -1133,6 +1133,7 @@ enum bpf_attach_type {
> >>>>>         BPF_NETKIT_PEER,
> >>>>>         BPF_TRACE_KPROBE_SESSION,
> >>>>>         BPF_TRACE_UPROBE_SESSION,
> >>>>> +     BPF_CGROUP_INET_SOCK_ACCEPT,
> >>>>
> >>>> Instead of adding another hook, can the SK_BPF_MEMCG_SOCK_ISOLATED bit be
> >>>> inherited from the listener?
> >>>
> >>> Since e876ecc67db80 and d752a4986532c , we defer memcg allocation to
> >>> accept() because the child socket could be created during irq context with
> >>> unrelated cgroup.  This had another reason; if the listener was created in the
> >>> root cgroup and passed to a process under cgroup, child sockets would never
> >>> have sk_memcg if sk_memcg was inherited.
> >>>
> >>> So, the child's memcg is not always the same one with the listener's, and
> >>> we cannot rely on the listener's sk_memcg.
> >>
> >> I didn't mean to inherit the entire sk_memcg pointer. I meant to only inherit
> >> the SK_BPF_MEMCG_SOCK_ISOLATED bit.
> >
> > I didn't want to let the flag remain alone without accept() (error-prone
> > but works because we always check mem_cgroup_from_sk() before the bit)
> > and wanted to check mem_cgroup_sk_enabled() in setsockopt(), but if we
> > don't care, it will be doable with other hooks, PASSIVE_ESTABLISHED_CB
> > or bpf_iter etc.
>
> I think this could be a surprise to the user. imo, this is the implementation
> details that a bit of a pointer is used for the setsockopt purpose and a right
> one for perf reason. It does not necessary need to affect what the user can
> expect from setsockopt in listener. From the user pov, what the user can usually
> expect from setsockopt in the listener and gets copied to child?

Ah, somehow I was confused with allowing flagging before sk_memcg
is set and I didn't think of inheriting a flag from the listener.

Inheriting a flag from the listener and only allowing setsockopt()
from socket() would be the simplest way.


> Beside, the
> listener and the accept-or on different processes is one of the use case but not
> the only use case.
>
> >
> > ---8<---
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index a78356682442..9ef458fe706e 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -5269,7 +5269,7 @@ static int sk_bpf_set_get_cb_flags(struct sock *sk, char *optval, bool getopt)
> >
> >   static int sk_bpf_set_get_memcg_flags(struct sock *sk, int *optval, bool getopt)
> >   {
> > -     if (!mem_cgroup_sk_enabled(sk))
> > +     if (!sk_has_account(sk))
> >               return -EOPNOTSUPP;
> >
> >       if (getopt) {
> > diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> > index e92dfca0a0ff..efae15d04306 100644
> > --- a/net/ipv4/af_inet.c
> > +++ b/net/ipv4/af_inet.c
> > @@ -760,7 +760,10 @@ void __inet_accept(struct socket *sock, struct socket *newsock, struct sock *new
> >       if (mem_cgroup_sockets_enabled &&
> >           (!IS_ENABLED(CONFIG_IP_SCTP) ||
> >            sk_is_tcp(newsk) || sk_is_mptcp(newsk))) {
> > +             unsigned short flags = mem_cgroup_sk_get_flags(newsk);
> > +
> >               mem_cgroup_sk_alloc(newsk);
> > +             mem_cgroup_sk_set_flags(newsk, flags);
> >               kmem_cache_charge(newsk, gfp);
> >       }
> >
> > ---8<---
> >
> >
> >>
> >> If it can only be done at accept, there is already an existing
> >> SEC("lsm_cgroup/socket_accept") hook. Take a look at
> >> tools/testing/selftests/bpf/progs/lsm_cgroup.c. The lsm socket_accept doesn't
> >> have access to the "newsock->sk" but it should have access to the "sock->sk", do
> >> bpf_setsockopt and then inherit by the newsock->sk (?)
> >>
> >> There are already quite enough cgroup-sk style hooks. I would prefer not to add
> >> another cgroup attach_type and instead see if some of the existing ones can be
> >> reused. There is also SEC("lsm/sock_graft").
> >
> > We need to do fixup below, so lsm_cgroup/socket_accept won't work
> > if we keep the code in __inet_accept().  We can move this after
> > lsm/sock_graft within __inet_accept().
> >
> > if (mem_cgroup_sk_isolated(newsk))
> >       sk_memory_allocated_sub(newsk, amt);
>
> If I read it correctly, lsm_cgroup/sock_graft should work but need to do the
> above sk_memory_allocated_sub() after the sock_graft and ...
>   >
> > But then, we cannot distinguish it with other hooks (lock_sock() &&
> > sk->sk_socket != NULL), and finally the fixup must be done dynamically
> > in setsockopt().
>
> ... need a way to disallow this SK_BPF_MEMCG_SOCK_ISOLATED bit being changed
> once the socket fd is visible to the user. The current approach is to use the
> observation in the owned_by_user and sk->sk_socket in the create and accept
> hook. [ unrelated, I am not sure about the owned_by_user check considering
> sol_socket_sockopt can be called from bh ].

[ my expectation was bh checks sock_owned_by_user() before
  processing packets and entering where bpf_setsockopt() can
  be called ]

>
> If it is needed, there are other ways to stop the SK_BPF_MEMCG_SOCK_ISOLATED
> being changed again once the fd is visible to user. e.g. there are still bits
> left in the sk_memcg pointer to freeze it at runtime. If doing it statically
> (i.e. at prog load time), it can probably return a different setsockopt_proto
> that can understand the SK_BPF_MEMCG_FLAGS.

I was thinking a kind of the latter, passing caller info to general
__bpf_setsockopt(), and gave up as it was ugly, but wrapping it
as different setsockopt_proto sounds good.

Then, we don't need to care about how to limit the caller context.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v1 bpf-next/net 2/8] bpf: Add a bpf hook in __inet_accept().
  2025-08-26 21:08             ` Kuniyuki Iwashima
@ 2025-08-26 22:02               ` Martin KaFai Lau
  2025-08-26 23:10                 ` Kuniyuki Iwashima
  0 siblings, 1 reply; 20+ messages in thread
From: Martin KaFai Lau @ 2025-08-26 22:02 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: almasrymina, andrii, ast, bpf, daniel, davem, edumazet, hannes,
	john.fastabend, kuba, kuni1840, mhocko, ncardwell, netdev, pabeni,
	roman.gushchin, sdf, shakeel.butt, willemb

On 8/26/25 2:08 PM, Kuniyuki Iwashima wrote:
>> ... need a way to disallow this SK_BPF_MEMCG_SOCK_ISOLATED bit being changed
>> once the socket fd is visible to the user. The current approach is to use the
>> observation in the owned_by_user and sk->sk_socket in the create and accept
>> hook. [ unrelated, I am not sure about the owned_by_user check considering
>> sol_socket_sockopt can be called from bh ].
> 
> [ my expectation was bh checks sock_owned_by_user() before
>    processing packets and entering where bpf_setsockopt() can
>    be called ]

hmm... so if a bpf prog is run in bh, owned_by_user should be false and the bh 
bpf prog can continue to do the bpf_setsockopt(SK_BPF_MEMCG_FLAGS). I was 
looking at this comment in v1 and v2, "Don't allow once sk has been published to 
userspace.". Regardless, it seems that v3 allows other bpf hooks to do the 
bpf_setsockopt(SK_BPF_MEMCG_FLAGS)?, so not sure if this point is still relevant.

> 
>>
>> If it is needed, there are other ways to stop the SK_BPF_MEMCG_SOCK_ISOLATED
>> being changed again once the fd is visible to user. e.g. there are still bits
>> left in the sk_memcg pointer to freeze it at runtime. If doing it statically
>> (i.e. at prog load time), it can probably return a different setsockopt_proto
>> that can understand the SK_BPF_MEMCG_FLAGS.
> 
> I was thinking a kind of the latter, passing caller info to general
> __bpf_setsockopt(), and gave up as it was ugly, but wrapping it
> as different setsockopt_proto sounds good.
> 
> Then, we don't need to care about how to limit the caller context.
passing caller info is doable in helper but it is not pretty for helper and 
needs verifier changes (kfunc would be easier), so I wouldn't go there also. fyi 
only, take a look at bpf_timer_set_callback. However, all five args in 
bpf_setsockopt is used, so the same way won't work.

Right, wrapping it into a different setsockopt_proto is an option but admittedly 
not pretty also but should work. Lets not go there first. It seems the v3 design 
has changed a bit. Lets discuss and explore there.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v1 bpf-next/net 2/8] bpf: Add a bpf hook in __inet_accept().
  2025-08-26 22:02               ` Martin KaFai Lau
@ 2025-08-26 23:10                 ` Kuniyuki Iwashima
  0 siblings, 0 replies; 20+ messages in thread
From: Kuniyuki Iwashima @ 2025-08-26 23:10 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: almasrymina, andrii, ast, bpf, daniel, davem, edumazet, hannes,
	john.fastabend, kuba, kuni1840, mhocko, ncardwell, netdev, pabeni,
	roman.gushchin, sdf, shakeel.butt, willemb

On Tue, Aug 26, 2025 at 3:02 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 8/26/25 2:08 PM, Kuniyuki Iwashima wrote:
> >> ... need a way to disallow this SK_BPF_MEMCG_SOCK_ISOLATED bit being changed
> >> once the socket fd is visible to the user. The current approach is to use the
> >> observation in the owned_by_user and sk->sk_socket in the create and accept
> >> hook. [ unrelated, I am not sure about the owned_by_user check considering
> >> sol_socket_sockopt can be called from bh ].
> >
> > [ my expectation was bh checks sock_owned_by_user() before
> >    processing packets and entering where bpf_setsockopt() can
> >    be called ]
>
> hmm... so if a bpf prog is run in bh, owned_by_user should be false and the bh
> bpf prog can continue to do the bpf_setsockopt(SK_BPF_MEMCG_FLAGS). I was
> looking at this comment in v1 and v2, "Don't allow once sk has been published to
> userspace.". Regardless, it seems that v3 allows other bpf hooks to do the
> bpf_setsockopt(SK_BPF_MEMCG_FLAGS)?, so not sure if this point is still relevant.

In v3, it's nuanced to limit hooks with sk->sk_memcg
to unlocked hooks, socket(2), but if there is unlocked place
with non-NULL sk_memcg in _bh context, we will sill need to
use setsockopt_proto.

sk_clone_lock() and reuseport_migrate_sock() in inet_csk_listen_stop()
are the only places where we don't check sock_owned_by_user().

sk_clone_lock ()'s path is fine as sk_memcg is NULL until accept(),
and sk_reuseport_func_proto() doesn't allow setsockopt() for now
(error-prone to future changes), but I may be missing something.

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2025-08-26 23:10 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-22 22:17 [PATCH v1 bpf-next/net 0/8] bpf: Allow decoupling memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
2025-08-22 22:17 ` [PATCH v1 bpf-next/net 1/8] tcp: Save lock_sock() for memcg in inet_csk_accept() Kuniyuki Iwashima
2025-08-22 22:17 ` [PATCH v1 bpf-next/net 2/8] bpf: Add a bpf hook in __inet_accept() Kuniyuki Iwashima
2025-08-23 11:02   ` kernel test robot
2025-08-25 17:57   ` Martin KaFai Lau
2025-08-25 18:14     ` Kuniyuki Iwashima
2025-08-25 23:14       ` Martin KaFai Lau
2025-08-26  0:23         ` Kuniyuki Iwashima
2025-08-26 20:06           ` Martin KaFai Lau
2025-08-26 21:08             ` Kuniyuki Iwashima
2025-08-26 22:02               ` Martin KaFai Lau
2025-08-26 23:10                 ` Kuniyuki Iwashima
2025-08-22 22:17 ` [PATCH v1 bpf-next/net 3/8] libbpf: Support BPF_CGROUP_INET_SOCK_ACCEPT Kuniyuki Iwashima
2025-08-22 22:17 ` [PATCH v1 bpf-next/net 4/8] bpftool: " Kuniyuki Iwashima
2025-08-22 22:18 ` [PATCH v1 bpf-next/net 5/8] bpf: Support bpf_setsockopt() for BPF_CGROUP_INET_SOCK_(CREATE|ACCEPT) Kuniyuki Iwashima
2025-08-23 23:58   ` kernel test robot
2025-08-22 22:18 ` [PATCH v1 bpf-next/net 6/8] bpf: Introduce SK_BPF_MEMCG_FLAGS and SK_BPF_MEMCG_SOCK_ISOLATED Kuniyuki Iwashima
2025-08-23 15:38   ` kernel test robot
2025-08-22 22:18 ` [PATCH v1 bpf-next/net 7/8] net-memcg: Allow decoupling memcg from global protocol memory accounting Kuniyuki Iwashima
2025-08-22 22:18 ` [PATCH v1 bpf-next/net 8/8] selftest: bpf: Add test for SK_BPF_MEMCG_SOCK_ISOLATED Kuniyuki Iwashima

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).