* [PATCH v8 bpf-next/net 0/6] bpf: Allow decoupling memcg from sk->sk_prot->memory_allocated.
@ 2025-09-10 19:19 Kuniyuki Iwashima
2025-09-10 19:19 ` [PATCH v8 bpf-next/net 1/6] tcp: Save lock_sock() for memcg in inet_csk_accept() Kuniyuki Iwashima
` (5 more replies)
0 siblings, 6 replies; 10+ messages in thread
From: Kuniyuki Iwashima @ 2025-09-10 19:19 UTC (permalink / raw)
To: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau
Cc: John Fastabend, Stanislav Fomichev, Johannes Weiner, Michal Hocko,
Roman Gushchin, Shakeel Butt, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Neal Cardwell, Willem de Bruijn,
Mina Almasry, Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev
Some protocols (e.g., TCP, UDP) have their own memory accounting for
socket buffers and charge memory to global per-protocol counters such
as /proc/net/ipv4/tcp_mem.
If the socket has sk->sk_memcg, this memory is also charged to the memcg
as sock in memory.stat.
We do not need to pay costs for two orthogonal memory accounting
mechanisms.
This series allows decoupling memcg from the global memory accounting
(memcg + tcp_mem -> memcg) if socket is configured as such by sysctl
or BPF prog.
Overview of the series:
patch 1 & 2 prepares for decoupling memcg from sk_prot->memory_allocated
based on the SK_MEMCG_EXCLUSIVE flag
patch 3 introduces net.core.memcg_exclusive
patch 4 & 5 supports flagging SK_MEMCG_EXCLUSIVE via bpf_setsockopt()
patch 6 is selftest
Changes:
v8:
* Patch 3: Fix build failure when CONFIG_NET=n
v7: https://lore.kernel.org/netdev/20250909204632.3994767-1-kuniyu@google.com/
* Rename s/ISOLATED/EXCLUSIVE/
* Add patch 3 (net.core.memcg_exclusive sysctl)
* Reorder the core patch 2 before sysctl + bpf changes
* Patch 6
* Add test for sysctl
v6: https://lore.kernel.org/netdev/20250908223750.3375376-1-kuniyu@google.com/
* Patch 4
* Update commit message
* Patch 5
* Trace sk_prot->memory_allocated + sk_prot->memory_per_cpu_fw_alloc
v5: https://lore.kernel.org/netdev/20250903190238.2511885-1-kuniyu@google.com/
* Patch 2
* Rename new variants to bpf_sock_create_{get,set}sockopt()
* Patch 3
* Limit getsockopt() to BPF_CGROUP_INET_SOCK_CREATE
* Patch 5
* Use kern_sync_rcu()
* Double NR_SEND to 128
v4: https://lore.kernel.org/netdev/20250829010026.347440-1-kuniyu@google.com/
* Patch 2
* Use __bpf_setsockopt() instead of _bpf_setsockopt()
* Add getsockopt() for a cgroup with multiple bpf progs running
* Patch 3
* Only allow inet_create() to set flags
* Inherit flags from listener to child in sk_clone_lock()
* Support clearing flags
* Patch 5
* Only use inet_create() hook
* Test bpf_getsockopt()
* Add serial_ prefix
* Reduce sleep() and the amount of sent data
v3: https://lore.kernel.org/netdev/20250826183940.3310118-1-kuniyu@google.com/
* Drop patches for accept() hook
* Patch 1
* Merge if blocks
* Patch2
* Drop bpf_func_proto for accept()
* Patch 3
* Allow flagging without sk->sk_memcg
* Inherit SK_BPF_MEMCG_SOCK_ISOLATED in __inet_accept()
v2: https://lore.kernel.org/bpf/20250825204158.2414402-1-kuniyu@google.com/
* Patch 2
* Define BPF_CGROUP_RUN_PROG_INET_SOCK_ACCEPT() when CONFIG_CGROUP_BPF=n
* Patch 5
* Make 2 new bpf_func_proto static
* Patch 6
* s/mem_cgroup_sk_set_flag/mem_cgroup_sk_set_flags/ when CONFIG_MEMCG=n
* Use finer CONFIG_CGROUP_BPF instead of CONFIG_BPF_SYSCALL for ifdef
v1: https://lore.kernel.org/netdev/20250822221846.744252-1-kuniyu@google.com/
Kuniyuki Iwashima (6):
tcp: Save lock_sock() for memcg in inet_csk_accept().
net-memcg: Allow decoupling memcg from global protocol memory
accounting.
net-memcg: Introduce net.core.memcg_exclusive sysctl.
bpf: Support bpf_setsockopt() for BPF_CGROUP_INET_SOCK_CREATE.
bpf: Introduce SK_BPF_MEMCG_FLAGS and SK_BPF_MEMCG_EXCLUSIVE.
selftest: bpf: Add test for SK_MEMCG_EXCLUSIVE.
Documentation/admin-guide/sysctl/net.rst | 9 +
include/net/netns/core.h | 3 +
include/net/proto_memory.h | 15 +-
include/net/sock.h | 47 +++-
include/net/tcp.h | 10 +-
include/uapi/linux/bpf.h | 6 +
mm/memcontrol.c | 15 +-
net/core/filter.c | 82 ++++++
net/core/sock.c | 65 +++--
net/core/sysctl_net_core.c | 11 +
net/ipv4/af_inet.c | 37 +++
net/ipv4/inet_connection_sock.c | 26 +-
net/ipv4/tcp.c | 3 +-
net/ipv4/tcp_output.c | 10 +-
net/mptcp/protocol.c | 3 +-
net/tls/tls_device.c | 4 +-
tools/include/uapi/linux/bpf.h | 6 +
.../selftests/bpf/prog_tests/sk_memcg.c | 261 ++++++++++++++++++
tools/testing/selftests/bpf/progs/sk_memcg.c | 146 ++++++++++
19 files changed, 701 insertions(+), 58 deletions(-)
create mode 100644 tools/testing/selftests/bpf/prog_tests/sk_memcg.c
create mode 100644 tools/testing/selftests/bpf/progs/sk_memcg.c
--
2.51.0.384.g4c02a37b29-goog
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH v8 bpf-next/net 1/6] tcp: Save lock_sock() for memcg in inet_csk_accept().
2025-09-10 19:19 [PATCH v8 bpf-next/net 0/6] bpf: Allow decoupling memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
@ 2025-09-10 19:19 ` Kuniyuki Iwashima
2025-09-17 6:37 ` kernel test robot
2025-09-10 19:19 ` [PATCH v8 bpf-next/net 2/6] net-memcg: Allow decoupling memcg from global protocol memory accounting Kuniyuki Iwashima
` (4 subsequent siblings)
5 siblings, 1 reply; 10+ messages in thread
From: Kuniyuki Iwashima @ 2025-09-10 19:19 UTC (permalink / raw)
To: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau
Cc: John Fastabend, Stanislav Fomichev, Johannes Weiner, Michal Hocko,
Roman Gushchin, Shakeel Butt, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Neal Cardwell, Willem de Bruijn,
Mina Almasry, Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev
If memcg is enabled, accept() acquires lock_sock() twice for each new
TCP/MPTCP socket in inet_csk_accept() and __inet_accept().
Let's move memcg operations from inet_csk_accept() to __inet_accept().
Note that SCTP somehow allocates a new socket by sk_alloc() in
sk->sk_prot->accept() and clones fields manually, instead of using
sk_clone_lock().
mem_cgroup_sk_alloc() is called for SCTP before __inet_accept(),
so I added the protocol check in __inet_accept(), but this can be
removed once SCTP uses sk_clone_lock().
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
---
v3: Don't split if blocks
---
net/ipv4/af_inet.c | 23 +++++++++++++++++++++++
net/ipv4/inet_connection_sock.c | 25 -------------------------
2 files changed, 23 insertions(+), 25 deletions(-)
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 76e38092cd8a..d42757f74c6e 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -753,6 +753,29 @@ EXPORT_SYMBOL(inet_stream_connect);
void __inet_accept(struct socket *sock, struct socket *newsock, struct sock *newsk)
{
+ /* TODO: use sk_clone_lock() in SCTP and remove protocol checks */
+ if (mem_cgroup_sockets_enabled &&
+ (!IS_ENABLED(CONFIG_IP_SCTP) ||
+ sk_is_tcp(newsk) || sk_is_mptcp(newsk))) {
+ gfp_t gfp = GFP_KERNEL | __GFP_NOFAIL;
+
+ mem_cgroup_sk_alloc(newsk);
+
+ if (mem_cgroup_from_sk(newsk)) {
+ int amt;
+
+ /* The socket has not been accepted yet, no need
+ * to look at newsk->sk_wmem_queued.
+ */
+ amt = sk_mem_pages(newsk->sk_forward_alloc +
+ atomic_read(&newsk->sk_rmem_alloc));
+ if (amt)
+ mem_cgroup_sk_charge(newsk, amt, gfp);
+ }
+
+ kmem_cache_charge(newsk, gfp);
+ }
+
sock_rps_record_flow(newsk);
WARN_ON(!((1 << newsk->sk_state) &
(TCPF_ESTABLISHED | TCPF_SYN_RECV |
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 0ef1eacd539d..ed10b959a906 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -708,31 +708,6 @@ struct sock *inet_csk_accept(struct sock *sk, struct proto_accept_arg *arg)
release_sock(sk);
- if (mem_cgroup_sockets_enabled) {
- gfp_t gfp = GFP_KERNEL | __GFP_NOFAIL;
- int amt = 0;
-
- /* atomically get the memory usage, set and charge the
- * newsk->sk_memcg.
- */
- lock_sock(newsk);
-
- mem_cgroup_sk_alloc(newsk);
- if (mem_cgroup_from_sk(newsk)) {
- /* The socket has not been accepted yet, no need
- * to look at newsk->sk_wmem_queued.
- */
- amt = sk_mem_pages(newsk->sk_forward_alloc +
- atomic_read(&newsk->sk_rmem_alloc));
- }
-
- if (amt)
- mem_cgroup_sk_charge(newsk, amt, gfp);
- kmem_cache_charge(newsk, gfp);
-
- release_sock(newsk);
- }
-
if (req)
reqsk_put(req);
--
2.51.0.384.g4c02a37b29-goog
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v8 bpf-next/net 2/6] net-memcg: Allow decoupling memcg from global protocol memory accounting.
2025-09-10 19:19 [PATCH v8 bpf-next/net 0/6] bpf: Allow decoupling memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
2025-09-10 19:19 ` [PATCH v8 bpf-next/net 1/6] tcp: Save lock_sock() for memcg in inet_csk_accept() Kuniyuki Iwashima
@ 2025-09-10 19:19 ` Kuniyuki Iwashima
2025-09-17 19:18 ` Martin KaFai Lau
2025-09-10 19:19 ` [PATCH v8 bpf-next/net 3/6] net-memcg: Introduce net.core.memcg_exclusive sysctl Kuniyuki Iwashima
` (3 subsequent siblings)
5 siblings, 1 reply; 10+ messages in thread
From: Kuniyuki Iwashima @ 2025-09-10 19:19 UTC (permalink / raw)
To: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau
Cc: John Fastabend, Stanislav Fomichev, Johannes Weiner, Michal Hocko,
Roman Gushchin, Shakeel Butt, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Neal Cardwell, Willem de Bruijn,
Mina Almasry, Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev
Some protocols (e.g., TCP, UDP) implement memory accounting for socket
buffers and charge memory to per-protocol global counters pointed to by
sk->sk_proto->memory_allocated.
If a socket has sk->sk_memcg, this memory is also charged to memcg as
"sock" in memory.stat.
We do not need to pay costs for two orthogonal memory accounting
mechanisms. A microbenchmark result is in the subsequent bpf patch.
Let's decouple sockets under memcg from the global per-protocol memory
accounting if mem_cgroup_sk_exclusive() returns true.
Note that this does NOT disable memcg, but rather the per-protocol one.
mem_cgroup_sk_exclusive() starts to return true in the following patches,
and then, the per-protocol memory accounting will be skipped.
In __inet_accept(), we need to reclaim counts that are already charged
for child sockets because we do not allocate sk->sk_memcg until accept().
trace_sock_exceed_buf_limit() will always show 0 as accounted for the
memcg-exclusive sockets, but this can be obtained in memory.stat.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Nacked-by: Johannes Weiner <hannes@cmpxchg.org>
---
v7: Reorder before sysctl & bpf patches
v6: Update commit message
---
include/net/proto_memory.h | 15 ++++++--
include/net/sock.h | 10 ++++++
include/net/tcp.h | 10 ++++--
net/core/sock.c | 64 ++++++++++++++++++++++-----------
net/ipv4/af_inet.c | 12 ++++++-
net/ipv4/inet_connection_sock.c | 1 +
net/ipv4/tcp.c | 3 +-
net/ipv4/tcp_output.c | 10 ++++--
net/mptcp/protocol.c | 3 +-
net/tls/tls_device.c | 4 ++-
10 files changed, 100 insertions(+), 32 deletions(-)
diff --git a/include/net/proto_memory.h b/include/net/proto_memory.h
index 72d4ec413ab5..4383cb4cb2d2 100644
--- a/include/net/proto_memory.h
+++ b/include/net/proto_memory.h
@@ -31,13 +31,22 @@ static inline bool sk_under_memory_pressure(const struct sock *sk)
if (!sk->sk_prot->memory_pressure)
return false;
- if (mem_cgroup_sk_enabled(sk) &&
- mem_cgroup_sk_under_memory_pressure(sk))
- return true;
+ if (mem_cgroup_sk_enabled(sk)) {
+ if (mem_cgroup_sk_under_memory_pressure(sk))
+ return true;
+
+ if (mem_cgroup_sk_exclusive(sk))
+ return false;
+ }
return !!READ_ONCE(*sk->sk_prot->memory_pressure);
}
+static inline bool sk_should_enter_memory_pressure(struct sock *sk)
+{
+ return !mem_cgroup_sk_enabled(sk) || !mem_cgroup_sk_exclusive(sk);
+}
+
static inline long
proto_memory_allocated(const struct proto *prot)
{
diff --git a/include/net/sock.h b/include/net/sock.h
index 63a6a48afb48..66501ab670eb 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2607,6 +2607,11 @@ static inline bool mem_cgroup_sk_enabled(const struct sock *sk)
return mem_cgroup_sockets_enabled && mem_cgroup_from_sk(sk);
}
+static inline bool mem_cgroup_sk_exclusive(const struct sock *sk)
+{
+ return false;
+}
+
static inline bool mem_cgroup_sk_under_memory_pressure(const struct sock *sk)
{
struct mem_cgroup *memcg = mem_cgroup_from_sk(sk);
@@ -2634,6 +2639,11 @@ static inline bool mem_cgroup_sk_enabled(const struct sock *sk)
return false;
}
+static inline bool mem_cgroup_sk_exclusive(const struct sock *sk)
+{
+ return false;
+}
+
static inline bool mem_cgroup_sk_under_memory_pressure(const struct sock *sk)
{
return false;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 2936b8175950..225f6bac06c3 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -275,9 +275,13 @@ extern unsigned long tcp_memory_pressure;
/* optimized version of sk_under_memory_pressure() for TCP sockets */
static inline bool tcp_under_memory_pressure(const struct sock *sk)
{
- if (mem_cgroup_sk_enabled(sk) &&
- mem_cgroup_sk_under_memory_pressure(sk))
- return true;
+ if (mem_cgroup_sk_enabled(sk)) {
+ if (mem_cgroup_sk_under_memory_pressure(sk))
+ return true;
+
+ if (mem_cgroup_sk_exclusive(sk))
+ return false;
+ }
return READ_ONCE(tcp_memory_pressure);
}
diff --git a/net/core/sock.c b/net/core/sock.c
index 8002ac6293dc..814966309b0e 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1046,17 +1046,21 @@ static int sock_reserve_memory(struct sock *sk, int bytes)
if (!charged)
return -ENOMEM;
- /* pre-charge to forward_alloc */
- sk_memory_allocated_add(sk, pages);
- allocated = sk_memory_allocated(sk);
- /* If the system goes into memory pressure with this
- * precharge, give up and return error.
- */
- if (allocated > sk_prot_mem_limits(sk, 1)) {
- sk_memory_allocated_sub(sk, pages);
- mem_cgroup_sk_uncharge(sk, pages);
- return -ENOMEM;
+ if (!mem_cgroup_sk_exclusive(sk)) {
+ /* pre-charge to forward_alloc */
+ sk_memory_allocated_add(sk, pages);
+ allocated = sk_memory_allocated(sk);
+
+ /* If the system goes into memory pressure with this
+ * precharge, give up and return error.
+ */
+ if (allocated > sk_prot_mem_limits(sk, 1)) {
+ sk_memory_allocated_sub(sk, pages);
+ mem_cgroup_sk_uncharge(sk, pages);
+ return -ENOMEM;
+ }
}
+
sk_forward_alloc_add(sk, pages << PAGE_SHIFT);
WRITE_ONCE(sk->sk_reserved_mem,
@@ -3153,8 +3157,11 @@ bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag)
if (likely(skb_page_frag_refill(32U, pfrag, sk->sk_allocation)))
return true;
- sk_enter_memory_pressure(sk);
+ if (sk_should_enter_memory_pressure(sk))
+ sk_enter_memory_pressure(sk);
+
sk_stream_moderate_sndbuf(sk);
+
return false;
}
EXPORT_SYMBOL(sk_page_frag_refill);
@@ -3267,18 +3274,30 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
{
bool memcg_enabled = false, charged = false;
struct proto *prot = sk->sk_prot;
- long allocated;
-
- sk_memory_allocated_add(sk, amt);
- allocated = sk_memory_allocated(sk);
+ long allocated = 0;
if (mem_cgroup_sk_enabled(sk)) {
+ bool exclusive = mem_cgroup_sk_exclusive(sk);
+
memcg_enabled = true;
charged = mem_cgroup_sk_charge(sk, amt, gfp_memcg_charge());
- if (!charged)
+
+ if (exclusive && charged)
+ return 1;
+
+ if (!charged) {
+ if (!exclusive) {
+ sk_memory_allocated_add(sk, amt);
+ allocated = sk_memory_allocated(sk);
+ }
+
goto suppress_allocation;
+ }
}
+ sk_memory_allocated_add(sk, amt);
+ allocated = sk_memory_allocated(sk);
+
/* Under limit. */
if (allocated <= sk_prot_mem_limits(sk, 0)) {
sk_leave_memory_pressure(sk);
@@ -3357,7 +3376,8 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
trace_sock_exceed_buf_limit(sk, prot, allocated, kind);
- sk_memory_allocated_sub(sk, amt);
+ if (allocated)
+ sk_memory_allocated_sub(sk, amt);
if (charged)
mem_cgroup_sk_uncharge(sk, amt);
@@ -3396,11 +3416,15 @@ EXPORT_SYMBOL(__sk_mem_schedule);
*/
void __sk_mem_reduce_allocated(struct sock *sk, int amount)
{
- sk_memory_allocated_sub(sk, amount);
-
- if (mem_cgroup_sk_enabled(sk))
+ if (mem_cgroup_sk_enabled(sk)) {
mem_cgroup_sk_uncharge(sk, amount);
+ if (mem_cgroup_sk_exclusive(sk))
+ return;
+ }
+
+ sk_memory_allocated_sub(sk, amount);
+
if (sk_under_global_memory_pressure(sk) &&
(sk_memory_allocated(sk) < sk_prot_mem_limits(sk, 0)))
sk_leave_memory_pressure(sk);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index d42757f74c6e..52d060bc9009 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -95,6 +95,7 @@
#include <net/checksum.h>
#include <net/ip.h>
#include <net/protocol.h>
+#include <net/proto_memory.h>
#include <net/arp.h>
#include <net/route.h>
#include <net/ip_fib.h>
@@ -769,8 +770,17 @@ void __inet_accept(struct socket *sock, struct socket *newsock, struct sock *new
*/
amt = sk_mem_pages(newsk->sk_forward_alloc +
atomic_read(&newsk->sk_rmem_alloc));
- if (amt)
+ if (amt) {
+ /* This amt is already charged globally to
+ * sk_prot->memory_allocated due to lack of
+ * sk_memcg until accept(), thus we need to
+ * reclaim it here if newsk is isolated.
+ */
+ if (mem_cgroup_sk_exclusive(newsk))
+ sk_memory_allocated_sub(newsk, amt);
+
mem_cgroup_sk_charge(newsk, amt, gfp);
+ }
}
kmem_cache_charge(newsk, gfp);
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index ed10b959a906..f8dd53d40dcf 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -22,6 +22,7 @@
#include <net/tcp.h>
#include <net/sock_reuseport.h>
#include <net/addrconf.h>
+#include <net/proto_memory.h>
#if IS_ENABLED(CONFIG_IPV6)
/* match_sk*_wildcard == true: IPV6_ADDR_ANY equals to any IPv6 addresses
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 71a956fbfc55..dcbd49e2f8af 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -908,7 +908,8 @@ struct sk_buff *tcp_stream_alloc_skb(struct sock *sk, gfp_t gfp,
}
__kfree_skb(skb);
} else {
- sk->sk_prot->enter_memory_pressure(sk);
+ if (sk_should_enter_memory_pressure(sk))
+ tcp_enter_memory_pressure(sk);
sk_stream_moderate_sndbuf(sk);
}
return NULL;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index dfbac0876d96..4b6a7250a9c2 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3574,12 +3574,18 @@ void sk_forced_mem_schedule(struct sock *sk, int size)
delta = size - sk->sk_forward_alloc;
if (delta <= 0)
return;
+
amt = sk_mem_pages(delta);
sk_forward_alloc_add(sk, amt << PAGE_SHIFT);
- sk_memory_allocated_add(sk, amt);
- if (mem_cgroup_sk_enabled(sk))
+ if (mem_cgroup_sk_enabled(sk)) {
mem_cgroup_sk_charge(sk, amt, gfp_memcg_charge() | __GFP_NOFAIL);
+
+ if (mem_cgroup_sk_exclusive(sk))
+ return;
+ }
+
+ sk_memory_allocated_add(sk, amt);
}
/* Send a FIN. The caller locks the socket for us.
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 9a287b75c1b3..f7487e22a3f8 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -16,6 +16,7 @@
#include <net/inet_common.h>
#include <net/inet_hashtables.h>
#include <net/protocol.h>
+#include <net/proto_memory.h>
#include <net/tcp_states.h>
#if IS_ENABLED(CONFIG_MPTCP_IPV6)
#include <net/transp_v6.h>
@@ -1016,7 +1017,7 @@ static void mptcp_enter_memory_pressure(struct sock *sk)
mptcp_for_each_subflow(msk, subflow) {
struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
- if (first)
+ if (first && sk_should_enter_memory_pressure(ssk))
tcp_enter_memory_pressure(ssk);
sk_stream_moderate_sndbuf(ssk);
diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
index f672a62a9a52..6696ef837116 100644
--- a/net/tls/tls_device.c
+++ b/net/tls/tls_device.c
@@ -35,6 +35,7 @@
#include <linux/netdevice.h>
#include <net/dst.h>
#include <net/inet_connection_sock.h>
+#include <net/proto_memory.h>
#include <net/tcp.h>
#include <net/tls.h>
#include <linux/skbuff_ref.h>
@@ -371,7 +372,8 @@ static int tls_do_allocation(struct sock *sk,
if (!offload_ctx->open_record) {
if (unlikely(!skb_page_frag_refill(prepend_size, pfrag,
sk->sk_allocation))) {
- READ_ONCE(sk->sk_prot)->enter_memory_pressure(sk);
+ if (sk_should_enter_memory_pressure(sk))
+ READ_ONCE(sk->sk_prot)->enter_memory_pressure(sk);
sk_stream_moderate_sndbuf(sk);
return -ENOMEM;
}
--
2.51.0.384.g4c02a37b29-goog
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v8 bpf-next/net 3/6] net-memcg: Introduce net.core.memcg_exclusive sysctl.
2025-09-10 19:19 [PATCH v8 bpf-next/net 0/6] bpf: Allow decoupling memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
2025-09-10 19:19 ` [PATCH v8 bpf-next/net 1/6] tcp: Save lock_sock() for memcg in inet_csk_accept() Kuniyuki Iwashima
2025-09-10 19:19 ` [PATCH v8 bpf-next/net 2/6] net-memcg: Allow decoupling memcg from global protocol memory accounting Kuniyuki Iwashima
@ 2025-09-10 19:19 ` Kuniyuki Iwashima
2025-09-10 19:19 ` [PATCH v8 bpf-next/net 4/6] bpf: Support bpf_setsockopt() for BPF_CGROUP_INET_SOCK_CREATE Kuniyuki Iwashima
` (2 subsequent siblings)
5 siblings, 0 replies; 10+ messages in thread
From: Kuniyuki Iwashima @ 2025-09-10 19:19 UTC (permalink / raw)
To: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau
Cc: John Fastabend, Stanislav Fomichev, Johannes Weiner, Michal Hocko,
Roman Gushchin, Shakeel Butt, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Neal Cardwell, Willem de Bruijn,
Mina Almasry, Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev
If net.core.memcg_exclusive is 1 when sk->sk_memcg is allocated,
the socket is flagged with SK_MEMCG_EXCLUSIVE internally and skips
the global per-protocol memory accounting.
OTOH, for accept()ed child sockets, this flag is inherited from
the listening socket in sk_clone_lock() and set in __inet_accept().
This is to preserve the decision by BPF which will be supported later.
Given sk->sk_memcg can be accessed in the fast path, it would
be preferable to place the flag field in the same cache line as
sk->sk_memcg.
However, struct sock does not have such a 1-byte hole.
Let's store the flag in the lowest bit of sk->sk_memcg and check
it in mem_cgroup_sk_exclusive().
Tested with a script that creates local socket pairs and send()s a
bunch of data without recv()ing.
Setup:
# mkdir /sys/fs/cgroup/test
# echo $$ >> /sys/fs/cgroup/test/cgroup.procs
# sysctl -q net.ipv4.tcp_mem="1000 1000 1000"
Without net.core.memcg_exclusive, charged to memcg & tcp_mem:
# prlimit -n=524288:524288 bash -c "python3 pressure.py" &
# cat /sys/fs/cgroup/test/memory.stat | grep sock
sock 22642688 <-------------------------------------- charged to memcg
# cat /proc/net/sockstat| grep TCP
TCP: inuse 2006 orphan 0 tw 0 alloc 2008 mem 5376 <-- charged to tcp_mem
# ss -tn | head -n 5
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 2000 0 127.0.0.1:34479 127.0.0.1:53188
ESTAB 2000 0 127.0.0.1:34479 127.0.0.1:49972
ESTAB 2000 0 127.0.0.1:34479 127.0.0.1:53868
ESTAB 2000 0 127.0.0.1:34479 127.0.0.1:53554
# nstat | grep Pressure || echo no pressure
TcpExtTCPMemoryPressures 1 0.0
With net.core.memcg_exclusive=1, only charged to memcg:
# sysctl -q net.core.memcg_exclusive=1
# prlimit -n=524288:524288 bash -c "python3 pressure.py" &
# cat /sys/fs/cgroup/test/memory.stat | grep sock
sock 2757468160 <------------------------------------ charged to memcg
# cat /proc/net/sockstat | grep TCP
TCP: inuse 2006 orphan 0 tw 0 alloc 2008 mem 0 <- NOT charged to tcp_mem
# ss -tn | head -n 5
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 111000 0 127.0.0.1:36019 127.0.0.1:49026
ESTAB 110000 0 127.0.0.1:36019 127.0.0.1:45630
ESTAB 110000 0 127.0.0.1:36019 127.0.0.1:44870
ESTAB 111000 0 127.0.0.1:36019 127.0.0.1:45274
# nstat | grep Pressure || echo no pressure
no pressure
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
v8: Fix build failure when CONFIG_NET=n
---
Documentation/admin-guide/sysctl/net.rst | 9 ++++++
include/net/netns/core.h | 3 ++
include/net/sock.h | 39 ++++++++++++++++++++++--
mm/memcontrol.c | 12 +++++++-
net/core/sock.c | 1 +
net/core/sysctl_net_core.c | 11 +++++++
net/ipv4/af_inet.c | 4 +++
7 files changed, 76 insertions(+), 3 deletions(-)
diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst
index 2ef50828aff1..7272194dcf45 100644
--- a/Documentation/admin-guide/sysctl/net.rst
+++ b/Documentation/admin-guide/sysctl/net.rst
@@ -212,6 +212,15 @@ mem_pcpu_rsv
Per-cpu reserved forward alloc cache size in page units. Default 1MB per CPU.
+memcg_exclusive
+---------------
+
+Skip charging socket buffers to the per-protocol global memory accounting
+(controlled by net.ipv4.tcp_mem, etc) if they are already charged to the
+cgroup memory controller ("sock" in memory.stat file).
+
+Default: 0
+
rmem_default
------------
diff --git a/include/net/netns/core.h b/include/net/netns/core.h
index 9b36f0ff0c20..ec511088e67d 100644
--- a/include/net/netns/core.h
+++ b/include/net/netns/core.h
@@ -16,6 +16,9 @@ struct netns_core {
int sysctl_optmem_max;
u8 sysctl_txrehash;
u8 sysctl_tstamp_allow_data;
+#ifdef CONFIG_MEMCG
+ u8 sysctl_memcg_exclusive;
+#endif
#ifdef CONFIG_PROC_FS
struct prot_inuse __percpu *prot_inuse;
diff --git a/include/net/sock.h b/include/net/sock.h
index 66501ab670eb..0f597f4deaa3 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2596,10 +2596,36 @@ static inline gfp_t gfp_memcg_charge(void)
return in_softirq() ? GFP_ATOMIC : GFP_KERNEL;
}
+enum {
+ SK_MEMCG_EXCLUSIVE = (1UL << 0),
+ SK_MEMCG_FLAG_MAX = (1UL << 1),
+};
+
+#define SK_MEMCG_FLAG_MASK (SK_MEMCG_FLAG_MAX - 1)
+#define SK_MEMCG_PTR_MASK ~SK_MEMCG_FLAG_MASK
+
#ifdef CONFIG_MEMCG
static inline struct mem_cgroup *mem_cgroup_from_sk(const struct sock *sk)
{
- return sk->sk_memcg;
+ unsigned long val = (unsigned long)sk->sk_memcg;
+
+ val &= SK_MEMCG_PTR_MASK;
+ return (struct mem_cgroup *)val;
+}
+
+static inline void mem_cgroup_sk_set_flags(struct sock *sk, unsigned short flags)
+{
+ unsigned long val = (unsigned long)mem_cgroup_from_sk(sk);
+
+ val |= flags;
+ sk->sk_memcg = (struct mem_cgroup *)val;
+}
+
+static inline unsigned short mem_cgroup_sk_get_flags(const struct sock *sk)
+{
+ unsigned long val = (unsigned long)sk->sk_memcg;
+
+ return val & SK_MEMCG_FLAG_MASK;
}
static inline bool mem_cgroup_sk_enabled(const struct sock *sk)
@@ -2609,7 +2635,7 @@ static inline bool mem_cgroup_sk_enabled(const struct sock *sk)
static inline bool mem_cgroup_sk_exclusive(const struct sock *sk)
{
- return false;
+ return mem_cgroup_sk_get_flags(sk) & SK_MEMCG_EXCLUSIVE;
}
static inline bool mem_cgroup_sk_under_memory_pressure(const struct sock *sk)
@@ -2634,6 +2660,15 @@ static inline struct mem_cgroup *mem_cgroup_from_sk(const struct sock *sk)
return NULL;
}
+static inline void mem_cgroup_sk_set_flags(struct sock *sk, unsigned short flags)
+{
+}
+
+static inline unsigned short mem_cgroup_sk_get_flags(const struct sock *sk)
+{
+ return 0;
+}
+
static inline bool mem_cgroup_sk_enabled(const struct sock *sk)
{
return false;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index df3e9205c9e6..88028af8ac28 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4995,6 +4995,16 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new)
DEFINE_STATIC_KEY_FALSE(memcg_sockets_enabled_key);
EXPORT_SYMBOL(memcg_sockets_enabled_key);
+static void mem_cgroup_sk_set(struct sock *sk, struct mem_cgroup *memcg)
+{
+ sk->sk_memcg = memcg;
+
+#ifdef CONFIG_NET
+ if (READ_ONCE(sock_net(sk)->core.sysctl_memcg_exclusive))
+ mem_cgroup_sk_set_flags(sk, SK_MEMCG_EXCLUSIVE);
+#endif
+}
+
void mem_cgroup_sk_alloc(struct sock *sk)
{
struct mem_cgroup *memcg;
@@ -5013,7 +5023,7 @@ void mem_cgroup_sk_alloc(struct sock *sk)
if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !memcg1_tcpmem_active(memcg))
goto out;
if (css_tryget(&memcg->css))
- sk->sk_memcg = memcg;
+ mem_cgroup_sk_set(sk, memcg);
out:
rcu_read_unlock();
}
diff --git a/net/core/sock.c b/net/core/sock.c
index 814966309b0e..348e599c3fbc 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2519,6 +2519,7 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority)
#ifdef CONFIG_MEMCG
/* sk->sk_memcg will be populated at accept() time */
newsk->sk_memcg = NULL;
+ mem_cgroup_sk_set_flags(newsk, mem_cgroup_sk_get_flags(sk));
#endif
cgroup_sk_clone(&newsk->sk_cgrp_data);
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index 8cf04b57ade1..c8b5fc3b8435 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -676,6 +676,17 @@ static struct ctl_table netns_core_table[] = {
.extra1 = SYSCTL_ZERO,
.extra2 = SYSCTL_ONE
},
+#ifdef CONFIG_MEMCG
+ {
+ .procname = "memcg_exclusive",
+ .data = &init_net.core.sysctl_memcg_exclusive,
+ .maxlen = sizeof(u8),
+ .mode = 0644,
+ .proc_handler = proc_dou8vec_minmax,
+ .extra1 = SYSCTL_ZERO,
+ .extra2 = SYSCTL_ONE
+ },
+#endif
/* sysctl_core_net_init() will set the values after this
* to readonly in network namespaces
*/
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 52d060bc9009..3fbb076ffd36 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -759,12 +759,16 @@ void __inet_accept(struct socket *sock, struct socket *newsock, struct sock *new
(!IS_ENABLED(CONFIG_IP_SCTP) ||
sk_is_tcp(newsk) || sk_is_mptcp(newsk))) {
gfp_t gfp = GFP_KERNEL | __GFP_NOFAIL;
+ unsigned short flags;
+ flags = mem_cgroup_sk_get_flags(newsk);
mem_cgroup_sk_alloc(newsk);
if (mem_cgroup_from_sk(newsk)) {
int amt;
+ mem_cgroup_sk_set_flags(newsk, flags);
+
/* The socket has not been accepted yet, no need
* to look at newsk->sk_wmem_queued.
*/
--
2.51.0.384.g4c02a37b29-goog
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v8 bpf-next/net 4/6] bpf: Support bpf_setsockopt() for BPF_CGROUP_INET_SOCK_CREATE.
2025-09-10 19:19 [PATCH v8 bpf-next/net 0/6] bpf: Allow decoupling memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
` (2 preceding siblings ...)
2025-09-10 19:19 ` [PATCH v8 bpf-next/net 3/6] net-memcg: Introduce net.core.memcg_exclusive sysctl Kuniyuki Iwashima
@ 2025-09-10 19:19 ` Kuniyuki Iwashima
2025-09-10 19:19 ` [PATCH v8 bpf-next/net 5/6] bpf: Introduce SK_BPF_MEMCG_FLAGS and SK_BPF_MEMCG_EXCLUSIVE Kuniyuki Iwashima
2025-09-10 19:19 ` [PATCH v8 bpf-next/net 6/6] selftest: bpf: Add test for SK_MEMCG_EXCLUSIVE Kuniyuki Iwashima
5 siblings, 0 replies; 10+ messages in thread
From: Kuniyuki Iwashima @ 2025-09-10 19:19 UTC (permalink / raw)
To: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau
Cc: John Fastabend, Stanislav Fomichev, Johannes Weiner, Michal Hocko,
Roman Gushchin, Shakeel Butt, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Neal Cardwell, Willem de Bruijn,
Mina Almasry, Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev
We will support flagging SK_MEMCG_EXCLUSIVE via bpf_setsockopt() at
the BPF_CGROUP_INET_SOCK_CREATE hook.
BPF_CGROUP_INET_SOCK_CREATE is invoked by __cgroup_bpf_run_filter_sk()
that passes a pointer to struct sock to the bpf prog as void *ctx.
But there are no bpf_func_proto for bpf_setsockopt() that receives
the ctx as a pointer to struct sock.
Also, bpf_getsockopt() will be necessary for a cgroup with multiple
bpf progs running.
Let's add new bpf_setsockopt() and bpf_getsockopt() variants for
BPF_CGROUP_INET_SOCK_CREATE.
Note that inet_create() is not under lock_sock() and has the same
semantics with bpf_lsm_unlocked_sockopt_hooks.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
v5: Rename new variants to bpf_sock_create_{get,set}sockopt().
v4:
* Use __bpf_setsockopt() instead of _bpf_setsockopt()
* Add getsockopt() for a cgroup with multiple bpf progs running
v3: Remove bpf_func_proto for accept()
v2: Make 2 new bpf_func_proto static
---
net/core/filter.c | 48 +++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 48 insertions(+)
diff --git a/net/core/filter.c b/net/core/filter.c
index 63f3baee2daf..31b259f02ee9 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5723,6 +5723,40 @@ static const struct bpf_func_proto bpf_sock_addr_getsockopt_proto = {
.arg5_type = ARG_CONST_SIZE,
};
+BPF_CALL_5(bpf_sock_create_setsockopt, struct sock *, sk, int, level,
+ int, optname, char *, optval, int, optlen)
+{
+ return __bpf_setsockopt(sk, level, optname, optval, optlen);
+}
+
+static const struct bpf_func_proto bpf_sock_create_setsockopt_proto = {
+ .func = bpf_sock_create_setsockopt,
+ .gpl_only = false,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_PTR_TO_CTX,
+ .arg2_type = ARG_ANYTHING,
+ .arg3_type = ARG_ANYTHING,
+ .arg4_type = ARG_PTR_TO_MEM | MEM_RDONLY,
+ .arg5_type = ARG_CONST_SIZE,
+};
+
+BPF_CALL_5(bpf_sock_create_getsockopt, struct sock *, sk, int, level,
+ int, optname, char *, optval, int, optlen)
+{
+ return __bpf_getsockopt(sk, level, optname, optval, optlen);
+}
+
+static const struct bpf_func_proto bpf_sock_create_getsockopt_proto = {
+ .func = bpf_sock_create_getsockopt,
+ .gpl_only = false,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_PTR_TO_CTX,
+ .arg2_type = ARG_ANYTHING,
+ .arg3_type = ARG_ANYTHING,
+ .arg4_type = ARG_PTR_TO_UNINIT_MEM,
+ .arg5_type = ARG_CONST_SIZE,
+};
+
BPF_CALL_5(bpf_sock_ops_setsockopt, struct bpf_sock_ops_kern *, bpf_sock,
int, level, int, optname, char *, optval, int, optlen)
{
@@ -8051,6 +8085,20 @@ sock_filter_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
return &bpf_sk_storage_get_cg_sock_proto;
case BPF_FUNC_ktime_get_coarse_ns:
return &bpf_ktime_get_coarse_ns_proto;
+ case BPF_FUNC_setsockopt:
+ switch (prog->expected_attach_type) {
+ case BPF_CGROUP_INET_SOCK_CREATE:
+ return &bpf_sock_create_setsockopt_proto;
+ default:
+ return NULL;
+ }
+ case BPF_FUNC_getsockopt:
+ switch (prog->expected_attach_type) {
+ case BPF_CGROUP_INET_SOCK_CREATE:
+ return &bpf_sock_create_getsockopt_proto;
+ default:
+ return NULL;
+ }
default:
return bpf_base_func_proto(func_id, prog);
}
--
2.51.0.384.g4c02a37b29-goog
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v8 bpf-next/net 5/6] bpf: Introduce SK_BPF_MEMCG_FLAGS and SK_BPF_MEMCG_EXCLUSIVE.
2025-09-10 19:19 [PATCH v8 bpf-next/net 0/6] bpf: Allow decoupling memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
` (3 preceding siblings ...)
2025-09-10 19:19 ` [PATCH v8 bpf-next/net 4/6] bpf: Support bpf_setsockopt() for BPF_CGROUP_INET_SOCK_CREATE Kuniyuki Iwashima
@ 2025-09-10 19:19 ` Kuniyuki Iwashima
2025-09-10 19:19 ` [PATCH v8 bpf-next/net 6/6] selftest: bpf: Add test for SK_MEMCG_EXCLUSIVE Kuniyuki Iwashima
5 siblings, 0 replies; 10+ messages in thread
From: Kuniyuki Iwashima @ 2025-09-10 19:19 UTC (permalink / raw)
To: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau
Cc: John Fastabend, Stanislav Fomichev, Johannes Weiner, Michal Hocko,
Roman Gushchin, Shakeel Butt, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Neal Cardwell, Willem de Bruijn,
Mina Almasry, Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev
If a socket has sk->sk_memcg with SK_MEMCG_EXCLUSIVE, it is decoupled
from the global protocol memory accounting.
This is controlled by net.core.memcg_exclusive sysctl, but it lacks
flexibility.
Let's support flagging (and clearing) SK_MEMCG_EXCLUSIVE via
bpf_setsockopt() at the BPF_CGROUP_INET_SOCK_CREATE hook.
u32 flags = SK_BPF_MEMCG_EXCLUSIVE;
bpf_setsockopt(ctx, SOL_SOCKET, SK_BPF_MEMCG_FLAGS,
&flags, sizeof(flags));
As with net.core.memcg_exclusive, this is inherited to child sockets,
and BPF always takes precedence over sysctl at socket(2) and accept(2).
SK_BPF_MEMCG_FLAGS is only supported at BPF_CGROUP_INET_SOCK_CREATE
and not supported on other hooks for some reasons:
1. UDP charges memory under sk->sk_receive_queue.lock instead
of lock_sock()
2. For TCP child sockets, memory accounting is adjusted only in
__inet_accept() which sk->sk_memcg allocation is deferred to
3. Modifying the flag after skb is charged to sk requires such
adjustment during bpf_setsockopt() and complicates the logic
unnecessarily
We can support other hooks later if a real use case justifies that.
Most changes are inline and hard to trace, but a microbenchmark on
__sk_mem_raise_allocated() during neper/tcp_stream showed that more
samples completed faster with SK_MEMCG_EXCLUSIVE. This will be more
visible under tcp_mem pressure.
# bpftrace -e 'kprobe:__sk_mem_raise_allocated { @start[tid] = nsecs; }
kretprobe:__sk_mem_raise_allocated /@start[tid]/
{ @end[tid] = nsecs - @start[tid]; @times = hist(@end[tid]); delete(@start[tid]); }'
# tcp_stream -6 -F 1000 -N -T 256
Without bpf prog:
[128, 256) 3846 | |
[256, 512) 1505326 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[512, 1K) 1371006 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[1K, 2K) 198207 |@@@@@@ |
[2K, 4K) 31199 |@ |
With bpf prog in the next patch:
(must be attached before tcp_stream)
# bpftool prog load sk_memcg.bpf.o /sys/fs/bpf/sk_memcg type cgroup/sock_create
# bpftool cgroup attach /sys/fs/cgroup/test cgroup_inet_sock_create pinned /sys/fs/bpf/sk_memcg
[128, 256) 6413 | |
[256, 512) 1868425 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[512, 1K) 1101697 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[1K, 2K) 117031 |@@@@ |
[2K, 4K) 11773 | |
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
v7:
* Update commit message.
v5:
* Limit getsockopt() to BPF_CGROUP_INET_SOCK_CREATE
v4:
* Only allow inet_create() to set flags
* Inherit flags from listener to child in sk_clone_lock()
* Support clearing flags
v3:
* Allow setting flags without sk->sk_memcg in sk_bpf_set_get_memcg_flags()
* Preserve flags in __inet_accept()
v2:
* s/mem_cgroup_sk_set_flag/mem_cgroup_sk_set_flags/ when CONFIG_MEMCG=n
* Use CONFIG_CGROUP_BPF instead of CONFIG_BPF_SYSCALL for ifdef
---
include/uapi/linux/bpf.h | 6 ++++++
mm/memcontrol.c | 3 +++
net/core/filter.c | 34 ++++++++++++++++++++++++++++++++++
tools/include/uapi/linux/bpf.h | 6 ++++++
4 files changed, 49 insertions(+)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 233de8677382..35e3ce40ac90 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -7182,6 +7182,7 @@ enum {
TCP_BPF_SYN_MAC = 1007, /* Copy the MAC, IP[46], and TCP header */
TCP_BPF_SOCK_OPS_CB_FLAGS = 1008, /* Get or Set TCP sock ops flags */
SK_BPF_CB_FLAGS = 1009, /* Get or set sock ops flags in socket */
+ SK_BPF_MEMCG_FLAGS = 1010, /* Get or Set flags saved in sk->sk_memcg */
};
enum {
@@ -7204,6 +7205,11 @@ enum {
*/
};
+enum {
+ SK_BPF_MEMCG_EXCLUSIVE = (1UL << 0),
+ SK_BPF_MEMCG_FLAG_MAX = (1UL << 1),
+};
+
struct bpf_perf_event_value {
__u64 counter;
__u64 enabled;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 88028af8ac28..b7d405b57e23 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4997,6 +4997,9 @@ EXPORT_SYMBOL(memcg_sockets_enabled_key);
static void mem_cgroup_sk_set(struct sock *sk, struct mem_cgroup *memcg)
{
+ BUILD_BUG_ON((unsigned short)SK_MEMCG_EXCLUSIVE != SK_BPF_MEMCG_EXCLUSIVE);
+ BUILD_BUG_ON((unsigned short)SK_MEMCG_FLAG_MAX != SK_BPF_MEMCG_FLAG_MAX);
+
sk->sk_memcg = memcg;
#ifdef CONFIG_NET
diff --git a/net/core/filter.c b/net/core/filter.c
index 31b259f02ee9..df2496120076 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5723,9 +5723,39 @@ static const struct bpf_func_proto bpf_sock_addr_getsockopt_proto = {
.arg5_type = ARG_CONST_SIZE,
};
+static int sk_bpf_set_get_memcg_flags(struct sock *sk,
+ char *optval, int optlen,
+ bool getopt)
+{
+ u32 flags;
+
+ if (optlen != sizeof(u32))
+ return -EINVAL;
+
+ if (!sk_has_account(sk))
+ return -EOPNOTSUPP;
+
+ if (getopt) {
+ *(u32 *)optval = mem_cgroup_sk_get_flags(sk);
+ return 0;
+ }
+
+ flags = *(u32 *)optval;
+ if (flags >= SK_BPF_MEMCG_FLAG_MAX)
+ return -EINVAL;
+
+ mem_cgroup_sk_set_flags(sk, flags);
+
+ return 0;
+}
+
BPF_CALL_5(bpf_sock_create_setsockopt, struct sock *, sk, int, level,
int, optname, char *, optval, int, optlen)
{
+ if (IS_ENABLED(CONFIG_MEMCG) &&
+ level == SOL_SOCKET && optname == SK_BPF_MEMCG_FLAGS)
+ return sk_bpf_set_get_memcg_flags(sk, optval, optlen, false);
+
return __bpf_setsockopt(sk, level, optname, optval, optlen);
}
@@ -5743,6 +5773,10 @@ static const struct bpf_func_proto bpf_sock_create_setsockopt_proto = {
BPF_CALL_5(bpf_sock_create_getsockopt, struct sock *, sk, int, level,
int, optname, char *, optval, int, optlen)
{
+ if (IS_ENABLED(CONFIG_MEMCG) &&
+ level == SOL_SOCKET && optname == SK_BPF_MEMCG_FLAGS)
+ return sk_bpf_set_get_memcg_flags(sk, optval, optlen, true);
+
return __bpf_getsockopt(sk, level, optname, optval, optlen);
}
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 233de8677382..35e3ce40ac90 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -7182,6 +7182,7 @@ enum {
TCP_BPF_SYN_MAC = 1007, /* Copy the MAC, IP[46], and TCP header */
TCP_BPF_SOCK_OPS_CB_FLAGS = 1008, /* Get or Set TCP sock ops flags */
SK_BPF_CB_FLAGS = 1009, /* Get or set sock ops flags in socket */
+ SK_BPF_MEMCG_FLAGS = 1010, /* Get or Set flags saved in sk->sk_memcg */
};
enum {
@@ -7204,6 +7205,11 @@ enum {
*/
};
+enum {
+ SK_BPF_MEMCG_EXCLUSIVE = (1UL << 0),
+ SK_BPF_MEMCG_FLAG_MAX = (1UL << 1),
+};
+
struct bpf_perf_event_value {
__u64 counter;
__u64 enabled;
--
2.51.0.384.g4c02a37b29-goog
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v8 bpf-next/net 6/6] selftest: bpf: Add test for SK_MEMCG_EXCLUSIVE.
2025-09-10 19:19 [PATCH v8 bpf-next/net 0/6] bpf: Allow decoupling memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
` (4 preceding siblings ...)
2025-09-10 19:19 ` [PATCH v8 bpf-next/net 5/6] bpf: Introduce SK_BPF_MEMCG_FLAGS and SK_BPF_MEMCG_EXCLUSIVE Kuniyuki Iwashima
@ 2025-09-10 19:19 ` Kuniyuki Iwashima
5 siblings, 0 replies; 10+ messages in thread
From: Kuniyuki Iwashima @ 2025-09-10 19:19 UTC (permalink / raw)
To: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau
Cc: John Fastabend, Stanislav Fomichev, Johannes Weiner, Michal Hocko,
Roman Gushchin, Shakeel Butt, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Neal Cardwell, Willem de Bruijn,
Mina Almasry, Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev
The test does the following for IPv4/IPv6 x TCP/UDP sockets
with/without SK_MEMCG_EXCLUSIVE, which can be turned on by
net.core.memcg_exclusive or bpf_setsockopt(SK_BPF_MEMCG_EXCLUSIVE).
1. Create socket pairs
2. Send a bunch of data that requires more than 1024 pages
3. Read memory_allocated from sk->sk_prot->memory_allocated and
sk->sk_prot->memory_per_cpu_fw_alloc
4. Check if unread data is charged to memory_allocated
If SK_MEMCG_EXCLUSIVE is set, memory_allocated should not be
changed, but we allow a small error (up to 10 pages) in case
other processes on the host use some amounts of TCP/UDP memory.
The amount of allocated pages are buffered to per-cpu variable
{tcp,udp}_memory_per_cpu_fw_alloc up to +/- net.core.mem_pcpu_rsv
before reported to {tcp,udp}_memory_allocated.
At 3., memory_allocated is calculated from the 2 variables twice
at fentry and fexit of socket create function to check if the per-cpu
value is drained during calculation. In that case, 3. is retried.
We use kern_sync_rcu() for UDP because UDP recv queue is destroyed
after RCU grace period.
The test takes ~2s on QEMU (64 CPUs) w/ KVM but takes 6s w/o KVM.
# time ./test_progs -t sk_memcg
#370/1 sk_memcg/TCP :OK
#370/2 sk_memcg/UDP :OK
#370/3 sk_memcg/TCPv6:OK
#370/4 sk_memcg/UDPv6:OK
#370 sk_memcg:OK
Summary: 1/4 PASSED, 0 SKIPPED, 0 FAILED
real 0m1.623s
user 0m0.165s
sys 0m0.366s
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
v7:
* Add test for sysctl
v6:
* Trace sk_prot->memory_allocated + sk_prot->memory_per_cpu_fw_alloc
v5:
* Use kern_sync_rcu()
* Double NR_SEND to 128
v4:
* Only use inet_create() hook
* Test bpf_getsockopt()
* Add serial_ prefix
* Reduce sleep() and the amount of sent data
---
.../selftests/bpf/prog_tests/sk_memcg.c | 261 ++++++++++++++++++
tools/testing/selftests/bpf/progs/sk_memcg.c | 146 ++++++++++
2 files changed, 407 insertions(+)
create mode 100644 tools/testing/selftests/bpf/prog_tests/sk_memcg.c
create mode 100644 tools/testing/selftests/bpf/progs/sk_memcg.c
diff --git a/tools/testing/selftests/bpf/prog_tests/sk_memcg.c b/tools/testing/selftests/bpf/prog_tests/sk_memcg.c
new file mode 100644
index 000000000000..777fb81e9365
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/sk_memcg.c
@@ -0,0 +1,261 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright 2025 Google LLC */
+
+#include <test_progs.h>
+#include "sk_memcg.skel.h"
+#include "network_helpers.h"
+
+#define NR_SOCKETS 64
+#define NR_SEND 128
+#define BUF_SINGLE 1024
+#define BUF_TOTAL (BUF_SINGLE * NR_SEND)
+
+struct test_case {
+ char name[8];
+ int family;
+ int type;
+ int (*create_sockets)(struct test_case *test_case, int sk[], int len);
+ long (*get_memory_allocated)(struct test_case *test_case, struct sk_memcg *skel);
+};
+
+static int tcp_create_sockets(struct test_case *test_case, int sk[], int len)
+{
+ int server, i;
+
+ server = start_server(test_case->family, test_case->type, NULL, 0, 0);
+ ASSERT_GE(server, 0, "start_server_str");
+
+ for (i = 0; i < len / 2; i++) {
+ sk[i * 2] = connect_to_fd(server, 0);
+ if (!ASSERT_GE(sk[i * 2], 0, "connect_to_fd"))
+ return sk[i * 2];
+
+ sk[i * 2 + 1] = accept(server, NULL, NULL);
+ if (!ASSERT_GE(sk[i * 2 + 1], 0, "accept"))
+ return sk[i * 2 + 1];
+ }
+
+ close(server);
+
+ return 0;
+}
+
+static int udp_create_sockets(struct test_case *test_case, int sk[], int len)
+{
+ int i, err, rcvbuf = BUF_TOTAL;
+
+ for (i = 0; i < len / 2; i++) {
+ sk[i * 2] = start_server(test_case->family, test_case->type, NULL, 0, 0);
+ if (!ASSERT_GE(sk[i * 2], 0, "start_server"))
+ return sk[i * 2];
+
+ sk[i * 2 + 1] = connect_to_fd(sk[i * 2], 0);
+ if (!ASSERT_GE(sk[i * 2 + 1], 0, "connect_to_fd"))
+ return sk[i * 2 + 1];
+
+ err = connect_fd_to_fd(sk[i * 2], sk[i * 2 + 1], 0);
+ if (!ASSERT_EQ(err, 0, "connect_fd_to_fd"))
+ return err;
+
+ err = setsockopt(sk[i * 2], SOL_SOCKET, SO_RCVBUF, &rcvbuf, sizeof(int));
+ if (!ASSERT_EQ(err, 0, "setsockopt(SO_RCVBUF)"))
+ return err;
+
+ err = setsockopt(sk[i * 2 + 1], SOL_SOCKET, SO_RCVBUF, &rcvbuf, sizeof(int));
+ if (!ASSERT_EQ(err, 0, "setsockopt(SO_RCVBUF)"))
+ return err;
+ }
+
+ return 0;
+}
+
+static long get_memory_allocated(struct test_case *test_case,
+ bool *activated, bool *stable,
+ long *memory_allocated)
+{
+ *stable = false;
+
+ do {
+ *activated = true;
+
+ /* AF_INET and AF_INET6 share the same memory_allocated.
+ * tcp_init_sock() is called by AF_INET and AF_INET6,
+ * but udp_lib_init_sock() is inline.
+ */
+ socket(AF_INET, test_case->type, 0);
+ } while (!*stable);
+
+ return *memory_allocated;
+}
+
+static long tcp_get_memory_allocated(struct test_case *test_case, struct sk_memcg *skel)
+{
+ return get_memory_allocated(test_case,
+ &skel->bss->tcp_activated,
+ &skel->bss->tcp_stable,
+ &skel->bss->tcp_memory_allocated);
+}
+
+static long udp_get_memory_allocated(struct test_case *test_case, struct sk_memcg *skel)
+{
+ return get_memory_allocated(test_case,
+ &skel->bss->udp_activated,
+ &skel->bss->udp_stable,
+ &skel->bss->udp_memory_allocated);
+}
+
+static int check_exclusive(struct test_case *test_case,
+ struct sk_memcg *skel, bool exclusive)
+{
+ char buf[BUF_SINGLE] = {};
+ long memory_allocated[2];
+ int sk[NR_SOCKETS] = {};
+ int err, i, j;
+
+ err = test_case->create_sockets(test_case, sk, ARRAY_SIZE(sk));
+ if (err)
+ goto close;
+
+ memory_allocated[0] = test_case->get_memory_allocated(test_case, skel);
+
+ /* allocate pages >= 1024 */
+ for (i = 0; i < ARRAY_SIZE(sk); i++) {
+ for (j = 0; j < NR_SEND; j++) {
+ int bytes = send(sk[i], buf, sizeof(buf), 0);
+
+ /* Avoid too noisy logs when something failed. */
+ if (bytes != sizeof(buf)) {
+ ASSERT_EQ(bytes, sizeof(buf), "send");
+ if (bytes < 0) {
+ err = bytes;
+ goto close;
+ }
+ }
+ }
+ }
+
+ memory_allocated[1] = test_case->get_memory_allocated(test_case, skel);
+
+ if (exclusive)
+ ASSERT_LE(memory_allocated[1], memory_allocated[0] + 10, "exclusive");
+ else
+ ASSERT_GT(memory_allocated[1], memory_allocated[0] + 1024, "not exclusive");
+
+close:
+ for (i = 0; i < ARRAY_SIZE(sk); i++)
+ close(sk[i]);
+
+ if (test_case->type == SOCK_DGRAM) {
+ /* UDP recv queue is destroyed after RCU grace period.
+ * With one kern_sync_rcu(), memory_allocated[0] of the
+ * isoalted case often matches with memory_allocated[1]
+ * of the preceding non-exclusive case.
+ */
+ kern_sync_rcu();
+ kern_sync_rcu();
+ }
+
+ return err;
+}
+
+void run_test(struct test_case *test_case)
+{
+ struct nstoken *nstoken;
+ struct sk_memcg *skel;
+ int cgroup, err;
+
+ skel = sk_memcg__open_and_load();
+ if (!ASSERT_OK_PTR(skel, "open_and_load"))
+ return;
+
+ skel->bss->nr_cpus = libbpf_num_possible_cpus();
+
+ err = sk_memcg__attach(skel);
+ if (!ASSERT_OK(err, "attach"))
+ goto destroy_skel;
+
+ cgroup = test__join_cgroup("/sk_memcg");
+ if (!ASSERT_GE(cgroup, 0, "join_cgroup"))
+ goto destroy_skel;
+
+ err = make_netns("sk_memcg");
+ if (!ASSERT_EQ(err, 0, "make_netns"))
+ goto close_cgroup;
+
+ nstoken = open_netns("sk_memcg");
+ if (!ASSERT_OK_PTR(nstoken, "open_netns"))
+ goto remove_netns;
+
+ err = check_exclusive(test_case, skel, false);
+ if (!ASSERT_EQ(err, 0, "test_exclusive(false)"))
+ goto close_netns;
+
+ err = write_sysctl("/proc/sys/net/core/memcg_exclusive", "1");
+ if (!ASSERT_EQ(err, 0, "write_sysctl(1)"))
+ goto close_netns;
+
+ err = check_exclusive(test_case, skel, true);
+ if (!ASSERT_EQ(err, 0, "test_exclusive(true by sysctl)"))
+ goto close_netns;
+
+ err = write_sysctl("/proc/sys/net/core/memcg_exclusive", "0");
+ if (!ASSERT_EQ(err, 0, "write_sysctl(0)"))
+ goto close_netns;
+
+ skel->links.sock_create = bpf_program__attach_cgroup(skel->progs.sock_create, cgroup);
+ if (!ASSERT_OK_PTR(skel->links.sock_create, "attach_cgroup(sock_create)"))
+ goto close_netns;
+
+ err = check_exclusive(test_case, skel, true);
+ ASSERT_EQ(err, 0, "test_exclusive(true by bpf)");
+
+close_netns:
+ close_netns(nstoken);
+remove_netns:
+ remove_netns("sk_memcg");
+close_cgroup:
+ close(cgroup);
+destroy_skel:
+ sk_memcg__destroy(skel);
+}
+
+struct test_case test_cases[] = {
+ {
+ .name = "TCP ",
+ .family = AF_INET,
+ .type = SOCK_STREAM,
+ .create_sockets = tcp_create_sockets,
+ .get_memory_allocated = tcp_get_memory_allocated,
+ },
+ {
+ .name = "UDP ",
+ .family = AF_INET,
+ .type = SOCK_DGRAM,
+ .create_sockets = udp_create_sockets,
+ .get_memory_allocated = udp_get_memory_allocated,
+ },
+ {
+ .name = "TCPv6",
+ .family = AF_INET6,
+ .type = SOCK_STREAM,
+ .create_sockets = tcp_create_sockets,
+ .get_memory_allocated = tcp_get_memory_allocated,
+ },
+ {
+ .name = "UDPv6",
+ .family = AF_INET6,
+ .type = SOCK_DGRAM,
+ .create_sockets = udp_create_sockets,
+ .get_memory_allocated = udp_get_memory_allocated,
+ },
+};
+
+void serial_test_sk_memcg(void)
+{
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(test_cases); i++) {
+ test__start_subtest(test_cases[i].name);
+ run_test(&test_cases[i]);
+ }
+}
diff --git a/tools/testing/selftests/bpf/progs/sk_memcg.c b/tools/testing/selftests/bpf/progs/sk_memcg.c
new file mode 100644
index 000000000000..6b1a928a0c90
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/sk_memcg.c
@@ -0,0 +1,146 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright 2025 Google LLC */
+
+#include "bpf_tracing_net.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include <errno.h>
+
+extern int tcp_memory_per_cpu_fw_alloc __ksym;
+extern int udp_memory_per_cpu_fw_alloc __ksym;
+
+int nr_cpus;
+bool tcp_activated, tcp_stable, udp_activated, udp_stable;
+long tcp_memory_allocated, udp_memory_allocated;
+static struct sock *tcp_sk_tracing, *udp_sk_tracing;
+
+struct sk_prot {
+ long *memory_allocated;
+ int *memory_per_cpu_fw_alloc;
+};
+
+static int drain_memory_per_cpu_fw_alloc(__u32 i, struct sk_prot *sk_prot_ctx)
+{
+ int *memory_per_cpu_fw_alloc;
+
+ memory_per_cpu_fw_alloc = bpf_per_cpu_ptr(sk_prot_ctx->memory_per_cpu_fw_alloc, i);
+ if (memory_per_cpu_fw_alloc)
+ *sk_prot_ctx->memory_allocated += *memory_per_cpu_fw_alloc;
+
+ return 0;
+}
+
+static long get_memory_allocated(struct sock *_sk, int *memory_per_cpu_fw_alloc)
+{
+ struct sock *sk = bpf_core_cast(_sk, struct sock);
+ struct sk_prot sk_prot_ctx;
+ long memory_allocated;
+
+ /* net_aligned_data.{tcp,udp}_memory_allocated was not available. */
+ memory_allocated = sk->__sk_common.skc_prot->memory_allocated->counter;
+
+ sk_prot_ctx.memory_allocated = &memory_allocated;
+ sk_prot_ctx.memory_per_cpu_fw_alloc = memory_per_cpu_fw_alloc;
+
+ bpf_loop(nr_cpus, drain_memory_per_cpu_fw_alloc, &sk_prot_ctx, 0);
+
+ return memory_allocated;
+}
+
+static void fentry_init_sock(struct sock *sk, struct sock **sk_tracing,
+ long *memory_allocated, int *memory_per_cpu_fw_alloc,
+ bool *activated)
+{
+ if (!*activated)
+ return;
+
+ if (__sync_val_compare_and_swap(sk_tracing, NULL, sk))
+ return;
+
+ *activated = false;
+ *memory_allocated = get_memory_allocated(sk, memory_per_cpu_fw_alloc);
+}
+
+static void fexit_init_sock(struct sock *sk, struct sock **sk_tracing,
+ long *memory_allocated, int *memory_per_cpu_fw_alloc,
+ bool *stable)
+{
+ long new_memory_allocated;
+
+ if (sk != *sk_tracing)
+ return;
+
+ new_memory_allocated = get_memory_allocated(sk, memory_per_cpu_fw_alloc);
+ if (new_memory_allocated == *memory_allocated)
+ *stable = true;
+
+ *sk_tracing = NULL;
+}
+
+SEC("fentry/tcp_init_sock")
+int BPF_PROG(fentry_tcp_init_sock, struct sock *sk)
+{
+ fentry_init_sock(sk, &tcp_sk_tracing,
+ &tcp_memory_allocated, &tcp_memory_per_cpu_fw_alloc,
+ &tcp_activated);
+ return 0;
+}
+
+SEC("fexit/tcp_init_sock")
+int BPF_PROG(fexit_tcp_init_sock, struct sock *sk)
+{
+ fexit_init_sock(sk, &tcp_sk_tracing,
+ &tcp_memory_allocated, &tcp_memory_per_cpu_fw_alloc,
+ &tcp_stable);
+ return 0;
+}
+
+SEC("fentry/udp_init_sock")
+int BPF_PROG(fentry_udp_init_sock, struct sock *sk)
+{
+ fentry_init_sock(sk, &udp_sk_tracing,
+ &udp_memory_allocated, &udp_memory_per_cpu_fw_alloc,
+ &udp_activated);
+ return 0;
+}
+
+SEC("fexit/udp_init_sock")
+int BPF_PROG(fexit_udp_init_sock, struct sock *sk)
+{
+ fexit_init_sock(sk, &udp_sk_tracing,
+ &udp_memory_allocated, &udp_memory_per_cpu_fw_alloc,
+ &udp_stable);
+ return 0;
+}
+
+SEC("cgroup/sock_create")
+int sock_create(struct bpf_sock *ctx)
+{
+ u32 flags = SK_BPF_MEMCG_EXCLUSIVE;
+ int err;
+
+ err = bpf_setsockopt(ctx, SOL_SOCKET, SK_BPF_MEMCG_FLAGS,
+ &flags, sizeof(flags));
+ if (err)
+ goto err;
+
+ flags = 0;
+
+ err = bpf_getsockopt(ctx, SOL_SOCKET, SK_BPF_MEMCG_FLAGS,
+ &flags, sizeof(flags));
+ if (err)
+ goto err;
+
+ if (flags != SK_BPF_MEMCG_EXCLUSIVE) {
+ err = -EINVAL;
+ goto err;
+ }
+
+ return 1;
+
+err:
+ bpf_set_retval(err);
+ return 0;
+}
+
+char LICENSE[] SEC("license") = "GPL";
--
2.51.0.384.g4c02a37b29-goog
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH v8 bpf-next/net 1/6] tcp: Save lock_sock() for memcg in inet_csk_accept().
2025-09-10 19:19 ` [PATCH v8 bpf-next/net 1/6] tcp: Save lock_sock() for memcg in inet_csk_accept() Kuniyuki Iwashima
@ 2025-09-17 6:37 ` kernel test robot
2025-09-17 7:03 ` Kuniyuki Iwashima
0 siblings, 1 reply; 10+ messages in thread
From: kernel test robot @ 2025-09-17 6:37 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: oe-lkp, lkp, Shakeel Butt, netdev, ltp, Alexei Starovoitov,
Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
John Fastabend, Stanislav Fomichev, Johannes Weiner, Michal Hocko,
Roman Gushchin, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Neal Cardwell, Willem de Bruijn, Mina Almasry,
Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, oliver.sang
Hello,
kernel test robot noticed "BUG:KASAN:slab-out-of-bounds_in__inet_accept" on:
commit: d465aa09942825d93a377c3715c464e8f6827f13 ("[PATCH v8 bpf-next/net 1/6] tcp: Save lock_sock() for memcg in inet_csk_accept().")
url: https://github.com/intel-lab-lkp/linux/commits/Kuniyuki-Iwashima/tcp-Save-lock_sock-for-memcg-in-inet_csk_accept/20250911-032312
base: https://git.kernel.org/cgit/linux/kernel/git/bpf/bpf-next.git net
patch link: https://lore.kernel.org/all/20250910192057.1045711-2-kuniyu@google.com/
patch subject: [PATCH v8 bpf-next/net 1/6] tcp: Save lock_sock() for memcg in inet_csk_accept().
in testcase: ltp
version: ltp-x86_64-c6660a3e0-1_20250913
with following parameters:
test: net.features
config: x86_64-rhel-9.4-ltp
compiler: gcc-14
test machine: 8 threads 1 sockets Intel(R) Core(TM) i7-4790T CPU @ 2.70GHz (Haswell) with 16G memory
(please refer to attached dmesg/kmsg for entire log/backtrace)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202509171359.658ddb38-lkp@intel.com
The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20250917/202509171359.658ddb38-lkp@intel.com
we saw a lot of "BUG:KASAN:slab-out-of-bounds_in__inet_accept" issue in dmesg
uploaded to above link, below is just one example:
[ 468.984291][T30180] ==================================================================
[ 468.992753][T30180] BUG: KASAN: slab-out-of-bounds in __inet_accept+0x5c6/0x640
[ 469.000550][T30180] Read of size 1 at addr ffff88810df4ea20 by task netstress/30180
[ 469.008720][T30180]
[ 469.011389][T30180] CPU: 0 UID: 0 PID: 30180 Comm: netstress Not tainted 6.17.0-rc2-00437-gd465aa099428 #1 PREEMPT(voluntary)
[ 469.011393][T30180] Hardware name: Gigabyte Technology Co., Ltd. Z97X-UD5H/Z97X-UD5H, BIOS F9 04/21/2015
[ 469.011395][T30180] Call Trace:
[ 469.011396][T30180] <TASK>
[ 469.011398][T30180] dump_stack_lvl+0x47/0x70
[ 469.011403][T30180] print_address_description+0x88/0x320
[ 469.011408][T30180] ? __inet_accept+0x5c6/0x640
[ 469.011410][T30180] print_report+0x106/0x1f4
[ 469.011413][T30180] ? __inet_accept+0x5c6/0x640
[ 469.011415][T30180] ? __inet_accept+0x5c6/0x640
[ 469.011417][T30180] kasan_report+0xb5/0xf0
[ 469.011421][T30180] ? __inet_accept+0x5c6/0x640
[ 469.011424][T30180] __inet_accept+0x5c6/0x640
[ 468.992753][T30180] BUG: KASAN: slab-out-of-bounds in __inet_accept+0x5c6/0x640
[ 469.011427][T30180] inet_accept+0xe2/0x170
[ 469.000550][T30180] Read of size 1 at addr ffff88810df4ea20 by task netstress/30180
[ 469.011430][T30180] do_accept+0x2e5/0x480
[ 469.008720][T30180]
[ 469.011434][T30180] ? folio_xchg_last_cpupid+0xc5/0x130
[ 469.011389][T30180] CPU: 0 UID: 0 PID: 30180 Comm: netstress Not tainted 6.17.0-rc2-00437-gd465aa099428 #1 PREEMPT(voluntary)
[ 469.011393][T30180] Hardware name: Gigabyte Technology Co., Ltd. Z97X-UD5H/Z97X-UD5H, BIOS F9 04/21/2015
[ 469.011437][T30180] ? __pfx_do_accept+0x10/0x10
[ 469.011395][T30180] Call Trace:
[ 469.011441][T30180] ? _raw_spin_lock+0x80/0xe0
[ 469.011396][T30180] <TASK>
[ 469.011444][T30180] ? __pfx__raw_spin_lock+0x10/0x10
[ 469.011398][T30180] dump_stack_lvl+0x47/0x70
[ 469.011447][T30180] ? alloc_fd+0x266/0x410
[ 469.011403][T30180] print_address_description+0x88/0x320
[ 469.011451][T30180] __sys_accept4+0xc4/0x150
[ 469.011454][T30180] ? __pfx___sys_accept4+0x10/0x10
[ 469.011458][T30180] __x64_sys_accept+0x70/0xb0
[ 469.011461][T30180] do_syscall_64+0x7b/0x2c0
[ 469.011466][T30180] ? __pfx___handle_mm_fault+0x10/0x10
[ 469.011468][T30180] ? __pfx_css_rstat_updated+0x10/0x10
[ 469.011471][T30180] ? count_memcg_events+0x253/0x3f0
[ 469.011475][T30180] ? handle_mm_fault+0x382/0x6c0
[ 469.011478][T30180] ? do_user_addr_fault+0x820/0xd60
[ 469.011482][T30180] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 469.011485][T30180] RIP: 0033:0x7f9c169c4687
[ 469.011488][T30180] Code: 48 89 fa 4c 89 df e8 58 b3 00 00 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 1a 5b c3 0f 1f 84 00 00 00 00 00 48 8b 44 24 10 0f 05 <5b> c3 0f 1f 80 00 00 00 00 83 e2 39 83 fa 08 75 de e8 23 ff ff ff
[ 469.011490][T30180] RSP: 002b:00007ffff0036ac0 EFLAGS: 00000202 ORIG_RAX: 000000000000002b
[ 469.011494][T30180] RAX: ffffffffffffffda RBX: 00007f9c16932740 RCX: 00007f9c169c4687
[ 469.011496][T30180] RDX: 00007ffff0036b14 RSI: 00007ffff0036b20 RDI: 0000000000000006
[ 469.011498][T30180] RBP: 0000562f1b4e85a0 R08: 0000000000000000 R09: 0000000000000000
[ 469.011500][T30180] R10: 0000000000000000 R11: 0000000000000202 R12: 00007ffff0036b18
[ 469.011501][T30180] R13: 00007ffff0036b20 R14: 00007ffff0036b14 R15: 0000562f1b4d3e5f
[ 469.011504][T30180] </TASK>
[ 469.011505][T30180]
[ 469.257645][T30180] The buggy address belongs to the object at ffff88810df4e800
[ 469.257645][T30180] which belongs to the cache SCTPv6 of size 1536
[ 469.271959][T30180] The buggy address is located 544 bytes inside of
[ 469.271959][T30180] allocated 1536-byte region [ffff88810df4e800, ffff88810df4ee00)
[ 469.286795][T30180]
[ 469.289353][T30180] The buggy address belongs to the physical page:
[ 469.296000][T30180] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x10df48
[ 469.305055][T30180] head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[ 469.313790][T30180] memcg:ffff888223ff8201
[ 469.318241][T30180] flags: 0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff)
[ 469.326258][T30180] page_type: f5(slab)
[ 469.011408][T30180] ? __inet_accept+0x5c6/0x640
[ 469.330466][T30180] raw: 0017ffffc0000040 ffff888101e08640 dead000000000122 0000000000000000
[ 469.011410][T30180] print_report+0x106/0x1f4
[ 469.339270][T30180] raw: 0000000000000000 0000000080130013 00000000f5000000 ffff888223ff8201
[ 469.011413][T30180] ? __inet_accept+0x5c6/0x640
[ 469.348078][T30180] head: 0017ffffc0000040 ffff888101e08640 dead000000000122 0000000000000000
[ 469.011415][T30180] ? __inet_accept+0x5c6/0x640
[ 469.356993][T30180] head: 0000000000000000 0000000080130013 00000000f5000000 ffff888223ff8201
[ 469.011417][T30180] kasan_report+0xb5/0xf0
[ 469.365914][T30180] head: 0017ffffc0000003 ffffea000437d201 00000000ffffffff 00000000ffffffff
[ 469.011421][T30180] ? __inet_accept+0x5c6/0x640
[ 469.374851][T30180] head: ffffffffffffffff 0000000000000000 00000000ffffffff 0000000000000008
[ 469.011424][T30180] __inet_accept+0x5c6/0x640
[ 469.383788][T30180] page dumped because: kasan: bad access detected
[ 469.011427][T30180] inet_accept+0xe2/0x170
[ 469.390449][T30180]
[ 469.011430][T30180] do_accept+0x2e5/0x480
[ 469.011434][T30180] ? folio_xchg_last_cpupid+0xc5/0x130
[ 469.393031][T30180] Memory state around the buggy address:
[ 469.011437][T30180] ? __pfx_do_accept+0x10/0x10
[ 469.398939][T30180] ffff88810df4e900: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 469.011441][T30180] ? _raw_spin_lock+0x80/0xe0
[ 469.407261][T30180] ffff88810df4e980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 469.011444][T30180] ? __pfx__raw_spin_lock+0x10/0x10
[ 469.415589][T30180] >ffff88810df4ea00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 469.011447][T30180] ? alloc_fd+0x266/0x410
[ 469.423933][T30180] ^
[ 469.011451][T30180] __sys_accept4+0xc4/0x150
[ 469.429308][T30180] ffff88810df4ea80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 469.011454][T30180] ? __pfx___sys_accept4+0x10/0x10
[ 469.437670][T30180] ffff88810df4eb00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 469.011458][T30180] __x64_sys_accept+0x70/0xb0
[ 469.446024][T30180] ==================================================================
[ 469.011461][T30180] do_syscall_64+0x7b/0x2c0
[ 469.454415][T30180] Disabling lock debugging due to kernel taint
[ 469.011466][T30180] ? __pfx___handle_mm_fault+0x10/0x10
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v8 bpf-next/net 1/6] tcp: Save lock_sock() for memcg in inet_csk_accept().
2025-09-17 6:37 ` kernel test robot
@ 2025-09-17 7:03 ` Kuniyuki Iwashima
0 siblings, 0 replies; 10+ messages in thread
From: Kuniyuki Iwashima @ 2025-09-17 7:03 UTC (permalink / raw)
To: kernel test robot
Cc: oe-lkp, lkp, Shakeel Butt, netdev, ltp, Alexei Starovoitov,
Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
John Fastabend, Stanislav Fomichev, Johannes Weiner, Michal Hocko,
Roman Gushchin, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Neal Cardwell, Willem de Bruijn, Mina Almasry,
Kuniyuki Iwashima, bpf
On Tue, Sep 16, 2025 at 11:37 PM kernel test robot
<oliver.sang@intel.com> wrote:
>
>
> Hello,
>
> kernel test robot noticed "BUG:KASAN:slab-out-of-bounds_in__inet_accept" on:
>
> commit: d465aa09942825d93a377c3715c464e8f6827f13 ("[PATCH v8 bpf-next/net 1/6] tcp: Save lock_sock() for memcg in inet_csk_accept().")
> url: https://github.com/intel-lab-lkp/linux/commits/Kuniyuki-Iwashima/tcp-Save-lock_sock-for-memcg-in-inet_csk_accept/20250911-032312
> base: https://git.kernel.org/cgit/linux/kernel/git/bpf/bpf-next.git net
> patch link: https://lore.kernel.org/all/20250910192057.1045711-2-kuniyu@google.com/
> patch subject: [PATCH v8 bpf-next/net 1/6] tcp: Save lock_sock() for memcg in inet_csk_accept().
>
> in testcase: ltp
> version: ltp-x86_64-c6660a3e0-1_20250913
> with following parameters:
>
> test: net.features
>
>
>
> config: x86_64-rhel-9.4-ltp
> compiler: gcc-14
> test machine: 8 threads 1 sockets Intel(R) Core(TM) i7-4790T CPU @ 2.70GHz (Haswell) with 16G memory
>
> (please refer to attached dmesg/kmsg for entire log/backtrace)
>
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <oliver.sang@intel.com>
> | Closes: https://lore.kernel.org/oe-lkp/202509171359.658ddb38-lkp@intel.com
>
>
> The kernel config and materials to reproduce are available at:
> https://download.01.org/0day-ci/archive/20250917/202509171359.658ddb38-lkp@intel.com
>
>
> we saw a lot of "BUG:KASAN:slab-out-of-bounds_in__inet_accept" issue in dmesg
> uploaded to above link, below is just one example:
>
>
> [ 468.984291][T30180] ==================================================================
> [ 468.992753][T30180] BUG: KASAN: slab-out-of-bounds in __inet_accept+0x5c6/0x640
Oh I misused sk_is_mptcp() which assumes that sk_is_tcp()
is always true and should not be used if sk_is_tcp() is false for
SCTP, so sk_is_mptcp() test was unnecessary
I'll remove it, thanks!
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v8 bpf-next/net 2/6] net-memcg: Allow decoupling memcg from global protocol memory accounting.
2025-09-10 19:19 ` [PATCH v8 bpf-next/net 2/6] net-memcg: Allow decoupling memcg from global protocol memory accounting Kuniyuki Iwashima
@ 2025-09-17 19:18 ` Martin KaFai Lau
0 siblings, 0 replies; 10+ messages in thread
From: Martin KaFai Lau @ 2025-09-17 19:18 UTC (permalink / raw)
To: Kuniyuki Iwashima, Shakeel Butt, Johannes Weiner
Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
John Fastabend, Stanislav Fomichev, Michal Hocko, Roman Gushchin,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Neal Cardwell, Willem de Bruijn, Mina Almasry, Kuniyuki Iwashima,
bpf, netdev
On 9/10/25 12:19 PM, Kuniyuki Iwashima wrote:
> Some protocols (e.g., TCP, UDP) implement memory accounting for socket
> buffers and charge memory to per-protocol global counters pointed to by
> sk->sk_proto->memory_allocated.
>
> If a socket has sk->sk_memcg, this memory is also charged to memcg as
> "sock" in memory.stat.
>
> We do not need to pay costs for two orthogonal memory accounting
> mechanisms. A microbenchmark result is in the subsequent bpf patch.
>
> Let's decouple sockets under memcg from the global per-protocol memory
> accounting if mem_cgroup_sk_exclusive() returns true.
>
> Note that this does NOT disable memcg, but rather the per-protocol one.
The bpf side changes look reasonable. I believe the commit message in v8 has
clarified the reason and its behavior. memcg reviewers, please take a look.
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2025-09-17 19:18 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-10 19:19 [PATCH v8 bpf-next/net 0/6] bpf: Allow decoupling memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
2025-09-10 19:19 ` [PATCH v8 bpf-next/net 1/6] tcp: Save lock_sock() for memcg in inet_csk_accept() Kuniyuki Iwashima
2025-09-17 6:37 ` kernel test robot
2025-09-17 7:03 ` Kuniyuki Iwashima
2025-09-10 19:19 ` [PATCH v8 bpf-next/net 2/6] net-memcg: Allow decoupling memcg from global protocol memory accounting Kuniyuki Iwashima
2025-09-17 19:18 ` Martin KaFai Lau
2025-09-10 19:19 ` [PATCH v8 bpf-next/net 3/6] net-memcg: Introduce net.core.memcg_exclusive sysctl Kuniyuki Iwashima
2025-09-10 19:19 ` [PATCH v8 bpf-next/net 4/6] bpf: Support bpf_setsockopt() for BPF_CGROUP_INET_SOCK_CREATE Kuniyuki Iwashima
2025-09-10 19:19 ` [PATCH v8 bpf-next/net 5/6] bpf: Introduce SK_BPF_MEMCG_FLAGS and SK_BPF_MEMCG_EXCLUSIVE Kuniyuki Iwashima
2025-09-10 19:19 ` [PATCH v8 bpf-next/net 6/6] selftest: bpf: Add test for SK_MEMCG_EXCLUSIVE Kuniyuki Iwashima
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).