From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pj1-f73.google.com (mail-pj1-f73.google.com [209.85.216.73]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B3328285CB6 for ; Fri, 22 Aug 2025 22:19:00 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.73 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755901143; cv=none; b=Ayc3FkF8ayOZJZk1bpC5F8Nw8UYQZ17Yuv1mI6sgHQjm077hc/vXT7J+h8XTbBTKXNraeaalYIk5fjhyPwji6v/9DOORG06SEa5N3rfgziGBipGKNVFPa3AsR3KtxEVBh0IZQ5AQyjhMF2orBx5wVS6C8n9we7qsfiGnNidFDQ4= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755901143; c=relaxed/simple; bh=bbgGWS/u+YkYebxGiJFWGTLveSq6/Tn6ORwA17tZ6lk=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=NoX23b708U8NTtpGfpXBt2uO/ZhV82IWNnGhUMPIl/apGfIewFkBeRCmjoFaA9fScVB8VKsSYF0Xw60gg+ZoF3ka49Brzzom9bFIT31z56o7KNtLswGIDgzof9p+L8Ji3Fgm60ZUkVtrZmgTkBWnDcZ0cqsYn8aMqy8foI2QxiQ= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--kuniyu.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=JhJ9Wqzi; arc=none smtp.client-ip=209.85.216.73 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--kuniyu.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="JhJ9Wqzi" Received: by mail-pj1-f73.google.com with SMTP id 98e67ed59e1d1-324e318d628so2915761a91.3 for ; Fri, 22 Aug 2025 15:19:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1755901140; x=1756505940; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=cE11iQIKp00J9HlVTKQchVWAtvFrlWkGHKElYyQFxmY=; b=JhJ9WqzihZj+Xkg3pmrdCP0E9IDpz3aIYyfGKM/63xqhw0E/ObIRK16Idv+hB9KmRQ tcjBjCkkLBJoFymoI5YEUqTzIMRGCie0eIg41og/Gw3eUliBcAeH6vz/VQA2NcERNStY OJB9GWebNjHsWtu+rjuOW2+u+k33BagOm0P5zK0RBZw4OwDbGAz4UYqI2TiBVNF3NL9t 3eQLvoyoZEEAGOFolbG+MSwMXIfaAquZXcZBi0gURgO0z+2ou3VoSdELYCOCaaA7T2Zf qWUqZ3UgK6I0ehmNjEYxKAEwGDgXlo8gZyif5O9LpC0xivIA1qdsO9ZqK0CyWhEU8QiL +WVw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1755901140; x=1756505940; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=cE11iQIKp00J9HlVTKQchVWAtvFrlWkGHKElYyQFxmY=; b=geOPCaMZspbOM8hE3ov1wazqElqJ8AXIo0hlyNoDnSXk4dF6zeC36AJvOlNw+dDADQ 385p34BdcmqLD4C/uBxiIjahyA+AbGs+u+RTCnuJM0bOmNc+3bu3tYocrd9lBJPPfrzI lXg7iiEgk4By/Uff2bvdukyNk1JDI9BWs7JEKM+v0r5keduqhVqSbVE9HZjezUU/NbCZ oVflFTTV7nq0JoAPE9XfK5+wSQj/C9ncuPKLWtiD+rZRP3ce1EQo4lAa2nklfkkBb1wn 1c3LebGctkMYdbCpy50N3VqfU7SfEOtnsTTxvEDwzDXbLZO+80YAHmMGV+9zItQ+FZ2V s34A== X-Forwarded-Encrypted: i=1; AJvYcCXJOjNyVABmRoSekqxw/AkbjMSvVCCQBPZTZLFAQx9i4W/XBzBQ6AUkcak4C91GAqoXVujbDSs=@vger.kernel.org X-Gm-Message-State: AOJu0YwKhwsI+Lg6VGqglsBwuw/ihc32hyCJCqxXdfIQio572Tb+V+oh gPUCxPBDJ59E3hjM7lCD952I2oBLZx2GR8sPwSnCbeFRIfQfaf/Mg+QRBuBAhz042AcHQ2CEdXJ 6NmRrSw== X-Google-Smtp-Source: AGHT+IHhknlScINMhzlAh96Zhugewd9BujaUsPHIo3lXM8El4R6cGm9rYPJq7U34P/PI/9doec0ZSVXyFvI= X-Received: from pjboh11.prod.google.com ([2002:a17:90b:3a4b:b0:312:e914:4548]) (user=kuniyu job=prod-delivery.src-stubby-dispatcher) by 2002:a17:90b:2752:b0:325:1548:f0f with SMTP id 98e67ed59e1d1-32515ee0104mr6188058a91.14.1755901140026; Fri, 22 Aug 2025 15:19:00 -0700 (PDT) Date: Fri, 22 Aug 2025 22:18:02 +0000 In-Reply-To: <20250822221846.744252-1-kuniyu@google.com> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250822221846.744252-1-kuniyu@google.com> X-Mailer: git-send-email 2.51.0.rc2.233.g662b1ed5c5-goog Message-ID: <20250822221846.744252-8-kuniyu@google.com> Subject: [PATCH v1 bpf-next/net 7/8] net-memcg: Allow decoupling memcg from global protocol memory accounting. From: Kuniyuki Iwashima To: Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , Martin KaFai Lau Cc: John Fastabend , Stanislav Fomichev , Johannes Weiner , Michal Hocko , Roman Gushchin , Shakeel Butt , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Neal Cardwell , Willem de Bruijn , Mina Almasry , Kuniyuki Iwashima , Kuniyuki Iwashima , bpf@vger.kernel.org, netdev@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Some protocols (e.g., TCP, UDP) implement memory accounting for socket buffers and charge memory to per-protocol global counters pointed to by sk->sk_proto->memory_allocated. When running under a non-root cgroup, this memory is also charged to the memcg as "sock" in memory.stat. Even when a memcg controls memory usage, sockets of such protocols are still subject to global limits (e.g., /proc/sys/net/ipv4/tcp_mem). This makes it difficult to accurately estimate and configure appropriate global limits, especially in multi-tenant environments. If all workloads were guaranteed to be controlled under memcg, the issue could be worked around by setting tcp_mem[0~2] to UINT_MAX. In reality, this assumption does not always hold, and processes not controlled by memcg lose the seatbelt and can consume memory up to the global limit, becoming noisy neighbour. Let's decouple sockets in memcg from the global per-protocol memory accounting if sockets have SK_BPF_MEMCG_SOCK_ISOLATED in sk->sk_memcg. This simplifies memcg configuration while keeping the global limits within a reasonable range. If mem_cgroup_sk_isolated(sk) returns true, the per-protocol memory accounting is skipped. In __inet_accept(), we need to reclaim counts that are already charged for child sockets because we do not allocate sk->sk_memcg until accept(). Note that trace_sock_exceed_buf_limit() will always show 0 as accounted for the isolated sockets, but this can be obtained via memory.stat. Tested with a script that creates local socket pairs and send()s a bunch of data without recv()ing. Setup: # mkdir /sys/fs/cgroup/test # echo $$ >> /sys/fs/cgroup/test/cgroup.procs # sysctl -q net.ipv4.tcp_mem="1000 1000 1000" Without bpf prog: # prlimit -n=524288:524288 bash -c "python3 pressure.py" & # cat /sys/fs/cgroup/test/memory.stat | grep sock sock 22642688 # cat /proc/net/sockstat| grep TCP TCP: inuse 2006 orphan 0 tw 0 alloc 2008 mem 5376 # ss -tn | head -n 5 State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 2000 0 127.0.0.1:34479 127.0.0.1:53188 ESTAB 2000 0 127.0.0.1:34479 127.0.0.1:49972 ESTAB 2000 0 127.0.0.1:34479 127.0.0.1:53868 ESTAB 2000 0 127.0.0.1:34479 127.0.0.1:53554 # nstat | grep Pressure || echo no pressure TcpExtTCPMemoryPressures 1 0.0 With bpf prog in the next patch: # bpftool prog load sk_memcg.bpf.o /sys/fs/bpf/sk_memcg_create type cgroup/sock_create # bpftool prog load sk_memcg.bpf.o /sys/fs/bpf/sk_memcg_accept type cgroup/sock_accept # bpftool cgroup attach /sys/fs/cgroup/test cgroup_inet_sock_create pinned /sys/fs/bpf/sk_memcg_create # bpftool cgroup attach /sys/fs/cgroup/test cgroup_inet_sock_accept pinned /sys/fs/bpf/sk_memcg_accept # prlimit -n=524288:524288 bash -c "python3 pressure.py" & # cat /sys/fs/cgroup/test/memory.stat | grep sock sock 2757468160 # cat /proc/net/sockstat | grep TCP TCP: inuse 2006 orphan 0 tw 0 alloc 2008 mem 0 # ss -tn | head -n 5 State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 111000 0 127.0.0.1:36019 127.0.0.1:49026 ESTAB 110000 0 127.0.0.1:36019 127.0.0.1:45630 ESTAB 110000 0 127.0.0.1:36019 127.0.0.1:44870 ESTAB 111000 0 127.0.0.1:36019 127.0.0.1:45274 # nstat | grep Pressure || echo no pressure no pressure Signed-off-by: Kuniyuki Iwashima --- include/net/proto_memory.h | 15 ++++++-- include/net/tcp.h | 10 ++++-- net/core/sock.c | 64 ++++++++++++++++++++++----------- net/ipv4/af_inet.c | 12 ++++++- net/ipv4/inet_connection_sock.c | 1 + net/ipv4/tcp.c | 3 +- net/ipv4/tcp_output.c | 10 ++++-- net/mptcp/protocol.c | 3 +- net/tls/tls_device.c | 4 ++- 9 files changed, 90 insertions(+), 32 deletions(-) diff --git a/include/net/proto_memory.h b/include/net/proto_memory.h index 8e91a8fa31b5..8e8432b13515 100644 --- a/include/net/proto_memory.h +++ b/include/net/proto_memory.h @@ -31,13 +31,22 @@ static inline bool sk_under_memory_pressure(const struct sock *sk) if (!sk->sk_prot->memory_pressure) return false; - if (mem_cgroup_sk_enabled(sk) && - mem_cgroup_sk_under_memory_pressure(sk)) - return true; + if (mem_cgroup_sk_enabled(sk)) { + if (mem_cgroup_sk_under_memory_pressure(sk)) + return true; + + if (mem_cgroup_sk_isolated(sk)) + return false; + } return !!READ_ONCE(*sk->sk_prot->memory_pressure); } +static inline bool sk_should_enter_memory_pressure(struct sock *sk) +{ + return !mem_cgroup_sk_enabled(sk) || !mem_cgroup_sk_isolated(sk); +} + static inline long proto_memory_allocated(const struct proto *prot) { diff --git a/include/net/tcp.h b/include/net/tcp.h index 2936b8175950..0191a4585bba 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -275,9 +275,13 @@ extern unsigned long tcp_memory_pressure; /* optimized version of sk_under_memory_pressure() for TCP sockets */ static inline bool tcp_under_memory_pressure(const struct sock *sk) { - if (mem_cgroup_sk_enabled(sk) && - mem_cgroup_sk_under_memory_pressure(sk)) - return true; + if (mem_cgroup_sk_enabled(sk)) { + if (mem_cgroup_sk_under_memory_pressure(sk)) + return true; + + if (mem_cgroup_sk_isolated(sk)) + return false; + } return READ_ONCE(tcp_memory_pressure); } diff --git a/net/core/sock.c b/net/core/sock.c index 8002ac6293dc..be5574f9a025 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1046,17 +1046,21 @@ static int sock_reserve_memory(struct sock *sk, int bytes) if (!charged) return -ENOMEM; - /* pre-charge to forward_alloc */ - sk_memory_allocated_add(sk, pages); - allocated = sk_memory_allocated(sk); - /* If the system goes into memory pressure with this - * precharge, give up and return error. - */ - if (allocated > sk_prot_mem_limits(sk, 1)) { - sk_memory_allocated_sub(sk, pages); - mem_cgroup_sk_uncharge(sk, pages); - return -ENOMEM; + if (!mem_cgroup_sk_isolated(sk)) { + /* pre-charge to forward_alloc */ + sk_memory_allocated_add(sk, pages); + allocated = sk_memory_allocated(sk); + + /* If the system goes into memory pressure with this + * precharge, give up and return error. + */ + if (allocated > sk_prot_mem_limits(sk, 1)) { + sk_memory_allocated_sub(sk, pages); + mem_cgroup_sk_uncharge(sk, pages); + return -ENOMEM; + } } + sk_forward_alloc_add(sk, pages << PAGE_SHIFT); WRITE_ONCE(sk->sk_reserved_mem, @@ -3153,8 +3157,11 @@ bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag) if (likely(skb_page_frag_refill(32U, pfrag, sk->sk_allocation))) return true; - sk_enter_memory_pressure(sk); + if (sk_should_enter_memory_pressure(sk)) + sk_enter_memory_pressure(sk); + sk_stream_moderate_sndbuf(sk); + return false; } EXPORT_SYMBOL(sk_page_frag_refill); @@ -3267,18 +3274,30 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind) { bool memcg_enabled = false, charged = false; struct proto *prot = sk->sk_prot; - long allocated; - - sk_memory_allocated_add(sk, amt); - allocated = sk_memory_allocated(sk); + long allocated = 0; if (mem_cgroup_sk_enabled(sk)) { + bool isolated = mem_cgroup_sk_isolated(sk); + memcg_enabled = true; charged = mem_cgroup_sk_charge(sk, amt, gfp_memcg_charge()); - if (!charged) + + if (isolated && charged) + return 1; + + if (!charged) { + if (!isolated) { + sk_memory_allocated_add(sk, amt); + allocated = sk_memory_allocated(sk); + } + goto suppress_allocation; + } } + sk_memory_allocated_add(sk, amt); + allocated = sk_memory_allocated(sk); + /* Under limit. */ if (allocated <= sk_prot_mem_limits(sk, 0)) { sk_leave_memory_pressure(sk); @@ -3357,7 +3376,8 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind) trace_sock_exceed_buf_limit(sk, prot, allocated, kind); - sk_memory_allocated_sub(sk, amt); + if (allocated) + sk_memory_allocated_sub(sk, amt); if (charged) mem_cgroup_sk_uncharge(sk, amt); @@ -3396,11 +3416,15 @@ EXPORT_SYMBOL(__sk_mem_schedule); */ void __sk_mem_reduce_allocated(struct sock *sk, int amount) { - sk_memory_allocated_sub(sk, amount); - - if (mem_cgroup_sk_enabled(sk)) + if (mem_cgroup_sk_enabled(sk)) { mem_cgroup_sk_uncharge(sk, amount); + if (mem_cgroup_sk_isolated(sk)) + return; + } + + sk_memory_allocated_sub(sk, amount); + if (sk_under_global_memory_pressure(sk) && (sk_memory_allocated(sk) < sk_prot_mem_limits(sk, 0))) sk_leave_memory_pressure(sk); diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index ab613abdfaa4..e92dfca0a0ff 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -95,6 +95,7 @@ #include #include #include +#include #include #include #include @@ -773,8 +774,17 @@ void __inet_accept(struct socket *sock, struct socket *newsock, struct sock *new */ amt = sk_mem_pages(newsk->sk_forward_alloc + atomic_read(&newsk->sk_rmem_alloc)); - if (amt) + if (amt) { + /* This amt is already charged globally to + * sk_prot->memory_allocated due to lack of + * sk_memcg until accept(), thus we need to + * reclaim it here if newsk is isolated. + */ + if (mem_cgroup_sk_isolated(newsk)) + sk_memory_allocated_sub(newsk, amt); + mem_cgroup_sk_charge(newsk, amt, gfp); + } } sock_rps_record_flow(newsk); diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c index ed10b959a906..f8dd53d40dcf 100644 --- a/net/ipv4/inet_connection_sock.c +++ b/net/ipv4/inet_connection_sock.c @@ -22,6 +22,7 @@ #include #include #include +#include #if IS_ENABLED(CONFIG_IPV6) /* match_sk*_wildcard == true: IPV6_ADDR_ANY equals to any IPv6 addresses diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 71a956fbfc55..dcbd49e2f8af 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -908,7 +908,8 @@ struct sk_buff *tcp_stream_alloc_skb(struct sock *sk, gfp_t gfp, } __kfree_skb(skb); } else { - sk->sk_prot->enter_memory_pressure(sk); + if (sk_should_enter_memory_pressure(sk)) + tcp_enter_memory_pressure(sk); sk_stream_moderate_sndbuf(sk); } return NULL; diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index dfbac0876d96..f7aa86661219 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -3574,12 +3574,18 @@ void sk_forced_mem_schedule(struct sock *sk, int size) delta = size - sk->sk_forward_alloc; if (delta <= 0) return; + amt = sk_mem_pages(delta); sk_forward_alloc_add(sk, amt << PAGE_SHIFT); - sk_memory_allocated_add(sk, amt); - if (mem_cgroup_sk_enabled(sk)) + if (mem_cgroup_sk_enabled(sk)) { mem_cgroup_sk_charge(sk, amt, gfp_memcg_charge() | __GFP_NOFAIL); + + if (mem_cgroup_sk_isolated(sk)) + return; + } + + sk_memory_allocated_add(sk, amt); } /* Send a FIN. The caller locks the socket for us. diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index 9a287b75c1b3..f7487e22a3f8 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -16,6 +16,7 @@ #include #include #include +#include #include #if IS_ENABLED(CONFIG_MPTCP_IPV6) #include @@ -1016,7 +1017,7 @@ static void mptcp_enter_memory_pressure(struct sock *sk) mptcp_for_each_subflow(msk, subflow) { struct sock *ssk = mptcp_subflow_tcp_sock(subflow); - if (first) + if (first && sk_should_enter_memory_pressure(ssk)) tcp_enter_memory_pressure(ssk); sk_stream_moderate_sndbuf(ssk); diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c index f672a62a9a52..6696ef837116 100644 --- a/net/tls/tls_device.c +++ b/net/tls/tls_device.c @@ -35,6 +35,7 @@ #include #include #include +#include #include #include #include @@ -371,7 +372,8 @@ static int tls_do_allocation(struct sock *sk, if (!offload_ctx->open_record) { if (unlikely(!skb_page_frag_refill(prepend_size, pfrag, sk->sk_allocation))) { - READ_ONCE(sk->sk_prot)->enter_memory_pressure(sk); + if (sk_should_enter_memory_pressure(sk)) + READ_ONCE(sk->sk_prot)->enter_memory_pressure(sk); sk_stream_moderate_sndbuf(sk); return -ENOMEM; } -- 2.51.0.rc2.233.g662b1ed5c5-goog