[PATCH v2 net-next 0/2] account for TCP memory pressure signaled by cgroup

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 net-next 0/2] account for TCP memory pressure signaled by cgroup
@ 2025-07-14 14:36 Daniel Sedlak
  2025-07-14 14:36 ` [PATCH v2 net-next 1/2] tcp: account for " Daniel Sedlak
  2025-07-14 14:36 ` [PATCH v2 net-next 2/2] mm/vmpressure: add tracepoint for socket pressure detection Daniel Sedlak
  0 siblings, 2 replies; 13+ messages in thread
From: Daniel Sedlak @ 2025-07-14 14:36 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Jonathan Corbet, Neal Cardwell, Kuniyuki Iwashima,
	David Ahern, Andrew Morton, Shakeel Butt, Yosry Ahmed, linux-mm,
	netdev
  Cc: Daniel Sedlak, linux-trace-kernel, linux-doc, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers

Greetings kernel wizards,

To summarize from the commit messages, the attached patches are the
result of our recent debugging sessions, where we were hunting sudden
TCP network throughput drops, where the throughput dropped from tens of
Gbps to a few Mbps for an undeterministic amount of time.

The first patch adds a new counter for netstat, so users can notice that
the sockets are marked as under pressure. The second patch adds a new
tracepoint so it is easier to pinpoint which exact cgroup is having
difficulties.

Individual commit messages should contain more detailed reasoning.

Thanks!
Daniel

Changes:

v1 -> v2:
Add tracepoint
Link: https://lore.kernel.org/netdev/20250707105205.222558-1-daniel.sedlak@cdn77.com/

Daniel Sedlak (1):
  tcp: account for memory pressure signaled by cgroup

Matyas Hurtik (1):
  mm/vmpressure: add tracepoint for socket pressure detection

 .../networking/net_cachelines/snmp.rst        |  1 +
 include/net/tcp.h                             | 14 ++++++-----
 include/trace/events/memcg.h                  | 25 +++++++++++++++++++
 include/uapi/linux/snmp.h                     |  1 +
 mm/vmpressure.c                               |  3 +++
 net/ipv4/proc.c                               |  1 +
 6 files changed, 39 insertions(+), 6 deletions(-)

cc: linux-trace-kernel@vger.kernel.org
cc: linux-doc@vger.kernel.org
cc: Steven Rostedt <rostedt@goodmis.org>
cc: Masami Hiramatsu <mhiramat@kernel.org>
cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

base-commit: e96ee511c906c59b7c4e6efd9d9b33917730e000
-- 
2.39.5



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 net-next 1/2] tcp: account for memory pressure signaled by cgroup
  2025-07-14 14:36 [PATCH v2 net-next 0/2] account for TCP memory pressure signaled by cgroup Daniel Sedlak
@ 2025-07-14 14:36 ` Daniel Sedlak
  2025-07-16 16:49   ` Shakeel Butt
  2025-07-14 14:36 ` [PATCH v2 net-next 2/2] mm/vmpressure: add tracepoint for socket pressure detection Daniel Sedlak
  1 sibling, 1 reply; 13+ messages in thread
From: Daniel Sedlak @ 2025-07-14 14:36 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Jonathan Corbet, Neal Cardwell, Kuniyuki Iwashima,
	David Ahern, Andrew Morton, Shakeel Butt, Yosry Ahmed, linux-mm,
	netdev
  Cc: Daniel Sedlak, Matyas Hurtik

This patch is a result of our long-standing debug sessions, where it all
started as "networking is slow", and TCP network throughput suddenly
dropped from tens of Gbps to few Mbps, and we could not see anything in
the kernel log or netstat counters.

Currently, we have two memory pressure counters for TCP sockets [1],
which we manipulate only when the memory pressure is signalled through
the proto struct [2]. However, the memory pressure can also be signaled
through the cgroup memory subsystem, which we do not reflect in the
netstat counters. In the end, when the cgroup memory subsystem signals
that it is under pressure, we silently reduce the advertised TCP window
with tcp_adjust_rcv_ssthresh() to 4*advmss, which causes a significant
throughput reduction.

So this patch adds a new counter to account for memory pressure
signaled by the memory cgroup, so it is much easier to spot.

Link: https://elixir.bootlin.com/linux/v6.15.4/source/include/uapi/linux/snmp.h#L231-L232 [1]
Link: https://elixir.bootlin.com/linux/v6.15.4/source/include/net/sock.h#L1300-L1301 [2]
Co-developed-by: Matyas Hurtik <matyas.hurtik@cdn77.com>
Signed-off-by: Matyas Hurtik <matyas.hurtik@cdn77.com>
Signed-off-by: Daniel Sedlak <daniel.sedlak@cdn77.com>
---
 Documentation/networking/net_cachelines/snmp.rst |  1 +
 include/net/tcp.h                                | 14 ++++++++------
 include/uapi/linux/snmp.h                        |  1 +
 net/ipv4/proc.c                                  |  1 +
 4 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/Documentation/networking/net_cachelines/snmp.rst b/Documentation/networking/net_cachelines/snmp.rst
index bd44b3eebbef..ed17ff84e39c 100644
--- a/Documentation/networking/net_cachelines/snmp.rst
+++ b/Documentation/networking/net_cachelines/snmp.rst
@@ -76,6 +76,7 @@ unsigned_long  LINUX_MIB_TCPABORTONLINGER
 unsigned_long  LINUX_MIB_TCPABORTFAILED
 unsigned_long  LINUX_MIB_TCPMEMORYPRESSURES
 unsigned_long  LINUX_MIB_TCPMEMORYPRESSURESCHRONO
+unsigned_long  LINUX_MIB_TCPCGROUPSOCKETPRESSURE
 unsigned_long  LINUX_MIB_TCPSACKDISCARD
 unsigned_long  LINUX_MIB_TCPDSACKIGNOREDOLD
 unsigned_long  LINUX_MIB_TCPDSACKIGNOREDNOUNDO
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 761c4a0ad386..aae3efe24282 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -267,6 +267,11 @@ extern long sysctl_tcp_mem[3];
 #define TCP_RACK_STATIC_REO_WND  0x2 /* Use static RACK reo wnd */
 #define TCP_RACK_NO_DUPTHRESH    0x4 /* Do not use DUPACK threshold in RACK */
 
+#define TCP_INC_STATS(net, field)	SNMP_INC_STATS((net)->mib.tcp_statistics, field)
+#define __TCP_INC_STATS(net, field)	__SNMP_INC_STATS((net)->mib.tcp_statistics, field)
+#define TCP_DEC_STATS(net, field)	SNMP_DEC_STATS((net)->mib.tcp_statistics, field)
+#define TCP_ADD_STATS(net, field, val)	SNMP_ADD_STATS((net)->mib.tcp_statistics, field, val)
+
 extern atomic_long_t tcp_memory_allocated;
 DECLARE_PER_CPU(int, tcp_memory_per_cpu_fw_alloc);
 
@@ -277,8 +282,10 @@ extern unsigned long tcp_memory_pressure;
 static inline bool tcp_under_memory_pressure(const struct sock *sk)
 {
 	if (mem_cgroup_sockets_enabled && sk->sk_memcg &&
-	    mem_cgroup_under_socket_pressure(sk->sk_memcg))
+	    mem_cgroup_under_socket_pressure(sk->sk_memcg)) {
+		TCP_INC_STATS(sock_net(sk), LINUX_MIB_TCPCGROUPSOCKETPRESSURE);
 		return true;
+	}
 
 	return READ_ONCE(tcp_memory_pressure);
 }
@@ -316,11 +323,6 @@ bool tcp_check_oom(const struct sock *sk, int shift);
 
 extern struct proto tcp_prot;
 
-#define TCP_INC_STATS(net, field)	SNMP_INC_STATS((net)->mib.tcp_statistics, field)
-#define __TCP_INC_STATS(net, field)	__SNMP_INC_STATS((net)->mib.tcp_statistics, field)
-#define TCP_DEC_STATS(net, field)	SNMP_DEC_STATS((net)->mib.tcp_statistics, field)
-#define TCP_ADD_STATS(net, field, val)	SNMP_ADD_STATS((net)->mib.tcp_statistics, field, val)
-
 void tcp_tsq_work_init(void);
 
 int tcp_v4_err(struct sk_buff *skb, u32);
diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
index 1d234d7e1892..9e8d1a5e56a9 100644
--- a/include/uapi/linux/snmp.h
+++ b/include/uapi/linux/snmp.h
@@ -231,6 +231,7 @@ enum
 	LINUX_MIB_TCPABORTFAILED,		/* TCPAbortFailed */
 	LINUX_MIB_TCPMEMORYPRESSURES,		/* TCPMemoryPressures */
 	LINUX_MIB_TCPMEMORYPRESSURESCHRONO,	/* TCPMemoryPressuresChrono */
+	LINUX_MIB_TCPCGROUPSOCKETPRESSURE,      /* TCPCgroupSocketPressure */
 	LINUX_MIB_TCPSACKDISCARD,		/* TCPSACKDiscard */
 	LINUX_MIB_TCPDSACKIGNOREDOLD,		/* TCPSACKIgnoredOld */
 	LINUX_MIB_TCPDSACKIGNOREDNOUNDO,	/* TCPSACKIgnoredNoUndo */
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index ea2f01584379..0bcec9a51fb0 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -235,6 +235,7 @@ static const struct snmp_mib snmp4_net_list[] = {
 	SNMP_MIB_ITEM("TCPAbortFailed", LINUX_MIB_TCPABORTFAILED),
 	SNMP_MIB_ITEM("TCPMemoryPressures", LINUX_MIB_TCPMEMORYPRESSURES),
 	SNMP_MIB_ITEM("TCPMemoryPressuresChrono", LINUX_MIB_TCPMEMORYPRESSURESCHRONO),
+	SNMP_MIB_ITEM("TCPCgroupSocketPressure", LINUX_MIB_TCPCGROUPSOCKETPRESSURE),
 	SNMP_MIB_ITEM("TCPSACKDiscard", LINUX_MIB_TCPSACKDISCARD),
 	SNMP_MIB_ITEM("TCPDSACKIgnoredOld", LINUX_MIB_TCPDSACKIGNOREDOLD),
 	SNMP_MIB_ITEM("TCPDSACKIgnoredNoUndo", LINUX_MIB_TCPDSACKIGNOREDNOUNDO),
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 net-next 2/2] mm/vmpressure: add tracepoint for socket pressure detection
  2025-07-14 14:36 [PATCH v2 net-next 0/2] account for TCP memory pressure signaled by cgroup Daniel Sedlak
  2025-07-14 14:36 ` [PATCH v2 net-next 1/2] tcp: account for " Daniel Sedlak
@ 2025-07-14 14:36 ` Daniel Sedlak
  2025-07-14 18:02   ` Kuniyuki Iwashima
  1 sibling, 1 reply; 13+ messages in thread
From: Daniel Sedlak @ 2025-07-14 14:36 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Jonathan Corbet, Neal Cardwell, Kuniyuki Iwashima,
	David Ahern, Andrew Morton, Shakeel Butt, Yosry Ahmed, linux-mm,
	netdev
  Cc: Matyas Hurtik, Daniel Sedlak

From: Matyas Hurtik <matyas.hurtik@cdn77.com>

When the vmpressure function marks all sockets within a particular
cgroup as under pressure, it can silently reduce network throughput
significantly. This socket pressure is not currently signaled in any way
to the users, and it is difficult to detect which cgroup is under socket
pressure.

This patch adds a new tracepoint that is called when a cgroup is under
socket pressure.

Signed-off-by: Matyas Hurtik <matyas.hurtik@cdn77.com>
Co-developed-by: Daniel Sedlak <danie.sedlak@cdn77.com>
Signed-off-by: Daniel Sedlak <danie.sedlak@cdn77.com>
---
 include/trace/events/memcg.h | 25 +++++++++++++++++++++++++
 mm/vmpressure.c              |  3 +++
 2 files changed, 28 insertions(+)

diff --git a/include/trace/events/memcg.h b/include/trace/events/memcg.h
index dfe2f51019b4..19a51db73913 100644
--- a/include/trace/events/memcg.h
+++ b/include/trace/events/memcg.h
@@ -100,6 +100,31 @@ TRACE_EVENT(memcg_flush_stats,
 		__entry->force, __entry->needs_flush)
 );
 
+TRACE_EVENT(memcg_socket_under_pressure,
+
+	TP_PROTO(const struct mem_cgroup *memcg, unsigned long scanned,
+		unsigned long reclaimed),
+
+	TP_ARGS(memcg, scanned, reclaimed),
+
+	TP_STRUCT__entry(
+		__field(u64, id)
+		__field(unsigned long, scanned)
+		__field(unsigned long, reclaimed)
+	),
+
+	TP_fast_assign(
+		__entry->id = cgroup_id(memcg->css.cgroup);
+		__entry->scanned = scanned;
+		__entry->reclaimed = reclaimed;
+	),
+
+	TP_printk("memcg_id=%llu scanned=%lu reclaimed=%lu",
+		__entry->id,
+		__entry->scanned,
+		__entry->reclaimed)
+);
+
 #endif /* _TRACE_MEMCG_H */
 
 /* This part must be outside protection */
diff --git a/mm/vmpressure.c b/mm/vmpressure.c
index bd5183dfd879..aa9583066731 100644
--- a/mm/vmpressure.c
+++ b/mm/vmpressure.c
@@ -21,6 +21,8 @@
 #include <linux/printk.h>
 #include <linux/vmpressure.h>
 
+#include <trace/events/memcg.h>
+
 /*
  * The window size (vmpressure_win) is the number of scanned pages before
  * we try to analyze scanned/reclaimed ratio. So the window is used as a
@@ -317,6 +319,7 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, bool tree,
 			 * pressure events can occur.
 			 */
 			WRITE_ONCE(memcg->socket_pressure, jiffies + HZ);
+			trace_memcg_socket_under_pressure(memcg, scanned, reclaimed);
 		}
 	}
 }
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 net-next 2/2] mm/vmpressure: add tracepoint for socket pressure detection
  2025-07-14 14:36 ` [PATCH v2 net-next 2/2] mm/vmpressure: add tracepoint for socket pressure detection Daniel Sedlak
@ 2025-07-14 18:02   ` Kuniyuki Iwashima
  2025-07-15  7:01     ` Daniel Sedlak
  0 siblings, 1 reply; 13+ messages in thread
From: Kuniyuki Iwashima @ 2025-07-14 18:02 UTC (permalink / raw)
  To: Daniel Sedlak
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Jonathan Corbet, Neal Cardwell, David Ahern,
	Andrew Morton, Shakeel Butt, Yosry Ahmed, linux-mm, netdev,
	Matyas Hurtik, Daniel Sedlak

On Mon, Jul 14, 2025 at 7:37 AM Daniel Sedlak <daniel.sedlak@cdn77.com> wrote:
>
> From: Matyas Hurtik <matyas.hurtik@cdn77.com>
>
> When the vmpressure function marks all sockets within a particular
> cgroup as under pressure, it can silently reduce network throughput
> significantly. This socket pressure is not currently signaled in any way
> to the users, and it is difficult to detect which cgroup is under socket
> pressure.
>
> This patch adds a new tracepoint that is called when a cgroup is under
> socket pressure.
>
> Signed-off-by: Matyas Hurtik <matyas.hurtik@cdn77.com>
> Co-developed-by: Daniel Sedlak <danie.sedlak@cdn77.com>
> Signed-off-by: Daniel Sedlak <danie.sedlak@cdn77.com>
> ---
>  include/trace/events/memcg.h | 25 +++++++++++++++++++++++++
>  mm/vmpressure.c              |  3 +++
>  2 files changed, 28 insertions(+)
>
> diff --git a/include/trace/events/memcg.h b/include/trace/events/memcg.h
> index dfe2f51019b4..19a51db73913 100644
> --- a/include/trace/events/memcg.h
> +++ b/include/trace/events/memcg.h
> @@ -100,6 +100,31 @@ TRACE_EVENT(memcg_flush_stats,
>                 __entry->force, __entry->needs_flush)
>  );
>
> +TRACE_EVENT(memcg_socket_under_pressure,
> +
> +       TP_PROTO(const struct mem_cgroup *memcg, unsigned long scanned,
> +               unsigned long reclaimed),
> +
> +       TP_ARGS(memcg, scanned, reclaimed),
> +
> +       TP_STRUCT__entry(
> +               __field(u64, id)
> +               __field(unsigned long, scanned)
> +               __field(unsigned long, reclaimed)
> +       ),
> +
> +       TP_fast_assign(
> +               __entry->id = cgroup_id(memcg->css.cgroup);
> +               __entry->scanned = scanned;
> +               __entry->reclaimed = reclaimed;
> +       ),
> +
> +       TP_printk("memcg_id=%llu scanned=%lu reclaimed=%lu",
> +               __entry->id,

Maybe a noob question: How can we translate the memcg ID
to the /sys/fs/cgroup/... path ?

It would be nice to place this patch first and the description of
patch 2 has how to use the new stat with this tracepoint.


> +               __entry->scanned,
> +               __entry->reclaimed)
> +);
> +
>  #endif /* _TRACE_MEMCG_H */
>
>  /* This part must be outside protection */
> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
> index bd5183dfd879..aa9583066731 100644
> --- a/mm/vmpressure.c
> +++ b/mm/vmpressure.c
> @@ -21,6 +21,8 @@
>  #include <linux/printk.h>
>  #include <linux/vmpressure.h>
>
> +#include <trace/events/memcg.h>
> +
>  /*
>   * The window size (vmpressure_win) is the number of scanned pages before
>   * we try to analyze scanned/reclaimed ratio. So the window is used as a
> @@ -317,6 +319,7 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, bool tree,
>                          * pressure events can occur.
>                          */
>                         WRITE_ONCE(memcg->socket_pressure, jiffies + HZ);
> +                       trace_memcg_socket_under_pressure(memcg, scanned, reclaimed);

This is triggered only when we enter the memory pressure state
and not when we leave the state, right ?  Is it possible to issue
such an event ?


>                 }
>         }
>  }
> --
> 2.39.5
>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 net-next 2/2] mm/vmpressure: add tracepoint for socket pressure detection
  2025-07-14 18:02   ` Kuniyuki Iwashima
@ 2025-07-15  7:01     ` Daniel Sedlak
  2025-07-15 17:17       ` Kuniyuki Iwashima
  0 siblings, 1 reply; 13+ messages in thread
From: Daniel Sedlak @ 2025-07-15  7:01 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Jonathan Corbet, Neal Cardwell, David Ahern,
	Andrew Morton, Shakeel Butt, Yosry Ahmed, linux-mm, netdev,
	Matyas Hurtik, Daniel Sedlak

Hi Kuniyuki,

On 7/14/25 8:02 PM, Kuniyuki Iwashima wrote:
>> +TRACE_EVENT(memcg_socket_under_pressure,
>> +
>> +       TP_PROTO(const struct mem_cgroup *memcg, unsigned long scanned,
>> +               unsigned long reclaimed),
>> +
>> +       TP_ARGS(memcg, scanned, reclaimed),
>> +
>> +       TP_STRUCT__entry(
>> +               __field(u64, id)
>> +               __field(unsigned long, scanned)
>> +               __field(unsigned long, reclaimed)
>> +       ),
>> +
>> +       TP_fast_assign(
>> +               __entry->id = cgroup_id(memcg->css.cgroup);
>> +               __entry->scanned = scanned;
>> +               __entry->reclaimed = reclaimed;
>> +       ),
>> +
>> +       TP_printk("memcg_id=%llu scanned=%lu reclaimed=%lu",
>> +               __entry->id,
> 
> Maybe a noob question: How can we translate the memcg ID
> to the /sys/fs/cgroup/... path ?

IMO this should be really named `cgroup_id` instead of `memcg_id`, but 
we kept the latter to keep consistency with the rest of the file.

To find cgroup path you can use:
- find /sys/fs/cgroup/ -inum `memcg_id`, and it will print "path" to the 
affected cgroup.
- or you can use bpftrace tracepoint hooks and there is a helper 
function [1].

Or we can put the cgroup_path to the tracepoint instead of that ID, but 
I feel it can be too much overhead, the paths can be pretty long.

Link: https://bpftrace.org/docs/latest#functions-cgroup_path [1]
> It would be nice to place this patch first and the description of
> patch 2 has how to use the new stat with this tracepoint.

Sure, can do that. However, I am unsure how a good idea is to 
cross-reference commits, since each may go through a different tree 
because each commit is for a different subsystem. They would have to go 
through one tree, right?


>> +               __entry->scanned,
>> +               __entry->reclaimed)
>> +);
>> +
>>   #endif /* _TRACE_MEMCG_H */
>>
>>   /* This part must be outside protection */
>> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
>> index bd5183dfd879..aa9583066731 100644
>> --- a/mm/vmpressure.c
>> +++ b/mm/vmpressure.c
>> @@ -21,6 +21,8 @@
>>   #include <linux/printk.h>
>>   #include <linux/vmpressure.h>
>>
>> +#include <trace/events/memcg.h>
>> +
>>   /*
>>    * The window size (vmpressure_win) is the number of scanned pages before
>>    * we try to analyze scanned/reclaimed ratio. So the window is used as a
>> @@ -317,6 +319,7 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, bool tree,
>>                           * pressure events can occur.
>>                           */
>>                          WRITE_ONCE(memcg->socket_pressure, jiffies + HZ);
>> +                       trace_memcg_socket_under_pressure(memcg, scanned, reclaimed);
> 
> This is triggered only when we enter the memory pressure state
> and not when we leave the state, right ?  Is it possible to issue
> such an event ?

AFAIK, the currently used API in the vmpressure function does not have 
anything like enter or leave the socket memory pressure state. It only 
periodically re-arms the socket pressure for a specific duration. So the 
current tracepoint is called when the socket pressure is re-armed.

I am not sure how feasible it is to rewrite it so we have enter and 
leave logic. I am not that familiar with those code paths.

Thanks!
Daniel




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 net-next 2/2] mm/vmpressure: add tracepoint for socket pressure detection
  2025-07-15  7:01     ` Daniel Sedlak
@ 2025-07-15 17:17       ` Kuniyuki Iwashima
  2025-07-15 17:46         ` Kuniyuki Iwashima
  0 siblings, 1 reply; 13+ messages in thread
From: Kuniyuki Iwashima @ 2025-07-15 17:17 UTC (permalink / raw)
  To: Daniel Sedlak
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Jonathan Corbet, Neal Cardwell, David Ahern,
	Andrew Morton, Shakeel Butt, Yosry Ahmed, linux-mm, netdev,
	Matyas Hurtik, Daniel Sedlak

On Tue, Jul 15, 2025 at 12:01 AM Daniel Sedlak <daniel.sedlak@cdn77.com> wrote:
>
> Hi Kuniyuki,
>
> On 7/14/25 8:02 PM, Kuniyuki Iwashima wrote:
> >> +TRACE_EVENT(memcg_socket_under_pressure,
> >> +
> >> +       TP_PROTO(const struct mem_cgroup *memcg, unsigned long scanned,
> >> +               unsigned long reclaimed),
> >> +
> >> +       TP_ARGS(memcg, scanned, reclaimed),
> >> +
> >> +       TP_STRUCT__entry(
> >> +               __field(u64, id)
> >> +               __field(unsigned long, scanned)
> >> +               __field(unsigned long, reclaimed)
> >> +       ),
> >> +
> >> +       TP_fast_assign(
> >> +               __entry->id = cgroup_id(memcg->css.cgroup);
> >> +               __entry->scanned = scanned;
> >> +               __entry->reclaimed = reclaimed;
> >> +       ),
> >> +
> >> +       TP_printk("memcg_id=%llu scanned=%lu reclaimed=%lu",
> >> +               __entry->id,
> >
> > Maybe a noob question: How can we translate the memcg ID
> > to the /sys/fs/cgroup/... path ?
>
> IMO this should be really named `cgroup_id` instead of `memcg_id`, but
> we kept the latter to keep consistency with the rest of the file.
>
> To find cgroup path you can use:
> - find /sys/fs/cgroup/ -inum `memcg_id`, and it will print "path" to the
> affected cgroup.
> - or you can use bpftrace tracepoint hooks and there is a helper
> function [1].

Thanks, this is good to know and worth in the commit message.

>
> Or we can put the cgroup_path to the tracepoint instead of that ID, but
> I feel it can be too much overhead, the paths can be pretty long.

Agree, the ID is good enough given we can find the cgroup by oneliner.

>
> Link: https://bpftrace.org/docs/latest#functions-cgroup_path [1]
> > It would be nice to place this patch first and the description of
> > patch 2 has how to use the new stat with this tracepoint.
>
> Sure, can do that. However, I am unsure how a good idea is to
> cross-reference commits, since each may go through a different tree
> because each commit is for a different subsystem. They would have to go
> through one tree, right?

Right.  Probably you can just assume both patches will be merged
and post the tracepoint patch to mm ML first and then add its
lore.kernel.org link and howto in the stat patch and post it to netdev ML.


>
>
> >> +               __entry->scanned,
> >> +               __entry->reclaimed)
> >> +);
> >> +
> >>   #endif /* _TRACE_MEMCG_H */
> >>
> >>   /* This part must be outside protection */
> >> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
> >> index bd5183dfd879..aa9583066731 100644
> >> --- a/mm/vmpressure.c
> >> +++ b/mm/vmpressure.c
> >> @@ -21,6 +21,8 @@
> >>   #include <linux/printk.h>
> >>   #include <linux/vmpressure.h>
> >>
> >> +#include <trace/events/memcg.h>
> >> +
> >>   /*
> >>    * The window size (vmpressure_win) is the number of scanned pages before
> >>    * we try to analyze scanned/reclaimed ratio. So the window is used as a
> >> @@ -317,6 +319,7 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, bool tree,
> >>                           * pressure events can occur.
> >>                           */
> >>                          WRITE_ONCE(memcg->socket_pressure, jiffies + HZ);
> >> +                       trace_memcg_socket_under_pressure(memcg, scanned, reclaimed);
> >
> > This is triggered only when we enter the memory pressure state
> > and not when we leave the state, right ?  Is it possible to issue
> > such an event ?
>
> AFAIK, the currently used API in the vmpressure function does not have
> anything like enter or leave the socket memory pressure state. It only
> periodically re-arms the socket pressure for a specific duration. So the
> current tracepoint is called when the socket pressure is re-armed.
>
> I am not sure how feasible it is to rewrite it so we have enter and
> leave logic. I am not that familiar with those code paths.

Unlike tcp mem pressure, the memcg pressure only persists for one
second, so it won't make sense to add the "leave" event, and I guess
the assumption would be that socket_pressure will keep updated under
memory pressure.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 net-next 2/2] mm/vmpressure: add tracepoint for socket pressure detection
  2025-07-15 17:17       ` Kuniyuki Iwashima
@ 2025-07-15 17:46         ` Kuniyuki Iwashima
  2025-07-16  8:47           ` Daniel Sedlak
  0 siblings, 1 reply; 13+ messages in thread
From: Kuniyuki Iwashima @ 2025-07-15 17:46 UTC (permalink / raw)
  To: Daniel Sedlak
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Jonathan Corbet, Neal Cardwell, David Ahern,
	Andrew Morton, Shakeel Butt, Yosry Ahmed, linux-mm, netdev,
	Matyas Hurtik, Daniel Sedlak

On Tue, Jul 15, 2025 at 10:17 AM Kuniyuki Iwashima <kuniyu@google.com> wrote:
>
> On Tue, Jul 15, 2025 at 12:01 AM Daniel Sedlak <daniel.sedlak@cdn77.com> wrote:
> >
> > Hi Kuniyuki,
> >
> > On 7/14/25 8:02 PM, Kuniyuki Iwashima wrote:
> > >> +TRACE_EVENT(memcg_socket_under_pressure,
> > >> +
> > >> +       TP_PROTO(const struct mem_cgroup *memcg, unsigned long scanned,
> > >> +               unsigned long reclaimed),
> > >> +
> > >> +       TP_ARGS(memcg, scanned, reclaimed),
> > >> +
> > >> +       TP_STRUCT__entry(
> > >> +               __field(u64, id)
> > >> +               __field(unsigned long, scanned)
> > >> +               __field(unsigned long, reclaimed)
> > >> +       ),
> > >> +
> > >> +       TP_fast_assign(
> > >> +               __entry->id = cgroup_id(memcg->css.cgroup);
> > >> +               __entry->scanned = scanned;
> > >> +               __entry->reclaimed = reclaimed;
> > >> +       ),
> > >> +
> > >> +       TP_printk("memcg_id=%llu scanned=%lu reclaimed=%lu",
> > >> +               __entry->id,
> > >
> > > Maybe a noob question: How can we translate the memcg ID
> > > to the /sys/fs/cgroup/... path ?
> >
> > IMO this should be really named `cgroup_id` instead of `memcg_id`, but
> > we kept the latter to keep consistency with the rest of the file.
> >
> > To find cgroup path you can use:
> > - find /sys/fs/cgroup/ -inum `memcg_id`, and it will print "path" to the
> > affected cgroup.
> > - or you can use bpftrace tracepoint hooks and there is a helper
> > function [1].
>
> Thanks, this is good to know and worth in the commit message.
>
> >
> > Or we can put the cgroup_path to the tracepoint instead of that ID, but
> > I feel it can be too much overhead, the paths can be pretty long.
>
> Agree, the ID is good enough given we can find the cgroup by oneliner.
>
> >
> > Link: https://bpftrace.org/docs/latest#functions-cgroup_path [1]
> > > It would be nice to place this patch first and the description of
> > > patch 2 has how to use the new stat with this tracepoint.
> >
> > Sure, can do that. However, I am unsure how a good idea is to
> > cross-reference commits, since each may go through a different tree
> > because each commit is for a different subsystem. They would have to go
> > through one tree, right?
>
> Right.

Sorry, I meant to say the two patches don't need to go along to a
single tree and you can post them separately as each change is
independent.

> Probably you can just assume both patches will be merged
> and post the tracepoint patch to mm ML first and then add its
> lore.kernel.org link and howto in the stat patch and post it to netdev ML.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 net-next 2/2] mm/vmpressure: add tracepoint for socket pressure detection
  2025-07-15 17:46         ` Kuniyuki Iwashima
@ 2025-07-16  8:47           ` Daniel Sedlak
  0 siblings, 0 replies; 13+ messages in thread
From: Daniel Sedlak @ 2025-07-16  8:47 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Jonathan Corbet, Neal Cardwell, David Ahern,
	Andrew Morton, Shakeel Butt, Yosry Ahmed, linux-mm, netdev,
	Matyas Hurtik, Daniel Sedlak

Hi Kuniyuki,

On 7/15/25 7:46 PM, Kuniyuki Iwashima wrote:
>>>> Maybe a noob question: How can we translate the memcg ID
>>>> to the /sys/fs/cgroup/... path ?
>>>
>>> IMO this should be really named `cgroup_id` instead of `memcg_id`, but
>>> we kept the latter to keep consistency with the rest of the file.
>>>
>>> To find cgroup path you can use:
>>> - find /sys/fs/cgroup/ -inum `memcg_id`, and it will print "path" to the
>>> affected cgroup.
>>> - or you can use bpftrace tracepoint hooks and there is a helper
>>> function [1].
>>
>> Thanks, this is good to know and worth in the commit message.

Sure, I will add it to the v3.

>>>> It would be nice to place this patch first and the description of
>>>> patch 2 has how to use the new stat with this tracepoint.
>>>
>>> Sure, can do that. However, I am unsure how a good idea is to
>>> cross-reference commits, since each may go through a different tree
>>> because each commit is for a different subsystem. They would have to go
>>> through one tree, right?
>>
>> Right.
> 
> Sorry, I meant to say the two patches don't need to go along to a
> single tree and you can post them separately as each change is
> independent.

Just to make sure we are on the same page. Are you suggesting first 
posting this second patch to mm, and once (if) it gets merged, reference 
it from the first patch and send the first patch to netdev?

Thanks!
Daniel


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 net-next 1/2] tcp: account for memory pressure signaled by cgroup
  2025-07-14 14:36 ` [PATCH v2 net-next 1/2] tcp: account for " Daniel Sedlak
@ 2025-07-16 16:49   ` Shakeel Butt
  2025-07-16 18:07     ` Kuniyuki Iwashima
  0 siblings, 1 reply; 13+ messages in thread
From: Shakeel Butt @ 2025-07-16 16:49 UTC (permalink / raw)
  To: Daniel Sedlak
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Jonathan Corbet, Neal Cardwell, Kuniyuki Iwashima,
	David Ahern, Andrew Morton, Yosry Ahmed, linux-mm, netdev,
	Matyas Hurtik

On Mon, Jul 14, 2025 at 04:36:12PM +0200, Daniel Sedlak wrote:
> This patch is a result of our long-standing debug sessions, where it all
> started as "networking is slow", and TCP network throughput suddenly
> dropped from tens of Gbps to few Mbps, and we could not see anything in
> the kernel log or netstat counters.
> 
> Currently, we have two memory pressure counters for TCP sockets [1],
> which we manipulate only when the memory pressure is signalled through
> the proto struct [2]. However, the memory pressure can also be signaled
> through the cgroup memory subsystem, which we do not reflect in the
> netstat counters. In the end, when the cgroup memory subsystem signals
> that it is under pressure, we silently reduce the advertised TCP window
> with tcp_adjust_rcv_ssthresh() to 4*advmss, which causes a significant
> throughput reduction.
> 
> So this patch adds a new counter to account for memory pressure
> signaled by the memory cgroup, so it is much easier to spot.
> 
> Link: https://elixir.bootlin.com/linux/v6.15.4/source/include/uapi/linux/snmp.h#L231-L232 [1]
> Link: https://elixir.bootlin.com/linux/v6.15.4/source/include/net/sock.h#L1300-L1301 [2]
> Co-developed-by: Matyas Hurtik <matyas.hurtik@cdn77.com>
> Signed-off-by: Matyas Hurtik <matyas.hurtik@cdn77.com>
> Signed-off-by: Daniel Sedlak <daniel.sedlak@cdn77.com>
> ---
>  Documentation/networking/net_cachelines/snmp.rst |  1 +
>  include/net/tcp.h                                | 14 ++++++++------
>  include/uapi/linux/snmp.h                        |  1 +
>  net/ipv4/proc.c                                  |  1 +
>  4 files changed, 11 insertions(+), 6 deletions(-)
> 
> diff --git a/Documentation/networking/net_cachelines/snmp.rst b/Documentation/networking/net_cachelines/snmp.rst
> index bd44b3eebbef..ed17ff84e39c 100644
> --- a/Documentation/networking/net_cachelines/snmp.rst
> +++ b/Documentation/networking/net_cachelines/snmp.rst
> @@ -76,6 +76,7 @@ unsigned_long  LINUX_MIB_TCPABORTONLINGER
>  unsigned_long  LINUX_MIB_TCPABORTFAILED
>  unsigned_long  LINUX_MIB_TCPMEMORYPRESSURES
>  unsigned_long  LINUX_MIB_TCPMEMORYPRESSURESCHRONO
> +unsigned_long  LINUX_MIB_TCPCGROUPSOCKETPRESSURE
>  unsigned_long  LINUX_MIB_TCPSACKDISCARD
>  unsigned_long  LINUX_MIB_TCPDSACKIGNOREDOLD
>  unsigned_long  LINUX_MIB_TCPDSACKIGNOREDNOUNDO
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 761c4a0ad386..aae3efe24282 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -267,6 +267,11 @@ extern long sysctl_tcp_mem[3];
>  #define TCP_RACK_STATIC_REO_WND  0x2 /* Use static RACK reo wnd */
>  #define TCP_RACK_NO_DUPTHRESH    0x4 /* Do not use DUPACK threshold in RACK */
>  
> +#define TCP_INC_STATS(net, field)	SNMP_INC_STATS((net)->mib.tcp_statistics, field)
> +#define __TCP_INC_STATS(net, field)	__SNMP_INC_STATS((net)->mib.tcp_statistics, field)
> +#define TCP_DEC_STATS(net, field)	SNMP_DEC_STATS((net)->mib.tcp_statistics, field)
> +#define TCP_ADD_STATS(net, field, val)	SNMP_ADD_STATS((net)->mib.tcp_statistics, field, val)
> +
>  extern atomic_long_t tcp_memory_allocated;
>  DECLARE_PER_CPU(int, tcp_memory_per_cpu_fw_alloc);
>  
> @@ -277,8 +282,10 @@ extern unsigned long tcp_memory_pressure;
>  static inline bool tcp_under_memory_pressure(const struct sock *sk)
>  {
>  	if (mem_cgroup_sockets_enabled && sk->sk_memcg &&
> -	    mem_cgroup_under_socket_pressure(sk->sk_memcg))
> +	    mem_cgroup_under_socket_pressure(sk->sk_memcg)) {
> +		TCP_INC_STATS(sock_net(sk), LINUX_MIB_TCPCGROUPSOCKETPRESSURE);
>  		return true;

Incrementing it here will give a very different semantic to this stat
compared to LINUX_MIB_TCPMEMORYPRESSURES. Here the increments mean the
number of times the kernel check if a given socket is under memcg
pressure for a net namespace. Is that what we want?



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 net-next 1/2] tcp: account for memory pressure signaled by cgroup
  2025-07-16 16:49   ` Shakeel Butt
@ 2025-07-16 18:07     ` Kuniyuki Iwashima
  2025-07-16 18:37       ` Shakeel Butt
  2025-07-17 15:31       ` Daniel Sedlak
  0 siblings, 2 replies; 13+ messages in thread
From: Kuniyuki Iwashima @ 2025-07-16 18:07 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Daniel Sedlak, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Neal Cardwell,
	David Ahern, Andrew Morton, Yosry Ahmed, linux-mm, netdev,
	Matyas Hurtik

On Wed, Jul 16, 2025 at 9:50 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Mon, Jul 14, 2025 at 04:36:12PM +0200, Daniel Sedlak wrote:
> > This patch is a result of our long-standing debug sessions, where it all
> > started as "networking is slow", and TCP network throughput suddenly
> > dropped from tens of Gbps to few Mbps, and we could not see anything in
> > the kernel log or netstat counters.
> >
> > Currently, we have two memory pressure counters for TCP sockets [1],
> > which we manipulate only when the memory pressure is signalled through
> > the proto struct [2]. However, the memory pressure can also be signaled
> > through the cgroup memory subsystem, which we do not reflect in the
> > netstat counters. In the end, when the cgroup memory subsystem signals
> > that it is under pressure, we silently reduce the advertised TCP window
> > with tcp_adjust_rcv_ssthresh() to 4*advmss, which causes a significant
> > throughput reduction.
> >
> > So this patch adds a new counter to account for memory pressure
> > signaled by the memory cgroup, so it is much easier to spot.
> >
> > Link: https://elixir.bootlin.com/linux/v6.15.4/source/include/uapi/linux/snmp.h#L231-L232 [1]
> > Link: https://elixir.bootlin.com/linux/v6.15.4/source/include/net/sock.h#L1300-L1301 [2]
> > Co-developed-by: Matyas Hurtik <matyas.hurtik@cdn77.com>
> > Signed-off-by: Matyas Hurtik <matyas.hurtik@cdn77.com>
> > Signed-off-by: Daniel Sedlak <daniel.sedlak@cdn77.com>
> > ---
> >  Documentation/networking/net_cachelines/snmp.rst |  1 +
> >  include/net/tcp.h                                | 14 ++++++++------
> >  include/uapi/linux/snmp.h                        |  1 +
> >  net/ipv4/proc.c                                  |  1 +
> >  4 files changed, 11 insertions(+), 6 deletions(-)
> >
> > diff --git a/Documentation/networking/net_cachelines/snmp.rst b/Documentation/networking/net_cachelines/snmp.rst
> > index bd44b3eebbef..ed17ff84e39c 100644
> > --- a/Documentation/networking/net_cachelines/snmp.rst
> > +++ b/Documentation/networking/net_cachelines/snmp.rst
> > @@ -76,6 +76,7 @@ unsigned_long  LINUX_MIB_TCPABORTONLINGER
> >  unsigned_long  LINUX_MIB_TCPABORTFAILED
> >  unsigned_long  LINUX_MIB_TCPMEMORYPRESSURES
> >  unsigned_long  LINUX_MIB_TCPMEMORYPRESSURESCHRONO
> > +unsigned_long  LINUX_MIB_TCPCGROUPSOCKETPRESSURE
> >  unsigned_long  LINUX_MIB_TCPSACKDISCARD
> >  unsigned_long  LINUX_MIB_TCPDSACKIGNOREDOLD
> >  unsigned_long  LINUX_MIB_TCPDSACKIGNOREDNOUNDO
> > diff --git a/include/net/tcp.h b/include/net/tcp.h
> > index 761c4a0ad386..aae3efe24282 100644
> > --- a/include/net/tcp.h
> > +++ b/include/net/tcp.h
> > @@ -267,6 +267,11 @@ extern long sysctl_tcp_mem[3];
> >  #define TCP_RACK_STATIC_REO_WND  0x2 /* Use static RACK reo wnd */
> >  #define TCP_RACK_NO_DUPTHRESH    0x4 /* Do not use DUPACK threshold in RACK */
> >
> > +#define TCP_INC_STATS(net, field)    SNMP_INC_STATS((net)->mib.tcp_statistics, field)
> > +#define __TCP_INC_STATS(net, field)  __SNMP_INC_STATS((net)->mib.tcp_statistics, field)
> > +#define TCP_DEC_STATS(net, field)    SNMP_DEC_STATS((net)->mib.tcp_statistics, field)
> > +#define TCP_ADD_STATS(net, field, val)       SNMP_ADD_STATS((net)->mib.tcp_statistics, field, val)
> > +
> >  extern atomic_long_t tcp_memory_allocated;
> >  DECLARE_PER_CPU(int, tcp_memory_per_cpu_fw_alloc);
> >
> > @@ -277,8 +282,10 @@ extern unsigned long tcp_memory_pressure;
> >  static inline bool tcp_under_memory_pressure(const struct sock *sk)
> >  {
> >       if (mem_cgroup_sockets_enabled && sk->sk_memcg &&
> > -         mem_cgroup_under_socket_pressure(sk->sk_memcg))
> > +         mem_cgroup_under_socket_pressure(sk->sk_memcg)) {
> > +             TCP_INC_STATS(sock_net(sk), LINUX_MIB_TCPCGROUPSOCKETPRESSURE);
> >               return true;
>
> Incrementing it here will give a very different semantic to this stat
> compared to LINUX_MIB_TCPMEMORYPRESSURES. Here the increments mean the
> number of times the kernel check if a given socket is under memcg
> pressure for a net namespace. Is that what we want?

I'm trying to decouple sk_memcg from the global tcp_memory_allocated
as you and Wei planned before, and the two accounting already have the
different semantics from day1 and will keep that, so a new stat having a
different semantics would be fine.

But I think per-memcg stat like memory.stat.XXX would be a good fit
rather than pre-netns because one netns could be shared by multiple
cgroups and multiple sockets in the same cgroup could be spread across
multiple netns.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 net-next 1/2] tcp: account for memory pressure signaled by cgroup
  2025-07-16 18:07     ` Kuniyuki Iwashima
@ 2025-07-16 18:37       ` Shakeel Butt
  2025-07-17 15:31       ` Daniel Sedlak
  1 sibling, 0 replies; 13+ messages in thread
From: Shakeel Butt @ 2025-07-16 18:37 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: Daniel Sedlak, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Neal Cardwell,
	David Ahern, Andrew Morton, Yosry Ahmed, linux-mm, netdev,
	Matyas Hurtik

On Wed, Jul 16, 2025 at 11:07:44AM -0700, Kuniyuki Iwashima wrote:
> On Wed, Jul 16, 2025 at 9:50 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > On Mon, Jul 14, 2025 at 04:36:12PM +0200, Daniel Sedlak wrote:
> > > This patch is a result of our long-standing debug sessions, where it all
> > > started as "networking is slow", and TCP network throughput suddenly
> > > dropped from tens of Gbps to few Mbps, and we could not see anything in
> > > the kernel log or netstat counters.
> > >
> > > Currently, we have two memory pressure counters for TCP sockets [1],
> > > which we manipulate only when the memory pressure is signalled through
> > > the proto struct [2]. However, the memory pressure can also be signaled
> > > through the cgroup memory subsystem, which we do not reflect in the
> > > netstat counters. In the end, when the cgroup memory subsystem signals
> > > that it is under pressure, we silently reduce the advertised TCP window
> > > with tcp_adjust_rcv_ssthresh() to 4*advmss, which causes a significant
> > > throughput reduction.
> > >
> > > So this patch adds a new counter to account for memory pressure
> > > signaled by the memory cgroup, so it is much easier to spot.
> > >
> > > Link: https://elixir.bootlin.com/linux/v6.15.4/source/include/uapi/linux/snmp.h#L231-L232 [1]
> > > Link: https://elixir.bootlin.com/linux/v6.15.4/source/include/net/sock.h#L1300-L1301 [2]
> > > Co-developed-by: Matyas Hurtik <matyas.hurtik@cdn77.com>
> > > Signed-off-by: Matyas Hurtik <matyas.hurtik@cdn77.com>
> > > Signed-off-by: Daniel Sedlak <daniel.sedlak@cdn77.com>
> > > ---
> > >  Documentation/networking/net_cachelines/snmp.rst |  1 +
> > >  include/net/tcp.h                                | 14 ++++++++------
> > >  include/uapi/linux/snmp.h                        |  1 +
> > >  net/ipv4/proc.c                                  |  1 +
> > >  4 files changed, 11 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/Documentation/networking/net_cachelines/snmp.rst b/Documentation/networking/net_cachelines/snmp.rst
> > > index bd44b3eebbef..ed17ff84e39c 100644
> > > --- a/Documentation/networking/net_cachelines/snmp.rst
> > > +++ b/Documentation/networking/net_cachelines/snmp.rst
> > > @@ -76,6 +76,7 @@ unsigned_long  LINUX_MIB_TCPABORTONLINGER
> > >  unsigned_long  LINUX_MIB_TCPABORTFAILED
> > >  unsigned_long  LINUX_MIB_TCPMEMORYPRESSURES
> > >  unsigned_long  LINUX_MIB_TCPMEMORYPRESSURESCHRONO
> > > +unsigned_long  LINUX_MIB_TCPCGROUPSOCKETPRESSURE
> > >  unsigned_long  LINUX_MIB_TCPSACKDISCARD
> > >  unsigned_long  LINUX_MIB_TCPDSACKIGNOREDOLD
> > >  unsigned_long  LINUX_MIB_TCPDSACKIGNOREDNOUNDO
> > > diff --git a/include/net/tcp.h b/include/net/tcp.h
> > > index 761c4a0ad386..aae3efe24282 100644
> > > --- a/include/net/tcp.h
> > > +++ b/include/net/tcp.h
> > > @@ -267,6 +267,11 @@ extern long sysctl_tcp_mem[3];
> > >  #define TCP_RACK_STATIC_REO_WND  0x2 /* Use static RACK reo wnd */
> > >  #define TCP_RACK_NO_DUPTHRESH    0x4 /* Do not use DUPACK threshold in RACK */
> > >
> > > +#define TCP_INC_STATS(net, field)    SNMP_INC_STATS((net)->mib.tcp_statistics, field)
> > > +#define __TCP_INC_STATS(net, field)  __SNMP_INC_STATS((net)->mib.tcp_statistics, field)
> > > +#define TCP_DEC_STATS(net, field)    SNMP_DEC_STATS((net)->mib.tcp_statistics, field)
> > > +#define TCP_ADD_STATS(net, field, val)       SNMP_ADD_STATS((net)->mib.tcp_statistics, field, val)
> > > +
> > >  extern atomic_long_t tcp_memory_allocated;
> > >  DECLARE_PER_CPU(int, tcp_memory_per_cpu_fw_alloc);
> > >
> > > @@ -277,8 +282,10 @@ extern unsigned long tcp_memory_pressure;
> > >  static inline bool tcp_under_memory_pressure(const struct sock *sk)
> > >  {
> > >       if (mem_cgroup_sockets_enabled && sk->sk_memcg &&
> > > -         mem_cgroup_under_socket_pressure(sk->sk_memcg))
> > > +         mem_cgroup_under_socket_pressure(sk->sk_memcg)) {
> > > +             TCP_INC_STATS(sock_net(sk), LINUX_MIB_TCPCGROUPSOCKETPRESSURE);
> > >               return true;
> >
> > Incrementing it here will give a very different semantic to this stat
> > compared to LINUX_MIB_TCPMEMORYPRESSURES. Here the increments mean the
> > number of times the kernel check if a given socket is under memcg
> > pressure for a net namespace. Is that what we want?
> 
> I'm trying to decouple sk_memcg from the global tcp_memory_allocated
> as you and Wei planned before, and the two accounting already have the
> different semantics from day1 and will keep that, so a new stat having a
> different semantics would be fine.
> 
> But I think per-memcg stat like memory.stat.XXX would be a good fit
> rather than pre-netns because one netns could be shared by multiple
> cgroups and multiple sockets in the same cgroup could be spread across
> multiple netns.

Yeah it makes much more sense to have memcg stat for memcg based socket
pressure.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 net-next 1/2] tcp: account for memory pressure signaled by cgroup
  2025-07-16 18:07     ` Kuniyuki Iwashima
  2025-07-16 18:37       ` Shakeel Butt
@ 2025-07-17 15:31       ` Daniel Sedlak
  2025-07-17 17:26         ` Kuniyuki Iwashima
  1 sibling, 1 reply; 13+ messages in thread
From: Daniel Sedlak @ 2025-07-17 15:31 UTC (permalink / raw)
  To: Kuniyuki Iwashima, Shakeel Butt
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Jonathan Corbet, Neal Cardwell, David Ahern,
	Andrew Morton, Yosry Ahmed, linux-mm, netdev, Matyas Hurtik

On 7/16/25 8:07 PM, Kuniyuki Iwashima wrote:
>> Incrementing it here will give a very different semantic to this stat
>> compared to LINUX_MIB_TCPMEMORYPRESSURES. Here the increments mean the
>> number of times the kernel check if a given socket is under memcg
>> pressure for a net namespace. Is that what we want?
> 
> I'm trying to decouple sk_memcg from the global tcp_memory_allocated
> as you and Wei planned before, and the two accounting already have the
> different semantics from day1 and will keep that, so a new stat having a
> different semantics would be fine.
> 
> But I think per-memcg stat like memory.stat.XXX would be a good fit
> rather than pre-netns because one netns could be shared by multiple
> cgroups and multiple sockets in the same cgroup could be spread across
> multiple netns.

I can move the counter to memory.stat.XXX in favor of this patch, if 
anyone has not started working on that yet?

Or are you suggesting to keep this change and also add it to the 
memory.stat.XXX?

Thanks!
Daniel


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 net-next 1/2] tcp: account for memory pressure signaled by cgroup
  2025-07-17 15:31       ` Daniel Sedlak
@ 2025-07-17 17:26         ` Kuniyuki Iwashima
  0 siblings, 0 replies; 13+ messages in thread
From: Kuniyuki Iwashima @ 2025-07-17 17:26 UTC (permalink / raw)
  To: Daniel Sedlak
  Cc: Shakeel Butt, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Neal Cardwell,
	David Ahern, Andrew Morton, Yosry Ahmed, linux-mm, netdev,
	Matyas Hurtik

On Thu, Jul 17, 2025 at 8:31 AM Daniel Sedlak <daniel.sedlak@cdn77.com> wrote:
>
> On 7/16/25 8:07 PM, Kuniyuki Iwashima wrote:
> >> Incrementing it here will give a very different semantic to this stat
> >> compared to LINUX_MIB_TCPMEMORYPRESSURES. Here the increments mean the
> >> number of times the kernel check if a given socket is under memcg
> >> pressure for a net namespace. Is that what we want?
> >
> > I'm trying to decouple sk_memcg from the global tcp_memory_allocated
> > as you and Wei planned before, and the two accounting already have the
> > different semantics from day1 and will keep that, so a new stat having a
> > different semantics would be fine.
> >
> > But I think per-memcg stat like memory.stat.XXX would be a good fit
> > rather than pre-netns because one netns could be shared by multiple
> > cgroups and multiple sockets in the same cgroup could be spread across
> > multiple netns.
>
> I can move the counter to memory.stat.XXX in favor of this patch

Please do so.  Per-netns stats could be confusing in some setup above.

Thanks!


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2025-07-17 17:26 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-14 14:36 [PATCH v2 net-next 0/2] account for TCP memory pressure signaled by cgroup Daniel Sedlak
2025-07-14 14:36 ` [PATCH v2 net-next 1/2] tcp: account for " Daniel Sedlak
2025-07-16 16:49   ` Shakeel Butt
2025-07-16 18:07     ` Kuniyuki Iwashima
2025-07-16 18:37       ` Shakeel Butt
2025-07-17 15:31       ` Daniel Sedlak
2025-07-17 17:26         ` Kuniyuki Iwashima
2025-07-14 14:36 ` [PATCH v2 net-next 2/2] mm/vmpressure: add tracepoint for socket pressure detection Daniel Sedlak
2025-07-14 18:02   ` Kuniyuki Iwashima
2025-07-15  7:01     ` Daniel Sedlak
2025-07-15 17:17       ` Kuniyuki Iwashima
2025-07-15 17:46         ` Kuniyuki Iwashima
2025-07-16  8:47           ` Daniel Sedlak

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).