[PATCH v4] memcg: expose socket memory pressure in a cgroup

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4] memcg: expose socket memory pressure in a cgroup
@ 2025-08-05  6:44 Daniel Sedlak
  2025-08-05 15:54 ` Kuniyuki Iwashima
                   ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Daniel Sedlak @ 2025-08-05  6:44 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Jonathan Corbet, Neal Cardwell, Kuniyuki Iwashima,
	David Ahern, Andrew Morton, Shakeel Butt, Yosry Ahmed, linux-mm,
	netdev, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song, cgroups, Tejun Heo, Michal Koutný
  Cc: Daniel Sedlak, Matyas Hurtik

This patch is a result of our long-standing debug sessions, where it all
started as "networking is slow", and TCP network throughput suddenly
dropped from tens of Gbps to few Mbps, and we could not see anything in
the kernel log or netstat counters.

Currently, we have two memory pressure counters for TCP sockets [1],
which we manipulate only when the memory pressure is signalled through
the proto struct [2]. However, the memory pressure can also be signaled
through the cgroup memory subsystem, which we do not reflect in the
netstat counters. In the end, when the cgroup memory subsystem signals
that it is under pressure, we silently reduce the advertised TCP window
with tcp_adjust_rcv_ssthresh() to 4*advmss, which causes a significant
throughput reduction.

Keep in mind that when the cgroup memory subsystem signals the socket
memory pressure, it affects all sockets used in that cgroup.

This patch exposes a new file for each cgroup in sysfs which signals
the cgroup socket memory pressure. The file is accessible in
the following path.

  /sys/fs/cgroup/**/<cgroup name>/memory.net.socket_pressure

The output value is a cumulative sum of microseconds spent
under pressure for that particular cgroup.

Link: https://elixir.bootlin.com/linux/v6.15.4/source/include/uapi/linux/snmp.h#L231-L232 [1]
Link: https://elixir.bootlin.com/linux/v6.15.4/source/include/net/sock.h#L1300-L1301 [2]
Co-developed-by: Matyas Hurtik <matyas.hurtik@cdn77.com>
Signed-off-by: Matyas Hurtik <matyas.hurtik@cdn77.com>
Signed-off-by: Daniel Sedlak <daniel.sedlak@cdn77.com>
---
Changes:
v3 -> v4:
- Add documentation
- Expose pressure as cummulative counter in microseconds
- Link to v3: https://lore.kernel.org/netdev/20250722071146.48616-1-daniel.sedlak@cdn77.com/

v2 -> v3:
- Expose the socket memory pressure on the cgroups instead of netstat
- Split patch
- Link to v2: https://lore.kernel.org/netdev/20250714143613.42184-1-daniel.sedlak@cdn77.com/

v1 -> v2:
- Add tracepoint
- Link to v1: https://lore.kernel.org/netdev/20250707105205.222558-1-daniel.sedlak@cdn77.com/

 Documentation/admin-guide/cgroup-v2.rst |  7 +++++++
 include/linux/memcontrol.h              |  2 ++
 mm/memcontrol.c                         | 15 +++++++++++++++
 mm/vmpressure.c                         |  9 ++++++++-
 4 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 0cc35a14afbe..c810b449fb3d 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1884,6 +1884,13 @@ The following nested keys are defined.
 	Shows pressure stall information for memory. See
 	:ref:`Documentation/accounting/psi.rst <psi>` for details.
 
+  memory.net.socket_pressure
+	A read-only single value file showing how many microseconds
+	all sockets within that cgroup spent under pressure.
+
+	Note that when the sockets are under pressure, the networking
+	throughput can be significantly degraded.
+
 
 Usage Guidelines
 ~~~~~~~~~~~~~~~~
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 87b6688f124a..6a1cb9a99b88 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -252,6 +252,8 @@ struct mem_cgroup {
 	 * where socket memory is accounted/charged separately.
 	 */
 	unsigned long		socket_pressure;
+	/* exported statistic for memory.net.socket_pressure */
+	unsigned long		socket_pressure_duration;
 
 	int kmemcg_id;
 	/*
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 902da8a9c643..8e299d94c073 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3758,6 +3758,7 @@ static struct mem_cgroup *mem_cgroup_alloc(struct mem_cgroup *parent)
 	INIT_LIST_HEAD(&memcg->swap_peaks);
 	spin_lock_init(&memcg->peaks_lock);
 	memcg->socket_pressure = jiffies;
+	memcg->socket_pressure_duration = 0;
 	memcg1_memcg_init(memcg);
 	memcg->kmemcg_id = -1;
 	INIT_LIST_HEAD(&memcg->objcg_list);
@@ -4647,6 +4648,15 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
 	return nbytes;
 }
 
+static int memory_socket_pressure_show(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
+
+	seq_printf(m, "%lu\n", READ_ONCE(memcg->socket_pressure_duration));
+
+	return 0;
+}
+
 static struct cftype memory_files[] = {
 	{
 		.name = "current",
@@ -4718,6 +4728,11 @@ static struct cftype memory_files[] = {
 		.flags = CFTYPE_NS_DELEGATABLE,
 		.write = memory_reclaim,
 	},
+	{
+		.name = "net.socket_pressure",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = memory_socket_pressure_show,
+	},
 	{ }	/* terminate */
 };
 
diff --git a/mm/vmpressure.c b/mm/vmpressure.c
index bd5183dfd879..1e767cd8aa08 100644
--- a/mm/vmpressure.c
+++ b/mm/vmpressure.c
@@ -308,6 +308,8 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, bool tree,
 		level = vmpressure_calc_level(scanned, reclaimed);
 
 		if (level > VMPRESSURE_LOW) {
+			unsigned long socket_pressure;
+			unsigned long jiffies_diff;
 			/*
 			 * Let the socket buffer allocator know that
 			 * we are having trouble reclaiming LRU pages.
@@ -316,7 +318,12 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, bool tree,
 			 * asserted for a second in which subsequent
 			 * pressure events can occur.
 			 */
-			WRITE_ONCE(memcg->socket_pressure, jiffies + HZ);
+			socket_pressure = jiffies + HZ;
+
+			jiffies_diff = min(socket_pressure - READ_ONCE(memcg->socket_pressure), HZ);
+			memcg->socket_pressure_duration += jiffies_to_usecs(jiffies_diff);
+
+			WRITE_ONCE(memcg->socket_pressure, socket_pressure);
 		}
 	}
 }

base-commit: e96ee511c906c59b7c4e6efd9d9b33917730e000
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v4] memcg: expose socket memory pressure in a cgroup
  2025-08-05  6:44 [PATCH v4] memcg: expose socket memory pressure in a cgroup Daniel Sedlak
@ 2025-08-05 15:54 ` Kuniyuki Iwashima
  2025-08-05 23:02 ` Shakeel Butt
  2025-08-09 18:32 ` Tejun Heo
  2 siblings, 0 replies; 25+ messages in thread
From: Kuniyuki Iwashima @ 2025-08-05 15:54 UTC (permalink / raw)
  To: Daniel Sedlak
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Jonathan Corbet, Neal Cardwell, David Ahern,
	Andrew Morton, Shakeel Butt, Yosry Ahmed, linux-mm, netdev,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	cgroups, Tejun Heo, Michal Koutný, Matyas Hurtik

On Mon, Aug 4, 2025 at 11:50 PM Daniel Sedlak <daniel.sedlak@cdn77.com> wrote:
>
> This patch is a result of our long-standing debug sessions, where it all
> started as "networking is slow", and TCP network throughput suddenly
> dropped from tens of Gbps to few Mbps, and we could not see anything in
> the kernel log or netstat counters.
>
> Currently, we have two memory pressure counters for TCP sockets [1],
> which we manipulate only when the memory pressure is signalled through
> the proto struct [2]. However, the memory pressure can also be signaled
> through the cgroup memory subsystem, which we do not reflect in the
> netstat counters. In the end, when the cgroup memory subsystem signals
> that it is under pressure, we silently reduce the advertised TCP window
> with tcp_adjust_rcv_ssthresh() to 4*advmss, which causes a significant
> throughput reduction.
>
> Keep in mind that when the cgroup memory subsystem signals the socket
> memory pressure, it affects all sockets used in that cgroup.
>
> This patch exposes a new file for each cgroup in sysfs which signals
> the cgroup socket memory pressure. The file is accessible in
> the following path.
>
>   /sys/fs/cgroup/**/<cgroup name>/memory.net.socket_pressure
>
> The output value is a cumulative sum of microseconds spent
> under pressure for that particular cgroup.
>
> Link: https://elixir.bootlin.com/linux/v6.15.4/source/include/uapi/linux/snmp.h#L231-L232 [1]
> Link: https://elixir.bootlin.com/linux/v6.15.4/source/include/net/sock.h#L1300-L1301 [2]
> Co-developed-by: Matyas Hurtik <matyas.hurtik@cdn77.com>
> Signed-off-by: Matyas Hurtik <matyas.hurtik@cdn77.com>
> Signed-off-by: Daniel Sedlak <daniel.sedlak@cdn77.com>
> ---
> Changes:
> v3 -> v4:
> - Add documentation
> - Expose pressure as cummulative counter in microseconds
> - Link to v3: https://lore.kernel.org/netdev/20250722071146.48616-1-daniel.sedlak@cdn77.com/
>
> v2 -> v3:
> - Expose the socket memory pressure on the cgroups instead of netstat
> - Split patch
> - Link to v2: https://lore.kernel.org/netdev/20250714143613.42184-1-daniel.sedlak@cdn77.com/
>
> v1 -> v2:
> - Add tracepoint
> - Link to v1: https://lore.kernel.org/netdev/20250707105205.222558-1-daniel.sedlak@cdn77.com/
>
>  Documentation/admin-guide/cgroup-v2.rst |  7 +++++++
>  include/linux/memcontrol.h              |  2 ++
>  mm/memcontrol.c                         | 15 +++++++++++++++
>  mm/vmpressure.c                         |  9 ++++++++-
>  4 files changed, 32 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 0cc35a14afbe..c810b449fb3d 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1884,6 +1884,13 @@ The following nested keys are defined.
>         Shows pressure stall information for memory. See
>         :ref:`Documentation/accounting/psi.rst <psi>` for details.
>
> +  memory.net.socket_pressure
> +       A read-only single value file showing how many microseconds
> +       all sockets within that cgroup spent under pressure.
> +
> +       Note that when the sockets are under pressure, the networking
> +       throughput can be significantly degraded.
> +
>
>  Usage Guidelines
>  ~~~~~~~~~~~~~~~~
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 87b6688f124a..6a1cb9a99b88 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -252,6 +252,8 @@ struct mem_cgroup {
>          * where socket memory is accounted/charged separately.
>          */
>         unsigned long           socket_pressure;
> +       /* exported statistic for memory.net.socket_pressure */
> +       unsigned long           socket_pressure_duration;
>
>         int kmemcg_id;
>         /*
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 902da8a9c643..8e299d94c073 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3758,6 +3758,7 @@ static struct mem_cgroup *mem_cgroup_alloc(struct mem_cgroup *parent)
>         INIT_LIST_HEAD(&memcg->swap_peaks);
>         spin_lock_init(&memcg->peaks_lock);
>         memcg->socket_pressure = jiffies;
> +       memcg->socket_pressure_duration = 0;
>         memcg1_memcg_init(memcg);
>         memcg->kmemcg_id = -1;
>         INIT_LIST_HEAD(&memcg->objcg_list);
> @@ -4647,6 +4648,15 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
>         return nbytes;
>  }
>
> +static int memory_socket_pressure_show(struct seq_file *m, void *v)
> +{
> +       struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
> +
> +       seq_printf(m, "%lu\n", READ_ONCE(memcg->socket_pressure_duration));
> +
> +       return 0;
> +}
> +
>  static struct cftype memory_files[] = {
>         {
>                 .name = "current",
> @@ -4718,6 +4728,11 @@ static struct cftype memory_files[] = {
>                 .flags = CFTYPE_NS_DELEGATABLE,
>                 .write = memory_reclaim,
>         },
> +       {
> +               .name = "net.socket_pressure",
> +               .flags = CFTYPE_NOT_ON_ROOT,
> +               .seq_show = memory_socket_pressure_show,
> +       },
>         { }     /* terminate */
>  };
>
> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
> index bd5183dfd879..1e767cd8aa08 100644
> --- a/mm/vmpressure.c
> +++ b/mm/vmpressure.c
> @@ -308,6 +308,8 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, bool tree,
>                 level = vmpressure_calc_level(scanned, reclaimed);
>
>                 if (level > VMPRESSURE_LOW) {
> +                       unsigned long socket_pressure;
> +                       unsigned long jiffies_diff;
>                         /*
>                          * Let the socket buffer allocator know that
>                          * we are having trouble reclaiming LRU pages.
> @@ -316,7 +318,12 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, bool tree,
>                          * asserted for a second in which subsequent
>                          * pressure events can occur.
>                          */
> -                       WRITE_ONCE(memcg->socket_pressure, jiffies + HZ);
> +                       socket_pressure = jiffies + HZ;
> +
> +                       jiffies_diff = min(socket_pressure - READ_ONCE(memcg->socket_pressure), HZ);
> +                       memcg->socket_pressure_duration += jiffies_to_usecs(jiffies_diff);

WRITE_ONCE() is needed here.


> +
> +                       WRITE_ONCE(memcg->socket_pressure, socket_pressure);
>                 }
>         }
>  }
>
> base-commit: e96ee511c906c59b7c4e6efd9d9b33917730e000
> --
> 2.39.5
>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4] memcg: expose socket memory pressure in a cgroup
  2025-08-05  6:44 [PATCH v4] memcg: expose socket memory pressure in a cgroup Daniel Sedlak
  2025-08-05 15:54 ` Kuniyuki Iwashima
@ 2025-08-05 23:02 ` Shakeel Butt
  2025-08-06 19:20   ` Kuniyuki Iwashima
  2025-08-07 10:42   ` Daniel Sedlak
  2025-08-09 18:32 ` Tejun Heo
  2 siblings, 2 replies; 25+ messages in thread
From: Shakeel Butt @ 2025-08-05 23:02 UTC (permalink / raw)
  To: Daniel Sedlak
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Jonathan Corbet, Neal Cardwell, Kuniyuki Iwashima,
	David Ahern, Andrew Morton, Yosry Ahmed, linux-mm, netdev,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	cgroups, Tejun Heo, Michal Koutný, Matyas Hurtik

On Tue, Aug 05, 2025 at 08:44:29AM +0200, Daniel Sedlak wrote:
> This patch is a result of our long-standing debug sessions, where it all
> started as "networking is slow", and TCP network throughput suddenly
> dropped from tens of Gbps to few Mbps, and we could not see anything in
> the kernel log or netstat counters.
> 
> Currently, we have two memory pressure counters for TCP sockets [1],
> which we manipulate only when the memory pressure is signalled through
> the proto struct [2]. However, the memory pressure can also be signaled
> through the cgroup memory subsystem, which we do not reflect in the
> netstat counters. In the end, when the cgroup memory subsystem signals
> that it is under pressure, we silently reduce the advertised TCP window
> with tcp_adjust_rcv_ssthresh() to 4*advmss, which causes a significant
> throughput reduction.
> 
> Keep in mind that when the cgroup memory subsystem signals the socket
> memory pressure, it affects all sockets used in that cgroup.
> 
> This patch exposes a new file for each cgroup in sysfs which signals
> the cgroup socket memory pressure. The file is accessible in
> the following path.
> 
>   /sys/fs/cgroup/**/<cgroup name>/memory.net.socket_pressure

let's keep the name concise. Maybe memory.net.pressure?

> 
> The output value is a cumulative sum of microseconds spent
> under pressure for that particular cgroup.
> 
> Link: https://elixir.bootlin.com/linux/v6.15.4/source/include/uapi/linux/snmp.h#L231-L232 [1]
> Link: https://elixir.bootlin.com/linux/v6.15.4/source/include/net/sock.h#L1300-L1301 [2]
> Co-developed-by: Matyas Hurtik <matyas.hurtik@cdn77.com>
> Signed-off-by: Matyas Hurtik <matyas.hurtik@cdn77.com>
> Signed-off-by: Daniel Sedlak <daniel.sedlak@cdn77.com>
> ---
> Changes:
> v3 -> v4:
> - Add documentation
> - Expose pressure as cummulative counter in microseconds
> - Link to v3: https://lore.kernel.org/netdev/20250722071146.48616-1-daniel.sedlak@cdn77.com/
> 
> v2 -> v3:
> - Expose the socket memory pressure on the cgroups instead of netstat
> - Split patch
> - Link to v2: https://lore.kernel.org/netdev/20250714143613.42184-1-daniel.sedlak@cdn77.com/
> 
> v1 -> v2:
> - Add tracepoint
> - Link to v1: https://lore.kernel.org/netdev/20250707105205.222558-1-daniel.sedlak@cdn77.com/
> 
>  Documentation/admin-guide/cgroup-v2.rst |  7 +++++++
>  include/linux/memcontrol.h              |  2 ++
>  mm/memcontrol.c                         | 15 +++++++++++++++
>  mm/vmpressure.c                         |  9 ++++++++-
>  4 files changed, 32 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 0cc35a14afbe..c810b449fb3d 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1884,6 +1884,13 @@ The following nested keys are defined.
>  	Shows pressure stall information for memory. See
>  	:ref:`Documentation/accounting/psi.rst <psi>` for details.
>  
> +  memory.net.socket_pressure
> +	A read-only single value file showing how many microseconds
> +	all sockets within that cgroup spent under pressure.
> +
> +	Note that when the sockets are under pressure, the networking
> +	throughput can be significantly degraded.
> +
>  
>  Usage Guidelines
>  ~~~~~~~~~~~~~~~~
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 87b6688f124a..6a1cb9a99b88 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -252,6 +252,8 @@ struct mem_cgroup {
>  	 * where socket memory is accounted/charged separately.
>  	 */
>  	unsigned long		socket_pressure;
> +	/* exported statistic for memory.net.socket_pressure */
> +	unsigned long		socket_pressure_duration;

I think atomic_long_t would be better.

>  
>  	int kmemcg_id;
>  	/*
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 902da8a9c643..8e299d94c073 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3758,6 +3758,7 @@ static struct mem_cgroup *mem_cgroup_alloc(struct mem_cgroup *parent)
>  	INIT_LIST_HEAD(&memcg->swap_peaks);
>  	spin_lock_init(&memcg->peaks_lock);
>  	memcg->socket_pressure = jiffies;
> +	memcg->socket_pressure_duration = 0;
>  	memcg1_memcg_init(memcg);
>  	memcg->kmemcg_id = -1;
>  	INIT_LIST_HEAD(&memcg->objcg_list);
> @@ -4647,6 +4648,15 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
>  	return nbytes;
>  }
>  
> +static int memory_socket_pressure_show(struct seq_file *m, void *v)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
> +
> +	seq_printf(m, "%lu\n", READ_ONCE(memcg->socket_pressure_duration));
> +
> +	return 0;
> +}
> +
>  static struct cftype memory_files[] = {
>  	{
>  		.name = "current",
> @@ -4718,6 +4728,11 @@ static struct cftype memory_files[] = {
>  		.flags = CFTYPE_NS_DELEGATABLE,
>  		.write = memory_reclaim,
>  	},
> +	{
> +		.name = "net.socket_pressure",
> +		.flags = CFTYPE_NOT_ON_ROOT,
> +		.seq_show = memory_socket_pressure_show,
> +	},
>  	{ }	/* terminate */
>  };
>  
> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
> index bd5183dfd879..1e767cd8aa08 100644
> --- a/mm/vmpressure.c
> +++ b/mm/vmpressure.c
> @@ -308,6 +308,8 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, bool tree,
>  		level = vmpressure_calc_level(scanned, reclaimed);
>  
>  		if (level > VMPRESSURE_LOW) {
> +			unsigned long socket_pressure;
> +			unsigned long jiffies_diff;
>  			/*
>  			 * Let the socket buffer allocator know that
>  			 * we are having trouble reclaiming LRU pages.
> @@ -316,7 +318,12 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, bool tree,
>  			 * asserted for a second in which subsequent
>  			 * pressure events can occur.
>  			 */
> -			WRITE_ONCE(memcg->socket_pressure, jiffies + HZ);
> +			socket_pressure = jiffies + HZ;
> +
> +			jiffies_diff = min(socket_pressure - READ_ONCE(memcg->socket_pressure), HZ);
> +			memcg->socket_pressure_duration += jiffies_to_usecs(jiffies_diff);

KCSAN will complain about this. I think we can use atomic_long_add() and
don't need the one with strict ordering.

> +
> +			WRITE_ONCE(memcg->socket_pressure, socket_pressure);
>  		}
>  	}
>  }
> 
> base-commit: e96ee511c906c59b7c4e6efd9d9b33917730e000
> -- 
> 2.39.5
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4] memcg: expose socket memory pressure in a cgroup
  2025-08-05 23:02 ` Shakeel Butt
@ 2025-08-06 19:20   ` Kuniyuki Iwashima
  2025-08-06 21:54     ` Shakeel Butt
  2025-08-07 10:42   ` Daniel Sedlak
  1 sibling, 1 reply; 25+ messages in thread
From: Kuniyuki Iwashima @ 2025-08-06 19:20 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Daniel Sedlak, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Neal Cardwell,
	David Ahern, Andrew Morton, Yosry Ahmed, linux-mm, netdev,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	cgroups, Tejun Heo, Michal Koutný, Matyas Hurtik

On Tue, Aug 5, 2025 at 4:02 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Tue, Aug 05, 2025 at 08:44:29AM +0200, Daniel Sedlak wrote:
> > This patch is a result of our long-standing debug sessions, where it all
> > started as "networking is slow", and TCP network throughput suddenly
> > dropped from tens of Gbps to few Mbps, and we could not see anything in
> > the kernel log or netstat counters.
> >
> > Currently, we have two memory pressure counters for TCP sockets [1],
> > which we manipulate only when the memory pressure is signalled through
> > the proto struct [2]. However, the memory pressure can also be signaled
> > through the cgroup memory subsystem, which we do not reflect in the
> > netstat counters. In the end, when the cgroup memory subsystem signals
> > that it is under pressure, we silently reduce the advertised TCP window
> > with tcp_adjust_rcv_ssthresh() to 4*advmss, which causes a significant
> > throughput reduction.
> >
> > Keep in mind that when the cgroup memory subsystem signals the socket
> > memory pressure, it affects all sockets used in that cgroup.
> >
> > This patch exposes a new file for each cgroup in sysfs which signals
> > the cgroup socket memory pressure. The file is accessible in
> > the following path.
> >
> >   /sys/fs/cgroup/**/<cgroup name>/memory.net.socket_pressure
>
> let's keep the name concise. Maybe memory.net.pressure?
>
> >
> > The output value is a cumulative sum of microseconds spent
> > under pressure for that particular cgroup.
> >
> > Link: https://elixir.bootlin.com/linux/v6.15.4/source/include/uapi/linux/snmp.h#L231-L232 [1]
> > Link: https://elixir.bootlin.com/linux/v6.15.4/source/include/net/sock.h#L1300-L1301 [2]
> > Co-developed-by: Matyas Hurtik <matyas.hurtik@cdn77.com>
> > Signed-off-by: Matyas Hurtik <matyas.hurtik@cdn77.com>
> > Signed-off-by: Daniel Sedlak <daniel.sedlak@cdn77.com>
> > ---
> > Changes:
> > v3 -> v4:
> > - Add documentation
> > - Expose pressure as cummulative counter in microseconds
> > - Link to v3: https://lore.kernel.org/netdev/20250722071146.48616-1-daniel.sedlak@cdn77.com/
> >
> > v2 -> v3:
> > - Expose the socket memory pressure on the cgroups instead of netstat
> > - Split patch
> > - Link to v2: https://lore.kernel.org/netdev/20250714143613.42184-1-daniel.sedlak@cdn77.com/
> >
> > v1 -> v2:
> > - Add tracepoint
> > - Link to v1: https://lore.kernel.org/netdev/20250707105205.222558-1-daniel.sedlak@cdn77.com/
> >
> >  Documentation/admin-guide/cgroup-v2.rst |  7 +++++++
> >  include/linux/memcontrol.h              |  2 ++
> >  mm/memcontrol.c                         | 15 +++++++++++++++
> >  mm/vmpressure.c                         |  9 ++++++++-
> >  4 files changed, 32 insertions(+), 1 deletion(-)
> >
> > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> > index 0cc35a14afbe..c810b449fb3d 100644
> > --- a/Documentation/admin-guide/cgroup-v2.rst
> > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > @@ -1884,6 +1884,13 @@ The following nested keys are defined.
> >       Shows pressure stall information for memory. See
> >       :ref:`Documentation/accounting/psi.rst <psi>` for details.
> >
> > +  memory.net.socket_pressure
> > +     A read-only single value file showing how many microseconds
> > +     all sockets within that cgroup spent under pressure.
> > +
> > +     Note that when the sockets are under pressure, the networking
> > +     throughput can be significantly degraded.
> > +
> >
> >  Usage Guidelines
> >  ~~~~~~~~~~~~~~~~
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 87b6688f124a..6a1cb9a99b88 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -252,6 +252,8 @@ struct mem_cgroup {
> >        * where socket memory is accounted/charged separately.
> >        */
> >       unsigned long           socket_pressure;
> > +     /* exported statistic for memory.net.socket_pressure */
> > +     unsigned long           socket_pressure_duration;
>
> I think atomic_long_t would be better.
>
> >
> >       int kmemcg_id;
> >       /*
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 902da8a9c643..8e299d94c073 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -3758,6 +3758,7 @@ static struct mem_cgroup *mem_cgroup_alloc(struct mem_cgroup *parent)
> >       INIT_LIST_HEAD(&memcg->swap_peaks);
> >       spin_lock_init(&memcg->peaks_lock);
> >       memcg->socket_pressure = jiffies;
> > +     memcg->socket_pressure_duration = 0;
> >       memcg1_memcg_init(memcg);
> >       memcg->kmemcg_id = -1;
> >       INIT_LIST_HEAD(&memcg->objcg_list);
> > @@ -4647,6 +4648,15 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
> >       return nbytes;
> >  }
> >
> > +static int memory_socket_pressure_show(struct seq_file *m, void *v)
> > +{
> > +     struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
> > +
> > +     seq_printf(m, "%lu\n", READ_ONCE(memcg->socket_pressure_duration));
> > +
> > +     return 0;
> > +}
> > +
> >  static struct cftype memory_files[] = {
> >       {
> >               .name = "current",
> > @@ -4718,6 +4728,11 @@ static struct cftype memory_files[] = {
> >               .flags = CFTYPE_NS_DELEGATABLE,
> >               .write = memory_reclaim,
> >       },
> > +     {
> > +             .name = "net.socket_pressure",
> > +             .flags = CFTYPE_NOT_ON_ROOT,
> > +             .seq_show = memory_socket_pressure_show,
> > +     },
> >       { }     /* terminate */
> >  };
> >
> > diff --git a/mm/vmpressure.c b/mm/vmpressure.c
> > index bd5183dfd879..1e767cd8aa08 100644
> > --- a/mm/vmpressure.c
> > +++ b/mm/vmpressure.c
> > @@ -308,6 +308,8 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, bool tree,
> >               level = vmpressure_calc_level(scanned, reclaimed);
> >
> >               if (level > VMPRESSURE_LOW) {
> > +                     unsigned long socket_pressure;
> > +                     unsigned long jiffies_diff;
> >                       /*
> >                        * Let the socket buffer allocator know that
> >                        * we are having trouble reclaiming LRU pages.
> > @@ -316,7 +318,12 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, bool tree,
> >                        * asserted for a second in which subsequent
> >                        * pressure events can occur.
> >                        */
> > -                     WRITE_ONCE(memcg->socket_pressure, jiffies + HZ);
> > +                     socket_pressure = jiffies + HZ;
> > +
> > +                     jiffies_diff = min(socket_pressure - READ_ONCE(memcg->socket_pressure), HZ);
> > +                     memcg->socket_pressure_duration += jiffies_to_usecs(jiffies_diff);
>
> KCSAN will complain about this. I think we can use atomic_long_add() and
> don't need the one with strict ordering.

Assuming from atomic_ that vmpressure() could be called concurrently
for the same memcg, should we protect socket_pressure and duration
within the same lock instead of mixing WRITE/READ_ONCE() and
atomic?  Otherwise jiffies_diff could be incorrect (the error is smaller
than HZ though).


>
> > +
> > +                     WRITE_ONCE(memcg->socket_pressure, socket_pressure);
> >               }
> >       }
> >  }
> >
> > base-commit: e96ee511c906c59b7c4e6efd9d9b33917730e000
> > --
> > 2.39.5
> >


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4] memcg: expose socket memory pressure in a cgroup
  2025-08-06 19:20   ` Kuniyuki Iwashima
@ 2025-08-06 21:54     ` Shakeel Butt
  2025-08-06 22:01       ` Kuniyuki Iwashima
  0 siblings, 1 reply; 25+ messages in thread
From: Shakeel Butt @ 2025-08-06 21:54 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: Daniel Sedlak, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Neal Cardwell,
	David Ahern, Andrew Morton, Yosry Ahmed, linux-mm, netdev,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	cgroups, Tejun Heo, Michal Koutný, Matyas Hurtik

On Wed, Aug 06, 2025 at 12:20:25PM -0700, Kuniyuki Iwashima wrote:
> > > -                     WRITE_ONCE(memcg->socket_pressure, jiffies + HZ);
> > > +                     socket_pressure = jiffies + HZ;
> > > +
> > > +                     jiffies_diff = min(socket_pressure - READ_ONCE(memcg->socket_pressure), HZ);
> > > +                     memcg->socket_pressure_duration += jiffies_to_usecs(jiffies_diff);
> >
> > KCSAN will complain about this. I think we can use atomic_long_add() and
> > don't need the one with strict ordering.
> 
> Assuming from atomic_ that vmpressure() could be called concurrently
> for the same memcg, should we protect socket_pressure and duration
> within the same lock instead of mixing WRITE/READ_ONCE() and
> atomic?  Otherwise jiffies_diff could be incorrect (the error is smaller
> than HZ though).
> 

Yeah good point. Also this field needs to be hierarchical. So, with lock
something like following is needed:

	if (!spin_trylock(memcg->net_pressure_lock))
		return;
	
	socket_pressure = jiffies + HZ;
	diff = min(socket_pressure - READ_ONCE(memcg->socket_pressure), HZ);

	if (diff) {
		WRITE_ONCE(memcg->socket_pressure, socket_pressure);
		// mod_memcg_state(memcg, MEMCG_NET_PRESSURE, diff);
		// OR
		// while (memcg) {
		//	memcg->sk_pressure_duration += diff;
		//	memcg = parent_mem_cgroup(memcg);
		// }
	}

	spin_unlock(memcg->net_pressure_lock);

Regarding the hierarchical, we can avoid rstat infra as this code path
is not really performance critical. 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4] memcg: expose socket memory pressure in a cgroup
  2025-08-06 21:54     ` Shakeel Butt
@ 2025-08-06 22:01       ` Kuniyuki Iwashima
  2025-08-06 23:34         ` Shakeel Butt
  0 siblings, 1 reply; 25+ messages in thread
From: Kuniyuki Iwashima @ 2025-08-06 22:01 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Daniel Sedlak, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Neal Cardwell,
	David Ahern, Andrew Morton, Yosry Ahmed, linux-mm, netdev,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	cgroups, Tejun Heo, Michal Koutný, Matyas Hurtik

On Wed, Aug 6, 2025 at 2:54 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Wed, Aug 06, 2025 at 12:20:25PM -0700, Kuniyuki Iwashima wrote:
> > > > -                     WRITE_ONCE(memcg->socket_pressure, jiffies + HZ);
> > > > +                     socket_pressure = jiffies + HZ;
> > > > +
> > > > +                     jiffies_diff = min(socket_pressure - READ_ONCE(memcg->socket_pressure), HZ);
> > > > +                     memcg->socket_pressure_duration += jiffies_to_usecs(jiffies_diff);
> > >
> > > KCSAN will complain about this. I think we can use atomic_long_add() and
> > > don't need the one with strict ordering.
> >
> > Assuming from atomic_ that vmpressure() could be called concurrently
> > for the same memcg, should we protect socket_pressure and duration
> > within the same lock instead of mixing WRITE/READ_ONCE() and
> > atomic?  Otherwise jiffies_diff could be incorrect (the error is smaller
> > than HZ though).
> >
>
> Yeah good point. Also this field needs to be hierarchical. So, with lock
> something like following is needed:
>
>         if (!spin_trylock(memcg->net_pressure_lock))
>                 return;
>
>         socket_pressure = jiffies + HZ;
>         diff = min(socket_pressure - READ_ONCE(memcg->socket_pressure), HZ);

READ_ONCE() should be unnecessary here.

>
>         if (diff) {
>                 WRITE_ONCE(memcg->socket_pressure, socket_pressure);
>                 // mod_memcg_state(memcg, MEMCG_NET_PRESSURE, diff);
>                 // OR
>                 // while (memcg) {
>                 //      memcg->sk_pressure_duration += diff;
>                 //      memcg = parent_mem_cgroup(memcg);

The parents' sk_pressure_duration is not protected by the lock
taken by trylock.  Maybe we need another global mutex if we want
the hierarchy ?


>                 // }
>         }
>
>         spin_unlock(memcg->net_pressure_lock);
>
> Regarding the hierarchical, we can avoid rstat infra as this code path
> is not really performance critical.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4] memcg: expose socket memory pressure in a cgroup
  2025-08-06 22:01       ` Kuniyuki Iwashima
@ 2025-08-06 23:34         ` Shakeel Butt
  2025-08-06 23:40           ` Kuniyuki Iwashima
  2025-08-07 10:22           ` Daniel Sedlak
  0 siblings, 2 replies; 25+ messages in thread
From: Shakeel Butt @ 2025-08-06 23:34 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: Daniel Sedlak, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Neal Cardwell,
	David Ahern, Andrew Morton, Yosry Ahmed, linux-mm, netdev,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	cgroups, Tejun Heo, Michal Koutný, Matyas Hurtik

On Wed, Aug 06, 2025 at 03:01:44PM -0700, Kuniyuki Iwashima wrote:
> On Wed, Aug 6, 2025 at 2:54 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > On Wed, Aug 06, 2025 at 12:20:25PM -0700, Kuniyuki Iwashima wrote:
> > > > > -                     WRITE_ONCE(memcg->socket_pressure, jiffies + HZ);
> > > > > +                     socket_pressure = jiffies + HZ;
> > > > > +
> > > > > +                     jiffies_diff = min(socket_pressure - READ_ONCE(memcg->socket_pressure), HZ);
> > > > > +                     memcg->socket_pressure_duration += jiffies_to_usecs(jiffies_diff);
> > > >
> > > > KCSAN will complain about this. I think we can use atomic_long_add() and
> > > > don't need the one with strict ordering.
> > >
> > > Assuming from atomic_ that vmpressure() could be called concurrently
> > > for the same memcg, should we protect socket_pressure and duration
> > > within the same lock instead of mixing WRITE/READ_ONCE() and
> > > atomic?  Otherwise jiffies_diff could be incorrect (the error is smaller
> > > than HZ though).
> > >
> >
> > Yeah good point. Also this field needs to be hierarchical. So, with lock
> > something like following is needed:
> >
> >         if (!spin_trylock(memcg->net_pressure_lock))
> >                 return;
> >
> >         socket_pressure = jiffies + HZ;
> >         diff = min(socket_pressure - READ_ONCE(memcg->socket_pressure), HZ);
> 
> READ_ONCE() should be unnecessary here.
> 
> >
> >         if (diff) {
> >                 WRITE_ONCE(memcg->socket_pressure, socket_pressure);
> >                 // mod_memcg_state(memcg, MEMCG_NET_PRESSURE, diff);
> >                 // OR
> >                 // while (memcg) {
> >                 //      memcg->sk_pressure_duration += diff;
> >                 //      memcg = parent_mem_cgroup(memcg);
> 
> The parents' sk_pressure_duration is not protected by the lock
> taken by trylock.  Maybe we need another global mutex if we want
> the hierarchy ?

We don't really need lock protection for sk_pressure_duration. The lock
is only giving us consistent value of diff. Once we have computed the
diff, we can add it to sk_pressure_duration of a memcg and all of its
ancestor without lock.



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4] memcg: expose socket memory pressure in a cgroup
  2025-08-06 23:34         ` Shakeel Butt
@ 2025-08-06 23:40           ` Kuniyuki Iwashima
  2025-08-06 23:51             ` Shakeel Butt
  2025-08-07 10:22           ` Daniel Sedlak
  1 sibling, 1 reply; 25+ messages in thread
From: Kuniyuki Iwashima @ 2025-08-06 23:40 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Daniel Sedlak, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Neal Cardwell,
	David Ahern, Andrew Morton, Yosry Ahmed, linux-mm, netdev,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	cgroups, Tejun Heo, Michal Koutný, Matyas Hurtik

On Wed, Aug 6, 2025 at 4:34 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Wed, Aug 06, 2025 at 03:01:44PM -0700, Kuniyuki Iwashima wrote:
> > On Wed, Aug 6, 2025 at 2:54 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > >
> > > On Wed, Aug 06, 2025 at 12:20:25PM -0700, Kuniyuki Iwashima wrote:
> > > > > > -                     WRITE_ONCE(memcg->socket_pressure, jiffies + HZ);
> > > > > > +                     socket_pressure = jiffies + HZ;
> > > > > > +
> > > > > > +                     jiffies_diff = min(socket_pressure - READ_ONCE(memcg->socket_pressure), HZ);
> > > > > > +                     memcg->socket_pressure_duration += jiffies_to_usecs(jiffies_diff);
> > > > >
> > > > > KCSAN will complain about this. I think we can use atomic_long_add() and
> > > > > don't need the one with strict ordering.
> > > >
> > > > Assuming from atomic_ that vmpressure() could be called concurrently
> > > > for the same memcg, should we protect socket_pressure and duration
> > > > within the same lock instead of mixing WRITE/READ_ONCE() and
> > > > atomic?  Otherwise jiffies_diff could be incorrect (the error is smaller
> > > > than HZ though).
> > > >
> > >
> > > Yeah good point. Also this field needs to be hierarchical. So, with lock
> > > something like following is needed:
> > >
> > >         if (!spin_trylock(memcg->net_pressure_lock))
> > >                 return;
> > >
> > >         socket_pressure = jiffies + HZ;
> > >         diff = min(socket_pressure - READ_ONCE(memcg->socket_pressure), HZ);
> >
> > READ_ONCE() should be unnecessary here.
> >
> > >
> > >         if (diff) {
> > >                 WRITE_ONCE(memcg->socket_pressure, socket_pressure);
> > >                 // mod_memcg_state(memcg, MEMCG_NET_PRESSURE, diff);
> > >                 // OR
> > >                 // while (memcg) {
> > >                 //      memcg->sk_pressure_duration += diff;
> > >                 //      memcg = parent_mem_cgroup(memcg);
> >
> > The parents' sk_pressure_duration is not protected by the lock
> > taken by trylock.  Maybe we need another global mutex if we want
> > the hierarchy ?
>
> We don't really need lock protection for sk_pressure_duration. The lock
> is only giving us consistent value of diff. Once we have computed the
> diff, we can add it to sk_pressure_duration of a memcg and all of its
> ancestor without lock.

Maybe I'm wrong but I was assuming two vmpressure() called
concurrently for cgroup-C and cgroup-D, and one updates
cgroup-C's duration and another updates C&D duration.

cgroup-A -> cgroup-B -> cgroup-C -> cgroup-D

Could that happen ?  Even if it's yes, we could use atomic ops.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4] memcg: expose socket memory pressure in a cgroup
  2025-08-06 23:40           ` Kuniyuki Iwashima
@ 2025-08-06 23:51             ` Shakeel Butt
  0 siblings, 0 replies; 25+ messages in thread
From: Shakeel Butt @ 2025-08-06 23:51 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: Daniel Sedlak, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Neal Cardwell,
	David Ahern, Andrew Morton, Yosry Ahmed, linux-mm, netdev,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	cgroups, Tejun Heo, Michal Koutný, Matyas Hurtik

On Wed, Aug 06, 2025 at 04:40:45PM -0700, Kuniyuki Iwashima wrote:
> On Wed, Aug 6, 2025 at 4:34 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > On Wed, Aug 06, 2025 at 03:01:44PM -0700, Kuniyuki Iwashima wrote:
> > > On Wed, Aug 6, 2025 at 2:54 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > > >
> > > > On Wed, Aug 06, 2025 at 12:20:25PM -0700, Kuniyuki Iwashima wrote:
> > > > > > > -                     WRITE_ONCE(memcg->socket_pressure, jiffies + HZ);
> > > > > > > +                     socket_pressure = jiffies + HZ;
> > > > > > > +
> > > > > > > +                     jiffies_diff = min(socket_pressure - READ_ONCE(memcg->socket_pressure), HZ);
> > > > > > > +                     memcg->socket_pressure_duration += jiffies_to_usecs(jiffies_diff);
> > > > > >
> > > > > > KCSAN will complain about this. I think we can use atomic_long_add() and
> > > > > > don't need the one with strict ordering.
> > > > >
> > > > > Assuming from atomic_ that vmpressure() could be called concurrently
> > > > > for the same memcg, should we protect socket_pressure and duration
> > > > > within the same lock instead of mixing WRITE/READ_ONCE() and
> > > > > atomic?  Otherwise jiffies_diff could be incorrect (the error is smaller
> > > > > than HZ though).
> > > > >
> > > >
> > > > Yeah good point. Also this field needs to be hierarchical. So, with lock
> > > > something like following is needed:
> > > >
> > > >         if (!spin_trylock(memcg->net_pressure_lock))
> > > >                 return;
> > > >
> > > >         socket_pressure = jiffies + HZ;
> > > >         diff = min(socket_pressure - READ_ONCE(memcg->socket_pressure), HZ);
> > >
> > > READ_ONCE() should be unnecessary here.
> > >
> > > >
> > > >         if (diff) {
> > > >                 WRITE_ONCE(memcg->socket_pressure, socket_pressure);
> > > >                 // mod_memcg_state(memcg, MEMCG_NET_PRESSURE, diff);
> > > >                 // OR
> > > >                 // while (memcg) {
> > > >                 //      memcg->sk_pressure_duration += diff;
> > > >                 //      memcg = parent_mem_cgroup(memcg);
> > >
> > > The parents' sk_pressure_duration is not protected by the lock
> > > taken by trylock.  Maybe we need another global mutex if we want
> > > the hierarchy ?
> >
> > We don't really need lock protection for sk_pressure_duration. The lock
> > is only giving us consistent value of diff. Once we have computed the
> > diff, we can add it to sk_pressure_duration of a memcg and all of its
> > ancestor without lock.
> 
> Maybe I'm wrong but I was assuming two vmpressure() called
> concurrently for cgroup-C and cgroup-D, and one updates
> cgroup-C's duration and another updates C&D duration.
> 
> cgroup-A -> cgroup-B -> cgroup-C -> cgroup-D
> 
> Could that happen ?  Even if it's yes, we could use atomic ops.

I am not getting the hierarchy you are using but yes concurrent updates
to sk_pressure_duration can happen and simple atomic_add is good enough
without any locking.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4] memcg: expose socket memory pressure in a cgroup
  2025-08-06 23:34         ` Shakeel Butt
  2025-08-06 23:40           ` Kuniyuki Iwashima
@ 2025-08-07 10:22           ` Daniel Sedlak
  2025-08-07 20:52             ` Shakeel Butt
  1 sibling, 1 reply; 25+ messages in thread
From: Daniel Sedlak @ 2025-08-07 10:22 UTC (permalink / raw)
  To: Shakeel Butt, Kuniyuki Iwashima
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Jonathan Corbet, Neal Cardwell, David Ahern,
	Andrew Morton, Yosry Ahmed, linux-mm, netdev, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Muchun Song, cgroups, Tejun Heo,
	Michal Koutný, Matyas Hurtik

On 8/7/25 1:34 AM, Shakeel Butt wrote:
> On Wed, Aug 06, 2025 at 03:01:44PM -0700, Kuniyuki Iwashima wrote:
>> On Wed, Aug 6, 2025 at 2:54 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>>>
>>> On Wed, Aug 06, 2025 at 12:20:25PM -0700, Kuniyuki Iwashima wrote:
>>>>>> -                     WRITE_ONCE(memcg->socket_pressure, jiffies + HZ);
>>>>>> +                     socket_pressure = jiffies + HZ;
>>>>>> +
>>>>>> +                     jiffies_diff = min(socket_pressure - READ_ONCE(memcg->socket_pressure), HZ);
>>>>>> +                     memcg->socket_pressure_duration += jiffies_to_usecs(jiffies_diff);
>>>>>
>>>>> KCSAN will complain about this. I think we can use atomic_long_add() and
>>>>> don't need the one with strict ordering.

Thanks for the KCSAN recommendation, I didn't know about this sanitizer.

>>>>
>>>> Assuming from atomic_ that vmpressure() could be called concurrently
>>>> for the same memcg, should we protect socket_pressure and duration
>>>> within the same lock instead of mixing WRITE/READ_ONCE() and
>>>> atomic?  Otherwise jiffies_diff could be incorrect (the error is smaller
>>>> than HZ though).
>>>>
>>>
>>> Yeah good point. Also this field needs to be hierarchical. So, with lock
>>> something like following is needed:

Thanks for the snippet, will incorporate it.

>>>
>>>          if (!spin_trylock(memcg->net_pressure_lock))
>>>                  return;
>>>
>>>          socket_pressure = jiffies + HZ;
>>>          diff = min(socket_pressure - READ_ONCE(memcg->socket_pressure), HZ);
>>
>> READ_ONCE() should be unnecessary here.
>>
>>>
>>>          if (diff) {
>>>                  WRITE_ONCE(memcg->socket_pressure, socket_pressure);
>>>                  // mod_memcg_state(memcg, MEMCG_NET_PRESSURE, diff);
>>>                  // OR
>>>                  // while (memcg) {
>>>                  //      memcg->sk_pressure_duration += diff;
>>>                  //      memcg = parent_mem_cgroup(memcg);
>>
>> The parents' sk_pressure_duration is not protected by the lock
>> taken by trylock.  Maybe we need another global mutex if we want
>> the hierarchy ?
> 
> We don't really need lock protection for sk_pressure_duration. The lock

By this you mean that we don't need the possible new global lock or the 
local memcg->net_pressure_lock?

> is only giving us consistent value of diff. Once we have computed the
> diff, we can add it to sk_pressure_duration of a memcg and all of its
> ancestor without lock.

Thanks!
Daniel



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4] memcg: expose socket memory pressure in a cgroup
  2025-08-07 10:22           ` Daniel Sedlak
@ 2025-08-07 20:52             ` Shakeel Butt
  2025-08-14 16:27               ` Matyas Hurtik
  0 siblings, 1 reply; 25+ messages in thread
From: Shakeel Butt @ 2025-08-07 20:52 UTC (permalink / raw)
  To: Daniel Sedlak
  Cc: Kuniyuki Iwashima, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Neal Cardwell,
	David Ahern, Andrew Morton, Yosry Ahmed, linux-mm, netdev,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	cgroups, Tejun Heo, Michal Koutný, Matyas Hurtik

On Thu, Aug 07, 2025 at 12:22:01PM +0200, Daniel Sedlak wrote:
> On 8/7/25 1:34 AM, Shakeel Butt wrote:
> > On Wed, Aug 06, 2025 at 03:01:44PM -0700, Kuniyuki Iwashima wrote:
> > > On Wed, Aug 6, 2025 at 2:54 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > > > 
> > > > On Wed, Aug 06, 2025 at 12:20:25PM -0700, Kuniyuki Iwashima wrote:
> > > > > > > -                     WRITE_ONCE(memcg->socket_pressure, jiffies + HZ);
> > > > > > > +                     socket_pressure = jiffies + HZ;
> > > > > > > +
> > > > > > > +                     jiffies_diff = min(socket_pressure - READ_ONCE(memcg->socket_pressure), HZ);
> > > > > > > +                     memcg->socket_pressure_duration += jiffies_to_usecs(jiffies_diff);
> > > > > > 
> > > > > > KCSAN will complain about this. I think we can use atomic_long_add() and
> > > > > > don't need the one with strict ordering.
> 
> Thanks for the KCSAN recommendation, I didn't know about this sanitizer.
> 
> > > > > 
> > > > > Assuming from atomic_ that vmpressure() could be called concurrently
> > > > > for the same memcg, should we protect socket_pressure and duration
> > > > > within the same lock instead of mixing WRITE/READ_ONCE() and
> > > > > atomic?  Otherwise jiffies_diff could be incorrect (the error is smaller
> > > > > than HZ though).
> > > > > 
> > > > 
> > > > Yeah good point. Also this field needs to be hierarchical. So, with lock
> > > > something like following is needed:
> 
> Thanks for the snippet, will incorporate it.

Cool though that is just like a pseudo code, so be careful in using it.

> 
> > > > 
> > > >          if (!spin_trylock(memcg->net_pressure_lock))
> > > >                  return;
> > > > 
> > > >          socket_pressure = jiffies + HZ;
> > > >          diff = min(socket_pressure - READ_ONCE(memcg->socket_pressure), HZ);
> > > 
> > > READ_ONCE() should be unnecessary here.
> > > 
> > > > 
> > > >          if (diff) {
> > > >                  WRITE_ONCE(memcg->socket_pressure, socket_pressure);
> > > >                  // mod_memcg_state(memcg, MEMCG_NET_PRESSURE, diff);
> > > >                  // OR
> > > >                  // while (memcg) {
> > > >                  //      memcg->sk_pressure_duration += diff;
> > > >                  //      memcg = parent_mem_cgroup(memcg);
> > > 
> > > The parents' sk_pressure_duration is not protected by the lock
> > > taken by trylock.  Maybe we need another global mutex if we want
> > > the hierarchy ?
> > 
> > We don't really need lock protection for sk_pressure_duration. The lock
> 
> By this you mean that we don't need the possible new global lock or the
> local memcg->net_pressure_lock?

We definitely don't need a global lock. For memcg->net_pressure_lock, we
need to be very clear why we need this lock. Basically we are doing RMW
on memcg->socket_pressure and we want known 'consistently' how much
further we are pushing memcg->socket_pressure. In other words the
consistent value of diff. The lock is one way to get that consistent
diff. We can also play some atomic ops trick to get the consistent value
without lock but I don't think that complexity is worth it. Third option
might be to remove our consistency requirement for diff as the error is
bound to HZ (as mentioned by Kuniyuki) but I think this code path is not
performance critical to make this option worth it. So, option 1 of
simple memcg->net_pressure_lock is good enough.

We don't need memcg->net_pressure_lock's protection for
sk_pressure_duration of the memcg and its ancestors if additions to
sk_pressure_duration are atomic.

Someone might notice that we are doing upward traversal for
memcg->sk_pressure_duration but not for memcg->socket_pressure. We can
do that if we decide to optimize mem_cgroup_under_socket_pressure().
However that is orthogonal work. Also the consistency requirement will
change.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4] memcg: expose socket memory pressure in a cgroup
  2025-08-07 20:52             ` Shakeel Butt
@ 2025-08-14 16:27               ` Matyas Hurtik
  2025-08-14 17:31                 ` Shakeel Butt
  0 siblings, 1 reply; 25+ messages in thread
From: Matyas Hurtik @ 2025-08-14 16:27 UTC (permalink / raw)
  To: Shakeel Butt, Daniel Sedlak
  Cc: Kuniyuki Iwashima, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Neal Cardwell,
	David Ahern, Andrew Morton, Yosry Ahmed, linux-mm, netdev,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	cgroups, Tejun Heo, Michal Koutný

On 8/7/25 10:52 PM, Shakeel Butt wrote:

> We definitely don't need a global lock. For memcg->net_pressure_lock, we
> need to be very clear why we need this lock. Basically we are doing RMW
> on memcg->socket_pressure and we want known 'consistently' how much
> further we are pushing memcg->socket_pressure. In other words the
> consistent value of diff. The lock is one way to get that consistent
> diff. We can also play some atomic ops trick to get the consistent value
> without lock but I don't think that complexity is worth it.

Hello,

I tried implementing the second option, making the diff consistent using 
atomics.
Would something like this work?

if (level > VMPRESSURE_LOW) {
   unsigned long new_socket_pressure;
   unsigned long old_socket_pressure;
   unsigned long duration_to_add;
   /*
     * Let the socket buffer allocator know that
     * we are having trouble reclaiming LRU pages.
     *
     * For hysteresis keep the pressure state
     * asserted for a second in which subsequent
     * pressure events can occur.
     */
   new_socket_pressure = jiffies + HZ;
   old_socket_pressure = atomic_long_xchg(
     &memcg->socket_pressure, new_socket_pressure);

   duration_to_add = jiffies_to_usecs(
     min(new_socket_pressure - old_socket_pressure, HZ));

   do {
     atomic_long_add(duration_to_add, &memcg->socket_pressure_duration);
   } while ((memcg = parent_mem_cgroup(memcg)));
}

memcg->socket_pressure would need to be changed into atomic_long_t,
but we avoid adding the memcg->net_pressure_lock.

> We don't need memcg->net_pressure_lock's protection for
> sk_pressure_duration of the memcg and its ancestors if additions to
> sk_pressure_duration are atomic.

With regards to the hierarchical propagation I noticed during testing that
vmpressure() was sometimes called with memcgs, created for systemd oneshot
services, that were at that time no longer present in the /sys/fs/cgroup 
tree.
This then made their parent counters a lot larger than just sum of the 
subtree
plus value of self. Would this behavior be correct?

Thanks,

Matyas

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4] memcg: expose socket memory pressure in a cgroup
  2025-08-14 16:27               ` Matyas Hurtik
@ 2025-08-14 17:31                 ` Shakeel Butt
  2025-08-14 17:43                   ` Shakeel Butt
  0 siblings, 1 reply; 25+ messages in thread
From: Shakeel Butt @ 2025-08-14 17:31 UTC (permalink / raw)
  To: Matyas Hurtik
  Cc: Daniel Sedlak, Kuniyuki Iwashima, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Jonathan Corbet,
	Neal Cardwell, David Ahern, Andrew Morton, Yosry Ahmed, linux-mm,
	netdev, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song, cgroups, Tejun Heo, Michal Koutný

On Thu, Aug 14, 2025 at 06:27:22PM +0200, Matyas Hurtik wrote:
> On 8/7/25 10:52 PM, Shakeel Butt wrote:
> 
> > We definitely don't need a global lock. For memcg->net_pressure_lock, we
> > need to be very clear why we need this lock. Basically we are doing RMW
> > on memcg->socket_pressure and we want known 'consistently' how much
> > further we are pushing memcg->socket_pressure. In other words the
> > consistent value of diff. The lock is one way to get that consistent
> > diff. We can also play some atomic ops trick to get the consistent value
> > without lock but I don't think that complexity is worth it.
> 
> Hello,
> 
> 
> I tried implementing the second option, making the diff consistent using
> atomics.
> Would something like this work?
> 
> if (level > VMPRESSURE_LOW) {
>   unsigned long new_socket_pressure;
>   unsigned long old_socket_pressure;
>   unsigned long duration_to_add;
>   /*
>     * Let the socket buffer allocator know that
>     * we are having trouble reclaiming LRU pages.
>     *
>     * For hysteresis keep the pressure state
>     * asserted for a second in which subsequent
>     * pressure events can occur.
>     */
>   new_socket_pressure = jiffies + HZ;

Add an if condition here if old_socket_pressure is already equal to
the new_socket_pressure and skip all of the following.

>   old_socket_pressure = atomic_long_xchg(
>     &memcg->socket_pressure, new_socket_pressure);
> 
>   duration_to_add = jiffies_to_usecs(
>     min(new_socket_pressure - old_socket_pressure, HZ));

Here if duration_to_add is zero skip the upwards following traversal.

> 
>   do {
>     atomic_long_add(duration_to_add, &memcg->socket_pressure_duration);
>   } while ((memcg = parent_mem_cgroup(memcg)));
> }
> 
> memcg->socket_pressure would need to be changed into atomic_long_t,
> but we avoid adding the memcg->net_pressure_lock.

Awesome, seems fine to me.

> 
> > We don't need memcg->net_pressure_lock's protection for
> > sk_pressure_duration of the memcg and its ancestors if additions to
> > sk_pressure_duration are atomic.
> 
> With regards to the hierarchical propagation I noticed during testing that
> vmpressure() was sometimes called with memcgs, created for systemd oneshot
> services, that were at that time no longer present in the /sys/fs/cgroup
> tree.
> This then made their parent counters a lot larger than just sum of the
> subtree
> plus value of self. Would this behavior be correct?
> 

That is intentional. You can see couple of other similar monotonically
increasing stats in memory.stat like workkingset refaults and demotes.



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4] memcg: expose socket memory pressure in a cgroup
  2025-08-14 17:31                 ` Shakeel Butt
@ 2025-08-14 17:43                   ` Shakeel Butt
  0 siblings, 0 replies; 25+ messages in thread
From: Shakeel Butt @ 2025-08-14 17:43 UTC (permalink / raw)
  To: Matyas Hurtik
  Cc: Daniel Sedlak, Kuniyuki Iwashima, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Jonathan Corbet,
	Neal Cardwell, David Ahern, Andrew Morton, Yosry Ahmed, linux-mm,
	netdev, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song, cgroups, Tejun Heo, Michal Koutný

On Thu, Aug 14, 2025 at 10:31:39AM -0700, Shakeel Butt wrote:
> On Thu, Aug 14, 2025 at 06:27:22PM +0200, Matyas Hurtik wrote:
> > On 8/7/25 10:52 PM, Shakeel Butt wrote:
> > 
> > > We definitely don't need a global lock. For memcg->net_pressure_lock, we
> > > need to be very clear why we need this lock. Basically we are doing RMW
> > > on memcg->socket_pressure and we want known 'consistently' how much
> > > further we are pushing memcg->socket_pressure. In other words the
> > > consistent value of diff. The lock is one way to get that consistent
> > > diff. We can also play some atomic ops trick to get the consistent value
> > > without lock but I don't think that complexity is worth it.
> > 
> > Hello,
> > 
> > 
> > I tried implementing the second option, making the diff consistent using
> > atomics.
> > Would something like this work?
> > 
> > if (level > VMPRESSURE_LOW) {
> >   unsigned long new_socket_pressure;
> >   unsigned long old_socket_pressure;
> >   unsigned long duration_to_add;
> >   /*
> >     * Let the socket buffer allocator know that
> >     * we are having trouble reclaiming LRU pages.
> >     *
> >     * For hysteresis keep the pressure state
> >     * asserted for a second in which subsequent
> >     * pressure events can occur.
> >     */
> >   new_socket_pressure = jiffies + HZ;
> 
> Add an if condition here if old_socket_pressure is already equal to
> the new_socket_pressure and skip all of the following.
> 
> >   old_socket_pressure = atomic_long_xchg(
> >     &memcg->socket_pressure, new_socket_pressure);
> > 
> >   duration_to_add = jiffies_to_usecs(

One more point, this jiffies_to_usecs() can be done at the read side
i.e. keep socket_pressure_duration in jiffies.

Also TJ suggested to expose this new stat in memory.stat. That will be
easy by just adding seq_buf_printf() at the end of memcg_stat_format().



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4] memcg: expose socket memory pressure in a cgroup
  2025-08-05 23:02 ` Shakeel Butt
  2025-08-06 19:20   ` Kuniyuki Iwashima
@ 2025-08-07 10:42   ` Daniel Sedlak
  1 sibling, 0 replies; 25+ messages in thread
From: Daniel Sedlak @ 2025-08-07 10:42 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Jonathan Corbet, Neal Cardwell, Kuniyuki Iwashima,
	David Ahern, Andrew Morton, Yosry Ahmed, linux-mm, netdev,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	cgroups, Tejun Heo, Michal Koutný, Matyas Hurtik

On 8/6/25 1:02 AM, Shakeel Butt wrote:
> On Tue, Aug 05, 2025 at 08:44:29AM +0200, Daniel Sedlak wrote:
>> This patch is a result of our long-standing debug sessions, where it all
>> started as "networking is slow", and TCP network throughput suddenly
>> dropped from tens of Gbps to few Mbps, and we could not see anything in
>> the kernel log or netstat counters.
>>
>> Currently, we have two memory pressure counters for TCP sockets [1],
>> which we manipulate only when the memory pressure is signalled through
>> the proto struct [2]. However, the memory pressure can also be signaled
>> through the cgroup memory subsystem, which we do not reflect in the
>> netstat counters. In the end, when the cgroup memory subsystem signals
>> that it is under pressure, we silently reduce the advertised TCP window
>> with tcp_adjust_rcv_ssthresh() to 4*advmss, which causes a significant
>> throughput reduction.
>>
>> Keep in mind that when the cgroup memory subsystem signals the socket
>> memory pressure, it affects all sockets used in that cgroup.
>>
>> This patch exposes a new file for each cgroup in sysfs which signals
>> the cgroup socket memory pressure. The file is accessible in
>> the following path.
>>
>>    /sys/fs/cgroup/**/<cgroup name>/memory.net.socket_pressure
> 
> let's keep the name concise. Maybe memory.net.pressure?
> 

Sure, I can rename it for v5.

Thanks!
Daniel



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4] memcg: expose socket memory pressure in a cgroup
  2025-08-05  6:44 [PATCH v4] memcg: expose socket memory pressure in a cgroup Daniel Sedlak
  2025-08-05 15:54 ` Kuniyuki Iwashima
  2025-08-05 23:02 ` Shakeel Butt
@ 2025-08-09 18:32 ` Tejun Heo
  2025-08-11 21:31   ` Shakeel Butt
  2025-08-13 12:03   ` Michal Koutný
  2 siblings, 2 replies; 25+ messages in thread
From: Tejun Heo @ 2025-08-09 18:32 UTC (permalink / raw)
  To: Daniel Sedlak
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Jonathan Corbet, Neal Cardwell, Kuniyuki Iwashima,
	David Ahern, Andrew Morton, Shakeel Butt, Yosry Ahmed, linux-mm,
	netdev, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song, cgroups, Michal Koutný, Matyas Hurtik

Hello,

On Tue, Aug 05, 2025 at 08:44:29AM +0200, Daniel Sedlak wrote:
> This patch exposes a new file for each cgroup in sysfs which signals
> the cgroup socket memory pressure. The file is accessible in
> the following path.
> 
>   /sys/fs/cgroup/**/<cgroup name>/memory.net.socket_pressure
> 
> The output value is a cumulative sum of microseconds spent
> under pressure for that particular cgroup.

I'm not sure the pressure name fits the best when the content is the
duration. Note that in the memory.pressure file, the main content is
time-averaged percentages which are the "pressure" numbers. Can this be an
entry in memory.stat which signifies that it's a duration? net_throttle_us
or something like that?

Also, as Shakeel already pointed out, this would need to be accumulated
hierarchically. The tricky thing is determining how the accumulation should
work. Hierarchical summing up is simple and we can use the usual rstat
propagation; however, that would deviate from how pressure durations are
propagated for .pressure metrics, where each cgroup tracks all / some
contention states in its descendants. For simplicity's sake and if the
number ends up in memory.stat, I think simple summing up should be fine as
long as it's so noted in the documentation. Note that this semantical
difference would be another reason to avoid the "pressure" name.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4] memcg: expose socket memory pressure in a cgroup
  2025-08-09 18:32 ` Tejun Heo
@ 2025-08-11 21:31   ` Shakeel Butt
  2025-08-13 12:03   ` Michal Koutný
  1 sibling, 0 replies; 25+ messages in thread
From: Shakeel Butt @ 2025-08-11 21:31 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Daniel Sedlak, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Neal Cardwell,
	Kuniyuki Iwashima, David Ahern, Andrew Morton, Yosry Ahmed,
	linux-mm, netdev, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song, cgroups, Michal Koutný, Matyas Hurtik

On Sat, Aug 09, 2025 at 08:32:22AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Tue, Aug 05, 2025 at 08:44:29AM +0200, Daniel Sedlak wrote:
> > This patch exposes a new file for each cgroup in sysfs which signals
> > the cgroup socket memory pressure. The file is accessible in
> > the following path.
> > 
> >   /sys/fs/cgroup/**/<cgroup name>/memory.net.socket_pressure
> > 
> > The output value is a cumulative sum of microseconds spent
> > under pressure for that particular cgroup.
> 
> I'm not sure the pressure name fits the best when the content is the
> duration. Note that in the memory.pressure file, the main content is
> time-averaged percentages which are the "pressure" numbers. Can this be an
> entry in memory.stat which signifies that it's a duration? net_throttle_us
> or something like that?

Good point and this can definitely be a metric exposed through
memory.stat. At the moment the metrics in memory.stat are either amount
of bytes or number of pages. Time duration would be the first one and
will need some work to make it part of rstat or we can explore to keep
this separate from rstat with manual upward sync on update side as it is
not performance critical (the read side seems performance critical for
this stat).


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4] memcg: expose socket memory pressure in a cgroup
  2025-08-09 18:32 ` Tejun Heo
  2025-08-11 21:31   ` Shakeel Butt
@ 2025-08-13 12:03   ` Michal Koutný
  2025-08-13 18:03     ` Tejun Heo
  1 sibling, 1 reply; 25+ messages in thread
From: Michal Koutný @ 2025-08-13 12:03 UTC (permalink / raw)
  To: Tejun Heo, Daniel Sedlak
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Jonathan Corbet, Neal Cardwell, Kuniyuki Iwashima,
	David Ahern, Andrew Morton, Shakeel Butt, Yosry Ahmed, linux-mm,
	netdev, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song, cgroups, Matyas Hurtik

[-- Attachment #1: Type: text/plain, Size: 1149 bytes --]

On Sat, Aug 09, 2025 at 08:32:22AM -1000, Tejun Heo <tj@kernel.org> wrote:
> Also, as Shakeel already pointed out, this would need to be accumulated
> hierarchically. The tricky thing is determining how the accumulation should
> work. Hierarchical summing up is simple and we can use the usual rstat
> propagation; however, that would deviate from how pressure durations are
> propagated for .pressure metrics, where each cgroup tracks all / some
> contention states in its descendants. For simplicity's sake and if the
> number ends up in memory.stat, I think simple summing up should be fine as
> long as it's so noted in the documentation. Note that this semantical
> difference would be another reason to avoid the "pressure" name.

One more point to clarify -- should the value include throttling from
ancestors or not. (I think both are fine but) this semantic should also
be described in the docs. I.e. current proposal is
	value = sum_children + self
and if you're see that C's value is 0, it doesn't mean its sockets
weren't subject of throttling. It just means you need to check also
values in C ancestors. Does that work?

Thanks,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4] memcg: expose socket memory pressure in a cgroup
  2025-08-13 12:03   ` Michal Koutný
@ 2025-08-13 18:03     ` Tejun Heo
  2025-08-20 16:51       ` Matyas Hurtik
  0 siblings, 1 reply; 25+ messages in thread
From: Tejun Heo @ 2025-08-13 18:03 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Daniel Sedlak, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Neal Cardwell,
	Kuniyuki Iwashima, David Ahern, Andrew Morton, Shakeel Butt,
	Yosry Ahmed, linux-mm, netdev, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song, cgroups, Matyas Hurtik

Hello,

On Wed, Aug 13, 2025 at 02:03:28PM +0200, Michal Koutný wrote:
...
> One more point to clarify -- should the value include throttling from
> ancestors or not. (I think both are fine but) this semantic should also
> be described in the docs. I.e. current proposal is
> 	value = sum_children + self
> and if you're see that C's value is 0, it doesn't mean its sockets
> weren't subject of throttling. It just means you need to check also
> values in C ancestors. Does that work?

I was more thinking that it would account for all throttled durations, but
it's true that we only count locally originating events for e.g.
memory.events::low or pids.events::max. Hmm... I'm unsure. So, for events, I
think local sources make sense as it's tracking what limits are triggering
where. However, I'm not sure that translates well to throttle duration which
is closer to pressure metrics than event counters. We don't distinguish the
sources of contention when presenting pressure metrics after all.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4] memcg: expose socket memory pressure in a cgroup
  2025-08-13 18:03     ` Tejun Heo
@ 2025-08-20 16:51       ` Matyas Hurtik
  2025-08-20 19:03         ` Tejun Heo
  0 siblings, 1 reply; 25+ messages in thread
From: Matyas Hurtik @ 2025-08-20 16:51 UTC (permalink / raw)
  To: Tejun Heo, Michal Koutný
  Cc: Daniel Sedlak, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Neal Cardwell,
	Kuniyuki Iwashima, David Ahern, Andrew Morton, Shakeel Butt,
	Yosry Ahmed, linux-mm, netdev, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song, cgroups

Hello,

On 8/13/25 8:03 PM, Tejun Heo wrote:
> On Wed, Aug 13, 2025 at 02:03:28PM +0200, Michal Koutný wrote:
> ...
>> One more point to clarify -- should the value include throttling from
>> ancestors or not. (I think both are fine but) this semantic should also
>> be described in the docs. I.e. current proposal is
>> 	value = sum_children + self
>> and if you're see that C's value is 0, it doesn't mean its sockets
>> weren't subject of throttling. It just means you need to check also
>> values in C ancestors. Does that work?
> I was more thinking that it would account for all throttled durations, but
> it's true that we only count locally originating events for e.g.
> memory.events::low or pids.events::max. Hmm... I'm unsure. So, for events, I
> think local sources make sense as it's tracking what limits are triggering
> where. However, I'm not sure that translates well to throttle duration which
> is closer to pressure metrics than event counters. We don't distinguish the
> sources of contention when presenting pressure metrics after all.

I think calculating the value using self and ancestors would better match
the logic in mem_cgroup_under_socket_pressure() and it would avoid the
issue Michal outlined without relying on an explanation in the docs -
checking a single value per cgroup to confirm whether sockets belonging
to that cgroup were being throttled looks more intuitive to me.

If we were to have the write side of the stat in vmpressure() look 
something like:
   new_socket_pressure = jiffies + HZ;
   old_socket_pressure = atomic_long_xchg(
     &memcg->socket_pressure, new_socket_pressure);

   duration_to_add = jiffies_to_usecs(
     min(new_socket_pressure - old_socket_pressure, HZ));
   atomic_long_add(duration_to_add, &memcg->socket_pressure_duration);

And the read side:
   total_duration = 0;
   for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg))
     total_duration += atomic_long_read(&memcg->socket_pressure_duration);
Would that work?

There would be an issue with the reported value possibly being larger
than the real duration of the throttling, due to overlapping
intervals of socket_pressure with some ancestor. Is that a problem?

Thanks,
Matyas


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4] memcg: expose socket memory pressure in a cgroup
  2025-08-20 16:51       ` Matyas Hurtik
@ 2025-08-20 19:03         ` Tejun Heo
  2025-08-20 19:31           ` Shakeel Butt
  2025-08-20 20:37           ` Matyas Hurtik
  0 siblings, 2 replies; 25+ messages in thread
From: Tejun Heo @ 2025-08-20 19:03 UTC (permalink / raw)
  To: Matyas Hurtik
  Cc: Michal Koutný, Daniel Sedlak, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Jonathan Corbet,
	Neal Cardwell, Kuniyuki Iwashima, David Ahern, Andrew Morton,
	Shakeel Butt, Yosry Ahmed, linux-mm, netdev, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Muchun Song, cgroups

Hello,

On Wed, Aug 20, 2025 at 06:51:07PM +0200, Matyas Hurtik wrote:
...
> And the read side:
>   total_duration = 0;
>   for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg))
>     total_duration += atomic_long_read(&memcg->socket_pressure_duration);
> Would that work?

This doesn't make sense to me. Why would a child report the numbers from its
ancestors?

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4] memcg: expose socket memory pressure in a cgroup
  2025-08-20 19:03         ` Tejun Heo
@ 2025-08-20 19:31           ` Shakeel Butt
  2025-08-20 20:37           ` Matyas Hurtik
  1 sibling, 0 replies; 25+ messages in thread
From: Shakeel Butt @ 2025-08-20 19:31 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Matyas Hurtik, Michal Koutný, Daniel Sedlak, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Jonathan Corbet, Neal Cardwell, Kuniyuki Iwashima, David Ahern,
	Andrew Morton, Yosry Ahmed, linux-mm, netdev, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Muchun Song, cgroups

On Wed, Aug 20, 2025 at 09:03:11AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Wed, Aug 20, 2025 at 06:51:07PM +0200, Matyas Hurtik wrote:
> ...
> > And the read side:
> >   total_duration = 0;
> >   for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg))
> >     total_duration += atomic_long_read(&memcg->socket_pressure_duration);
> > Would that work?
> 
> This doesn't make sense to me. Why would a child report the numbers from its
> ancestors?

I think Matyas might wanted to do a tree traversal under the given memcg
and by mistake did this upward traversal. However that will miss the
already deleted memcgs under the given memcg.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4] memcg: expose socket memory pressure in a cgroup
  2025-08-20 19:03         ` Tejun Heo
  2025-08-20 19:31           ` Shakeel Butt
@ 2025-08-20 20:37           ` Matyas Hurtik
  2025-08-20 21:34             ` Shakeel Butt
  1 sibling, 1 reply; 25+ messages in thread
From: Matyas Hurtik @ 2025-08-20 20:37 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Koutný, Daniel Sedlak, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Jonathan Corbet,
	Neal Cardwell, Kuniyuki Iwashima, David Ahern, Andrew Morton,
	Shakeel Butt, Yosry Ahmed, linux-mm, netdev, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Muchun Song, cgroups

Hello,

On 8/20/25 9:03 PM, Tejun Heo wrote:
> On Wed, Aug 20, 2025 at 06:51:07PM +0200, Matyas Hurtik wrote:
>> And the read side:   total_duration = 0;   for (; 
>> !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg))     
>> total_duration += atomic_long_read(&memcg->socket_pressure_duration); 
>> Would that work? 
> This doesn't make sense to me. Why would a child report the numbers 
> from its ancestors?

Result of mem_cgroup_under_socket_pressure() depends on
whether self or any ancestors have had socket_pressure set.

So any duration of an ancestor being throttled would also
mean the child was being throttled.

By summing our and our ancestors socket_pressure_duration
we should get our total time being throttled
(possibly more because of overlaps).

Thanks,
Matyas


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4] memcg: expose socket memory pressure in a cgroup
  2025-08-20 20:37           ` Matyas Hurtik
@ 2025-08-20 21:34             ` Shakeel Butt
  2025-08-21 18:44               ` Matyas Hurtik
  0 siblings, 1 reply; 25+ messages in thread
From: Shakeel Butt @ 2025-08-20 21:34 UTC (permalink / raw)
  To: Matyas Hurtik
  Cc: Tejun Heo, Michal Koutný, Daniel Sedlak, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Jonathan Corbet, Neal Cardwell, Kuniyuki Iwashima, David Ahern,
	Andrew Morton, Yosry Ahmed, linux-mm, netdev, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Muchun Song, cgroups

On Wed, Aug 20, 2025 at 10:37:49PM +0200, Matyas Hurtik wrote:
> Hello,
> 
> On 8/20/25 9:03 PM, Tejun Heo wrote:
> > On Wed, Aug 20, 2025 at 06:51:07PM +0200, Matyas Hurtik wrote:
> > > And the read side:   total_duration = 0;   for (;
> > > !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg))    
> > > total_duration +=
> > > atomic_long_read(&memcg->socket_pressure_duration); Would that work?
> > This doesn't make sense to me. Why would a child report the numbers from
> > its ancestors?
> 
> Result of mem_cgroup_under_socket_pressure() depends on
> whether self or any ancestors have had socket_pressure set.
> 
> So any duration of an ancestor being throttled would also
> mean the child was being throttled.
> 
> By summing our and our ancestors socket_pressure_duration
> we should get our total time being throttled
> (possibly more because of overlaps).

This is not how memcg stats (and their semantics) work and maybe that is
not what you want. In the memcg stats semactics for a given memcg the
socket_pressure_duration metric is not the stall duration faced by
sockets in memcg but instead it will be stall duration caused by the
memcg and its descendants. If that is not what we want, we need to do
something different and orthogonal to memcg stats.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4] memcg: expose socket memory pressure in a cgroup
  2025-08-20 21:34             ` Shakeel Butt
@ 2025-08-21 18:44               ` Matyas Hurtik
  0 siblings, 0 replies; 25+ messages in thread
From: Matyas Hurtik @ 2025-08-21 18:44 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Tejun Heo, Michal Koutný, Daniel Sedlak, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Jonathan Corbet, Neal Cardwell, Kuniyuki Iwashima, David Ahern,
	Andrew Morton, Yosry Ahmed, linux-mm, netdev, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Muchun Song, cgroups

Hello,

On 8/20/25 11:34 PM, Shakeel Butt wrote:
> On Wed, Aug 20, 2025 at 10:37:49PM +0200, Matyas Hurtik wrote:
>> Result of mem_cgroup_under_socket_pressure() depends on whether self 
>> or any ancestors have had socket_pressure set. So any duration of an 
>> ancestor being throttled would also mean the child was being 
>> throttled. By summing our and our ancestors socket_pressure_duration 
>> we should get our total time being throttled (possibly more because 
>> of overlaps). 
> This is not how memcg stats (and their semantics) work and maybe that 
> is not what you want. In the memcg stats semactics for a given memcg 
> the socket_pressure_duration metric is not the stall duration faced by 
> sockets in memcg but instead it will be stall duration caused by the 
> memcg and its descendants. If that is not what we want, we need to do 
> something different and orthogonal to memcg stats.

By memcg stats, do you mean only the contents of the memory.stat file?

Would it be semantically consistent if we were to put it into
a separate file (like memory.net.throttled) instead?

Just to summarize the proposals of different methods of hierarchical 
propagation:

1) None - keeping the reported duration local to that cgroup:

    value = self

    Would not be too out of place, since memory.events.local
    already does not accumulate hierarchically.
    To determine whether sockets in a memcg were throttled,
    we would traverse the /sys/fs/cgroup/ hierarchy from root to
    the cgroup of interest and sum those local durations.

2) Propagating the duration upwards (using rstat or simple iteration
    towards root memcg during write):

    value = self + sum of children

    Most semantically consistent with other exposed stat files.
    Could be added as an entry into memory.stat.
    Since the pressure gets applied from ancestors to children
    (see mem_cgroup_under_socket_pressure()), determining the duration of
    throttling for sockets in some cgroup would be hardest in this variant.

    It would involve iterating from the root to the examined cgroup and
    at each node subtracting the values of its children from that nodes 
value,
    then the sum of that would correspond to the total duration throttled.

3) Propagating the duration downwards (write only locally,
    read traversing hierarchy upwards):

    value = self + sum of ancestors

    Mirrors the logic used in mem_cgroup_under_socket_pressure(),
    increase in the reported value for a memcg would coincide with more
    throttling being done to the sockets of that memcg.

I think that variant 3 would be the most useful for diagnosing
when this socket throttling happens in a certain memcg.

I'm not sure if I understand the use case of variant 2.

Thanks,
Matyas

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2025-08-21 18:44 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-05  6:44 [PATCH v4] memcg: expose socket memory pressure in a cgroup Daniel Sedlak
2025-08-05 15:54 ` Kuniyuki Iwashima
2025-08-05 23:02 ` Shakeel Butt
2025-08-06 19:20   ` Kuniyuki Iwashima
2025-08-06 21:54     ` Shakeel Butt
2025-08-06 22:01       ` Kuniyuki Iwashima
2025-08-06 23:34         ` Shakeel Butt
2025-08-06 23:40           ` Kuniyuki Iwashima
2025-08-06 23:51             ` Shakeel Butt
2025-08-07 10:22           ` Daniel Sedlak
2025-08-07 20:52             ` Shakeel Butt
2025-08-14 16:27               ` Matyas Hurtik
2025-08-14 17:31                 ` Shakeel Butt
2025-08-14 17:43                   ` Shakeel Butt
2025-08-07 10:42   ` Daniel Sedlak
2025-08-09 18:32 ` Tejun Heo
2025-08-11 21:31   ` Shakeel Butt
2025-08-13 12:03   ` Michal Koutný
2025-08-13 18:03     ` Tejun Heo
2025-08-20 16:51       ` Matyas Hurtik
2025-08-20 19:03         ` Tejun Heo
2025-08-20 19:31           ` Shakeel Butt
2025-08-20 20:37           ` Matyas Hurtik
2025-08-20 21:34             ` Shakeel Butt
2025-08-21 18:44               ` Matyas Hurtik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).