[PATCH mm-new v3] mm/memcontrol: Add memory.stat_refresh for on-demand stats flushing

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

* [PATCH mm-new v3] mm/memcontrol: Add memory.stat_refresh for on-demand stats flushing
@ 2025-11-10 10:19 Leon Huang Fu
  2025-11-10 11:28 ` Michal Hocko
                   ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: Leon Huang Fu @ 2025-11-10 10:19 UTC (permalink / raw)
  To: linux-mm
  Cc: tj, mkoutny, hannes, mhocko, roman.gushchin, shakeel.butt,
	muchun.song, akpm, joel.granados, jack, laoar.shao, mclapinski,
	kyle.meyer, corbet, lance.yang, leon.huangfu, linux-doc,
	linux-kernel, cgroups

Memory cgroup statistics are updated asynchronously with periodic
flushing to reduce overhead. The current implementation uses a flush
threshold calculated as MEMCG_CHARGE_BATCH * num_online_cpus() for
determining when to aggregate per-CPU memory cgroup statistics. On
systems with high core counts, this threshold can become very large
(e.g., 64 * 256 = 16,384 on a 256-core system), leading to stale
statistics when userspace reads memory.stat files.

This is particularly problematic for monitoring and management tools
that rely on reasonably fresh statistics, as they may observe data
that is thousands of updates out of date.

Introduce a new write-only file, memory.stat_refresh, that allows
userspace to explicitly trigger an immediate flush of memory statistics.
Writing any value to this file forces a synchronous flush via
__mem_cgroup_flush_stats(memcg, true) for the cgroup and all its
descendants, ensuring that subsequent reads of memory.stat and
memory.numa_stat reflect current data.

This approach follows the pattern established by /proc/sys/vm/stat_refresh
and memory.peak, where the written value is ignored, keeping the
interface simple and consistent with existing kernel APIs.

Usage example:
  echo 1 > /sys/fs/cgroup/mygroup/memory.stat_refresh
  cat /sys/fs/cgroup/mygroup/memory.stat

The feature is available in both cgroup v1 and v2 for consistency.

Signed-off-by: Leon Huang Fu <leon.huangfu@shopee.com>
---
v2 -> v3:
  - Flush stats by memory.stat_refresh (per Michal)
  - https://lore.kernel.org/linux-mm/20251105074917.94531-1-leon.huangfu@shopee.com/

v1 -> v2:
  - Flush stats when write the file (per Michal).
  - https://lore.kernel.org/linux-mm/20251104031908.77313-1-leon.huangfu@shopee.com/

 Documentation/admin-guide/cgroup-v2.rst | 21 +++++++++++++++++--
 mm/memcontrol-v1.c                      |  4 ++++
 mm/memcontrol-v1.h                      |  2 ++
 mm/memcontrol.c                         | 27 ++++++++++++++++++-------
 4 files changed, 45 insertions(+), 9 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 3345961c30ac..ca079932f957 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1337,7 +1337,7 @@ PAGE_SIZE multiple when read back.
 	cgroup is within its effective low boundary, the cgroup's
 	memory won't be reclaimed unless there is no reclaimable
 	memory available in unprotected cgroups.
-	Above the effective low	boundary (or
+	Above the effective low	boundary (or
 	effective min boundary if it is higher), pages are reclaimed
 	proportionally to the overage, reducing reclaim pressure for
 	smaller overages.
@@ -1785,6 +1785,23 @@ The following nested keys are defined.
 		up if hugetlb usage is accounted for in memory.current (i.e.
 		cgroup is mounted with the memory_hugetlb_accounting option).

+  memory.stat_refresh
+	A write-only file which exists on non-root cgroups.
+
+	Writing any value to this file forces an immediate flush of
+	memory statistics for this cgroup and its descendants. This
+	ensures subsequent reads of memory.stat and memory.numa_stat
+	reflect the most current data.
+
+	This is useful on high-core count systems where per-CPU caching
+	can lead to stale statistics, or when precise memory usage
+	information is needed for monitoring or debugging purposes.
+
+	Example::
+
+	  echo 1 > memory.stat_refresh
+	  cat memory.stat
+
   memory.numa_stat
 	A read-only nested-keyed file which exists on non-root cgroups.

@@ -2173,7 +2190,7 @@ of the two is enforced.

 cgroup writeback requires explicit support from the underlying
 filesystem.  Currently, cgroup writeback is implemented on ext2, ext4,
-btrfs, f2fs, and xfs.  On other filesystems, all writeback IOs are
+btrfs, f2fs, and xfs.  On other filesystems, all writeback IOs are
 attributed to the root cgroup.

 There are inherent differences in memory and writeback management
diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
index 6eed14bff742..c3eac9b1f1be 100644
--- a/mm/memcontrol-v1.c
+++ b/mm/memcontrol-v1.c
@@ -2041,6 +2041,10 @@ struct cftype mem_cgroup_legacy_files[] = {
 		.name = "stat",
 		.seq_show = memory_stat_show,
 	},
+	{
+		.name = "stat_refresh",
+		.write = memory_stat_refresh_write,
+	},
 	{
 		.name = "force_empty",
 		.write = mem_cgroup_force_empty_write,
diff --git a/mm/memcontrol-v1.h b/mm/memcontrol-v1.h
index 6358464bb416..a14d4d74c9aa 100644
--- a/mm/memcontrol-v1.h
+++ b/mm/memcontrol-v1.h
@@ -29,6 +29,8 @@ void drain_all_stock(struct mem_cgroup *root_memcg);
 unsigned long memcg_events(struct mem_cgroup *memcg, int event);
 unsigned long memcg_page_state_output(struct mem_cgroup *memcg, int item);
 int memory_stat_show(struct seq_file *m, void *v);
+ssize_t memory_stat_refresh_write(struct kernfs_open_file *of, char *buf,
+				  size_t nbytes, loff_t off);

 void mem_cgroup_id_get_many(struct mem_cgroup *memcg, unsigned int n);
 struct mem_cgroup *mem_cgroup_id_get_online(struct mem_cgroup *memcg);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index bfc986da3289..19ef4b971d8d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -610,6 +610,15 @@ static void __mem_cgroup_flush_stats(struct mem_cgroup *memcg, bool force)
 	css_rstat_flush(&memcg->css);
 }

+static void memcg_flush_stats(struct mem_cgroup *memcg, bool force)
+{
+	if (mem_cgroup_disabled())
+		return;
+
+	memcg = memcg ?: root_mem_cgroup;
+	__mem_cgroup_flush_stats(memcg, force);
+}
+
 /*
  * mem_cgroup_flush_stats - flush the stats of a memory cgroup subtree
  * @memcg: root of the subtree to flush
@@ -621,13 +630,7 @@ static void __mem_cgroup_flush_stats(struct mem_cgroup *memcg, bool force)
  */
 void mem_cgroup_flush_stats(struct mem_cgroup *memcg)
 {
-	if (mem_cgroup_disabled())
-		return;
-
-	if (!memcg)
-		memcg = root_mem_cgroup;
-
-	__mem_cgroup_flush_stats(memcg, false);
+	memcg_flush_stats(memcg, false);
 }

 void mem_cgroup_flush_stats_ratelimited(struct mem_cgroup *memcg)
@@ -4530,6 +4533,12 @@ int memory_stat_show(struct seq_file *m, void *v)
 	return 0;
 }

+ssize_t memory_stat_refresh_write(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off)
+{
+	memcg_flush_stats(mem_cgroup_from_css(of_css(of)), true);
+	return nbytes;
+}
+
 #ifdef CONFIG_NUMA
 static inline unsigned long lruvec_page_state_output(struct lruvec *lruvec,
 						     int item)
@@ -4666,6 +4675,10 @@ static struct cftype memory_files[] = {
 		.name = "stat",
 		.seq_show = memory_stat_show,
 	},
+	{
+		.name = "stat_refresh",
+		.write = memory_stat_refresh_write,
+	},
 #ifdef CONFIG_NUMA
 	{
 		.name = "numa_stat",
--
2.51.2



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH mm-new v3] mm/memcontrol: Add memory.stat_refresh for on-demand stats flushing
  2025-11-10 10:19 [PATCH mm-new v3] mm/memcontrol: Add memory.stat_refresh for on-demand stats flushing Leon Huang Fu
@ 2025-11-10 11:28 ` Michal Hocko
  2025-11-11  6:12   ` Leon Huang Fu
  2025-11-10 11:52 ` Harry Yoo
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2025-11-10 11:28 UTC (permalink / raw)
  To: Leon Huang Fu
  Cc: linux-mm, tj, mkoutny, hannes, roman.gushchin, shakeel.butt,
	muchun.song, akpm, joel.granados, jack, laoar.shao, mclapinski,
	kyle.meyer, corbet, lance.yang, linux-doc, linux-kernel, cgroups

On Mon 10-11-25 18:19:48, Leon Huang Fu wrote:
> Memory cgroup statistics are updated asynchronously with periodic
> flushing to reduce overhead. The current implementation uses a flush
> threshold calculated as MEMCG_CHARGE_BATCH * num_online_cpus() for
> determining when to aggregate per-CPU memory cgroup statistics. On
> systems with high core counts, this threshold can become very large
> (e.g., 64 * 256 = 16,384 on a 256-core system), leading to stale
> statistics when userspace reads memory.stat files.
> 
> This is particularly problematic for monitoring and management tools
> that rely on reasonably fresh statistics, as they may observe data
> that is thousands of updates out of date.
> 
> Introduce a new write-only file, memory.stat_refresh, that allows
> userspace to explicitly trigger an immediate flush of memory statistics.
> Writing any value to this file forces a synchronous flush via
> __mem_cgroup_flush_stats(memcg, true) for the cgroup and all its
> descendants, ensuring that subsequent reads of memory.stat and
> memory.numa_stat reflect current data.
> 
> This approach follows the pattern established by /proc/sys/vm/stat_refresh
> and memory.peak, where the written value is ignored, keeping the
> interface simple and consistent with existing kernel APIs.
> 
> Usage example:
>   echo 1 > /sys/fs/cgroup/mygroup/memory.stat_refresh
>   cat /sys/fs/cgroup/mygroup/memory.stat
> 
> The feature is available in both cgroup v1 and v2 for consistency.
> 
> Signed-off-by: Leon Huang Fu <leon.huangfu@shopee.com>

Acked-by: Michal Hocko <mhocko@suse.com>
Thanks!

> ---
> v2 -> v3:
>   - Flush stats by memory.stat_refresh (per Michal)
>   - https://lore.kernel.org/linux-mm/20251105074917.94531-1-leon.huangfu@shopee.com/
> 
> v1 -> v2:
>   - Flush stats when write the file (per Michal).
>   - https://lore.kernel.org/linux-mm/20251104031908.77313-1-leon.huangfu@shopee.com/
> 
>  Documentation/admin-guide/cgroup-v2.rst | 21 +++++++++++++++++--
>  mm/memcontrol-v1.c                      |  4 ++++
>  mm/memcontrol-v1.h                      |  2 ++
>  mm/memcontrol.c                         | 27 ++++++++++++++++++-------
>  4 files changed, 45 insertions(+), 9 deletions(-)
> 
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 3345961c30ac..ca079932f957 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1337,7 +1337,7 @@ PAGE_SIZE multiple when read back.
>  	cgroup is within its effective low boundary, the cgroup's
>  	memory won't be reclaimed unless there is no reclaimable
>  	memory available in unprotected cgroups.
> -	Above the effective low	boundary (or
> +	Above the effective low	boundary (or
>  	effective min boundary if it is higher), pages are reclaimed
>  	proportionally to the overage, reducing reclaim pressure for
>  	smaller overages.
> @@ -1785,6 +1785,23 @@ The following nested keys are defined.
>  		up if hugetlb usage is accounted for in memory.current (i.e.
>  		cgroup is mounted with the memory_hugetlb_accounting option).
> 
> +  memory.stat_refresh
> +	A write-only file which exists on non-root cgroups.
> +
> +	Writing any value to this file forces an immediate flush of
> +	memory statistics for this cgroup and its descendants. This
> +	ensures subsequent reads of memory.stat and memory.numa_stat
> +	reflect the most current data.
> +
> +	This is useful on high-core count systems where per-CPU caching
> +	can lead to stale statistics, or when precise memory usage
> +	information is needed for monitoring or debugging purposes.
> +
> +	Example::
> +
> +	  echo 1 > memory.stat_refresh
> +	  cat memory.stat
> +
>    memory.numa_stat
>  	A read-only nested-keyed file which exists on non-root cgroups.
> 
> @@ -2173,7 +2190,7 @@ of the two is enforced.
> 
>  cgroup writeback requires explicit support from the underlying
>  filesystem.  Currently, cgroup writeback is implemented on ext2, ext4,
> -btrfs, f2fs, and xfs.  On other filesystems, all writeback IOs are
> +btrfs, f2fs, and xfs.  On other filesystems, all writeback IOs are
>  attributed to the root cgroup.
> 
>  There are inherent differences in memory and writeback management
> diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
> index 6eed14bff742..c3eac9b1f1be 100644
> --- a/mm/memcontrol-v1.c
> +++ b/mm/memcontrol-v1.c
> @@ -2041,6 +2041,10 @@ struct cftype mem_cgroup_legacy_files[] = {
>  		.name = "stat",
>  		.seq_show = memory_stat_show,
>  	},
> +	{
> +		.name = "stat_refresh",
> +		.write = memory_stat_refresh_write,
> +	},
>  	{
>  		.name = "force_empty",
>  		.write = mem_cgroup_force_empty_write,
> diff --git a/mm/memcontrol-v1.h b/mm/memcontrol-v1.h
> index 6358464bb416..a14d4d74c9aa 100644
> --- a/mm/memcontrol-v1.h
> +++ b/mm/memcontrol-v1.h
> @@ -29,6 +29,8 @@ void drain_all_stock(struct mem_cgroup *root_memcg);
>  unsigned long memcg_events(struct mem_cgroup *memcg, int event);
>  unsigned long memcg_page_state_output(struct mem_cgroup *memcg, int item);
>  int memory_stat_show(struct seq_file *m, void *v);
> +ssize_t memory_stat_refresh_write(struct kernfs_open_file *of, char *buf,
> +				  size_t nbytes, loff_t off);
> 
>  void mem_cgroup_id_get_many(struct mem_cgroup *memcg, unsigned int n);
>  struct mem_cgroup *mem_cgroup_id_get_online(struct mem_cgroup *memcg);
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index bfc986da3289..19ef4b971d8d 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -610,6 +610,15 @@ static void __mem_cgroup_flush_stats(struct mem_cgroup *memcg, bool force)
>  	css_rstat_flush(&memcg->css);
>  }
> 
> +static void memcg_flush_stats(struct mem_cgroup *memcg, bool force)
> +{
> +	if (mem_cgroup_disabled())
> +		return;
> +
> +	memcg = memcg ?: root_mem_cgroup;
> +	__mem_cgroup_flush_stats(memcg, force);
> +}
> +
>  /*
>   * mem_cgroup_flush_stats - flush the stats of a memory cgroup subtree
>   * @memcg: root of the subtree to flush
> @@ -621,13 +630,7 @@ static void __mem_cgroup_flush_stats(struct mem_cgroup *memcg, bool force)
>   */
>  void mem_cgroup_flush_stats(struct mem_cgroup *memcg)
>  {
> -	if (mem_cgroup_disabled())
> -		return;
> -
> -	if (!memcg)
> -		memcg = root_mem_cgroup;
> -
> -	__mem_cgroup_flush_stats(memcg, false);
> +	memcg_flush_stats(memcg, false);
>  }
> 
>  void mem_cgroup_flush_stats_ratelimited(struct mem_cgroup *memcg)
> @@ -4530,6 +4533,12 @@ int memory_stat_show(struct seq_file *m, void *v)
>  	return 0;
>  }
> 
> +ssize_t memory_stat_refresh_write(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off)
> +{
> +	memcg_flush_stats(mem_cgroup_from_css(of_css(of)), true);
> +	return nbytes;
> +}
> +
>  #ifdef CONFIG_NUMA
>  static inline unsigned long lruvec_page_state_output(struct lruvec *lruvec,
>  						     int item)
> @@ -4666,6 +4675,10 @@ static struct cftype memory_files[] = {
>  		.name = "stat",
>  		.seq_show = memory_stat_show,
>  	},
> +	{
> +		.name = "stat_refresh",
> +		.write = memory_stat_refresh_write,
> +	},
>  #ifdef CONFIG_NUMA
>  	{
>  		.name = "numa_stat",
> --
> 2.51.2

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH mm-new v3] mm/memcontrol: Add memory.stat_refresh for on-demand stats flushing
  2025-11-10 10:19 [PATCH mm-new v3] mm/memcontrol: Add memory.stat_refresh for on-demand stats flushing Leon Huang Fu
  2025-11-10 11:28 ` Michal Hocko
@ 2025-11-10 11:52 ` Harry Yoo
  2025-11-11  6:12   ` Leon Huang Fu
  2025-11-10 13:50 ` Michal Koutný
  2025-11-11 19:10 ` Waiman Long
  3 siblings, 1 reply; 21+ messages in thread
From: Harry Yoo @ 2025-11-10 11:52 UTC (permalink / raw)
  To: Leon Huang Fu
  Cc: linux-mm, tj, mkoutny, hannes, mhocko, roman.gushchin,
	shakeel.butt, muchun.song, akpm, joel.granados, jack, laoar.shao,
	mclapinski, kyle.meyer, corbet, lance.yang, linux-doc,
	linux-kernel, cgroups

On Mon, Nov 10, 2025 at 06:19:48PM +0800, Leon Huang Fu wrote:
> Memory cgroup statistics are updated asynchronously with periodic
> flushing to reduce overhead. The current implementation uses a flush
> threshold calculated as MEMCG_CHARGE_BATCH * num_online_cpus() for
> determining when to aggregate per-CPU memory cgroup statistics. On
> systems with high core counts, this threshold can become very large
> (e.g., 64 * 256 = 16,384 on a 256-core system), leading to stale
> statistics when userspace reads memory.stat files.
> 
> This is particularly problematic for monitoring and management tools
> that rely on reasonably fresh statistics, as they may observe data
> that is thousands of updates out of date.
> 
> Introduce a new write-only file, memory.stat_refresh, that allows
> userspace to explicitly trigger an immediate flush of memory statistics.
>
> Writing any value to this file forces a synchronous flush via
> __mem_cgroup_flush_stats(memcg, true) for the cgroup and all its
> descendants, ensuring that subsequent reads of memory.stat and
> memory.numa_stat reflect current data.
> 
> This approach follows the pattern established by /proc/sys/vm/stat_refresh
> and memory.peak, where the written value is ignored, keeping the
> interface simple and consistent with existing kernel APIs.
> 
> Usage example:
>   echo 1 > /sys/fs/cgroup/mygroup/memory.stat_refresh
>   cat /sys/fs/cgroup/mygroup/memory.stat
> 
> The feature is available in both cgroup v1 and v2 for consistency.
> 
> Signed-off-by: Leon Huang Fu <leon.huangfu@shopee.com>
> ---
> v2 -> v3:
>   - Flush stats by memory.stat_refresh (per Michal)
>   - https://lore.kernel.org/linux-mm/20251105074917.94531-1-leon.huangfu@shopee.com/
> 
> v1 -> v2:
>   - Flush stats when write the file (per Michal).
>   - https://lore.kernel.org/linux-mm/20251104031908.77313-1-leon.huangfu@shopee.com/
> 
>  Documentation/admin-guide/cgroup-v2.rst | 21 +++++++++++++++++--
>  mm/memcontrol-v1.c                      |  4 ++++
>  mm/memcontrol-v1.h                      |  2 ++
>  mm/memcontrol.c                         | 27 ++++++++++++++++++-------
>  4 files changed, 45 insertions(+), 9 deletions(-)

Hi Leon, I have a few questions on the patch.

> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 3345961c30ac..ca079932f957 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1337,7 +1337,7 @@ PAGE_SIZE multiple when read back.
>  	cgroup is within its effective low boundary, the cgroup's
>  	memory won't be reclaimed unless there is no reclaimable
>  	memory available in unprotected cgroups.
> -	Above the effective low	boundary (or
> +	Above the effective low	boundary (or

Is this whitespace change? it looks the same as before.

>  	effective min boundary if it is higher), pages are reclaimed
>  	proportionally to the overage, reducing reclaim pressure for
>  	smaller overages.
> @@ -1785,6 +1785,23 @@ The following nested keys are defined.
>  		up if hugetlb usage is accounted for in memory.current (i.e.
>  		cgroup is mounted with the memory_hugetlb_accounting option).
> 
> +  memory.stat_refresh
> +	A write-only file which exists on non-root cgroups.

Why don't we create the file for the root cgroup?

> +	Writing any value to this file forces an immediate flush of
> +	memory statistics for this cgroup and its descendants. This
> +	ensures subsequent reads of memory.stat and memory.numa_stat
> +	reflect the most current data.
> +
> +	This is useful on high-core count systems where per-CPU caching
> +	can lead to stale statistics, or when precise memory usage
> +	information is needed for monitoring or debugging purposes.
> +
> +	Example::
> +
> +	  echo 1 > memory.stat_refresh
> +	  cat memory.stat
> +
>    memory.numa_stat
>  	A read-only nested-keyed file which exists on non-root cgroups.
> 
> @@ -2173,7 +2190,7 @@ of the two is enforced.
> 
>  cgroup writeback requires explicit support from the underlying
>  filesystem.  Currently, cgroup writeback is implemented on ext2, ext4,
> -btrfs, f2fs, and xfs.  On other filesystems, all writeback IOs are
> +btrfs, f2fs, and xfs.  On other filesystems, all writeback IOs are
>  attributed to the root cgroup.

Same here, not sure what's changed...

>  There are inherent differences in memory and writeback management
> diff --git a/mm/memcontrol-v1.h b/mm/memcontrol-v1.h
> index 6358464bb416..a14d4d74c9aa 100644
> --- a/mm/memcontrol-v1.h
> +++ b/mm/memcontrol-v1.h
> @@ -4666,6 +4675,10 @@ static struct cftype memory_files[] = {
>  		.name = "stat",
>  		.seq_show = memory_stat_show,
>  	},
> +	{
> +		.name = "stat_refresh",
> +		.write = memory_stat_refresh_write,

I think we should use the CFTYPE_NOT_ON_ROOT flag to avoid creating
the file for the root cgroup if that's intended?

-- 
Cheers,
Harry / Hyeonggon

> +	},
>  #ifdef CONFIG_NUMA
>  	{
>  		.name = "numa_stat",
> --
> 2.51.2
> 
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH mm-new v3] mm/memcontrol: Add memory.stat_refresh for on-demand stats flushing
  2025-11-10 10:19 [PATCH mm-new v3] mm/memcontrol: Add memory.stat_refresh for on-demand stats flushing Leon Huang Fu
  2025-11-10 11:28 ` Michal Hocko
  2025-11-10 11:52 ` Harry Yoo
@ 2025-11-10 13:50 ` Michal Koutný
  2025-11-10 16:04   ` Tejun Heo
                     ` (3 more replies)
  2025-11-11 19:10 ` Waiman Long
  3 siblings, 4 replies; 21+ messages in thread
From: Michal Koutný @ 2025-11-10 13:50 UTC (permalink / raw)
  To: Leon Huang Fu
  Cc: linux-mm, tj, hannes, mhocko, roman.gushchin, shakeel.butt,
	muchun.song, akpm, joel.granados, jack, laoar.shao, mclapinski,
	kyle.meyer, corbet, lance.yang, linux-doc, linux-kernel, cgroups

[-- Attachment #1: Type: text/plain, Size: 3119 bytes --]

Hello Leon.

On Mon, Nov 10, 2025 at 06:19:48PM +0800, Leon Huang Fu <leon.huangfu@shopee.com> wrote:
> Memory cgroup statistics are updated asynchronously with periodic
> flushing to reduce overhead. The current implementation uses a flush
> threshold calculated as MEMCG_CHARGE_BATCH * num_online_cpus() for
> determining when to aggregate per-CPU memory cgroup statistics. On
> systems with high core counts, this threshold can become very large
> (e.g., 64 * 256 = 16,384 on a 256-core system), leading to stale
> statistics when userspace reads memory.stat files.
> 
> This is particularly problematic for monitoring and management tools
> that rely on reasonably fresh statistics, as they may observe data
> that is thousands of updates out of date.
> 
> Introduce a new write-only file, memory.stat_refresh, that allows
> userspace to explicitly trigger an immediate flush of memory statistics.

I think it's worth thinking twice when introducing a new file like
this...

> Writing any value to this file forces a synchronous flush via
> __mem_cgroup_flush_stats(memcg, true) for the cgroup and all its
> descendants, ensuring that subsequent reads of memory.stat and
> memory.numa_stat reflect current data.
> 
> This approach follows the pattern established by /proc/sys/vm/stat_refresh
> and memory.peak, where the written value is ignored, keeping the
> interface simple and consistent with existing kernel APIs.
> 
> Usage example:
>   echo 1 > /sys/fs/cgroup/mygroup/memory.stat_refresh
>   cat /sys/fs/cgroup/mygroup/memory.stat
> 
> The feature is available in both cgroup v1 and v2 for consistency.

First, I find the motivation by the testcase (not real world) weak when
considering such an API change (e.g. real world would be confined to
fewer CPUs or there'd be other "traffic" causing flushes making this a
non-issue, we don't know here).

Second, this is open to everyone (non-root) who mkdir's their cgroups.
Then why not make it the default memory.stat behavior? (Tongue-in-cheek,
but [*].)

With this change, we admit the implementation (async flushing) and leak
it to the users which is hard to take back. Why should we continue doing
any implicit in-kernel flushing afterwards?

Next, v1 and v2 haven't been consistent since introduction of v2 (unlike
some other controllers that share code or even cftypes between v1 and
v2). So I'd avoid introducing a new file to V1 API.

When looking for analogies, I admittedly like memory.reclaim's
O_NONBLOCK better (than /proc/sys/vm/stat_refresh). That would be an
argument for flushing by default mentioned abovee [*]).

Also, this undercuts the hooking of rstat flushing into BPF. I think the
attempts were given up too early (I read about the verifier vs
seq_file). Have you tried bypassing bailout from
__mem_cgroup_flush_stats via trace_memcg_flush_stats?

All in all, I'd like to have more backing data on insufficiency of (all
the) rstat optimizations before opening explicit flushes like this
(especially when it's meant to be exposed by BPF already).

Thanks,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH mm-new v3] mm/memcontrol: Add memory.stat_refresh for on-demand stats flushing
  2025-11-10 13:50 ` Michal Koutný
@ 2025-11-10 16:04   ` Tejun Heo
  2025-11-11  6:27     ` Leon Huang Fu
  2025-11-11  1:00   ` Chen Ridong
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 21+ messages in thread
From: Tejun Heo @ 2025-11-10 16:04 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Leon Huang Fu, linux-mm, hannes, mhocko, roman.gushchin,
	shakeel.butt, muchun.song, akpm, joel.granados, jack, laoar.shao,
	mclapinski, kyle.meyer, corbet, lance.yang, linux-doc,
	linux-kernel, cgroups

Hello,

On Mon, Nov 10, 2025 at 02:50:11PM +0100, Michal Koutný wrote:
> All in all, I'd like to have more backing data on insufficiency of (all
> the) rstat optimizations before opening explicit flushes like this
> (especially when it's meant to be exposed by BPF already).

+1. If the current behavior introduces errors too significant to ignore, I'd
much rather see it fixed from the implementation side rather than exposing
internal operation details like this.

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH mm-new v3] mm/memcontrol: Add memory.stat_refresh for on-demand stats flushing
  2025-11-10 13:50 ` Michal Koutný
  2025-11-10 16:04   ` Tejun Heo
@ 2025-11-11  1:00   ` Chen Ridong
  2025-11-11  6:44     ` Leon Huang Fu
  2025-11-11  6:13   ` Leon Huang Fu
  2025-11-11  8:10   ` Michal Hocko
  3 siblings, 1 reply; 21+ messages in thread
From: Chen Ridong @ 2025-11-11  1:00 UTC (permalink / raw)
  To: Michal Koutný, Leon Huang Fu
  Cc: linux-mm, tj, hannes, mhocko, roman.gushchin, shakeel.butt,
	muchun.song, akpm, joel.granados, jack, laoar.shao, mclapinski,
	kyle.meyer, corbet, lance.yang, linux-doc, linux-kernel, cgroups



On 2025/11/10 21:50, Michal Koutný wrote:
> Hello Leon.
> 
> On Mon, Nov 10, 2025 at 06:19:48PM +0800, Leon Huang Fu <leon.huangfu@shopee.com> wrote:
>> Memory cgroup statistics are updated asynchronously with periodic
>> flushing to reduce overhead. The current implementation uses a flush
>> threshold calculated as MEMCG_CHARGE_BATCH * num_online_cpus() for
>> determining when to aggregate per-CPU memory cgroup statistics. On
>> systems with high core counts, this threshold can become very large
>> (e.g., 64 * 256 = 16,384 on a 256-core system), leading to stale
>> statistics when userspace reads memory.stat files.
>>

We have encountered this problem multiple times when running LTP tests. It can easily occur when
using a 64K page size.

error:
 	memcg_stat_rss 10 TFAIL: rss is 0, 266240 expected

>> This is particularly problematic for monitoring and management tools
>> that rely on reasonably fresh statistics, as they may observe data
>> that is thousands of updates out of date.
>>
>> Introduce a new write-only file, memory.stat_refresh, that allows
>> userspace to explicitly trigger an immediate flush of memory statistics.
> 
> I think it's worth thinking twice when introducing a new file like
> this...
> 
>> Writing any value to this file forces a synchronous flush via
>> __mem_cgroup_flush_stats(memcg, true) for the cgroup and all its
>> descendants, ensuring that subsequent reads of memory.stat and
>> memory.numa_stat reflect current data.
>>
>> This approach follows the pattern established by /proc/sys/vm/stat_refresh
>> and memory.peak, where the written value is ignored, keeping the
>> interface simple and consistent with existing kernel APIs.
>>
>> Usage example:
>>   echo 1 > /sys/fs/cgroup/mygroup/memory.stat_refresh
>>   cat /sys/fs/cgroup/mygroup/memory.stat
>>
>> The feature is available in both cgroup v1 and v2 for consistency.
> 
> First, I find the motivation by the testcase (not real world) weak when
> considering such an API change (e.g. real world would be confined to
> fewer CPUs or there'd be other "traffic" causing flushes making this a
> non-issue, we don't know here).
> 
> Second, this is open to everyone (non-root) who mkdir's their cgroups.
> Then why not make it the default memory.stat behavior? (Tongue-in-cheek,
> but [*].)
> 
> With this change, we admit the implementation (async flushing) and leak
> it to the users which is hard to take back. Why should we continue doing
> any implicit in-kernel flushing afterwards?
> 
> Next, v1 and v2 haven't been consistent since introduction of v2 (unlike
> some other controllers that share code or even cftypes between v1 and
> v2). So I'd avoid introducing a new file to V1 API.
> 

We encountered this problem in v1, I think this is a common problem should be fixed.

> When looking for analogies, I admittedly like memory.reclaim's
> O_NONBLOCK better (than /proc/sys/vm/stat_refresh). That would be an
> argument for flushing by default mentioned abovee [*]).
> 
> Also, this undercuts the hooking of rstat flushing into BPF. I think the
> attempts were given up too early (I read about the verifier vs
> seq_file). Have you tried bypassing bailout from
> __mem_cgroup_flush_stats via trace_memcg_flush_stats?
> 
> 
> All in all, I'd like to have more backing data on insufficiency of (all
> the) rstat optimizations before opening explicit flushes like this
> (especially when it's meant to be exposed by BPF already).
> 
> Thanks,
> Michal
> 
> 

-- 
Best regards,
Ridong



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH mm-new v3] mm/memcontrol: Add memory.stat_refresh for on-demand stats flushing
  2025-11-10 11:28 ` Michal Hocko
@ 2025-11-11  6:12   ` Leon Huang Fu
  0 siblings, 0 replies; 21+ messages in thread
From: Leon Huang Fu @ 2025-11-11  6:12 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, cgroups, corbet, hannes, jack, joel.granados, kyle.meyer,
	lance.yang, laoar.shao, leon.huangfu, linux-doc, linux-kernel,
	linux-mm, mclapinski, mkoutny, muchun.song, roman.gushchin,
	shakeel.butt, tj

On Mon, Nov 10, 2025 at 7:28 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 10-11-25 18:19:48, Leon Huang Fu wrote:
> > Memory cgroup statistics are updated asynchronously with periodic
> > flushing to reduce overhead. The current implementation uses a flush
> > threshold calculated as MEMCG_CHARGE_BATCH * num_online_cpus() for
> > determining when to aggregate per-CPU memory cgroup statistics. On
> > systems with high core counts, this threshold can become very large
> > (e.g., 64 * 256 = 16,384 on a 256-core system), leading to stale
> > statistics when userspace reads memory.stat files.
> >
> > This is particularly problematic for monitoring and management tools
> > that rely on reasonably fresh statistics, as they may observe data
> > that is thousands of updates out of date.
> >
> > Introduce a new write-only file, memory.stat_refresh, that allows
> > userspace to explicitly trigger an immediate flush of memory statistics.
> > Writing any value to this file forces a synchronous flush via
> > __mem_cgroup_flush_stats(memcg, true) for the cgroup and all its
> > descendants, ensuring that subsequent reads of memory.stat and
> > memory.numa_stat reflect current data.
> >
> > This approach follows the pattern established by /proc/sys/vm/stat_refresh
> > and memory.peak, where the written value is ignored, keeping the
> > interface simple and consistent with existing kernel APIs.
> >
> > Usage example:
> >   echo 1 > /sys/fs/cgroup/mygroup/memory.stat_refresh
> >   cat /sys/fs/cgroup/mygroup/memory.stat
> >
> > The feature is available in both cgroup v1 and v2 for consistency.
> >
> > Signed-off-by: Leon Huang Fu <leon.huangfu@shopee.com>
>
> Acked-by: Michal Hocko <mhocko@suse.com>
> Thanks!
>

Thank you for your review.

Thanks,
Leon

[...]


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH mm-new v3] mm/memcontrol: Add memory.stat_refresh for on-demand stats flushing
  2025-11-10 11:52 ` Harry Yoo
@ 2025-11-11  6:12   ` Leon Huang Fu
  0 siblings, 0 replies; 21+ messages in thread
From: Leon Huang Fu @ 2025-11-11  6:12 UTC (permalink / raw)
  To: harry.yoo
  Cc: akpm, cgroups, corbet, hannes, jack, joel.granados, kyle.meyer,
	lance.yang, laoar.shao, leon.huangfu, linux-doc, linux-kernel,
	linux-mm, mclapinski, mhocko, mkoutny, muchun.song,
	roman.gushchin, shakeel.butt, tj

Hi Harry,

On Mon, Nov 10, 2025 at 7:52 PM Harry Yoo <harry.yoo@oracle.com> wrote:
>
> On Mon, Nov 10, 2025 at 06:19:48PM +0800, Leon Huang Fu wrote:
> > Memory cgroup statistics are updated asynchronously with periodic
> > flushing to reduce overhead. The current implementation uses a flush
> > threshold calculated as MEMCG_CHARGE_BATCH * num_online_cpus() for
> > determining when to aggregate per-CPU memory cgroup statistics. On
> > systems with high core counts, this threshold can become very large
> > (e.g., 64 * 256 = 16,384 on a 256-core system), leading to stale
> > statistics when userspace reads memory.stat files.
> >
> > This is particularly problematic for monitoring and management tools
> > that rely on reasonably fresh statistics, as they may observe data
> > that is thousands of updates out of date.
> >
> > Introduce a new write-only file, memory.stat_refresh, that allows
> > userspace to explicitly trigger an immediate flush of memory statistics.
> >
> > Writing any value to this file forces a synchronous flush via
> > __mem_cgroup_flush_stats(memcg, true) for the cgroup and all its
> > descendants, ensuring that subsequent reads of memory.stat and
> > memory.numa_stat reflect current data.
> >
> > This approach follows the pattern established by /proc/sys/vm/stat_refresh
> > and memory.peak, where the written value is ignored, keeping the
> > interface simple and consistent with existing kernel APIs.
> >
> > Usage example:
> >   echo 1 > /sys/fs/cgroup/mygroup/memory.stat_refresh
> >   cat /sys/fs/cgroup/mygroup/memory.stat
> >
> > The feature is available in both cgroup v1 and v2 for consistency.
> >
> > Signed-off-by: Leon Huang Fu <leon.huangfu@shopee.com>
> > ---
> > v2 -> v3:
> >   - Flush stats by memory.stat_refresh (per Michal)
> >   - https://lore.kernel.org/linux-mm/20251105074917.94531-1-leon.huangfu@shopee.com/
> >
> > v1 -> v2:
> >   - Flush stats when write the file (per Michal).
> >   - https://lore.kernel.org/linux-mm/20251104031908.77313-1-leon.huangfu@shopee.com/
> >
> >  Documentation/admin-guide/cgroup-v2.rst | 21 +++++++++++++++++--
> >  mm/memcontrol-v1.c                      |  4 ++++
> >  mm/memcontrol-v1.h                      |  2 ++
> >  mm/memcontrol.c                         | 27 ++++++++++++++++++-------
> >  4 files changed, 45 insertions(+), 9 deletions(-)
>
> Hi Leon, I have a few questions on the patch.
>
> > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> > index 3345961c30ac..ca079932f957 100644
> > --- a/Documentation/admin-guide/cgroup-v2.rst
> > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > @@ -1337,7 +1337,7 @@ PAGE_SIZE multiple when read back.
> >       cgroup is within its effective low boundary, the cgroup's
> >       memory won't be reclaimed unless there is no reclaimable
> >       memory available in unprotected cgroups.
> > -     Above the effective low boundary (or
> > +     Above the effective low boundary (or
>
> Is this whitespace change? it looks the same as before.
>

Yes, that hunk just trims the trailing whitespace.

If you'd prefer to avoid the churn, I'm happy to drop it from the series.

> >       effective min boundary if it is higher), pages are reclaimed
> >       proportionally to the overage, reducing reclaim pressure for
> >       smaller overages.
> > @@ -1785,6 +1785,23 @@ The following nested keys are defined.
> >               up if hugetlb usage is accounted for in memory.current (i.e.
> >               cgroup is mounted with the memory_hugetlb_accounting option).
> >
> > +  memory.stat_refresh
> > +     A write-only file which exists on non-root cgroups.
>
> Why don't we create the file for the root cgroup?
>

Thanks for pointing that out—I copied the wording from the memory.stat section without double-checking.

All three files, memory.{stat,numa_stat,stat_refresh}, are created for the root cgroup.

> > +     Writing any value to this file forces an immediate flush of
> > +     memory statistics for this cgroup and its descendants. This
> > +     ensures subsequent reads of memory.stat and memory.numa_stat
> > +     reflect the most current data.
> > +
> > +     This is useful on high-core count systems where per-CPU caching
> > +     can lead to stale statistics, or when precise memory usage
> > +     information is needed for monitoring or debugging purposes.
> > +
> > +     Example::
> > +
> > +       echo 1 > memory.stat_refresh
> > +       cat memory.stat
> > +
> >    memory.numa_stat
> >       A read-only nested-keyed file which exists on non-root cgroups.
> >
> > @@ -2173,7 +2190,7 @@ of the two is enforced.
> >
> >  cgroup writeback requires explicit support from the underlying
> >  filesystem.  Currently, cgroup writeback is implemented on ext2, ext4,
> > -btrfs, f2fs, and xfs.  On other filesystems, all writeback IOs are
> > +btrfs, f2fs, and xfs.  On other filesystems, all writeback IOs are
> >  attributed to the root cgroup.
>
> Same here, not sure what's changed...

That's just trimming the trailing whitespace.

>
> >  There are inherent differences in memory and writeback management
> > diff --git a/mm/memcontrol-v1.h b/mm/memcontrol-v1.h
> > index 6358464bb416..a14d4d74c9aa 100644
> > --- a/mm/memcontrol-v1.h
> > +++ b/mm/memcontrol-v1.h
> > @@ -4666,6 +4675,10 @@ static struct cftype memory_files[] = {
> >               .name = "stat",
> >               .seq_show = memory_stat_show,
> >       },
> > +     {
> > +             .name = "stat_refresh",
> > +             .write = memory_stat_refresh_write,
>
> I think we should use the CFTYPE_NOT_ON_ROOT flag to avoid creating
> the file for the root cgroup if that's intended?
>

I kept memory.stat_refresh aligned with the existing memory.stat entry, so
I left CFTYPE_NOT_ON_ROOT unset.

That said, the documentation is behind the current behavior; I'll update
it to spell out that the files exist on the root cgroup too.

Thanks,
Leon


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH mm-new v3] mm/memcontrol: Add memory.stat_refresh for on-demand stats flushing
  2025-11-10 13:50 ` Michal Koutný
  2025-11-10 16:04   ` Tejun Heo
  2025-11-11  1:00   ` Chen Ridong
@ 2025-11-11  6:13   ` Leon Huang Fu
  2025-11-11 18:52     ` Tejun Heo
  2025-11-11 19:01     ` Michal Koutný
  2025-11-11  8:10   ` Michal Hocko
  3 siblings, 2 replies; 21+ messages in thread
From: Leon Huang Fu @ 2025-11-11  6:13 UTC (permalink / raw)
  To: mkoutny
  Cc: akpm, cgroups, corbet, hannes, jack, joel.granados, kyle.meyer,
	lance.yang, laoar.shao, leon.huangfu, linux-doc, linux-kernel,
	linux-mm, mclapinski, mhocko, muchun.song, roman.gushchin,
	shakeel.butt, tj

On Mon, Nov 10, 2025 at 9:50 PM Michal Koutný <mkoutny@suse.com> wrote:
>
> Hello Leon.

Hi Michal,

>
> On Mon, Nov 10, 2025 at 06:19:48PM +0800, Leon Huang Fu <leon.huangfu@shopee.com> wrote:
> > Memory cgroup statistics are updated asynchronously with periodic
> > flushing to reduce overhead. The current implementation uses a flush
> > threshold calculated as MEMCG_CHARGE_BATCH * num_online_cpus() for
> > determining when to aggregate per-CPU memory cgroup statistics. On
> > systems with high core counts, this threshold can become very large
> > (e.g., 64 * 256 = 16,384 on a 256-core system), leading to stale
> > statistics when userspace reads memory.stat files.
> >
> > This is particularly problematic for monitoring and management tools
> > that rely on reasonably fresh statistics, as they may observe data
> > that is thousands of updates out of date.
> >
> > Introduce a new write-only file, memory.stat_refresh, that allows
> > userspace to explicitly trigger an immediate flush of memory statistics.
>
> I think it's worth thinking twice when introducing a new file like
> this...
>
> > Writing any value to this file forces a synchronous flush via
> > __mem_cgroup_flush_stats(memcg, true) for the cgroup and all its
> > descendants, ensuring that subsequent reads of memory.stat and
> > memory.numa_stat reflect current data.
> >
> > This approach follows the pattern established by /proc/sys/vm/stat_refresh
> > and memory.peak, where the written value is ignored, keeping the
> > interface simple and consistent with existing kernel APIs.
> >
> > Usage example:
> >   echo 1 > /sys/fs/cgroup/mygroup/memory.stat_refresh
> >   cat /sys/fs/cgroup/mygroup/memory.stat
> >
> > The feature is available in both cgroup v1 and v2 for consistency.
>
> First, I find the motivation by the testcase (not real world) weak when
> considering such an API change (e.g. real world would be confined to
> fewer CPUs or there'd be other "traffic" causing flushes making this a
> non-issue, we don't know here).

Fewer CPUs?

We are going to run kernels on 224/256 cores machines, and the flush threshold
is 16384 on a 256-core machine. That means we will have stale statistics often,
and we will need a way to improve the stats accuracy.

>
> Second, this is open to everyone (non-root) who mkdir's their cgroups.
> Then why not make it the default memory.stat behavior? (Tongue-in-cheek,
> but [*].)
>
> With this change, we admit the implementation (async flushing) and leak
> it to the users which is hard to take back. Why should we continue doing
> any implicit in-kernel flushing afterwards?

If the concern is that we're papering over a suboptimal flush path, I'm happy
to take a closer look. I'll review both the synchronous and asynchronous
flushing paths to see how to improve it.

>
> Next, v1 and v2 haven't been consistent since introduction of v2 (unlike
> some other controllers that share code or even cftypes between v1 and
> v2). So I'd avoid introducing a new file to V1 API.
>
> When looking for analogies, I admittedly like memory.reclaim's
> O_NONBLOCK better (than /proc/sys/vm/stat_refresh). That would be an
> argument for flushing by default mentioned abovee [*]).
>
> Also, this undercuts the hooking of rstat flushing into BPF. I think the
> attempts were given up too early (I read about the verifier vs
> seq_file). Have you tried bypassing bailout from
> __mem_cgroup_flush_stats via trace_memcg_flush_stats?
>

I tried "tp_btf/memcg_flush_stats", but it didn't work:

        10: (85) call css_rstat_flush#80218
        program must be sleepable to call sleepable kfunc css_rstat_flush

The bpf code and the error message are attached at last section.

>
> All in all, I'd like to have more backing data on insufficiency of (all
> the) rstat optimizations before opening explicit flushes like this
> (especially when it's meant to be exposed by BPF already).
>

It's proving non-trivial to capture a persuasive delta. The global worker
already flushes rstat every two seconds (2UL*HZ), so the window where
userspace can observe stale numbers is short.

[...]

Thanks,
Leon

---

#include "vmlinux.h"

#include "bpf_helpers.h"
#include "bpf_tracing.h"

char _license[] SEC("license") = "GPL";

extern void css_rstat_flush(struct cgroup_subsys_state *css) __weak __ksym;

SEC("tp_btf/memcg_flush_stats")
int BPF_PROG(memcg_flush_stats, struct mem_cgroup *memcg, s64 stats_updates, bool force, bool needs_flush)
{
	if (!force || !needs_flush) {
		css_rstat_flush(&memcg->css);
		__bpf_vprintk("memcg_flush_stats: memcg id=%d, stats_updates=%lld, force=%d, needs_flush=%d\n",
					  memcg->id.id, stats_updates, force, needs_flush);
	}
    return 0;
}

---

permission denied:
        0: R1=ctx() R10=fp0
        ; int BPF_PROG(memcg_flush_stats, struct mem_cgroup *memcg, s64 stats_updates, bool force, bool needs_flush) @ memcg.c:13
        0: (79) r6 = *(u64 *)(r1 +24)         ; R1=ctx() R6_w=scalar()
        1: (79) r9 = *(u64 *)(r1 +16)         ; R1=ctx() R9_w=scalar()
        ; if (!force || !needs_flush) { @ memcg.c:15
        2: (15) if r9 == 0x0 goto pc+1        ; R9_w=scalar(umin=1)
        3: (55) if r6 != 0x0 goto pc+27       ; R6_w=0
        4: (b7) r3 = 0                        ; R3_w=0
        ; int BPF_PROG(memcg_flush_stats, struct mem_cgroup *memcg, s64 stats_updates, bool force, bool needs_flush) @ memcg.c:13
        5: (79) r7 = *(u64 *)(r1 +0)
        func 'memcg_flush_stats' arg0 has btf_id 623 type STRUCT 'mem_cgroup'
        6: R1=ctx() R7_w=trusted_ptr_mem_cgroup()
        6: (bf) r2 = r7                       ; R2_w=trusted_ptr_mem_cgroup() R7_w=trusted_ptr_mem_cgroup()
        7: (0f) r2 += r3                      ; R2_w=trusted_ptr_mem_cgroup() R3_w=0
        8: (79) r8 = *(u64 *)(r1 +8)          ; R1=ctx() R8_w=scalar()
        ; css_rstat_flush(&memcg->css); @ memcg.c:16
        9: (bf) r1 = r2                       ; R1_w=trusted_ptr_mem_cgroup() R2_w=trusted_ptr_mem_cgroup()
        10: (85) call css_rstat_flush#80218
        program must be sleepable to call sleepable kfunc css_rstat_flush
        processed 11 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH mm-new v3] mm/memcontrol: Add memory.stat_refresh for on-demand stats flushing
  2025-11-10 16:04   ` Tejun Heo
@ 2025-11-11  6:27     ` Leon Huang Fu
  0 siblings, 0 replies; 21+ messages in thread
From: Leon Huang Fu @ 2025-11-11  6:27 UTC (permalink / raw)
  To: tj
  Cc: akpm, cgroups, corbet, hannes, jack, joel.granados, kyle.meyer,
	lance.yang, laoar.shao, leon.huangfu, linux-doc, linux-kernel,
	linux-mm, mclapinski, mhocko, mkoutny, muchun.song,
	roman.gushchin, shakeel.butt

On Tue, Nov 11, 2025 at 12:04 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,

Hi Tejun,

>
> On Mon, Nov 10, 2025 at 02:50:11PM +0100, Michal Koutnı wrote:
> > All in all, I'd like to have more backing data on insufficiency of (all
> > the) rstat optimizations before opening explicit flushes like this
> > (especially when it's meant to be exposed by BPF already).
>
> +1. If the current behavior introduces errors too significant to ignore, I'd
> much rather see it fixed from the implementation side rather than exposing
> internal operation details like this.
>

I haven't observed any significant errors with the current behavior.

That said, I agree that we should focus on improving the flushing
implementation to enhance stats accuracy on high-core-count systems.
I'll review both the synchronous and asynchronous flushing paths to see
where we can tighten things up.

Thanks,
Leon


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH mm-new v3] mm/memcontrol: Add memory.stat_refresh for on-demand stats flushing
  2025-11-11  1:00   ` Chen Ridong
@ 2025-11-11  6:44     ` Leon Huang Fu
  2025-11-12  0:56       ` Chen Ridong
  0 siblings, 1 reply; 21+ messages in thread
From: Leon Huang Fu @ 2025-11-11  6:44 UTC (permalink / raw)
  To: chenridong
  Cc: akpm, cgroups, corbet, hannes, jack, joel.granados, kyle.meyer,
	lance.yang, laoar.shao, leon.huangfu, linux-doc, linux-kernel,
	linux-mm, mclapinski, mhocko, mkoutny, muchun.song,
	roman.gushchin, shakeel.butt, tj

On Tue, Nov 11, 2025 at 9:00 AM Chen Ridong <chenridong@huaweicloud.com> wrote:
>
>
>
> On 2025/11/10 21:50, Michal Koutný wrote:
>> Hello Leon.

Hi Ridong,

>>
>> On Mon, Nov 10, 2025 at 06:19:48PM +0800, Leon Huang Fu <leon.huangfu@shopee.com> wrote:
>>> Memory cgroup statistics are updated asynchronously with periodic
>>> flushing to reduce overhead. The current implementation uses a flush
>>> threshold calculated as MEMCG_CHARGE_BATCH * num_online_cpus() for
>>> determining when to aggregate per-CPU memory cgroup statistics. On
>>> systems with high core counts, this threshold can become very large
>>> (e.g., 64 * 256 = 16,384 on a 256-core system), leading to stale
>>> statistics when userspace reads memory.stat files.
>>>
>
> We have encountered this problem multiple times when running LTP tests. It can easily occur when
> using a 64K page size.
>
> error:
>         memcg_stat_rss 10 TFAIL: rss is 0, 266240 expected
>

Have you encountered this problem in real world?

>>> This is particularly problematic for monitoring and management tools
>>> that rely on reasonably fresh statistics, as they may observe data
>>> that is thousands of updates out of date.
>>>
>>> Introduce a new write-only file, memory.stat_refresh, that allows
>>> userspace to explicitly trigger an immediate flush of memory statistics.
>>
[...]
>>
>> Next, v1 and v2 haven't been consistent since introduction of v2 (unlike
>> some other controllers that share code or even cftypes between v1 and
>> v2). So I'd avoid introducing a new file to V1 API.
>>
>
> We encountered this problem in v1, I think this is a common problem should be fixed.

Thanks for pointing that out.

Thanks,
Leon

[...]


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH mm-new v3] mm/memcontrol: Add memory.stat_refresh for on-demand stats flushing
  2025-11-10 13:50 ` Michal Koutný
                     ` (2 preceding siblings ...)
  2025-11-11  6:13   ` Leon Huang Fu
@ 2025-11-11  8:10   ` Michal Hocko
  3 siblings, 0 replies; 21+ messages in thread
From: Michal Hocko @ 2025-11-11  8:10 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Leon Huang Fu, linux-mm, tj, hannes, roman.gushchin, shakeel.butt,
	muchun.song, akpm, joel.granados, jack, laoar.shao, mclapinski,
	kyle.meyer, corbet, lance.yang, linux-doc, linux-kernel, cgroups

On Mon 10-11-25 14:50:11, Michal Koutny wrote:
> Hello Leon.
> 
> On Mon, Nov 10, 2025 at 06:19:48PM +0800, Leon Huang Fu <leon.huangfu@shopee.com> wrote:
> > Memory cgroup statistics are updated asynchronously with periodic
> > flushing to reduce overhead. The current implementation uses a flush
> > threshold calculated as MEMCG_CHARGE_BATCH * num_online_cpus() for
> > determining when to aggregate per-CPU memory cgroup statistics. On
> > systems with high core counts, this threshold can become very large
> > (e.g., 64 * 256 = 16,384 on a 256-core system), leading to stale
> > statistics when userspace reads memory.stat files.
> > 
> > This is particularly problematic for monitoring and management tools
> > that rely on reasonably fresh statistics, as they may observe data
> > that is thousands of updates out of date.
> > 
> > Introduce a new write-only file, memory.stat_refresh, that allows
> > userspace to explicitly trigger an immediate flush of memory statistics.
> 
> I think it's worth thinking twice when introducing a new file like
> this...
> 
> > Writing any value to this file forces a synchronous flush via
> > __mem_cgroup_flush_stats(memcg, true) for the cgroup and all its
> > descendants, ensuring that subsequent reads of memory.stat and
> > memory.numa_stat reflect current data.
> > 
> > This approach follows the pattern established by /proc/sys/vm/stat_refresh
> > and memory.peak, where the written value is ignored, keeping the
> > interface simple and consistent with existing kernel APIs.
> > 
> > Usage example:
> >   echo 1 > /sys/fs/cgroup/mygroup/memory.stat_refresh
> >   cat /sys/fs/cgroup/mygroup/memory.stat
> > 
> > The feature is available in both cgroup v1 and v2 for consistency.
> 
> First, I find the motivation by the testcase (not real world) weak when
> considering such an API change (e.g. real world would be confined to
> fewer CPUs or there'd be other "traffic" causing flushes making this a
> non-issue, we don't know here).

I do agree that the current justification is rather weak.

> Second, this is open to everyone (non-root) who mkdir's their cgroups.
> Then why not make it the default memory.stat behavior? (Tongue-in-cheek,
> but [*].)
> 
> With this change, we admit the implementation (async flushing) and leak
> it to the users which is hard to take back. Why should we continue doing
> any implicit in-kernel flushing afterwards?

In theory you are correct but I think it is also good to recognize the
reality. Keeping accurate stats is _expensive_ and we are always
struggling to keep a balance between accurace and runtime overhead. Yet
there will always be those couple special cases that would like to have
precision we do not want to pay for in general case.

We have recognized that in /proc/vmstat casee already without much added
maintenance burden. This seem a very similar case. If there is a general
consensus that we want to outsource all those special cases into BPF
then fine (I guess) but I believe BPF approach is figting a completely
different problem (data formating overhead rather than accuracy).

All that being said I do agree that we should have a more real usecase
than LTP test to justify a new interface. I am personally not convinced
about BPF-only way to address this fundamental precision-vs-overhead
battle.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH mm-new v3] mm/memcontrol: Add memory.stat_refresh for on-demand stats flushing
  2025-11-11  6:13   ` Leon Huang Fu
@ 2025-11-11 18:52     ` Tejun Heo
  2025-11-11 19:01     ` Michal Koutný
  1 sibling, 0 replies; 21+ messages in thread
From: Tejun Heo @ 2025-11-11 18:52 UTC (permalink / raw)
  To: Leon Huang Fu
  Cc: mkoutny, akpm, cgroups, corbet, hannes, jack, joel.granados,
	kyle.meyer, lance.yang, laoar.shao, linux-doc, linux-kernel,
	linux-mm, mclapinski, mhocko, muchun.song, roman.gushchin,
	shakeel.butt

On Tue, Nov 11, 2025 at 02:13:42PM +0800, Leon Huang Fu wrote:
> We are going to run kernels on 224/256 cores machines, and the flush threshold
> is 16384 on a 256-core machine. That means we will have stale statistics often,
> and we will need a way to improve the stats accuracy.

The thing is that these machines are already common and going to be more and
more common. These are cases that aren't all that special, so I really hope
this could be solved in a more generic manner.

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH mm-new v3] mm/memcontrol: Add memory.stat_refresh for on-demand stats flushing
  2025-11-11  6:13   ` Leon Huang Fu
  2025-11-11 18:52     ` Tejun Heo
@ 2025-11-11 19:01     ` Michal Koutný
  1 sibling, 0 replies; 21+ messages in thread
From: Michal Koutný @ 2025-11-11 19:01 UTC (permalink / raw)
  To: Leon Huang Fu
  Cc: akpm, cgroups, corbet, hannes, jack, joel.granados, kyle.meyer,
	lance.yang, laoar.shao, linux-doc, linux-kernel, linux-mm,
	mclapinski, mhocko, muchun.song, roman.gushchin, shakeel.butt, tj

[-- Attachment #1: Type: text/plain, Size: 1530 bytes --]

On Tue, Nov 11, 2025 at 02:13:42PM +0800, Leon Huang Fu <leon.huangfu@shopee.com> wrote:
> Fewer CPUs?

Your surprise makes me realize I confused this with something else [1]
where harnessing the job to a subset of CPUs (e.g. with cpuset) would
reduce the accumulated error. But memory.stat's threshold is static (and
stricter affinity would actually render the threshold relatively worse).

> We are going to run kernels on 224/256 cores machines, and the flush threshold
> is 16384 on a 256-core machine. That means we will have stale statistics often,
> and we will need a way to improve the stats accuracy.

(The theory behind the threshold is that you'd also need to amortize
proportionally more updates.)

> The bpf code and the error message are attached at last section.

(Thanks, wondering about it...)

> 
> >
> > All in all, I'd like to have more backing data on insufficiency of (all
> > the) rstat optimizations before opening explicit flushes like this
> > (especially when it's meant to be exposed by BPF already).
> >
> 
> It's proving non-trivial to capture a persuasive delta. The global worker
> already flushes rstat every two seconds (2UL*HZ), so the window where
> userspace can observe stale numbers is short.

This is the important bit -- even though you can see it only rarely do
you refer to the LTPs failures or do you have some consumer of the stats
that fails terribly with the imprecise numbers?

Thanks,
Michal

[1] Per-cpu stocks that affect memory.current.


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH mm-new v3] mm/memcontrol: Add memory.stat_refresh for on-demand stats flushing
  2025-11-10 10:19 [PATCH mm-new v3] mm/memcontrol: Add memory.stat_refresh for on-demand stats flushing Leon Huang Fu
                   ` (2 preceding siblings ...)
  2025-11-10 13:50 ` Michal Koutný
@ 2025-11-11 19:10 ` Waiman Long
  2025-11-11 19:47   ` Michal Hocko
  3 siblings, 1 reply; 21+ messages in thread
From: Waiman Long @ 2025-11-11 19:10 UTC (permalink / raw)
  To: Leon Huang Fu, linux-mm
  Cc: tj, mkoutny, hannes, mhocko, roman.gushchin, shakeel.butt,
	muchun.song, akpm, joel.granados, jack, laoar.shao, mclapinski,
	kyle.meyer, corbet, lance.yang, linux-doc, linux-kernel, cgroups

On 11/10/25 5:19 AM, Leon Huang Fu wrote:
> Memory cgroup statistics are updated asynchronously with periodic
> flushing to reduce overhead. The current implementation uses a flush
> threshold calculated as MEMCG_CHARGE_BATCH * num_online_cpus() for
> determining when to aggregate per-CPU memory cgroup statistics. On
> systems with high core counts, this threshold can become very large
> (e.g., 64 * 256 = 16,384 on a 256-core system), leading to stale
> statistics when userspace reads memory.stat files.
>
> This is particularly problematic for monitoring and management tools
> that rely on reasonably fresh statistics, as they may observe data
> that is thousands of updates out of date.
>
> Introduce a new write-only file, memory.stat_refresh, that allows
> userspace to explicitly trigger an immediate flush of memory statistics.
> Writing any value to this file forces a synchronous flush via
> __mem_cgroup_flush_stats(memcg, true) for the cgroup and all its
> descendants, ensuring that subsequent reads of memory.stat and
> memory.numa_stat reflect current data.
>
> This approach follows the pattern established by /proc/sys/vm/stat_refresh
> and memory.peak, where the written value is ignored, keeping the
> interface simple and consistent with existing kernel APIs.
>
> Usage example:
>    echo 1 > /sys/fs/cgroup/mygroup/memory.stat_refresh
>    cat /sys/fs/cgroup/mygroup/memory.stat
>
> The feature is available in both cgroup v1 and v2 for consistency.
>
> Signed-off-by: Leon Huang Fu <leon.huangfu@shopee.com>
> ---
> v2 -> v3:
>    - Flush stats by memory.stat_refresh (per Michal)
>    - https://lore.kernel.org/linux-mm/20251105074917.94531-1-leon.huangfu@shopee.com/
>
> v1 -> v2:
>    - Flush stats when write the file (per Michal).
>    - https://lore.kernel.org/linux-mm/20251104031908.77313-1-leon.huangfu@shopee.com/
>
>   Documentation/admin-guide/cgroup-v2.rst | 21 +++++++++++++++++--
>   mm/memcontrol-v1.c                      |  4 ++++
>   mm/memcontrol-v1.h                      |  2 ++
>   mm/memcontrol.c                         | 27 ++++++++++++++++++-------
>   4 files changed, 45 insertions(+), 9 deletions(-)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 3345961c30ac..ca079932f957 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1337,7 +1337,7 @@ PAGE_SIZE multiple when read back.
>   	cgroup is within its effective low boundary, the cgroup's
>   	memory won't be reclaimed unless there is no reclaimable
>   	memory available in unprotected cgroups.
> -	Above the effective low	boundary (or
> +	Above the effective low	boundary (or
>   	effective min boundary if it is higher), pages are reclaimed
>   	proportionally to the overage, reducing reclaim pressure for
>   	smaller overages.
> @@ -1785,6 +1785,23 @@ The following nested keys are defined.
>   		up if hugetlb usage is accounted for in memory.current (i.e.
>   		cgroup is mounted with the memory_hugetlb_accounting option).
>
> +  memory.stat_refresh
> +	A write-only file which exists on non-root cgroups.
> +
> +	Writing any value to this file forces an immediate flush of
> +	memory statistics for this cgroup and its descendants. This
> +	ensures subsequent reads of memory.stat and memory.numa_stat
> +	reflect the most current data.
> +
> +	This is useful on high-core count systems where per-CPU caching
> +	can lead to stale statistics, or when precise memory usage
> +	information is needed for monitoring or debugging purposes.
> +
> +	Example::
> +
> +	  echo 1 > memory.stat_refresh
> +	  cat memory.stat
> +
>     memory.numa_stat
>   	A read-only nested-keyed file which exists on non-root cgroups.
>
> @@ -2173,7 +2190,7 @@ of the two is enforced.
>
>   cgroup writeback requires explicit support from the underlying
>   filesystem.  Currently, cgroup writeback is implemented on ext2, ext4,
> -btrfs, f2fs, and xfs.  On other filesystems, all writeback IOs are
> +btrfs, f2fs, and xfs.  On other filesystems, all writeback IOs are
>   attributed to the root cgroup.
>
>   There are inherent differences in memory and writeback management
> diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
> index 6eed14bff742..c3eac9b1f1be 100644
> --- a/mm/memcontrol-v1.c
> +++ b/mm/memcontrol-v1.c
> @@ -2041,6 +2041,10 @@ struct cftype mem_cgroup_legacy_files[] = {
>   		.name = "stat",
>   		.seq_show = memory_stat_show,
>   	},
> +	{
> +		.name = "stat_refresh",
> +		.write = memory_stat_refresh_write,
> +	},
>   	{
>   		.name = "force_empty",
>   		.write = mem_cgroup_force_empty_write,
> diff --git a/mm/memcontrol-v1.h b/mm/memcontrol-v1.h
> index 6358464bb416..a14d4d74c9aa 100644
> --- a/mm/memcontrol-v1.h
> +++ b/mm/memcontrol-v1.h
> @@ -29,6 +29,8 @@ void drain_all_stock(struct mem_cgroup *root_memcg);
>   unsigned long memcg_events(struct mem_cgroup *memcg, int event);
>   unsigned long memcg_page_state_output(struct mem_cgroup *memcg, int item);
>   int memory_stat_show(struct seq_file *m, void *v);
> +ssize_t memory_stat_refresh_write(struct kernfs_open_file *of, char *buf,
> +				  size_t nbytes, loff_t off);
>
>   void mem_cgroup_id_get_many(struct mem_cgroup *memcg, unsigned int n);
>   struct mem_cgroup *mem_cgroup_id_get_online(struct mem_cgroup *memcg);
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index bfc986da3289..19ef4b971d8d 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -610,6 +610,15 @@ static void __mem_cgroup_flush_stats(struct mem_cgroup *memcg, bool force)
>   	css_rstat_flush(&memcg->css);
>   }
>
> +static void memcg_flush_stats(struct mem_cgroup *memcg, bool force)
> +{
> +	if (mem_cgroup_disabled())
> +		return;
> +
> +	memcg = memcg ?: root_mem_cgroup;
> +	__mem_cgroup_flush_stats(memcg, force);
> +}

Shouldn't we impose a limit in term of how frequently this 
memcg_flush_stats() function can be called like at most a few times per 
second to prevent abuse from user space as stat flushing is expensive? 
We should prevent some kind of user space DoS attack by using this new 
API if we decide to implement it.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH mm-new v3] mm/memcontrol: Add memory.stat_refresh for on-demand stats flushing
  2025-11-11 19:10 ` Waiman Long
@ 2025-11-11 19:47   ` Michal Hocko
  2025-11-11 20:44     ` Waiman Long
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2025-11-11 19:47 UTC (permalink / raw)
  To: Waiman Long
  Cc: Leon Huang Fu, linux-mm, tj, mkoutny, hannes, roman.gushchin,
	shakeel.butt, muchun.song, akpm, joel.granados, jack, laoar.shao,
	mclapinski, kyle.meyer, corbet, lance.yang, linux-doc,
	linux-kernel, cgroups

On Tue 11-11-25 14:10:28, Waiman Long wrote:
[...]
> > +static void memcg_flush_stats(struct mem_cgroup *memcg, bool force)
> > +{
> > +	if (mem_cgroup_disabled())
> > +		return;
> > +
> > +	memcg = memcg ?: root_mem_cgroup;
> > +	__mem_cgroup_flush_stats(memcg, force);
> > +}
> 
> Shouldn't we impose a limit in term of how frequently this
> memcg_flush_stats() function can be called like at most a few times per

This effectivelly invalidates the primary purpose of the interface to
provide a method to get as-fresh-as-possible value AFAICS. 

> second to prevent abuse from user space as stat flushing is expensive? We
> should prevent some kind of user space DoS attack by using this new API if
> we decide to implement it.

What exactly would be an attack vector?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH mm-new v3] mm/memcontrol: Add memory.stat_refresh for on-demand stats flushing
  2025-11-11 19:47   ` Michal Hocko
@ 2025-11-11 20:44     ` Waiman Long
  2025-11-11 21:01       ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: Waiman Long @ 2025-11-11 20:44 UTC (permalink / raw)
  To: Michal Hocko, Waiman Long
  Cc: Leon Huang Fu, linux-mm, tj, mkoutny, hannes, roman.gushchin,
	shakeel.butt, muchun.song, akpm, joel.granados, jack, laoar.shao,
	mclapinski, kyle.meyer, corbet, lance.yang, linux-doc,
	linux-kernel, cgroups


On 11/11/25 2:47 PM, Michal Hocko wrote:
> On Tue 11-11-25 14:10:28, Waiman Long wrote:
> [...]
>>> +static void memcg_flush_stats(struct mem_cgroup *memcg, bool force)
>>> +{
>>> +	if (mem_cgroup_disabled())
>>> +		return;
>>> +
>>> +	memcg = memcg ?: root_mem_cgroup;
>>> +	__mem_cgroup_flush_stats(memcg, force);
>>> +}
>> Shouldn't we impose a limit in term of how frequently this
>> memcg_flush_stats() function can be called like at most a few times per
> This effectivelly invalidates the primary purpose of the interface to
> provide a method to get as-fresh-as-possible value AFAICS.
>
>> second to prevent abuse from user space as stat flushing is expensive? We
>> should prevent some kind of user space DoS attack by using this new API if
>> we decide to implement it.
> What exactly would be an attack vector?

just repeatedly write a string to the new cgroup file. It will then call 
css_rstat_flush() repeatedly. It is not a real DoS attack, but it can 
still consume a lot of cpu time and slow down other tasks.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH mm-new v3] mm/memcontrol: Add memory.stat_refresh for on-demand stats flushing
  2025-11-11 20:44     ` Waiman Long
@ 2025-11-11 21:01       ` Michal Hocko
  2025-11-12 14:02         ` Michal Koutný
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2025-11-11 21:01 UTC (permalink / raw)
  To: Waiman Long
  Cc: Leon Huang Fu, linux-mm, tj, mkoutny, hannes, roman.gushchin,
	shakeel.butt, muchun.song, akpm, joel.granados, jack, laoar.shao,
	mclapinski, kyle.meyer, corbet, lance.yang, linux-doc,
	linux-kernel, cgroups

On Tue 11-11-25 15:44:07, Waiman Long wrote:
> 
> On 11/11/25 2:47 PM, Michal Hocko wrote:
> > On Tue 11-11-25 14:10:28, Waiman Long wrote:
> > [...]
> > > > +static void memcg_flush_stats(struct mem_cgroup *memcg, bool force)
> > > > +{
> > > > +	if (mem_cgroup_disabled())
> > > > +		return;
> > > > +
> > > > +	memcg = memcg ?: root_mem_cgroup;
> > > > +	__mem_cgroup_flush_stats(memcg, force);
> > > > +}
> > > Shouldn't we impose a limit in term of how frequently this
> > > memcg_flush_stats() function can be called like at most a few times per
> > This effectivelly invalidates the primary purpose of the interface to
> > provide a method to get as-fresh-as-possible value AFAICS.
> > 
> > > second to prevent abuse from user space as stat flushing is expensive? We
> > > should prevent some kind of user space DoS attack by using this new API if
> > > we decide to implement it.
> > What exactly would be an attack vector?
> 
> just repeatedly write a string to the new cgroup file. It will then call
> css_rstat_flush() repeatedly. It is not a real DoS attack, but it can still
> consume a lot of cpu time and slow down other tasks.

How does that differ from writing a limit that would cause a constant
memory reclaim from a worklad that you craft and cause a constant CPU
activity and even worse lock contention?

I guess the answer is that you do not let untrusted entities to create
cgroup hierarchies and allow to modify or generally have a write access
to control files. Or am I missing something?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH mm-new v3] mm/memcontrol: Add memory.stat_refresh for on-demand stats flushing
  2025-11-11  6:44     ` Leon Huang Fu
@ 2025-11-12  0:56       ` Chen Ridong
  2025-11-12 14:02         ` Michal Koutný
  0 siblings, 1 reply; 21+ messages in thread
From: Chen Ridong @ 2025-11-12  0:56 UTC (permalink / raw)
  To: Leon Huang Fu
  Cc: akpm, cgroups, corbet, hannes, jack, joel.granados, kyle.meyer,
	lance.yang, laoar.shao, linux-doc, linux-kernel, linux-mm,
	mclapinski, mhocko, mkoutny, muchun.song, roman.gushchin,
	shakeel.butt, tj



On 2025/11/11 14:44, Leon Huang Fu wrote:
> On Tue, Nov 11, 2025 at 9:00 AM Chen Ridong <chenridong@huaweicloud.com> wrote:
>>
>>
>>
>> On 2025/11/10 21:50, Michal Koutný wrote:
>>> Hello Leon.
> 
> Hi Ridong,
> 
>>>
>>> On Mon, Nov 10, 2025 at 06:19:48PM +0800, Leon Huang Fu <leon.huangfu@shopee.com> wrote:
>>>> Memory cgroup statistics are updated asynchronously with periodic
>>>> flushing to reduce overhead. The current implementation uses a flush
>>>> threshold calculated as MEMCG_CHARGE_BATCH * num_online_cpus() for
>>>> determining when to aggregate per-CPU memory cgroup statistics. On
>>>> systems with high core counts, this threshold can become very large
>>>> (e.g., 64 * 256 = 16,384 on a 256-core system), leading to stale
>>>> statistics when userspace reads memory.stat files.
>>>>
>>
>> We have encountered this problem multiple times when running LTP tests. It can easily occur when
>> using a 64K page size.
>>
>> error:
>>         memcg_stat_rss 10 TFAIL: rss is 0, 266240 expected
>>
> 
> Have you encountered this problem in real world?
> 
Do you mean whether we’ve encountered this issue in our product? We haven’t so far.

However, this fails the LTP test quite easily. The error logs come directly from LTP. The issue
occurs because the threshold isn’t reached, resulting in an RSS value of 0. We tried increasing the
memory allocated by the LTP case, but that wasn’t the right solution.

>>>> This is particularly problematic for monitoring and management tools
>>>> that rely on reasonably fresh statistics, as they may observe data
>>>> that is thousands of updates out of date.
>>>>
>>>> Introduce a new write-only file, memory.stat_refresh, that allows
>>>> userspace to explicitly trigger an immediate flush of memory statistics.
>>>
> [...]
>>>
>>> Next, v1 and v2 haven't been consistent since introduction of v2 (unlike
>>> some other controllers that share code or even cftypes between v1 and
>>> v2). So I'd avoid introducing a new file to V1 API.
>>>
>>
>> We encountered this problem in v1, I think this is a common problem should be fixed.
> 
> Thanks for pointing that out.
> 
> Thanks,
> Leon
> 
> [...]

-- 
Best regards,
Ridong



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH mm-new v3] mm/memcontrol: Add memory.stat_refresh for on-demand stats flushing
  2025-11-11 21:01       ` Michal Hocko
@ 2025-11-12 14:02         ` Michal Koutný
  0 siblings, 0 replies; 21+ messages in thread
From: Michal Koutný @ 2025-11-12 14:02 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Waiman Long, Leon Huang Fu, linux-mm, tj, hannes, roman.gushchin,
	shakeel.butt, muchun.song, akpm, joel.granados, jack, laoar.shao,
	mclapinski, kyle.meyer, corbet, lance.yang, linux-doc,
	linux-kernel, cgroups

[-- Attachment #1: Type: text/plain, Size: 1350 bytes --]

On Tue, Nov 11, 2025 at 10:01:37PM +0100, Michal Hocko <mhocko@suse.com> wrote:
> How does that differ from writing a limit that would cause a constant
> memory reclaim from a worklad that you craft and cause a constant CPU
> activity and even worse lock contention?
> 
> I guess the answer is that you do not let untrusted entities to create
> cgroup hierarchies and allow to modify or generally have a write access
> to control files. Or am I missing something?

This used to apply in cgroup v1 but the v2 controller APIs are meant to
be available to anyone (e.g. rootless containers).

So yes, if it turns out that the isolation may be substantially bypassed
by reclaim, I think it should be solved by some rework.

The memory.stat_refresh is different because it doesn't exist yet so its
impact on isolation needn't be even potentially solved :-p (not more
than memory.stat).

---

That's also why memory.stat_refresh is different from one global
vm/stat_refresh (easily constrained to root's monitoring tools).
And despite this precedent, I don't like the approach of two independent
invocations (write(2)+read(2)) when the intention [1] is to obtain
precise data (at least) at the time of the read(2).

Cheers,
Michal

[1] I guess. I'd still wait for what the actual usefulness besides
    fixing LTP here is.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH mm-new v3] mm/memcontrol: Add memory.stat_refresh for on-demand stats flushing
  2025-11-12  0:56       ` Chen Ridong
@ 2025-11-12 14:02         ` Michal Koutný
  0 siblings, 0 replies; 21+ messages in thread
From: Michal Koutný @ 2025-11-12 14:02 UTC (permalink / raw)
  To: Chen Ridong
  Cc: Leon Huang Fu, akpm, cgroups, corbet, hannes, jack, joel.granados,
	kyle.meyer, lance.yang, laoar.shao, linux-doc, linux-kernel,
	linux-mm, mclapinski, mhocko, muchun.song, roman.gushchin,
	shakeel.butt, tj

[-- Attachment #1: Type: text/plain, Size: 703 bytes --]

On Wed, Nov 12, 2025 at 08:56:28AM +0800, Chen Ridong <chenridong@huaweicloud.com> wrote:
> However, this fails the LTP test quite easily. The error logs come directly from LTP. The issue
> occurs because the threshold isn’t reached, resulting in an RSS value of 0. We tried increasing the
> memory allocated by the LTP case, but that wasn’t the right solution.

You touched on a slightly different cause (other than  async flushing)
-- there are other fields/stats that are quantified by (basic) page
size. I.e. I think that might need different/more central solution, e.g.
working with absolute (not relative to pagesize) thresholds if absolute
precision is needed.


Thanks,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2025-11-12 14:02 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-10 10:19 [PATCH mm-new v3] mm/memcontrol: Add memory.stat_refresh for on-demand stats flushing Leon Huang Fu
2025-11-10 11:28 ` Michal Hocko
2025-11-11  6:12   ` Leon Huang Fu
2025-11-10 11:52 ` Harry Yoo
2025-11-11  6:12   ` Leon Huang Fu
2025-11-10 13:50 ` Michal Koutný
2025-11-10 16:04   ` Tejun Heo
2025-11-11  6:27     ` Leon Huang Fu
2025-11-11  1:00   ` Chen Ridong
2025-11-11  6:44     ` Leon Huang Fu
2025-11-12  0:56       ` Chen Ridong
2025-11-12 14:02         ` Michal Koutný
2025-11-11  6:13   ` Leon Huang Fu
2025-11-11 18:52     ` Tejun Heo
2025-11-11 19:01     ` Michal Koutný
2025-11-11  8:10   ` Michal Hocko
2025-11-11 19:10 ` Waiman Long
2025-11-11 19:47   ` Michal Hocko
2025-11-11 20:44     ` Waiman Long
2025-11-11 21:01       ` Michal Hocko
2025-11-12 14:02         ` Michal Koutný

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox