linux-doc.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH mm-new v2] mm/memcontrol: Flush stats when write stat file
@ 2025-11-05  7:49 Leon Huang Fu
  2025-11-05  8:19 ` Michal Hocko
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Leon Huang Fu @ 2025-11-05  7:49 UTC (permalink / raw)
  To: linux-mm
  Cc: hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm,
	joel.granados, jack, laoar.shao, mclapinski, kyle.meyer, corbet,
	lance.yang, leon.huangfu, linux-doc, linux-kernel, cgroups

On high-core count systems, memory cgroup statistics can become stale
due to per-CPU caching and deferred aggregation. Monitoring tools and
management applications sometimes need guaranteed up-to-date statistics
at specific points in time to make accurate decisions.

This patch adds write handlers to both memory.stat and memory.numa_stat
files to allow userspace to explicitly force an immediate flush of
memory statistics. When "1" is written to either file, it triggers
__mem_cgroup_flush_stats(memcg, true), which unconditionally flushes
all pending statistics for the cgroup and its descendants.

The write operation validates the input and only accepts the value "1",
returning -EINVAL for any other input.

Usage example:
  # Force immediate flush before reading critical statistics
  echo 1 > /sys/fs/cgroup/mygroup/memory.stat
  cat /sys/fs/cgroup/mygroup/memory.stat

This provides several benefits:

1. On-demand accuracy: Tools can flush only when needed, avoiding
   continuous overhead

2. Targeted flushing: Allows flushing specific cgroups when precision
   is required for particular workloads

3. Integration flexibility: Monitoring scripts can decide when to pay
   the flush cost based on their specific accuracy requirements

The implementation is shared between cgroup v1 and v2 interfaces,
with memory_stat_write() providing the common validation and flush
logic. Both memory.stat and memory.numa_stat use the same write
handler since they both benefit from forcing accurate statistics.

Documentation is updated to reflect that these files are now read-write
instead of read-only, with clear explanation of the write behavior.

Signed-off-by: Leon Huang Fu <leon.huangfu@shopee.com>
---
v1 -> v2:
  - Flush stats when write the file (per Michal).
  - https://lore.kernel.org/linux-mm/20251104031908.77313-1-leon.huangfu@shopee.com/

 Documentation/admin-guide/cgroup-v2.rst | 31 +++++++++++++++++--------
 mm/memcontrol-v1.c                      |  2 ++
 mm/memcontrol-v1.h                      |  1 +
 mm/memcontrol.c                         | 13 +++++++++++
 4 files changed, 37 insertions(+), 10 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 3345961c30ac..2a4a81d2cc2f 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1337,7 +1337,7 @@ PAGE_SIZE multiple when read back.
 	cgroup is within its effective low boundary, the cgroup's
 	memory won't be reclaimed unless there is no reclaimable
 	memory available in unprotected cgroups.
-	Above the effective low	boundary (or
+	Above the effective low	boundary (or
 	effective min boundary if it is higher), pages are reclaimed
 	proportionally to the overage, reducing reclaim pressure for
 	smaller overages.
@@ -1525,11 +1525,17 @@ The following nested keys are defined.
 	generated on this file reflects only the local events.

   memory.stat
-	A read-only flat-keyed file which exists on non-root cgroups.
+	A read-write flat-keyed file which exists on non-root cgroups.

-	This breaks down the cgroup's memory footprint into different
-	types of memory, type-specific details, and other information
-	on the state and past events of the memory management system.
+	Reading this file breaks down the cgroup's memory footprint into
+	different types of memory, type-specific details, and other
+	information on the state and past events of the memory management
+	system.
+
+	Writing the value "1" to this file forces an immediate flush of
+	memory statistics for this cgroup and its descendants, improving
+	the accuracy of subsequent reads. Any other value will result in
+	an error.

 	All memory amounts are in bytes.

@@ -1786,11 +1792,16 @@ The following nested keys are defined.
 		cgroup is mounted with the memory_hugetlb_accounting option).

   memory.numa_stat
-	A read-only nested-keyed file which exists on non-root cgroups.
+	A read-write nested-keyed file which exists on non-root cgroups.
+
+	Reading this file breaks down the cgroup's memory footprint into
+	different types of memory, type-specific details, and other
+	information per node on the state of the memory management system.

-	This breaks down the cgroup's memory footprint into different
-	types of memory, type-specific details, and other information
-	per node on the state of the memory management system.
+	Writing the value "1" to this file forces an immediate flush of
+	memory statistics for this cgroup and its descendants, improving
+	the accuracy of subsequent reads. Any other value will result in
+	an error.

 	This is useful for providing visibility into the NUMA locality
 	information within an memcg since the pages are allowed to be
@@ -2173,7 +2184,7 @@ of the two is enforced.

 cgroup writeback requires explicit support from the underlying
 filesystem.  Currently, cgroup writeback is implemented on ext2, ext4,
-btrfs, f2fs, and xfs.  On other filesystems, all writeback IOs are
+btrfs, f2fs, and xfs.  On other filesystems, all writeback IOs are
 attributed to the root cgroup.

 There are inherent differences in memory and writeback management
diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
index 6eed14bff742..8cab6b52424b 100644
--- a/mm/memcontrol-v1.c
+++ b/mm/memcontrol-v1.c
@@ -2040,6 +2040,7 @@ struct cftype mem_cgroup_legacy_files[] = {
 	{
 		.name = "stat",
 		.seq_show = memory_stat_show,
+		.write_u64 = memory_stat_write,
 	},
 	{
 		.name = "force_empty",
@@ -2078,6 +2079,7 @@ struct cftype mem_cgroup_legacy_files[] = {
 	{
 		.name = "numa_stat",
 		.seq_show = memcg_numa_stat_show,
+		.write_u64 = memory_stat_write,
 	},
 #endif
 	{
diff --git a/mm/memcontrol-v1.h b/mm/memcontrol-v1.h
index 6358464bb416..1c92d58330aa 100644
--- a/mm/memcontrol-v1.h
+++ b/mm/memcontrol-v1.h
@@ -29,6 +29,7 @@ void drain_all_stock(struct mem_cgroup *root_memcg);
 unsigned long memcg_events(struct mem_cgroup *memcg, int event);
 unsigned long memcg_page_state_output(struct mem_cgroup *memcg, int item);
 int memory_stat_show(struct seq_file *m, void *v);
+int memory_stat_write(struct cgroup_subsys_state *css, struct cftype *cft, u64 val);

 void mem_cgroup_id_get_many(struct mem_cgroup *memcg, unsigned int n);
 struct mem_cgroup *mem_cgroup_id_get_online(struct mem_cgroup *memcg);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c34029e92bab..d6a5d872fbcb 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4531,6 +4531,17 @@ int memory_stat_show(struct seq_file *m, void *v)
 	return 0;
 }

+int memory_stat_write(struct cgroup_subsys_state *css, struct cftype *cft, u64 val)
+{
+	if (val != 1)
+		return -EINVAL;
+
+	if (css)
+		css_rstat_flush(css);
+
+	return 0;
+}
+
 #ifdef CONFIG_NUMA
 static inline unsigned long lruvec_page_state_output(struct lruvec *lruvec,
 						     int item)
@@ -4666,11 +4677,13 @@ static struct cftype memory_files[] = {
 	{
 		.name = "stat",
 		.seq_show = memory_stat_show,
+		.write_u64 = memory_stat_write,
 	},
 #ifdef CONFIG_NUMA
 	{
 		.name = "numa_stat",
 		.seq_show = memory_numa_stat_show,
+		.write_u64 = memory_stat_write,
 	},
 #endif
 	{
--
2.51.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH mm-new v2] mm/memcontrol: Flush stats when write stat file
  2025-11-05  7:49 [PATCH mm-new v2] mm/memcontrol: Flush stats when write stat file Leon Huang Fu
@ 2025-11-05  8:19 ` Michal Hocko
  2025-11-05  8:39   ` Lance Yang
  2025-11-06  1:19 ` Shakeel Butt
  2025-11-06 17:02 ` JP Kobryn
  2 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2025-11-05  8:19 UTC (permalink / raw)
  To: Leon Huang Fu
  Cc: linux-mm, hannes, roman.gushchin, shakeel.butt, muchun.song, akpm,
	joel.granados, jack, laoar.shao, mclapinski, kyle.meyer, corbet,
	lance.yang, linux-doc, linux-kernel, cgroups

On Wed 05-11-25 15:49:16, Leon Huang Fu wrote:
> diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
> index 6eed14bff742..8cab6b52424b 100644
> --- a/mm/memcontrol-v1.c
> +++ b/mm/memcontrol-v1.c
> @@ -2040,6 +2040,7 @@ struct cftype mem_cgroup_legacy_files[] = {
>  	{
>  		.name = "stat",
>  		.seq_show = memory_stat_show,
> +		.write_u64 = memory_stat_write,
>  	},
>  	{
>  		.name = "force_empty",
> @@ -2078,6 +2079,7 @@ struct cftype mem_cgroup_legacy_files[] = {
>  	{
>  		.name = "numa_stat",
>  		.seq_show = memcg_numa_stat_show,
> +		.write_u64 = memory_stat_write,
>  	},

Any reason you are not using .write like others? Also is there any
reason why a specific value is required. /proc/sys/vm/stat_refresh which does
something similar ignores the value. Also memcg.peak write handler which
resets the peak value ignores it. It is true that a specific value
allows for future extensions but I guess it would be better to be
consistent with others here.

One last thing to consider is whether this should follow
/proc/sys/vm/stat_refresh path and have a single file to flush them all
or have a per file flushing. I do not have a strong preference but
considering both are doing the same thing it makes sense to go
stat_refresh path.

In any case, thanks for considering the explicit flushing path which is
IMHO much better than flushing tunning which would become really hard
for admins to wrap their heads around. Especially when dealing with
large fleets of machines to maintain.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH mm-new v2] mm/memcontrol: Flush stats when write stat file
  2025-11-05  8:19 ` Michal Hocko
@ 2025-11-05  8:39   ` Lance Yang
  2025-11-05  8:51     ` Leon Huang Fu
  0 siblings, 1 reply; 14+ messages in thread
From: Lance Yang @ 2025-11-05  8:39 UTC (permalink / raw)
  To: Michal Hocko, Leon Huang Fu
  Cc: linux-mm, hannes, roman.gushchin, shakeel.butt, muchun.song, akpm,
	joel.granados, jack, laoar.shao, mclapinski, kyle.meyer, corbet,
	linux-doc, linux-kernel, cgroups



On 2025/11/5 16:19, Michal Hocko wrote:
> On Wed 05-11-25 15:49:16, Leon Huang Fu wrote:
>> diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
>> index 6eed14bff742..8cab6b52424b 100644
>> --- a/mm/memcontrol-v1.c
>> +++ b/mm/memcontrol-v1.c
>> @@ -2040,6 +2040,7 @@ struct cftype mem_cgroup_legacy_files[] = {
>>   	{
>>   		.name = "stat",
>>   		.seq_show = memory_stat_show,
>> +		.write_u64 = memory_stat_write,
>>   	},
>>   	{
>>   		.name = "force_empty",
>> @@ -2078,6 +2079,7 @@ struct cftype mem_cgroup_legacy_files[] = {
>>   	{
>>   		.name = "numa_stat",
>>   		.seq_show = memcg_numa_stat_show,
>> +		.write_u64 = memory_stat_write,
>>   	},
> 
> Any reason you are not using .write like others? Also is there any
> reason why a specific value is required. /proc/sys/vm/stat_refresh which does
> something similar ignores the value. Also memcg.peak write handler which
> resets the peak value ignores it. It is true that a specific value
> allows for future extensions but I guess it would be better to be
> consistent with others here.
> 
> One last thing to consider is whether this should follow
> /proc/sys/vm/stat_refresh path and have a single file to flush them all
> or have a per file flushing. I do not have a strong preference but
> considering both are doing the same thing it makes sense to go
> stat_refresh path.

+1

IMHO, a dedicated file like memory.stat_refresh is a much better approach ;)

It's cleaner, simpler to use, and much more intuitive for users.

> 
> In any case, thanks for considering the explicit flushing path which is
> IMHO much better than flushing tunning which would become really hard
> for admins to wrap their heads around. Especially when dealing with
> large fleets of machines to maintain.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH mm-new v2] mm/memcontrol: Flush stats when write stat file
  2025-11-05  8:39   ` Lance Yang
@ 2025-11-05  8:51     ` Leon Huang Fu
  0 siblings, 0 replies; 14+ messages in thread
From: Leon Huang Fu @ 2025-11-05  8:51 UTC (permalink / raw)
  To: Lance Yang
  Cc: Michal Hocko, linux-mm, hannes, roman.gushchin, shakeel.butt,
	muchun.song, akpm, joel.granados, jack, laoar.shao, mclapinski,
	kyle.meyer, corbet, linux-doc, linux-kernel, cgroups

On Wed, Nov 5, 2025 at 4:39 PM Lance Yang <lance.yang@linux.dev> wrote:
> On 2025/11/5 16:19, Michal Hocko wrote:
> > On Wed 05-11-25 15:49:16, Leon Huang Fu wrote:
> >> diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
> >> index 6eed14bff742..8cab6b52424b 100644
> >> --- a/mm/memcontrol-v1.c
> >> +++ b/mm/memcontrol-v1.c
> >> @@ -2040,6 +2040,7 @@ struct cftype mem_cgroup_legacy_files[] = {
> >>      {
> >>              .name = "stat",
> >>              .seq_show = memory_stat_show,
> >> +            .write_u64 = memory_stat_write,
> >>      },
> >>      {
> >>              .name = "force_empty",
> >> @@ -2078,6 +2079,7 @@ struct cftype mem_cgroup_legacy_files[] = {
> >>      {
> >>              .name = "numa_stat",
> >>              .seq_show = memcg_numa_stat_show,
> >> +            .write_u64 = memory_stat_write,
> >>      },
> >
> > Any reason you are not using .write like others? Also is there any
> > reason why a specific value is required. /proc/sys/vm/stat_refresh which does
> > something similar ignores the value. Also memcg.peak write handler which
> > resets the peak value ignores it. It is true that a specific value
> > allows for future extensions but I guess it would be better to be
> > consistent with others here.
> >
> > One last thing to consider is whether this should follow
> > /proc/sys/vm/stat_refresh path and have a single file to flush them all
> > or have a per file flushing. I do not have a strong preference but
> > considering both are doing the same thing it makes sense to go
> > stat_refresh path.
>
> +1
>
> IMHO, a dedicated file like memory.stat_refresh is a much better approach ;)
>
> It's cleaner, simpler to use, and much more intuitive for users.
>

Agreed. Thank you both for the feedback.

You're right that following the /proc/sys/vm/stat_refresh pattern makes
more sense here. A dedicated memory.stat_refresh file has several advantages:

1) It provides a clear, explicit interface for the refresh operation rather
    than overloading existing stat files with write capability
2) It's more consistent with the existing kernel patterns - stat_refresh
    ignores the written value, and memory.peak also ignores it for reset
3) It's more intuitive for users - the purpose is immediately clear from
    the filename

For the next revision, I'll introduce a dedicated memory.stat_refresh file
that ignores the written value (similar to stat_refresh and memory.peak).
This will work for both cgroup v1 and v2.

Thanks,
Leon

[...]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH mm-new v2] mm/memcontrol: Flush stats when write stat file
  2025-11-05  7:49 [PATCH mm-new v2] mm/memcontrol: Flush stats when write stat file Leon Huang Fu
  2025-11-05  8:19 ` Michal Hocko
@ 2025-11-06  1:19 ` Shakeel Butt
  2025-11-06  3:30   ` Leon Huang Fu
  2025-11-06 17:02 ` JP Kobryn
  2 siblings, 1 reply; 14+ messages in thread
From: Shakeel Butt @ 2025-11-06  1:19 UTC (permalink / raw)
  To: Leon Huang Fu
  Cc: linux-mm, hannes, mhocko, roman.gushchin, muchun.song, akpm,
	joel.granados, jack, laoar.shao, mclapinski, kyle.meyer, corbet,
	lance.yang, linux-doc, linux-kernel, cgroups, Yosry Ahmed,
	JP Kobryn

+Yosry, JP

On Wed, Nov 05, 2025 at 03:49:16PM +0800, Leon Huang Fu wrote:
> On high-core count systems, memory cgroup statistics can become stale
> due to per-CPU caching and deferred aggregation. Monitoring tools and
> management applications sometimes need guaranteed up-to-date statistics
> at specific points in time to make accurate decisions.

Can you explain a bit more on your environment where you are seeing
stale stats? More specifically, how often the management applications
are reading the memcg stats and if these applications are reading memcg
stats for each nodes of the cgroup tree.

We force flush all the memcg stats at root level every 2 seconds but it
seems like that is not enough for your case. I am fine with an explicit
way for users to flush the memcg stats. In that way only users who want
to has to pay for the flush cost.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH mm-new v2] mm/memcontrol: Flush stats when write stat file
  2025-11-06  1:19 ` Shakeel Butt
@ 2025-11-06  3:30   ` Leon Huang Fu
  2025-11-06  5:35     ` JP Kobryn
  2025-11-06 23:55     ` Shakeel Butt
  0 siblings, 2 replies; 14+ messages in thread
From: Leon Huang Fu @ 2025-11-06  3:30 UTC (permalink / raw)
  To: shakeel.butt
  Cc: akpm, cgroups, corbet, hannes, inwardvessel, jack, joel.granados,
	kyle.meyer, lance.yang, laoar.shao, leon.huangfu, linux-doc,
	linux-kernel, linux-mm, mclapinski, mhocko, muchun.song,
	roman.gushchin, yosry.ahmed

On Thu, Nov 6, 2025 at 9:19 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> +Yosry, JP
>
> On Wed, Nov 05, 2025 at 03:49:16PM +0800, Leon Huang Fu wrote:
> > On high-core count systems, memory cgroup statistics can become stale
> > due to per-CPU caching and deferred aggregation. Monitoring tools and
> > management applications sometimes need guaranteed up-to-date statistics
> > at specific points in time to make accurate decisions.
>
> Can you explain a bit more on your environment where you are seeing
> stale stats? More specifically, how often the management applications
> are reading the memcg stats and if these applications are reading memcg
> stats for each nodes of the cgroup tree.
>
> We force flush all the memcg stats at root level every 2 seconds but it
> seems like that is not enough for your case. I am fine with an explicit
> way for users to flush the memcg stats. In that way only users who want
> to has to pay for the flush cost.
>

Thanks for the feedback. I encountered this issue while running the LTP
memcontrol02 test case [1] on a 256-core server with the 6.6.y kernel on XFS,
where it consistently failed.

I was aware that Yosry had improved the memory statistics refresh mechanism
in "mm: memcg: subtree stats flushing and thresholds" [2], so I attempted to
backport that patchset to 6.6.y [3]. However, even on the 6.15.0-061500-generic
kernel with those improvements, the test still fails intermittently on XFS.

I've created a simplified reproducer that mirrors the LTP test behavior. The
test allocates 50 MiB of page cache and then verifies that memory.current and
memory.stat's "file" field are approximately equal (within 5% tolerance).

The failure pattern looks like:

  After alloc: memory.current=52690944, memory.stat.file=48496640, size=52428800
  Checks: current>=size=OK, file>0=OK, current~=file(5%)=FAIL

Here's the reproducer code and test script (attached below for reference).

To reproduce on XFS:
  sudo ./run.sh --xfs
  for i in {1..100}; do sudo ./run.sh --run; echo "==="; sleep 0.1; done
  sudo ./run.sh --cleanup

The test fails sporadically, typically a few times out of 100 runs, confirming
that the improved flush isn't sufficient for this workload pattern.

I agree that providing an explicit flush mechanism allows users who need
guaranteed accuracy to pay the cost only when necessary, rather than imposing
more aggressive global flushing on all users.

Thanks,
Leon

---

Links:
[1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/controllers/memcg/memcontrol02.c
[2] https://lore.kernel.org/all/20231129032154.3710765-1-yosryahmed@google.com/
[3] https://lore.kernel.org/linux-mm/20251103075135.20254-1-leon.huangfu@shopee.com/

---

Reproducer code (pagecache_50m_demo.c):

#define _GNU_SOURCE
#include <errno.h>
#include <fcntl.h>
#include <limits.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>

static int write_str(const char *path, const char *s) {
    int fd = open(path, O_WRONLY | O_CLOEXEC);
    if (fd < 0) { perror(path); return -1; }
    ssize_t w = write(fd, s, strlen(s));
    if (w != (ssize_t)strlen(s)) { perror("write"); close(fd); return -1; }
    close(fd);
    return 0;
}

static long read_long_file(const char *path) {
    FILE *f = fopen(path, "re");
    if (!f) { perror(path); return -1; }
    long v = -1;
    if (fscanf(f, "%ld", &v) != 1) v = -1;
    fclose(f);
    return v;
}

static long read_stat_field_bytes(const char *path, const char *key) {
    FILE *f = fopen(path, "re");
    if (!f) { perror(path); return -1; }
    char *line = NULL; size_t n = 0; long val = -1;
    while (getline(&line, &n, f) > 0) {
        if (strncmp(line, key, strlen(key)) == 0) {
            char *p = line + strlen(key);
            while (*p == ' ' || *p == '\t') p++;
            val = strtoll(p, NULL, 10);
            break;
        }
    }
    free(line);
    fclose(f);
    return val;
}

static int make_memcg_child(char *cg, size_t cg_sz) {
    const char *root = "/sys/fs/cgroup";
    int n = snprintf(cg, cg_sz, "%s/memcg_demo_%d", root, getpid());
    if (n < 0 || n >= (int)cg_sz) {
        fprintf(stderr, "cg path too long\n"); return -1;
    }
    if (mkdir(cg, 0755) && errno != EEXIST) { perror("mkdir cg"); return -1; }

    // best-effort enable memory controller on parent
    char parent[PATH_MAX];
    strncpy(parent, cg, sizeof(parent));
    parent[sizeof(parent)-1] = '\0';
    char *s = strrchr(parent, '/');
    if (s && s != parent) {
        *s = '\0';
        char stc[PATH_MAX];
        n = snprintf(stc, sizeof(stc), "%s/cgroup.subtree_control", parent);
        if (n >= 0 && n < (int)sizeof(stc)) (void)write_str(stc, "+memory");
    }

    char procs[PATH_MAX];
    n = snprintf(procs, sizeof(procs), "%s/cgroup.procs", cg);
    if (n < 0 || n >= (int)sizeof(procs)) { fprintf(stderr, "path too long\n"); return -1; }
    char pidbuf[32]; snprintf(pidbuf, sizeof(pidbuf), "%d", getpid());
    if (write_str(procs, pidbuf) < 0) return -1;
    return 0;
}

/* Exact mirror of LTP alloc_pagecache() behavior */
static inline void alloc_pagecache_exact(const int fd, size_t size)
{
    char buf[BUFSIZ];
    size_t i;

    if (lseek(fd, 0, SEEK_END) == (off_t)-1) {
        perror("lseek"); exit(1);
    }
    for (i = 0; i < size; i += sizeof(buf)) {
        ssize_t w = write(fd, buf, sizeof(buf));
        if (w < 0) { perror("write"); exit(1); }
        if ((size_t)w != sizeof(buf)) { fprintf(stderr, "short write\n"); exit(1); }
    }
}

static bool approx_equal(long a, long b, double tol_frac) {
    long diff = labs(a - b);
    long lim = (long)((double)b * tol_frac);
    return diff <= lim;
}

int main(int argc, char **argv) {
    const char *mnt = (argc >= 2) ? argv[1] : ".";
    char cg[PATH_MAX];
    if (make_memcg_child(cg, sizeof(cg)) < 0) {
        fprintf(stderr, "Failed to setup memcg (need root? cgroup v2?)\n");
        return 1;
    }
    printf("Created cg %s\n", cg);

    char p_current[PATH_MAX], p_stat[PATH_MAX];
    int n = snprintf(p_current, sizeof(p_current), "%s/memory.current", cg);
    if (n < 0 || n >= (int)sizeof(p_current)) { fprintf(stderr, "path too long\n"); return 1; }
    n = snprintf(p_stat, sizeof(p_stat), "%s/memory.stat", cg);
    if (n < 0 || n >= (int)sizeof(p_stat)) { fprintf(stderr, "path too long\n"); return 1; }

    char filepath[PATH_MAX];
    n = snprintf(filepath, sizeof(filepath), "%s/tmpfile", mnt);
    if (n < 0 || n >= (int)sizeof(filepath)) { fprintf(stderr, "file path too long\n"); return 1; }

    int fd = open(filepath, O_RDWR | O_CREAT | O_TRUNC, 0600);
    if (fd < 0) { perror("open tmpfile"); return 1; }

    long current = read_long_file(p_current);
    printf("Created temp file: memory.current=%ld\n", current);

    const size_t size = 50UL * 1024 * 1024; // 50 MiB
    alloc_pagecache_exact(fd, size);

    // No fsyncs; small wait to reduce XFS timing flakiness
    //fsync(fd);
    //usleep(2200000);

    current = read_long_file(p_current);
    long file_bytes = read_stat_field_bytes(p_stat, "file");
    printf("After alloc: memory.current=%ld, memory.stat.file=%ld, size=%zu\n",
           current, file_bytes, size);

    bool ge_size = (current >= (long)size);
    bool file_pos = (file_bytes > 0);

    // Approximate LTP's values_close(..., file_to_all_error) with 5% tolerance
    double tol = 0.05;
    bool approx = approx_equal(current, file_bytes, tol);

    printf("Checks: current>=size=%s, file>0=%s, current~=file(%.0f%%)=%s\n",
           ge_size ? "OK" : "FAIL",
           file_pos ? "OK" : "FAIL",
           tol * 100,
           approx ? "OK" : "FAIL");

    close(fd);
    return (ge_size && file_pos && approx) ? 0 : 2;
}

Build: gcc -O2 -Wall pagecache_50m_demo.c -o pagecache_50m_demo

Test runner (run.sh):

#!/usr/bin/env bash
set -euo pipefail

# Config (overridable via env)
SIZE_MB="${SIZE_MB:-500}"
IMG="${IMG:-/var/tmp/pagecache_demo.img}"
MNT="${MNT:-/mnt/pagecache_demo}"
DEMO_BIN="${DEMO_BIN:-./pagecache_50m_demo}"
STATE="${STATE:-/var/tmp/pagecache_demo.state}" # stores LOOP + FS type

usage() {
  echo "Usage:"
  echo "  sudo $0 --ext4         Prepare ext4 loopback FS and mount it at \$MNT"
  echo "  sudo $0 --xfs          Prepare xfs  loopback FS and mount it at \$MNT"
  echo "  sudo $0 --btrfs        Prepare btrfs loopback FS and mount it at \$MNT"
  echo "  sudo $0 --tmpfs        Prepare tmpfs mount at \$MNT (no image/loop)"
  echo "  sudo $0 --run          Run demo on mounted \$MNT"
  echo "  sudo $0 --cleanup      Unmount and remove loop/image/state"
  echo
  echo "Env overrides:"
  echo "  SIZE_MB (default 256), IMG, MNT, DEMO_BIN, STATE"
}

need_root() {
  if [[ ${EUID:-$(id -u)} -ne 0 ]]; then
    echo "Please run as root (sudo)"
    exit 1
  fi
}

have_cmd() { command -v "$1" > /dev/null 2>&1; }

save_state() {
  local loop="${1:-}" fstype="$2"
  printf 'LOOP=%q\nFSTYPE=%q\nIMG=%q\nMNT=%q\n' "$loop" "$fstype" "$IMG" "$MNT" > "$STATE"
}

load_state() {
  if [[ ! -f "$STATE" ]]; then
    echo "State not found: $STATE. Run --ext4/--xfs/--btrfs/--tmpfs first."
    exit 1
  fi
  # shellcheck disable=SC1090
  . "$STATE"
  : "${FSTYPE:?}" "${IMG:?}" "${MNT:?}"
  # LOOP may be empty for tmpfs
}

cleanup_mount() {
  set +e
  if mountpoint -q "$MNT"; then
    umount "$MNT" || umount -l "$MNT"
  fi
  if [[ -n "${LOOP:-}" ]] && [[ -b "${LOOP:-}" ]]; then
    losetup -d "$LOOP" 2> /dev/null || true
  else
    # Detach any loop using IMG as fallback
    if [[ -f "$IMG" ]]; then
      if losetup -j "$IMG" | grep -q .; then
        losetup -j "$IMG" | awk -F: '{print $1}' | xargs -r -n1 losetup -d
      fi
    fi
  fi
  rmdir "$MNT" 2> /dev/null || true
  set -e
}

cleanup_all() {
  echo "[*] Cleaning up..."
  if [[ -f "$STATE" ]]; then
    load_state || true
  fi
  cleanup_mount
  # For tmpfs there is no image; for others remove image
  if [[ "${FSTYPE:-}" != "tmpfs" ]]; then
    rm -f "$IMG"
  fi
  rm -f "$STATE"
  rmdir /sys/fs/cgroup/memcg_demo_* || true
  echo "[*] Done."
}

make_image() {
  echo "[*] Creating sparse image: $IMG (${SIZE_MB} MiB)"
  dd if=/dev/zero of="$IMG" bs=1M count=0 seek="$SIZE_MB" status=none
}

attach_loop() {
  # stdout returns loop device path only
  losetup -fP --show "$IMG"
}

ensure_loop_ready() {
  local dev="$1"
  if [[ -z "$dev" ]]; then
    echo "Failed to get loop device for $IMG"
    exit 1
  fi
  # udev settle
  for _ in {1..10}; do
    [[ -b "$dev" ]] && return 0
    sleep 0.1
  done
  echo "Loop device is not a block device: $dev"
  exit 1
}

mkfs_ext4() {
  have_cmd mkfs.ext4 || {
    echo "mkfs.ext4 not found"
    exit 1
  }
  echo "[*] Making ext4 on $1"
  mkfs.ext4 -F -q "$1"
}

mkfs_xfs() {
  have_cmd mkfs.xfs || {
    echo "mkfs.xfs not found (install xfsprogs)"
    exit 1
  }
  echo "[*] Making xfs on $1"
  mkfs.xfs -f -q "$1"
}

mkfs_btrfs() {
  have_cmd mkfs.btrfs || {
    echo "mkfs.btrfs not found (install btrfs-progs)"
    exit 1
  }
  echo "[*] Making btrfs on $1"
  mkfs.btrfs -f -q "$1"
}

mount_fs_dev() {
  mkdir -p "$MNT"
  echo "[*] Mounting $1 at $MNT"
  mount "$1" "$MNT"
  df -h "$MNT" || true
}

prepare_fs_loop() {
  need_root
  local fstype="$1" # ext4 | xfs | btrfs

  rm -f "$STATE"
  if mountpoint -q "$MNT" || losetup -j "$IMG" | grep -q . || [[ -f "$IMG" ]]; then
    echo "[*] Previous environment detected, cleaning first..."
    cleanup_all
  fi

  make_image
  local loop
  loop="$(attach_loop)"
  ensure_loop_ready "$loop"

  case "$fstype" in
    ext4) mkfs_ext4 "$loop" ;;
    xfs) mkfs_xfs "$loop" ;;
    btrfs) mkfs_btrfs "$loop" ;;
    *)
      echo "Unknown fs: $fstype"
      exit 1
      ;;
  esac

  mount_fs_dev "$loop"
  save_state "$loop" "$fstype"
  echo "[*] Prepared $fstype at $MNT (loop=$loop)"
}

prepare_tmpfs() {
  need_root
  rm -f "$STATE"
  if mountpoint -q "$MNT"; then
    echo "[*] Unmounting previous $MNT..."
    umount "$MNT" || umount -l "$MNT"
  fi
  mkdir -p "$MNT"
  echo "[*] Mounting tmpfs at $MNT (size=${SIZE_MB}m)"
  mount -t tmpfs -o "size=${SIZE_MB}m" tmpfs "$MNT"
  df -h "$MNT" || true
  save_state "" "tmpfs"
  echo "[*] Prepared tmpfs at $MNT"
}

run_demo() {
  need_root
  load_state
  if ! mountpoint -q "$MNT"; then
    echo "Mount point not mounted: $MNT. Did you run --$FSTYPE first?"
    exit 1
  fi
  if [[ ! -x "$DEMO_BIN" ]]; then
    echo "Demo binary not found or not executable: $DEMO_BIN"
    exit 1
  fi
  echo "[*] Running demo on $MNT: $DEMO_BIN $MNT"
  "$DEMO_BIN" "$MNT"
}

case "${1:-}" in
  --ext4) prepare_fs_loop ext4 ;;
  --xfs) prepare_fs_loop xfs ;;
  --btrfs) prepare_fs_loop btrfs ;;
  --tmpfs) prepare_tmpfs ;;
  --run) run_demo ;;
  --cleanup) cleanup_all ;;
  *)
    usage
    exit 1
    ;;
esac

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH mm-new v2] mm/memcontrol: Flush stats when write stat file
  2025-11-06  3:30   ` Leon Huang Fu
@ 2025-11-06  5:35     ` JP Kobryn
  2025-11-06  6:42       ` Leon Huang Fu
  2025-11-06 23:55     ` Shakeel Butt
  1 sibling, 1 reply; 14+ messages in thread
From: JP Kobryn @ 2025-11-06  5:35 UTC (permalink / raw)
  To: Leon Huang Fu, shakeel.butt
  Cc: akpm, cgroups, corbet, hannes, jack, joel.granados, kyle.meyer,
	lance.yang, laoar.shao, linux-doc, linux-kernel, linux-mm,
	mclapinski, mhocko, muchun.song, roman.gushchin, yosry.ahmed

On 11/5/25 7:30 PM, Leon Huang Fu wrote:
> On Thu, Nov 6, 2025 at 9:19 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>>
>> +Yosry, JP
>>
>> On Wed, Nov 05, 2025 at 03:49:16PM +0800, Leon Huang Fu wrote:
>>> On high-core count systems, memory cgroup statistics can become stale
>>> due to per-CPU caching and deferred aggregation. Monitoring tools and
>>> management applications sometimes need guaranteed up-to-date statistics
>>> at specific points in time to make accurate decisions.
>>
>> Can you explain a bit more on your environment where you are seeing
>> stale stats? More specifically, how often the management applications
>> are reading the memcg stats and if these applications are reading memcg
>> stats for each nodes of the cgroup tree.
>>
>> We force flush all the memcg stats at root level every 2 seconds but it
>> seems like that is not enough for your case. I am fine with an explicit
>> way for users to flush the memcg stats. In that way only users who want
>> to has to pay for the flush cost.
>>
> 
> Thanks for the feedback. I encountered this issue while running the LTP
> memcontrol02 test case [1] on a 256-core server with the 6.6.y kernel on XFS,
> where it consistently failed.
> 
> I was aware that Yosry had improved the memory statistics refresh mechanism
> in "mm: memcg: subtree stats flushing and thresholds" [2], so I attempted to
> backport that patchset to 6.6.y [3]. However, even on the 6.15.0-061500-generic
> kernel with those improvements, the test still fails intermittently on XFS.
> 

I'm not against this change, but it might be worth testing on a 6.16 or
later kernel. There were some changes that could affect your
measurements. One is that flushing was isolated to individual subsystems
[0] and the other is that updating stats became lockless [1].

[0] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/kernel/cgroup/rstat.c?h=v6.18-rc4&id=5da3bfa029d6809e192d112f39fca4dbe0137aaf
[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/kernel/cgroup/rstat.c?h=v6.18-rc4&id=36df6e3dbd7e7b074e55fec080012184e2fa3a46

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH mm-new v2] mm/memcontrol: Flush stats when write stat file
  2025-11-06  5:35     ` JP Kobryn
@ 2025-11-06  6:42       ` Leon Huang Fu
  0 siblings, 0 replies; 14+ messages in thread
From: Leon Huang Fu @ 2025-11-06  6:42 UTC (permalink / raw)
  To: inwardvessel
  Cc: akpm, cgroups, corbet, hannes, jack, joel.granados, kyle.meyer,
	lance.yang, laoar.shao, leon.huangfu, linux-doc, linux-kernel,
	linux-mm, mclapinski, mhocko, muchun.song, roman.gushchin,
	shakeel.butt, yosry.ahmed

>On 11/5/25 7:30 PM, Leon Huang Fu wrote:
>> On Thu, Nov 6, 2025 at 9:19 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>>>
>>> +Yosry, JP
>>>
>>> On Wed, Nov 05, 2025 at 03:49:16PM +0800, Leon Huang Fu wrote:
>>>> On high-core count systems, memory cgroup statistics can become stale
>>>> due to per-CPU caching and deferred aggregation. Monitoring tools and
>>>> management applications sometimes need guaranteed up-to-date statistics
>>>> at specific points in time to make accurate decisions.
>>>
>>> Can you explain a bit more on your environment where you are seeing
>>> stale stats? More specifically, how often the management applications
>>> are reading the memcg stats and if these applications are reading memcg
>>> stats for each nodes of the cgroup tree.
>>>
>>> We force flush all the memcg stats at root level every 2 seconds but it
>>> seems like that is not enough for your case. I am fine with an explicit
>>> way for users to flush the memcg stats. In that way only users who want
>>> to has to pay for the flush cost.
>>>
>>
>> Thanks for the feedback. I encountered this issue while running the LTP
>> memcontrol02 test case [1] on a 256-core server with the 6.6.y kernel on XFS,
>> where it consistently failed.
>>
>> I was aware that Yosry had improved the memory statistics refresh mechanism
>> in "mm: memcg: subtree stats flushing and thresholds" [2], so I attempted to
>> backport that patchset to 6.6.y [3]. However, even on the 6.15.0-061500-generic
>> kernel with those improvements, the test still fails intermittently on XFS.
>>
>
>I'm not against this change, but it might be worth testing on a 6.16 or
>later kernel. There were some changes that could affect your
>measurements. One is that flushing was isolated to individual subsystems
>[0] and the other is that updating stats became lockless [1].
>
>[0]
>https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/kernel/cgroup/rstat.c?h=v6.18-rc4&id=5da3bfa029d6809e192d112f39fca4dbe0137aaf
>[1]
>https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/kernel/cgroup/rstat.c?h=v6.18-rc4&id=36df6e3dbd7e7b074e55fec080012184e2fa3a46

Thanks for the suggestion! I've tested on kernel 6.17.7-061707-generic and
the results show the problem has actually gotten worse compared to
6.15.0-061500-generic.

Test results (100 runs each on the LTP memcontrol02 test scenario):

Kernel 6.15.0-061500-generic:
- Failures: 2/100 runs
- Failure rate: 2%

Kernel 6.17.7-061707-generic:
- Failures: 25/100 runs
- Failure rate: 25%

The increased failure rate with the newer kernel suggests that the lockless
stats updates and subsystem isolation changes, while improving performance,
may have reduced the implicit synchronization that was helping mask the
staleness issue in some cases.

This reinforces the need for an explicit flush mechanism (memory.stat_refresh)
to give users control when they need guaranteed up-to-date statistics.

Thanks,
Leon

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH mm-new v2] mm/memcontrol: Flush stats when write stat file
  2025-11-05  7:49 [PATCH mm-new v2] mm/memcontrol: Flush stats when write stat file Leon Huang Fu
  2025-11-05  8:19 ` Michal Hocko
  2025-11-06  1:19 ` Shakeel Butt
@ 2025-11-06 17:02 ` JP Kobryn
  2025-11-10  6:20   ` Leon Huang Fu
  2 siblings, 1 reply; 14+ messages in thread
From: JP Kobryn @ 2025-11-06 17:02 UTC (permalink / raw)
  To: Leon Huang Fu, linux-mm
  Cc: hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm,
	joel.granados, jack, laoar.shao, mclapinski, kyle.meyer, corbet,
	lance.yang, linux-doc, linux-kernel, cgroups

On 11/4/25 11:49 PM, Leon Huang Fu wrote:
> On high-core count systems, memory cgroup statistics can become stale
> due to per-CPU caching and deferred aggregation. Monitoring tools and
> management applications sometimes need guaranteed up-to-date statistics
> at specific points in time to make accurate decisions.
> 
> This patch adds write handlers to both memory.stat and memory.numa_stat
> files to allow userspace to explicitly force an immediate flush of
> memory statistics. When "1" is written to either file, it triggers
> __mem_cgroup_flush_stats(memcg, true), which unconditionally flushes
> all pending statistics for the cgroup and its descendants.
> 
> The write operation validates the input and only accepts the value "1",
> returning -EINVAL for any other input.
> 
> Usage example:
>    # Force immediate flush before reading critical statistics
>    echo 1 > /sys/fs/cgroup/mygroup/memory.stat
>    cat /sys/fs/cgroup/mygroup/memory.stat
> 
> This provides several benefits:
> 
> 1. On-demand accuracy: Tools can flush only when needed, avoiding
>     continuous overhead
> 
> 2. Targeted flushing: Allows flushing specific cgroups when precision
>     is required for particular workloads

I'm curious about your use case. Since you mention required precision,
are you planning on manually flushing before every read?

> 
> 3. Integration flexibility: Monitoring scripts can decide when to pay
>     the flush cost based on their specific accuracy requirements

[...]
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c34029e92bab..d6a5d872fbcb 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4531,6 +4531,17 @@ int memory_stat_show(struct seq_file *m, void *v)
>   	return 0;
>   }
> 
> +int memory_stat_write(struct cgroup_subsys_state *css, struct cftype *cft, u64 val)
> +{
> +	if (val != 1)
> +		return -EINVAL;
> +
> +	if (css)
> +		css_rstat_flush(css);

This is a kfunc. You can do this right now from a bpf program without
any kernel changes.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH mm-new v2] mm/memcontrol: Flush stats when write stat file
  2025-11-06  3:30   ` Leon Huang Fu
  2025-11-06  5:35     ` JP Kobryn
@ 2025-11-06 23:55     ` Shakeel Butt
  2025-11-10  6:37       ` Leon Huang Fu
  1 sibling, 1 reply; 14+ messages in thread
From: Shakeel Butt @ 2025-11-06 23:55 UTC (permalink / raw)
  To: Leon Huang Fu
  Cc: akpm, cgroups, corbet, hannes, inwardvessel, jack, joel.granados,
	kyle.meyer, lance.yang, laoar.shao, linux-doc, linux-kernel,
	linux-mm, mclapinski, mhocko, muchun.song, roman.gushchin,
	yosry.ahmed

On Thu, Nov 06, 2025 at 11:30:45AM +0800, Leon Huang Fu wrote:
> On Thu, Nov 6, 2025 at 9:19 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > +Yosry, JP
> >
> > On Wed, Nov 05, 2025 at 03:49:16PM +0800, Leon Huang Fu wrote:
> > > On high-core count systems, memory cgroup statistics can become stale
> > > due to per-CPU caching and deferred aggregation. Monitoring tools and
> > > management applications sometimes need guaranteed up-to-date statistics
> > > at specific points in time to make accurate decisions.
> >
> > Can you explain a bit more on your environment where you are seeing
> > stale stats? More specifically, how often the management applications
> > are reading the memcg stats and if these applications are reading memcg
> > stats for each nodes of the cgroup tree.
> >
> > We force flush all the memcg stats at root level every 2 seconds but it
> > seems like that is not enough for your case. I am fine with an explicit
> > way for users to flush the memcg stats. In that way only users who want
> > to has to pay for the flush cost.
> >
> 
> Thanks for the feedback. I encountered this issue while running the LTP
> memcontrol02 test case [1] on a 256-core server with the 6.6.y kernel on XFS,
> where it consistently failed.
> 
> I was aware that Yosry had improved the memory statistics refresh mechanism
> in "mm: memcg: subtree stats flushing and thresholds" [2], so I attempted to
> backport that patchset to 6.6.y [3]. However, even on the 6.15.0-061500-generic
> kernel with those improvements, the test still fails intermittently on XFS.
> 
> I've created a simplified reproducer that mirrors the LTP test behavior. The
> test allocates 50 MiB of page cache and then verifies that memory.current and
> memory.stat's "file" field are approximately equal (within 5% tolerance).
> 
> The failure pattern looks like:
> 
>   After alloc: memory.current=52690944, memory.stat.file=48496640, size=52428800
>   Checks: current>=size=OK, file>0=OK, current~=file(5%)=FAIL
> 
> Here's the reproducer code and test script (attached below for reference).
> 
> To reproduce on XFS:
>   sudo ./run.sh --xfs
>   for i in {1..100}; do sudo ./run.sh --run; echo "==="; sleep 0.1; done
>   sudo ./run.sh --cleanup
> 
> The test fails sporadically, typically a few times out of 100 runs, confirming
> that the improved flush isn't sufficient for this workload pattern.

I was hoping that you have a real world workload/scenario which is
facing this issue. For the test a simple 'sleep 2' would be enough.
Anyways that is not an argument against adding an inteface for flushing.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH mm-new v2] mm/memcontrol: Flush stats when write stat file
  2025-11-06 17:02 ` JP Kobryn
@ 2025-11-10  6:20   ` Leon Huang Fu
  2025-11-10 19:24     ` JP Kobryn
  0 siblings, 1 reply; 14+ messages in thread
From: Leon Huang Fu @ 2025-11-10  6:20 UTC (permalink / raw)
  To: inwardvessel
  Cc: akpm, cgroups, corbet, hannes, jack, joel.granados, kyle.meyer,
	lance.yang, laoar.shao, leon.huangfu, linux-doc, linux-kernel,
	linux-mm, mclapinski, mhocko, muchun.song, roman.gushchin,
	shakeel.butt

On Fri, Nov 7, 2025 at 1:02 AM JP Kobryn <inwardvessel@gmail.com> wrote:
>
> On 11/4/25 11:49 PM, Leon Huang Fu wrote:
> > On high-core count systems, memory cgroup statistics can become stale
> > due to per-CPU caching and deferred aggregation. Monitoring tools and
> > management applications sometimes need guaranteed up-to-date statistics
> > at specific points in time to make accurate decisions.
> >
> > This patch adds write handlers to both memory.stat and memory.numa_stat
> > files to allow userspace to explicitly force an immediate flush of
> > memory statistics. When "1" is written to either file, it triggers
> > __mem_cgroup_flush_stats(memcg, true), which unconditionally flushes
> > all pending statistics for the cgroup and its descendants.
> >
> > The write operation validates the input and only accepts the value "1",
> > returning -EINVAL for any other input.
> >
> > Usage example:
> >    # Force immediate flush before reading critical statistics
> >    echo 1 > /sys/fs/cgroup/mygroup/memory.stat
> >    cat /sys/fs/cgroup/mygroup/memory.stat
> >
> > This provides several benefits:
> >
> > 1. On-demand accuracy: Tools can flush only when needed, avoiding
> >     continuous overhead
> >
> > 2. Targeted flushing: Allows flushing specific cgroups when precision
> >     is required for particular workloads
>
> I'm curious about your use case. Since you mention required precision,
> are you planning on manually flushing before every read?
>

Yes, for our use case, manual flushing before critical reads is necessary.
We're going to run on high-core count servers (224-256 cores), where the
per-CPU batching threshold (MEMCG_CHARGE_BATCH * num_online_cpus) can
accumulate up to 16,384 events (on 256 cores) before an automatic flush is
triggered. This means memory statistics can be likely stale, often exceeding
acceptable tolerance for critical memory management decisions.

Our monitoring tools don't need to flush on every read - only when making
critical decisions like OOM adjustments, container placement, or resource
limit enforcement. The opt-in nature of this mechanism allows us to pay the
flush cost only when precision is truly required.

> >
> > 3. Integration flexibility: Monitoring scripts can decide when to pay
> >     the flush cost based on their specific accuracy requirements
>
> [...]
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index c34029e92bab..d6a5d872fbcb 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -4531,6 +4531,17 @@ int memory_stat_show(struct seq_file *m, void *v)
> >       return 0;
> >   }
> >
> > +int memory_stat_write(struct cgroup_subsys_state *css, struct cftype *cft, u64 val)
> > +{
> > +     if (val != 1)
> > +             return -EINVAL;
> > +
> > +     if (css)
> > +             css_rstat_flush(css);
>
> This is a kfunc. You can do this right now from a bpf program without
> any kernel changes.
>

While css_rstat_flush() is indeed available as a BPF kfunc, the practical
challenge is determining when to call it. The natural hook point would be
memory_stat_show() using fentry, but this runs into a BPF verifier
limitation: the function's 'struct seq_file *' argument doesn't provide a
trusted path to obtain the 'struct cgroup_subsys_state *css' pointer
required by css_rstat_flush().

I attempted to implement this via BPF (code below), but it fails
verification because deriving the css pointer through
seq->private->kn->parent->priv results in an untrusted scalar that the
verifier rejects for the kfunc call:

    R1 invalid mem access 'scalar'

The verifier error occurs because:
1. seq->private is rdonly_untrusted_mem
2. Dereferencing through kernfs_node internals produces untracked pointers
3. css_rstat_flush() requires a trusted css pointer per its kfunc definition

A direct userspace interface (memory.stat_refresh) avoids these verifier
limitations and provides a cleaner, more maintainable solution that doesn't
require BPF expertise or complex workarounds.

Thanks,
Leon

---


#include "vmlinux.h"

#include "bpf_helpers.h"
#include "bpf_tracing.h"

char _license[] SEC("license") = "GPL";

extern void css_rstat_flush(struct cgroup_subsys_state *css) __weak __ksym;

static inline struct cftype *of_cft(struct kernfs_open_file *of)
{
	return of->kn->priv;
}

struct cgroup_subsys_state *of_css(struct kernfs_open_file *of)
{
	struct cgroup *cgrp = of->kn->parent->priv;
	struct cftype *cft = of_cft(of);

	/*
	 * This is open and unprotected implementation of cgroup_css().
	 * seq_css() is only called from a kernfs file operation which has
	 * an active reference on the file.  Because all the subsystem
	 * files are drained before a css is disassociated with a cgroup,
	 * the matching css from the cgroup's subsys table is guaranteed to
	 * be and stay valid until the enclosing operation is complete.
	 */
	if (cft->ss)
		return cgrp->subsys[cft->ss->id];
	else
		return &cgrp->self;
}

static inline struct cgroup_subsys_state *seq_css(struct seq_file *seq)
{
	return of_css(seq->private);
}

SEC("fentry/memory_stat_show")
int BPF_PROG(memory_stat_show, struct seq_file *seq, void *v)
{
	struct cgroup_subsys_state *css = seq_css(seq);

	if (css)
		css_rstat_flush(css);

	return 0;
}

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH mm-new v2] mm/memcontrol: Flush stats when write stat file
  2025-11-06 23:55     ` Shakeel Butt
@ 2025-11-10  6:37       ` Leon Huang Fu
  2025-11-10 20:19         ` Yosry Ahmed
  0 siblings, 1 reply; 14+ messages in thread
From: Leon Huang Fu @ 2025-11-10  6:37 UTC (permalink / raw)
  To: shakeel.butt
  Cc: akpm, cgroups, corbet, hannes, inwardvessel, jack, joel.granados,
	kyle.meyer, lance.yang, laoar.shao, leon.huangfu, linux-doc,
	linux-kernel, linux-mm, mclapinski, mhocko, muchun.song,
	roman.gushchin, yosry.ahmed

On Fri, Nov 7, 2025 at 7:56 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Thu, Nov 06, 2025 at 11:30:45AM +0800, Leon Huang Fu wrote:
> > On Thu, Nov 6, 2025 at 9:19 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > >
> > > +Yosry, JP
> > >
> > > On Wed, Nov 05, 2025 at 03:49:16PM +0800, Leon Huang Fu wrote:
> > > > On high-core count systems, memory cgroup statistics can become stale
> > > > due to per-CPU caching and deferred aggregation. Monitoring tools and
> > > > management applications sometimes need guaranteed up-to-date statistics
> > > > at specific points in time to make accurate decisions.
> > >
> > > Can you explain a bit more on your environment where you are seeing
> > > stale stats? More specifically, how often the management applications
> > > are reading the memcg stats and if these applications are reading memcg
> > > stats for each nodes of the cgroup tree.
> > >
> > > We force flush all the memcg stats at root level every 2 seconds but it
> > > seems like that is not enough for your case. I am fine with an explicit
> > > way for users to flush the memcg stats. In that way only users who want
> > > to has to pay for the flush cost.
> > >
> >
> > Thanks for the feedback. I encountered this issue while running the LTP
> > memcontrol02 test case [1] on a 256-core server with the 6.6.y kernel on XFS,
> > where it consistently failed.
> >
> > I was aware that Yosry had improved the memory statistics refresh mechanism
> > in "mm: memcg: subtree stats flushing and thresholds" [2], so I attempted to
> > backport that patchset to 6.6.y [3]. However, even on the 6.15.0-061500-generic
> > kernel with those improvements, the test still fails intermittently on XFS.
> >
> > I've created a simplified reproducer that mirrors the LTP test behavior. The
> > test allocates 50 MiB of page cache and then verifies that memory.current and
> > memory.stat's "file" field are approximately equal (within 5% tolerance).
> >
> > The failure pattern looks like:
> >
> >   After alloc: memory.current=52690944, memory.stat.file=48496640, size=52428800
> >   Checks: current>=size=OK, file>0=OK, current~=file(5%)=FAIL
> >
> > Here's the reproducer code and test script (attached below for reference).
> >
> > To reproduce on XFS:
> >   sudo ./run.sh --xfs
> >   for i in {1..100}; do sudo ./run.sh --run; echo "==="; sleep 0.1; done
> >   sudo ./run.sh --cleanup
> >
> > The test fails sporadically, typically a few times out of 100 runs, confirming
> > that the improved flush isn't sufficient for this workload pattern.
>
> I was hoping that you have a real world workload/scenario which is
> facing this issue. For the test a simple 'sleep 2' would be enough.
> Anyways that is not an argument against adding an inteface for flushing.
>

Fair point. I haven't encountered a production issue yet - this came up during
our kernel testing phase on high-core count servers (224-256 cores) before
deploying to production.

The LTP test failure was the indicator that prompted investigation. While
adding 'sleep 2' would fix the test, it highlights a broader concern: on these
high-core systems, the batching threshold (MEMCG_CHARGE_BATCH * num_online_cpus)
can accumulate 14K-16K events before auto-flush, potentially causing significant
staleness for workloads that need timely statistics.

We're planning to deploy container workloads on these servers where memory
statistics drive placement and resource management decisions. Having an explicit
flush interface would give us confidence that when precision matters (e.g.,
admission control, OOM decisions), we can get accurate stats on demand rather
than relying on timing or hoping the 2-second periodic flush happens when needed.

I understand this is more of a "preparing for future needs" rather than "fixing
current production breakage" situation. However, given the interface provides
opt-in control with no cost to users who don't need it, I believe it's a
reasonable addition. I'll prepare a v3 with the dedicated memory.stat_refresh
file as suggested.

Thanks,
Leon

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH mm-new v2] mm/memcontrol: Flush stats when write stat file
  2025-11-10  6:20   ` Leon Huang Fu
@ 2025-11-10 19:24     ` JP Kobryn
  0 siblings, 0 replies; 14+ messages in thread
From: JP Kobryn @ 2025-11-10 19:24 UTC (permalink / raw)
  To: Leon Huang Fu
  Cc: akpm, cgroups, corbet, hannes, jack, joel.granados, kyle.meyer,
	lance.yang, laoar.shao, linux-doc, linux-kernel, linux-mm,
	mclapinski, mhocko, muchun.song, roman.gushchin, shakeel.butt

On 11/9/25 10:20 PM, Leon Huang Fu wrote:
> On Fri, Nov 7, 2025 at 1:02 AM JP Kobryn <inwardvessel@gmail.com> wrote:
>>
>> On 11/4/25 11:49 PM, Leon Huang Fu wrote:
>>> On high-core count systems, memory cgroup statistics can become stale
>>> due to per-CPU caching and deferred aggregation. Monitoring tools and
>>> management applications sometimes need guaranteed up-to-date statistics
>>> at specific points in time to make accurate decisions.
>>>
>>> This patch adds write handlers to both memory.stat and memory.numa_stat
>>> files to allow userspace to explicitly force an immediate flush of
>>> memory statistics. When "1" is written to either file, it triggers
>>> __mem_cgroup_flush_stats(memcg, true), which unconditionally flushes
>>> all pending statistics for the cgroup and its descendants.
>>>
>>> The write operation validates the input and only accepts the value "1",
>>> returning -EINVAL for any other input.
>>>
>>> Usage example:
>>>     # Force immediate flush before reading critical statistics
>>>     echo 1 > /sys/fs/cgroup/mygroup/memory.stat
>>>     cat /sys/fs/cgroup/mygroup/memory.stat
>>>
>>> This provides several benefits:
>>>
>>> 1. On-demand accuracy: Tools can flush only when needed, avoiding
>>>      continuous overhead
>>>
>>> 2. Targeted flushing: Allows flushing specific cgroups when precision
>>>      is required for particular workloads
>>
>> I'm curious about your use case. Since you mention required precision,
>> are you planning on manually flushing before every read?
>>
> 
> Yes, for our use case, manual flushing before critical reads is necessary.
> We're going to run on high-core count servers (224-256 cores), where the
> per-CPU batching threshold (MEMCG_CHARGE_BATCH * num_online_cpus) can
> accumulate up to 16,384 events (on 256 cores) before an automatic flush is
> triggered. This means memory statistics can be likely stale, often exceeding
> acceptable tolerance for critical memory management decisions.
> 
> Our monitoring tools don't need to flush on every read - only when making
> critical decisions like OOM adjustments, container placement, or resource
> limit enforcement. The opt-in nature of this mechanism allows us to pay the
> flush cost only when precision is truly required.
> 
>>>
>>> 3. Integration flexibility: Monitoring scripts can decide when to pay
>>>      the flush cost based on their specific accuracy requirements
>>
>> [...]
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index c34029e92bab..d6a5d872fbcb 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -4531,6 +4531,17 @@ int memory_stat_show(struct seq_file *m, void *v)
>>>        return 0;
>>>    }
>>>
>>> +int memory_stat_write(struct cgroup_subsys_state *css, struct cftype *cft, u64 val)
>>> +{
>>> +     if (val != 1)
>>> +             return -EINVAL;
>>> +
>>> +     if (css)
>>> +             css_rstat_flush(css);
>>
>> This is a kfunc. You can do this right now from a bpf program without
>> any kernel changes.
>>
> 
> While css_rstat_flush() is indeed available as a BPF kfunc, the practical
> challenge is determining when to call it. The natural hook point would be
> memory_stat_show() using fentry, but this runs into a BPF verifier
> limitation: the function's 'struct seq_file *' argument doesn't provide a
> trusted path to obtain the 'struct cgroup_subsys_state *css' pointer
> required by css_rstat_flush().

Ok, I see this would only work on the css for base stats.

SEC("iter.s/cgroup")
int cgroup_memcg_query(struct bpf_iter__cgroup *ctx)
{
     struct cgroup *cgrp = ctx->cgroup;
     struct cgroup_subsys_state *css;

     if (!cgrp)
         return 1;

     /* example of flushing css for base cpu stats
      * css = container_of(cgrp, struct cgroup_subsys_state, cgroup);
      * if (!css)
      *     return 1;
      * css_rstat_flush(css);
      */

     /* get css for memcg stats */
     css = cgrp->subsys[memory_cgrp_id];
     if (!css)
         return 1;
     css_rstat_flush(css); <- confirm untrusted pointer arg error
     ...

> 
> I attempted to implement this via BPF (code below), but it fails
> verification because deriving the css pointer through
> seq->private->kn->parent->priv results in an untrusted scalar that the
> verifier rejects for the kfunc call:
> 
>      R1 invalid mem access 'scalar'
> 
> The verifier error occurs because:
> 1. seq->private is rdonly_untrusted_mem
> 2. Dereferencing through kernfs_node internals produces untracked pointers
> 3. css_rstat_flush() requires a trusted css pointer per its kfunc definition
> 
> A direct userspace interface (memory.stat_refresh) avoids these verifier
> limitations and provides a cleaner, more maintainable solution that doesn't
> require BPF expertise or complex workarounds.

This is subjective. After hearing more about your use case and how you
mention making critical decisions, you should have a look at the work
being done on BPF OOM [0][1]. I think you would benefit from this
series. Specifically for your case it provides the ability to flush
memcg on demand and also fetch stats.

[0] 
https://lore.kernel.org/all/20251027231727.472628-1-roman.gushchin@linux.dev/
[1] 
https://lore.kernel.org/all/20251027232206.473085-2-roman.gushchin@linux.dev/

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH mm-new v2] mm/memcontrol: Flush stats when write stat file
  2025-11-10  6:37       ` Leon Huang Fu
@ 2025-11-10 20:19         ` Yosry Ahmed
  0 siblings, 0 replies; 14+ messages in thread
From: Yosry Ahmed @ 2025-11-10 20:19 UTC (permalink / raw)
  To: Leon Huang Fu
  Cc: shakeel.butt, akpm, cgroups, corbet, hannes, inwardvessel, jack,
	joel.granados, kyle.meyer, lance.yang, laoar.shao, linux-doc,
	linux-kernel, linux-mm, mclapinski, mhocko, muchun.song,
	roman.gushchin

On Mon, Nov 10, 2025 at 02:37:57PM +0800, Leon Huang Fu wrote:
> On Fri, Nov 7, 2025 at 7:56 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > On Thu, Nov 06, 2025 at 11:30:45AM +0800, Leon Huang Fu wrote:
> > > On Thu, Nov 6, 2025 at 9:19 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > > >
> > > > +Yosry, JP
> > > >
> > > > On Wed, Nov 05, 2025 at 03:49:16PM +0800, Leon Huang Fu wrote:
> > > > > On high-core count systems, memory cgroup statistics can become stale
> > > > > due to per-CPU caching and deferred aggregation. Monitoring tools and
> > > > > management applications sometimes need guaranteed up-to-date statistics
> > > > > at specific points in time to make accurate decisions.
> > > >
> > > > Can you explain a bit more on your environment where you are seeing
> > > > stale stats? More specifically, how often the management applications
> > > > are reading the memcg stats and if these applications are reading memcg
> > > > stats for each nodes of the cgroup tree.
> > > >
> > > > We force flush all the memcg stats at root level every 2 seconds but it
> > > > seems like that is not enough for your case. I am fine with an explicit
> > > > way for users to flush the memcg stats. In that way only users who want
> > > > to has to pay for the flush cost.
> > > >
> > >
> > > Thanks for the feedback. I encountered this issue while running the LTP
> > > memcontrol02 test case [1] on a 256-core server with the 6.6.y kernel on XFS,
> > > where it consistently failed.
> > >
> > > I was aware that Yosry had improved the memory statistics refresh mechanism
> > > in "mm: memcg: subtree stats flushing and thresholds" [2], so I attempted to
> > > backport that patchset to 6.6.y [3]. However, even on the 6.15.0-061500-generic
> > > kernel with those improvements, the test still fails intermittently on XFS.
> > >
> > > I've created a simplified reproducer that mirrors the LTP test behavior. The
> > > test allocates 50 MiB of page cache and then verifies that memory.current and
> > > memory.stat's "file" field are approximately equal (within 5% tolerance).
> > >
> > > The failure pattern looks like:
> > >
> > >   After alloc: memory.current=52690944, memory.stat.file=48496640, size=52428800
> > >   Checks: current>=size=OK, file>0=OK, current~=file(5%)=FAIL
> > >
> > > Here's the reproducer code and test script (attached below for reference).
> > >
> > > To reproduce on XFS:
> > >   sudo ./run.sh --xfs
> > >   for i in {1..100}; do sudo ./run.sh --run; echo "==="; sleep 0.1; done
> > >   sudo ./run.sh --cleanup
> > >
> > > The test fails sporadically, typically a few times out of 100 runs, confirming
> > > that the improved flush isn't sufficient for this workload pattern.
> >
> > I was hoping that you have a real world workload/scenario which is
> > facing this issue. For the test a simple 'sleep 2' would be enough.
> > Anyways that is not an argument against adding an inteface for flushing.
> >
> 
> Fair point. I haven't encountered a production issue yet - this came up during
> our kernel testing phase on high-core count servers (224-256 cores) before
> deploying to production.
> 
> The LTP test failure was the indicator that prompted investigation. While
> adding 'sleep 2' would fix the test, it highlights a broader concern: on these
> high-core systems, the batching threshold (MEMCG_CHARGE_BATCH * num_online_cpus)
> can accumulate 14K-16K events before auto-flush, potentially causing significant
> staleness for workloads that need timely statistics.

The thresholding is implemented as a tradeoff between expensive flushing
and accurate stats, and it aims to at least provide deterministic
behavior in terms of how much the stats can deviate.

That being said, it's understandable that some use cases require even
higher accuracy and are willing to pay the price. Although I share
Shakeel's frustration that the driving motivation is tests where you can
sleep for 2 seconds or alter the tests to allow some bound deviation.

The two alternatives I can think of are the synchronous flushing
interface, and some sort of tunable that determines the needed accuracy.
The latter sounds like it would be difficult to design properly and may
end up with some of the swappiness problems, so I think the synchronous
flushing interface is probably the way to go. This was also brought up
before when the thresholding was implemented.

If we ever change the stats implementation completely and lose the
concept of flushes/refreshes, the interface can just be a noop, and we
can document that writes are useless (or even print something in dmesg).

So no objections from me.

> 
> We're planning to deploy container workloads on these servers where memory
> statistics drive placement and resource management decisions. Having an explicit
> flush interface would give us confidence that when precision matters (e.g.,
> admission control, OOM decisions), we can get accurate stats on demand rather
> than relying on timing or hoping the 2-second periodic flush happens when needed.
> 
> I understand this is more of a "preparing for future needs" rather than "fixing
> current production breakage" situation. However, given the interface provides
> opt-in control with no cost to users who don't need it, I believe it's a
> reasonable addition. I'll prepare a v3 with the dedicated memory.stat_refresh
> file as suggested.
> 
> Thanks,
> Leon

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-11-10 20:19 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-05  7:49 [PATCH mm-new v2] mm/memcontrol: Flush stats when write stat file Leon Huang Fu
2025-11-05  8:19 ` Michal Hocko
2025-11-05  8:39   ` Lance Yang
2025-11-05  8:51     ` Leon Huang Fu
2025-11-06  1:19 ` Shakeel Butt
2025-11-06  3:30   ` Leon Huang Fu
2025-11-06  5:35     ` JP Kobryn
2025-11-06  6:42       ` Leon Huang Fu
2025-11-06 23:55     ` Shakeel Butt
2025-11-10  6:37       ` Leon Huang Fu
2025-11-10 20:19         ` Yosry Ahmed
2025-11-06 17:02 ` JP Kobryn
2025-11-10  6:20   ` Leon Huang Fu
2025-11-10 19:24     ` JP Kobryn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).