[PATCH 0/3] memcg: non-unified flushing for userspace stats

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/3] memcg: non-unified flushing for userspace stats
@ 2023-08-21 20:54 Yosry Ahmed
  2023-08-21 20:54 ` [PATCH 1/3] mm: memcg: properly name and document unified stats flushing Yosry Ahmed
                   ` (3 more replies)
  0 siblings, 4 replies; 43+ messages in thread
From: Yosry Ahmed @ 2023-08-21 20:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Ivan Babrou, Tejun Heo, linux-mm, cgroups,
	linux-kernel, Yosry Ahmed

Most memcg flushing contexts using "unified" flushing, where only one
flusher is allowed at a time (others skip), and all flushers need to
flush the entire tree. This works well with high concurrency, which
mostly comes from in-kernel flushers (e.g. reclaim, refault, ..).

For userspace reads, unified flushing leads to non-deterministic stats
staleness and reading cost. This series clarifies and documents the
differences between unified and non-unified flushing (patches 1 & 2),
then opts userspace reads out of unified flushing (patch 3).

This patch series is a follow up on the discussion in [1]. That was a
patch that proposed that userspace reads wait for ongoing unified
flushers to complete before returning. There were concerns about the
latency that this introduces to userspace reads, especially with ongoing
reports of expensive stat reads even with unified flushing. Hence, this
series follows a different approach, by opting userspace reads out of
unified flushing completely. The cost of userspace reads are now
determinstic, and depend on the size of the subtree being read. This
should fix both the *sometimes* expensive reads (due to flushing the
entire tree) and occasional staless (due to skipping flushing).

I attempted to remove unified flushing completely, but noticed that
in-kernel flushers with high concurrency (e.g. hundreds of concurrent
reclaimers). This sort of concurrency is not expected from userspace
reads. More details about testing and some numbers in the last patch's
changelog.

[1]https://lore.kernel.org/lkml/20230809045810.1659356-1-yosryahmed@google.com/
[2]https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ@mail.gmail.com/

Yosry Ahmed (3):
  mm: memcg: properly name and document unified stats flushing
  mm: memcg: add a helper for non-unified stats flushing
  mm: memcg: use non-unified stats flushing for userspace reads

 include/linux/memcontrol.h |  8 ++---
 mm/memcontrol.c            | 74 +++++++++++++++++++++++++++-----------
 mm/vmscan.c                |  2 +-
 mm/workingset.c            |  4 +--
 4 files changed, 60 insertions(+), 28 deletions(-)

-- 
2.42.0.rc1.204.g551eb34607-goog

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 1/3] mm: memcg: properly name and document unified stats flushing
  2023-08-21 20:54 [PATCH 0/3] memcg: non-unified flushing for userspace stats Yosry Ahmed
@ 2023-08-21 20:54 ` Yosry Ahmed
  2023-08-21 20:54 ` [PATCH 2/3] mm: memcg: add a helper for non-unified " Yosry Ahmed
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 43+ messages in thread
From: Yosry Ahmed @ 2023-08-21 20:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Ivan Babrou, Tejun Heo, linux-mm, cgroups,
	linux-kernel, Yosry Ahmed

Most contexts that flush memcg stats use "unified" flushing, where
basically all flushers attempt to flush the entire hierarchy, but only
one flusher is allowed at a time, others skip flushing.

This is needed because we need to flush the stats from paths such as
reclaim or refaults, which may have high concurrency, especially on
large systems. Serializing such performance-sensitive paths can
introduce regressions, hence, unified flushing offers a tradeoff between
stats staleness and the performance impact of flushing stats.

Document this properly and explicitly by renaming the common flushing
helper from do_flush_stats() to do_unified_stats_flush(), and adding
documentation to describe unified flushing. Additionally, rename
flushing APIs to add "try" in the name, which implies that flushing will
not always happen. Also add proper documentation.

No functional change intended.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 include/linux/memcontrol.h |  8 +++---
 mm/memcontrol.c            | 53 ++++++++++++++++++++++++++------------
 mm/vmscan.c                |  2 +-
 mm/workingset.c            |  4 +--
 4 files changed, 43 insertions(+), 24 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 11810a2cfd2d..d517b0cc5221 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1034,8 +1034,8 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
 	return x;
 }
 
-void mem_cgroup_flush_stats(void);
-void mem_cgroup_flush_stats_ratelimited(void);
+void mem_cgroup_try_flush_stats(void);
+void mem_cgroup_try_flush_stats_ratelimited(void);
 
 void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 			      int val);
@@ -1519,11 +1519,11 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
 	return node_page_state(lruvec_pgdat(lruvec), idx);
 }
 
-static inline void mem_cgroup_flush_stats(void)
+static inline void mem_cgroup_try_flush_stats(void)
 {
 }
 
-static inline void mem_cgroup_flush_stats_ratelimited(void)
+static inline void mem_cgroup_try_flush_stats_ratelimited(void)
 {
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index cf57fe9318d5..c6150ea54d48 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -630,7 +630,7 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
 		/*
 		 * If stats_flush_threshold exceeds the threshold
 		 * (>num_online_cpus()), cgroup stats update will be triggered
-		 * in __mem_cgroup_flush_stats(). Increasing this var further
+		 * in mem_cgroup_try_flush_stats(). Increasing this var further
 		 * is redundant and simply adds overhead in atomic update.
 		 */
 		if (atomic_read(&stats_flush_threshold) <= num_online_cpus())
@@ -639,13 +639,17 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
 	}
 }
 
-static void do_flush_stats(void)
+/*
+ * do_unified_stats_flush - do a unified flush of memory cgroup statistics
+ *
+ * A unified flush tries to flush the entire hierarchy, but skips if there is
+ * another ongoing flush. This is meant for flushers that may have a lot of
+ * concurrency (e.g. reclaim, refault, etc), and should not be serialized to
+ * avoid slowing down performance-sensitive paths. A unified flush may skip, and
+ * hence may yield stale stats.
+ */
+static void do_unified_stats_flush(void)
 {
-	/*
-	 * We always flush the entire tree, so concurrent flushers can just
-	 * skip. This avoids a thundering herd problem on the rstat global lock
-	 * from memcg flushers (e.g. reclaim, refault, etc).
-	 */
 	if (atomic_read(&stats_flush_ongoing) ||
 	    atomic_xchg(&stats_flush_ongoing, 1))
 		return;
@@ -658,16 +662,31 @@ static void do_flush_stats(void)
 	atomic_set(&stats_flush_ongoing, 0);
 }
 
-void mem_cgroup_flush_stats(void)
+/*
+ * mem_cgroup_try_flush_stats - try to flush memory cgroup statistics
+ *
+ * Try to flush the stats of all memcgs that have stat updates since the last
+ * flush. We do not flush the stats if:
+ * - The magnitude of the pending updates is below a certain threshold.
+ * - There is another ongoing unified flush (see do_unified_stats_flush()).
+ *
+ * Hence, the stats may be stale, but ideally by less than FLUSH_TIME due to
+ * periodic flushing.
+ */
+void mem_cgroup_try_flush_stats(void)
 {
 	if (atomic_read(&stats_flush_threshold) > num_online_cpus())
-		do_flush_stats();
+		do_unified_stats_flush();
 }
 
-void mem_cgroup_flush_stats_ratelimited(void)
+/*
+ * Like mem_cgroup_try_flush_stats(), but only flushes if the periodic flusher
+ * is late.
+ */
+void mem_cgroup_try_flush_stats_ratelimited(void)
 {
 	if (time_after64(jiffies_64, READ_ONCE(flush_next_time)))
-		mem_cgroup_flush_stats();
+		mem_cgroup_try_flush_stats();
 }
 
 static void flush_memcg_stats_dwork(struct work_struct *w)
@@ -676,7 +695,7 @@ static void flush_memcg_stats_dwork(struct work_struct *w)
 	 * Always flush here so that flushing in latency-sensitive paths is
 	 * as cheap as possible.
 	 */
-	do_flush_stats();
+	do_unified_stats_flush();
 	queue_delayed_work(system_unbound_wq, &stats_flush_dwork, FLUSH_TIME);
 }
 
@@ -1576,7 +1595,7 @@ static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
 	 *
 	 * Current memory state:
 	 */
-	mem_cgroup_flush_stats();
+	mem_cgroup_try_flush_stats();
 
 	for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
 		u64 size;
@@ -4018,7 +4037,7 @@ static int memcg_numa_stat_show(struct seq_file *m, void *v)
 	int nid;
 	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
 
-	mem_cgroup_flush_stats();
+	mem_cgroup_try_flush_stats();
 
 	for (stat = stats; stat < stats + ARRAY_SIZE(stats); stat++) {
 		seq_printf(m, "%s=%lu", stat->name,
@@ -4093,7 +4112,7 @@ static void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
 
 	BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) != ARRAY_SIZE(memcg1_stats));
 
-	mem_cgroup_flush_stats();
+	mem_cgroup_try_flush_stats();
 
 	for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) {
 		unsigned long nr;
@@ -4595,7 +4614,7 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages,
 	struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css);
 	struct mem_cgroup *parent;
 
-	mem_cgroup_flush_stats();
+	mem_cgroup_try_flush_stats();
 
 	*pdirty = memcg_page_state(memcg, NR_FILE_DIRTY);
 	*pwriteback = memcg_page_state(memcg, NR_WRITEBACK);
@@ -6610,7 +6629,7 @@ static int memory_numa_stat_show(struct seq_file *m, void *v)
 	int i;
 	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
 
-	mem_cgroup_flush_stats();
+	mem_cgroup_try_flush_stats();
 
 	for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
 		int nid;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c7c149cb8d66..457a18921fda 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2923,7 +2923,7 @@ static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc)
 	 * Flush the memory cgroup stats, so that we read accurate per-memcg
 	 * lruvec stats for heuristics.
 	 */
-	mem_cgroup_flush_stats();
+	mem_cgroup_try_flush_stats();
 
 	/*
 	 * Determine the scan balance between anon and file LRUs.
diff --git a/mm/workingset.c b/mm/workingset.c
index da58a26d0d4d..affb8699e58d 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -520,7 +520,7 @@ void workingset_refault(struct folio *folio, void *shadow)
 	}
 
 	/* Flush stats (and potentially sleep) before holding RCU read lock */
-	mem_cgroup_flush_stats_ratelimited();
+	mem_cgroup_try_flush_stats_ratelimited();
 
 	rcu_read_lock();
 
@@ -664,7 +664,7 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
 		struct lruvec *lruvec;
 		int i;
 
-		mem_cgroup_flush_stats();
+		mem_cgroup_try_flush_stats();
 		lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid));
 		for (pages = 0, i = 0; i < NR_LRU_LISTS; i++)
 			pages += lruvec_page_state_local(lruvec,
-- 
2.42.0.rc1.204.g551eb34607-goog



^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 2/3] mm: memcg: add a helper for non-unified stats flushing
  2023-08-21 20:54 [PATCH 0/3] memcg: non-unified flushing for userspace stats Yosry Ahmed
  2023-08-21 20:54 ` [PATCH 1/3] mm: memcg: properly name and document unified stats flushing Yosry Ahmed
@ 2023-08-21 20:54 ` Yosry Ahmed
  2023-08-22 13:01   ` Michal Koutný
  2023-08-21 20:54 ` [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads Yosry Ahmed
  2023-08-22 13:00 ` [PATCH 0/3] memcg: non-unified flushing for userspace stats Michal Koutný
  3 siblings, 1 reply; 43+ messages in thread
From: Yosry Ahmed @ 2023-08-21 20:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Ivan Babrou, Tejun Heo, linux-mm, cgroups,
	linux-kernel, Yosry Ahmed

Some contexts flush memcg stats outside of unified flushing, directly
using cgroup_rstat_flush(). Add a helper for non-unified flushing, a
counterpart for do_unified_stats_flush(), and use it in those contexts,
as well as in do_unified_stats_flush() itself.

This abstracts the rstat API and makes it easy to introduce
modifications to either unified or non-unified flushing functions
without changing callers.

No functional change intended.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 mm/memcontrol.c | 21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c6150ea54d48..90f08b35fa77 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -639,6 +639,17 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
 	}
 }
 
+/*
+ * do_stats_flush - do a flush of the memory cgroup statistics
+ * @memcg: memory cgroup to flush
+ *
+ * Only flushes the subtree of @memcg, does not skip under any conditions.
+ */
+static void do_stats_flush(struct mem_cgroup *memcg)
+{
+	cgroup_rstat_flush(memcg->css.cgroup);
+}
+
 /*
  * do_unified_stats_flush - do a unified flush of memory cgroup statistics
  *
@@ -656,7 +667,7 @@ static void do_unified_stats_flush(void)
 
 	WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME);
 
-	cgroup_rstat_flush(root_mem_cgroup->css.cgroup);
+	do_stats_flush(root_mem_cgroup);
 
 	atomic_set(&stats_flush_threshold, 0);
 	atomic_set(&stats_flush_ongoing, 0);
@@ -7790,7 +7801,7 @@ bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
 			break;
 		}
 
-		cgroup_rstat_flush(memcg->css.cgroup);
+		do_stats_flush(memcg);
 		pages = memcg_page_state(memcg, MEMCG_ZSWAP_B) / PAGE_SIZE;
 		if (pages < max)
 			continue;
@@ -7855,8 +7866,10 @@ void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size)
 static u64 zswap_current_read(struct cgroup_subsys_state *css,
 			      struct cftype *cft)
 {
-	cgroup_rstat_flush(css->cgroup);
-	return memcg_page_state(mem_cgroup_from_css(css), MEMCG_ZSWAP_B);
+	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+
+	do_stats_flush(memcg);
+	return memcg_page_state(memcg, MEMCG_ZSWAP_B);
 }
 
 static int zswap_max_show(struct seq_file *m, void *v)
-- 
2.42.0.rc1.204.g551eb34607-goog



^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads
  2023-08-21 20:54 [PATCH 0/3] memcg: non-unified flushing for userspace stats Yosry Ahmed
  2023-08-21 20:54 ` [PATCH 1/3] mm: memcg: properly name and document unified stats flushing Yosry Ahmed
  2023-08-21 20:54 ` [PATCH 2/3] mm: memcg: add a helper for non-unified " Yosry Ahmed
@ 2023-08-21 20:54 ` Yosry Ahmed
  2023-08-22  9:06   ` Michal Hocko
  2023-08-22 13:00 ` [PATCH 0/3] memcg: non-unified flushing for userspace stats Michal Koutný
  3 siblings, 1 reply; 43+ messages in thread
From: Yosry Ahmed @ 2023-08-21 20:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Ivan Babrou, Tejun Heo, linux-mm, cgroups,
	linux-kernel, Yosry Ahmed

Unified flushing allows for great concurrency for paths that attempt to
flush the stats, at the expense of potential staleness and a single
flusher paying the extra cost of flushing the full tree.

This tradeoff makes sense for in-kernel flushers that may observe high
concurrency (e.g. reclaim, refault). For userspace readers, stale stats
may be unexpected and problematic, especially when such stats are used
for critical paths such as userspace OOM handling. Additionally, a
userspace reader will occasionally pay the cost of flushing the entire
hierarchy, which also causes problems in some cases [1].

Opt userspace reads out of unified flushing. This makes the cost of
reading the stats more predictable (proportional to the size of the
subtree), as well as the freshness of the stats. Since userspace readers
are not expected to have similar concurrency to in-kernel flushers,
serializing them among themselves and among in-kernel flushers should be
okay.

This was tested on a machine with 256 cpus by running a synthetic test
The script that creates 50 top-level cgroups, each with 5 children (250
leaf cgroups). Each leaf cgroup has 10 processes running that allocate
memory beyond the cgroup limit, invoking reclaim (which is an in-kernel
unified flusher). Concurrently, one thread is spawned per-cgroup to read
the stats every second (including root, top-level, and leaf cgroups --
so total 251 threads). No regressions were observed in the total running
time; which means that non-unified userspace readers are not slowing
down in-kernel unified flushers:

Base (mm-unstable):

real	0m18.228s
user	0m9.463s
sys	60m15.879s

real	0m20.828s
user	0m8.535s
sys	70m12.364s

real	0m19.789s
user	0m9.177s
sys	66m10.798s

With this patch:

real	0m19.632s
user	0m8.608s
sys	64m23.483s

real	0m18.463s
user	0m7.465s
sys	60m34.089s

real	0m20.309s
user	0m7.754s
sys	68m2.392s

Additionally, the average latency for reading stats went from roughly
40ms to 5 ms, because we mostly read the stats of leaf cgroups in this
script, so we only have to flush one cgroup, instead of *sometimes*
flushing the entire tree with unified flushing.

[1]https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ@mail.gmail.com/

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 mm/memcontrol.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 90f08b35fa77..d3b13a06224c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1606,7 +1606,7 @@ static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
 	 *
 	 * Current memory state:
 	 */
-	mem_cgroup_try_flush_stats();
+	do_stats_flush(memcg);

 	for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
 		u64 size;
@@ -4048,7 +4048,7 @@ static int memcg_numa_stat_show(struct seq_file *m, void *v)
 	int nid;
 	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);

-	mem_cgroup_try_flush_stats();
+	do_stats_flush(memcg);

 	for (stat = stats; stat < stats + ARRAY_SIZE(stats); stat++) {
 		seq_printf(m, "%s=%lu", stat->name,
@@ -4123,7 +4123,7 @@ static void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)

 	BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) != ARRAY_SIZE(memcg1_stats));

-	mem_cgroup_try_flush_stats();
+	do_stats_flush(memcg);

 	for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) {
 		unsigned long nr;
@@ -4625,7 +4625,7 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages,
 	struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css);
 	struct mem_cgroup *parent;

-	mem_cgroup_try_flush_stats();
+	do_stats_flush(memcg);

 	*pdirty = memcg_page_state(memcg, NR_FILE_DIRTY);
 	*pwriteback = memcg_page_state(memcg, NR_WRITEBACK);
@@ -6640,7 +6640,7 @@ static int memory_numa_stat_show(struct seq_file *m, void *v)
 	int i;
 	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);

-	mem_cgroup_try_flush_stats();
+	do_stats_flush(memcg);

 	for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
 		int nid;
-- 
2.42.0.rc1.204.g551eb34607-goog

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads
  2023-08-21 20:54 ` [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads Yosry Ahmed
@ 2023-08-22  9:06   ` Michal Hocko
  2023-08-22 15:30     ` Yosry Ahmed
  0 siblings, 1 reply; 43+ messages in thread
From: Michal Hocko @ 2023-08-22  9:06 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt,
	Muchun Song, Ivan Babrou, Tejun Heo, linux-mm, cgroups,
	linux-kernel

On Mon 21-08-23 20:54:58, Yosry Ahmed wrote:
> Unified flushing allows for great concurrency for paths that attempt to
> flush the stats, at the expense of potential staleness and a single
> flusher paying the extra cost of flushing the full tree.
> 
> This tradeoff makes sense for in-kernel flushers that may observe high
> concurrency (e.g. reclaim, refault). For userspace readers, stale stats
> may be unexpected and problematic, especially when such stats are used
> for critical paths such as userspace OOM handling. Additionally, a
> userspace reader will occasionally pay the cost of flushing the entire
> hierarchy, which also causes problems in some cases [1].
> 
> Opt userspace reads out of unified flushing. This makes the cost of
> reading the stats more predictable (proportional to the size of the
> subtree), as well as the freshness of the stats. Since userspace readers
> are not expected to have similar concurrency to in-kernel flushers,
> serializing them among themselves and among in-kernel flushers should be
> okay.
> 
> This was tested on a machine with 256 cpus by running a synthetic test
> The script that creates 50 top-level cgroups, each with 5 children (250
> leaf cgroups). Each leaf cgroup has 10 processes running that allocate
> memory beyond the cgroup limit, invoking reclaim (which is an in-kernel
> unified flusher). Concurrently, one thread is spawned per-cgroup to read
> the stats every second (including root, top-level, and leaf cgroups --
> so total 251 threads). No regressions were observed in the total running
> time; which means that non-unified userspace readers are not slowing
> down in-kernel unified flushers:

I have to admit I am rather confused by cgroup_rstat_flush (and
cgroup_rstat_flush_locked). The former says it can block but the later
doesn't ever block and even if it drops the cgroup_rstat_lock it merely
cond_rescheds or busy loops. How much of a contention and yielding can
you see with this patch? What is the worst case? How bad a random user
can make the situation by going crazy and trying to flush from many
different contexts?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/3] memcg: non-unified flushing for userspace stats
  2023-08-21 20:54 [PATCH 0/3] memcg: non-unified flushing for userspace stats Yosry Ahmed
                   ` (2 preceding siblings ...)
  2023-08-21 20:54 ` [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads Yosry Ahmed
@ 2023-08-22 13:00 ` Michal Koutný
  2023-08-22 15:43   ` Yosry Ahmed
  3 siblings, 1 reply; 43+ messages in thread
From: Michal Koutný @ 2023-08-22 13:00 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo, linux-mm,
	cgroups, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1075 bytes --]

Hello.

On Mon, Aug 21, 2023 at 08:54:55PM +0000, Yosry Ahmed <yosryahmed@google.com> wrote:
> For userspace reads, unified flushing leads to non-deterministic stats
> staleness and reading cost.

I only skimed previous threads but I don't remember if it was resolved:
a) periodic flushing was too much spaced for user space readers (i.e. 2s
   delay is too much [1]),
b) periodic flushing didn't catch up (i.e. full tree flush can
   occassionaly take more than 2s) leading to extra staleness?

[1] Assuming that nr_cpus*MEMCG_CHARGE_BATCH error bound is also too
    much for userspace readers, correct?

> The cost of userspace reads are now determinstic, and depend on the
> size of the subtree being read. This should fix both the *sometimes*
> expensive reads (due to flushing the entire tree) and occasional
> staless (due to skipping flushing).

This is nice, thanks to the atomic removal in the commit 0a2dc6ac3329
("cgroup: remove cgroup_rstat_flush_atomic()"). I think the smaller
chunks with yielding could be universally better (last words :-).

Thanks,
Michal


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/3] mm: memcg: add a helper for non-unified stats flushing
  2023-08-21 20:54 ` [PATCH 2/3] mm: memcg: add a helper for non-unified " Yosry Ahmed
@ 2023-08-22 13:01   ` Michal Koutný
  2023-08-22 16:00     ` Yosry Ahmed
  0 siblings, 1 reply; 43+ messages in thread
From: Michal Koutný @ 2023-08-22 13:01 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo, linux-mm,
	cgroups, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 780 bytes --]

On Mon, Aug 21, 2023 at 08:54:57PM +0000, Yosry Ahmed <yosryahmed@google.com> wrote:
> +static void do_stats_flush(struct mem_cgroup *memcg)
> +{
> +	cgroup_rstat_flush(memcg->css.cgroup);
	if(memcg == root_mem_cgroup)
		atomic_set(&stats_flush_threshold, 0);
> +}
> +
>  /*
>   * do_unified_stats_flush - do a unified flush of memory cgroup statistics
>   *
> @@ -656,7 +667,7 @@ static void do_unified_stats_flush(void)
>  
>  	WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME);
>  
> -	cgroup_rstat_flush(root_mem_cgroup->css.cgroup);
> +	do_stats_flush(root_mem_cgroup);
>  
  -	atomic_set(&stats_flush_threshold, 0);
>  	atomic_set(&stats_flush_ongoing, 0);

You may reset stats_flush_threshold to save the unified flushers some
work.

Michal


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads
  2023-08-22  9:06   ` Michal Hocko
@ 2023-08-22 15:30     ` Yosry Ahmed
  2023-08-23  7:33       ` Michal Hocko
  0 siblings, 1 reply; 43+ messages in thread
From: Yosry Ahmed @ 2023-08-22 15:30 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt,
	Muchun Song, Ivan Babrou, Tejun Heo, linux-mm, cgroups,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3773 bytes --]

On Tue, Aug 22, 2023 at 2:06 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 21-08-23 20:54:58, Yosry Ahmed wrote:
> > Unified flushing allows for great concurrency for paths that attempt to
> > flush the stats, at the expense of potential staleness and a single
> > flusher paying the extra cost of flushing the full tree.
> >
> > This tradeoff makes sense for in-kernel flushers that may observe high
> > concurrency (e.g. reclaim, refault). For userspace readers, stale stats
> > may be unexpected and problematic, especially when such stats are used
> > for critical paths such as userspace OOM handling. Additionally, a
> > userspace reader will occasionally pay the cost of flushing the entire
> > hierarchy, which also causes problems in some cases [1].
> >
> > Opt userspace reads out of unified flushing. This makes the cost of
> > reading the stats more predictable (proportional to the size of the
> > subtree), as well as the freshness of the stats. Since userspace readers
> > are not expected to have similar concurrency to in-kernel flushers,
> > serializing them among themselves and among in-kernel flushers should be
> > okay.
> >
> > This was tested on a machine with 256 cpus by running a synthetic test
> > The script that creates 50 top-level cgroups, each with 5 children (250
> > leaf cgroups). Each leaf cgroup has 10 processes running that allocate
> > memory beyond the cgroup limit, invoking reclaim (which is an in-kernel
> > unified flusher). Concurrently, one thread is spawned per-cgroup to read
> > the stats every second (including root, top-level, and leaf cgroups --
> > so total 251 threads). No regressions were observed in the total running
> > time; which means that non-unified userspace readers are not slowing
> > down in-kernel unified flushers:
>
> I have to admit I am rather confused by cgroup_rstat_flush (and
> cgroup_rstat_flush_locked). The former says it can block but the later
> doesn't ever block and even if it drops the cgroup_rstat_lock it merely
> cond_rescheds or busy loops. How much of a contention and yielding can
> you see with this patch? What is the worst case? How bad a random user
> can make the situation by going crazy and trying to flush from many
> different contexts?

Userspace readers (or more generically non-unified flushers) can all
collectively only block a single unified flusher at most.
Specifically, one userspace reader goes to flush and holds
cgroup_rstat_lock, meanwhile an in-kernel flusher (e.g. reclaim) goes
and tries to flush, and spins on cgroup_rstat_lock. Other in-kernel
(unified) flushers will just see another unified flusher in progress
and skip. So userspace can only block a single in-kernel reclaimer.
Not that it's not really that bad because:
(a) As you note, cgroup_rstat_flush() does not really "block", it's
cpu-bound. Even when it cond_resched()'s, it yields the lock first. So
it can't really hold anyone hostage for long.
(b) I assume a random user can only read their own stats, which should
be a relatively small subtree, quick to flush. I am assuming a random
user cannot read root's memory.stat (which is most expensive).
(c) Excessive flushing doesn't really build up because there will be
nothing to flush and the lock will be released very shortly after it's
held.

So to answer your question, I don't think a random user can really
affect the system in a significant way by constantly flushing. In
fact, in the test script (which I am now attaching, in case you're
interested), there are hundreds of threads that are reading stats of
different cgroups every 1s, and I don't see any negative effects on
in-kernel flushers in this case (reclaimers).

> --
> Michal Hocko
> SUSE Labs

[-- Attachment #2: stress.sh --]
[-- Type: text/x-sh, Size: 2967 bytes --]

#!/bin/bash

NR_L1_CGROUPS=50
NR_L2_CGROUPS_PER_L1=5
NR_L2_CGROUPS=$(( NR_L1_CGROUPS * NR_L2_CGROUPS_PER_L1 ))
NR_WORKERS_PER_CGROUP=10
WORKER_MB=10
NR_TOTAL_WORKERS=$(( NR_WORKERS_PER_CGROUP * (1 + NR_L1_CGROUPS + NR_L2_CGROUPS) ))
L1_CGROUP_LIMIT_MB=$(( NR_L2_CGROUPS_PER_L1 * NR_WORKERS_PER_CGROUP * WORKER_MB / 4 ))
TOTAL_MB=$(( NR_TOTAL_WORKERS * WORKER_MB ))
TMPFS=$(mktemp -d)
ROOT="/sys/fs/cgroup/"
ADMIN="/sys/fs/cgroup/admin"
ZRAM_DEV="/mnt/devtmpfs/zram0"

cleanup_cgroup() {
  local -r cg=$1
  local -r procs=$(cat $cg/cgroup.procs)
  if [[ -n $procs ]]; then
    kill -KILL $procs
    wait $procs 2>/dev/null
  fi
  while [[ -n $(cat $cg/cgroup.procs) ]]; do
    sleep 0.1
  done
  rmdir $cg
}

cleanup() {
  for i in $(seq $NR_L1_CGROUPS); do
    l1="$ROOT/parent$i"
    for j in $(seq $NR_L2_CGROUPS_PER_L1); do
      l2="$l1/child$j"
      cleanup_cgroup $l2
    done
    cleanup_cgroup $l1
  done

  cleanup_cgroup $ADMIN

  umount $TMPFS
  rm -rf $TMPFS

  swapoff $ZRAM_DEV
  echo 1 > "/sys/block/zram0/reset"
}
trap cleanup INT QUIT EXIT

run_stats_reader() {
  local -r cg_run=$1
  local -r cg_read=$2

  # read the stats every second until workers are done
  echo 0 > "$cg_run/cgroup.procs"
  while [[ $(ls $TMPFS) ]]; do
    cat "$cg_read/memory.stat" > /dev/null
    sleep 1
  done
}

run_worker() {
  local -r cg=$1
  local -r f=$2

  echo 0 > "$cg/cgroup.procs"
  rm -rf "$f"
  dd if=/dev/zero of="$f" bs=1M count=$WORKER_MB status=none
  cat "$f" > /dev/null
  rm "$f"
}

# Setup zram
echo $((TOTAL_MB << 20)) > "/sys/block/zram0/disksize"
mkswap $ZRAM_DEV
swapon $ZRAM_DEV
echo "Setup zram done"

# Mount tmpfs
mount -t tmpfs none $TMPFS

# Create admin cgroup
mkdir $ADMIN

# Create worker cgroups, set limits
echo "+memory" > "$ROOT/cgroup.subtree_control"
for i in $(seq $NR_L1_CGROUPS); do
  l1="$ROOT/parent$i"
  mkdir $l1
  echo "+memory" > "$l1/cgroup.subtree_control"
  for j in $(seq $NR_L2_CGROUPS_PER_L1); do
    l2="$l1/child$j"
    mkdir $l2
  done
  echo $(( L1_CGROUP_LIMIT_MB << 20)) > "$l1/memory.max"
done
echo "Setup $NR_L1_CGROUPS L1 cgroups with limit ${L1_CGROUP_LIMIT_MB}M done"
echo "Each L1 cgroup has $NR_L2_CGROUPS_PER_L1 L2 children"

# Start workers to allocate tmpfs memory
for i in $(seq $NR_L1_CGROUPS); do
  l1="$ROOT/parent$i"
  for j in $(seq $NR_L2_CGROUPS_PER_L1); do
    l2="$l1/child$j"
    for k in $(seq $NR_WORKERS_PER_CGROUP); do
      (run_worker "$l2" "$TMPFS/$i-$j-$k")&
    done
  done
done

# Start stat readers
(run_stats_reader "$ADMIN" "$ROOT")&
for i in $(seq $NR_L1_CGROUPS); do
  l1="$ROOT/parent$i"
  (run_stats_reader "$ADMIN" "$l1")&
  for j in $(seq $NR_L2_CGROUPS_PER_L1); do
    l2="$l1/child$j"
    # Ran stat readers in the admin cgroup as well as the cgroup itself
    (run_stats_reader "$ADMIN" "$l2")&
    (run_stats_reader "$l2" "$l2")&
  done
done

# Wait for workers
echo "Ran $NR_WORKERS_PER_CGROUP workers per L2 cgroup, each allocating ${WORKER_MB}M, waiting.."
wait


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/3] memcg: non-unified flushing for userspace stats
  2023-08-22 13:00 ` [PATCH 0/3] memcg: non-unified flushing for userspace stats Michal Koutný
@ 2023-08-22 15:43   ` Yosry Ahmed
  0 siblings, 0 replies; 43+ messages in thread
From: Yosry Ahmed @ 2023-08-22 15:43 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo, linux-mm,
	cgroups, linux-kernel

On Tue, Aug 22, 2023 at 6:01 AM Michal Koutný <mkoutny@suse.com> wrote:
>
> Hello.
>
> On Mon, Aug 21, 2023 at 08:54:55PM +0000, Yosry Ahmed <yosryahmed@google.com> wrote:
> > For userspace reads, unified flushing leads to non-deterministic stats
> > staleness and reading cost.
>
> I only skimed previous threads but I don't remember if it was resolved:
> a) periodic flushing was too much spaced for user space readers (i.e. 2s
>    delay is too much [1]),
> b) periodic flushing didn't catch up (i.e. full tree flush can
>    occassionaly take more than 2s) leading to extra staleness?

The idea is that having stats anywhere between 0-2 seconds stale is
inconsistent, especially when compared to other values (such as
memory.usage_in_bytes), which can give an inconsistent and stale view
of the system. For some readers, 2s is too spaced (we have readers
that read every second). A time-only bound is scary because on large
systems a lot can happen in a second. There will always be a delay
anyway, but ideally we want to minimize it.

I think 2s is also not a strict bound (e.g. flushing is taking a lot
of time, a flush started but the cgroup you care about hasn't been
flushed yet, etc).

There is also the problem of variable cost of reading.

>
> [1] Assuming that nr_cpus*MEMCG_CHARGE_BATCH error bound is also too
>     much for userspace readers, correct?

I can't tell for sure to be honest, but given the continuously
increased number of cpus on modern systems, and the fact that
nr_cpus*MEMCG_CHARGE_BATCH can be localized in one cgroup or spread
across the hierarchy, I think it's better if we drop it for userspace
reads if possible.

>
> > The cost of userspace reads are now determinstic, and depend on the
> > size of the subtree being read. This should fix both the *sometimes*
> > expensive reads (due to flushing the entire tree) and occasional
> > staless (due to skipping flushing).
>
> This is nice, thanks to the atomic removal in the commit 0a2dc6ac3329
> ("cgroup: remove cgroup_rstat_flush_atomic()"). I think the smaller
> chunks with yielding could be universally better (last words :-).

I was hoping we can remove unified flushing completely, but stress
testing with hundreds of concurrent reclaimers shows a lot of
regression. Maybe it doesn't matter in practice, but I don't want to
pull that trigger :)

FWIW, with unified flushing we are getting great concurrency for
in-kernel flushers like reclaim or refault, but at the expense of
stats staleness. I really wonder what hidden cost we are paying due to
the stale stats. I would imagine any problems that manifest from this
would be difficult to tie back to the stale stats.

>
> Thanks,
> Michal
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/3] mm: memcg: add a helper for non-unified stats flushing
  2023-08-22 13:01   ` Michal Koutný
@ 2023-08-22 16:00     ` Yosry Ahmed
  2023-08-22 16:35       ` Michal Koutný
  0 siblings, 1 reply; 43+ messages in thread
From: Yosry Ahmed @ 2023-08-22 16:00 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo, linux-mm,
	cgroups, linux-kernel

On Tue, Aug 22, 2023 at 6:01 AM Michal Koutný <mkoutny@suse.com> wrote:
>
> On Mon, Aug 21, 2023 at 08:54:57PM +0000, Yosry Ahmed <yosryahmed@google.com> wrote:
> > +static void do_stats_flush(struct mem_cgroup *memcg)
> > +{
> > +     cgroup_rstat_flush(memcg->css.cgroup);
>         if(memcg == root_mem_cgroup)
>                 atomic_set(&stats_flush_threshold, 0);
> > +}
> > +
> >  /*
> >   * do_unified_stats_flush - do a unified flush of memory cgroup statistics
> >   *
> > @@ -656,7 +667,7 @@ static void do_unified_stats_flush(void)
> >
> >       WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME);
> >
> > -     cgroup_rstat_flush(root_mem_cgroup->css.cgroup);
> > +     do_stats_flush(root_mem_cgroup);
> >
>   -     atomic_set(&stats_flush_threshold, 0);
> >       atomic_set(&stats_flush_ongoing, 0);
>
> You may reset stats_flush_threshold to save the unified flushers some
> work.

We can probably also kick FLUSH_TIME forward as well. Perhaps I can
move both into do_stats_flush() then. If I understand correctly this
is what you mean?

static void do_stats_flush(struct mem_cgroup *memcg)
{
       if (mem_cgroup_is_root(memcg))
               WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME);

       cgroup_rstat_flush(memcg->css.cgroup);

       if (mem_cgroup_is_root(memcg))
               atomic_set(&stats_flush_threshold, 0);
}

  static void do_unified_stats_flush(void)
  {
          if (atomic_read(&stats_flush_ongoing) ||
              atomic_xchg(&stats_flush_ongoing, 1))
                  return;

          do_stats_flush(root_mem_cgroup);
          atomic_set(&stats_flush_ongoing, 0);
  }

, or simplify it further by just resetting stats_flush_threshold
before we flush as well:

static void do_stats_flush(struct mem_cgroup *memcg)
{
       /* for unified flushing, root non-unified flushing can help as well */
       if (mem_cgroup_is_root(memcg)) {
               WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME);
               atomic_set(&stats_flush_threshold, 0);
       }

       cgroup_rstat_flush(memcg->css.cgroup);
}

  static void do_unified_stats_flush(void)
  {
          if (atomic_read(&stats_flush_ongoing) ||
              atomic_xchg(&stats_flush_ongoing, 1))
                  return;

          do_stats_flush(root_mem_cgroup);
          atomic_set(&stats_flush_ongoing, 0);
  }


What do you think?


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/3] mm: memcg: add a helper for non-unified stats flushing
  2023-08-22 16:00     ` Yosry Ahmed
@ 2023-08-22 16:35       ` Michal Koutný
  2023-08-22 16:48         ` Yosry Ahmed
  0 siblings, 1 reply; 43+ messages in thread
From: Michal Koutný @ 2023-08-22 16:35 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo, linux-mm,
	cgroups, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 785 bytes --]

On Tue, Aug 22, 2023 at 09:00:06AM -0700, Yosry Ahmed <yosryahmed@google.com> wrote:
> We can probably also kick FLUSH_TIME forward as well.

True, they guard same work.

> Perhaps I can move both into do_stats_flush() then. If I understand
> correctly this is what you mean?

Yes.

> What do you think?

The latter is certainly better looking code.

I wasn't sure at first about moving stats_flush_threshold reset before
actual flush but on second thought it should not be a significant change
- readers: may skip flushing, the values that they read should still be
  below the error threshold,
- writers: may be slowed down a bit (because of conditional atomic write
  optimization in memcg_rstat_updates), presumably not on average
  though.

So the latter should work too.

Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/3] mm: memcg: add a helper for non-unified stats flushing
  2023-08-22 16:35       ` Michal Koutný
@ 2023-08-22 16:48         ` Yosry Ahmed
  0 siblings, 0 replies; 43+ messages in thread
From: Yosry Ahmed @ 2023-08-22 16:48 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo, linux-mm,
	cgroups, linux-kernel

On Tue, Aug 22, 2023 at 9:35 AM Michal Koutný <mkoutny@suse.com> wrote:
>
> On Tue, Aug 22, 2023 at 09:00:06AM -0700, Yosry Ahmed <yosryahmed@google.com> wrote:
> > We can probably also kick FLUSH_TIME forward as well.
>
> True, they guard same work.
>
> > Perhaps I can move both into do_stats_flush() then. If I understand
> > correctly this is what you mean?
>
> Yes.
>
> > What do you think?
>
> The latter is certainly better looking code.
>
> I wasn't sure at first about moving stats_flush_threshold reset before
> actual flush but on second thought it should not be a significant change
> - readers: may skip flushing, the values that they read should still be
>   below the error threshold,

Unified readers will skip anyway as there's an ongoing flush,
non-unified readers don't check the threshold anyway (with this patch
series). So for readers it should not be a change.

> - writers: may be slowed down a bit (because of conditional atomic write
>   optimization in memcg_rstat_updates), presumably not on average
>   though.

Yeah writers will start doing atomic writes once a flush starts
instead of once it ends. I don't think it will matter in practice
though. The optimization is only effective if we manage to surpass the
threshold before the periodic flusher (or any unified flusher) kicks
in and resets it. If we start doing atomic writes earlier, then we
will also stop earlier; the number of atomic writes should stay the
same.

I think the only difference will be the scenario where we start atomic
writes early but the periodic flush happens before we reach the
threshold, in which case we aren't doing a lot of updates anyway.

I hope this makes _any_ sense :)

>
> So the latter should work too.
>

I will include that in v2. I will wait for a bit of further review
comments on this version first.

Thanks for taking a look!

> Michal

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads
  2023-08-22 15:30     ` Yosry Ahmed
@ 2023-08-23  7:33       ` Michal Hocko
  2023-08-23 14:55         ` Yosry Ahmed
  0 siblings, 1 reply; 43+ messages in thread
From: Michal Hocko @ 2023-08-23  7:33 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt,
	Muchun Song, Ivan Babrou, Tejun Heo, linux-mm, cgroups,
	linux-kernel

On Tue 22-08-23 08:30:05, Yosry Ahmed wrote:
> On Tue, Aug 22, 2023 at 2:06 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 21-08-23 20:54:58, Yosry Ahmed wrote:
[...]
> So to answer your question, I don't think a random user can really
> affect the system in a significant way by constantly flushing. In
> fact, in the test script (which I am now attaching, in case you're
> interested), there are hundreds of threads that are reading stats of
> different cgroups every 1s, and I don't see any negative effects on
> in-kernel flushers in this case (reclaimers).

I suspect you have missed my point. Maybe I am just misunderstanding
the code but it seems to me that the lock dropping inside
cgroup_rstat_flush_locked effectivelly allows unbounded number of
contenders which is really dangerous when it is triggerable from the
userspace. The number of spinners at a moment is always bound by the
number CPUs but depending on timing many potential spinners might be
cond_rescheded and the worst time latency to complete can be really
high. Makes more sense?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads
  2023-08-23  7:33       ` Michal Hocko
@ 2023-08-23 14:55         ` Yosry Ahmed
  2023-08-24  7:13           ` Michal Hocko
  0 siblings, 1 reply; 43+ messages in thread
From: Yosry Ahmed @ 2023-08-23 14:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt,
	Muchun Song, Ivan Babrou, Tejun Heo, linux-mm, cgroups,
	linux-kernel

On Wed, Aug 23, 2023 at 12:33 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 22-08-23 08:30:05, Yosry Ahmed wrote:
> > On Tue, Aug 22, 2023 at 2:06 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Mon 21-08-23 20:54:58, Yosry Ahmed wrote:
> [...]
> > So to answer your question, I don't think a random user can really
> > affect the system in a significant way by constantly flushing. In
> > fact, in the test script (which I am now attaching, in case you're
> > interested), there are hundreds of threads that are reading stats of
> > different cgroups every 1s, and I don't see any negative effects on
> > in-kernel flushers in this case (reclaimers).
>
> I suspect you have missed my point.

I suspect you are right :)

> Maybe I am just misunderstanding
> the code but it seems to me that the lock dropping inside
> cgroup_rstat_flush_locked effectivelly allows unbounded number of
> contenders which is really dangerous when it is triggerable from the
> userspace. The number of spinners at a moment is always bound by the
> number CPUs but depending on timing many potential spinners might be
> cond_rescheded and the worst time latency to complete can be really
> high. Makes more sense?

I think I understand better now. So basically because we might drop
the lock and resched, there can be nr_cpus spinners + other spinners
that are currently scheduled away, so these will need to wait to be
scheduled and then start spinning on the lock. This may happen for one
reader multiple times during its read, which is what can cause a high
worst case latency.

I hope I understood you correctly this time. Did I?

So the logic to give up the lock and sleep was introduced by commit
0fa294fb1985 ("cgroup: Replace cgroup_rstat_mutex with a spinlock") in
4.18. It has been possible for userspace to trigger this scenario by
reading cpu.stat, which has been using rstat since then. On the memcg
side, it was also possible to trigger this behavior between commit
2d146aa3aa84 ("mm: memcontrol: switch to rstat") and commit
fd25a9e0e23b ("memcg: unify memcg stat flushing") (i.e between 5.13
and 5.16).

I am not sure there has been any problems from this, but perhaps Tejun
can answer this better than I can.

The way I think about it is that random userspace reads will mostly be
reading their subtrees, which is generally not very large (and can be
limited), so every individual read should be cheap enough. Also,
consequent flushes on overlapping subtrees will have very little to do
as there won't be many pending updates, they should also be very
cheap. So unless multiple jobs on the same machine are collectively
trying to act maliciously (purposefully or not) and concurrently spawn
multiple readers of different parts of the hierarchy (and maintain
enough activity to generate stat updates to flush), I don't think it's
a concern.

I also imagine (but haven't checked) that there is some locking at
some level that will throttle a malicious job that spawns multiple
readers to the same memory.stat file.

I hope this answers your question.

> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads
  2023-08-23 14:55         ` Yosry Ahmed
@ 2023-08-24  7:13           ` Michal Hocko
  2023-08-24 18:15             ` Yosry Ahmed
  0 siblings, 1 reply; 43+ messages in thread
From: Michal Hocko @ 2023-08-24  7:13 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt,
	Muchun Song, Ivan Babrou, Tejun Heo, linux-mm, cgroups,
	linux-kernel

On Wed 23-08-23 07:55:40, Yosry Ahmed wrote:
> On Wed, Aug 23, 2023 at 12:33 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Tue 22-08-23 08:30:05, Yosry Ahmed wrote:
> > > On Tue, Aug 22, 2023 at 2:06 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Mon 21-08-23 20:54:58, Yosry Ahmed wrote:
> > [...]
> > > So to answer your question, I don't think a random user can really
> > > affect the system in a significant way by constantly flushing. In
> > > fact, in the test script (which I am now attaching, in case you're
> > > interested), there are hundreds of threads that are reading stats of
> > > different cgroups every 1s, and I don't see any negative effects on
> > > in-kernel flushers in this case (reclaimers).
> >
> > I suspect you have missed my point.
> 
> I suspect you are right :)
> 
> 
> > Maybe I am just misunderstanding
> > the code but it seems to me that the lock dropping inside
> > cgroup_rstat_flush_locked effectivelly allows unbounded number of
> > contenders which is really dangerous when it is triggerable from the
> > userspace. The number of spinners at a moment is always bound by the
> > number CPUs but depending on timing many potential spinners might be
> > cond_rescheded and the worst time latency to complete can be really
> > high. Makes more sense?
> 
> I think I understand better now. So basically because we might drop
> the lock and resched, there can be nr_cpus spinners + other spinners
> that are currently scheduled away, so these will need to wait to be
> scheduled and then start spinning on the lock. This may happen for one
> reader multiple times during its read, which is what can cause a high
> worst case latency.
> 
> I hope I understood you correctly this time. Did I?

Yes. I would just add that this could also influence the worst case
latency for a different reader - so an adversary user can stall others.
Exposing a shared global lock in uncontrolable way over generally
available user interface is not really a great idea IMHO.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads
  2023-08-24  7:13           ` Michal Hocko
@ 2023-08-24 18:15             ` Yosry Ahmed
  2023-08-24 18:50               ` Yosry Ahmed
  0 siblings, 1 reply; 43+ messages in thread
From: Yosry Ahmed @ 2023-08-24 18:15 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt,
	Muchun Song, Ivan Babrou, Tejun Heo, linux-mm, cgroups,
	linux-kernel

On Thu, Aug 24, 2023 at 12:13 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 23-08-23 07:55:40, Yosry Ahmed wrote:
> > On Wed, Aug 23, 2023 at 12:33 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Tue 22-08-23 08:30:05, Yosry Ahmed wrote:
> > > > On Tue, Aug 22, 2023 at 2:06 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Mon 21-08-23 20:54:58, Yosry Ahmed wrote:
> > > [...]
> > > > So to answer your question, I don't think a random user can really
> > > > affect the system in a significant way by constantly flushing. In
> > > > fact, in the test script (which I am now attaching, in case you're
> > > > interested), there are hundreds of threads that are reading stats of
> > > > different cgroups every 1s, and I don't see any negative effects on
> > > > in-kernel flushers in this case (reclaimers).
> > >
> > > I suspect you have missed my point.
> >
> > I suspect you are right :)
> >
> >
> > > Maybe I am just misunderstanding
> > > the code but it seems to me that the lock dropping inside
> > > cgroup_rstat_flush_locked effectivelly allows unbounded number of
> > > contenders which is really dangerous when it is triggerable from the
> > > userspace. The number of spinners at a moment is always bound by the
> > > number CPUs but depending on timing many potential spinners might be
> > > cond_rescheded and the worst time latency to complete can be really
> > > high. Makes more sense?
> >
> > I think I understand better now. So basically because we might drop
> > the lock and resched, there can be nr_cpus spinners + other spinners
> > that are currently scheduled away, so these will need to wait to be
> > scheduled and then start spinning on the lock. This may happen for one
> > reader multiple times during its read, which is what can cause a high
> > worst case latency.
> >
> > I hope I understood you correctly this time. Did I?
>
> Yes. I would just add that this could also influence the worst case
> latency for a different reader - so an adversary user can stall others.

I can add that for v2 to the commit log, thanks.

> Exposing a shared global lock in uncontrolable way over generally
> available user interface is not really a great idea IMHO.

I think that's how it was always meant to be when it was designed. The
global rstat lock has always existed and was always available to
userspace readers. The memory controller took a different path at some
point with unified flushing, but that was mainly because of high
concurrency from in-kernel flushers, not because userspace readers
caused a problem. Outside of memcg, the core cgroup code has always
exercised this global lock when reading cpu.stat since rstat's
introduction. I assume there hasn't been any problems since it's still
there. I was hoping Tejun would confirm/deny this.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads
  2023-08-24 18:15             ` Yosry Ahmed
@ 2023-08-24 18:50               ` Yosry Ahmed
  2023-08-25  7:05                 ` Michal Hocko
  0 siblings, 1 reply; 43+ messages in thread
From: Yosry Ahmed @ 2023-08-24 18:50 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt,
	Muchun Song, Ivan Babrou, Tejun Heo, linux-mm, cgroups,
	linux-kernel

On Thu, Aug 24, 2023 at 11:15 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Thu, Aug 24, 2023 at 12:13 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Wed 23-08-23 07:55:40, Yosry Ahmed wrote:
> > > On Wed, Aug 23, 2023 at 12:33 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Tue 22-08-23 08:30:05, Yosry Ahmed wrote:
> > > > > On Tue, Aug 22, 2023 at 2:06 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > >
> > > > > > On Mon 21-08-23 20:54:58, Yosry Ahmed wrote:
> > > > [...]
> > > > > So to answer your question, I don't think a random user can really
> > > > > affect the system in a significant way by constantly flushing. In
> > > > > fact, in the test script (which I am now attaching, in case you're
> > > > > interested), there are hundreds of threads that are reading stats of
> > > > > different cgroups every 1s, and I don't see any negative effects on
> > > > > in-kernel flushers in this case (reclaimers).
> > > >
> > > > I suspect you have missed my point.
> > >
> > > I suspect you are right :)
> > >
> > >
> > > > Maybe I am just misunderstanding
> > > > the code but it seems to me that the lock dropping inside
> > > > cgroup_rstat_flush_locked effectivelly allows unbounded number of
> > > > contenders which is really dangerous when it is triggerable from the
> > > > userspace. The number of spinners at a moment is always bound by the
> > > > number CPUs but depending on timing many potential spinners might be
> > > > cond_rescheded and the worst time latency to complete can be really
> > > > high. Makes more sense?
> > >
> > > I think I understand better now. So basically because we might drop
> > > the lock and resched, there can be nr_cpus spinners + other spinners
> > > that are currently scheduled away, so these will need to wait to be
> > > scheduled and then start spinning on the lock. This may happen for one
> > > reader multiple times during its read, which is what can cause a high
> > > worst case latency.
> > >
> > > I hope I understood you correctly this time. Did I?
> >
> > Yes. I would just add that this could also influence the worst case
> > latency for a different reader - so an adversary user can stall others.
>
> I can add that for v2 to the commit log, thanks.
>
> > Exposing a shared global lock in uncontrolable way over generally
> > available user interface is not really a great idea IMHO.
>
> I think that's how it was always meant to be when it was designed. The
> global rstat lock has always existed and was always available to
> userspace readers. The memory controller took a different path at some
> point with unified flushing, but that was mainly because of high
> concurrency from in-kernel flushers, not because userspace readers
> caused a problem. Outside of memcg, the core cgroup code has always
> exercised this global lock when reading cpu.stat since rstat's
> introduction. I assume there hasn't been any problems since it's still
> there. I was hoping Tejun would confirm/deny this.

One thing we can do to remedy this situation is to replace the global
rstat lock with a mutex, and drop the resched/lock dropping condition.
Tejun suggested this in the previous thread. This effectively reverts
0fa294fb1985 ("cgroup: Replace cgroup_rstat_mutex with a spinlock")
since now all the flushing contexts are sleepable.

My synthetic stress test does not show any regressions with mutexes,
and there is a small boost to reading latency (probably because we
stop dropping the lock / rescheduling). Not sure if we may start
seeing need_resched warnings on big flushes though.

One other concern that Shakeel pointed out to me is preemption. If
someone holding the mutex gets preempted this may starve other
waiters. We can disable preemption while we hold the mutex, not sure
if that's a common pattern though.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads
  2023-08-24 18:50               ` Yosry Ahmed
@ 2023-08-25  7:05                 ` Michal Hocko
  2023-08-25 15:14                   ` Yosry Ahmed
  2023-08-29 18:44                   ` Tejun Heo
  0 siblings, 2 replies; 43+ messages in thread
From: Michal Hocko @ 2023-08-25  7:05 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt,
	Muchun Song, Ivan Babrou, Tejun Heo, linux-mm, cgroups,
	linux-kernel

On Thu 24-08-23 11:50:51, Yosry Ahmed wrote:
> On Thu, Aug 24, 2023 at 11:15 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Thu, Aug 24, 2023 at 12:13 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Wed 23-08-23 07:55:40, Yosry Ahmed wrote:
> > > > On Wed, Aug 23, 2023 at 12:33 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Tue 22-08-23 08:30:05, Yosry Ahmed wrote:
> > > > > > On Tue, Aug 22, 2023 at 2:06 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > >
> > > > > > > On Mon 21-08-23 20:54:58, Yosry Ahmed wrote:
> > > > > [...]
> > > > > > So to answer your question, I don't think a random user can really
> > > > > > affect the system in a significant way by constantly flushing. In
> > > > > > fact, in the test script (which I am now attaching, in case you're
> > > > > > interested), there are hundreds of threads that are reading stats of
> > > > > > different cgroups every 1s, and I don't see any negative effects on
> > > > > > in-kernel flushers in this case (reclaimers).
> > > > >
> > > > > I suspect you have missed my point.
> > > >
> > > > I suspect you are right :)
> > > >
> > > >
> > > > > Maybe I am just misunderstanding
> > > > > the code but it seems to me that the lock dropping inside
> > > > > cgroup_rstat_flush_locked effectivelly allows unbounded number of
> > > > > contenders which is really dangerous when it is triggerable from the
> > > > > userspace. The number of spinners at a moment is always bound by the
> > > > > number CPUs but depending on timing many potential spinners might be
> > > > > cond_rescheded and the worst time latency to complete can be really
> > > > > high. Makes more sense?
> > > >
> > > > I think I understand better now. So basically because we might drop
> > > > the lock and resched, there can be nr_cpus spinners + other spinners
> > > > that are currently scheduled away, so these will need to wait to be
> > > > scheduled and then start spinning on the lock. This may happen for one
> > > > reader multiple times during its read, which is what can cause a high
> > > > worst case latency.
> > > >
> > > > I hope I understood you correctly this time. Did I?
> > >
> > > Yes. I would just add that this could also influence the worst case
> > > latency for a different reader - so an adversary user can stall others.
> >
> > I can add that for v2 to the commit log, thanks.
> >
> > > Exposing a shared global lock in uncontrolable way over generally
> > > available user interface is not really a great idea IMHO.
> >
> > I think that's how it was always meant to be when it was designed. The
> > global rstat lock has always existed and was always available to
> > userspace readers. The memory controller took a different path at some
> > point with unified flushing, but that was mainly because of high
> > concurrency from in-kernel flushers, not because userspace readers
> > caused a problem. Outside of memcg, the core cgroup code has always
> > exercised this global lock when reading cpu.stat since rstat's
> > introduction. I assume there hasn't been any problems since it's still
> > there.

I suspect nobody has just considered a malfunctioning or adversary
workloads so far.

> > I was hoping Tejun would confirm/deny this.

Yes, that would be interesting to hear.

> One thing we can do to remedy this situation is to replace the global
> rstat lock with a mutex, and drop the resched/lock dropping condition.
> Tejun suggested this in the previous thread. This effectively reverts
> 0fa294fb1985 ("cgroup: Replace cgroup_rstat_mutex with a spinlock")
> since now all the flushing contexts are sleepable.

I would have a very daring question. Do we really need a global lock in
the first place? AFAIU this locks serializes (kinda as the lock can be
dropped midway) flushers and cgroup_rstat_flush_hold/release caller (a
single one ATM). I can see cgroup_base_stat_cputime_show would like to
have a consistent view on multiple stats but can we live without a
strong guarantee or to replace the lock with seqlock instead?

> My synthetic stress test does not show any regressions with mutexes,
> and there is a small boost to reading latency (probably because we
> stop dropping the lock / rescheduling). Not sure if we may start
> seeing need_resched warnings on big flushes though.

Reading 0fa294fb1985 ("cgroup: Replace cgroup_rstat_mutex with a spinlock")
it seems the point of moving away from mutex was to have a more usable
API.

> One other concern that Shakeel pointed out to me is preemption. If
> someone holding the mutex gets preempted this may starve other
> waiters. We can disable preemption while we hold the mutex, not sure
> if that's a common pattern though.

No, not really. It is expected that holder of mutex can sleep and can be
preempted as well.

I might be wrong but the whole discussion so far suggests that the
global rstat lock should be reconsidered. From my personal experience
global locks easily triggerable from the userspace are just a receip for
problems. Stats reading shouldn't be interfering with the system runtime
as much as possible and they should be deterministic wrt runtime as
well.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads
  2023-08-25  7:05                 ` Michal Hocko
@ 2023-08-25 15:14                   ` Yosry Ahmed
  2023-08-25 18:17                     ` Michal Hocko
  2023-08-28 15:47                     ` Michal Hocko
  2023-08-29 18:44                   ` Tejun Heo
  1 sibling, 2 replies; 43+ messages in thread
From: Yosry Ahmed @ 2023-08-25 15:14 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt,
	Muchun Song, Ivan Babrou, Tejun Heo, linux-mm, cgroups,
	linux-kernel

On Fri, Aug 25, 2023 at 12:05 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 24-08-23 11:50:51, Yosry Ahmed wrote:
> > On Thu, Aug 24, 2023 at 11:15 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > > On Thu, Aug 24, 2023 at 12:13 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Wed 23-08-23 07:55:40, Yosry Ahmed wrote:
> > > > > On Wed, Aug 23, 2023 at 12:33 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > >
> > > > > > On Tue 22-08-23 08:30:05, Yosry Ahmed wrote:
> > > > > > > On Tue, Aug 22, 2023 at 2:06 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > > >
> > > > > > > > On Mon 21-08-23 20:54:58, Yosry Ahmed wrote:
> > > > > > [...]
> > > > > > > So to answer your question, I don't think a random user can really
> > > > > > > affect the system in a significant way by constantly flushing. In
> > > > > > > fact, in the test script (which I am now attaching, in case you're
> > > > > > > interested), there are hundreds of threads that are reading stats of
> > > > > > > different cgroups every 1s, and I don't see any negative effects on
> > > > > > > in-kernel flushers in this case (reclaimers).
> > > > > >
> > > > > > I suspect you have missed my point.
> > > > >
> > > > > I suspect you are right :)
> > > > >
> > > > >
> > > > > > Maybe I am just misunderstanding
> > > > > > the code but it seems to me that the lock dropping inside
> > > > > > cgroup_rstat_flush_locked effectivelly allows unbounded number of
> > > > > > contenders which is really dangerous when it is triggerable from the
> > > > > > userspace. The number of spinners at a moment is always bound by the
> > > > > > number CPUs but depending on timing many potential spinners might be
> > > > > > cond_rescheded and the worst time latency to complete can be really
> > > > > > high. Makes more sense?
> > > > >
> > > > > I think I understand better now. So basically because we might drop
> > > > > the lock and resched, there can be nr_cpus spinners + other spinners
> > > > > that are currently scheduled away, so these will need to wait to be
> > > > > scheduled and then start spinning on the lock. This may happen for one
> > > > > reader multiple times during its read, which is what can cause a high
> > > > > worst case latency.
> > > > >
> > > > > I hope I understood you correctly this time. Did I?
> > > >
> > > > Yes. I would just add that this could also influence the worst case
> > > > latency for a different reader - so an adversary user can stall others.
> > >
> > > I can add that for v2 to the commit log, thanks.
> > >
> > > > Exposing a shared global lock in uncontrolable way over generally
> > > > available user interface is not really a great idea IMHO.
> > >
> > > I think that's how it was always meant to be when it was designed. The
> > > global rstat lock has always existed and was always available to
> > > userspace readers. The memory controller took a different path at some
> > > point with unified flushing, but that was mainly because of high
> > > concurrency from in-kernel flushers, not because userspace readers
> > > caused a problem. Outside of memcg, the core cgroup code has always
> > > exercised this global lock when reading cpu.stat since rstat's
> > > introduction. I assume there hasn't been any problems since it's still
> > > there.
>
> I suspect nobody has just considered a malfunctioning or adversary
> workloads so far.

Perhaps that also means it's not a problem in practice, or at least I hope so :)

>
> > > I was hoping Tejun would confirm/deny this.
>
> Yes, that would be interesting to hear.
>
> > One thing we can do to remedy this situation is to replace the global
> > rstat lock with a mutex, and drop the resched/lock dropping condition.
> > Tejun suggested this in the previous thread. This effectively reverts
> > 0fa294fb1985 ("cgroup: Replace cgroup_rstat_mutex with a spinlock")
> > since now all the flushing contexts are sleepable.
>
> I would have a very daring question. Do we really need a global lock in
> the first place? AFAIU this locks serializes (kinda as the lock can be
> dropped midway) flushers and cgroup_rstat_flush_hold/release caller (a
> single one ATM). I can see cgroup_base_stat_cputime_show would like to
> have a consistent view on multiple stats but can we live without a
> strong guarantee or to replace the lock with seqlock instead?

Unfortunately, it's more difficult than this. I thought about breaking
down that lock and falled back to this solution. See below.

>
> > My synthetic stress test does not show any regressions with mutexes,
> > and there is a small boost to reading latency (probably because we
> > stop dropping the lock / rescheduling). Not sure if we may start
> > seeing need_resched warnings on big flushes though.
>
> Reading 0fa294fb1985 ("cgroup: Replace cgroup_rstat_mutex with a spinlock")
> it seems the point of moving away from mutex was to have a more usable
> API.

Right, we needed an atomic interface for flushing, but that later
turned out to cause some problems, so we reworked the code such that
we never have to flush atomically. Now we can go back to the mutex if
it makes things better, I am not really sure how much it helps though.

>
> > One other concern that Shakeel pointed out to me is preemption. If
> > someone holding the mutex gets preempted this may starve other
> > waiters. We can disable preemption while we hold the mutex, not sure
> > if that's a common pattern though.
>
> No, not really. It is expected that holder of mutex can sleep and can be
> preempted as well.

Maybe not for this specific case because it's a global mutex and
holding it for too long might cause problems? Is it bad to disable
preemption while holding a mutex?

>
> I might be wrong but the whole discussion so far suggests that the
> global rstat lock should be reconsidered. From my personal experience
> global locks easily triggerable from the userspace are just a receip for
> problems. Stats reading shouldn't be interfering with the system runtime
> as much as possible and they should be deterministic wrt runtime as
> well.

The problem is that the global lock also serializes the global
counters that we flush to. I will talk from the memcg flushing
perspective as that's what I am familiar with. I am not sure how much
this is transferable to other flushers.

On the memcg side (see mem_cgroup_css_rstat_flush()), the global lock
synchronizes access to multiple counters, for this discussion what's
most important are:
- The global stat counters of the memcg being flushed (e.g.
memcg->vmstats->state).
- The pending stat counters of the parent being flushed (e.g.
parent->vmstats->state_pending).

To get rid of this lock, we either need to use atomics for those
counters, or have fine-grained locks. I experimented a while back with
atomic and flushing was significantly more expensive. The number of
atomic operations would scale with O(# cgroups * # cpus) and can grow
unbounded.

The other option is fine-grained locks. In this case we would need to
lock both the memcg being flushed and its parent together. This can go
wrong with ordering, and for top-level memcgs the root memcg lock will
become the new global lock anyway. One way to overcome this is to
change the parent's pending stat counters to also be percpu. This will
increase the memory usage of the stat counters per-memcg, by hundreds
of bytes per cpu.

Let's assume that's okay, so we only need to lock one cgroup at a
time. There are more problems.

We also have a percpu lock (cgroup_rstat_cpu_lock), which we use to
lock the percpu tree which has the cgroups that have updates on this
cpu. It is held by both flushing contexts and updating contexts (hot
paths). Ideally we don't want to spin on a per-cgroup (non percpu)
lock while holding the percpu lock, as flushers of different cpus will
start blocking one another, as well as blocking updaters. On the other
hand, we need to hold percpu lock to pop a cgroup from that tree and
lock it. It's a chicken and egg problem. Also, if we release the
percpu lock while flushing, we open another can of worms:
(a) Concurrent updates can keep updating the tree putting us in an
endless flushing loop. We need some sort of generation tracking for
this.
(b) Concurrent flushing can flush a parent prematurely on the same cpu
as we are flushing a child, and not get the updates from the child.

One possible scheme to handle the above is as follows:
1. Hold the percpu lock, find the cgroup that needs to be flushed next.
2. Trylock that cgroup. If we succeed, we flush it with both the
percpu and the per-cgroup locks held.
3. If we fail, release the percpu lock and spin on the per-cgroup lock.
4. When we get the per-cgroup lock, take the percpu lock again, and
make sure that the locked cgroup is still the correct cgroup to flush.
If not, repeat.
5. Flush the cgroup, and go back to step 1 to get the next cgroup.

Of course this is complex and error-prone, and might introduce
significant overheads due to the number of locks we need to take
compared with what we currently have.

I guess what I am trying to say is, breaking down that lock is a major
surgery that might require re-designing or re-implementing some parts
of rstat. I would be extremely happy to be proven wrong. If we can
break down that lock then there is no need for unified flushing even
for in-kernel contexts, and we can all live happily ever after with
cheap(ish) and accurate stats flushing.

I really hope we can move forward with the problems at hand (sometimes
reads are expensive, sometimes reads are stale), and not block fixing
them until we can come up with an alternative to that global lock
(unless, of course, there is a simpler way of doing that).

Sorry for the very long reply :)

Thanks!

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads
  2023-08-25 15:14                   ` Yosry Ahmed
@ 2023-08-25 18:17                     ` Michal Hocko
  2023-08-25 18:21                       ` Yosry Ahmed
  2023-08-28 15:47                     ` Michal Hocko
  1 sibling, 1 reply; 43+ messages in thread
From: Michal Hocko @ 2023-08-25 18:17 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt,
	Muchun Song, Ivan Babrou, Tejun Heo, linux-mm, cgroups,
	linux-kernel

On Fri 25-08-23 08:14:54, Yosry Ahmed wrote:
> On Fri, Aug 25, 2023 at 12:05 AM Michal Hocko <mhocko@suse.com> wrote:
[...]
> > I might be wrong but the whole discussion so far suggests that the
> > global rstat lock should be reconsidered. From my personal experience
> > global locks easily triggerable from the userspace are just a receip for
> > problems. Stats reading shouldn't be interfering with the system runtime
> > as much as possible and they should be deterministic wrt runtime as
> > well.
> 
> The problem is that the global lock also serializes the global
> counters that we flush to. I will talk from the memcg flushing
> perspective as that's what I am familiar with. I am not sure how much
> this is transferable to other flushers.
> 
> On the memcg side (see mem_cgroup_css_rstat_flush()), the global lock
> synchronizes access to multiple counters, for this discussion what's
> most important are:
> - The global stat counters of the memcg being flushed (e.g.
> memcg->vmstats->state).
> - The pending stat counters of the parent being flushed (e.g.
> parent->vmstats->state_pending).

I haven't digested the rest of the email yet (Friday brain, sorry) but I
do not think you are adressing this particular part so let me ask before
I dive more into the following. I really do not follow the serialization
requirement here because the lock doesn't really serialize the flushing,
does it? At least not in a sense of a single caller to do the flushing
atomicaly from other flushers. It is possible that the current flusher
simply drops the lock midway and another one retakes the lock and
performs the operation again. So what additional flushing
synchronization does it provide and why cannot parallel flushers simply
compete over pcp spinlocks?

So what am I missing?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads
  2023-08-25 18:17                     ` Michal Hocko
@ 2023-08-25 18:21                       ` Yosry Ahmed
  2023-08-25 18:43                         ` Michal Hocko
  0 siblings, 1 reply; 43+ messages in thread
From: Yosry Ahmed @ 2023-08-25 18:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt,
	Muchun Song, Ivan Babrou, Tejun Heo, linux-mm, cgroups,
	linux-kernel

On Fri, Aug 25, 2023 at 11:17 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 25-08-23 08:14:54, Yosry Ahmed wrote:
> > On Fri, Aug 25, 2023 at 12:05 AM Michal Hocko <mhocko@suse.com> wrote:
> [...]
> > > I might be wrong but the whole discussion so far suggests that the
> > > global rstat lock should be reconsidered. From my personal experience
> > > global locks easily triggerable from the userspace are just a receip for
> > > problems. Stats reading shouldn't be interfering with the system runtime
> > > as much as possible and they should be deterministic wrt runtime as
> > > well.
> >
> > The problem is that the global lock also serializes the global
> > counters that we flush to. I will talk from the memcg flushing
> > perspective as that's what I am familiar with. I am not sure how much
> > this is transferable to other flushers.
> >
> > On the memcg side (see mem_cgroup_css_rstat_flush()), the global lock
> > synchronizes access to multiple counters, for this discussion what's
> > most important are:
> > - The global stat counters of the memcg being flushed (e.g.
> > memcg->vmstats->state).
> > - The pending stat counters of the parent being flushed (e.g.
> > parent->vmstats->state_pending).
>
> I haven't digested the rest of the email yet (Friday brain, sorry) but I
> do not think you are adressing this particular part so let me ask before
> I dive more into the following. I really do not follow the serialization
> requirement here because the lock doesn't really serialize the flushing,
> does it? At least not in a sense of a single caller to do the flushing
> atomicaly from other flushers. It is possible that the current flusher
> simply drops the lock midway and another one retakes the lock and
> performs the operation again. So what additional flushing
> synchronization does it provide and why cannot parallel flushers simply
> compete over pcp spinlocks?
>
> So what am I missing?

Those counters are non-atomic. The lock makes sure we don't have two
concurrent flushers updating the same counter locklessly and
non-atomically, which would be possible if we flush the same cgroup on
two different cpus in parallel.

> --
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads
  2023-08-25 18:21                       ` Yosry Ahmed
@ 2023-08-25 18:43                         ` Michal Hocko
  2023-08-25 18:44                           ` Michal Hocko
  0 siblings, 1 reply; 43+ messages in thread
From: Michal Hocko @ 2023-08-25 18:43 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt,
	Muchun Song, Ivan Babrou, Tejun Heo, linux-mm, cgroups,
	linux-kernel

On Fri 25-08-23 11:21:16, Yosry Ahmed wrote:
> On Fri, Aug 25, 2023 at 11:17 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Fri 25-08-23 08:14:54, Yosry Ahmed wrote:
> > > On Fri, Aug 25, 2023 at 12:05 AM Michal Hocko <mhocko@suse.com> wrote:
> > [...]
> > > > I might be wrong but the whole discussion so far suggests that the
> > > > global rstat lock should be reconsidered. From my personal experience
> > > > global locks easily triggerable from the userspace are just a receip for
> > > > problems. Stats reading shouldn't be interfering with the system runtime
> > > > as much as possible and they should be deterministic wrt runtime as
> > > > well.
> > >
> > > The problem is that the global lock also serializes the global
> > > counters that we flush to. I will talk from the memcg flushing
> > > perspective as that's what I am familiar with. I am not sure how much
> > > this is transferable to other flushers.
> > >
> > > On the memcg side (see mem_cgroup_css_rstat_flush()), the global lock
> > > synchronizes access to multiple counters, for this discussion what's
> > > most important are:
> > > - The global stat counters of the memcg being flushed (e.g.
> > > memcg->vmstats->state).
> > > - The pending stat counters of the parent being flushed (e.g.
> > > parent->vmstats->state_pending).
> >
> > I haven't digested the rest of the email yet (Friday brain, sorry) but I
> > do not think you are adressing this particular part so let me ask before
> > I dive more into the following. I really do not follow the serialization
> > requirement here because the lock doesn't really serialize the flushing,
> > does it? At least not in a sense of a single caller to do the flushing
> > atomicaly from other flushers. It is possible that the current flusher
> > simply drops the lock midway and another one retakes the lock and
> > performs the operation again. So what additional flushing
> > synchronization does it provide and why cannot parallel flushers simply
> > compete over pcp spinlocks?
> >
> > So what am I missing?
> 
> Those counters are non-atomic. The lock makes sure we don't have two
> concurrent flushers updating the same counter locklessly and
> non-atomically, which would be possible if we flush the same cgroup on
> two different cpus in parallel.

pcp lock (cpu_lock) guarantees the very same, doesn't it?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads
  2023-08-25 18:43                         ` Michal Hocko
@ 2023-08-25 18:44                           ` Michal Hocko
  0 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2023-08-25 18:44 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt,
	Muchun Song, Ivan Babrou, Tejun Heo, linux-mm, cgroups,
	linux-kernel

On Fri 25-08-23 20:43:02, Michal Hocko wrote:
> On Fri 25-08-23 11:21:16, Yosry Ahmed wrote:
> > On Fri, Aug 25, 2023 at 11:17 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Fri 25-08-23 08:14:54, Yosry Ahmed wrote:
> > > > On Fri, Aug 25, 2023 at 12:05 AM Michal Hocko <mhocko@suse.com> wrote:
> > > [...]
> > > > > I might be wrong but the whole discussion so far suggests that the
> > > > > global rstat lock should be reconsidered. From my personal experience
> > > > > global locks easily triggerable from the userspace are just a receip for
> > > > > problems. Stats reading shouldn't be interfering with the system runtime
> > > > > as much as possible and they should be deterministic wrt runtime as
> > > > > well.
> > > >
> > > > The problem is that the global lock also serializes the global
> > > > counters that we flush to. I will talk from the memcg flushing
> > > > perspective as that's what I am familiar with. I am not sure how much
> > > > this is transferable to other flushers.
> > > >
> > > > On the memcg side (see mem_cgroup_css_rstat_flush()), the global lock
> > > > synchronizes access to multiple counters, for this discussion what's
> > > > most important are:
> > > > - The global stat counters of the memcg being flushed (e.g.
> > > > memcg->vmstats->state).
> > > > - The pending stat counters of the parent being flushed (e.g.
> > > > parent->vmstats->state_pending).
> > >
> > > I haven't digested the rest of the email yet (Friday brain, sorry) but I
> > > do not think you are adressing this particular part so let me ask before
> > > I dive more into the following. I really do not follow the serialization
> > > requirement here because the lock doesn't really serialize the flushing,
> > > does it? At least not in a sense of a single caller to do the flushing
> > > atomicaly from other flushers. It is possible that the current flusher
> > > simply drops the lock midway and another one retakes the lock and
> > > performs the operation again. So what additional flushing
> > > synchronization does it provide and why cannot parallel flushers simply
> > > compete over pcp spinlocks?
> > >
> > > So what am I missing?
> > 
> > Those counters are non-atomic. The lock makes sure we don't have two
> > concurrent flushers updating the same counter locklessly and
> > non-atomically, which would be possible if we flush the same cgroup on
> > two different cpus in parallel.
> 
> pcp lock (cpu_lock) guarantees the very same, doesn't it?

Nope, it doesn't. I really need to have a deeper look.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads
  2023-08-25 15:14                   ` Yosry Ahmed
  2023-08-25 18:17                     ` Michal Hocko
@ 2023-08-28 15:47                     ` Michal Hocko
  2023-08-28 16:15                       ` Yosry Ahmed
  1 sibling, 1 reply; 43+ messages in thread
From: Michal Hocko @ 2023-08-28 15:47 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt,
	Muchun Song, Ivan Babrou, Tejun Heo, linux-mm, cgroups,
	linux-kernel

Done my homework and studied the rstat code more (sorry should have done
that earlier).

On Fri 25-08-23 08:14:54, Yosry Ahmed wrote:
[...]
> I guess what I am trying to say is, breaking down that lock is a major
> surgery that might require re-designing or re-implementing some parts
> of rstat. I would be extremely happy to be proven wrong. If we can
> break down that lock then there is no need for unified flushing even
> for in-kernel contexts, and we can all live happily ever after with
> cheap(ish) and accurate stats flushing.

Yes, this seems like a big change and also over complicating the whole
thing. I am not sure this is worth it.

> I really hope we can move forward with the problems at hand (sometimes
> reads are expensive, sometimes reads are stale), and not block fixing
> them until we can come up with an alternative to that global lock
> (unless, of course, there is a simpler way of doing that).

Well, I really have to say that I do not like the notion that reading
stats is unpredictable. This just makes it really hard to use. If
the precision is to be sarificed then this should be preferable over
potentially high global lock contention. We already have that model in
place of /proc/vmstat (configurable timeout for flusher and a way to
flush explicitly). I appreciate you would like to have a better
precision but as you have explored the locking is really hard to get rid
of here.

So from my POV I would prefer to avoid flushing from the stats reading
path and implement force flushing by writing to stat file. If the 2s
flushing interval is considered to coarse I would be OK to allow setting
it from userspace. This way this would be more in line with /proc/vmstat
which seems to be working quite well.

If this is not accaptable or deemed a wrong approach long term then it
would be good to reonsider the current cgroup_rstat_lock at least.
Either by turning it into mutex or by dropping the yielding code which
can severly affect the worst case latency AFAIU.

> Sorry for the very long reply :)

Thanks for bearing with me and taking time to formulate all this!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads
  2023-08-28 15:47                     ` Michal Hocko
@ 2023-08-28 16:15                       ` Yosry Ahmed
  2023-08-28 17:00                         ` Shakeel Butt
  0 siblings, 1 reply; 43+ messages in thread
From: Yosry Ahmed @ 2023-08-28 16:15 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt,
	Muchun Song, Ivan Babrou, Tejun Heo, linux-mm, cgroups,
	linux-kernel

On Mon, Aug 28, 2023 at 8:47 AM Michal Hocko <mhocko@suse.com> wrote:
>
> Done my homework and studied the rstat code more (sorry should have done
> that earlier).
>
> On Fri 25-08-23 08:14:54, Yosry Ahmed wrote:
> [...]
> > I guess what I am trying to say is, breaking down that lock is a major
> > surgery that might require re-designing or re-implementing some parts
> > of rstat. I would be extremely happy to be proven wrong. If we can
> > break down that lock then there is no need for unified flushing even
> > for in-kernel contexts, and we can all live happily ever after with
> > cheap(ish) and accurate stats flushing.
>
> Yes, this seems like a big change and also over complicating the whole
> thing. I am not sure this is worth it.
>
> > I really hope we can move forward with the problems at hand (sometimes
> > reads are expensive, sometimes reads are stale), and not block fixing
> > them until we can come up with an alternative to that global lock
> > (unless, of course, there is a simpler way of doing that).
>
> Well, I really have to say that I do not like the notion that reading
> stats is unpredictable. This just makes it really hard to use. If
> the precision is to be sarificed then this should be preferable over
> potentially high global lock contention. We already have that model in
> place of /proc/vmstat (configurable timeout for flusher and a way to
> flush explicitly). I appreciate you would like to have a better
> precision but as you have explored the locking is really hard to get rid
> of here.

Reading the stats *is* unpredictable today. In terms of
accuracy/staleness and cost. Avoiding the flush entirely on the read
path will surely make the cost very stable and cheap, but will make
accuracy even less predictable.

>
> So from my POV I would prefer to avoid flushing from the stats reading
> path and implement force flushing by writing to stat file. If the 2s
> flushing interval is considered to coarse I would be OK to allow setting
> it from userspace. This way this would be more in line with /proc/vmstat
> which seems to be working quite well.
>
> If this is not accaptable or deemed a wrong approach long term then it
> would be good to reonsider the current cgroup_rstat_lock at least.
> Either by turning it into mutex or by dropping the yielding code which
> can severly affect the worst case latency AFAIU.

Honestly I think it's better if we do it the other way around. We make
flushing on the stats reading path non-unified and deterministic. That
model also exists and is used for cpu.stat. If we find a problem with
the locking being held from userspace, we can then remove flushing
from the read path and add interface(s) to configure the periodic
flusher and do a force flush.

I would like to avoid introducing additional interfaces and
configuration knobs unless it's necessary. Also, if we remove the
flush entirely the cost will become really cheap. We will have a hard
time reversing that in the future if we want to change the
implementation.

IOW, moving forward with this change seems much more reversible than
adopting the /proc/vmstat model.

If using a mutex will make things better, we can do that now. It
doesn't introduce performance issues in my testing. My only concern is
someone sleeping or getting preempted while holding the mutex, so I
would prefer disabling preemption while we flush if that doesn't cause
problems.

Thanks!

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads
  2023-08-28 16:15                       ` Yosry Ahmed
@ 2023-08-28 17:00                         ` Shakeel Butt
  2023-08-28 17:07                           ` Yosry Ahmed
  0 siblings, 1 reply; 43+ messages in thread
From: Shakeel Butt @ 2023-08-28 17:00 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Michal Hocko, Andrew Morton, Johannes Weiner, Roman Gushchin,
	Muchun Song, Ivan Babrou, Tejun Heo, linux-mm, cgroups,
	linux-kernel

On Mon, Aug 28, 2023 at 9:15 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
[...]
> >
> > Well, I really have to say that I do not like the notion that reading
> > stats is unpredictable. This just makes it really hard to use. If
> > the precision is to be sarificed then this should be preferable over
> > potentially high global lock contention. We already have that model in
> > place of /proc/vmstat (configurable timeout for flusher and a way to
> > flush explicitly). I appreciate you would like to have a better
> > precision but as you have explored the locking is really hard to get rid
> > of here.
>
> Reading the stats *is* unpredictable today. In terms of
> accuracy/staleness and cost. Avoiding the flush entirely on the read
> path will surely make the cost very stable and cheap, but will make
> accuracy even less predictable.
>

On average you would get the stats at most 2 second old, so I would
not say it is less predictable.

> >
> > So from my POV I would prefer to avoid flushing from the stats reading
> > path and implement force flushing by writing to stat file. If the 2s
> > flushing interval is considered to coarse I would be OK to allow setting
> > it from userspace. This way this would be more in line with /proc/vmstat
> > which seems to be working quite well.
> >
> > If this is not accaptable or deemed a wrong approach long term then it
> > would be good to reonsider the current cgroup_rstat_lock at least.
> > Either by turning it into mutex or by dropping the yielding code which
> > can severly affect the worst case latency AFAIU.
>
> Honestly I think it's better if we do it the other way around. We make
> flushing on the stats reading path non-unified and deterministic. That
> model also exists and is used for cpu.stat. If we find a problem with
> the locking being held from userspace, we can then remove flushing
> from the read path and add interface(s) to configure the periodic
> flusher and do a force flush.
>

Here I agree with you. Let's go with the approach which is easy to
undo for now. Though I prefer the new explicit interface for flushing,
that step would be very hard to undo. Let's reevaluate if the proposed
approach shows negative impact on production traffic and I think
Cloudflare folks can give us the results soon.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads
  2023-08-28 17:00                         ` Shakeel Butt
@ 2023-08-28 17:07                           ` Yosry Ahmed
  2023-08-28 17:27                             ` Waiman Long
  0 siblings, 1 reply; 43+ messages in thread
From: Yosry Ahmed @ 2023-08-28 17:07 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Michal Hocko, Andrew Morton, Johannes Weiner, Roman Gushchin,
	Muchun Song, Ivan Babrou, Tejun Heo, linux-mm, cgroups,
	linux-kernel

On Mon, Aug 28, 2023 at 10:00 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Mon, Aug 28, 2023 at 9:15 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> [...]
> > >
> > > Well, I really have to say that I do not like the notion that reading
> > > stats is unpredictable. This just makes it really hard to use. If
> > > the precision is to be sarificed then this should be preferable over
> > > potentially high global lock contention. We already have that model in
> > > place of /proc/vmstat (configurable timeout for flusher and a way to
> > > flush explicitly). I appreciate you would like to have a better
> > > precision but as you have explored the locking is really hard to get rid
> > > of here.
> >
> > Reading the stats *is* unpredictable today. In terms of
> > accuracy/staleness and cost. Avoiding the flush entirely on the read
> > path will surely make the cost very stable and cheap, but will make
> > accuracy even less predictable.
> >
>
> On average you would get the stats at most 2 second old, so I would
> not say it is less predictable.
>
> > >
> > > So from my POV I would prefer to avoid flushing from the stats reading
> > > path and implement force flushing by writing to stat file. If the 2s
> > > flushing interval is considered to coarse I would be OK to allow setting
> > > it from userspace. This way this would be more in line with /proc/vmstat
> > > which seems to be working quite well.
> > >
> > > If this is not accaptable or deemed a wrong approach long term then it
> > > would be good to reonsider the current cgroup_rstat_lock at least.
> > > Either by turning it into mutex or by dropping the yielding code which
> > > can severly affect the worst case latency AFAIU.
> >
> > Honestly I think it's better if we do it the other way around. We make
> > flushing on the stats reading path non-unified and deterministic. That
> > model also exists and is used for cpu.stat. If we find a problem with
> > the locking being held from userspace, we can then remove flushing
> > from the read path and add interface(s) to configure the periodic
> > flusher and do a force flush.
> >
>
> Here I agree with you. Let's go with the approach which is easy to
> undo for now. Though I prefer the new explicit interface for flushing,
> that step would be very hard to undo. Let's reevaluate if the proposed
> approach shows negative impact on production traffic and I think
> Cloudflare folks can give us the results soon.

Do you prefer we also switch to using a mutex (with preemption
disabled) to avoid the scenario Michal described where flushers give
up the lock and sleep resulting in an unbounded wait time in the worst
case?


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads
  2023-08-28 17:07                           ` Yosry Ahmed
@ 2023-08-28 17:27                             ` Waiman Long
  2023-08-28 17:28                               ` Yosry Ahmed
  2023-08-29  7:27                               ` Michal Hocko
  0 siblings, 2 replies; 43+ messages in thread
From: Waiman Long @ 2023-08-28 17:27 UTC (permalink / raw)
  To: Yosry Ahmed, Shakeel Butt
  Cc: Michal Hocko, Andrew Morton, Johannes Weiner, Roman Gushchin,
	Muchun Song, Ivan Babrou, Tejun Heo, linux-mm, cgroups,
	linux-kernel


On 8/28/23 13:07, Yosry Ahmed wrote:
>
>> Here I agree with you. Let's go with the approach which is easy to
>> undo for now. Though I prefer the new explicit interface for flushing,
>> that step would be very hard to undo. Let's reevaluate if the proposed
>> approach shows negative impact on production traffic and I think
>> Cloudflare folks can give us the results soon.
> Do you prefer we also switch to using a mutex (with preemption
> disabled) to avoid the scenario Michal described where flushers give
> up the lock and sleep resulting in an unbounded wait time in the worst
> case?

Locking with mutex with preemption disabled is an oxymoron. Use spinlock 
if you want to have preemption disabled. The purpose of usiing mutex is 
to allow the lock owner to sleep, but you can't sleep with preemption 
disabled. You need to enable preemption first. You can disable 
preemption for a short time in a non-sleeping section of the lock 
critical section, but I would not recommend disabling preemption for the 
whole critical section.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads
  2023-08-28 17:27                             ` Waiman Long
@ 2023-08-28 17:28                               ` Yosry Ahmed
  2023-08-28 17:35                                 ` Waiman Long
  2023-08-29  7:27                               ` Michal Hocko
  1 sibling, 1 reply; 43+ messages in thread
From: Yosry Ahmed @ 2023-08-28 17:28 UTC (permalink / raw)
  To: Waiman Long
  Cc: Shakeel Butt, Michal Hocko, Andrew Morton, Johannes Weiner,
	Roman Gushchin, Muchun Song, Ivan Babrou, Tejun Heo, linux-mm,
	cgroups, linux-kernel

On Mon, Aug 28, 2023 at 10:27 AM Waiman Long <longman@redhat.com> wrote:
>
>
> On 8/28/23 13:07, Yosry Ahmed wrote:
> >
> >> Here I agree with you. Let's go with the approach which is easy to
> >> undo for now. Though I prefer the new explicit interface for flushing,
> >> that step would be very hard to undo. Let's reevaluate if the proposed
> >> approach shows negative impact on production traffic and I think
> >> Cloudflare folks can give us the results soon.
> > Do you prefer we also switch to using a mutex (with preemption
> > disabled) to avoid the scenario Michal described where flushers give
> > up the lock and sleep resulting in an unbounded wait time in the worst
> > case?
>
> Locking with mutex with preemption disabled is an oxymoron. Use spinlock
> if you want to have preemption disabled. The purpose of usiing mutex is
> to allow the lock owner to sleep, but you can't sleep with preemption
> disabled. You need to enable preemption first. You can disable
> preemption for a short time in a non-sleeping section of the lock
> critical section, but I would not recommend disabling preemption for the
> whole critical section.

I thought using a mutex with preemption disabled would at least allow
waiters to sleep rather than spin, is this not correct (or doesn't
matter) ?

>
> Cheers,
> Longman
>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads
  2023-08-28 17:28                               ` Yosry Ahmed
@ 2023-08-28 17:35                                 ` Waiman Long
  2023-08-28 17:43                                   ` Waiman Long
  0 siblings, 1 reply; 43+ messages in thread
From: Waiman Long @ 2023-08-28 17:35 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Shakeel Butt, Michal Hocko, Andrew Morton, Johannes Weiner,
	Roman Gushchin, Muchun Song, Ivan Babrou, Tejun Heo, linux-mm,
	cgroups, linux-kernel

On 8/28/23 13:28, Yosry Ahmed wrote:
> On Mon, Aug 28, 2023 at 10:27 AM Waiman Long <longman@redhat.com> wrote:
>>
>> On 8/28/23 13:07, Yosry Ahmed wrote:
>>>> Here I agree with you. Let's go with the approach which is easy to
>>>> undo for now. Though I prefer the new explicit interface for flushing,
>>>> that step would be very hard to undo. Let's reevaluate if the proposed
>>>> approach shows negative impact on production traffic and I think
>>>> Cloudflare folks can give us the results soon.
>>> Do you prefer we also switch to using a mutex (with preemption
>>> disabled) to avoid the scenario Michal described where flushers give
>>> up the lock and sleep resulting in an unbounded wait time in the worst
>>> case?
>> Locking with mutex with preemption disabled is an oxymoron. Use spinlock
>> if you want to have preemption disabled. The purpose of usiing mutex is
>> to allow the lock owner to sleep, but you can't sleep with preemption
>> disabled. You need to enable preemption first. You can disable
>> preemption for a short time in a non-sleeping section of the lock
>> critical section, but I would not recommend disabling preemption for the
>> whole critical section.
> I thought using a mutex with preemption disabled would at least allow
> waiters to sleep rather than spin, is this not correct (or doesn't
> matter) ?

Because of optimistic spinning, a mutex lock waiter will only sleep if 
the lock holder sleep or when its time slice run out. So the waiters are 
likely to spin for quite a while before they go to sleep.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads
  2023-08-28 17:35                                 ` Waiman Long
@ 2023-08-28 17:43                                   ` Waiman Long
  2023-08-28 18:35                                     ` Yosry Ahmed
  0 siblings, 1 reply; 43+ messages in thread
From: Waiman Long @ 2023-08-28 17:43 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Shakeel Butt, Michal Hocko, Andrew Morton, Johannes Weiner,
	Roman Gushchin, Muchun Song, Ivan Babrou, Tejun Heo, linux-mm,
	cgroups, linux-kernel


On 8/28/23 13:35, Waiman Long wrote:
> On 8/28/23 13:28, Yosry Ahmed wrote:
>> On Mon, Aug 28, 2023 at 10:27 AM Waiman Long <longman@redhat.com> wrote:
>>>
>>> On 8/28/23 13:07, Yosry Ahmed wrote:
>>>>> Here I agree with you. Let's go with the approach which is easy to
>>>>> undo for now. Though I prefer the new explicit interface for 
>>>>> flushing,
>>>>> that step would be very hard to undo. Let's reevaluate if the 
>>>>> proposed
>>>>> approach shows negative impact on production traffic and I think
>>>>> Cloudflare folks can give us the results soon.
>>>> Do you prefer we also switch to using a mutex (with preemption
>>>> disabled) to avoid the scenario Michal described where flushers give
>>>> up the lock and sleep resulting in an unbounded wait time in the worst
>>>> case?
>>> Locking with mutex with preemption disabled is an oxymoron. Use 
>>> spinlock
>>> if you want to have preemption disabled. The purpose of usiing mutex is
>>> to allow the lock owner to sleep, but you can't sleep with preemption
>>> disabled. You need to enable preemption first. You can disable
>>> preemption for a short time in a non-sleeping section of the lock
>>> critical section, but I would not recommend disabling preemption for 
>>> the
>>> whole critical section.
>> I thought using a mutex with preemption disabled would at least allow
>> waiters to sleep rather than spin, is this not correct (or doesn't
>> matter) ?
>
> Because of optimistic spinning, a mutex lock waiter will only sleep if 
> the lock holder sleep or when its time slice run out. So the waiters 
> are likely to spin for quite a while before they go to sleep.

Perhaps you can add a mutex at the read side so that only 1 reader can 
contend with the global rstat spinlock at any time if this is a concern.

Regards,
Longman



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads
  2023-08-28 17:43                                   ` Waiman Long
@ 2023-08-28 18:35                                     ` Yosry Ahmed
  0 siblings, 0 replies; 43+ messages in thread
From: Yosry Ahmed @ 2023-08-28 18:35 UTC (permalink / raw)
  To: Waiman Long
  Cc: Shakeel Butt, Michal Hocko, Andrew Morton, Johannes Weiner,
	Roman Gushchin, Muchun Song, Ivan Babrou, Tejun Heo, linux-mm,
	cgroups, linux-kernel

On Mon, Aug 28, 2023 at 10:43 AM Waiman Long <longman@redhat.com> wrote:
>
>
> On 8/28/23 13:35, Waiman Long wrote:
> > On 8/28/23 13:28, Yosry Ahmed wrote:
> >> On Mon, Aug 28, 2023 at 10:27 AM Waiman Long <longman@redhat.com> wrote:
> >>>
> >>> On 8/28/23 13:07, Yosry Ahmed wrote:
> >>>>> Here I agree with you. Let's go with the approach which is easy to
> >>>>> undo for now. Though I prefer the new explicit interface for
> >>>>> flushing,
> >>>>> that step would be very hard to undo. Let's reevaluate if the
> >>>>> proposed
> >>>>> approach shows negative impact on production traffic and I think
> >>>>> Cloudflare folks can give us the results soon.
> >>>> Do you prefer we also switch to using a mutex (with preemption
> >>>> disabled) to avoid the scenario Michal described where flushers give
> >>>> up the lock and sleep resulting in an unbounded wait time in the worst
> >>>> case?
> >>> Locking with mutex with preemption disabled is an oxymoron. Use
> >>> spinlock
> >>> if you want to have preemption disabled. The purpose of usiing mutex is
> >>> to allow the lock owner to sleep, but you can't sleep with preemption
> >>> disabled. You need to enable preemption first. You can disable
> >>> preemption for a short time in a non-sleeping section of the lock
> >>> critical section, but I would not recommend disabling preemption for
> >>> the
> >>> whole critical section.
> >> I thought using a mutex with preemption disabled would at least allow
> >> waiters to sleep rather than spin, is this not correct (or doesn't
> >> matter) ?
> >
> > Because of optimistic spinning, a mutex lock waiter will only sleep if
> > the lock holder sleep or when its time slice run out. So the waiters
> > are likely to spin for quite a while before they go to sleep.

I see. Thanks for the explanation.

>
> Perhaps you can add a mutex at the read side so that only 1 reader can
> contend with the global rstat spinlock at any time if this is a concern.

I guess we can keep it simple for now and add that later if needed.
For unified flushers we already can only have one. If we see a problem
from the stat reading side we can add a mutex there.

Thanks!


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads
  2023-08-28 17:27                             ` Waiman Long
  2023-08-28 17:28                               ` Yosry Ahmed
@ 2023-08-29  7:27                               ` Michal Hocko
  2023-08-29 15:05                                 ` Waiman Long
  1 sibling, 1 reply; 43+ messages in thread
From: Michal Hocko @ 2023-08-29  7:27 UTC (permalink / raw)
  To: Waiman Long
  Cc: Yosry Ahmed, Shakeel Butt, Andrew Morton, Johannes Weiner,
	Roman Gushchin, Muchun Song, Ivan Babrou, Tejun Heo, linux-mm,
	cgroups, linux-kernel

On Mon 28-08-23 13:27:23, Waiman Long wrote:
> 
> On 8/28/23 13:07, Yosry Ahmed wrote:
> > 
> > > Here I agree with you. Let's go with the approach which is easy to
> > > undo for now. Though I prefer the new explicit interface for flushing,
> > > that step would be very hard to undo. Let's reevaluate if the proposed
> > > approach shows negative impact on production traffic and I think
> > > Cloudflare folks can give us the results soon.
> > Do you prefer we also switch to using a mutex (with preemption
> > disabled) to avoid the scenario Michal described where flushers give
> > up the lock and sleep resulting in an unbounded wait time in the worst
> > case?
> 
> Locking with mutex with preemption disabled is an oxymoron.

I believe Yosry wanted to disable preemption _after_ the lock is taken
to reduce the time spent while it is held. The idea to use the mutex is
to reduce spinning and more importantly to get rid of lock dropping
part. It is not really clear (but unlikely) we can drop it while
preserving the spinlock as the thing scales with O(#cgroups x #cpus)
in the worst case.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads
  2023-08-29  7:27                               ` Michal Hocko
@ 2023-08-29 15:05                                 ` Waiman Long
  2023-08-29 15:17                                   ` Michal Hocko
  0 siblings, 1 reply; 43+ messages in thread
From: Waiman Long @ 2023-08-29 15:05 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Yosry Ahmed, Shakeel Butt, Andrew Morton, Johannes Weiner,
	Roman Gushchin, Muchun Song, Ivan Babrou, Tejun Heo, linux-mm,
	cgroups, linux-kernel

On 8/29/23 03:27, Michal Hocko wrote:
> On Mon 28-08-23 13:27:23, Waiman Long wrote:
>> On 8/28/23 13:07, Yosry Ahmed wrote:
>>>> Here I agree with you. Let's go with the approach which is easy to
>>>> undo for now. Though I prefer the new explicit interface for flushing,
>>>> that step would be very hard to undo. Let's reevaluate if the proposed
>>>> approach shows negative impact on production traffic and I think
>>>> Cloudflare folks can give us the results soon.
>>> Do you prefer we also switch to using a mutex (with preemption
>>> disabled) to avoid the scenario Michal described where flushers give
>>> up the lock and sleep resulting in an unbounded wait time in the worst
>>> case?
>> Locking with mutex with preemption disabled is an oxymoron.
> I believe Yosry wanted to disable preemption _after_ the lock is taken
> to reduce the time spent while it is held. The idea to use the mutex is
> to reduce spinning and more importantly to get rid of lock dropping
> part. It is not really clear (but unlikely) we can drop it while
> preserving the spinlock as the thing scales with O(#cgroups x #cpus)
> in the worst case.

As I have said later in my email, I am not against disabling preemption 
selectively on some parts of the lock critical section where preemption 
is undesirable. However, I am against disabling preemption for the whole 
duration of the code where the mutex lock is held as it defeats the 
purpose of using mutex in the first place.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads
  2023-08-29 15:05                                 ` Waiman Long
@ 2023-08-29 15:17                                   ` Michal Hocko
  2023-08-29 16:04                                     ` Yosry Ahmed
  0 siblings, 1 reply; 43+ messages in thread
From: Michal Hocko @ 2023-08-29 15:17 UTC (permalink / raw)
  To: Waiman Long
  Cc: Yosry Ahmed, Shakeel Butt, Andrew Morton, Johannes Weiner,
	Roman Gushchin, Muchun Song, Ivan Babrou, Tejun Heo, linux-mm,
	cgroups, linux-kernel

On Tue 29-08-23 11:05:28, Waiman Long wrote:
> On 8/29/23 03:27, Michal Hocko wrote:
> > On Mon 28-08-23 13:27:23, Waiman Long wrote:
> > > On 8/28/23 13:07, Yosry Ahmed wrote:
> > > > > Here I agree with you. Let's go with the approach which is easy to
> > > > > undo for now. Though I prefer the new explicit interface for flushing,
> > > > > that step would be very hard to undo. Let's reevaluate if the proposed
> > > > > approach shows negative impact on production traffic and I think
> > > > > Cloudflare folks can give us the results soon.
> > > > Do you prefer we also switch to using a mutex (with preemption
> > > > disabled) to avoid the scenario Michal described where flushers give
> > > > up the lock and sleep resulting in an unbounded wait time in the worst
> > > > case?
> > > Locking with mutex with preemption disabled is an oxymoron.
> > I believe Yosry wanted to disable preemption _after_ the lock is taken
> > to reduce the time spent while it is held. The idea to use the mutex is
> > to reduce spinning and more importantly to get rid of lock dropping
> > part. It is not really clear (but unlikely) we can drop it while
> > preserving the spinlock as the thing scales with O(#cgroups x #cpus)
> > in the worst case.
> 
> As I have said later in my email, I am not against disabling preemption
> selectively on some parts of the lock critical section where preemption is
> undesirable. However, I am against disabling preemption for the whole
> duration of the code where the mutex lock is held as it defeats the purpose
> of using mutex in the first place.

I certainly agree this is an antipattern.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads
  2023-08-29 15:17                                   ` Michal Hocko
@ 2023-08-29 16:04                                     ` Yosry Ahmed
  0 siblings, 0 replies; 43+ messages in thread
From: Yosry Ahmed @ 2023-08-29 16:04 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Waiman Long, Shakeel Butt, Andrew Morton, Johannes Weiner,
	Roman Gushchin, Muchun Song, Ivan Babrou, Tejun Heo, linux-mm,
	cgroups, linux-kernel

On Tue, Aug 29, 2023 at 8:17 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 29-08-23 11:05:28, Waiman Long wrote:
> > On 8/29/23 03:27, Michal Hocko wrote:
> > > On Mon 28-08-23 13:27:23, Waiman Long wrote:
> > > > On 8/28/23 13:07, Yosry Ahmed wrote:
> > > > > > Here I agree with you. Let's go with the approach which is easy to
> > > > > > undo for now. Though I prefer the new explicit interface for flushing,
> > > > > > that step would be very hard to undo. Let's reevaluate if the proposed
> > > > > > approach shows negative impact on production traffic and I think
> > > > > > Cloudflare folks can give us the results soon.
> > > > > Do you prefer we also switch to using a mutex (with preemption
> > > > > disabled) to avoid the scenario Michal described where flushers give
> > > > > up the lock and sleep resulting in an unbounded wait time in the worst
> > > > > case?
> > > > Locking with mutex with preemption disabled is an oxymoron.
> > > I believe Yosry wanted to disable preemption _after_ the lock is taken
> > > to reduce the time spent while it is held. The idea to use the mutex is
> > > to reduce spinning and more importantly to get rid of lock dropping
> > > part. It is not really clear (but unlikely) we can drop it while
> > > preserving the spinlock as the thing scales with O(#cgroups x #cpus)
> > > in the worst case.
> >
> > As I have said later in my email, I am not against disabling preemption
> > selectively on some parts of the lock critical section where preemption is
> > undesirable. However, I am against disabling preemption for the whole
> > duration of the code where the mutex lock is held as it defeats the purpose
> > of using mutex in the first place.
>
> I certainly agree this is an antipattern.

So I guess the verdict is to avoid using a mutex here for now. I sent
a v2 which includes an additional small patch suggested by Michal
Koutny and an updated changelog for this patch to document this
discussion and possible alternatives we can do if things go wrong with
this approach:

https://lore.kernel.org/lkml/20230828233319.340712-1-yosryahmed@google.com/


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads
  2023-08-25  7:05                 ` Michal Hocko
  2023-08-25 15:14                   ` Yosry Ahmed
@ 2023-08-29 18:44                   ` Tejun Heo
  2023-08-29 19:13                     ` Yosry Ahmed
  1 sibling, 1 reply; 43+ messages in thread
From: Tejun Heo @ 2023-08-29 18:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Yosry Ahmed, Andrew Morton, Johannes Weiner, Roman Gushchin,
	Shakeel Butt, Muchun Song, Ivan Babrou, linux-mm, cgroups,
	linux-kernel

Hello,

On Fri, Aug 25, 2023 at 09:05:46AM +0200, Michal Hocko wrote:
> > > I think that's how it was always meant to be when it was designed. The
> > > global rstat lock has always existed and was always available to
> > > userspace readers. The memory controller took a different path at some
> > > point with unified flushing, but that was mainly because of high
> > > concurrency from in-kernel flushers, not because userspace readers
> > > caused a problem. Outside of memcg, the core cgroup code has always
> > > exercised this global lock when reading cpu.stat since rstat's
> > > introduction. I assume there hasn't been any problems since it's still
> > > there.
> 
> I suspect nobody has just considered a malfunctioning or adversary
> workloads so far.
> 
> > > I was hoping Tejun would confirm/deny this.
> 
> Yes, that would be interesting to hear.

So, the assumptions in the original design were:

* Writers are high freq but readers are lower freq and can block.

* The global lock is mutex.

* Back-to-back reads won't have too much to do because it only has to flush
  what's been accumulated since the last flush which took place just before.

It's likely that the userspace side is gonna be just fine if we restore the
global lock to be a mutex and let them be. Most of the problems are caused
by trying to allow flushing from non-sleepable and kernel contexts. Would it
make sense to distinguish what can and can't wait and make the latter group
always use cached value? e.g. even in kernel, during oom kill, waiting
doesn't really matter and it can just wait to obtain the up-to-date numbers.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads
  2023-08-29 18:44                   ` Tejun Heo
@ 2023-08-29 19:13                     ` Yosry Ahmed
  2023-08-29 19:36                       ` Tejun Heo
  0 siblings, 1 reply; 43+ messages in thread
From: Yosry Ahmed @ 2023-08-29 19:13 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Hocko, Andrew Morton, Johannes Weiner, Roman Gushchin,
	Shakeel Butt, Muchun Song, Ivan Babrou, linux-mm, cgroups,
	linux-kernel

On Tue, Aug 29, 2023 at 11:44 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Fri, Aug 25, 2023 at 09:05:46AM +0200, Michal Hocko wrote:
> > > > I think that's how it was always meant to be when it was designed. The
> > > > global rstat lock has always existed and was always available to
> > > > userspace readers. The memory controller took a different path at some
> > > > point with unified flushing, but that was mainly because of high
> > > > concurrency from in-kernel flushers, not because userspace readers
> > > > caused a problem. Outside of memcg, the core cgroup code has always
> > > > exercised this global lock when reading cpu.stat since rstat's
> > > > introduction. I assume there hasn't been any problems since it's still
> > > > there.
> >
> > I suspect nobody has just considered a malfunctioning or adversary
> > workloads so far.
> >
> > > > I was hoping Tejun would confirm/deny this.
> >
> > Yes, that would be interesting to hear.
>
> So, the assumptions in the original design were:
>
> * Writers are high freq but readers are lower freq and can block.
>
> * The global lock is mutex.
>
> * Back-to-back reads won't have too much to do because it only has to flush
>   what's been accumulated since the last flush which took place just before.
>
> It's likely that the userspace side is gonna be just fine if we restore the
> global lock to be a mutex and let them be. Most of the problems are caused
> by trying to allow flushing from non-sleepable and kernel contexts.

So basically restore the flush without disabling preemption, and if a
userspace reader gets preempted while holding the mutex it's probably
okay because we won't have high concurrency among userspace readers?

I think Shakeel was worried that this may cause a priority inversion
where a low priority task is preempted while holding the mutex, and
prevents high priority tasks from acquiring it to read the stats and
take actions (e.g. userspace OOMs).

> Would it
> make sense to distinguish what can and can't wait and make the latter group
> always use cached value? e.g. even in kernel, during oom kill, waiting
> doesn't really matter and it can just wait to obtain the up-to-date numbers.

The problem with waiting for in-kernel flushers is that high
concurrency leads to terrible serialization. Running a stress test
with 100s of reclaimers where everyone waits shows ~ 3x slowdowns.

This patch series is indeed trying to formalize a distinction between
waiters who can wait and those who can't on the memcg side:

- Unified flushers always flush the entire tree and only flush if no
one else is flushing (no waiting), otherwise they use cached data and
hope the concurrent flushing helps. This is what we currently do for
most memcg contexts. This patch series opts userspace reads out of it.

- Non-unified flushers only flush the subtree they care about and they
wait if there are other flushers. This is what we currently do for
some zswap accounting code. This patch series opts userspace readers
into this.

The problem Michal is raising is that dropping the lock can lead to an
unbounded number of waiters and longer worst case latency. Especially
that this is directly influenced by userspace. Reintroducing the mutex
and removing the lock dropping code fixes that problem, but then if
the mutex holder gets preempted, we face another problem.

Personally I think there is a good chance there won't be userspace
latency problems due to the lock as usually there isn't high
concurrency among userspace readers, and we can deal with that problem
if/when it happens. So far no problem is happening for cpu.stat which
has the same potential problem.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads
  2023-08-29 19:13                     ` Yosry Ahmed
@ 2023-08-29 19:36                       ` Tejun Heo
  2023-08-29 19:54                         ` Yosry Ahmed
  0 siblings, 1 reply; 43+ messages in thread
From: Tejun Heo @ 2023-08-29 19:36 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Michal Hocko, Andrew Morton, Johannes Weiner, Roman Gushchin,
	Shakeel Butt, Muchun Song, Ivan Babrou, linux-mm, cgroups,
	linux-kernel

Hello,

On Tue, Aug 29, 2023 at 12:13:31PM -0700, Yosry Ahmed wrote:
...
> > So, the assumptions in the original design were:
> >
> > * Writers are high freq but readers are lower freq and can block.
> >
> > * The global lock is mutex.
> >
> > * Back-to-back reads won't have too much to do because it only has to flush
> >   what's been accumulated since the last flush which took place just before.
> >
> > It's likely that the userspace side is gonna be just fine if we restore the
> > global lock to be a mutex and let them be. Most of the problems are caused
> > by trying to allow flushing from non-sleepable and kernel contexts.
> 
> So basically restore the flush without disabling preemption, and if a
> userspace reader gets preempted while holding the mutex it's probably
> okay because we won't have high concurrency among userspace readers?
> 
> I think Shakeel was worried that this may cause a priority inversion
> where a low priority task is preempted while holding the mutex, and
> prevents high priority tasks from acquiring it to read the stats and
> take actions (e.g. userspace OOMs).

We'll have to see but I'm not sure this is going to be a huge problem. The
most common way that priority inversions are resolved is through work
conservation - ie. the system just runs out of other things to do, so
whatever is blocking the system gets to run and unblocks it. It only really
becomes a problem when work conservation breaks down on CPU side which
happens if the one holding the resource is 1. blocked on IOs (usually
through memory allocation but can be anything) 2. throttled by cpu.max.

#1 is not a factor here. #2 is but that is a factor for everything in the
kernel and should really be solved from cpu.max side. So, I think in
practice, this should be fine, or at least not worse than anything else.

> > Would it
> > make sense to distinguish what can and can't wait and make the latter group
> > always use cached value? e.g. even in kernel, during oom kill, waiting
> > doesn't really matter and it can just wait to obtain the up-to-date numbers.
> 
> The problem with waiting for in-kernel flushers is that high
> concurrency leads to terrible serialization. Running a stress test
> with 100s of reclaimers where everyone waits shows ~ 3x slowdowns.
> 
> This patch series is indeed trying to formalize a distinction between
> waiters who can wait and those who can't on the memcg side:
> 
> - Unified flushers always flush the entire tree and only flush if no
> one else is flushing (no waiting), otherwise they use cached data and
> hope the concurrent flushing helps. This is what we currently do for
> most memcg contexts. This patch series opts userspace reads out of it.
> 
> - Non-unified flushers only flush the subtree they care about and they
> wait if there are other flushers. This is what we currently do for
> some zswap accounting code. This patch series opts userspace readers
> into this.
> 
> The problem Michal is raising is that dropping the lock can lead to an
> unbounded number of waiters and longer worst case latency. Especially
> that this is directly influenced by userspace. Reintroducing the mutex
> and removing the lock dropping code fixes that problem, but then if
> the mutex holder gets preempted, we face another problem.
> 
> Personally I think there is a good chance there won't be userspace
> latency problems due to the lock as usually there isn't high
> concurrency among userspace readers, and we can deal with that problem
> if/when it happens. So far no problem is happening for cpu.stat which
> has the same potential problem.

Maybe leave the global lock as-is and gate the userland flushers with a
mutex so that there's only ever one contenting on the rstat lock from
userland side?

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads
  2023-08-29 19:36                       ` Tejun Heo
@ 2023-08-29 19:54                         ` Yosry Ahmed
  2023-08-29 20:12                           ` Tejun Heo
  0 siblings, 1 reply; 43+ messages in thread
From: Yosry Ahmed @ 2023-08-29 19:54 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Hocko, Andrew Morton, Johannes Weiner, Roman Gushchin,
	Shakeel Butt, Muchun Song, Ivan Babrou, linux-mm, cgroups,
	linux-kernel

On Tue, Aug 29, 2023 at 12:36 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Tue, Aug 29, 2023 at 12:13:31PM -0700, Yosry Ahmed wrote:
> ...
> > > So, the assumptions in the original design were:
> > >
> > > * Writers are high freq but readers are lower freq and can block.
> > >
> > > * The global lock is mutex.
> > >
> > > * Back-to-back reads won't have too much to do because it only has to flush
> > >   what's been accumulated since the last flush which took place just before.
> > >
> > > It's likely that the userspace side is gonna be just fine if we restore the
> > > global lock to be a mutex and let them be. Most of the problems are caused
> > > by trying to allow flushing from non-sleepable and kernel contexts.
> >
> > So basically restore the flush without disabling preemption, and if a
> > userspace reader gets preempted while holding the mutex it's probably
> > okay because we won't have high concurrency among userspace readers?
> >
> > I think Shakeel was worried that this may cause a priority inversion
> > where a low priority task is preempted while holding the mutex, and
> > prevents high priority tasks from acquiring it to read the stats and
> > take actions (e.g. userspace OOMs).
>
> We'll have to see but I'm not sure this is going to be a huge problem. The
> most common way that priority inversions are resolved is through work
> conservation - ie. the system just runs out of other things to do, so
> whatever is blocking the system gets to run and unblocks it. It only really
> becomes a problem when work conservation breaks down on CPU side which
> happens if the one holding the resource is 1. blocked on IOs (usually
> through memory allocation but can be anything) 2. throttled by cpu.max.
>
> #1 is not a factor here. #2 is but that is a factor for everything in the
> kernel and should really be solved from cpu.max side. So, I think in
> practice, this should be fine, or at least not worse than anything else.

If that's not a concern I can just append a patch that changes the
spinlock to a mutex to this series. Shakeel, wdyt?

>
> > > Would it
> > > make sense to distinguish what can and can't wait and make the latter group
> > > always use cached value? e.g. even in kernel, during oom kill, waiting
> > > doesn't really matter and it can just wait to obtain the up-to-date numbers.
> >
> > The problem with waiting for in-kernel flushers is that high
> > concurrency leads to terrible serialization. Running a stress test
> > with 100s of reclaimers where everyone waits shows ~ 3x slowdowns.
> >
> > This patch series is indeed trying to formalize a distinction between
> > waiters who can wait and those who can't on the memcg side:
> >
> > - Unified flushers always flush the entire tree and only flush if no
> > one else is flushing (no waiting), otherwise they use cached data and
> > hope the concurrent flushing helps. This is what we currently do for
> > most memcg contexts. This patch series opts userspace reads out of it.
> >
> > - Non-unified flushers only flush the subtree they care about and they
> > wait if there are other flushers. This is what we currently do for
> > some zswap accounting code. This patch series opts userspace readers
> > into this.
> >
> > The problem Michal is raising is that dropping the lock can lead to an
> > unbounded number of waiters and longer worst case latency. Especially
> > that this is directly influenced by userspace. Reintroducing the mutex
> > and removing the lock dropping code fixes that problem, but then if
> > the mutex holder gets preempted, we face another problem.
> >
> > Personally I think there is a good chance there won't be userspace
> > latency problems due to the lock as usually there isn't high
> > concurrency among userspace readers, and we can deal with that problem
> > if/when it happens. So far no problem is happening for cpu.stat which
> > has the same potential problem.
>
> Maybe leave the global lock as-is and gate the userland flushers with a
> mutex so that there's only ever one contenting on the rstat lock from
> userland side?

Waiman suggested this as well. We can do that for sure, although I
think we should wait until we are sure it's needed.

One question. If whoever is holding that mutex is either flushing with
the spinlock held or spinning (i.e not sleepable or preemptable),
wouldn't this be equivalent to just changing the spinlock with a mutex
and disable preemption while holding it?


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads
  2023-08-29 19:54                         ` Yosry Ahmed
@ 2023-08-29 20:12                           ` Tejun Heo
  2023-08-29 20:20                             ` Yosry Ahmed
  0 siblings, 1 reply; 43+ messages in thread
From: Tejun Heo @ 2023-08-29 20:12 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Michal Hocko, Andrew Morton, Johannes Weiner, Roman Gushchin,
	Shakeel Butt, Muchun Song, Ivan Babrou, linux-mm, cgroups,
	linux-kernel

Hello,

On Tue, Aug 29, 2023 at 12:54:06PM -0700, Yosry Ahmed wrote:
...
> > Maybe leave the global lock as-is and gate the userland flushers with a
> > mutex so that there's only ever one contenting on the rstat lock from
> > userland side?
> 
> Waiman suggested this as well. We can do that for sure, although I
> think we should wait until we are sure it's needed.
> 
> One question. If whoever is holding that mutex is either flushing with
> the spinlock held or spinning (i.e not sleepable or preemptable),
> wouldn't this be equivalent to just changing the spinlock with a mutex
> and disable preemption while holding it?

Well, it creates layering so that userspace can't flood the inner lock which
can cause contention issues for kernel side users. Not sleeping while
actively flushing is an side-effect too but the code at least doesn't look
as anti-patterny as disabling preemption right after grabbing a mutex.

I don't have a strong preference. As long as we stay away from introducing a
new user interface construct and can address the noticed scalability issues,
it should be fine. Note that there are other ways to address priority
inversions and contentions too - e.g. we can always bounce flushing to a
[kthread_]kworker and rate limit (or rather latency limit) how often
different classes of users can trigger flushing. I don't think we have to go
there yet but if the simpler meaures don't work out, there are still many
ways to solve the problem within the kernel.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads
  2023-08-29 20:12                           ` Tejun Heo
@ 2023-08-29 20:20                             ` Yosry Ahmed
  2023-08-31  9:05                               ` Michal Hocko
  0 siblings, 1 reply; 43+ messages in thread
From: Yosry Ahmed @ 2023-08-29 20:20 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Hocko, Andrew Morton, Johannes Weiner, Roman Gushchin,
	Shakeel Butt, Muchun Song, Ivan Babrou, linux-mm, cgroups,
	linux-kernel

On Tue, Aug 29, 2023 at 1:12 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Tue, Aug 29, 2023 at 12:54:06PM -0700, Yosry Ahmed wrote:
> ...
> > > Maybe leave the global lock as-is and gate the userland flushers with a
> > > mutex so that there's only ever one contenting on the rstat lock from
> > > userland side?
> >
> > Waiman suggested this as well. We can do that for sure, although I
> > think we should wait until we are sure it's needed.
> >
> > One question. If whoever is holding that mutex is either flushing with
> > the spinlock held or spinning (i.e not sleepable or preemptable),
> > wouldn't this be equivalent to just changing the spinlock with a mutex
> > and disable preemption while holding it?
>
> Well, it creates layering so that userspace can't flood the inner lock which
> can cause contention issues for kernel side users. Not sleeping while
> actively flushing is an side-effect too but the code at least doesn't look
> as anti-patterny as disabling preemption right after grabbing a mutex.

I see. At most one kernel side flusher will be spinning for the lock
at any given point anyway, but I guess having that one kernel side
flusher competing against one user side flusher is better competing
with N flushers.

I will add a mutex on the userspace read side then and spin a v3.
Hopefully this addresses Michal's concern as well. The lock dropping
logic will still exist for the inner lock, but when one userspace
reader drops the inner lock other readers won't be able to pick it up.

>
> I don't have a strong preference. As long as we stay away from introducing a
> new user interface construct and can address the noticed scalability issues,
> it should be fine. Note that there are other ways to address priority
> inversions and contentions too - e.g. we can always bounce flushing to a
> [kthread_]kworker and rate limit (or rather latency limit) how often
> different classes of users can trigger flushing. I don't think we have to go
> there yet but if the simpler meaures don't work out, there are still many
> ways to solve the problem within the kernel.

I whole-heartedly agree with the preference to fix the problem within
the kernel with minimal/none user space involvement.

Thanks!


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads
  2023-08-29 20:20                             ` Yosry Ahmed
@ 2023-08-31  9:05                               ` Michal Hocko
  0 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2023-08-31  9:05 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Tejun Heo, Andrew Morton, Johannes Weiner, Roman Gushchin,
	Shakeel Butt, Muchun Song, Ivan Babrou, linux-mm, cgroups,
	linux-kernel

On Tue 29-08-23 13:20:34, Yosry Ahmed wrote:
[...]
> I will add a mutex on the userspace read side then and spin a v3.
> Hopefully this addresses Michal's concern as well. The lock dropping
> logic will still exist for the inner lock, but when one userspace
> reader drops the inner lock other readers won't be able to pick it up.

Yes, that would minimize the risk of the worst case pathological
behavior.

> > I don't have a strong preference. As long as we stay away from introducing a
> > new user interface construct and can address the noticed scalability issues,
> > it should be fine. Note that there are other ways to address priority
> > inversions and contentions too - e.g. we can always bounce flushing to a
> > [kthread_]kworker and rate limit (or rather latency limit) how often
> > different classes of users can trigger flushing. I don't think we have to go
> > there yet but if the simpler meaures don't work out, there are still many
> > ways to solve the problem within the kernel.
> 
> I whole-heartedly agree with the preference to fix the problem within
> the kernel with minimal/none user space involvement.

Let's see. While I would love to see a solution that works for everybody
without explicit interface we have hit problems with locks involved in
stat files in the past.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2023-08-31  9:05 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-21 20:54 [PATCH 0/3] memcg: non-unified flushing for userspace stats Yosry Ahmed
2023-08-21 20:54 ` [PATCH 1/3] mm: memcg: properly name and document unified stats flushing Yosry Ahmed
2023-08-21 20:54 ` [PATCH 2/3] mm: memcg: add a helper for non-unified " Yosry Ahmed
2023-08-22 13:01   ` Michal Koutný
2023-08-22 16:00     ` Yosry Ahmed
2023-08-22 16:35       ` Michal Koutný
2023-08-22 16:48         ` Yosry Ahmed
2023-08-21 20:54 ` [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads Yosry Ahmed
2023-08-22  9:06   ` Michal Hocko
2023-08-22 15:30     ` Yosry Ahmed
2023-08-23  7:33       ` Michal Hocko
2023-08-23 14:55         ` Yosry Ahmed
2023-08-24  7:13           ` Michal Hocko
2023-08-24 18:15             ` Yosry Ahmed
2023-08-24 18:50               ` Yosry Ahmed
2023-08-25  7:05                 ` Michal Hocko
2023-08-25 15:14                   ` Yosry Ahmed
2023-08-25 18:17                     ` Michal Hocko
2023-08-25 18:21                       ` Yosry Ahmed
2023-08-25 18:43                         ` Michal Hocko
2023-08-25 18:44                           ` Michal Hocko
2023-08-28 15:47                     ` Michal Hocko
2023-08-28 16:15                       ` Yosry Ahmed
2023-08-28 17:00                         ` Shakeel Butt
2023-08-28 17:07                           ` Yosry Ahmed
2023-08-28 17:27                             ` Waiman Long
2023-08-28 17:28                               ` Yosry Ahmed
2023-08-28 17:35                                 ` Waiman Long
2023-08-28 17:43                                   ` Waiman Long
2023-08-28 18:35                                     ` Yosry Ahmed
2023-08-29  7:27                               ` Michal Hocko
2023-08-29 15:05                                 ` Waiman Long
2023-08-29 15:17                                   ` Michal Hocko
2023-08-29 16:04                                     ` Yosry Ahmed
2023-08-29 18:44                   ` Tejun Heo
2023-08-29 19:13                     ` Yosry Ahmed
2023-08-29 19:36                       ` Tejun Heo
2023-08-29 19:54                         ` Yosry Ahmed
2023-08-29 20:12                           ` Tejun Heo
2023-08-29 20:20                             ` Yosry Ahmed
2023-08-31  9:05                               ` Michal Hocko
2023-08-22 13:00 ` [PATCH 0/3] memcg: non-unified flushing for userspace stats Michal Koutný
2023-08-22 15:43   ` Yosry Ahmed

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).