* [PATCH v4 0/4] memcg: non-unified flushing for userspace stats
@ 2023-08-31 16:56 Yosry Ahmed
2023-08-31 16:56 ` [PATCH v4 1/4] mm: memcg: properly name and document unified stats flushing Yosry Ahmed
` (2 more replies)
0 siblings, 3 replies; 29+ messages in thread
From: Yosry Ahmed @ 2023-08-31 16:56 UTC (permalink / raw)
To: Andrew Morton
Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
Muchun Song, Ivan Babrou, Tejun Heo, Michal Koutný,
Waiman Long, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, Yosry Ahmed
Most memcg flushing contexts using "unified" flushing, where only one
flusher is allowed at a time (others skip), and all flushers need to
flush the entire tree. This works well with high concurrency, which
mostly comes from in-kernel flushers (e.g. reclaim, refault, ..).
For userspace reads, unified flushing leads to non-deterministic stats
staleness and reading cost. This series clarifies and documents the
differences between unified and non-unified flushing (patches 1 & 2),
then opts userspace reads out of unified flushing (patch 3).
This patch series is a follow up on the discussion in [1]. That was a
patch that proposed that userspace reads wait for ongoing unified
flushers to complete before returning. There were concerns about the
latency that this introduces to userspace reads, especially with ongoing
reports of expensive stat reads even with unified flushing. Hence, this
series follows a different approach, by opting userspace reads out of
unified flushing completely. The cost of userspace reads are now
determinstic, and depend on the size of the subtree being read. This
should fix both the *sometimes* expensive reads (due to flushing the
entire tree) and occasional staless (due to skipping flushing).
I attempted to remove unified flushing completely, but noticed that
in-kernel flushers with high concurrency (e.g. hundreds of concurrent
reclaimers). This sort of concurrency is not expected from userspace
reads. More details about testing and some numbers in the last patch's
changelog.
v4 -> v5:
- Fixed build error in the last patch with W=1 because of a missed
'static'.
v4: https://lore.kernel.org/lkml/20230830175335.1536008-1-yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org/
Yosry Ahmed (4):
mm: memcg: properly name and document unified stats flushing
mm: memcg: add a helper for non-unified stats flushing
mm: memcg: let non-unified root stats flushes help unified flushes
mm: memcg: use non-unified stats flushing for userspace reads
include/linux/memcontrol.h | 8 +--
mm/memcontrol.c | 106 +++++++++++++++++++++++++++----------
mm/vmscan.c | 2 +-
mm/workingset.c | 4 +-
4 files changed, 85 insertions(+), 35 deletions(-)
--
2.42.0.rc2.253.gd59a3bf2b4-goog
^ permalink raw reply [flat|nested] 29+ messages in thread* [PATCH v4 1/4] mm: memcg: properly name and document unified stats flushing 2023-08-31 16:56 [PATCH v4 0/4] memcg: non-unified flushing for userspace stats Yosry Ahmed @ 2023-08-31 16:56 ` Yosry Ahmed [not found] ` <20230831165611.2610118-2-yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> [not found] ` <20230831165611.2610118-1-yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> 2023-08-31 16:56 ` [PATCH v4 3/4] mm: memcg: let non-unified root stats flushes help unified flushes Yosry Ahmed 2 siblings, 1 reply; 29+ messages in thread From: Yosry Ahmed @ 2023-08-31 16:56 UTC (permalink / raw) To: Andrew Morton Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo, Michal Koutný, Waiman Long, linux-mm, cgroups, linux-kernel, Yosry Ahmed Most contexts that flush memcg stats use "unified" flushing, where basically all flushers attempt to flush the entire hierarchy, but only one flusher is allowed at a time, others skip flushing. This is needed because we need to flush the stats from paths such as reclaim or refaults, which may have high concurrency, especially on large systems. Serializing such performance-sensitive paths can introduce regressions, hence, unified flushing offers a tradeoff between stats staleness and the performance impact of flushing stats. Document this properly and explicitly by renaming the common flushing helper from do_flush_stats() to do_unified_stats_flush(), and adding documentation to describe unified flushing. Additionally, rename flushing APIs to add "try" in the name, which implies that flushing will not always happen. Also add proper documentation. No functional change intended. Signed-off-by: Yosry Ahmed <yosryahmed@google.com> --- include/linux/memcontrol.h | 8 ++--- mm/memcontrol.c | 61 +++++++++++++++++++++++++------------- mm/vmscan.c | 2 +- mm/workingset.c | 4 +-- 4 files changed, 47 insertions(+), 28 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 11810a2cfd2d..d517b0cc5221 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1034,8 +1034,8 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec, return x; } -void mem_cgroup_flush_stats(void); -void mem_cgroup_flush_stats_ratelimited(void); +void mem_cgroup_try_flush_stats(void); +void mem_cgroup_try_flush_stats_ratelimited(void); void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, int val); @@ -1519,11 +1519,11 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec, return node_page_state(lruvec_pgdat(lruvec), idx); } -static inline void mem_cgroup_flush_stats(void) +static inline void mem_cgroup_try_flush_stats(void) { } -static inline void mem_cgroup_flush_stats_ratelimited(void) +static inline void mem_cgroup_try_flush_stats_ratelimited(void) { } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index cf57fe9318d5..2d0ec828a1c4 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -588,7 +588,7 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz) static void flush_memcg_stats_dwork(struct work_struct *w); static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork); static DEFINE_PER_CPU(unsigned int, stats_updates); -static atomic_t stats_flush_ongoing = ATOMIC_INIT(0); +static atomic_t stats_unified_flush_ongoing = ATOMIC_INIT(0); static atomic_t stats_flush_threshold = ATOMIC_INIT(0); static u64 flush_next_time; @@ -630,7 +630,7 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val) /* * If stats_flush_threshold exceeds the threshold * (>num_online_cpus()), cgroup stats update will be triggered - * in __mem_cgroup_flush_stats(). Increasing this var further + * in mem_cgroup_try_flush_stats(). Increasing this var further * is redundant and simply adds overhead in atomic update. */ if (atomic_read(&stats_flush_threshold) <= num_online_cpus()) @@ -639,15 +639,19 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val) } } -static void do_flush_stats(void) +/* + * do_unified_stats_flush - do a unified flush of memory cgroup statistics + * + * A unified flush tries to flush the entire hierarchy, but skips if there is + * another ongoing flush. This is meant for flushers that may have a lot of + * concurrency (e.g. reclaim, refault, etc), and should not be serialized to + * avoid slowing down performance-sensitive paths. A unified flush may skip, and + * hence may yield stale stats. + */ +static void do_unified_stats_flush(void) { - /* - * We always flush the entire tree, so concurrent flushers can just - * skip. This avoids a thundering herd problem on the rstat global lock - * from memcg flushers (e.g. reclaim, refault, etc). - */ - if (atomic_read(&stats_flush_ongoing) || - atomic_xchg(&stats_flush_ongoing, 1)) + if (atomic_read(&stats_unified_flush_ongoing) || + atomic_xchg(&stats_unified_flush_ongoing, 1)) return; WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME); @@ -655,19 +659,34 @@ static void do_flush_stats(void) cgroup_rstat_flush(root_mem_cgroup->css.cgroup); atomic_set(&stats_flush_threshold, 0); - atomic_set(&stats_flush_ongoing, 0); + atomic_set(&stats_unified_flush_ongoing, 0); } -void mem_cgroup_flush_stats(void) +/* + * mem_cgroup_try_flush_stats - try to flush memory cgroup statistics + * + * Try to flush the stats of all memcgs that have stat updates since the last + * flush. We do not flush the stats if: + * - The magnitude of the pending updates is below a certain threshold. + * - There is another ongoing unified flush (see do_unified_stats_flush()). + * + * Hence, the stats may be stale, but ideally by less than FLUSH_TIME due to + * periodic flushing. + */ +void mem_cgroup_try_flush_stats(void) { if (atomic_read(&stats_flush_threshold) > num_online_cpus()) - do_flush_stats(); + do_unified_stats_flush(); } -void mem_cgroup_flush_stats_ratelimited(void) +/* + * Like mem_cgroup_try_flush_stats(), but only flushes if the periodic flusher + * is late. + */ +void mem_cgroup_try_flush_stats_ratelimited(void) { if (time_after64(jiffies_64, READ_ONCE(flush_next_time))) - mem_cgroup_flush_stats(); + mem_cgroup_try_flush_stats(); } static void flush_memcg_stats_dwork(struct work_struct *w) @@ -676,7 +695,7 @@ static void flush_memcg_stats_dwork(struct work_struct *w) * Always flush here so that flushing in latency-sensitive paths is * as cheap as possible. */ - do_flush_stats(); + do_unified_stats_flush(); queue_delayed_work(system_unbound_wq, &stats_flush_dwork, FLUSH_TIME); } @@ -1576,7 +1595,7 @@ static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s) * * Current memory state: */ - mem_cgroup_flush_stats(); + mem_cgroup_try_flush_stats(); for (i = 0; i < ARRAY_SIZE(memory_stats); i++) { u64 size; @@ -4018,7 +4037,7 @@ static int memcg_numa_stat_show(struct seq_file *m, void *v) int nid; struct mem_cgroup *memcg = mem_cgroup_from_seq(m); - mem_cgroup_flush_stats(); + mem_cgroup_try_flush_stats(); for (stat = stats; stat < stats + ARRAY_SIZE(stats); stat++) { seq_printf(m, "%s=%lu", stat->name, @@ -4093,7 +4112,7 @@ static void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s) BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) != ARRAY_SIZE(memcg1_stats)); - mem_cgroup_flush_stats(); + mem_cgroup_try_flush_stats(); for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) { unsigned long nr; @@ -4595,7 +4614,7 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages, struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css); struct mem_cgroup *parent; - mem_cgroup_flush_stats(); + mem_cgroup_try_flush_stats(); *pdirty = memcg_page_state(memcg, NR_FILE_DIRTY); *pwriteback = memcg_page_state(memcg, NR_WRITEBACK); @@ -6610,7 +6629,7 @@ static int memory_numa_stat_show(struct seq_file *m, void *v) int i; struct mem_cgroup *memcg = mem_cgroup_from_seq(m); - mem_cgroup_flush_stats(); + mem_cgroup_try_flush_stats(); for (i = 0; i < ARRAY_SIZE(memory_stats); i++) { int nid; diff --git a/mm/vmscan.c b/mm/vmscan.c index c7c149cb8d66..457a18921fda 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2923,7 +2923,7 @@ static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc) * Flush the memory cgroup stats, so that we read accurate per-memcg * lruvec stats for heuristics. */ - mem_cgroup_flush_stats(); + mem_cgroup_try_flush_stats(); /* * Determine the scan balance between anon and file LRUs. diff --git a/mm/workingset.c b/mm/workingset.c index da58a26d0d4d..affb8699e58d 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -520,7 +520,7 @@ void workingset_refault(struct folio *folio, void *shadow) } /* Flush stats (and potentially sleep) before holding RCU read lock */ - mem_cgroup_flush_stats_ratelimited(); + mem_cgroup_try_flush_stats_ratelimited(); rcu_read_lock(); @@ -664,7 +664,7 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker, struct lruvec *lruvec; int i; - mem_cgroup_flush_stats(); + mem_cgroup_try_flush_stats(); lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid)); for (pages = 0, i = 0; i < NR_LRU_LISTS; i++) pages += lruvec_page_state_local(lruvec, -- 2.42.0.rc2.253.gd59a3bf2b4-goog ^ permalink raw reply related [flat|nested] 29+ messages in thread
[parent not found: <20230831165611.2610118-2-yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH v4 1/4] mm: memcg: properly name and document unified stats flushing [not found] ` <20230831165611.2610118-2-yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> @ 2023-09-04 14:44 ` Michal Hocko [not found] ` <ZPXtVLNIXk8trj2k-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 29+ messages in thread From: Michal Hocko @ 2023-09-04 14:44 UTC (permalink / raw) To: Yosry Ahmed Cc: Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo, Michal Koutný, Waiman Long, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Thu 31-08-23 16:56:08, Yosry Ahmed wrote: > Most contexts that flush memcg stats use "unified" flushing, where > basically all flushers attempt to flush the entire hierarchy, but only > one flusher is allowed at a time, others skip flushing. > > This is needed because we need to flush the stats from paths such as > reclaim or refaults, which may have high concurrency, especially on > large systems. Serializing such performance-sensitive paths can > introduce regressions, hence, unified flushing offers a tradeoff between > stats staleness and the performance impact of flushing stats. > > Document this properly and explicitly by renaming the common flushing > helper from do_flush_stats() to do_unified_stats_flush(), and adding > documentation to describe unified flushing. Additionally, rename > flushing APIs to add "try" in the name, which implies that flushing will > not always happen. Also add proper documentation. > > No functional change intended. > > Signed-off-by: Yosry Ahmed <yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> No objections to renaming but it would be really nice to simplify this. It is just "funny" to see 4 different flushing methods (flush from userspace, flush, try_flush_with_ratelimit_1 and try_flush_with_ratelimit_2). This is all internal so I am not all that worried that this would get confused but it surely is rather convoluted. Acked-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> > --- > include/linux/memcontrol.h | 8 ++--- > mm/memcontrol.c | 61 +++++++++++++++++++++++++------------- > mm/vmscan.c | 2 +- > mm/workingset.c | 4 +-- > 4 files changed, 47 insertions(+), 28 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 11810a2cfd2d..d517b0cc5221 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -1034,8 +1034,8 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec, > return x; > } > > -void mem_cgroup_flush_stats(void); > -void mem_cgroup_flush_stats_ratelimited(void); > +void mem_cgroup_try_flush_stats(void); > +void mem_cgroup_try_flush_stats_ratelimited(void); > > void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, > int val); > @@ -1519,11 +1519,11 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec, > return node_page_state(lruvec_pgdat(lruvec), idx); > } > > -static inline void mem_cgroup_flush_stats(void) > +static inline void mem_cgroup_try_flush_stats(void) > { > } > > -static inline void mem_cgroup_flush_stats_ratelimited(void) > +static inline void mem_cgroup_try_flush_stats_ratelimited(void) > { > } > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index cf57fe9318d5..2d0ec828a1c4 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -588,7 +588,7 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz) > static void flush_memcg_stats_dwork(struct work_struct *w); > static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork); > static DEFINE_PER_CPU(unsigned int, stats_updates); > -static atomic_t stats_flush_ongoing = ATOMIC_INIT(0); > +static atomic_t stats_unified_flush_ongoing = ATOMIC_INIT(0); > static atomic_t stats_flush_threshold = ATOMIC_INIT(0); > static u64 flush_next_time; > > @@ -630,7 +630,7 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val) > /* > * If stats_flush_threshold exceeds the threshold > * (>num_online_cpus()), cgroup stats update will be triggered > - * in __mem_cgroup_flush_stats(). Increasing this var further > + * in mem_cgroup_try_flush_stats(). Increasing this var further > * is redundant and simply adds overhead in atomic update. > */ > if (atomic_read(&stats_flush_threshold) <= num_online_cpus()) > @@ -639,15 +639,19 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val) > } > } > > -static void do_flush_stats(void) > +/* > + * do_unified_stats_flush - do a unified flush of memory cgroup statistics > + * > + * A unified flush tries to flush the entire hierarchy, but skips if there is > + * another ongoing flush. This is meant for flushers that may have a lot of > + * concurrency (e.g. reclaim, refault, etc), and should not be serialized to > + * avoid slowing down performance-sensitive paths. A unified flush may skip, and > + * hence may yield stale stats. > + */ > +static void do_unified_stats_flush(void) > { > - /* > - * We always flush the entire tree, so concurrent flushers can just > - * skip. This avoids a thundering herd problem on the rstat global lock > - * from memcg flushers (e.g. reclaim, refault, etc). > - */ > - if (atomic_read(&stats_flush_ongoing) || > - atomic_xchg(&stats_flush_ongoing, 1)) > + if (atomic_read(&stats_unified_flush_ongoing) || > + atomic_xchg(&stats_unified_flush_ongoing, 1)) > return; > > WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME); > @@ -655,19 +659,34 @@ static void do_flush_stats(void) > cgroup_rstat_flush(root_mem_cgroup->css.cgroup); > > atomic_set(&stats_flush_threshold, 0); > - atomic_set(&stats_flush_ongoing, 0); > + atomic_set(&stats_unified_flush_ongoing, 0); > } > > -void mem_cgroup_flush_stats(void) > +/* > + * mem_cgroup_try_flush_stats - try to flush memory cgroup statistics > + * > + * Try to flush the stats of all memcgs that have stat updates since the last > + * flush. We do not flush the stats if: > + * - The magnitude of the pending updates is below a certain threshold. > + * - There is another ongoing unified flush (see do_unified_stats_flush()). > + * > + * Hence, the stats may be stale, but ideally by less than FLUSH_TIME due to > + * periodic flushing. > + */ > +void mem_cgroup_try_flush_stats(void) > { > if (atomic_read(&stats_flush_threshold) > num_online_cpus()) > - do_flush_stats(); > + do_unified_stats_flush(); > } > > -void mem_cgroup_flush_stats_ratelimited(void) > +/* > + * Like mem_cgroup_try_flush_stats(), but only flushes if the periodic flusher > + * is late. > + */ > +void mem_cgroup_try_flush_stats_ratelimited(void) > { > if (time_after64(jiffies_64, READ_ONCE(flush_next_time))) > - mem_cgroup_flush_stats(); > + mem_cgroup_try_flush_stats(); > } > > static void flush_memcg_stats_dwork(struct work_struct *w) > @@ -676,7 +695,7 @@ static void flush_memcg_stats_dwork(struct work_struct *w) > * Always flush here so that flushing in latency-sensitive paths is > * as cheap as possible. > */ > - do_flush_stats(); > + do_unified_stats_flush(); > queue_delayed_work(system_unbound_wq, &stats_flush_dwork, FLUSH_TIME); > } > > @@ -1576,7 +1595,7 @@ static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s) > * > * Current memory state: > */ > - mem_cgroup_flush_stats(); > + mem_cgroup_try_flush_stats(); > > for (i = 0; i < ARRAY_SIZE(memory_stats); i++) { > u64 size; > @@ -4018,7 +4037,7 @@ static int memcg_numa_stat_show(struct seq_file *m, void *v) > int nid; > struct mem_cgroup *memcg = mem_cgroup_from_seq(m); > > - mem_cgroup_flush_stats(); > + mem_cgroup_try_flush_stats(); > > for (stat = stats; stat < stats + ARRAY_SIZE(stats); stat++) { > seq_printf(m, "%s=%lu", stat->name, > @@ -4093,7 +4112,7 @@ static void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s) > > BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) != ARRAY_SIZE(memcg1_stats)); > > - mem_cgroup_flush_stats(); > + mem_cgroup_try_flush_stats(); > > for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) { > unsigned long nr; > @@ -4595,7 +4614,7 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages, > struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css); > struct mem_cgroup *parent; > > - mem_cgroup_flush_stats(); > + mem_cgroup_try_flush_stats(); > > *pdirty = memcg_page_state(memcg, NR_FILE_DIRTY); > *pwriteback = memcg_page_state(memcg, NR_WRITEBACK); > @@ -6610,7 +6629,7 @@ static int memory_numa_stat_show(struct seq_file *m, void *v) > int i; > struct mem_cgroup *memcg = mem_cgroup_from_seq(m); > > - mem_cgroup_flush_stats(); > + mem_cgroup_try_flush_stats(); > > for (i = 0; i < ARRAY_SIZE(memory_stats); i++) { > int nid; > diff --git a/mm/vmscan.c b/mm/vmscan.c > index c7c149cb8d66..457a18921fda 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2923,7 +2923,7 @@ static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc) > * Flush the memory cgroup stats, so that we read accurate per-memcg > * lruvec stats for heuristics. > */ > - mem_cgroup_flush_stats(); > + mem_cgroup_try_flush_stats(); > > /* > * Determine the scan balance between anon and file LRUs. > diff --git a/mm/workingset.c b/mm/workingset.c > index da58a26d0d4d..affb8699e58d 100644 > --- a/mm/workingset.c > +++ b/mm/workingset.c > @@ -520,7 +520,7 @@ void workingset_refault(struct folio *folio, void *shadow) > } > > /* Flush stats (and potentially sleep) before holding RCU read lock */ > - mem_cgroup_flush_stats_ratelimited(); > + mem_cgroup_try_flush_stats_ratelimited(); > > rcu_read_lock(); > > @@ -664,7 +664,7 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker, > struct lruvec *lruvec; > int i; > > - mem_cgroup_flush_stats(); > + mem_cgroup_try_flush_stats(); > lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid)); > for (pages = 0, i = 0; i < NR_LRU_LISTS; i++) > pages += lruvec_page_state_local(lruvec, > -- > 2.42.0.rc2.253.gd59a3bf2b4-goog -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 29+ messages in thread
[parent not found: <ZPXtVLNIXk8trj2k-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH v4 1/4] mm: memcg: properly name and document unified stats flushing [not found] ` <ZPXtVLNIXk8trj2k-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2023-09-05 15:55 ` Yosry Ahmed 0 siblings, 0 replies; 29+ messages in thread From: Yosry Ahmed @ 2023-09-05 15:55 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo, Michal Koutný, Waiman Long, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Mon, Sep 4, 2023 at 7:44 AM Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote: > > On Thu 31-08-23 16:56:08, Yosry Ahmed wrote: > > Most contexts that flush memcg stats use "unified" flushing, where > > basically all flushers attempt to flush the entire hierarchy, but only > > one flusher is allowed at a time, others skip flushing. > > > > This is needed because we need to flush the stats from paths such as > > reclaim or refaults, which may have high concurrency, especially on > > large systems. Serializing such performance-sensitive paths can > > introduce regressions, hence, unified flushing offers a tradeoff between > > stats staleness and the performance impact of flushing stats. > > > > Document this properly and explicitly by renaming the common flushing > > helper from do_flush_stats() to do_unified_stats_flush(), and adding > > documentation to describe unified flushing. Additionally, rename > > flushing APIs to add "try" in the name, which implies that flushing will > > not always happen. Also add proper documentation. > > > > No functional change intended. > > > > Signed-off-by: Yosry Ahmed <yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> > > No objections to renaming but it would be really nice to simplify this. > It is just "funny" to see 4 different flushing methods (flush from > userspace, flush, try_flush_with_ratelimit_1 and try_flush_with_ratelimit_2). > This is all internal so I am not all that worried that this would get > confused but it surely is rather convoluted. > > Acked-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> Thanks! I tried to at least keep the naming consistent and add documentation, but I agree we can do better :) It's less obvious to me tbh where we can improve, but hopefully when someone new to this code comes complaining we'll know better what needs to be changed. > > > --- > > include/linux/memcontrol.h | 8 ++--- > > mm/memcontrol.c | 61 +++++++++++++++++++++++++------------- > > mm/vmscan.c | 2 +- > > mm/workingset.c | 4 +-- > > 4 files changed, 47 insertions(+), 28 deletions(-) > > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > > index 11810a2cfd2d..d517b0cc5221 100644 > > --- a/include/linux/memcontrol.h > > +++ b/include/linux/memcontrol.h > > @@ -1034,8 +1034,8 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec, > > return x; > > } > > > > -void mem_cgroup_flush_stats(void); > > -void mem_cgroup_flush_stats_ratelimited(void); > > +void mem_cgroup_try_flush_stats(void); > > +void mem_cgroup_try_flush_stats_ratelimited(void); > > > > void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, > > int val); > > @@ -1519,11 +1519,11 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec, > > return node_page_state(lruvec_pgdat(lruvec), idx); > > } > > > > -static inline void mem_cgroup_flush_stats(void) > > +static inline void mem_cgroup_try_flush_stats(void) > > { > > } > > > > -static inline void mem_cgroup_flush_stats_ratelimited(void) > > +static inline void mem_cgroup_try_flush_stats_ratelimited(void) > > { > > } > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index cf57fe9318d5..2d0ec828a1c4 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -588,7 +588,7 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz) > > static void flush_memcg_stats_dwork(struct work_struct *w); > > static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork); > > static DEFINE_PER_CPU(unsigned int, stats_updates); > > -static atomic_t stats_flush_ongoing = ATOMIC_INIT(0); > > +static atomic_t stats_unified_flush_ongoing = ATOMIC_INIT(0); > > static atomic_t stats_flush_threshold = ATOMIC_INIT(0); > > static u64 flush_next_time; > > > > @@ -630,7 +630,7 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val) > > /* > > * If stats_flush_threshold exceeds the threshold > > * (>num_online_cpus()), cgroup stats update will be triggered > > - * in __mem_cgroup_flush_stats(). Increasing this var further > > + * in mem_cgroup_try_flush_stats(). Increasing this var further > > * is redundant and simply adds overhead in atomic update. > > */ > > if (atomic_read(&stats_flush_threshold) <= num_online_cpus()) > > @@ -639,15 +639,19 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val) > > } > > } > > > > -static void do_flush_stats(void) > > +/* > > + * do_unified_stats_flush - do a unified flush of memory cgroup statistics > > + * > > + * A unified flush tries to flush the entire hierarchy, but skips if there is > > + * another ongoing flush. This is meant for flushers that may have a lot of > > + * concurrency (e.g. reclaim, refault, etc), and should not be serialized to > > + * avoid slowing down performance-sensitive paths. A unified flush may skip, and > > + * hence may yield stale stats. > > + */ > > +static void do_unified_stats_flush(void) > > { > > - /* > > - * We always flush the entire tree, so concurrent flushers can just > > - * skip. This avoids a thundering herd problem on the rstat global lock > > - * from memcg flushers (e.g. reclaim, refault, etc). > > - */ > > - if (atomic_read(&stats_flush_ongoing) || > > - atomic_xchg(&stats_flush_ongoing, 1)) > > + if (atomic_read(&stats_unified_flush_ongoing) || > > + atomic_xchg(&stats_unified_flush_ongoing, 1)) > > return; > > > > WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME); > > @@ -655,19 +659,34 @@ static void do_flush_stats(void) > > cgroup_rstat_flush(root_mem_cgroup->css.cgroup); > > > > atomic_set(&stats_flush_threshold, 0); > > - atomic_set(&stats_flush_ongoing, 0); > > + atomic_set(&stats_unified_flush_ongoing, 0); > > } > > > > -void mem_cgroup_flush_stats(void) > > +/* > > + * mem_cgroup_try_flush_stats - try to flush memory cgroup statistics > > + * > > + * Try to flush the stats of all memcgs that have stat updates since the last > > + * flush. We do not flush the stats if: > > + * - The magnitude of the pending updates is below a certain threshold. > > + * - There is another ongoing unified flush (see do_unified_stats_flush()). > > + * > > + * Hence, the stats may be stale, but ideally by less than FLUSH_TIME due to > > + * periodic flushing. > > + */ > > +void mem_cgroup_try_flush_stats(void) > > { > > if (atomic_read(&stats_flush_threshold) > num_online_cpus()) > > - do_flush_stats(); > > + do_unified_stats_flush(); > > } > > > > -void mem_cgroup_flush_stats_ratelimited(void) > > +/* > > + * Like mem_cgroup_try_flush_stats(), but only flushes if the periodic flusher > > + * is late. > > + */ > > +void mem_cgroup_try_flush_stats_ratelimited(void) > > { > > if (time_after64(jiffies_64, READ_ONCE(flush_next_time))) > > - mem_cgroup_flush_stats(); > > + mem_cgroup_try_flush_stats(); > > } > > > > static void flush_memcg_stats_dwork(struct work_struct *w) > > @@ -676,7 +695,7 @@ static void flush_memcg_stats_dwork(struct work_struct *w) > > * Always flush here so that flushing in latency-sensitive paths is > > * as cheap as possible. > > */ > > - do_flush_stats(); > > + do_unified_stats_flush(); > > queue_delayed_work(system_unbound_wq, &stats_flush_dwork, FLUSH_TIME); > > } > > > > @@ -1576,7 +1595,7 @@ static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s) > > * > > * Current memory state: > > */ > > - mem_cgroup_flush_stats(); > > + mem_cgroup_try_flush_stats(); > > > > for (i = 0; i < ARRAY_SIZE(memory_stats); i++) { > > u64 size; > > @@ -4018,7 +4037,7 @@ static int memcg_numa_stat_show(struct seq_file *m, void *v) > > int nid; > > struct mem_cgroup *memcg = mem_cgroup_from_seq(m); > > > > - mem_cgroup_flush_stats(); > > + mem_cgroup_try_flush_stats(); > > > > for (stat = stats; stat < stats + ARRAY_SIZE(stats); stat++) { > > seq_printf(m, "%s=%lu", stat->name, > > @@ -4093,7 +4112,7 @@ static void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s) > > > > BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) != ARRAY_SIZE(memcg1_stats)); > > > > - mem_cgroup_flush_stats(); > > + mem_cgroup_try_flush_stats(); > > > > for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) { > > unsigned long nr; > > @@ -4595,7 +4614,7 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages, > > struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css); > > struct mem_cgroup *parent; > > > > - mem_cgroup_flush_stats(); > > + mem_cgroup_try_flush_stats(); > > > > *pdirty = memcg_page_state(memcg, NR_FILE_DIRTY); > > *pwriteback = memcg_page_state(memcg, NR_WRITEBACK); > > @@ -6610,7 +6629,7 @@ static int memory_numa_stat_show(struct seq_file *m, void *v) > > int i; > > struct mem_cgroup *memcg = mem_cgroup_from_seq(m); > > > > - mem_cgroup_flush_stats(); > > + mem_cgroup_try_flush_stats(); > > > > for (i = 0; i < ARRAY_SIZE(memory_stats); i++) { > > int nid; > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index c7c149cb8d66..457a18921fda 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -2923,7 +2923,7 @@ static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc) > > * Flush the memory cgroup stats, so that we read accurate per-memcg > > * lruvec stats for heuristics. > > */ > > - mem_cgroup_flush_stats(); > > + mem_cgroup_try_flush_stats(); > > > > /* > > * Determine the scan balance between anon and file LRUs. > > diff --git a/mm/workingset.c b/mm/workingset.c > > index da58a26d0d4d..affb8699e58d 100644 > > --- a/mm/workingset.c > > +++ b/mm/workingset.c > > @@ -520,7 +520,7 @@ void workingset_refault(struct folio *folio, void *shadow) > > } > > > > /* Flush stats (and potentially sleep) before holding RCU read lock */ > > - mem_cgroup_flush_stats_ratelimited(); > > + mem_cgroup_try_flush_stats_ratelimited(); > > > > rcu_read_lock(); > > > > @@ -664,7 +664,7 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker, > > struct lruvec *lruvec; > > int i; > > > > - mem_cgroup_flush_stats(); > > + mem_cgroup_try_flush_stats(); > > lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid)); > > for (pages = 0, i = 0; i < NR_LRU_LISTS; i++) > > pages += lruvec_page_state_local(lruvec, > > -- > > 2.42.0.rc2.253.gd59a3bf2b4-goog > > -- > Michal Hocko > SUSE Labs ^ permalink raw reply [flat|nested] 29+ messages in thread
[parent not found: <20230831165611.2610118-1-yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>]
* [PATCH v4 2/4] mm: memcg: add a helper for non-unified stats flushing [not found] ` <20230831165611.2610118-1-yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> @ 2023-08-31 16:56 ` Yosry Ahmed [not found] ` <20230831165611.2610118-3-yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> 2023-08-31 16:56 ` [PATCH v4 4/4] mm: memcg: use non-unified stats flushing for userspace reads Yosry Ahmed 2023-08-31 17:18 ` [PATCH v4 0/4] memcg: non-unified flushing for userspace stats Waiman Long 2 siblings, 1 reply; 29+ messages in thread From: Yosry Ahmed @ 2023-08-31 16:56 UTC (permalink / raw) To: Andrew Morton Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo, Michal Koutný, Waiman Long, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Yosry Ahmed Some contexts flush memcg stats outside of unified flushing, directly using cgroup_rstat_flush(). Add a helper for non-unified flushing, a counterpart for do_unified_stats_flush(), and use it in those contexts, as well as in do_unified_stats_flush() itself. This abstracts the rstat API and makes it easy to introduce modifications to either unified or non-unified flushing functions without changing callers. No functional change intended. Signed-off-by: Yosry Ahmed <yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> --- mm/memcontrol.c | 21 +++++++++++++++++---- 1 file changed, 17 insertions(+), 4 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 2d0ec828a1c4..8c046feeaae7 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -639,6 +639,17 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val) } } +/* + * do_stats_flush - do a flush of the memory cgroup statistics + * @memcg: memory cgroup to flush + * + * Only flushes the subtree of @memcg, does not skip under any conditions. + */ +static void do_stats_flush(struct mem_cgroup *memcg) +{ + cgroup_rstat_flush(memcg->css.cgroup); +} + /* * do_unified_stats_flush - do a unified flush of memory cgroup statistics * @@ -656,7 +667,7 @@ static void do_unified_stats_flush(void) WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME); - cgroup_rstat_flush(root_mem_cgroup->css.cgroup); + do_stats_flush(root_mem_cgroup); atomic_set(&stats_flush_threshold, 0); atomic_set(&stats_unified_flush_ongoing, 0); @@ -7790,7 +7801,7 @@ bool obj_cgroup_may_zswap(struct obj_cgroup *objcg) break; } - cgroup_rstat_flush(memcg->css.cgroup); + do_stats_flush(memcg); pages = memcg_page_state(memcg, MEMCG_ZSWAP_B) / PAGE_SIZE; if (pages < max) continue; @@ -7855,8 +7866,10 @@ void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size) static u64 zswap_current_read(struct cgroup_subsys_state *css, struct cftype *cft) { - cgroup_rstat_flush(css->cgroup); - return memcg_page_state(mem_cgroup_from_css(css), MEMCG_ZSWAP_B); + struct mem_cgroup *memcg = mem_cgroup_from_css(css); + + do_stats_flush(memcg); + return memcg_page_state(memcg, MEMCG_ZSWAP_B); } static int zswap_max_show(struct seq_file *m, void *v) -- 2.42.0.rc2.253.gd59a3bf2b4-goog ^ permalink raw reply related [flat|nested] 29+ messages in thread
[parent not found: <20230831165611.2610118-3-yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH v4 2/4] mm: memcg: add a helper for non-unified stats flushing [not found] ` <20230831165611.2610118-3-yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> @ 2023-09-04 14:45 ` Michal Hocko 0 siblings, 0 replies; 29+ messages in thread From: Michal Hocko @ 2023-09-04 14:45 UTC (permalink / raw) To: Yosry Ahmed Cc: Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo, Michal Koutný, Waiman Long, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Thu 31-08-23 16:56:09, Yosry Ahmed wrote: > Some contexts flush memcg stats outside of unified flushing, directly > using cgroup_rstat_flush(). Add a helper for non-unified flushing, a > counterpart for do_unified_stats_flush(), and use it in those contexts, > as well as in do_unified_stats_flush() itself. > > This abstracts the rstat API and makes it easy to introduce > modifications to either unified or non-unified flushing functions > without changing callers. > > No functional change intended. > > Signed-off-by: Yosry Ahmed <yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> Acked-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> > --- > mm/memcontrol.c | 21 +++++++++++++++++---- > 1 file changed, 17 insertions(+), 4 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 2d0ec828a1c4..8c046feeaae7 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -639,6 +639,17 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val) > } > } > > +/* > + * do_stats_flush - do a flush of the memory cgroup statistics > + * @memcg: memory cgroup to flush > + * > + * Only flushes the subtree of @memcg, does not skip under any conditions. > + */ > +static void do_stats_flush(struct mem_cgroup *memcg) > +{ > + cgroup_rstat_flush(memcg->css.cgroup); > +} > + > /* > * do_unified_stats_flush - do a unified flush of memory cgroup statistics > * > @@ -656,7 +667,7 @@ static void do_unified_stats_flush(void) > > WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME); > > - cgroup_rstat_flush(root_mem_cgroup->css.cgroup); > + do_stats_flush(root_mem_cgroup); > > atomic_set(&stats_flush_threshold, 0); > atomic_set(&stats_unified_flush_ongoing, 0); > @@ -7790,7 +7801,7 @@ bool obj_cgroup_may_zswap(struct obj_cgroup *objcg) > break; > } > > - cgroup_rstat_flush(memcg->css.cgroup); > + do_stats_flush(memcg); > pages = memcg_page_state(memcg, MEMCG_ZSWAP_B) / PAGE_SIZE; > if (pages < max) > continue; > @@ -7855,8 +7866,10 @@ void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size) > static u64 zswap_current_read(struct cgroup_subsys_state *css, > struct cftype *cft) > { > - cgroup_rstat_flush(css->cgroup); > - return memcg_page_state(mem_cgroup_from_css(css), MEMCG_ZSWAP_B); > + struct mem_cgroup *memcg = mem_cgroup_from_css(css); > + > + do_stats_flush(memcg); > + return memcg_page_state(memcg, MEMCG_ZSWAP_B); > } > > static int zswap_max_show(struct seq_file *m, void *v) > -- > 2.42.0.rc2.253.gd59a3bf2b4-goog -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v4 4/4] mm: memcg: use non-unified stats flushing for userspace reads [not found] ` <20230831165611.2610118-1-yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> 2023-08-31 16:56 ` [PATCH v4 2/4] mm: memcg: add a helper for non-unified " Yosry Ahmed @ 2023-08-31 16:56 ` Yosry Ahmed [not found] ` <20230831165611.2610118-5-yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> 2023-08-31 17:18 ` [PATCH v4 0/4] memcg: non-unified flushing for userspace stats Waiman Long 2 siblings, 1 reply; 29+ messages in thread From: Yosry Ahmed @ 2023-08-31 16:56 UTC (permalink / raw) To: Andrew Morton Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo, Michal Koutný, Waiman Long, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Yosry Ahmed Unified flushing allows for great concurrency for paths that attempt to flush the stats, at the expense of potential staleness and a single flusher paying the extra cost of flushing the full tree. This tradeoff makes sense for in-kernel flushers that may observe high concurrency (e.g. reclaim, refault). For userspace readers, stale stats may be unexpected and problematic, especially when such stats are used for critical paths such as userspace OOM handling. Additionally, a userspace reader will occasionally pay the cost of flushing the entire hierarchy, which also causes problems in some cases [1]. Opt userspace reads out of unified flushing. This makes the cost of reading the stats more predictable (proportional to the size of the subtree), as well as the freshness of the stats. Userspace readers are not expected to have similar concurrency to in-kernel flushers, serializing them among themselves and among in-kernel flushers should be okay. Nonetheless, for extra safety, introduce a mutex when flushing for userspace readers to make sure only a single userspace reader can compete with in-kernel flushers at a time. This takes away userspace ability to directly influence or hurt in-kernel lock contention. An alternative is to remove flushing from the stats reading path completely, and rely on the periodic flusher. This should be accompanied by making the periodic flushing period tunable, and providing an interface for userspace to force a flush, following a similar model to /proc/vmstat. However, such a change will be hard to reverse if the implementation needs to be changed because: - The cost of reading stats will be very cheap and we won't be able to take that back easily. - There are user-visible interfaces involved. Hence, let's go with the change that's most reversible first and revisit as needed. This was tested on a machine with 256 cpus by running a synthetic test script [2] that creates 50 top-level cgroups, each with 5 children (250 leaf cgroups). Each leaf cgroup has 10 processes running that allocate memory beyond the cgroup limit, invoking reclaim (which is an in-kernel unified flusher). Concurrently, one thread is spawned per-cgroup to read the stats every second (including root, top-level, and leaf cgroups -- so total 251 threads). No significant regressions were observed in the total run time, which means that userspace readers are not significantly affecting in-kernel flushers: Base (mm-unstable): real 0m22.500s user 0m9.399s sys 73m41.381s real 0m22.749s user 0m15.648s sys 73m13.113s real 0m22.466s user 0m10.000s sys 73m11.933s With this patch: real 0m23.092s user 0m10.110s sys 75m42.774s real 0m22.277s user 0m10.443s sys 72m7.182s real 0m24.127s user 0m12.617s sys 78m52.765s [1]https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org/ [2]https://lore.kernel.org/lkml/CAJD7tka13M-zVZTyQJYL1iUAYvuQ1fcHbCjcOBZcz6POYTV-4g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org/ Signed-off-by: Yosry Ahmed <yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> --- mm/memcontrol.c | 24 ++++++++++++++++++++---- 1 file changed, 20 insertions(+), 4 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 94d5a6751a9e..46a7abf71c73 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -588,6 +588,7 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz) static void flush_memcg_stats_dwork(struct work_struct *w); static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork); static DEFINE_PER_CPU(unsigned int, stats_updates); +static DEFINE_MUTEX(stats_user_flush_mutex); static atomic_t stats_unified_flush_ongoing = ATOMIC_INIT(0); static atomic_t stats_flush_threshold = ATOMIC_INIT(0); static u64 flush_next_time; @@ -655,6 +656,21 @@ static void do_stats_flush(struct mem_cgroup *memcg) cgroup_rstat_flush(memcg->css.cgroup); } +/* + * mem_cgroup_user_flush_stats - do a stats flush for a user read + * @memcg: memory cgroup to flush + * + * Flush the subtree of @memcg. A mutex is used for userspace readers to gate + * the global rstat spinlock. This protects in-kernel flushers from userspace + * readers hogging the lock. + */ +static void mem_cgroup_user_flush_stats(struct mem_cgroup *memcg) +{ + mutex_lock(&stats_user_flush_mutex); + do_stats_flush(memcg); + mutex_unlock(&stats_user_flush_mutex); +} + /* * do_unified_stats_flush - do a unified flush of memory cgroup statistics * @@ -1608,7 +1624,7 @@ static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s) * * Current memory state: */ - mem_cgroup_try_flush_stats(); + mem_cgroup_user_flush_stats(memcg); for (i = 0; i < ARRAY_SIZE(memory_stats); i++) { u64 size; @@ -4050,7 +4066,7 @@ static int memcg_numa_stat_show(struct seq_file *m, void *v) int nid; struct mem_cgroup *memcg = mem_cgroup_from_seq(m); - mem_cgroup_try_flush_stats(); + mem_cgroup_user_flush_stats(memcg); for (stat = stats; stat < stats + ARRAY_SIZE(stats); stat++) { seq_printf(m, "%s=%lu", stat->name, @@ -4125,7 +4141,7 @@ static void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s) BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) != ARRAY_SIZE(memcg1_stats)); - mem_cgroup_try_flush_stats(); + mem_cgroup_user_flush_stats(memcg); for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) { unsigned long nr; @@ -6642,7 +6658,7 @@ static int memory_numa_stat_show(struct seq_file *m, void *v) int i; struct mem_cgroup *memcg = mem_cgroup_from_seq(m); - mem_cgroup_try_flush_stats(); + mem_cgroup_user_flush_stats(memcg); for (i = 0; i < ARRAY_SIZE(memory_stats); i++) { int nid; -- 2.42.0.rc2.253.gd59a3bf2b4-goog ^ permalink raw reply related [flat|nested] 29+ messages in thread
[parent not found: <20230831165611.2610118-5-yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH v4 4/4] mm: memcg: use non-unified stats flushing for userspace reads [not found] ` <20230831165611.2610118-5-yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> @ 2023-09-04 15:15 ` Michal Hocko [not found] ` <ZPX0kCKd4TaVLJY7-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 29+ messages in thread From: Michal Hocko @ 2023-09-04 15:15 UTC (permalink / raw) To: Yosry Ahmed Cc: Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo, Michal Koutný, Waiman Long, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Thu 31-08-23 16:56:11, Yosry Ahmed wrote: > Unified flushing allows for great concurrency for paths that attempt to > flush the stats, at the expense of potential staleness and a single > flusher paying the extra cost of flushing the full tree. > > This tradeoff makes sense for in-kernel flushers that may observe high > concurrency (e.g. reclaim, refault). For userspace readers, stale stats > may be unexpected and problematic, especially when such stats are used > for critical paths such as userspace OOM handling. Additionally, a > userspace reader will occasionally pay the cost of flushing the entire > hierarchy, which also causes problems in some cases [1]. > > Opt userspace reads out of unified flushing. This makes the cost of > reading the stats more predictable (proportional to the size of the > subtree), as well as the freshness of the stats. Userspace readers are > not expected to have similar concurrency to in-kernel flushers, > serializing them among themselves and among in-kernel flushers should be > okay. Nonetheless, for extra safety, introduce a mutex when flushing for > userspace readers to make sure only a single userspace reader can compete > with in-kernel flushers at a time. This takes away userspace ability to > directly influence or hurt in-kernel lock contention. I think it would be helpful to note that the primary reason this is a concern is that the spinlock is dropped during flushing under contention. > An alternative is to remove flushing from the stats reading path > completely, and rely on the periodic flusher. This should be accompanied > by making the periodic flushing period tunable, and providing an > interface for userspace to force a flush, following a similar model to > /proc/vmstat. However, such a change will be hard to reverse if the > implementation needs to be changed because: > - The cost of reading stats will be very cheap and we won't be able to > take that back easily. > - There are user-visible interfaces involved. > > Hence, let's go with the change that's most reversible first and revisit > as needed. > > This was tested on a machine with 256 cpus by running a synthetic test > script [2] that creates 50 top-level cgroups, each with 5 children (250 > leaf cgroups). Each leaf cgroup has 10 processes running that allocate > memory beyond the cgroup limit, invoking reclaim (which is an in-kernel > unified flusher). Concurrently, one thread is spawned per-cgroup to read > the stats every second (including root, top-level, and leaf cgroups -- > so total 251 threads). No significant regressions were observed in the > total run time, which means that userspace readers are not significantly > affecting in-kernel flushers: > > Base (mm-unstable): > > real 0m22.500s > user 0m9.399s > sys 73m41.381s > > real 0m22.749s > user 0m15.648s > sys 73m13.113s > > real 0m22.466s > user 0m10.000s > sys 73m11.933s > > With this patch: > > real 0m23.092s > user 0m10.110s > sys 75m42.774s > > real 0m22.277s > user 0m10.443s > sys 72m7.182s > > real 0m24.127s > user 0m12.617s > sys 78m52.765s > > [1]https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org/ > [2]https://lore.kernel.org/lkml/CAJD7tka13M-zVZTyQJYL1iUAYvuQ1fcHbCjcOBZcz6POYTV-4g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org/ > > Signed-off-by: Yosry Ahmed <yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> OK, I can live with that but I still believe that locking involved in the user interface only begs for issues later on as there is no control over that lock contention other than the number of processes involved. As it seems that we cannot make a consensus on this concern now and this should be already helping existing workloads then let's just buy some more time ;) Acked-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> Thanks! > --- > mm/memcontrol.c | 24 ++++++++++++++++++++---- > 1 file changed, 20 insertions(+), 4 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 94d5a6751a9e..46a7abf71c73 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -588,6 +588,7 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz) > static void flush_memcg_stats_dwork(struct work_struct *w); > static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork); > static DEFINE_PER_CPU(unsigned int, stats_updates); > +static DEFINE_MUTEX(stats_user_flush_mutex); > static atomic_t stats_unified_flush_ongoing = ATOMIC_INIT(0); > static atomic_t stats_flush_threshold = ATOMIC_INIT(0); > static u64 flush_next_time; > @@ -655,6 +656,21 @@ static void do_stats_flush(struct mem_cgroup *memcg) > cgroup_rstat_flush(memcg->css.cgroup); > } > > +/* > + * mem_cgroup_user_flush_stats - do a stats flush for a user read > + * @memcg: memory cgroup to flush > + * > + * Flush the subtree of @memcg. A mutex is used for userspace readers to gate > + * the global rstat spinlock. This protects in-kernel flushers from userspace > + * readers hogging the lock. readers hogging the lock as do_stats_flush drops the spinlock under contention. > + */ > +static void mem_cgroup_user_flush_stats(struct mem_cgroup *memcg) > +{ > + mutex_lock(&stats_user_flush_mutex); > + do_stats_flush(memcg); > + mutex_unlock(&stats_user_flush_mutex); > +} > + > /* > * do_unified_stats_flush - do a unified flush of memory cgroup statistics > * -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 29+ messages in thread
[parent not found: <ZPX0kCKd4TaVLJY7-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH v4 4/4] mm: memcg: use non-unified stats flushing for userspace reads [not found] ` <ZPX0kCKd4TaVLJY7-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2023-09-05 15:57 ` Yosry Ahmed 2023-09-08 0:52 ` Wei Xu 1 sibling, 0 replies; 29+ messages in thread From: Yosry Ahmed @ 2023-09-05 15:57 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo, Michal Koutný, Waiman Long, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Mon, Sep 4, 2023 at 8:15 AM Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote: > > On Thu 31-08-23 16:56:11, Yosry Ahmed wrote: > > Unified flushing allows for great concurrency for paths that attempt to > > flush the stats, at the expense of potential staleness and a single > > flusher paying the extra cost of flushing the full tree. > > > > This tradeoff makes sense for in-kernel flushers that may observe high > > concurrency (e.g. reclaim, refault). For userspace readers, stale stats > > may be unexpected and problematic, especially when such stats are used > > for critical paths such as userspace OOM handling. Additionally, a > > userspace reader will occasionally pay the cost of flushing the entire > > hierarchy, which also causes problems in some cases [1]. > > > > Opt userspace reads out of unified flushing. This makes the cost of > > reading the stats more predictable (proportional to the size of the > > subtree), as well as the freshness of the stats. Userspace readers are > > not expected to have similar concurrency to in-kernel flushers, > > serializing them among themselves and among in-kernel flushers should be > > okay. Nonetheless, for extra safety, introduce a mutex when flushing for > > userspace readers to make sure only a single userspace reader can compete > > with in-kernel flushers at a time. This takes away userspace ability to > > directly influence or hurt in-kernel lock contention. > > I think it would be helpful to note that the primary reason this is a > concern is that the spinlock is dropped during flushing under > contention. > > > An alternative is to remove flushing from the stats reading path > > completely, and rely on the periodic flusher. This should be accompanied > > by making the periodic flushing period tunable, and providing an > > interface for userspace to force a flush, following a similar model to > > /proc/vmstat. However, such a change will be hard to reverse if the > > implementation needs to be changed because: > > - The cost of reading stats will be very cheap and we won't be able to > > take that back easily. > > - There are user-visible interfaces involved. > > > > Hence, let's go with the change that's most reversible first and revisit > > as needed. > > > > This was tested on a machine with 256 cpus by running a synthetic test > > script [2] that creates 50 top-level cgroups, each with 5 children (250 > > leaf cgroups). Each leaf cgroup has 10 processes running that allocate > > memory beyond the cgroup limit, invoking reclaim (which is an in-kernel > > unified flusher). Concurrently, one thread is spawned per-cgroup to read > > the stats every second (including root, top-level, and leaf cgroups -- > > so total 251 threads). No significant regressions were observed in the > > total run time, which means that userspace readers are not significantly > > affecting in-kernel flushers: > > > > Base (mm-unstable): > > > > real 0m22.500s > > user 0m9.399s > > sys 73m41.381s > > > > real 0m22.749s > > user 0m15.648s > > sys 73m13.113s > > > > real 0m22.466s > > user 0m10.000s > > sys 73m11.933s > > > > With this patch: > > > > real 0m23.092s > > user 0m10.110s > > sys 75m42.774s > > > > real 0m22.277s > > user 0m10.443s > > sys 72m7.182s > > > > real 0m24.127s > > user 0m12.617s > > sys 78m52.765s > > > > [1]https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org/ > > [2]https://lore.kernel.org/lkml/CAJD7tka13M-zVZTyQJYL1iUAYvuQ1fcHbCjcOBZcz6POYTV-4g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org/ > > > > Signed-off-by: Yosry Ahmed <yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> > > OK, I can live with that but I still believe that locking involved in > the user interface only begs for issues later on as there is no control > over that lock contention other than the number of processes involved. > As it seems that we cannot make a consensus on this concern now and this > should be already helping existing workloads then let's just buy some > more time ;) > > Acked-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> Thanks! I agree, let's fix problems if/when they arise, maybe it will be just fine :) I will send a v5 collecting Ack's and augmenting the changelog and comment below as you suggested (probably after we resolve patch 3). > > Thanks! > > > --- > > mm/memcontrol.c | 24 ++++++++++++++++++++---- > > 1 file changed, 20 insertions(+), 4 deletions(-) > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index 94d5a6751a9e..46a7abf71c73 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -588,6 +588,7 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz) > > static void flush_memcg_stats_dwork(struct work_struct *w); > > static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork); > > static DEFINE_PER_CPU(unsigned int, stats_updates); > > +static DEFINE_MUTEX(stats_user_flush_mutex); > > static atomic_t stats_unified_flush_ongoing = ATOMIC_INIT(0); > > static atomic_t stats_flush_threshold = ATOMIC_INIT(0); > > static u64 flush_next_time; > > @@ -655,6 +656,21 @@ static void do_stats_flush(struct mem_cgroup *memcg) > > cgroup_rstat_flush(memcg->css.cgroup); > > } > > > > +/* > > + * mem_cgroup_user_flush_stats - do a stats flush for a user read > > + * @memcg: memory cgroup to flush > > + * > > + * Flush the subtree of @memcg. A mutex is used for userspace readers to gate > > + * the global rstat spinlock. This protects in-kernel flushers from userspace > > + * readers hogging the lock. > > readers hogging the lock as do_stats_flush drops the spinlock under > contention. > > > + */ > > +static void mem_cgroup_user_flush_stats(struct mem_cgroup *memcg) > > +{ > > + mutex_lock(&stats_user_flush_mutex); > > + do_stats_flush(memcg); > > + mutex_unlock(&stats_user_flush_mutex); > > +} > > + > > /* > > * do_unified_stats_flush - do a unified flush of memory cgroup statistics > > * > -- > Michal Hocko > SUSE Labs ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v4 4/4] mm: memcg: use non-unified stats flushing for userspace reads [not found] ` <ZPX0kCKd4TaVLJY7-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 2023-09-05 15:57 ` Yosry Ahmed @ 2023-09-08 0:52 ` Wei Xu [not found] ` <CAAPL-u9D2b=iF5Lf_cRnKxUfkiEe0AMDTu6yhrUAzX0b6a6rDg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2023-09-11 13:11 ` Michal Hocko 1 sibling, 2 replies; 29+ messages in thread From: Wei Xu @ 2023-09-08 0:52 UTC (permalink / raw) To: Michal Hocko Cc: Yosry Ahmed, Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo, Michal Koutný, Waiman Long, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Greg Thelen On Mon, Sep 4, 2023 at 8:15 AM Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote: > > On Thu 31-08-23 16:56:11, Yosry Ahmed wrote: > > Unified flushing allows for great concurrency for paths that attempt to > > flush the stats, at the expense of potential staleness and a single > > flusher paying the extra cost of flushing the full tree. > > > > This tradeoff makes sense for in-kernel flushers that may observe high > > concurrency (e.g. reclaim, refault). For userspace readers, stale stats > > may be unexpected and problematic, especially when such stats are used > > for critical paths such as userspace OOM handling. Additionally, a > > userspace reader will occasionally pay the cost of flushing the entire > > hierarchy, which also causes problems in some cases [1]. > > > > Opt userspace reads out of unified flushing. This makes the cost of > > reading the stats more predictable (proportional to the size of the > > subtree), as well as the freshness of the stats. Userspace readers are > > not expected to have similar concurrency to in-kernel flushers, > > serializing them among themselves and among in-kernel flushers should be > > okay. Nonetheless, for extra safety, introduce a mutex when flushing for > > userspace readers to make sure only a single userspace reader can compete > > with in-kernel flushers at a time. This takes away userspace ability to > > directly influence or hurt in-kernel lock contention. > > I think it would be helpful to note that the primary reason this is a > concern is that the spinlock is dropped during flushing under > contention. > > > An alternative is to remove flushing from the stats reading path > > completely, and rely on the periodic flusher. This should be accompanied > > by making the periodic flushing period tunable, and providing an > > interface for userspace to force a flush, following a similar model to > > /proc/vmstat. However, such a change will be hard to reverse if the > > implementation needs to be changed because: > > - The cost of reading stats will be very cheap and we won't be able to > > take that back easily. > > - There are user-visible interfaces involved. > > > > Hence, let's go with the change that's most reversible first and revisit > > as needed. > > > > This was tested on a machine with 256 cpus by running a synthetic test > > script [2] that creates 50 top-level cgroups, each with 5 children (250 > > leaf cgroups). Each leaf cgroup has 10 processes running that allocate > > memory beyond the cgroup limit, invoking reclaim (which is an in-kernel > > unified flusher). Concurrently, one thread is spawned per-cgroup to read > > the stats every second (including root, top-level, and leaf cgroups -- > > so total 251 threads). No significant regressions were observed in the > > total run time, which means that userspace readers are not significantly > > affecting in-kernel flushers: > > > > Base (mm-unstable): > > > > real 0m22.500s > > user 0m9.399s > > sys 73m41.381s > > > > real 0m22.749s > > user 0m15.648s > > sys 73m13.113s > > > > real 0m22.466s > > user 0m10.000s > > sys 73m11.933s > > > > With this patch: > > > > real 0m23.092s > > user 0m10.110s > > sys 75m42.774s > > > > real 0m22.277s > > user 0m10.443s > > sys 72m7.182s > > > > real 0m24.127s > > user 0m12.617s > > sys 78m52.765s > > > > [1]https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org/ > > [2]https://lore.kernel.org/lkml/CAJD7tka13M-zVZTyQJYL1iUAYvuQ1fcHbCjcOBZcz6POYTV-4g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org/ > > > > Signed-off-by: Yosry Ahmed <yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> > > OK, I can live with that but I still believe that locking involved in > the user interface only begs for issues later on as there is no control > over that lock contention other than the number of processes involved. > As it seems that we cannot make a consensus on this concern now and this > should be already helping existing workloads then let's just buy some > more time ;) Indeed, even though the new global mutex protects the kernel from the userspace contention on the rstats spinlock, its current implementation doesn't have any protection for the lock contention among the userspace threads and can cause significant delays to memcg stats reads. I tested this patch on a machine with 384 CPUs using a microbenchmark that spawns 10K threads, each reading its memory.stat every 100 milliseconds. Most of memory.stat reads take 5ms-10ms in kernel, with ~5% reads even exceeding 1 second. This is a significant regression. In comparison, without contention, each memory.stat read only takes 20us-50us in the kernel. Almost all of the extra read time is spent on waiting for the new mutex. The time to flush rstats only accounts for 10us-50us (This test creates only 1K memory cgroups and doesn't generate any loads other than these stat reader threads). Here are some ideas to control the lock contention on the mutex and reduce both the median and tail latencies of concurrent memcg stat reads: - Bring back the stats_flush_threshold check in mem_cgroup_try_flush_stats() to mem_cgroup_user_flush_stats(). This check provides a reasonable bound on the stats staleness while being able to filter out a large number of rstats flush requests, which reduces the contention on the mutex. - When contended, upgrade the per-memcg rstats flush in mem_cgroup_user_flush_stats() to a root memcg flush and coalesce these contended flushes together. We can wait for the ongoing flush to complete and eliminate repeated flush requests. - Wait for the mutex and the ongoing flush with a timeout. We should not use busy-wait, though. We can bail out to read the stats without a flush after enough wait. A long-stalled system call is much worse than somewhat stale stats in the corner cases in my opinion. Wei Xu > Acked-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> > > Thanks! > > > --- > > mm/memcontrol.c | 24 ++++++++++++++++++++---- > > 1 file changed, 20 insertions(+), 4 deletions(-) > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index 94d5a6751a9e..46a7abf71c73 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -588,6 +588,7 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz) > > static void flush_memcg_stats_dwork(struct work_struct *w); > > static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork); > > static DEFINE_PER_CPU(unsigned int, stats_updates); > > +static DEFINE_MUTEX(stats_user_flush_mutex); > > static atomic_t stats_unified_flush_ongoing = ATOMIC_INIT(0); > > static atomic_t stats_flush_threshold = ATOMIC_INIT(0); > > static u64 flush_next_time; > > @@ -655,6 +656,21 @@ static void do_stats_flush(struct mem_cgroup *memcg) > > cgroup_rstat_flush(memcg->css.cgroup); > > } > > > > +/* > > + * mem_cgroup_user_flush_stats - do a stats flush for a user read > > + * @memcg: memory cgroup to flush > > + * > > + * Flush the subtree of @memcg. A mutex is used for userspace readers to gate > > + * the global rstat spinlock. This protects in-kernel flushers from userspace > > + * readers hogging the lock. > > readers hogging the lock as do_stats_flush drops the spinlock under > contention. > > > + */ > > +static void mem_cgroup_user_flush_stats(struct mem_cgroup *memcg) > > +{ > > + mutex_lock(&stats_user_flush_mutex); > > + do_stats_flush(memcg); > > + mutex_unlock(&stats_user_flush_mutex); > > +} > > + > > /* > > * do_unified_stats_flush - do a unified flush of memory cgroup statistics > > * > -- > Michal Hocko > SUSE Labs > ^ permalink raw reply [flat|nested] 29+ messages in thread
[parent not found: <CAAPL-u9D2b=iF5Lf_cRnKxUfkiEe0AMDTu6yhrUAzX0b6a6rDg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH v4 4/4] mm: memcg: use non-unified stats flushing for userspace reads [not found] ` <CAAPL-u9D2b=iF5Lf_cRnKxUfkiEe0AMDTu6yhrUAzX0b6a6rDg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2023-09-08 1:02 ` Ivan Babrou [not found] ` <CABWYdi1WNp9f20nRFEExn8QB1MwP7QXwvD6Q8xHHuTO2SUTLkA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 29+ messages in thread From: Ivan Babrou @ 2023-09-08 1:02 UTC (permalink / raw) To: Wei Xu Cc: Michal Hocko, Yosry Ahmed, Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt, Muchun Song, Tejun Heo, Michal Koutný, Waiman Long, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Greg Thelen On Thu, Sep 7, 2023 at 5:52 PM Wei Xu <weixugc-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote: > > On Mon, Sep 4, 2023 at 8:15 AM Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote: > > > > On Thu 31-08-23 16:56:11, Yosry Ahmed wrote: > > > Unified flushing allows for great concurrency for paths that attempt to > > > flush the stats, at the expense of potential staleness and a single > > > flusher paying the extra cost of flushing the full tree. > > > > > > This tradeoff makes sense for in-kernel flushers that may observe high > > > concurrency (e.g. reclaim, refault). For userspace readers, stale stats > > > may be unexpected and problematic, especially when such stats are used > > > for critical paths such as userspace OOM handling. Additionally, a > > > userspace reader will occasionally pay the cost of flushing the entire > > > hierarchy, which also causes problems in some cases [1]. > > > > > > Opt userspace reads out of unified flushing. This makes the cost of > > > reading the stats more predictable (proportional to the size of the > > > subtree), as well as the freshness of the stats. Userspace readers are > > > not expected to have similar concurrency to in-kernel flushers, > > > serializing them among themselves and among in-kernel flushers should be > > > okay. Nonetheless, for extra safety, introduce a mutex when flushing for > > > userspace readers to make sure only a single userspace reader can compete > > > with in-kernel flushers at a time. This takes away userspace ability to > > > directly influence or hurt in-kernel lock contention. > > > > I think it would be helpful to note that the primary reason this is a > > concern is that the spinlock is dropped during flushing under > > contention. > > > > > An alternative is to remove flushing from the stats reading path > > > completely, and rely on the periodic flusher. This should be accompanied > > > by making the periodic flushing period tunable, and providing an > > > interface for userspace to force a flush, following a similar model to > > > /proc/vmstat. However, such a change will be hard to reverse if the > > > implementation needs to be changed because: > > > - The cost of reading stats will be very cheap and we won't be able to > > > take that back easily. > > > - There are user-visible interfaces involved. > > > > > > Hence, let's go with the change that's most reversible first and revisit > > > as needed. > > > > > > This was tested on a machine with 256 cpus by running a synthetic test > > > script [2] that creates 50 top-level cgroups, each with 5 children (250 > > > leaf cgroups). Each leaf cgroup has 10 processes running that allocate > > > memory beyond the cgroup limit, invoking reclaim (which is an in-kernel > > > unified flusher). Concurrently, one thread is spawned per-cgroup to read > > > the stats every second (including root, top-level, and leaf cgroups -- > > > so total 251 threads). No significant regressions were observed in the > > > total run time, which means that userspace readers are not significantly > > > affecting in-kernel flushers: > > > > > > Base (mm-unstable): > > > > > > real 0m22.500s > > > user 0m9.399s > > > sys 73m41.381s > > > > > > real 0m22.749s > > > user 0m15.648s > > > sys 73m13.113s > > > > > > real 0m22.466s > > > user 0m10.000s > > > sys 73m11.933s > > > > > > With this patch: > > > > > > real 0m23.092s > > > user 0m10.110s > > > sys 75m42.774s > > > > > > real 0m22.277s > > > user 0m10.443s > > > sys 72m7.182s > > > > > > real 0m24.127s > > > user 0m12.617s > > > sys 78m52.765s > > > > > > [1]https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org/ > > > [2]https://lore.kernel.org/lkml/CAJD7tka13M-zVZTyQJYL1iUAYvuQ1fcHbCjcOBZcz6POYTV-4g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org/ > > > > > > Signed-off-by: Yosry Ahmed <yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> > > > > OK, I can live with that but I still believe that locking involved in > > the user interface only begs for issues later on as there is no control > > over that lock contention other than the number of processes involved. > > As it seems that we cannot make a consensus on this concern now and this > > should be already helping existing workloads then let's just buy some > > more time ;) > > Indeed, even though the new global mutex protects the kernel from the > userspace contention on the rstats spinlock, its current > implementation doesn't have any protection for the lock contention > among the userspace threads and can cause significant delays to memcg > stats reads. > > I tested this patch on a machine with 384 CPUs using a microbenchmark > that spawns 10K threads, each reading its memory.stat every 100 > milliseconds. Most of memory.stat reads take 5ms-10ms in kernel, with > ~5% reads even exceeding 1 second. This is a significant regression. > In comparison, without contention, each memory.stat read only takes > 20us-50us in the kernel. Almost all of the extra read time is spent > on waiting for the new mutex. The time to flush rstats only accounts > for 10us-50us (This test creates only 1K memory cgroups and doesn't > generate any loads other than these stat reader threads). > > Here are some ideas to control the lock contention on the mutex and > reduce both the median and tail latencies of concurrent memcg stat > reads: > > - Bring back the stats_flush_threshold check in > mem_cgroup_try_flush_stats() to mem_cgroup_user_flush_stats(). This > check provides a reasonable bound on the stats staleness while being > able to filter out a large number of rstats flush requests, which > reduces the contention on the mutex. > > - When contended, upgrade the per-memcg rstats flush in > mem_cgroup_user_flush_stats() to a root memcg flush and coalesce these > contended flushes together. We can wait for the ongoing flush to > complete and eliminate repeated flush requests. Full root memcg flush being slow is one of the issues that prompted this patch: * https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org/ I don't want us to regress in this regard. > - Wait for the mutex and the ongoing flush with a timeout. We should > not use busy-wait, though. We can bail out to read the stats without > a flush after enough wait. A long-stalled system call is much worse > than somewhat stale stats in the corner cases in my opinion. > > Wei Xu > > > Acked-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> > > > > Thanks! > > > > > --- > > > mm/memcontrol.c | 24 ++++++++++++++++++++---- > > > 1 file changed, 20 insertions(+), 4 deletions(-) > > > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > index 94d5a6751a9e..46a7abf71c73 100644 > > > --- a/mm/memcontrol.c > > > +++ b/mm/memcontrol.c > > > @@ -588,6 +588,7 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz) > > > static void flush_memcg_stats_dwork(struct work_struct *w); > > > static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork); > > > static DEFINE_PER_CPU(unsigned int, stats_updates); > > > +static DEFINE_MUTEX(stats_user_flush_mutex); > > > static atomic_t stats_unified_flush_ongoing = ATOMIC_INIT(0); > > > static atomic_t stats_flush_threshold = ATOMIC_INIT(0); > > > static u64 flush_next_time; > > > @@ -655,6 +656,21 @@ static void do_stats_flush(struct mem_cgroup *memcg) > > > cgroup_rstat_flush(memcg->css.cgroup); > > > } > > > > > > +/* > > > + * mem_cgroup_user_flush_stats - do a stats flush for a user read > > > + * @memcg: memory cgroup to flush > > > + * > > > + * Flush the subtree of @memcg. A mutex is used for userspace readers to gate > > > + * the global rstat spinlock. This protects in-kernel flushers from userspace > > > + * readers hogging the lock. > > > > readers hogging the lock as do_stats_flush drops the spinlock under > > contention. > > > > > + */ > > > +static void mem_cgroup_user_flush_stats(struct mem_cgroup *memcg) > > > +{ > > > + mutex_lock(&stats_user_flush_mutex); > > > + do_stats_flush(memcg); > > > + mutex_unlock(&stats_user_flush_mutex); > > > +} > > > + > > > /* > > > * do_unified_stats_flush - do a unified flush of memory cgroup statistics > > > * > > -- > > Michal Hocko > > SUSE Labs > > ^ permalink raw reply [flat|nested] 29+ messages in thread
[parent not found: <CABWYdi1WNp9f20nRFEExn8QB1MwP7QXwvD6Q8xHHuTO2SUTLkA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH v4 4/4] mm: memcg: use non-unified stats flushing for userspace reads [not found] ` <CABWYdi1WNp9f20nRFEExn8QB1MwP7QXwvD6Q8xHHuTO2SUTLkA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2023-09-08 1:11 ` Yosry Ahmed 0 siblings, 0 replies; 29+ messages in thread From: Yosry Ahmed @ 2023-09-08 1:11 UTC (permalink / raw) To: Ivan Babrou Cc: Wei Xu, Michal Hocko, Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt, Muchun Song, Tejun Heo, Michal Koutný, Waiman Long, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Greg Thelen On Thu, Sep 7, 2023 at 6:03 PM Ivan Babrou <ivan-lDpJ742SOEtZroRs9YW3xA@public.gmane.org> wrote: > > On Thu, Sep 7, 2023 at 5:52 PM Wei Xu <weixugc-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote: > > > > On Mon, Sep 4, 2023 at 8:15 AM Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote: > > > > > > On Thu 31-08-23 16:56:11, Yosry Ahmed wrote: > > > > Unified flushing allows for great concurrency for paths that attempt to > > > > flush the stats, at the expense of potential staleness and a single > > > > flusher paying the extra cost of flushing the full tree. > > > > > > > > This tradeoff makes sense for in-kernel flushers that may observe high > > > > concurrency (e.g. reclaim, refault). For userspace readers, stale stats > > > > may be unexpected and problematic, especially when such stats are used > > > > for critical paths such as userspace OOM handling. Additionally, a > > > > userspace reader will occasionally pay the cost of flushing the entire > > > > hierarchy, which also causes problems in some cases [1]. > > > > > > > > Opt userspace reads out of unified flushing. This makes the cost of > > > > reading the stats more predictable (proportional to the size of the > > > > subtree), as well as the freshness of the stats. Userspace readers are > > > > not expected to have similar concurrency to in-kernel flushers, > > > > serializing them among themselves and among in-kernel flushers should be > > > > okay. Nonetheless, for extra safety, introduce a mutex when flushing for > > > > userspace readers to make sure only a single userspace reader can compete > > > > with in-kernel flushers at a time. This takes away userspace ability to > > > > directly influence or hurt in-kernel lock contention. > > > > > > I think it would be helpful to note that the primary reason this is a > > > concern is that the spinlock is dropped during flushing under > > > contention. > > > > > > > An alternative is to remove flushing from the stats reading path > > > > completely, and rely on the periodic flusher. This should be accompanied > > > > by making the periodic flushing period tunable, and providing an > > > > interface for userspace to force a flush, following a similar model to > > > > /proc/vmstat. However, such a change will be hard to reverse if the > > > > implementation needs to be changed because: > > > > - The cost of reading stats will be very cheap and we won't be able to > > > > take that back easily. > > > > - There are user-visible interfaces involved. > > > > > > > > Hence, let's go with the change that's most reversible first and revisit > > > > as needed. > > > > > > > > This was tested on a machine with 256 cpus by running a synthetic test > > > > script [2] that creates 50 top-level cgroups, each with 5 children (250 > > > > leaf cgroups). Each leaf cgroup has 10 processes running that allocate > > > > memory beyond the cgroup limit, invoking reclaim (which is an in-kernel > > > > unified flusher). Concurrently, one thread is spawned per-cgroup to read > > > > the stats every second (including root, top-level, and leaf cgroups -- > > > > so total 251 threads). No significant regressions were observed in the > > > > total run time, which means that userspace readers are not significantly > > > > affecting in-kernel flushers: > > > > > > > > Base (mm-unstable): > > > > > > > > real 0m22.500s > > > > user 0m9.399s > > > > sys 73m41.381s > > > > > > > > real 0m22.749s > > > > user 0m15.648s > > > > sys 73m13.113s > > > > > > > > real 0m22.466s > > > > user 0m10.000s > > > > sys 73m11.933s > > > > > > > > With this patch: > > > > > > > > real 0m23.092s > > > > user 0m10.110s > > > > sys 75m42.774s > > > > > > > > real 0m22.277s > > > > user 0m10.443s > > > > sys 72m7.182s > > > > > > > > real 0m24.127s > > > > user 0m12.617s > > > > sys 78m52.765s > > > > > > > > [1]https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org/ > > > > [2]https://lore.kernel.org/lkml/CAJD7tka13M-zVZTyQJYL1iUAYvuQ1fcHbCjcOBZcz6POYTV-4g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org/ > > > > > > > > Signed-off-by: Yosry Ahmed <yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> > > > > > > OK, I can live with that but I still believe that locking involved in > > > the user interface only begs for issues later on as there is no control > > > over that lock contention other than the number of processes involved. > > > As it seems that we cannot make a consensus on this concern now and this > > > should be already helping existing workloads then let's just buy some > > > more time ;) > > > > Indeed, even though the new global mutex protects the kernel from the > > userspace contention on the rstats spinlock, its current > > implementation doesn't have any protection for the lock contention > > among the userspace threads and can cause significant delays to memcg > > stats reads. > > > > I tested this patch on a machine with 384 CPUs using a microbenchmark > > that spawns 10K threads, each reading its memory.stat every 100 > > milliseconds. Most of memory.stat reads take 5ms-10ms in kernel, with > > ~5% reads even exceeding 1 second. This is a significant regression. > > In comparison, without contention, each memory.stat read only takes > > 20us-50us in the kernel. Almost all of the extra read time is spent > > on waiting for the new mutex. The time to flush rstats only accounts > > for 10us-50us (This test creates only 1K memory cgroups and doesn't > > generate any loads other than these stat reader threads). > > > > Here are some ideas to control the lock contention on the mutex and > > reduce both the median and tail latencies of concurrent memcg stat > > reads: Thanks for the analysis, Wei! I will update the patch series based on your ideas to limit the contention on the userspace read mutex. > > > > > - Bring back the stats_flush_threshold check in > > mem_cgroup_try_flush_stats() to mem_cgroup_user_flush_stats(). This > > check provides a reasonable bound on the stats staleness while being > > able to filter out a large number of rstats flush requests, which > > reduces the contention on the mutex. > > > > - When contended, upgrade the per-memcg rstats flush in > > mem_cgroup_user_flush_stats() to a root memcg flush and coalesce these > > contended flushes together. We can wait for the ongoing flush to > > complete and eliminate repeated flush requests. > > Full root memcg flush being slow is one of the issues that prompted this patch: > > * https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org/ > > I don't want us to regress in this regard. It will only be a fallback if there is high concurrency among userspace reads, which will cause high contention on the mutex. In that case, the userspace reads will be slowed down by contention, which can be even worse than flush slowness as it is theoretically unbounded. I am working on a v5 now to incorporate Wei's suggestions. Would you be able to test that and verify that there are no regressions? > > > > - Wait for the mutex and the ongoing flush with a timeout. We should > > not use busy-wait, though. We can bail out to read the stats without > > a flush after enough wait. A long-stalled system call is much worse > > than somewhat stale stats in the corner cases in my opinion. > > > > Wei Xu > > ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v4 4/4] mm: memcg: use non-unified stats flushing for userspace reads 2023-09-08 0:52 ` Wei Xu [not found] ` <CAAPL-u9D2b=iF5Lf_cRnKxUfkiEe0AMDTu6yhrUAzX0b6a6rDg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2023-09-11 13:11 ` Michal Hocko 2023-09-11 19:15 ` Wei Xu 1 sibling, 1 reply; 29+ messages in thread From: Michal Hocko @ 2023-09-11 13:11 UTC (permalink / raw) To: Wei Xu Cc: Yosry Ahmed, Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo, Michal Koutný, Waiman Long, linux-mm, cgroups, linux-kernel, Greg Thelen On Thu 07-09-23 17:52:12, Wei Xu wrote: [...] > I tested this patch on a machine with 384 CPUs using a microbenchmark > that spawns 10K threads, each reading its memory.stat every 100 > milliseconds. This is rather extreme case but I wouldn't call it utterly insane though. > Most of memory.stat reads take 5ms-10ms in kernel, with > ~5% reads even exceeding 1 second. Just curious, what would numbers look like if the mutex is removed and those threads would be condending on the existing spinlock with lock dropping in place and removed. Would you be willing to give it a shot? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v4 4/4] mm: memcg: use non-unified stats flushing for userspace reads 2023-09-11 13:11 ` Michal Hocko @ 2023-09-11 19:15 ` Wei Xu 2023-09-11 19:34 ` Michal Hocko 0 siblings, 1 reply; 29+ messages in thread From: Wei Xu @ 2023-09-11 19:15 UTC (permalink / raw) To: Michal Hocko Cc: Yosry Ahmed, Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo, Michal Koutný, Waiman Long, linux-mm, cgroups, linux-kernel, Greg Thelen On Mon, Sep 11, 2023 at 6:11 AM Michal Hocko <mhocko@suse.com> wrote: > > On Thu 07-09-23 17:52:12, Wei Xu wrote: > [...] > > I tested this patch on a machine with 384 CPUs using a microbenchmark > > that spawns 10K threads, each reading its memory.stat every 100 > > milliseconds. > > This is rather extreme case but I wouldn't call it utterly insane > though. > > > Most of memory.stat reads take 5ms-10ms in kernel, with > > ~5% reads even exceeding 1 second. > > Just curious, what would numbers look like if the mutex is removed and > those threads would be condending on the existing spinlock with lock > dropping in place and removed. Would you be willing to give it a shot? Without the mutex and with the spinlock only, the common read latency of memory.stat is still 5ms-10ms in kernel. There are very few reads (<0.003%) going above 10ms and none more than 1 second. > -- > Michal Hocko > SUSE Labs ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v4 4/4] mm: memcg: use non-unified stats flushing for userspace reads 2023-09-11 19:15 ` Wei Xu @ 2023-09-11 19:34 ` Michal Hocko 2023-09-11 20:01 ` Wei Xu 0 siblings, 1 reply; 29+ messages in thread From: Michal Hocko @ 2023-09-11 19:34 UTC (permalink / raw) To: Wei Xu Cc: Yosry Ahmed, Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo, Michal Koutný, Waiman Long, linux-mm, cgroups, linux-kernel, Greg Thelen On Mon 11-09-23 12:15:24, Wei Xu wrote: > On Mon, Sep 11, 2023 at 6:11 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Thu 07-09-23 17:52:12, Wei Xu wrote: > > [...] > > > I tested this patch on a machine with 384 CPUs using a microbenchmark > > > that spawns 10K threads, each reading its memory.stat every 100 > > > milliseconds. > > > > This is rather extreme case but I wouldn't call it utterly insane > > though. > > > > > Most of memory.stat reads take 5ms-10ms in kernel, with > > > ~5% reads even exceeding 1 second. > > > > Just curious, what would numbers look like if the mutex is removed and > > those threads would be condending on the existing spinlock with lock > > dropping in place and removed. Would you be willing to give it a shot? > > Without the mutex and with the spinlock only, the common read latency > of memory.stat is still 5ms-10ms in kernel. There are very few reads > (<0.003%) going above 10ms and none more than 1 second. Is this with the existing spinlock dropping and same 10k potentially contending readers? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v4 4/4] mm: memcg: use non-unified stats flushing for userspace reads 2023-09-11 19:34 ` Michal Hocko @ 2023-09-11 20:01 ` Wei Xu 2023-09-11 20:21 ` Tejun Heo 0 siblings, 1 reply; 29+ messages in thread From: Wei Xu @ 2023-09-11 20:01 UTC (permalink / raw) To: Michal Hocko Cc: Yosry Ahmed, Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo, Michal Koutný, Waiman Long, linux-mm, cgroups, linux-kernel, Greg Thelen On Mon, Sep 11, 2023 at 12:34 PM Michal Hocko <mhocko@suse.com> wrote: > > On Mon 11-09-23 12:15:24, Wei Xu wrote: > > On Mon, Sep 11, 2023 at 6:11 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Thu 07-09-23 17:52:12, Wei Xu wrote: > > > [...] > > > > I tested this patch on a machine with 384 CPUs using a microbenchmark > > > > that spawns 10K threads, each reading its memory.stat every 100 > > > > milliseconds. > > > > > > This is rather extreme case but I wouldn't call it utterly insane > > > though. > > > > > > > Most of memory.stat reads take 5ms-10ms in kernel, with > > > > ~5% reads even exceeding 1 second. > > > > > > Just curious, what would numbers look like if the mutex is removed and > > > those threads would be condending on the existing spinlock with lock > > > dropping in place and removed. Would you be willing to give it a shot? > > > > Without the mutex and with the spinlock only, the common read latency > > of memory.stat is still 5ms-10ms in kernel. There are very few reads > > (<0.003%) going above 10ms and none more than 1 second. > > Is this with the existing spinlock dropping and same 10k potentially > contending readers? Yes, it is the same test (10K contending readers). The kernel change is to remove stats_user_flush_mutex from mem_cgroup_user_flush_stats() so that the concurrent mem_cgroup_user_flush_stats() requests directly contend on cgroup_rstat_lock in cgroup_rstat_flush(). > -- > Michal Hocko > SUSE Labs ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v4 4/4] mm: memcg: use non-unified stats flushing for userspace reads 2023-09-11 20:01 ` Wei Xu @ 2023-09-11 20:21 ` Tejun Heo 2023-09-11 20:28 ` Yosry Ahmed [not found] ` <ZP92xP5rdKdeps7Z-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> 0 siblings, 2 replies; 29+ messages in thread From: Tejun Heo @ 2023-09-11 20:21 UTC (permalink / raw) To: Wei Xu Cc: Michal Hocko, Yosry Ahmed, Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou, Michal Koutný, Waiman Long, linux-mm, cgroups, linux-kernel, Greg Thelen Hello, On Mon, Sep 11, 2023 at 01:01:25PM -0700, Wei Xu wrote: > Yes, it is the same test (10K contending readers). The kernel change > is to remove stats_user_flush_mutex from mem_cgroup_user_flush_stats() > so that the concurrent mem_cgroup_user_flush_stats() requests directly > contend on cgroup_rstat_lock in cgroup_rstat_flush(). I don't think it'd be a good idea to twist rstat and other kernel internal code to accommodate 10k parallel readers. If we want to support that, let's explicitly support that by implementing better batching in the read path. The only guarantee you need is that there has been at least one flush since the read attempt started, so we can do sth like the following in the read path: 1. Grab a waiter lock. Remember the current timestamp. 2. Try lock flush mutex. If obtained, drop the waiter lock, flush. Regrab the waiter lock, update the latest flush time to my start time, wake up waiters on the waitqueue (maybe do custom wakeups based on start time?). 3. Release the waiter lock and sleep on the waitqueue. 4. When woken up, regarb the waiter lock, compare whether the latest flush timestamp is later than my start time, if so, return the latest result. If not go back to #2. Maybe the above isn't the best way to do it but you get the general idea. When you have that many concurrent readers, most of them won't need to actually flush. Thanks. -- tejun ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v4 4/4] mm: memcg: use non-unified stats flushing for userspace reads 2023-09-11 20:21 ` Tejun Heo @ 2023-09-11 20:28 ` Yosry Ahmed [not found] ` <ZP92xP5rdKdeps7Z-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> 1 sibling, 0 replies; 29+ messages in thread From: Yosry Ahmed @ 2023-09-11 20:28 UTC (permalink / raw) To: Tejun Heo Cc: Wei Xu, Michal Hocko, Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou, Michal Koutný, Waiman Long, linux-mm, cgroups, linux-kernel, Greg Thelen On Mon, Sep 11, 2023 at 1:21 PM Tejun Heo <tj@kernel.org> wrote: > > Hello, > > On Mon, Sep 11, 2023 at 01:01:25PM -0700, Wei Xu wrote: > > Yes, it is the same test (10K contending readers). The kernel change > > is to remove stats_user_flush_mutex from mem_cgroup_user_flush_stats() > > so that the concurrent mem_cgroup_user_flush_stats() requests directly > > contend on cgroup_rstat_lock in cgroup_rstat_flush(). > > I don't think it'd be a good idea to twist rstat and other kernel internal > code to accommodate 10k parallel readers. If we want to support that, let's > explicitly support that by implementing better batching in the read path. > The only guarantee you need is that there has been at least one flush since > the read attempt started, so we can do sth like the following in the read > path: > > 1. Grab a waiter lock. Remember the current timestamp. > > 2. Try lock flush mutex. If obtained, drop the waiter lock, flush. Regrab > the waiter lock, update the latest flush time to my start time, wake up > waiters on the waitqueue (maybe do custom wakeups based on start time?). > > 3. Release the waiter lock and sleep on the waitqueue. > > 4. When woken up, regarb the waiter lock, compare whether the latest flush > timestamp is later than my start time, if so, return the latest result. > If not go back to #2. > > Maybe the above isn't the best way to do it but you get the general idea. > When you have that many concurrent readers, most of them won't need to > actually flush. I am testing something vaguely similar to this conceptually, but doesn't depend on timestamps. I replaced the mutex with a semaphore, and I added a fallback logic to unified flushing with a timeout: static void mem_cgroup_user_flush_stats(struct mem_cgroup *memcg) { static DEFINE_SEMAPHORE(user_flush_sem, 1); if (atomic_read(&stats_flush_order) <= STATS_FLUSH_THRESHOLD) return; if (!down_timeout(&user_flush_sem, msecs_to_jiffies(1))) { do_stats_flush(memcg); up(&user_flush_sem); } else { do_unified_stats_flush(true); } } In do_unified_stats_flush(), I added a wait argument. If set, the caller will wait for any ongoing flushers before returning (but it never attempts to flush, so no contention on the underlying rstat lock). I implemented this using completions. I am running some tests now, but this should make sure userspace read latency is bounded by 1ms + unified flush time. We basically attempt to flush our subtree only, if we can't after 1ms, we fallback to unified flushing. Another benefit I am seeing here is that I tried switching in-kernel flushers to also use the completion in do_unified_stats_flush(). Basically instead of skipping entirely when someone else is flushing, they just wait for them to finish (without being serialized or contending the lock). I see no regressions in my parallel reclaim benchmark. This should make sure no one ever skips a flush, while still avoiding too much serialization/contention. I suspect this should make reclaim heuristics (and other in-kernel flushers) slightly better. I will run Wei's benchmark next to see how userspace read latency is affected. > > Thanks. > > -- > tejun ^ permalink raw reply [flat|nested] 29+ messages in thread
[parent not found: <ZP92xP5rdKdeps7Z-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>]
* Re: [PATCH v4 4/4] mm: memcg: use non-unified stats flushing for userspace reads [not found] ` <ZP92xP5rdKdeps7Z-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> @ 2023-09-12 11:03 ` Michal Hocko [not found] ` <ZQBFZMRL8WmqRgrM-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 29+ messages in thread From: Michal Hocko @ 2023-09-12 11:03 UTC (permalink / raw) To: Tejun Heo Cc: Wei Xu, Yosry Ahmed, Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou, Michal Koutný, Waiman Long, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Greg Thelen On Mon 11-09-23 10:21:24, Tejun Heo wrote: > Hello, > > On Mon, Sep 11, 2023 at 01:01:25PM -0700, Wei Xu wrote: > > Yes, it is the same test (10K contending readers). The kernel change > > is to remove stats_user_flush_mutex from mem_cgroup_user_flush_stats() > > so that the concurrent mem_cgroup_user_flush_stats() requests directly > > contend on cgroup_rstat_lock in cgroup_rstat_flush(). > > I don't think it'd be a good idea to twist rstat and other kernel internal > code to accommodate 10k parallel readers. I didn't mean to suggest optimizing for this specific scenario. I was mostly curious whether the pathological case of unbound high latency due to lock dropping is easy to trigger by huge number of readers. It seems it is not and the mutex might not be really needed as a prevention. > If we want to support that, let's > explicitly support that by implementing better batching in the read path. Well, we need to be able to handle those situations because stat files are generally readable and we do not want unrelated workloads to influence each other heavily through this path. [...] > When you have that many concurrent readers, most of them won't need to > actually flush. Agreed! -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 29+ messages in thread
[parent not found: <ZQBFZMRL8WmqRgrM-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH v4 4/4] mm: memcg: use non-unified stats flushing for userspace reads [not found] ` <ZQBFZMRL8WmqRgrM-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2023-09-12 11:09 ` Yosry Ahmed 0 siblings, 0 replies; 29+ messages in thread From: Yosry Ahmed @ 2023-09-12 11:09 UTC (permalink / raw) To: Michal Hocko Cc: Tejun Heo, Wei Xu, Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou, Michal Koutný, Waiman Long, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Greg Thelen On Tue, Sep 12, 2023 at 4:03 AM Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote: > > On Mon 11-09-23 10:21:24, Tejun Heo wrote: > > Hello, > > > > On Mon, Sep 11, 2023 at 01:01:25PM -0700, Wei Xu wrote: > > > Yes, it is the same test (10K contending readers). The kernel change > > > is to remove stats_user_flush_mutex from mem_cgroup_user_flush_stats() > > > so that the concurrent mem_cgroup_user_flush_stats() requests directly > > > contend on cgroup_rstat_lock in cgroup_rstat_flush(). > > > > I don't think it'd be a good idea to twist rstat and other kernel internal > > code to accommodate 10k parallel readers. > > I didn't mean to suggest optimizing for this specific scenario. I was > mostly curious whether the pathological case of unbound high latency due > to lock dropping is easy to trigger by huge number of readers. It seems > it is not and the mutex might not be really needed as a prevention. > > > If we want to support that, let's > > explicitly support that by implementing better batching in the read path. > > Well, we need to be able to handle those situations because stat files > are generally readable and we do not want unrelated workloads to > influence each other heavily through this path. I am working on a complete rework of this series based on the feedback I got from Wei and the discussions here. I think I have something simpler and more generic, and doesn't proliferate the number of flushing variants we have. I am running some tests right now and will share it as soon as I can. It should address the high concurrency use case without adding a lot of complexity. It basically involves a fast path where we only flush the needed subtree if there's no contention, and a slow path where we coalesce all flushing requests, and everyone just waits for a single flush to complete (without spinning or contending on any locks). I am trying to use this generic mechanism for both userspace reads and in-kernel flushers. I am making sure in-kernel flushers do not regress. > > [...] > > > When you have that many concurrent readers, most of them won't need to > > actually flush. > > Agreed! > -- > Michal Hocko > SUSE Labs ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v4 0/4] memcg: non-unified flushing for userspace stats [not found] ` <20230831165611.2610118-1-yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> 2023-08-31 16:56 ` [PATCH v4 2/4] mm: memcg: add a helper for non-unified " Yosry Ahmed 2023-08-31 16:56 ` [PATCH v4 4/4] mm: memcg: use non-unified stats flushing for userspace reads Yosry Ahmed @ 2023-08-31 17:18 ` Waiman Long 2 siblings, 0 replies; 29+ messages in thread From: Waiman Long @ 2023-08-31 17:18 UTC (permalink / raw) To: Yosry Ahmed, Andrew Morton Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo, Michal Koutný, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On 8/31/23 12:56, Yosry Ahmed wrote: > Most memcg flushing contexts using "unified" flushing, where only one > flusher is allowed at a time (others skip), and all flushers need to > flush the entire tree. This works well with high concurrency, which > mostly comes from in-kernel flushers (e.g. reclaim, refault, ..). > > For userspace reads, unified flushing leads to non-deterministic stats > staleness and reading cost. This series clarifies and documents the > differences between unified and non-unified flushing (patches 1 & 2), > then opts userspace reads out of unified flushing (patch 3). > > This patch series is a follow up on the discussion in [1]. That was a > patch that proposed that userspace reads wait for ongoing unified > flushers to complete before returning. There were concerns about the > latency that this introduces to userspace reads, especially with ongoing > reports of expensive stat reads even with unified flushing. Hence, this > series follows a different approach, by opting userspace reads out of > unified flushing completely. The cost of userspace reads are now > determinstic, and depend on the size of the subtree being read. This > should fix both the *sometimes* expensive reads (due to flushing the > entire tree) and occasional staless (due to skipping flushing). > > I attempted to remove unified flushing completely, but noticed that > in-kernel flushers with high concurrency (e.g. hundreds of concurrent > reclaimers). This sort of concurrency is not expected from userspace > reads. More details about testing and some numbers in the last patch's > changelog. > > v4 -> v5: > - Fixed build error in the last patch with W=1 because of a missed > 'static'. > > v4: https://lore.kernel.org/lkml/20230830175335.1536008-1-yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org/ > > Yosry Ahmed (4): > mm: memcg: properly name and document unified stats flushing > mm: memcg: add a helper for non-unified stats flushing > mm: memcg: let non-unified root stats flushes help unified flushes > mm: memcg: use non-unified stats flushing for userspace reads > > include/linux/memcontrol.h | 8 +-- > mm/memcontrol.c | 106 +++++++++++++++++++++++++++---------- > mm/vmscan.c | 2 +- > mm/workingset.c | 4 +- > 4 files changed, 85 insertions(+), 35 deletions(-) > LGTM Acked-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> ^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v4 3/4] mm: memcg: let non-unified root stats flushes help unified flushes 2023-08-31 16:56 [PATCH v4 0/4] memcg: non-unified flushing for userspace stats Yosry Ahmed 2023-08-31 16:56 ` [PATCH v4 1/4] mm: memcg: properly name and document unified stats flushing Yosry Ahmed [not found] ` <20230831165611.2610118-1-yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> @ 2023-08-31 16:56 ` Yosry Ahmed 2023-09-04 14:50 ` Michal Hocko 2 siblings, 1 reply; 29+ messages in thread From: Yosry Ahmed @ 2023-08-31 16:56 UTC (permalink / raw) To: Andrew Morton Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo, Michal Koutný, Waiman Long, linux-mm, cgroups, linux-kernel, Yosry Ahmed Unified flushing of memcg stats keeps track of the magnitude of pending updates, and only allows a flush if that magnitude exceeds a threshold. It also keeps track of the time at which ratelimited flushing should be allowed as flush_next_time. A non-unified flush on the root memcg has the same effect as a unified flush, so let it help unified flushing by resetting pending updates and kicking flush_next_time forward. Move the logic into the common do_stats_flush() helper, and do it for all root flushes, unified or not. There is a subtle change here, we reset stats_flush_threshold before a flush rather than after a flush. This probably okay because: (a) For flushers: only unified flushers check stats_flush_threshold, and those flushers skip anyway if there is another unified flush ongoing. Having them also skip if there is an ongoing non-unified root flush is actually more consistent. (b) For updaters: Resetting stats_flush_threshold early may lead to more atomic updates of stats_flush_threshold, as we start updating it earlier. This should not be significant in practice because we stop updating stats_flush_threshold when it reaches the threshold anyway. If we start early and stop early, the number of atomic updates remain the same. The only difference is the scenario where we reset stats_flush_threshold early, start doing atomic updates early, and then the periodic flusher kicks in before we reach the threshold. In this case, we will have done more atomic updates. However, since the threshold wasn't reached, then we did not do a lot of updates anyway. Suggested-by: Michal Koutn√Ω <mkoutny@suse.com> Signed-off-by: Yosry Ahmed <yosryahmed@google.com> --- mm/memcontrol.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 8c046feeaae7..94d5a6751a9e 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -647,6 +647,11 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val) */ static void do_stats_flush(struct mem_cgroup *memcg) { + /* for unified flushing, root non-unified flushing can help as well */ + if (mem_cgroup_is_root(memcg)) { + WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME); + atomic_set(&stats_flush_threshold, 0); + } cgroup_rstat_flush(memcg->css.cgroup); } @@ -665,11 +670,8 @@ static void do_unified_stats_flush(void) atomic_xchg(&stats_unified_flush_ongoing, 1)) return; - WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME); - do_stats_flush(root_mem_cgroup); - atomic_set(&stats_flush_threshold, 0); atomic_set(&stats_unified_flush_ongoing, 0); } -- 2.42.0.rc2.253.gd59a3bf2b4-goog ^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [PATCH v4 3/4] mm: memcg: let non-unified root stats flushes help unified flushes 2023-08-31 16:56 ` [PATCH v4 3/4] mm: memcg: let non-unified root stats flushes help unified flushes Yosry Ahmed @ 2023-09-04 14:50 ` Michal Hocko [not found] ` <ZPXupwjewuLgksAI-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 29+ messages in thread From: Michal Hocko @ 2023-09-04 14:50 UTC (permalink / raw) To: Yosry Ahmed Cc: Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo, Michal Koutný, Waiman Long, linux-mm, cgroups, linux-kernel On Thu 31-08-23 16:56:10, Yosry Ahmed wrote: > Unified flushing of memcg stats keeps track of the magnitude of pending > updates, and only allows a flush if that magnitude exceeds a threshold. > It also keeps track of the time at which ratelimited flushing should be > allowed as flush_next_time. > > A non-unified flush on the root memcg has the same effect as a unified > flush, so let it help unified flushing by resetting pending updates and > kicking flush_next_time forward. Move the logic into the common > do_stats_flush() helper, and do it for all root flushes, unified or > not. I have hard time to follow why we really want/need this. Does this cause any observable changes to the behavior? > > There is a subtle change here, we reset stats_flush_threshold before a > flush rather than after a flush. This probably okay because: > > (a) For flushers: only unified flushers check stats_flush_threshold, and > those flushers skip anyway if there is another unified flush ongoing. > Having them also skip if there is an ongoing non-unified root flush is > actually more consistent. > > (b) For updaters: Resetting stats_flush_threshold early may lead to more > atomic updates of stats_flush_threshold, as we start updating it > earlier. This should not be significant in practice because we stop > updating stats_flush_threshold when it reaches the threshold anyway. If > we start early and stop early, the number of atomic updates remain the > same. The only difference is the scenario where we reset > stats_flush_threshold early, start doing atomic updates early, and then > the periodic flusher kicks in before we reach the threshold. In this > case, we will have done more atomic updates. However, since the > threshold wasn't reached, then we did not do a lot of updates anyway. > > Suggested-by: Michal Koutný <mkoutny@suse.com> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > --- > mm/memcontrol.c | 8 +++++--- > 1 file changed, 5 insertions(+), 3 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 8c046feeaae7..94d5a6751a9e 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -647,6 +647,11 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val) > */ > static void do_stats_flush(struct mem_cgroup *memcg) > { > + /* for unified flushing, root non-unified flushing can help as well */ > + if (mem_cgroup_is_root(memcg)) { > + WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME); > + atomic_set(&stats_flush_threshold, 0); > + } > cgroup_rstat_flush(memcg->css.cgroup); > } > > @@ -665,11 +670,8 @@ static void do_unified_stats_flush(void) > atomic_xchg(&stats_unified_flush_ongoing, 1)) > return; > > - WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME); > - > do_stats_flush(root_mem_cgroup); > > - atomic_set(&stats_flush_threshold, 0); > atomic_set(&stats_unified_flush_ongoing, 0); > } > > -- > 2.42.0.rc2.253.gd59a3bf2b4-goog -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 29+ messages in thread
[parent not found: <ZPXupwjewuLgksAI-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH v4 3/4] mm: memcg: let non-unified root stats flushes help unified flushes [not found] ` <ZPXupwjewuLgksAI-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2023-09-04 15:29 ` Michal Koutný 2023-09-04 15:41 ` Michal Hocko 0 siblings, 1 reply; 29+ messages in thread From: Michal Koutný @ 2023-09-04 15:29 UTC (permalink / raw) To: Michal Hocko Cc: Yosry Ahmed, Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo, Waiman Long, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA [-- Attachment #1: Type: text/plain, Size: 594 bytes --] Hello. On Mon, Sep 04, 2023 at 04:50:15PM +0200, Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote: > I have hard time to follow why we really want/need this. Does this cause > any observable changes to the behavior? Behavior change depends on how much userspace triggers the root memcg flush, from nothing to effectively offloading flushing to userspace tasks. (Theory^^^) It keeps stats_flush_threshold up to date representing global error estimate so that error-tolerant readers may save their time and it keeps the reasoning about the stats_flush_threshold effect simple. Michal [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v4 3/4] mm: memcg: let non-unified root stats flushes help unified flushes 2023-09-04 15:29 ` Michal Koutný @ 2023-09-04 15:41 ` Michal Hocko [not found] ` <ZPX6luPGqypp68+L-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 29+ messages in thread From: Michal Hocko @ 2023-09-04 15:41 UTC (permalink / raw) To: Michal Koutný Cc: Yosry Ahmed, Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo, Waiman Long, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Mon 04-09-23 17:29:14, Michal Koutny wrote: > Hello. > > On Mon, Sep 04, 2023 at 04:50:15PM +0200, Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote: > > I have hard time to follow why we really want/need this. Does this cause > > any observable changes to the behavior? > > Behavior change depends on how much userspace triggers the root memcg > flush, from nothing to effectively offloading flushing to userspace tasks. > (Theory^^^) > > It keeps stats_flush_threshold up to date representing global error > estimate so that error-tolerant readers may save their time and it keeps > the reasoning about the stats_flush_threshold effect simple. So it also creates an undocumented but userspace visible behavior. Something that userspace might start depending on, right? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 29+ messages in thread
[parent not found: <ZPX6luPGqypp68+L-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH v4 3/4] mm: memcg: let non-unified root stats flushes help unified flushes [not found] ` <ZPX6luPGqypp68+L-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2023-09-05 14:10 ` Michal Koutný 2023-09-05 15:54 ` Yosry Ahmed 0 siblings, 1 reply; 29+ messages in thread From: Michal Koutný @ 2023-09-05 14:10 UTC (permalink / raw) To: Michal Hocko Cc: Yosry Ahmed, Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo, Waiman Long, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA [-- Attachment #1: Type: text/plain, Size: 650 bytes --] On Mon, Sep 04, 2023 at 05:41:10PM +0200, Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote: > So it also creates an undocumented but userspace visible behavior. > Something that userspace might start depending on, right? Yes but - - depending on undocumented behavior is a mistake, - breaking the dependency would manifest (in the case I imagine) as a performance regression (and if there are some users, the future can allow them configuring periodic kernel flush to compensate for that). Or do you suggest these effects should be documented (that would require deeper analysis of the actual effect)? Thanks, Michal [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v4 3/4] mm: memcg: let non-unified root stats flushes help unified flushes 2023-09-05 14:10 ` Michal Koutný @ 2023-09-05 15:54 ` Yosry Ahmed [not found] ` <CAJD7tkaHVtMiMYFocNiABuyhPcqt77gei0UeaDq4J7V-=tMFYA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 29+ messages in thread From: Yosry Ahmed @ 2023-09-05 15:54 UTC (permalink / raw) To: Michal Koutný Cc: Michal Hocko, Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo, Waiman Long, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Tue, Sep 5, 2023 at 7:10 AM Michal Koutný <mkoutny-IBi9RG/b67k@public.gmane.org> wrote: > > On Mon, Sep 04, 2023 at 05:41:10PM +0200, Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote: > > So it also creates an undocumented but userspace visible behavior. > > Something that userspace might start depending on, right? > > Yes but - > - depending on undocumented behavior is a mistake, > - breaking the dependency would manifest (in the case I imagine) as a > performance regression (and if there are some users, the future can > allow them configuring periodic kernel flush to compensate for that). I think I am missing something. This change basically makes userspace readers (for the root memcg) help out unified flushers, which are in-kernel readers (e.g. reclaim) -- not the other way around. How would that create a userspace visible behavior that a dependency can be formed on? Users expecting reclaim to be faster right after reading root stats? I would guess that would be too flaky to cause a behavior that people can depend on tbh. ^ permalink raw reply [flat|nested] 29+ messages in thread
[parent not found: <CAJD7tkaHVtMiMYFocNiABuyhPcqt77gei0UeaDq4J7V-=tMFYA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH v4 3/4] mm: memcg: let non-unified root stats flushes help unified flushes [not found] ` <CAJD7tkaHVtMiMYFocNiABuyhPcqt77gei0UeaDq4J7V-=tMFYA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2023-09-05 16:07 ` Michal Koutný 2023-09-12 11:03 ` Michal Hocko 1 sibling, 0 replies; 29+ messages in thread From: Michal Koutný @ 2023-09-05 16:07 UTC (permalink / raw) To: Yosry Ahmed Cc: Michal Hocko, Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo, Waiman Long, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA [-- Attachment #1: Type: text/plain, Size: 600 bytes --] On Tue, Sep 05, 2023 at 08:54:46AM -0700, Yosry Ahmed <yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote: > How would that create a userspace visible behavior that a dependency > can be formed on? A userspace process reading out root memory.stat more frequently than in-kernel periodic flusher. > Users expecting reclaim to be faster right after reading root stats? Yes, that is what I had in mind. > I would guess that would be too flaky to cause a behavior that people > can depend on tbh. I agree it's a weird dependency. As I wrote, nothing that would be hard to take away. Michal [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v4 3/4] mm: memcg: let non-unified root stats flushes help unified flushes [not found] ` <CAJD7tkaHVtMiMYFocNiABuyhPcqt77gei0UeaDq4J7V-=tMFYA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2023-09-05 16:07 ` Michal Koutný @ 2023-09-12 11:03 ` Michal Hocko 1 sibling, 0 replies; 29+ messages in thread From: Michal Hocko @ 2023-09-12 11:03 UTC (permalink / raw) To: Yosry Ahmed Cc: Michal Koutný, Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo, Waiman Long, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Tue 05-09-23 08:54:46, Yosry Ahmed wrote: > On Tue, Sep 5, 2023 at 7:10 AM Michal Koutný <mkoutny-IBi9RG/b67k@public.gmane.org> wrote: > > > > On Mon, Sep 04, 2023 at 05:41:10PM +0200, Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote: > > > So it also creates an undocumented but userspace visible behavior. > > > Something that userspace might start depending on, right? > > > > Yes but - > > - depending on undocumented behavior is a mistake, > > - breaking the dependency would manifest (in the case I imagine) as a > > performance regression (and if there are some users, the future can > > allow them configuring periodic kernel flush to compensate for that). > > I think I am missing something. This change basically makes userspace > readers (for the root memcg) help out unified flushers, which are > in-kernel readers (e.g. reclaim) -- not the other way around. > > How would that create a userspace visible behavior that a dependency > can be formed on? Users expecting reclaim to be faster right after > reading root stats? I would guess that would be too flaky to cause a > behavior that people can depend on tbh. Flaky or not, it might cause behavior difference and a subtle one. I can imagine nohz and similar workloads wanting to (ab)use this to reduce kernel footprint. If we really need this then make it obvious in the changelog at least. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2023-09-12 11:09 UTC | newest]
Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-31 16:56 [PATCH v4 0/4] memcg: non-unified flushing for userspace stats Yosry Ahmed
2023-08-31 16:56 ` [PATCH v4 1/4] mm: memcg: properly name and document unified stats flushing Yosry Ahmed
[not found] ` <20230831165611.2610118-2-yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2023-09-04 14:44 ` Michal Hocko
[not found] ` <ZPXtVLNIXk8trj2k-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2023-09-05 15:55 ` Yosry Ahmed
[not found] ` <20230831165611.2610118-1-yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2023-08-31 16:56 ` [PATCH v4 2/4] mm: memcg: add a helper for non-unified " Yosry Ahmed
[not found] ` <20230831165611.2610118-3-yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2023-09-04 14:45 ` Michal Hocko
2023-08-31 16:56 ` [PATCH v4 4/4] mm: memcg: use non-unified stats flushing for userspace reads Yosry Ahmed
[not found] ` <20230831165611.2610118-5-yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2023-09-04 15:15 ` Michal Hocko
[not found] ` <ZPX0kCKd4TaVLJY7-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2023-09-05 15:57 ` Yosry Ahmed
2023-09-08 0:52 ` Wei Xu
[not found] ` <CAAPL-u9D2b=iF5Lf_cRnKxUfkiEe0AMDTu6yhrUAzX0b6a6rDg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2023-09-08 1:02 ` Ivan Babrou
[not found] ` <CABWYdi1WNp9f20nRFEExn8QB1MwP7QXwvD6Q8xHHuTO2SUTLkA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2023-09-08 1:11 ` Yosry Ahmed
2023-09-11 13:11 ` Michal Hocko
2023-09-11 19:15 ` Wei Xu
2023-09-11 19:34 ` Michal Hocko
2023-09-11 20:01 ` Wei Xu
2023-09-11 20:21 ` Tejun Heo
2023-09-11 20:28 ` Yosry Ahmed
[not found] ` <ZP92xP5rdKdeps7Z-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
2023-09-12 11:03 ` Michal Hocko
[not found] ` <ZQBFZMRL8WmqRgrM-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2023-09-12 11:09 ` Yosry Ahmed
2023-08-31 17:18 ` [PATCH v4 0/4] memcg: non-unified flushing for userspace stats Waiman Long
2023-08-31 16:56 ` [PATCH v4 3/4] mm: memcg: let non-unified root stats flushes help unified flushes Yosry Ahmed
2023-09-04 14:50 ` Michal Hocko
[not found] ` <ZPXupwjewuLgksAI-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2023-09-04 15:29 ` Michal Koutný
2023-09-04 15:41 ` Michal Hocko
[not found] ` <ZPX6luPGqypp68+L-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2023-09-05 14:10 ` Michal Koutný
2023-09-05 15:54 ` Yosry Ahmed
[not found] ` <CAJD7tkaHVtMiMYFocNiABuyhPcqt77gei0UeaDq4J7V-=tMFYA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2023-09-05 16:07 ` Michal Koutný
2023-09-12 11:03 ` Michal Hocko
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox