* [PATCH 0/3] memcg: non-unified flushing for userspace stats
@ 2023-08-21 20:54 Yosry Ahmed
2023-08-21 20:54 ` [PATCH 1/3] mm: memcg: properly name and document unified stats flushing Yosry Ahmed
` (3 more replies)
0 siblings, 4 replies; 43+ messages in thread
From: Yosry Ahmed @ 2023-08-21 20:54 UTC (permalink / raw)
To: Andrew Morton
Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
Muchun Song, Ivan Babrou, Tejun Heo, linux-mm, cgroups,
linux-kernel, Yosry Ahmed
Most memcg flushing contexts using "unified" flushing, where only one
flusher is allowed at a time (others skip), and all flushers need to
flush the entire tree. This works well with high concurrency, which
mostly comes from in-kernel flushers (e.g. reclaim, refault, ..).
For userspace reads, unified flushing leads to non-deterministic stats
staleness and reading cost. This series clarifies and documents the
differences between unified and non-unified flushing (patches 1 & 2),
then opts userspace reads out of unified flushing (patch 3).
This patch series is a follow up on the discussion in [1]. That was a
patch that proposed that userspace reads wait for ongoing unified
flushers to complete before returning. There were concerns about the
latency that this introduces to userspace reads, especially with ongoing
reports of expensive stat reads even with unified flushing. Hence, this
series follows a different approach, by opting userspace reads out of
unified flushing completely. The cost of userspace reads are now
determinstic, and depend on the size of the subtree being read. This
should fix both the *sometimes* expensive reads (due to flushing the
entire tree) and occasional staless (due to skipping flushing).
I attempted to remove unified flushing completely, but noticed that
in-kernel flushers with high concurrency (e.g. hundreds of concurrent
reclaimers). This sort of concurrency is not expected from userspace
reads. More details about testing and some numbers in the last patch's
changelog.
[1]https://lore.kernel.org/lkml/20230809045810.1659356-1-yosryahmed@google.com/
[2]https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ@mail.gmail.com/
Yosry Ahmed (3):
mm: memcg: properly name and document unified stats flushing
mm: memcg: add a helper for non-unified stats flushing
mm: memcg: use non-unified stats flushing for userspace reads
include/linux/memcontrol.h | 8 ++---
mm/memcontrol.c | 74 +++++++++++++++++++++++++++-----------
mm/vmscan.c | 2 +-
mm/workingset.c | 4 +--
4 files changed, 60 insertions(+), 28 deletions(-)
--
2.42.0.rc1.204.g551eb34607-goog
^ permalink raw reply [flat|nested] 43+ messages in thread* [PATCH 1/3] mm: memcg: properly name and document unified stats flushing 2023-08-21 20:54 [PATCH 0/3] memcg: non-unified flushing for userspace stats Yosry Ahmed @ 2023-08-21 20:54 ` Yosry Ahmed 2023-08-21 20:54 ` [PATCH 2/3] mm: memcg: add a helper for non-unified " Yosry Ahmed ` (2 subsequent siblings) 3 siblings, 0 replies; 43+ messages in thread From: Yosry Ahmed @ 2023-08-21 20:54 UTC (permalink / raw) To: Andrew Morton Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo, linux-mm, cgroups, linux-kernel, Yosry Ahmed Most contexts that flush memcg stats use "unified" flushing, where basically all flushers attempt to flush the entire hierarchy, but only one flusher is allowed at a time, others skip flushing. This is needed because we need to flush the stats from paths such as reclaim or refaults, which may have high concurrency, especially on large systems. Serializing such performance-sensitive paths can introduce regressions, hence, unified flushing offers a tradeoff between stats staleness and the performance impact of flushing stats. Document this properly and explicitly by renaming the common flushing helper from do_flush_stats() to do_unified_stats_flush(), and adding documentation to describe unified flushing. Additionally, rename flushing APIs to add "try" in the name, which implies that flushing will not always happen. Also add proper documentation. No functional change intended. Signed-off-by: Yosry Ahmed <yosryahmed@google.com> --- include/linux/memcontrol.h | 8 +++--- mm/memcontrol.c | 53 ++++++++++++++++++++++++++------------ mm/vmscan.c | 2 +- mm/workingset.c | 4 +-- 4 files changed, 43 insertions(+), 24 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 11810a2cfd2d..d517b0cc5221 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1034,8 +1034,8 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec, return x; } -void mem_cgroup_flush_stats(void); -void mem_cgroup_flush_stats_ratelimited(void); +void mem_cgroup_try_flush_stats(void); +void mem_cgroup_try_flush_stats_ratelimited(void); void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, int val); @@ -1519,11 +1519,11 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec, return node_page_state(lruvec_pgdat(lruvec), idx); } -static inline void mem_cgroup_flush_stats(void) +static inline void mem_cgroup_try_flush_stats(void) { } -static inline void mem_cgroup_flush_stats_ratelimited(void) +static inline void mem_cgroup_try_flush_stats_ratelimited(void) { } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index cf57fe9318d5..c6150ea54d48 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -630,7 +630,7 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val) /* * If stats_flush_threshold exceeds the threshold * (>num_online_cpus()), cgroup stats update will be triggered - * in __mem_cgroup_flush_stats(). Increasing this var further + * in mem_cgroup_try_flush_stats(). Increasing this var further * is redundant and simply adds overhead in atomic update. */ if (atomic_read(&stats_flush_threshold) <= num_online_cpus()) @@ -639,13 +639,17 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val) } } -static void do_flush_stats(void) +/* + * do_unified_stats_flush - do a unified flush of memory cgroup statistics + * + * A unified flush tries to flush the entire hierarchy, but skips if there is + * another ongoing flush. This is meant for flushers that may have a lot of + * concurrency (e.g. reclaim, refault, etc), and should not be serialized to + * avoid slowing down performance-sensitive paths. A unified flush may skip, and + * hence may yield stale stats. + */ +static void do_unified_stats_flush(void) { - /* - * We always flush the entire tree, so concurrent flushers can just - * skip. This avoids a thundering herd problem on the rstat global lock - * from memcg flushers (e.g. reclaim, refault, etc). - */ if (atomic_read(&stats_flush_ongoing) || atomic_xchg(&stats_flush_ongoing, 1)) return; @@ -658,16 +662,31 @@ static void do_flush_stats(void) atomic_set(&stats_flush_ongoing, 0); } -void mem_cgroup_flush_stats(void) +/* + * mem_cgroup_try_flush_stats - try to flush memory cgroup statistics + * + * Try to flush the stats of all memcgs that have stat updates since the last + * flush. We do not flush the stats if: + * - The magnitude of the pending updates is below a certain threshold. + * - There is another ongoing unified flush (see do_unified_stats_flush()). + * + * Hence, the stats may be stale, but ideally by less than FLUSH_TIME due to + * periodic flushing. + */ +void mem_cgroup_try_flush_stats(void) { if (atomic_read(&stats_flush_threshold) > num_online_cpus()) - do_flush_stats(); + do_unified_stats_flush(); } -void mem_cgroup_flush_stats_ratelimited(void) +/* + * Like mem_cgroup_try_flush_stats(), but only flushes if the periodic flusher + * is late. + */ +void mem_cgroup_try_flush_stats_ratelimited(void) { if (time_after64(jiffies_64, READ_ONCE(flush_next_time))) - mem_cgroup_flush_stats(); + mem_cgroup_try_flush_stats(); } static void flush_memcg_stats_dwork(struct work_struct *w) @@ -676,7 +695,7 @@ static void flush_memcg_stats_dwork(struct work_struct *w) * Always flush here so that flushing in latency-sensitive paths is * as cheap as possible. */ - do_flush_stats(); + do_unified_stats_flush(); queue_delayed_work(system_unbound_wq, &stats_flush_dwork, FLUSH_TIME); } @@ -1576,7 +1595,7 @@ static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s) * * Current memory state: */ - mem_cgroup_flush_stats(); + mem_cgroup_try_flush_stats(); for (i = 0; i < ARRAY_SIZE(memory_stats); i++) { u64 size; @@ -4018,7 +4037,7 @@ static int memcg_numa_stat_show(struct seq_file *m, void *v) int nid; struct mem_cgroup *memcg = mem_cgroup_from_seq(m); - mem_cgroup_flush_stats(); + mem_cgroup_try_flush_stats(); for (stat = stats; stat < stats + ARRAY_SIZE(stats); stat++) { seq_printf(m, "%s=%lu", stat->name, @@ -4093,7 +4112,7 @@ static void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s) BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) != ARRAY_SIZE(memcg1_stats)); - mem_cgroup_flush_stats(); + mem_cgroup_try_flush_stats(); for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) { unsigned long nr; @@ -4595,7 +4614,7 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages, struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css); struct mem_cgroup *parent; - mem_cgroup_flush_stats(); + mem_cgroup_try_flush_stats(); *pdirty = memcg_page_state(memcg, NR_FILE_DIRTY); *pwriteback = memcg_page_state(memcg, NR_WRITEBACK); @@ -6610,7 +6629,7 @@ static int memory_numa_stat_show(struct seq_file *m, void *v) int i; struct mem_cgroup *memcg = mem_cgroup_from_seq(m); - mem_cgroup_flush_stats(); + mem_cgroup_try_flush_stats(); for (i = 0; i < ARRAY_SIZE(memory_stats); i++) { int nid; diff --git a/mm/vmscan.c b/mm/vmscan.c index c7c149cb8d66..457a18921fda 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2923,7 +2923,7 @@ static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc) * Flush the memory cgroup stats, so that we read accurate per-memcg * lruvec stats for heuristics. */ - mem_cgroup_flush_stats(); + mem_cgroup_try_flush_stats(); /* * Determine the scan balance between anon and file LRUs. diff --git a/mm/workingset.c b/mm/workingset.c index da58a26d0d4d..affb8699e58d 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -520,7 +520,7 @@ void workingset_refault(struct folio *folio, void *shadow) } /* Flush stats (and potentially sleep) before holding RCU read lock */ - mem_cgroup_flush_stats_ratelimited(); + mem_cgroup_try_flush_stats_ratelimited(); rcu_read_lock(); @@ -664,7 +664,7 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker, struct lruvec *lruvec; int i; - mem_cgroup_flush_stats(); + mem_cgroup_try_flush_stats(); lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid)); for (pages = 0, i = 0; i < NR_LRU_LISTS; i++) pages += lruvec_page_state_local(lruvec, -- 2.42.0.rc1.204.g551eb34607-goog ^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH 2/3] mm: memcg: add a helper for non-unified stats flushing 2023-08-21 20:54 [PATCH 0/3] memcg: non-unified flushing for userspace stats Yosry Ahmed 2023-08-21 20:54 ` [PATCH 1/3] mm: memcg: properly name and document unified stats flushing Yosry Ahmed @ 2023-08-21 20:54 ` Yosry Ahmed [not found] ` <20230821205458.1764662-3-yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> [not found] ` <20230821205458.1764662-1-yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> 2023-08-22 13:00 ` [PATCH 0/3] memcg: non-unified flushing for userspace stats Michal Koutný 3 siblings, 1 reply; 43+ messages in thread From: Yosry Ahmed @ 2023-08-21 20:54 UTC (permalink / raw) To: Andrew Morton Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo, linux-mm, cgroups, linux-kernel, Yosry Ahmed Some contexts flush memcg stats outside of unified flushing, directly using cgroup_rstat_flush(). Add a helper for non-unified flushing, a counterpart for do_unified_stats_flush(), and use it in those contexts, as well as in do_unified_stats_flush() itself. This abstracts the rstat API and makes it easy to introduce modifications to either unified or non-unified flushing functions without changing callers. No functional change intended. Signed-off-by: Yosry Ahmed <yosryahmed@google.com> --- mm/memcontrol.c | 21 +++++++++++++++++---- 1 file changed, 17 insertions(+), 4 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c6150ea54d48..90f08b35fa77 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -639,6 +639,17 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val) } } +/* + * do_stats_flush - do a flush of the memory cgroup statistics + * @memcg: memory cgroup to flush + * + * Only flushes the subtree of @memcg, does not skip under any conditions. + */ +static void do_stats_flush(struct mem_cgroup *memcg) +{ + cgroup_rstat_flush(memcg->css.cgroup); +} + /* * do_unified_stats_flush - do a unified flush of memory cgroup statistics * @@ -656,7 +667,7 @@ static void do_unified_stats_flush(void) WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME); - cgroup_rstat_flush(root_mem_cgroup->css.cgroup); + do_stats_flush(root_mem_cgroup); atomic_set(&stats_flush_threshold, 0); atomic_set(&stats_flush_ongoing, 0); @@ -7790,7 +7801,7 @@ bool obj_cgroup_may_zswap(struct obj_cgroup *objcg) break; } - cgroup_rstat_flush(memcg->css.cgroup); + do_stats_flush(memcg); pages = memcg_page_state(memcg, MEMCG_ZSWAP_B) / PAGE_SIZE; if (pages < max) continue; @@ -7855,8 +7866,10 @@ void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size) static u64 zswap_current_read(struct cgroup_subsys_state *css, struct cftype *cft) { - cgroup_rstat_flush(css->cgroup); - return memcg_page_state(mem_cgroup_from_css(css), MEMCG_ZSWAP_B); + struct mem_cgroup *memcg = mem_cgroup_from_css(css); + + do_stats_flush(memcg); + return memcg_page_state(memcg, MEMCG_ZSWAP_B); } static int zswap_max_show(struct seq_file *m, void *v) -- 2.42.0.rc1.204.g551eb34607-goog ^ permalink raw reply related [flat|nested] 43+ messages in thread
[parent not found: <20230821205458.1764662-3-yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 2/3] mm: memcg: add a helper for non-unified stats flushing [not found] ` <20230821205458.1764662-3-yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> @ 2023-08-22 13:01 ` Michal Koutný 2023-08-22 16:00 ` Yosry Ahmed 0 siblings, 1 reply; 43+ messages in thread From: Michal Koutný @ 2023-08-22 13:01 UTC (permalink / raw) To: Yosry Ahmed Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA [-- Attachment #1: Type: text/plain, Size: 810 bytes --] On Mon, Aug 21, 2023 at 08:54:57PM +0000, Yosry Ahmed <yosryahmed-hpIqsD4AKldhl2p70BpVqQ@public.gmane.orgm> wrote: > +static void do_stats_flush(struct mem_cgroup *memcg) > +{ > + cgroup_rstat_flush(memcg->css.cgroup); if(memcg == root_mem_cgroup) atomic_set(&stats_flush_threshold, 0); > +} > + > /* > * do_unified_stats_flush - do a unified flush of memory cgroup statistics > * > @@ -656,7 +667,7 @@ static void do_unified_stats_flush(void) > > WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME); > > - cgroup_rstat_flush(root_mem_cgroup->css.cgroup); > + do_stats_flush(root_mem_cgroup); > - atomic_set(&stats_flush_threshold, 0); > atomic_set(&stats_flush_ongoing, 0); You may reset stats_flush_threshold to save the unified flushers some work. Michal [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 2/3] mm: memcg: add a helper for non-unified stats flushing 2023-08-22 13:01 ` Michal Koutný @ 2023-08-22 16:00 ` Yosry Ahmed [not found] ` <CAJD7tkaVuiMU-ifJiyH5d_W1hi9DnAymYJxzBxEKCVX+tU=OCA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 43+ messages in thread From: Yosry Ahmed @ 2023-08-22 16:00 UTC (permalink / raw) To: Michal Koutný Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo, linux-mm, cgroups, linux-kernel On Tue, Aug 22, 2023 at 6:01 AM Michal Koutný <mkoutny@suse.com> wrote: > > On Mon, Aug 21, 2023 at 08:54:57PM +0000, Yosry Ahmed <yosryahmed@google.com> wrote: > > +static void do_stats_flush(struct mem_cgroup *memcg) > > +{ > > + cgroup_rstat_flush(memcg->css.cgroup); > if(memcg == root_mem_cgroup) > atomic_set(&stats_flush_threshold, 0); > > +} > > + > > /* > > * do_unified_stats_flush - do a unified flush of memory cgroup statistics > > * > > @@ -656,7 +667,7 @@ static void do_unified_stats_flush(void) > > > > WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME); > > > > - cgroup_rstat_flush(root_mem_cgroup->css.cgroup); > > + do_stats_flush(root_mem_cgroup); > > > - atomic_set(&stats_flush_threshold, 0); > > atomic_set(&stats_flush_ongoing, 0); > > You may reset stats_flush_threshold to save the unified flushers some > work. We can probably also kick FLUSH_TIME forward as well. Perhaps I can move both into do_stats_flush() then. If I understand correctly this is what you mean? static void do_stats_flush(struct mem_cgroup *memcg) { if (mem_cgroup_is_root(memcg)) WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME); cgroup_rstat_flush(memcg->css.cgroup); if (mem_cgroup_is_root(memcg)) atomic_set(&stats_flush_threshold, 0); } static void do_unified_stats_flush(void) { if (atomic_read(&stats_flush_ongoing) || atomic_xchg(&stats_flush_ongoing, 1)) return; do_stats_flush(root_mem_cgroup); atomic_set(&stats_flush_ongoing, 0); } , or simplify it further by just resetting stats_flush_threshold before we flush as well: static void do_stats_flush(struct mem_cgroup *memcg) { /* for unified flushing, root non-unified flushing can help as well */ if (mem_cgroup_is_root(memcg)) { WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME); atomic_set(&stats_flush_threshold, 0); } cgroup_rstat_flush(memcg->css.cgroup); } static void do_unified_stats_flush(void) { if (atomic_read(&stats_flush_ongoing) || atomic_xchg(&stats_flush_ongoing, 1)) return; do_stats_flush(root_mem_cgroup); atomic_set(&stats_flush_ongoing, 0); } What do you think? ^ permalink raw reply [flat|nested] 43+ messages in thread
[parent not found: <CAJD7tkaVuiMU-ifJiyH5d_W1hi9DnAymYJxzBxEKCVX+tU=OCA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH 2/3] mm: memcg: add a helper for non-unified stats flushing [not found] ` <CAJD7tkaVuiMU-ifJiyH5d_W1hi9DnAymYJxzBxEKCVX+tU=OCA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2023-08-22 16:35 ` Michal Koutný 2023-08-22 16:48 ` Yosry Ahmed 0 siblings, 1 reply; 43+ messages in thread From: Michal Koutný @ 2023-08-22 16:35 UTC (permalink / raw) To: Yosry Ahmed Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA [-- Attachment #1: Type: text/plain, Size: 814 bytes --] On Tue, Aug 22, 2023 at 09:00:06AM -0700, Yosry Ahmed <yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote: > We can probably also kick FLUSH_TIME forward as well. True, they guard same work. > Perhaps I can move both into do_stats_flush() then. If I understand > correctly this is what you mean? Yes. > What do you think? The latter is certainly better looking code. I wasn't sure at first about moving stats_flush_threshold reset before actual flush but on second thought it should not be a significant change - readers: may skip flushing, the values that they read should still be below the error threshold, - writers: may be slowed down a bit (because of conditional atomic write optimization in memcg_rstat_updates), presumably not on average though. So the latter should work too. Michal [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 2/3] mm: memcg: add a helper for non-unified stats flushing 2023-08-22 16:35 ` Michal Koutný @ 2023-08-22 16:48 ` Yosry Ahmed 0 siblings, 0 replies; 43+ messages in thread From: Yosry Ahmed @ 2023-08-22 16:48 UTC (permalink / raw) To: Michal Koutný Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo, linux-mm, cgroups, linux-kernel On Tue, Aug 22, 2023 at 9:35 AM Michal Koutný <mkoutny@suse.com> wrote: > > On Tue, Aug 22, 2023 at 09:00:06AM -0700, Yosry Ahmed <yosryahmed@google.com> wrote: > > We can probably also kick FLUSH_TIME forward as well. > > True, they guard same work. > > > Perhaps I can move both into do_stats_flush() then. If I understand > > correctly this is what you mean? > > Yes. > > > What do you think? > > The latter is certainly better looking code. > > I wasn't sure at first about moving stats_flush_threshold reset before > actual flush but on second thought it should not be a significant change > - readers: may skip flushing, the values that they read should still be > below the error threshold, Unified readers will skip anyway as there's an ongoing flush, non-unified readers don't check the threshold anyway (with this patch series). So for readers it should not be a change. > - writers: may be slowed down a bit (because of conditional atomic write > optimization in memcg_rstat_updates), presumably not on average > though. Yeah writers will start doing atomic writes once a flush starts instead of once it ends. I don't think it will matter in practice though. The optimization is only effective if we manage to surpass the threshold before the periodic flusher (or any unified flusher) kicks in and resets it. If we start doing atomic writes earlier, then we will also stop earlier; the number of atomic writes should stay the same. I think the only difference will be the scenario where we start atomic writes early but the periodic flush happens before we reach the threshold, in which case we aren't doing a lot of updates anyway. I hope this makes _any_ sense :) > > So the latter should work too. > I will include that in v2. I will wait for a bit of further review comments on this version first. Thanks for taking a look! > Michal ^ permalink raw reply [flat|nested] 43+ messages in thread
[parent not found: <20230821205458.1764662-1-yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>]
* [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads [not found] ` <20230821205458.1764662-1-yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> @ 2023-08-21 20:54 ` Yosry Ahmed [not found] ` <20230821205458.1764662-4-yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 43+ messages in thread From: Yosry Ahmed @ 2023-08-21 20:54 UTC (permalink / raw) To: Andrew Morton Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Yosry Ahmed Unified flushing allows for great concurrency for paths that attempt to flush the stats, at the expense of potential staleness and a single flusher paying the extra cost of flushing the full tree. This tradeoff makes sense for in-kernel flushers that may observe high concurrency (e.g. reclaim, refault). For userspace readers, stale stats may be unexpected and problematic, especially when such stats are used for critical paths such as userspace OOM handling. Additionally, a userspace reader will occasionally pay the cost of flushing the entire hierarchy, which also causes problems in some cases [1]. Opt userspace reads out of unified flushing. This makes the cost of reading the stats more predictable (proportional to the size of the subtree), as well as the freshness of the stats. Since userspace readers are not expected to have similar concurrency to in-kernel flushers, serializing them among themselves and among in-kernel flushers should be okay. This was tested on a machine with 256 cpus by running a synthetic test The script that creates 50 top-level cgroups, each with 5 children (250 leaf cgroups). Each leaf cgroup has 10 processes running that allocate memory beyond the cgroup limit, invoking reclaim (which is an in-kernel unified flusher). Concurrently, one thread is spawned per-cgroup to read the stats every second (including root, top-level, and leaf cgroups -- so total 251 threads). No regressions were observed in the total running time; which means that non-unified userspace readers are not slowing down in-kernel unified flushers: Base (mm-unstable): real 0m18.228s user 0m9.463s sys 60m15.879s real 0m20.828s user 0m8.535s sys 70m12.364s real 0m19.789s user 0m9.177s sys 66m10.798s With this patch: real 0m19.632s user 0m8.608s sys 64m23.483s real 0m18.463s user 0m7.465s sys 60m34.089s real 0m20.309s user 0m7.754s sys 68m2.392s Additionally, the average latency for reading stats went from roughly 40ms to 5 ms, because we mostly read the stats of leaf cgroups in this script, so we only have to flush one cgroup, instead of *sometimes* flushing the entire tree with unified flushing. [1]https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org/ Signed-off-by: Yosry Ahmed <yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> --- mm/memcontrol.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 90f08b35fa77..d3b13a06224c 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1606,7 +1606,7 @@ static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s) * * Current memory state: */ - mem_cgroup_try_flush_stats(); + do_stats_flush(memcg); for (i = 0; i < ARRAY_SIZE(memory_stats); i++) { u64 size; @@ -4048,7 +4048,7 @@ static int memcg_numa_stat_show(struct seq_file *m, void *v) int nid; struct mem_cgroup *memcg = mem_cgroup_from_seq(m); - mem_cgroup_try_flush_stats(); + do_stats_flush(memcg); for (stat = stats; stat < stats + ARRAY_SIZE(stats); stat++) { seq_printf(m, "%s=%lu", stat->name, @@ -4123,7 +4123,7 @@ static void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s) BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) != ARRAY_SIZE(memcg1_stats)); - mem_cgroup_try_flush_stats(); + do_stats_flush(memcg); for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) { unsigned long nr; @@ -4625,7 +4625,7 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages, struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css); struct mem_cgroup *parent; - mem_cgroup_try_flush_stats(); + do_stats_flush(memcg); *pdirty = memcg_page_state(memcg, NR_FILE_DIRTY); *pwriteback = memcg_page_state(memcg, NR_WRITEBACK); @@ -6640,7 +6640,7 @@ static int memory_numa_stat_show(struct seq_file *m, void *v) int i; struct mem_cgroup *memcg = mem_cgroup_from_seq(m); - mem_cgroup_try_flush_stats(); + do_stats_flush(memcg); for (i = 0; i < ARRAY_SIZE(memory_stats); i++) { int nid; -- 2.42.0.rc1.204.g551eb34607-goog ^ permalink raw reply related [flat|nested] 43+ messages in thread
[parent not found: <20230821205458.1764662-4-yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads [not found] ` <20230821205458.1764662-4-yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> @ 2023-08-22 9:06 ` Michal Hocko [not found] ` <ZOR6eyYfJYlxdMet-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 43+ messages in thread From: Michal Hocko @ 2023-08-22 9:06 UTC (permalink / raw) To: Yosry Ahmed Cc: Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Mon 21-08-23 20:54:58, Yosry Ahmed wrote: > Unified flushing allows for great concurrency for paths that attempt to > flush the stats, at the expense of potential staleness and a single > flusher paying the extra cost of flushing the full tree. > > This tradeoff makes sense for in-kernel flushers that may observe high > concurrency (e.g. reclaim, refault). For userspace readers, stale stats > may be unexpected and problematic, especially when such stats are used > for critical paths such as userspace OOM handling. Additionally, a > userspace reader will occasionally pay the cost of flushing the entire > hierarchy, which also causes problems in some cases [1]. > > Opt userspace reads out of unified flushing. This makes the cost of > reading the stats more predictable (proportional to the size of the > subtree), as well as the freshness of the stats. Since userspace readers > are not expected to have similar concurrency to in-kernel flushers, > serializing them among themselves and among in-kernel flushers should be > okay. > > This was tested on a machine with 256 cpus by running a synthetic test > The script that creates 50 top-level cgroups, each with 5 children (250 > leaf cgroups). Each leaf cgroup has 10 processes running that allocate > memory beyond the cgroup limit, invoking reclaim (which is an in-kernel > unified flusher). Concurrently, one thread is spawned per-cgroup to read > the stats every second (including root, top-level, and leaf cgroups -- > so total 251 threads). No regressions were observed in the total running > time; which means that non-unified userspace readers are not slowing > down in-kernel unified flushers: I have to admit I am rather confused by cgroup_rstat_flush (and cgroup_rstat_flush_locked). The former says it can block but the later doesn't ever block and even if it drops the cgroup_rstat_lock it merely cond_rescheds or busy loops. How much of a contention and yielding can you see with this patch? What is the worst case? How bad a random user can make the situation by going crazy and trying to flush from many different contexts? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 43+ messages in thread
[parent not found: <ZOR6eyYfJYlxdMet-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads [not found] ` <ZOR6eyYfJYlxdMet-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2023-08-22 15:30 ` Yosry Ahmed [not found] ` <CAJD7tka13M-zVZTyQJYL1iUAYvuQ1fcHbCjcOBZcz6POYTV-4g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 43+ messages in thread From: Yosry Ahmed @ 2023-08-22 15:30 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA [-- Attachment #1: Type: text/plain, Size: 3793 bytes --] On Tue, Aug 22, 2023 at 2:06 AM Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote: > > On Mon 21-08-23 20:54:58, Yosry Ahmed wrote: > > Unified flushing allows for great concurrency for paths that attempt to > > flush the stats, at the expense of potential staleness and a single > > flusher paying the extra cost of flushing the full tree. > > > > This tradeoff makes sense for in-kernel flushers that may observe high > > concurrency (e.g. reclaim, refault). For userspace readers, stale stats > > may be unexpected and problematic, especially when such stats are used > > for critical paths such as userspace OOM handling. Additionally, a > > userspace reader will occasionally pay the cost of flushing the entire > > hierarchy, which also causes problems in some cases [1]. > > > > Opt userspace reads out of unified flushing. This makes the cost of > > reading the stats more predictable (proportional to the size of the > > subtree), as well as the freshness of the stats. Since userspace readers > > are not expected to have similar concurrency to in-kernel flushers, > > serializing them among themselves and among in-kernel flushers should be > > okay. > > > > This was tested on a machine with 256 cpus by running a synthetic test > > The script that creates 50 top-level cgroups, each with 5 children (250 > > leaf cgroups). Each leaf cgroup has 10 processes running that allocate > > memory beyond the cgroup limit, invoking reclaim (which is an in-kernel > > unified flusher). Concurrently, one thread is spawned per-cgroup to read > > the stats every second (including root, top-level, and leaf cgroups -- > > so total 251 threads). No regressions were observed in the total running > > time; which means that non-unified userspace readers are not slowing > > down in-kernel unified flushers: > > I have to admit I am rather confused by cgroup_rstat_flush (and > cgroup_rstat_flush_locked). The former says it can block but the later > doesn't ever block and even if it drops the cgroup_rstat_lock it merely > cond_rescheds or busy loops. How much of a contention and yielding can > you see with this patch? What is the worst case? How bad a random user > can make the situation by going crazy and trying to flush from many > different contexts? Userspace readers (or more generically non-unified flushers) can all collectively only block a single unified flusher at most. Specifically, one userspace reader goes to flush and holds cgroup_rstat_lock, meanwhile an in-kernel flusher (e.g. reclaim) goes and tries to flush, and spins on cgroup_rstat_lock. Other in-kernel (unified) flushers will just see another unified flusher in progress and skip. So userspace can only block a single in-kernel reclaimer. Not that it's not really that bad because: (a) As you note, cgroup_rstat_flush() does not really "block", it's cpu-bound. Even when it cond_resched()'s, it yields the lock first. So it can't really hold anyone hostage for long. (b) I assume a random user can only read their own stats, which should be a relatively small subtree, quick to flush. I am assuming a random user cannot read root's memory.stat (which is most expensive). (c) Excessive flushing doesn't really build up because there will be nothing to flush and the lock will be released very shortly after it's held. So to answer your question, I don't think a random user can really affect the system in a significant way by constantly flushing. In fact, in the test script (which I am now attaching, in case you're interested), there are hundreds of threads that are reading stats of different cgroups every 1s, and I don't see any negative effects on in-kernel flushers in this case (reclaimers). > -- > Michal Hocko > SUSE Labs [-- Attachment #2: stress.sh --] [-- Type: text/x-sh, Size: 2967 bytes --] #!/bin/bash NR_L1_CGROUPS=50 NR_L2_CGROUPS_PER_L1=5 NR_L2_CGROUPS=$(( NR_L1_CGROUPS * NR_L2_CGROUPS_PER_L1 )) NR_WORKERS_PER_CGROUP=10 WORKER_MB=10 NR_TOTAL_WORKERS=$(( NR_WORKERS_PER_CGROUP * (1 + NR_L1_CGROUPS + NR_L2_CGROUPS) )) L1_CGROUP_LIMIT_MB=$(( NR_L2_CGROUPS_PER_L1 * NR_WORKERS_PER_CGROUP * WORKER_MB / 4 )) TOTAL_MB=$(( NR_TOTAL_WORKERS * WORKER_MB )) TMPFS=$(mktemp -d) ROOT="/sys/fs/cgroup/" ADMIN="/sys/fs/cgroup/admin" ZRAM_DEV="/mnt/devtmpfs/zram0" cleanup_cgroup() { local -r cg=$1 local -r procs=$(cat $cg/cgroup.procs) if [[ -n $procs ]]; then kill -KILL $procs wait $procs 2>/dev/null fi while [[ -n $(cat $cg/cgroup.procs) ]]; do sleep 0.1 done rmdir $cg } cleanup() { for i in $(seq $NR_L1_CGROUPS); do l1="$ROOT/parent$i" for j in $(seq $NR_L2_CGROUPS_PER_L1); do l2="$l1/child$j" cleanup_cgroup $l2 done cleanup_cgroup $l1 done cleanup_cgroup $ADMIN umount $TMPFS rm -rf $TMPFS swapoff $ZRAM_DEV echo 1 > "/sys/block/zram0/reset" } trap cleanup INT QUIT EXIT run_stats_reader() { local -r cg_run=$1 local -r cg_read=$2 # read the stats every second until workers are done echo 0 > "$cg_run/cgroup.procs" while [[ $(ls $TMPFS) ]]; do cat "$cg_read/memory.stat" > /dev/null sleep 1 done } run_worker() { local -r cg=$1 local -r f=$2 echo 0 > "$cg/cgroup.procs" rm -rf "$f" dd if=/dev/zero of="$f" bs=1M count=$WORKER_MB status=none cat "$f" > /dev/null rm "$f" } # Setup zram echo $((TOTAL_MB << 20)) > "/sys/block/zram0/disksize" mkswap $ZRAM_DEV swapon $ZRAM_DEV echo "Setup zram done" # Mount tmpfs mount -t tmpfs none $TMPFS # Create admin cgroup mkdir $ADMIN # Create worker cgroups, set limits echo "+memory" > "$ROOT/cgroup.subtree_control" for i in $(seq $NR_L1_CGROUPS); do l1="$ROOT/parent$i" mkdir $l1 echo "+memory" > "$l1/cgroup.subtree_control" for j in $(seq $NR_L2_CGROUPS_PER_L1); do l2="$l1/child$j" mkdir $l2 done echo $(( L1_CGROUP_LIMIT_MB << 20)) > "$l1/memory.max" done echo "Setup $NR_L1_CGROUPS L1 cgroups with limit ${L1_CGROUP_LIMIT_MB}M done" echo "Each L1 cgroup has $NR_L2_CGROUPS_PER_L1 L2 children" # Start workers to allocate tmpfs memory for i in $(seq $NR_L1_CGROUPS); do l1="$ROOT/parent$i" for j in $(seq $NR_L2_CGROUPS_PER_L1); do l2="$l1/child$j" for k in $(seq $NR_WORKERS_PER_CGROUP); do (run_worker "$l2" "$TMPFS/$i-$j-$k")& done done done # Start stat readers (run_stats_reader "$ADMIN" "$ROOT")& for i in $(seq $NR_L1_CGROUPS); do l1="$ROOT/parent$i" (run_stats_reader "$ADMIN" "$l1")& for j in $(seq $NR_L2_CGROUPS_PER_L1); do l2="$l1/child$j" # Ran stat readers in the admin cgroup as well as the cgroup itself (run_stats_reader "$ADMIN" "$l2")& (run_stats_reader "$l2" "$l2")& done done # Wait for workers echo "Ran $NR_WORKERS_PER_CGROUP workers per L2 cgroup, each allocating ${WORKER_MB}M, waiting.." wait ^ permalink raw reply [flat|nested] 43+ messages in thread
[parent not found: <CAJD7tka13M-zVZTyQJYL1iUAYvuQ1fcHbCjcOBZcz6POYTV-4g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads [not found] ` <CAJD7tka13M-zVZTyQJYL1iUAYvuQ1fcHbCjcOBZcz6POYTV-4g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2023-08-23 7:33 ` Michal Hocko 2023-08-23 14:55 ` Yosry Ahmed 0 siblings, 1 reply; 43+ messages in thread From: Michal Hocko @ 2023-08-23 7:33 UTC (permalink / raw) To: Yosry Ahmed Cc: Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Tue 22-08-23 08:30:05, Yosry Ahmed wrote: > On Tue, Aug 22, 2023 at 2:06 AM Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote: > > > > On Mon 21-08-23 20:54:58, Yosry Ahmed wrote: [...] > So to answer your question, I don't think a random user can really > affect the system in a significant way by constantly flushing. In > fact, in the test script (which I am now attaching, in case you're > interested), there are hundreds of threads that are reading stats of > different cgroups every 1s, and I don't see any negative effects on > in-kernel flushers in this case (reclaimers). I suspect you have missed my point. Maybe I am just misunderstanding the code but it seems to me that the lock dropping inside cgroup_rstat_flush_locked effectivelly allows unbounded number of contenders which is really dangerous when it is triggerable from the userspace. The number of spinners at a moment is always bound by the number CPUs but depending on timing many potential spinners might be cond_rescheded and the worst time latency to complete can be really high. Makes more sense? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads 2023-08-23 7:33 ` Michal Hocko @ 2023-08-23 14:55 ` Yosry Ahmed 2023-08-24 7:13 ` Michal Hocko 0 siblings, 1 reply; 43+ messages in thread From: Yosry Ahmed @ 2023-08-23 14:55 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo, linux-mm, cgroups, linux-kernel On Wed, Aug 23, 2023 at 12:33 AM Michal Hocko <mhocko@suse.com> wrote: > > On Tue 22-08-23 08:30:05, Yosry Ahmed wrote: > > On Tue, Aug 22, 2023 at 2:06 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Mon 21-08-23 20:54:58, Yosry Ahmed wrote: > [...] > > So to answer your question, I don't think a random user can really > > affect the system in a significant way by constantly flushing. In > > fact, in the test script (which I am now attaching, in case you're > > interested), there are hundreds of threads that are reading stats of > > different cgroups every 1s, and I don't see any negative effects on > > in-kernel flushers in this case (reclaimers). > > I suspect you have missed my point. I suspect you are right :) > Maybe I am just misunderstanding > the code but it seems to me that the lock dropping inside > cgroup_rstat_flush_locked effectivelly allows unbounded number of > contenders which is really dangerous when it is triggerable from the > userspace. The number of spinners at a moment is always bound by the > number CPUs but depending on timing many potential spinners might be > cond_rescheded and the worst time latency to complete can be really > high. Makes more sense? I think I understand better now. So basically because we might drop the lock and resched, there can be nr_cpus spinners + other spinners that are currently scheduled away, so these will need to wait to be scheduled and then start spinning on the lock. This may happen for one reader multiple times during its read, which is what can cause a high worst case latency. I hope I understood you correctly this time. Did I? So the logic to give up the lock and sleep was introduced by commit 0fa294fb1985 ("cgroup: Replace cgroup_rstat_mutex with a spinlock") in 4.18. It has been possible for userspace to trigger this scenario by reading cpu.stat, which has been using rstat since then. On the memcg side, it was also possible to trigger this behavior between commit 2d146aa3aa84 ("mm: memcontrol: switch to rstat") and commit fd25a9e0e23b ("memcg: unify memcg stat flushing") (i.e between 5.13 and 5.16). I am not sure there has been any problems from this, but perhaps Tejun can answer this better than I can. The way I think about it is that random userspace reads will mostly be reading their subtrees, which is generally not very large (and can be limited), so every individual read should be cheap enough. Also, consequent flushes on overlapping subtrees will have very little to do as there won't be many pending updates, they should also be very cheap. So unless multiple jobs on the same machine are collectively trying to act maliciously (purposefully or not) and concurrently spawn multiple readers of different parts of the hierarchy (and maintain enough activity to generate stat updates to flush), I don't think it's a concern. I also imagine (but haven't checked) that there is some locking at some level that will throttle a malicious job that spawns multiple readers to the same memory.stat file. I hope this answers your question. > -- > Michal Hocko > SUSE Labs ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads 2023-08-23 14:55 ` Yosry Ahmed @ 2023-08-24 7:13 ` Michal Hocko 2023-08-24 18:15 ` Yosry Ahmed 0 siblings, 1 reply; 43+ messages in thread From: Michal Hocko @ 2023-08-24 7:13 UTC (permalink / raw) To: Yosry Ahmed Cc: Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo, linux-mm, cgroups, linux-kernel On Wed 23-08-23 07:55:40, Yosry Ahmed wrote: > On Wed, Aug 23, 2023 at 12:33 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Tue 22-08-23 08:30:05, Yosry Ahmed wrote: > > > On Tue, Aug 22, 2023 at 2:06 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > On Mon 21-08-23 20:54:58, Yosry Ahmed wrote: > > [...] > > > So to answer your question, I don't think a random user can really > > > affect the system in a significant way by constantly flushing. In > > > fact, in the test script (which I am now attaching, in case you're > > > interested), there are hundreds of threads that are reading stats of > > > different cgroups every 1s, and I don't see any negative effects on > > > in-kernel flushers in this case (reclaimers). > > > > I suspect you have missed my point. > > I suspect you are right :) > > > > Maybe I am just misunderstanding > > the code but it seems to me that the lock dropping inside > > cgroup_rstat_flush_locked effectivelly allows unbounded number of > > contenders which is really dangerous when it is triggerable from the > > userspace. The number of spinners at a moment is always bound by the > > number CPUs but depending on timing many potential spinners might be > > cond_rescheded and the worst time latency to complete can be really > > high. Makes more sense? > > I think I understand better now. So basically because we might drop > the lock and resched, there can be nr_cpus spinners + other spinners > that are currently scheduled away, so these will need to wait to be > scheduled and then start spinning on the lock. This may happen for one > reader multiple times during its read, which is what can cause a high > worst case latency. > > I hope I understood you correctly this time. Did I? Yes. I would just add that this could also influence the worst case latency for a different reader - so an adversary user can stall others. Exposing a shared global lock in uncontrolable way over generally available user interface is not really a great idea IMHO. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads 2023-08-24 7:13 ` Michal Hocko @ 2023-08-24 18:15 ` Yosry Ahmed 2023-08-24 18:50 ` Yosry Ahmed 0 siblings, 1 reply; 43+ messages in thread From: Yosry Ahmed @ 2023-08-24 18:15 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo, linux-mm, cgroups, linux-kernel On Thu, Aug 24, 2023 at 12:13 AM Michal Hocko <mhocko@suse.com> wrote: > > On Wed 23-08-23 07:55:40, Yosry Ahmed wrote: > > On Wed, Aug 23, 2023 at 12:33 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Tue 22-08-23 08:30:05, Yosry Ahmed wrote: > > > > On Tue, Aug 22, 2023 at 2:06 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > On Mon 21-08-23 20:54:58, Yosry Ahmed wrote: > > > [...] > > > > So to answer your question, I don't think a random user can really > > > > affect the system in a significant way by constantly flushing. In > > > > fact, in the test script (which I am now attaching, in case you're > > > > interested), there are hundreds of threads that are reading stats of > > > > different cgroups every 1s, and I don't see any negative effects on > > > > in-kernel flushers in this case (reclaimers). > > > > > > I suspect you have missed my point. > > > > I suspect you are right :) > > > > > > > Maybe I am just misunderstanding > > > the code but it seems to me that the lock dropping inside > > > cgroup_rstat_flush_locked effectivelly allows unbounded number of > > > contenders which is really dangerous when it is triggerable from the > > > userspace. The number of spinners at a moment is always bound by the > > > number CPUs but depending on timing many potential spinners might be > > > cond_rescheded and the worst time latency to complete can be really > > > high. Makes more sense? > > > > I think I understand better now. So basically because we might drop > > the lock and resched, there can be nr_cpus spinners + other spinners > > that are currently scheduled away, so these will need to wait to be > > scheduled and then start spinning on the lock. This may happen for one > > reader multiple times during its read, which is what can cause a high > > worst case latency. > > > > I hope I understood you correctly this time. Did I? > > Yes. I would just add that this could also influence the worst case > latency for a different reader - so an adversary user can stall others. I can add that for v2 to the commit log, thanks. > Exposing a shared global lock in uncontrolable way over generally > available user interface is not really a great idea IMHO. I think that's how it was always meant to be when it was designed. The global rstat lock has always existed and was always available to userspace readers. The memory controller took a different path at some point with unified flushing, but that was mainly because of high concurrency from in-kernel flushers, not because userspace readers caused a problem. Outside of memcg, the core cgroup code has always exercised this global lock when reading cpu.stat since rstat's introduction. I assume there hasn't been any problems since it's still there. I was hoping Tejun would confirm/deny this. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads 2023-08-24 18:15 ` Yosry Ahmed @ 2023-08-24 18:50 ` Yosry Ahmed 2023-08-25 7:05 ` Michal Hocko 0 siblings, 1 reply; 43+ messages in thread From: Yosry Ahmed @ 2023-08-24 18:50 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo, linux-mm, cgroups, linux-kernel On Thu, Aug 24, 2023 at 11:15 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Thu, Aug 24, 2023 at 12:13 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Wed 23-08-23 07:55:40, Yosry Ahmed wrote: > > > On Wed, Aug 23, 2023 at 12:33 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > On Tue 22-08-23 08:30:05, Yosry Ahmed wrote: > > > > > On Tue, Aug 22, 2023 at 2:06 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > On Mon 21-08-23 20:54:58, Yosry Ahmed wrote: > > > > [...] > > > > > So to answer your question, I don't think a random user can really > > > > > affect the system in a significant way by constantly flushing. In > > > > > fact, in the test script (which I am now attaching, in case you're > > > > > interested), there are hundreds of threads that are reading stats of > > > > > different cgroups every 1s, and I don't see any negative effects on > > > > > in-kernel flushers in this case (reclaimers). > > > > > > > > I suspect you have missed my point. > > > > > > I suspect you are right :) > > > > > > > > > > Maybe I am just misunderstanding > > > > the code but it seems to me that the lock dropping inside > > > > cgroup_rstat_flush_locked effectivelly allows unbounded number of > > > > contenders which is really dangerous when it is triggerable from the > > > > userspace. The number of spinners at a moment is always bound by the > > > > number CPUs but depending on timing many potential spinners might be > > > > cond_rescheded and the worst time latency to complete can be really > > > > high. Makes more sense? > > > > > > I think I understand better now. So basically because we might drop > > > the lock and resched, there can be nr_cpus spinners + other spinners > > > that are currently scheduled away, so these will need to wait to be > > > scheduled and then start spinning on the lock. This may happen for one > > > reader multiple times during its read, which is what can cause a high > > > worst case latency. > > > > > > I hope I understood you correctly this time. Did I? > > > > Yes. I would just add that this could also influence the worst case > > latency for a different reader - so an adversary user can stall others. > > I can add that for v2 to the commit log, thanks. > > > Exposing a shared global lock in uncontrolable way over generally > > available user interface is not really a great idea IMHO. > > I think that's how it was always meant to be when it was designed. The > global rstat lock has always existed and was always available to > userspace readers. The memory controller took a different path at some > point with unified flushing, but that was mainly because of high > concurrency from in-kernel flushers, not because userspace readers > caused a problem. Outside of memcg, the core cgroup code has always > exercised this global lock when reading cpu.stat since rstat's > introduction. I assume there hasn't been any problems since it's still > there. I was hoping Tejun would confirm/deny this. One thing we can do to remedy this situation is to replace the global rstat lock with a mutex, and drop the resched/lock dropping condition. Tejun suggested this in the previous thread. This effectively reverts 0fa294fb1985 ("cgroup: Replace cgroup_rstat_mutex with a spinlock") since now all the flushing contexts are sleepable. My synthetic stress test does not show any regressions with mutexes, and there is a small boost to reading latency (probably because we stop dropping the lock / rescheduling). Not sure if we may start seeing need_resched warnings on big flushes though. One other concern that Shakeel pointed out to me is preemption. If someone holding the mutex gets preempted this may starve other waiters. We can disable preemption while we hold the mutex, not sure if that's a common pattern though. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads 2023-08-24 18:50 ` Yosry Ahmed @ 2023-08-25 7:05 ` Michal Hocko 2023-08-25 15:14 ` Yosry Ahmed [not found] ` <ZOhSyvDxAyYUJ45i-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 2 replies; 43+ messages in thread From: Michal Hocko @ 2023-08-25 7:05 UTC (permalink / raw) To: Yosry Ahmed Cc: Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo, linux-mm, cgroups, linux-kernel On Thu 24-08-23 11:50:51, Yosry Ahmed wrote: > On Thu, Aug 24, 2023 at 11:15 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > On Thu, Aug 24, 2023 at 12:13 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Wed 23-08-23 07:55:40, Yosry Ahmed wrote: > > > > On Wed, Aug 23, 2023 at 12:33 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > On Tue 22-08-23 08:30:05, Yosry Ahmed wrote: > > > > > > On Tue, Aug 22, 2023 at 2:06 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > > > On Mon 21-08-23 20:54:58, Yosry Ahmed wrote: > > > > > [...] > > > > > > So to answer your question, I don't think a random user can really > > > > > > affect the system in a significant way by constantly flushing. In > > > > > > fact, in the test script (which I am now attaching, in case you're > > > > > > interested), there are hundreds of threads that are reading stats of > > > > > > different cgroups every 1s, and I don't see any negative effects on > > > > > > in-kernel flushers in this case (reclaimers). > > > > > > > > > > I suspect you have missed my point. > > > > > > > > I suspect you are right :) > > > > > > > > > > > > > Maybe I am just misunderstanding > > > > > the code but it seems to me that the lock dropping inside > > > > > cgroup_rstat_flush_locked effectivelly allows unbounded number of > > > > > contenders which is really dangerous when it is triggerable from the > > > > > userspace. The number of spinners at a moment is always bound by the > > > > > number CPUs but depending on timing many potential spinners might be > > > > > cond_rescheded and the worst time latency to complete can be really > > > > > high. Makes more sense? > > > > > > > > I think I understand better now. So basically because we might drop > > > > the lock and resched, there can be nr_cpus spinners + other spinners > > > > that are currently scheduled away, so these will need to wait to be > > > > scheduled and then start spinning on the lock. This may happen for one > > > > reader multiple times during its read, which is what can cause a high > > > > worst case latency. > > > > > > > > I hope I understood you correctly this time. Did I? > > > > > > Yes. I would just add that this could also influence the worst case > > > latency for a different reader - so an adversary user can stall others. > > > > I can add that for v2 to the commit log, thanks. > > > > > Exposing a shared global lock in uncontrolable way over generally > > > available user interface is not really a great idea IMHO. > > > > I think that's how it was always meant to be when it was designed. The > > global rstat lock has always existed and was always available to > > userspace readers. The memory controller took a different path at some > > point with unified flushing, but that was mainly because of high > > concurrency from in-kernel flushers, not because userspace readers > > caused a problem. Outside of memcg, the core cgroup code has always > > exercised this global lock when reading cpu.stat since rstat's > > introduction. I assume there hasn't been any problems since it's still > > there. I suspect nobody has just considered a malfunctioning or adversary workloads so far. > > I was hoping Tejun would confirm/deny this. Yes, that would be interesting to hear. > One thing we can do to remedy this situation is to replace the global > rstat lock with a mutex, and drop the resched/lock dropping condition. > Tejun suggested this in the previous thread. This effectively reverts > 0fa294fb1985 ("cgroup: Replace cgroup_rstat_mutex with a spinlock") > since now all the flushing contexts are sleepable. I would have a very daring question. Do we really need a global lock in the first place? AFAIU this locks serializes (kinda as the lock can be dropped midway) flushers and cgroup_rstat_flush_hold/release caller (a single one ATM). I can see cgroup_base_stat_cputime_show would like to have a consistent view on multiple stats but can we live without a strong guarantee or to replace the lock with seqlock instead? > My synthetic stress test does not show any regressions with mutexes, > and there is a small boost to reading latency (probably because we > stop dropping the lock / rescheduling). Not sure if we may start > seeing need_resched warnings on big flushes though. Reading 0fa294fb1985 ("cgroup: Replace cgroup_rstat_mutex with a spinlock") it seems the point of moving away from mutex was to have a more usable API. > One other concern that Shakeel pointed out to me is preemption. If > someone holding the mutex gets preempted this may starve other > waiters. We can disable preemption while we hold the mutex, not sure > if that's a common pattern though. No, not really. It is expected that holder of mutex can sleep and can be preempted as well. I might be wrong but the whole discussion so far suggests that the global rstat lock should be reconsidered. From my personal experience global locks easily triggerable from the userspace are just a receip for problems. Stats reading shouldn't be interfering with the system runtime as much as possible and they should be deterministic wrt runtime as well. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads 2023-08-25 7:05 ` Michal Hocko @ 2023-08-25 15:14 ` Yosry Ahmed [not found] ` <CAJD7tkYPyb+2zOKqctQw-vhuwYRg85e6v2Y44xWJofHZ+F+YQw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> [not found] ` <ZOhSyvDxAyYUJ45i-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 1 sibling, 1 reply; 43+ messages in thread From: Yosry Ahmed @ 2023-08-25 15:14 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo, linux-mm, cgroups, linux-kernel On Fri, Aug 25, 2023 at 12:05 AM Michal Hocko <mhocko@suse.com> wrote: > > On Thu 24-08-23 11:50:51, Yosry Ahmed wrote: > > On Thu, Aug 24, 2023 at 11:15 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > On Thu, Aug 24, 2023 at 12:13 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > On Wed 23-08-23 07:55:40, Yosry Ahmed wrote: > > > > > On Wed, Aug 23, 2023 at 12:33 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > On Tue 22-08-23 08:30:05, Yosry Ahmed wrote: > > > > > > > On Tue, Aug 22, 2023 at 2:06 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > > > > > On Mon 21-08-23 20:54:58, Yosry Ahmed wrote: > > > > > > [...] > > > > > > > So to answer your question, I don't think a random user can really > > > > > > > affect the system in a significant way by constantly flushing. In > > > > > > > fact, in the test script (which I am now attaching, in case you're > > > > > > > interested), there are hundreds of threads that are reading stats of > > > > > > > different cgroups every 1s, and I don't see any negative effects on > > > > > > > in-kernel flushers in this case (reclaimers). > > > > > > > > > > > > I suspect you have missed my point. > > > > > > > > > > I suspect you are right :) > > > > > > > > > > > > > > > > Maybe I am just misunderstanding > > > > > > the code but it seems to me that the lock dropping inside > > > > > > cgroup_rstat_flush_locked effectivelly allows unbounded number of > > > > > > contenders which is really dangerous when it is triggerable from the > > > > > > userspace. The number of spinners at a moment is always bound by the > > > > > > number CPUs but depending on timing many potential spinners might be > > > > > > cond_rescheded and the worst time latency to complete can be really > > > > > > high. Makes more sense? > > > > > > > > > > I think I understand better now. So basically because we might drop > > > > > the lock and resched, there can be nr_cpus spinners + other spinners > > > > > that are currently scheduled away, so these will need to wait to be > > > > > scheduled and then start spinning on the lock. This may happen for one > > > > > reader multiple times during its read, which is what can cause a high > > > > > worst case latency. > > > > > > > > > > I hope I understood you correctly this time. Did I? > > > > > > > > Yes. I would just add that this could also influence the worst case > > > > latency for a different reader - so an adversary user can stall others. > > > > > > I can add that for v2 to the commit log, thanks. > > > > > > > Exposing a shared global lock in uncontrolable way over generally > > > > available user interface is not really a great idea IMHO. > > > > > > I think that's how it was always meant to be when it was designed. The > > > global rstat lock has always existed and was always available to > > > userspace readers. The memory controller took a different path at some > > > point with unified flushing, but that was mainly because of high > > > concurrency from in-kernel flushers, not because userspace readers > > > caused a problem. Outside of memcg, the core cgroup code has always > > > exercised this global lock when reading cpu.stat since rstat's > > > introduction. I assume there hasn't been any problems since it's still > > > there. > > I suspect nobody has just considered a malfunctioning or adversary > workloads so far. Perhaps that also means it's not a problem in practice, or at least I hope so :) > > > > I was hoping Tejun would confirm/deny this. > > Yes, that would be interesting to hear. > > > One thing we can do to remedy this situation is to replace the global > > rstat lock with a mutex, and drop the resched/lock dropping condition. > > Tejun suggested this in the previous thread. This effectively reverts > > 0fa294fb1985 ("cgroup: Replace cgroup_rstat_mutex with a spinlock") > > since now all the flushing contexts are sleepable. > > I would have a very daring question. Do we really need a global lock in > the first place? AFAIU this locks serializes (kinda as the lock can be > dropped midway) flushers and cgroup_rstat_flush_hold/release caller (a > single one ATM). I can see cgroup_base_stat_cputime_show would like to > have a consistent view on multiple stats but can we live without a > strong guarantee or to replace the lock with seqlock instead? Unfortunately, it's more difficult than this. I thought about breaking down that lock and falled back to this solution. See below. > > > My synthetic stress test does not show any regressions with mutexes, > > and there is a small boost to reading latency (probably because we > > stop dropping the lock / rescheduling). Not sure if we may start > > seeing need_resched warnings on big flushes though. > > Reading 0fa294fb1985 ("cgroup: Replace cgroup_rstat_mutex with a spinlock") > it seems the point of moving away from mutex was to have a more usable > API. Right, we needed an atomic interface for flushing, but that later turned out to cause some problems, so we reworked the code such that we never have to flush atomically. Now we can go back to the mutex if it makes things better, I am not really sure how much it helps though. > > > One other concern that Shakeel pointed out to me is preemption. If > > someone holding the mutex gets preempted this may starve other > > waiters. We can disable preemption while we hold the mutex, not sure > > if that's a common pattern though. > > No, not really. It is expected that holder of mutex can sleep and can be > preempted as well. Maybe not for this specific case because it's a global mutex and holding it for too long might cause problems? Is it bad to disable preemption while holding a mutex? > > I might be wrong but the whole discussion so far suggests that the > global rstat lock should be reconsidered. From my personal experience > global locks easily triggerable from the userspace are just a receip for > problems. Stats reading shouldn't be interfering with the system runtime > as much as possible and they should be deterministic wrt runtime as > well. The problem is that the global lock also serializes the global counters that we flush to. I will talk from the memcg flushing perspective as that's what I am familiar with. I am not sure how much this is transferable to other flushers. On the memcg side (see mem_cgroup_css_rstat_flush()), the global lock synchronizes access to multiple counters, for this discussion what's most important are: - The global stat counters of the memcg being flushed (e.g. memcg->vmstats->state). - The pending stat counters of the parent being flushed (e.g. parent->vmstats->state_pending). To get rid of this lock, we either need to use atomics for those counters, or have fine-grained locks. I experimented a while back with atomic and flushing was significantly more expensive. The number of atomic operations would scale with O(# cgroups * # cpus) and can grow unbounded. The other option is fine-grained locks. In this case we would need to lock both the memcg being flushed and its parent together. This can go wrong with ordering, and for top-level memcgs the root memcg lock will become the new global lock anyway. One way to overcome this is to change the parent's pending stat counters to also be percpu. This will increase the memory usage of the stat counters per-memcg, by hundreds of bytes per cpu. Let's assume that's okay, so we only need to lock one cgroup at a time. There are more problems. We also have a percpu lock (cgroup_rstat_cpu_lock), which we use to lock the percpu tree which has the cgroups that have updates on this cpu. It is held by both flushing contexts and updating contexts (hot paths). Ideally we don't want to spin on a per-cgroup (non percpu) lock while holding the percpu lock, as flushers of different cpus will start blocking one another, as well as blocking updaters. On the other hand, we need to hold percpu lock to pop a cgroup from that tree and lock it. It's a chicken and egg problem. Also, if we release the percpu lock while flushing, we open another can of worms: (a) Concurrent updates can keep updating the tree putting us in an endless flushing loop. We need some sort of generation tracking for this. (b) Concurrent flushing can flush a parent prematurely on the same cpu as we are flushing a child, and not get the updates from the child. One possible scheme to handle the above is as follows: 1. Hold the percpu lock, find the cgroup that needs to be flushed next. 2. Trylock that cgroup. If we succeed, we flush it with both the percpu and the per-cgroup locks held. 3. If we fail, release the percpu lock and spin on the per-cgroup lock. 4. When we get the per-cgroup lock, take the percpu lock again, and make sure that the locked cgroup is still the correct cgroup to flush. If not, repeat. 5. Flush the cgroup, and go back to step 1 to get the next cgroup. Of course this is complex and error-prone, and might introduce significant overheads due to the number of locks we need to take compared with what we currently have. I guess what I am trying to say is, breaking down that lock is a major surgery that might require re-designing or re-implementing some parts of rstat. I would be extremely happy to be proven wrong. If we can break down that lock then there is no need for unified flushing even for in-kernel contexts, and we can all live happily ever after with cheap(ish) and accurate stats flushing. I really hope we can move forward with the problems at hand (sometimes reads are expensive, sometimes reads are stale), and not block fixing them until we can come up with an alternative to that global lock (unless, of course, there is a simpler way of doing that). Sorry for the very long reply :) Thanks! ^ permalink raw reply [flat|nested] 43+ messages in thread
[parent not found: <CAJD7tkYPyb+2zOKqctQw-vhuwYRg85e6v2Y44xWJofHZ+F+YQw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads [not found] ` <CAJD7tkYPyb+2zOKqctQw-vhuwYRg85e6v2Y44xWJofHZ+F+YQw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2023-08-25 18:17 ` Michal Hocko 2023-08-25 18:21 ` Yosry Ahmed 2023-08-28 15:47 ` Michal Hocko 1 sibling, 1 reply; 43+ messages in thread From: Michal Hocko @ 2023-08-25 18:17 UTC (permalink / raw) To: Yosry Ahmed Cc: Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Fri 25-08-23 08:14:54, Yosry Ahmed wrote: > On Fri, Aug 25, 2023 at 12:05 AM Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote: [...] > > I might be wrong but the whole discussion so far suggests that the > > global rstat lock should be reconsidered. From my personal experience > > global locks easily triggerable from the userspace are just a receip for > > problems. Stats reading shouldn't be interfering with the system runtime > > as much as possible and they should be deterministic wrt runtime as > > well. > > The problem is that the global lock also serializes the global > counters that we flush to. I will talk from the memcg flushing > perspective as that's what I am familiar with. I am not sure how much > this is transferable to other flushers. > > On the memcg side (see mem_cgroup_css_rstat_flush()), the global lock > synchronizes access to multiple counters, for this discussion what's > most important are: > - The global stat counters of the memcg being flushed (e.g. > memcg->vmstats->state). > - The pending stat counters of the parent being flushed (e.g. > parent->vmstats->state_pending). I haven't digested the rest of the email yet (Friday brain, sorry) but I do not think you are adressing this particular part so let me ask before I dive more into the following. I really do not follow the serialization requirement here because the lock doesn't really serialize the flushing, does it? At least not in a sense of a single caller to do the flushing atomicaly from other flushers. It is possible that the current flusher simply drops the lock midway and another one retakes the lock and performs the operation again. So what additional flushing synchronization does it provide and why cannot parallel flushers simply compete over pcp spinlocks? So what am I missing? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads 2023-08-25 18:17 ` Michal Hocko @ 2023-08-25 18:21 ` Yosry Ahmed [not found] ` <CAJD7tka=60_vPMY9Tg8tH+55g-feV1B24VNmDpp_3iMHqrUh7Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 43+ messages in thread From: Yosry Ahmed @ 2023-08-25 18:21 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo, linux-mm, cgroups, linux-kernel On Fri, Aug 25, 2023 at 11:17 AM Michal Hocko <mhocko@suse.com> wrote: > > On Fri 25-08-23 08:14:54, Yosry Ahmed wrote: > > On Fri, Aug 25, 2023 at 12:05 AM Michal Hocko <mhocko@suse.com> wrote: > [...] > > > I might be wrong but the whole discussion so far suggests that the > > > global rstat lock should be reconsidered. From my personal experience > > > global locks easily triggerable from the userspace are just a receip for > > > problems. Stats reading shouldn't be interfering with the system runtime > > > as much as possible and they should be deterministic wrt runtime as > > > well. > > > > The problem is that the global lock also serializes the global > > counters that we flush to. I will talk from the memcg flushing > > perspective as that's what I am familiar with. I am not sure how much > > this is transferable to other flushers. > > > > On the memcg side (see mem_cgroup_css_rstat_flush()), the global lock > > synchronizes access to multiple counters, for this discussion what's > > most important are: > > - The global stat counters of the memcg being flushed (e.g. > > memcg->vmstats->state). > > - The pending stat counters of the parent being flushed (e.g. > > parent->vmstats->state_pending). > > I haven't digested the rest of the email yet (Friday brain, sorry) but I > do not think you are adressing this particular part so let me ask before > I dive more into the following. I really do not follow the serialization > requirement here because the lock doesn't really serialize the flushing, > does it? At least not in a sense of a single caller to do the flushing > atomicaly from other flushers. It is possible that the current flusher > simply drops the lock midway and another one retakes the lock and > performs the operation again. So what additional flushing > synchronization does it provide and why cannot parallel flushers simply > compete over pcp spinlocks? > > So what am I missing? Those counters are non-atomic. The lock makes sure we don't have two concurrent flushers updating the same counter locklessly and non-atomically, which would be possible if we flush the same cgroup on two different cpus in parallel. > -- > Michal Hocko > SUSE Labs ^ permalink raw reply [flat|nested] 43+ messages in thread
[parent not found: <CAJD7tka=60_vPMY9Tg8tH+55g-feV1B24VNmDpp_3iMHqrUh7Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads [not found] ` <CAJD7tka=60_vPMY9Tg8tH+55g-feV1B24VNmDpp_3iMHqrUh7Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2023-08-25 18:43 ` Michal Hocko [not found] ` <ZOj2NeU5yYhrTZPJ-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 43+ messages in thread From: Michal Hocko @ 2023-08-25 18:43 UTC (permalink / raw) To: Yosry Ahmed Cc: Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Fri 25-08-23 11:21:16, Yosry Ahmed wrote: > On Fri, Aug 25, 2023 at 11:17 AM Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote: > > > > On Fri 25-08-23 08:14:54, Yosry Ahmed wrote: > > > On Fri, Aug 25, 2023 at 12:05 AM Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote: > > [...] > > > > I might be wrong but the whole discussion so far suggests that the > > > > global rstat lock should be reconsidered. From my personal experience > > > > global locks easily triggerable from the userspace are just a receip for > > > > problems. Stats reading shouldn't be interfering with the system runtime > > > > as much as possible and they should be deterministic wrt runtime as > > > > well. > > > > > > The problem is that the global lock also serializes the global > > > counters that we flush to. I will talk from the memcg flushing > > > perspective as that's what I am familiar with. I am not sure how much > > > this is transferable to other flushers. > > > > > > On the memcg side (see mem_cgroup_css_rstat_flush()), the global lock > > > synchronizes access to multiple counters, for this discussion what's > > > most important are: > > > - The global stat counters of the memcg being flushed (e.g. > > > memcg->vmstats->state). > > > - The pending stat counters of the parent being flushed (e.g. > > > parent->vmstats->state_pending). > > > > I haven't digested the rest of the email yet (Friday brain, sorry) but I > > do not think you are adressing this particular part so let me ask before > > I dive more into the following. I really do not follow the serialization > > requirement here because the lock doesn't really serialize the flushing, > > does it? At least not in a sense of a single caller to do the flushing > > atomicaly from other flushers. It is possible that the current flusher > > simply drops the lock midway and another one retakes the lock and > > performs the operation again. So what additional flushing > > synchronization does it provide and why cannot parallel flushers simply > > compete over pcp spinlocks? > > > > So what am I missing? > > Those counters are non-atomic. The lock makes sure we don't have two > concurrent flushers updating the same counter locklessly and > non-atomically, which would be possible if we flush the same cgroup on > two different cpus in parallel. pcp lock (cpu_lock) guarantees the very same, doesn't it? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 43+ messages in thread
[parent not found: <ZOj2NeU5yYhrTZPJ-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads [not found] ` <ZOj2NeU5yYhrTZPJ-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2023-08-25 18:44 ` Michal Hocko 0 siblings, 0 replies; 43+ messages in thread From: Michal Hocko @ 2023-08-25 18:44 UTC (permalink / raw) To: Yosry Ahmed Cc: Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Fri 25-08-23 20:43:02, Michal Hocko wrote: > On Fri 25-08-23 11:21:16, Yosry Ahmed wrote: > > On Fri, Aug 25, 2023 at 11:17 AM Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote: > > > > > > On Fri 25-08-23 08:14:54, Yosry Ahmed wrote: > > > > On Fri, Aug 25, 2023 at 12:05 AM Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote: > > > [...] > > > > > I might be wrong but the whole discussion so far suggests that the > > > > > global rstat lock should be reconsidered. From my personal experience > > > > > global locks easily triggerable from the userspace are just a receip for > > > > > problems. Stats reading shouldn't be interfering with the system runtime > > > > > as much as possible and they should be deterministic wrt runtime as > > > > > well. > > > > > > > > The problem is that the global lock also serializes the global > > > > counters that we flush to. I will talk from the memcg flushing > > > > perspective as that's what I am familiar with. I am not sure how much > > > > this is transferable to other flushers. > > > > > > > > On the memcg side (see mem_cgroup_css_rstat_flush()), the global lock > > > > synchronizes access to multiple counters, for this discussion what's > > > > most important are: > > > > - The global stat counters of the memcg being flushed (e.g. > > > > memcg->vmstats->state). > > > > - The pending stat counters of the parent being flushed (e.g. > > > > parent->vmstats->state_pending). > > > > > > I haven't digested the rest of the email yet (Friday brain, sorry) but I > > > do not think you are adressing this particular part so let me ask before > > > I dive more into the following. I really do not follow the serialization > > > requirement here because the lock doesn't really serialize the flushing, > > > does it? At least not in a sense of a single caller to do the flushing > > > atomicaly from other flushers. It is possible that the current flusher > > > simply drops the lock midway and another one retakes the lock and > > > performs the operation again. So what additional flushing > > > synchronization does it provide and why cannot parallel flushers simply > > > compete over pcp spinlocks? > > > > > > So what am I missing? > > > > Those counters are non-atomic. The lock makes sure we don't have two > > concurrent flushers updating the same counter locklessly and > > non-atomically, which would be possible if we flush the same cgroup on > > two different cpus in parallel. > > pcp lock (cpu_lock) guarantees the very same, doesn't it? Nope, it doesn't. I really need to have a deeper look. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads [not found] ` <CAJD7tkYPyb+2zOKqctQw-vhuwYRg85e6v2Y44xWJofHZ+F+YQw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2023-08-25 18:17 ` Michal Hocko @ 2023-08-28 15:47 ` Michal Hocko 2023-08-28 16:15 ` Yosry Ahmed 1 sibling, 1 reply; 43+ messages in thread From: Michal Hocko @ 2023-08-28 15:47 UTC (permalink / raw) To: Yosry Ahmed Cc: Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA Done my homework and studied the rstat code more (sorry should have done that earlier). On Fri 25-08-23 08:14:54, Yosry Ahmed wrote: [...] > I guess what I am trying to say is, breaking down that lock is a major > surgery that might require re-designing or re-implementing some parts > of rstat. I would be extremely happy to be proven wrong. If we can > break down that lock then there is no need for unified flushing even > for in-kernel contexts, and we can all live happily ever after with > cheap(ish) and accurate stats flushing. Yes, this seems like a big change and also over complicating the whole thing. I am not sure this is worth it. > I really hope we can move forward with the problems at hand (sometimes > reads are expensive, sometimes reads are stale), and not block fixing > them until we can come up with an alternative to that global lock > (unless, of course, there is a simpler way of doing that). Well, I really have to say that I do not like the notion that reading stats is unpredictable. This just makes it really hard to use. If the precision is to be sarificed then this should be preferable over potentially high global lock contention. We already have that model in place of /proc/vmstat (configurable timeout for flusher and a way to flush explicitly). I appreciate you would like to have a better precision but as you have explored the locking is really hard to get rid of here. So from my POV I would prefer to avoid flushing from the stats reading path and implement force flushing by writing to stat file. If the 2s flushing interval is considered to coarse I would be OK to allow setting it from userspace. This way this would be more in line with /proc/vmstat which seems to be working quite well. If this is not accaptable or deemed a wrong approach long term then it would be good to reonsider the current cgroup_rstat_lock at least. Either by turning it into mutex or by dropping the yielding code which can severly affect the worst case latency AFAIU. > Sorry for the very long reply :) Thanks for bearing with me and taking time to formulate all this! -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads 2023-08-28 15:47 ` Michal Hocko @ 2023-08-28 16:15 ` Yosry Ahmed 2023-08-28 17:00 ` Shakeel Butt 0 siblings, 1 reply; 43+ messages in thread From: Yosry Ahmed @ 2023-08-28 16:15 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo, linux-mm, cgroups, linux-kernel On Mon, Aug 28, 2023 at 8:47 AM Michal Hocko <mhocko@suse.com> wrote: > > Done my homework and studied the rstat code more (sorry should have done > that earlier). > > On Fri 25-08-23 08:14:54, Yosry Ahmed wrote: > [...] > > I guess what I am trying to say is, breaking down that lock is a major > > surgery that might require re-designing or re-implementing some parts > > of rstat. I would be extremely happy to be proven wrong. If we can > > break down that lock then there is no need for unified flushing even > > for in-kernel contexts, and we can all live happily ever after with > > cheap(ish) and accurate stats flushing. > > Yes, this seems like a big change and also over complicating the whole > thing. I am not sure this is worth it. > > > I really hope we can move forward with the problems at hand (sometimes > > reads are expensive, sometimes reads are stale), and not block fixing > > them until we can come up with an alternative to that global lock > > (unless, of course, there is a simpler way of doing that). > > Well, I really have to say that I do not like the notion that reading > stats is unpredictable. This just makes it really hard to use. If > the precision is to be sarificed then this should be preferable over > potentially high global lock contention. We already have that model in > place of /proc/vmstat (configurable timeout for flusher and a way to > flush explicitly). I appreciate you would like to have a better > precision but as you have explored the locking is really hard to get rid > of here. Reading the stats *is* unpredictable today. In terms of accuracy/staleness and cost. Avoiding the flush entirely on the read path will surely make the cost very stable and cheap, but will make accuracy even less predictable. > > So from my POV I would prefer to avoid flushing from the stats reading > path and implement force flushing by writing to stat file. If the 2s > flushing interval is considered to coarse I would be OK to allow setting > it from userspace. This way this would be more in line with /proc/vmstat > which seems to be working quite well. > > If this is not accaptable or deemed a wrong approach long term then it > would be good to reonsider the current cgroup_rstat_lock at least. > Either by turning it into mutex or by dropping the yielding code which > can severly affect the worst case latency AFAIU. Honestly I think it's better if we do it the other way around. We make flushing on the stats reading path non-unified and deterministic. That model also exists and is used for cpu.stat. If we find a problem with the locking being held from userspace, we can then remove flushing from the read path and add interface(s) to configure the periodic flusher and do a force flush. I would like to avoid introducing additional interfaces and configuration knobs unless it's necessary. Also, if we remove the flush entirely the cost will become really cheap. We will have a hard time reversing that in the future if we want to change the implementation. IOW, moving forward with this change seems much more reversible than adopting the /proc/vmstat model. If using a mutex will make things better, we can do that now. It doesn't introduce performance issues in my testing. My only concern is someone sleeping or getting preempted while holding the mutex, so I would prefer disabling preemption while we flush if that doesn't cause problems. Thanks! ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads 2023-08-28 16:15 ` Yosry Ahmed @ 2023-08-28 17:00 ` Shakeel Butt [not found] ` <CALvZod7uxDd3Lrd3VwTTC-SDvqhdj2Ly-dYVswO=TBM=XTnkcg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 43+ messages in thread From: Shakeel Butt @ 2023-08-28 17:00 UTC (permalink / raw) To: Yosry Ahmed Cc: Michal Hocko, Andrew Morton, Johannes Weiner, Roman Gushchin, Muchun Song, Ivan Babrou, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Mon, Aug 28, 2023 at 9:15 AM Yosry Ahmed <yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote: > [...] > > > > Well, I really have to say that I do not like the notion that reading > > stats is unpredictable. This just makes it really hard to use. If > > the precision is to be sarificed then this should be preferable over > > potentially high global lock contention. We already have that model in > > place of /proc/vmstat (configurable timeout for flusher and a way to > > flush explicitly). I appreciate you would like to have a better > > precision but as you have explored the locking is really hard to get rid > > of here. > > Reading the stats *is* unpredictable today. In terms of > accuracy/staleness and cost. Avoiding the flush entirely on the read > path will surely make the cost very stable and cheap, but will make > accuracy even less predictable. > On average you would get the stats at most 2 second old, so I would not say it is less predictable. > > > > So from my POV I would prefer to avoid flushing from the stats reading > > path and implement force flushing by writing to stat file. If the 2s > > flushing interval is considered to coarse I would be OK to allow setting > > it from userspace. This way this would be more in line with /proc/vmstat > > which seems to be working quite well. > > > > If this is not accaptable or deemed a wrong approach long term then it > > would be good to reonsider the current cgroup_rstat_lock at least. > > Either by turning it into mutex or by dropping the yielding code which > > can severly affect the worst case latency AFAIU. > > Honestly I think it's better if we do it the other way around. We make > flushing on the stats reading path non-unified and deterministic. That > model also exists and is used for cpu.stat. If we find a problem with > the locking being held from userspace, we can then remove flushing > from the read path and add interface(s) to configure the periodic > flusher and do a force flush. > Here I agree with you. Let's go with the approach which is easy to undo for now. Though I prefer the new explicit interface for flushing, that step would be very hard to undo. Let's reevaluate if the proposed approach shows negative impact on production traffic and I think Cloudflare folks can give us the results soon. ^ permalink raw reply [flat|nested] 43+ messages in thread
[parent not found: <CALvZod7uxDd3Lrd3VwTTC-SDvqhdj2Ly-dYVswO=TBM=XTnkcg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads [not found] ` <CALvZod7uxDd3Lrd3VwTTC-SDvqhdj2Ly-dYVswO=TBM=XTnkcg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2023-08-28 17:07 ` Yosry Ahmed 2023-08-28 17:27 ` Waiman Long 0 siblings, 1 reply; 43+ messages in thread From: Yosry Ahmed @ 2023-08-28 17:07 UTC (permalink / raw) To: Shakeel Butt Cc: Michal Hocko, Andrew Morton, Johannes Weiner, Roman Gushchin, Muchun Song, Ivan Babrou, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Mon, Aug 28, 2023 at 10:00 AM Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote: > > On Mon, Aug 28, 2023 at 9:15 AM Yosry Ahmed <yosryahmed-hpIqsD4AKldhl2p70BpVqQ@public.gmane.orgm> wrote: > > > [...] > > > > > > Well, I really have to say that I do not like the notion that reading > > > stats is unpredictable. This just makes it really hard to use. If > > > the precision is to be sarificed then this should be preferable over > > > potentially high global lock contention. We already have that model in > > > place of /proc/vmstat (configurable timeout for flusher and a way to > > > flush explicitly). I appreciate you would like to have a better > > > precision but as you have explored the locking is really hard to get rid > > > of here. > > > > Reading the stats *is* unpredictable today. In terms of > > accuracy/staleness and cost. Avoiding the flush entirely on the read > > path will surely make the cost very stable and cheap, but will make > > accuracy even less predictable. > > > > On average you would get the stats at most 2 second old, so I would > not say it is less predictable. > > > > > > > So from my POV I would prefer to avoid flushing from the stats reading > > > path and implement force flushing by writing to stat file. If the 2s > > > flushing interval is considered to coarse I would be OK to allow setting > > > it from userspace. This way this would be more in line with /proc/vmstat > > > which seems to be working quite well. > > > > > > If this is not accaptable or deemed a wrong approach long term then it > > > would be good to reonsider the current cgroup_rstat_lock at least. > > > Either by turning it into mutex or by dropping the yielding code which > > > can severly affect the worst case latency AFAIU. > > > > Honestly I think it's better if we do it the other way around. We make > > flushing on the stats reading path non-unified and deterministic. That > > model also exists and is used for cpu.stat. If we find a problem with > > the locking being held from userspace, we can then remove flushing > > from the read path and add interface(s) to configure the periodic > > flusher and do a force flush. > > > > Here I agree with you. Let's go with the approach which is easy to > undo for now. Though I prefer the new explicit interface for flushing, > that step would be very hard to undo. Let's reevaluate if the proposed > approach shows negative impact on production traffic and I think > Cloudflare folks can give us the results soon. Do you prefer we also switch to using a mutex (with preemption disabled) to avoid the scenario Michal described where flushers give up the lock and sleep resulting in an unbounded wait time in the worst case? ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads 2023-08-28 17:07 ` Yosry Ahmed @ 2023-08-28 17:27 ` Waiman Long 2023-08-28 17:28 ` Yosry Ahmed [not found] ` <599b167c-deaf-4b92-aa8b-5767b8608483-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 0 siblings, 2 replies; 43+ messages in thread From: Waiman Long @ 2023-08-28 17:27 UTC (permalink / raw) To: Yosry Ahmed, Shakeel Butt Cc: Michal Hocko, Andrew Morton, Johannes Weiner, Roman Gushchin, Muchun Song, Ivan Babrou, Tejun Heo, linux-mm, cgroups, linux-kernel On 8/28/23 13:07, Yosry Ahmed wrote: > >> Here I agree with you. Let's go with the approach which is easy to >> undo for now. Though I prefer the new explicit interface for flushing, >> that step would be very hard to undo. Let's reevaluate if the proposed >> approach shows negative impact on production traffic and I think >> Cloudflare folks can give us the results soon. > Do you prefer we also switch to using a mutex (with preemption > disabled) to avoid the scenario Michal described where flushers give > up the lock and sleep resulting in an unbounded wait time in the worst > case? Locking with mutex with preemption disabled is an oxymoron. Use spinlock if you want to have preemption disabled. The purpose of usiing mutex is to allow the lock owner to sleep, but you can't sleep with preemption disabled. You need to enable preemption first. You can disable preemption for a short time in a non-sleeping section of the lock critical section, but I would not recommend disabling preemption for the whole critical section. Cheers, Longman ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads 2023-08-28 17:27 ` Waiman Long @ 2023-08-28 17:28 ` Yosry Ahmed [not found] ` <CAJD7tkZsGfYXkWM5aa67v3JytTO04LS7_x+ooMDK82cBZ-C8eQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> [not found] ` <599b167c-deaf-4b92-aa8b-5767b8608483-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 1 sibling, 1 reply; 43+ messages in thread From: Yosry Ahmed @ 2023-08-28 17:28 UTC (permalink / raw) To: Waiman Long Cc: Shakeel Butt, Michal Hocko, Andrew Morton, Johannes Weiner, Roman Gushchin, Muchun Song, Ivan Babrou, Tejun Heo, linux-mm, cgroups, linux-kernel On Mon, Aug 28, 2023 at 10:27 AM Waiman Long <longman@redhat.com> wrote: > > > On 8/28/23 13:07, Yosry Ahmed wrote: > > > >> Here I agree with you. Let's go with the approach which is easy to > >> undo for now. Though I prefer the new explicit interface for flushing, > >> that step would be very hard to undo. Let's reevaluate if the proposed > >> approach shows negative impact on production traffic and I think > >> Cloudflare folks can give us the results soon. > > Do you prefer we also switch to using a mutex (with preemption > > disabled) to avoid the scenario Michal described where flushers give > > up the lock and sleep resulting in an unbounded wait time in the worst > > case? > > Locking with mutex with preemption disabled is an oxymoron. Use spinlock > if you want to have preemption disabled. The purpose of usiing mutex is > to allow the lock owner to sleep, but you can't sleep with preemption > disabled. You need to enable preemption first. You can disable > preemption for a short time in a non-sleeping section of the lock > critical section, but I would not recommend disabling preemption for the > whole critical section. I thought using a mutex with preemption disabled would at least allow waiters to sleep rather than spin, is this not correct (or doesn't matter) ? > > Cheers, > Longman > ^ permalink raw reply [flat|nested] 43+ messages in thread
[parent not found: <CAJD7tkZsGfYXkWM5aa67v3JytTO04LS7_x+ooMDK82cBZ-C8eQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads [not found] ` <CAJD7tkZsGfYXkWM5aa67v3JytTO04LS7_x+ooMDK82cBZ-C8eQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2023-08-28 17:35 ` Waiman Long [not found] ` <307cbcf6-dca2-0b5d-93e8-11368a931d2f-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 43+ messages in thread From: Waiman Long @ 2023-08-28 17:35 UTC (permalink / raw) To: Yosry Ahmed Cc: Shakeel Butt, Michal Hocko, Andrew Morton, Johannes Weiner, Roman Gushchin, Muchun Song, Ivan Babrou, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On 8/28/23 13:28, Yosry Ahmed wrote: > On Mon, Aug 28, 2023 at 10:27 AM Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: >> >> On 8/28/23 13:07, Yosry Ahmed wrote: >>>> Here I agree with you. Let's go with the approach which is easy to >>>> undo for now. Though I prefer the new explicit interface for flushing, >>>> that step would be very hard to undo. Let's reevaluate if the proposed >>>> approach shows negative impact on production traffic and I think >>>> Cloudflare folks can give us the results soon. >>> Do you prefer we also switch to using a mutex (with preemption >>> disabled) to avoid the scenario Michal described where flushers give >>> up the lock and sleep resulting in an unbounded wait time in the worst >>> case? >> Locking with mutex with preemption disabled is an oxymoron. Use spinlock >> if you want to have preemption disabled. The purpose of usiing mutex is >> to allow the lock owner to sleep, but you can't sleep with preemption >> disabled. You need to enable preemption first. You can disable >> preemption for a short time in a non-sleeping section of the lock >> critical section, but I would not recommend disabling preemption for the >> whole critical section. > I thought using a mutex with preemption disabled would at least allow > waiters to sleep rather than spin, is this not correct (or doesn't > matter) ? Because of optimistic spinning, a mutex lock waiter will only sleep if the lock holder sleep or when its time slice run out. So the waiters are likely to spin for quite a while before they go to sleep. Cheers, Longman ^ permalink raw reply [flat|nested] 43+ messages in thread
[parent not found: <307cbcf6-dca2-0b5d-93e8-11368a931d2f-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads [not found] ` <307cbcf6-dca2-0b5d-93e8-11368a931d2f-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2023-08-28 17:43 ` Waiman Long 2023-08-28 18:35 ` Yosry Ahmed 0 siblings, 1 reply; 43+ messages in thread From: Waiman Long @ 2023-08-28 17:43 UTC (permalink / raw) To: Yosry Ahmed Cc: Shakeel Butt, Michal Hocko, Andrew Morton, Johannes Weiner, Roman Gushchin, Muchun Song, Ivan Babrou, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On 8/28/23 13:35, Waiman Long wrote: > On 8/28/23 13:28, Yosry Ahmed wrote: >> On Mon, Aug 28, 2023 at 10:27 AM Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: >>> >>> On 8/28/23 13:07, Yosry Ahmed wrote: >>>>> Here I agree with you. Let's go with the approach which is easy to >>>>> undo for now. Though I prefer the new explicit interface for >>>>> flushing, >>>>> that step would be very hard to undo. Let's reevaluate if the >>>>> proposed >>>>> approach shows negative impact on production traffic and I think >>>>> Cloudflare folks can give us the results soon. >>>> Do you prefer we also switch to using a mutex (with preemption >>>> disabled) to avoid the scenario Michal described where flushers give >>>> up the lock and sleep resulting in an unbounded wait time in the worst >>>> case? >>> Locking with mutex with preemption disabled is an oxymoron. Use >>> spinlock >>> if you want to have preemption disabled. The purpose of usiing mutex is >>> to allow the lock owner to sleep, but you can't sleep with preemption >>> disabled. You need to enable preemption first. You can disable >>> preemption for a short time in a non-sleeping section of the lock >>> critical section, but I would not recommend disabling preemption for >>> the >>> whole critical section. >> I thought using a mutex with preemption disabled would at least allow >> waiters to sleep rather than spin, is this not correct (or doesn't >> matter) ? > > Because of optimistic spinning, a mutex lock waiter will only sleep if > the lock holder sleep or when its time slice run out. So the waiters > are likely to spin for quite a while before they go to sleep. Perhaps you can add a mutex at the read side so that only 1 reader can contend with the global rstat spinlock at any time if this is a concern. Regards, Longman ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads 2023-08-28 17:43 ` Waiman Long @ 2023-08-28 18:35 ` Yosry Ahmed 0 siblings, 0 replies; 43+ messages in thread From: Yosry Ahmed @ 2023-08-28 18:35 UTC (permalink / raw) To: Waiman Long Cc: Shakeel Butt, Michal Hocko, Andrew Morton, Johannes Weiner, Roman Gushchin, Muchun Song, Ivan Babrou, Tejun Heo, linux-mm, cgroups, linux-kernel On Mon, Aug 28, 2023 at 10:43 AM Waiman Long <longman@redhat.com> wrote: > > > On 8/28/23 13:35, Waiman Long wrote: > > On 8/28/23 13:28, Yosry Ahmed wrote: > >> On Mon, Aug 28, 2023 at 10:27 AM Waiman Long <longman@redhat.com> wrote: > >>> > >>> On 8/28/23 13:07, Yosry Ahmed wrote: > >>>>> Here I agree with you. Let's go with the approach which is easy to > >>>>> undo for now. Though I prefer the new explicit interface for > >>>>> flushing, > >>>>> that step would be very hard to undo. Let's reevaluate if the > >>>>> proposed > >>>>> approach shows negative impact on production traffic and I think > >>>>> Cloudflare folks can give us the results soon. > >>>> Do you prefer we also switch to using a mutex (with preemption > >>>> disabled) to avoid the scenario Michal described where flushers give > >>>> up the lock and sleep resulting in an unbounded wait time in the worst > >>>> case? > >>> Locking with mutex with preemption disabled is an oxymoron. Use > >>> spinlock > >>> if you want to have preemption disabled. The purpose of usiing mutex is > >>> to allow the lock owner to sleep, but you can't sleep with preemption > >>> disabled. You need to enable preemption first. You can disable > >>> preemption for a short time in a non-sleeping section of the lock > >>> critical section, but I would not recommend disabling preemption for > >>> the > >>> whole critical section. > >> I thought using a mutex with preemption disabled would at least allow > >> waiters to sleep rather than spin, is this not correct (or doesn't > >> matter) ? > > > > Because of optimistic spinning, a mutex lock waiter will only sleep if > > the lock holder sleep or when its time slice run out. So the waiters > > are likely to spin for quite a while before they go to sleep. I see. Thanks for the explanation. > > Perhaps you can add a mutex at the read side so that only 1 reader can > contend with the global rstat spinlock at any time if this is a concern. I guess we can keep it simple for now and add that later if needed. For unified flushers we already can only have one. If we see a problem from the stat reading side we can add a mutex there. Thanks! ^ permalink raw reply [flat|nested] 43+ messages in thread
[parent not found: <599b167c-deaf-4b92-aa8b-5767b8608483-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads [not found] ` <599b167c-deaf-4b92-aa8b-5767b8608483-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2023-08-29 7:27 ` Michal Hocko [not found] ` <ZO2d7dT8gulMyb8g-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 43+ messages in thread From: Michal Hocko @ 2023-08-29 7:27 UTC (permalink / raw) To: Waiman Long Cc: Yosry Ahmed, Shakeel Butt, Andrew Morton, Johannes Weiner, Roman Gushchin, Muchun Song, Ivan Babrou, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Mon 28-08-23 13:27:23, Waiman Long wrote: > > On 8/28/23 13:07, Yosry Ahmed wrote: > > > > > Here I agree with you. Let's go with the approach which is easy to > > > undo for now. Though I prefer the new explicit interface for flushing, > > > that step would be very hard to undo. Let's reevaluate if the proposed > > > approach shows negative impact on production traffic and I think > > > Cloudflare folks can give us the results soon. > > Do you prefer we also switch to using a mutex (with preemption > > disabled) to avoid the scenario Michal described where flushers give > > up the lock and sleep resulting in an unbounded wait time in the worst > > case? > > Locking with mutex with preemption disabled is an oxymoron. I believe Yosry wanted to disable preemption _after_ the lock is taken to reduce the time spent while it is held. The idea to use the mutex is to reduce spinning and more importantly to get rid of lock dropping part. It is not really clear (but unlikely) we can drop it while preserving the spinlock as the thing scales with O(#cgroups x #cpus) in the worst case. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 43+ messages in thread
[parent not found: <ZO2d7dT8gulMyb8g-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads [not found] ` <ZO2d7dT8gulMyb8g-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2023-08-29 15:05 ` Waiman Long 2023-08-29 15:17 ` Michal Hocko 0 siblings, 1 reply; 43+ messages in thread From: Waiman Long @ 2023-08-29 15:05 UTC (permalink / raw) To: Michal Hocko Cc: Yosry Ahmed, Shakeel Butt, Andrew Morton, Johannes Weiner, Roman Gushchin, Muchun Song, Ivan Babrou, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On 8/29/23 03:27, Michal Hocko wrote: > On Mon 28-08-23 13:27:23, Waiman Long wrote: >> On 8/28/23 13:07, Yosry Ahmed wrote: >>>> Here I agree with you. Let's go with the approach which is easy to >>>> undo for now. Though I prefer the new explicit interface for flushing, >>>> that step would be very hard to undo. Let's reevaluate if the proposed >>>> approach shows negative impact on production traffic and I think >>>> Cloudflare folks can give us the results soon. >>> Do you prefer we also switch to using a mutex (with preemption >>> disabled) to avoid the scenario Michal described where flushers give >>> up the lock and sleep resulting in an unbounded wait time in the worst >>> case? >> Locking with mutex with preemption disabled is an oxymoron. > I believe Yosry wanted to disable preemption _after_ the lock is taken > to reduce the time spent while it is held. The idea to use the mutex is > to reduce spinning and more importantly to get rid of lock dropping > part. It is not really clear (but unlikely) we can drop it while > preserving the spinlock as the thing scales with O(#cgroups x #cpus) > in the worst case. As I have said later in my email, I am not against disabling preemption selectively on some parts of the lock critical section where preemption is undesirable. However, I am against disabling preemption for the whole duration of the code where the mutex lock is held as it defeats the purpose of using mutex in the first place. Cheers, Longman ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads 2023-08-29 15:05 ` Waiman Long @ 2023-08-29 15:17 ` Michal Hocko [not found] ` <ZO4MBNzsbhsi7adb-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 43+ messages in thread From: Michal Hocko @ 2023-08-29 15:17 UTC (permalink / raw) To: Waiman Long Cc: Yosry Ahmed, Shakeel Butt, Andrew Morton, Johannes Weiner, Roman Gushchin, Muchun Song, Ivan Babrou, Tejun Heo, linux-mm, cgroups, linux-kernel On Tue 29-08-23 11:05:28, Waiman Long wrote: > On 8/29/23 03:27, Michal Hocko wrote: > > On Mon 28-08-23 13:27:23, Waiman Long wrote: > > > On 8/28/23 13:07, Yosry Ahmed wrote: > > > > > Here I agree with you. Let's go with the approach which is easy to > > > > > undo for now. Though I prefer the new explicit interface for flushing, > > > > > that step would be very hard to undo. Let's reevaluate if the proposed > > > > > approach shows negative impact on production traffic and I think > > > > > Cloudflare folks can give us the results soon. > > > > Do you prefer we also switch to using a mutex (with preemption > > > > disabled) to avoid the scenario Michal described where flushers give > > > > up the lock and sleep resulting in an unbounded wait time in the worst > > > > case? > > > Locking with mutex with preemption disabled is an oxymoron. > > I believe Yosry wanted to disable preemption _after_ the lock is taken > > to reduce the time spent while it is held. The idea to use the mutex is > > to reduce spinning and more importantly to get rid of lock dropping > > part. It is not really clear (but unlikely) we can drop it while > > preserving the spinlock as the thing scales with O(#cgroups x #cpus) > > in the worst case. > > As I have said later in my email, I am not against disabling preemption > selectively on some parts of the lock critical section where preemption is > undesirable. However, I am against disabling preemption for the whole > duration of the code where the mutex lock is held as it defeats the purpose > of using mutex in the first place. I certainly agree this is an antipattern. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 43+ messages in thread
[parent not found: <ZO4MBNzsbhsi7adb-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads [not found] ` <ZO4MBNzsbhsi7adb-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2023-08-29 16:04 ` Yosry Ahmed 0 siblings, 0 replies; 43+ messages in thread From: Yosry Ahmed @ 2023-08-29 16:04 UTC (permalink / raw) To: Michal Hocko Cc: Waiman Long, Shakeel Butt, Andrew Morton, Johannes Weiner, Roman Gushchin, Muchun Song, Ivan Babrou, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Tue, Aug 29, 2023 at 8:17 AM Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote: > > On Tue 29-08-23 11:05:28, Waiman Long wrote: > > On 8/29/23 03:27, Michal Hocko wrote: > > > On Mon 28-08-23 13:27:23, Waiman Long wrote: > > > > On 8/28/23 13:07, Yosry Ahmed wrote: > > > > > > Here I agree with you. Let's go with the approach which is easy to > > > > > > undo for now. Though I prefer the new explicit interface for flushing, > > > > > > that step would be very hard to undo. Let's reevaluate if the proposed > > > > > > approach shows negative impact on production traffic and I think > > > > > > Cloudflare folks can give us the results soon. > > > > > Do you prefer we also switch to using a mutex (with preemption > > > > > disabled) to avoid the scenario Michal described where flushers give > > > > > up the lock and sleep resulting in an unbounded wait time in the worst > > > > > case? > > > > Locking with mutex with preemption disabled is an oxymoron. > > > I believe Yosry wanted to disable preemption _after_ the lock is taken > > > to reduce the time spent while it is held. The idea to use the mutex is > > > to reduce spinning and more importantly to get rid of lock dropping > > > part. It is not really clear (but unlikely) we can drop it while > > > preserving the spinlock as the thing scales with O(#cgroups x #cpus) > > > in the worst case. > > > > As I have said later in my email, I am not against disabling preemption > > selectively on some parts of the lock critical section where preemption is > > undesirable. However, I am against disabling preemption for the whole > > duration of the code where the mutex lock is held as it defeats the purpose > > of using mutex in the first place. > > I certainly agree this is an antipattern. So I guess the verdict is to avoid using a mutex here for now. I sent a v2 which includes an additional small patch suggested by Michal Koutny and an updated changelog for this patch to document this discussion and possible alternatives we can do if things go wrong with this approach: https://lore.kernel.org/lkml/20230828233319.340712-1-yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org/ ^ permalink raw reply [flat|nested] 43+ messages in thread
[parent not found: <ZOhSyvDxAyYUJ45i-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads [not found] ` <ZOhSyvDxAyYUJ45i-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2023-08-29 18:44 ` Tejun Heo [not found] ` <ZO48h7c9qwQxEPPA-NiLfg/pYEd1N0TnZuCh8vA@public.gmane.org> 0 siblings, 1 reply; 43+ messages in thread From: Tejun Heo @ 2023-08-29 18:44 UTC (permalink / raw) To: Michal Hocko Cc: Yosry Ahmed, Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA Hello, On Fri, Aug 25, 2023 at 09:05:46AM +0200, Michal Hocko wrote: > > > I think that's how it was always meant to be when it was designed. The > > > global rstat lock has always existed and was always available to > > > userspace readers. The memory controller took a different path at some > > > point with unified flushing, but that was mainly because of high > > > concurrency from in-kernel flushers, not because userspace readers > > > caused a problem. Outside of memcg, the core cgroup code has always > > > exercised this global lock when reading cpu.stat since rstat's > > > introduction. I assume there hasn't been any problems since it's still > > > there. > > I suspect nobody has just considered a malfunctioning or adversary > workloads so far. > > > > I was hoping Tejun would confirm/deny this. > > Yes, that would be interesting to hear. So, the assumptions in the original design were: * Writers are high freq but readers are lower freq and can block. * The global lock is mutex. * Back-to-back reads won't have too much to do because it only has to flush what's been accumulated since the last flush which took place just before. It's likely that the userspace side is gonna be just fine if we restore the global lock to be a mutex and let them be. Most of the problems are caused by trying to allow flushing from non-sleepable and kernel contexts. Would it make sense to distinguish what can and can't wait and make the latter group always use cached value? e.g. even in kernel, during oom kill, waiting doesn't really matter and it can just wait to obtain the up-to-date numbers. Thanks. -- tejun ^ permalink raw reply [flat|nested] 43+ messages in thread
[parent not found: <ZO48h7c9qwQxEPPA-NiLfg/pYEd1N0TnZuCh8vA@public.gmane.org>]
* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads [not found] ` <ZO48h7c9qwQxEPPA-NiLfg/pYEd1N0TnZuCh8vA@public.gmane.org> @ 2023-08-29 19:13 ` Yosry Ahmed 2023-08-29 19:36 ` Tejun Heo 0 siblings, 1 reply; 43+ messages in thread From: Yosry Ahmed @ 2023-08-29 19:13 UTC (permalink / raw) To: Tejun Heo Cc: Michal Hocko, Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Tue, Aug 29, 2023 at 11:44 AM Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote: > > Hello, > > On Fri, Aug 25, 2023 at 09:05:46AM +0200, Michal Hocko wrote: > > > > I think that's how it was always meant to be when it was designed. The > > > > global rstat lock has always existed and was always available to > > > > userspace readers. The memory controller took a different path at some > > > > point with unified flushing, but that was mainly because of high > > > > concurrency from in-kernel flushers, not because userspace readers > > > > caused a problem. Outside of memcg, the core cgroup code has always > > > > exercised this global lock when reading cpu.stat since rstat's > > > > introduction. I assume there hasn't been any problems since it's still > > > > there. > > > > I suspect nobody has just considered a malfunctioning or adversary > > workloads so far. > > > > > > I was hoping Tejun would confirm/deny this. > > > > Yes, that would be interesting to hear. > > So, the assumptions in the original design were: > > * Writers are high freq but readers are lower freq and can block. > > * The global lock is mutex. > > * Back-to-back reads won't have too much to do because it only has to flush > what's been accumulated since the last flush which took place just before. > > It's likely that the userspace side is gonna be just fine if we restore the > global lock to be a mutex and let them be. Most of the problems are caused > by trying to allow flushing from non-sleepable and kernel contexts. So basically restore the flush without disabling preemption, and if a userspace reader gets preempted while holding the mutex it's probably okay because we won't have high concurrency among userspace readers? I think Shakeel was worried that this may cause a priority inversion where a low priority task is preempted while holding the mutex, and prevents high priority tasks from acquiring it to read the stats and take actions (e.g. userspace OOMs). > Would it > make sense to distinguish what can and can't wait and make the latter group > always use cached value? e.g. even in kernel, during oom kill, waiting > doesn't really matter and it can just wait to obtain the up-to-date numbers. The problem with waiting for in-kernel flushers is that high concurrency leads to terrible serialization. Running a stress test with 100s of reclaimers where everyone waits shows ~ 3x slowdowns. This patch series is indeed trying to formalize a distinction between waiters who can wait and those who can't on the memcg side: - Unified flushers always flush the entire tree and only flush if no one else is flushing (no waiting), otherwise they use cached data and hope the concurrent flushing helps. This is what we currently do for most memcg contexts. This patch series opts userspace reads out of it. - Non-unified flushers only flush the subtree they care about and they wait if there are other flushers. This is what we currently do for some zswap accounting code. This patch series opts userspace readers into this. The problem Michal is raising is that dropping the lock can lead to an unbounded number of waiters and longer worst case latency. Especially that this is directly influenced by userspace. Reintroducing the mutex and removing the lock dropping code fixes that problem, but then if the mutex holder gets preempted, we face another problem. Personally I think there is a good chance there won't be userspace latency problems due to the lock as usually there isn't high concurrency among userspace readers, and we can deal with that problem if/when it happens. So far no problem is happening for cpu.stat which has the same potential problem. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads 2023-08-29 19:13 ` Yosry Ahmed @ 2023-08-29 19:36 ` Tejun Heo [not found] ` <ZO5IuULSCXMe9_pN-NiLfg/pYEd1N0TnZuCh8vA@public.gmane.org> 0 siblings, 1 reply; 43+ messages in thread From: Tejun Heo @ 2023-08-29 19:36 UTC (permalink / raw) To: Yosry Ahmed Cc: Michal Hocko, Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou, linux-mm, cgroups, linux-kernel Hello, On Tue, Aug 29, 2023 at 12:13:31PM -0700, Yosry Ahmed wrote: ... > > So, the assumptions in the original design were: > > > > * Writers are high freq but readers are lower freq and can block. > > > > * The global lock is mutex. > > > > * Back-to-back reads won't have too much to do because it only has to flush > > what's been accumulated since the last flush which took place just before. > > > > It's likely that the userspace side is gonna be just fine if we restore the > > global lock to be a mutex and let them be. Most of the problems are caused > > by trying to allow flushing from non-sleepable and kernel contexts. > > So basically restore the flush without disabling preemption, and if a > userspace reader gets preempted while holding the mutex it's probably > okay because we won't have high concurrency among userspace readers? > > I think Shakeel was worried that this may cause a priority inversion > where a low priority task is preempted while holding the mutex, and > prevents high priority tasks from acquiring it to read the stats and > take actions (e.g. userspace OOMs). We'll have to see but I'm not sure this is going to be a huge problem. The most common way that priority inversions are resolved is through work conservation - ie. the system just runs out of other things to do, so whatever is blocking the system gets to run and unblocks it. It only really becomes a problem when work conservation breaks down on CPU side which happens if the one holding the resource is 1. blocked on IOs (usually through memory allocation but can be anything) 2. throttled by cpu.max. #1 is not a factor here. #2 is but that is a factor for everything in the kernel and should really be solved from cpu.max side. So, I think in practice, this should be fine, or at least not worse than anything else. > > Would it > > make sense to distinguish what can and can't wait and make the latter group > > always use cached value? e.g. even in kernel, during oom kill, waiting > > doesn't really matter and it can just wait to obtain the up-to-date numbers. > > The problem with waiting for in-kernel flushers is that high > concurrency leads to terrible serialization. Running a stress test > with 100s of reclaimers where everyone waits shows ~ 3x slowdowns. > > This patch series is indeed trying to formalize a distinction between > waiters who can wait and those who can't on the memcg side: > > - Unified flushers always flush the entire tree and only flush if no > one else is flushing (no waiting), otherwise they use cached data and > hope the concurrent flushing helps. This is what we currently do for > most memcg contexts. This patch series opts userspace reads out of it. > > - Non-unified flushers only flush the subtree they care about and they > wait if there are other flushers. This is what we currently do for > some zswap accounting code. This patch series opts userspace readers > into this. > > The problem Michal is raising is that dropping the lock can lead to an > unbounded number of waiters and longer worst case latency. Especially > that this is directly influenced by userspace. Reintroducing the mutex > and removing the lock dropping code fixes that problem, but then if > the mutex holder gets preempted, we face another problem. > > Personally I think there is a good chance there won't be userspace > latency problems due to the lock as usually there isn't high > concurrency among userspace readers, and we can deal with that problem > if/when it happens. So far no problem is happening for cpu.stat which > has the same potential problem. Maybe leave the global lock as-is and gate the userland flushers with a mutex so that there's only ever one contenting on the rstat lock from userland side? Thanks. -- tejun ^ permalink raw reply [flat|nested] 43+ messages in thread
[parent not found: <ZO5IuULSCXMe9_pN-NiLfg/pYEd1N0TnZuCh8vA@public.gmane.org>]
* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads [not found] ` <ZO5IuULSCXMe9_pN-NiLfg/pYEd1N0TnZuCh8vA@public.gmane.org> @ 2023-08-29 19:54 ` Yosry Ahmed [not found] ` <CAJD7tkYtnhemCLBqFqOVurfWEaCjKtyEM745JYRxFS0r5cpZwQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 43+ messages in thread From: Yosry Ahmed @ 2023-08-29 19:54 UTC (permalink / raw) To: Tejun Heo Cc: Michal Hocko, Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Tue, Aug 29, 2023 at 12:36 PM Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote: > > Hello, > > On Tue, Aug 29, 2023 at 12:13:31PM -0700, Yosry Ahmed wrote: > ... > > > So, the assumptions in the original design were: > > > > > > * Writers are high freq but readers are lower freq and can block. > > > > > > * The global lock is mutex. > > > > > > * Back-to-back reads won't have too much to do because it only has to flush > > > what's been accumulated since the last flush which took place just before. > > > > > > It's likely that the userspace side is gonna be just fine if we restore the > > > global lock to be a mutex and let them be. Most of the problems are caused > > > by trying to allow flushing from non-sleepable and kernel contexts. > > > > So basically restore the flush without disabling preemption, and if a > > userspace reader gets preempted while holding the mutex it's probably > > okay because we won't have high concurrency among userspace readers? > > > > I think Shakeel was worried that this may cause a priority inversion > > where a low priority task is preempted while holding the mutex, and > > prevents high priority tasks from acquiring it to read the stats and > > take actions (e.g. userspace OOMs). > > We'll have to see but I'm not sure this is going to be a huge problem. The > most common way that priority inversions are resolved is through work > conservation - ie. the system just runs out of other things to do, so > whatever is blocking the system gets to run and unblocks it. It only really > becomes a problem when work conservation breaks down on CPU side which > happens if the one holding the resource is 1. blocked on IOs (usually > through memory allocation but can be anything) 2. throttled by cpu.max. > > #1 is not a factor here. #2 is but that is a factor for everything in the > kernel and should really be solved from cpu.max side. So, I think in > practice, this should be fine, or at least not worse than anything else. If that's not a concern I can just append a patch that changes the spinlock to a mutex to this series. Shakeel, wdyt? > > > > Would it > > > make sense to distinguish what can and can't wait and make the latter group > > > always use cached value? e.g. even in kernel, during oom kill, waiting > > > doesn't really matter and it can just wait to obtain the up-to-date numbers. > > > > The problem with waiting for in-kernel flushers is that high > > concurrency leads to terrible serialization. Running a stress test > > with 100s of reclaimers where everyone waits shows ~ 3x slowdowns. > > > > This patch series is indeed trying to formalize a distinction between > > waiters who can wait and those who can't on the memcg side: > > > > - Unified flushers always flush the entire tree and only flush if no > > one else is flushing (no waiting), otherwise they use cached data and > > hope the concurrent flushing helps. This is what we currently do for > > most memcg contexts. This patch series opts userspace reads out of it. > > > > - Non-unified flushers only flush the subtree they care about and they > > wait if there are other flushers. This is what we currently do for > > some zswap accounting code. This patch series opts userspace readers > > into this. > > > > The problem Michal is raising is that dropping the lock can lead to an > > unbounded number of waiters and longer worst case latency. Especially > > that this is directly influenced by userspace. Reintroducing the mutex > > and removing the lock dropping code fixes that problem, but then if > > the mutex holder gets preempted, we face another problem. > > > > Personally I think there is a good chance there won't be userspace > > latency problems due to the lock as usually there isn't high > > concurrency among userspace readers, and we can deal with that problem > > if/when it happens. So far no problem is happening for cpu.stat which > > has the same potential problem. > > Maybe leave the global lock as-is and gate the userland flushers with a > mutex so that there's only ever one contenting on the rstat lock from > userland side? Waiman suggested this as well. We can do that for sure, although I think we should wait until we are sure it's needed. One question. If whoever is holding that mutex is either flushing with the spinlock held or spinning (i.e not sleepable or preemptable), wouldn't this be equivalent to just changing the spinlock with a mutex and disable preemption while holding it? ^ permalink raw reply [flat|nested] 43+ messages in thread
[parent not found: <CAJD7tkYtnhemCLBqFqOVurfWEaCjKtyEM745JYRxFS0r5cpZwQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads [not found] ` <CAJD7tkYtnhemCLBqFqOVurfWEaCjKtyEM745JYRxFS0r5cpZwQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2023-08-29 20:12 ` Tejun Heo 2023-08-29 20:20 ` Yosry Ahmed 0 siblings, 1 reply; 43+ messages in thread From: Tejun Heo @ 2023-08-29 20:12 UTC (permalink / raw) To: Yosry Ahmed Cc: Michal Hocko, Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA Hello, On Tue, Aug 29, 2023 at 12:54:06PM -0700, Yosry Ahmed wrote: ... > > Maybe leave the global lock as-is and gate the userland flushers with a > > mutex so that there's only ever one contenting on the rstat lock from > > userland side? > > Waiman suggested this as well. We can do that for sure, although I > think we should wait until we are sure it's needed. > > One question. If whoever is holding that mutex is either flushing with > the spinlock held or spinning (i.e not sleepable or preemptable), > wouldn't this be equivalent to just changing the spinlock with a mutex > and disable preemption while holding it? Well, it creates layering so that userspace can't flood the inner lock which can cause contention issues for kernel side users. Not sleeping while actively flushing is an side-effect too but the code at least doesn't look as anti-patterny as disabling preemption right after grabbing a mutex. I don't have a strong preference. As long as we stay away from introducing a new user interface construct and can address the noticed scalability issues, it should be fine. Note that there are other ways to address priority inversions and contentions too - e.g. we can always bounce flushing to a [kthread_]kworker and rate limit (or rather latency limit) how often different classes of users can trigger flushing. I don't think we have to go there yet but if the simpler meaures don't work out, there are still many ways to solve the problem within the kernel. Thanks. -- tejun ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads 2023-08-29 20:12 ` Tejun Heo @ 2023-08-29 20:20 ` Yosry Ahmed [not found] ` <CAJD7tkZn_7ppFB1B1V8tBEw12LXCnEOue2Beq6e19PkUAVHUSQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 43+ messages in thread From: Yosry Ahmed @ 2023-08-29 20:20 UTC (permalink / raw) To: Tejun Heo Cc: Michal Hocko, Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou, linux-mm, cgroups, linux-kernel On Tue, Aug 29, 2023 at 1:12 PM Tejun Heo <tj@kernel.org> wrote: > > Hello, > > On Tue, Aug 29, 2023 at 12:54:06PM -0700, Yosry Ahmed wrote: > ... > > > Maybe leave the global lock as-is and gate the userland flushers with a > > > mutex so that there's only ever one contenting on the rstat lock from > > > userland side? > > > > Waiman suggested this as well. We can do that for sure, although I > > think we should wait until we are sure it's needed. > > > > One question. If whoever is holding that mutex is either flushing with > > the spinlock held or spinning (i.e not sleepable or preemptable), > > wouldn't this be equivalent to just changing the spinlock with a mutex > > and disable preemption while holding it? > > Well, it creates layering so that userspace can't flood the inner lock which > can cause contention issues for kernel side users. Not sleeping while > actively flushing is an side-effect too but the code at least doesn't look > as anti-patterny as disabling preemption right after grabbing a mutex. I see. At most one kernel side flusher will be spinning for the lock at any given point anyway, but I guess having that one kernel side flusher competing against one user side flusher is better competing with N flushers. I will add a mutex on the userspace read side then and spin a v3. Hopefully this addresses Michal's concern as well. The lock dropping logic will still exist for the inner lock, but when one userspace reader drops the inner lock other readers won't be able to pick it up. > > I don't have a strong preference. As long as we stay away from introducing a > new user interface construct and can address the noticed scalability issues, > it should be fine. Note that there are other ways to address priority > inversions and contentions too - e.g. we can always bounce flushing to a > [kthread_]kworker and rate limit (or rather latency limit) how often > different classes of users can trigger flushing. I don't think we have to go > there yet but if the simpler meaures don't work out, there are still many > ways to solve the problem within the kernel. I whole-heartedly agree with the preference to fix the problem within the kernel with minimal/none user space involvement. Thanks! ^ permalink raw reply [flat|nested] 43+ messages in thread
[parent not found: <CAJD7tkZn_7ppFB1B1V8tBEw12LXCnEOue2Beq6e19PkUAVHUSQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads [not found] ` <CAJD7tkZn_7ppFB1B1V8tBEw12LXCnEOue2Beq6e19PkUAVHUSQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2023-08-31 9:05 ` Michal Hocko 0 siblings, 0 replies; 43+ messages in thread From: Michal Hocko @ 2023-08-31 9:05 UTC (permalink / raw) To: Yosry Ahmed Cc: Tejun Heo, Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Tue 29-08-23 13:20:34, Yosry Ahmed wrote: [...] > I will add a mutex on the userspace read side then and spin a v3. > Hopefully this addresses Michal's concern as well. The lock dropping > logic will still exist for the inner lock, but when one userspace > reader drops the inner lock other readers won't be able to pick it up. Yes, that would minimize the risk of the worst case pathological behavior. > > I don't have a strong preference. As long as we stay away from introducing a > > new user interface construct and can address the noticed scalability issues, > > it should be fine. Note that there are other ways to address priority > > inversions and contentions too - e.g. we can always bounce flushing to a > > [kthread_]kworker and rate limit (or rather latency limit) how often > > different classes of users can trigger flushing. I don't think we have to go > > there yet but if the simpler meaures don't work out, there are still many > > ways to solve the problem within the kernel. > > I whole-heartedly agree with the preference to fix the problem within > the kernel with minimal/none user space involvement. Let's see. While I would love to see a solution that works for everybody without explicit interface we have hit problems with locks involved in stat files in the past. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 0/3] memcg: non-unified flushing for userspace stats 2023-08-21 20:54 [PATCH 0/3] memcg: non-unified flushing for userspace stats Yosry Ahmed ` (2 preceding siblings ...) [not found] ` <20230821205458.1764662-1-yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> @ 2023-08-22 13:00 ` Michal Koutný 2023-08-22 15:43 ` Yosry Ahmed 3 siblings, 1 reply; 43+ messages in thread From: Michal Koutný @ 2023-08-22 13:00 UTC (permalink / raw) To: Yosry Ahmed Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo, linux-mm, cgroups, linux-kernel [-- Attachment #1: Type: text/plain, Size: 1075 bytes --] Hello. On Mon, Aug 21, 2023 at 08:54:55PM +0000, Yosry Ahmed <yosryahmed@google.com> wrote: > For userspace reads, unified flushing leads to non-deterministic stats > staleness and reading cost. I only skimed previous threads but I don't remember if it was resolved: a) periodic flushing was too much spaced for user space readers (i.e. 2s delay is too much [1]), b) periodic flushing didn't catch up (i.e. full tree flush can occassionaly take more than 2s) leading to extra staleness? [1] Assuming that nr_cpus*MEMCG_CHARGE_BATCH error bound is also too much for userspace readers, correct? > The cost of userspace reads are now determinstic, and depend on the > size of the subtree being read. This should fix both the *sometimes* > expensive reads (due to flushing the entire tree) and occasional > staless (due to skipping flushing). This is nice, thanks to the atomic removal in the commit 0a2dc6ac3329 ("cgroup: remove cgroup_rstat_flush_atomic()"). I think the smaller chunks with yielding could be universally better (last words :-). Thanks, Michal [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 0/3] memcg: non-unified flushing for userspace stats 2023-08-22 13:00 ` [PATCH 0/3] memcg: non-unified flushing for userspace stats Michal Koutný @ 2023-08-22 15:43 ` Yosry Ahmed 0 siblings, 0 replies; 43+ messages in thread From: Yosry Ahmed @ 2023-08-22 15:43 UTC (permalink / raw) To: Michal Koutný Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Tue, Aug 22, 2023 at 6:01 AM Michal Koutný <mkoutny-WDPRK7U/wSs@public.gmane.orgm> wrote: > > Hello. > > On Mon, Aug 21, 2023 at 08:54:55PM +0000, Yosry Ahmed <yosryahmed@google.com> wrote: > > For userspace reads, unified flushing leads to non-deterministic stats > > staleness and reading cost. > > I only skimed previous threads but I don't remember if it was resolved: > a) periodic flushing was too much spaced for user space readers (i.e. 2s > delay is too much [1]), > b) periodic flushing didn't catch up (i.e. full tree flush can > occassionaly take more than 2s) leading to extra staleness? The idea is that having stats anywhere between 0-2 seconds stale is inconsistent, especially when compared to other values (such as memory.usage_in_bytes), which can give an inconsistent and stale view of the system. For some readers, 2s is too spaced (we have readers that read every second). A time-only bound is scary because on large systems a lot can happen in a second. There will always be a delay anyway, but ideally we want to minimize it. I think 2s is also not a strict bound (e.g. flushing is taking a lot of time, a flush started but the cgroup you care about hasn't been flushed yet, etc). There is also the problem of variable cost of reading. > > [1] Assuming that nr_cpus*MEMCG_CHARGE_BATCH error bound is also too > much for userspace readers, correct? I can't tell for sure to be honest, but given the continuously increased number of cpus on modern systems, and the fact that nr_cpus*MEMCG_CHARGE_BATCH can be localized in one cgroup or spread across the hierarchy, I think it's better if we drop it for userspace reads if possible. > > > The cost of userspace reads are now determinstic, and depend on the > > size of the subtree being read. This should fix both the *sometimes* > > expensive reads (due to flushing the entire tree) and occasional > > staless (due to skipping flushing). > > This is nice, thanks to the atomic removal in the commit 0a2dc6ac3329 > ("cgroup: remove cgroup_rstat_flush_atomic()"). I think the smaller > chunks with yielding could be universally better (last words :-). I was hoping we can remove unified flushing completely, but stress testing with hundreds of concurrent reclaimers shows a lot of regression. Maybe it doesn't matter in practice, but I don't want to pull that trigger :) FWIW, with unified flushing we are getting great concurrency for in-kernel flushers like reclaim or refault, but at the expense of stats staleness. I really wonder what hidden cost we are paying due to the stale stats. I would imagine any problems that manifest from this would be difficult to tie back to the stale stats. > > Thanks, > Michal > ^ permalink raw reply [flat|nested] 43+ messages in thread
end of thread, other threads:[~2023-08-31 9:05 UTC | newest]
Thread overview: 43+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-21 20:54 [PATCH 0/3] memcg: non-unified flushing for userspace stats Yosry Ahmed
2023-08-21 20:54 ` [PATCH 1/3] mm: memcg: properly name and document unified stats flushing Yosry Ahmed
2023-08-21 20:54 ` [PATCH 2/3] mm: memcg: add a helper for non-unified " Yosry Ahmed
[not found] ` <20230821205458.1764662-3-yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2023-08-22 13:01 ` Michal Koutný
2023-08-22 16:00 ` Yosry Ahmed
[not found] ` <CAJD7tkaVuiMU-ifJiyH5d_W1hi9DnAymYJxzBxEKCVX+tU=OCA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2023-08-22 16:35 ` Michal Koutný
2023-08-22 16:48 ` Yosry Ahmed
[not found] ` <20230821205458.1764662-1-yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2023-08-21 20:54 ` [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads Yosry Ahmed
[not found] ` <20230821205458.1764662-4-yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2023-08-22 9:06 ` Michal Hocko
[not found] ` <ZOR6eyYfJYlxdMet-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2023-08-22 15:30 ` Yosry Ahmed
[not found] ` <CAJD7tka13M-zVZTyQJYL1iUAYvuQ1fcHbCjcOBZcz6POYTV-4g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2023-08-23 7:33 ` Michal Hocko
2023-08-23 14:55 ` Yosry Ahmed
2023-08-24 7:13 ` Michal Hocko
2023-08-24 18:15 ` Yosry Ahmed
2023-08-24 18:50 ` Yosry Ahmed
2023-08-25 7:05 ` Michal Hocko
2023-08-25 15:14 ` Yosry Ahmed
[not found] ` <CAJD7tkYPyb+2zOKqctQw-vhuwYRg85e6v2Y44xWJofHZ+F+YQw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2023-08-25 18:17 ` Michal Hocko
2023-08-25 18:21 ` Yosry Ahmed
[not found] ` <CAJD7tka=60_vPMY9Tg8tH+55g-feV1B24VNmDpp_3iMHqrUh7Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2023-08-25 18:43 ` Michal Hocko
[not found] ` <ZOj2NeU5yYhrTZPJ-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2023-08-25 18:44 ` Michal Hocko
2023-08-28 15:47 ` Michal Hocko
2023-08-28 16:15 ` Yosry Ahmed
2023-08-28 17:00 ` Shakeel Butt
[not found] ` <CALvZod7uxDd3Lrd3VwTTC-SDvqhdj2Ly-dYVswO=TBM=XTnkcg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2023-08-28 17:07 ` Yosry Ahmed
2023-08-28 17:27 ` Waiman Long
2023-08-28 17:28 ` Yosry Ahmed
[not found] ` <CAJD7tkZsGfYXkWM5aa67v3JytTO04LS7_x+ooMDK82cBZ-C8eQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2023-08-28 17:35 ` Waiman Long
[not found] ` <307cbcf6-dca2-0b5d-93e8-11368a931d2f-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2023-08-28 17:43 ` Waiman Long
2023-08-28 18:35 ` Yosry Ahmed
[not found] ` <599b167c-deaf-4b92-aa8b-5767b8608483-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2023-08-29 7:27 ` Michal Hocko
[not found] ` <ZO2d7dT8gulMyb8g-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2023-08-29 15:05 ` Waiman Long
2023-08-29 15:17 ` Michal Hocko
[not found] ` <ZO4MBNzsbhsi7adb-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2023-08-29 16:04 ` Yosry Ahmed
[not found] ` <ZOhSyvDxAyYUJ45i-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2023-08-29 18:44 ` Tejun Heo
[not found] ` <ZO48h7c9qwQxEPPA-NiLfg/pYEd1N0TnZuCh8vA@public.gmane.org>
2023-08-29 19:13 ` Yosry Ahmed
2023-08-29 19:36 ` Tejun Heo
[not found] ` <ZO5IuULSCXMe9_pN-NiLfg/pYEd1N0TnZuCh8vA@public.gmane.org>
2023-08-29 19:54 ` Yosry Ahmed
[not found] ` <CAJD7tkYtnhemCLBqFqOVurfWEaCjKtyEM745JYRxFS0r5cpZwQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2023-08-29 20:12 ` Tejun Heo
2023-08-29 20:20 ` Yosry Ahmed
[not found] ` <CAJD7tkZn_7ppFB1B1V8tBEw12LXCnEOue2Beq6e19PkUAVHUSQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2023-08-31 9:05 ` Michal Hocko
2023-08-22 13:00 ` [PATCH 0/3] memcg: non-unified flushing for userspace stats Michal Koutný
2023-08-22 15:43 ` Yosry Ahmed
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox