* [RFC PATCH 1/8] memcg: refactor updating memcg->moving_account
2023-07-20 7:08 ` Yosry Ahmed
@ 2023-07-20 7:08 ` Yosry Ahmed
-1 siblings, 0 replies; 56+ messages in thread
From: Yosry Ahmed @ 2023-07-20 7:08 UTC (permalink / raw)
To: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
Shakeel Butt
Cc: Muchun Song, Matthew Wilcox (Oracle), Tejun Heo, Zefan Li,
Yu Zhao, Luis Chamberlain, Kees Cook, Iurii Zaikin, T.J. Mercier,
Greg Thelen, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
Yosry Ahmed
memcg->moving_account is used to signal that a memcg move is taking
place, so that folio_memcg_lock() would start acquiring the per-memcg
move lock instead of just initiating an rcu read section.
Refactor incrementing and decrementing memcg->moving_account, together
with rcu synchornization and the elaborate comment into helpers, to
allow for reuse by incoming patches.
Signed-off-by: Yosry Ahmed <yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
mm/memcontrol.c | 18 ++++++++++++++----
1 file changed, 14 insertions(+), 4 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e8ca4bdcb03c..ffdb848f4003 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6305,16 +6305,26 @@ static const struct mm_walk_ops charge_walk_ops = {
.pmd_entry = mem_cgroup_move_charge_pte_range,
};
-static void mem_cgroup_move_charge(void)
+static void mem_cgroup_start_move_charge(struct mem_cgroup *memcg)
{
- lru_add_drain_all();
/*
* Signal folio_memcg_lock() to take the memcg's move_lock
* while we're moving its pages to another memcg. Then wait
* for already started RCU-only updates to finish.
*/
- atomic_inc(&mc.from->moving_account);
+ atomic_inc(&memcg->moving_account);
synchronize_rcu();
+}
+
+static void mem_cgroup_end_move_charge(struct mem_cgroup *memcg)
+{
+ atomic_dec(&memcg->moving_account);
+}
+
+static void mem_cgroup_move_charge(void)
+{
+ lru_add_drain_all();
+ mem_cgroup_start_move_charge(mc.from);
retry:
if (unlikely(!mmap_read_trylock(mc.mm))) {
/*
@@ -6334,7 +6344,7 @@ static void mem_cgroup_move_charge(void)
*/
walk_page_range(mc.mm, 0, ULONG_MAX, &charge_walk_ops, NULL);
mmap_read_unlock(mc.mm);
- atomic_dec(&mc.from->moving_account);
+ mem_cgroup_end_move_charge(mc.from);
}
static void mem_cgroup_move_task(void)
--
2.41.0.255.g8b1d071c50-goog
^ permalink raw reply related [flat|nested] 56+ messages in thread* [RFC PATCH 1/8] memcg: refactor updating memcg->moving_account
@ 2023-07-20 7:08 ` Yosry Ahmed
0 siblings, 0 replies; 56+ messages in thread
From: Yosry Ahmed @ 2023-07-20 7:08 UTC (permalink / raw)
To: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
Shakeel Butt
Cc: Muchun Song, Matthew Wilcox (Oracle), Tejun Heo, Zefan Li,
Yu Zhao, Luis Chamberlain, Kees Cook, Iurii Zaikin, T.J. Mercier,
Greg Thelen, linux-kernel, linux-mm, cgroups, Yosry Ahmed
memcg->moving_account is used to signal that a memcg move is taking
place, so that folio_memcg_lock() would start acquiring the per-memcg
move lock instead of just initiating an rcu read section.
Refactor incrementing and decrementing memcg->moving_account, together
with rcu synchornization and the elaborate comment into helpers, to
allow for reuse by incoming patches.
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
mm/memcontrol.c | 18 ++++++++++++++----
1 file changed, 14 insertions(+), 4 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e8ca4bdcb03c..ffdb848f4003 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6305,16 +6305,26 @@ static const struct mm_walk_ops charge_walk_ops = {
.pmd_entry = mem_cgroup_move_charge_pte_range,
};
-static void mem_cgroup_move_charge(void)
+static void mem_cgroup_start_move_charge(struct mem_cgroup *memcg)
{
- lru_add_drain_all();
/*
* Signal folio_memcg_lock() to take the memcg's move_lock
* while we're moving its pages to another memcg. Then wait
* for already started RCU-only updates to finish.
*/
- atomic_inc(&mc.from->moving_account);
+ atomic_inc(&memcg->moving_account);
synchronize_rcu();
+}
+
+static void mem_cgroup_end_move_charge(struct mem_cgroup *memcg)
+{
+ atomic_dec(&memcg->moving_account);
+}
+
+static void mem_cgroup_move_charge(void)
+{
+ lru_add_drain_all();
+ mem_cgroup_start_move_charge(mc.from);
retry:
if (unlikely(!mmap_read_trylock(mc.mm))) {
/*
@@ -6334,7 +6344,7 @@ static void mem_cgroup_move_charge(void)
*/
walk_page_range(mc.mm, 0, ULONG_MAX, &charge_walk_ops, NULL);
mmap_read_unlock(mc.mm);
- atomic_dec(&mc.from->moving_account);
+ mem_cgroup_end_move_charge(mc.from);
}
static void mem_cgroup_move_task(void)
--
2.41.0.255.g8b1d071c50-goog
^ permalink raw reply related [flat|nested] 56+ messages in thread
* [RFC PATCH 2/8] mm: vmscan: add lruvec_for_each_list() helper
2023-07-20 7:08 ` Yosry Ahmed
@ 2023-07-20 7:08 ` Yosry Ahmed
-1 siblings, 0 replies; 56+ messages in thread
From: Yosry Ahmed @ 2023-07-20 7:08 UTC (permalink / raw)
To: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
Shakeel Butt
Cc: Muchun Song, Matthew Wilcox (Oracle), Tejun Heo, Zefan Li,
Yu Zhao, Luis Chamberlain, Kees Cook, Iurii Zaikin, T.J. Mercier,
Greg Thelen, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
Yosry Ahmed
This helper is used to provide a callback to be called for each lruvec
list. This abstracts different lruvec implementations (MGLRU vs. classic
LRUs). The helper is used by a following commit to iterate all folios in
all LRUs lists for memcg recharging.
Signed-off-by: Yosry Ahmed <yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
include/linux/swap.h | 8 ++++++++
mm/vmscan.c | 28 ++++++++++++++++++++++++++++
2 files changed, 36 insertions(+)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 456546443f1f..c0621deceb03 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -406,6 +406,14 @@ extern void lru_cache_add_inactive_or_unevictable(struct page *page,
struct vm_area_struct *vma);
/* linux/mm/vmscan.c */
+typedef bool (*lruvec_list_fn_t)(struct lruvec *lruvec,
+ struct list_head *list,
+ enum lru_list lru,
+ void *arg);
+extern void lruvec_for_each_list(struct lruvec *lruvec,
+ lruvec_list_fn_t fn,
+ void *arg);
+
extern unsigned long zone_reclaimable_pages(struct zone *zone);
extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
gfp_t gfp_mask, nodemask_t *mask);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1080209a568b..e7956000a3b6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -6254,6 +6254,34 @@ static void lru_gen_shrink_node(struct pglist_data *pgdat, struct scan_control *
#endif /* CONFIG_LRU_GEN */
+/*
+ * lruvec_for_each_list - make a callback for every folio list in the lruvec
+ * @lruvec: the lruvec to iterate lists in
+ * @fn: the callback to make for each list, iteration stops if it returns true
+ * @arg: argument to pass to @fn
+ */
+void lruvec_for_each_list(struct lruvec *lruvec, lruvec_list_fn_t fn, void *arg)
+{
+ enum lru_list lru;
+
+#ifdef CONFIG_LRU_GEN
+ if (lru_gen_enabled()) {
+ int gen, type, zone;
+
+ for_each_gen_type_zone(gen, type, zone) {
+ lru = type * LRU_INACTIVE_FILE;
+ if (fn(lruvec, &lruvec->lrugen.folios[gen][type][zone],
+ lru, arg))
+ break;
+ }
+ } else
+#endif
+ for_each_evictable_lru(lru) {
+ if (fn(lruvec, &lruvec->lists[lru], lru, arg))
+ break;
+ }
+}
+
static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
{
unsigned long nr[NR_LRU_LISTS];
--
2.41.0.255.g8b1d071c50-goog
^ permalink raw reply related [flat|nested] 56+ messages in thread* [RFC PATCH 2/8] mm: vmscan: add lruvec_for_each_list() helper
@ 2023-07-20 7:08 ` Yosry Ahmed
0 siblings, 0 replies; 56+ messages in thread
From: Yosry Ahmed @ 2023-07-20 7:08 UTC (permalink / raw)
To: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
Shakeel Butt
Cc: Muchun Song, Matthew Wilcox (Oracle), Tejun Heo, Zefan Li,
Yu Zhao, Luis Chamberlain, Kees Cook, Iurii Zaikin, T.J. Mercier,
Greg Thelen, linux-kernel, linux-mm, cgroups, Yosry Ahmed
This helper is used to provide a callback to be called for each lruvec
list. This abstracts different lruvec implementations (MGLRU vs. classic
LRUs). The helper is used by a following commit to iterate all folios in
all LRUs lists for memcg recharging.
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
include/linux/swap.h | 8 ++++++++
mm/vmscan.c | 28 ++++++++++++++++++++++++++++
2 files changed, 36 insertions(+)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 456546443f1f..c0621deceb03 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -406,6 +406,14 @@ extern void lru_cache_add_inactive_or_unevictable(struct page *page,
struct vm_area_struct *vma);
/* linux/mm/vmscan.c */
+typedef bool (*lruvec_list_fn_t)(struct lruvec *lruvec,
+ struct list_head *list,
+ enum lru_list lru,
+ void *arg);
+extern void lruvec_for_each_list(struct lruvec *lruvec,
+ lruvec_list_fn_t fn,
+ void *arg);
+
extern unsigned long zone_reclaimable_pages(struct zone *zone);
extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
gfp_t gfp_mask, nodemask_t *mask);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1080209a568b..e7956000a3b6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -6254,6 +6254,34 @@ static void lru_gen_shrink_node(struct pglist_data *pgdat, struct scan_control *
#endif /* CONFIG_LRU_GEN */
+/*
+ * lruvec_for_each_list - make a callback for every folio list in the lruvec
+ * @lruvec: the lruvec to iterate lists in
+ * @fn: the callback to make for each list, iteration stops if it returns true
+ * @arg: argument to pass to @fn
+ */
+void lruvec_for_each_list(struct lruvec *lruvec, lruvec_list_fn_t fn, void *arg)
+{
+ enum lru_list lru;
+
+#ifdef CONFIG_LRU_GEN
+ if (lru_gen_enabled()) {
+ int gen, type, zone;
+
+ for_each_gen_type_zone(gen, type, zone) {
+ lru = type * LRU_INACTIVE_FILE;
+ if (fn(lruvec, &lruvec->lrugen.folios[gen][type][zone],
+ lru, arg))
+ break;
+ }
+ } else
+#endif
+ for_each_evictable_lru(lru) {
+ if (fn(lruvec, &lruvec->lists[lru], lru, arg))
+ break;
+ }
+}
+
static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
{
unsigned long nr[NR_LRU_LISTS];
--
2.41.0.255.g8b1d071c50-goog
^ permalink raw reply related [flat|nested] 56+ messages in thread
* [RFC PATCH 3/8] memcg: recharge mapped folios when a memcg is offlined
2023-07-20 7:08 ` Yosry Ahmed
@ 2023-07-20 7:08 ` Yosry Ahmed
-1 siblings, 0 replies; 56+ messages in thread
From: Yosry Ahmed @ 2023-07-20 7:08 UTC (permalink / raw)
To: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
Shakeel Butt
Cc: Muchun Song, Matthew Wilcox (Oracle), Tejun Heo, Zefan Li,
Yu Zhao, Luis Chamberlain, Kees Cook, Iurii Zaikin, T.J. Mercier,
Greg Thelen, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
Yosry Ahmed
When a cgroup is removed by userspace and its memcg goes offline,
attempt to recharge the pages charged to the memcg to other memcgs that
are actually using the folios. Recharging is done in an asynchronous
worker as it is an expensive operation and needs to acquire multiple
locks.
Recharge targets are identified by walking the rmap and checking the
memcgs of the processes mapping the folio, if any. We avoid an OOM kill
if we fail to charge the folio, to avoid inducing an OOM kill at a
seemingly random point in time in the target memcg.
If we fail for any reason (e.g. could not isolate all folios, could not
lock folio, target memcg reached its limit, etc), we reschedule the
worker to rety.
Signed-off-by: Yosry Ahmed <yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
include/linux/memcontrol.h | 6 +
mm/memcontrol.c | 230 +++++++++++++++++++++++++++++++++++++
2 files changed, 236 insertions(+)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 5818af8eca5a..b41d69685ead 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -326,6 +326,12 @@ struct mem_cgroup {
struct lru_gen_mm_list mm_list;
#endif
+ /* async recharge of mapped folios for offline memcgs */
+ struct {
+ struct delayed_work dwork;
+ int retries;
+ } recharge_mapped_work;
+
struct mem_cgroup_per_node *nodeinfo[];
};
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ffdb848f4003..a46bc8f000c8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -96,6 +96,8 @@ static bool cgroup_memory_nobpf __ro_after_init;
static DECLARE_WAIT_QUEUE_HEAD(memcg_cgwb_frn_waitq);
#endif
+static struct workqueue_struct *memcg_recharge_wq;
+
/* Whether legacy memory+swap accounting is active */
static bool do_memsw_account(void)
{
@@ -5427,6 +5429,8 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
return -ENOMEM;
}
+static void memcg_recharge_mapped_folios(struct mem_cgroup *memcg);
+
static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(css);
@@ -5455,6 +5459,8 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
drain_all_stock(memcg);
mem_cgroup_id_put(memcg);
+
+ memcg_recharge_mapped_folios(memcg);
}
static void mem_cgroup_css_released(struct cgroup_subsys_state *css)
@@ -5487,6 +5493,7 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
vmpressure_cleanup(&memcg->vmpressure);
cancel_work_sync(&memcg->high_work);
+ cancel_delayed_work_sync(&memcg->recharge_mapped_work.dwork);
mem_cgroup_remove_from_trees(memcg);
free_shrinker_info(memcg);
mem_cgroup_free(memcg);
@@ -6367,6 +6374,219 @@ static void mem_cgroup_move_task(void)
}
#endif
+/* Returns true if recharging is successful */
+static bool mem_cgroup_recharge_folio(struct folio *folio,
+ struct mem_cgroup *new_memcg)
+{
+ struct mem_cgroup *old_memcg = folio_memcg(folio);
+ gfp_t gfp = GFP_KERNEL | __GFP_RETRY_MAYFAIL;
+ long nr_pages = folio_nr_pages(folio);
+ int err = -1;
+
+ if (!new_memcg)
+ goto out;
+
+ err = try_charge(new_memcg, gfp, nr_pages);
+ if (err)
+ goto out;
+
+ err = mem_cgroup_move_account(&folio->page, folio_test_large(folio),
+ old_memcg, new_memcg);
+ cancel_charge(err ? new_memcg : old_memcg, nr_pages);
+out:
+ return err == 0;
+}
+
+struct folio_memcg_rmap_recharge_arg {
+ bool recharged;
+};
+
+static bool folio_memcg_rmap_recharge_one(struct folio *folio,
+ struct vm_area_struct *vma, unsigned long address, void *arg)
+{
+ struct folio_memcg_rmap_recharge_arg *recharge_arg = arg;
+ DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
+ struct mem_cgroup *memcg;
+
+ /*
+ * page_vma_mapped_walk() is only needed to grab any pte lock to
+ * serialize with page_remove_rmap(), as folio_mapped() must remain
+ * stable during the move.
+ */
+ recharge_arg->recharged = false;
+ while (page_vma_mapped_walk(&pvmw)) {
+ memcg = get_mem_cgroup_from_mm(vma->vm_mm);
+ if (mem_cgroup_recharge_folio(folio, memcg))
+ recharge_arg->recharged = true;
+ mem_cgroup_put(memcg);
+ page_vma_mapped_walk_done(&pvmw);
+ break;
+ }
+
+ /* stop the rmap walk if we were successful */
+ return !recharge_arg->recharged;
+}
+
+/* Returns true if recharging is successful */
+static bool folio_memcg_rmap_recharge(struct folio *folio)
+{
+ struct folio_memcg_rmap_recharge_arg arg = { .recharged = false };
+ struct rmap_walk_control rwc = {
+ .rmap_one = folio_memcg_rmap_recharge_one,
+ .arg = (void *)&arg,
+ .anon_lock = folio_lock_anon_vma_read,
+ .try_lock = true,
+ };
+
+ rmap_walk(folio, &rwc);
+ return arg.recharged;
+}
+
+static unsigned long lruvec_nr_local_mapped_pages(struct lruvec *lruvec)
+{
+ return lruvec_page_state_local(lruvec, NR_ANON_MAPPED) +
+ lruvec_page_state_local(lruvec, NR_FILE_MAPPED);
+}
+
+static unsigned long memcg_nr_local_mapped_pages(struct mem_cgroup *memcg)
+{
+ return memcg_page_state_local(memcg, NR_ANON_MAPPED) +
+ memcg_page_state_local(memcg, NR_FILE_MAPPED);
+}
+
+static bool memcg_recharge_lruvec_list(struct lruvec *lruvec,
+ struct list_head *list,
+ enum lru_list lru,
+ void *arg)
+{
+ int isolated_idx = NR_ISOLATED_ANON + is_file_lru(lru);
+ struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+ unsigned long *nr_recharged = arg;
+ unsigned long nr_staged = 0;
+ LIST_HEAD(folios_skipped);
+ LIST_HEAD(folios_staged);
+ struct folio *folio;
+
+ /* Are we done with mapped pages on this node? */
+ if (!lruvec_nr_local_mapped_pages(lruvec))
+ return true;
+
+ /*
+ * Iterating the LRUs here is tricky, because we
+ * usually cannot iterate the entire list with the
+ * lock held, and the LRU can change once we release it.
+ *
+ * What we try to do is isolate as many folios as we can
+ * without hogging the lock or the cpu. We need to move
+ * all the folios we iterate off the list to try to
+ * avoid re-visiting them on retries.
+ *
+ * The folios we are interested in are moved to
+ * @folios_staged, and other folios are moved to
+ * @folios_skipped. Before releasing the lock, we splice
+ * @folios_skipped back into the beginning of the LRU.
+ * This essentially rotates the LRU, similar to reclaim,
+ * as lru_to_folio() fetches folios from the end of the
+ * LRU.
+ */
+ spin_lock_irq(&lruvec->lru_lock);
+ while (!list_empty(list) && !need_resched() &&
+ !spin_is_contended(&lruvec->lru_lock)) {
+ folio = lru_to_folio(list);
+ if (!folio_mapped(folio)) {
+ list_move(&folio->lru, &folios_skipped);
+ continue;
+ }
+
+ if (unlikely(!folio_try_get(folio))) {
+ list_move(&folio->lru, &folios_skipped);
+ continue;
+ }
+
+ if (!folio_test_clear_lru(folio)) {
+ list_move(&folio->lru, &folios_skipped);
+ folio_put(folio);
+ continue;
+ }
+
+ lruvec_del_folio(lruvec, folio);
+ list_add(&folio->lru, &folios_staged);
+ nr_staged += folio_nr_pages(folio);
+ }
+ list_splice(&folios_skipped, list);
+ spin_unlock_irq(&lruvec->lru_lock);
+
+ mod_lruvec_state(lruvec, isolated_idx, nr_staged);
+ mem_cgroup_start_move_charge(memcg);
+ while (!list_empty(&folios_staged)) {
+ folio = lru_to_folio(&folios_staged);
+ list_del(&folio->lru);
+
+ if (!folio_trylock(folio)) {
+ folio_putback_lru(folio);
+ continue;
+ }
+
+ if (folio_memcg_rmap_recharge(folio))
+ *nr_recharged += folio_nr_pages(folio);
+
+ folio_unlock(folio);
+ folio_putback_lru(folio);
+ cond_resched();
+ }
+ mem_cgroup_end_move_charge(memcg);
+ mod_lruvec_state(lruvec, isolated_idx, -nr_staged);
+ return false;
+}
+
+static void memcg_do_recharge_mapped_folios(struct work_struct *work)
+{
+ struct delayed_work *dwork = to_delayed_work(work);
+ struct mem_cgroup *memcg = container_of(dwork, struct mem_cgroup,
+ recharge_mapped_work.dwork);
+ unsigned long nr_recharged = 0;
+ struct lruvec *lruvec;
+ int nid;
+
+ lru_add_drain_all();
+ for_each_node_state(nid, N_MEMORY) {
+ lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid));
+ lruvec_for_each_list(lruvec, memcg_recharge_lruvec_list,
+ &nr_recharged);
+ }
+
+ /* Are we done with all mapped folios? */
+ if (!memcg_nr_local_mapped_pages(memcg))
+ return;
+
+ /* Did we make progress? reset retries */
+ if (nr_recharged > 0)
+ memcg->recharge_mapped_work.retries = 0;
+
+ /* Exponentially increase delay before each retry (from 1ms to ~32s) */
+ if (memcg->recharge_mapped_work.retries < MAX_RECLAIM_RETRIES)
+ queue_delayed_work(memcg_recharge_wq,
+ &memcg->recharge_mapped_work.dwork,
+ 1 << memcg->recharge_mapped_work.retries++);
+}
+
+static void memcg_recharge_mapped_folios(struct mem_cgroup *memcg)
+{
+ /*
+ * We need to initialize the work here, even if we are not going to
+ * queue it, such that cancel_delayed_work_sync() in
+ * mem_cgroup_css_free() does not complain.
+ */
+ INIT_DELAYED_WORK(&memcg->recharge_mapped_work.dwork,
+ memcg_do_recharge_mapped_folios);
+
+ if (memcg_recharge_wq && memcg_nr_local_mapped_pages(memcg)) {
+ memcg->recharge_mapped_work.retries = 0;
+ queue_delayed_work(memcg_recharge_wq,
+ &memcg->recharge_mapped_work.dwork, 0);
+ }
+}
+
#ifdef CONFIG_LRU_GEN
static void mem_cgroup_attach(struct cgroup_taskset *tset)
{
@@ -7904,3 +8124,13 @@ static int __init mem_cgroup_swap_init(void)
subsys_initcall(mem_cgroup_swap_init);
#endif /* CONFIG_SWAP */
+
+static int __init memcg_recharge_wq_init(void)
+{
+ if (mem_cgroup_disabled())
+ return 0;
+ memcg_recharge_wq = alloc_workqueue("memcg_recharge", WQ_UNBOUND, 0);
+ WARN_ON(!memcg_recharge_wq);
+ return 0;
+}
+subsys_initcall(memcg_recharge_wq_init);
--
2.41.0.255.g8b1d071c50-goog
^ permalink raw reply related [flat|nested] 56+ messages in thread* [RFC PATCH 3/8] memcg: recharge mapped folios when a memcg is offlined
@ 2023-07-20 7:08 ` Yosry Ahmed
0 siblings, 0 replies; 56+ messages in thread
From: Yosry Ahmed @ 2023-07-20 7:08 UTC (permalink / raw)
To: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
Shakeel Butt
Cc: Muchun Song, Matthew Wilcox (Oracle), Tejun Heo, Zefan Li,
Yu Zhao, Luis Chamberlain, Kees Cook, Iurii Zaikin, T.J. Mercier,
Greg Thelen, linux-kernel, linux-mm, cgroups, Yosry Ahmed
When a cgroup is removed by userspace and its memcg goes offline,
attempt to recharge the pages charged to the memcg to other memcgs that
are actually using the folios. Recharging is done in an asynchronous
worker as it is an expensive operation and needs to acquire multiple
locks.
Recharge targets are identified by walking the rmap and checking the
memcgs of the processes mapping the folio, if any. We avoid an OOM kill
if we fail to charge the folio, to avoid inducing an OOM kill at a
seemingly random point in time in the target memcg.
If we fail for any reason (e.g. could not isolate all folios, could not
lock folio, target memcg reached its limit, etc), we reschedule the
worker to rety.
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
include/linux/memcontrol.h | 6 +
mm/memcontrol.c | 230 +++++++++++++++++++++++++++++++++++++
2 files changed, 236 insertions(+)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 5818af8eca5a..b41d69685ead 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -326,6 +326,12 @@ struct mem_cgroup {
struct lru_gen_mm_list mm_list;
#endif
+ /* async recharge of mapped folios for offline memcgs */
+ struct {
+ struct delayed_work dwork;
+ int retries;
+ } recharge_mapped_work;
+
struct mem_cgroup_per_node *nodeinfo[];
};
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ffdb848f4003..a46bc8f000c8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -96,6 +96,8 @@ static bool cgroup_memory_nobpf __ro_after_init;
static DECLARE_WAIT_QUEUE_HEAD(memcg_cgwb_frn_waitq);
#endif
+static struct workqueue_struct *memcg_recharge_wq;
+
/* Whether legacy memory+swap accounting is active */
static bool do_memsw_account(void)
{
@@ -5427,6 +5429,8 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
return -ENOMEM;
}
+static void memcg_recharge_mapped_folios(struct mem_cgroup *memcg);
+
static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(css);
@@ -5455,6 +5459,8 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
drain_all_stock(memcg);
mem_cgroup_id_put(memcg);
+
+ memcg_recharge_mapped_folios(memcg);
}
static void mem_cgroup_css_released(struct cgroup_subsys_state *css)
@@ -5487,6 +5493,7 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
vmpressure_cleanup(&memcg->vmpressure);
cancel_work_sync(&memcg->high_work);
+ cancel_delayed_work_sync(&memcg->recharge_mapped_work.dwork);
mem_cgroup_remove_from_trees(memcg);
free_shrinker_info(memcg);
mem_cgroup_free(memcg);
@@ -6367,6 +6374,219 @@ static void mem_cgroup_move_task(void)
}
#endif
+/* Returns true if recharging is successful */
+static bool mem_cgroup_recharge_folio(struct folio *folio,
+ struct mem_cgroup *new_memcg)
+{
+ struct mem_cgroup *old_memcg = folio_memcg(folio);
+ gfp_t gfp = GFP_KERNEL | __GFP_RETRY_MAYFAIL;
+ long nr_pages = folio_nr_pages(folio);
+ int err = -1;
+
+ if (!new_memcg)
+ goto out;
+
+ err = try_charge(new_memcg, gfp, nr_pages);
+ if (err)
+ goto out;
+
+ err = mem_cgroup_move_account(&folio->page, folio_test_large(folio),
+ old_memcg, new_memcg);
+ cancel_charge(err ? new_memcg : old_memcg, nr_pages);
+out:
+ return err == 0;
+}
+
+struct folio_memcg_rmap_recharge_arg {
+ bool recharged;
+};
+
+static bool folio_memcg_rmap_recharge_one(struct folio *folio,
+ struct vm_area_struct *vma, unsigned long address, void *arg)
+{
+ struct folio_memcg_rmap_recharge_arg *recharge_arg = arg;
+ DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
+ struct mem_cgroup *memcg;
+
+ /*
+ * page_vma_mapped_walk() is only needed to grab any pte lock to
+ * serialize with page_remove_rmap(), as folio_mapped() must remain
+ * stable during the move.
+ */
+ recharge_arg->recharged = false;
+ while (page_vma_mapped_walk(&pvmw)) {
+ memcg = get_mem_cgroup_from_mm(vma->vm_mm);
+ if (mem_cgroup_recharge_folio(folio, memcg))
+ recharge_arg->recharged = true;
+ mem_cgroup_put(memcg);
+ page_vma_mapped_walk_done(&pvmw);
+ break;
+ }
+
+ /* stop the rmap walk if we were successful */
+ return !recharge_arg->recharged;
+}
+
+/* Returns true if recharging is successful */
+static bool folio_memcg_rmap_recharge(struct folio *folio)
+{
+ struct folio_memcg_rmap_recharge_arg arg = { .recharged = false };
+ struct rmap_walk_control rwc = {
+ .rmap_one = folio_memcg_rmap_recharge_one,
+ .arg = (void *)&arg,
+ .anon_lock = folio_lock_anon_vma_read,
+ .try_lock = true,
+ };
+
+ rmap_walk(folio, &rwc);
+ return arg.recharged;
+}
+
+static unsigned long lruvec_nr_local_mapped_pages(struct lruvec *lruvec)
+{
+ return lruvec_page_state_local(lruvec, NR_ANON_MAPPED) +
+ lruvec_page_state_local(lruvec, NR_FILE_MAPPED);
+}
+
+static unsigned long memcg_nr_local_mapped_pages(struct mem_cgroup *memcg)
+{
+ return memcg_page_state_local(memcg, NR_ANON_MAPPED) +
+ memcg_page_state_local(memcg, NR_FILE_MAPPED);
+}
+
+static bool memcg_recharge_lruvec_list(struct lruvec *lruvec,
+ struct list_head *list,
+ enum lru_list lru,
+ void *arg)
+{
+ int isolated_idx = NR_ISOLATED_ANON + is_file_lru(lru);
+ struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+ unsigned long *nr_recharged = arg;
+ unsigned long nr_staged = 0;
+ LIST_HEAD(folios_skipped);
+ LIST_HEAD(folios_staged);
+ struct folio *folio;
+
+ /* Are we done with mapped pages on this node? */
+ if (!lruvec_nr_local_mapped_pages(lruvec))
+ return true;
+
+ /*
+ * Iterating the LRUs here is tricky, because we
+ * usually cannot iterate the entire list with the
+ * lock held, and the LRU can change once we release it.
+ *
+ * What we try to do is isolate as many folios as we can
+ * without hogging the lock or the cpu. We need to move
+ * all the folios we iterate off the list to try to
+ * avoid re-visiting them on retries.
+ *
+ * The folios we are interested in are moved to
+ * @folios_staged, and other folios are moved to
+ * @folios_skipped. Before releasing the lock, we splice
+ * @folios_skipped back into the beginning of the LRU.
+ * This essentially rotates the LRU, similar to reclaim,
+ * as lru_to_folio() fetches folios from the end of the
+ * LRU.
+ */
+ spin_lock_irq(&lruvec->lru_lock);
+ while (!list_empty(list) && !need_resched() &&
+ !spin_is_contended(&lruvec->lru_lock)) {
+ folio = lru_to_folio(list);
+ if (!folio_mapped(folio)) {
+ list_move(&folio->lru, &folios_skipped);
+ continue;
+ }
+
+ if (unlikely(!folio_try_get(folio))) {
+ list_move(&folio->lru, &folios_skipped);
+ continue;
+ }
+
+ if (!folio_test_clear_lru(folio)) {
+ list_move(&folio->lru, &folios_skipped);
+ folio_put(folio);
+ continue;
+ }
+
+ lruvec_del_folio(lruvec, folio);
+ list_add(&folio->lru, &folios_staged);
+ nr_staged += folio_nr_pages(folio);
+ }
+ list_splice(&folios_skipped, list);
+ spin_unlock_irq(&lruvec->lru_lock);
+
+ mod_lruvec_state(lruvec, isolated_idx, nr_staged);
+ mem_cgroup_start_move_charge(memcg);
+ while (!list_empty(&folios_staged)) {
+ folio = lru_to_folio(&folios_staged);
+ list_del(&folio->lru);
+
+ if (!folio_trylock(folio)) {
+ folio_putback_lru(folio);
+ continue;
+ }
+
+ if (folio_memcg_rmap_recharge(folio))
+ *nr_recharged += folio_nr_pages(folio);
+
+ folio_unlock(folio);
+ folio_putback_lru(folio);
+ cond_resched();
+ }
+ mem_cgroup_end_move_charge(memcg);
+ mod_lruvec_state(lruvec, isolated_idx, -nr_staged);
+ return false;
+}
+
+static void memcg_do_recharge_mapped_folios(struct work_struct *work)
+{
+ struct delayed_work *dwork = to_delayed_work(work);
+ struct mem_cgroup *memcg = container_of(dwork, struct mem_cgroup,
+ recharge_mapped_work.dwork);
+ unsigned long nr_recharged = 0;
+ struct lruvec *lruvec;
+ int nid;
+
+ lru_add_drain_all();
+ for_each_node_state(nid, N_MEMORY) {
+ lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid));
+ lruvec_for_each_list(lruvec, memcg_recharge_lruvec_list,
+ &nr_recharged);
+ }
+
+ /* Are we done with all mapped folios? */
+ if (!memcg_nr_local_mapped_pages(memcg))
+ return;
+
+ /* Did we make progress? reset retries */
+ if (nr_recharged > 0)
+ memcg->recharge_mapped_work.retries = 0;
+
+ /* Exponentially increase delay before each retry (from 1ms to ~32s) */
+ if (memcg->recharge_mapped_work.retries < MAX_RECLAIM_RETRIES)
+ queue_delayed_work(memcg_recharge_wq,
+ &memcg->recharge_mapped_work.dwork,
+ 1 << memcg->recharge_mapped_work.retries++);
+}
+
+static void memcg_recharge_mapped_folios(struct mem_cgroup *memcg)
+{
+ /*
+ * We need to initialize the work here, even if we are not going to
+ * queue it, such that cancel_delayed_work_sync() in
+ * mem_cgroup_css_free() does not complain.
+ */
+ INIT_DELAYED_WORK(&memcg->recharge_mapped_work.dwork,
+ memcg_do_recharge_mapped_folios);
+
+ if (memcg_recharge_wq && memcg_nr_local_mapped_pages(memcg)) {
+ memcg->recharge_mapped_work.retries = 0;
+ queue_delayed_work(memcg_recharge_wq,
+ &memcg->recharge_mapped_work.dwork, 0);
+ }
+}
+
#ifdef CONFIG_LRU_GEN
static void mem_cgroup_attach(struct cgroup_taskset *tset)
{
@@ -7904,3 +8124,13 @@ static int __init mem_cgroup_swap_init(void)
subsys_initcall(mem_cgroup_swap_init);
#endif /* CONFIG_SWAP */
+
+static int __init memcg_recharge_wq_init(void)
+{
+ if (mem_cgroup_disabled())
+ return 0;
+ memcg_recharge_wq = alloc_workqueue("memcg_recharge", WQ_UNBOUND, 0);
+ WARN_ON(!memcg_recharge_wq);
+ return 0;
+}
+subsys_initcall(memcg_recharge_wq_init);
--
2.41.0.255.g8b1d071c50-goog
^ permalink raw reply related [flat|nested] 56+ messages in thread
* [RFC PATCH 4/8] memcg: support deferred memcg recharging
2023-07-20 7:08 ` Yosry Ahmed
@ 2023-07-20 7:08 ` Yosry Ahmed
-1 siblings, 0 replies; 56+ messages in thread
From: Yosry Ahmed @ 2023-07-20 7:08 UTC (permalink / raw)
To: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
Shakeel Butt
Cc: Muchun Song, Matthew Wilcox (Oracle), Tejun Heo, Zefan Li,
Yu Zhao, Luis Chamberlain, Kees Cook, Iurii Zaikin, T.J. Mercier,
Greg Thelen, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
Yosry Ahmed
The previous patch added support for memcg recharging for mapped folios
when a memcg goes offline. For unmapped folios, it is not straighforward
to find a rightful memcg to recharge the folio to. Hence, add support
for deferred recharging.
Deferred recharging provides a hook that can be added in data access
paths: folio_memcg_deferred_recharge().
folio_memcg_deferred_recharge() will check if the memcg that the folio
is charged to is offline. If so, it will queue an asynchronous worker to
attempt to recharge the folio to the memcg of the process accessing the
folio. An asynchronous worker is used for 2 reasons:
(a) Avoid expensive operations on the data access path.
(b) Acquring some locks (e.g. folio lock, lruvec lock) is not safe to do
from all contexts.
Deferring recharging will not cause an OOM kill in the target memcg. If
recharging fails for any reason, the worker reschedules itself to retry,
unless the folio is freed or the target memcg goes offline.
Signed-off-by: Yosry Ahmed <yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
include/linux/memcontrol.h | 6 ++
mm/memcontrol.c | 125 +++++++++++++++++++++++++++++++++++--
2 files changed, 126 insertions(+), 5 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b41d69685ead..59b653d4a76e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -956,6 +956,8 @@ void mem_cgroup_print_oom_group(struct mem_cgroup *memcg);
void folio_memcg_lock(struct folio *folio);
void folio_memcg_unlock(struct folio *folio);
+void folio_memcg_deferred_recharge(struct folio *folio);
+
void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val);
/* try to stablize folio_memcg() for all the pages in a memcg */
@@ -1461,6 +1463,10 @@ static inline void mem_cgroup_unlock_pages(void)
rcu_read_unlock();
}
+static inline void folio_memcg_deferred_recharge(struct folio *folio)
+{
+}
+
static inline void mem_cgroup_handle_over_high(void)
{
}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a46bc8f000c8..cf9fb51ecfcc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6398,6 +6398,7 @@ static bool mem_cgroup_recharge_folio(struct folio *folio,
}
struct folio_memcg_rmap_recharge_arg {
+ struct mem_cgroup *memcg;
bool recharged;
};
@@ -6415,10 +6416,12 @@ static bool folio_memcg_rmap_recharge_one(struct folio *folio,
*/
recharge_arg->recharged = false;
while (page_vma_mapped_walk(&pvmw)) {
- memcg = get_mem_cgroup_from_mm(vma->vm_mm);
+ memcg = recharge_arg->memcg ?:
+ get_mem_cgroup_from_mm(vma->vm_mm);
if (mem_cgroup_recharge_folio(folio, memcg))
recharge_arg->recharged = true;
- mem_cgroup_put(memcg);
+ if (!recharge_arg->memcg)
+ mem_cgroup_put(memcg);
page_vma_mapped_walk_done(&pvmw);
break;
}
@@ -6428,9 +6431,13 @@ static bool folio_memcg_rmap_recharge_one(struct folio *folio,
}
/* Returns true if recharging is successful */
-static bool folio_memcg_rmap_recharge(struct folio *folio)
+static bool folio_memcg_rmap_recharge(struct folio *folio,
+ struct mem_cgroup *memcg)
{
- struct folio_memcg_rmap_recharge_arg arg = { .recharged = false };
+ struct folio_memcg_rmap_recharge_arg arg = {
+ .recharged = false,
+ .memcg = memcg,
+ };
struct rmap_walk_control rwc = {
.rmap_one = folio_memcg_rmap_recharge_one,
.arg = (void *)&arg,
@@ -6527,7 +6534,7 @@ static bool memcg_recharge_lruvec_list(struct lruvec *lruvec,
continue;
}
- if (folio_memcg_rmap_recharge(folio))
+ if (folio_memcg_rmap_recharge(folio, NULL))
*nr_recharged += folio_nr_pages(folio);
folio_unlock(folio);
@@ -6587,6 +6594,114 @@ static void memcg_recharge_mapped_folios(struct mem_cgroup *memcg)
}
}
+/* Result is only stable if @folio is locked */
+static bool should_do_deferred_recharge(struct folio *folio)
+{
+ struct mem_cgroup *memcg;
+ bool ret;
+
+ rcu_read_lock();
+ memcg = folio_memcg_rcu(folio);
+ ret = memcg && !!(memcg->css.flags & CSS_DYING);
+ rcu_read_unlock();
+
+ return ret;
+}
+
+struct deferred_recharge_work {
+ struct folio *folio;
+ struct mem_cgroup *memcg;
+ struct work_struct work;
+};
+
+static void folio_memcg_do_deferred_recharge(struct work_struct *work)
+{
+ struct deferred_recharge_work *recharge_work = container_of(work,
+ struct deferred_recharge_work, work);
+ struct folio *folio = recharge_work->folio;
+ struct mem_cgroup *new = recharge_work->memcg;
+ struct mem_cgroup *old;
+
+ /* We are holding the last ref to the folio, let it be freed */
+ if (unlikely(folio_ref_count(folio) == 1))
+ goto out;
+
+ if (!folio_isolate_lru(folio))
+ goto out;
+
+ if (unlikely(!folio_trylock(folio)))
+ goto out_putback;
+
+ /* @folio was already recharged since the worker was queued? */
+ if (unlikely(!should_do_deferred_recharge(folio)))
+ goto out_unlock;
+
+ /* @folio was already recharged to @new and it already went offline? */
+ old = folio_memcg(folio);
+ if (unlikely(old == new))
+ goto out_unlock;
+
+ /*
+ * folio_mapped() must remain stable during the move. If the folio is
+ * mapped, we must use rmap recharge to serialize against unmapping.
+ * Otherwise, if the folio is unmapped, the folio lock is held so this
+ * should prevent faults against the pagecache or swapcache to map it.
+ */
+ mem_cgroup_start_move_charge(old);
+ if (folio_mapped(folio))
+ folio_memcg_rmap_recharge(folio, new);
+ else
+ mem_cgroup_recharge_folio(folio, new);
+ mem_cgroup_end_move_charge(old);
+
+out_unlock:
+ folio_unlock(folio);
+out_putback:
+ folio_putback_lru(folio);
+out:
+ folio_put(folio);
+ mem_cgroup_put(new);
+ kfree(recharge_work);
+}
+
+/*
+ * Queue deferred work to recharge @folio to current's memcg if needed.
+ */
+void folio_memcg_deferred_recharge(struct folio *folio)
+{
+ struct deferred_recharge_work *recharge_work = NULL;
+ struct mem_cgroup *memcg = NULL;
+
+ /* racy check, the async worker will check again with @folio locked */
+ if (likely(!should_do_deferred_recharge(folio)))
+ return;
+
+ if (unlikely(!memcg_recharge_wq))
+ return;
+
+ if (unlikely(!folio_try_get(folio)))
+ return;
+
+ memcg = get_mem_cgroup_from_mm(current->mm);
+ if (!memcg)
+ goto fail;
+
+ recharge_work = kmalloc(sizeof(*recharge_work), GFP_ATOMIC);
+ if (!recharge_work)
+ goto fail;
+
+ /* we hold refs to both the folio and the memcg we are charging to */
+ recharge_work->folio = folio;
+ recharge_work->memcg = memcg;
+ INIT_WORK(&recharge_work->work, folio_memcg_do_deferred_recharge);
+ queue_work(memcg_recharge_wq, &recharge_work->work);
+ return;
+fail:
+ kfree(recharge_work);
+ mem_cgroup_put(memcg);
+ folio_put(folio);
+}
+
#ifdef CONFIG_LRU_GEN
static void mem_cgroup_attach(struct cgroup_taskset *tset)
{
--
2.41.0.255.g8b1d071c50-goog
^ permalink raw reply related [flat|nested] 56+ messages in thread* [RFC PATCH 4/8] memcg: support deferred memcg recharging
@ 2023-07-20 7:08 ` Yosry Ahmed
0 siblings, 0 replies; 56+ messages in thread
From: Yosry Ahmed @ 2023-07-20 7:08 UTC (permalink / raw)
To: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
Shakeel Butt
Cc: Muchun Song, Matthew Wilcox (Oracle), Tejun Heo, Zefan Li,
Yu Zhao, Luis Chamberlain, Kees Cook, Iurii Zaikin, T.J. Mercier,
Greg Thelen, linux-kernel, linux-mm, cgroups, Yosry Ahmed
The previous patch added support for memcg recharging for mapped folios
when a memcg goes offline. For unmapped folios, it is not straighforward
to find a rightful memcg to recharge the folio to. Hence, add support
for deferred recharging.
Deferred recharging provides a hook that can be added in data access
paths: folio_memcg_deferred_recharge().
folio_memcg_deferred_recharge() will check if the memcg that the folio
is charged to is offline. If so, it will queue an asynchronous worker to
attempt to recharge the folio to the memcg of the process accessing the
folio. An asynchronous worker is used for 2 reasons:
(a) Avoid expensive operations on the data access path.
(b) Acquring some locks (e.g. folio lock, lruvec lock) is not safe to do
from all contexts.
Deferring recharging will not cause an OOM kill in the target memcg. If
recharging fails for any reason, the worker reschedules itself to retry,
unless the folio is freed or the target memcg goes offline.
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
include/linux/memcontrol.h | 6 ++
mm/memcontrol.c | 125 +++++++++++++++++++++++++++++++++++--
2 files changed, 126 insertions(+), 5 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b41d69685ead..59b653d4a76e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -956,6 +956,8 @@ void mem_cgroup_print_oom_group(struct mem_cgroup *memcg);
void folio_memcg_lock(struct folio *folio);
void folio_memcg_unlock(struct folio *folio);
+void folio_memcg_deferred_recharge(struct folio *folio);
+
void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val);
/* try to stablize folio_memcg() for all the pages in a memcg */
@@ -1461,6 +1463,10 @@ static inline void mem_cgroup_unlock_pages(void)
rcu_read_unlock();
}
+static inline void folio_memcg_deferred_recharge(struct folio *folio)
+{
+}
+
static inline void mem_cgroup_handle_over_high(void)
{
}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a46bc8f000c8..cf9fb51ecfcc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6398,6 +6398,7 @@ static bool mem_cgroup_recharge_folio(struct folio *folio,
}
struct folio_memcg_rmap_recharge_arg {
+ struct mem_cgroup *memcg;
bool recharged;
};
@@ -6415,10 +6416,12 @@ static bool folio_memcg_rmap_recharge_one(struct folio *folio,
*/
recharge_arg->recharged = false;
while (page_vma_mapped_walk(&pvmw)) {
- memcg = get_mem_cgroup_from_mm(vma->vm_mm);
+ memcg = recharge_arg->memcg ?:
+ get_mem_cgroup_from_mm(vma->vm_mm);
if (mem_cgroup_recharge_folio(folio, memcg))
recharge_arg->recharged = true;
- mem_cgroup_put(memcg);
+ if (!recharge_arg->memcg)
+ mem_cgroup_put(memcg);
page_vma_mapped_walk_done(&pvmw);
break;
}
@@ -6428,9 +6431,13 @@ static bool folio_memcg_rmap_recharge_one(struct folio *folio,
}
/* Returns true if recharging is successful */
-static bool folio_memcg_rmap_recharge(struct folio *folio)
+static bool folio_memcg_rmap_recharge(struct folio *folio,
+ struct mem_cgroup *memcg)
{
- struct folio_memcg_rmap_recharge_arg arg = { .recharged = false };
+ struct folio_memcg_rmap_recharge_arg arg = {
+ .recharged = false,
+ .memcg = memcg,
+ };
struct rmap_walk_control rwc = {
.rmap_one = folio_memcg_rmap_recharge_one,
.arg = (void *)&arg,
@@ -6527,7 +6534,7 @@ static bool memcg_recharge_lruvec_list(struct lruvec *lruvec,
continue;
}
- if (folio_memcg_rmap_recharge(folio))
+ if (folio_memcg_rmap_recharge(folio, NULL))
*nr_recharged += folio_nr_pages(folio);
folio_unlock(folio);
@@ -6587,6 +6594,114 @@ static void memcg_recharge_mapped_folios(struct mem_cgroup *memcg)
}
}
+/* Result is only stable if @folio is locked */
+static bool should_do_deferred_recharge(struct folio *folio)
+{
+ struct mem_cgroup *memcg;
+ bool ret;
+
+ rcu_read_lock();
+ memcg = folio_memcg_rcu(folio);
+ ret = memcg && !!(memcg->css.flags & CSS_DYING);
+ rcu_read_unlock();
+
+ return ret;
+}
+
+struct deferred_recharge_work {
+ struct folio *folio;
+ struct mem_cgroup *memcg;
+ struct work_struct work;
+};
+
+static void folio_memcg_do_deferred_recharge(struct work_struct *work)
+{
+ struct deferred_recharge_work *recharge_work = container_of(work,
+ struct deferred_recharge_work, work);
+ struct folio *folio = recharge_work->folio;
+ struct mem_cgroup *new = recharge_work->memcg;
+ struct mem_cgroup *old;
+
+ /* We are holding the last ref to the folio, let it be freed */
+ if (unlikely(folio_ref_count(folio) == 1))
+ goto out;
+
+ if (!folio_isolate_lru(folio))
+ goto out;
+
+ if (unlikely(!folio_trylock(folio)))
+ goto out_putback;
+
+ /* @folio was already recharged since the worker was queued? */
+ if (unlikely(!should_do_deferred_recharge(folio)))
+ goto out_unlock;
+
+ /* @folio was already recharged to @new and it already went offline? */
+ old = folio_memcg(folio);
+ if (unlikely(old == new))
+ goto out_unlock;
+
+ /*
+ * folio_mapped() must remain stable during the move. If the folio is
+ * mapped, we must use rmap recharge to serialize against unmapping.
+ * Otherwise, if the folio is unmapped, the folio lock is held so this
+ * should prevent faults against the pagecache or swapcache to map it.
+ */
+ mem_cgroup_start_move_charge(old);
+ if (folio_mapped(folio))
+ folio_memcg_rmap_recharge(folio, new);
+ else
+ mem_cgroup_recharge_folio(folio, new);
+ mem_cgroup_end_move_charge(old);
+
+out_unlock:
+ folio_unlock(folio);
+out_putback:
+ folio_putback_lru(folio);
+out:
+ folio_put(folio);
+ mem_cgroup_put(new);
+ kfree(recharge_work);
+}
+
+/*
+ * Queue deferred work to recharge @folio to current's memcg if needed.
+ */
+void folio_memcg_deferred_recharge(struct folio *folio)
+{
+ struct deferred_recharge_work *recharge_work = NULL;
+ struct mem_cgroup *memcg = NULL;
+
+ /* racy check, the async worker will check again with @folio locked */
+ if (likely(!should_do_deferred_recharge(folio)))
+ return;
+
+ if (unlikely(!memcg_recharge_wq))
+ return;
+
+ if (unlikely(!folio_try_get(folio)))
+ return;
+
+ memcg = get_mem_cgroup_from_mm(current->mm);
+ if (!memcg)
+ goto fail;
+
+ recharge_work = kmalloc(sizeof(*recharge_work), GFP_ATOMIC);
+ if (!recharge_work)
+ goto fail;
+
+ /* we hold refs to both the folio and the memcg we are charging to */
+ recharge_work->folio = folio;
+ recharge_work->memcg = memcg;
+ INIT_WORK(&recharge_work->work, folio_memcg_do_deferred_recharge);
+ queue_work(memcg_recharge_wq, &recharge_work->work);
+ return;
+fail:
+ kfree(recharge_work);
+ mem_cgroup_put(memcg);
+ folio_put(folio);
+}
+
#ifdef CONFIG_LRU_GEN
static void mem_cgroup_attach(struct cgroup_taskset *tset)
{
--
2.41.0.255.g8b1d071c50-goog
^ permalink raw reply related [flat|nested] 56+ messages in thread
* [RFC PATCH 5/8] memcg: recharge folios when accessed or dirtied
2023-07-20 7:08 ` Yosry Ahmed
@ 2023-07-20 7:08 ` Yosry Ahmed
-1 siblings, 0 replies; 56+ messages in thread
From: Yosry Ahmed @ 2023-07-20 7:08 UTC (permalink / raw)
To: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
Shakeel Butt
Cc: Muchun Song, Matthew Wilcox (Oracle), Tejun Heo, Zefan Li,
Yu Zhao, Luis Chamberlain, Kees Cook, Iurii Zaikin, T.J. Mercier,
Greg Thelen, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
Yosry Ahmed
The previous patch provided support for deferred recharging of folios
when their memcgs go offline. This patch adds recharging hooks to
folio_mark_accessed() and folio_mark_dirty().
This should cover a variety of code paths where folios are accessed by
userspace.
The hook, folio_memcg_deferred_recharge() only checks if the folio is
charged to an offline memcg in the common fast path (i.e checks
folio->memcg_data). If yes, an asynchronous worker is queued to do the
actual work.
Signed-off-by: Yosry Ahmed <yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
mm/page-writeback.c | 2 ++
mm/swap.c | 2 ++
2 files changed, 4 insertions(+)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index d3f42009bb70..a644530d98c7 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2785,6 +2785,8 @@ bool folio_mark_dirty(struct folio *folio)
{
struct address_space *mapping = folio_mapping(folio);
+ folio_memcg_deferred_recharge(folio);
+
if (likely(mapping)) {
/*
* readahead/folio_deactivate could remain
diff --git a/mm/swap.c b/mm/swap.c
index cd8f0150ba3a..296c0b87c967 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -457,6 +457,8 @@ static void folio_inc_refs(struct folio *folio)
*/
void folio_mark_accessed(struct folio *folio)
{
+ folio_memcg_deferred_recharge(folio);
+
if (lru_gen_enabled()) {
folio_inc_refs(folio);
return;
--
2.41.0.255.g8b1d071c50-goog
^ permalink raw reply related [flat|nested] 56+ messages in thread* [RFC PATCH 5/8] memcg: recharge folios when accessed or dirtied
@ 2023-07-20 7:08 ` Yosry Ahmed
0 siblings, 0 replies; 56+ messages in thread
From: Yosry Ahmed @ 2023-07-20 7:08 UTC (permalink / raw)
To: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
Shakeel Butt
Cc: Muchun Song, Matthew Wilcox (Oracle), Tejun Heo, Zefan Li,
Yu Zhao, Luis Chamberlain, Kees Cook, Iurii Zaikin, T.J. Mercier,
Greg Thelen, linux-kernel, linux-mm, cgroups, Yosry Ahmed
The previous patch provided support for deferred recharging of folios
when their memcgs go offline. This patch adds recharging hooks to
folio_mark_accessed() and folio_mark_dirty().
This should cover a variety of code paths where folios are accessed by
userspace.
The hook, folio_memcg_deferred_recharge() only checks if the folio is
charged to an offline memcg in the common fast path (i.e checks
folio->memcg_data). If yes, an asynchronous worker is queued to do the
actual work.
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
mm/page-writeback.c | 2 ++
mm/swap.c | 2 ++
2 files changed, 4 insertions(+)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index d3f42009bb70..a644530d98c7 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2785,6 +2785,8 @@ bool folio_mark_dirty(struct folio *folio)
{
struct address_space *mapping = folio_mapping(folio);
+ folio_memcg_deferred_recharge(folio);
+
if (likely(mapping)) {
/*
* readahead/folio_deactivate could remain
diff --git a/mm/swap.c b/mm/swap.c
index cd8f0150ba3a..296c0b87c967 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -457,6 +457,8 @@ static void folio_inc_refs(struct folio *folio)
*/
void folio_mark_accessed(struct folio *folio)
{
+ folio_memcg_deferred_recharge(folio);
+
if (lru_gen_enabled()) {
folio_inc_refs(folio);
return;
--
2.41.0.255.g8b1d071c50-goog
^ permalink raw reply related [flat|nested] 56+ messages in thread
* [RFC PATCH 6/8] memcg: add stats for offline memcgs recharging
2023-07-20 7:08 ` Yosry Ahmed
@ 2023-07-20 7:08 ` Yosry Ahmed
-1 siblings, 0 replies; 56+ messages in thread
From: Yosry Ahmed @ 2023-07-20 7:08 UTC (permalink / raw)
To: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
Shakeel Butt
Cc: Muchun Song, Matthew Wilcox (Oracle), Tejun Heo, Zefan Li,
Yu Zhao, Luis Chamberlain, Kees Cook, Iurii Zaikin, T.J. Mercier,
Greg Thelen, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
Yosry Ahmed
Add vm events for scanning pages for recharge, successfully recharging
pages, and cancelling a recharge due to failure to charge the target
memcg.
Signed-off-by: Yosry Ahmed <yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
include/linux/vm_event_item.h | 5 +++++
mm/memcontrol.c | 6 ++++++
mm/vmstat.c | 6 +++++-
3 files changed, 16 insertions(+), 1 deletion(-)
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 8abfa1240040..cd80c00c50c2 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -60,6 +60,11 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
PAGEOUTRUN, PGROTATED,
DROP_PAGECACHE, DROP_SLAB,
OOM_KILL,
+#ifdef CONFIG_MEMCG
+ RECHARGE_PGSCANNED,
+ RECHARGE_PGMOVED,
+ RECHARGE_PGCANCELLED,
+#endif
#ifdef CONFIG_NUMA_BALANCING
NUMA_PTE_UPDATES,
NUMA_HUGE_PTE_UPDATES,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index cf9fb51ecfcc..2fe9c6f1be80 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6394,6 +6394,8 @@ static bool mem_cgroup_recharge_folio(struct folio *folio,
old_memcg, new_memcg);
cancel_charge(err ? new_memcg : old_memcg, nr_pages);
out:
+ count_vm_events(err ? RECHARGE_PGCANCELLED : RECHARGE_PGMOVED,
+ nr_pages);
return err == 0;
}
@@ -6469,6 +6471,7 @@ static bool memcg_recharge_lruvec_list(struct lruvec *lruvec,
int isolated_idx = NR_ISOLATED_ANON + is_file_lru(lru);
struct mem_cgroup *memcg = lruvec_memcg(lruvec);
unsigned long *nr_recharged = arg;
+ unsigned long nr_scanned = 0;
unsigned long nr_staged = 0;
LIST_HEAD(folios_skipped);
LIST_HEAD(folios_staged);
@@ -6505,6 +6508,7 @@ static bool memcg_recharge_lruvec_list(struct lruvec *lruvec,
continue;
}
+ nr_scanned += folio_nr_pages(folio);
if (unlikely(!folio_try_get(folio))) {
list_move(&folio->lru, &folios_skipped);
continue;
@@ -6543,6 +6547,7 @@ static bool memcg_recharge_lruvec_list(struct lruvec *lruvec,
}
mem_cgroup_end_move_charge(memcg);
mod_lruvec_state(lruvec, isolated_idx, -nr_staged);
+ count_vm_events(RECHARGE_PGSCANNED, nr_scanned);
return false;
}
@@ -6679,6 +6684,7 @@ void folio_memcg_deferred_recharge(struct folio *folio)
if (unlikely(!memcg_recharge_wq))
return;
+ count_vm_events(RECHARGE_PGSCANNED, folio_nr_pages(folio));
if (unlikely(!folio_try_get(folio)))
return;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index b731d57996c5..e425a1aa7890 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1303,7 +1303,11 @@ const char * const vmstat_text[] = {
"drop_pagecache",
"drop_slab",
"oom_kill",
-
+#ifdef CONFIG_MEMCG
+ "recharge_pgs_scanned",
+ "recharge_pgs_moved",
+ "recharge_pgs_cancelled",
+#endif
#ifdef CONFIG_NUMA_BALANCING
"numa_pte_updates",
"numa_huge_pte_updates",
--
2.41.0.255.g8b1d071c50-goog
^ permalink raw reply related [flat|nested] 56+ messages in thread* [RFC PATCH 6/8] memcg: add stats for offline memcgs recharging
@ 2023-07-20 7:08 ` Yosry Ahmed
0 siblings, 0 replies; 56+ messages in thread
From: Yosry Ahmed @ 2023-07-20 7:08 UTC (permalink / raw)
To: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
Shakeel Butt
Cc: Muchun Song, Matthew Wilcox (Oracle), Tejun Heo, Zefan Li,
Yu Zhao, Luis Chamberlain, Kees Cook, Iurii Zaikin, T.J. Mercier,
Greg Thelen, linux-kernel, linux-mm, cgroups, Yosry Ahmed
Add vm events for scanning pages for recharge, successfully recharging
pages, and cancelling a recharge due to failure to charge the target
memcg.
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
include/linux/vm_event_item.h | 5 +++++
mm/memcontrol.c | 6 ++++++
mm/vmstat.c | 6 +++++-
3 files changed, 16 insertions(+), 1 deletion(-)
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 8abfa1240040..cd80c00c50c2 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -60,6 +60,11 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
PAGEOUTRUN, PGROTATED,
DROP_PAGECACHE, DROP_SLAB,
OOM_KILL,
+#ifdef CONFIG_MEMCG
+ RECHARGE_PGSCANNED,
+ RECHARGE_PGMOVED,
+ RECHARGE_PGCANCELLED,
+#endif
#ifdef CONFIG_NUMA_BALANCING
NUMA_PTE_UPDATES,
NUMA_HUGE_PTE_UPDATES,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index cf9fb51ecfcc..2fe9c6f1be80 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6394,6 +6394,8 @@ static bool mem_cgroup_recharge_folio(struct folio *folio,
old_memcg, new_memcg);
cancel_charge(err ? new_memcg : old_memcg, nr_pages);
out:
+ count_vm_events(err ? RECHARGE_PGCANCELLED : RECHARGE_PGMOVED,
+ nr_pages);
return err == 0;
}
@@ -6469,6 +6471,7 @@ static bool memcg_recharge_lruvec_list(struct lruvec *lruvec,
int isolated_idx = NR_ISOLATED_ANON + is_file_lru(lru);
struct mem_cgroup *memcg = lruvec_memcg(lruvec);
unsigned long *nr_recharged = arg;
+ unsigned long nr_scanned = 0;
unsigned long nr_staged = 0;
LIST_HEAD(folios_skipped);
LIST_HEAD(folios_staged);
@@ -6505,6 +6508,7 @@ static bool memcg_recharge_lruvec_list(struct lruvec *lruvec,
continue;
}
+ nr_scanned += folio_nr_pages(folio);
if (unlikely(!folio_try_get(folio))) {
list_move(&folio->lru, &folios_skipped);
continue;
@@ -6543,6 +6547,7 @@ static bool memcg_recharge_lruvec_list(struct lruvec *lruvec,
}
mem_cgroup_end_move_charge(memcg);
mod_lruvec_state(lruvec, isolated_idx, -nr_staged);
+ count_vm_events(RECHARGE_PGSCANNED, nr_scanned);
return false;
}
@@ -6679,6 +6684,7 @@ void folio_memcg_deferred_recharge(struct folio *folio)
if (unlikely(!memcg_recharge_wq))
return;
+ count_vm_events(RECHARGE_PGSCANNED, folio_nr_pages(folio));
if (unlikely(!folio_try_get(folio)))
return;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index b731d57996c5..e425a1aa7890 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1303,7 +1303,11 @@ const char * const vmstat_text[] = {
"drop_pagecache",
"drop_slab",
"oom_kill",
-
+#ifdef CONFIG_MEMCG
+ "recharge_pgs_scanned",
+ "recharge_pgs_moved",
+ "recharge_pgs_cancelled",
+#endif
#ifdef CONFIG_NUMA_BALANCING
"numa_pte_updates",
"numa_huge_pte_updates",
--
2.41.0.255.g8b1d071c50-goog
^ permalink raw reply related [flat|nested] 56+ messages in thread
* [RFC PATCH 7/8] memcg: add sysctl and config option to control memory recharging
2023-07-20 7:08 ` Yosry Ahmed
@ 2023-07-20 7:08 ` Yosry Ahmed
-1 siblings, 0 replies; 56+ messages in thread
From: Yosry Ahmed @ 2023-07-20 7:08 UTC (permalink / raw)
To: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
Shakeel Butt
Cc: Muchun Song, Matthew Wilcox (Oracle), Tejun Heo, Zefan Li,
Yu Zhao, Luis Chamberlain, Kees Cook, Iurii Zaikin, T.J. Mercier,
Greg Thelen, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
Yosry Ahmed
Add a sysctl to enable/disable memory recharging for offline memcgs. Add
a config option to control whether or not it is enabled by default.
Signed-off-by: Yosry Ahmed <yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
include/linux/memcontrol.h | 2 ++
kernel/sysctl.c | 11 +++++++++++
mm/Kconfig | 12 ++++++++++++
mm/memcontrol.c | 9 ++++++++-
4 files changed, 33 insertions(+), 1 deletion(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 59b653d4a76e..ae9f09ee90cb 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -60,6 +60,8 @@ struct mem_cgroup_reclaim_cookie {
#ifdef CONFIG_MEMCG
+extern int sysctl_recharge_offline_memcgs;
+
#define MEM_CGROUP_ID_SHIFT 16
#define MEM_CGROUP_ID_MAX USHRT_MAX
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 354a2d294f52..1735d1d95652 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -2249,6 +2249,17 @@ static struct ctl_table vm_table[] = {
.extra2 = (void *)&mmap_rnd_compat_bits_max,
},
#endif
+#ifdef CONFIG_MEMCG
+ {
+ .procname = "recharge_offline_memcgs",
+ .data = &sysctl_recharge_offline_memcgs,
+ .maxlen = sizeof(sysctl_recharge_offline_memcgs),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = SYSCTL_ZERO,
+ .extra2 = SYSCTL_ONE,
+ },
+#endif /* CONFIG_MEMCG */
{ }
};
diff --git a/mm/Kconfig b/mm/Kconfig
index 09130434e30d..9462c4b598d9 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1236,6 +1236,18 @@ config LOCK_MM_AND_FIND_VMA
bool
depends on !STACK_GROWSUP
+config MEMCG_RECHARGE_OFFLINE_ENABLED
+ bool "Recharge memory charged to offline memcgs"
+ depends on MEMCG
+ help
+ When a memory cgroup is removed by userspace, try to recharge any
+ memory still charged to it to avoid having it live on as an offline
+ memcg. Offline memcgs potentially consume memory and limit scalability
+ of some operations.
+
+ This option enables the above behavior by default. It can be override
+ at runtime through /proc/sys/vm/recharge_offline_memcgs.
+
source "mm/damon/Kconfig"
endmenu
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2fe9c6f1be80..25cdb17eaaa3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -96,6 +96,9 @@ static bool cgroup_memory_nobpf __ro_after_init;
static DECLARE_WAIT_QUEUE_HEAD(memcg_cgwb_frn_waitq);
#endif
+int sysctl_recharge_offline_memcgs __read_mostly = IS_ENABLED(
+ CONFIG_MEMCG_RECHARGE_OFFLINE_ENABLED);
+
static struct workqueue_struct *memcg_recharge_wq;
/* Whether legacy memory+swap accounting is active */
@@ -6592,7 +6595,8 @@ static void memcg_recharge_mapped_folios(struct mem_cgroup *memcg)
INIT_DELAYED_WORK(&memcg->recharge_mapped_work.dwork,
memcg_do_recharge_mapped_folios);
- if (memcg_recharge_wq && memcg_nr_local_mapped_pages(memcg)) {
+ if (sysctl_recharge_offline_memcgs &&
+ memcg_recharge_wq && memcg_nr_local_mapped_pages(memcg)) {
memcg->recharge_mapped_work.retries = 0;
queue_delayed_work(memcg_recharge_wq,
&memcg->recharge_mapped_work.dwork, 0);
@@ -6605,6 +6609,9 @@ static bool should_do_deferred_recharge(struct folio *folio)
struct mem_cgroup *memcg;
bool ret;
+ if (!sysctl_recharge_offline_memcgs)
+ return false;
+
rcu_read_lock();
memcg = folio_memcg_rcu(folio);
ret = memcg && !!(memcg->css.flags & CSS_DYING);
--
2.41.0.255.g8b1d071c50-goog
^ permalink raw reply related [flat|nested] 56+ messages in thread* [RFC PATCH 7/8] memcg: add sysctl and config option to control memory recharging
@ 2023-07-20 7:08 ` Yosry Ahmed
0 siblings, 0 replies; 56+ messages in thread
From: Yosry Ahmed @ 2023-07-20 7:08 UTC (permalink / raw)
To: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
Shakeel Butt
Cc: Muchun Song, Matthew Wilcox (Oracle), Tejun Heo, Zefan Li,
Yu Zhao, Luis Chamberlain, Kees Cook, Iurii Zaikin, T.J. Mercier,
Greg Thelen, linux-kernel, linux-mm, cgroups, Yosry Ahmed
Add a sysctl to enable/disable memory recharging for offline memcgs. Add
a config option to control whether or not it is enabled by default.
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
include/linux/memcontrol.h | 2 ++
kernel/sysctl.c | 11 +++++++++++
mm/Kconfig | 12 ++++++++++++
mm/memcontrol.c | 9 ++++++++-
4 files changed, 33 insertions(+), 1 deletion(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 59b653d4a76e..ae9f09ee90cb 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -60,6 +60,8 @@ struct mem_cgroup_reclaim_cookie {
#ifdef CONFIG_MEMCG
+extern int sysctl_recharge_offline_memcgs;
+
#define MEM_CGROUP_ID_SHIFT 16
#define MEM_CGROUP_ID_MAX USHRT_MAX
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 354a2d294f52..1735d1d95652 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -2249,6 +2249,17 @@ static struct ctl_table vm_table[] = {
.extra2 = (void *)&mmap_rnd_compat_bits_max,
},
#endif
+#ifdef CONFIG_MEMCG
+ {
+ .procname = "recharge_offline_memcgs",
+ .data = &sysctl_recharge_offline_memcgs,
+ .maxlen = sizeof(sysctl_recharge_offline_memcgs),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = SYSCTL_ZERO,
+ .extra2 = SYSCTL_ONE,
+ },
+#endif /* CONFIG_MEMCG */
{ }
};
diff --git a/mm/Kconfig b/mm/Kconfig
index 09130434e30d..9462c4b598d9 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1236,6 +1236,18 @@ config LOCK_MM_AND_FIND_VMA
bool
depends on !STACK_GROWSUP
+config MEMCG_RECHARGE_OFFLINE_ENABLED
+ bool "Recharge memory charged to offline memcgs"
+ depends on MEMCG
+ help
+ When a memory cgroup is removed by userspace, try to recharge any
+ memory still charged to it to avoid having it live on as an offline
+ memcg. Offline memcgs potentially consume memory and limit scalability
+ of some operations.
+
+ This option enables the above behavior by default. It can be override
+ at runtime through /proc/sys/vm/recharge_offline_memcgs.
+
source "mm/damon/Kconfig"
endmenu
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2fe9c6f1be80..25cdb17eaaa3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -96,6 +96,9 @@ static bool cgroup_memory_nobpf __ro_after_init;
static DECLARE_WAIT_QUEUE_HEAD(memcg_cgwb_frn_waitq);
#endif
+int sysctl_recharge_offline_memcgs __read_mostly = IS_ENABLED(
+ CONFIG_MEMCG_RECHARGE_OFFLINE_ENABLED);
+
static struct workqueue_struct *memcg_recharge_wq;
/* Whether legacy memory+swap accounting is active */
@@ -6592,7 +6595,8 @@ static void memcg_recharge_mapped_folios(struct mem_cgroup *memcg)
INIT_DELAYED_WORK(&memcg->recharge_mapped_work.dwork,
memcg_do_recharge_mapped_folios);
- if (memcg_recharge_wq && memcg_nr_local_mapped_pages(memcg)) {
+ if (sysctl_recharge_offline_memcgs &&
+ memcg_recharge_wq && memcg_nr_local_mapped_pages(memcg)) {
memcg->recharge_mapped_work.retries = 0;
queue_delayed_work(memcg_recharge_wq,
&memcg->recharge_mapped_work.dwork, 0);
@@ -6605,6 +6609,9 @@ static bool should_do_deferred_recharge(struct folio *folio)
struct mem_cgroup *memcg;
bool ret;
+ if (!sysctl_recharge_offline_memcgs)
+ return false;
+
rcu_read_lock();
memcg = folio_memcg_rcu(folio);
ret = memcg && !!(memcg->css.flags & CSS_DYING);
--
2.41.0.255.g8b1d071c50-goog
^ permalink raw reply related [flat|nested] 56+ messages in thread[parent not found: <20230720070825.992023-8-yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC PATCH 7/8] memcg: add sysctl and config option to control memory recharging
2023-07-20 7:08 ` Yosry Ahmed
@ 2023-07-20 18:13 ` Luis Chamberlain
-1 siblings, 0 replies; 56+ messages in thread
From: Luis Chamberlain @ 2023-07-20 18:13 UTC (permalink / raw)
To: Yosry Ahmed
Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
Shakeel Butt, Muchun Song, Matthew Wilcox (Oracle), Tejun Heo,
Zefan Li, Yu Zhao, Kees Cook, Iurii Zaikin, T.J. Mercier,
Greg Thelen, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA
On Thu, Jul 20, 2023 at 07:08:24AM +0000, Yosry Ahmed wrote:
> Add a sysctl to enable/disable memory recharging for offline memcgs. Add
> a config option to control whether or not it is enabled by default.
>
> Signed-off-by: Yosry Ahmed <yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> ---
> include/linux/memcontrol.h | 2 ++
> kernel/sysctl.c | 11 +++++++++++
> mm/Kconfig | 12 ++++++++++++
> mm/memcontrol.c | 9 ++++++++-
> 4 files changed, 33 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 59b653d4a76e..ae9f09ee90cb 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -60,6 +60,8 @@ struct mem_cgroup_reclaim_cookie {
>
> #ifdef CONFIG_MEMCG
>
> +extern int sysctl_recharge_offline_memcgs;
> +
> #define MEM_CGROUP_ID_SHIFT 16
> #define MEM_CGROUP_ID_MAX USHRT_MAX
>
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 354a2d294f52..1735d1d95652 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -2249,6 +2249,17 @@ static struct ctl_table vm_table[] = {
> .extra2 = (void *)&mmap_rnd_compat_bits_max,
> },
> #endif
> +#ifdef CONFIG_MEMCG
> + {
> + .procname = "recharge_offline_memcgs",
> + .data = &sysctl_recharge_offline_memcgs,
> + .maxlen = sizeof(sysctl_recharge_offline_memcgs),
> + .mode = 0644,
> + .proc_handler = proc_dointvec_minmax,
> + .extra1 = SYSCTL_ZERO,
> + .extra2 = SYSCTL_ONE,
> + },
> +#endif /* CONFIG_MEMCG */
> { }
> };
Please don't add any more sysctls to kernel/sysctl.c, git log that file
for a series of cleanups which show how to use your own and why we have
been doing that cleanup.
Luis
^ permalink raw reply [flat|nested] 56+ messages in thread* Re: [RFC PATCH 7/8] memcg: add sysctl and config option to control memory recharging
@ 2023-07-20 18:13 ` Luis Chamberlain
0 siblings, 0 replies; 56+ messages in thread
From: Luis Chamberlain @ 2023-07-20 18:13 UTC (permalink / raw)
To: Yosry Ahmed
Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
Shakeel Butt, Muchun Song, Matthew Wilcox (Oracle), Tejun Heo,
Zefan Li, Yu Zhao, Kees Cook, Iurii Zaikin, T.J. Mercier,
Greg Thelen, linux-kernel, linux-mm, cgroups
On Thu, Jul 20, 2023 at 07:08:24AM +0000, Yosry Ahmed wrote:
> Add a sysctl to enable/disable memory recharging for offline memcgs. Add
> a config option to control whether or not it is enabled by default.
>
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> ---
> include/linux/memcontrol.h | 2 ++
> kernel/sysctl.c | 11 +++++++++++
> mm/Kconfig | 12 ++++++++++++
> mm/memcontrol.c | 9 ++++++++-
> 4 files changed, 33 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 59b653d4a76e..ae9f09ee90cb 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -60,6 +60,8 @@ struct mem_cgroup_reclaim_cookie {
>
> #ifdef CONFIG_MEMCG
>
> +extern int sysctl_recharge_offline_memcgs;
> +
> #define MEM_CGROUP_ID_SHIFT 16
> #define MEM_CGROUP_ID_MAX USHRT_MAX
>
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 354a2d294f52..1735d1d95652 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -2249,6 +2249,17 @@ static struct ctl_table vm_table[] = {
> .extra2 = (void *)&mmap_rnd_compat_bits_max,
> },
> #endif
> +#ifdef CONFIG_MEMCG
> + {
> + .procname = "recharge_offline_memcgs",
> + .data = &sysctl_recharge_offline_memcgs,
> + .maxlen = sizeof(sysctl_recharge_offline_memcgs),
> + .mode = 0644,
> + .proc_handler = proc_dointvec_minmax,
> + .extra1 = SYSCTL_ZERO,
> + .extra2 = SYSCTL_ONE,
> + },
> +#endif /* CONFIG_MEMCG */
> { }
> };
Please don't add any more sysctls to kernel/sysctl.c, git log that file
for a series of cleanups which show how to use your own and why we have
been doing that cleanup.
Luis
^ permalink raw reply [flat|nested] 56+ messages in thread[parent not found: <ZLl5XA25BIlYyngD-PfSpb0PWhxZc2C7mugBRk2EX/6BAtgUQ@public.gmane.org>]
* Re: [RFC PATCH 7/8] memcg: add sysctl and config option to control memory recharging
2023-07-20 18:13 ` Luis Chamberlain
@ 2023-07-20 18:24 ` Yosry Ahmed
-1 siblings, 0 replies; 56+ messages in thread
From: Yosry Ahmed @ 2023-07-20 18:24 UTC (permalink / raw)
To: Luis Chamberlain
Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
Shakeel Butt, Muchun Song, Matthew Wilcox (Oracle), Tejun Heo,
Zefan Li, Yu Zhao, Kees Cook, Iurii Zaikin, T.J. Mercier,
Greg Thelen, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA
On Thu, Jul 20, 2023 at 11:13 AM Luis Chamberlain <mcgrof-DgEjT+Ai2yhQFI55V6+gNQ@public.gmane.orgg> wrote:
>
> On Thu, Jul 20, 2023 at 07:08:24AM +0000, Yosry Ahmed wrote:
> > Add a sysctl to enable/disable memory recharging for offline memcgs. Add
> > a config option to control whether or not it is enabled by default.
> >
> > Signed-off-by: Yosry Ahmed <yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> > ---
> > include/linux/memcontrol.h | 2 ++
> > kernel/sysctl.c | 11 +++++++++++
> > mm/Kconfig | 12 ++++++++++++
> > mm/memcontrol.c | 9 ++++++++-
> > 4 files changed, 33 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 59b653d4a76e..ae9f09ee90cb 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -60,6 +60,8 @@ struct mem_cgroup_reclaim_cookie {
> >
> > #ifdef CONFIG_MEMCG
> >
> > +extern int sysctl_recharge_offline_memcgs;
> > +
> > #define MEM_CGROUP_ID_SHIFT 16
> > #define MEM_CGROUP_ID_MAX USHRT_MAX
> >
> > diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> > index 354a2d294f52..1735d1d95652 100644
> > --- a/kernel/sysctl.c
> > +++ b/kernel/sysctl.c
> > @@ -2249,6 +2249,17 @@ static struct ctl_table vm_table[] = {
> > .extra2 = (void *)&mmap_rnd_compat_bits_max,
> > },
> > #endif
> > +#ifdef CONFIG_MEMCG
> > + {
> > + .procname = "recharge_offline_memcgs",
> > + .data = &sysctl_recharge_offline_memcgs,
> > + .maxlen = sizeof(sysctl_recharge_offline_memcgs),
> > + .mode = 0644,
> > + .proc_handler = proc_dointvec_minmax,
> > + .extra1 = SYSCTL_ZERO,
> > + .extra2 = SYSCTL_ONE,
> > + },
> > +#endif /* CONFIG_MEMCG */
> > { }
> > };
>
> Please don't add any more sysctls to kernel/sysctl.c, git log that file
> for a series of cleanups which show how to use your own and why we have
> been doing that cleanup.
Thanks for pointing this out, I definitely missed it. Will do that in
the next version. I guess this will also reduce the reviewer churn if
I won't be touching kernel/sysctl.c?
>
> Luis
^ permalink raw reply [flat|nested] 56+ messages in thread* Re: [RFC PATCH 7/8] memcg: add sysctl and config option to control memory recharging
@ 2023-07-20 18:24 ` Yosry Ahmed
0 siblings, 0 replies; 56+ messages in thread
From: Yosry Ahmed @ 2023-07-20 18:24 UTC (permalink / raw)
To: Luis Chamberlain
Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
Shakeel Butt, Muchun Song, Matthew Wilcox (Oracle), Tejun Heo,
Zefan Li, Yu Zhao, Kees Cook, Iurii Zaikin, T.J. Mercier,
Greg Thelen, linux-kernel, linux-mm, cgroups
On Thu, Jul 20, 2023 at 11:13 AM Luis Chamberlain <mcgrof@kernel.org> wrote:
>
> On Thu, Jul 20, 2023 at 07:08:24AM +0000, Yosry Ahmed wrote:
> > Add a sysctl to enable/disable memory recharging for offline memcgs. Add
> > a config option to control whether or not it is enabled by default.
> >
> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> > ---
> > include/linux/memcontrol.h | 2 ++
> > kernel/sysctl.c | 11 +++++++++++
> > mm/Kconfig | 12 ++++++++++++
> > mm/memcontrol.c | 9 ++++++++-
> > 4 files changed, 33 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 59b653d4a76e..ae9f09ee90cb 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -60,6 +60,8 @@ struct mem_cgroup_reclaim_cookie {
> >
> > #ifdef CONFIG_MEMCG
> >
> > +extern int sysctl_recharge_offline_memcgs;
> > +
> > #define MEM_CGROUP_ID_SHIFT 16
> > #define MEM_CGROUP_ID_MAX USHRT_MAX
> >
> > diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> > index 354a2d294f52..1735d1d95652 100644
> > --- a/kernel/sysctl.c
> > +++ b/kernel/sysctl.c
> > @@ -2249,6 +2249,17 @@ static struct ctl_table vm_table[] = {
> > .extra2 = (void *)&mmap_rnd_compat_bits_max,
> > },
> > #endif
> > +#ifdef CONFIG_MEMCG
> > + {
> > + .procname = "recharge_offline_memcgs",
> > + .data = &sysctl_recharge_offline_memcgs,
> > + .maxlen = sizeof(sysctl_recharge_offline_memcgs),
> > + .mode = 0644,
> > + .proc_handler = proc_dointvec_minmax,
> > + .extra1 = SYSCTL_ZERO,
> > + .extra2 = SYSCTL_ONE,
> > + },
> > +#endif /* CONFIG_MEMCG */
> > { }
> > };
>
> Please don't add any more sysctls to kernel/sysctl.c, git log that file
> for a series of cleanups which show how to use your own and why we have
> been doing that cleanup.
Thanks for pointing this out, I definitely missed it. Will do that in
the next version. I guess this will also reduce the reviewer churn if
I won't be touching kernel/sysctl.c?
>
> Luis
^ permalink raw reply [flat|nested] 56+ messages in thread* Re: [RFC PATCH 7/8] memcg: add sysctl and config option to control memory recharging
2023-07-20 18:24 ` Yosry Ahmed
@ 2023-07-20 18:30 ` Luis Chamberlain
-1 siblings, 0 replies; 56+ messages in thread
From: Luis Chamberlain @ 2023-07-20 18:30 UTC (permalink / raw)
To: Yosry Ahmed
Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
Shakeel Butt, Muchun Song, Matthew Wilcox (Oracle), Tejun Heo,
Zefan Li, Yu Zhao, Kees Cook, Iurii Zaikin, T.J. Mercier,
Greg Thelen, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA
On Thu, Jul 20, 2023 at 11:24:20AM -0700, Yosry Ahmed wrote:
> On Thu, Jul 20, 2023 at 11:13 AM Luis Chamberlain <mcgrof-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> >
> > On Thu, Jul 20, 2023 at 07:08:24AM +0000, Yosry Ahmed wrote:
> > > Add a sysctl to enable/disable memory recharging for offline memcgs. Add
> > > a config option to control whether or not it is enabled by default.
> > >
> > > Signed-off-by: Yosry Ahmed <yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> > > ---
> > > include/linux/memcontrol.h | 2 ++
> > > kernel/sysctl.c | 11 +++++++++++
> > > mm/Kconfig | 12 ++++++++++++
> > > mm/memcontrol.c | 9 ++++++++-
> > > 4 files changed, 33 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > > index 59b653d4a76e..ae9f09ee90cb 100644
> > > --- a/include/linux/memcontrol.h
> > > +++ b/include/linux/memcontrol.h
> > > @@ -60,6 +60,8 @@ struct mem_cgroup_reclaim_cookie {
> > >
> > > #ifdef CONFIG_MEMCG
> > >
> > > +extern int sysctl_recharge_offline_memcgs;
> > > +
> > > #define MEM_CGROUP_ID_SHIFT 16
> > > #define MEM_CGROUP_ID_MAX USHRT_MAX
> > >
> > > diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> > > index 354a2d294f52..1735d1d95652 100644
> > > --- a/kernel/sysctl.c
> > > +++ b/kernel/sysctl.c
> > > @@ -2249,6 +2249,17 @@ static struct ctl_table vm_table[] = {
> > > .extra2 = (void *)&mmap_rnd_compat_bits_max,
> > > },
> > > #endif
> > > +#ifdef CONFIG_MEMCG
> > > + {
> > > + .procname = "recharge_offline_memcgs",
> > > + .data = &sysctl_recharge_offline_memcgs,
> > > + .maxlen = sizeof(sysctl_recharge_offline_memcgs),
> > > + .mode = 0644,
> > > + .proc_handler = proc_dointvec_minmax,
> > > + .extra1 = SYSCTL_ZERO,
> > > + .extra2 = SYSCTL_ONE,
> > > + },
> > > +#endif /* CONFIG_MEMCG */
> > > { }
> > > };
> >
> > Please don't add any more sysctls to kernel/sysctl.c, git log that file
> > for a series of cleanups which show how to use your own and why we have
> > been doing that cleanup.
>
> Thanks for pointing this out, I definitely missed it. Will do that in
> the next version. I guess this will also reduce the reviewer churn if
> I won't be touching kernel/sysctl.c?
Right, it means I don't have to care anymore about random sysctl knobs.
Let people knob it all up.
Luis
^ permalink raw reply [flat|nested] 56+ messages in thread* Re: [RFC PATCH 7/8] memcg: add sysctl and config option to control memory recharging
@ 2023-07-20 18:30 ` Luis Chamberlain
0 siblings, 0 replies; 56+ messages in thread
From: Luis Chamberlain @ 2023-07-20 18:30 UTC (permalink / raw)
To: Yosry Ahmed
Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
Shakeel Butt, Muchun Song, Matthew Wilcox (Oracle), Tejun Heo,
Zefan Li, Yu Zhao, Kees Cook, Iurii Zaikin, T.J. Mercier,
Greg Thelen, linux-kernel, linux-mm, cgroups
On Thu, Jul 20, 2023 at 11:24:20AM -0700, Yosry Ahmed wrote:
> On Thu, Jul 20, 2023 at 11:13 AM Luis Chamberlain <mcgrof@kernel.org> wrote:
> >
> > On Thu, Jul 20, 2023 at 07:08:24AM +0000, Yosry Ahmed wrote:
> > > Add a sysctl to enable/disable memory recharging for offline memcgs. Add
> > > a config option to control whether or not it is enabled by default.
> > >
> > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> > > ---
> > > include/linux/memcontrol.h | 2 ++
> > > kernel/sysctl.c | 11 +++++++++++
> > > mm/Kconfig | 12 ++++++++++++
> > > mm/memcontrol.c | 9 ++++++++-
> > > 4 files changed, 33 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > > index 59b653d4a76e..ae9f09ee90cb 100644
> > > --- a/include/linux/memcontrol.h
> > > +++ b/include/linux/memcontrol.h
> > > @@ -60,6 +60,8 @@ struct mem_cgroup_reclaim_cookie {
> > >
> > > #ifdef CONFIG_MEMCG
> > >
> > > +extern int sysctl_recharge_offline_memcgs;
> > > +
> > > #define MEM_CGROUP_ID_SHIFT 16
> > > #define MEM_CGROUP_ID_MAX USHRT_MAX
> > >
> > > diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> > > index 354a2d294f52..1735d1d95652 100644
> > > --- a/kernel/sysctl.c
> > > +++ b/kernel/sysctl.c
> > > @@ -2249,6 +2249,17 @@ static struct ctl_table vm_table[] = {
> > > .extra2 = (void *)&mmap_rnd_compat_bits_max,
> > > },
> > > #endif
> > > +#ifdef CONFIG_MEMCG
> > > + {
> > > + .procname = "recharge_offline_memcgs",
> > > + .data = &sysctl_recharge_offline_memcgs,
> > > + .maxlen = sizeof(sysctl_recharge_offline_memcgs),
> > > + .mode = 0644,
> > > + .proc_handler = proc_dointvec_minmax,
> > > + .extra1 = SYSCTL_ZERO,
> > > + .extra2 = SYSCTL_ONE,
> > > + },
> > > +#endif /* CONFIG_MEMCG */
> > > { }
> > > };
> >
> > Please don't add any more sysctls to kernel/sysctl.c, git log that file
> > for a series of cleanups which show how to use your own and why we have
> > been doing that cleanup.
>
> Thanks for pointing this out, I definitely missed it. Will do that in
> the next version. I guess this will also reduce the reviewer churn if
> I won't be touching kernel/sysctl.c?
Right, it means I don't have to care anymore about random sysctl knobs.
Let people knob it all up.
Luis
^ permalink raw reply [flat|nested] 56+ messages in thread
* [RFC PATCH 8/8] selftests: cgroup: test_memcontrol: add a selftest for memcg recharging
2023-07-20 7:08 ` Yosry Ahmed
@ 2023-07-20 7:08 ` Yosry Ahmed
-1 siblings, 0 replies; 56+ messages in thread
From: Yosry Ahmed @ 2023-07-20 7:08 UTC (permalink / raw)
To: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
Shakeel Butt
Cc: Muchun Song, Matthew Wilcox (Oracle), Tejun Heo, Zefan Li,
Yu Zhao, Luis Chamberlain, Kees Cook, Iurii Zaikin, T.J. Mercier,
Greg Thelen, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
Yosry Ahmed
When a memcg is removed, any mapped pages charged to it are recharged to
the memcg of the process(es) mapping them. Any remaining pages are
recharged using deferred recharge on the next time they are accessed or
ditied. Add a selftest that exercises these paths for shmem and normal
files:
- A page is recharged on offlining if it is already mapped into the
address space of a process in a different memcg.
- A page is recharged after offlining when written to by a process in a
different memcg (if the write results in dirtying the page).
- A page is recharged after offlining when read by a process in a
different memcg.
- A page is recharged after offlining when mapped by a process in a
different memcg.
Signed-off-by: Yosry Ahmed <yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
tools/testing/selftests/cgroup/cgroup_util.c | 14 +
tools/testing/selftests/cgroup/cgroup_util.h | 1 +
.../selftests/cgroup/test_memcontrol.c | 310 ++++++++++++++++++
3 files changed, 325 insertions(+)
diff --git a/tools/testing/selftests/cgroup/cgroup_util.c b/tools/testing/selftests/cgroup/cgroup_util.c
index e8bbbdb77e0d..e853b2a4db77 100644
--- a/tools/testing/selftests/cgroup/cgroup_util.c
+++ b/tools/testing/selftests/cgroup/cgroup_util.c
@@ -519,6 +519,20 @@ int is_swap_enabled(void)
return cnt > 1;
}
+
+int is_memcg_recharging_enabled(void)
+{
+ char buf[10];
+ bool enabled;
+
+ if (read_text("/proc/sys/vm/recharge_offline_memcgs",
+ buf, sizeof(buf)) <= 0)
+ return -1;
+
+ enabled = strtol(buf, NULL, 10);
+ return enabled;
+}
+
int set_oom_adj_score(int pid, int score)
{
char path[PATH_MAX];
diff --git a/tools/testing/selftests/cgroup/cgroup_util.h b/tools/testing/selftests/cgroup/cgroup_util.h
index c92df4e5d395..10c0fa36bfd7 100644
--- a/tools/testing/selftests/cgroup/cgroup_util.h
+++ b/tools/testing/selftests/cgroup/cgroup_util.h
@@ -49,6 +49,7 @@ extern int get_temp_fd(void);
extern int alloc_pagecache(int fd, size_t size);
extern int alloc_anon(const char *cgroup, void *arg);
extern int is_swap_enabled(void);
+extern int is_memcg_recharging_enabled(void);
extern int set_oom_adj_score(int pid, int score);
extern int cg_wait_for_proc_count(const char *cgroup, int count);
extern int cg_killall(const char *cgroup);
diff --git a/tools/testing/selftests/cgroup/test_memcontrol.c b/tools/testing/selftests/cgroup/test_memcontrol.c
index c7c9572003a8..4e1ea93e0a54 100644
--- a/tools/testing/selftests/cgroup/test_memcontrol.c
+++ b/tools/testing/selftests/cgroup/test_memcontrol.c
@@ -17,6 +17,8 @@
#include <netdb.h>
#include <errno.h>
#include <sys/mman.h>
+#include <sys/mount.h>
+#include <sched.h>
#include "../kselftest.h"
#include "cgroup_util.h"
@@ -1287,6 +1289,313 @@ static int test_memcg_oom_group_score_events(const char *root)
return ret;
}
+/* Map 50M from the beginning of a file */
+static int map_fd_50M_noexit(const char *cgroup, void *arg)
+{
+ size_t size = MB(50);
+ int ppid = getppid();
+ int fd = (long)arg;
+ char *memory;
+
+ memory = mmap(NULL, size, PROT_READ, MAP_SHARED | MAP_POPULATE, fd, 0);
+ if (memory == MAP_FAILED) {
+ fprintf(stderr, "error: mmap, errno %d\n", errno);
+ return -1;
+ }
+
+ while (getppid() == ppid)
+ sleep(1);
+
+ munmap(memory, size);
+ return 0;
+}
+
+/*
+ * Write 50M to the beginning of a file.
+ * The file is sync'ed first to make sure any dirty pages are laundered before
+ * we dirty them again.
+ */
+static int write_fd_50M(const char *cgroup, void *arg)
+{
+ size_t size = MB(50);
+ int fd = (long)arg;
+ char buf[PAGE_SIZE];
+ int i;
+
+ fsync(fd);
+ lseek(fd, 0, SEEK_SET);
+ for (i = 0; i < size; i += sizeof(buf))
+ write(fd, buf, sizeof(buf));
+
+ return 0;
+}
+
+/* See write_fd_50M() */
+static int write_fd_50M_noexit(const char *cgroup, void *arg)
+{
+ int ppid = getppid();
+
+ write_fd_50M(cgroup, arg);
+
+ while (getppid() == ppid)
+ sleep(1);
+
+ return 0;
+}
+
+/* Read 50M from the beginning of a file */
+static int read_fd_50M_noexit(const char *cgroup, void *arg)
+{
+ size_t size = MB(50);
+ int ppid = getppid();
+ int fd = (long)arg;
+ char buf[PAGE_SIZE];
+ int i;
+
+ lseek(fd, 0, SEEK_SET);
+ for (i = 0; i < size; i += sizeof(buf))
+ read(fd, buf, sizeof(buf));
+
+ while (getppid() == ppid)
+ sleep(1);
+
+ return 0;
+}
+
+#define TEST_RECHARGE_DIR "/test-recharge"
+
+static int __test_memcg_recharge(const char *root, char *stat_name)
+{
+ char *parent = NULL, *child1 = NULL, *child2 = NULL;
+ long stat, prev, pstat, current;
+ int ret = KSFT_FAIL;
+ char file_path[256];
+ int i, pid;
+ struct {
+ int fd;
+ int (*before_fn)(const char *cgroup, void *arg);
+ int (*after_fn)(const char *cgroup, void *arg);
+ } test_files[] = {
+ /* test recharge for already mapped file */
+ {
+ .before_fn = map_fd_50M_noexit,
+ },
+ /* test recharge on new mapping after offline */
+ {
+ .after_fn = map_fd_50M_noexit,
+ },
+ /* test recharge on write after offline */
+ {
+ .after_fn = write_fd_50M_noexit,
+ },
+ /* test recharge on read after offline */
+ {
+ .after_fn = read_fd_50M_noexit,
+ }
+ };
+
+ parent = cg_name(root, "parent");
+ if (!parent)
+ goto cleanup;
+
+ if (cg_create(parent))
+ goto cleanup;
+
+ if (cg_write(parent, "cgroup.subtree_control", "+memory"))
+ goto cleanup;
+
+ child1 = cg_name(parent, "child1");
+ if (!child1)
+ goto cleanup;
+
+ if (cg_create(child1))
+ goto cleanup;
+
+ child2 = cg_name(parent, "child2");
+ if (!child2)
+ goto cleanup;
+
+ if (cg_create(child2))
+ goto cleanup;
+
+ for (i = 0; i < ARRAY_SIZE(test_files); i++) {
+ long target = MB(50) * (i+1); /* 50MB per file */
+ int fd;
+
+ snprintf(file_path, sizeof(file_path), "%s/file%d",
+ TEST_RECHARGE_DIR, i);
+
+ fd = open(file_path, O_CREAT | O_RDWR);
+ if (fd < 0)
+ goto cleanup;
+
+ test_files[i].fd = fd;
+ if (cg_run(child1, write_fd_50M, (void *)(long) fd))
+ goto cleanup;
+
+ stat = 0;
+ do {
+ sleep(1);
+ prev = stat;
+ stat = cg_read_key_long(child1, "memory.stat",
+ stat_name);
+ } while (stat < target && stat > prev);
+
+ if (stat < target) {
+ fprintf(stderr, "error: child1 %s %ld < %ld",
+ stat_name, stat, target);
+ goto cleanup;
+ }
+
+ current = cg_read_long(child1, "memory.current");
+ if (current < target) {
+ fprintf(stderr, "error: child1 current %ld < %ld",
+ current, target);
+ goto cleanup;
+ }
+
+ if (test_files[i].before_fn) {
+ pid = cg_run_nowait(child2, test_files[i].before_fn,
+ (void *)(long)fd);
+ if (pid < 0)
+ goto cleanup;
+ /* make sure before_fn() finishes executing before offlining */
+ sleep(1);
+ }
+ }
+
+ current = cg_read_long(child2, "memory.current");
+ if (current > MB(1)) {
+ fprintf(stderr, "error: child2 current %ld > 1M\n", current);
+ goto cleanup;
+ }
+
+ stat = cg_read_key_long(child2, "memory.stat", stat_name);
+ if (stat > 0) {
+ fprintf(stderr, "error: child2 %s %ld > 0\n",
+ stat_name, stat);
+ goto cleanup;
+ }
+
+ if (cg_destroy(child1) < 0)
+ goto cleanup;
+
+ for (i = 0; i < ARRAY_SIZE(test_files); i++) {
+ long target = MB(50) * (i+1);
+ int fd = test_files[i].fd;
+
+ if (test_files[i].after_fn) {
+ pid = cg_run_nowait(child2, test_files[i].after_fn,
+ (void *)(long)fd);
+ if (pid < 0)
+ goto cleanup;
+ }
+
+ stat = 0;
+ do {
+ sleep(1);
+ prev = stat;
+ stat = cg_read_key_long(child2, "memory.stat",
+ stat_name);
+ } while (stat < target && stat > prev);
+
+ if (stat < target) {
+ fprintf(stderr, "error: child2 %s %ld < %ld\n",
+ stat_name, stat, target);
+ goto cleanup;
+ }
+
+ current = cg_read_long(child2, "memory.current");
+ if (current < target) {
+ fprintf(stderr, "error: child2 current %ld < %ld\n",
+ current, target);
+ goto cleanup;
+ }
+ }
+
+ pstat = cg_read_key_long(parent, "memory.stat", stat_name);
+ if (stat < pstat) {
+ fprintf(stderr, "error: recharged %s (%ld) < total (%ld)\n",
+ stat_name, stat, pstat);
+ goto cleanup;
+ }
+
+ ret = KSFT_PASS;
+cleanup:
+ if (child2) {
+ cg_destroy(child2);
+ free(child2);
+ }
+ if (child1) {
+ cg_destroy(child1);
+ free(child1);
+ }
+ if (parent) {
+ cg_destroy(parent);
+ free(parent);
+ }
+ for (i = 0; i < ARRAY_SIZE(test_files); i++) {
+ close(test_files[i].fd);
+ snprintf(file_path, sizeof(file_path), "%s/file%d",
+ TEST_RECHARGE_DIR, i);
+ remove(file_path);
+ }
+ return ret;
+}
+
+static int test_memcg_recharge(const char *root)
+{
+ int i, ret = KSFT_PASS;
+ struct {
+ char *mount_type, *stat_name;
+ } test_setups[] = {
+ /* test both shmem & normal files */
+ {
+ .mount_type = "tmpfs",
+ .stat_name = "shmem",
+ },
+ {
+ .stat_name = "file",
+ }
+ };
+
+ if (!is_memcg_recharging_enabled())
+ return KSFT_SKIP;
+
+ if (unshare(CLONE_NEWNS) < 0)
+ return KSFT_FAIL;
+
+ if (mount(NULL, "/", "", MS_REC | MS_PRIVATE, NULL) < 0)
+ return KSFT_FAIL;
+
+ for (i = 0; i < ARRAY_SIZE(test_setups); i++) {
+ int setup_ret = KSFT_FAIL;
+ char *mount_type = test_setups[i].mount_type;
+ char *stat_name = test_setups[i].stat_name;
+
+ if (mkdir(TEST_RECHARGE_DIR, 0777) < 0)
+ goto next;
+
+ if (mount_type &&
+ mount(NULL, TEST_RECHARGE_DIR, mount_type, 0, NULL) < 0)
+ goto next;
+
+ setup_ret = __test_memcg_recharge(root, stat_name);
+
+next:
+ if (mount_type)
+ umount(TEST_RECHARGE_DIR);
+ remove(TEST_RECHARGE_DIR);
+
+ if (setup_ret == KSFT_FAIL) {
+ ret = KSFT_FAIL;
+ break;
+ }
+ }
+ umount("/");
+ return ret;
+}
+
#define T(x) { x, #x }
struct memcg_test {
int (*fn)(const char *root);
@@ -1306,6 +1615,7 @@ struct memcg_test {
T(test_memcg_oom_group_leaf_events),
T(test_memcg_oom_group_parent_events),
T(test_memcg_oom_group_score_events),
+ T(test_memcg_recharge),
};
#undef T
--
2.41.0.255.g8b1d071c50-goog
^ permalink raw reply related [flat|nested] 56+ messages in thread* [RFC PATCH 8/8] selftests: cgroup: test_memcontrol: add a selftest for memcg recharging
@ 2023-07-20 7:08 ` Yosry Ahmed
0 siblings, 0 replies; 56+ messages in thread
From: Yosry Ahmed @ 2023-07-20 7:08 UTC (permalink / raw)
To: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
Shakeel Butt
Cc: Muchun Song, Matthew Wilcox (Oracle), Tejun Heo, Zefan Li,
Yu Zhao, Luis Chamberlain, Kees Cook, Iurii Zaikin, T.J. Mercier,
Greg Thelen, linux-kernel, linux-mm, cgroups, Yosry Ahmed
When a memcg is removed, any mapped pages charged to it are recharged to
the memcg of the process(es) mapping them. Any remaining pages are
recharged using deferred recharge on the next time they are accessed or
ditied. Add a selftest that exercises these paths for shmem and normal
files:
- A page is recharged on offlining if it is already mapped into the
address space of a process in a different memcg.
- A page is recharged after offlining when written to by a process in a
different memcg (if the write results in dirtying the page).
- A page is recharged after offlining when read by a process in a
different memcg.
- A page is recharged after offlining when mapped by a process in a
different memcg.
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
tools/testing/selftests/cgroup/cgroup_util.c | 14 +
tools/testing/selftests/cgroup/cgroup_util.h | 1 +
.../selftests/cgroup/test_memcontrol.c | 310 ++++++++++++++++++
3 files changed, 325 insertions(+)
diff --git a/tools/testing/selftests/cgroup/cgroup_util.c b/tools/testing/selftests/cgroup/cgroup_util.c
index e8bbbdb77e0d..e853b2a4db77 100644
--- a/tools/testing/selftests/cgroup/cgroup_util.c
+++ b/tools/testing/selftests/cgroup/cgroup_util.c
@@ -519,6 +519,20 @@ int is_swap_enabled(void)
return cnt > 1;
}
+
+int is_memcg_recharging_enabled(void)
+{
+ char buf[10];
+ bool enabled;
+
+ if (read_text("/proc/sys/vm/recharge_offline_memcgs",
+ buf, sizeof(buf)) <= 0)
+ return -1;
+
+ enabled = strtol(buf, NULL, 10);
+ return enabled;
+}
+
int set_oom_adj_score(int pid, int score)
{
char path[PATH_MAX];
diff --git a/tools/testing/selftests/cgroup/cgroup_util.h b/tools/testing/selftests/cgroup/cgroup_util.h
index c92df4e5d395..10c0fa36bfd7 100644
--- a/tools/testing/selftests/cgroup/cgroup_util.h
+++ b/tools/testing/selftests/cgroup/cgroup_util.h
@@ -49,6 +49,7 @@ extern int get_temp_fd(void);
extern int alloc_pagecache(int fd, size_t size);
extern int alloc_anon(const char *cgroup, void *arg);
extern int is_swap_enabled(void);
+extern int is_memcg_recharging_enabled(void);
extern int set_oom_adj_score(int pid, int score);
extern int cg_wait_for_proc_count(const char *cgroup, int count);
extern int cg_killall(const char *cgroup);
diff --git a/tools/testing/selftests/cgroup/test_memcontrol.c b/tools/testing/selftests/cgroup/test_memcontrol.c
index c7c9572003a8..4e1ea93e0a54 100644
--- a/tools/testing/selftests/cgroup/test_memcontrol.c
+++ b/tools/testing/selftests/cgroup/test_memcontrol.c
@@ -17,6 +17,8 @@
#include <netdb.h>
#include <errno.h>
#include <sys/mman.h>
+#include <sys/mount.h>
+#include <sched.h>
#include "../kselftest.h"
#include "cgroup_util.h"
@@ -1287,6 +1289,313 @@ static int test_memcg_oom_group_score_events(const char *root)
return ret;
}
+/* Map 50M from the beginning of a file */
+static int map_fd_50M_noexit(const char *cgroup, void *arg)
+{
+ size_t size = MB(50);
+ int ppid = getppid();
+ int fd = (long)arg;
+ char *memory;
+
+ memory = mmap(NULL, size, PROT_READ, MAP_SHARED | MAP_POPULATE, fd, 0);
+ if (memory == MAP_FAILED) {
+ fprintf(stderr, "error: mmap, errno %d\n", errno);
+ return -1;
+ }
+
+ while (getppid() == ppid)
+ sleep(1);
+
+ munmap(memory, size);
+ return 0;
+}
+
+/*
+ * Write 50M to the beginning of a file.
+ * The file is sync'ed first to make sure any dirty pages are laundered before
+ * we dirty them again.
+ */
+static int write_fd_50M(const char *cgroup, void *arg)
+{
+ size_t size = MB(50);
+ int fd = (long)arg;
+ char buf[PAGE_SIZE];
+ int i;
+
+ fsync(fd);
+ lseek(fd, 0, SEEK_SET);
+ for (i = 0; i < size; i += sizeof(buf))
+ write(fd, buf, sizeof(buf));
+
+ return 0;
+}
+
+/* See write_fd_50M() */
+static int write_fd_50M_noexit(const char *cgroup, void *arg)
+{
+ int ppid = getppid();
+
+ write_fd_50M(cgroup, arg);
+
+ while (getppid() == ppid)
+ sleep(1);
+
+ return 0;
+}
+
+/* Read 50M from the beginning of a file */
+static int read_fd_50M_noexit(const char *cgroup, void *arg)
+{
+ size_t size = MB(50);
+ int ppid = getppid();
+ int fd = (long)arg;
+ char buf[PAGE_SIZE];
+ int i;
+
+ lseek(fd, 0, SEEK_SET);
+ for (i = 0; i < size; i += sizeof(buf))
+ read(fd, buf, sizeof(buf));
+
+ while (getppid() == ppid)
+ sleep(1);
+
+ return 0;
+}
+
+#define TEST_RECHARGE_DIR "/test-recharge"
+
+static int __test_memcg_recharge(const char *root, char *stat_name)
+{
+ char *parent = NULL, *child1 = NULL, *child2 = NULL;
+ long stat, prev, pstat, current;
+ int ret = KSFT_FAIL;
+ char file_path[256];
+ int i, pid;
+ struct {
+ int fd;
+ int (*before_fn)(const char *cgroup, void *arg);
+ int (*after_fn)(const char *cgroup, void *arg);
+ } test_files[] = {
+ /* test recharge for already mapped file */
+ {
+ .before_fn = map_fd_50M_noexit,
+ },
+ /* test recharge on new mapping after offline */
+ {
+ .after_fn = map_fd_50M_noexit,
+ },
+ /* test recharge on write after offline */
+ {
+ .after_fn = write_fd_50M_noexit,
+ },
+ /* test recharge on read after offline */
+ {
+ .after_fn = read_fd_50M_noexit,
+ }
+ };
+
+ parent = cg_name(root, "parent");
+ if (!parent)
+ goto cleanup;
+
+ if (cg_create(parent))
+ goto cleanup;
+
+ if (cg_write(parent, "cgroup.subtree_control", "+memory"))
+ goto cleanup;
+
+ child1 = cg_name(parent, "child1");
+ if (!child1)
+ goto cleanup;
+
+ if (cg_create(child1))
+ goto cleanup;
+
+ child2 = cg_name(parent, "child2");
+ if (!child2)
+ goto cleanup;
+
+ if (cg_create(child2))
+ goto cleanup;
+
+ for (i = 0; i < ARRAY_SIZE(test_files); i++) {
+ long target = MB(50) * (i+1); /* 50MB per file */
+ int fd;
+
+ snprintf(file_path, sizeof(file_path), "%s/file%d",
+ TEST_RECHARGE_DIR, i);
+
+ fd = open(file_path, O_CREAT | O_RDWR);
+ if (fd < 0)
+ goto cleanup;
+
+ test_files[i].fd = fd;
+ if (cg_run(child1, write_fd_50M, (void *)(long) fd))
+ goto cleanup;
+
+ stat = 0;
+ do {
+ sleep(1);
+ prev = stat;
+ stat = cg_read_key_long(child1, "memory.stat",
+ stat_name);
+ } while (stat < target && stat > prev);
+
+ if (stat < target) {
+ fprintf(stderr, "error: child1 %s %ld < %ld",
+ stat_name, stat, target);
+ goto cleanup;
+ }
+
+ current = cg_read_long(child1, "memory.current");
+ if (current < target) {
+ fprintf(stderr, "error: child1 current %ld < %ld",
+ current, target);
+ goto cleanup;
+ }
+
+ if (test_files[i].before_fn) {
+ pid = cg_run_nowait(child2, test_files[i].before_fn,
+ (void *)(long)fd);
+ if (pid < 0)
+ goto cleanup;
+ /* make sure before_fn() finishes executing before offlining */
+ sleep(1);
+ }
+ }
+
+ current = cg_read_long(child2, "memory.current");
+ if (current > MB(1)) {
+ fprintf(stderr, "error: child2 current %ld > 1M\n", current);
+ goto cleanup;
+ }
+
+ stat = cg_read_key_long(child2, "memory.stat", stat_name);
+ if (stat > 0) {
+ fprintf(stderr, "error: child2 %s %ld > 0\n",
+ stat_name, stat);
+ goto cleanup;
+ }
+
+ if (cg_destroy(child1) < 0)
+ goto cleanup;
+
+ for (i = 0; i < ARRAY_SIZE(test_files); i++) {
+ long target = MB(50) * (i+1);
+ int fd = test_files[i].fd;
+
+ if (test_files[i].after_fn) {
+ pid = cg_run_nowait(child2, test_files[i].after_fn,
+ (void *)(long)fd);
+ if (pid < 0)
+ goto cleanup;
+ }
+
+ stat = 0;
+ do {
+ sleep(1);
+ prev = stat;
+ stat = cg_read_key_long(child2, "memory.stat",
+ stat_name);
+ } while (stat < target && stat > prev);
+
+ if (stat < target) {
+ fprintf(stderr, "error: child2 %s %ld < %ld\n",
+ stat_name, stat, target);
+ goto cleanup;
+ }
+
+ current = cg_read_long(child2, "memory.current");
+ if (current < target) {
+ fprintf(stderr, "error: child2 current %ld < %ld\n",
+ current, target);
+ goto cleanup;
+ }
+ }
+
+ pstat = cg_read_key_long(parent, "memory.stat", stat_name);
+ if (stat < pstat) {
+ fprintf(stderr, "error: recharged %s (%ld) < total (%ld)\n",
+ stat_name, stat, pstat);
+ goto cleanup;
+ }
+
+ ret = KSFT_PASS;
+cleanup:
+ if (child2) {
+ cg_destroy(child2);
+ free(child2);
+ }
+ if (child1) {
+ cg_destroy(child1);
+ free(child1);
+ }
+ if (parent) {
+ cg_destroy(parent);
+ free(parent);
+ }
+ for (i = 0; i < ARRAY_SIZE(test_files); i++) {
+ close(test_files[i].fd);
+ snprintf(file_path, sizeof(file_path), "%s/file%d",
+ TEST_RECHARGE_DIR, i);
+ remove(file_path);
+ }
+ return ret;
+}
+
+static int test_memcg_recharge(const char *root)
+{
+ int i, ret = KSFT_PASS;
+ struct {
+ char *mount_type, *stat_name;
+ } test_setups[] = {
+ /* test both shmem & normal files */
+ {
+ .mount_type = "tmpfs",
+ .stat_name = "shmem",
+ },
+ {
+ .stat_name = "file",
+ }
+ };
+
+ if (!is_memcg_recharging_enabled())
+ return KSFT_SKIP;
+
+ if (unshare(CLONE_NEWNS) < 0)
+ return KSFT_FAIL;
+
+ if (mount(NULL, "/", "", MS_REC | MS_PRIVATE, NULL) < 0)
+ return KSFT_FAIL;
+
+ for (i = 0; i < ARRAY_SIZE(test_setups); i++) {
+ int setup_ret = KSFT_FAIL;
+ char *mount_type = test_setups[i].mount_type;
+ char *stat_name = test_setups[i].stat_name;
+
+ if (mkdir(TEST_RECHARGE_DIR, 0777) < 0)
+ goto next;
+
+ if (mount_type &&
+ mount(NULL, TEST_RECHARGE_DIR, mount_type, 0, NULL) < 0)
+ goto next;
+
+ setup_ret = __test_memcg_recharge(root, stat_name);
+
+next:
+ if (mount_type)
+ umount(TEST_RECHARGE_DIR);
+ remove(TEST_RECHARGE_DIR);
+
+ if (setup_ret == KSFT_FAIL) {
+ ret = KSFT_FAIL;
+ break;
+ }
+ }
+ umount("/");
+ return ret;
+}
+
#define T(x) { x, #x }
struct memcg_test {
int (*fn)(const char *root);
@@ -1306,6 +1615,7 @@ struct memcg_test {
T(test_memcg_oom_group_leaf_events),
T(test_memcg_oom_group_parent_events),
T(test_memcg_oom_group_score_events),
+ T(test_memcg_recharge),
};
#undef T
--
2.41.0.255.g8b1d071c50-goog
^ permalink raw reply related [flat|nested] 56+ messages in thread
* Re: [RFC PATCH 0/8] memory recharging for offline memcgs
2023-07-20 7:08 ` Yosry Ahmed
@ 2023-07-20 15:35 ` Johannes Weiner
-1 siblings, 0 replies; 56+ messages in thread
From: Johannes Weiner @ 2023-07-20 15:35 UTC (permalink / raw)
To: Yosry Ahmed
Cc: Andrew Morton, Michal Hocko, Roman Gushchin, Shakeel Butt,
Muchun Song, Matthew Wilcox (Oracle), Tejun Heo, Zefan Li,
Yu Zhao, Luis Chamberlain, Kees Cook, Iurii Zaikin, T.J. Mercier,
Greg Thelen, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA
On Thu, Jul 20, 2023 at 07:08:17AM +0000, Yosry Ahmed wrote:
> This patch series implements the proposal in LSF/MM/BPF 2023 conference
> for reducing offline/zombie memcgs by memory recharging [1]. The main
> difference is that this series focuses on recharging and does not
> include eviction of any memory charged to offline memcgs.
>
> Two methods of recharging are proposed:
>
> (a) Recharging of mapped folios.
>
> When a memcg is offlined, queue an asynchronous worker that will walk
> the lruvec of the offline memcg and try to recharge any mapped folios to
> the memcg of one of the processes mapping the folio. The main assumption
> is that a process mapping the folio is the "rightful" owner of the
> memory.
>
> Currently, this is only supported for evictable folios, as the
> unevictable lru is imaginary and we cannot iterate the folios on it. A
> separate proposal [2] was made to revive the unevictable lru, which
> would allow recharging of unevictable folios.
>
> (b) Deferred recharging of folios.
>
> For folios that are unmapped, or mapped but we fail to recharge them
> with (a), we rely on deferred recharging. Simply put, any time a folio
> is accessed or dirtied by a userspace process, and that folio is charged
> to an offline memcg, we will try to recharge it to the memcg of the
> process accessing the folio. Again, we assume this process should be the
> "rightful" owner of the memory. This is also done asynchronously to avoid
> slowing down the data access path.
I'm super skeptical of this proposal.
Recharging *might* be the most desirable semantics from a user pov,
but only if it applies consistently to the whole memory footprint.
There is no mention of slab allocations such as inodes, dentries,
network buffers etc. which can be a significant part of a cgroup's
footprint. These are currently reparented. I don't think doing one
thing with half of the memory, and a totally different thing with the
other half upon cgroup deletion is going to be acceptable semantics.
It appears this also brings back the reliability issue that caused us
to deprecate charge moving. The recharge path has trylocks, LRU
isolation attempts, GFP_ATOMIC allocations. These introduce a variable
error rate into the relocation process, which causes pages that should
belong to the same domain to be scattered around all over the place.
It also means that zombie pinning still exists, but it's now even more
influenced by timing and race conditions, and so less predictable.
There are two issues being conflated here:
a) the problem of zombie cgroups, and
b) who controls resources that outlive the control domain.
For a), reparenting is still the most reasonable proposal. It's
reliable for one, but it also fixes the problem fully within the
established, user-facing semantics: resources that belong to a cgroup
also hierarchically belong to all ancestral groups; if those resources
outlive the last-level control domain, they continue to belong to the
parents. This is how it works today, and this is how it continues to
work with reparenting. The only difference is that those resources no
longer pin a dead cgroup anymore, but instead are physically linked to
the next online ancestor. Since dead cgroups have no effective control
parameters anymore, this is semantically equivalent - it's just a more
memory efficient implementation of the same exact thing.
b) is a discussion totally separate from this. We can argue what we
want this behavior to be, but I'd argue strongly that whatever we do
here should apply to all resources managed by the controller equally.
It could also be argued that if you don't want to lose control over a
set of resources, then maybe don't delete their control domain while
they are still alive and in use. For example, when restarting a
workload, and the new instance is expected to have largely the same
workingset, consider reusing the cgroup instead of making a new one.
For the zombie problem, I think we should merge Muchun's patches
ASAP. They've been proposed several times, they have Roman's reviews
and acks, and they do not change user-facing semantics. There is no
good reason not to merge them.
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [RFC PATCH 0/8] memory recharging for offline memcgs
@ 2023-07-20 15:35 ` Johannes Weiner
0 siblings, 0 replies; 56+ messages in thread
From: Johannes Weiner @ 2023-07-20 15:35 UTC (permalink / raw)
To: Yosry Ahmed
Cc: Andrew Morton, Michal Hocko, Roman Gushchin, Shakeel Butt,
Muchun Song, Matthew Wilcox (Oracle), Tejun Heo, Zefan Li,
Yu Zhao, Luis Chamberlain, Kees Cook, Iurii Zaikin, T.J. Mercier,
Greg Thelen, linux-kernel, linux-mm, cgroups
On Thu, Jul 20, 2023 at 07:08:17AM +0000, Yosry Ahmed wrote:
> This patch series implements the proposal in LSF/MM/BPF 2023 conference
> for reducing offline/zombie memcgs by memory recharging [1]. The main
> difference is that this series focuses on recharging and does not
> include eviction of any memory charged to offline memcgs.
>
> Two methods of recharging are proposed:
>
> (a) Recharging of mapped folios.
>
> When a memcg is offlined, queue an asynchronous worker that will walk
> the lruvec of the offline memcg and try to recharge any mapped folios to
> the memcg of one of the processes mapping the folio. The main assumption
> is that a process mapping the folio is the "rightful" owner of the
> memory.
>
> Currently, this is only supported for evictable folios, as the
> unevictable lru is imaginary and we cannot iterate the folios on it. A
> separate proposal [2] was made to revive the unevictable lru, which
> would allow recharging of unevictable folios.
>
> (b) Deferred recharging of folios.
>
> For folios that are unmapped, or mapped but we fail to recharge them
> with (a), we rely on deferred recharging. Simply put, any time a folio
> is accessed or dirtied by a userspace process, and that folio is charged
> to an offline memcg, we will try to recharge it to the memcg of the
> process accessing the folio. Again, we assume this process should be the
> "rightful" owner of the memory. This is also done asynchronously to avoid
> slowing down the data access path.
I'm super skeptical of this proposal.
Recharging *might* be the most desirable semantics from a user pov,
but only if it applies consistently to the whole memory footprint.
There is no mention of slab allocations such as inodes, dentries,
network buffers etc. which can be a significant part of a cgroup's
footprint. These are currently reparented. I don't think doing one
thing with half of the memory, and a totally different thing with the
other half upon cgroup deletion is going to be acceptable semantics.
It appears this also brings back the reliability issue that caused us
to deprecate charge moving. The recharge path has trylocks, LRU
isolation attempts, GFP_ATOMIC allocations. These introduce a variable
error rate into the relocation process, which causes pages that should
belong to the same domain to be scattered around all over the place.
It also means that zombie pinning still exists, but it's now even more
influenced by timing and race conditions, and so less predictable.
There are two issues being conflated here:
a) the problem of zombie cgroups, and
b) who controls resources that outlive the control domain.
For a), reparenting is still the most reasonable proposal. It's
reliable for one, but it also fixes the problem fully within the
established, user-facing semantics: resources that belong to a cgroup
also hierarchically belong to all ancestral groups; if those resources
outlive the last-level control domain, they continue to belong to the
parents. This is how it works today, and this is how it continues to
work with reparenting. The only difference is that those resources no
longer pin a dead cgroup anymore, but instead are physically linked to
the next online ancestor. Since dead cgroups have no effective control
parameters anymore, this is semantically equivalent - it's just a more
memory efficient implementation of the same exact thing.
b) is a discussion totally separate from this. We can argue what we
want this behavior to be, but I'd argue strongly that whatever we do
here should apply to all resources managed by the controller equally.
It could also be argued that if you don't want to lose control over a
set of resources, then maybe don't delete their control domain while
they are still alive and in use. For example, when restarting a
workload, and the new instance is expected to have largely the same
workingset, consider reusing the cgroup instead of making a new one.
For the zombie problem, I think we should merge Muchun's patches
ASAP. They've been proposed several times, they have Roman's reviews
and acks, and they do not change user-facing semantics. There is no
good reason not to merge them.
^ permalink raw reply [flat|nested] 56+ messages in thread
[parent not found: <20230720153515.GA1003248-druUgvl0LCNAfugRpC6u6w@public.gmane.org>]
* Re: [RFC PATCH 0/8] memory recharging for offline memcgs
2023-07-20 15:35 ` Johannes Weiner
@ 2023-07-20 19:57 ` Tejun Heo
-1 siblings, 0 replies; 56+ messages in thread
From: Tejun Heo @ 2023-07-20 19:57 UTC (permalink / raw)
To: Johannes Weiner
Cc: Yosry Ahmed, Andrew Morton, Michal Hocko, Roman Gushchin,
Shakeel Butt, Muchun Song, Matthew Wilcox (Oracle), Zefan Li,
Yu Zhao, Luis Chamberlain, Kees Cook, Iurii Zaikin, T.J. Mercier,
Greg Thelen, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA
On Thu, Jul 20, 2023 at 11:35:15AM -0400, Johannes Weiner wrote:
> It could also be argued that if you don't want to lose control over a
> set of resources, then maybe don't delete their control domain while
> they are still alive and in use. For example, when restarting a
> workload, and the new instance is expected to have largely the same
> workingset, consider reusing the cgroup instead of making a new one.
Or just create a nesting layer so that there's a cgroup which represents the
persistent resources and a nested cgroup instance inside representing the
current instance.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [RFC PATCH 0/8] memory recharging for offline memcgs
@ 2023-07-20 19:57 ` Tejun Heo
0 siblings, 0 replies; 56+ messages in thread
From: Tejun Heo @ 2023-07-20 19:57 UTC (permalink / raw)
To: Johannes Weiner
Cc: Yosry Ahmed, Andrew Morton, Michal Hocko, Roman Gushchin,
Shakeel Butt, Muchun Song, Matthew Wilcox (Oracle), Zefan Li,
Yu Zhao, Luis Chamberlain, Kees Cook, Iurii Zaikin, T.J. Mercier,
Greg Thelen, linux-kernel, linux-mm, cgroups
On Thu, Jul 20, 2023 at 11:35:15AM -0400, Johannes Weiner wrote:
> It could also be argued that if you don't want to lose control over a
> set of resources, then maybe don't delete their control domain while
> they are still alive and in use. For example, when restarting a
> workload, and the new instance is expected to have largely the same
> workingset, consider reusing the cgroup instead of making a new one.
Or just create a nesting layer so that there's a cgroup which represents the
persistent resources and a nested cgroup instance inside representing the
current instance.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 56+ messages in thread
[parent not found: <ZLmRlTej8Tm82kXG-NiLfg/pYEd1N0TnZuCh8vA@public.gmane.org>]
* Re: [RFC PATCH 0/8] memory recharging for offline memcgs
2023-07-20 19:57 ` Tejun Heo
@ 2023-07-20 21:34 ` Yosry Ahmed
-1 siblings, 0 replies; 56+ messages in thread
From: Yosry Ahmed @ 2023-07-20 21:34 UTC (permalink / raw)
To: Tejun Heo
Cc: Johannes Weiner, Andrew Morton, Michal Hocko, Roman Gushchin,
Shakeel Butt, Muchun Song, Matthew Wilcox (Oracle), Zefan Li,
Yu Zhao, Luis Chamberlain, Kees Cook, Iurii Zaikin, T.J. Mercier,
Greg Thelen, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA
On Thu, Jul 20, 2023 at 12:57 PM Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>
> On Thu, Jul 20, 2023 at 11:35:15AM -0400, Johannes Weiner wrote:
> > It could also be argued that if you don't want to lose control over a
> > set of resources, then maybe don't delete their control domain while
> > they are still alive and in use. For example, when restarting a
> > workload, and the new instance is expected to have largely the same
> > workingset, consider reusing the cgroup instead of making a new one.
>
> Or just create a nesting layer so that there's a cgroup which represents the
> persistent resources and a nested cgroup instance inside representing the
> current instance.
In practice it is not easy to know exactly which resources are shared
and used by which cgroups, especially in a large dynamic environment.
>
> Thanks.
>
> --
> tejun
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [RFC PATCH 0/8] memory recharging for offline memcgs
@ 2023-07-20 21:34 ` Yosry Ahmed
0 siblings, 0 replies; 56+ messages in thread
From: Yosry Ahmed @ 2023-07-20 21:34 UTC (permalink / raw)
To: Tejun Heo
Cc: Johannes Weiner, Andrew Morton, Michal Hocko, Roman Gushchin,
Shakeel Butt, Muchun Song, Matthew Wilcox (Oracle), Zefan Li,
Yu Zhao, Luis Chamberlain, Kees Cook, Iurii Zaikin, T.J. Mercier,
Greg Thelen, linux-kernel, linux-mm, cgroups
On Thu, Jul 20, 2023 at 12:57 PM Tejun Heo <tj@kernel.org> wrote:
>
> On Thu, Jul 20, 2023 at 11:35:15AM -0400, Johannes Weiner wrote:
> > It could also be argued that if you don't want to lose control over a
> > set of resources, then maybe don't delete their control domain while
> > they are still alive and in use. For example, when restarting a
> > workload, and the new instance is expected to have largely the same
> > workingset, consider reusing the cgroup instead of making a new one.
>
> Or just create a nesting layer so that there's a cgroup which represents the
> persistent resources and a nested cgroup instance inside representing the
> current instance.
In practice it is not easy to know exactly which resources are shared
and used by which cgroups, especially in a large dynamic environment.
>
> Thanks.
>
> --
> tejun
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [RFC PATCH 0/8] memory recharging for offline memcgs
2023-07-20 21:34 ` Yosry Ahmed
(?)
@ 2023-07-20 22:11 ` Tejun Heo
[not found] ` <ZLmxLUNdxMi5s2Kq-NiLfg/pYEd1N0TnZuCh8vA@public.gmane.org>
-1 siblings, 1 reply; 56+ messages in thread
From: Tejun Heo @ 2023-07-20 22:11 UTC (permalink / raw)
To: Yosry Ahmed
Cc: Johannes Weiner, Andrew Morton, Michal Hocko, Roman Gushchin,
Shakeel Butt, Muchun Song, Matthew Wilcox (Oracle), Zefan Li,
Yu Zhao, Luis Chamberlain, Kees Cook, Iurii Zaikin, T.J. Mercier,
Greg Thelen, linux-kernel, linux-mm, cgroups
Hello,
On Thu, Jul 20, 2023 at 02:34:16PM -0700, Yosry Ahmed wrote:
> > Or just create a nesting layer so that there's a cgroup which represents the
> > persistent resources and a nested cgroup instance inside representing the
> > current instance.
>
> In practice it is not easy to know exactly which resources are shared
> and used by which cgroups, especially in a large dynamic environment.
Yeah, that only covers when resource persistence is confined in a known
scope. That said, I have a hard time seeing how recharding once after cgroup
destruction can be a solution for the situations you describe. What if A
touches it once first, B constantly uses it but C only very occasionally and
after A dies C ends up owning it due to timing. This is very much possible
in a large dynamic environment but neither the initial or final situation is
satisfactory.
To solve the problems you're describing, you actually would have to
guarantee that memory pages are charged to the current majority user (or
maybe even spread across current active users). Maybe it can be argued that
this is a step towards that but it's a very partial step and at least would
need a technically viable direction that this development can follow.
On its own, AFAICS, I'm not sure the scope of problems it can actually solve
is justifiably greater than what can be achieved with simple nesting.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [RFC PATCH 0/8] memory recharging for offline memcgs
2023-07-20 15:35 ` Johannes Weiner
@ 2023-07-20 21:33 ` Yosry Ahmed
-1 siblings, 0 replies; 56+ messages in thread
From: Yosry Ahmed @ 2023-07-20 21:33 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, Michal Hocko, Roman Gushchin, Shakeel Butt,
Muchun Song, Matthew Wilcox (Oracle), Tejun Heo, Zefan Li,
Yu Zhao, Luis Chamberlain, Kees Cook, Iurii Zaikin, T.J. Mercier,
Greg Thelen, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA
On Thu, Jul 20, 2023 at 8:35 AM Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> wrote:
>
> On Thu, Jul 20, 2023 at 07:08:17AM +0000, Yosry Ahmed wrote:
> > This patch series implements the proposal in LSF/MM/BPF 2023 conference
> > for reducing offline/zombie memcgs by memory recharging [1]. The main
> > difference is that this series focuses on recharging and does not
> > include eviction of any memory charged to offline memcgs.
> >
> > Two methods of recharging are proposed:
> >
> > (a) Recharging of mapped folios.
> >
> > When a memcg is offlined, queue an asynchronous worker that will walk
> > the lruvec of the offline memcg and try to recharge any mapped folios to
> > the memcg of one of the processes mapping the folio. The main assumption
> > is that a process mapping the folio is the "rightful" owner of the
> > memory.
> >
> > Currently, this is only supported for evictable folios, as the
> > unevictable lru is imaginary and we cannot iterate the folios on it. A
> > separate proposal [2] was made to revive the unevictable lru, which
> > would allow recharging of unevictable folios.
> >
> > (b) Deferred recharging of folios.
> >
> > For folios that are unmapped, or mapped but we fail to recharge them
> > with (a), we rely on deferred recharging. Simply put, any time a folio
> > is accessed or dirtied by a userspace process, and that folio is charged
> > to an offline memcg, we will try to recharge it to the memcg of the
> > process accessing the folio. Again, we assume this process should be the
> > "rightful" owner of the memory. This is also done asynchronously to avoid
> > slowing down the data access path.
>
> I'm super skeptical of this proposal.
I expected this :)
>
> Recharging *might* be the most desirable semantics from a user pov,
> but only if it applies consistently to the whole memory footprint.
> There is no mention of slab allocations such as inodes, dentries,
> network buffers etc. which can be a significant part of a cgroup's
> footprint. These are currently reparented. I don't think doing one
> thing with half of the memory, and a totally different thing with the
> other half upon cgroup deletion is going to be acceptable semantics.
I think, as you say, recharging has the most desirable semantics
because the charge is maintained where it *should* be (with who is
actually using it). We simply cannot do that for kernel memory,
because we have no way of attributing it to a user. On the other hand,
we *can* attribute user memory to a user. Consistency is great, but
our inability to do (arguably) the right thing for one type of memory,
doesn't mean we shouldn't do it when we can. I would also argue that
user memory (anon/file pages) would commonly be the larger portion of
memory on a machine compared to kernel memory (e.g. slab).
>
> It appears this also brings back the reliability issue that caused us
> to deprecate charge moving. The recharge path has trylocks, LRU
> isolation attempts, GFP_ATOMIC allocations. These introduce a variable
> error rate into the relocation process,
Recharging is naturally best effort, because it's non-disruptive.
After a memcg dies, the kernel continuously tries to move the charges
away from it on every chance it gets. If it fails one time that's
fine, there will be other chances. Compared to the status quo, it is
definitely better than just leaving all the memory behind with the
zombie memcg. I would argue that over time (and accesses), most/all
memory should eventually get recharged. If not, something is not
working correctly, or a wrong assumption is being made.
> which causes pages that should
> belong to the same domain to be scattered around all over the place.
I strongly disagree with this point. Ideally, yes, memory charged to a
memcg would belong to the same domain. In practice, due to the first
touch charging semantics, this is far from the truth. For anonymous
memory, sure, they all belong to the same domain (mostly), the process
they belong to. But most of anonymous memory will go away when the
process dies anyway, the problem is mostly with shared resources (e.g.
file, tmpfs, ..). With file/tmpfs memory, the charging behavior is
random. The first memcg that touches a page gets charged for it.
Consequently, the file/tmpfs memory charged to a memcg would be a
mixture of pages from different files in different mounts, definitely
not a single domain. Perhaps with some workloads, where each memcg is
accessing different files, most memory charged to a memcg will belong
to the same domain, but in this case, recharging wouldn't move it away
anyway.
> It also means that zombie pinning still exists, but it's now even more
> influenced by timing and race conditions, and so less predictable.
It still exists, but it is improved. The kernel tries to move charges
away from zombies on every chance it gets instead of doing nothing
about it. It is less predictable, can't argue about this, but it can't
get worse, only better.
>
> There are two issues being conflated here:
>
> a) the problem of zombie cgroups, and
>
> b) who controls resources that outlive the control domain.
>
> For a), reparenting is still the most reasonable proposal. It's
> reliable for one, but it also fixes the problem fully within the
> established, user-facing semantics: resources that belong to a cgroup
> also hierarchically belong to all ancestral groups; if those resources
> outlive the last-level control domain, they continue to belong to the
> parents. This is how it works today, and this is how it continues to
> work with reparenting. The only difference is that those resources no
> longer pin a dead cgroup anymore, but instead are physically linked to
> the next online ancestor. Since dead cgroups have no effective control
> parameters anymore, this is semantically equivalent - it's just a more
> memory efficient implementation of the same exact thing.
I agree that reparenting is more deterministic and reliable, but there
are two major flaws off the top of my head:
(1) If a memcg touches a page one time and gets charged for it, the
charge is stuck in its hierarchy forever. It can get reparented, but
it will never be charged to whoever is actually using it again, unless
it is reclaimed and refaulted (in some cases).
Consider this hierarchy:
root
/ \
A B
\
C
Consider a case where memcg C touches a library file once, and gets
charged for some memory, and then dies. The memory gets reparente to
memcg B. Meanwhile, memcg A is continuously using the memory that
memcg B is charged for. memcg B would be indefinitely taxed by memcg
A. The only way out is if memcg B hit its limit, and the pages get
reclaimed, and then refaulted and recharged to memcg A. In some cases
(e.g. tmpfs), even then the memory would still get charged to memcg B.
There is no way to get rid of the charge until the resource itself is
freed.
This problem exists today, even without reparenting, with the
difference being that the charge will remain with C instead of B.
Recharging offers a better alternative where the charge will be
correctly moved to A, the "rightful" owner.
(2) In the above scenario, when memcg B dies, the memory will be
reparented to the root. That's even worse. Now memcg A is using memory
that is not accounted for anywhere, essentially an accounting leak.
From an admin perspective, the memory charged to root is system
overhead, it is lost capacity. For long-living systems, as memcgs are
created and destroyed for different workloads, memory will keep
accumulating at the root. The machine will keep leaking capacity over
time, and accounting becomes less and less accurate as more memory
becomes charged to the root.
>
> b) is a discussion totally separate from this.
I would argue that the zombie problem is (at least partially) an
artifact of the shared/sticky resources problem. If all resources are
used by one memcg and do not outlive it, we wouldn't have zombies.
> We can argue what we
> want this behavior to be, but I'd argue strongly that whatever we do
> here should apply to all resources managed by the controller equally.
User memory and kernel memory are very different in nature. Ideally
yeah, we want to treat all resources equally. But user memory is
naturally more attributable to users and easier to account correctly
than kernel memory.
>
> It could also be argued that if you don't want to lose control over a
> set of resources, then maybe don't delete their control domain while
> they are still alive and in use.
This is easier said than done :) As I mentioned earlier, the charging
semantics are inherently indeterministic for shared resources (e.g.
file/tmpfs). The user cannot control or monitor which resources belong
to which control domain. Each memcg in the system could be charged for
one page from each file in a shared library for all that matters :)
> For example, when restarting a
> workload, and the new instance is expected to have largely the same
> workingset, consider reusing the cgroup instead of making a new one.
In a large fleet with many different jobs getting rescheduled and
restarted on different machines, it's really hard in practice to do
so. We can keep the same cgroup if the same workload is being
restarted on the same machine, sure, but most of the time there's a
new workload arriving or so. We can't reuse containers in this case.
>
> For the zombie problem, I think we should merge Muchun's patches
> ASAP. They've been proposed several times, they have Roman's reviews
> and acks, and they do not change user-facing semantics. There is no
> good reason not to merge them.
There are some, which I pointed out above.
All in all, I understand where you are coming from. Your concerns are
valid. Recharging is not a perfect approach, but it is arguably the
best we can do at this point. Being indeterministic sucks, but our
charging semantics are inherently indeterministic anyway.
^ permalink raw reply [flat|nested] 56+ messages in thread* Re: [RFC PATCH 0/8] memory recharging for offline memcgs
@ 2023-07-20 21:33 ` Yosry Ahmed
0 siblings, 0 replies; 56+ messages in thread
From: Yosry Ahmed @ 2023-07-20 21:33 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, Michal Hocko, Roman Gushchin, Shakeel Butt,
Muchun Song, Matthew Wilcox (Oracle), Tejun Heo, Zefan Li,
Yu Zhao, Luis Chamberlain, Kees Cook, Iurii Zaikin, T.J. Mercier,
Greg Thelen, linux-kernel, linux-mm, cgroups
On Thu, Jul 20, 2023 at 8:35 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Thu, Jul 20, 2023 at 07:08:17AM +0000, Yosry Ahmed wrote:
> > This patch series implements the proposal in LSF/MM/BPF 2023 conference
> > for reducing offline/zombie memcgs by memory recharging [1]. The main
> > difference is that this series focuses on recharging and does not
> > include eviction of any memory charged to offline memcgs.
> >
> > Two methods of recharging are proposed:
> >
> > (a) Recharging of mapped folios.
> >
> > When a memcg is offlined, queue an asynchronous worker that will walk
> > the lruvec of the offline memcg and try to recharge any mapped folios to
> > the memcg of one of the processes mapping the folio. The main assumption
> > is that a process mapping the folio is the "rightful" owner of the
> > memory.
> >
> > Currently, this is only supported for evictable folios, as the
> > unevictable lru is imaginary and we cannot iterate the folios on it. A
> > separate proposal [2] was made to revive the unevictable lru, which
> > would allow recharging of unevictable folios.
> >
> > (b) Deferred recharging of folios.
> >
> > For folios that are unmapped, or mapped but we fail to recharge them
> > with (a), we rely on deferred recharging. Simply put, any time a folio
> > is accessed or dirtied by a userspace process, and that folio is charged
> > to an offline memcg, we will try to recharge it to the memcg of the
> > process accessing the folio. Again, we assume this process should be the
> > "rightful" owner of the memory. This is also done asynchronously to avoid
> > slowing down the data access path.
>
> I'm super skeptical of this proposal.
I expected this :)
>
> Recharging *might* be the most desirable semantics from a user pov,
> but only if it applies consistently to the whole memory footprint.
> There is no mention of slab allocations such as inodes, dentries,
> network buffers etc. which can be a significant part of a cgroup's
> footprint. These are currently reparented. I don't think doing one
> thing with half of the memory, and a totally different thing with the
> other half upon cgroup deletion is going to be acceptable semantics.
I think, as you say, recharging has the most desirable semantics
because the charge is maintained where it *should* be (with who is
actually using it). We simply cannot do that for kernel memory,
because we have no way of attributing it to a user. On the other hand,
we *can* attribute user memory to a user. Consistency is great, but
our inability to do (arguably) the right thing for one type of memory,
doesn't mean we shouldn't do it when we can. I would also argue that
user memory (anon/file pages) would commonly be the larger portion of
memory on a machine compared to kernel memory (e.g. slab).
>
> It appears this also brings back the reliability issue that caused us
> to deprecate charge moving. The recharge path has trylocks, LRU
> isolation attempts, GFP_ATOMIC allocations. These introduce a variable
> error rate into the relocation process,
Recharging is naturally best effort, because it's non-disruptive.
After a memcg dies, the kernel continuously tries to move the charges
away from it on every chance it gets. If it fails one time that's
fine, there will be other chances. Compared to the status quo, it is
definitely better than just leaving all the memory behind with the
zombie memcg. I would argue that over time (and accesses), most/all
memory should eventually get recharged. If not, something is not
working correctly, or a wrong assumption is being made.
> which causes pages that should
> belong to the same domain to be scattered around all over the place.
I strongly disagree with this point. Ideally, yes, memory charged to a
memcg would belong to the same domain. In practice, due to the first
touch charging semantics, this is far from the truth. For anonymous
memory, sure, they all belong to the same domain (mostly), the process
they belong to. But most of anonymous memory will go away when the
process dies anyway, the problem is mostly with shared resources (e.g.
file, tmpfs, ..). With file/tmpfs memory, the charging behavior is
random. The first memcg that touches a page gets charged for it.
Consequently, the file/tmpfs memory charged to a memcg would be a
mixture of pages from different files in different mounts, definitely
not a single domain. Perhaps with some workloads, where each memcg is
accessing different files, most memory charged to a memcg will belong
to the same domain, but in this case, recharging wouldn't move it away
anyway.
> It also means that zombie pinning still exists, but it's now even more
> influenced by timing and race conditions, and so less predictable.
It still exists, but it is improved. The kernel tries to move charges
away from zombies on every chance it gets instead of doing nothing
about it. It is less predictable, can't argue about this, but it can't
get worse, only better.
>
> There are two issues being conflated here:
>
> a) the problem of zombie cgroups, and
>
> b) who controls resources that outlive the control domain.
>
> For a), reparenting is still the most reasonable proposal. It's
> reliable for one, but it also fixes the problem fully within the
> established, user-facing semantics: resources that belong to a cgroup
> also hierarchically belong to all ancestral groups; if those resources
> outlive the last-level control domain, they continue to belong to the
> parents. This is how it works today, and this is how it continues to
> work with reparenting. The only difference is that those resources no
> longer pin a dead cgroup anymore, but instead are physically linked to
> the next online ancestor. Since dead cgroups have no effective control
> parameters anymore, this is semantically equivalent - it's just a more
> memory efficient implementation of the same exact thing.
I agree that reparenting is more deterministic and reliable, but there
are two major flaws off the top of my head:
(1) If a memcg touches a page one time and gets charged for it, the
charge is stuck in its hierarchy forever. It can get reparented, but
it will never be charged to whoever is actually using it again, unless
it is reclaimed and refaulted (in some cases).
Consider this hierarchy:
root
/ \
A B
\
C
Consider a case where memcg C touches a library file once, and gets
charged for some memory, and then dies. The memory gets reparente to
memcg B. Meanwhile, memcg A is continuously using the memory that
memcg B is charged for. memcg B would be indefinitely taxed by memcg
A. The only way out is if memcg B hit its limit, and the pages get
reclaimed, and then refaulted and recharged to memcg A. In some cases
(e.g. tmpfs), even then the memory would still get charged to memcg B.
There is no way to get rid of the charge until the resource itself is
freed.
This problem exists today, even without reparenting, with the
difference being that the charge will remain with C instead of B.
Recharging offers a better alternative where the charge will be
correctly moved to A, the "rightful" owner.
(2) In the above scenario, when memcg B dies, the memory will be
reparented to the root. That's even worse. Now memcg A is using memory
that is not accounted for anywhere, essentially an accounting leak.
From an admin perspective, the memory charged to root is system
overhead, it is lost capacity. For long-living systems, as memcgs are
created and destroyed for different workloads, memory will keep
accumulating at the root. The machine will keep leaking capacity over
time, and accounting becomes less and less accurate as more memory
becomes charged to the root.
>
> b) is a discussion totally separate from this.
I would argue that the zombie problem is (at least partially) an
artifact of the shared/sticky resources problem. If all resources are
used by one memcg and do not outlive it, we wouldn't have zombies.
> We can argue what we
> want this behavior to be, but I'd argue strongly that whatever we do
> here should apply to all resources managed by the controller equally.
User memory and kernel memory are very different in nature. Ideally
yeah, we want to treat all resources equally. But user memory is
naturally more attributable to users and easier to account correctly
than kernel memory.
>
> It could also be argued that if you don't want to lose control over a
> set of resources, then maybe don't delete their control domain while
> they are still alive and in use.
This is easier said than done :) As I mentioned earlier, the charging
semantics are inherently indeterministic for shared resources (e.g.
file/tmpfs). The user cannot control or monitor which resources belong
to which control domain. Each memcg in the system could be charged for
one page from each file in a shared library for all that matters :)
> For example, when restarting a
> workload, and the new instance is expected to have largely the same
> workingset, consider reusing the cgroup instead of making a new one.
In a large fleet with many different jobs getting rescheduled and
restarted on different machines, it's really hard in practice to do
so. We can keep the same cgroup if the same workload is being
restarted on the same machine, sure, but most of the time there's a
new workload arriving or so. We can't reuse containers in this case.
>
> For the zombie problem, I think we should merge Muchun's patches
> ASAP. They've been proposed several times, they have Roman's reviews
> and acks, and they do not change user-facing semantics. There is no
> good reason not to merge them.
There are some, which I pointed out above.
All in all, I understand where you are coming from. Your concerns are
valid. Recharging is not a perfect approach, but it is arguably the
best we can do at this point. Being indeterministic sucks, but our
charging semantics are inherently indeterministic anyway.
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [RFC PATCH 0/8] memory recharging for offline memcgs
2023-07-20 15:35 ` Johannes Weiner
@ 2023-08-01 9:54 ` Michal Hocko
-1 siblings, 0 replies; 56+ messages in thread
From: Michal Hocko @ 2023-08-01 9:54 UTC (permalink / raw)
To: Johannes Weiner
Cc: Yosry Ahmed, Andrew Morton, Roman Gushchin, Shakeel Butt,
Muchun Song, Matthew Wilcox (Oracle), Tejun Heo, Zefan Li,
Yu Zhao, Luis Chamberlain, Kees Cook, Iurii Zaikin, T.J. Mercier,
Greg Thelen, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA
[Sorry for being late to this discussion]
On Thu 20-07-23 11:35:15, Johannes Weiner wrote:
[...]
> I'm super skeptical of this proposal.
Agreed.
> Recharging *might* be the most desirable semantics from a user pov,
> but only if it applies consistently to the whole memory footprint.
> There is no mention of slab allocations such as inodes, dentries,
> network buffers etc. which can be a significant part of a cgroup's
> footprint. These are currently reparented. I don't think doing one
> thing with half of the memory, and a totally different thing with the
> other half upon cgroup deletion is going to be acceptable semantics.
>
> It appears this also brings back the reliability issue that caused us
> to deprecate charge moving. The recharge path has trylocks, LRU
> isolation attempts, GFP_ATOMIC allocations. These introduce a variable
> error rate into the relocation process, which causes pages that should
> belong to the same domain to be scattered around all over the place.
> It also means that zombie pinning still exists, but it's now even more
> influenced by timing and race conditions, and so less predictable.
>
> There are two issues being conflated here:
>
> a) the problem of zombie cgroups, and
>
> b) who controls resources that outlive the control domain.
>
> For a), reparenting is still the most reasonable proposal. It's
> reliable for one, but it also fixes the problem fully within the
> established, user-facing semantics: resources that belong to a cgroup
> also hierarchically belong to all ancestral groups; if those resources
> outlive the last-level control domain, they continue to belong to the
> parents. This is how it works today, and this is how it continues to
> work with reparenting. The only difference is that those resources no
> longer pin a dead cgroup anymore, but instead are physically linked to
> the next online ancestor. Since dead cgroups have no effective control
> parameters anymore, this is semantically equivalent - it's just a more
> memory efficient implementation of the same exact thing.
>
> b) is a discussion totally separate from this. We can argue what we
> want this behavior to be, but I'd argue strongly that whatever we do
> here should apply to all resources managed by the controller equally.
>
> It could also be argued that if you don't want to lose control over a
> set of resources, then maybe don't delete their control domain while
> they are still alive and in use. For example, when restarting a
> workload, and the new instance is expected to have largely the same
> workingset, consider reusing the cgroup instead of making a new one.
>
> For the zombie problem, I think we should merge Muchun's patches
> ASAP. They've been proposed several times, they have Roman's reviews
> and acks, and they do not change user-facing semantics. There is no
> good reason not to merge them.
Yes, fully agreed on both points. The problem with zombies is real but
reparenting should address it for a large part. Ownership is a different
problem. We have discussed that at LSFMM this year and in the past as
well I believe. What we probably need is a concept of taking an
ownership of the memory (something like madvise(MADV_OWN, range) or
fadvise for fd based resources). This would allow the caller to take
ownership of the said resource (like memcg charge of it).
I understand that would require some changes to existing workloads.
Whatever the interface will be, it has to be explicit otherwise we
are hitting problems with unaccounted resources that are sitting without
any actual ownership and an undeterministic and time dependeing hopping
over owners. In other words, nobody should be able to drop
responsibility of any object while it is still consuming resources.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [RFC PATCH 0/8] memory recharging for offline memcgs
@ 2023-08-01 9:54 ` Michal Hocko
0 siblings, 0 replies; 56+ messages in thread
From: Michal Hocko @ 2023-08-01 9:54 UTC (permalink / raw)
To: Johannes Weiner
Cc: Yosry Ahmed, Andrew Morton, Roman Gushchin, Shakeel Butt,
Muchun Song, Matthew Wilcox (Oracle), Tejun Heo, Zefan Li,
Yu Zhao, Luis Chamberlain, Kees Cook, Iurii Zaikin, T.J. Mercier,
Greg Thelen, linux-kernel, linux-mm, cgroups
[Sorry for being late to this discussion]
On Thu 20-07-23 11:35:15, Johannes Weiner wrote:
[...]
> I'm super skeptical of this proposal.
Agreed.
> Recharging *might* be the most desirable semantics from a user pov,
> but only if it applies consistently to the whole memory footprint.
> There is no mention of slab allocations such as inodes, dentries,
> network buffers etc. which can be a significant part of a cgroup's
> footprint. These are currently reparented. I don't think doing one
> thing with half of the memory, and a totally different thing with the
> other half upon cgroup deletion is going to be acceptable semantics.
>
> It appears this also brings back the reliability issue that caused us
> to deprecate charge moving. The recharge path has trylocks, LRU
> isolation attempts, GFP_ATOMIC allocations. These introduce a variable
> error rate into the relocation process, which causes pages that should
> belong to the same domain to be scattered around all over the place.
> It also means that zombie pinning still exists, but it's now even more
> influenced by timing and race conditions, and so less predictable.
>
> There are two issues being conflated here:
>
> a) the problem of zombie cgroups, and
>
> b) who controls resources that outlive the control domain.
>
> For a), reparenting is still the most reasonable proposal. It's
> reliable for one, but it also fixes the problem fully within the
> established, user-facing semantics: resources that belong to a cgroup
> also hierarchically belong to all ancestral groups; if those resources
> outlive the last-level control domain, they continue to belong to the
> parents. This is how it works today, and this is how it continues to
> work with reparenting. The only difference is that those resources no
> longer pin a dead cgroup anymore, but instead are physically linked to
> the next online ancestor. Since dead cgroups have no effective control
> parameters anymore, this is semantically equivalent - it's just a more
> memory efficient implementation of the same exact thing.
>
> b) is a discussion totally separate from this. We can argue what we
> want this behavior to be, but I'd argue strongly that whatever we do
> here should apply to all resources managed by the controller equally.
>
> It could also be argued that if you don't want to lose control over a
> set of resources, then maybe don't delete their control domain while
> they are still alive and in use. For example, when restarting a
> workload, and the new instance is expected to have largely the same
> workingset, consider reusing the cgroup instead of making a new one.
>
> For the zombie problem, I think we should merge Muchun's patches
> ASAP. They've been proposed several times, they have Roman's reviews
> and acks, and they do not change user-facing semantics. There is no
> good reason not to merge them.
Yes, fully agreed on both points. The problem with zombies is real but
reparenting should address it for a large part. Ownership is a different
problem. We have discussed that at LSFMM this year and in the past as
well I believe. What we probably need is a concept of taking an
ownership of the memory (something like madvise(MADV_OWN, range) or
fadvise for fd based resources). This would allow the caller to take
ownership of the said resource (like memcg charge of it).
I understand that would require some changes to existing workloads.
Whatever the interface will be, it has to be explicit otherwise we
are hitting problems with unaccounted resources that are sitting without
any actual ownership and an undeterministic and time dependeing hopping
over owners. In other words, nobody should be able to drop
responsibility of any object while it is still consuming resources.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 56+ messages in thread