* [PATCH V2 0/4] record non-slab shrinkers for non-root memcgs when kmem is disabled
@ 2026-03-10 3:12 Haifeng Xu
2026-03-10 3:12 ` [PATCH V2 1/4] mm: shrinker: add one more parameter in shrinker_id() Haifeng Xu
` (3 more replies)
0 siblings, 4 replies; 10+ messages in thread
From: Haifeng Xu @ 2026-03-10 3:12 UTC (permalink / raw)
To: akpm, david, roman.gushchin
Cc: zhengqi.arch, muchun.song, linux-mm, linux-kernel, Haifeng Xu
When registering new shrinkers, all memcgs need to expand shrinker info
if new allocated id exceeds shrinker_nr_max. But if kmem is disabled,
only non-slab shrinkers is useful in memcg slab shrink. So in this case,
it is enough to allocate non-slab shrinker info for non-root memcg. This
can save a bit of memory and reduce the holding time of shrinker lock.
With this optimization, the finish time of pod creation in our internal
test is reduced from 150 seconds to 69 seconds. We test it based on
stable kernel 6.6.102.
Changes since V1:
- reuse shrinker_id() to retrieve the id of shrinker, this also fix the
build error without CONFIG_MEMCG.
https://lore.kernel.org/all/20260306075757.198887-1-haifeng.xu@shopee.com/
Haifeng Xu (4):
mm: shrinker: add one more parameter in shrinker_id()
mm: shrinker: move shrinker_id() code block below memcg_kmem_online()
mm: shrinker: optimize the allocation of shrinker_info when setting
cgroup_memory_nokmem
mm: shrinker: remove unnecessary check in shrink_slab_memcg()
include/linux/memcontrol.h | 194 +++++++++++++++++++------------------
include/linux/shrinker.h | 3 +
mm/huge_memory.c | 4 +-
mm/shrinker.c | 133 +++++++++++++++++++++----
4 files changed, 217 insertions(+), 117 deletions(-)
--
2.43.0
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH V2 1/4] mm: shrinker: add one more parameter in shrinker_id()
2026-03-10 3:12 [PATCH V2 0/4] record non-slab shrinkers for non-root memcgs when kmem is disabled Haifeng Xu
@ 2026-03-10 3:12 ` Haifeng Xu
2026-03-10 3:12 ` [PATCH V2 2/4] mm: shrinker: move shrinker_id() code block below memcg_kmem_online() Haifeng Xu
` (2 subsequent siblings)
3 siblings, 0 replies; 10+ messages in thread
From: Haifeng Xu @ 2026-03-10 3:12 UTC (permalink / raw)
To: akpm, david, roman.gushchin
Cc: zhengqi.arch, muchun.song, linux-mm, linux-kernel, Haifeng Xu
Add a parameter points to target memory cgroup in shrinker_id().
The following patch will use it to decide the id of shrinker.
No functional change here.
Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com>
---
include/linux/memcontrol.h | 4 ++--
mm/huge_memory.c | 4 ++--
mm/shrinker.c | 12 ++++++++----
3 files changed, 12 insertions(+), 8 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 70b685a85bf4..a583dbc0adcc 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1634,7 +1634,7 @@ void free_shrinker_info(struct mem_cgroup *memcg);
void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id);
void reparent_shrinker_deferred(struct mem_cgroup *memcg);
-static inline int shrinker_id(struct shrinker *shrinker)
+static inline int shrinker_id(struct mem_cgroup *memcg, struct shrinker *shrinker)
{
return shrinker->id;
}
@@ -1670,7 +1670,7 @@ static inline void set_shrinker_bit(struct mem_cgroup *memcg,
{
}
-static inline int shrinker_id(struct shrinker *shrinker)
+static inline int shrinker_id(struct mem_cgroup *memcg, struct shrinker *shrinker)
{
return -1;
}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8e2746ea74ad..6050f8d71587 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -4353,7 +4353,7 @@ void deferred_split_folio(struct folio *folio, bool partially_mapped)
ds_queue->split_queue_len++;
if (memcg)
set_shrinker_bit(memcg, folio_nid(folio),
- shrinker_id(deferred_split_shrinker));
+ shrinker_id(memcg, deferred_split_shrinker));
}
split_queue_unlock_irqrestore(ds_queue, flags);
}
@@ -4509,7 +4509,7 @@ void reparent_deferred_split_queue(struct mem_cgroup *memcg)
ds_queue->split_queue_len = 0;
for_each_node(nid)
- set_shrinker_bit(parent, nid, shrinker_id(deferred_split_shrinker));
+ set_shrinker_bit(parent, nid, shrinker_id(parent, deferred_split_shrinker));
unlock:
spin_unlock(&parent_ds_queue->split_queue_lock);
diff --git a/mm/shrinker.c b/mm/shrinker.c
index 7b61fc0ee78f..61dbb6afae52 100644
--- a/mm/shrinker.c
+++ b/mm/shrinker.c
@@ -255,11 +255,13 @@ static long xchg_nr_deferred_memcg(int nid, struct shrinker *shrinker,
struct shrinker_info *info;
struct shrinker_info_unit *unit;
long nr_deferred;
+ int id;
rcu_read_lock();
+ id = shrinker_id(memcg, shrinker);
info = rcu_dereference(memcg->nodeinfo[nid]->shrinker_info);
- unit = info->unit[shrinker_id_to_index(shrinker->id)];
- nr_deferred = atomic_long_xchg(&unit->nr_deferred[shrinker_id_to_offset(shrinker->id)], 0);
+ unit = info->unit[shrinker_id_to_index(id)];
+ nr_deferred = atomic_long_xchg(&unit->nr_deferred[shrinker_id_to_offset(id)], 0);
rcu_read_unlock();
return nr_deferred;
@@ -271,12 +273,14 @@ static long add_nr_deferred_memcg(long nr, int nid, struct shrinker *shrinker,
struct shrinker_info *info;
struct shrinker_info_unit *unit;
long nr_deferred;
+ int id;
rcu_read_lock();
+ id = shrinker_id(memcg, shrinker);
info = rcu_dereference(memcg->nodeinfo[nid]->shrinker_info);
- unit = info->unit[shrinker_id_to_index(shrinker->id)];
+ unit = info->unit[shrinker_id_to_index(id)];
nr_deferred =
- atomic_long_add_return(nr, &unit->nr_deferred[shrinker_id_to_offset(shrinker->id)]);
+ atomic_long_add_return(nr, &unit->nr_deferred[shrinker_id_to_offset(id)]);
rcu_read_unlock();
return nr_deferred;
--
2.43.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH V2 2/4] mm: shrinker: move shrinker_id() code block below memcg_kmem_online()
2026-03-10 3:12 [PATCH V2 0/4] record non-slab shrinkers for non-root memcgs when kmem is disabled Haifeng Xu
2026-03-10 3:12 ` [PATCH V2 1/4] mm: shrinker: add one more parameter in shrinker_id() Haifeng Xu
@ 2026-03-10 3:12 ` Haifeng Xu
2026-03-10 3:12 ` [PATCH V2 3/4] mm: shrinker: optimize the allocation of shrinker_info when setting cgroup_memory_nokmem Haifeng Xu
2026-03-10 3:12 ` [PATCH V2 4/4] mm: shrinker: remove unnecessary check in shrink_slab_memcg() Haifeng Xu
3 siblings, 0 replies; 10+ messages in thread
From: Haifeng Xu @ 2026-03-10 3:12 UTC (permalink / raw)
To: akpm, david, roman.gushchin
Cc: zhengqi.arch, muchun.song, linux-mm, linux-kernel, Haifeng Xu
The following patch will use memcg_kmem_online() to decide the id
of shrinker, so move the code block down to reduce the diff noise
in the following patch.
No functional change here.
Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com>
---
include/linux/memcontrol.h | 188 ++++++++++++++++++-------------------
1 file changed, 94 insertions(+), 94 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index a583dbc0adcc..ce7b5101bc02 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1582,100 +1582,6 @@ static inline void mem_cgroup_flush_foreign(struct bdi_writeback *wb)
#endif /* CONFIG_CGROUP_WRITEBACK */
-struct sock;
-#ifdef CONFIG_MEMCG
-extern struct static_key_false memcg_sockets_enabled_key;
-#define mem_cgroup_sockets_enabled static_branch_unlikely(&memcg_sockets_enabled_key)
-
-void mem_cgroup_sk_alloc(struct sock *sk);
-void mem_cgroup_sk_free(struct sock *sk);
-void mem_cgroup_sk_inherit(const struct sock *sk, struct sock *newsk);
-bool mem_cgroup_sk_charge(const struct sock *sk, unsigned int nr_pages,
- gfp_t gfp_mask);
-void mem_cgroup_sk_uncharge(const struct sock *sk, unsigned int nr_pages);
-
-#if BITS_PER_LONG < 64
-static inline void mem_cgroup_set_socket_pressure(struct mem_cgroup *memcg)
-{
- u64 val = get_jiffies_64() + HZ;
- unsigned long flags;
-
- write_seqlock_irqsave(&memcg->socket_pressure_seqlock, flags);
- memcg->socket_pressure = val;
- write_sequnlock_irqrestore(&memcg->socket_pressure_seqlock, flags);
-}
-
-static inline u64 mem_cgroup_get_socket_pressure(struct mem_cgroup *memcg)
-{
- unsigned int seq;
- u64 val;
-
- do {
- seq = read_seqbegin(&memcg->socket_pressure_seqlock);
- val = memcg->socket_pressure;
- } while (read_seqretry(&memcg->socket_pressure_seqlock, seq));
-
- return val;
-}
-#else
-static inline void mem_cgroup_set_socket_pressure(struct mem_cgroup *memcg)
-{
- WRITE_ONCE(memcg->socket_pressure, jiffies + HZ);
-}
-
-static inline u64 mem_cgroup_get_socket_pressure(struct mem_cgroup *memcg)
-{
- return READ_ONCE(memcg->socket_pressure);
-}
-#endif
-
-int alloc_shrinker_info(struct mem_cgroup *memcg);
-void free_shrinker_info(struct mem_cgroup *memcg);
-void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id);
-void reparent_shrinker_deferred(struct mem_cgroup *memcg);
-
-static inline int shrinker_id(struct mem_cgroup *memcg, struct shrinker *shrinker)
-{
- return shrinker->id;
-}
-#else
-#define mem_cgroup_sockets_enabled 0
-
-static inline void mem_cgroup_sk_alloc(struct sock *sk)
-{
-}
-
-static inline void mem_cgroup_sk_free(struct sock *sk)
-{
-}
-
-static inline void mem_cgroup_sk_inherit(const struct sock *sk, struct sock *newsk)
-{
-}
-
-static inline bool mem_cgroup_sk_charge(const struct sock *sk,
- unsigned int nr_pages,
- gfp_t gfp_mask)
-{
- return false;
-}
-
-static inline void mem_cgroup_sk_uncharge(const struct sock *sk,
- unsigned int nr_pages)
-{
-}
-
-static inline void set_shrinker_bit(struct mem_cgroup *memcg,
- int nid, int shrinker_id)
-{
-}
-
-static inline int shrinker_id(struct mem_cgroup *memcg, struct shrinker *shrinker)
-{
- return -1;
-}
-#endif
-
#ifdef CONFIG_MEMCG
bool mem_cgroup_kmem_disabled(void);
int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order);
@@ -1844,6 +1750,100 @@ static inline bool memcg_is_dying(struct mem_cgroup *memcg)
}
#endif /* CONFIG_MEMCG */
+struct sock;
+#ifdef CONFIG_MEMCG
+extern struct static_key_false memcg_sockets_enabled_key;
+#define mem_cgroup_sockets_enabled static_branch_unlikely(&memcg_sockets_enabled_key)
+
+void mem_cgroup_sk_alloc(struct sock *sk);
+void mem_cgroup_sk_free(struct sock *sk);
+void mem_cgroup_sk_inherit(const struct sock *sk, struct sock *newsk);
+bool mem_cgroup_sk_charge(const struct sock *sk, unsigned int nr_pages,
+ gfp_t gfp_mask);
+void mem_cgroup_sk_uncharge(const struct sock *sk, unsigned int nr_pages);
+
+#if BITS_PER_LONG < 64
+static inline void mem_cgroup_set_socket_pressure(struct mem_cgroup *memcg)
+{
+ u64 val = get_jiffies_64() + HZ;
+ unsigned long flags;
+
+ write_seqlock_irqsave(&memcg->socket_pressure_seqlock, flags);
+ memcg->socket_pressure = val;
+ write_sequnlock_irqrestore(&memcg->socket_pressure_seqlock, flags);
+}
+
+static inline u64 mem_cgroup_get_socket_pressure(struct mem_cgroup *memcg)
+{
+ unsigned int seq;
+ u64 val;
+
+ do {
+ seq = read_seqbegin(&memcg->socket_pressure_seqlock);
+ val = memcg->socket_pressure;
+ } while (read_seqretry(&memcg->socket_pressure_seqlock, seq));
+
+ return val;
+}
+#else
+static inline void mem_cgroup_set_socket_pressure(struct mem_cgroup *memcg)
+{
+ WRITE_ONCE(memcg->socket_pressure, jiffies + HZ);
+}
+
+static inline u64 mem_cgroup_get_socket_pressure(struct mem_cgroup *memcg)
+{
+ return READ_ONCE(memcg->socket_pressure);
+}
+#endif
+
+int alloc_shrinker_info(struct mem_cgroup *memcg);
+void free_shrinker_info(struct mem_cgroup *memcg);
+void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id);
+void reparent_shrinker_deferred(struct mem_cgroup *memcg);
+
+static inline int shrinker_id(struct mem_cgroup *memcg, struct shrinker *shrinker)
+{
+ return shrinker->id;
+}
+#else
+#define mem_cgroup_sockets_enabled 0
+
+static inline void mem_cgroup_sk_alloc(struct sock *sk)
+{
+}
+
+static inline void mem_cgroup_sk_free(struct sock *sk)
+{
+}
+
+static inline void mem_cgroup_sk_inherit(const struct sock *sk, struct sock *newsk)
+{
+}
+
+static inline bool mem_cgroup_sk_charge(const struct sock *sk,
+ unsigned int nr_pages,
+ gfp_t gfp_mask)
+{
+ return false;
+}
+
+static inline void mem_cgroup_sk_uncharge(const struct sock *sk,
+ unsigned int nr_pages)
+{
+}
+
+static inline void set_shrinker_bit(struct mem_cgroup *memcg,
+ int nid, int shrinker_id)
+{
+}
+
+static inline int shrinker_id(struct mem_cgroup *memcg, struct shrinker *shrinker)
+{
+ return -1;
+}
+#endif
+
#if defined(CONFIG_MEMCG) && defined(CONFIG_ZSWAP)
bool obj_cgroup_may_zswap(struct obj_cgroup *objcg);
void obj_cgroup_charge_zswap(struct obj_cgroup *objcg, size_t size);
--
2.43.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH V2 3/4] mm: shrinker: optimize the allocation of shrinker_info when setting cgroup_memory_nokmem
2026-03-10 3:12 [PATCH V2 0/4] record non-slab shrinkers for non-root memcgs when kmem is disabled Haifeng Xu
2026-03-10 3:12 ` [PATCH V2 1/4] mm: shrinker: add one more parameter in shrinker_id() Haifeng Xu
2026-03-10 3:12 ` [PATCH V2 2/4] mm: shrinker: move shrinker_id() code block below memcg_kmem_online() Haifeng Xu
@ 2026-03-10 3:12 ` Haifeng Xu
2026-03-10 11:05 ` Usama Arif
2026-03-11 22:14 ` Dave Chinner
2026-03-10 3:12 ` [PATCH V2 4/4] mm: shrinker: remove unnecessary check in shrink_slab_memcg() Haifeng Xu
3 siblings, 2 replies; 10+ messages in thread
From: Haifeng Xu @ 2026-03-10 3:12 UTC (permalink / raw)
To: akpm, david, roman.gushchin
Cc: zhengqi.arch, muchun.song, linux-mm, linux-kernel, Haifeng Xu
When kmem is disabled, memcg slab shrink only call non-slab shrinkers,
so just allocates shrinker info for non-slab shrinkers to non-root memcgs.
Therefore, if memcg_kmem_online is true, all things keep same as before.
Otherwise, root memcg allocates id from shrinker_idr to identify each
shrinker and non-root memcgs use nonslab_id to identify non-slab shrinkers.
The size of shrinkers_info in non-root memcgs can be very low because the
number of shrinkers marked as SHRINKER_NONSLAB | SHRINKER_MEMCG_AWARE is
few. Also, the time spending in expand_shrinker_info() can reduce a lot.
When setting shrinker bit or updating nr_deferred, use nonslab_id for
non-root memcgs if the shrinker is marked as SHRINKER_NONSLAB.
Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com>
---
include/linux/memcontrol.h | 8 ++-
include/linux/shrinker.h | 3 +
mm/shrinker.c | 116 +++++++++++++++++++++++++++++++++----
3 files changed, 114 insertions(+), 13 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index ce7b5101bc02..3edd6211aed2 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1804,7 +1804,13 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg);
static inline int shrinker_id(struct mem_cgroup *memcg, struct shrinker *shrinker)
{
- return shrinker->id;
+ int id = shrinker->id;
+
+ if (!memcg_kmem_online() && (shrinker->flags & SHRINKER_NONSLAB) &&
+ memcg != root_mem_cgroup)
+ id = shrinker->nonslab_id;
+
+ return id;
}
#else
#define mem_cgroup_sockets_enabled 0
diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 1a00be90d93a..df53008ed8b5 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -107,6 +107,9 @@ struct shrinker {
#ifdef CONFIG_MEMCG
/* ID in shrinker_idr */
int id;
+
+ /* ID in shrinker_nonslab_idr */
+ int nonslab_id;
#endif
#ifdef CONFIG_SHRINKER_DEBUG
int debugfs_id;
diff --git a/mm/shrinker.c b/mm/shrinker.c
index 61dbb6afae52..68ea2d49495c 100644
--- a/mm/shrinker.c
+++ b/mm/shrinker.c
@@ -12,6 +12,7 @@ DEFINE_MUTEX(shrinker_mutex);
#ifdef CONFIG_MEMCG
static int shrinker_nr_max;
+static int shrinker_nonslab_nr_max;
static inline int shrinker_unit_size(int nr_items)
{
@@ -78,15 +79,25 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
{
int nid, ret = 0;
int array_size = 0;
+ int alloc_nr_max;
+
+ if (memcg_kmem_online()) {
+ alloc_nr_max = shrinker_nr_max;
+ } else {
+ if (memcg == root_mem_cgroup)
+ alloc_nr_max = shrinker_nr_max;
+ else
+ alloc_nr_max = shrinker_nonslab_nr_max;
+ }
mutex_lock(&shrinker_mutex);
- array_size = shrinker_unit_size(shrinker_nr_max);
+ array_size = shrinker_unit_size(alloc_nr_max);
for_each_node(nid) {
struct shrinker_info *info = kvzalloc_node(sizeof(*info) + array_size,
GFP_KERNEL, nid);
if (!info)
goto err;
- info->map_nr_max = shrinker_nr_max;
+ info->map_nr_max = alloc_nr_max;
if (shrinker_unit_alloc(info, NULL, nid)) {
kvfree(info);
goto err;
@@ -147,33 +158,47 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg, int new_size,
return 0;
}
-static int expand_shrinker_info(int new_id)
+static int expand_shrinker_info(int new_id, bool full, bool root)
{
int ret = 0;
int new_nr_max = round_up(new_id + 1, SHRINKER_UNIT_BITS);
int new_size, old_size = 0;
struct mem_cgroup *memcg;
+ struct mem_cgroup *start = NULL;
+ int old_nr_max = shrinker_nr_max;
if (!root_mem_cgroup)
goto out;
lockdep_assert_held(&shrinker_mutex);
+ if (!full && !root) {
+ start = root_mem_cgroup;
+ old_nr_max = shrinker_nonslab_nr_max;
+ }
+
new_size = shrinker_unit_size(new_nr_max);
- old_size = shrinker_unit_size(shrinker_nr_max);
+ old_size = shrinker_unit_size(old_nr_max);
+
+ memcg = mem_cgroup_iter(NULL, start, NULL);
+ if (!memcg)
+ goto out;
- memcg = mem_cgroup_iter(NULL, NULL, NULL);
do {
ret = expand_one_shrinker_info(memcg, new_size, old_size,
new_nr_max);
- if (ret) {
+ if (ret || (root && memcg == root_mem_cgroup)) {
mem_cgroup_iter_break(NULL, memcg);
goto out;
}
} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
out:
- if (!ret)
- shrinker_nr_max = new_nr_max;
+ if (!ret) {
+ if (!full && !root)
+ shrinker_nonslab_nr_max = new_nr_max;
+ else
+ shrinker_nr_max = new_nr_max;
+ }
return ret;
}
@@ -212,25 +237,58 @@ void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)
}
static DEFINE_IDR(shrinker_idr);
+static DEFINE_IDR(shrinker_nonslab_idr);
static int shrinker_memcg_alloc(struct shrinker *shrinker)
{
int id, ret = -ENOMEM;
+ bool kmem_online;
if (mem_cgroup_disabled())
return -ENOSYS;
+ kmem_online = memcg_kmem_online();
mutex_lock(&shrinker_mutex);
id = idr_alloc(&shrinker_idr, shrinker, 0, 0, GFP_KERNEL);
if (id < 0)
goto unlock;
if (id >= shrinker_nr_max) {
- if (expand_shrinker_info(id)) {
+ /* If memcg_kmem_online() returns true, expand shrinker
+ * info for all memcgs, otherwise, expand shrinker info
+ * for root memcg only
+ */
+ if (expand_shrinker_info(id, kmem_online, !kmem_online)) {
+ idr_remove(&shrinker_idr, id);
+ goto unlock;
+ }
+ }
+
+ shrinker->nonslab_id = -1;
+ /*
+ * If cgroup_memory_nokmem is set, record shrinkers with SHRINKER_NONSLAB
+ * because memcg slab shrink only call non-slab shrinkers.
+ */
+ if (!kmem_online && shrinker->flags & SHRINKER_NONSLAB) {
+ int nonslab_id;
+
+ nonslab_id = idr_alloc(&shrinker_nonslab_idr, shrinker, 0, 0, GFP_KERNEL);
+ if (nonslab_id < 0) {
idr_remove(&shrinker_idr, id);
goto unlock;
}
+
+ if (nonslab_id >= shrinker_nonslab_nr_max) {
+ /* expand shrinker info for non-root memcgs */
+ if (expand_shrinker_info(nonslab_id, false, false)) {
+ idr_remove(&shrinker_idr, id);
+ idr_remove(&shrinker_nonslab_idr, nonslab_id);
+ goto unlock;
+ }
+ }
+ shrinker->nonslab_id = nonslab_id;
}
+
shrinker->id = id;
ret = 0;
unlock:
@@ -247,6 +305,12 @@ static void shrinker_memcg_remove(struct shrinker *shrinker)
lockdep_assert_held(&shrinker_mutex);
idr_remove(&shrinker_idr, id);
+
+ if (shrinker->flags & SHRINKER_NONSLAB) {
+ id = shrinker->nonslab_id;
+ if (id >= 0)
+ idr_remove(&shrinker_nonslab_idr, id);
+ }
}
static long xchg_nr_deferred_memcg(int nid, struct shrinker *shrinker,
@@ -305,10 +369,33 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
parent_info = shrinker_info_protected(parent, nid);
for (index = 0; index < shrinker_id_to_index(child_info->map_nr_max); index++) {
child_unit = child_info->unit[index];
- parent_unit = parent_info->unit[index];
for (offset = 0; offset < SHRINKER_UNIT_BITS; offset++) {
nr = atomic_long_read(&child_unit->nr_deferred[offset]);
- atomic_long_add(nr, &parent_unit->nr_deferred[offset]);
+
+ /*
+ * If memcg_kmem_online() is false, the non-root memcgs use
+ * nonslab_id but root memory cgroup use id. When reparenting
+ * shrinker info to it, must convert the nonslab_id to id.
+ */
+ if (!memcg_kmem_online() && parent == root_mem_cgroup) {
+ int id, p_index, p_off;
+ struct shrinker *shrinker;
+
+ id = calc_shrinker_id(index, offset);
+ shrinker = idr_find(&shrinker_nonslab_idr, id);
+ if (shrinker) {
+ id = shrinker->id;
+ p_index = shrinker_id_to_index(id);
+ p_off = shrinker_id_to_offset(id);
+
+ parent_unit = parent_info->unit[p_index];
+ atomic_long_add(nr,
+ &parent_unit->nr_deferred[p_off]);
+ }
+ } else {
+ parent_unit = parent_info->unit[index];
+ atomic_long_add(nr, &parent_unit->nr_deferred[offset]);
+ }
}
}
}
@@ -538,7 +625,12 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
int shrinker_id = calc_shrinker_id(index, offset);
rcu_read_lock();
- shrinker = idr_find(&shrinker_idr, shrinker_id);
+
+ if (memcg_kmem_online())
+ shrinker = idr_find(&shrinker_idr, shrinker_id);
+ else
+ shrinker = idr_find(&shrinker_nonslab_idr, shrinker_id);
+
if (unlikely(!shrinker || !shrinker_try_get(shrinker))) {
clear_bit(offset, unit->map);
rcu_read_unlock();
--
2.43.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH V2 4/4] mm: shrinker: remove unnecessary check in shrink_slab_memcg()
2026-03-10 3:12 [PATCH V2 0/4] record non-slab shrinkers for non-root memcgs when kmem is disabled Haifeng Xu
` (2 preceding siblings ...)
2026-03-10 3:12 ` [PATCH V2 3/4] mm: shrinker: optimize the allocation of shrinker_info when setting cgroup_memory_nokmem Haifeng Xu
@ 2026-03-10 3:12 ` Haifeng Xu
3 siblings, 0 replies; 10+ messages in thread
From: Haifeng Xu @ 2026-03-10 3:12 UTC (permalink / raw)
To: akpm, david, roman.gushchin
Cc: zhengqi.arch, muchun.song, linux-mm, linux-kernel, Haifeng Xu
If memcg_kmem_online() is false, only non-slab shrinkers are recorded
in the map, so remove the check.
Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com>
---
mm/shrinker.c | 5 -----
1 file changed, 5 deletions(-)
diff --git a/mm/shrinker.c b/mm/shrinker.c
index 68ea2d49495c..436387b3ba24 100644
--- a/mm/shrinker.c
+++ b/mm/shrinker.c
@@ -638,11 +638,6 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
}
rcu_read_unlock();
- /* Call non-slab shrinkers even though kmem is disabled */
- if (!memcg_kmem_online() &&
- !(shrinker->flags & SHRINKER_NONSLAB))
- continue;
-
ret = do_shrink_slab(&sc, shrinker, priority);
if (ret == SHRINK_EMPTY) {
clear_bit(offset, unit->map);
--
2.43.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH V2 3/4] mm: shrinker: optimize the allocation of shrinker_info when setting cgroup_memory_nokmem
2026-03-10 3:12 ` [PATCH V2 3/4] mm: shrinker: optimize the allocation of shrinker_info when setting cgroup_memory_nokmem Haifeng Xu
@ 2026-03-10 11:05 ` Usama Arif
2026-03-11 2:21 ` Haifeng Xu
2026-03-11 22:14 ` Dave Chinner
1 sibling, 1 reply; 10+ messages in thread
From: Usama Arif @ 2026-03-10 11:05 UTC (permalink / raw)
To: Haifeng Xu
Cc: Usama Arif, akpm, david, roman.gushchin, zhengqi.arch,
muchun.song, linux-mm, linux-kernel
On Tue, 10 Mar 2026 11:12:49 +0800 Haifeng Xu <haifeng.xu@shopee.com> wrote:
> When kmem is disabled, memcg slab shrink only call non-slab shrinkers,
> so just allocates shrinker info for non-slab shrinkers to non-root memcgs.
>
> Therefore, if memcg_kmem_online is true, all things keep same as before.
> Otherwise, root memcg allocates id from shrinker_idr to identify each
> shrinker and non-root memcgs use nonslab_id to identify non-slab shrinkers.
> The size of shrinkers_info in non-root memcgs can be very low because the
> number of shrinkers marked as SHRINKER_NONSLAB | SHRINKER_MEMCG_AWARE is
> few. Also, the time spending in expand_shrinker_info() can reduce a lot.
>
> When setting shrinker bit or updating nr_deferred, use nonslab_id for
> non-root memcgs if the shrinker is marked as SHRINKER_NONSLAB.
>
> Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com>
> ---
> include/linux/memcontrol.h | 8 ++-
> include/linux/shrinker.h | 3 +
> mm/shrinker.c | 116 +++++++++++++++++++++++++++++++++----
> 3 files changed, 114 insertions(+), 13 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index ce7b5101bc02..3edd6211aed2 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1804,7 +1804,13 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg);
>
> static inline int shrinker_id(struct mem_cgroup *memcg, struct shrinker *shrinker)
> {
> - return shrinker->id;
> + int id = shrinker->id;
> +
> + if (!memcg_kmem_online() && (shrinker->flags & SHRINKER_NONSLAB) &&
> + memcg != root_mem_cgroup)
> + id = shrinker->nonslab_id;
> +
> + return id;
> }
> #else
> #define mem_cgroup_sockets_enabled 0
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index 1a00be90d93a..df53008ed8b5 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -107,6 +107,9 @@ struct shrinker {
> #ifdef CONFIG_MEMCG
> /* ID in shrinker_idr */
> int id;
> +
> + /* ID in shrinker_nonslab_idr */
> + int nonslab_id;
> #endif
> #ifdef CONFIG_SHRINKER_DEBUG
> int debugfs_id;
> diff --git a/mm/shrinker.c b/mm/shrinker.c
> index 61dbb6afae52..68ea2d49495c 100644
> --- a/mm/shrinker.c
> +++ b/mm/shrinker.c
> @@ -12,6 +12,7 @@ DEFINE_MUTEX(shrinker_mutex);
>
> #ifdef CONFIG_MEMCG
> static int shrinker_nr_max;
> +static int shrinker_nonslab_nr_max;
>
> static inline int shrinker_unit_size(int nr_items)
> {
> @@ -78,15 +79,25 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
> {
> int nid, ret = 0;
> int array_size = 0;
> + int alloc_nr_max;
> +
> + if (memcg_kmem_online()) {
> + alloc_nr_max = shrinker_nr_max;
> + } else {
> + if (memcg == root_mem_cgroup)
> + alloc_nr_max = shrinker_nr_max;
> + else
> + alloc_nr_max = shrinker_nonslab_nr_max;
> + }
>
> mutex_lock(&shrinker_mutex);
Hello!
The patch reads shrinker_nonslab_nr_max and shrinker_nr_max before
acquiring shrinker_mutex. The original code read shrinker_nr_max
UNDER the lock. Both variables are modified under shrinker_mutex by
concurrent shrinker registrations. A stale read could cause
alloc_shrinker_info() to allocate an undersized shrinker_info,
leading to out-of-bounds access in set_shrinker_bit() or the
nr_deferred functions.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH V2 3/4] mm: shrinker: optimize the allocation of shrinker_info when setting cgroup_memory_nokmem
2026-03-10 11:05 ` Usama Arif
@ 2026-03-11 2:21 ` Haifeng Xu
0 siblings, 0 replies; 10+ messages in thread
From: Haifeng Xu @ 2026-03-11 2:21 UTC (permalink / raw)
To: Usama Arif
Cc: akpm, david, roman.gushchin, zhengqi.arch, muchun.song, linux-mm,
linux-kernel
On 2026/3/10 19:05, Usama Arif wrote:
> On Tue, 10 Mar 2026 11:12:49 +0800 Haifeng Xu <haifeng.xu@shopee.com> wrote:
>
>> When kmem is disabled, memcg slab shrink only call non-slab shrinkers,
>> so just allocates shrinker info for non-slab shrinkers to non-root memcgs.
>>
>> Therefore, if memcg_kmem_online is true, all things keep same as before.
>> Otherwise, root memcg allocates id from shrinker_idr to identify each
>> shrinker and non-root memcgs use nonslab_id to identify non-slab shrinkers.
>> The size of shrinkers_info in non-root memcgs can be very low because the
>> number of shrinkers marked as SHRINKER_NONSLAB | SHRINKER_MEMCG_AWARE is
>> few. Also, the time spending in expand_shrinker_info() can reduce a lot.
>>
>> When setting shrinker bit or updating nr_deferred, use nonslab_id for
>> non-root memcgs if the shrinker is marked as SHRINKER_NONSLAB.
>>
>> Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com>
>> ---
>> include/linux/memcontrol.h | 8 ++-
>> include/linux/shrinker.h | 3 +
>> mm/shrinker.c | 116 +++++++++++++++++++++++++++++++++----
>> 3 files changed, 114 insertions(+), 13 deletions(-)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index ce7b5101bc02..3edd6211aed2 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -1804,7 +1804,13 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg);
>>
>> static inline int shrinker_id(struct mem_cgroup *memcg, struct shrinker *shrinker)
>> {
>> - return shrinker->id;
>> + int id = shrinker->id;
>> +
>> + if (!memcg_kmem_online() && (shrinker->flags & SHRINKER_NONSLAB) &&
>> + memcg != root_mem_cgroup)
>> + id = shrinker->nonslab_id;
>> +
>> + return id;
>> }
>> #else
>> #define mem_cgroup_sockets_enabled 0
>> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
>> index 1a00be90d93a..df53008ed8b5 100644
>> --- a/include/linux/shrinker.h
>> +++ b/include/linux/shrinker.h
>> @@ -107,6 +107,9 @@ struct shrinker {
>> #ifdef CONFIG_MEMCG
>> /* ID in shrinker_idr */
>> int id;
>> +
>> + /* ID in shrinker_nonslab_idr */
>> + int nonslab_id;
>> #endif
>> #ifdef CONFIG_SHRINKER_DEBUG
>> int debugfs_id;
>> diff --git a/mm/shrinker.c b/mm/shrinker.c
>> index 61dbb6afae52..68ea2d49495c 100644
>> --- a/mm/shrinker.c
>> +++ b/mm/shrinker.c
>> @@ -12,6 +12,7 @@ DEFINE_MUTEX(shrinker_mutex);
>>
>> #ifdef CONFIG_MEMCG
>> static int shrinker_nr_max;
>> +static int shrinker_nonslab_nr_max;
>>
>> static inline int shrinker_unit_size(int nr_items)
>> {
>> @@ -78,15 +79,25 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
>> {
>> int nid, ret = 0;
>> int array_size = 0;
>> + int alloc_nr_max;
>> +
>> + if (memcg_kmem_online()) {
>> + alloc_nr_max = shrinker_nr_max;
>> + } else {
>> + if (memcg == root_mem_cgroup)
>> + alloc_nr_max = shrinker_nr_max;
>> + else
>> + alloc_nr_max = shrinker_nonslab_nr_max;
>> + }
>>
>> mutex_lock(&shrinker_mutex);
>
> Hello!
>
> The patch reads shrinker_nonslab_nr_max and shrinker_nr_max before
> acquiring shrinker_mutex. The original code read shrinker_nr_max
> UNDER the lock. Both variables are modified under shrinker_mutex by
> concurrent shrinker registrations. A stale read could cause
> alloc_shrinker_info() to allocate an undersized shrinker_info,
> leading to out-of-bounds access in set_shrinker_bit() or the
> nr_deferred functions.
>
Yes,thanks for pointing out the problems.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH V2 3/4] mm: shrinker: optimize the allocation of shrinker_info when setting cgroup_memory_nokmem
2026-03-10 3:12 ` [PATCH V2 3/4] mm: shrinker: optimize the allocation of shrinker_info when setting cgroup_memory_nokmem Haifeng Xu
2026-03-10 11:05 ` Usama Arif
@ 2026-03-11 22:14 ` Dave Chinner
[not found] ` <bc08d009-fa43-44d0-880f-a37cc200a3b9@shopee.com>
1 sibling, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2026-03-11 22:14 UTC (permalink / raw)
To: Haifeng Xu
Cc: akpm, david, roman.gushchin, zhengqi.arch, muchun.song, linux-mm,
linux-kernel
On Tue, Mar 10, 2026 at 11:12:49AM +0800, Haifeng Xu wrote:
> When kmem is disabled, memcg slab shrink only call non-slab shrinkers,
> so just allocates shrinker info for non-slab shrinkers to non-root memcgs.
>
> Therefore, if memcg_kmem_online is true, all things keep same as before.
> Otherwise, root memcg allocates id from shrinker_idr to identify each
> shrinker and non-root memcgs use nonslab_id to identify non-slab shrinkers.
> The size of shrinkers_info in non-root memcgs can be very low because the
> number of shrinkers marked as SHRINKER_NONSLAB | SHRINKER_MEMCG_AWARE is
> few. Also, the time spending in expand_shrinker_info() can reduce a lot.
>
> When setting shrinker bit or updating nr_deferred, use nonslab_id for
> non-root memcgs if the shrinker is marked as SHRINKER_NONSLAB.
>
> Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com>
> ---
> include/linux/memcontrol.h | 8 ++-
> include/linux/shrinker.h | 3 +
> mm/shrinker.c | 116 +++++++++++++++++++++++++++++++++----
> 3 files changed, 114 insertions(+), 13 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index ce7b5101bc02..3edd6211aed2 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1804,7 +1804,13 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg);
>
> static inline int shrinker_id(struct mem_cgroup *memcg, struct shrinker *shrinker)
> {
> - return shrinker->id;
> + int id = shrinker->id;
> +
> + if (!memcg_kmem_online() && (shrinker->flags & SHRINKER_NONSLAB) &&
> + memcg != root_mem_cgroup)
> + id = shrinker->nonslab_id;
> +
> + return id;
> }
> #else
> #define mem_cgroup_sockets_enabled 0
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index 1a00be90d93a..df53008ed8b5 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -107,6 +107,9 @@ struct shrinker {
> #ifdef CONFIG_MEMCG
> /* ID in shrinker_idr */
> int id;
> +
> + /* ID in shrinker_nonslab_idr */
> + int nonslab_id;
> #endif
> #ifdef CONFIG_SHRINKER_DEBUG
> int debugfs_id;
> diff --git a/mm/shrinker.c b/mm/shrinker.c
> index 61dbb6afae52..68ea2d49495c 100644
> --- a/mm/shrinker.c
> +++ b/mm/shrinker.c
> @@ -12,6 +12,7 @@ DEFINE_MUTEX(shrinker_mutex);
>
> #ifdef CONFIG_MEMCG
> static int shrinker_nr_max;
> +static int shrinker_nonslab_nr_max;
>
> static inline int shrinker_unit_size(int nr_items)
> {
> @@ -78,15 +79,25 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
> {
> int nid, ret = 0;
> int array_size = 0;
> + int alloc_nr_max;
> +
> + if (memcg_kmem_online()) {
> + alloc_nr_max = shrinker_nr_max;
> + } else {
> + if (memcg == root_mem_cgroup)
> + alloc_nr_max = shrinker_nr_max;
> + else
> + alloc_nr_max = shrinker_nonslab_nr_max;
> + }
What does this do and why does it exist? Why do we need two
different indexes and tracking structures when memcg is disabled?
If I look at this code outside of this commit context, I have -zero-
idea of what all this ... complexity does or is needed for.
AFAICT, the code is trying to reduce memcg-aware shrinker
registration overhead, yes?
If so, please explain where all the overhead is in the first place -
if there's a time saving of hundreds of seconds in your workload,
then whatever is causing the overhead is going to show up in CPU
profiles. What, exactly, is causing all the registration overhead?
i.e. there are lots of workloads that create large numbers of
containers when memcg is actually enabled, so if registration is
costly then the right thing to do here is fix the registration
overhead problem.
Hacking custom logic into the code to avoid the overhead in your
specific special case so you can ignore the problem is not the way
we solve problems. We need to solve problems like this in a way that
benefits -everyone- regardless of whether they are using memcgs or
not.
So, please identify where all the overhead in memcg shrinker
registration is, and then we can take steps to improve the
registration code -for everyone-.
-Dave.
--
Dave Chinner
dgc@kernel.org
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH V2 3/4] mm: shrinker: optimize the allocation of shrinker_info when setting cgroup_memory_nokmem
[not found] ` <bc08d009-fa43-44d0-880f-a37cc200a3b9@shopee.com>
@ 2026-03-12 5:52 ` Dave Chinner
2026-03-13 3:04 ` Haifeng Xu
0 siblings, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2026-03-12 5:52 UTC (permalink / raw)
To: Haifeng Xu
Cc: akpm, david, roman.gushchin, zhengqi.arch, muchun.song, linux-mm,
linux-kernel
On Thu, Mar 12, 2026 at 12:08:42PM +0800, Haifeng Xu wrote:
> On 2026/3/12 06:14, Dave Chinner wrote:
> > On Tue, Mar 10, 2026 at 11:12:49AM +0800, Haifeng Xu wrote:
> >> @@ -78,15 +79,25 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
> >> {
> >> int nid, ret = 0;
> >> int array_size = 0;
> >> + int alloc_nr_max;
> >> +
> >> + if (memcg_kmem_online()) {
> >> + alloc_nr_max = shrinker_nr_max;
> >> + } else {
> >> + if (memcg == root_mem_cgroup)
> >> + alloc_nr_max = shrinker_nr_max;
> >> + else
> >> + alloc_nr_max = shrinker_nonslab_nr_max;
> >> + }
> >
> > What does this do and why does it exist? Why do we need two
> > different indexes and tracking structures when memcg is disabled?
> >
> > If I look at this code outside of this commit context, I have -zero-
> > idea of what all this ... complexity does or is needed for.
> >
> > AFAICT, the code is trying to reduce memcg-aware shrinker
> > registration overhead, yes?
> >
> > If so, please explain where all the overhead is in the first place -
> > if there's a time saving of hundreds of seconds in your workload,
> > then whatever is causing the overhead is going to show up in CPU
> > profiles. What, exactly, is causing all the registration overhead?
> >
> > i.e. there are lots of workloads that create large numbers of
> > containers when memcg is actually enabled, so if registration is
> > costly then the right thing to do here is fix the registration
> > overhead problem.
> >
> > Hacking custom logic into the code to avoid the overhead in your
> > specific special case so you can ignore the problem is not the way
> > we solve problems. We need to solve problems like this in a way that
> > benefits -everyone- regardless of whether they are using memcgs or
> > not.
> >
> > So, please identify where all the overhead in memcg shrinker
> > registration is, and then we can take steps to improve the
> > registration code -for everyone-.
> >
> > -Dave.
>
> When creating containers, we found many threads got stuck and wait shrinker
> lock in our machine with kmem disabled. And we found that the shrinker lock
> was held for a long time when expanding shrinker info. As the number of
> containers increases,the lock holding time can increase from a few milliseconds
> to over one hundred milliseconds.
Ok, but...
> Call stack can be seen as below(based on stable kernel 6.6.102).
>
>
> PID: 4462 TASK: ffff8eff5ca0b500 CPU: 79 COMMAND: "runc:[2:INIT]"
> #0 [ffffc9005b213b10] __schedule at ffffffffa3ad84c0
> #1 [ffffc9005b213bb8] schedule at ffffffffa3ad8988
> #2 [ffffc9005b213bd8] schedule_preempt_disabled at ffffffffa3ad8bae
> #3 [ffffc9005b213be8] rwsem_down_write_slowpath at ffffffffa3adcc5e
> #4 [ffffc9005b213ca8] down_write at ffffffffa3adcf3c
> #5 [ffffc9005b213cc0] __prealloc_shrinker at ffffffffa2db3bf0
> #6 [ffffc9005b213d08] prealloc_shrinker at ffffffffa2db9e0e
> #7 [ffffc9005b213d18] alloc_super at ffffffffa2ebec49
> #8 [ffffc9005b213d48] sget_fc at ffffffffa2ebff48
.... this is exactly why you need to show your working, not just
present your solution.
That is, prealloc_shrinker() API no longer exists in TOT. The way
shrinkers are registered and run was significantly changed in ...
2023 by commit c42d50aefd17 ("mm: shrinker: add infrastructure for
dynamically allocating shrinker").
IOWs, the problem problem you are reporting may not even exist on
TOT kernels. Have you tested your problematic workload on a TOT
kernel, and if so, what were the results?
> We use perf tool to record the cpu consuming. I have posted it in the attachment.
> From the Flame Graph, we can see that the clear_page_erms() and memcpy() in
> expand_one_shrinker_info() are the main sources of overhead.
expand_one_shrinker_info() was somewhat simplified by the above
series, so even if it is still an issue on TOT kernels, it still
likely costs less than on old LTS kernels.
> Therefore, the more shrinkers and memcgs exsit,the process of expanding shrinker
> info takes longer. This is because when expanding shrinker info, we will traverse
> all memcgs an record all shrinkers for them.
Only if the shrinker_info arrays need expanding. This is happening
frequently because the expansion code only expands the shrinker info
arrays by one ID at a time. hence if you create a container that has
half a dozen new filesystems mount in it, the array is being
expanded at least half a dozen times. Maybe multiples of this more,
if the filesystem has multiple memcg-aware shrinkers that it
registers per mount...
IOWs, we need to make the expansion case the rare slow path, not the
common fast path here.
It should be trivial to batch the array expansion (e.g. expand in
multiples of 8/16/32 slots) and then shrinker instantiation overhead
should scale down linearly with increasing expansion batch sizes.
Such an improvement will reduce the memcg aware shrinker
registration overhead on -all- configurations, not just the specific
one you run....
> However, with kmem disabled, memcg slab shrink only call non-slab shrinkers, that's
> to say, we only need to record non-slab shrinker for non-root memcgs. For root
> memcg, we still need to record all shrinkers because global shrink call all shrinkers.
Yes, I know what your patch does - it should be clear that it does
not address the root cause of the problem you are reporting, merely
works around it for your specific use case.
-Dave.
--
Dave Chinner
dgc@kernel.org
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH V2 3/4] mm: shrinker: optimize the allocation of shrinker_info when setting cgroup_memory_nokmem
2026-03-12 5:52 ` Dave Chinner
@ 2026-03-13 3:04 ` Haifeng Xu
0 siblings, 0 replies; 10+ messages in thread
From: Haifeng Xu @ 2026-03-13 3:04 UTC (permalink / raw)
To: Dave Chinner
Cc: akpm, david, roman.gushchin, zhengqi.arch, muchun.song, linux-mm,
linux-kernel
On 2026/3/12 13:52, Dave Chinner wrote:
> On Thu, Mar 12, 2026 at 12:08:42PM +0800, Haifeng Xu wrote:
>> On 2026/3/12 06:14, Dave Chinner wrote:
>>> On Tue, Mar 10, 2026 at 11:12:49AM +0800, Haifeng Xu wrote:
>>>> @@ -78,15 +79,25 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
>>>> {
>>>> int nid, ret = 0;
>>>> int array_size = 0;
>>>> + int alloc_nr_max;
>>>> +
>>>> + if (memcg_kmem_online()) {
>>>> + alloc_nr_max = shrinker_nr_max;
>>>> + } else {
>>>> + if (memcg == root_mem_cgroup)
>>>> + alloc_nr_max = shrinker_nr_max;
>>>> + else
>>>> + alloc_nr_max = shrinker_nonslab_nr_max;
>>>> + }
>>>
>>> What does this do and why does it exist? Why do we need two
>>> different indexes and tracking structures when memcg is disabled?
>>>
>>> If I look at this code outside of this commit context, I have -zero-
>>> idea of what all this ... complexity does or is needed for.
>>>
>>> AFAICT, the code is trying to reduce memcg-aware shrinker
>>> registration overhead, yes?
>>>
>>> If so, please explain where all the overhead is in the first place -
>>> if there's a time saving of hundreds of seconds in your workload,
>>> then whatever is causing the overhead is going to show up in CPU
>>> profiles. What, exactly, is causing all the registration overhead?
>>>
>>> i.e. there are lots of workloads that create large numbers of
>>> containers when memcg is actually enabled, so if registration is
>>> costly then the right thing to do here is fix the registration
>>> overhead problem.
>>>
>>> Hacking custom logic into the code to avoid the overhead in your
>>> specific special case so you can ignore the problem is not the way
>>> we solve problems. We need to solve problems like this in a way that
>>> benefits -everyone- regardless of whether they are using memcgs or
>>> not.
>>>
>>> So, please identify where all the overhead in memcg shrinker
>>> registration is, and then we can take steps to improve the
>>> registration code -for everyone-.
>>>
>>> -Dave.
>>
>> When creating containers, we found many threads got stuck and wait shrinker
>> lock in our machine with kmem disabled. And we found that the shrinker lock
>> was held for a long time when expanding shrinker info. As the number of
>> containers increases,the lock holding time can increase from a few milliseconds
>> to over one hundred milliseconds.
>
> Ok, but...
>
>> Call stack can be seen as below(based on stable kernel 6.6.102).
>>
>>
>> PID: 4462 TASK: ffff8eff5ca0b500 CPU: 79 COMMAND: "runc:[2:INIT]"
>> #0 [ffffc9005b213b10] __schedule at ffffffffa3ad84c0
>> #1 [ffffc9005b213bb8] schedule at ffffffffa3ad8988
>> #2 [ffffc9005b213bd8] schedule_preempt_disabled at ffffffffa3ad8bae
>> #3 [ffffc9005b213be8] rwsem_down_write_slowpath at ffffffffa3adcc5e
>> #4 [ffffc9005b213ca8] down_write at ffffffffa3adcf3c
>> #5 [ffffc9005b213cc0] __prealloc_shrinker at ffffffffa2db3bf0
>> #6 [ffffc9005b213d08] prealloc_shrinker at ffffffffa2db9e0e
>> #7 [ffffc9005b213d18] alloc_super at ffffffffa2ebec49
>> #8 [ffffc9005b213d48] sget_fc at ffffffffa2ebff48
>
> .... this is exactly why you need to show your working, not just
> present your solution.
>
> That is, prealloc_shrinker() API no longer exists in TOT. The way
> shrinkers are registered and run was significantly changed in ...
> 2023 by commit c42d50aefd17 ("mm: shrinker: add infrastructure for
> dynamically allocating shrinker").
>
> IOWs, the problem problem you are reporting may not even exist on
> TOT kernels. Have you tested your problematic workload on a TOT
> kernel, and if so, what were the results?
>
>> We use perf tool to record the cpu consuming. I have posted it in the attachment.
>> From the Flame Graph, we can see that the clear_page_erms() and memcpy() in
>> expand_one_shrinker_info() are the main sources of overhead.
>
> expand_one_shrinker_info() was somewhat simplified by the above
> series, so even if it is still an issue on TOT kernels, it still
> likely costs less than on old LTS kernels.
>
I run tests based on stable kernel 6.12.76 which includes the above patch series.
The lock holding time of each expansion is less than ten milliseconds.
Though still traverse all memcgs, but only expand one unit of map and nr_deferred
every expansion. The original code need to reallocate whole map and nr_deferred.
Thus the size to be allocated in old version is much larger than the new version.
Thanks!
>> Therefore, the more shrinkers and memcgs exsit,the process of expanding shrinker
>> info takes longer. This is because when expanding shrinker info, we will traverse
>> all memcgs an record all shrinkers for them.
>
> Only if the shrinker_info arrays need expanding. This is happening
> frequently because the expansion code only expands the shrinker info
> arrays by one ID at a time. hence if you create a container that has
> half a dozen new filesystems mount in it, the array is being
> expanded at least half a dozen times. Maybe multiples of this more,
> if the filesystem has multiple memcg-aware shrinkers that it
> registers per mount...
>
> IOWs, we need to make the expansion case the rare slow path, not the
> common fast path here.
>
> It should be trivial to batch the array expansion (e.g. expand in
> multiples of 8/16/32 slots) and then shrinker instantiation overhead
> should scale down linearly with increasing expansion batch sizes.
>
> Such an improvement will reduce the memcg aware shrinker
> registration overhead on -all- configurations, not just the specific
> one you run....
>
>> However, with kmem disabled, memcg slab shrink only call non-slab shrinkers, that's
>> to say, we only need to record non-slab shrinker for non-root memcgs. For root
>> memcg, we still need to record all shrinkers because global shrink call all shrinkers.
>
> Yes, I know what your patch does - it should be clear that it does
> not address the root cause of the problem you are reporting, merely
> works around it for your specific use case.
>
> -Dave.
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2026-03-13 3:04 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-10 3:12 [PATCH V2 0/4] record non-slab shrinkers for non-root memcgs when kmem is disabled Haifeng Xu
2026-03-10 3:12 ` [PATCH V2 1/4] mm: shrinker: add one more parameter in shrinker_id() Haifeng Xu
2026-03-10 3:12 ` [PATCH V2 2/4] mm: shrinker: move shrinker_id() code block below memcg_kmem_online() Haifeng Xu
2026-03-10 3:12 ` [PATCH V2 3/4] mm: shrinker: optimize the allocation of shrinker_info when setting cgroup_memory_nokmem Haifeng Xu
2026-03-10 11:05 ` Usama Arif
2026-03-11 2:21 ` Haifeng Xu
2026-03-11 22:14 ` Dave Chinner
[not found] ` <bc08d009-fa43-44d0-880f-a37cc200a3b9@shopee.com>
2026-03-12 5:52 ` Dave Chinner
2026-03-13 3:04 ` Haifeng Xu
2026-03-10 3:12 ` [PATCH V2 4/4] mm: shrinker: remove unnecessary check in shrink_slab_memcg() Haifeng Xu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox