* Re: [PATCH v4 45/48] mm: shrinker: make global slab shrink lockless
[not found] ` <20230807110936.21819-46-zhengqi.arch@bytedance.com>
@ 2023-08-07 23:28 ` Dave Chinner via Virtualization
2023-08-08 2:24 ` Dave Chinner via Virtualization
1 sibling, 0 replies; 5+ messages in thread
From: Dave Chinner via Virtualization @ 2023-08-07 23:28 UTC (permalink / raw)
To: Qi Zheng
Cc: kvm, djwong, roman.gushchin, dri-devel, virtualization, linux-mm,
dm-devel, linux-mtd, cel, x86, steven.price, cluster-devel,
simon.horman, xen-devel, linux-ext4, paulmck, linux-arm-msm,
linux-nfs, rcu, linux-bcache, dlemoal, yujie.liu, vbabka,
linux-raid, brauner, tytso, gregkh, muchun.song, linux-kernel,
linux-f2fs-devel, linux-xfs, senozhatsky, netdev, linux-fsdevel,
akpm, linux-erofs, linux-btrfs, tkhai
On Mon, Aug 07, 2023 at 07:09:33PM +0800, Qi Zheng wrote:
> The shrinker_rwsem is a global read-write lock in shrinkers subsystem,
> which protects most operations such as slab shrink, registration and
> unregistration of shrinkers, etc. This can easily cause problems in the
> following cases.
....
> This commit uses the refcount+RCU method [5] proposed by Dave Chinner
> to re-implement the lockless global slab shrink. The memcg slab shrink is
> handled in the subsequent patch.
....
> ---
> include/linux/shrinker.h | 17 ++++++++++
> mm/shrinker.c | 70 +++++++++++++++++++++++++++++-----------
> 2 files changed, 68 insertions(+), 19 deletions(-)
There's no documentation in the code explaining how the lockless
shrinker algorithm works. It's left to the reader to work out how
this all goes together....
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index eb342994675a..f06225f18531 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -4,6 +4,8 @@
>
> #include <linux/atomic.h>
> #include <linux/types.h>
> +#include <linux/refcount.h>
> +#include <linux/completion.h>
>
> #define SHRINKER_UNIT_BITS BITS_PER_LONG
>
> @@ -87,6 +89,10 @@ struct shrinker {
> int seeks; /* seeks to recreate an obj */
> unsigned flags;
>
> + refcount_t refcount;
> + struct completion done;
> + struct rcu_head rcu;
What does the refcount protect, why do we need the completion, etc?
> +
> void *private_data;
>
> /* These are for internal use */
> @@ -120,6 +126,17 @@ struct shrinker *shrinker_alloc(unsigned int flags, const char *fmt, ...);
> void shrinker_register(struct shrinker *shrinker);
> void shrinker_free(struct shrinker *shrinker);
>
> +static inline bool shrinker_try_get(struct shrinker *shrinker)
> +{
> + return refcount_inc_not_zero(&shrinker->refcount);
> +}
> +
> +static inline void shrinker_put(struct shrinker *shrinker)
> +{
> + if (refcount_dec_and_test(&shrinker->refcount))
> + complete(&shrinker->done);
> +}
> +
> #ifdef CONFIG_SHRINKER_DEBUG
> extern int __printf(2, 3) shrinker_debugfs_rename(struct shrinker *shrinker,
> const char *fmt, ...);
> diff --git a/mm/shrinker.c b/mm/shrinker.c
> index 1911c06b8af5..d318f5621862 100644
> --- a/mm/shrinker.c
> +++ b/mm/shrinker.c
> @@ -2,6 +2,7 @@
> #include <linux/memcontrol.h>
> #include <linux/rwsem.h>
> #include <linux/shrinker.h>
> +#include <linux/rculist.h>
> #include <trace/events/vmscan.h>
>
> #include "internal.h"
> @@ -577,33 +578,42 @@ unsigned long shrink_slab(gfp_t gfp_mask, int nid, struct mem_cgroup *memcg,
> if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
> return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>
> - if (!down_read_trylock(&shrinker_rwsem))
> - goto out;
> -
> - list_for_each_entry(shrinker, &shrinker_list, list) {
> + rcu_read_lock();
> + list_for_each_entry_rcu(shrinker, &shrinker_list, list) {
> struct shrink_control sc = {
> .gfp_mask = gfp_mask,
> .nid = nid,
> .memcg = memcg,
> };
>
> + if (!shrinker_try_get(shrinker))
> + continue;
> +
> + /*
> + * We can safely unlock the RCU lock here since we already
> + * hold the refcount of the shrinker.
> + */
> + rcu_read_unlock();
> +
> ret = do_shrink_slab(&sc, shrinker, priority);
> if (ret == SHRINK_EMPTY)
> ret = 0;
> freed += ret;
> +
> /*
> - * Bail out if someone want to register a new shrinker to
> - * prevent the registration from being stalled for long periods
> - * by parallel ongoing shrinking.
> + * This shrinker may be deleted from shrinker_list and freed
> + * after the shrinker_put() below, but this shrinker is still
> + * used for the next traversal. So it is necessary to hold the
> + * RCU lock first to prevent this shrinker from being freed,
> + * which also ensures that the next shrinker that is traversed
> + * will not be freed (even if it is deleted from shrinker_list
> + * at the same time).
> */
This comment really should be at the head of the function,
describing the algorithm used within the function itself. i.e. how
reference counts are used w.r.t. the rcu_read_lock() usage to
guarantee existence of the shrinker and the validity of the list
walk.
I'm not going to remember all these little details when I look at
this code in another 6 months time, and having to work it out from
first principles every time I look at the code will waste of a lot
of time...
-Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v4 44/48] mm: shrinker: add a secondary array for shrinker_info::{map, nr_deferred}
[not found] ` <20230807110936.21819-45-zhengqi.arch@bytedance.com>
@ 2023-08-08 2:12 ` Dave Chinner via Virtualization
0 siblings, 0 replies; 5+ messages in thread
From: Dave Chinner via Virtualization @ 2023-08-08 2:12 UTC (permalink / raw)
To: Qi Zheng
Cc: kvm, djwong, roman.gushchin, dri-devel, virtualization, linux-mm,
dm-devel, linux-mtd, cel, x86, steven.price, cluster-devel,
simon.horman, xen-devel, linux-ext4, paulmck, linux-arm-msm,
linux-nfs, rcu, linux-bcache, dlemoal, Muchun Song, yujie.liu,
vbabka, linux-raid, brauner, tytso, gregkh, muchun.song,
linux-kernel, linux-f2fs-devel, linux-xfs, senozhatsky, netdev,
linux-fsdevel, akpm, linux-erofs, linux-btrfs, tkhai
On Mon, Aug 07, 2023 at 07:09:32PM +0800, Qi Zheng wrote:
> Currently, we maintain two linear arrays per node per memcg, which are
> shrinker_info::map and shrinker_info::nr_deferred. And we need to resize
> them when the shrinker_nr_max is exceeded, that is, allocate a new array,
> and then copy the old array to the new array, and finally free the old
> array by RCU.
>
> For shrinker_info::map, we do set_bit() under the RCU lock, so we may set
> the value into the old map which is about to be freed. This may cause the
> value set to be lost. The current solution is not to copy the old map when
> resizing, but to set all the corresponding bits in the new map to 1. This
> solves the data loss problem, but bring the overhead of more pointless
> loops while doing memcg slab shrink.
>
> For shrinker_info::nr_deferred, we will only modify it under the read lock
> of shrinker_rwsem, so it will not run concurrently with the resizing. But
> after we make memcg slab shrink lockless, there will be the same data loss
> problem as shrinker_info::map, and we can't work around it like the map.
>
> For such resizable arrays, the most straightforward idea is to change it
> to xarray, like we did for list_lru [1]. We need to do xa_store() in the
> list_lru_add()-->set_shrinker_bit(), but this will cause memory
> allocation, and the list_lru_add() doesn't accept failure. A possible
> solution is to pre-allocate, but the location of pre-allocation is not
> well determined.
So you implemented a two level array that preallocates leaf
nodes to work around it? It's remarkable complex for what it does,
I can't help but think a radix tree using a special holder for
nr_deferred values of zero would end up being simpler...
> Therefore, this commit chooses to introduce a secondary array for
> shrinker_info::{map, nr_deferred}, so that we only need to copy this
> secondary array every time the size is resized. Then even if we get the
> old secondary array under the RCU lock, the found map and nr_deferred are
> also true, so no data is lost.
I don't understand what you are trying to describe here. If we get
the old array, then don't we get either a stale nr_deferred value,
or the update we do gets lost because the next shrinker lookup will
find the new array and os the deferred value stored to the old one
is never seen again?
>
> [1]. https://lore.kernel.org/all/20220228122126.37293-13-songmuchun@bytedance.com/
>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> Reviewed-by: Muchun Song <songmuchun@bytedance.com>
> ---
.....
> diff --git a/mm/shrinker.c b/mm/shrinker.c
> index a27779ed3798..1911c06b8af5 100644
> --- a/mm/shrinker.c
> +++ b/mm/shrinker.c
> @@ -12,15 +12,50 @@ DECLARE_RWSEM(shrinker_rwsem);
> #ifdef CONFIG_MEMCG
> static int shrinker_nr_max;
>
> -/* The shrinker_info is expanded in a batch of BITS_PER_LONG */
> -static inline int shrinker_map_size(int nr_items)
> +static inline int shrinker_unit_size(int nr_items)
> {
> - return (DIV_ROUND_UP(nr_items, BITS_PER_LONG) * sizeof(unsigned long));
> + return (DIV_ROUND_UP(nr_items, SHRINKER_UNIT_BITS) * sizeof(struct shrinker_info_unit *));
> }
>
> -static inline int shrinker_defer_size(int nr_items)
> +static inline void shrinker_unit_free(struct shrinker_info *info, int start)
> {
> - return (round_up(nr_items, BITS_PER_LONG) * sizeof(atomic_long_t));
> + struct shrinker_info_unit **unit;
> + int nr, i;
> +
> + if (!info)
> + return;
> +
> + unit = info->unit;
> + nr = DIV_ROUND_UP(info->map_nr_max, SHRINKER_UNIT_BITS);
> +
> + for (i = start; i < nr; i++) {
> + if (!unit[i])
> + break;
> +
> + kvfree(unit[i]);
> + unit[i] = NULL;
> + }
> +}
> +
> +static inline int shrinker_unit_alloc(struct shrinker_info *new,
> + struct shrinker_info *old, int nid)
> +{
> + struct shrinker_info_unit *unit;
> + int nr = DIV_ROUND_UP(new->map_nr_max, SHRINKER_UNIT_BITS);
> + int start = old ? DIV_ROUND_UP(old->map_nr_max, SHRINKER_UNIT_BITS) : 0;
> + int i;
> +
> + for (i = start; i < nr; i++) {
> + unit = kvzalloc_node(sizeof(*unit), GFP_KERNEL, nid);
A unit is 576 bytes. Why is this using kvzalloc_node()?
> + if (!unit) {
> + shrinker_unit_free(new, start);
> + return -ENOMEM;
> + }
> +
> + new->unit[i] = unit;
> + }
> +
> + return 0;
> }
>
> void free_shrinker_info(struct mem_cgroup *memcg)
> @@ -32,6 +67,7 @@ void free_shrinker_info(struct mem_cgroup *memcg)
> for_each_node(nid) {
> pn = memcg->nodeinfo[nid];
> info = rcu_dereference_protected(pn->shrinker_info, true);
> + shrinker_unit_free(info, 0);
> kvfree(info);
> rcu_assign_pointer(pn->shrinker_info, NULL);
> }
Why is this safe? The info and maps are looked up by RCU, so why is
freeing them without a RCU grace period expiring safe?
Yes, it was safe to do this when it was all under a semaphore, but
now the lookup and use is under RCU, so this freeing isn't
serialised against lookups anymore...
> @@ -40,28 +76,27 @@ void free_shrinker_info(struct mem_cgroup *memcg)
> int alloc_shrinker_info(struct mem_cgroup *memcg)
> {
> struct shrinker_info *info;
> - int nid, size, ret = 0;
> - int map_size, defer_size = 0;
> + int nid, ret = 0;
> + int array_size = 0;
>
> down_write(&shrinker_rwsem);
> - map_size = shrinker_map_size(shrinker_nr_max);
> - defer_size = shrinker_defer_size(shrinker_nr_max);
> - size = map_size + defer_size;
> + array_size = shrinker_unit_size(shrinker_nr_max);
> for_each_node(nid) {
> - info = kvzalloc_node(sizeof(*info) + size, GFP_KERNEL, nid);
> - if (!info) {
> - free_shrinker_info(memcg);
> - ret = -ENOMEM;
> - break;
> - }
> - info->nr_deferred = (atomic_long_t *)(info + 1);
> - info->map = (void *)info->nr_deferred + defer_size;
> + info = kvzalloc_node(sizeof(*info) + array_size, GFP_KERNEL, nid);
> + if (!info)
> + goto err;
> info->map_nr_max = shrinker_nr_max;
> + if (shrinker_unit_alloc(info, NULL, nid))
> + goto err;
That's going to now do a lot of small memory allocation when we have
lots of shrinkers active....
> @@ -150,17 +175,34 @@ static int expand_shrinker_info(int new_id)
> return ret;
> }
>
> +static inline int shriner_id_to_index(int shrinker_id)
shrinker_id_to_index
> +{
> + return shrinker_id / SHRINKER_UNIT_BITS;
> +}
> +
> +static inline int shriner_id_to_offset(int shrinker_id)
shrinker_id_to_offset
> +{
> + return shrinker_id % SHRINKER_UNIT_BITS;
> +}
....
> @@ -209,26 +251,31 @@ static long xchg_nr_deferred_memcg(int nid, struct shrinker *shrinker,
> struct mem_cgroup *memcg)
> {
> struct shrinker_info *info;
> + struct shrinker_info_unit *unit;
>
> info = shrinker_info_protected(memcg, nid);
> - return atomic_long_xchg(&info->nr_deferred[shrinker->id], 0);
> + unit = info->unit[shriner_id_to_index(shrinker->id)];
> + return atomic_long_xchg(&unit->nr_deferred[shriner_id_to_offset(shrinker->id)], 0);
> }
>
> static long add_nr_deferred_memcg(long nr, int nid, struct shrinker *shrinker,
> struct mem_cgroup *memcg)
> {
> struct shrinker_info *info;
> + struct shrinker_info_unit *unit;
>
> info = shrinker_info_protected(memcg, nid);
> - return atomic_long_add_return(nr, &info->nr_deferred[shrinker->id]);
> + unit = info->unit[shriner_id_to_index(shrinker->id)];
> + return atomic_long_add_return(nr, &unit->nr_deferred[shriner_id_to_offset(shrinker->id)]);
> }
>
> void reparent_shrinker_deferred(struct mem_cgroup *memcg)
> {
> - int i, nid;
> + int nid, index, offset;
> long nr;
> struct mem_cgroup *parent;
> struct shrinker_info *child_info, *parent_info;
> + struct shrinker_info_unit *child_unit, *parent_unit;
>
> parent = parent_mem_cgroup(memcg);
> if (!parent)
> @@ -239,9 +286,13 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
> for_each_node(nid) {
> child_info = shrinker_info_protected(memcg, nid);
> parent_info = shrinker_info_protected(parent, nid);
> - for (i = 0; i < child_info->map_nr_max; i++) {
> - nr = atomic_long_read(&child_info->nr_deferred[i]);
> - atomic_long_add(nr, &parent_info->nr_deferred[i]);
> + for (index = 0; index < shriner_id_to_index(child_info->map_nr_max); index++) {
> + child_unit = child_info->unit[index];
> + parent_unit = parent_info->unit[index];
> + for (offset = 0; offset < SHRINKER_UNIT_BITS; offset++) {
> + nr = atomic_long_read(&child_unit->nr_deferred[offset]);
> + atomic_long_add(nr, &parent_unit->nr_deferred[offset]);
> + }
> }
> }
> up_read(&shrinker_rwsem);
> @@ -407,7 +458,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
> {
> struct shrinker_info *info;
> unsigned long ret, freed = 0;
> - int i;
> + int offset, index = 0;
>
> if (!mem_cgroup_online(memcg))
> return 0;
> @@ -419,56 +470,63 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
> if (unlikely(!info))
> goto unlock;
>
> - for_each_set_bit(i, info->map, info->map_nr_max) {
> - struct shrink_control sc = {
> - .gfp_mask = gfp_mask,
> - .nid = nid,
> - .memcg = memcg,
> - };
> - struct shrinker *shrinker;
> + for (; index < shriner_id_to_index(info->map_nr_max); index++) {
> + struct shrinker_info_unit *unit;
This adds another layer of indent to shrink_slab_memcg(). Please
factor it first so that the code ends up being readable. Doing that
first as a separate patch will also make the actual algorithm
changes in this patch be much more obvious - this huge hunk of
diff is pretty much impossible to review...
-Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v4 45/48] mm: shrinker: make global slab shrink lockless
[not found] ` <20230807110936.21819-46-zhengqi.arch@bytedance.com>
2023-08-07 23:28 ` [PATCH v4 45/48] mm: shrinker: make global slab shrink lockless Dave Chinner via Virtualization
@ 2023-08-08 2:24 ` Dave Chinner via Virtualization
1 sibling, 0 replies; 5+ messages in thread
From: Dave Chinner via Virtualization @ 2023-08-08 2:24 UTC (permalink / raw)
To: Qi Zheng
Cc: kvm, djwong, roman.gushchin, dri-devel, virtualization, linux-mm,
dm-devel, linux-mtd, cel, x86, steven.price, cluster-devel,
simon.horman, xen-devel, linux-ext4, paulmck, linux-arm-msm,
linux-nfs, rcu, linux-bcache, dlemoal, yujie.liu, vbabka,
linux-raid, brauner, tytso, gregkh, muchun.song, linux-kernel,
linux-f2fs-devel, linux-xfs, senozhatsky, netdev, linux-fsdevel,
akpm, linux-erofs, linux-btrfs, tkhai
On Mon, Aug 07, 2023 at 07:09:33PM +0800, Qi Zheng wrote:
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index eb342994675a..f06225f18531 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -4,6 +4,8 @@
>
> #include <linux/atomic.h>
> #include <linux/types.h>
> +#include <linux/refcount.h>
> +#include <linux/completion.h>
>
> #define SHRINKER_UNIT_BITS BITS_PER_LONG
>
> @@ -87,6 +89,10 @@ struct shrinker {
> int seeks; /* seeks to recreate an obj */
> unsigned flags;
>
> + refcount_t refcount;
> + struct completion done;
> + struct rcu_head rcu;
Documentation, please. What does the refcount protect, what does the
completion provide, etc.
> +
> void *private_data;
>
> /* These are for internal use */
> @@ -120,6 +126,17 @@ struct shrinker *shrinker_alloc(unsigned int flags, const char *fmt, ...);
> void shrinker_register(struct shrinker *shrinker);
> void shrinker_free(struct shrinker *shrinker);
>
> +static inline bool shrinker_try_get(struct shrinker *shrinker)
> +{
> + return refcount_inc_not_zero(&shrinker->refcount);
> +}
> +
> +static inline void shrinker_put(struct shrinker *shrinker)
> +{
> + if (refcount_dec_and_test(&shrinker->refcount))
> + complete(&shrinker->done);
> +}
> +
> #ifdef CONFIG_SHRINKER_DEBUG
> extern int __printf(2, 3) shrinker_debugfs_rename(struct shrinker *shrinker,
> const char *fmt, ...);
> diff --git a/mm/shrinker.c b/mm/shrinker.c
> index 1911c06b8af5..d318f5621862 100644
> --- a/mm/shrinker.c
> +++ b/mm/shrinker.c
> @@ -2,6 +2,7 @@
> #include <linux/memcontrol.h>
> #include <linux/rwsem.h>
> #include <linux/shrinker.h>
> +#include <linux/rculist.h>
> #include <trace/events/vmscan.h>
>
> #include "internal.h"
> @@ -577,33 +578,42 @@ unsigned long shrink_slab(gfp_t gfp_mask, int nid, struct mem_cgroup *memcg,
> if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
> return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>
> - if (!down_read_trylock(&shrinker_rwsem))
> - goto out;
> -
> - list_for_each_entry(shrinker, &shrinker_list, list) {
> + rcu_read_lock();
> + list_for_each_entry_rcu(shrinker, &shrinker_list, list) {
> struct shrink_control sc = {
> .gfp_mask = gfp_mask,
> .nid = nid,
> .memcg = memcg,
> };
>
> + if (!shrinker_try_get(shrinker))
> + continue;
> +
> + /*
> + * We can safely unlock the RCU lock here since we already
> + * hold the refcount of the shrinker.
> + */
> + rcu_read_unlock();
> +
> ret = do_shrink_slab(&sc, shrinker, priority);
> if (ret == SHRINK_EMPTY)
> ret = 0;
> freed += ret;
> +
> /*
> - * Bail out if someone want to register a new shrinker to
> - * prevent the registration from being stalled for long periods
> - * by parallel ongoing shrinking.
> + * This shrinker may be deleted from shrinker_list and freed
> + * after the shrinker_put() below, but this shrinker is still
> + * used for the next traversal. So it is necessary to hold the
> + * RCU lock first to prevent this shrinker from being freed,
> + * which also ensures that the next shrinker that is traversed
> + * will not be freed (even if it is deleted from shrinker_list
> + * at the same time).
> */
This needs to be moved to the head of the function, and document
the whole list walk, get, put and completion parts of the algorithm
that make it safe. There's more to this than "we hold a reference
count", especially the tricky "we might see the shrinker before it
is fully initialised" case....
.....
> void shrinker_free(struct shrinker *shrinker)
> {
> struct dentry *debugfs_entry = NULL;
> @@ -686,9 +712,18 @@ void shrinker_free(struct shrinker *shrinker)
> if (!shrinker)
> return;
>
> + if (shrinker->flags & SHRINKER_REGISTERED) {
> + shrinker_put(shrinker);
> + wait_for_completion(&shrinker->done);
> + }
Needs a comment explaining why we need to wait here...
> +
> down_write(&shrinker_rwsem);
> if (shrinker->flags & SHRINKER_REGISTERED) {
> - list_del(&shrinker->list);
> + /*
> + * Lookups on the shrinker are over and will fail in the future,
> + * so we can now remove it from the lists and free it.
> + */
.... rather than here after the wait has been done and provided the
guarantee that no shrinker is running or will run again...
-Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v4 46/48] mm: shrinker: make memcg slab shrink lockless
[not found] ` <20230807110936.21819-47-zhengqi.arch@bytedance.com>
@ 2023-08-08 2:44 ` Dave Chinner via Virtualization
0 siblings, 0 replies; 5+ messages in thread
From: Dave Chinner via Virtualization @ 2023-08-08 2:44 UTC (permalink / raw)
To: Qi Zheng
Cc: kvm, djwong, roman.gushchin, dri-devel, virtualization, linux-mm,
dm-devel, linux-mtd, cel, x86, steven.price, cluster-devel,
simon.horman, xen-devel, linux-ext4, paulmck, linux-arm-msm,
linux-nfs, rcu, linux-bcache, dlemoal, yujie.liu, vbabka,
linux-raid, brauner, tytso, gregkh, muchun.song, linux-kernel,
linux-f2fs-devel, linux-xfs, senozhatsky, netdev, linux-fsdevel,
akpm, linux-erofs, linux-btrfs, tkhai
On Mon, Aug 07, 2023 at 07:09:34PM +0800, Qi Zheng wrote:
> Like global slab shrink, this commit also uses refcount+RCU method to make
> memcg slab shrink lockless.
This patch does random code cleanups amongst the actual RCU changes.
Can you please move the cleanups to a spearate patch to reduce the
noise in this one?
> diff --git a/mm/shrinker.c b/mm/shrinker.c
> index d318f5621862..fee6f62904fb 100644
> --- a/mm/shrinker.c
> +++ b/mm/shrinker.c
> @@ -107,6 +107,12 @@ static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
> lockdep_is_held(&shrinker_rwsem));
> }
>
> +static struct shrinker_info *shrinker_info_rcu(struct mem_cgroup *memcg,
> + int nid)
> +{
> + return rcu_dereference(memcg->nodeinfo[nid]->shrinker_info);
> +}
This helper doesn't add value. It doesn't tell me that
rcu_read_lock() needs to be held when it is called, for one....
> static int expand_one_shrinker_info(struct mem_cgroup *memcg, int new_size,
> int old_size, int new_nr_max)
> {
> @@ -198,7 +204,7 @@ void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)
> struct shrinker_info_unit *unit;
>
> rcu_read_lock();
> - info = rcu_dereference(memcg->nodeinfo[nid]->shrinker_info);
> + info = shrinker_info_rcu(memcg, nid);
... whilst the original code here was obviously correct.
> unit = info->unit[shriner_id_to_index(shrinker_id)];
> if (!WARN_ON_ONCE(shrinker_id >= info->map_nr_max)) {
> /* Pairs with smp mb in shrink_slab() */
> @@ -211,7 +217,7 @@ void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)
>
> static DEFINE_IDR(shrinker_idr);
>
> -static int prealloc_memcg_shrinker(struct shrinker *shrinker)
> +static int shrinker_memcg_alloc(struct shrinker *shrinker)
Cleanups in a separate patch.
> @@ -253,10 +258,15 @@ static long xchg_nr_deferred_memcg(int nid, struct shrinker *shrinker,
> {
> struct shrinker_info *info;
> struct shrinker_info_unit *unit;
> + long nr_deferred;
>
> - info = shrinker_info_protected(memcg, nid);
> + rcu_read_lock();
> + info = shrinker_info_rcu(memcg, nid);
> unit = info->unit[shriner_id_to_index(shrinker->id)];
> - return atomic_long_xchg(&unit->nr_deferred[shriner_id_to_offset(shrinker->id)], 0);
> + nr_deferred = atomic_long_xchg(&unit->nr_deferred[shriner_id_to_offset(shrinker->id)], 0);
> + rcu_read_unlock();
> +
> + return nr_deferred;
> }
This adds two rcu_read_lock() sections to every call to
do_shrink_slab(). It's not at all clear ifrom any of the other code
that do_shrink_slab() now has internal rcu_read_lock() sections....
> @@ -464,18 +480,23 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
> if (!mem_cgroup_online(memcg))
> return 0;
>
> - if (!down_read_trylock(&shrinker_rwsem))
> - return 0;
> -
> - info = shrinker_info_protected(memcg, nid);
> +again:
> + rcu_read_lock();
> + info = shrinker_info_rcu(memcg, nid);
> if (unlikely(!info))
> goto unlock;
>
> - for (; index < shriner_id_to_index(info->map_nr_max); index++) {
> + if (index < shriner_id_to_index(info->map_nr_max)) {
> struct shrinker_info_unit *unit;
>
> unit = info->unit[index];
>
> + /*
> + * The shrinker_info_unit will not be freed, so we can
> + * safely release the RCU lock here.
> + */
> + rcu_read_unlock();
Why - what guarantees that the shrinker_info_unit exists at this
point? We hold no reference to it, we hold no reference to any
shrinker, etc. What provides this existence guarantee?
> +
> for_each_set_bit(offset, unit->map, SHRINKER_UNIT_BITS) {
> struct shrink_control sc = {
> .gfp_mask = gfp_mask,
> @@ -485,12 +506,14 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
> struct shrinker *shrinker;
> int shrinker_id = calc_shrinker_id(index, offset);
>
> + rcu_read_lock();
> shrinker = idr_find(&shrinker_idr, shrinker_id);
> - if (unlikely(!shrinker || !(shrinker->flags & SHRINKER_REGISTERED))) {
> - if (!shrinker)
> - clear_bit(offset, unit->map);
> + if (unlikely(!shrinker || !shrinker_try_get(shrinker))) {
> + clear_bit(offset, unit->map);
> + rcu_read_unlock();
> continue;
> }
> + rcu_read_unlock();
>
> /* Call non-slab shrinkers even though kmem is disabled */
> if (!memcg_kmem_online() &&
> @@ -523,15 +546,20 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
> set_shrinker_bit(memcg, nid, shrinker_id);
> }
> freed += ret;
> -
> - if (rwsem_is_contended(&shrinker_rwsem)) {
> - freed = freed ? : 1;
> - goto unlock;
> - }
> + shrinker_put(shrinker);
Ok, so why is this safe to call without holding the rcu read lock?
The global shrinker has to hold the rcu_read_lock() whilst calling
shrinker_put() to guarantee the validity of the list next pointer,
but we don't hold off RCU here so what guarantees a racing global
shrinker walk doesn't trip over this shrinker_put() call dropping
the refcount to zero and freeing occuring in a different context...
> + /*
> + * We have already exited the read-side of rcu critical section
> + * before calling do_shrink_slab(), the shrinker_info may be
> + * released in expand_one_shrinker_info(), so reacquire the
> + * shrinker_info.
> + */
> + index++;
> + goto again;
With that, what makes the use of shrinker_info in
xchg_nr_deferred_memcg() in do_shrink_slab() coherent and valid?
-Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v4 43/48] drm/ttm: introduce pool_shrink_rwsem
[not found] ` <20230807110936.21819-44-zhengqi.arch@bytedance.com>
@ 2023-08-22 13:56 ` Daniel Vetter
0 siblings, 0 replies; 5+ messages in thread
From: Daniel Vetter @ 2023-08-22 13:56 UTC (permalink / raw)
To: Qi Zheng
Cc: kvm, djwong, roman.gushchin, david, dri-devel, virtualization,
linux-mm, dm-devel, linux-mtd, cel, x86, steven.price,
cluster-devel, simon.horman, xen-devel, linux-ext4, paulmck,
linux-arm-msm, linux-nfs, rcu, linux-bcache, dlemoal, Muchun Song,
yujie.liu, vbabka, linux-raid, brauner, tytso, gregkh,
muchun.song, linux-kernel, linux-f2fs-devel, linux-xfs,
senozhatsky, netdev, linux-fsdevel, akpm, linux-erofs,
linux-btrfs, tkhai
On Mon, Aug 07, 2023 at 07:09:31PM +0800, Qi Zheng wrote:
> Currently, the synchronize_shrinkers() is only used by TTM pool. It only
> requires that no shrinkers run in parallel.
>
> After we use RCU+refcount method to implement the lockless slab shrink,
> we can not use shrinker_rwsem or synchronize_rcu() to guarantee that all
> shrinker invocations have seen an update before freeing memory.
>
> So we introduce a new pool_shrink_rwsem to implement a private
> synchronize_shrinkers(), so as to achieve the same purpose.
>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> Reviewed-by: Muchun Song <songmuchun@bytedance.com>
On the 5 drm patches (I counted 2 ttm and 3 drivers) for merging through
some other tree (since I'm assuming that's how this will land):
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> ---
> drivers/gpu/drm/ttm/ttm_pool.c | 15 +++++++++++++++
> include/linux/shrinker.h | 2 --
> mm/shrinker.c | 15 ---------------
> 3 files changed, 15 insertions(+), 17 deletions(-)
>
> diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
> index c9c9618c0dce..38b4c280725c 100644
> --- a/drivers/gpu/drm/ttm/ttm_pool.c
> +++ b/drivers/gpu/drm/ttm/ttm_pool.c
> @@ -74,6 +74,7 @@ static struct ttm_pool_type global_dma32_uncached[MAX_ORDER + 1];
> static spinlock_t shrinker_lock;
> static struct list_head shrinker_list;
> static struct shrinker *mm_shrinker;
> +static DECLARE_RWSEM(pool_shrink_rwsem);
>
> /* Allocate pages of size 1 << order with the given gfp_flags */
> static struct page *ttm_pool_alloc_page(struct ttm_pool *pool, gfp_t gfp_flags,
> @@ -317,6 +318,7 @@ static unsigned int ttm_pool_shrink(void)
> unsigned int num_pages;
> struct page *p;
>
> + down_read(&pool_shrink_rwsem);
> spin_lock(&shrinker_lock);
> pt = list_first_entry(&shrinker_list, typeof(*pt), shrinker_list);
> list_move_tail(&pt->shrinker_list, &shrinker_list);
> @@ -329,6 +331,7 @@ static unsigned int ttm_pool_shrink(void)
> } else {
> num_pages = 0;
> }
> + up_read(&pool_shrink_rwsem);
>
> return num_pages;
> }
> @@ -572,6 +575,18 @@ void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
> }
> EXPORT_SYMBOL(ttm_pool_init);
>
> +/**
> + * synchronize_shrinkers - Wait for all running shrinkers to complete.
> + *
> + * This is useful to guarantee that all shrinker invocations have seen an
> + * update, before freeing memory, similar to rcu.
> + */
> +static void synchronize_shrinkers(void)
> +{
> + down_write(&pool_shrink_rwsem);
> + up_write(&pool_shrink_rwsem);
> +}
> +
> /**
> * ttm_pool_fini - Cleanup a pool
> *
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index c55c07c3f0cb..025c8070dd86 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -103,8 +103,6 @@ struct shrinker *shrinker_alloc(unsigned int flags, const char *fmt, ...);
> void shrinker_register(struct shrinker *shrinker);
> void shrinker_free(struct shrinker *shrinker);
>
> -extern void synchronize_shrinkers(void);
> -
> #ifdef CONFIG_SHRINKER_DEBUG
> extern int __printf(2, 3) shrinker_debugfs_rename(struct shrinker *shrinker,
> const char *fmt, ...);
> diff --git a/mm/shrinker.c b/mm/shrinker.c
> index 3ab301ff122d..a27779ed3798 100644
> --- a/mm/shrinker.c
> +++ b/mm/shrinker.c
> @@ -650,18 +650,3 @@ void shrinker_free(struct shrinker *shrinker)
> kfree(shrinker);
> }
> EXPORT_SYMBOL_GPL(shrinker_free);
> -
> -/**
> - * synchronize_shrinkers - Wait for all running shrinkers to complete.
> - *
> - * This is equivalent to calling unregister_shrink() and register_shrinker(),
> - * but atomically and with less overhead. This is useful to guarantee that all
> - * shrinker invocations have seen an update, before freeing memory, similar to
> - * rcu.
> - */
> -void synchronize_shrinkers(void)
> -{
> - down_write(&shrinker_rwsem);
> - up_write(&shrinker_rwsem);
> -}
> -EXPORT_SYMBOL(synchronize_shrinkers);
> --
> 2.30.2
>
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2023-08-22 13:56 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20230807110936.21819-1-zhengqi.arch@bytedance.com>
[not found] ` <20230807110936.21819-46-zhengqi.arch@bytedance.com>
2023-08-07 23:28 ` [PATCH v4 45/48] mm: shrinker: make global slab shrink lockless Dave Chinner via Virtualization
2023-08-08 2:24 ` Dave Chinner via Virtualization
[not found] ` <20230807110936.21819-45-zhengqi.arch@bytedance.com>
2023-08-08 2:12 ` [PATCH v4 44/48] mm: shrinker: add a secondary array for shrinker_info::{map, nr_deferred} Dave Chinner via Virtualization
[not found] ` <20230807110936.21819-47-zhengqi.arch@bytedance.com>
2023-08-08 2:44 ` [PATCH v4 46/48] mm: shrinker: make memcg slab shrink lockless Dave Chinner via Virtualization
[not found] ` <20230807110936.21819-44-zhengqi.arch@bytedance.com>
2023-08-22 13:56 ` [PATCH v4 43/48] drm/ttm: introduce pool_shrink_rwsem Daniel Vetter
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).