public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: akpm@linux-foundation.org, tkhai@ya.ru, vbabka@suse.cz,
	roman.gushchin@linux.dev, djwong@kernel.org, brauner@kernel.org,
	paulmck@kernel.org, tytso@mit.edu, steven.price@arm.com,
	cel@kernel.org, senozhatsky@chromium.org, yujie.liu@intel.com,
	gregkh@linuxfoundation.org, muchun.song@linux.dev,
	simon.horman@corigine.com, dlemoal@kernel.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org,
	kvm@vger.kernel.org, xen-devel@lists.xenproject.org,
	linux-erofs@lists.ozlabs.org,
	linux-f2fs-devel@lists.sourceforge.net, cluster-devel@redhat.com,
	linux-nfs@vger.kernel.org, linux-mtd@lists.infradead.org,
	rcu@vger.kernel.org, netdev@vger.kernel.org,
	dri-devel@lists.freedesktop.org, linux-arm-msm@vger.kernel.org,
	dm-devel@redhat.com, linux-raid@vger.kernel.org,
	linux-bcache@vger.kernel.org,
	virtualization@lists.linux-foundation.org,
	linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org,
	linux-xfs@vger.kernel.org, linux-btrfs@vger.kernel.org,
	Muchun Song <songmuchun@bytedance.com>
Subject: Re: [PATCH v4 44/48] mm: shrinker: add a secondary array for shrinker_info::{map, nr_deferred}
Date: Tue, 8 Aug 2023 12:12:02 +1000	[thread overview]
Message-ID: <ZNGkcp3Dh8hOiFpk@dread.disaster.area> (raw)
In-Reply-To: <20230807110936.21819-45-zhengqi.arch@bytedance.com>

On Mon, Aug 07, 2023 at 07:09:32PM +0800, Qi Zheng wrote:
> Currently, we maintain two linear arrays per node per memcg, which are
> shrinker_info::map and shrinker_info::nr_deferred. And we need to resize
> them when the shrinker_nr_max is exceeded, that is, allocate a new array,
> and then copy the old array to the new array, and finally free the old
> array by RCU.
> 
> For shrinker_info::map, we do set_bit() under the RCU lock, so we may set
> the value into the old map which is about to be freed. This may cause the
> value set to be lost. The current solution is not to copy the old map when
> resizing, but to set all the corresponding bits in the new map to 1. This
> solves the data loss problem, but bring the overhead of more pointless
> loops while doing memcg slab shrink.
> 
> For shrinker_info::nr_deferred, we will only modify it under the read lock
> of shrinker_rwsem, so it will not run concurrently with the resizing. But
> after we make memcg slab shrink lockless, there will be the same data loss
> problem as shrinker_info::map, and we can't work around it like the map.
> 
> For such resizable arrays, the most straightforward idea is to change it
> to xarray, like we did for list_lru [1]. We need to do xa_store() in the
> list_lru_add()-->set_shrinker_bit(), but this will cause memory
> allocation, and the list_lru_add() doesn't accept failure. A possible
> solution is to pre-allocate, but the location of pre-allocation is not
> well determined.

So you implemented a two level array that preallocates leaf
nodes to work around it? It's remarkable complex for what it does,
I can't help but think a radix tree using a special holder for
nr_deferred values of zero would end up being simpler...

> Therefore, this commit chooses to introduce a secondary array for
> shrinker_info::{map, nr_deferred}, so that we only need to copy this
> secondary array every time the size is resized. Then even if we get the
> old secondary array under the RCU lock, the found map and nr_deferred are
> also true, so no data is lost.

I don't understand what you are trying to describe here. If we get
the old array, then don't we get either a stale nr_deferred value,
or the update we do gets lost because the next shrinker lookup will
find the new array and os the deferred value stored to the old one
is never seen again?

> 
> [1]. https://lore.kernel.org/all/20220228122126.37293-13-songmuchun@bytedance.com/
> 
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> Reviewed-by: Muchun Song <songmuchun@bytedance.com>
> ---
.....
> diff --git a/mm/shrinker.c b/mm/shrinker.c
> index a27779ed3798..1911c06b8af5 100644
> --- a/mm/shrinker.c
> +++ b/mm/shrinker.c
> @@ -12,15 +12,50 @@ DECLARE_RWSEM(shrinker_rwsem);
>  #ifdef CONFIG_MEMCG
>  static int shrinker_nr_max;
>  
> -/* The shrinker_info is expanded in a batch of BITS_PER_LONG */
> -static inline int shrinker_map_size(int nr_items)
> +static inline int shrinker_unit_size(int nr_items)
>  {
> -	return (DIV_ROUND_UP(nr_items, BITS_PER_LONG) * sizeof(unsigned long));
> +	return (DIV_ROUND_UP(nr_items, SHRINKER_UNIT_BITS) * sizeof(struct shrinker_info_unit *));
>  }
>  
> -static inline int shrinker_defer_size(int nr_items)
> +static inline void shrinker_unit_free(struct shrinker_info *info, int start)
>  {
> -	return (round_up(nr_items, BITS_PER_LONG) * sizeof(atomic_long_t));
> +	struct shrinker_info_unit **unit;
> +	int nr, i;
> +
> +	if (!info)
> +		return;
> +
> +	unit = info->unit;
> +	nr = DIV_ROUND_UP(info->map_nr_max, SHRINKER_UNIT_BITS);
> +
> +	for (i = start; i < nr; i++) {
> +		if (!unit[i])
> +			break;
> +
> +		kvfree(unit[i]);
> +		unit[i] = NULL;
> +	}
> +}
> +
> +static inline int shrinker_unit_alloc(struct shrinker_info *new,
> +				       struct shrinker_info *old, int nid)
> +{
> +	struct shrinker_info_unit *unit;
> +	int nr = DIV_ROUND_UP(new->map_nr_max, SHRINKER_UNIT_BITS);
> +	int start = old ? DIV_ROUND_UP(old->map_nr_max, SHRINKER_UNIT_BITS) : 0;
> +	int i;
> +
> +	for (i = start; i < nr; i++) {
> +		unit = kvzalloc_node(sizeof(*unit), GFP_KERNEL, nid);

A unit is 576 bytes. Why is this using kvzalloc_node()?

> +		if (!unit) {
> +			shrinker_unit_free(new, start);
> +			return -ENOMEM;
> +		}
> +
> +		new->unit[i] = unit;
> +	}
> +
> +	return 0;
>  }
>  
>  void free_shrinker_info(struct mem_cgroup *memcg)
> @@ -32,6 +67,7 @@ void free_shrinker_info(struct mem_cgroup *memcg)
>  	for_each_node(nid) {
>  		pn = memcg->nodeinfo[nid];
>  		info = rcu_dereference_protected(pn->shrinker_info, true);
> +		shrinker_unit_free(info, 0);
>  		kvfree(info);
>  		rcu_assign_pointer(pn->shrinker_info, NULL);
>  	}

Why is this safe? The info and maps are looked up by RCU, so why is
freeing them without a RCU grace period expiring safe?

Yes, it was safe to do this when it was all under a semaphore, but
now the lookup and use is under RCU, so this freeing isn't
serialised against lookups anymore...


> @@ -40,28 +76,27 @@ void free_shrinker_info(struct mem_cgroup *memcg)
>  int alloc_shrinker_info(struct mem_cgroup *memcg)
>  {
>  	struct shrinker_info *info;
> -	int nid, size, ret = 0;
> -	int map_size, defer_size = 0;
> +	int nid, ret = 0;
> +	int array_size = 0;
>  
>  	down_write(&shrinker_rwsem);
> -	map_size = shrinker_map_size(shrinker_nr_max);
> -	defer_size = shrinker_defer_size(shrinker_nr_max);
> -	size = map_size + defer_size;
> +	array_size = shrinker_unit_size(shrinker_nr_max);
>  	for_each_node(nid) {
> -		info = kvzalloc_node(sizeof(*info) + size, GFP_KERNEL, nid);
> -		if (!info) {
> -			free_shrinker_info(memcg);
> -			ret = -ENOMEM;
> -			break;
> -		}
> -		info->nr_deferred = (atomic_long_t *)(info + 1);
> -		info->map = (void *)info->nr_deferred + defer_size;
> +		info = kvzalloc_node(sizeof(*info) + array_size, GFP_KERNEL, nid);
> +		if (!info)
> +			goto err;
>  		info->map_nr_max = shrinker_nr_max;
> +		if (shrinker_unit_alloc(info, NULL, nid))
> +			goto err;

That's going to now do a lot of small memory allocation when we have
lots of shrinkers active....

> @@ -150,17 +175,34 @@ static int expand_shrinker_info(int new_id)
>  	return ret;
>  }
>  
> +static inline int shriner_id_to_index(int shrinker_id)

shrinker_id_to_index

> +{
> +	return shrinker_id / SHRINKER_UNIT_BITS;
> +}
> +
> +static inline int shriner_id_to_offset(int shrinker_id)

shrinker_id_to_offset

> +{
> +	return shrinker_id % SHRINKER_UNIT_BITS;
> +}

....
> @@ -209,26 +251,31 @@ static long xchg_nr_deferred_memcg(int nid, struct shrinker *shrinker,
>  				   struct mem_cgroup *memcg)
>  {
>  	struct shrinker_info *info;
> +	struct shrinker_info_unit *unit;
>  
>  	info = shrinker_info_protected(memcg, nid);
> -	return atomic_long_xchg(&info->nr_deferred[shrinker->id], 0);
> +	unit = info->unit[shriner_id_to_index(shrinker->id)];
> +	return atomic_long_xchg(&unit->nr_deferred[shriner_id_to_offset(shrinker->id)], 0);
>  }
>  
>  static long add_nr_deferred_memcg(long nr, int nid, struct shrinker *shrinker,
>  				  struct mem_cgroup *memcg)
>  {
>  	struct shrinker_info *info;
> +	struct shrinker_info_unit *unit;
>  
>  	info = shrinker_info_protected(memcg, nid);
> -	return atomic_long_add_return(nr, &info->nr_deferred[shrinker->id]);
> +	unit = info->unit[shriner_id_to_index(shrinker->id)];
> +	return atomic_long_add_return(nr, &unit->nr_deferred[shriner_id_to_offset(shrinker->id)]);
>  }
>  
>  void reparent_shrinker_deferred(struct mem_cgroup *memcg)
>  {
> -	int i, nid;
> +	int nid, index, offset;
>  	long nr;
>  	struct mem_cgroup *parent;
>  	struct shrinker_info *child_info, *parent_info;
> +	struct shrinker_info_unit *child_unit, *parent_unit;
>  
>  	parent = parent_mem_cgroup(memcg);
>  	if (!parent)
> @@ -239,9 +286,13 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
>  	for_each_node(nid) {
>  		child_info = shrinker_info_protected(memcg, nid);
>  		parent_info = shrinker_info_protected(parent, nid);
> -		for (i = 0; i < child_info->map_nr_max; i++) {
> -			nr = atomic_long_read(&child_info->nr_deferred[i]);
> -			atomic_long_add(nr, &parent_info->nr_deferred[i]);
> +		for (index = 0; index < shriner_id_to_index(child_info->map_nr_max); index++) {
> +			child_unit = child_info->unit[index];
> +			parent_unit = parent_info->unit[index];
> +			for (offset = 0; offset < SHRINKER_UNIT_BITS; offset++) {
> +				nr = atomic_long_read(&child_unit->nr_deferred[offset]);
> +				atomic_long_add(nr, &parent_unit->nr_deferred[offset]);
> +			}
>  		}
>  	}
>  	up_read(&shrinker_rwsem);
> @@ -407,7 +458,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>  {
>  	struct shrinker_info *info;
>  	unsigned long ret, freed = 0;
> -	int i;
> +	int offset, index = 0;
>  
>  	if (!mem_cgroup_online(memcg))
>  		return 0;
> @@ -419,56 +470,63 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>  	if (unlikely(!info))
>  		goto unlock;
>  
> -	for_each_set_bit(i, info->map, info->map_nr_max) {
> -		struct shrink_control sc = {
> -			.gfp_mask = gfp_mask,
> -			.nid = nid,
> -			.memcg = memcg,
> -		};
> -		struct shrinker *shrinker;
> +	for (; index < shriner_id_to_index(info->map_nr_max); index++) {
> +		struct shrinker_info_unit *unit;

This adds another layer of indent to shrink_slab_memcg(). Please
factor it first so that the code ends up being readable. Doing that
first as a separate patch will also make the actual algorithm
changes in this patch be much more obvious - this huge hunk of
diff is pretty much impossible to review...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2023-08-08  2:12 UTC|newest]

Thread overview: 66+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-07 11:08 [PATCH v4 00/48] use refcount+RCU method to implement lockless slab shrink Qi Zheng
2023-08-07 11:08 ` [PATCH v4 01/48] mm: move some shrinker-related function declarations to mm/internal.h Qi Zheng
2023-08-15  8:36   ` Muchun Song
2023-08-15  9:14     ` Qi Zheng
2023-08-07 11:08 ` [PATCH v4 02/48] mm: vmscan: move shrinker-related code into a separate file Qi Zheng
2023-08-15  8:44   ` Muchun Song
2023-08-07 11:08 ` [PATCH v4 03/48] mm: shrinker: remove redundant shrinker_rwsem in debugfs operations Qi Zheng
2023-08-07 11:08 ` [PATCH v4 04/48] mm: shrinker: add infrastructure for dynamically allocating shrinker Qi Zheng
2023-08-07 11:08 ` [PATCH v4 05/48] kvm: mmu: dynamically allocate the x86-mmu shrinker Qi Zheng
2023-08-07 11:08 ` [PATCH v4 06/48] binder: dynamically allocate the android-binder shrinker Qi Zheng
2023-08-08  2:30   ` Muchun Song
2023-08-07 11:08 ` [PATCH v4 07/48] drm/ttm: dynamically allocate the drm-ttm_pool shrinker Qi Zheng
2023-08-07 11:08 ` [PATCH v4 08/48] xenbus/backend: dynamically allocate the xen-backend shrinker Qi Zheng
2023-08-07 11:08 ` [PATCH v4 09/48] erofs: dynamically allocate the erofs-shrinker Qi Zheng
2023-08-07 11:08 ` [PATCH v4 10/48] f2fs: dynamically allocate the f2fs-shrinker Qi Zheng
2023-08-07 11:08 ` [PATCH v4 11/48] gfs2: dynamically allocate the gfs2-glock shrinker Qi Zheng
2023-08-07 11:09 ` [PATCH v4 12/48] gfs2: dynamically allocate the gfs2-qd shrinker Qi Zheng
2023-08-15  8:56   ` Muchun Song
2023-08-07 11:09 ` [PATCH v4 13/48] NFSv4.2: dynamically allocate the nfs-xattr shrinkers Qi Zheng
2023-08-07 11:09 ` [PATCH v4 14/48] nfs: dynamically allocate the nfs-acl shrinker Qi Zheng
2023-08-07 11:09 ` [PATCH v4 15/48] nfsd: dynamically allocate the nfsd-filecache shrinker Qi Zheng
2023-08-07 11:09 ` [PATCH v4 16/48] quota: dynamically allocate the dquota-cache shrinker Qi Zheng
2023-08-07 11:09 ` [PATCH v4 17/48] ubifs: dynamically allocate the ubifs-slab shrinker Qi Zheng
2023-08-07 11:09 ` [PATCH v4 18/48] rcu: dynamically allocate the rcu-lazy shrinker Qi Zheng
2023-08-07 14:16   ` Joel Fernandes
2023-08-07 11:09 ` [PATCH v4 19/48] rcu: dynamically allocate the rcu-kfree shrinker Qi Zheng
2023-08-07 14:14   ` Joel Fernandes
2023-08-08  2:28   ` Muchun Song
2023-08-07 11:09 ` [PATCH v4 20/48] mm: thp: dynamically allocate the thp-related shrinkers Qi Zheng
2023-08-07 11:09 ` [PATCH v4 21/48] sunrpc: dynamically allocate the sunrpc_cred shrinker Qi Zheng
2023-08-07 11:09 ` [PATCH v4 22/48] mm: workingset: dynamically allocate the mm-shadow shrinker Qi Zheng
2023-08-07 11:09 ` [PATCH v4 23/48] drm/i915: dynamically allocate the i915_gem_mm shrinker Qi Zheng
2023-08-07 11:09 ` [PATCH v4 24/48] drm/msm: dynamically allocate the drm-msm_gem shrinker Qi Zheng
2023-08-07 11:09 ` [PATCH v4 25/48] drm/panfrost: dynamically allocate the drm-panfrost shrinker Qi Zheng
2023-08-07 11:09 ` [PATCH v4 26/48] dm: dynamically allocate the dm-bufio shrinker Qi Zheng
2023-08-07 11:09 ` [PATCH v4 27/48] dm zoned: dynamically allocate the dm-zoned-meta shrinker Qi Zheng
2023-08-07 11:09 ` [PATCH v4 28/48] md/raid5: dynamically allocate the md-raid5 shrinker Qi Zheng
2023-08-07 11:09 ` [PATCH v4 29/48] bcache: dynamically allocate the md-bcache shrinker Qi Zheng
2023-08-07 11:09 ` [PATCH v4 30/48] vmw_balloon: dynamically allocate the vmw-balloon shrinker Qi Zheng
2023-08-07 11:09 ` [PATCH v4 31/48] virtio_balloon: dynamically allocate the virtio-balloon shrinker Qi Zheng
2023-08-07 11:09 ` [PATCH v4 32/48] mbcache: dynamically allocate the mbcache shrinker Qi Zheng
2023-08-07 11:09 ` [PATCH v4 33/48] ext4: dynamically allocate the ext4-es shrinker Qi Zheng
2023-08-07 11:09 ` [PATCH v4 34/48] jbd2,ext4: dynamically allocate the jbd2-journal shrinker Qi Zheng
2023-08-07 11:09 ` [PATCH v4 35/48] nfsd: dynamically allocate the nfsd-client shrinker Qi Zheng
2023-08-07 11:09 ` [PATCH v4 36/48] nfsd: dynamically allocate the nfsd-reply shrinker Qi Zheng
2023-08-07 11:09 ` [PATCH v4 37/48] xfs: dynamically allocate the xfs-buf shrinker Qi Zheng
2023-08-07 11:09 ` [PATCH v4 38/48] xfs: dynamically allocate the xfs-inodegc shrinker Qi Zheng
2023-08-07 11:09 ` [PATCH v4 39/48] xfs: dynamically allocate the xfs-qm shrinker Qi Zheng
2023-08-07 11:09 ` [PATCH v4 40/48] zsmalloc: dynamically allocate the mm-zspool shrinker Qi Zheng
2023-08-07 11:09 ` [PATCH v4 41/48] fs: super: dynamically allocate the s_shrink Qi Zheng
2023-08-07 11:09 ` [PATCH v4 42/48] mm: shrinker: remove old APIs Qi Zheng
2023-08-07 11:09 ` [PATCH v4 43/48] drm/ttm: introduce pool_shrink_rwsem Qi Zheng
2023-08-22 13:56   ` Daniel Vetter
2023-08-23  2:59     ` Qi Zheng
2023-08-07 11:09 ` [PATCH v4 44/48] mm: shrinker: add a secondary array for shrinker_info::{map, nr_deferred} Qi Zheng
2023-08-08  2:12   ` Dave Chinner [this message]
2023-08-08  6:32     ` Qi Zheng
2023-08-07 11:09 ` [PATCH v4 45/48] mm: shrinker: make global slab shrink lockless Qi Zheng
2023-08-07 23:28   ` Dave Chinner
2023-08-08  2:24   ` Dave Chinner
2023-08-08  7:22     ` Qi Zheng
2023-08-07 11:09 ` [PATCH v4 46/48] mm: shrinker: make memcg " Qi Zheng
2023-08-08  2:44   ` Dave Chinner
2023-08-08  7:50     ` Qi Zheng
2023-08-07 11:09 ` [PATCH v4 47/48] mm: shrinker: hold write lock to reparent shrinker nr_deferred Qi Zheng
2023-08-07 11:09 ` [PATCH v4 48/48] mm: shrinker: convert shrinker_rwsem to mutex Qi Zheng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZNGkcp3Dh8hOiFpk@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=akpm@linux-foundation.org \
    --cc=brauner@kernel.org \
    --cc=cel@kernel.org \
    --cc=cluster-devel@redhat.com \
    --cc=djwong@kernel.org \
    --cc=dlemoal@kernel.org \
    --cc=dm-devel@redhat.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=gregkh@linuxfoundation.org \
    --cc=kvm@vger.kernel.org \
    --cc=linux-arm-msm@vger.kernel.org \
    --cc=linux-bcache@vger.kernel.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-erofs@lists.ozlabs.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-f2fs-devel@lists.sourceforge.net \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-mtd@lists.infradead.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=muchun.song@linux.dev \
    --cc=netdev@vger.kernel.org \
    --cc=paulmck@kernel.org \
    --cc=rcu@vger.kernel.org \
    --cc=roman.gushchin@linux.dev \
    --cc=senozhatsky@chromium.org \
    --cc=simon.horman@corigine.com \
    --cc=songmuchun@bytedance.com \
    --cc=steven.price@arm.com \
    --cc=tkhai@ya.ru \
    --cc=tytso@mit.edu \
    --cc=vbabka@suse.cz \
    --cc=virtualization@lists.linux-foundation.org \
    --cc=x86@kernel.org \
    --cc=xen-devel@lists.xenproject.org \
    --cc=yujie.liu@intel.com \
    --cc=zhengqi.arch@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox