linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Qi Zheng <zhengqi.arch@bytedance.com>
To: Alan Huang <mmpgouride@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>,
	akpm@linux-foundation.org, Dave Chinner <david@fromorbit.com>,
	tkhai@ya.ru, roman.gushchin@linux.dev,
	"Darrick J. Wong" <djwong@kernel.org>,
	brauner@kernel.org, paulmck@kernel.org, tytso@mit.edu,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
	linux-arm-msm@vger.kernel.org, dm-devel@redhat.com,
	linux-raid@vger.kernel.org, linux-bcache@vger.kernel.org,
	virtualization@lists.linux-foundation.org,
	linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org,
	linux-nfs@vger.kernel.org, linux-xfs@vger.kernel.org,
	linux-btrfs@vger.kernel.org
Subject: Re: [PATCH 24/29] mm: vmscan: make global slab shrink lockless
Date: Fri, 23 Jun 2023 02:18:14 +0800	[thread overview]
Message-ID: <a66f4fa5-e614-9dd6-b5fb-fb1189322840@bytedance.com> (raw)
In-Reply-To: <43CEA22D-3FF5-40CB-BF07-0FB9829EF778@gmail.com>



On 2023/6/23 01:41, Alan Huang wrote:
> 
>> 2023年6月23日 上午12:42,Qi Zheng <zhengqi.arch@bytedance.com> 写道:
>>
>>
>>
>> On 2023/6/22 23:12, Vlastimil Babka wrote:
>>> On 6/22/23 10:53, Qi Zheng wrote:
>>>> The shrinker_rwsem is a global read-write lock in
>>>> shrinkers subsystem, which protects most operations
>>>> such as slab shrink, registration and unregistration
>>>> of shrinkers, etc. This can easily cause problems in
>>>> the following cases.
>>>>
>>>> 1) When the memory pressure is high and there are many
>>>>     filesystems mounted or unmounted at the same time,
>>>>     slab shrink will be affected (down_read_trylock()
>>>>     failed).
>>>>
>>>>     Such as the real workload mentioned by Kirill Tkhai:
>>>>
>>>>     ```
>>>>     One of the real workloads from my experience is start
>>>>     of an overcommitted node containing many starting
>>>>     containers after node crash (or many resuming containers
>>>>     after reboot for kernel update). In these cases memory
>>>>     pressure is huge, and the node goes round in long reclaim.
>>>>     ```
>>>>
>>>> 2) If a shrinker is blocked (such as the case mentioned
>>>>     in [1]) and a writer comes in (such as mount a fs),
>>>>     then this writer will be blocked and cause all
>>>>     subsequent shrinker-related operations to be blocked.
>>>>
>>>> Even if there is no competitor when shrinking slab, there
>>>> may still be a problem. If we have a long shrinker list
>>>> and we do not reclaim enough memory with each shrinker,
>>>> then the down_read_trylock() may be called with high
>>>> frequency. Because of the poor multicore scalability of
>>>> atomic operations, this can lead to a significant drop
>>>> in IPC (instructions per cycle).
>>>>
>>>> We used to implement the lockless slab shrink with
>>>> SRCU [1], but then kernel test robot reported -88.8%
>>>> regression in stress-ng.ramfs.ops_per_sec test case [2],
>>>> so we reverted it [3].
>>>>
>>>> This commit uses the refcount+RCU method [4] proposed by
>>>> by Dave Chinner to re-implement the lockless global slab
>>>> shrink. The memcg slab shrink is handled in the subsequent
>>>> patch.
>>>>
>>>> Currently, the shrinker instances can be divided into
>>>> the following three types:
>>>>
>>>> a) global shrinker instance statically defined in the kernel,
>>>> such as workingset_shadow_shrinker.
>>>>
>>>> b) global shrinker instance statically defined in the kernel
>>>> modules, such as mmu_shrinker in x86.
>>>>
>>>> c) shrinker instance embedded in other structures.
>>>>
>>>> For case a, the memory of shrinker instance is never freed.
>>>> For case b, the memory of shrinker instance will be freed
>>>> after the module is unloaded. But we will call synchronize_rcu()
>>>> in free_module() to wait for RCU read-side critical section to
>>>> exit. For case c, the memory of shrinker instance will be
>>>> dynamically freed by calling kfree_rcu(). So we can use
>>>> rcu_read_{lock,unlock}() to ensure that the shrinker instance
>>>> is valid.
>>>>
>>>> The shrinker::refcount mechanism ensures that the shrinker
>>>> instance will not be run again after unregistration. So the
>>>> structure that records the pointer of shrinker instance can be
>>>> safely freed without waiting for the RCU read-side critical
>>>> section.
>>>>
>>>> In this way, while we implement the lockless slab shrink, we
>>>> don't need to be blocked in unregister_shrinker() to wait
>>>> RCU read-side critical section.
>>>>
>>>> The following are the test results:
>>>>
>>>> stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 &
>>>>
>>>> 1) Before applying this patchset:
>>>>
>>>>   setting to a 60 second run per stressor
>>>>   dispatching hogs: 9 ramfs
>>>>   stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s
>>>>                             (secs)    (secs)    (secs)   (real time) (usr+sys time)
>>>>   ramfs            880623     60.02      7.71    226.93     14671.45        3753.09
>>>>   ramfs:
>>>>            1 System Management Interrupt
>>>>   for a 60.03s run time:
>>>>      5762.40s available CPU time
>>>>         7.71s user time   (  0.13%)
>>>>       226.93s system time (  3.94%)
>>>>       234.64s total time  (  4.07%)
>>>>   load average: 8.54 3.06 2.11
>>>>   passed: 9: ramfs (9)
>>>>   failed: 0
>>>>   skipped: 0
>>>>   successful run completed in 60.03s (1 min, 0.03 secs)
>>>>
>>>> 2) After applying this patchset:
>>>>
>>>>   setting to a 60 second run per stressor
>>>>   dispatching hogs: 9 ramfs
>>>>   stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s
>>>>                             (secs)    (secs)    (secs)   (real time) (usr+sys time)
>>>>   ramfs            847562     60.02      7.44    230.22     14120.66        3566.23
>>>>   ramfs:
>>>>            4 System Management Interrupts
>>>>   for a 60.12s run time:
>>>>      5771.95s available CPU time
>>>>         7.44s user time   (  0.13%)
>>>>       230.22s system time (  3.99%)
>>>>       237.66s total time  (  4.12%)
>>>>   load average: 8.18 2.43 0.84
>>>>   passed: 9: ramfs (9)
>>>>   failed: 0
>>>>   skipped: 0
>>>>   successful run completed in 60.12s (1 min, 0.12 secs)
>>>>
>>>> We can see that the ops/s has hardly changed.
>>>>
>>>> [1]. https://lore.kernel.org/lkml/20230313112819.38938-1-zhengqi.arch@bytedance.com/
>>>> [2]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie.liu@intel.com/
>>>> [3]. https://lore.kernel.org/all/20230609081518.3039120-1-qi.zheng@linux.dev/
>>>> [4]. https://lore.kernel.org/lkml/ZIJhou1d55d4H1s0@dread.disaster.area/
>>>>
>>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>>> ---
>>>>   include/linux/shrinker.h |  6 ++++++
>>>>   mm/vmscan.c              | 33 ++++++++++++++-------------------
>>>>   2 files changed, 20 insertions(+), 19 deletions(-)
>>>>
>>>> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
>>>> index 7bfeb2f25246..b0c6c2df9db8 100644
>>>> --- a/include/linux/shrinker.h
>>>> +++ b/include/linux/shrinker.h
>>>> @@ -74,6 +74,7 @@ struct shrinker {
>>>>     	refcount_t refcount;
>>>>   	struct completion completion_wait;
>>>> +	struct rcu_head rcu;
>>>>     	void *private_data;
>>>>   @@ -123,6 +124,11 @@ struct shrinker *shrinker_alloc_and_init(count_objects_cb count,
>>>>   void shrinker_free(struct shrinker *shrinker);
>>>>   void unregister_and_free_shrinker(struct shrinker *shrinker);
>>>>   +static inline bool shrinker_try_get(struct shrinker *shrinker)
>>>> +{
>>>> +	return refcount_inc_not_zero(&shrinker->refcount);
>>>> +}
>>>> +
>>>>   static inline void shrinker_put(struct shrinker *shrinker)
>>>>   {
>>>>   	if (refcount_dec_and_test(&shrinker->refcount))
>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>> index 6f9c4750effa..767569698946 100644
>>>> --- a/mm/vmscan.c
>>>> +++ b/mm/vmscan.c
>>>> @@ -57,6 +57,7 @@
>>>>   #include <linux/khugepaged.h>
>>>>   #include <linux/rculist_nulls.h>
>>>>   #include <linux/random.h>
>>>> +#include <linux/rculist.h>
>>>>     #include <asm/tlbflush.h>
>>>>   #include <asm/div64.h>
>>>> @@ -742,7 +743,7 @@ void register_shrinker_prepared(struct shrinker *shrinker)
>>>>   	down_write(&shrinker_rwsem);
>>>>   	refcount_set(&shrinker->refcount, 1);
>>>>   	init_completion(&shrinker->completion_wait);
>>>> -	list_add_tail(&shrinker->list, &shrinker_list);
>>>> +	list_add_tail_rcu(&shrinker->list, &shrinker_list);
>>>>   	shrinker->flags |= SHRINKER_REGISTERED;
>>>>   	shrinker_debugfs_add(shrinker);
>>>>   	up_write(&shrinker_rwsem);
>>>> @@ -800,7 +801,7 @@ void unregister_shrinker(struct shrinker *shrinker)
>>>>   	wait_for_completion(&shrinker->completion_wait);
>>>>     	down_write(&shrinker_rwsem);
>>>> -	list_del(&shrinker->list);
>>>> +	list_del_rcu(&shrinker->list);
>>>>   	shrinker->flags &= ~SHRINKER_REGISTERED;
>>>>   	if (shrinker->flags & SHRINKER_MEMCG_AWARE)
>>>>   		unregister_memcg_shrinker(shrinker);
>>>> @@ -845,7 +846,7 @@ EXPORT_SYMBOL(shrinker_free);
>>>>   void unregister_and_free_shrinker(struct shrinker *shrinker)
>>>>   {
>>>>   	unregister_shrinker(shrinker);
>>>> -	kfree(shrinker);
>>>> +	kfree_rcu(shrinker, rcu);
>>>>   }
>>>>   EXPORT_SYMBOL(unregister_and_free_shrinker);
>>>>   @@ -1067,33 +1068,27 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>>   	if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
>>>>   		return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>>>>   -	if (!down_read_trylock(&shrinker_rwsem))
>>>> -		goto out;
>>>> -
>>>> -	list_for_each_entry(shrinker, &shrinker_list, list) {
>>>> +	rcu_read_lock();
>>>> +	list_for_each_entry_rcu(shrinker, &shrinker_list, list) {
>>>>   		struct shrink_control sc = {
>>>>   			.gfp_mask = gfp_mask,
>>>>   			.nid = nid,
>>>>   			.memcg = memcg,
>>>>   		};
>>>>   +		if (!shrinker_try_get(shrinker))
>>>> +			continue;
>>>> +		rcu_read_unlock();
>>> I don't think you can do this unlock?
>>>> +
>>>>   		ret = do_shrink_slab(&sc, shrinker, priority);
>>>>   		if (ret == SHRINK_EMPTY)
>>>>   			ret = 0;
>>>>   		freed += ret;
>>>> -		/*
>>>> -		 * Bail out if someone want to register a new shrinker to
>>>> -		 * prevent the registration from being stalled for long periods
>>>> -		 * by parallel ongoing shrinking.
>>>> -		 */
>>>> -		if (rwsem_is_contended(&shrinker_rwsem)) {
>>>> -			freed = freed ? : 1;
>>>> -			break;
>>>> -		}
>>>> -	}
>>>>   -	up_read(&shrinker_rwsem);
>>>> -out:
>>>> +		rcu_read_lock();
>>> That new rcu_read_lock() won't help AFAIK, the whole
>>> list_for_each_entry_rcu() needs to be under the single rcu_read_lock() to be
>>> safe.
>>
>> In the unregister_shrinker() path, we will wait for the refcount to zero
>> before deleting the shrinker from the linked list. Here, we first took
>> the rcu lock, and then decrement the refcount of this shrinker.
>>
>>     shrink_slab                 unregister_shrinker
>>     ===========                 ===================
>> 				
>> 				/* wait for B */
>> 				wait_for_completion()
>>   rcu_read_lock()
>>
>>   shrinker_put() --> (B)
>> 				list_del_rcu()
>>                                 /* wait for rcu_read_unlock() */
>> 				kfree_rcu()
>>
>>   /*
>>    * so this shrinker will not be freed here,
>>    * and can be used to traverse the next node
>>    * normally?
>>    */
>>   list_for_each_entry()
>>
>>   shrinker_try_get()
>>   rcu_read_unlock()
>>
>> Did I miss something?
> 
> After calling rcu_read_unlock(), the next shrinker in the list can be freed,
> so in the next iteration, use after free might happen?
> 
> Is that right?

IIUC, are you talking about the following scenario?

      shrink_slab                 unregister_shrinker a
      ===========                 =====================
		

    rcu_read_unlock()	

    /* next *shrinker b* was
     * removed from shrinker_list.
     */
				/* wait for B */
				wait_for_completion()
    rcu_read_lock()

    shrinker_put() --> (B)
				list_del_rcu()
                                  /* wait for rcu_read_unlock() */
				kfree_rcu()

    list_for_each_entry()

    shrinker_try_get()
    rcu_read_unlock()

When the next *shrinker b* is deleted, the *shrinker a* has not been
deleted from the shrinker_list, so it will point a->next to b->next.
Then in the next iteration, we will get the b->next instead of b?


> 
>>
>>> IIUC this is why Dave in [4] suggests unifying shrink_slab() with
>>> shrink_slab_memcg(), as the latter doesn't iterate the list but uses IDR.
>>>> +		shrinker_put(shrinker);
>>>> +	}
>>>> +	rcu_read_unlock();
>>>>   	cond_resched();
>>>>   	return freed;
>>>>   }
> 


  reply	other threads:[~2023-06-22 18:18 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-06-22  8:53 [PATCH 00/29] use refcount+RCU method to implement lockless slab shrink Qi Zheng
2023-06-22  8:53 ` [PATCH 01/29] mm: shrinker: add shrinker::private_data field Qi Zheng
2023-06-22 14:47   ` Vlastimil Babka
2023-06-23 12:50     ` [External] " Qi Zheng
2023-06-22  8:53 ` [PATCH 02/29] mm: vmscan: introduce some helpers for dynamically allocating shrinker Qi Zheng
2023-06-23  6:12   ` Dave Chinner
2023-06-23 12:49     ` Qi Zheng
2023-06-22  8:53 ` [PATCH 03/29] drm/i915: dynamically allocate the i915_gem_mm shrinker Qi Zheng
2023-06-22  8:53 ` [PATCH 04/29] drm/msm: dynamically allocate the drm-msm_gem shrinker Qi Zheng
2023-06-22  8:53 ` [PATCH 05/29] drm/panfrost: dynamically allocate the drm-panfrost shrinker Qi Zheng
2023-06-23 13:33   ` Qi Zheng
2023-06-23 14:18   ` Bobs_Email
2023-06-22  8:53 ` [PATCH 06/29] dm: dynamically allocate the dm-bufio shrinker Qi Zheng
2023-06-22  8:53 ` [PATCH 07/29] dm zoned: dynamically allocate the dm-zoned-meta shrinker Qi Zheng
2023-06-22  8:53 ` [PATCH 08/29] md/raid5: dynamically allocate the md-raid5 shrinker Qi Zheng
2023-06-22  8:53 ` [PATCH 09/29] bcache: dynamically allocate the md-bcache shrinker Qi Zheng
2023-06-22  8:53 ` [PATCH 10/29] vmw_balloon: dynamically allocate the vmw-balloon shrinker Qi Zheng
2023-06-22  8:53 ` [PATCH 11/29] virtio_balloon: dynamically allocate the virtio-balloon shrinker Qi Zheng
2023-06-22  8:53 ` [PATCH 12/29] mbcache: dynamically allocate the mbcache shrinker Qi Zheng
2023-06-22  8:53 ` [PATCH 13/29] ext4: dynamically allocate the ext4-es shrinker Qi Zheng
2023-06-22  8:53 ` [PATCH 14/29] jbd2,ext4: dynamically allocate the jbd2-journal shrinker Qi Zheng
2023-06-22  8:53 ` [PATCH 15/29] NFSD: dynamically allocate the nfsd-client shrinker Qi Zheng
2023-06-23 21:49   ` Chuck Lever
2023-06-24 11:17     ` Qi Zheng
2023-06-22  8:53 ` [PATCH 16/29] NFSD: dynamically allocate the nfsd-reply shrinker Qi Zheng
2023-06-22  8:53 ` [PATCH 17/29] xfs: dynamically allocate the xfs-buf shrinker Qi Zheng
2023-06-22  8:53 ` [PATCH 18/29] xfs: dynamically allocate the xfs-inodegc shrinker Qi Zheng
2023-06-22  8:53 ` [PATCH 19/29] xfs: dynamically allocate the xfs-qm shrinker Qi Zheng
2023-06-22  8:53 ` [PATCH 20/29] zsmalloc: dynamically allocate the mm-zspool shrinker Qi Zheng
2023-06-22  8:53 ` [PATCH 21/29] fs: super: dynamically allocate the s_shrink Qi Zheng
2023-06-22  8:53 ` [PATCH 22/29] drm/ttm: introduce pool_shrink_rwsem Qi Zheng
2023-06-22  8:53 ` [PATCH 23/29] mm: shrinker: add refcount and completion_wait fields Qi Zheng
2023-06-22  8:53 ` [PATCH 24/29] mm: vmscan: make global slab shrink lockless Qi Zheng
2023-06-22 15:12   ` Vlastimil Babka
2023-06-22 16:42     ` Qi Zheng
2023-06-22 17:41       ` Alan Huang
2023-06-22 18:18         ` Qi Zheng [this message]
2023-06-23  6:29     ` Dave Chinner
2023-06-23 13:10       ` Qi Zheng
2023-06-23 22:19         ` Dave Chinner
2023-06-24 11:08           ` Qi Zheng
2023-06-25  3:15             ` Qi Zheng
2023-07-04  4:20             ` Qi Zheng
2023-07-03 16:39       ` Paul E. McKenney
2023-07-04  3:45         ` Qi Zheng
2023-07-05  3:27           ` Qi Zheng
2023-06-22  8:53 ` [PATCH 25/29] mm: vmscan: make memcg " Qi Zheng
2023-06-22  8:53 ` [PATCH 26/29] mm: shrinker: make count and scan in shrinker debugfs lockless Qi Zheng
2023-06-22  8:53 ` [PATCH 27/29] mm: vmscan: hold write lock to reparent shrinker nr_deferred Qi Zheng
2023-06-22  8:53 ` [PATCH 28/29] mm: shrinkers: convert shrinker_rwsem to mutex Qi Zheng
2023-06-22  8:53 ` [PATCH 29/29] mm: shrinker: move shrinker-related code into a separate file Qi Zheng
2023-06-22 14:53   ` Vlastimil Babka
2023-06-23 13:12     ` Qi Zheng
2023-06-23  5:25   ` Sergey Senozhatsky
2023-06-23 13:24     ` Qi Zheng
2023-06-22  9:02 ` [PATCH 00/29] use refcount+RCU method to implement lockless slab shrink Qi Zheng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a66f4fa5-e614-9dd6-b5fb-fb1189322840@bytedance.com \
    --to=zhengqi.arch@bytedance.com \
    --cc=akpm@linux-foundation.org \
    --cc=brauner@kernel.org \
    --cc=david@fromorbit.com \
    --cc=djwong@kernel.org \
    --cc=dm-devel@redhat.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=linux-arm-msm@vger.kernel.org \
    --cc=linux-bcache@vger.kernel.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=mmpgouride@gmail.com \
    --cc=paulmck@kernel.org \
    --cc=roman.gushchin@linux.dev \
    --cc=tkhai@ya.ru \
    --cc=tytso@mit.edu \
    --cc=vbabka@suse.cz \
    --cc=virtualization@lists.linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).