Re: [linus:master] [mm] f95bdb700b: stress-ng.ramfs.ops_per_sec -88.8% regression

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Qi Zheng <qi.zheng@linux.dev>
To: Kirill Tkhai <tkhai@ya.ru>
Cc: RCU <rcu@vger.kernel.org>,
	"paulmck@kernel.org" <paulmck@kernel.org>,
	"Yujie Liu" <yujie.liu@intel.com>,
	"oe-lkp@lists.linux.dev" <oe-lkp@lists.linux.dev>,
	"lkp@intel.com" <lkp@intel.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"Vlastimil Babka" <vbabka@suse.cz>,
	"Roman Gushchin" <roman.gushchin@linux.dev>,
	"Christian König" <christian.koenig@amd.com>,
	"David Hildenbrand" <david@redhat.com>,
	"Davidlohr Bueso" <dave@stgolabs.net>,
	"Johannes Weiner" <hannes@cmpxchg.org>,
	"Michal Hocko" <mhocko@kernel.org>,
	"Muchun Song" <muchun.song@linux.dev>,
	"Shakeel Butt" <shakeelb@google.com>,
	"Yang Shi" <shy828301@gmail.com>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"ying.huang@intel.com" <ying.huang@intel.com>,
	"feng.tang@intel.com" <feng.tang@intel.com>,
	"fengwei.yin@intel.com" <fengwei.yin@intel.com>,
	"david@fromorbit.com" <david@fromorbit.com>,
	"Christian Brauner" <brauner@kernel.org>
Subject: Re: [linus:master] [mm] f95bdb700b: stress-ng.ramfs.ops_per_sec -88.8% regression
Date: Thu, 1 Jun 2023 18:44:37 +0800	[thread overview]
Message-ID: <8b2b273a-1df1-5834-9d3a-397ca31f7e87@linux.dev> (raw)
In-Reply-To: <932751685611907@mail.yandex.ru>

Hi Kirill,

On 2023/6/1 18:06, Kirill Tkhai wrote:
> 01.06.2023, 11:34, "Qi Zheng" <qi.zheng@linux.dev 
> <mailto:qi.zheng@linux.dev>>:
> 
> 
> 
>     On 2023/6/1 08:57, Kirill Tkhai wrote:
> 
>           Hi,
> 
>           On 30.05.2023 06:07, Qi Zheng wrote:
> 
>               Hi,
> 
>               On 2023/5/29 20:51, Paul E. McKenney wrote:
> 
>                   On Mon, May 29, 2023 at 10:39:21AM +0800, Qi Zheng wrote:
> 
> 
>               [...]
> 
> 
>                       Thanks for such a detailed explanation.
> 
>                       Now I think we can continue to try to complete the
>                     idea[1] from
>                       Kirill Tkhai. The patch moves heavy
>                     synchronize_srcu() to delayed
>                       work, so it doesn't affect on user-visible
>                     unregistration speed.
> 
>                       [1].
>                     https://lore.kernel.org/lkml/153365636747.19074.12610817307548583381.stgit@localhost.localdomain/ <https://lore.kernel.org/lkml/153365636747.19074.12610817307548583381.stgit@localhost.localdomain/>
> 
> 
>                   A blast from the past!  ;-)
> 
>                   But yes, moving the long-latency synchronize_srcu()
>                 off the user-visible
>                   critical code path can be even better.
> 
> 
>               Yeah, I applied these patches  ([PATCH RFC 04/10]~[PATCH
>             RFC 10/10],
>               with few conflicts), the ops/s does get back to the
>             previous levels.
> 
>               I'll continue updating this patchset after doing more testing.
> 
> 
>           You may also fix the issue using the below generic solution.
> 
>           In addition to this we need patch, which calls
>         unregister_shrinker_delayed_initiate()
>           instead of unregister_shrinker() in deactivate_locked_super(),
>         and calls
>           unregister_shrinker_delayed_finalize() from
>         destroy_super_work(). Compilation tested only.
> 
>           ---
>             include/linux/shrinker.h | 2 ++
>             mm/vmscan.c | 38 ++++++++++++++++++++++++++++++--------
>             2 files changed, 32 insertions(+), 8 deletions(-)
>           diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
>           index 224293b2dd06..4ba2986716d3 100644
>           --- a/include/linux/shrinker.h
>           +++ b/include/linux/shrinker.h
>           @@ -4,6 +4,7 @@
> 
>             #include <linux/atomic.h>
>             #include <linux/types.h>
>           +#include <linux/rwsem.h>
> 
>             /*
>              * This struct is used to pass information from page reclaim
>         to the shrinkers.
>           @@ -83,6 +84,7 @@ struct shrinker {
>             #endif
>                     /* objs pending delete, per node */
>                     atomic_long_t *nr_deferred;
>           + struct rw_semaphore rwsem;
>             };
>             #define DEFAULT_SEEKS 2 /* A good number if you don't know
>         better. */
> 
>           diff --git a/mm/vmscan.c b/mm/vmscan.c
>           index eeca83e28c9b..19fc129771ce 100644
>           --- a/mm/vmscan.c
>           +++ b/mm/vmscan.c
>           @@ -706,6 +706,7 @@ static int __prealloc_shrinker(struct
>         shrinker *shrinker)
>                     if (!shrinker->nr_deferred)
>                             return -ENOMEM;
> 
>           + init_rwsem(&shrinker->rwsem);
>                     return 0;
>             }
> 
>           @@ -757,7 +758,9 @@ void register_shrinker_prepared(struct
>         shrinker *shrinker)
>             {
>                     mutex_lock(&shrinker_mutex);
>                     list_add_tail_rcu(&shrinker->list, &shrinker_list);
>           + down_write(&shrinker->rwsem);
>                     shrinker->flags |= SHRINKER_REGISTERED;
>           + up_write(&shrinker->rwsem);
>                     shrinker_debugfs_add(shrinker);
>                     mutex_unlock(&shrinker_mutex);
>             }
>           @@ -802,7 +805,7 @@ EXPORT_SYMBOL(register_shrinker);
>             /*
>              * Remove one
>              */
>           -void unregister_shrinker(struct shrinker *shrinker)
>           +void unregister_shrinker_delayed_initiate(struct shrinker
>         *shrinker)
>             {
>                     struct dentry *debugfs_entry;
>                     int debugfs_id;
>           @@ -812,20 +815,33 @@ void unregister_shrinker(struct shrinker
>         *shrinker)
> 
>                     mutex_lock(&shrinker_mutex);
>                     list_del_rcu(&shrinker->list);
>           + down_write(&shrinker->rwsem);
>                     shrinker->flags &= ~SHRINKER_REGISTERED;
>           + up_write(&shrinker->rwsem);
>                     if (shrinker->flags & SHRINKER_MEMCG_AWARE)
>                             unregister_memcg_shrinker(shrinker);
>                     debugfs_entry = shrinker_debugfs_detach(shrinker,
>         &debugfs_id);
>                     mutex_unlock(&shrinker_mutex);
> 
>           + shrinker_debugfs_remove(debugfs_entry, debugfs_id); // This
>         is moved in your patch
>           +}
>           +EXPORT_SYMBOL(unregister_shrinker_delayed_initiate);
>           +
>           +void unregister_shrinker_delayed_finalize(struct shrinker
>         *shrinker)
>           +{
>                     atomic_inc(&shrinker_srcu_generation);
>                     synchronize_srcu(&shrinker_srcu);
> 
>           - shrinker_debugfs_remove(debugfs_entry, debugfs_id);
>           -
>                     kfree(shrinker->nr_deferred);
>                     shrinker->nr_deferred = NULL;
>             }
>           +EXPORT_SYMBOL(unregister_shrinker_delayed_finalize);
>           +
>           +void unregister_shrinker(struct shrinker *shrinker)
>           +{
>           + unregister_shrinker_delayed_initiate(shrinker);
>           + unregister_shrinker_delayed_finalize(shrinker);
>           +}
>             EXPORT_SYMBOL(unregister_shrinker);
> 
>             /**
>           @@ -856,9 +872,14 @@ static unsigned long
>         do_shrink_slab(struct shrink_control *shrinkctl,
>                                                       : SHRINK_BATCH;
>                     long scanned = 0, next_deferred;
> 
>           + down_read(&shrinker->rwsem);
>           + if (!(shrinker->flags & SHRINKER_REGISTERED))
>           + goto unlock;
>                     freeable = shrinker->count_objects(shrinker, shrinkctl);
>           - if (freeable == 0 || freeable == SHRINK_EMPTY)
>           - return freeable;
>           + if (freeable == 0 || freeable == SHRINK_EMPTY) {
>           + freed = freeable;
>           + goto unlock;
>           + }
> 
>                     /*
>                      * copy the current shrinker scan count into a local
>         variable
>           @@ -935,6 +956,8 @@ static unsigned long do_shrink_slab(struct
>         shrink_control *shrinkctl,
>                      * manner that handles concurrent updates.
>                      */
>                     new_nr = add_nr_deferred(next_deferred, shrinker,
>         shrinkctl);
>           +unlock:
>           + up_read(&shrinker->rwsem);
> 
> 
>     It should be moved after trace_mm_shrink_slab_end().
> 
> Could you explain the reason? I don't see the variable it will protect.

We jump to unlock label before trace_mm_shrink_slab_start(), so I
think we should not go to call trace_mm_shrink_slab_end().

> 
> 
>                     trace_mm_shrink_slab_end(shrinker, shrinkctl->nid,
>         freed, nr, new_nr, total_scan);
>                     return freed;
>           @@ -968,9 +991,8 @@ static unsigned long
>         shrink_slab_memcg(gfp_t gfp_mask, int nid,
>                             struct shrinker *shrinker;
> 
>                             shrinker = idr_find(&shrinker_idr, i);
>           - if (unlikely(!shrinker || !(shrinker->flags &
>         SHRINKER_REGISTERED))) {
>           - if (!shrinker)
>           - clear_bit(i, info->map);
>           + if (unlikely(!shrinker)) {
>           + clear_bit(i, info->map);
>                                     continue;
>                             }
> 
> 
>     Keep this as a fast path?
> 
> Probably, yes. And also we need 1)down_trylock() instead of down_read() 
>   and 2)rwsem_is_contended  in do_shrink_slab().

Agree, although the critical section of the writer of shrinker->rwsem is
very short, this prevents unnecessary sleeps.

> 
> 
>     After applying the above patch, the performance regression problem of
>     ops/s can be solved. And it can be guaranteed that the shrinker is not
>     running after unregister_shrinker_delayed_initiate(), so the previous
>     semantics are not broken.
> 
> Keeping old semantics or not is quite subjective, I think. It's possible 
> to provide strong arguments for both cases. First is faster, second is 
> easier to adopt for users. For me personally the faster approach looks 
> better.

Agree. I also like completely lock-less slab shrink.

>  >nce the lock granularity of down_read() has changed to the granularity
> 
>     of per shrinker, it seems that the down_read() perf hotspot will not be
>     very high. I'm not quite sure why.
> 
>     (The test script used is the script in
>     https://lore.kernel.org/all/20230313112819.38938-4-zhengqi.arch@bytedance.com/ <https://lore.kernel.org/all/20230313112819.38938-4-zhengqi.arch@bytedance.com/>)
> 
> Hmm, possible original shrinker_rwsem had a lot of atomic intersections 
> between cpus on down_read(), while with small locks it has not. Are 

I guess so.

> CONFIG_ for locks debug are same in original and this case?

Basically yes, I will do some more testing.

Thanks,
Qi

> 
> 
>         25.28% [kernel] [k] do_shrink_slab
>         21.91% [kernel] [k] pv_native_safe_halt
>         10.81% [kernel] [k] _find_next_bit
>         10.47% [kernel] [k] down_read
>          8.75% [kernel] [k] shrink_slab
>          4.03% [kernel] [k] up_read
>          3.29% [kernel] [k] shrink_lruvec
>          2.75% [kernel] [k] xa_load
>          2.73% [kernel] [k] mem_cgroup_iter
>          2.67% [kernel] [k] shrink_node
>          1.30% [kernel] [k] list_lru_count_one
> 
>     Thanks,
>     Qi
> 
> 

-- 
Thanks,
Qi

next prev parent reply	other threads:[~2023-06-01 10:44 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-23  1:35 [linus:master] [mm] f95bdb700b: stress-ng.ramfs.ops_per_sec -88.8% regression kernel test robot
2023-05-23  3:05 ` Qi Zheng
2023-05-24  7:31   ` Yujie Liu
2023-05-24  8:36     ` Qi Zheng
2023-05-24 11:08       ` Qi Zheng
2023-05-24 11:22         ` Qi Zheng
2023-05-24 11:56         ` Qi Zheng
2023-05-25  4:03           ` Qi Zheng
2023-05-27 11:14             ` Paul E. McKenney
2023-05-29  2:39               ` Qi Zheng
2023-05-29 12:51                 ` Paul E. McKenney
2023-05-30  3:07                   ` Qi Zheng
2023-06-01  0:57                     ` Kirill Tkhai
2023-06-01  8:34                       ` Qi Zheng
     [not found]                         ` <932751685611907@mail.yandex.ru>
2023-06-01 10:44                           ` Qi Zheng [this message]
2023-06-18 10:59 ` Linux regression tracking #adding (Thorsten Leemhuis)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8b2b273a-1df1-5834-9d3a-397ca31f7e87@linux.dev \
    --to=qi.zheng@linux.dev \
    --cc=akpm@linux-foundation.org \
    --cc=brauner@kernel.org \
    --cc=christian.koenig@amd.com \
    --cc=dave@stgolabs.net \
    --cc=david@fromorbit.com \
    --cc=david@redhat.com \
    --cc=feng.tang@intel.com \
    --cc=fengwei.yin@intel.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lkp@intel.com \
    --cc=mhocko@kernel.org \
    --cc=muchun.song@linux.dev \
    --cc=oe-lkp@lists.linux.dev \
    --cc=paulmck@kernel.org \
    --cc=rcu@vger.kernel.org \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeelb@google.com \
    --cc=shy828301@gmail.com \
    --cc=tkhai@ya.ru \
    --cc=vbabka@suse.cz \
    --cc=ying.huang@intel.com \
    --cc=yujie.liu@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.