From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id B1939CA0EC7 for ; Mon, 11 Sep 2023 22:26:05 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S241962AbjIKWXx (ORCPT ); Mon, 11 Sep 2023 18:23:53 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41214 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S244563AbjIKUlD (ORCPT ); Mon, 11 Sep 2023 16:41:03 -0400 Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 804BC127 for ; Mon, 11 Sep 2023 13:40:57 -0700 (PDT) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 0FFFBC433C8; Mon, 11 Sep 2023 20:40:57 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linux-foundation.org; s=korg; t=1694464857; bh=pVkc1Umk0EEpOQg+GQ4J3k133KcxaQPW51otEgxuA+A=; h=Date:To:From:Subject:From; b=GCyO6ZZQJwkGoit+0YjEncbzWet/aQrFeRLf6/8plxWF2hcOfD7jw5yNl+dVKfO5l I5X7KnQGCjKwIXg5o0eIMdYvQ+T4rTAtL3C8bjxXknXbBHYIkrp+LSfyEU3T3i2jOk 8Y3WpzC78itont+mvbxAwRhon31rf4DSD5cwWu2c= Date: Mon, 11 Sep 2023 13:40:56 -0700 To: mm-commits@vger.kernel.org, xuanzhuo@linux.alibaba.com, viro@zeniv.linux.org.uk, vbabka@suse.cz, tytso@mit.edu, tvrtko.ursulin@linux.intel.com, trond.myklebust@hammerspace.com, tom@talpey.com, tomeu.vizoso@collabora.com, tkhai@ya.ru, tglx@linutronix.de, steven.price@arm.com, sstabellini@kernel.org, songmuchun@bytedance.com, song@kernel.org, snitzer@kernel.org, senozhatsky@chromium.org, sean@poorly.run, rpeterso@redhat.com, roman.gushchin@linux.dev, rodrigo.vivi@intel.com, robh@kernel.org, robdclark@gmail.com, richard@nod.at, ray.huang@amd.com, quic_abhinavk@quicinc.com, paulmck@kernel.org, oleksandr_tyshchenko@epam.com, neilb@suse.de, namit@vmware.com, muchun.song@linux.dev, mst@redhat.com, mingo@redhat.com, minchan@kernel.org, marijn.suijten@somainline.org, kolga@netapp.com, kent.overstreet@gmail.com, josef@toxicpanda.com, joonas.lahtinen@linux.intel.com, joel@joelfernandes.org, jlayton@kernel.org, jgross@suse.com, jefflexu@linux.alibaba.com, jasowang@redhat.com, jani.nikula@linux.intel.com, jaegeuk@kernel.org, jack@suse.cz, huyue2@coolpad.com, hsiangkao@linux.alibaba.com, gregkh@linuxfoundation.org, dsterba@suse.com, dmitry.baryshkov@linaro.org, djwong@kernel.org, david@redhat.com, david@fromorbit.com, dave.hansen@linux.intel.com, daniel.vetter@ffwll.ch, daniel@ffwll.ch, Dai.Ngo@oracle.com, colyli@suse.de, cmllamas@google.com, clm@fb.com, christian.koenig@amd.com, chao@kernel.org, chandan.babu@oracle.com, cel@kernel.org, brauner@kernel.org, bp@alien8.de, arnd@arndb.de, anna@kernel.org, alyssa.rosenzweig@collabora.com, airlied@gmail.com, agruenba@redhat.com, agk@redhat.com, adilger.kernel@dilger.ca, zhengqi.arch@bytedance.com, akpm@linux-foundation.org From: Andrew Morton Subject: + mm-shrinker-make-memcg-slab-shrink-lockless.patch added to mm-unstable branch Message-Id: <20230911204057.0FFFBC433C8@smtp.kernel.org> Precedence: bulk Reply-To: linux-kernel@vger.kernel.org List-ID: X-Mailing-List: mm-commits@vger.kernel.org The patch titled Subject: mm: shrinker: make memcg slab shrink lockless has been added to the -mm mm-unstable branch. Its filename is mm-shrinker-make-memcg-slab-shrink-lockless.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-shrinker-make-memcg-slab-shrink-lockless.patch This patch will later appear in the mm-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: Qi Zheng Subject: mm: shrinker: make memcg slab shrink lockless Date: Mon, 11 Sep 2023 17:44:42 +0800 Like global slab shrink, this commit also uses refcount+RCU method to make memcg slab shrink lockless. Use the following script to do slab shrink stress test: ``` DIR="/root/shrinker/memcg/mnt" do_create() { mkdir -p /sys/fs/cgroup/memory/test echo 4G > /sys/fs/cgroup/memory/test/memory.limit_in_bytes for i in `seq 0 $1`; do mkdir -p /sys/fs/cgroup/memory/test/$i; echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs; mkdir -p $DIR/$i; done } do_mount() { for i in `seq $1 $2`; do mount -t tmpfs $i $DIR/$i; done } do_touch() { for i in `seq $1 $2`; do echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs; dd if=/dev/zero of=$DIR/$i/file$i bs=1M count=1 & done } case "$1" in touch) do_touch $2 $3 ;; test) do_create 4000 do_mount 0 4000 do_touch 0 3000 ;; *) exit 1 ;; esac ``` Save the above script, then run test and touch commands. Then we can use the following perf command to view hotspots: perf top -U -F 999 1) Before applying this patchset: 33.15% [kernel] [k] down_read_trylock 25.38% [kernel] [k] shrink_slab 21.75% [kernel] [k] up_read 4.45% [kernel] [k] _find_next_bit 2.27% [kernel] [k] do_shrink_slab 1.80% [kernel] [k] intel_idle_irq 1.79% [kernel] [k] shrink_lruvec 0.67% [kernel] [k] xas_descend 0.41% [kernel] [k] mem_cgroup_iter 0.40% [kernel] [k] shrink_node 0.38% [kernel] [k] list_lru_count_one 2) After applying this patchset: 64.56% [kernel] [k] shrink_slab 12.18% [kernel] [k] do_shrink_slab 3.30% [kernel] [k] __rcu_read_unlock 2.61% [kernel] [k] shrink_lruvec 2.49% [kernel] [k] __rcu_read_lock 1.93% [kernel] [k] intel_idle_irq 0.89% [kernel] [k] shrink_node 0.81% [kernel] [k] mem_cgroup_iter 0.77% [kernel] [k] mem_cgroup_calculate_protection 0.66% [kernel] [k] list_lru_count_one We can see that the first perf hotspot becomes shrink_slab, which is what we expect. Link: https://lkml.kernel.org/r/20230911094444.68966-44-zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng Cc: Abhinav Kumar Cc: Alasdair Kergon Cc: Alexander Viro Cc: Alyssa Rosenzweig Cc: Andreas Dilger Cc: Andreas Gruenbacher Cc: Anna Schumaker Cc: Arnd Bergmann Cc: Bob Peterson Cc: Borislav Petkov Cc: Carlos Llamas Cc: Chandan Babu R Cc: Chao Yu Cc: Chris Mason Cc: Christian Brauner Cc: Christian Koenig Cc: Chuck Lever Cc: Coly Li Cc: Dai Ngo Cc: Daniel Vetter Cc: Daniel Vetter Cc: "Darrick J. Wong" Cc: Dave Chinner Cc: Dave Hansen Cc: David Airlie Cc: David Hildenbrand Cc: David Sterba Cc: Dmitry Baryshkov Cc: Gao Xiang Cc: Greg Kroah-Hartman Cc: Huang Rui Cc: Ingo Molnar Cc: Jaegeuk Kim Cc: Jani Nikula Cc: Jan Kara Cc: Jason Wang Cc: Jeff Layton Cc: Jeffle Xu Cc: Joel Fernandes (Google) Cc: Joonas Lahtinen Cc: Josef Bacik Cc: Juergen Gross Cc: Kent Overstreet Cc: Kirill Tkhai Cc: Marijn Suijten Cc: "Michael S. Tsirkin" Cc: Mike Snitzer Cc: Minchan Kim Cc: Muchun Song Cc: Muchun Song Cc: Nadav Amit Cc: Neil Brown Cc: Oleksandr Tyshchenko Cc: Olga Kornievskaia Cc: Paul E. McKenney Cc: Richard Weinberger Cc: Rob Clark Cc: Rob Herring Cc: Rodrigo Vivi Cc: Roman Gushchin Cc: Sean Paul Cc: Sergey Senozhatsky Cc: Song Liu Cc: Stefano Stabellini Cc: Steven Price Cc: "Theodore Ts'o" Cc: Thomas Gleixner Cc: Tomeu Vizoso Cc: Tom Talpey Cc: Trond Myklebust Cc: Tvrtko Ursulin Cc: Vlastimil Babka Cc: Xuan Zhuo Cc: Yue Hu Signed-off-by: Andrew Morton --- mm/shrinker.c | 85 +++++++++++++++++++++++++++++++++++++----------- 1 file changed, 66 insertions(+), 19 deletions(-) --- a/mm/shrinker.c~mm-shrinker-make-memcg-slab-shrink-lockless +++ a/mm/shrinker.c @@ -218,7 +218,6 @@ static int shrinker_memcg_alloc(struct s return -ENOSYS; down_write(&shrinker_rwsem); - /* This may call shrinker, so it must use down_read_trylock() */ id = idr_alloc(&shrinker_idr, shrinker, 0, 0, GFP_KERNEL); if (id < 0) goto unlock; @@ -252,10 +251,15 @@ static long xchg_nr_deferred_memcg(int n { struct shrinker_info *info; struct shrinker_info_unit *unit; + long nr_deferred; - info = shrinker_info_protected(memcg, nid); + rcu_read_lock(); + info = rcu_dereference(memcg->nodeinfo[nid]->shrinker_info); unit = info->unit[shrinker_id_to_index(shrinker->id)]; - return atomic_long_xchg(&unit->nr_deferred[shrinker_id_to_offset(shrinker->id)], 0); + nr_deferred = atomic_long_xchg(&unit->nr_deferred[shrinker_id_to_offset(shrinker->id)], 0); + rcu_read_unlock(); + + return nr_deferred; } static long add_nr_deferred_memcg(long nr, int nid, struct shrinker *shrinker, @@ -263,10 +267,16 @@ static long add_nr_deferred_memcg(long n { struct shrinker_info *info; struct shrinker_info_unit *unit; + long nr_deferred; - info = shrinker_info_protected(memcg, nid); + rcu_read_lock(); + info = rcu_dereference(memcg->nodeinfo[nid]->shrinker_info); unit = info->unit[shrinker_id_to_index(shrinker->id)]; - return atomic_long_add_return(nr, &unit->nr_deferred[shrinker_id_to_offset(shrinker->id)]); + nr_deferred = + atomic_long_add_return(nr, &unit->nr_deferred[shrinker_id_to_offset(shrinker->id)]); + rcu_read_unlock(); + + return nr_deferred; } void reparent_shrinker_deferred(struct mem_cgroup *memcg) @@ -463,18 +473,54 @@ static unsigned long shrink_slab_memcg(g if (!mem_cgroup_online(memcg)) return 0; - if (!down_read_trylock(&shrinker_rwsem)) - return 0; - - info = shrinker_info_protected(memcg, nid); + /* + * lockless algorithm of memcg shrink. + * + * The shrinker_info may be freed asynchronously via RCU in the + * expand_one_shrinker_info(), so the rcu_read_lock() needs to be used + * to ensure the existence of the shrinker_info. + * + * The shrinker_info_unit is never freed unless its corresponding memcg + * is destroyed. Here we already hold the refcount of memcg, so the + * memcg will not be destroyed, and of course shrinker_info_unit will + * not be freed. + * + * So in the memcg shrink: + * step 1: use rcu_read_lock() to guarantee existence of the + * shrinker_info. + * step 2: after getting shrinker_info_unit we can safely release the + * RCU lock. + * step 3: traverse the bitmap and calculate shrinker_id + * step 4: use rcu_read_lock() to guarantee existence of the shrinker. + * step 5: use shrinker_id to find the shrinker, then use + * shrinker_try_get() to guarantee existence of the shrinker, + * then we can release the RCU lock to do do_shrink_slab() that + * may sleep. + * step 6: do shrinker_put() paired with step 5 to put the refcount, + * if the refcount reaches 0, then wake up the waiter in + * shrinker_free() by calling complete(). + * Note: here is different from the global shrink, we don't + * need to acquire the RCU lock to guarantee existence of + * the shrinker, because we don't need to use this + * shrinker to traverse the next shrinker in the bitmap. + * step 7: we have already exited the read-side of rcu critical section + * before calling do_shrink_slab(), the shrinker_info may be + * released in expand_one_shrinker_info(), so go back to step 1 + * to reacquire the shrinker_info. + */ +again: + rcu_read_lock(); + info = rcu_dereference(memcg->nodeinfo[nid]->shrinker_info); if (unlikely(!info)) goto unlock; - for (; index < shrinker_id_to_index(info->map_nr_max); index++) { + if (index < shrinker_id_to_index(info->map_nr_max)) { struct shrinker_info_unit *unit; unit = info->unit[index]; + rcu_read_unlock(); + for_each_set_bit(offset, unit->map, SHRINKER_UNIT_BITS) { struct shrink_control sc = { .gfp_mask = gfp_mask, @@ -484,12 +530,14 @@ static unsigned long shrink_slab_memcg(g struct shrinker *shrinker; int shrinker_id = calc_shrinker_id(index, offset); + rcu_read_lock(); shrinker = idr_find(&shrinker_idr, shrinker_id); - if (unlikely(!shrinker || !(shrinker->flags & SHRINKER_REGISTERED))) { - if (!shrinker) - clear_bit(offset, unit->map); + if (unlikely(!shrinker || !shrinker_try_get(shrinker))) { + clear_bit(offset, unit->map); + rcu_read_unlock(); continue; } + rcu_read_unlock(); /* Call non-slab shrinkers even though kmem is disabled */ if (!memcg_kmem_online() && @@ -522,15 +570,14 @@ static unsigned long shrink_slab_memcg(g set_shrinker_bit(memcg, nid, shrinker_id); } freed += ret; - - if (rwsem_is_contended(&shrinker_rwsem)) { - freed = freed ? : 1; - goto unlock; - } + shrinker_put(shrinker); } + + index++; + goto again; } unlock: - up_read(&shrinker_rwsem); + rcu_read_unlock(); return freed; } #else /* !CONFIG_MEMCG */ _ Patches currently in -mm which might be from zhengqi.arch@bytedance.com are mm-move-some-shrinker-related-function-declarations-to-mm-internalh.patch mm-vmscan-move-shrinker-related-code-into-a-separate-file.patch mm-shrinker-remove-redundant-shrinker_rwsem-in-debugfs-operations.patch drm-ttm-introduce-pool_shrink_rwsem.patch mm-shrinker-add-infrastructure-for-dynamically-allocating-shrinker.patch kvm-mmu-dynamically-allocate-the-x86-mmu-shrinker.patch binder-dynamically-allocate-the-android-binder-shrinker.patch drm-ttm-dynamically-allocate-the-drm-ttm_pool-shrinker.patch xenbus-backend-dynamically-allocate-the-xen-backend-shrinker.patch erofs-dynamically-allocate-the-erofs-shrinker.patch f2fs-dynamically-allocate-the-f2fs-shrinker.patch gfs2-dynamically-allocate-the-gfs2-glock-shrinker.patch gfs2-dynamically-allocate-the-gfs2-qd-shrinker.patch nfsv42-dynamically-allocate-the-nfs-xattr-shrinkers.patch nfs-dynamically-allocate-the-nfs-acl-shrinker.patch nfsd-dynamically-allocate-the-nfsd-filecache-shrinker.patch quota-dynamically-allocate-the-dquota-cache-shrinker.patch ubifs-dynamically-allocate-the-ubifs-slab-shrinker.patch rcu-dynamically-allocate-the-rcu-lazy-shrinker.patch rcu-dynamically-allocate-the-rcu-kfree-shrinker.patch mm-thp-dynamically-allocate-the-thp-related-shrinkers.patch sunrpc-dynamically-allocate-the-sunrpc_cred-shrinker.patch mm-workingset-dynamically-allocate-the-mm-shadow-shrinker.patch drm-i915-dynamically-allocate-the-i915_gem_mm-shrinker.patch drm-msm-dynamically-allocate-the-drm-msm_gem-shrinker.patch drm-panfrost-dynamically-allocate-the-drm-panfrost-shrinker.patch dm-dynamically-allocate-the-dm-bufio-shrinker.patch dm-zoned-dynamically-allocate-the-dm-zoned-meta-shrinker.patch md-raid5-dynamically-allocate-the-md-raid5-shrinker.patch bcache-dynamically-allocate-the-md-bcache-shrinker.patch vmw_balloon-dynamically-allocate-the-vmw-balloon-shrinker.patch virtio_balloon-dynamically-allocate-the-virtio-balloon-shrinker.patch mbcache-dynamically-allocate-the-mbcache-shrinker.patch ext4-dynamically-allocate-the-ext4-es-shrinker.patch jbd2ext4-dynamically-allocate-the-jbd2-journal-shrinker.patch nfsd-dynamically-allocate-the-nfsd-client-shrinker.patch nfsd-dynamically-allocate-the-nfsd-reply-shrinker.patch xfs-dynamically-allocate-the-xfs-buf-shrinker.patch xfs-dynamically-allocate-the-xfs-inodegc-shrinker.patch xfs-dynamically-allocate-the-xfs-qm-shrinker.patch zsmalloc-dynamically-allocate-the-mm-zspool-shrinker.patch fs-super-dynamically-allocate-the-s_shrink.patch mm-shrinker-remove-old-apis.patch mm-shrinker-add-a-secondary-array-for-shrinker_info-map-nr_deferred.patch mm-shrinker-rename-preallocunregister_memcg_shrinker-to-shrinker_memcg_allocremove.patch mm-shrinker-make-global-slab-shrink-lockless.patch mm-shrinker-make-memcg-slab-shrink-lockless.patch mm-shrinker-hold-write-lock-to-reparent-shrinker-nr_deferred.patch mm-shrinker-convert-shrinker_rwsem-to-mutex.patch