Re: [RFC PATCH] zram: support asynchronous GC for lazy slot freeing

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Kairui Song <ryncsn@gmail.com>
To: "Barry Song (Xiaomi)" <baohua@kernel.org>
Cc: minchan@kernel.org, senozhatsky@chromium.org,
	 akpm@linux-foundation.org, linux-mm@kvack.org, axboe@kernel.dk,
	linux-block@vger.kernel.org,  linux-kernel@vger.kernel.org,
	kasong@tencent.com, chrisl@kernel.org, justinjiang@vivo.com,
	 liulei.rjpt@vivo.com, Xueyuan Chen <xueyuan.chen21@gmail.com>
Subject: Re: [RFC PATCH] zram: support asynchronous GC for lazy slot freeing
Date: Sun, 12 Apr 2026 19:48:48 +0800	[thread overview]
Message-ID: <adt3Q_SRToF6fb3W@KASONG-MC4> (raw)
In-Reply-To: <20260412060450.15813-1-baohua@kernel.org>

On Sun, Apr 12, 2026 at 02:04:50PM +0800, Barry Song (Xiaomi) wrote:
> Swap freeing can be expensive when unmapping a VMA containing
> many swap entries. This has been reported to significantly
> delay memory reclamation during Android’s low-memory killing,
> especially when multiple processes are terminated to free
> memory, with slot_free() accounting for more than 80% of
> the total cost of freeing swap entries.
> 
> Two earlier attempts by Lei and Zhiguo added a new thread in the mm core
> to asynchronously collect and free swap entries [1][2], but the
> design itself is fairly complex.
> 
> When anon folios and swap entries are mixed within a
> process, reclaiming anon folios from killed processes
> helps return memory to the system as quickly as possible,
> so that newly launched applications can satisfy their
> memory demands. It is not ideal for swap freeing to block
> anon folio freeing. On the other hand, swap freeing can
> still return memory to the system, although at a slower
> rate due to memory compression.
> 
> Therefore, in zram, we introduce a GC worker to allow anon
> folio freeing and slot_free to run in parallel, since
> slot_free is performed asynchronously, maximizing the rate at
> which memory is returned to the system.
> 
> Xueyuan’s test on RK3588 shows that unmapping a 256MB swap-filled
> VMA becomes 3.4× faster when pinning tasks to CPU2, reducing the
> execution time from 63,102,982 ns to 18,570,726 ns.
> 
> A positive side effect is that async GC also slightly improves
> do_swap_page() performance, as it no longer has to wait for
> slot_free() to complete.
> 
> Xueyuan’s test shows that swapping in 256MB of data (each page
> filled with repeating patterns such as “1024 one”, “1024 two”,
> “1024 three”, and “1024 four”) reduces execution time from
> 1,358,133,886 ns to 1,104,315,986 ns, achieving a 1.22× speedup.
> 
> [1] https://lore.kernel.org/all/20240805153639.1057-1-justinjiang@vivo.com/
> [2] https://lore.kernel.org/all/20250909065349.574894-1-liulei.rjpt@vivo.com/
> 
> Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com>
> Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>

Hi Barry

This looks an interesting idea to me.

> ---
>  drivers/block/zram/zram_drv.c | 56 ++++++++++++++++++++++++++++++++++-
>  drivers/block/zram/zram_drv.h |  3 ++
>  2 files changed, 58 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index c2afd1c34f4a..f5c07eb997a8 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -1958,6 +1958,23 @@ static ssize_t debug_stat_show(struct device *dev,
>  	return ret;
>  }
>  
> +static void gc_slots_free(struct zram *zram)
> +{
> +	size_t num_pages = zram->disksize >> PAGE_SHIFT;
> +	unsigned long index;
> +
> +	index = find_next_bit(zram->gc_map, num_pages, 0);
> +	while (index < num_pages) {
> +		if (slot_trylock(zram, index)) {
> +			if (test_bit(index, zram->gc_map))
> +				slot_free(zram, index);
> +			slot_unlock(zram, index);
> +			cond_resched();
> +		}
> +		index = find_next_bit(zram->gc_map, num_pages, index + 1);
> +	}
> +}
> +

The ideas looks interesting but the implementation looks not that
optimal to me. find_next_bit does a O(n) looks up for every gc call
looks really expensive if the pending slot is at tail.

Perhaps a percpu stack can be used, something like the folio batch?

> -	slot_free(zram, index);
> +	if (!try_slot_lazy_free(zram, index))
> +		slot_free(zram, index);

What is making this slot_free so costly? zs_free?

>  	slot_unlock(zram, index);
>  }
>  
> diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
> index 08d1774c15db..1f3ffd79fcb1 100644
> --- a/drivers/block/zram/zram_drv.h
> +++ b/drivers/block/zram/zram_drv.h
> @@ -88,6 +88,7 @@ struct zram_stats {
>  	atomic64_t pages_stored;	/* no. of pages currently stored */
>  	atomic_long_t max_used_pages;	/* no. of maximum pages stored */
>  	atomic64_t miss_free;		/* no. of missed free */
> +	atomic64_t gc_slots;		/* no. of queued for lazy free by gc */

Maybe we want to track the size of content being delayed instead
of slots number? I saw there is a 30000 hard limit for that.

Perhaps it will make more sense if we have a "buffer size"
(e.g. 64M), seems more intuitive to me. e.g. the ZRAM module can occupy
at most 64M of memory, so the delayed free won't cause a significant
global pressure.

Also I think this patch is batching the memory free operations, so the
workqueue or design can also be further optimized for batching, for
example if the zs_free is the expensive part then maybe we shall just
clear the handler for the freeing slot and leave the handler in a
percpu stack, then batch free these handlers. zsmalloc might make
use some batch optimization based on that too, something like
kmem_cache_free_bulk but for zsmalloc?

if zs_free is not all the expensive part, I took a look at slot_free
maybe a lot of read / write of slot data can be merged.

This patch currently doesn't reduce the total amount of work, but
if above idea works, a lot of redundant operations might be be dropped,
result in better performance in every case.

Just my two cents and ideas, not sure if I got everything correct.
Looking forward for more disscussion on this :)

next prev parent reply	other threads:[~2026-04-12 11:48 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-12  6:04 [RFC PATCH] zram: support asynchronous GC for lazy slot freeing Barry Song (Xiaomi)
2026-04-12 11:48 ` Kairui Song [this message]
2026-04-14  5:49   ` Xueyuan Chen
2026-04-16  7:41     ` Sergey Senozhatsky
2026-04-16  8:09       ` Barry Song
2026-04-17 21:59   ` Barry Song

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=adt3Q_SRToF6fb3W@KASONG-MC4 \
    --to=ryncsn@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=axboe@kernel.dk \
    --cc=baohua@kernel.org \
    --cc=chrisl@kernel.org \
    --cc=justinjiang@vivo.com \
    --cc=kasong@tencent.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=liulei.rjpt@vivo.com \
    --cc=minchan@kernel.org \
    --cc=senozhatsky@chromium.org \
    --cc=xueyuan.chen21@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.