Re: [PATCH 1/2] mm: zswap: optimize zswap pool size tracking

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Johannes Weiner <hannes@cmpxchg.org>
To: Yosry Ahmed <yosryahmed@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Nhat Pham <nphamcs@gmail.com>,
	Chengming Zhou <zhouchengming@bytedance.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 1/2] mm: zswap: optimize zswap pool size tracking
Date: Mon, 11 Mar 2024 22:34:11 -0400	[thread overview]
Message-ID: <20240312023411.GA22705@cmpxchg.org> (raw)
In-Reply-To: <Ze-BH5HDNUG5ohJS@google.com>

On Mon, Mar 11, 2024 at 10:09:35PM +0000, Yosry Ahmed wrote:
> On Mon, Mar 11, 2024 at 12:12:13PM -0400, Johannes Weiner wrote:
> > Profiling the munmap() of a zswapped memory region shows 50%(!) of the
> > total cycles currently going into updating the zswap_pool_total_size.
> 
> Yikes. I have always hated that size update scheme FWIW.
> 
> I have also wondered whether it makes sense to just maintain the number
> of pages in zswap as an atomic, like zswap_stored_pages. I guess your
> proposed scheme is even cheaper for the load/invalidate paths because we
> do nothing at all.  It could be an option if the aggregation in other
> paths ever becomes a problem, but we would need to make sure it
> doesn't regress the load/invalidate paths. Just sharing some thoughts.

Agree with you there. I actually tried doing it that way at first, but
noticed zram uses zs_get_total_pages() and actually wants a per-pool
count. I didn't want the backend to have to update two atomics, so I
settled for this version.

> > There are three consumers of this counter:
> > - store, to enforce the globally configured pool limit
> > - meminfo & debugfs, to report the size to the user
> > - shrink, to determine the batch size for each cycle
> > 
> > Instead of aggregating everytime an entry enters or exits the zswap
> > pool, aggregate the value from the zpools on-demand:
> > 
> > - Stores aggregate the counter anyway upon success. Aggregating to
> >   check the limit instead is the same amount of work.
> > 
> > - Meminfo & debugfs might benefit somewhat from a pre-aggregated
> >   counter, but aren't exactly hotpaths.
> > 
> > - Shrinking can aggregate once for every cycle instead of doing it for
> >   every freed entry. As the shrinker might work on tens or hundreds of
> >   objects per scan cycle, this is a large reduction in aggregations.
> > 
> > The paths that benefit dramatically are swapin, swapoff, and
> > unmaps. There could be millions of pages being processed until
> > somebody asks for the pool size again. This eliminates the pool size
> > updates from those paths entirely.
> 
> This looks like a big win, thanks! I wonder if you have any numbers of
> perf profiles to share. That would be nice to have, but I think the
> benefit is clear regardless.

I deleted the perf files already, but can re-run it tomorrow.

> I also like the implicit cleanup when we switch to maintaining the
> number of pages rather than bytes. The code looks much better with all
> the shifts and divisions gone :)
> 
> I have a couple of comments below. With them addressed, feel free to
> add:
> Acked-by: Yosry Ahmed <yosryahmed@google.com>

Thanks!

> > @@ -1385,6 +1365,10 @@ static void shrink_worker(struct work_struct *w)
> >  {
> >  	struct mem_cgroup *memcg;
> >  	int ret, failures = 0;
> > +	unsigned long thr;
> > +
> > +	/* Reclaim down to the accept threshold */
> > +	thr = zswap_max_pages() * zswap_accept_thr_percent / 100;
> 
> This calculation is repeated twice, so I'd rather keep a helper for it
> as an alternative to zswap_can_accept(). Perhaps zswap_threshold_page()
> or zswap_acceptance_pages()?

Sounds good. I went with zswap_accept_thr_pages().

> > @@ -1711,6 +1700,13 @@ void zswap_swapoff(int type)
> >  
> >  static struct dentry *zswap_debugfs_root;
> >  
> > +static int debugfs_get_total_size(void *data, u64 *val)
> > +{
> > +	*val = zswap_total_pages() * PAGE_SIZE;
> > +	return 0;
> > +}
> > +DEFINE_DEBUGFS_ATTRIBUTE(total_size_fops, debugfs_get_total_size, NULL, "%llu");
> 
> I think we are missing a newline here to maintain the current format
> (i.e "%llu\n").

Oops, good catch! I had verified the debugfs file (along with the
others) with 'grep . *', which hides that this is missing. Fixed up.

Thanks for taking a look. The incremental diff is below. I'll run the
tests and recapture the numbers tomorrow, then send v2.

diff --git a/mm/zswap.c b/mm/zswap.c
index 7c39327a7cc2..1a5cc7298306 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -504,6 +504,11 @@ static unsigned long zswap_max_pages(void)
 	return totalram_pages() * zswap_max_pool_percent / 100;
 }
 
+static unsigned long zswap_accept_thr_pages(void)
+{
+	return zswap_max_pages() * zswap_accept_thr_percent / 100;
+}
+
 unsigned long zswap_total_pages(void)
 {
 	struct zswap_pool *pool;
@@ -1368,7 +1373,7 @@ static void shrink_worker(struct work_struct *w)
 	unsigned long thr;
 
 	/* Reclaim down to the accept threshold */
-	thr = zswap_max_pages() * zswap_accept_thr_percent / 100;
+	thr = zswap_accept_thr_pages();
 
 	/* global reclaim will select cgroup in a round-robin fashion. */
 	do {
@@ -1493,9 +1498,7 @@ bool zswap_store(struct folio *folio)
 	}
 
 	if (zswap_pool_reached_full) {
-		unsigned long thr = max_pages * zswap_accept_thr_percent / 100;
-
-		if (cur_pages > thr)
+		if (cur_pages > zswap_accept_thr_pages())
 			goto shrink;
 		else
 			zswap_pool_reached_full = false;
@@ -1705,7 +1708,7 @@ static int debugfs_get_total_size(void *data, u64 *val)
 	*val = zswap_total_pages() * PAGE_SIZE;
 	return 0;
 }
-DEFINE_DEBUGFS_ATTRIBUTE(total_size_fops, debugfs_get_total_size, NULL, "%llu");
+DEFINE_DEBUGFS_ATTRIBUTE(total_size_fops, debugfs_get_total_size, NULL, "%llu\n");
 
 static int zswap_debugfs_init(void)
 {

next prev parent reply	other threads:[~2024-03-12  2:34 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-11 16:12 [PATCH 1/2] mm: zswap: optimize zswap pool size tracking Johannes Weiner
2024-03-11 16:12 ` [PATCH 2/2] mm: zpool: return pool size in pages Johannes Weiner
2024-03-11 22:12   ` Yosry Ahmed
2024-03-12  2:36     ` Johannes Weiner
2024-03-12  4:07       ` Yosry Ahmed
2024-03-12  4:56   ` Chengming Zhou
2024-03-12  9:15   ` Nhat Pham
2024-03-11 22:09 ` [PATCH 1/2] mm: zswap: optimize zswap pool size tracking Yosry Ahmed
2024-03-12  2:34   ` Johannes Weiner [this message]
2024-03-12  4:04     ` Yosry Ahmed
2024-03-12  4:55 ` Chengming Zhou
2024-03-12  9:12 ` Nhat Pham

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:7c39327a7cc dfblob:1a5cc729830 )
 OR (
bs:"Re: [PATCH 1/2] mm: zswap: optimize zswap pool size tracking" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240312023411.GA22705@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=akpm@linux-foundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nphamcs@gmail.com \
    --cc=yosryahmed@google.com \
    --cc=zhouchengming@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.