Re: [RFC PATCH v2 00/18] Virtual Swap Space

cgroups.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Kairui Song <ryncsn@gmail.com>
To: Nhat Pham <nphamcs@gmail.com>
Cc: YoungJun Park <youngjun.park@lge.com>,
	linux-mm@kvack.org, akpm@linux-foundation.org,
	 hannes@cmpxchg.org, hughd@google.com, yosry.ahmed@linux.dev,
	 mhocko@kernel.org, roman.gushchin@linux.dev,
	shakeel.butt@linux.dev,  muchun.song@linux.dev,
	len.brown@intel.com, chengming.zhou@linux.dev,
	 chrisl@kernel.org, huang.ying.caritas@gmail.com,
	ryan.roberts@arm.com,  viro@zeniv.linux.org.uk,
	baohua@kernel.org, osalvador@suse.de,
	 lorenzo.stoakes@oracle.com, christophe.leroy@csgroup.eu,
	pavel@kernel.org,  kernel-team@meta.com,
	linux-kernel@vger.kernel.org, cgroups@vger.kernel.org,
	 linux-pm@vger.kernel.org, peterx@redhat.com, gunho.lee@lge.com,
	 taejoon.song@lge.com, iamjoonsoo.kim@lge.com
Subject: Re: [RFC PATCH v2 00/18] Virtual Swap Space
Date: Tue, 3 Jun 2025 17:50:01 +0800	[thread overview]
Message-ID: <CAMgjq7D4gcOih3235DRBEOv4EaaV3YEKc6w2Ab-wTCgb7=sA6w@mail.gmail.com> (raw)
In-Reply-To: <CAKEwX=P4Q6jNQAi+H3sMQ73z-F-rG5jz8jj1NeGgUi=Pem_ZTQ@mail.gmail.com>

On Tue, Jun 3, 2025 at 2:30 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Sun, Jun 1, 2025 at 9:15 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> >
> > Hi All,
>
> Thanks for sharing your setup, Kairui! I've always been curious about
> multi-tier compression swapping.
>
> >
> > I'd like to share some info from my side. Currently we have an
> > internal solution for multi tier swap, implemented based on ZRAM and
> > writeback: 4 compression level and multiple block layer level. The
> > ZRAM table serves a similar role to the swap table in the "swap table
> > series" or the virtual layer here.
> >
> > We hacked the BIO layer to let ZRAM be Cgroup aware, so it even
>
> Hmmm this part seems a bit hacky to me too :-?

Yeah, terribly hackish :P

One of the reasons why I'm trying to retire it.

>
> > supports per-cgroup priority, and per-cgroup writeback control, and it
> > worked perfectly fine in production.
> >
> > The interface looks something like this:
> > /sys/fs/cgroup/cg1/zram.prio: [1-4]
> > /sys/fs/cgroup/cg1/zram.writeback_prio [1-4]
> > /sys/fs/cgroup/cg1/zram.writeback_size [0 - 4K]
>
> How do you do aging with multiple tiers like this? Or do you just rely
> on time thresholds, and have userspace invokes writeback in a cron
> job-style?

ZRAM already has a time threshold, and I added another LRU for swapped
out entries, aging is supposed to be done by userspace agents, I
didn't mention it here as things are becoming more irrelevant to
upstream implementation.

> Tbh, I'm surprised that we see performance win with recompression. I
> understand that different workloads might benefit the most from
> different points in the Pareto frontier of latency-memory saving:
> latency-sensitive workloads might like a fast compression algorithm,
> whereas other workloads might prefer a compression algorithm that
> saves more memory. So a per-cgroup compressor selection can make
> sense.
>
> However, would the overhead of moving a page from one tier to the
> other not eat up all the benefit from the (usually small) extra memory
> savings?

So far we are not re-compressing things, but per-cgroup compression /
writeback level is useful indeed. Compressed memory gets written back
to the block device, that's a large gain.

> > It's really nothing fancy and complex, the four priority is simply the
> > four ZRAM compression streams that's already in upstream, and you can
> > simply hardcode four *bdev in "struct zram" and reuse the bits, then
> > chain the write bio with new underlayer bio... Getting the priority
> > info of a cgroup is even simpler once ZRAM is cgroup aware.
> >
> > All interfaces can be adjusted dynamically at any time (e.g. by an
> > agent), and already swapped out pages won't be touched. The block
> > devices are specified in ZRAM's sys files during swapon.
> >
> > It's easy to implement, but not a good idea for upstream at all:
> > redundant layers, and performance is bad (if not optimized):
> > - it breaks SYNCHRONIZE_IO, causing a huge slowdown, so we removed the
> > SYNCHRONIZE_IO completely which actually improved the performance in
> > every aspect (I've been trying to upstream this for a while);
> > - ZRAM's block device allocator is just not good (just a bitmap) so we
> > want to use the SWAP allocator directly (which I'm also trying to
> > upstream with the swap table series);
> > - And many other bits and pieces like bio batching are kind of broken,
>
> Interesting, is zram doing writeback batching?

Nope, it even has a comment saying "XXX: A single page IO would be
inefficient for write". We managed to chain bio on the initial page
writeback but still not an ideal design.

> > busy loop due to the ZRAM_WB bit, etc...
>
> Hmmm, this sounds like something swap cache can help with. It's the
> approach zswap writeback is taking - concurrent assessors can get the
> page in the swap cache, and OTOH zswap writeback back off if it
> detects swap cache contention (since the page is probably being
> swapped in, freed, or written back by another thread).
>
> But I'm not sure how zram writeback works...

Yeah, any bit lock design suffers a similar problem (like
SWAP_HAS_CACHE), I think we should just use folio lock or folio
writeback in the long term, it works extremely well as a generic
infrastructure (which I'm trying to push upstream) and we don't need
any extra locking, minimizing memory / design overhead.

> > - Lacking support for things like effective migration/compaction,
> > doable but looks horrible.
> >
> > So I definitely don't like this band-aid solution, but hey, it works.
> > I'm looking forward to replacing it with native upstream support.
> > That's one of the motivations behind the swap table series, which
> > I think it would resolve the problems in an elegant and clean way
> > upstreamly. The initial tests do show it has a much lower overhead
> > and cleans up SWAP.
> >
> > But maybe this is kind of similar to the "less optimized form" you
> > are talking about? As I mentioned I'm already trying to upstream
> > some nice parts of it, and hopefully replace it with an upstream
> > solution finally.
> >
> > I can try upstream other parts of it if there are people really
> > interested, but I strongly recommend that we should focus on the
> > right approach instead and not waste time on that and spam the
> > mail list.
>
> I suppose a lot of this is specific to zram, but bits and pieces of it
> sound upstreamable to me :)
>
> We can wait for YoungJun's patches/RFC for further discussion, but perhaps:
>
> 1. A new cgroup interface to select swap backends for a cgroup.
>
> 2. Writeback/fallback order either designated by the above interface,
> or by the priority of the swap backends.

Fully agree, the final interface and features definitely need more
discussion and collab in upstream...

next prev parent reply	other threads:[~2025-06-03  9:50 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-29 23:38 [RFC PATCH v2 00/18] Virtual Swap Space Nhat Pham
2025-04-29 23:38 ` [RFC PATCH v2 01/18] swap: rearrange the swap header file Nhat Pham
2025-04-29 23:38 ` [RFC PATCH v2 02/18] swapfile: rearrange functions Nhat Pham
2025-04-29 23:38 ` [RFC PATCH v2 03/18] swapfile: rearrange freeing steps Nhat Pham
2025-04-29 23:38 ` [RFC PATCH v2 04/18] mm: swap: add an abstract API for locking out swapoff Nhat Pham
2025-04-29 23:38 ` [RFC PATCH v2 05/18] mm: swap: add a separate type for physical swap slots Nhat Pham
2025-04-29 23:38 ` [RFC PATCH v2 06/18] mm: create scaffolds for the new virtual swap implementation Nhat Pham
2025-04-29 23:38 ` [RFC PATCH v2 07/18] mm: swap: zswap: swap cache and zswap support for virtualized swap Nhat Pham
2025-04-29 23:38 ` [RFC PATCH v2 08/18] mm: swap: allocate a virtual swap slot for each swapped out page Nhat Pham
2025-04-29 23:38 ` [RFC PATCH v2 09/18] swap: implement the swap_cgroup API using virtual swap Nhat Pham
2025-04-29 23:38 ` [RFC PATCH v2 10/18] swap: manage swap entry lifetime at the virtual swap layer Nhat Pham
2025-04-29 23:38 ` [RFC PATCH v2 11/18] mm: swap: temporarily disable THP swapin and batched freeing swap Nhat Pham
2025-04-29 23:38 ` [RFC PATCH v2 12/18] mm: swap: decouple virtual swap slot from backing store Nhat Pham
2025-04-29 23:38 ` [RFC PATCH v2 13/18] zswap: do not start zswap shrinker if there is no physical swap slots Nhat Pham
2025-04-29 23:38 ` [RFC PATCH v2 14/18] memcg: swap: only charge " Nhat Pham
2025-04-29 23:38 ` [RFC PATCH v2 15/18] vswap: support THP swapin and batch free_swap_and_cache Nhat Pham
2025-04-29 23:38 ` [RFC PATCH v2 16/18] swap: simplify swapoff using virtual swap Nhat Pham
2025-04-29 23:38 ` [RFC PATCH v2 17/18] swapfile: move zeromap setup out of enable_swap_info Nhat Pham
2025-04-29 23:38 ` [RFC PATCH v2 18/18] swapfile: remove zeromap in virtual swap implementation Nhat Pham
2025-04-29 23:51 ` [RFC PATCH v2 00/18] Virtual Swap Space Nhat Pham
2025-05-30  6:47 ` YoungJun Park
2025-05-30 16:52   ` Nhat Pham
2025-05-30 16:54     ` Nhat Pham
2025-06-01 12:56     ` YoungJun Park
2025-06-01 16:14       ` Kairui Song
2025-06-02 15:17         ` YoungJun Park
2025-06-02 18:29         ` Nhat Pham
2025-06-03  9:50           ` Kairui Song [this message]
2025-06-01 21:08       ` Nhat Pham
2025-06-02 15:03         ` YoungJun Park

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAMgjq7D4gcOih3235DRBEOv4EaaV3YEKc6w2Ab-wTCgb7=sA6w@mail.gmail.com' \
    --to=ryncsn@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=cgroups@vger.kernel.org \
    --cc=chengming.zhou@linux.dev \
    --cc=chrisl@kernel.org \
    --cc=christophe.leroy@csgroup.eu \
    --cc=gunho.lee@lge.com \
    --cc=hannes@cmpxchg.org \
    --cc=huang.ying.caritas@gmail.com \
    --cc=hughd@google.com \
    --cc=iamjoonsoo.kim@lge.com \
    --cc=kernel-team@meta.com \
    --cc=len.brown@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-pm@vger.kernel.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mhocko@kernel.org \
    --cc=muchun.song@linux.dev \
    --cc=nphamcs@gmail.com \
    --cc=osalvador@suse.de \
    --cc=pavel@kernel.org \
    --cc=peterx@redhat.com \
    --cc=roman.gushchin@linux.dev \
    --cc=ryan.roberts@arm.com \
    --cc=shakeel.butt@linux.dev \
    --cc=taejoon.song@lge.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=yosry.ahmed@linux.dev \
    --cc=youngjun.park@lge.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).