linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: SeongJae Park <sj@kernel.org>
To: Nhat Pham <nphamcs@gmail.com>
Cc: SeongJae Park <sj@kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Chengming Zhou <chengming.zhou@linux.dev>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Takero Funaki <flintglass@gmail.com>,
	Yosry Ahmed <yosry.ahmed@linux.dev>,
	kernel-team@meta.com, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org
Subject: Re: [RFC PATCH] mm/zswap: store compression failed page as-is
Date: Thu, 31 Jul 2025 09:43:23 -0700	[thread overview]
Message-ID: <20250731164323.15107-1-sj@kernel.org> (raw)
In-Reply-To: <CAKEwX=NC65XCkmX1YzivEJtPc+sEJ3pLHUsYhF60QJnk_OtpVw@mail.gmail.com>

Hi Nhat,

On Wed, 30 Jul 2025 17:21:44 -0700 Nhat Pham <nphamcs@gmail.com> wrote:

> On Wed, Jul 30, 2025 at 4:41 PM SeongJae Park <sj@kernel.org> wrote:
> >
> > When zswap writeback is enabled and it fails compressing a given page,
> > zswap lets the page be swapped out to the backing swap device.  This
> > behavior breaks the zswap's writeback LRU order, and hence users can
> > experience unexpected latency spikes.
> >
> > Keep the LRU order by storing the original content in zswap as-is.  The
> > original content is saved in a dynamically allocated page size buffer,
> > and the pointer to the buffer is kept in zswap_entry, on the space for
> > zswap_entry->pool.  Whether the space is used for the original content
> > or zpool is identified by 'zswap_entry->length == PAGE_SIZE'.
[...]
> > ---
> >  mm/zswap.c | 73 ++++++++++++++++++++++++++++++++++++++++++++++++------
> >  1 file changed, 65 insertions(+), 8 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 7e02c760955f..e021865696c6 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
[...]
> > +/*
> > + * If the compression is failed, try saving the content as is without
> > + * compression, to keep the LRU order.  This can increase memory overhead from
> > + * metadata, but in common zswap use cases where there are sufficient amount of
> > + * compressible pages, the overhead should be not ciritical, and can be
> > + * mitigated by the writeback.  Also, the decompression overhead is optimized.
> > + *
> > + * When the writeback is disabled, however, the additional overhead could be
> > + * problematic.  For the case, just return the failure.  swap_writeout() will
> > + * put the page back to the active LRU list in the case.
> > + */
> > +static int zswap_handle_compression_failure(int comp_ret, struct page *page,
> > +               struct zswap_entry *entry)
> > +{
> > +       if (!zswap_save_incompressible_pages)
> > +               return comp_ret;
> > +       if (!mem_cgroup_zswap_writeback_enabled(
> > +                               folio_memcg(page_folio(page))))
> > +               return comp_ret;
> > +
> > +       entry->orig_data = kmalloc_node(PAGE_SIZE, GFP_NOWAIT | __GFP_NORETRY |
> > +                       __GFP_HIGHMEM | __GFP_MOVABLE, page_to_nid(page));
> 
> Hmm, seems like this new buffer is not migratable (for compaction etc.?)
> 
> My understanding is that zsmalloc's allocated memory can be migrated
> (which is why zswap only works with a handle - it's a layer of
> indirection that gives zsmalloc the ability to move memory around).
> 
> Besides, why should we re-invent the wheel when zsmalloc already
> handles page-sized objects? :)

Makes sense, I will use zpool in the next version.

I actually saw both you and Takero did so in your versions, but I didn't
realize the migration benefit of the approach.  Thank you for enlightening me,
now I think this migration benefit is important, and I will make the next
version to provide the migratability reusing zpool.

> 
> > +       if (!entry->orig_data)
> > +               return -ENOMEM;
> > +       memcpy_from_page(entry->orig_data, page, 0, PAGE_SIZE);
> > +       entry->length = PAGE_SIZE;
> > +       atomic_long_inc(&zswap_stored_uncompressed_pages);
> > +       return 0;
> > +}
> > +
> >  static bool zswap_compress(struct page *page, struct zswap_entry *entry,
> >                            struct zswap_pool *pool)
> >  {
> > @@ -976,8 +1023,11 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
> >          */
> >         comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
> >         dlen = acomp_ctx->req->dlen;
> > -       if (comp_ret)
> > +       if (comp_ret) {
> > +               comp_ret = zswap_handle_compression_failure(comp_ret, page,
> > +                               entry);
> >                 goto unlock;
> > +       }
> >
> >         zpool = pool->zpool;
> >         gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE;
> > @@ -1009,6 +1059,11 @@ static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
> >         int decomp_ret, dlen;
> >         u8 *src, *obj;
> >
> > +       if (entry->length == PAGE_SIZE) {
> > +               memcpy_to_folio(folio, 0, entry->orig_data, entry->length);
> > +               return true;
> > +       }
> 
> This might not be safe.
> 
> It's conceivable that in zswap_compress(), some compression algorithm
> "successfully" compresses a page to the same size (comp_ret == 0). We
> hand that to zsmalloc, which happily stores the page.
> 
> When we "decompress" the page again, we will attempt to
> memcpy_to_folio from a bogus address (the handle from zsmalloc).

Makes sense, thank you for catching this.

> 
> So, in zswap_compress, you have to treat both comp_ret == 0 and dlen
> == PAGE_SIZE as "compression failure".

I saw your reply saying you were meaning both comp_ret != 0 and dlen ==
PAGE_SIZE, and yes, this makes sense.  I will do so in the next version.


Thanks,
SJ

[...]


  parent reply	other threads:[~2025-07-31 16:43 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-30 23:40 [RFC PATCH] mm/zswap: store compression failed page as-is SeongJae Park
2025-07-31  0:21 ` Nhat Pham
2025-07-31  0:22   ` Nhat Pham
2025-07-31 16:43     ` SeongJae Park
2025-07-31 16:43   ` SeongJae Park [this message]
2025-07-31  0:48 ` Nhat Pham
2025-07-31 16:56   ` SeongJae Park
2025-07-31 15:27 ` Johannes Weiner
2025-07-31 17:09   ` SeongJae Park
2025-07-31 18:16     ` Johannes Weiner
2025-07-31 17:20 ` Joshua Hahn
2025-08-01 19:57   ` SeongJae Park

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250731164323.15107-1-sj@kernel.org \
    --to=sj@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=chengming.zhou@linux.dev \
    --cc=flintglass@gmail.com \
    --cc=hannes@cmpxchg.org \
    --cc=kernel-team@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nphamcs@gmail.com \
    --cc=yosry.ahmed@linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).