All of lore.kernel.org
 help / color / mirror / Atom feed
From: SeongJae Park <sj@kernel.org>
To: Nhat Pham <nphamcs@gmail.com>
Cc: SeongJae Park <sj@kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Chengming Zhou <chengming.zhou@linux.dev>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Takero Funaki <flintglass@gmail.com>,
	Yosry Ahmed <yosry.ahmed@linux.dev>,
	kernel-team@meta.com, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org
Subject: Re: [RFC PATCH] mm/zswap: store compression failed page as-is
Date: Thu, 31 Jul 2025 09:43:23 -0700	[thread overview]
Message-ID: <20250731164323.15107-1-sj@kernel.org> (raw)
In-Reply-To: <CAKEwX=NC65XCkmX1YzivEJtPc+sEJ3pLHUsYhF60QJnk_OtpVw@mail.gmail.com>

Hi Nhat,

On Wed, 30 Jul 2025 17:21:44 -0700 Nhat Pham <nphamcs@gmail.com> wrote:

> On Wed, Jul 30, 2025 at 4:41 PM SeongJae Park <sj@kernel.org> wrote:
> >
> > When zswap writeback is enabled and it fails compressing a given page,
> > zswap lets the page be swapped out to the backing swap device.  This
> > behavior breaks the zswap's writeback LRU order, and hence users can
> > experience unexpected latency spikes.
> >
> > Keep the LRU order by storing the original content in zswap as-is.  The
> > original content is saved in a dynamically allocated page size buffer,
> > and the pointer to the buffer is kept in zswap_entry, on the space for
> > zswap_entry->pool.  Whether the space is used for the original content
> > or zpool is identified by 'zswap_entry->length == PAGE_SIZE'.
[...]
> > ---
> >  mm/zswap.c | 73 ++++++++++++++++++++++++++++++++++++++++++++++++------
> >  1 file changed, 65 insertions(+), 8 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 7e02c760955f..e021865696c6 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
[...]
> > +/*
> > + * If the compression is failed, try saving the content as is without
> > + * compression, to keep the LRU order.  This can increase memory overhead from
> > + * metadata, but in common zswap use cases where there are sufficient amount of
> > + * compressible pages, the overhead should be not ciritical, and can be
> > + * mitigated by the writeback.  Also, the decompression overhead is optimized.
> > + *
> > + * When the writeback is disabled, however, the additional overhead could be
> > + * problematic.  For the case, just return the failure.  swap_writeout() will
> > + * put the page back to the active LRU list in the case.
> > + */
> > +static int zswap_handle_compression_failure(int comp_ret, struct page *page,
> > +               struct zswap_entry *entry)
> > +{
> > +       if (!zswap_save_incompressible_pages)
> > +               return comp_ret;
> > +       if (!mem_cgroup_zswap_writeback_enabled(
> > +                               folio_memcg(page_folio(page))))
> > +               return comp_ret;
> > +
> > +       entry->orig_data = kmalloc_node(PAGE_SIZE, GFP_NOWAIT | __GFP_NORETRY |
> > +                       __GFP_HIGHMEM | __GFP_MOVABLE, page_to_nid(page));
> 
> Hmm, seems like this new buffer is not migratable (for compaction etc.?)
> 
> My understanding is that zsmalloc's allocated memory can be migrated
> (which is why zswap only works with a handle - it's a layer of
> indirection that gives zsmalloc the ability to move memory around).
> 
> Besides, why should we re-invent the wheel when zsmalloc already
> handles page-sized objects? :)

Makes sense, I will use zpool in the next version.

I actually saw both you and Takero did so in your versions, but I didn't
realize the migration benefit of the approach.  Thank you for enlightening me,
now I think this migration benefit is important, and I will make the next
version to provide the migratability reusing zpool.

> 
> > +       if (!entry->orig_data)
> > +               return -ENOMEM;
> > +       memcpy_from_page(entry->orig_data, page, 0, PAGE_SIZE);
> > +       entry->length = PAGE_SIZE;
> > +       atomic_long_inc(&zswap_stored_uncompressed_pages);
> > +       return 0;
> > +}
> > +
> >  static bool zswap_compress(struct page *page, struct zswap_entry *entry,
> >                            struct zswap_pool *pool)
> >  {
> > @@ -976,8 +1023,11 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
> >          */
> >         comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
> >         dlen = acomp_ctx->req->dlen;
> > -       if (comp_ret)
> > +       if (comp_ret) {
> > +               comp_ret = zswap_handle_compression_failure(comp_ret, page,
> > +                               entry);
> >                 goto unlock;
> > +       }
> >
> >         zpool = pool->zpool;
> >         gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE;
> > @@ -1009,6 +1059,11 @@ static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
> >         int decomp_ret, dlen;
> >         u8 *src, *obj;
> >
> > +       if (entry->length == PAGE_SIZE) {
> > +               memcpy_to_folio(folio, 0, entry->orig_data, entry->length);
> > +               return true;
> > +       }
> 
> This might not be safe.
> 
> It's conceivable that in zswap_compress(), some compression algorithm
> "successfully" compresses a page to the same size (comp_ret == 0). We
> hand that to zsmalloc, which happily stores the page.
> 
> When we "decompress" the page again, we will attempt to
> memcpy_to_folio from a bogus address (the handle from zsmalloc).

Makes sense, thank you for catching this.

> 
> So, in zswap_compress, you have to treat both comp_ret == 0 and dlen
> == PAGE_SIZE as "compression failure".

I saw your reply saying you were meaning both comp_ret != 0 and dlen ==
PAGE_SIZE, and yes, this makes sense.  I will do so in the next version.


Thanks,
SJ

[...]


  parent reply	other threads:[~2025-07-31 16:43 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-30 23:40 [RFC PATCH] mm/zswap: store compression failed page as-is SeongJae Park
2025-07-31  0:21 ` Nhat Pham
2025-07-31  0:22   ` Nhat Pham
2025-07-31 16:43     ` SeongJae Park
2025-07-31 16:43   ` SeongJae Park [this message]
2025-07-31  0:48 ` Nhat Pham
2025-07-31 16:56   ` SeongJae Park
2025-07-31 15:27 ` Johannes Weiner
2025-07-31 17:09   ` SeongJae Park
2025-07-31 18:16     ` Johannes Weiner
2025-07-31 17:20 ` Joshua Hahn
2025-08-01 19:57   ` SeongJae Park

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250731164323.15107-1-sj@kernel.org \
    --to=sj@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=chengming.zhou@linux.dev \
    --cc=flintglass@gmail.com \
    --cc=hannes@cmpxchg.org \
    --cc=kernel-team@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nphamcs@gmail.com \
    --cc=yosry.ahmed@linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.