linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: SeongJae Park <sj@kernel.org>
To: Nhat Pham <nphamcs@gmail.com>
Cc: SeongJae Park <sj@kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Chengming Zhou <chengming.zhou@linux.dev>,
	David Hildenbrand <david@redhat.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Yosry Ahmed <yosry.ahmed@linux.dev>,
	kernel-team@meta.com, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, Takero Funaki <flintglass@gmail.com>
Subject: Re: [PATCH] mm/zswap: store <PAGE_SIZE compression failed page as-is
Date: Thu,  7 Aug 2025 16:54:27 -0700	[thread overview]
Message-ID: <20250807235427.4743-1-sj@kernel.org> (raw)
In-Reply-To: <CAKEwX=Nm08xZVR0Zk2fOvXBmeWe5wggvTXN6+hD84S7T9L=eBQ@mail.gmail.com>

On Thu, 7 Aug 2025 16:03:54 -0700 Nhat Pham <nphamcs@gmail.com> wrote:

> On Thu, Aug 7, 2025 at 11:16 AM SeongJae Park <sj@kernel.org> wrote:
> >
> > When zswap writeback is enabled and it fails compressing a given page,
> > the page is swapped out to the backing swap device.  This behavior
> > breaks the zswap's writeback LRU order, and hence users can experience
> > unexpected latency spikes.  If the page is compressed without failure,
> > but results in a size of PAGE_SIZE, the LRU order is kept, but the
> > decompression overhead for loading the page back on the later access is
> > unnecessary.
> 
> Maybe add the writeback disabled story - we added to get around this
> latency issue, but that solution was not satisfactory either: wasting
> cpu cycles retrying incompressible pages, especially when we're under
> reclaim/memory pressure, and we've reclaimed most if not all other
> sources.
> 
> This problem was pointed out by Yosry:
> 
> https://lore.kernel.org/all/CAJD7tkZXS-UJVAFfvxJ0nNgTzWBiqepPYA4hEozi01_qktkitg@mail.gmail.com/

Thank you for sharing the detailed context, I will add this history with the
link to the last paragraph of this section, or the 'Related Works' section.

> 
> >
> > Keep the LRU order and optimize unnecessary decompression overheads in
> > those cases, by storing the original content as-is in zpool.  The length
> > field of zswap_entry will be set appropriately, as PAGE_SIZE,  Hence
> > whether it is saved as-is or not (whether decompression is unnecessary)
> > is identified by 'zswap_entry->length = PAGE_SIZE'.
> >
> > Because the uncompressed data is saved in zpool, same to the compressed
> > ones, this introduces no change in terms of memory management including
> > movability and migratability of involved pages.
> >
> > This change is also not increasing per zswap entry metadata overhead.
> > But as the number of incompressible pages increases, total zswap
> > metadata overhead is proportionally increased.  The overhead should not
> > be problematic in usual cases, since the zswap metadata for single zswap
> > entry is much smaller than PAGE_SIZE, and in common zswap use cases
> > there should be a sufficient amount of compressible pages.  Also it can
> > be mitigated by the zswap writeback.
> >
> > When the writeback is disabled, the additional overhead could be
> > problematic.  For the case, keep the current behavior that just returns
> > the failure and let swap_writeout() put the page back to the active LRU
> > list in the case.  It is known to be suboptimal when the incompressible
> > pages are cold, since the incompressible pages will continuously be
> > tried to be zswapped out, and burn CPU cycles for compression attempts
> > that will anyway fails.  One imaginable solution for the problem is
> > reusing the swapped-out page and its struct page to store in the zswap
> > pool.  But that's out of the scope of this patch.
> >
> > Tests
> > -----
[...]
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -976,8 +976,25 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
> >          */
> >         comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
> >         dlen = acomp_ctx->req->dlen;
> > -       if (comp_ret)
> > -               goto unlock;
> > +
> > +       /*
> > +        * If a page cannot be compressed into a size smaller than PAGE_SIZE,
> > +        * save the content as is without a compression, to keep the LRU order
> > +        * of writebacks.  If writeback is disabled, reject the page since it
> > +        * only adds metadata overhead.  swap_writeout() will put the page back
> > +        * to the active LRU list in the case.
> > +        */
> > +       if (comp_ret || dlen >= PAGE_SIZE) {
> > +               if (mem_cgroup_zswap_writeback_enabled(
> > +                                       folio_memcg(page_folio(page)))) {
> > +                       comp_ret = 0;
> > +                       dlen = PAGE_SIZE;
> > +                       memcpy_from_page(dst, page, 0, dlen);
> 
> I wonder if we can save one memcpy here. Would it be safe to map the page:
> 
> dst = kmap_local_page(page);

Sounds good, I will do so in the next version.

> 
> then, after we're done with storing (i.e after zpool_obj_write()), do:
> 
> if (dlen = PAGE_SIZE)
>     kunmap(dst);
> 
> (don't forget to unmap the page in the failure paths too!)

Sure, I think the 'unlock' label would be appropriate part to add the unmap
code.  I also got your correction of s/kunmap/kunmap_local() in the reply.  I
will not miss that on the next version.

> 
> > +               } else {
> > +                       comp_ret = comp_ret ? : -EINVAL;
> 
> Does this keep the value of comp_ret if comp_ret != 0 lol.

Yes.

> Syntax
> looks weird to me.

I agree it could look weird.  I prefer keeping the code in a way more familiar
to ones who will keep maintain the file.  So I will modify this as below, in
the next version.

    comp_ret ? comp_ret : -EINVAL

> 
> > +                       goto unlock;
> > +               }
> > +       }
> 
> Also, can we fix the counter value?
> 
> I assume we want:
> 
> else if (comp_ret || dlen = PAGE_SIZE)
>      zswap_reject_compress_fail++;
> 
> or something like that.

I'm not very clearly getting your point.

I was thinking we should increase the counter if we "reject" the page (does not
save the content in the zpool) due to failing at compressing the page's content
into a size smaller than PAGE_SIZE.  This patch implements the behavior.

Am I missing a mis-implementation of the behavior in this patch, or the
behavior is not what you think it should be?  More elaboration of your point
would be helpful for me.

> 
> And what happened to the incompressible page stored counter? :)

I don't get what counter you are asking about.  Could you please elaborate?

> 
> 
> >
> >         zpool = pool->zpool;
> >         gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE;
> > @@ -1012,6 +1029,14 @@ static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
> >         acomp_ctx = acomp_ctx_get_cpu_lock(entry->pool);
> >         obj = zpool_obj_read_begin(zpool, entry->handle, acomp_ctx->buffer);
> >
> > +       /* zswap entries of PAGE_SIZE are not compressed. */
> of length?

Thank you for this nice suggestion, I will change the comment so, in the next
version.

> 
> 
> > +       if (entry->length = PAGE_SIZE) {
> > +               memcpy_to_folio(folio, 0, obj, entry->length);
> > +               zpool_obj_read_end(zpool, entry->handle, obj);
> > +               acomp_ctx_put_unlock(acomp_ctx);
> > +               return true;
> > +       }
> > +
> >         /*
> >          * zpool_obj_read_begin() might return a kmap address of highmem when
> >          * acomp_ctx->buffer is not used.  However, sg_init_one() does not
> >
> > base-commit: 2ec534125ae474292175ae198483c545eac2161d
> > --
> > 2.39.5
> 

Thank you for your nice review and comments :)


Thanks,
SJ


  parent reply	other threads:[~2025-08-07 23:54 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-07 18:16 [PATCH] mm/zswap: store <PAGE_SIZE compression failed page as-is SeongJae Park
2025-08-07 18:32 ` SeongJae Park
2025-08-07 23:03 ` Nhat Pham
2025-08-07 23:07   ` Nhat Pham
2025-08-07 23:54   ` SeongJae Park [this message]
2025-08-08 23:37     ` Nhat Pham
2025-08-09  0:18       ` SeongJae Park

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250807235427.4743-1-sj@kernel.org \
    --to=sj@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=chengming.zhou@linux.dev \
    --cc=david@redhat.com \
    --cc=flintglass@gmail.com \
    --cc=hannes@cmpxchg.org \
    --cc=kernel-team@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nphamcs@gmail.com \
    --cc=yosry.ahmed@linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).