From: Yosry Ahmed <yosry.ahmed@linux.dev>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nhat Pham <nphamcs@gmail.com>,
akpm@linux-foundation.org, chengming.zhou@linux.dev,
linux-mm@kvack.org, kernel-team@meta.com,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH] zswap: do not crash the kernel on decompression failure
Date: Wed, 26 Feb 2025 15:33:43 +0000 [thread overview]
Message-ID: <Z780VzBOE3LKY0yi@google.com> (raw)
In-Reply-To: <20250226045727.GB1775487@cmpxchg.org>
On Tue, Feb 25, 2025 at 11:57:27PM -0500, Johannes Weiner wrote:
> On Wed, Feb 26, 2025 at 03:12:35AM +0000, Yosry Ahmed wrote:
> > On Tue, Feb 25, 2025 at 01:32:00PM -0800, Nhat Pham wrote:
> > > Currently, we crash the kernel when a decompression failure occurs in
> > > zswap (either because of memory corruption, or a bug in the compression
> > > algorithm). This is overkill. We should only SIGBUS the unfortunate
> > > process asking for the zswap entry on zswap load, and skip the corrupted
> > > entry in zswap writeback.
> >
> > Some relevant observations/questions, but not really actionable for this
> > patch, perhaps some future work, or more likely some incoherent
> > illogical thoughts :
> >
> > (1) It seems like not making the folio uptodate will cause shmem faults
> > to mark the swap entry as hwpoisoned, but I don't see similar handling
> > for do_swap_page(). So it seems like even if we SIGBUS the process,
> > other processes mapping the same page could follow in the same
> > footsteps.
>
> It's analogous to what __end_swap_bio_read() does for block backends,
> so it's hitchhiking on the standard swap protocol for read failures.
Right, that's also how I got the idea when I did the same for large
folios handling.
>
> The page sticks around if there are other users. It can get reclaimed,
> but since it's not marked dirty, it won't get overwritten. Another
> access will either find it in the swapcache and die on !uptodate; if
> it was reclaimed, it will attempt another decompression. If all
> references have been killed, zswap_invalidate() will finally drop it.
>
> Swapoff actually poisons the page table as well (unuse_pte).
Right. My question was basically why don't we also poison the page table
in do_swap_page() in this case. It's like that we never swapoff.
This will cause subsequent fault attempts to return VM_FAULT_HWPOISON
quickly without doing through the swapcache or decompression. Probably
not a big deal, but shmem does it so maybe it'd be nice to do it for
consistency.
>
> > (2) A hwpoisoned swap entry results in VM_FAULT_SIGBUS in some cases
> > (e.g. shmem_fault() -> shmem_get_folio_gfp() -> shmem_swapin_folio()),
> > even though we have VM_FAULT_HWPOISON. This patch falls under this
> > bucket, but unfortunately we cannot tell for sure if it's a hwpoision or
> > a decompression bug.
>
> Are you sure? Actual memory failure should replace the ptes of a
> mapped shmem page with TTU_HWPOISON, which turns them into special
> swap entries that trigger VM_FAULT_HWPOISON in do_swap_page().
I was looking at the shmem_fault() path. It seems like for this path we
end up with VM_SIGBUS because shmem_swapin_folio() returns -EIO and not
-EHWPOISON. This seems like something that can be easily fixed though,
unless -EHWPOISON is not always correct for a diffrent reason.
>
> Anon swap distinguishes as long as the swapfile is there. Swapoff
> installs poison markers, which are then handled the same in future
> faults (VM_FAULT_HWPOISON):
>
> /*
> * "Poisoned" here is meant in the very general sense of "future accesses are
> * invalid", instead of referring very specifically to hardware memory errors.
> * This marker is meant to represent any of various different causes of this.
> *
> * Note that, when encountered by the faulting logic, PTEs with this marker will
> * result in VM_FAULT_HWPOISON and thus regardless trigger hardware memory error
> * logic.
If that's the case, maybe it's better for zswap in the future if we stop
relying on not marking the folio uptodate, and instead propagate an
error through swap_read_folio() to the callers to make sure we always
return VM_FAULT_HWPOISON and install poison markers.
The handling is a bit quirky and inconsistent, but it ultimately results
in VM_SIGBUS or VM_FAULT_HWPOISON which I guess is fine for now.
> */
> #define PTE_MARKER_POISONED BIT(1)
next prev parent reply other threads:[~2025-02-26 15:33 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-02-25 21:32 [PATCH] zswap: do not crash the kernel on decompression failure Nhat Pham
2025-02-25 22:25 ` Andrew Morton
2025-02-25 22:28 ` Nhat Pham
2025-02-26 0:51 ` Johannes Weiner
2025-02-26 2:45 ` Yosry Ahmed
2025-02-26 3:57 ` Johannes Weiner
2025-02-26 15:42 ` Yosry Ahmed
2025-02-26 23:39 ` Nhat Pham
2025-02-26 2:40 ` Yosry Ahmed
2025-02-26 23:16 ` Nhat Pham
2025-02-27 0:03 ` Yosry Ahmed
2025-02-26 3:12 ` Yosry Ahmed
2025-02-26 4:57 ` Johannes Weiner
2025-02-26 15:33 ` Yosry Ahmed [this message]
2025-02-26 23:20 ` Nhat Pham
2025-02-27 0:00 ` Yosry Ahmed
2025-02-26 23:29 ` Nhat Pham
2025-02-26 23:58 ` Yosry Ahmed
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Z780VzBOE3LKY0yi@google.com \
--to=yosry.ahmed@linux.dev \
--cc=akpm@linux-foundation.org \
--cc=chengming.zhou@linux.dev \
--cc=hannes@cmpxchg.org \
--cc=kernel-team@meta.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=nphamcs@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.