All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: "Darrick J. Wong" <djwong@kernel.org>
Cc: linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org
Subject: Re: [RFC PATCH 0/2] iomap/xfs: fix data corruption due to stale cached iomaps
Date: Thu, 29 Sep 2022 12:11:28 +1000	[thread overview]
Message-ID: <20220929021128.GF3600936@dread.disaster.area> (raw)
In-Reply-To: <YzPYuu7GtzoN4tB+@magnolia>

On Tue, Sep 27, 2022 at 10:16:42PM -0700, Darrick J. Wong wrote:
> On Fri, Sep 23, 2022 at 08:59:34AM +1000, Dave Chinner wrote:
> > On Wed, Sep 21, 2022 at 09:25:26PM -0700, Darrick J. Wong wrote:
> > > On Wed, Sep 21, 2022 at 06:29:57PM +1000, Dave Chinner wrote:
> > > > Hi folks,
> > > > 
> > > > THese patches address the data corruption first described here:
> > > > 
> > > > https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/
> > > > 
> > > > This data corruption has been seen in high profile production
> > > > systems so there is some urgency to fix it. The underlying flaw is
> > > > essentially a zero-day iomap bug, so whatever fix we come up with
> > > > needs to be back portable to all supported stable kernels (i.e.
> > > > ~4.18 onwards).
> > > > 
> > > > A combination of concurrent write()s, writeback IO completion, and
> > > > memory reclaim combine to expose the fact that the cached iomap that
> > > > is held across an iomap_begin/iomap_end iteration can become stale
> > > > without the iomap iterator actor being aware that the underlying
> > > > filesystem extent map has changed.
> > > > 
> > > > Hence actions based on the iomap state (e.g. is unwritten or newly
> > > > allocated) may actually be incorrect as writeback actions may have
> > > > changed the state (unwritten to written, delalloc to unwritten or
> > > > written, etc). This affects partial block/page operations, where we
> > > > may need to read from disk or zero cached pages depending on the
> > > > actual extent state. Memory reclaim plays it's part here in that it
> > > > removes pages containing partial state from the page cache, exposing
> > > > future partial page/block operations to incorrect behaviour.
> > > > 
> > > > Really, we should have known that this would be a problem - we have
> > > > exactly the same issue with cached iomaps for writeback, and the
> > > > ->map_blocks callback that occurs for every filesystem block we need
> > > > to write back is responsible for validating the cached iomap is
> > > > still valid. The data corruption on the write() side is a result of
> > > > not validating that the iomap is still valid before we initialise
> > > > new pages and prepare them for data to be copied in to them....
> > > > 
> > > > I'm not really happy with the solution I have for triggering
> > > > remapping of an iomap when the current one is considered stale.
> > > > Doing the right thing requires both iomap_iter() to handle stale
> > > > iomaps correctly (esp. the "map is invalid before the first actor
> > > > operation" case), and it requires the filesystem
> > > > iomap_begin/iomap_end operations to co-operate and be aware of stale
> > > > iomaps.
> > > > 
> > > > There are a bunch of *nasty* issues around handling failed writes in
> > > > XFS taht this has exposed - a failed write() that races with a
> > > > mmap() based write to the same delalloc page will result in the mmap
> > > > writes being silently lost if we punch out the delalloc range we
> > > > allocated but didn't write to. g/344 and g/346 expose this bug
> > > > directly if we punch out delalloc regions allocated by now stale
> > > > mappings.
> > > 
> > > Yuck.  I'm pretty sure that callers (xfs_buffered_write_iomap_end) is
> > > supposed to call truncate_pagecache_range with the invalidatelock (fka
> > > MMAPLOCK) held.
> > 
> > Yup, there's multiple problems with this code; apart from
> > recognising that it is obviously broken and definitely problematic,
> > I haven't dug into it further.
> 
> ...and I've been so buried in attending meetings and livedebug sessions
> related to a 4.14 corruption that now I'm starved of time to fully think
> through all the implications of this one. :(
> 
> > > > Then, because we can't punch out the delalloc we allocated region
> > > > safely when we have a stale iomap, we have to ensure when we remap
> > > > it the IOMAP_F_NEW flag is preserved so that the iomap code knows
> > > > that it is uninitialised space that is being written into so it will
> > > > zero sub page/sub block ranges correctly.
> > > 
> > > Hm.  IOMAP_F_NEW results in zeroing around, right?  So if the first
> > > ->iomap_begin got a delalloc mapping, but by the time we got the folio
> > > locked someone else managed to writeback and evict the page, we'd no
> > > longer want that zeroing ... right?
> > 
> > Yes, and that is one of the sources of the data corruption - zeroing
> > when we shouldn't.
> > 
> > There are multiple vectors to having a stale iomap here:
> > 
> > 1. we allocate the delalloc range, giving us IOMAP_DELALLOC and
> >    IOMAP_F_NEW. Writeback runs, allocating the range as unwritten.
> >    Even though the iomap is now stale, there is no data corruption
> >    in this case because the range is unwritten and so we still need
> >    zeroing.
> 
> ...and I guess this at least happens more often now that writeback does
> delalloc -> unwritten -> write -> unwritten conversion?

*nod*

> > 2. Same as above, but IO completion converts the range to written.
> >    Data corruption occurs in this case because IOMAP_F_NEW causes
> >    incorrect page cache zeroing to occur on partial page writes.
> > 
> > 3. We have an unwritten extent (prealloc, writeback in progress,
> >    etc) so we have IOMAP_UNWRITTEN. These require zeroing,
> >    regardless of whether IOMAP_F_NEW is set or not. Extent is
> >    written behind our backs, unwritten conversion occurs, and now we
> >    zero partial pages when we shouldn't.
> 
> Yikes.
> 
> > Other issues I've found:
> > 
> > 4. page faults can run the buffered write path concurrently with
> >    write() because they aren't serialised against each other. Hence
> >    we can have overlapping concurrent iomap_iter() operations with
> >    different zeroing requirements and it's anyone's guess as to
> >    which will win the race to the page lock and do the initial
> >    zeroing. This is a potential silent mmap() write data loss
> >    vector.
> 
> TBH I've long wondered why IOLOCK and MMAPLOCK both seemingly protected
> pagecache operations but the buffered io paths never seemed to take the
> MMAPLOCK, and if there was some subtle way things could go wrong.

We can't take MMAPLOCK in the buffered IO path because the user
buffer could be a mmap()d range of the same file and we need to be
able to fault in those pages during copyin/copyout. Hence we can't
hold the MMAPLOCK across iomap_iter(), nor across
.iomap_begin/.iomap_end context pairs.

taking the MMAPLOCK and dropping it again can be done in iomap_begin
or iomap_end, as long as those methods aren't called from the page
fault path....

> > 5. anything that can modify the extent layout without holding the
> >    i_rwsem exclusive can race with iomap iterating the extent list.
> >    Holding the i_rwsem shared and modifying the extent list (e.g.
> >    direct IO writes) can result in iomaps changing in the middle of,
> >    say, buffered reads (e.g. hole->unwritten->written).
> 
> Yep.  I wonder, can this result in other incorrect write behavior that
> you and I haven't thought of yet?

Entirely possible - this code is complex and there are lots of very
subtle interactions and we've already found several bonus broken
bits as a result. Hence I wouldn't be surprised if we've missed
other subtle issues and/or not fully grokked the implications of the
broken bits we've found...

[....]

> > > What happens if iomap_writepage_map errors out (say because ->map_blocks
> > > returns an error) without adding the folio to any ioend?
> > 
> > Without reading further:
> > 
> > 1. if we want to retry the write, we folio_redirty_for_writepage(),
> > unlock it and return with no error. Essentially we just skip over
> > it.
> 
> If the fs isn't shut down, I guess we could redirty the page, though I
> guess the problem is that the page is now stuck in dirty state until
> xfs_scrub fixes the problem.  If it fixes the problem.
> 
> I think for bufferhead users it's nastier because we might have a
> situation where pagedirty is unset but BH_Dirty is still set.  It
> certainly is a problem on 4.14.
> 
> > 2. If we want to fail the write, we should call set_mapping_error()
> > to record the failure for the next syscall to report and, maybe, set
> > the error flag/clear the uptodate flag on the folio depending on
> > whether we want the data to remain valid in memory or not.
> 
> <nod> That seems to be happening.  Sort of.
> 
> I think there's also a UAF in iomap_writepage_map -- if the folio is
> unlocked and we cleared (or never set) PageWriteback, isn't it possible
> that by the time we get to the mapping_set_error, the folio could have
> been torn out of the page cache and reused somewhere else?

We still have a reference to the folio at this point from the lookup
in write_cache_pages(). Hence the folio can't be freed while we are
running iomap_writepage_map().

However, we have unlocked the folio, and we don't hold either the IO
lock or the invalidate lock and so the folio could get punched out
of the page cache....

> In which case, we're at best walking off a NULL mapping and crashing the
> system, and at worst setting an IO error on the wrong mapping?

Yes, I think so - we could walk off a NULL mapping here,
but because write_cache_pages() still holds a page reference, the
page won't get freed from under us so we won't ever see the wrong
mapping being set here.

I think we could fix that simply by using inode->i_mapping instead
of folio->mapping...

> > > I think in
> > > that case we'll follow the (error && !count) case, in which we unlock
> > > the folio and exit without calling folio_redirty_for_writepage, right?
> > > The error will get recorded in the mapping for the next fsync, I think,
> > > but I also wonder if we *should* redirty because the mapping failed, not
> > > the attempt at persistence.
> > 
> > *nod*
> > 
> > I think the question that needs to be answered here is this: in what
> > case is an error being returned from ->map_blocks a recoverable
> > error that a redirty + future writeback retry will succeed?
> > 
> > AFAICT, all cases from XFS this is a fatal error (e.g. corruption of
> > the BMBT), so the failure will persist across all attempts to retry
> > the write?
> > 
> > Perhaps online repair will change this (i.e. in the background
> > repair fixes the BMBT corruption and so the next attempt to write
> > the data will succeed) so I can see that we *might* need to redirty
> > the page in this case, but....
> 
> ...but I don't know that we can practically wait for repairs to happen
> because the page is now stuck in dirty state indefinitely.

*nod*

So do we treat it as fatal for now, and revisit it later when online
repair might be able to do something better here? 

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2022-09-29  2:11 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-21  8:29 [RFC PATCH 0/2] iomap/xfs: fix data corruption due to stale cached iomaps Dave Chinner
2022-09-21  8:29 ` [PATCH 1/2] iomap: write iomap validity checks Dave Chinner
2022-09-22  3:40   ` Darrick J. Wong
2022-09-22 23:16     ` Dave Chinner
2022-09-28  4:48       ` Darrick J. Wong
2022-09-21  8:29 ` [PATCH 2/2] xfs: use iomap_valid method to detect stale cached iomaps Dave Chinner
2022-09-22  3:44   ` Darrick J. Wong
2022-09-23  0:04     ` Dave Chinner
2022-09-28  4:54       ` Darrick J. Wong
2022-09-29  1:45         ` Dave Chinner
2022-10-04 23:34           ` Frank Sorenson
2022-10-05  1:34             ` Darrick J. Wong
2022-09-22  4:25 ` [RFC PATCH 0/2] iomap/xfs: fix data corruption due to " Darrick J. Wong
2022-09-22 22:59   ` Dave Chinner
2022-09-28  5:16     ` Darrick J. Wong
2022-09-29  2:11       ` Dave Chinner [this message]
2022-09-29  2:15         ` Darrick J. Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220929021128.GF3600936@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=djwong@kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.