From: "Darrick J. Wong" <djwong@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH 1/2] iomap: write iomap validity checks
Date: Tue, 27 Sep 2022 21:48:19 -0700 [thread overview]
Message-ID: <YzPSE5pymHlPryz6@magnolia> (raw)
In-Reply-To: <20220922231655.GV3600936@dread.disaster.area>
On Fri, Sep 23, 2022 at 09:16:55AM +1000, Dave Chinner wrote:
> On Wed, Sep 21, 2022 at 08:40:29PM -0700, Darrick J. Wong wrote:
> > On Wed, Sep 21, 2022 at 06:29:58PM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > >
> > > A recent multithreaded write data corruption has been uncovered in
> > > the iomap write code. The core of the problem is partial folio
> > > writes can be flushed to disk while a new racing write can map it
> > > and fill the rest of the page:
> > >
> > > writeback new write
> > >
> > > allocate blocks
> > > blocks are unwritten
> > > submit IO
> > > .....
> > > map blocks
> > > iomap indicates UNWRITTEN range
> > > loop {
> > > lock folio
> > > copyin data
> > > .....
> > > IO completes
> > > runs unwritten extent conv
> > > blocks are marked written
> > > <iomap now stale>
> > > get next folio
> > > }
> > >
> > > Now add memory pressure such that memory reclaim evicts the
> > > partially written folio that has already been written to disk.
> > >
> > > When the new write finally gets to the last partial page of the new
> > > write, it does not find it in cache, so it instantiates a new page,
> > > sees the iomap is unwritten, and zeros the part of the page that
> > > it does not have data from. This overwrites the data on disk that
> > > was originally written.
> > >
> > > The full description of the corruption mechanism can be found here:
> > >
> > > https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/
> > >
> > > To solve this problem, we need to check whether the iomap is still
> > > valid after we lock each folio during the write. We have to do it
> > > after we lock the page so that we don't end up with state changes
> > > occurring while we wait for the folio to be locked.
> > >
> > > Hence we need a mechanism to be able to check that the cached iomap
> > > is still valid (similar to what we already do in buffered
> > > writeback), and we need a way for ->begin_write to back out and
> > > tell the high level iomap iterator that we need to remap the
> > > remaining write range.
> > >
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > > fs/iomap/buffered-io.c | 53 +++++++++++++++++++++++++++++++++---------
> > > fs/iomap/iter.c | 33 ++++++++++++++++++++++++--
> > > include/linux/iomap.h | 17 ++++++++++++++
> > > 3 files changed, 90 insertions(+), 13 deletions(-)
> > >
> > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > > index ca5c62901541..44c806d46be4 100644
> > > --- a/fs/iomap/buffered-io.c
> > > +++ b/fs/iomap/buffered-io.c
> > > @@ -584,8 +584,9 @@ static int iomap_write_begin_inline(const struct iomap_iter *iter,
> > > return iomap_read_inline_data(iter, folio);
> > > }
> > >
> > > -static int iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
> > > - size_t len, struct folio **foliop)
> > > +static int iomap_write_begin(struct iomap_iter *iter,
> > > + const struct iomap_ops *ops, loff_t pos, size_t len,
> > > + struct folio **foliop)
> > > {
> > > const struct iomap_page_ops *page_ops = iter->iomap.page_ops;
> > > const struct iomap *srcmap = iomap_iter_srcmap(iter);
> > > @@ -618,6 +619,27 @@ static int iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
> > > status = (iter->flags & IOMAP_NOWAIT) ? -EAGAIN : -ENOMEM;
> > > goto out_no_page;
> > > }
> > > +
> > > + /*
> > > + * Now we have a locked folio, before we do anything with it we need to
> > > + * check that the iomap we have cached is not stale. The inode extent
> > > + * mapping can change due to concurrent IO in flight (e.g.
> > > + * IOMAP_UNWRITTEN state can change and memory reclaim could have
> >
> > Earlier, we were talking this specific problem involves zeroing folio
> > areas around a folio-unaligned write to a !uptodate folio backed by an
> > unwritten extent. Is the scope of this corruption still limited to
> > *just* that case? Or does it affect any write to a new mapping or
> > beyond EOF? Can it affect a write to an extent that somehow gets
> > switched to unwritten or a hole after the mapping has been sampled?
>
> We only hold the IOLOCK while we are iterating the cached mapping.
> Hence anything that can modify the extent tree without holding the
> IOLOCK in a way that would exclude iomap iteration can trigger stale
> mappings.
>
> AFAICT, the ways this can happen right now are:
>
> 1. page fault path takes mapping->invalidate_lock, not IOLOCK
> 2. writeback allocating delalloc regions as unwritten
> 3. IO completion converting unwritten regions to written.
> 4. DIO writes doing extent allocation under IOLOCK_SHARED while
> other iomap iterators are holding IOLOCK_SHARED, too. (i.e. hole
> -> unwritten).
<nod> Ok, I had kinda suspected that this applied to any mapping, since
a lot can go on with IOLOCK_SHARED held.
> Note that this affects my recent suggestion about using
> IOLOCK_SHARED for buffered writes, too - that would be case #5,
> similar to #4.
<nod>
> Stuff that takes the IOLOCK_EXCL, such as truncate, fallocate, etc
> for direct extent manipulation cannot trigger stale cached iomaps in
> the IO path because they wait for all IO path operations to drain
> and for dirty data to flush across the range they are operating on.
>
> > > + * reclaimed a previously partially written page at this index after IO
> > > + * completion before this write reaches this file offset) and hence we
> > > + * could do the wrong thing here (zero a page range incorrectly or fail
> > > + * to zero) and corrupt data.
> > > + */
> > > + if (ops->iomap_valid) {
> > > + bool iomap_valid = ops->iomap_valid(iter->inode, &iter->iomap);
> >
> > At this point in the (xfs) buffered write cycle, we've taken the IOLOCK
> > and folio lock. Is it possible for ->iomap_valid to detect that the
> > mapping is no longer valid, trylock the ILOCK, and update the mapping so
> > that we don't have to drop the folio lock and go all the way back to
> > ->iomap_begin?
>
> Possibly. But we can fail to get the ILOCK, and that means we still
> have to fall back all to being outside the page lock before we can
> try again.
<nod>
> The other issue is that the iomap iter still currently points to the
> offset at the start of the write, so we aren't guaranteed that the
> refetched iomap even covers the range of the page we currently have
> locked. IOWs, we have to advance the iterator before we remap the
> range to guarantee that the new iomap covers the range of the file
> that we are currently operating on....
Aha! I hadn't caught on to that subtlety.
> > If we have to pay the cost of an indirect call, we could try to update
> > the mapping if one's available, right? Obviously, if we don't find a
> > mapping we're probably going to have to drop the folio lock and allocate
> > more space.
>
> I'd prefer that we get the slow, fall-back-all-the-way path correct
> first. We are always going to need this slow path, so let's focus on
> making it correct and robust rather than trying to prematurely
> optimise it....
Agreed.
> > > --- a/fs/iomap/iter.c
> > > +++ b/fs/iomap/iter.c
> > > @@ -7,12 +7,36 @@
> > > #include <linux/iomap.h>
> > > #include "trace.h"
> > >
> > > +/*
> > > + * Advance to the next range we need to map.
> > > + *
> > > + * If the iomap is marked IOMAP_F_STALE, it means the existing map was not fully
> > > + * processed - it was aborted because the extent the iomap spanned may have been
> > > + * changed during the operation. In this case, the iteration behaviour is to
> > > + * remap the unprocessed range of the iter, and that means we may need to remap
> > > + * even when we've made no progress (i.e. iter->processed = 0). Hence the
> > > + * "finished iterating" case needs to distinguish between
> > > + * (processed = 0) meaning we are done and (processed = 0 && stale) meaning we
> > > + * need to remap the entire remaining range.
> > > + *
> > > + * We also need to preserve IOMAP_F_STALE on the iomap so that the next call
> > > + * to iomap_begin() knows that it is reprocessing a stale map. Similarly, we
> > > + * need to preserve IOMAP_F_NEW as the filesystem may not realise that it
> > > + * is remapping a region that it allocated in the previous cycle and we still
> > > + * need to initialise partially filled pages within the remapped range.
> >
> > Must the ->iomap_begin implementation also take care to carry-over the
> > IOMAP_F_NEW to the mapping if IOMAP_F_STALE is set? Is it responsible
> > for dropping IOMAP_F_STALE, or will iomap_iter* do that?
>
> iomap_iter() carries STALE and NEW over to the the next
> ->iomap_begin() call, which is then responsible for clearing STALE
> and propagating NEW if required.
Ok. I suspect from the IRC traffic that you've moved on a bit from this
approach, so I'll wait for it to appear.
--D
>
> > > diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> > > index 238a03087e17..308931f0840a 100644
> > > --- a/include/linux/iomap.h
> > > +++ b/include/linux/iomap.h
> > > @@ -62,8 +62,13 @@ struct vm_fault;
> > > *
> > > * IOMAP_F_SIZE_CHANGED indicates to the iomap_end method that the file size
> > > * has changed as the result of this write operation.
> > > + *
> > > + * IOMAP_F_STALE indicates that the iomap is not valid any longer and the file
> > > + * range it covers needs to be remapped by the high level before the operation
> > > + * can proceed.
> >
> > I think it might be worth mentioning that /any/ iomap_iter caller can
> > set this flag in the loop body and kick things back to ->iomap_begin.
>
> *nod*
>
> -Dave.
> --
> Dave Chinner
> david@fromorbit.com
next prev parent reply other threads:[~2022-09-28 4:48 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-09-21 8:29 [RFC PATCH 0/2] iomap/xfs: fix data corruption due to stale cached iomaps Dave Chinner
2022-09-21 8:29 ` [PATCH 1/2] iomap: write iomap validity checks Dave Chinner
2022-09-22 3:40 ` Darrick J. Wong
2022-09-22 23:16 ` Dave Chinner
2022-09-28 4:48 ` Darrick J. Wong [this message]
2022-09-21 8:29 ` [PATCH 2/2] xfs: use iomap_valid method to detect stale cached iomaps Dave Chinner
2022-09-22 3:44 ` Darrick J. Wong
2022-09-23 0:04 ` Dave Chinner
2022-09-28 4:54 ` Darrick J. Wong
2022-09-29 1:45 ` Dave Chinner
2022-10-04 23:34 ` Frank Sorenson
2022-10-05 1:34 ` Darrick J. Wong
2022-09-22 4:25 ` [RFC PATCH 0/2] iomap/xfs: fix data corruption due to " Darrick J. Wong
2022-09-22 22:59 ` Dave Chinner
2022-09-28 5:16 ` Darrick J. Wong
2022-09-29 2:11 ` Dave Chinner
2022-09-29 2:15 ` Darrick J. Wong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YzPSE5pymHlPryz6@magnolia \
--to=djwong@kernel.org \
--cc=david@fromorbit.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).