From: "Darrick J. Wong" <djwong@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org
Subject: Re: xfs, iomap: fix data corrupton due to stale cached iomaps
Date: Wed, 2 Nov 2022 10:23:43 -0700 [thread overview]
Message-ID: <Y2Knn6+m8kxCnMoQ@magnolia> (raw)
In-Reply-To: <20221101042150.GN3600936@dread.disaster.area>
On Tue, Nov 01, 2022 at 03:21:50PM +1100, Dave Chinner wrote:
> On Mon, Oct 31, 2022 at 08:39:11PM -0700, Darrick J. Wong wrote:
> > On Tue, Nov 01, 2022 at 11:34:05AM +1100, Dave Chinner wrote:
> > > Recently a customer workload encountered a data corruption in a
> > > specific multi-threaded write operation. The workload combined
> > > racing unaligned adjacent buffered writes with low memory conditions
> > > that caused both writeback and memory reclaim to race with the
> > > writes.
> > >
> > > The result of this was random partial blocks containing zeroes
> > > instead of the correct data. The underlying problem is that iomap
> > > caches the write iomap for the duration of the write() operation,
> > > but it fails to take into account that the extent underlying the
> > > iomap can change whilst the write is in progress.
> >
> > Wheeeee....
> >
> > > The short story is that an iomap can span mutliple folios, and so
> > > under low memory writeback can be cleaning folios the write()
> > > overlaps. Whilst the overlapping data is cached in memory, this
> > > isn't a problem, but because the folios are now clean they can be
> > > reclaimed. Once reclaimed, the write() does the wrong thing when
> > > re-instantiating partial folios because the iomap no longer reflects
> > > the underlying state of the extent. e.g. it thinks the extent is
> > > unwritten, so it zeroes the partial range, when in fact the
> > > underlying extent is now written and so it should have read the data
> >
> > DOH!!
> >
> > > from disk. This is how we get random zero ranges in the file
> > > instead of the correct data.
> > >
> > > The gory details of the race condition can be found here:
> > >
> > > https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/
> > >
> > > Fixing the problem has two aspects. The first aspect of the problem
> > > is ensuring that iomap can detect a stale cached iomap during a
> > > write in a race-free manner. We already do this stale iomap
> > > detection in the writeback path, so we have a mechanism for
> > > detecting that the iomap backing the data range may have changed
> > > and needs to be remapped.
> > >
> > > In the case of the write() path, we have to ensure that the iomap is
> > > validated at a point in time when the page cache is stable and
> > > cannot be reclaimed from under us. We also need to validate the
> > > extent before we start performing any modifications to the folio
> > > state or contents. Combine these two requirements together, and the
> > > only "safe" place to validate the iomap is after we have looked up
> > > and locked the folio we are going to copy the data into, but before
> > > we've performed any initialisation operations on that folio.
> > >
> > > If the iomap fails validation, we then mark it stale, unlock the
> > > folio and end the write. This effectively means a stale iomap
> > > results in a short write.
> >
> > Does this short write echo all the way out to userspace? Or do we
> > merely go 'round iomap_iter() again?
>
> It triggers another iomap_iter() loop because we haven't consumed
> the data in the iov_iter yet. It essentially advances the iter to
> the end of the short write and runs another begin/write iter/end
> loop to continue the write with a new iomap. The stale flag is
> needed so that a "no bytes written so return to the caller" can be
> distinguished from "no bytes written because the iomap was
> immediately found to be stale so we need to loop again to remap it".
<nod> Now that I've read the patches with fresh eyes, I've seen that. :)
> > > write() in progress doesn't necessarily own the data in the page
> > > cache over the range of the delalloc extent it just allocated.
> > >
> > > As a result, we can't just truncate the page cache over the range
> > > the write() didn't reach and punch all the delalloc extent. We have
> > > to walk the page cache over the untouched range and skip over any
> > > dirty data region in the cache in that range. Which is ....
> > > non-trivial.
> > >
> > > That is, iterating the page cache has to handle partially populated
> > > folios (i.e. block size < page size) that contain data. The data
> > > might be discontiguous within a folio. Indeed, there might be
> > > *multiple* discontiguous data regions within a single folio. And to
> > > make matters more complex, multi-page folios mean we just don't know
> > > how many sub-folio regions we might have to iterate to find all
> > > these regions. All the corner cases between the conversions and
> > > rounding between filesystem block size, folio size and multi-page
> > > folio size combined with unaligned write offsets kept breaking my
> > > brain.
> > >
> > > Eventually, I realised that if the XFS code tracked the processed
> > > write regions by byte ranges instead of fileysetm block or page
> > > cache index, we could simply use mapping_seek_hole_data() to find
> > > the start and end of each discrete data region within the range we
> > > needed to scan. SEEK_DATA finds the start of the cached data region,
> > > SEEK_HOLE finds the end of the region. THese are byte based
> > > interfaces that understand partially uptodate folio regions, and so
> > > can iterate discrete sub-folio data regions directly. This largely
> > > solved the problem of discovering the dirty regions we need to keep
> > > the delalloc extent over.
> >
> > Heh. Clever!
>
> It'd be clever if I didn't take a couple of weeks banging my head
> against continual off-by-one and rounding bugs before I realised
> there was a better way... :(
>
> > > Of course, now xfs/196 fails. This is a error injection test that is
> > > supposed to exercise the delalloc extent recover code that the above
> > > fixes just completely reworked. the error injection assumes that it
> > > can just truncate the page cache over the write and then punch out
> > > the delalloc extent completely. This is fundamentally broken, and
> > > only has been working by chance - the chance is that writes are page
> > > aligned and page aligned writes don't install large folios in the
> > > page cache.
> >
> > Side question -- should we introduce a couple of debug knobs to slow
> > down page cache write and writeback so that we can make it easier to
> > (cough) experiment with races?
>
> Perhaps so. There was once the equivalent in Irix (dating back as
> far as 1996, IIRC) for this sort of testing. Search for
> XFSRACEDEBUG in the historic archive, there are:
>
> #ifdef XFSRACEDEBUG
> delay_for_intr();
> delay(100);
> #endif
>
> clauses added and removed regularly in the data and metadata IO
> paths through the early history of the code base....
<nod>
> > Now that we've been bitten hard by the writeback race, I wrote a
> > testcase to exercise it. The problem was solved long ago so the only
> > way you can tell its working is to inject debugging printks and look for
> > them, but I feel like we ought to have test points for these two serious
> > corruption problems.
>
> Well, I just assume writeback races with everything and mostly
> everything is OK. What I didn't expect was racing writeback enabling
> racing reclaim to free pages dirtied by a write() before the write() has
> completed (i.e. while it is still dirtying pages and copying data
> into the page cache). i.e. it was the racing page reclaim that exposed
> the data corruption bugs, not writeback.
>
> As it is, I suspect that we should actually try to run fstests in
> massively constrained memcgs more often - all sorts of problems have
> fallen out of doing this (e.g. finding GFP_KERNEL allocations
> in explicit, known GFP_NOFS contexts)....
FWIW I mentioned on irc yesterday that I tried running fstests with
mem=560M and watched the system fail spectacularly.
OTOH now that fstests can run testcase processes in a scope, that means
we can turn on memcgs with low limits pretty easily.
> > i IOWs, with sub-folio block size, and not know what size folios are
> > > in the cache, we can't actually guarantee that we can remove the
> >
> > Is that supposed to read "...and not knowing what size folios..."?
>
> *nod*
>
> > > cached dirty folios from the cache via truncation, and hence the new
> > > code will not remove the delalloc extents under those dirty folios.
> > > As a result the error injection results is writing zeroes to disk
> > > rather that removing the delalloc extents from memory. I can't make
> >
> > rather *than*?
>
> *nod*
>
> > > this error injection to work the way it was intended, so I removed
> > > it. The code that it is supposed to exercise is now exercised every time we
> > > detect a stale iomap, so we have much better coverage of the failed
> > > write error handling than the error injection provides us with,
> > > anyway....
> > >
> > > So, this passes fstests on 1kb and 4kb block sizes and the data
> > > corruption reproducer does not detect data corruption, so this set
> > > of fixes is /finally/ something I'd consider ready for merge.
> > > Comments and testing welcome!
> >
> > Urrrk. Thank you for your work on this. :)
> >
> > /me functionally braindead on halloween candy, will read the code
> > tomorrow. Well ok we'll see how long it takes curiosity to get the
> > better of me...
>
> Thanks!
(Well, I guess it took a day and a half.)
--D
>
> -Dave.
> --
> Dave Chinner
> david@fromorbit.com
prev parent reply other threads:[~2022-11-02 17:23 UTC|newest]
Thread overview: 42+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-11-01 0:34 xfs, iomap: fix data corrupton due to stale cached iomaps Dave Chinner
2022-11-01 0:34 ` [PATCH 1/7] xfs: write page faults in iomap are not buffered writes Dave Chinner
2022-11-02 7:17 ` Christoph Hellwig
2022-11-02 16:12 ` Darrick J. Wong
2022-11-02 21:11 ` Dave Chinner
2022-11-01 0:34 ` [PATCH 2/7] xfs: punching delalloc extents on write failure is racy Dave Chinner
2022-11-02 7:18 ` Christoph Hellwig
2022-11-02 16:22 ` Darrick J. Wong
2022-11-01 0:34 ` [PATCH 3/7] xfs: use byte ranges for write cleanup ranges Dave Chinner
2022-11-02 7:20 ` Christoph Hellwig
2022-11-02 16:32 ` Darrick J. Wong
2022-11-04 5:40 ` Dave Chinner
2022-11-07 23:53 ` Darrick J. Wong
2022-11-01 0:34 ` [PATCH 4/7] xfs: buffered write failure should not truncate the page cache Dave Chinner
2022-11-01 11:57 ` kernel test robot
2022-11-02 7:24 ` Christoph Hellwig
2022-11-02 20:57 ` Dave Chinner
2022-11-02 16:41 ` Darrick J. Wong
2022-11-02 21:04 ` Dave Chinner
2022-11-02 22:26 ` Darrick J. Wong
2022-11-04 8:08 ` Christoph Hellwig
2022-11-04 23:10 ` Dave Chinner
2022-11-07 23:48 ` Darrick J. Wong
2022-11-01 0:34 ` [PATCH 5/7] iomap: write iomap validity checks Dave Chinner
2022-11-02 8:36 ` Christoph Hellwig
2022-11-02 16:43 ` Darrick J. Wong
2022-11-02 16:58 ` Darrick J. Wong
2022-11-03 0:35 ` Dave Chinner
2022-11-04 8:12 ` Christoph Hellwig
2022-11-02 16:57 ` Darrick J. Wong
2022-11-01 0:34 ` [PATCH 6/7] xfs: use iomap_valid method to detect stale cached iomaps Dave Chinner
2022-11-01 9:15 ` kernel test robot
2022-11-02 8:41 ` Christoph Hellwig
2022-11-02 21:39 ` Dave Chinner
2022-11-04 8:14 ` Christoph Hellwig
2022-11-02 17:19 ` Darrick J. Wong
2022-11-02 22:36 ` Dave Chinner
2022-11-08 0:00 ` Darrick J. Wong
2022-11-01 0:34 ` [PATCH 7/7] xfs: drop write error injection is unfixable, remove it Dave Chinner
2022-11-01 3:39 ` xfs, iomap: fix data corrupton due to stale cached iomaps Darrick J. Wong
2022-11-01 4:21 ` Dave Chinner
2022-11-02 17:23 ` Darrick J. Wong [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Y2Knn6+m8kxCnMoQ@magnolia \
--to=djwong@kernel.org \
--cc=david@fromorbit.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).