From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Brian Foster <bfoster@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>, linux-xfs@vger.kernel.org
Subject: Re: [PATCH 2/4] xfs: force writes to delalloc regions to unwritten
Date: Tue, 19 Mar 2019 11:02:49 -0700 [thread overview]
Message-ID: <20190319180249.GS4929@magnolia> (raw)
In-Reply-To: <20190318215014.GA21231@bfoster>
On Mon, Mar 18, 2019 at 05:50:14PM -0400, Brian Foster wrote:
> On Tue, Mar 19, 2019 at 07:35:06AM +1100, Dave Chinner wrote:
> > On Mon, Mar 18, 2019 at 08:40:39AM -0400, Brian Foster wrote:
> > > On Mon, Mar 18, 2019 at 09:40:11AM +1100, Dave Chinner wrote:
> > > > On Fri, Mar 15, 2019 at 08:29:40AM -0400, Brian Foster wrote:
> > > > > On Fri, Mar 15, 2019 at 02:40:38PM +1100, Dave Chinner wrote:
> > > > > > > If the delalloc extent crosses EOF then I think it makes sense to map it
> > > > > > > in as unwritten, issue the write, and convert the extent to real during
> > > > > > > io completion, and not split it up just to avoid having unwritten
> > > > > > > extents past EOF.
> > > > > >
> > > > > > Yup, makes sense. For a sequential write, they should always be
> > > > > > beyond EOF. For anything else, we use unwritten.
> > > > > >
> > > > >
> > > > > I'm not quite sure I follow the suggested change in behavior here tbh so
> > > > > maybe this is not an issue, but another thing to consider is whether
> > > > > selective delalloc -> real conversion for post-eof extents translates to
> > > > > conditional stale data exposure vectors in certain file extension
> > > > > scenarios.
> > > >
> > > > No, it doesn't because the transaction that moves EOF is done after
> > > > the data write into the region it exposes is complete. hence the
> > > > device cache flush that occurs before the log write containing the
> > > > transaction that moves the EOF will always result in the data being
> > > > on stable storage *before* the extending szie transaction hits the
> > > > journal and exposes it.
> > > >
> > >
> > > Ok, but this ordering alone doesn't guarantee the data we've just
> > > written is on-disk before the transaction hits the log.
> >
> > Which transaction are you talking about? This ordering guarantees
> > that the data is on stable storage before the EOF size change
> > transactions that exposes the region is on disk (that's the whole
> > point of updates after data writes).
> >
> > If you are talking about the allocation transaction taht we are
> > going to write the data into, then you are right that it doesn't
> > guarantee that the data is on disk before that commits, but when the
> > allocation is beyond EOF it doesn't matter ebcause is won't be
> > exposed until after the data is written and the EOF moving
> > transaction is committed.
> >
>
> The extending transaction that you referred to above and with respect to
> the device cache flush. I'm simply saying that the transaction has to be
> ordered with respect to full writeback completion as well, which is also
> what you are saying further down. (We're in agreement... ;)).
>
> > As I've previously proposed, what I suspect we should be doing is
> > not commiting the allocation transaction until IO completion, which
> > closes this stale data exposure hole completely and removes the need
> > for allocating unwritten extentsi and then having to convert them.
> > Direct IO could also do this, and so it could stop using unwritten
> > extents, too....
> >
>
> That sounds like an interesting (and potentially more efficient)
> alternative to using unwritten extents across the board only to convert
> them as I/Os complete. I guess we'd need to make sure that the
> allocation transaction holds across potentially many ioends considering
> the max extents size of 8GB or so. I suppose the locking and handling of
> fragmented allocation cases could get a bit interesting there as well,
> but it's probably worth a prototype attempt at least to survey the
> concept... (I wonder if Darrick is still paying attention or found
> something more interesting to do..? :D)
/me wandered off into fixing the io completion mess; if either of you
want to work on a better prototype, I encourage you to go for it. :)
Though I wonder how bad would be the effects of holding the allocation
transaction open across a (potentially very slow) writeout. We'd still
be holding onto the dirty inode's ilock as well as the AGF header and
whatever bnobt/cntbt blocks are dirty. Wouldn't that stall any other
thread that was walking the AGs looking for free space?
> I also assume we'd still need to use unwritten extents for cases where
> the allocation completes before all of the extent is within EOF (i.e.,
> the extent zeroing idea you mentioned on irc..).
Yes, I think it would still be necessary. My guess is that
xfs_bmapi_convert_delalloc has to create unwritten extents for any
extent starting before the on-disk EOF but can create real extents for
anything past the on-disk isize (because we don't set the on-disk isize
to match the in-core isize until writes complete). I guess the "zero
pages between old isize and new isize" code can simply reset any real
extents it finds to be unwritten too, as we've been discussing.
> > > > > I think even the post-eof zeroing code may be susceptible to
> > > > > this problem if the log happens to push after a truncate transaction
> > > > > commits but before the associated dirty pages are written back..
> > > >
> > > > That's what this code in xfs_setattr_size() handles:
> > > >
> > > > /*
> > > > * We are going to log the inode size change in this transaction so
> > > > * any previous writes that are beyond the on disk EOF and the new
> > > > * EOF that have not been written out need to be written here. If we
> > > > * do not write the data out, we expose ourselves to the null files
> > > > * problem. Note that this includes any block zeroing we did above;
> > > > * otherwise those blocks may not be zeroed after a crash.
> > > > */
> > > > if (did_zeroing ||
> > > > (newsize > ip->i_d.di_size && oldsize != ip->i_d.di_size)) {
> > > > error = filemap_write_and_wait_range(VFS_I(ip)->i_mapping,
> > > > ip->i_d.di_size, newsize - 1);
> > > > if (error)
> > > > return error;
> > > > }
> > > >
> > >
> > > Ah, I forgot about this code. So this combined with the above provides
> > > correctness for this particular case with respect to stale data exposure
> > > after a crash. AFAICT this is still a performance tradeoff as the
> > > zeroing may not be necessary if the post-eof extents happened to be
> > > unwritten. This optimization is also achieved "for free" in this path
> > > because did_zeroing is only set if iomap had to physically zero some
> > > pages.
> > >
> > > This also only covers one extension scenario. It looks like the typical
> > > write extend scenario is intended to be handled similarly by submitting
> > > the di_size update transaction from data write completion. It looks to
> > > me that this should be fairly robust in most common cases (i.e.,
> > > background writeback and/or fsync()), but I'm not totally convinced this
> > > is always robust because write time extension doesn't perform the same
> > > kind of writeback ordering as the truncate example above.
> >
> > It depends on IO completion order as to whether writeback is truly
> > robust. Calling filemap_write_and_wait_range() doesn't change that,
> > because it doesn't strictly order IO completions and so we can still
> > expose a region under IO that hasn't been signalled as complete by
> > the hardware.
> >
>
> Indeed.
>
> > > Consider sync_file_range() for example.
> >
> > sync_file_range() is not a data integrity operation - it does not
> > ensure disk caches are flushed and so cannot provide any crash
> > guarantees. hence we can completely ignore when we talk about data
> > integrity operations - it's no different from normal writeback.
> >
>
> Of course. I am by no means claiming sync_file_range() is an integrity
> operation. Technically neither is background writeback, even though the
> latter has fairly predictable behavior. I'm simply using
> sync_file_range() as an example and means to simulate out of order
> writeback for test purposes.
>
> > > Suppose we have a large,
> > > physical post-eof prealloc extent and perform a sparse extended write to
> > > the last block of that extent followed by a sync_file_range() of the
> > > just written data. The file write should physically zero the pagecache
> > > for the preallocated region before the write, then the proper write
> > > completes and returns. Nothing has changed on disk so far. A
> > > sync_file_range() call sets the range start/end in the writeback_control
> > > and works its way through __filemap_fdatawrite_range() to
> > > write_cache_pages() and our writeback mechanism for the requested pages
> > > only. From there, the same logic applies as for background writeback: we
> > > issue writeback and I/O completion commits a file extension transaction.
> > > AFAICT, nothing has guaranteed writeback has even initiated on any of
> > > the zeroed pages over the non-unwritten extent that has just been pulled
> > > within EOF, which means we are susceptible to stale data exposure until
> > > that happens.
> >
> > Yup, yet another reason why sync_file_range() is useless as a data
> > integrity operation. I've had to explain this repeatedly to people
> > over the past 10-15 years, and eventually the man page was updated
> > with this:
> >
> > Warning
> > This system call is extremely dangerous and should not be
> > used in portable programs. None of these operations writes
> > out the file's meta¿ data. Therefore, unless the
> > application is strictly performing overwrites of
> > already-instantiated disk blocks, there are no guarantees
> > that the data will be available after a crash. There is no
> > user interface to know if a write is purely an overwrite.
> > On filesystems using copy- on-write semantics (e.g., btrfs)
> > an overwrite of existing allocated blocks is impossible.
> > When writing into preallocated space, many filesystems
> > also require calls into the block allocator, which this
> > system call does not sync out to disk. This system call
> > does not flush disk write caches and thus does not provide
> > any data integrity on systems with volatile disk write
> > caches.
> >
> > >
> ...
> >
> > Yup, but keep in mind this is exactly what is expected from using
> > sync_file_range(). Friends don't let friends use sync_file_range()
> > because it's a guaranteed vector for user data corruption and stale
> > data exposure.
> >
>
> Again, I'm using sync_file_range() as a dumb lever to control writeback
> completion ordering. It's a test tool for my purposes and not something
> I'd suggest for use otherwise or that should inherently provide
> integrity guarantees. Please don't get distracted reading more into it
> than that. We can induce the exact same writeback behavior and stale
> data exposure problem using something like hole punch instead, for
> example, and it's just as reliable a reproducer.
>
> The broader point is that by controlling writeback ordering, we can
> explicitly demonstrate stale data exposure. If fixed properly, we should
> never expect stale data exposure even in the event of out of order
> writeback completion, regardless of the cause.
Agree. Even if s_f_r is a giant trash fire, if we're exposing stale
disk contents it's game over anyway and we ought to fix it.
--D
> Brian
>
> > Cheers,
> >
> > Dave.
> > --
> > Dave Chinner
> > david@fromorbit.com
next prev parent reply other threads:[~2019-03-19 18:02 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-03-14 21:28 [PATCH 0/4] xfs: various rough fixes Darrick J. Wong
2019-03-14 21:29 ` [PATCH 1/4] xfs: only free posteof blocks on first close Darrick J. Wong
2019-03-15 3:42 ` Dave Chinner
2019-03-14 21:29 ` [PATCH 2/4] xfs: force writes to delalloc regions to unwritten Darrick J. Wong
2019-03-14 23:08 ` Dave Chinner
2019-03-15 3:13 ` Darrick J. Wong
2019-03-15 3:40 ` Dave Chinner
2019-03-15 12:29 ` Brian Foster
2019-03-17 22:40 ` Dave Chinner
2019-03-18 12:40 ` Brian Foster
2019-03-18 20:35 ` Dave Chinner
2019-03-18 21:50 ` Brian Foster
2019-03-19 18:02 ` Darrick J. Wong [this message]
2019-03-19 20:25 ` Dave Chinner
2019-03-20 12:02 ` Brian Foster
2019-03-20 21:10 ` Dave Chinner
2019-03-21 12:27 ` Brian Foster
2019-03-22 1:52 ` Dave Chinner
2019-03-22 12:46 ` Brian Foster
2019-04-03 22:38 ` Darrick J. Wong
2019-04-04 12:50 ` Brian Foster
2019-04-04 15:22 ` Darrick J. Wong
2019-04-04 16:16 ` Brian Foster
2019-04-04 22:08 ` Dave Chinner
2019-04-05 11:24 ` Brian Foster
2019-04-05 15:12 ` Darrick J. Wong
2019-04-08 0:17 ` Dave Chinner
2019-04-08 12:02 ` Brian Foster
2019-03-14 21:29 ` [PATCH 3/4] xfs: trace transaction reservation logcount overflow Darrick J. Wong
2019-03-15 12:30 ` Brian Foster
2019-03-14 21:29 ` [PATCH 4/4] xfs: avoid reflink end cow deadlock Darrick J. Wong
2019-03-15 12:31 ` Brian Foster
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190319180249.GS4929@magnolia \
--to=darrick.wong@oracle.com \
--cc=bfoster@redhat.com \
--cc=david@fromorbit.com \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).