From: Brian Foster <bfoster@redhat.com>
To: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>, xfs <linux-xfs@vger.kernel.org>
Subject: Re: [PATCH] xfs: split up the xfs_reflink_end_cow work into smaller transactions
Date: Tue, 4 Dec 2018 07:28:53 -0500 [thread overview]
Message-ID: <20181204122852.GA63209@bfoster> (raw)
In-Reply-To: <20181203231929.GE24487@magnolia>
On Mon, Dec 03, 2018 at 03:19:29PM -0800, Darrick J. Wong wrote:
> On Mon, Dec 03, 2018 at 10:13:12AM -0500, Brian Foster wrote:
> > On Fri, Nov 30, 2018 at 11:19:10AM -0800, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > >
> > > In xfs_reflink_end_cow, we allocate a single transaction for the entire
> > > end_cow operation and then loop the CoW fork mappings to move them to
> > > the data fork. This design fails on a heavily fragmented filesystem
> > > where an inode's data fork has exactly one more extent than would fit in
> > > an extents-format fork, because the unmap can collapse the data fork
> > > into extents format (freeing the bmbt block) but the remap can expand
> > > the data fork back into a (newly allocated) bmbt block. If the number
> > > of extents we end up remapping is large, we can overflow the block
> > > reservation because we reserved blocks assuming that we were adding
> > > mappings into an already-cleared area of the data fork.
> > >
> > > Let's say we have 8 extents in the data fork, 8 extents in the CoW fork,
> > > and the data fork can hold at most 7 extents before needing to convert
> > > to btree format; and that blocks A-P are discontiguous single-block
> > > extents:
> > >
> > > 0......7
> > > D: ABCDEFGH
> > > C: IJKLMNOP
> > >
> > > When a write to file blocks 0-7 completes, we must remap I-P into the
> > > data fork. We start by removing H from the btree-format data fork. Now
> > > we have 7 extents, so we convert the fork to extents format, freeing the
> > > bmbt block. We then move P into the data fork and it now has 8 extents
> > > again. We must convert the data fork back to btree format, requiring a
> > > block allocation. If we repeat this sequence for blocks 6-5-4-3-2-1-0,
> > > we'll need a total of 8 block allocations to remap all 8 blocks. We
> > > reserved only enough blocks to handle one btree split (5 blocks on a 4k
> > > block filesystem), which means we overflow the block reservation.
> > >
> > > To fix this issue, create a separate helper function to remap a single
> > > extent, and change _reflink_end_cow to call it in a tight loop over the
> > > entire range we're completing. As a side effect this also removes the
> > > size restrictions on how many extents we can end_cow at a time, though
> > > nobody ever hit that. It is not reasonable to reserve N blocks to remap
> > > N blocks.
> > >
> > > Note that this can be reproduced after ~320 million fsx ops while
> > > running generic/938 (long soak directio fsx exerciser):
> > >
> > > XFS: Assertion failed: tp->t_blk_res >= tp->t_blk_res_used, file: fs/xfs/xfs_trans.c, line: 116
> > > <machine registers snipped>
> > > Call Trace:
> > > xfs_trans_dup+0x211/0x250 [xfs]
> > > xfs_trans_roll+0x6d/0x180 [xfs]
> > > xfs_defer_trans_roll+0x10c/0x3b0 [xfs]
> > > xfs_defer_finish_noroll+0xdf/0x740 [xfs]
> > > xfs_defer_finish+0x13/0x70 [xfs]
> > > xfs_reflink_end_cow+0x2c6/0x680 [xfs]
> > > xfs_dio_write_end_io+0x115/0x220 [xfs]
> > > iomap_dio_complete+0x3f/0x130
> > > iomap_dio_rw+0x3c3/0x420
> > > xfs_file_dio_aio_write+0x132/0x3c0 [xfs]
> > > xfs_file_write_iter+0x8b/0xc0 [xfs]
> > > __vfs_write+0x193/0x1f0
> > > vfs_write+0xba/0x1c0
> > > ksys_write+0x52/0xc0
> > > do_syscall_64+0x50/0x160
> > > entry_SYSCALL_64_after_hwframe+0x49/0xbe
> > >
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > > fs/xfs/xfs_reflink.c | 202 ++++++++++++++++++++++++++------------------------
> > > 1 file changed, 105 insertions(+), 97 deletions(-)
> > >
> > > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > > index 322a852ce284..af0b2da76adf 100644
> > > --- a/fs/xfs/xfs_reflink.c
> > > +++ b/fs/xfs/xfs_reflink.c
> > > @@ -623,53 +623,41 @@ xfs_reflink_cancel_cow_range(
> > > }
> > >
> > > /*
> > > - * Remap parts of a file's data fork after a successful CoW.
> > > + * Remap part of the CoW fork into the data fork.
> > > + *
> > > + * We aim to remap the range starting at @offset_fsb and ending at @end_fsb
> > > + * into the data fork; this function will remap what it can (at the end of the
> > > + * range) and update @end_fsb appropriately. Each remap gets its own
> > > + * transaction because we can end up merging and splitting bmbt blocks for
> > > + * every remap operation and we'd like to keep the block reservation
> > > + * requirements as low as possible.
> > > */
> > > -int
> > > -xfs_reflink_end_cow(
> > > - struct xfs_inode *ip,
> > > - xfs_off_t offset,
> > > - xfs_off_t count)
> > > +STATIC int
> > > +xfs_reflink_end_cow_extent(
> > > + struct xfs_inode *ip,
> > > + xfs_fileoff_t offset_fsb,
> > > + xfs_fileoff_t *end_fsb)
> > > {
> > > - struct xfs_ifork *ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
> > > - struct xfs_bmbt_irec got, del;
> > > - struct xfs_trans *tp;
> > > - xfs_fileoff_t offset_fsb;
> > > - xfs_fileoff_t end_fsb;
> > > - int error;
> > > - unsigned int resblks;
> > > - xfs_filblks_t rlen;
> > > - struct xfs_iext_cursor icur;
> > > -
> > > - trace_xfs_reflink_end_cow(ip, offset, count);
> > > + struct xfs_bmbt_irec got, del;
> > > + struct xfs_iext_cursor icur;
> > > + struct xfs_mount *mp = ip->i_mount;
> > > + struct xfs_trans *tp;
> > > + struct xfs_ifork *ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
> > > + xfs_filblks_t rlen;
> > > + unsigned int resblks;
> > > + int error;
> > >
> > > /* No COW extents? That's easy! */
> > > - if (ifp->if_bytes == 0)
> > > + if (ifp->if_bytes == 0) {
> > > + *end_fsb = offset_fsb;
> > > return 0;
> > > -
> > > - offset_fsb = XFS_B_TO_FSBT(ip->i_mount, offset);
> > > - end_fsb = XFS_B_TO_FSB(ip->i_mount, offset + count);
> > > -
> > > - /*
> > > - * Start a rolling transaction to switch the mappings. We're
> > > - * unlikely ever to have to remap 16T worth of single-block
> > > - * extents, so just cap the worst case extent count to 2^32-1.
> > > - * Stick a warning in just in case, and avoid 64-bit division.
> > > - */
> > > - BUILD_BUG_ON(MAX_RW_COUNT > UINT_MAX);
> > > - if (end_fsb - offset_fsb > UINT_MAX) {
> > > - error = -EFSCORRUPTED;
> > > - xfs_force_shutdown(ip->i_mount, SHUTDOWN_CORRUPT_INCORE);
> > > - ASSERT(0);
> > > - goto out;
> > > }
> > > - resblks = XFS_NEXTENTADD_SPACE_RES(ip->i_mount,
> > > - (unsigned int)(end_fsb - offset_fsb),
> > > - XFS_DATA_FORK);
> > > - error = xfs_trans_alloc(ip->i_mount, &M_RES(ip->i_mount)->tr_write,
> > > - resblks, 0, XFS_TRANS_RESERVE | XFS_TRANS_NOFS, &tp);
> > > +
> > > + resblks = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
> > > + error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0,
> > > + XFS_TRANS_RESERVE | XFS_TRANS_NOFS, &tp);
> > > if (error)
> > > - goto out;
> > > + goto out_err;
> > >
> > > xfs_ilock(ip, XFS_ILOCK_EXCL);
> > > xfs_trans_ijoin(tp, ip, 0);
> > > @@ -679,80 +667,100 @@ xfs_reflink_end_cow(
> > > * left by the time I/O completes for the loser of the race. In that
> > > * case we are done.
> > > */
> > > - if (!xfs_iext_lookup_extent_before(ip, ifp, &end_fsb, &icur, &got))
> > > + if (!xfs_iext_lookup_extent_before(ip, ifp, end_fsb, &icur, &got) ||
> > > + got.br_startoff + got.br_blockcount <= offset_fsb) {
> > > + *end_fsb = offset_fsb;
> > > goto out_cancel;
> > > + }
> > >
> > > - /* Walk backwards until we're out of the I/O range... */
> > > - while (got.br_startoff + got.br_blockcount > offset_fsb) {
> > > - del = got;
> > > - xfs_trim_extent(&del, offset_fsb, end_fsb - offset_fsb);
> > > -
> > > - /* Extent delete may have bumped ext forward */
> > > - if (!del.br_blockcount)
> > > - goto prev_extent;
> > > + /*
> > > + * Structure copy @got into @del, then trim @del to the range that we
> > > + * were asked to remap. We preserve @got for the eventual CoW fork
> > > + * deletion; from now on @del represents the mapping that we're
> > > + * actually remapping.
> > > + */
> > > + del = got;
> > > + xfs_trim_extent(&del, offset_fsb, *end_fsb - offset_fsb);
> > >
> > > - /*
> > > - * Only remap real extent that contain data. With AIO
> > > - * speculatively preallocations can leak into the range we
> > > - * are called upon, and we need to skip them.
> > > - */
> > > - if (!xfs_bmap_is_real_extent(&got))
> > > - goto prev_extent;
> > > + ASSERT(del.br_blockcount > 0);
> > >
> > > - /* Unmap the old blocks in the data fork. */
> > > - ASSERT(tp->t_firstblock == NULLFSBLOCK);
> > > - rlen = del.br_blockcount;
> > > - error = __xfs_bunmapi(tp, ip, del.br_startoff, &rlen, 0, 1);
> > > - if (error)
> > > - goto out_cancel;
> > > + /*
> > > + * Only remap real extents that contain data. With AIO, speculative
> > > + * preallocations can leak into the range we are called upon, and we
> > > + * need to skip them.
> > > + */
> > > + if (!xfs_bmap_is_real_extent(&got)) {
> > > + *end_fsb = del.br_startoff;
> > > + goto out_cancel;
> > > + }
> > >
> > > - /* Trim the extent to whatever got unmapped. */
> > > - if (rlen) {
> > > - xfs_trim_extent(&del, del.br_startoff + rlen,
> > > - del.br_blockcount - rlen);
> > > - }
> > > - trace_xfs_reflink_cow_remap(ip, &del);
> > > + /* Unmap the old blocks in the data fork. */
> > > + rlen = del.br_blockcount;
> > > + error = __xfs_bunmapi(tp, ip, del.br_startoff, &rlen, 0, 1);
> > > + if (error)
> > > + goto out_cancel;
> > >
> > > - /* Free the CoW orphan record. */
> > > - error = xfs_refcount_free_cow_extent(tp, del.br_startblock,
> > > - del.br_blockcount);
> > > - if (error)
> > > - goto out_cancel;
> > > + /* Trim the extent to whatever got unmapped. */
> > > + xfs_trim_extent(&del, del.br_startoff + rlen, del.br_blockcount - rlen);
> > > + trace_xfs_reflink_cow_remap(ip, &del);
> > >
> > > - /* Map the new blocks into the data fork. */
> > > - error = xfs_bmap_map_extent(tp, ip, &del);
> > > - if (error)
> > > - goto out_cancel;
> > > + /* Free the CoW orphan record. */
> > > + error = xfs_refcount_free_cow_extent(tp, del.br_startblock,
> > > + del.br_blockcount);
> > > + if (error)
> > > + goto out_cancel;
> > >
> > > - /* Charge this new data fork mapping to the on-disk quota. */
> > > - xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_DELBCOUNT,
> > > - (long)del.br_blockcount);
> > > + /* Map the new blocks into the data fork. */
> > > + error = xfs_bmap_map_extent(tp, ip, &del);
> > > + if (error)
> > > + goto out_cancel;
> > >
> > > - /* Remove the mapping from the CoW fork. */
> > > - xfs_bmap_del_extent_cow(ip, &icur, &got, &del);
> > > + /* Charge this new data fork mapping to the on-disk quota. */
> > > + xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_DELBCOUNT,
> > > + (long)del.br_blockcount);
> > >
> > > - error = xfs_defer_finish(&tp);
> > > - if (error)
> > > - goto out_cancel;
> > > - if (!xfs_iext_get_extent(ifp, &icur, &got))
> > > - break;
> > > - continue;
> > > -prev_extent:
> > > - if (!xfs_iext_prev_extent(ifp, &icur, &got))
> > > - break;
> > > - }
> > > + /* Remove the mapping from the CoW fork. */
> > > + xfs_bmap_del_extent_cow(ip, &icur, &got, &del);
> > >
> > > error = xfs_trans_commit(tp);
> > > - xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > if (error)
> > > - goto out;
> > > - return 0;
> > > + goto out_unlock;
> > >
> > > + /* Update the caller about how much progress we made. */
> > > + *end_fsb = del.br_startoff;
> > > + goto out_unlock;
> > > out_cancel:
> > > xfs_trans_cancel(tp);
> > > +out_unlock:
> > > xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > -out:
> > > - trace_xfs_reflink_end_cow_error(ip, error, _RET_IP_);
> > > +out_err:
> > > + return error;
> > > +}
> > > +
> >
> > Nit, but the error handling/labels could be cleaned up a bit here. The
> > ilock can be transferred to the transaction to eliminate that label, the
> > trans alloc and commit sites can then just return directly to eliminate
> > out_err, and we're left with a single out_cancel label (and no longer
> > need a goto in the common return path).
>
> Ok, will do.
>
> > > +/*
> > > + * Remap parts of a file's data fork after a successful CoW.
> > > + */
> > > +int
> > > +xfs_reflink_end_cow(
> > > + struct xfs_inode *ip,
> > > + xfs_off_t offset,
> > > + xfs_off_t count)
> > > +{
> > > + xfs_fileoff_t offset_fsb;
> > > + xfs_fileoff_t end_fsb;
> > > + int error = 0;
> > > +
> > > + trace_xfs_reflink_end_cow(ip, offset, count);
> > > +
> > > + offset_fsb = XFS_B_TO_FSBT(ip->i_mount, offset);
> > > + end_fsb = XFS_B_TO_FSB(ip->i_mount, offset + count);
> > > +
> > > + /* Walk backwards until we're out of the I/O range... */
> > > + while (end_fsb > offset_fsb && !error)
> > > + error = xfs_reflink_end_cow_extent(ip, offset_fsb, &end_fsb);
> > > +
> >
> > The high level change seems reasonable provided we've thought through
> > whether it's Ok to cycle the ILOCK around this operation. For example,
> > maybe this is not possible, but I'm wondering if the end I/O processing
> > could technically race with something crazy across a lock cycle like a
> > truncate followed by another reflink -> COW writeback that lands in the
> > same target range of the file.
>
> I think the answer to this is that reflink always flushes both ranges
> and truncates the pagecache (with iolock/mmaplock held) before starting
> the remapping, and pagecache truncation waits for PageWriteback pages,
> which means that we shouldn't end up reflinking into a range that's
> undergoing writeback.
>
So that means even a straight truncate() will serialize against a
pending writeback and associated COW remap, before anything else has an
opportunity to muck around with the file offset range under completion
processing. Sounds good.
> As far as reflink + directio writes go, reflink also waits for all dio
> to finish so there shouldn't be a race there, and the dio write will
> flush dirty pages and invalidate the pagecache before it starts the dio,
> so I think we're (mostly) protected from races there, insofar as we
> don't 100% support buffered and dio writes to the same parts of a
> file....
>
Ok.
> > I guess we'd only remap COW blocks if they are "real," but that
> > conversion looks like it happens on I/O submission, so what if that I/O
> > happens to fail? Could we end up with racing success/fail COW I/O
> > completions somehow or another?
>
> ...so I think that leaves only the potential for a race between two
> directio writes to the same region where one of them succeeds and the
> other fails. I /thought/ that there could be a race there, but I
> noticed that on dio error we no longer _reflink_cancel_cow, we simply
> don't _reflink_remap_cow. So I think we're ok there too. If both both
> dio writes succeed then one or the other of them will do the remapping.
>
Ok, thanks.
Brian
> (I'm not sure how we define the expected outcome of simultaneous dio
> writes to the same part of a file, but that seems rather suspect to
> me...)
>
> --D
>
> > (Conversely, if the ilock approach is safe here then a quick note in the
> > commit log as to why would be useful.)
>
> > Brian
> >
> > > + if (error)
> > > + trace_xfs_reflink_end_cow_error(ip, error, _RET_IP_);
> > > return error;
> > > }
> > >
next prev parent reply other threads:[~2018-12-04 12:28 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-11-30 19:19 [PATCH] xfs: split up the xfs_reflink_end_cow work into smaller transactions Darrick J. Wong
2018-12-03 15:13 ` Brian Foster
2018-12-03 23:19 ` Darrick J. Wong
2018-12-04 4:58 ` Darrick J. Wong
2018-12-04 12:29 ` Brian Foster
2018-12-04 12:28 ` Brian Foster [this message]
2018-12-04 18:08 ` [PATCH v2] " Darrick J. Wong
2018-12-04 18:24 ` Brian Foster
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20181204122852.GA63209@bfoster \
--to=bfoster@redhat.com \
--cc=darrick.wong@oracle.com \
--cc=david@fromorbit.com \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox