From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Brian Foster <bfoster@redhat.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH 4/6] xfs: don't screw up direct writes when freesp is fragmented
Date: Fri, 26 Jan 2018 11:49:24 -0800 [thread overview]
Message-ID: <20180126194924.GD9068@magnolia> (raw)
In-Reply-To: <20180126192457.GC26910@bfoster.bfoster>
On Fri, Jan 26, 2018 at 02:24:57PM -0500, Brian Foster wrote:
> On Thu, Jan 25, 2018 at 06:05:12PM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> >
> > xfs_bmap_btalloc is given a range of file offset blocks that must be
> > allocated to some data/attr/cow fork. If the fork has an extent size
> > hint associated with it, the request will be enlarged on both ends to
> > try to satisfy the alignment hint. If free space is fragmentated,
> > sometimes we can allocate some blocks but not enough to fulfill any of
> > the requested range. Since bmapi_allocate always trims the new extent
> > mapping to match the originally requested range, this results in
> > bmapi_write returning zero and no mapping.
> >
> > The consequences of this vary -- buffered writes will simply re-call
> > bmapi_write until it can satisfy at least one block from the original
> > request. Direct IO overwrites notice nmaps == 0 and return -ENOSPC
> > through the dio mechanism out to userspace with the weird result that
> > writes fail even when we have enough space because the ENOSPC return
> > overrides any partial write status. For direct CoW writes the situation
> > is disastrous because nobody notices us returning an invalid zero-length
> > wrong-offset mapping to iomap and the write goes off into space.
> >
> > First of all, teach iomap and xfs_reflink_allocate_cow to check the
> > mappings being returned and bail out with ENOSPC if we can't get any
> > blocks. Second of all, if free space is so fragmented we can't satisfy
> > even a single block of the original allocation request but did get a few
> > blocks, we should break the alignment hint in order to guarantee at
> > least some forward progress for the direct write. If we return a short
> > allocation to iomap_apply it'll call back about the remaining blocks.
> >
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> > fs/iomap.c | 2 +-
> > fs/xfs/libxfs/xfs_bmap.c | 15 +++++++++++++++
> > fs/xfs/xfs_reflink.c | 6 ++++++
> > 3 files changed, 22 insertions(+), 1 deletion(-)
> >
> >
> > diff --git a/fs/iomap.c b/fs/iomap.c
> > index e5de772..aec35a0 100644
> > --- a/fs/iomap.c
> > +++ b/fs/iomap.c
> > @@ -63,7 +63,7 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
> > ret = ops->iomap_begin(inode, pos, length, flags, &iomap);
> > if (ret)
> > return ret;
> > - if (WARN_ON(iomap.offset > pos))
> > + if (WARN_ON(iomap.offset > pos) || WARN_ON(iomap.length == 0))
> > return -EIO;
> >
> > /*
> > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > index 032befb..1fb885c 100644
> > --- a/fs/xfs/libxfs/xfs_bmap.c
> > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > @@ -3386,6 +3386,8 @@ xfs_bmap_btalloc(
> > xfs_agnumber_t fb_agno; /* ag number of ap->firstblock */
> > xfs_agnumber_t ag;
> > xfs_alloc_arg_t args;
> > + xfs_fileoff_t orig_offset;
> > + xfs_extlen_t orig_length;
> > xfs_extlen_t blen;
> > xfs_extlen_t nextminlen = 0;
> > int nullfb; /* true if ap->firstblock isn't set */
> > @@ -3395,6 +3397,8 @@ xfs_bmap_btalloc(
> > int stripe_align;
> >
> > ASSERT(ap->length);
> > + orig_offset = ap->offset;
> > + orig_length = ap->length;
> >
> > mp = ap->ip->i_mount;
> >
> > @@ -3610,6 +3614,17 @@ xfs_bmap_btalloc(
> > *ap->firstblock = args.fsbno;
> > ASSERT(nullfb || fb_agno <= args.agno);
> > ap->length = args.len;
> > + /*
> > + * If we didn't get enough blocks to fill even one block of
> > + * the original request, break the alignment and return
> > + * whatever we got; it's the best we can do. Free space is
> > + * fragmented enough that we cannot honor the extent size
> > + * hints but we can still make some forward progress.
> > + */
> > + if (ap->length <= orig_length)
> > + ap->offset = orig_offset;
> > + else if (ap->offset + ap->length < orig_offset + orig_length)
> > + ap->offset = orig_offset + orig_length - ap->length;
>
> Ok, but do we really want to shift the extent placement for any
> allocation that is short? E.g., what if the allocation is short but
> still covers the target offset? ISTM there's a tradeoff here between
> filling the range asked for vs. following the heuristic in
> xfs_bmap_extsize_align(). Even if we do end up shifting to prioritize
> the target range, that doesn't seem to be what the comment describes by
> saying "... we didn't get enough blocks to fill even one block of the
> original request." :P
Yeah, Eric complained about the wording of the comment on IRC too.
How about:
/*
* If the extent size hint is active, we tried to round the caller's
* allocation request offset down to extsz and the length up to another
* extsz boundary. If we found a free extent we mapped it in starting
* at this new offset. If the newly mapped space isn't long enough to
* cover the entire range of offsets that was originally requested, move
* the mapping up so that we can fill as much of the caller's original
* request as possible. Free space is apparently very fragmented so
* we're unlikely to be able to satisfy the hints anyway.
*/
if (ap->length <= orig_length)
ap->offset = orig_offset;
else if (ap->offset + ap->length < orig_offset + orig_length)
ap->offset = orig_offset + orig_length - ap->length;
>
> > xfs_bmap_btalloc_accounting(ap, &args);
> > } else {
> > ap->blkno = NULLFSBLOCK;
> > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > index 9a6c545..9eca8aa 100644
> > --- a/fs/xfs/xfs_reflink.c
> > +++ b/fs/xfs/xfs_reflink.c
> > @@ -464,6 +464,12 @@ xfs_reflink_allocate_cow(
> > error = xfs_trans_commit(tp);
> > if (error)
> > return error;
> > + /*
> > + * Allocation succeeded but our request was not even partially
> > + * satisfied? Bail out.
> > + */
> > + if (nimaps == 0)
> > + return -ENOSPC;
>
> I think this should probably be a separate patch. This part is a
> straightforward bug fix for a potentially critical issue (i.e., a no
> brainer). While not that much code, the change above is an enhancement
> with a bit more complexity to consider.
Ok, will do.
--D
> Brian
>
> > convert:
> > return xfs_reflink_convert_cow_extent(ip, imap, offset_fsb, count_fsb,
> > &dfops);
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2018-01-26 19:59 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-01-26 2:04 [PATCH v2 0/6] xfs: reflink/scrub/quota fixes Darrick J. Wong
2018-01-26 2:04 ` [PATCH 1/6] xfs: refactor accounting updates out of xfs_bmap_btalloc Darrick J. Wong
2018-01-26 12:19 ` Christoph Hellwig
2018-01-26 19:06 ` Brian Foster
2018-01-26 2:05 ` [PATCH 2/6] xfs: CoW fork operations should only update quota reservations Darrick J. Wong
2018-01-26 19:06 ` Brian Foster
2018-01-26 19:20 ` Darrick J. Wong
2018-01-26 2:05 ` [PATCH 3/6] xfs: track CoW blocks separately in the inode Darrick J. Wong
2018-01-26 2:05 ` [PATCH 4/6] xfs: don't screw up direct writes when freesp is fragmented Darrick J. Wong
2018-01-26 19:24 ` Brian Foster
2018-01-26 19:49 ` Darrick J. Wong [this message]
2018-01-26 2:05 ` [PATCH 5/6] xfs: skip CoW writes past EOF when writeback races with truncate Darrick J. Wong
2018-01-26 12:21 ` Christoph Hellwig
2018-01-26 2:05 ` [PATCH 6/6] xfs: don't clobber inobt/finobt cursors when xref with rmap Darrick J. Wong
2018-01-26 12:18 ` [PATCH v2 0/6] xfs: reflink/scrub/quota fixes Christoph Hellwig
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180126194924.GD9068@magnolia \
--to=darrick.wong@oracle.com \
--cc=bfoster@redhat.com \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox