Re: [RFC PATCH] xfs: skip discard of unwritten extents

linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Brian Foster <bfoster@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [RFC PATCH] xfs: skip discard of unwritten extents
Date: Tue, 1 May 2018 13:38:15 -0400	[thread overview]
Message-ID: <20180501173815.GJ4525@bfoster.bfoster> (raw)
In-Reply-To: <20180430220020.GE13766@dastard>

On Tue, May 01, 2018 at 08:00:20AM +1000, Dave Chinner wrote:
> On Thu, Apr 26, 2018 at 02:06:24PM -0400, Brian Foster wrote:
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> > 
> > Hi all,
> > 
> > What do folks think of something like this?
> 
> Definitely sounds like something we need to address.
> 
> > The motivation here is that
> > the VDO (dedup) devs had reported seeing online discards during
> > write-only workloads. These turn out to be related to trimming post-eof
> > preallocation blocks after large file copies. To my knowledge, this
> > isn't really a prevalent or serious issue, but I think that technically
> > these discards are unnecessary and so I was looking into how we could
> > avoid them.
> 
> We simply trucate post-eof extents, right? So we know in
> xfs_itruncate_extents() if the inode size is changing, not to
> mention we know if the extent is beyond EOF? e.g. all calls to
> xfs_itruncate_extents() other than xfs_free_eofblocks() change the
> inode size and so directly indicate they are removing written
> blocks. Anything where the inode size is not changing is doing a
> post-eof removal, and so we can assume no data has been written?
> 

Yes, xfs_free_eofblocks() is the only caller explicitly responsible for
trimming post-eof extents.

> So rather than converting everything to unwritten extents, the "skip
> discard flag" is simply triggered via extents being freed sitting
> beyond the current EOF (not the new EOF) and/or being unwritten?
> 

That was pretty much my initial thought, but note that the extent free
is ultimately deferred down in xfs_free_eofblocks() ->
xfs_itruncate_extents() -> xfs_bunmapi() -> xfs_bmap_del_extent_real()
-> xfs_bmap_add_free(). We can communicate this down to that point with
an itruncate_extents() parameter and XFS_BMAPI_NODISCARD flag or some
such, it just seemed a bit kludgy to pass that down through those layers
when the unwritten state is known in the bunmapi code (but I'll take
that approach if preferred).

> > This behavior is of course not directly related to unwritten extents,
> > but the immediate/obvious solution to bubble up a bmapi flag of some
> > kind to xfs_free_eofblocks() seemed rather crude. From there, I figured
> > that we technically don't need to discard any unwritten extents (within
> > or beyond EOF) because they haven't been written to since being
> > allocated. In fact, I'm not sure we have to even busy them, but it's
> > roughly equivalent logic either way and I'm trying to avoid getting too
> > clever.
> 
> I think we still need to busy them to avoid re-allocating them in
> the same checkpoint, as data extent free/realloc in the same
> checkpoint could result in a failed recovery (i.e. partial
> checkpoint replay) leaving the extent linked into two separate
> files.
> 

Ah, Ok.. I was only thinking about metadata/data reuse. Hm, isn't the
filesystem essentially corrupted on a failed/partial recovery anyways?
(Not that this matters much in this context, I wasn't planning to bypass
the busy sequence...).

> > I also recall that we've discussed using unwritten extents for delalloc
> > -> real conversion to avoid the small stale data exposure window that
> > exists in writeback. Without getting too deep into the reason we don't
> > currently do an initial unwritten allocation [1], I don't think there's
> > anything blocking us from converting any post-eof blocks that happen to
> > be part of the resulting normal allocation. As it is, the imap is
> > already trimmed to EOF by the writeback code for coherency reasons. If
> > we were to convert post-eof blocks (not part of this patch) along with
> > something like this patch, then we'd indirectly prevent discards for
> > eofblocks trims.
> 
> I think we should leave that as a separate problem, as writeback
> currently has issues with the way we manage bufferhead state.
> i.e. things don't work right if we put unwritten extents under
> delalloc buffers and vice versa. [ I have patches to address that
> I'm working on.] And there's also the issue that we need to change
> the delalloc reservations to take into account block allocations
> required by unwritten extent conversion needed by delalloc.
> 

Right.. I wasn't planning to try and solve the whole buffer head state
mismatch thing as a dependency to not discard eofblocks. I was only
going to convert blocks that happened to be post-eof after the
xfs_iomap_write_allocate() allocation because those blocks by definition
don't have buffers. So it's essentially just another xfs_bmapi_write()
call from xfs_iomap_write_allocate() to convert eofblocks if the just
allocated mapping crosses eof.

> Hence I think we should address that as a separate problem, not as
> the solution to avoiding discard of post-eof extents.
> 

Fair enough, I'll look into tacking on a separate patch to also skip
discards for unwritten extents (irrespective of eof).

> > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> > index 4bcc095fe44a..942c90ec6747 100644
> > --- a/fs/xfs/libxfs/xfs_alloc.c
> > +++ b/fs/xfs/libxfs/xfs_alloc.c
> > @@ -2954,13 +2954,15 @@ xfs_free_extent(
> >  	xfs_fsblock_t		bno,	/* starting block number of extent */
> >  	xfs_extlen_t		len,	/* length of extent */
> >  	struct xfs_owner_info	*oinfo,	/* extent owner */
> > -	enum xfs_ag_resv_type	type)	/* block reservation type */
> > +	enum xfs_ag_resv_type	type,	/* block reservation type */
> > +	bool			skip_discard)
> >  {
> >  	struct xfs_mount	*mp = tp->t_mountp;
> >  	struct xfs_buf		*agbp;
> >  	xfs_agnumber_t		agno = XFS_FSB_TO_AGNO(mp, bno);
> >  	xfs_agblock_t		agbno = XFS_FSB_TO_AGBNO(mp, bno);
> >  	int			error;
> > +	unsigned int		busy_flags = 0;
> >  
> >  	ASSERT(len != 0);
> >  	ASSERT(type != XFS_AG_RESV_AGFL);
> > @@ -2984,7 +2986,9 @@ xfs_free_extent(
> >  	if (error)
> >  		goto err;
> >  
> > -	xfs_extent_busy_insert(tp, agno, agbno, len, 0);
> > +	if (skip_discard)
> > +		busy_flags |= XFS_EXTENT_BUSY_SKIP_DISCARD;
> > +	xfs_extent_busy_insert(tp, agno, agbno, len, busy_flags);
> >  	return 0;
> 
> Rather than changing xfs_free_extent(), how about adding a
> xfs_free_extent_nodiscard() wrapper, and only call it when
> processing an extent that doesn't need discard? This means none of
> the other code that frees extents needs to be changed. Similarly
> add a xfs_bmap_add_free_nodiscard() wrapper. That will cut down on
> the code churn caused by passing new parameters everywhere....
> 

Ok, I can clean that up. Thanks for the feedback.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

next prev parent reply	other threads:[~2018-05-01 17:38 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-04-26 18:06 [RFC PATCH] xfs: skip discard of unwritten extents Brian Foster
2018-04-30  9:06 ` Carlos Maiolino
2018-04-30 12:17   ` Brian Foster
2018-04-30 13:26     ` Carlos Maiolino
2018-04-30 22:00 ` Dave Chinner
2018-05-01 17:38   ` Brian Foster [this message]
2018-05-01 22:39     ` Dave Chinner
2018-05-02 11:18       ` Brian Foster
2018-05-03  0:48         ` Dave Chinner
2018-05-03 12:07           ` Brian Foster

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180501173815.GJ4525@bfoster.bfoster \
    --to=bfoster@redhat.com \
    --cc=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).