Re: [RFC PATCH 0/3]: Extreme fragmentation ahoy!

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Brian Foster <bfoster@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>, linux-xfs@vger.kernel.org
Subject: Re: [RFC PATCH 0/3]: Extreme fragmentation ahoy!
Date: Mon, 11 Feb 2019 17:13:33 -0800	[thread overview]
Message-ID: <20190212011333.GB23989@magnolia> (raw)
In-Reply-To: <20190208123432.GB21317@bfoster>

On Fri, Feb 08, 2019 at 07:34:33AM -0500, Brian Foster wrote:
> On Fri, Feb 08, 2019 at 01:47:30PM +1100, Dave Chinner wrote:
> > On Thu, Feb 07, 2019 at 10:52:43AM -0500, Brian Foster wrote:
> > > On Thu, Feb 07, 2019 at 04:39:41PM +1100, Dave Chinner wrote:
> > > > On Wed, Feb 06, 2019 at 09:21:14PM -0800, Darrick J. Wong wrote:
> > > > > On Thu, Feb 07, 2019 at 04:08:10PM +1100, Dave Chinner wrote:
> > > > > > Hi folks,
> > > > > > 
> > > > > > I've just finished analysing an IO trace from a application
> > > > > > generating an extreme filesystem fragmentation problem that started
> > > > > > with extent size hints and ended with spurious ENOSPC reports due to
> > > > > > massively fragmented files and free space. While the ENOSPC issue
> > > > > > looks to have previously been solved, I still wanted to understand
> > > > > > how the application had so comprehensively defeated extent size
> > > > > > hints as a method of avoiding file fragmentation.
> > ....
> > > > FWIW, I think the scope of the problem is quite widespread -
> > > > anything that does open/something/close repeatedly on a file that is
> > > > being written to with O_DSYNC or O_DIRECT appending writes will kill
> > > > the post-eof extent size hint allocated space. That's why I suspect
> > > > we need to think about not trimming by default and trying to
> > > > enumerating only the cases that need to trim eof blocks.
> > > > 
> > > 
> > > To further this point.. I think the eofblocks scanning stuff came long
> > > after the speculative preallocation code and associated release time
> > > post-eof truncate.
> > 
> > Yes, I cribed a bit of the history of the xfs_release() behaviour
> > on #xfs yesterday afternoon:
> > 
> >  <djwong>	dchinner: feel free to ignore this until tomorrow if you want, but /me wonders why we'd want to free the eofblocks at close time at all, instead of waiting for inactivation/enospc/background reaper to do it?
> >  <dchinner>	historic. People doing operations then complaining du didn't match ls
> >  <dchinner>	stuff like that
> >  <dchinner>	There used to be a open file cache in XFS - we'd know exactly when the last reference went away and trim it then
> >  <dchinner>	but that went away when NFS and the dcache got smarter about file handle conversion
> >  <dchinner>	(i.e. that's how we used to make nfs not suck)
> >  <dchinner>	that's when we started doing work in ->release
> >  <dchinner>	it was close enough to "last close" for most workloads it made no difference.
> >  <dchinner>	Except for concurrent NFS writes into the same directory
> >  <dchinner>	and now there's another pathological application that triggers problems
> >  <dchinner>	The NFS exception was prior to having thebackground reaper
> >  <dchinner>	as these things goes the background reaper is relatively recent functionality
> >  <dchinner>	so perhaps we should just leave it to "inode cache expiry or background reaping" and not do it on close at al
> > 
> 
> Thanks.
> 
> > > I think the background scanning was initially an
> > > enhancement to deal with things like the dirty release optimization
> > > leaving these blocks around longer and being able to free up this
> > > accumulated space when we're at -ENOSPC conditions.
> > 
> > Yes, amongst other things like slow writes keeping the file open
> > forever.....
> > 
> > > Now that we have the
> > > scanning mechanism in place (and a 5 minute default background scan,
> > > which really isn't all that long), it might be reasonable to just drop
> > > the release time truncate completely and only trim post-eof blocks via
> > > the bg scan or reclaim paths.
> > 
> > Yeah, that's kinda the question I'm asking here. What's the likely
> > impact of not trimming EOF blocks at least on close apart from
> > people complaining about df/ls not matching du?
> > 
> 
> Ok. ISTM it's just a continuation of the same "might confuse some users"
> scenario that pops up occasionally. It also seems that kind of thing has
> died down as either most people don't really know or care about the
> transient state or are just more familiar with it at this point. IME,
> complex applications that depend on block ownership stats (userspace
> filesystems for example) already have to account for speculative
> preallocation with XFS, so tweaking the semantics of the optimization
> shouldn't really have much of an impact that I can tell so long as the
> broader/long-term behavior doesn't change[1].
> 
> I suppose there are all kinds of other applications that are technically
> affected by dropping the release time trim (simple file copies, archive
> extraction, etc.), but it's not clear to me that matters so long as we
> have effective bg and -ENOSPC scans. The only thing I can think of so
> far is whether we should consider changes to the bg scan heuristics to
> accommodate scenarios currently covered by the release time trim. For
> example, the release time scan doesn't consider whether the file is
> dirty or not while the bg scan always skips "active" files.

I wrote a quick and dirty fstest that writes 999 files between 128k and
256k in size, to simulate untarring onto a filesystem.  No fancy
preallocation, just buffered writes.  I patched my kernel to skip the
posteof block freeing in xfs_release, so the preallocations get freed by
inode inactivation.  Then the freespace histogram looks like:

+   from      to extents  blocks    pct
+      1       1      36      36   0.00
+      2       3      69     175   0.01
+      4       7     122     698   0.02
+      8      15     237    2691   0.08
+     16      31       1      16   0.00
+     32      63     500   27843   0.88
+ 524288  806272       4 3141225  99.01

Pretty gnarly. :)  By comparison, a stock upstream kernel:

+   from      to extents  blocks    pct
+ 524288  806272       4 3172579 100.00

That's 969 free extents vs. 4, on a fs with 999 new files... which is
pretty bad.  Dave also suggessted on IRC that maybe this should be a
little smarter -- possibly skipping the posteof removal only if the
filesystem has sunit/swidth set, or if the inode has extent size hints,
or whatever. :)

--D

> Brian
> 
> [1] Which isn't the case for a change to not bg trim files with extent
> size hints.
> 
> > I don't really care about that anymore because, well, reflink/dedupe
> > completely break any remaining assumption that du reported space
> > consumption is related to the file size (if sparse files wasn't
> > enough of a hint arlready)....
> > 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com

next prev parent reply	other threads:[~2019-02-12  4:38 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-02-07  5:08 [RFC PATCH 0/3]: Extreme fragmentation ahoy! Dave Chinner
2019-02-07  5:08 ` [PATCH 1/3] xfs: Don't free EOF blocks on sync write close Dave Chinner
2019-02-07  5:08 ` [PATCH 2/3] xfs: Don't free EOF blocks on close when extent size hints are set Dave Chinner
2019-02-07 15:51   ` Brian Foster
2019-02-07  5:08 ` [PATCH 3/3] xfs: Don't free EOF blocks on sync write close Dave Chinner
2019-02-07  5:19   ` Dave Chinner
2019-02-07  5:21 ` [RFC PATCH 0/3]: Extreme fragmentation ahoy! Darrick J. Wong
2019-02-07  5:39   ` Dave Chinner
2019-02-07 15:52     ` Brian Foster
2019-02-08  2:47       ` Dave Chinner
2019-02-08 12:34         ` Brian Foster
2019-02-12  1:13           ` Darrick J. Wong [this message]
2019-02-12 11:46             ` Brian Foster
2019-02-12 20:21               ` Dave Chinner
2019-02-13 13:50                 ` Brian Foster
2019-02-13 22:27                   ` Dave Chinner
2019-02-14 13:00                     ` Brian Foster
2019-02-14 21:51                       ` Dave Chinner
2019-02-15  2:35                         ` Brian Foster
2019-02-15  7:23                           ` Dave Chinner
2019-02-15 20:33                             ` Brian Foster
2019-02-08 16:29         ` Darrick J. Wong
2019-02-18  2:26 ` [PATCH 4/3] xfs: EOF blocks are not busy extents Dave Chinner
2019-02-20 15:12   ` Brian Foster

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190212011333.GB23989@magnolia \
    --to=darrick.wong@oracle.com \
    --cc=bfoster@redhat.com \
    --cc=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox