public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Brian Foster <bfoster@redhat.com>
Cc: xfs@oss.sgi.com
Subject: Re: [RFCv4 00/76] xfs: add reverse-mapping, reflink, and dedupe support
Date: Tue, 5 Jan 2016 18:04:40 -0800	[thread overview]
Message-ID: <20160106020440.GL28330@birch.djwong.org> (raw)
In-Reply-To: <20160105124226.GA38749@bfoster.bfoster>

On Tue, Jan 05, 2016 at 07:42:26AM -0500, Brian Foster wrote:
> On Mon, Jan 04, 2016 at 03:59:51PM -0800, Darrick J. Wong wrote:
> > On Sun, Dec 20, 2015 at 09:02:54AM -0500, Brian Foster wrote:
> > > On Sat, Dec 19, 2015 at 12:56:23AM -0800, Darrick J. Wong wrote:
> > > > Hi all,
> > > > 
> > > ...
> > > > Fixed since RFCv3:
> > > > 
> > > >  * The reflink and dedupe ioctls are being hoisted to the VFS, as
> > > >    provided in the first few patches.  Patch 81 connects to this
> > > >    functionality.
> > > > 
> > > >  * Copy on write has been rewritten for v4.  We now use the existing
> > > >    delayed allocation mechanism to coalesce writes together, deferring
> > > >    allocation until writeout time.  This enables CoW to make better
> > > >    block placement decisions and significantly reduces overhead.
> > > >    CoW is still pretty slow, but not as slow as before.
> > > > 
> > > >  * Direct IO CoW has been implemented using the same mechanism as
> > > >    above, but modified to perform the allocation and remapping right
> > > >    then and there.  Throughput is much higher than pushing data
> > > >    through the page cache CoW.  (It's the same mechanism, but we're
> > > >    playing with chunks bigger than a single memory page.)
> > > > 
> > > >  * CoW ENOSPC works correctly now, except in the pathological case
> > > >    that the AG fills up and the rmap btree cannot expand.  That will
> > > >    be addressed for v5.
> > > > 
> > > >  * fallocate will now unshare blocks to prevent future ENOSPC, as
> > > >    you'd expect.
> > > > 
> > > >  * refcount btree blocks are preallocated at mount time to prevent
> > > >    ENOSPC while trying to expand the tree.  This also has the effect
> > > >    of grouping the btree blocks together, which can speed up CoW
> > > >    remapping.
> > > > 
> > > 
> > > Can you elaborate on how these blocks are preallocated? E.g., is the
> > > tree "preconstructed" in some sense? However that is done, is this the
> > > anticipated solution or a temporary workaround..?
> > > 
> > > Also, shouldn't the enospc condition be handled by the agfl? I take it
> > > there is something going on here that renders that solution flawed, so
> > > I'm just curious what it is.
> > > 
> > > (Sorry if this is all explained elsewhere, but I haven't yet had a
> > > chance to take a close enough look at this feature..).
> > 
> > Reference count btree blocks aren't allocated from the AGFL; they're allocated
> > from the free space in the same manner as the inobt, per a review comment from
> > Dave a looong time ago. :) 
> > 
> 
> Ah, Ok.
> 
> > As such, we can get ourselves into the nasty situation where every block in the
> > AG has been allocated to file data.  If we then see a bunch of reference count
> > changes that are scattered around the AG, the reference count btree has to
> > expand to hold all the new records... but there isn't space, and the operation
> > fails.  Given that we know the maximum possible size of the refcount btree
> > (it's 0.3% of the AG size with 4k blocks), I figured it was easy enough to
> > avoid ENOSPC for reflink operations.
> > 
> 
> Sounds reasonable.
> 
> > I've temporarily fixed this by adding code that figures out how many blocks we
> > need if the reference count btree has to have a unique record for every block
> > in the AG and holding that many blocks until either they're allocated to the
> > refcount btree or freed at umount time.  Right now it's a temporary fix (if the
> > FS crashes, the reserved blocks are lost) but it wouldn't be difficult for the
> > FS to make a permanent reservation that's recorded on disk somehow.  But that's
> > involves writing things to disk + making xfsprogs understand the reservation;
> > let's see what people say about the reserved pool idea at all.
> > 
> > Does that make sense? :)
> > 
> 
> Yep, it sounds sort of like the reserve pool mechanism used to protect
> against ENOSPC when freeing blocks. Curious... why are the reserved
> blocks lost on fs crash? Wouldn't they be reserved again on the
> subsequent mount?

They will, but the pre-crash reservation isn't (yet) written down anywhere on
disk.

Thank /you/ for having a look at the reflink code! :)

--D

> 
> Thanks for the explanation...
> 
> Brian
> 
> > --D
> > 
> > > 
> > > Brian
> > > 
> > > > Issues: 
> > > > 
> > > >  * The extent swapping ioctl still allocates a bigger fixed-size
> > > >    transaction.  That's most likely a stupid thing to do, so getting a
> > > >    better grip on how the journalling code works and auditing all the
> > > >    new transaction users will have to happen.  Right now it mostly
> > > >    gets lucky.
> > > > 
> > > >  * EFI tracking for the allocated-but-not-yet-mapped blocks is
> > > >    nonexistant.  A crash will leak them.
> > > > 
> > > >  * ENOSPC while expanding the rmap btree can crash the FS.  For now we
> > > >    work around this problem by making the AGFL as big as possible,
> > > >    failing CoW attempts with ENOSPC if there aren't enough AGFL blocks
> > > >    available, and hoping that doesn't actually happen.
> > > > 
> > > > If you're going to start using this mess, you probably ought to just
> > > > pull from my github trees for kernel[1], xfsprogs[2], and xfstests[3].
> > > > There are also updates for xfs-docs[4] and man-pages[5].
> > > > 
> > > > The patches have been xfstested with x64, i386, and ppc64; while in
> > > > general the tests run to completion, there are still periodic bugs
> > > > that will be addressed by the next RFC.  There's a persistent crash on
> > > > arm64 and ppc64el that I haven't been able to triage.
> > > > 
> > > > This is an extraordinary way to eat your data.  Enjoy! 
> > > > Comments and questions are, as always, welcome.
> > > > 
> > > > --D
> > > > 
> > > > [1] https://github.com/djwong/linux/tree/for-dave
> > > > [2] https://github.com/djwong/xfsprogs/tree/for-dave
> > > > [3] https://github.com/djwong/xfstests/tree/for-dave
> > > > [4] https://github.com/djwong/xfs-documentation/tree/for-dave
> > > > [5] https://github.com/djwong/man-pages/commits/for-mtk
> > > > 
> > > > _______________________________________________
> > > > xfs mailing list
> > > > xfs@oss.sgi.com
> > > > http://oss.sgi.com/mailman/listinfo/xfs
> > > 
> > > _______________________________________________
> > > xfs mailing list
> > > xfs@oss.sgi.com
> > > http://oss.sgi.com/mailman/listinfo/xfs
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  reply	other threads:[~2016-01-06  2:04 UTC|newest]

Thread overview: 102+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-12-19  8:56 [RFCv4 00/76] xfs: add reverse-mapping, reflink, and dedupe support Darrick J. Wong
2015-12-19  8:56 ` [PATCH 01/76] libxfs: make xfs_alloc_fix_freelist non-static Darrick J. Wong
2015-12-19  8:56 ` [PATCH 02/76] xfs: fix log ticket type printing Darrick J. Wong
2016-01-03 12:13   ` Christoph Hellwig
2016-01-03 21:29     ` Dave Chinner
2016-01-04 19:57       ` Darrick J. Wong
2015-12-19  8:56 ` [PATCH 03/76] libxfs: refactor the btree size calculator code Darrick J. Wong
2015-12-20 20:39   ` Dave Chinner
2016-01-04 22:06     ` Darrick J. Wong
2015-12-19  8:56 ` [PATCH 04/76] libxfs: use a convenience variable instead of open-coding the fork Darrick J. Wong
2015-12-19  8:56 ` [PATCH 05/76] libxfs: pack the agfl header structure so XFS_AGFL_SIZE is correct Darrick J. Wong
2016-01-03 12:15   ` Christoph Hellwig
2016-01-04 22:12     ` Darrick J. Wong
2016-01-04 23:23       ` Darrick J. Wong
2016-01-04 23:51       ` Dave Chinner
2015-12-19  8:57 ` [PATCH 06/76] xfs: introduce rmap btree definitions Darrick J. Wong
2015-12-19  8:57 ` [PATCH 07/76] xfs: add rmap btree stats infrastructure Darrick J. Wong
2015-12-19  8:57 ` [PATCH 08/76] xfs: rmap btree add more reserved blocks Darrick J. Wong
2015-12-19  8:57 ` [PATCH 09/76] xfs: add owner field to extent allocation and freeing Darrick J. Wong
2015-12-19  8:57 ` [PATCH 10/76] xfs: add extended " Darrick J. Wong
2015-12-19  8:57 ` [PATCH 11/76] xfs: introduce rmap extent operation stubs Darrick J. Wong
2015-12-19  8:57 ` [PATCH 12/76] xfs: extend rmap extent operation stubs to take full owner info Darrick J. Wong
2015-12-19  8:57 ` [PATCH 13/76] xfs: define the on-disk rmap btree format Darrick J. Wong
2015-12-19  8:57 ` [PATCH 14/76] xfs: enhance " Darrick J. Wong
2015-12-19  8:58 ` [PATCH 15/76] xfs: add rmap btree growfs support Darrick J. Wong
2015-12-19  8:58 ` [PATCH 16/76] xfs: enhance " Darrick J. Wong
2015-12-19  8:58 ` [PATCH 17/76] xfs: rmap btree transaction reservations Darrick J. Wong
2015-12-19  8:58 ` [PATCH 18/76] xfs: rmap btree requires more reserved free space Darrick J. Wong
2015-12-19  8:58 ` [PATCH 19/76] libxfs: fix min freelist length calculation Darrick J. Wong
2015-12-19  8:58 ` [PATCH 20/76] xfs: add rmap btree operations Darrick J. Wong
2015-12-19  8:58 ` [PATCH 21/76] xfs: enhance " Darrick J. Wong
2015-12-19  8:58 ` [PATCH 22/76] xfs: add an extent to the rmap btree Darrick J. Wong
2015-12-19  8:58 ` [PATCH 23/76] xfs: add tracepoints for the rmap-mirrors-bmbt functions Darrick J. Wong
2015-12-19  8:58 ` [PATCH 24/76] xfs: teach rmap_alloc how to deal with our larger rmap btree Darrick J. Wong
2015-12-19  8:59 ` [PATCH 25/76] xfs: remove an extent from the " Darrick J. Wong
2015-12-19  8:59 ` [PATCH 26/76] xfs: enhanced " Darrick J. Wong
2015-12-19  8:59 ` [PATCH 27/76] xfs: add rmap btree insert and delete helpers Darrick J. Wong
2015-12-19  8:59 ` [PATCH 28/76] xfs: piggyback rmapbt update intents in the bmap free structure Darrick J. Wong
2015-12-19  8:59 ` [PATCH 29/76] xfs: bmap btree changes should update rmap btree Darrick J. Wong
2015-12-19  8:59 ` [PATCH 30/76] xfs: add rmap btree geometry feature flag Darrick J. Wong
2015-12-19  8:59 ` [PATCH 31/76] xfs: add rmap btree block detection to log recovery Darrick J. Wong
2015-12-19  8:59 ` [PATCH 32/76] xfs: enable the rmap btree functionality Darrick J. Wong
2015-12-19  9:00 ` [PATCH 33/76] xfs: disable XFS_IOC_SWAPEXT when rmap btree is enabled Darrick J. Wong
2015-12-19  9:00 ` [PATCH 34/76] xfs: implement " Darrick J. Wong
2016-01-03 12:17   ` Christoph Hellwig
2016-01-04 23:40     ` Darrick J. Wong
2016-01-05  2:41       ` Dave Chinner
2016-01-07  0:09         ` Darrick J. Wong
2015-12-19  9:00 ` [PATCH 35/76] libxfs: refactor short btree block verification Darrick J. Wong
2016-01-03 12:18   ` Christoph Hellwig
2016-01-03 21:30     ` Dave Chinner
2015-12-19  9:00 ` [PATCH 36/76] xfs: don't update rmapbt when fixing agfl Darrick J. Wong
2015-12-19  9:00 ` [PATCH 37/76] xfs: define tracepoints for refcount btree activities Darrick J. Wong
2015-12-19  9:00 ` [PATCH 38/76] xfs: introduce refcount btree definitions Darrick J. Wong
2015-12-19  9:00 ` [PATCH 39/76] xfs: add refcount btree stats infrastructure Darrick J. Wong
2015-12-19  9:00 ` [PATCH 40/76] xfs: refcount btree add more reserved blocks Darrick J. Wong
2015-12-19  9:00 ` [PATCH 41/76] xfs: define the on-disk refcount btree format Darrick J. Wong
2015-12-19  9:00 ` [PATCH 42/76] xfs: add refcount btree support to growfs Darrick J. Wong
2015-12-19  9:01 ` [PATCH 43/76] xfs: add refcount btree operations Darrick J. Wong
2015-12-19  9:01 ` [PATCH 44/76] libxfs: adjust refcount of an extent of blocks in refcount btree Darrick J. Wong
2015-12-19  9:01 ` [PATCH 45/76] libxfs: adjust refcount when unmapping file blocks Darrick J. Wong
2015-12-19  9:01 ` [PATCH 46/76] xfs: add refcount btree block detection to log recovery Darrick J. Wong
2015-12-19  9:01 ` [PATCH 47/76] xfs: refcount btree requires more reserved space Darrick J. Wong
2015-12-19  9:01 ` [PATCH 48/76] xfs: introduce reflink utility functions Darrick J. Wong
2015-12-19  9:01 ` [PATCH 49/76] xfs: define tracepoints for reflink activities Darrick J. Wong
2015-12-19  9:01 ` [PATCH 50/76] xfs: map an inode's offset to an exact physical block Darrick J. Wong
2015-12-19  9:02 ` [PATCH 51/76] xfs: add reflink feature flag to geometry Darrick J. Wong
2015-12-19  9:02 ` [PATCH 52/76] xfs: don't allow reflinked dir/dev/fifo/socket/pipe files Darrick J. Wong
2015-12-19  9:02 ` [PATCH 53/76] xfs: introduce the CoW fork Darrick J. Wong
2015-12-19  9:02 ` [PATCH 54/76] xfs: support bmapping delalloc extents in " Darrick J. Wong
2015-12-19  9:02 ` [PATCH 55/76] xfs: create delalloc extents in " Darrick J. Wong
2015-12-19  9:02 ` [PATCH 56/76] xfs: support allocating delayed " Darrick J. Wong
2015-12-19  9:02 ` [PATCH 57/76] xfs: allocate " Darrick J. Wong
2016-01-03 12:20   ` Christoph Hellwig
2016-01-05  1:13     ` Darrick J. Wong
2016-01-09  9:59   ` Darrick J. Wong
2015-12-19  9:02 ` [PATCH 58/76] xfs: support removing extents from " Darrick J. Wong
2015-12-19  9:03 ` [PATCH 59/76] xfs: move mappings from cow fork to data fork after copy-write Darrick J. Wong
2015-12-19  9:03 ` [PATCH 60/76] xfs: implement CoW for directio writes Darrick J. Wong
2016-01-08  9:34   ` Darrick J. Wong
2015-12-19  9:03 ` [PATCH 61/76] xfs: copy-on-write reflinked blocks when zeroing ranges of blocks Darrick J. Wong
2015-12-19  9:03 ` [PATCH 62/76] xfs: clear inode reflink flag when freeing blocks Darrick J. Wong
2015-12-19  9:03 ` [PATCH 63/76] xfs: cancel pending CoW reservations when destroying inodes Darrick J. Wong
2015-12-19  9:03 ` [PATCH 64/76] xfs: reflink extents from one file to another Darrick J. Wong
2015-12-19  9:03 ` [PATCH 65/76] xfs: add clone file and clone range ioctls Darrick J. Wong
2015-12-19  9:03 ` [PATCH 66/76] xfs: emulate the btrfs dedupe extent same ioctl Darrick J. Wong
2015-12-19  9:03 ` [PATCH 67/76] xfs: teach fiemap about reflink'd extents Darrick J. Wong
2015-12-19  9:03 ` [PATCH 68/76] xfs: swap inode reflink flags when swapping inode extents Darrick J. Wong
2015-12-19  9:04 ` [PATCH 69/76] xfs: unshare a range of blocks via fallocate Darrick J. Wong
2015-12-19  9:04 ` [PATCH 70/76] xfs: fork shared EOF block when truncating file Darrick J. Wong
2015-12-19  9:04 ` [PATCH 71/76] xfs: support XFS_XFLAG_REFLINK (and FS_NOCOW_FL) on reflink filesystems Darrick J. Wong
2015-12-19  9:04 ` [PATCH 72/76] xfs: recognize the reflink feature bit Darrick J. Wong
2015-12-19  9:04 ` [PATCH 73/76] xfs: use new vfs reflink and dedup function pointers Darrick J. Wong
2015-12-19  9:04 ` [PATCH 74/76] xfs: set up per-AG preallocated block pools Darrick J. Wong
2015-12-19  9:04 ` [PATCH 75/76] xfs: preallocate blocks for worst-case refcount btree expansion Darrick J. Wong
2015-12-19  9:04 ` [PATCH 76/76] xfs: try to prevent failed rmap btree expansion during cow Darrick J. Wong
2015-12-20 14:02 ` [RFCv4 00/76] xfs: add reverse-mapping, reflink, and dedupe support Brian Foster
2016-01-04 23:59   ` Darrick J. Wong
2016-01-05 12:42     ` Brian Foster
2016-01-06  2:04       ` Darrick J. Wong [this message]
2016-01-06  3:44         ` Dave Chinner
2016-02-02 23:06           ` Darrick J. Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160106020440.GL28330@birch.djwong.org \
    --to=darrick.wong@oracle.com \
    --cc=bfoster@redhat.com \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox