From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Christoph Hellwig <hch@infradead.org>
Cc: linux-xfs@vger.kernel.org, xfs@oss.sgi.com
Subject: Re: [PATCH 04/71] xfs: defer should allow ->finish_item to request a new transaction
Date: Tue, 6 Sep 2016 16:57:17 -0700 [thread overview]
Message-ID: <20160906235717.GB26927@birch.djwong.org> (raw)
In-Reply-To: <20160906063815.GA11411@infradead.org>
On Mon, Sep 05, 2016 at 11:38:15PM -0700, Christoph Hellwig wrote:
> On Thu, Aug 25, 2016 at 04:32:21PM -0700, Darrick J. Wong wrote:
> > When xfs_defer_finish calls ->finish_item, it's possible that
> > (refcount) won't be able to finish all the work in a single
> > transaction. When this happens, the ->finish_item handler should
> > shorten the log done item's list count, update the work item to
> > reflect where work should continue, and return -EAGAIN so that
> > defer_finish knows to retain the pending item on the pending list,
> > roll the transaction, and restart processing where we left off.
> >
> > Plumb in the code and document how this mechanism is supposed to work.
>
> I've voiced this before, and I will again: I think the whole xfs_defer
> mechanism is a major, major mistake. All three users look somewhat
> similar, but actually are different underneath
I agree with you that the four users are different underneath, and that
shoehorning all four through the same xfs_defer has its risks. I'm not
sure specifically what's making you nervous, so I'll write a little
about how I arrived at the xfs_defer_ops design, and hopefully that'll
be enough of a jumping off point to discuss from there?
Reworking the existing xfs_bmap_free_t code was unsettling in the usual
sense of "am I really grokking this thing?" The refcount item code
requiring the ability to roll a transaction midway through finishing the
item is not something the other three items care about. There's an
additional question about the interaction between extent free and busy,
which I'll get to further down.
That said, prior to writing the xfs_defer code (that I submitted for
4.8), I *did* iterate through various designs, all of which eventually
failed. The problem I'm dealing with here is this -- in the new world
of rmap and refcount, a data block mapping transaction can cause an
intense storm of metadata updates. In the old days we only had to deal
with (in the free case):
1) update extent map,
2) alloc/free bmbt block,
3) free data extent
Now that process expands (explodes?) to:
1) update extent map,
2) bmbt block alloc/free,
3) bmbt rmap update,
4) rmapbt shape changes,
5) freeing(?) rmapbt blocks,
6) data block rmapbt update,
7) rmapbt shape change for bmbt update,
8) freeing(?) rmapbt blocks,
9) refcountbt update,
A) rmapbt shape change for refcountbt update,
B) freeing(?) rmapbt blocks,
C) data block freeing,
In the bad first implementation I simply tried to cram all of these
updates into a single big transaction, which lead to reservation
blowouts and deadlocks when we have to perform multiple allocations for
data blocks and bmbt blocks.
The next design left the old xfs_bmap_free code mostly intact and
provided separate handlers for the rmap/refcount/bmap updates as you
suggested during the rmap review, with just enough of a defer_ops to
point to the four types of deferred ops that could go with each
transaction. Each metadata update type was logged in order of type;
this proved insufficient because the various metadata updates are
interdependent -- for example, if adding an rmapbt record causes an
rmapbt block to be freed, we have to ensure that the rmap update and RUD
commit before we start working on the extent freeing. Were this order
not maintained, we could crash after the EFD gets logged but before the
rmap update commits, and replay would retry the rmap update and free the
block again. So, separately maintained lists of deferred operations
didn't work.
This resulted in the third (and current) deferred ops design. Central
coordination between deferred ops is stronger than it was before.
Deferred operations are tracked in the order that they're added to the
defer_ops, and the intent items are logged in that order so that
recovery happens in (more or less) the same way that they would during a
successful transaction commit. That required defer_ops to shoulder a
larger burden of state tracking, at which point it made more sense to
make extent freeing another defer_ops client.
However, there's also the question of how does this new defer_ops thing
interact with the extent busy code? AFAIK, the extent busy code
prevents XFS from reusing a freed block for data until the transaction
that frees it has committed so that log recover cannot replay updates to
a block that has already been rewritten with file data (i.e.
xlog_cil_committed() has run to completion). The deferred ops system
will roll the transaction repeatedly until all the ops are finished, but
there's nothing about it that changes that guarantee. Metadata blocks
being freed are still put in the busy-extent rbtree where they remain
until the CIL commits and clears them (or they get reused for other
metadata) which prevents premature reuse. In other words, extent
freeing should behave the same as before, even in the presence of the
new pieces.
> Especially the extent free case is a lot simpler than all others, and
> we now place the burden of a complex abstraction onto code that would
> otherwise be fairly easily understandable.
Right, it would be nice to be able to keep using the old bmap_free code
when we don't need the complex operations. I guess you could make the
public xfs_defer.c functions short-circuit to the bmap_free code if
!rmap && !reflink.
> Also it adds a memory allocation to the extent free code,
The second memory allocation could be eliminated by kmem_alloc'ing more
bytes in xfs_defer_add and changing xfs_defer_add to take a pointer to a
work item buffer that would be copied into the extra space. I'm not
sure how much we'd gain since there are already plenty of allocations
in that path anyway.
> and gets in the way of merging the freed extent and extent busy
> structures, something I prototyped a while ago.
I searched my email spool back to 9/2012 but didn't find any patches,
so I don't really know how to respond to this.
--D
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
next prev parent reply other threads:[~2016-09-06 23:57 UTC|newest]
Thread overview: 120+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-08-25 23:31 [PATCH v8 00/71] xfs: add reflink and dedupe support Darrick J. Wong
2016-08-25 23:32 ` [PATCH 01/71] xfs: remove xfs_btree_bigkey Darrick J. Wong
2016-09-05 15:04 ` Christoph Hellwig
2016-08-25 23:32 ` [PATCH 02/71] xfs: create a standard btree size calculator code Darrick J. Wong
2016-09-05 15:05 ` Christoph Hellwig
2016-08-25 23:32 ` [PATCH 03/71] xfs: count the blocks in a btree Darrick J. Wong
2016-09-05 15:05 ` Christoph Hellwig
2016-08-25 23:32 ` [PATCH 04/71] xfs: defer should allow ->finish_item to request a new transaction Darrick J. Wong
2016-09-06 6:38 ` Christoph Hellwig
2016-09-06 23:57 ` Darrick J. Wong [this message]
2016-08-25 23:32 ` [PATCH 05/71] xfs: introduce tracepoints for AG reservation code Darrick J. Wong
2016-09-06 6:38 ` Christoph Hellwig
2016-08-25 23:32 ` [PATCH 06/71] xfs: set up per-AG free space reservations Darrick J. Wong
2016-09-06 14:53 ` Christoph Hellwig
2016-09-06 17:31 ` Darrick J. Wong
2016-09-08 17:47 ` Darrick J. Wong
2016-08-25 23:32 ` [PATCH 07/71] xfs: define tracepoints for refcount btree activities Darrick J. Wong
2016-09-06 14:54 ` Christoph Hellwig
2016-09-08 18:20 ` Darrick J. Wong
2016-08-25 23:32 ` [PATCH 08/71] xfs: introduce refcount btree definitions Darrick J. Wong
2016-09-06 14:59 ` Christoph Hellwig
2016-09-06 17:13 ` Darrick J. Wong
2016-08-25 23:32 ` [PATCH 09/71] xfs: add refcount btree stats infrastructure Darrick J. Wong
2016-09-06 14:59 ` Christoph Hellwig
2016-08-25 23:33 ` [PATCH 10/71] xfs: refcount btree add more reserved blocks Darrick J. Wong
2016-09-06 15:00 ` Christoph Hellwig
2016-08-25 23:33 ` [PATCH 11/71] xfs: define the on-disk refcount btree format Darrick J. Wong
2016-09-06 15:06 ` Christoph Hellwig
2016-08-25 23:33 ` [PATCH 12/71] xfs: add refcount btree support to growfs Darrick J. Wong
2016-09-06 15:06 ` Christoph Hellwig
2016-08-25 23:33 ` [PATCH 13/71] xfs: account for the refcount btree in the alloc/free log reservation Darrick J. Wong
2016-08-25 23:33 ` [PATCH 14/71] xfs: add refcount btree operations Darrick J. Wong
2016-09-06 15:09 ` Christoph Hellwig
2016-08-25 23:33 ` [PATCH 15/71] xfs: create refcount update intent log items Darrick J. Wong
2016-09-06 15:16 ` Christoph Hellwig
2016-09-06 16:43 ` Darrick J. Wong
2016-09-06 17:03 ` Christoph Hellwig
2016-08-25 23:33 ` [PATCH 16/71] xfs: log refcount intent items Darrick J. Wong
2016-09-06 15:21 ` Christoph Hellwig
2016-09-08 19:14 ` Darrick J. Wong
2016-09-08 23:13 ` Dave Chinner
2016-09-08 23:16 ` Darrick J. Wong
2016-09-11 12:52 ` Christoph Hellwig
2016-09-12 18:40 ` Darrick J. Wong
2016-09-12 23:28 ` Dave Chinner
2016-08-25 23:33 ` [PATCH 17/71] xfs: adjust refcount of an extent of blocks in refcount btree Darrick J. Wong
2016-08-25 23:33 ` [PATCH 18/71] xfs: connect refcount adjust functions to upper layers Darrick J. Wong
2016-08-25 23:34 ` [PATCH 19/71] xfs: adjust refcount when unmapping file blocks Darrick J. Wong
2016-08-25 23:34 ` [PATCH 20/71] xfs: add refcount btree block detection to log recovery Darrick J. Wong
2016-08-25 23:34 ` [PATCH 21/71] xfs: refcount btree requires more reserved space Darrick J. Wong
2016-08-25 23:34 ` [PATCH 22/71] xfs: introduce reflink utility functions Darrick J. Wong
2016-08-25 23:34 ` [PATCH 23/71] xfs: create bmbt update intent log items Darrick J. Wong
2016-08-25 23:34 ` [PATCH 24/71] xfs: log bmap intent items Darrick J. Wong
2016-08-25 23:34 ` [PATCH 25/71] xfs: map an inode's offset to an exact physical block Darrick J. Wong
2016-08-25 23:34 ` [PATCH 26/71] xfs: pass bmapi flags through to bmap_del_extent Darrick J. Wong
2016-08-25 23:34 ` [PATCH 27/71] xfs: implement deferred bmbt map/unmap operations Darrick J. Wong
2016-08-25 23:35 ` [PATCH 28/71] xfs: when replaying bmap operations, don't let unlinked inodes get reaped Darrick J. Wong
2016-08-25 23:35 ` [PATCH 29/71] xfs: return work remaining at the end of a bunmapi operation Darrick J. Wong
2016-08-25 23:35 ` [PATCH 30/71] xfs: define tracepoints for reflink activities Darrick J. Wong
2016-08-25 23:35 ` [PATCH 31/71] xfs: add reflink feature flag to geometry Darrick J. Wong
2016-08-25 23:35 ` [PATCH 32/71] xfs: don't allow reflinked dir/dev/fifo/socket/pipe files Darrick J. Wong
2016-08-25 23:35 ` [PATCH 33/71] xfs: introduce the CoW fork Darrick J. Wong
2016-08-25 23:35 ` [PATCH 34/71] xfs: support bmapping delalloc extents in " Darrick J. Wong
2016-09-06 15:25 ` Christoph Hellwig
2016-09-06 16:34 ` Darrick J. Wong
2016-09-11 12:59 ` Christoph Hellwig
2016-09-06 23:40 ` Dave Chinner
2016-09-11 12:57 ` Christoph Hellwig
2016-08-25 23:35 ` [PATCH 35/71] xfs: create delalloc extents in " Darrick J. Wong
2016-08-25 23:35 ` [PATCH 36/71] xfs: support allocating delayed " Darrick J. Wong
2016-08-25 23:35 ` [PATCH 37/71] xfs: allocate " Darrick J. Wong
2016-08-25 23:36 ` [PATCH 38/71] xfs: support removing extents from " Darrick J. Wong
2016-08-25 23:36 ` [PATCH 39/71] xfs: move mappings from cow fork to data fork after copy-write Darrick J. Wong
2016-08-25 23:36 ` [PATCH 40/71] xfs: report shared extents through the iomap interface Darrick J. Wong
2016-08-25 23:36 ` [PATCH 41/71] xfs: implement CoW for directio writes Darrick J. Wong
2016-08-25 23:36 ` [PATCH 42/71] xfs: cancel CoW reservations and clear inode reflink flag when freeing blocks Darrick J. Wong
2016-08-25 23:36 ` [PATCH 43/71] xfs: cancel pending CoW reservations when destroying inodes Darrick J. Wong
2016-08-25 23:36 ` [PATCH 44/71] xfs: store in-progress CoW allocations in the refcount btree Darrick J. Wong
2016-08-25 23:36 ` [PATCH 45/71] xfs: reflink extents from one file to another Darrick J. Wong
2016-08-25 23:36 ` [PATCH 46/71] xfs: add clone file and clone range vfs functions Darrick J. Wong
2016-08-25 23:37 ` [PATCH 47/71] xfs: add dedupe range vfs function Darrick J. Wong
2016-08-25 23:37 ` [PATCH 48/71] xfs: teach get_bmapx about shared extents and the CoW fork Darrick J. Wong
2016-08-25 23:37 ` [PATCH 49/71] xfs: swap inode reflink flags when swapping inode extents Darrick J. Wong
2016-08-25 23:37 ` [PATCH 50/71] xfs: unshare a range of blocks via fallocate Darrick J. Wong
2016-08-25 23:37 ` [PATCH 51/71] xfs: CoW shared EOF block when truncating file Darrick J. Wong
2016-08-25 23:37 ` [PATCH 52/71] xfs: support FS_XFLAG_REFLINK on reflink filesystems Darrick J. Wong
2016-08-25 23:37 ` [PATCH 53/71] xfs: create a separate cow extent size hint for the allocator Darrick J. Wong
2016-08-25 23:37 ` [PATCH 54/71] xfs: preallocate blocks for worst-case btree expansion Darrick J. Wong
2016-08-25 23:37 ` [PATCH 55/71] xfs: don't allow reflink when the AG is low on space Darrick J. Wong
2016-08-25 23:38 ` [PATCH 56/71] xfs: try other AGs to allocate a BMBT block Darrick J. Wong
2016-08-25 23:38 ` [PATCH 57/71] xfs: promote buffered writes to CoW when cowextsz is set Darrick J. Wong
2016-08-25 23:38 ` [PATCH 58/71] xfs: garbage collect old cowextsz reservations Darrick J. Wong
2016-09-24 19:42 ` Christoph Hellwig
2016-09-26 21:52 ` Darrick J. Wong
2016-09-27 18:50 ` Christoph Hellwig
2016-09-27 19:29 ` Darrick J. Wong
2016-09-27 20:15 ` Christoph Hellwig
2016-09-27 20:25 ` Darrick J. Wong
2016-08-25 23:38 ` [PATCH 59/71] xfs: provide switch to force filesystem to copy-on-write all the time Darrick J. Wong
2016-08-25 23:38 ` [PATCH 60/71] xfs: increase log reservations for reflink Darrick J. Wong
2016-08-25 23:38 ` [PATCH 61/71] xfs: add shared rmap map/unmap/convert log item types Darrick J. Wong
2016-08-25 23:38 ` [PATCH 62/71] xfs: use interval query for rmap alloc operations on shared files Darrick J. Wong
2016-08-25 23:38 ` [PATCH 63/71] xfs: convert unwritten status of reverse mappings for " Darrick J. Wong
2016-08-25 23:38 ` [PATCH 64/71] xfs: set a default CoW extent size of 32 blocks Darrick J. Wong
2016-08-25 23:38 ` [PATCH 65/71] xfs: check for invalid inode reflink flags Darrick J. Wong
2016-08-25 23:39 ` [PATCH 66/71] xfs: don't mix reflink and DAX mode for now Darrick J. Wong
2016-08-25 23:39 ` [PATCH 67/71] xfs: fail ->bmap for reflink inodes Darrick J. Wong
2016-09-06 15:29 ` Christoph Hellwig
2016-09-06 16:26 ` Darrick J. Wong
2016-09-06 17:02 ` Christoph Hellwig
2016-08-25 23:39 ` [PATCH 68/71] xfs: recognize the reflink feature bit Darrick J. Wong
2016-08-25 23:39 ` [PATCH 69/71] xfs: various swapext cleanups Darrick J. Wong
2016-08-25 23:39 ` [PATCH 70/71] xfs: refactor swapext code Darrick J. Wong
2016-08-25 23:39 ` [PATCH 71/71] xfs: implement swapext for rmap filesystems Darrick J. Wong
2016-08-26 12:56 ` [PATCH v8 00/71] xfs: add reflink and dedupe support Christoph Hellwig
2016-08-26 16:28 ` Darrick J. Wong
2016-08-26 18:42 ` Darrick J. Wong
2016-08-26 14:08 ` Brian Foster
2016-08-26 18:44 ` [PATCH 72/71] xfs: track log done items directly in the deferred pending work item Darrick J. Wong
-- strict thread matches above, loose matches on Subject: below --
2016-08-25 23:46 [PATCH v8 00/71] xfsprogs: add reflink and dedupe support Darrick J. Wong
2016-08-25 23:46 ` [PATCH 04/71] xfs: defer should allow ->finish_item to request a new transaction Darrick J. Wong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20160906235717.GB26927@birch.djwong.org \
--to=darrick.wong@oracle.com \
--cc=hch@infradead.org \
--cc=linux-xfs@vger.kernel.org \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).