From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Brian Foster <bfoster@redhat.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH 04/11] xfs: CoW fork operations should only update quota reservations
Date: Thu, 25 Jan 2018 09:52:57 -0800 [thread overview]
Message-ID: <20180125175257.GG9068@magnolia> (raw)
In-Reply-To: <20180125130123.GA43198@bfoster.bfoster>
On Thu, Jan 25, 2018 at 08:01:24AM -0500, Brian Foster wrote:
> On Wed, Jan 24, 2018 at 11:14:25AM -0800, Darrick J. Wong wrote:
> > On Wed, Jan 24, 2018 at 09:22:16AM -0500, Brian Foster wrote:
> > > On Tue, Jan 23, 2018 at 06:18:23PM -0800, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > >
> > > > Since the CoW fork only exists in memory, it is incorrect to update the
> > > > on-disk quota block counts when we modify the CoW fork. Unlike the data
> > > > fork, even real extents in the CoW fork are only reservations (on-disk
> > > > they're owned by the refcountbt) so they must not be tracked in the on
> > > > disk quota info.
> > > >
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > ---
> > > > fs/xfs/libxfs/xfs_bmap.c | 203 ++++++++++++++++++++++++++++++++++++++++++++--
> > > > fs/xfs/xfs_reflink.c | 8 +-
> > > > 2 files changed, 196 insertions(+), 15 deletions(-)
> > > >
> > > >
> ...
> > > > + * - sb_fdblocks: Number of free blocks recorded in the superblock on disk.
> > > > + * - fdblocks: Number of free blocks recorded in the superblock minus any
> > > > + * in-core reservations made in anticipation of future writes.
> > > > + *
> > > > + * - t_blk_res: Number of blocks reserved out of fdblocks for a transaction.
> > > > + * When the transaction commits, t_blk_res - t_blk_res_used is given
> > > > + * back to fdblocks.
> > > > + * - t_blk_res_used: Number of blocks used by this transaction that were
> > > > + * reserved for this transaction.
> > > > + * - t_fdblocks_del: Number of blocks by which fdblocks and sb_fdblocks will
> > > > + * have to decrease at commit.
> > > > + * - t_res_fdblocks_delta: Number of blocks by which sb_fdblocks will have to
> > > > + * decrease at commit. We assume that fdblocks was decreased
> > > > + * prior to the transaction.
> > > > + *
> > > > + * Data fork block mappings have four logical states:
> > > > + *
> > > > + * +--------> UNWRITTEN <------+
> > > > + * | ^ |
> > > > + * | v v
> > > > + * DELALLOC <----> HOLE <------> REAL
> > > > + * | ^
> > > > + * | |
> > > > + * +---------------------------+
> > > > + *
> > >
> > > I'm not sure we need a graphic for the extent states. Non-hole
> > > conversions to delalloc is the only transition that doesn't make any
> > > sense.
> >
> > First of all, Dave keeps asking for ASCII art 'when appropriate'. :)
> >
>
> I'm not against ASCII art in general...
>
> > But on a more serious note, I thought the state diagram would be useful
> > for anyone who isn't so familiar with how blocks get mapped into files
> > in xfs, particularly to compare the data fork diagram against the
> > corresponding picture for the cow fork.
> >
>
> ... but TBH I didn't really notice they were different until you pointed
> it out. :/ It still seems like a rather verbose means to point out that
> COW fork apparently doesn't have the DELALLOC -> REAL transition.
It also doesn't have REAL -> UNWRITTEN transitions either. Clearly the
diagrams aren't helping, point made. ;)
Heh, it also has HOLE -> REAL transitions even though the diagram does
not so indicate, so ... yeah.
> ...
> >
> > > I'm also wondering how much of the whole picture really needs to be
> > > described here to cover quota accounting with respect to block mapping
> > > (sufficiently to differentiate data fork from cow fork, which I take is
> > > the purpose of this section). For example, do we really need the
> > > internal transaction details for how quota deltas are carried?
> > >
> > > Instead, I think it might be sufficient to explain that the quota system
> > > works in two "levels" (for lack of a better term :/), one for
> > > reservation and another for real block usage. The reservation is
> > > associated with transaction reservation and/or delayed block allocation
> > > (no tx). In either case, quota reservation is converted to real quota
> > > block usage when a transaction commits that maps real/physical blocks.
> > > If the transaction held extra reservation that went unused, that quota
> > > reservation is released. The primary difference is that transactions to
> > > convert delalloc -> real do not reserve quota blocks in the first place,
> > > since that has already occurred, so they just need to make sure to
> > > convert/persist the quota res for the blocks that were converted to
> > > real.
> >
> > I thought about cutting this whole comment down to a simple sentence
> > about how quota accounting is different between the data & cow forks (as
> > you figured out, we use delalloc quota reservations for everything in
> > the cow fork and only turn them into real ones when we go to remap) but
> > then worried that doing so would presuppose the reader knew anything
> > about how the extent lifecycles work... and that's how I end up with a
> > gigantic manual.
> >
>
> Which is probably fine for xfs-docs or something..
>
> The more I think about it, the more I think this whole thing is better
> off focused on explaining the unique quota management of COW fork
> blocks. We can add additional comments about extent lifecycles, how
> quota is tracked in general, etc., but perhaps it's best to make those a
> separate patch rather than attempt to document a ground-up "COW fork
> quotas for dummies" in a single comment. :P Anyways, I'll reserve
> further judgement for the new patch..
Yeah. I think I'll go update ch3 ("Sharing Data Blocks") in the design
document instead of dumping everything into a code comment that seems
somewhat misplaced.
Onto the next reply! :)
--D
> Brian
>
> > > > + * Copy on Write Fork Mapping Lifecycle
> > > > + *
> > > > + * The CoW fork handles things differently from the data fork because its
> > > > + * mappings only exist in memory-- the refcount btree is the on-disk owner of
> > > > + * the extents until they're remapped into the data fork. Therefore,
> > > > + * unwritten and real extents in the CoW fork are treated the same way as
> > > > + * delayed allocation extents. Quota and fdblock changes only exist in
> > > > + * memory, which requires some twists in the bmap functions.
> > > > + *
> > >
> > > Ok, but perhaps this should point out what happens when cow blocks are
> > > reserved with respect to quotas..? IIUC, a delalloc quota reservation
> > > occurs just the same as above, the blocks simply reside in another fork.
> >
> > Yes.
> >
> > > > + * The CoW fork extent state diagram looks like this:
> > > > + *
> > > > + * +--------> UNWRITTEN -------+
> > > > + * | ^ |
> > > > + * | v v
> > > > + * DELALLOC <----> HOLE <------- REAL
> > > > + *
> > > > + * Holes are still holes. Delayed allocation extents reserve blocks for
> > > > + * landing future writes, just like they do in the data fork. However, unlike
> > > > + * the data fork, unwritten extents signal an extent that has been allocated
> > > > + * but is not currently undergoing writeback. Real extents are undergoing
> > > > + * writeback, and when that writeback finishes the corresponding data fork
> > > > + * extent will be punched out and the CoW fork counterpart moved to the new
> > > > + * hole in the data fork.
> > > > + *
> > >
> > > Ok, so the difference is that for the COW fork, the _extent state_
> > > conversion is not the appropriate trigger event to convert quota
> > > reservation to real quota usage. Instead, the blocks being remapped from
> > > the COW fork to the data fork is when that should occur.
> >
> > Yes.
> >
> > > > + * The state transitions and required metadata updates are as follows:
> > > > + *
> > > > + * - HOLE to DELALLOC: Increase i_cow_blocks and q_res_bcount, and decrease
> > > > + * fdblocks.
> > > > + * - HOLE to UNWRITTEN: Same as above, but since we reserved quota via
> > > > + * qt_blk_res (which increased q_res_bcount) when we allocate the
> > > > + * extent we have to decrease qt_blk_res so that the commit doesn't
> > > > + * give the allocated CoW blocks back.
> > > > + *
> > >
> > > Hmm, this is a little confusing. Looking at the code change and comment
> > > below, I think I get what this is trying to do, which is essentially
> > > make a real block cow fork alloc behave like a delalloc reservation
> > > (with respect to quota). FWIW, I think what confuses me is the assertion
> > > that the blocks would be "given back" otherwise. The only reference I
> > > have to compare is data fork alloc behavior, which implies that used
> > > reservation would not be given back, but rather converted to real quota
> > > usage on block allocation (and excess reservation would still be given
> > > back, which afaict we still want to happen). So the trickery is required
> > > to prevent conversion of quota reservation for the allocated cow blocks,
> > > let that res sit around until the cow blocks are remapped, and release
> > > unused reservation from the tx as normal. Am I following that correctly?
> >
> > Yes.
> >
> > > > + * - DELALLOC to UNWRITTEN: No change.
> > > > + * - DELALLOC to HOLE: Decrease i_cow_blocks and q_res_bcount, and increase
> > > > + * fdblocks.
> > > > + *
> > > > + * - UNWRITTEN to HOLE: Same as DELALLOC to HOLE.
> > > > + * - UNWRITTEN to REAL: No change.
> > > > + *
> > > > + * - REAL to HOLE: This transition happens when we've finished a write
> > > > + * operation and need to move the mapping to the data fork. We
> > > > + * punch the correspond data fork mappings, which decreases
> > > > + * qt_bcount. Then we map the CoW fork mapping into the hole we
> > > > + * just cleared out of the data fork, which increases qt_bcount.
> > > > + * There's a subtlety here -- if we promoted a write over a hole to
> > > > + * CoW, there will be a net increase in qt_bcount, which is fine
> > > > + * because we already reserved the quota when we filled the CoW
> > > > + * fork. Finally, we punch the CoW fork mapping, which decreases
> > > > + * q_res_bcount.
> > > > + *
> > > > + * Notice how all CoW fork extents use transactionless quota reservations and
> > > > + * the in-core fdblocks to maintain state, and we avoid updating any on-disk
> > > > + * metadata. This is essential to maintain metadata correctness if the system
> > > > + * goes down.
> > > > + */
> > > >
> > > > kmem_zone_t *xfs_bmap_free_item_zone;
> > > >
> > > > @@ -3337,6 +3476,39 @@ xfs_bmap_btalloc_filestreams(
> > > > return 0;
> > > > }
> > > >
> > > > +/* Deal with CoW fork accounting when we allocate a block. */
> > > > +static void
> > > > +xfs_bmap_btalloc_cow(
> > > > + struct xfs_bmalloca *ap,
> > > > + struct xfs_alloc_arg *args)
> > > > +{
> > > > + /* Filling a previously reserved extent; nothing to do here. */
> > > > + if (ap->wasdel)
> > > > + return;
> > > > +
> > > > + /*
> > > > + * The CoW fork only exists in memory, so the on-disk quota accounting
> > > > + * must not incude any CoW fork extents. Therefore, CoW blocks are
> > > > + * only tracked in the in-core dquot block count (q_res_bcount).
> > > > + *
> > > > + * If we get here, we're filling a CoW hole with a real (non-delalloc)
> > > > + * CoW extent having reserved enough blocks from both q_res_bcount and
> > > > + * qt_blk_res to guarantee that we won't run out of space. The unused
> > > > + * qt_blk_res is given back to q_res_bcount when the transaction
> > > > + * commits.
> > > > + *
> > > > + * We don't want the quota accounting for our newly allocated blocks
> > > > + * to be given back, so we must decrease qt_blk_res without decreasing
> > > > + * q_res_bcount.
> > > > + *
> > > > + * Note: If we're allocating a delalloc extent, we already reserved
> > > > + * the q_res_bcount blocks, so no quota accounting update is needed
> > > > + * here.
> > > > + */
> > > > + xfs_trans_mod_dquot_byino(ap->tp, ap->ip, XFS_TRANS_DQ_RES_BLKS,
> > > > + -(long)args->len);
> > > > +}
> > >
> > > Factoring nit.. if we're going to refactor bits of xfs_bmap_btalloc()
> > > out, it might be cleaner to factor out all of the quota logic rather
> > > than just the cow bits (which is basically just a simple check and
> > > function call). E.g., refactor into an xfs_bmap_btalloc_quota() helper
> > > that does the right thing based on the fork, with comments as to why,
> > > etc. (and perhaps just leave the unrelated di_nblocks change behind).
> >
> > I thought about factoring the data/attr fork stuff into its own
> > xfs_bmap_btalloc_quota() function too, since this function is already
> > eyewateringly long. I think I'll do that as a separate refactor at the
> > end of the series, though...
> >
> > --D
> >
> > > Brian
> > >
> > > > +
> > > > STATIC int
> > > > xfs_bmap_btalloc(
> > > > struct xfs_bmalloca *ap) /* bmap alloc argument struct */
> > > > @@ -3571,19 +3743,22 @@ xfs_bmap_btalloc(
> > > > *ap->firstblock = args.fsbno;
> > > > ASSERT(nullfb || fb_agno <= args.agno);
> > > > ap->length = args.len;
> > > > - if (!(ap->flags & XFS_BMAPI_COWFORK))
> > > > - ap->ip->i_d.di_nblocks += args.len;
> > > > - xfs_trans_log_inode(ap->tp, ap->ip, XFS_ILOG_CORE);
> > > > if (ap->wasdel)
> > > > ap->ip->i_delayed_blks -= args.len;
> > > > - /*
> > > > - * Adjust the disk quota also. This was reserved
> > > > - * earlier.
> > > > - */
> > > > - xfs_trans_mod_dquot_byino(ap->tp, ap->ip,
> > > > - ap->wasdel ? XFS_TRANS_DQ_DELBCOUNT :
> > > > - XFS_TRANS_DQ_BCOUNT,
> > > > - (long) args.len);
> > > > + if (ap->flags & XFS_BMAPI_COWFORK) {
> > > > + xfs_bmap_btalloc_cow(ap, &args);
> > > > + } else {
> > > > + ap->ip->i_d.di_nblocks += args.len;
> > > > + xfs_trans_log_inode(ap->tp, ap->ip, XFS_ILOG_CORE);
> > > > + /*
> > > > + * Adjust the disk quota also. This was reserved
> > > > + * earlier.
> > > > + */
> > > > + xfs_trans_mod_dquot_byino(ap->tp, ap->ip,
> > > > + ap->wasdel ? XFS_TRANS_DQ_DELBCOUNT :
> > > > + XFS_TRANS_DQ_BCOUNT,
> > > > + (long) args.len);
> > > > + }
> > > > } else {
> > > > ap->blkno = NULLFSBLOCK;
> > > > ap->length = 0;
> > > > @@ -4776,6 +4951,7 @@ xfs_bmap_del_extent_cow(
> > > > struct xfs_bmbt_irec new;
> > > > xfs_fileoff_t del_endoff, got_endoff;
> > > > int state = BMAP_COWFORK;
> > > > + int error;
> > > >
> > > > XFS_STATS_INC(mp, xs_del_exlist);
> > > >
> > > > @@ -4832,6 +5008,11 @@ xfs_bmap_del_extent_cow(
> > > > xfs_iext_insert(ip, icur, &new, state);
> > > > break;
> > > > }
> > > > +
> > > > + /* Remove the quota reservation */
> > > > + error = xfs_trans_reserve_quota_nblks(NULL, ip,
> > > > + -(long)del->br_blockcount, 0, XFS_QMOPT_RES_REGBLKS);
> > > > + ASSERT(error == 0);
> > > > }
> > > >
> > > > /*
> > > > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > > > index 82abff6..e367351 100644
> > > > --- a/fs/xfs/xfs_reflink.c
> > > > +++ b/fs/xfs/xfs_reflink.c
> > > > @@ -599,10 +599,6 @@ xfs_reflink_cancel_cow_blocks(
> > > > del.br_startblock, del.br_blockcount,
> > > > NULL);
> > > >
> > > > - /* Update quota accounting */
> > > > - xfs_trans_mod_dquot_byino(*tpp, ip, XFS_TRANS_DQ_BCOUNT,
> > > > - -(long)del.br_blockcount);
> > > > -
> > > > /* Roll the transaction */
> > > > xfs_defer_ijoin(&dfops, ip);
> > > > error = xfs_defer_finish(tpp, &dfops);
> > > > @@ -795,6 +791,10 @@ xfs_reflink_end_cow(
> > > > if (error)
> > > > goto out_defer;
> > > >
> > > > + /* Charge this new data fork mapping to the on-disk quota. */
> > > > + xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT,
> > > > + (long)del.br_blockcount);
> > > > +
> > > > /* Remove the mapping from the CoW fork. */
> > > > xfs_bmap_del_extent_cow(ip, &icur, &got, &del);
> > > >
> > > >
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2018-01-25 17:53 UTC|newest]
Thread overview: 60+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-01-24 2:17 [PATCH 00/11] xfs: reflink/scrub/quota fixes Darrick J. Wong
2018-01-24 2:18 ` [PATCH 01/11] xfs: reflink should break pnfs leases before sharing blocks Darrick J. Wong
2018-01-24 14:16 ` Brian Foster
2018-01-26 9:06 ` Christoph Hellwig
2018-01-26 18:26 ` Darrick J. Wong
2018-01-24 2:18 ` [PATCH 02/11] xfs: only grab shared inode locks for source file during reflink Darrick J. Wong
2018-01-24 14:18 ` Brian Foster
2018-01-24 18:40 ` Darrick J. Wong
2018-01-26 12:07 ` Christoph Hellwig
2018-01-26 18:48 ` Darrick J. Wong
2018-01-27 3:32 ` Dave Chinner
2018-01-24 2:18 ` [PATCH 03/11] xfs: call xfs_qm_dqattach before performing reflink operations Darrick J. Wong
2018-01-24 14:18 ` Brian Foster
2018-01-26 9:07 ` Christoph Hellwig
2018-01-24 2:18 ` [PATCH 04/11] xfs: CoW fork operations should only update quota reservations Darrick J. Wong
2018-01-24 14:22 ` Brian Foster
2018-01-24 19:14 ` Darrick J. Wong
2018-01-25 13:01 ` Brian Foster
2018-01-25 17:52 ` Darrick J. Wong [this message]
2018-01-25 1:20 ` [PATCH v2 " Darrick J. Wong
2018-01-25 13:03 ` Brian Foster
2018-01-25 18:20 ` Darrick J. Wong
2018-01-26 13:02 ` Brian Foster
2018-01-26 18:40 ` Darrick J. Wong
2018-01-26 12:12 ` Christoph Hellwig
2018-01-24 2:18 ` [PATCH 05/11] xfs: track CoW blocks separately in the inode Darrick J. Wong
2018-01-25 13:06 ` Brian Foster
2018-01-25 19:21 ` Darrick J. Wong
2018-01-26 13:04 ` Brian Foster
2018-01-26 19:08 ` Darrick J. Wong
2018-01-26 12:15 ` Christoph Hellwig
2018-01-26 19:00 ` Darrick J. Wong
2018-01-26 23:51 ` Darrick J. Wong
2018-01-24 2:18 ` [PATCH 06/11] xfs: fix up cowextsz allocation shortfalls Darrick J. Wong
2018-01-25 17:31 ` Brian Foster
2018-01-25 20:20 ` Darrick J. Wong
2018-01-26 13:06 ` Brian Foster
2018-01-26 19:12 ` Darrick J. Wong
2018-01-26 9:11 ` Christoph Hellwig
2018-01-24 2:18 ` [PATCH 07/11] xfs: always zero di_flags2 when we free the inode Darrick J. Wong
2018-01-25 17:31 ` Brian Foster
2018-01-25 18:36 ` Darrick J. Wong
2018-01-26 9:08 ` Christoph Hellwig
2018-01-24 2:18 ` [PATCH 08/11] xfs: fix tracepoint %p formats Darrick J. Wong
2018-01-25 17:31 ` Brian Foster
2018-01-25 18:47 ` Darrick J. Wong
2018-01-26 0:19 ` Darrick J. Wong
2018-01-26 9:09 ` Christoph Hellwig
2018-01-24 2:18 ` [PATCH 09/11] xfs: make tracepoint inode number format consistent Darrick J. Wong
2018-01-25 17:31 ` Brian Foster
2018-01-26 9:09 ` Christoph Hellwig
2018-01-24 2:19 ` [PATCH 10/11] xfs: refactor inode verifier corruption error printing Darrick J. Wong
2018-01-25 17:31 ` Brian Foster
2018-01-25 18:23 ` Darrick J. Wong
2018-01-26 9:10 ` Christoph Hellwig
2018-01-24 2:19 ` [PATCH 11/11] xfs: don't clobber inobt/finobt cursors when xref with rmap Darrick J. Wong
2018-01-26 9:10 ` Christoph Hellwig
2018-01-25 5:26 ` [PATCH 12/11] xfs: refactor quota code in xfs_bmap_btalloc Darrick J. Wong
2018-01-26 12:17 ` Christoph Hellwig
2018-01-26 21:46 ` Darrick J. Wong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180125175257.GG9068@magnolia \
--to=darrick.wong@oracle.com \
--cc=bfoster@redhat.com \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).