public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <djwong@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: cheng.lin130@zte.com.cn, linux-xfs@vger.kernel.org,
	linux-kernel@vger.kernel.org, jiang.yong5@zte.com.cn,
	wang.liang82@zte.com.cn, liu.dong3@zte.com.cn
Subject: Re: [PATCH v3] xfs: introduce protection for drop nlink
Date: Tue, 3 Oct 2023 10:33:17 -0700	[thread overview]
Message-ID: <20231003173317.GI21298@frogsfrogsfrogs> (raw)
In-Reply-To: <ZQqI5KNgghI5iFrC@dread.disaster.area>

On Wed, Sep 20, 2023 at 03:53:40PM +1000, Dave Chinner wrote:
> On Mon, Sep 18, 2023 at 08:33:35PM -0700, Darrick J. Wong wrote:
> > On Mon, Sep 18, 2023 at 03:48:38PM +1000, Dave Chinner wrote:
> > > It is only when we are trying to modify something that corruption
> > > becomes a problem with fatal consequences. Once we've made a
> > > modification, the in-memory state is different to the on-disk state
> > > and whilst we are in that state any corruption we discover becomes
> > > fatal. That is because there is no way to reconcile the changes
> > > we've already made in memory with what is on-disk - we don't know
> > > that the in-memory changes are good because we tripped over
> > > corruption, and so we must not propagate bad in-memory state and
> > > metadata to disk over the top of what may be still be uncorrupted
> > > metadata on disk.
> > 
> > It'd be a massive effort, but wouldn't it be fun if one could attach
> > defer ops to a transaction that updated incore state on commit but
> > otherwise never appeared on disk?
> >
> > Let me cogitate on that during part 2 of vacation...
> 
> Sure, I'm interested to see what you might come up with.
> 
> My thoughts on rollback of dirty transactions come from a different
> perspective.
> 
> Conceptually being able to roll back individual transactions isn't
> that difficult. All it takes is a bit more memory and CPU - when we
> join the item to the transaction we take a copy of the item we are
> about to modify.
> 
> If we then cancel a dirty transaction, we then roll back all the
> dirty items to their original state before we unlock them.  This
> works fine for all the on-disk stuff we track in log items.
> 
> I have vague thoughts about how this could potentially be tied into
> the shadow buffers we already use for keeping a delta copy of all
> the committed in-memory changes in the CIL that we haven't yet
> committed to the journal - that's actually the entire delta between
> what is on disk and what we've changed prior to the current
> transaction we are cancelling.
> 
> Hence, in theory, a rollback for a dirty log item is simply "read it
> from disk again, copy the CIL shadow buffer delta into it".

<nod> That's more or less the same as what I was thinking.

> However, the complexity comes with trying to roll back associated
> in-memory state changes that we don't track as log items.  e.g.
> incore extent list changes, in memory inode flag state (e.g.
> XFS_ISTALE), etc. that's where all the hard problems to solve lie, I
> think.

Yeah.  I was thinking that each of those incore state changes could be
implemented as a defer_ops that have NOP ->create_intent and
->create_done functions.  The ->finish_item would actually update the
incore structure.  This would be a very large project, and I'm not sure
that it wouldn't be easier to snapshot the xfs_inode fields themselves,
similar to how inode log items snapshot xfs_dinode fields.

(Snapshotting probably doesn't work for the more complex incore
inode structures.)

Kent has been wrangling with this problem for a while in bcachefs and I
think he's actually gotten the rollbacks to work more or less correctly.
He told me that it was a significant restructuring of his codebase even
though *everything* is tracked in btrees and the cursor abstraction
there is more robust than xfs.

> Another problem is how do we rollback from the middle of an intent
> (defer ops) chain? We have to complete that chain for things to end
> up consistent on disk, so we can't just cancel the current
> transaction and say we are done and everything is clean.  Maybe
> that's what you are thinking of here - each chain has an "undo"
> intent chain that can roll back all the changes already made?

Yes.  Every time we call ->finish_item on a log intent item, we also log
a new intent item that undoes whatever that step did.  These items we'll
call "log undo intent" items, and put them on a separate list, e.g.
tp->t_undoops.  If the chain completes successfully then the last step
is to abort everything on t_undoops to release all that memory.

If the chain does not succeed, then we'd abort the intents on t_dfops,
splice t_undoops onto t_dfops, and call xfs_defer_finish to write the
log undo intent items to disk and finish them.  If /that/ fails then we
have to shutdown.

I think this also means that buffer updates that are logged from a
->finish_item function should not be cancelled per above, since the undo
intent item will take care of that.  That would be easy if btree updates
made by an efi/cui/rui items used ordered buffers instead of logging
them directly like we do now.

For bui items, I think we'd need ordered buffers for bmbt updates and
snapshotting inode items for the inode updates themselves.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

      reply	other threads:[~2023-10-03 17:33 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-13  9:44 [PATCH v3] xfs: introduce protection for drop nlink cheng.lin130
2023-09-13 23:42 ` Dave Chinner
2023-09-15  8:20   ` cheng.lin130
2023-09-15  9:50   ` cheng.lin130
2023-09-17 22:44     ` Dave Chinner
2023-09-18  3:44       ` cheng.lin130
2023-09-18  5:48         ` Dave Chinner
2023-09-18  6:34           ` cheng.lin130
2023-09-19  3:33           ` Darrick J. Wong
2023-09-20  5:53             ` Dave Chinner
2023-10-03 17:33               ` Darrick J. Wong [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20231003173317.GI21298@frogsfrogsfrogs \
    --to=djwong@kernel.org \
    --cc=cheng.lin130@zte.com.cn \
    --cc=david@fromorbit.com \
    --cc=jiang.yong5@zte.com.cn \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=liu.dong3@zte.com.cn \
    --cc=wang.liang82@zte.com.cn \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox