public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Long Li <leo.lilong@huawei.com>
To: Brian Foster <bfoster@redhat.com>
Cc: <djwong@kernel.org>, <chandanbabu@kernel.org>,
	<linux-xfs@vger.kernel.org>, <yi.zhang@huawei.com>,
	<houtao1@huawei.com>, <yangerkun@huawei.com>,
	Dave Chinner <david@fromorbit.com>
Subject: Re: [PATCH] xfs: ensure submit buffers on LSN boundaries in error handlers
Date: Wed, 10 Jan 2024 16:38:08 +0800	[thread overview]
Message-ID: <20240110083808.GA2075885@ceph-admin> (raw)
In-Reply-To: <ZZ1dtV1psURJnTOy@bfoster>

On Tue, Jan 09, 2024 at 09:52:37AM -0500, Brian Foster wrote:
> > > > > After commit 12818d24db8a ("xfs: rework log recovery to submit buffers on
> > > > > LSN boundaries") was introduced, we submit buffers on lsn boundaries during
> > > > > log recovery. 
> > > > 
> > > > Correct - we submit all the changes in a checkpoint for submission
> > > > before we start recovering the next checkpoint. That's because
> > > > checkpoints are supposed to be atomic units of change moving the
> > > > on-disk state from one change set to the next.
> > > 
> > > Submit buffer on LSN boundaries not means submit buffer on checkpoint
> > > boundaries during recovery. In my understanding, One transaction on disk
> > > corresponds to a checkpoint, there's maybe multiple transaction on disk
> > > share same LSN, so sometimes we should ensure that submit multiple
> > > transation one time in such case.  This rule was introduced by commit
> > > 12818d24db8a ("xfs: rework log recovery to submit buffers on LSN boundaries")
> > 
> > Well, yes, that's exactly the situation that commit 12818d24db8a was
> > intended to handle:
> > 
> >     "If independent transactions share an LSN and both modify the
> >     same buffer, log recovery can incorrectly skip updates and leave
> >     the filesystem in an inconsisent state."
> > 
> > Unfortunately, we didn't take into account the complexity of
> > mutliple transactions sharing the same start LSN in commit
> > 12818d24db8a ("xfs: rework log recovery to submit buffers on LSN
> > boundaries") back in 2016.
> > 
> > Indeed, we didn't even know that there was a reliance on strict
> > start record LSN ordering in journal recovery until 2021:
> > 
> > commit 68a74dcae6737c27b524b680e070fe41f0cad43a
> > Author: Dave Chinner <dchinner@redhat.com>
> > Date:   Tue Aug 10 18:00:44 2021 -0700
> > 
> >     xfs: order CIL checkpoint start records
> >     
> >     Because log recovery depends on strictly ordered start records as
> >     well as strictly ordered commit records.
> >     
> >     This is a zero day bug in the way XFS writes pipelined transactions
> >     to the journal which is exposed by fixing the zero day bug that
> >     prevents the CIL from pipelining checkpoints. This re-introduces
> >     explicit concurrent commits back into the on-disk journal and hence
> >     out of order start records.
> >     
> >     The XFS journal commit code has never ordered start records and we
> >     have relied on strict commit record ordering for correct recovery
> >     ordering of concurrently written transactions. Unfortunately, root
> >     cause analysis uncovered the fact that log recovery uses the LSN of
> >     the start record for transaction commit processing. Hence, whilst
> >     the commits are processed in strict order by recovery, the LSNs
> >     associated with the commits can be out of order and so recovery may
> >     stamp incorrect LSNs into objects and/or misorder intents in the AIL
> >     for later processing. This can result in log recovery failures
> >     and/or on disk corruption, sometimes silent.
> >     
> >     Because this is a long standing log recovery issue, we can't just
> >     fix log recovery and call it good. This still leaves older kernels
> >     susceptible to recovery failures and corruption when replaying a log
> >     from a kernel that pipelines checkpoints. There is also the issue
> >     that in-memory ordering for AIL pushing and data integrity
> >     operations are based on checkpoint start LSNs, and if the start LSN
> >     is incorrect in the journal, it is also incorrect in memory.
> >     
> >     Hence there's really only one choice for fixing this zero-day bug:
> >     we need to strictly order checkpoint start records in ascending
> >     sequence order in the log, the same way we already strictly order
> >     commit records.
> >     
> >     Signed-off-by: Dave Chinner <dchinner@redhat.com>
> >     Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> >     Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > 
> > Essentially, the problem now is that even with strictly ordered
> > start records for checkpoints, checkpoints with the same start LSN
> > interfere with each other in recovery because recovery is not
> > aware of the fact that we can have multiple checkpoints that start
> > with the same LSN.
> > 
> > This is another zero-day issue with the journal and log recovery;
> > originally there was no "anti-recovery" logic in the journal like we
> > have now with LSNs to prevent recovery from taking metadata state
> > backwards.  Hence log recovery just always replayed every change
> > that was in the journal from start to finish and so there was never
> > a problem with having multiple start records in the same log record.
> > 
> > However, this was known to cause problems with inodes and data vs
> > metadata sequencing and non-transactional inode metadata updates
> > (e.g. inode size), so a "flush iteration" counter was added to
> > inodes in 2003:
> > 
> > commit 6ed3d868e47470a301b49f1e8626972791206f50
> > Author: Steve Lord <lord@sgi.com>
> > Date:   Wed Aug 6 21:17:05 2003 +0000
> > 
> >     Add versioning to the on disk inode which we increment on each
> >     flush call. This is used during recovery to avoid replaying an
> >     older copy of the inode from the log. We can do this without
> >     versioning the filesystem as the pad space we borrowed was
> >     always zero and will be ignored by old kernels.
> >     During recovery, do not replay an inode log record which is older
> >     than the on disk copy. Check for wrapping in the counter.
> > 
> > This was never fully reliable, and there was always issues with
> > this counter because inode changes weren't always journalled nor
> > were cache flushes used to ensure unlogged inode metadata updates
> > reached stable storage.
> > 
> > The LSN sequencing was added to the v5 format to ensure metadata
> > never goes backwards in time on disk without fail. The issue you've
> > uncovered shows that we still have issues stemming from the
> > original journal recovery algorithm that was not designed with
> > anti-recovery protections in mind from the start.
> > 
> > The problem we need to solve is how we preserve the necessary
> > anti-recovery behaviour when we have multiple checkpoints that can
> > have the same LSN and objects are updated immediately on recovery?
> > 
> > I suspect that we need to track that the checkpoint being recovered
> > has a duplicate start LSN (i.e. in the struct xlog_recover) and
> > modify the anti-recovery LSN check to take this into account. i.e.
> > we can really only skip recovery of the first checkpoint at any
> > given LSN because we cannot disambiguate an LSN updated by the first
> > checkpoint at that LSN and the metadata already being up to date on
> > disk in the second and subsequent checkpoints at the same start
> > LSN.
> > 
> > There are likely to be other solutions - anyone have a different
> > idea on how we might address this?
> > 
> 
> It's been a while since I've looked at any of this and I haven't waded
> through all of the details, so I could easily be missing something, but
> what exactly is wrong with the approach of the patch as posted?
> 
> Commit 12818d24db ("xfs: rework log recovery to submit buffers on LSN
> boundaries") basically created a new invariant for log recovery where
> buffers are allowed to be written only once per LSN. The risk otherwise
> is that a subsequent update with a matching LSN would not be correctly
> applied due to the v5 LSN ordering rules. Since log recovery processes
> transactions (using terminology/granularity as defined by the
> implementation of xlog_recover_commit_trans()), this required changes to
> accommodate any of the various possible runtime logging scenarios that
> could cause a buffer to have multiple entries in the log associated with
> a single LSN, the details of which were orthogonal to the fix.
> 
> The functional change therefore was that rather than to process and
> submit "transactions" in sequence during recovery, the pending buffer
> list was lifted to a higher level in the code, a tracking field was
> added for the "current LSN" of log recovery, and only once we cross a
> current LSN boundary are we allowed to submit the set of buffers
> processed for the prior LSN. The reason for this logic is that seeing
> the next LSN was really the only way we know we're done processing items
> for a particular LSN.
> 
> If I understand the problem description correctly, the issue here is
> that if an error is encountered in the middle of processing items for
> some LSN A, we bail out of recovery and submit the pending buffers on
> the way out. If we haven't completed processing all items for LSN A
> before failing, however, then we've just possibly violated the "write
> once per LSN" invariant that protects from corrupting the fs. This is
> because the writeback permanently updates metadata LSNs (assuming that
> I/O doesn't fail), which means if recovery retries from the same point
> the next time around and progresses to find a second instance of an
> already written buffer in LSN A, it will exhibit the same general
> behavior from before the write once invariant was introduced. IOW,
> there's still a vector to the original problematic multi-write per LSN
> behavior through multiple recovery attempts (hence the simulated I/O
> error to reproduce).
> 
> Long Li, am I following the problem description correctly? I've not
> fully reviewed it, but if so, the proposed solution seems fairly sane
> and logical to me. (And nice work tracking this down, BTW, regardless of
> whether this is the final solution. ;).
> 

Hi, Brian, your description is correct for me, and it is clear and easy
to understand. Thanks for your encouragement of my work.

Thanks,
Long Li

  parent reply	other threads:[~2024-01-10  8:35 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-12-28 12:46 [PATCH] xfs: ensure submit buffers on LSN boundaries in error handlers Long Li
2024-01-04  2:01 ` Darrick J. Wong
2024-01-04 13:03   ` Long Li
2024-01-07 22:13 ` Dave Chinner
2024-01-08 12:28   ` Long Li
2024-01-08 23:40     ` Dave Chinner
2024-01-09 14:52       ` Brian Foster
2024-01-09 21:43         ` Dave Chinner
2024-01-10  7:03           ` Long Li
2024-01-10 23:08             ` Dave Chinner
2024-01-11 14:54               ` Brian Foster
2024-01-10  8:38         ` Long Li [this message]
2024-01-11  0:47 ` Dave Chinner
2024-01-12 12:55   ` Long Li
2024-01-12 18:42     ` Brian Foster
2024-01-15 13:31       ` Long Li
2024-01-15 14:14         ` Brian Foster

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240110083808.GA2075885@ceph-admin \
    --to=leo.lilong@huawei.com \
    --cc=bfoster@redhat.com \
    --cc=chandanbabu@kernel.org \
    --cc=david@fromorbit.com \
    --cc=djwong@kernel.org \
    --cc=houtao1@huawei.com \
    --cc=linux-xfs@vger.kernel.org \
    --cc=yangerkun@huawei.com \
    --cc=yi.zhang@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox