public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Brian Foster <bfoster@redhat.com>
To: "Darrick J. Wong" <djwong@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>,
	Chandan Babu R <chandan.babu@oracle.com>,
	linux-xfs@vger.kernel.org
Subject: Re: [PATCH 6/7] xfs: don't update file system geometry through transaction deltas
Date: Mon, 14 Oct 2024 14:50:37 -0400	[thread overview]
Message-ID: <Zw1n_bkugSs6oEI6@bfoster> (raw)
In-Reply-To: <20241011231241.GD21853@frogsfrogsfrogs>

On Fri, Oct 11, 2024 at 04:12:41PM -0700, Darrick J. Wong wrote:
> On Fri, Oct 11, 2024 at 02:41:17PM -0400, Brian Foster wrote:
> > On Fri, Oct 11, 2024 at 10:13:03AM -0700, Darrick J. Wong wrote:
> > > On Fri, Oct 11, 2024 at 10:02:16AM -0400, Brian Foster wrote:
> > > > On Fri, Oct 11, 2024 at 09:57:09AM +0200, Christoph Hellwig wrote:
> > > > > On Thu, Oct 10, 2024 at 10:05:53AM -0400, Brian Foster wrote:
> > > > > > Ok, so we don't want geometry changes transactions in the same CIL
> > > > > > checkpoint as alloc related transactions that might depend on the
> > > > > > geometry changes. That seems reasonable and on a first pass I have an
> > > > > > idea of what this is doing, but the description is kind of vague.
> > > > > > Obviously this fixes an issue on the recovery side (since I've tested
> > > > > > it), but it's not quite clear to me from the description and/or logic
> > > > > > changes how that issue manifests.
> > > > > > 
> > > > > > Could you elaborate please? For example, is this some kind of race
> > > > > > situation between an allocation request and a growfs transaction, where
> > > > > > the former perhaps sees a newly added AG between the time the growfs
> > > > > > transaction commits (applying the sb deltas) and it actually commits to
> > > > > > the log due to being a sync transaction, thus allowing an alloc on a new
> > > > > > AG into the same checkpoint that adds the AG?
> > > > > 
> > > > > This is based on the feedback by Dave on the previous version:
> > > > > 
> > > > > https://lore.kernel.org/linux-xfs/Zut51Ftv%2F46Oj386@dread.disaster.area/
> > > > > 
> > > > 
> > > > Ah, Ok. That all seems reasonably sane to me on a first pass, but I'm
> > > > not sure I'd go straight to this change given the situation...
> > > > 
> > > > > Just doing the perag/in-core sb updates earlier fixes all the issues
> > > > > with my test case, so I'm not actually sure how to get more updates
> > > > > into the check checkpoint.  I'll try your exercisers if it could hit
> > > > > that.
> > > > > 
> > > > 
> > > > Ok, that explains things a bit. My observation is that the first 5
> > > > patches or so address the mount failure problem, but from there I'm not
> > > > reproducing much difference with or without the final patch.
> > > 
> > > Does this change to flush the log after committing the new sb fix the
> > > recovery problems on older kernels?  I /think/ that's the point of this
> > > patch.
> > > 
> > 
> > I don't follow.. growfs always forced the log via the sync transaction,
> > right? Or do you mean something else by "change to flush the log?"
> 
> I guess I was typing a bit too fast this morning -- "change to flush the
> log to disk before anyone else can get their hands on the superblock".
> You're right that xfs_log_sb and data-device growfs already do that.
> 
> That said, growfsrt **doesn't** call xfs_trans_set_sync, so that's a bug
> that this patch fixes, right?
> 

Ah, Ok.. that makes sense. Sounds like it could be..

> > I thought the main functional change of this patch was to hold the
> > superblock buffer locked across the force so nothing else can relog the
> > new geometry superblock buffer in the same cil checkpoint. Presumably,
> > the theory is that prevents recovery from seeing updates to different
> > buffers that depend on the geometry update before the actual sb geometry
> > update is recovered (because the latter might have been relogged).
> > 
> > Maybe we're saying the same thing..? Or maybe I just misunderstand.
> > Either way I think patch could use a more detailed commit log...
> 
> <nod> The commit message should point out that we're fixing a real bug
> here, which is that growfsrt doesn't force the log to disk when it
> commits the new rt geometry.
> 

Maybe even make it a separate patch to pull apart some of these cleanups
from fixes. I was also wondering if the whole locking change is the
moral equivalent of locking the sb across the growfs trans (i.e.
trans_getsb() + trans_bhold()), at which point maybe that would be a
reasonable incremental patch too.

> > > >                                                              Either way,
> > > > I see aborts and splats all over the place, which implies at minimum
> > > > this isn't the only issue here.
> > > 
> > > Ugh.  I've recently noticed the long soak logrecovery test vm have seen
> > > a slight tick up in failure rates -- random blocks that have clearly had
> > > garbage written to them such that recovery tries to read the block to
> > > recover a buffer log item and kaboom.  At this point it's unclear if
> > > that's a problem with xfs or somewhere else. :(
> > > 
> > > > So given that 1. growfs recovery seems pretty much broken, 2. this
> > > > particular patch has no straightforward way to test that it fixes
> > > > something and at the same time doesn't break anything else, and 3. we do
> > > 
> > > I'm curious, what might break?  Was that merely a general comment, or do
> > > you have something specific in mind?  (iows: do you see more string to
> > > pull? :))
> > > 
> > 
> > Just a general comment..
> > 
> > Something related that isn't totally clear to me is what about the
> > inverse shrink situation where dblocks is reduced. I.e., is there some
> > similar scenario where for example instead of the sb buffer being
> > relogged past some other buffer update that depends on it, some other
> > change is relogged past a sb update that invalidates/removes blocks
> > referenced by the relogged buffer..? If so, does that imply a shrink
> > should flush the log before the shrink transaction commits to ensure it
> > lands in a new checkpoint (as opposed to ensuring followon updates land
> > in a new checkpoint)..?
> 
> I think so.  Might we want to do that before and after to be careful?
> 

Yeah maybe. I'm not quite sure if even that's enough. I.e. assuming we
had a log preflush to flush out already committed changes before the
grow, I don't think anything really prevents another "problematic"
transaction from committing after that preflush.

I dunno.. on one hand it does seem like an unlikely thing due to the
nature of needing space to be free in order to shrink in the first
place, but OTOH if you have something like grow that is rare, not
performance sensitive, has a history of not being well tested, and has
these subtle ordering requirements that might change indirectly to other
transactions, ISTM it could be a wise engineering decision to simplify
to the degree possible and find the most basic model that enforces
predictable ordering.

So for a hacky thought/example, suppose we defined a transaction mode
that basically implemented an exclusive checkpoint requirement (i.e.
this transaction owns an entire checkpoint, nothing else is allowed in
the CIL concurrently). Presumably that would ensure everything before
the grow would flush out to disk in one checkpoint, everything
concurrent would block on synchronous commit of the grow trans (before
new geometry is exposed), and then after that point everything pending
would drain into another checkpoint.

It kind of sounds like overkill, but really if it could be implemented
simply enough then we wouldn't have to think too hard about auditing all
other relog scenarios. I'd probably want to see at least some reproducer
for this sort of problem to prove the theory though too, even if it
required debug instrumentation or something. Hm?

> > Anyways, my point is just that if it were me I wouldn't get too deep
> > into this until some of the reproducible growfs recovery issues are at
> > least characterized and testing is more sorted out.
> > 
> > The context for testing is here [1]. The TLDR is basically that
> > Christoph has a targeted test that reproduces the initial mount failure
> > and I hacked up a more general test that also reproduces it and
> > additional growfs recovery problems. This test does seem to confirm that
> > the previous patches address the mount failure issue, but this patch
> > doesn't seem to prevent any of the other problems produced by the
> > generic test. That might just mean the test doesn't reproduce what this
> > fixes, but it's kind of hard to at least regression test something like
> > this when basic growfs crash-recovery seems pretty much broken.
> 
> Hmm, if you make a variant of that test which formats with an rt device
> and -d rtinherit=1 and then runs xfs_growfs -R instead of -D, do you see
> similar blowups?  Let's see what happens if I do that...
> 

Heh, sounds like so from your followup. Fun times.

I guess that test should probably work its way upstream. I made some
tweaks locally since last posted to try and make it a little more
aggressive, but it didn't repro anything new so not sure how much
difference it makes really. Do we want a separate version like yours for
the rt case or would you expect to cover both cases in a single test?

Brian

> --D
> 
> > Brian
> > 
> > [1] https://lore.kernel.org/fstests/ZwVdtXUSwEXRpcuQ@bfoster/
> > 
> > > > have at least one fairly straightforward growfs/recovery test in the
> > > > works that reliably explodes, personally I'd suggest to split this work
> > > > off into separate series.
> > > > 
> > > > It seems reasonable enough to me to get patches 1-5 in asap once they're
> > > > fully cleaned up, and then leave the next two as part of a followon
> > > > series pending further investigation into these other issues. As part of
> > > > that I'd like to know whether the recovery test reproduces (or can be
> > > > made to reproduce) the issue this patch presumably fixes, but I'd also
> > > > settle for "the grow recovery test now passes reliably and this doesn't
> > > > regress it." But once again, just my .02.
> > > 
> > > Yeah, it's too bad there's no good way to test recovery with older
> > > kernels either. :(
> > > 
> > > --D
> > > 
> > > > Brian
> > > > 
> > > > 
> > > 
> > 
> > 
> 


  parent reply	other threads:[~2024-10-14 18:49 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-30 16:41 fix recovery of allocator ops after a growfs Christoph Hellwig
2024-09-30 16:41 ` [PATCH 1/7] xfs: pass the exact range to initialize to xfs_initialize_perag Christoph Hellwig
2024-10-10 14:02   ` Brian Foster
2024-10-11  7:53     ` Christoph Hellwig
2024-10-11 14:01       ` Brian Foster
2024-09-30 16:41 ` [PATCH 2/7] xfs: merge the perag freeing helpers Christoph Hellwig
2024-10-10 14:02   ` Brian Foster
2024-09-30 16:41 ` [PATCH 3/7] xfs: update the file system geometry after recoverying superblock buffers Christoph Hellwig
2024-09-30 16:50   ` Darrick J. Wong
2024-10-01  8:49     ` Christoph Hellwig
2024-10-10 16:02       ` Darrick J. Wong
2024-10-10 14:03   ` Brian Foster
2024-09-30 16:41 ` [PATCH 4/7] xfs: error out when a superblock buffer updates reduces the agcount Christoph Hellwig
2024-09-30 16:51   ` Darrick J. Wong
2024-10-01  8:47     ` Christoph Hellwig
2024-10-10 14:04   ` Brian Foster
2024-09-30 16:41 ` [PATCH 5/7] xfs: don't use __GFP_RETRY_MAYFAIL in xfs_initialize_perag Christoph Hellwig
2024-10-10 14:04   ` Brian Foster
2024-09-30 16:41 ` [PATCH 6/7] xfs: don't update file system geometry through transaction deltas Christoph Hellwig
2024-10-10 14:05   ` Brian Foster
2024-10-11  7:57     ` Christoph Hellwig
2024-10-11 14:02       ` Brian Foster
2024-10-11 17:13         ` Darrick J. Wong
2024-10-11 18:41           ` Brian Foster
2024-10-11 23:12             ` Darrick J. Wong
2024-10-11 23:29               ` Darrick J. Wong
2024-10-14  5:58                 ` Christoph Hellwig
2024-10-14 15:30                   ` Darrick J. Wong
2024-10-14 18:50               ` Brian Foster [this message]
2024-10-15 16:42                 ` Darrick J. Wong
2024-10-18 12:27                   ` Brian Foster
2024-10-21 16:59                     ` Darrick J. Wong
2024-10-23 14:45                       ` Brian Foster
2024-10-24 18:02                         ` Darrick J. Wong
2024-10-21 13:38                 ` Dave Chinner
2024-10-23 15:06                   ` Brian Foster
2024-10-10 19:01   ` Darrick J. Wong
2024-10-11  7:59     ` Christoph Hellwig
2024-10-11 16:44       ` Darrick J. Wong
2024-09-30 16:41 ` [PATCH 7/7] xfs: split xfs_trans_mod_sb Christoph Hellwig
2024-10-10 14:06   ` Brian Foster
2024-10-11  7:54     ` Christoph Hellwig
2024-10-11 14:05       ` Brian Foster
2024-10-11 16:50         ` Darrick J. Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Zw1n_bkugSs6oEI6@bfoster \
    --to=bfoster@redhat.com \
    --cc=chandan.babu@oracle.com \
    --cc=djwong@kernel.org \
    --cc=hch@lst.de \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox