bcachefs replica garbage collection

public inbox for linux-bcachefs@vger.kernel.org
 help / color / mirror / Atom feed

* bcachefs replica garbage collection
@ 2023-05-02 14:58 Brian Foster
  2023-05-02 23:13 ` Kent Overstreet
  0 siblings, 1 reply; 4+ messages in thread
From: Brian Foster @ 2023-05-02 14:58 UTC (permalink / raw)
  To: linux-bcachefs; +Cc: Kent Overstreet

Hi Kent,

I'm tracking an issue that seems to be related to bch_replica garbage
collection, where fsck reports the following:

  superblock not marked as containing replicas journal: 1/1 [1], not fixing

... and the fs doesn't want to mount (until repaired).

The short of it looks like there's a window where we write out the sb
replica info with no BCH_DATA_journal usage where a shutdown/recovery
doesn't otherwise expect it. I've not fully root caused the problem yet,
but when looking into the code it looks like we have a couple different
mechanisms for clearing out unused replicas...

The "old" _start/_end mechanism handles journal data only and is invoked
after pin flush. I presume the start/end calls wrap marking journal dev
replicas to ensure usage accounting is up to date, but I'm not totally
sure. Is that the case? If so, any reason the marking via
journal_write_done() isn't sufficient?

The "new" bch2_replicas_gc2() mechanism is invoked from various other
places and always propagates journal data replica entries. Is that
filter on journal data by design, or temporary because journal replica
gc is handled separately as above? If the latter, is there some explicit
reason the old mechanism was left around to handle journal data like
this?

As noted above, I still need to track down the root cause of the
unexpectedly absent journal data replica sb field. It may be agnostic to
either gc mechanism, but I'm trying to understand the higher level
situation here a bit better before getting too deep into the weeds...
thanks.

Brian

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: bcachefs replica garbage collection
  2023-05-02 14:58 bcachefs replica garbage collection Brian Foster
@ 2023-05-02 23:13 ` Kent Overstreet
  2023-05-03 14:05   ` Brian Foster
  0 siblings, 1 reply; 4+ messages in thread
From: Kent Overstreet @ 2023-05-02 23:13 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-bcachefs

On Tue, May 02, 2023 at 10:58:50AM -0400, Brian Foster wrote:
> Hi Kent,
> 
> I'm tracking an issue that seems to be related to bch_replica garbage
> collection, where fsck reports the following:
> 
>   superblock not marked as containing replicas journal: 1/1 [1], not fixing
> 
> ... and the fs doesn't want to mount (until repaired).
> 
> The short of it looks like there's a window where we write out the sb
> replica info with no BCH_DATA_journal usage where a shutdown/recovery
> doesn't otherwise expect it. I've not fully root caused the problem yet,
> but when looking into the code it looks like we have a couple different
> mechanisms for clearing out unused replicas...
> 
> The "old" _start/_end mechanism handles journal data only and is invoked
> after pin flush. I presume the start/end calls wrap marking journal dev
> replicas to ensure usage accounting is up to date, but I'm not totally
> sure. Is that the case?

It's not anything to do with usage accounting - we don't really do usage
accounting for the journal, not in a way that would work for replicas
tracking.

So the old style replicas gc mechanism that
bch2_journal_flush_device_pins() uses is your traditional mark and sweep
gc - clear everything, then walk and remark everything that's currently
in use.

> If so, any reason the marking via journal_write_done() isn't
> sufficient?

This might be the bug you're looking for - the journal write path should
be doing the marking before the write, not after.

> The "new" bch2_replicas_gc2() mechanism is invoked from various other
> places and always propagates journal data replica entries. Is that
> filter on journal data by design, or temporary because journal replica
> gc is handled separately as above? If the latter, is there some explicit
> reason the old mechanism was left around to handle journal data like
> this?
> 
> As noted above, I still need to track down the root cause of the
> unexpectedly absent journal data replica sb field. It may be agnostic to
> either gc mechanism, but I'm trying to understand the higher level
> situation here a bit better before getting too deep into the weeds...
> thanks.

Applying this...

From: Kent Overstreet <kent.overstreet@linux.dev>
Date: Tue, 2 May 2023 18:22:12 -0400
Subject: [PATCH] bcachefs: Improved comment for bch2_replicas_gc2()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

diff --git a/fs/bcachefs/replicas.c b/fs/bcachefs/replicas.c
index 8935ff5899..8ae50dfd8c 100644
--- a/fs/bcachefs/replicas.c
+++ b/fs/bcachefs/replicas.c
@@ -550,8 +550,14 @@ int bch2_replicas_gc_start(struct bch_fs *c, unsigned typemask)
 	return 0;
 }
 
-/* New much simpler mechanism for clearing out unneeded replicas entries: */
-
+/*
+ * New much simpler mechanism for clearing out unneeded replicas entries - drop
+ * replicas entries that have 0 sectors used.
+ *
+ * However, we don't track sector counts for journal usage, so this doesn't drop
+ * any BCH_DATA_journal entries; the old bch2_replicas_gc_(start|end) mechanism
+ * is retained for that.
+ */
 int bch2_replicas_gc2(struct bch_fs *c)
 {
 	struct bch_replicas_cpu new = { 0 };

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: bcachefs replica garbage collection
  2023-05-02 23:13 ` Kent Overstreet
@ 2023-05-03 14:05   ` Brian Foster
  2023-05-04 14:11     ` Brian Foster
  0 siblings, 1 reply; 4+ messages in thread
From: Brian Foster @ 2023-05-03 14:05 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcachefs

On Tue, May 02, 2023 at 07:13:54PM -0400, Kent Overstreet wrote:
> On Tue, May 02, 2023 at 10:58:50AM -0400, Brian Foster wrote:
> > Hi Kent,
> > 
> > I'm tracking an issue that seems to be related to bch_replica garbage
> > collection, where fsck reports the following:
> > 
> >   superblock not marked as containing replicas journal: 1/1 [1], not fixing
> > 
> > ... and the fs doesn't want to mount (until repaired).
> > 
> > The short of it looks like there's a window where we write out the sb
> > replica info with no BCH_DATA_journal usage where a shutdown/recovery
> > doesn't otherwise expect it. I've not fully root caused the problem yet,
> > but when looking into the code it looks like we have a couple different
> > mechanisms for clearing out unused replicas...
> > 
> > The "old" _start/_end mechanism handles journal data only and is invoked
> > after pin flush. I presume the start/end calls wrap marking journal dev
> > replicas to ensure usage accounting is up to date, but I'm not totally
> > sure. Is that the case?
> 
> It's not anything to do with usage accounting - we don't really do usage
> accounting for the journal, not in a way that would work for replicas
> tracking.
> 
> So the old style replicas gc mechanism that
> bch2_journal_flush_device_pins() uses is your traditional mark and sweep
> gc - clear everything, then walk and remark everything that's currently
> in use.
> 

Yeah, I think I grokked the fundamentals from the implementation. It
looks like it basically clears out the journal replica entries from the
_gc container, repopulates based on c->usage->replicas, and then updates
the sb/primary tables based on the result.

> > If so, any reason the marking via journal_write_done() isn't
> > sufficient?
> 
> This might be the bug you're looking for - the journal write path should
> be doing the marking before the write, not after.
> 

Quite possible. That certainly could be a vector to this transient
state. I'll look into it..

I am still a bit unclear on the marks in the flush path
(bch2_journal_flush_device_pins()), though. Am I following correctly
that this code marks the devices still backing journal entries? If so
and given the submit time marking, is that basically gc algorithmic
boilerplate where in context we'd expect all of those devices to already
be marked, or is there some particular situation where you'd expect that
path to mark a device that isn't already marked with journal data?

> > The "new" bch2_replicas_gc2() mechanism is invoked from various other
> > places and always propagates journal data replica entries. Is that
> > filter on journal data by design, or temporary because journal replica
> > gc is handled separately as above? If the latter, is there some explicit
> > reason the old mechanism was left around to handle journal data like
> > this?
> > 
> > As noted above, I still need to track down the root cause of the
> > unexpectedly absent journal data replica sb field. It may be agnostic to
> > either gc mechanism, but I'm trying to understand the higher level
> > situation here a bit better before getting too deep into the weeds...
> > thanks.
> 
> Applying this...
> 

ACK.. that's useful context. Thanks.

Brian

> From: Kent Overstreet <kent.overstreet@linux.dev>
> Date: Tue, 2 May 2023 18:22:12 -0400
> Subject: [PATCH] bcachefs: Improved comment for bch2_replicas_gc2()
> 
> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
> 
> diff --git a/fs/bcachefs/replicas.c b/fs/bcachefs/replicas.c
> index 8935ff5899..8ae50dfd8c 100644
> --- a/fs/bcachefs/replicas.c
> +++ b/fs/bcachefs/replicas.c
> @@ -550,8 +550,14 @@ int bch2_replicas_gc_start(struct bch_fs *c, unsigned typemask)
>  	return 0;
>  }
>  
> -/* New much simpler mechanism for clearing out unneeded replicas entries: */
> -
> +/*
> + * New much simpler mechanism for clearing out unneeded replicas entries - drop
> + * replicas entries that have 0 sectors used.
> + *
> + * However, we don't track sector counts for journal usage, so this doesn't drop
> + * any BCH_DATA_journal entries; the old bch2_replicas_gc_(start|end) mechanism
> + * is retained for that.
> + */
>  int bch2_replicas_gc2(struct bch_fs *c)
>  {
>  	struct bch_replicas_cpu new = { 0 };
> 


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: bcachefs replica garbage collection
  2023-05-03 14:05   ` Brian Foster
@ 2023-05-04 14:11     ` Brian Foster
  0 siblings, 0 replies; 4+ messages in thread
From: Brian Foster @ 2023-05-04 14:11 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcachefs

On Wed, May 03, 2023 at 10:05:15AM -0400, Brian Foster wrote:
> On Tue, May 02, 2023 at 07:13:54PM -0400, Kent Overstreet wrote:
> > On Tue, May 02, 2023 at 10:58:50AM -0400, Brian Foster wrote:
> > > Hi Kent,
> > > 
> > > I'm tracking an issue that seems to be related to bch_replica garbage
> > > collection, where fsck reports the following:
> > > 
> > >   superblock not marked as containing replicas journal: 1/1 [1], not fixing
> > > 
> > > ... and the fs doesn't want to mount (until repaired).
> > > 
> > > The short of it looks like there's a window where we write out the sb
> > > replica info with no BCH_DATA_journal usage where a shutdown/recovery
> > > doesn't otherwise expect it. I've not fully root caused the problem yet,
> > > but when looking into the code it looks like we have a couple different
> > > mechanisms for clearing out unused replicas...
> > > 
> > > The "old" _start/_end mechanism handles journal data only and is invoked
> > > after pin flush. I presume the start/end calls wrap marking journal dev
> > > replicas to ensure usage accounting is up to date, but I'm not totally
> > > sure. Is that the case?
> > 
> > It's not anything to do with usage accounting - we don't really do usage
> > accounting for the journal, not in a way that would work for replicas
> > tracking.
> > 
> > So the old style replicas gc mechanism that
> > bch2_journal_flush_device_pins() uses is your traditional mark and sweep
> > gc - clear everything, then walk and remark everything that's currently
> > in use.
> > 
> 
> Yeah, I think I grokked the fundamentals from the implementation. It
> looks like it basically clears out the journal replica entries from the
> _gc container, repopulates based on c->usage->replicas, and then updates
> the sb/primary tables based on the result.
> 
> > > If so, any reason the marking via journal_write_done() isn't
> > > sufficient?
> > 
> > This might be the bug you're looking for - the journal write path should
> > be doing the marking before the write, not after.
> > 
> 
> Quite possible. That certainly could be a vector to this transient
> state. I'll look into it..
> 

Just a note on this bit.. I made a change to mark for the journal before
the journal write, but still managed to hit this problem. It was
slightly harder to reproduce, so I suspect there might just be a couple
different instances of the problem.

AFAICT from looking at the mount time recovery code, it pretty much
expects to see at least one device marked for journal data so long as
the filesystem is not clean. It looks like clean state is removed at
mount time and restored during clean unmount, so that means we probably
don't ever want to have this "no devices marked with journal data" state
at runtime.

However, this is exactly what I see from the journal gc mechanism
invoked during device evacuate or remove. On an otherwise idle fs, the
gc sees no journal usage and clears every replica entry for journal data
on the fs. The next journal writes fairly quickly restore new entries
for unrelated replicas, so this appears to be a small window of
opportunity. But indeed, a quick hack to instrument this path and
shutdown the fs during that window makes it trivial to reproduce the
subsequent mount failure.

Given all of the above context and that this only seems to be invoked
via data job paths, I think this variant can be fixed by simply
tightening up the filtering in bch2_replicas_gc_start() to also consider
the target device. IOW, if the device flush only sweeps out replica
entries for which the target device is a participant, we won't ever have
this transient (!clean && no journal replicas) state that leads to mount
failure in the event of a crash.

So unless that all sounds like a Bad Idea(tm), I'm planning to give
something like that some testing..

Brian

> I am still a bit unclear on the marks in the flush path
> (bch2_journal_flush_device_pins()), though. Am I following correctly
> that this code marks the devices still backing journal entries? If so
> and given the submit time marking, is that basically gc algorithmic
> boilerplate where in context we'd expect all of those devices to already
> be marked, or is there some particular situation where you'd expect that
> path to mark a device that isn't already marked with journal data?
> 
> > > The "new" bch2_replicas_gc2() mechanism is invoked from various other
> > > places and always propagates journal data replica entries. Is that
> > > filter on journal data by design, or temporary because journal replica
> > > gc is handled separately as above? If the latter, is there some explicit
> > > reason the old mechanism was left around to handle journal data like
> > > this?
> > > 
> > > As noted above, I still need to track down the root cause of the
> > > unexpectedly absent journal data replica sb field. It may be agnostic to
> > > either gc mechanism, but I'm trying to understand the higher level
> > > situation here a bit better before getting too deep into the weeds...
> > > thanks.
> > 
> > Applying this...
> > 
> 
> ACK.. that's useful context. Thanks.
> 
> Brian
> 
> > From: Kent Overstreet <kent.overstreet@linux.dev>
> > Date: Tue, 2 May 2023 18:22:12 -0400
> > Subject: [PATCH] bcachefs: Improved comment for bch2_replicas_gc2()
> > 
> > Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
> > 
> > diff --git a/fs/bcachefs/replicas.c b/fs/bcachefs/replicas.c
> > index 8935ff5899..8ae50dfd8c 100644
> > --- a/fs/bcachefs/replicas.c
> > +++ b/fs/bcachefs/replicas.c
> > @@ -550,8 +550,14 @@ int bch2_replicas_gc_start(struct bch_fs *c, unsigned typemask)
> >  	return 0;
> >  }
> >  
> > -/* New much simpler mechanism for clearing out unneeded replicas entries: */
> > -
> > +/*
> > + * New much simpler mechanism for clearing out unneeded replicas entries - drop
> > + * replicas entries that have 0 sectors used.
> > + *
> > + * However, we don't track sector counts for journal usage, so this doesn't drop
> > + * any BCH_DATA_journal entries; the old bch2_replicas_gc_(start|end) mechanism
> > + * is retained for that.
> > + */
> >  int bch2_replicas_gc2(struct bch_fs *c)
> >  {
> >  	struct bch_replicas_cpu new = { 0 };
> > 


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2023-05-04 14:10 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-05-02 14:58 bcachefs replica garbage collection Brian Foster
2023-05-02 23:13 ` Kent Overstreet
2023-05-03 14:05   ` Brian Foster
2023-05-04 14:11     ` Brian Foster

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox