* bcachefs replica garbage collection @ 2023-05-02 14:58 Brian Foster 2023-05-02 23:13 ` Kent Overstreet 0 siblings, 1 reply; 4+ messages in thread From: Brian Foster @ 2023-05-02 14:58 UTC (permalink / raw) To: linux-bcachefs; +Cc: Kent Overstreet Hi Kent, I'm tracking an issue that seems to be related to bch_replica garbage collection, where fsck reports the following: superblock not marked as containing replicas journal: 1/1 [1], not fixing ... and the fs doesn't want to mount (until repaired). The short of it looks like there's a window where we write out the sb replica info with no BCH_DATA_journal usage where a shutdown/recovery doesn't otherwise expect it. I've not fully root caused the problem yet, but when looking into the code it looks like we have a couple different mechanisms for clearing out unused replicas... The "old" _start/_end mechanism handles journal data only and is invoked after pin flush. I presume the start/end calls wrap marking journal dev replicas to ensure usage accounting is up to date, but I'm not totally sure. Is that the case? If so, any reason the marking via journal_write_done() isn't sufficient? The "new" bch2_replicas_gc2() mechanism is invoked from various other places and always propagates journal data replica entries. Is that filter on journal data by design, or temporary because journal replica gc is handled separately as above? If the latter, is there some explicit reason the old mechanism was left around to handle journal data like this? As noted above, I still need to track down the root cause of the unexpectedly absent journal data replica sb field. It may be agnostic to either gc mechanism, but I'm trying to understand the higher level situation here a bit better before getting too deep into the weeds... thanks. Brian ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: bcachefs replica garbage collection 2023-05-02 14:58 bcachefs replica garbage collection Brian Foster @ 2023-05-02 23:13 ` Kent Overstreet 2023-05-03 14:05 ` Brian Foster 0 siblings, 1 reply; 4+ messages in thread From: Kent Overstreet @ 2023-05-02 23:13 UTC (permalink / raw) To: Brian Foster; +Cc: linux-bcachefs On Tue, May 02, 2023 at 10:58:50AM -0400, Brian Foster wrote: > Hi Kent, > > I'm tracking an issue that seems to be related to bch_replica garbage > collection, where fsck reports the following: > > superblock not marked as containing replicas journal: 1/1 [1], not fixing > > ... and the fs doesn't want to mount (until repaired). > > The short of it looks like there's a window where we write out the sb > replica info with no BCH_DATA_journal usage where a shutdown/recovery > doesn't otherwise expect it. I've not fully root caused the problem yet, > but when looking into the code it looks like we have a couple different > mechanisms for clearing out unused replicas... > > The "old" _start/_end mechanism handles journal data only and is invoked > after pin flush. I presume the start/end calls wrap marking journal dev > replicas to ensure usage accounting is up to date, but I'm not totally > sure. Is that the case? It's not anything to do with usage accounting - we don't really do usage accounting for the journal, not in a way that would work for replicas tracking. So the old style replicas gc mechanism that bch2_journal_flush_device_pins() uses is your traditional mark and sweep gc - clear everything, then walk and remark everything that's currently in use. > If so, any reason the marking via journal_write_done() isn't > sufficient? This might be the bug you're looking for - the journal write path should be doing the marking before the write, not after. > The "new" bch2_replicas_gc2() mechanism is invoked from various other > places and always propagates journal data replica entries. Is that > filter on journal data by design, or temporary because journal replica > gc is handled separately as above? If the latter, is there some explicit > reason the old mechanism was left around to handle journal data like > this? > > As noted above, I still need to track down the root cause of the > unexpectedly absent journal data replica sb field. It may be agnostic to > either gc mechanism, but I'm trying to understand the higher level > situation here a bit better before getting too deep into the weeds... > thanks. Applying this... From: Kent Overstreet <kent.overstreet@linux.dev> Date: Tue, 2 May 2023 18:22:12 -0400 Subject: [PATCH] bcachefs: Improved comment for bch2_replicas_gc2() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> diff --git a/fs/bcachefs/replicas.c b/fs/bcachefs/replicas.c index 8935ff5899..8ae50dfd8c 100644 --- a/fs/bcachefs/replicas.c +++ b/fs/bcachefs/replicas.c @@ -550,8 +550,14 @@ int bch2_replicas_gc_start(struct bch_fs *c, unsigned typemask) return 0; } -/* New much simpler mechanism for clearing out unneeded replicas entries: */ - +/* + * New much simpler mechanism for clearing out unneeded replicas entries - drop + * replicas entries that have 0 sectors used. + * + * However, we don't track sector counts for journal usage, so this doesn't drop + * any BCH_DATA_journal entries; the old bch2_replicas_gc_(start|end) mechanism + * is retained for that. + */ int bch2_replicas_gc2(struct bch_fs *c) { struct bch_replicas_cpu new = { 0 }; ^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: bcachefs replica garbage collection 2023-05-02 23:13 ` Kent Overstreet @ 2023-05-03 14:05 ` Brian Foster 2023-05-04 14:11 ` Brian Foster 0 siblings, 1 reply; 4+ messages in thread From: Brian Foster @ 2023-05-03 14:05 UTC (permalink / raw) To: Kent Overstreet; +Cc: linux-bcachefs On Tue, May 02, 2023 at 07:13:54PM -0400, Kent Overstreet wrote: > On Tue, May 02, 2023 at 10:58:50AM -0400, Brian Foster wrote: > > Hi Kent, > > > > I'm tracking an issue that seems to be related to bch_replica garbage > > collection, where fsck reports the following: > > > > superblock not marked as containing replicas journal: 1/1 [1], not fixing > > > > ... and the fs doesn't want to mount (until repaired). > > > > The short of it looks like there's a window where we write out the sb > > replica info with no BCH_DATA_journal usage where a shutdown/recovery > > doesn't otherwise expect it. I've not fully root caused the problem yet, > > but when looking into the code it looks like we have a couple different > > mechanisms for clearing out unused replicas... > > > > The "old" _start/_end mechanism handles journal data only and is invoked > > after pin flush. I presume the start/end calls wrap marking journal dev > > replicas to ensure usage accounting is up to date, but I'm not totally > > sure. Is that the case? > > It's not anything to do with usage accounting - we don't really do usage > accounting for the journal, not in a way that would work for replicas > tracking. > > So the old style replicas gc mechanism that > bch2_journal_flush_device_pins() uses is your traditional mark and sweep > gc - clear everything, then walk and remark everything that's currently > in use. > Yeah, I think I grokked the fundamentals from the implementation. It looks like it basically clears out the journal replica entries from the _gc container, repopulates based on c->usage->replicas, and then updates the sb/primary tables based on the result. > > If so, any reason the marking via journal_write_done() isn't > > sufficient? > > This might be the bug you're looking for - the journal write path should > be doing the marking before the write, not after. > Quite possible. That certainly could be a vector to this transient state. I'll look into it.. I am still a bit unclear on the marks in the flush path (bch2_journal_flush_device_pins()), though. Am I following correctly that this code marks the devices still backing journal entries? If so and given the submit time marking, is that basically gc algorithmic boilerplate where in context we'd expect all of those devices to already be marked, or is there some particular situation where you'd expect that path to mark a device that isn't already marked with journal data? > > The "new" bch2_replicas_gc2() mechanism is invoked from various other > > places and always propagates journal data replica entries. Is that > > filter on journal data by design, or temporary because journal replica > > gc is handled separately as above? If the latter, is there some explicit > > reason the old mechanism was left around to handle journal data like > > this? > > > > As noted above, I still need to track down the root cause of the > > unexpectedly absent journal data replica sb field. It may be agnostic to > > either gc mechanism, but I'm trying to understand the higher level > > situation here a bit better before getting too deep into the weeds... > > thanks. > > Applying this... > ACK.. that's useful context. Thanks. Brian > From: Kent Overstreet <kent.overstreet@linux.dev> > Date: Tue, 2 May 2023 18:22:12 -0400 > Subject: [PATCH] bcachefs: Improved comment for bch2_replicas_gc2() > > Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> > > diff --git a/fs/bcachefs/replicas.c b/fs/bcachefs/replicas.c > index 8935ff5899..8ae50dfd8c 100644 > --- a/fs/bcachefs/replicas.c > +++ b/fs/bcachefs/replicas.c > @@ -550,8 +550,14 @@ int bch2_replicas_gc_start(struct bch_fs *c, unsigned typemask) > return 0; > } > > -/* New much simpler mechanism for clearing out unneeded replicas entries: */ > - > +/* > + * New much simpler mechanism for clearing out unneeded replicas entries - drop > + * replicas entries that have 0 sectors used. > + * > + * However, we don't track sector counts for journal usage, so this doesn't drop > + * any BCH_DATA_journal entries; the old bch2_replicas_gc_(start|end) mechanism > + * is retained for that. > + */ > int bch2_replicas_gc2(struct bch_fs *c) > { > struct bch_replicas_cpu new = { 0 }; > ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: bcachefs replica garbage collection 2023-05-03 14:05 ` Brian Foster @ 2023-05-04 14:11 ` Brian Foster 0 siblings, 0 replies; 4+ messages in thread From: Brian Foster @ 2023-05-04 14:11 UTC (permalink / raw) To: Kent Overstreet; +Cc: linux-bcachefs On Wed, May 03, 2023 at 10:05:15AM -0400, Brian Foster wrote: > On Tue, May 02, 2023 at 07:13:54PM -0400, Kent Overstreet wrote: > > On Tue, May 02, 2023 at 10:58:50AM -0400, Brian Foster wrote: > > > Hi Kent, > > > > > > I'm tracking an issue that seems to be related to bch_replica garbage > > > collection, where fsck reports the following: > > > > > > superblock not marked as containing replicas journal: 1/1 [1], not fixing > > > > > > ... and the fs doesn't want to mount (until repaired). > > > > > > The short of it looks like there's a window where we write out the sb > > > replica info with no BCH_DATA_journal usage where a shutdown/recovery > > > doesn't otherwise expect it. I've not fully root caused the problem yet, > > > but when looking into the code it looks like we have a couple different > > > mechanisms for clearing out unused replicas... > > > > > > The "old" _start/_end mechanism handles journal data only and is invoked > > > after pin flush. I presume the start/end calls wrap marking journal dev > > > replicas to ensure usage accounting is up to date, but I'm not totally > > > sure. Is that the case? > > > > It's not anything to do with usage accounting - we don't really do usage > > accounting for the journal, not in a way that would work for replicas > > tracking. > > > > So the old style replicas gc mechanism that > > bch2_journal_flush_device_pins() uses is your traditional mark and sweep > > gc - clear everything, then walk and remark everything that's currently > > in use. > > > > Yeah, I think I grokked the fundamentals from the implementation. It > looks like it basically clears out the journal replica entries from the > _gc container, repopulates based on c->usage->replicas, and then updates > the sb/primary tables based on the result. > > > > If so, any reason the marking via journal_write_done() isn't > > > sufficient? > > > > This might be the bug you're looking for - the journal write path should > > be doing the marking before the write, not after. > > > > Quite possible. That certainly could be a vector to this transient > state. I'll look into it.. > Just a note on this bit.. I made a change to mark for the journal before the journal write, but still managed to hit this problem. It was slightly harder to reproduce, so I suspect there might just be a couple different instances of the problem. AFAICT from looking at the mount time recovery code, it pretty much expects to see at least one device marked for journal data so long as the filesystem is not clean. It looks like clean state is removed at mount time and restored during clean unmount, so that means we probably don't ever want to have this "no devices marked with journal data" state at runtime. However, this is exactly what I see from the journal gc mechanism invoked during device evacuate or remove. On an otherwise idle fs, the gc sees no journal usage and clears every replica entry for journal data on the fs. The next journal writes fairly quickly restore new entries for unrelated replicas, so this appears to be a small window of opportunity. But indeed, a quick hack to instrument this path and shutdown the fs during that window makes it trivial to reproduce the subsequent mount failure. Given all of the above context and that this only seems to be invoked via data job paths, I think this variant can be fixed by simply tightening up the filtering in bch2_replicas_gc_start() to also consider the target device. IOW, if the device flush only sweeps out replica entries for which the target device is a participant, we won't ever have this transient (!clean && no journal replicas) state that leads to mount failure in the event of a crash. So unless that all sounds like a Bad Idea(tm), I'm planning to give something like that some testing.. Brian > I am still a bit unclear on the marks in the flush path > (bch2_journal_flush_device_pins()), though. Am I following correctly > that this code marks the devices still backing journal entries? If so > and given the submit time marking, is that basically gc algorithmic > boilerplate where in context we'd expect all of those devices to already > be marked, or is there some particular situation where you'd expect that > path to mark a device that isn't already marked with journal data? > > > > The "new" bch2_replicas_gc2() mechanism is invoked from various other > > > places and always propagates journal data replica entries. Is that > > > filter on journal data by design, or temporary because journal replica > > > gc is handled separately as above? If the latter, is there some explicit > > > reason the old mechanism was left around to handle journal data like > > > this? > > > > > > As noted above, I still need to track down the root cause of the > > > unexpectedly absent journal data replica sb field. It may be agnostic to > > > either gc mechanism, but I'm trying to understand the higher level > > > situation here a bit better before getting too deep into the weeds... > > > thanks. > > > > Applying this... > > > > ACK.. that's useful context. Thanks. > > Brian > > > From: Kent Overstreet <kent.overstreet@linux.dev> > > Date: Tue, 2 May 2023 18:22:12 -0400 > > Subject: [PATCH] bcachefs: Improved comment for bch2_replicas_gc2() > > > > Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> > > > > diff --git a/fs/bcachefs/replicas.c b/fs/bcachefs/replicas.c > > index 8935ff5899..8ae50dfd8c 100644 > > --- a/fs/bcachefs/replicas.c > > +++ b/fs/bcachefs/replicas.c > > @@ -550,8 +550,14 @@ int bch2_replicas_gc_start(struct bch_fs *c, unsigned typemask) > > return 0; > > } > > > > -/* New much simpler mechanism for clearing out unneeded replicas entries: */ > > - > > +/* > > + * New much simpler mechanism for clearing out unneeded replicas entries - drop > > + * replicas entries that have 0 sectors used. > > + * > > + * However, we don't track sector counts for journal usage, so this doesn't drop > > + * any BCH_DATA_journal entries; the old bch2_replicas_gc_(start|end) mechanism > > + * is retained for that. > > + */ > > int bch2_replicas_gc2(struct bch_fs *c) > > { > > struct bch_replicas_cpu new = { 0 }; > > ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2023-05-04 14:10 UTC | newest] Thread overview: 4+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-05-02 14:58 bcachefs replica garbage collection Brian Foster 2023-05-02 23:13 ` Kent Overstreet 2023-05-03 14:05 ` Brian Foster 2023-05-04 14:11 ` Brian Foster
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox