All of lore.kernel.org
 help / color / mirror / Atom feed
From: Brian Foster <bfoster@redhat.com>
To: Kent Overstreet <kent.overstreet@linux.dev>
Cc: linux-bcachefs@vger.kernel.org
Subject: Re: fstests generic/441 -- occasional bcachefs failure
Date: Thu, 2 Feb 2023 10:50:23 -0500	[thread overview]
Message-ID: <Y9vbv7TFyu+stccD@bfoster> (raw)
In-Reply-To: <Y9p4ihwzFs0OP6to@moria.home.lan>

On Wed, Feb 01, 2023 at 09:34:50AM -0500, Kent Overstreet wrote:
> On Tue, Jan 31, 2023 at 11:04:11AM -0500, Brian Foster wrote:
> > On Mon, Jan 30, 2023 at 12:06:34PM -0500, Kent Overstreet wrote:
> > > On Fri, Jan 27, 2023 at 09:50:05AM -0500, Brian Foster wrote:
> > > > Something else that occurred to me while looking further at this is we
> > > > can also avoid the error in this case fairly easily by bailing out of
> > > > bch2_fsync() if page writeback fails, as opposed to the unconditional
> > > > flush -> sync meta -> flush log sequence that returns the first error
> > > > anyways. That would prevent marking the inode with a new sequence number
> > > > when I/Os are obviously failing. The caveat is that the test still
> > > > fails, now with a "Read-only file system" error instead of EIO, because
> > > > the filesystem is shutdown by the time the vfs write inode path actually
> > > > runs.
> > > 
> > > If some pages did write successfully we don't want to skip the rest of
> > > the fsync, though.
> > > 
> > 
> > What does it matter if the fsync() has already failed? ISTM this is
> > pretty standard error handling behavior across major fs', but it's not
> > clear to me if there's some bcachefs specific quirk that warrants
> > different handling..
> > 
> > FWIW, I think I've been able to fix this test with a couple small
> > tweaks:
> > 
> > 1. Change bch2_fsync() to return on first error.
> > 2. Introduce a bcachefs specific delay in the working -> error table
> > switchover.
> > 
> > I've not run a full regression test on the bcachefs change yet, but if
> > you're willing to take a patch along those lines I can do that and then
> > if it survives, follow up with an upstream fstests patch. That would
> > allow this test to pass fairly reliably and also no longer
> > unconditionally shutdown bcachefs.
> 
> If you link me your git repo I'll add it to the CI so all the automated
> tests run.
> 

I don't have a public repo atm but I've posted the patch if you have
somewhere to land it for CI testing..? It survived my regression tests
so far, FWIW. (I also had posted that random cleanup patch a bit ago if
you hadn't noticed..).

Is there a reporting dashboard or something available for the test
infrastruture for bcachefs?

> > 
> > > The test failing because the filesystem went read-only is because a
> > > btree node write fails - I'm assuming without that change the test just
> > > fails before any btree node writes are attempted.
> > > 
> > 
> > Not sure I parse the above. Generally speaking, the test fails because
> > after the last successful fsync, it immediately switches over to the
> > error dm table. The fs still has some reclaim/journal work to do after
> > this point, however, so this essentially always results in emergency
> > shutdown. Whether the test passes or fails at this point depends on a
> > race between this shutdown sequence in the background and an fsync call
> > successfuly committing a transaction (calling bch2_mark_inode()).
> 
> *nod* That's more accurate.
> 
> > Indeed, that makes sense. I wonder if it would make sense to issue
> > retries at some point?
> 
> Possibly. AFAIK, on typical block devices we shouldn't see write errors
> unless the device is really not working anymore - both spinning rust and
> SSDs will remap on write error, and on transport error IMO the block
> layer is the more correct place to be issuing retries, if so desired.
> 
> But we do have aspirations of running on raw flash in the
> not-too-distant future (ZNS SSDs), and I would expect we'll need retry
> logic at that point.
> 
> > I'm not sure a simple retry once sequence would help with this test, but
> > an infinite retry option (with some delay) probably would.
> 
> I'd think we'd want a timeout, not a specific number of retries to
> attempt (or perhaps both).
> 
> > OTOH, I wonder if a more elegant solution is to explicitly quiesce the
> > fs before the working -> error table switchover. That might typically be
> > done with a freeze/unfreeze cycle so the test can serialize against
> > whatever background work needs to complete before putting the block
> > device into error mode. However, I notice freeze isn't currently
> > supported on bcachefs. Any thoughts on freeze support?
> 
> Freeze definitely needs to happen. It's been _ages_ since I was looking
> at it so I couldn't say offhand where we'd need to start, but if you're
> interested I'd be happy to look at what it'd take.
> 

Yeah, that would be interesting. Thanks.

> > Ah.. so does "for simplicity" here essentially mean "for reuse of the
> > transaction code?"
> 
> Yes - bch2_btree_key_cache_journal_flush() -> btree_key_cache_flush_pos()
> -> bch2_trans_commit().
> 
> > 
> > >  - flush_seq: given a journal sequence number we previously got a
> > >    journal reservation on, request that it be written out (with a
> > >    flush/fua write, not all journal entries are written flush/fua).
> > > 
> > 
> > Thanks for this. Something that initially confused me wrt to the journal
> > is that I was seeing journal writes after btree node writes. The docs
> > mention the journal is primarily an optimization and ordering is not
> > important for crash consistency.
> 
> Wherever you read that must date to well before bcachefs was a thing -
> that was true in bcache when the journalling code was still new. In
> bcachefs ordering of updates is important and the journal is indeed our
> global time ordering.
> 

FWIW, I was reading through the "Journaling" and "Sequential
consistency" sections of the architecture documentation [1]. It's
possible I just misinterpreted some wording.

[1] https://bcachefs.org/Architecture/#Journaling

> > With that in mind, does that mean a
> > journal "pin" refers to entries held in the journal until written back
> > to the btree (as opposed to the journal pinning keys in memory)? IOW, it
> > doesn't really matter in which order that journal/btree writes occur,
> > but only that entries remain pinned in the journal until btree writes
> > complete..?
> 
> Correct.
> 

Ok, I think I get the general idea then. I'll have to dig more into it
at some point.

> > (I suspect this confused me because XFS uses the same pinning
> > terminology, but in that context it refers to metadata items being
> > pinned by the log, until the log item is flushed and thus unpins the
> > metadata item for final writeback. So ordering is a critical factor in
> > this context.).
> 
> I see -  in XFS metadata items can't be written until the relevant log
> entry is written? Yeah we never did it that way in bcachefs, blocking
> writeback which already blocks memory reclaim sounded like a bad idea :)
> 

This is all a completely different design and architecture of course.
XFS has its own buffer cache for metadata, with aging heuristics, and
shrinker hooks and whatnot with feedback into log and metadata flushing
to manage all this and keep things moving along.

I was mainly just trying to wrap my head around the disk write ordering
rules for bcachefs journaling wrt to fs consistency, using "traditional
logging" (i.e. XFS) as a point of reference.

Brian

> Originally, before we started updating pointers to btree nodes after
> every write, we'd note in every btree node write the newest journal
> sequence number it had updates for - then in recovery we'd blacklist the
> next N journal sequence numbers and if we ever saw a btree node entry
> for a blacklisted journal sequence number we just ignored it.
> 
> > #bcache or #bachefs?
> 
> #bcache :)
> 


  reply	other threads:[~2023-02-02 15:51 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-25 15:45 fstests generic/441 -- occasional bcachefs failure Brian Foster
2023-01-26 15:08 ` Kent Overstreet
2023-01-27  7:21   ` Kent Overstreet
2023-01-27 14:50   ` Brian Foster
2023-01-30 17:06     ` Kent Overstreet
2023-01-31 16:04       ` Brian Foster
2023-02-01 14:34         ` Kent Overstreet
2023-02-02 15:50           ` Brian Foster [this message]
2023-02-02 17:09             ` Freezing (was: Re: fstests generic/441 -- occasional bcachefs failure) Kent Overstreet
2023-02-02 20:04               ` Brian Foster
2023-02-02 22:39                 ` Kent Overstreet
2023-02-03  0:51               ` Dave Chinner
2023-02-04  0:35                 ` Kent Overstreet
2023-02-07  0:03                   ` Dave Chinner
2023-02-16 20:04                     ` Eric Wheeler
2023-02-20 22:19                       ` Dave Chinner
2023-02-20 23:23                         ` Kent Overstreet
2023-02-02 22:56         ` fstests generic/441 -- occasional bcachefs failure Kent Overstreet
2023-02-04 21:33           ` Brian Foster
2023-02-04 22:15             ` Kent Overstreet
2023-02-06 15:33               ` Brian Foster
2023-02-06 22:18                 ` Kent Overstreet
2023-02-09 12:57                   ` Brian Foster
2023-02-09 14:58                     ` Kent Overstreet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y9vbv7TFyu+stccD@bfoster \
    --to=bfoster@redhat.com \
    --cc=kent.overstreet@linux.dev \
    --cc=linux-bcachefs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.