From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8CDB0C61DA4 for ; Thu, 2 Feb 2023 15:51:06 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232758AbjBBPvF (ORCPT ); Thu, 2 Feb 2023 10:51:05 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50508 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232294AbjBBPuz (ORCPT ); Thu, 2 Feb 2023 10:50:55 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DE5DF521E2 for ; Thu, 2 Feb 2023 07:49:47 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1675352952; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=sADM0rqnC7LMsqA9/q+dbLH7MsuRWjVin7ubshK1Xr0=; b=VsHkOY4gLF3lQday1JGSx6h6oXWS8oqb/4J2fzQWZ0A1g/mcYgY3/1aZ4vQW+KN0DTl206 syaZMuKwcF6lRFIHZJ3sKPwIOs2bCUSjjQ/Jb8rHluDa7XSIwDk9BIBpkaEgGV6GjDcCyM xjjUwOoZiZAou+Zy48nEUrQT7OJGdcQ= Received: from mail-qt1-f198.google.com (mail-qt1-f198.google.com [209.85.160.198]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-133-ixW73DxLOqK9-1CR2nx95A-1; Thu, 02 Feb 2023 10:49:11 -0500 X-MC-Unique: ixW73DxLOqK9-1CR2nx95A-1 Received: by mail-qt1-f198.google.com with SMTP id f22-20020a05622a1a1600b003b8674f2302so1141445qtb.7 for ; Thu, 02 Feb 2023 07:49:11 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=sADM0rqnC7LMsqA9/q+dbLH7MsuRWjVin7ubshK1Xr0=; b=QI1Uhz4U8Y5ZTdzF+PH9Fi2o5qbBfE1y2fOR5Z+l3K3IqC9xnAoJJ+t+5lEv+pCcMy pO31nxqaH+/dMF/dx1uUmokaBhjo6XotzounsFx57y1pAG5OI5tNWib02G89denOH364 rTgd4utddFSKTEJKuZODdKvVYWTokGxVORdlCLsVxz+uVqmzC0fhBtKKrb+H7t4ZXtuY bHGf1sffozzmR3oPvN9RO+0xs9EWJ/mACE9sg3GN7n9cvkoLGHaEy64hgIrt/Wn0nGrq 7VSaInsrzLEhZ7FLQZGacuZWFJCoZTrf8TaCxVskxOYVm0NVeYfX0TaHDad1BeFkJ2iB bP6g== X-Gm-Message-State: AO0yUKWNJRinZMFUBiNqQcr9adNjXasSo0hJF8VoFEidZGNIzumqHM9e LYpFGKzcBI+fgAFPfe7tGsfexgHTbuPz6p4rDmNrPiRqwlN5LR8wVQVZbBy60zy8if4rHljGi9I DTh1HkzqsX4U3QpIvkD6ZcVBWGNs= X-Received: by 2002:ac8:5b84:0:b0:3b8:461b:4134 with SMTP id a4-20020ac85b84000000b003b8461b4134mr12899385qta.68.1675352950792; Thu, 02 Feb 2023 07:49:10 -0800 (PST) X-Google-Smtp-Source: AK7set/ayuTLHon9CpuZZt0PPp1K9Kj5K9NXvefu2ZLTUdS59MWXNOwQTxO1OKxkPrt4l8PyspP7Ag== X-Received: by 2002:ac8:5b84:0:b0:3b8:461b:4134 with SMTP id a4-20020ac85b84000000b003b8461b4134mr12899336qta.68.1675352950384; Thu, 02 Feb 2023 07:49:10 -0800 (PST) Received: from bfoster (c-24-61-119-116.hsd1.ma.comcast.net. [24.61.119.116]) by smtp.gmail.com with ESMTPSA id op50-20020a05620a537200b0072ddf70791fsm36703qkn.122.2023.02.02.07.49.09 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 02 Feb 2023 07:49:09 -0800 (PST) Date: Thu, 2 Feb 2023 10:50:23 -0500 From: Brian Foster To: Kent Overstreet Cc: linux-bcachefs@vger.kernel.org Subject: Re: fstests generic/441 -- occasional bcachefs failure Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-bcachefs@vger.kernel.org On Wed, Feb 01, 2023 at 09:34:50AM -0500, Kent Overstreet wrote: > On Tue, Jan 31, 2023 at 11:04:11AM -0500, Brian Foster wrote: > > On Mon, Jan 30, 2023 at 12:06:34PM -0500, Kent Overstreet wrote: > > > On Fri, Jan 27, 2023 at 09:50:05AM -0500, Brian Foster wrote: > > > > Something else that occurred to me while looking further at this is we > > > > can also avoid the error in this case fairly easily by bailing out of > > > > bch2_fsync() if page writeback fails, as opposed to the unconditional > > > > flush -> sync meta -> flush log sequence that returns the first error > > > > anyways. That would prevent marking the inode with a new sequence number > > > > when I/Os are obviously failing. The caveat is that the test still > > > > fails, now with a "Read-only file system" error instead of EIO, because > > > > the filesystem is shutdown by the time the vfs write inode path actually > > > > runs. > > > > > > If some pages did write successfully we don't want to skip the rest of > > > the fsync, though. > > > > > > > What does it matter if the fsync() has already failed? ISTM this is > > pretty standard error handling behavior across major fs', but it's not > > clear to me if there's some bcachefs specific quirk that warrants > > different handling.. > > > > FWIW, I think I've been able to fix this test with a couple small > > tweaks: > > > > 1. Change bch2_fsync() to return on first error. > > 2. Introduce a bcachefs specific delay in the working -> error table > > switchover. > > > > I've not run a full regression test on the bcachefs change yet, but if > > you're willing to take a patch along those lines I can do that and then > > if it survives, follow up with an upstream fstests patch. That would > > allow this test to pass fairly reliably and also no longer > > unconditionally shutdown bcachefs. > > If you link me your git repo I'll add it to the CI so all the automated > tests run. > I don't have a public repo atm but I've posted the patch if you have somewhere to land it for CI testing..? It survived my regression tests so far, FWIW. (I also had posted that random cleanup patch a bit ago if you hadn't noticed..). Is there a reporting dashboard or something available for the test infrastruture for bcachefs? > > > > > The test failing because the filesystem went read-only is because a > > > btree node write fails - I'm assuming without that change the test just > > > fails before any btree node writes are attempted. > > > > > > > Not sure I parse the above. Generally speaking, the test fails because > > after the last successful fsync, it immediately switches over to the > > error dm table. The fs still has some reclaim/journal work to do after > > this point, however, so this essentially always results in emergency > > shutdown. Whether the test passes or fails at this point depends on a > > race between this shutdown sequence in the background and an fsync call > > successfuly committing a transaction (calling bch2_mark_inode()). > > *nod* That's more accurate. > > > Indeed, that makes sense. I wonder if it would make sense to issue > > retries at some point? > > Possibly. AFAIK, on typical block devices we shouldn't see write errors > unless the device is really not working anymore - both spinning rust and > SSDs will remap on write error, and on transport error IMO the block > layer is the more correct place to be issuing retries, if so desired. > > But we do have aspirations of running on raw flash in the > not-too-distant future (ZNS SSDs), and I would expect we'll need retry > logic at that point. > > > I'm not sure a simple retry once sequence would help with this test, but > > an infinite retry option (with some delay) probably would. > > I'd think we'd want a timeout, not a specific number of retries to > attempt (or perhaps both). > > > OTOH, I wonder if a more elegant solution is to explicitly quiesce the > > fs before the working -> error table switchover. That might typically be > > done with a freeze/unfreeze cycle so the test can serialize against > > whatever background work needs to complete before putting the block > > device into error mode. However, I notice freeze isn't currently > > supported on bcachefs. Any thoughts on freeze support? > > Freeze definitely needs to happen. It's been _ages_ since I was looking > at it so I couldn't say offhand where we'd need to start, but if you're > interested I'd be happy to look at what it'd take. > Yeah, that would be interesting. Thanks. > > Ah.. so does "for simplicity" here essentially mean "for reuse of the > > transaction code?" > > Yes - bch2_btree_key_cache_journal_flush() -> btree_key_cache_flush_pos() > -> bch2_trans_commit(). > > > > > > - flush_seq: given a journal sequence number we previously got a > > > journal reservation on, request that it be written out (with a > > > flush/fua write, not all journal entries are written flush/fua). > > > > > > > Thanks for this. Something that initially confused me wrt to the journal > > is that I was seeing journal writes after btree node writes. The docs > > mention the journal is primarily an optimization and ordering is not > > important for crash consistency. > > Wherever you read that must date to well before bcachefs was a thing - > that was true in bcache when the journalling code was still new. In > bcachefs ordering of updates is important and the journal is indeed our > global time ordering. > FWIW, I was reading through the "Journaling" and "Sequential consistency" sections of the architecture documentation [1]. It's possible I just misinterpreted some wording. [1] https://bcachefs.org/Architecture/#Journaling > > With that in mind, does that mean a > > journal "pin" refers to entries held in the journal until written back > > to the btree (as opposed to the journal pinning keys in memory)? IOW, it > > doesn't really matter in which order that journal/btree writes occur, > > but only that entries remain pinned in the journal until btree writes > > complete..? > > Correct. > Ok, I think I get the general idea then. I'll have to dig more into it at some point. > > (I suspect this confused me because XFS uses the same pinning > > terminology, but in that context it refers to metadata items being > > pinned by the log, until the log item is flushed and thus unpins the > > metadata item for final writeback. So ordering is a critical factor in > > this context.). > > I see - in XFS metadata items can't be written until the relevant log > entry is written? Yeah we never did it that way in bcachefs, blocking > writeback which already blocks memory reclaim sounded like a bad idea :) > This is all a completely different design and architecture of course. XFS has its own buffer cache for metadata, with aging heuristics, and shrinker hooks and whatnot with feedback into log and metadata flushing to manage all this and keep things moving along. I was mainly just trying to wrap my head around the disk write ordering rules for bcachefs journaling wrt to fs consistency, using "traditional logging" (i.e. XFS) as a point of reference. Brian > Originally, before we started updating pointers to btree nodes after > every write, we'd note in every btree node write the newest journal > sequence number it had updates for - then in recovery we'd blacklist the > next N journal sequence numbers and if we ever saw a btree node entry > for a blacklisted journal sequence number we just ignored it. > > > #bcache or #bachefs? > > #bcache :) >