From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id B5D47C77B7C for ; Thu, 4 May 2023 14:10:09 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230177AbjEDOKI (ORCPT ); Thu, 4 May 2023 10:10:08 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52082 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229942AbjEDOKG (ORCPT ); Thu, 4 May 2023 10:10:06 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9113B86A6 for ; Thu, 4 May 2023 07:09:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1683209359; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=tGLT+BYosY502uGT4Ey2SNRKUTRg4ruLXgqQ1qmpQYg=; b=CwmURhnoiGMTOlBqdT/KNEYAy9c3nI/QudIXfBmnrmDG7nmm8JNHq0i+P8CX8R7K8FJjnW ZSt+aO4ictF7RTJ1vbgtR3IQY3MMVM6IWLPK+e6xX5yiv1F9lmDL467AliLxQxdSWMEbHI v4N5fBlC2JR0Udho+oQ8zh+aABI9RKw= Received: from mail-qk1-f199.google.com (mail-qk1-f199.google.com [209.85.222.199]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-527-B56parM0PsOEAzlB0O-Vqw-1; Thu, 04 May 2023 10:09:18 -0400 X-MC-Unique: B56parM0PsOEAzlB0O-Vqw-1 Received: by mail-qk1-f199.google.com with SMTP id af79cd13be357-74e47b6e044so26752385a.0 for ; Thu, 04 May 2023 07:09:18 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1683209357; x=1685801357; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=tGLT+BYosY502uGT4Ey2SNRKUTRg4ruLXgqQ1qmpQYg=; b=OwAf9ba/ZBBhOkOzKtQV47ZZ7L2QsTcPUz++5QjhjB0fJ3ykc/EpkAFCEvx6GuuC01 8WoWpIzZnMCymIOZo8rQeRhe4EhUQZlgpBPxaJ5jFPah4CF1rmOWHhgBNAltC8bS4zxb arswjz8k7XgdUhS6NRH7U+GgyhVgm6za8DKxa7SRBvB5pzxLONmI5Kpg5ozGWDEODSL3 Rrgce6Z0hAeITkHnyCns9iyRqKPS74DZbWtmZEUH6oPhn4SGcEkvOlDeAB27TpBT6aPd 1Os2zD1yEWLkYSWb9Tjvj1oo6PMXFDtKXqUDF8nn1r9E4q7SB86aYRt0zPXyeG3QBZIo Ivvw== X-Gm-Message-State: AC+VfDwbpKSd+xq8MRlIkjVsdIf2beVVUcJmpZiq+JbStoqr1A6HvbFb jnICdUvmOa1Oo8KbOVWcbuOUgsqrwQ+0P3I4C0R26OjW2pMUlG/Wkdw3KeDnqDfThV+k7Ye2w5g KKZmJCc7TyQELqcR5Pp/dDq8LwfuZLOjJGvI= X-Received: by 2002:a05:6214:1307:b0:5ef:8c79:fe99 with SMTP id pn7-20020a056214130700b005ef8c79fe99mr17740095qvb.7.1683209357387; Thu, 04 May 2023 07:09:17 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ4ByPFKB7x7DAMpZLTETw6PPMa8KHCIEq6J9AOiAtYLDPN0pwXtSqryAYE6Rp+R+tSV2d9Hlw== X-Received: by 2002:a05:6214:1307:b0:5ef:8c79:fe99 with SMTP id pn7-20020a056214130700b005ef8c79fe99mr17740037qvb.7.1683209356775; Thu, 04 May 2023 07:09:16 -0700 (PDT) Received: from bfoster (c-24-61-119-116.hsd1.ma.comcast.net. [24.61.119.116]) by smtp.gmail.com with ESMTPSA id f14-20020a0caa8e000000b0061b5b399d1csm3245780qvb.104.2023.05.04.07.09.16 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 04 May 2023 07:09:16 -0700 (PDT) Date: Thu, 4 May 2023 10:11:37 -0400 From: Brian Foster To: Kent Overstreet Cc: linux-bcachefs@vger.kernel.org Subject: Re: bcachefs replica garbage collection Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-bcachefs@vger.kernel.org On Wed, May 03, 2023 at 10:05:15AM -0400, Brian Foster wrote: > On Tue, May 02, 2023 at 07:13:54PM -0400, Kent Overstreet wrote: > > On Tue, May 02, 2023 at 10:58:50AM -0400, Brian Foster wrote: > > > Hi Kent, > > > > > > I'm tracking an issue that seems to be related to bch_replica garbage > > > collection, where fsck reports the following: > > > > > > superblock not marked as containing replicas journal: 1/1 [1], not fixing > > > > > > ... and the fs doesn't want to mount (until repaired). > > > > > > The short of it looks like there's a window where we write out the sb > > > replica info with no BCH_DATA_journal usage where a shutdown/recovery > > > doesn't otherwise expect it. I've not fully root caused the problem yet, > > > but when looking into the code it looks like we have a couple different > > > mechanisms for clearing out unused replicas... > > > > > > The "old" _start/_end mechanism handles journal data only and is invoked > > > after pin flush. I presume the start/end calls wrap marking journal dev > > > replicas to ensure usage accounting is up to date, but I'm not totally > > > sure. Is that the case? > > > > It's not anything to do with usage accounting - we don't really do usage > > accounting for the journal, not in a way that would work for replicas > > tracking. > > > > So the old style replicas gc mechanism that > > bch2_journal_flush_device_pins() uses is your traditional mark and sweep > > gc - clear everything, then walk and remark everything that's currently > > in use. > > > > Yeah, I think I grokked the fundamentals from the implementation. It > looks like it basically clears out the journal replica entries from the > _gc container, repopulates based on c->usage->replicas, and then updates > the sb/primary tables based on the result. > > > > If so, any reason the marking via journal_write_done() isn't > > > sufficient? > > > > This might be the bug you're looking for - the journal write path should > > be doing the marking before the write, not after. > > > > Quite possible. That certainly could be a vector to this transient > state. I'll look into it.. > Just a note on this bit.. I made a change to mark for the journal before the journal write, but still managed to hit this problem. It was slightly harder to reproduce, so I suspect there might just be a couple different instances of the problem. AFAICT from looking at the mount time recovery code, it pretty much expects to see at least one device marked for journal data so long as the filesystem is not clean. It looks like clean state is removed at mount time and restored during clean unmount, so that means we probably don't ever want to have this "no devices marked with journal data" state at runtime. However, this is exactly what I see from the journal gc mechanism invoked during device evacuate or remove. On an otherwise idle fs, the gc sees no journal usage and clears every replica entry for journal data on the fs. The next journal writes fairly quickly restore new entries for unrelated replicas, so this appears to be a small window of opportunity. But indeed, a quick hack to instrument this path and shutdown the fs during that window makes it trivial to reproduce the subsequent mount failure. Given all of the above context and that this only seems to be invoked via data job paths, I think this variant can be fixed by simply tightening up the filtering in bch2_replicas_gc_start() to also consider the target device. IOW, if the device flush only sweeps out replica entries for which the target device is a participant, we won't ever have this transient (!clean && no journal replicas) state that leads to mount failure in the event of a crash. So unless that all sounds like a Bad Idea(tm), I'm planning to give something like that some testing.. Brian > I am still a bit unclear on the marks in the flush path > (bch2_journal_flush_device_pins()), though. Am I following correctly > that this code marks the devices still backing journal entries? If so > and given the submit time marking, is that basically gc algorithmic > boilerplate where in context we'd expect all of those devices to already > be marked, or is there some particular situation where you'd expect that > path to mark a device that isn't already marked with journal data? > > > > The "new" bch2_replicas_gc2() mechanism is invoked from various other > > > places and always propagates journal data replica entries. Is that > > > filter on journal data by design, or temporary because journal replica > > > gc is handled separately as above? If the latter, is there some explicit > > > reason the old mechanism was left around to handle journal data like > > > this? > > > > > > As noted above, I still need to track down the root cause of the > > > unexpectedly absent journal data replica sb field. It may be agnostic to > > > either gc mechanism, but I'm trying to understand the higher level > > > situation here a bit better before getting too deep into the weeds... > > > thanks. > > > > Applying this... > > > > ACK.. that's useful context. Thanks. > > Brian > > > From: Kent Overstreet > > Date: Tue, 2 May 2023 18:22:12 -0400 > > Subject: [PATCH] bcachefs: Improved comment for bch2_replicas_gc2() > > > > Signed-off-by: Kent Overstreet > > > > diff --git a/fs/bcachefs/replicas.c b/fs/bcachefs/replicas.c > > index 8935ff5899..8ae50dfd8c 100644 > > --- a/fs/bcachefs/replicas.c > > +++ b/fs/bcachefs/replicas.c > > @@ -550,8 +550,14 @@ int bch2_replicas_gc_start(struct bch_fs *c, unsigned typemask) > > return 0; > > } > > > > -/* New much simpler mechanism for clearing out unneeded replicas entries: */ > > - > > +/* > > + * New much simpler mechanism for clearing out unneeded replicas entries - drop > > + * replicas entries that have 0 sectors used. > > + * > > + * However, we don't track sector counts for journal usage, so this doesn't drop > > + * any BCH_DATA_journal entries; the old bch2_replicas_gc_(start|end) mechanism > > + * is retained for that. > > + */ > > int bch2_replicas_gc2(struct bch_fs *c) > > { > > struct bch_replicas_cpu new = { 0 }; > >