From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 57894C433EF for ; Sun, 24 Oct 2021 16:39:16 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 2911260E98 for ; Sun, 24 Oct 2021 16:39:16 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229830AbhJXQlg (ORCPT ); Sun, 24 Oct 2021 12:41:36 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58670 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229527AbhJXQlf (ORCPT ); Sun, 24 Oct 2021 12:41:35 -0400 Received: from mail-qv1-xf29.google.com (mail-qv1-xf29.google.com [IPv6:2607:f8b0:4864:20::f29]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 18647C061745 for ; Sun, 24 Oct 2021 09:39:15 -0700 (PDT) Received: by mail-qv1-xf29.google.com with SMTP id o20so5733878qvk.7 for ; Sun, 24 Oct 2021 09:39:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=qnRP8gqb63dXIbmRvEvE1eJeyCo8oiBwoD3GsRxOerw=; b=mQdbMAV5IUBxqVA/uIshCGkKTiLhbee9MvCQz+EVy4/Z2rzqL8yHbYYYOPUeXVQ5/a hQL+u3FHUjv3ROf6GUasudqVLgjyWBKiurMiqCzLVGw8hVdDC7ohAzEX/9Si3kb80juq iB9RxbXBMwWO8nagk9MXzeMg2xz4SuY/Qm84Bt/sOt0I+IUJwHA+5QG9L1ed/wRcxnxi C+kSO1EKvzd8Bvs94eftv0E1j8BKfHJOwxZKEMN6GD00PYLrJLjNJCyq7246z9u/d6eE u/8VPZatMN+aUTlC0bQ3w0Fs/bHPIUWEK+5wY9kLg+TSHAyVSMWDwwdudOxQvW/cxc5A U8xg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=qnRP8gqb63dXIbmRvEvE1eJeyCo8oiBwoD3GsRxOerw=; b=Z4l05XseONSusTvcLjg+nOgxrFjh4VlVDMoiZ/zTox/lJMhtCJyoUGiYGJjPm/smvD n4P2pFFEOxY5mMwSa8hLa1mx6JTgSToSZIftSFYKRE4Ny0W52Z0VY9ah6Fo70pAovVRi 7QwA0fLgj/KTufD4Ot3wxdy2B/ZFGpZpfoId+zkhE5vQY3wV9G7Hcodt60lIhanrg4l7 9bHHejwtGTSiZmrqRDG4rLTE50295vNHGZmJ3GCIUIp2gSA8nCk33vcSej4gmzgP9LUm GyT2zcUgqwa5bvUW5IUkA96Yw9abGKq7DkArTItgS410m0qR8AXOV/m5Kf/GOgzpREeB 3Zgg== X-Gm-Message-State: AOAM5320pKZQhIu9Rw20o/uEZPuusWn13j67wnoRoeAIp8PKumWVvdbm yuRBfuKfp9kCHd3GD8WIgyUf8J3pCQ== X-Google-Smtp-Source: ABdhPJxCjBVJbmDL4XmPBVIyLq/YTBSXfyyDeYVUyUDZk7NUhFrrOcUjZeH9ZbmChg/hmj60qe4SDw== X-Received: by 2002:ad4:5748:: with SMTP id q8mr11006691qvx.52.1635093554206; Sun, 24 Oct 2021 09:39:14 -0700 (PDT) Received: from moria.home.lan (c-73-219-103-14.hsd1.vt.comcast.net. [73.219.103.14]) by smtp.gmail.com with ESMTPSA id bj37sm7523790qkb.49.2021.10.24.09.39.12 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 24 Oct 2021 09:39:13 -0700 (PDT) Date: Sun, 24 Oct 2021 12:39:11 -0400 From: Kent Overstreet To: Chris Webb Cc: linux-bcachefs@vger.kernel.org Subject: Re: Metadata usage following device add Message-ID: References: <20211024111547.GG11670@arachsys.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20211024111547.GG11670@arachsys.com> Precedence: bulk List-ID: X-Mailing-List: linux-bcachefs@vger.kernel.org On Sun, Oct 24, 2021 at 12:15:47PM +0100, Chris Webb wrote: > Kent Overstreet wrote: > > > On Wed, Oct 13, 2021 at 05:52:40PM +0100, Chris Webb wrote: > > > looks like sb and journal are still zero on the newly added device even > > > following the data rereplicate operation. > > > > That's a separate (known) bug - adding a new device doesn't account for > > sb/journal on the new device correctly. I should revisit that one... > > Hi Kent. Sorry for this slow follow-up, but I took a look at what's going on > to see if it was easy to fix. After a bit of printk() abuse, I think I > understand why this happens: > > bch2_dev_add calls __bch2_mark_metadata_bucket via bch2_mark_dev_superblock > before the device is attached to the fs, so it can't yet use > bch2_dev_usage_update. > > Later, after the device is attached, we call bch2_trans_mark_dev_sb which > diligently goes through re-marking buckets and updating the usage. However, > the old and new bucket types and dirty_sectors values match, so > bch2_dev_usage_update ends up repeatedly subtracting and then re-adding the > sizes of all the BCH_DATA_sb and BCH_DATA_journal buckets from the usage > data, leaving the usage unchanged. > > In a comment in bch2_dev_add, you suggest "we have to mark, allocate the > journal, reset all the marks, then remark after we attach" but unless I'm > misunderstanding, I don't think I see anything that resets the marks in > between the two metadata-marking operations? > > If it's hard to do that right, I did wonder if an alternative approach would > be to somehow flag the bucket as 'not yet accounted for' when !c in > __bch2_mark_metadata_bucket, so that bch2_dev_usage_update could spot this > during the later bch2_trans_mark_dev_sb operation, and not subtract the > unaccounted-for old.dirty_sectors. > > But I thought I'd better ask first, and make sure I've understood what's > going on correctly, before getting any further out of my depth in the > innards of your filesystem! :) Good catch! I completely forgot about that comment and didn't think to fix it that way, the last time I was looking at that I was looking at the device accounting code in buckets.c. I've just pushed a fix, can you tell me if you see anything I missed? This is a good opportunity to talk about the disk space accounting code. Basically I've been trying to incrementally move things away from using the in-memory bucket away and transition all the logic to going off of the alloc info in the btree - updating of filesystem wide, per-replica accounting works that way now, but the per device accounting does not - it's slaved to the updates of the in-memory bucket marks, see bch2_dev_usage_update() in buckets.c, and updating the in memory bucket marks is triggered by btree updates to the alloc btree, see bch2_mark_alloc(). We need to eventually get rid of the in memory bucket arrays - they take up too much memory on (extremely large) filesystems, right now they're our biggest scalability limitation. Some background: we have memory triggers - triggers that update other in memory data structures on btree updates; they update things that have to stay in sync with the btree so they run with tree write locks held. And we have transactional triggers, which look at the btree updates we're about to do and trigger more btree updates Updating of the filesystem wide accounting is another mechanism; the transactional triggers build up a list of updates to be applied to the filesystem wide accounting (based only on looking at old and new btree keys, which is what we want), and then bch2_trans_fs_usage_apply() commits those updates while we've got appropriate locks held. The really big thing that needs to happen with disk space accounting is it needs to be moved to keys in a btree. Right now, disk space accounting metadata is "special"; it lives in the journal and is never flushed to the btree. The amount of disk space accounting we can have on filesystems with many devices already makes this problematic, and we need to add per-snapshot ID disk space accounting too, as well as disk space accounting broken out by compression type, with compressed and uncompressed size. Complicating factors: the way disk space accounting is stored and managed now is ridiculously complicated (but fast!). It's stored in percpu counters, which are additionally sharded by outstanding journal commit (correspending to journal entries/buffers in flight). Btree updates update the counters corresponding to the journal buffer they took a journal reservation on; then prior to journal write, those counters are rolled up into the main, non-percpu set of counters and written out. So I'm envisioning moving all disk space accounting - per device, per replica set, per snapshot id - to keys in another btree. Indexing scheme to be determined; using the str_hash code would be ideal since replica sets are essentially strings (variable length list of device IDs, and I don't want to put a hard cap on the number of devices in a replica set) - but the str_hash code (probably?) won't work with the btree key cache code, which we'll need to use. The locks for these accounting keys are going to become contention points - unless we're able to come up with a way to do updates to them with just read locks, which I think ought to be possible by carrying over the existing scheme of applying deltas and sharding updates by journal buf and rolling them up prior to journal write. If we can do it with just read locks, we're completely fine on lock/cacheline contention - six locks have a fancy optional percpu read lock mode. Anyways, that's my current braindump on disk space accounting...