Re: Metadata usage following device add

public inbox for linux-bcachefs@vger.kernel.org
 help / color / mirror / Atom feed

From: Kent Overstreet <kent.overstreet@gmail.com>
To: Chris Webb <chris@arachsys.com>
Cc: linux-bcachefs@vger.kernel.org
Subject: Re: Metadata usage following device add
Date: Sun, 24 Oct 2021 12:39:11 -0400	[thread overview]
Message-ID: <YXWML9QLQKCsRgdh@moria.home.lan> (raw)
In-Reply-To: <20211024111547.GG11670@arachsys.com>

On Sun, Oct 24, 2021 at 12:15:47PM +0100, Chris Webb wrote:
> Kent Overstreet <kent.overstreet@gmail.com> wrote:
> 
> > On Wed, Oct 13, 2021 at 05:52:40PM +0100, Chris Webb wrote:
> > > looks like sb and journal are still zero on the newly added device even
> > > following the data rereplicate operation.
> >
> > That's a separate (known) bug - adding a new device doesn't account for
> > sb/journal on the new device correctly. I should revisit that one...
> 
> Hi Kent. Sorry for this slow follow-up, but I took a look at what's going on
> to see if it was easy to fix. After a bit of printk() abuse, I think I
> understand why this happens:
> 
> bch2_dev_add calls __bch2_mark_metadata_bucket via bch2_mark_dev_superblock
> before the device is attached to the fs, so it can't yet use
> bch2_dev_usage_update.
> 
> Later, after the device is attached, we call bch2_trans_mark_dev_sb which
> diligently goes through re-marking buckets and updating the usage. However,
> the old and new bucket types and dirty_sectors values match, so
> bch2_dev_usage_update ends up repeatedly subtracting and then re-adding the
> sizes of all the BCH_DATA_sb and BCH_DATA_journal buckets from the usage
> data, leaving the usage unchanged.
> 
> In a comment in bch2_dev_add, you suggest "we have to mark, allocate the
> journal, reset all the marks, then remark after we attach" but unless I'm
> misunderstanding, I don't think I see anything that resets the marks in
> between the two metadata-marking operations?
> 
> If it's hard to do that right, I did wonder if an alternative approach would
> be to somehow flag the bucket as 'not yet accounted for' when !c in
> __bch2_mark_metadata_bucket, so that bch2_dev_usage_update could spot this
> during the later bch2_trans_mark_dev_sb operation, and not subtract the
> unaccounted-for old.dirty_sectors.
> 
> But I thought I'd better ask first, and make sure I've understood what's
> going on correctly, before getting any further out of my depth in the
> innards of your filesystem! :)

Good catch! I completely forgot about that comment and didn't think to fix it
that way, the last time I was looking at that I was looking at the device
accounting code in buckets.c. I've just pushed a fix, can you tell me if you see
anything I missed?

This is a good opportunity to talk about the disk space accounting code.

Basically I've been trying to incrementally move things away from using the
in-memory bucket away and transition all the logic to going off of the alloc
info in the btree - updating of filesystem wide, per-replica accounting works
that way now, but the per device accounting does not - it's slaved to the
updates of the in-memory bucket marks, see bch2_dev_usage_update() in buckets.c,
and updating the in memory bucket marks is triggered by btree updates to the alloc
btree, see bch2_mark_alloc().

We need to eventually get rid of the in memory bucket arrays - they take up too
much memory on (extremely large) filesystems, right now they're our biggest
scalability limitation.

Some background: we have memory triggers - triggers that update other in memory
data structures on btree updates; they update things that have to stay in sync
with the btree so they run with tree write locks held. And we have transactional
triggers, which look at the btree updates we're about to do and trigger more
btree updates

Updating of the filesystem wide accounting is another mechanism; the
transactional triggers build up a list of updates to be applied to the
filesystem wide accounting (based only on looking at old and new btree keys,
which is what we want), and then bch2_trans_fs_usage_apply() commits those
updates while we've got appropriate locks held.

The really big thing that needs to happen with disk space accounting is it needs
to be moved to keys in a btree. Right now, disk space accounting metadata is
"special"; it lives in the journal and is never flushed to the btree. The amount
of disk space accounting we can have on filesystems with many devices already
makes this problematic, and we need to add per-snapshot ID disk space accounting
too, as well as disk space accounting broken out by compression type, with
compressed and uncompressed size.

Complicating factors: the way disk space accounting is stored and managed now is
ridiculously complicated (but fast!). It's stored in percpu counters, which are
additionally sharded by outstanding journal commit (correspending to journal
entries/buffers in flight). Btree updates update the counters corresponding to
the journal buffer they took a journal reservation on; then prior to journal
write, those counters are rolled up into the main, non-percpu set of counters
and written out.

So I'm envisioning moving all disk space accounting - per device, per replica
set, per snapshot id - to keys in another btree. Indexing scheme to be
determined; using the str_hash code would be ideal since replica sets are
essentially strings (variable length list of device IDs, and I don't want to put
a hard cap on the number of devices in a replica set) - but the str_hash code
(probably?) won't work with the btree key cache code, which we'll need to use.

The locks for these accounting keys are going to become contention points -
unless we're able to come up with a way to do updates to them with just read
locks, which I think ought to be possible by carrying over the existing scheme
of applying deltas and sharding updates by journal buf and rolling them up prior
to journal write. If we can do it with just read locks, we're completely fine on
lock/cacheline contention - six locks have a fancy optional percpu read lock
mode.

Anyways, that's my current braindump on disk space accounting...

next prev parent reply	other threads:[~2021-10-24 16:39 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-24 11:15 Metadata usage following device add Chris Webb
2021-10-24 16:39 ` Kent Overstreet [this message]
2021-10-24 21:36   ` Chris Webb

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YXWML9QLQKCsRgdh@moria.home.lan \
    --to=kent.overstreet@gmail.com \
    --cc=chris@arachsys.com \
    --cc=linux-bcachefs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox