Metadata usage following device add

public inbox for linux-bcachefs@vger.kernel.org
 help / color / mirror / Atom feed

* Metadata usage following device add
@ 2021-10-24 11:15 Chris Webb
  2021-10-24 16:39 ` Kent Overstreet
  0 siblings, 1 reply; 3+ messages in thread
From: Chris Webb @ 2021-10-24 11:15 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcachefs

Kent Overstreet <kent.overstreet@gmail.com> wrote:

> On Wed, Oct 13, 2021 at 05:52:40PM +0100, Chris Webb wrote:
> > looks like sb and journal are still zero on the newly added device even
> > following the data rereplicate operation.
>
> That's a separate (known) bug - adding a new device doesn't account for
> sb/journal on the new device correctly. I should revisit that one...

Hi Kent. Sorry for this slow follow-up, but I took a look at what's going on
to see if it was easy to fix. After a bit of printk() abuse, I think I
understand why this happens:

bch2_dev_add calls __bch2_mark_metadata_bucket via bch2_mark_dev_superblock
before the device is attached to the fs, so it can't yet use
bch2_dev_usage_update.

Later, after the device is attached, we call bch2_trans_mark_dev_sb which
diligently goes through re-marking buckets and updating the usage. However,
the old and new bucket types and dirty_sectors values match, so
bch2_dev_usage_update ends up repeatedly subtracting and then re-adding the
sizes of all the BCH_DATA_sb and BCH_DATA_journal buckets from the usage
data, leaving the usage unchanged.

In a comment in bch2_dev_add, you suggest "we have to mark, allocate the
journal, reset all the marks, then remark after we attach" but unless I'm
misunderstanding, I don't think I see anything that resets the marks in
between the two metadata-marking operations?

If it's hard to do that right, I did wonder if an alternative approach would
be to somehow flag the bucket as 'not yet accounted for' when !c in
__bch2_mark_metadata_bucket, so that bch2_dev_usage_update could spot this
during the later bch2_trans_mark_dev_sb operation, and not subtract the
unaccounted-for old.dirty_sectors.

But I thought I'd better ask first, and make sure I've understood what's
going on correctly, before getting any further out of my depth in the
innards of your filesystem! :)

Best wishes,

Chris.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Metadata usage following device add
  2021-10-24 11:15 Metadata usage following device add Chris Webb
@ 2021-10-24 16:39 ` Kent Overstreet
  2021-10-24 21:36   ` Chris Webb
  0 siblings, 1 reply; 3+ messages in thread
From: Kent Overstreet @ 2021-10-24 16:39 UTC (permalink / raw)
  To: Chris Webb; +Cc: linux-bcachefs

On Sun, Oct 24, 2021 at 12:15:47PM +0100, Chris Webb wrote:
> Kent Overstreet <kent.overstreet@gmail.com> wrote:
> 
> > On Wed, Oct 13, 2021 at 05:52:40PM +0100, Chris Webb wrote:
> > > looks like sb and journal are still zero on the newly added device even
> > > following the data rereplicate operation.
> >
> > That's a separate (known) bug - adding a new device doesn't account for
> > sb/journal on the new device correctly. I should revisit that one...
> 
> Hi Kent. Sorry for this slow follow-up, but I took a look at what's going on
> to see if it was easy to fix. After a bit of printk() abuse, I think I
> understand why this happens:
> 
> bch2_dev_add calls __bch2_mark_metadata_bucket via bch2_mark_dev_superblock
> before the device is attached to the fs, so it can't yet use
> bch2_dev_usage_update.
> 
> Later, after the device is attached, we call bch2_trans_mark_dev_sb which
> diligently goes through re-marking buckets and updating the usage. However,
> the old and new bucket types and dirty_sectors values match, so
> bch2_dev_usage_update ends up repeatedly subtracting and then re-adding the
> sizes of all the BCH_DATA_sb and BCH_DATA_journal buckets from the usage
> data, leaving the usage unchanged.
> 
> In a comment in bch2_dev_add, you suggest "we have to mark, allocate the
> journal, reset all the marks, then remark after we attach" but unless I'm
> misunderstanding, I don't think I see anything that resets the marks in
> between the two metadata-marking operations?
> 
> If it's hard to do that right, I did wonder if an alternative approach would
> be to somehow flag the bucket as 'not yet accounted for' when !c in
> __bch2_mark_metadata_bucket, so that bch2_dev_usage_update could spot this
> during the later bch2_trans_mark_dev_sb operation, and not subtract the
> unaccounted-for old.dirty_sectors.
> 
> But I thought I'd better ask first, and make sure I've understood what's
> going on correctly, before getting any further out of my depth in the
> innards of your filesystem! :)

Good catch! I completely forgot about that comment and didn't think to fix it
that way, the last time I was looking at that I was looking at the device
accounting code in buckets.c. I've just pushed a fix, can you tell me if you see
anything I missed?

This is a good opportunity to talk about the disk space accounting code.

Basically I've been trying to incrementally move things away from using the
in-memory bucket away and transition all the logic to going off of the alloc
info in the btree - updating of filesystem wide, per-replica accounting works
that way now, but the per device accounting does not - it's slaved to the
updates of the in-memory bucket marks, see bch2_dev_usage_update() in buckets.c,
and updating the in memory bucket marks is triggered by btree updates to the alloc
btree, see bch2_mark_alloc().

We need to eventually get rid of the in memory bucket arrays - they take up too
much memory on (extremely large) filesystems, right now they're our biggest
scalability limitation.

Some background: we have memory triggers - triggers that update other in memory
data structures on btree updates; they update things that have to stay in sync
with the btree so they run with tree write locks held. And we have transactional
triggers, which look at the btree updates we're about to do and trigger more
btree updates

Updating of the filesystem wide accounting is another mechanism; the
transactional triggers build up a list of updates to be applied to the
filesystem wide accounting (based only on looking at old and new btree keys,
which is what we want), and then bch2_trans_fs_usage_apply() commits those
updates while we've got appropriate locks held.

The really big thing that needs to happen with disk space accounting is it needs
to be moved to keys in a btree. Right now, disk space accounting metadata is
"special"; it lives in the journal and is never flushed to the btree. The amount
of disk space accounting we can have on filesystems with many devices already
makes this problematic, and we need to add per-snapshot ID disk space accounting
too, as well as disk space accounting broken out by compression type, with
compressed and uncompressed size.

Complicating factors: the way disk space accounting is stored and managed now is
ridiculously complicated (but fast!). It's stored in percpu counters, which are
additionally sharded by outstanding journal commit (correspending to journal
entries/buffers in flight). Btree updates update the counters corresponding to
the journal buffer they took a journal reservation on; then prior to journal
write, those counters are rolled up into the main, non-percpu set of counters
and written out.

So I'm envisioning moving all disk space accounting - per device, per replica
set, per snapshot id - to keys in another btree. Indexing scheme to be
determined; using the str_hash code would be ideal since replica sets are
essentially strings (variable length list of device IDs, and I don't want to put
a hard cap on the number of devices in a replica set) - but the str_hash code
(probably?) won't work with the btree key cache code, which we'll need to use.

The locks for these accounting keys are going to become contention points -
unless we're able to come up with a way to do updates to them with just read
locks, which I think ought to be possible by carrying over the existing scheme
of applying deltas and sharding updates by journal buf and rolling them up prior
to journal write. If we can do it with just read locks, we're completely fine on
lock/cacheline contention - six locks have a fancy optional percpu read lock
mode.

Anyways, that's my current braindump on disk space accounting...

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Metadata usage following device add
  2021-10-24 16:39 ` Kent Overstreet
@ 2021-10-24 21:36   ` Chris Webb
  0 siblings, 0 replies; 3+ messages in thread
From: Chris Webb @ 2021-10-24 21:36 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcachefs

Kent Overstreet <kent.overstreet@gmail.com> writes:

> Good catch! I completely forgot about that comment and didn't think to
> fix it that way, the last time I was looking at that I was looking at the
> device accounting code in buckets.c. I've just pushed a fix, can you tell
> me if you see anything I missed?

Thanks! I've played with it and tried everything I can think of to break it,
but I didn't find a way to get it out of sync or inconsistent.

> This is a good opportunity to talk about the disk space accounting code.

That was a really interesting write-up of your planned transition to move
the accounting into btree keys, thanks! Makes a lot of sense as you describe
it. (I saw you mention hitting 1M snapshots on a filesystem the other day,
which is good motivation for a scalable place to put accounting info!)

I need to spend a bit of time reading the code and getting a better sense of
how the allocation system more generally works. It's still one of the more
mysterious bits of the filesystem to me.

Am I right that in the current implementation, buckets only get discarded
when the allocator thread is woken because we've run out of free buckets for
a new allocation and need to reclaim - so they're discarded and quickly
reused? I wonder if waking the allocator thread periodically (or from a
userspace ioctl?) to reclaim and discard buckets earlier might be better for
SSD health, kind of like a fancy version of a manual fstrim on ext4?

Best wishes,

Chris.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2021-10-24 21:36 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-10-24 11:15 Metadata usage following device add Chris Webb
2021-10-24 16:39 ` Kent Overstreet
2021-10-24 21:36   ` Chris Webb

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox