Re: [PATCH RFC] bcachefs: use inode as write point index instead of task

Linux bcachefs list
 help / color / mirror / Atom feed

From: Brian Foster <bfoster@redhat.com>
To: Kent Overstreet <kent.overstreet@linux.dev>
Cc: linux-bcachefs@vger.kernel.org
Subject: Re: [PATCH RFC] bcachefs: use inode as write point index instead of task
Date: Thu, 22 Dec 2022 09:03:22 -0500	[thread overview]
Message-ID: <Y6RjqsXgCAJ/M7c+@bfoster> (raw)
In-Reply-To: <Y6EJmGTbFmAgmjre@moria.home.lan>

On Mon, Dec 19, 2022 at 08:02:16PM -0500, Kent Overstreet wrote:
> On Mon, Dec 19, 2022 at 10:27:23AM -0500, Brian Foster wrote:
> > A couple of the more common optimizations XFS uses are speculative
> > preallocation and extent size hints. The former is designed to mitigate
> > fragmentation, particularly in the case of concurrent sustained writes.
> > Basically it will selectively increase the size of appending delalloc
> > reservations beyond current eof in anticipation of further writes. In
> > the meantime, writeback will attempt to allocate maximal sized physical
> > extents for contiguous delalloc ranges. Finally, the allocation itself
> > will start off with a simple hint based on the physical location of the
> > inode. This helps ensure extents are eventually maximally sized whenever
> > sufficient contiguous free extents are available and similarly ensures
> > as related inodes are removed, contiguous extents are freed together.
> > Excess/unused prealloc blocks are eventually reclaimed in the background
> > or as needed.
> > 
> > Extent size hints are more for random write/allocation scenarios and
> > must be set by the user. For example, consider a sparse vdisk image
> > seeing random small writes all over the place. If we allocate single
> > blocks at a time, fragmentation and the extent count can eventually
> > explode out of control. An extent size hint of 1MB or so ensures every
> > new allocation is sized/aligned as such and so helps mitigate that sort
> > of problem as more of the file is allocated.
> > 
> > Of course XFS is fundamentally different in that it's not a COW fs, so
> > might have different concerns. It supports reflinks, but that's a
> > relatively recent feature compared to the allocation heuristics and not
> > something they were designed around or significantly updated for (since
> > COW is not default behavior, although I believe an always_cow mode does
> > exist).
> 
> *nod* Yeah, I've been wondering how much this stuff makes sense in the context
> of a COW filesystem.
> 
> But we do have nocow mode, complete with unwritten extents. If we need to go the
> delalloc route, I think the existing allocator design should be able to support
> that (we can pin space purely as an in memory operation, but we do have a fixed
> number of those so we have to be careful about introducing deadlocks).
> 
> It sounds like the optimizations XFS is doing are trying to ensure that writes
> remain contiguous on disk even when buffered writeback isn't batching them up as
> much as we'd like? Is that something we still feel is important?
> Pagecache/system memory size keeps going up but seek times do not (and go down
> in the case of flash); it's not clear to me that this is still important today.
> 

That's a good question and I don't really know the answer. I suspect
there is more to it than the fundamental principles of hardware and
related improvements. In practice these sorts of things still improve fs
efficiency, performance, scalability, aging (perhaps under less than
ideal hardware/workload/resource conditions), etc. Given the relative
low cost (in terms of complexity) of the implementation, they certainly
aren't things I see going away from fs' like XFS any time soon. The
underlying concepts may just not be as generically relevant (i.e. useful
across different fs implementations) as perhaps they might have been in
the past.

When you think about it, it is kind of amusing to see things like the fs
attempt to create as large/contiguous mappings as possible, only for
writeback to subsequently have to explicitly break them up into smaller
I/O requests because otherwise the massive amount of in-core metadata
status updates that result (i.e. clearing per-page writeback state)
leads to excessive completion latency. ;) OTOH, if that eventually leads
to more use of things like large folios, then perhaps that's an overall
win.

Anyways, I just bring these things up here for reference and discussion
purposes..

> > Ok. Based on the above, it kind of sounds like a worse case scenario
> > might be something like N files allocated by the same task in such a way
> > that each bucket ends up split between the N files, and then some number
> > of files end up removed. Rinse and repeat that sort of thing across new
> > sets of files and then presumably we'd have increasing amount of free
> > space in partially used buckets that cannot be allocated..?
> 
> Yep, that would do it.
> 

Ok.

> > 
> > Is copygc responsible for cleaning things up in such a case in order to
> > create more usable free space (hence the excessive copygc comment
> > below)?
> 
> Correct. Copygc finds buckets that are mostly but not completely empty and
> evacuates them - writes the data in them to new buckets.
> 
> Copygc doesn't do any file-level defragmentation, but now that we have
> backpointers it could.
> 

Cool.

> > Hmm.. Ok, that gives me another area to look into re: copygc. ;) Thanks
> > for all of the feedback and context..
> 
> Feel free to hit me up on IRC as you're looking at code. I'm also currently
> working on the copygc code - we have a persistent fragmentation index about to
> land, which will be a drastic improvement to copygc scalability. Not relevant to
> what you're looking at, but the code is at least fresh in my mind :)
> 

Thanks. Appreciate the feedback here and in the other subthread. I'm
currently mostly trying to grok core concepts and map them to areas of
code and such, but will undoubtedly have more questions once I get more
into details..

Brian

next prev parent reply	other threads:[~2022-12-22 14:04 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-12-12 19:06 [PATCH RFC] bcachefs: use inode as write point index instead of task Brian Foster
2022-12-13 18:38 ` Kent Overstreet
2022-12-14 17:44   ` Brian Foster
2022-12-16  0:09     ` Kent Overstreet
2022-12-16  7:04       ` Kent Overstreet
2022-12-19 15:42         ` Brian Foster
2022-12-20  1:56           ` Kent Overstreet
2022-12-28 22:24             ` Eric Wheeler
2022-12-29 20:59               ` Kent Overstreet
2022-12-29 22:26                 ` Eric Wheeler
2022-12-30  3:18                   ` Kent Overstreet
2022-12-19 15:27       ` Brian Foster
2022-12-20  1:02         ` Kent Overstreet
2022-12-22 14:03           ` Brian Foster [this message]
2022-12-23  4:36             ` Kent Overstreet
2022-12-23 11:49               ` Brian Foster
2022-12-23 18:02                 ` Kent Overstreet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y6RjqsXgCAJ/M7c+@bfoster \
    --to=bfoster@redhat.com \
    --cc=kent.overstreet@linux.dev \
    --cc=linux-bcachefs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox