Linux bcachefs list
 help / color / mirror / Atom feed
From: Brian Foster <bfoster@redhat.com>
To: Kent Overstreet <kent.overstreet@linux.dev>
Cc: linux-bcachefs@vger.kernel.org
Subject: Re: [PATCH RFC] bcachefs: use inode as write point index instead of task
Date: Mon, 19 Dec 2022 10:27:23 -0500	[thread overview]
Message-ID: <Y6CC27y2Rf44DzFI@bfoster> (raw)
In-Reply-To: <Y5u3VlkA3AbhQKav@moria.home.lan>

On Thu, Dec 15, 2022 at 07:09:58PM -0500, Kent Overstreet wrote:
> On Wed, Dec 14, 2022 at 12:44:52PM -0500, Brian Foster wrote:
> > On Tue, Dec 13, 2022 at 01:38:33PM -0500, Kent Overstreet wrote:
> > > Fundamentally, what we're doing is trying to preserve locality, the idea is that
> > > data written by the same process is likely related and should be written out
> > > together.
> > > 
> > > Consider a cp -r of many small files, or any sort of process that's writing out
> > > to many small files. Using the pid for the writepoint preserves locality, using
> > > the inode number would not.
> > > 
> > 
> > Yeah, that makes sense. I'm more curious about the value and tradeoff of
> > inter-file vs. intra-file locality and the associated workloads that
> > might or might not benefit. I suppose I can see value of a single writer
> > copying in a bunch of small files and the allocations being streamlined
> > as such. OTOH, what's the cost if separate files don't happen to follow
> > the same write point like this (i.e. as if cp were run one file at a
> > time)? Is there some indirect/internal to bcachefs benefit to this
> > behavior, or are we just trying to batch writes?
> 
> There's two main factors.
> 
> Firstly, we've only got so many write points, so we're more likely to end up
> using the same write points as another application, and interspersing our data
> with theirs.
> 
> The big factor is bucket based allocation and copygc: preserving locality is
> more important in bcachefs than in other filesystems, because we have to
> evacuate a bucket before it can be reused. Keeping related data grouped together
> in the same bucket means that it's much more likely to all die (be deleted or
> overwritten) at the same time, and copygc won't have to do any work.
> 

OK, thanks. This is context I didn't have and gives me a bit more to dig
into.

> (Side note: bucket based allocation and copygc is what will make really good
> zoned device supoprt possible, which has been part of the design since forever
> and something I want to see finished - I've just been focusing on more core
> things).
> 
> > What about scenarios where the writing task is less relevant? For
> > example, perhaps something like a file server ingesting data for
> > multiple sets of large files via a thread pool or some such.. I'm not
> > sure I see as much value of a worker task doling out per-task hinted
> > allocations across a set of the N active files it might be processing..
> 
> There's been talk of plumbing a write point API up to userspace before - SSDs
> could make good use of it if they had it.
> 
> Or, perhaps we could detect when a task is writing to multiple files at the same
> time (not one after the other, as cp would) and switch to writepoint-per-file
> then.
> 

That's more along the lines of the direction I was expecting this to go
(if anywhere)...

> > The reason it stood out to me is that I'm not aware of fs' going through
> > major pains to preserve per-task locality across inodes in this sort of
> > way (XFS, for example, may deliberately spread files over the disk if
> > they happen to span directories, perhaps for different reasons). OTOH,
> > we do put effort into reducing per-file fragmentation in scenarios like
> > concurrent sustained buffered writes or random write/alloc workloads to
> > sparse files, and then usually hear from users one way or another when
> > those optimizations aren't effective enough. ;)
> 
> Maybe you could tell me more? I don't know nearly as much as I'd like to about
> allocator strategies in XFS; I've never gone as deep here as I'd like beacuse,
> quite honestly, people haven't complained :)
> 

Yeah.. I suspect typical buffered write/writeback behavior turns this
more into a potential refinement than a bug or anything.

A couple of the more common optimizations XFS uses are speculative
preallocation and extent size hints. The former is designed to mitigate
fragmentation, particularly in the case of concurrent sustained writes.
Basically it will selectively increase the size of appending delalloc
reservations beyond current eof in anticipation of further writes. In
the meantime, writeback will attempt to allocate maximal sized physical
extents for contiguous delalloc ranges. Finally, the allocation itself
will start off with a simple hint based on the physical location of the
inode. This helps ensure extents are eventually maximally sized whenever
sufficient contiguous free extents are available and similarly ensures
as related inodes are removed, contiguous extents are freed together.
Excess/unused prealloc blocks are eventually reclaimed in the background
or as needed.

Extent size hints are more for random write/allocation scenarios and
must be set by the user. For example, consider a sparse vdisk image
seeing random small writes all over the place. If we allocate single
blocks at a time, fragmentation and the extent count can eventually
explode out of control. An extent size hint of 1MB or so ensures every
new allocation is sized/aligned as such and so helps mitigate that sort
of problem as more of the file is allocated.

Of course XFS is fundamentally different in that it's not a COW fs, so
might have different concerns. It supports reflinks, but that's a
relatively recent feature compared to the allocation heuristics and not
something they were designed around or significantly updated for (since
COW is not default behavior, although I believe an always_cow mode does
exist).

> (And the company funding me cares very much about streaming bandwidth on
> rotating disk, it's something we've looked at. The bigger issue last time we
> looked was that readdir order in bcachefs doesn't match create order or sort
> order, because directories are implemented as simple open addressing hash
> tables - meaning reading back files in a directory incurred unnecessary seeks on
> file boundaries. Another todo list item :)
> 
> > Eh, I'd have to go back at play around with it to provide specifics.. I
> > suspect I was doing something like multiple, concurrent, appending
> > writes to a single file from different tasks, also concurrent with a set
> > of sync_file_range() workers forcing frequent writeback activity, and
> > then seeing a notably large number of extents for the relative size of
> > the file. That turned around completely with the tweak in this patch.
> > 
> > This test wasn't intended to reflect any sort of sane workload. Rather,
> > to just try and confirm assumptions made from reading the code. I've not
> > actually tested anything like the more practical example mentioned
> > above, and I wouldn't expect it to result in anything nearly as
> > pathological as this test.
> 
> *nod*
> 
> > Did you have specific improvements or target use cases in mind in this
> > area..?
> 
> I do think we could probably be doing something more than using the pid for the
> writepoint, I've just been waiting until we see specific workloads where the
> current behaviour falls over or have a specific complaint before designing
> something new.
> 

Ok. Based on the above, it kind of sounds like a worse case scenario
might be something like N files allocated by the same task in such a way
that each bucket ends up split between the N files, and then some number
of files end up removed. Rinse and repeat that sort of thing across new
sets of files and then presumably we'd have increasing amount of free
space in partially used buckets that cannot be allocated..?

Is copygc responsible for cleaning things up in such a case in order to
create more usable free space (hence the excessive copygc comment
below)?

> Target use cases - I want bcachefs to be able to handle any mainstream
> application XFS can (as well as brtfs/ext4/zfs) - i.e. be a truly general
> purpose filesystem. Anything needed to make that happen is on the table, long
> term - including potentially switching out the entire bucket-based allocator for
> something else, as an option - the codebase is flexible enough that we could do
> that.
> 

Good to know.

> I doubt it'll come to that though, especially now that we've got nocow mode - my
> expectation is that we'll be able to get it to do what we need to with perhaps
> minor tweaks to writepoint usage. Always contingent upon more data, of course.
> 
> More immediately, I'm trying to set things up so that we've got good visibility
> into what's going on can see what needs to be tweaked and tuned later.
> 
> One big thing we'll want to watch for is excessive copygc - that'll be an
> indication our data layout could've been better. We've got backpointers from
> buckets back to extents and inodes back to dirents, which means we could pretty
> easily have the copygc tracepoints start telling us the specific filenames it
> was moving data for.
> 

Hmm.. Ok, that gives me another area to look into re: copygc. ;) Thanks
for all of the feedback and context..

Brian


  parent reply	other threads:[~2022-12-19 15:28 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-12-12 19:06 [PATCH RFC] bcachefs: use inode as write point index instead of task Brian Foster
2022-12-13 18:38 ` Kent Overstreet
2022-12-14 17:44   ` Brian Foster
2022-12-16  0:09     ` Kent Overstreet
2022-12-16  7:04       ` Kent Overstreet
2022-12-19 15:42         ` Brian Foster
2022-12-20  1:56           ` Kent Overstreet
2022-12-28 22:24             ` Eric Wheeler
2022-12-29 20:59               ` Kent Overstreet
2022-12-29 22:26                 ` Eric Wheeler
2022-12-30  3:18                   ` Kent Overstreet
2022-12-19 15:27       ` Brian Foster [this message]
2022-12-20  1:02         ` Kent Overstreet
2022-12-22 14:03           ` Brian Foster
2022-12-23  4:36             ` Kent Overstreet
2022-12-23 11:49               ` Brian Foster
2022-12-23 18:02                 ` Kent Overstreet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y6CC27y2Rf44DzFI@bfoster \
    --to=bfoster@redhat.com \
    --cc=kent.overstreet@linux.dev \
    --cc=linux-bcachefs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox