Re: [PATCH RFC] bcachefs: use inode as write point index instead of task

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Brian Foster <bfoster@redhat.com>
To: Kent Overstreet <kent.overstreet@linux.dev>
Cc: linux-bcachefs@vger.kernel.org
Subject: Re: [PATCH RFC] bcachefs: use inode as write point index instead of task
Date: Wed, 14 Dec 2022 12:44:52 -0500	[thread overview]
Message-ID: <Y5oLlLcHmS2EWp8n@bfoster> (raw)
In-Reply-To: <20221213183743.3m6ntfnu7n3yebng@moria.home.lan>

On Tue, Dec 13, 2022 at 01:38:33PM -0500, Kent Overstreet wrote:
> On Mon, Dec 12, 2022 at 02:06:02PM -0500, Brian Foster wrote:
> > Use the pointer to the current in-core inode object as the hash of
> > the write point associated with block allocation for buffered
> > writes.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> > 
> > Hi bcachefs folks,
> > 
> > I'm posting this RFC more as a discussion point than a direct proposal.
> > I've been poking around a bit at bcachefs just to grok some of the
> > basics and whatnot and ended up digging down into the block allocation
> > code a bit. From there I was trying to make sense of the bucket and
> > write point abstractions, and then noticed that the write point
> > structure used to track allocation/write contexts is hashed by task.
> > IIUC, that means a single task that might be writing to multiple files
> > will do so via a single write point, or conversely multiple tasks that
> > might be writing to a single file would do so with multiple write points
> > (one associated with each task). In turn, that means the allocation
> > layout of a particular file might be a characteristic of the set of
> > tasks that initially wrote the file as opposed to something more
> > traditional based on inode locality.
> > 
> > In thinking about this a bit more, it seems logical that this derives
> > from bcache because it seems like a perfectly good cache allocation
> > policy. I suppose it might make sense in certain filesystem workloads
> > where per-file contiguity might no longer necessarily be the most
> > important thing in the world, but I'm curious if that is intentional for
> > bcachefs (the approach is documented, after all) and if there are any
> > particular workloads or use cases in mind for that sort of approach?
> > 
> > Even if it makes sense in some cases, I wonder whether it's the ideal
> > approach across the board. For example, consider how the task is tracked
> > through the inode from buffered write to writeback contexts so the
> > latter can locate the write point for the last task that wrote to the
> > file. Even if multiple tasks perform buffered writes to a file, we only
> > really use the last task that wrote to the file before the current
> > writeback cycle. Absent frequent writebacks, wouldn't it make some sense
> > to just associate the write point with the current inode pointer and use
> > that to minimize per-inode fragmentation?
> 
> Fundamentally, what we're doing is trying to preserve locality, the idea is that
> data written by the same process is likely related and should be written out
> together.
> 
> Consider a cp -r of many small files, or any sort of process that's writing out
> to many small files. Using the pid for the writepoint preserves locality, using
> the inode number would not.
> 

Yeah, that makes sense. I'm more curious about the value and tradeoff of
inter-file vs. intra-file locality and the associated workloads that
might or might not benefit. I suppose I can see value of a single writer
copying in a bunch of small files and the allocations being streamlined
as such. OTOH, what's the cost if separate files don't happen to follow
the same write point like this (i.e. as if cp were run one file at a
time)? Is there some indirect/internal to bcachefs benefit to this
behavior, or are we just trying to batch writes?

What about scenarios where the writing task is less relevant? For
example, perhaps something like a file server ingesting data for
multiple sets of large files via a thread pool or some such.. I'm not
sure I see as much value of a worker task doling out per-task hinted
allocations across a set of the N active files it might be processing..

Note that I'm just probing here; trying to understand the origin of this
heuristic, where it might or might not apply between bcache and
bcachefs, etc. I'm not familiar with bcache and have only just scratched
the surface of bcachefs to this point and this just happened to be the
first higher level question that arose when reading through the
allocation code.

The reason it stood out to me is that I'm not aware of fs' going through
major pains to preserve per-task locality across inodes in this sort of
way (XFS, for example, may deliberately spread files over the disk if
they happen to span directories, perhaps for different reasons). OTOH,
we do put effort into reducing per-file fragmentation in scenarios like
concurrent sustained buffered writes or random write/alloc workloads to
sparse files, and then usually hear from users one way or another when
those optimizations aren't effective enough. ;)

> > FWIW, I was able to run some oddball workloads that produce excessive
> > fragmentation by forcing frequent writeback activity, but my tests so
> > far are ad hoc and moreso just playing around to help understand
> > behavior. To be fair, more traditional filesystems don't necessarily
> > handle the same sort of tests terribly well either. What I found more
> > interesting is that with the small tweak from this RFC, the allocation
> > layout from bcachefs went from notably worse to notably better in that
> > kind of pathological workload. Thoughts?
> 
> Could you describe more what you were seeing and what the changes were? I think
> there is room for improvement here, but we're going to have to carefully think
> about and describe what we're trying to acheive.
> 

Eh, I'd have to go back at play around with it to provide specifics.. I
suspect I was doing something like multiple, concurrent, appending
writes to a single file from different tasks, also concurrent with a set
of sync_file_range() workers forcing frequent writeback activity, and
then seeing a notably large number of extents for the relative size of
the file. That turned around completely with the tweak in this patch.

This test wasn't intended to reflect any sort of sane workload. Rather,
to just try and confirm assumptions made from reading the code. I've not
actually tested anything like the more practical example mentioned
above, and I wouldn't expect it to result in anything nearly as
pathological as this test.

Did you have specific improvements or target use cases in mind in this
area..?

> It's not clear to me how this would have an effect at all in the test you're
> describing; forcing frequent writebacks is going to hurt in a COW filesystem,
> since every write is an allocation - need more information here.
> 

Yeah, I was just trying to wrap my head around the initial allocation
case for the time being..

Brian

next prev parent reply	other threads:[~2022-12-14 17:45 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-12-12 19:06 [PATCH RFC] bcachefs: use inode as write point index instead of task Brian Foster
2022-12-13 18:38 ` Kent Overstreet
2022-12-14 17:44   ` Brian Foster [this message]
2022-12-16  0:09     ` Kent Overstreet
2022-12-16  7:04       ` Kent Overstreet
2022-12-19 15:42         ` Brian Foster
2022-12-20  1:56           ` Kent Overstreet
2022-12-28 22:24             ` Eric Wheeler
2022-12-29 20:59               ` Kent Overstreet
2022-12-29 22:26                 ` Eric Wheeler
2022-12-30  3:18                   ` Kent Overstreet
2022-12-19 15:27       ` Brian Foster
2022-12-20  1:02         ` Kent Overstreet
2022-12-22 14:03           ` Brian Foster
2022-12-23  4:36             ` Kent Overstreet
2022-12-23 11:49               ` Brian Foster
2022-12-23 18:02                 ` Kent Overstreet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y5oLlLcHmS2EWp8n@bfoster \
    --to=bfoster@redhat.com \
    --cc=kent.overstreet@linux.dev \
    --cc=linux-bcachefs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.