From: Brian Foster <bfoster@redhat.com>
To: Kent Overstreet <kent.overstreet@linux.dev>
Cc: linux-bcachefs@vger.kernel.org
Subject: Re: [PATCH RFC] bcachefs: use inode as write point index instead of task
Date: Fri, 23 Dec 2022 06:49:05 -0500 [thread overview]
Message-ID: <Y6WVsZ6IXPngUw9R@bfoster> (raw)
In-Reply-To: <Y6UwQnjX7VyV2NGk@moria.home.lan>
On Thu, Dec 22, 2022 at 11:36:18PM -0500, Kent Overstreet wrote:
> On Thu, Dec 22, 2022 at 09:03:22AM -0500, Brian Foster wrote:
> > On Mon, Dec 19, 2022 at 08:02:16PM -0500, Kent Overstreet wrote:
> > > On Mon, Dec 19, 2022 at 10:27:23AM -0500, Brian Foster wrote:
> > > > A couple of the more common optimizations XFS uses are speculative
> > > > preallocation and extent size hints. The former is designed to mitigate
> > > > fragmentation, particularly in the case of concurrent sustained writes.
> > > > Basically it will selectively increase the size of appending delalloc
> > > > reservations beyond current eof in anticipation of further writes. In
> > > > the meantime, writeback will attempt to allocate maximal sized physical
> > > > extents for contiguous delalloc ranges. Finally, the allocation itself
> > > > will start off with a simple hint based on the physical location of the
> > > > inode. This helps ensure extents are eventually maximally sized whenever
> > > > sufficient contiguous free extents are available and similarly ensures
> > > > as related inodes are removed, contiguous extents are freed together.
> > > > Excess/unused prealloc blocks are eventually reclaimed in the background
> > > > or as needed.
> > > >
> > > > Extent size hints are more for random write/allocation scenarios and
> > > > must be set by the user. For example, consider a sparse vdisk image
> > > > seeing random small writes all over the place. If we allocate single
> > > > blocks at a time, fragmentation and the extent count can eventually
> > > > explode out of control. An extent size hint of 1MB or so ensures every
> > > > new allocation is sized/aligned as such and so helps mitigate that sort
> > > > of problem as more of the file is allocated.
> > > >
> > > > Of course XFS is fundamentally different in that it's not a COW fs, so
> > > > might have different concerns. It supports reflinks, but that's a
> > > > relatively recent feature compared to the allocation heuristics and not
> > > > something they were designed around or significantly updated for (since
> > > > COW is not default behavior, although I believe an always_cow mode does
> > > > exist).
> > >
> > > *nod* Yeah, I've been wondering how much this stuff makes sense in the context
> > > of a COW filesystem.
> > >
> > > But we do have nocow mode, complete with unwritten extents. If we need to go the
> > > delalloc route, I think the existing allocator design should be able to support
> > > that (we can pin space purely as an in memory operation, but we do have a fixed
> > > number of those so we have to be careful about introducing deadlocks).
> > >
> > > It sounds like the optimizations XFS is doing are trying to ensure that writes
> > > remain contiguous on disk even when buffered writeback isn't batching them up as
> > > much as we'd like? Is that something we still feel is important?
> > > Pagecache/system memory size keeps going up but seek times do not (and go down
> > > in the case of flash); it's not clear to me that this is still important today.
> > >
> >
> > That's a good question and I don't really know the answer. I suspect
> > there is more to it than the fundamental principles of hardware and
> > related improvements. In practice these sorts of things still improve fs
> > efficiency, performance, scalability, aging (perhaps under less than
> > ideal hardware/workload/resource conditions), etc. Given the relative
> > low cost (in terms of complexity) of the implementation, they certainly
> > aren't things I see going away from fs' like XFS any time soon. The
> > underlying concepts may just not be as generically relevant (i.e. useful
> > across different fs implementations) as perhaps they might have been in
> > the past.
>
> Certainly no pressing reason to drop that code from XFS, but bcachefs is a clean
> slate (and COW introduces different challenges) so I have to think about things
> differently.
>
> >
> > When you think about it, it is kind of amusing to see things like the fs
> > attempt to create as large/contiguous mappings as possible, only for
> > writeback to subsequently have to explicitly break them up into smaller
> > I/O requests because otherwise the massive amount of in-core metadata
> > status updates that result (i.e. clearing per-page writeback state)
> > leads to excessive completion latency. ;) OTOH, if that eventually leads
> > to more use of things like large folios, then perhaps that's an overall
> > win.
>
> Hmm? That sounds like an odd way to describe things.
>
IIRC that's pretty much the current behavior with XFS and iomap. XFS
aggressively allocates large contiguous extents, writeback thus
constructs large enough bio chains in the iomap ioend that completion
processing can produce soft lockups just dealing with pages involved
with the bio chain. Therefore the ioend became capped to a max number of
chained bios. Of course writeback still carries on submitting the same
amount of overall I/O either way, just made up of more ioends with
smaller bio chains. But I suspect as folio size is able to increase,
we'll be back to constructing larger I/Os based on fewer folios and thus
with less processing overhead per submission (in scenarios where
multipage bvecs doesn't already do so, at least).
Brian
> Folios are certainly long overdue and are going to be a big help, but more in
> the buffered IO paths than writeback I expect. Writeback can and does aggregate
> adjacent pages into the same IO; in bcachefs this is right now limited to 2 MB
> in practice because we represent the IO from the very start as a bio, but large
> folios + multipage bvecs should finally get us past that limit.
>
> IOW, once I do the large folio conversion writeback ought to be generating
> bucket sized extents - when data checksums are off, as they'll otherwise limit
> extent size (and that restriction will be lifted once we get extents with block
> granular/variable granularity checksums).
>
next prev parent reply other threads:[~2022-12-23 11:50 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-12-12 19:06 [PATCH RFC] bcachefs: use inode as write point index instead of task Brian Foster
2022-12-13 18:38 ` Kent Overstreet
2022-12-14 17:44 ` Brian Foster
2022-12-16 0:09 ` Kent Overstreet
2022-12-16 7:04 ` Kent Overstreet
2022-12-19 15:42 ` Brian Foster
2022-12-20 1:56 ` Kent Overstreet
2022-12-28 22:24 ` Eric Wheeler
2022-12-29 20:59 ` Kent Overstreet
2022-12-29 22:26 ` Eric Wheeler
2022-12-30 3:18 ` Kent Overstreet
2022-12-19 15:27 ` Brian Foster
2022-12-20 1:02 ` Kent Overstreet
2022-12-22 14:03 ` Brian Foster
2022-12-23 4:36 ` Kent Overstreet
2022-12-23 11:49 ` Brian Foster [this message]
2022-12-23 18:02 ` Kent Overstreet
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Y6WVsZ6IXPngUw9R@bfoster \
--to=bfoster@redhat.com \
--cc=kent.overstreet@linux.dev \
--cc=linux-bcachefs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.