From: Brian Foster <bfoster@redhat.com>
To: Joanne Koong <joannelkoong@gmail.com>
Cc: "Darrick J. Wong" <djwong@kernel.org>,
linux-mm@kvack.org, brauner@kernel.org, willy@infradead.org,
jack@suse.cz, hch@infradead.org, jlayton@kernel.org,
linux-fsdevel@vger.kernel.org, kernel-team@meta.com
Subject: Re: [PATCH v2 12/12] iomap: add granular dirty and writeback accounting
Date: Fri, 5 Sep 2025 07:19:05 -0400 [thread overview]
Message-ID: <aLrG_eOwiROSi-XB@bfoster> (raw)
In-Reply-To: <CAJnrk1aD=n1NzyxgftoQfvC83OO73w2E7ChvGHAh5xfxKrM86Q@mail.gmail.com>
On Thu, Sep 04, 2025 at 05:14:21PM -0700, Joanne Koong wrote:
> On Thu, Sep 4, 2025 at 1:07 PM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > On Thu, Sep 04, 2025 at 07:47:11AM -0400, Brian Foster wrote:
> > > On Wed, Sep 03, 2025 at 05:35:51PM -0700, Joanne Koong wrote:
> > > > On Wed, Sep 3, 2025 at 11:44 AM Brian Foster <bfoster@redhat.com> wrote:
> > > > >
> > > > > On Tue, Sep 02, 2025 at 04:46:04PM -0700, Darrick J. Wong wrote:
> > > > > > On Fri, Aug 29, 2025 at 04:39:42PM -0700, Joanne Koong wrote:
> > > > > > > Add granular dirty and writeback accounting for large folios. These
> > > > > > > stats are used by the mm layer for dirty balancing and throttling.
> > > > > > > Having granular dirty and writeback accounting helps prevent
> > > > > > > over-aggressive balancing and throttling.
> > > > > > >
> > > > > > > There are 4 places in iomap this commit affects:
> > > > > > > a) filemap dirtying, which now calls filemap_dirty_folio_pages()
> > > > > > > b) writeback_iter with setting the wbc->no_stats_accounting bit and
> > > > > > > calling clear_dirty_for_io_stats()
> > > > > > > c) starting writeback, which now calls __folio_start_writeback()
> > > > > > > d) ending writeback, which now calls folio_end_writeback_pages()
> > > > > > >
> > > > > > > This relies on using the ifs->state dirty bitmap to track dirty pages in
> > > > > > > the folio. As such, this can only be utilized on filesystems where the
> > > > > > > block size >= PAGE_SIZE.
> > > > > >
> > > > > > Er... is this statement correct? I thought that you wanted the granular
> > > > > > dirty page accounting when it's possible that individual sub-pages of a
> > > > > > folio could be dirty.
> > > > > >
> > > > > > If i_blocksize >= PAGE_SIZE, then we'll have set the min folio order and
> > > > > > there will be exactly one (large) folio for a single fsblock. Writeback
> > > >
> > > > Oh interesting, this is the part I'm confused about. With i_blocksize
> > > > >= PAGE_SIZE, isn't there still the situation where the folio itself
> > > > could be a lot larger, like 1MB? That's what I've been seeing on fuse
> > > > where "blocksize" == PAGE_SIZE == 4096. I see that xfs sets the min
> > > > folio order through mapping_set_folio_min_order() but I'm not seeing
> > > > how that ensures "there will be exactly one large folio for a single
> > > > fsblock"? My understanding is that that only ensures the folio is at
> > > > least the size of the fsblock but that the folio size can be larger
> > > > than that too. Am I understanding this incorrectly?
> > > >
> > > > > > must happen in units of fsblocks, so there's no point in doing the extra
> > > > > > accounting calculations if there's only one fsblock.
> > > > > >
> > > > > > Waitaminute, I think the logic to decide if you're going to use the
> > > > > > granular accounting is:
> > > > > >
> > > > > > (folio_size > PAGE_SIZE && folio_size > i_blocksize)
> > > > > >
> > > >
> > > > Yeah, you're right about this - I had used "ifs && i_blocksize >=
> > > > PAGE_SIZE" as the check, which translates to "i_blocks_per_folio > 1
> > > > && i_block_size >= PAGE_SIZE", which in effect does the same thing as
> > > > what you wrote but has the additional (and now I'm realizing,
> > > > unnecessary) stipulation that block_size can't be less than PAGE_SIZE.
> > > >
> > > > > > Hrm?
> > > > > >
> > > > >
> > > > > I'm also a little confused why this needs to be restricted to blocksize
> > > > > gte PAGE_SIZE. The lower level helpers all seem to be managing block
> > > > > ranges, and then apparently just want to be able to use that directly as
> > > > > a page count (for accounting purposes).
> > > > >
> > > > > Is there any reason the lower level functions couldn't return block
> > > > > units, then the higher level code can use a blocks_per_page or some such
> > > > > to translate that to a base page count..? As Darrick points out I assume
> > > > > you'd want to shortcut the folio_nr_pages() == 1 case to use a min page
> > > > > count of 1, but otherwise ISTM that would allow this to work with
> > > > > configs like 64k pagesize and 4k blocks as well. Am I missing something?
> > > > >
> > > >
> > > > No, I don't think you're missing anything, it should have been done
> > > > like this in the first place.
> > > >
> > >
> > > Ok. Something that came to mind after thinking about this some more is
> > > whether there is risk for the accounting to get wonky.. For example,
> > > consider 4k blocks, 64k pages, and then a large folio on top of that. If
> > > a couple or so blocks are dirtied at one time, you'd presumably want to
> > > account that as the minimum of 1 dirty page. Then if a couple more
> > > blocks are dirtied in the same large folio, how do you determine whether
> > > those blocks are a newly dirtied page or part of the already accounted
> > > dirty page? I wonder if perhaps this is the value of the no sub-page
> > > sized blocks restriction, because you can imply that newly dirtied
> > > blocks means newly dirtied pages..?
> > >
> > > I suppose if that is an issue it might still be manageable. Perhaps we'd
> > > have to scan the bitmap in blks per page windows and use that to
> > > determine how many base pages are accounted for at any time. So for
> > > example, 3 dirty 4k blocks all within the same 64k page size window
> > > still accounts as 1 dirty page, vs. dirty blocks in multiple page size
> > > windows might mean multiple dirty pages, etc. That way writeback
> > > accounting remains consistent with dirty accounting. Hm?
> >
> > Yes, I think that's correct -- one has to track which basepages /were/
> > dirty, and then which ones become dirty after updating the ifs dirty
> > bitmap.
> >
> > For example, if you have a 1k fsblock filesystem, 4k base pages, and a
> > 64k folio, you could write a single byte at offset 0, then come back and
> > write to a byte at offset 1024. The first write will result in a charge
> > of one basepage, but so will the second, I think. That results
> > incharges for two dirty pages, when you've really only dirtied a single
> > basepage.
>
> Does it matter though which blocks map to which pages? AFAIU, the
> "block size" is the granularity for disk io and is not really related
> to pages (eg for writing out to disk, only the block gets written, not
> the whole page). The stats (as i understand it) are used to throttle
> how much data gets written back to disk, and the primary thing it
> cares about is how many bytes that is, not how many pages, it's just
> that it's in PAGE_SIZE granularity because prior to iomap there was no
> dirty tracking of individual blocks within a page/folio; it seems like
> it suffices then to just keep track of total # of dirty blocks,
> multiply that by blocksize, and roundup divide that by PAGE_SIZE and
> pass that to the stats.
>
I suppose it may not matter in terms of the purpose of the mechanism
itself. In fact if the whole thing could just be converted to track
bytes, at least internally, then maybe that would eliminate some of the
confusion in dealing with different granularity of units..? I have no
idea how practical or appropriate that is, though. :)
The concern Darrick and I were discussing is more about maintaining
accounting consistency in the event that we do continue translating
blocks to pages and ultimately add support for the block size < page
size case.
In that case the implication is that we'd still need to account
something when we dirty a single block out of a page (i.e. use
Darrick's example where we dirty a 1k fs block out of a 4k page). If we
round that partial page case up to 1 dirty page and repeat as each 1k
block is dirtied, then we have to make sure accounting remains
consistent in the case where we dirty account each sub-block of a page
through separate writes, but then clear dirty accounting for the entire
folio once at writeback time.
But I suppose we are projecting the implementation a bit so it might not
be worth getting this far into the weeds until you determine what
direction you want to go with this and have more code to review. All in
all, I do agree with Jan's general concern that I'd rather not have to
deal with multiple variants of sub-page state tracking in iomap. It's
already a challenge to support multiple different filesystems. This does
seem like a useful enhancement to me however, so IMO it would be fine to
just try and make it more generic (short of something more generic on
the mm side or whatever) than it is currently.
> But, as Jan pointed out to me in his comment, the stats are also used
> for monitoring the health of reclaim, so maybe it does matter then how
> the blocks translate to pages.
>
Random thought, but would having an additional/optional stat to track
bytes (alongside the existing page granularity counts) help at all? For
example, if throttling could use optional byte granular dirty/writeback
counters when they are enabled instead of the traditional page granular,
would that solve your problem and be less disruptive to other things
that actually prefer the page count?
Brian
> I'll put this patchset on hold until there's more feedback from the mm
> side as to whether we should proceed or drop this.
>
>
> Thanks,
> Joanne
>
>
> >
> > Also, does (block_size >> PAGE_SHIFT) evaluate to ... zero?
> >
> > --D
> >
> > > Brian
>
next prev parent reply other threads:[~2025-09-05 11:15 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-08-29 23:39 [PATCH v2 00/12] mm/iomap: add granular dirty and writeback accounting Joanne Koong
2025-08-29 23:39 ` [PATCH v2 01/12] mm: pass number of pages to __folio_start_writeback() Joanne Koong
2025-09-03 11:48 ` David Hildenbrand
2025-09-03 20:02 ` Darrick J. Wong
2025-09-03 20:05 ` David Hildenbrand
2025-09-03 23:12 ` Joanne Koong
2025-08-29 23:39 ` [PATCH v2 02/12] mm: pass number of pages to __folio_end_writeback() Joanne Koong
2025-08-29 23:39 ` [PATCH v2 03/12] mm: add folio_end_writeback_pages() helper Joanne Koong
2025-08-29 23:39 ` [PATCH v2 04/12] mm: pass number of pages dirtied to __folio_mark_dirty() Joanne Koong
2025-08-29 23:39 ` [PATCH v2 05/12] mm: add filemap_dirty_folio_pages() helper Joanne Koong
2025-08-29 23:39 ` [PATCH v2 06/12] mm: add __folio_clear_dirty_for_io() helper Joanne Koong
2025-08-29 23:39 ` [PATCH v2 07/12] mm: add no_stats_accounting bitfield to wbc Joanne Koong
2025-08-29 23:39 ` [PATCH v2 08/12] mm: refactor clearing dirty stats into helper function Joanne Koong
2025-08-29 23:39 ` [PATCH v2 09/12] mm: add clear_dirty_for_io_stats() helper Joanne Koong
2025-08-29 23:39 ` [PATCH v2 10/12] iomap: refactor dirty bitmap iteration Joanne Koong
2025-09-03 18:53 ` Brian Foster
2025-09-03 19:59 ` Darrick J. Wong
2025-08-29 23:39 ` [PATCH v2 11/12] iomap: refactor uptodate " Joanne Koong
2025-08-29 23:39 ` [PATCH v2 12/12] iomap: add granular dirty and writeback accounting Joanne Koong
2025-09-02 23:46 ` Darrick J. Wong
2025-09-03 18:48 ` Brian Foster
2025-09-04 0:35 ` Joanne Koong
2025-09-04 2:52 ` Darrick J. Wong
2025-09-04 11:47 ` Brian Foster
2025-09-04 20:07 ` Darrick J. Wong
2025-09-05 0:14 ` Joanne Koong
2025-09-05 11:19 ` Brian Foster [this message]
2025-09-05 12:43 ` Jan Kara
2025-09-04 8:53 ` [PATCH v2 00/12] mm/iomap: " Jan Kara
2025-09-04 23:59 ` Joanne Koong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aLrG_eOwiROSi-XB@bfoster \
--to=bfoster@redhat.com \
--cc=brauner@kernel.org \
--cc=djwong@kernel.org \
--cc=hch@infradead.org \
--cc=jack@suse.cz \
--cc=jlayton@kernel.org \
--cc=joannelkoong@gmail.com \
--cc=kernel-team@meta.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).