[PATCH v7 0/3] mm: improve write performance with RWF

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

* [PATCH v7 0/3] mm: improve write performance with RWF_DONTCACHE
@ 2026-05-11 11:58 Jeff Layton
  2026-05-11 11:58 ` [PATCH v7 1/3] mm: preserve PG_dropbehind flag during folio split Jeff Layton
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Jeff Layton @ 2026-05-11 11:58 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox (Oracle), Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Mike Snitzer, Jens Axboe,
	Ritesh Harjani, Chuck Lever
  Cc: linux-fsdevel, linux-kernel, linux-nfs, linux-mm, Jeff Layton

This patch series is intended to improve write performance with
RWF_DONTCACHE. This version fixes additional stat accounting issues
found during review: integer promotion on 32-bit, cgroup writeback
domain migration, folio split flag preservation, and a UAF that could
occur in filemap_dontcache_kick_writeback().

Because there are substantive changes in this set, I've dropped the
R-b's. Please resend them if you're OK with this version.

Christian, please consider these for v7.2.

Thanks,

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
Changes in v7:
- Fix integer promotion bug on 32-bit: store folio_nr_pages() in signed
  long before negating, matching existing patterns in page-writeback.c
- Transfer WB_DONTCACHE_DIRTY in inode_do_switch_wbs() alongside
  WB_RECLAIMABLE so the stat migrates with the inode on cgroup wb switch
- Preserve PG_dropbehind in __split_folio_to_order() so tail folios
  retain the flag and are properly accounted when cleaned
- Take reference to wb around unlocked_inode_to_wb_end() to avoid UAF
- Link to v6: https://lore.kernel.org/r/20260505-dontcache-v6-0-66463805dd6a@kernel.org

Changes in v6:
- Use atomic folio_test_clear_dropbehind() in __filemap_get_folio_mpol()
  to prevent double-decrement of WB_DONTCACHE_DIRTY by concurrent readers
- Add mapping_can_writeback() guard before decrementing WB_DONTCACHE_DIRTY
  in __filemap_get_folio_mpol() to match the increment path
- Use wb_stat_sum() instead of wb_stat() in wb_check_start_dontcache() so
  small writes below the percpu batch threshold are visible to the flusher
- Use test_and_clear_bit for WB_start_dontcache before starting writeback
  to prevent lost wakeups from concurrent DONTCACHE writers
- Move wb_wakeup() outside the unlocked_inode_to_wb_begin/end section in
  filemap_dontcache_kick_writeback() to avoid spin_unlock_irq() re-enabling
  interrupts while the i_pages xa_lock is held during cgroup writeback switch
- Drop Reviewed-by tags due to substantive changes
- Link to v5: https://lore.kernel.org/r/20260504-dontcache-v5-0-4103e58bb377@kernel.org

Changes in v5:
- Flesh out comment over filemap_dontcache_kick_writeback()
- Drop testcases from posting
- Link to v4: https://lore.kernel.org/r/20260501-dontcache-v4-0-5d5e6dc71cb3@kernel.org

Changes in v4:
- Track DONTCACHE dirty pages per bdi_writeback
- New benchmark for competing buffered and dontcache writers
- New benchmark replicating Jens' original 32 concurrent writer test
- Link to v3: https://lore.kernel.org/r/20260426-dontcache-v3-0-79eb37da9547@kernel.org

Changes in v3:
- Track dirty DONTCACHE pages in the VM
- Have flusher write back a proportional number of pages after DONTCACHE write
- Link to v2: https://lore.kernel.org/r/20260408-dontcache-v2-0-948dec1e756b@kernel.org

Changes in v2:
- kick flusher thread instead of initiating writeback inline
- add mechanism to run 'perf lock' around the testcases
- Link to v1: https://lore.kernel.org/r/20260401-dontcache-v1-0-1f5746fab47a@kernel.org

---
Jeff Layton (3):
      mm: preserve PG_dropbehind flag during folio split
      mm: track DONTCACHE dirty pages per bdi_writeback
      mm: kick writeback flusher for IOCB_DONTCACHE with targeted dirty tracking

 fs/fs-writeback.c                | 67 ++++++++++++++++++++++++++++++++++++++++
 include/linux/backing-dev-defs.h |  3 ++
 include/linux/fs.h               |  6 ++--
 include/trace/events/writeback.h |  3 +-
 mm/filemap.c                     | 15 +++++++--
 mm/huge_memory.c                 |  1 +
 mm/page-writeback.c              |  6 ++++
 7 files changed, 94 insertions(+), 7 deletions(-)
---
base-commit: 7e2326f4275c11652e1fdaae11de06159fef1d90
change-id: 20260401-dontcache-5811efd7eaf3

Best regards,
-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v7 1/3] mm: preserve PG_dropbehind flag during folio split
  2026-05-11 11:58 [PATCH v7 0/3] mm: improve write performance with RWF_DONTCACHE Jeff Layton
@ 2026-05-11 11:58 ` Jeff Layton
  2026-05-11 12:38   ` David Hildenbrand (Arm)
  2026-05-12 13:58   ` Jan Kara
  2026-05-11 11:58 ` [PATCH v7 2/3] mm: track DONTCACHE dirty pages per bdi_writeback Jeff Layton
  2026-05-11 11:58 ` [PATCH v7 3/3] mm: kick writeback flusher for IOCB_DONTCACHE with targeted dirty tracking Jeff Layton
  2 siblings, 2 replies; 16+ messages in thread
From: Jeff Layton @ 2026-05-11 11:58 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox (Oracle), Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Mike Snitzer, Jens Axboe,
	Ritesh Harjani, Chuck Lever
  Cc: linux-fsdevel, linux-kernel, linux-nfs, linux-mm, Jeff Layton

__split_folio_to_order() copies page flags from the original folio to
newly created sub-folios using an explicit allowlist, but PG_dropbehind
is not included. When a large folio with PG_dropbehind set is split,
only the head sub-folio retains the flag; all tail sub-folios silently
lose it and will not be reclaimed eagerly after writeback completes.

Add PG_dropbehind to the flag copy mask so that the drop-behind hint
is preserved across folio splits.

Fixes: a323281cdfec ("mm: add PG_dropbehind folio flag")
Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 mm/huge_memory.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 970e077019b7..e01917b14d1a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3642,6 +3642,7 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
 				 (1L << PG_arch_3) |
 #endif
 				 (1L << PG_dirty) |
+				 (1L << PG_dropbehind) |
 				 LRU_GEN_MASK | LRU_REFS_MASK));
 
 		if (handle_hwpoison &&

-- 
2.54.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v7 1/3] mm: preserve PG_dropbehind flag during folio split
  2026-05-11 11:58 ` [PATCH v7 1/3] mm: preserve PG_dropbehind flag during folio split Jeff Layton
@ 2026-05-11 12:38   ` David Hildenbrand (Arm)
  2026-05-12 13:58   ` Jan Kara
  1 sibling, 0 replies; 16+ messages in thread
From: David Hildenbrand (Arm) @ 2026-05-11 12:38 UTC (permalink / raw)
  To: Jeff Layton, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox (Oracle), Andrew Morton, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Mike Snitzer, Jens Axboe,
	Ritesh Harjani, Chuck Lever
  Cc: linux-fsdevel, linux-kernel, linux-nfs, linux-mm

On 5/11/26 13:58, Jeff Layton wrote:
> __split_folio_to_order() copies page flags from the original folio to
> newly created sub-folios using an explicit allowlist, but PG_dropbehind
> is not included. When a large folio with PG_dropbehind set is split,
> only the head sub-folio retains the flag; all tail sub-folios silently
> lose it and will not be reclaimed eagerly after writeback completes.
> 
> Add PG_dropbehind to the flag copy mask so that the drop-behind hint
> is preserved across folio splits.
> 
> Fixes: a323281cdfec ("mm: add PG_dropbehind folio flag")
> Assisted-by: Claude:claude-opus-4-6
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---

Acked-by: David Hildenbrand (Arm) <david@kernel.org>

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v7 1/3] mm: preserve PG_dropbehind flag during folio split
  2026-05-11 11:58 ` [PATCH v7 1/3] mm: preserve PG_dropbehind flag during folio split Jeff Layton
  2026-05-11 12:38   ` David Hildenbrand (Arm)
@ 2026-05-12 13:58   ` Jan Kara
  1 sibling, 0 replies; 16+ messages in thread
From: Jan Kara @ 2026-05-12 13:58 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox (Oracle), Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Mike Snitzer, Jens Axboe,
	Ritesh Harjani, Chuck Lever, linux-fsdevel, linux-kernel,
	linux-nfs, linux-mm

On Mon 11-05-26 07:58:27, Jeff Layton wrote:
> __split_folio_to_order() copies page flags from the original folio to
> newly created sub-folios using an explicit allowlist, but PG_dropbehind
> is not included. When a large folio with PG_dropbehind set is split,
> only the head sub-folio retains the flag; all tail sub-folios silently
> lose it and will not be reclaimed eagerly after writeback completes.
> 
> Add PG_dropbehind to the flag copy mask so that the drop-behind hint
> is preserved across folio splits.
> 
> Fixes: a323281cdfec ("mm: add PG_dropbehind folio flag")
> Assisted-by: Claude:claude-opus-4-6
> Signed-off-by: Jeff Layton <jlayton@kernel.org>

I'm not really very knowledgeable in this code but the change looks good to
me. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  mm/huge_memory.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 970e077019b7..e01917b14d1a 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3642,6 +3642,7 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
>  				 (1L << PG_arch_3) |
>  #endif
>  				 (1L << PG_dirty) |
> +				 (1L << PG_dropbehind) |
>  				 LRU_GEN_MASK | LRU_REFS_MASK));
>  
>  		if (handle_hwpoison &&
> 
> -- 
> 2.54.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v7 2/3] mm: track DONTCACHE dirty pages per bdi_writeback
  2026-05-11 11:58 [PATCH v7 0/3] mm: improve write performance with RWF_DONTCACHE Jeff Layton
  2026-05-11 11:58 ` [PATCH v7 1/3] mm: preserve PG_dropbehind flag during folio split Jeff Layton
@ 2026-05-11 11:58 ` Jeff Layton
  2026-05-11 13:10   ` Christian Brauner
                     ` (2 more replies)
  2026-05-11 11:58 ` [PATCH v7 3/3] mm: kick writeback flusher for IOCB_DONTCACHE with targeted dirty tracking Jeff Layton
  2 siblings, 3 replies; 16+ messages in thread
From: Jeff Layton @ 2026-05-11 11:58 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox (Oracle), Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Mike Snitzer, Jens Axboe,
	Ritesh Harjani, Chuck Lever
  Cc: linux-fsdevel, linux-kernel, linux-nfs, linux-mm, Jeff Layton

Add a per-wb WB_DONTCACHE_DIRTY counter that tracks the number of dirty
pages with the dropbehind flag set (i.e., pages dirtied via RWF_DONTCACHE
writes).

Increment the counter alongside WB_RECLAIMABLE in folio_account_dirtied()
when the folio has the dropbehind flag set, and decrement it in
folio_clear_dirty_for_io() and folio_account_cleaned(). Also decrement it
when a non-DONTCACHE lookup atomically clears the dropbehind flag on a
dirty folio in __filemap_get_folio_mpol(), using folio_test_clear_dropbehind()
to prevent concurrent lookups from double-decrementing the counter, and
guarding the decrement with mapping_can_writeback() to match the increment
path.

Transfer the counter alongside WB_RECLAIMABLE in inode_do_switch_wbs() so
that the stat is properly migrated when an inode switches cgroup writeback
domains.

The counter will be used by the writeback flusher to determine how many
pages to write back when expediting writeback for IOCB_DONTCACHE writes,
without flushing the entire BDI's dirty pages.

Suggested-by: Jan Kara <jack@suse.cz>
Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/fs-writeback.c                |  4 ++++
 include/linux/backing-dev-defs.h |  1 +
 mm/filemap.c                     | 15 +++++++++++++--
 mm/page-writeback.c              |  6 ++++++
 4 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index a65694cbfe68..32ecc745f5f7 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -432,6 +432,10 @@ static bool inode_do_switch_wbs(struct inode *inode,
 			long nr = folio_nr_pages(folio);
 			wb_stat_mod(old_wb, WB_RECLAIMABLE, -nr);
 			wb_stat_mod(new_wb, WB_RECLAIMABLE, nr);
+			if (folio_test_dropbehind(folio)) {
+				wb_stat_mod(old_wb, WB_DONTCACHE_DIRTY, -nr);
+				wb_stat_mod(new_wb, WB_DONTCACHE_DIRTY, nr);
+			}
 		}
 	}
 
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index a06b93446d10..cb660dd37286 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -33,6 +33,7 @@ enum wb_stat_item {
 	WB_WRITEBACK,
 	WB_DIRTIED,
 	WB_WRITTEN,
+	WB_DONTCACHE_DIRTY,
 	NR_WB_STAT_ITEMS
 };
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 4e636647100c..179f2886f8c0 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2052,8 +2052,19 @@ struct folio *__filemap_get_folio_mpol(struct address_space *mapping,
 	if (!folio)
 		return ERR_PTR(-ENOENT);
 	/* not an uncached lookup, clear uncached if set */
-	if (folio_test_dropbehind(folio) && !(fgp_flags & FGP_DONTCACHE))
-		folio_clear_dropbehind(folio);
+	if (!(fgp_flags & FGP_DONTCACHE) && folio_test_clear_dropbehind(folio)) {
+		if (folio_test_dirty(folio) &&
+		    mapping_can_writeback(mapping)) {
+			struct inode *inode = mapping->host;
+			struct bdi_writeback *wb;
+			struct wb_lock_cookie cookie = {};
+			long nr = folio_nr_pages(folio);
+
+			wb = unlocked_inode_to_wb_begin(inode, &cookie);
+			wb_stat_mod(wb, WB_DONTCACHE_DIRTY, -nr);
+			unlocked_inode_to_wb_end(inode, &cookie);
+		}
+	}
 	return folio;
 }
 EXPORT_SYMBOL(__filemap_get_folio_mpol);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 88cd53d4ba09..8e520717d1f6 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2630,6 +2630,8 @@ static void folio_account_dirtied(struct folio *folio,
 		wb = inode_to_wb(inode);
 
 		lruvec_stat_mod_folio(folio, NR_FILE_DIRTY, nr);
+		if (folio_test_dropbehind(folio))
+			wb_stat_mod(wb, WB_DONTCACHE_DIRTY, nr);
 		__zone_stat_mod_folio(folio, NR_ZONE_WRITE_PENDING, nr);
 		__node_stat_mod_folio(folio, NR_DIRTIED, nr);
 		wb_stat_mod(wb, WB_RECLAIMABLE, nr);
@@ -2651,6 +2653,8 @@ void folio_account_cleaned(struct folio *folio, struct bdi_writeback *wb)
 	long nr = folio_nr_pages(folio);
 
 	lruvec_stat_mod_folio(folio, NR_FILE_DIRTY, -nr);
+	if (folio_test_dropbehind(folio))
+		wb_stat_mod(wb, WB_DONTCACHE_DIRTY, -nr);
 	zone_stat_mod_folio(folio, NR_ZONE_WRITE_PENDING, -nr);
 	wb_stat_mod(wb, WB_RECLAIMABLE, -nr);
 	task_io_account_cancelled_write(nr * PAGE_SIZE);
@@ -2920,6 +2924,8 @@ bool folio_clear_dirty_for_io(struct folio *folio)
 		if (folio_test_clear_dirty(folio)) {
 			long nr = folio_nr_pages(folio);
 			lruvec_stat_mod_folio(folio, NR_FILE_DIRTY, -nr);
+			if (folio_test_dropbehind(folio))
+				wb_stat_mod(wb, WB_DONTCACHE_DIRTY, -nr);
 			zone_stat_mod_folio(folio, NR_ZONE_WRITE_PENDING, -nr);
 			wb_stat_mod(wb, WB_RECLAIMABLE, -nr);
 			ret = true;

-- 
2.54.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v7 2/3] mm: track DONTCACHE dirty pages per bdi_writeback
  2026-05-11 11:58 ` [PATCH v7 2/3] mm: track DONTCACHE dirty pages per bdi_writeback Jeff Layton
@ 2026-05-11 13:10   ` Christian Brauner
  2026-05-11 13:29     ` Jeff Layton
  2026-05-12 14:07   ` Jan Kara
  2026-05-13  2:07   ` Ritesh Harjani
  2 siblings, 1 reply; 16+ messages in thread
From: Christian Brauner @ 2026-05-11 13:10 UTC (permalink / raw)
  To: Jeff Layton, Linus Torvalds, Christoph Hellwig
  Cc: Alexander Viro, Jan Kara, Matthew Wilcox (Oracle), Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Mike Snitzer, Jens Axboe, Ritesh Harjani, Chuck Lever,
	linux-fsdevel, linux-kernel, linux-nfs, linux-mm

On Mon, May 11, 2026 at 07:58:28AM -0400, Jeff Layton wrote:
> Add a per-wb WB_DONTCACHE_DIRTY counter that tracks the number of dirty
> pages with the dropbehind flag set (i.e., pages dirtied via RWF_DONTCACHE
> writes).
> 
> Increment the counter alongside WB_RECLAIMABLE in folio_account_dirtied()
> when the folio has the dropbehind flag set, and decrement it in
> folio_clear_dirty_for_io() and folio_account_cleaned(). Also decrement it
> when a non-DONTCACHE lookup atomically clears the dropbehind flag on a
> dirty folio in __filemap_get_folio_mpol(), using folio_test_clear_dropbehind()
> to prevent concurrent lookups from double-decrementing the counter, and
> guarding the decrement with mapping_can_writeback() to match the increment
> path.
> 
> Transfer the counter alongside WB_RECLAIMABLE in inode_do_switch_wbs() so
> that the stat is properly migrated when an inode switches cgroup writeback
> domains.
> 
> The counter will be used by the writeback flusher to determine how many
> pages to write back when expediting writeback for IOCB_DONTCACHE writes,
> without flushing the entire BDI's dirty pages.
> 
> Suggested-by: Jan Kara <jack@suse.cz>
> Assisted-by: Claude:claude-opus-4-6

Picking up on something we discussed at LSFMM in one of the sessions as
an aside rant: I find these AI Assisted-by tags so useless tbh and just
pure noise in the git log _especially_ for a core developer like Jeff
that I really don't see the point of them and I'm always tempted to just
remove the tags when I apply. I have dropped them before because I found
them so pointless.

Crediting Jan here is the right thing to do and it provides actual value
and also just makes sure that a real person who spent time helping out
gets visibility in the git history. Why we should extend the same
courtesy to automated tooling is really beyond me. Somehow we've become
all convinced that these tools require a special status but have spent
months arguing about the usefulness of other tags.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v7 2/3] mm: track DONTCACHE dirty pages per bdi_writeback
  2026-05-11 13:10   ` Christian Brauner
@ 2026-05-11 13:29     ` Jeff Layton
  2026-05-11 13:34       ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 16+ messages in thread
From: Jeff Layton @ 2026-05-11 13:29 UTC (permalink / raw)
  To: Christian Brauner, Linus Torvalds, Christoph Hellwig
  Cc: Alexander Viro, Jan Kara, Matthew Wilcox (Oracle), Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Mike Snitzer, Jens Axboe, Ritesh Harjani, Chuck Lever,
	linux-fsdevel, linux-kernel, linux-nfs, linux-mm

On Mon, 2026-05-11 at 15:10 +0200, Christian Brauner wrote:
> On Mon, May 11, 2026 at 07:58:28AM -0400, Jeff Layton wrote:
> > Add a per-wb WB_DONTCACHE_DIRTY counter that tracks the number of dirty
> > pages with the dropbehind flag set (i.e., pages dirtied via RWF_DONTCACHE
> > writes).
> > 
> > Increment the counter alongside WB_RECLAIMABLE in folio_account_dirtied()
> > when the folio has the dropbehind flag set, and decrement it in
> > folio_clear_dirty_for_io() and folio_account_cleaned(). Also decrement it
> > when a non-DONTCACHE lookup atomically clears the dropbehind flag on a
> > dirty folio in __filemap_get_folio_mpol(), using folio_test_clear_dropbehind()
> > to prevent concurrent lookups from double-decrementing the counter, and
> > guarding the decrement with mapping_can_writeback() to match the increment
> > path.
> > 
> > Transfer the counter alongside WB_RECLAIMABLE in inode_do_switch_wbs() so
> > that the stat is properly migrated when an inode switches cgroup writeback
> > domains.
> > 
> > The counter will be used by the writeback flusher to determine how many
> > pages to write back when expediting writeback for IOCB_DONTCACHE writes,
> > without flushing the entire BDI's dirty pages.
> > 
> > Suggested-by: Jan Kara <jack@suse.cz>
> > Assisted-by: Claude:claude-opus-4-6
> 
> Picking up on something we discussed at LSFMM in one of the sessions as
> an aside rant: I find these AI Assisted-by tags so useless tbh and just
> pure noise in the git log _especially_ for a core developer like Jeff
> that I really don't see the point of them and I'm always tempted to just
> remove the tags when I apply. I have dropped them before because I found
> them so pointless.
> 
> Crediting Jan here is the right thing to do and it provides actual value
> and also just makes sure that a real person who spent time helping out
> gets visibility in the git history. Why we should extend the same
> courtesy to automated tooling is really beyond me. Somehow we've become
> all convinced that these tools require a special status but have spent
> months arguing about the usefulness of other tags.

To be clear, Christoph and Ritesh also contributed a lot of review and
suggestions.

I was mainly trying to follow this new verbiage in
Documentation/process/submitting-patches.rst:

------------------8<-------------------
Using Assisted-by:
------------------
If you used any sort of advanced coding tool in the creation of your patch,
you need to acknowledge that use by adding an Assisted-by tag.  Failure to
do so may impede the acceptance of your work.  Please see
Documentation/process/coding-assistants.rst for details regarding the
acknowledgment of coding assistants.
------------------8<-------------------

If we're demanding this from anyone, then we should demand it from
everyone. I don't think we want one set of rules for core contributors
and another set for other folks.

As to whether we should add them at all -- I don't know. I think it
really comes down to what we intend to do with this info. I'll play
devil's advocate for the moment:

The cost of adding these tags is low. It's just a few extra bits in the
repo. Maybe this could eventually have historical value?
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v7 2/3] mm: track DONTCACHE dirty pages per bdi_writeback
  2026-05-11 13:29     ` Jeff Layton
@ 2026-05-11 13:34       ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 16+ messages in thread
From: David Hildenbrand (Arm) @ 2026-05-11 13:34 UTC (permalink / raw)
  To: Jeff Layton, Christian Brauner, Linus Torvalds, Christoph Hellwig
  Cc: Alexander Viro, Jan Kara, Matthew Wilcox (Oracle), Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Mike Snitzer, Jens Axboe,
	Ritesh Harjani, Chuck Lever, linux-fsdevel, linux-kernel,
	linux-nfs, linux-mm

On 5/11/26 15:29, Jeff Layton wrote:
> On Mon, 2026-05-11 at 15:10 +0200, Christian Brauner wrote:
>> On Mon, May 11, 2026 at 07:58:28AM -0400, Jeff Layton wrote:
>>> Add a per-wb WB_DONTCACHE_DIRTY counter that tracks the number of dirty
>>> pages with the dropbehind flag set (i.e., pages dirtied via RWF_DONTCACHE
>>> writes).
>>>
>>> Increment the counter alongside WB_RECLAIMABLE in folio_account_dirtied()
>>> when the folio has the dropbehind flag set, and decrement it in
>>> folio_clear_dirty_for_io() and folio_account_cleaned(). Also decrement it
>>> when a non-DONTCACHE lookup atomically clears the dropbehind flag on a
>>> dirty folio in __filemap_get_folio_mpol(), using folio_test_clear_dropbehind()
>>> to prevent concurrent lookups from double-decrementing the counter, and
>>> guarding the decrement with mapping_can_writeback() to match the increment
>>> path.
>>>
>>> Transfer the counter alongside WB_RECLAIMABLE in inode_do_switch_wbs() so
>>> that the stat is properly migrated when an inode switches cgroup writeback
>>> domains.
>>>
>>> The counter will be used by the writeback flusher to determine how many
>>> pages to write back when expediting writeback for IOCB_DONTCACHE writes,
>>> without flushing the entire BDI's dirty pages.
>>>
>>> Suggested-by: Jan Kara <jack@suse.cz>
>>> Assisted-by: Claude:claude-opus-4-6
>>
>> Picking up on something we discussed at LSFMM in one of the sessions as
>> an aside rant: I find these AI Assisted-by tags so useless tbh and just
>> pure noise in the git log _especially_ for a core developer like Jeff
>> that I really don't see the point of them and I'm always tempted to just
>> remove the tags when I apply. I have dropped them before because I found
>> them so pointless.
>>
>> Crediting Jan here is the right thing to do and it provides actual value
>> and also just makes sure that a real person who spent time helping out
>> gets visibility in the git history. Why we should extend the same
>> courtesy to automated tooling is really beyond me. Somehow we've become
>> all convinced that these tools require a special status but have spent
>> months arguing about the usefulness of other tags.
> 
> To be clear, Christoph and Ritesh also contributed a lot of review and
> suggestions.
> 
> I was mainly trying to follow this new verbiage in
> Documentation/process/submitting-patches.rst:
> 
> ------------------8<-------------------
> Using Assisted-by:
> ------------------
> If you used any sort of advanced coding tool in the creation of your patch,
> you need to acknowledge that use by adding an Assisted-by tag.  Failure to
> do so may impede the acceptance of your work.  Please see
> Documentation/process/coding-assistants.rst for details regarding the
> acknowledgment of coding assistants.
> ------------------8<-------------------
> 
> If we're demanding this from anyone, then we should demand it from
> everyone. I don't think we want one set of rules for core contributors
> and another set for other folks.
> 
> As to whether we should add them at all -- I don't know. I think it
> really comes down to what we intend to do with this info. I'll play
> devil's advocate for the moment:
> 
> The cost of adding these tags is low. It's just a few extra bits in the
> repo. Maybe this could eventually have historical value?

As we don't really know "how" AI tooling helped. it's pretty useless without a
more detailed description I'm afraid.

OTOH, we want people (in particular not trusted community members) to indicate
that it might all just be unchecked AI output.

Hm.

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v7 2/3] mm: track DONTCACHE dirty pages per bdi_writeback
  2026-05-11 11:58 ` [PATCH v7 2/3] mm: track DONTCACHE dirty pages per bdi_writeback Jeff Layton
  2026-05-11 13:10   ` Christian Brauner
@ 2026-05-12 14:07   ` Jan Kara
  2026-05-13  2:07   ` Ritesh Harjani
  2 siblings, 0 replies; 16+ messages in thread
From: Jan Kara @ 2026-05-12 14:07 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox (Oracle), Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Mike Snitzer, Jens Axboe,
	Ritesh Harjani, Chuck Lever, linux-fsdevel, linux-kernel,
	linux-nfs, linux-mm

On Mon 11-05-26 07:58:28, Jeff Layton wrote:
> Add a per-wb WB_DONTCACHE_DIRTY counter that tracks the number of dirty
> pages with the dropbehind flag set (i.e., pages dirtied via RWF_DONTCACHE
> writes).
> 
> Increment the counter alongside WB_RECLAIMABLE in folio_account_dirtied()
> when the folio has the dropbehind flag set, and decrement it in
> folio_clear_dirty_for_io() and folio_account_cleaned(). Also decrement it
> when a non-DONTCACHE lookup atomically clears the dropbehind flag on a
> dirty folio in __filemap_get_folio_mpol(), using folio_test_clear_dropbehind()
> to prevent concurrent lookups from double-decrementing the counter, and
> guarding the decrement with mapping_can_writeback() to match the increment
> path.
> 
> Transfer the counter alongside WB_RECLAIMABLE in inode_do_switch_wbs() so
> that the stat is properly migrated when an inode switches cgroup writeback
> domains.
> 
> The counter will be used by the writeback flusher to determine how many
> pages to write back when expediting writeback for IOCB_DONTCACHE writes,
> without flushing the entire BDI's dirty pages.
> 
> Suggested-by: Jan Kara <jack@suse.cz>
> Assisted-by: Claude:claude-opus-4-6
> Signed-off-by: Jeff Layton <jlayton@kernel.org>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/fs-writeback.c                |  4 ++++
>  include/linux/backing-dev-defs.h |  1 +
>  mm/filemap.c                     | 15 +++++++++++++--
>  mm/page-writeback.c              |  6 ++++++
>  4 files changed, 24 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index a65694cbfe68..32ecc745f5f7 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -432,6 +432,10 @@ static bool inode_do_switch_wbs(struct inode *inode,
>  			long nr = folio_nr_pages(folio);
>  			wb_stat_mod(old_wb, WB_RECLAIMABLE, -nr);
>  			wb_stat_mod(new_wb, WB_RECLAIMABLE, nr);
> +			if (folio_test_dropbehind(folio)) {
> +				wb_stat_mod(old_wb, WB_DONTCACHE_DIRTY, -nr);
> +				wb_stat_mod(new_wb, WB_DONTCACHE_DIRTY, nr);
> +			}
>  		}
>  	}
>  
> diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
> index a06b93446d10..cb660dd37286 100644
> --- a/include/linux/backing-dev-defs.h
> +++ b/include/linux/backing-dev-defs.h
> @@ -33,6 +33,7 @@ enum wb_stat_item {
>  	WB_WRITEBACK,
>  	WB_DIRTIED,
>  	WB_WRITTEN,
> +	WB_DONTCACHE_DIRTY,
>  	NR_WB_STAT_ITEMS
>  };
>  
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 4e636647100c..179f2886f8c0 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2052,8 +2052,19 @@ struct folio *__filemap_get_folio_mpol(struct address_space *mapping,
>  	if (!folio)
>  		return ERR_PTR(-ENOENT);
>  	/* not an uncached lookup, clear uncached if set */
> -	if (folio_test_dropbehind(folio) && !(fgp_flags & FGP_DONTCACHE))
> -		folio_clear_dropbehind(folio);
> +	if (!(fgp_flags & FGP_DONTCACHE) && folio_test_clear_dropbehind(folio)) {
> +		if (folio_test_dirty(folio) &&
> +		    mapping_can_writeback(mapping)) {
> +			struct inode *inode = mapping->host;
> +			struct bdi_writeback *wb;
> +			struct wb_lock_cookie cookie = {};
> +			long nr = folio_nr_pages(folio);
> +
> +			wb = unlocked_inode_to_wb_begin(inode, &cookie);
> +			wb_stat_mod(wb, WB_DONTCACHE_DIRTY, -nr);
> +			unlocked_inode_to_wb_end(inode, &cookie);
> +		}
> +	}
>  	return folio;
>  }
>  EXPORT_SYMBOL(__filemap_get_folio_mpol);
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 88cd53d4ba09..8e520717d1f6 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2630,6 +2630,8 @@ static void folio_account_dirtied(struct folio *folio,
>  		wb = inode_to_wb(inode);
>  
>  		lruvec_stat_mod_folio(folio, NR_FILE_DIRTY, nr);
> +		if (folio_test_dropbehind(folio))
> +			wb_stat_mod(wb, WB_DONTCACHE_DIRTY, nr);
>  		__zone_stat_mod_folio(folio, NR_ZONE_WRITE_PENDING, nr);
>  		__node_stat_mod_folio(folio, NR_DIRTIED, nr);
>  		wb_stat_mod(wb, WB_RECLAIMABLE, nr);
> @@ -2651,6 +2653,8 @@ void folio_account_cleaned(struct folio *folio, struct bdi_writeback *wb)
>  	long nr = folio_nr_pages(folio);
>  
>  	lruvec_stat_mod_folio(folio, NR_FILE_DIRTY, -nr);
> +	if (folio_test_dropbehind(folio))
> +		wb_stat_mod(wb, WB_DONTCACHE_DIRTY, -nr);
>  	zone_stat_mod_folio(folio, NR_ZONE_WRITE_PENDING, -nr);
>  	wb_stat_mod(wb, WB_RECLAIMABLE, -nr);
>  	task_io_account_cancelled_write(nr * PAGE_SIZE);
> @@ -2920,6 +2924,8 @@ bool folio_clear_dirty_for_io(struct folio *folio)
>  		if (folio_test_clear_dirty(folio)) {
>  			long nr = folio_nr_pages(folio);
>  			lruvec_stat_mod_folio(folio, NR_FILE_DIRTY, -nr);
> +			if (folio_test_dropbehind(folio))
> +				wb_stat_mod(wb, WB_DONTCACHE_DIRTY, -nr);
>  			zone_stat_mod_folio(folio, NR_ZONE_WRITE_PENDING, -nr);
>  			wb_stat_mod(wb, WB_RECLAIMABLE, -nr);
>  			ret = true;
> 
> -- 
> 2.54.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v7 2/3] mm: track DONTCACHE dirty pages per bdi_writeback
  2026-05-11 11:58 ` [PATCH v7 2/3] mm: track DONTCACHE dirty pages per bdi_writeback Jeff Layton
  2026-05-11 13:10   ` Christian Brauner
  2026-05-12 14:07   ` Jan Kara
@ 2026-05-13  2:07   ` Ritesh Harjani
  2 siblings, 0 replies; 16+ messages in thread
From: Ritesh Harjani @ 2026-05-13  2:07 UTC (permalink / raw)
  To: Jeff Layton, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox (Oracle), Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Mike Snitzer, Jens Axboe,
	Chuck Lever
  Cc: linux-fsdevel, linux-kernel, linux-nfs, linux-mm, Jeff Layton

Jeff Layton <jlayton@kernel.org> writes:

> Add a per-wb WB_DONTCACHE_DIRTY counter that tracks the number of dirty
> pages with the dropbehind flag set (i.e., pages dirtied via RWF_DONTCACHE
> writes).
>
> Increment the counter alongside WB_RECLAIMABLE in folio_account_dirtied()
> when the folio has the dropbehind flag set, and decrement it in
> folio_clear_dirty_for_io() and folio_account_cleaned(). Also decrement it
> when a non-DONTCACHE lookup atomically clears the dropbehind flag on a
> dirty folio in __filemap_get_folio_mpol(), using folio_test_clear_dropbehind()
> to prevent concurrent lookups from double-decrementing the counter, and
> guarding the decrement with mapping_can_writeback() to match the increment
> path.
>
> Transfer the counter alongside WB_RECLAIMABLE in inode_do_switch_wbs() so
> that the stat is properly migrated when an inode switches cgroup writeback
> domains.
>
> The counter will be used by the writeback flusher to determine how many
> pages to write back when expediting writeback for IOCB_DONTCACHE writes,
> without flushing the entire BDI's dirty pages.
>

Using wb_stat infra was a clever thing to do for counting the number of
dontcache folios.
I see that we don't collect DONTCACHE stats in collect_wb_stats(), which
I guess, is mainly for debug purposes only. Either ways I am not sure
how useful that might be, since it only shows the approximate stats
since it doesn't do wb_stat_sum() for most of the other stats too.
If for any reason we need that in future, that could go in a separate patch.

For this patch - LGTM. Please feel free to add:
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v7 3/3] mm: kick writeback flusher for IOCB_DONTCACHE with targeted dirty tracking
  2026-05-11 11:58 [PATCH v7 0/3] mm: improve write performance with RWF_DONTCACHE Jeff Layton
  2026-05-11 11:58 ` [PATCH v7 1/3] mm: preserve PG_dropbehind flag during folio split Jeff Layton
  2026-05-11 11:58 ` [PATCH v7 2/3] mm: track DONTCACHE dirty pages per bdi_writeback Jeff Layton
@ 2026-05-11 11:58 ` Jeff Layton
  2026-05-11 13:24   ` Christian Brauner
                     ` (2 more replies)
  2 siblings, 3 replies; 16+ messages in thread
From: Jeff Layton @ 2026-05-11 11:58 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox (Oracle), Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Mike Snitzer, Jens Axboe,
	Ritesh Harjani, Chuck Lever
  Cc: linux-fsdevel, linux-kernel, linux-nfs, linux-mm, Jeff Layton

The IOCB_DONTCACHE writeback path in generic_write_sync() calls
filemap_flush_range() on every write, submitting writeback inline in
the writer's context.  Perf lock contention profiling shows the
performance problem is not lock contention but the writeback submission
work itself — walking the page tree and submitting I/O blocks the writer
for milliseconds, inflating p99.9 latency from 23ms (buffered) to 93ms
(dontcache).

Replace the inline filemap_flush_range() call with a flusher kick that
drains dirty pages in the background.  This moves writeback submission
completely off the writer's hot path.

To avoid flushing unrelated buffered dirty data, add a dedicated
WB_start_dontcache bit and wb_check_start_dontcache() handler that uses
the per-wb WB_DONTCACHE_DIRTY counter to determine how many pages to
write back.  The flusher writes back that many pages from the oldest dirty
inodes (not restricted to dontcache-specific inodes). This helps
preserve I/O batching while limiting the scope of expedited writeback.

Like WB_start_all, the WB_start_dontcache bit coalesces multiple
DONTCACHE writes into a single flusher wakeup without per-write
allocations.  Use test_and_clear_bit to atomically consume the kick
request before reading the dirty counter and starting writeback, so that
concurrent DONTCACHE writes during writeback can re-set the bit and
schedule a follow-up flusher run.

Read the dirty counter with wb_stat_sum() (aggregating per-CPU batches)
rather than wb_stat() (which reads only the global counter) to ensure
small writes below the percpu batch threshold are visible to the flusher.

In filemap_dontcache_kick_writeback(), set the WB_start_dontcache bit
inside the unlocked_inode_to_wb_begin/end section for correct cgroup
writeback domain targeting, but defer the wb_wakeup() call until after
the section ends, since wb_wakeup() uses spin_unlock_irq() which would
unconditionally re-enable interrupts while the i_pages xa_lock may still
be held under irqsave during a cgroup writeback switch. Pin the wb with
wb_get() inside the RCU critical section before calling wb_wakeup()
outside it, since cgroup bdi_writeback structures are RCU-freed and the
wb pointer could become invalid after unlocked_inode_to_wb_end() drops
the RCU read lock.

Also add WB_REASON_DONTCACHE as a new writeback reason for tracing
visibility.

dontcache-bench results (same host, T6F_SKL_1920GBF, 251 GiB RAM,
xfs on NVMe, fio io_uring):

Buffered and direct I/O paths are unaffected by this patchset. All
improvements are confined to the dontcache path:

Single-stream throughput (MB/s):
                        Before    After    Change
  seq-write/dontcache      298      897    +201%
  rand-write/dontcache     131      236     +80%

Tail latency improvements (seq-write/dontcache):
  p99:    135,266 us  ->  23,986 us   (-82%)
  p99.9: 8,925,479 us ->  28,443 us   (-99.7%)

Multi-writer (4 jobs, sequential write):
                                Before    After    Change
  dontcache aggregate (MB/s)     2,529    4,532     +79%
  dontcache p99 (us)             8,553    1,002     -88%
  dontcache p99.9 (us)         109,314    1,057     -99%

  Dontcache multi-writer throughput now matches buffered (4,532 vs
  4,616 MB/s).

32-file write (Axboe test):
                                Before    After    Change
  dontcache aggregate (MB/s)     1,548    3,499    +126%
  dontcache p99 (us)            10,170      602     -94%
  Peak dirty pages (MB)          1,837      213     -88%

  Dontcache now reaches 81% of buffered throughput (was 35%).

Competing writers (dontcache vs buffered, separate files):
                                Before    After
  buffered writer                  868      433 MB/s
  dontcache writer                 415      433 MB/s
  Aggregate                      1,284      866 MB/s

  Previously the buffered writer starved the dontcache writer 2:1.
  With per-bdi_writeback tracking, both writers now receive equal
  bandwidth. The aggregate matches the buffered-vs-buffered baseline
  (863 MB/s), indicating fair sharing regardless of I/O mode.

  The dontcache writer's p99.9 latency collapsed from 119 ms to
  33 ms (-73%), eliminating the severe periodic stalls seen in the
  baseline. Both writers now share identical latency profiles,
  matching the buffered-vs-buffered pattern.

The per-bdi_writeback dirty tracking dramatically reduces peak dirty
pages in dontcache workloads, with the 32-file test dropping from
1.8 GB to 213 MB. Dontcache sequential write throughput triples and
multi-writer throughput reaches parity with buffered I/O, with tail
latencies collapsing by 1-2 orders of magnitude.

Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/fs-writeback.c                | 63 ++++++++++++++++++++++++++++++++++++++++
 include/linux/backing-dev-defs.h |  2 ++
 include/linux/fs.h               |  6 ++--
 include/trace/events/writeback.h |  3 +-
 4 files changed, 69 insertions(+), 5 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 32ecc745f5f7..77d53df97cc3 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -2377,6 +2377,27 @@ static long wb_check_start_all(struct bdi_writeback *wb)
 	return nr_pages;
 }

+static long wb_check_start_dontcache(struct bdi_writeback *wb)
+{
+	long nr_pages;
+
+	if (!test_and_clear_bit(WB_start_dontcache, &wb->state))
+		return 0;
+
+	nr_pages = wb_stat_sum(wb, WB_DONTCACHE_DIRTY);
+	if (nr_pages) {
+		struct wb_writeback_work work = {
+			.nr_pages	= nr_pages,
+			.sync_mode	= WB_SYNC_NONE,
+			.range_cyclic	= 1,
+			.reason		= WB_REASON_DONTCACHE,
+		};
+
+		nr_pages = wb_writeback(wb, &work);
+	}
+
+	return nr_pages;
+}

 /*
  * Retrieve work items and do the writeback they describe
@@ -2398,6 +2419,11 @@ static long wb_do_writeback(struct bdi_writeback *wb)
 	 */
 	wrote += wb_check_start_all(wb);

+	/*
+	 * Check for dontcache writeback request
+	 */
+	wrote += wb_check_start_dontcache(wb);
+
 	/*
 	 * Check for periodic writeback, kupdated() style
 	 */
@@ -2472,6 +2498,43 @@ void wakeup_flusher_threads_bdi(struct backing_dev_info *bdi,
 	rcu_read_unlock();
 }

+/**
+ * filemap_dontcache_kick_writeback - kick flusher for IOCB_DONTCACHE writes
+ * @mapping:	address_space that was just written to
+ *
+ * Kick the writeback flusher thread to expedite writeback of dontcache dirty
+ * pages. Queue writeback for the inode's wb for as many pages as there are
+ * dontcache pages, but don't restrict writeback to dontcache pages only.
+ *
+ * This significantly improves performance over either writing all wb's pages
+ * or writing only dontcache pages.  Although it doesn't guarantee quick
+ * writeback and reclaim of dontcache pages, it keeps the amount of dirty pages
+ * in check. Over longer term dontcache pages get written and reclaimed by
+ * background writeback even with this rough heuristic.
+ */
+void filemap_dontcache_kick_writeback(struct address_space *mapping)
+{
+	struct inode *inode = mapping->host;
+	struct bdi_writeback *wb;
+	struct wb_lock_cookie cookie = {};
+	bool need_wakeup = false;
+
+	wb = unlocked_inode_to_wb_begin(inode, &cookie);
+	if (wb_has_dirty_io(wb) &&
+	    !test_bit(WB_start_dontcache, &wb->state) &&
+	    !test_and_set_bit(WB_start_dontcache, &wb->state)) {
+		wb_get(wb);
+		need_wakeup = true;
+	}
+	unlocked_inode_to_wb_end(inode, &cookie);
+
+	if (need_wakeup) {
+		wb_wakeup(wb);
+		wb_put(wb);
+	}
+}
+EXPORT_SYMBOL_GPL(filemap_dontcache_kick_writeback);
+
 /*
  * Wakeup the flusher threads to start writeback of all currently dirty pages
  */
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index cb660dd37286..4f1084937315 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -26,6 +26,7 @@ enum wb_state {
 	WB_writeback_running,	/* Writeback is in progress */
 	WB_has_dirty_io,	/* Dirty inodes on ->b_{dirty|io|more_io} */
 	WB_start_all,		/* nr_pages == 0 (all) work pending */
+	WB_start_dontcache,	/* dontcache writeback pending */
 };

 enum wb_stat_item {
@@ -56,6 +57,7 @@ enum wb_reason {
 	 */
 	WB_REASON_FORKER_THREAD,
 	WB_REASON_FOREIGN_FLUSH,
+	WB_REASON_DONTCACHE,

 	WB_REASON_MAX,
 };
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 11559c513dfb..df72b42a9e9b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2624,6 +2624,7 @@ extern int __must_check file_write_and_wait_range(struct file *file,
 						loff_t start, loff_t end);
 int filemap_flush_range(struct address_space *mapping, loff_t start,
 		loff_t end);
+void filemap_dontcache_kick_writeback(struct address_space *mapping);

 static inline int file_write_and_wait(struct file *file)
 {
@@ -2657,10 +2658,7 @@ static inline ssize_t generic_write_sync(struct kiocb *iocb, ssize_t count)
 		if (ret)
 			return ret;
 	} else if (iocb->ki_flags & IOCB_DONTCACHE) {
-		struct address_space *mapping = iocb->ki_filp->f_mapping;
-
-		filemap_flush_range(mapping, iocb->ki_pos - count,
-				iocb->ki_pos - 1);
+		filemap_dontcache_kick_writeback(iocb->ki_filp->f_mapping);
 	}

 	return count;
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index bdac0d685a98..13ee076ccd16 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -44,7 +44,8 @@
 	EM( WB_REASON_PERIODIC,			"periodic")		\
 	EM( WB_REASON_FS_FREE_SPACE,		"fs_free_space")	\
 	EM( WB_REASON_FORKER_THREAD,		"forker_thread")	\
-	EMe(WB_REASON_FOREIGN_FLUSH,		"foreign_flush")
+	EM( WB_REASON_FOREIGN_FLUSH,		"foreign_flush")	\
+	EMe(WB_REASON_DONTCACHE,		"dontcache")

 WB_WORK_REASON

-- 
2.54.0

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v7 3/3] mm: kick writeback flusher for IOCB_DONTCACHE with targeted dirty tracking
  2026-05-11 11:58 ` [PATCH v7 3/3] mm: kick writeback flusher for IOCB_DONTCACHE with targeted dirty tracking Jeff Layton
@ 2026-05-11 13:24   ` Christian Brauner
  2026-05-11 13:53     ` Jeff Layton
  2026-05-12 14:17   ` Jan Kara
  2026-05-13  3:01   ` Ritesh Harjani
  2 siblings, 1 reply; 16+ messages in thread
From: Christian Brauner @ 2026-05-11 13:24 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Alexander Viro, Jan Kara, Matthew Wilcox (Oracle), Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Mike Snitzer, Jens Axboe, Ritesh Harjani, Chuck Lever,
	linux-fsdevel, linux-kernel, linux-nfs, linux-mm

On Mon, May 11, 2026 at 07:58:29AM -0400, Jeff Layton wrote:
> The IOCB_DONTCACHE writeback path in generic_write_sync() calls
> filemap_flush_range() on every write, submitting writeback inline in
> the writer's context.  Perf lock contention profiling shows the
> performance problem is not lock contention but the writeback submission
> work itself — walking the page tree and submitting I/O blocks the writer
> for milliseconds, inflating p99.9 latency from 23ms (buffered) to 93ms
> (dontcache).
> 
> Replace the inline filemap_flush_range() call with a flusher kick that
> drains dirty pages in the background.  This moves writeback submission
> completely off the writer's hot path.
> 
> To avoid flushing unrelated buffered dirty data, add a dedicated
> WB_start_dontcache bit and wb_check_start_dontcache() handler that uses
> the per-wb WB_DONTCACHE_DIRTY counter to determine how many pages to
> write back.  The flusher writes back that many pages from the oldest dirty
> inodes (not restricted to dontcache-specific inodes). This helps
> preserve I/O batching while limiting the scope of expedited writeback.
> 
> Like WB_start_all, the WB_start_dontcache bit coalesces multiple
> DONTCACHE writes into a single flusher wakeup without per-write
> allocations.  Use test_and_clear_bit to atomically consume the kick
> request before reading the dirty counter and starting writeback, so that
> concurrent DONTCACHE writes during writeback can re-set the bit and
> schedule a follow-up flusher run.
> 
> Read the dirty counter with wb_stat_sum() (aggregating per-CPU batches)
> rather than wb_stat() (which reads only the global counter) to ensure
> small writes below the percpu batch threshold are visible to the flusher.
> 
> In filemap_dontcache_kick_writeback(), set the WB_start_dontcache bit
> inside the unlocked_inode_to_wb_begin/end section for correct cgroup
> writeback domain targeting, but defer the wb_wakeup() call until after
> the section ends, since wb_wakeup() uses spin_unlock_irq() which would
> unconditionally re-enable interrupts while the i_pages xa_lock may still
> be held under irqsave during a cgroup writeback switch. Pin the wb with
> wb_get() inside the RCU critical section before calling wb_wakeup()
> outside it, since cgroup bdi_writeback structures are RCU-freed and the
> wb pointer could become invalid after unlocked_inode_to_wb_end() drops
> the RCU read lock.
> 
> Also add WB_REASON_DONTCACHE as a new writeback reason for tracing
> visibility.
> 
> dontcache-bench results (same host, T6F_SKL_1920GBF, 251 GiB RAM,
> xfs on NVMe, fio io_uring):
> 
> Buffered and direct I/O paths are unaffected by this patchset. All
> improvements are confined to the dontcache path:
> 
> Single-stream throughput (MB/s):
>                         Before    After    Change
>   seq-write/dontcache      298      897    +201%
>   rand-write/dontcache     131      236     +80%
> 
> Tail latency improvements (seq-write/dontcache):
>   p99:    135,266 us  ->  23,986 us   (-82%)
>   p99.9: 8,925,479 us ->  28,443 us   (-99.7%)
> 
> Multi-writer (4 jobs, sequential write):
>                                 Before    After    Change
>   dontcache aggregate (MB/s)     2,529    4,532     +79%
>   dontcache p99 (us)             8,553    1,002     -88%
>   dontcache p99.9 (us)         109,314    1,057     -99%
> 
>   Dontcache multi-writer throughput now matches buffered (4,532 vs
>   4,616 MB/s).
> 
> 32-file write (Axboe test):
>                                 Before    After    Change
>   dontcache aggregate (MB/s)     1,548    3,499    +126%
>   dontcache p99 (us)            10,170      602     -94%
>   Peak dirty pages (MB)          1,837      213     -88%
> 
>   Dontcache now reaches 81% of buffered throughput (was 35%).
> 
> Competing writers (dontcache vs buffered, separate files):
>                                 Before    After
>   buffered writer                  868      433 MB/s
>   dontcache writer                 415      433 MB/s
>   Aggregate                      1,284      866 MB/s
> 
>   Previously the buffered writer starved the dontcache writer 2:1.
>   With per-bdi_writeback tracking, both writers now receive equal
>   bandwidth. The aggregate matches the buffered-vs-buffered baseline
>   (863 MB/s), indicating fair sharing regardless of I/O mode.
> 
>   The dontcache writer's p99.9 latency collapsed from 119 ms to
>   33 ms (-73%), eliminating the severe periodic stalls seen in the
>   baseline. Both writers now share identical latency profiles,
>   matching the buffered-vs-buffered pattern.
> 
> The per-bdi_writeback dirty tracking dramatically reduces peak dirty
> pages in dontcache workloads, with the 32-file test dropping from
> 1.8 GB to 213 MB. Dontcache sequential write throughput triples and
> multi-writer throughput reaches parity with buffered I/O, with tail
> latencies collapsing by 1-2 orders of magnitude.
> 
> Assisted-by: Claude:claude-opus-4-6
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>  fs/fs-writeback.c                | 63 ++++++++++++++++++++++++++++++++++++++++
>  include/linux/backing-dev-defs.h |  2 ++
>  include/linux/fs.h               |  6 ++--
>  include/trace/events/writeback.h |  3 +-
>  4 files changed, 69 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 32ecc745f5f7..77d53df97cc3 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -2377,6 +2377,27 @@ static long wb_check_start_all(struct bdi_writeback *wb)
>  	return nr_pages;
>  }
>  
> +static long wb_check_start_dontcache(struct bdi_writeback *wb)
> +{
> +	long nr_pages;
> +
> +	if (!test_and_clear_bit(WB_start_dontcache, &wb->state))
> +		return 0;
> +
> +	nr_pages = wb_stat_sum(wb, WB_DONTCACHE_DIRTY);
> +	if (nr_pages) {
> +		struct wb_writeback_work work = {
> +			.nr_pages	= nr_pages,
> +			.sync_mode	= WB_SYNC_NONE,
> +			.range_cyclic	= 1,
> +			.reason		= WB_REASON_DONTCACHE,
> +		};
> +
> +		nr_pages = wb_writeback(wb, &work);
> +	}
> +
> +	return nr_pages;
> +}
>  
>  /*
>   * Retrieve work items and do the writeback they describe
> @@ -2398,6 +2419,11 @@ static long wb_do_writeback(struct bdi_writeback *wb)
>  	 */
>  	wrote += wb_check_start_all(wb);
>  
> +	/*
> +	 * Check for dontcache writeback request
> +	 */
> +	wrote += wb_check_start_dontcache(wb);
> +
>  	/*
>  	 * Check for periodic writeback, kupdated() style
>  	 */
> @@ -2472,6 +2498,43 @@ void wakeup_flusher_threads_bdi(struct backing_dev_info *bdi,
>  	rcu_read_unlock();
>  }
>  
> +/**
> + * filemap_dontcache_kick_writeback - kick flusher for IOCB_DONTCACHE writes
> + * @mapping:	address_space that was just written to
> + *
> + * Kick the writeback flusher thread to expedite writeback of dontcache dirty
> + * pages. Queue writeback for the inode's wb for as many pages as there are
> + * dontcache pages, but don't restrict writeback to dontcache pages only.
> + *
> + * This significantly improves performance over either writing all wb's pages
> + * or writing only dontcache pages.  Although it doesn't guarantee quick
> + * writeback and reclaim of dontcache pages, it keeps the amount of dirty pages
> + * in check. Over longer term dontcache pages get written and reclaimed by
> + * background writeback even with this rough heuristic.
> + */
> +void filemap_dontcache_kick_writeback(struct address_space *mapping)
> +{
> +	struct inode *inode = mapping->host;
> +	struct bdi_writeback *wb;
> +	struct wb_lock_cookie cookie = {};
> +	bool need_wakeup = false;
> +
> +	wb = unlocked_inode_to_wb_begin(inode, &cookie);
> +	if (wb_has_dirty_io(wb) &&
> +	    !test_bit(WB_start_dontcache, &wb->state) &&
> +	    !test_and_set_bit(WB_start_dontcache, &wb->state)) {

Doesn't test_and_set_bit() return the old value? IOW, if it sees that
WB_start_dontcache was already set it'll return true? So you can remove
the test_bit() call, right?

> +		wb_get(wb);
> +		need_wakeup = true;
> +	}

Actually, I think you can rewrite this function quite a bit:


> +	unlocked_inode_to_wb_end(inode, &cookie);
> +
> +	if (need_wakeup) {
> +		wb_wakeup(wb);
> +		wb_put(wb);
> +	}
> +}
> +EXPORT_SYMBOL_GPL(filemap_dontcache_kick_writeback);

void filemap_dontcache_kick_writeback(struct address_space *mapping)
{
	struct inode *inode = mapping->host;
	struct bdi_writeback *wb;
	struct wb_lock_cookie cookie = {};

	wb = unlocked_inode_to_wb_begin(inode, &cookie);
	if (wb_has_dirty_io(wb) && !test_and_set_bit(WB_start_dontcache, &wb->state))
		wb_get(wb);
	else
		wb = NULL;
	unlocked_inode_to_wb_end(inode, &cookie);

	if (wb) {
		wb_wakeup(wb);
		wb_put(wb);
	}
}

No?

> +
>  /*
>   * Wakeup the flusher threads to start writeback of all currently dirty pages
>   */
> diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
> index cb660dd37286..4f1084937315 100644
> --- a/include/linux/backing-dev-defs.h
> +++ b/include/linux/backing-dev-defs.h
> @@ -26,6 +26,7 @@ enum wb_state {
>  	WB_writeback_running,	/* Writeback is in progress */
>  	WB_has_dirty_io,	/* Dirty inodes on ->b_{dirty|io|more_io} */
>  	WB_start_all,		/* nr_pages == 0 (all) work pending */
> +	WB_start_dontcache,	/* dontcache writeback pending */
>  };
>  
>  enum wb_stat_item {
> @@ -56,6 +57,7 @@ enum wb_reason {
>  	 */
>  	WB_REASON_FORKER_THREAD,
>  	WB_REASON_FOREIGN_FLUSH,
> +	WB_REASON_DONTCACHE,
>  
>  	WB_REASON_MAX,
>  };
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 11559c513dfb..df72b42a9e9b 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2624,6 +2624,7 @@ extern int __must_check file_write_and_wait_range(struct file *file,
>  						loff_t start, loff_t end);
>  int filemap_flush_range(struct address_space *mapping, loff_t start,
>  		loff_t end);
> +void filemap_dontcache_kick_writeback(struct address_space *mapping);
>  
>  static inline int file_write_and_wait(struct file *file)
>  {
> @@ -2657,10 +2658,7 @@ static inline ssize_t generic_write_sync(struct kiocb *iocb, ssize_t count)
>  		if (ret)
>  			return ret;
>  	} else if (iocb->ki_flags & IOCB_DONTCACHE) {
> -		struct address_space *mapping = iocb->ki_filp->f_mapping;
> -
> -		filemap_flush_range(mapping, iocb->ki_pos - count,
> -				iocb->ki_pos - 1);
> +		filemap_dontcache_kick_writeback(iocb->ki_filp->f_mapping);
>  	}
>  
>  	return count;
> diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
> index bdac0d685a98..13ee076ccd16 100644
> --- a/include/trace/events/writeback.h
> +++ b/include/trace/events/writeback.h
> @@ -44,7 +44,8 @@
>  	EM( WB_REASON_PERIODIC,			"periodic")		\
>  	EM( WB_REASON_FS_FREE_SPACE,		"fs_free_space")	\
>  	EM( WB_REASON_FORKER_THREAD,		"forker_thread")	\
> -	EMe(WB_REASON_FOREIGN_FLUSH,		"foreign_flush")
> +	EM( WB_REASON_FOREIGN_FLUSH,		"foreign_flush")	\
> +	EMe(WB_REASON_DONTCACHE,		"dontcache")
>  
>  WB_WORK_REASON
>  
> 
> -- 
> 2.54.0
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v7 3/3] mm: kick writeback flusher for IOCB_DONTCACHE with targeted dirty tracking
  2026-05-11 13:24   ` Christian Brauner
@ 2026-05-11 13:53     ` Jeff Layton
  2026-05-11 14:06       ` Christian Brauner
  0 siblings, 1 reply; 16+ messages in thread
From: Jeff Layton @ 2026-05-11 13:53 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Alexander Viro, Jan Kara, Matthew Wilcox (Oracle), Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Mike Snitzer, Jens Axboe, Ritesh Harjani, Chuck Lever,
	linux-fsdevel, linux-kernel, linux-nfs, linux-mm

On Mon, 2026-05-11 at 15:24 +0200, Christian Brauner wrote:
> On Mon, May 11, 2026 at 07:58:29AM -0400, Jeff Layton wrote:
> > The IOCB_DONTCACHE writeback path in generic_write_sync() calls
> > filemap_flush_range() on every write, submitting writeback inline in
> > the writer's context.  Perf lock contention profiling shows the
> > performance problem is not lock contention but the writeback submission
> > work itself — walking the page tree and submitting I/O blocks the writer
> > for milliseconds, inflating p99.9 latency from 23ms (buffered) to 93ms
> > (dontcache).
> > 
> > Replace the inline filemap_flush_range() call with a flusher kick that
> > drains dirty pages in the background.  This moves writeback submission
> > completely off the writer's hot path.
> > 
> > To avoid flushing unrelated buffered dirty data, add a dedicated
> > WB_start_dontcache bit and wb_check_start_dontcache() handler that uses
> > the per-wb WB_DONTCACHE_DIRTY counter to determine how many pages to
> > write back.  The flusher writes back that many pages from the oldest dirty
> > inodes (not restricted to dontcache-specific inodes). This helps
> > preserve I/O batching while limiting the scope of expedited writeback.
> > 
> > Like WB_start_all, the WB_start_dontcache bit coalesces multiple
> > DONTCACHE writes into a single flusher wakeup without per-write
> > allocations.  Use test_and_clear_bit to atomically consume the kick
> > request before reading the dirty counter and starting writeback, so that
> > concurrent DONTCACHE writes during writeback can re-set the bit and
> > schedule a follow-up flusher run.
> > 
> > Read the dirty counter with wb_stat_sum() (aggregating per-CPU batches)
> > rather than wb_stat() (which reads only the global counter) to ensure
> > small writes below the percpu batch threshold are visible to the flusher.
> > 
> > In filemap_dontcache_kick_writeback(), set the WB_start_dontcache bit
> > inside the unlocked_inode_to_wb_begin/end section for correct cgroup
> > writeback domain targeting, but defer the wb_wakeup() call until after
> > the section ends, since wb_wakeup() uses spin_unlock_irq() which would
> > unconditionally re-enable interrupts while the i_pages xa_lock may still
> > be held under irqsave during a cgroup writeback switch. Pin the wb with
> > wb_get() inside the RCU critical section before calling wb_wakeup()
> > outside it, since cgroup bdi_writeback structures are RCU-freed and the
> > wb pointer could become invalid after unlocked_inode_to_wb_end() drops
> > the RCU read lock.
> > 
> > Also add WB_REASON_DONTCACHE as a new writeback reason for tracing
> > visibility.
> > 
> > dontcache-bench results (same host, T6F_SKL_1920GBF, 251 GiB RAM,
> > xfs on NVMe, fio io_uring):
> > 
> > Buffered and direct I/O paths are unaffected by this patchset. All
> > improvements are confined to the dontcache path:
> > 
> > Single-stream throughput (MB/s):
> >                         Before    After    Change
> >   seq-write/dontcache      298      897    +201%
> >   rand-write/dontcache     131      236     +80%
> > 
> > Tail latency improvements (seq-write/dontcache):
> >   p99:    135,266 us  ->  23,986 us   (-82%)
> >   p99.9: 8,925,479 us ->  28,443 us   (-99.7%)
> > 
> > Multi-writer (4 jobs, sequential write):
> >                                 Before    After    Change
> >   dontcache aggregate (MB/s)     2,529    4,532     +79%
> >   dontcache p99 (us)             8,553    1,002     -88%
> >   dontcache p99.9 (us)         109,314    1,057     -99%
> > 
> >   Dontcache multi-writer throughput now matches buffered (4,532 vs
> >   4,616 MB/s).
> > 
> > 32-file write (Axboe test):
> >                                 Before    After    Change
> >   dontcache aggregate (MB/s)     1,548    3,499    +126%
> >   dontcache p99 (us)            10,170      602     -94%
> >   Peak dirty pages (MB)          1,837      213     -88%
> > 
> >   Dontcache now reaches 81% of buffered throughput (was 35%).
> > 
> > Competing writers (dontcache vs buffered, separate files):
> >                                 Before    After
> >   buffered writer                  868      433 MB/s
> >   dontcache writer                 415      433 MB/s
> >   Aggregate                      1,284      866 MB/s
> > 
> >   Previously the buffered writer starved the dontcache writer 2:1.
> >   With per-bdi_writeback tracking, both writers now receive equal
> >   bandwidth. The aggregate matches the buffered-vs-buffered baseline
> >   (863 MB/s), indicating fair sharing regardless of I/O mode.
> > 
> >   The dontcache writer's p99.9 latency collapsed from 119 ms to
> >   33 ms (-73%), eliminating the severe periodic stalls seen in the
> >   baseline. Both writers now share identical latency profiles,
> >   matching the buffered-vs-buffered pattern.
> > 
> > The per-bdi_writeback dirty tracking dramatically reduces peak dirty
> > pages in dontcache workloads, with the 32-file test dropping from
> > 1.8 GB to 213 MB. Dontcache sequential write throughput triples and
> > multi-writer throughput reaches parity with buffered I/O, with tail
> > latencies collapsing by 1-2 orders of magnitude.
> > 
> > Assisted-by: Claude:claude-opus-4-6
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> >  fs/fs-writeback.c                | 63 ++++++++++++++++++++++++++++++++++++++++
> >  include/linux/backing-dev-defs.h |  2 ++
> >  include/linux/fs.h               |  6 ++--
> >  include/trace/events/writeback.h |  3 +-
> >  4 files changed, 69 insertions(+), 5 deletions(-)
> > 
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index 32ecc745f5f7..77d53df97cc3 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -2377,6 +2377,27 @@ static long wb_check_start_all(struct bdi_writeback *wb)
> >  	return nr_pages;
> >  }
> >  
> > +static long wb_check_start_dontcache(struct bdi_writeback *wb)
> > +{
> > +	long nr_pages;
> > +
> > +	if (!test_and_clear_bit(WB_start_dontcache, &wb->state))
> > +		return 0;
> > +
> > +	nr_pages = wb_stat_sum(wb, WB_DONTCACHE_DIRTY);
> > +	if (nr_pages) {
> > +		struct wb_writeback_work work = {
> > +			.nr_pages	= nr_pages,
> > +			.sync_mode	= WB_SYNC_NONE,
> > +			.range_cyclic	= 1,
> > +			.reason		= WB_REASON_DONTCACHE,
> > +		};
> > +
> > +		nr_pages = wb_writeback(wb, &work);
> > +	}
> > +
> > +	return nr_pages;
> > +}
> >  
> >  /*
> >   * Retrieve work items and do the writeback they describe
> > @@ -2398,6 +2419,11 @@ static long wb_do_writeback(struct bdi_writeback *wb)
> >  	 */
> >  	wrote += wb_check_start_all(wb);
> >  
> > +	/*
> > +	 * Check for dontcache writeback request
> > +	 */
> > +	wrote += wb_check_start_dontcache(wb);
> > +
> >  	/*
> >  	 * Check for periodic writeback, kupdated() style
> >  	 */
> > @@ -2472,6 +2498,43 @@ void wakeup_flusher_threads_bdi(struct backing_dev_info *bdi,
> >  	rcu_read_unlock();
> >  }
> >  
> > +/**
> > + * filemap_dontcache_kick_writeback - kick flusher for IOCB_DONTCACHE writes
> > + * @mapping:	address_space that was just written to
> > + *
> > + * Kick the writeback flusher thread to expedite writeback of dontcache dirty
> > + * pages. Queue writeback for the inode's wb for as many pages as there are
> > + * dontcache pages, but don't restrict writeback to dontcache pages only.
> > + *
> > + * This significantly improves performance over either writing all wb's pages
> > + * or writing only dontcache pages.  Although it doesn't guarantee quick
> > + * writeback and reclaim of dontcache pages, it keeps the amount of dirty pages
> > + * in check. Over longer term dontcache pages get written and reclaimed by
> > + * background writeback even with this rough heuristic.
> > + */
> > +void filemap_dontcache_kick_writeback(struct address_space *mapping)
> > +{
> > +	struct inode *inode = mapping->host;
> > +	struct bdi_writeback *wb;
> > +	struct wb_lock_cookie cookie = {};
> > +	bool need_wakeup = false;
> > +
> > +	wb = unlocked_inode_to_wb_begin(inode, &cookie);
> > +	if (wb_has_dirty_io(wb) &&
> > +	    !test_bit(WB_start_dontcache, &wb->state) &&
> > +	    !test_and_set_bit(WB_start_dontcache, &wb->state)) {
> 
> Doesn't test_and_set_bit() return the old value? IOW, if it sees that
> WB_start_dontcache was already set it'll return true? So you can remove
> the test_bit() call, right?
> 

Yes.

> > +		wb_get(wb);
> > +		need_wakeup = true;
> > +	}
> 
> Actually, I think you can rewrite this function quite a bit:
> 
> 
> > +	unlocked_inode_to_wb_end(inode, &cookie);
> > +
> > +	if (need_wakeup) {
> > +		wb_wakeup(wb);
> > +		wb_put(wb);
> > +	}
> > +}
> > +EXPORT_SYMBOL_GPL(filemap_dontcache_kick_writeback);
> 
> void filemap_dontcache_kick_writeback(struct address_space *mapping)
> {
> 	struct inode *inode = mapping->host;
> 	struct bdi_writeback *wb;
> 	struct wb_lock_cookie cookie = {};
> 
> 	wb = unlocked_inode_to_wb_begin(inode, &cookie);
> 	if (wb_has_dirty_io(wb) && !test_and_set_bit(WB_start_dontcache, &wb->state))
> 		wb_get(wb);
> 	else
> 		wb = NULL;
> 	unlocked_inode_to_wb_end(inode, &cookie);
> 
> 	if (wb) {
> 		wb_wakeup(wb);
> 		wb_put(wb);
> 	}
> }
> 
> No?
> 

That does look much cleaner. Do you want to just make that change or
would you rather I resend?

Thanks!

> > +
> >  /*
> >   * Wakeup the flusher threads to start writeback of all currently dirty pages
> >   */
> > diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
> > index cb660dd37286..4f1084937315 100644
> > --- a/include/linux/backing-dev-defs.h
> > +++ b/include/linux/backing-dev-defs.h
> > @@ -26,6 +26,7 @@ enum wb_state {
> >  	WB_writeback_running,	/* Writeback is in progress */
> >  	WB_has_dirty_io,	/* Dirty inodes on ->b_{dirty|io|more_io} */
> >  	WB_start_all,		/* nr_pages == 0 (all) work pending */
> > +	WB_start_dontcache,	/* dontcache writeback pending */
> >  };
> >  
> >  enum wb_stat_item {
> > @@ -56,6 +57,7 @@ enum wb_reason {
> >  	 */
> >  	WB_REASON_FORKER_THREAD,
> >  	WB_REASON_FOREIGN_FLUSH,
> > +	WB_REASON_DONTCACHE,
> >  
> >  	WB_REASON_MAX,
> >  };
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 11559c513dfb..df72b42a9e9b 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -2624,6 +2624,7 @@ extern int __must_check file_write_and_wait_range(struct file *file,
> >  						loff_t start, loff_t end);
> >  int filemap_flush_range(struct address_space *mapping, loff_t start,
> >  		loff_t end);
> > +void filemap_dontcache_kick_writeback(struct address_space *mapping);
> >  
> >  static inline int file_write_and_wait(struct file *file)
> >  {
> > @@ -2657,10 +2658,7 @@ static inline ssize_t generic_write_sync(struct kiocb *iocb, ssize_t count)
> >  		if (ret)
> >  			return ret;
> >  	} else if (iocb->ki_flags & IOCB_DONTCACHE) {
> > -		struct address_space *mapping = iocb->ki_filp->f_mapping;
> > -
> > -		filemap_flush_range(mapping, iocb->ki_pos - count,
> > -				iocb->ki_pos - 1);
> > +		filemap_dontcache_kick_writeback(iocb->ki_filp->f_mapping);
> >  	}
> >  
> >  	return count;
> > diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
> > index bdac0d685a98..13ee076ccd16 100644
> > --- a/include/trace/events/writeback.h
> > +++ b/include/trace/events/writeback.h
> > @@ -44,7 +44,8 @@
> >  	EM( WB_REASON_PERIODIC,			"periodic")		\
> >  	EM( WB_REASON_FS_FREE_SPACE,		"fs_free_space")	\
> >  	EM( WB_REASON_FORKER_THREAD,		"forker_thread")	\
> > -	EMe(WB_REASON_FOREIGN_FLUSH,		"foreign_flush")
> > +	EM( WB_REASON_FOREIGN_FLUSH,		"foreign_flush")	\
> > +	EMe(WB_REASON_DONTCACHE,		"dontcache")
> >  
> >  WB_WORK_REASON
> >  
> > 
> > -- 
> > 2.54.0
> > 

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v7 3/3] mm: kick writeback flusher for IOCB_DONTCACHE with targeted dirty tracking
  2026-05-11 13:53     ` Jeff Layton
@ 2026-05-11 14:06       ` Christian Brauner
  0 siblings, 0 replies; 16+ messages in thread
From: Christian Brauner @ 2026-05-11 14:06 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Alexander Viro, Jan Kara, Matthew Wilcox (Oracle), Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Mike Snitzer, Jens Axboe, Ritesh Harjani, Chuck Lever,
	linux-fsdevel, linux-kernel, linux-nfs, linux-mm

On Mon, May 11, 2026 at 09:53:21AM -0400, Jeff Layton wrote:
> On Mon, 2026-05-11 at 15:24 +0200, Christian Brauner wrote:
> > On Mon, May 11, 2026 at 07:58:29AM -0400, Jeff Layton wrote:
> > > The IOCB_DONTCACHE writeback path in generic_write_sync() calls
> > > filemap_flush_range() on every write, submitting writeback inline in
> > > the writer's context.  Perf lock contention profiling shows the
> > > performance problem is not lock contention but the writeback submission
> > > work itself — walking the page tree and submitting I/O blocks the writer
> > > for milliseconds, inflating p99.9 latency from 23ms (buffered) to 93ms
> > > (dontcache).
> > > 
> > > Replace the inline filemap_flush_range() call with a flusher kick that
> > > drains dirty pages in the background.  This moves writeback submission
> > > completely off the writer's hot path.
> > > 
> > > To avoid flushing unrelated buffered dirty data, add a dedicated
> > > WB_start_dontcache bit and wb_check_start_dontcache() handler that uses
> > > the per-wb WB_DONTCACHE_DIRTY counter to determine how many pages to
> > > write back.  The flusher writes back that many pages from the oldest dirty
> > > inodes (not restricted to dontcache-specific inodes). This helps
> > > preserve I/O batching while limiting the scope of expedited writeback.
> > > 
> > > Like WB_start_all, the WB_start_dontcache bit coalesces multiple
> > > DONTCACHE writes into a single flusher wakeup without per-write
> > > allocations.  Use test_and_clear_bit to atomically consume the kick
> > > request before reading the dirty counter and starting writeback, so that
> > > concurrent DONTCACHE writes during writeback can re-set the bit and
> > > schedule a follow-up flusher run.
> > > 
> > > Read the dirty counter with wb_stat_sum() (aggregating per-CPU batches)
> > > rather than wb_stat() (which reads only the global counter) to ensure
> > > small writes below the percpu batch threshold are visible to the flusher.
> > > 
> > > In filemap_dontcache_kick_writeback(), set the WB_start_dontcache bit
> > > inside the unlocked_inode_to_wb_begin/end section for correct cgroup
> > > writeback domain targeting, but defer the wb_wakeup() call until after
> > > the section ends, since wb_wakeup() uses spin_unlock_irq() which would
> > > unconditionally re-enable interrupts while the i_pages xa_lock may still
> > > be held under irqsave during a cgroup writeback switch. Pin the wb with
> > > wb_get() inside the RCU critical section before calling wb_wakeup()
> > > outside it, since cgroup bdi_writeback structures are RCU-freed and the
> > > wb pointer could become invalid after unlocked_inode_to_wb_end() drops
> > > the RCU read lock.
> > > 
> > > Also add WB_REASON_DONTCACHE as a new writeback reason for tracing
> > > visibility.
> > > 
> > > dontcache-bench results (same host, T6F_SKL_1920GBF, 251 GiB RAM,
> > > xfs on NVMe, fio io_uring):
> > > 
> > > Buffered and direct I/O paths are unaffected by this patchset. All
> > > improvements are confined to the dontcache path:
> > > 
> > > Single-stream throughput (MB/s):
> > >                         Before    After    Change
> > >   seq-write/dontcache      298      897    +201%
> > >   rand-write/dontcache     131      236     +80%
> > > 
> > > Tail latency improvements (seq-write/dontcache):
> > >   p99:    135,266 us  ->  23,986 us   (-82%)
> > >   p99.9: 8,925,479 us ->  28,443 us   (-99.7%)
> > > 
> > > Multi-writer (4 jobs, sequential write):
> > >                                 Before    After    Change
> > >   dontcache aggregate (MB/s)     2,529    4,532     +79%
> > >   dontcache p99 (us)             8,553    1,002     -88%
> > >   dontcache p99.9 (us)         109,314    1,057     -99%
> > > 
> > >   Dontcache multi-writer throughput now matches buffered (4,532 vs
> > >   4,616 MB/s).
> > > 
> > > 32-file write (Axboe test):
> > >                                 Before    After    Change
> > >   dontcache aggregate (MB/s)     1,548    3,499    +126%
> > >   dontcache p99 (us)            10,170      602     -94%
> > >   Peak dirty pages (MB)          1,837      213     -88%
> > > 
> > >   Dontcache now reaches 81% of buffered throughput (was 35%).
> > > 
> > > Competing writers (dontcache vs buffered, separate files):
> > >                                 Before    After
> > >   buffered writer                  868      433 MB/s
> > >   dontcache writer                 415      433 MB/s
> > >   Aggregate                      1,284      866 MB/s
> > > 
> > >   Previously the buffered writer starved the dontcache writer 2:1.
> > >   With per-bdi_writeback tracking, both writers now receive equal
> > >   bandwidth. The aggregate matches the buffered-vs-buffered baseline
> > >   (863 MB/s), indicating fair sharing regardless of I/O mode.
> > > 
> > >   The dontcache writer's p99.9 latency collapsed from 119 ms to
> > >   33 ms (-73%), eliminating the severe periodic stalls seen in the
> > >   baseline. Both writers now share identical latency profiles,
> > >   matching the buffered-vs-buffered pattern.
> > > 
> > > The per-bdi_writeback dirty tracking dramatically reduces peak dirty
> > > pages in dontcache workloads, with the 32-file test dropping from
> > > 1.8 GB to 213 MB. Dontcache sequential write throughput triples and
> > > multi-writer throughput reaches parity with buffered I/O, with tail
> > > latencies collapsing by 1-2 orders of magnitude.
> > > 
> > > Assisted-by: Claude:claude-opus-4-6
> > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > ---
> > >  fs/fs-writeback.c                | 63 ++++++++++++++++++++++++++++++++++++++++
> > >  include/linux/backing-dev-defs.h |  2 ++
> > >  include/linux/fs.h               |  6 ++--
> > >  include/trace/events/writeback.h |  3 +-
> > >  4 files changed, 69 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > > index 32ecc745f5f7..77d53df97cc3 100644
> > > --- a/fs/fs-writeback.c
> > > +++ b/fs/fs-writeback.c
> > > @@ -2377,6 +2377,27 @@ static long wb_check_start_all(struct bdi_writeback *wb)
> > >  	return nr_pages;
> > >  }
> > >  
> > > +static long wb_check_start_dontcache(struct bdi_writeback *wb)
> > > +{
> > > +	long nr_pages;
> > > +
> > > +	if (!test_and_clear_bit(WB_start_dontcache, &wb->state))
> > > +		return 0;
> > > +
> > > +	nr_pages = wb_stat_sum(wb, WB_DONTCACHE_DIRTY);
> > > +	if (nr_pages) {
> > > +		struct wb_writeback_work work = {
> > > +			.nr_pages	= nr_pages,
> > > +			.sync_mode	= WB_SYNC_NONE,
> > > +			.range_cyclic	= 1,
> > > +			.reason		= WB_REASON_DONTCACHE,
> > > +		};
> > > +
> > > +		nr_pages = wb_writeback(wb, &work);
> > > +	}
> > > +
> > > +	return nr_pages;
> > > +}
> > >  
> > >  /*
> > >   * Retrieve work items and do the writeback they describe
> > > @@ -2398,6 +2419,11 @@ static long wb_do_writeback(struct bdi_writeback *wb)
> > >  	 */
> > >  	wrote += wb_check_start_all(wb);
> > >  
> > > +	/*
> > > +	 * Check for dontcache writeback request
> > > +	 */
> > > +	wrote += wb_check_start_dontcache(wb);
> > > +
> > >  	/*
> > >  	 * Check for periodic writeback, kupdated() style
> > >  	 */
> > > @@ -2472,6 +2498,43 @@ void wakeup_flusher_threads_bdi(struct backing_dev_info *bdi,
> > >  	rcu_read_unlock();
> > >  }
> > >  
> > > +/**
> > > + * filemap_dontcache_kick_writeback - kick flusher for IOCB_DONTCACHE writes
> > > + * @mapping:	address_space that was just written to
> > > + *
> > > + * Kick the writeback flusher thread to expedite writeback of dontcache dirty
> > > + * pages. Queue writeback for the inode's wb for as many pages as there are
> > > + * dontcache pages, but don't restrict writeback to dontcache pages only.
> > > + *
> > > + * This significantly improves performance over either writing all wb's pages
> > > + * or writing only dontcache pages.  Although it doesn't guarantee quick
> > > + * writeback and reclaim of dontcache pages, it keeps the amount of dirty pages
> > > + * in check. Over longer term dontcache pages get written and reclaimed by
> > > + * background writeback even with this rough heuristic.
> > > + */
> > > +void filemap_dontcache_kick_writeback(struct address_space *mapping)
> > > +{
> > > +	struct inode *inode = mapping->host;
> > > +	struct bdi_writeback *wb;
> > > +	struct wb_lock_cookie cookie = {};
> > > +	bool need_wakeup = false;
> > > +
> > > +	wb = unlocked_inode_to_wb_begin(inode, &cookie);
> > > +	if (wb_has_dirty_io(wb) &&
> > > +	    !test_bit(WB_start_dontcache, &wb->state) &&
> > > +	    !test_and_set_bit(WB_start_dontcache, &wb->state)) {
> > 
> > Doesn't test_and_set_bit() return the old value? IOW, if it sees that
> > WB_start_dontcache was already set it'll return true? So you can remove
> > the test_bit() call, right?
> > 
> 
> Yes.
> 
> > > +		wb_get(wb);
> > > +		need_wakeup = true;
> > > +	}
> > 
> > Actually, I think you can rewrite this function quite a bit:
> > 
> > 
> > > +	unlocked_inode_to_wb_end(inode, &cookie);
> > > +
> > > +	if (need_wakeup) {
> > > +		wb_wakeup(wb);
> > > +		wb_put(wb);
> > > +	}
> > > +}
> > > +EXPORT_SYMBOL_GPL(filemap_dontcache_kick_writeback);
> > 
> > void filemap_dontcache_kick_writeback(struct address_space *mapping)
> > {
> > 	struct inode *inode = mapping->host;
> > 	struct bdi_writeback *wb;
> > 	struct wb_lock_cookie cookie = {};
> > 
> > 	wb = unlocked_inode_to_wb_begin(inode, &cookie);
> > 	if (wb_has_dirty_io(wb) && !test_and_set_bit(WB_start_dontcache, &wb->state))
> > 		wb_get(wb);
> > 	else
> > 		wb = NULL;
> > 	unlocked_inode_to_wb_end(inode, &cookie);
> > 
> > 	if (wb) {
> > 		wb_wakeup(wb);
> > 		wb_put(wb);
> > 	}
> > }
> > 
> > No?
> > 
> 
> That does look much cleaner. Do you want to just make that change or
> would you rather I resend?

I'll just fold it. I already have 1157 mails. I don't need more. :D

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v7 3/3] mm: kick writeback flusher for IOCB_DONTCACHE with targeted dirty tracking
  2026-05-11 11:58 ` [PATCH v7 3/3] mm: kick writeback flusher for IOCB_DONTCACHE with targeted dirty tracking Jeff Layton
  2026-05-11 13:24   ` Christian Brauner
@ 2026-05-12 14:17   ` Jan Kara
  2026-05-13  3:01   ` Ritesh Harjani
  2 siblings, 0 replies; 16+ messages in thread
From: Jan Kara @ 2026-05-12 14:17 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox (Oracle), Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Mike Snitzer, Jens Axboe,
	Ritesh Harjani, Chuck Lever, linux-fsdevel, linux-kernel,
	linux-nfs, linux-mm

On Mon 11-05-26 07:58:29, Jeff Layton wrote:
> The IOCB_DONTCACHE writeback path in generic_write_sync() calls
> filemap_flush_range() on every write, submitting writeback inline in
> the writer's context.  Perf lock contention profiling shows the
> performance problem is not lock contention but the writeback submission
> work itself — walking the page tree and submitting I/O blocks the writer
> for milliseconds, inflating p99.9 latency from 23ms (buffered) to 93ms
> (dontcache).
> 
> Replace the inline filemap_flush_range() call with a flusher kick that
> drains dirty pages in the background.  This moves writeback submission
> completely off the writer's hot path.
> 
> To avoid flushing unrelated buffered dirty data, add a dedicated
> WB_start_dontcache bit and wb_check_start_dontcache() handler that uses
> the per-wb WB_DONTCACHE_DIRTY counter to determine how many pages to
> write back.  The flusher writes back that many pages from the oldest dirty
> inodes (not restricted to dontcache-specific inodes). This helps
> preserve I/O batching while limiting the scope of expedited writeback.
> 
> Like WB_start_all, the WB_start_dontcache bit coalesces multiple
> DONTCACHE writes into a single flusher wakeup without per-write
> allocations.  Use test_and_clear_bit to atomically consume the kick
> request before reading the dirty counter and starting writeback, so that
> concurrent DONTCACHE writes during writeback can re-set the bit and
> schedule a follow-up flusher run.
> 
> Read the dirty counter with wb_stat_sum() (aggregating per-CPU batches)
> rather than wb_stat() (which reads only the global counter) to ensure
> small writes below the percpu batch threshold are visible to the flusher.
> 
> In filemap_dontcache_kick_writeback(), set the WB_start_dontcache bit
> inside the unlocked_inode_to_wb_begin/end section for correct cgroup
> writeback domain targeting, but defer the wb_wakeup() call until after
> the section ends, since wb_wakeup() uses spin_unlock_irq() which would
> unconditionally re-enable interrupts while the i_pages xa_lock may still
> be held under irqsave during a cgroup writeback switch. Pin the wb with
> wb_get() inside the RCU critical section before calling wb_wakeup()
> outside it, since cgroup bdi_writeback structures are RCU-freed and the
> wb pointer could become invalid after unlocked_inode_to_wb_end() drops
> the RCU read lock.
> 
> Also add WB_REASON_DONTCACHE as a new writeback reason for tracing
> visibility.
> 
> dontcache-bench results (same host, T6F_SKL_1920GBF, 251 GiB RAM,
> xfs on NVMe, fio io_uring):
> 
> Buffered and direct I/O paths are unaffected by this patchset. All
> improvements are confined to the dontcache path:
> 
> Single-stream throughput (MB/s):
>                         Before    After    Change
>   seq-write/dontcache      298      897    +201%
>   rand-write/dontcache     131      236     +80%
> 
> Tail latency improvements (seq-write/dontcache):
>   p99:    135,266 us  ->  23,986 us   (-82%)
>   p99.9: 8,925,479 us ->  28,443 us   (-99.7%)
> 
> Multi-writer (4 jobs, sequential write):
>                                 Before    After    Change
>   dontcache aggregate (MB/s)     2,529    4,532     +79%
>   dontcache p99 (us)             8,553    1,002     -88%
>   dontcache p99.9 (us)         109,314    1,057     -99%
> 
>   Dontcache multi-writer throughput now matches buffered (4,532 vs
>   4,616 MB/s).
> 
> 32-file write (Axboe test):
>                                 Before    After    Change
>   dontcache aggregate (MB/s)     1,548    3,499    +126%
>   dontcache p99 (us)            10,170      602     -94%
>   Peak dirty pages (MB)          1,837      213     -88%
> 
>   Dontcache now reaches 81% of buffered throughput (was 35%).
> 
> Competing writers (dontcache vs buffered, separate files):
>                                 Before    After
>   buffered writer                  868      433 MB/s
>   dontcache writer                 415      433 MB/s
>   Aggregate                      1,284      866 MB/s
> 
>   Previously the buffered writer starved the dontcache writer 2:1.
>   With per-bdi_writeback tracking, both writers now receive equal
>   bandwidth. The aggregate matches the buffered-vs-buffered baseline
>   (863 MB/s), indicating fair sharing regardless of I/O mode.
> 
>   The dontcache writer's p99.9 latency collapsed from 119 ms to
>   33 ms (-73%), eliminating the severe periodic stalls seen in the
>   baseline. Both writers now share identical latency profiles,
>   matching the buffered-vs-buffered pattern.
> 
> The per-bdi_writeback dirty tracking dramatically reduces peak dirty
> pages in dontcache workloads, with the 32-file test dropping from
> 1.8 GB to 213 MB. Dontcache sequential write throughput triples and
> multi-writer throughput reaches parity with buffered I/O, with tail
> latencies collapsing by 1-2 orders of magnitude.
> 
> Assisted-by: Claude:claude-opus-4-6
> Signed-off-by: Jeff Layton <jlayton@kernel.org>

This patch with Christian's simplification looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/fs-writeback.c                | 63 ++++++++++++++++++++++++++++++++++++++++
>  include/linux/backing-dev-defs.h |  2 ++
>  include/linux/fs.h               |  6 ++--
>  include/trace/events/writeback.h |  3 +-
>  4 files changed, 69 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 32ecc745f5f7..77d53df97cc3 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -2377,6 +2377,27 @@ static long wb_check_start_all(struct bdi_writeback *wb)
>  	return nr_pages;
>  }
>  
> +static long wb_check_start_dontcache(struct bdi_writeback *wb)
> +{
> +	long nr_pages;
> +
> +	if (!test_and_clear_bit(WB_start_dontcache, &wb->state))
> +		return 0;
> +
> +	nr_pages = wb_stat_sum(wb, WB_DONTCACHE_DIRTY);
> +	if (nr_pages) {
> +		struct wb_writeback_work work = {
> +			.nr_pages	= nr_pages,
> +			.sync_mode	= WB_SYNC_NONE,
> +			.range_cyclic	= 1,
> +			.reason		= WB_REASON_DONTCACHE,
> +		};
> +
> +		nr_pages = wb_writeback(wb, &work);
> +	}
> +
> +	return nr_pages;
> +}
>  
>  /*
>   * Retrieve work items and do the writeback they describe
> @@ -2398,6 +2419,11 @@ static long wb_do_writeback(struct bdi_writeback *wb)
>  	 */
>  	wrote += wb_check_start_all(wb);
>  
> +	/*
> +	 * Check for dontcache writeback request
> +	 */
> +	wrote += wb_check_start_dontcache(wb);
> +
>  	/*
>  	 * Check for periodic writeback, kupdated() style
>  	 */
> @@ -2472,6 +2498,43 @@ void wakeup_flusher_threads_bdi(struct backing_dev_info *bdi,
>  	rcu_read_unlock();
>  }
>  
> +/**
> + * filemap_dontcache_kick_writeback - kick flusher for IOCB_DONTCACHE writes
> + * @mapping:	address_space that was just written to
> + *
> + * Kick the writeback flusher thread to expedite writeback of dontcache dirty
> + * pages. Queue writeback for the inode's wb for as many pages as there are
> + * dontcache pages, but don't restrict writeback to dontcache pages only.
> + *
> + * This significantly improves performance over either writing all wb's pages
> + * or writing only dontcache pages.  Although it doesn't guarantee quick
> + * writeback and reclaim of dontcache pages, it keeps the amount of dirty pages
> + * in check. Over longer term dontcache pages get written and reclaimed by
> + * background writeback even with this rough heuristic.
> + */
> +void filemap_dontcache_kick_writeback(struct address_space *mapping)
> +{
> +	struct inode *inode = mapping->host;
> +	struct bdi_writeback *wb;
> +	struct wb_lock_cookie cookie = {};
> +	bool need_wakeup = false;
> +
> +	wb = unlocked_inode_to_wb_begin(inode, &cookie);
> +	if (wb_has_dirty_io(wb) &&
> +	    !test_bit(WB_start_dontcache, &wb->state) &&
> +	    !test_and_set_bit(WB_start_dontcache, &wb->state)) {
> +		wb_get(wb);
> +		need_wakeup = true;
> +	}
> +	unlocked_inode_to_wb_end(inode, &cookie);
> +
> +	if (need_wakeup) {
> +		wb_wakeup(wb);
> +		wb_put(wb);
> +	}
> +}
> +EXPORT_SYMBOL_GPL(filemap_dontcache_kick_writeback);
> +
>  /*
>   * Wakeup the flusher threads to start writeback of all currently dirty pages
>   */
> diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
> index cb660dd37286..4f1084937315 100644
> --- a/include/linux/backing-dev-defs.h
> +++ b/include/linux/backing-dev-defs.h
> @@ -26,6 +26,7 @@ enum wb_state {
>  	WB_writeback_running,	/* Writeback is in progress */
>  	WB_has_dirty_io,	/* Dirty inodes on ->b_{dirty|io|more_io} */
>  	WB_start_all,		/* nr_pages == 0 (all) work pending */
> +	WB_start_dontcache,	/* dontcache writeback pending */
>  };
>  
>  enum wb_stat_item {
> @@ -56,6 +57,7 @@ enum wb_reason {
>  	 */
>  	WB_REASON_FORKER_THREAD,
>  	WB_REASON_FOREIGN_FLUSH,
> +	WB_REASON_DONTCACHE,
>  
>  	WB_REASON_MAX,
>  };
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 11559c513dfb..df72b42a9e9b 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2624,6 +2624,7 @@ extern int __must_check file_write_and_wait_range(struct file *file,
>  						loff_t start, loff_t end);
>  int filemap_flush_range(struct address_space *mapping, loff_t start,
>  		loff_t end);
> +void filemap_dontcache_kick_writeback(struct address_space *mapping);
>  
>  static inline int file_write_and_wait(struct file *file)
>  {
> @@ -2657,10 +2658,7 @@ static inline ssize_t generic_write_sync(struct kiocb *iocb, ssize_t count)
>  		if (ret)
>  			return ret;
>  	} else if (iocb->ki_flags & IOCB_DONTCACHE) {
> -		struct address_space *mapping = iocb->ki_filp->f_mapping;
> -
> -		filemap_flush_range(mapping, iocb->ki_pos - count,
> -				iocb->ki_pos - 1);
> +		filemap_dontcache_kick_writeback(iocb->ki_filp->f_mapping);
>  	}
>  
>  	return count;
> diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
> index bdac0d685a98..13ee076ccd16 100644
> --- a/include/trace/events/writeback.h
> +++ b/include/trace/events/writeback.h
> @@ -44,7 +44,8 @@
>  	EM( WB_REASON_PERIODIC,			"periodic")		\
>  	EM( WB_REASON_FS_FREE_SPACE,		"fs_free_space")	\
>  	EM( WB_REASON_FORKER_THREAD,		"forker_thread")	\
> -	EMe(WB_REASON_FOREIGN_FLUSH,		"foreign_flush")
> +	EM( WB_REASON_FOREIGN_FLUSH,		"foreign_flush")	\
> +	EMe(WB_REASON_DONTCACHE,		"dontcache")
>  
>  WB_WORK_REASON
>  
> 
> -- 
> 2.54.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v7 3/3] mm: kick writeback flusher for IOCB_DONTCACHE with targeted dirty tracking
  2026-05-11 11:58 ` [PATCH v7 3/3] mm: kick writeback flusher for IOCB_DONTCACHE with targeted dirty tracking Jeff Layton
  2026-05-11 13:24   ` Christian Brauner
  2026-05-12 14:17   ` Jan Kara
@ 2026-05-13  3:01   ` Ritesh Harjani
  2 siblings, 0 replies; 16+ messages in thread
From: Ritesh Harjani @ 2026-05-13  3:01 UTC (permalink / raw)
  To: Jeff Layton, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox (Oracle), Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Mike Snitzer, Jens Axboe,
	Chuck Lever
  Cc: linux-fsdevel, linux-kernel, linux-nfs, linux-mm, Jeff Layton

Jeff Layton <jlayton@kernel.org> writes:

> dontcache-bench results (same host, T6F_SKL_1920GBF, 251 GiB RAM,
> xfs on NVMe, fio io_uring):
>
> Buffered and direct I/O paths are unaffected by this patchset. All
> improvements are confined to the dontcache path:
>
> Single-stream throughput (MB/s):
>                         Before    After    Change
>   seq-write/dontcache      298      897    +201%
>   rand-write/dontcache     131      236     +80%
>
> Tail latency improvements (seq-write/dontcache):
>   p99:    135,266 us  ->  23,986 us   (-82%)
>   p99.9: 8,925,479 us ->  28,443 us   (-99.7%)
>
> Multi-writer (4 jobs, sequential write):
>                                 Before    After    Change
>   dontcache aggregate (MB/s)     2,529    4,532     +79%
>   dontcache p99 (us)             8,553    1,002     -88%
>   dontcache p99.9 (us)         109,314    1,057     -99%
>
>   Dontcache multi-writer throughput now matches buffered (4,532 vs
>   4,616 MB/s).
>
> 32-file write (Axboe test):
>                                 Before    After    Change
>   dontcache aggregate (MB/s)     1,548    3,499    +126%
>   dontcache p99 (us)            10,170      602     -94%
>   Peak dirty pages (MB)          1,837      213     -88%
>
>   Dontcache now reaches 81% of buffered throughput (was 35%).
>
> Competing writers (dontcache vs buffered, separate files):
>                                 Before    After
>   buffered writer                  868      433 MB/s
>   dontcache writer                 415      433 MB/s
>   Aggregate                      1,284      866 MB/s
>
>   Previously the buffered writer starved the dontcache writer 2:1.
>   With per-bdi_writeback tracking, both writers now receive equal
>   bandwidth. The aggregate matches the buffered-vs-buffered baseline
>   (863 MB/s), indicating fair sharing regardless of I/O mode.
>
>   The dontcache writer's p99.9 latency collapsed from 119 ms to
>   33 ms (-73%), eliminating the severe periodic stalls seen in the
>   baseline. Both writers now share identical latency profiles,
>   matching the buffered-vs-buffered pattern.
>
> The per-bdi_writeback dirty tracking dramatically reduces peak dirty
> pages in dontcache workloads, with the 32-file test dropping from
> 1.8 GB to 213 MB. Dontcache sequential write throughput triples and
> multi-writer throughput reaches parity with buffered I/O, with tail
> latencies collapsing by 1-2 orders of magnitude.
>
Thanks for also considering multiple request for performance numbers
into account, this indeed is a nice improvement overall to
RWF_DONTCACHE.

With Christian simplification - the patch looks good to me.
So, please feel free to add:

Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2026-05-13  3:07 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-11 11:58 [PATCH v7 0/3] mm: improve write performance with RWF_DONTCACHE Jeff Layton
2026-05-11 11:58 ` [PATCH v7 1/3] mm: preserve PG_dropbehind flag during folio split Jeff Layton
2026-05-11 12:38   ` David Hildenbrand (Arm)
2026-05-12 13:58   ` Jan Kara
2026-05-11 11:58 ` [PATCH v7 2/3] mm: track DONTCACHE dirty pages per bdi_writeback Jeff Layton
2026-05-11 13:10   ` Christian Brauner
2026-05-11 13:29     ` Jeff Layton
2026-05-11 13:34       ` David Hildenbrand (Arm)
2026-05-12 14:07   ` Jan Kara
2026-05-13  2:07   ` Ritesh Harjani
2026-05-11 11:58 ` [PATCH v7 3/3] mm: kick writeback flusher for IOCB_DONTCACHE with targeted dirty tracking Jeff Layton
2026-05-11 13:24   ` Christian Brauner
2026-05-11 13:53     ` Jeff Layton
2026-05-11 14:06       ` Christian Brauner
2026-05-12 14:17   ` Jan Kara
2026-05-13  3:01   ` Ritesh Harjani

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox