From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 631FCCD37AC for ; Mon, 11 May 2026 11:58:53 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7C09D6B00C1; Mon, 11 May 2026 07:58:52 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7996A6B00C3; Mon, 11 May 2026 07:58:52 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 639036B00C4; Mon, 11 May 2026 07:58:52 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 52C0C6B00C1 for ; Mon, 11 May 2026 07:58:52 -0400 (EDT) Received: from smtpin25.hostedemail.com (lb01a-stub [10.200.18.249]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 0F9EE1406EE for ; Mon, 11 May 2026 11:58:52 +0000 (UTC) X-FDA: 84754992504.25.A0CDEBF Received: from sea.source.kernel.org (sea.source.kernel.org [172.234.252.31]) by imf16.hostedemail.com (Postfix) with ESMTP id 04545180002 for ; Mon, 11 May 2026 11:58:49 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=attDCdZz; dmarc=pass (policy=quarantine) header.from=kernel.org; spf=pass (imf16.hostedemail.com: domain of jlayton@kernel.org designates 172.234.252.31 as permitted sender) smtp.mailfrom=jlayton@kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1778500730; a=rsa-sha256; cv=none; b=B/JOEDNIZmsoIvv+W9SaDcCNjm+m+VaXp1DWgcWfS3vxkSyuG7aigoAN68AJ6CPhVMG4K2 0GBci22x90jKlymAubKq2zOxlUfwOIWjWHGpavhhiVZlQFB8Vg/8JJZIteS3Md6lwLwY7w WvbxUgMDJbTSo0f15+BWZ9YnRZKUTow= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=attDCdZz; dmarc=pass (policy=quarantine) header.from=kernel.org; spf=pass (imf16.hostedemail.com: domain of jlayton@kernel.org designates 172.234.252.31 as permitted sender) smtp.mailfrom=jlayton@kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1778500730; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=FFp2Ht1V9ayoH569tgrKDk72yRD5mAY/o7Z8KabCZVE=; b=qmdJlZNaxRxjwZce0ehkb9NfE7BYlcsjeVDaL9OJtGEjUNl7l5UBZrDwKOtxPT3W/mIg5m 55K/4jc8nZ/A36gVvCAtz0+LmFihiqVZasQp23Wzpe2UV00XF4GXMug9Zc1DFFUX8B0LiB tDzVjrpoQE3p+4OOCkcNo9j+nmbzDAU= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by sea.source.kernel.org (Postfix) with ESMTP id 3035A41B46; Mon, 11 May 2026 11:58:49 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 4BE2BC2BCFB; Mon, 11 May 2026 11:58:47 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1778500729; bh=XKPJq5QezlNP12IRQEcyUTSt/8bOccpcPc3bZyrPITg=; h=From:Date:Subject:References:In-Reply-To:To:Cc:From; b=attDCdZzMTFzU0M9ujFJsO3/m+4d/zS9fIUIV/YhRW2DY2i7Ip6oqZn/WexxmrLld LzivpQ1UYTMFXVMWPcWTiA/UoXVIHiPTvEuwfepW842Wes/xVJt9LlYL0aC35jXQcj gxi0kbxjoIh89/Q4z1rF9cUGSzqNorMcHKSCuQRRI4KcBj3yFnVLrnvqUg2kdJVf9/ H4Ebzd+1ryhvSgjslgvo9RB/N3lV/ZE/veufExYfaAeQbrSWUecXpTn3lLGiVMF/qn cfo5DIpF3AYkpw1U7ku99xd45BENc1SKcy7rGv6QMAsRD+qQLZkLxeOk423seL6BQe Hgv2ZOM8akyEw== From: Jeff Layton Date: Mon, 11 May 2026 07:58:29 -0400 Subject: [PATCH v7 3/3] mm: kick writeback flusher for IOCB_DONTCACHE with targeted dirty tracking MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 8bit Message-Id: <20260511-dontcache-v7-3-2848ddce8090@kernel.org> References: <20260511-dontcache-v7-0-2848ddce8090@kernel.org> In-Reply-To: <20260511-dontcache-v7-0-2848ddce8090@kernel.org> To: Alexander Viro , Christian Brauner , Jan Kara , "Matthew Wilcox (Oracle)" , Andrew Morton , David Hildenbrand , Lorenzo Stoakes , "Liam R. Howlett" , Vlastimil Babka , Mike Rapoport , Suren Baghdasaryan , Michal Hocko , Mike Snitzer , Jens Axboe , Ritesh Harjani , Chuck Lever Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-nfs@vger.kernel.org, linux-mm@kvack.org, Jeff Layton X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=openpgp-sha256; l=9911; i=jlayton@kernel.org; h=from:subject:message-id; bh=XKPJq5QezlNP12IRQEcyUTSt/8bOccpcPc3bZyrPITg=; b=owEBbQKS/ZANAwAKAQAOaEEZVoIVAcsmYgBqAcRwv/1r1ukrs2eUhxmYS8yfySR+xMzDSH8sO +fxb+n/PtOJAjMEAAEKAB0WIQRLwNeyRHGyoYTq9dMADmhBGVaCFQUCagHEcAAKCRAADmhBGVaC Ff6fD/4rXPaczdBaJkzIeTTp1CXFQqi7GGM4Axx6hL9BFixen3jcak61QYUJvI6gy2Z+ITfLNOs zPC9fWD85ocBQKcr9h04RP+y5JPxj9Y602clUBxdVeODA5NbABBrVNvaoyKdddnNehkTRv+TV7D VbicqR6Z7B8QO1r0sJ7BR4QgkpxMSYDWvBR+Jfkm5zTuMkwZvouybLx7UzhsryCXyDpFlwArMbF DvvVz//AJUPwICH4JeFVhyx5ALTTLRNxMEEDow5OG8QpocuVqdNc0eUbWLrwcapI+Qo1ZEy/wyy 0b0bOpdrWxE+WjQb/Iyfe2Ta5NhyNcfavdPMt6chd3tcNbyyK/txZHHdSxq0GQR7MFoayYGBNcv zXotA2oBfCbBzqac+Gu1wPvQFdbVe/RDWlSvdDkMB/0uS4mR9KKzJVfGX72ZMpgEKPDVLqYv/+P zNOA3Ze2gAssUE/do1WYK02wXVrAWETrwgwLG2W3PGK+S4oMryu7UANdyQqmE+4nMBiqOoieIQ6 H6Vbdtb4f/p1y/Ci3fOK6x3rPf8w23mnahqLt/6OGHevhTguO9LD2oPWMsqxO7RRsDmmZfzlC99 bc+cW8H/8VQs87gAUgbjsTK9C3Rwu1s9cRTe10B1A+N9TsGoOg4epHwQvqecTq5ij0htd7TU2Qn 9w9rktSlZk2C39A== X-Developer-Key: i=jlayton@kernel.org; a=openpgp; fpr=4BC0D7B24471B2A184EAF5D3000E684119568215 X-Stat-Signature: e5r6ruqk4agjh9tjm474m5ob4pbn5spb X-Rspam-User: X-Rspamd-Queue-Id: 04545180002 X-Rspamd-Server: rspam07 X-HE-Tag: 1778500729-745093 X-HE-Meta: U2FsdGVkX1/8HWWh6pQDr5qtj3HOvgmoDLCkJurI1dP3T42tdt88z+wGyE/e88XZKGoi14d71JP25oYE5qNvywm7Ynw6ZKu/KC54gWD0CQS4Si8hwwcqfz6MUAsPZrL83/2EYrBoxYFnl3/Cg9tjSCNUT6TQ4pHEtVSioowQFYVcteO27FwoWTrdFe0D3lqpSoQtpelHviVc58E03uDyXrP7i9JBIvlz2KUl2CkDpYAPnduzWX1l2HPNpO+9SUFtzmcLEw33YO6NCoyMDknXk9KyxO5Ada4ErpcSLyvQ7hJlIK/BjMipaBxa0YMaQj9EdHjwtNVvJcJ1zzT+SHHzQ2/+kvCgSb5F/eSf2lbfdr0x8dsp/+AKqVEcDMm+rTA6ClzUoH6ieBuw37isdFdaImQ5fj3/dtTYhLiEy2R7aXsPvNG+9Lo5ggyUe1e/NCNWwB5D0qBP2zUui5mchkxXpB5T+eoMBWm2kMd428qszv4BvIWoYC0DVTjApWgFAFDcY7CF0/XOdyMZoNW96lTziPi1rAG6x03ROjpQg8Qn8BsvPGdBdOf3ylyc/MauPNQBfTzNYhe7+q2KD1/PFe6YP6flO7g41OZTyBFdVmwbqvifLD/miwrGk8MtEc3JxdBPWIScMwvSTJk1pwfe4RlMHecRsPRRE9e85NG0+RpsBSANFsYlTTc2UU3S9ah/6CI0g3zLkg2Pt/ZthunnQptjKQRnOVQNiJHhryvANsW9jxCOXptUo4vAuoegfEDfFcVdIkcl+zXj8iDtNUeS5Nxhx7qMx+jmcfRUpyrf4BSZbFQ7Rq0JLl25zZLaOyU/c6YdLoQ1XqDsEo8+7AKoyjnezMCVAjcN93VdfGTUG3eqTcagSqFa8hK+nZ6SDIU8Udw/uvlOCaSAmDbJf7QSHehK6iYKWmYwSRAdH4fdVUQlU4xyrBbANIwy59Ob3EKO1mtx/mjFxgHiSbGAOfJxFk0 FmELhHJ3 W6dme/Z2YQs2cN6ISxS6/+l/sbxK0ht/YRX//Jb4Zz8W3Z0D0+jt6qT5zrqZVKYJEB3gtM/jsY/Mn3e13WE14tWPv6l65LZhF3mdRo9ytAo5LrNuBeYyiRnNQH4hGe8062KUTrpWkqcsqXA6ZgyOpN1TLv+nx5/Cm+vDmhociyzV1FzhUW0keSp349TG1PLrCLWK5AjqYnAzO00WmskTVRg7QBrUTLvhExyIvnysclO6g0aNuGtCmASbt7JGFfatiyQGFHROVMNJggYC87mA4N16GMqIJrYsFrimg1DVJIapVu5VFyVWfzcmGDJi7z4+4dsTcp2SvMWKgvCvgDaGFTPQWMWDPp57fk8C3SNCSuKbw1DrqJb0D9EcLw+aVmqb2dRazPt5LjmUcJiE= Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: The IOCB_DONTCACHE writeback path in generic_write_sync() calls filemap_flush_range() on every write, submitting writeback inline in the writer's context. Perf lock contention profiling shows the performance problem is not lock contention but the writeback submission work itself — walking the page tree and submitting I/O blocks the writer for milliseconds, inflating p99.9 latency from 23ms (buffered) to 93ms (dontcache). Replace the inline filemap_flush_range() call with a flusher kick that drains dirty pages in the background. This moves writeback submission completely off the writer's hot path. To avoid flushing unrelated buffered dirty data, add a dedicated WB_start_dontcache bit and wb_check_start_dontcache() handler that uses the per-wb WB_DONTCACHE_DIRTY counter to determine how many pages to write back. The flusher writes back that many pages from the oldest dirty inodes (not restricted to dontcache-specific inodes). This helps preserve I/O batching while limiting the scope of expedited writeback. Like WB_start_all, the WB_start_dontcache bit coalesces multiple DONTCACHE writes into a single flusher wakeup without per-write allocations. Use test_and_clear_bit to atomically consume the kick request before reading the dirty counter and starting writeback, so that concurrent DONTCACHE writes during writeback can re-set the bit and schedule a follow-up flusher run. Read the dirty counter with wb_stat_sum() (aggregating per-CPU batches) rather than wb_stat() (which reads only the global counter) to ensure small writes below the percpu batch threshold are visible to the flusher. In filemap_dontcache_kick_writeback(), set the WB_start_dontcache bit inside the unlocked_inode_to_wb_begin/end section for correct cgroup writeback domain targeting, but defer the wb_wakeup() call until after the section ends, since wb_wakeup() uses spin_unlock_irq() which would unconditionally re-enable interrupts while the i_pages xa_lock may still be held under irqsave during a cgroup writeback switch. Pin the wb with wb_get() inside the RCU critical section before calling wb_wakeup() outside it, since cgroup bdi_writeback structures are RCU-freed and the wb pointer could become invalid after unlocked_inode_to_wb_end() drops the RCU read lock. Also add WB_REASON_DONTCACHE as a new writeback reason for tracing visibility. dontcache-bench results (same host, T6F_SKL_1920GBF, 251 GiB RAM, xfs on NVMe, fio io_uring): Buffered and direct I/O paths are unaffected by this patchset. All improvements are confined to the dontcache path: Single-stream throughput (MB/s): Before After Change seq-write/dontcache 298 897 +201% rand-write/dontcache 131 236 +80% Tail latency improvements (seq-write/dontcache): p99: 135,266 us -> 23,986 us (-82%) p99.9: 8,925,479 us -> 28,443 us (-99.7%) Multi-writer (4 jobs, sequential write): Before After Change dontcache aggregate (MB/s) 2,529 4,532 +79% dontcache p99 (us) 8,553 1,002 -88% dontcache p99.9 (us) 109,314 1,057 -99% Dontcache multi-writer throughput now matches buffered (4,532 vs 4,616 MB/s). 32-file write (Axboe test): Before After Change dontcache aggregate (MB/s) 1,548 3,499 +126% dontcache p99 (us) 10,170 602 -94% Peak dirty pages (MB) 1,837 213 -88% Dontcache now reaches 81% of buffered throughput (was 35%). Competing writers (dontcache vs buffered, separate files): Before After buffered writer 868 433 MB/s dontcache writer 415 433 MB/s Aggregate 1,284 866 MB/s Previously the buffered writer starved the dontcache writer 2:1. With per-bdi_writeback tracking, both writers now receive equal bandwidth. The aggregate matches the buffered-vs-buffered baseline (863 MB/s), indicating fair sharing regardless of I/O mode. The dontcache writer's p99.9 latency collapsed from 119 ms to 33 ms (-73%), eliminating the severe periodic stalls seen in the baseline. Both writers now share identical latency profiles, matching the buffered-vs-buffered pattern. The per-bdi_writeback dirty tracking dramatically reduces peak dirty pages in dontcache workloads, with the 32-file test dropping from 1.8 GB to 213 MB. Dontcache sequential write throughput triples and multi-writer throughput reaches parity with buffered I/O, with tail latencies collapsing by 1-2 orders of magnitude. Assisted-by: Claude:claude-opus-4-6 Signed-off-by: Jeff Layton --- fs/fs-writeback.c | 63 ++++++++++++++++++++++++++++++++++++++++ include/linux/backing-dev-defs.h | 2 ++ include/linux/fs.h | 6 ++-- include/trace/events/writeback.h | 3 +- 4 files changed, 69 insertions(+), 5 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 32ecc745f5f7..77d53df97cc3 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -2377,6 +2377,27 @@ static long wb_check_start_all(struct bdi_writeback *wb) return nr_pages; } +static long wb_check_start_dontcache(struct bdi_writeback *wb) +{ + long nr_pages; + + if (!test_and_clear_bit(WB_start_dontcache, &wb->state)) + return 0; + + nr_pages = wb_stat_sum(wb, WB_DONTCACHE_DIRTY); + if (nr_pages) { + struct wb_writeback_work work = { + .nr_pages = nr_pages, + .sync_mode = WB_SYNC_NONE, + .range_cyclic = 1, + .reason = WB_REASON_DONTCACHE, + }; + + nr_pages = wb_writeback(wb, &work); + } + + return nr_pages; +} /* * Retrieve work items and do the writeback they describe @@ -2398,6 +2419,11 @@ static long wb_do_writeback(struct bdi_writeback *wb) */ wrote += wb_check_start_all(wb); + /* + * Check for dontcache writeback request + */ + wrote += wb_check_start_dontcache(wb); + /* * Check for periodic writeback, kupdated() style */ @@ -2472,6 +2498,43 @@ void wakeup_flusher_threads_bdi(struct backing_dev_info *bdi, rcu_read_unlock(); } +/** + * filemap_dontcache_kick_writeback - kick flusher for IOCB_DONTCACHE writes + * @mapping: address_space that was just written to + * + * Kick the writeback flusher thread to expedite writeback of dontcache dirty + * pages. Queue writeback for the inode's wb for as many pages as there are + * dontcache pages, but don't restrict writeback to dontcache pages only. + * + * This significantly improves performance over either writing all wb's pages + * or writing only dontcache pages. Although it doesn't guarantee quick + * writeback and reclaim of dontcache pages, it keeps the amount of dirty pages + * in check. Over longer term dontcache pages get written and reclaimed by + * background writeback even with this rough heuristic. + */ +void filemap_dontcache_kick_writeback(struct address_space *mapping) +{ + struct inode *inode = mapping->host; + struct bdi_writeback *wb; + struct wb_lock_cookie cookie = {}; + bool need_wakeup = false; + + wb = unlocked_inode_to_wb_begin(inode, &cookie); + if (wb_has_dirty_io(wb) && + !test_bit(WB_start_dontcache, &wb->state) && + !test_and_set_bit(WB_start_dontcache, &wb->state)) { + wb_get(wb); + need_wakeup = true; + } + unlocked_inode_to_wb_end(inode, &cookie); + + if (need_wakeup) { + wb_wakeup(wb); + wb_put(wb); + } +} +EXPORT_SYMBOL_GPL(filemap_dontcache_kick_writeback); + /* * Wakeup the flusher threads to start writeback of all currently dirty pages */ diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h index cb660dd37286..4f1084937315 100644 --- a/include/linux/backing-dev-defs.h +++ b/include/linux/backing-dev-defs.h @@ -26,6 +26,7 @@ enum wb_state { WB_writeback_running, /* Writeback is in progress */ WB_has_dirty_io, /* Dirty inodes on ->b_{dirty|io|more_io} */ WB_start_all, /* nr_pages == 0 (all) work pending */ + WB_start_dontcache, /* dontcache writeback pending */ }; enum wb_stat_item { @@ -56,6 +57,7 @@ enum wb_reason { */ WB_REASON_FORKER_THREAD, WB_REASON_FOREIGN_FLUSH, + WB_REASON_DONTCACHE, WB_REASON_MAX, }; diff --git a/include/linux/fs.h b/include/linux/fs.h index 11559c513dfb..df72b42a9e9b 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2624,6 +2624,7 @@ extern int __must_check file_write_and_wait_range(struct file *file, loff_t start, loff_t end); int filemap_flush_range(struct address_space *mapping, loff_t start, loff_t end); +void filemap_dontcache_kick_writeback(struct address_space *mapping); static inline int file_write_and_wait(struct file *file) { @@ -2657,10 +2658,7 @@ static inline ssize_t generic_write_sync(struct kiocb *iocb, ssize_t count) if (ret) return ret; } else if (iocb->ki_flags & IOCB_DONTCACHE) { - struct address_space *mapping = iocb->ki_filp->f_mapping; - - filemap_flush_range(mapping, iocb->ki_pos - count, - iocb->ki_pos - 1); + filemap_dontcache_kick_writeback(iocb->ki_filp->f_mapping); } return count; diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h index bdac0d685a98..13ee076ccd16 100644 --- a/include/trace/events/writeback.h +++ b/include/trace/events/writeback.h @@ -44,7 +44,8 @@ EM( WB_REASON_PERIODIC, "periodic") \ EM( WB_REASON_FS_FREE_SPACE, "fs_free_space") \ EM( WB_REASON_FORKER_THREAD, "forker_thread") \ - EMe(WB_REASON_FOREIGN_FLUSH, "foreign_flush") + EM( WB_REASON_FOREIGN_FLUSH, "foreign_flush") \ + EMe(WB_REASON_DONTCACHE, "dontcache") WB_WORK_REASON -- 2.54.0