[PATCH v7 0/3] fuse: remove temp page copies in writeback

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v7 0/3] fuse: remove temp page copies in writeback
@ 2025-04-04 18:14 Joanne Koong
  2025-04-04 18:14 ` [PATCH v7 1/3] mm: add AS_WRITEBACK_INDETERMINATE mapping flag Joanne Koong
                   ` (3 more replies)
  0 siblings, 4 replies; 23+ messages in thread
From: Joanne Koong @ 2025-04-04 18:14 UTC (permalink / raw)
  To: miklos, akpm, linux-fsdevel, linux-mm
  Cc: jefflexu, shakeel.butt, david, bernd.schubert, ziy, jlayton,
	kernel-team

The purpose of this patchset is to help make writeback in FUSE filesystems as
fast as possible.

In the current FUSE writeback design (see commit 3be5a52b30aa
("fuse: support writable mmap"))), a temp page is allocated for every dirty
page to be written back, the contents of the dirty page are copied over to the
temp page, and the temp page gets handed to the server to write back. This is
done so that writeback may be immediately cleared on the dirty page, and this 
in turn is done in order to mitigate the following deadlock scenario that may
arise if reclaim waits on writeback on the dirty page to complete (more details
can be found in this thread [1]):
* single-threaded FUSE server is in the middle of handling a request
  that needs a memory allocation
* memory allocation triggers direct reclaim
* direct reclaim waits on a folio under writeback
* the FUSE server can't write back the folio since it's stuck in
  direct reclaim

Allocating and copying dirty pages to temp pages is the biggest performance
bottleneck for FUSE writeback. This patchset aims to get rid of the temp page
altogether (which will also allow us to get rid of the internal FUSE rb tree
that is needed to keep track of writeback status on the temp pages).
Benchmarks show approximately a 20% improvement in throughput for 4k
block-size writes and a 45% improvement for 1M block-size writes.

In the current reclaim code, there is one scenario where writeback is waited
on, which is the case where the system is running legacy cgroupv1 and reclaim
encounters a folio that already has the reclaim flag set and the caller did
not have __GFP_FS (or __GFP_IO if swap) set.

This patchset adds a new mapping flag, AS_WRITEBACK_INDETERMINATE, which
filesystems may set on its inode mappings to indicate that writeback
operations may take an indeterminate amount of time to complete. FUSE will set
this flag on its mappings. Reclaim for the legacy cgroup v1 case described
above will skip reclaim of folios with that flag set.

With this change, writeback state is now only cleared on the dirty page after
the server has written it back to disk. If the server is deliberately
malicious or well-intentioned but buggy, this may stall sync(2) and page
migration, but for sync(2), a malicious server may already stall this by not
replying to the FUSE_SYNCFS request and for page migration, there are already
many easier ways to stall this by having FUSE permanently hold the folio lock.
A fuller discussion on this can be found in [2]. Long-term, there needs to be
a more comprehensive solution for addressing migration of FUSE pages that
handles all scenarios where FUSE may permanently hold the lock, but that is
outside the scope of this patchset and will be done as future work. Please
also note that this change also now ensures that when sync(2) returns, FUSE
filesystems will have persisted writeback changes.

[1] https://lore.kernel.org/linux-kernel/495d2400-1d96-4924-99d3-8b2952e05fc3@linux.alibaba.com/
[2] https://lore.kernel.org/linux-fsdevel/20241122232359.429647-1-joannelkoong@gmail.com/

Changelog
---------
v6:
https://lore.kernel.org/linux-fsdevel/20241122232359.429647-1-joannelkoong@gmail.com/
Changes from v6 -> v7:
* Drop migration and sync patches, as they are useless if a server is
  determined to be malicious

v5:
https://lore.kernel.org/linux-fsdevel/20241115224459.427610-1-joannelkoong@gmail.com/
Changes from v5 -> v6:
* Add Shakeel and Jingbo's reviewed-bys 
* Move folio_end_writeback() to fuse_writepage_finish() (Jingbo)
* Embed fuse_writepage_finish_stat() logic inline (Jingbo)
* Remove node_stat NR_WRITEBACK inc/sub (Jingbo)

v4:
https://lore.kernel.org/linux-fsdevel/20241107235614.3637221-1-joannelkoong@gmail.com/
Changes from v4 -> v5:
* AS_WRITEBACK_MAY_BLOCK -> AS_WRITEBACK_INDETERMINATE (Shakeel)
* Drop memory hotplug patch (David and Shakeel)
* Remove some more kunnecessary writeback waits in fuse code (Jingbo)
* Make commit message for reclaim patch more concise - drop part about
  deadlock and just focus on how it may stall waits

v3:
https://lore.kernel.org/linux-fsdevel/20241107191618.2011146-1-joannelkoong@gmail.com/
Changes from v3 -> v4:
* Use filemap_fdatawait_range() instead of filemap_range_has_writeback() in
  readahead

v2:
https://lore.kernel.org/linux-fsdevel/20241014182228.1941246-1-joannelkoong@gmail.com/
Changes from v2 -> v3:
* Account for sync and page migration cases as well (Miklos)
* Change AS_NO_WRITEBACK_RECLAIM to the more generic AS_WRITEBACK_MAY_BLOCK
* For fuse inodes, set mapping_writeback_may_block only if fc->writeback_cache
  is enabled

v1:
https://lore.kernel.org/linux-fsdevel/20241011223434.1307300-1-joannelkoong@gmail.com/T/#t
Changes from v1 -> v2:
* Have flag in "enum mapping_flags" instead of creating asop_flags (Shakeel)
* Set fuse inodes to use AS_NO_WRITEBACK_RECLAIM (Shakeel)

Joanne Koong (3):
  mm: add AS_WRITEBACK_INDETERMINATE mapping flag
  mm: skip reclaiming folios in legacy memcg writeback indeterminate
    contexts
  fuse: remove tmp folio for writebacks and internal rb tree

 fs/fuse/file.c          | 360 ++++------------------------------------
 fs/fuse/fuse_i.h        |   3 -
 include/linux/pagemap.h |  11 ++
 mm/vmscan.c             |  10 +-
 4 files changed, 46 insertions(+), 338 deletions(-)

-- 
2.47.1

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v7 1/3] mm: add AS_WRITEBACK_INDETERMINATE mapping flag
  2025-04-04 18:14 [PATCH v7 0/3] fuse: remove temp page copies in writeback Joanne Koong
@ 2025-04-04 18:14 ` Joanne Koong
  2025-04-04 19:13   ` David Hildenbrand
  2025-04-04 18:14 ` [PATCH v7 2/3] mm: skip reclaiming folios in legacy memcg writeback indeterminate contexts Joanne Koong
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 23+ messages in thread
From: Joanne Koong @ 2025-04-04 18:14 UTC (permalink / raw)
  To: miklos, akpm, linux-fsdevel, linux-mm
  Cc: jefflexu, shakeel.butt, david, bernd.schubert, ziy, jlayton,
	kernel-team, Miklos Szeredi

Add a new mapping flag AS_WRITEBACK_INDETERMINATE which filesystems may
set to indicate that writing back to disk may take an indeterminate
amount of time to complete. Extra caution should be taken when waiting
on writeback for folios belonging to mappings where this flag is set.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
---
 include/linux/pagemap.h | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 26baa78f1ca7..762575f1d195 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -210,6 +210,7 @@ enum mapping_flags {
 	AS_STABLE_WRITES = 7,	/* must wait for writeback before modifying
 				   folio contents */
 	AS_INACCESSIBLE = 8,	/* Do not attempt direct R/W access to the mapping */
+	AS_WRITEBACK_INDETERMINATE = 9, /* Use caution when waiting on writeback */
 	/* Bits 16-25 are used for FOLIO_ORDER */
 	AS_FOLIO_ORDER_BITS = 5,
 	AS_FOLIO_ORDER_MIN = 16,
@@ -335,6 +336,16 @@ static inline bool mapping_inaccessible(struct address_space *mapping)
 	return test_bit(AS_INACCESSIBLE, &mapping->flags);
 }
 
+static inline void mapping_set_writeback_indeterminate(struct address_space *mapping)
+{
+	set_bit(AS_WRITEBACK_INDETERMINATE, &mapping->flags);
+}
+
+static inline bool mapping_writeback_indeterminate(struct address_space *mapping)
+{
+	return test_bit(AS_WRITEBACK_INDETERMINATE, &mapping->flags);
+}
+
 static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
 {
 	return mapping->gfp_mask;
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v7 2/3] mm: skip reclaiming folios in legacy memcg writeback indeterminate contexts
  2025-04-04 18:14 [PATCH v7 0/3] fuse: remove temp page copies in writeback Joanne Koong
  2025-04-04 18:14 ` [PATCH v7 1/3] mm: add AS_WRITEBACK_INDETERMINATE mapping flag Joanne Koong
@ 2025-04-04 18:14 ` Joanne Koong
  2025-04-04 18:14 ` [PATCH v7 3/3] fuse: remove tmp folio for writebacks and internal rb tree Joanne Koong
  2025-04-14 16:21 ` [PATCH v7 0/3] fuse: remove temp page copies in writeback Jeff Layton
  3 siblings, 0 replies; 23+ messages in thread
From: Joanne Koong @ 2025-04-04 18:14 UTC (permalink / raw)
  To: miklos, akpm, linux-fsdevel, linux-mm
  Cc: jefflexu, shakeel.butt, david, bernd.schubert, ziy, jlayton,
	kernel-team, Miklos Szeredi

Currently in shrink_folio_list(), reclaim for folios under writeback
falls into 3 different cases:
1) Reclaim is encountering an excessive number of folios under
   writeback and this folio has both the writeback and reclaim flags
   set
2) Dirty throttling is enabled (this happens if reclaim through cgroup
   is not enabled, if reclaim through cgroupv2 memcg is enabled, or
   if reclaim is on the root cgroup), or if the folio is not marked for
   immediate reclaim, or if the caller does not have __GFP_FS (or
   __GFP_IO if it's going to swap) set
3) Legacy cgroupv1 encounters a folio that already has the reclaim flag
   set and the caller did not have __GFP_FS (or __GFP_IO if swap) set

In cases 1) and 2), we activate the folio and skip reclaiming it while
in case 3), we wait for writeback to finish on the folio and then try
to reclaim the folio again. In case 3, we wait on writeback because
cgroupv1 does not have dirty folio throttling, as such this is a
mitigation against the case where there are too many folios in writeback
with nothing else to reclaim.

For filesystems where writeback may take an indeterminate amount of time
to write to disk, this has the possibility of stalling reclaim.

In this commit, if legacy memcg encounters a folio with the reclaim flag
set (eg case 3) and the folio belongs to a mapping that has the
AS_WRITEBACK_INDETERMINATE flag set, the folio will be activated and skip
reclaim (eg default to behavior in case 2) instead.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
---
 mm/vmscan.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index b620d74b0f66..d37843b2e621 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1187,8 +1187,9 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 		 * 2) Global or new memcg reclaim encounters a folio that is
 		 *    not marked for immediate reclaim, or the caller does not
 		 *    have __GFP_FS (or __GFP_IO if it's simply going to swap,
-		 *    not to fs). In this case mark the folio for immediate
-		 *    reclaim and continue scanning.
+		 *    not to fs), or the writeback may take an indeterminate
+		 *    amount of time to complete. In this case mark the folio
+		 *    for immediate reclaim and continue scanning.
 		 *
 		 *    Require may_enter_fs() because we would wait on fs, which
 		 *    may not have submitted I/O yet. And the loop driver might
@@ -1213,6 +1214,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 		 * takes to write them to disk.
 		 */
 		if (folio_test_writeback(folio)) {
+			mapping = folio_mapping(folio);
+
 			/* Case 1 above */
 			if (current_is_kswapd() &&
 			    folio_test_reclaim(folio) &&
@@ -1223,7 +1226,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 			/* Case 2 above */
 			} else if (writeback_throttling_sane(sc) ||
 			    !folio_test_reclaim(folio) ||
-			    !may_enter_fs(folio, sc->gfp_mask)) {
+			    !may_enter_fs(folio, sc->gfp_mask) ||
+			    (mapping && mapping_writeback_indeterminate(mapping))) {
 				/*
 				 * This is slightly racy -
 				 * folio_end_writeback() might have
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v7 3/3] fuse: remove tmp folio for writebacks and internal rb tree
  2025-04-04 18:14 [PATCH v7 0/3] fuse: remove temp page copies in writeback Joanne Koong
  2025-04-04 18:14 ` [PATCH v7 1/3] mm: add AS_WRITEBACK_INDETERMINATE mapping flag Joanne Koong
  2025-04-04 18:14 ` [PATCH v7 2/3] mm: skip reclaiming folios in legacy memcg writeback indeterminate contexts Joanne Koong
@ 2025-04-04 18:14 ` Joanne Koong
  2025-04-09  2:43   ` Jingbo Xu
  2025-04-14 16:21 ` [PATCH v7 0/3] fuse: remove temp page copies in writeback Jeff Layton
  3 siblings, 1 reply; 23+ messages in thread
From: Joanne Koong @ 2025-04-04 18:14 UTC (permalink / raw)
  To: miklos, akpm, linux-fsdevel, linux-mm
  Cc: jefflexu, shakeel.butt, david, bernd.schubert, ziy, jlayton,
	kernel-team, Miklos Szeredi

In the current FUSE writeback design (see commit 3be5a52b30aa
("fuse: support writable mmap")), a temp page is allocated for every
dirty page to be written back, the contents of the dirty page are copied over
to the temp page, and the temp page gets handed to the server to write back.

This is done so that writeback may be immediately cleared on the dirty page,
and this in turn is done in order to mitigate the following deadlock scenario
that may arise if reclaim waits on writeback on the dirty page to complete:
* single-threaded FUSE server is in the middle of handling a request
  that needs a memory allocation
* memory allocation triggers direct reclaim
* direct reclaim waits on a folio under writeback
* the FUSE server can't write back the folio since it's stuck in
  direct reclaim

With a recent change that added AS_WRITEBACK_INDETERMINATE and mitigates
the situations described above, FUSE writeback does not need to use
temp pages if it sets AS_WRITEBACK_INDETERMINATE on its inode mappings.

This commit sets AS_WRITEBACK_INDETERMINATE on the inode mappings
and removes the temporary pages + extra copying and the internal rb
tree.

fio benchmarks --
(using averages observed from 10 runs, throwing away outliers)

Setup:
sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount
 ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount

fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G
--numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount

        bs =  1k          4k            1M
Before  351 MiB/s     1818 MiB/s     1851 MiB/s
After   341 MiB/s     2246 MiB/s     2685 MiB/s
% diff        -3%          23%         45%

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
---
 fs/fuse/file.c   | 360 ++++-------------------------------------------
 fs/fuse/fuse_i.h |   3 -
 2 files changed, 28 insertions(+), 335 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 754378dd9f71..91ada0208863 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -415,89 +415,11 @@ u64 fuse_lock_owner_id(struct fuse_conn *fc, fl_owner_t id)
 
 struct fuse_writepage_args {
 	struct fuse_io_args ia;
-	struct rb_node writepages_entry;
 	struct list_head queue_entry;
-	struct fuse_writepage_args *next;
 	struct inode *inode;
 	struct fuse_sync_bucket *bucket;
 };
 
-static struct fuse_writepage_args *fuse_find_writeback(struct fuse_inode *fi,
-					    pgoff_t idx_from, pgoff_t idx_to)
-{
-	struct rb_node *n;
-
-	n = fi->writepages.rb_node;
-
-	while (n) {
-		struct fuse_writepage_args *wpa;
-		pgoff_t curr_index;
-
-		wpa = rb_entry(n, struct fuse_writepage_args, writepages_entry);
-		WARN_ON(get_fuse_inode(wpa->inode) != fi);
-		curr_index = wpa->ia.write.in.offset >> PAGE_SHIFT;
-		if (idx_from >= curr_index + wpa->ia.ap.num_folios)
-			n = n->rb_right;
-		else if (idx_to < curr_index)
-			n = n->rb_left;
-		else
-			return wpa;
-	}
-	return NULL;
-}
-
-/*
- * Check if any page in a range is under writeback
- */
-static bool fuse_range_is_writeback(struct inode *inode, pgoff_t idx_from,
-				   pgoff_t idx_to)
-{
-	struct fuse_inode *fi = get_fuse_inode(inode);
-	bool found;
-
-	if (RB_EMPTY_ROOT(&fi->writepages))
-		return false;
-
-	spin_lock(&fi->lock);
-	found = fuse_find_writeback(fi, idx_from, idx_to);
-	spin_unlock(&fi->lock);
-
-	return found;
-}
-
-static inline bool fuse_page_is_writeback(struct inode *inode, pgoff_t index)
-{
-	return fuse_range_is_writeback(inode, index, index);
-}
-
-/*
- * Wait for page writeback to be completed.
- *
- * Since fuse doesn't rely on the VM writeback tracking, this has to
- * use some other means.
- */
-static void fuse_wait_on_page_writeback(struct inode *inode, pgoff_t index)
-{
-	struct fuse_inode *fi = get_fuse_inode(inode);
-
-	wait_event(fi->page_waitq, !fuse_page_is_writeback(inode, index));
-}
-
-static inline bool fuse_folio_is_writeback(struct inode *inode,
-					   struct folio *folio)
-{
-	pgoff_t last = folio_next_index(folio) - 1;
-	return fuse_range_is_writeback(inode, folio_index(folio), last);
-}
-
-static void fuse_wait_on_folio_writeback(struct inode *inode,
-					 struct folio *folio)
-{
-	struct fuse_inode *fi = get_fuse_inode(inode);
-
-	wait_event(fi->page_waitq, !fuse_folio_is_writeback(inode, folio));
-}
-
 /*
  * Wait for all pending writepages on the inode to finish.
  *
@@ -886,13 +808,6 @@ static int fuse_do_readfolio(struct file *file, struct folio *folio)
 	ssize_t res;
 	u64 attr_ver;
 
-	/*
-	 * With the temporary pages that are used to complete writeback, we can
-	 * have writeback that extends beyond the lifetime of the folio.  So
-	 * make sure we read a properly synced folio.
-	 */
-	fuse_wait_on_folio_writeback(inode, folio);
-
 	attr_ver = fuse_get_attr_version(fm->fc);
 
 	/* Don't overflow end offset */
@@ -1005,17 +920,12 @@ static void fuse_send_readpages(struct fuse_io_args *ia, struct file *file)
 static void fuse_readahead(struct readahead_control *rac)
 {
 	struct inode *inode = rac->mapping->host;
-	struct fuse_inode *fi = get_fuse_inode(inode);
 	struct fuse_conn *fc = get_fuse_conn(inode);
 	unsigned int max_pages, nr_pages;
-	pgoff_t first = readahead_index(rac);
-	pgoff_t last = first + readahead_count(rac) - 1;
 
 	if (fuse_is_bad(inode))
 		return;
 
-	wait_event(fi->page_waitq, !fuse_range_is_writeback(inode, first, last));
-
 	max_pages = min_t(unsigned int, fc->max_pages,
 			fc->max_read / PAGE_SIZE);
 
@@ -1181,7 +1091,7 @@ static ssize_t fuse_send_write_pages(struct fuse_io_args *ia,
 	int err;
 
 	for (i = 0; i < ap->num_folios; i++)
-		fuse_wait_on_folio_writeback(inode, ap->folios[i]);
+		folio_wait_writeback(ap->folios[i]);
 
 	fuse_write_args_fill(ia, ff, pos, count);
 	ia->write.in.flags = fuse_write_flags(iocb);
@@ -1638,7 +1548,7 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *iter,
 			return res;
 		}
 	}
-	if (!cuse && fuse_range_is_writeback(inode, idx_from, idx_to)) {
+	if (!cuse && filemap_range_has_writeback(mapping, pos, (pos + count - 1))) {
 		if (!write)
 			inode_lock(inode);
 		fuse_sync_writes(inode);
@@ -1835,38 +1745,34 @@ static ssize_t fuse_splice_write(struct pipe_inode_info *pipe, struct file *out,
 static void fuse_writepage_free(struct fuse_writepage_args *wpa)
 {
 	struct fuse_args_pages *ap = &wpa->ia.ap;
-	int i;
 
 	if (wpa->bucket)
 		fuse_sync_bucket_dec(wpa->bucket);
 
-	for (i = 0; i < ap->num_folios; i++)
-		folio_put(ap->folios[i]);
-
 	fuse_file_put(wpa->ia.ff, false);
 
 	kfree(ap->folios);
 	kfree(wpa);
 }
 
-static void fuse_writepage_finish_stat(struct inode *inode, struct folio *folio)
-{
-	struct backing_dev_info *bdi = inode_to_bdi(inode);
-
-	dec_wb_stat(&bdi->wb, WB_WRITEBACK);
-	node_stat_sub_folio(folio, NR_WRITEBACK_TEMP);
-	wb_writeout_inc(&bdi->wb);
-}
-
 static void fuse_writepage_finish(struct fuse_writepage_args *wpa)
 {
 	struct fuse_args_pages *ap = &wpa->ia.ap;
 	struct inode *inode = wpa->inode;
 	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct backing_dev_info *bdi = inode_to_bdi(inode);
 	int i;
 
-	for (i = 0; i < ap->num_folios; i++)
-		fuse_writepage_finish_stat(inode, ap->folios[i]);
+	for (i = 0; i < ap->num_folios; i++) {
+		/*
+		 * Benchmarks showed that ending writeback within the
+		 * scope of the fi->lock alleviates xarray lock
+		 * contention and noticeably improves performance.
+		 */
+		folio_end_writeback(ap->folios[i]);
+		dec_wb_stat(&bdi->wb, WB_WRITEBACK);
+		wb_writeout_inc(&bdi->wb);
+	}
 
 	wake_up(&fi->page_waitq);
 }
@@ -1877,7 +1783,6 @@ static void fuse_send_writepage(struct fuse_mount *fm,
 __releases(fi->lock)
 __acquires(fi->lock)
 {
-	struct fuse_writepage_args *aux, *next;
 	struct fuse_inode *fi = get_fuse_inode(wpa->inode);
 	struct fuse_write_in *inarg = &wpa->ia.write.in;
 	struct fuse_args *args = &wpa->ia.ap.args;
@@ -1914,19 +1819,8 @@ __acquires(fi->lock)
 
  out_free:
 	fi->writectr--;
-	rb_erase(&wpa->writepages_entry, &fi->writepages);
 	fuse_writepage_finish(wpa);
 	spin_unlock(&fi->lock);
-
-	/* After rb_erase() aux request list is private */
-	for (aux = wpa->next; aux; aux = next) {
-		next = aux->next;
-		aux->next = NULL;
-		fuse_writepage_finish_stat(aux->inode,
-					   aux->ia.ap.folios[0]);
-		fuse_writepage_free(aux);
-	}
-
 	fuse_writepage_free(wpa);
 	spin_lock(&fi->lock);
 }
@@ -1954,43 +1848,6 @@ __acquires(fi->lock)
 	}
 }
 
-static struct fuse_writepage_args *fuse_insert_writeback(struct rb_root *root,
-						struct fuse_writepage_args *wpa)
-{
-	pgoff_t idx_from = wpa->ia.write.in.offset >> PAGE_SHIFT;
-	pgoff_t idx_to = idx_from + wpa->ia.ap.num_folios - 1;
-	struct rb_node **p = &root->rb_node;
-	struct rb_node  *parent = NULL;
-
-	WARN_ON(!wpa->ia.ap.num_folios);
-	while (*p) {
-		struct fuse_writepage_args *curr;
-		pgoff_t curr_index;
-
-		parent = *p;
-		curr = rb_entry(parent, struct fuse_writepage_args,
-				writepages_entry);
-		WARN_ON(curr->inode != wpa->inode);
-		curr_index = curr->ia.write.in.offset >> PAGE_SHIFT;
-
-		if (idx_from >= curr_index + curr->ia.ap.num_folios)
-			p = &(*p)->rb_right;
-		else if (idx_to < curr_index)
-			p = &(*p)->rb_left;
-		else
-			return curr;
-	}
-
-	rb_link_node(&wpa->writepages_entry, parent, p);
-	rb_insert_color(&wpa->writepages_entry, root);
-	return NULL;
-}
-
-static void tree_insert(struct rb_root *root, struct fuse_writepage_args *wpa)
-{
-	WARN_ON(fuse_insert_writeback(root, wpa));
-}
-
 static void fuse_writepage_end(struct fuse_mount *fm, struct fuse_args *args,
 			       int error)
 {
@@ -2010,41 +1867,6 @@ static void fuse_writepage_end(struct fuse_mount *fm, struct fuse_args *args,
 	if (!fc->writeback_cache)
 		fuse_invalidate_attr_mask(inode, FUSE_STATX_MODIFY);
 	spin_lock(&fi->lock);
-	rb_erase(&wpa->writepages_entry, &fi->writepages);
-	while (wpa->next) {
-		struct fuse_mount *fm = get_fuse_mount(inode);
-		struct fuse_write_in *inarg = &wpa->ia.write.in;
-		struct fuse_writepage_args *next = wpa->next;
-
-		wpa->next = next->next;
-		next->next = NULL;
-		tree_insert(&fi->writepages, next);
-
-		/*
-		 * Skip fuse_flush_writepages() to make it easy to crop requests
-		 * based on primary request size.
-		 *
-		 * 1st case (trivial): there are no concurrent activities using
-		 * fuse_set/release_nowrite.  Then we're on safe side because
-		 * fuse_flush_writepages() would call fuse_send_writepage()
-		 * anyway.
-		 *
-		 * 2nd case: someone called fuse_set_nowrite and it is waiting
-		 * now for completion of all in-flight requests.  This happens
-		 * rarely and no more than once per page, so this should be
-		 * okay.
-		 *
-		 * 3rd case: someone (e.g. fuse_do_setattr()) is in the middle
-		 * of fuse_set_nowrite..fuse_release_nowrite section.  The fact
-		 * that fuse_set_nowrite returned implies that all in-flight
-		 * requests were completed along with all of their secondary
-		 * requests.  Further primary requests are blocked by negative
-		 * writectr.  Hence there cannot be any in-flight requests and
-		 * no invocations of fuse_writepage_end() while we're in
-		 * fuse_set_nowrite..fuse_release_nowrite section.
-		 */
-		fuse_send_writepage(fm, next, inarg->offset + inarg->size);
-	}
 	fi->writectr--;
 	fuse_writepage_finish(wpa);
 	spin_unlock(&fi->lock);
@@ -2131,19 +1953,16 @@ static void fuse_writepage_add_to_bucket(struct fuse_conn *fc,
 }
 
 static void fuse_writepage_args_page_fill(struct fuse_writepage_args *wpa, struct folio *folio,
-					  struct folio *tmp_folio, uint32_t folio_index)
+					  uint32_t folio_index)
 {
 	struct inode *inode = folio->mapping->host;
 	struct fuse_args_pages *ap = &wpa->ia.ap;
 
-	folio_copy(tmp_folio, folio);
-
-	ap->folios[folio_index] = tmp_folio;
+	ap->folios[folio_index] = folio;
 	ap->descs[folio_index].offset = 0;
 	ap->descs[folio_index].length = PAGE_SIZE;
 
 	inc_wb_stat(&inode_to_bdi(inode)->wb, WB_WRITEBACK);
-	node_stat_add_folio(tmp_folio, NR_WRITEBACK_TEMP);
 }
 
 static struct fuse_writepage_args *fuse_writepage_args_setup(struct folio *folio,
@@ -2178,18 +1997,12 @@ static int fuse_writepage_locked(struct folio *folio)
 	struct fuse_inode *fi = get_fuse_inode(inode);
 	struct fuse_writepage_args *wpa;
 	struct fuse_args_pages *ap;
-	struct folio *tmp_folio;
 	struct fuse_file *ff;
-	int error = -ENOMEM;
-
-	tmp_folio = folio_alloc(GFP_NOFS | __GFP_HIGHMEM, 0);
-	if (!tmp_folio)
-		goto err;
+	int error = -EIO;
 
-	error = -EIO;
 	ff = fuse_write_file_get(fi);
 	if (!ff)
-		goto err_nofile;
+		goto err;
 
 	wpa = fuse_writepage_args_setup(folio, ff);
 	error = -ENOMEM;
@@ -2200,22 +2013,17 @@ static int fuse_writepage_locked(struct folio *folio)
 	ap->num_folios = 1;
 
 	folio_start_writeback(folio);
-	fuse_writepage_args_page_fill(wpa, folio, tmp_folio, 0);
+	fuse_writepage_args_page_fill(wpa, folio, 0);
 
 	spin_lock(&fi->lock);
-	tree_insert(&fi->writepages, wpa);
 	list_add_tail(&wpa->queue_entry, &fi->queued_writes);
 	fuse_flush_writepages(inode);
 	spin_unlock(&fi->lock);
 
-	folio_end_writeback(folio);
-
 	return 0;
 
 err_writepage_args:
 	fuse_file_put(ff, false);
-err_nofile:
-	folio_put(tmp_folio);
 err:
 	mapping_set_error(folio->mapping, error);
 	return error;
@@ -2225,7 +2033,6 @@ struct fuse_fill_wb_data {
 	struct fuse_writepage_args *wpa;
 	struct fuse_file *ff;
 	struct inode *inode;
-	struct folio **orig_folios;
 	unsigned int max_folios;
 };
 
@@ -2260,69 +2067,11 @@ static void fuse_writepages_send(struct fuse_fill_wb_data *data)
 	struct fuse_writepage_args *wpa = data->wpa;
 	struct inode *inode = data->inode;
 	struct fuse_inode *fi = get_fuse_inode(inode);
-	int num_folios = wpa->ia.ap.num_folios;
-	int i;
 
 	spin_lock(&fi->lock);
 	list_add_tail(&wpa->queue_entry, &fi->queued_writes);
 	fuse_flush_writepages(inode);
 	spin_unlock(&fi->lock);
-
-	for (i = 0; i < num_folios; i++)
-		folio_end_writeback(data->orig_folios[i]);
-}
-
-/*
- * Check under fi->lock if the page is under writeback, and insert it onto the
- * rb_tree if not. Otherwise iterate auxiliary write requests, to see if there's
- * one already added for a page at this offset.  If there's none, then insert
- * this new request onto the auxiliary list, otherwise reuse the existing one by
- * swapping the new temp page with the old one.
- */
-static bool fuse_writepage_add(struct fuse_writepage_args *new_wpa,
-			       struct folio *folio)
-{
-	struct fuse_inode *fi = get_fuse_inode(new_wpa->inode);
-	struct fuse_writepage_args *tmp;
-	struct fuse_writepage_args *old_wpa;
-	struct fuse_args_pages *new_ap = &new_wpa->ia.ap;
-
-	WARN_ON(new_ap->num_folios != 0);
-	new_ap->num_folios = 1;
-
-	spin_lock(&fi->lock);
-	old_wpa = fuse_insert_writeback(&fi->writepages, new_wpa);
-	if (!old_wpa) {
-		spin_unlock(&fi->lock);
-		return true;
-	}
-
-	for (tmp = old_wpa->next; tmp; tmp = tmp->next) {
-		pgoff_t curr_index;
-
-		WARN_ON(tmp->inode != new_wpa->inode);
-		curr_index = tmp->ia.write.in.offset >> PAGE_SHIFT;
-		if (curr_index == folio->index) {
-			WARN_ON(tmp->ia.ap.num_folios != 1);
-			swap(tmp->ia.ap.folios[0], new_ap->folios[0]);
-			break;
-		}
-	}
-
-	if (!tmp) {
-		new_wpa->next = old_wpa->next;
-		old_wpa->next = new_wpa;
-	}
-
-	spin_unlock(&fi->lock);
-
-	if (tmp) {
-		fuse_writepage_finish_stat(new_wpa->inode,
-					   folio);
-		fuse_writepage_free(new_wpa);
-	}
-
-	return false;
 }
 
 static bool fuse_writepage_need_send(struct fuse_conn *fc, struct folio *folio,
@@ -2331,15 +2080,6 @@ static bool fuse_writepage_need_send(struct fuse_conn *fc, struct folio *folio,
 {
 	WARN_ON(!ap->num_folios);
 
-	/*
-	 * Being under writeback is unlikely but possible.  For example direct
-	 * read to an mmaped fuse file will set the page dirty twice; once when
-	 * the pages are faulted with get_user_pages(), and then after the read
-	 * completed.
-	 */
-	if (fuse_folio_is_writeback(data->inode, folio))
-		return true;
-
 	/* Reached max pages */
 	if (ap->num_folios == fc->max_pages)
 		return true;
@@ -2349,7 +2089,7 @@ static bool fuse_writepage_need_send(struct fuse_conn *fc, struct folio *folio,
 		return true;
 
 	/* Discontinuity */
-	if (data->orig_folios[ap->num_folios - 1]->index + 1 != folio_index(folio))
+	if (ap->folios[ap->num_folios - 1]->index + 1 != folio_index(folio))
 		return true;
 
 	/* Need to grow the pages array?  If so, did the expansion fail? */
@@ -2368,7 +2108,6 @@ static int fuse_writepages_fill(struct folio *folio,
 	struct inode *inode = data->inode;
 	struct fuse_inode *fi = get_fuse_inode(inode);
 	struct fuse_conn *fc = get_fuse_conn(inode);
-	struct folio *tmp_folio;
 	int err;
 
 	if (!data->ff) {
@@ -2383,54 +2122,23 @@ static int fuse_writepages_fill(struct folio *folio,
 		data->wpa = NULL;
 	}
 
-	err = -ENOMEM;
-	tmp_folio = folio_alloc(GFP_NOFS | __GFP_HIGHMEM, 0);
-	if (!tmp_folio)
-		goto out_unlock;
-
-	/*
-	 * The page must not be redirtied until the writeout is completed
-	 * (i.e. userspace has sent a reply to the write request).  Otherwise
-	 * there could be more than one temporary page instance for each real
-	 * page.
-	 *
-	 * This is ensured by holding the page lock in page_mkwrite() while
-	 * checking fuse_page_is_writeback().  We already hold the page lock
-	 * since clear_page_dirty_for_io() and keep it held until we add the
-	 * request to the fi->writepages list and increment ap->num_folios.
-	 * After this fuse_page_is_writeback() will indicate that the page is
-	 * under writeback, so we can release the page lock.
-	 */
 	if (data->wpa == NULL) {
 		err = -ENOMEM;
 		wpa = fuse_writepage_args_setup(folio, data->ff);
-		if (!wpa) {
-			folio_put(tmp_folio);
+		if (!wpa)
 			goto out_unlock;
-		}
 		fuse_file_get(wpa->ia.ff);
 		data->max_folios = 1;
 		ap = &wpa->ia.ap;
 	}
 	folio_start_writeback(folio);
 
-	fuse_writepage_args_page_fill(wpa, folio, tmp_folio, ap->num_folios);
-	data->orig_folios[ap->num_folios] = folio;
+	fuse_writepage_args_page_fill(wpa, folio, ap->num_folios);
 
 	err = 0;
-	if (data->wpa) {
-		/*
-		 * Protected by fi->lock against concurrent access by
-		 * fuse_page_is_writeback().
-		 */
-		spin_lock(&fi->lock);
-		ap->num_folios++;
-		spin_unlock(&fi->lock);
-	} else if (fuse_writepage_add(wpa, folio)) {
+	ap->num_folios++;
+	if (!data->wpa)
 		data->wpa = wpa;
-	} else {
-		folio_end_writeback(folio);
-	}
 out_unlock:
 	folio_unlock(folio);
 
@@ -2457,13 +2165,6 @@ static int fuse_writepages(struct address_space *mapping,
 	data.wpa = NULL;
 	data.ff = NULL;
 
-	err = -ENOMEM;
-	data.orig_folios = kcalloc(fc->max_pages,
-				   sizeof(struct folio *),
-				   GFP_NOFS);
-	if (!data.orig_folios)
-		goto out;
-
 	err = write_cache_pages(mapping, wbc, fuse_writepages_fill, &data);
 	if (data.wpa) {
 		WARN_ON(!data.wpa->ia.ap.num_folios);
@@ -2472,7 +2173,6 @@ static int fuse_writepages(struct address_space *mapping,
 	if (data.ff)
 		fuse_file_put(data.ff, false);
 
-	kfree(data.orig_folios);
 out:
 	return err;
 }
@@ -2497,8 +2197,6 @@ static int fuse_write_begin(struct file *file, struct address_space *mapping,
 	if (IS_ERR(folio))
 		goto error;
 
-	fuse_wait_on_page_writeback(mapping->host, folio->index);
-
 	if (folio_test_uptodate(folio) || len >= folio_size(folio))
 		goto success;
 	/*
@@ -2561,13 +2259,9 @@ static int fuse_launder_folio(struct folio *folio)
 {
 	int err = 0;
 	if (folio_clear_dirty_for_io(folio)) {
-		struct inode *inode = folio->mapping->host;
-
-		/* Serialize with pending writeback for the same page */
-		fuse_wait_on_page_writeback(inode, folio->index);
 		err = fuse_writepage_locked(folio);
 		if (!err)
-			fuse_wait_on_page_writeback(inode, folio->index);
+			folio_wait_writeback(folio);
 	}
 	return err;
 }
@@ -2611,7 +2305,7 @@ static vm_fault_t fuse_page_mkwrite(struct vm_fault *vmf)
 		return VM_FAULT_NOPAGE;
 	}
 
-	fuse_wait_on_folio_writeback(inode, folio);
+	folio_wait_writeback(folio);
 	return VM_FAULT_LOCKED;
 }
 
@@ -3429,9 +3123,12 @@ static const struct address_space_operations fuse_file_aops  = {
 void fuse_init_file_inode(struct inode *inode, unsigned int flags)
 {
 	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_conn *fc = get_fuse_conn(inode);
 
 	inode->i_fop = &fuse_file_operations;
 	inode->i_data.a_ops = &fuse_file_aops;
+	if (fc->writeback_cache)
+		mapping_set_writeback_indeterminate(&inode->i_data);
 
 	INIT_LIST_HEAD(&fi->write_files);
 	INIT_LIST_HEAD(&fi->queued_writes);
@@ -3439,7 +3136,6 @@ void fuse_init_file_inode(struct inode *inode, unsigned int flags)
 	fi->iocachectr = 0;
 	init_waitqueue_head(&fi->page_waitq);
 	init_waitqueue_head(&fi->direct_io_waitq);
-	fi->writepages = RB_ROOT;
 
 	if (IS_ENABLED(CONFIG_FUSE_DAX))
 		fuse_dax_inode_init(inode, flags);
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index fee96fe7887b..28a101a5c558 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -141,9 +141,6 @@ struct fuse_inode {
 
 			/* waitq for direct-io completion */
 			wait_queue_head_t direct_io_waitq;
-
-			/* List of writepage requestst (pending or sent) */
-			struct rb_root writepages;
 		};
 
 		/* readdir cache (directory only) */
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v7 1/3] mm: add AS_WRITEBACK_INDETERMINATE mapping flag
  2025-04-04 18:14 ` [PATCH v7 1/3] mm: add AS_WRITEBACK_INDETERMINATE mapping flag Joanne Koong
@ 2025-04-04 19:13   ` David Hildenbrand
  2025-04-04 20:09     ` Joanne Koong
  0 siblings, 1 reply; 23+ messages in thread
From: David Hildenbrand @ 2025-04-04 19:13 UTC (permalink / raw)
  To: Joanne Koong, miklos, akpm, linux-fsdevel, linux-mm
  Cc: jefflexu, shakeel.butt, bernd.schubert, ziy, jlayton, kernel-team,
	Miklos Szeredi

On 04.04.25 20:14, Joanne Koong wrote:
> Add a new mapping flag AS_WRITEBACK_INDETERMINATE which filesystems may
> set to indicate that writing back to disk may take an indeterminate
> amount of time to complete. Extra caution should be taken when waiting
> on writeback for folios belonging to mappings where this flag is set.
> 
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
> Acked-by: Miklos Szeredi <mszeredi@redhat.com>
> ---
>   include/linux/pagemap.h | 11 +++++++++++
>   1 file changed, 11 insertions(+)
> 
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 26baa78f1ca7..762575f1d195 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -210,6 +210,7 @@ enum mapping_flags {
>   	AS_STABLE_WRITES = 7,	/* must wait for writeback before modifying
>   				   folio contents */
>   	AS_INACCESSIBLE = 8,	/* Do not attempt direct R/W access to the mapping */
> +	AS_WRITEBACK_INDETERMINATE = 9, /* Use caution when waiting on writeback */
>   	/* Bits 16-25 are used for FOLIO_ORDER */
>   	AS_FOLIO_ORDER_BITS = 5,
>   	AS_FOLIO_ORDER_MIN = 16,
> @@ -335,6 +336,16 @@ static inline bool mapping_inaccessible(struct address_space *mapping)
>   	return test_bit(AS_INACCESSIBLE, &mapping->flags);
>   }
>   
> +static inline void mapping_set_writeback_indeterminate(struct address_space *mapping)
> +{
> +	set_bit(AS_WRITEBACK_INDETERMINATE, &mapping->flags);
> +}
> +
> +static inline bool mapping_writeback_indeterminate(struct address_space *mapping)
> +{
> +	return test_bit(AS_WRITEBACK_INDETERMINATE, &mapping->flags);
> +}
> +
>   static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
>   {
>   	return mapping->gfp_mask;

Staring at this again reminds me of my comment in [1]

"
b) Call it sth. like AS_WRITEBACK_MIGHT_DEADLOCK_ON_RECLAIM to express
      that very deadlock problem.
"

In the context here now, where we really only focus on the reclaim 
deadlock that can happen for trusted FUSE servers during reclaim, would 
it make sense to call it now something like that?

[1] 
https://lore.kernel.org/linux-mm/0ed5241e-10af-43ee-baaf-87a5b4dc9694@redhat.com/

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v7 1/3] mm: add AS_WRITEBACK_INDETERMINATE mapping flag
  2025-04-04 19:13   ` David Hildenbrand
@ 2025-04-04 20:09     ` Joanne Koong
  2025-04-04 20:13       ` David Hildenbrand
  0 siblings, 1 reply; 23+ messages in thread
From: Joanne Koong @ 2025-04-04 20:09 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: miklos, akpm, linux-fsdevel, linux-mm, jefflexu, shakeel.butt,
	bernd.schubert, ziy, jlayton, kernel-team, Miklos Szeredi

On Fri, Apr 4, 2025 at 12:13 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 04.04.25 20:14, Joanne Koong wrote:
> > Add a new mapping flag AS_WRITEBACK_INDETERMINATE which filesystems may
> > set to indicate that writing back to disk may take an indeterminate
> > amount of time to complete. Extra caution should be taken when waiting
> > on writeback for folios belonging to mappings where this flag is set.
> >
> > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
> > Acked-by: Miklos Szeredi <mszeredi@redhat.com>
> > ---
> >   include/linux/pagemap.h | 11 +++++++++++
> >   1 file changed, 11 insertions(+)
> >
> > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> > index 26baa78f1ca7..762575f1d195 100644
> > --- a/include/linux/pagemap.h
> > +++ b/include/linux/pagemap.h
> > @@ -210,6 +210,7 @@ enum mapping_flags {
> >       AS_STABLE_WRITES = 7,   /* must wait for writeback before modifying
> >                                  folio contents */
> >       AS_INACCESSIBLE = 8,    /* Do not attempt direct R/W access to the mapping */
> > +     AS_WRITEBACK_INDETERMINATE = 9, /* Use caution when waiting on writeback */
> >       /* Bits 16-25 are used for FOLIO_ORDER */
> >       AS_FOLIO_ORDER_BITS = 5,
> >       AS_FOLIO_ORDER_MIN = 16,
> > @@ -335,6 +336,16 @@ static inline bool mapping_inaccessible(struct address_space *mapping)
> >       return test_bit(AS_INACCESSIBLE, &mapping->flags);
> >   }
> >
> > +static inline void mapping_set_writeback_indeterminate(struct address_space *mapping)
> > +{
> > +     set_bit(AS_WRITEBACK_INDETERMINATE, &mapping->flags);
> > +}
> > +
> > +static inline bool mapping_writeback_indeterminate(struct address_space *mapping)
> > +{
> > +     return test_bit(AS_WRITEBACK_INDETERMINATE, &mapping->flags);
> > +}
> > +
> >   static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
> >   {
> >       return mapping->gfp_mask;
>
> Staring at this again reminds me of my comment in [1]
>
> "
> b) Call it sth. like AS_WRITEBACK_MIGHT_DEADLOCK_ON_RECLAIM to express
>       that very deadlock problem.
> "
>
> In the context here now, where we really only focus on the reclaim
> deadlock that can happen for trusted FUSE servers during reclaim, would
> it make sense to call it now something like that?

Happy to make this change. My thinking was that
'AS_WRITEBACK_INDETERMINATE' could be reused in the future for stuff
besides reclaim, but we can cross that bridge if that ends up being
the case. Will submit v8 with this changed to
AS_WRITEBACK_MIGHT_DEADLOCK_ON_RECLAIM.

Thanks,
Joanne

>
> [1]
> https://lore.kernel.org/linux-mm/0ed5241e-10af-43ee-baaf-87a5b4dc9694@redhat.com/
>
> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v7 1/3] mm: add AS_WRITEBACK_INDETERMINATE mapping flag
  2025-04-04 20:09     ` Joanne Koong
@ 2025-04-04 20:13       ` David Hildenbrand
  2025-04-09 22:05         ` Shakeel Butt
  0 siblings, 1 reply; 23+ messages in thread
From: David Hildenbrand @ 2025-04-04 20:13 UTC (permalink / raw)
  To: Joanne Koong
  Cc: miklos, akpm, linux-fsdevel, linux-mm, jefflexu, shakeel.butt,
	bernd.schubert, ziy, jlayton, kernel-team, Miklos Szeredi

On 04.04.25 22:09, Joanne Koong wrote:
> On Fri, Apr 4, 2025 at 12:13 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 04.04.25 20:14, Joanne Koong wrote:
>>> Add a new mapping flag AS_WRITEBACK_INDETERMINATE which filesystems may
>>> set to indicate that writing back to disk may take an indeterminate
>>> amount of time to complete. Extra caution should be taken when waiting
>>> on writeback for folios belonging to mappings where this flag is set.
>>>
>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
>>> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
>>> Acked-by: Miklos Szeredi <mszeredi@redhat.com>
>>> ---
>>>    include/linux/pagemap.h | 11 +++++++++++
>>>    1 file changed, 11 insertions(+)
>>>
>>> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
>>> index 26baa78f1ca7..762575f1d195 100644
>>> --- a/include/linux/pagemap.h
>>> +++ b/include/linux/pagemap.h
>>> @@ -210,6 +210,7 @@ enum mapping_flags {
>>>        AS_STABLE_WRITES = 7,   /* must wait for writeback before modifying
>>>                                   folio contents */
>>>        AS_INACCESSIBLE = 8,    /* Do not attempt direct R/W access to the mapping */
>>> +     AS_WRITEBACK_INDETERMINATE = 9, /* Use caution when waiting on writeback */
>>>        /* Bits 16-25 are used for FOLIO_ORDER */
>>>        AS_FOLIO_ORDER_BITS = 5,
>>>        AS_FOLIO_ORDER_MIN = 16,
>>> @@ -335,6 +336,16 @@ static inline bool mapping_inaccessible(struct address_space *mapping)
>>>        return test_bit(AS_INACCESSIBLE, &mapping->flags);
>>>    }
>>>
>>> +static inline void mapping_set_writeback_indeterminate(struct address_space *mapping)
>>> +{
>>> +     set_bit(AS_WRITEBACK_INDETERMINATE, &mapping->flags);
>>> +}
>>> +
>>> +static inline bool mapping_writeback_indeterminate(struct address_space *mapping)
>>> +{
>>> +     return test_bit(AS_WRITEBACK_INDETERMINATE, &mapping->flags);
>>> +}
>>> +
>>>    static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
>>>    {
>>>        return mapping->gfp_mask;
>>
>> Staring at this again reminds me of my comment in [1]
>>
>> "
>> b) Call it sth. like AS_WRITEBACK_MIGHT_DEADLOCK_ON_RECLAIM to express
>>        that very deadlock problem.
>> "
>>
>> In the context here now, where we really only focus on the reclaim
>> deadlock that can happen for trusted FUSE servers during reclaim, would
>> it make sense to call it now something like that?
> 
> Happy to make this change. My thinking was that
> 'AS_WRITEBACK_INDETERMINATE' could be reused in the future for stuff
> besides reclaim, but we can cross that bridge if that ends up being
> the case. 

Yes, but I'm afraid one we start using it in other context we're 
reaching the point where we are trying to deal with untrusted user space 
and the page lock would already be a similar problem.

Happy to be wrong on this one.

Wait for other opinions first. Apart from that, no objection from my side.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v7 3/3] fuse: remove tmp folio for writebacks and internal rb tree
  2025-04-04 18:14 ` [PATCH v7 3/3] fuse: remove tmp folio for writebacks and internal rb tree Joanne Koong
@ 2025-04-09  2:43   ` Jingbo Xu
  2025-04-09 23:47     ` Joanne Koong
  0 siblings, 1 reply; 23+ messages in thread
From: Jingbo Xu @ 2025-04-09  2:43 UTC (permalink / raw)
  To: Joanne Koong, miklos, akpm, linux-fsdevel, linux-mm
  Cc: shakeel.butt, david, bernd.schubert, ziy, jlayton, kernel-team,
	Miklos Szeredi

Hi Joanne,

On 4/5/25 2:14 AM, Joanne Koong wrote:
> In the current FUSE writeback design (see commit 3be5a52b30aa
> ("fuse: support writable mmap")), a temp page is allocated for every
> dirty page to be written back, the contents of the dirty page are copied over
> to the temp page, and the temp page gets handed to the server to write back.
> 
> This is done so that writeback may be immediately cleared on the dirty page,
> and this in turn is done in order to mitigate the following deadlock scenario
> that may arise if reclaim waits on writeback on the dirty page to complete:
> * single-threaded FUSE server is in the middle of handling a request
>   that needs a memory allocation
> * memory allocation triggers direct reclaim
> * direct reclaim waits on a folio under writeback
> * the FUSE server can't write back the folio since it's stuck in
>   direct reclaim
> 
> With a recent change that added AS_WRITEBACK_INDETERMINATE and mitigates
> the situations described above, FUSE writeback does not need to use
> temp pages if it sets AS_WRITEBACK_INDETERMINATE on its inode mappings.
> 
> This commit sets AS_WRITEBACK_INDETERMINATE on the inode mappings
> and removes the temporary pages + extra copying and the internal rb
> tree.
> 
> fio benchmarks --
> (using averages observed from 10 runs, throwing away outliers)
> 
> Setup:
> sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount
>  ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount
> 
> fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G
> --numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount
> 
>         bs =  1k          4k            1M
> Before  351 MiB/s     1818 MiB/s     1851 MiB/s
> After   341 MiB/s     2246 MiB/s     2685 MiB/s
> % diff        -3%          23%         45%
> 
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
> Acked-by: Miklos Szeredi <mszeredi@redhat.com>

Overall this patch LGTM.

Apart from that, IMO the fi->writectr and fi->queued_writes mechanism is
also unneeded then, at least the DIRECT IO routine (i.e.
fuse_direct_io()) doesn't need fuse_sync_writes() anymore.  That is
because after removing the temp page, the DIRECT IO routine has already
been waiting for all inflight WRITE requests, see

# DIRECT read
generic_file_read_iter
  kiocb_write_and_wait
    filemap_write_and_wait_range

# DIRECT write
generic_file_write_iter
  generic_file_direct_write
    kiocb_invalidate_pages
      filemap_invalidate_pages
	filemap_write_and_wait_range

The DIRECT write routine will also invalidate the page cache in the
range that is written to, so that the following buffer write needs to
read the page cache back first. The writeback following the buffer write
is much likely after the DIRECT write, so that the writeback won't
conflict with the DIRECT write (i.e. there won't be duplicate WRITE
requests for the same page that are initiated from DIRECT write and
writeback at the same time), which is exactly why fi->writectr and
fi->queued_writes are introduced.  However it seems that the writeback
won't wait for previous inflight DIRECT WRITE requests, so I'm not much
sure about that.  Maybe other folks could offer more insights...

Also fuse_sync_writes() is not needed in fuse_flush() anymore, with
which I'm pretty sure.

The potential cleanup for fi->writectr and fi->queued_writes could be
offered as following separate patches (if any).


> ---
>  fs/fuse/file.c   | 360 ++++-------------------------------------------
>  fs/fuse/fuse_i.h |   3 -
>  2 files changed, 28 insertions(+), 335 deletions(-)
> 
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 754378dd9f71..91ada0208863 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -415,89 +415,11 @@ u64 fuse_lock_owner_id(struct fuse_conn *fc, fl_owner_t id)
>  
>  struct fuse_writepage_args {
>  	struct fuse_io_args ia;
> -	struct rb_node writepages_entry;
>  	struct list_head queue_entry;
> -	struct fuse_writepage_args *next;
>  	struct inode *inode;
>  	struct fuse_sync_bucket *bucket;
>  };
>  
> -static struct fuse_writepage_args *fuse_find_writeback(struct fuse_inode *fi,
> -					    pgoff_t idx_from, pgoff_t idx_to)
> -{
> -	struct rb_node *n;
> -
> -	n = fi->writepages.rb_node;
> -
> -	while (n) {
> -		struct fuse_writepage_args *wpa;
> -		pgoff_t curr_index;
> -
> -		wpa = rb_entry(n, struct fuse_writepage_args, writepages_entry);
> -		WARN_ON(get_fuse_inode(wpa->inode) != fi);
> -		curr_index = wpa->ia.write.in.offset >> PAGE_SHIFT;
> -		if (idx_from >= curr_index + wpa->ia.ap.num_folios)
> -			n = n->rb_right;
> -		else if (idx_to < curr_index)
> -			n = n->rb_left;
> -		else
> -			return wpa;
> -	}
> -	return NULL;
> -}
> -
> -/*
> - * Check if any page in a range is under writeback
> - */
> -static bool fuse_range_is_writeback(struct inode *inode, pgoff_t idx_from,
> -				   pgoff_t idx_to)
> -{
> -	struct fuse_inode *fi = get_fuse_inode(inode);
> -	bool found;
> -
> -	if (RB_EMPTY_ROOT(&fi->writepages))
> -		return false;
> -
> -	spin_lock(&fi->lock);
> -	found = fuse_find_writeback(fi, idx_from, idx_to);
> -	spin_unlock(&fi->lock);
> -
> -	return found;
> -}
> -
> -static inline bool fuse_page_is_writeback(struct inode *inode, pgoff_t index)
> -{
> -	return fuse_range_is_writeback(inode, index, index);
> -}
> -
> -/*
> - * Wait for page writeback to be completed.
> - *
> - * Since fuse doesn't rely on the VM writeback tracking, this has to
> - * use some other means.
> - */
> -static void fuse_wait_on_page_writeback(struct inode *inode, pgoff_t index)
> -{
> -	struct fuse_inode *fi = get_fuse_inode(inode);
> -
> -	wait_event(fi->page_waitq, !fuse_page_is_writeback(inode, index));
> -}
> -
> -static inline bool fuse_folio_is_writeback(struct inode *inode,
> -					   struct folio *folio)
> -{
> -	pgoff_t last = folio_next_index(folio) - 1;
> -	return fuse_range_is_writeback(inode, folio_index(folio), last);
> -}
> -
> -static void fuse_wait_on_folio_writeback(struct inode *inode,
> -					 struct folio *folio)
> -{
> -	struct fuse_inode *fi = get_fuse_inode(inode);
> -
> -	wait_event(fi->page_waitq, !fuse_folio_is_writeback(inode, folio));
> -}
> -
>  /*
>   * Wait for all pending writepages on the inode to finish.
>   *
> @@ -886,13 +808,6 @@ static int fuse_do_readfolio(struct file *file, struct folio *folio)
>  	ssize_t res;
>  	u64 attr_ver;
>  
> -	/*
> -	 * With the temporary pages that are used to complete writeback, we can
> -	 * have writeback that extends beyond the lifetime of the folio.  So
> -	 * make sure we read a properly synced folio.
> -	 */
> -	fuse_wait_on_folio_writeback(inode, folio);
> -
>  	attr_ver = fuse_get_attr_version(fm->fc);
>  
>  	/* Don't overflow end offset */
> @@ -1005,17 +920,12 @@ static void fuse_send_readpages(struct fuse_io_args *ia, struct file *file)
>  static void fuse_readahead(struct readahead_control *rac)
>  {
>  	struct inode *inode = rac->mapping->host;
> -	struct fuse_inode *fi = get_fuse_inode(inode);
>  	struct fuse_conn *fc = get_fuse_conn(inode);
>  	unsigned int max_pages, nr_pages;
> -	pgoff_t first = readahead_index(rac);
> -	pgoff_t last = first + readahead_count(rac) - 1;
>  
>  	if (fuse_is_bad(inode))
>  		return;
>  
> -	wait_event(fi->page_waitq, !fuse_range_is_writeback(inode, first, last));
> -
>  	max_pages = min_t(unsigned int, fc->max_pages,
>  			fc->max_read / PAGE_SIZE);
>  
> @@ -1181,7 +1091,7 @@ static ssize_t fuse_send_write_pages(struct fuse_io_args *ia,
>  	int err;
>  
>  	for (i = 0; i < ap->num_folios; i++)
> -		fuse_wait_on_folio_writeback(inode, ap->folios[i]);
> +		folio_wait_writeback(ap->folios[i]);
>  
>  	fuse_write_args_fill(ia, ff, pos, count);
>  	ia->write.in.flags = fuse_write_flags(iocb);
> @@ -1638,7 +1548,7 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *iter,
>  			return res;
>  		}
>  	}
> -	if (!cuse && fuse_range_is_writeback(inode, idx_from, idx_to)) {
> +	if (!cuse && filemap_range_has_writeback(mapping, pos, (pos + count - 1))) {
>  		if (!write)
>  			inode_lock(inode);
>  		fuse_sync_writes(inode);
> @@ -1835,38 +1745,34 @@ static ssize_t fuse_splice_write(struct pipe_inode_info *pipe, struct file *out,
>  static void fuse_writepage_free(struct fuse_writepage_args *wpa)
>  {
>  	struct fuse_args_pages *ap = &wpa->ia.ap;
> -	int i;
>  
>  	if (wpa->bucket)
>  		fuse_sync_bucket_dec(wpa->bucket);
>  
> -	for (i = 0; i < ap->num_folios; i++)
> -		folio_put(ap->folios[i]);
> -
>  	fuse_file_put(wpa->ia.ff, false);
>  
>  	kfree(ap->folios);
>  	kfree(wpa);
>  }
>  
> -static void fuse_writepage_finish_stat(struct inode *inode, struct folio *folio)
> -{
> -	struct backing_dev_info *bdi = inode_to_bdi(inode);
> -
> -	dec_wb_stat(&bdi->wb, WB_WRITEBACK);
> -	node_stat_sub_folio(folio, NR_WRITEBACK_TEMP);
> -	wb_writeout_inc(&bdi->wb);
> -}
> -
>  static void fuse_writepage_finish(struct fuse_writepage_args *wpa)
>  {
>  	struct fuse_args_pages *ap = &wpa->ia.ap;
>  	struct inode *inode = wpa->inode;
>  	struct fuse_inode *fi = get_fuse_inode(inode);
> +	struct backing_dev_info *bdi = inode_to_bdi(inode);
>  	int i;
>  
> -	for (i = 0; i < ap->num_folios; i++)
> -		fuse_writepage_finish_stat(inode, ap->folios[i]);
> +	for (i = 0; i < ap->num_folios; i++) {
> +		/*
> +		 * Benchmarks showed that ending writeback within the
> +		 * scope of the fi->lock alleviates xarray lock
> +		 * contention and noticeably improves performance.
> +		 */
> +		folio_end_writeback(ap->folios[i]);
> +		dec_wb_stat(&bdi->wb, WB_WRITEBACK);
> +		wb_writeout_inc(&bdi->wb);
> +	}
>  
>  	wake_up(&fi->page_waitq);
>  }
> @@ -1877,7 +1783,6 @@ static void fuse_send_writepage(struct fuse_mount *fm,
>  __releases(fi->lock)
>  __acquires(fi->lock)
>  {
> -	struct fuse_writepage_args *aux, *next;
>  	struct fuse_inode *fi = get_fuse_inode(wpa->inode);
>  	struct fuse_write_in *inarg = &wpa->ia.write.in;
>  	struct fuse_args *args = &wpa->ia.ap.args;
> @@ -1914,19 +1819,8 @@ __acquires(fi->lock)
>  
>   out_free:
>  	fi->writectr--;
> -	rb_erase(&wpa->writepages_entry, &fi->writepages);
>  	fuse_writepage_finish(wpa);
>  	spin_unlock(&fi->lock);
> -
> -	/* After rb_erase() aux request list is private */
> -	for (aux = wpa->next; aux; aux = next) {
> -		next = aux->next;
> -		aux->next = NULL;
> -		fuse_writepage_finish_stat(aux->inode,
> -					   aux->ia.ap.folios[0]);
> -		fuse_writepage_free(aux);
> -	}
> -
>  	fuse_writepage_free(wpa);
>  	spin_lock(&fi->lock);
>  }
> @@ -1954,43 +1848,6 @@ __acquires(fi->lock)
>  	}
>  }
>  
> -static struct fuse_writepage_args *fuse_insert_writeback(struct rb_root *root,
> -						struct fuse_writepage_args *wpa)
> -{
> -	pgoff_t idx_from = wpa->ia.write.in.offset >> PAGE_SHIFT;
> -	pgoff_t idx_to = idx_from + wpa->ia.ap.num_folios - 1;
> -	struct rb_node **p = &root->rb_node;
> -	struct rb_node  *parent = NULL;
> -
> -	WARN_ON(!wpa->ia.ap.num_folios);
> -	while (*p) {
> -		struct fuse_writepage_args *curr;
> -		pgoff_t curr_index;
> -
> -		parent = *p;
> -		curr = rb_entry(parent, struct fuse_writepage_args,
> -				writepages_entry);
> -		WARN_ON(curr->inode != wpa->inode);
> -		curr_index = curr->ia.write.in.offset >> PAGE_SHIFT;
> -
> -		if (idx_from >= curr_index + curr->ia.ap.num_folios)
> -			p = &(*p)->rb_right;
> -		else if (idx_to < curr_index)
> -			p = &(*p)->rb_left;
> -		else
> -			return curr;
> -	}
> -
> -	rb_link_node(&wpa->writepages_entry, parent, p);
> -	rb_insert_color(&wpa->writepages_entry, root);
> -	return NULL;
> -}
> -
> -static void tree_insert(struct rb_root *root, struct fuse_writepage_args *wpa)
> -{
> -	WARN_ON(fuse_insert_writeback(root, wpa));
> -}
> -
>  static void fuse_writepage_end(struct fuse_mount *fm, struct fuse_args *args,
>  			       int error)
>  {
> @@ -2010,41 +1867,6 @@ static void fuse_writepage_end(struct fuse_mount *fm, struct fuse_args *args,
>  	if (!fc->writeback_cache)
>  		fuse_invalidate_attr_mask(inode, FUSE_STATX_MODIFY);
>  	spin_lock(&fi->lock);
> -	rb_erase(&wpa->writepages_entry, &fi->writepages);
> -	while (wpa->next) {
> -		struct fuse_mount *fm = get_fuse_mount(inode);
> -		struct fuse_write_in *inarg = &wpa->ia.write.in;
> -		struct fuse_writepage_args *next = wpa->next;
> -
> -		wpa->next = next->next;
> -		next->next = NULL;
> -		tree_insert(&fi->writepages, next);
> -
> -		/*
> -		 * Skip fuse_flush_writepages() to make it easy to crop requests
> -		 * based on primary request size.
> -		 *
> -		 * 1st case (trivial): there are no concurrent activities using
> -		 * fuse_set/release_nowrite.  Then we're on safe side because
> -		 * fuse_flush_writepages() would call fuse_send_writepage()
> -		 * anyway.
> -		 *
> -		 * 2nd case: someone called fuse_set_nowrite and it is waiting
> -		 * now for completion of all in-flight requests.  This happens
> -		 * rarely and no more than once per page, so this should be
> -		 * okay.
> -		 *
> -		 * 3rd case: someone (e.g. fuse_do_setattr()) is in the middle
> -		 * of fuse_set_nowrite..fuse_release_nowrite section.  The fact
> -		 * that fuse_set_nowrite returned implies that all in-flight
> -		 * requests were completed along with all of their secondary
> -		 * requests.  Further primary requests are blocked by negative
> -		 * writectr.  Hence there cannot be any in-flight requests and
> -		 * no invocations of fuse_writepage_end() while we're in
> -		 * fuse_set_nowrite..fuse_release_nowrite section.
> -		 */
> -		fuse_send_writepage(fm, next, inarg->offset + inarg->size);
> -	}
>  	fi->writectr--;
>  	fuse_writepage_finish(wpa);
>  	spin_unlock(&fi->lock);
> @@ -2131,19 +1953,16 @@ static void fuse_writepage_add_to_bucket(struct fuse_conn *fc,
>  }
>  
>  static void fuse_writepage_args_page_fill(struct fuse_writepage_args *wpa, struct folio *folio,
> -					  struct folio *tmp_folio, uint32_t folio_index)
> +					  uint32_t folio_index)
>  {
>  	struct inode *inode = folio->mapping->host;
>  	struct fuse_args_pages *ap = &wpa->ia.ap;
>  
> -	folio_copy(tmp_folio, folio);
> -
> -	ap->folios[folio_index] = tmp_folio;
> +	ap->folios[folio_index] = folio;
>  	ap->descs[folio_index].offset = 0;
>  	ap->descs[folio_index].length = PAGE_SIZE;
>  
>  	inc_wb_stat(&inode_to_bdi(inode)->wb, WB_WRITEBACK);
> -	node_stat_add_folio(tmp_folio, NR_WRITEBACK_TEMP);
>  }
>  
>  static struct fuse_writepage_args *fuse_writepage_args_setup(struct folio *folio,
> @@ -2178,18 +1997,12 @@ static int fuse_writepage_locked(struct folio *folio)
>  	struct fuse_inode *fi = get_fuse_inode(inode);
>  	struct fuse_writepage_args *wpa;
>  	struct fuse_args_pages *ap;
> -	struct folio *tmp_folio;
>  	struct fuse_file *ff;
> -	int error = -ENOMEM;
> -
> -	tmp_folio = folio_alloc(GFP_NOFS | __GFP_HIGHMEM, 0);
> -	if (!tmp_folio)
> -		goto err;
> +	int error = -EIO;
>  
> -	error = -EIO;
>  	ff = fuse_write_file_get(fi);
>  	if (!ff)
> -		goto err_nofile;
> +		goto err;
>  
>  	wpa = fuse_writepage_args_setup(folio, ff);
>  	error = -ENOMEM;
> @@ -2200,22 +2013,17 @@ static int fuse_writepage_locked(struct folio *folio)
>  	ap->num_folios = 1;
>  
>  	folio_start_writeback(folio);
> -	fuse_writepage_args_page_fill(wpa, folio, tmp_folio, 0);
> +	fuse_writepage_args_page_fill(wpa, folio, 0);
>  
>  	spin_lock(&fi->lock);
> -	tree_insert(&fi->writepages, wpa);
>  	list_add_tail(&wpa->queue_entry, &fi->queued_writes);
>  	fuse_flush_writepages(inode);
>  	spin_unlock(&fi->lock);
>  
> -	folio_end_writeback(folio);
> -
>  	return 0;
>  
>  err_writepage_args:
>  	fuse_file_put(ff, false);
> -err_nofile:
> -	folio_put(tmp_folio);
>  err:
>  	mapping_set_error(folio->mapping, error);
>  	return error;
> @@ -2225,7 +2033,6 @@ struct fuse_fill_wb_data {
>  	struct fuse_writepage_args *wpa;
>  	struct fuse_file *ff;
>  	struct inode *inode;
> -	struct folio **orig_folios;
>  	unsigned int max_folios;
>  };
>  
> @@ -2260,69 +2067,11 @@ static void fuse_writepages_send(struct fuse_fill_wb_data *data)
>  	struct fuse_writepage_args *wpa = data->wpa;
>  	struct inode *inode = data->inode;
>  	struct fuse_inode *fi = get_fuse_inode(inode);
> -	int num_folios = wpa->ia.ap.num_folios;
> -	int i;
>  
>  	spin_lock(&fi->lock);
>  	list_add_tail(&wpa->queue_entry, &fi->queued_writes);
>  	fuse_flush_writepages(inode);
>  	spin_unlock(&fi->lock);
> -
> -	for (i = 0; i < num_folios; i++)
> -		folio_end_writeback(data->orig_folios[i]);
> -}
> -
> -/*
> - * Check under fi->lock if the page is under writeback, and insert it onto the
> - * rb_tree if not. Otherwise iterate auxiliary write requests, to see if there's
> - * one already added for a page at this offset.  If there's none, then insert
> - * this new request onto the auxiliary list, otherwise reuse the existing one by
> - * swapping the new temp page with the old one.
> - */
> -static bool fuse_writepage_add(struct fuse_writepage_args *new_wpa,
> -			       struct folio *folio)
> -{
> -	struct fuse_inode *fi = get_fuse_inode(new_wpa->inode);
> -	struct fuse_writepage_args *tmp;
> -	struct fuse_writepage_args *old_wpa;
> -	struct fuse_args_pages *new_ap = &new_wpa->ia.ap;
> -
> -	WARN_ON(new_ap->num_folios != 0);
> -	new_ap->num_folios = 1;
> -
> -	spin_lock(&fi->lock);
> -	old_wpa = fuse_insert_writeback(&fi->writepages, new_wpa);
> -	if (!old_wpa) {
> -		spin_unlock(&fi->lock);
> -		return true;
> -	}
> -
> -	for (tmp = old_wpa->next; tmp; tmp = tmp->next) {
> -		pgoff_t curr_index;
> -
> -		WARN_ON(tmp->inode != new_wpa->inode);
> -		curr_index = tmp->ia.write.in.offset >> PAGE_SHIFT;
> -		if (curr_index == folio->index) {
> -			WARN_ON(tmp->ia.ap.num_folios != 1);
> -			swap(tmp->ia.ap.folios[0], new_ap->folios[0]);
> -			break;
> -		}
> -	}
> -
> -	if (!tmp) {
> -		new_wpa->next = old_wpa->next;
> -		old_wpa->next = new_wpa;
> -	}
> -
> -	spin_unlock(&fi->lock);
> -
> -	if (tmp) {
> -		fuse_writepage_finish_stat(new_wpa->inode,
> -					   folio);
> -		fuse_writepage_free(new_wpa);
> -	}
> -
> -	return false;
>  }
>  
>  static bool fuse_writepage_need_send(struct fuse_conn *fc, struct folio *folio,
> @@ -2331,15 +2080,6 @@ static bool fuse_writepage_need_send(struct fuse_conn *fc, struct folio *folio,
>  {
>  	WARN_ON(!ap->num_folios);
>  
> -	/*
> -	 * Being under writeback is unlikely but possible.  For example direct
> -	 * read to an mmaped fuse file will set the page dirty twice; once when
> -	 * the pages are faulted with get_user_pages(), and then after the read
> -	 * completed.
> -	 */
> -	if (fuse_folio_is_writeback(data->inode, folio))
> -		return true;
> -
>  	/* Reached max pages */
>  	if (ap->num_folios == fc->max_pages)
>  		return true;
> @@ -2349,7 +2089,7 @@ static bool fuse_writepage_need_send(struct fuse_conn *fc, struct folio *folio,
>  		return true;
>  
>  	/* Discontinuity */
> -	if (data->orig_folios[ap->num_folios - 1]->index + 1 != folio_index(folio))
> +	if (ap->folios[ap->num_folios - 1]->index + 1 != folio_index(folio))
>  		return true;
>  
>  	/* Need to grow the pages array?  If so, did the expansion fail? */
> @@ -2368,7 +2108,6 @@ static int fuse_writepages_fill(struct folio *folio,
>  	struct inode *inode = data->inode;
>  	struct fuse_inode *fi = get_fuse_inode(inode);
>  	struct fuse_conn *fc = get_fuse_conn(inode);
> -	struct folio *tmp_folio;
>  	int err;
>  
>  	if (!data->ff) {
> @@ -2383,54 +2122,23 @@ static int fuse_writepages_fill(struct folio *folio,
>  		data->wpa = NULL;
>  	}
>  
> -	err = -ENOMEM;
> -	tmp_folio = folio_alloc(GFP_NOFS | __GFP_HIGHMEM, 0);
> -	if (!tmp_folio)
> -		goto out_unlock;
> -
> -	/*
> -	 * The page must not be redirtied until the writeout is completed
> -	 * (i.e. userspace has sent a reply to the write request).  Otherwise
> -	 * there could be more than one temporary page instance for each real
> -	 * page.
> -	 *
> -	 * This is ensured by holding the page lock in page_mkwrite() while
> -	 * checking fuse_page_is_writeback().  We already hold the page lock
> -	 * since clear_page_dirty_for_io() and keep it held until we add the
> -	 * request to the fi->writepages list and increment ap->num_folios.
> -	 * After this fuse_page_is_writeback() will indicate that the page is
> -	 * under writeback, so we can release the page lock.
> -	 */
>  	if (data->wpa == NULL) {
>  		err = -ENOMEM;
>  		wpa = fuse_writepage_args_setup(folio, data->ff);
> -		if (!wpa) {
> -			folio_put(tmp_folio);
> +		if (!wpa)
>  			goto out_unlock;
> -		}
>  		fuse_file_get(wpa->ia.ff);
>  		data->max_folios = 1;
>  		ap = &wpa->ia.ap;
>  	}
>  	folio_start_writeback(folio);
>  
> -	fuse_writepage_args_page_fill(wpa, folio, tmp_folio, ap->num_folios);
> -	data->orig_folios[ap->num_folios] = folio;
> +	fuse_writepage_args_page_fill(wpa, folio, ap->num_folios);
>  
>  	err = 0;
> -	if (data->wpa) {
> -		/*
> -		 * Protected by fi->lock against concurrent access by
> -		 * fuse_page_is_writeback().
> -		 */
> -		spin_lock(&fi->lock);
> -		ap->num_folios++;
> -		spin_unlock(&fi->lock);
> -	} else if (fuse_writepage_add(wpa, folio)) {
> +	ap->num_folios++;
> +	if (!data->wpa)
>  		data->wpa = wpa;
> -	} else {
> -		folio_end_writeback(folio);
> -	}
>  out_unlock:
>  	folio_unlock(folio);
>  
> @@ -2457,13 +2165,6 @@ static int fuse_writepages(struct address_space *mapping,
>  	data.wpa = NULL;
>  	data.ff = NULL;
>  
> -	err = -ENOMEM;
> -	data.orig_folios = kcalloc(fc->max_pages,
> -				   sizeof(struct folio *),
> -				   GFP_NOFS);
> -	if (!data.orig_folios)
> -		goto out;
> -
>  	err = write_cache_pages(mapping, wbc, fuse_writepages_fill, &data);
>  	if (data.wpa) {
>  		WARN_ON(!data.wpa->ia.ap.num_folios);
> @@ -2472,7 +2173,6 @@ static int fuse_writepages(struct address_space *mapping,
>  	if (data.ff)
>  		fuse_file_put(data.ff, false);
>  
> -	kfree(data.orig_folios);
>  out:
>  	return err;
>  }
> @@ -2497,8 +2197,6 @@ static int fuse_write_begin(struct file *file, struct address_space *mapping,
>  	if (IS_ERR(folio))
>  		goto error;
>  
> -	fuse_wait_on_page_writeback(mapping->host, folio->index);
> -
>  	if (folio_test_uptodate(folio) || len >= folio_size(folio))
>  		goto success;
>  	/*
> @@ -2561,13 +2259,9 @@ static int fuse_launder_folio(struct folio *folio)
>  {
>  	int err = 0;
>  	if (folio_clear_dirty_for_io(folio)) {
> -		struct inode *inode = folio->mapping->host;
> -
> -		/* Serialize with pending writeback for the same page */
> -		fuse_wait_on_page_writeback(inode, folio->index);
>  		err = fuse_writepage_locked(folio);
>  		if (!err)
> -			fuse_wait_on_page_writeback(inode, folio->index);
> +			folio_wait_writeback(folio);
>  	}
>  	return err;
>  }
> @@ -2611,7 +2305,7 @@ static vm_fault_t fuse_page_mkwrite(struct vm_fault *vmf)
>  		return VM_FAULT_NOPAGE;
>  	}
>  
> -	fuse_wait_on_folio_writeback(inode, folio);
> +	folio_wait_writeback(folio);
>  	return VM_FAULT_LOCKED;
>  }
>  
> @@ -3429,9 +3123,12 @@ static const struct address_space_operations fuse_file_aops  = {
>  void fuse_init_file_inode(struct inode *inode, unsigned int flags)
>  {
>  	struct fuse_inode *fi = get_fuse_inode(inode);
> +	struct fuse_conn *fc = get_fuse_conn(inode);
>  
>  	inode->i_fop = &fuse_file_operations;
>  	inode->i_data.a_ops = &fuse_file_aops;
> +	if (fc->writeback_cache)
> +		mapping_set_writeback_indeterminate(&inode->i_data);
>  
>  	INIT_LIST_HEAD(&fi->write_files);
>  	INIT_LIST_HEAD(&fi->queued_writes);
> @@ -3439,7 +3136,6 @@ void fuse_init_file_inode(struct inode *inode, unsigned int flags)
>  	fi->iocachectr = 0;
>  	init_waitqueue_head(&fi->page_waitq);
>  	init_waitqueue_head(&fi->direct_io_waitq);
> -	fi->writepages = RB_ROOT;
>  
>  	if (IS_ENABLED(CONFIG_FUSE_DAX))
>  		fuse_dax_inode_init(inode, flags);
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index fee96fe7887b..28a101a5c558 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -141,9 +141,6 @@ struct fuse_inode {
>  
>  			/* waitq for direct-io completion */
>  			wait_queue_head_t direct_io_waitq;
> -
> -			/* List of writepage requestst (pending or sent) */
> -			struct rb_root writepages;
>  		};
>  
>  		/* readdir cache (directory only) */

-- 
Thanks,
Jingbo


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v7 1/3] mm: add AS_WRITEBACK_INDETERMINATE mapping flag
  2025-04-04 20:13       ` David Hildenbrand
@ 2025-04-09 22:05         ` Shakeel Butt
  2025-04-09 23:48           ` Joanne Koong
  0 siblings, 1 reply; 23+ messages in thread
From: Shakeel Butt @ 2025-04-09 22:05 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Joanne Koong, miklos, akpm, linux-fsdevel, linux-mm, jefflexu,
	bernd.schubert, ziy, jlayton, kernel-team, Miklos Szeredi

On Fri, Apr 04, 2025 at 10:13:55PM +0200, David Hildenbrand wrote:
> On 04.04.25 22:09, Joanne Koong wrote:
> > On Fri, Apr 4, 2025 at 12:13 PM David Hildenbrand <david@redhat.com> wrote:
> > > 
> > > On 04.04.25 20:14, Joanne Koong wrote:
> > > > Add a new mapping flag AS_WRITEBACK_INDETERMINATE which filesystems may
> > > > set to indicate that writing back to disk may take an indeterminate
> > > > amount of time to complete. Extra caution should be taken when waiting
> > > > on writeback for folios belonging to mappings where this flag is set.
> > > > 
> > > > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > > > Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
> > > > Acked-by: Miklos Szeredi <mszeredi@redhat.com>
> > > > ---
> > > >    include/linux/pagemap.h | 11 +++++++++++
> > > >    1 file changed, 11 insertions(+)
> > > > 
> > > > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> > > > index 26baa78f1ca7..762575f1d195 100644
> > > > --- a/include/linux/pagemap.h
> > > > +++ b/include/linux/pagemap.h
> > > > @@ -210,6 +210,7 @@ enum mapping_flags {
> > > >        AS_STABLE_WRITES = 7,   /* must wait for writeback before modifying
> > > >                                   folio contents */
> > > >        AS_INACCESSIBLE = 8,    /* Do not attempt direct R/W access to the mapping */
> > > > +     AS_WRITEBACK_INDETERMINATE = 9, /* Use caution when waiting on writeback */
> > > >        /* Bits 16-25 are used for FOLIO_ORDER */
> > > >        AS_FOLIO_ORDER_BITS = 5,
> > > >        AS_FOLIO_ORDER_MIN = 16,
> > > > @@ -335,6 +336,16 @@ static inline bool mapping_inaccessible(struct address_space *mapping)
> > > >        return test_bit(AS_INACCESSIBLE, &mapping->flags);
> > > >    }
> > > > 
> > > > +static inline void mapping_set_writeback_indeterminate(struct address_space *mapping)
> > > > +{
> > > > +     set_bit(AS_WRITEBACK_INDETERMINATE, &mapping->flags);
> > > > +}
> > > > +
> > > > +static inline bool mapping_writeback_indeterminate(struct address_space *mapping)
> > > > +{
> > > > +     return test_bit(AS_WRITEBACK_INDETERMINATE, &mapping->flags);
> > > > +}
> > > > +
> > > >    static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
> > > >    {
> > > >        return mapping->gfp_mask;
> > > 
> > > Staring at this again reminds me of my comment in [1]
> > > 
> > > "
> > > b) Call it sth. like AS_WRITEBACK_MIGHT_DEADLOCK_ON_RECLAIM to express
> > >        that very deadlock problem.
> > > "
> > > 
> > > In the context here now, where we really only focus on the reclaim
> > > deadlock that can happen for trusted FUSE servers during reclaim, would
> > > it make sense to call it now something like that?
> > 
> > Happy to make this change. My thinking was that
> > 'AS_WRITEBACK_INDETERMINATE' could be reused in the future for stuff
> > besides reclaim, but we can cross that bridge if that ends up being
> > the case.
> 
> Yes, but I'm afraid one we start using it in other context we're reaching
> the point where we are trying to deal with untrusted user space and the page
> lock would already be a similar problem.
> 
> Happy to be wrong on this one.
> 
> Wait for other opinions first. Apart from that, no objection from my side.

I am on-board with keeping it specific to reclaim deadlock avoidance and
naming it such.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v7 3/3] fuse: remove tmp folio for writebacks and internal rb tree
  2025-04-09  2:43   ` Jingbo Xu
@ 2025-04-09 23:47     ` Joanne Koong
  2025-04-10  2:12       ` Jingbo Xu
  0 siblings, 1 reply; 23+ messages in thread
From: Joanne Koong @ 2025-04-09 23:47 UTC (permalink / raw)
  To: Jingbo Xu
  Cc: miklos, akpm, linux-fsdevel, linux-mm, shakeel.butt, david,
	bernd.schubert, ziy, jlayton, kernel-team, Miklos Szeredi

  On Tue, Apr 8, 2025 at 7:43 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>
> Hi Joanne,
>
> On 4/5/25 2:14 AM, Joanne Koong wrote:
> > In the current FUSE writeback design (see commit 3be5a52b30aa
> > ("fuse: support writable mmap")), a temp page is allocated for every
> > dirty page to be written back, the contents of the dirty page are copied over
> > to the temp page, and the temp page gets handed to the server to write back.
> >
> > This is done so that writeback may be immediately cleared on the dirty page,
> > and this in turn is done in order to mitigate the following deadlock scenario
> > that may arise if reclaim waits on writeback on the dirty page to complete:
> > * single-threaded FUSE server is in the middle of handling a request
> >   that needs a memory allocation
> > * memory allocation triggers direct reclaim
> > * direct reclaim waits on a folio under writeback
> > * the FUSE server can't write back the folio since it's stuck in
> >   direct reclaim
> >
> > With a recent change that added AS_WRITEBACK_INDETERMINATE and mitigates
> > the situations described above, FUSE writeback does not need to use
> > temp pages if it sets AS_WRITEBACK_INDETERMINATE on its inode mappings.
> >
> > This commit sets AS_WRITEBACK_INDETERMINATE on the inode mappings
> > and removes the temporary pages + extra copying and the internal rb
> > tree.
> >
> > fio benchmarks --
> > (using averages observed from 10 runs, throwing away outliers)
> >
> > Setup:
> > sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount
> >  ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount
> >
> > fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G
> > --numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount
> >
> >         bs =  1k          4k            1M
> > Before  351 MiB/s     1818 MiB/s     1851 MiB/s
> > After   341 MiB/s     2246 MiB/s     2685 MiB/s
> > % diff        -3%          23%         45%
> >
> > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
> > Acked-by: Miklos Szeredi <mszeredi@redhat.com>
>

Hi Jingbo,

Thanks for sharing your analysis for this.

> Overall this patch LGTM.
>
> Apart from that, IMO the fi->writectr and fi->queued_writes mechanism is
> also unneeded then, at least the DIRECT IO routine (i.e.

I took a look at fi->writectr and fi->queued_writes and my
understanding is that we do still need this. For example, for
truncates (I'm looking at fuse_do_setattr()), I think we still need to
prevent concurrent writeback or else the setattr request and the
writeback request could race which would result in a mismatch between
the file's reported size and the actual data written to disk.

> fuse_direct_io()) doesn't need fuse_sync_writes() anymore.  That is
> because after removing the temp page, the DIRECT IO routine has already
> been waiting for all inflight WRITE requests, see
>
> # DIRECT read
> generic_file_read_iter
>   kiocb_write_and_wait
>     filemap_write_and_wait_range

Where do you see generic_file_read_iter() getting called for direct io reads?

For direct io reads, I'm only seeing

fuse_direct_IO()
  __fuse_direct_read()
    fuse_direct_io()

and

fuse_file_read_iter()
    fuse_direct_read_iter()
        fuse_direct_IO() / __fuse_direct_read()

>
> # DIRECT write
> generic_file_write_iter
>   generic_file_direct_write
>     kiocb_invalidate_pages
>       filemap_invalidate_pages
>         filemap_write_and_wait_range

Similarly, where do you see generic_file_write_iter() getting called
for direct io writes?
My understanding is that it'd either go through fuse_file_write_iter()
-> fuse_direct_write_iter() or through the fuse_direct_IO() callback.

>
> The DIRECT write routine will also invalidate the page cache in the
> range that is written to, so that the following buffer write needs to
> read the page cache back first. The writeback following the buffer write
> is much likely after the DIRECT write, so that the writeback won't
> conflict with the DIRECT write (i.e. there won't be duplicate WRITE
> requests for the same page that are initiated from DIRECT write and
> writeback at the same time), which is exactly why fi->writectr and
> fi->queued_writes are introduced.

Where do you see fi->writectr / fi->queued-writes preventing this
race? It looks to me like in the existing code, this race condition
you described of direct write invalidating the page cache, then
another buffer write reads the page cache and dirties it, then
writeback is called on that, and the 2 write requests racing, could
still happen?


> However it seems that the writeback
> won't wait for previous inflight DIRECT WRITE requests, so I'm not much
> sure about that.  Maybe other folks could offer more insights...

My understanding is that these lines

if (!cuse && filemap_range_has_writeback(...)) {
   ...
   fuse_sync_writes(inode);
   ...
}

in fuse_direct_io() is what waits on previous inflight direct write
requests to complete before the direct io happens.


>
> Also fuse_sync_writes() is not needed in fuse_flush() anymore, with
> which I'm pretty sure.

Why don't we still need this for fuse_flush()?

If a caller calls close(), this will call

filp_close()
  filp_flush()
      filp->f_op->flush()
          fuse_flush()

it seems like we should still be waiting for all writebacks to finish
before sending the fuse server the fuse_flush request, no?

>
> The potential cleanup for fi->writectr and fi->queued_writes could be
> offered as following separate patches (if any).
>

Thanks,
Joanne
>
> > ---
> >  fs/fuse/file.c   | 360 ++++-------------------------------------------
> >  fs/fuse/fuse_i.h |   3 -
> >  2 files changed, 28 insertions(+), 335 deletions(-)
> >
> > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > index 754378dd9f71..91ada0208863 100644
> > --- a/fs/fuse/file.c
> > +++ b/fs/fuse/file.c
> > @@ -415,89 +415,11 @@ u64 fuse_lock_owner_id(struct fuse_conn *fc, fl_owner_t id)
> >
> >  struct fuse_writepage_args {
> >       struct fuse_io_args ia;
> > -     struct rb_node writepages_entry;
> >       struct list_head queue_entry;
> > -     struct fuse_writepage_args *next;
> >       struct inode *inode;
> >       struct fuse_sync_bucket *bucket;
> >  };
> >
> > -static struct fuse_writepage_args *fuse_find_writeback(struct fuse_inode *fi,
> > -                                         pgoff_t idx_from, pgoff_t idx_to)
> > -{
> > -     struct rb_node *n;
> > -
> > -     n = fi->writepages.rb_node;
> > -
> > -     while (n) {
> > -             struct fuse_writepage_args *wpa;
> > -             pgoff_t curr_index;
> > -
>
> --
> Thanks,
> Jingbo


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v7 1/3] mm: add AS_WRITEBACK_INDETERMINATE mapping flag
  2025-04-09 22:05         ` Shakeel Butt
@ 2025-04-09 23:48           ` Joanne Koong
  0 siblings, 0 replies; 23+ messages in thread
From: Joanne Koong @ 2025-04-09 23:48 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: David Hildenbrand, miklos, akpm, linux-fsdevel, linux-mm,
	jefflexu, bernd.schubert, ziy, jlayton, kernel-team,
	Miklos Szeredi

On Wed, Apr 9, 2025 at 3:05 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Fri, Apr 04, 2025 at 10:13:55PM +0200, David Hildenbrand wrote:
> > On 04.04.25 22:09, Joanne Koong wrote:
> > > On Fri, Apr 4, 2025 at 12:13 PM David Hildenbrand <david@redhat.com> wrote:
> > > >
> > > > On 04.04.25 20:14, Joanne Koong wrote:
> > > > > Add a new mapping flag AS_WRITEBACK_INDETERMINATE which filesystems may
> > > > > set to indicate that writing back to disk may take an indeterminate
> > > > > amount of time to complete. Extra caution should be taken when waiting
> > > > > on writeback for folios belonging to mappings where this flag is set.
> > > > >
> > > > > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > > > > Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
> > > > > Acked-by: Miklos Szeredi <mszeredi@redhat.com>
> > > > > ---
> > > > >    include/linux/pagemap.h | 11 +++++++++++
> > > > >    1 file changed, 11 insertions(+)
> > > > >
> > > > > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> > > > > index 26baa78f1ca7..762575f1d195 100644
> > > > > --- a/include/linux/pagemap.h
> > > > > +++ b/include/linux/pagemap.h
> > > > > @@ -210,6 +210,7 @@ enum mapping_flags {
> > > > >        AS_STABLE_WRITES = 7,   /* must wait for writeback before modifying
> > > > >                                   folio contents */
> > > > >        AS_INACCESSIBLE = 8,    /* Do not attempt direct R/W access to the mapping */
> > > > > +     AS_WRITEBACK_INDETERMINATE = 9, /* Use caution when waiting on writeback */
> > > > >        /* Bits 16-25 are used for FOLIO_ORDER */
> > > > >        AS_FOLIO_ORDER_BITS = 5,
> > > > >        AS_FOLIO_ORDER_MIN = 16,
> > > > > @@ -335,6 +336,16 @@ static inline bool mapping_inaccessible(struct address_space *mapping)
> > > > >        return test_bit(AS_INACCESSIBLE, &mapping->flags);
> > > > >    }
> > > > >
> > > > > +static inline void mapping_set_writeback_indeterminate(struct address_space *mapping)
> > > > > +{
> > > > > +     set_bit(AS_WRITEBACK_INDETERMINATE, &mapping->flags);
> > > > > +}
> > > > > +
> > > > > +static inline bool mapping_writeback_indeterminate(struct address_space *mapping)
> > > > > +{
> > > > > +     return test_bit(AS_WRITEBACK_INDETERMINATE, &mapping->flags);
> > > > > +}
> > > > > +
> > > > >    static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
> > > > >    {
> > > > >        return mapping->gfp_mask;
> > > >
> > > > Staring at this again reminds me of my comment in [1]
> > > >
> > > > "
> > > > b) Call it sth. like AS_WRITEBACK_MIGHT_DEADLOCK_ON_RECLAIM to express
> > > >        that very deadlock problem.
> > > > "
> > > >
> > > > In the context here now, where we really only focus on the reclaim
> > > > deadlock that can happen for trusted FUSE servers during reclaim, would
> > > > it make sense to call it now something like that?
> > >
> > > Happy to make this change. My thinking was that
> > > 'AS_WRITEBACK_INDETERMINATE' could be reused in the future for stuff
> > > besides reclaim, but we can cross that bridge if that ends up being
> > > the case.
> >
> > Yes, but I'm afraid one we start using it in other context we're reaching
> > the point where we are trying to deal with untrusted user space and the page
> > lock would already be a similar problem.
> >
> > Happy to be wrong on this one.
> >
> > Wait for other opinions first. Apart from that, no objection from my side.
>
> I am on-board with keeping it specific to reclaim deadlock avoidance and
> naming it such.

Sounds good, I will submit v8 with this renamed to
AS_WRITEBACK_MIGHT_DEADLOCK_ON_RECLAIM.

Thanks,
Joanne


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v7 3/3] fuse: remove tmp folio for writebacks and internal rb tree
  2025-04-09 23:47     ` Joanne Koong
@ 2025-04-10  2:12       ` Jingbo Xu
  2025-04-10 15:07         ` Joanne Koong
  0 siblings, 1 reply; 23+ messages in thread
From: Jingbo Xu @ 2025-04-10  2:12 UTC (permalink / raw)
  To: Joanne Koong
  Cc: miklos, akpm, linux-fsdevel, linux-mm, shakeel.butt, david,
	bernd.schubert, ziy, jlayton, kernel-team, Miklos Szeredi



On 4/10/25 7:47 AM, Joanne Koong wrote:
>   On Tue, Apr 8, 2025 at 7:43 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>
>> Hi Joanne,
>>
>> On 4/5/25 2:14 AM, Joanne Koong wrote:
>>> In the current FUSE writeback design (see commit 3be5a52b30aa
>>> ("fuse: support writable mmap")), a temp page is allocated for every
>>> dirty page to be written back, the contents of the dirty page are copied over
>>> to the temp page, and the temp page gets handed to the server to write back.
>>>
>>> This is done so that writeback may be immediately cleared on the dirty page,
>>> and this in turn is done in order to mitigate the following deadlock scenario
>>> that may arise if reclaim waits on writeback on the dirty page to complete:
>>> * single-threaded FUSE server is in the middle of handling a request
>>>   that needs a memory allocation
>>> * memory allocation triggers direct reclaim
>>> * direct reclaim waits on a folio under writeback
>>> * the FUSE server can't write back the folio since it's stuck in
>>>   direct reclaim
>>>
>>> With a recent change that added AS_WRITEBACK_INDETERMINATE and mitigates
>>> the situations described above, FUSE writeback does not need to use
>>> temp pages if it sets AS_WRITEBACK_INDETERMINATE on its inode mappings.
>>>
>>> This commit sets AS_WRITEBACK_INDETERMINATE on the inode mappings
>>> and removes the temporary pages + extra copying and the internal rb
>>> tree.
>>>
>>> fio benchmarks --
>>> (using averages observed from 10 runs, throwing away outliers)
>>>
>>> Setup:
>>> sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount
>>>  ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount
>>>
>>> fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G
>>> --numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount
>>>
>>>         bs =  1k          4k            1M
>>> Before  351 MiB/s     1818 MiB/s     1851 MiB/s
>>> After   341 MiB/s     2246 MiB/s     2685 MiB/s
>>> % diff        -3%          23%         45%
>>>
>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
>>> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
>>> Acked-by: Miklos Szeredi <mszeredi@redhat.com>
>>
> 
> Hi Jingbo,
> 
> Thanks for sharing your analysis for this.
> 
>> Overall this patch LGTM.
>>
>> Apart from that, IMO the fi->writectr and fi->queued_writes mechanism is
>> also unneeded then, at least the DIRECT IO routine (i.e.
> 
> I took a look at fi->writectr and fi->queued_writes and my
> understanding is that we do still need this. For example, for
> truncates (I'm looking at fuse_do_setattr()), I think we still need to
> prevent concurrent writeback or else the setattr request and the
> writeback request could race which would result in a mismatch between
> the file's reported size and the actual data written to disk.

I haven't looked into the truncate routine yet.  I will see it later.

> 
>> fuse_direct_io()) doesn't need fuse_sync_writes() anymore.  That is
>> because after removing the temp page, the DIRECT IO routine has already
>> been waiting for all inflight WRITE requests, see
>>
>> # DIRECT read
>> generic_file_read_iter
>>   kiocb_write_and_wait
>>     filemap_write_and_wait_range
> 
> Where do you see generic_file_read_iter() getting called for direct io reads?

# DIRECT read
fuse_file_read_iter
  fuse_cache_read_iter
    generic_file_read_iter
      kiocb_write_and_wait
       filemap_write_and_wait_range
      a_ops->direct_IO(),i.e. fuse_direct_IO()


> Similarly, where do you see generic_file_write_iter() getting called
> for direct io writes?

# DIRECT read
fuse_file_write_iter
  fuse_cache_write_iter
    generic_file_write_iter
      generic_file_direct_write
        kiocb_invalidate_pages
         filemap_invalidate_pages
           filemap_write_and_wait_range
      a_ops->direct_IO(),i.e. fuse_direct_IO()


> Where do you see fi->writectr / fi->queued-writes preventing this
> race?

IMO overall fi->writectr / fi->queued-writes are introduced to prevent
DIRECT IO and writeback from sending duplicate (inflight) WRITE requests
for the same page.

For the DIRECT write routine:

# non-FOPEN_DIRECT_IO DIRECT write
fuse_cache_write_iter
  fuse_direct_IO
    fuse_direct_io
      fuse_sync_writes


# FOPEN_DIRECT_IO DIRECT write
fuse_direct_write_iter
  fuse_direct_IO
    fuse_direct_io
      fuse_sync_writes


For the writeback routine:
fuse_writepages()
  fuse_writepages_fill
    fuse_writepages_send
      # buffer the WRITE request in queued_writes list
      fuse_flush_writepages
	# flush WRITE only when fi->writectr >= 0
	


> It looks to me like in the existing code, this race condition
> you described of direct write invalidating the page cache, then
> another buffer write reads the page cache and dirties it, then
> writeback is called on that, and the 2 write requests racing, could
> still happen?
> 
> 
>> However it seems that the writeback
>> won't wait for previous inflight DIRECT WRITE requests, so I'm not much
>> sure about that.  Maybe other folks could offer more insights...
> 
> My understanding is that these lines
> 
> if (!cuse && filemap_range_has_writeback(...)) {
>    ...
>    fuse_sync_writes(inode);
>    ...
> }
> 
> in fuse_direct_io() is what waits on previous inflight direct write
> requests to complete before the direct io happens.

Right.

> 
> 
>>
>> Also fuse_sync_writes() is not needed in fuse_flush() anymore, with
>> which I'm pretty sure.
> 
> Why don't we still need this for fuse_flush()?
> 
> If a caller calls close(), this will call
> 
> filp_close()
>   filp_flush()
>       filp->f_op->flush()
>           fuse_flush()
> 
> it seems like we should still be waiting for all writebacks to finish
> before sending the fuse server the fuse_flush request, no?
> 

filp_close()
   filp_flush()
       filp->f_op->flush()
           fuse_flush()
 	     write_inode_now
		writeback_single_inode(WB_SYNC_ALL)
		  do_writepages
		    # flush dirty page
		  filemap_fdatawait
		    # wait for WRITE completion

>>
>>> ---
>>>  fs/fuse/file.c   | 360 ++++-------------------------------------------
>>>  fs/fuse/fuse_i.h |   3 -
>>>  2 files changed, 28 insertions(+), 335 deletions(-)
>>>
>>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
>>> index 754378dd9f71..91ada0208863 100644
>>> --- a/fs/fuse/file.c
>>> +++ b/fs/fuse/file.c
>>> @@ -415,89 +415,11 @@ u64 fuse_lock_owner_id(struct fuse_conn *fc, fl_owner_t id)
>>>
>>>  struct fuse_writepage_args {
>>>       struct fuse_io_args ia;
>>> -     struct rb_node writepages_entry;
>>>       struct list_head queue_entry;
>>> -     struct fuse_writepage_args *next;
>>>       struct inode *inode;
>>>       struct fuse_sync_bucket *bucket;
>>>  };
>>>
>>> -static struct fuse_writepage_args *fuse_find_writeback(struct fuse_inode *fi,
>>> -                                         pgoff_t idx_from, pgoff_t idx_to)
>>> -{
>>> -     struct rb_node *n;
>>> -
>>> -     n = fi->writepages.rb_node;
>>> -
>>> -     while (n) {
>>> -             struct fuse_writepage_args *wpa;
>>> -             pgoff_t curr_index;
>>> -
>>
>> --
>> Thanks,
>> Jingbo

-- 
Thanks,
Jingbo


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v7 3/3] fuse: remove tmp folio for writebacks and internal rb tree
  2025-04-10  2:12       ` Jingbo Xu
@ 2025-04-10 15:07         ` Joanne Koong
  2025-04-10 15:11           ` Jingbo Xu
  0 siblings, 1 reply; 23+ messages in thread
From: Joanne Koong @ 2025-04-10 15:07 UTC (permalink / raw)
  To: Jingbo Xu
  Cc: miklos, akpm, linux-fsdevel, linux-mm, shakeel.butt, david,
	bernd.schubert, ziy, jlayton, kernel-team, Miklos Szeredi

On Wed, Apr 9, 2025 at 7:12 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>
>
>
> On 4/10/25 7:47 AM, Joanne Koong wrote:
> >   On Tue, Apr 8, 2025 at 7:43 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
> >>
> >> Hi Joanne,
> >>
> >> On 4/5/25 2:14 AM, Joanne Koong wrote:
> >>> In the current FUSE writeback design (see commit 3be5a52b30aa
> >>> ("fuse: support writable mmap")), a temp page is allocated for every
> >>> dirty page to be written back, the contents of the dirty page are copied over
> >>> to the temp page, and the temp page gets handed to the server to write back.
> >>>
> >>> This is done so that writeback may be immediately cleared on the dirty page,
> >>> and this in turn is done in order to mitigate the following deadlock scenario
> >>> that may arise if reclaim waits on writeback on the dirty page to complete:
> >>> * single-threaded FUSE server is in the middle of handling a request
> >>>   that needs a memory allocation
> >>> * memory allocation triggers direct reclaim
> >>> * direct reclaim waits on a folio under writeback
> >>> * the FUSE server can't write back the folio since it's stuck in
> >>>   direct reclaim
> >>>
> >>> With a recent change that added AS_WRITEBACK_INDETERMINATE and mitigates
> >>> the situations described above, FUSE writeback does not need to use
> >>> temp pages if it sets AS_WRITEBACK_INDETERMINATE on its inode mappings.
> >>>
> >>> This commit sets AS_WRITEBACK_INDETERMINATE on the inode mappings
> >>> and removes the temporary pages + extra copying and the internal rb
> >>> tree.
> >>>
> >>> fio benchmarks --
> >>> (using averages observed from 10 runs, throwing away outliers)
> >>>
> >>> Setup:
> >>> sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount
> >>>  ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount
> >>>
> >>> fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G
> >>> --numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount
> >>>
> >>>         bs =  1k          4k            1M
> >>> Before  351 MiB/s     1818 MiB/s     1851 MiB/s
> >>> After   341 MiB/s     2246 MiB/s     2685 MiB/s
> >>> % diff        -3%          23%         45%
> >>>
> >>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> >>> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
> >>> Acked-by: Miklos Szeredi <mszeredi@redhat.com>
> >>
> >
> > Hi Jingbo,
> >
> > Thanks for sharing your analysis for this.
> >
> >> Overall this patch LGTM.
> >>
> >> Apart from that, IMO the fi->writectr and fi->queued_writes mechanism is
> >> also unneeded then, at least the DIRECT IO routine (i.e.
> >
> > I took a look at fi->writectr and fi->queued_writes and my
> > understanding is that we do still need this. For example, for
> > truncates (I'm looking at fuse_do_setattr()), I think we still need to
> > prevent concurrent writeback or else the setattr request and the
> > writeback request could race which would result in a mismatch between
> > the file's reported size and the actual data written to disk.
>
> I haven't looked into the truncate routine yet.  I will see it later.
>
> >
> >> fuse_direct_io()) doesn't need fuse_sync_writes() anymore.  That is
> >> because after removing the temp page, the DIRECT IO routine has already
> >> been waiting for all inflight WRITE requests, see
> >>
> >> # DIRECT read
> >> generic_file_read_iter
> >>   kiocb_write_and_wait
> >>     filemap_write_and_wait_range
> >
> > Where do you see generic_file_read_iter() getting called for direct io reads?
>
> # DIRECT read
> fuse_file_read_iter
>   fuse_cache_read_iter
>     generic_file_read_iter
>       kiocb_write_and_wait
>        filemap_write_and_wait_range
>       a_ops->direct_IO(),i.e. fuse_direct_IO()
>

Oh I see, I thought files opened with O_DIRECT automatically call the
.direct_IO handler for reads/writes but you're right, it first goes
through .read_iter / .write_iter handlers, and the .direct_IO handler
only gets invoked through generic_file_read_iter() /
generic_file_direct_write() in mm/filemap.c

There's two paths for direct io in FUSE:
a) fuse server sets fi->direct_io = true when a file is opened, which
will set the FOPEN_DIRECT_IO bit in ff->open_flags on the kernel side
b) fuse server doesn't set fi->direct_io = true, but the client opens
the file with O_DIRECT

We only go through the stack trace you listed above for the b) case.
For the a) case, we'll hit

        if (ff->open_flags & FOPEN_DIRECT_IO)
                return fuse_direct_read_iter(iocb, to);

and

        if (ff->open_flags & FOPEN_DIRECT_IO)
                return fuse_direct_write_iter(iocb, from);

which will invoke fuse_direct_IO() / fuse_direct_io() without going
through the kiocb_write_and_wait() -> filemap_write_and_wait_range() /
kiocb_invalidate_pages() -> filemap_write_and_wait_range() you listed
above.

So for the a) case I think we'd still need the fuse_sync_writes() in
case there's still pending writeback.

Do you agree with this analysis or am I missing something here?

>
> > Similarly, where do you see generic_file_write_iter() getting called
> > for direct io writes?
>
> # DIRECT read
> fuse_file_write_iter
>   fuse_cache_write_iter
>     generic_file_write_iter
>       generic_file_direct_write
>         kiocb_invalidate_pages
>          filemap_invalidate_pages
>            filemap_write_and_wait_range
>       a_ops->direct_IO(),i.e. fuse_direct_IO()
>
>
> > Where do you see fi->writectr / fi->queued-writes preventing this
> > race?
>
> IMO overall fi->writectr / fi->queued-writes are introduced to prevent
> DIRECT IO and writeback from sending duplicate (inflight) WRITE requests
> for the same page.
>
> For the DIRECT write routine:
>
> # non-FOPEN_DIRECT_IO DIRECT write
> fuse_cache_write_iter
>   fuse_direct_IO
>     fuse_direct_io
>       fuse_sync_writes
>
>
> # FOPEN_DIRECT_IO DIRECT write
> fuse_direct_write_iter
>   fuse_direct_IO
>     fuse_direct_io
>       fuse_sync_writes
>
>
> For the writeback routine:
> fuse_writepages()
>   fuse_writepages_fill
>     fuse_writepages_send
>       # buffer the WRITE request in queued_writes list
>       fuse_flush_writepages
>         # flush WRITE only when fi->writectr >= 0
>
>
>
> > It looks to me like in the existing code, this race condition
> > you described of direct write invalidating the page cache, then
> > another buffer write reads the page cache and dirties it, then
> > writeback is called on that, and the 2 write requests racing, could
> > still happen?
> >
> >
> >> However it seems that the writeback
> >> won't wait for previous inflight DIRECT WRITE requests, so I'm not much
> >> sure about that.  Maybe other folks could offer more insights...
> >
> > My understanding is that these lines
> >
> > if (!cuse && filemap_range_has_writeback(...)) {
> >    ...
> >    fuse_sync_writes(inode);
> >    ...
> > }
> >
> > in fuse_direct_io() is what waits on previous inflight direct write
> > requests to complete before the direct io happens.
>
> Right.
>
> >
> >
> >>
> >> Also fuse_sync_writes() is not needed in fuse_flush() anymore, with
> >> which I'm pretty sure.
> >
> > Why don't we still need this for fuse_flush()?
> >
> > If a caller calls close(), this will call
> >
> > filp_close()
> >   filp_flush()
> >       filp->f_op->flush()
> >           fuse_flush()
> >
> > it seems like we should still be waiting for all writebacks to finish
> > before sending the fuse server the fuse_flush request, no?
> >
>
> filp_close()
>    filp_flush()
>        filp->f_op->flush()
>            fuse_flush()
>              write_inode_now
>                 writeback_single_inode(WB_SYNC_ALL)
>                   do_writepages
>                     # flush dirty page
>                   filemap_fdatawait
>                     # wait for WRITE completion

Nice. I missed that write_inode_now() will invoke filemap_fdatawait().
This seems pretty straightforward. I'll remove the fuse_sync_writes()
call in fuse_flush() when I send out v8.

The direct io one above is less straight-forward. I won't add that to
v8 but that can be done in a separate future patch when we figure that
out.

Thanks,
Joanne

>
> >>
> >>> ---
> >>>  fs/fuse/file.c   | 360 ++++-------------------------------------------
> >>>  fs/fuse/fuse_i.h |   3 -
> >>>  2 files changed, 28 insertions(+), 335 deletions(-)
> >>>
> >>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> >>> index 754378dd9f71..91ada0208863 100644
> >>> --- a/fs/fuse/file.c
> >>> +++ b/fs/fuse/file.c
> >>> @@ -415,89 +415,11 @@ u64 fuse_lock_owner_id(struct fuse_conn *fc, fl_owner_t id)
> >>>
> >>>  struct fuse_writepage_args {
> >>>       struct fuse_io_args ia;
> >>> -     struct rb_node writepages_entry;
> >>>       struct list_head queue_entry;
> >>> -     struct fuse_writepage_args *next;
> >>>       struct inode *inode;
> >>>       struct fuse_sync_bucket *bucket;
> >>>  };
> >>>
> >>> -static struct fuse_writepage_args *fuse_find_writeback(struct fuse_inode *fi,
> >>> -                                         pgoff_t idx_from, pgoff_t idx_to)
> >>> -{
> >>> -     struct rb_node *n;
> >>> -
> >>> -     n = fi->writepages.rb_node;
> >>> -
> >>> -     while (n) {
> >>> -             struct fuse_writepage_args *wpa;
> >>> -             pgoff_t curr_index;
> >>> -
> >>
> >> --
> >> Thanks,
> >> Jingbo
>
> --
> Thanks,
> Jingbo


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v7 3/3] fuse: remove tmp folio for writebacks and internal rb tree
  2025-04-10 15:07         ` Joanne Koong
@ 2025-04-10 15:11           ` Jingbo Xu
  2025-04-10 16:11             ` Joanne Koong
  0 siblings, 1 reply; 23+ messages in thread
From: Jingbo Xu @ 2025-04-10 15:11 UTC (permalink / raw)
  To: Joanne Koong
  Cc: miklos, akpm, linux-fsdevel, linux-mm, shakeel.butt, david,
	bernd.schubert, ziy, jlayton, kernel-team, Miklos Szeredi



On 4/10/25 11:07 PM, Joanne Koong wrote:
> On Wed, Apr 9, 2025 at 7:12 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>
>>
>>
>> On 4/10/25 7:47 AM, Joanne Koong wrote:
>>>   On Tue, Apr 8, 2025 at 7:43 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>>>
>>>> Hi Joanne,
>>>>
>>>> On 4/5/25 2:14 AM, Joanne Koong wrote:
>>>>> In the current FUSE writeback design (see commit 3be5a52b30aa
>>>>> ("fuse: support writable mmap")), a temp page is allocated for every
>>>>> dirty page to be written back, the contents of the dirty page are copied over
>>>>> to the temp page, and the temp page gets handed to the server to write back.
>>>>>
>>>>> This is done so that writeback may be immediately cleared on the dirty page,
>>>>> and this in turn is done in order to mitigate the following deadlock scenario
>>>>> that may arise if reclaim waits on writeback on the dirty page to complete:
>>>>> * single-threaded FUSE server is in the middle of handling a request
>>>>>   that needs a memory allocation
>>>>> * memory allocation triggers direct reclaim
>>>>> * direct reclaim waits on a folio under writeback
>>>>> * the FUSE server can't write back the folio since it's stuck in
>>>>>   direct reclaim
>>>>>
>>>>> With a recent change that added AS_WRITEBACK_INDETERMINATE and mitigates
>>>>> the situations described above, FUSE writeback does not need to use
>>>>> temp pages if it sets AS_WRITEBACK_INDETERMINATE on its inode mappings.
>>>>>
>>>>> This commit sets AS_WRITEBACK_INDETERMINATE on the inode mappings
>>>>> and removes the temporary pages + extra copying and the internal rb
>>>>> tree.
>>>>>
>>>>> fio benchmarks --
>>>>> (using averages observed from 10 runs, throwing away outliers)
>>>>>
>>>>> Setup:
>>>>> sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount
>>>>>  ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount
>>>>>
>>>>> fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G
>>>>> --numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount
>>>>>
>>>>>         bs =  1k          4k            1M
>>>>> Before  351 MiB/s     1818 MiB/s     1851 MiB/s
>>>>> After   341 MiB/s     2246 MiB/s     2685 MiB/s
>>>>> % diff        -3%          23%         45%
>>>>>
>>>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
>>>>> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
>>>>> Acked-by: Miklos Szeredi <mszeredi@redhat.com>
>>>>
>>>
>>> Hi Jingbo,
>>>
>>> Thanks for sharing your analysis for this.
>>>
>>>> Overall this patch LGTM.
>>>>
>>>> Apart from that, IMO the fi->writectr and fi->queued_writes mechanism is
>>>> also unneeded then, at least the DIRECT IO routine (i.e.
>>>
>>> I took a look at fi->writectr and fi->queued_writes and my
>>> understanding is that we do still need this. For example, for
>>> truncates (I'm looking at fuse_do_setattr()), I think we still need to
>>> prevent concurrent writeback or else the setattr request and the
>>> writeback request could race which would result in a mismatch between
>>> the file's reported size and the actual data written to disk.
>>
>> I haven't looked into the truncate routine yet.  I will see it later.
>>
>>>
>>>> fuse_direct_io()) doesn't need fuse_sync_writes() anymore.  That is
>>>> because after removing the temp page, the DIRECT IO routine has already
>>>> been waiting for all inflight WRITE requests, see
>>>>
>>>> # DIRECT read
>>>> generic_file_read_iter
>>>>   kiocb_write_and_wait
>>>>     filemap_write_and_wait_range
>>>
>>> Where do you see generic_file_read_iter() getting called for direct io reads?
>>
>> # DIRECT read
>> fuse_file_read_iter
>>   fuse_cache_read_iter
>>     generic_file_read_iter
>>       kiocb_write_and_wait
>>        filemap_write_and_wait_range
>>       a_ops->direct_IO(),i.e. fuse_direct_IO()
>>
> 
> Oh I see, I thought files opened with O_DIRECT automatically call the
> .direct_IO handler for reads/writes but you're right, it first goes
> through .read_iter / .write_iter handlers, and the .direct_IO handler
> only gets invoked through generic_file_read_iter() /
> generic_file_direct_write() in mm/filemap.c
> 
> There's two paths for direct io in FUSE:
> a) fuse server sets fi->direct_io = true when a file is opened, which
> will set the FOPEN_DIRECT_IO bit in ff->open_flags on the kernel side
> b) fuse server doesn't set fi->direct_io = true, but the client opens
> the file with O_DIRECT
> 
> We only go through the stack trace you listed above for the b) case.
> For the a) case, we'll hit
> 
>         if (ff->open_flags & FOPEN_DIRECT_IO)
>                 return fuse_direct_read_iter(iocb, to);
> 
> and
> 
>         if (ff->open_flags & FOPEN_DIRECT_IO)
>                 return fuse_direct_write_iter(iocb, from);
> 
> which will invoke fuse_direct_IO() / fuse_direct_io() without going
> through the kiocb_write_and_wait() -> filemap_write_and_wait_range() /
> kiocb_invalidate_pages() -> filemap_write_and_wait_range() you listed
> above.
> 
> So for the a) case I think we'd still need the fuse_sync_writes() in
> case there's still pending writeback.
> 
> Do you agree with this analysis or am I missing something here?

Yeah, that's true.  But instead of calling fuse_sync_writes(), we can
call filemap_wait_range() or something similar here.



>> filp_close()
>>    filp_flush()
>>        filp->f_op->flush()
>>            fuse_flush()
>>              write_inode_now
>>                 writeback_single_inode(WB_SYNC_ALL)
>>                   do_writepages
>>                     # flush dirty page
>>                   filemap_fdatawait
>>                     # wait for WRITE completion
> 
> Nice. I missed that write_inode_now() will invoke filemap_fdatawait().
> This seems pretty straightforward. I'll remove the fuse_sync_writes()
> call in fuse_flush() when I send out v8.
> 
> The direct io one above is less straight-forward. I won't add that to
> v8 but that can be done in a separate future patch when we figure that
> out.

Thanks for keep working on this. Appreciated.

-- 
Thanks,
Jingbo


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v7 3/3] fuse: remove tmp folio for writebacks and internal rb tree
  2025-04-10 15:11           ` Jingbo Xu
@ 2025-04-10 16:11             ` Joanne Koong
  2025-04-14 20:24               ` Joanne Koong
  2025-04-15  7:49               ` Jingbo Xu
  0 siblings, 2 replies; 23+ messages in thread
From: Joanne Koong @ 2025-04-10 16:11 UTC (permalink / raw)
  To: Jingbo Xu
  Cc: miklos, akpm, linux-fsdevel, linux-mm, shakeel.butt, david,
	bernd.schubert, ziy, jlayton, kernel-team, Miklos Szeredi

On Thu, Apr 10, 2025 at 8:11 AM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>
> On 4/10/25 11:07 PM, Joanne Koong wrote:
> > On Wed, Apr 9, 2025 at 7:12 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
> >>
> >>
> >>
> >> On 4/10/25 7:47 AM, Joanne Koong wrote:
> >>>   On Tue, Apr 8, 2025 at 7:43 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
> >>>>
> >>>> Hi Joanne,
> >>>>
> >>>> On 4/5/25 2:14 AM, Joanne Koong wrote:
> >>>>> In the current FUSE writeback design (see commit 3be5a52b30aa
> >>>>> ("fuse: support writable mmap")), a temp page is allocated for every
> >>>>> dirty page to be written back, the contents of the dirty page are copied over
> >>>>> to the temp page, and the temp page gets handed to the server to write back.
> >>>>>
> >>>>> This is done so that writeback may be immediately cleared on the dirty page,
> >>>>> and this in turn is done in order to mitigate the following deadlock scenario
> >>>>> that may arise if reclaim waits on writeback on the dirty page to complete:
> >>>>> * single-threaded FUSE server is in the middle of handling a request
> >>>>>   that needs a memory allocation
> >>>>> * memory allocation triggers direct reclaim
> >>>>> * direct reclaim waits on a folio under writeback
> >>>>> * the FUSE server can't write back the folio since it's stuck in
> >>>>>   direct reclaim
> >>>>>
> >>>>> With a recent change that added AS_WRITEBACK_INDETERMINATE and mitigates
> >>>>> the situations described above, FUSE writeback does not need to use
> >>>>> temp pages if it sets AS_WRITEBACK_INDETERMINATE on its inode mappings.
> >>>>>
> >>>>> This commit sets AS_WRITEBACK_INDETERMINATE on the inode mappings
> >>>>> and removes the temporary pages + extra copying and the internal rb
> >>>>> tree.
> >>>>>
> >>>>> fio benchmarks --
> >>>>> (using averages observed from 10 runs, throwing away outliers)
> >>>>>
> >>>>> Setup:
> >>>>> sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount
> >>>>>  ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount
> >>>>>
> >>>>> fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G
> >>>>> --numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount
> >>>>>
> >>>>>         bs =  1k          4k            1M
> >>>>> Before  351 MiB/s     1818 MiB/s     1851 MiB/s
> >>>>> After   341 MiB/s     2246 MiB/s     2685 MiB/s
> >>>>> % diff        -3%          23%         45%
> >>>>>
> >>>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> >>>>> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
> >>>>> Acked-by: Miklos Szeredi <mszeredi@redhat.com>
> >>>>
> >>>
> >>> Hi Jingbo,
> >>>
> >>> Thanks for sharing your analysis for this.
> >>>
> >>>> Overall this patch LGTM.
> >>>>
> >>>> Apart from that, IMO the fi->writectr and fi->queued_writes mechanism is
> >>>> also unneeded then, at least the DIRECT IO routine (i.e.
> >>>
> >>> I took a look at fi->writectr and fi->queued_writes and my
> >>> understanding is that we do still need this. For example, for
> >>> truncates (I'm looking at fuse_do_setattr()), I think we still need to
> >>> prevent concurrent writeback or else the setattr request and the
> >>> writeback request could race which would result in a mismatch between
> >>> the file's reported size and the actual data written to disk.
> >>
> >> I haven't looked into the truncate routine yet.  I will see it later.
> >>
> >>>
> >>>> fuse_direct_io()) doesn't need fuse_sync_writes() anymore.  That is
> >>>> because after removing the temp page, the DIRECT IO routine has already
> >>>> been waiting for all inflight WRITE requests, see
> >>>>
> >>>> # DIRECT read
> >>>> generic_file_read_iter
> >>>>   kiocb_write_and_wait
> >>>>     filemap_write_and_wait_range
> >>>
> >>> Where do you see generic_file_read_iter() getting called for direct io reads?
> >>
> >> # DIRECT read
> >> fuse_file_read_iter
> >>   fuse_cache_read_iter
> >>     generic_file_read_iter
> >>       kiocb_write_and_wait
> >>        filemap_write_and_wait_range
> >>       a_ops->direct_IO(),i.e. fuse_direct_IO()
> >>
> >
> > Oh I see, I thought files opened with O_DIRECT automatically call the
> > .direct_IO handler for reads/writes but you're right, it first goes
> > through .read_iter / .write_iter handlers, and the .direct_IO handler
> > only gets invoked through generic_file_read_iter() /
> > generic_file_direct_write() in mm/filemap.c
> >
> > There's two paths for direct io in FUSE:
> > a) fuse server sets fi->direct_io = true when a file is opened, which
> > will set the FOPEN_DIRECT_IO bit in ff->open_flags on the kernel side
> > b) fuse server doesn't set fi->direct_io = true, but the client opens
> > the file with O_DIRECT
> >
> > We only go through the stack trace you listed above for the b) case.
> > For the a) case, we'll hit
> >
> >         if (ff->open_flags & FOPEN_DIRECT_IO)
> >                 return fuse_direct_read_iter(iocb, to);
> >
> > and
> >
> >         if (ff->open_flags & FOPEN_DIRECT_IO)
> >                 return fuse_direct_write_iter(iocb, from);
> >
> > which will invoke fuse_direct_IO() / fuse_direct_io() without going
> > through the kiocb_write_and_wait() -> filemap_write_and_wait_range() /
> > kiocb_invalidate_pages() -> filemap_write_and_wait_range() you listed
> > above.
> >
> > So for the a) case I think we'd still need the fuse_sync_writes() in
> > case there's still pending writeback.
> >
> > Do you agree with this analysis or am I missing something here?
>
> Yeah, that's true.  But instead of calling fuse_sync_writes(), we can
> call filemap_wait_range() or something similar here.
>

Agreed. Actually, the more I look at this, the more I think we can
replace all fuse_sync_writes() and get rid of it entirely. Right now,
fuse_sync_writes() is called in:

fuse_fsync():
        /*
         * Start writeback against all dirty pages of the inode, then
         * wait for all outstanding writes, before sending the FSYNC
         * request.
         */
        err = file_write_and_wait_range(file, start, end);
        if (err)
                goto out;

        fuse_sync_writes(inode);

        /*
         * Due to implementation of fuse writeback
         * file_write_and_wait_range() does not catch errors.
         * We have to do this directly after fuse_sync_writes()
         */
        err = file_check_and_advance_wb_err(file);
        if (err)
                goto out;


      We can get rid of the fuse_sync_writes() and
file_check_and_advance_wb_err() entirely since now without temp pages,
the file_write_and_wait_range() call actually ensures that writeback
is completed



fuse_writeback_range():
        static int fuse_writeback_range(struct inode *inode, loff_t
start, loff_t end)
        {
                int err =
filemap_write_and_wait_range(inode->i_mapping, start, LLONG_MAX);

                if (!err)
                        fuse_sync_writes(inode);

                return err;
        }


      We can replace fuse_writeback_range() entirely with
filemap_write_and_wait_range().



fuse_direct_io():
        if (fopen_direct_io && fc->direct_io_allow_mmap) {
                res = filemap_write_and_wait_range(mapping, pos, pos +
count - 1);
                if (res) {
                        fuse_io_free(ia);
                        return res;
                }
        }
        if (!cuse && filemap_range_has_writeback(mapping, pos, (pos +
count - 1))) {
                if (!write)
                        inode_lock(inode);
                fuse_sync_writes(inode);
                if (!write)
                        inode_unlock(inode);
        }


       I think this can just replaced with
                if (fopen_direct_io && (fc->direct_io_allow_mmap || !cuse)) {
                        res = filemap_write_and_wait_range(mapping,
pos, pos + count - 1);
                        if (res) {
                                fuse_io_free(ia);
                                return res;
                        }
                }
       since for the !fopen_direct_io case, it will already go through
filemap_write_and_wait_range(), as you mentioned in your previous
message. I think this also fixes a bug (?) in the original code - in
the fopen_direct_io && !fc->direct_io_allow_mmap case, I think we
still need to write out dirty pages first, which we don't currently
do.


What do you think?

Thanks for all your careful review on this patchset throughout all of
its iterations.

>
>
> >> filp_close()
> >>    filp_flush()
> >>        filp->f_op->flush()
> >>            fuse_flush()
> >>              write_inode_now
> >>                 writeback_single_inode(WB_SYNC_ALL)
> >>                   do_writepages
> >>                     # flush dirty page
> >>                   filemap_fdatawait
> >>                     # wait for WRITE completion
> >
> > Nice. I missed that write_inode_now() will invoke filemap_fdatawait().
> > This seems pretty straightforward. I'll remove the fuse_sync_writes()
> > call in fuse_flush() when I send out v8.
> >
> > The direct io one above is less straight-forward. I won't add that to
> > v8 but that can be done in a separate future patch when we figure that
> > out.
>
> Thanks for keep working on this. Appreciated.
>
> --
> Thanks,
> Jingbo


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v7 0/3] fuse: remove temp page copies in writeback
  2025-04-04 18:14 [PATCH v7 0/3] fuse: remove temp page copies in writeback Joanne Koong
                   ` (2 preceding siblings ...)
  2025-04-04 18:14 ` [PATCH v7 3/3] fuse: remove tmp folio for writebacks and internal rb tree Joanne Koong
@ 2025-04-14 16:21 ` Jeff Layton
  2025-04-14 20:28   ` Joanne Koong
  3 siblings, 1 reply; 23+ messages in thread
From: Jeff Layton @ 2025-04-14 16:21 UTC (permalink / raw)
  To: Joanne Koong, miklos, akpm, linux-fsdevel, linux-mm
  Cc: jefflexu, shakeel.butt, david, bernd.schubert, ziy, kernel-team

On Fri, 2025-04-04 at 11:14 -0700, Joanne Koong wrote:
> The purpose of this patchset is to help make writeback in FUSE filesystems as
> fast as possible.
> 
> In the current FUSE writeback design (see commit 3be5a52b30aa
> ("fuse: support writable mmap"))), a temp page is allocated for every dirty
> page to be written back, the contents of the dirty page are copied over to the
> temp page, and the temp page gets handed to the server to write back. This is
> done so that writeback may be immediately cleared on the dirty page, and this 
> in turn is done in order to mitigate the following deadlock scenario that may
> arise if reclaim waits on writeback on the dirty page to complete (more details
> can be found in this thread [1]):
> * single-threaded FUSE server is in the middle of handling a request
>   that needs a memory allocation
> * memory allocation triggers direct reclaim
> * direct reclaim waits on a folio under writeback
> * the FUSE server can't write back the folio since it's stuck in
>   direct reclaim
> 
> Allocating and copying dirty pages to temp pages is the biggest performance
> bottleneck for FUSE writeback. This patchset aims to get rid of the temp page
> altogether (which will also allow us to get rid of the internal FUSE rb tree
> that is needed to keep track of writeback status on the temp pages).
> Benchmarks show approximately a 20% improvement in throughput for 4k
> block-size writes and a 45% improvement for 1M block-size writes.
> 
> In the current reclaim code, there is one scenario where writeback is waited
> on, which is the case where the system is running legacy cgroupv1 and reclaim
> encounters a folio that already has the reclaim flag set and the caller did
> not have __GFP_FS (or __GFP_IO if swap) set.
> 
> This patchset adds a new mapping flag, AS_WRITEBACK_INDETERMINATE, which
> filesystems may set on its inode mappings to indicate that writeback
> operations may take an indeterminate amount of time to complete. FUSE will set
> this flag on its mappings. Reclaim for the legacy cgroup v1 case described
> above will skip reclaim of folios with that flag set.
> 
> With this change, writeback state is now only cleared on the dirty page after
> the server has written it back to disk. If the server is deliberately
> malicious or well-intentioned but buggy, this may stall sync(2) and page
> migration, but for sync(2), a malicious server may already stall this by not
> replying to the FUSE_SYNCFS request and for page migration, there are already
> many easier ways to stall this by having FUSE permanently hold the folio lock.
> A fuller discussion on this can be found in [2]. Long-term, there needs to be
> a more comprehensive solution for addressing migration of FUSE pages that
> handles all scenarios where FUSE may permanently hold the lock, but that is
> outside the scope of this patchset and will be done as future work. Please
> also note that this change also now ensures that when sync(2) returns, FUSE
> filesystems will have persisted writeback changes.
> 
> [1] https://lore.kernel.org/linux-kernel/495d2400-1d96-4924-99d3-8b2952e05fc3@linux.alibaba.com/
> [2] https://lore.kernel.org/linux-fsdevel/20241122232359.429647-1-joannelkoong@gmail.com/
> 
> Changelog
> ---------
> v6:
> https://lore.kernel.org/linux-fsdevel/20241122232359.429647-1-joannelkoong@gmail.com/
> Changes from v6 -> v7:
> * Drop migration and sync patches, as they are useless if a server is
>   determined to be malicious
> 
> v5:
> https://lore.kernel.org/linux-fsdevel/20241115224459.427610-1-joannelkoong@gmail.com/
> Changes from v5 -> v6:
> * Add Shakeel and Jingbo's reviewed-bys 
> * Move folio_end_writeback() to fuse_writepage_finish() (Jingbo)
> * Embed fuse_writepage_finish_stat() logic inline (Jingbo)
> * Remove node_stat NR_WRITEBACK inc/sub (Jingbo)
> 
> v4:
> https://lore.kernel.org/linux-fsdevel/20241107235614.3637221-1-joannelkoong@gmail.com/
> Changes from v4 -> v5:
> * AS_WRITEBACK_MAY_BLOCK -> AS_WRITEBACK_INDETERMINATE (Shakeel)
> * Drop memory hotplug patch (David and Shakeel)
> * Remove some more kunnecessary writeback waits in fuse code (Jingbo)
> * Make commit message for reclaim patch more concise - drop part about
>   deadlock and just focus on how it may stall waits
> 
> v3:
> https://lore.kernel.org/linux-fsdevel/20241107191618.2011146-1-joannelkoong@gmail.com/
> Changes from v3 -> v4:
> * Use filemap_fdatawait_range() instead of filemap_range_has_writeback() in
>   readahead
> 
> v2:
> https://lore.kernel.org/linux-fsdevel/20241014182228.1941246-1-joannelkoong@gmail.com/
> Changes from v2 -> v3:
> * Account for sync and page migration cases as well (Miklos)
> * Change AS_NO_WRITEBACK_RECLAIM to the more generic AS_WRITEBACK_MAY_BLOCK
> * For fuse inodes, set mapping_writeback_may_block only if fc->writeback_cache
>   is enabled
> 
> v1:
> https://lore.kernel.org/linux-fsdevel/20241011223434.1307300-1-joannelkoong@gmail.com/T/#t
> Changes from v1 -> v2:
> * Have flag in "enum mapping_flags" instead of creating asop_flags (Shakeel)
> * Set fuse inodes to use AS_NO_WRITEBACK_RECLAIM (Shakeel)
> 
> Joanne Koong (3):
>   mm: add AS_WRITEBACK_INDETERMINATE mapping flag
>   mm: skip reclaiming folios in legacy memcg writeback indeterminate
>     contexts
>   fuse: remove tmp folio for writebacks and internal rb tree
> 
>  fs/fuse/file.c          | 360 ++++------------------------------------
>  fs/fuse/fuse_i.h        |   3 -
>  include/linux/pagemap.h |  11 ++
>  mm/vmscan.c             |  10 +-
>  4 files changed, 46 insertions(+), 338 deletions(-)
> 

This looks sane, and I love that diffstat.

I also agree with David about changing the flag name to something more
specific. As a kernel engineer, anything with "INDETERMINATE" in the
name gives me the ick.

Assuming that the only real change in v8 will be the flag name change,
you can add:

Reviewed-by: Jeff Layton <jlayton@kernel.org>

Assuming others are ok with this, how do you see this going in? Maybe
Andrew could pick up the mm bits and Miklos could take the FUSE patch?


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v7 3/3] fuse: remove tmp folio for writebacks and internal rb tree
  2025-04-10 16:11             ` Joanne Koong
@ 2025-04-14 20:24               ` Joanne Koong
  2025-04-15  7:49               ` Jingbo Xu
  1 sibling, 0 replies; 23+ messages in thread
From: Joanne Koong @ 2025-04-14 20:24 UTC (permalink / raw)
  To: Jingbo Xu
  Cc: miklos, akpm, linux-fsdevel, linux-mm, shakeel.butt, david,
	bernd.schubert, ziy, jlayton, kernel-team, Miklos Szeredi

On Thu, Apr 10, 2025 at 9:11 AM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Thu, Apr 10, 2025 at 8:11 AM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
> >
> > On 4/10/25 11:07 PM, Joanne Koong wrote:
> > > On Wed, Apr 9, 2025 at 7:12 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
> > >>
> > >>
> > >>
> > >> On 4/10/25 7:47 AM, Joanne Koong wrote:
> > >>>   On Tue, Apr 8, 2025 at 7:43 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
> > >>>>
> > >>>> Hi Joanne,
> > >>>>
> > >>>> On 4/5/25 2:14 AM, Joanne Koong wrote:
> > >>>>> In the current FUSE writeback design (see commit 3be5a52b30aa
> > >>>>> ("fuse: support writable mmap")), a temp page is allocated for every
> > >>>>> dirty page to be written back, the contents of the dirty page are copied over
> > >>>>> to the temp page, and the temp page gets handed to the server to write back.
> > >>>>>
> > >>>>> This is done so that writeback may be immediately cleared on the dirty page,
> > >>>>> and this in turn is done in order to mitigate the following deadlock scenario
> > >>>>> that may arise if reclaim waits on writeback on the dirty page to complete:
> > >>>>> * single-threaded FUSE server is in the middle of handling a request
> > >>>>>   that needs a memory allocation
> > >>>>> * memory allocation triggers direct reclaim
> > >>>>> * direct reclaim waits on a folio under writeback
> > >>>>> * the FUSE server can't write back the folio since it's stuck in
> > >>>>>   direct reclaim
> > >>>>>
> > >>>>> With a recent change that added AS_WRITEBACK_INDETERMINATE and mitigates
> > >>>>> the situations described above, FUSE writeback does not need to use
> > >>>>> temp pages if it sets AS_WRITEBACK_INDETERMINATE on its inode mappings.
> > >>>>>
> > >>>>> This commit sets AS_WRITEBACK_INDETERMINATE on the inode mappings
> > >>>>> and removes the temporary pages + extra copying and the internal rb
> > >>>>> tree.
> > >>>>>
> > >>>>> fio benchmarks --
> > >>>>> (using averages observed from 10 runs, throwing away outliers)
> > >>>>>
> > >>>>> Setup:
> > >>>>> sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount
> > >>>>>  ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount
> > >>>>>
> > >>>>> fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G
> > >>>>> --numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount
> > >>>>>
> > >>>>>         bs =  1k          4k            1M
> > >>>>> Before  351 MiB/s     1818 MiB/s     1851 MiB/s
> > >>>>> After   341 MiB/s     2246 MiB/s     2685 MiB/s
> > >>>>> % diff        -3%          23%         45%
> > >>>>>
> > >>>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > >>>>> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
> > >>>>> Acked-by: Miklos Szeredi <mszeredi@redhat.com>
> > >>>>
> > >>>
> > >>> Hi Jingbo,
> > >>>
> > >>> Thanks for sharing your analysis for this.
> > >>>
> > >>>> Overall this patch LGTM.
> > >>>>
> > >>>> Apart from that, IMO the fi->writectr and fi->queued_writes mechanism is
> > >>>> also unneeded then, at least the DIRECT IO routine (i.e.
> > >>>
> > >>> I took a look at fi->writectr and fi->queued_writes and my
> > >>> understanding is that we do still need this. For example, for
> > >>> truncates (I'm looking at fuse_do_setattr()), I think we still need to
> > >>> prevent concurrent writeback or else the setattr request and the
> > >>> writeback request could race which would result in a mismatch between
> > >>> the file's reported size and the actual data written to disk.
> > >>
> > >> I haven't looked into the truncate routine yet.  I will see it later.
> > >>
> > >>>
> > >>>> fuse_direct_io()) doesn't need fuse_sync_writes() anymore.  That is
> > >>>> because after removing the temp page, the DIRECT IO routine has already
> > >>>> been waiting for all inflight WRITE requests, see
> > >>>>
> > >>>> # DIRECT read
> > >>>> generic_file_read_iter
> > >>>>   kiocb_write_and_wait
> > >>>>     filemap_write_and_wait_range
> > >>>
> > >>> Where do you see generic_file_read_iter() getting called for direct io reads?
> > >>
> > >> # DIRECT read
> > >> fuse_file_read_iter
> > >>   fuse_cache_read_iter
> > >>     generic_file_read_iter
> > >>       kiocb_write_and_wait
> > >>        filemap_write_and_wait_range
> > >>       a_ops->direct_IO(),i.e. fuse_direct_IO()
> > >>
> > >
> > > Oh I see, I thought files opened with O_DIRECT automatically call the
> > > .direct_IO handler for reads/writes but you're right, it first goes
> > > through .read_iter / .write_iter handlers, and the .direct_IO handler
> > > only gets invoked through generic_file_read_iter() /
> > > generic_file_direct_write() in mm/filemap.c
> > >
> > > There's two paths for direct io in FUSE:
> > > a) fuse server sets fi->direct_io = true when a file is opened, which
> > > will set the FOPEN_DIRECT_IO bit in ff->open_flags on the kernel side
> > > b) fuse server doesn't set fi->direct_io = true, but the client opens
> > > the file with O_DIRECT
> > >
> > > We only go through the stack trace you listed above for the b) case.
> > > For the a) case, we'll hit
> > >
> > >         if (ff->open_flags & FOPEN_DIRECT_IO)
> > >                 return fuse_direct_read_iter(iocb, to);
> > >
> > > and
> > >
> > >         if (ff->open_flags & FOPEN_DIRECT_IO)
> > >                 return fuse_direct_write_iter(iocb, from);
> > >
> > > which will invoke fuse_direct_IO() / fuse_direct_io() without going
> > > through the kiocb_write_and_wait() -> filemap_write_and_wait_range() /
> > > kiocb_invalidate_pages() -> filemap_write_and_wait_range() you listed
> > > above.
> > >
> > > So for the a) case I think we'd still need the fuse_sync_writes() in
> > > case there's still pending writeback.
> > >
> > > Do you agree with this analysis or am I missing something here?
> >
> > Yeah, that's true.  But instead of calling fuse_sync_writes(), we can
> > call filemap_wait_range() or something similar here.
> >
>
> Agreed. Actually, the more I look at this, the more I think we can
> replace all fuse_sync_writes() and get rid of it entirely. Right now,
> fuse_sync_writes() is called in:

This is nontrivial so I'll leave this optimization to be done as a
separate future patchset so as to not slow this one down.

Thanks,
Joanne

>
> fuse_fsync():
>         /*
>          * Start writeback against all dirty pages of the inode, then
>          * wait for all outstanding writes, before sending the FSYNC
>          * request.
>          */
>         err = file_write_and_wait_range(file, start, end);
>         if (err)
>                 goto out;
>
>         fuse_sync_writes(inode);
>
>         /*
>          * Due to implementation of fuse writeback
>          * file_write_and_wait_range() does not catch errors.
>          * We have to do this directly after fuse_sync_writes()
>          */
>         err = file_check_and_advance_wb_err(file);
>         if (err)
>                 goto out;
>
>
>       We can get rid of the fuse_sync_writes() and
> file_check_and_advance_wb_err() entirely since now without temp pages,
> the file_write_and_wait_range() call actually ensures that writeback
> is completed
>
>
>
> fuse_writeback_range():
>         static int fuse_writeback_range(struct inode *inode, loff_t
> start, loff_t end)
>         {
>                 int err =
> filemap_write_and_wait_range(inode->i_mapping, start, LLONG_MAX);
>
>                 if (!err)
>                         fuse_sync_writes(inode);
>
>                 return err;
>         }
>
>
>       We can replace fuse_writeback_range() entirely with
> filemap_write_and_wait_range().
>
>
>
> fuse_direct_io():
>         if (fopen_direct_io && fc->direct_io_allow_mmap) {
>                 res = filemap_write_and_wait_range(mapping, pos, pos +
> count - 1);
>                 if (res) {
>                         fuse_io_free(ia);
>                         return res;
>                 }
>         }
>         if (!cuse && filemap_range_has_writeback(mapping, pos, (pos +
> count - 1))) {
>                 if (!write)
>                         inode_lock(inode);
>                 fuse_sync_writes(inode);
>                 if (!write)
>                         inode_unlock(inode);
>         }
>
>
>        I think this can just replaced with
>                 if (fopen_direct_io && (fc->direct_io_allow_mmap || !cuse)) {
>                         res = filemap_write_and_wait_range(mapping,
> pos, pos + count - 1);
>                         if (res) {
>                                 fuse_io_free(ia);
>                                 return res;
>                         }
>                 }
>        since for the !fopen_direct_io case, it will already go through
> filemap_write_and_wait_range(), as you mentioned in your previous
> message. I think this also fixes a bug (?) in the original code - in
> the fopen_direct_io && !fc->direct_io_allow_mmap case, I think we
> still need to write out dirty pages first, which we don't currently
> do.
>
>
> What do you think?
>
> Thanks for all your careful review on this patchset throughout all of
> its iterations.
>
> >
> >
> > >> filp_close()
> > >>    filp_flush()
> > >>        filp->f_op->flush()
> > >>            fuse_flush()
> > >>              write_inode_now
> > >>                 writeback_single_inode(WB_SYNC_ALL)
> > >>                   do_writepages
> > >>                     # flush dirty page
> > >>                   filemap_fdatawait
> > >>                     # wait for WRITE completion
> > >
> > > Nice. I missed that write_inode_now() will invoke filemap_fdatawait().
> > > This seems pretty straightforward. I'll remove the fuse_sync_writes()
> > > call in fuse_flush() when I send out v8.
> > >
> > > The direct io one above is less straight-forward. I won't add that to
> > > v8 but that can be done in a separate future patch when we figure that
> > > out.
> >
> > Thanks for keep working on this. Appreciated.
> >
> > --
> > Thanks,
> > Jingbo


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v7 0/3] fuse: remove temp page copies in writeback
  2025-04-14 16:21 ` [PATCH v7 0/3] fuse: remove temp page copies in writeback Jeff Layton
@ 2025-04-14 20:28   ` Joanne Koong
  0 siblings, 0 replies; 23+ messages in thread
From: Joanne Koong @ 2025-04-14 20:28 UTC (permalink / raw)
  To: Jeff Layton
  Cc: miklos, akpm, linux-fsdevel, linux-mm, jefflexu, shakeel.butt,
	david, bernd.schubert, ziy, kernel-team

On Mon, Apr 14, 2025 at 9:21 AM Jeff Layton <jlayton@kernel.org> wrote:
>
> On Fri, 2025-04-04 at 11:14 -0700, Joanne Koong wrote:
> > The purpose of this patchset is to help make writeback in FUSE filesystems as
> > fast as possible.
> >
> > In the current FUSE writeback design (see commit 3be5a52b30aa
> > ("fuse: support writable mmap"))), a temp page is allocated for every dirty
> > page to be written back, the contents of the dirty page are copied over to the
> > temp page, and the temp page gets handed to the server to write back. This is
> > done so that writeback may be immediately cleared on the dirty page, and this
> > in turn is done in order to mitigate the following deadlock scenario that may
> > arise if reclaim waits on writeback on the dirty page to complete (more details
> > can be found in this thread [1]):
> > * single-threaded FUSE server is in the middle of handling a request
> >   that needs a memory allocation
> > * memory allocation triggers direct reclaim
> > * direct reclaim waits on a folio under writeback
> > * the FUSE server can't write back the folio since it's stuck in
> >   direct reclaim
> >
> > Allocating and copying dirty pages to temp pages is the biggest performance
> > bottleneck for FUSE writeback. This patchset aims to get rid of the temp page
> > altogether (which will also allow us to get rid of the internal FUSE rb tree
> > that is needed to keep track of writeback status on the temp pages).
> > Benchmarks show approximately a 20% improvement in throughput for 4k
> > block-size writes and a 45% improvement for 1M block-size writes.
> >
> > In the current reclaim code, there is one scenario where writeback is waited
> > on, which is the case where the system is running legacy cgroupv1 and reclaim
> > encounters a folio that already has the reclaim flag set and the caller did
> > not have __GFP_FS (or __GFP_IO if swap) set.
> >
> > This patchset adds a new mapping flag, AS_WRITEBACK_INDETERMINATE, which
> > filesystems may set on its inode mappings to indicate that writeback
> > operations may take an indeterminate amount of time to complete. FUSE will set
> > this flag on its mappings. Reclaim for the legacy cgroup v1 case described
> > above will skip reclaim of folios with that flag set.
> >
> > With this change, writeback state is now only cleared on the dirty page after
> > the server has written it back to disk. If the server is deliberately
> > malicious or well-intentioned but buggy, this may stall sync(2) and page
> > migration, but for sync(2), a malicious server may already stall this by not
> > replying to the FUSE_SYNCFS request and for page migration, there are already
> > many easier ways to stall this by having FUSE permanently hold the folio lock.
> > A fuller discussion on this can be found in [2]. Long-term, there needs to be
> > a more comprehensive solution for addressing migration of FUSE pages that
> > handles all scenarios where FUSE may permanently hold the lock, but that is
> > outside the scope of this patchset and will be done as future work. Please
> > also note that this change also now ensures that when sync(2) returns, FUSE
> > filesystems will have persisted writeback changes.
> >
> > [1] https://lore.kernel.org/linux-kernel/495d2400-1d96-4924-99d3-8b2952e05fc3@linux.alibaba.com/
> > [2] https://lore.kernel.org/linux-fsdevel/20241122232359.429647-1-joannelkoong@gmail.com/
> >
> > Changelog
> > ---------
> > v6:
> > https://lore.kernel.org/linux-fsdevel/20241122232359.429647-1-joannelkoong@gmail.com/
> > Changes from v6 -> v7:
> > * Drop migration and sync patches, as they are useless if a server is
> >   determined to be malicious
> >
> > v5:
> > https://lore.kernel.org/linux-fsdevel/20241115224459.427610-1-joannelkoong@gmail.com/
> > Changes from v5 -> v6:
> > * Add Shakeel and Jingbo's reviewed-bys
> > * Move folio_end_writeback() to fuse_writepage_finish() (Jingbo)
> > * Embed fuse_writepage_finish_stat() logic inline (Jingbo)
> > * Remove node_stat NR_WRITEBACK inc/sub (Jingbo)
> >
> > v4:
> > https://lore.kernel.org/linux-fsdevel/20241107235614.3637221-1-joannelkoong@gmail.com/
> > Changes from v4 -> v5:
> > * AS_WRITEBACK_MAY_BLOCK -> AS_WRITEBACK_INDETERMINATE (Shakeel)
> > * Drop memory hotplug patch (David and Shakeel)
> > * Remove some more kunnecessary writeback waits in fuse code (Jingbo)
> > * Make commit message for reclaim patch more concise - drop part about
> >   deadlock and just focus on how it may stall waits
> >
> > v3:
> > https://lore.kernel.org/linux-fsdevel/20241107191618.2011146-1-joannelkoong@gmail.com/
> > Changes from v3 -> v4:
> > * Use filemap_fdatawait_range() instead of filemap_range_has_writeback() in
> >   readahead
> >
> > v2:
> > https://lore.kernel.org/linux-fsdevel/20241014182228.1941246-1-joannelkoong@gmail.com/
> > Changes from v2 -> v3:
> > * Account for sync and page migration cases as well (Miklos)
> > * Change AS_NO_WRITEBACK_RECLAIM to the more generic AS_WRITEBACK_MAY_BLOCK
> > * For fuse inodes, set mapping_writeback_may_block only if fc->writeback_cache
> >   is enabled
> >
> > v1:
> > https://lore.kernel.org/linux-fsdevel/20241011223434.1307300-1-joannelkoong@gmail.com/T/#t
> > Changes from v1 -> v2:
> > * Have flag in "enum mapping_flags" instead of creating asop_flags (Shakeel)
> > * Set fuse inodes to use AS_NO_WRITEBACK_RECLAIM (Shakeel)
> >
> > Joanne Koong (3):
> >   mm: add AS_WRITEBACK_INDETERMINATE mapping flag
> >   mm: skip reclaiming folios in legacy memcg writeback indeterminate
> >     contexts
> >   fuse: remove tmp folio for writebacks and internal rb tree
> >
> >  fs/fuse/file.c          | 360 ++++------------------------------------
> >  fs/fuse/fuse_i.h        |   3 -
> >  include/linux/pagemap.h |  11 ++
> >  mm/vmscan.c             |  10 +-
> >  4 files changed, 46 insertions(+), 338 deletions(-)
> >
>
> This looks sane, and I love that diffstat.
>
> I also agree with David about changing the flag name to something more
> specific. As a kernel engineer, anything with "INDETERMINATE" in the
> name gives me the ick.
>
> Assuming that the only real change in v8 will be the flag name change,
> you can add:
>
> Reviewed-by: Jeff Layton <jlayton@kernel.org>
>
> Assuming others are ok with this, how do you see this going in? Maybe
> Andrew could pick up the mm bits and Miklos could take the FUSE patch?


Thanks for the review. The only thing I plan to change for v8 is the
flag name and removing the unneeded fuse_sync_writes() call in
fuse_flush() that Jingbo pointed out.

With v8, I'm hoping the mm bits (first 2 patches) could be picked up
by Andrew and that the 3rd patch (the one with FUSE changes) could be
taken by Miklos, as the FUSE large folios patchset [1] I will be
resending will depend on patch 3.

Thanks,
Joanne

 [1] https://lore.kernel.org/linux-fsdevel/20241213221818.322371-1-joannelkoong@gmail.com/


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v7 3/3] fuse: remove tmp folio for writebacks and internal rb tree
  2025-04-10 16:11             ` Joanne Koong
  2025-04-14 20:24               ` Joanne Koong
@ 2025-04-15  7:49               ` Jingbo Xu
  2025-04-15 15:59                 ` Joanne Koong
  1 sibling, 1 reply; 23+ messages in thread
From: Jingbo Xu @ 2025-04-15  7:49 UTC (permalink / raw)
  To: Joanne Koong
  Cc: miklos, akpm, linux-fsdevel, linux-mm, shakeel.butt, david,
	bernd.schubert, ziy, jlayton, kernel-team, Miklos Szeredi

Hi Joanne,

Sorry for the late reply...


On 4/11/25 12:11 AM, Joanne Koong wrote:
> On Thu, Apr 10, 2025 at 8:11 AM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>
>> On 4/10/25 11:07 PM, Joanne Koong wrote:
>>> On Wed, Apr 9, 2025 at 7:12 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>>>
>>>>
>>>>
>>>> On 4/10/25 7:47 AM, Joanne Koong wrote:
>>>>>   On Tue, Apr 8, 2025 at 7:43 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>>>>>
>>>>>> Hi Joanne,
>>>>>>
>>>>>> On 4/5/25 2:14 AM, Joanne Koong wrote:
>>>>>>> In the current FUSE writeback design (see commit 3be5a52b30aa
>>>>>>> ("fuse: support writable mmap")), a temp page is allocated for every
>>>>>>> dirty page to be written back, the contents of the dirty page are copied over
>>>>>>> to the temp page, and the temp page gets handed to the server to write back.
>>>>>>>
>>>>>>> This is done so that writeback may be immediately cleared on the dirty page,
>>>>>>> and this in turn is done in order to mitigate the following deadlock scenario
>>>>>>> that may arise if reclaim waits on writeback on the dirty page to complete:
>>>>>>> * single-threaded FUSE server is in the middle of handling a request
>>>>>>>   that needs a memory allocation
>>>>>>> * memory allocation triggers direct reclaim
>>>>>>> * direct reclaim waits on a folio under writeback
>>>>>>> * the FUSE server can't write back the folio since it's stuck in
>>>>>>>   direct reclaim
>>>>>>>
>>>>>>> With a recent change that added AS_WRITEBACK_INDETERMINATE and mitigates
>>>>>>> the situations described above, FUSE writeback does not need to use
>>>>>>> temp pages if it sets AS_WRITEBACK_INDETERMINATE on its inode mappings.
>>>>>>>
>>>>>>> This commit sets AS_WRITEBACK_INDETERMINATE on the inode mappings
>>>>>>> and removes the temporary pages + extra copying and the internal rb
>>>>>>> tree.
>>>>>>>
>>>>>>> fio benchmarks --
>>>>>>> (using averages observed from 10 runs, throwing away outliers)
>>>>>>>
>>>>>>> Setup:
>>>>>>> sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount
>>>>>>>  ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount
>>>>>>>
>>>>>>> fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G
>>>>>>> --numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount
>>>>>>>
>>>>>>>         bs =  1k          4k            1M
>>>>>>> Before  351 MiB/s     1818 MiB/s     1851 MiB/s
>>>>>>> After   341 MiB/s     2246 MiB/s     2685 MiB/s
>>>>>>> % diff        -3%          23%         45%
>>>>>>>
>>>>>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
>>>>>>> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
>>>>>>> Acked-by: Miklos Szeredi <mszeredi@redhat.com>
>>>>>>
>>>>>
>>>>> Hi Jingbo,
>>>>>
>>>>> Thanks for sharing your analysis for this.
>>>>>
>>>>>> Overall this patch LGTM.
>>>>>>
>>>>>> Apart from that, IMO the fi->writectr and fi->queued_writes mechanism is
>>>>>> also unneeded then, at least the DIRECT IO routine (i.e.
>>>>>
>>>>> I took a look at fi->writectr and fi->queued_writes and my
>>>>> understanding is that we do still need this. For example, for
>>>>> truncates (I'm looking at fuse_do_setattr()), I think we still need to
>>>>> prevent concurrent writeback or else the setattr request and the
>>>>> writeback request could race which would result in a mismatch between
>>>>> the file's reported size and the actual data written to disk.
>>>>
>>>> I haven't looked into the truncate routine yet.  I will see it later.
>>>>
>>>>>
>>>>>> fuse_direct_io()) doesn't need fuse_sync_writes() anymore.  That is
>>>>>> because after removing the temp page, the DIRECT IO routine has already
>>>>>> been waiting for all inflight WRITE requests, see
>>>>>>
>>>>>> # DIRECT read
>>>>>> generic_file_read_iter
>>>>>>   kiocb_write_and_wait
>>>>>>     filemap_write_and_wait_range
>>>>>
>>>>> Where do you see generic_file_read_iter() getting called for direct io reads?
>>>>
>>>> # DIRECT read
>>>> fuse_file_read_iter
>>>>   fuse_cache_read_iter
>>>>     generic_file_read_iter
>>>>       kiocb_write_and_wait
>>>>        filemap_write_and_wait_range
>>>>       a_ops->direct_IO(),i.e. fuse_direct_IO()
>>>>
>>>
>>> Oh I see, I thought files opened with O_DIRECT automatically call the
>>> .direct_IO handler for reads/writes but you're right, it first goes
>>> through .read_iter / .write_iter handlers, and the .direct_IO handler
>>> only gets invoked through generic_file_read_iter() /
>>> generic_file_direct_write() in mm/filemap.c
>>>
>>> There's two paths for direct io in FUSE:
>>> a) fuse server sets fi->direct_io = true when a file is opened, which
>>> will set the FOPEN_DIRECT_IO bit in ff->open_flags on the kernel side
>>> b) fuse server doesn't set fi->direct_io = true, but the client opens
>>> the file with O_DIRECT
>>>
>>> We only go through the stack trace you listed above for the b) case.
>>> For the a) case, we'll hit
>>>
>>>         if (ff->open_flags & FOPEN_DIRECT_IO)
>>>                 return fuse_direct_read_iter(iocb, to);
>>>
>>> and
>>>
>>>         if (ff->open_flags & FOPEN_DIRECT_IO)
>>>                 return fuse_direct_write_iter(iocb, from);
>>>
>>> which will invoke fuse_direct_IO() / fuse_direct_io() without going
>>> through the kiocb_write_and_wait() -> filemap_write_and_wait_range() /
>>> kiocb_invalidate_pages() -> filemap_write_and_wait_range() you listed
>>> above.
>>>
>>> So for the a) case I think we'd still need the fuse_sync_writes() in
>>> case there's still pending writeback.
>>>
>>> Do you agree with this analysis or am I missing something here?
>>
>> Yeah, that's true.  But instead of calling fuse_sync_writes(), we can
>> call filemap_wait_range() or something similar here.
>>
> 
> Agreed. Actually, the more I look at this, the more I think we can
> replace all fuse_sync_writes() and get rid of it entirely. 


I have seen your latest reply that this cleaning up won't be included in
this series, which is okay.


> fuse_sync_writes() is called in:
> 
> fuse_fsync():
>         /*
>          * Start writeback against all dirty pages of the inode, then
>          * wait for all outstanding writes, before sending the FSYNC
>          * request.
>          */
>         err = file_write_and_wait_range(file, start, end);
>         if (err)
>                 goto out;
> 
>         fuse_sync_writes(inode);
> 
>         /*
>          * Due to implementation of fuse writeback
>          * file_write_and_wait_range() does not catch errors.
>          * We have to do this directly after fuse_sync_writes()
>          */
>         err = file_check_and_advance_wb_err(file);
>         if (err)
>                 goto out;
> 
> 
>       We can get rid of the fuse_sync_writes() and
> file_check_and_advance_wb_err() entirely since now without temp pages,
> the file_write_and_wait_range() call actually ensures that writeback
> is completed
> 
> 
> 
> fuse_writeback_range():
>         static int fuse_writeback_range(struct inode *inode, loff_t
> start, loff_t end)
>         {
>                 int err =
> filemap_write_and_wait_range(inode->i_mapping, start, LLONG_MAX);
> 
>                 if (!err)
>                         fuse_sync_writes(inode);
> 
>                 return err;
>         }
> 
> 
>       We can replace fuse_writeback_range() entirely with
> filemap_write_and_wait_range().
> 
> 
> 
> fuse_direct_io():
>         if (fopen_direct_io && fc->direct_io_allow_mmap) {
>                 res = filemap_write_and_wait_range(mapping, pos, pos +
> count - 1);
>                 if (res) {
>                         fuse_io_free(ia);
>                         return res;
>                 }
>         }
>         if (!cuse && filemap_range_has_writeback(mapping, pos, (pos +
> count - 1))) {
>                 if (!write)
>                         inode_lock(inode);
>                 fuse_sync_writes(inode);
>                 if (!write)
>                         inode_unlock(inode);
>         }
> 
> 
>        I think this can just replaced with
>                 if (fopen_direct_io && (fc->direct_io_allow_mmap || !cuse)) {
>                         res = filemap_write_and_wait_range(mapping,
> pos, pos + count - 1);
>                         if (res) {
>                                 fuse_io_free(ia);
>                                 return res;
>                         }
>                 }

Alright. But I would prefer doing this filemap_write_and_wait_range() in
fuse_direct_write_iter() rather than fuse_direct_io() if possible.

>        since for the !fopen_direct_io case, it will already go through
> filemap_write_and_wait_range(), as you mentioned in your previous
> message. I think this also fixes a bug (?) in the original code - in
> the fopen_direct_io && !fc->direct_io_allow_mmap case, I think we
> still need to write out dirty pages first, which we don't currently
> do.

Nope.  In case of fopen_direct_io && !fc->direct_io_allow_mmap, there
won't be any page cache at all, right?



-- 
Thanks,
Jingbo


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v7 3/3] fuse: remove tmp folio for writebacks and internal rb tree
  2025-04-15  7:49               ` Jingbo Xu
@ 2025-04-15 15:59                 ` Joanne Koong
  2025-04-16  1:40                   ` Jingbo Xu
  0 siblings, 1 reply; 23+ messages in thread
From: Joanne Koong @ 2025-04-15 15:59 UTC (permalink / raw)
  To: Jingbo Xu
  Cc: miklos, akpm, linux-fsdevel, linux-mm, shakeel.butt, david,
	bernd.schubert, ziy, jlayton, kernel-team, Miklos Szeredi

On Tue, Apr 15, 2025 at 12:49 AM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>
> Hi Joanne,
>
> Sorry for the late reply...

Hi Jingbo,

No worries at all.
>
>
> On 4/11/25 12:11 AM, Joanne Koong wrote:
> > On Thu, Apr 10, 2025 at 8:11 AM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
> >>
> >> On 4/10/25 11:07 PM, Joanne Koong wrote:
> >>> On Wed, Apr 9, 2025 at 7:12 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 4/10/25 7:47 AM, Joanne Koong wrote:
> >>>>>   On Tue, Apr 8, 2025 at 7:43 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
> >>>>>>
> >>>>>> Hi Joanne,
> >>>>>>
> >>>>>> On 4/5/25 2:14 AM, Joanne Koong wrote:
> >>>>>>> In the current FUSE writeback design (see commit 3be5a52b30aa
> >>>>>>> ("fuse: support writable mmap")), a temp page is allocated for every
> >>>>>>> dirty page to be written back, the contents of the dirty page are copied over
> >>>>>>> to the temp page, and the temp page gets handed to the server to write back.
> >>>>>>>
> >>>>>>> This is done so that writeback may be immediately cleared on the dirty page,
> >>>>>>> and this in turn is done in order to mitigate the following deadlock scenario
> >>>>>>> that may arise if reclaim waits on writeback on the dirty page to complete:
> >>>>>>> * single-threaded FUSE server is in the middle of handling a request
> >>>>>>>   that needs a memory allocation
> >>>>>>> * memory allocation triggers direct reclaim
> >>>>>>> * direct reclaim waits on a folio under writeback
> >>>>>>> * the FUSE server can't write back the folio since it's stuck in
> >>>>>>>   direct reclaim
> >>>>>>>
> >>>>>>> With a recent change that added AS_WRITEBACK_INDETERMINATE and mitigates
> >>>>>>> the situations described above, FUSE writeback does not need to use
> >>>>>>> temp pages if it sets AS_WRITEBACK_INDETERMINATE on its inode mappings.
> >>>>>>>
> >>>>>>> This commit sets AS_WRITEBACK_INDETERMINATE on the inode mappings
> >>>>>>> and removes the temporary pages + extra copying and the internal rb
> >>>>>>> tree.
> >>>>>>>
> >>>>>>> fio benchmarks --
> >>>>>>> (using averages observed from 10 runs, throwing away outliers)
> >>>>>>>
> >>>>>>> Setup:
> >>>>>>> sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount
> >>>>>>>  ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount
> >>>>>>>
> >>>>>>> fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G
> >>>>>>> --numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount
> >>>>>>>
> >>>>>>>         bs =  1k          4k            1M
> >>>>>>> Before  351 MiB/s     1818 MiB/s     1851 MiB/s
> >>>>>>> After   341 MiB/s     2246 MiB/s     2685 MiB/s
> >>>>>>> % diff        -3%          23%         45%
> >>>>>>>
> >>>>>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> >>>>>>> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
> >>>>>>> Acked-by: Miklos Szeredi <mszeredi@redhat.com>
> >>>>>>
> >>>>>
> >>>>> Hi Jingbo,
> >>>>>
> >>>>> Thanks for sharing your analysis for this.
> >>>>>
> >>>>>> Overall this patch LGTM.
> >>>>>>
> >>>>>> Apart from that, IMO the fi->writectr and fi->queued_writes mechanism is
> >>>>>> also unneeded then, at least the DIRECT IO routine (i.e.
> >>>>>
> >>>>> I took a look at fi->writectr and fi->queued_writes and my
> >>>>> understanding is that we do still need this. For example, for
> >>>>> truncates (I'm looking at fuse_do_setattr()), I think we still need to
> >>>>> prevent concurrent writeback or else the setattr request and the
> >>>>> writeback request could race which would result in a mismatch between
> >>>>> the file's reported size and the actual data written to disk.
> >>>>
> >>>> I haven't looked into the truncate routine yet.  I will see it later.
> >>>>
> >>>>>
> >>>>>> fuse_direct_io()) doesn't need fuse_sync_writes() anymore.  That is
> >>>>>> because after removing the temp page, the DIRECT IO routine has already
> >>>>>> been waiting for all inflight WRITE requests, see
> >>>>>>
> >>>>>> # DIRECT read
> >>>>>> generic_file_read_iter
> >>>>>>   kiocb_write_and_wait
> >>>>>>     filemap_write_and_wait_range
> >>>>>
> >>>>> Where do you see generic_file_read_iter() getting called for direct io reads?
> >>>>
> >>>> # DIRECT read
> >>>> fuse_file_read_iter
> >>>>   fuse_cache_read_iter
> >>>>     generic_file_read_iter
> >>>>       kiocb_write_and_wait
> >>>>        filemap_write_and_wait_range
> >>>>       a_ops->direct_IO(),i.e. fuse_direct_IO()
> >>>>
> >>>
> >>> Oh I see, I thought files opened with O_DIRECT automatically call the
> >>> .direct_IO handler for reads/writes but you're right, it first goes
> >>> through .read_iter / .write_iter handlers, and the .direct_IO handler
> >>> only gets invoked through generic_file_read_iter() /
> >>> generic_file_direct_write() in mm/filemap.c
> >>>
> >>> There's two paths for direct io in FUSE:
> >>> a) fuse server sets fi->direct_io = true when a file is opened, which
> >>> will set the FOPEN_DIRECT_IO bit in ff->open_flags on the kernel side
> >>> b) fuse server doesn't set fi->direct_io = true, but the client opens
> >>> the file with O_DIRECT
> >>>
> >>> We only go through the stack trace you listed above for the b) case.
> >>> For the a) case, we'll hit
> >>>
> >>>         if (ff->open_flags & FOPEN_DIRECT_IO)
> >>>                 return fuse_direct_read_iter(iocb, to);
> >>>
> >>> and
> >>>
> >>>         if (ff->open_flags & FOPEN_DIRECT_IO)
> >>>                 return fuse_direct_write_iter(iocb, from);
> >>>
> >>> which will invoke fuse_direct_IO() / fuse_direct_io() without going
> >>> through the kiocb_write_and_wait() -> filemap_write_and_wait_range() /
> >>> kiocb_invalidate_pages() -> filemap_write_and_wait_range() you listed
> >>> above.
> >>>
> >>> So for the a) case I think we'd still need the fuse_sync_writes() in
> >>> case there's still pending writeback.
> >>>
> >>> Do you agree with this analysis or am I missing something here?
> >>
> >> Yeah, that's true.  But instead of calling fuse_sync_writes(), we can
> >> call filemap_wait_range() or something similar here.
> >>
> >
> > Agreed. Actually, the more I look at this, the more I think we can
> > replace all fuse_sync_writes() and get rid of it entirely.
>
>
> I have seen your latest reply that this cleaning up won't be included in
> this series, which is okay.
>
>
> > fuse_sync_writes() is called in:
> >
> > fuse_fsync():
> >         /*
> >          * Start writeback against all dirty pages of the inode, then
> >          * wait for all outstanding writes, before sending the FSYNC
> >          * request.
> >          */
> >         err = file_write_and_wait_range(file, start, end);
> >         if (err)
> >                 goto out;
> >
> >         fuse_sync_writes(inode);
> >
> >         /*
> >          * Due to implementation of fuse writeback
> >          * file_write_and_wait_range() does not catch errors.
> >          * We have to do this directly after fuse_sync_writes()
> >          */
> >         err = file_check_and_advance_wb_err(file);
> >         if (err)
> >                 goto out;
> >
> >
> >       We can get rid of the fuse_sync_writes() and
> > file_check_and_advance_wb_err() entirely since now without temp pages,
> > the file_write_and_wait_range() call actually ensures that writeback
> > is completed
> >
> >
> >
> > fuse_writeback_range():
> >         static int fuse_writeback_range(struct inode *inode, loff_t
> > start, loff_t end)
> >         {
> >                 int err =
> > filemap_write_and_wait_range(inode->i_mapping, start, LLONG_MAX);
> >
> >                 if (!err)
> >                         fuse_sync_writes(inode);
> >
> >                 return err;
> >         }
> >
> >
> >       We can replace fuse_writeback_range() entirely with
> > filemap_write_and_wait_range().
> >
> >
> >
> > fuse_direct_io():
> >         if (fopen_direct_io && fc->direct_io_allow_mmap) {
> >                 res = filemap_write_and_wait_range(mapping, pos, pos +
> > count - 1);
> >                 if (res) {
> >                         fuse_io_free(ia);
> >                         return res;
> >                 }
> >         }
> >         if (!cuse && filemap_range_has_writeback(mapping, pos, (pos +
> > count - 1))) {
> >                 if (!write)
> >                         inode_lock(inode);
> >                 fuse_sync_writes(inode);
> >                 if (!write)
> >                         inode_unlock(inode);
> >         }
> >
> >
> >        I think this can just replaced with
> >                 if (fopen_direct_io && (fc->direct_io_allow_mmap || !cuse)) {
> >                         res = filemap_write_and_wait_range(mapping,
> > pos, pos + count - 1);
> >                         if (res) {
> >                                 fuse_io_free(ia);
> >                                 return res;
> >                         }
> >                 }
>
> Alright. But I would prefer doing this filemap_write_and_wait_range() in
> fuse_direct_write_iter() rather than fuse_direct_io() if possible.
>
> >        since for the !fopen_direct_io case, it will already go through
> > filemap_write_and_wait_range(), as you mentioned in your previous
> > message. I think this also fixes a bug (?) in the original code - in
> > the fopen_direct_io && !fc->direct_io_allow_mmap case, I think we
> > still need to write out dirty pages first, which we don't currently
> > do.
>
> Nope.  In case of fopen_direct_io && !fc->direct_io_allow_mmap, there
> won't be any page cache at all, right?
>

Isn't there still a page cache if the file was previously opened
without direct io and then the client opens another handle to that
file with direct io? In that case, the pages could still be dirty in
the page cache and would need to be written back first, no?


Thanks,
Joanne
>
>
> --
> Thanks,
> Jingbo


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v7 3/3] fuse: remove tmp folio for writebacks and internal rb tree
  2025-04-15 15:59                 ` Joanne Koong
@ 2025-04-16  1:40                   ` Jingbo Xu
  2025-04-16 16:43                     ` Joanne Koong
  0 siblings, 1 reply; 23+ messages in thread
From: Jingbo Xu @ 2025-04-16  1:40 UTC (permalink / raw)
  To: Joanne Koong
  Cc: miklos, akpm, linux-fsdevel, linux-mm, shakeel.butt, david,
	bernd.schubert, ziy, jlayton, kernel-team, Miklos Szeredi



On 4/15/25 11:59 PM, Joanne Koong wrote:
> On Tue, Apr 15, 2025 at 12:49 AM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>
>> Hi Joanne,
>>
>> Sorry for the late reply...
> 
> Hi Jingbo,
> 
> No worries at all.
>>
>>
>> On 4/11/25 12:11 AM, Joanne Koong wrote:
>>> On Thu, Apr 10, 2025 at 8:11 AM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>>>
>>>> On 4/10/25 11:07 PM, Joanne Koong wrote:
>>>>> On Wed, Apr 9, 2025 at 7:12 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 4/10/25 7:47 AM, Joanne Koong wrote:
>>>>>>>   On Tue, Apr 8, 2025 at 7:43 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>>>>>>>
>>>>>>>> Hi Joanne,
>>>>>>>>
>>>>>>>> On 4/5/25 2:14 AM, Joanne Koong wrote:
>>>>>>>>> In the current FUSE writeback design (see commit 3be5a52b30aa
>>>>>>>>> ("fuse: support writable mmap")), a temp page is allocated for every
>>>>>>>>> dirty page to be written back, the contents of the dirty page are copied over
>>>>>>>>> to the temp page, and the temp page gets handed to the server to write back.
>>>>>>>>>
>>>>>>>>> This is done so that writeback may be immediately cleared on the dirty page,
>>>>>>>>> and this in turn is done in order to mitigate the following deadlock scenario
>>>>>>>>> that may arise if reclaim waits on writeback on the dirty page to complete:
>>>>>>>>> * single-threaded FUSE server is in the middle of handling a request
>>>>>>>>>   that needs a memory allocation
>>>>>>>>> * memory allocation triggers direct reclaim
>>>>>>>>> * direct reclaim waits on a folio under writeback
>>>>>>>>> * the FUSE server can't write back the folio since it's stuck in
>>>>>>>>>   direct reclaim
>>>>>>>>>
>>>>>>>>> With a recent change that added AS_WRITEBACK_INDETERMINATE and mitigates
>>>>>>>>> the situations described above, FUSE writeback does not need to use
>>>>>>>>> temp pages if it sets AS_WRITEBACK_INDETERMINATE on its inode mappings.
>>>>>>>>>
>>>>>>>>> This commit sets AS_WRITEBACK_INDETERMINATE on the inode mappings
>>>>>>>>> and removes the temporary pages + extra copying and the internal rb
>>>>>>>>> tree.
>>>>>>>>>
>>>>>>>>> fio benchmarks --
>>>>>>>>> (using averages observed from 10 runs, throwing away outliers)
>>>>>>>>>
>>>>>>>>> Setup:
>>>>>>>>> sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount
>>>>>>>>>  ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount
>>>>>>>>>
>>>>>>>>> fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G
>>>>>>>>> --numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount
>>>>>>>>>
>>>>>>>>>         bs =  1k          4k            1M
>>>>>>>>> Before  351 MiB/s     1818 MiB/s     1851 MiB/s
>>>>>>>>> After   341 MiB/s     2246 MiB/s     2685 MiB/s
>>>>>>>>> % diff        -3%          23%         45%
>>>>>>>>>
>>>>>>>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
>>>>>>>>> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
>>>>>>>>> Acked-by: Miklos Szeredi <mszeredi@redhat.com>
>>>>>>>>
>>>>>>>
>>>>>>> Hi Jingbo,
>>>>>>>
>>>>>>> Thanks for sharing your analysis for this.
>>>>>>>
>>>>>>>> Overall this patch LGTM.
>>>>>>>>
>>>>>>>> Apart from that, IMO the fi->writectr and fi->queued_writes mechanism is
>>>>>>>> also unneeded then, at least the DIRECT IO routine (i.e.
>>>>>>>
>>>>>>> I took a look at fi->writectr and fi->queued_writes and my
>>>>>>> understanding is that we do still need this. For example, for
>>>>>>> truncates (I'm looking at fuse_do_setattr()), I think we still need to
>>>>>>> prevent concurrent writeback or else the setattr request and the
>>>>>>> writeback request could race which would result in a mismatch between
>>>>>>> the file's reported size and the actual data written to disk.
>>>>>>
>>>>>> I haven't looked into the truncate routine yet.  I will see it later.
>>>>>>
>>>>>>>
>>>>>>>> fuse_direct_io()) doesn't need fuse_sync_writes() anymore.  That is
>>>>>>>> because after removing the temp page, the DIRECT IO routine has already
>>>>>>>> been waiting for all inflight WRITE requests, see
>>>>>>>>
>>>>>>>> # DIRECT read
>>>>>>>> generic_file_read_iter
>>>>>>>>   kiocb_write_and_wait
>>>>>>>>     filemap_write_and_wait_range
>>>>>>>
>>>>>>> Where do you see generic_file_read_iter() getting called for direct io reads?
>>>>>>
>>>>>> # DIRECT read
>>>>>> fuse_file_read_iter
>>>>>>   fuse_cache_read_iter
>>>>>>     generic_file_read_iter
>>>>>>       kiocb_write_and_wait
>>>>>>        filemap_write_and_wait_range
>>>>>>       a_ops->direct_IO(),i.e. fuse_direct_IO()
>>>>>>
>>>>>
>>>>> Oh I see, I thought files opened with O_DIRECT automatically call the
>>>>> .direct_IO handler for reads/writes but you're right, it first goes
>>>>> through .read_iter / .write_iter handlers, and the .direct_IO handler
>>>>> only gets invoked through generic_file_read_iter() /
>>>>> generic_file_direct_write() in mm/filemap.c
>>>>>
>>>>> There's two paths for direct io in FUSE:
>>>>> a) fuse server sets fi->direct_io = true when a file is opened, which
>>>>> will set the FOPEN_DIRECT_IO bit in ff->open_flags on the kernel side
>>>>> b) fuse server doesn't set fi->direct_io = true, but the client opens
>>>>> the file with O_DIRECT
>>>>>
>>>>> We only go through the stack trace you listed above for the b) case.
>>>>> For the a) case, we'll hit
>>>>>
>>>>>         if (ff->open_flags & FOPEN_DIRECT_IO)
>>>>>                 return fuse_direct_read_iter(iocb, to);
>>>>>
>>>>> and
>>>>>
>>>>>         if (ff->open_flags & FOPEN_DIRECT_IO)
>>>>>                 return fuse_direct_write_iter(iocb, from);
>>>>>
>>>>> which will invoke fuse_direct_IO() / fuse_direct_io() without going
>>>>> through the kiocb_write_and_wait() -> filemap_write_and_wait_range() /
>>>>> kiocb_invalidate_pages() -> filemap_write_and_wait_range() you listed
>>>>> above.
>>>>>
>>>>> So for the a) case I think we'd still need the fuse_sync_writes() in
>>>>> case there's still pending writeback.
>>>>>
>>>>> Do you agree with this analysis or am I missing something here?
>>>>
>>>> Yeah, that's true.  But instead of calling fuse_sync_writes(), we can
>>>> call filemap_wait_range() or something similar here.
>>>>
>>>
>>> Agreed. Actually, the more I look at this, the more I think we can
>>> replace all fuse_sync_writes() and get rid of it entirely.
>>
>>
>> I have seen your latest reply that this cleaning up won't be included in
>> this series, which is okay.
>>
>>
>>> fuse_sync_writes() is called in:
>>>
>>> fuse_fsync():
>>>         /*
>>>          * Start writeback against all dirty pages of the inode, then
>>>          * wait for all outstanding writes, before sending the FSYNC
>>>          * request.
>>>          */
>>>         err = file_write_and_wait_range(file, start, end);
>>>         if (err)
>>>                 goto out;
>>>
>>>         fuse_sync_writes(inode);
>>>
>>>         /*
>>>          * Due to implementation of fuse writeback
>>>          * file_write_and_wait_range() does not catch errors.
>>>          * We have to do this directly after fuse_sync_writes()
>>>          */
>>>         err = file_check_and_advance_wb_err(file);
>>>         if (err)
>>>                 goto out;
>>>
>>>
>>>       We can get rid of the fuse_sync_writes() and
>>> file_check_and_advance_wb_err() entirely since now without temp pages,
>>> the file_write_and_wait_range() call actually ensures that writeback
>>> is completed
>>>
>>>
>>>
>>> fuse_writeback_range():
>>>         static int fuse_writeback_range(struct inode *inode, loff_t
>>> start, loff_t end)
>>>         {
>>>                 int err =
>>> filemap_write_and_wait_range(inode->i_mapping, start, LLONG_MAX);
>>>
>>>                 if (!err)
>>>                         fuse_sync_writes(inode);
>>>
>>>                 return err;
>>>         }
>>>
>>>
>>>       We can replace fuse_writeback_range() entirely with
>>> filemap_write_and_wait_range().
>>>
>>>
>>>
>>> fuse_direct_io():
>>>         if (fopen_direct_io && fc->direct_io_allow_mmap) {
>>>                 res = filemap_write_and_wait_range(mapping, pos, pos +
>>> count - 1);
>>>                 if (res) {
>>>                         fuse_io_free(ia);
>>>                         return res;
>>>                 }
>>>         }
>>>         if (!cuse && filemap_range_has_writeback(mapping, pos, (pos +
>>> count - 1))) {
>>>                 if (!write)
>>>                         inode_lock(inode);
>>>                 fuse_sync_writes(inode);
>>>                 if (!write)
>>>                         inode_unlock(inode);
>>>         }
>>>
>>>
>>>        I think this can just replaced with
>>>                 if (fopen_direct_io && (fc->direct_io_allow_mmap || !cuse)) {
>>>                         res = filemap_write_and_wait_range(mapping,
>>> pos, pos + count - 1);
>>>                         if (res) {
>>>                                 fuse_io_free(ia);
>>>                                 return res;
>>>                         }
>>>                 }
>>
>> Alright. But I would prefer doing this filemap_write_and_wait_range() in
>> fuse_direct_write_iter() rather than fuse_direct_io() if possible.
>>
>>>        since for the !fopen_direct_io case, it will already go through
>>> filemap_write_and_wait_range(), as you mentioned in your previous
>>> message. I think this also fixes a bug (?) in the original code - in
>>> the fopen_direct_io && !fc->direct_io_allow_mmap case, I think we
>>> still need to write out dirty pages first, which we don't currently
>>> do.
>>
>> Nope.  In case of fopen_direct_io && !fc->direct_io_allow_mmap, there
>> won't be any page cache at all, right?
>>
> 
> Isn't there still a page cache if the file was previously opened
> without direct io and then the client opens another handle to that
> file with direct io? In that case, the pages could still be dirty in
> the page cache and would need to be written back first, no?

Do you mean that when the inode is firstly opened, FOPEN_DIRECT_IO is
not set by the FUSE server, while it is secondly opened, the flag is set?

Though the behavior of the FUSE daemon is quite confusing in this case,
it is completely possible in real life.  So yes we'd better add
filemap_write_and_wait_range() unconditionally in fopen_direct_io case.


-- 
Thanks,
Jingbo


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v7 3/3] fuse: remove tmp folio for writebacks and internal rb tree
  2025-04-16  1:40                   ` Jingbo Xu
@ 2025-04-16 16:43                     ` Joanne Koong
  2025-04-16 18:05                       ` Bernd Schubert
  0 siblings, 1 reply; 23+ messages in thread
From: Joanne Koong @ 2025-04-16 16:43 UTC (permalink / raw)
  To: Jingbo Xu
  Cc: miklos, akpm, linux-fsdevel, linux-mm, shakeel.butt, david,
	bernd.schubert, ziy, jlayton, kernel-team, Miklos Szeredi

On Tue, Apr 15, 2025 at 6:40 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>
> On 4/15/25 11:59 PM, Joanne Koong wrote:
> > On Tue, Apr 15, 2025 at 12:49 AM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
> >>
> >> Hi Joanne,
> >>
> >> Sorry for the late reply...
> >
> > Hi Jingbo,
> >
> > No worries at all.
> >>
> >>
> >> On 4/11/25 12:11 AM, Joanne Koong wrote:
> >>> On Thu, Apr 10, 2025 at 8:11 AM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
> >>>>
> >>>> On 4/10/25 11:07 PM, Joanne Koong wrote:
> >>>>> On Wed, Apr 9, 2025 at 7:12 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 4/10/25 7:47 AM, Joanne Koong wrote:
> >>>>>>>   On Tue, Apr 8, 2025 at 7:43 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
> >>>>>>>>
> >>>>>>>> Hi Joanne,
> >>>>>>>>
> >>>>>>>> On 4/5/25 2:14 AM, Joanne Koong wrote:
> >>>>>>>>> In the current FUSE writeback design (see commit 3be5a52b30aa
> >>>>>>>>> ("fuse: support writable mmap")), a temp page is allocated for every
> >>>>>>>>> dirty page to be written back, the contents of the dirty page are copied over
> >>>>>>>>> to the temp page, and the temp page gets handed to the server to write back.
> >>>>>>>>>
> >>>>>>>>> This is done so that writeback may be immediately cleared on the dirty page,
> >>>>>>>>> and this in turn is done in order to mitigate the following deadlock scenario
> >>>>>>>>> that may arise if reclaim waits on writeback on the dirty page to complete:
> >>>>>>>>> * single-threaded FUSE server is in the middle of handling a request
> >>>>>>>>>   that needs a memory allocation
> >>>>>>>>> * memory allocation triggers direct reclaim
> >>>>>>>>> * direct reclaim waits on a folio under writeback
> >>>>>>>>> * the FUSE server can't write back the folio since it's stuck in
> >>>>>>>>>   direct reclaim
> >>>>>>>>>
> >>>>>>>>> With a recent change that added AS_WRITEBACK_INDETERMINATE and mitigates
> >>>>>>>>> the situations described above, FUSE writeback does not need to use
> >>>>>>>>> temp pages if it sets AS_WRITEBACK_INDETERMINATE on its inode mappings.
> >>>>>>>>>
> >>>>>>>>> This commit sets AS_WRITEBACK_INDETERMINATE on the inode mappings
> >>>>>>>>> and removes the temporary pages + extra copying and the internal rb
> >>>>>>>>> tree.
> >>>>>>>>>
> >>>>>>>>> fio benchmarks --
> >>>>>>>>> (using averages observed from 10 runs, throwing away outliers)
> >>>>>>>>>
> >>>>>>>>> Setup:
> >>>>>>>>> sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount
> >>>>>>>>>  ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount
> >>>>>>>>>
> >>>>>>>>> fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G
> >>>>>>>>> --numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount
> >>>>>>>>>
> >>>>>>>>>         bs =  1k          4k            1M
> >>>>>>>>> Before  351 MiB/s     1818 MiB/s     1851 MiB/s
> >>>>>>>>> After   341 MiB/s     2246 MiB/s     2685 MiB/s
> >>>>>>>>> % diff        -3%          23%         45%
> >>>>>>>>>
> >>>>>>>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> >>>>>>>>> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
> >>>>>>>>> Acked-by: Miklos Szeredi <mszeredi@redhat.com>
> >>>>>>>>
> >>>>>>>
> >>>>>>> Hi Jingbo,
> >>>>>>>
> >>>>>>> Thanks for sharing your analysis for this.
> >>>>>>>
> >>>>>>>> Overall this patch LGTM.
> >>>>>>>>
> >>>>>>>> Apart from that, IMO the fi->writectr and fi->queued_writes mechanism is
> >>>>>>>> also unneeded then, at least the DIRECT IO routine (i.e.
> >>>>>>>
> >>>>>>> I took a look at fi->writectr and fi->queued_writes and my
> >>>>>>> understanding is that we do still need this. For example, for
> >>>>>>> truncates (I'm looking at fuse_do_setattr()), I think we still need to
> >>>>>>> prevent concurrent writeback or else the setattr request and the
> >>>>>>> writeback request could race which would result in a mismatch between
> >>>>>>> the file's reported size and the actual data written to disk.
> >>>>>>
> >>>>>> I haven't looked into the truncate routine yet.  I will see it later.
> >>>>>>
> >>>>>>>
> >>>>>>>> fuse_direct_io()) doesn't need fuse_sync_writes() anymore.  That is
> >>>>>>>> because after removing the temp page, the DIRECT IO routine has already
> >>>>>>>> been waiting for all inflight WRITE requests, see
> >>>>>>>>
> >>>>>>>> # DIRECT read
> >>>>>>>> generic_file_read_iter
> >>>>>>>>   kiocb_write_and_wait
> >>>>>>>>     filemap_write_and_wait_range
> >>>>>>>
> >>>>>>> Where do you see generic_file_read_iter() getting called for direct io reads?
> >>>>>>
> >>>>>> # DIRECT read
> >>>>>> fuse_file_read_iter
> >>>>>>   fuse_cache_read_iter
> >>>>>>     generic_file_read_iter
> >>>>>>       kiocb_write_and_wait
> >>>>>>        filemap_write_and_wait_range
> >>>>>>       a_ops->direct_IO(),i.e. fuse_direct_IO()
> >>>>>>
> >>>>>
> >>>>> Oh I see, I thought files opened with O_DIRECT automatically call the
> >>>>> .direct_IO handler for reads/writes but you're right, it first goes
> >>>>> through .read_iter / .write_iter handlers, and the .direct_IO handler
> >>>>> only gets invoked through generic_file_read_iter() /
> >>>>> generic_file_direct_write() in mm/filemap.c
> >>>>>
> >>>>> There's two paths for direct io in FUSE:
> >>>>> a) fuse server sets fi->direct_io = true when a file is opened, which
> >>>>> will set the FOPEN_DIRECT_IO bit in ff->open_flags on the kernel side
> >>>>> b) fuse server doesn't set fi->direct_io = true, but the client opens
> >>>>> the file with O_DIRECT
> >>>>>
> >>>>> We only go through the stack trace you listed above for the b) case.
> >>>>> For the a) case, we'll hit
> >>>>>
> >>>>>         if (ff->open_flags & FOPEN_DIRECT_IO)
> >>>>>                 return fuse_direct_read_iter(iocb, to);
> >>>>>
> >>>>> and
> >>>>>
> >>>>>         if (ff->open_flags & FOPEN_DIRECT_IO)
> >>>>>                 return fuse_direct_write_iter(iocb, from);
> >>>>>
> >>>>> which will invoke fuse_direct_IO() / fuse_direct_io() without going
> >>>>> through the kiocb_write_and_wait() -> filemap_write_and_wait_range() /
> >>>>> kiocb_invalidate_pages() -> filemap_write_and_wait_range() you listed
> >>>>> above.
> >>>>>
> >>>>> So for the a) case I think we'd still need the fuse_sync_writes() in
> >>>>> case there's still pending writeback.
> >>>>>
> >>>>> Do you agree with this analysis or am I missing something here?
> >>>>
> >>>> Yeah, that's true.  But instead of calling fuse_sync_writes(), we can
> >>>> call filemap_wait_range() or something similar here.
> >>>>
> >>>
> >>> Agreed. Actually, the more I look at this, the more I think we can
> >>> replace all fuse_sync_writes() and get rid of it entirely.
> >>
> >>
> >> I have seen your latest reply that this cleaning up won't be included in
> >> this series, which is okay.
> >>
> >>
> >>> fuse_sync_writes() is called in:
> >>>
> >>> fuse_fsync():
> >>>         /*
> >>>          * Start writeback against all dirty pages of the inode, then
> >>>          * wait for all outstanding writes, before sending the FSYNC
> >>>          * request.
> >>>          */
> >>>         err = file_write_and_wait_range(file, start, end);
> >>>         if (err)
> >>>                 goto out;
> >>>
> >>>         fuse_sync_writes(inode);
> >>>
> >>>         /*
> >>>          * Due to implementation of fuse writeback
> >>>          * file_write_and_wait_range() does not catch errors.
> >>>          * We have to do this directly after fuse_sync_writes()
> >>>          */
> >>>         err = file_check_and_advance_wb_err(file);
> >>>         if (err)
> >>>                 goto out;
> >>>
> >>>
> >>>       We can get rid of the fuse_sync_writes() and
> >>> file_check_and_advance_wb_err() entirely since now without temp pages,
> >>> the file_write_and_wait_range() call actually ensures that writeback
> >>> is completed
> >>>
> >>>
> >>>
> >>> fuse_writeback_range():
> >>>         static int fuse_writeback_range(struct inode *inode, loff_t
> >>> start, loff_t end)
> >>>         {
> >>>                 int err =
> >>> filemap_write_and_wait_range(inode->i_mapping, start, LLONG_MAX);
> >>>
> >>>                 if (!err)
> >>>                         fuse_sync_writes(inode);
> >>>
> >>>                 return err;
> >>>         }
> >>>
> >>>
> >>>       We can replace fuse_writeback_range() entirely with
> >>> filemap_write_and_wait_range().
> >>>
> >>>
> >>>
> >>> fuse_direct_io():
> >>>         if (fopen_direct_io && fc->direct_io_allow_mmap) {
> >>>                 res = filemap_write_and_wait_range(mapping, pos, pos +
> >>> count - 1);
> >>>                 if (res) {
> >>>                         fuse_io_free(ia);
> >>>                         return res;
> >>>                 }
> >>>         }
> >>>         if (!cuse && filemap_range_has_writeback(mapping, pos, (pos +
> >>> count - 1))) {
> >>>                 if (!write)
> >>>                         inode_lock(inode);
> >>>                 fuse_sync_writes(inode);
> >>>                 if (!write)
> >>>                         inode_unlock(inode);
> >>>         }
> >>>
> >>>
> >>>        I think this can just replaced with
> >>>                 if (fopen_direct_io && (fc->direct_io_allow_mmap || !cuse)) {
> >>>                         res = filemap_write_and_wait_range(mapping,
> >>> pos, pos + count - 1);
> >>>                         if (res) {
> >>>                                 fuse_io_free(ia);
> >>>                                 return res;
> >>>                         }
> >>>                 }
> >>
> >> Alright. But I would prefer doing this filemap_write_and_wait_range() in
> >> fuse_direct_write_iter() rather than fuse_direct_io() if possible.
> >>
> >>>        since for the !fopen_direct_io case, it will already go through
> >>> filemap_write_and_wait_range(), as you mentioned in your previous
> >>> message. I think this also fixes a bug (?) in the original code - in
> >>> the fopen_direct_io && !fc->direct_io_allow_mmap case, I think we
> >>> still need to write out dirty pages first, which we don't currently
> >>> do.
> >>
> >> Nope.  In case of fopen_direct_io && !fc->direct_io_allow_mmap, there
> >> won't be any page cache at all, right?
> >>
> >
> > Isn't there still a page cache if the file was previously opened
> > without direct io and then the client opens another handle to that
> > file with direct io? In that case, the pages could still be dirty in
> > the page cache and would need to be written back first, no?
>
> Do you mean that when the inode is firstly opened, FOPEN_DIRECT_IO is
> not set by the FUSE server, while it is secondly opened, the flag is set?
>
> Though the behavior of the FUSE daemon is quite confusing in this case,
> it is completely possible in real life.  So yes we'd better add
> filemap_write_and_wait_range() unconditionally in fopen_direct_io case.
>

I think this behavior on the server side is pretty common. From what
I've seen on most servers, the server when handling the open sets
fi->direct_io depending on if the client opens with O_DIRECT, eg

        if (fi->flags & O_DIRECT)
                fi->direct_io = 1;

If a client opens a file without O_DIRECT and then opens the same file
with O_DIRECT, then we run into this case. Though I'm not sure how
common it generally is for clients to do this.

Thanks,
Joanne
>
> --
> Thanks,
> Jingbo


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v7 3/3] fuse: remove tmp folio for writebacks and internal rb tree
  2025-04-16 16:43                     ` Joanne Koong
@ 2025-04-16 18:05                       ` Bernd Schubert
  0 siblings, 0 replies; 23+ messages in thread
From: Bernd Schubert @ 2025-04-16 18:05 UTC (permalink / raw)
  To: Joanne Koong, Jingbo Xu
  Cc: miklos, akpm, linux-fsdevel, linux-mm, shakeel.butt, david, ziy,
	jlayton, kernel-team, Miklos Szeredi



On 4/16/25 18:43, Joanne Koong wrote:
> On Tue, Apr 15, 2025 at 6:40 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>
>> On 4/15/25 11:59 PM, Joanne Koong wrote:
>>> On Tue, Apr 15, 2025 at 12:49 AM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>>>
>>>> Hi Joanne,
>>>>
>>>> Sorry for the late reply...
>>>
>>> Hi Jingbo,
>>>
>>> No worries at all.
>>>>
>>>>
>>>> On 4/11/25 12:11 AM, Joanne Koong wrote:
>>>>> On Thu, Apr 10, 2025 at 8:11 AM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>>>>>
>>>>>> On 4/10/25 11:07 PM, Joanne Koong wrote:
>>>>>>> On Wed, Apr 9, 2025 at 7:12 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 4/10/25 7:47 AM, Joanne Koong wrote:
>>>>>>>>>    On Tue, Apr 8, 2025 at 7:43 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Joanne,
>>>>>>>>>>
>>>>>>>>>> On 4/5/25 2:14 AM, Joanne Koong wrote:
>>>>>>>>>>> In the current FUSE writeback design (see commit 3be5a52b30aa
>>>>>>>>>>> ("fuse: support writable mmap")), a temp page is allocated for every
>>>>>>>>>>> dirty page to be written back, the contents of the dirty page are copied over
>>>>>>>>>>> to the temp page, and the temp page gets handed to the server to write back.
>>>>>>>>>>>
>>>>>>>>>>> This is done so that writeback may be immediately cleared on the dirty page,
>>>>>>>>>>> and this in turn is done in order to mitigate the following deadlock scenario
>>>>>>>>>>> that may arise if reclaim waits on writeback on the dirty page to complete:
>>>>>>>>>>> * single-threaded FUSE server is in the middle of handling a request
>>>>>>>>>>>    that needs a memory allocation
>>>>>>>>>>> * memory allocation triggers direct reclaim
>>>>>>>>>>> * direct reclaim waits on a folio under writeback
>>>>>>>>>>> * the FUSE server can't write back the folio since it's stuck in
>>>>>>>>>>>    direct reclaim
>>>>>>>>>>>
>>>>>>>>>>> With a recent change that added AS_WRITEBACK_INDETERMINATE and mitigates
>>>>>>>>>>> the situations described above, FUSE writeback does not need to use
>>>>>>>>>>> temp pages if it sets AS_WRITEBACK_INDETERMINATE on its inode mappings.
>>>>>>>>>>>
>>>>>>>>>>> This commit sets AS_WRITEBACK_INDETERMINATE on the inode mappings
>>>>>>>>>>> and removes the temporary pages + extra copying and the internal rb
>>>>>>>>>>> tree.
>>>>>>>>>>>
>>>>>>>>>>> fio benchmarks --
>>>>>>>>>>> (using averages observed from 10 runs, throwing away outliers)
>>>>>>>>>>>
>>>>>>>>>>> Setup:
>>>>>>>>>>> sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount
>>>>>>>>>>>   ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount
>>>>>>>>>>>
>>>>>>>>>>> fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G
>>>>>>>>>>> --numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount
>>>>>>>>>>>
>>>>>>>>>>>          bs =  1k          4k            1M
>>>>>>>>>>> Before  351 MiB/s     1818 MiB/s     1851 MiB/s
>>>>>>>>>>> After   341 MiB/s     2246 MiB/s     2685 MiB/s
>>>>>>>>>>> % diff        -3%          23%         45%
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
>>>>>>>>>>> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
>>>>>>>>>>> Acked-by: Miklos Szeredi <mszeredi@redhat.com>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi Jingbo,
>>>>>>>>>
>>>>>>>>> Thanks for sharing your analysis for this.
>>>>>>>>>
>>>>>>>>>> Overall this patch LGTM.
>>>>>>>>>>
>>>>>>>>>> Apart from that, IMO the fi->writectr and fi->queued_writes mechanism is
>>>>>>>>>> also unneeded then, at least the DIRECT IO routine (i.e.
>>>>>>>>>
>>>>>>>>> I took a look at fi->writectr and fi->queued_writes and my
>>>>>>>>> understanding is that we do still need this. For example, for
>>>>>>>>> truncates (I'm looking at fuse_do_setattr()), I think we still need to
>>>>>>>>> prevent concurrent writeback or else the setattr request and the
>>>>>>>>> writeback request could race which would result in a mismatch between
>>>>>>>>> the file's reported size and the actual data written to disk.
>>>>>>>>
>>>>>>>> I haven't looked into the truncate routine yet.  I will see it later.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> fuse_direct_io()) doesn't need fuse_sync_writes() anymore.  That is
>>>>>>>>>> because after removing the temp page, the DIRECT IO routine has already
>>>>>>>>>> been waiting for all inflight WRITE requests, see
>>>>>>>>>>
>>>>>>>>>> # DIRECT read
>>>>>>>>>> generic_file_read_iter
>>>>>>>>>>    kiocb_write_and_wait
>>>>>>>>>>      filemap_write_and_wait_range
>>>>>>>>>
>>>>>>>>> Where do you see generic_file_read_iter() getting called for direct io reads?
>>>>>>>>
>>>>>>>> # DIRECT read
>>>>>>>> fuse_file_read_iter
>>>>>>>>    fuse_cache_read_iter
>>>>>>>>      generic_file_read_iter
>>>>>>>>        kiocb_write_and_wait
>>>>>>>>         filemap_write_and_wait_range
>>>>>>>>        a_ops->direct_IO(),i.e. fuse_direct_IO()
>>>>>>>>
>>>>>>>
>>>>>>> Oh I see, I thought files opened with O_DIRECT automatically call the
>>>>>>> .direct_IO handler for reads/writes but you're right, it first goes
>>>>>>> through .read_iter / .write_iter handlers, and the .direct_IO handler
>>>>>>> only gets invoked through generic_file_read_iter() /
>>>>>>> generic_file_direct_write() in mm/filemap.c
>>>>>>>
>>>>>>> There's two paths for direct io in FUSE:
>>>>>>> a) fuse server sets fi->direct_io = true when a file is opened, which
>>>>>>> will set the FOPEN_DIRECT_IO bit in ff->open_flags on the kernel side
>>>>>>> b) fuse server doesn't set fi->direct_io = true, but the client opens
>>>>>>> the file with O_DIRECT
>>>>>>>
>>>>>>> We only go through the stack trace you listed above for the b) case.
>>>>>>> For the a) case, we'll hit
>>>>>>>
>>>>>>>          if (ff->open_flags & FOPEN_DIRECT_IO)
>>>>>>>                  return fuse_direct_read_iter(iocb, to);
>>>>>>>
>>>>>>> and
>>>>>>>
>>>>>>>          if (ff->open_flags & FOPEN_DIRECT_IO)
>>>>>>>                  return fuse_direct_write_iter(iocb, from);
>>>>>>>
>>>>>>> which will invoke fuse_direct_IO() / fuse_direct_io() without going
>>>>>>> through the kiocb_write_and_wait() -> filemap_write_and_wait_range() /
>>>>>>> kiocb_invalidate_pages() -> filemap_write_and_wait_range() you listed
>>>>>>> above.
>>>>>>>
>>>>>>> So for the a) case I think we'd still need the fuse_sync_writes() in
>>>>>>> case there's still pending writeback.
>>>>>>>
>>>>>>> Do you agree with this analysis or am I missing something here?
>>>>>>
>>>>>> Yeah, that's true.  But instead of calling fuse_sync_writes(), we can
>>>>>> call filemap_wait_range() or something similar here.
>>>>>>
>>>>>
>>>>> Agreed. Actually, the more I look at this, the more I think we can
>>>>> replace all fuse_sync_writes() and get rid of it entirely.
>>>>
>>>>
>>>> I have seen your latest reply that this cleaning up won't be included in
>>>> this series, which is okay.
>>>>
>>>>
>>>>> fuse_sync_writes() is called in:
>>>>>
>>>>> fuse_fsync():
>>>>>          /*
>>>>>           * Start writeback against all dirty pages of the inode, then
>>>>>           * wait for all outstanding writes, before sending the FSYNC
>>>>>           * request.
>>>>>           */
>>>>>          err = file_write_and_wait_range(file, start, end);
>>>>>          if (err)
>>>>>                  goto out;
>>>>>
>>>>>          fuse_sync_writes(inode);
>>>>>
>>>>>          /*
>>>>>           * Due to implementation of fuse writeback
>>>>>           * file_write_and_wait_range() does not catch errors.
>>>>>           * We have to do this directly after fuse_sync_writes()
>>>>>           */
>>>>>          err = file_check_and_advance_wb_err(file);
>>>>>          if (err)
>>>>>                  goto out;
>>>>>
>>>>>
>>>>>        We can get rid of the fuse_sync_writes() and
>>>>> file_check_and_advance_wb_err() entirely since now without temp pages,
>>>>> the file_write_and_wait_range() call actually ensures that writeback
>>>>> is completed
>>>>>
>>>>>
>>>>>
>>>>> fuse_writeback_range():
>>>>>          static int fuse_writeback_range(struct inode *inode, loff_t
>>>>> start, loff_t end)
>>>>>          {
>>>>>                  int err =
>>>>> filemap_write_and_wait_range(inode->i_mapping, start, LLONG_MAX);
>>>>>
>>>>>                  if (!err)
>>>>>                          fuse_sync_writes(inode);
>>>>>
>>>>>                  return err;
>>>>>          }
>>>>>
>>>>>
>>>>>        We can replace fuse_writeback_range() entirely with
>>>>> filemap_write_and_wait_range().
>>>>>
>>>>>
>>>>>
>>>>> fuse_direct_io():
>>>>>          if (fopen_direct_io && fc->direct_io_allow_mmap) {
>>>>>                  res = filemap_write_and_wait_range(mapping, pos, pos +
>>>>> count - 1);
>>>>>                  if (res) {
>>>>>                          fuse_io_free(ia);
>>>>>                          return res;
>>>>>                  }
>>>>>          }
>>>>>          if (!cuse && filemap_range_has_writeback(mapping, pos, (pos +
>>>>> count - 1))) {
>>>>>                  if (!write)
>>>>>                          inode_lock(inode);
>>>>>                  fuse_sync_writes(inode);
>>>>>                  if (!write)
>>>>>                          inode_unlock(inode);
>>>>>          }
>>>>>
>>>>>
>>>>>         I think this can just replaced with
>>>>>                  if (fopen_direct_io && (fc->direct_io_allow_mmap || !cuse)) {
>>>>>                          res = filemap_write_and_wait_range(mapping,
>>>>> pos, pos + count - 1);
>>>>>                          if (res) {
>>>>>                                  fuse_io_free(ia);
>>>>>                                  return res;
>>>>>                          }
>>>>>                  }
>>>>
>>>> Alright. But I would prefer doing this filemap_write_and_wait_range() in
>>>> fuse_direct_write_iter() rather than fuse_direct_io() if possible.
>>>>
>>>>>         since for the !fopen_direct_io case, it will already go through
>>>>> filemap_write_and_wait_range(), as you mentioned in your previous
>>>>> message. I think this also fixes a bug (?) in the original code - in
>>>>> the fopen_direct_io && !fc->direct_io_allow_mmap case, I think we
>>>>> still need to write out dirty pages first, which we don't currently
>>>>> do.
>>>>
>>>> Nope.  In case of fopen_direct_io && !fc->direct_io_allow_mmap, there
>>>> won't be any page cache at all, right?
>>>>
>>>
>>> Isn't there still a page cache if the file was previously opened
>>> without direct io and then the client opens another handle to that
>>> file with direct io? In that case, the pages could still be dirty in
>>> the page cache and would need to be written back first, no?
>>
>> Do you mean that when the inode is firstly opened, FOPEN_DIRECT_IO is
>> not set by the FUSE server, while it is secondly opened, the flag is set?
>>
>> Though the behavior of the FUSE daemon is quite confusing in this case,
>> it is completely possible in real life.  So yes we'd better add
>> filemap_write_and_wait_range() unconditionally in fopen_direct_io case.
>>
> 
> I think this behavior on the server side is pretty common. From what
> I've seen on most servers, the server when handling the open sets
> fi->direct_io depending on if the client opens with O_DIRECT, eg
> 
>          if (fi->flags & O_DIRECT)
>                  fi->direct_io = 1;

One should do that, actually, to get a shared lock on the inode.
With the additional

fi.parallel_direct_writes = 1;

> 
> If a client opens a file without O_DIRECT and then opens the same file
> with O_DIRECT, then we run into this case. Though I'm not sure how
> common it generally is for clients to do this.
> 


I guess the common case is

application1 does mmap
application2 does normal read/write

fuse-server might set  fi->direct_io = 1 on all opens, with the
additional

fuse_set_feature_flag(connp, FUSE_CAP_DIRECT_IO_ALLOW_MMAP);

Probably will only come to it tomorrow, but will review,
especially to check about cached/uncached io-modes.



Thanks,
Bernd


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2025-04-16 18:05 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-04 18:14 [PATCH v7 0/3] fuse: remove temp page copies in writeback Joanne Koong
2025-04-04 18:14 ` [PATCH v7 1/3] mm: add AS_WRITEBACK_INDETERMINATE mapping flag Joanne Koong
2025-04-04 19:13   ` David Hildenbrand
2025-04-04 20:09     ` Joanne Koong
2025-04-04 20:13       ` David Hildenbrand
2025-04-09 22:05         ` Shakeel Butt
2025-04-09 23:48           ` Joanne Koong
2025-04-04 18:14 ` [PATCH v7 2/3] mm: skip reclaiming folios in legacy memcg writeback indeterminate contexts Joanne Koong
2025-04-04 18:14 ` [PATCH v7 3/3] fuse: remove tmp folio for writebacks and internal rb tree Joanne Koong
2025-04-09  2:43   ` Jingbo Xu
2025-04-09 23:47     ` Joanne Koong
2025-04-10  2:12       ` Jingbo Xu
2025-04-10 15:07         ` Joanne Koong
2025-04-10 15:11           ` Jingbo Xu
2025-04-10 16:11             ` Joanne Koong
2025-04-14 20:24               ` Joanne Koong
2025-04-15  7:49               ` Jingbo Xu
2025-04-15 15:59                 ` Joanne Koong
2025-04-16  1:40                   ` Jingbo Xu
2025-04-16 16:43                     ` Joanne Koong
2025-04-16 18:05                       ` Bernd Schubert
2025-04-14 16:21 ` [PATCH v7 0/3] fuse: remove temp page copies in writeback Jeff Layton
2025-04-14 20:28   ` Joanne Koong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).