linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 0/5] fuse: remove temp page copies in writeback
@ 2024-11-22 23:23 Joanne Koong
  2024-11-22 23:23 ` [PATCH v6 1/5] mm: add AS_WRITEBACK_INDETERMINATE mapping flag Joanne Koong
                   ` (6 more replies)
  0 siblings, 7 replies; 124+ messages in thread
From: Joanne Koong @ 2024-11-22 23:23 UTC (permalink / raw)
  To: miklos, linux-fsdevel
  Cc: shakeel.butt, jefflexu, josef, bernd.schubert, linux-mm,
	kernel-team

The purpose of this patchset is to help make writeback-cache write
performance in FUSE filesystems as fast as possible.

In the current FUSE writeback design (see commit 3be5a52b30aa
("fuse: support writable mmap"))), a temp page is allocated for every dirty
page to be written back, the contents of the dirty page are copied over to the
temp page, and the temp page gets handed to the server to write back. This is
done so that writeback may be immediately cleared on the dirty page, and this 
in turn is done for two reasons:
a) in order to mitigate the following deadlock scenario that may arise if
reclaim waits on writeback on the dirty page to complete (more details can be
found in this thread [1]):
* single-threaded FUSE server is in the middle of handling a request
  that needs a memory allocation
* memory allocation triggers direct reclaim
* direct reclaim waits on a folio under writeback
* the FUSE server can't write back the folio since it's stuck in
  direct reclaim
b) in order to unblock internal (eg sync, page compaction) waits on writeback
without needing the server to complete writing back to disk, which may take
an indeterminate amount of time.

Allocating and copying dirty pages to temp pages is the biggest performance
bottleneck for FUSE writeback. This patchset aims to get rid of the temp page
altogether (which will also allow us to get rid of the internal FUSE rb tree
that is needed to keep track of writeback status on the temp pages).
Benchmarks show approximately a 20% improvement in throughput for 4k
block-size writes and a 45% improvement for 1M block-size writes.

With removing the temp page, writeback state is now only cleared on the dirty
page after the server has written it back to disk. This may take an
indeterminate amount of time. As well, there is also the possibility of
malicious or well-intentioned but buggy servers where writeback may in the
worst case scenario, never complete. This means that any
folio_wait_writeback() on a dirty page belonging to a FUSE filesystem needs to
be carefully audited.

In particular, these are the cases that need to be accounted for:
* potentially deadlocking in reclaim, as mentioned above
* potentially stalling sync(2)
* potentially stalling page migration / compaction

This patchset adds a new mapping flag, AS_WRITEBACK_INDETERMINATE, which
filesystems may set on its inode mappings to indicate that writeback
operations may take an indeterminate amount of time to complete. FUSE will set
this flag on its mappings. This patchset adds checks to the critical parts of
reclaim, sync, and page migration logic where writeback may be waited on.

Please note the following:
* For sync(2), waiting on writeback will be skipped for FUSE, but this has no
  effect on existing behavior. Dirty FUSE pages are already not guaranteed to
  be written to disk by the time sync(2) returns (eg writeback is cleared on
  the dirty page but the server may not have written out the temp page to disk
  yet). If the caller wishes to ensure the data has actually been synced to
  disk, they should use fsync(2)/fdatasync(2) instead.
* AS_WRITEBACK_INDETERMINATE does not indicate that the folios should never be
  waited on when in writeback. There are some cases where the wait is
  desirable. For example, for the sync_file_range() syscall, it is fine to
  wait on the writeback since the caller passes in a fd for the operation.

[1]
https://lore.kernel.org/linux-kernel/495d2400-1d96-4924-99d3-8b2952e05fc3@linux.alibaba.com/

Changelog
---------
v5:
https://lore.kernel.org/linux-fsdevel/20241115224459.427610-1-joannelkoong@gmail.com/
Changes from v5 -> v6:
* Add Shakeel and Jingbo's reviewed-bys 
* Move folio_end_writeback() to fuse_writepage_finish() (Jingbo)
* Embed fuse_writepage_finish_stat() logic inline (Jingbo)
* Remove node_stat NR_WRITEBACK inc/sub (Jingbo)

v4:
https://lore.kernel.org/linux-fsdevel/20241107235614.3637221-1-joannelkoong@gmail.com/
Changes from v4 -> v5:
* AS_WRITEBACK_MAY_BLOCK -> AS_WRITEBACK_INDETERMINATE (Shakeel)
* Drop memory hotplug patch (David and Shakeel)
* Remove some more kunnecessary writeback waits in fuse code (Jingbo)
* Make commit message for reclaim patch more concise - drop part about
  deadlock and just focus on how it may stall waits

v3:
https://lore.kernel.org/linux-fsdevel/20241107191618.2011146-1-joannelkoong@gmail.com/
Changes from v3 -> v4:
* Use filemap_fdatawait_range() instead of filemap_range_has_writeback() in
  readahead

v2:
https://lore.kernel.org/linux-fsdevel/20241014182228.1941246-1-joannelkoong@gmail.com/
Changes from v2 -> v3:
* Account for sync and page migration cases as well (Miklos)
* Change AS_NO_WRITEBACK_RECLAIM to the more generic AS_WRITEBACK_MAY_BLOCK
* For fuse inodes, set mapping_writeback_may_block only if fc->writeback_cache
  is enabled

v1:
https://lore.kernel.org/linux-fsdevel/20241011223434.1307300-1-joannelkoong@gmail.com/T/#t
Changes from v1 -> v2:
* Have flag in "enum mapping_flags" instead of creating asop_flags (Shakeel)
* Set fuse inodes to use AS_NO_WRITEBACK_RECLAIM (Shakeel)

Joanne Koong (5):
  mm: add AS_WRITEBACK_INDETERMINATE mapping flag
  mm: skip reclaiming folios in legacy memcg writeback indeterminate
    contexts
  fs/writeback: in wait_sb_inodes(), skip wait for
    AS_WRITEBACK_INDETERMINATE mappings
  mm/migrate: skip migrating folios under writeback with
    AS_WRITEBACK_INDETERMINATE mappings
  fuse: remove tmp folio for writebacks and internal rb tree

 fs/fs-writeback.c       |   3 +
 fs/fuse/file.c          | 360 ++++------------------------------------
 fs/fuse/fuse_i.h        |   3 -
 include/linux/pagemap.h |  11 ++
 mm/migrate.c            |   5 +-
 mm/vmscan.c             |  10 +-
 6 files changed, 53 insertions(+), 339 deletions(-)

-- 
2.43.5



^ permalink raw reply	[flat|nested] 124+ messages in thread

* [PATCH v6 1/5] mm: add AS_WRITEBACK_INDETERMINATE mapping flag
  2024-11-22 23:23 [PATCH v6 0/5] fuse: remove temp page copies in writeback Joanne Koong
@ 2024-11-22 23:23 ` Joanne Koong
  2024-11-22 23:23 ` [PATCH v6 2/5] mm: skip reclaiming folios in legacy memcg writeback indeterminate contexts Joanne Koong
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 124+ messages in thread
From: Joanne Koong @ 2024-11-22 23:23 UTC (permalink / raw)
  To: miklos, linux-fsdevel
  Cc: shakeel.butt, jefflexu, josef, bernd.schubert, linux-mm,
	kernel-team

Add a new mapping flag AS_WRITEBACK_INDETERMINATE which filesystems may
set to indicate that writing back to disk may take an indeterminate
amount of time to complete. Extra caution should be taken when waiting
on writeback for folios belonging to mappings where this flag is set.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
---
 include/linux/pagemap.h | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 68a5f1ff3301..fcf7d4dd7e2b 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -210,6 +210,7 @@ enum mapping_flags {
 	AS_STABLE_WRITES = 7,	/* must wait for writeback before modifying
 				   folio contents */
 	AS_INACCESSIBLE = 8,	/* Do not attempt direct R/W access to the mapping */
+	AS_WRITEBACK_INDETERMINATE = 9, /* Use caution when waiting on writeback */
 	/* Bits 16-25 are used for FOLIO_ORDER */
 	AS_FOLIO_ORDER_BITS = 5,
 	AS_FOLIO_ORDER_MIN = 16,
@@ -335,6 +336,16 @@ static inline bool mapping_inaccessible(struct address_space *mapping)
 	return test_bit(AS_INACCESSIBLE, &mapping->flags);
 }
 
+static inline void mapping_set_writeback_indeterminate(struct address_space *mapping)
+{
+	set_bit(AS_WRITEBACK_INDETERMINATE, &mapping->flags);
+}
+
+static inline bool mapping_writeback_indeterminate(struct address_space *mapping)
+{
+	return test_bit(AS_WRITEBACK_INDETERMINATE, &mapping->flags);
+}
+
 static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
 {
 	return mapping->gfp_mask;
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH v6 2/5] mm: skip reclaiming folios in legacy memcg writeback indeterminate contexts
  2024-11-22 23:23 [PATCH v6 0/5] fuse: remove temp page copies in writeback Joanne Koong
  2024-11-22 23:23 ` [PATCH v6 1/5] mm: add AS_WRITEBACK_INDETERMINATE mapping flag Joanne Koong
@ 2024-11-22 23:23 ` Joanne Koong
  2024-11-22 23:23 ` [PATCH v6 3/5] fs/writeback: in wait_sb_inodes(), skip wait for AS_WRITEBACK_INDETERMINATE mappings Joanne Koong
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 124+ messages in thread
From: Joanne Koong @ 2024-11-22 23:23 UTC (permalink / raw)
  To: miklos, linux-fsdevel
  Cc: shakeel.butt, jefflexu, josef, bernd.schubert, linux-mm,
	kernel-team

Currently in shrink_folio_list(), reclaim for folios under writeback
falls into 3 different cases:
1) Reclaim is encountering an excessive number of folios under
   writeback and this folio has both the writeback and reclaim flags
   set
2) Dirty throttling is enabled (this happens if reclaim through cgroup
   is not enabled, if reclaim through cgroupv2 memcg is enabled, or
   if reclaim is on the root cgroup), or if the folio is not marked for
   immediate reclaim, or if the caller does not have __GFP_FS (or
   __GFP_IO if it's going to swap) set
3) Legacy cgroupv1 encounters a folio that already has the reclaim flag
   set and the caller did not have __GFP_FS (or __GFP_IO if swap) set

In cases 1) and 2), we activate the folio and skip reclaiming it while
in case 3), we wait for writeback to finish on the folio and then try
to reclaim the folio again. In case 3, we wait on writeback because
cgroupv1 does not have dirty folio throttling, as such this is a
mitigation against the case where there are too many folios in writeback
with nothing else to reclaim.

For filesystems where writeback may take an indeterminate amount of time
to write to disk, this has the possibility of stalling reclaim.

In this commit, if legacy memcg encounters a folio with the reclaim flag
set (eg case 3) and the folio belongs to a mapping that has the
AS_WRITEBACK_INDETERMINATE flag set, the folio will be activated and skip
reclaim (eg default to behavior in case 2) instead.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
---
 mm/vmscan.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 749cdc110c74..37ce6b6dac06 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1129,8 +1129,9 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 		 * 2) Global or new memcg reclaim encounters a folio that is
 		 *    not marked for immediate reclaim, or the caller does not
 		 *    have __GFP_FS (or __GFP_IO if it's simply going to swap,
-		 *    not to fs). In this case mark the folio for immediate
-		 *    reclaim and continue scanning.
+		 *    not to fs), or the writeback may take an indeterminate
+		 *    amount of time to complete. In this case mark the folio
+		 *    for immediate reclaim and continue scanning.
 		 *
 		 *    Require may_enter_fs() because we would wait on fs, which
 		 *    may not have submitted I/O yet. And the loop driver might
@@ -1155,6 +1156,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 		 * takes to write them to disk.
 		 */
 		if (folio_test_writeback(folio)) {
+			mapping = folio_mapping(folio);
+
 			/* Case 1 above */
 			if (current_is_kswapd() &&
 			    folio_test_reclaim(folio) &&
@@ -1165,7 +1168,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 			/* Case 2 above */
 			} else if (writeback_throttling_sane(sc) ||
 			    !folio_test_reclaim(folio) ||
-			    !may_enter_fs(folio, sc->gfp_mask)) {
+			    !may_enter_fs(folio, sc->gfp_mask) ||
+			    (mapping && mapping_writeback_indeterminate(mapping))) {
 				/*
 				 * This is slightly racy -
 				 * folio_end_writeback() might have
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH v6 3/5] fs/writeback: in wait_sb_inodes(), skip wait for AS_WRITEBACK_INDETERMINATE mappings
  2024-11-22 23:23 [PATCH v6 0/5] fuse: remove temp page copies in writeback Joanne Koong
  2024-11-22 23:23 ` [PATCH v6 1/5] mm: add AS_WRITEBACK_INDETERMINATE mapping flag Joanne Koong
  2024-11-22 23:23 ` [PATCH v6 2/5] mm: skip reclaiming folios in legacy memcg writeback indeterminate contexts Joanne Koong
@ 2024-11-22 23:23 ` Joanne Koong
  2024-11-22 23:23 ` [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with " Joanne Koong
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 124+ messages in thread
From: Joanne Koong @ 2024-11-22 23:23 UTC (permalink / raw)
  To: miklos, linux-fsdevel
  Cc: shakeel.butt, jefflexu, josef, bernd.schubert, linux-mm,
	kernel-team

For filesystems with the AS_WRITEBACK_INDETERMINATE flag set, writeback
operations may take an indeterminate time to complete. For example, writing
data back to disk in FUSE filesystems depends on the userspace server
successfully completing writeback.

In this commit, wait_sb_inodes() skips waiting on writeback if the
inode's mapping has AS_WRITEBACK_INDETERMINATE set, else sync(2) may take an
indeterminate amount of time to complete.

If the caller wishes to ensure the data for a mapping with the
AS_WRITEBACK_INDETERMINATE flag set has actually been written back to disk,
they should use fsync(2)/fdatasync(2) instead.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
---
 fs/fs-writeback.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index d8bec3c1bb1f..ad192db17ce4 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -2659,6 +2659,9 @@ static void wait_sb_inodes(struct super_block *sb)
 		if (!mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK))
 			continue;
 
+		if (mapping_writeback_indeterminate(mapping))
+			continue;
+
 		spin_unlock_irq(&sb->s_inode_wblist_lock);
 
 		spin_lock(&inode->i_lock);
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-11-22 23:23 [PATCH v6 0/5] fuse: remove temp page copies in writeback Joanne Koong
                   ` (2 preceding siblings ...)
  2024-11-22 23:23 ` [PATCH v6 3/5] fs/writeback: in wait_sb_inodes(), skip wait for AS_WRITEBACK_INDETERMINATE mappings Joanne Koong
@ 2024-11-22 23:23 ` Joanne Koong
  2024-12-19 13:05   ` David Hildenbrand
  2024-11-22 23:23 ` [PATCH v6 5/5] fuse: remove tmp folio for writebacks and internal rb tree Joanne Koong
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 124+ messages in thread
From: Joanne Koong @ 2024-11-22 23:23 UTC (permalink / raw)
  To: miklos, linux-fsdevel
  Cc: shakeel.butt, jefflexu, josef, bernd.schubert, linux-mm,
	kernel-team

For migrations called in MIGRATE_SYNC mode, skip migrating the folio if
it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag set on its
mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the mapping, the
writeback may take an indeterminate amount of time to complete, and
waits may get stuck.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
---
 mm/migrate.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index df91248755e4..fe73284e5246 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1260,7 +1260,10 @@ static int migrate_folio_unmap(new_folio_t get_new_folio,
 		 */
 		switch (mode) {
 		case MIGRATE_SYNC:
-			break;
+			if (!src->mapping ||
+			    !mapping_writeback_indeterminate(src->mapping))
+				break;
+			fallthrough;
 		default:
 			rc = -EBUSY;
 			goto out;
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH v6 5/5] fuse: remove tmp folio for writebacks and internal rb tree
  2024-11-22 23:23 [PATCH v6 0/5] fuse: remove temp page copies in writeback Joanne Koong
                   ` (3 preceding siblings ...)
  2024-11-22 23:23 ` [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with " Joanne Koong
@ 2024-11-22 23:23 ` Joanne Koong
  2024-11-25  9:46   ` Jingbo Xu
  2024-12-12 21:55 ` [PATCH v6 0/5] fuse: remove temp page copies in writeback Joanne Koong
  2024-12-13 11:52 ` Miklos Szeredi
  6 siblings, 1 reply; 124+ messages in thread
From: Joanne Koong @ 2024-11-22 23:23 UTC (permalink / raw)
  To: miklos, linux-fsdevel
  Cc: shakeel.butt, jefflexu, josef, bernd.schubert, linux-mm,
	kernel-team

In the current FUSE writeback design (see commit 3be5a52b30aa
("fuse: support writable mmap")), a temp page is allocated for every
dirty page to be written back, the contents of the dirty page are copied over
to the temp page, and the temp page gets handed to the server to write back.

This is done so that writeback may be immediately cleared on the dirty page,
and this in turn is done for two reasons:
a) in order to mitigate the following deadlock scenario that may arise
if reclaim waits on writeback on the dirty page to complete:
* single-threaded FUSE server is in the middle of handling a request
  that needs a memory allocation
* memory allocation triggers direct reclaim
* direct reclaim waits on a folio under writeback
* the FUSE server can't write back the folio since it's stuck in
  direct reclaim
b) in order to unblock internal (eg sync, page compaction) waits on
writeback without needing the server to complete writing back to disk,
which may take an indeterminate amount of time.

With a recent change that added AS_WRITEBACK_INDETERMINATE and mitigates
the situations described above, FUSE writeback does not need to use
temp pages if it sets AS_WRITEBACK_INDETERMINATE on its inode mappings.

This commit sets AS_WRITEBACK_INDETERMINATE on the inode mappings
and removes the temporary pages + extra copying and the internal rb
tree.

fio benchmarks --
(using averages observed from 10 runs, throwing away outliers)

Setup:
sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount
 ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount

fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G
--numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount

        bs =  1k          4k            1M
Before  351 MiB/s     1818 MiB/s     1851 MiB/s
After   341 MiB/s     2246 MiB/s     2685 MiB/s
% diff        -3%          23%         45%

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
 fs/fuse/file.c   | 360 ++++-------------------------------------------
 fs/fuse/fuse_i.h |   3 -
 2 files changed, 28 insertions(+), 335 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 88d0946b5bc9..1970d1a699a6 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -415,89 +415,11 @@ u64 fuse_lock_owner_id(struct fuse_conn *fc, fl_owner_t id)
 
 struct fuse_writepage_args {
 	struct fuse_io_args ia;
-	struct rb_node writepages_entry;
 	struct list_head queue_entry;
-	struct fuse_writepage_args *next;
 	struct inode *inode;
 	struct fuse_sync_bucket *bucket;
 };
 
-static struct fuse_writepage_args *fuse_find_writeback(struct fuse_inode *fi,
-					    pgoff_t idx_from, pgoff_t idx_to)
-{
-	struct rb_node *n;
-
-	n = fi->writepages.rb_node;
-
-	while (n) {
-		struct fuse_writepage_args *wpa;
-		pgoff_t curr_index;
-
-		wpa = rb_entry(n, struct fuse_writepage_args, writepages_entry);
-		WARN_ON(get_fuse_inode(wpa->inode) != fi);
-		curr_index = wpa->ia.write.in.offset >> PAGE_SHIFT;
-		if (idx_from >= curr_index + wpa->ia.ap.num_folios)
-			n = n->rb_right;
-		else if (idx_to < curr_index)
-			n = n->rb_left;
-		else
-			return wpa;
-	}
-	return NULL;
-}
-
-/*
- * Check if any page in a range is under writeback
- */
-static bool fuse_range_is_writeback(struct inode *inode, pgoff_t idx_from,
-				   pgoff_t idx_to)
-{
-	struct fuse_inode *fi = get_fuse_inode(inode);
-	bool found;
-
-	if (RB_EMPTY_ROOT(&fi->writepages))
-		return false;
-
-	spin_lock(&fi->lock);
-	found = fuse_find_writeback(fi, idx_from, idx_to);
-	spin_unlock(&fi->lock);
-
-	return found;
-}
-
-static inline bool fuse_page_is_writeback(struct inode *inode, pgoff_t index)
-{
-	return fuse_range_is_writeback(inode, index, index);
-}
-
-/*
- * Wait for page writeback to be completed.
- *
- * Since fuse doesn't rely on the VM writeback tracking, this has to
- * use some other means.
- */
-static void fuse_wait_on_page_writeback(struct inode *inode, pgoff_t index)
-{
-	struct fuse_inode *fi = get_fuse_inode(inode);
-
-	wait_event(fi->page_waitq, !fuse_page_is_writeback(inode, index));
-}
-
-static inline bool fuse_folio_is_writeback(struct inode *inode,
-					   struct folio *folio)
-{
-	pgoff_t last = folio_next_index(folio) - 1;
-	return fuse_range_is_writeback(inode, folio_index(folio), last);
-}
-
-static void fuse_wait_on_folio_writeback(struct inode *inode,
-					 struct folio *folio)
-{
-	struct fuse_inode *fi = get_fuse_inode(inode);
-
-	wait_event(fi->page_waitq, !fuse_folio_is_writeback(inode, folio));
-}
-
 /*
  * Wait for all pending writepages on the inode to finish.
  *
@@ -886,13 +808,6 @@ static int fuse_do_readfolio(struct file *file, struct folio *folio)
 	ssize_t res;
 	u64 attr_ver;
 
-	/*
-	 * With the temporary pages that are used to complete writeback, we can
-	 * have writeback that extends beyond the lifetime of the folio.  So
-	 * make sure we read a properly synced folio.
-	 */
-	fuse_wait_on_folio_writeback(inode, folio);
-
 	attr_ver = fuse_get_attr_version(fm->fc);
 
 	/* Don't overflow end offset */
@@ -1003,17 +918,12 @@ static void fuse_send_readpages(struct fuse_io_args *ia, struct file *file)
 static void fuse_readahead(struct readahead_control *rac)
 {
 	struct inode *inode = rac->mapping->host;
-	struct fuse_inode *fi = get_fuse_inode(inode);
 	struct fuse_conn *fc = get_fuse_conn(inode);
 	unsigned int max_pages, nr_pages;
-	pgoff_t first = readahead_index(rac);
-	pgoff_t last = first + readahead_count(rac) - 1;
 
 	if (fuse_is_bad(inode))
 		return;
 
-	wait_event(fi->page_waitq, !fuse_range_is_writeback(inode, first, last));
-
 	max_pages = min_t(unsigned int, fc->max_pages,
 			fc->max_read / PAGE_SIZE);
 
@@ -1172,7 +1082,7 @@ static ssize_t fuse_send_write_pages(struct fuse_io_args *ia,
 	int err;
 
 	for (i = 0; i < ap->num_folios; i++)
-		fuse_wait_on_folio_writeback(inode, ap->folios[i]);
+		folio_wait_writeback(ap->folios[i]);
 
 	fuse_write_args_fill(ia, ff, pos, count);
 	ia->write.in.flags = fuse_write_flags(iocb);
@@ -1622,7 +1532,7 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *iter,
 			return res;
 		}
 	}
-	if (!cuse && fuse_range_is_writeback(inode, idx_from, idx_to)) {
+	if (!cuse && filemap_range_has_writeback(mapping, pos, (pos + count - 1))) {
 		if (!write)
 			inode_lock(inode);
 		fuse_sync_writes(inode);
@@ -1819,38 +1729,34 @@ static ssize_t fuse_splice_write(struct pipe_inode_info *pipe, struct file *out,
 static void fuse_writepage_free(struct fuse_writepage_args *wpa)
 {
 	struct fuse_args_pages *ap = &wpa->ia.ap;
-	int i;
 
 	if (wpa->bucket)
 		fuse_sync_bucket_dec(wpa->bucket);
 
-	for (i = 0; i < ap->num_folios; i++)
-		folio_put(ap->folios[i]);
-
 	fuse_file_put(wpa->ia.ff, false);
 
 	kfree(ap->folios);
 	kfree(wpa);
 }
 
-static void fuse_writepage_finish_stat(struct inode *inode, struct folio *folio)
-{
-	struct backing_dev_info *bdi = inode_to_bdi(inode);
-
-	dec_wb_stat(&bdi->wb, WB_WRITEBACK);
-	node_stat_sub_folio(folio, NR_WRITEBACK_TEMP);
-	wb_writeout_inc(&bdi->wb);
-}
-
 static void fuse_writepage_finish(struct fuse_writepage_args *wpa)
 {
 	struct fuse_args_pages *ap = &wpa->ia.ap;
 	struct inode *inode = wpa->inode;
 	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct backing_dev_info *bdi = inode_to_bdi(inode);
 	int i;
 
-	for (i = 0; i < ap->num_folios; i++)
-		fuse_writepage_finish_stat(inode, ap->folios[i]);
+	for (i = 0; i < ap->num_folios; i++) {
+		/*
+		 * Benchmarks showed that ending writeback within the
+		 * scope of the fi->lock alleviates xarray lock
+		 * contention and noticeably improves performance.
+		 */
+		folio_end_writeback(ap->folios[i]);
+		dec_wb_stat(&bdi->wb, WB_WRITEBACK);
+		wb_writeout_inc(&bdi->wb);
+	}
 
 	wake_up(&fi->page_waitq);
 }
@@ -1861,7 +1767,6 @@ static void fuse_send_writepage(struct fuse_mount *fm,
 __releases(fi->lock)
 __acquires(fi->lock)
 {
-	struct fuse_writepage_args *aux, *next;
 	struct fuse_inode *fi = get_fuse_inode(wpa->inode);
 	struct fuse_write_in *inarg = &wpa->ia.write.in;
 	struct fuse_args *args = &wpa->ia.ap.args;
@@ -1898,19 +1803,8 @@ __acquires(fi->lock)
 
  out_free:
 	fi->writectr--;
-	rb_erase(&wpa->writepages_entry, &fi->writepages);
 	fuse_writepage_finish(wpa);
 	spin_unlock(&fi->lock);
-
-	/* After rb_erase() aux request list is private */
-	for (aux = wpa->next; aux; aux = next) {
-		next = aux->next;
-		aux->next = NULL;
-		fuse_writepage_finish_stat(aux->inode,
-					   aux->ia.ap.folios[0]);
-		fuse_writepage_free(aux);
-	}
-
 	fuse_writepage_free(wpa);
 	spin_lock(&fi->lock);
 }
@@ -1938,43 +1832,6 @@ __acquires(fi->lock)
 	}
 }
 
-static struct fuse_writepage_args *fuse_insert_writeback(struct rb_root *root,
-						struct fuse_writepage_args *wpa)
-{
-	pgoff_t idx_from = wpa->ia.write.in.offset >> PAGE_SHIFT;
-	pgoff_t idx_to = idx_from + wpa->ia.ap.num_folios - 1;
-	struct rb_node **p = &root->rb_node;
-	struct rb_node  *parent = NULL;
-
-	WARN_ON(!wpa->ia.ap.num_folios);
-	while (*p) {
-		struct fuse_writepage_args *curr;
-		pgoff_t curr_index;
-
-		parent = *p;
-		curr = rb_entry(parent, struct fuse_writepage_args,
-				writepages_entry);
-		WARN_ON(curr->inode != wpa->inode);
-		curr_index = curr->ia.write.in.offset >> PAGE_SHIFT;
-
-		if (idx_from >= curr_index + curr->ia.ap.num_folios)
-			p = &(*p)->rb_right;
-		else if (idx_to < curr_index)
-			p = &(*p)->rb_left;
-		else
-			return curr;
-	}
-
-	rb_link_node(&wpa->writepages_entry, parent, p);
-	rb_insert_color(&wpa->writepages_entry, root);
-	return NULL;
-}
-
-static void tree_insert(struct rb_root *root, struct fuse_writepage_args *wpa)
-{
-	WARN_ON(fuse_insert_writeback(root, wpa));
-}
-
 static void fuse_writepage_end(struct fuse_mount *fm, struct fuse_args *args,
 			       int error)
 {
@@ -1994,41 +1851,6 @@ static void fuse_writepage_end(struct fuse_mount *fm, struct fuse_args *args,
 	if (!fc->writeback_cache)
 		fuse_invalidate_attr_mask(inode, FUSE_STATX_MODIFY);
 	spin_lock(&fi->lock);
-	rb_erase(&wpa->writepages_entry, &fi->writepages);
-	while (wpa->next) {
-		struct fuse_mount *fm = get_fuse_mount(inode);
-		struct fuse_write_in *inarg = &wpa->ia.write.in;
-		struct fuse_writepage_args *next = wpa->next;
-
-		wpa->next = next->next;
-		next->next = NULL;
-		tree_insert(&fi->writepages, next);
-
-		/*
-		 * Skip fuse_flush_writepages() to make it easy to crop requests
-		 * based on primary request size.
-		 *
-		 * 1st case (trivial): there are no concurrent activities using
-		 * fuse_set/release_nowrite.  Then we're on safe side because
-		 * fuse_flush_writepages() would call fuse_send_writepage()
-		 * anyway.
-		 *
-		 * 2nd case: someone called fuse_set_nowrite and it is waiting
-		 * now for completion of all in-flight requests.  This happens
-		 * rarely and no more than once per page, so this should be
-		 * okay.
-		 *
-		 * 3rd case: someone (e.g. fuse_do_setattr()) is in the middle
-		 * of fuse_set_nowrite..fuse_release_nowrite section.  The fact
-		 * that fuse_set_nowrite returned implies that all in-flight
-		 * requests were completed along with all of their secondary
-		 * requests.  Further primary requests are blocked by negative
-		 * writectr.  Hence there cannot be any in-flight requests and
-		 * no invocations of fuse_writepage_end() while we're in
-		 * fuse_set_nowrite..fuse_release_nowrite section.
-		 */
-		fuse_send_writepage(fm, next, inarg->offset + inarg->size);
-	}
 	fi->writectr--;
 	fuse_writepage_finish(wpa);
 	spin_unlock(&fi->lock);
@@ -2115,19 +1937,16 @@ static void fuse_writepage_add_to_bucket(struct fuse_conn *fc,
 }
 
 static void fuse_writepage_args_page_fill(struct fuse_writepage_args *wpa, struct folio *folio,
-					  struct folio *tmp_folio, uint32_t folio_index)
+					  uint32_t folio_index)
 {
 	struct inode *inode = folio->mapping->host;
 	struct fuse_args_pages *ap = &wpa->ia.ap;
 
-	folio_copy(tmp_folio, folio);
-
-	ap->folios[folio_index] = tmp_folio;
+	ap->folios[folio_index] = folio;
 	ap->descs[folio_index].offset = 0;
 	ap->descs[folio_index].length = PAGE_SIZE;
 
 	inc_wb_stat(&inode_to_bdi(inode)->wb, WB_WRITEBACK);
-	node_stat_add_folio(tmp_folio, NR_WRITEBACK_TEMP);
 }
 
 static struct fuse_writepage_args *fuse_writepage_args_setup(struct folio *folio,
@@ -2162,18 +1981,12 @@ static int fuse_writepage_locked(struct folio *folio)
 	struct fuse_inode *fi = get_fuse_inode(inode);
 	struct fuse_writepage_args *wpa;
 	struct fuse_args_pages *ap;
-	struct folio *tmp_folio;
 	struct fuse_file *ff;
-	int error = -ENOMEM;
+	int error = -EIO;
 
-	tmp_folio = folio_alloc(GFP_NOFS | __GFP_HIGHMEM, 0);
-	if (!tmp_folio)
-		goto err;
-
-	error = -EIO;
 	ff = fuse_write_file_get(fi);
 	if (!ff)
-		goto err_nofile;
+		goto err;
 
 	wpa = fuse_writepage_args_setup(folio, ff);
 	error = -ENOMEM;
@@ -2184,22 +1997,17 @@ static int fuse_writepage_locked(struct folio *folio)
 	ap->num_folios = 1;
 
 	folio_start_writeback(folio);
-	fuse_writepage_args_page_fill(wpa, folio, tmp_folio, 0);
+	fuse_writepage_args_page_fill(wpa, folio, 0);
 
 	spin_lock(&fi->lock);
-	tree_insert(&fi->writepages, wpa);
 	list_add_tail(&wpa->queue_entry, &fi->queued_writes);
 	fuse_flush_writepages(inode);
 	spin_unlock(&fi->lock);
 
-	folio_end_writeback(folio);
-
 	return 0;
 
 err_writepage_args:
 	fuse_file_put(ff, false);
-err_nofile:
-	folio_put(tmp_folio);
 err:
 	mapping_set_error(folio->mapping, error);
 	return error;
@@ -2209,7 +2017,6 @@ struct fuse_fill_wb_data {
 	struct fuse_writepage_args *wpa;
 	struct fuse_file *ff;
 	struct inode *inode;
-	struct folio **orig_folios;
 	unsigned int max_folios;
 };
 
@@ -2244,69 +2051,11 @@ static void fuse_writepages_send(struct fuse_fill_wb_data *data)
 	struct fuse_writepage_args *wpa = data->wpa;
 	struct inode *inode = data->inode;
 	struct fuse_inode *fi = get_fuse_inode(inode);
-	int num_folios = wpa->ia.ap.num_folios;
-	int i;
 
 	spin_lock(&fi->lock);
 	list_add_tail(&wpa->queue_entry, &fi->queued_writes);
 	fuse_flush_writepages(inode);
 	spin_unlock(&fi->lock);
-
-	for (i = 0; i < num_folios; i++)
-		folio_end_writeback(data->orig_folios[i]);
-}
-
-/*
- * Check under fi->lock if the page is under writeback, and insert it onto the
- * rb_tree if not. Otherwise iterate auxiliary write requests, to see if there's
- * one already added for a page at this offset.  If there's none, then insert
- * this new request onto the auxiliary list, otherwise reuse the existing one by
- * swapping the new temp page with the old one.
- */
-static bool fuse_writepage_add(struct fuse_writepage_args *new_wpa,
-			       struct folio *folio)
-{
-	struct fuse_inode *fi = get_fuse_inode(new_wpa->inode);
-	struct fuse_writepage_args *tmp;
-	struct fuse_writepage_args *old_wpa;
-	struct fuse_args_pages *new_ap = &new_wpa->ia.ap;
-
-	WARN_ON(new_ap->num_folios != 0);
-	new_ap->num_folios = 1;
-
-	spin_lock(&fi->lock);
-	old_wpa = fuse_insert_writeback(&fi->writepages, new_wpa);
-	if (!old_wpa) {
-		spin_unlock(&fi->lock);
-		return true;
-	}
-
-	for (tmp = old_wpa->next; tmp; tmp = tmp->next) {
-		pgoff_t curr_index;
-
-		WARN_ON(tmp->inode != new_wpa->inode);
-		curr_index = tmp->ia.write.in.offset >> PAGE_SHIFT;
-		if (curr_index == folio->index) {
-			WARN_ON(tmp->ia.ap.num_folios != 1);
-			swap(tmp->ia.ap.folios[0], new_ap->folios[0]);
-			break;
-		}
-	}
-
-	if (!tmp) {
-		new_wpa->next = old_wpa->next;
-		old_wpa->next = new_wpa;
-	}
-
-	spin_unlock(&fi->lock);
-
-	if (tmp) {
-		fuse_writepage_finish_stat(new_wpa->inode,
-					   folio);
-		fuse_writepage_free(new_wpa);
-	}
-
-	return false;
 }
 
 static bool fuse_writepage_need_send(struct fuse_conn *fc, struct folio *folio,
@@ -2315,15 +2064,6 @@ static bool fuse_writepage_need_send(struct fuse_conn *fc, struct folio *folio,
 {
 	WARN_ON(!ap->num_folios);
 
-	/*
-	 * Being under writeback is unlikely but possible.  For example direct
-	 * read to an mmaped fuse file will set the page dirty twice; once when
-	 * the pages are faulted with get_user_pages(), and then after the read
-	 * completed.
-	 */
-	if (fuse_folio_is_writeback(data->inode, folio))
-		return true;
-
 	/* Reached max pages */
 	if (ap->num_folios == fc->max_pages)
 		return true;
@@ -2333,7 +2073,7 @@ static bool fuse_writepage_need_send(struct fuse_conn *fc, struct folio *folio,
 		return true;
 
 	/* Discontinuity */
-	if (data->orig_folios[ap->num_folios - 1]->index + 1 != folio_index(folio))
+	if (ap->folios[ap->num_folios - 1]->index + 1 != folio_index(folio))
 		return true;
 
 	/* Need to grow the pages array?  If so, did the expansion fail? */
@@ -2352,7 +2092,6 @@ static int fuse_writepages_fill(struct folio *folio,
 	struct inode *inode = data->inode;
 	struct fuse_inode *fi = get_fuse_inode(inode);
 	struct fuse_conn *fc = get_fuse_conn(inode);
-	struct folio *tmp_folio;
 	int err;
 
 	if (!data->ff) {
@@ -2367,54 +2106,23 @@ static int fuse_writepages_fill(struct folio *folio,
 		data->wpa = NULL;
 	}
 
-	err = -ENOMEM;
-	tmp_folio = folio_alloc(GFP_NOFS | __GFP_HIGHMEM, 0);
-	if (!tmp_folio)
-		goto out_unlock;
-
-	/*
-	 * The page must not be redirtied until the writeout is completed
-	 * (i.e. userspace has sent a reply to the write request).  Otherwise
-	 * there could be more than one temporary page instance for each real
-	 * page.
-	 *
-	 * This is ensured by holding the page lock in page_mkwrite() while
-	 * checking fuse_page_is_writeback().  We already hold the page lock
-	 * since clear_page_dirty_for_io() and keep it held until we add the
-	 * request to the fi->writepages list and increment ap->num_folios.
-	 * After this fuse_page_is_writeback() will indicate that the page is
-	 * under writeback, so we can release the page lock.
-	 */
 	if (data->wpa == NULL) {
 		err = -ENOMEM;
 		wpa = fuse_writepage_args_setup(folio, data->ff);
-		if (!wpa) {
-			folio_put(tmp_folio);
+		if (!wpa)
 			goto out_unlock;
-		}
 		fuse_file_get(wpa->ia.ff);
 		data->max_folios = 1;
 		ap = &wpa->ia.ap;
 	}
 	folio_start_writeback(folio);
 
-	fuse_writepage_args_page_fill(wpa, folio, tmp_folio, ap->num_folios);
-	data->orig_folios[ap->num_folios] = folio;
+	fuse_writepage_args_page_fill(wpa, folio, ap->num_folios);
 
 	err = 0;
-	if (data->wpa) {
-		/*
-		 * Protected by fi->lock against concurrent access by
-		 * fuse_page_is_writeback().
-		 */
-		spin_lock(&fi->lock);
-		ap->num_folios++;
-		spin_unlock(&fi->lock);
-	} else if (fuse_writepage_add(wpa, folio)) {
+	ap->num_folios++;
+	if (!data->wpa)
 		data->wpa = wpa;
-	} else {
-		folio_end_writeback(folio);
-	}
 out_unlock:
 	folio_unlock(folio);
 
@@ -2441,13 +2149,6 @@ static int fuse_writepages(struct address_space *mapping,
 	data.wpa = NULL;
 	data.ff = NULL;
 
-	err = -ENOMEM;
-	data.orig_folios = kcalloc(fc->max_pages,
-				   sizeof(struct folio *),
-				   GFP_NOFS);
-	if (!data.orig_folios)
-		goto out;
-
 	err = write_cache_pages(mapping, wbc, fuse_writepages_fill, &data);
 	if (data.wpa) {
 		WARN_ON(!data.wpa->ia.ap.num_folios);
@@ -2456,7 +2157,6 @@ static int fuse_writepages(struct address_space *mapping,
 	if (data.ff)
 		fuse_file_put(data.ff, false);
 
-	kfree(data.orig_folios);
 out:
 	return err;
 }
@@ -2481,8 +2181,6 @@ static int fuse_write_begin(struct file *file, struct address_space *mapping,
 	if (IS_ERR(folio))
 		goto error;
 
-	fuse_wait_on_page_writeback(mapping->host, folio->index);
-
 	if (folio_test_uptodate(folio) || len >= folio_size(folio))
 		goto success;
 	/*
@@ -2545,13 +2243,9 @@ static int fuse_launder_folio(struct folio *folio)
 {
 	int err = 0;
 	if (folio_clear_dirty_for_io(folio)) {
-		struct inode *inode = folio->mapping->host;
-
-		/* Serialize with pending writeback for the same page */
-		fuse_wait_on_page_writeback(inode, folio->index);
 		err = fuse_writepage_locked(folio);
 		if (!err)
-			fuse_wait_on_page_writeback(inode, folio->index);
+			folio_wait_writeback(folio);
 	}
 	return err;
 }
@@ -2595,7 +2289,7 @@ static vm_fault_t fuse_page_mkwrite(struct vm_fault *vmf)
 		return VM_FAULT_NOPAGE;
 	}
 
-	fuse_wait_on_folio_writeback(inode, folio);
+	folio_wait_writeback(folio);
 	return VM_FAULT_LOCKED;
 }
 
@@ -3413,9 +3107,12 @@ static const struct address_space_operations fuse_file_aops  = {
 void fuse_init_file_inode(struct inode *inode, unsigned int flags)
 {
 	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_conn *fc = get_fuse_conn(inode);
 
 	inode->i_fop = &fuse_file_operations;
 	inode->i_data.a_ops = &fuse_file_aops;
+	if (fc->writeback_cache)
+		mapping_set_writeback_indeterminate(&inode->i_data);
 
 	INIT_LIST_HEAD(&fi->write_files);
 	INIT_LIST_HEAD(&fi->queued_writes);
@@ -3423,7 +3120,6 @@ void fuse_init_file_inode(struct inode *inode, unsigned int flags)
 	fi->iocachectr = 0;
 	init_waitqueue_head(&fi->page_waitq);
 	init_waitqueue_head(&fi->direct_io_waitq);
-	fi->writepages = RB_ROOT;
 
 	if (IS_ENABLED(CONFIG_FUSE_DAX))
 		fuse_dax_inode_init(inode, flags);
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 74744c6f2860..23736c5c64c1 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -141,9 +141,6 @@ struct fuse_inode {
 
 			/* waitq for direct-io completion */
 			wait_queue_head_t direct_io_waitq;
-
-			/* List of writepage requestst (pending or sent) */
-			struct rb_root writepages;
 		};
 
 		/* readdir cache (directory only) */
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 5/5] fuse: remove tmp folio for writebacks and internal rb tree
  2024-11-22 23:23 ` [PATCH v6 5/5] fuse: remove tmp folio for writebacks and internal rb tree Joanne Koong
@ 2024-11-25  9:46   ` Jingbo Xu
  0 siblings, 0 replies; 124+ messages in thread
From: Jingbo Xu @ 2024-11-25  9:46 UTC (permalink / raw)
  To: Joanne Koong, miklos, linux-fsdevel
  Cc: shakeel.butt, josef, bernd.schubert, linux-mm, kernel-team



On 11/23/24 7:23 AM, Joanne Koong wrote:
> In the current FUSE writeback design (see commit 3be5a52b30aa
> ("fuse: support writable mmap")), a temp page is allocated for every
> dirty page to be written back, the contents of the dirty page are copied over
> to the temp page, and the temp page gets handed to the server to write back.
> 
> This is done so that writeback may be immediately cleared on the dirty page,
> and this in turn is done for two reasons:
> a) in order to mitigate the following deadlock scenario that may arise
> if reclaim waits on writeback on the dirty page to complete:
> * single-threaded FUSE server is in the middle of handling a request
>   that needs a memory allocation
> * memory allocation triggers direct reclaim
> * direct reclaim waits on a folio under writeback
> * the FUSE server can't write back the folio since it's stuck in
>   direct reclaim
> b) in order to unblock internal (eg sync, page compaction) waits on
> writeback without needing the server to complete writing back to disk,
> which may take an indeterminate amount of time.
> 
> With a recent change that added AS_WRITEBACK_INDETERMINATE and mitigates
> the situations described above, FUSE writeback does not need to use
> temp pages if it sets AS_WRITEBACK_INDETERMINATE on its inode mappings.
> 
> This commit sets AS_WRITEBACK_INDETERMINATE on the inode mappings
> and removes the temporary pages + extra copying and the internal rb
> tree.
> 
> fio benchmarks --
> (using averages observed from 10 runs, throwing away outliers)
> 
> Setup:
> sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount
>  ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount
> 
> fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G
> --numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount
> 
>         bs =  1k          4k            1M
> Before  351 MiB/s     1818 MiB/s     1851 MiB/s
> After   341 MiB/s     2246 MiB/s     2685 MiB/s
> % diff        -3%          23%         45%
> 
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>

LGTM.

Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>


> ---
>  fs/fuse/file.c   | 360 ++++-------------------------------------------
>  fs/fuse/fuse_i.h |   3 -
>  2 files changed, 28 insertions(+), 335 deletions(-)
> 
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 88d0946b5bc9..1970d1a699a6 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -415,89 +415,11 @@ u64 fuse_lock_owner_id(struct fuse_conn *fc, fl_owner_t id)
>  
>  struct fuse_writepage_args {
>  	struct fuse_io_args ia;
> -	struct rb_node writepages_entry;
>  	struct list_head queue_entry;
> -	struct fuse_writepage_args *next;
>  	struct inode *inode;
>  	struct fuse_sync_bucket *bucket;
>  };
>  
> -static struct fuse_writepage_args *fuse_find_writeback(struct fuse_inode *fi,
> -					    pgoff_t idx_from, pgoff_t idx_to)
> -{
> -	struct rb_node *n;
> -
> -	n = fi->writepages.rb_node;
> -
> -	while (n) {
> -		struct fuse_writepage_args *wpa;
> -		pgoff_t curr_index;
> -
> -		wpa = rb_entry(n, struct fuse_writepage_args, writepages_entry);
> -		WARN_ON(get_fuse_inode(wpa->inode) != fi);
> -		curr_index = wpa->ia.write.in.offset >> PAGE_SHIFT;
> -		if (idx_from >= curr_index + wpa->ia.ap.num_folios)
> -			n = n->rb_right;
> -		else if (idx_to < curr_index)
> -			n = n->rb_left;
> -		else
> -			return wpa;
> -	}
> -	return NULL;
> -}
> -
> -/*
> - * Check if any page in a range is under writeback
> - */
> -static bool fuse_range_is_writeback(struct inode *inode, pgoff_t idx_from,
> -				   pgoff_t idx_to)
> -{
> -	struct fuse_inode *fi = get_fuse_inode(inode);
> -	bool found;
> -
> -	if (RB_EMPTY_ROOT(&fi->writepages))
> -		return false;
> -
> -	spin_lock(&fi->lock);
> -	found = fuse_find_writeback(fi, idx_from, idx_to);
> -	spin_unlock(&fi->lock);
> -
> -	return found;
> -}
> -
> -static inline bool fuse_page_is_writeback(struct inode *inode, pgoff_t index)
> -{
> -	return fuse_range_is_writeback(inode, index, index);
> -}
> -
> -/*
> - * Wait for page writeback to be completed.
> - *
> - * Since fuse doesn't rely on the VM writeback tracking, this has to
> - * use some other means.
> - */
> -static void fuse_wait_on_page_writeback(struct inode *inode, pgoff_t index)
> -{
> -	struct fuse_inode *fi = get_fuse_inode(inode);
> -
> -	wait_event(fi->page_waitq, !fuse_page_is_writeback(inode, index));
> -}
> -
> -static inline bool fuse_folio_is_writeback(struct inode *inode,
> -					   struct folio *folio)
> -{
> -	pgoff_t last = folio_next_index(folio) - 1;
> -	return fuse_range_is_writeback(inode, folio_index(folio), last);
> -}
> -
> -static void fuse_wait_on_folio_writeback(struct inode *inode,
> -					 struct folio *folio)
> -{
> -	struct fuse_inode *fi = get_fuse_inode(inode);
> -
> -	wait_event(fi->page_waitq, !fuse_folio_is_writeback(inode, folio));
> -}
> -
>  /*
>   * Wait for all pending writepages on the inode to finish.
>   *
> @@ -886,13 +808,6 @@ static int fuse_do_readfolio(struct file *file, struct folio *folio)
>  	ssize_t res;
>  	u64 attr_ver;
>  
> -	/*
> -	 * With the temporary pages that are used to complete writeback, we can
> -	 * have writeback that extends beyond the lifetime of the folio.  So
> -	 * make sure we read a properly synced folio.
> -	 */
> -	fuse_wait_on_folio_writeback(inode, folio);
> -
>  	attr_ver = fuse_get_attr_version(fm->fc);
>  
>  	/* Don't overflow end offset */
> @@ -1003,17 +918,12 @@ static void fuse_send_readpages(struct fuse_io_args *ia, struct file *file)
>  static void fuse_readahead(struct readahead_control *rac)
>  {
>  	struct inode *inode = rac->mapping->host;
> -	struct fuse_inode *fi = get_fuse_inode(inode);
>  	struct fuse_conn *fc = get_fuse_conn(inode);
>  	unsigned int max_pages, nr_pages;
> -	pgoff_t first = readahead_index(rac);
> -	pgoff_t last = first + readahead_count(rac) - 1;
>  
>  	if (fuse_is_bad(inode))
>  		return;
>  
> -	wait_event(fi->page_waitq, !fuse_range_is_writeback(inode, first, last));
> -
>  	max_pages = min_t(unsigned int, fc->max_pages,
>  			fc->max_read / PAGE_SIZE);
>  
> @@ -1172,7 +1082,7 @@ static ssize_t fuse_send_write_pages(struct fuse_io_args *ia,
>  	int err;
>  
>  	for (i = 0; i < ap->num_folios; i++)
> -		fuse_wait_on_folio_writeback(inode, ap->folios[i]);
> +		folio_wait_writeback(ap->folios[i]);
>  
>  	fuse_write_args_fill(ia, ff, pos, count);
>  	ia->write.in.flags = fuse_write_flags(iocb);
> @@ -1622,7 +1532,7 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *iter,
>  			return res;
>  		}
>  	}
> -	if (!cuse && fuse_range_is_writeback(inode, idx_from, idx_to)) {
> +	if (!cuse && filemap_range_has_writeback(mapping, pos, (pos + count - 1))) {
>  		if (!write)
>  			inode_lock(inode);
>  		fuse_sync_writes(inode);
> @@ -1819,38 +1729,34 @@ static ssize_t fuse_splice_write(struct pipe_inode_info *pipe, struct file *out,
>  static void fuse_writepage_free(struct fuse_writepage_args *wpa)
>  {
>  	struct fuse_args_pages *ap = &wpa->ia.ap;
> -	int i;
>  
>  	if (wpa->bucket)
>  		fuse_sync_bucket_dec(wpa->bucket);
>  
> -	for (i = 0; i < ap->num_folios; i++)
> -		folio_put(ap->folios[i]);
> -
>  	fuse_file_put(wpa->ia.ff, false);
>  
>  	kfree(ap->folios);
>  	kfree(wpa);
>  }
>  
> -static void fuse_writepage_finish_stat(struct inode *inode, struct folio *folio)
> -{
> -	struct backing_dev_info *bdi = inode_to_bdi(inode);
> -
> -	dec_wb_stat(&bdi->wb, WB_WRITEBACK);
> -	node_stat_sub_folio(folio, NR_WRITEBACK_TEMP);
> -	wb_writeout_inc(&bdi->wb);
> -}
> -
>  static void fuse_writepage_finish(struct fuse_writepage_args *wpa)
>  {
>  	struct fuse_args_pages *ap = &wpa->ia.ap;
>  	struct inode *inode = wpa->inode;
>  	struct fuse_inode *fi = get_fuse_inode(inode);
> +	struct backing_dev_info *bdi = inode_to_bdi(inode);
>  	int i;
>  
> -	for (i = 0; i < ap->num_folios; i++)
> -		fuse_writepage_finish_stat(inode, ap->folios[i]);
> +	for (i = 0; i < ap->num_folios; i++) {
> +		/*
> +		 * Benchmarks showed that ending writeback within the
> +		 * scope of the fi->lock alleviates xarray lock
> +		 * contention and noticeably improves performance.
> +		 */
> +		folio_end_writeback(ap->folios[i]);
> +		dec_wb_stat(&bdi->wb, WB_WRITEBACK);
> +		wb_writeout_inc(&bdi->wb);
> +	}
>  
>  	wake_up(&fi->page_waitq);
>  }
> @@ -1861,7 +1767,6 @@ static void fuse_send_writepage(struct fuse_mount *fm,
>  __releases(fi->lock)
>  __acquires(fi->lock)
>  {
> -	struct fuse_writepage_args *aux, *next;
>  	struct fuse_inode *fi = get_fuse_inode(wpa->inode);
>  	struct fuse_write_in *inarg = &wpa->ia.write.in;
>  	struct fuse_args *args = &wpa->ia.ap.args;
> @@ -1898,19 +1803,8 @@ __acquires(fi->lock)
>  
>   out_free:
>  	fi->writectr--;
> -	rb_erase(&wpa->writepages_entry, &fi->writepages);
>  	fuse_writepage_finish(wpa);
>  	spin_unlock(&fi->lock);
> -
> -	/* After rb_erase() aux request list is private */
> -	for (aux = wpa->next; aux; aux = next) {
> -		next = aux->next;
> -		aux->next = NULL;
> -		fuse_writepage_finish_stat(aux->inode,
> -					   aux->ia.ap.folios[0]);
> -		fuse_writepage_free(aux);
> -	}
> -
>  	fuse_writepage_free(wpa);
>  	spin_lock(&fi->lock);
>  }
> @@ -1938,43 +1832,6 @@ __acquires(fi->lock)
>  	}
>  }
>  
> -static struct fuse_writepage_args *fuse_insert_writeback(struct rb_root *root,
> -						struct fuse_writepage_args *wpa)
> -{
> -	pgoff_t idx_from = wpa->ia.write.in.offset >> PAGE_SHIFT;
> -	pgoff_t idx_to = idx_from + wpa->ia.ap.num_folios - 1;
> -	struct rb_node **p = &root->rb_node;
> -	struct rb_node  *parent = NULL;
> -
> -	WARN_ON(!wpa->ia.ap.num_folios);
> -	while (*p) {
> -		struct fuse_writepage_args *curr;
> -		pgoff_t curr_index;
> -
> -		parent = *p;
> -		curr = rb_entry(parent, struct fuse_writepage_args,
> -				writepages_entry);
> -		WARN_ON(curr->inode != wpa->inode);
> -		curr_index = curr->ia.write.in.offset >> PAGE_SHIFT;
> -
> -		if (idx_from >= curr_index + curr->ia.ap.num_folios)
> -			p = &(*p)->rb_right;
> -		else if (idx_to < curr_index)
> -			p = &(*p)->rb_left;
> -		else
> -			return curr;
> -	}
> -
> -	rb_link_node(&wpa->writepages_entry, parent, p);
> -	rb_insert_color(&wpa->writepages_entry, root);
> -	return NULL;
> -}
> -
> -static void tree_insert(struct rb_root *root, struct fuse_writepage_args *wpa)
> -{
> -	WARN_ON(fuse_insert_writeback(root, wpa));
> -}
> -
>  static void fuse_writepage_end(struct fuse_mount *fm, struct fuse_args *args,
>  			       int error)
>  {
> @@ -1994,41 +1851,6 @@ static void fuse_writepage_end(struct fuse_mount *fm, struct fuse_args *args,
>  	if (!fc->writeback_cache)
>  		fuse_invalidate_attr_mask(inode, FUSE_STATX_MODIFY);
>  	spin_lock(&fi->lock);
> -	rb_erase(&wpa->writepages_entry, &fi->writepages);
> -	while (wpa->next) {
> -		struct fuse_mount *fm = get_fuse_mount(inode);
> -		struct fuse_write_in *inarg = &wpa->ia.write.in;
> -		struct fuse_writepage_args *next = wpa->next;
> -
> -		wpa->next = next->next;
> -		next->next = NULL;
> -		tree_insert(&fi->writepages, next);
> -
> -		/*
> -		 * Skip fuse_flush_writepages() to make it easy to crop requests
> -		 * based on primary request size.
> -		 *
> -		 * 1st case (trivial): there are no concurrent activities using
> -		 * fuse_set/release_nowrite.  Then we're on safe side because
> -		 * fuse_flush_writepages() would call fuse_send_writepage()
> -		 * anyway.
> -		 *
> -		 * 2nd case: someone called fuse_set_nowrite and it is waiting
> -		 * now for completion of all in-flight requests.  This happens
> -		 * rarely and no more than once per page, so this should be
> -		 * okay.
> -		 *
> -		 * 3rd case: someone (e.g. fuse_do_setattr()) is in the middle
> -		 * of fuse_set_nowrite..fuse_release_nowrite section.  The fact
> -		 * that fuse_set_nowrite returned implies that all in-flight
> -		 * requests were completed along with all of their secondary
> -		 * requests.  Further primary requests are blocked by negative
> -		 * writectr.  Hence there cannot be any in-flight requests and
> -		 * no invocations of fuse_writepage_end() while we're in
> -		 * fuse_set_nowrite..fuse_release_nowrite section.
> -		 */
> -		fuse_send_writepage(fm, next, inarg->offset + inarg->size);
> -	}
>  	fi->writectr--;
>  	fuse_writepage_finish(wpa);
>  	spin_unlock(&fi->lock);
> @@ -2115,19 +1937,16 @@ static void fuse_writepage_add_to_bucket(struct fuse_conn *fc,
>  }
>  
>  static void fuse_writepage_args_page_fill(struct fuse_writepage_args *wpa, struct folio *folio,
> -					  struct folio *tmp_folio, uint32_t folio_index)
> +					  uint32_t folio_index)
>  {
>  	struct inode *inode = folio->mapping->host;
>  	struct fuse_args_pages *ap = &wpa->ia.ap;
>  
> -	folio_copy(tmp_folio, folio);
> -
> -	ap->folios[folio_index] = tmp_folio;
> +	ap->folios[folio_index] = folio;
>  	ap->descs[folio_index].offset = 0;
>  	ap->descs[folio_index].length = PAGE_SIZE;
>  
>  	inc_wb_stat(&inode_to_bdi(inode)->wb, WB_WRITEBACK);
> -	node_stat_add_folio(tmp_folio, NR_WRITEBACK_TEMP);
>  }
>  
>  static struct fuse_writepage_args *fuse_writepage_args_setup(struct folio *folio,
> @@ -2162,18 +1981,12 @@ static int fuse_writepage_locked(struct folio *folio)
>  	struct fuse_inode *fi = get_fuse_inode(inode);
>  	struct fuse_writepage_args *wpa;
>  	struct fuse_args_pages *ap;
> -	struct folio *tmp_folio;
>  	struct fuse_file *ff;
> -	int error = -ENOMEM;
> +	int error = -EIO;
>  
> -	tmp_folio = folio_alloc(GFP_NOFS | __GFP_HIGHMEM, 0);
> -	if (!tmp_folio)
> -		goto err;
> -
> -	error = -EIO;
>  	ff = fuse_write_file_get(fi);
>  	if (!ff)
> -		goto err_nofile;
> +		goto err;
>  
>  	wpa = fuse_writepage_args_setup(folio, ff);
>  	error = -ENOMEM;
> @@ -2184,22 +1997,17 @@ static int fuse_writepage_locked(struct folio *folio)
>  	ap->num_folios = 1;
>  
>  	folio_start_writeback(folio);
> -	fuse_writepage_args_page_fill(wpa, folio, tmp_folio, 0);
> +	fuse_writepage_args_page_fill(wpa, folio, 0);
>  
>  	spin_lock(&fi->lock);
> -	tree_insert(&fi->writepages, wpa);
>  	list_add_tail(&wpa->queue_entry, &fi->queued_writes);
>  	fuse_flush_writepages(inode);
>  	spin_unlock(&fi->lock);
>  
> -	folio_end_writeback(folio);
> -
>  	return 0;
>  
>  err_writepage_args:
>  	fuse_file_put(ff, false);
> -err_nofile:
> -	folio_put(tmp_folio);
>  err:
>  	mapping_set_error(folio->mapping, error);
>  	return error;
> @@ -2209,7 +2017,6 @@ struct fuse_fill_wb_data {
>  	struct fuse_writepage_args *wpa;
>  	struct fuse_file *ff;
>  	struct inode *inode;
> -	struct folio **orig_folios;
>  	unsigned int max_folios;
>  };
>  
> @@ -2244,69 +2051,11 @@ static void fuse_writepages_send(struct fuse_fill_wb_data *data)
>  	struct fuse_writepage_args *wpa = data->wpa;
>  	struct inode *inode = data->inode;
>  	struct fuse_inode *fi = get_fuse_inode(inode);
> -	int num_folios = wpa->ia.ap.num_folios;
> -	int i;
>  
>  	spin_lock(&fi->lock);
>  	list_add_tail(&wpa->queue_entry, &fi->queued_writes);
>  	fuse_flush_writepages(inode);
>  	spin_unlock(&fi->lock);
> -
> -	for (i = 0; i < num_folios; i++)
> -		folio_end_writeback(data->orig_folios[i]);
> -}
> -
> -/*
> - * Check under fi->lock if the page is under writeback, and insert it onto the
> - * rb_tree if not. Otherwise iterate auxiliary write requests, to see if there's
> - * one already added for a page at this offset.  If there's none, then insert
> - * this new request onto the auxiliary list, otherwise reuse the existing one by
> - * swapping the new temp page with the old one.
> - */
> -static bool fuse_writepage_add(struct fuse_writepage_args *new_wpa,
> -			       struct folio *folio)
> -{
> -	struct fuse_inode *fi = get_fuse_inode(new_wpa->inode);
> -	struct fuse_writepage_args *tmp;
> -	struct fuse_writepage_args *old_wpa;
> -	struct fuse_args_pages *new_ap = &new_wpa->ia.ap;
> -
> -	WARN_ON(new_ap->num_folios != 0);
> -	new_ap->num_folios = 1;
> -
> -	spin_lock(&fi->lock);
> -	old_wpa = fuse_insert_writeback(&fi->writepages, new_wpa);
> -	if (!old_wpa) {
> -		spin_unlock(&fi->lock);
> -		return true;
> -	}
> -
> -	for (tmp = old_wpa->next; tmp; tmp = tmp->next) {
> -		pgoff_t curr_index;
> -
> -		WARN_ON(tmp->inode != new_wpa->inode);
> -		curr_index = tmp->ia.write.in.offset >> PAGE_SHIFT;
> -		if (curr_index == folio->index) {
> -			WARN_ON(tmp->ia.ap.num_folios != 1);
> -			swap(tmp->ia.ap.folios[0], new_ap->folios[0]);
> -			break;
> -		}
> -	}
> -
> -	if (!tmp) {
> -		new_wpa->next = old_wpa->next;
> -		old_wpa->next = new_wpa;
> -	}
> -
> -	spin_unlock(&fi->lock);
> -
> -	if (tmp) {
> -		fuse_writepage_finish_stat(new_wpa->inode,
> -					   folio);
> -		fuse_writepage_free(new_wpa);
> -	}
> -
> -	return false;
>  }
>  
>  static bool fuse_writepage_need_send(struct fuse_conn *fc, struct folio *folio,
> @@ -2315,15 +2064,6 @@ static bool fuse_writepage_need_send(struct fuse_conn *fc, struct folio *folio,
>  {
>  	WARN_ON(!ap->num_folios);
>  
> -	/*
> -	 * Being under writeback is unlikely but possible.  For example direct
> -	 * read to an mmaped fuse file will set the page dirty twice; once when
> -	 * the pages are faulted with get_user_pages(), and then after the read
> -	 * completed.
> -	 */
> -	if (fuse_folio_is_writeback(data->inode, folio))
> -		return true;
> -
>  	/* Reached max pages */
>  	if (ap->num_folios == fc->max_pages)
>  		return true;
> @@ -2333,7 +2073,7 @@ static bool fuse_writepage_need_send(struct fuse_conn *fc, struct folio *folio,
>  		return true;
>  
>  	/* Discontinuity */
> -	if (data->orig_folios[ap->num_folios - 1]->index + 1 != folio_index(folio))
> +	if (ap->folios[ap->num_folios - 1]->index + 1 != folio_index(folio))
>  		return true;
>  
>  	/* Need to grow the pages array?  If so, did the expansion fail? */
> @@ -2352,7 +2092,6 @@ static int fuse_writepages_fill(struct folio *folio,
>  	struct inode *inode = data->inode;
>  	struct fuse_inode *fi = get_fuse_inode(inode);
>  	struct fuse_conn *fc = get_fuse_conn(inode);
> -	struct folio *tmp_folio;
>  	int err;
>  
>  	if (!data->ff) {
> @@ -2367,54 +2106,23 @@ static int fuse_writepages_fill(struct folio *folio,
>  		data->wpa = NULL;
>  	}
>  
> -	err = -ENOMEM;
> -	tmp_folio = folio_alloc(GFP_NOFS | __GFP_HIGHMEM, 0);
> -	if (!tmp_folio)
> -		goto out_unlock;
> -
> -	/*
> -	 * The page must not be redirtied until the writeout is completed
> -	 * (i.e. userspace has sent a reply to the write request).  Otherwise
> -	 * there could be more than one temporary page instance for each real
> -	 * page.
> -	 *
> -	 * This is ensured by holding the page lock in page_mkwrite() while
> -	 * checking fuse_page_is_writeback().  We already hold the page lock
> -	 * since clear_page_dirty_for_io() and keep it held until we add the
> -	 * request to the fi->writepages list and increment ap->num_folios.
> -	 * After this fuse_page_is_writeback() will indicate that the page is
> -	 * under writeback, so we can release the page lock.
> -	 */
>  	if (data->wpa == NULL) {
>  		err = -ENOMEM;
>  		wpa = fuse_writepage_args_setup(folio, data->ff);
> -		if (!wpa) {
> -			folio_put(tmp_folio);
> +		if (!wpa)
>  			goto out_unlock;
> -		}
>  		fuse_file_get(wpa->ia.ff);
>  		data->max_folios = 1;
>  		ap = &wpa->ia.ap;
>  	}
>  	folio_start_writeback(folio);
>  
> -	fuse_writepage_args_page_fill(wpa, folio, tmp_folio, ap->num_folios);
> -	data->orig_folios[ap->num_folios] = folio;
> +	fuse_writepage_args_page_fill(wpa, folio, ap->num_folios);
>  
>  	err = 0;
> -	if (data->wpa) {
> -		/*
> -		 * Protected by fi->lock against concurrent access by
> -		 * fuse_page_is_writeback().
> -		 */
> -		spin_lock(&fi->lock);
> -		ap->num_folios++;
> -		spin_unlock(&fi->lock);
> -	} else if (fuse_writepage_add(wpa, folio)) {
> +	ap->num_folios++;
> +	if (!data->wpa)
>  		data->wpa = wpa;
> -	} else {
> -		folio_end_writeback(folio);
> -	}
>  out_unlock:
>  	folio_unlock(folio);
>  
> @@ -2441,13 +2149,6 @@ static int fuse_writepages(struct address_space *mapping,
>  	data.wpa = NULL;
>  	data.ff = NULL;
>  
> -	err = -ENOMEM;
> -	data.orig_folios = kcalloc(fc->max_pages,
> -				   sizeof(struct folio *),
> -				   GFP_NOFS);
> -	if (!data.orig_folios)
> -		goto out;
> -
>  	err = write_cache_pages(mapping, wbc, fuse_writepages_fill, &data);
>  	if (data.wpa) {
>  		WARN_ON(!data.wpa->ia.ap.num_folios);
> @@ -2456,7 +2157,6 @@ static int fuse_writepages(struct address_space *mapping,
>  	if (data.ff)
>  		fuse_file_put(data.ff, false);
>  
> -	kfree(data.orig_folios);
>  out:
>  	return err;
>  }
> @@ -2481,8 +2181,6 @@ static int fuse_write_begin(struct file *file, struct address_space *mapping,
>  	if (IS_ERR(folio))
>  		goto error;
>  
> -	fuse_wait_on_page_writeback(mapping->host, folio->index);
> -
>  	if (folio_test_uptodate(folio) || len >= folio_size(folio))
>  		goto success;
>  	/*
> @@ -2545,13 +2243,9 @@ static int fuse_launder_folio(struct folio *folio)
>  {
>  	int err = 0;
>  	if (folio_clear_dirty_for_io(folio)) {
> -		struct inode *inode = folio->mapping->host;
> -
> -		/* Serialize with pending writeback for the same page */
> -		fuse_wait_on_page_writeback(inode, folio->index);
>  		err = fuse_writepage_locked(folio);
>  		if (!err)
> -			fuse_wait_on_page_writeback(inode, folio->index);
> +			folio_wait_writeback(folio);
>  	}
>  	return err;
>  }
> @@ -2595,7 +2289,7 @@ static vm_fault_t fuse_page_mkwrite(struct vm_fault *vmf)
>  		return VM_FAULT_NOPAGE;
>  	}
>  
> -	fuse_wait_on_folio_writeback(inode, folio);
> +	folio_wait_writeback(folio);
>  	return VM_FAULT_LOCKED;
>  }
>  
> @@ -3413,9 +3107,12 @@ static const struct address_space_operations fuse_file_aops  = {
>  void fuse_init_file_inode(struct inode *inode, unsigned int flags)
>  {
>  	struct fuse_inode *fi = get_fuse_inode(inode);
> +	struct fuse_conn *fc = get_fuse_conn(inode);
>  
>  	inode->i_fop = &fuse_file_operations;
>  	inode->i_data.a_ops = &fuse_file_aops;
> +	if (fc->writeback_cache)
> +		mapping_set_writeback_indeterminate(&inode->i_data);
>  
>  	INIT_LIST_HEAD(&fi->write_files);
>  	INIT_LIST_HEAD(&fi->queued_writes);
> @@ -3423,7 +3120,6 @@ void fuse_init_file_inode(struct inode *inode, unsigned int flags)
>  	fi->iocachectr = 0;
>  	init_waitqueue_head(&fi->page_waitq);
>  	init_waitqueue_head(&fi->direct_io_waitq);
> -	fi->writepages = RB_ROOT;
>  
>  	if (IS_ENABLED(CONFIG_FUSE_DAX))
>  		fuse_dax_inode_init(inode, flags);
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index 74744c6f2860..23736c5c64c1 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -141,9 +141,6 @@ struct fuse_inode {
>  
>  			/* waitq for direct-io completion */
>  			wait_queue_head_t direct_io_waitq;
> -
> -			/* List of writepage requestst (pending or sent) */
> -			struct rb_root writepages;
>  		};
>  
>  		/* readdir cache (directory only) */

-- 
Thanks,
Jingbo


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 0/5] fuse: remove temp page copies in writeback
  2024-11-22 23:23 [PATCH v6 0/5] fuse: remove temp page copies in writeback Joanne Koong
                   ` (4 preceding siblings ...)
  2024-11-22 23:23 ` [PATCH v6 5/5] fuse: remove tmp folio for writebacks and internal rb tree Joanne Koong
@ 2024-12-12 21:55 ` Joanne Koong
  2024-12-13 11:52 ` Miklos Szeredi
  6 siblings, 0 replies; 124+ messages in thread
From: Joanne Koong @ 2024-12-12 21:55 UTC (permalink / raw)
  To: miklos, linux-fsdevel
  Cc: shakeel.butt, jefflexu, josef, bernd.schubert, linux-mm,
	kernel-team

On Fri, Nov 22, 2024 at 3:24 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> The purpose of this patchset is to help make writeback-cache write
> performance in FUSE filesystems as fast as possible.
>
> In the current FUSE writeback design (see commit 3be5a52b30aa
> ("fuse: support writable mmap"))), a temp page is allocated for every dirty
> page to be written back, the contents of the dirty page are copied over to the
> temp page, and the temp page gets handed to the server to write back. This is
> done so that writeback may be immediately cleared on the dirty page, and this
> in turn is done for two reasons:
> a) in order to mitigate the following deadlock scenario that may arise if
> reclaim waits on writeback on the dirty page to complete (more details can be
> found in this thread [1]):
> * single-threaded FUSE server is in the middle of handling a request
>   that needs a memory allocation
> * memory allocation triggers direct reclaim
> * direct reclaim waits on a folio under writeback
> * the FUSE server can't write back the folio since it's stuck in
>   direct reclaim
> b) in order to unblock internal (eg sync, page compaction) waits on writeback
> without needing the server to complete writing back to disk, which may take
> an indeterminate amount of time.
>
> Allocating and copying dirty pages to temp pages is the biggest performance
> bottleneck for FUSE writeback. This patchset aims to get rid of the temp page
> altogether (which will also allow us to get rid of the internal FUSE rb tree
> that is needed to keep track of writeback status on the temp pages).
> Benchmarks show approximately a 20% improvement in throughput for 4k
> block-size writes and a 45% improvement for 1M block-size writes.
>
> With removing the temp page, writeback state is now only cleared on the dirty
> page after the server has written it back to disk. This may take an
> indeterminate amount of time. As well, there is also the possibility of
> malicious or well-intentioned but buggy servers where writeback may in the
> worst case scenario, never complete. This means that any
> folio_wait_writeback() on a dirty page belonging to a FUSE filesystem needs to
> be carefully audited.
>
> In particular, these are the cases that need to be accounted for:
> * potentially deadlocking in reclaim, as mentioned above
> * potentially stalling sync(2)
> * potentially stalling page migration / compaction
>
> This patchset adds a new mapping flag, AS_WRITEBACK_INDETERMINATE, which
> filesystems may set on its inode mappings to indicate that writeback
> operations may take an indeterminate amount of time to complete. FUSE will set
> this flag on its mappings. This patchset adds checks to the critical parts of
> reclaim, sync, and page migration logic where writeback may be waited on.
>
> Please note the following:
> * For sync(2), waiting on writeback will be skipped for FUSE, but this has no
>   effect on existing behavior. Dirty FUSE pages are already not guaranteed to
>   be written to disk by the time sync(2) returns (eg writeback is cleared on
>   the dirty page but the server may not have written out the temp page to disk
>   yet). If the caller wishes to ensure the data has actually been synced to
>   disk, they should use fsync(2)/fdatasync(2) instead.
> * AS_WRITEBACK_INDETERMINATE does not indicate that the folios should never be
>   waited on when in writeback. There are some cases where the wait is
>   desirable. For example, for the sync_file_range() syscall, it is fine to
>   wait on the writeback since the caller passes in a fd for the operation.
>
> [1]
> https://lore.kernel.org/linux-kernel/495d2400-1d96-4924-99d3-8b2952e05fc3@linux.alibaba.com/
>
> Changelog
> ---------
> v5:
> https://lore.kernel.org/linux-fsdevel/20241115224459.427610-1-joannelkoong@gmail.com/
> Changes from v5 -> v6:
> * Add Shakeel and Jingbo's reviewed-bys
> * Move folio_end_writeback() to fuse_writepage_finish() (Jingbo)
> * Embed fuse_writepage_finish_stat() logic inline (Jingbo)
> * Remove node_stat NR_WRITEBACK inc/sub (Jingbo)
>
> v4:
> https://lore.kernel.org/linux-fsdevel/20241107235614.3637221-1-joannelkoong@gmail.com/
> Changes from v4 -> v5:
> * AS_WRITEBACK_MAY_BLOCK -> AS_WRITEBACK_INDETERMINATE (Shakeel)
> * Drop memory hotplug patch (David and Shakeel)
> * Remove some more kunnecessary writeback waits in fuse code (Jingbo)
> * Make commit message for reclaim patch more concise - drop part about
>   deadlock and just focus on how it may stall waits
>
> v3:
> https://lore.kernel.org/linux-fsdevel/20241107191618.2011146-1-joannelkoong@gmail.com/
> Changes from v3 -> v4:
> * Use filemap_fdatawait_range() instead of filemap_range_has_writeback() in
>   readahead
>
> v2:
> https://lore.kernel.org/linux-fsdevel/20241014182228.1941246-1-joannelkoong@gmail.com/
> Changes from v2 -> v3:
> * Account for sync and page migration cases as well (Miklos)
> * Change AS_NO_WRITEBACK_RECLAIM to the more generic AS_WRITEBACK_MAY_BLOCK
> * For fuse inodes, set mapping_writeback_may_block only if fc->writeback_cache
>   is enabled
>
> v1:
> https://lore.kernel.org/linux-fsdevel/20241011223434.1307300-1-joannelkoong@gmail.com/T/#t
> Changes from v1 -> v2:
> * Have flag in "enum mapping_flags" instead of creating asop_flags (Shakeel)
> * Set fuse inodes to use AS_NO_WRITEBACK_RECLAIM (Shakeel)
>
> Joanne Koong (5):
>   mm: add AS_WRITEBACK_INDETERMINATE mapping flag
>   mm: skip reclaiming folios in legacy memcg writeback indeterminate
>     contexts
>   fs/writeback: in wait_sb_inodes(), skip wait for
>     AS_WRITEBACK_INDETERMINATE mappings
>   mm/migrate: skip migrating folios under writeback with
>     AS_WRITEBACK_INDETERMINATE mappings
>   fuse: remove tmp folio for writebacks and internal rb tree
>
>  fs/fs-writeback.c       |   3 +
>  fs/fuse/file.c          | 360 ++++------------------------------------
>  fs/fuse/fuse_i.h        |   3 -
>  include/linux/pagemap.h |  11 ++
>  mm/migrate.c            |   5 +-
>  mm/vmscan.c             |  10 +-
>  6 files changed, 53 insertions(+), 339 deletions(-)
>

Miklos, may I get your thoughts on this patchset?


Thanks,
Joanne

> --
> 2.43.5
>


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 0/5] fuse: remove temp page copies in writeback
  2024-11-22 23:23 [PATCH v6 0/5] fuse: remove temp page copies in writeback Joanne Koong
                   ` (5 preceding siblings ...)
  2024-12-12 21:55 ` [PATCH v6 0/5] fuse: remove temp page copies in writeback Joanne Koong
@ 2024-12-13 11:52 ` Miklos Szeredi
  2024-12-13 16:47   ` Shakeel Butt
  6 siblings, 1 reply; 124+ messages in thread
From: Miklos Szeredi @ 2024-12-13 11:52 UTC (permalink / raw)
  To: Joanne Koong
  Cc: linux-fsdevel, shakeel.butt, jefflexu, josef, bernd.schubert,
	linux-mm, kernel-team

On Sat, 23 Nov 2024 at 00:24, Joanne Koong <joannelkoong@gmail.com> wrote:
>
> The purpose of this patchset is to help make writeback-cache write
> performance in FUSE filesystems as fast as possible.
>
> In the current FUSE writeback design (see commit 3be5a52b30aa
> ("fuse: support writable mmap"))), a temp page is allocated for every dirty
> page to be written back, the contents of the dirty page are copied over to the
> temp page, and the temp page gets handed to the server to write back. This is
> done so that writeback may be immediately cleared on the dirty page, and this
> in turn is done for two reasons:
> a) in order to mitigate the following deadlock scenario that may arise if
> reclaim waits on writeback on the dirty page to complete (more details can be
> found in this thread [1]):
> * single-threaded FUSE server is in the middle of handling a request
>   that needs a memory allocation
> * memory allocation triggers direct reclaim
> * direct reclaim waits on a folio under writeback
> * the FUSE server can't write back the folio since it's stuck in
>   direct reclaim
> b) in order to unblock internal (eg sync, page compaction) waits on writeback
> without needing the server to complete writing back to disk, which may take
> an indeterminate amount of time.
>
> Allocating and copying dirty pages to temp pages is the biggest performance
> bottleneck for FUSE writeback. This patchset aims to get rid of the temp page
> altogether (which will also allow us to get rid of the internal FUSE rb tree
> that is needed to keep track of writeback status on the temp pages).
> Benchmarks show approximately a 20% improvement in throughput for 4k
> block-size writes and a 45% improvement for 1M block-size writes.
>
> With removing the temp page, writeback state is now only cleared on the dirty
> page after the server has written it back to disk. This may take an
> indeterminate amount of time. As well, there is also the possibility of
> malicious or well-intentioned but buggy servers where writeback may in the
> worst case scenario, never complete. This means that any
> folio_wait_writeback() on a dirty page belonging to a FUSE filesystem needs to
> be carefully audited.
>
> In particular, these are the cases that need to be accounted for:
> * potentially deadlocking in reclaim, as mentioned above
> * potentially stalling sync(2)
> * potentially stalling page migration / compaction
>
> This patchset adds a new mapping flag, AS_WRITEBACK_INDETERMINATE, which
> filesystems may set on its inode mappings to indicate that writeback
> operations may take an indeterminate amount of time to complete. FUSE will set
> this flag on its mappings. This patchset adds checks to the critical parts of
> reclaim, sync, and page migration logic where writeback may be waited on.
>
> Please note the following:
> * For sync(2), waiting on writeback will be skipped for FUSE, but this has no
>   effect on existing behavior. Dirty FUSE pages are already not guaranteed to
>   be written to disk by the time sync(2) returns (eg writeback is cleared on
>   the dirty page but the server may not have written out the temp page to disk
>   yet). If the caller wishes to ensure the data has actually been synced to
>   disk, they should use fsync(2)/fdatasync(2) instead.
> * AS_WRITEBACK_INDETERMINATE does not indicate that the folios should never be
>   waited on when in writeback. There are some cases where the wait is
>   desirable. For example, for the sync_file_range() syscall, it is fine to
>   wait on the writeback since the caller passes in a fd for the operation.

Looks good, thanks.

Acked-by: Miklos Szeredi <mszeredi@redhat.com>

I think this should go via the mm tree.

Thanks,
Miklos


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 0/5] fuse: remove temp page copies in writeback
  2024-12-13 11:52 ` Miklos Szeredi
@ 2024-12-13 16:47   ` Shakeel Butt
  2024-12-18 17:37     ` Joanne Koong
  0 siblings, 1 reply; 124+ messages in thread
From: Shakeel Butt @ 2024-12-13 16:47 UTC (permalink / raw)
  To: Miklos Szeredi, Andrew Morton
  Cc: Joanne Koong, linux-fsdevel, jefflexu, josef, bernd.schubert,
	linux-mm, kernel-team

+Andrew

On Fri, Dec 13, 2024 at 12:52:44PM +0100, Miklos Szeredi wrote:
> On Sat, 23 Nov 2024 at 00:24, Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > The purpose of this patchset is to help make writeback-cache write
> > performance in FUSE filesystems as fast as possible.
> >
> > In the current FUSE writeback design (see commit 3be5a52b30aa
> > ("fuse: support writable mmap"))), a temp page is allocated for every dirty
> > page to be written back, the contents of the dirty page are copied over to the
> > temp page, and the temp page gets handed to the server to write back. This is
> > done so that writeback may be immediately cleared on the dirty page, and this
> > in turn is done for two reasons:
> > a) in order to mitigate the following deadlock scenario that may arise if
> > reclaim waits on writeback on the dirty page to complete (more details can be
> > found in this thread [1]):
> > * single-threaded FUSE server is in the middle of handling a request
> >   that needs a memory allocation
> > * memory allocation triggers direct reclaim
> > * direct reclaim waits on a folio under writeback
> > * the FUSE server can't write back the folio since it's stuck in
> >   direct reclaim
> > b) in order to unblock internal (eg sync, page compaction) waits on writeback
> > without needing the server to complete writing back to disk, which may take
> > an indeterminate amount of time.
> >
> > Allocating and copying dirty pages to temp pages is the biggest performance
> > bottleneck for FUSE writeback. This patchset aims to get rid of the temp page
> > altogether (which will also allow us to get rid of the internal FUSE rb tree
> > that is needed to keep track of writeback status on the temp pages).
> > Benchmarks show approximately a 20% improvement in throughput for 4k
> > block-size writes and a 45% improvement for 1M block-size writes.
> >
> > With removing the temp page, writeback state is now only cleared on the dirty
> > page after the server has written it back to disk. This may take an
> > indeterminate amount of time. As well, there is also the possibility of
> > malicious or well-intentioned but buggy servers where writeback may in the
> > worst case scenario, never complete. This means that any
> > folio_wait_writeback() on a dirty page belonging to a FUSE filesystem needs to
> > be carefully audited.
> >
> > In particular, these are the cases that need to be accounted for:
> > * potentially deadlocking in reclaim, as mentioned above
> > * potentially stalling sync(2)
> > * potentially stalling page migration / compaction
> >
> > This patchset adds a new mapping flag, AS_WRITEBACK_INDETERMINATE, which
> > filesystems may set on its inode mappings to indicate that writeback
> > operations may take an indeterminate amount of time to complete. FUSE will set
> > this flag on its mappings. This patchset adds checks to the critical parts of
> > reclaim, sync, and page migration logic where writeback may be waited on.
> >
> > Please note the following:
> > * For sync(2), waiting on writeback will be skipped for FUSE, but this has no
> >   effect on existing behavior. Dirty FUSE pages are already not guaranteed to
> >   be written to disk by the time sync(2) returns (eg writeback is cleared on
> >   the dirty page but the server may not have written out the temp page to disk
> >   yet). If the caller wishes to ensure the data has actually been synced to
> >   disk, they should use fsync(2)/fdatasync(2) instead.
> > * AS_WRITEBACK_INDETERMINATE does not indicate that the folios should never be
> >   waited on when in writeback. There are some cases where the wait is
> >   desirable. For example, for the sync_file_range() syscall, it is fine to
> >   wait on the writeback since the caller passes in a fd for the operation.
> 
> Looks good, thanks.
> 
> Acked-by: Miklos Szeredi <mszeredi@redhat.com>
> 
> I think this should go via the mm tree.

Andrew, can you please pick this series up or Joanne can send an updated
version with all Acks/Review tag collected? Let us know what you prefer.

Thanks,
Shakeel


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 0/5] fuse: remove temp page copies in writeback
  2024-12-13 16:47   ` Shakeel Butt
@ 2024-12-18 17:37     ` Joanne Koong
  2024-12-18 17:44       ` Shakeel Butt
  0 siblings, 1 reply; 124+ messages in thread
From: Joanne Koong @ 2024-12-18 17:37 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Miklos Szeredi, Andrew Morton, linux-fsdevel, jefflexu, josef,
	bernd.schubert, linux-mm, kernel-team

On Fri, Dec 13, 2024 at 8:47 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> +Andrew
>
> On Fri, Dec 13, 2024 at 12:52:44PM +0100, Miklos Szeredi wrote:
> > On Sat, 23 Nov 2024 at 00:24, Joanne Koong <joannelkoong@gmail.com> wrote:
> > >
> > > The purpose of this patchset is to help make writeback-cache write
> > > performance in FUSE filesystems as fast as possible.
> > >
> > > In the current FUSE writeback design (see commit 3be5a52b30aa
> > > ("fuse: support writable mmap"))), a temp page is allocated for every dirty
> > > page to be written back, the contents of the dirty page are copied over to the
> > > temp page, and the temp page gets handed to the server to write back. This is
> > > done so that writeback may be immediately cleared on the dirty page, and this
> > > in turn is done for two reasons:
> > > a) in order to mitigate the following deadlock scenario that may arise if
> > > reclaim waits on writeback on the dirty page to complete (more details can be
> > > found in this thread [1]):
> > > * single-threaded FUSE server is in the middle of handling a request
> > >   that needs a memory allocation
> > > * memory allocation triggers direct reclaim
> > > * direct reclaim waits on a folio under writeback
> > > * the FUSE server can't write back the folio since it's stuck in
> > >   direct reclaim
> > > b) in order to unblock internal (eg sync, page compaction) waits on writeback
> > > without needing the server to complete writing back to disk, which may take
> > > an indeterminate amount of time.
> > >
> > > Allocating and copying dirty pages to temp pages is the biggest performance
> > > bottleneck for FUSE writeback. This patchset aims to get rid of the temp page
> > > altogether (which will also allow us to get rid of the internal FUSE rb tree
> > > that is needed to keep track of writeback status on the temp pages).
> > > Benchmarks show approximately a 20% improvement in throughput for 4k
> > > block-size writes and a 45% improvement for 1M block-size writes.
> > >
> > > With removing the temp page, writeback state is now only cleared on the dirty
> > > page after the server has written it back to disk. This may take an
> > > indeterminate amount of time. As well, there is also the possibility of
> > > malicious or well-intentioned but buggy servers where writeback may in the
> > > worst case scenario, never complete. This means that any
> > > folio_wait_writeback() on a dirty page belonging to a FUSE filesystem needs to
> > > be carefully audited.
> > >
> > > In particular, these are the cases that need to be accounted for:
> > > * potentially deadlocking in reclaim, as mentioned above
> > > * potentially stalling sync(2)
> > > * potentially stalling page migration / compaction
> > >
> > > This patchset adds a new mapping flag, AS_WRITEBACK_INDETERMINATE, which
> > > filesystems may set on its inode mappings to indicate that writeback
> > > operations may take an indeterminate amount of time to complete. FUSE will set
> > > this flag on its mappings. This patchset adds checks to the critical parts of
> > > reclaim, sync, and page migration logic where writeback may be waited on.
> > >
> > > Please note the following:
> > > * For sync(2), waiting on writeback will be skipped for FUSE, but this has no
> > >   effect on existing behavior. Dirty FUSE pages are already not guaranteed to
> > >   be written to disk by the time sync(2) returns (eg writeback is cleared on
> > >   the dirty page but the server may not have written out the temp page to disk
> > >   yet). If the caller wishes to ensure the data has actually been synced to
> > >   disk, they should use fsync(2)/fdatasync(2) instead.
> > > * AS_WRITEBACK_INDETERMINATE does not indicate that the folios should never be
> > >   waited on when in writeback. There are some cases where the wait is
> > >   desirable. For example, for the sync_file_range() syscall, it is fine to
> > >   wait on the writeback since the caller passes in a fd for the operation.
> >
> > Looks good, thanks.
> >
> > Acked-by: Miklos Szeredi <mszeredi@redhat.com>
> >
> > I think this should go via the mm tree.
>
> Andrew, can you please pick this series up or Joanne can send an updated
> version with all Acks/Review tag collected? Let us know what you prefer.
>

Hi Andrew,

Could you let us know your preference or if there's anything else you
need from us to proceed?


Thanks,
Joanne

> Thanks,
> Shakeel


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 0/5] fuse: remove temp page copies in writeback
  2024-12-18 17:37     ` Joanne Koong
@ 2024-12-18 17:44       ` Shakeel Butt
  2024-12-18 17:53         ` Joanne Koong
  0 siblings, 1 reply; 124+ messages in thread
From: Shakeel Butt @ 2024-12-18 17:44 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Miklos Szeredi, Andrew Morton, linux-fsdevel, jefflexu, josef,
	bernd.schubert, linux-mm, kernel-team

On Wed, Dec 18, 2024 at 09:37:37AM -0800, Joanne Koong wrote:
[...]
> 
> Hi Andrew,
> 
> Could you let us know your preference or if there's anything else you
> need from us to proceed?
> 

Andrew has already picked the series into mm-tree (mm-unstable).



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 0/5] fuse: remove temp page copies in writeback
  2024-12-18 17:44       ` Shakeel Butt
@ 2024-12-18 17:53         ` Joanne Koong
  0 siblings, 0 replies; 124+ messages in thread
From: Joanne Koong @ 2024-12-18 17:53 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Miklos Szeredi, Andrew Morton, linux-fsdevel, jefflexu, josef,
	bernd.schubert, linux-mm, kernel-team

On Wed, Dec 18, 2024 at 9:44 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Wed, Dec 18, 2024 at 09:37:37AM -0800, Joanne Koong wrote:
> [...]
> >
> > Hi Andrew,
> >
> > Could you let us know your preference or if there's anything else you
> > need from us to proceed?
> >
>
> Andrew has already picked the series into mm-tree (mm-unstable).
>

Great, thanks.


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-11-22 23:23 ` [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with " Joanne Koong
@ 2024-12-19 13:05   ` David Hildenbrand
  2024-12-19 14:19     ` Zi Yan
                       ` (2 more replies)
  0 siblings, 3 replies; 124+ messages in thread
From: David Hildenbrand @ 2024-12-19 13:05 UTC (permalink / raw)
  To: Joanne Koong, miklos, linux-fsdevel
  Cc: shakeel.butt, jefflexu, josef, bernd.schubert, linux-mm,
	kernel-team, Matthew Wilcox, Zi Yan, Oscar Salvador, Michal Hocko

On 23.11.24 00:23, Joanne Koong wrote:
> For migrations called in MIGRATE_SYNC mode, skip migrating the folio if
> it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag set on its
> mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the mapping, the
> writeback may take an indeterminate amount of time to complete, and
> waits may get stuck.
> 
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
> ---
>   mm/migrate.c | 5 ++++-
>   1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index df91248755e4..fe73284e5246 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1260,7 +1260,10 @@ static int migrate_folio_unmap(new_folio_t get_new_folio,
>   		 */
>   		switch (mode) {
>   		case MIGRATE_SYNC:
> -			break;
> +			if (!src->mapping ||
> +			    !mapping_writeback_indeterminate(src->mapping))
> +				break;
> +			fallthrough;
>   		default:
>   			rc = -EBUSY;
>   			goto out;

Ehm, doesn't this mean that any fuse user can essentially completely 
block CMA allocations, memory compaction, memory hotunplug, memory 
poisoning... ?!

That sounds very bad.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-19 13:05   ` David Hildenbrand
@ 2024-12-19 14:19     ` Zi Yan
  2024-12-19 15:08       ` Zi Yan
  2024-12-19 15:43     ` Shakeel Butt
  2025-04-02 21:34     ` Joanne Koong
  2 siblings, 1 reply; 124+ messages in thread
From: Zi Yan @ 2024-12-19 14:19 UTC (permalink / raw)
  To: David Hildenbrand, Joanne Koong
  Cc: miklos, linux-fsdevel, shakeel.butt, jefflexu, josef,
	bernd.schubert, linux-mm, kernel-team, Matthew Wilcox,
	Oscar Salvador, Michal Hocko

On 19 Dec 2024, at 8:05, David Hildenbrand wrote:

> On 23.11.24 00:23, Joanne Koong wrote:
>> For migrations called in MIGRATE_SYNC mode, skip migrating the folio if
>> it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag set on its
>> mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the mapping, the
>> writeback may take an indeterminate amount of time to complete, and
>> waits may get stuck.
>>
>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
>> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
>> ---
>>   mm/migrate.c | 5 ++++-
>>   1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index df91248755e4..fe73284e5246 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -1260,7 +1260,10 @@ static int migrate_folio_unmap(new_folio_t get_new_folio,
>>   		 */
>>   		switch (mode) {
>>   		case MIGRATE_SYNC:
>> -			break;
>> +			if (!src->mapping ||
>> +			    !mapping_writeback_indeterminate(src->mapping))
>> +				break;
>> +			fallthrough;
>>   		default:
>>   			rc = -EBUSY;
>>   			goto out;
>
> Ehm, doesn't this mean that any fuse user can essentially completely block CMA allocations, memory compaction, memory hotunplug, memory poisoning... ?!
>
> That sounds very bad.

Yeah, these writeback folios become unmovable. It makes memory fragmentation
unrecoverable. I do not know why AS_WRITEBACK_INDETERMINATE is allowed, since
it is essentially a forever pin to writeback folios. Why not introduce a
retry and timeout mechanism instead of waiting for the writeback forever?

--
Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-19 14:19     ` Zi Yan
@ 2024-12-19 15:08       ` Zi Yan
  2024-12-19 15:39         ` David Hildenbrand
  0 siblings, 1 reply; 124+ messages in thread
From: Zi Yan @ 2024-12-19 15:08 UTC (permalink / raw)
  To: David Hildenbrand, Joanne Koong
  Cc: miklos, linux-fsdevel, shakeel.butt, jefflexu, josef,
	bernd.schubert, linux-mm, kernel-team, Matthew Wilcox,
	Oscar Salvador, Michal Hocko

On 19 Dec 2024, at 9:19, Zi Yan wrote:

> On 19 Dec 2024, at 8:05, David Hildenbrand wrote:
>
>> On 23.11.24 00:23, Joanne Koong wrote:
>>> For migrations called in MIGRATE_SYNC mode, skip migrating the folio if
>>> it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag set on its
>>> mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the mapping, the
>>> writeback may take an indeterminate amount of time to complete, and
>>> waits may get stuck.
>>>
>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
>>> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
>>> ---
>>>   mm/migrate.c | 5 ++++-
>>>   1 file changed, 4 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>> index df91248755e4..fe73284e5246 100644
>>> --- a/mm/migrate.c
>>> +++ b/mm/migrate.c
>>> @@ -1260,7 +1260,10 @@ static int migrate_folio_unmap(new_folio_t get_new_folio,
>>>   		 */
>>>   		switch (mode) {
>>>   		case MIGRATE_SYNC:
>>> -			break;
>>> +			if (!src->mapping ||
>>> +			    !mapping_writeback_indeterminate(src->mapping))
>>> +				break;
>>> +			fallthrough;
>>>   		default:
>>>   			rc = -EBUSY;
>>>   			goto out;
>>
>> Ehm, doesn't this mean that any fuse user can essentially completely block CMA allocations, memory compaction, memory hotunplug, memory poisoning... ?!
>>
>> That sounds very bad.
>
> Yeah, these writeback folios become unmovable. It makes memory fragmentation
> unrecoverable. I do not know why AS_WRITEBACK_INDETERMINATE is allowed, since
> it is essentially a forever pin to writeback folios. Why not introduce a
> retry and timeout mechanism instead of waiting for the writeback forever?

If there is no way around such indeterminate writebacks, to avoid fragment memory,
these to-be-written-back folios should be migrated to a physically contiguous region. Either you have a preallocated region or get free pages from MIGRATE_UNMOVABLE.

--
Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-19 15:08       ` Zi Yan
@ 2024-12-19 15:39         ` David Hildenbrand
  2024-12-19 15:47           ` Zi Yan
  0 siblings, 1 reply; 124+ messages in thread
From: David Hildenbrand @ 2024-12-19 15:39 UTC (permalink / raw)
  To: Zi Yan, Joanne Koong
  Cc: miklos, linux-fsdevel, shakeel.butt, jefflexu, josef,
	bernd.schubert, linux-mm, kernel-team, Matthew Wilcox,
	Oscar Salvador, Michal Hocko

On 19.12.24 16:08, Zi Yan wrote:
> On 19 Dec 2024, at 9:19, Zi Yan wrote:
> 
>> On 19 Dec 2024, at 8:05, David Hildenbrand wrote:
>>
>>> On 23.11.24 00:23, Joanne Koong wrote:
>>>> For migrations called in MIGRATE_SYNC mode, skip migrating the folio if
>>>> it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag set on its
>>>> mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the mapping, the
>>>> writeback may take an indeterminate amount of time to complete, and
>>>> waits may get stuck.
>>>>
>>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
>>>> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
>>>> ---
>>>>    mm/migrate.c | 5 ++++-
>>>>    1 file changed, 4 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>>> index df91248755e4..fe73284e5246 100644
>>>> --- a/mm/migrate.c
>>>> +++ b/mm/migrate.c
>>>> @@ -1260,7 +1260,10 @@ static int migrate_folio_unmap(new_folio_t get_new_folio,
>>>>    		 */
>>>>    		switch (mode) {
>>>>    		case MIGRATE_SYNC:
>>>> -			break;
>>>> +			if (!src->mapping ||
>>>> +			    !mapping_writeback_indeterminate(src->mapping))
>>>> +				break;
>>>> +			fallthrough;
>>>>    		default:
>>>>    			rc = -EBUSY;
>>>>    			goto out;
>>>
>>> Ehm, doesn't this mean that any fuse user can essentially completely block CMA allocations, memory compaction, memory hotunplug, memory poisoning... ?!
>>>
>>> That sounds very bad.
>>
>> Yeah, these writeback folios become unmovable. It makes memory fragmentation
>> unrecoverable. I do not know why AS_WRITEBACK_INDETERMINATE is allowed, since
>> it is essentially a forever pin to writeback folios. Why not introduce a
>> retry and timeout mechanism instead of waiting for the writeback forever?
> 
> If there is no way around such indeterminate writebacks, to avoid fragment memory,
> these to-be-written-back folios should be migrated to a physically contiguous region. Either you have a preallocated region or get free pages from MIGRATE_UNMOVABLE.

But at what point?

We surely don't want to make fuse consume only effectively-unmovable memory.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-19 13:05   ` David Hildenbrand
  2024-12-19 14:19     ` Zi Yan
@ 2024-12-19 15:43     ` Shakeel Butt
  2024-12-19 15:47       ` David Hildenbrand
  2025-04-02 21:34     ` Joanne Koong
  2 siblings, 1 reply; 124+ messages in thread
From: Shakeel Butt @ 2024-12-19 15:43 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Joanne Koong, miklos, linux-fsdevel, jefflexu, josef,
	bernd.schubert, linux-mm, kernel-team, Matthew Wilcox, Zi Yan,
	Oscar Salvador, Michal Hocko

On Thu, Dec 19, 2024 at 02:05:04PM +0100, David Hildenbrand wrote:
> On 23.11.24 00:23, Joanne Koong wrote:
> > For migrations called in MIGRATE_SYNC mode, skip migrating the folio if
> > it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag set on its
> > mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the mapping, the
> > writeback may take an indeterminate amount of time to complete, and
> > waits may get stuck.
> > 
> > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
> > ---
> >   mm/migrate.c | 5 ++++-
> >   1 file changed, 4 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index df91248755e4..fe73284e5246 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -1260,7 +1260,10 @@ static int migrate_folio_unmap(new_folio_t get_new_folio,
> >   		 */
> >   		switch (mode) {
> >   		case MIGRATE_SYNC:
> > -			break;
> > +			if (!src->mapping ||
> > +			    !mapping_writeback_indeterminate(src->mapping))
> > +				break;
> > +			fallthrough;
> >   		default:
> >   			rc = -EBUSY;
> >   			goto out;
> 
> Ehm, doesn't this mean that any fuse user can essentially completely block
> CMA allocations, memory compaction, memory hotunplug, memory poisoning... ?!
> 
> That sounds very bad.

The page under writeback are already unmovable while they are under
writeback. This patch is only making potentially unrelated tasks to
synchronously wait on writeback completion for such pages which in worst
case can be indefinite. This actually is solving an isolation issue on a
multi-tenant machine.


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-19 15:39         ` David Hildenbrand
@ 2024-12-19 15:47           ` Zi Yan
  2024-12-19 15:50             ` David Hildenbrand
  0 siblings, 1 reply; 124+ messages in thread
From: Zi Yan @ 2024-12-19 15:47 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Joanne Koong, miklos, linux-fsdevel, shakeel.butt, jefflexu,
	josef, bernd.schubert, linux-mm, kernel-team, Matthew Wilcox,
	Oscar Salvador, Michal Hocko

On 19 Dec 2024, at 10:39, David Hildenbrand wrote:

> On 19.12.24 16:08, Zi Yan wrote:
>> On 19 Dec 2024, at 9:19, Zi Yan wrote:
>>
>>> On 19 Dec 2024, at 8:05, David Hildenbrand wrote:
>>>
>>>> On 23.11.24 00:23, Joanne Koong wrote:
>>>>> For migrations called in MIGRATE_SYNC mode, skip migrating the folio if
>>>>> it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag set on its
>>>>> mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the mapping, the
>>>>> writeback may take an indeterminate amount of time to complete, and
>>>>> waits may get stuck.
>>>>>
>>>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
>>>>> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
>>>>> ---
>>>>>    mm/migrate.c | 5 ++++-
>>>>>    1 file changed, 4 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>>>> index df91248755e4..fe73284e5246 100644
>>>>> --- a/mm/migrate.c
>>>>> +++ b/mm/migrate.c
>>>>> @@ -1260,7 +1260,10 @@ static int migrate_folio_unmap(new_folio_t get_new_folio,
>>>>>    		 */
>>>>>    		switch (mode) {
>>>>>    		case MIGRATE_SYNC:
>>>>> -			break;
>>>>> +			if (!src->mapping ||
>>>>> +			    !mapping_writeback_indeterminate(src->mapping))
>>>>> +				break;
>>>>> +			fallthrough;
>>>>>    		default:
>>>>>    			rc = -EBUSY;
>>>>>    			goto out;
>>>>
>>>> Ehm, doesn't this mean that any fuse user can essentially completely block CMA allocations, memory compaction, memory hotunplug, memory poisoning... ?!
>>>>
>>>> That sounds very bad.
>>>
>>> Yeah, these writeback folios become unmovable. It makes memory fragmentation
>>> unrecoverable. I do not know why AS_WRITEBACK_INDETERMINATE is allowed, since
>>> it is essentially a forever pin to writeback folios. Why not introduce a
>>> retry and timeout mechanism instead of waiting for the writeback forever?
>>
>> If there is no way around such indeterminate writebacks, to avoid fragment memory,
>> these to-be-written-back folios should be migrated to a physically contiguous region. Either you have a preallocated region or get free pages from MIGRATE_UNMOVABLE.
>
> But at what point?

Before each writeback. And there should be a limit on the amount of unmovable
pages they can allocate.

>
> We surely don't want to make fuse consume only effectively-unmovable memory.

Yes, that is undesirable, but the folio under writeback cannot be migrated,
since migration needs to wait until its finish. Of course, the right way
is to make writeback interruptible, so that migration can continue, but
that routine might take a lot of effort I suppose. I admit my proposal is more
like a bandaid to minimize the memory fragmentation issue.

--
Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-19 15:43     ` Shakeel Butt
@ 2024-12-19 15:47       ` David Hildenbrand
  2024-12-19 15:53         ` Shakeel Butt
  0 siblings, 1 reply; 124+ messages in thread
From: David Hildenbrand @ 2024-12-19 15:47 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Joanne Koong, miklos, linux-fsdevel, jefflexu, josef,
	bernd.schubert, linux-mm, kernel-team, Matthew Wilcox, Zi Yan,
	Oscar Salvador, Michal Hocko

On 19.12.24 16:43, Shakeel Butt wrote:
> On Thu, Dec 19, 2024 at 02:05:04PM +0100, David Hildenbrand wrote:
>> On 23.11.24 00:23, Joanne Koong wrote:
>>> For migrations called in MIGRATE_SYNC mode, skip migrating the folio if
>>> it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag set on its
>>> mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the mapping, the
>>> writeback may take an indeterminate amount of time to complete, and
>>> waits may get stuck.
>>>
>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
>>> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
>>> ---
>>>    mm/migrate.c | 5 ++++-
>>>    1 file changed, 4 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>> index df91248755e4..fe73284e5246 100644
>>> --- a/mm/migrate.c
>>> +++ b/mm/migrate.c
>>> @@ -1260,7 +1260,10 @@ static int migrate_folio_unmap(new_folio_t get_new_folio,
>>>    		 */
>>>    		switch (mode) {
>>>    		case MIGRATE_SYNC:
>>> -			break;
>>> +			if (!src->mapping ||
>>> +			    !mapping_writeback_indeterminate(src->mapping))
>>> +				break;
>>> +			fallthrough;
>>>    		default:
>>>    			rc = -EBUSY;
>>>    			goto out;
>>
>> Ehm, doesn't this mean that any fuse user can essentially completely block
>> CMA allocations, memory compaction, memory hotunplug, memory poisoning... ?!
>>
>> That sounds very bad.
> 
> The page under writeback are already unmovable while they are under
> writeback. This patch is only making potentially unrelated tasks to
> synchronously wait on writeback completion for such pages which in worst
> case can be indefinite. This actually is solving an isolation issue on a
> multi-tenant machine.
> 
Are you sure, because I read in the cover letter:

"In the current FUSE writeback design (see commit 3be5a52b30aa ("fuse: 
support writable mmap"))), a temp page is allocated for every dirty
page to be written back, the contents of the dirty page are copied over 
to the temp page, and the temp page gets handed to the server to write 
back. This is done so that writeback may be immediately cleared on the 
dirty page,"

Which to me means that they are immediately movable again?

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-19 15:47           ` Zi Yan
@ 2024-12-19 15:50             ` David Hildenbrand
  0 siblings, 0 replies; 124+ messages in thread
From: David Hildenbrand @ 2024-12-19 15:50 UTC (permalink / raw)
  To: Zi Yan
  Cc: Joanne Koong, miklos, linux-fsdevel, shakeel.butt, jefflexu,
	josef, bernd.schubert, linux-mm, kernel-team, Matthew Wilcox,
	Oscar Salvador, Michal Hocko

On 19.12.24 16:47, Zi Yan wrote:
> On 19 Dec 2024, at 10:39, David Hildenbrand wrote:
> 
>> On 19.12.24 16:08, Zi Yan wrote:
>>> On 19 Dec 2024, at 9:19, Zi Yan wrote:
>>>
>>>> On 19 Dec 2024, at 8:05, David Hildenbrand wrote:
>>>>
>>>>> On 23.11.24 00:23, Joanne Koong wrote:
>>>>>> For migrations called in MIGRATE_SYNC mode, skip migrating the folio if
>>>>>> it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag set on its
>>>>>> mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the mapping, the
>>>>>> writeback may take an indeterminate amount of time to complete, and
>>>>>> waits may get stuck.
>>>>>>
>>>>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
>>>>>> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
>>>>>> ---
>>>>>>     mm/migrate.c | 5 ++++-
>>>>>>     1 file changed, 4 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>>>>> index df91248755e4..fe73284e5246 100644
>>>>>> --- a/mm/migrate.c
>>>>>> +++ b/mm/migrate.c
>>>>>> @@ -1260,7 +1260,10 @@ static int migrate_folio_unmap(new_folio_t get_new_folio,
>>>>>>     		 */
>>>>>>     		switch (mode) {
>>>>>>     		case MIGRATE_SYNC:
>>>>>> -			break;
>>>>>> +			if (!src->mapping ||
>>>>>> +			    !mapping_writeback_indeterminate(src->mapping))
>>>>>> +				break;
>>>>>> +			fallthrough;
>>>>>>     		default:
>>>>>>     			rc = -EBUSY;
>>>>>>     			goto out;
>>>>>
>>>>> Ehm, doesn't this mean that any fuse user can essentially completely block CMA allocations, memory compaction, memory hotunplug, memory poisoning... ?!
>>>>>
>>>>> That sounds very bad.
>>>>
>>>> Yeah, these writeback folios become unmovable. It makes memory fragmentation
>>>> unrecoverable. I do not know why AS_WRITEBACK_INDETERMINATE is allowed, since
>>>> it is essentially a forever pin to writeback folios. Why not introduce a
>>>> retry and timeout mechanism instead of waiting for the writeback forever?
>>>
>>> If there is no way around such indeterminate writebacks, to avoid fragment memory,
>>> these to-be-written-back folios should be migrated to a physically contiguous region. Either you have a preallocated region or get free pages from MIGRATE_UNMOVABLE.
>>
>> But at what point?
> 
> Before each writeback. And there should be a limit on the amount of unmovable
> pages they can allocate.

The question is if that is than still a performance win :) But yes, we 
can avoid another migration if we are already on allows-movable memory.

> 
>>
>> We surely don't want to make fuse consume only effectively-unmovable memory.
> 
> Yes, that is undesirable, but the folio under writeback cannot be migrated,
> since migration needs to wait until its finish.
Right, and currently that works by immediately marking the folio clean 
again (IIUC after reading the cover letter).

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-19 15:47       ` David Hildenbrand
@ 2024-12-19 15:53         ` Shakeel Butt
  2024-12-19 15:55           ` Zi Yan
  0 siblings, 1 reply; 124+ messages in thread
From: Shakeel Butt @ 2024-12-19 15:53 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Joanne Koong, miklos, linux-fsdevel, jefflexu, josef,
	bernd.schubert, linux-mm, kernel-team, Matthew Wilcox, Zi Yan,
	Oscar Salvador, Michal Hocko

On Thu, Dec 19, 2024 at 04:47:18PM +0100, David Hildenbrand wrote:
> On 19.12.24 16:43, Shakeel Butt wrote:
> > On Thu, Dec 19, 2024 at 02:05:04PM +0100, David Hildenbrand wrote:
> > > On 23.11.24 00:23, Joanne Koong wrote:
> > > > For migrations called in MIGRATE_SYNC mode, skip migrating the folio if
> > > > it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag set on its
> > > > mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the mapping, the
> > > > writeback may take an indeterminate amount of time to complete, and
> > > > waits may get stuck.
> > > > 
> > > > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > > > Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
> > > > ---
> > > >    mm/migrate.c | 5 ++++-
> > > >    1 file changed, 4 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/mm/migrate.c b/mm/migrate.c
> > > > index df91248755e4..fe73284e5246 100644
> > > > --- a/mm/migrate.c
> > > > +++ b/mm/migrate.c
> > > > @@ -1260,7 +1260,10 @@ static int migrate_folio_unmap(new_folio_t get_new_folio,
> > > >    		 */
> > > >    		switch (mode) {
> > > >    		case MIGRATE_SYNC:
> > > > -			break;
> > > > +			if (!src->mapping ||
> > > > +			    !mapping_writeback_indeterminate(src->mapping))
> > > > +				break;
> > > > +			fallthrough;
> > > >    		default:
> > > >    			rc = -EBUSY;
> > > >    			goto out;
> > > 
> > > Ehm, doesn't this mean that any fuse user can essentially completely block
> > > CMA allocations, memory compaction, memory hotunplug, memory poisoning... ?!
> > > 
> > > That sounds very bad.
> > 
> > The page under writeback are already unmovable while they are under
> > writeback. This patch is only making potentially unrelated tasks to
> > synchronously wait on writeback completion for such pages which in worst
> > case can be indefinite. This actually is solving an isolation issue on a
> > multi-tenant machine.
> > 
> Are you sure, because I read in the cover letter:
> 
> "In the current FUSE writeback design (see commit 3be5a52b30aa ("fuse:
> support writable mmap"))), a temp page is allocated for every dirty
> page to be written back, the contents of the dirty page are copied over to
> the temp page, and the temp page gets handed to the server to write back.
> This is done so that writeback may be immediately cleared on the dirty
> page,"
> 
> Which to me means that they are immediately movable again?

Oh sorry, my mistake, yes this will become an isolation issue with the
removal of the temp page in-between which this series is doing. I think
the tradeoff is between extra memory plus slow write performance versus
temporary unmovable memory.

> 
> -- 
> Cheers,
> 
> David / dhildenb
> 
> 


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-19 15:53         ` Shakeel Butt
@ 2024-12-19 15:55           ` Zi Yan
  2024-12-19 15:56             ` Bernd Schubert
  2024-12-19 16:22             ` Shakeel Butt
  0 siblings, 2 replies; 124+ messages in thread
From: Zi Yan @ 2024-12-19 15:55 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: David Hildenbrand, Joanne Koong, miklos, linux-fsdevel, jefflexu,
	josef, bernd.schubert, linux-mm, kernel-team, Matthew Wilcox,
	Oscar Salvador, Michal Hocko

On 19 Dec 2024, at 10:53, Shakeel Butt wrote:

> On Thu, Dec 19, 2024 at 04:47:18PM +0100, David Hildenbrand wrote:
>> On 19.12.24 16:43, Shakeel Butt wrote:
>>> On Thu, Dec 19, 2024 at 02:05:04PM +0100, David Hildenbrand wrote:
>>>> On 23.11.24 00:23, Joanne Koong wrote:
>>>>> For migrations called in MIGRATE_SYNC mode, skip migrating the folio if
>>>>> it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag set on its
>>>>> mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the mapping, the
>>>>> writeback may take an indeterminate amount of time to complete, and
>>>>> waits may get stuck.
>>>>>
>>>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
>>>>> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
>>>>> ---
>>>>>    mm/migrate.c | 5 ++++-
>>>>>    1 file changed, 4 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>>>> index df91248755e4..fe73284e5246 100644
>>>>> --- a/mm/migrate.c
>>>>> +++ b/mm/migrate.c
>>>>> @@ -1260,7 +1260,10 @@ static int migrate_folio_unmap(new_folio_t get_new_folio,
>>>>>    		 */
>>>>>    		switch (mode) {
>>>>>    		case MIGRATE_SYNC:
>>>>> -			break;
>>>>> +			if (!src->mapping ||
>>>>> +			    !mapping_writeback_indeterminate(src->mapping))
>>>>> +				break;
>>>>> +			fallthrough;
>>>>>    		default:
>>>>>    			rc = -EBUSY;
>>>>>    			goto out;
>>>>
>>>> Ehm, doesn't this mean that any fuse user can essentially completely block
>>>> CMA allocations, memory compaction, memory hotunplug, memory poisoning... ?!
>>>>
>>>> That sounds very bad.
>>>
>>> The page under writeback are already unmovable while they are under
>>> writeback. This patch is only making potentially unrelated tasks to
>>> synchronously wait on writeback completion for such pages which in worst
>>> case can be indefinite. This actually is solving an isolation issue on a
>>> multi-tenant machine.
>>>
>> Are you sure, because I read in the cover letter:
>>
>> "In the current FUSE writeback design (see commit 3be5a52b30aa ("fuse:
>> support writable mmap"))), a temp page is allocated for every dirty
>> page to be written back, the contents of the dirty page are copied over to
>> the temp page, and the temp page gets handed to the server to write back.
>> This is done so that writeback may be immediately cleared on the dirty
>> page,"
>>
>> Which to me means that they are immediately movable again?
>
> Oh sorry, my mistake, yes this will become an isolation issue with the
> removal of the temp page in-between which this series is doing. I think
> the tradeoff is between extra memory plus slow write performance versus
> temporary unmovable memory.

No, the tradeoff is slow FUSE performance vs whole system slowdown due to
memory fragmentation. AS_WRITEBACK_INDETERMINATE indicates it is not
temporary.

--
Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-19 15:55           ` Zi Yan
@ 2024-12-19 15:56             ` Bernd Schubert
  2024-12-19 16:00               ` Zi Yan
  2024-12-19 16:22             ` Shakeel Butt
  1 sibling, 1 reply; 124+ messages in thread
From: Bernd Schubert @ 2024-12-19 15:56 UTC (permalink / raw)
  To: Zi Yan, Shakeel Butt
  Cc: David Hildenbrand, Joanne Koong, miklos, linux-fsdevel, jefflexu,
	josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador,
	Michal Hocko



On 12/19/24 16:55, Zi Yan wrote:
> On 19 Dec 2024, at 10:53, Shakeel Butt wrote:
> 
>> On Thu, Dec 19, 2024 at 04:47:18PM +0100, David Hildenbrand wrote:
>>> On 19.12.24 16:43, Shakeel Butt wrote:
>>>> On Thu, Dec 19, 2024 at 02:05:04PM +0100, David Hildenbrand wrote:
>>>>> On 23.11.24 00:23, Joanne Koong wrote:
>>>>>> For migrations called in MIGRATE_SYNC mode, skip migrating the folio if
>>>>>> it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag set on its
>>>>>> mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the mapping, the
>>>>>> writeback may take an indeterminate amount of time to complete, and
>>>>>> waits may get stuck.
>>>>>>
>>>>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
>>>>>> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
>>>>>> ---
>>>>>>    mm/migrate.c | 5 ++++-
>>>>>>    1 file changed, 4 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>>>>> index df91248755e4..fe73284e5246 100644
>>>>>> --- a/mm/migrate.c
>>>>>> +++ b/mm/migrate.c
>>>>>> @@ -1260,7 +1260,10 @@ static int migrate_folio_unmap(new_folio_t get_new_folio,
>>>>>>    		 */
>>>>>>    		switch (mode) {
>>>>>>    		case MIGRATE_SYNC:
>>>>>> -			break;
>>>>>> +			if (!src->mapping ||
>>>>>> +			    !mapping_writeback_indeterminate(src->mapping))
>>>>>> +				break;
>>>>>> +			fallthrough;
>>>>>>    		default:
>>>>>>    			rc = -EBUSY;
>>>>>>    			goto out;
>>>>>
>>>>> Ehm, doesn't this mean that any fuse user can essentially completely block
>>>>> CMA allocations, memory compaction, memory hotunplug, memory poisoning... ?!
>>>>>
>>>>> That sounds very bad.
>>>>
>>>> The page under writeback are already unmovable while they are under
>>>> writeback. This patch is only making potentially unrelated tasks to
>>>> synchronously wait on writeback completion for such pages which in worst
>>>> case can be indefinite. This actually is solving an isolation issue on a
>>>> multi-tenant machine.
>>>>
>>> Are you sure, because I read in the cover letter:
>>>
>>> "In the current FUSE writeback design (see commit 3be5a52b30aa ("fuse:
>>> support writable mmap"))), a temp page is allocated for every dirty
>>> page to be written back, the contents of the dirty page are copied over to
>>> the temp page, and the temp page gets handed to the server to write back.
>>> This is done so that writeback may be immediately cleared on the dirty
>>> page,"
>>>
>>> Which to me means that they are immediately movable again?
>>
>> Oh sorry, my mistake, yes this will become an isolation issue with the
>> removal of the temp page in-between which this series is doing. I think
>> the tradeoff is between extra memory plus slow write performance versus
>> temporary unmovable memory.
> 
> No, the tradeoff is slow FUSE performance vs whole system slowdown due to
> memory fragmentation. AS_WRITEBACK_INDETERMINATE indicates it is not
> temporary.

Is there is a difference between FUSE TMP page being unmovable and
AS_WRITEBACK_INDETERMINATE folios/pages being unmovable?


Thanks,
Bernd
AS_WRITEBACK_INDETERMINATE


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-19 15:56             ` Bernd Schubert
@ 2024-12-19 16:00               ` Zi Yan
  2024-12-19 16:02                 ` Zi Yan
  0 siblings, 1 reply; 124+ messages in thread
From: Zi Yan @ 2024-12-19 16:00 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Shakeel Butt, David Hildenbrand, Joanne Koong, miklos,
	linux-fsdevel, jefflexu, josef, linux-mm, kernel-team,
	Matthew Wilcox, Oscar Salvador, Michal Hocko



--
Best Regards,
Yan, Zi

On 19 Dec 2024, at 10:56, Bernd Schubert wrote:

> On 12/19/24 16:55, Zi Yan wrote:
>> On 19 Dec 2024, at 10:53, Shakeel Butt wrote:
>>
>>> On Thu, Dec 19, 2024 at 04:47:18PM +0100, David Hildenbrand wrote:
>>>> On 19.12.24 16:43, Shakeel Butt wrote:
>>>>> On Thu, Dec 19, 2024 at 02:05:04PM +0100, David Hildenbrand wrote:
>>>>>> On 23.11.24 00:23, Joanne Koong wrote:
>>>>>>> For migrations called in MIGRATE_SYNC mode, skip migrating the folio if
>>>>>>> it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag set on its
>>>>>>> mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the mapping, the
>>>>>>> writeback may take an indeterminate amount of time to complete, and
>>>>>>> waits may get stuck.
>>>>>>>
>>>>>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
>>>>>>> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
>>>>>>> ---
>>>>>>>    mm/migrate.c | 5 ++++-
>>>>>>>    1 file changed, 4 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>>>>>> index df91248755e4..fe73284e5246 100644
>>>>>>> --- a/mm/migrate.c
>>>>>>> +++ b/mm/migrate.c
>>>>>>> @@ -1260,7 +1260,10 @@ static int migrate_folio_unmap(new_folio_t get_new_folio,
>>>>>>>    		 */
>>>>>>>    		switch (mode) {
>>>>>>>    		case MIGRATE_SYNC:
>>>>>>> -			break;
>>>>>>> +			if (!src->mapping ||
>>>>>>> +			    !mapping_writeback_indeterminate(src->mapping))
>>>>>>> +				break;
>>>>>>> +			fallthrough;
>>>>>>>    		default:
>>>>>>>    			rc = -EBUSY;
>>>>>>>    			goto out;
>>>>>>
>>>>>> Ehm, doesn't this mean that any fuse user can essentially completely block
>>>>>> CMA allocations, memory compaction, memory hotunplug, memory poisoning... ?!
>>>>>>
>>>>>> That sounds very bad.
>>>>>
>>>>> The page under writeback are already unmovable while they are under
>>>>> writeback. This patch is only making potentially unrelated tasks to
>>>>> synchronously wait on writeback completion for such pages which in worst
>>>>> case can be indefinite. This actually is solving an isolation issue on a
>>>>> multi-tenant machine.
>>>>>
>>>> Are you sure, because I read in the cover letter:
>>>>
>>>> "In the current FUSE writeback design (see commit 3be5a52b30aa ("fuse:
>>>> support writable mmap"))), a temp page is allocated for every dirty
>>>> page to be written back, the contents of the dirty page are copied over to
>>>> the temp page, and the temp page gets handed to the server to write back.
>>>> This is done so that writeback may be immediately cleared on the dirty
>>>> page,"
>>>>
>>>> Which to me means that they are immediately movable again?
>>>
>>> Oh sorry, my mistake, yes this will become an isolation issue with the
>>> removal of the temp page in-between which this series is doing. I think
>>> the tradeoff is between extra memory plus slow write performance versus
>>> temporary unmovable memory.
>>
>> No, the tradeoff is slow FUSE performance vs whole system slowdown due to
>> memory fragmentation. AS_WRITEBACK_INDETERMINATE indicates it is not
>> temporary.
>
> Is there is a difference between FUSE TMP page being unmovable and
> AS_WRITEBACK_INDETERMINATE folios/pages being unmovable?

Both are unmovable, but you can control where FUSE TMP page
can come from to avoid spread across the entire memory space. For example,
allocate a contiguous region as a TMP page pool.


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-19 16:00               ` Zi Yan
@ 2024-12-19 16:02                 ` Zi Yan
  2024-12-19 16:09                   ` Bernd Schubert
  0 siblings, 1 reply; 124+ messages in thread
From: Zi Yan @ 2024-12-19 16:02 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Shakeel Butt, David Hildenbrand, Joanne Koong, miklos,
	linux-fsdevel, jefflexu, josef, linux-mm, kernel-team,
	Matthew Wilcox, Oscar Salvador, Michal Hocko

On 19 Dec 2024, at 11:00, Zi Yan wrote:
> On 19 Dec 2024, at 10:56, Bernd Schubert wrote:
>
>> On 12/19/24 16:55, Zi Yan wrote:
>>> On 19 Dec 2024, at 10:53, Shakeel Butt wrote:
>>>
>>>> On Thu, Dec 19, 2024 at 04:47:18PM +0100, David Hildenbrand wrote:
>>>>> On 19.12.24 16:43, Shakeel Butt wrote:
>>>>>> On Thu, Dec 19, 2024 at 02:05:04PM +0100, David Hildenbrand wrote:
>>>>>>> On 23.11.24 00:23, Joanne Koong wrote:
>>>>>>>> For migrations called in MIGRATE_SYNC mode, skip migrating the folio if
>>>>>>>> it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag set on its
>>>>>>>> mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the mapping, the
>>>>>>>> writeback may take an indeterminate amount of time to complete, and
>>>>>>>> waits may get stuck.
>>>>>>>>
>>>>>>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
>>>>>>>> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
>>>>>>>> ---
>>>>>>>>    mm/migrate.c | 5 ++++-
>>>>>>>>    1 file changed, 4 insertions(+), 1 deletion(-)
>>>>>>>>
>>>>>>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>>>>>>> index df91248755e4..fe73284e5246 100644
>>>>>>>> --- a/mm/migrate.c
>>>>>>>> +++ b/mm/migrate.c
>>>>>>>> @@ -1260,7 +1260,10 @@ static int migrate_folio_unmap(new_folio_t get_new_folio,
>>>>>>>>    		 */
>>>>>>>>    		switch (mode) {
>>>>>>>>    		case MIGRATE_SYNC:
>>>>>>>> -			break;
>>>>>>>> +			if (!src->mapping ||
>>>>>>>> +			    !mapping_writeback_indeterminate(src->mapping))
>>>>>>>> +				break;
>>>>>>>> +			fallthrough;
>>>>>>>>    		default:
>>>>>>>>    			rc = -EBUSY;
>>>>>>>>    			goto out;
>>>>>>>
>>>>>>> Ehm, doesn't this mean that any fuse user can essentially completely block
>>>>>>> CMA allocations, memory compaction, memory hotunplug, memory poisoning... ?!
>>>>>>>
>>>>>>> That sounds very bad.
>>>>>>
>>>>>> The page under writeback are already unmovable while they are under
>>>>>> writeback. This patch is only making potentially unrelated tasks to
>>>>>> synchronously wait on writeback completion for such pages which in worst
>>>>>> case can be indefinite. This actually is solving an isolation issue on a
>>>>>> multi-tenant machine.
>>>>>>
>>>>> Are you sure, because I read in the cover letter:
>>>>>
>>>>> "In the current FUSE writeback design (see commit 3be5a52b30aa ("fuse:
>>>>> support writable mmap"))), a temp page is allocated for every dirty
>>>>> page to be written back, the contents of the dirty page are copied over to
>>>>> the temp page, and the temp page gets handed to the server to write back.
>>>>> This is done so that writeback may be immediately cleared on the dirty
>>>>> page,"
>>>>>
>>>>> Which to me means that they are immediately movable again?
>>>>
>>>> Oh sorry, my mistake, yes this will become an isolation issue with the
>>>> removal of the temp page in-between which this series is doing. I think
>>>> the tradeoff is between extra memory plus slow write performance versus
>>>> temporary unmovable memory.
>>>
>>> No, the tradeoff is slow FUSE performance vs whole system slowdown due to
>>> memory fragmentation. AS_WRITEBACK_INDETERMINATE indicates it is not
>>> temporary.
>>
>> Is there is a difference between FUSE TMP page being unmovable and
>> AS_WRITEBACK_INDETERMINATE folios/pages being unmovable?

(Fix my response location)

Both are unmovable, but you can control where FUSE TMP page
can come from to avoid spread across the entire memory space. For example,
allocate a contiguous region as a TMP page pool.

--
Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-19 16:02                 ` Zi Yan
@ 2024-12-19 16:09                   ` Bernd Schubert
  2024-12-19 16:14                     ` Zi Yan
  0 siblings, 1 reply; 124+ messages in thread
From: Bernd Schubert @ 2024-12-19 16:09 UTC (permalink / raw)
  To: Zi Yan
  Cc: Shakeel Butt, David Hildenbrand, Joanne Koong, miklos,
	linux-fsdevel, jefflexu, josef, linux-mm, kernel-team,
	Matthew Wilcox, Oscar Salvador, Michal Hocko



On 12/19/24 17:02, Zi Yan wrote:
> On 19 Dec 2024, at 11:00, Zi Yan wrote:
>> On 19 Dec 2024, at 10:56, Bernd Schubert wrote:
>>
>>> On 12/19/24 16:55, Zi Yan wrote:
>>>> On 19 Dec 2024, at 10:53, Shakeel Butt wrote:
>>>>
>>>>> On Thu, Dec 19, 2024 at 04:47:18PM +0100, David Hildenbrand wrote:
>>>>>> On 19.12.24 16:43, Shakeel Butt wrote:
>>>>>>> On Thu, Dec 19, 2024 at 02:05:04PM +0100, David Hildenbrand wrote:
>>>>>>>> On 23.11.24 00:23, Joanne Koong wrote:
>>>>>>>>> For migrations called in MIGRATE_SYNC mode, skip migrating the folio if
>>>>>>>>> it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag set on its
>>>>>>>>> mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the mapping, the
>>>>>>>>> writeback may take an indeterminate amount of time to complete, and
>>>>>>>>> waits may get stuck.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
>>>>>>>>> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
>>>>>>>>> ---
>>>>>>>>>    mm/migrate.c | 5 ++++-
>>>>>>>>>    1 file changed, 4 insertions(+), 1 deletion(-)
>>>>>>>>>
>>>>>>>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>>>>>>>> index df91248755e4..fe73284e5246 100644
>>>>>>>>> --- a/mm/migrate.c
>>>>>>>>> +++ b/mm/migrate.c
>>>>>>>>> @@ -1260,7 +1260,10 @@ static int migrate_folio_unmap(new_folio_t get_new_folio,
>>>>>>>>>    		 */
>>>>>>>>>    		switch (mode) {
>>>>>>>>>    		case MIGRATE_SYNC:
>>>>>>>>> -			break;
>>>>>>>>> +			if (!src->mapping ||
>>>>>>>>> +			    !mapping_writeback_indeterminate(src->mapping))
>>>>>>>>> +				break;
>>>>>>>>> +			fallthrough;
>>>>>>>>>    		default:
>>>>>>>>>    			rc = -EBUSY;
>>>>>>>>>    			goto out;
>>>>>>>>
>>>>>>>> Ehm, doesn't this mean that any fuse user can essentially completely block
>>>>>>>> CMA allocations, memory compaction, memory hotunplug, memory poisoning... ?!
>>>>>>>>
>>>>>>>> That sounds very bad.
>>>>>>>
>>>>>>> The page under writeback are already unmovable while they are under
>>>>>>> writeback. This patch is only making potentially unrelated tasks to
>>>>>>> synchronously wait on writeback completion for such pages which in worst
>>>>>>> case can be indefinite. This actually is solving an isolation issue on a
>>>>>>> multi-tenant machine.
>>>>>>>
>>>>>> Are you sure, because I read in the cover letter:
>>>>>>
>>>>>> "In the current FUSE writeback design (see commit 3be5a52b30aa ("fuse:
>>>>>> support writable mmap"))), a temp page is allocated for every dirty
>>>>>> page to be written back, the contents of the dirty page are copied over to
>>>>>> the temp page, and the temp page gets handed to the server to write back.
>>>>>> This is done so that writeback may be immediately cleared on the dirty
>>>>>> page,"
>>>>>>
>>>>>> Which to me means that they are immediately movable again?
>>>>>
>>>>> Oh sorry, my mistake, yes this will become an isolation issue with the
>>>>> removal of the temp page in-between which this series is doing. I think
>>>>> the tradeoff is between extra memory plus slow write performance versus
>>>>> temporary unmovable memory.
>>>>
>>>> No, the tradeoff is slow FUSE performance vs whole system slowdown due to
>>>> memory fragmentation. AS_WRITEBACK_INDETERMINATE indicates it is not
>>>> temporary.
>>>
>>> Is there is a difference between FUSE TMP page being unmovable and
>>> AS_WRITEBACK_INDETERMINATE folios/pages being unmovable?
> 
> (Fix my response location)
> 
> Both are unmovable, but you can control where FUSE TMP page
> can come from to avoid spread across the entire memory space. For example,
> allocate a contiguous region as a TMP page pool.

Wouldn't it make sense to have that for fuse writeback pages as well?
Fuse tries to limit dirty pages anyway.


Thanks,
Bernd


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-19 16:09                   ` Bernd Schubert
@ 2024-12-19 16:14                     ` Zi Yan
  2024-12-19 16:26                       ` Shakeel Butt
  0 siblings, 1 reply; 124+ messages in thread
From: Zi Yan @ 2024-12-19 16:14 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Shakeel Butt, David Hildenbrand, Joanne Koong, miklos,
	linux-fsdevel, jefflexu, josef, linux-mm, kernel-team,
	Matthew Wilcox, Oscar Salvador, Michal Hocko

On 19 Dec 2024, at 11:09, Bernd Schubert wrote:

> On 12/19/24 17:02, Zi Yan wrote:
>> On 19 Dec 2024, at 11:00, Zi Yan wrote:
>>> On 19 Dec 2024, at 10:56, Bernd Schubert wrote:
>>>
>>>> On 12/19/24 16:55, Zi Yan wrote:
>>>>> On 19 Dec 2024, at 10:53, Shakeel Butt wrote:
>>>>>
>>>>>> On Thu, Dec 19, 2024 at 04:47:18PM +0100, David Hildenbrand wrote:
>>>>>>> On 19.12.24 16:43, Shakeel Butt wrote:
>>>>>>>> On Thu, Dec 19, 2024 at 02:05:04PM +0100, David Hildenbrand wrote:
>>>>>>>>> On 23.11.24 00:23, Joanne Koong wrote:
>>>>>>>>>> For migrations called in MIGRATE_SYNC mode, skip migrating the folio if
>>>>>>>>>> it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag set on its
>>>>>>>>>> mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the mapping, the
>>>>>>>>>> writeback may take an indeterminate amount of time to complete, and
>>>>>>>>>> waits may get stuck.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
>>>>>>>>>> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
>>>>>>>>>> ---
>>>>>>>>>>    mm/migrate.c | 5 ++++-
>>>>>>>>>>    1 file changed, 4 insertions(+), 1 deletion(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>>>>>>>>> index df91248755e4..fe73284e5246 100644
>>>>>>>>>> --- a/mm/migrate.c
>>>>>>>>>> +++ b/mm/migrate.c
>>>>>>>>>> @@ -1260,7 +1260,10 @@ static int migrate_folio_unmap(new_folio_t get_new_folio,
>>>>>>>>>>    		 */
>>>>>>>>>>    		switch (mode) {
>>>>>>>>>>    		case MIGRATE_SYNC:
>>>>>>>>>> -			break;
>>>>>>>>>> +			if (!src->mapping ||
>>>>>>>>>> +			    !mapping_writeback_indeterminate(src->mapping))
>>>>>>>>>> +				break;
>>>>>>>>>> +			fallthrough;
>>>>>>>>>>    		default:
>>>>>>>>>>    			rc = -EBUSY;
>>>>>>>>>>    			goto out;
>>>>>>>>>
>>>>>>>>> Ehm, doesn't this mean that any fuse user can essentially completely block
>>>>>>>>> CMA allocations, memory compaction, memory hotunplug, memory poisoning... ?!
>>>>>>>>>
>>>>>>>>> That sounds very bad.
>>>>>>>>
>>>>>>>> The page under writeback are already unmovable while they are under
>>>>>>>> writeback. This patch is only making potentially unrelated tasks to
>>>>>>>> synchronously wait on writeback completion for such pages which in worst
>>>>>>>> case can be indefinite. This actually is solving an isolation issue on a
>>>>>>>> multi-tenant machine.
>>>>>>>>
>>>>>>> Are you sure, because I read in the cover letter:
>>>>>>>
>>>>>>> "In the current FUSE writeback design (see commit 3be5a52b30aa ("fuse:
>>>>>>> support writable mmap"))), a temp page is allocated for every dirty
>>>>>>> page to be written back, the contents of the dirty page are copied over to
>>>>>>> the temp page, and the temp page gets handed to the server to write back.
>>>>>>> This is done so that writeback may be immediately cleared on the dirty
>>>>>>> page,"
>>>>>>>
>>>>>>> Which to me means that they are immediately movable again?
>>>>>>
>>>>>> Oh sorry, my mistake, yes this will become an isolation issue with the
>>>>>> removal of the temp page in-between which this series is doing. I think
>>>>>> the tradeoff is between extra memory plus slow write performance versus
>>>>>> temporary unmovable memory.
>>>>>
>>>>> No, the tradeoff is slow FUSE performance vs whole system slowdown due to
>>>>> memory fragmentation. AS_WRITEBACK_INDETERMINATE indicates it is not
>>>>> temporary.
>>>>
>>>> Is there is a difference between FUSE TMP page being unmovable and
>>>> AS_WRITEBACK_INDETERMINATE folios/pages being unmovable?
>>
>> (Fix my response location)
>>
>> Both are unmovable, but you can control where FUSE TMP page
>> can come from to avoid spread across the entire memory space. For example,
>> allocate a contiguous region as a TMP page pool.
>
> Wouldn't it make sense to have that for fuse writeback pages as well?
> Fuse tries to limit dirty pages anyway.

Can fuse constraint the location of writeback pages? Something like what
I proposed[1], migrating pages to a location before their writeback? Will
that be a performance concern?

In terms of the number of dirty pages, you only need one page out of 512
pages to prevent 2MB THP from allocation. For CMA allocation, one unmovable
page can kill one contiguous range. What is the limit of fuse dirty pages?

[1] https://lore.kernel.org/linux-mm/90C41581-179F-40B6-9801-9C9DBBEB1AF4@nvidia.com/

--
Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-19 15:55           ` Zi Yan
  2024-12-19 15:56             ` Bernd Schubert
@ 2024-12-19 16:22             ` Shakeel Butt
  2024-12-19 16:29               ` David Hildenbrand
  1 sibling, 1 reply; 124+ messages in thread
From: Shakeel Butt @ 2024-12-19 16:22 UTC (permalink / raw)
  To: Zi Yan
  Cc: David Hildenbrand, Joanne Koong, miklos, linux-fsdevel, jefflexu,
	josef, bernd.schubert, linux-mm, kernel-team, Matthew Wilcox,
	Oscar Salvador, Michal Hocko

On Thu, Dec 19, 2024 at 10:55:10AM -0500, Zi Yan wrote:
> On 19 Dec 2024, at 10:53, Shakeel Butt wrote:
> 
> > On Thu, Dec 19, 2024 at 04:47:18PM +0100, David Hildenbrand wrote:
> >> On 19.12.24 16:43, Shakeel Butt wrote:
> >>> On Thu, Dec 19, 2024 at 02:05:04PM +0100, David Hildenbrand wrote:
> >>>> On 23.11.24 00:23, Joanne Koong wrote:
> >>>>> For migrations called in MIGRATE_SYNC mode, skip migrating the folio if
> >>>>> it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag set on its
> >>>>> mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the mapping, the
> >>>>> writeback may take an indeterminate amount of time to complete, and
> >>>>> waits may get stuck.
> >>>>>
> >>>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> >>>>> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
> >>>>> ---
> >>>>>    mm/migrate.c | 5 ++++-
> >>>>>    1 file changed, 4 insertions(+), 1 deletion(-)
> >>>>>
> >>>>> diff --git a/mm/migrate.c b/mm/migrate.c
> >>>>> index df91248755e4..fe73284e5246 100644
> >>>>> --- a/mm/migrate.c
> >>>>> +++ b/mm/migrate.c
> >>>>> @@ -1260,7 +1260,10 @@ static int migrate_folio_unmap(new_folio_t get_new_folio,
> >>>>>    		 */
> >>>>>    		switch (mode) {
> >>>>>    		case MIGRATE_SYNC:
> >>>>> -			break;
> >>>>> +			if (!src->mapping ||
> >>>>> +			    !mapping_writeback_indeterminate(src->mapping))
> >>>>> +				break;
> >>>>> +			fallthrough;
> >>>>>    		default:
> >>>>>    			rc = -EBUSY;
> >>>>>    			goto out;
> >>>>
> >>>> Ehm, doesn't this mean that any fuse user can essentially completely block
> >>>> CMA allocations, memory compaction, memory hotunplug, memory poisoning... ?!
> >>>>
> >>>> That sounds very bad.
> >>>
> >>> The page under writeback are already unmovable while they are under
> >>> writeback. This patch is only making potentially unrelated tasks to
> >>> synchronously wait on writeback completion for such pages which in worst
> >>> case can be indefinite. This actually is solving an isolation issue on a
> >>> multi-tenant machine.
> >>>
> >> Are you sure, because I read in the cover letter:
> >>
> >> "In the current FUSE writeback design (see commit 3be5a52b30aa ("fuse:
> >> support writable mmap"))), a temp page is allocated for every dirty
> >> page to be written back, the contents of the dirty page are copied over to
> >> the temp page, and the temp page gets handed to the server to write back.
> >> This is done so that writeback may be immediately cleared on the dirty
> >> page,"
> >>
> >> Which to me means that they are immediately movable again?
> >
> > Oh sorry, my mistake, yes this will become an isolation issue with the
> > removal of the temp page in-between which this series is doing. I think
> > the tradeoff is between extra memory plus slow write performance versus
> > temporary unmovable memory.
> 
> No, the tradeoff is slow FUSE performance vs whole system slowdown due to
> memory fragmentation. AS_WRITEBACK_INDETERMINATE indicates it is not
> temporary.

If you check the code just above this patch, this
mapping_writeback_indeterminate() check only happen for pages under
writeback which is a temp state. Anyways, fuse folios should not be
unmovable for their lifetime but only while under writeback which is
same for all fs.



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-19 16:14                     ` Zi Yan
@ 2024-12-19 16:26                       ` Shakeel Butt
  2024-12-19 16:31                         ` David Hildenbrand
  0 siblings, 1 reply; 124+ messages in thread
From: Shakeel Butt @ 2024-12-19 16:26 UTC (permalink / raw)
  To: Zi Yan
  Cc: Bernd Schubert, David Hildenbrand, Joanne Koong, miklos,
	linux-fsdevel, jefflexu, josef, linux-mm, kernel-team,
	Matthew Wilcox, Oscar Salvador, Michal Hocko

On Thu, Dec 19, 2024 at 11:14:49AM -0500, Zi Yan wrote:
> On 19 Dec 2024, at 11:09, Bernd Schubert wrote:
> 
> > On 12/19/24 17:02, Zi Yan wrote:
> >> On 19 Dec 2024, at 11:00, Zi Yan wrote:
> >>> On 19 Dec 2024, at 10:56, Bernd Schubert wrote:
> >>>
> >>>> On 12/19/24 16:55, Zi Yan wrote:
> >>>>> On 19 Dec 2024, at 10:53, Shakeel Butt wrote:
> >>>>>
> >>>>>> On Thu, Dec 19, 2024 at 04:47:18PM +0100, David Hildenbrand wrote:
> >>>>>>> On 19.12.24 16:43, Shakeel Butt wrote:
> >>>>>>>> On Thu, Dec 19, 2024 at 02:05:04PM +0100, David Hildenbrand wrote:
> >>>>>>>>> On 23.11.24 00:23, Joanne Koong wrote:
> >>>>>>>>>> For migrations called in MIGRATE_SYNC mode, skip migrating the folio if
> >>>>>>>>>> it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag set on its
> >>>>>>>>>> mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the mapping, the
> >>>>>>>>>> writeback may take an indeterminate amount of time to complete, and
> >>>>>>>>>> waits may get stuck.
> >>>>>>>>>>
> >>>>>>>>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> >>>>>>>>>> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
> >>>>>>>>>> ---
> >>>>>>>>>>    mm/migrate.c | 5 ++++-
> >>>>>>>>>>    1 file changed, 4 insertions(+), 1 deletion(-)
> >>>>>>>>>>
> >>>>>>>>>> diff --git a/mm/migrate.c b/mm/migrate.c
> >>>>>>>>>> index df91248755e4..fe73284e5246 100644
> >>>>>>>>>> --- a/mm/migrate.c
> >>>>>>>>>> +++ b/mm/migrate.c
> >>>>>>>>>> @@ -1260,7 +1260,10 @@ static int migrate_folio_unmap(new_folio_t get_new_folio,
> >>>>>>>>>>    		 */
> >>>>>>>>>>    		switch (mode) {
> >>>>>>>>>>    		case MIGRATE_SYNC:
> >>>>>>>>>> -			break;
> >>>>>>>>>> +			if (!src->mapping ||
> >>>>>>>>>> +			    !mapping_writeback_indeterminate(src->mapping))
> >>>>>>>>>> +				break;
> >>>>>>>>>> +			fallthrough;
> >>>>>>>>>>    		default:
> >>>>>>>>>>    			rc = -EBUSY;
> >>>>>>>>>>    			goto out;
> >>>>>>>>>
> >>>>>>>>> Ehm, doesn't this mean that any fuse user can essentially completely block
> >>>>>>>>> CMA allocations, memory compaction, memory hotunplug, memory poisoning... ?!
> >>>>>>>>>
> >>>>>>>>> That sounds very bad.
> >>>>>>>>
> >>>>>>>> The page under writeback are already unmovable while they are under
> >>>>>>>> writeback. This patch is only making potentially unrelated tasks to
> >>>>>>>> synchronously wait on writeback completion for such pages which in worst
> >>>>>>>> case can be indefinite. This actually is solving an isolation issue on a
> >>>>>>>> multi-tenant machine.
> >>>>>>>>
> >>>>>>> Are you sure, because I read in the cover letter:
> >>>>>>>
> >>>>>>> "In the current FUSE writeback design (see commit 3be5a52b30aa ("fuse:
> >>>>>>> support writable mmap"))), a temp page is allocated for every dirty
> >>>>>>> page to be written back, the contents of the dirty page are copied over to
> >>>>>>> the temp page, and the temp page gets handed to the server to write back.
> >>>>>>> This is done so that writeback may be immediately cleared on the dirty
> >>>>>>> page,"
> >>>>>>>
> >>>>>>> Which to me means that they are immediately movable again?
> >>>>>>
> >>>>>> Oh sorry, my mistake, yes this will become an isolation issue with the
> >>>>>> removal of the temp page in-between which this series is doing. I think
> >>>>>> the tradeoff is between extra memory plus slow write performance versus
> >>>>>> temporary unmovable memory.
> >>>>>
> >>>>> No, the tradeoff is slow FUSE performance vs whole system slowdown due to
> >>>>> memory fragmentation. AS_WRITEBACK_INDETERMINATE indicates it is not
> >>>>> temporary.
> >>>>
> >>>> Is there is a difference between FUSE TMP page being unmovable and
> >>>> AS_WRITEBACK_INDETERMINATE folios/pages being unmovable?
> >>
> >> (Fix my response location)
> >>
> >> Both are unmovable, but you can control where FUSE TMP page
> >> can come from to avoid spread across the entire memory space. For example,
> >> allocate a contiguous region as a TMP page pool.
> >
> > Wouldn't it make sense to have that for fuse writeback pages as well?
> > Fuse tries to limit dirty pages anyway.
> 
> Can fuse constraint the location of writeback pages? Something like what
> I proposed[1], migrating pages to a location before their writeback? Will
> that be a performance concern?
> 
> In terms of the number of dirty pages, you only need one page out of 512
> pages to prevent 2MB THP from allocation. For CMA allocation, one unmovable
> page can kill one contiguous range. What is the limit of fuse dirty pages?
> 
> [1] https://lore.kernel.org/linux-mm/90C41581-179F-40B6-9801-9C9DBBEB1AF4@nvidia.com/

I think this whole concern of fuse making system memory unmovable
forever is overblown. Fuse is already using a temp (unmovable) page for
the writeback and is slow and is being removed in this series.


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-19 16:22             ` Shakeel Butt
@ 2024-12-19 16:29               ` David Hildenbrand
  2024-12-19 16:40                 ` Shakeel Butt
  0 siblings, 1 reply; 124+ messages in thread
From: David Hildenbrand @ 2024-12-19 16:29 UTC (permalink / raw)
  To: Shakeel Butt, Zi Yan
  Cc: Joanne Koong, miklos, linux-fsdevel, jefflexu, josef,
	bernd.schubert, linux-mm, kernel-team, Matthew Wilcox,
	Oscar Salvador, Michal Hocko

On 19.12.24 17:22, Shakeel Butt wrote:
> On Thu, Dec 19, 2024 at 10:55:10AM -0500, Zi Yan wrote:
>> On 19 Dec 2024, at 10:53, Shakeel Butt wrote:
>>
>>> On Thu, Dec 19, 2024 at 04:47:18PM +0100, David Hildenbrand wrote:
>>>> On 19.12.24 16:43, Shakeel Butt wrote:
>>>>> On Thu, Dec 19, 2024 at 02:05:04PM +0100, David Hildenbrand wrote:
>>>>>> On 23.11.24 00:23, Joanne Koong wrote:
>>>>>>> For migrations called in MIGRATE_SYNC mode, skip migrating the folio if
>>>>>>> it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag set on its
>>>>>>> mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the mapping, the
>>>>>>> writeback may take an indeterminate amount of time to complete, and
>>>>>>> waits may get stuck.
>>>>>>>
>>>>>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
>>>>>>> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
>>>>>>> ---
>>>>>>>     mm/migrate.c | 5 ++++-
>>>>>>>     1 file changed, 4 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>>>>>> index df91248755e4..fe73284e5246 100644
>>>>>>> --- a/mm/migrate.c
>>>>>>> +++ b/mm/migrate.c
>>>>>>> @@ -1260,7 +1260,10 @@ static int migrate_folio_unmap(new_folio_t get_new_folio,
>>>>>>>     		 */
>>>>>>>     		switch (mode) {
>>>>>>>     		case MIGRATE_SYNC:
>>>>>>> -			break;
>>>>>>> +			if (!src->mapping ||
>>>>>>> +			    !mapping_writeback_indeterminate(src->mapping))
>>>>>>> +				break;
>>>>>>> +			fallthrough;
>>>>>>>     		default:
>>>>>>>     			rc = -EBUSY;
>>>>>>>     			goto out;
>>>>>>
>>>>>> Ehm, doesn't this mean that any fuse user can essentially completely block
>>>>>> CMA allocations, memory compaction, memory hotunplug, memory poisoning... ?!
>>>>>>
>>>>>> That sounds very bad.
>>>>>
>>>>> The page under writeback are already unmovable while they are under
>>>>> writeback. This patch is only making potentially unrelated tasks to
>>>>> synchronously wait on writeback completion for such pages which in worst
>>>>> case can be indefinite. This actually is solving an isolation issue on a
>>>>> multi-tenant machine.
>>>>>
>>>> Are you sure, because I read in the cover letter:
>>>>
>>>> "In the current FUSE writeback design (see commit 3be5a52b30aa ("fuse:
>>>> support writable mmap"))), a temp page is allocated for every dirty
>>>> page to be written back, the contents of the dirty page are copied over to
>>>> the temp page, and the temp page gets handed to the server to write back.
>>>> This is done so that writeback may be immediately cleared on the dirty
>>>> page,"
>>>>
>>>> Which to me means that they are immediately movable again?
>>>
>>> Oh sorry, my mistake, yes this will become an isolation issue with the
>>> removal of the temp page in-between which this series is doing. I think
>>> the tradeoff is between extra memory plus slow write performance versus
>>> temporary unmovable memory.
>>
>> No, the tradeoff is slow FUSE performance vs whole system slowdown due to
>> memory fragmentation. AS_WRITEBACK_INDETERMINATE indicates it is not
>> temporary.
> 
> If you check the code just above this patch, this
> mapping_writeback_indeterminate() check only happen for pages under
> writeback which is a temp state. Anyways, fuse folios should not be
> unmovable for their lifetime but only while under writeback which is
> same for all fs.

But there, writeback is expected to be a temporary thing, not possibly: 
"AS_WRITEBACK_INDETERMINATE", that is a BIG difference.

I'll have to NACK anything that violates ZONE_MOVABLE / ALLOC_CMA 
guarantees, and unfortunately, it sounds like this is the case here, 
unless I am missing something important.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-19 16:26                       ` Shakeel Butt
@ 2024-12-19 16:31                         ` David Hildenbrand
  2024-12-19 16:53                           ` Shakeel Butt
  0 siblings, 1 reply; 124+ messages in thread
From: David Hildenbrand @ 2024-12-19 16:31 UTC (permalink / raw)
  To: Shakeel Butt, Zi Yan
  Cc: Bernd Schubert, Joanne Koong, miklos, linux-fsdevel, jefflexu,
	josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador,
	Michal Hocko

On 19.12.24 17:26, Shakeel Butt wrote:
> On Thu, Dec 19, 2024 at 11:14:49AM -0500, Zi Yan wrote:
>> On 19 Dec 2024, at 11:09, Bernd Schubert wrote:
>>
>>> On 12/19/24 17:02, Zi Yan wrote:
>>>> On 19 Dec 2024, at 11:00, Zi Yan wrote:
>>>>> On 19 Dec 2024, at 10:56, Bernd Schubert wrote:
>>>>>
>>>>>> On 12/19/24 16:55, Zi Yan wrote:
>>>>>>> On 19 Dec 2024, at 10:53, Shakeel Butt wrote:
>>>>>>>
>>>>>>>> On Thu, Dec 19, 2024 at 04:47:18PM +0100, David Hildenbrand wrote:
>>>>>>>>> On 19.12.24 16:43, Shakeel Butt wrote:
>>>>>>>>>> On Thu, Dec 19, 2024 at 02:05:04PM +0100, David Hildenbrand wrote:
>>>>>>>>>>> On 23.11.24 00:23, Joanne Koong wrote:
>>>>>>>>>>>> For migrations called in MIGRATE_SYNC mode, skip migrating the folio if
>>>>>>>>>>>> it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag set on its
>>>>>>>>>>>> mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the mapping, the
>>>>>>>>>>>> writeback may take an indeterminate amount of time to complete, and
>>>>>>>>>>>> waits may get stuck.
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
>>>>>>>>>>>> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
>>>>>>>>>>>> ---
>>>>>>>>>>>>     mm/migrate.c | 5 ++++-
>>>>>>>>>>>>     1 file changed, 4 insertions(+), 1 deletion(-)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>>>>>>>>>>> index df91248755e4..fe73284e5246 100644
>>>>>>>>>>>> --- a/mm/migrate.c
>>>>>>>>>>>> +++ b/mm/migrate.c
>>>>>>>>>>>> @@ -1260,7 +1260,10 @@ static int migrate_folio_unmap(new_folio_t get_new_folio,
>>>>>>>>>>>>     		 */
>>>>>>>>>>>>     		switch (mode) {
>>>>>>>>>>>>     		case MIGRATE_SYNC:
>>>>>>>>>>>> -			break;
>>>>>>>>>>>> +			if (!src->mapping ||
>>>>>>>>>>>> +			    !mapping_writeback_indeterminate(src->mapping))
>>>>>>>>>>>> +				break;
>>>>>>>>>>>> +			fallthrough;
>>>>>>>>>>>>     		default:
>>>>>>>>>>>>     			rc = -EBUSY;
>>>>>>>>>>>>     			goto out;
>>>>>>>>>>>
>>>>>>>>>>> Ehm, doesn't this mean that any fuse user can essentially completely block
>>>>>>>>>>> CMA allocations, memory compaction, memory hotunplug, memory poisoning... ?!
>>>>>>>>>>>
>>>>>>>>>>> That sounds very bad.
>>>>>>>>>>
>>>>>>>>>> The page under writeback are already unmovable while they are under
>>>>>>>>>> writeback. This patch is only making potentially unrelated tasks to
>>>>>>>>>> synchronously wait on writeback completion for such pages which in worst
>>>>>>>>>> case can be indefinite. This actually is solving an isolation issue on a
>>>>>>>>>> multi-tenant machine.
>>>>>>>>>>
>>>>>>>>> Are you sure, because I read in the cover letter:
>>>>>>>>>
>>>>>>>>> "In the current FUSE writeback design (see commit 3be5a52b30aa ("fuse:
>>>>>>>>> support writable mmap"))), a temp page is allocated for every dirty
>>>>>>>>> page to be written back, the contents of the dirty page are copied over to
>>>>>>>>> the temp page, and the temp page gets handed to the server to write back.
>>>>>>>>> This is done so that writeback may be immediately cleared on the dirty
>>>>>>>>> page,"
>>>>>>>>>
>>>>>>>>> Which to me means that they are immediately movable again?
>>>>>>>>
>>>>>>>> Oh sorry, my mistake, yes this will become an isolation issue with the
>>>>>>>> removal of the temp page in-between which this series is doing. I think
>>>>>>>> the tradeoff is between extra memory plus slow write performance versus
>>>>>>>> temporary unmovable memory.
>>>>>>>
>>>>>>> No, the tradeoff is slow FUSE performance vs whole system slowdown due to
>>>>>>> memory fragmentation. AS_WRITEBACK_INDETERMINATE indicates it is not
>>>>>>> temporary.
>>>>>>
>>>>>> Is there is a difference between FUSE TMP page being unmovable and
>>>>>> AS_WRITEBACK_INDETERMINATE folios/pages being unmovable?
>>>>
>>>> (Fix my response location)
>>>>
>>>> Both are unmovable, but you can control where FUSE TMP page
>>>> can come from to avoid spread across the entire memory space. For example,
>>>> allocate a contiguous region as a TMP page pool.
>>>
>>> Wouldn't it make sense to have that for fuse writeback pages as well?
>>> Fuse tries to limit dirty pages anyway.
>>
>> Can fuse constraint the location of writeback pages? Something like what
>> I proposed[1], migrating pages to a location before their writeback? Will
>> that be a performance concern?
>>
>> In terms of the number of dirty pages, you only need one page out of 512
>> pages to prevent 2MB THP from allocation. For CMA allocation, one unmovable
>> page can kill one contiguous range. What is the limit of fuse dirty pages?
>>
>> [1] https://lore.kernel.org/linux-mm/90C41581-179F-40B6-9801-9C9DBBEB1AF4@nvidia.com/
> 
> I think this whole concern of fuse making system memory unmovable
> forever is overblown. Fuse is already using a temp (unmovable) page 

Right, and we allocated in a way that we expect it to not be movable 
(e.g., not on ZONE_MOVABLE, usually in a UNMOVABLE pageblock etc).

As another question, which effect does this change here have on 
folio_wait_writeback() users like arch/s390/kernel/uv.c or 
shrink_folio_list()?


-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-19 16:29               ` David Hildenbrand
@ 2024-12-19 16:40                 ` Shakeel Butt
  2024-12-19 16:41                   ` David Hildenbrand
  0 siblings, 1 reply; 124+ messages in thread
From: Shakeel Butt @ 2024-12-19 16:40 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Zi Yan, Joanne Koong, miklos, linux-fsdevel, jefflexu, josef,
	bernd.schubert, linux-mm, kernel-team, Matthew Wilcox,
	Oscar Salvador, Michal Hocko

On Thu, Dec 19, 2024 at 05:29:08PM +0100, David Hildenbrand wrote:
[...]
> > 
> > If you check the code just above this patch, this
> > mapping_writeback_indeterminate() check only happen for pages under
> > writeback which is a temp state. Anyways, fuse folios should not be
> > unmovable for their lifetime but only while under writeback which is
> > same for all fs.
> 
> But there, writeback is expected to be a temporary thing, not possibly:
> "AS_WRITEBACK_INDETERMINATE", that is a BIG difference.
> 
> I'll have to NACK anything that violates ZONE_MOVABLE / ALLOC_CMA
> guarantees, and unfortunately, it sounds like this is the case here, unless
> I am missing something important.
> 

It might just be the name "AS_WRITEBACK_INDETERMINATE" is causing
the confusion. The writeback state is not indefinite. A proper fuse fs,
like anyother fs, should handle writeback pages appropriately. These
additional checks and skips are for (I think) untrusted fuse servers.
Personally I think waiting indefinitely on writeback, particularly for
sync compaction, should be fine but fuse maintainers want to avoid
scenarios where an untrusted fuse server can force such stalls in other
jobs. Yes, this will not solve the untrusted fuse server causing
fragmentation issue but that is the risk of running untrusted fuse
server, IMHO.



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-19 16:40                 ` Shakeel Butt
@ 2024-12-19 16:41                   ` David Hildenbrand
  2024-12-19 17:14                     ` Shakeel Butt
  2024-12-20  7:55                     ` Jingbo Xu
  0 siblings, 2 replies; 124+ messages in thread
From: David Hildenbrand @ 2024-12-19 16:41 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Zi Yan, Joanne Koong, miklos, linux-fsdevel, jefflexu, josef,
	bernd.schubert, linux-mm, kernel-team, Matthew Wilcox,
	Oscar Salvador, Michal Hocko

On 19.12.24 17:40, Shakeel Butt wrote:
> On Thu, Dec 19, 2024 at 05:29:08PM +0100, David Hildenbrand wrote:
> [...]
>>>
>>> If you check the code just above this patch, this
>>> mapping_writeback_indeterminate() check only happen for pages under
>>> writeback which is a temp state. Anyways, fuse folios should not be
>>> unmovable for their lifetime but only while under writeback which is
>>> same for all fs.
>>
>> But there, writeback is expected to be a temporary thing, not possibly:
>> "AS_WRITEBACK_INDETERMINATE", that is a BIG difference.
>>
>> I'll have to NACK anything that violates ZONE_MOVABLE / ALLOC_CMA
>> guarantees, and unfortunately, it sounds like this is the case here, unless
>> I am missing something important.
>>
> 
> It might just be the name "AS_WRITEBACK_INDETERMINATE" is causing
> the confusion. The writeback state is not indefinite. A proper fuse fs,
> like anyother fs, should handle writeback pages appropriately. These
> additional checks and skips are for (I think) untrusted fuse servers.

Can unprivileged user space provoke this case?

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-19 16:31                         ` David Hildenbrand
@ 2024-12-19 16:53                           ` Shakeel Butt
  0 siblings, 0 replies; 124+ messages in thread
From: Shakeel Butt @ 2024-12-19 16:53 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Zi Yan, Bernd Schubert, Joanne Koong, miklos, linux-fsdevel,
	jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox,
	Oscar Salvador, Michal Hocko

On Thu, Dec 19, 2024 at 05:31:14PM +0100, David Hildenbrand wrote:
[...]
> > I think this whole concern of fuse making system memory unmovable
> > forever is overblown. Fuse is already using a temp (unmovable) page
> 
> Right, and we allocated in a way that we expect it to not be movable (e.g.,
> not on ZONE_MOVABLE, usually in a UNMOVABLE pageblock etc).
> 
> As another question, which effect does this change here have on
> folio_wait_writeback() users like arch/s390/kernel/uv.c or
> shrink_folio_list()?
> 

shrink_folio_list() is handled in second patch [1] of this series. To
summarize only memcg-v1 which does not have sane dirty throttling can be
impacted and needs change. For arch/s390/kernel/uv.c, I don't think this
series is doing anything. For sane fuse folios, things should be fine.


[1] https://lore.kernel.org/linux-mm/CAJnrk1bXDkwExR=ztnidX4DAvVD5wZZemEVNt9bg=tkwWAg6fw@mail.gmail.com/T/#m02461fb4fb73849900e811d695deee0706c370f9



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-19 16:41                   ` David Hildenbrand
@ 2024-12-19 17:14                     ` Shakeel Butt
  2024-12-19 17:26                       ` David Hildenbrand
  2024-12-20  7:55                     ` Jingbo Xu
  1 sibling, 1 reply; 124+ messages in thread
From: Shakeel Butt @ 2024-12-19 17:14 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Zi Yan, Joanne Koong, miklos, linux-fsdevel, jefflexu, josef,
	bernd.schubert, linux-mm, kernel-team, Matthew Wilcox,
	Oscar Salvador, Michal Hocko

On Thu, Dec 19, 2024 at 05:41:36PM +0100, David Hildenbrand wrote:
> On 19.12.24 17:40, Shakeel Butt wrote:
> > On Thu, Dec 19, 2024 at 05:29:08PM +0100, David Hildenbrand wrote:
> > [...]
> > > > 
> > > > If you check the code just above this patch, this
> > > > mapping_writeback_indeterminate() check only happen for pages under
> > > > writeback which is a temp state. Anyways, fuse folios should not be
> > > > unmovable for their lifetime but only while under writeback which is
> > > > same for all fs.
> > > 
> > > But there, writeback is expected to be a temporary thing, not possibly:
> > > "AS_WRITEBACK_INDETERMINATE", that is a BIG difference.
> > > 
> > > I'll have to NACK anything that violates ZONE_MOVABLE / ALLOC_CMA
> > > guarantees, and unfortunately, it sounds like this is the case here, unless
> > > I am missing something important.
> > > 
> > 
> > It might just be the name "AS_WRITEBACK_INDETERMINATE" is causing
> > the confusion. The writeback state is not indefinite. A proper fuse fs,
> > like anyother fs, should handle writeback pages appropriately. These
> > additional checks and skips are for (I think) untrusted fuse servers.
> 
> Can unprivileged user space provoke this case?

Let's ask Joanne and other fuse folks about the above question.

Let's say unprivileged user space can start a untrusted fuse server,
mount fuse, allocate and dirty a lot of fuse folios (within its dirty
and memcg limits) and trigger the writeback. To cause pain (through
fragmentation), it is not clearing the writeback state. Is this the
scenario you are envisioning?


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-19 17:14                     ` Shakeel Butt
@ 2024-12-19 17:26                       ` David Hildenbrand
  2024-12-19 17:30                         ` Bernd Schubert
  2024-12-19 17:55                         ` Joanne Koong
  0 siblings, 2 replies; 124+ messages in thread
From: David Hildenbrand @ 2024-12-19 17:26 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Zi Yan, Joanne Koong, miklos, linux-fsdevel, jefflexu, josef,
	bernd.schubert, linux-mm, kernel-team, Matthew Wilcox,
	Oscar Salvador, Michal Hocko

On 19.12.24 18:14, Shakeel Butt wrote:
> On Thu, Dec 19, 2024 at 05:41:36PM +0100, David Hildenbrand wrote:
>> On 19.12.24 17:40, Shakeel Butt wrote:
>>> On Thu, Dec 19, 2024 at 05:29:08PM +0100, David Hildenbrand wrote:
>>> [...]
>>>>>
>>>>> If you check the code just above this patch, this
>>>>> mapping_writeback_indeterminate() check only happen for pages under
>>>>> writeback which is a temp state. Anyways, fuse folios should not be
>>>>> unmovable for their lifetime but only while under writeback which is
>>>>> same for all fs.
>>>>
>>>> But there, writeback is expected to be a temporary thing, not possibly:
>>>> "AS_WRITEBACK_INDETERMINATE", that is a BIG difference.
>>>>
>>>> I'll have to NACK anything that violates ZONE_MOVABLE / ALLOC_CMA
>>>> guarantees, and unfortunately, it sounds like this is the case here, unless
>>>> I am missing something important.
>>>>
>>>
>>> It might just be the name "AS_WRITEBACK_INDETERMINATE" is causing
>>> the confusion. The writeback state is not indefinite. A proper fuse fs,
>>> like anyother fs, should handle writeback pages appropriately. These
>>> additional checks and skips are for (I think) untrusted fuse servers.
>>
>> Can unprivileged user space provoke this case?
> 
> Let's ask Joanne and other fuse folks about the above question.
> 
> Let's say unprivileged user space can start a untrusted fuse server,
> mount fuse, allocate and dirty a lot of fuse folios (within its dirty
> and memcg limits) and trigger the writeback. To cause pain (through
> fragmentation), it is not clearing the writeback state. Is this the
> scenario you are envisioning?

Yes, for example causing harm on a shared host (containers, ...).

If it cannot happen, we should make it very clear in documentation and 
patch descriptions that it can only cause harm with privileged user 
space, and that this harm can make things like CMA allocations, memory 
onplug, ... fail, which is rather bad and against concepts like 
ZONE_MOVABLE/MIGRATE_CMA.

Although I wonder what would happen if the privileged user space daemon 
crashes  (e.g., OOM killer?) and simply no longer replies to any messages.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-19 17:26                       ` David Hildenbrand
@ 2024-12-19 17:30                         ` Bernd Schubert
  2024-12-19 17:37                           ` Shakeel Butt
  2024-12-19 17:55                         ` Joanne Koong
  1 sibling, 1 reply; 124+ messages in thread
From: Bernd Schubert @ 2024-12-19 17:30 UTC (permalink / raw)
  To: David Hildenbrand, Shakeel Butt
  Cc: Zi Yan, Joanne Koong, miklos, linux-fsdevel, jefflexu, josef,
	linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador,
	Michal Hocko



On 12/19/24 18:26, David Hildenbrand wrote:
> On 19.12.24 18:14, Shakeel Butt wrote:
>> On Thu, Dec 19, 2024 at 05:41:36PM +0100, David Hildenbrand wrote:
>>> On 19.12.24 17:40, Shakeel Butt wrote:
>>>> On Thu, Dec 19, 2024 at 05:29:08PM +0100, David Hildenbrand wrote:
>>>> [...]
>>>>>>
>>>>>> If you check the code just above this patch, this
>>>>>> mapping_writeback_indeterminate() check only happen for pages under
>>>>>> writeback which is a temp state. Anyways, fuse folios should not be
>>>>>> unmovable for their lifetime but only while under writeback which is
>>>>>> same for all fs.
>>>>>
>>>>> But there, writeback is expected to be a temporary thing, not
>>>>> possibly:
>>>>> "AS_WRITEBACK_INDETERMINATE", that is a BIG difference.
>>>>>
>>>>> I'll have to NACK anything that violates ZONE_MOVABLE / ALLOC_CMA
>>>>> guarantees, and unfortunately, it sounds like this is the case
>>>>> here, unless
>>>>> I am missing something important.
>>>>>
>>>>
>>>> It might just be the name "AS_WRITEBACK_INDETERMINATE" is causing
>>>> the confusion. The writeback state is not indefinite. A proper fuse fs,
>>>> like anyother fs, should handle writeback pages appropriately. These
>>>> additional checks and skips are for (I think) untrusted fuse servers.
>>>
>>> Can unprivileged user space provoke this case?
>>
>> Let's ask Joanne and other fuse folks about the above question.
>>
>> Let's say unprivileged user space can start a untrusted fuse server,
>> mount fuse, allocate and dirty a lot of fuse folios (within its dirty
>> and memcg limits) and trigger the writeback. To cause pain (through
>> fragmentation), it is not clearing the writeback state. Is this the
>> scenario you are envisioning?
> 
> Yes, for example causing harm on a shared host (containers, ...).
> 
> If it cannot happen, we should make it very clear in documentation and
> patch descriptions that it can only cause harm with privileged user
> space, and that this harm can make things like CMA allocations, memory
> onplug, ... fail, which is rather bad and against concepts like
> ZONE_MOVABLE/MIGRATE_CMA.
> 
> Although I wonder what would happen if the privileged user space daemon
> crashes  (e.g., OOM killer?) and simply no longer replies to any messages.
> 

The request is canceled then - that should clear the page/folio state


I start to wonder if we should introduce really short fuse request
timeouts and just repeat requests when things have cleared up. At least
for write-back requests (in the sense that fuse-over-network might
be slow or interrupted for some time).


Thanks,
Bernd



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-19 17:30                         ` Bernd Schubert
@ 2024-12-19 17:37                           ` Shakeel Butt
  2024-12-19 17:40                             ` Bernd Schubert
  2024-12-19 17:44                             ` Joanne Koong
  0 siblings, 2 replies; 124+ messages in thread
From: Shakeel Butt @ 2024-12-19 17:37 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: David Hildenbrand, Zi Yan, Joanne Koong, miklos, linux-fsdevel,
	jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox,
	Oscar Salvador, Michal Hocko

On Thu, Dec 19, 2024 at 06:30:34PM +0100, Bernd Schubert wrote:
> 
> 
> On 12/19/24 18:26, David Hildenbrand wrote:
> > On 19.12.24 18:14, Shakeel Butt wrote:
> >> On Thu, Dec 19, 2024 at 05:41:36PM +0100, David Hildenbrand wrote:
> >>> On 19.12.24 17:40, Shakeel Butt wrote:
> >>>> On Thu, Dec 19, 2024 at 05:29:08PM +0100, David Hildenbrand wrote:
> >>>> [...]
> >>>>>>
> >>>>>> If you check the code just above this patch, this
> >>>>>> mapping_writeback_indeterminate() check only happen for pages under
> >>>>>> writeback which is a temp state. Anyways, fuse folios should not be
> >>>>>> unmovable for their lifetime but only while under writeback which is
> >>>>>> same for all fs.
> >>>>>
> >>>>> But there, writeback is expected to be a temporary thing, not
> >>>>> possibly:
> >>>>> "AS_WRITEBACK_INDETERMINATE", that is a BIG difference.
> >>>>>
> >>>>> I'll have to NACK anything that violates ZONE_MOVABLE / ALLOC_CMA
> >>>>> guarantees, and unfortunately, it sounds like this is the case
> >>>>> here, unless
> >>>>> I am missing something important.
> >>>>>
> >>>>
> >>>> It might just be the name "AS_WRITEBACK_INDETERMINATE" is causing
> >>>> the confusion. The writeback state is not indefinite. A proper fuse fs,
> >>>> like anyother fs, should handle writeback pages appropriately. These
> >>>> additional checks and skips are for (I think) untrusted fuse servers.
> >>>
> >>> Can unprivileged user space provoke this case?
> >>
> >> Let's ask Joanne and other fuse folks about the above question.
> >>
> >> Let's say unprivileged user space can start a untrusted fuse server,
> >> mount fuse, allocate and dirty a lot of fuse folios (within its dirty
> >> and memcg limits) and trigger the writeback. To cause pain (through
> >> fragmentation), it is not clearing the writeback state. Is this the
> >> scenario you are envisioning?
> > 
> > Yes, for example causing harm on a shared host (containers, ...).
> > 
> > If it cannot happen, we should make it very clear in documentation and
> > patch descriptions that it can only cause harm with privileged user
> > space, and that this harm can make things like CMA allocations, memory
> > onplug, ... fail, which is rather bad and against concepts like
> > ZONE_MOVABLE/MIGRATE_CMA.
> > 
> > Although I wonder what would happen if the privileged user space daemon
> > crashes  (e.g., OOM killer?) and simply no longer replies to any messages.
> > 
> 
> The request is canceled then - that should clear the page/folio state
> 
> 
> I start to wonder if we should introduce really short fuse request
> timeouts and just repeat requests when things have cleared up. At least
> for write-back requests (in the sense that fuse-over-network might
> be slow or interrupted for some time).
> 
> 

Thanks Bernd for the response. Can you tell a bit more about the request
timeouts? Basically does it impact/clear the page/folio state as well?


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-19 17:37                           ` Shakeel Butt
@ 2024-12-19 17:40                             ` Bernd Schubert
  2024-12-19 17:44                             ` Joanne Koong
  1 sibling, 0 replies; 124+ messages in thread
From: Bernd Schubert @ 2024-12-19 17:40 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: David Hildenbrand, Zi Yan, Joanne Koong, miklos, linux-fsdevel,
	jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox,
	Oscar Salvador, Michal Hocko



On 12/19/24 18:37, Shakeel Butt wrote:
> On Thu, Dec 19, 2024 at 06:30:34PM +0100, Bernd Schubert wrote:
>>
>>
>> On 12/19/24 18:26, David Hildenbrand wrote:
>>> On 19.12.24 18:14, Shakeel Butt wrote:
>>>> On Thu, Dec 19, 2024 at 05:41:36PM +0100, David Hildenbrand wrote:
>>>>> On 19.12.24 17:40, Shakeel Butt wrote:
>>>>>> On Thu, Dec 19, 2024 at 05:29:08PM +0100, David Hildenbrand wrote:
>>>>>> [...]
>>>>>>>>
>>>>>>>> If you check the code just above this patch, this
>>>>>>>> mapping_writeback_indeterminate() check only happen for pages under
>>>>>>>> writeback which is a temp state. Anyways, fuse folios should not be
>>>>>>>> unmovable for their lifetime but only while under writeback which is
>>>>>>>> same for all fs.
>>>>>>>
>>>>>>> But there, writeback is expected to be a temporary thing, not
>>>>>>> possibly:
>>>>>>> "AS_WRITEBACK_INDETERMINATE", that is a BIG difference.
>>>>>>>
>>>>>>> I'll have to NACK anything that violates ZONE_MOVABLE / ALLOC_CMA
>>>>>>> guarantees, and unfortunately, it sounds like this is the case
>>>>>>> here, unless
>>>>>>> I am missing something important.
>>>>>>>
>>>>>>
>>>>>> It might just be the name "AS_WRITEBACK_INDETERMINATE" is causing
>>>>>> the confusion. The writeback state is not indefinite. A proper fuse fs,
>>>>>> like anyother fs, should handle writeback pages appropriately. These
>>>>>> additional checks and skips are for (I think) untrusted fuse servers.
>>>>>
>>>>> Can unprivileged user space provoke this case?
>>>>
>>>> Let's ask Joanne and other fuse folks about the above question.
>>>>
>>>> Let's say unprivileged user space can start a untrusted fuse server,
>>>> mount fuse, allocate and dirty a lot of fuse folios (within its dirty
>>>> and memcg limits) and trigger the writeback. To cause pain (through
>>>> fragmentation), it is not clearing the writeback state. Is this the
>>>> scenario you are envisioning?
>>>
>>> Yes, for example causing harm on a shared host (containers, ...).
>>>
>>> If it cannot happen, we should make it very clear in documentation and
>>> patch descriptions that it can only cause harm with privileged user
>>> space, and that this harm can make things like CMA allocations, memory
>>> onplug, ... fail, which is rather bad and against concepts like
>>> ZONE_MOVABLE/MIGRATE_CMA.
>>>
>>> Although I wonder what would happen if the privileged user space daemon
>>> crashes  (e.g., OOM killer?) and simply no longer replies to any messages.
>>>
>>
>> The request is canceled then - that should clear the page/folio state
>>
>>
>> I start to wonder if we should introduce really short fuse request
>> timeouts and just repeat requests when things have cleared up. At least
>> for write-back requests (in the sense that fuse-over-network might
>> be slow or interrupted for some time).
>>
>>
> 
> Thanks Bernd for the response. Can you tell a bit more about the request
> timeouts? Basically does it impact/clear the page/folio state as well?

That is just an idea, needs more discussion first. Just sent an off list
message. 




^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-19 17:37                           ` Shakeel Butt
  2024-12-19 17:40                             ` Bernd Schubert
@ 2024-12-19 17:44                             ` Joanne Koong
  2024-12-19 17:54                               ` Shakeel Butt
  1 sibling, 1 reply; 124+ messages in thread
From: Joanne Koong @ 2024-12-19 17:44 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Bernd Schubert, David Hildenbrand, Zi Yan, miklos, linux-fsdevel,
	jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox,
	Oscar Salvador, Michal Hocko

On Thu, Dec 19, 2024 at 9:37 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Thu, Dec 19, 2024 at 06:30:34PM +0100, Bernd Schubert wrote:
> >
> >
> > On 12/19/24 18:26, David Hildenbrand wrote:
> > > On 19.12.24 18:14, Shakeel Butt wrote:
> > >> On Thu, Dec 19, 2024 at 05:41:36PM +0100, David Hildenbrand wrote:
> > >>> On 19.12.24 17:40, Shakeel Butt wrote:
> > >>>> On Thu, Dec 19, 2024 at 05:29:08PM +0100, David Hildenbrand wrote:
> > >>>> [...]
> > >>>>>>
> > >>>>>> If you check the code just above this patch, this
> > >>>>>> mapping_writeback_indeterminate() check only happen for pages under
> > >>>>>> writeback which is a temp state. Anyways, fuse folios should not be
> > >>>>>> unmovable for their lifetime but only while under writeback which is
> > >>>>>> same for all fs.
> > >>>>>
> > >>>>> But there, writeback is expected to be a temporary thing, not
> > >>>>> possibly:
> > >>>>> "AS_WRITEBACK_INDETERMINATE", that is a BIG difference.
> > >>>>>
> > >>>>> I'll have to NACK anything that violates ZONE_MOVABLE / ALLOC_CMA
> > >>>>> guarantees, and unfortunately, it sounds like this is the case
> > >>>>> here, unless
> > >>>>> I am missing something important.
> > >>>>>
> > >>>>
> > >>>> It might just be the name "AS_WRITEBACK_INDETERMINATE" is causing
> > >>>> the confusion. The writeback state is not indefinite. A proper fuse fs,
> > >>>> like anyother fs, should handle writeback pages appropriately. These
> > >>>> additional checks and skips are for (I think) untrusted fuse servers.
> > >>>
> > >>> Can unprivileged user space provoke this case?
> > >>
> > >> Let's ask Joanne and other fuse folks about the above question.
> > >>
> > >> Let's say unprivileged user space can start a untrusted fuse server,
> > >> mount fuse, allocate and dirty a lot of fuse folios (within its dirty
> > >> and memcg limits) and trigger the writeback. To cause pain (through
> > >> fragmentation), it is not clearing the writeback state. Is this the
> > >> scenario you are envisioning?
> > >
> > > Yes, for example causing harm on a shared host (containers, ...).
> > >
> > > If it cannot happen, we should make it very clear in documentation and
> > > patch descriptions that it can only cause harm with privileged user
> > > space, and that this harm can make things like CMA allocations, memory
> > > onplug, ... fail, which is rather bad and against concepts like
> > > ZONE_MOVABLE/MIGRATE_CMA.
> > >
> > > Although I wonder what would happen if the privileged user space daemon
> > > crashes  (e.g., OOM killer?) and simply no longer replies to any messages.
> > >
> >
> > The request is canceled then - that should clear the page/folio state
> >
> >
> > I start to wonder if we should introduce really short fuse request
> > timeouts and just repeat requests when things have cleared up. At least
> > for write-back requests (in the sense that fuse-over-network might
> > be slow or interrupted for some time).
> >
> >
>
> Thanks Bernd for the response. Can you tell a bit more about the request
> timeouts? Basically does it impact/clear the page/folio state as well?

Request timeouts can be set by admins system-wide to protect against
malicious/buggy fuse servers that do not reply to requests by a
certain amount of time. If the request times out, then the whole
connection will be aborted, and pages/folios will be cleaned up
accordingly. The corresponding patchset is here [1]. This helps
mitigate the possibility of unprivileged buggy servers tieing up
writeback state by not replying.


Thanks,
Joanne

[1] https://lore.kernel.org/linux-fsdevel/20241218222630.99920-1-joannelkoong@gmail.com/T/#t


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-19 17:44                             ` Joanne Koong
@ 2024-12-19 17:54                               ` Shakeel Butt
  2024-12-20 11:44                                 ` David Hildenbrand
  0 siblings, 1 reply; 124+ messages in thread
From: Shakeel Butt @ 2024-12-19 17:54 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Bernd Schubert, David Hildenbrand, Zi Yan, miklos, linux-fsdevel,
	jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox,
	Oscar Salvador, Michal Hocko

On Thu, Dec 19, 2024 at 09:44:42AM -0800, Joanne Koong wrote:
> On Thu, Dec 19, 2024 at 9:37 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
[...]
> > >
> > > The request is canceled then - that should clear the page/folio state
> > >
> > >
> > > I start to wonder if we should introduce really short fuse request
> > > timeouts and just repeat requests when things have cleared up. At least
> > > for write-back requests (in the sense that fuse-over-network might
> > > be slow or interrupted for some time).
> > >
> > >
> >
> > Thanks Bernd for the response. Can you tell a bit more about the request
> > timeouts? Basically does it impact/clear the page/folio state as well?
> 
> Request timeouts can be set by admins system-wide to protect against
> malicious/buggy fuse servers that do not reply to requests by a
> certain amount of time. If the request times out, then the whole
> connection will be aborted, and pages/folios will be cleaned up
> accordingly. The corresponding patchset is here [1]. This helps
> mitigate the possibility of unprivileged buggy servers tieing up
> writeback state by not replying.
> 

Thanks a lot Joanne and Bernd.

David, does these timeouts resolve your concerns?


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-19 17:26                       ` David Hildenbrand
  2024-12-19 17:30                         ` Bernd Schubert
@ 2024-12-19 17:55                         ` Joanne Koong
  2024-12-19 18:04                           ` Bernd Schubert
  1 sibling, 1 reply; 124+ messages in thread
From: Joanne Koong @ 2024-12-19 17:55 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Shakeel Butt, Zi Yan, miklos, linux-fsdevel, jefflexu, josef,
	bernd.schubert, linux-mm, kernel-team, Matthew Wilcox,
	Oscar Salvador, Michal Hocko

On Thu, Dec 19, 2024 at 9:26 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 19.12.24 18:14, Shakeel Butt wrote:
> > On Thu, Dec 19, 2024 at 05:41:36PM +0100, David Hildenbrand wrote:
> >> On 19.12.24 17:40, Shakeel Butt wrote:
> >>> On Thu, Dec 19, 2024 at 05:29:08PM +0100, David Hildenbrand wrote:
> >>> [...]
> >>>>>
> >>>>> If you check the code just above this patch, this
> >>>>> mapping_writeback_indeterminate() check only happen for pages under
> >>>>> writeback which is a temp state. Anyways, fuse folios should not be
> >>>>> unmovable for their lifetime but only while under writeback which is
> >>>>> same for all fs.
> >>>>
> >>>> But there, writeback is expected to be a temporary thing, not possibly:
> >>>> "AS_WRITEBACK_INDETERMINATE", that is a BIG difference.
> >>>>
> >>>> I'll have to NACK anything that violates ZONE_MOVABLE / ALLOC_CMA
> >>>> guarantees, and unfortunately, it sounds like this is the case here, unless
> >>>> I am missing something important.
> >>>>
> >>>
> >>> It might just be the name "AS_WRITEBACK_INDETERMINATE" is causing
> >>> the confusion. The writeback state is not indefinite. A proper fuse fs,
> >>> like anyother fs, should handle writeback pages appropriately. These
> >>> additional checks and skips are for (I think) untrusted fuse servers.
> >>
> >> Can unprivileged user space provoke this case?
> >
> > Let's ask Joanne and other fuse folks about the above question.
> >
> > Let's say unprivileged user space can start a untrusted fuse server,
> > mount fuse, allocate and dirty a lot of fuse folios (within its dirty
> > and memcg limits) and trigger the writeback. To cause pain (through
> > fragmentation), it is not clearing the writeback state. Is this the
> > scenario you are envisioning?
>

This scenario can already happen with temp pages. An untrusted
malicious fuse server may allocate and dirty a lot of fuse folios
within its dirty/memcg limits and never clear writeback on any of them
and tie up system resources. This certainly isn't the common case, but
it is a possibility. However, request timeouts can be set by the
system admin [1] to protect against malicious/buggy fuse servers that
try to do this. If the request isn't replied to by a certain amount of
time, then the connection will be aborted and writeback state and
other resources will be cleared/freed.


Thanks,
Joanne

[1] https://lore.kernel.org/linux-fsdevel/20241218222630.99920-1-joannelkoong@gmail.com/T/#t

> Yes, for example causing harm on a shared host (containers, ...).
>
> If it cannot happen, we should make it very clear in documentation and
> patch descriptions that it can only cause harm with privileged user
> space, and that this harm can make things like CMA allocations, memory
> onplug, ... fail, which is rather bad and against concepts like
> ZONE_MOVABLE/MIGRATE_CMA.
>
> Although I wonder what would happen if the privileged user space daemon
> crashes  (e.g., OOM killer?) and simply no longer replies to any messages.
>
> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-19 17:55                         ` Joanne Koong
@ 2024-12-19 18:04                           ` Bernd Schubert
  2024-12-19 18:11                             ` Shakeel Butt
  0 siblings, 1 reply; 124+ messages in thread
From: Bernd Schubert @ 2024-12-19 18:04 UTC (permalink / raw)
  To: Joanne Koong, David Hildenbrand
  Cc: Shakeel Butt, Zi Yan, miklos, linux-fsdevel, jefflexu, josef,
	linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador,
	Michal Hocko



On 12/19/24 18:55, Joanne Koong wrote:
> On Thu, Dec 19, 2024 at 9:26 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 19.12.24 18:14, Shakeel Butt wrote:
>>> On Thu, Dec 19, 2024 at 05:41:36PM +0100, David Hildenbrand wrote:
>>>> On 19.12.24 17:40, Shakeel Butt wrote:
>>>>> On Thu, Dec 19, 2024 at 05:29:08PM +0100, David Hildenbrand wrote:
>>>>> [...]
>>>>>>>
>>>>>>> If you check the code just above this patch, this
>>>>>>> mapping_writeback_indeterminate() check only happen for pages under
>>>>>>> writeback which is a temp state. Anyways, fuse folios should not be
>>>>>>> unmovable for their lifetime but only while under writeback which is
>>>>>>> same for all fs.
>>>>>>
>>>>>> But there, writeback is expected to be a temporary thing, not possibly:
>>>>>> "AS_WRITEBACK_INDETERMINATE", that is a BIG difference.
>>>>>>
>>>>>> I'll have to NACK anything that violates ZONE_MOVABLE / ALLOC_CMA
>>>>>> guarantees, and unfortunately, it sounds like this is the case here, unless
>>>>>> I am missing something important.
>>>>>>
>>>>>
>>>>> It might just be the name "AS_WRITEBACK_INDETERMINATE" is causing
>>>>> the confusion. The writeback state is not indefinite. A proper fuse fs,
>>>>> like anyother fs, should handle writeback pages appropriately. These
>>>>> additional checks and skips are for (I think) untrusted fuse servers.
>>>>
>>>> Can unprivileged user space provoke this case?
>>>
>>> Let's ask Joanne and other fuse folks about the above question.
>>>
>>> Let's say unprivileged user space can start a untrusted fuse server,
>>> mount fuse, allocate and dirty a lot of fuse folios (within its dirty
>>> and memcg limits) and trigger the writeback. To cause pain (through
>>> fragmentation), it is not clearing the writeback state. Is this the
>>> scenario you are envisioning?
>>
> 
> This scenario can already happen with temp pages. An untrusted
> malicious fuse server may allocate and dirty a lot of fuse folios
> within its dirty/memcg limits and never clear writeback on any of them
> and tie up system resources. This certainly isn't the common case, but
> it is a possibility. However, request timeouts can be set by the
> system admin [1] to protect against malicious/buggy fuse servers that
> try to do this. If the request isn't replied to by a certain amount of
> time, then the connection will be aborted and writeback state and
> other resources will be cleared/freed.
> 

I think what Zi points out that that is a current implementation issue
and these temp pages should be in a continues range. 
Obviously better to avoid a tmp copy at all.


Thanks,
Bernd




^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-19 18:04                           ` Bernd Schubert
@ 2024-12-19 18:11                             ` Shakeel Butt
  0 siblings, 0 replies; 124+ messages in thread
From: Shakeel Butt @ 2024-12-19 18:11 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Joanne Koong, David Hildenbrand, Zi Yan, miklos, linux-fsdevel,
	jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox,
	Oscar Salvador, Michal Hocko

On Thu, Dec 19, 2024 at 07:04:40PM +0100, Bernd Schubert wrote:
> 
> 
> On 12/19/24 18:55, Joanne Koong wrote:
> > On Thu, Dec 19, 2024 at 9:26 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 19.12.24 18:14, Shakeel Butt wrote:
> >>> On Thu, Dec 19, 2024 at 05:41:36PM +0100, David Hildenbrand wrote:
> >>>> On 19.12.24 17:40, Shakeel Butt wrote:
> >>>>> On Thu, Dec 19, 2024 at 05:29:08PM +0100, David Hildenbrand wrote:
> >>>>> [...]
> >>>>>>>
> >>>>>>> If you check the code just above this patch, this
> >>>>>>> mapping_writeback_indeterminate() check only happen for pages under
> >>>>>>> writeback which is a temp state. Anyways, fuse folios should not be
> >>>>>>> unmovable for their lifetime but only while under writeback which is
> >>>>>>> same for all fs.
> >>>>>>
> >>>>>> But there, writeback is expected to be a temporary thing, not possibly:
> >>>>>> "AS_WRITEBACK_INDETERMINATE", that is a BIG difference.
> >>>>>>
> >>>>>> I'll have to NACK anything that violates ZONE_MOVABLE / ALLOC_CMA
> >>>>>> guarantees, and unfortunately, it sounds like this is the case here, unless
> >>>>>> I am missing something important.
> >>>>>>
> >>>>>
> >>>>> It might just be the name "AS_WRITEBACK_INDETERMINATE" is causing
> >>>>> the confusion. The writeback state is not indefinite. A proper fuse fs,
> >>>>> like anyother fs, should handle writeback pages appropriately. These
> >>>>> additional checks and skips are for (I think) untrusted fuse servers.
> >>>>
> >>>> Can unprivileged user space provoke this case?
> >>>
> >>> Let's ask Joanne and other fuse folks about the above question.
> >>>
> >>> Let's say unprivileged user space can start a untrusted fuse server,
> >>> mount fuse, allocate and dirty a lot of fuse folios (within its dirty
> >>> and memcg limits) and trigger the writeback. To cause pain (through
> >>> fragmentation), it is not clearing the writeback state. Is this the
> >>> scenario you are envisioning?
> >>
> > 
> > This scenario can already happen with temp pages. An untrusted
> > malicious fuse server may allocate and dirty a lot of fuse folios
> > within its dirty/memcg limits and never clear writeback on any of them
> > and tie up system resources. This certainly isn't the common case, but
> > it is a possibility. However, request timeouts can be set by the
> > system admin [1] to protect against malicious/buggy fuse servers that
> > try to do this. If the request isn't replied to by a certain amount of
> > time, then the connection will be aborted and writeback state and
> > other resources will be cleared/freed.
> > 
> 
> I think what Zi points out that that is a current implementation issue
> and these temp pages should be in a continues range. 
> Obviously better to avoid a tmp copy at all.

The current tmp pages are allocated from MIGRATE_UNMOVABLE. I don't see
any additional benefit of reserving any continuous unmovable memory
regions for tmp pages. It will just add complexity without any clear
benefit.


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-19 16:41                   ` David Hildenbrand
  2024-12-19 17:14                     ` Shakeel Butt
@ 2024-12-20  7:55                     ` Jingbo Xu
  1 sibling, 0 replies; 124+ messages in thread
From: Jingbo Xu @ 2024-12-20  7:55 UTC (permalink / raw)
  To: David Hildenbrand, Shakeel Butt
  Cc: Zi Yan, Joanne Koong, miklos, linux-fsdevel, josef,
	bernd.schubert, linux-mm, kernel-team, Matthew Wilcox,
	Oscar Salvador, Michal Hocko

Hi,

On 12/20/24 12:41 AM, David Hildenbrand wrote:
> On 19.12.24 17:40, Shakeel Butt wrote:
>> On Thu, Dec 19, 2024 at 05:29:08PM +0100, David Hildenbrand wrote:
>> [...]
>>>>
>>>> If you check the code just above this patch, this
>>>> mapping_writeback_indeterminate() check only happen for pages under
>>>> writeback which is a temp state. Anyways, fuse folios should not be
>>>> unmovable for their lifetime but only while under writeback which is
>>>> same for all fs.
>>>
>>> But there, writeback is expected to be a temporary thing, not possibly:
>>> "AS_WRITEBACK_INDETERMINATE", that is a BIG difference.
>>>
>>> I'll have to NACK anything that violates ZONE_MOVABLE / ALLOC_CMA
>>> guarantees, and unfortunately, it sounds like this is the case here,
>>> unless
>>> I am missing something important.
>>>
>>
>> It might just be the name "AS_WRITEBACK_INDETERMINATE" is causing
>> the confusion. The writeback state is not indefinite. A proper fuse fs,
>> like anyother fs, should handle writeback pages appropriately. These
>> additional checks and skips are for (I think) untrusted fuse servers.
> 
> Can unprivileged user space provoke this case?
> 

There are some details on the initial problem that FUSE community wants
to fix [1].

In summary, a non-malicious fuse daemon may need to allocate some memory
when processing a FUSE_WRITE request (initiated from the writeback
routine), in which case memory reclaim and compaction is triggered when
allocating memory, which in turn leads to waiting on the writeback of
**FUSE** dirty pages (which itself waits for the fuse daemon to handle
it) - a deadlock here.

The current FUSE implementation fixes this by introducing "temp page" in
the writeback routine for FUSE.  In short, a temporary page (allocated
from ZONE_UNMOVABLE) is allocated for each dirty page cache needs to be
written back.  The content is copied from the original page cache to the
temporary page.  And then the original page cache (to writeback,
allocated from ZONE_MOVABLE) clears PG_writeback bit immediately, so
that the fuse daemon won't possibly stuck in deadlock waiting for the
writeback of FUSE page cache.  Instead, the actual writeback work is
done upon the cloned temporary page then.

Thus there are actually two pages for each FUSE page cache, one is the
original FUSE page cache (in ZONE_MOVABLE) and the other is the
temporary page (in ZONE_UNMOVABLE).

- For the original page cache, it will clear PG_writeback bit very
quickly in the writeback routine and won't block the memory direct
reclaim and compaction at all
- As for the temporary page, in the normal case, the fuse server will
complete FUSE_WRITE request as expected, and thus the temporary page
will get freed soon.

However FUSE supports unprivileged mount, in which case the fuse daemon
is run and mounted by an unprivileged user.  Thus the backend fuse
daemon may be malicious (started by an unprivileged user) and refuses to
process any FUSE requests.  Thus in the worst case, these temporary
pages will never complete writeback and get pinned in ZONE_UNMOVABLE
forever. (One thing worth noting is that, once the fuse daemon gets
killed, the whole FUSE filesystem will be aborted, all inflight FUSE
requests are flushed, and all the temporary pages will be freed then)


What this patchset does is to drop the temporary page design in the FUSE
writeback routine, while this patch is introduced to avoid the above
mentioned deadlock for a *sane* FUSE daemon in memory compaction after
dropping the temp page design.

Currently the FUSE writeback pages (i.e. FUSE page cache) is allocated
from GFP_HIGHUSER_MOVABLE, which is consistent with other filesystems.

In the normal case (the FUSE is backed by a well-behaved FUSE daemon),
the page cache will be completed in a reasonable manner and it won't
affect the usability of ZONE_MOVABLE.

While in the worst case (a malicious FUSE daemon run by an unprivileged
user), these page cache in ZONE_MOVABLE can be pinned there indefinitely.

We can argue that in the current implementation (without this patch
series), ZONE_UNMOVABLE can also grow larger and larger, and pin quite
many memory usage (correct me if I'm wrong) in the worst case.  In this
degree this patch doesn't make things even worse.  Besides FUSE enables
strictlimit feature by default, in which each FUSE filesystem can
consume at most 1% of global vm.dirty_background_thresh before write
throttle is triggered.


[1]
https://lore.kernel.org/all/8eec0912-7a6c-4387-b9be-6718f438a111@linux.alibaba.com/


-- 
Thanks,
Jingbo


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-19 17:54                               ` Shakeel Butt
@ 2024-12-20 11:44                                 ` David Hildenbrand
  2024-12-20 12:15                                   ` Bernd Schubert
  0 siblings, 1 reply; 124+ messages in thread
From: David Hildenbrand @ 2024-12-20 11:44 UTC (permalink / raw)
  To: Shakeel Butt, Joanne Koong
  Cc: Bernd Schubert, Zi Yan, miklos, linux-fsdevel, jefflexu, josef,
	linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador,
	Michal Hocko

On 19.12.24 18:54, Shakeel Butt wrote:
> On Thu, Dec 19, 2024 at 09:44:42AM -0800, Joanne Koong wrote:
>> On Thu, Dec 19, 2024 at 9:37 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> [...]
>>>>
>>>> The request is canceled then - that should clear the page/folio state
>>>>
>>>>
>>>> I start to wonder if we should introduce really short fuse request
>>>> timeouts and just repeat requests when things have cleared up. At least
>>>> for write-back requests (in the sense that fuse-over-network might
>>>> be slow or interrupted for some time).
>>>>
>>>>
>>>
>>> Thanks Bernd for the response. Can you tell a bit more about the request
>>> timeouts? Basically does it impact/clear the page/folio state as well?
>>
>> Request timeouts can be set by admins system-wide to protect against
>> malicious/buggy fuse servers that do not reply to requests by a
>> certain amount of time. If the request times out, then the whole
>> connection will be aborted, and pages/folios will be cleaned up
>> accordingly. The corresponding patchset is here [1]. This helps
>> mitigate the possibility of unprivileged buggy servers tieing up
>> writeback state by not replying.
>>
> 
> Thanks a lot Joanne and Bernd.
> 
> David, does these timeouts resolve your concerns?

Thanks for that information. Yes and no. :)

Bernd wrote: "I start to wonder if we should introduce really short fuse 
request timeouts and just repeat requests when things have cleared up. 
At least for write-back requests (in the sense that fuse-over-network 
might be slow or interrupted for some time).

Indicating to me that while timeouts might be supported soon (will there 
be a sane default?) even trusted implementations can run into this 
(network example above) where timeouts might actually be harmful I suppose?

I'm wondering if there would be a way to just "cancel" the writeback and 
mark the folio dirty again. That way it could be migrated, but not 
reclaimed. At least we could avoid the whole AS_WRITEBACK_INDETERMINATE 
thing.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-20 11:44                                 ` David Hildenbrand
@ 2024-12-20 12:15                                   ` Bernd Schubert
  2024-12-20 14:49                                     ` David Hildenbrand
  0 siblings, 1 reply; 124+ messages in thread
From: Bernd Schubert @ 2024-12-20 12:15 UTC (permalink / raw)
  To: David Hildenbrand, Shakeel Butt, Joanne Koong
  Cc: Zi Yan, miklos, linux-fsdevel, jefflexu, josef, linux-mm,
	kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko



On 12/20/24 12:44, David Hildenbrand wrote:
> On 19.12.24 18:54, Shakeel Butt wrote:
>> On Thu, Dec 19, 2024 at 09:44:42AM -0800, Joanne Koong wrote:
>>> On Thu, Dec 19, 2024 at 9:37 AM Shakeel Butt <shakeel.butt@linux.dev>
>>> wrote:
>> [...]
>>>>>
>>>>> The request is canceled then - that should clear the page/folio state
>>>>>
>>>>>
>>>>> I start to wonder if we should introduce really short fuse request
>>>>> timeouts and just repeat requests when things have cleared up. At
>>>>> least
>>>>> for write-back requests (in the sense that fuse-over-network might
>>>>> be slow or interrupted for some time).
>>>>>
>>>>>
>>>>
>>>> Thanks Bernd for the response. Can you tell a bit more about the
>>>> request
>>>> timeouts? Basically does it impact/clear the page/folio state as well?
>>>
>>> Request timeouts can be set by admins system-wide to protect against
>>> malicious/buggy fuse servers that do not reply to requests by a
>>> certain amount of time. If the request times out, then the whole
>>> connection will be aborted, and pages/folios will be cleaned up
>>> accordingly. The corresponding patchset is here [1]. This helps
>>> mitigate the possibility of unprivileged buggy servers tieing up
>>> writeback state by not replying.
>>>
>>
>> Thanks a lot Joanne and Bernd.
>>
>> David, does these timeouts resolve your concerns?
> 
> Thanks for that information. Yes and no. :)
> 
> Bernd wrote: "I start to wonder if we should introduce really short fuse
> request timeouts and just repeat requests when things have cleared up.
> At least for write-back requests (in the sense that fuse-over-network
> might be slow or interrupted for some time).
> 
> Indicating to me that while timeouts might be supported soon (will there
> be a sane default?) even trusted implementations can run into this
> (network example above) where timeouts might actually be harmful I suppose?

Yeah and that makes it hard to provide a default. In Joannes timeout patches
the admin can set a system default.

https://lore.kernel.org/all/20241218222630.99920-3-joannelkoong@gmail.com/

> 
> I'm wondering if there would be a way to just "cancel" the writeback and
> mark the folio dirty again. That way it could be migrated, but not
> reclaimed. At least we could avoid the whole AS_WRITEBACK_INDETERMINATE
> thing.
> 

That is what I basically meant with short timeouts. Obviously it is not
that simple to cancel the request and to retry - it would add in quite
some complexity, if all the issues that arise can be solved at all.


Thanks,
Bernd


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-20 12:15                                   ` Bernd Schubert
@ 2024-12-20 14:49                                     ` David Hildenbrand
  2024-12-20 15:26                                       ` Bernd Schubert
                                                         ` (2 more replies)
  0 siblings, 3 replies; 124+ messages in thread
From: David Hildenbrand @ 2024-12-20 14:49 UTC (permalink / raw)
  To: Bernd Schubert, Shakeel Butt, Joanne Koong
  Cc: Zi Yan, miklos, linux-fsdevel, jefflexu, josef, linux-mm,
	kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko

>> I'm wondering if there would be a way to just "cancel" the writeback and
>> mark the folio dirty again. That way it could be migrated, but not
>> reclaimed. At least we could avoid the whole AS_WRITEBACK_INDETERMINATE
>> thing.
>>
> 
> That is what I basically meant with short timeouts. Obviously it is not
> that simple to cancel the request and to retry - it would add in quite
> some complexity, if all the issues that arise can be solved at all.

At least it would keep that out of core-mm.

AS_WRITEBACK_INDETERMINATE really has weird smell to it ... we should 
try to improve such scenarios, not acknowledge and integrate them, then 
work around using timeouts that must be manually configured, and ca 
likely no be default enabled because it could hurt reasonable use cases :(

Right now we clear the writeback flag immediately, indicating that data 
was written back, when in fact it was not written back at all. I suspect 
fsync() currently handles that manually already, to wait for any of the 
allocated pages to actually get written back by user space, so we have 
control over when something was *actually* written back.


Similar to your proposal, I wonder if there could be a way to request 
fuse to "abort" a writeback request (instead of using fixed timeouts per 
request). Meaning, when we stumble over a folio that is under writeback 
on some paths, we would tell fuse to "end writeback now", or "end 
writeback now if it takes longer than X". Essentially hidden inside 
folio_wait_writeback().

When aborting a request, as I said, we would essentially "end writeback" 
and mark the folio as dirty again. The interesting thing is likely how 
to handle user space that wants to process this request right now (stuck 
in fuse_send_writepage() I assume?), correct?

Just throwing it out there ... no expert at all on fuse ...

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-20 14:49                                     ` David Hildenbrand
@ 2024-12-20 15:26                                       ` Bernd Schubert
  2024-12-20 18:01                                       ` Shakeel Butt
  2024-12-20 21:01                                       ` Joanne Koong
  2 siblings, 0 replies; 124+ messages in thread
From: Bernd Schubert @ 2024-12-20 15:26 UTC (permalink / raw)
  To: David Hildenbrand, Shakeel Butt, Joanne Koong
  Cc: Zi Yan, miklos, linux-fsdevel, jefflexu, josef, linux-mm,
	kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko



On 12/20/24 15:49, David Hildenbrand wrote:
>>> I'm wondering if there would be a way to just "cancel" the writeback and
>>> mark the folio dirty again. That way it could be migrated, but not
>>> reclaimed. At least we could avoid the whole AS_WRITEBACK_INDETERMINATE
>>> thing.
>>>
>>
>> That is what I basically meant with short timeouts. Obviously it is not
>> that simple to cancel the request and to retry - it would add in quite
>> some complexity, if all the issues that arise can be solved at all.
> 
> At least it would keep that out of core-mm.
> 
> AS_WRITEBACK_INDETERMINATE really has weird smell to it ... we should
> try to improve such scenarios, not acknowledge and integrate them, then
> work around using timeouts that must be manually configured, and ca
> likely no be default enabled because it could hurt reasonable use cases :(
> 
> Right now we clear the writeback flag immediately, indicating that data
> was written back, when in fact it was not written back at all. I suspect
> fsync() currently handles that manually already, to wait for any of the
> allocated pages to actually get written back by user space, so we have
> control over when something was *actually* written back.

Yeah, fuse_writepage_end() decreases fi->writectr, which gets checked
by fsync.

Knowing when somethings has been written back is not the issue, but
keeping order, handling splice, possible double write to the same range
(it should be mostly idempotent, but is that guaranteed by all servers),
etc.


> 
> 
> Similar to your proposal, I wonder if there could be a way to request
> fuse to "abort" a writeback request (instead of using fixed timeouts per
> request). Meaning, when we stumble over a folio that is under writeback
> on some paths, we would tell fuse to "end writeback now", or "end
> writeback now if it takes longer than X". Essentially hidden inside
> folio_wait_writeback().

Yeah, that would be a minor improvement to the overall issue ;) Re-queue
issue.

> 
> When aborting a request, as I said, we would essentially "end writeback"
> and mark the folio as dirty again. The interesting thing is likely how
> to handle user space that wants to process this request right now (stuck
> in fuse_send_writepage() I assume?), correct?

That sends background requests - does not get stuck. Completion happens
in fuse_writepage_end(), when the request reply is received.



Thanks,
Bernd


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-20 14:49                                     ` David Hildenbrand
  2024-12-20 15:26                                       ` Bernd Schubert
@ 2024-12-20 18:01                                       ` Shakeel Butt
  2024-12-21  2:28                                         ` Jingbo Xu
  2024-12-21 16:18                                         ` David Hildenbrand
  2024-12-20 21:01                                       ` Joanne Koong
  2 siblings, 2 replies; 124+ messages in thread
From: Shakeel Butt @ 2024-12-20 18:01 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Bernd Schubert, Joanne Koong, Zi Yan, miklos, linux-fsdevel,
	jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox,
	Oscar Salvador, Michal Hocko

On Fri, Dec 20, 2024 at 03:49:39PM +0100, David Hildenbrand wrote:
> > > I'm wondering if there would be a way to just "cancel" the writeback and
> > > mark the folio dirty again. That way it could be migrated, but not
> > > reclaimed. At least we could avoid the whole AS_WRITEBACK_INDETERMINATE
> > > thing.
> > > 
> > 
> > That is what I basically meant with short timeouts. Obviously it is not
> > that simple to cancel the request and to retry - it would add in quite
> > some complexity, if all the issues that arise can be solved at all.
> 
> At least it would keep that out of core-mm.
> 
> AS_WRITEBACK_INDETERMINATE really has weird smell to it ... we should try to
> improve such scenarios, not acknowledge and integrate them, then work around
> using timeouts that must be manually configured, and ca likely no be default
> enabled because it could hurt reasonable use cases :(

Just to be clear AS_WRITEBACK_INDETERMINATE is being used in two core-mm
parts. First is reclaim and second is compaction/migration. For reclaim,
it is a must have as explained by Jingbo in [1] i.e. due to potential
self deadlock by fuse server. If I understand you correctly, the main
concern you have is its usage in the second case.

The reason for adding AS_WRITEBACK_INDETERMINATE in the second case was
to avoid untrusted fuse server causing pain to unrelated jobs on the
machine (fuse folks please correct me if I am wrong here). Now we are
discussing how to better handle that scenario.

I just wanted to point out that irrespective of that discussion, the
reclaim will have handle the potential recursive deadlock and thus will
be using AS_WRITEBACK_INDETERMINATE or something similar.

[1] https://lore.kernel.org/all/d48ae58e-500f-4ef1-bc6f-a41a8f5b94bf@linux.alibaba.com/


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-20 14:49                                     ` David Hildenbrand
  2024-12-20 15:26                                       ` Bernd Schubert
  2024-12-20 18:01                                       ` Shakeel Butt
@ 2024-12-20 21:01                                       ` Joanne Koong
  2024-12-21 16:25                                         ` David Hildenbrand
  2 siblings, 1 reply; 124+ messages in thread
From: Joanne Koong @ 2024-12-20 21:01 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Bernd Schubert, Shakeel Butt, Zi Yan, miklos, linux-fsdevel,
	jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox,
	Oscar Salvador, Michal Hocko

On Fri, Dec 20, 2024 at 6:49 AM David Hildenbrand <david@redhat.com> wrote:
>
> >> I'm wondering if there would be a way to just "cancel" the writeback and
> >> mark the folio dirty again. That way it could be migrated, but not
> >> reclaimed. At least we could avoid the whole AS_WRITEBACK_INDETERMINATE
> >> thing.
> >>
> >
> > That is what I basically meant with short timeouts. Obviously it is not
> > that simple to cancel the request and to retry - it would add in quite
> > some complexity, if all the issues that arise can be solved at all.
>
> At least it would keep that out of core-mm.
>
> AS_WRITEBACK_INDETERMINATE really has weird smell to it ... we should
> try to improve such scenarios, not acknowledge and integrate them, then
> work around using timeouts that must be manually configured, and ca
> likely no be default enabled because it could hurt reasonable use cases :(
>
> Right now we clear the writeback flag immediately, indicating that data
> was written back, when in fact it was not written back at all. I suspect
> fsync() currently handles that manually already, to wait for any of the
> allocated pages to actually get written back by user space, so we have
> control over when something was *actually* written back.
>
>
> Similar to your proposal, I wonder if there could be a way to request
> fuse to "abort" a writeback request (instead of using fixed timeouts per
> request). Meaning, when we stumble over a folio that is under writeback
> on some paths, we would tell fuse to "end writeback now", or "end
> writeback now if it takes longer than X". Essentially hidden inside
> folio_wait_writeback().
>
> When aborting a request, as I said, we would essentially "end writeback"
> and mark the folio as dirty again. The interesting thing is likely how
> to handle user space that wants to process this request right now (stuck
> in fuse_send_writepage() I assume?), correct?

This would be fine if the writeback request hasn't been sent yet to
userspace but if it has and the pages are spliced, then ending
writeback could lead to memory crashes if the pipebuf buf->page is
accessed as it's being migrated. When a page/folio is being migrated,
is there some state set on the page to indicate that it's currently
under migration? The only workaround I can see for the splice case
that doesn't resort to bringing back extra copies is to have splice
somehow ensure that the page isn't being migrated when it's accessing
it.


Thanks,
Joanne

>
> Just throwing it out there ... no expert at all on fuse ...
>
> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-20 18:01                                       ` Shakeel Butt
@ 2024-12-21  2:28                                         ` Jingbo Xu
  2024-12-21 16:23                                           ` David Hildenbrand
  2024-12-21 16:18                                         ` David Hildenbrand
  1 sibling, 1 reply; 124+ messages in thread
From: Jingbo Xu @ 2024-12-21  2:28 UTC (permalink / raw)
  To: Shakeel Butt, David Hildenbrand
  Cc: Bernd Schubert, Joanne Koong, Zi Yan, miklos, linux-fsdevel,
	josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador,
	Michal Hocko



On 12/21/24 2:01 AM, Shakeel Butt wrote:
> On Fri, Dec 20, 2024 at 03:49:39PM +0100, David Hildenbrand wrote:
>>>> I'm wondering if there would be a way to just "cancel" the writeback and
>>>> mark the folio dirty again. That way it could be migrated, but not
>>>> reclaimed. At least we could avoid the whole AS_WRITEBACK_INDETERMINATE
>>>> thing.
>>>>
>>>
>>> That is what I basically meant with short timeouts. Obviously it is not
>>> that simple to cancel the request and to retry - it would add in quite
>>> some complexity, if all the issues that arise can be solved at all.
>>
>> At least it would keep that out of core-mm.
>>
>> AS_WRITEBACK_INDETERMINATE really has weird smell to it ... we should try to
>> improve such scenarios, not acknowledge and integrate them, then work around
>> using timeouts that must be manually configured, and ca likely no be default
>> enabled because it could hurt reasonable use cases :(
> 
> Just to be clear AS_WRITEBACK_INDETERMINATE is being used in two core-mm
> parts. First is reclaim and second is compaction/migration. For reclaim,
> it is a must have as explained by Jingbo in [1] i.e. due to potential
> self deadlock by fuse server. If I understand you correctly, the main
> concern you have is its usage in the second case.
> 
> The reason for adding AS_WRITEBACK_INDETERMINATE in the second case was
> to avoid untrusted fuse server causing pain to unrelated jobs on the
> machine (fuse folks please correct me if I am wrong here).

Right, IIUC direct MIGRATE_SYNC migration won't be triggered on the
memory allocation path, i.e. the fuse server itself won't stumble into
MIGRATE_SYNC migration.

-- 
Thanks,
Jingbo


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-20 18:01                                       ` Shakeel Butt
  2024-12-21  2:28                                         ` Jingbo Xu
@ 2024-12-21 16:18                                         ` David Hildenbrand
  2024-12-23 22:14                                           ` Shakeel Butt
  1 sibling, 1 reply; 124+ messages in thread
From: David Hildenbrand @ 2024-12-21 16:18 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Bernd Schubert, Joanne Koong, Zi Yan, miklos, linux-fsdevel,
	jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox,
	Oscar Salvador, Michal Hocko

On 20.12.24 19:01, Shakeel Butt wrote:
> On Fri, Dec 20, 2024 at 03:49:39PM +0100, David Hildenbrand wrote:
>>>> I'm wondering if there would be a way to just "cancel" the writeback and
>>>> mark the folio dirty again. That way it could be migrated, but not
>>>> reclaimed. At least we could avoid the whole AS_WRITEBACK_INDETERMINATE
>>>> thing.
>>>>
>>>
>>> That is what I basically meant with short timeouts. Obviously it is not
>>> that simple to cancel the request and to retry - it would add in quite
>>> some complexity, if all the issues that arise can be solved at all.
>>
>> At least it would keep that out of core-mm.
>>
>> AS_WRITEBACK_INDETERMINATE really has weird smell to it ... we should try to
>> improve such scenarios, not acknowledge and integrate them, then work around
>> using timeouts that must be manually configured, and ca likely no be default
>> enabled because it could hurt reasonable use cases :(
> 
> Just to be clear AS_WRITEBACK_INDETERMINATE is being used in two core-mm
> parts. First is reclaim and second is compaction/migration. For reclaim,
> it is a must have as explained by Jingbo in [1] i.e. due to potential
> self deadlock by fuse server. If I understand you correctly, the main
> concern you have is its usage in the second case.

Yes, so I can see fuse

(1) Breaking memory reclaim (memory cannot get freed up)

(2) Breaking page migration (memory cannot be migrated)

Due to (1) we might experience bigger memory pressure in the system I 
guess. A handful of these pages don't really hurt, I have no idea how 
bad having many of these pages can be. But yes, inherently we cannot 
throw away the data as long as it is dirty without causing harm. (maybe 
we could move it to some other cache, like swap/zswap; but that smells 
like a big and complicated project)

Due to (2) we turn pages that are supposed to be movable possibly for a 
long time unmovable. Even a *single* such page will mean that CMA 
allocations / memory unplug can start failing.

We have similar situations with page pinning. With things like O_DIRECT, 
our assumption/experience so far is that it will only take a couple of 
seconds max, and retry loops are sufficient to handle it. That's why 
only long-term pinning ("indeterminate", e.g., vfio) migrate these pages 
out of ZONE_MOVABLE/MIGRATE_CMA areas in order to long-term pin them.


The biggest concern I have is that timeouts, while likely reasonable it 
many scenarios, might not be desirable even for some sane workloads, and 
the default in all system will be "no timeout", letting the clueless 
admin of each and every system out there that might support fuse to make 
a decision.

I might have misunderstood something, in which case I am very sorry, but 
we also don't want CMA allocations to start failing simply because a 
network connection is down for a couple of minutes such that a fuse 
daemon cannot make progress.


> 
> The reason for adding AS_WRITEBACK_INDETERMINATE in the second case was
> to avoid untrusted fuse server causing pain to unrelated jobs on the
> machine (fuse folks please correct me if I am wrong here). Now we are
> discussing how to better handle that scenario.
> 
> I just wanted to point out that irrespective of that discussion, the
> reclaim will have handle the potential recursive deadlock and thus will
> be using AS_WRITEBACK_INDETERMINATE or something similar.

Yes, I see no way to throw away dirty data without causing harm.

Migration was kept working for now, although in a hacky fashion I admit. 
I do enjoy that "writeback" on the folio actually matches the reality now.

I guess an alternative to "aborting writeback" would be to make fuse 
allow for migrating folios that are under writeback. I would assume that 
with fuse we have very good control over who is currently 
reading/writing that folio, and we could swap it out? Again, just an 
idea ...


-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-21  2:28                                         ` Jingbo Xu
@ 2024-12-21 16:23                                           ` David Hildenbrand
  2024-12-22  2:47                                             ` Jingbo Xu
  0 siblings, 1 reply; 124+ messages in thread
From: David Hildenbrand @ 2024-12-21 16:23 UTC (permalink / raw)
  To: Jingbo Xu, Shakeel Butt
  Cc: Bernd Schubert, Joanne Koong, Zi Yan, miklos, linux-fsdevel,
	josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador,
	Michal Hocko

On 21.12.24 03:28, Jingbo Xu wrote:
> 
> 
> On 12/21/24 2:01 AM, Shakeel Butt wrote:
>> On Fri, Dec 20, 2024 at 03:49:39PM +0100, David Hildenbrand wrote:
>>>>> I'm wondering if there would be a way to just "cancel" the writeback and
>>>>> mark the folio dirty again. That way it could be migrated, but not
>>>>> reclaimed. At least we could avoid the whole AS_WRITEBACK_INDETERMINATE
>>>>> thing.
>>>>>
>>>>
>>>> That is what I basically meant with short timeouts. Obviously it is not
>>>> that simple to cancel the request and to retry - it would add in quite
>>>> some complexity, if all the issues that arise can be solved at all.
>>>
>>> At least it would keep that out of core-mm.
>>>
>>> AS_WRITEBACK_INDETERMINATE really has weird smell to it ... we should try to
>>> improve such scenarios, not acknowledge and integrate them, then work around
>>> using timeouts that must be manually configured, and ca likely no be default
>>> enabled because it could hurt reasonable use cases :(
>>
>> Just to be clear AS_WRITEBACK_INDETERMINATE is being used in two core-mm
>> parts. First is reclaim and second is compaction/migration. For reclaim,
>> it is a must have as explained by Jingbo in [1] i.e. due to potential
>> self deadlock by fuse server. If I understand you correctly, the main
>> concern you have is its usage in the second case.
>>
>> The reason for adding AS_WRITEBACK_INDETERMINATE in the second case was
>> to avoid untrusted fuse server causing pain to unrelated jobs on the
>> machine (fuse folks please correct me if I am wrong here).
> 
> Right, IIUC direct MIGRATE_SYNC migration won't be triggered on the
> memory allocation path, i.e. the fuse server itself won't stumble into
> MIGRATE_SYNC migration.
> 

Maybe memory compaction (on higher-order allocations only) could trigger it?

gfp_compaction_allowed() checks __GFP_IO. GFP_KERNEL includes that.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-20 21:01                                       ` Joanne Koong
@ 2024-12-21 16:25                                         ` David Hildenbrand
  2024-12-21 21:59                                           ` Bernd Schubert
  0 siblings, 1 reply; 124+ messages in thread
From: David Hildenbrand @ 2024-12-21 16:25 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Bernd Schubert, Shakeel Butt, Zi Yan, miklos, linux-fsdevel,
	jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox,
	Oscar Salvador, Michal Hocko

On 20.12.24 22:01, Joanne Koong wrote:
> On Fri, Dec 20, 2024 at 6:49 AM David Hildenbrand <david@redhat.com> wrote:
>>
>>>> I'm wondering if there would be a way to just "cancel" the writeback and
>>>> mark the folio dirty again. That way it could be migrated, but not
>>>> reclaimed. At least we could avoid the whole AS_WRITEBACK_INDETERMINATE
>>>> thing.
>>>>
>>>
>>> That is what I basically meant with short timeouts. Obviously it is not
>>> that simple to cancel the request and to retry - it would add in quite
>>> some complexity, if all the issues that arise can be solved at all.
>>
>> At least it would keep that out of core-mm.
>>
>> AS_WRITEBACK_INDETERMINATE really has weird smell to it ... we should
>> try to improve such scenarios, not acknowledge and integrate them, then
>> work around using timeouts that must be manually configured, and ca
>> likely no be default enabled because it could hurt reasonable use cases :(
>>
>> Right now we clear the writeback flag immediately, indicating that data
>> was written back, when in fact it was not written back at all. I suspect
>> fsync() currently handles that manually already, to wait for any of the
>> allocated pages to actually get written back by user space, so we have
>> control over when something was *actually* written back.
>>
>>
>> Similar to your proposal, I wonder if there could be a way to request
>> fuse to "abort" a writeback request (instead of using fixed timeouts per
>> request). Meaning, when we stumble over a folio that is under writeback
>> on some paths, we would tell fuse to "end writeback now", or "end
>> writeback now if it takes longer than X". Essentially hidden inside
>> folio_wait_writeback().
>>
>> When aborting a request, as I said, we would essentially "end writeback"
>> and mark the folio as dirty again. The interesting thing is likely how
>> to handle user space that wants to process this request right now (stuck
>> in fuse_send_writepage() I assume?), correct?
> 
> This would be fine if the writeback request hasn't been sent yet to
> userspace but if it has and the pages are spliced

Can you point me at the code where that splicing happens?

, then ending
> writeback could lead to memory crashes if the pipebuf buf->page is
> accessed as it's being migrated. When a page/folio is being migrated,
> is there some state set on the page to indicate that it's currently
> under migration?

Unfortunately not really. It should be isolated and locked. So it would 
be a !LRU but locked folio.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-21 16:25                                         ` David Hildenbrand
@ 2024-12-21 21:59                                           ` Bernd Schubert
  2024-12-23 19:00                                             ` Joanne Koong
  0 siblings, 1 reply; 124+ messages in thread
From: Bernd Schubert @ 2024-12-21 21:59 UTC (permalink / raw)
  To: David Hildenbrand, Joanne Koong
  Cc: Shakeel Butt, Zi Yan, miklos, linux-fsdevel, jefflexu, josef,
	linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador,
	Michal Hocko



On 12/21/24 17:25, David Hildenbrand wrote:
> On 20.12.24 22:01, Joanne Koong wrote:
>> On Fri, Dec 20, 2024 at 6:49 AM David Hildenbrand <david@redhat.com>
>> wrote:
>>>
>>>>> I'm wondering if there would be a way to just "cancel" the
>>>>> writeback and
>>>>> mark the folio dirty again. That way it could be migrated, but not
>>>>> reclaimed. At least we could avoid the whole
>>>>> AS_WRITEBACK_INDETERMINATE
>>>>> thing.
>>>>>
>>>>
>>>> That is what I basically meant with short timeouts. Obviously it is not
>>>> that simple to cancel the request and to retry - it would add in quite
>>>> some complexity, if all the issues that arise can be solved at all.
>>>
>>> At least it would keep that out of core-mm.
>>>
>>> AS_WRITEBACK_INDETERMINATE really has weird smell to it ... we should
>>> try to improve such scenarios, not acknowledge and integrate them, then
>>> work around using timeouts that must be manually configured, and ca
>>> likely no be default enabled because it could hurt reasonable use
>>> cases :(
>>>
>>> Right now we clear the writeback flag immediately, indicating that data
>>> was written back, when in fact it was not written back at all. I suspect
>>> fsync() currently handles that manually already, to wait for any of the
>>> allocated pages to actually get written back by user space, so we have
>>> control over when something was *actually* written back.
>>>
>>>
>>> Similar to your proposal, I wonder if there could be a way to request
>>> fuse to "abort" a writeback request (instead of using fixed timeouts per
>>> request). Meaning, when we stumble over a folio that is under writeback
>>> on some paths, we would tell fuse to "end writeback now", or "end
>>> writeback now if it takes longer than X". Essentially hidden inside
>>> folio_wait_writeback().
>>>
>>> When aborting a request, as I said, we would essentially "end writeback"
>>> and mark the folio as dirty again. The interesting thing is likely how
>>> to handle user space that wants to process this request right now (stuck
>>> in fuse_send_writepage() I assume?), correct?
>>
>> This would be fine if the writeback request hasn't been sent yet to
>> userspace but if it has and the pages are spliced
> 
> Can you point me at the code where that splicing happens?

fuse_dev_splice_read()
  fuse_dev_do_read()
    fuse_copy_args()
      fuse_copy_page


Btw, for the non splice case, disabling migration should be
only needed while it is copying to the userspace buffer?



Thanks,
Bernd


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-21 16:23                                           ` David Hildenbrand
@ 2024-12-22  2:47                                             ` Jingbo Xu
  2024-12-24 11:32                                               ` David Hildenbrand
  0 siblings, 1 reply; 124+ messages in thread
From: Jingbo Xu @ 2024-12-22  2:47 UTC (permalink / raw)
  To: David Hildenbrand, Shakeel Butt
  Cc: Bernd Schubert, Joanne Koong, Zi Yan, miklos, linux-fsdevel,
	josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador,
	Michal Hocko



On 12/22/24 12:23 AM, David Hildenbrand wrote:
> On 21.12.24 03:28, Jingbo Xu wrote:
>>
>>
>> On 12/21/24 2:01 AM, Shakeel Butt wrote:
>>> On Fri, Dec 20, 2024 at 03:49:39PM +0100, David Hildenbrand wrote:
>>>>>> I'm wondering if there would be a way to just "cancel" the
>>>>>> writeback and
>>>>>> mark the folio dirty again. That way it could be migrated, but not
>>>>>> reclaimed. At least we could avoid the whole
>>>>>> AS_WRITEBACK_INDETERMINATE
>>>>>> thing.
>>>>>>
>>>>>
>>>>> That is what I basically meant with short timeouts. Obviously it is
>>>>> not
>>>>> that simple to cancel the request and to retry - it would add in quite
>>>>> some complexity, if all the issues that arise can be solved at all.
>>>>
>>>> At least it would keep that out of core-mm.
>>>>
>>>> AS_WRITEBACK_INDETERMINATE really has weird smell to it ... we
>>>> should try to
>>>> improve such scenarios, not acknowledge and integrate them, then
>>>> work around
>>>> using timeouts that must be manually configured, and ca likely no be
>>>> default
>>>> enabled because it could hurt reasonable use cases :(
>>>
>>> Just to be clear AS_WRITEBACK_INDETERMINATE is being used in two core-mm
>>> parts. First is reclaim and second is compaction/migration. For reclaim,
>>> it is a must have as explained by Jingbo in [1] i.e. due to potential
>>> self deadlock by fuse server. If I understand you correctly, the main
>>> concern you have is its usage in the second case.
>>>
>>> The reason for adding AS_WRITEBACK_INDETERMINATE in the second case was
>>> to avoid untrusted fuse server causing pain to unrelated jobs on the
>>> machine (fuse folks please correct me if I am wrong here).
>>
>> Right, IIUC direct MIGRATE_SYNC migration won't be triggered on the
>> memory allocation path, i.e. the fuse server itself won't stumble into
>> MIGRATE_SYNC migration.
>>
> 
> Maybe memory compaction (on higher-order allocations only) could trigger
> it?
> 
> gfp_compaction_allowed() checks __GFP_IO. GFP_KERNEL includes that.
> 

But that (memory compaction on memory allocation, which can be triggered
in the fuse server process context) only triggers MIGRATE_SYNC_LIGHT,
which won't wait for writeback.

AFAICS, MIGRATE_SYNC can be triggered during cma allocation, memory
offline, or node compaction manually through sysctl.

-- 
Thanks,
Jingbo


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-21 21:59                                           ` Bernd Schubert
@ 2024-12-23 19:00                                             ` Joanne Koong
  2024-12-26 22:44                                               ` Bernd Schubert
  0 siblings, 1 reply; 124+ messages in thread
From: Joanne Koong @ 2024-12-23 19:00 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: David Hildenbrand, Shakeel Butt, Zi Yan, miklos, linux-fsdevel,
	jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox,
	Oscar Salvador, Michal Hocko

On Sat, Dec 21, 2024 at 1:59 PM Bernd Schubert
<bernd.schubert@fastmail.fm> wrote:
>
>
>
> On 12/21/24 17:25, David Hildenbrand wrote:
> > On 20.12.24 22:01, Joanne Koong wrote:
> >> On Fri, Dec 20, 2024 at 6:49 AM David Hildenbrand <david@redhat.com>
> >> wrote:
> >>>
> >>>>> I'm wondering if there would be a way to just "cancel" the
> >>>>> writeback and
> >>>>> mark the folio dirty again. That way it could be migrated, but not
> >>>>> reclaimed. At least we could avoid the whole
> >>>>> AS_WRITEBACK_INDETERMINATE
> >>>>> thing.
> >>>>>
> >>>>
> >>>> That is what I basically meant with short timeouts. Obviously it is not
> >>>> that simple to cancel the request and to retry - it would add in quite
> >>>> some complexity, if all the issues that arise can be solved at all.
> >>>
> >>> At least it would keep that out of core-mm.
> >>>
> >>> AS_WRITEBACK_INDETERMINATE really has weird smell to it ... we should
> >>> try to improve such scenarios, not acknowledge and integrate them, then
> >>> work around using timeouts that must be manually configured, and ca
> >>> likely no be default enabled because it could hurt reasonable use
> >>> cases :(
> >>>
> >>> Right now we clear the writeback flag immediately, indicating that data
> >>> was written back, when in fact it was not written back at all. I suspect
> >>> fsync() currently handles that manually already, to wait for any of the
> >>> allocated pages to actually get written back by user space, so we have
> >>> control over when something was *actually* written back.
> >>>
> >>>
> >>> Similar to your proposal, I wonder if there could be a way to request
> >>> fuse to "abort" a writeback request (instead of using fixed timeouts per
> >>> request). Meaning, when we stumble over a folio that is under writeback
> >>> on some paths, we would tell fuse to "end writeback now", or "end
> >>> writeback now if it takes longer than X". Essentially hidden inside
> >>> folio_wait_writeback().
> >>>
> >>> When aborting a request, as I said, we would essentially "end writeback"
> >>> and mark the folio as dirty again. The interesting thing is likely how
> >>> to handle user space that wants to process this request right now (stuck
> >>> in fuse_send_writepage() I assume?), correct?
> >>
> >> This would be fine if the writeback request hasn't been sent yet to
> >> userspace but if it has and the pages are spliced
> >
> > Can you point me at the code where that splicing happens?
>
> fuse_dev_splice_read()
>   fuse_dev_do_read()
>     fuse_copy_args()
>       fuse_copy_page
>
>
> Btw, for the non splice case, disabling migration should be
> only needed while it is copying to the userspace buffer?

I don't think so. We don't currently disable migration when copying
to/from the userspace buffer for reads.


Thanks,
Joanne
>
>
>
> Thanks,
> Bernd


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-21 16:18                                         ` David Hildenbrand
@ 2024-12-23 22:14                                           ` Shakeel Butt
  2024-12-24 12:37                                             ` David Hildenbrand
  0 siblings, 1 reply; 124+ messages in thread
From: Shakeel Butt @ 2024-12-23 22:14 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Bernd Schubert, Joanne Koong, Zi Yan, miklos, linux-fsdevel,
	jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox,
	Oscar Salvador, Michal Hocko

On Sat, Dec 21, 2024 at 05:18:20PM +0100, David Hildenbrand wrote:
[...]
> 
> Yes, so I can see fuse
> 
> (1) Breaking memory reclaim (memory cannot get freed up)
> 
> (2) Breaking page migration (memory cannot be migrated)
> 
> Due to (1) we might experience bigger memory pressure in the system I guess.
> A handful of these pages don't really hurt, I have no idea how bad having
> many of these pages can be. But yes, inherently we cannot throw away the
> data as long as it is dirty without causing harm. (maybe we could move it to
> some other cache, like swap/zswap; but that smells like a big and
> complicated project)
> 
> Due to (2) we turn pages that are supposed to be movable possibly for a long
> time unmovable. Even a *single* such page will mean that CMA allocations /
> memory unplug can start failing.
> 
> We have similar situations with page pinning. With things like O_DIRECT, our
> assumption/experience so far is that it will only take a couple of seconds
> max, and retry loops are sufficient to handle it. That's why only long-term
> pinning ("indeterminate", e.g., vfio) migrate these pages out of
> ZONE_MOVABLE/MIGRATE_CMA areas in order to long-term pin them.
> 
> 
> The biggest concern I have is that timeouts, while likely reasonable it many
> scenarios, might not be desirable even for some sane workloads, and the
> default in all system will be "no timeout", letting the clueless admin of
> each and every system out there that might support fuse to make a decision.
> 
> I might have misunderstood something, in which case I am very sorry, but we
> also don't want CMA allocations to start failing simply because a network
> connection is down for a couple of minutes such that a fuse daemon cannot
> make progress.
> 

I think you have valid concerns but these are not new and not unique to
fuse. Any filesystem with a potential arbitrary stall can have similar
issues. The arbitrary stall can be caused due to network issues or some
faultly local storage.

Regarding the reclaim, I wouldn't say fuse or similar filesystem are
breaking memory reclaim as the kernel has mechanism to throttle the
threads dirtying the file memory to reduce the chance of situations
where most of memory becomes unreclaimable due to being dirty.

Please note that such filesystems are mostly used in environments like
data center or hyperscalar and usually have more advanced mechanisms to
handle and avoid situations like long delays. For such environment
network unavailability is a larger issue than some cma allocation
failure. My point is: let's not assume the disastrous situaion is normal
and overcomplicate the solution.



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-22  2:47                                             ` Jingbo Xu
@ 2024-12-24 11:32                                               ` David Hildenbrand
  0 siblings, 0 replies; 124+ messages in thread
From: David Hildenbrand @ 2024-12-24 11:32 UTC (permalink / raw)
  To: Jingbo Xu, Shakeel Butt
  Cc: Bernd Schubert, Joanne Koong, Zi Yan, miklos, linux-fsdevel,
	josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador,
	Michal Hocko

On 22.12.24 03:47, Jingbo Xu wrote:
> 
> 
> On 12/22/24 12:23 AM, David Hildenbrand wrote:
>> On 21.12.24 03:28, Jingbo Xu wrote:
>>>
>>>
>>> On 12/21/24 2:01 AM, Shakeel Butt wrote:
>>>> On Fri, Dec 20, 2024 at 03:49:39PM +0100, David Hildenbrand wrote:
>>>>>>> I'm wondering if there would be a way to just "cancel" the
>>>>>>> writeback and
>>>>>>> mark the folio dirty again. That way it could be migrated, but not
>>>>>>> reclaimed. At least we could avoid the whole
>>>>>>> AS_WRITEBACK_INDETERMINATE
>>>>>>> thing.
>>>>>>>
>>>>>>
>>>>>> That is what I basically meant with short timeouts. Obviously it is
>>>>>> not
>>>>>> that simple to cancel the request and to retry - it would add in quite
>>>>>> some complexity, if all the issues that arise can be solved at all.
>>>>>
>>>>> At least it would keep that out of core-mm.
>>>>>
>>>>> AS_WRITEBACK_INDETERMINATE really has weird smell to it ... we
>>>>> should try to
>>>>> improve such scenarios, not acknowledge and integrate them, then
>>>>> work around
>>>>> using timeouts that must be manually configured, and ca likely no be
>>>>> default
>>>>> enabled because it could hurt reasonable use cases :(
>>>>
>>>> Just to be clear AS_WRITEBACK_INDETERMINATE is being used in two core-mm
>>>> parts. First is reclaim and second is compaction/migration. For reclaim,
>>>> it is a must have as explained by Jingbo in [1] i.e. due to potential
>>>> self deadlock by fuse server. If I understand you correctly, the main
>>>> concern you have is its usage in the second case.
>>>>
>>>> The reason for adding AS_WRITEBACK_INDETERMINATE in the second case was
>>>> to avoid untrusted fuse server causing pain to unrelated jobs on the
>>>> machine (fuse folks please correct me if I am wrong here).
>>>
>>> Right, IIUC direct MIGRATE_SYNC migration won't be triggered on the
>>> memory allocation path, i.e. the fuse server itself won't stumble into
>>> MIGRATE_SYNC migration.
>>>
>>
>> Maybe memory compaction (on higher-order allocations only) could trigger
>> it?
>>
>> gfp_compaction_allowed() checks __GFP_IO. GFP_KERNEL includes that.
>>
> 
> But that (memory compaction on memory allocation, which can be triggered
> in the fuse server process context) only triggers MIGRATE_SYNC_LIGHT,
> which won't wait for writeback.
> 

Ah, that makes sense.

> AFAICS, MIGRATE_SYNC can be triggered during cma allocation, memory
> offline, or node compaction manually through sysctl.

Right, non-proactive compaction always uses MIGRATE_SYNC_LIGHT, that 
won't wait.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-23 22:14                                           ` Shakeel Butt
@ 2024-12-24 12:37                                             ` David Hildenbrand
  2024-12-26 15:11                                               ` Zi Yan
  2024-12-26 20:13                                               ` Shakeel Butt
  0 siblings, 2 replies; 124+ messages in thread
From: David Hildenbrand @ 2024-12-24 12:37 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Bernd Schubert, Joanne Koong, Zi Yan, miklos, linux-fsdevel,
	jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox,
	Oscar Salvador, Michal Hocko

On 23.12.24 23:14, Shakeel Butt wrote:
> On Sat, Dec 21, 2024 at 05:18:20PM +0100, David Hildenbrand wrote:
> [...]
>>
>> Yes, so I can see fuse
>>
>> (1) Breaking memory reclaim (memory cannot get freed up)
>>
>> (2) Breaking page migration (memory cannot be migrated)
>>
>> Due to (1) we might experience bigger memory pressure in the system I guess.
>> A handful of these pages don't really hurt, I have no idea how bad having
>> many of these pages can be. But yes, inherently we cannot throw away the
>> data as long as it is dirty without causing harm. (maybe we could move it to
>> some other cache, like swap/zswap; but that smells like a big and
>> complicated project)
>>
>> Due to (2) we turn pages that are supposed to be movable possibly for a long
>> time unmovable. Even a *single* such page will mean that CMA allocations /
>> memory unplug can start failing.
>>
>> We have similar situations with page pinning. With things like O_DIRECT, our
>> assumption/experience so far is that it will only take a couple of seconds
>> max, and retry loops are sufficient to handle it. That's why only long-term
>> pinning ("indeterminate", e.g., vfio) migrate these pages out of
>> ZONE_MOVABLE/MIGRATE_CMA areas in order to long-term pin them.
>>
>>
>> The biggest concern I have is that timeouts, while likely reasonable it many
>> scenarios, might not be desirable even for some sane workloads, and the
>> default in all system will be "no timeout", letting the clueless admin of
>> each and every system out there that might support fuse to make a decision.
>>
>> I might have misunderstood something, in which case I am very sorry, but we
>> also don't want CMA allocations to start failing simply because a network
>> connection is down for a couple of minutes such that a fuse daemon cannot
>> make progress.
>>
> 
> I think you have valid concerns but these are not new and not unique to
> fuse. Any filesystem with a potential arbitrary stall can have similar
> issues. The arbitrary stall can be caused due to network issues or some
> faultly local storage.

What concerns me more is that this is can be triggered by even 
unprivileged user space, and that there is no default protection as far 
as I understood, because timeouts cannot be set universally to a sane 
defaults.

Again, please correct me if I got that wrong.


BTW, I just looked at NFS out of interest, in particular 
nfs_page_async_flush(), and I spot some logic about re-dirtying pages + 
canceling writeback. IIUC, there are default timeouts for UDP and TCP, 
whereby the TCP default one seems to be around 60s (* retrans?), and the 
privileged user that mounts it can set higher ones. I guess one could 
run into similar writeback issues?

So I wonder why we never required AS_WRITEBACK_INDETERMINATE for nfs? 
Not sure if I grasped all details about NFS and writeback and when it 
would redirty+end writeback, and if there is some other handling in there.

> 
> Regarding the reclaim, I wouldn't say fuse or similar filesystem are
> breaking memory reclaim as the kernel has mechanism to throttle the
> threads dirtying the file memory to reduce the chance of situations
> where most of memory becomes unreclaimable due to being dirty.

Yes, likely even cgroups can easily limit the amount.

> 
> Please note that such filesystems are mostly used in environments like
> data center or hyperscalar and usually have more advanced mechanisms to
> handle and avoid situations like long delays. For such environment
> network unavailability is a larger issue than some cma allocation
> failure. My point is: let's not assume the disastrous situaion is normal
> and overcomplicate the solution.

Let me summarize my main point: ZONE_MOVABLE/MIGRATE_CMA must only be 
used for movable allocations.

Mechanisms that possible turn these folios unmovable for a 
long/indeterminate time must either fail or migrate these folios out of 
these regions, otherwise we start violating the very semantics why 
ZONE_MOVABLE/MIGRATE_CMA was added in the first place.

Yes, there are corner cases where we cannot guarantee movability (e.g., 
OOM when allocating a migration destination), but these are not cases 
that can be triggered by (unprivileged) user space easily.

That's why FOLL_LONGTERM pinning does exactly that: even if user space 
would promise that this is really only "short-term", we will treat it as 
"possibly forever", because it's under user-space control.


Instead of having more subsystems violate these semantics because 
"performance" ... I would hope we would do better. Maybe it's an issue 
for NFS as well ("at least" only for privileged user space)? In which 
case, again, I would hope we would do better.


Anyhow, I'm hoping there will be more feedback from other MM folks, but 
likely right now a lot of people are out (just like I should ;) ).

If I end up being the only one with these concerns, then likely people 
can feel free to ignore them. ;)

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-24 12:37                                             ` David Hildenbrand
@ 2024-12-26 15:11                                               ` Zi Yan
  2024-12-26 20:13                                               ` Shakeel Butt
  1 sibling, 0 replies; 124+ messages in thread
From: Zi Yan @ 2024-12-26 15:11 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Shakeel Butt, Bernd Schubert, Joanne Koong, miklos, linux-fsdevel,
	jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox,
	Oscar Salvador, Michal Hocko

On 24 Dec 2024, at 7:37, David Hildenbrand wrote:

> On 23.12.24 23:14, Shakeel Butt wrote:
>> On Sat, Dec 21, 2024 at 05:18:20PM +0100, David Hildenbrand wrote:
>> [...]
>>>
>>> Yes, so I can see fuse
>>>
>>> (1) Breaking memory reclaim (memory cannot get freed up)
>>>
>>> (2) Breaking page migration (memory cannot be migrated)
>>>
>>> Due to (1) we might experience bigger memory pressure in the system I guess.
>>> A handful of these pages don't really hurt, I have no idea how bad having
>>> many of these pages can be. But yes, inherently we cannot throw away the
>>> data as long as it is dirty without causing harm. (maybe we could move it to
>>> some other cache, like swap/zswap; but that smells like a big and
>>> complicated project)
>>>
>>> Due to (2) we turn pages that are supposed to be movable possibly for a long
>>> time unmovable. Even a *single* such page will mean that CMA allocations /
>>> memory unplug can start failing.
>>>
>>> We have similar situations with page pinning. With things like O_DIRECT, our
>>> assumption/experience so far is that it will only take a couple of seconds
>>> max, and retry loops are sufficient to handle it. That's why only long-term
>>> pinning ("indeterminate", e.g., vfio) migrate these pages out of
>>> ZONE_MOVABLE/MIGRATE_CMA areas in order to long-term pin them.
>>>
>>>
>>> The biggest concern I have is that timeouts, while likely reasonable it many
>>> scenarios, might not be desirable even for some sane workloads, and the
>>> default in all system will be "no timeout", letting the clueless admin of
>>> each and every system out there that might support fuse to make a decision.
>>>
>>> I might have misunderstood something, in which case I am very sorry, but we
>>> also don't want CMA allocations to start failing simply because a network
>>> connection is down for a couple of minutes such that a fuse daemon cannot
>>> make progress.
>>>
>>
>> I think you have valid concerns but these are not new and not unique to
>> fuse. Any filesystem with a potential arbitrary stall can have similar
>> issues. The arbitrary stall can be caused due to network issues or some
>> faultly local storage.
>
> What concerns me more is that this is can be triggered by even unprivileged user space, and that there is no default protection as far as I understood, because timeouts cannot be set universally to a sane defaults.
>
> Again, please correct me if I got that wrong.
>
>
> BTW, I just looked at NFS out of interest, in particular nfs_page_async_flush(), and I spot some logic about re-dirtying pages + canceling writeback. IIUC, there are default timeouts for UDP and TCP, whereby the TCP default one seems to be around 60s (* retrans?), and the privileged user that mounts it can set higher ones. I guess one could run into similar writeback issues?
>
> So I wonder why we never required AS_WRITEBACK_INDETERMINATE for nfs? Not sure if I grasped all details about NFS and writeback and when it would redirty+end writeback, and if there is some other handling in there.
>
>>
>> Regarding the reclaim, I wouldn't say fuse or similar filesystem are
>> breaking memory reclaim as the kernel has mechanism to throttle the
>> threads dirtying the file memory to reduce the chance of situations
>> where most of memory becomes unreclaimable due to being dirty.
>
> Yes, likely even cgroups can easily limit the amount.
>
>>
>> Please note that such filesystems are mostly used in environments like
>> data center or hyperscalar and usually have more advanced mechanisms to
>> handle and avoid situations like long delays. For such environment
>> network unavailability is a larger issue than some cma allocation
>> failure. My point is: let's not assume the disastrous situaion is normal
>> and overcomplicate the solution.
>
> Let me summarize my main point: ZONE_MOVABLE/MIGRATE_CMA must only be used for movable allocations.

Exactly this.

>
> Mechanisms that possible turn these folios unmovable for a long/indeterminate time must either fail or migrate these folios out of these regions, otherwise we start violating the very semantics why ZONE_MOVABLE/MIGRATE_CMA was added in the first place.

Totally agree.

>
> Yes, there are corner cases where we cannot guarantee movability (e.g., OOM when allocating a migration destination), but these are not cases that can be triggered by (unprivileged) user space easily.
>
> That's why FOLL_LONGTERM pinning does exactly that: even if user space would promise that this is really only "short-term", we will treat it as "possibly forever", because it's under user-space control.
>
>
> Instead of having more subsystems violate these semantics because "performance" ... I would hope we would do better. Maybe it's an issue for NFS as well ("at least" only for privileged user space)? In which case, again, I would hope we would do better.

Another issue with the proposed AS_WRITEBACK_INDETERMINATE approach is that FUSE
used to use temp pages from MIGRATE_UNMOVABLE to write back dirty pages, which
confines these unmovable pages within certain pageblocks, but now any dirty page
can become unmovable due to AS_WRITEBACK_INDETERMINATE and they can spread across
the entire physical space. This means memory can be fragmented much easier, namely
with the same 512 dirty pages, previously, all could be confined in 1 pageblock,
but now in the worse scenario they can appear in 512 pageblocks.

--
Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-24 12:37                                             ` David Hildenbrand
  2024-12-26 15:11                                               ` Zi Yan
@ 2024-12-26 20:13                                               ` Shakeel Butt
  2024-12-26 22:02                                                 ` Bernd Schubert
                                                                   ` (2 more replies)
  1 sibling, 3 replies; 124+ messages in thread
From: Shakeel Butt @ 2024-12-26 20:13 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Bernd Schubert, Joanne Koong, Zi Yan, miklos, linux-fsdevel,
	jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox,
	Oscar Salvador, Michal Hocko

On Tue, Dec 24, 2024 at 01:37:49PM +0100, David Hildenbrand wrote:
> On 23.12.24 23:14, Shakeel Butt wrote:
> > On Sat, Dec 21, 2024 at 05:18:20PM +0100, David Hildenbrand wrote:
> > [...]
> > > 
> > > Yes, so I can see fuse
> > > 
> > > (1) Breaking memory reclaim (memory cannot get freed up)
> > > 
> > > (2) Breaking page migration (memory cannot be migrated)
> > > 
> > > Due to (1) we might experience bigger memory pressure in the system I guess.
> > > A handful of these pages don't really hurt, I have no idea how bad having
> > > many of these pages can be. But yes, inherently we cannot throw away the
> > > data as long as it is dirty without causing harm. (maybe we could move it to
> > > some other cache, like swap/zswap; but that smells like a big and
> > > complicated project)
> > > 
> > > Due to (2) we turn pages that are supposed to be movable possibly for a long
> > > time unmovable. Even a *single* such page will mean that CMA allocations /
> > > memory unplug can start failing.
> > > 
> > > We have similar situations with page pinning. With things like O_DIRECT, our
> > > assumption/experience so far is that it will only take a couple of seconds
> > > max, and retry loops are sufficient to handle it. That's why only long-term
> > > pinning ("indeterminate", e.g., vfio) migrate these pages out of
> > > ZONE_MOVABLE/MIGRATE_CMA areas in order to long-term pin them.
> > > 
> > > 
> > > The biggest concern I have is that timeouts, while likely reasonable it many
> > > scenarios, might not be desirable even for some sane workloads, and the
> > > default in all system will be "no timeout", letting the clueless admin of
> > > each and every system out there that might support fuse to make a decision.
> > > 
> > > I might have misunderstood something, in which case I am very sorry, but we
> > > also don't want CMA allocations to start failing simply because a network
> > > connection is down for a couple of minutes such that a fuse daemon cannot
> > > make progress.
> > > 
> > 
> > I think you have valid concerns but these are not new and not unique to
> > fuse. Any filesystem with a potential arbitrary stall can have similar
> > issues. The arbitrary stall can be caused due to network issues or some
> > faultly local storage.
> 
> What concerns me more is that this is can be triggered by even unprivileged
> user space, and that there is no default protection as far as I understood,
> because timeouts cannot be set universally to a sane defaults.
> 
> Again, please correct me if I got that wrong.
> 

Let's route this question to FUSE folks. More specifically: can an
unprivileged process create a mount point backed by itself, create a
lot of dirty (bound by cgroup) and writeback pages on it and let the
writeback pages in that state forever?

> 
> BTW, I just looked at NFS out of interest, in particular
> nfs_page_async_flush(), and I spot some logic about re-dirtying pages +
> canceling writeback. IIUC, there are default timeouts for UDP and TCP,
> whereby the TCP default one seems to be around 60s (* retrans?), and the
> privileged user that mounts it can set higher ones. I guess one could run
> into similar writeback issues?

Yes, I think so.

> 
> So I wonder why we never required AS_WRITEBACK_INDETERMINATE for nfs?

I feel like INDETERMINATE in the name is the main cause of confusion.
So, let me explain why it is required (but later I will tell you how it
can be avoided). The FUSE thread which is actively handling writeback of
a given folio can cause memory allocation either through syscall or page
fault. That memory allocation can trigger global reclaim synchronously
and in cgroup-v1, that FUSE thread can wait on the writeback on the same
folio whose writeback it is supposed to end and cauing a deadlock. So,
AS_WRITEBACK_INDETERMINATE is used to just avoid this deadlock.

The in-kernel fs avoid this situation through the use of GFP_NOFS
allocations. The userspace fs can also use a similar approach which is
prctl(PR_SET_IO_FLUSHER, 1) to avoid this situation. However I have been
told that it is hard to use as it is per-thread flag and has to be set
for all the threads handling writeback which can be error prone if the
threadpool is dynamic. Second it is very coarse such that all the
allocations from those threads (e.g. page faults) become NOFS which
makes userspace very unreliable on highly utilized machine as NOFS can
not reclaim potentially a lot of memory and can not trigger oom-kill.

> Not
> sure if I grasped all details about NFS and writeback and when it would
> redirty+end writeback, and if there is some other handling in there.
> 
[...]
> > 
> > Please note that such filesystems are mostly used in environments like
> > data center or hyperscalar and usually have more advanced mechanisms to
> > handle and avoid situations like long delays. For such environment
> > network unavailability is a larger issue than some cma allocation
> > failure. My point is: let's not assume the disastrous situaion is normal
> > and overcomplicate the solution.
> 
> Let me summarize my main point: ZONE_MOVABLE/MIGRATE_CMA must only be used
> for movable allocations.
> 
> Mechanisms that possible turn these folios unmovable for a
> long/indeterminate time must either fail or migrate these folios out of
> these regions, otherwise we start violating the very semantics why
> ZONE_MOVABLE/MIGRATE_CMA was added in the first place.
> 
> Yes, there are corner cases where we cannot guarantee movability (e.g., OOM
> when allocating a migration destination), but these are not cases that can
> be triggered by (unprivileged) user space easily.
> 
> That's why FOLL_LONGTERM pinning does exactly that: even if user space would
> promise that this is really only "short-term", we will treat it as "possibly
> forever", because it's under user-space control.
> 
> 
> Instead of having more subsystems violate these semantics because
> "performance" ... I would hope we would do better. Maybe it's an issue for
> NFS as well ("at least" only for privileged user space)? In which case,
> again, I would hope we would do better.
> 
> 
> Anyhow, I'm hoping there will be more feedback from other MM folks, but
> likely right now a lot of people are out (just like I should ;) ).
> 
> If I end up being the only one with these concerns, then likely people can
> feel free to ignore them. ;)

I agree we should do better but IMHO it should be an iterative process.
I think your concerns are valid, so let's push the discussion towards
resolving those concerns. I think the concerns can be resolved by better
handling of lifetime of folios under writeback. The amount of such
folios is already handled through existing dirty throttling mechanism.

We should start with a baseline i.e. distribution of lifetime of folios
under writeback for traditional storage devices (spinning disk and SSDs)
as we don't want an unrealistic goal for ourself. I think this data will
drive the appropriate timeout values (if we decide timeout based
approach is the right one).

At the moment we have timeout based approach to limit the lifetime of
folios under writeback. Any other ideas?


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-26 20:13                                               ` Shakeel Butt
@ 2024-12-26 22:02                                                 ` Bernd Schubert
  2024-12-27 20:08                                                 ` Joanne Koong
  2024-12-30 10:16                                                 ` David Hildenbrand
  2 siblings, 0 replies; 124+ messages in thread
From: Bernd Schubert @ 2024-12-26 22:02 UTC (permalink / raw)
  To: Shakeel Butt, David Hildenbrand
  Cc: Joanne Koong, Zi Yan, miklos, linux-fsdevel, jefflexu, josef,
	linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador,
	Michal Hocko



On 12/26/24 21:13, Shakeel Butt wrote:
> On Tue, Dec 24, 2024 at 01:37:49PM +0100, David Hildenbrand wrote:
>> On 23.12.24 23:14, Shakeel Butt wrote:
>>> On Sat, Dec 21, 2024 at 05:18:20PM +0100, David Hildenbrand wrote:
>>>>
>>>
>>> I think you have valid concerns but these are not new and not unique to
>>> fuse. Any filesystem with a potential arbitrary stall can have similar
>>> issues. The arbitrary stall can be caused due to network issues or some
>>> faultly local storage.
>>
>> What concerns me more is that this is can be triggered by even unprivileged
>> user space, and that there is no default protection as far as I understood,
>> because timeouts cannot be set universally to a sane defaults.
>>
>> Again, please correct me if I got that wrong.
>>
> 
> Let's route this question to FUSE folks. More specifically: can an
> unprivileged process create a mount point backed by itself, create a
> lot of dirty (bound by cgroup) and writeback pages on it and let the
> writeback pages in that state forever?

libfuse provides 'fusermount' which has the s-bit set. I think most 
distributions take that over into their libfuse packages. 
The fuse-server process then continues to run as arbitrary user.






^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-23 19:00                                             ` Joanne Koong
@ 2024-12-26 22:44                                               ` Bernd Schubert
  2024-12-27 18:25                                                 ` Joanne Koong
  0 siblings, 1 reply; 124+ messages in thread
From: Bernd Schubert @ 2024-12-26 22:44 UTC (permalink / raw)
  To: Joanne Koong
  Cc: David Hildenbrand, Shakeel Butt, Zi Yan, miklos, linux-fsdevel,
	jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox,
	Oscar Salvador, Michal Hocko



On 12/23/24 20:00, Joanne Koong wrote:
> On Sat, Dec 21, 2024 at 1:59 PM Bernd Schubert
> <bernd.schubert@fastmail.fm> wrote:
>>
>>
>>
>> On 12/21/24 17:25, David Hildenbrand wrote:
>>> On 20.12.24 22:01, Joanne Koong wrote:
>>>> On Fri, Dec 20, 2024 at 6:49 AM David Hildenbrand <david@redhat.com>
>>>> wrote:
>>>>>
>>>>>>> I'm wondering if there would be a way to just "cancel" the
>>>>>>> writeback and
>>>>>>> mark the folio dirty again. That way it could be migrated, but not
>>>>>>> reclaimed. At least we could avoid the whole
>>>>>>> AS_WRITEBACK_INDETERMINATE
>>>>>>> thing.
>>>>>>>
>>>>>>
>>>>>> That is what I basically meant with short timeouts. Obviously it is not
>>>>>> that simple to cancel the request and to retry - it would add in quite
>>>>>> some complexity, if all the issues that arise can be solved at all.
>>>>>
>>>>> At least it would keep that out of core-mm.
>>>>>
>>>>> AS_WRITEBACK_INDETERMINATE really has weird smell to it ... we should
>>>>> try to improve such scenarios, not acknowledge and integrate them, then
>>>>> work around using timeouts that must be manually configured, and ca
>>>>> likely no be default enabled because it could hurt reasonable use
>>>>> cases :(
>>>>>
>>>>> Right now we clear the writeback flag immediately, indicating that data
>>>>> was written back, when in fact it was not written back at all. I suspect
>>>>> fsync() currently handles that manually already, to wait for any of the
>>>>> allocated pages to actually get written back by user space, so we have
>>>>> control over when something was *actually* written back.
>>>>>
>>>>>
>>>>> Similar to your proposal, I wonder if there could be a way to request
>>>>> fuse to "abort" a writeback request (instead of using fixed timeouts per
>>>>> request). Meaning, when we stumble over a folio that is under writeback
>>>>> on some paths, we would tell fuse to "end writeback now", or "end
>>>>> writeback now if it takes longer than X". Essentially hidden inside
>>>>> folio_wait_writeback().
>>>>>
>>>>> When aborting a request, as I said, we would essentially "end writeback"
>>>>> and mark the folio as dirty again. The interesting thing is likely how
>>>>> to handle user space that wants to process this request right now (stuck
>>>>> in fuse_send_writepage() I assume?), correct?
>>>>
>>>> This would be fine if the writeback request hasn't been sent yet to
>>>> userspace but if it has and the pages are spliced
>>>
>>> Can you point me at the code where that splicing happens?
>>
>> fuse_dev_splice_read()
>>   fuse_dev_do_read()
>>     fuse_copy_args()
>>       fuse_copy_page
>>
>>
>> Btw, for the non splice case, disabling migration should be
>> only needed while it is copying to the userspace buffer?
> 
> I don't think so. We don't currently disable migration when copying
> to/from the userspace buffer for reads.


Sorry for my late reply. I'm confused about "reads". This discussions
is about writeback?
Without your patches we have tmp-pages - migration disabled on these. 
With your patches we have AS_WRITEBACK_INDETERMINATE - migration
also disabled?

I think we have two code paths

a) fuse_dev_read - does a full buffer copy. Why do we need tmp-pages
for these at all? The only time migration must not run on these pages
while it is copying to the userspace buffer?

b) fuse_dev_splice_read - isn't this our real problem, as we don't
know when pages in the pipe are getting consumed?


Thanks,
Bernd



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-26 22:44                                               ` Bernd Schubert
@ 2024-12-27 18:25                                                 ` Joanne Koong
  0 siblings, 0 replies; 124+ messages in thread
From: Joanne Koong @ 2024-12-27 18:25 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: David Hildenbrand, Shakeel Butt, Zi Yan, miklos, linux-fsdevel,
	jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox,
	Oscar Salvador, Michal Hocko

On Thu, Dec 26, 2024 at 2:44 PM Bernd Schubert
<bernd.schubert@fastmail.fm> wrote:
>
> On 12/23/24 20:00, Joanne Koong wrote:
> > On Sat, Dec 21, 2024 at 1:59 PM Bernd Schubert
> > <bernd.schubert@fastmail.fm> wrote:
> >>
> >>
> >>
> >> On 12/21/24 17:25, David Hildenbrand wrote:
> >>> On 20.12.24 22:01, Joanne Koong wrote:
> >>>> On Fri, Dec 20, 2024 at 6:49 AM David Hildenbrand <david@redhat.com>
> >>>> wrote:
> >>>>>
> >>>>>>> I'm wondering if there would be a way to just "cancel" the
> >>>>>>> writeback and
> >>>>>>> mark the folio dirty again. That way it could be migrated, but not
> >>>>>>> reclaimed. At least we could avoid the whole
> >>>>>>> AS_WRITEBACK_INDETERMINATE
> >>>>>>> thing.
> >>>>>>>
> >>>>>>
> >>>>>> That is what I basically meant with short timeouts. Obviously it is not
> >>>>>> that simple to cancel the request and to retry - it would add in quite
> >>>>>> some complexity, if all the issues that arise can be solved at all.
> >>>>>
> >>>>> At least it would keep that out of core-mm.
> >>>>>
> >>>>> AS_WRITEBACK_INDETERMINATE really has weird smell to it ... we should
> >>>>> try to improve such scenarios, not acknowledge and integrate them, then
> >>>>> work around using timeouts that must be manually configured, and ca
> >>>>> likely no be default enabled because it could hurt reasonable use
> >>>>> cases :(
> >>>>>
> >>>>> Right now we clear the writeback flag immediately, indicating that data
> >>>>> was written back, when in fact it was not written back at all. I suspect
> >>>>> fsync() currently handles that manually already, to wait for any of the
> >>>>> allocated pages to actually get written back by user space, so we have
> >>>>> control over when something was *actually* written back.
> >>>>>
> >>>>>
> >>>>> Similar to your proposal, I wonder if there could be a way to request
> >>>>> fuse to "abort" a writeback request (instead of using fixed timeouts per
> >>>>> request). Meaning, when we stumble over a folio that is under writeback
> >>>>> on some paths, we would tell fuse to "end writeback now", or "end
> >>>>> writeback now if it takes longer than X". Essentially hidden inside
> >>>>> folio_wait_writeback().
> >>>>>
> >>>>> When aborting a request, as I said, we would essentially "end writeback"
> >>>>> and mark the folio as dirty again. The interesting thing is likely how
> >>>>> to handle user space that wants to process this request right now (stuck
> >>>>> in fuse_send_writepage() I assume?), correct?
> >>>>
> >>>> This would be fine if the writeback request hasn't been sent yet to
> >>>> userspace but if it has and the pages are spliced
> >>>
> >>> Can you point me at the code where that splicing happens?
> >>
> >> fuse_dev_splice_read()
> >>   fuse_dev_do_read()
> >>     fuse_copy_args()
> >>       fuse_copy_page
> >>
> >>
> >> Btw, for the non splice case, disabling migration should be
> >> only needed while it is copying to the userspace buffer?
> >
> > I don't think so. We don't currently disable migration when copying
> > to/from the userspace buffer for reads.
>
>
> Sorry for my late reply. I'm confused about "reads". This discussions
> is about writeback?

Whether we need to disable migration for copying to/from the userspace
buffers for non-tmp pages should be the same between handling reads or
writes, no? That's why I brought up reads, but looking more at how
fuse handles readahead and read_folio(), it looks like the folio's
lock is held while it's being copied out, and IIUC that's enough to
disable migration since migration will wait on the lock. So if we end
writeback on the non-tmp, it seems like we'd probably need to do
something similar first.

> Without your patches we have tmp-pages - migration disabled on these.
> With your patches we have AS_WRITEBACK_INDETERMINATE - migration
> also disabled?
>
> I think we have two code paths
>
> a) fuse_dev_read - does a full buffer copy. Why do we need tmp-pages
> for these at all? The only time migration must not run on these pages
> while it is copying to the userspace buffer?

The tmp pages were originally introduced for avoiding deadlock on
reclaim and avoiding hanging sync()s as well.

[1] https://lore.kernel.org/linux-kernel/bd49fcba-3eb6-4e84-a0f0-e73bce31ddb2@linux.alibaba.com/

>
> b) fuse_dev_splice_read - isn't this our real problem, as we don't
> know when pages in the pipe are getting consumed?

Yes, the splice case nixes the idea unfortunately. Everything else we
could find a workaround for, but there's no way I can see to avoid
this for splice


Thanks,
Joanne
>
>
> Thanks,
> Bernd
>


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-26 20:13                                               ` Shakeel Butt
  2024-12-26 22:02                                                 ` Bernd Schubert
@ 2024-12-27 20:08                                                 ` Joanne Koong
  2024-12-27 20:32                                                   ` Bernd Schubert
  2024-12-30 10:16                                                 ` David Hildenbrand
  2 siblings, 1 reply; 124+ messages in thread
From: Joanne Koong @ 2024-12-27 20:08 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: David Hildenbrand, Bernd Schubert, Zi Yan, miklos, linux-fsdevel,
	jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox,
	Oscar Salvador, Michal Hocko

On Thu, Dec 26, 2024 at 12:13 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Tue, Dec 24, 2024 at 01:37:49PM +0100, David Hildenbrand wrote:
> > On 23.12.24 23:14, Shakeel Butt wrote:
> > > On Sat, Dec 21, 2024 at 05:18:20PM +0100, David Hildenbrand wrote:
> > > [...]
> > > >
> > > > Yes, so I can see fuse
> > > >
> > > > (1) Breaking memory reclaim (memory cannot get freed up)
> > > >
> > > > (2) Breaking page migration (memory cannot be migrated)
> > > >
> > > > Due to (1) we might experience bigger memory pressure in the system I guess.
> > > > A handful of these pages don't really hurt, I have no idea how bad having
> > > > many of these pages can be. But yes, inherently we cannot throw away the
> > > > data as long as it is dirty without causing harm. (maybe we could move it to
> > > > some other cache, like swap/zswap; but that smells like a big and
> > > > complicated project)
> > > >
> > > > Due to (2) we turn pages that are supposed to be movable possibly for a long
> > > > time unmovable. Even a *single* such page will mean that CMA allocations /
> > > > memory unplug can start failing.
> > > >
> > > > We have similar situations with page pinning. With things like O_DIRECT, our
> > > > assumption/experience so far is that it will only take a couple of seconds
> > > > max, and retry loops are sufficient to handle it. That's why only long-term
> > > > pinning ("indeterminate", e.g., vfio) migrate these pages out of
> > > > ZONE_MOVABLE/MIGRATE_CMA areas in order to long-term pin them.
> > > >
> > > >
> > > > The biggest concern I have is that timeouts, while likely reasonable it many
> > > > scenarios, might not be desirable even for some sane workloads, and the
> > > > default in all system will be "no timeout", letting the clueless admin of
> > > > each and every system out there that might support fuse to make a decision.
> > > >
> > > > I might have misunderstood something, in which case I am very sorry, but we
> > > > also don't want CMA allocations to start failing simply because a network
> > > > connection is down for a couple of minutes such that a fuse daemon cannot
> > > > make progress.
> > > >
> > >
> > > I think you have valid concerns but these are not new and not unique to
> > > fuse. Any filesystem with a potential arbitrary stall can have similar
> > > issues. The arbitrary stall can be caused due to network issues or some
> > > faultly local storage.
> >
> > What concerns me more is that this is can be triggered by even unprivileged
> > user space, and that there is no default protection as far as I understood,
> > because timeouts cannot be set universally to a sane defaults.
> >
> > Again, please correct me if I got that wrong.
> >
>
> Let's route this question to FUSE folks. More specifically: can an
> unprivileged process create a mount point backed by itself, create a
> lot of dirty (bound by cgroup) and writeback pages on it and let the
> writeback pages in that state forever?
>
> >
> > BTW, I just looked at NFS out of interest, in particular
> > nfs_page_async_flush(), and I spot some logic about re-dirtying pages +
> > canceling writeback. IIUC, there are default timeouts for UDP and TCP,
> > whereby the TCP default one seems to be around 60s (* retrans?), and the
> > privileged user that mounts it can set higher ones. I guess one could run
> > into similar writeback issues?
>
> Yes, I think so.
>
> >
> > So I wonder why we never required AS_WRITEBACK_INDETERMINATE for nfs?
>
> I feel like INDETERMINATE in the name is the main cause of confusion.
> So, let me explain why it is required (but later I will tell you how it
> can be avoided). The FUSE thread which is actively handling writeback of
> a given folio can cause memory allocation either through syscall or page
> fault. That memory allocation can trigger global reclaim synchronously
> and in cgroup-v1, that FUSE thread can wait on the writeback on the same
> folio whose writeback it is supposed to end and cauing a deadlock. So,
> AS_WRITEBACK_INDETERMINATE is used to just avoid this deadlock.
>
> The in-kernel fs avoid this situation through the use of GFP_NOFS
> allocations. The userspace fs can also use a similar approach which is
> prctl(PR_SET_IO_FLUSHER, 1) to avoid this situation. However I have been
> told that it is hard to use as it is per-thread flag and has to be set
> for all the threads handling writeback which can be error prone if the
> threadpool is dynamic. Second it is very coarse such that all the
> allocations from those threads (e.g. page faults) become NOFS which
> makes userspace very unreliable on highly utilized machine as NOFS can
> not reclaim potentially a lot of memory and can not trigger oom-kill.
>
> > Not
> > sure if I grasped all details about NFS and writeback and when it would
> > redirty+end writeback, and if there is some other handling in there.
> >
> [...]
> > >
> > > Please note that such filesystems are mostly used in environments like
> > > data center or hyperscalar and usually have more advanced mechanisms to
> > > handle and avoid situations like long delays. For such environment
> > > network unavailability is a larger issue than some cma allocation
> > > failure. My point is: let's not assume the disastrous situaion is normal
> > > and overcomplicate the solution.
> >
> > Let me summarize my main point: ZONE_MOVABLE/MIGRATE_CMA must only be used
> > for movable allocations.
> >
> > Mechanisms that possible turn these folios unmovable for a
> > long/indeterminate time must either fail or migrate these folios out of
> > these regions, otherwise we start violating the very semantics why
> > ZONE_MOVABLE/MIGRATE_CMA was added in the first place.
> >
> > Yes, there are corner cases where we cannot guarantee movability (e.g., OOM
> > when allocating a migration destination), but these are not cases that can
> > be triggered by (unprivileged) user space easily.
> >
> > That's why FOLL_LONGTERM pinning does exactly that: even if user space would
> > promise that this is really only "short-term", we will treat it as "possibly
> > forever", because it's under user-space control.
> >
> >
> > Instead of having more subsystems violate these semantics because
> > "performance" ... I would hope we would do better. Maybe it's an issue for
> > NFS as well ("at least" only for privileged user space)? In which case,
> > again, I would hope we would do better.
> >
> >
> > Anyhow, I'm hoping there will be more feedback from other MM folks, but
> > likely right now a lot of people are out (just like I should ;) ).
> >
> > If I end up being the only one with these concerns, then likely people can
> > feel free to ignore them. ;)
>
> I agree we should do better but IMHO it should be an iterative process.
> I think your concerns are valid, so let's push the discussion towards
> resolving those concerns. I think the concerns can be resolved by better
> handling of lifetime of folios under writeback. The amount of such
> folios is already handled through existing dirty throttling mechanism.
>
> We should start with a baseline i.e. distribution of lifetime of folios
> under writeback for traditional storage devices (spinning disk and SSDs)
> as we don't want an unrealistic goal for ourself. I think this data will
> drive the appropriate timeout values (if we decide timeout based
> approach is the right one).
>
> At the moment we have timeout based approach to limit the lifetime of
> folios under writeback. Any other ideas?

I don't see any other approach that would handle splice, other than
modifying the splice code to prevent the underlying buf->page from
being migrated while it's being copied out, which seems non-viable to
consider. The other alternatives I see are to either a) do the extra
temp page copying for splice and "abort" the writeback if migration is
triggered or b) gate this to only apply to servers running as
privileged. I assume the majority of use cases do use splice, in which
case a) would be pointless and would make the internal logic more
complicated (eg we would still need the rb tree and would now need to
check writeback against the folio writeback state or the rb tree,
etc). I'm not sure how useful this would be either if this is just
gated to privileged servers.


Thanks,
Joanne


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-27 20:08                                                 ` Joanne Koong
@ 2024-12-27 20:32                                                   ` Bernd Schubert
  2024-12-30 17:52                                                     ` Joanne Koong
  0 siblings, 1 reply; 124+ messages in thread
From: Bernd Schubert @ 2024-12-27 20:32 UTC (permalink / raw)
  To: Joanne Koong, Shakeel Butt
  Cc: David Hildenbrand, Zi Yan, miklos, linux-fsdevel, jefflexu, josef,
	linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador,
	Michal Hocko



On 12/27/24 21:08, Joanne Koong wrote:
> On Thu, Dec 26, 2024 at 12:13 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>>
>> On Tue, Dec 24, 2024 at 01:37:49PM +0100, David Hildenbrand wrote:
>>> On 23.12.24 23:14, Shakeel Butt wrote:
>>>> On Sat, Dec 21, 2024 at 05:18:20PM +0100, David Hildenbrand wrote:
>>>> [...]
>>>>>
>>>>> Yes, so I can see fuse
>>>>>
>>>>> (1) Breaking memory reclaim (memory cannot get freed up)
>>>>>
>>>>> (2) Breaking page migration (memory cannot be migrated)
>>>>>
>>>>> Due to (1) we might experience bigger memory pressure in the system I guess.
>>>>> A handful of these pages don't really hurt, I have no idea how bad having
>>>>> many of these pages can be. But yes, inherently we cannot throw away the
>>>>> data as long as it is dirty without causing harm. (maybe we could move it to
>>>>> some other cache, like swap/zswap; but that smells like a big and
>>>>> complicated project)
>>>>>
>>>>> Due to (2) we turn pages that are supposed to be movable possibly for a long
>>>>> time unmovable. Even a *single* such page will mean that CMA allocations /
>>>>> memory unplug can start failing.
>>>>>
>>>>> We have similar situations with page pinning. With things like O_DIRECT, our
>>>>> assumption/experience so far is that it will only take a couple of seconds
>>>>> max, and retry loops are sufficient to handle it. That's why only long-term
>>>>> pinning ("indeterminate", e.g., vfio) migrate these pages out of
>>>>> ZONE_MOVABLE/MIGRATE_CMA areas in order to long-term pin them.
>>>>>
>>>>>
>>>>> The biggest concern I have is that timeouts, while likely reasonable it many
>>>>> scenarios, might not be desirable even for some sane workloads, and the
>>>>> default in all system will be "no timeout", letting the clueless admin of
>>>>> each and every system out there that might support fuse to make a decision.
>>>>>
>>>>> I might have misunderstood something, in which case I am very sorry, but we
>>>>> also don't want CMA allocations to start failing simply because a network
>>>>> connection is down for a couple of minutes such that a fuse daemon cannot
>>>>> make progress.
>>>>>
>>>>
>>>> I think you have valid concerns but these are not new and not unique to
>>>> fuse. Any filesystem with a potential arbitrary stall can have similar
>>>> issues. The arbitrary stall can be caused due to network issues or some
>>>> faultly local storage.
>>>
>>> What concerns me more is that this is can be triggered by even unprivileged
>>> user space, and that there is no default protection as far as I understood,
>>> because timeouts cannot be set universally to a sane defaults.
>>>
>>> Again, please correct me if I got that wrong.
>>>
>>
>> Let's route this question to FUSE folks. More specifically: can an
>> unprivileged process create a mount point backed by itself, create a
>> lot of dirty (bound by cgroup) and writeback pages on it and let the
>> writeback pages in that state forever?
>>
>>>
>>> BTW, I just looked at NFS out of interest, in particular
>>> nfs_page_async_flush(), and I spot some logic about re-dirtying pages +
>>> canceling writeback. IIUC, there are default timeouts for UDP and TCP,
>>> whereby the TCP default one seems to be around 60s (* retrans?), and the
>>> privileged user that mounts it can set higher ones. I guess one could run
>>> into similar writeback issues?
>>
>> Yes, I think so.
>>
>>>
>>> So I wonder why we never required AS_WRITEBACK_INDETERMINATE for nfs?
>>
>> I feel like INDETERMINATE in the name is the main cause of confusion.
>> So, let me explain why it is required (but later I will tell you how it
>> can be avoided). The FUSE thread which is actively handling writeback of
>> a given folio can cause memory allocation either through syscall or page
>> fault. That memory allocation can trigger global reclaim synchronously
>> and in cgroup-v1, that FUSE thread can wait on the writeback on the same
>> folio whose writeback it is supposed to end and cauing a deadlock. So,
>> AS_WRITEBACK_INDETERMINATE is used to just avoid this deadlock.
>>
>> The in-kernel fs avoid this situation through the use of GFP_NOFS
>> allocations. The userspace fs can also use a similar approach which is
>> prctl(PR_SET_IO_FLUSHER, 1) to avoid this situation. However I have been
>> told that it is hard to use as it is per-thread flag and has to be set
>> for all the threads handling writeback which can be error prone if the
>> threadpool is dynamic. Second it is very coarse such that all the
>> allocations from those threads (e.g. page faults) become NOFS which
>> makes userspace very unreliable on highly utilized machine as NOFS can
>> not reclaim potentially a lot of memory and can not trigger oom-kill.
>>
>>> Not
>>> sure if I grasped all details about NFS and writeback and when it would
>>> redirty+end writeback, and if there is some other handling in there.
>>>
>> [...]
>>>>
>>>> Please note that such filesystems are mostly used in environments like
>>>> data center or hyperscalar and usually have more advanced mechanisms to
>>>> handle and avoid situations like long delays. For such environment
>>>> network unavailability is a larger issue than some cma allocation
>>>> failure. My point is: let's not assume the disastrous situaion is normal
>>>> and overcomplicate the solution.
>>>
>>> Let me summarize my main point: ZONE_MOVABLE/MIGRATE_CMA must only be used
>>> for movable allocations.
>>>
>>> Mechanisms that possible turn these folios unmovable for a
>>> long/indeterminate time must either fail or migrate these folios out of
>>> these regions, otherwise we start violating the very semantics why
>>> ZONE_MOVABLE/MIGRATE_CMA was added in the first place.
>>>
>>> Yes, there are corner cases where we cannot guarantee movability (e.g., OOM
>>> when allocating a migration destination), but these are not cases that can
>>> be triggered by (unprivileged) user space easily.
>>>
>>> That's why FOLL_LONGTERM pinning does exactly that: even if user space would
>>> promise that this is really only "short-term", we will treat it as "possibly
>>> forever", because it's under user-space control.
>>>
>>>
>>> Instead of having more subsystems violate these semantics because
>>> "performance" ... I would hope we would do better. Maybe it's an issue for
>>> NFS as well ("at least" only for privileged user space)? In which case,
>>> again, I would hope we would do better.
>>>
>>>
>>> Anyhow, I'm hoping there will be more feedback from other MM folks, but
>>> likely right now a lot of people are out (just like I should ;) ).
>>>
>>> If I end up being the only one with these concerns, then likely people can
>>> feel free to ignore them. ;)
>>
>> I agree we should do better but IMHO it should be an iterative process.
>> I think your concerns are valid, so let's push the discussion towards
>> resolving those concerns. I think the concerns can be resolved by better
>> handling of lifetime of folios under writeback. The amount of such
>> folios is already handled through existing dirty throttling mechanism.
>>
>> We should start with a baseline i.e. distribution of lifetime of folios
>> under writeback for traditional storage devices (spinning disk and SSDs)
>> as we don't want an unrealistic goal for ourself. I think this data will
>> drive the appropriate timeout values (if we decide timeout based
>> approach is the right one).
>>
>> At the moment we have timeout based approach to limit the lifetime of
>> folios under writeback. Any other ideas?
> 
> I don't see any other approach that would handle splice, other than
> modifying the splice code to prevent the underlying buf->page from
> being migrated while it's being copied out, which seems non-viable to
> consider. The other alternatives I see are to either a) do the extra
> temp page copying for splice and "abort" the writeback if migration is
> triggered or b) gate this to only apply to servers running as
> privileged. I assume the majority of use cases do use splice, in which
> case a) would be pointless and would make the internal logic more
> complicated (eg we would still need the rb tree and would now need to
> check writeback against the folio writeback state or the rb tree,
> etc). I'm not sure how useful this would be either if this is just
> gated to privileged servers.


I'm not so sure about that majority of unprivileged servers. 
Try this patch and then run an unprivileged process.

diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c
index ee0b3b1d0470..adebfbc03d4c 100644
--- a/lib/fuse_lowlevel.c
+++ b/lib/fuse_lowlevel.c
@@ -3588,6 +3588,7 @@ static int _fuse_session_receive_buf(struct fuse_session *se,
                        res = fcntl(llp->pipe[0], F_SETPIPE_SZ, bufsize);
                        if (res == -1) {
                                llp->can_grow = 0;
+                               fuse_log(FUSE_LOG_ERR, "cannot grow pipe\n");
                                res = grow_pipe_to_max(llp->pipe[0]);
                                if (res > 0)
                                        llp->size = res;
@@ -3678,6 +3679,7 @@ static int _fuse_session_receive_buf(struct fuse_session *se,
 
        } else {
                /* Don't overwrite buf->mem, as that would cause a leak */
+               fuse_log(FUSE_LOG_WARNING, "Using splice\n");
                buf->fd = tmpbuf.fd;
                buf->flags = tmpbuf.flags;
        }
@@ -3687,6 +3689,7 @@ static int _fuse_session_receive_buf(struct fuse_session *se,
 
 fallback:
 #endif
+       fuse_log(FUSE_LOG_WARNING, "Splice fallback\n");
        if (!buf->mem) {
                buf->mem = buf_alloc(se->bufsize, internal);
                if (!buf->mem) {


And then run this again after 
sudo sysctl -w fs.pipe-max-size=1052672

(Please don't change '/proc/sys/fs/fuse/max_pages_limit'
from default).

And now we would need to know how many users either limit
max-pages + header to fit default pipe-max-size (1MB) or
increase max_pages_limit. Given there is no warning in
libfuse about the fallback from splice to buf copy, I doubt
many people know about that - who would change system
defaults without the knowledge?


And then, I still doubt that copy-to-tmp-page-and-splice
is any faster than no-tmp-page-copy-but-copy-to-lib-fuse-buffer. 
Especially as the tmp page copy is single threaded, I think.
But needs to be benchmarked.


Thanks,
Bernd





^ permalink raw reply related	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-26 20:13                                               ` Shakeel Butt
  2024-12-26 22:02                                                 ` Bernd Schubert
  2024-12-27 20:08                                                 ` Joanne Koong
@ 2024-12-30 10:16                                                 ` David Hildenbrand
  2024-12-30 18:38                                                   ` Joanne Koong
  2 siblings, 1 reply; 124+ messages in thread
From: David Hildenbrand @ 2024-12-30 10:16 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Bernd Schubert, Joanne Koong, Zi Yan, miklos, linux-fsdevel,
	jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox,
	Oscar Salvador, Michal Hocko

>> BTW, I just looked at NFS out of interest, in particular
>> nfs_page_async_flush(), and I spot some logic about re-dirtying pages +
>> canceling writeback. IIUC, there are default timeouts for UDP and TCP,
>> whereby the TCP default one seems to be around 60s (* retrans?), and the
>> privileged user that mounts it can set higher ones. I guess one could run
>> into similar writeback issues?
> 

Hi,

sorry for the late reply.

> Yes, I think so.
> 
>>
>> So I wonder why we never required AS_WRITEBACK_INDETERMINATE for nfs?
> 
> I feel like INDETERMINATE in the name is the main cause of confusion.

We are adding logic that says "unconditionally, never wait on writeback 
for these folios, not even any sync migration". That's the main problem 
I have.

Your explanation below is helpful. Because ...

> So, let me explain why it is required (but later I will tell you how it
> can be avoided). The FUSE thread which is actively handling writeback of
> a given folio can cause memory allocation either through syscall or page
> fault. That memory allocation can trigger global reclaim synchronously
> and in cgroup-v1, that FUSE thread can wait on the writeback on the same
> folio whose writeback it is supposed to end and cauing a deadlock. So,
> AS_WRITEBACK_INDETERMINATE is used to just avoid this deadlock.
 > > The in-kernel fs avoid this situation through the use of GFP_NOFS
> allocations. The userspace fs can also use a similar approach which is
> prctl(PR_SET_IO_FLUSHER, 1) to avoid this situation. However I have been
> told that it is hard to use as it is per-thread flag and has to be set
> for all the threads handling writeback which can be error prone if the
> threadpool is dynamic. Second it is very coarse such that all the
> allocations from those threads (e.g. page faults) become NOFS which
> makes userspace very unreliable on highly utilized machine as NOFS can
> not reclaim potentially a lot of memory and can not trigger oom-kill.
> 

... now I understand that we want to prevent a deadlock in one specific 
scenario only?

What sounds plausible for me is:

a) Make this only affect the actual deadlock path: sync migration
    during compaction. Communicate it either using some "context"
    information or with a new MIGRATE_SYNC_COMPACTION.
b) Call it sth. like AS_WRITEBACK_MIGHT_DEADLOCK_ON_RECLAIM to express
     that very deadlock problem.
c) Leave all others sync migration users alone for now

Would that prevent the deadlock? Even *better* would be to to be able to 
ask the fs if starting writeback on a specific folio could deadlock. 
Because in most cases, as I understand, we'll  not actually run into the 
deadlock and would just want to wait for writeback to just complete 
(esp. compaction).

(I still think having folios under writeback for a long time might be a 
problem, but that's indeed something to sort out separately in the 
future, because I suspect NFS has similar issues. We'd want to "wait 
with timeout" and e.g., cancel writeback during memory 
offlining/alloc_cma ...)

>> Not
>> sure if I grasped all details about NFS and writeback and when it would
>> redirty+end writeback, and if there is some other handling in there.
>>
> [...]
>>>
>>> Please note that such filesystems are mostly used in environments like
>>> data center or hyperscalar and usually have more advanced mechanisms to
>>> handle and avoid situations like long delays. For such environment
>>> network unavailability is a larger issue than some cma allocation
>>> failure. My point is: let's not assume the disastrous situaion is normal
>>> and overcomplicate the solution.
>>
>> Let me summarize my main point: ZONE_MOVABLE/MIGRATE_CMA must only be used
>> for movable allocations.
>>
>> Mechanisms that possible turn these folios unmovable for a
>> long/indeterminate time must either fail or migrate these folios out of
>> these regions, otherwise we start violating the very semantics why
>> ZONE_MOVABLE/MIGRATE_CMA was added in the first place.
>>
>> Yes, there are corner cases where we cannot guarantee movability (e.g., OOM
>> when allocating a migration destination), but these are not cases that can
>> be triggered by (unprivileged) user space easily.
>>
>> That's why FOLL_LONGTERM pinning does exactly that: even if user space would
>> promise that this is really only "short-term", we will treat it as "possibly
>> forever", because it's under user-space control.
>>
>>
>> Instead of having more subsystems violate these semantics because
>> "performance" ... I would hope we would do better. Maybe it's an issue for
>> NFS as well ("at least" only for privileged user space)? In which case,
>> again, I would hope we would do better.
>>
>>
>> Anyhow, I'm hoping there will be more feedback from other MM folks, but
>> likely right now a lot of people are out (just like I should ;) ).
>>
>> If I end up being the only one with these concerns, then likely people can
>> feel free to ignore them. ;)
> 
> I agree we should do better but IMHO it should be an iterative process.
 > I think your concerns are valid, so let's push the discussion 
towards> resolving those concerns. I think the concerns can be resolved 
by better
> handling of lifetime of folios under writeback. The amount of such
> folios is already handled through existing dirty throttling mechanism.
> 
> We should start with a baseline i.e. distribution of lifetime of folios
> under writeback for traditional storage devices (spinning disk and SSDs)
> as we don't want an unrealistic goal for ourself. I think this data will
> drive the appropriate timeout values (if we decide timeout based
> approach is the right one).
> 
> At the moment we have timeout based approach to limit the lifetime of
> folios under writeback. Any other ideas?

See above, maybe we could limit the deadlock avoidance to the actual 
deadlock path and sort out the "infinite writeback in some corner cases" 
problem separately.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-27 20:32                                                   ` Bernd Schubert
@ 2024-12-30 17:52                                                     ` Joanne Koong
  0 siblings, 0 replies; 124+ messages in thread
From: Joanne Koong @ 2024-12-30 17:52 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Shakeel Butt, David Hildenbrand, Zi Yan, miklos, linux-fsdevel,
	jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox,
	Oscar Salvador, Michal Hocko

On Fri, Dec 27, 2024 at 12:32 PM Bernd Schubert
<bernd.schubert@fastmail.fm> wrote:
>
> On 12/27/24 21:08, Joanne Koong wrote:
> > On Thu, Dec 26, 2024 at 12:13 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >>
> >> On Tue, Dec 24, 2024 at 01:37:49PM +0100, David Hildenbrand wrote:
> >>> On 23.12.24 23:14, Shakeel Butt wrote:
> >>>> On Sat, Dec 21, 2024 at 05:18:20PM +0100, David Hildenbrand wrote:
> >>>> [...]
> >>>>>
> >>>>> Yes, so I can see fuse
> >>>>>
> >>>>> (1) Breaking memory reclaim (memory cannot get freed up)
> >>>>>
> >>>>> (2) Breaking page migration (memory cannot be migrated)
> >>>>>
> >>>>> Due to (1) we might experience bigger memory pressure in the system I guess.
> >>>>> A handful of these pages don't really hurt, I have no idea how bad having
> >>>>> many of these pages can be. But yes, inherently we cannot throw away the
> >>>>> data as long as it is dirty without causing harm. (maybe we could move it to
> >>>>> some other cache, like swap/zswap; but that smells like a big and
> >>>>> complicated project)
> >>>>>
> >>>>> Due to (2) we turn pages that are supposed to be movable possibly for a long
> >>>>> time unmovable. Even a *single* such page will mean that CMA allocations /
> >>>>> memory unplug can start failing.
> >>>>>
> >>>>> We have similar situations with page pinning. With things like O_DIRECT, our
> >>>>> assumption/experience so far is that it will only take a couple of seconds
> >>>>> max, and retry loops are sufficient to handle it. That's why only long-term
> >>>>> pinning ("indeterminate", e.g., vfio) migrate these pages out of
> >>>>> ZONE_MOVABLE/MIGRATE_CMA areas in order to long-term pin them.
> >>>>>
> >>>>>
> >>>>> The biggest concern I have is that timeouts, while likely reasonable it many
> >>>>> scenarios, might not be desirable even for some sane workloads, and the
> >>>>> default in all system will be "no timeout", letting the clueless admin of
> >>>>> each and every system out there that might support fuse to make a decision.
> >>>>>
> >>>>> I might have misunderstood something, in which case I am very sorry, but we
> >>>>> also don't want CMA allocations to start failing simply because a network
> >>>>> connection is down for a couple of minutes such that a fuse daemon cannot
> >>>>> make progress.
> >>>>>
> >>>>
> >>>> I think you have valid concerns but these are not new and not unique to
> >>>> fuse. Any filesystem with a potential arbitrary stall can have similar
> >>>> issues. The arbitrary stall can be caused due to network issues or some
> >>>> faultly local storage.
> >>>
> >>> What concerns me more is that this is can be triggered by even unprivileged
> >>> user space, and that there is no default protection as far as I understood,
> >>> because timeouts cannot be set universally to a sane defaults.
> >>>
> >>> Again, please correct me if I got that wrong.
> >>>
> >>
> >> Let's route this question to FUSE folks. More specifically: can an
> >> unprivileged process create a mount point backed by itself, create a
> >> lot of dirty (bound by cgroup) and writeback pages on it and let the
> >> writeback pages in that state forever?
> >>
> >>>
> >>> BTW, I just looked at NFS out of interest, in particular
> >>> nfs_page_async_flush(), and I spot some logic about re-dirtying pages +
> >>> canceling writeback. IIUC, there are default timeouts for UDP and TCP,
> >>> whereby the TCP default one seems to be around 60s (* retrans?), and the
> >>> privileged user that mounts it can set higher ones. I guess one could run
> >>> into similar writeback issues?
> >>
> >> Yes, I think so.
> >>
> >>>
> >>> So I wonder why we never required AS_WRITEBACK_INDETERMINATE for nfs?
> >>
> >> I feel like INDETERMINATE in the name is the main cause of confusion.
> >> So, let me explain why it is required (but later I will tell you how it
> >> can be avoided). The FUSE thread which is actively handling writeback of
> >> a given folio can cause memory allocation either through syscall or page
> >> fault. That memory allocation can trigger global reclaim synchronously
> >> and in cgroup-v1, that FUSE thread can wait on the writeback on the same
> >> folio whose writeback it is supposed to end and cauing a deadlock. So,
> >> AS_WRITEBACK_INDETERMINATE is used to just avoid this deadlock.
> >>
> >> The in-kernel fs avoid this situation through the use of GFP_NOFS
> >> allocations. The userspace fs can also use a similar approach which is
> >> prctl(PR_SET_IO_FLUSHER, 1) to avoid this situation. However I have been
> >> told that it is hard to use as it is per-thread flag and has to be set
> >> for all the threads handling writeback which can be error prone if the
> >> threadpool is dynamic. Second it is very coarse such that all the
> >> allocations from those threads (e.g. page faults) become NOFS which
> >> makes userspace very unreliable on highly utilized machine as NOFS can
> >> not reclaim potentially a lot of memory and can not trigger oom-kill.
> >>
> >>> Not
> >>> sure if I grasped all details about NFS and writeback and when it would
> >>> redirty+end writeback, and if there is some other handling in there.
> >>>
> >> [...]
> >>>>
> >>>> Please note that such filesystems are mostly used in environments like
> >>>> data center or hyperscalar and usually have more advanced mechanisms to
> >>>> handle and avoid situations like long delays. For such environment
> >>>> network unavailability is a larger issue than some cma allocation
> >>>> failure. My point is: let's not assume the disastrous situaion is normal
> >>>> and overcomplicate the solution.
> >>>
> >>> Let me summarize my main point: ZONE_MOVABLE/MIGRATE_CMA must only be used
> >>> for movable allocations.
> >>>
> >>> Mechanisms that possible turn these folios unmovable for a
> >>> long/indeterminate time must either fail or migrate these folios out of
> >>> these regions, otherwise we start violating the very semantics why
> >>> ZONE_MOVABLE/MIGRATE_CMA was added in the first place.
> >>>
> >>> Yes, there are corner cases where we cannot guarantee movability (e.g., OOM
> >>> when allocating a migration destination), but these are not cases that can
> >>> be triggered by (unprivileged) user space easily.
> >>>
> >>> That's why FOLL_LONGTERM pinning does exactly that: even if user space would
> >>> promise that this is really only "short-term", we will treat it as "possibly
> >>> forever", because it's under user-space control.
> >>>
> >>>
> >>> Instead of having more subsystems violate these semantics because
> >>> "performance" ... I would hope we would do better. Maybe it's an issue for
> >>> NFS as well ("at least" only for privileged user space)? In which case,
> >>> again, I would hope we would do better.
> >>>
> >>>
> >>> Anyhow, I'm hoping there will be more feedback from other MM folks, but
> >>> likely right now a lot of people are out (just like I should ;) ).
> >>>
> >>> If I end up being the only one with these concerns, then likely people can
> >>> feel free to ignore them. ;)
> >>
> >> I agree we should do better but IMHO it should be an iterative process.
> >> I think your concerns are valid, so let's push the discussion towards
> >> resolving those concerns. I think the concerns can be resolved by better
> >> handling of lifetime of folios under writeback. The amount of such
> >> folios is already handled through existing dirty throttling mechanism.
> >>
> >> We should start with a baseline i.e. distribution of lifetime of folios
> >> under writeback for traditional storage devices (spinning disk and SSDs)
> >> as we don't want an unrealistic goal for ourself. I think this data will
> >> drive the appropriate timeout values (if we decide timeout based
> >> approach is the right one).
> >>
> >> At the moment we have timeout based approach to limit the lifetime of
> >> folios under writeback. Any other ideas?
> >
> > I don't see any other approach that would handle splice, other than
> > modifying the splice code to prevent the underlying buf->page from
> > being migrated while it's being copied out, which seems non-viable to
> > consider. The other alternatives I see are to either a) do the extra
> > temp page copying for splice and "abort" the writeback if migration is
> > triggered or b) gate this to only apply to servers running as
> > privileged. I assume the majority of use cases do use splice, in which
> > case a) would be pointless and would make the internal logic more
> > complicated (eg we would still need the rb tree and would now need to
> > check writeback against the folio writeback state or the rb tree,
> > etc). I'm not sure how useful this would be either if this is just
> > gated to privileged servers.
>
>
> I'm not so sure about that majority of unprivileged servers.
> Try this patch and then run an unprivileged process.
>
> diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c
> index ee0b3b1d0470..adebfbc03d4c 100644
> --- a/lib/fuse_lowlevel.c
> +++ b/lib/fuse_lowlevel.c
> @@ -3588,6 +3588,7 @@ static int _fuse_session_receive_buf(struct fuse_session *se,
>                         res = fcntl(llp->pipe[0], F_SETPIPE_SZ, bufsize);
>                         if (res == -1) {
>                                 llp->can_grow = 0;
> +                               fuse_log(FUSE_LOG_ERR, "cannot grow pipe\n");
>                                 res = grow_pipe_to_max(llp->pipe[0]);
>                                 if (res > 0)
>                                         llp->size = res;
> @@ -3678,6 +3679,7 @@ static int _fuse_session_receive_buf(struct fuse_session *se,
>
>         } else {
>                 /* Don't overwrite buf->mem, as that would cause a leak */
> +               fuse_log(FUSE_LOG_WARNING, "Using splice\n");
>                 buf->fd = tmpbuf.fd;
>                 buf->flags = tmpbuf.flags;
>         }
> @@ -3687,6 +3689,7 @@ static int _fuse_session_receive_buf(struct fuse_session *se,
>
>  fallback:
>  #endif
> +       fuse_log(FUSE_LOG_WARNING, "Splice fallback\n");
>         if (!buf->mem) {
>                 buf->mem = buf_alloc(se->bufsize, internal);
>                 if (!buf->mem) {
>
>
> And then run this again after
> sudo sysctl -w fs.pipe-max-size=1052672
>
> (Please don't change '/proc/sys/fs/fuse/max_pages_limit'
> from default).
>
> And now we would need to know how many users either limit
> max-pages + header to fit default pipe-max-size (1MB) or
> increase max_pages_limit. Given there is no warning in
> libfuse about the fallback from splice to buf copy, I doubt
> many people know about that - who would change system
> defaults without the knowledge?
>

My concern is that this would break backwards compatibility for the
rare subset of users who use their own custom library instead of
libfuse, who expect splice to work as-is and might not have this
in-built fallback to buffer copies.


Thanks,
Joanne

>
> And then, I still doubt that copy-to-tmp-page-and-splice
> is any faster than no-tmp-page-copy-but-copy-to-lib-fuse-buffer.
> Especially as the tmp page copy is single threaded, I think.
> But needs to be benchmarked.
>
>
> Thanks,
> Bernd
>
>
>


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-30 10:16                                                 ` David Hildenbrand
@ 2024-12-30 18:38                                                   ` Joanne Koong
  2024-12-30 19:52                                                     ` David Hildenbrand
  2024-12-30 20:04                                                     ` Shakeel Butt
  0 siblings, 2 replies; 124+ messages in thread
From: Joanne Koong @ 2024-12-30 18:38 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Shakeel Butt, Bernd Schubert, Zi Yan, miklos, linux-fsdevel,
	jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox,
	Oscar Salvador, Michal Hocko

On Mon, Dec 30, 2024 at 2:16 AM David Hildenbrand <david@redhat.com> wrote:
>
> >> BTW, I just looked at NFS out of interest, in particular
> >> nfs_page_async_flush(), and I spot some logic about re-dirtying pages +
> >> canceling writeback. IIUC, there are default timeouts for UDP and TCP,
> >> whereby the TCP default one seems to be around 60s (* retrans?), and the
> >> privileged user that mounts it can set higher ones. I guess one could run
> >> into similar writeback issues?
> >
>
> Hi,
>
> sorry for the late reply.
>
> > Yes, I think so.
> >
> >>
> >> So I wonder why we never required AS_WRITEBACK_INDETERMINATE for nfs?
> >
> > I feel like INDETERMINATE in the name is the main cause of confusion.
>
> We are adding logic that says "unconditionally, never wait on writeback
> for these folios, not even any sync migration". That's the main problem
> I have.
>
> Your explanation below is helpful. Because ...
>
> > So, let me explain why it is required (but later I will tell you how it
> > can be avoided). The FUSE thread which is actively handling writeback of
> > a given folio can cause memory allocation either through syscall or page
> > fault. That memory allocation can trigger global reclaim synchronously
> > and in cgroup-v1, that FUSE thread can wait on the writeback on the same
> > folio whose writeback it is supposed to end and cauing a deadlock. So,
> > AS_WRITEBACK_INDETERMINATE is used to just avoid this deadlock.
>  > > The in-kernel fs avoid this situation through the use of GFP_NOFS
> > allocations. The userspace fs can also use a similar approach which is
> > prctl(PR_SET_IO_FLUSHER, 1) to avoid this situation. However I have been
> > told that it is hard to use as it is per-thread flag and has to be set
> > for all the threads handling writeback which can be error prone if the
> > threadpool is dynamic. Second it is very coarse such that all the
> > allocations from those threads (e.g. page faults) become NOFS which
> > makes userspace very unreliable on highly utilized machine as NOFS can
> > not reclaim potentially a lot of memory and can not trigger oom-kill.
> >
>
> ... now I understand that we want to prevent a deadlock in one specific
> scenario only?
>
> What sounds plausible for me is:
>
> a) Make this only affect the actual deadlock path: sync migration
>     during compaction. Communicate it either using some "context"
>     information or with a new MIGRATE_SYNC_COMPACTION.
> b) Call it sth. like AS_WRITEBACK_MIGHT_DEADLOCK_ON_RECLAIM to express
>      that very deadlock problem.
> c) Leave all others sync migration users alone for now

The deadlock path is separate from sync migration. The deadlock arises
from a corner case where cgroupv1 reclaim waits on a folio under
writeback where that writeback itself is blocked on reclaim.

>
> Would that prevent the deadlock? Even *better* would be to to be able to
> ask the fs if starting writeback on a specific folio could deadlock.
> Because in most cases, as I understand, we'll  not actually run into the
> deadlock and would just want to wait for writeback to just complete
> (esp. compaction).
>
> (I still think having folios under writeback for a long time might be a
> problem, but that's indeed something to sort out separately in the
> future, because I suspect NFS has similar issues. We'd want to "wait
> with timeout" and e.g., cancel writeback during memory
> offlining/alloc_cma ...)

I'm looking back at some of the discussions in v2 [1] and I'm still
not clear on how memory fragmentation for non-movable pages differs
from memory fragmentation from movable pages and whether one is worse
than the other. Currently fuse uses movable temp pages (allocated with
gfp flags GFP_NOFS | __GFP_HIGHMEM), and these can run into the same
issue where a buggy/malicious server may never complete writeback.
This has the same effect of fragmenting memory and has a worse memory
cost to the system in terms of memory used. With not having temp pages
though, now in this scenario, pages allocated in a movable page block
can't be compacted and that memory is fragmented. My (basic and maybe
incorrect) understanding is that memory gets allocated through a buddy
allocator and moveable vs nonmovable pages get allocated to
corresponding blocks that match their type, but there's no other
difference otherwise. Is this understanding correct? Or is there some
substantial difference between fragmentation for movable vs nonmovable
blocks?


Thanks,
Joanne

[1] https://lore.kernel.org/linux-fsdevel/20241014182228.1941246-1-joannelkoong@gmail.com/T/#m7637e26a559db86348461ebc1104352083085d6d

>
> >> Not
> >> sure if I grasped all details about NFS and writeback and when it would
> >> redirty+end writeback, and if there is some other handling in there.
> >>
> > [...]
> >>>
> >>> Please note that such filesystems are mostly used in environments like
> >>> data center or hyperscalar and usually have more advanced mechanisms to
> >>> handle and avoid situations like long delays. For such environment
> >>> network unavailability is a larger issue than some cma allocation
> >>> failure. My point is: let's not assume the disastrous situaion is normal
> >>> and overcomplicate the solution.
> >>
> >> Let me summarize my main point: ZONE_MOVABLE/MIGRATE_CMA must only be used
> >> for movable allocations.
> >>
> >> Mechanisms that possible turn these folios unmovable for a
> >> long/indeterminate time must either fail or migrate these folios out of
> >> these regions, otherwise we start violating the very semantics why
> >> ZONE_MOVABLE/MIGRATE_CMA was added in the first place.
> >>
> >> Yes, there are corner cases where we cannot guarantee movability (e.g., OOM
> >> when allocating a migration destination), but these are not cases that can
> >> be triggered by (unprivileged) user space easily.
> >>
> >> That's why FOLL_LONGTERM pinning does exactly that: even if user space would
> >> promise that this is really only "short-term", we will treat it as "possibly
> >> forever", because it's under user-space control.
> >>
> >>
> >> Instead of having more subsystems violate these semantics because
> >> "performance" ... I would hope we would do better. Maybe it's an issue for
> >> NFS as well ("at least" only for privileged user space)? In which case,
> >> again, I would hope we would do better.
> >>
> >>
> >> Anyhow, I'm hoping there will be more feedback from other MM folks, but
> >> likely right now a lot of people are out (just like I should ;) ).
> >>
> >> If I end up being the only one with these concerns, then likely people can
> >> feel free to ignore them. ;)
> >
> > I agree we should do better but IMHO it should be an iterative process.
>  > I think your concerns are valid, so let's push the discussion
> towards> resolving those concerns. I think the concerns can be resolved
> by better
> > handling of lifetime of folios under writeback. The amount of such
> > folios is already handled through existing dirty throttling mechanism.
> >
> > We should start with a baseline i.e. distribution of lifetime of folios
> > under writeback for traditional storage devices (spinning disk and SSDs)
> > as we don't want an unrealistic goal for ourself. I think this data will
> > drive the appropriate timeout values (if we decide timeout based
> > approach is the right one).
> >
> > At the moment we have timeout based approach to limit the lifetime of
> > folios under writeback. Any other ideas?
>
> See above, maybe we could limit the deadlock avoidance to the actual
> deadlock path and sort out the "infinite writeback in some corner cases"
> problem separately.
>
> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-30 18:38                                                   ` Joanne Koong
@ 2024-12-30 19:52                                                     ` David Hildenbrand
  2024-12-30 20:11                                                       ` Shakeel Butt
  2024-12-30 20:04                                                     ` Shakeel Butt
  1 sibling, 1 reply; 124+ messages in thread
From: David Hildenbrand @ 2024-12-30 19:52 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Shakeel Butt, Bernd Schubert, Zi Yan, miklos, linux-fsdevel,
	jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox,
	Oscar Salvador, Michal Hocko


>>
>> What sounds plausible for me is:
>>
>> a) Make this only affect the actual deadlock path: sync migration
>>      during compaction. Communicate it either using some "context"
>>      information or with a new MIGRATE_SYNC_COMPACTION.
>> b) Call it sth. like AS_WRITEBACK_MIGHT_DEADLOCK_ON_RECLAIM to express
>>       that very deadlock problem.
>> c) Leave all others sync migration users alone for now
> 
> The deadlock path is separate from sync migration. The deadlock arises
> from a corner case where cgroupv1 reclaim waits on a folio under
> writeback where that writeback itself is blocked on reclaim.

Okay, so compaction (IOW this patch) is not relevant at all to resolve 
the deadlock in any way, correct?

For a second I thought I understood how this patch here relates to the 
deadlock :)

> 
>>
>> Would that prevent the deadlock? Even *better* would be to to be able to
>> ask the fs if starting writeback on a specific folio could deadlock.
>> Because in most cases, as I understand, we'll  not actually run into the
>> deadlock and would just want to wait for writeback to just complete
>> (esp. compaction).
>>
>> (I still think having folios under writeback for a long time might be a
>> problem, but that's indeed something to sort out separately in the
>> future, because I suspect NFS has similar issues. We'd want to "wait
>> with timeout" and e.g., cancel writeback during memory
>> offlining/alloc_cma ...)
> 
> I'm looking back at some of the discussions in v2 [1] and I'm still
> not clear on how memory fragmentation for non-movable pages differs
> from memory fragmentation from movable pages and whether one is worse
> than the other. Currently fuse uses movable temp pages (allocated with
> gfp flags GFP_NOFS | __GFP_HIGHMEM), and these can run into the same

Why are they movable? Do you also specify __GFP_MOVABLE?

If not, they are unmovable and are never allocated from 
ZONE_MOVABLE/MIGRATE_CMA -- and usually only from MIGRATE_UNMOVBALE, to 
group these unmovable pages.

> issue where a buggy/malicious server may never complete writeback.

If the temp pages are not allocated using __GFP_MOVABLE, they are just 
like any other kernel allocation -- unmovable. Nobody would even try 
migrating them, ever. And they are allocated from memory regions where 
that is expected.


> This has the same effect of fragmenting memory and has a worse memory
> cost to the system in terms of memory used. With not having temp pages
> though, now in this scenario, pages allocated in a movable page block
> can't be compacted and that memory is fragmented. 

Yes. With temp pages, they simply grouped naturally "where they belong".

After all, pagecache pages are allocated using __GFP_MOVABLE, which 
implies "this thing is movable" -- so the buddy can place them in 
physical memory regions that allow only for movable allocations or 
minimize fragmentation.

> My (basic and maybe
> incorrect) understanding is that memory gets allocated through a buddy
> allocator and moveable vs nonmovable pages get allocated to
> corresponding blocks that match their type, but there's no other
> difference otherwise. Is this understanding correct? Or is there some
> substantial difference between fragmentation for movable vs nonmovable
> blocks?

I assume not regarding fragmentation.


In general, I see two main issues:

A) We are no longer waiting on writeback, even though we expect in sane 
environments that writeback will happen and we it might be worthwhile to 
just wait for writeback so we can migrate these folios.

B) We allow turning movable pages to be unmovable, possibly forever/long 
time, and there is no way to make them movable again (e.g., cancel 
writeback).


I'm wondering if A) is actually a new issue introduced by this change. 
Can folios with busy temp pages (writeback cleared on folio, but temp 
pages are still around) be migrated? I will look into some details once 
I'm back from vacation.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-30 18:38                                                   ` Joanne Koong
  2024-12-30 19:52                                                     ` David Hildenbrand
@ 2024-12-30 20:04                                                     ` Shakeel Butt
  2025-01-02 19:59                                                       ` Joanne Koong
  1 sibling, 1 reply; 124+ messages in thread
From: Shakeel Butt @ 2024-12-30 20:04 UTC (permalink / raw)
  To: Joanne Koong
  Cc: David Hildenbrand, Bernd Schubert, Zi Yan, miklos, linux-fsdevel,
	jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox,
	Oscar Salvador, Michal Hocko

On Mon, Dec 30, 2024 at 10:38:16AM -0800, Joanne Koong wrote:
> On Mon, Dec 30, 2024 at 2:16 AM David Hildenbrand <david@redhat.com> wrote:

Thanks David for the response.

> >
> > >> BTW, I just looked at NFS out of interest, in particular
> > >> nfs_page_async_flush(), and I spot some logic about re-dirtying pages +
> > >> canceling writeback. IIUC, there are default timeouts for UDP and TCP,
> > >> whereby the TCP default one seems to be around 60s (* retrans?), and the
> > >> privileged user that mounts it can set higher ones. I guess one could run
> > >> into similar writeback issues?
> > >
> >
> > Hi,
> >
> > sorry for the late reply.
> >
> > > Yes, I think so.
> > >
> > >>
> > >> So I wonder why we never required AS_WRITEBACK_INDETERMINATE for nfs?
> > >
> > > I feel like INDETERMINATE in the name is the main cause of confusion.
> >
> > We are adding logic that says "unconditionally, never wait on writeback
> > for these folios, not even any sync migration". That's the main problem
> > I have.
> >
> > Your explanation below is helpful. Because ...
> >
> > > So, let me explain why it is required (but later I will tell you how it
> > > can be avoided). The FUSE thread which is actively handling writeback of
> > > a given folio can cause memory allocation either through syscall or page
> > > fault. That memory allocation can trigger global reclaim synchronously
> > > and in cgroup-v1, that FUSE thread can wait on the writeback on the same
> > > folio whose writeback it is supposed to end and cauing a deadlock. So,
> > > AS_WRITEBACK_INDETERMINATE is used to just avoid this deadlock.
> >  > > The in-kernel fs avoid this situation through the use of GFP_NOFS
> > > allocations. The userspace fs can also use a similar approach which is
> > > prctl(PR_SET_IO_FLUSHER, 1) to avoid this situation. However I have been
> > > told that it is hard to use as it is per-thread flag and has to be set
> > > for all the threads handling writeback which can be error prone if the
> > > threadpool is dynamic. Second it is very coarse such that all the
> > > allocations from those threads (e.g. page faults) become NOFS which
> > > makes userspace very unreliable on highly utilized machine as NOFS can
> > > not reclaim potentially a lot of memory and can not trigger oom-kill.
> > >
> >
> > ... now I understand that we want to prevent a deadlock in one specific
> > scenario only?
> >
> > What sounds plausible for me is:
> >
> > a) Make this only affect the actual deadlock path: sync migration
> >     during compaction. Communicate it either using some "context"
> >     information or with a new MIGRATE_SYNC_COMPACTION.
> > b) Call it sth. like AS_WRITEBACK_MIGHT_DEADLOCK_ON_RECLAIM to express
> >      that very deadlock problem.
> > c) Leave all others sync migration users alone for now
> 
> The deadlock path is separate from sync migration. The deadlock arises
> from a corner case where cgroupv1 reclaim waits on a folio under
> writeback where that writeback itself is blocked on reclaim.
> 

Joanne, let's drop the patch to migrate.c completely and let's rename
the flag to something like what David is suggesting and only handle in
the reclaim path.

> >
> > Would that prevent the deadlock? Even *better* would be to to be able to
> > ask the fs if starting writeback on a specific folio could deadlock.
> > Because in most cases, as I understand, we'll  not actually run into the
> > deadlock and would just want to wait for writeback to just complete
> > (esp. compaction).
> >
> > (I still think having folios under writeback for a long time might be a
> > problem, but that's indeed something to sort out separately in the
> > future, because I suspect NFS has similar issues. We'd want to "wait
> > with timeout" and e.g., cancel writeback during memory
> > offlining/alloc_cma ...)

Thanks David and yes let's handle the folios under writeback issue
separately.

> 
> I'm looking back at some of the discussions in v2 [1] and I'm still
> not clear on how memory fragmentation for non-movable pages differs
> from memory fragmentation from movable pages and whether one is worse
> than the other.

I think the fragmentation due to movable pages becoming unmovable is
worse as that situation is unexpected and the kernel can waste a lot of
CPU to defrag the block containing those folios. For non-movable blocks,
the kernel will not even try to defrag. Now we can have a situation
where almost all memory is backed by non-movable blocks and higher order
allocations start failing even when there is enough free memory. For
such situations either system needs to be restarted (or workloads
restarted if they are cause of high non-movable memory) or the admin
needs to setup ZONE_MOVABLE where non-movable allocations don't go.

> Currently fuse uses movable temp pages (allocated with
> gfp flags GFP_NOFS | __GFP_HIGHMEM), and these can run into the same
> issue where a buggy/malicious server may never complete writeback.

So, these temp pages are not an issue for fragmenting the movable blocks
but if there is no limit on temp pages, the whole system can become
non-movable (there is a case where movable blocks on non-ZONE_MOVABLE
can be converted into non-movable blocks under low memory). ZONE_MOVABLE
will avoid such scenario but tuning the right size of ZONE_MOVABLE is
not easy.

> This has the same effect of fragmenting memory and has a worse memory
> cost to the system in terms of memory used. With not having temp pages
> though, now in this scenario, pages allocated in a movable page block
> can't be compacted and that memory is fragmented. My (basic and maybe
> incorrect) understanding is that memory gets allocated through a buddy
> allocator and moveable vs nonmovable pages get allocated to
> corresponding blocks that match their type, but there's no other
> difference otherwise. Is this understanding correct? Or is there some
> substantial difference between fragmentation for movable vs nonmovable
> blocks?

The main difference is the fallback of high order allocation which can
trigger compaction or background compaction through kcompactd. The
kernel will only try to defrag the movable blocks.



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-30 19:52                                                     ` David Hildenbrand
@ 2024-12-30 20:11                                                       ` Shakeel Butt
  2025-01-02 18:54                                                         ` Joanne Koong
  0 siblings, 1 reply; 124+ messages in thread
From: Shakeel Butt @ 2024-12-30 20:11 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Joanne Koong, Bernd Schubert, Zi Yan, miklos, linux-fsdevel,
	jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox,
	Oscar Salvador, Michal Hocko

On Mon, Dec 30, 2024 at 08:52:04PM +0100, David Hildenbrand wrote:
> 
[...]
> > I'm looking back at some of the discussions in v2 [1] and I'm still
> > not clear on how memory fragmentation for non-movable pages differs
> > from memory fragmentation from movable pages and whether one is worse
> > than the other. Currently fuse uses movable temp pages (allocated with
> > gfp flags GFP_NOFS | __GFP_HIGHMEM), and these can run into the same
> 
> Why are they movable? Do you also specify __GFP_MOVABLE?
> 
> If not, they are unmovable and are never allocated from
> ZONE_MOVABLE/MIGRATE_CMA -- and usually only from MIGRATE_UNMOVBALE, to
> group these unmovable pages.
> 

Yes, these temp pages are non-movable. (Must be a typo in Joanne's
email).

[...]
> 
> I assume not regarding fragmentation.
> 
> 
> In general, I see two main issues:
> 
> A) We are no longer waiting on writeback, even though we expect in sane
> environments that writeback will happen and we it might be worthwhile to
> just wait for writeback so we can migrate these folios.
> 
> B) We allow turning movable pages to be unmovable, possibly forever/long
> time, and there is no way to make them movable again (e.g., cancel
> writeback).
> 
> 
> I'm wondering if A) is actually a new issue introduced by this change. Can
> folios with busy temp pages (writeback cleared on folio, but temp pages are
> still around) be migrated? I will look into some details once I'm back from
> vacation.
> 

My suggestion is to just drop the patch related to A as it is not
required for deadlock avoidance. For B, I think we need a long term
solution which is usable by other filesystems as well.


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-30 20:11                                                       ` Shakeel Butt
@ 2025-01-02 18:54                                                         ` Joanne Koong
  2025-01-03 20:31                                                           ` David Hildenbrand
  0 siblings, 1 reply; 124+ messages in thread
From: Joanne Koong @ 2025-01-02 18:54 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: David Hildenbrand, Bernd Schubert, Zi Yan, miklos, linux-fsdevel,
	jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox,
	Oscar Salvador, Michal Hocko

On Mon, Dec 30, 2024 at 12:11 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Mon, Dec 30, 2024 at 08:52:04PM +0100, David Hildenbrand wrote:
> >
> [...]
> > > I'm looking back at some of the discussions in v2 [1] and I'm still
> > > not clear on how memory fragmentation for non-movable pages differs
> > > from memory fragmentation from movable pages and whether one is worse
> > > than the other. Currently fuse uses movable temp pages (allocated with
> > > gfp flags GFP_NOFS | __GFP_HIGHMEM), and these can run into the same
> >
> > Why are they movable? Do you also specify __GFP_MOVABLE?
> >
> > If not, they are unmovable and are never allocated from
> > ZONE_MOVABLE/MIGRATE_CMA -- and usually only from MIGRATE_UNMOVBALE, to
> > group these unmovable pages.
> >
>
> Yes, these temp pages are non-movable. (Must be a typo in Joanne's
> email).

Sorry for the confusion, that should have been "non-movable temp pages".

>
> [...]
> >
> > I assume not regarding fragmentation.
> >
> >
> > In general, I see two main issues:
> >
> > A) We are no longer waiting on writeback, even though we expect in sane
> > environments that writeback will happen and we it might be worthwhile to
> > just wait for writeback so we can migrate these folios.
> >
> > B) We allow turning movable pages to be unmovable, possibly forever/long
> > time, and there is no way to make them movable again (e.g., cancel
> > writeback).
> >
> >
> > I'm wondering if A) is actually a new issue introduced by this change. Can
> > folios with busy temp pages (writeback cleared on folio, but temp pages are
> > still around) be migrated? I will look into some details once I'm back from
> > vacation.
> >

Folios with busy temp pages can be migrated since fuse will clear
writeback on the folio immediately once it's copied to the temp page.

To me, these two issues seem like one and the same. No longer waiting
on writeback renders it unmovable, which prevents
compaction/migration.

>
> My suggestion is to just drop the patch related to A as it is not
> required for deadlock avoidance. For B, I think we need a long term
> solution which is usable by other filesystems as well.

Sounds good. With that, we need to take this patchset out of
mm-unstable or this could lead to migration infinitely waiting on
folio writeback without the migrate patch there.


Thanks,
Joanne


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-30 20:04                                                     ` Shakeel Butt
@ 2025-01-02 19:59                                                       ` Joanne Koong
  2025-01-02 20:26                                                         ` Zi Yan
  0 siblings, 1 reply; 124+ messages in thread
From: Joanne Koong @ 2025-01-02 19:59 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: David Hildenbrand, Bernd Schubert, Zi Yan, miklos, linux-fsdevel,
	jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox,
	Oscar Salvador, Michal Hocko

On Mon, Dec 30, 2024 at 12:04 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Mon, Dec 30, 2024 at 10:38:16AM -0800, Joanne Koong wrote:
> > On Mon, Dec 30, 2024 at 2:16 AM David Hildenbrand <david@redhat.com> wrote:
>
> Thanks David for the response.
>
> > >
> > > >> BTW, I just looked at NFS out of interest, in particular
> > > >> nfs_page_async_flush(), and I spot some logic about re-dirtying pages +
> > > >> canceling writeback. IIUC, there are default timeouts for UDP and TCP,
> > > >> whereby the TCP default one seems to be around 60s (* retrans?), and the
> > > >> privileged user that mounts it can set higher ones. I guess one could run
> > > >> into similar writeback issues?
> > > >
> > >
> > > Hi,
> > >
> > > sorry for the late reply.
> > >
> > > > Yes, I think so.
> > > >
> > > >>
> > > >> So I wonder why we never required AS_WRITEBACK_INDETERMINATE for nfs?
> > > >
> > > > I feel like INDETERMINATE in the name is the main cause of confusion.
> > >
> > > We are adding logic that says "unconditionally, never wait on writeback
> > > for these folios, not even any sync migration". That's the main problem
> > > I have.
> > >
> > > Your explanation below is helpful. Because ...
> > >
> > > > So, let me explain why it is required (but later I will tell you how it
> > > > can be avoided). The FUSE thread which is actively handling writeback of
> > > > a given folio can cause memory allocation either through syscall or page
> > > > fault. That memory allocation can trigger global reclaim synchronously
> > > > and in cgroup-v1, that FUSE thread can wait on the writeback on the same
> > > > folio whose writeback it is supposed to end and cauing a deadlock. So,
> > > > AS_WRITEBACK_INDETERMINATE is used to just avoid this deadlock.
> > >  > > The in-kernel fs avoid this situation through the use of GFP_NOFS
> > > > allocations. The userspace fs can also use a similar approach which is
> > > > prctl(PR_SET_IO_FLUSHER, 1) to avoid this situation. However I have been
> > > > told that it is hard to use as it is per-thread flag and has to be set
> > > > for all the threads handling writeback which can be error prone if the
> > > > threadpool is dynamic. Second it is very coarse such that all the
> > > > allocations from those threads (e.g. page faults) become NOFS which
> > > > makes userspace very unreliable on highly utilized machine as NOFS can
> > > > not reclaim potentially a lot of memory and can not trigger oom-kill.
> > > >
> > >
> > > ... now I understand that we want to prevent a deadlock in one specific
> > > scenario only?
> > >
> > > What sounds plausible for me is:
> > >
> > > a) Make this only affect the actual deadlock path: sync migration
> > >     during compaction. Communicate it either using some "context"
> > >     information or with a new MIGRATE_SYNC_COMPACTION.
> > > b) Call it sth. like AS_WRITEBACK_MIGHT_DEADLOCK_ON_RECLAIM to express
> > >      that very deadlock problem.
> > > c) Leave all others sync migration users alone for now
> >
> > The deadlock path is separate from sync migration. The deadlock arises
> > from a corner case where cgroupv1 reclaim waits on a folio under
> > writeback where that writeback itself is blocked on reclaim.
> >
>
> Joanne, let's drop the patch to migrate.c completely and let's rename
> the flag to something like what David is suggesting and only handle in
> the reclaim path.
>
> > >
> > > Would that prevent the deadlock? Even *better* would be to to be able to
> > > ask the fs if starting writeback on a specific folio could deadlock.
> > > Because in most cases, as I understand, we'll  not actually run into the
> > > deadlock and would just want to wait for writeback to just complete
> > > (esp. compaction).
> > >
> > > (I still think having folios under writeback for a long time might be a
> > > problem, but that's indeed something to sort out separately in the
> > > future, because I suspect NFS has similar issues. We'd want to "wait
> > > with timeout" and e.g., cancel writeback during memory
> > > offlining/alloc_cma ...)
>
> Thanks David and yes let's handle the folios under writeback issue
> separately.
>
> >
> > I'm looking back at some of the discussions in v2 [1] and I'm still
> > not clear on how memory fragmentation for non-movable pages differs
> > from memory fragmentation from movable pages and whether one is worse
> > than the other.
>
> I think the fragmentation due to movable pages becoming unmovable is
> worse as that situation is unexpected and the kernel can waste a lot of
> CPU to defrag the block containing those folios. For non-movable blocks,
> the kernel will not even try to defrag. Now we can have a situation
> where almost all memory is backed by non-movable blocks and higher order
> allocations start failing even when there is enough free memory. For
> such situations either system needs to be restarted (or workloads
> restarted if they are cause of high non-movable memory) or the admin
> needs to setup ZONE_MOVABLE where non-movable allocations don't go.

Thanks for the explanations.

The reason I ask is because I'm trying to figure out if having a time
interval wait or retry mechanism instead of skipping migration would
be a viable solution. Where when attempting the migration for folios
with the as_writeback_indeterminate flag that are under writeback,
it'll wait on folio writeback for a certain amount of time and then
skip the migration if no progress has been made and the folio is still
under writeback.

there are two cases for fuse folios under writeback (for folios not
under writeback, migration will work as is):
a) normal case: server is not malicious or buggy, writeback is
completed in a timely manner.
For this case, migration would be successful and there'd be no
difference for this between having no temp pages vs temp pages


b) server is malicious or buggy:
eg the server never completes writeback

With no temp pages:
The folio under writeback prevents a memory block (not sure how big
this usually is?) from being compacted, leading to memory
fragmentation

With temp pages:
fuse allocates a non-movable page for every page it needs to write
back, which worsens memory usage, these pages will never get freed
since the server never finishes writeback on them. The non-movable
pages could also fragment memory blocks like in the scenario with no
temp pages.


Is the b) case with no temp pages worse for memory health than the
scenario with temp pages? For the cpu usage issue (eg kernel keeps
trying to defrag blocks containing these problematic folios), it seems
like this could be potentially mitigated by marking these blocks as
uncompactable?


Thanks,
Joanne

>
> > Currently fuse uses movable temp pages (allocated with
> > gfp flags GFP_NOFS | __GFP_HIGHMEM), and these can run into the same
> > issue where a buggy/malicious server may never complete writeback.
>
> So, these temp pages are not an issue for fragmenting the movable blocks
> but if there is no limit on temp pages, the whole system can become
> non-movable (there is a case where movable blocks on non-ZONE_MOVABLE
> can be converted into non-movable blocks under low memory). ZONE_MOVABLE
> will avoid such scenario but tuning the right size of ZONE_MOVABLE is
> not easy.
>
> > This has the same effect of fragmenting memory and has a worse memory
> > cost to the system in terms of memory used. With not having temp pages
> > though, now in this scenario, pages allocated in a movable page block
> > can't be compacted and that memory is fragmented. My (basic and maybe
> > incorrect) understanding is that memory gets allocated through a buddy
> > allocator and moveable vs nonmovable pages get allocated to
> > corresponding blocks that match their type, but there's no other
> > difference otherwise. Is this understanding correct? Or is there some
> > substantial difference between fragmentation for movable vs nonmovable
> > blocks?
>
> The main difference is the fallback of high order allocation which can
> trigger compaction or background compaction through kcompactd. The
> kernel will only try to defrag the movable blocks.
>


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2025-01-02 19:59                                                       ` Joanne Koong
@ 2025-01-02 20:26                                                         ` Zi Yan
  0 siblings, 0 replies; 124+ messages in thread
From: Zi Yan @ 2025-01-02 20:26 UTC (permalink / raw)
  To: Joanne Koong, Shakeel Butt
  Cc: David Hildenbrand, Bernd Schubert, miklos, linux-fsdevel,
	jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox,
	Oscar Salvador, Michal Hocko

On Thu Jan 2, 2025 at 2:59 PM EST, Joanne Koong wrote:
> On Mon, Dec 30, 2024 at 12:04 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > On Mon, Dec 30, 2024 at 10:38:16AM -0800, Joanne Koong wrote:
> > > On Mon, Dec 30, 2024 at 2:16 AM David Hildenbrand <david@redhat.com> wrote:
> >
> > Thanks David for the response.
> >
> > > >
> > > > >> BTW, I just looked at NFS out of interest, in particular
> > > > >> nfs_page_async_flush(), and I spot some logic about re-dirtying pages +
> > > > >> canceling writeback. IIUC, there are default timeouts for UDP and TCP,
> > > > >> whereby the TCP default one seems to be around 60s (* retrans?), and the
> > > > >> privileged user that mounts it can set higher ones. I guess one could run
> > > > >> into similar writeback issues?
> > > > >
> > > >
> > > > Hi,
> > > >
> > > > sorry for the late reply.
> > > >
> > > > > Yes, I think so.
> > > > >
> > > > >>
> > > > >> So I wonder why we never required AS_WRITEBACK_INDETERMINATE for nfs?
> > > > >
> > > > > I feel like INDETERMINATE in the name is the main cause of confusion.
> > > >
> > > > We are adding logic that says "unconditionally, never wait on writeback
> > > > for these folios, not even any sync migration". That's the main problem
> > > > I have.
> > > >
> > > > Your explanation below is helpful. Because ...
> > > >
> > > > > So, let me explain why it is required (but later I will tell you how it
> > > > > can be avoided). The FUSE thread which is actively handling writeback of
> > > > > a given folio can cause memory allocation either through syscall or page
> > > > > fault. That memory allocation can trigger global reclaim synchronously
> > > > > and in cgroup-v1, that FUSE thread can wait on the writeback on the same
> > > > > folio whose writeback it is supposed to end and cauing a deadlock. So,
> > > > > AS_WRITEBACK_INDETERMINATE is used to just avoid this deadlock.
> > > >  > > The in-kernel fs avoid this situation through the use of GFP_NOFS
> > > > > allocations. The userspace fs can also use a similar approach which is
> > > > > prctl(PR_SET_IO_FLUSHER, 1) to avoid this situation. However I have been
> > > > > told that it is hard to use as it is per-thread flag and has to be set
> > > > > for all the threads handling writeback which can be error prone if the
> > > > > threadpool is dynamic. Second it is very coarse such that all the
> > > > > allocations from those threads (e.g. page faults) become NOFS which
> > > > > makes userspace very unreliable on highly utilized machine as NOFS can
> > > > > not reclaim potentially a lot of memory and can not trigger oom-kill.
> > > > >
> > > >
> > > > ... now I understand that we want to prevent a deadlock in one specific
> > > > scenario only?
> > > >
> > > > What sounds plausible for me is:
> > > >
> > > > a) Make this only affect the actual deadlock path: sync migration
> > > >     during compaction. Communicate it either using some "context"
> > > >     information or with a new MIGRATE_SYNC_COMPACTION.
> > > > b) Call it sth. like AS_WRITEBACK_MIGHT_DEADLOCK_ON_RECLAIM to express
> > > >      that very deadlock problem.
> > > > c) Leave all others sync migration users alone for now
> > >
> > > The deadlock path is separate from sync migration. The deadlock arises
> > > from a corner case where cgroupv1 reclaim waits on a folio under
> > > writeback where that writeback itself is blocked on reclaim.
> > >
> >
> > Joanne, let's drop the patch to migrate.c completely and let's rename
> > the flag to something like what David is suggesting and only handle in
> > the reclaim path.
> >
> > > >
> > > > Would that prevent the deadlock? Even *better* would be to to be able to
> > > > ask the fs if starting writeback on a specific folio could deadlock.
> > > > Because in most cases, as I understand, we'll  not actually run into the
> > > > deadlock and would just want to wait for writeback to just complete
> > > > (esp. compaction).
> > > >
> > > > (I still think having folios under writeback for a long time might be a
> > > > problem, but that's indeed something to sort out separately in the
> > > > future, because I suspect NFS has similar issues. We'd want to "wait
> > > > with timeout" and e.g., cancel writeback during memory
> > > > offlining/alloc_cma ...)
> >
> > Thanks David and yes let's handle the folios under writeback issue
> > separately.
> >
> > >
> > > I'm looking back at some of the discussions in v2 [1] and I'm still
> > > not clear on how memory fragmentation for non-movable pages differs
> > > from memory fragmentation from movable pages and whether one is worse
> > > than the other.
> >
> > I think the fragmentation due to movable pages becoming unmovable is
> > worse as that situation is unexpected and the kernel can waste a lot of
> > CPU to defrag the block containing those folios. For non-movable blocks,
> > the kernel will not even try to defrag. Now we can have a situation
> > where almost all memory is backed by non-movable blocks and higher order
> > allocations start failing even when there is enough free memory. For
> > such situations either system needs to be restarted (or workloads
> > restarted if they are cause of high non-movable memory) or the admin
> > needs to setup ZONE_MOVABLE where non-movable allocations don't go.
>
> Thanks for the explanations.
>
> The reason I ask is because I'm trying to figure out if having a time
> interval wait or retry mechanism instead of skipping migration would
> be a viable solution. Where when attempting the migration for folios
> with the as_writeback_indeterminate flag that are under writeback,
> it'll wait on folio writeback for a certain amount of time and then
> skip the migration if no progress has been made and the folio is still
> under writeback.
>
> there are two cases for fuse folios under writeback (for folios not
> under writeback, migration will work as is):
> a) normal case: server is not malicious or buggy, writeback is
> completed in a timely manner.
> For this case, migration would be successful and there'd be no
> difference for this between having no temp pages vs temp pages
>
>
> b) server is malicious or buggy:
> eg the server never completes writeback
>
> With no temp pages:
> The folio under writeback prevents a memory block (not sure how big
> this usually is?) from being compacted, leading to memory
> fragmentation

It is called pageblock. Its size is usually the same as a PMD THP
(e.g., 2MB on x86_64).

With no temp pages, folios can spread across multiple pageblocks,
fragmenting all of them.

>
> With temp pages:
> fuse allocates a non-movable page for every page it needs to write
> back, which worsens memory usage, these pages will never get freed
> since the server never finishes writeback on them. The non-movable
> pages could also fragment memory blocks like in the scenario with no
> temp pages.

Since the temp pages are all coming from MIGRATE_UNMOVABLE pageblocks,
which are much fewer, the fragmentation is much limited.

>
>
> Is the b) case with no temp pages worse for memory health than the
> scenario with temp pages? For the cpu usage issue (eg kernel keeps
> trying to defrag blocks containing these problematic folios), it seems
> like this could be potentially mitigated by marking these blocks as
> uncompactable?

With no temp pages, folios under writeback can potentially fragment more,
if not all, pageblocks, compared to with temp pages, because
MIGRATE_UNMOVABLE pageblocks are used for unmovable page allocations,
like kernel data allocations, and are supposed to be much fewer than
MIGRATE_MOVABLE pageblocks in the system.

>
>
> Thanks,
> Joanne
>
> >
> > > Currently fuse uses movable temp pages (allocated with
> > > gfp flags GFP_NOFS | __GFP_HIGHMEM), and these can run into the same
> > > issue where a buggy/malicious server may never complete writeback.
> >
> > So, these temp pages are not an issue for fragmenting the movable blocks
> > but if there is no limit on temp pages, the whole system can become
> > non-movable (there is a case where movable blocks on non-ZONE_MOVABLE
> > can be converted into non-movable blocks under low memory). ZONE_MOVABLE
> > will avoid such scenario but tuning the right size of ZONE_MOVABLE is
> > not easy.
> >
> > > This has the same effect of fragmenting memory and has a worse memory
> > > cost to the system in terms of memory used. With not having temp pages
> > > though, now in this scenario, pages allocated in a movable page block
> > > can't be compacted and that memory is fragmented. My (basic and maybe
> > > incorrect) understanding is that memory gets allocated through a buddy
> > > allocator and moveable vs nonmovable pages get allocated to
> > > corresponding blocks that match their type, but there's no other
> > > difference otherwise. Is this understanding correct? Or is there some
> > > substantial difference between fragmentation for movable vs nonmovable
> > > blocks?
> >
> > The main difference is the fallback of high order allocation which can
> > trigger compaction or background compaction through kcompactd. The
> > kernel will only try to defrag the movable blocks.
> >




-- 
Best Regards,
Yan, Zi



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2025-01-02 18:54                                                         ` Joanne Koong
@ 2025-01-03 20:31                                                           ` David Hildenbrand
  2025-01-06 10:19                                                             ` Miklos Szeredi
  0 siblings, 1 reply; 124+ messages in thread
From: David Hildenbrand @ 2025-01-03 20:31 UTC (permalink / raw)
  To: Joanne Koong, Shakeel Butt
  Cc: Bernd Schubert, Zi Yan, miklos, linux-fsdevel, jefflexu, josef,
	linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador,
	Michal Hocko

On 02.01.25 19:54, Joanne Koong wrote:
> On Mon, Dec 30, 2024 at 12:11 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>>
>> On Mon, Dec 30, 2024 at 08:52:04PM +0100, David Hildenbrand wrote:
>>>
>> [...]
>>>> I'm looking back at some of the discussions in v2 [1] and I'm still
>>>> not clear on how memory fragmentation for non-movable pages differs
>>>> from memory fragmentation from movable pages and whether one is worse
>>>> than the other. Currently fuse uses movable temp pages (allocated with
>>>> gfp flags GFP_NOFS | __GFP_HIGHMEM), and these can run into the same
>>>
>>> Why are they movable? Do you also specify __GFP_MOVABLE?
>>>
>>> If not, they are unmovable and are never allocated from
>>> ZONE_MOVABLE/MIGRATE_CMA -- and usually only from MIGRATE_UNMOVBALE, to
>>> group these unmovable pages.
>>>
>>
>> Yes, these temp pages are non-movable. (Must be a typo in Joanne's
>> email).
> 
> Sorry for the confusion, that should have been "non-movable temp pages".
> 
>>
>> [...]
>>>
>>> I assume not regarding fragmentation.
>>>
>>>
>>> In general, I see two main issues:
>>>
>>> A) We are no longer waiting on writeback, even though we expect in sane
>>> environments that writeback will happen and we it might be worthwhile to
>>> just wait for writeback so we can migrate these folios.
>>>
>>> B) We allow turning movable pages to be unmovable, possibly forever/long
>>> time, and there is no way to make them movable again (e.g., cancel
>>> writeback).
>>>
>>>
>>> I'm wondering if A) is actually a new issue introduced by this change. Can
>>> folios with busy temp pages (writeback cleared on folio, but temp pages are
>>> still around) be migrated? I will look into some details once I'm back from
>>> vacation.
>>>
> 
> Folios with busy temp pages can be migrated since fuse will clear
> writeback on the folio immediately once it's copied to the temp page.

I was rather wondering if there is something else that prevents 
migrating these folios: for example, if there is a raised refcount on 
the folio while the temp pages exist. If that is not the case, then it 
should indeed just work.

> 
> To me, these two issues seem like one and the same. No longer waiting
> on writeback renders it unmovable, which prevents
> compaction/migration.
> 
>>
>> My suggestion is to just drop the patch related to A as it is not
>> required for deadlock avoidance. For B, I think we need a long term
>> solution which is usable by other filesystems as well.
> 
> Sounds good. With that, we need to take this patchset out of
> mm-unstable or this could lead to migration infinitely waiting on
> folio writeback without the migrate patch there.

I want to try triggering it with NFS next week when I am back from PTO, 
to see if it is indeed a problem there as well on connection loss.

In any case, having movable pages be turned unmovable due to persistent 
writaback is something that must be fixed, not worked around. Likely a 
good topic for LSF/MM.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2025-01-03 20:31                                                           ` David Hildenbrand
@ 2025-01-06 10:19                                                             ` Miklos Szeredi
  2025-01-06 18:17                                                               ` Shakeel Butt
  0 siblings, 1 reply; 124+ messages in thread
From: Miklos Szeredi @ 2025-01-06 10:19 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Joanne Koong, Shakeel Butt, Bernd Schubert, Zi Yan, linux-fsdevel,
	jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox,
	Oscar Salvador, Michal Hocko

On Fri, 3 Jan 2025 at 21:31, David Hildenbrand <david@redhat.com> wrote:
> In any case, having movable pages be turned unmovable due to persistent
> writaback is something that must be fixed, not worked around. Likely a
> good topic for LSF/MM.

Yes, this seems a good cross fs-mm topic.

So the issue discussed here is that movable pages used for fuse
page-cache cause a problems when memory needs to be compacted. The
problem is either that

 - the page is skipped, leaving the physical memory block unmovable

 - the compaction is blocked for an unbounded time

While the new AS_WRITEBACK_INDETERMINATE could potentially make things
worse, the same thing happens on readahead, since the new page can be
locked for an indeterminate amount of time, which can also block
compaction, right?

What about explicitly opting fuse cache pages out of compaction by
allocating them form ZONE_UNMOVABLE?

Thanks,
Miklos


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2025-01-06 10:19                                                             ` Miklos Szeredi
@ 2025-01-06 18:17                                                               ` Shakeel Butt
  2025-01-07  8:34                                                                 ` David Hildenbrand
  2025-01-07 16:15                                                                 ` Miklos Szeredi
  0 siblings, 2 replies; 124+ messages in thread
From: Shakeel Butt @ 2025-01-06 18:17 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: David Hildenbrand, Joanne Koong, Bernd Schubert, Zi Yan,
	linux-fsdevel, jefflexu, josef, linux-mm, kernel-team,
	Matthew Wilcox, Oscar Salvador, Michal Hocko

On Mon, Jan 06, 2025 at 11:19:42AM +0100, Miklos Szeredi wrote:
> On Fri, 3 Jan 2025 at 21:31, David Hildenbrand <david@redhat.com> wrote:
> > In any case, having movable pages be turned unmovable due to persistent
> > writaback is something that must be fixed, not worked around. Likely a
> > good topic for LSF/MM.
> 
> Yes, this seems a good cross fs-mm topic.
> 
> So the issue discussed here is that movable pages used for fuse
> page-cache cause a problems when memory needs to be compacted. The
> problem is either that
> 
>  - the page is skipped, leaving the physical memory block unmovable
> 
>  - the compaction is blocked for an unbounded time
> 
> While the new AS_WRITEBACK_INDETERMINATE could potentially make things
> worse, the same thing happens on readahead, since the new page can be
> locked for an indeterminate amount of time, which can also block
> compaction, right?

Yes locked pages are unmovable. How much of these locked pages/folios
can be caused by untrusted fuse server?

> 
> What about explicitly opting fuse cache pages out of compaction by
> allocating them form ZONE_UNMOVABLE?

This can be done but it will change the memory condition of the
users/workloads/systems where page cache is the majority of the memory
(i.e. majority of memory will be unmovable) and when such systems are
overcommitted, weird corner cases will arise (failing high order
allocations, long term fragmentation etc). In addition the memory
behind CXL will become unusable for fuse folios.

IMHO the transient unmovable state of fuse folios due to writeback is
not an issue if we can show that untrusted fuse server can not cause
unlimited folios under writeback for arbitrary long time.


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2025-01-06 18:17                                                               ` Shakeel Butt
@ 2025-01-07  8:34                                                                 ` David Hildenbrand
  2025-01-07 18:07                                                                   ` Shakeel Butt
  2025-01-10 20:16                                                                   ` Jeff Layton
  2025-01-07 16:15                                                                 ` Miklos Szeredi
  1 sibling, 2 replies; 124+ messages in thread
From: David Hildenbrand @ 2025-01-07  8:34 UTC (permalink / raw)
  To: Shakeel Butt, Miklos Szeredi
  Cc: Joanne Koong, Bernd Schubert, Zi Yan, linux-fsdevel, jefflexu,
	josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador,
	Michal Hocko

On 06.01.25 19:17, Shakeel Butt wrote:
> On Mon, Jan 06, 2025 at 11:19:42AM +0100, Miklos Szeredi wrote:
>> On Fri, 3 Jan 2025 at 21:31, David Hildenbrand <david@redhat.com> wrote:
>>> In any case, having movable pages be turned unmovable due to persistent
>>> writaback is something that must be fixed, not worked around. Likely a
>>> good topic for LSF/MM.
>>
>> Yes, this seems a good cross fs-mm topic.
>>
>> So the issue discussed here is that movable pages used for fuse
>> page-cache cause a problems when memory needs to be compacted. The
>> problem is either that
>>
>>   - the page is skipped, leaving the physical memory block unmovable
>>
>>   - the compaction is blocked for an unbounded time
>>
>> While the new AS_WRITEBACK_INDETERMINATE could potentially make things
>> worse, the same thing happens on readahead, since the new page can be
>> locked for an indeterminate amount of time, which can also block
>> compaction, right?

Yes, as memory hotplug + virtio-mem maintainer my bigger concern is 
these pages residing in ZONE_MOVABLE / MIGRATE_CMA areas where there 
*must not be unmovable pages ever*. Not triggered by an untrusted 
source, not triggered by an trusted source.

It's a violation of core-mm principles.

Even if we have a timeout of 60s, making things like alloc_contig_page() 
wait for that long on writeback is broken and needs to be fixed.

And the fix is not to skip these pages, that's a workaround.

I'm hoping I can find an easy way to trigger this also with NFS.

> 
> Yes locked pages are unmovable. How much of these locked pages/folios
> can be caused by untrusted fuse server?
 > >>
>> What about explicitly opting fuse cache pages out of compaction by
>> allocating them form ZONE_UNMOVABLE?
> 
> This can be done but it will change the memory condition of the
> users/workloads/systems where page cache is the majority of the memory
> (i.e. majority of memory will be unmovable) and when such systems are
> overcommitted, weird corner cases will arise (failing high order
> allocations, long term fragmentation etc). In addition the memory
> behind CXL will become unusable for fuse folios.

Yes.

> 
> IMHO the transient unmovable state of fuse folios due to writeback is
> not an issue if we can show that untrusted fuse server can not cause
> unlimited folios under writeback for arbitrary long time.

See above, I disagree.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2025-01-06 18:17                                                               ` Shakeel Butt
  2025-01-07  8:34                                                                 ` David Hildenbrand
@ 2025-01-07 16:15                                                                 ` Miklos Szeredi
  2025-01-08  1:40                                                                   ` Jingbo Xu
  1 sibling, 1 reply; 124+ messages in thread
From: Miklos Szeredi @ 2025-01-07 16:15 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: David Hildenbrand, Joanne Koong, Bernd Schubert, Zi Yan,
	linux-fsdevel, jefflexu, josef, linux-mm, kernel-team,
	Matthew Wilcox, Oscar Salvador, Michal Hocko

On Mon, 6 Jan 2025 at 19:17, Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Mon, Jan 06, 2025 at 11:19:42AM +0100, Miklos Szeredi wrote:
> > On Fri, 3 Jan 2025 at 21:31, David Hildenbrand <david@redhat.com> wrote:
> > > In any case, having movable pages be turned unmovable due to persistent
> > > writaback is something that must be fixed, not worked around. Likely a
> > > good topic for LSF/MM.
> >
> > Yes, this seems a good cross fs-mm topic.
> >
> > So the issue discussed here is that movable pages used for fuse
> > page-cache cause a problems when memory needs to be compacted. The
> > problem is either that
> >
> >  - the page is skipped, leaving the physical memory block unmovable
> >
> >  - the compaction is blocked for an unbounded time
> >
> > While the new AS_WRITEBACK_INDETERMINATE could potentially make things
> > worse, the same thing happens on readahead, since the new page can be
> > locked for an indeterminate amount of time, which can also block
> > compaction, right?
>
> Yes locked pages are unmovable. How much of these locked pages/folios
> can be caused by untrusted fuse server?

A stuck server would quickly reach the background threshold at which
point everything stops.   So my guess is that accidentally this won't
do much harm.

Doing it deliberately (tuning max_background, starting multiple
servers) the number of pages that are permanently locked could be
basically unlimited.

Thanks,
Miklos


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2025-01-07  8:34                                                                 ` David Hildenbrand
@ 2025-01-07 18:07                                                                   ` Shakeel Butt
  2025-01-09 11:22                                                                     ` David Hildenbrand
  2025-01-10 20:16                                                                   ` Jeff Layton
  1 sibling, 1 reply; 124+ messages in thread
From: Shakeel Butt @ 2025-01-07 18:07 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Miklos Szeredi, Joanne Koong, Bernd Schubert, Zi Yan,
	linux-fsdevel, jefflexu, josef, linux-mm, kernel-team,
	Matthew Wilcox, Oscar Salvador, Michal Hocko

On Tue, Jan 07, 2025 at 09:34:49AM +0100, David Hildenbrand wrote:
> On 06.01.25 19:17, Shakeel Butt wrote:
> > On Mon, Jan 06, 2025 at 11:19:42AM +0100, Miklos Szeredi wrote:
> > > On Fri, 3 Jan 2025 at 21:31, David Hildenbrand <david@redhat.com> wrote:
> > > > In any case, having movable pages be turned unmovable due to persistent
> > > > writaback is something that must be fixed, not worked around. Likely a
> > > > good topic for LSF/MM.
> > > 
> > > Yes, this seems a good cross fs-mm topic.
> > > 
> > > So the issue discussed here is that movable pages used for fuse
> > > page-cache cause a problems when memory needs to be compacted. The
> > > problem is either that
> > > 
> > >   - the page is skipped, leaving the physical memory block unmovable
> > > 
> > >   - the compaction is blocked for an unbounded time
> > > 
> > > While the new AS_WRITEBACK_INDETERMINATE could potentially make things
> > > worse, the same thing happens on readahead, since the new page can be
> > > locked for an indeterminate amount of time, which can also block
> > > compaction, right?
> 
> Yes, as memory hotplug + virtio-mem maintainer my bigger concern is these
> pages residing in ZONE_MOVABLE / MIGRATE_CMA areas where there *must not be
> unmovable pages ever*. Not triggered by an untrusted source, not triggered
> by an trusted source.
> 
> It's a violation of core-mm principles.

The "must not be unmovable pages ever" is a very strong statement and we
are violating it today and will keep violating it in future. Any
page/folio under lock or writeback or have reference taken or have been
isolated from their LRU is unmovable (most of the time for small period
of time). These operations are being done all over the place in kernel.
Miklos gave an example of readahead. The per-CPU LRU caches are another
case where folios can get stuck for long period of time. Reclaim and
compaction can isolate a lot of folios that they need to have
too_many_isolated() checks. So, "must not be unmovable pages ever" is
impractical.

The point is that, yes we should aim to improve things but in iterations
and "must not be unmovable pages ever" is not something we can achieve
in one step. Though I doubt that state is practically achievable and to
me something like a bound (time or amount) on the transient unmovable
folios is more practical.


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2025-01-07 16:15                                                                 ` Miklos Szeredi
@ 2025-01-08  1:40                                                                   ` Jingbo Xu
  0 siblings, 0 replies; 124+ messages in thread
From: Jingbo Xu @ 2025-01-08  1:40 UTC (permalink / raw)
  To: Miklos Szeredi, Shakeel Butt
  Cc: David Hildenbrand, Joanne Koong, Bernd Schubert, Zi Yan,
	linux-fsdevel, josef, linux-mm, kernel-team, Matthew Wilcox,
	Oscar Salvador, Michal Hocko



On 1/8/25 12:15 AM, Miklos Szeredi wrote:
> On Mon, 6 Jan 2025 at 19:17, Shakeel Butt <shakeel.butt@linux.dev> wrote:
>>
>> On Mon, Jan 06, 2025 at 11:19:42AM +0100, Miklos Szeredi wrote:
>>> On Fri, 3 Jan 2025 at 21:31, David Hildenbrand <david@redhat.com> wrote:
>>>> In any case, having movable pages be turned unmovable due to persistent
>>>> writaback is something that must be fixed, not worked around. Likely a
>>>> good topic for LSF/MM.
>>>
>>> Yes, this seems a good cross fs-mm topic.
>>>
>>> So the issue discussed here is that movable pages used for fuse
>>> page-cache cause a problems when memory needs to be compacted. The
>>> problem is either that
>>>
>>>  - the page is skipped, leaving the physical memory block unmovable
>>>
>>>  - the compaction is blocked for an unbounded time
>>>
>>> While the new AS_WRITEBACK_INDETERMINATE could potentially make things
>>> worse, the same thing happens on readahead, since the new page can be
>>> locked for an indeterminate amount of time, which can also block
>>> compaction, right?
>>
>> Yes locked pages are unmovable. How much of these locked pages/folios
>> can be caused by untrusted fuse server?
> 
> A stuck server would quickly reach the background threshold at which
> point everything stops.   So my guess is that accidentally this won't
> do much harm.
> 
> Doing it deliberately (tuning max_background, starting multiple
> servers) the number of pages that are permanently locked could be
> basically unlimited.

If "limiting the number of actually unmovable pages in a reasonable
bound" is acceptable, maybe we could limit the maximum number of
background requests that the whole unprivileged FUSE servers could achieve.

BTW currently the writeback requests are not limited by max_background
as the writeback routine allocates requests with "force == true".  We
had ever noticed that heavy writeback workload could starve other
background requests (e.g. readahead), in which the readahead routine
were waiting in fuse_get_req() forever until the writeback workload
finished.

-- 
Thanks,
Jingbo


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2025-01-07 18:07                                                                   ` Shakeel Butt
@ 2025-01-09 11:22                                                                     ` David Hildenbrand
  2025-01-10 20:28                                                                       ` Jeff Layton
  0 siblings, 1 reply; 124+ messages in thread
From: David Hildenbrand @ 2025-01-09 11:22 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Miklos Szeredi, Joanne Koong, Bernd Schubert, Zi Yan,
	linux-fsdevel, jefflexu, josef, linux-mm, kernel-team,
	Matthew Wilcox, Oscar Salvador, Michal Hocko

On 07.01.25 19:07, Shakeel Butt wrote:
> On Tue, Jan 07, 2025 at 09:34:49AM +0100, David Hildenbrand wrote:
>> On 06.01.25 19:17, Shakeel Butt wrote:
>>> On Mon, Jan 06, 2025 at 11:19:42AM +0100, Miklos Szeredi wrote:
>>>> On Fri, 3 Jan 2025 at 21:31, David Hildenbrand <david@redhat.com> wrote:
>>>>> In any case, having movable pages be turned unmovable due to persistent
>>>>> writaback is something that must be fixed, not worked around. Likely a
>>>>> good topic for LSF/MM.
>>>>
>>>> Yes, this seems a good cross fs-mm topic.
>>>>
>>>> So the issue discussed here is that movable pages used for fuse
>>>> page-cache cause a problems when memory needs to be compacted. The
>>>> problem is either that
>>>>
>>>>    - the page is skipped, leaving the physical memory block unmovable
>>>>
>>>>    - the compaction is blocked for an unbounded time
>>>>
>>>> While the new AS_WRITEBACK_INDETERMINATE could potentially make things
>>>> worse, the same thing happens on readahead, since the new page can be
>>>> locked for an indeterminate amount of time, which can also block
>>>> compaction, right?
>>
>> Yes, as memory hotplug + virtio-mem maintainer my bigger concern is these
>> pages residing in ZONE_MOVABLE / MIGRATE_CMA areas where there *must not be
>> unmovable pages ever*. Not triggered by an untrusted source, not triggered
>> by an trusted source.
>>
>> It's a violation of core-mm principles.
> 
> The "must not be unmovable pages ever" is a very strong statement and we
> are violating it today and will keep violating it in future. Any
> page/folio under lock or writeback or have reference taken or have been
> isolated from their LRU is unmovable (most of the time for small period
> of time).

^ this: "small period of time" is what I meant.

Most of these things are known to not be problematic: retrying a couple 
of times makes it work, that's why migration keeps retrying.

Again, as an example, we allow short-term O_DIRECT but disallow 
long-term page pinning. I think there were concerns at some point if 
O_DIRECT might also be problematic (I/O might take a while), but so far 
it was not a problem in practice that would make CMA allocations easily 
fail.

vmsplice() is a known problem, because it behaves like O_DIRECT but 
actually triggers long-term pinning; IIRC David Howells has this on his 
todo list to fix. [I recall that seccomp disallows vmsplice by default 
right now]

These operations are being done all over the place in kernel.
> Miklos gave an example of readahead. 

I assume you mean "unmovable for a short time", correct, or can you 
point me at that specific example; I think I missed that.

> The per-CPU LRU caches are another
> case where folios can get stuck for long period of time.

Which is why memory offlining disables the lru cache. See 
lru_cache_disable(). Other users that care about that drain the LRU on 
all cpus.

> Reclaim and
> compaction can isolate a lot of folios that they need to have
> too_many_isolated() checks. So, "must not be unmovable pages ever" is
> impractical.

"must only be short-term unmovable", better?

> 
> The point is that, yes we should aim to improve things but in iterations
> and "must not be unmovable pages ever" is not something we can achieve
> in one step.

I agree with the "improve things in iterations", but as
AS_WRITEBACK_INDETERMINATE has the FOLL_LONGTERM smell to it, I think we 
are making things worse.

And as this discussion has been going on for too long, to summarize my 
point: there exist conditions where pages are short-term unmovable, and 
possibly some to be fixed that turn pages long-term unmovable (e.g., 
vmsplice); that does not mean that we can freely add new conditions that 
turn movable pages unmovable long-term or even forever.

Again, this might be a good LSF/MM topic. If I would have the capacity I 
would suggest a topic around which things are know to cause pages to be 
short-term or long-term unmovable/unsplittable, and which can be 
handled, which not. Maybe I'll find the time to propose that as a topic.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2025-01-07  8:34                                                                 ` David Hildenbrand
  2025-01-07 18:07                                                                   ` Shakeel Butt
@ 2025-01-10 20:16                                                                   ` Jeff Layton
  2025-01-10 20:20                                                                     ` David Hildenbrand
  1 sibling, 1 reply; 124+ messages in thread
From: Jeff Layton @ 2025-01-10 20:16 UTC (permalink / raw)
  To: David Hildenbrand, Shakeel Butt, Miklos Szeredi
  Cc: Joanne Koong, Bernd Schubert, Zi Yan, linux-fsdevel, jefflexu,
	josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador,
	Michal Hocko

On Tue, 2025-01-07 at 09:34 +0100, David Hildenbrand wrote:
> On 06.01.25 19:17, Shakeel Butt wrote:
> > On Mon, Jan 06, 2025 at 11:19:42AM +0100, Miklos Szeredi wrote:
> > > On Fri, 3 Jan 2025 at 21:31, David Hildenbrand <david@redhat.com> wrote:
> > > > In any case, having movable pages be turned unmovable due to persistent
> > > > writaback is something that must be fixed, not worked around. Likely a
> > > > good topic for LSF/MM.
> > > 
> > > Yes, this seems a good cross fs-mm topic.
> > > 
> > > So the issue discussed here is that movable pages used for fuse
> > > page-cache cause a problems when memory needs to be compacted. The
> > > problem is either that
> > > 
> > >   - the page is skipped, leaving the physical memory block unmovable
> > > 
> > >   - the compaction is blocked for an unbounded time
> > > 
> > > While the new AS_WRITEBACK_INDETERMINATE could potentially make things
> > > worse, the same thing happens on readahead, since the new page can be
> > > locked for an indeterminate amount of time, which can also block
> > > compaction, right?
> 
> Yes, as memory hotplug + virtio-mem maintainer my bigger concern is 
> these pages residing in ZONE_MOVABLE / MIGRATE_CMA areas where there 
> *must not be unmovable pages ever*. Not triggered by an untrusted 
> source, not triggered by an trusted source.
> 
> It's a violation of core-mm principles.
> 
> Even if we have a timeout of 60s, making things like alloc_contig_page() 
> wait for that long on writeback is broken and needs to be fixed.
> 
> And the fix is not to skip these pages, that's a workaround.
> 
> I'm hoping I can find an easy way to trigger this also with NFS.
> 

I imagine that you can just open a file and start writing to it, pull
the plug on the NFS server, and then issue a fsync or something to
ensure some writeback occurs.

Any dirty pagecache folios should be stuck in writeback at that point.
The NFS client is also very patient about waiting for the server to
come back, so it should stay that way indefinitely.

> > 
> > Yes locked pages are unmovable. How much of these locked pages/folios
> > can be caused by untrusted fuse server?
>  > >>
> > > What about explicitly opting fuse cache pages out of compaction by
> > > allocating them form ZONE_UNMOVABLE?
> > 
> > This can be done but it will change the memory condition of the
> > users/workloads/systems where page cache is the majority of the memory
> > (i.e. majority of memory will be unmovable) and when such systems are
> > overcommitted, weird corner cases will arise (failing high order
> > allocations, long term fragmentation etc). In addition the memory
> > behind CXL will become unusable for fuse folios.
> 
> Yes.
> 
> > 
> > IMHO the transient unmovable state of fuse folios due to writeback is
> > not an issue if we can show that untrusted fuse server can not cause
> > unlimited folios under writeback for arbitrary long time.
> 
> See above, I disagree.
> 

-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2025-01-10 20:16                                                                   ` Jeff Layton
@ 2025-01-10 20:20                                                                     ` David Hildenbrand
  2025-01-10 20:43                                                                       ` Jeff Layton
  0 siblings, 1 reply; 124+ messages in thread
From: David Hildenbrand @ 2025-01-10 20:20 UTC (permalink / raw)
  To: Jeff Layton, Shakeel Butt, Miklos Szeredi
  Cc: Joanne Koong, Bernd Schubert, Zi Yan, linux-fsdevel, jefflexu,
	josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador,
	Michal Hocko

On 10.01.25 21:16, Jeff Layton wrote:
> On Tue, 2025-01-07 at 09:34 +0100, David Hildenbrand wrote:
>> On 06.01.25 19:17, Shakeel Butt wrote:
>>> On Mon, Jan 06, 2025 at 11:19:42AM +0100, Miklos Szeredi wrote:
>>>> On Fri, 3 Jan 2025 at 21:31, David Hildenbrand <david@redhat.com> wrote:
>>>>> In any case, having movable pages be turned unmovable due to persistent
>>>>> writaback is something that must be fixed, not worked around. Likely a
>>>>> good topic for LSF/MM.
>>>>
>>>> Yes, this seems a good cross fs-mm topic.
>>>>
>>>> So the issue discussed here is that movable pages used for fuse
>>>> page-cache cause a problems when memory needs to be compacted. The
>>>> problem is either that
>>>>
>>>>    - the page is skipped, leaving the physical memory block unmovable
>>>>
>>>>    - the compaction is blocked for an unbounded time
>>>>
>>>> While the new AS_WRITEBACK_INDETERMINATE could potentially make things
>>>> worse, the same thing happens on readahead, since the new page can be
>>>> locked for an indeterminate amount of time, which can also block
>>>> compaction, right?
>>
>> Yes, as memory hotplug + virtio-mem maintainer my bigger concern is
>> these pages residing in ZONE_MOVABLE / MIGRATE_CMA areas where there
>> *must not be unmovable pages ever*. Not triggered by an untrusted
>> source, not triggered by an trusted source.
>>
>> It's a violation of core-mm principles.
>>
>> Even if we have a timeout of 60s, making things like alloc_contig_page()
>> wait for that long on writeback is broken and needs to be fixed.
>>
>> And the fix is not to skip these pages, that's a workaround.
>>
>> I'm hoping I can find an easy way to trigger this also with NFS.
>>
> 
> I imagine that you can just open a file and start writing to it, pull
> the plug on the NFS server, and then issue a fsync or something to
> ensure some writeback occurs.

Yes, that's the plan, thanks!

> 
> Any dirty pagecache folios should be stuck in writeback at that point.
> The NFS client is also very patient about waiting for the server to
> come back, so it should stay that way indefinitely.

Yes, however the default timeout for UDP is fairly small (for TCP 
certainly much longer). So one thing I'd like to understand what that 
"cancel writeback -> redirty folio" on timeout does, and when it 
actually triggers with TCP vs UDP timeouts.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2025-01-09 11:22                                                                     ` David Hildenbrand
@ 2025-01-10 20:28                                                                       ` Jeff Layton
  2025-01-10 21:13                                                                         ` David Hildenbrand
  0 siblings, 1 reply; 124+ messages in thread
From: Jeff Layton @ 2025-01-10 20:28 UTC (permalink / raw)
  To: David Hildenbrand, Shakeel Butt
  Cc: Miklos Szeredi, Joanne Koong, Bernd Schubert, Zi Yan,
	linux-fsdevel, jefflexu, josef, linux-mm, kernel-team,
	Matthew Wilcox, Oscar Salvador, Michal Hocko

On Thu, 2025-01-09 at 12:22 +0100, David Hildenbrand wrote:
> On 07.01.25 19:07, Shakeel Butt wrote:
> > On Tue, Jan 07, 2025 at 09:34:49AM +0100, David Hildenbrand wrote:
> > > On 06.01.25 19:17, Shakeel Butt wrote:
> > > > On Mon, Jan 06, 2025 at 11:19:42AM +0100, Miklos Szeredi wrote:
> > > > > On Fri, 3 Jan 2025 at 21:31, David Hildenbrand <david@redhat.com> wrote:
> > > > > > In any case, having movable pages be turned unmovable due to persistent
> > > > > > writaback is something that must be fixed, not worked around. Likely a
> > > > > > good topic for LSF/MM.
> > > > > 
> > > > > Yes, this seems a good cross fs-mm topic.
> > > > > 
> > > > > So the issue discussed here is that movable pages used for fuse
> > > > > page-cache cause a problems when memory needs to be compacted. The
> > > > > problem is either that
> > > > > 
> > > > >    - the page is skipped, leaving the physical memory block unmovable
> > > > > 
> > > > >    - the compaction is blocked for an unbounded time
> > > > > 
> > > > > While the new AS_WRITEBACK_INDETERMINATE could potentially make things
> > > > > worse, the same thing happens on readahead, since the new page can be
> > > > > locked for an indeterminate amount of time, which can also block
> > > > > compaction, right?
> > > 
> > > Yes, as memory hotplug + virtio-mem maintainer my bigger concern is these
> > > pages residing in ZONE_MOVABLE / MIGRATE_CMA areas where there *must not be
> > > unmovable pages ever*. Not triggered by an untrusted source, not triggered
> > > by an trusted source.
> > > 
> > > It's a violation of core-mm principles.
> > 
> > The "must not be unmovable pages ever" is a very strong statement and we
> > are violating it today and will keep violating it in future. Any
> > page/folio under lock or writeback or have reference taken or have been
> > isolated from their LRU is unmovable (most of the time for small period
> > of time).
> 
> ^ this: "small period of time" is what I meant.
> 
> Most of these things are known to not be problematic: retrying a couple 
> of times makes it work, that's why migration keeps retrying.
> 
> Again, as an example, we allow short-term O_DIRECT but disallow 
> long-term page pinning. I think there were concerns at some point if 
> O_DIRECT might also be problematic (I/O might take a while), but so far 
> it was not a problem in practice that would make CMA allocations easily 
> fail.
> 
> vmsplice() is a known problem, because it behaves like O_DIRECT but 
> actually triggers long-term pinning; IIRC David Howells has this on his 
> todo list to fix. [I recall that seccomp disallows vmsplice by default 
> right now]
> 
> These operations are being done all over the place in kernel.
> > Miklos gave an example of readahead. 
> 
> I assume you mean "unmovable for a short time", correct, or can you 
> point me at that specific example; I think I missed that.
> 
> > The per-CPU LRU caches are another
> > case where folios can get stuck for long period of time.
> 
> Which is why memory offlining disables the lru cache. See 
> lru_cache_disable(). Other users that care about that drain the LRU on 
> all cpus.
> 
> > Reclaim and
> > compaction can isolate a lot of folios that they need to have
> > too_many_isolated() checks. So, "must not be unmovable pages ever" is
> > impractical.
> 
> "must only be short-term unmovable", better?
> 

Still a little ambiguous.

How short is "short-term"? Are we talking milliseconds or minutes?

Imposing a hard timeout on writeback requests to unprivileged FUSE
servers might give us a better guarantee of forward-progress, but it
would probably have to be on the order of at least a minute or so to be
workable.

> > 
> > The point is that, yes we should aim to improve things but in iterations
> > and "must not be unmovable pages ever" is not something we can achieve
> > in one step.
> 
> I agree with the "improve things in iterations", but as
> AS_WRITEBACK_INDETERMINATE has the FOLL_LONGTERM smell to it, I think we 
> are making things worse.
> 
> And as this discussion has been going on for too long, to summarize my 
> point: there exist conditions where pages are short-term unmovable, and 
> possibly some to be fixed that turn pages long-term unmovable (e.g., 
> vmsplice); that does not mean that we can freely add new conditions that 
> turn movable pages unmovable long-term or even forever.
> 
> Again, this might be a good LSF/MM topic. If I would have the capacity I 
> would suggest a topic around which things are know to cause pages to be 
> short-term or long-term unmovable/unsplittable, and which can be 
> handled, which not. Maybe I'll find the time to propose that as a topic.
> 


This does sound like great LSF/MM fodder! I predict that this session
will run long! ;)
-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2025-01-10 20:20                                                                     ` David Hildenbrand
@ 2025-01-10 20:43                                                                       ` Jeff Layton
  2025-01-10 21:00                                                                         ` David Hildenbrand
  0 siblings, 1 reply; 124+ messages in thread
From: Jeff Layton @ 2025-01-10 20:43 UTC (permalink / raw)
  To: David Hildenbrand, Shakeel Butt, Miklos Szeredi
  Cc: Joanne Koong, Bernd Schubert, Zi Yan, linux-fsdevel, jefflexu,
	josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador,
	Michal Hocko

On Fri, 2025-01-10 at 21:20 +0100, David Hildenbrand wrote:
> On 10.01.25 21:16, Jeff Layton wrote:
> > On Tue, 2025-01-07 at 09:34 +0100, David Hildenbrand wrote:
> > > On 06.01.25 19:17, Shakeel Butt wrote:
> > > > On Mon, Jan 06, 2025 at 11:19:42AM +0100, Miklos Szeredi wrote:
> > > > > On Fri, 3 Jan 2025 at 21:31, David Hildenbrand <david@redhat.com> wrote:
> > > > > > In any case, having movable pages be turned unmovable due to persistent
> > > > > > writaback is something that must be fixed, not worked around. Likely a
> > > > > > good topic for LSF/MM.
> > > > > 
> > > > > Yes, this seems a good cross fs-mm topic.
> > > > > 
> > > > > So the issue discussed here is that movable pages used for fuse
> > > > > page-cache cause a problems when memory needs to be compacted. The
> > > > > problem is either that
> > > > > 
> > > > >    - the page is skipped, leaving the physical memory block unmovable
> > > > > 
> > > > >    - the compaction is blocked for an unbounded time
> > > > > 
> > > > > While the new AS_WRITEBACK_INDETERMINATE could potentially make things
> > > > > worse, the same thing happens on readahead, since the new page can be
> > > > > locked for an indeterminate amount of time, which can also block
> > > > > compaction, right?
> > > 
> > > Yes, as memory hotplug + virtio-mem maintainer my bigger concern is
> > > these pages residing in ZONE_MOVABLE / MIGRATE_CMA areas where there
> > > *must not be unmovable pages ever*. Not triggered by an untrusted
> > > source, not triggered by an trusted source.
> > > 
> > > It's a violation of core-mm principles.
> > > 
> > > Even if we have a timeout of 60s, making things like alloc_contig_page()
> > > wait for that long on writeback is broken and needs to be fixed.
> > > 
> > > And the fix is not to skip these pages, that's a workaround.
> > > 
> > > I'm hoping I can find an easy way to trigger this also with NFS.
> > > 
> > 
> > I imagine that you can just open a file and start writing to it, pull
> > the plug on the NFS server, and then issue a fsync or something to
> > ensure some writeback occurs.
> 
> Yes, that's the plan, thanks!
> 
> > 
> > Any dirty pagecache folios should be stuck in writeback at that point.
> > The NFS client is also very patient about waiting for the server to
> > come back, so it should stay that way indefinitely.
> 
> Yes, however the default timeout for UDP is fairly small (for TCP 
> certainly much longer). So one thing I'd like to understand what that 
> "cancel writeback -> redirty folio" on timeout does, and when it 
> actually triggers with TCP vs UDP timeouts.
> 


The lifetime of the pagecache pages is not at all related to the socket
lifetimes. IOW, the client can completely lose the connection to the
server and the page will just stay dirty until the connection can be
reestablished and the server responds.

The exception here is if you mount with "-o soft" in which case, an RPC
request will time out with an error after a major RPC timeout (usually
after a minute or so). See nfs(5) for the gory details of timeouts and
retransmission. The default is "-o hard" since that's necessary for
data-integrity in the face of spotty network connections. 

Once a soft mount has a writeback RPC time out, the folio is marked
clean and a writeback error is set on the mapping, so that fsync() will
return an error.
-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2025-01-10 20:43                                                                       ` Jeff Layton
@ 2025-01-10 21:00                                                                         ` David Hildenbrand
  2025-01-10 21:07                                                                           ` Jeff Layton
  0 siblings, 1 reply; 124+ messages in thread
From: David Hildenbrand @ 2025-01-10 21:00 UTC (permalink / raw)
  To: Jeff Layton, Shakeel Butt, Miklos Szeredi
  Cc: Joanne Koong, Bernd Schubert, Zi Yan, linux-fsdevel, jefflexu,
	josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador,
	Michal Hocko

On 10.01.25 21:43, Jeff Layton wrote:
> On Fri, 2025-01-10 at 21:20 +0100, David Hildenbrand wrote:
>> On 10.01.25 21:16, Jeff Layton wrote:
>>> On Tue, 2025-01-07 at 09:34 +0100, David Hildenbrand wrote:
>>>> On 06.01.25 19:17, Shakeel Butt wrote:
>>>>> On Mon, Jan 06, 2025 at 11:19:42AM +0100, Miklos Szeredi wrote:
>>>>>> On Fri, 3 Jan 2025 at 21:31, David Hildenbrand <david@redhat.com> wrote:
>>>>>>> In any case, having movable pages be turned unmovable due to persistent
>>>>>>> writaback is something that must be fixed, not worked around. Likely a
>>>>>>> good topic for LSF/MM.
>>>>>>
>>>>>> Yes, this seems a good cross fs-mm topic.
>>>>>>
>>>>>> So the issue discussed here is that movable pages used for fuse
>>>>>> page-cache cause a problems when memory needs to be compacted. The
>>>>>> problem is either that
>>>>>>
>>>>>>     - the page is skipped, leaving the physical memory block unmovable
>>>>>>
>>>>>>     - the compaction is blocked for an unbounded time
>>>>>>
>>>>>> While the new AS_WRITEBACK_INDETERMINATE could potentially make things
>>>>>> worse, the same thing happens on readahead, since the new page can be
>>>>>> locked for an indeterminate amount of time, which can also block
>>>>>> compaction, right?
>>>>
>>>> Yes, as memory hotplug + virtio-mem maintainer my bigger concern is
>>>> these pages residing in ZONE_MOVABLE / MIGRATE_CMA areas where there
>>>> *must not be unmovable pages ever*. Not triggered by an untrusted
>>>> source, not triggered by an trusted source.
>>>>
>>>> It's a violation of core-mm principles.
>>>>
>>>> Even if we have a timeout of 60s, making things like alloc_contig_page()
>>>> wait for that long on writeback is broken and needs to be fixed.
>>>>
>>>> And the fix is not to skip these pages, that's a workaround.
>>>>
>>>> I'm hoping I can find an easy way to trigger this also with NFS.
>>>>
>>>
>>> I imagine that you can just open a file and start writing to it, pull
>>> the plug on the NFS server, and then issue a fsync or something to
>>> ensure some writeback occurs.
>>
>> Yes, that's the plan, thanks!
>>
>>>
>>> Any dirty pagecache folios should be stuck in writeback at that point.
>>> The NFS client is also very patient about waiting for the server to
>>> come back, so it should stay that way indefinitely.
>>
>> Yes, however the default timeout for UDP is fairly small (for TCP
>> certainly much longer). So one thing I'd like to understand what that
>> "cancel writeback -> redirty folio" on timeout does, and when it
>> actually triggers with TCP vs UDP timeouts.
>>
> 
> 
> The lifetime of the pagecache pages is not at all related to the socket
> lifetimes. IOW, the client can completely lose the connection to the
> server and the page will just stay dirty until the connection can be
> reestablished and the server responds.

Right. It cannot get reclaimed while that is the case.

> 
> The exception here is if you mount with "-o soft" in which case, an RPC
> request will time out with an error after a major RPC timeout (usually
> after a minute or so). See nfs(5) for the gory details of timeouts and
> retransmission. The default is "-o hard" since that's necessary for
> data-integrity in the face of spotty network connections.
> 
> Once a soft mount has a writeback RPC time out, the folio is marked
> clean and a writeback error is set on the mapping, so that fsync() will
> return an error.

I assume that's the code I stumbled over in nfs_page_async_flush(), 
where we end up calling folio_redirty_for_writepage() + 
nfs_redirty_request(), unless we run into a fatal error; in that case, 
we end up in nfs_write_error() where we set the mapping error and stop 
writeback using nfs_page_end_writeback().

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2025-01-10 21:00                                                                         ` David Hildenbrand
@ 2025-01-10 21:07                                                                           ` Jeff Layton
  2025-01-10 21:21                                                                             ` David Hildenbrand
  0 siblings, 1 reply; 124+ messages in thread
From: Jeff Layton @ 2025-01-10 21:07 UTC (permalink / raw)
  To: David Hildenbrand, Shakeel Butt, Miklos Szeredi
  Cc: Joanne Koong, Bernd Schubert, Zi Yan, linux-fsdevel, jefflexu,
	josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador,
	Michal Hocko

On Fri, 2025-01-10 at 22:00 +0100, David Hildenbrand wrote:
> On 10.01.25 21:43, Jeff Layton wrote:
> > On Fri, 2025-01-10 at 21:20 +0100, David Hildenbrand wrote:
> > > On 10.01.25 21:16, Jeff Layton wrote:
> > > > On Tue, 2025-01-07 at 09:34 +0100, David Hildenbrand wrote:
> > > > > On 06.01.25 19:17, Shakeel Butt wrote:
> > > > > > On Mon, Jan 06, 2025 at 11:19:42AM +0100, Miklos Szeredi wrote:
> > > > > > > On Fri, 3 Jan 2025 at 21:31, David Hildenbrand <david@redhat.com> wrote:
> > > > > > > > In any case, having movable pages be turned unmovable due to persistent
> > > > > > > > writaback is something that must be fixed, not worked around. Likely a
> > > > > > > > good topic for LSF/MM.
> > > > > > > 
> > > > > > > Yes, this seems a good cross fs-mm topic.
> > > > > > > 
> > > > > > > So the issue discussed here is that movable pages used for fuse
> > > > > > > page-cache cause a problems when memory needs to be compacted. The
> > > > > > > problem is either that
> > > > > > > 
> > > > > > >     - the page is skipped, leaving the physical memory block unmovable
> > > > > > > 
> > > > > > >     - the compaction is blocked for an unbounded time
> > > > > > > 
> > > > > > > While the new AS_WRITEBACK_INDETERMINATE could potentially make things
> > > > > > > worse, the same thing happens on readahead, since the new page can be
> > > > > > > locked for an indeterminate amount of time, which can also block
> > > > > > > compaction, right?
> > > > > 
> > > > > Yes, as memory hotplug + virtio-mem maintainer my bigger concern is
> > > > > these pages residing in ZONE_MOVABLE / MIGRATE_CMA areas where there
> > > > > *must not be unmovable pages ever*. Not triggered by an untrusted
> > > > > source, not triggered by an trusted source.
> > > > > 
> > > > > It's a violation of core-mm principles.
> > > > > 
> > > > > Even if we have a timeout of 60s, making things like alloc_contig_page()
> > > > > wait for that long on writeback is broken and needs to be fixed.
> > > > > 
> > > > > And the fix is not to skip these pages, that's a workaround.
> > > > > 
> > > > > I'm hoping I can find an easy way to trigger this also with NFS.
> > > > > 
> > > > 
> > > > I imagine that you can just open a file and start writing to it, pull
> > > > the plug on the NFS server, and then issue a fsync or something to
> > > > ensure some writeback occurs.
> > > 
> > > Yes, that's the plan, thanks!
> > > 
> > > > 
> > > > Any dirty pagecache folios should be stuck in writeback at that point.
> > > > The NFS client is also very patient about waiting for the server to
> > > > come back, so it should stay that way indefinitely.
> > > 
> > > Yes, however the default timeout for UDP is fairly small (for TCP
> > > certainly much longer). So one thing I'd like to understand what that
> > > "cancel writeback -> redirty folio" on timeout does, and when it
> > > actually triggers with TCP vs UDP timeouts.
> > > 
> > 
> > 
> > The lifetime of the pagecache pages is not at all related to the socket
> > lifetimes. IOW, the client can completely lose the connection to the
> > server and the page will just stay dirty until the connection can be
> > reestablished and the server responds.
> 
> Right. It cannot get reclaimed while that is the case.
> 
> > 
> > The exception here is if you mount with "-o soft" in which case, an RPC
> > request will time out with an error after a major RPC timeout (usually
> > after a minute or so). See nfs(5) for the gory details of timeouts and
> > retransmission. The default is "-o hard" since that's necessary for
> > data-integrity in the face of spotty network connections.
> > 
> > Once a soft mount has a writeback RPC time out, the folio is marked
> > clean and a writeback error is set on the mapping, so that fsync() will
> > return an error.
> 
> I assume that's the code I stumbled over in nfs_page_async_flush(), 
> where we end up calling folio_redirty_for_writepage() + 
> nfs_redirty_request(), unless we run into a fatal error; in that case, 
> we end up in nfs_write_error() where we set the mapping error and stop 
> writeback using nfs_page_end_writeback().
> 

Exactly.

The upshot is that you can dirty NFS pages that will sit in the
pagecache indefinitely, if you can disrupt the connection to the server
indefinitely. This is substantially the same in other netfs's too --
CIFS, Ceph, etc.

The big difference vs FUSE is that they don't allow unprivileged users
to mount arbitrary filesystems, so it's a harder for an attacker to do
this with only a local unprivileged account to work with.
-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2025-01-10 20:28                                                                       ` Jeff Layton
@ 2025-01-10 21:13                                                                         ` David Hildenbrand
  2025-01-10 22:00                                                                           ` Shakeel Butt
  2025-01-10 23:11                                                                           ` Jeff Layton
  0 siblings, 2 replies; 124+ messages in thread
From: David Hildenbrand @ 2025-01-10 21:13 UTC (permalink / raw)
  To: Jeff Layton, Shakeel Butt
  Cc: Miklos Szeredi, Joanne Koong, Bernd Schubert, Zi Yan,
	linux-fsdevel, jefflexu, josef, linux-mm, kernel-team,
	Matthew Wilcox, Oscar Salvador, Michal Hocko

On 10.01.25 21:28, Jeff Layton wrote:
> On Thu, 2025-01-09 at 12:22 +0100, David Hildenbrand wrote:
>> On 07.01.25 19:07, Shakeel Butt wrote:
>>> On Tue, Jan 07, 2025 at 09:34:49AM +0100, David Hildenbrand wrote:
>>>> On 06.01.25 19:17, Shakeel Butt wrote:
>>>>> On Mon, Jan 06, 2025 at 11:19:42AM +0100, Miklos Szeredi wrote:
>>>>>> On Fri, 3 Jan 2025 at 21:31, David Hildenbrand <david@redhat.com> wrote:
>>>>>>> In any case, having movable pages be turned unmovable due to persistent
>>>>>>> writaback is something that must be fixed, not worked around. Likely a
>>>>>>> good topic for LSF/MM.
>>>>>>
>>>>>> Yes, this seems a good cross fs-mm topic.
>>>>>>
>>>>>> So the issue discussed here is that movable pages used for fuse
>>>>>> page-cache cause a problems when memory needs to be compacted. The
>>>>>> problem is either that
>>>>>>
>>>>>>     - the page is skipped, leaving the physical memory block unmovable
>>>>>>
>>>>>>     - the compaction is blocked for an unbounded time
>>>>>>
>>>>>> While the new AS_WRITEBACK_INDETERMINATE could potentially make things
>>>>>> worse, the same thing happens on readahead, since the new page can be
>>>>>> locked for an indeterminate amount of time, which can also block
>>>>>> compaction, right?
>>>>
>>>> Yes, as memory hotplug + virtio-mem maintainer my bigger concern is these
>>>> pages residing in ZONE_MOVABLE / MIGRATE_CMA areas where there *must not be
>>>> unmovable pages ever*. Not triggered by an untrusted source, not triggered
>>>> by an trusted source.
>>>>
>>>> It's a violation of core-mm principles.
>>>
>>> The "must not be unmovable pages ever" is a very strong statement and we
>>> are violating it today and will keep violating it in future. Any
>>> page/folio under lock or writeback or have reference taken or have been
>>> isolated from their LRU is unmovable (most of the time for small period
>>> of time).
>>
>> ^ this: "small period of time" is what I meant.
>>
>> Most of these things are known to not be problematic: retrying a couple
>> of times makes it work, that's why migration keeps retrying.
>>
>> Again, as an example, we allow short-term O_DIRECT but disallow
>> long-term page pinning. I think there were concerns at some point if
>> O_DIRECT might also be problematic (I/O might take a while), but so far
>> it was not a problem in practice that would make CMA allocations easily
>> fail.
>>
>> vmsplice() is a known problem, because it behaves like O_DIRECT but
>> actually triggers long-term pinning; IIRC David Howells has this on his
>> todo list to fix. [I recall that seccomp disallows vmsplice by default
>> right now]
>>
>> These operations are being done all over the place in kernel.
>>> Miklos gave an example of readahead.
>>
>> I assume you mean "unmovable for a short time", correct, or can you
>> point me at that specific example; I think I missed that.
>>
>>> The per-CPU LRU caches are another
>>> case where folios can get stuck for long period of time.
>>
>> Which is why memory offlining disables the lru cache. See
>> lru_cache_disable(). Other users that care about that drain the LRU on
>> all cpus.
>>
>>> Reclaim and
>>> compaction can isolate a lot of folios that they need to have
>>> too_many_isolated() checks. So, "must not be unmovable pages ever" is
>>> impractical.
>>
>> "must only be short-term unmovable", better?
>>
> 
> Still a little ambiguous.
> 
> How short is "short-term"? Are we talking milliseconds or minutes?

Usually a couple of seconds, max. For memory offlining, slightly longer 
times are acceptable; other things (in particular compaction or CMA 
allocations) will give up much faster.

> 
> Imposing a hard timeout on writeback requests to unprivileged FUSE
> servers might give us a better guarantee of forward-progress, but it
> would probably have to be on the order of at least a minute or so to be
> workable.

Yes, and that might already be a bit too much, especially if stuck on 
waiting for folio writeback ... so ideally we could find a way to 
migrate these folios that are under writeback and it's not your ordinary 
disk driver that responds rather quickly.

Right now we do it via these temp pages, and I can see how that's 
undesirable.

For NFS etc. we probably never ran into this, because it's all used in 
fairly well managed environments and, well, I assume NFS easily outdates 
CMA and ZONE_MOVABLE :)

 > >>>
>>> The point is that, yes we should aim to improve things but in iterations
>>> and "must not be unmovable pages ever" is not something we can achieve
>>> in one step.
>>
>> I agree with the "improve things in iterations", but as
>> AS_WRITEBACK_INDETERMINATE has the FOLL_LONGTERM smell to it, I think we
>> are making things worse.
>>
>> And as this discussion has been going on for too long, to summarize my
>> point: there exist conditions where pages are short-term unmovable, and
>> possibly some to be fixed that turn pages long-term unmovable (e.g.,
>> vmsplice); that does not mean that we can freely add new conditions that
>> turn movable pages unmovable long-term or even forever.
>>
>> Again, this might be a good LSF/MM topic. If I would have the capacity I
>> would suggest a topic around which things are know to cause pages to be
>> short-term or long-term unmovable/unsplittable, and which can be
>> handled, which not. Maybe I'll find the time to propose that as a topic.
>>
> 
> 
> This does sound like great LSF/MM fodder! I predict that this session
> will run long! ;)

Heh, fully agreed! :)

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2025-01-10 21:07                                                                           ` Jeff Layton
@ 2025-01-10 21:21                                                                             ` David Hildenbrand
  0 siblings, 0 replies; 124+ messages in thread
From: David Hildenbrand @ 2025-01-10 21:21 UTC (permalink / raw)
  To: Jeff Layton, Shakeel Butt, Miklos Szeredi
  Cc: Joanne Koong, Bernd Schubert, Zi Yan, linux-fsdevel, jefflexu,
	josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador,
	Michal Hocko

On 10.01.25 22:07, Jeff Layton wrote:
> On Fri, 2025-01-10 at 22:00 +0100, David Hildenbrand wrote:
>> On 10.01.25 21:43, Jeff Layton wrote:
>>> On Fri, 2025-01-10 at 21:20 +0100, David Hildenbrand wrote:
>>>> On 10.01.25 21:16, Jeff Layton wrote:
>>>>> On Tue, 2025-01-07 at 09:34 +0100, David Hildenbrand wrote:
>>>>>> On 06.01.25 19:17, Shakeel Butt wrote:
>>>>>>> On Mon, Jan 06, 2025 at 11:19:42AM +0100, Miklos Szeredi wrote:
>>>>>>>> On Fri, 3 Jan 2025 at 21:31, David Hildenbrand <david@redhat.com> wrote:
>>>>>>>>> In any case, having movable pages be turned unmovable due to persistent
>>>>>>>>> writaback is something that must be fixed, not worked around. Likely a
>>>>>>>>> good topic for LSF/MM.
>>>>>>>>
>>>>>>>> Yes, this seems a good cross fs-mm topic.
>>>>>>>>
>>>>>>>> So the issue discussed here is that movable pages used for fuse
>>>>>>>> page-cache cause a problems when memory needs to be compacted. The
>>>>>>>> problem is either that
>>>>>>>>
>>>>>>>>      - the page is skipped, leaving the physical memory block unmovable
>>>>>>>>
>>>>>>>>      - the compaction is blocked for an unbounded time
>>>>>>>>
>>>>>>>> While the new AS_WRITEBACK_INDETERMINATE could potentially make things
>>>>>>>> worse, the same thing happens on readahead, since the new page can be
>>>>>>>> locked for an indeterminate amount of time, which can also block
>>>>>>>> compaction, right?
>>>>>>
>>>>>> Yes, as memory hotplug + virtio-mem maintainer my bigger concern is
>>>>>> these pages residing in ZONE_MOVABLE / MIGRATE_CMA areas where there
>>>>>> *must not be unmovable pages ever*. Not triggered by an untrusted
>>>>>> source, not triggered by an trusted source.
>>>>>>
>>>>>> It's a violation of core-mm principles.
>>>>>>
>>>>>> Even if we have a timeout of 60s, making things like alloc_contig_page()
>>>>>> wait for that long on writeback is broken and needs to be fixed.
>>>>>>
>>>>>> And the fix is not to skip these pages, that's a workaround.
>>>>>>
>>>>>> I'm hoping I can find an easy way to trigger this also with NFS.
>>>>>>
>>>>>
>>>>> I imagine that you can just open a file and start writing to it, pull
>>>>> the plug on the NFS server, and then issue a fsync or something to
>>>>> ensure some writeback occurs.
>>>>
>>>> Yes, that's the plan, thanks!
>>>>
>>>>>
>>>>> Any dirty pagecache folios should be stuck in writeback at that point.
>>>>> The NFS client is also very patient about waiting for the server to
>>>>> come back, so it should stay that way indefinitely.
>>>>
>>>> Yes, however the default timeout for UDP is fairly small (for TCP
>>>> certainly much longer). So one thing I'd like to understand what that
>>>> "cancel writeback -> redirty folio" on timeout does, and when it
>>>> actually triggers with TCP vs UDP timeouts.
>>>>
>>>
>>>
>>> The lifetime of the pagecache pages is not at all related to the socket
>>> lifetimes. IOW, the client can completely lose the connection to the
>>> server and the page will just stay dirty until the connection can be
>>> reestablished and the server responds.
>>
>> Right. It cannot get reclaimed while that is the case.
>>
>>>
>>> The exception here is if you mount with "-o soft" in which case, an RPC
>>> request will time out with an error after a major RPC timeout (usually
>>> after a minute or so). See nfs(5) for the gory details of timeouts and
>>> retransmission. The default is "-o hard" since that's necessary for
>>> data-integrity in the face of spotty network connections.
>>>
>>> Once a soft mount has a writeback RPC time out, the folio is marked
>>> clean and a writeback error is set on the mapping, so that fsync() will
>>> return an error.
>>
>> I assume that's the code I stumbled over in nfs_page_async_flush(),
>> where we end up calling folio_redirty_for_writepage() +
>> nfs_redirty_request(), unless we run into a fatal error; in that case,
>> we end up in nfs_write_error() where we set the mapping error and stop
>> writeback using nfs_page_end_writeback().
>>
> 
> Exactly.
> 
> The upshot is that you can dirty NFS pages that will sit in the
> pagecache indefinitely, if you can disrupt the connection to the server
> indefinitely. This is substantially the same in other netfs's too --
> CIFS, Ceph, etc.
> 
> The big difference vs FUSE is that they don't allow unprivileged users
> to mount arbitrary filesystems, so it's a harder for an attacker to do
> this with only a local unprivileged account to work with.

Exactly my point/concern. With most netfs's I would assume that reliable 
connections are mandatory, otherwise you might be in bigger trouble, 
maybe one of the reasons being stuck forever waiting for writeback on 
folios was not identified as a problem so far. Maybe :)

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2025-01-10 21:13                                                                         ` David Hildenbrand
@ 2025-01-10 22:00                                                                           ` Shakeel Butt
  2025-01-13 15:27                                                                             ` David Hildenbrand
  2025-01-10 23:11                                                                           ` Jeff Layton
  1 sibling, 1 reply; 124+ messages in thread
From: Shakeel Butt @ 2025-01-10 22:00 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Jeff Layton, Miklos Szeredi, Joanne Koong, Bernd Schubert, Zi Yan,
	linux-fsdevel, jefflexu, josef, linux-mm, kernel-team,
	Matthew Wilcox, Oscar Salvador, Michal Hocko

On Fri, Jan 10, 2025 at 10:13:17PM +0100, David Hildenbrand wrote:
> On 10.01.25 21:28, Jeff Layton wrote:
> > On Thu, 2025-01-09 at 12:22 +0100, David Hildenbrand wrote:
> > > On 07.01.25 19:07, Shakeel Butt wrote:
> > > > On Tue, Jan 07, 2025 at 09:34:49AM +0100, David Hildenbrand wrote:
> > > > > On 06.01.25 19:17, Shakeel Butt wrote:
> > > > > > On Mon, Jan 06, 2025 at 11:19:42AM +0100, Miklos Szeredi wrote:
> > > > > > > On Fri, 3 Jan 2025 at 21:31, David Hildenbrand <david@redhat.com> wrote:
> > > > > > > > In any case, having movable pages be turned unmovable due to persistent
> > > > > > > > writaback is something that must be fixed, not worked around. Likely a
> > > > > > > > good topic for LSF/MM.
> > > > > > > 
> > > > > > > Yes, this seems a good cross fs-mm topic.
> > > > > > > 
> > > > > > > So the issue discussed here is that movable pages used for fuse
> > > > > > > page-cache cause a problems when memory needs to be compacted. The
> > > > > > > problem is either that
> > > > > > > 
> > > > > > >     - the page is skipped, leaving the physical memory block unmovable
> > > > > > > 
> > > > > > >     - the compaction is blocked for an unbounded time
> > > > > > > 
> > > > > > > While the new AS_WRITEBACK_INDETERMINATE could potentially make things
> > > > > > > worse, the same thing happens on readahead, since the new page can be
> > > > > > > locked for an indeterminate amount of time, which can also block
> > > > > > > compaction, right?
> > > > > 
> > > > > Yes, as memory hotplug + virtio-mem maintainer my bigger concern is these
> > > > > pages residing in ZONE_MOVABLE / MIGRATE_CMA areas where there *must not be
> > > > > unmovable pages ever*. Not triggered by an untrusted source, not triggered
> > > > > by an trusted source.
> > > > > 
> > > > > It's a violation of core-mm principles.
> > > > 
> > > > The "must not be unmovable pages ever" is a very strong statement and we
> > > > are violating it today and will keep violating it in future. Any
> > > > page/folio under lock or writeback or have reference taken or have been
> > > > isolated from their LRU is unmovable (most of the time for small period
> > > > of time).
> > > 
> > > ^ this: "small period of time" is what I meant.
> > > 
> > > Most of these things are known to not be problematic: retrying a couple
> > > of times makes it work, that's why migration keeps retrying.
> > > 
> > > Again, as an example, we allow short-term O_DIRECT but disallow
> > > long-term page pinning. I think there were concerns at some point if
> > > O_DIRECT might also be problematic (I/O might take a while), but so far
> > > it was not a problem in practice that would make CMA allocations easily
> > > fail.
> > > 
> > > vmsplice() is a known problem, because it behaves like O_DIRECT but
> > > actually triggers long-term pinning; IIRC David Howells has this on his
> > > todo list to fix. [I recall that seccomp disallows vmsplice by default
> > > right now]
> > > 
> > > These operations are being done all over the place in kernel.
> > > > Miklos gave an example of readahead.
> > > 
> > > I assume you mean "unmovable for a short time", correct, or can you
> > > point me at that specific example; I think I missed that.

Please see https://lore.kernel.org/all/CAJfpegthP2enc9o1hV-izyAG9nHcD_tT8dKFxxzhdQws6pcyhQ@mail.gmail.com/

> > > 
> > > > The per-CPU LRU caches are another
> > > > case where folios can get stuck for long period of time.
> > > 
> > > Which is why memory offlining disables the lru cache. See
> > > lru_cache_disable(). Other users that care about that drain the LRU on
> > > all cpus.
> > > 
> > > > Reclaim and
> > > > compaction can isolate a lot of folios that they need to have
> > > > too_many_isolated() checks. So, "must not be unmovable pages ever" is
> > > > impractical.
> > > 
> > > "must only be short-term unmovable", better?

Yes and you have clarified further below of the actual amount.

> > > 
> > 
> > Still a little ambiguous.
> > 
> > How short is "short-term"? Are we talking milliseconds or minutes?
> 
> Usually a couple of seconds, max. For memory offlining, slightly longer
> times are acceptable; other things (in particular compaction or CMA
> allocations) will give up much faster.
> 
> > 
> > Imposing a hard timeout on writeback requests to unprivileged FUSE
> > servers might give us a better guarantee of forward-progress, but it
> > would probably have to be on the order of at least a minute or so to be
> > workable.
> 
> Yes, and that might already be a bit too much, especially if stuck on
> waiting for folio writeback ... so ideally we could find a way to migrate
> these folios that are under writeback and it's not your ordinary disk driver
> that responds rather quickly.
> 
> Right now we do it via these temp pages, and I can see how that's
> undesirable.
> 
> For NFS etc. we probably never ran into this, because it's all used in
> fairly well managed environments and, well, I assume NFS easily outdates CMA
> and ZONE_MOVABLE :)
> 
> > >>>
> > > > The point is that, yes we should aim to improve things but in iterations
> > > > and "must not be unmovable pages ever" is not something we can achieve
> > > > in one step.
> > > 
> > > I agree with the "improve things in iterations", but as
> > > AS_WRITEBACK_INDETERMINATE has the FOLL_LONGTERM smell to it, I think we
> > > are making things worse.

AS_WRITEBACK_INDETERMINATE is really a bad name we picked as it is still
causing confusion. It is a simple flag to avoid deadlock in the reclaim
code path and does not say anything about movability.

> > > 
> > > And as this discussion has been going on for too long, to summarize my
> > > point: there exist conditions where pages are short-term unmovable, and
> > > possibly some to be fixed that turn pages long-term unmovable (e.g.,
> > > vmsplice); that does not mean that we can freely add new conditions that
> > > turn movable pages unmovable long-term or even forever.
> > > 
> > > Again, this might be a good LSF/MM topic. If I would have the capacity I
> > > would suggest a topic around which things are know to cause pages to be
> > > short-term or long-term unmovable/unsplittable, and which can be
> > > handled, which not. Maybe I'll find the time to propose that as a topic.
> > > 
> > 
> > 
> > This does sound like great LSF/MM fodder! I predict that this session
> > will run long! ;)
> 
> Heh, fully agreed! :)

I would like more targeted topic and for that I want us to at least
agree where we are disagring. Let me write down two statements and
please tell me where you disagree:

1. For a normal running FUSE server (without tmp pages), the lifetime of
writeback state of fuse folios falls under "short-term unmovable" bucket
as it does not differ in anyway from anyother filesystems handling
writeback folios.

2. For a buggy or untrusted FUSE server (without tmp pages), the
lifetime of writeback state of fuse folios can be arbitrarily long and
we need some mechanism to limit it.


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2025-01-10 21:13                                                                         ` David Hildenbrand
  2025-01-10 22:00                                                                           ` Shakeel Butt
@ 2025-01-10 23:11                                                                           ` Jeff Layton
  1 sibling, 0 replies; 124+ messages in thread
From: Jeff Layton @ 2025-01-10 23:11 UTC (permalink / raw)
  To: David Hildenbrand, Shakeel Butt
  Cc: Miklos Szeredi, Joanne Koong, Bernd Schubert, Zi Yan,
	linux-fsdevel, jefflexu, josef, linux-mm, kernel-team,
	Matthew Wilcox, Oscar Salvador, Michal Hocko

On Fri, 2025-01-10 at 22:13 +0100, David Hildenbrand wrote:
> On 10.01.25 21:28, Jeff Layton wrote:
> > On Thu, 2025-01-09 at 12:22 +0100, David Hildenbrand wrote:
> > > On 07.01.25 19:07, Shakeel Butt wrote:
> > > > On Tue, Jan 07, 2025 at 09:34:49AM +0100, David Hildenbrand wrote:
> > > > > On 06.01.25 19:17, Shakeel Butt wrote:
> > > > > > On Mon, Jan 06, 2025 at 11:19:42AM +0100, Miklos Szeredi wrote:
> > > > > > > On Fri, 3 Jan 2025 at 21:31, David Hildenbrand <david@redhat.com> wrote:
> > > > > > > > In any case, having movable pages be turned unmovable due to persistent
> > > > > > > > writaback is something that must be fixed, not worked around. Likely a
> > > > > > > > good topic for LSF/MM.
> > > > > > > 
> > > > > > > Yes, this seems a good cross fs-mm topic.
> > > > > > > 
> > > > > > > So the issue discussed here is that movable pages used for fuse
> > > > > > > page-cache cause a problems when memory needs to be compacted. The
> > > > > > > problem is either that
> > > > > > > 
> > > > > > >     - the page is skipped, leaving the physical memory block unmovable
> > > > > > > 
> > > > > > >     - the compaction is blocked for an unbounded time
> > > > > > > 
> > > > > > > While the new AS_WRITEBACK_INDETERMINATE could potentially make things
> > > > > > > worse, the same thing happens on readahead, since the new page can be
> > > > > > > locked for an indeterminate amount of time, which can also block
> > > > > > > compaction, right?
> > > > > 
> > > > > Yes, as memory hotplug + virtio-mem maintainer my bigger concern is these
> > > > > pages residing in ZONE_MOVABLE / MIGRATE_CMA areas where there *must not be
> > > > > unmovable pages ever*. Not triggered by an untrusted source, not triggered
> > > > > by an trusted source.
> > > > > 
> > > > > It's a violation of core-mm principles.
> > > > 
> > > > The "must not be unmovable pages ever" is a very strong statement and we
> > > > are violating it today and will keep violating it in future. Any
> > > > page/folio under lock or writeback or have reference taken or have been
> > > > isolated from their LRU is unmovable (most of the time for small period
> > > > of time).
> > > 
> > > ^ this: "small period of time" is what I meant.
> > > 
> > > Most of these things are known to not be problematic: retrying a couple
> > > of times makes it work, that's why migration keeps retrying.
> > > 
> > > Again, as an example, we allow short-term O_DIRECT but disallow
> > > long-term page pinning. I think there were concerns at some point if
> > > O_DIRECT might also be problematic (I/O might take a while), but so far
> > > it was not a problem in practice that would make CMA allocations easily
> > > fail.
> > > 
> > > vmsplice() is a known problem, because it behaves like O_DIRECT but
> > > actually triggers long-term pinning; IIRC David Howells has this on his
> > > todo list to fix. [I recall that seccomp disallows vmsplice by default
> > > right now]
> > > 
> > > These operations are being done all over the place in kernel.
> > > > Miklos gave an example of readahead.
> > > 
> > > I assume you mean "unmovable for a short time", correct, or can you
> > > point me at that specific example; I think I missed that.
> > > 
> > > > The per-CPU LRU caches are another
> > > > case where folios can get stuck for long period of time.
> > > 
> > > Which is why memory offlining disables the lru cache. See
> > > lru_cache_disable(). Other users that care about that drain the LRU on
> > > all cpus.
> > > 
> > > > Reclaim and
> > > > compaction can isolate a lot of folios that they need to have
> > > > too_many_isolated() checks. So, "must not be unmovable pages ever" is
> > > > impractical.
> > > 
> > > "must only be short-term unmovable", better?
> > > 
> > 
> > Still a little ambiguous.
> > 
> > How short is "short-term"? Are we talking milliseconds or minutes?
> 
> Usually a couple of seconds, max. For memory offlining, slightly longer 
> times are acceptable; other things (in particular compaction or CMA 
> allocations) will give up much faster.
> 
> > 
> > Imposing a hard timeout on writeback requests to unprivileged FUSE
> > servers might give us a better guarantee of forward-progress, but it
> > would probably have to be on the order of at least a minute or so to be
> > workable.
> 
> Yes, and that might already be a bit too much, especially if stuck on 
> waiting for folio writeback ... so ideally we could find a way to 
> migrate these folios that are under writeback and it's not your ordinary 
> disk driver that responds rather quickly.
> 

That would be ideal I think. One thought:

In practice, a lot of these writeback handers use the folio up front
and then don't need to touch it again afterward until the reply comes
in and they clear the writeback bit.

Maybe we could add a mechanism where the writeback handers could mark
the folio as being moveable after the first phase was done? When the
reply comes in, they would clear that mark and check whether it's been
moved in the interim, and fix up the appropriate pointers if so?

Implementing that sounds a bit complex though since it's effectively a
new locking scheme.


> Right now we do it via these temp pages, and I can see how that's 
> undesirable.
> 
> For NFS etc. we probably never ran into this, because it's all used in 
> fairly well managed environments and, well, I assume NFS easily outdates 
> CMA and ZONE_MOVABLE :)
>
>  > >>>
> > > > The point is that, yes we should aim to improve things but in iterations
> > > > and "must not be unmovable pages ever" is not something we can achieve
> > > > in one step.
> > > 
> > > I agree with the "improve things in iterations", but as
> > > AS_WRITEBACK_INDETERMINATE has the FOLL_LONGTERM smell to it, I think we
> > > are making things worse.
> > > 
> > > And as this discussion has been going on for too long, to summarize my
> > > point: there exist conditions where pages are short-term unmovable, and
> > > possibly some to be fixed that turn pages long-term unmovable (e.g.,
> > > vmsplice); that does not mean that we can freely add new conditions that
> > > turn movable pages unmovable long-term or even forever.
> > > 
> > > Again, this might be a good LSF/MM topic. If I would have the capacity I
> > > would suggest a topic around which things are know to cause pages to be
> > > short-term or long-term unmovable/unsplittable, and which can be
> > > handled, which not. Maybe I'll find the time to propose that as a topic.
> > > 
> > 
> > 
> > This does sound like great LSF/MM fodder! I predict that this session
> > will run long! ;)
> 
> Heh, fully agreed! :)
> 

-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2025-01-10 22:00                                                                           ` Shakeel Butt
@ 2025-01-13 15:27                                                                             ` David Hildenbrand
  2025-01-13 21:44                                                                               ` Jeff Layton
  0 siblings, 1 reply; 124+ messages in thread
From: David Hildenbrand @ 2025-01-13 15:27 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Jeff Layton, Miklos Szeredi, Joanne Koong, Bernd Schubert, Zi Yan,
	linux-fsdevel, jefflexu, josef, linux-mm, kernel-team,
	Matthew Wilcox, Oscar Salvador, Michal Hocko

On 10.01.25 23:00, Shakeel Butt wrote:
> On Fri, Jan 10, 2025 at 10:13:17PM +0100, David Hildenbrand wrote:
>> On 10.01.25 21:28, Jeff Layton wrote:
>>> On Thu, 2025-01-09 at 12:22 +0100, David Hildenbrand wrote:
>>>> On 07.01.25 19:07, Shakeel Butt wrote:
>>>>> On Tue, Jan 07, 2025 at 09:34:49AM +0100, David Hildenbrand wrote:
>>>>>> On 06.01.25 19:17, Shakeel Butt wrote:
>>>>>>> On Mon, Jan 06, 2025 at 11:19:42AM +0100, Miklos Szeredi wrote:
>>>>>>>> On Fri, 3 Jan 2025 at 21:31, David Hildenbrand <david@redhat.com> wrote:
>>>>>>>>> In any case, having movable pages be turned unmovable due to persistent
>>>>>>>>> writaback is something that must be fixed, not worked around. Likely a
>>>>>>>>> good topic for LSF/MM.
>>>>>>>>
>>>>>>>> Yes, this seems a good cross fs-mm topic.
>>>>>>>>
>>>>>>>> So the issue discussed here is that movable pages used for fuse
>>>>>>>> page-cache cause a problems when memory needs to be compacted. The
>>>>>>>> problem is either that
>>>>>>>>
>>>>>>>>      - the page is skipped, leaving the physical memory block unmovable
>>>>>>>>
>>>>>>>>      - the compaction is blocked for an unbounded time
>>>>>>>>
>>>>>>>> While the new AS_WRITEBACK_INDETERMINATE could potentially make things
>>>>>>>> worse, the same thing happens on readahead, since the new page can be
>>>>>>>> locked for an indeterminate amount of time, which can also block
>>>>>>>> compaction, right?
>>>>>>
>>>>>> Yes, as memory hotplug + virtio-mem maintainer my bigger concern is these
>>>>>> pages residing in ZONE_MOVABLE / MIGRATE_CMA areas where there *must not be
>>>>>> unmovable pages ever*. Not triggered by an untrusted source, not triggered
>>>>>> by an trusted source.
>>>>>>
>>>>>> It's a violation of core-mm principles.
>>>>>
>>>>> The "must not be unmovable pages ever" is a very strong statement and we
>>>>> are violating it today and will keep violating it in future. Any
>>>>> page/folio under lock or writeback or have reference taken or have been
>>>>> isolated from their LRU is unmovable (most of the time for small period
>>>>> of time).
>>>>
>>>> ^ this: "small period of time" is what I meant.
>>>>
>>>> Most of these things are known to not be problematic: retrying a couple
>>>> of times makes it work, that's why migration keeps retrying.
>>>>
>>>> Again, as an example, we allow short-term O_DIRECT but disallow
>>>> long-term page pinning. I think there were concerns at some point if
>>>> O_DIRECT might also be problematic (I/O might take a while), but so far
>>>> it was not a problem in practice that would make CMA allocations easily
>>>> fail.
>>>>
>>>> vmsplice() is a known problem, because it behaves like O_DIRECT but
>>>> actually triggers long-term pinning; IIRC David Howells has this on his
>>>> todo list to fix. [I recall that seccomp disallows vmsplice by default
>>>> right now]
>>>>
>>>> These operations are being done all over the place in kernel.
>>>>> Miklos gave an example of readahead.
>>>>
>>>> I assume you mean "unmovable for a short time", correct, or can you
>>>> point me at that specific example; I think I missed that.
> 
> Please see https://lore.kernel.org/all/CAJfpegthP2enc9o1hV-izyAG9nHcD_tT8dKFxxzhdQws6pcyhQ@mail.gmail.com/
> 
>>>>
>>>>> The per-CPU LRU caches are another
>>>>> case where folios can get stuck for long period of time.
>>>>
>>>> Which is why memory offlining disables the lru cache. See
>>>> lru_cache_disable(). Other users that care about that drain the LRU on
>>>> all cpus.
>>>>
>>>>> Reclaim and
>>>>> compaction can isolate a lot of folios that they need to have
>>>>> too_many_isolated() checks. So, "must not be unmovable pages ever" is
>>>>> impractical.
>>>>
>>>> "must only be short-term unmovable", better?
> 
> Yes and you have clarified further below of the actual amount.
> 
>>>>
>>>
>>> Still a little ambiguous.
>>>
>>> How short is "short-term"? Are we talking milliseconds or minutes?
>>
>> Usually a couple of seconds, max. For memory offlining, slightly longer
>> times are acceptable; other things (in particular compaction or CMA
>> allocations) will give up much faster.
>>
>>>
>>> Imposing a hard timeout on writeback requests to unprivileged FUSE
>>> servers might give us a better guarantee of forward-progress, but it
>>> would probably have to be on the order of at least a minute or so to be
>>> workable.
>>
>> Yes, and that might already be a bit too much, especially if stuck on
>> waiting for folio writeback ... so ideally we could find a way to migrate
>> these folios that are under writeback and it's not your ordinary disk driver
>> that responds rather quickly.
>>
>> Right now we do it via these temp pages, and I can see how that's
>> undesirable.
>>
>> For NFS etc. we probably never ran into this, because it's all used in
>> fairly well managed environments and, well, I assume NFS easily outdates CMA
>> and ZONE_MOVABLE :)
>>
>>>>>>
>>>>> The point is that, yes we should aim to improve things but in iterations
>>>>> and "must not be unmovable pages ever" is not something we can achieve
>>>>> in one step.
>>>>
>>>> I agree with the "improve things in iterations", but as
>>>> AS_WRITEBACK_INDETERMINATE has the FOLL_LONGTERM smell to it, I think we
>>>> are making things worse.
> 
> AS_WRITEBACK_INDETERMINATE is really a bad name we picked as it is still
> causing confusion. It is a simple flag to avoid deadlock in the reclaim
> code path and does not say anything about movability.
> 
>>>>
>>>> And as this discussion has been going on for too long, to summarize my
>>>> point: there exist conditions where pages are short-term unmovable, and
>>>> possibly some to be fixed that turn pages long-term unmovable (e.g.,
>>>> vmsplice); that does not mean that we can freely add new conditions that
>>>> turn movable pages unmovable long-term or even forever.
>>>>
>>>> Again, this might be a good LSF/MM topic. If I would have the capacity I
>>>> would suggest a topic around which things are know to cause pages to be
>>>> short-term or long-term unmovable/unsplittable, and which can be
>>>> handled, which not. Maybe I'll find the time to propose that as a topic.
>>>>
>>>
>>>
>>> This does sound like great LSF/MM fodder! I predict that this session
>>> will run long! ;)
>>
>> Heh, fully agreed! :)
> 
> I would like more targeted topic and for that I want us to at least
> agree where we are disagring. Let me write down two statements and
> please tell me where you disagree:

I think we're mostly in agreement!

> 
> 1. For a normal running FUSE server (without tmp pages), the lifetime of
> writeback state of fuse folios falls under "short-term unmovable" bucket
> as it does not differ in anyway from anyother filesystems handling
> writeback folios.

That's the expectation, yes. As long as the FUSE server is able to make 
progress, the expectation is that it's just like NFS etc. If it isn't 
able to make progress (i.e., crash), the expectation is that everything 
will get cleaned up either way.

I wonder if there could be valid scenario where the FUSE server is no 
longer able to make progress (ignoring network outages), or the progress 
might start being extremely slow such that it becomes a problem. In 
contrast to in-kernel FSs, one can do some fancy stuff with fuse where 
writing a page could possibly consume a lot of memory in user-space. 
Likely, in this case we might just blame it on the admin that agreed to 
running this (trusted) fuse server.

> 
> 2. For a buggy or untrusted FUSE server (without tmp pages), the
> lifetime of writeback state of fuse folios can be arbitrarily long and
> we need some mechanism to limit it.

Yes.


Especially in 1), we really want to wait for writeback to finish, just 
like for any other filesystem. For 2), we want a way so writeback will 
not get stuck for a long time, but are able to make progress and migrate 
these pages.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2025-01-13 15:27                                                                             ` David Hildenbrand
@ 2025-01-13 21:44                                                                               ` Jeff Layton
  2025-01-14  8:38                                                                                 ` Miklos Szeredi
  0 siblings, 1 reply; 124+ messages in thread
From: Jeff Layton @ 2025-01-13 21:44 UTC (permalink / raw)
  To: David Hildenbrand, Shakeel Butt
  Cc: Miklos Szeredi, Joanne Koong, Bernd Schubert, Zi Yan,
	linux-fsdevel, jefflexu, josef, linux-mm, kernel-team,
	Matthew Wilcox, Oscar Salvador, Michal Hocko

On Mon, 2025-01-13 at 16:27 +0100, David Hildenbrand wrote:
> On 10.01.25 23:00, Shakeel Butt wrote:
> > On Fri, Jan 10, 2025 at 10:13:17PM +0100, David Hildenbrand wrote:
> > > On 10.01.25 21:28, Jeff Layton wrote:
> > > > On Thu, 2025-01-09 at 12:22 +0100, David Hildenbrand wrote:
> > > > > On 07.01.25 19:07, Shakeel Butt wrote:
> > > > > > On Tue, Jan 07, 2025 at 09:34:49AM +0100, David Hildenbrand wrote:
> > > > > > > On 06.01.25 19:17, Shakeel Butt wrote:
> > > > > > > > On Mon, Jan 06, 2025 at 11:19:42AM +0100, Miklos Szeredi wrote:
> > > > > > > > > On Fri, 3 Jan 2025 at 21:31, David Hildenbrand <david@redhat.com> wrote:
> > > > > > > > > > In any case, having movable pages be turned unmovable due to persistent
> > > > > > > > > > writaback is something that must be fixed, not worked around. Likely a
> > > > > > > > > > good topic for LSF/MM.
> > > > > > > > > 
> > > > > > > > > Yes, this seems a good cross fs-mm topic.
> > > > > > > > > 
> > > > > > > > > So the issue discussed here is that movable pages used for fuse
> > > > > > > > > page-cache cause a problems when memory needs to be compacted. The
> > > > > > > > > problem is either that
> > > > > > > > > 
> > > > > > > > >      - the page is skipped, leaving the physical memory block unmovable
> > > > > > > > > 
> > > > > > > > >      - the compaction is blocked for an unbounded time
> > > > > > > > > 
> > > > > > > > > While the new AS_WRITEBACK_INDETERMINATE could potentially make things
> > > > > > > > > worse, the same thing happens on readahead, since the new page can be
> > > > > > > > > locked for an indeterminate amount of time, which can also block
> > > > > > > > > compaction, right?
> > > > > > > 
> > > > > > > Yes, as memory hotplug + virtio-mem maintainer my bigger concern is these
> > > > > > > pages residing in ZONE_MOVABLE / MIGRATE_CMA areas where there *must not be
> > > > > > > unmovable pages ever*. Not triggered by an untrusted source, not triggered
> > > > > > > by an trusted source.
> > > > > > > 
> > > > > > > It's a violation of core-mm principles.
> > > > > > 
> > > > > > The "must not be unmovable pages ever" is a very strong statement and we
> > > > > > are violating it today and will keep violating it in future. Any
> > > > > > page/folio under lock or writeback or have reference taken or have been
> > > > > > isolated from their LRU is unmovable (most of the time for small period
> > > > > > of time).
> > > > > 
> > > > > ^ this: "small period of time" is what I meant.
> > > > > 
> > > > > Most of these things are known to not be problematic: retrying a couple
> > > > > of times makes it work, that's why migration keeps retrying.
> > > > > 
> > > > > Again, as an example, we allow short-term O_DIRECT but disallow
> > > > > long-term page pinning. I think there were concerns at some point if
> > > > > O_DIRECT might also be problematic (I/O might take a while), but so far
> > > > > it was not a problem in practice that would make CMA allocations easily
> > > > > fail.
> > > > > 
> > > > > vmsplice() is a known problem, because it behaves like O_DIRECT but
> > > > > actually triggers long-term pinning; IIRC David Howells has this on his
> > > > > todo list to fix. [I recall that seccomp disallows vmsplice by default
> > > > > right now]
> > > > > 
> > > > > These operations are being done all over the place in kernel.
> > > > > > Miklos gave an example of readahead.
> > > > > 
> > > > > I assume you mean "unmovable for a short time", correct, or can you
> > > > > point me at that specific example; I think I missed that.
> > 
> > Please see https://lore.kernel.org/all/CAJfpegthP2enc9o1hV-izyAG9nHcD_tT8dKFxxzhdQws6pcyhQ@mail.gmail.com/
> > 
> > > > > 
> > > > > > The per-CPU LRU caches are another
> > > > > > case where folios can get stuck for long period of time.
> > > > > 
> > > > > Which is why memory offlining disables the lru cache. See
> > > > > lru_cache_disable(). Other users that care about that drain the LRU on
> > > > > all cpus.
> > > > > 
> > > > > > Reclaim and
> > > > > > compaction can isolate a lot of folios that they need to have
> > > > > > too_many_isolated() checks. So, "must not be unmovable pages ever" is
> > > > > > impractical.
> > > > > 
> > > > > "must only be short-term unmovable", better?
> > 
> > Yes and you have clarified further below of the actual amount.
> > 
> > > > > 
> > > > 
> > > > Still a little ambiguous.
> > > > 
> > > > How short is "short-term"? Are we talking milliseconds or minutes?
> > > 
> > > Usually a couple of seconds, max. For memory offlining, slightly longer
> > > times are acceptable; other things (in particular compaction or CMA
> > > allocations) will give up much faster.
> > > 
> > > > 
> > > > Imposing a hard timeout on writeback requests to unprivileged FUSE
> > > > servers might give us a better guarantee of forward-progress, but it
> > > > would probably have to be on the order of at least a minute or so to be
> > > > workable.
> > > 
> > > Yes, and that might already be a bit too much, especially if stuck on
> > > waiting for folio writeback ... so ideally we could find a way to migrate
> > > these folios that are under writeback and it's not your ordinary disk driver
> > > that responds rather quickly.
> > > 
> > > Right now we do it via these temp pages, and I can see how that's
> > > undesirable.
> > > 
> > > For NFS etc. we probably never ran into this, because it's all used in
> > > fairly well managed environments and, well, I assume NFS easily outdates CMA
> > > and ZONE_MOVABLE :)
> > > 
> > > > > > > 
> > > > > > The point is that, yes we should aim to improve things but in iterations
> > > > > > and "must not be unmovable pages ever" is not something we can achieve
> > > > > > in one step.
> > > > > 
> > > > > I agree with the "improve things in iterations", but as
> > > > > AS_WRITEBACK_INDETERMINATE has the FOLL_LONGTERM smell to it, I think we
> > > > > are making things worse.
> > 
> > AS_WRITEBACK_INDETERMINATE is really a bad name we picked as it is still
> > causing confusion. It is a simple flag to avoid deadlock in the reclaim
> > code path and does not say anything about movability.
> > 
> > > > > 
> > > > > And as this discussion has been going on for too long, to summarize my
> > > > > point: there exist conditions where pages are short-term unmovable, and
> > > > > possibly some to be fixed that turn pages long-term unmovable (e.g.,
> > > > > vmsplice); that does not mean that we can freely add new conditions that
> > > > > turn movable pages unmovable long-term or even forever.
> > > > > 
> > > > > Again, this might be a good LSF/MM topic. If I would have the capacity I
> > > > > would suggest a topic around which things are know to cause pages to be
> > > > > short-term or long-term unmovable/unsplittable, and which can be
> > > > > handled, which not. Maybe I'll find the time to propose that as a topic.
> > > > > 
> > > > 
> > > > 
> > > > This does sound like great LSF/MM fodder! I predict that this session
> > > > will run long! ;)
> > > 
> > > Heh, fully agreed! :)
> > 
> > I would like more targeted topic and for that I want us to at least
> > agree where we are disagring. Let me write down two statements and
> > please tell me where you disagree:
> 
> I think we're mostly in agreement!
> 
> > 
> > 1. For a normal running FUSE server (without tmp pages), the lifetime of
> > writeback state of fuse folios falls under "short-term unmovable" bucket
> > as it does not differ in anyway from anyother filesystems handling
> > writeback folios.
> 
> That's the expectation, yes. As long as the FUSE server is able to make 
> progress, the expectation is that it's just like NFS etc. If it isn't 
> able to make progress (i.e., crash), the expectation is that everything 
> will get cleaned up either way.
> 
> I wonder if there could be valid scenario where the FUSE server is no 
> longer able to make progress (ignoring network outages), or the progress 
> might start being extremely slow such that it becomes a problem. In 
> contrast to in-kernel FSs, one can do some fancy stuff with fuse where 
> writing a page could possibly consume a lot of memory in user-space. 
> Likely, in this case we might just blame it on the admin that agreed to 
> running this (trusted) fuse server.
> 
> > 
> > 2. For a buggy or untrusted FUSE server (without tmp pages), the
> > lifetime of writeback state of fuse folios can be arbitrarily long and
> > we need some mechanism to limit it.
> 
> Yes.
> 
> 
> Especially in 1), we really want to wait for writeback to finish, just 
> like for any other filesystem. For 2), we want a way so writeback will 
> not get stuck for a long time, but are able to make progress and migrate 
> these pages.
> 

What if we were to allow the kernel to kill off an unprivileged FUSE
server that was "misbehaving" [1], clean any dirty pagecache pages that
it has, and set writeback errors on the corresponding FUSE inodes [2]?
We'd still need a rather long timeout (on the order of at least a
minute or so, by default).

Would that be enough to assuage concerns about unprivileged servers
pinning pages indefinitely? Buggy servers are still a problem, but
there's not much we can do about that.

There are a lot of details we'd have to sort out, so I'm also
interested in whether anyone (Miklos? Bernd?) would find this basic
approach objectionable.

[1]: for some definition of misbehavior. Probably a writeback
timeout of some sort but maybe there would be other criteria too.

[2]: or maybe just make them eligible to be cleaned without talking to
the server, should the VM wish it.
-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2025-01-13 21:44                                                                               ` Jeff Layton
@ 2025-01-14  8:38                                                                                 ` Miklos Szeredi
  2025-01-14  9:40                                                                                   ` Miklos Szeredi
  2025-01-14 15:44                                                                                   ` Jeff Layton
  0 siblings, 2 replies; 124+ messages in thread
From: Miklos Szeredi @ 2025-01-14  8:38 UTC (permalink / raw)
  To: Jeff Layton
  Cc: David Hildenbrand, Shakeel Butt, Joanne Koong, Bernd Schubert,
	Zi Yan, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team,
	Matthew Wilcox, Oscar Salvador, Michal Hocko

On Mon, 13 Jan 2025 at 22:44, Jeff Layton <jlayton@kernel.org> wrote:

> What if we were to allow the kernel to kill off an unprivileged FUSE
> server that was "misbehaving" [1], clean any dirty pagecache pages that
> it has, and set writeback errors on the corresponding FUSE inodes [2]?
> We'd still need a rather long timeout (on the order of at least a
> minute or so, by default).

How would this be different from Joanne's current request timeout patch?

I think it makes sense, but it *has* to be opt in, for the same reason
that NFS soft timeout is opt in, so it can't really solve the page
migration issue generally.

Also page reading has exactly the same issues, so fixing writeback is
not enough.

Maybe an explicit callback from the migration code to the filesystem
would work. I.e. move the complexity of dealing with migration for
problematic filesystems (netfs/fuse) to the filesystem itself.  I'm
not sure how this would actually look, as I'm unfamiliar with the
details of page migration, but I guess it shouldn't be too difficult
to implement for fuse at least.

Thanks,
Miklos






>
> Would that be enough to assuage concerns about unprivileged servers
> pinning pages indefinitely? Buggy servers are still a problem, but
> there's not much we can do about that.
>
> There are a lot of details we'd have to sort out, so I'm also
> interested in whether anyone (Miklos? Bernd?) would find this basic
> approach objectionable.
>
> [1]: for some definition of misbehavior. Probably a writeback
> timeout of some sort but maybe there would be other criteria too.
>
> [2]: or maybe just make them eligible to be cleaned without talking to
> the server, should the VM wish it.
> --
> Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2025-01-14  8:38                                                                                 ` Miklos Szeredi
@ 2025-01-14  9:40                                                                                   ` Miklos Szeredi
  2025-01-14  9:55                                                                                     ` Bernd Schubert
  2025-01-14 15:49                                                                                     ` Jeff Layton
  2025-01-14 15:44                                                                                   ` Jeff Layton
  1 sibling, 2 replies; 124+ messages in thread
From: Miklos Szeredi @ 2025-01-14  9:40 UTC (permalink / raw)
  To: Jeff Layton
  Cc: David Hildenbrand, Shakeel Butt, Joanne Koong, Bernd Schubert,
	Zi Yan, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team,
	Matthew Wilcox, Oscar Salvador, Michal Hocko

On Tue, 14 Jan 2025 at 09:38, Miklos Szeredi <miklos@szeredi.hu> wrote:

> Maybe an explicit callback from the migration code to the filesystem
> would work. I.e. move the complexity of dealing with migration for
> problematic filesystems (netfs/fuse) to the filesystem itself.  I'm
> not sure how this would actually look, as I'm unfamiliar with the
> details of page migration, but I guess it shouldn't be too difficult
> to implement for fuse at least.

Thinking a bit...

1) reading pages

Pages are allocated (PG_locked set, PG_uptodate cleared) and passed to
->readpages(), which may make the pages uptodate asynchronously.  If a
page is unlocked but not set uptodate, then caller is supposed to
retry the reading, at least that's how I interpret
filemap_get_pages().   This means that it's fine to migrate the page
before it's actually filled with data, since the caller will retry.

It also means that it would be sufficient to allocate the page itself
just before filling it in, if there was a mechanism to keep track of
these "not yet filled" pages.  But that probably off topic.

2) writing pages

When the page isn't actually being copied, the writeback could be
cancelled and the page redirtied.  At which point it's fine to migrate
it.  The problem is with pages that are spliced from /dev/fuse and
control over when it's being accessed is lost.  Note: this is not
actually done right now on cached pages, since writeback always copies
to temp pages.  So we can continue to do that when doing a splice and
not risk any performance regressions.

Am I missing something?

Thanks,
Miklos


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2025-01-14  9:40                                                                                   ` Miklos Szeredi
@ 2025-01-14  9:55                                                                                     ` Bernd Schubert
  2025-01-14 10:07                                                                                       ` Miklos Szeredi
  2025-01-14 15:49                                                                                     ` Jeff Layton
  1 sibling, 1 reply; 124+ messages in thread
From: Bernd Schubert @ 2025-01-14  9:55 UTC (permalink / raw)
  To: Miklos Szeredi, Jeff Layton
  Cc: David Hildenbrand, Shakeel Butt, Joanne Koong, Zi Yan,
	linux-fsdevel, jefflexu, josef, linux-mm, kernel-team,
	Matthew Wilcox, Oscar Salvador, Michal Hocko



On 1/14/25 10:40, Miklos Szeredi wrote:
> On Tue, 14 Jan 2025 at 09:38, Miklos Szeredi <miklos@szeredi.hu> wrote:
> 
>> Maybe an explicit callback from the migration code to the filesystem
>> would work. I.e. move the complexity of dealing with migration for
>> problematic filesystems (netfs/fuse) to the filesystem itself.  I'm
>> not sure how this would actually look, as I'm unfamiliar with the
>> details of page migration, but I guess it shouldn't be too difficult
>> to implement for fuse at least.
> 
> Thinking a bit...
> 
> 1) reading pages
> 
> Pages are allocated (PG_locked set, PG_uptodate cleared) and passed to
> ->readpages(), which may make the pages uptodate asynchronously.  If a
> page is unlocked but not set uptodate, then caller is supposed to
> retry the reading, at least that's how I interpret
> filemap_get_pages().   This means that it's fine to migrate the page
> before it's actually filled with data, since the caller will retry.
> 
> It also means that it would be sufficient to allocate the page itself
> just before filling it in, if there was a mechanism to keep track of
> these "not yet filled" pages.  But that probably off topic.

With /dev/fuse buffer copies should be easy - just allocate the page
on buffer copy, control is in libfuse. With splice you really need
a page state.

> 
> 2) writing pages
> 
> When the page isn't actually being copied, the writeback could be
> cancelled and the page redirtied.  At which point it's fine to migrate
> it.  The problem is with pages that are spliced from /dev/fuse and
> control over when it's being accessed is lost.  Note: this is not
> actually done right now on cached pages, since writeback always copies
> to temp pages.  So we can continue to do that when doing a splice and
> not risk any performance regressions.
> 

I wrote this before already - what is the advantage of a tmp page copy
over /dev/fuse buffer copy? I.e. I wonder if we need splice at all here.



Thanks,
Bernd




^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2025-01-14  9:55                                                                                     ` Bernd Schubert
@ 2025-01-14 10:07                                                                                       ` Miklos Szeredi
  2025-01-14 18:07                                                                                         ` Joanne Koong
  2025-01-14 20:51                                                                                         ` Joanne Koong
  0 siblings, 2 replies; 124+ messages in thread
From: Miklos Szeredi @ 2025-01-14 10:07 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Jeff Layton, David Hildenbrand, Shakeel Butt, Joanne Koong,
	Zi Yan, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team,
	Matthew Wilcox, Oscar Salvador, Michal Hocko

On Tue, 14 Jan 2025 at 10:55, Bernd Schubert <bernd.schubert@fastmail.fm> wrote:
>
>
>
> On 1/14/25 10:40, Miklos Szeredi wrote:
> > On Tue, 14 Jan 2025 at 09:38, Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> >> Maybe an explicit callback from the migration code to the filesystem
> >> would work. I.e. move the complexity of dealing with migration for
> >> problematic filesystems (netfs/fuse) to the filesystem itself.  I'm
> >> not sure how this would actually look, as I'm unfamiliar with the
> >> details of page migration, but I guess it shouldn't be too difficult
> >> to implement for fuse at least.
> >
> > Thinking a bit...
> >
> > 1) reading pages
> >
> > Pages are allocated (PG_locked set, PG_uptodate cleared) and passed to
> > ->readpages(), which may make the pages uptodate asynchronously.  If a
> > page is unlocked but not set uptodate, then caller is supposed to
> > retry the reading, at least that's how I interpret
> > filemap_get_pages().   This means that it's fine to migrate the page
> > before it's actually filled with data, since the caller will retry.
> >
> > It also means that it would be sufficient to allocate the page itself
> > just before filling it in, if there was a mechanism to keep track of
> > these "not yet filled" pages.  But that probably off topic.
>
> With /dev/fuse buffer copies should be easy - just allocate the page
> on buffer copy, control is in libfuse.

I think the issue is with generic page cache code, which currently
relies on the PG_locked flag on the allocated but not yet filled page.
  If the generic code would be able to keep track of "under
construction" ranges without relying on an allocated page, then the
filesystem could allocate the page just before copying the data,
insert the page into the cache mark the relevant portion of the file
uptodate.

> With splice you really need
> a page state.

It's not possible to splice a not-uptodate page.

> I wrote this before already - what is the advantage of a tmp page copy
> over /dev/fuse buffer copy? I.e. I wonder if we need splice at all here.

Splice seems a dead end, but we probably need to continue supporting
it for a while for backward compatibility.

Thanks,
Miklos


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2025-01-14  8:38                                                                                 ` Miklos Szeredi
  2025-01-14  9:40                                                                                   ` Miklos Szeredi
@ 2025-01-14 15:44                                                                                   ` Jeff Layton
  2025-01-14 18:58                                                                                     ` Joanne Koong
  1 sibling, 1 reply; 124+ messages in thread
From: Jeff Layton @ 2025-01-14 15:44 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: David Hildenbrand, Shakeel Butt, Joanne Koong, Bernd Schubert,
	Zi Yan, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team,
	Matthew Wilcox, Oscar Salvador, Michal Hocko

On Tue, 2025-01-14 at 09:38 +0100, Miklos Szeredi wrote:
> On Mon, 13 Jan 2025 at 22:44, Jeff Layton <jlayton@kernel.org> wrote:
> 
> > What if we were to allow the kernel to kill off an unprivileged FUSE
> > server that was "misbehaving" [1], clean any dirty pagecache pages that
> > it has, and set writeback errors on the corresponding FUSE inodes [2]?
> > We'd still need a rather long timeout (on the order of at least a
> > minute or so, by default).
> 
> How would this be different from Joanne's current request timeout patch?
> 

When the timeout pops with Joanne's set, the pages still remain dirty
(IIUC). The idea here would be that after a call times out and we've
decided the server is "misbehaving", we'd want to clean the pages and
mark the inode with a writeback error. That frees up the page to be
migrated, but a later msync or fsync should return an error. This is
the standard behavior for writeback errors on filesystems.

> I think it makes sense, but it *has* to be opt in, for the same reason
> that NFS soft timeout is opt in, so it can't really solve the page
> migration issue generally.
> 

Does it really need to be though? We're talking unprivileged mounts
here. Imposing a hard timeout on reads or writes as a mechanism to
limit resource consumption by an unprivileged user seems like a
reasonable thing to do. Writeback errors suck, but what other recourse
do we have in this situation?

We could also consider only enforcing this when memory gets low, or a
migration has failed.

> Also page reading has exactly the same issues, so fixing writeback is
> not enough.
> 

Reads are synchronous, so we could just return an error directly on
those.

> Maybe an explicit callback from the migration code to the filesystem
> would work. I.e. move the complexity of dealing with migration for
> problematic filesystems (netfs/fuse) to the filesystem itself.  I'm
> not sure how this would actually look, as I'm unfamiliar with the
> details of page migration, but I guess it shouldn't be too difficult
> to implement for fuse at least.
> 

We already have a ->migrate_folio operation. Maybe we could consider
pushing down the PG_writeback check into the ->migrate_folio ops? As an
initial step, we could just make them all return -EBUSY, and then allow
some (like FUSE) to handle the situation properly.
-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2025-01-14  9:40                                                                                   ` Miklos Szeredi
  2025-01-14  9:55                                                                                     ` Bernd Schubert
@ 2025-01-14 15:49                                                                                     ` Jeff Layton
  2025-01-24 12:29                                                                                       ` David Hildenbrand
  1 sibling, 1 reply; 124+ messages in thread
From: Jeff Layton @ 2025-01-14 15:49 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: David Hildenbrand, Shakeel Butt, Joanne Koong, Bernd Schubert,
	Zi Yan, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team,
	Matthew Wilcox, Oscar Salvador, Michal Hocko

On Tue, 2025-01-14 at 10:40 +0100, Miklos Szeredi wrote:
> On Tue, 14 Jan 2025 at 09:38, Miklos Szeredi <miklos@szeredi.hu> wrote:
> 
> > Maybe an explicit callback from the migration code to the filesystem
> > would work. I.e. move the complexity of dealing with migration for
> > problematic filesystems (netfs/fuse) to the filesystem itself.  I'm
> > not sure how this would actually look, as I'm unfamiliar with the
> > details of page migration, but I guess it shouldn't be too difficult
> > to implement for fuse at least.
> 
> Thinking a bit...
> 
> 1) reading pages
> 
> Pages are allocated (PG_locked set, PG_uptodate cleared) and passed to
> ->readpages(), which may make the pages uptodate asynchronously.  If a
> page is unlocked but not set uptodate, then caller is supposed to
> retry the reading, at least that's how I interpret
> filemap_get_pages().   This means that it's fine to migrate the page
> before it's actually filled with data, since the caller will retry.
> 
> It also means that it would be sufficient to allocate the page itself
> just before filling it in, if there was a mechanism to keep track of
> these "not yet filled" pages.  But that probably off topic.
> 

Sounds plausible.

> 2) writing pages
> 
> When the page isn't actually being copied, the writeback could be
> cancelled and the page redirtied.  At which point it's fine to migrate
> it.  The problem is with pages that are spliced from /dev/fuse and
> control over when it's being accessed is lost.  Note: this is not
> actually done right now on cached pages, since writeback always copies
> to temp pages.  So we can continue to do that when doing a splice and
> not risk any performance regressions.
> 

Can we just cancel and redirty the page like that when doing a
WB_SYNC_ALL flush? I think we'd need to ensure that it gets a new
writeback attempt as soon as the migration is done if that's in
progress, no?

-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2025-01-14 10:07                                                                                       ` Miklos Szeredi
@ 2025-01-14 18:07                                                                                         ` Joanne Koong
  2025-01-14 18:58                                                                                           ` Miklos Szeredi
  2025-01-14 20:51                                                                                         ` Joanne Koong
  1 sibling, 1 reply; 124+ messages in thread
From: Joanne Koong @ 2025-01-14 18:07 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Bernd Schubert, Jeff Layton, David Hildenbrand, Shakeel Butt,
	Zi Yan, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team,
	Matthew Wilcox, Oscar Salvador, Michal Hocko

On Tue, Jan 14, 2025 at 2:07 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Tue, 14 Jan 2025 at 10:55, Bernd Schubert <bernd.schubert@fastmail.fm> wrote:
> >
> >
> >
> > On 1/14/25 10:40, Miklos Szeredi wrote:
> > > On Tue, 14 Jan 2025 at 09:38, Miklos Szeredi <miklos@szeredi.hu> wrote:
> > >
> > >> Maybe an explicit callback from the migration code to the filesystem
> > >> would work. I.e. move the complexity of dealing with migration for
> > >> problematic filesystems (netfs/fuse) to the filesystem itself.  I'm
> > >> not sure how this would actually look, as I'm unfamiliar with the
> > >> details of page migration, but I guess it shouldn't be too difficult
> > >> to implement for fuse at least.
> > >
> > > Thinking a bit...
> > >
> > > 1) reading pages
> > >
> > > Pages are allocated (PG_locked set, PG_uptodate cleared) and passed to
> > > ->readpages(), which may make the pages uptodate asynchronously.  If a
> > > page is unlocked but not set uptodate, then caller is supposed to
> > > retry the reading, at least that's how I interpret
> > > filemap_get_pages().   This means that it's fine to migrate the page
> > > before it's actually filled with data, since the caller will retry.
> > >
> > > It also means that it would be sufficient to allocate the page itself
> > > just before filling it in, if there was a mechanism to keep track of
> > > these "not yet filled" pages.  But that probably off topic.
> >
> > With /dev/fuse buffer copies should be easy - just allocate the page
> > on buffer copy, control is in libfuse.
>
> I think the issue is with generic page cache code, which currently
> relies on the PG_locked flag on the allocated but not yet filled page.
>   If the generic code would be able to keep track of "under
> construction" ranges without relying on an allocated page, then the
> filesystem could allocate the page just before copying the data,
> insert the page into the cache mark the relevant portion of the file
> uptodate.
>
> > With splice you really need
> > a page state.
>
> It's not possible to splice a not-uptodate page.
>
> > I wrote this before already - what is the advantage of a tmp page copy
> > over /dev/fuse buffer copy? I.e. I wonder if we need splice at all here.
>
> Splice seems a dead end, but we probably need to continue supporting
> it for a while for backward compatibility.
>

There was a previous discussion about splice and tmp pages here [1], I
see the following issues with having splice default to using tmp pages
as a workaround:

- my understanding is that the majority of use cases do use splice (eg
iirc, libfuse does as well), in which case there's no point to this
patchset then
- codewise, imo this gets messy (eg we would still need the rb tree
and would now need to check writeback against folio writeback state
and against the rb tree)
- for the large folios work in [2], the implementation imo is pretty
clean because it's rebased on top of this patchset that removes the
tmp pages and rb tree. If we still have tmp pages, then this gets very
gnarly. There's not a good way I see to handle large folios in the rb
tree given this scenario:
a) writeback on a large folio is issued
b) we copy it to a tmp folio and clear writeback on it since it's
being spliced, we add this writeback request to the rb tree
c) the folio in the pagecache is evicted
d) another write occurs on a larger range that encompasses the range
in the writeback in a) or on a subset of it
Maybe this is doable with some other data structure instead of the rb
tree (eg an xarray with refcounts maybe?), but it'd be ideal if we
could find a solution (my guess is this would have to come from the
the mm layer?) that obviates tmp pages altogether.


Thanks,
Joanne

[1] https://lore.kernel.org/linux-fsdevel/CAJnrk1YwNw7C=EMfKQzN88Zq_2Qih5Te_bfkeaOf=tG+L3u9eA@mail.gmail.com/
[2] https://lore.kernel.org/linux-fsdevel/20241213221818.322371-1-joannelkoong@gmail.com/

> Thanks,
> Miklos


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2025-01-14 15:44                                                                                   ` Jeff Layton
@ 2025-01-14 18:58                                                                                     ` Joanne Koong
  0 siblings, 0 replies; 124+ messages in thread
From: Joanne Koong @ 2025-01-14 18:58 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Miklos Szeredi, David Hildenbrand, Shakeel Butt, Bernd Schubert,
	Zi Yan, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team,
	Matthew Wilcox, Oscar Salvador, Michal Hocko

On Tue, Jan 14, 2025 at 7:44 AM Jeff Layton <jlayton@kernel.org> wrote:
>
> On Tue, 2025-01-14 at 09:38 +0100, Miklos Szeredi wrote:
> > On Mon, 13 Jan 2025 at 22:44, Jeff Layton <jlayton@kernel.org> wrote:
> >
> > > What if we were to allow the kernel to kill off an unprivileged FUSE
> > > server that was "misbehaving" [1], clean any dirty pagecache pages that
> > > it has, and set writeback errors on the corresponding FUSE inodes [2]?
> > > We'd still need a rather long timeout (on the order of at least a
> > > minute or so, by default).
> >
> > How would this be different from Joanne's current request timeout patch?
> >
>
> When the timeout pops with Joanne's set, the pages still remain dirty
> (IIUC). The idea here would be that after a call times out and we've
> decided the server is "misbehaving", we'd want to clean the pages and
> mark the inode with a writeback error. That frees up the page to be
> migrated, but a later msync or fsync should return an error. This is
> the standard behavior for writeback errors on filesystems.

I think the pages already get cleaned and the inode marked with an
error in the case of a timeout. The timeout calls into the abort path,
so the abort path should already be doing this. When the connection is
aborted, fuse_request_end() will get invoked, which will call the
req->args->end() callback which for writebacks will be
fuse_writepage_end(). In fuse_writepage_end(), the inode->i_mapping
gets set to the error code and the writeback state will be cleared on
the folio as well (in fuse_writepage_finish()).

>
> > I think it makes sense, but it *has* to be opt in, for the same reason
> > that NFS soft timeout is opt in, so it can't really solve the page
> > migration issue generally.
> >
>
> Does it really need to be though? We're talking unprivileged mounts
> here. Imposing a hard timeout on reads or writes as a mechanism to
> limit resource consumption by an unprivileged user seems like a
> reasonable thing to do. Writeback errors suck, but what other recourse
> do we have in this situation?
>
> We could also consider only enforcing this when memory gets low, or a
> migration has failed.
>

I think there's a case to be made here that this "resource checking"
of unprivileged mounts should be behavior that already exists (eg
automatically enforcing timeouts instead of only by opt-in). The only
issue with this I see is that it might potentially break
backwards-compatibility, but I think it could be argued that
protecting memory resources outweighs that. Though the timeout would
have to be somewhat large, and I don't know if that would be
acceptable for migration.


Thanks,
Joanne

> > Also page reading has exactly the same issues, so fixing writeback is
> > not enough.
> >
>
> Reads are synchronous, so we could just return an error directly on
> those.
>
> > Maybe an explicit callback from the migration code to the filesystem
> > would work. I.e. move the complexity of dealing with migration for
> > problematic filesystems (netfs/fuse) to the filesystem itself.  I'm
> > not sure how this would actually look, as I'm unfamiliar with the
> > details of page migration, but I guess it shouldn't be too difficult
> > to implement for fuse at least.
> >
>
> We already have a ->migrate_folio operation. Maybe we could consider
> pushing down the PG_writeback check into the ->migrate_folio ops? As an
> initial step, we could just make them all return -EBUSY, and then allow
> some (like FUSE) to handle the situation properly.
> --
> Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2025-01-14 18:07                                                                                         ` Joanne Koong
@ 2025-01-14 18:58                                                                                           ` Miklos Szeredi
  2025-01-14 19:12                                                                                             ` Joanne Koong
  0 siblings, 1 reply; 124+ messages in thread
From: Miklos Szeredi @ 2025-01-14 18:58 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Bernd Schubert, Jeff Layton, David Hildenbrand, Shakeel Butt,
	Zi Yan, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team,
	Matthew Wilcox, Oscar Salvador, Michal Hocko

On Tue, 14 Jan 2025 at 19:08, Joanne Koong <joannelkoong@gmail.com> wrote:

> - my understanding is that the majority of use cases do use splice (eg
> iirc, libfuse does as well), in which case there's no point to this
> patchset then

If it turns out that non-splice writes are more performant, then
libfuse can be fixed to use non-splice by default.   It's not as clear
cut though, since write through (which is also the default in libfuse,
AFAIK) should not be affected by all this, since that never used tmp
pages.

> - codewise, imo this gets messy (eg we would still need the rb tree
> and would now need to check writeback against folio writeback state
> and against the rb tree)

I'm thinking of something slightly different: remove the current tmp
page mess, but instead of duplicating a page ref on splice, fall back
to copying the cache page (see the user_pages case in
fuse_copy_page()).  This should have very similar performance to what
we have today, but allows us to deal with page accesses the same way
for both regular and splice I/O on /dev/fuse.

Thanks,
Miklos


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2025-01-14 18:58                                                                                           ` Miklos Szeredi
@ 2025-01-14 19:12                                                                                             ` Joanne Koong
  2025-01-14 20:00                                                                                               ` Miklos Szeredi
  2025-01-14 20:29                                                                                               ` Jeff Layton
  0 siblings, 2 replies; 124+ messages in thread
From: Joanne Koong @ 2025-01-14 19:12 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Bernd Schubert, Jeff Layton, David Hildenbrand, Shakeel Butt,
	Zi Yan, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team,
	Matthew Wilcox, Oscar Salvador, Michal Hocko

On Tue, Jan 14, 2025 at 10:58 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Tue, 14 Jan 2025 at 19:08, Joanne Koong <joannelkoong@gmail.com> wrote:
>
> > - my understanding is that the majority of use cases do use splice (eg
> > iirc, libfuse does as well), in which case there's no point to this
> > patchset then
>
> If it turns out that non-splice writes are more performant, then
> libfuse can be fixed to use non-splice by default.   It's not as clear
> cut though, since write through (which is also the default in libfuse,
> AFAIK) should not be affected by all this, since that never used tmp
> pages.

My thinking was that spliced writes without tmp pages would be
fastest, then non-splice writes w/out tmp pages and spliced writes w/
would be roughly the same. But i'd need to benchmark and verify this
assumption.

>
> > - codewise, imo this gets messy (eg we would still need the rb tree
> > and would now need to check writeback against folio writeback state
> > and against the rb tree)
>
> I'm thinking of something slightly different: remove the current tmp
> page mess, but instead of duplicating a page ref on splice, fall back
> to copying the cache page (see the user_pages case in
> fuse_copy_page()).  This should have very similar performance to what
> we have today, but allows us to deal with page accesses the same way
> for both regular and splice I/O on /dev/fuse.

If we copy the cache page, do we not have the same issue with needing
an rb tree to track writeback state since writeback on the original
folio would be immediately cleared?


Thanks,
Joanne

>
> Thanks,
> Miklos


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2025-01-14 19:12                                                                                             ` Joanne Koong
@ 2025-01-14 20:00                                                                                               ` Miklos Szeredi
  2025-01-14 20:29                                                                                               ` Jeff Layton
  1 sibling, 0 replies; 124+ messages in thread
From: Miklos Szeredi @ 2025-01-14 20:00 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Bernd Schubert, Jeff Layton, David Hildenbrand, Shakeel Butt,
	Zi Yan, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team,
	Matthew Wilcox, Oscar Salvador, Michal Hocko

On Tue, 14 Jan 2025 at 20:12, Joanne Koong <joannelkoong@gmail.com> wrote:

> If we copy the cache page, do we not have the same issue with needing
> an rb tree to track writeback state since writeback on the original
> folio would be immediately cleared?

Writeback would not be cleared in that case.   The copy would be to
guarantee that the page can be migrated.  Starting migration for an
under-writeback page would need some new mechanism, because currently
that's not possible.

But I realize now that even though write-through does not involve
PG_writeback, doing splice will result in those cache pages being
referenced for an indefinite amount of time, which can deny migration.
Ugh.   Same as page reading, this exists today.

Thanks,
Miklos


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2025-01-14 19:12                                                                                             ` Joanne Koong
  2025-01-14 20:00                                                                                               ` Miklos Szeredi
@ 2025-01-14 20:29                                                                                               ` Jeff Layton
  2025-01-14 21:40                                                                                                 ` Bernd Schubert
  1 sibling, 1 reply; 124+ messages in thread
From: Jeff Layton @ 2025-01-14 20:29 UTC (permalink / raw)
  To: Joanne Koong, Miklos Szeredi
  Cc: Bernd Schubert, David Hildenbrand, Shakeel Butt, Zi Yan,
	linux-fsdevel, jefflexu, josef, linux-mm, kernel-team,
	Matthew Wilcox, Oscar Salvador, Michal Hocko

On Tue, 2025-01-14 at 11:12 -0800, Joanne Koong wrote:
> On Tue, Jan 14, 2025 at 10:58 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> > 
> > On Tue, 14 Jan 2025 at 19:08, Joanne Koong <joannelkoong@gmail.com> wrote:
> > 
> > > - my understanding is that the majority of use cases do use splice (eg
> > > iirc, libfuse does as well), in which case there's no point to this
> > > patchset then
> > 
> > If it turns out that non-splice writes are more performant, then
> > libfuse can be fixed to use non-splice by default.   It's not as clear
> > cut though, since write through (which is also the default in libfuse,
> > AFAIK) should not be affected by all this, since that never used tmp
> > pages.
> 
> My thinking was that spliced writes without tmp pages would be
> fastest, then non-splice writes w/out tmp pages and spliced writes w/
> would be roughly the same. But i'd need to benchmark and verify this
> assumption.
> 

A somewhat related question: is Bernd's io_uring patchset susceptible
to the same problem as splice() in this situation? IOW, does the kernel
inline pagecache pages into the io_uring buffers?

If it doesn't have the same issue, then maybe we should think about
using that to make a clean behavior break. Gate large folios and not
using bounce pages behind io_uring.

That would mean dealing with multiple IO paths, but that might still be
simpler than trying to deal with multiple folio sizes in the writeback
rbtree tracking.

> > 
> > > - codewise, imo this gets messy (eg we would still need the rb tree
> > > and would now need to check writeback against folio writeback state
> > > and against the rb tree)
> > 
> > I'm thinking of something slightly different: remove the current tmp
> > page mess, but instead of duplicating a page ref on splice, fall back
> > to copying the cache page (see the user_pages case in
> > fuse_copy_page()).  This should have very similar performance to what
> > we have today, but allows us to deal with page accesses the same way
> > for both regular and splice I/O on /dev/fuse.
> 
> If we copy the cache page, do we not have the same issue with needing
> an rb tree to track writeback state since writeback on the original
> folio would be immediately cleared?
> 



-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2025-01-14 10:07                                                                                       ` Miklos Szeredi
  2025-01-14 18:07                                                                                         ` Joanne Koong
@ 2025-01-14 20:51                                                                                         ` Joanne Koong
  2025-01-24 12:25                                                                                           ` David Hildenbrand
  1 sibling, 1 reply; 124+ messages in thread
From: Joanne Koong @ 2025-01-14 20:51 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Bernd Schubert, Jeff Layton, David Hildenbrand, Shakeel Butt,
	Zi Yan, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team,
	Matthew Wilcox, Oscar Salvador, Michal Hocko

On Tue, Jan 14, 2025 at 2:07 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Tue, 14 Jan 2025 at 10:55, Bernd Schubert <bernd.schubert@fastmail.fm> wrote:
> >
> >
> >
> > On 1/14/25 10:40, Miklos Szeredi wrote:
> > > On Tue, 14 Jan 2025 at 09:38, Miklos Szeredi <miklos@szeredi.hu> wrote:
> > >
> > >> Maybe an explicit callback from the migration code to the filesystem
> > >> would work. I.e. move the complexity of dealing with migration for
> > >> problematic filesystems (netfs/fuse) to the filesystem itself.  I'm
> > >> not sure how this would actually look, as I'm unfamiliar with the
> > >> details of page migration, but I guess it shouldn't be too difficult
> > >> to implement for fuse at least.
> > >
> > > Thinking a bit...
> > >
> > > 1) reading pages
> > >
> > > Pages are allocated (PG_locked set, PG_uptodate cleared) and passed to
> > > ->readpages(), which may make the pages uptodate asynchronously.  If a
> > > page is unlocked but not set uptodate, then caller is supposed to
> > > retry the reading, at least that's how I interpret
> > > filemap_get_pages().   This means that it's fine to migrate the page
> > > before it's actually filled with data, since the caller will retry.
> > >
> > > It also means that it would be sufficient to allocate the page itself
> > > just before filling it in, if there was a mechanism to keep track of
> > > these "not yet filled" pages.  But that probably off topic.
> >
> > With /dev/fuse buffer copies should be easy - just allocate the page
> > on buffer copy, control is in libfuse.
>
> I think the issue is with generic page cache code, which currently
> relies on the PG_locked flag on the allocated but not yet filled page.
>   If the generic code would be able to keep track of "under
> construction" ranges without relying on an allocated page, then the
> filesystem could allocate the page just before copying the data,
> insert the page into the cache mark the relevant portion of the file
> uptodate.
>
> > With splice you really need
> > a page state.
>
> It's not possible to splice a not-uptodate page.
>
> > I wrote this before already - what is the advantage of a tmp page copy
> > over /dev/fuse buffer copy? I.e. I wonder if we need splice at all here.
>
> Splice seems a dead end, but we probably need to continue supporting
> it for a while for backward compatibility.

For the splice case, could we do something like this or is this too invasive?:
* in mm, add a flag that marks a page as either being in migration or
temporarily blocking migration
* in splice, when we have to access the page in the pipe buffer, check
if that flag is set and wait for the migration to complete before
proceeding
* in splice, set that flag while it's accessing the page, which will
only temporarily block migration (eg for the duration of the memcpy)

I guess this is basically what the page lock is for, but with less overhead?

I need to look more at the splice code to see how it works, but
something like this would allow us to cancel writeback on spliced
pages that have already been sent to userspace if the request is
taking too long, and migration would never get stalled. Though I guess
the flag would be pretty specific only to the migration use case,
which might be a waste of a bit.


Thanks,
Joanne

>
> Thanks,
> Miklos


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2025-01-14 20:29                                                                                               ` Jeff Layton
@ 2025-01-14 21:40                                                                                                 ` Bernd Schubert
  2025-01-23 16:06                                                                                                   ` Pavel Begunkov
  0 siblings, 1 reply; 124+ messages in thread
From: Bernd Schubert @ 2025-01-14 21:40 UTC (permalink / raw)
  To: Jeff Layton, Joanne Koong, Miklos Szeredi
  Cc: David Hildenbrand, Shakeel Butt, Zi Yan, linux-fsdevel, jefflexu,
	josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador,
	Michal Hocko, David Wei, Ming Lei, Pavel Begunkov, Jens Axboe



On 1/14/25 21:29, Jeff Layton wrote:
> On Tue, 2025-01-14 at 11:12 -0800, Joanne Koong wrote:
>> On Tue, Jan 14, 2025 at 10:58 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>>>
>>> On Tue, 14 Jan 2025 at 19:08, Joanne Koong <joannelkoong@gmail.com> wrote:
>>>
>>>> - my understanding is that the majority of use cases do use splice (eg
>>>> iirc, libfuse does as well), in which case there's no point to this
>>>> patchset then
>>>
>>> If it turns out that non-splice writes are more performant, then
>>> libfuse can be fixed to use non-splice by default.   It's not as clear
>>> cut though, since write through (which is also the default in libfuse,
>>> AFAIK) should not be affected by all this, since that never used tmp
>>> pages.
>>
>> My thinking was that spliced writes without tmp pages would be
>> fastest, then non-splice writes w/out tmp pages and spliced writes w/
>> would be roughly the same. But i'd need to benchmark and verify this
>> assumption.
>>
> 
> A somewhat related question: is Bernd's io_uring patchset susceptible
> to the same problem as splice() in this situation? IOW, does the kernel
> inline pagecache pages into the io_uring buffers?

Right now it does a full copy, similar as non-splice /dev/fuse
read/write. I.e. it doesn't have zero copy either yet.

> 
> If it doesn't have the same issue, then maybe we should think about
> using that to make a clean behavior break. Gate large folios and not
> using bounce pages behind io_uring.
> 
> That would mean dealing with multiple IO paths, but that might still be
> simpler than trying to deal with multiple folio sizes in the writeback
> rbtree tracking.


My personal thinking regarding ZC was to hook into Mings work, I
didn't into deep details but from interface point of view it sounded
nice, like

- Application write
- fuse-client/kernel request/CQEs with write attempts
- fuse server prepares group SQE, group leader prepares
  the write buffer, other group members are consumers
  using their buffer part for the final destination
- release of leader buffer when other group members
  are done


Though, Pavel and Jens have concerns and have a different suggestion
and at least the example Pavel gave looks like splice

https://lore.kernel.org/all/f3a83b6a-c4b9-4933-998d-ebd1d09e3405@gmail.com/


I think David is looking into a different ZC solution, but I
don't have details on that.
Maybe fuse-io-uring and ublk splice approach should be another LSFMM
topic.


Thanks,
Bernd





^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2025-01-14 21:40                                                                                                 ` Bernd Schubert
@ 2025-01-23 16:06                                                                                                   ` Pavel Begunkov
  0 siblings, 0 replies; 124+ messages in thread
From: Pavel Begunkov @ 2025-01-23 16:06 UTC (permalink / raw)
  To: Bernd Schubert, Jeff Layton, Joanne Koong, Miklos Szeredi
  Cc: David Hildenbrand, Shakeel Butt, Zi Yan, linux-fsdevel, jefflexu,
	josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador,
	Michal Hocko, David Wei, Ming Lei, Jens Axboe

On 1/14/25 21:40, Bernd Schubert wrote:
...
> My personal thinking regarding ZC was to hook into Mings work, I
> didn't into deep details but from interface point of view it sounded
> nice, like
> 
> - Application write
> - fuse-client/kernel request/CQEs with write attempts
> - fuse server prepares group SQE, group leader prepares
>    the write buffer, other group members are consumers
>    using their buffer part for the final destination
> - release of leader buffer when other group members
>    are done
> 
> 
> Though, Pavel and Jens have concerns and have a different suggestion
> and at least the example Pavel gave looks like splice

That's the same approach but with adjusted api, i.e. instead of caging
into groups it uses an io_uring private table, but in both cases one
request provides a buffer, subsequent requests do IO with that buffer.
And fwiw, it has nothing to do with pipes.
  > https://lore.kernel.org/all/f3a83b6a-c4b9-4933-998d-ebd1d09e3405@gmail.com/

That one is simple and easy to maintain, we can trivially pick it up
if needed.

> I think David is looking into a different ZC solution, but I
> don't have details on that.
> Maybe fuse-io-uring and ublk splice approach should be another LSFMM
> topic.

Unfortunately, I won't make it, but maybe Jens is planning to go.

-- 
Pavel Begunkov



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2025-01-14 20:51                                                                                         ` Joanne Koong
@ 2025-01-24 12:25                                                                                           ` David Hildenbrand
  0 siblings, 0 replies; 124+ messages in thread
From: David Hildenbrand @ 2025-01-24 12:25 UTC (permalink / raw)
  To: Joanne Koong, Miklos Szeredi
  Cc: Bernd Schubert, Jeff Layton, Shakeel Butt, Zi Yan, linux-fsdevel,
	jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox,
	Oscar Salvador, Michal Hocko

On 14.01.25 21:51, Joanne Koong wrote:
> On Tue, Jan 14, 2025 at 2:07 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>>
>> On Tue, 14 Jan 2025 at 10:55, Bernd Schubert <bernd.schubert@fastmail.fm> wrote:
>>>
>>>
>>>
>>> On 1/14/25 10:40, Miklos Szeredi wrote:
>>>> On Tue, 14 Jan 2025 at 09:38, Miklos Szeredi <miklos@szeredi.hu> wrote:
>>>>
>>>>> Maybe an explicit callback from the migration code to the filesystem
>>>>> would work. I.e. move the complexity of dealing with migration for
>>>>> problematic filesystems (netfs/fuse) to the filesystem itself.  I'm
>>>>> not sure how this would actually look, as I'm unfamiliar with the
>>>>> details of page migration, but I guess it shouldn't be too difficult
>>>>> to implement for fuse at least.
>>>>
>>>> Thinking a bit...
>>>>
>>>> 1) reading pages
>>>>
>>>> Pages are allocated (PG_locked set, PG_uptodate cleared) and passed to
>>>> ->readpages(), which may make the pages uptodate asynchronously.  If a
>>>> page is unlocked but not set uptodate, then caller is supposed to
>>>> retry the reading, at least that's how I interpret
>>>> filemap_get_pages().   This means that it's fine to migrate the page
>>>> before it's actually filled with data, since the caller will retry.
>>>>
>>>> It also means that it would be sufficient to allocate the page itself
>>>> just before filling it in, if there was a mechanism to keep track of
>>>> these "not yet filled" pages.  But that probably off topic.
>>>
>>> With /dev/fuse buffer copies should be easy - just allocate the page
>>> on buffer copy, control is in libfuse.
>>
>> I think the issue is with generic page cache code, which currently
>> relies on the PG_locked flag on the allocated but not yet filled page.
>>    If the generic code would be able to keep track of "under
>> construction" ranges without relying on an allocated page, then the
>> filesystem could allocate the page just before copying the data,
>> insert the page into the cache mark the relevant portion of the file
>> uptodate.
>>
>>> With splice you really need
>>> a page state.
>>
>> It's not possible to splice a not-uptodate page.
>>
>>> I wrote this before already - what is the advantage of a tmp page copy
>>> over /dev/fuse buffer copy? I.e. I wonder if we need splice at all here.
>>
>> Splice seems a dead end, but we probably need to continue supporting
>> it for a while for backward compatibility.
> 
> For the splice case, could we do something like this or is this too invasive?:
> * in mm, add a flag that marks a page as either being in migration or
> temporarily blocking migration
> * in splice, when we have to access the page in the pipe buffer, check
> if that flag is set and wait for the migration to complete before
> proceeding
> * in splice, set that flag while it's accessing the page, which will
> only temporarily block migration (eg for the duration of the memcpy)
 > > I guess this is basically what the page lock is for, but with less 
overhead?

Yes, the folio lock kind-of behaves that way.

One problem might be, that while the page is spliced that there is a 
raised refcount on the page: migration cannot make progress if there are 
unknown references.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2025-01-14 15:49                                                                                     ` Jeff Layton
@ 2025-01-24 12:29                                                                                       ` David Hildenbrand
  2025-01-28 10:16                                                                                         ` Miklos Szeredi
  0 siblings, 1 reply; 124+ messages in thread
From: David Hildenbrand @ 2025-01-24 12:29 UTC (permalink / raw)
  To: Jeff Layton, Miklos Szeredi
  Cc: Shakeel Butt, Joanne Koong, Bernd Schubert, Zi Yan, linux-fsdevel,
	jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox,
	Oscar Salvador, Michal Hocko

On 14.01.25 16:49, Jeff Layton wrote:
> On Tue, 2025-01-14 at 10:40 +0100, Miklos Szeredi wrote:
>> On Tue, 14 Jan 2025 at 09:38, Miklos Szeredi <miklos@szeredi.hu> wrote:
>>
>>> Maybe an explicit callback from the migration code to the filesystem
>>> would work. I.e. move the complexity of dealing with migration for
>>> problematic filesystems (netfs/fuse) to the filesystem itself.  I'm
>>> not sure how this would actually look, as I'm unfamiliar with the
>>> details of page migration, but I guess it shouldn't be too difficult
>>> to implement for fuse at least.
>>
>> Thinking a bit...
>>
>> 1) reading pages
>>
>> Pages are allocated (PG_locked set, PG_uptodate cleared) and passed to
>> ->readpages(), which may make the pages uptodate asynchronously.  If a
>> page is unlocked but not set uptodate, then caller is supposed to
>> retry the reading, at least that's how I interpret
>> filemap_get_pages().   This means that it's fine to migrate the page
>> before it's actually filled with data, since the caller will retry.
>>
>> It also means that it would be sufficient to allocate the page itself
>> just before filling it in, if there was a mechanism to keep track of
>> these "not yet filled" pages.  But that probably off topic.
>>
> 
> Sounds plausible.
> 
>> 2) writing pages
>>
>> When the page isn't actually being copied, the writeback could be
>> cancelled and the page redirtied.  At which point it's fine to migrate
>> it.  The problem is with pages that are spliced from /dev/fuse and
>> control over when it's being accessed is lost.  Note: this is not
>> actually done right now on cached pages, since writeback always copies
>> to temp pages.  So we can continue to do that when doing a splice and
>> not risk any performance regressions.
>>
> 
> Can we just cancel and redirty the page like that when doing a
> WB_SYNC_ALL flush? I think we'd need to ensure that it gets a new
> writeback attempt as soon as the migration is done if that's in
> progress, no?

Yeah, that was one of my initial questions as well: could one 
"transparently" (to user space) handle canceling writeback and simply 
re-dirty the page.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2025-01-24 12:29                                                                                       ` David Hildenbrand
@ 2025-01-28 10:16                                                                                         ` Miklos Szeredi
  0 siblings, 0 replies; 124+ messages in thread
From: Miklos Szeredi @ 2025-01-28 10:16 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Jeff Layton, Shakeel Butt, Joanne Koong, Bernd Schubert, Zi Yan,
	linux-fsdevel, jefflexu, josef, linux-mm, kernel-team,
	Matthew Wilcox, Oscar Salvador, Michal Hocko

On Fri, 24 Jan 2025 at 13:29, David Hildenbrand <david@redhat.com> wrote:

> Yeah, that was one of my initial questions as well: could one
> "transparently" (to user space) handle canceling writeback and simply
> re-dirty the page.

1) WRITE request is not yet dequeued by userspace: the writeback can
be cancelled

2/a) WRITE request is dequeued (copied) to userspace: the page can be
reused, but the writeback isn't yet complete.  Calling
folio_end_writeback() is lying in the same sense that it's lying with
temp pages.

2/b) WRITE request is dequeued (spliced) to userspace:  the page is
referenced indefinitely (could even be after the writeback completes).
Temp page could be allocated at splice time, which means performance
will be no better than with current temp page writeback, but at least
it will be less complex.

3) WRITE request is currently being copied to userspace: this should
normally be short, but userspace can be nasty and have the buffer be
an mmap of another fuse file, and make the copy hang in the middle by
triggering a page fault.  The request cannot be cancelled at this
point.  In such a case the "echo 1 >
/sys/fs/fuse/connections/##/abort" mechanism or the upcoming server
timeout can be used to shutdown the filesystem.

So this is definitely more complicated than I'd like.

Thanks,
Miklos


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2024-12-19 13:05   ` David Hildenbrand
  2024-12-19 14:19     ` Zi Yan
  2024-12-19 15:43     ` Shakeel Butt
@ 2025-04-02 21:34     ` Joanne Koong
  2025-04-03  3:31       ` Jingbo Xu
  2 siblings, 1 reply; 124+ messages in thread
From: Joanne Koong @ 2025-04-02 21:34 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: miklos, linux-fsdevel, shakeel.butt, jefflexu, josef,
	bernd.schubert, linux-mm, kernel-team, Matthew Wilcox, Zi Yan,
	Oscar Salvador, Michal Hocko

On Thu, Dec 19, 2024 at 5:05 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 23.11.24 00:23, Joanne Koong wrote:
> > For migrations called in MIGRATE_SYNC mode, skip migrating the folio if
> > it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag set on its
> > mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the mapping, the
> > writeback may take an indeterminate amount of time to complete, and
> > waits may get stuck.
> >
> > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
> > ---
> >   mm/migrate.c | 5 ++++-
> >   1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index df91248755e4..fe73284e5246 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -1260,7 +1260,10 @@ static int migrate_folio_unmap(new_folio_t get_new_folio,
> >                */
> >               switch (mode) {
> >               case MIGRATE_SYNC:
> > -                     break;
> > +                     if (!src->mapping ||
> > +                         !mapping_writeback_indeterminate(src->mapping))
> > +                             break;
> > +                     fallthrough;
> >               default:
> >                       rc = -EBUSY;
> >                       goto out;
>
> Ehm, doesn't this mean that any fuse user can essentially completely
> block CMA allocations, memory compaction, memory hotunplug, memory
> poisoning... ?!
>
> That sounds very bad.

I took a closer look at the migration code and the FUSE code. In the
migration code in migrate_folio_unmap(), I see that any MIGATE_SYNC
mode folio lock holds will block migration until that folio is
unlocked. This is the snippet in migrate_folio_unmap() I'm looking at:

        if (!folio_trylock(src)) {
                if (mode == MIGRATE_ASYNC)
                        goto out;

                if (current->flags & PF_MEMALLOC)
                        goto out;

                if (mode == MIGRATE_SYNC_LIGHT && !folio_test_uptodate(src))
                        goto out;

                folio_lock(src);
        }

If this is all that is needed for a malicious FUSE server to block
migration, then it makes no difference if AS_WRITEBACK_INDETERMINATE
mappings are skipped in migration. A malicious server has easier and
more powerful ways of blocking migration in FUSE than trying to do it
through writeback. For a malicious fuse server, we in fact wouldn't
even get far enough to hit writeback - a write triggers
aops->write_begin() and a malicious server would deliberately hang
forever while the folio is locked in write_begin().

I looked into whether we could eradicate all the places in FUSE where
we may hold the folio lock for an indeterminate amount of time,
because if that is possible, then we should not add this writeback way
for a malicious fuse server to affect migration. But I don't think we
can, for example taking one case, the folio lock needs to be held as
we read in the folio from the server when servicing page faults, else
the page cache would contain stale data if there was a concurrent
write that happened just before, which would lead to data corruption
in the filesystem. Imo, we need a more encompassing solution for all
these cases if we're serious about preventing FUSE from blocking
migration, which probably looks like a globally enforced default
timeout of some sort or an mm solution for mitigating the blast radius
of how much memory can be blocked from migration, but that is outside
the scope of this patchset and is its own standalone topic.

I don't see how this patch has any additional negative impact on
memory migration for the case of malicious servers that the server
can't already (and more easily) do. In fact, this patchset if anything
helps memory given that malicious servers now can't also trigger page
allocations for temp pages that would never get freed.


Thanks,
Joanne

>
> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2025-04-02 21:34     ` Joanne Koong
@ 2025-04-03  3:31       ` Jingbo Xu
  2025-04-03  9:18         ` David Hildenbrand
  0 siblings, 1 reply; 124+ messages in thread
From: Jingbo Xu @ 2025-04-03  3:31 UTC (permalink / raw)
  To: Joanne Koong, David Hildenbrand
  Cc: miklos, linux-fsdevel, shakeel.butt, josef, bernd.schubert,
	linux-mm, kernel-team, Matthew Wilcox, Zi Yan, Oscar Salvador,
	Michal Hocko



On 4/3/25 5:34 AM, Joanne Koong wrote:
> On Thu, Dec 19, 2024 at 5:05 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 23.11.24 00:23, Joanne Koong wrote:
>>> For migrations called in MIGRATE_SYNC mode, skip migrating the folio if
>>> it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag set on its
>>> mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the mapping, the
>>> writeback may take an indeterminate amount of time to complete, and
>>> waits may get stuck.
>>>
>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
>>> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
>>> ---
>>>   mm/migrate.c | 5 ++++-
>>>   1 file changed, 4 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>> index df91248755e4..fe73284e5246 100644
>>> --- a/mm/migrate.c
>>> +++ b/mm/migrate.c
>>> @@ -1260,7 +1260,10 @@ static int migrate_folio_unmap(new_folio_t get_new_folio,
>>>                */
>>>               switch (mode) {
>>>               case MIGRATE_SYNC:
>>> -                     break;
>>> +                     if (!src->mapping ||
>>> +                         !mapping_writeback_indeterminate(src->mapping))
>>> +                             break;
>>> +                     fallthrough;
>>>               default:
>>>                       rc = -EBUSY;
>>>                       goto out;
>>
>> Ehm, doesn't this mean that any fuse user can essentially completely
>> block CMA allocations, memory compaction, memory hotunplug, memory
>> poisoning... ?!
>>
>> That sounds very bad.
> 
> I took a closer look at the migration code and the FUSE code. In the
> migration code in migrate_folio_unmap(), I see that any MIGATE_SYNC
> mode folio lock holds will block migration until that folio is
> unlocked. This is the snippet in migrate_folio_unmap() I'm looking at:
> 
>         if (!folio_trylock(src)) {
>                 if (mode == MIGRATE_ASYNC)
>                         goto out;
> 
>                 if (current->flags & PF_MEMALLOC)
>                         goto out;
> 
>                 if (mode == MIGRATE_SYNC_LIGHT && !folio_test_uptodate(src))
>                         goto out;
> 
>                 folio_lock(src);
>         }
> 
> If this is all that is needed for a malicious FUSE server to block
> migration, then it makes no difference if AS_WRITEBACK_INDETERMINATE
> mappings are skipped in migration. A malicious server has easier and
> more powerful ways of blocking migration in FUSE than trying to do it
> through writeback. For a malicious fuse server, we in fact wouldn't
> even get far enough to hit writeback - a write triggers
> aops->write_begin() and a malicious server would deliberately hang
> forever while the folio is locked in write_begin().

Indeed it seems possible.  A malicious FUSE server may already be
capable of blocking the synchronous migration in this way.


> 
> I looked into whether we could eradicate all the places in FUSE where
> we may hold the folio lock for an indeterminate amount of time,
> because if that is possible, then we should not add this writeback way
> for a malicious fuse server to affect migration. But I don't think we
> can, for example taking one case, the folio lock needs to be held as
> we read in the folio from the server when servicing page faults, else
> the page cache would contain stale data if there was a concurrent
> write that happened just before, which would lead to data corruption
> in the filesystem. Imo, we need a more encompassing solution for all
> these cases if we're serious about preventing FUSE from blocking
> migration, which probably looks like a globally enforced default
> timeout of some sort or an mm solution for mitigating the blast radius
> of how much memory can be blocked from migration, but that is outside
> the scope of this patchset and is its own standalone topic.
> 
> I don't see how this patch has any additional negative impact on
> memory migration for the case of malicious servers that the server
> can't already (and more easily) do. In fact, this patchset if anything
> helps memory given that malicious servers now can't also trigger page
> allocations for temp pages that would never get freed.
> 

If that's true, maybe we could drop this patch out of this patchset? So
that both before and after this patchset, synchronous migration could be
blocked by a malicious FUSE server, while the usability of continuous
memory (CMA) won't be affected.

-- 
Thanks,
Jingbo


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2025-04-03  3:31       ` Jingbo Xu
@ 2025-04-03  9:18         ` David Hildenbrand
  2025-04-03  9:25           ` Bernd Schubert
  2025-04-03 19:09           ` Joanne Koong
  0 siblings, 2 replies; 124+ messages in thread
From: David Hildenbrand @ 2025-04-03  9:18 UTC (permalink / raw)
  To: Jingbo Xu, Joanne Koong
  Cc: miklos, linux-fsdevel, shakeel.butt, josef, bernd.schubert,
	linux-mm, kernel-team, Matthew Wilcox, Zi Yan, Oscar Salvador,
	Michal Hocko

On 03.04.25 05:31, Jingbo Xu wrote:
> 
> 
> On 4/3/25 5:34 AM, Joanne Koong wrote:
>> On Thu, Dec 19, 2024 at 5:05 AM David Hildenbrand <david@redhat.com> wrote:
>>>
>>> On 23.11.24 00:23, Joanne Koong wrote:
>>>> For migrations called in MIGRATE_SYNC mode, skip migrating the folio if
>>>> it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag set on its
>>>> mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the mapping, the
>>>> writeback may take an indeterminate amount of time to complete, and
>>>> waits may get stuck.
>>>>
>>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
>>>> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
>>>> ---
>>>>    mm/migrate.c | 5 ++++-
>>>>    1 file changed, 4 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>>> index df91248755e4..fe73284e5246 100644
>>>> --- a/mm/migrate.c
>>>> +++ b/mm/migrate.c
>>>> @@ -1260,7 +1260,10 @@ static int migrate_folio_unmap(new_folio_t get_new_folio,
>>>>                 */
>>>>                switch (mode) {
>>>>                case MIGRATE_SYNC:
>>>> -                     break;
>>>> +                     if (!src->mapping ||
>>>> +                         !mapping_writeback_indeterminate(src->mapping))
>>>> +                             break;
>>>> +                     fallthrough;
>>>>                default:
>>>>                        rc = -EBUSY;
>>>>                        goto out;
>>>
>>> Ehm, doesn't this mean that any fuse user can essentially completely
>>> block CMA allocations, memory compaction, memory hotunplug, memory
>>> poisoning... ?!
>>>
>>> That sounds very bad.
>>
>> I took a closer look at the migration code and the FUSE code. In the
>> migration code in migrate_folio_unmap(), I see that any MIGATE_SYNC
>> mode folio lock holds will block migration until that folio is
>> unlocked. This is the snippet in migrate_folio_unmap() I'm looking at:
>>
>>          if (!folio_trylock(src)) {
>>                  if (mode == MIGRATE_ASYNC)
>>                          goto out;
>>
>>                  if (current->flags & PF_MEMALLOC)
>>                          goto out;
>>
>>                  if (mode == MIGRATE_SYNC_LIGHT && !folio_test_uptodate(src))
>>                          goto out;
>>
>>                  folio_lock(src);
>>          }
>>

Right, I raised that also in my LSF/MM talk: waiting for readahead 
currently implies waiting for the folio lock (there is no separate 
readahead flag like there would be for writeback).

The more I look into this and fuse, the more I realize that what fuse 
does is just completely broken right now.

>> If this is all that is needed for a malicious FUSE server to block
>> migration, then it makes no difference if AS_WRITEBACK_INDETERMINATE
>> mappings are skipped in migration. A malicious server has easier and
>> more powerful ways of blocking migration in FUSE than trying to do it
>> through writeback. For a malicious fuse server, we in fact wouldn't
>> even get far enough to hit writeback - a write triggers
>> aops->write_begin() and a malicious server would deliberately hang
>> forever while the folio is locked in write_begin().
> 
> Indeed it seems possible.  A malicious FUSE server may already be
> capable of blocking the synchronous migration in this way.

Yes, I think the conclusion is that we should advise people from not 
using unprivileged FUSE if they care about any features that rely on 
page migration or page reclaim.

> 
> 
>>
>> I looked into whether we could eradicate all the places in FUSE where
>> we may hold the folio lock for an indeterminate amount of time,
>> because if that is possible, then we should not add this writeback way
>> for a malicious fuse server to affect migration. But I don't think we
>> can, for example taking one case, the folio lock needs to be held as
>> we read in the folio from the server when servicing page faults, else
>> the page cache would contain stale data if there was a concurrent
>> write that happened just before, which would lead to data corruption
>> in the filesystem. Imo, we need a more encompassing solution for all
>> these cases if we're serious about preventing FUSE from blocking
>> migration, which probably looks like a globally enforced default
>> timeout of some sort or an mm solution for mitigating the blast radius
>> of how much memory can be blocked from migration, but that is outside
>> the scope of this patchset and is its own standalone topic.

I'm still skeptical about timeouts: we can only get it wrong.

I think a proper solution is making these pages movable, which does seem 
feasible if (a) splice is not involved and (b) we can find a way to not 
hold the folio lock forever e.g., in the readahead case.

Maybe readahead would have to be handled more similar to writeback 
(e.g., having a separate flag, or using a combination of e.g., 
writeback+uptodate flag, not sure)

In both cases (readahead+writeback), we'd want to call into the FS to 
migrate a folio that is under readahread/writeback. In case of fuse 
without splice, a migration might be doable, and as discussed, splice 
might just be avoided.

>>
>> I don't see how this patch has any additional negative impact on
>> memory migration for the case of malicious servers that the server
>> can't already (and more easily) do. In fact, this patchset if anything
>> helps memory given that malicious servers now can't also trigger page
>> allocations for temp pages that would never get freed.
>>
> 
> If that's true, maybe we could drop this patch out of this patchset? So
> that both before and after this patchset, synchronous migration could be
> blocked by a malicious FUSE server, while the usability of continuous
> memory (CMA) won't be affected.

I had exactly the same thought: if we can block forever on the folio 
lock, there is no need for AS_WRITEBACK_INDETERMINATE. It's already all 
completely broken.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2025-04-03  9:18         ` David Hildenbrand
@ 2025-04-03  9:25           ` Bernd Schubert
  2025-04-03  9:35             ` Christian Brauner
  2025-04-03 19:09           ` Joanne Koong
  1 sibling, 1 reply; 124+ messages in thread
From: Bernd Schubert @ 2025-04-03  9:25 UTC (permalink / raw)
  To: David Hildenbrand, Jingbo Xu, Joanne Koong
  Cc: miklos, linux-fsdevel, shakeel.butt, josef, linux-mm, kernel-team,
	Matthew Wilcox, Zi Yan, Oscar Salvador, Michal Hocko, Keith Busch



On 4/3/25 11:18, David Hildenbrand wrote:
> On 03.04.25 05:31, Jingbo Xu wrote:
>>
>>
>> On 4/3/25 5:34 AM, Joanne Koong wrote:
>>> On Thu, Dec 19, 2024 at 5:05 AM David Hildenbrand <david@redhat.com>
>>> wrote:
>>>>
>>>> On 23.11.24 00:23, Joanne Koong wrote:
>>>>> For migrations called in MIGRATE_SYNC mode, skip migrating the
>>>>> folio if
>>>>> it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag
>>>>> set on its
>>>>> mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the
>>>>> mapping, the
>>>>> writeback may take an indeterminate amount of time to complete, and
>>>>> waits may get stuck.
>>>>>
>>>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
>>>>> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
>>>>> ---
>>>>>    mm/migrate.c | 5 ++++-
>>>>>    1 file changed, 4 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>>>> index df91248755e4..fe73284e5246 100644
>>>>> --- a/mm/migrate.c
>>>>> +++ b/mm/migrate.c
>>>>> @@ -1260,7 +1260,10 @@ static int migrate_folio_unmap(new_folio_t
>>>>> get_new_folio,
>>>>>                 */
>>>>>                switch (mode) {
>>>>>                case MIGRATE_SYNC:
>>>>> -                     break;
>>>>> +                     if (!src->mapping ||
>>>>> +                         !mapping_writeback_indeterminate(src-
>>>>> >mapping))
>>>>> +                             break;
>>>>> +                     fallthrough;
>>>>>                default:
>>>>>                        rc = -EBUSY;
>>>>>                        goto out;
>>>>
>>>> Ehm, doesn't this mean that any fuse user can essentially completely
>>>> block CMA allocations, memory compaction, memory hotunplug, memory
>>>> poisoning... ?!
>>>>
>>>> That sounds very bad.
>>>
>>> I took a closer look at the migration code and the FUSE code. In the
>>> migration code in migrate_folio_unmap(), I see that any MIGATE_SYNC
>>> mode folio lock holds will block migration until that folio is
>>> unlocked. This is the snippet in migrate_folio_unmap() I'm looking at:
>>>
>>>          if (!folio_trylock(src)) {
>>>                  if (mode == MIGRATE_ASYNC)
>>>                          goto out;
>>>
>>>                  if (current->flags & PF_MEMALLOC)
>>>                          goto out;
>>>
>>>                  if (mode == MIGRATE_SYNC_LIGHT && !
>>> folio_test_uptodate(src))
>>>                          goto out;
>>>
>>>                  folio_lock(src);
>>>          }
>>>
> 
> Right, I raised that also in my LSF/MM talk: waiting for readahead
> currently implies waiting for the folio lock (there is no separate
> readahead flag like there would be for writeback).
> 
> The more I look into this and fuse, the more I realize that what fuse
> does is just completely broken right now.
> 
>>> If this is all that is needed for a malicious FUSE server to block
>>> migration, then it makes no difference if AS_WRITEBACK_INDETERMINATE
>>> mappings are skipped in migration. A malicious server has easier and
>>> more powerful ways of blocking migration in FUSE than trying to do it
>>> through writeback. For a malicious fuse server, we in fact wouldn't
>>> even get far enough to hit writeback - a write triggers
>>> aops->write_begin() and a malicious server would deliberately hang
>>> forever while the folio is locked in write_begin().
>>
>> Indeed it seems possible.  A malicious FUSE server may already be
>> capable of blocking the synchronous migration in this way.
> 
> Yes, I think the conclusion is that we should advise people from not
> using unprivileged FUSE if they care about any features that rely on
> page migration or page reclaim.
> 
>>
>>
>>>
>>> I looked into whether we could eradicate all the places in FUSE where
>>> we may hold the folio lock for an indeterminate amount of time,
>>> because if that is possible, then we should not add this writeback way
>>> for a malicious fuse server to affect migration. But I don't think we
>>> can, for example taking one case, the folio lock needs to be held as
>>> we read in the folio from the server when servicing page faults, else
>>> the page cache would contain stale data if there was a concurrent
>>> write that happened just before, which would lead to data corruption
>>> in the filesystem. Imo, we need a more encompassing solution for all
>>> these cases if we're serious about preventing FUSE from blocking
>>> migration, which probably looks like a globally enforced default
>>> timeout of some sort or an mm solution for mitigating the blast radius
>>> of how much memory can be blocked from migration, but that is outside
>>> the scope of this patchset and is its own standalone topic.
> 
> I'm still skeptical about timeouts: we can only get it wrong.
> 
> I think a proper solution is making these pages movable, which does seem
> feasible if (a) splice is not involved and (b) we can find a way to not
> hold the folio lock forever e.g., in the readahead case.
> 
> Maybe readahead would have to be handled more similar to writeback
> (e.g., having a separate flag, or using a combination of e.g.,
> writeback+uptodate flag, not sure)
> 
> In both cases (readahead+writeback), we'd want to call into the FS to
> migrate a folio that is under readahread/writeback. In case of fuse
> without splice, a migration might be doable, and as discussed, splice
> might just be avoided.

My personal take is here that we should move away from splice.
Keith (or colleague) is working on ZC with io-uring anyway, so
maybe a good timing. We should just ensure that the new approach
doesn't have the same issue.

Thanks,
Bernd


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2025-04-03  9:25           ` Bernd Schubert
@ 2025-04-03  9:35             ` Christian Brauner
  0 siblings, 0 replies; 124+ messages in thread
From: Christian Brauner @ 2025-04-03  9:35 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: David Hildenbrand, Jingbo Xu, Joanne Koong, miklos, linux-fsdevel,
	shakeel.butt, josef, linux-mm, kernel-team, Matthew Wilcox,
	Zi Yan, Oscar Salvador, Michal Hocko, Keith Busch

On Thu, Apr 03, 2025 at 11:25:17AM +0200, Bernd Schubert wrote:
> 
> 
> On 4/3/25 11:18, David Hildenbrand wrote:
> > On 03.04.25 05:31, Jingbo Xu wrote:
> >>
> >>
> >> On 4/3/25 5:34 AM, Joanne Koong wrote:
> >>> On Thu, Dec 19, 2024 at 5:05 AM David Hildenbrand <david@redhat.com>
> >>> wrote:
> >>>>
> >>>> On 23.11.24 00:23, Joanne Koong wrote:
> >>>>> For migrations called in MIGRATE_SYNC mode, skip migrating the
> >>>>> folio if
> >>>>> it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag
> >>>>> set on its
> >>>>> mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the
> >>>>> mapping, the
> >>>>> writeback may take an indeterminate amount of time to complete, and
> >>>>> waits may get stuck.
> >>>>>
> >>>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> >>>>> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
> >>>>> ---
> >>>>>    mm/migrate.c | 5 ++++-
> >>>>>    1 file changed, 4 insertions(+), 1 deletion(-)
> >>>>>
> >>>>> diff --git a/mm/migrate.c b/mm/migrate.c
> >>>>> index df91248755e4..fe73284e5246 100644
> >>>>> --- a/mm/migrate.c
> >>>>> +++ b/mm/migrate.c
> >>>>> @@ -1260,7 +1260,10 @@ static int migrate_folio_unmap(new_folio_t
> >>>>> get_new_folio,
> >>>>>                 */
> >>>>>                switch (mode) {
> >>>>>                case MIGRATE_SYNC:
> >>>>> -                     break;
> >>>>> +                     if (!src->mapping ||
> >>>>> +                         !mapping_writeback_indeterminate(src-
> >>>>> >mapping))
> >>>>> +                             break;
> >>>>> +                     fallthrough;
> >>>>>                default:
> >>>>>                        rc = -EBUSY;
> >>>>>                        goto out;
> >>>>
> >>>> Ehm, doesn't this mean that any fuse user can essentially completely
> >>>> block CMA allocations, memory compaction, memory hotunplug, memory
> >>>> poisoning... ?!
> >>>>
> >>>> That sounds very bad.
> >>>
> >>> I took a closer look at the migration code and the FUSE code. In the
> >>> migration code in migrate_folio_unmap(), I see that any MIGATE_SYNC
> >>> mode folio lock holds will block migration until that folio is
> >>> unlocked. This is the snippet in migrate_folio_unmap() I'm looking at:
> >>>
> >>>          if (!folio_trylock(src)) {
> >>>                  if (mode == MIGRATE_ASYNC)
> >>>                          goto out;
> >>>
> >>>                  if (current->flags & PF_MEMALLOC)
> >>>                          goto out;
> >>>
> >>>                  if (mode == MIGRATE_SYNC_LIGHT && !
> >>> folio_test_uptodate(src))
> >>>                          goto out;
> >>>
> >>>                  folio_lock(src);
> >>>          }
> >>>
> > 
> > Right, I raised that also in my LSF/MM talk: waiting for readahead
> > currently implies waiting for the folio lock (there is no separate
> > readahead flag like there would be for writeback).
> > 
> > The more I look into this and fuse, the more I realize that what fuse
> > does is just completely broken right now.
> > 
> >>> If this is all that is needed for a malicious FUSE server to block
> >>> migration, then it makes no difference if AS_WRITEBACK_INDETERMINATE
> >>> mappings are skipped in migration. A malicious server has easier and
> >>> more powerful ways of blocking migration in FUSE than trying to do it
> >>> through writeback. For a malicious fuse server, we in fact wouldn't
> >>> even get far enough to hit writeback - a write triggers
> >>> aops->write_begin() and a malicious server would deliberately hang
> >>> forever while the folio is locked in write_begin().
> >>
> >> Indeed it seems possible.  A malicious FUSE server may already be
> >> capable of blocking the synchronous migration in this way.
> > 
> > Yes, I think the conclusion is that we should advise people from not
> > using unprivileged FUSE if they care about any features that rely on
> > page migration or page reclaim.
> > 
> >>
> >>
> >>>
> >>> I looked into whether we could eradicate all the places in FUSE where
> >>> we may hold the folio lock for an indeterminate amount of time,
> >>> because if that is possible, then we should not add this writeback way
> >>> for a malicious fuse server to affect migration. But I don't think we
> >>> can, for example taking one case, the folio lock needs to be held as
> >>> we read in the folio from the server when servicing page faults, else
> >>> the page cache would contain stale data if there was a concurrent
> >>> write that happened just before, which would lead to data corruption
> >>> in the filesystem. Imo, we need a more encompassing solution for all
> >>> these cases if we're serious about preventing FUSE from blocking
> >>> migration, which probably looks like a globally enforced default
> >>> timeout of some sort or an mm solution for mitigating the blast radius
> >>> of how much memory can be blocked from migration, but that is outside
> >>> the scope of this patchset and is its own standalone topic.
> > 
> > I'm still skeptical about timeouts: we can only get it wrong.
> > 
> > I think a proper solution is making these pages movable, which does seem
> > feasible if (a) splice is not involved and (b) we can find a way to not
> > hold the folio lock forever e.g., in the readahead case.
> > 
> > Maybe readahead would have to be handled more similar to writeback
> > (e.g., having a separate flag, or using a combination of e.g.,
> > writeback+uptodate flag, not sure)
> > 
> > In both cases (readahead+writeback), we'd want to call into the FS to
> > migrate a folio that is under readahread/writeback. In case of fuse
> > without splice, a migration might be doable, and as discussed, splice
> > might just be avoided.
> 
> My personal take is here that we should move away from splice.
> Keith (or colleague) is working on ZC with io-uring anyway, so
> maybe a good timing. We should just ensure that the new approach
> doesn't have the same issue.

splice is problematic in a lot of other ways too. It's easy to abuse it
for weird userspace hangs since it clings onto the pipe_lock() and no
one wants to do the invasive surgery to wean it off of that. So +1 on
avoiding splice.


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2025-04-03  9:18         ` David Hildenbrand
  2025-04-03  9:25           ` Bernd Schubert
@ 2025-04-03 19:09           ` Joanne Koong
  2025-04-03 20:44             ` David Hildenbrand
  1 sibling, 1 reply; 124+ messages in thread
From: Joanne Koong @ 2025-04-03 19:09 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Jingbo Xu, miklos, linux-fsdevel, shakeel.butt, josef,
	bernd.schubert, linux-mm, kernel-team, Matthew Wilcox, Zi Yan,
	Oscar Salvador, Michal Hocko

On Thu, Apr 3, 2025 at 2:18 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 03.04.25 05:31, Jingbo Xu wrote:
> >
> >
> > On 4/3/25 5:34 AM, Joanne Koong wrote:
> >> On Thu, Dec 19, 2024 at 5:05 AM David Hildenbrand <david@redhat.com> wrote:
> >>>
> >>> On 23.11.24 00:23, Joanne Koong wrote:
> >>>> For migrations called in MIGRATE_SYNC mode, skip migrating the folio if
> >>>> it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag set on its
> >>>> mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the mapping, the
> >>>> writeback may take an indeterminate amount of time to complete, and
> >>>> waits may get stuck.
> >>>>
> >>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> >>>> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
> >>>> ---
> >>>>    mm/migrate.c | 5 ++++-
> >>>>    1 file changed, 4 insertions(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/mm/migrate.c b/mm/migrate.c
> >>>> index df91248755e4..fe73284e5246 100644
> >>>> --- a/mm/migrate.c
> >>>> +++ b/mm/migrate.c
> >>>> @@ -1260,7 +1260,10 @@ static int migrate_folio_unmap(new_folio_t get_new_folio,
> >>>>                 */
> >>>>                switch (mode) {
> >>>>                case MIGRATE_SYNC:
> >>>> -                     break;
> >>>> +                     if (!src->mapping ||
> >>>> +                         !mapping_writeback_indeterminate(src->mapping))
> >>>> +                             break;
> >>>> +                     fallthrough;
> >>>>                default:
> >>>>                        rc = -EBUSY;
> >>>>                        goto out;
> >>>
> >>> Ehm, doesn't this mean that any fuse user can essentially completely
> >>> block CMA allocations, memory compaction, memory hotunplug, memory
> >>> poisoning... ?!
> >>>
> >>> That sounds very bad.
> >>
> >> I took a closer look at the migration code and the FUSE code. In the
> >> migration code in migrate_folio_unmap(), I see that any MIGATE_SYNC
> >> mode folio lock holds will block migration until that folio is
> >> unlocked. This is the snippet in migrate_folio_unmap() I'm looking at:
> >>
> >>          if (!folio_trylock(src)) {
> >>                  if (mode == MIGRATE_ASYNC)
> >>                          goto out;
> >>
> >>                  if (current->flags & PF_MEMALLOC)
> >>                          goto out;
> >>
> >>                  if (mode == MIGRATE_SYNC_LIGHT && !folio_test_uptodate(src))
> >>                          goto out;
> >>
> >>                  folio_lock(src);
> >>          }
> >>
>
> Right, I raised that also in my LSF/MM talk: waiting for readahead
> currently implies waiting for the folio lock (there is no separate
> readahead flag like there would be for writeback).
>
> The more I look into this and fuse, the more I realize that what fuse
> does is just completely broken right now.
>
> >> If this is all that is needed for a malicious FUSE server to block
> >> migration, then it makes no difference if AS_WRITEBACK_INDETERMINATE
> >> mappings are skipped in migration. A malicious server has easier and
> >> more powerful ways of blocking migration in FUSE than trying to do it
> >> through writeback. For a malicious fuse server, we in fact wouldn't
> >> even get far enough to hit writeback - a write triggers
> >> aops->write_begin() and a malicious server would deliberately hang
> >> forever while the folio is locked in write_begin().
> >
> > Indeed it seems possible.  A malicious FUSE server may already be
> > capable of blocking the synchronous migration in this way.
>
> Yes, I think the conclusion is that we should advise people from not
> using unprivileged FUSE if they care about any features that rely on
> page migration or page reclaim.
>
> >
> >
> >>
> >> I looked into whether we could eradicate all the places in FUSE where
> >> we may hold the folio lock for an indeterminate amount of time,
> >> because if that is possible, then we should not add this writeback way
> >> for a malicious fuse server to affect migration. But I don't think we
> >> can, for example taking one case, the folio lock needs to be held as
> >> we read in the folio from the server when servicing page faults, else
> >> the page cache would contain stale data if there was a concurrent
> >> write that happened just before, which would lead to data corruption
> >> in the filesystem. Imo, we need a more encompassing solution for all
> >> these cases if we're serious about preventing FUSE from blocking
> >> migration, which probably looks like a globally enforced default
> >> timeout of some sort or an mm solution for mitigating the blast radius
> >> of how much memory can be blocked from migration, but that is outside
> >> the scope of this patchset and is its own standalone topic.
>
> I'm still skeptical about timeouts: we can only get it wrong.
>
> I think a proper solution is making these pages movable, which does seem
> feasible if (a) splice is not involved and (b) we can find a way to not
> hold the folio lock forever e.g., in the readahead case.
>
> Maybe readahead would have to be handled more similar to writeback
> (e.g., having a separate flag, or using a combination of e.g.,
> writeback+uptodate flag, not sure)
>
> In both cases (readahead+writeback), we'd want to call into the FS to
> migrate a folio that is under readahread/writeback. In case of fuse
> without splice, a migration might be doable, and as discussed, splice
> might just be avoided.
>
> >>
> >> I don't see how this patch has any additional negative impact on
> >> memory migration for the case of malicious servers that the server
> >> can't already (and more easily) do. In fact, this patchset if anything
> >> helps memory given that malicious servers now can't also trigger page
> >> allocations for temp pages that would never get freed.
> >>
> >
> > If that's true, maybe we could drop this patch out of this patchset? So
> > that both before and after this patchset, synchronous migration could be
> > blocked by a malicious FUSE server, while the usability of continuous
> > memory (CMA) won't be affected.
>
> I had exactly the same thought: if we can block forever on the folio
> lock, there is no need for AS_WRITEBACK_INDETERMINATE. It's already all
> completely broken.

I will resubmit this patchset and drop this patch.

I think we still need AS_WRITEBACK_INDETERMINATE for sync and legacy
cgroupv1 reclaim scenarios:
a) sync: sync waits on writeback so if we don't skip waiting on
writeback for AS_WRITEBACK_INDETERMINATE mappings, then malicious fuse
servers could make syncs hang. (There's no actual effect on sync
behavior though with temp pages because even without temp pages, we
return even though the data hasn't actually been synced to disk by the
server yet)

b) cgroupv1 reclaim: a correctly written fuse server can fall into
this deadlock in one very specific scenario (eg  if it's using legacy
cgroupv1 and reclaim encounters a folio that already has the reclaim
flag set and the caller didn't have __GFP_FS (or __GFP_IO if swap)
set), where the deadlock is triggered by:
* single-threaded FUSE server is in the middle of handling a request
that needs a memory allocation
* memory allocation triggers direct reclaim
* direct reclaim waits on a folio under writeback
* the FUSE server can't write back the folio since it's stuck in direct reclaim

Thanks for the feedback and discussion, everyone.
>
> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2025-04-03 19:09           ` Joanne Koong
@ 2025-04-03 20:44             ` David Hildenbrand
  2025-04-03 22:04               ` Joanne Koong
  0 siblings, 1 reply; 124+ messages in thread
From: David Hildenbrand @ 2025-04-03 20:44 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Jingbo Xu, miklos, linux-fsdevel, shakeel.butt, josef,
	bernd.schubert, linux-mm, kernel-team, Matthew Wilcox, Zi Yan,
	Oscar Salvador, Michal Hocko

On 03.04.25 21:09, Joanne Koong wrote:
> On Thu, Apr 3, 2025 at 2:18 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 03.04.25 05:31, Jingbo Xu wrote:
>>>
>>>
>>> On 4/3/25 5:34 AM, Joanne Koong wrote:
>>>> On Thu, Dec 19, 2024 at 5:05 AM David Hildenbrand <david@redhat.com> wrote:
>>>>>
>>>>> On 23.11.24 00:23, Joanne Koong wrote:
>>>>>> For migrations called in MIGRATE_SYNC mode, skip migrating the folio if
>>>>>> it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag set on its
>>>>>> mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the mapping, the
>>>>>> writeback may take an indeterminate amount of time to complete, and
>>>>>> waits may get stuck.
>>>>>>
>>>>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
>>>>>> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
>>>>>> ---
>>>>>>     mm/migrate.c | 5 ++++-
>>>>>>     1 file changed, 4 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>>>>> index df91248755e4..fe73284e5246 100644
>>>>>> --- a/mm/migrate.c
>>>>>> +++ b/mm/migrate.c
>>>>>> @@ -1260,7 +1260,10 @@ static int migrate_folio_unmap(new_folio_t get_new_folio,
>>>>>>                  */
>>>>>>                 switch (mode) {
>>>>>>                 case MIGRATE_SYNC:
>>>>>> -                     break;
>>>>>> +                     if (!src->mapping ||
>>>>>> +                         !mapping_writeback_indeterminate(src->mapping))
>>>>>> +                             break;
>>>>>> +                     fallthrough;
>>>>>>                 default:
>>>>>>                         rc = -EBUSY;
>>>>>>                         goto out;
>>>>>
>>>>> Ehm, doesn't this mean that any fuse user can essentially completely
>>>>> block CMA allocations, memory compaction, memory hotunplug, memory
>>>>> poisoning... ?!
>>>>>
>>>>> That sounds very bad.
>>>>
>>>> I took a closer look at the migration code and the FUSE code. In the
>>>> migration code in migrate_folio_unmap(), I see that any MIGATE_SYNC
>>>> mode folio lock holds will block migration until that folio is
>>>> unlocked. This is the snippet in migrate_folio_unmap() I'm looking at:
>>>>
>>>>           if (!folio_trylock(src)) {
>>>>                   if (mode == MIGRATE_ASYNC)
>>>>                           goto out;
>>>>
>>>>                   if (current->flags & PF_MEMALLOC)
>>>>                           goto out;
>>>>
>>>>                   if (mode == MIGRATE_SYNC_LIGHT && !folio_test_uptodate(src))
>>>>                           goto out;
>>>>
>>>>                   folio_lock(src);
>>>>           }
>>>>
>>
>> Right, I raised that also in my LSF/MM talk: waiting for readahead
>> currently implies waiting for the folio lock (there is no separate
>> readahead flag like there would be for writeback).
>>
>> The more I look into this and fuse, the more I realize that what fuse
>> does is just completely broken right now.
>>
>>>> If this is all that is needed for a malicious FUSE server to block
>>>> migration, then it makes no difference if AS_WRITEBACK_INDETERMINATE
>>>> mappings are skipped in migration. A malicious server has easier and
>>>> more powerful ways of blocking migration in FUSE than trying to do it
>>>> through writeback. For a malicious fuse server, we in fact wouldn't
>>>> even get far enough to hit writeback - a write triggers
>>>> aops->write_begin() and a malicious server would deliberately hang
>>>> forever while the folio is locked in write_begin().
>>>
>>> Indeed it seems possible.  A malicious FUSE server may already be
>>> capable of blocking the synchronous migration in this way.
>>
>> Yes, I think the conclusion is that we should advise people from not
>> using unprivileged FUSE if they care about any features that rely on
>> page migration or page reclaim.
>>
>>>
>>>
>>>>
>>>> I looked into whether we could eradicate all the places in FUSE where
>>>> we may hold the folio lock for an indeterminate amount of time,
>>>> because if that is possible, then we should not add this writeback way
>>>> for a malicious fuse server to affect migration. But I don't think we
>>>> can, for example taking one case, the folio lock needs to be held as
>>>> we read in the folio from the server when servicing page faults, else
>>>> the page cache would contain stale data if there was a concurrent
>>>> write that happened just before, which would lead to data corruption
>>>> in the filesystem. Imo, we need a more encompassing solution for all
>>>> these cases if we're serious about preventing FUSE from blocking
>>>> migration, which probably looks like a globally enforced default
>>>> timeout of some sort or an mm solution for mitigating the blast radius
>>>> of how much memory can be blocked from migration, but that is outside
>>>> the scope of this patchset and is its own standalone topic.
>>
>> I'm still skeptical about timeouts: we can only get it wrong.
>>
>> I think a proper solution is making these pages movable, which does seem
>> feasible if (a) splice is not involved and (b) we can find a way to not
>> hold the folio lock forever e.g., in the readahead case.
>>
>> Maybe readahead would have to be handled more similar to writeback
>> (e.g., having a separate flag, or using a combination of e.g.,
>> writeback+uptodate flag, not sure)
>>
>> In both cases (readahead+writeback), we'd want to call into the FS to
>> migrate a folio that is under readahread/writeback. In case of fuse
>> without splice, a migration might be doable, and as discussed, splice
>> might just be avoided.
>>
>>>>
>>>> I don't see how this patch has any additional negative impact on
>>>> memory migration for the case of malicious servers that the server
>>>> can't already (and more easily) do. In fact, this patchset if anything
>>>> helps memory given that malicious servers now can't also trigger page
>>>> allocations for temp pages that would never get freed.
>>>>
>>>
>>> If that's true, maybe we could drop this patch out of this patchset? So
>>> that both before and after this patchset, synchronous migration could be
>>> blocked by a malicious FUSE server, while the usability of continuous
>>> memory (CMA) won't be affected.
>>
>> I had exactly the same thought: if we can block forever on the folio
>> lock, there is no need for AS_WRITEBACK_INDETERMINATE. It's already all
>> completely broken.
> 
> I will resubmit this patchset and drop this patch.
> 
> I think we still need AS_WRITEBACK_INDETERMINATE for sync and legacy
> cgroupv1 reclaim scenarios:
> a) sync: sync waits on writeback so if we don't skip waiting on
> writeback for AS_WRITEBACK_INDETERMINATE mappings, then malicious fuse
> servers could make syncs hang. (There's no actual effect on sync
> behavior though with temp pages because even without temp pages, we
> return even though the data hasn't actually been synced to disk by the
> server yet)

Just curious: Are we sure there are no other cases where a malicious 
userspace could make some other folio_lock() hang forever either way?

IOW, just like for migration, isn't this just solving one part of the 
whole problem we are facing?

> 
> b) cgroupv1 reclaim: a correctly written fuse server can fall into
> this deadlock in one very specific scenario (eg  if it's using legacy
> cgroupv1 and reclaim encounters a folio that already has the reclaim
> flag set and the caller didn't have __GFP_FS (or __GFP_IO if swap)
> set), where the deadlock is triggered by:
> * single-threaded FUSE server is in the middle of handling a request
> that needs a memory allocation
> * memory allocation triggers direct reclaim
> * direct reclaim waits on a folio under writeback
> * the FUSE server can't write back the folio since it's stuck in direct reclaim

Yes, that sounds reasonable.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
  2025-04-03 20:44             ` David Hildenbrand
@ 2025-04-03 22:04               ` Joanne Koong
  0 siblings, 0 replies; 124+ messages in thread
From: Joanne Koong @ 2025-04-03 22:04 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Jingbo Xu, miklos, linux-fsdevel, shakeel.butt, josef,
	bernd.schubert, linux-mm, kernel-team, Matthew Wilcox, Zi Yan,
	Oscar Salvador, Michal Hocko

On Thu, Apr 3, 2025 at 1:44 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 03.04.25 21:09, Joanne Koong wrote:
> > On Thu, Apr 3, 2025 at 2:18 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 03.04.25 05:31, Jingbo Xu wrote:
> >>>
> >>>
> >>> On 4/3/25 5:34 AM, Joanne Koong wrote:
> >>>> On Thu, Dec 19, 2024 at 5:05 AM David Hildenbrand <david@redhat.com> wrote:
> >>>>>
> >>>>> On 23.11.24 00:23, Joanne Koong wrote:
> >>>>>> For migrations called in MIGRATE_SYNC mode, skip migrating the folio if
> >>>>>> it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag set on its
> >>>>>> mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the mapping, the
> >>>>>> writeback may take an indeterminate amount of time to complete, and
> >>>>>> waits may get stuck.
> >>>>>>
> >>>>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> >>>>>> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
> >>>>>> ---
> >>>>>>     mm/migrate.c | 5 ++++-
> >>>>>>     1 file changed, 4 insertions(+), 1 deletion(-)
> >>>>>>

> >>>>> Ehm, doesn't this mean that any fuse user can essentially completely
> >>>>> block CMA allocations, memory compaction, memory hotunplug, memory
> >>>>> poisoning... ?!
> >>>>>
> >>>>> That sounds very bad.
> >>>>
> >>>> I took a closer look at the migration code and the FUSE code. In the
> >>>> migration code in migrate_folio_unmap(), I see that any MIGATE_SYNC
> >>>> mode folio lock holds will block migration until that folio is
> >>>> unlocked. This is the snippet in migrate_folio_unmap() I'm looking at:
> >>>>
> >>>>           if (!folio_trylock(src)) {
> >>>>                   if (mode == MIGRATE_ASYNC)
> >>>>                           goto out;
> >>>>
> >>>>                   if (current->flags & PF_MEMALLOC)
> >>>>                           goto out;
> >>>>
> >>>>                   if (mode == MIGRATE_SYNC_LIGHT && !folio_test_uptodate(src))
> >>>>                           goto out;
> >>>>
> >>>>                   folio_lock(src);
> >>>>           }
> >>>>
> >>
> >> Right, I raised that also in my LSF/MM talk: waiting for readahead
> >> currently implies waiting for the folio lock (there is no separate
> >> readahead flag like there would be for writeback).
> >>
> >> The more I look into this and fuse, the more I realize that what fuse
> >> does is just completely broken right now.
> >>
> >>>> If this is all that is needed for a malicious FUSE server to block
> >>>> migration, then it makes no difference if AS_WRITEBACK_INDETERMINATE
> >>>> mappings are skipped in migration. A malicious server has easier and
> >>>> more powerful ways of blocking migration in FUSE than trying to do it
> >>>> through writeback. For a malicious fuse server, we in fact wouldn't
> >>>> even get far enough to hit writeback - a write triggers
> >>>> aops->write_begin() and a malicious server would deliberately hang
> >>>> forever while the folio is locked in write_begin().
> >>>
> >>> Indeed it seems possible.  A malicious FUSE server may already be
> >>> capable of blocking the synchronous migration in this way.
> >>
> >> Yes, I think the conclusion is that we should advise people from not
> >> using unprivileged FUSE if they care about any features that rely on
> >> page migration or page reclaim.
> >>
> >>>
> >>>
> >>>>
> >>>> I looked into whether we could eradicate all the places in FUSE where
> >>>> we may hold the folio lock for an indeterminate amount of time,
> >>>> because if that is possible, then we should not add this writeback way
> >>>> for a malicious fuse server to affect migration. But I don't think we
> >>>> can, for example taking one case, the folio lock needs to be held as
> >>>> we read in the folio from the server when servicing page faults, else
> >>>> the page cache would contain stale data if there was a concurrent
> >>>> write that happened just before, which would lead to data corruption
> >>>> in the filesystem. Imo, we need a more encompassing solution for all
> >>>> these cases if we're serious about preventing FUSE from blocking
> >>>> migration, which probably looks like a globally enforced default
> >>>> timeout of some sort or an mm solution for mitigating the blast radius
> >>>> of how much memory can be blocked from migration, but that is outside
> >>>> the scope of this patchset and is its own standalone topic.
> >>
> >> I'm still skeptical about timeouts: we can only get it wrong.
> >>
> >> I think a proper solution is making these pages movable, which does seem
> >> feasible if (a) splice is not involved and (b) we can find a way to not
> >> hold the folio lock forever e.g., in the readahead case.
> >>
> >> Maybe readahead would have to be handled more similar to writeback
> >> (e.g., having a separate flag, or using a combination of e.g.,
> >> writeback+uptodate flag, not sure)
> >>
> >> In both cases (readahead+writeback), we'd want to call into the FS to
> >> migrate a folio that is under readahread/writeback. In case of fuse
> >> without splice, a migration might be doable, and as discussed, splice
> >> might just be avoided.
> >>
> >>>>
> >>>> I don't see how this patch has any additional negative impact on
> >>>> memory migration for the case of malicious servers that the server
> >>>> can't already (and more easily) do. In fact, this patchset if anything
> >>>> helps memory given that malicious servers now can't also trigger page
> >>>> allocations for temp pages that would never get freed.
> >>>>
> >>>
> >>> If that's true, maybe we could drop this patch out of this patchset? So
> >>> that both before and after this patchset, synchronous migration could be
> >>> blocked by a malicious FUSE server, while the usability of continuous
> >>> memory (CMA) won't be affected.
> >>
> >> I had exactly the same thought: if we can block forever on the folio
> >> lock, there is no need for AS_WRITEBACK_INDETERMINATE. It's already all
> >> completely broken.
> >
> > I will resubmit this patchset and drop this patch.
> >
> > I think we still need AS_WRITEBACK_INDETERMINATE for sync and legacy
> > cgroupv1 reclaim scenarios:
> > a) sync: sync waits on writeback so if we don't skip waiting on
> > writeback for AS_WRITEBACK_INDETERMINATE mappings, then malicious fuse
> > servers could make syncs hang. (There's no actual effect on sync
> > behavior though with temp pages because even without temp pages, we
> > return even though the data hasn't actually been synced to disk by the
> > server yet)
>
> Just curious: Are we sure there are no other cases where a malicious
> userspace could make some other folio_lock() hang forever either way?
>

Unfortunately, there's an awful case where kswapd may get blocked
waiting for the folio lock. We encountered this in prod last week from
a well-intentioned but incorrectly written FUSE server that got stuck.
The stack trace was:

  366 kswapd0 D
  folio_wait_bit_common.llvm.15141953522965195141
  truncate_inode_pages_range
  fuse_evict_inode
  evict
  _dentry_kill
  shrink_dentry_list
  prune_dcache_sb
  super_cache_scan
  do_shrink_slab
  shrink_slab
  kswapd
  kthread
  ret_from_fork
  ret_from_fork_asm

which was narrowed down to the  __filemap_get_folio(..., FGP_LOCK,
...)  call in truncate_inode_pages_range().

I'm working on a fix for this for kswapd and planning to also do a
broader audit for other places where we might get tripped up from fuse
forever holding a folio lock. I'm going to look more into the
long-term fuse fix too - the first step will be documenting all the
places currently where a lock may be forever held.

> IOW, just like for migration, isn't this just solving one part of the
> whole problem we are facing?

For sync, I didn't see any folio lock acquires anywhere but I just
noticed that fuse's .sync_fs() implementation will block until a
server replies, so yes a malicious server could still hold up sync
regardless of temp pages or not. I'll drop the sync patch too in v7.

Thanks,
Joanne

>
> >
> > b) cgroupv1 reclaim: a correctly written fuse server can fall into
> > this deadlock in one very specific scenario (eg  if it's using legacy
> > cgroupv1 and reclaim encounters a folio that already has the reclaim
> > flag set and the caller didn't have __GFP_FS (or __GFP_IO if swap)
> > set), where the deadlock is triggered by:
> > * single-threaded FUSE server is in the middle of handling a request
> > that needs a memory allocation
> > * memory allocation triggers direct reclaim
> > * direct reclaim waits on a folio under writeback
> > * the FUSE server can't write back the folio since it's stuck in direct reclaim
>
> Yes, that sounds reasonable.
>
> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 124+ messages in thread

end of thread, other threads:[~2025-04-03 22:04 UTC | newest]

Thread overview: 124+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-22 23:23 [PATCH v6 0/5] fuse: remove temp page copies in writeback Joanne Koong
2024-11-22 23:23 ` [PATCH v6 1/5] mm: add AS_WRITEBACK_INDETERMINATE mapping flag Joanne Koong
2024-11-22 23:23 ` [PATCH v6 2/5] mm: skip reclaiming folios in legacy memcg writeback indeterminate contexts Joanne Koong
2024-11-22 23:23 ` [PATCH v6 3/5] fs/writeback: in wait_sb_inodes(), skip wait for AS_WRITEBACK_INDETERMINATE mappings Joanne Koong
2024-11-22 23:23 ` [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with " Joanne Koong
2024-12-19 13:05   ` David Hildenbrand
2024-12-19 14:19     ` Zi Yan
2024-12-19 15:08       ` Zi Yan
2024-12-19 15:39         ` David Hildenbrand
2024-12-19 15:47           ` Zi Yan
2024-12-19 15:50             ` David Hildenbrand
2024-12-19 15:43     ` Shakeel Butt
2024-12-19 15:47       ` David Hildenbrand
2024-12-19 15:53         ` Shakeel Butt
2024-12-19 15:55           ` Zi Yan
2024-12-19 15:56             ` Bernd Schubert
2024-12-19 16:00               ` Zi Yan
2024-12-19 16:02                 ` Zi Yan
2024-12-19 16:09                   ` Bernd Schubert
2024-12-19 16:14                     ` Zi Yan
2024-12-19 16:26                       ` Shakeel Butt
2024-12-19 16:31                         ` David Hildenbrand
2024-12-19 16:53                           ` Shakeel Butt
2024-12-19 16:22             ` Shakeel Butt
2024-12-19 16:29               ` David Hildenbrand
2024-12-19 16:40                 ` Shakeel Butt
2024-12-19 16:41                   ` David Hildenbrand
2024-12-19 17:14                     ` Shakeel Butt
2024-12-19 17:26                       ` David Hildenbrand
2024-12-19 17:30                         ` Bernd Schubert
2024-12-19 17:37                           ` Shakeel Butt
2024-12-19 17:40                             ` Bernd Schubert
2024-12-19 17:44                             ` Joanne Koong
2024-12-19 17:54                               ` Shakeel Butt
2024-12-20 11:44                                 ` David Hildenbrand
2024-12-20 12:15                                   ` Bernd Schubert
2024-12-20 14:49                                     ` David Hildenbrand
2024-12-20 15:26                                       ` Bernd Schubert
2024-12-20 18:01                                       ` Shakeel Butt
2024-12-21  2:28                                         ` Jingbo Xu
2024-12-21 16:23                                           ` David Hildenbrand
2024-12-22  2:47                                             ` Jingbo Xu
2024-12-24 11:32                                               ` David Hildenbrand
2024-12-21 16:18                                         ` David Hildenbrand
2024-12-23 22:14                                           ` Shakeel Butt
2024-12-24 12:37                                             ` David Hildenbrand
2024-12-26 15:11                                               ` Zi Yan
2024-12-26 20:13                                               ` Shakeel Butt
2024-12-26 22:02                                                 ` Bernd Schubert
2024-12-27 20:08                                                 ` Joanne Koong
2024-12-27 20:32                                                   ` Bernd Schubert
2024-12-30 17:52                                                     ` Joanne Koong
2024-12-30 10:16                                                 ` David Hildenbrand
2024-12-30 18:38                                                   ` Joanne Koong
2024-12-30 19:52                                                     ` David Hildenbrand
2024-12-30 20:11                                                       ` Shakeel Butt
2025-01-02 18:54                                                         ` Joanne Koong
2025-01-03 20:31                                                           ` David Hildenbrand
2025-01-06 10:19                                                             ` Miklos Szeredi
2025-01-06 18:17                                                               ` Shakeel Butt
2025-01-07  8:34                                                                 ` David Hildenbrand
2025-01-07 18:07                                                                   ` Shakeel Butt
2025-01-09 11:22                                                                     ` David Hildenbrand
2025-01-10 20:28                                                                       ` Jeff Layton
2025-01-10 21:13                                                                         ` David Hildenbrand
2025-01-10 22:00                                                                           ` Shakeel Butt
2025-01-13 15:27                                                                             ` David Hildenbrand
2025-01-13 21:44                                                                               ` Jeff Layton
2025-01-14  8:38                                                                                 ` Miklos Szeredi
2025-01-14  9:40                                                                                   ` Miklos Szeredi
2025-01-14  9:55                                                                                     ` Bernd Schubert
2025-01-14 10:07                                                                                       ` Miklos Szeredi
2025-01-14 18:07                                                                                         ` Joanne Koong
2025-01-14 18:58                                                                                           ` Miklos Szeredi
2025-01-14 19:12                                                                                             ` Joanne Koong
2025-01-14 20:00                                                                                               ` Miklos Szeredi
2025-01-14 20:29                                                                                               ` Jeff Layton
2025-01-14 21:40                                                                                                 ` Bernd Schubert
2025-01-23 16:06                                                                                                   ` Pavel Begunkov
2025-01-14 20:51                                                                                         ` Joanne Koong
2025-01-24 12:25                                                                                           ` David Hildenbrand
2025-01-14 15:49                                                                                     ` Jeff Layton
2025-01-24 12:29                                                                                       ` David Hildenbrand
2025-01-28 10:16                                                                                         ` Miklos Szeredi
2025-01-14 15:44                                                                                   ` Jeff Layton
2025-01-14 18:58                                                                                     ` Joanne Koong
2025-01-10 23:11                                                                           ` Jeff Layton
2025-01-10 20:16                                                                   ` Jeff Layton
2025-01-10 20:20                                                                     ` David Hildenbrand
2025-01-10 20:43                                                                       ` Jeff Layton
2025-01-10 21:00                                                                         ` David Hildenbrand
2025-01-10 21:07                                                                           ` Jeff Layton
2025-01-10 21:21                                                                             ` David Hildenbrand
2025-01-07 16:15                                                                 ` Miklos Szeredi
2025-01-08  1:40                                                                   ` Jingbo Xu
2024-12-30 20:04                                                     ` Shakeel Butt
2025-01-02 19:59                                                       ` Joanne Koong
2025-01-02 20:26                                                         ` Zi Yan
2024-12-20 21:01                                       ` Joanne Koong
2024-12-21 16:25                                         ` David Hildenbrand
2024-12-21 21:59                                           ` Bernd Schubert
2024-12-23 19:00                                             ` Joanne Koong
2024-12-26 22:44                                               ` Bernd Schubert
2024-12-27 18:25                                                 ` Joanne Koong
2024-12-19 17:55                         ` Joanne Koong
2024-12-19 18:04                           ` Bernd Schubert
2024-12-19 18:11                             ` Shakeel Butt
2024-12-20  7:55                     ` Jingbo Xu
2025-04-02 21:34     ` Joanne Koong
2025-04-03  3:31       ` Jingbo Xu
2025-04-03  9:18         ` David Hildenbrand
2025-04-03  9:25           ` Bernd Schubert
2025-04-03  9:35             ` Christian Brauner
2025-04-03 19:09           ` Joanne Koong
2025-04-03 20:44             ` David Hildenbrand
2025-04-03 22:04               ` Joanne Koong
2024-11-22 23:23 ` [PATCH v6 5/5] fuse: remove tmp folio for writebacks and internal rb tree Joanne Koong
2024-11-25  9:46   ` Jingbo Xu
2024-12-12 21:55 ` [PATCH v6 0/5] fuse: remove temp page copies in writeback Joanne Koong
2024-12-13 11:52 ` Miklos Szeredi
2024-12-13 16:47   ` Shakeel Butt
2024-12-18 17:37     ` Joanne Koong
2024-12-18 17:44       ` Shakeel Butt
2024-12-18 17:53         ` Joanne Koong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).