[PATCH v2 0/2] fuse: remove extra page copies in writeback

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/2] fuse: remove extra page copies in writeback
@ 2024-10-14 18:22 Joanne Koong
  2024-10-14 18:22 ` [PATCH v2 1/2] mm: skip reclaiming folios in writeback contexts that may trigger deadlock Joanne Koong
  2024-10-14 18:22 ` [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree Joanne Koong
  0 siblings, 2 replies; 63+ messages in thread
From: Joanne Koong @ 2024-10-14 18:22 UTC (permalink / raw)
  To: miklos, linux-fsdevel
  Cc: josef, bernd.schubert, jefflexu, hannes, shakeel.butt, linux-mm,
	kernel-team

Currently when we handle writeback in fuse, we allocate and copy data
to a temporary folio in order to mitigate the following deadlock scenario
that may arise if reclaim waits on writeback to complete:
* single-threaded FUSE server is in the middle of handling a request
  that needs a memory allocation
* memory allocation triggers direct reclaim
* direct reclaim waits on a folio under writeback
* the FUSE server can't write back the folio since it's stuck in
  direct reclaim
(more details can be found here [1])

The first patch adds a AS_NO_WRITEBACK_RECLAIM flag to "enum mapping_flags"
to indicate that reclaim when in writeback should be skipped for filesystems
with this flag set. More details can be found in the commit message of the
first patch, but this only actually affects the case of legacy cgroupv1 when
it encounters a folio that already has the reclaim flag set (in the other
cases, we already skip reclaim when in writeback). Doing so allows us to
get rid of needing to allocate and copy over pages to a temporary folio when
handling writeback in fuse (2nd patch). Benchmarks can be found in commit
message of the 2nd patch.

Thanks,
Joanne

[1] https://lore.kernel.org/linux-kernel/495d2400-1d96-4924-99d3-8b2952e05fc3@linux.alibaba.com/

v1: https://lore.kernel.org/linux-fsdevel/20241011223434.1307300-1-joannelkoong@gmail.com/T/#t
Changes from v1 -> v2:
* Have flag in "enum mapping_flags" instead of creating asop_flags (Shakeel)
* Set fuse inodes to use AS_NO_WRITEBACK_RECLAIM (Shakeel)

Joanne Koong (2):
  mm: skip reclaiming folios in writeback contexts that may trigger
    deadlock
  fuse: remove tmp folio for writebacks and internal rb tree

 fs/fuse/file.c          | 322 ++++------------------------------------
 include/linux/pagemap.h |  11 ++
 mm/vmscan.c             |   6 +-
 3 files changed, 43 insertions(+), 296 deletions(-)

-- 
2.43.5



^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH v2 1/2] mm: skip reclaiming folios in writeback contexts that may trigger deadlock
  2024-10-14 18:22 [PATCH v2 0/2] fuse: remove extra page copies in writeback Joanne Koong
@ 2024-10-14 18:22 ` Joanne Koong
  2024-10-14 18:38   ` Shakeel Butt
  2024-10-14 18:22 ` [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree Joanne Koong
  1 sibling, 1 reply; 63+ messages in thread
From: Joanne Koong @ 2024-10-14 18:22 UTC (permalink / raw)
  To: miklos, linux-fsdevel
  Cc: josef, bernd.schubert, jefflexu, hannes, shakeel.butt, linux-mm,
	kernel-team

Currently in shrink_folio_list(), reclaim for folios under writeback
falls into 3 different cases:
1) Reclaim is encountering an excessive number of folios under
   writeback and this folio has both the writeback and reclaim flags
   set
2) Dirty throttling is enabled (this happens if reclaim through cgroup
   is not enabled, if reclaim through cgroupv2 memcg is enabled, or
   if reclaim is on the root cgroup), or if the folio is not marked for
   immediate reclaim, or if the caller does not have __GFP_FS (or
   __GFP_IO if it's going to swap) set
3) Legacy cgroupv1 encounters a folio that already has the reclaim flag
   set and the caller did not have __GFP_FS (or __GFP_IO if swap) set

In cases 1) and 2), we activate the folio and skip reclaiming it while
in case 3), we wait for writeback to finish on the folio and then try
to reclaim the folio again. In case 3, we wait on writeback because
cgroupv1 does not have dirty folio throttling, as such this is a
mitigation against the case where there are too many folios in writeback
with nothing else to reclaim.

The issue is that for filesystems where writeback may block, sub-optimal
workarounds need to be put in place to avoid potential deadlocks that may
arise from the case where reclaim waits on writeback. (Even though case
3 above is rare given that legacy cgroupv1 is on its way to being
deprecated, this case still needs to be accounted for)

For example, for FUSE filesystems, when a writeback is triggered on a
folio, a temporary folio is allocated and the pages are copied over to
this temporary folio so that writeback can be immediately cleared on the
original folio. This additionally requires an internal rb tree to keep
track of writeback state on the temporary folios. Benchmarks show
roughly a ~20% decrease in throughput from the overhead incurred with 4k
block size writes. The temporary folio is needed here in order to avoid
the following deadlock if reclaim waits on writeback:
* single-threaded FUSE server is in the middle of handling a request that
  needs a memory allocation
* memory allocation triggers direct reclaim
* direct reclaim waits on a folio under writeback (eg falls into case 3
  above) that needs to be written back to the fuse server
* the FUSE server can't write back the folio since it's stuck in direct
  reclaim

This commit adds a new flag, AS_NO_WRITEBACK_RECLAIM, to "enum
mapping_flags" which filesystems can set to signify that reclaim
should not happen when the folio is already in writeback. This only has
effects on the case where cgroupv1 memcg encounters a folio under
writeback that already has the reclaim flag set (eg case 3 above), and
allows for the suboptimal workarounds added to address the "reclaim wait
on writeback" deadlock scenario to be removed.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
 include/linux/pagemap.h | 11 +++++++++++
 mm/vmscan.c             |  6 ++++--
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 68a5f1ff3301..513a72b8451b 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -210,6 +210,7 @@ enum mapping_flags {
 	AS_STABLE_WRITES = 7,	/* must wait for writeback before modifying
 				   folio contents */
 	AS_INACCESSIBLE = 8,	/* Do not attempt direct R/W access to the mapping */
+	AS_NO_WRITEBACK_RECLAIM = 9, /* Do not reclaim folios under writeback */
 	/* Bits 16-25 are used for FOLIO_ORDER */
 	AS_FOLIO_ORDER_BITS = 5,
 	AS_FOLIO_ORDER_MIN = 16,
@@ -335,6 +336,16 @@ static inline bool mapping_inaccessible(struct address_space *mapping)
 	return test_bit(AS_INACCESSIBLE, &mapping->flags);
 }

+static inline void mapping_set_no_writeback_reclaim(struct address_space *mapping)
+{
+	set_bit(AS_NO_WRITEBACK_RECLAIM, &mapping->flags);
+}
+
+static inline int mapping_no_writeback_reclaim(struct address_space *mapping)
+{
+	return test_bit(AS_NO_WRITEBACK_RECLAIM, &mapping->flags);
+}
+
 static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
 {
 	return mapping->gfp_mask;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 749cdc110c74..885d496ae652 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1110,6 +1110,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 		if (writeback && folio_test_reclaim(folio))
 			stat->nr_congested += nr_pages;

+		mapping = folio_mapping(folio);
+
 		/*
 		 * If a folio at the tail of the LRU is under writeback, there
 		 * are three cases to consider.
@@ -1165,7 +1167,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 			/* Case 2 above */
 			} else if (writeback_throttling_sane(sc) ||
 			    !folio_test_reclaim(folio) ||
-			    !may_enter_fs(folio, sc->gfp_mask)) {
+			    !may_enter_fs(folio, sc->gfp_mask) ||
+			    (mapping && mapping_no_writeback_reclaim(mapping))) {
 				/*
 				 * This is slightly racy -
 				 * folio_end_writeback() might have
@@ -1320,7 +1323,6 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 		if (folio_maybe_dma_pinned(folio))
 			goto activate_locked;

-		mapping = folio_mapping(folio);
 		if (folio_test_dirty(folio)) {
 			/*
 			 * Only kswapd can writeback filesystem folios
-- 
2.43.5

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree
  2024-10-14 18:22 [PATCH v2 0/2] fuse: remove extra page copies in writeback Joanne Koong
  2024-10-14 18:22 ` [PATCH v2 1/2] mm: skip reclaiming folios in writeback contexts that may trigger deadlock Joanne Koong
@ 2024-10-14 18:22 ` Joanne Koong
  2024-10-15 10:01   ` Miklos Szeredi
  2024-10-18  9:24   ` Jingbo Xu
  1 sibling, 2 replies; 63+ messages in thread
From: Joanne Koong @ 2024-10-14 18:22 UTC (permalink / raw)
  To: miklos, linux-fsdevel
  Cc: josef, bernd.schubert, jefflexu, hannes, shakeel.butt, linux-mm,
	kernel-team

Currently, we allocate and copy data to a temporary folio when
handling writeback in order to mitigate the following deadlock scenario
that may arise if reclaim waits on writeback to complete:
* single-threaded FUSE server is in the middle of handling a request
  that needs a memory allocation
* memory allocation triggers direct reclaim
* direct reclaim waits on a folio under writeback
* the FUSE server can't write back the folio since it's stuck in
  direct reclaim

To work around this, we allocate a temporary folio and copy over the
original folio to the temporary folio so that writeback can be
immediately cleared on the original folio. This additionally requires us
to maintain an internal rb tree to keep track of writeback state on the
temporary folios.

This change sets AS_NO_WRITEBACK_RECLAIM on the inode mapping so that
FUSE folios are not reclaimed and waited on while in writeback, and
removes the temporary folio + extra copying and the internal rb tree.

fio benchmarks --
(using averages observed from 10 runs, throwing away outliers)

Setup:
sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount
 ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount

fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G
--numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount

        bs =  1k          4k            1M
Before  351 MiB/s     1818 MiB/s     1851 MiB/s
After   341 MiB/s     2246 MiB/s     2685 MiB/s
% diff        -3%          23%         45%

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
 fs/fuse/file.c | 322 +++++--------------------------------------------
 1 file changed, 28 insertions(+), 294 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 4304e44f32e6..c3c5ddb4f66b 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -415,74 +415,11 @@ u64 fuse_lock_owner_id(struct fuse_conn *fc, fl_owner_t id)
 
 struct fuse_writepage_args {
 	struct fuse_io_args ia;
-	struct rb_node writepages_entry;
 	struct list_head queue_entry;
-	struct fuse_writepage_args *next;
 	struct inode *inode;
 	struct fuse_sync_bucket *bucket;
 };
 
-static struct fuse_writepage_args *fuse_find_writeback(struct fuse_inode *fi,
-					    pgoff_t idx_from, pgoff_t idx_to)
-{
-	struct rb_node *n;
-
-	n = fi->writepages.rb_node;
-
-	while (n) {
-		struct fuse_writepage_args *wpa;
-		pgoff_t curr_index;
-
-		wpa = rb_entry(n, struct fuse_writepage_args, writepages_entry);
-		WARN_ON(get_fuse_inode(wpa->inode) != fi);
-		curr_index = wpa->ia.write.in.offset >> PAGE_SHIFT;
-		if (idx_from >= curr_index + wpa->ia.ap.num_pages)
-			n = n->rb_right;
-		else if (idx_to < curr_index)
-			n = n->rb_left;
-		else
-			return wpa;
-	}
-	return NULL;
-}
-
-/*
- * Check if any page in a range is under writeback
- */
-static bool fuse_range_is_writeback(struct inode *inode, pgoff_t idx_from,
-				   pgoff_t idx_to)
-{
-	struct fuse_inode *fi = get_fuse_inode(inode);
-	bool found;
-
-	if (RB_EMPTY_ROOT(&fi->writepages))
-		return false;
-
-	spin_lock(&fi->lock);
-	found = fuse_find_writeback(fi, idx_from, idx_to);
-	spin_unlock(&fi->lock);
-
-	return found;
-}
-
-static inline bool fuse_page_is_writeback(struct inode *inode, pgoff_t index)
-{
-	return fuse_range_is_writeback(inode, index, index);
-}
-
-/*
- * Wait for page writeback to be completed.
- *
- * Since fuse doesn't rely on the VM writeback tracking, this has to
- * use some other means.
- */
-static void fuse_wait_on_page_writeback(struct inode *inode, pgoff_t index)
-{
-	struct fuse_inode *fi = get_fuse_inode(inode);
-
-	wait_event(fi->page_waitq, !fuse_page_is_writeback(inode, index));
-}
-
 /*
  * Wait for all pending writepages on the inode to finish.
  *
@@ -876,7 +813,7 @@ static int fuse_do_readpage(struct file *file, struct page *page)
 	 * page-cache page, so make sure we read a properly synced
 	 * page.
 	 */
-	fuse_wait_on_page_writeback(inode, page->index);
+	folio_wait_writeback(page_folio(page));
 
 	attr_ver = fuse_get_attr_version(fm->fc);
 
@@ -1024,8 +961,7 @@ static void fuse_readahead(struct readahead_control *rac)
 		ap = &ia->ap;
 		nr_pages = __readahead_batch(rac, ap->pages, nr_pages);
 		for (i = 0; i < nr_pages; i++) {
-			fuse_wait_on_page_writeback(inode,
-						    readahead_index(rac) + i);
+			folio_wait_writeback(page_folio(ap->pages[i]));
 			ap->descs[i].length = PAGE_SIZE;
 		}
 		ap->num_pages = nr_pages;
@@ -1147,7 +1083,7 @@ static ssize_t fuse_send_write_pages(struct fuse_io_args *ia,
 	int err;
 
 	for (i = 0; i < ap->num_pages; i++)
-		fuse_wait_on_page_writeback(inode, ap->pages[i]->index);
+		folio_wait_writeback(page_folio(ap->pages[i]));
 
 	fuse_write_args_fill(ia, ff, pos, count);
 	ia->write.in.flags = fuse_write_flags(iocb);
@@ -1583,7 +1519,7 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *iter,
 			return res;
 		}
 	}
-	if (!cuse && fuse_range_is_writeback(inode, idx_from, idx_to)) {
+	if (!cuse && filemap_range_has_writeback(mapping, pos, (pos + count - 1))) {
 		if (!write)
 			inode_lock(inode);
 		fuse_sync_writes(inode);
@@ -1780,13 +1716,17 @@ static ssize_t fuse_splice_write(struct pipe_inode_info *pipe, struct file *out,
 static void fuse_writepage_free(struct fuse_writepage_args *wpa)
 {
 	struct fuse_args_pages *ap = &wpa->ia.ap;
+	struct folio *folio;
 	int i;
 
 	if (wpa->bucket)
 		fuse_sync_bucket_dec(wpa->bucket);
 
-	for (i = 0; i < ap->num_pages; i++)
-		__free_page(ap->pages[i]);
+	for (i = 0; i < ap->num_pages; i++) {
+		folio = page_folio(ap->pages[i]);
+		folio_end_writeback(folio);
+		folio_put(folio);
+	}
 
 	fuse_file_put(wpa->ia.ff, false);
 
@@ -1799,7 +1739,7 @@ static void fuse_writepage_finish_stat(struct inode *inode, struct page *page)
 	struct backing_dev_info *bdi = inode_to_bdi(inode);
 
 	dec_wb_stat(&bdi->wb, WB_WRITEBACK);
-	dec_node_page_state(page, NR_WRITEBACK_TEMP);
+	dec_node_page_state(page, NR_WRITEBACK);
 	wb_writeout_inc(&bdi->wb);
 }
 
@@ -1822,7 +1762,6 @@ static void fuse_send_writepage(struct fuse_mount *fm,
 __releases(fi->lock)
 __acquires(fi->lock)
 {
-	struct fuse_writepage_args *aux, *next;
 	struct fuse_inode *fi = get_fuse_inode(wpa->inode);
 	struct fuse_write_in *inarg = &wpa->ia.write.in;
 	struct fuse_args *args = &wpa->ia.ap.args;
@@ -1858,18 +1797,8 @@ __acquires(fi->lock)
 
  out_free:
 	fi->writectr--;
-	rb_erase(&wpa->writepages_entry, &fi->writepages);
 	fuse_writepage_finish(wpa);
 	spin_unlock(&fi->lock);
-
-	/* After rb_erase() aux request list is private */
-	for (aux = wpa->next; aux; aux = next) {
-		next = aux->next;
-		aux->next = NULL;
-		fuse_writepage_finish_stat(aux->inode, aux->ia.ap.pages[0]);
-		fuse_writepage_free(aux);
-	}
-
 	fuse_writepage_free(wpa);
 	spin_lock(&fi->lock);
 }
@@ -1897,43 +1826,6 @@ __acquires(fi->lock)
 	}
 }
 
-static struct fuse_writepage_args *fuse_insert_writeback(struct rb_root *root,
-						struct fuse_writepage_args *wpa)
-{
-	pgoff_t idx_from = wpa->ia.write.in.offset >> PAGE_SHIFT;
-	pgoff_t idx_to = idx_from + wpa->ia.ap.num_pages - 1;
-	struct rb_node **p = &root->rb_node;
-	struct rb_node  *parent = NULL;
-
-	WARN_ON(!wpa->ia.ap.num_pages);
-	while (*p) {
-		struct fuse_writepage_args *curr;
-		pgoff_t curr_index;
-
-		parent = *p;
-		curr = rb_entry(parent, struct fuse_writepage_args,
-				writepages_entry);
-		WARN_ON(curr->inode != wpa->inode);
-		curr_index = curr->ia.write.in.offset >> PAGE_SHIFT;
-
-		if (idx_from >= curr_index + curr->ia.ap.num_pages)
-			p = &(*p)->rb_right;
-		else if (idx_to < curr_index)
-			p = &(*p)->rb_left;
-		else
-			return curr;
-	}
-
-	rb_link_node(&wpa->writepages_entry, parent, p);
-	rb_insert_color(&wpa->writepages_entry, root);
-	return NULL;
-}
-
-static void tree_insert(struct rb_root *root, struct fuse_writepage_args *wpa)
-{
-	WARN_ON(fuse_insert_writeback(root, wpa));
-}
-
 static void fuse_writepage_end(struct fuse_mount *fm, struct fuse_args *args,
 			       int error)
 {
@@ -1953,41 +1845,6 @@ static void fuse_writepage_end(struct fuse_mount *fm, struct fuse_args *args,
 	if (!fc->writeback_cache)
 		fuse_invalidate_attr_mask(inode, FUSE_STATX_MODIFY);
 	spin_lock(&fi->lock);
-	rb_erase(&wpa->writepages_entry, &fi->writepages);
-	while (wpa->next) {
-		struct fuse_mount *fm = get_fuse_mount(inode);
-		struct fuse_write_in *inarg = &wpa->ia.write.in;
-		struct fuse_writepage_args *next = wpa->next;
-
-		wpa->next = next->next;
-		next->next = NULL;
-		tree_insert(&fi->writepages, next);
-
-		/*
-		 * Skip fuse_flush_writepages() to make it easy to crop requests
-		 * based on primary request size.
-		 *
-		 * 1st case (trivial): there are no concurrent activities using
-		 * fuse_set/release_nowrite.  Then we're on safe side because
-		 * fuse_flush_writepages() would call fuse_send_writepage()
-		 * anyway.
-		 *
-		 * 2nd case: someone called fuse_set_nowrite and it is waiting
-		 * now for completion of all in-flight requests.  This happens
-		 * rarely and no more than once per page, so this should be
-		 * okay.
-		 *
-		 * 3rd case: someone (e.g. fuse_do_setattr()) is in the middle
-		 * of fuse_set_nowrite..fuse_release_nowrite section.  The fact
-		 * that fuse_set_nowrite returned implies that all in-flight
-		 * requests were completed along with all of their secondary
-		 * requests.  Further primary requests are blocked by negative
-		 * writectr.  Hence there cannot be any in-flight requests and
-		 * no invocations of fuse_writepage_end() while we're in
-		 * fuse_set_nowrite..fuse_release_nowrite section.
-		 */
-		fuse_send_writepage(fm, next, inarg->offset + inarg->size);
-	}
 	fi->writectr--;
 	fuse_writepage_finish(wpa);
 	spin_unlock(&fi->lock);
@@ -2074,19 +1931,18 @@ static void fuse_writepage_add_to_bucket(struct fuse_conn *fc,
 }
 
 static void fuse_writepage_args_page_fill(struct fuse_writepage_args *wpa, struct folio *folio,
-					  struct folio *tmp_folio, uint32_t page_index)
+					  uint32_t page_index)
 {
 	struct inode *inode = folio->mapping->host;
 	struct fuse_args_pages *ap = &wpa->ia.ap;
 
-	folio_copy(tmp_folio, folio);
-
-	ap->pages[page_index] = &tmp_folio->page;
+	folio_get(folio);
+	ap->pages[page_index] = &folio->page;
 	ap->descs[page_index].offset = 0;
 	ap->descs[page_index].length = PAGE_SIZE;
 
 	inc_wb_stat(&inode_to_bdi(inode)->wb, WB_WRITEBACK);
-	inc_node_page_state(&tmp_folio->page, NR_WRITEBACK_TEMP);
+	inc_node_page_state(&folio->page, NR_WRITEBACK);
 }
 
 static struct fuse_writepage_args *fuse_writepage_args_setup(struct folio *folio,
@@ -2121,18 +1977,12 @@ static int fuse_writepage_locked(struct folio *folio)
 	struct fuse_inode *fi = get_fuse_inode(inode);
 	struct fuse_writepage_args *wpa;
 	struct fuse_args_pages *ap;
-	struct folio *tmp_folio;
 	struct fuse_file *ff;
-	int error = -ENOMEM;
-
-	tmp_folio = folio_alloc(GFP_NOFS | __GFP_HIGHMEM, 0);
-	if (!tmp_folio)
-		goto err;
+	int error = -EIO;
 
-	error = -EIO;
 	ff = fuse_write_file_get(fi);
 	if (!ff)
-		goto err_nofile;
+		goto err;
 
 	wpa = fuse_writepage_args_setup(folio, ff);
 	error = -ENOMEM;
@@ -2143,22 +1993,17 @@ static int fuse_writepage_locked(struct folio *folio)
 	ap->num_pages = 1;
 
 	folio_start_writeback(folio);
-	fuse_writepage_args_page_fill(wpa, folio, tmp_folio, 0);
+	fuse_writepage_args_page_fill(wpa, folio, 0);
 
 	spin_lock(&fi->lock);
-	tree_insert(&fi->writepages, wpa);
 	list_add_tail(&wpa->queue_entry, &fi->queued_writes);
 	fuse_flush_writepages(inode);
 	spin_unlock(&fi->lock);
 
-	folio_end_writeback(folio);
-
 	return 0;
 
 err_writepage_args:
 	fuse_file_put(ff, false);
-err_nofile:
-	folio_put(tmp_folio);
 err:
 	mapping_set_error(folio->mapping, error);
 	return error;
@@ -2168,7 +2013,6 @@ struct fuse_fill_wb_data {
 	struct fuse_writepage_args *wpa;
 	struct fuse_file *ff;
 	struct inode *inode;
-	struct page **orig_pages;
 	unsigned int max_pages;
 };
 
@@ -2203,68 +2047,11 @@ static void fuse_writepages_send(struct fuse_fill_wb_data *data)
 	struct fuse_writepage_args *wpa = data->wpa;
 	struct inode *inode = data->inode;
 	struct fuse_inode *fi = get_fuse_inode(inode);
-	int num_pages = wpa->ia.ap.num_pages;
-	int i;
 
 	spin_lock(&fi->lock);
 	list_add_tail(&wpa->queue_entry, &fi->queued_writes);
 	fuse_flush_writepages(inode);
 	spin_unlock(&fi->lock);
-
-	for (i = 0; i < num_pages; i++)
-		end_page_writeback(data->orig_pages[i]);
-}
-
-/*
- * Check under fi->lock if the page is under writeback, and insert it onto the
- * rb_tree if not. Otherwise iterate auxiliary write requests, to see if there's
- * one already added for a page at this offset.  If there's none, then insert
- * this new request onto the auxiliary list, otherwise reuse the existing one by
- * swapping the new temp page with the old one.
- */
-static bool fuse_writepage_add(struct fuse_writepage_args *new_wpa,
-			       struct page *page)
-{
-	struct fuse_inode *fi = get_fuse_inode(new_wpa->inode);
-	struct fuse_writepage_args *tmp;
-	struct fuse_writepage_args *old_wpa;
-	struct fuse_args_pages *new_ap = &new_wpa->ia.ap;
-
-	WARN_ON(new_ap->num_pages != 0);
-	new_ap->num_pages = 1;
-
-	spin_lock(&fi->lock);
-	old_wpa = fuse_insert_writeback(&fi->writepages, new_wpa);
-	if (!old_wpa) {
-		spin_unlock(&fi->lock);
-		return true;
-	}
-
-	for (tmp = old_wpa->next; tmp; tmp = tmp->next) {
-		pgoff_t curr_index;
-
-		WARN_ON(tmp->inode != new_wpa->inode);
-		curr_index = tmp->ia.write.in.offset >> PAGE_SHIFT;
-		if (curr_index == page->index) {
-			WARN_ON(tmp->ia.ap.num_pages != 1);
-			swap(tmp->ia.ap.pages[0], new_ap->pages[0]);
-			break;
-		}
-	}
-
-	if (!tmp) {
-		new_wpa->next = old_wpa->next;
-		old_wpa->next = new_wpa;
-	}
-
-	spin_unlock(&fi->lock);
-
-	if (tmp) {
-		fuse_writepage_finish_stat(new_wpa->inode, new_ap->pages[0]);
-		fuse_writepage_free(new_wpa);
-	}
-
-	return false;
 }
 
 static bool fuse_writepage_need_send(struct fuse_conn *fc, struct page *page,
@@ -2273,15 +2060,6 @@ static bool fuse_writepage_need_send(struct fuse_conn *fc, struct page *page,
 {
 	WARN_ON(!ap->num_pages);
 
-	/*
-	 * Being under writeback is unlikely but possible.  For example direct
-	 * read to an mmaped fuse file will set the page dirty twice; once when
-	 * the pages are faulted with get_user_pages(), and then after the read
-	 * completed.
-	 */
-	if (fuse_page_is_writeback(data->inode, page->index))
-		return true;
-
 	/* Reached max pages */
 	if (ap->num_pages == fc->max_pages)
 		return true;
@@ -2291,7 +2069,7 @@ static bool fuse_writepage_need_send(struct fuse_conn *fc, struct page *page,
 		return true;
 
 	/* Discontinuity */
-	if (data->orig_pages[ap->num_pages - 1]->index + 1 != page->index)
+	if (ap->pages[ap->num_pages - 1]->index + 1 != page->index)
 		return true;
 
 	/* Need to grow the pages array?  If so, did the expansion fail? */
@@ -2308,9 +2086,7 @@ static int fuse_writepages_fill(struct folio *folio,
 	struct fuse_writepage_args *wpa = data->wpa;
 	struct fuse_args_pages *ap = &wpa->ia.ap;
 	struct inode *inode = data->inode;
-	struct fuse_inode *fi = get_fuse_inode(inode);
 	struct fuse_conn *fc = get_fuse_conn(inode);
-	struct folio *tmp_folio;
 	int err;
 
 	if (wpa && fuse_writepage_need_send(fc, &folio->page, ap, data)) {
@@ -2318,54 +2094,23 @@ static int fuse_writepages_fill(struct folio *folio,
 		data->wpa = NULL;
 	}
 
-	err = -ENOMEM;
-	tmp_folio = folio_alloc(GFP_NOFS | __GFP_HIGHMEM, 0);
-	if (!tmp_folio)
-		goto out_unlock;
-
-	/*
-	 * The page must not be redirtied until the writeout is completed
-	 * (i.e. userspace has sent a reply to the write request).  Otherwise
-	 * there could be more than one temporary page instance for each real
-	 * page.
-	 *
-	 * This is ensured by holding the page lock in page_mkwrite() while
-	 * checking fuse_page_is_writeback().  We already hold the page lock
-	 * since clear_page_dirty_for_io() and keep it held until we add the
-	 * request to the fi->writepages list and increment ap->num_pages.
-	 * After this fuse_page_is_writeback() will indicate that the page is
-	 * under writeback, so we can release the page lock.
-	 */
 	if (data->wpa == NULL) {
 		err = -ENOMEM;
 		wpa = fuse_writepage_args_setup(folio, data->ff);
-		if (!wpa) {
-			folio_put(tmp_folio);
+		if (!wpa)
 			goto out_unlock;
-		}
 		fuse_file_get(wpa->ia.ff);
 		data->max_pages = 1;
 		ap = &wpa->ia.ap;
 	}
 	folio_start_writeback(folio);
 
-	fuse_writepage_args_page_fill(wpa, folio, tmp_folio, ap->num_pages);
-	data->orig_pages[ap->num_pages] = &folio->page;
+	fuse_writepage_args_page_fill(wpa, folio, ap->num_pages);
 
 	err = 0;
-	if (data->wpa) {
-		/*
-		 * Protected by fi->lock against concurrent access by
-		 * fuse_page_is_writeback().
-		 */
-		spin_lock(&fi->lock);
-		ap->num_pages++;
-		spin_unlock(&fi->lock);
-	} else if (fuse_writepage_add(wpa, &folio->page)) {
+	ap->num_pages++;
+	if (!data->wpa)
 		data->wpa = wpa;
-	} else {
-		folio_end_writeback(folio);
-	}
 out_unlock:
 	folio_unlock(folio);
 
@@ -2394,21 +2139,12 @@ static int fuse_writepages(struct address_space *mapping,
 	if (!data.ff)
 		return -EIO;
 
-	err = -ENOMEM;
-	data.orig_pages = kcalloc(fc->max_pages,
-				  sizeof(struct page *),
-				  GFP_NOFS);
-	if (!data.orig_pages)
-		goto out;
-
 	err = write_cache_pages(mapping, wbc, fuse_writepages_fill, &data);
 	if (data.wpa) {
 		WARN_ON(!data.wpa->ia.ap.num_pages);
 		fuse_writepages_send(&data);
 	}
 
-	kfree(data.orig_pages);
-out:
 	fuse_file_put(data.ff, false);
 	return err;
 }
@@ -2433,7 +2169,7 @@ static int fuse_write_begin(struct file *file, struct address_space *mapping,
 	if (IS_ERR(folio))
 		goto error;
 
-	fuse_wait_on_page_writeback(mapping->host, folio->index);
+	folio_wait_writeback(folio);
 
 	if (folio_test_uptodate(folio) || len >= folio_size(folio))
 		goto success;
@@ -2497,13 +2233,11 @@ static int fuse_launder_folio(struct folio *folio)
 {
 	int err = 0;
 	if (folio_clear_dirty_for_io(folio)) {
-		struct inode *inode = folio->mapping->host;
-
 		/* Serialize with pending writeback for the same page */
-		fuse_wait_on_page_writeback(inode, folio->index);
+		folio_wait_writeback(folio);
 		err = fuse_writepage_locked(folio);
 		if (!err)
-			fuse_wait_on_page_writeback(inode, folio->index);
+			folio_wait_writeback(folio);
 	}
 	return err;
 }
@@ -2547,7 +2281,7 @@ static vm_fault_t fuse_page_mkwrite(struct vm_fault *vmf)
 		return VM_FAULT_NOPAGE;
 	}
 
-	fuse_wait_on_page_writeback(inode, page->index);
+	folio_wait_writeback(page_folio(page));
 	return VM_FAULT_LOCKED;
 }
 
@@ -3368,6 +3102,7 @@ void fuse_init_file_inode(struct inode *inode, unsigned int flags)
 
 	inode->i_fop = &fuse_file_operations;
 	inode->i_data.a_ops = &fuse_file_aops;
+	mapping_set_no_writeback_reclaim(&inode->i_data);
 
 	INIT_LIST_HEAD(&fi->write_files);
 	INIT_LIST_HEAD(&fi->queued_writes);
@@ -3375,7 +3110,6 @@ void fuse_init_file_inode(struct inode *inode, unsigned int flags)
 	fi->iocachectr = 0;
 	init_waitqueue_head(&fi->page_waitq);
 	init_waitqueue_head(&fi->direct_io_waitq);
-	fi->writepages = RB_ROOT;
 
 	if (IS_ENABLED(CONFIG_FUSE_DAX))
 		fuse_dax_inode_init(inode, flags);
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 1/2] mm: skip reclaiming folios in writeback contexts that may trigger deadlock
  2024-10-14 18:22 ` [PATCH v2 1/2] mm: skip reclaiming folios in writeback contexts that may trigger deadlock Joanne Koong
@ 2024-10-14 18:38   ` Shakeel Butt
  2024-10-14 21:04     ` Joanne Koong
  0 siblings, 1 reply; 63+ messages in thread
From: Shakeel Butt @ 2024-10-14 18:38 UTC (permalink / raw)
  To: Joanne Koong
  Cc: miklos, linux-fsdevel, josef, bernd.schubert, jefflexu, hannes,
	linux-mm, kernel-team

On Mon, Oct 14, 2024 at 11:22:27AM GMT, Joanne Koong wrote:
> Currently in shrink_folio_list(), reclaim for folios under writeback
> falls into 3 different cases:
> 1) Reclaim is encountering an excessive number of folios under
>    writeback and this folio has both the writeback and reclaim flags
>    set
> 2) Dirty throttling is enabled (this happens if reclaim through cgroup
>    is not enabled, if reclaim through cgroupv2 memcg is enabled, or
>    if reclaim is on the root cgroup), or if the folio is not marked for
>    immediate reclaim, or if the caller does not have __GFP_FS (or
>    __GFP_IO if it's going to swap) set
> 3) Legacy cgroupv1 encounters a folio that already has the reclaim flag
>    set and the caller did not have __GFP_FS (or __GFP_IO if swap) set
> 
> In cases 1) and 2), we activate the folio and skip reclaiming it while
> in case 3), we wait for writeback to finish on the folio and then try
> to reclaim the folio again. In case 3, we wait on writeback because
> cgroupv1 does not have dirty folio throttling, as such this is a
> mitigation against the case where there are too many folios in writeback
> with nothing else to reclaim.
> 
> The issue is that for filesystems where writeback may block, sub-optimal
> workarounds need to be put in place to avoid potential deadlocks that may
> arise from the case where reclaim waits on writeback. (Even though case
> 3 above is rare given that legacy cgroupv1 is on its way to being
> deprecated, this case still needs to be accounted for)
> 
> For example, for FUSE filesystems, when a writeback is triggered on a
> folio, a temporary folio is allocated and the pages are copied over to
> this temporary folio so that writeback can be immediately cleared on the
> original folio. This additionally requires an internal rb tree to keep
> track of writeback state on the temporary folios. Benchmarks show
> roughly a ~20% decrease in throughput from the overhead incurred with 4k
> block size writes. The temporary folio is needed here in order to avoid
> the following deadlock if reclaim waits on writeback:
> * single-threaded FUSE server is in the middle of handling a request that
>   needs a memory allocation
> * memory allocation triggers direct reclaim
> * direct reclaim waits on a folio under writeback (eg falls into case 3
>   above) that needs to be written back to the fuse server
> * the FUSE server can't write back the folio since it's stuck in direct
>   reclaim
> 
> This commit adds a new flag, AS_NO_WRITEBACK_RECLAIM, to "enum
> mapping_flags" which filesystems can set to signify that reclaim
> should not happen when the folio is already in writeback. This only has
> effects on the case where cgroupv1 memcg encounters a folio under
> writeback that already has the reclaim flag set (eg case 3 above), and
> allows for the suboptimal workarounds added to address the "reclaim wait
> on writeback" deadlock scenario to be removed.
> 
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> ---
>  include/linux/pagemap.h | 11 +++++++++++
>  mm/vmscan.c             |  6 ++++--
>  2 files changed, 15 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 68a5f1ff3301..513a72b8451b 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -210,6 +210,7 @@ enum mapping_flags {
>  	AS_STABLE_WRITES = 7,	/* must wait for writeback before modifying
>  				   folio contents */
>  	AS_INACCESSIBLE = 8,	/* Do not attempt direct R/W access to the mapping */
> +	AS_NO_WRITEBACK_RECLAIM = 9, /* Do not reclaim folios under writeback */

Isn't it "Do not wait for writeback completion for folios of this
mapping during reclaim"?

>  	/* Bits 16-25 are used for FOLIO_ORDER */
>  	AS_FOLIO_ORDER_BITS = 5,
>  	AS_FOLIO_ORDER_MIN = 16,
> @@ -335,6 +336,16 @@ static inline bool mapping_inaccessible(struct address_space *mapping)
>  	return test_bit(AS_INACCESSIBLE, &mapping->flags);
>  }
>  
> +static inline void mapping_set_no_writeback_reclaim(struct address_space *mapping)
> +{
> +	set_bit(AS_NO_WRITEBACK_RECLAIM, &mapping->flags);
> +}
> +
> +static inline int mapping_no_writeback_reclaim(struct address_space *mapping)
> +{
> +	return test_bit(AS_NO_WRITEBACK_RECLAIM, &mapping->flags);
> +}
> +
>  static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
>  {
>  	return mapping->gfp_mask;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 749cdc110c74..885d496ae652 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1110,6 +1110,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>  		if (writeback && folio_test_reclaim(folio))
>  			stat->nr_congested += nr_pages;
>  
> +		mapping = folio_mapping(folio);
> +
>  		/*
>  		 * If a folio at the tail of the LRU is under writeback, there
>  		 * are three cases to consider.
> @@ -1165,7 +1167,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>  			/* Case 2 above */
>  			} else if (writeback_throttling_sane(sc) ||
>  			    !folio_test_reclaim(folio) ||
> -			    !may_enter_fs(folio, sc->gfp_mask)) {
> +			    !may_enter_fs(folio, sc->gfp_mask) ||
> +			    (mapping && mapping_no_writeback_reclaim(mapping))) {
>  				/*
>  				 * This is slightly racy -
>  				 * folio_end_writeback() might have
> @@ -1320,7 +1323,6 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>  		if (folio_maybe_dma_pinned(folio))
>  			goto activate_locked;
>  
> -		mapping = folio_mapping(folio);

We should not remove the above line as it will make anon pages added to
the swap in this code path skip reclaim. Basically the mapping of anon
pages which are not yet in swap cache, will be null at the newly added
mapping = folio_mapping(folio) but will be swapcache mapping at this
location (there is add_to_swap() in between), so if we remove this line,
the kernel will skip the reclaim of that folio in this iteration. This
will increase memory pressure.

>  		if (folio_test_dirty(folio)) {
>  			/*
>  			 * Only kswapd can writeback filesystem folios
> -- 
> 2.43.5
> 


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 1/2] mm: skip reclaiming folios in writeback contexts that may trigger deadlock
  2024-10-14 18:38   ` Shakeel Butt
@ 2024-10-14 21:04     ` Joanne Koong
  2024-10-14 23:57       ` Shakeel Butt
  0 siblings, 1 reply; 63+ messages in thread
From: Joanne Koong @ 2024-10-14 21:04 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: miklos, linux-fsdevel, josef, bernd.schubert, jefflexu, hannes,
	linux-mm, kernel-team

On Mon, Oct 14, 2024 at 11:38 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Mon, Oct 14, 2024 at 11:22:27AM GMT, Joanne Koong wrote:
> > Currently in shrink_folio_list(), reclaim for folios under writeback
> > falls into 3 different cases:
> > 1) Reclaim is encountering an excessive number of folios under
> >    writeback and this folio has both the writeback and reclaim flags
> >    set
> > 2) Dirty throttling is enabled (this happens if reclaim through cgroup
> >    is not enabled, if reclaim through cgroupv2 memcg is enabled, or
> >    if reclaim is on the root cgroup), or if the folio is not marked for
> >    immediate reclaim, or if the caller does not have __GFP_FS (or
> >    __GFP_IO if it's going to swap) set
> > 3) Legacy cgroupv1 encounters a folio that already has the reclaim flag
> >    set and the caller did not have __GFP_FS (or __GFP_IO if swap) set
> >
> > In cases 1) and 2), we activate the folio and skip reclaiming it while
> > in case 3), we wait for writeback to finish on the folio and then try
> > to reclaim the folio again. In case 3, we wait on writeback because
> > cgroupv1 does not have dirty folio throttling, as such this is a
> > mitigation against the case where there are too many folios in writeback
> > with nothing else to reclaim.
> >
> > The issue is that for filesystems where writeback may block, sub-optimal
> > workarounds need to be put in place to avoid potential deadlocks that may
> > arise from the case where reclaim waits on writeback. (Even though case
> > 3 above is rare given that legacy cgroupv1 is on its way to being
> > deprecated, this case still needs to be accounted for)
> >
> > For example, for FUSE filesystems, when a writeback is triggered on a
> > folio, a temporary folio is allocated and the pages are copied over to
> > this temporary folio so that writeback can be immediately cleared on the
> > original folio. This additionally requires an internal rb tree to keep
> > track of writeback state on the temporary folios. Benchmarks show
> > roughly a ~20% decrease in throughput from the overhead incurred with 4k
> > block size writes. The temporary folio is needed here in order to avoid
> > the following deadlock if reclaim waits on writeback:
> > * single-threaded FUSE server is in the middle of handling a request that
> >   needs a memory allocation
> > * memory allocation triggers direct reclaim
> > * direct reclaim waits on a folio under writeback (eg falls into case 3
> >   above) that needs to be written back to the fuse server
> > * the FUSE server can't write back the folio since it's stuck in direct
> >   reclaim
> >
> > This commit adds a new flag, AS_NO_WRITEBACK_RECLAIM, to "enum
> > mapping_flags" which filesystems can set to signify that reclaim
> > should not happen when the folio is already in writeback. This only has
> > effects on the case where cgroupv1 memcg encounters a folio under
> > writeback that already has the reclaim flag set (eg case 3 above), and
> > allows for the suboptimal workarounds added to address the "reclaim wait
> > on writeback" deadlock scenario to be removed.
> >
> > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > ---
> >  include/linux/pagemap.h | 11 +++++++++++
> >  mm/vmscan.c             |  6 ++++--
> >  2 files changed, 15 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> > index 68a5f1ff3301..513a72b8451b 100644
> > --- a/include/linux/pagemap.h
> > +++ b/include/linux/pagemap.h
> > @@ -210,6 +210,7 @@ enum mapping_flags {
> >       AS_STABLE_WRITES = 7,   /* must wait for writeback before modifying
> >                                  folio contents */
> >       AS_INACCESSIBLE = 8,    /* Do not attempt direct R/W access to the mapping */
> > +     AS_NO_WRITEBACK_RECLAIM = 9, /* Do not reclaim folios under writeback */
>
> Isn't it "Do not wait for writeback completion for folios of this
> mapping during reclaim"?

I think if we make this "don't wait for writeback completion for
folios of this mapping during reclaim", then the
mapping_no_writeback_reclaim check in shrink_folio_list() below would
need to be something like this instead:

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 885d496ae652..37108d633d21 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1190,7 +1190,8 @@ static unsigned int shrink_folio_list(struct
list_head *folio_list,
                        /* Case 3 above */
                        } else {
                                folio_unlock(folio);
-                               folio_wait_writeback(folio);
+                               if (mapping &&
!mapping_no_writeback_reclaim(mapping))
+                                       folio_wait_writeback(folio);
                                /* then go back and try same folio again */
                                list_add_tail(&folio->lru, folio_list);
                                continue;

which I'm not sure if that would be the correct logic here or not.
I'm not too familiar with vmscan, but it seems like if we are going to
reclaim the folio then we should wait on it or else we would just keep
trying the same folio again and again and wasting cpu cycles. In this
current patch (if I'm understanding this mm code correctly), we skip
reclaiming the folio altogether if it's under writeback.

Either one (don't wait for writeback during reclaim or don't reclaim
under writeback) works for mitigating the potential fuse deadlock,
but I was thinking "don't reclaim under writeback" might also be more
generalizable to other filesystems.

I'm happy to go with whichever you think would be best.

And thanks again for taking a look at this patch, Shakeel!

>
> >       /* Bits 16-25 are used for FOLIO_ORDER */
> >       AS_FOLIO_ORDER_BITS = 5,
> >       AS_FOLIO_ORDER_MIN = 16,
> > @@ -335,6 +336,16 @@ static inline bool mapping_inaccessible(struct address_space *mapping)
> >       return test_bit(AS_INACCESSIBLE, &mapping->flags);
> >  }
> >
> > +static inline void mapping_set_no_writeback_reclaim(struct address_space *mapping)
> > +{
> > +     set_bit(AS_NO_WRITEBACK_RECLAIM, &mapping->flags);
> > +}
> > +
> > +static inline int mapping_no_writeback_reclaim(struct address_space *mapping)
> > +{
> > +     return test_bit(AS_NO_WRITEBACK_RECLAIM, &mapping->flags);
> > +}
> > +
> >  static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
> >  {
> >       return mapping->gfp_mask;
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 749cdc110c74..885d496ae652 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1110,6 +1110,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >               if (writeback && folio_test_reclaim(folio))
> >                       stat->nr_congested += nr_pages;
> >
> > +             mapping = folio_mapping(folio);
> > +
> >               /*
> >                * If a folio at the tail of the LRU is under writeback, there
> >                * are three cases to consider.
> > @@ -1165,7 +1167,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >                       /* Case 2 above */
> >                       } else if (writeback_throttling_sane(sc) ||
> >                           !folio_test_reclaim(folio) ||
> > -                         !may_enter_fs(folio, sc->gfp_mask)) {
> > +                         !may_enter_fs(folio, sc->gfp_mask) ||
> > +                         (mapping && mapping_no_writeback_reclaim(mapping))) {
> >                               /*
> >                                * This is slightly racy -
> >                                * folio_end_writeback() might have
> > @@ -1320,7 +1323,6 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >               if (folio_maybe_dma_pinned(folio))
> >                       goto activate_locked;
> >
> > -             mapping = folio_mapping(folio);
>
> We should not remove the above line as it will make anon pages added to
> the swap in this code path skip reclaim. Basically the mapping of anon
> pages which are not yet in swap cache, will be null at the newly added
> mapping = folio_mapping(folio) but will be swapcache mapping at this
> location (there is add_to_swap() in between), so if we remove this line,
> the kernel will skip the reclaim of that folio in this iteration. This
> will increase memory pressure.

Thanks for the explanation! I will add this line back in for v3.

>
> >               if (folio_test_dirty(folio)) {
> >                       /*
> >                        * Only kswapd can writeback filesystem folios
> > --
> > 2.43.5
> >


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 1/2] mm: skip reclaiming folios in writeback contexts that may trigger deadlock
  2024-10-14 21:04     ` Joanne Koong
@ 2024-10-14 23:57       ` Shakeel Butt
  2024-10-15 16:59         ` Joanne Koong
  0 siblings, 1 reply; 63+ messages in thread
From: Shakeel Butt @ 2024-10-14 23:57 UTC (permalink / raw)
  To: Joanne Koong
  Cc: miklos, linux-fsdevel, josef, bernd.schubert, jefflexu, hannes,
	linux-mm, kernel-team

On Mon, Oct 14, 2024 at 02:04:07PM GMT, Joanne Koong wrote:
> On Mon, Oct 14, 2024 at 11:38 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > On Mon, Oct 14, 2024 at 11:22:27AM GMT, Joanne Koong wrote:
> > > Currently in shrink_folio_list(), reclaim for folios under writeback
> > > falls into 3 different cases:
> > > 1) Reclaim is encountering an excessive number of folios under
> > >    writeback and this folio has both the writeback and reclaim flags
> > >    set
> > > 2) Dirty throttling is enabled (this happens if reclaim through cgroup
> > >    is not enabled, if reclaim through cgroupv2 memcg is enabled, or
> > >    if reclaim is on the root cgroup), or if the folio is not marked for
> > >    immediate reclaim, or if the caller does not have __GFP_FS (or
> > >    __GFP_IO if it's going to swap) set
> > > 3) Legacy cgroupv1 encounters a folio that already has the reclaim flag
> > >    set and the caller did not have __GFP_FS (or __GFP_IO if swap) set
> > >
> > > In cases 1) and 2), we activate the folio and skip reclaiming it while
> > > in case 3), we wait for writeback to finish on the folio and then try
> > > to reclaim the folio again. In case 3, we wait on writeback because
> > > cgroupv1 does not have dirty folio throttling, as such this is a
> > > mitigation against the case where there are too many folios in writeback
> > > with nothing else to reclaim.
> > >
> > > The issue is that for filesystems where writeback may block, sub-optimal
> > > workarounds need to be put in place to avoid potential deadlocks that may
> > > arise from the case where reclaim waits on writeback. (Even though case
> > > 3 above is rare given that legacy cgroupv1 is on its way to being
> > > deprecated, this case still needs to be accounted for)
> > >
> > > For example, for FUSE filesystems, when a writeback is triggered on a
> > > folio, a temporary folio is allocated and the pages are copied over to
> > > this temporary folio so that writeback can be immediately cleared on the
> > > original folio. This additionally requires an internal rb tree to keep
> > > track of writeback state on the temporary folios. Benchmarks show
> > > roughly a ~20% decrease in throughput from the overhead incurred with 4k
> > > block size writes. The temporary folio is needed here in order to avoid
> > > the following deadlock if reclaim waits on writeback:
> > > * single-threaded FUSE server is in the middle of handling a request that
> > >   needs a memory allocation
> > > * memory allocation triggers direct reclaim
> > > * direct reclaim waits on a folio under writeback (eg falls into case 3
> > >   above) that needs to be written back to the fuse server
> > > * the FUSE server can't write back the folio since it's stuck in direct
> > >   reclaim
> > >
> > > This commit adds a new flag, AS_NO_WRITEBACK_RECLAIM, to "enum
> > > mapping_flags" which filesystems can set to signify that reclaim
> > > should not happen when the folio is already in writeback. This only has
> > > effects on the case where cgroupv1 memcg encounters a folio under
> > > writeback that already has the reclaim flag set (eg case 3 above), and
> > > allows for the suboptimal workarounds added to address the "reclaim wait
> > > on writeback" deadlock scenario to be removed.
> > >
> > > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > > ---
> > >  include/linux/pagemap.h | 11 +++++++++++
> > >  mm/vmscan.c             |  6 ++++--
> > >  2 files changed, 15 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> > > index 68a5f1ff3301..513a72b8451b 100644
> > > --- a/include/linux/pagemap.h
> > > +++ b/include/linux/pagemap.h
> > > @@ -210,6 +210,7 @@ enum mapping_flags {
> > >       AS_STABLE_WRITES = 7,   /* must wait for writeback before modifying
> > >                                  folio contents */
> > >       AS_INACCESSIBLE = 8,    /* Do not attempt direct R/W access to the mapping */
> > > +     AS_NO_WRITEBACK_RECLAIM = 9, /* Do not reclaim folios under writeback */
> >
> > Isn't it "Do not wait for writeback completion for folios of this
> > mapping during reclaim"?
> 
> I think if we make this "don't wait for writeback completion for
> folios of this mapping during reclaim", then the
> mapping_no_writeback_reclaim check in shrink_folio_list() below would
> need to be something like this instead:
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 885d496ae652..37108d633d21 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1190,7 +1190,8 @@ static unsigned int shrink_folio_list(struct
> list_head *folio_list,
>                         /* Case 3 above */
>                         } else {
>                                 folio_unlock(folio);
> -                               folio_wait_writeback(folio);
> +                               if (mapping &&
> !mapping_no_writeback_reclaim(mapping))
> +                                       folio_wait_writeback(folio);
>                                 /* then go back and try same folio again */
>                                 list_add_tail(&folio->lru, folio_list);
>                                 continue;

The difference between the outcome for Case 2 and Case 3 is that in Case
2 the kernel is putting the folio in an active list and thus the kernel
will not try to reclaim it in near future but in Case 3, the kernel is
putting back in the list from which it is currently reclaiming meaning
the next iteration will try to reclaim the same folio.

We definitely don't want it in Case 3.

> 
> which I'm not sure if that would be the correct logic here or not.
> I'm not too familiar with vmscan, but it seems like if we are going to
> reclaim the folio then we should wait on it or else we would just keep
> trying the same folio again and again and wasting cpu cycles. In this
> current patch (if I'm understanding this mm code correctly), we skip
> reclaiming the folio altogether if it's under writeback.
> 
> Either one (don't wait for writeback during reclaim or don't reclaim
> under writeback) works for mitigating the potential fuse deadlock,
> but I was thinking "don't reclaim under writeback" might also be more
> generalizable to other filesystems.
> 
> I'm happy to go with whichever you think would be best.

Just to be clear that we are on the same page that this scenario should
be handled in Case 2. Our difference is on how to describe the scenario.
To me the reason we are taking the path of Case 2 is because we don't
want what Case 3 is doing and thus wrote that. Anyways I don't think it
is that importatnt, use whatever working seems reasonable to you.

BTW you will need to update the comment for Case 2 which is above code
block.

thanks,
Shakeel


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree
  2024-10-14 18:22 ` [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree Joanne Koong
@ 2024-10-15 10:01   ` Miklos Szeredi
  2024-10-15 17:06     ` Joanne Koong
                       ` (2 more replies)
  2024-10-18  9:24   ` Jingbo Xu
  1 sibling, 3 replies; 63+ messages in thread
From: Miklos Szeredi @ 2024-10-15 10:01 UTC (permalink / raw)
  To: Joanne Koong
  Cc: linux-fsdevel, josef, bernd.schubert, jefflexu, hannes,
	shakeel.butt, linux-mm, kernel-team

On Mon, 14 Oct 2024 at 20:23, Joanne Koong <joannelkoong@gmail.com> wrote:

> This change sets AS_NO_WRITEBACK_RECLAIM on the inode mapping so that
> FUSE folios are not reclaimed and waited on while in writeback, and
> removes the temporary folio + extra copying and the internal rb tree.

What about sync(2)?   And page migration?

Hopefully there are no other cases, but I think a careful review of
places where generic code waits for writeback is needed before we can
say for sure.

Thanks,
Miklos


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 1/2] mm: skip reclaiming folios in writeback contexts that may trigger deadlock
  2024-10-14 23:57       ` Shakeel Butt
@ 2024-10-15 16:59         ` Joanne Koong
  0 siblings, 0 replies; 63+ messages in thread
From: Joanne Koong @ 2024-10-15 16:59 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: miklos, linux-fsdevel, josef, bernd.schubert, jefflexu, hannes,
	linux-mm, kernel-team

On Mon, Oct 14, 2024 at 4:57 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Mon, Oct 14, 2024 at 02:04:07PM GMT, Joanne Koong wrote:
> > On Mon, Oct 14, 2024 at 11:38 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > >
> > > On Mon, Oct 14, 2024 at 11:22:27AM GMT, Joanne Koong wrote:
> > > > Currently in shrink_folio_list(), reclaim for folios under writeback
> > > > falls into 3 different cases:
> > > > 1) Reclaim is encountering an excessive number of folios under
> > > >    writeback and this folio has both the writeback and reclaim flags
> > > >    set
> > > > 2) Dirty throttling is enabled (this happens if reclaim through cgroup
> > > >    is not enabled, if reclaim through cgroupv2 memcg is enabled, or
> > > >    if reclaim is on the root cgroup), or if the folio is not marked for
> > > >    immediate reclaim, or if the caller does not have __GFP_FS (or
> > > >    __GFP_IO if it's going to swap) set
> > > > 3) Legacy cgroupv1 encounters a folio that already has the reclaim flag
> > > >    set and the caller did not have __GFP_FS (or __GFP_IO if swap) set
> > > >
> > > > In cases 1) and 2), we activate the folio and skip reclaiming it while
> > > > in case 3), we wait for writeback to finish on the folio and then try
> > > > to reclaim the folio again. In case 3, we wait on writeback because
> > > > cgroupv1 does not have dirty folio throttling, as such this is a
> > > > mitigation against the case where there are too many folios in writeback
> > > > with nothing else to reclaim.
> > > >
> > > > The issue is that for filesystems where writeback may block, sub-optimal
> > > > workarounds need to be put in place to avoid potential deadlocks that may
> > > > arise from the case where reclaim waits on writeback. (Even though case
> > > > 3 above is rare given that legacy cgroupv1 is on its way to being
> > > > deprecated, this case still needs to be accounted for)
> > > >
> > > > For example, for FUSE filesystems, when a writeback is triggered on a
> > > > folio, a temporary folio is allocated and the pages are copied over to
> > > > this temporary folio so that writeback can be immediately cleared on the
> > > > original folio. This additionally requires an internal rb tree to keep
> > > > track of writeback state on the temporary folios. Benchmarks show
> > > > roughly a ~20% decrease in throughput from the overhead incurred with 4k
> > > > block size writes. The temporary folio is needed here in order to avoid
> > > > the following deadlock if reclaim waits on writeback:
> > > > * single-threaded FUSE server is in the middle of handling a request that
> > > >   needs a memory allocation
> > > > * memory allocation triggers direct reclaim
> > > > * direct reclaim waits on a folio under writeback (eg falls into case 3
> > > >   above) that needs to be written back to the fuse server
> > > > * the FUSE server can't write back the folio since it's stuck in direct
> > > >   reclaim
> > > >
> > > > This commit adds a new flag, AS_NO_WRITEBACK_RECLAIM, to "enum
> > > > mapping_flags" which filesystems can set to signify that reclaim
> > > > should not happen when the folio is already in writeback. This only has
> > > > effects on the case where cgroupv1 memcg encounters a folio under
> > > > writeback that already has the reclaim flag set (eg case 3 above), and
> > > > allows for the suboptimal workarounds added to address the "reclaim wait
> > > > on writeback" deadlock scenario to be removed.
> > > >
> > > > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > > > ---
> > > >  include/linux/pagemap.h | 11 +++++++++++
> > > >  mm/vmscan.c             |  6 ++++--
> > > >  2 files changed, 15 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> > > > index 68a5f1ff3301..513a72b8451b 100644
> > > > --- a/include/linux/pagemap.h
> > > > +++ b/include/linux/pagemap.h
> > > > @@ -210,6 +210,7 @@ enum mapping_flags {
> > > >       AS_STABLE_WRITES = 7,   /* must wait for writeback before modifying
> > > >                                  folio contents */
> > > >       AS_INACCESSIBLE = 8,    /* Do not attempt direct R/W access to the mapping */
> > > > +     AS_NO_WRITEBACK_RECLAIM = 9, /* Do not reclaim folios under writeback */
> > >
> > > Isn't it "Do not wait for writeback completion for folios of this
> > > mapping during reclaim"?
> >
> > I think if we make this "don't wait for writeback completion for
> > folios of this mapping during reclaim", then the
> > mapping_no_writeback_reclaim check in shrink_folio_list() below would
> > need to be something like this instead:
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 885d496ae652..37108d633d21 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1190,7 +1190,8 @@ static unsigned int shrink_folio_list(struct
> > list_head *folio_list,
> >                         /* Case 3 above */
> >                         } else {
> >                                 folio_unlock(folio);
> > -                               folio_wait_writeback(folio);
> > +                               if (mapping &&
> > !mapping_no_writeback_reclaim(mapping))
> > +                                       folio_wait_writeback(folio);
> >                                 /* then go back and try same folio again */
> >                                 list_add_tail(&folio->lru, folio_list);
> >                                 continue;
>
> The difference between the outcome for Case 2 and Case 3 is that in Case
> 2 the kernel is putting the folio in an active list and thus the kernel
> will not try to reclaim it in near future but in Case 3, the kernel is
> putting back in the list from which it is currently reclaiming meaning
> the next iteration will try to reclaim the same folio.
>
> We definitely don't want it in Case 3.
>
> >
> > which I'm not sure if that would be the correct logic here or not.
> > I'm not too familiar with vmscan, but it seems like if we are going to
> > reclaim the folio then we should wait on it or else we would just keep
> > trying the same folio again and again and wasting cpu cycles. In this
> > current patch (if I'm understanding this mm code correctly), we skip
> > reclaiming the folio altogether if it's under writeback.
> >
> > Either one (don't wait for writeback during reclaim or don't reclaim
> > under writeback) works for mitigating the potential fuse deadlock,
> > but I was thinking "don't reclaim under writeback" might also be more
> > generalizable to other filesystems.
> >
> > I'm happy to go with whichever you think would be best.
>
> Just to be clear that we are on the same page that this scenario should
> be handled in Case 2. Our difference is on how to describe the scenario.
> To me the reason we are taking the path of Case 2 is because we don't
> want what Case 3 is doing and thus wrote that. Anyways I don't think it
> is that importatnt, use whatever working seems reasonable to you.

Gotcha, thanks for clarifying. Your point makes sense to me - if we go
this route we should also probably change the name to
AS_NO_RECLAIM_WAIT_WRITEBACK or something like that to make it more
congruent.

For now, I'll keep it as AS_NO_WRITEBACK_RECLAIM because I think that
might be more generalizable of a use case for other filesystems too.

>
> BTW you will need to update the comment for Case 2 which is above code
> block.

Great point, I will do this in v3.


Thanks,
Joanne
>
> thanks,
> Shakeel


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree
  2024-10-15 10:01   ` Miklos Szeredi
@ 2024-10-15 17:06     ` Joanne Koong
  2024-10-15 19:17       ` Shakeel Butt
  2024-10-16  9:56     ` Jingbo Xu
  2024-10-18  1:30     ` Joanne Koong
  2 siblings, 1 reply; 63+ messages in thread
From: Joanne Koong @ 2024-10-15 17:06 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-fsdevel, josef, bernd.schubert, jefflexu, hannes,
	shakeel.butt, linux-mm, kernel-team

On Tue, Oct 15, 2024 at 3:01 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Mon, 14 Oct 2024 at 20:23, Joanne Koong <joannelkoong@gmail.com> wrote:
>
> > This change sets AS_NO_WRITEBACK_RECLAIM on the inode mapping so that
> > FUSE folios are not reclaimed and waited on while in writeback, and
> > removes the temporary folio + extra copying and the internal rb tree.
>
> What about sync(2)?   And page migration?
>
> Hopefully there are no other cases, but I think a careful review of
> places where generic code waits for writeback is needed before we can
> say for sure.

Sounds good, that's a great point. I'll audit the other places in the
mm code where we might call folio_wait_writeback() and report back
what I find.


Thanks,
Joanne

>
> Thanks,
> Miklos


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree
  2024-10-15 17:06     ` Joanne Koong
@ 2024-10-15 19:17       ` Shakeel Butt
  2024-10-16  9:44         ` Jingbo Xu
  2024-10-16  9:51         ` Miklos Szeredi
  0 siblings, 2 replies; 63+ messages in thread
From: Shakeel Butt @ 2024-10-15 19:17 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Miklos Szeredi, linux-fsdevel, josef, bernd.schubert, jefflexu,
	hannes, linux-mm, kernel-team

On Tue, Oct 15, 2024 at 10:06:52AM GMT, Joanne Koong wrote:
> On Tue, Oct 15, 2024 at 3:01 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> > On Mon, 14 Oct 2024 at 20:23, Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > > This change sets AS_NO_WRITEBACK_RECLAIM on the inode mapping so that
> > > FUSE folios are not reclaimed and waited on while in writeback, and
> > > removes the temporary folio + extra copying and the internal rb tree.
> >
> > What about sync(2)?   And page migration?
> >
> > Hopefully there are no other cases, but I think a careful review of
> > places where generic code waits for writeback is needed before we can
> > say for sure.
> 
> Sounds good, that's a great point. I'll audit the other places in the
> mm code where we might call folio_wait_writeback() and report back
> what I find.
> 
> 

So, any operation that the fuse server can do which can cause wait on
writeback on the folios backed by the fuse is problematic. We know about
one scenario i.e. memory allocation causing reclaim which may do the
wait on unrelated folios which may be backed by the fuse server.

Now there are ways fuse server can shoot itself on the foot. Like sync()
syscall or accessing the folios backed by itself. The quesion is should
we really need to protect fuse from such cases?



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree
  2024-10-15 19:17       ` Shakeel Butt
@ 2024-10-16  9:44         ` Jingbo Xu
  2024-10-16  9:57           ` Miklos Szeredi
  2024-10-16  9:51         ` Miklos Szeredi
  1 sibling, 1 reply; 63+ messages in thread
From: Jingbo Xu @ 2024-10-16  9:44 UTC (permalink / raw)
  To: Shakeel Butt, Joanne Koong
  Cc: Miklos Szeredi, linux-fsdevel, josef, bernd.schubert, hannes,
	linux-mm, kernel-team

On 10/16/24 3:17 AM, Shakeel Butt wrote:
> On Tue, Oct 15, 2024 at 10:06:52AM GMT, Joanne Koong wrote:
>> On Tue, Oct 15, 2024 at 3:01 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>>>
>>> On Mon, 14 Oct 2024 at 20:23, Joanne Koong <joannelkoong@gmail.com> wrote:
>>>
>>>> This change sets AS_NO_WRITEBACK_RECLAIM on the inode mapping so that
>>>> FUSE folios are not reclaimed and waited on while in writeback, and
>>>> removes the temporary folio + extra copying and the internal rb tree.
>>>
>>> What about sync(2)?   And page migration?
>>>
>>> Hopefully there are no other cases, but I think a careful review of
>>> places where generic code waits for writeback is needed before we can
>>> say for sure.
>>
>> Sounds good, that's a great point. I'll audit the other places in the
>> mm code where we might call folio_wait_writeback() and report back
>> what I find.
>>
>>
> 
> So, any operation that the fuse server can do which can cause wait on
> writeback on the folios backed by the fuse is problematic. We know about
> one scenario i.e. memory allocation causing reclaim which may do the
> wait on unrelated folios which may be backed by the fuse server.
> 
> Now there are ways fuse server can shoot itself on the foot. Like sync()
> syscall or accessing the folios backed by itself. The quesion is should
> we really need to protect fuse from such cases?

I think there are several scenarios shall be evaluated case by case:

1) a non-malicious fuse daemon wants to allocate some memory when
processing a fuse request, which in turn leads to memory reclaim and
thus waiting on the writeback of fuse dirty pages - a deadlock here.
This is reasonable and also the target scenario that this series wants
to fix.

2) a malicious fuse daemon refuses to process any request; or a buggy or
not well-written fuse daemon as you described, e.g. may call sync(2)
itself or access page cache backed by itself, then
  2.1) any unrelated user process attempting to initiate a sync(2)
itself, will hang there.  This scenario is also unexpected and shall be
fixed.
  2.2) any direct user of fuse filesystem (e.g. access files backed by
fuse filesystem) will hang in this case.  IMHO this is expected, and the
impact is affordable as it is controlled within certain range (the
direct user of the fs).

-- 
Thanks,
Jingbo

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree
  2024-10-15 19:17       ` Shakeel Butt
  2024-10-16  9:44         ` Jingbo Xu
@ 2024-10-16  9:51         ` Miklos Szeredi
  2024-10-16 17:52           ` Shakeel Butt
  1 sibling, 1 reply; 63+ messages in thread
From: Miklos Szeredi @ 2024-10-16  9:51 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Joanne Koong, linux-fsdevel, josef, bernd.schubert, jefflexu,
	hannes, linux-mm, kernel-team

On Tue, 15 Oct 2024 at 21:17, Shakeel Butt <shakeel.butt@linux.dev> wrote:

> So, any operation that the fuse server can do which can cause wait on
> writeback on the folios backed by the fuse is problematic. We know about
> one scenario i.e. memory allocation causing reclaim which may do the
> wait on unrelated folios which may be backed by the fuse server.
>
> Now there are ways fuse server can shoot itself on the foot. Like sync()
> syscall or accessing the folios backed by itself. The quesion is should
> we really need to protect fuse from such cases?

That's not the issue with sync(2).  The issue is that a fuse server
can deny service to an unrelated and possibly higher privilege task by
blocking writeback.  We really don't want that to happen.

Thanks,
Miklos


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree
  2024-10-15 10:01   ` Miklos Szeredi
  2024-10-15 17:06     ` Joanne Koong
@ 2024-10-16  9:56     ` Jingbo Xu
  2024-10-16 10:00       ` Miklos Szeredi
  2024-10-18  1:30     ` Joanne Koong
  2 siblings, 1 reply; 63+ messages in thread
From: Jingbo Xu @ 2024-10-16  9:56 UTC (permalink / raw)
  To: Miklos Szeredi, Joanne Koong
  Cc: linux-fsdevel, josef, bernd.schubert, hannes, shakeel.butt,
	linux-mm, kernel-team



On 10/15/24 6:01 PM, Miklos Szeredi wrote:
> On Mon, 14 Oct 2024 at 20:23, Joanne Koong <joannelkoong@gmail.com> wrote:
> 
>> This change sets AS_NO_WRITEBACK_RECLAIM on the inode mapping so that
>> FUSE folios are not reclaimed and waited on while in writeback, and
>> removes the temporary folio + extra copying and the internal rb tree.
> 
> What about sync(2)? 

FYI The posix sync(2) says, "The writing, although scheduled, is not
necessarily complete upon return from sync()." [1]

Thus hopefully it won't break the posix semantics of sync(2) down if we
skip the waiting on the writeback of fuse pages.

[1] https://pubs.opengroup.org/onlinepubs/9699919799/functions/sync.html

-- 
Thanks,
Jingbo


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree
  2024-10-16  9:44         ` Jingbo Xu
@ 2024-10-16  9:57           ` Miklos Szeredi
  0 siblings, 0 replies; 63+ messages in thread
From: Miklos Szeredi @ 2024-10-16  9:57 UTC (permalink / raw)
  To: Jingbo Xu
  Cc: Shakeel Butt, Joanne Koong, linux-fsdevel, josef, bernd.schubert,
	hannes, linux-mm, kernel-team

On Wed, 16 Oct 2024 at 11:44, Jingbo Xu <jefflexu@linux.alibaba.com> wrote:

> 1) a non-malicious fuse daemon wants to allocate some memory when
> processing a fuse request, which in turn leads to memory reclaim and
> thus waiting on the writeback of fuse dirty pages - a deadlock here.
> This is reasonable and also the target scenario that this series wants
> to fix.
>
> 2) a malicious fuse daemon refuses to process any request; or a buggy or
> not well-written fuse daemon as you described, e.g. may call sync(2)
> itself or access page cache backed by itself, then
>   2.1) any unrelated user process attempting to initiate a sync(2)
> itself, will hang there.  This scenario is also unexpected and shall be
> fixed.

Exactly.

We only care about

 - properly written server not deadlocking

 - buggy or malicious server not denying service to unrelated tasks,
where unrelated means it would not otherwise be able to access the
fuse mount. Typically this separation is done with a user namespace or
-oallow_other.

Thanks,
Miklos


>   2.2) any direct user of fuse filesystem (e.g. access files backed by
> fuse filesystem) will hang in this case.  IMHO this is expected, and the
> impact is affordable as it is controlled within certain range (the
> direct user of the fs).
>
> --
> Thanks,
> Jingbo


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree
  2024-10-16  9:56     ` Jingbo Xu
@ 2024-10-16 10:00       ` Miklos Szeredi
  0 siblings, 0 replies; 63+ messages in thread
From: Miklos Szeredi @ 2024-10-16 10:00 UTC (permalink / raw)
  To: Jingbo Xu
  Cc: Joanne Koong, linux-fsdevel, josef, bernd.schubert, hannes,
	shakeel.butt, linux-mm, kernel-team

On Wed, 16 Oct 2024 at 11:56, Jingbo Xu <jefflexu@linux.alibaba.com> wrote:

> FYI The posix sync(2) says, "The writing, although scheduled, is not
> necessarily complete upon return from sync()." [1]
>
> Thus hopefully it won't break the posix semantics of sync(2) down if we
> skip the waiting on the writeback of fuse pages.

Since this has been the case for fuse since day one and no one
complained, this is not a worry.

Thanks,
Miklos


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree
  2024-10-16  9:51         ` Miklos Szeredi
@ 2024-10-16 17:52           ` Shakeel Butt
  2024-10-16 18:37             ` Miklos Szeredi
  0 siblings, 1 reply; 63+ messages in thread
From: Shakeel Butt @ 2024-10-16 17:52 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Joanne Koong, linux-fsdevel, josef, bernd.schubert, jefflexu,
	hannes, linux-mm, kernel-team

On Wed, Oct 16, 2024 at 11:51:39AM GMT, Miklos Szeredi wrote:
> On Tue, 15 Oct 2024 at 21:17, Shakeel Butt <shakeel.butt@linux.dev> wrote:
> 
> > So, any operation that the fuse server can do which can cause wait on
> > writeback on the folios backed by the fuse is problematic. We know about
> > one scenario i.e. memory allocation causing reclaim which may do the
> > wait on unrelated folios which may be backed by the fuse server.
> >
> > Now there are ways fuse server can shoot itself on the foot. Like sync()
> > syscall or accessing the folios backed by itself. The quesion is should
> > we really need to protect fuse from such cases?
> 
> That's not the issue with sync(2).  The issue is that a fuse server
> can deny service to an unrelated and possibly higher privilege task by
> blocking writeback.  We really don't want that to happen.

If I understand you correctly, you are saying fuse server doing wrong
things like accessing the files it is serving is not something we need
to care about. More specifically all the operations which directly
manipulates the folios it is serving (like migration) should be ignored.
Is this correct?


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree
  2024-10-16 17:52           ` Shakeel Butt
@ 2024-10-16 18:37             ` Miklos Szeredi
  2024-10-16 21:27               ` Shakeel Butt
  0 siblings, 1 reply; 63+ messages in thread
From: Miklos Szeredi @ 2024-10-16 18:37 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Joanne Koong, linux-fsdevel, josef, bernd.schubert, jefflexu,
	hannes, linux-mm, kernel-team

On Wed, 16 Oct 2024 at 19:52, Shakeel Butt <shakeel.butt@linux.dev> wrote:

> If I understand you correctly, you are saying fuse server doing wrong
> things like accessing the files it is serving is not something we need
> to care about.

I don't think detecting such recursion is feasible or even generally possible.

> More specifically all the operations which directly
> manipulates the folios it is serving (like migration) should be ignored.
> Is this correct?

Um, not sure I understand.  If migration can be triggered on fuse
folios that results in the task that triggered the migration to wait
on the fuse folio, then that's bad.  Ignoring fuse folios is the only
solution that I can see, and that's basically what the current temp
page copy does.

Sprinkling mm code with fuse specific conditionals seems the only
solution if we want to get rid of the temp page thing.  Hopefully
there aren't too many of those.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree
  2024-10-16 18:37             ` Miklos Szeredi
@ 2024-10-16 21:27               ` Shakeel Butt
  2024-10-17 13:31                 ` Miklos Szeredi
  0 siblings, 1 reply; 63+ messages in thread
From: Shakeel Butt @ 2024-10-16 21:27 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Joanne Koong, linux-fsdevel, josef, bernd.schubert, jefflexu,
	hannes, linux-mm, kernel-team

On Wed, Oct 16, 2024 at 08:37:12PM GMT, Miklos Szeredi wrote:
> On Wed, 16 Oct 2024 at 19:52, Shakeel Butt <shakeel.butt@linux.dev> wrote:
> 
> > If I understand you correctly, you are saying fuse server doing wrong
> > things like accessing the files it is serving is not something we need
> > to care about.
> 
> I don't think detecting such recursion is feasible or even generally possible.
> 
> > More specifically all the operations which directly
> > manipulates the folios it is serving (like migration) should be ignored.
> > Is this correct?
> 
> Um, not sure I understand.  If migration can be triggered on fuse
> folios that results in the task that triggered the migration to wait
> on the fuse folio, then that's bad.

Why is it bad? I can understand fuse server getting blocked on fuse
folios is bad but why it is bad for other applications/tasks? I am
wondering network filesystems have to handle similar situation then why
is it bad just for fuse?

> Ignoring fuse folios is the only
> solution that I can see, and that's basically what the current temp
> page copy does.
> 
> Sprinkling mm code with fuse specific conditionals seems the only
> solution if we want to get rid of the temp page thing.  Hopefully
> there aren't too many of those.

It might be a bit more than sprinkling. The reclaim code has to activate
the folio to avoid reclaiming the folio in near future. I am not sure
what we will need to do for move_pages() syscall.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree
  2024-10-16 21:27               ` Shakeel Butt
@ 2024-10-17 13:31                 ` Miklos Szeredi
  2024-10-18  5:31                   ` Shakeel Butt
  2024-10-29 22:04                   ` Bernd Schubert
  0 siblings, 2 replies; 63+ messages in thread
From: Miklos Szeredi @ 2024-10-17 13:31 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Joanne Koong, linux-fsdevel, josef, bernd.schubert, jefflexu,
	hannes, linux-mm, kernel-team

On Wed, 16 Oct 2024 at 23:27, Shakeel Butt <shakeel.butt@linux.dev> wrote:

> Why is it bad? I can understand fuse server getting blocked on fuse
> folios is bad but why it is bad for other applications/tasks? I am
> wondering network filesystems have to handle similar situation then why
> is it bad just for fuse?

You need privileges (physical access) to unplug the network cable.
You don't need privileges (in most setups) to run a fuse server.

> It might be a bit more than sprinkling. The reclaim code has to activate
> the folio to avoid reclaiming the folio in near future. I am not sure
> what we will need to do for move_pages() syscall.

Maybe move_pages() is okay, because it is explicitly targeting a fuse
mmap.   Is this the only way to trigger MIGRATE_SYNC?

Thanks,
Miklos


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree
  2024-10-15 10:01   ` Miklos Szeredi
  2024-10-15 17:06     ` Joanne Koong
  2024-10-16  9:56     ` Jingbo Xu
@ 2024-10-18  1:30     ` Joanne Koong
  2024-10-18  5:57       ` Shakeel Butt
  2024-10-21  9:32       ` Miklos Szeredi
  2 siblings, 2 replies; 63+ messages in thread
From: Joanne Koong @ 2024-10-18  1:30 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-fsdevel, josef, bernd.schubert, jefflexu, hannes,
	shakeel.butt, linux-mm, kernel-team

On Tue, Oct 15, 2024 at 3:01 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Mon, 14 Oct 2024 at 20:23, Joanne Koong <joannelkoong@gmail.com> wrote:
>
> > This change sets AS_NO_WRITEBACK_RECLAIM on the inode mapping so that
> > FUSE folios are not reclaimed and waited on while in writeback, and
> > removes the temporary folio + extra copying and the internal rb tree.
>
> What about sync(2)?   And page migration?
>
> Hopefully there are no other cases, but I think a careful review of
> places where generic code waits for writeback is needed before we can
> say for sure.

The places where I see this potential deadlock still being possible are:
* page migration when handling a page fault:
     In particular, this path: handle_mm_fault() ->
__handle_mm_fault() -> handle_pte_fault() -> do_numa_page() ->
migrate_misplaced_folio() -> migrate_pages() -> migrate_pages_sync()
-> migrate_pages_batch() -> migrate_folio_unmap() ->
folio_wait_writeback()
* syscalls that trigger waits on writeback, which will lead to
deadlock if a single-threaded fuse server calls this when servicing
requests:
    - sync(), sync_file_range(), fsync(), fdatasync()
    - swapoff()
    - move_pages()

I need to analyze the page fault path more to get a clearer picture of
what is happening, but so far this looks like a valid case for a
correctly written fuse server to run into.
For the syscalls however, is it valid/safe in general (disregarding
the writeback deadlock scenario for a minute) for fuse servers to be
invoking these syscalls in their handlers anyways?

The other places where I see a generic wait on writeback seem safe:
* splice, page_cache_pipe_buf_try_steal() (fs/splice.c):
   We hit this in fuse when we try to move a page from the pipe buffer
into the page cache (fuse_try_move_page()) for the SPLICE_F_MOVE case.
This wait seems fine, since the folio that's being waited on is the
folio in the pipe buffer which is not a fuse folio.
* memory failure (mm/memory_failure.c):
   Soft offlining a page and handling page memory failure - these can
be triggered asynchronously (memory_failure_work_func()), but this
should be fine for the fuse use case since the server isn't blocked on
servicing any writeback requests while memory failure handling is
waiting on writeback
* page truncation (mm/truncate.c):
   Same here. These cases seem fine since the server isn't blocked on
servicing writeback requests while truncation waits on writeback

Thanks,
Joanne

>
> Thanks,
> Miklos

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree
  2024-10-17 13:31                 ` Miklos Szeredi
@ 2024-10-18  5:31                   ` Shakeel Butt
  2024-10-21 10:15                     ` Miklos Szeredi
  2024-10-29 22:04                   ` Bernd Schubert
  1 sibling, 1 reply; 63+ messages in thread
From: Shakeel Butt @ 2024-10-18  5:31 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Joanne Koong, linux-fsdevel, josef, bernd.schubert, jefflexu,
	hannes, linux-mm, kernel-team

On Thu, Oct 17, 2024 at 03:31:48PM GMT, Miklos Szeredi wrote:
> On Wed, 16 Oct 2024 at 23:27, Shakeel Butt <shakeel.butt@linux.dev> wrote:
> 
> > Why is it bad? I can understand fuse server getting blocked on fuse
> > folios is bad but why it is bad for other applications/tasks? I am
> > wondering network filesystems have to handle similar situation then why
> > is it bad just for fuse?
> 
> You need privileges (physical access) to unplug the network cable.
> You don't need privileges (in most setups) to run a fuse server.

I feel like this is too much restrictive and I am still not sure why
blocking on fuse folios served by non-privileges fuse server is worse
than blocking on folios served from the network.

> 
> > It might be a bit more than sprinkling. The reclaim code has to activate
> > the folio to avoid reclaiming the folio in near future. I am not sure
> > what we will need to do for move_pages() syscall.
> 
> Maybe move_pages() is okay, because it is explicitly targeting a fuse
> mmap.   Is this the only way to trigger MIGRATE_SYNC?

It can happen through migrate_pages(), mbind(), compaction which can
caused by hugepage allcations.



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree
  2024-10-18  1:30     ` Joanne Koong
@ 2024-10-18  5:57       ` Shakeel Butt
  2024-10-18 19:57         ` Joanne Koong
  2024-10-21  9:32       ` Miklos Szeredi
  1 sibling, 1 reply; 63+ messages in thread
From: Shakeel Butt @ 2024-10-18  5:57 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Miklos Szeredi, linux-fsdevel, josef, bernd.schubert, jefflexu,
	hannes, linux-mm, kernel-team

On Thu, Oct 17, 2024 at 06:30:08PM GMT, Joanne Koong wrote:
> On Tue, Oct 15, 2024 at 3:01 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> > On Mon, 14 Oct 2024 at 20:23, Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > > This change sets AS_NO_WRITEBACK_RECLAIM on the inode mapping so that
> > > FUSE folios are not reclaimed and waited on while in writeback, and
> > > removes the temporary folio + extra copying and the internal rb tree.
> >
> > What about sync(2)?   And page migration?
> >
> > Hopefully there are no other cases, but I think a careful review of
> > places where generic code waits for writeback is needed before we can
> > say for sure.
> 
> The places where I see this potential deadlock still being possible are:
> * page migration when handling a page fault:
>      In particular, this path: handle_mm_fault() ->
> __handle_mm_fault() -> handle_pte_fault() -> do_numa_page() ->
> migrate_misplaced_folio() -> migrate_pages() -> migrate_pages_sync()
> -> migrate_pages_batch() -> migrate_folio_unmap() ->
> folio_wait_writeback()

So, this is numa fault and if fuse server is not mapping the fuse folios
which it is serving, in its address space then this is not an issue.
However hugepage allocation on page fault can cause compaction which
might migrate unrelated fuse folios. So, fuse server doing compaction
is an issue and we need to resolve similar to reclaim codepath. (Though
I think for THP it is not doing MIGRATE_SYNC but doing for gigantic
hugetlb pages).

> * syscalls that trigger waits on writeback, which will lead to
> deadlock if a single-threaded fuse server calls this when servicing
> requests:
>     - sync(), sync_file_range(), fsync(), fdatasync()
>     - swapoff()
>     - move_pages()
> 
> I need to analyze the page fault path more to get a clearer picture of
> what is happening, but so far this looks like a valid case for a
> correctly written fuse server to run into.
> For the syscalls however, is it valid/safe in general (disregarding
> the writeback deadlock scenario for a minute) for fuse servers to be
> invoking these syscalls in their handlers anyways?
> 
> The other places where I see a generic wait on writeback seem safe:
> * splice, page_cache_pipe_buf_try_steal() (fs/splice.c):
>    We hit this in fuse when we try to move a page from the pipe buffer
> into the page cache (fuse_try_move_page()) for the SPLICE_F_MOVE case.
> This wait seems fine, since the folio that's being waited on is the
> folio in the pipe buffer which is not a fuse folio.
> * memory failure (mm/memory_failure.c):
>    Soft offlining a page and handling page memory failure - these can
> be triggered asynchronously (memory_failure_work_func()), but this
> should be fine for the fuse use case since the server isn't blocked on
> servicing any writeback requests while memory failure handling is
> waiting on writeback
> * page truncation (mm/truncate.c):
>    Same here. These cases seem fine since the server isn't blocked on
> servicing writeback requests while truncation waits on writeback
> 
> 
> Thanks,
> Joanne
> 
> >
> > Thanks,
> > Miklos


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree
  2024-10-14 18:22 ` [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree Joanne Koong
  2024-10-15 10:01   ` Miklos Szeredi
@ 2024-10-18  9:24   ` Jingbo Xu
  2024-10-18 20:29     ` Joanne Koong
  1 sibling, 1 reply; 63+ messages in thread
From: Jingbo Xu @ 2024-10-18  9:24 UTC (permalink / raw)
  To: Joanne Koong, miklos, linux-fsdevel
  Cc: josef, bernd.schubert, hannes, shakeel.butt, linux-mm,
	kernel-team



On 10/15/24 2:22 AM, Joanne Koong wrote:
>  static void fuse_writepage_args_page_fill(struct fuse_writepage_args *wpa, struct folio *folio,
> -					  struct folio *tmp_folio, uint32_t page_index)
> +					  uint32_t page_index)
>  {
>  	struct inode *inode = folio->mapping->host;
>  	struct fuse_args_pages *ap = &wpa->ia.ap;
>  
> -	folio_copy(tmp_folio, folio);
> -
> -	ap->pages[page_index] = &tmp_folio->page;
> +	folio_get(folio);

Why folio_get() is needed?


>  
> @@ -2203,68 +2047,11 @@ static void fuse_writepages_send(struct fuse_fill_wb_data *data)
>  	struct fuse_writepage_args *wpa = data->wpa;
>  	struct inode *inode = data->inode;
>  	struct fuse_inode *fi = get_fuse_inode(inode);
> -	int num_pages = wpa->ia.ap.num_pages;
> -	int i;
>  
>  	spin_lock(&fi->lock);
>  	list_add_tail(&wpa->queue_entry, &fi->queued_writes);

Could we also drop fi->queued_writes list and writectr counter after we
dropping the temp folio?  I'm not sure.  When I grep all callsites of
fuse_set_nowrite(), it seems that apart from the direct IO
path(fuse_direct_io -> fuse_sync_writes), the truncation related code
also relies on this writectr mechanism.



-- 
Thanks,
Jingbo


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree
  2024-10-18  5:57       ` Shakeel Butt
@ 2024-10-18 19:57         ` Joanne Koong
  2024-10-18 20:46           ` Shakeel Butt
  0 siblings, 1 reply; 63+ messages in thread
From: Joanne Koong @ 2024-10-18 19:57 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Miklos Szeredi, linux-fsdevel, josef, bernd.schubert, jefflexu,
	hannes, linux-mm, kernel-team

On Thu, Oct 17, 2024 at 10:57 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Thu, Oct 17, 2024 at 06:30:08PM GMT, Joanne Koong wrote:
> > On Tue, Oct 15, 2024 at 3:01 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> > >
> > > On Mon, 14 Oct 2024 at 20:23, Joanne Koong <joannelkoong@gmail.com> wrote:
> > >
> > > > This change sets AS_NO_WRITEBACK_RECLAIM on the inode mapping so that
> > > > FUSE folios are not reclaimed and waited on while in writeback, and
> > > > removes the temporary folio + extra copying and the internal rb tree.
> > >
> > > What about sync(2)?   And page migration?
> > >
> > > Hopefully there are no other cases, but I think a careful review of
> > > places where generic code waits for writeback is needed before we can
> > > say for sure.
> >
> > The places where I see this potential deadlock still being possible are:
> > * page migration when handling a page fault:
> >      In particular, this path: handle_mm_fault() ->
> > __handle_mm_fault() -> handle_pte_fault() -> do_numa_page() ->
> > migrate_misplaced_folio() -> migrate_pages() -> migrate_pages_sync()
> > -> migrate_pages_batch() -> migrate_folio_unmap() ->
> > folio_wait_writeback()
>
> So, this is numa fault and if fuse server is not mapping the fuse folios
> which it is serving, in its address space then this is not an issue.
> However hugepage allocation on page fault can cause compaction which
> might migrate unrelated fuse folios. So, fuse server doing compaction
> is an issue and we need to resolve similar to reclaim codepath. (Though
> I think for THP it is not doing MIGRATE_SYNC but doing for gigantic
> hugetlb pages).

Thanks for the explanation. Would you mind pointing me to the
compaction function where this triggers the migrate? Is this in
compact_zone() where it calls migrate_pages() on the cc->migratepages
list?

>
> > * syscalls that trigger waits on writeback, which will lead to
> > deadlock if a single-threaded fuse server calls this when servicing
> > requests:
> >     - sync(), sync_file_range(), fsync(), fdatasync()
> >     - swapoff()
> >     - move_pages()
> >
> > I need to analyze the page fault path more to get a clearer picture of
> > what is happening, but so far this looks like a valid case for a
> > correctly written fuse server to run into.
> > For the syscalls however, is it valid/safe in general (disregarding
> > the writeback deadlock scenario for a minute) for fuse servers to be
> > invoking these syscalls in their handlers anyways?
> >
> > The other places where I see a generic wait on writeback seem safe:
> > * splice, page_cache_pipe_buf_try_steal() (fs/splice.c):
> >    We hit this in fuse when we try to move a page from the pipe buffer
> > into the page cache (fuse_try_move_page()) for the SPLICE_F_MOVE case.
> > This wait seems fine, since the folio that's being waited on is the
> > folio in the pipe buffer which is not a fuse folio.
> > * memory failure (mm/memory_failure.c):
> >    Soft offlining a page and handling page memory failure - these can
> > be triggered asynchronously (memory_failure_work_func()), but this
> > should be fine for the fuse use case since the server isn't blocked on
> > servicing any writeback requests while memory failure handling is
> > waiting on writeback
> > * page truncation (mm/truncate.c):
> >    Same here. These cases seem fine since the server isn't blocked on
> > servicing writeback requests while truncation waits on writeback
> >
> >
> > Thanks,
> > Joanne
> >
> > >
> > > Thanks,
> > > Miklos


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree
  2024-10-18  9:24   ` Jingbo Xu
@ 2024-10-18 20:29     ` Joanne Koong
  0 siblings, 0 replies; 63+ messages in thread
From: Joanne Koong @ 2024-10-18 20:29 UTC (permalink / raw)
  To: Jingbo Xu
  Cc: miklos, linux-fsdevel, josef, bernd.schubert, hannes,
	shakeel.butt, linux-mm, kernel-team

On Fri, Oct 18, 2024 at 2:24 AM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>
>
> On 10/15/24 2:22 AM, Joanne Koong wrote:
> >  static void fuse_writepage_args_page_fill(struct fuse_writepage_args *wpa, struct folio *folio,
> > -                                       struct folio *tmp_folio, uint32_t page_index)
> > +                                       uint32_t page_index)
> >  {
> >       struct inode *inode = folio->mapping->host;
> >       struct fuse_args_pages *ap = &wpa->ia.ap;
> >
> > -     folio_copy(tmp_folio, folio);
> > -
> > -     ap->pages[page_index] = &tmp_folio->page;
> > +     folio_get(folio);
>
> Why folio_get() is needed?
>

It's not strictly needed but I added this to ensure that the folio
will always be valid when we later on access it. My understanding is
that it's an implementation detail rather than a guarantee that the
folio can't be removed from the mapping until writeback finishes.

>
> >
> > @@ -2203,68 +2047,11 @@ static void fuse_writepages_send(struct fuse_fill_wb_data *data)
> >       struct fuse_writepage_args *wpa = data->wpa;
> >       struct inode *inode = data->inode;
> >       struct fuse_inode *fi = get_fuse_inode(inode);
> > -     int num_pages = wpa->ia.ap.num_pages;
> > -     int i;
> >
> >       spin_lock(&fi->lock);
> >       list_add_tail(&wpa->queue_entry, &fi->queued_writes);
>
> Could we also drop fi->queued_writes list and writectr counter after we
> dropping the temp folio?  I'm not sure.  When I grep all callsites of
> fuse_set_nowrite(), it seems that apart from the direct IO
> path(fuse_direct_io -> fuse_sync_writes), the truncation related code
> also relies on this writectr mechanism.

I'm not sure either, but I think we still need it. My understanding is
that the main purpose of this writectr is to detect when the inode has
pending writebacks. I think even without the temporary pages, we still
would need a way overall to track if the inode is writing back any
folio in its mapping.


Thanks,
Joanne

>
>
>
> --
> Thanks,
> Jingbo


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree
  2024-10-18 19:57         ` Joanne Koong
@ 2024-10-18 20:46           ` Shakeel Butt
  0 siblings, 0 replies; 63+ messages in thread
From: Shakeel Butt @ 2024-10-18 20:46 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Miklos Szeredi, linux-fsdevel, josef, bernd.schubert, jefflexu,
	hannes, linux-mm, kernel-team

On Fri, Oct 18, 2024 at 12:57:08PM GMT, Joanne Koong wrote:
> On Thu, Oct 17, 2024 at 10:57 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > On Thu, Oct 17, 2024 at 06:30:08PM GMT, Joanne Koong wrote:
> > > On Tue, Oct 15, 2024 at 3:01 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> > > >
> > > > On Mon, 14 Oct 2024 at 20:23, Joanne Koong <joannelkoong@gmail.com> wrote:
> > > >
> > > > > This change sets AS_NO_WRITEBACK_RECLAIM on the inode mapping so that
> > > > > FUSE folios are not reclaimed and waited on while in writeback, and
> > > > > removes the temporary folio + extra copying and the internal rb tree.
> > > >
> > > > What about sync(2)?   And page migration?
> > > >
> > > > Hopefully there are no other cases, but I think a careful review of
> > > > places where generic code waits for writeback is needed before we can
> > > > say for sure.
> > >
> > > The places where I see this potential deadlock still being possible are:
> > > * page migration when handling a page fault:
> > >      In particular, this path: handle_mm_fault() ->
> > > __handle_mm_fault() -> handle_pte_fault() -> do_numa_page() ->
> > > migrate_misplaced_folio() -> migrate_pages() -> migrate_pages_sync()
> > > -> migrate_pages_batch() -> migrate_folio_unmap() ->
> > > folio_wait_writeback()
> >
> > So, this is numa fault and if fuse server is not mapping the fuse folios
> > which it is serving, in its address space then this is not an issue.
> > However hugepage allocation on page fault can cause compaction which
> > might migrate unrelated fuse folios. So, fuse server doing compaction
> > is an issue and we need to resolve similar to reclaim codepath. (Though
> > I think for THP it is not doing MIGRATE_SYNC but doing for gigantic
> > hugetlb pages).
> 
> Thanks for the explanation. Would you mind pointing me to the
> compaction function where this triggers the migrate? Is this in
> compact_zone() where it calls migrate_pages() on the cc->migratepages
> list?
> 

Something like the following:

__alloc_pages_direct_compact() ->
try_to_compact_pages() ->
compact_zone_order() -> /* MIGRATE_ASYNC or MIGRATE_SYNC_LIGHT */
compact_zone() ->
migrate_pages() ->
migrate_pages_sync() ->
migrate_pages_batch() ->
migrate_folio_unmap() ->
folio_wait_writeback()


The following is one code path from hugetlb:

alloc_contig_range_noprof() -> /* MIGRATE_SYNC */
__alloc_contig_migrate_range() ->
migrate_pages() ->
migrate_pages_sync() ->
migrate_pages_batch() ->
migrate_folio_unmap() ->
folio_wait_writeback()



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree
  2024-10-18  1:30     ` Joanne Koong
  2024-10-18  5:57       ` Shakeel Butt
@ 2024-10-21  9:32       ` Miklos Szeredi
  1 sibling, 0 replies; 63+ messages in thread
From: Miklos Szeredi @ 2024-10-21  9:32 UTC (permalink / raw)
  To: Joanne Koong
  Cc: linux-fsdevel, josef, bernd.schubert, jefflexu, hannes,
	shakeel.butt, linux-mm, kernel-team

On Fri, 18 Oct 2024 at 03:30, Joanne Koong <joannelkoong@gmail.com> wrote:

> I need to analyze the page fault path more to get a clearer picture of
> what is happening, but so far this looks like a valid case for a
> correctly written fuse server to run into.

Yes.

> For the syscalls however, is it valid/safe in general (disregarding
> the writeback deadlock scenario for a minute) for fuse servers to be
> invoking these syscalls in their handlers anyways?

Generally no.  Any kind of recursion in fuse is a landmine.

E.g. CVE-2019-20794 was created to track an issue with a fuse server
going unkillable on namespace shutdown.  It didn't occur to the
reporter that this is just a special case of a plain old recursive
deadlock, because it happens to be triggered by kill.  But recursion
is clearly there: there's a file descriptor referring to the same fuse
mount that is being served.  When this fd is closed at process exit
the recursion is triggered and the thing deadlocks.  The fix: move the
recursive part of the code to a different process.  But people seem to
believe that recursion is okay and the kernel should deal with that
:-/

> The other places where I see a generic wait on writeback seem safe:
> * splice, page_cache_pipe_buf_try_steal() (fs/splice.c):
>    We hit this in fuse when we try to move a page from the pipe buffer
> into the page cache (fuse_try_move_page()) for the SPLICE_F_MOVE case.
> This wait seems fine, since the folio that's being waited on is the
> folio in the pipe buffer which is not a fuse folio.
> * memory failure (mm/memory_failure.c):
>    Soft offlining a page and handling page memory failure - these can
> be triggered asynchronously (memory_failure_work_func()), but this
> should be fine for the fuse use case since the server isn't blocked on
> servicing any writeback requests while memory failure handling is
> waiting on writeback
> * page truncation (mm/truncate.c):
>    Same here. These cases seem fine since the server isn't blocked on
> servicing writeback requests while truncation waits on writeback

Right.

Thanks,
Miklos


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree
  2024-10-18  5:31                   ` Shakeel Butt
@ 2024-10-21 10:15                     ` Miklos Szeredi
  2024-10-21 17:01                       ` Shakeel Butt
  2024-10-21 21:05                       ` Joanne Koong
  0 siblings, 2 replies; 63+ messages in thread
From: Miklos Szeredi @ 2024-10-21 10:15 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Joanne Koong, linux-fsdevel, josef, bernd.schubert, jefflexu,
	hannes, linux-mm, kernel-team

On Fri, 18 Oct 2024 at 07:31, Shakeel Butt <shakeel.butt@linux.dev> wrote:

> I feel like this is too much restrictive and I am still not sure why
> blocking on fuse folios served by non-privileges fuse server is worse
> than blocking on folios served from the network.

Might be.  But historically fuse had this behavior and I'd be very
reluctant to change that unconditionally.

With a systemwide maximal timeout for fuse requests it might make
sense to allow sync(2), etc. to wait for fuse writeback.

Without a timeout allowing fuse servers to block sync(2) indefinitely
seems rather risky.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree
  2024-10-21 10:15                     ` Miklos Szeredi
@ 2024-10-21 17:01                       ` Shakeel Butt
  2024-10-22 15:03                         ` Miklos Szeredi
  2024-10-21 21:05                       ` Joanne Koong
  1 sibling, 1 reply; 63+ messages in thread
From: Shakeel Butt @ 2024-10-21 17:01 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Joanne Koong, linux-fsdevel, josef, bernd.schubert, jefflexu,
	hannes, linux-mm, kernel-team

On Mon, Oct 21, 2024 at 12:15:36PM GMT, Miklos Szeredi wrote:
> On Fri, 18 Oct 2024 at 07:31, Shakeel Butt <shakeel.butt@linux.dev> wrote:
> 
> > I feel like this is too much restrictive and I am still not sure why
> > blocking on fuse folios served by non-privileges fuse server is worse
> > than blocking on folios served from the network.
> 
> Might be.  But historically fuse had this behavior and I'd be very
> reluctant to change that unconditionally.
> 
> With a systemwide maximal timeout for fuse requests it might make
> sense to allow sync(2), etc. to wait for fuse writeback.
> 
> Without a timeout allowing fuse servers to block sync(2) indefinitely
> seems rather risky.
> 

Thanks Miklos for the response. Just to be clear on where we disagree, let
me point out what I think is right and please tell me where you
disagree:

1. Fuse server should never access fuse folios (and files, directories,
   mounts, etc) directly it is providing.

2. Fuse server should not get blocked indirectly on the fuse folios (and
   related objects). This series is removing one such scenario caused
   due to reclaim.

3. Non fuse server processes can be blocked on fuse folios (and related
   objects) directly and indirectly.

Am I understanding correctly that we disagree on (3)?

thanks,
Shakeel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree
  2024-10-21 10:15                     ` Miklos Szeredi
  2024-10-21 17:01                       ` Shakeel Butt
@ 2024-10-21 21:05                       ` Joanne Koong
  2024-10-24 16:54                         ` Joanne Koong
  1 sibling, 1 reply; 63+ messages in thread
From: Joanne Koong @ 2024-10-21 21:05 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Shakeel Butt, linux-fsdevel, josef, bernd.schubert, jefflexu,
	hannes, linux-mm, kernel-team

On Mon, Oct 21, 2024 at 3:15 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Fri, 18 Oct 2024 at 07:31, Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> > I feel like this is too much restrictive and I am still not sure why
> > blocking on fuse folios served by non-privileges fuse server is worse
> > than blocking on folios served from the network.
>
> Might be.  But historically fuse had this behavior and I'd be very
> reluctant to change that unconditionally.
>
> With a systemwide maximal timeout for fuse requests it might make
> sense to allow sync(2), etc. to wait for fuse writeback.
>
> Without a timeout allowing fuse servers to block sync(2) indefinitely
> seems rather risky.

Could we skip waiting on writeback in sync(2) if it's a fuse folio?
That seems in line with the sync(2) documentation Jingbo referenced
earlier where it states "The writing, although scheduled, is not
necessarily complete upon return from sync()."
https://pubs.opengroup.org/onlinepubs/9699919799/functions/sync.html


Thanks,
Joanne
>
> Thanks,
> Miklos


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree
  2024-10-21 17:01                       ` Shakeel Butt
@ 2024-10-22 15:03                         ` Miklos Szeredi
  0 siblings, 0 replies; 63+ messages in thread
From: Miklos Szeredi @ 2024-10-22 15:03 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Joanne Koong, linux-fsdevel, josef, bernd.schubert, jefflexu,
	hannes, linux-mm, kernel-team

On Mon, 21 Oct 2024 at 19:01, Shakeel Butt <shakeel.butt@linux.dev> wrote:

> Thanks Miklos for the response. Just to be clear on where we disagree, let
> me point out what I think is right and please tell me where you
> disagree:
>
> 1. Fuse server should never access fuse folios (and files, directories,
>    mounts, etc) directly it is providing.

Correct.

> 2. Fuse server should not get blocked indirectly on the fuse folios (and
>    related objects). This series is removing one such scenario caused
>    due to reclaim.

Correct.

>
> 3. Non fuse server processes can be blocked on fuse folios (and related
>    objects) directly and indirectly.

Agree on the direct access part, but disagree about allowing to block
unrelated tasks.

Accessing a fuse backed object is basically agreeing to give the fuse
server extra privileges.  Documentation/filesystems/fuse.rst explains
this below "How do non-privileged mounts work?".

Thanks,
Miklos


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree
  2024-10-21 21:05                       ` Joanne Koong
@ 2024-10-24 16:54                         ` Joanne Koong
  2024-10-25  1:38                           ` Jingbo Xu
  0 siblings, 1 reply; 63+ messages in thread
From: Joanne Koong @ 2024-10-24 16:54 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Shakeel Butt, linux-fsdevel, josef, bernd.schubert, jefflexu,
	hannes, linux-mm, kernel-team

On Mon, Oct 21, 2024 at 2:05 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Mon, Oct 21, 2024 at 3:15 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> > On Fri, 18 Oct 2024 at 07:31, Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > > I feel like this is too much restrictive and I am still not sure why
> > > blocking on fuse folios served by non-privileges fuse server is worse
> > > than blocking on folios served from the network.
> >
> > Might be.  But historically fuse had this behavior and I'd be very
> > reluctant to change that unconditionally.
> >
> > With a systemwide maximal timeout for fuse requests it might make
> > sense to allow sync(2), etc. to wait for fuse writeback.
> >
> > Without a timeout allowing fuse servers to block sync(2) indefinitely
> > seems rather risky.
>
> Could we skip waiting on writeback in sync(2) if it's a fuse folio?
> That seems in line with the sync(2) documentation Jingbo referenced
> earlier where it states "The writing, although scheduled, is not
> necessarily complete upon return from sync()."
> https://pubs.opengroup.org/onlinepubs/9699919799/functions/sync.html
>

So I think the answer to this is "no" for Linux. What the Linux man
page for sync(2) says:

"According to the standard specification (e.g., POSIX.1-2001), sync()
schedules the writes, but may return before the actual writing is
done.  However Linux waits for I/O completions, and thus sync() or
syncfs() provide the same guarantees as fsync() called on every file
in the system or filesystem respectively." [1]

Regardless of the compaction / page migration issue then, this
blocking sync(2) is a dealbreaker.

I see two workarounds around this:

1) Optimistically skip the tmp page but add a timeout where if the
server doesn't reply to the writeback in X seconds, then allocate the
tmp folio and clear writeback immediately on the original folio).
This would address any page migration deadlocks as well. And probably
we don't need the reclaim patch either then, since that could also be
handled by the timeout.
This would make 99% of writebacks fast but in the case of a malicious
fuse server, could make sync() and page migration wait an extra X
seconds.

2) Only skip the tmp folio for privileged fuse servers (we'd still
need to address the page migration path)

Would love to hear thoughts on this. Should we go ahead with option 1?

[1] https://man7.org/linux/man-pages/man2/sync.2.html

>
> Thanks,
> Joanne
> >
> > Thanks,
> > Miklos

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree
  2024-10-24 16:54                         ` Joanne Koong
@ 2024-10-25  1:38                           ` Jingbo Xu
  2024-10-25 15:32                             ` Miklos Szeredi
  2024-10-25 17:36                             ` Joanne Koong
  0 siblings, 2 replies; 63+ messages in thread
From: Jingbo Xu @ 2024-10-25  1:38 UTC (permalink / raw)
  To: Joanne Koong, Miklos Szeredi
  Cc: Shakeel Butt, linux-fsdevel, josef, bernd.schubert, hannes,
	linux-mm, kernel-team



On 10/25/24 12:54 AM, Joanne Koong wrote:
> On Mon, Oct 21, 2024 at 2:05 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>>
>> On Mon, Oct 21, 2024 at 3:15 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>>>
>>> On Fri, 18 Oct 2024 at 07:31, Shakeel Butt <shakeel.butt@linux.dev> wrote:
>>>
>>>> I feel like this is too much restrictive and I am still not sure why
>>>> blocking on fuse folios served by non-privileges fuse server is worse
>>>> than blocking on folios served from the network.
>>>
>>> Might be.  But historically fuse had this behavior and I'd be very
>>> reluctant to change that unconditionally.
>>>
>>> With a systemwide maximal timeout for fuse requests it might make
>>> sense to allow sync(2), etc. to wait for fuse writeback.
>>>
>>> Without a timeout allowing fuse servers to block sync(2) indefinitely
>>> seems rather risky.
>>
>> Could we skip waiting on writeback in sync(2) if it's a fuse folio?
>> That seems in line with the sync(2) documentation Jingbo referenced
>> earlier where it states "The writing, although scheduled, is not
>> necessarily complete upon return from sync()."
>> https://pubs.opengroup.org/onlinepubs/9699919799/functions/sync.html
>>
> 
> So I think the answer to this is "no" for Linux. What the Linux man
> page for sync(2) says:
> 
> "According to the standard specification (e.g., POSIX.1-2001), sync()
> schedules the writes, but may return before the actual writing is
> done.  However Linux waits for I/O completions, and thus sync() or
> syncfs() provide the same guarantees as fsync() called on every file
> in the system or filesystem respectively." [1]

Actually as for FUSE, IIUC the writeback is not guaranteed to be
completed when sync(2) returns since the temp page mechanism.  When
sync(2) returns, PG_writeback is indeed cleared for all original pages
(in the address_space), while the real writeback work (initiated from
temp page) may be still in progress.

I think this is also what Miklos means in:
https://lore.kernel.org/all/CAJfpegsJKD4YT5R5qfXXE=hyqKvhpTRbD4m1wsYNbGB6k4rC2A@mail.gmail.com/

Though we need special handling for AS_NO_WRITEBACK_RECLAIM marked pages
in sync(2) codepath similar to what we have done for the direct reclaim
in patch 1.


> 
> Regardless of the compaction / page migration issue then, this
> blocking sync(2) is a dealbreaker.

I really should have figureg out the compaction / page migration
mechanism and the potential impact to FUSE when we dropping the temp
page.  Just too busy to take some time on this though.....


-- 
Thanks,
Jingbo


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree
  2024-10-25  1:38                           ` Jingbo Xu
@ 2024-10-25 15:32                             ` Miklos Szeredi
  2024-10-25 17:36                             ` Joanne Koong
  1 sibling, 0 replies; 63+ messages in thread
From: Miklos Szeredi @ 2024-10-25 15:32 UTC (permalink / raw)
  To: Jingbo Xu
  Cc: Joanne Koong, Shakeel Butt, linux-fsdevel, josef, bernd.schubert,
	hannes, linux-mm, kernel-team

On Fri, 25 Oct 2024 at 03:38, Jingbo Xu <jefflexu@linux.alibaba.com> wrote:

> Actually as for FUSE, IIUC the writeback is not guaranteed to be
> completed when sync(2) returns since the temp page mechanism.  When
> sync(2) returns, PG_writeback is indeed cleared for all original pages
> (in the address_space), while the real writeback work (initiated from
> temp page) may be still in progress.

Correct, this is the current behavior of fuse.

I'm not against changing this for the privileged server case.  I.e. a
server running with certain privileges (e.g. CAP_SYS_ADMIN in the
global namespace) then it should be able to opt in to waiting sync(2)
behavior.

> I think this is also what Miklos means in:
> https://lore.kernel.org/all/CAJfpegsJKD4YT5R5qfXXE=hyqKvhpTRbD4m1wsYNbGB6k4rC2A@mail.gmail.com/
>
> Though we need special handling for AS_NO_WRITEBACK_RECLAIM marked pages
> in sync(2) codepath similar to what we have done for the direct reclaim
> in patch 1.

I'd love to get rid of the tmp page thing completely if the same
guarantees can be implemented in the mm.

Thanks,
Miklos


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree
  2024-10-25  1:38                           ` Jingbo Xu
  2024-10-25 15:32                             ` Miklos Szeredi
@ 2024-10-25 17:36                             ` Joanne Koong
  2024-10-25 18:02                               ` Miklos Szeredi
                                                 ` (2 more replies)
  1 sibling, 3 replies; 63+ messages in thread
From: Joanne Koong @ 2024-10-25 17:36 UTC (permalink / raw)
  To: Jingbo Xu
  Cc: Miklos Szeredi, Shakeel Butt, linux-fsdevel, josef,
	bernd.schubert, hannes, linux-mm, kernel-team

On Thu, Oct 24, 2024 at 6:38 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>
>
>
> On 10/25/24 12:54 AM, Joanne Koong wrote:
> > On Mon, Oct 21, 2024 at 2:05 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> >>
> >> On Mon, Oct 21, 2024 at 3:15 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> >>>
> >>> On Fri, 18 Oct 2024 at 07:31, Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >>>
> >>>> I feel like this is too much restrictive and I am still not sure why
> >>>> blocking on fuse folios served by non-privileges fuse server is worse
> >>>> than blocking on folios served from the network.
> >>>
> >>> Might be.  But historically fuse had this behavior and I'd be very
> >>> reluctant to change that unconditionally.
> >>>
> >>> With a systemwide maximal timeout for fuse requests it might make
> >>> sense to allow sync(2), etc. to wait for fuse writeback.
> >>>
> >>> Without a timeout allowing fuse servers to block sync(2) indefinitely
> >>> seems rather risky.
> >>
> >> Could we skip waiting on writeback in sync(2) if it's a fuse folio?
> >> That seems in line with the sync(2) documentation Jingbo referenced
> >> earlier where it states "The writing, although scheduled, is not
> >> necessarily complete upon return from sync()."
> >> https://pubs.opengroup.org/onlinepubs/9699919799/functions/sync.html
> >>
> >
> > So I think the answer to this is "no" for Linux. What the Linux man
> > page for sync(2) says:
> >
> > "According to the standard specification (e.g., POSIX.1-2001), sync()
> > schedules the writes, but may return before the actual writing is
> > done.  However Linux waits for I/O completions, and thus sync() or
> > syncfs() provide the same guarantees as fsync() called on every file
> > in the system or filesystem respectively." [1]
>
> Actually as for FUSE, IIUC the writeback is not guaranteed to be
> completed when sync(2) returns since the temp page mechanism.  When
> sync(2) returns, PG_writeback is indeed cleared for all original pages
> (in the address_space), while the real writeback work (initiated from
> temp page) may be still in progress.
>

That's a great point. It seems like we can just skip waiting on
writeback to finish for fuse folios in sync(2) altogether then. I'll
look into what's the best way to do this.

> I think this is also what Miklos means in:
> https://lore.kernel.org/all/CAJfpegsJKD4YT5R5qfXXE=hyqKvhpTRbD4m1wsYNbGB6k4rC2A@mail.gmail.com/
>
> Though we need special handling for AS_NO_WRITEBACK_RECLAIM marked pages
> in sync(2) codepath similar to what we have done for the direct reclaim
> in patch 1.
>
>
> >
> > Regardless of the compaction / page migration issue then, this
> > blocking sync(2) is a dealbreaker.
>
> I really should have figureg out the compaction / page migration
> mechanism and the potential impact to FUSE when we dropping the temp
> page.  Just too busy to take some time on this though.....

Same here, I need to look some more into the compaction / page
migration paths. I'm planning to do this early next week and will
report back with what I find.


Thanks,
Joanne
>
>
> --
> Thanks,
> Jingbo


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree
  2024-10-25 17:36                             ` Joanne Koong
@ 2024-10-25 18:02                               ` Miklos Szeredi
  2024-10-25 18:19                                 ` Joanne Koong
  2024-10-25 18:47                               ` Joanne Koong
  2024-10-25 22:40                               ` Joanne Koong
  2 siblings, 1 reply; 63+ messages in thread
From: Miklos Szeredi @ 2024-10-25 18:02 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Jingbo Xu, Shakeel Butt, linux-fsdevel, josef, bernd.schubert,
	hannes, linux-mm, kernel-team

On Fri, 25 Oct 2024 at 19:36, Joanne Koong <joannelkoong@gmail.com> wrote:

> That's a great point. It seems like we can just skip waiting on
> writeback to finish for fuse folios in sync(2) altogether then. I'll
> look into what's the best way to do this.

I just tested this, and it turns out this doesn't quite work the way
I'd expected.  I can trigger sync(2) being blocked by a suspended fuse
server:

task:kworker/u16:3   state:D stack:0     pid:172   tgid:172   ppid:2
   flags:0x00004000
Workqueue: writeback wb_workfn (flush-0:30)
Call Trace:
 __schedule+0x40b/0xad0
 schedule+0x36/0x120
 inode_sleep_on_writeback+0x9d/0xb0
 wb_writeback+0x104/0x3d0
 wb_workfn+0x325/0x490
 process_one_work+0x1d8/0x520
 worker_thread+0x1af/0x390
 kthread+0xcc/0x100
 ret_from_fork+0x2d/0x50
 ret_from_fork_asm+0x1a/0x30

task:dd              state:S stack:0     pid:1364  tgid:1364
ppid:1336   flags:0x00000002
Call Trace:
 __schedule+0x40b/0xad0
 schedule+0x36/0x120
 request_wait_answer+0x16b/0x200
 __fuse_simple_request+0xd6/0x290
 fuse_flush_times+0x119/0x140
 fuse_write_inode+0x6d/0xc0
 __writeback_single_inode+0x36d/0x480
 writeback_single_inode+0xa8/0x170
 write_inode_now+0x75/0xa0
 fuse_flush+0x85/0x1c0
 filp_flush+0x2c/0x70
 __x64_sys_close+0x2e/0x80
 do_syscall_64+0x64/0x140
 entry_SYSCALL_64_after_hwframe+0x76/0x7e

task:sync            state:D stack:0     pid:1365  tgid:1365
ppid:1336   flags:0x00004002
Call Trace:
 __schedule+0x40b/0xad0
 schedule+0x36/0x120
 wb_wait_for_completion+0x56/0x80
 sync_inodes_sb+0xc5/0x450
 iterate_supers+0x69/0xd0
 ksys_sync+0x40/0xa0
 __do_sys_sync+0xa/0x20
 do_syscall_64+0x64/0x140
 entry_SYSCALL_64_after_hwframe+0x76/0x7e

Maybe I'm too paranoid about this, and in practice we can just let
sync(2) block in any case.  But I want to understand this better
first.

Thanks,
Miklos


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree
  2024-10-25 18:02                               ` Miklos Szeredi
@ 2024-10-25 18:19                                 ` Joanne Koong
  2024-10-28  2:02                                   ` Jingbo Xu
  0 siblings, 1 reply; 63+ messages in thread
From: Joanne Koong @ 2024-10-25 18:19 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Jingbo Xu, Shakeel Butt, linux-fsdevel, josef, bernd.schubert,
	hannes, linux-mm, kernel-team

On Fri, Oct 25, 2024 at 11:02 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Fri, 25 Oct 2024 at 19:36, Joanne Koong <joannelkoong@gmail.com> wrote:
>
> > That's a great point. It seems like we can just skip waiting on
> > writeback to finish for fuse folios in sync(2) altogether then. I'll
> > look into what's the best way to do this.
>
> I just tested this, and it turns out this doesn't quite work the way
> I'd expected.  I can trigger sync(2) being blocked by a suspended fuse
> server:
>
> task:kworker/u16:3   state:D stack:0     pid:172   tgid:172   ppid:2
>    flags:0x00004000
> Workqueue: writeback wb_workfn (flush-0:30)
> Call Trace:
>  __schedule+0x40b/0xad0
>  schedule+0x36/0x120
>  inode_sleep_on_writeback+0x9d/0xb0
>  wb_writeback+0x104/0x3d0
>  wb_workfn+0x325/0x490
>  process_one_work+0x1d8/0x520
>  worker_thread+0x1af/0x390
>  kthread+0xcc/0x100
>  ret_from_fork+0x2d/0x50
>  ret_from_fork_asm+0x1a/0x30
>
> task:dd              state:S stack:0     pid:1364  tgid:1364
> ppid:1336   flags:0x00000002
> Call Trace:
>  __schedule+0x40b/0xad0
>  schedule+0x36/0x120
>  request_wait_answer+0x16b/0x200
>  __fuse_simple_request+0xd6/0x290
>  fuse_flush_times+0x119/0x140
>  fuse_write_inode+0x6d/0xc0
>  __writeback_single_inode+0x36d/0x480
>  writeback_single_inode+0xa8/0x170
>  write_inode_now+0x75/0xa0
>  fuse_flush+0x85/0x1c0
>  filp_flush+0x2c/0x70
>  __x64_sys_close+0x2e/0x80
>  do_syscall_64+0x64/0x140
>  entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> task:sync            state:D stack:0     pid:1365  tgid:1365
> ppid:1336   flags:0x00004002
> Call Trace:
>  __schedule+0x40b/0xad0
>  schedule+0x36/0x120
>  wb_wait_for_completion+0x56/0x80
>  sync_inodes_sb+0xc5/0x450
>  iterate_supers+0x69/0xd0
>  ksys_sync+0x40/0xa0
>  __do_sys_sync+0xa/0x20
>  do_syscall_64+0x64/0x140
>  entry_SYSCALL_64_after_hwframe+0x76/0x7e
>

Thanks for the trace. If i'm understanding it correctly, this only
blocks temporarily until the writeback wb_workfn is rescheduled?

> Maybe I'm too paranoid about this, and in practice we can just let
> sync(2) block in any case.  But I want to understand this better

in the case of a malicious fuse server, it could block sync(2) forever
so I think this means we have to skip the wait for fuse folios
altogether.


Thanks,
Joanne
> first.
>
> Thanks,
> Miklos


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree
  2024-10-25 17:36                             ` Joanne Koong
  2024-10-25 18:02                               ` Miklos Szeredi
@ 2024-10-25 18:47                               ` Joanne Koong
  2024-10-28  2:28                                 ` Jingbo Xu
  2024-10-25 22:40                               ` Joanne Koong
  2 siblings, 1 reply; 63+ messages in thread
From: Joanne Koong @ 2024-10-25 18:47 UTC (permalink / raw)
  To: Jingbo Xu
  Cc: Miklos Szeredi, Shakeel Butt, linux-fsdevel, josef,
	bernd.schubert, hannes, linux-mm, kernel-team

On Fri, Oct 25, 2024 at 10:36 AM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Thu, Oct 24, 2024 at 6:38 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
> >
> >
> >
> > On 10/25/24 12:54 AM, Joanne Koong wrote:
> > > On Mon, Oct 21, 2024 at 2:05 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > >>
> > >> On Mon, Oct 21, 2024 at 3:15 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> > >>>
> > >>> On Fri, 18 Oct 2024 at 07:31, Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > >>>
> > >>>> I feel like this is too much restrictive and I am still not sure why
> > >>>> blocking on fuse folios served by non-privileges fuse server is worse
> > >>>> than blocking on folios served from the network.
> > >>>
> > >>> Might be.  But historically fuse had this behavior and I'd be very
> > >>> reluctant to change that unconditionally.
> > >>>
> > >>> With a systemwide maximal timeout for fuse requests it might make
> > >>> sense to allow sync(2), etc. to wait for fuse writeback.
> > >>>
> > >>> Without a timeout allowing fuse servers to block sync(2) indefinitely
> > >>> seems rather risky.
> > >>
> > >> Could we skip waiting on writeback in sync(2) if it's a fuse folio?
> > >> That seems in line with the sync(2) documentation Jingbo referenced
> > >> earlier where it states "The writing, although scheduled, is not
> > >> necessarily complete upon return from sync()."
> > >> https://pubs.opengroup.org/onlinepubs/9699919799/functions/sync.html
> > >>
> > >
> > > So I think the answer to this is "no" for Linux. What the Linux man
> > > page for sync(2) says:
> > >
> > > "According to the standard specification (e.g., POSIX.1-2001), sync()
> > > schedules the writes, but may return before the actual writing is
> > > done.  However Linux waits for I/O completions, and thus sync() or
> > > syncfs() provide the same guarantees as fsync() called on every file
> > > in the system or filesystem respectively." [1]
> >
> > Actually as for FUSE, IIUC the writeback is not guaranteed to be
> > completed when sync(2) returns since the temp page mechanism.  When
> > sync(2) returns, PG_writeback is indeed cleared for all original pages
> > (in the address_space), while the real writeback work (initiated from
> > temp page) may be still in progress.
> >
>
> That's a great point. It seems like we can just skip waiting on
> writeback to finish for fuse folios in sync(2) altogether then. I'll
> look into what's the best way to do this.

I think the most straightforward way to do this for sync(2) is to add
the mapping check inside sync_bdevs(). With something like:

diff --git a/block/bdev.c b/block/bdev.c
index 738e3c8457e7..bcb2b6d3db94 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -1247,7 +1247,7 @@ void sync_bdevs(bool wait)
                mutex_lock(&bdev->bd_disk->open_mutex);
                if (!atomic_read(&bdev->bd_openers)) {
                        ; /* skip */
-               } else if (wait) {
+               } else if (wait &&
!mapping_no_writeback_wait(inode->i_mapping)) {
                        /*
                         * We keep the error status of individual mapping so
                         * that applications can catch the writeback error using


and changing AS_NO_WRITEBACK_RECLAIM to AS_NO_WRITEBACK_WAIT.


Thanks,
Joanne
>


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree
  2024-10-25 17:36                             ` Joanne Koong
  2024-10-25 18:02                               ` Miklos Szeredi
  2024-10-25 18:47                               ` Joanne Koong
@ 2024-10-25 22:40                               ` Joanne Koong
  2024-10-28 21:58                                 ` Joanne Koong
  2 siblings, 1 reply; 63+ messages in thread
From: Joanne Koong @ 2024-10-25 22:40 UTC (permalink / raw)
  To: Jingbo Xu
  Cc: Miklos Szeredi, Shakeel Butt, linux-fsdevel, josef,
	bernd.schubert, hannes, linux-mm, kernel-team

On Fri, Oct 25, 2024 at 10:36 AM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Thu, Oct 24, 2024 at 6:38 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
> >
> >
> >
> > On 10/25/24 12:54 AM, Joanne Koong wrote:
> > > On Mon, Oct 21, 2024 at 2:05 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > >>
> > >> On Mon, Oct 21, 2024 at 3:15 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> > >>>
> > >>> On Fri, 18 Oct 2024 at 07:31, Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > >>>
> > >>>> I feel like this is too much restrictive and I am still not sure why
> > >>>> blocking on fuse folios served by non-privileges fuse server is worse
> > >>>> than blocking on folios served from the network.
> > >>>
> > >>> Might be.  But historically fuse had this behavior and I'd be very
> > >>> reluctant to change that unconditionally.
> > >>>
> > >>> With a systemwide maximal timeout for fuse requests it might make
> > >>> sense to allow sync(2), etc. to wait for fuse writeback.
> > >>>
> > >>> Without a timeout allowing fuse servers to block sync(2) indefinitely
> > >>> seems rather risky.
> > >>
> > >> Could we skip waiting on writeback in sync(2) if it's a fuse folio?
> > >> That seems in line with the sync(2) documentation Jingbo referenced
> > >> earlier where it states "The writing, although scheduled, is not
> > >> necessarily complete upon return from sync()."
> > >> https://pubs.opengroup.org/onlinepubs/9699919799/functions/sync.html
> > >>
> > >
> > > So I think the answer to this is "no" for Linux. What the Linux man
> > > page for sync(2) says:
> > >
> > > "According to the standard specification (e.g., POSIX.1-2001), sync()
> > > schedules the writes, but may return before the actual writing is
> > > done.  However Linux waits for I/O completions, and thus sync() or
> > > syncfs() provide the same guarantees as fsync() called on every file
> > > in the system or filesystem respectively." [1]
> >
> > Actually as for FUSE, IIUC the writeback is not guaranteed to be
> > completed when sync(2) returns since the temp page mechanism.  When
> > sync(2) returns, PG_writeback is indeed cleared for all original pages
> > (in the address_space), while the real writeback work (initiated from
> > temp page) may be still in progress.
> >
>
> That's a great point. It seems like we can just skip waiting on
> writeback to finish for fuse folios in sync(2) altogether then. I'll
> look into what's the best way to do this.
>
> > I think this is also what Miklos means in:
> > https://lore.kernel.org/all/CAJfpegsJKD4YT5R5qfXXE=hyqKvhpTRbD4m1wsYNbGB6k4rC2A@mail.gmail.com/
> >
> > Though we need special handling for AS_NO_WRITEBACK_RECLAIM marked pages
> > in sync(2) codepath similar to what we have done for the direct reclaim
> > in patch 1.
> >
> >
> > >
> > > Regardless of the compaction / page migration issue then, this
> > > blocking sync(2) is a dealbreaker.
> >
> > I really should have figureg out the compaction / page migration
> > mechanism and the potential impact to FUSE when we dropping the temp
> > page.  Just too busy to take some time on this though.....
>
> Same here, I need to look some more into the compaction / page
> migration paths. I'm planning to do this early next week and will
> report back with what I find.
>

These are my notes so far:

* We hit the folio_wait_writeback() path when callers call
migrate_pages() with mode MIGRATE_SYNC
   ... -> migrate_pages() -> migrate_pages_sync() ->
migrate_pages_batch() -> migrate_folio_unmap() ->
folio_wait_writeback()

* These are the places where we call migrate_pages():
1) demote_folio_list()
Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode

2) __damon_pa_migrate_folio_list()
Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode

3) migrate_misplaced_folio()
Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode

4) do_move_pages_to_node()
Can ignore this. This calls migrate_pages() in MIGRATE_SYNC mode but
this path is only invoked by the move_pages() syscall. It's fine to
wait on writeback for the move_pages() syscall since the user would
have to deliberately invoke this on the fuse server for this to apply
to the server's fuse folios

5)  migrate_to_node()
Can ignore this for the same reason as in 4. This path is only invoked
by the migrate_pages() syscall.

6) do_mbind()
Can ignore this for the same reason as 4 and 5. This path is only
invoked by the mbind() syscall.

7) soft_offline_in_use_page()
Can skip soft offlining fuse folios (eg folios with the
AS_NO_WRITEBACK_WAIT mapping flag set).
The path for this is soft_offline_page() -> soft_offline_in_use_page()
-> migrate_pages(). soft_offline_page() only invokes this for in-use
pages in a well-defined state (see ret value of get_hwpoison_page()).
My understanding of soft offlining pages is that it's a mitigation
strategy for handling pages that are experiencing errors but are not
yet completely unusable, and its main purpose is to prevent future
issues. It seems fine to skip this for fuse folios.

8) do_migrate_range()
9) compact_zone()
10) migrate_longterm_unpinnable_folios()
11) __alloc_contig_migrate_range()

8 to 11 needs more investigation / thinking about. I don't see a good
way around these tbh. I think we have to operate under the assumption
that the fuse server running is malicious or benevolently but
incorrectly written and could possibly never complete writeback. So we
definitely can't wait on these but it also doesn't seem like we can
skip waiting on these, especially for the case where the server uses
spliced pages, nor does it seem like we can just fail these with
-EBUSY or something.

Will continue looking more into this early next week.

Thanks,
Joanne
>
> Thanks,
> Joanne
> >
> >
> > --
> > Thanks,
> > Jingbo


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree
  2024-10-25 18:19                                 ` Joanne Koong
@ 2024-10-28  2:02                                   ` Jingbo Xu
  0 siblings, 0 replies; 63+ messages in thread
From: Jingbo Xu @ 2024-10-28  2:02 UTC (permalink / raw)
  To: Joanne Koong, Miklos Szeredi
  Cc: Shakeel Butt, linux-fsdevel, josef, bernd.schubert, hannes,
	linux-mm, kernel-team



On 10/26/24 2:19 AM, Joanne Koong wrote:
> On Fri, Oct 25, 2024 at 11:02 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>>
>> On Fri, 25 Oct 2024 at 19:36, Joanne Koong <joannelkoong@gmail.com> wrote:
>>
>>> That's a great point. It seems like we can just skip waiting on
>>> writeback to finish for fuse folios in sync(2) altogether then. I'll
>>> look into what's the best way to do this.
>>
>> I just tested this, and it turns out this doesn't quite work the way
>> I'd expected.  I can trigger sync(2) being blocked by a suspended fuse
>> server:
>>
>> task:kworker/u16:3   state:D stack:0     pid:172   tgid:172   ppid:2
>>    flags:0x00004000
>> Workqueue: writeback wb_workfn (flush-0:30)
>> Call Trace:
>>  __schedule+0x40b/0xad0
>>  schedule+0x36/0x120
>>  inode_sleep_on_writeback+0x9d/0xb0
>>  wb_writeback+0x104/0x3d0
>>  wb_workfn+0x325/0x490
>>  process_one_work+0x1d8/0x520
>>  worker_thread+0x1af/0x390
>>  kthread+0xcc/0x100
>>  ret_from_fork+0x2d/0x50
>>  ret_from_fork_asm+0x1a/0x30
>>
>> task:dd              state:S stack:0     pid:1364  tgid:1364
>> ppid:1336   flags:0x00000002
>> Call Trace:
>>  __schedule+0x40b/0xad0
>>  schedule+0x36/0x120
>>  request_wait_answer+0x16b/0x200
>>  __fuse_simple_request+0xd6/0x290
>>  fuse_flush_times+0x119/0x140
>>  fuse_write_inode+0x6d/0xc0
>>  __writeback_single_inode+0x36d/0x480
>>  writeback_single_inode+0xa8/0x170
>>  write_inode_now+0x75/0xa0
>>  fuse_flush+0x85/0x1c0
>>  filp_flush+0x2c/0x70
>>  __x64_sys_close+0x2e/0x80
>>  do_syscall_64+0x64/0x140
>>  entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>
>> task:sync            state:D stack:0     pid:1365  tgid:1365
>> ppid:1336   flags:0x00004002
>> Call Trace:
>>  __schedule+0x40b/0xad0
>>  schedule+0x36/0x120
>>  wb_wait_for_completion+0x56/0x80
>>  sync_inodes_sb+0xc5/0x450
>>  iterate_supers+0x69/0xd0
>>  ksys_sync+0x40/0xa0
>>  __do_sys_sync+0xa/0x20
>>  do_syscall_64+0x64/0x140
>>  entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>
> 
> Thanks for the trace. If i'm understanding it correctly, this only
> blocks temporarily until the writeback wb_workfn is rescheduled?
> 
>> Maybe I'm too paranoid about this, and in practice we can just let
>> sync(2) block in any case.  But I want to understand this better
> 
> in the case of a malicious fuse server, it could block sync(2) forever
> so I think this means we have to skip the wait for fuse folios
> altogether.

Right. In summary we need to skip waiting on the completion for the
writeback in case of a malicious fuse server could block sync(2)
forever, and meanwhile the change won't break the backward compatibility
as the current fuse implementation also doesn't wait for the data
actually being written back and persistent on the backstore in sync(2).


-- 
Thanks,
Jingbo


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree
  2024-10-25 18:47                               ` Joanne Koong
@ 2024-10-28  2:28                                 ` Jingbo Xu
  2024-10-28 21:57                                   ` Joanne Koong
  0 siblings, 1 reply; 63+ messages in thread
From: Jingbo Xu @ 2024-10-28  2:28 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Miklos Szeredi, Shakeel Butt, linux-fsdevel, josef,
	bernd.schubert, hannes, linux-mm, kernel-team



On 10/26/24 2:47 AM, Joanne Koong wrote:
> On Fri, Oct 25, 2024 at 10:36 AM Joanne Koong <joannelkoong@gmail.com> wrote:
>>
>> On Thu, Oct 24, 2024 at 6:38 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>>
>>>
>>>
>>> On 10/25/24 12:54 AM, Joanne Koong wrote:
>>>> On Mon, Oct 21, 2024 at 2:05 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>>>>>
>>>>> On Mon, Oct 21, 2024 at 3:15 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>>>>>>
>>>>>> On Fri, 18 Oct 2024 at 07:31, Shakeel Butt <shakeel.butt@linux.dev> wrote:
>>>>>>
>>>>>>> I feel like this is too much restrictive and I am still not sure why
>>>>>>> blocking on fuse folios served by non-privileges fuse server is worse
>>>>>>> than blocking on folios served from the network.
>>>>>>
>>>>>> Might be.  But historically fuse had this behavior and I'd be very
>>>>>> reluctant to change that unconditionally.
>>>>>>
>>>>>> With a systemwide maximal timeout for fuse requests it might make
>>>>>> sense to allow sync(2), etc. to wait for fuse writeback.
>>>>>>
>>>>>> Without a timeout allowing fuse servers to block sync(2) indefinitely
>>>>>> seems rather risky.
>>>>>
>>>>> Could we skip waiting on writeback in sync(2) if it's a fuse folio?
>>>>> That seems in line with the sync(2) documentation Jingbo referenced
>>>>> earlier where it states "The writing, although scheduled, is not
>>>>> necessarily complete upon return from sync()."
>>>>> https://pubs.opengroup.org/onlinepubs/9699919799/functions/sync.html
>>>>>
>>>>
>>>> So I think the answer to this is "no" for Linux. What the Linux man
>>>> page for sync(2) says:
>>>>
>>>> "According to the standard specification (e.g., POSIX.1-2001), sync()
>>>> schedules the writes, but may return before the actual writing is
>>>> done.  However Linux waits for I/O completions, and thus sync() or
>>>> syncfs() provide the same guarantees as fsync() called on every file
>>>> in the system or filesystem respectively." [1]
>>>
>>> Actually as for FUSE, IIUC the writeback is not guaranteed to be
>>> completed when sync(2) returns since the temp page mechanism.  When
>>> sync(2) returns, PG_writeback is indeed cleared for all original pages
>>> (in the address_space), while the real writeback work (initiated from
>>> temp page) may be still in progress.
>>>
>>
>> That's a great point. It seems like we can just skip waiting on
>> writeback to finish for fuse folios in sync(2) altogether then. I'll
>> look into what's the best way to do this.
> 
> I think the most straightforward way to do this for sync(2) is to add
> the mapping check inside sync_bdevs(). With something like:
> 
> diff --git a/block/bdev.c b/block/bdev.c
> index 738e3c8457e7..bcb2b6d3db94 100644
> --- a/block/bdev.c
> +++ b/block/bdev.c
> @@ -1247,7 +1247,7 @@ void sync_bdevs(bool wait)
>                 mutex_lock(&bdev->bd_disk->open_mutex);
>                 if (!atomic_read(&bdev->bd_openers)) {
>                         ; /* skip */
> -               } else if (wait) {
> +               } else if (wait &&
> !mapping_no_writeback_wait(inode->i_mapping)) {
>                         /*
>                          * We keep the error status of individual mapping so
>                          * that applications can catch the writeback error using
> 
> 

I'm afraid we are waiting in wait_sb_inodes (ksys_sync -> sync_inodes_sb
-> wait_sb_inodes) rather than sync_bdevs.  sync_bdevs() is used to
writeback and sync the metadata residing on the block device directly
such as the superblock.  It is sync_inodes_one_sb() that actually
writeback inodes.


-- 
Thanks,
Jingbo


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree
  2024-10-28  2:28                                 ` Jingbo Xu
@ 2024-10-28 21:57                                   ` Joanne Koong
  0 siblings, 0 replies; 63+ messages in thread
From: Joanne Koong @ 2024-10-28 21:57 UTC (permalink / raw)
  To: Jingbo Xu
  Cc: Miklos Szeredi, Shakeel Butt, linux-fsdevel, josef,
	bernd.schubert, hannes, linux-mm, kernel-team

On Sun, Oct 27, 2024 at 7:28 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>
>
>
> On 10/26/24 2:47 AM, Joanne Koong wrote:
> > On Fri, Oct 25, 2024 at 10:36 AM Joanne Koong <joannelkoong@gmail.com> wrote:
> >>
> >> On Thu, Oct 24, 2024 at 6:38 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
> >>>
> >>>
> >>>
> >>> On 10/25/24 12:54 AM, Joanne Koong wrote:
> >>>> On Mon, Oct 21, 2024 at 2:05 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> >>>>>
> >>>>> On Mon, Oct 21, 2024 at 3:15 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> >>>>>>
> >>>>>> On Fri, 18 Oct 2024 at 07:31, Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >>>>>>
> >>>>>>> I feel like this is too much restrictive and I am still not sure why
> >>>>>>> blocking on fuse folios served by non-privileges fuse server is worse
> >>>>>>> than blocking on folios served from the network.
> >>>>>>
> >>>>>> Might be.  But historically fuse had this behavior and I'd be very
> >>>>>> reluctant to change that unconditionally.
> >>>>>>
> >>>>>> With a systemwide maximal timeout for fuse requests it might make
> >>>>>> sense to allow sync(2), etc. to wait for fuse writeback.
> >>>>>>
> >>>>>> Without a timeout allowing fuse servers to block sync(2) indefinitely
> >>>>>> seems rather risky.
> >>>>>
> >>>>> Could we skip waiting on writeback in sync(2) if it's a fuse folio?
> >>>>> That seems in line with the sync(2) documentation Jingbo referenced
> >>>>> earlier where it states "The writing, although scheduled, is not
> >>>>> necessarily complete upon return from sync()."
> >>>>> https://pubs.opengroup.org/onlinepubs/9699919799/functions/sync.html
> >>>>>
> >>>>
> >>>> So I think the answer to this is "no" for Linux. What the Linux man
> >>>> page for sync(2) says:
> >>>>
> >>>> "According to the standard specification (e.g., POSIX.1-2001), sync()
> >>>> schedules the writes, but may return before the actual writing is
> >>>> done.  However Linux waits for I/O completions, and thus sync() or
> >>>> syncfs() provide the same guarantees as fsync() called on every file
> >>>> in the system or filesystem respectively." [1]
> >>>
> >>> Actually as for FUSE, IIUC the writeback is not guaranteed to be
> >>> completed when sync(2) returns since the temp page mechanism.  When
> >>> sync(2) returns, PG_writeback is indeed cleared for all original pages
> >>> (in the address_space), while the real writeback work (initiated from
> >>> temp page) may be still in progress.
> >>>
> >>
> >> That's a great point. It seems like we can just skip waiting on
> >> writeback to finish for fuse folios in sync(2) altogether then. I'll
> >> look into what's the best way to do this.
> >
> > I think the most straightforward way to do this for sync(2) is to add
> > the mapping check inside sync_bdevs(). With something like:
> >
> > diff --git a/block/bdev.c b/block/bdev.c
> > index 738e3c8457e7..bcb2b6d3db94 100644
> > --- a/block/bdev.c
> > +++ b/block/bdev.c
> > @@ -1247,7 +1247,7 @@ void sync_bdevs(bool wait)
> >                 mutex_lock(&bdev->bd_disk->open_mutex);
> >                 if (!atomic_read(&bdev->bd_openers)) {
> >                         ; /* skip */
> > -               } else if (wait) {
> > +               } else if (wait &&
> > !mapping_no_writeback_wait(inode->i_mapping)) {
> >                         /*
> >                          * We keep the error status of individual mapping so
> >                          * that applications can catch the writeback error using
> >
> >
>
> I'm afraid we are waiting in wait_sb_inodes (ksys_sync -> sync_inodes_sb
> -> wait_sb_inodes) rather than sync_bdevs.  sync_bdevs() is used to
> writeback and sync the metadata residing on the block device directly
> such as the superblock.  It is sync_inodes_one_sb() that actually
> writeback inodes.
>

Great point, thanks for the info!

>
> --
> Thanks,
> Jingbo


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree
  2024-10-25 22:40                               ` Joanne Koong
@ 2024-10-28 21:58                                 ` Joanne Koong
  2024-10-30  9:32                                   ` Bernd Schubert
  0 siblings, 1 reply; 63+ messages in thread
From: Joanne Koong @ 2024-10-28 21:58 UTC (permalink / raw)
  To: Jingbo Xu
  Cc: Miklos Szeredi, Shakeel Butt, linux-fsdevel, josef,
	bernd.schubert, hannes, linux-mm, kernel-team

On Fri, Oct 25, 2024 at 3:40 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Fri, Oct 25, 2024 at 10:36 AM Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > On Thu, Oct 24, 2024 at 6:38 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
> > >
> > >
> > >
> > > On 10/25/24 12:54 AM, Joanne Koong wrote:
> > > > On Mon, Oct 21, 2024 at 2:05 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > > >>
> > > >> On Mon, Oct 21, 2024 at 3:15 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> > > >>>
> > > >>> On Fri, 18 Oct 2024 at 07:31, Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > > >>>
> > > >>>> I feel like this is too much restrictive and I am still not sure why
> > > >>>> blocking on fuse folios served by non-privileges fuse server is worse
> > > >>>> than blocking on folios served from the network.
> > > >>>
> > > >>> Might be.  But historically fuse had this behavior and I'd be very
> > > >>> reluctant to change that unconditionally.
> > > >>>
> > > >>> With a systemwide maximal timeout for fuse requests it might make
> > > >>> sense to allow sync(2), etc. to wait for fuse writeback.
> > > >>>
> > > >>> Without a timeout allowing fuse servers to block sync(2) indefinitely
> > > >>> seems rather risky.
> > > >>
> > > >> Could we skip waiting on writeback in sync(2) if it's a fuse folio?
> > > >> That seems in line with the sync(2) documentation Jingbo referenced
> > > >> earlier where it states "The writing, although scheduled, is not
> > > >> necessarily complete upon return from sync()."
> > > >> https://pubs.opengroup.org/onlinepubs/9699919799/functions/sync.html
> > > >>
> > > >
> > > > So I think the answer to this is "no" for Linux. What the Linux man
> > > > page for sync(2) says:
> > > >
> > > > "According to the standard specification (e.g., POSIX.1-2001), sync()
> > > > schedules the writes, but may return before the actual writing is
> > > > done.  However Linux waits for I/O completions, and thus sync() or
> > > > syncfs() provide the same guarantees as fsync() called on every file
> > > > in the system or filesystem respectively." [1]
> > >
> > > Actually as for FUSE, IIUC the writeback is not guaranteed to be
> > > completed when sync(2) returns since the temp page mechanism.  When
> > > sync(2) returns, PG_writeback is indeed cleared for all original pages
> > > (in the address_space), while the real writeback work (initiated from
> > > temp page) may be still in progress.
> > >
> >
> > That's a great point. It seems like we can just skip waiting on
> > writeback to finish for fuse folios in sync(2) altogether then. I'll
> > look into what's the best way to do this.
> >
> > > I think this is also what Miklos means in:
> > > https://lore.kernel.org/all/CAJfpegsJKD4YT5R5qfXXE=hyqKvhpTRbD4m1wsYNbGB6k4rC2A@mail.gmail.com/
> > >
> > > Though we need special handling for AS_NO_WRITEBACK_RECLAIM marked pages
> > > in sync(2) codepath similar to what we have done for the direct reclaim
> > > in patch 1.
> > >
> > >
> > > >
> > > > Regardless of the compaction / page migration issue then, this
> > > > blocking sync(2) is a dealbreaker.
> > >
> > > I really should have figureg out the compaction / page migration
> > > mechanism and the potential impact to FUSE when we dropping the temp
> > > page.  Just too busy to take some time on this though.....
> >
> > Same here, I need to look some more into the compaction / page
> > migration paths. I'm planning to do this early next week and will
> > report back with what I find.
> >
>
> These are my notes so far:
>
> * We hit the folio_wait_writeback() path when callers call
> migrate_pages() with mode MIGRATE_SYNC
>    ... -> migrate_pages() -> migrate_pages_sync() ->
> migrate_pages_batch() -> migrate_folio_unmap() ->
> folio_wait_writeback()
>
> * These are the places where we call migrate_pages():
> 1) demote_folio_list()
> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode
>
> 2) __damon_pa_migrate_folio_list()
> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode
>
> 3) migrate_misplaced_folio()
> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode
>
> 4) do_move_pages_to_node()
> Can ignore this. This calls migrate_pages() in MIGRATE_SYNC mode but
> this path is only invoked by the move_pages() syscall. It's fine to
> wait on writeback for the move_pages() syscall since the user would
> have to deliberately invoke this on the fuse server for this to apply
> to the server's fuse folios
>
> 5)  migrate_to_node()
> Can ignore this for the same reason as in 4. This path is only invoked
> by the migrate_pages() syscall.
>
> 6) do_mbind()
> Can ignore this for the same reason as 4 and 5. This path is only
> invoked by the mbind() syscall.
>
> 7) soft_offline_in_use_page()
> Can skip soft offlining fuse folios (eg folios with the
> AS_NO_WRITEBACK_WAIT mapping flag set).
> The path for this is soft_offline_page() -> soft_offline_in_use_page()
> -> migrate_pages(). soft_offline_page() only invokes this for in-use
> pages in a well-defined state (see ret value of get_hwpoison_page()).
> My understanding of soft offlining pages is that it's a mitigation
> strategy for handling pages that are experiencing errors but are not
> yet completely unusable, and its main purpose is to prevent future
> issues. It seems fine to skip this for fuse folios.
>
> 8) do_migrate_range()
> 9) compact_zone()
> 10) migrate_longterm_unpinnable_folios()
> 11) __alloc_contig_migrate_range()
>
> 8 to 11 needs more investigation / thinking about. I don't see a good
> way around these tbh. I think we have to operate under the assumption
> that the fuse server running is malicious or benevolently but
> incorrectly written and could possibly never complete writeback. So we
> definitely can't wait on these but it also doesn't seem like we can
> skip waiting on these, especially for the case where the server uses
> spliced pages, nor does it seem like we can just fail these with
> -EBUSY or something.
>

I'm still not seeing a good way around this.

What about this then? We add a new fuse sysctl called something like
"/proc/sys/fs/fuse/writeback_optimization_timeout" where if the sys
admin sets this, then it opts into optimizing writeback to be as fast
as possible (eg skipping the page copies) and if the server doesn't
fulfill the writeback by the set timeout value, then the connection is
aborted.

Alternatively, we could also repurpose
/proc/sys/fs/fuse/max_request_timeout from the request timeout
patchset [1] but I like the additional flexibility and explicitness
having the "writeback_optimization_timeout" sysctl gives.

Any thoughts on this?

[1] https://lore.kernel.org/linux-fsdevel/20241011191320.91592-4-joannelkoong@gmail.com/

Thanks,
Joanne

> Will continue looking more into this early next week.
>
> Thanks,
> Joanne
> >
> > Thanks,
> > Joanne
> > >
> > >
> > > --
> > > Thanks,
> > > Jingbo


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree
  2024-10-17 13:31                 ` Miklos Szeredi
  2024-10-18  5:31                   ` Shakeel Butt
@ 2024-10-29 22:04                   ` Bernd Schubert
  1 sibling, 0 replies; 63+ messages in thread
From: Bernd Schubert @ 2024-10-29 22:04 UTC (permalink / raw)
  To: Miklos Szeredi, Shakeel Butt
  Cc: Joanne Koong, linux-fsdevel, josef, jefflexu, hannes, linux-mm,
	kernel-team



On 10/17/24 15:31, Miklos Szeredi wrote:
> On Wed, 16 Oct 2024 at 23:27, Shakeel Butt <shakeel.butt@linux.dev> wrote:
> 
>> Why is it bad? I can understand fuse server getting blocked on fuse
>> folios is bad but why it is bad for other applications/tasks? I am
>> wondering network filesystems have to handle similar situation then why
>> is it bad just for fuse?
> 
> You need privileges (physical access) to unplug the network cable.
> You don't need privileges (in most setups) to run a fuse server.

Not sure if that is the perfect example, you can also run a local user
space nfs server. I think the real difference is 'fusermount' which
typically has the s-bit set and allows mounts from unprivileged users.
I.e. I really think we should differentiate between 
privileged/non-privileged fuse server tasks.

Thanks,
Bernd


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree
  2024-10-28 21:58                                 ` Joanne Koong
@ 2024-10-30  9:32                                   ` Bernd Schubert
  2024-10-30 16:04                                     ` Joanne Koong
  0 siblings, 1 reply; 63+ messages in thread
From: Bernd Schubert @ 2024-10-30  9:32 UTC (permalink / raw)
  To: Joanne Koong, Jingbo Xu
  Cc: Miklos Szeredi, Shakeel Butt, linux-fsdevel, josef, hannes,
	linux-mm, kernel-team



On 10/28/24 22:58, Joanne Koong wrote:
> On Fri, Oct 25, 2024 at 3:40 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>>
>>> Same here, I need to look some more into the compaction / page
>>> migration paths. I'm planning to do this early next week and will
>>> report back with what I find.
>>>
>>
>> These are my notes so far:
>>
>> * We hit the folio_wait_writeback() path when callers call
>> migrate_pages() with mode MIGRATE_SYNC
>>    ... -> migrate_pages() -> migrate_pages_sync() ->
>> migrate_pages_batch() -> migrate_folio_unmap() ->
>> folio_wait_writeback()
>>
>> * These are the places where we call migrate_pages():
>> 1) demote_folio_list()
>> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode
>>
>> 2) __damon_pa_migrate_folio_list()
>> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode
>>
>> 3) migrate_misplaced_folio()
>> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode
>>
>> 4) do_move_pages_to_node()
>> Can ignore this. This calls migrate_pages() in MIGRATE_SYNC mode but
>> this path is only invoked by the move_pages() syscall. It's fine to
>> wait on writeback for the move_pages() syscall since the user would
>> have to deliberately invoke this on the fuse server for this to apply
>> to the server's fuse folios
>>
>> 5)  migrate_to_node()
>> Can ignore this for the same reason as in 4. This path is only invoked
>> by the migrate_pages() syscall.
>>
>> 6) do_mbind()
>> Can ignore this for the same reason as 4 and 5. This path is only
>> invoked by the mbind() syscall.
>>
>> 7) soft_offline_in_use_page()
>> Can skip soft offlining fuse folios (eg folios with the
>> AS_NO_WRITEBACK_WAIT mapping flag set).
>> The path for this is soft_offline_page() -> soft_offline_in_use_page()
>> -> migrate_pages(). soft_offline_page() only invokes this for in-use
>> pages in a well-defined state (see ret value of get_hwpoison_page()).
>> My understanding of soft offlining pages is that it's a mitigation
>> strategy for handling pages that are experiencing errors but are not
>> yet completely unusable, and its main purpose is to prevent future
>> issues. It seems fine to skip this for fuse folios.
>>
>> 8) do_migrate_range()
>> 9) compact_zone()
>> 10) migrate_longterm_unpinnable_folios()
>> 11) __alloc_contig_migrate_range()
>>
>> 8 to 11 needs more investigation / thinking about. I don't see a good
>> way around these tbh. I think we have to operate under the assumption
>> that the fuse server running is malicious or benevolently but
>> incorrectly written and could possibly never complete writeback. So we
>> definitely can't wait on these but it also doesn't seem like we can
>> skip waiting on these, especially for the case where the server uses
>> spliced pages, nor does it seem like we can just fail these with
>> -EBUSY or something.

I see some code paths with -EAGAIN in migration. Could you explain why
we can't just fail migration for fuse write-back pages?

>>
> 
> I'm still not seeing a good way around this.
> 
> What about this then? We add a new fuse sysctl called something like
> "/proc/sys/fs/fuse/writeback_optimization_timeout" where if the sys
> admin sets this, then it opts into optimizing writeback to be as fast
> as possible (eg skipping the page copies) and if the server doesn't
> fulfill the writeback by the set timeout value, then the connection is
> aborted.
> 
> Alternatively, we could also repurpose
> /proc/sys/fs/fuse/max_request_timeout from the request timeout
> patchset [1] but I like the additional flexibility and explicitness
> having the "writeback_optimization_timeout" sysctl gives.
> 
> Any thoughts on this?


I'm a bit worried that we might lock up the system until time out is
reached - not ideal. Especially as timeouts are in minutes now. But
even a slightly stuttering video system not be great. I think we
should give users/admin the choice then, if they prefer slow page
copies or fast, but possibly shortly unresponsive system.


Thank,
Bernd


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree
  2024-10-30  9:32                                   ` Bernd Schubert
@ 2024-10-30 16:04                                     ` Joanne Koong
  2024-10-30 16:21                                       ` Bernd Schubert
  0 siblings, 1 reply; 63+ messages in thread
From: Joanne Koong @ 2024-10-30 16:04 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Jingbo Xu, Miklos Szeredi, Shakeel Butt, linux-fsdevel, josef,
	hannes, linux-mm, kernel-team

On Wed, Oct 30, 2024 at 2:32 AM Bernd Schubert
<bernd.schubert@fastmail.fm> wrote:
>
> On 10/28/24 22:58, Joanne Koong wrote:
> > On Fri, Oct 25, 2024 at 3:40 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> >>
> >>> Same here, I need to look some more into the compaction / page
> >>> migration paths. I'm planning to do this early next week and will
> >>> report back with what I find.
> >>>
> >>
> >> These are my notes so far:
> >>
> >> * We hit the folio_wait_writeback() path when callers call
> >> migrate_pages() with mode MIGRATE_SYNC
> >>    ... -> migrate_pages() -> migrate_pages_sync() ->
> >> migrate_pages_batch() -> migrate_folio_unmap() ->
> >> folio_wait_writeback()
> >>
> >> * These are the places where we call migrate_pages():
> >> 1) demote_folio_list()
> >> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode
> >>
> >> 2) __damon_pa_migrate_folio_list()
> >> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode
> >>
> >> 3) migrate_misplaced_folio()
> >> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode
> >>
> >> 4) do_move_pages_to_node()
> >> Can ignore this. This calls migrate_pages() in MIGRATE_SYNC mode but
> >> this path is only invoked by the move_pages() syscall. It's fine to
> >> wait on writeback for the move_pages() syscall since the user would
> >> have to deliberately invoke this on the fuse server for this to apply
> >> to the server's fuse folios
> >>
> >> 5)  migrate_to_node()
> >> Can ignore this for the same reason as in 4. This path is only invoked
> >> by the migrate_pages() syscall.
> >>
> >> 6) do_mbind()
> >> Can ignore this for the same reason as 4 and 5. This path is only
> >> invoked by the mbind() syscall.
> >>
> >> 7) soft_offline_in_use_page()
> >> Can skip soft offlining fuse folios (eg folios with the
> >> AS_NO_WRITEBACK_WAIT mapping flag set).
> >> The path for this is soft_offline_page() -> soft_offline_in_use_page()
> >> -> migrate_pages(). soft_offline_page() only invokes this for in-use
> >> pages in a well-defined state (see ret value of get_hwpoison_page()).
> >> My understanding of soft offlining pages is that it's a mitigation
> >> strategy for handling pages that are experiencing errors but are not
> >> yet completely unusable, and its main purpose is to prevent future
> >> issues. It seems fine to skip this for fuse folios.
> >>
> >> 8) do_migrate_range()
> >> 9) compact_zone()
> >> 10) migrate_longterm_unpinnable_folios()
> >> 11) __alloc_contig_migrate_range()
> >>
> >> 8 to 11 needs more investigation / thinking about. I don't see a good
> >> way around these tbh. I think we have to operate under the assumption
> >> that the fuse server running is malicious or benevolently but
> >> incorrectly written and could possibly never complete writeback. So we
> >> definitely can't wait on these but it also doesn't seem like we can
> >> skip waiting on these, especially for the case where the server uses
> >> spliced pages, nor does it seem like we can just fail these with
> >> -EBUSY or something.
>
> I see some code paths with -EAGAIN in migration. Could you explain why
> we can't just fail migration for fuse write-back pages?
>

My understanding (and please correct me here Shakeel if I'm wrong) is
that this could block system optimizations, especially since if an
unprivileged malicious fuse server never replies to the writeback
request, then this completely stalls progress. In the best case
scenario, -EAGAIN could be used because the server might just be slow
in serving the writeback, but I think we need to also account for
servers that never complete the writeback. For
__alloc_contig_migrate_range() for example, my understanding is that
this is used to migrate pages so that there are more physically
contiguous ranges of memory freed up. If fuse writeback blocks that,
then that hurts system health overall.

> >>
> >
> > I'm still not seeing a good way around this.
> >
> > What about this then? We add a new fuse sysctl called something like
> > "/proc/sys/fs/fuse/writeback_optimization_timeout" where if the sys
> > admin sets this, then it opts into optimizing writeback to be as fast
> > as possible (eg skipping the page copies) and if the server doesn't
> > fulfill the writeback by the set timeout value, then the connection is
> > aborted.
> >
> > Alternatively, we could also repurpose
> > /proc/sys/fs/fuse/max_request_timeout from the request timeout
> > patchset [1] but I like the additional flexibility and explicitness
> > having the "writeback_optimization_timeout" sysctl gives.
> >
> > Any thoughts on this?
>
>
> I'm a bit worried that we might lock up the system until time out is
> reached - not ideal. Especially as timeouts are in minutes now. But
> even a slightly stuttering video system not be great. I think we
> should give users/admin the choice then, if they prefer slow page
> copies or fast, but possibly shortly unresponsive system.
>
I was thinking the /proc/sys/fs/fuse/writeback_optimization_timeout
would be in seconds, where the sys admin would probably set something
more reasonable like 5 seconds or so.
If this syctl value is set, then servers who want writebacks to be
fast can opt into it at mount time (and by doing so agree that they
will service writeback requests by the timeout or their connection
will be aborted).


Thanks,
Joanne
>
> Thank,
> Bernd


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree
  2024-10-30 16:04                                     ` Joanne Koong
@ 2024-10-30 16:21                                       ` Bernd Schubert
  2024-10-30 17:02                                         ` Joanne Koong
  0 siblings, 1 reply; 63+ messages in thread
From: Bernd Schubert @ 2024-10-30 16:21 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Jingbo Xu, Miklos Szeredi, Shakeel Butt, linux-fsdevel, josef,
	hannes, linux-mm, kernel-team



On 10/30/24 17:04, Joanne Koong wrote:
> On Wed, Oct 30, 2024 at 2:32 AM Bernd Schubert
> <bernd.schubert@fastmail.fm> wrote:
>>
>> On 10/28/24 22:58, Joanne Koong wrote:
>>> On Fri, Oct 25, 2024 at 3:40 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>>>>
>>>>> Same here, I need to look some more into the compaction / page
>>>>> migration paths. I'm planning to do this early next week and will
>>>>> report back with what I find.
>>>>>
>>>>
>>>> These are my notes so far:
>>>>
>>>> * We hit the folio_wait_writeback() path when callers call
>>>> migrate_pages() with mode MIGRATE_SYNC
>>>>    ... -> migrate_pages() -> migrate_pages_sync() ->
>>>> migrate_pages_batch() -> migrate_folio_unmap() ->
>>>> folio_wait_writeback()
>>>>
>>>> * These are the places where we call migrate_pages():
>>>> 1) demote_folio_list()
>>>> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode
>>>>
>>>> 2) __damon_pa_migrate_folio_list()
>>>> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode
>>>>
>>>> 3) migrate_misplaced_folio()
>>>> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode
>>>>
>>>> 4) do_move_pages_to_node()
>>>> Can ignore this. This calls migrate_pages() in MIGRATE_SYNC mode but
>>>> this path is only invoked by the move_pages() syscall. It's fine to
>>>> wait on writeback for the move_pages() syscall since the user would
>>>> have to deliberately invoke this on the fuse server for this to apply
>>>> to the server's fuse folios
>>>>
>>>> 5)  migrate_to_node()
>>>> Can ignore this for the same reason as in 4. This path is only invoked
>>>> by the migrate_pages() syscall.
>>>>
>>>> 6) do_mbind()
>>>> Can ignore this for the same reason as 4 and 5. This path is only
>>>> invoked by the mbind() syscall.
>>>>
>>>> 7) soft_offline_in_use_page()
>>>> Can skip soft offlining fuse folios (eg folios with the
>>>> AS_NO_WRITEBACK_WAIT mapping flag set).
>>>> The path for this is soft_offline_page() -> soft_offline_in_use_page()
>>>> -> migrate_pages(). soft_offline_page() only invokes this for in-use
>>>> pages in a well-defined state (see ret value of get_hwpoison_page()).
>>>> My understanding of soft offlining pages is that it's a mitigation
>>>> strategy for handling pages that are experiencing errors but are not
>>>> yet completely unusable, and its main purpose is to prevent future
>>>> issues. It seems fine to skip this for fuse folios.
>>>>
>>>> 8) do_migrate_range()
>>>> 9) compact_zone()
>>>> 10) migrate_longterm_unpinnable_folios()
>>>> 11) __alloc_contig_migrate_range()
>>>>
>>>> 8 to 11 needs more investigation / thinking about. I don't see a good
>>>> way around these tbh. I think we have to operate under the assumption
>>>> that the fuse server running is malicious or benevolently but
>>>> incorrectly written and could possibly never complete writeback. So we
>>>> definitely can't wait on these but it also doesn't seem like we can
>>>> skip waiting on these, especially for the case where the server uses
>>>> spliced pages, nor does it seem like we can just fail these with
>>>> -EBUSY or something.
>>
>> I see some code paths with -EAGAIN in migration. Could you explain why
>> we can't just fail migration for fuse write-back pages?
>>

Hi Joanne,

thanks a lot for your quick reply (especially as my reviews come in very 
late).

> 
> My understanding (and please correct me here Shakeel if I'm wrong) is
> that this could block system optimizations, especially since if an
> unprivileged malicious fuse server never replies to the writeback
> request, then this completely stalls progress. In the best case
> scenario, -EAGAIN could be used because the server might just be slow
> in serving the writeback, but I think we need to also account for
> servers that never complete the writeback. For
> __alloc_contig_migrate_range() for example, my understanding is that
> this is used to migrate pages so that there are more physically
> contiguous ranges of memory freed up. If fuse writeback blocks that,
> then that hurts system health overall.

Hmm, I wonder what is worse - tmp page copies or missing compaction.
Especially if we expect a low range of in-writeback pages/folios. 
One could argue that an evil user might spawn many fuse server
processes to work around the default low fuse write-back limits, but
does that make any difference with tmp pages? And these cannot be
compacted either?

And with timeouts that would be so far totally uncritical, I
think.


You also mentioned 

> especially for the case where the server uses spliced pages

could you provide more details for that? 



> 
>>>>
>>>
>>> I'm still not seeing a good way around this.
>>>
>>> What about this then? We add a new fuse sysctl called something like
>>> "/proc/sys/fs/fuse/writeback_optimization_timeout" where if the sys
>>> admin sets this, then it opts into optimizing writeback to be as fast
>>> as possible (eg skipping the page copies) and if the server doesn't
>>> fulfill the writeback by the set timeout value, then the connection is
>>> aborted.
>>>
>>> Alternatively, we could also repurpose
>>> /proc/sys/fs/fuse/max_request_timeout from the request timeout
>>> patchset [1] but I like the additional flexibility and explicitness
>>> having the "writeback_optimization_timeout" sysctl gives.
>>>
>>> Any thoughts on this?
>>
>>
>> I'm a bit worried that we might lock up the system until time out is
>> reached - not ideal. Especially as timeouts are in minutes now. But
>> even a slightly stuttering video system not be great. I think we
>> should give users/admin the choice then, if they prefer slow page
>> copies or fast, but possibly shortly unresponsive system.
>>
> I was thinking the /proc/sys/fs/fuse/writeback_optimization_timeout
> would be in seconds, where the sys admin would probably set something
> more reasonable like 5 seconds or so.
> If this syctl value is set, then servers who want writebacks to be
> fast can opt into it at mount time (and by doing so agree that they
> will service writeback requests by the timeout or their connection
> will be aborted).


I think your current patch set has it in minutes? (Should be easy
enough to change that.) Though I'm more worried about the impact
of _frequent_ timeout scanning through the different fuse lists 
on performance, than about missing compaction for folios that are
currently in write-back.


Thanks,
Bernd


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree
  2024-10-30 16:21                                       ` Bernd Schubert
@ 2024-10-30 17:02                                         ` Joanne Koong
  2024-10-30 17:27                                           ` Bernd Schubert
  0 siblings, 1 reply; 63+ messages in thread
From: Joanne Koong @ 2024-10-30 17:02 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Jingbo Xu, Miklos Szeredi, Shakeel Butt, linux-fsdevel, josef,
	hannes, linux-mm, kernel-team

On Wed, Oct 30, 2024 at 9:21 AM Bernd Schubert
<bernd.schubert@fastmail.fm> wrote:
>
> On 10/30/24 17:04, Joanne Koong wrote:
> > On Wed, Oct 30, 2024 at 2:32 AM Bernd Schubert
> > <bernd.schubert@fastmail.fm> wrote:
> >>
> >> On 10/28/24 22:58, Joanne Koong wrote:
> >>> On Fri, Oct 25, 2024 at 3:40 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> >>>>
> >>>>> Same here, I need to look some more into the compaction / page
> >>>>> migration paths. I'm planning to do this early next week and will
> >>>>> report back with what I find.
> >>>>>
> >>>>
> >>>> These are my notes so far:
> >>>>
> >>>> * We hit the folio_wait_writeback() path when callers call
> >>>> migrate_pages() with mode MIGRATE_SYNC
> >>>>    ... -> migrate_pages() -> migrate_pages_sync() ->
> >>>> migrate_pages_batch() -> migrate_folio_unmap() ->
> >>>> folio_wait_writeback()
> >>>>
> >>>> * These are the places where we call migrate_pages():
> >>>> 1) demote_folio_list()
> >>>> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode
> >>>>
> >>>> 2) __damon_pa_migrate_folio_list()
> >>>> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode
> >>>>
> >>>> 3) migrate_misplaced_folio()
> >>>> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode
> >>>>
> >>>> 4) do_move_pages_to_node()
> >>>> Can ignore this. This calls migrate_pages() in MIGRATE_SYNC mode but
> >>>> this path is only invoked by the move_pages() syscall. It's fine to
> >>>> wait on writeback for the move_pages() syscall since the user would
> >>>> have to deliberately invoke this on the fuse server for this to apply
> >>>> to the server's fuse folios
> >>>>
> >>>> 5)  migrate_to_node()
> >>>> Can ignore this for the same reason as in 4. This path is only invoked
> >>>> by the migrate_pages() syscall.
> >>>>
> >>>> 6) do_mbind()
> >>>> Can ignore this for the same reason as 4 and 5. This path is only
> >>>> invoked by the mbind() syscall.
> >>>>
> >>>> 7) soft_offline_in_use_page()
> >>>> Can skip soft offlining fuse folios (eg folios with the
> >>>> AS_NO_WRITEBACK_WAIT mapping flag set).
> >>>> The path for this is soft_offline_page() -> soft_offline_in_use_page()
> >>>> -> migrate_pages(). soft_offline_page() only invokes this for in-use
> >>>> pages in a well-defined state (see ret value of get_hwpoison_page()).
> >>>> My understanding of soft offlining pages is that it's a mitigation
> >>>> strategy for handling pages that are experiencing errors but are not
> >>>> yet completely unusable, and its main purpose is to prevent future
> >>>> issues. It seems fine to skip this for fuse folios.
> >>>>
> >>>> 8) do_migrate_range()
> >>>> 9) compact_zone()
> >>>> 10) migrate_longterm_unpinnable_folios()
> >>>> 11) __alloc_contig_migrate_range()
> >>>>
> >>>> 8 to 11 needs more investigation / thinking about. I don't see a good
> >>>> way around these tbh. I think we have to operate under the assumption
> >>>> that the fuse server running is malicious or benevolently but
> >>>> incorrectly written and could possibly never complete writeback. So we
> >>>> definitely can't wait on these but it also doesn't seem like we can
> >>>> skip waiting on these, especially for the case where the server uses
> >>>> spliced pages, nor does it seem like we can just fail these with
> >>>> -EBUSY or something.
> >>
> >> I see some code paths with -EAGAIN in migration. Could you explain why
> >> we can't just fail migration for fuse write-back pages?
> >>
>
> Hi Joanne,
>
> thanks a lot for your quick reply (especially as my reviews come in very
> late).
>

Thanks for your comments/reviews, Bernd! I always appreciate them.

> >
> > My understanding (and please correct me here Shakeel if I'm wrong) is
> > that this could block system optimizations, especially since if an
> > unprivileged malicious fuse server never replies to the writeback
> > request, then this completely stalls progress. In the best case
> > scenario, -EAGAIN could be used because the server might just be slow
> > in serving the writeback, but I think we need to also account for
> > servers that never complete the writeback. For
> > __alloc_contig_migrate_range() for example, my understanding is that
> > this is used to migrate pages so that there are more physically
> > contiguous ranges of memory freed up. If fuse writeback blocks that,
> > then that hurts system health overall.
>
> Hmm, I wonder what is worse - tmp page copies or missing compaction.
> Especially if we expect a low range of in-writeback pages/folios.
> One could argue that an evil user might spawn many fuse server
> processes to work around the default low fuse write-back limits, but
> does that make any difference with tmp pages? And these cannot be
> compacted either?

My understanding (and Shakeel please jump in here if this isn't right)
is that tmp pages can be migrated/compacted. I think it's only pages
marked as under writeback that are considered to be non-movable.

>
> And with timeouts that would be so far totally uncritical, I
> think.
>
>
> You also mentioned
>
> > especially for the case where the server uses spliced pages
>
> could you provide more details for that?
>

For the page migration / compaction paths, I don't think we can do the
workaround we could do for sync where we skip waiting on writeback for
fuse folios and continue on with the operation, because the migration
/ compaction paths operate on the pages. For the splice case, we
assign the page to the pipebuffer (fuse_ref_page()), so if the
migration/compaction happens on the page before the server has read
this page from the pipebuffer, it'll be incorrect data or maybe crash
the kernel.

>
>
> >
> >>>>
> >>>
> >>> I'm still not seeing a good way around this.
> >>>
> >>> What about this then? We add a new fuse sysctl called something like
> >>> "/proc/sys/fs/fuse/writeback_optimization_timeout" where if the sys
> >>> admin sets this, then it opts into optimizing writeback to be as fast
> >>> as possible (eg skipping the page copies) and if the server doesn't
> >>> fulfill the writeback by the set timeout value, then the connection is
> >>> aborted.
> >>>
> >>> Alternatively, we could also repurpose
> >>> /proc/sys/fs/fuse/max_request_timeout from the request timeout
> >>> patchset [1] but I like the additional flexibility and explicitness
> >>> having the "writeback_optimization_timeout" sysctl gives.
> >>>
> >>> Any thoughts on this?
> >>
> >>
> >> I'm a bit worried that we might lock up the system until time out is
> >> reached - not ideal. Especially as timeouts are in minutes now. But
> >> even a slightly stuttering video system not be great. I think we
> >> should give users/admin the choice then, if they prefer slow page
> >> copies or fast, but possibly shortly unresponsive system.
> >>
> > I was thinking the /proc/sys/fs/fuse/writeback_optimization_timeout
> > would be in seconds, where the sys admin would probably set something
> > more reasonable like 5 seconds or so.
> > If this syctl value is set, then servers who want writebacks to be
> > fast can opt into it at mount time (and by doing so agree that they
> > will service writeback requests by the timeout or their connection
> > will be aborted).
>
>
> I think your current patch set has it in minutes? (Should be easy
> enough to change that.) Though I'm more worried about the impact
> of _frequent_ timeout scanning through the different fuse lists
> on performance, than about missing compaction for folios that are
> currently in write-back.
>

Ah, for this the " /proc/sys/fs/fuse/writeback_optimization_timeout"
would be a separate thing from the
"/proc/sys/fs/fuse/max_request_timeout". The
"/proc/sys/fs/fuse/writeback_optimization_timeout" would only apply
for writeback requests. I was thinking implementation-wise, for
writebacks we could just have a timer associated with each request
(instead of having to grab locks with the fuse lists), since they
won't be super common.


Thanks,
Joanne
>
> Thanks,
> Bernd


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree
  2024-10-30 17:02                                         ` Joanne Koong
@ 2024-10-30 17:27                                           ` Bernd Schubert
  2024-10-30 17:35                                             ` Joanne Koong
  0 siblings, 1 reply; 63+ messages in thread
From: Bernd Schubert @ 2024-10-30 17:27 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Jingbo Xu, Miklos Szeredi, Shakeel Butt, linux-fsdevel, josef,
	hannes, linux-mm, kernel-team



On 10/30/24 18:02, Joanne Koong wrote:
> On Wed, Oct 30, 2024 at 9:21 AM Bernd Schubert
> <bernd.schubert@fastmail.fm> wrote:
>>
>> On 10/30/24 17:04, Joanne Koong wrote:
>>> On Wed, Oct 30, 2024 at 2:32 AM Bernd Schubert
>>> <bernd.schubert@fastmail.fm> wrote:
>>>>
>>>> On 10/28/24 22:58, Joanne Koong wrote:
>>>>> On Fri, Oct 25, 2024 at 3:40 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>>>>>>
>>>>>>> Same here, I need to look some more into the compaction / page
>>>>>>> migration paths. I'm planning to do this early next week and will
>>>>>>> report back with what I find.
>>>>>>>
>>>>>>
>>>>>> These are my notes so far:
>>>>>>
>>>>>> * We hit the folio_wait_writeback() path when callers call
>>>>>> migrate_pages() with mode MIGRATE_SYNC
>>>>>>    ... -> migrate_pages() -> migrate_pages_sync() ->
>>>>>> migrate_pages_batch() -> migrate_folio_unmap() ->
>>>>>> folio_wait_writeback()
>>>>>>
>>>>>> * These are the places where we call migrate_pages():
>>>>>> 1) demote_folio_list()
>>>>>> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode
>>>>>>
>>>>>> 2) __damon_pa_migrate_folio_list()
>>>>>> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode
>>>>>>
>>>>>> 3) migrate_misplaced_folio()
>>>>>> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode
>>>>>>
>>>>>> 4) do_move_pages_to_node()
>>>>>> Can ignore this. This calls migrate_pages() in MIGRATE_SYNC mode but
>>>>>> this path is only invoked by the move_pages() syscall. It's fine to
>>>>>> wait on writeback for the move_pages() syscall since the user would
>>>>>> have to deliberately invoke this on the fuse server for this to apply
>>>>>> to the server's fuse folios
>>>>>>
>>>>>> 5)  migrate_to_node()
>>>>>> Can ignore this for the same reason as in 4. This path is only invoked
>>>>>> by the migrate_pages() syscall.
>>>>>>
>>>>>> 6) do_mbind()
>>>>>> Can ignore this for the same reason as 4 and 5. This path is only
>>>>>> invoked by the mbind() syscall.
>>>>>>
>>>>>> 7) soft_offline_in_use_page()
>>>>>> Can skip soft offlining fuse folios (eg folios with the
>>>>>> AS_NO_WRITEBACK_WAIT mapping flag set).
>>>>>> The path for this is soft_offline_page() -> soft_offline_in_use_page()
>>>>>> -> migrate_pages(). soft_offline_page() only invokes this for in-use
>>>>>> pages in a well-defined state (see ret value of get_hwpoison_page()).
>>>>>> My understanding of soft offlining pages is that it's a mitigation
>>>>>> strategy for handling pages that are experiencing errors but are not
>>>>>> yet completely unusable, and its main purpose is to prevent future
>>>>>> issues. It seems fine to skip this for fuse folios.
>>>>>>
>>>>>> 8) do_migrate_range()
>>>>>> 9) compact_zone()
>>>>>> 10) migrate_longterm_unpinnable_folios()
>>>>>> 11) __alloc_contig_migrate_range()
>>>>>>
>>>>>> 8 to 11 needs more investigation / thinking about. I don't see a good
>>>>>> way around these tbh. I think we have to operate under the assumption
>>>>>> that the fuse server running is malicious or benevolently but
>>>>>> incorrectly written and could possibly never complete writeback. So we
>>>>>> definitely can't wait on these but it also doesn't seem like we can
>>>>>> skip waiting on these, especially for the case where the server uses
>>>>>> spliced pages, nor does it seem like we can just fail these with
>>>>>> -EBUSY or something.
>>>>
>>>> I see some code paths with -EAGAIN in migration. Could you explain why
>>>> we can't just fail migration for fuse write-back pages?
>>>>
>>
>> Hi Joanne,
>>
>> thanks a lot for your quick reply (especially as my reviews come in very
>> late).
>>
> 
> Thanks for your comments/reviews, Bernd! I always appreciate them.
> 
>>>
>>> My understanding (and please correct me here Shakeel if I'm wrong) is
>>> that this could block system optimizations, especially since if an
>>> unprivileged malicious fuse server never replies to the writeback
>>> request, then this completely stalls progress. In the best case
>>> scenario, -EAGAIN could be used because the server might just be slow
>>> in serving the writeback, but I think we need to also account for
>>> servers that never complete the writeback. For
>>> __alloc_contig_migrate_range() for example, my understanding is that
>>> this is used to migrate pages so that there are more physically
>>> contiguous ranges of memory freed up. If fuse writeback blocks that,
>>> then that hurts system health overall.
>>
>> Hmm, I wonder what is worse - tmp page copies or missing compaction.
>> Especially if we expect a low range of in-writeback pages/folios.
>> One could argue that an evil user might spawn many fuse server
>> processes to work around the default low fuse write-back limits, but
>> does that make any difference with tmp pages? And these cannot be
>> compacted either?
> 
> My understanding (and Shakeel please jump in here if this isn't right)
> is that tmp pages can be migrated/compacted. I think it's only pages
> marked as under writeback that are considered to be non-movable.
> 
>>
>> And with timeouts that would be so far totally uncritical, I
>> think.
>>
>>
>> You also mentioned
>>
>>> especially for the case where the server uses spliced pages
>>
>> could you provide more details for that?
>>
7> 
> For the page migration / compaction paths, I don't think we can do the
> workaround we could do for sync where we skip waiting on writeback for
> fuse folios and continue on with the operation, because the migration
> / compaction paths operate on the pages. For the splice case, we
> assign the page to the pipebuffer (fuse_ref_page()), so if the
> migration/compaction happens on the page before the server has read
> this page from the pipebuffer, it'll be incorrect data or maybe crash
> the kernel.
> 
>>
>>
>>>
>>>>>>
>>>>>
>>>>> I'm still not seeing a good way around this.
>>>>>
>>>>> What about this then? We add a new fuse sysctl called something like
>>>>> "/proc/sys/fs/fuse/writeback_optimization_timeout" where if the sys
>>>>> admin sets this, then it opts into optimizing writeback to be as fast
>>>>> as possible (eg skipping the page copies) and if the server doesn't
>>>>> fulfill the writeback by the set timeout value, then the connection is
>>>>> aborted.
>>>>>
>>>>> Alternatively, we could also repurpose
>>>>> /proc/sys/fs/fuse/max_request_timeout from the request timeout
>>>>> patchset [1] but I like the additional flexibility and explicitness
>>>>> having the "writeback_optimization_timeout" sysctl gives.
>>>>>
>>>>> Any thoughts on this?
>>>>
>>>>
>>>> I'm a bit worried that we might lock up the system until time out is
>>>> reached - not ideal. Especially as timeouts are in minutes now. But
>>>> even a slightly stuttering video system not be great. I think we
>>>> should give users/admin the choice then, if they prefer slow page
>>>> copies or fast, but possibly shortly unresponsive system.
>>>>
>>> I was thinking the /proc/sys/fs/fuse/writeback_optimization_timeout
>>> would be in seconds, where the sys admin would probably set something
>>> more reasonable like 5 seconds or so.
>>> If this syctl value is set, then servers who want writebacks to be
>>> fast can opt into it at mount time (and by doing so agree that they
>>> will service writeback requests by the timeout or their connection
>>> will be aborted).
>>
>>
>> I think your current patch set has it in minutes? (Should be easy
>> enough to change that.) Though I'm more worried about the impact
>> of _frequent_ timeout scanning through the different fuse lists
>> on performance, than about missing compaction for folios that are
>> currently in write-back.

Hmm, if tmp pages can be compacted, isn't that a problem for splice?
I.e. I don't understand what the difference between tmp page and
write-back page for migration.


>>
> 
> Ah, for this the " /proc/sys/fs/fuse/writeback_optimization_timeout"
> would be a separate thing from the
> "/proc/sys/fs/fuse/max_request_timeout". The
> "/proc/sys/fs/fuse/writeback_optimization_timeout" would only apply
> for writeback requests. I was thinking implementation-wise, for
> writebacks we could just have a timer associated with each request
> (instead of having to grab locks with the fuse lists), since they
> won't be super common.

Ah, thank you! I had missed that this is another variable. Issue
with too short timeouts would probably be network hick-up that
would immediately kill fuse-server. I.e. if it just the missing
page compaction/migration, maybe larger time outs would be
acceptable.


Thanks,
Bernd


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree
  2024-10-30 17:27                                           ` Bernd Schubert
@ 2024-10-30 17:35                                             ` Joanne Koong
  2024-10-30 21:56                                               ` Shakeel Butt
  0 siblings, 1 reply; 63+ messages in thread
From: Joanne Koong @ 2024-10-30 17:35 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Jingbo Xu, Miklos Szeredi, Shakeel Butt, linux-fsdevel, josef,
	hannes, linux-mm, kernel-team

On Wed, Oct 30, 2024 at 10:27 AM Bernd Schubert
<bernd.schubert@fastmail.fm> wrote:
>
>
>
> On 10/30/24 18:02, Joanne Koong wrote:
> > On Wed, Oct 30, 2024 at 9:21 AM Bernd Schubert
> > <bernd.schubert@fastmail.fm> wrote:
> >>
> >> On 10/30/24 17:04, Joanne Koong wrote:
> >>> On Wed, Oct 30, 2024 at 2:32 AM Bernd Schubert
> >>> <bernd.schubert@fastmail.fm> wrote:
> >>>>
> >>>> On 10/28/24 22:58, Joanne Koong wrote:
> >>>>> On Fri, Oct 25, 2024 at 3:40 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> >>>>>>
> >>>>>>> Same here, I need to look some more into the compaction / page
> >>>>>>> migration paths. I'm planning to do this early next week and will
> >>>>>>> report back with what I find.
> >>>>>>>
> >>>>>>
> >>>>>> These are my notes so far:
> >>>>>>
> >>>>>> * We hit the folio_wait_writeback() path when callers call
> >>>>>> migrate_pages() with mode MIGRATE_SYNC
> >>>>>>    ... -> migrate_pages() -> migrate_pages_sync() ->
> >>>>>> migrate_pages_batch() -> migrate_folio_unmap() ->
> >>>>>> folio_wait_writeback()
> >>>>>>
> >>>>>> * These are the places where we call migrate_pages():
> >>>>>> 1) demote_folio_list()
> >>>>>> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode
> >>>>>>
> >>>>>> 2) __damon_pa_migrate_folio_list()
> >>>>>> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode
> >>>>>>
> >>>>>> 3) migrate_misplaced_folio()
> >>>>>> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode
> >>>>>>
> >>>>>> 4) do_move_pages_to_node()
> >>>>>> Can ignore this. This calls migrate_pages() in MIGRATE_SYNC mode but
> >>>>>> this path is only invoked by the move_pages() syscall. It's fine to
> >>>>>> wait on writeback for the move_pages() syscall since the user would
> >>>>>> have to deliberately invoke this on the fuse server for this to apply
> >>>>>> to the server's fuse folios
> >>>>>>
> >>>>>> 5)  migrate_to_node()
> >>>>>> Can ignore this for the same reason as in 4. This path is only invoked
> >>>>>> by the migrate_pages() syscall.
> >>>>>>
> >>>>>> 6) do_mbind()
> >>>>>> Can ignore this for the same reason as 4 and 5. This path is only
> >>>>>> invoked by the mbind() syscall.
> >>>>>>
> >>>>>> 7) soft_offline_in_use_page()
> >>>>>> Can skip soft offlining fuse folios (eg folios with the
> >>>>>> AS_NO_WRITEBACK_WAIT mapping flag set).
> >>>>>> The path for this is soft_offline_page() -> soft_offline_in_use_page()
> >>>>>> -> migrate_pages(). soft_offline_page() only invokes this for in-use
> >>>>>> pages in a well-defined state (see ret value of get_hwpoison_page()).
> >>>>>> My understanding of soft offlining pages is that it's a mitigation
> >>>>>> strategy for handling pages that are experiencing errors but are not
> >>>>>> yet completely unusable, and its main purpose is to prevent future
> >>>>>> issues. It seems fine to skip this for fuse folios.
> >>>>>>
> >>>>>> 8) do_migrate_range()
> >>>>>> 9) compact_zone()
> >>>>>> 10) migrate_longterm_unpinnable_folios()
> >>>>>> 11) __alloc_contig_migrate_range()
> >>>>>>
> >>>>>> 8 to 11 needs more investigation / thinking about. I don't see a good
> >>>>>> way around these tbh. I think we have to operate under the assumption
> >>>>>> that the fuse server running is malicious or benevolently but
> >>>>>> incorrectly written and could possibly never complete writeback. So we
> >>>>>> definitely can't wait on these but it also doesn't seem like we can
> >>>>>> skip waiting on these, especially for the case where the server uses
> >>>>>> spliced pages, nor does it seem like we can just fail these with
> >>>>>> -EBUSY or something.
> >>>>
> >>>> I see some code paths with -EAGAIN in migration. Could you explain why
> >>>> we can't just fail migration for fuse write-back pages?
> >>>>
> >>
> >> Hi Joanne,
> >>
> >> thanks a lot for your quick reply (especially as my reviews come in very
> >> late).
> >>
> >
> > Thanks for your comments/reviews, Bernd! I always appreciate them.
> >
> >>>
> >>> My understanding (and please correct me here Shakeel if I'm wrong) is
> >>> that this could block system optimizations, especially since if an
> >>> unprivileged malicious fuse server never replies to the writeback
> >>> request, then this completely stalls progress. In the best case
> >>> scenario, -EAGAIN could be used because the server might just be slow
> >>> in serving the writeback, but I think we need to also account for
> >>> servers that never complete the writeback. For
> >>> __alloc_contig_migrate_range() for example, my understanding is that
> >>> this is used to migrate pages so that there are more physically
> >>> contiguous ranges of memory freed up. If fuse writeback blocks that,
> >>> then that hurts system health overall.
> >>
> >> Hmm, I wonder what is worse - tmp page copies or missing compaction.
> >> Especially if we expect a low range of in-writeback pages/folios.
> >> One could argue that an evil user might spawn many fuse server
> >> processes to work around the default low fuse write-back limits, but
> >> does that make any difference with tmp pages? And these cannot be
> >> compacted either?
> >
> > My understanding (and Shakeel please jump in here if this isn't right)
> > is that tmp pages can be migrated/compacted. I think it's only pages
> > marked as under writeback that are considered to be non-movable.
> >
> >>
> >> And with timeouts that would be so far totally uncritical, I
> >> think.
> >>
> >>
> >> You also mentioned
> >>
> >>> especially for the case where the server uses spliced pages
> >>
> >> could you provide more details for that?
> >>
> 7>
> > For the page migration / compaction paths, I don't think we can do the
> > workaround we could do for sync where we skip waiting on writeback for
> > fuse folios and continue on with the operation, because the migration
> > / compaction paths operate on the pages. For the splice case, we
> > assign the page to the pipebuffer (fuse_ref_page()), so if the
> > migration/compaction happens on the page before the server has read
> > this page from the pipebuffer, it'll be incorrect data or maybe crash
> > the kernel.
> >
> >>
> >>
> >>>
> >>>>>>
> >>>>>
> >>>>> I'm still not seeing a good way around this.
> >>>>>
> >>>>> What about this then? We add a new fuse sysctl called something like
> >>>>> "/proc/sys/fs/fuse/writeback_optimization_timeout" where if the sys
> >>>>> admin sets this, then it opts into optimizing writeback to be as fast
> >>>>> as possible (eg skipping the page copies) and if the server doesn't
> >>>>> fulfill the writeback by the set timeout value, then the connection is
> >>>>> aborted.
> >>>>>
> >>>>> Alternatively, we could also repurpose
> >>>>> /proc/sys/fs/fuse/max_request_timeout from the request timeout
> >>>>> patchset [1] but I like the additional flexibility and explicitness
> >>>>> having the "writeback_optimization_timeout" sysctl gives.
> >>>>>
> >>>>> Any thoughts on this?
> >>>>
> >>>>
> >>>> I'm a bit worried that we might lock up the system until time out is
> >>>> reached - not ideal. Especially as timeouts are in minutes now. But
> >>>> even a slightly stuttering video system not be great. I think we
> >>>> should give users/admin the choice then, if they prefer slow page
> >>>> copies or fast, but possibly shortly unresponsive system.
> >>>>
> >>> I was thinking the /proc/sys/fs/fuse/writeback_optimization_timeout
> >>> would be in seconds, where the sys admin would probably set something
> >>> more reasonable like 5 seconds or so.
> >>> If this syctl value is set, then servers who want writebacks to be
> >>> fast can opt into it at mount time (and by doing so agree that they
> >>> will service writeback requests by the timeout or their connection
> >>> will be aborted).
> >>
> >>
> >> I think your current patch set has it in minutes? (Should be easy
> >> enough to change that.) Though I'm more worried about the impact
> >> of _frequent_ timeout scanning through the different fuse lists
> >> on performance, than about missing compaction for folios that are
> >> currently in write-back.
>
> Hmm, if tmp pages can be compacted, isn't that a problem for splice?
> I.e. I don't understand what the difference between tmp page and
> write-back page for migration.
>

That's a great question! I have no idea how compaction works for pages
being used in splice. Shakeel, do you know the answer to this?


Thanks,
Joanne

>
> >>
> >
> > Ah, for this the " /proc/sys/fs/fuse/writeback_optimization_timeout"
> > would be a separate thing from the
> > "/proc/sys/fs/fuse/max_request_timeout". The
> > "/proc/sys/fs/fuse/writeback_optimization_timeout" would only apply
> > for writeback requests. I was thinking implementation-wise, for
> > writebacks we could just have a timer associated with each request
> > (instead of having to grab locks with the fuse lists), since they
> > won't be super common.
>
> Ah, thank you! I had missed that this is another variable. Issue
> with too short timeouts would probably be network hick-up that
> would immediately kill fuse-server. I.e. if it just the missing
> page compaction/migration, maybe larger time outs would be
> acceptable.
>
>
> Thanks,
> Bernd


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree
  2024-10-30 17:35                                             ` Joanne Koong
@ 2024-10-30 21:56                                               ` Shakeel Butt
  2024-10-30 22:17                                                 ` Bernd Schubert
  0 siblings, 1 reply; 63+ messages in thread
From: Shakeel Butt @ 2024-10-30 21:56 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Bernd Schubert, Jingbo Xu, Miklos Szeredi, linux-fsdevel, josef,
	hannes, linux-mm, kernel-team

On Wed, Oct 30, 2024 at 10:35:47AM GMT, Joanne Koong wrote:
> On Wed, Oct 30, 2024 at 10:27 AM Bernd Schubert
> <bernd.schubert@fastmail.fm> wrote:
> >
> >
> > Hmm, if tmp pages can be compacted, isn't that a problem for splice?
> > I.e. I don't understand what the difference between tmp page and
> > write-back page for migration.
> >
> 
> That's a great question! I have no idea how compaction works for pages
> being used in splice. Shakeel, do you know the answer to this?
> 

Sorry for the late response. I still have to go through other unanswered
questions but let me answer this one quickly. From the way the tmp pages
are allocated, it does not seem like they are movable and thus are not
target for migration/compaction.

The page with the writeback bit set is actually just a user memory page
cache which is moveable but due to, at the moment, under writeback it
temporarily becomes unmovable to not cause corruption.

Shakeel


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree
  2024-10-30 21:56                                               ` Shakeel Butt
@ 2024-10-30 22:17                                                 ` Bernd Schubert
  2024-10-30 22:51                                                   ` Joanne Koong
  0 siblings, 1 reply; 63+ messages in thread
From: Bernd Schubert @ 2024-10-30 22:17 UTC (permalink / raw)
  To: Shakeel Butt, Joanne Koong
  Cc: Jingbo Xu, Miklos Szeredi, linux-fsdevel, josef, hannes, linux-mm,
	kernel-team



On 10/30/24 22:56, Shakeel Butt wrote:
> On Wed, Oct 30, 2024 at 10:35:47AM GMT, Joanne Koong wrote:
>> On Wed, Oct 30, 2024 at 10:27 AM Bernd Schubert
>> <bernd.schubert@fastmail.fm> wrote:
>>>
>>>
>>> Hmm, if tmp pages can be compacted, isn't that a problem for splice?
>>> I.e. I don't understand what the difference between tmp page and
>>> write-back page for migration.
>>>
>>
>> That's a great question! I have no idea how compaction works for pages
>> being used in splice. Shakeel, do you know the answer to this?
>>
> 
> Sorry for the late response. I still have to go through other unanswered
> questions but let me answer this one quickly. From the way the tmp pages
> are allocated, it does not seem like they are movable and thus are not
> target for migration/compaction.
> 
> The page with the writeback bit set is actually just a user memory page
> cache which is moveable but due to, at the moment, under writeback it
> temporarily becomes unmovable to not cause corruption.

Thanks a lot for your quick reply Shakeel! (Actually very fast!).

With that, it confirms what I wrote earlier - removing tmp and ignoring
fuse writeback pages in migration should not make any difference 
regarding overall system performance. Unless I miss something,
more on the contrary as additional memory pressure expensive page
copying is being removed.


Thanks,
Bernd


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree
  2024-10-30 22:17                                                 ` Bernd Schubert
@ 2024-10-30 22:51                                                   ` Joanne Koong
  2024-10-31  0:30                                                     ` Shakeel Butt
  0 siblings, 1 reply; 63+ messages in thread
From: Joanne Koong @ 2024-10-30 22:51 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Shakeel Butt, Jingbo Xu, Miklos Szeredi, linux-fsdevel, josef,
	hannes, linux-mm, kernel-team

On Wed, Oct 30, 2024 at 3:17 PM Bernd Schubert
<bernd.schubert@fastmail.fm> wrote:
>
>
>
> On 10/30/24 22:56, Shakeel Butt wrote:
> > On Wed, Oct 30, 2024 at 10:35:47AM GMT, Joanne Koong wrote:
> >> On Wed, Oct 30, 2024 at 10:27 AM Bernd Schubert
> >> <bernd.schubert@fastmail.fm> wrote:
> >>>
> >>>
> >>> Hmm, if tmp pages can be compacted, isn't that a problem for splice?
> >>> I.e. I don't understand what the difference between tmp page and
> >>> write-back page for migration.
> >>>
> >>
> >> That's a great question! I have no idea how compaction works for pages
> >> being used in splice. Shakeel, do you know the answer to this?
> >>
> >
> > Sorry for the late response. I still have to go through other unanswered
> > questions but let me answer this one quickly. From the way the tmp pages
> > are allocated, it does not seem like they are movable and thus are not
> > target for migration/compaction.
> >
> > The page with the writeback bit set is actually just a user memory page
> > cache which is moveable but due to, at the moment, under writeback it
> > temporarily becomes unmovable to not cause corruption.
>
> Thanks a lot for your quick reply Shakeel! (Actually very fast!).
>
> With that, it confirms what I wrote earlier - removing tmp and ignoring
> fuse writeback pages in migration should not make any difference
> regarding overall system performance. Unless I miss something,
> more on the contrary as additional memory pressure expensive page
> copying is being removed.
>

Thanks for the information Shakeel, and thanks Bernd for bringing up
this point of discussion.

Before I celebrate too prematurely, a few additional questions:

Are tmp pages (eg from folio_alloc(GFP_NOFS | __GFP_HIGHMEM, 0)) and
page cache pages allocated from the same memory pool? Or are tmp pages
allocated from a special memory pool that isn't meant to be
compacted/optimized?

If they are allocated from the same memory pool, then it seems like
there's no difference between tmp pages blocking a memory range from
being compacted vs. a page cache page blocking a memory range from
being compacted (by not clearing writeback). But if they are not
allocated from the same pool, then it seems like the page cache page
blocking migration could adversely affect general system performance
in a way that the tmp page doesn't?


Thanks,
Joanne
>
> Thanks,
> Bernd


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree
  2024-10-30 22:51                                                   ` Joanne Koong
@ 2024-10-31  0:30                                                     ` Shakeel Butt
  2024-10-31 19:06                                                       ` Joanne Koong
  0 siblings, 1 reply; 63+ messages in thread
From: Shakeel Butt @ 2024-10-31  0:30 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Bernd Schubert, Jingbo Xu, Miklos Szeredi, linux-fsdevel, josef,
	hannes, linux-mm, kernel-team, Vlastimil Babka

On Wed, Oct 30, 2024 at 03:51:08PM GMT, Joanne Koong wrote:
> On Wed, Oct 30, 2024 at 3:17 PM Bernd Schubert
> <bernd.schubert@fastmail.fm> wrote:
> >
> >
> >
> > On 10/30/24 22:56, Shakeel Butt wrote:
> > > On Wed, Oct 30, 2024 at 10:35:47AM GMT, Joanne Koong wrote:
> > >> On Wed, Oct 30, 2024 at 10:27 AM Bernd Schubert
> > >> <bernd.schubert@fastmail.fm> wrote:
> > >>>
> > >>>
> > >>> Hmm, if tmp pages can be compacted, isn't that a problem for splice?
> > >>> I.e. I don't understand what the difference between tmp page and
> > >>> write-back page for migration.
> > >>>
> > >>
> > >> That's a great question! I have no idea how compaction works for pages
> > >> being used in splice. Shakeel, do you know the answer to this?
> > >>
> > >
> > > Sorry for the late response. I still have to go through other unanswered
> > > questions but let me answer this one quickly. From the way the tmp pages
> > > are allocated, it does not seem like they are movable and thus are not
> > > target for migration/compaction.
> > >
> > > The page with the writeback bit set is actually just a user memory page
> > > cache which is moveable but due to, at the moment, under writeback it
> > > temporarily becomes unmovable to not cause corruption.
> >
> > Thanks a lot for your quick reply Shakeel! (Actually very fast!).
> >
> > With that, it confirms what I wrote earlier - removing tmp and ignoring
> > fuse writeback pages in migration should not make any difference
> > regarding overall system performance. Unless I miss something,
> > more on the contrary as additional memory pressure expensive page
> > copying is being removed.
> >
> 
> Thanks for the information Shakeel, and thanks Bernd for bringing up
> this point of discussion.
> 
> Before I celebrate too prematurely, a few additional questions:

You are asking hard questions, so CCed couple more folks to correct me
if I am wrong.

> 
> Are tmp pages (eg from folio_alloc(GFP_NOFS | __GFP_HIGHMEM, 0)) and
> page cache pages allocated from the same memory pool? Or are tmp pages
> allocated from a special memory pool that isn't meant to be
> compacted/optimized?

Memory pool is a bit confusing term here. Most probably you are asking
about the migrate type of the page block from which tmp page is
allocated from. In a normal system, tmp page would be allocated from page
block with MIGRATE_UNMOVABLE migrate type while the page cache page, it
depends on what gfp flag was used for its allocation. What does fuse fs
use? GFP_HIGHUSER_MOVABLE or something else? Under low memory situation
allocations can get mixed up with different migrate types.

> 
> If they are allocated from the same memory pool, then it seems like
> there's no difference between tmp pages blocking a memory range from
> being compacted vs. a page cache page blocking a memory range from
> being compacted (by not clearing writeback). But if they are not
> allocated from the same pool, then it seems like the page cache page
> blocking migration could adversely affect general system performance
> in a way that the tmp page doesn't?

I think irrespective of where the page is coming from, the page under
writeback is non-movable and can fragment the memory. The question that
is that worse than a tmp page fragmenting the memory, I am not sure.



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree
  2024-10-31  0:30                                                     ` Shakeel Butt
@ 2024-10-31 19:06                                                       ` Joanne Koong
  2024-10-31 20:06                                                         ` Shakeel Butt
  0 siblings, 1 reply; 63+ messages in thread
From: Joanne Koong @ 2024-10-31 19:06 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Bernd Schubert, Jingbo Xu, Miklos Szeredi, linux-fsdevel, josef,
	hannes, linux-mm, kernel-team, Vlastimil Babka

On Wed, Oct 30, 2024 at 5:30 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Wed, Oct 30, 2024 at 03:51:08PM GMT, Joanne Koong wrote:
> > On Wed, Oct 30, 2024 at 3:17 PM Bernd Schubert
> > <bernd.schubert@fastmail.fm> wrote:
> > >
> > >
> > >
> > > On 10/30/24 22:56, Shakeel Butt wrote:
> > > > On Wed, Oct 30, 2024 at 10:35:47AM GMT, Joanne Koong wrote:
> > > >> On Wed, Oct 30, 2024 at 10:27 AM Bernd Schubert
> > > >> <bernd.schubert@fastmail.fm> wrote:
> > > >>>
> > > >>>
> > > >>> Hmm, if tmp pages can be compacted, isn't that a problem for splice?
> > > >>> I.e. I don't understand what the difference between tmp page and
> > > >>> write-back page for migration.
> > > >>>
> > > >>
> > > >> That's a great question! I have no idea how compaction works for pages
> > > >> being used in splice. Shakeel, do you know the answer to this?
> > > >>
> > > >
> > > > Sorry for the late response. I still have to go through other unanswered
> > > > questions but let me answer this one quickly. From the way the tmp pages
> > > > are allocated, it does not seem like they are movable and thus are not
> > > > target for migration/compaction.
> > > >
> > > > The page with the writeback bit set is actually just a user memory page
> > > > cache which is moveable but due to, at the moment, under writeback it
> > > > temporarily becomes unmovable to not cause corruption.
> > >
> > > Thanks a lot for your quick reply Shakeel! (Actually very fast!).
> > >
> > > With that, it confirms what I wrote earlier - removing tmp and ignoring
> > > fuse writeback pages in migration should not make any difference
> > > regarding overall system performance. Unless I miss something,
> > > more on the contrary as additional memory pressure expensive page
> > > copying is being removed.
> > >
> >
> > Thanks for the information Shakeel, and thanks Bernd for bringing up
> > this point of discussion.
> >
> > Before I celebrate too prematurely, a few additional questions:
>
> You are asking hard questions, so CCed couple more folks to correct me
> if I am wrong.
>
> >
> > Are tmp pages (eg from folio_alloc(GFP_NOFS | __GFP_HIGHMEM, 0)) and
> > page cache pages allocated from the same memory pool? Or are tmp pages
> > allocated from a special memory pool that isn't meant to be
> > compacted/optimized?
>
> Memory pool is a bit confusing term here. Most probably you are asking
> about the migrate type of the page block from which tmp page is
> allocated from. In a normal system, tmp page would be allocated from page
> block with MIGRATE_UNMOVABLE migrate type while the page cache page, it
> depends on what gfp flag was used for its allocation. What does fuse fs
> use? GFP_HIGHUSER_MOVABLE or something else? Under low memory situation
> allocations can get mixed up with different migrate types.
>

I believe it's GFP_HIGHUSER_MOVABLE for the page cache pages since
fuse doesn't set any additional gfp masks on the inode mapping.

Could we just allocate the fuse writeback pages with GFP_HIGHUSER
instead of GFP_HIGHUSER_MOVABLE? That would be in fuse_write_begin()
where we pass in the gfp mask to __filemap_get_folio(). I think this
would give us the same behavior memory-wise as what the tmp pages
currently do, and would solve all our headaches regarding writeback
potentially blocking page migration/compaction.



Thanks,
Joanne

> >
> > If they are allocated from the same memory pool, then it seems like
> > there's no difference between tmp pages blocking a memory range from
> > being compacted vs. a page cache page blocking a memory range from
> > being compacted (by not clearing writeback). But if they are not
> > allocated from the same pool, then it seems like the page cache page
> > blocking migration could adversely affect general system performance
> > in a way that the tmp page doesn't?
>
> I think irrespective of where the page is coming from, the page under
> writeback is non-movable and can fragment the memory. The question that
> is that worse than a tmp page fragmenting the memory, I am not sure.
>


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree
  2024-10-31 19:06                                                       ` Joanne Koong
@ 2024-10-31 20:06                                                         ` Shakeel Butt
  2024-10-31 21:52                                                           ` Joanne Koong
  0 siblings, 1 reply; 63+ messages in thread
From: Shakeel Butt @ 2024-10-31 20:06 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Bernd Schubert, Jingbo Xu, Miklos Szeredi, linux-fsdevel, josef,
	hannes, linux-mm, kernel-team, Vlastimil Babka

On Thu, Oct 31, 2024 at 12:06:49PM GMT, Joanne Koong wrote:
> On Wed, Oct 30, 2024 at 5:30 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
[...]
> >
> > Memory pool is a bit confusing term here. Most probably you are asking
> > about the migrate type of the page block from which tmp page is
> > allocated from. In a normal system, tmp page would be allocated from page
> > block with MIGRATE_UNMOVABLE migrate type while the page cache page, it
> > depends on what gfp flag was used for its allocation. What does fuse fs
> > use? GFP_HIGHUSER_MOVABLE or something else? Under low memory situation
> > allocations can get mixed up with different migrate types.
> >
> 
> I believe it's GFP_HIGHUSER_MOVABLE for the page cache pages since
> fuse doesn't set any additional gfp masks on the inode mapping.
> 
> Could we just allocate the fuse writeback pages with GFP_HIGHUSER
> instead of GFP_HIGHUSER_MOVABLE? That would be in fuse_write_begin()
> where we pass in the gfp mask to __filemap_get_folio(). I think this
> would give us the same behavior memory-wise as what the tmp pages
> currently do,

I don't think it would be the same behavior. From what I understand the
liftime of the tmp page is from the start of the writeback till the ack
from the fuse server that writeback is done. While the lifetime of the
page of the page cache can be arbitrarily large. We should just make it
unmovable for its lifetime. I think it is fine to make the page
unmovable during the writeback. We should not try to optimize for the
bad or buggy behavior of fuse server.

Regarding the avoidance of wait on writeback for fuse folios, I think we
can handle the migration similar to how you are handling reclaim and in
addition we can add a WARN() in folio_wait_writeback() if the kernel ever
sees a fuse folio in that function.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree
  2024-10-31 20:06                                                         ` Shakeel Butt
@ 2024-10-31 21:52                                                           ` Joanne Koong
  2024-10-31 22:38                                                             ` Shakeel Butt
  2024-11-01 11:44                                                             ` Jingbo Xu
  0 siblings, 2 replies; 63+ messages in thread
From: Joanne Koong @ 2024-10-31 21:52 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Bernd Schubert, Jingbo Xu, Miklos Szeredi, linux-fsdevel, josef,
	hannes, linux-mm, kernel-team, Vlastimil Babka

On Thu, Oct 31, 2024 at 1:06 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Thu, Oct 31, 2024 at 12:06:49PM GMT, Joanne Koong wrote:
> > On Wed, Oct 30, 2024 at 5:30 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> [...]
> > >
> > > Memory pool is a bit confusing term here. Most probably you are asking
> > > about the migrate type of the page block from which tmp page is
> > > allocated from. In a normal system, tmp page would be allocated from page
> > > block with MIGRATE_UNMOVABLE migrate type while the page cache page, it
> > > depends on what gfp flag was used for its allocation. What does fuse fs
> > > use? GFP_HIGHUSER_MOVABLE or something else? Under low memory situation
> > > allocations can get mixed up with different migrate types.
> > >
> >
> > I believe it's GFP_HIGHUSER_MOVABLE for the page cache pages since
> > fuse doesn't set any additional gfp masks on the inode mapping.
> >
> > Could we just allocate the fuse writeback pages with GFP_HIGHUSER
> > instead of GFP_HIGHUSER_MOVABLE? That would be in fuse_write_begin()
> > where we pass in the gfp mask to __filemap_get_folio(). I think this
> > would give us the same behavior memory-wise as what the tmp pages
> > currently do,
>
> I don't think it would be the same behavior. From what I understand the
> liftime of the tmp page is from the start of the writeback till the ack
> from the fuse server that writeback is done. While the lifetime of the
> page of the page cache can be arbitrarily large. We should just make it
> unmovable for its lifetime. I think it is fine to make the page
> unmovable during the writeback. We should not try to optimize for the
> bad or buggy behavior of fuse server.
>
> Regarding the avoidance of wait on writeback for fuse folios, I think we
> can handle the migration similar to how you are handling reclaim and in
> addition we can add a WARN() in folio_wait_writeback() if the kernel ever
> sees a fuse folio in that function.

Awesome, this is what I'm planning to do in v3 to address migration then:

1) in migrate_folio_unmap(), only call "folio_wait_writeback(src);" if
src->mapping does not have the AS_NO_WRITEBACK_WAIT bit set on it (eg
fuse folios will have that AS_NO_WRITEBACK_WAIT bit set)

2) in the fuse filesystem's implementation of the
mapping->a_ops->migrate_folio callback, return -EAGAIN if the folio is
under writeback.

Does this sound good?


Thanks,
Joanne


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree
  2024-10-31 21:52                                                           ` Joanne Koong
@ 2024-10-31 22:38                                                             ` Shakeel Butt
  2024-11-06 23:37                                                               ` Joanne Koong
  2024-11-01 11:44                                                             ` Jingbo Xu
  1 sibling, 1 reply; 63+ messages in thread
From: Shakeel Butt @ 2024-10-31 22:38 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Bernd Schubert, Jingbo Xu, Miklos Szeredi, linux-fsdevel, josef,
	hannes, linux-mm, kernel-team, Vlastimil Babka

On Thu, Oct 31, 2024 at 02:52:57PM GMT, Joanne Koong wrote:
> On Thu, Oct 31, 2024 at 1:06 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > On Thu, Oct 31, 2024 at 12:06:49PM GMT, Joanne Koong wrote:
> > > On Wed, Oct 30, 2024 at 5:30 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > [...]
> > > >
> > > > Memory pool is a bit confusing term here. Most probably you are asking
> > > > about the migrate type of the page block from which tmp page is
> > > > allocated from. In a normal system, tmp page would be allocated from page
> > > > block with MIGRATE_UNMOVABLE migrate type while the page cache page, it
> > > > depends on what gfp flag was used for its allocation. What does fuse fs
> > > > use? GFP_HIGHUSER_MOVABLE or something else? Under low memory situation
> > > > allocations can get mixed up with different migrate types.
> > > >
> > >
> > > I believe it's GFP_HIGHUSER_MOVABLE for the page cache pages since
> > > fuse doesn't set any additional gfp masks on the inode mapping.
> > >
> > > Could we just allocate the fuse writeback pages with GFP_HIGHUSER
> > > instead of GFP_HIGHUSER_MOVABLE? That would be in fuse_write_begin()
> > > where we pass in the gfp mask to __filemap_get_folio(). I think this
> > > would give us the same behavior memory-wise as what the tmp pages
> > > currently do,
> >
> > I don't think it would be the same behavior. From what I understand the
> > liftime of the tmp page is from the start of the writeback till the ack
> > from the fuse server that writeback is done. While the lifetime of the
> > page of the page cache can be arbitrarily large. We should just make it
> > unmovable for its lifetime. I think it is fine to make the page
> > unmovable during the writeback. We should not try to optimize for the
> > bad or buggy behavior of fuse server.
> >
> > Regarding the avoidance of wait on writeback for fuse folios, I think we
> > can handle the migration similar to how you are handling reclaim and in
> > addition we can add a WARN() in folio_wait_writeback() if the kernel ever
> > sees a fuse folio in that function.
> 
> Awesome, this is what I'm planning to do in v3 to address migration then:
> 
> 1) in migrate_folio_unmap(), only call "folio_wait_writeback(src);" if
> src->mapping does not have the AS_NO_WRITEBACK_WAIT bit set on it (eg
> fuse folios will have that AS_NO_WRITEBACK_WAIT bit set)
> 
> 2) in the fuse filesystem's implementation of the
> mapping->a_ops->migrate_folio callback, return -EAGAIN if the folio is
> under writeback.

3) Add WARN_ONCE() in folio_wait_writeback() if folio->mapping has
AS_NO_WRITEBACK_WAIT set and return without waiting.

> 
> Does this sound good?

Yes.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree
  2024-10-31 21:52                                                           ` Joanne Koong
  2024-10-31 22:38                                                             ` Shakeel Butt
@ 2024-11-01 11:44                                                             ` Jingbo Xu
  2024-11-01 20:54                                                               ` Joanne Koong
  1 sibling, 1 reply; 63+ messages in thread
From: Jingbo Xu @ 2024-11-01 11:44 UTC (permalink / raw)
  To: Joanne Koong, Shakeel Butt
  Cc: Bernd Schubert, Miklos Szeredi, linux-fsdevel, josef, hannes,
	linux-mm, kernel-team, Vlastimil Babka

Hi Joanne,

Thanks for keeping pushing this forward.

On 11/1/24 5:52 AM, Joanne Koong wrote:
> On Thu, Oct 31, 2024 at 1:06 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>>
>> On Thu, Oct 31, 2024 at 12:06:49PM GMT, Joanne Koong wrote:
>>> On Wed, Oct 30, 2024 at 5:30 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>> [...]
>>>>
>>>> Memory pool is a bit confusing term here. Most probably you are asking
>>>> about the migrate type of the page block from which tmp page is
>>>> allocated from. In a normal system, tmp page would be allocated from page
>>>> block with MIGRATE_UNMOVABLE migrate type while the page cache page, it
>>>> depends on what gfp flag was used for its allocation. What does fuse fs
>>>> use? GFP_HIGHUSER_MOVABLE or something else? Under low memory situation
>>>> allocations can get mixed up with different migrate types.
>>>>
>>>
>>> I believe it's GFP_HIGHUSER_MOVABLE for the page cache pages since
>>> fuse doesn't set any additional gfp masks on the inode mapping.
>>>
>>> Could we just allocate the fuse writeback pages with GFP_HIGHUSER
>>> instead of GFP_HIGHUSER_MOVABLE? That would be in fuse_write_begin()
>>> where we pass in the gfp mask to __filemap_get_folio(). I think this
>>> would give us the same behavior memory-wise as what the tmp pages
>>> currently do,
>>
>> I don't think it would be the same behavior. From what I understand the
>> liftime of the tmp page is from the start of the writeback till the ack
>> from the fuse server that writeback is done. While the lifetime of the
>> page of the page cache can be arbitrarily large. We should just make it
>> unmovable for its lifetime. I think it is fine to make the page
>> unmovable during the writeback. We should not try to optimize for the
>> bad or buggy behavior of fuse server.
>>
>> Regarding the avoidance of wait on writeback for fuse folios, I think we
>> can handle the migration similar to how you are handling reclaim and in
>> addition we can add a WARN() in folio_wait_writeback() if the kernel ever
>> sees a fuse folio in that function.
> 
> Awesome, this is what I'm planning to do in v3 to address migration then:
> 
> 1) in migrate_folio_unmap(), only call "folio_wait_writeback(src);" if
> src->mapping does not have the AS_NO_WRITEBACK_WAIT bit set on it (eg
> fuse folios will have that AS_NO_WRITEBACK_WAIT bit set)

I think it's generally okay to skip FUSE pages under writeback when the
sync migrate_pages() is called in low memory context, which only tries
to migrate as many pages as possible (i.e. best effort).

While more caution may be needed when the sync migrate_pages() is called
with an implicit hint that the migration can not fail.  For example,

```
offline_pages
	while {
		scan_movable_pages
		do_migrate_range
	}
```

If the malicious server never completes the writeback IO, no progress
will be made in the above while loop, and I'm afraid it will be a dead
loop then.


> 
> 2) in the fuse filesystem's implementation of the
> mapping->a_ops->migrate_folio callback, return -EAGAIN if the folio is
> under writeback.

Is there any possibility that a_ops->migrate_folio() may be called with
the folio under writeback?

- for most pages without AS_NO_WRITEBACK_WAIT, a_ops->migrate_folio()
will be called only when Page_writeback is cleared;
- for AS_NO_WRITEBACK_WAIT pages, they are skipped if they are under
writeback

-- 
Thanks,
Jingbo


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree
  2024-11-01 11:44                                                             ` Jingbo Xu
@ 2024-11-01 20:54                                                               ` Joanne Koong
  2024-11-04  8:09                                                                 ` Jingbo Xu
  0 siblings, 1 reply; 63+ messages in thread
From: Joanne Koong @ 2024-11-01 20:54 UTC (permalink / raw)
  To: Jingbo Xu
  Cc: Shakeel Butt, Bernd Schubert, Miklos Szeredi, linux-fsdevel,
	josef, hannes, linux-mm, kernel-team, Vlastimil Babka

On Fri, Nov 1, 2024 at 4:44 AM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>
> Hi Joanne,
>
> Thanks for keeping pushing this forward.
>
> On 11/1/24 5:52 AM, Joanne Koong wrote:
> > On Thu, Oct 31, 2024 at 1:06 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >>
> >> On Thu, Oct 31, 2024 at 12:06:49PM GMT, Joanne Koong wrote:
> >>> On Wed, Oct 30, 2024 at 5:30 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >> [...]
> >>>>
> >>>> Memory pool is a bit confusing term here. Most probably you are asking
> >>>> about the migrate type of the page block from which tmp page is
> >>>> allocated from. In a normal system, tmp page would be allocated from page
> >>>> block with MIGRATE_UNMOVABLE migrate type while the page cache page, it
> >>>> depends on what gfp flag was used for its allocation. What does fuse fs
> >>>> use? GFP_HIGHUSER_MOVABLE or something else? Under low memory situation
> >>>> allocations can get mixed up with different migrate types.
> >>>>
> >>>
> >>> I believe it's GFP_HIGHUSER_MOVABLE for the page cache pages since
> >>> fuse doesn't set any additional gfp masks on the inode mapping.
> >>>
> >>> Could we just allocate the fuse writeback pages with GFP_HIGHUSER
> >>> instead of GFP_HIGHUSER_MOVABLE? That would be in fuse_write_begin()
> >>> where we pass in the gfp mask to __filemap_get_folio(). I think this
> >>> would give us the same behavior memory-wise as what the tmp pages
> >>> currently do,
> >>
> >> I don't think it would be the same behavior. From what I understand the
> >> liftime of the tmp page is from the start of the writeback till the ack
> >> from the fuse server that writeback is done. While the lifetime of the
> >> page of the page cache can be arbitrarily large. We should just make it
> >> unmovable for its lifetime. I think it is fine to make the page
> >> unmovable during the writeback. We should not try to optimize for the
> >> bad or buggy behavior of fuse server.
> >>
> >> Regarding the avoidance of wait on writeback for fuse folios, I think we
> >> can handle the migration similar to how you are handling reclaim and in
> >> addition we can add a WARN() in folio_wait_writeback() if the kernel ever
> >> sees a fuse folio in that function.
> >
> > Awesome, this is what I'm planning to do in v3 to address migration then:
> >
> > 1) in migrate_folio_unmap(), only call "folio_wait_writeback(src);" if
> > src->mapping does not have the AS_NO_WRITEBACK_WAIT bit set on it (eg
> > fuse folios will have that AS_NO_WRITEBACK_WAIT bit set)
>
> I think it's generally okay to skip FUSE pages under writeback when the
> sync migrate_pages() is called in low memory context, which only tries
> to migrate as many pages as possible (i.e. best effort).
>
> While more caution may be needed when the sync migrate_pages() is called
> with an implicit hint that the migration can not fail.  For example,
>
> ```
> offline_pages
>         while {
>                 scan_movable_pages
>                 do_migrate_range
>         }
> ```
>
> If the malicious server never completes the writeback IO, no progress
> will be made in the above while loop, and I'm afraid it will be a dead
> loop then.
>

Thanks for taking a look and sharing your thoughts.
I agree. I think for this offline_pages() path, we need to handle this
"TODO: fatal migration failures should bail out". For v3 I'm thinking
of handling this by having some number of retries where we try
do_migrate_range() but if it still doesn't succeed, to skip those
pages and move onto the next.

>
> >
> > 2) in the fuse filesystem's implementation of the
> > mapping->a_ops->migrate_folio callback, return -EAGAIN if the folio is
> > under writeback.
>
> Is there any possibility that a_ops->migrate_folio() may be called with
> the folio under writeback?
>
> - for most pages without AS_NO_WRITEBACK_WAIT, a_ops->migrate_folio()
> will be called only when Page_writeback is cleared;
> - for AS_NO_WRITEBACK_WAIT pages, they are skipped if they are under
> writeback
>

For AS_NO_WRITEBACK_WAIT_PAGES, if we skip waiting on them if they are
under writeback, I think the a_ops->migrate_folio() will still get
called (by migrate_pages_batch() -> migrate_folio_move() ->
move_to_new_folio()).

Looking at migrate_folio_unmap() some more though,  I don't think we
can just skip the wait call like we can for the sync(2) case. I think
we need to error out here instead since after the wait call,
migrate_folio_unmap() will replace the folio's page table mappings
(try_to_migrate()). If we error out here, then there's no hitting
a_ops->migrate_folio() when the folio is under writeback.


Thanks,
Joanne
> --
> Thanks,
> Jingbo


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree
  2024-11-01 20:54                                                               ` Joanne Koong
@ 2024-11-04  8:09                                                                 ` Jingbo Xu
  0 siblings, 0 replies; 63+ messages in thread
From: Jingbo Xu @ 2024-11-04  8:09 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Shakeel Butt, Bernd Schubert, Miklos Szeredi, linux-fsdevel,
	josef, hannes, linux-mm, kernel-team, Vlastimil Babka



On 11/2/24 4:54 AM, Joanne Koong wrote:
> On Fri, Nov 1, 2024 at 4:44 AM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>
>> Hi Joanne,
>>
>> Thanks for keeping pushing this forward.
>>
>> On 11/1/24 5:52 AM, Joanne Koong wrote:
>>> On Thu, Oct 31, 2024 at 1:06 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>>>>
>>>> On Thu, Oct 31, 2024 at 12:06:49PM GMT, Joanne Koong wrote:
>>>>> On Wed, Oct 30, 2024 at 5:30 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>>>> [...]
>>>>>>
>>>>>> Memory pool is a bit confusing term here. Most probably you are asking
>>>>>> about the migrate type of the page block from which tmp page is
>>>>>> allocated from. In a normal system, tmp page would be allocated from page
>>>>>> block with MIGRATE_UNMOVABLE migrate type while the page cache page, it
>>>>>> depends on what gfp flag was used for its allocation. What does fuse fs
>>>>>> use? GFP_HIGHUSER_MOVABLE or something else? Under low memory situation
>>>>>> allocations can get mixed up with different migrate types.
>>>>>>
>>>>>
>>>>> I believe it's GFP_HIGHUSER_MOVABLE for the page cache pages since
>>>>> fuse doesn't set any additional gfp masks on the inode mapping.
>>>>>
>>>>> Could we just allocate the fuse writeback pages with GFP_HIGHUSER
>>>>> instead of GFP_HIGHUSER_MOVABLE? That would be in fuse_write_begin()
>>>>> where we pass in the gfp mask to __filemap_get_folio(). I think this
>>>>> would give us the same behavior memory-wise as what the tmp pages
>>>>> currently do,
>>>>
>>>> I don't think it would be the same behavior. From what I understand the
>>>> liftime of the tmp page is from the start of the writeback till the ack
>>>> from the fuse server that writeback is done. While the lifetime of the
>>>> page of the page cache can be arbitrarily large. We should just make it
>>>> unmovable for its lifetime. I think it is fine to make the page
>>>> unmovable during the writeback. We should not try to optimize for the
>>>> bad or buggy behavior of fuse server.
>>>>
>>>> Regarding the avoidance of wait on writeback for fuse folios, I think we
>>>> can handle the migration similar to how you are handling reclaim and in
>>>> addition we can add a WARN() in folio_wait_writeback() if the kernel ever
>>>> sees a fuse folio in that function.
>>>
>>> Awesome, this is what I'm planning to do in v3 to address migration then:
>>>
>>> 1) in migrate_folio_unmap(), only call "folio_wait_writeback(src);" if
>>> src->mapping does not have the AS_NO_WRITEBACK_WAIT bit set on it (eg
>>> fuse folios will have that AS_NO_WRITEBACK_WAIT bit set)
>>
>> I think it's generally okay to skip FUSE pages under writeback when the
>> sync migrate_pages() is called in low memory context, which only tries
>> to migrate as many pages as possible (i.e. best effort).
>>
>> While more caution may be needed when the sync migrate_pages() is called
>> with an implicit hint that the migration can not fail.  For example,
>>
>> ```
>> offline_pages
>>         while {
>>                 scan_movable_pages
>>                 do_migrate_range
>>         }
>> ```
>>
>> If the malicious server never completes the writeback IO, no progress
>> will be made in the above while loop, and I'm afraid it will be a dead
>> loop then.
>>
> 
> Thanks for taking a look and sharing your thoughts.
> I agree. I think for this offline_pages() path, we need to handle this
> "TODO: fatal migration failures should bail out". For v3 I'm thinking
> of handling this by having some number of retries where we try
> do_migrate_range() but if it still doesn't succeed, to skip those
> pages and move onto the next.
> 
>>
>>>
>>> 2) in the fuse filesystem's implementation of the
>>> mapping->a_ops->migrate_folio callback, return -EAGAIN if the folio is
>>> under writeback.
>>
>> Is there any possibility that a_ops->migrate_folio() may be called with
>> the folio under writeback?
>>
>> - for most pages without AS_NO_WRITEBACK_WAIT, a_ops->migrate_folio()
>> will be called only when Page_writeback is cleared;
>> - for AS_NO_WRITEBACK_WAIT pages, they are skipped if they are under
>> writeback
>>
> 
> For AS_NO_WRITEBACK_WAIT_PAGES, if we skip waiting on them if they are
> under writeback, I think the a_ops->migrate_folio() will still get
> called (by migrate_pages_batch() -> migrate_folio_move() ->
> move_to_new_folio()).
> 
> Looking at migrate_folio_unmap() some more though,  I don't think we
> can just skip the wait call like we can for the sync(2) case. I think
> we need to error out here instead since after the wait call,
> migrate_folio_unmap() will replace the folio's page table mappings
> (try_to_migrate()). If we error out here, then there's no hitting
> a_ops->migrate_folio() when the folio is under writeback.
> 

Right, we need to bail out to skip this page (under writeback), just
like how MIGRATE_SYNC does.

-- 
Thanks,
Jingbo


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree
  2024-10-31 22:38                                                             ` Shakeel Butt
@ 2024-11-06 23:37                                                               ` Joanne Koong
  2024-11-06 23:56                                                                 ` Shakeel Butt
  0 siblings, 1 reply; 63+ messages in thread
From: Joanne Koong @ 2024-11-06 23:37 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Bernd Schubert, Jingbo Xu, Miklos Szeredi, linux-fsdevel, josef,
	hannes, linux-mm, kernel-team, Vlastimil Babka

On Thu, Oct 31, 2024 at 3:38 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Thu, Oct 31, 2024 at 02:52:57PM GMT, Joanne Koong wrote:
> > On Thu, Oct 31, 2024 at 1:06 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > >
> > > On Thu, Oct 31, 2024 at 12:06:49PM GMT, Joanne Koong wrote:
> > > > On Wed, Oct 30, 2024 at 5:30 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > > [...]
> > > > >
> > > > > Memory pool is a bit confusing term here. Most probably you are asking
> > > > > about the migrate type of the page block from which tmp page is
> > > > > allocated from. In a normal system, tmp page would be allocated from page
> > > > > block with MIGRATE_UNMOVABLE migrate type while the page cache page, it
> > > > > depends on what gfp flag was used for its allocation. What does fuse fs
> > > > > use? GFP_HIGHUSER_MOVABLE or something else? Under low memory situation
> > > > > allocations can get mixed up with different migrate types.
> > > > >
> > > >
> > > > I believe it's GFP_HIGHUSER_MOVABLE for the page cache pages since
> > > > fuse doesn't set any additional gfp masks on the inode mapping.
> > > >
> > > > Could we just allocate the fuse writeback pages with GFP_HIGHUSER
> > > > instead of GFP_HIGHUSER_MOVABLE? That would be in fuse_write_begin()
> > > > where we pass in the gfp mask to __filemap_get_folio(). I think this
> > > > would give us the same behavior memory-wise as what the tmp pages
> > > > currently do,
> > >
> > > I don't think it would be the same behavior. From what I understand the
> > > liftime of the tmp page is from the start of the writeback till the ack
> > > from the fuse server that writeback is done. While the lifetime of the
> > > page of the page cache can be arbitrarily large. We should just make it
> > > unmovable for its lifetime. I think it is fine to make the page
> > > unmovable during the writeback. We should not try to optimize for the
> > > bad or buggy behavior of fuse server.
> > >
> > > Regarding the avoidance of wait on writeback for fuse folios, I think we
> > > can handle the migration similar to how you are handling reclaim and in
> > > addition we can add a WARN() in folio_wait_writeback() if the kernel ever
> > > sees a fuse folio in that function.
> >
> > Awesome, this is what I'm planning to do in v3 to address migration then:
> >
> > 1) in migrate_folio_unmap(), only call "folio_wait_writeback(src);" if
> > src->mapping does not have the AS_NO_WRITEBACK_WAIT bit set on it (eg
> > fuse folios will have that AS_NO_WRITEBACK_WAIT bit set)
> >
> > 2) in the fuse filesystem's implementation of the
> > mapping->a_ops->migrate_folio callback, return -EAGAIN if the folio is
> > under writeback.
>
> 3) Add WARN_ONCE() in folio_wait_writeback() if folio->mapping has
> AS_NO_WRITEBACK_WAIT set and return without waiting.

For v3, I'm going to change AS_NO_WRITEBACK_RECLAIM to
AS_WRITEBACK_MAY_BLOCK and skip 3) because 3) may be too restrictive.
For example, for the sync_file_range() syscall, we do want to wait on
writeback - it's ok in this case to call folio_wait_writeback() on a
fuse folio since the caller would have intentionally passed in a fuse
fd to sync_file_range().


Thanks,
Joanne

>
> >
> > Does this sound good?
>
> Yes.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree
  2024-11-06 23:37                                                               ` Joanne Koong
@ 2024-11-06 23:56                                                                 ` Shakeel Butt
  0 siblings, 0 replies; 63+ messages in thread
From: Shakeel Butt @ 2024-11-06 23:56 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Bernd Schubert, Jingbo Xu, Miklos Szeredi, linux-fsdevel, josef,
	hannes, linux-mm, kernel-team, Vlastimil Babka

On Wed, Nov 06, 2024 at 03:37:11PM -0800, Joanne Koong wrote:
> On Thu, Oct 31, 2024 at 3:38 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > On Thu, Oct 31, 2024 at 02:52:57PM GMT, Joanne Koong wrote:
> > > On Thu, Oct 31, 2024 at 1:06 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > > >
> > > > On Thu, Oct 31, 2024 at 12:06:49PM GMT, Joanne Koong wrote:
> > > > > On Wed, Oct 30, 2024 at 5:30 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > > > [...]
> > > > > >
> > > > > > Memory pool is a bit confusing term here. Most probably you are asking
> > > > > > about the migrate type of the page block from which tmp page is
> > > > > > allocated from. In a normal system, tmp page would be allocated from page
> > > > > > block with MIGRATE_UNMOVABLE migrate type while the page cache page, it
> > > > > > depends on what gfp flag was used for its allocation. What does fuse fs
> > > > > > use? GFP_HIGHUSER_MOVABLE or something else? Under low memory situation
> > > > > > allocations can get mixed up with different migrate types.
> > > > > >
> > > > >
> > > > > I believe it's GFP_HIGHUSER_MOVABLE for the page cache pages since
> > > > > fuse doesn't set any additional gfp masks on the inode mapping.
> > > > >
> > > > > Could we just allocate the fuse writeback pages with GFP_HIGHUSER
> > > > > instead of GFP_HIGHUSER_MOVABLE? That would be in fuse_write_begin()
> > > > > where we pass in the gfp mask to __filemap_get_folio(). I think this
> > > > > would give us the same behavior memory-wise as what the tmp pages
> > > > > currently do,
> > > >
> > > > I don't think it would be the same behavior. From what I understand the
> > > > liftime of the tmp page is from the start of the writeback till the ack
> > > > from the fuse server that writeback is done. While the lifetime of the
> > > > page of the page cache can be arbitrarily large. We should just make it
> > > > unmovable for its lifetime. I think it is fine to make the page
> > > > unmovable during the writeback. We should not try to optimize for the
> > > > bad or buggy behavior of fuse server.
> > > >
> > > > Regarding the avoidance of wait on writeback for fuse folios, I think we
> > > > can handle the migration similar to how you are handling reclaim and in
> > > > addition we can add a WARN() in folio_wait_writeback() if the kernel ever
> > > > sees a fuse folio in that function.
> > >
> > > Awesome, this is what I'm planning to do in v3 to address migration then:
> > >
> > > 1) in migrate_folio_unmap(), only call "folio_wait_writeback(src);" if
> > > src->mapping does not have the AS_NO_WRITEBACK_WAIT bit set on it (eg
> > > fuse folios will have that AS_NO_WRITEBACK_WAIT bit set)
> > >
> > > 2) in the fuse filesystem's implementation of the
> > > mapping->a_ops->migrate_folio callback, return -EAGAIN if the folio is
> > > under writeback.
> >
> > 3) Add WARN_ONCE() in folio_wait_writeback() if folio->mapping has
> > AS_NO_WRITEBACK_WAIT set and return without waiting.
> 
> For v3, I'm going to change AS_NO_WRITEBACK_RECLAIM to
> AS_WRITEBACK_MAY_BLOCK and skip 3) because 3) may be too restrictive.
> For example, for the sync_file_range() syscall, we do want to wait on
> writeback - it's ok in this case to call folio_wait_writeback() on a
> fuse folio since the caller would have intentionally passed in a fuse
> fd to sync_file_range().
> 

Sounds good.


^ permalink raw reply	[flat|nested] 63+ messages in thread

end of thread, other threads:[~2024-11-06 23:56 UTC | newest]

Thread overview: 63+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-14 18:22 [PATCH v2 0/2] fuse: remove extra page copies in writeback Joanne Koong
2024-10-14 18:22 ` [PATCH v2 1/2] mm: skip reclaiming folios in writeback contexts that may trigger deadlock Joanne Koong
2024-10-14 18:38   ` Shakeel Butt
2024-10-14 21:04     ` Joanne Koong
2024-10-14 23:57       ` Shakeel Butt
2024-10-15 16:59         ` Joanne Koong
2024-10-14 18:22 ` [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree Joanne Koong
2024-10-15 10:01   ` Miklos Szeredi
2024-10-15 17:06     ` Joanne Koong
2024-10-15 19:17       ` Shakeel Butt
2024-10-16  9:44         ` Jingbo Xu
2024-10-16  9:57           ` Miklos Szeredi
2024-10-16  9:51         ` Miklos Szeredi
2024-10-16 17:52           ` Shakeel Butt
2024-10-16 18:37             ` Miklos Szeredi
2024-10-16 21:27               ` Shakeel Butt
2024-10-17 13:31                 ` Miklos Szeredi
2024-10-18  5:31                   ` Shakeel Butt
2024-10-21 10:15                     ` Miklos Szeredi
2024-10-21 17:01                       ` Shakeel Butt
2024-10-22 15:03                         ` Miklos Szeredi
2024-10-21 21:05                       ` Joanne Koong
2024-10-24 16:54                         ` Joanne Koong
2024-10-25  1:38                           ` Jingbo Xu
2024-10-25 15:32                             ` Miklos Szeredi
2024-10-25 17:36                             ` Joanne Koong
2024-10-25 18:02                               ` Miklos Szeredi
2024-10-25 18:19                                 ` Joanne Koong
2024-10-28  2:02                                   ` Jingbo Xu
2024-10-25 18:47                               ` Joanne Koong
2024-10-28  2:28                                 ` Jingbo Xu
2024-10-28 21:57                                   ` Joanne Koong
2024-10-25 22:40                               ` Joanne Koong
2024-10-28 21:58                                 ` Joanne Koong
2024-10-30  9:32                                   ` Bernd Schubert
2024-10-30 16:04                                     ` Joanne Koong
2024-10-30 16:21                                       ` Bernd Schubert
2024-10-30 17:02                                         ` Joanne Koong
2024-10-30 17:27                                           ` Bernd Schubert
2024-10-30 17:35                                             ` Joanne Koong
2024-10-30 21:56                                               ` Shakeel Butt
2024-10-30 22:17                                                 ` Bernd Schubert
2024-10-30 22:51                                                   ` Joanne Koong
2024-10-31  0:30                                                     ` Shakeel Butt
2024-10-31 19:06                                                       ` Joanne Koong
2024-10-31 20:06                                                         ` Shakeel Butt
2024-10-31 21:52                                                           ` Joanne Koong
2024-10-31 22:38                                                             ` Shakeel Butt
2024-11-06 23:37                                                               ` Joanne Koong
2024-11-06 23:56                                                                 ` Shakeel Butt
2024-11-01 11:44                                                             ` Jingbo Xu
2024-11-01 20:54                                                               ` Joanne Koong
2024-11-04  8:09                                                                 ` Jingbo Xu
2024-10-29 22:04                   ` Bernd Schubert
2024-10-16  9:56     ` Jingbo Xu
2024-10-16 10:00       ` Miklos Szeredi
2024-10-18  1:30     ` Joanne Koong
2024-10-18  5:57       ` Shakeel Butt
2024-10-18 19:57         ` Joanne Koong
2024-10-18 20:46           ` Shakeel Butt
2024-10-21  9:32       ` Miklos Szeredi
2024-10-18  9:24   ` Jingbo Xu
2024-10-18 20:29     ` Joanne Koong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).